Hacker Newsnew | past | comments | ask | show | jobs | submit | izabera's commentslogin

they're generally pretty but they should really hide the cursor, it looks offputting in basically all cases


Agreed! Known bug that will get squashed...


the secret is to keep things ˢᵐᵒˡ


> Unlike other apps or cookbooks, we’ll teach you the hows and whys of cooking.

this is literally like every other app or cookbook


atomics aren't free even without contention. the slogan of the language is "you don't pay for what you don't use", and it's really not great that there's no non atomic refcount in the standard. the fact that it is default atomic has also lead people to assume guarantees that it doesn't provide, which was trivially predictable when the standard first introduced it.


OP specifically mentioned contention, though -- not marginally higher cost of atomic inc/dec vs plain inc/dec.

> For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention).

A single-threaded program will not have cross-core contention whether it uses std::atomic<> refcounts or plain integer refcounts, period. You're right that non-atomic refcounts can be anywhere from somewhat cheaper to a lot cheaper than atomic refcounts, depending on that platform. But that is orthogonal to cross-core contention.


> not marginally higher cost of atomic inc/dec vs plain inc/dec.

Note that the difference is not so marginal, and the difference is not just in hardware instructions as the non-atomic operations generally allow for more optimizations by the compiler.


The actual intrinsic is like 8-9 cycles on Zen4 or Ice Lake (vs 1 for plain add). It's something if you're banging on it in a hot loop, but otherwise not a ton. (If refcounting is hot in your design, your design is bad.)

It's comparable to like, two integer multiplies, or a single integer division. Yes, there is some effect on program order.


Can’t you have cross core contention just purely because of other processes doing atomics that happen to have a cache line address collision in the lock broadcast?


Related to this, GNU's libstdc++ shared_ptr implementation actually opts not to use atomic arithmetic when it infers that the program is not using threads.


I never heard of this and went to check in the source and it really does exist: https://codebrowser.dev/llvm/include/c++/11/ext/concurrence....


The code you linked is a compile-time configuration option, which doesn't quite match "infer" IMO. I think GP is thinking of the way that libstdc++ basically relies on the linker to tell it whether libpthread is linked in and skips atomic operations if it isn't [0].

[0]: https://snf.github.io/2019/02/13/shared-ptr-optimization/


It's a compile-time flag which is defined when libpthread is linked into the binary.


Sure, but I think that's independent of what eMSF was describing. From libgcc/gthr.h:

    /* If this file is compiled with threads support, it must
           #define __GTHREADS 1
       to indicate that threads support is present.  Also it has define
       function
         int __gthread_active_p ()
       that returns 1 if thread system is active, 0 if not.
I think the mechanism eMSF was describing (and the mechanism in the blogpost I linked) corresponds to __gthread_active_p().

I think the distinction between the two should be visible in some cases - for example, what happens for shared libraries that use std::shared_ptr and don't link libpthread, but are later used with a binary that does link libpthread?


Hm, not sure. I can see that shared_ptr::_M_release [0] is implemented in terms of __exchange_and_add_dispatch [1] and which is implemented in terms of __is_single_threaded [2]. __is_single_threaded will use __gthread_active_p iff __GTHREADS is not defined and <sys/single_threaded.h> header not included.

Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

Strange optimization IMHO so I wonder what was the motivation behind it. The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

[0] https://codebrowser.dev/llvm/include/c++/11/bits/shared_ptr_...

[1] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[2] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[3] https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...

[4] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[5] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....


> Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

The line you linked is for some FreeBSD/Solaris versions which appear to have some quirks with the way pthreads functions are exposed in their libc. I think the "normal" implementation of __gthread_active_p is on line 248 [0], and that is a pretty straightforwards check against a weak symbol.

> Strange optimization IMHO so I wonder what was the motivation behind it.

I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

> The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

Not entirely sure what you're getting at here? The former is used for single-threaded programs so there's ostensibly no need for atomics, whereas the latter is used for non-single-threaded programs.

[0]: https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...


> Not entirely sure what you're getting at here?

> I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

Obviously yes. What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact. Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.


> What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact.

I mean, the blog post basically starts with an example where the performance impact is noticeable:

> I found that my Rust port of an immutable RB tree insertion was significantly slower than the C++ one.

And:

> I just referenced pthread_create in the program and the reference count became atomic again.

> Although uninteresting to the topic of the blog post, after the modifications, both programs performed very similarly in the benchmarks.

So in principle an insert-heavy workload for that data structure could see a noticeable performance impact.

> Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.

Not entirely sure I'd agree? My impression is that while uncontended atomics are not too expensive they aren't exactly free compared to the corresponding non-atomic instruction. For example, Agner Fog's instruction tables [0] states:

> Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices, then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

And there's this blog post [1], which compares the performance of various concurrency mechanisms/implementations including uncontended atomics and "plain" code and shows that uncontended atomics are still slower than non-atomic operations (~3.5x if I'm reading the raw data table correctly).

So if the atomic instruction is in a hot loop then I think it's quite plausible that it'll be noticeable.

[0]: https://www.agner.org/optimize/instruction_tables.pdf

[1]: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...


Thanks, I'll revisit your comment. Some interesting things you shared.


People assume non-existent guarantees such as?


"is shared_ptr thread safe?" is a classic question asked thousands of times. the answer by the way is "it's as thread safe as a regular pointer"


maybe some car pool services for the most frequent routes could run at regular intervals. there could even be some predefined stops to efficiently batch many people getting on and off the car pool at once


How does the old-school bus know when nobody needs a ride?


You could even have special lanes dedicated to them so you could move people even more efficiently


i'm struggling to imagine many negative effects on society caused by the specific papers in this list


Public policies were made (or justified) based on some of this research. People used this "settled science" to make consequential decisions.

Stereotype threat for example was widely used to explain test score gaps as purely environmental, which contributed to the public seeing gaps as a moral emergency that needed to be fixed, leading to affirmative action policies.


To be honest, whether they had a "study" proving it or not I think those things would have happened anyway.

It's just a question of power in the end. And even if you could question the legitimacy of "studies" the people in power use to justify their ruling, they would produce a dozen more flawed justifications before you could even produce one serious debunking. And they wouldn't even have to give much light to your production so you would need large cultural and political support.

Psychology exists mostly as a new religion; it serves as a tool for justification for people in power, it is used just in the same way as the bible.

It should not be surprising to anyone that much of it isn't replicable (nor falsifiable in the first place) and when it is, the effects are so close to randomness that you can't even be sure of what it means. This is all by design, you need to keep people confused to rule over them. If they start asking questions you can't answer, you lose authority and legitimacy. Psychology is the tool that serves the dominant ideology that is used to "answer" those questions.


random example, may or may not come from a real situation that just happened:

    - other team opens jira ticket requesting a new type of encabulator
    - random guy who doesn't know anything of how the current encabulator system works picks up the ticket for whatever reason
    - 10 minutes later opens a 2000 line vibe coded pr with the new encabulator type and plenty of unit tests
    - assigns ticket to the person who designed the current encabulator system for review
    - encabulator system designer finds out about this ticket for the first time this way
    - "wait wtf what is this?  why are we doing any of this?  the current system is completely generic and it takes arbitrary parameters?"
    - waste an hour talking to the guy and to the team that requested this
    - explain they could just use the parameters, close pr and ticket
it's so incredibly productive


deepseek is from china and all their papers have been very well received


tested it on this https://github.com/izabera/cube.bash/tree/timep

i ran `source /path/to/timep.bash; timep ./cube.bash "R U R' U' R' F R2 U' R' U' R U R' F'"`

without profiling https://i.imgur.com/JE93ony.png

with profiling there's a bunch of errors but overall it's ~300x slower https://i.imgur.com/qif7Qp3.png so i'm sceptical on all the efficiency claims


It took some time, but I figured out what was causing the errors when profiling cube.bash - the code assigns huge (some >400,000 elements) associative arrays in a single command, and timep was taking the full (several MB) $BASH_COMMAND from those and trying to do stuff with it and bash couldnt handle it.

Try profiling cube.bash with the timep version in the "timep_testing" branch (https://github.com/jkool702/timep/blob/timep_testing/timep.b...). This is my "in development" branch that contains a handful of improvements, one of which is that timep will truncate the BASH_COMMAND at 16kb. On my machine at least that timep version successfully profiles cube.bash.

now - regarding efficiency. timep's overhead is more-or-less constant per command (or, more specifically, per debug trap fire). what "percent overhead" this equates to is entirely dependent on how long the average command being profiled takes to run. And for things like base.cube, that are virtually all builtin commands that dont fork, that time is low. For cube.bash it is really pretty remarkably low.

Looking at the "full" profile and stack trace that timep generated, running cube.bash involved running around 7150 commands. I also profiled a modified version that stops at the ` echo scramble: "$@"` line - that one was about 3350 commands. Meaning the timed part of cube.bash (where it actually solves the cube) represents about 3800 bash commands. This part of the code (when run by timep) took 870 ms or so on my system. which puts the per-command overhead at under 1/4 of 1 ms (about 230 microseconds to be precise).

230 microseconds per command is best-in-class for a trap-based profiler - many take an order of magnitude (or more) longer and dont collect cpu times or the code structure metadata needed to reconstruct the full call stack. To put it in perspective, bash (on my system at least) has 1-2 ms overhead every time it runs an external binary. Your cube.bash is just impressively stupidly fast, so much so that the 230 microseconds still introduces considerable overhead.


thanks for taking the time to make it work on my code :)


no problem. thanks for helping me discover and fix a bug in timep that all my test cases missed.

Sorry the overhead is too high (relative to cube.bash's insanely low avg command runtime of something like 1 microsecond) to be really useful as a profiler...hopefully itll still prove useful strictly for mapping the code execution/structure and seeing how many times a given function got called and things of that nature.


not trying to be a hater but how is 100mb/s high performance in 2025? that's as performant as a 20 years old hdd


The system is honestly tuned for storage efficiency not speed but these configurations are tunable and you can use the benchmarks as a reference for tuning. https://github.com/trvon/yams/blob/main/docs/benchmarks/perf...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: