Binary programs are executed on the CPU but the program file is an archive with sections, and only one of them is the program, usually, while the others are all metadata. The CPU isn't capable of understanding the program file at all. Linux has to establish the conditions under which the program runs, that means at a minimum establishing the address space in which the program counter lies then jumping to that address. The instructions for how to do that are in the metadata sections of the ELF executable.
not too bad explain, though the 'usually' might be clarified that an ELF file 'can have sections marked as executable' (tho ofc i get not wanting to get into segment flags :p) and also a program is cobbeled together potentially from many of these ELF files. in most cases the single file is useless. (most cases as in binaries provided by a standard linux distro, now 'producible binaries')
The volumetric rate for electricity is almost totally irrelevant. California's bills are dominated by the fixed cost of the grid, and we use very little grid power compared to other states, so the volumetric rate has to be really high as a consequence. Electric power bills in California are in the middle of the pack compared to the other states, almost exactly the same as Texas and less than ten other states.
This isn’t convincing to me - our electricity rates in 2002 were 8 cents per kWh and none of those other facts were different back then. Inflation hasn’t done 5x since then, but our electricity rates have. We are being ripped off by the three companies that CPUC allows to do it.
They should have imposed a ruinous penalty on the utilities for their wildfire liability, and seized their assets for non-payment of it. And I say that as not a liberal and not a socialist.
People who live in cities that have municipal power are paying HALF the rates people with PG&E and SCE are.
> none of those other facts were different back then.
That is not correct.
Small-scale solar capacity in California today stands at 19GW. In 2002, it was 45MW. That is a more than 400-fold increase. Title 20 and 24 rules have dramatically cut residential electric demand, especially lighting which has fallen by over 90%. Population grew 13% but peak grid load has not grown at all. But all that grid stuff to serve the new residents had to be installed and amortized. Many many things have changed since 2002.
California has high volumetric rates, but mostly that is because it has much more distributed generation than any other state, uses far less grid power, so the grid rates are dominated by fixed grid costs. Actual monthly electric bills in California are not remarkable at all. According to the EIA the typical residential electric bill in California is almost exactly the same as Texas: $174.59 vs. $173.94.
I don't see how that could possibly be true. Sounds like a low-ball estimate.
Also i wish to point out that the "tcmalloc" being used as a baseline in these performance claims is Ye Olde tcmalloc, the abandoned and now community-maintained version of the project. The current version of tcmalloc is a completely different thing that the mimalloc-bench project doesn't support (correctly; I just checked).
Fair points on both - the 5ns is the L2 hit case. I should have stated the range (30-60ns?) instead of the best case. And yes, fixing the tcmalloc case is on my list - thanks for pointing that out. And also to be clear, the goal was never to beat jemalloc or tcmalloc on raw throughput. I wanted t oshow that one doesn't have t ogive up competitive performnce to get explicit heaps, hard caps and teardown semantics.
That makes sense. I have a long-standing beef with the mimalloc-bench people because they made a bunch of claims in their paper but as recently as 2022 they were apparently not aware of the distinction, and the way they tried to shoehorn tcmalloc into their harness is plain broken. That is not a problem caused by your fine project.
I keep an eye on jaq, but there are some holes in the story. jaq 3.0 is faster than Linux distro builds of jq, but jq built correctly is faster than jaq. As far as I can tell the performance reputation of jq is caused by bad distro packaging.
I guess A/V pros are used to getting screwed constantly, but it must be really irritating to face the prospect of eventually having to move PCI add-in cards to TB5 enclosures that cost $1000 per slot.
I see retail 3 slots for $1800, so a lot cheaper than you think. They can move to a Studio and buy a box for less than a Mac Pro replacement would cost.
It's lock-free because it uses ordered loads and stores, which is also how you implement locks. I find the semantic distinction unconvincing. The post is really about how slow the default STL mutex implementation is.
That's what "lock-free" means. You still need to use the hardware mechanisms provided for atomicity.
The whole point of lock-free data structures and algorithms is that sometimes you can do better by using these atomic operations inside your own code, rather than using a one-size-fits-all mutex based on those same atomic operations.
(Note that I say "sometimes". Too many people believe that lock-free structures are always faster; as always, your mileage may vary. In this case it's a huge win, to the point where I would bet it almost always moves the bottleneck to the code actually using the ring buffer.)
My point is that the "huge win" is expressed in terms of a bogus and misleading baseline. The article moves immediately from the worst possible lock-based implementation to a pretty bad atomics-based implementation. The final punchline of the article is expressed as a ratio of the bad baseline. To make an honest conclusion, the article should also explore better ways of using the locks.
It's precisely the way we teach people how to build thread-safe systems. And we teach them to do it that way because we've learned from experience that letting them code up their own custom synchronization primitives leads to immense woe and suffering.
(and it's not slow because of the C++ mutex implementation, either - I tested a C/pthreads version, and it was the same speed as the C++ version)
The GNU libstdc++ STL mutex is nothing but pthread_lock, so that's not a surprise.
I really don't understand what you are saying about not using custom primitives. The whole article is "YOLO your own synchronization" and it fails to grapple with the subtleties. An example of the unaddressed complexity: use of acquire-release semantics for head_ and tail_ atomics imposes no ordering whatsoever between observations of head_ and tail_. The final solution has four atomics that use acquire-release and does not discuss the fact that threads may observe the values of these four things in very surprising order. The issue is so complex that I consider this 50-page academic paper to be the bare minimum survey of the problem that a programmer should thoroughly understand before they even consider using atomics.
"An example of the unaddressed complexity: use of acquire-release semantics for head_ and tail_ atomics imposes no ordering whatsoever between observations of head_ and tail_."
The Acquire / Release in version 4 looks right to me, but I'd like to know if I'm missing something.
Also, while your linked paper is good background for what the C++11 memory model is intended to abstract over, it's almost entirely its own thing with a mountain of complexity.
Somebody else in this comment section brought atomics knowledge to an Acquire/Release fight and it didn't go well.
As a starting introduction I'd probably recommend this:
I think he's complaining that because the head_ and tail_ loads in push/pop are relaxed, rather than also being acquire, they can be reordered relative to the acquire tail_ and head_ loads respectively. I don't believe this impacts the correctness of the logic, but I could be missing something.
You dismissed the standard lock-guarded data structure as a "bogus comparison", despite it being the way every programmer is taught to write multi-threaded code.
Now the more you write, the more you seem to make the case that (a) normal programmers shouldn't be writing code like this, and (b) there are significant speedups possible if someone who knows what they're doing *does* write a highly tuned lock-free library.
The easy speedup is to use 2 mutexes, one that protects head and tail_cached, and the other that protects tail and head_cached, and align so they don't interfere. In other words, take the RingBufferV5 from the article and define the class like this:
Then change the code to forget the atomics and just use the locks. On my system this is more than ten times faster than the baseline naïve thread-safe RingBufferV2. That's what I mean about using a bogus baseline.
There are real practical implications of both the producer and consumer mutating the same cache line to take a lock that is fundamentally avoided by this "lock-free" design. It isn't meaningless.
That only explains the last stage. In order to steelman the mutex alternative, everything before "further optimization" should have used 2 critical sections. That would give a realistic baseline.
There are some microarchitectural resources that are either statically divided between running threads, or "cooperatively" fought over, and if you don't need to hide cache miss latency, which is the only thing hyperthreading is really good at, you're probably better off disabling the supernumerary threads.
reply