My point is that the "huge win" is expressed in terms of a bogus and misleading ...

pjdesno · 2026-03-27T19:35:36 1774640136

It's not a "bogus and misleading baseline".

It's precisely the way we teach people how to build thread-safe systems. And we teach them to do it that way because we've learned from experience that letting them code up their own custom synchronization primitives leads to immense woe and suffering.

(and it's not slow because of the C++ mutex implementation, either - I tested a C/pthreads version, and it was the same speed as the C++ version)

jeffbee · 2026-03-27T20:18:40 1774642720

The GNU libstdc++ STL mutex is nothing but pthread_lock, so that's not a surprise.

I really don't understand what you are saying about not using custom primitives. The whole article is "YOLO your own synchronization" and it fails to grapple with the subtleties. An example of the unaddressed complexity: use of acquire-release semantics for head_ and tail_ atomics imposes no ordering whatsoever between observations of head_ and tail_. The final solution has four atomics that use acquire-release and does not discuss the fact that threads may observe the values of these four things in very surprising order. The issue is so complex that I consider this 50-page academic paper to be the bare minimum survey of the problem that a programmer should thoroughly understand before they even consider using atomics.

https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

reitzensteinm · 2026-03-27T22:48:59 1774651739

Could you macroexpand your claims a little here?

"An example of the unaddressed complexity: use of acquire-release semantics for head_ and tail_ atomics imposes no ordering whatsoever between observations of head_ and tail_."

The Acquire / Release in version 4 looks right to me, but I'd like to know if I'm missing something.

Also, while your linked paper is good background for what the C++11 memory model is intended to abstract over, it's almost entirely its own thing with a mountain of complexity.

Somebody else in this comment section brought atomics knowledge to an Acquire/Release fight and it didn't go well.

As a starting introduction I'd probably recommend this:

https://www.amazon.com.au/C-Concurrency-Action-Practical-Mul...

loeg · 2026-03-28T04:58:07 1774673887

I think he's complaining that because the head_ and tail_ loads in push/pop are relaxed, rather than also being acquire, they can be reordered relative to the acquire tail_ and head_ loads respectively. I don't believe this impacts the correctness of the logic, but I could be missing something.

pjdesno · 2026-03-27T21:30:10 1774647010

I don't actually get your point.

You dismissed the standard lock-guarded data structure as a "bogus comparison", despite it being the way every programmer is taught to write multi-threaded code.

Now the more you write, the more you seem to make the case that (a) normal programmers shouldn't be writing code like this, and (b) there are significant speedups possible if someone who knows what they're doing *does* write a highly tuned lock-free library.

jeffbee · 2026-03-28T01:32:52 1774661572

The easy speedup is to use 2 mutexes, one that protects head and tail_cached, and the other that protects tail and head_cached, and align so they don't interfere. In other words, take the RingBufferV5 from the article and define the class like this:

  std::array<T, N> buffer_;
  alignas(64) absl::Mutex hmu_;
  std::size_t head_{0};
  std::size_t tail_cached_{0};

  alignas(64) absl::Mutex tmu_;
  std::size_t tail_{0};
  std::size_t head_cached_{0};

Then change the code to forget the atomics and just use the locks. On my system this is more than ten times faster than the baseline naïve thread-safe RingBufferV2. That's what I mean about using a bogus baseline.