More

EdSchouten · 2026-01-27T14:09:05 1769522945

It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.

For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.

oconnor663 · 2026-01-27T20:24:48 1769545488

As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.

EdSchouten · 2026-01-28T05:24:17 1769577857

You can always use a divide and conquer strategy to compute the chunks. Chunk both halves of the file independently. Once that’s done, you redo the chunking around the midpoint of the file forward, until it starts to match the chunks obtained previously.

EdSchouten · 2026-01-21T05:52:29 1768974749

That’s great! People who do that are often inconsiderate of how it affect others. First of all, it generates unnecessary noise, which is annoying for neighbors who are still trying to sleep. Pedestrians/cyclists also need to breathe those exhaust gases.

EdSchouten · 2025-12-08T06:24:13 1765175053

I don’t understand why I would need to care about this. Can’t my operating system and/or pthread library sort this out by itself?

senderista · 2025-12-08T06:56:16 1765176976

Pretty much, given that any decent pthreads implementation will offer an adaptive mutex. Unless you really need a mutex the size of a single bit or byte (which likely implies false sharing), there's little reason to ever use a pure spinlock, since a mutex with adaptive spinning (up to context switch latency) gives you the same performance for short critical sections without the disastrous worst-case behavior.

nly · 2025-12-08T09:47:12 1765187232

Some people don't want to block for a microsecond when their lock goes 1ns over your adaptive mutexes spin deadline. That kind of jitter is unacceptable.

senderista · 2025-12-08T22:44:13 1765233853

I assume those people are already running 1 pinned thread/core and have no issues with unbounded spinning in the first place. In which case, go nuts.

baobun · 2025-12-08T06:57:12 1765177032

General heuristics only get you so far and at the limit come with their own overhead compared to what you can do with a tailored solution with knowledge about your usage and data access patterns. The cases where this makes a practical difference for higher-level apps are rare but they exist.

EdSchouten · 2025-10-01T05:35:26 1759296926

I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

https://github.com/buildbarn/go-cdc

Scaevolus · 2025-10-01T05:55:20 1759298120

This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...

Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)

EdSchouten · 2025-10-01T07:06:55 1759302415

Yeah, GEAR hashing is simple enough that I haven't considered using anything else.

Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.

pbhjpbhj · 2025-10-01T08:05:52 1759305952

The random number generation could match the first 2048 digits of pi, so if it works with _any_ random number...

If it doesn't work with any random number, then some work better than others then intuitively you can find a (or a set of) best seed(s).

Scaevolus · 2025-10-01T16:22:12 1759335732

Right, just one fewer module dependency using the stdlib RNG.

rokkamokka · 2025-10-01T08:48:55 1759308535

What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?

EdSchouten · 2025-10-01T10:03:49 1759313029

In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.

If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.

quotemstr · 2025-10-01T12:00:56 1759320056

I wonder whether there's a role for AI here.

(Please don't hurt me.)

AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).

Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.

EdSchouten · 2025-10-01T17:48:08 1759340888

Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.

In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.

xyzzy_plugh · 2025-10-01T16:58:58 1759337938

> https://bonanza.build

I just wanted to let you know, this is really cool. Makes me wish I still used Bazel.

EdSchouten · 2025-09-15T07:37:41 1757921861

So 19494 is the largest? That's far lower than I expected. There's nobody out there that has put a date in a version number (e.g., 20250915)?

genshii · 2025-09-15T14:22:17 1757946137

There are plenty of larger ones and plenty of ones that used the date as the version, but I was mainly curious about packages that followed semver.

Any package version that didn't follow the x.y.z format was excluded, and any package that had less published versions than their largest version number was excluded (e.g. a package at version 1.123.0 should have at least 123 published versions)

rs186 · 2025-09-15T10:47:20 1757933240

Well, we are looking at npm packages, where every package is supposed to follow semantic versioning. The fact that we don't have date as version number means everyone is a good citizen.

https://docs.npmjs.com/about-semantic-versioning

arcfour · 2025-09-15T14:58:20 1757948300

Off the top of my head, CloudFlare uses a somewhat date based method of typing for their Workers types package, but it makes sense in context as you define compatibility dates for a Worker when you set it up, which automatically enables/disables potentially breaking features in the API.

https://www.npmjs.com/package/@cloudflare/workers-types

EdSchouten · 2025-08-20T20:12:52 1755720772

Soon available for people in the Vatican as well?

EdSchouten · 2025-08-19T15:09:40 1755616180

Exactly! At the same time you also don't want to call into the kernel's internal malloc() whenever a thread ends up blocking on a lock to allocate the data structures that are needed to keep track of queues of blocked threads for a given lock.

To prevent that, many operating systems allocate these 'queue objects' whenever threads are created and will attach a pointer to it from the thread object. Whenever a thread then stumbles upon a contended lock, it will effectively 'donate' this queue object to that lock, meaning that every lock having one or more waiters will have a linked list of 'queue objects' attached to it. When threads are woken up, they will each take one of those objects with them on the way out. But there's no guarantee that they will get their own queue object back; they may get shuffled! So by the time a thread terminates, it will free one of those objects, but that may not necessarily be the one it created.

I think the first operating system to use this method was Solaris. There they called these 'queue objects' turnstiles. The BSDs adopted the same approach, and kept the same name.

https://www.oreilly.com/library/view/solaristm-internals-cor...

https://www.bsdcan.org/2012/schedule/attachments/195_locking...

senderista · 2025-08-19T22:59:25 1755644365

This is a really dumb question from someone unfamiliar with the kernel's futex implementation, so bear with me. In userspace locks (e.g. in parking_lot), wait queues can be threaded through nodes allocated on the stack of each waiting thread, so no static or dynamic allocation of wait queue nodes is necessary. Is it possible to allocate wait queue nodes on the kernel stacks of waiting threads as well?

jlokier · 2025-08-20T13:26:28 1755696388

> Is it possible to allocate wait queue nodes on the kernel stacks of waiting threads as well?

Yes, this is exactly what's done. The queue node is an declared as a local variable in kernel code, i.e. on the kernel stack, and its address is passed to the wait function which links it into the waitqueue and then blocks in the scheduler.

EdSchouten · 2025-08-01T14:34:17 1754058857

> Calls to bind(), connect(), listen(), and accept() can be used to initiate and accept connections in much the same way as with TCP, but then things diverge a bit. [...] The sendmsg() and recvmsg() system calls are used to carry out that setup

I wish the article explained why this approach was chosen, as opposed to adding a dedicated system call API that matches the semantics of QUIC.

EdSchouten · 2025-07-30T12:47:16 1753879636

That's not entirely true: with CBOR you can add custom data types through custom tags. A central registry of them is here:

https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml

This is, for example, used by IPLD (https://ipld.io) to express references between objects through native types (https://github.com/ipld/cid-cbor/).

EdSchouten · 2025-07-28T14:52:44 1753714364

Those plants likely have a higher efficiency than a gas powered stove, so may be worth it regardless?

johnnienaked · 2025-07-28T15:15:59 1753715759

The point is that the laws are not about efficiency. They merely serve to send California pollution to the poor parts of Nevada and charge a premium for it.