It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.
Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.
For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.
As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.
You can always use a divide and conquer strategy to compute the chunks. Chunk both halves of the file independently. Once that’s done, you redo the chunking around the midpoint of the file forward, until it starts to match the chunks obtained previously.
That’s great! People who do that are often inconsiderate of how it affect others. First of all, it generates unnecessary noise, which is annoying for neighbors who are still trying to sleep. Pedestrians/cyclists also need to breathe those exhaust gases.
Pretty much, given that any decent pthreads implementation will offer an adaptive mutex. Unless you really need a mutex the size of a single bit or byte (which likely implies false sharing), there's little reason to ever use a pure spinlock, since a mutex with adaptive spinning (up to context switch latency) gives you the same performance for short critical sections without the disastrous worst-case behavior.
Some people don't want to block for a microsecond when their lock goes 1ns over your adaptive mutexes spin deadline. That kind of jitter is unacceptable.
General heuristics only get you so far and at the limit come with their own overhead compared to what you can do with a tailored solution with knowledge about your usage and data access patterns. The cases where this makes a practical difference for higher-level apps are rare but they exist.
I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:
Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)
Yeah, GEAR hashing is simple enough that I haven't considered using anything else.
Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.
In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.
If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.
Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.
In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.
There are plenty of larger ones and plenty of ones that used the date as the version, but I was mainly curious about packages that followed semver.
Any package version that didn't follow the x.y.z format was excluded, and any package that had less published versions than their largest version number was excluded (e.g. a package at version 1.123.0 should have at least 123 published versions)
Well, we are looking at npm packages, where every package is supposed to follow semantic versioning. The fact that we don't have date as version number means everyone is a good citizen.
Off the top of my head, CloudFlare uses a somewhat date based method of typing for their Workers types package, but it makes sense in context as you define compatibility dates for a Worker when you set it up, which automatically enables/disables potentially breaking features in the API.
Exactly! At the same time you also don't want to call into the kernel's internal malloc() whenever a thread ends up blocking on a lock to allocate the data structures that are needed to keep track of queues of blocked threads for a given lock.
To prevent that, many operating systems allocate these 'queue objects' whenever threads are created and will attach a pointer to it from the thread object. Whenever a thread then stumbles upon a contended lock, it will effectively 'donate' this queue object to that lock, meaning that every lock having one or more waiters will have a linked list of 'queue objects' attached to it. When threads are woken up, they will each take one of those objects with them on the way out. But there's no guarantee that they will get their own queue object back; they may get shuffled! So by the time a thread terminates, it will free one of those objects, but that may not necessarily be the one it created.
I think the first operating system to use this method was Solaris. There they called these 'queue objects' turnstiles. The BSDs adopted the same approach, and kept the same name.
This is a really dumb question from someone unfamiliar with the kernel's futex implementation, so bear with me. In userspace locks (e.g. in parking_lot), wait queues can be threaded through nodes allocated on the stack of each waiting thread, so no static or dynamic allocation of wait queue nodes is necessary. Is it possible to allocate wait queue nodes on the kernel stacks of waiting threads as well?
> Is it possible to allocate wait queue nodes on the kernel stacks of waiting threads as well?
Yes, this is exactly what's done. The queue node is an declared as a local variable in kernel code, i.e. on the kernel stack, and its address is passed to the wait function which links it into the waitqueue and then blocks in the scheduler.
> Calls to bind(), connect(), listen(), and accept() can be used to initiate and accept connections in much the same way as with TCP, but then things diverge a bit. [...] The sendmsg() and recvmsg() system calls are used to carry out that setup
I wish the article explained why this approach was chosen, as opposed to adding a dedicated system call API that matches the semantics of QUIC.
The point is that the laws are not about efficiency. They merely serve to send California pollution to the poor parts of Nevada and charge a premium for it.
Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.
For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.