More

ashvardanian · 2025-12-25T01:30:58 1766626258

Not a fan of Windows either, but playing devil’s advocate here: Apple’s Finder has steadily gotten worse over the last ~16 years, at least in my experience. It increasingly struggles with basic functionality.

There seems to be a pattern where higher market cap correlates with worse ~~tech~~ fundamentals.

Aeglaecia · 2025-12-25T01:35:22 1766626522

why would a company be incentivized to improve the user experience in ways that aren't profitable ? especially after watching the number one tech company literally worsen UX to increase profitability

ashvardanian · 2025-12-16T16:08:53 1765901333

Thanks a lot for the correction! I'll adjust the references in a bit.

ashvardanian · 2025-12-16T15:51:48 1765900308

I was just about to ask some friends about it. If I’m not mistaken, Postgres began using ICU for collation, but not string matching yet. Curious if someone here is working in that direction?

ashvardanian · 2025-12-16T14:51:50 1765896710

Levenshtein distance calculations are a pretty generic string operation, Genomics happens to be one of the domains where they are most used... and a passion of mine :)

ashvardanian · 2025-12-16T14:13:21 1765894401

This is a very good example! Still, “correct” needs context. You can be 100% “correct with respect to ICU”. It’s definitely not perfect, but it’s the best standard we have. And luckily for me, it also defines the locale-independent rules. I can expand to support locale-specific adjustments in the future, but waiting for the adoption to grow before investing even more engineering effort into this feature. Maybe worth opening a GitHub issue for that :)

wat10000 · 2025-12-16T14:21:07 1765894867

Right, nothing wrong with delegating the decision to a bunch of people who have thought long and hard about the best compromise, as long as it’s understood that it’s not perfect.

ashvardanian · 2025-12-16T13:37:04 1765892224

This article is about the ugliest — but arguably the most important — piece of open-source software I’ve written this year. The write-up ended up long and dense, so here’s a short TL;DR:

I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.

StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.

Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.

dboon · 2025-12-16T17:40:42 1765906842

This is exactly the kind of thankless software which the world operates on. It’s unfortunate that such fundamental code hasn’t already been vectorized or the gills, but thank you for doing so! It’s excellent work

zvr · 2025-12-17T12:51:49 1765975909

Thank you for this, and congrats on your achievement!

Sesse__ · 2025-12-16T17:53:17 1765907597

> I grouped all Unicode 17 case-folding rules

But why are you using the case-folding rules and not the collation rules?

ashvardanian · 2025-12-16T18:09:11 1765908551

Yes, CaseFolding.txt. I'm considering using the collation rules for sorting. Now they only target lexicographic comparisons and seem to be 4x faster than Rust's standard quick-sort implementation, but few people use it: https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

adzm · 2025-12-16T14:20:49 1765894849

This is a truly amazing accomplishment. Reading these kernels is a joy!

fatty_patty89 · 2025-12-16T13:59:34 1765893574

Thank you

do the go bindings require cgo?

ashvardanian · 2025-12-16T14:03:27 1765893807

The GoLang bindings – yes, they are based on cGo. I realize it's suboptimal, but seems like the only practical option at this point.

fatty_patty89 · 2025-12-16T14:05:50 1765893950

In a normal world the Go C FFI wouldn't have insane overhead but what can we do, the language is perfect and it will stay that way until morale improves.

Thanks for the work you do

kbolino · 2025-12-16T15:45:15 1765899915

There are undoubtedly still some optimizations lying around, but the biggest source of Go's FFI overhead is goroutines.

There's only two "easy" solutions I can see: switch to N:N threading model or make the C code goroutine-aware. The former would speed up C calls at the expense of slowing down lots of ordinary Go code. Personally, I can still see some scenarios where that's beneficial, but it's pretty niche. The latter would greatly complicate the use of cgo, and defeat one of its core purposes, namely having access to large hard-to-translate C codebases without requiring extensive modifications of them.

A lot of people compare Go's FFI overhead to that of other natively compiled languages, like Zig or Rust, or to managed runtime languages like Java (JVM) or C# (.NET), but those alternatives don't use green threads (the general concept behind goroutines) as extensively. If you really want to compare apples-to-apples, you should compare against Erlang (BEAM). As far as I can tell, Erlang NIFs [1] are broadly similar to purego [2] calls, and their runtime performance [3] has more or less the same issues as CGo [4].

[1]: https://www.erlang.org/doc/system/nif.html

[2]: https://pkg.go.dev/github.com/ebitengine/purego

[3]: https://erlang.org/documentation/doc-10.1/doc/efficiency_gui...

[4]: https://www.reddit.com/r/golang/comments/12nt2le/when_dealin...

fatty_patty89 · 2025-12-16T16:27:09 1765902429

Java has green threads and c#/.net has logical threads

kbolino · 2025-12-16T16:38:22 1765903102

Yes, I have cleaned up the wording a bit. Also, the common implementation of Rust's async is comparable to green threads, and I think Zig is adopting something like it too.

However, the "normal" execution model on all of them is using heavyweight native threads, not green threads. As far as I can tell, FFI is either unsupported entirely or has the same kind of overhead as Go and Erlang do, when used from those languages' green threads.

fatty_patty89 · 2025-12-16T16:56:18 1765904178

Genuine question, you make it seem as this is a limitation and they're all in the same bucket but how was Java for example able to scale all the enterprises while having multi threading and good ffi, same with .net.

My impression is that the go ffi is with big overhead because of the specific choices made to not care about ffi because it would benefit the go code more?

My point was that there's other gc languages/envorionments that have good ffi and were somehow able all these decades to create scalable multithreaded applications.

kbolino · 2025-12-16T17:56:50 1765907810

I would suggest gaining a better understanding of the M:N threading model versus the N:N threading model. I do not know that I can do it justice here.

Both Java and Rust flirted with green threads in their early days. Java abandoned them because the hardware wasn't ready yet, and Rust abandoned them because they require a heavyweight runtime that wasn't appropriate for many applications Rust was targeting. And yet, both languages (and others besides) ended up adding something like them in later anyway, albeit sitting beside, rather than replacing, the traditional N:N threading they primarily support.

Your question might just be misdirected; one could view it as operating systems, and not programming languages per se, that screwed it all up. Their threads, which were conservatively designed to be as compatible as possible with existing code, have too much overhead for many tasks. They were good enough for awhile, especially as multicore systems started to enter the scene, but their limitations became apparent after e.g. nginx could handle 10x the requests of Apache httpd on the same hardware. This gap would eventually be narrowed, to some extent, but it required a significant amount of rework in Apache.

If you can answer the question of why ThreadPoolExecutor exists in Java, then you are about halfway to answering the question about why M:N threading exists. The other half is mostly ergonomics; ThreadPoolExecutor is great for fanning out pieces of a single, subdividable task, but it isn't great for handling a perpetual stream of unrelated tasks that ebb and flow over time. EDIT: See the Project Loom proposal for green threads in Java today, which also brings up the ForkJoinPool, another approach to M:N threading: https://cr.openjdk.org/~rpressler/loom/Loom-Proposal.html

kardianos · 2025-12-16T14:40:38 1765896038

In a real (not "normal") world, trade-offs exist and Go choose a specific set of design points that are consequential.

ashvardanian · 2025-12-16T13:29:09 1765891749

Yes — fuzzy and phonetic matching across languages is part of the roadmap. That space is still poorly standardized, so I wanted to start with something widely understood and well-defined (ICU-style transforms) before layering on more advanced behavior.

Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.

janc_ · 2025-12-16T22:14:58 1765923298

Even when transliteration is somewhat de-facto standardised, it usually is dependent on the target/host language. So e.g. Arabic & Russian are transliterated differently in e.g. English, French, German, Dutch, etc.

ashvardanian · 2025-12-16T13:20:39 1765891239

I get why it sounds that way, but it’s not actually true.

StringZilla added full Unicode case folding in an earlier release, and had a state-of-the-art exact case-sensitive substring search for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.

The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.

I go into the details in the “Ideation & Challenges in Substring Search” section of the article

ashvardanian · 2025-11-01T12:46:33 1762001193

Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:

  /**
   *  @brief Defines variable alignment to avoid false sharing.
   *  @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
   *  @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
   *
   *  The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
   *
   *  @code{.cpp}
   *  #if defined(__cpp_lib_hardware_interference_size)
   *  static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
   *  #else
   *  static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
   *  #endif
   *  @endcode
   *
   *  That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
   *  unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
   *  with `-Wno-interference-size`.
   */
  static constexpr std::size_t default_alignment_k = 128;

As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.

[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...

rnrn · 2025-11-01T16:10:34 1762013434

> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines

I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.

Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule

AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.

Sesse__ · 2025-11-01T14:14:37 1762006477

This makes attempts of cargo-culting __attribute__((aligned(64))) without benchmarking even more hilarious. :-)

rnrn · 2025-11-01T16:14:12 1762013652

It’s not a cargo cult if the actions directly cause cargo to arrive based on well understood mechanics.

Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.

Sesse__ · 2025-11-01T17:24:20 1762017860

It indeed isn't, but I've seen my share of systems where nobody checked if cargo arrived. (The code was checked in without any benchmarks done, and after many years, it was found that the macros used were effectively no-ops :-) )

ashvardanian · 2025-10-27T12:08:35 1761566915

Storage, strings, sorting, counting, bioinformatics... I got nerd-sniped! Can't resist a shameless plug here :)

Looking at the code, there are a few things I would consider optimizing. I'd start by trying (my) StringZilla for hashing and sorting.

HashBrown collections under the hood use aHash, which is an excellent hash function, but on both short and long inputs, on new CPUs, StringZilla seems faster [0]:

                               short               long
  aHash::hash_one         1.23 GiB/s         8.61 GiB/s
  stringzilla::hash       1.84 GiB/s        11.38 GiB/s

A similar story with sorting strings. Inner loops of arbitrary length string comparisons often dominate such workloads. Doing it in a more Radix-style fashion can 4x your performance [1]:

                                                    short                  long
  std::sort_unstable_by_key           ~54.35 M compares/s    57.70 M compares/s
  stringzilla::argsort_permutation   ~213.73 M compares/s    74.64 M compares/s

Bear in mind that "compares/s" is a made-up metric here; in reality, I'm comparing from the duration.

[0] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

[1] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

noamteyssier · 2025-10-28T16:12:05 1761667925

Cool suggestions! I definitely would be interested in exploring other hash functions for this (and other binf works) so I'll definitely take a look at your stringzilla lib.