More

janwas · 2025-11-06T09:00:09 1762419609

> performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether

A quick comment on this one point (personal opinion): from a hyperscalar perspective, scalar code is most certainly not enough. The energy cost from scheduling a MUL instruction is something like 10x of the actual operation it performs. It is important to amortize that cost over many elements (i.e. SIMD).

janwas · 2025-10-26T15:15:00 1761491700

Wow, that number requires STRONG caveats, lest it be called out as completely false. Take away the tensor cores (unless you only do matmuls?), and an H100 has roughly 2x as many f32 flops as a Zen5 CPU, which is considerably cheaper. I suspect brute force HW/algorithms are not going to age well: https://www.sigarch.org/dont-put-all-your-tensors-in-one-bas... (/personal opinion)

janwas · 2025-09-14T08:49:18 1757839758

CPU-time would over-emphasize regions where many threads are running, right? I find wall-time useful for finding serial regions that aren't yet parallelized.

More detail here: https://github.com/dvyukov/perf-load. We recently implemented the same idea without requiring context-switch events: https://github.com/google/highway/blob/master/hwy/profiler.h...

janwas · 2025-07-24T08:24:46 1753345486

You also get the automatic support for newer instructions (and multiversioning) with a wrapper library such as our Highway :)

janwas · 2025-07-24T07:50:30 1753343430

Highway author here :) I'm curious what you disagree with, because it all sounds very sensible to me?

jesse__ · 2025-07-24T14:58:18 1753369098

There's a lot to discuss.

First off, a number of statements are nonsense. Take, for example

> you shouldn't be writing SIMD instructions directly unless you're writing a SIMD library or an optimizing compiler.

Why would writing an optimizing compiler qualify as territory for directly writing SIMD code, but anything else is off the table? That makes no sense at all.

Furthermore, I was writing a library. It's just embedded in my game engine.

> Instead you should reach for one of the many available libraries

This blanket statement is only true in a narrow set of circumstances. In my mind, it requires that you ship on multiple architectures and probably multiple compilers. If you have narrower constraints, it's extremely easy to write your own wrappers (like I did) and not take a dependency. A good trade IMO. Furthermore, someone's got to write the libraries, so doing it yourself as a learning exercise has value.

> There are loads of libraries like this [...] and provide targeting for a vast trove of SIMD options without hand-writing for every option.

The original commentor seems to be under the impression that using a SIMD library would somehow have produced a better result. The fact is, the library code is super fucking boring. I barely mentioned it in the article because it's basically just boilerplate an LLM could probably spit out, first try. The interesting part of the series is the observation that you can precompute a matrix of intermediates and look them up, instead of recomputing them in the hot loop, effectively trading memory bandwidth for less instructions. A good trade for this algorithm, which saturates the instruction pipelines.

The thing the original commentor does get right is the notion that thinking about data layout is important. But, that has nothing to do with the library you're using .. you just have to do it. They seem to be conflating the use of a library with the act of writing wide code, as if you can't do one without the other, which is obviously false.

> I was going to quickly rewrite the example in Highway ..

Right. I'll believe this when I see it.

I could pick it apart more, but.. I think you get my drift.

janwas · 2025-07-25T07:54:55 1753430095

Thanks for expanding on your viewpoint.

> Why would writing an optimizing compiler qualify as territory for directly writing SIMD code, but anything else is off the table?

I understood "directly writing" to mean assembly or even intrinsics. In general, I would advise not touching intrinsics directly, because the intrinsic definitions themselves have in several cases had bugs. Here's one AVX-512 example: https://github.com/google/highway/commit/7da2b760c012db04103....

When using a wrapper library, these can be fixed in one spot, but every direct user of intrinsics has to deal with it themselves.

> it's extremely easy to write your own wrappers (like I did) and not take a dependency. A good trade IMO

I understand wanting to reduce dependencies. The tradeoff is a bit more complex: for example many readers would be familiar with Highway terminology. We have also made efforts to be a lightweight dependency :)

> doing it yourself as a learning exercise has value.

Understandable :) Though it's a bit regrettable to tie your user code to the library prototype - if used elsewhere, it would probably have to be ported.

> The fact is, the library code is super fucking boring.

True for many ops. However, emulating AES or other complex ops is nontrivial. And it is easy to underestimate the sheer toil of keeping things working across compiler versions and their many bugs. We recently hit the 3000 commit mark in Highway :)

jesse__ · 2025-07-25T16:39:07 1753461547

Generally agree, especially with the sentiment that it's a huge PITA maintaining something like this across multiple compilers & platforms.

Out of curiosity, does highway implement integer divide?

janwas · 2025-07-25T17:57:22 1753466242

:) Yes indeed, it's about 500 LOC in https://github.com/google/highway/blob/master/hwy/ops/generi....

jesse__ · 2025-07-25T19:54:05 1753473245

Nice. Looks like it handles quite a bit. I just supported a single div op, which was enough for my needs.

https://github.com/scallyw4g/bonsai_stdlib/blob/71fadd0f1fce...

llm_nerd · 2025-07-24T17:07:06 1753376826

>First off, a number of statements are nonsense.

100% of my original comment is absolutely and completely correct. Indisputable correct.

>Furthermore, I was writing a library.

Little misunderstandings like this pervade your take.

>seems to be under the impression that using a SIMD library would somehow have produced a better result.

To be clear, I wasn't speaking to you or for your benefit, or specifically to your exercise. You'll notice I didn't email a list of recommendations to you, because I do not care what you do or how you do it. I didn't address my comment to you.

I -- and I was abundantly clear on this -- was speaking to the random reader who might be considering optimizing their code with some hand-crafted SIMD. That following the path in this (and an endless chain of similar) submission(s) is usually ill advised, generally, not even speaking to this specific project, but rather to the average "I want to take advantage of SIMD in my code" consideration.

HN has a fetish for SIMD code recently and there is almost always a better approach than hand-crafting some SSE3 calls in one's random project.

>The original commentor seems to be under the impression that using a SIMD library would somehow have produced a better result.

Again, I could not care less about your project. But the average developer does care that their code runs on a wide variety of platforms optimally. You don't, but again, you and your project was tangential to my comment which was general.

>The thing the original commentor does get right is the notion that thinking about data layout is important.

Aside from the entirety of my comment being correct, the point was that many of the SIMD tools and libraries force you down a path where you are coerced into such structures. Versus often relying upon the compiler to make the best of suboptimal structures. We've seen many times where people complain that their compiler isn't vectorizing things that they think it should, but there is a choice between endlessly fighting with the compiler, and hand-rolling SSE calls, that not only supports much more hardware it leads you down the path of best practices.

Which is of course why C++ 26 is getting std::simd.

Again, you are irrelevant to my comment. Your project is irrelevant to it. I know this is tough to stomach.

>Right. I'll believe this when I see it.

I actually cloned the project but then this submission fell off the front page and it seemed not worth my time. Not to mention that it can't be built on macOS which happened to be the machine I was on at the moment.

Because again, I don't care about your or your project, and my commentary was to the SIMD sideliners considering how to approach it.

>I could pick it apart more, but.. I think you get my drift.

None of your retorts are valid, and my comment stands as completely correct. The drift is that you feel defensive about a general comment because you did something different, which....eh.

janwas · 2025-07-25T07:57:44 1753430264

I appreciate your efforts to nudge readers towards SoA data structures and varying SIMD widths. FWIW I have observed that advice is more effective if communicated with some kindness.

jesse__ · 2025-07-25T01:52:03 1753408323

lol, alright dude. Good luck with C++26

llm_nerd · 2025-07-25T02:03:18 1753408998

Delicious snark. Humorously I only mentioned C++26 because the approach is being formalized right into the standard -- it is so painfully obvious and necessary -- but of course I mentioned a number of existing excellent solutions like Highway already, so again you either have no idea what you're reading, or choose not to.

Cheers!

janwas · 2025-07-07T14:23:55 1751898235

I made the same argument a while ago but a coworker changed my mind.

Can you afford to write and maintain a codepath per ISA (knowing that more keep coming, including RVV, LASX and HVX), to squeeze out the last X%? Is there no higher-impact use of developer time? If so, great.

If not, what's the alternative - scalar code? I'd think decent portable SIMD code is still better than nothing, and nothing (scalar) is all we have for new ISAs which have not yet been hand-optimized. So it seems we should anyway have a generic SIMD path, in addition to any hand-optimized specializations.

BTW, Highway indeed provides decent emulations of LD2..4, and at least 2-table lookups. Note that some Arm uarchs are anyway slow with 3 and 4.

anonymoushn · 2025-07-07T16:07:54 1751904474

For now, at work, it's just some parts with AVX-512, some parts with AVX-512 that we can't really use, so we should use AVX2, and some parts with NEON and SVE. So the implementations for SSE basically are a courtesy to outside users of the libraries, and there are no RVV implementations.

If we were already depending on highway or eve, I would think it's great to ship the generic SIMD version instead of the SSE version, which probably compiles down to the same thing on the relevant targets. This way, if future maintainers need to make changes and don't want to deal with the several implementations I have left behind, the presence of the generic implementation would allow them to delete them rather than making the same changes a bunch of times.

janwas · 2025-07-07T18:09:07 1751911747

Makes sense :) Generic or fallback versions are also useful for correctness testing and benchmarking.

janwas · 2025-07-06T11:32:22 1751801542

Mature such wrappers exist, for example our Highway library :)

janwas · 2025-07-06T11:31:20 1751801480

If SIMT is so obviously the right path, why have just about all GPU vendors and standards reinvented SIMD, calling it subgroups (Vulkan), __shfl_sync (CUDA), work group/sub-group (OpenCL), wave intrinsics (HLSL), I think also simdgroup (Metal)?

janwas · 2025-07-06T11:24:49 1751801089

I don't understand why it helps to "avoid them" entirely. For the (in my experience) >90% of shared code, we can gain the convenience of the wrapper library. For the rest, Highway allows target-specific specializations amidst your otherwise portable code: `#if HWY_TARGET == HWY_AVX2 ...`

janwas · 2025-07-06T11:20:44 1751800844

I know dzaima is aware, but for all the other posters who might not be, our Highway library provides all these missing instructions, via emulation if required.

I do not understand why folks are still making do with direct use of intrinsics or compiler builtins. Having a library centralize workarounds (such an an MSAN compiler change which hit us last week) seems like an obvious win.