Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Three Fundamental Flaws of SIMD (bitsnbites.eu)
56 points by mbitsnbites on Aug 10, 2021 | hide | past | favorite | 20 comments


These are not flaws, but tradeoffs made by deliberate choice.

I’ve worked on x64 and graphics compilers and this article should be skipped.


Agreed.

The fundamental flaw of SIMD that _SHOULD_ be discussed, is branch divergence. Because of the way SIMD is designed, its probably hopeless for branch divergence to ever be solved.

The wider the SIMD, the more branch divergence messes up your performance. The narrower the SIMD, the less it matters.

CPUs have a form of branch-divergence slowdowns when branches are hard to predict (CPUs try to execute the future branch in parallel with the current code). So I guess branch divergence affects all code. But... GPUs are especially harmed by branch divergence, even more so than any CPU would be.

---------

This is different from the "fixed width" SIMD that is discussed in the blogpost. Any chosen width will have branch divergence.

GPUs don't really have a fixed width though. Through the magic of thread barrier commands, you can have anything from the native wavefront / warp width (32 on NVidia), all the way to 1024-wide thread groups.

But the advantages of very-wide groups is that 1024-at-a-time is sometimes easier to think about than 64-at-a-time. You really should just choose the width that makes most sense to your problem. Ex: 32x32 pixels is 1024-wide, while an 8x8 group of pixels is handled 64-wide.


There's also the matter of branch divergence costing not only compute time, but programmer time, if the programming model is bad.

Using SIMD primitives that force me to pack my own vectors and handle all the divergence edge cases manually makes me want to stab my eyes out. Trying to get "CPU-style" auto-vectorization engines to infer vector semantics from a fully scalar program makes me want to stab my eyes out. Using "GPU-style" (NVidia calls it SIMT) auto-vectorization, which infers vector semantics by sweeping a kernel input parameter, is a breath of fresh air.

I get that hardware people want to focus on the hardware, not the programming interface, but the amount of good hardware that sank for want of a good programming interface is truly mind-blowing. Normally I wouldn't have expected 90% of an industry to repeatedly shoot itself in the foot for decades, but from an outsider's perspective that seems to be exactly what happened.


I wish C++AMP actually got the investment and attention it deserved.

Microsoft was about 5 years too early on that one. No one understood its relevance when it first came out.


I just read SYCL is inspired by C++AMP.

https://en.wikipedia.org/wiki/SYCL


> its probably hopeless for branch divergence to ever be solved

Here's a partial solution, at least: Huihui Sun, Florian Fey, Jie Zhao, Sergei Gorlatch, "WCCV: improving the vectorization of IF-statements with warp-coherent conditions", 2019, https://www.di.ens.fr/~zhaojie/ics2019.pdf

> WCCV uses two different methods to detect warp-coherent conditions. The first method detects boolean-step conditions based on static affine analysis. Affine analysis is usually used for analyzing memory access patterns [15, 23], while we use affine analysis to analyze the variables and expressions used in conditions in order to detect a boolean-step condition. If the static affine analysis fails, we use the second method based on the branch probability estimation to identify high-probability conditions. We develop a cost model based on the estimated branch probability and branch cost: if a certain branch is more likely to be executed and the corresponding branch cost is greater than a threshold, we treat the corresponding condition as warp-coherent. We use auto-tuning to determine the optimal thresholds for various target platforms and applications.

Specifically, the detection of "boolean-step conditions" looks interesting. The fallback heuristic sounds like one of those things that doesn't extrapolate well in the wider software ecosystem.


I like DirectX ray tracing's take on branch divergence: when a program processing a vector of rays has a branch, the vector is split into 2 subvectors by the taken branch. Then each group is rescheduled separately. The process is repeated, so the SIMD unit can be fully utilized until one of the subgroups becomes too small.

I often wonder if this approach can be used for general programming. E.g. for example I could think of a C parser, that you could feed 200 files, it would use SIMD unit of width 20 to separate them into 100 files, that start with a function declaration, and 100 files, that start with a variable declaration. Then the second group would be scheduled to descent (as in recursive descent), and split into 20/80 int/bool variables. All with SIMD fully utilized. etc

On the same note I am curious if it would be easy enough to write a code generator, that would use this pattern when translating regular programs.


The complaint about churning through instruction-encoding space like x86 does each it wants to change the SIMD size seems reasonable, at least.

Do all the ISAs do that?


RISC-V vector instructions are not tied to register size.


My biggest complaint is that SIMD opens the door to doing proper vector math (dot product, cross product, vector addition, etc) in a single instruction. I'd like to see languages like C++ and Rust adopt 2,3 and 4-element vectors as first class types and use these registers and instructions for them.

I don't want to use the language features to define my onw vector types and figure out how to minimize any overhead in the implementation. I tried that years ago in C++ and ended up not using some of my overloaded operators because they were too slow. Some of that can be overcome with C++11 (I think) features, or others. Point is, vector math is so common I think they deserve first class support in languages.


I'm not convinced that the "4-vector" is a very useful C++ concept. Sure, it easily maps to 4-wide SIMD registers, but is that really what you want?

Its clear to anyone who has tried it... that NVidia's CUDA / OpenCL / Intel ISPC approach is superior. Seeing the SIMD-lanes as a thread is easier to understand than expected.

NVidia CUDA and AMD ROCm/HIP are your C++ languages that compile into SIMD code. OpenCL isn't really C++ but kinda is associated with it. Intel is doing the OneAPI thing but I don't know much about it yet.

Python, Julia, and other high-level languages are also moving into the "simd-lanes as threads" approach. Its just fundamentally easier to think about.


> SIMD opens the door to doing proper vector math (dot product

dot product turns vectors into a scalar value, doing that for single vectors is a bad fit for SIMD. You can write SIMD code that efficiently computes multiple dot products in parallel but that needs a completely different vector layout.


Well, Rust is getting support for portable vectors (I'm one of the ones working on that): https://github.com/rust-lang/portable-simd


How about geometric algebra libraries like Klein and g3. Klein has operator overloading for the basis elements of the algebra implemented using intel's SSE, while g3 does the same but uses the Rust's portable stdsimd crate.

Klein: https://www.jeremyong.com/klein/ g3: https://github.com/wrnrlr/g3


Making Julia viable for your use case is probably a quicker path towards this goal than trying to persuade C++ and Rust to cater to the math audience.


alignment requirements for loading/storing wider registers makes this not worth the effort. simd instruction sets also generally dont have anything like dot or cross products.

shading languages do of course have first class vector types but all modern GPUs implement them with plain scalar code.


That's what I was thinking. I don't do a lot of work in that space, but I know for a fact that "fixed width SIMD buses" are no more of a fundamental flaw than our CPUs having a limited amount of cores.


Tail handling is not a problem if you generate custom SIMD code for your particular computation.

For example, https://NN-512.com (but you can imagine similar code generators for other domains)

Vector overhang (masking), unrolling, full/partial iterations are all known when the code is generated. This allows a very tight fit between the SIMD processor and the work that needs to be done.


realy you should atleast mention GPU architectures and predicated SIMD. they address all of your points and are much easier to program for. x86 also moves in this direction with avx-512.


We can call these things "flaws" or "traits", it doesn't really matter. The reality is SIMD hardware is ubiquitous, while the alternatives the article mentions are niche. Maybe except ARM SVE, which appeared this year on the market.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: