I'm not convinced that the "4-vector" is a very useful C++ concept. Sure, it eas...

I'm not convinced that the "4-vector" is a very useful C++ concept. Sure, it easily maps to 4-wide SIMD registers, but is that really what you want?

Its clear to anyone who has tried it... that NVidia's CUDA / OpenCL / Intel ISPC approach is superior. Seeing the SIMD-lanes as a thread is easier to understand than expected.

NVidia CUDA and AMD ROCm/HIP are your C++ languages that compile into SIMD code. OpenCL isn't really C++ but kinda is associated with it. Intel is doing the OneAPI thing but I don't know much about it yet.

Python, Julia, and other high-level languages are also moving into the "simd-lanes as threads" approach. Its just fundamentally easier to think about.