I've been delving deep into SIMD for wasm and have had a few findings:
Clang/llvm doesn't use the wat format. Its similar to gas preprocessor, and supports relocation, so you can write your wasm assembly similar to how you would for arm and integrates in your makefile/C code as normal. This also makes it easier to port SIMD from other platforms to wasm simd.
The SIMD spec is quite lacking. Many algorithms take advantage of the chips ability to do multiple things at once in one instructions. Rounding, saturation, and truncation are the main ones I see that wasm does not support out of box, and you have to emulate them. Minus add_sat, that one does saturation.
So far aarch64 is lovely to port, but x86 is terrible. As noted in OP's blog here one single f64.max instruction ends up being 5-6 instructions on x86. While aarch64 and armhf just package it in a singular instruction.
I hope there is more work being done to remedy this so we can bring more algorithms at great performance in the browser. compiler work is above my abilities currently.
If you are interested to bring more SIMD instructions, you can participate in the SIMD subgroup. What's needed is interest in suggesting new instructions, working on a proposal (following the process) and pushing it through. Compiler expertise not required :)
The SIMD spec is quite lacking. Many algorithms take advantage of the chips ability to do multiple things at once in one instructions. Rounding, saturation, and truncation are the main ones I see that wasm does not support out of box, and you have to emulate them. Minus add_sat, that one does saturation.
So far aarch64 is lovely to port, but x86 is terrible. As noted in OP's blog here one single f64.max instruction ends up being 5-6 instructions on x86. While aarch64 and armhf just package it in a singular instruction.
I hope there is more work being done to remedy this so we can bring more algorithms at great performance in the browser. compiler work is above my abilities currently.