i did a quick scroll and was happy to see a long-ish article on XLA and TPUs. then i realized it was literally just "using vmap for parallel loops is better than fori", but in massively wordy claudisms.
That's a bit trite tbh. We all know of these techniques, but actually implementing them on GPUs in a low-overhead manner that maintains the model's fidelity is challenging. It's much more than just breaking out the old CS book and picking the next idea from there.
Thats reasonably accurate, we're fusing both pre-defined operations as well as codegenned operations. Block-level operations live inside the search space, as do kernel, warp and thread level operations. Since it's a unified search space, we can look through tons of combinations of kernel, block, warp, and thread level ops. When we go to compile them to runnable code, thread ops get compiled to warp ops, warp ops get compiled to block ops, block ops get compiled to kernel ops (megakernels live here!), so at the end of the day everything that gets ran is a kernel.
In other words, very complimentary to our search-based approach.
mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.
yup! we build a search space by iteratively applying rewrite rules in every possible order (using e-graphs to do this efficiently). the rewrites alter stuff like looping / tiling structures, as well as algebraic rewrites like softmax to online softmax (and then flash attention).
yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.
we're working on techniques like mcts and RL (e.g. AlphaGo) to manage the search space, but you'd be suprised how far you can get if you carefully design the search space to prevent explosions.
very similar to superoptimisation, but most superoptimisers try to tackle turing-complete code. by just doing a very limited space of computation (linear algebra with 12 primitive ops) the search remains tractable.
the search space is designed to remain logically equivalent at all times, by virtue of how its built (applying rewrite rules we know dont change the logical equivalence).
If the search space never leaves the programs that are equivalent to the original specification, that will probably limit the optimisations you can discover. (E.g. if you start out with standard matmul, you will not discover Strassen's algorithm.) This is not a criticism, I'm just trying to understand your algorithm.