Hacker Newsnew | past | comments | ask | show | jobs | submit | jafioti's commentslogin

i did a quick scroll and was happy to see a long-ish article on XLA and TPUs. then i realized it was literally just "using vmap for parallel loops is better than fori", but in massively wordy claudisms.


That's a bit trite tbh. We all know of these techniques, but actually implementing them on GPUs in a low-overhead manner that maintains the model's fidelity is challenging. It's much more than just breaking out the old CS book and picking the next idea from there.


Thats reasonably accurate, we're fusing both pre-defined operations as well as codegenned operations. Block-level operations live inside the search space, as do kernel, warp and thread level operations. Since it's a unified search space, we can look through tons of combinations of kernel, block, warp, and thread level ops. When we go to compile them to runnable code, thread ops get compiled to warp ops, warp ops get compiled to block ops, block ops get compiled to kernel ops (megakernels live here!), so at the end of the day everything that gets ran is a kernel.

In other words, very complimentary to our search-based approach.


mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.


yep, parallelized profiling across many devices is definitely something i want to add.


hopefully! i dont know the exact trick they used, but the idea is to design the search space such that that trick is discoverable.


yup! we build a search space by iteratively applying rewrite rules in every possible order (using e-graphs to do this efficiently). the rewrites alter stuff like looping / tiling structures, as well as algebraic rewrites like softmax to online softmax (and then flash attention).

yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.


Is/will it be possible to just write a model component with Luminal and then use that as a building block in e.g. Torch or JAX?


> take a long time compiling

Lol np-hard is still np-hard no matter how you slice it (especially given vague objective functions).


np-hard is still solveable with constraints. look at go.


What about it?


e-graphs are awesome! none of this would be possible without them.


we're working on techniques like mcts and RL (e.g. AlphaGo) to manage the search space, but you'd be suprised how far you can get if you carefully design the search space to prevent explosions.


very similar to superoptimisation, but most superoptimisers try to tackle turing-complete code. by just doing a very limited space of computation (linear algebra with 12 primitive ops) the search remains tractable.

the search space is designed to remain logically equivalent at all times, by virtue of how its built (applying rewrite rules we know dont change the logical equivalence).


If the search space never leaves the programs that are equivalent to the original specification, that will probably limit the optimisations you can discover. (E.g. if you start out with standard matmul, you will not discover Strassen's algorithm.) This is not a criticism, I'm just trying to understand your algorithm.


could be...im not opposed to looking into this to see if there's no possible trajectory from naive to strassen's without leaving logical equivalency.

all the optimizations for matmul so far have been straightforward trajectories from naive (tiling, smem caching, tensor core offload, etc.)


There is an old CACM post that explains how to use a bit of randomness to avoid only doing semantics preserving program changes.

https://cacm.acm.org/research/stochastic-program-optimizatio...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: