More

jafioti · 2026-03-13T04:45:44 1773377144

i did a quick scroll and was happy to see a long-ish article on XLA and TPUs. then i realized it was literally just "using vmap for parallel loops is better than fori", but in massively wordy claudisms.

jafioti · 2026-01-26T17:37:41 1769449061

That's a bit trite tbh. We all know of these techniques, but actually implementing them on GPUs in a low-overhead manner that maintains the model's fidelity is challenging. It's much more than just breaking out the old CS book and picking the next idea from there.

jafioti · 2026-01-26T17:34:18 1769448858

Thats reasonably accurate, we're fusing both pre-defined operations as well as codegenned operations. Block-level operations live inside the search space, as do kernel, warp and thread level operations. Since it's a unified search space, we can look through tons of combinations of kernel, block, warp, and thread level ops. When we go to compile them to runnable code, thread ops get compiled to warp ops, warp ops get compiled to block ops, block ops get compiled to kernel ops (megakernels live here!), so at the end of the day everything that gets ran is a kernel.

In other words, very complimentary to our search-based approach.

jafioti · 2025-08-21T16:30:25 1755793825

mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.

jafioti · 2025-08-21T00:19:51 1755735591

yep, parallelized profiling across many devices is definitely something i want to add.

jafioti · 2025-08-20T22:47:17 1755730037

hopefully! i dont know the exact trick they used, but the idea is to design the search space such that that trick is discoverable.

jafioti · 2025-08-20T22:24:09 1755728649

yup! we build a search space by iteratively applying rewrite rules in every possible order (using e-graphs to do this efficiently). the rewrites alter stuff like looping / tiling structures, as well as algebraic rewrites like softmax to online softmax (and then flash attention).

yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.

_0ffh · 2025-08-20T23:09:22 1755731362

Is/will it be possible to just write a model component with Luminal and then use that as a building block in e.g. Torch or JAX?

almostgotcaught · 2025-08-21T02:12:23 1755742343

> take a long time compiling

Lol np-hard is still np-hard no matter how you slice it (especially given vague objective functions).

jafioti · 2025-08-21T06:06:52 1755756412

np-hard is still solveable with constraints. look at go.

gregorygoc · 2025-08-21T12:52:51 1755780771

What about it?

jafioti · 2025-08-20T22:20:27 1755728427

e-graphs are awesome! none of this would be possible without them.

jafioti · 2025-08-20T22:20:05 1755728405

we're working on techniques like mcts and RL (e.g. AlphaGo) to manage the search space, but you'd be suprised how far you can get if you carefully design the search space to prevent explosions.

jafioti · 2025-08-20T22:15:42 1755728142

very similar to superoptimisation, but most superoptimisers try to tackle turing-complete code. by just doing a very limited space of computation (linear algebra with 12 primitive ops) the search remains tractable.

the search space is designed to remain logically equivalent at all times, by virtue of how its built (applying rewrite rules we know dont change the logical equivalence).

GregarianChild · 2025-08-21T06:49:38 1755758978

If the search space never leaves the programs that are equivalent to the original specification, that will probably limit the optimisations you can discover. (E.g. if you start out with standard matmul, you will not discover Strassen's algorithm.) This is not a criticism, I'm just trying to understand your algorithm.

jafioti · 2025-08-21T16:32:45 1755793965

could be...im not opposed to looking into this to see if there's no possible trajectory from naive to strassen's without leaving logical equivalency.

all the optimizations for matmul so far have been straightforward trajectories from naive (tiling, smem caching, tensor core offload, etc.)

GregarianChild · 2025-08-21T17:55:48 1755798948

There is an old CACM post that explains how to use a bit of randomness to avoid only doing semantics preserving program changes.

https://cacm.acm.org/research/stochastic-program-optimizatio...