> For a quick check, I measured 1.82 instructions per cycle to compile ~0.3MB of...

> For a quick check, I measured 1.82 instructions per cycle to compile ~0.3MB of source in BQN (the compiler's a very array-oriented program for those unaware) and 1.38 ipc on ~1.7MB where caching's worse

It's not really clear to me what this is measuring, and whether that thing is interesting. I care about throughput as measured in widgets per unit time, not instructions per unit time; the contribution of a compiler may not only be in instructions per unit time, but also instructions per widget.

> Some of the problems you mention, especially special combinations, can be addressed by a bytecode compiler that ultimately defers to an interpreted VM.

Sure. But aside from implementation effort, that seems monotonically worse than a compiler targeting machine code.

> if the binary search loop needs many iterations, then only part of it can overlap with other code (barring some compiler intervention that I'd categorize as fanciful if the loop length isn't known)

What interventions? FWIW I think the main limitation is (architectural) registers (plus rob/rename, of course), but you can balance around that by putting multiple _dependent_ search iterations, and leaving enough time for them all to finish.

Also: I think it's probably good to have at least a rough idea of how many iterations things are going to take, and am vaguely planning to specialise on order of magnitude of each dimension (plus exact value when small). Knowing how big things are also gives you a fairly good idea of what is going to pollute cache (conservatively assume that random accesses are uniformly distributed) and by how much.

> Mind if I email about this?

Feel free! elronnd@elronnd.net. FWIW I think the gather hammer, where supported, is probably ideal--I benchmarked it, and it lost, but I was on AMD hardware at the time, which is quite abysmal in that respect; intel seems to be much better, especially with avx512--and that leads to much more straightforward code.

> I implemented a different solution to high latencies for large searches in Dyalog, which is to partition the array y (searched-for values) by a single pivot from x and recurse.

I figured it was doing something like that. Was thinking about implementing something similar, but vaguely hoped that if the buffers were large enough and I could keep them filled, it would be possible to avoid the overhead of partitioning and of scattering results. Also the option, depending on relative sizes, of preprocessing x, producing e.g. a b-tree or eytzinger. Fairly low priority, so I've not yet looked too closely.