Hacker Newsnew | past | comments | ask | show | jobs | submit | purplesyringa's commentslogin

There's just so many "but"s to this that I can't in good faith recommend people to treat floats as deterministic, even though I'd very much love to do so (and I make such assumptions myself, caveat emptor):

- NaN bits are non-deterministic. x86 and ARM generate different sign bits for NaNs. Wasm says NaN payloads are completely unpredictable.

- GPUs don't give a shit about IEEE-754 and apply optimizations raging from DAZ to -ffast-math.

- sin, rsqrt, etc. behave differently when implemented by different libraries. If you're linking libm for sin, you can get different implementations depending on the libc in use. Or you can get different results on different hardware.

- C compilers are allowed to "optimize" a * b + c to FMA when they wish to. The standard only technically allows this merge within one expression, but GCC enables this in all cases by default on some `-std`s.

You're technically correct that floats can be used right, but it's just impossible to explain to a layman that, yes, floats are fine on CPUs, but not on GPUs; fine if you're doing normal arithmetic and sqrt, but not sin or rsqrt; fine on modern compilers, but not old ones; fine on x86, but not i686; fine if you're writing code yourself, but not if you're relying on linear algebra libraries, unless of course you write `a * b + c` and compile with the wrong options; fine if you rely on float equality, but not bitwise equality; etc. Everything is broken and the entire thing is a mess.


Yes, there are a large number of ways to fall into traps that cause you to do a different series of operations when you didn't realise that you did. But that's still ultimately what all your examples are. (Except the NaN thing!)

I still think it's important to fight the misinformation.

Programmers have been conditioned to be so afraid of floats that many believe that doing a + b has an essentially random outcome when it doesn't work that way at all. It leads people to spend a bunch of effort on things that they don't need to be doing.


> Very tangential, but I could swear QBasic included an on-disk documentation system accessible from the editor. Maybe only later versions?

Perhaps my installation didn't include it, or maybe you're confusing it with QuickBASIC, a more feature-complete IDE with a compiler (instead of just an interpreter). I don't exactly remember.


I don't think it's possible to apply this trick to 64-bit floats on 64-bit architecture, which OP mentions in the last sentence. You need a 52 x 52 -> 104 product. Modular 64 x 64 -> 64 multiplication gives you the 64 bottom bits exactly, widening 32 x 32 -> 64 multiplication approximately gives you the top 32 bits. That leaves 104 - 64 - 32 = 8 bits that are not accounted for at all. Compare with the 32-bit case, where the same arithmetic gives 46 - 32 - 16 = -2, i.e. a 2-bit overlap the method relies on.

Yeah, it feels intuitively like it should work, but even with full 64-bit modular multiply the bounds don't overlap. I should probably delete that sentence, but on the other hand maybe the wild goose chase will lead to someone finding a better trick. For now my double-precision multiply is just direct expansion.

I think it fails because it seems like the difference between 32-bit and 64-bit floats is 2x, but in reality we should look at the mantissa, and the increase from 23 bits to 52 bits is much greater.

Although I managed to tweak this method to work with 3 multiplications.

ETA: I just realized you wanted to use 32x32 -> 64 products, while my approach assumes the existence of 64x64 -> 64 products; basically it's just a scaled-up version of the original question and likely not what you're looking for. Hopefully it's still useful though.

First, remove the bottom 8 bits of the two inputs and compute the 44x44->88 product. This can be done with the approach in the post. Then apply the algorithm again, combining that product together with the product of the bottom half of the input to get the full 52x52->104 output. The bounds are a bit tight, but it should work. Here's a numeric example:

    a = 98a67ee86f8cf
    b = da19d2c9dfe71

    (a >> 20) * (b >> 20)         = 820d2e04637bf428
    (a >> 8) * (b >> 8) % 2**64   =       0547f8cdb2100210
    ->
    (a >> 8) * (b >> 8)           = 820d2e0547f8cdb2100210

    (a >> 8) * (b >> 8)           = 820d2e0547f8cdb2100210
    (a * b) % 2**64               =           080978075f64355f
    ->
    a * b                         = 820d2e0548080978075f64355f
And my attempt at implementation: https://play.rust-lang.org/?version=stable&mode=release&edit...

And here are the constants for the ostensibly more useful 54x54->108 product: https://play.rust-lang.org/?version=stable&mode=release&edit...

I tried to go even higher, but the bounds seems to break at 55 bits.


You can still write code without LLMs, much like you can write code without modern IDEs, or use C and assembly instead of higher-level languages. But there are significant differences between the skills you learn in the process, which I believe inhibits upward mobility.

The formulas are provided in the supplementary information file, as mentioned in the paper. https://arxiv.org/src/2603.21852v2/anc/SupplementaryInformat... You want page 9.

I must admit I'm surprised to see this -- Lemire offhandedly mentioned in the famous remainder blog post (https://lemire.me/blog/2019/02/08/faster-remainders-when-the...) that 64-bit constants can be used for 32-bit division, and even provided a short example to compute the remainder that way (though not the quotient). Looking a bit more, it seems like libdivide didn't integrate this optimization either.

I guess everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented?


The paper doesn't require a bitshift after multiplication -- it directly uses the high half of the product as the quotient, so it saves at least one tick over the solution you mentioned. And on x86, saturating addition can't be done in a tick and 32->64 zero-extension is implicit, so the distinction is even wider.

> And on x86, saturating addition can't be done in a tick

Perhaps I misunderstand your point, but I am rather sure that in SSE.../AVX... there do exist instructions for saturating addition:

* (V)PADDSB, (V)PADDSW, (V)PADDUSB, (V)PADDUSW

* (V)PHADDSW, (V)PHSUBSW


Unfortunately, that's only vector, and ≤16-bit ints at that, no 32-bit ints; and as the other reply says, nearly non-existent multiply-high which generally makes vectorized div-by-const its own mini-hell (but doing a 2x-width multiply with fixups is still better than the OP 4x-width method).

(...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.)

(And, for what it's worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too)


On x86, there is no vector instruction to get the upper half of integer product (64-bits x 64-bits). ARM SVE2 and RISC-V RVV have one, x86 unfortunately does not (and probably wont for a long time as AVX10 does not add it, either).

There is one for the f64 FMA recycling IFMA from AVX512 they have for bignum libraries;it's a 52 bit unsigned multiply and accumulates either the low or the high output halves into a 64bit accumulator.

It's surely no 64 bit but it's much more than 32 bit. And it's giving you access to the high halves so you can use it to compute 32x32->64 on vector even if only half as packed as that could be.


Honorary mention: byte swapping instructions (originally added to CPUs for endianness conversion) can also be used to redistribute entropy, but they're slightly slower than rotations on Intel, which is why I think they aren't utilized much.

I think the reason real-world implementations don't do this is to speed up access when the key is a small integer. Say, if your IDs are spread uniformly between 1 and 1000, taking the bottom 7 bits is a great hash, while the top 7 bits would just be zeros. So it's optimizing for a trivial hash rather than a general-purpose fast hash.

And since most languages require each data type to provide its own hash function, you kind of have to assume that the hash is half-assed and bottom bits are better. I think only Rust could make decisions differently here, since it's parametric over hashers, but I haven't seen that done.


I work on the machine code level, so the only characteristic I'm interested in is how many ticks it takes to compute the result, not how many transistors it requires or anything like that. All modern CPUs take 1 tick to compute both XOR, addition, and many other simple arithmetic operations, so even though addition is technically more complicated in CPU designs, it never surfaces in software. In the context of this post, I preferred addition instead of XOR to reduce cancel-out and propagate entropy between bits.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: