Sorta related -- whenever I'm doing something with embeddings, i just normalize ...

psyklic · on Sept 6, 2024

For the LLM itself, length matters. For example, the final logits are computed as the un-normalized dot product, making them a function of both direction and magnitude. This means that if you embed then immediately un-embed (using the same embeddings for both), a different token might be obtained. In models such as GPT2, the embedding vector magnitude is loosely correlated with token frequency.

ashvardanian · on Sept 6, 2024

On the practical side, dot products are great, but break in mixed precision and integer representations, where accurately normalizing to unit length isn't feasible.

In other cases people prefer L2 distances for embeddings, where the magnitude can have a serious impact on the distance between a pair of points.

janalsncm · on Sept 6, 2024

If you’re feeling guilty about it you can usually store the un-normalized lengths separately.