Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorta related -- whenever I'm doing something with embeddings, i just normalize them to length one, at which point cosine similarity becomes a simple dot product. Is there ever a reason to not normalize embedding length? An application where that length matters?


For the LLM itself, length matters. For example, the final logits are computed as the un-normalized dot product, making them a function of both direction and magnitude. This means that if you embed then immediately un-embed (using the same embeddings for both), a different token might be obtained. In models such as GPT2, the embedding vector magnitude is loosely correlated with token frequency.


On the practical side, dot products are great, but break in mixed precision and integer representations, where accurately normalizing to unit length isn't feasible.

In other cases people prefer L2 distances for embeddings, where the magnitude can have a serious impact on the distance between a pair of points.


If you’re feeling guilty about it you can usually store the un-normalized lengths separately.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: