My understanding is that your levers are roughly better / more diverse embeddings or computing more embeddings (embed chunks / groups / etc) + aggregating more cosine similarities / scores. More flops = better search w/ steep diminishing returns
Colbert being a good google-able application of utilizing more embeddings.
Search ends up often being a funnel of techniques. Cheap and high recall for phase 1 and ratchet up the flops and precision in
subsequent passes on the previous result set.
Exactly! A near property of the matryoshka embeddings is that you can compute a low dimension embedding similarity really fast and then refine afterwards.
It's expensive in this field to verify other people's work. There are a few other papers in the last 3 years that have the same high-level idea but call the anchor tokens something different -- Gist tokens being the only one I personally remember, but you can follow the citation chains back.
Those other papers sounded like a godsend but have deficits that you only find out about if you try to use them against non-cherry-picked use-cases. I think they are on average getting better though with time.
They call out their limitations in the bottom of the paper. For these kinds of models, it would be nice to see them exploiting & measuring the weaknesses of compressive memory -> producing exact outputs. This would be things retrieving multiple things out of context exactly, arithmetic, or copy-pasting high-entropy bits (e.g. where a basic n-gram model can't bias you out of the blurry pieces).
The other side of it is there is often some difficulty in reproducing training for some of these architectures -- the training can be highly unstable and both difficult + expensive to dial-in on a real-world model. We see their best training run, not their 500 runs where they changed hyperparameters b/c the loss kept exploding randomly (compare this to text-only llama-esque architectures where they are wildly stable at training time / predictable / easy to invest into and hyperparams are easy to find from prior art).
I think we are still many papers away from something ready-for-prod on this concept, but I am personally optimistic.
Gradient.ai | SF Bay Area | Onsite/hybrid | Staff SWE | Senior SWE | Enterprise Account Executive
Our vision is to power the future of enterprise automation. Gradient is a full stack AI platform that enables businesses to automate operational workflows via AI/agent driven data transformation & processing.
We are training on top of llama 3. The 256k reasoning benchmarks are on the open LLM leaderboard.
And re: token count: our copy was wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.
Your point re: token counts still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.
Colbert being a good google-able application of utilizing more embeddings.
Search ends up often being a funnel of techniques. Cheap and high recall for phase 1 and ratchet up the flops and precision in subsequent passes on the previous result set.