forrestp's comments

forrestp · 2025-05-21T08:02:58 1747814578

My understanding is that your levers are roughly better / more diverse embeddings or computing more embeddings (embed chunks / groups / etc) + aggregating more cosine similarities / scores. More flops = better search w/ steep diminishing returns

Colbert being a good google-able application of utilizing more embeddings.

Search ends up often being a funnel of techniques. Cheap and high recall for phase 1 and ratchet up the flops and precision in subsequent passes on the previous result set.

0101111101 · 2025-05-21T14:25:02 1747837502

Exactly! A near property of the matryoshka embeddings is that you can compute a low dimension embedding similarity really fast and then refine afterwards.

forrestp · on Oct 11, 2024

It's expensive in this field to verify other people's work. There are a few other papers in the last 3 years that have the same high-level idea but call the anchor tokens something different -- Gist tokens being the only one I personally remember, but you can follow the citation chains back.

Those other papers sounded like a godsend but have deficits that you only find out about if you try to use them against non-cherry-picked use-cases. I think they are on average getting better though with time.

They call out their limitations in the bottom of the paper. For these kinds of models, it would be nice to see them exploiting & measuring the weaknesses of compressive memory -> producing exact outputs. This would be things retrieving multiple things out of context exactly, arithmetic, or copy-pasting high-entropy bits (e.g. where a basic n-gram model can't bias you out of the blurry pieces).

The other side of it is there is often some difficulty in reproducing training for some of these architectures -- the training can be highly unstable and both difficult + expensive to dial-in on a real-world model. We see their best training run, not their 500 runs where they changed hyperparameters b/c the loss kept exploding randomly (compare this to text-only llama-esque architectures where they are wildly stable at training time / predictable / easy to invest into and hyperparams are easy to find from prior art).

I think we are still many papers away from something ready-for-prod on this concept, but I am personally optimistic.

forrestp · on Aug 1, 2024

Our vision is to power the future of enterprise automation. Gradient is a full stack AI platform that enables businesses to automate operational workflows via AI/agent driven data transformation & processing.

Job descriptions: Staff SWE: https://lnkd.in/g6kytYQf Senior SWE: https://lnkd.in/gt62fXYu Enterprise AE: https://lnkd.in/gBbcRneY

JCzynski · on Aug 1, 2024

Your LinkedIn pages say 'No longer accepting applications.' Where should we contact you, if not there?

forrestp · on April 29, 2024

Right. We are sleep deprived -- couldn't stop over the weekend. Please forgive the typos

forrestp · on April 29, 2024

All (training / evals / inference) performed on their L40s clusters. These machines are underrated but capable of serious work

forrestp · on April 29, 2024

We are training on top of llama 3. The 256k reasoning benchmarks are on the open LLM leaderboard.

And re: token count: our copy was wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.

Your point re: token counts still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.

forrestp · on April 26, 2024

Our team @ https://gradient.ai/ has more checkpoints coming soon with longer context lengths.