It really depends on how it scales. If it can scale to LLM sizes via this training method (Which is a big if), then it could mean fundamentally overturning the transformer architecture and replacing it with RNNs in the most optimistic case.
But if not, it could mean as little as some LLM-adjacent tools like vec2text get reworked into RNNs. Or some interesting fine-tuning at least.
We're actually in a great place to reverse aging already with off the shelf stuff you can get from the grey market. We have countless interventions that show improvements in aging-related metrics in humans that show verified longevity benefits in animal models. That's about as good as you can hope for without waiting a lifetime to actually confirm the human actuarial benefit, which is inherently a losing strategy.
Because I'm not sure exactly what you're looking for when you say 'compares to' -- whether accuracy, speed, or architecture -- I'll hit all 3, but sorry if it's a bit much.
1. Accuracy: For simple tasks (like sentiment analysis on straightforward examples), it won't be much more accurate than a classical linear classifier, if at all.
1a. Accuracy on more diverse or challenging tasks: Because a linear classifier is just so damned simplistic, it simply cannot handle anything even resembling a reasoning task. Meanwhile, (when specifically trained), this architecture managed to get 8/10 on textual entailment tasks, which are generally considered the sort of entry level gold standard for reasoning ability.
2. Speed: It's slower than a classical classifier...in light of the ~1B params it's pushing. They're both still pretty much blazing fast, but the tiny classical classifier will definitely be faster.
3. Architecture:
Here's where it gets interesting.
The architecture of the core model here differs significant from a classical linear classifier:
Classical Classifier:
Input: BGE embedding (in this hypothetical)
Output: Class labels through softmax
Internal Architecture: No nonlinearity, no hidden layers, direct projection
General Classifier:
Input: BGE Embedding
Output: Class labels through nearest neighbor cosine similarity search of vocabulary
Internal architecture: An input projection sparse layer, a layer for combining the 3 inputs after their upwards projection, and 14 hidden layers with nonlinearity (GELU), layernorms, skip connections -- all of the standard stuff you'd expect in an LLM, but...not in an LLM.
I hope that clears up your questions! If not, I'm happy to tell you more.
I would say that ease of use and deployment is actually a good reason to have a single model.
We don't train 20 LLMs for different purposes - we train one (or, I guess 3-4 in practice, each with their own broad specialization), and then prompt it for different tasks.
This simplifies deployment, integration, upgrading, etc.
This model is basically the same - instead of having a restriction to doing single-task classification. This means that a user can complete new tasks using a new prompt, not a new model.
While I agree with the general reasoning, isn't it harder for the user to prompt the model correctly as opposed to selecting a specialized model that they wish to use?
That's the feeling I have when I try to use LLMs for more general language processing.
Have you run in cases where the model "forgets" the task at hand and switches to another mid text stream?
Regardless of all of the above. It looks to me that your choice of reasoning and problem solving in the latent space is a great one and where we should be collectively focusing our efforts, keep up the good work.
Ah, that's the beauty of it! It's not an LLM. It's a new class of model:
A DSRU / Direct Semantic Reasoning Unit.
It's a vec2vec architecture - it takes in 3 bge-large embeddings of the task, the input data, and the vocabulary. It outputs 1 bge-large embedding of the answer.
That's the DSRU part.
What makes it a classifier is that later, outside of the model, we do a nearest neighbor search for our vocabulary items using our answer vector. So it will output something from the labels no matter what - the nearest neighbor search will always have something closest, even if the model went a little crazy internally.
The prompts here tend to be very straightforward. Things like:
"Is this book review positive or negative?"
"Is this person sharing something happy or venting?"
"Determine the logical relationship between the premise and hypothesis. Answer with: entailment, neutral, or contradiction."
It has limited use cases, but where it's good, it should be very, very good - the insane speed, deterministic output, and forced label output makes it great for a lot of common, cheap tasks.
I haven't seen an LLM stay on task anywhere near that long, like...ever. The only thing that works better left running overnight that has anything to do with ML, in my experience, is training.
Overall, I agree - it would take far more sophisticated and deterministic or 'logical' AI better capable of tracking constraints, knowing what to check and double check, etc... Right now, AI is far too scattered to pull that off (or, for the stuff that isn't scattered, it's largely just incapable), but a lot of smart people are thinking about it.
This is such a convenient tool for a casual user, and a great application of an LLM to a narrow task that probably couldn't be handled quite so easily everywhere. Also a great example of the emerging 'chat driven' UX trend, which I'm really liking.
I was trying to solve AGI at the time this was just a side study I did to better understand how models forget the effect was not what I was looking for.
It could be expanded to better understand alignment.
But the resolution makes that cost prohibitive.
I did ~100 runs on different sizes but inferencing 100s of thousands of times made it computationally prohibitive. The key random statement is what allowed accurate measurements of the model.
The equivalent would be for every fine tuning data you train on run the entire evaluation dataset through it.
I want to share what I've been working on the last few weeks: O(1) inference across whole tasks through direct vector transformation. A few facts upfront to give you an idea of how it goes:
1. Implemented as part of a PoC of what I call the Promptable General Classifier (a classifier which can be prompted for general tasks, including (some, limited) reasoning tasks, and has inference-time hot swappable vocabulary/classes), and the 1.09B implementation:
1. Runs 93x faster than Zephyr 7B (and this is being generous to Zephyr, as I had to add post-processing to extract labels from malformed LLM output, and I didn't count the time necessary to complete this post processing in the Zephyr's benchmarks
2. Matches Zephyr 7B's batched accuracy across 13 tasks at 77.7% (the unbatched run with Zephyr gets one more correct, so it's 80%. The DSRU is much more deterministic, and it receives no accuracy boost from running unbatched). Note that I did prompt engineering on 2-3 of these to help the DSRU. The prompt engineering seemed to have no impact on Zephyr’s performance, which I’m assuming is due to its robustness as a professionally built LLM rather than a PoC of a new architecture made by a lone amateur researcher
3. ~19x faster latency than Zephyr 7B
2. Separately trained on entailment tasks, and scored 80% (~2.66x better than chance) on a 3-label text entailment task (entails, contradicts, neutral), and 50% on a 3-label multiple choices entailment task ('1', '2', '3') - notes in the white paper on why the difference
3. The core model has an inference time at 1.09B of around 1ms per batch, but this is purely in post-attention latent space. This model has generalization capabilities, but lacks the full flexibility of an LLM. In exchange for giving that up, it gains extreme inference speeds, determinism, and extremely straightforward training with smooth loss landscapes. I was a bit hesitant to put this out so early, kept thinking about edge cases, ways I could add just a bit more rigor, etc, but I decided the perfect was the enemy of the good, and put together this white paper over the course of a couple of weekends with some midweek refinements.
I'll be releasing a full reference implementation of the training pipeline that can run on midrange consumer hardware with default settings on github in…I’m thinking 4 weeks, probably, depending on how busy I end up being - doing this with a day job has been...a lot, to say the least.
I’d release it now, but frankly, it’s an embarrassing ball of mud that I hacked my way do haphazardly while chasing positive signal. Now that I’ve gotten this far, I can implement it more thoughtfully - and try a new specific model architecture that I think will work a lot better for a lot of comparative reasoning tasks.
It is patent pending, but I'm permitting personal experimentation and thesis work without restriction. This includes grad students using it for their degrees! You can share results and discuss your work, but distribution of trained models or derivatives is not permitted. For funded research, institutional use, or anything commercial, usage is not permitted for now.
But if not, it could mean as little as some LLM-adjacent tools like vec2text get reworked into RNNs. Or some interesting fine-tuning at least.