Because I'm not sure exactly what you're looking for when you say 'compares to' -- whether accuracy, speed, or architecture -- I'll hit all 3, but sorry if it's a bit much.
1. Accuracy: For simple tasks (like sentiment analysis on straightforward examples), it won't be much more accurate than a classical linear classifier, if at all.
1a. Accuracy on more diverse or challenging tasks: Because a linear classifier is just so damned simplistic, it simply cannot handle anything even resembling a reasoning task. Meanwhile, (when specifically trained), this architecture managed to get 8/10 on textual entailment tasks, which are generally considered the sort of entry level gold standard for reasoning ability.
2. Speed: It's slower than a classical classifier...in light of the ~1B params it's pushing. They're both still pretty much blazing fast, but the tiny classical classifier will definitely be faster.
3. Architecture:
Here's where it gets interesting.
The architecture of the core model here differs significant from a classical linear classifier:
Classical Classifier:
Input: BGE embedding (in this hypothetical)
Output: Class labels through softmax
Internal Architecture: No nonlinearity, no hidden layers, direct projection
General Classifier:
Input: BGE Embedding
Output: Class labels through nearest neighbor cosine similarity search of vocabulary
Internal architecture: An input projection sparse layer, a layer for combining the 3 inputs after their upwards projection, and 14 hidden layers with nonlinearity (GELU), layernorms, skip connections -- all of the standard stuff you'd expect in an LLM, but...not in an LLM.
I hope that clears up your questions! If not, I'm happy to tell you more.
Because I'm not sure exactly what you're looking for when you say 'compares to' -- whether accuracy, speed, or architecture -- I'll hit all 3, but sorry if it's a bit much.
1. Accuracy: For simple tasks (like sentiment analysis on straightforward examples), it won't be much more accurate than a classical linear classifier, if at all.
1a. Accuracy on more diverse or challenging tasks: Because a linear classifier is just so damned simplistic, it simply cannot handle anything even resembling a reasoning task. Meanwhile, (when specifically trained), this architecture managed to get 8/10 on textual entailment tasks, which are generally considered the sort of entry level gold standard for reasoning ability.
2. Speed: It's slower than a classical classifier...in light of the ~1B params it's pushing. They're both still pretty much blazing fast, but the tiny classical classifier will definitely be faster.
3. Architecture: Here's where it gets interesting.
The architecture of the core model here differs significant from a classical linear classifier:
Classical Classifier: Input: BGE embedding (in this hypothetical) Output: Class labels through softmax Internal Architecture: No nonlinearity, no hidden layers, direct projection
General Classifier: Input: BGE Embedding Output: Class labels through nearest neighbor cosine similarity search of vocabulary Internal architecture: An input projection sparse layer, a layer for combining the 3 inputs after their upwards projection, and 14 hidden layers with nonlinearity (GELU), layernorms, skip connections -- all of the standard stuff you'd expect in an LLM, but...not in an LLM.
I hope that clears up your questions! If not, I'm happy to tell you more.