If you do a bit of digging into most of the popular benchmarks that all the big labs report on, you'll see pretty quickly that they have almost zero correlation with any real world tasks.
The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.
I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:
> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?
I particularly like their usage of LLM-as-a-judge. They don't go "hey chatgpt, sort these from best to worst based on vibes", rather they extract a set of ground truths and check how the answer compares, a task that SOTA LLM can do kind of reliably. It's a very smart way to circumvent the problems introduced by pure LLM-as-a-judge methods.
so if i understand this correctly — you want the speech recognition model to identify a vocabulary of specific terms that it wasn't trained on. instead of fine-tuning with training data that includes the new vocabulary, you input the full vocabulary at test time as a list of words and the model is able to generate transcripts that include words from the vocabulary.
seems like it could be very useful but it really comes down to the specifics.
you can prompt whisper with context — how does this compare?
how large of a vocabulary can it work with? if it's a few dozen words it's only gonna help for niche use cases. if it can handle 100s-1000s with good performance that could completely replace fine-tuning for many uses
I haven't really dug in yet but from a quick skim, it looks promising. They show a big improvement over Whisper on a medical dataset (F1 increased from 80.5% to 96.58%).
The inference time for the keyword detection is about 10ms. If it scales linearly with additional keywords you could potentially scale to hundreds or thousands of keywords but it really depends on how sensitive you are to latency. For real-time with large vocabularies my guess is you might still want to fine-tune.
yeah — sounds about right. retraining the whole model just to add one jargon-y term isn’t super efficient. this approach lets you plug in a vocab list at runtime instead, which feels a lot more scalable.
fully agree that LangChain is a meaningless abstraction but I've found that the graph abstraction that LangGraph uses is a very useful mental model for thinking about an agentic flow
The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.
I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:
> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?