Still doesn't look like it can do real-time unfortunately.
Edit: I understand that you can use small samples and approximate something like streaming, but the limitation here is you wind up without context for the samples, increasing WER. It would be nice if there was some streaming option.
It's inefficient, but even older gaming GPUs are fast enough for real time performance, and accuracy is good. If you were going to train a model from scratch for real time you could do something more efficient, but it works as is.
Edit: I'm not sure what you mean by "you wind up without context for the samples". You can supply context to Whisper.
Unfortunately Microsoft's store doesn't allow requesting more than 6GB of VRAM in a store listing, so that's why it says 6 and 12 in the same listing. The lowest I've tested so far is a 3060 12GB. I wouldn't expect 6GB to work, but if you're willing to give it a try I'd be interested to know what happens.
I'd like to support less VRAM. Maybe a future version will offload some of the processing to the cloud.
whisper-cpp can do about 6x-10x realtime with the older Whisper model on my 2021 M1 Macbook. I use this to transcribe multiple-hour-long podcasts. The tiny model can easily do 30x realtime on this hardware.
A few providers have done a variant of live transcription that's similar to how old-school providers do it, where they transcribe a short window (i.e. XXXms at a time) and this is definitely the easiest path. One such provider is Gladia: https://www.gladia.io/
There are other ways too with different trade-offs, can e-mail me at the link in my profile if you'd like to talk about how.
I've been looking for this. Thanks for the recommendation.
I like startups where you can sign up, use it in seconds, integrate it in minutes.
(I am definitely inspired as we don't currently provide such a straightforward experience in our own signup flow)
I tried it in English, French, and my broken Spanish, and all 3 came out great. One surprising thing is that if you switch languages mid-transcription with the "single language" model, it will transcribe the second language and translate it at the same time, so the entire transcription is in a single language, but the meaning is preserved.
So... you have to operate in chunks, 30s at a time generally, although you get reduced time-per-chunk if there are zeroes in the chunk and there are variants.
The zeroes + faster model is how Gladia (mentioned in my other comment here) achieves live transcription by simply transcribing really short chunks one after the other, I believe.
For more advanced stuff you kinda have to get your hands dirty, which I've done for my own product (not linked).
Thanks for mentioning Gladia, this is not exactly how it works however, our version of Whisper is modified from the original one to avoid hallucinations, we are releasing a new model in a few days that is even better regarding this matter. Also worth mentioning 3 main problems that occurs when it comes to real-time - endpointing, context reinjection (while avoiding hallucinations - which is a main issue with whisper as prompt injection is generating hallucinations like a lot in general), and finally alignment. Timestamps are extremely important in real time if you want to realign with the original streamed audio. Whisper tends to be hard to handle in all these elements.
Interesting teaser. I thought there must be some way to better optimize the model for real time, but haven't dug in because it's decently fast as is and there's so much other stuff to work on. So many models, so little time!
As another commenter pointed out, you can give context for the decoder. So you can feed previous chunks into the model as the context. This is how we do it for streaming, at least.
I've built a Google Docs-like doc editor that has real-time Whisper transcription built in. It's not released yet, but message me if you'd like to try it out!
Edit: I understand that you can use small samples and approximate something like streaming, but the limitation here is you wind up without context for the samples, increasing WER. It would be nice if there was some streaming option.