Still doesn't look like it can do real-time unfortunately. Edit: I understand th...

modeless · on Nov 6, 2023

I have Whisper working in real time here, with TTS too for a real time voice AI that's much faster than ChatGPT's voice mode: https://www.microsoft.com/store/apps/9NC624PBFGB7

It's inefficient, but even older gaming GPUs are fast enough for real time performance, and accuracy is good. If you were going to train a model from scratch for real time you could do something more efficient, but it works as is.

Edit: I'm not sure what you mean by "you wind up without context for the samples". You can supply context to Whisper.

alok-g · on Nov 7, 2023

Nice.

>> Video Memory: 6 GB >> Graphics Processor: Nvidia GPU with 12 GB VRAM or greater

Would it work on an x64 Intel i7 machine with 32 GB RAM and with Nvidia RTX3060 with 6 GB RAM.

modeless · on Nov 7, 2023

Unfortunately Microsoft's store doesn't allow requesting more than 6GB of VRAM in a store listing, so that's why it says 6 and 12 in the same listing. The lowest I've tested so far is a 3060 12GB. I wouldn't expect 6GB to work, but if you're willing to give it a try I'd be interested to know what happens.

I'd like to support less VRAM. Maybe a future version will offload some of the processing to the cloud.

gcr · on Nov 17, 2023

whisper-cpp can do about 6x-10x realtime with the older Whisper model on my 2021 M1 Macbook. I use this to transcribe multiple-hour-long podcasts. The tiny model can easily do 30x realtime on this hardware.

Here's a thread about running realtime on raspberry pi devices (some tweaks required): https://github.com/ggerganov/whisper.cpp/discussions/166

latentdeepspace · on Nov 14, 2023

I implemented a dummy real-time (tested on Mac M1) transcription approach with Whisper. You can find the project here: https://github.com/gaborvecsei/whisper-live-transcription

The idea was to provide transcription results as fast as you can, and you can refine it along the way by providing more and more context.

tekacs · on Nov 6, 2023

A few providers have done a variant of live transcription that's similar to how old-school providers do it, where they transcribe a short window (i.e. XXXms at a time) and this is definitely the easiest path. One such provider is Gladia: https://www.gladia.io/

There are other ways too with different trade-offs, can e-mail me at the link in my profile if you'd like to talk about how.

sebastiennight · on Nov 7, 2023

I've been looking for this. Thanks for the recommendation. I like startups where you can sign up, use it in seconds, integrate it in minutes.

(I am definitely inspired as we don't currently provide such a straightforward experience in our own signup flow)

I tried it in English, French, and my broken Spanish, and all 3 came out great. One surprising thing is that if you switch languages mid-transcription with the "single language" model, it will transcribe the second language and translate it at the same time, so the entire transcription is in a single language, but the meaning is preserved.

jilijeanlouis · on Nov 7, 2023

Thanks for the feedbacks sebastien, to preserve original language you can select "automatic multiple languages" it will perform code switching.

thot_experiment · on Nov 6, 2023

Hm? I've never had trouble getting faster than realtime perf out of the old Whisper? What sort of hardware are you running?

manmal · on Nov 6, 2023

Isn’t the problem that you can’t stream live audio data in?

tekacs · on Nov 6, 2023

So... you have to operate in chunks, 30s at a time generally, although you get reduced time-per-chunk if there are zeroes in the chunk and there are variants.

The zeroes + faster model is how Gladia (mentioned in my other comment here) achieves live transcription by simply transcribing really short chunks one after the other, I believe.

For more advanced stuff you kinda have to get your hands dirty, which I've done for my own product (not linked).

jilijeanlouis · on Nov 6, 2023

Thanks for mentioning Gladia, this is not exactly how it works however, our version of Whisper is modified from the original one to avoid hallucinations, we are releasing a new model in a few days that is even better regarding this matter. Also worth mentioning 3 main problems that occurs when it comes to real-time - endpointing, context reinjection (while avoiding hallucinations - which is a main issue with whisper as prompt injection is generating hallucinations like a lot in general), and finally alignment. Timestamps are extremely important in real time if you want to realign with the original streamed audio. Whisper tends to be hard to handle in all these elements.

modeless · on Nov 6, 2023

Interesting teaser. I thought there must be some way to better optimize the model for real time, but haven't dug in because it's decently fast as is and there's so much other stuff to work on. So many models, so little time!

atty · on Nov 7, 2023

As another commenter pointed out, you can give context for the decoder. So you can feed previous chunks into the model as the context. This is how we do it for streaming, at least.

sidhire · on Nov 6, 2023

I've built a Google Docs-like doc editor that has real-time Whisper transcription built in. It's not released yet, but message me if you'd like to try it out!