I enjoyed reading this article. I was doing research on speech recognition in the 2005 to 2010 period, and the article sums up the feeling. Everything was based on HMMs but there was no clear path to getting HMM performance up from where it capped out,or any idea of what models could work better. Deep learning came along and quickly supplanted all the old stuff. It seems like speech recognition has again plateaued but at a much higher accuracy, though its not inconceivable something better than transformers comes along and improves things some more. It seems like the next big leap may be combining the text understanding of GPT with waveform data.
I was at a conference yesterday where they had live transcription of the speakers onto big screens by the stage.
On this occasion it was a human stenographer in the back-end, but what was remarkable from is that if you didn't already know that, you'd guess from the error rate (not perfect, with difficulties that made sense in context) and the delay (bounced to another continent and back) that it was something like Whisper running locally. All the more so if you happened to know that one of the organisers has been running local AI meetups.
It seems to me like the only way up from here, now, is to aim for better than the average human stenographer. You'd have to have a system capable of reasoning "ah, what I heard was Y but they must have really meant X" when X involves knowledge that couldn't possibly be predicted from the context, like a foreign name coming up for the first time. I can believe that if the stenographer had more domain-specific knowledge that they wouldn't have made some of the errors they did, but we're already off into capabilities that are as good as an extremely rare type of human today.
When you say "plateaued" again, do you have example ? I follow speech recognition like an average engineer and I mostly encounter "success" stories. So I'm afraid that I see things with rosy glasses. Do you have some examples where it fails in a meaningful way (that is, it fails on apparently simple, or usual, scenarios).
Google's SR running on our xiaomi spy devices in my house consistently gets things way wrong.
Pretty much any SR I have ever used on any device seems incredibly bad.
Obviously these devices don't use whisper etc, but even so, these are sold to consumers and 50% of the time they don't get the prompt correctly.
We have "smart bulbs" that are connected to this spy network, which means that if I want to turn off a light, if I don't want to be shat on my by my partner I can't just turn off the switch, but rather stand pointlessly trying to turn off/on the lights by yelling at a fucking box in my lounge. I hate this so much more than you could ever begin to imagine.
And somehow, all these "success" stories has not yet been enough for him to open his eyes. People will buy any shit and then convince themselves that they have "success" stories.
Lets not even breach the topic of the fact that I have been using computer vision for years to do cool shit, but no, lets buy shit xiaomi cameras that have no API to interface with, so the absolute best case scenario for me knowing what's on the camera, is opening their piece of shit app and waiting 10-30 seconds for them to figure out how to connect over our network using their own fucking API.
Since I'm on this tangent now, I would love to pay special mention to their PIECE OF SHIT KETTLE Which is supposed to be a smart kettle, but the one fucking thing you would want to be able to do remotely is not supported. Oh, you wanted to boil your kettle from your desk? Well Fuck you!
I was going to go on a rant about how smart isn't useful unless it's easier than not-smart - my lights are either on a motion sensor, timer or based on device usage so we never actually need to do anything with them. The bedroom, bathroom and kitchen are on manual switches.
Then I looked at the kettle... I have a temperature controlled kettle, it has three buttons and does everything theirs does with an app. Plus it's 2.4kW and holds 1.7L so is better than theirs all around!
I really think we should have a consumer organization that grants rights to companies to mass produce items (so it would hit only large companies). That way we get a kind of democratic way to say "No" to products that work against the consumer (goodbye SmartTV) and end in a landfill within a few years.
I mean, I need a permit to add a window to my home. Can we please require permits for mass-dumping semi-useless products onto the market?
What would be interesting is if Google took the chat GPT approach to better train itself.
For instance, every time you speak to Google, it doesn't just 'guess' what you're saying, rather, it will present you with 3 options of the most closely sounding things for you to pick from. And by picking, you train the model.
Reinforcement learning from human feedback (RLHF) as you have described here is pretty common in voice recognition.[1] I don't have any inside scoop but I'd be very surprised if there isn't any RLHF going on in these applications.
edit to add: The whisper paper mentions using RLHF as a possible path to improve accuracy[2]
Whisper is pretty good honestly. It gets my french accent in english and french slang, without having to tell it what language I'm going to speak first. That's impressive.
Speech recognition won't do much for my language.. with no official spoken language (and that's not wanted either), but hundreds of very different dialects. Anything existing right now is pretty much useless, but on the other hand I've never really found speech to be a particularly useful input method.
Only if you happen to only search for English titles. Try to search for a foreign song in English (say, La Marseillaise), and the results quickly become next to useless.
I agree that speech an input method is overrated, but automatic caption is a really nice accessibility feature (not only for the deaf, but also for non-native speakers, or when it's not convenient to have sound, ...)
I think OP is referring to Swiss German --- dialects are very local (I heard it even differs from valley to valley in the Alps) and often difficult to understand (Wallisertiitsch -- one of the mountain dialect -- is often brought as an example of an unintelligible favor of German), but official and government matters are conducted in standard German (with only few differences with the one spoken in Germany).
“Recognise speech” / “wreck a nice beach” can sound similar (if you allow for some imprecisions, slurring, accents, etc, like noisy real-world data has).
The relevant term here is _phonemes_, the individual chunks of sounds that make the phrases up; these two phrases have shared or similar phonemes in typical English diction. e.g the first two sounds are both “reh”, “cuh”
In english its a near homophone (sounds the same as) “I helped apple recognize speech.” Its a joke about working on language recognition and still getting it wrong.
It would definitely be hard, but a human listener would just interrupt and say “pies park?” at which point the speaker would be obligated to clarify it (heck, even a confused facial expression might be enough.) Most dictation and speech appliances don’t have that interactivity and so instead they let small mistakes waste (relatively) large amounts of our time.
You could of course search for yourself, but it's a python library[1] for interfacing with "Spark"[2], the Apache large scale data processing framework.
I wonder what the author would think nowadays. I've used voice transcription software recently, and it's incredibly accurate. For Anglosphere accents, at least.