Rest in Peas: The Unrecognized Death of Speech Recognition (2010)

unlikelymordant · on May 5, 2023

I enjoyed reading this article. I was doing research on speech recognition in the 2005 to 2010 period, and the article sums up the feeling. Everything was based on HMMs but there was no clear path to getting HMM performance up from where it capped out,or any idea of what models could work better. Deep learning came along and quickly supplanted all the old stuff. It seems like speech recognition has again plateaued but at a much higher accuracy, though its not inconceivable something better than transformers comes along and improves things some more. It seems like the next big leap may be combining the text understanding of GPT with waveform data.

regularfry · on May 5, 2023

I was at a conference yesterday where they had live transcription of the speakers onto big screens by the stage.

On this occasion it was a human stenographer in the back-end, but what was remarkable from is that if you didn't already know that, you'd guess from the error rate (not perfect, with difficulties that made sense in context) and the delay (bounced to another continent and back) that it was something like Whisper running locally. All the more so if you happened to know that one of the organisers has been running local AI meetups.

It seems to me like the only way up from here, now, is to aim for better than the average human stenographer. You'd have to have a system capable of reasoning "ah, what I heard was Y but they must have really meant X" when X involves knowledge that couldn't possibly be predicted from the context, like a foreign name coming up for the first time. I can believe that if the stenographer had more domain-specific knowledge that they wouldn't have made some of the errors they did, but we're already off into capabilities that are as good as an extremely rare type of human today.

wiz21c · on May 5, 2023

When you say "plateaued" again, do you have example ? I follow speech recognition like an average engineer and I mostly encounter "success" stories. So I'm afraid that I see things with rosy glasses. Do you have some examples where it fails in a meaningful way (that is, it fails on apparently simple, or usual, scenarios).

hanselot · on May 5, 2023

Google's SR running on our xiaomi spy devices in my house consistently gets things way wrong. Pretty much any SR I have ever used on any device seems incredibly bad. Obviously these devices don't use whisper etc, but even so, these are sold to consumers and 50% of the time they don't get the prompt correctly.

We have "smart bulbs" that are connected to this spy network, which means that if I want to turn off a light, if I don't want to be shat on my by my partner I can't just turn off the switch, but rather stand pointlessly trying to turn off/on the lights by yelling at a fucking box in my lounge. I hate this so much more than you could ever begin to imagine. And somehow, all these "success" stories has not yet been enough for him to open his eyes. People will buy any shit and then convince themselves that they have "success" stories.

Lets not even breach the topic of the fact that I have been using computer vision for years to do cool shit, but no, lets buy shit xiaomi cameras that have no API to interface with, so the absolute best case scenario for me knowing what's on the camera, is opening their piece of shit app and waiting 10-30 seconds for them to figure out how to connect over our network using their own fucking API.

Since I'm on this tangent now, I would love to pay special mention to their PIECE OF SHIT KETTLE Which is supposed to be a smart kettle, but the one fucking thing you would want to be able to do remotely is not supported. Oh, you wanted to boil your kettle from your desk? Well Fuck you!

sitharus · on May 5, 2023

I was going to go on a rant about how smart isn't useful unless it's easier than not-smart - my lights are either on a motion sensor, timer or based on device usage so we never actually need to do anything with them. The bedroom, bathroom and kitchen are on manual switches.

Then I looked at the kettle... I have a temperature controlled kettle, it has three buttons and does everything theirs does with an app. Plus it's 2.4kW and holds 1.7L so is better than theirs all around!

amelius · on May 5, 2023

> Oh, you wanted to boil your kettle from your desk?

Besides a right to repair, we should also have a right to API.

kybernetikos · on May 5, 2023

Maybe it could be enforced via stricter recycling and warranty laws for devices that don't allow you to control them / update them / root them.

amelius · on May 5, 2023

I really think we should have a consumer organization that grants rights to companies to mass produce items (so it would hit only large companies). That way we get a kind of democratic way to say "No" to products that work against the consumer (goodbye SmartTV) and end in a landfill within a few years.

I mean, I need a permit to add a window to my home. Can we please require permits for mass-dumping semi-useless products onto the market?

unicoded · on May 5, 2023

What would be interesting is if Google took the chat GPT approach to better train itself.

For instance, every time you speak to Google, it doesn't just 'guess' what you're saying, rather, it will present you with 3 options of the most closely sounding things for you to pick from. And by picking, you train the model.

seanhunter · on May 5, 2023

Reinforcement learning from human feedback (RLHF) as you have described here is pretty common in voice recognition.[1] I don't have any inside scoop but I'd be very surprised if there isn't any RLHF going on in these applications.

edit to add: The whisper paper mentions using RLHF as a possible path to improve accuracy[2]

[1] https://link.springer.com/article/10.1007/s10462-022-10224-2 gives many examples

[2] https://cdn.openai.com/papers/whisper.pdf

BiteCode_dev · on May 5, 2023

Whisper is pretty good honestly. It gets my french accent in english and french slang, without having to tell it what language I'm going to speak first. That's impressive.

courgette · on May 5, 2023

Exactly, the accent bump is behind us. I get good transcripts of my meetings even with my tick accent

Tor3 · on May 5, 2023

Speech recognition won't do much for my language.. with no official spoken language (and that's not wanted either), but hundreds of very different dialects. Anything existing right now is pretty much useless, but on the other hand I've never really found speech to be a particularly useful input method.

lutorm · on May 5, 2023

Speech isn't a particularly useful input method ... until you lose the use of your hands. Then it suddenly is a lot more attractive.

DaiPlusPlus · on May 5, 2023

Even with my hands, voice-search on Apple TV, including the YouTube app, is very capable and accurate.

tsimionescu · on May 9, 2023

Only if you happen to only search for English titles. Try to search for a foreign song in English (say, La Marseillaise), and the results quickly become next to useless.

PetitPrince · on May 5, 2023

I agree that speech an input method is overrated, but automatic caption is a really nice accessibility feature (not only for the deaf, but also for non-native speakers, or when it's not convenient to have sound, ...)

elkos · on May 5, 2023

oh that's pretty interesting, would you like to share more on the specifics of that?

PetitPrince · on May 5, 2023

I think OP is referring to Swiss German --- dialects are very local (I heard it even differs from valley to valley in the Alps) and often difficult to understand (Wallisertiitsch -- one of the mountain dialect -- is often brought as an example of an unintelligible favor of German), but official and government matters are conducted in standard German (with only few differences with the one spoken in Germany).

KenArrari · on May 5, 2023

What language is that?

Stratoscope · on May 5, 2023

Somewhere I have an old T-shirt from when I was consulting for Apple.

It had a sketch of a trashed-out beach with all kinds of garbage and debris.

The caption was:

I helped Apple wreck a nice beach

wiz21c · on May 5, 2023

Hmmm... Maybe because I'm not a native speaker, but, well, I don't get it :-) Could you explain ?

orls · on May 5, 2023

“Recognise speech” / “wreck a nice beach” can sound similar (if you allow for some imprecisions, slurring, accents, etc, like noisy real-world data has).

The relevant term here is _phonemes_, the individual chunks of sounds that make the phrases up; these two phrases have shared or similar phonemes in typical English diction. e.g the first two sounds are both “reh”, “cuh”

donavanm · on May 5, 2023

In english its a near homophone (sounds the same as) “I helped apple recognize speech.” Its a joke about working on language recognition and still getting it wrong.

Stratoscope · on May 5, 2023

Try saying it out loud several times without thinking about the words at all, but only the sounds your voice is making as you say them:

I helped Apple wreck a nice beach

I helped Apple reck a nize beach

I helped Apple reck ug nize beach

I helped Apple reckug nize peach

I helped Apple recognize speech!

ZeljkoS · on May 5, 2023

It is mentioned in the top article:

Saying “recognize speech” makes a sound that can be indistinguishable from “wreck a nice beach.”

ignite · on May 5, 2023

I have (had?) that one somewhere!

its-summertime · on May 5, 2023

What has happened since then? I know Common Voice has come and gone https://en.wikipedia.org/wiki/Common_Voice https://github.com/coqui-ai/STT

And I've seen some neural approaches too

No idea where to look for comparisons though.

Baeocystin · on May 5, 2023

>I know Common Voice has come and gone

From https://commonvoice.mozilla.org/en: Today's Progress- 22/2400, help us get to our goal!

Ouch. I remember contributing to it for a while, quite some time ago. Did it just not pan out to be as useful as expected?

aidenn0 · on May 5, 2023

weird; there are more hours of Catalan than any language other than English...

harperlee · on May 5, 2023

It’s due to politics, there was a campaign tied with nationalistic sentiment that leveraged soccer fans for it.

adastra22 · on May 5, 2023

Whisper is the best out there right now: https://openai.com/research/whisper

GaggiX · on May 5, 2023

The best out there open source

adastra22 · on May 5, 2023

It is better than any commercially available offering too.

GaggiX · on May 5, 2023

Not really, just look at https://sites.research.google/usm/

Or this: https://www.assemblyai.com/blog/conformer-1/

adastra22 · on May 10, 2023

Thanks!

elchief · on May 5, 2023

Despite deep learning, Google still suggested "pies park" when I asked for "pyspark".

To be fair, there is a Pies Park in Cincinnati. I don't live in Cincinnati though

GaggiX · on May 5, 2023

Without the context it would be difficult for a human too, despite our biological net.

matthewdgreen · on May 5, 2023

It would definitely be hard, but a human listener would just interrupt and say “pies park?” at which point the speaker would be obligated to clarify it (heck, even a confused facial expression might be enough.) Most dictation and speech appliances don’t have that interactivity and so instead they let small mistakes waste (relatively) large amounts of our time.

elchief · on May 5, 2023

Google has decades of context on me

GaggiX · on May 5, 2023

I doubt they have trained a model just for you ahah

regularfry · on May 5, 2023

djmips · on May 5, 2023

Yeah, I wouldn't understand what you said either!

m463 · on May 5, 2023

just like ios autocomplete is seeded with your addressbook, usually but not always to your detriment.

bdhcuidbebe · on May 5, 2023

wtf is pyspark

seanhunter · on May 5, 2023

You could of course search for yourself, but it's a python library[1] for interfacing with "Spark"[2], the Apache large scale data processing framework.

[1] https://pypi.org/project/pyspark/

[2] https://spark.apache.org/

riceart · on May 9, 2023

Something that anyone or more importantly any model with an iota of familiarity with machine learning should be familiar with.

DoItToMe81 · on May 5, 2023

I wonder what the author would think nowadays. I've used voice transcription software recently, and it's incredibly accurate. For Anglosphere accents, at least.

villgax · on May 5, 2023

Did not age well