Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rest in Peas: The Unrecognized Death of Speech Recognition (2010) (robertfortner.posthaven.com)
43 points by jawns on May 5, 2023 | hide | past | favorite | 49 comments


I enjoyed reading this article. I was doing research on speech recognition in the 2005 to 2010 period, and the article sums up the feeling. Everything was based on HMMs but there was no clear path to getting HMM performance up from where it capped out,or any idea of what models could work better. Deep learning came along and quickly supplanted all the old stuff. It seems like speech recognition has again plateaued but at a much higher accuracy, though its not inconceivable something better than transformers comes along and improves things some more. It seems like the next big leap may be combining the text understanding of GPT with waveform data.


I was at a conference yesterday where they had live transcription of the speakers onto big screens by the stage.

On this occasion it was a human stenographer in the back-end, but what was remarkable from is that if you didn't already know that, you'd guess from the error rate (not perfect, with difficulties that made sense in context) and the delay (bounced to another continent and back) that it was something like Whisper running locally. All the more so if you happened to know that one of the organisers has been running local AI meetups.

It seems to me like the only way up from here, now, is to aim for better than the average human stenographer. You'd have to have a system capable of reasoning "ah, what I heard was Y but they must have really meant X" when X involves knowledge that couldn't possibly be predicted from the context, like a foreign name coming up for the first time. I can believe that if the stenographer had more domain-specific knowledge that they wouldn't have made some of the errors they did, but we're already off into capabilities that are as good as an extremely rare type of human today.


When you say "plateaued" again, do you have example ? I follow speech recognition like an average engineer and I mostly encounter "success" stories. So I'm afraid that I see things with rosy glasses. Do you have some examples where it fails in a meaningful way (that is, it fails on apparently simple, or usual, scenarios).


Google's SR running on our xiaomi spy devices in my house consistently gets things way wrong. Pretty much any SR I have ever used on any device seems incredibly bad. Obviously these devices don't use whisper etc, but even so, these are sold to consumers and 50% of the time they don't get the prompt correctly.

We have "smart bulbs" that are connected to this spy network, which means that if I want to turn off a light, if I don't want to be shat on my by my partner I can't just turn off the switch, but rather stand pointlessly trying to turn off/on the lights by yelling at a fucking box in my lounge. I hate this so much more than you could ever begin to imagine. And somehow, all these "success" stories has not yet been enough for him to open his eyes. People will buy any shit and then convince themselves that they have "success" stories.

Lets not even breach the topic of the fact that I have been using computer vision for years to do cool shit, but no, lets buy shit xiaomi cameras that have no API to interface with, so the absolute best case scenario for me knowing what's on the camera, is opening their piece of shit app and waiting 10-30 seconds for them to figure out how to connect over our network using their own fucking API.

Since I'm on this tangent now, I would love to pay special mention to their PIECE OF SHIT KETTLE Which is supposed to be a smart kettle, but the one fucking thing you would want to be able to do remotely is not supported. Oh, you wanted to boil your kettle from your desk? Well Fuck you!


I was going to go on a rant about how smart isn't useful unless it's easier than not-smart - my lights are either on a motion sensor, timer or based on device usage so we never actually need to do anything with them. The bedroom, bathroom and kitchen are on manual switches.

Then I looked at the kettle... I have a temperature controlled kettle, it has three buttons and does everything theirs does with an app. Plus it's 2.4kW and holds 1.7L so is better than theirs all around!


> Oh, you wanted to boil your kettle from your desk?

Besides a right to repair, we should also have a right to API.


Maybe it could be enforced via stricter recycling and warranty laws for devices that don't allow you to control them / update them / root them.


I really think we should have a consumer organization that grants rights to companies to mass produce items (so it would hit only large companies). That way we get a kind of democratic way to say "No" to products that work against the consumer (goodbye SmartTV) and end in a landfill within a few years.

I mean, I need a permit to add a window to my home. Can we please require permits for mass-dumping semi-useless products onto the market?


What would be interesting is if Google took the chat GPT approach to better train itself.

For instance, every time you speak to Google, it doesn't just 'guess' what you're saying, rather, it will present you with 3 options of the most closely sounding things for you to pick from. And by picking, you train the model.


Reinforcement learning from human feedback (RLHF) as you have described here is pretty common in voice recognition.[1] I don't have any inside scoop but I'd be very surprised if there isn't any RLHF going on in these applications.

edit to add: The whisper paper mentions using RLHF as a possible path to improve accuracy[2]

[1] https://link.springer.com/article/10.1007/s10462-022-10224-2 gives many examples

[2] https://cdn.openai.com/papers/whisper.pdf


Whisper is pretty good honestly. It gets my french accent in english and french slang, without having to tell it what language I'm going to speak first. That's impressive.


Exactly, the accent bump is behind us. I get good transcripts of my meetings even with my tick accent


Speech recognition won't do much for my language.. with no official spoken language (and that's not wanted either), but hundreds of very different dialects. Anything existing right now is pretty much useless, but on the other hand I've never really found speech to be a particularly useful input method.


Speech isn't a particularly useful input method ... until you lose the use of your hands. Then it suddenly is a lot more attractive.


Even with my hands, voice-search on Apple TV, including the YouTube app, is very capable and accurate.


Only if you happen to only search for English titles. Try to search for a foreign song in English (say, La Marseillaise), and the results quickly become next to useless.


I agree that speech an input method is overrated, but automatic caption is a really nice accessibility feature (not only for the deaf, but also for non-native speakers, or when it's not convenient to have sound, ...)


oh that's pretty interesting, would you like to share more on the specifics of that?


I think OP is referring to Swiss German --- dialects are very local (I heard it even differs from valley to valley in the Alps) and often difficult to understand (Wallisertiitsch -- one of the mountain dialect -- is often brought as an example of an unintelligible favor of German), but official and government matters are conducted in standard German (with only few differences with the one spoken in Germany).


What language is that?


Somewhere I have an old T-shirt from when I was consulting for Apple.

It had a sketch of a trashed-out beach with all kinds of garbage and debris.

The caption was:

I helped Apple wreck a nice beach


Hmmm... Maybe because I'm not a native speaker, but, well, I don't get it :-) Could you explain ?


“Recognise speech” / “wreck a nice beach” can sound similar (if you allow for some imprecisions, slurring, accents, etc, like noisy real-world data has).

The relevant term here is _phonemes_, the individual chunks of sounds that make the phrases up; these two phrases have shared or similar phonemes in typical English diction. e.g the first two sounds are both “reh”, “cuh”


In english its a near homophone (sounds the same as) “I helped apple recognize speech.” Its a joke about working on language recognition and still getting it wrong.


Try saying it out loud several times without thinking about the words at all, but only the sounds your voice is making as you say them:

I helped Apple wreck a nice beach

I helped Apple reck a nize beach

I helped Apple reck ug nize beach

I helped Apple reckug nize peach

I helped Apple recognize speech!


It is mentioned in the top article:

Saying “recognize speech” makes a sound that can be indistinguishable from “wreck a nice beach.”


I have (had?) that one somewhere!


What has happened since then? I know Common Voice has come and gone https://en.wikipedia.org/wiki/Common_Voice https://github.com/coqui-ai/STT

And I've seen some neural approaches too

No idea where to look for comparisons though.


>I know Common Voice has come and gone

From https://commonvoice.mozilla.org/en: Today's Progress- 22/2400, help us get to our goal!

Ouch. I remember contributing to it for a while, quite some time ago. Did it just not pan out to be as useful as expected?


weird; there are more hours of Catalan than any language other than English...


It’s due to politics, there was a campaign tied with nationalistic sentiment that leveraged soccer fans for it.


Whisper is the best out there right now: https://openai.com/research/whisper


The best out there open source


It is better than any commercially available offering too.



Thanks!


Despite deep learning, Google still suggested "pies park" when I asked for "pyspark".

To be fair, there is a Pies Park in Cincinnati. I don't live in Cincinnati though


Without the context it would be difficult for a human too, despite our biological net.


It would definitely be hard, but a human listener would just interrupt and say “pies park?” at which point the speaker would be obligated to clarify it (heck, even a confused facial expression might be enough.) Most dictation and speech appliances don’t have that interactivity and so instead they let small mistakes waste (relatively) large amounts of our time.


Google has decades of context on me


I doubt they have trained a model just for you ahah


yet


Yeah, I wouldn't understand what you said either!


just like ios autocomplete is seeded with your addressbook, usually but not always to your detriment.


wtf is pyspark


You could of course search for yourself, but it's a python library[1] for interfacing with "Spark"[2], the Apache large scale data processing framework.

[1] https://pypi.org/project/pyspark/

[2] https://spark.apache.org/


Something that anyone or more importantly any model with an iota of familiarity with machine learning should be familiar with.


I wonder what the author would think nowadays. I've used voice transcription software recently, and it's incredibly accurate. For Anglosphere accents, at least.


Did not age well




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: