Alice offers a few different STT options. While you can stay completely offline,...

moffkalast · on May 1, 2022

Does it have an option to cache the whole "Hey Alice, what's gonna be the weather at 11 tomorrow?" utterance then if the wake word is detected send what's been cached after it?

That's the main gripe I had with Mycroft, as there was no going around this:

"Hey Mycroft."

long pause

"What's gonna be the weather at 11 tomorrow?"

even longer pause

Which is frankly so unnatural and just too annoying to be practical. And it feels like something that should be very straightforward to implement in terms of logic.

philipp2310 · on May 1, 2022

For the moment, no, nothing is cached until the hotword is recognized. We thought about it though, but it would mean we have to store passed sound input for a few seconds. While this won't be a problem for the main device, satellites aren't power full enough to run ASR them selfes (Raspi Zero), so the sound is streamed to the main device after the hotword detection. This process wouldn't match perfectly with the storing of the data.

Another thing to keep in mind is, we use intermediate results for the ASR. Means already while you are speaking, the input is parsed. Only a few ms after you go silent, the parsing is finalized and NLU/TTS will start right away.

Of course with a bit bias, I'd say it is more like: "Hey Alice" "Yes?.." "What's gonna be the weather at 11 tomorrow?" short pause (.2 seconds?) <answer>

Psychokiller · on May 2, 2022

This is entirely true and I have a few solution I could deploy with some work, the problem is a caching like this consumes power, as you literally listen all the time, as for a wakeword, to cache the audio data in memory and use it ONLY if a wakeword is detected. Now big companies do it in the cloud, we could do it locally, as an option. The path I chose to mitigate that unatural feeling is to use a human answer, bit like at home, you in the kitchen, wife or kids further away, not communicating. At some point you'd call your wife "Alice?" and you'd wait for her to reply for a "yes?" before talking as you are unaware if she's focused on you at the moment or playing with the kids whatever

magicalhippo · on May 2, 2022

I haven't looked into the details, but when listening to a wakeword, surely it has to literally listen all the time anyway?

I mean, would it really consume that much extra power to just have a second sink that's just a N-second circular buffer, so you got the samples after the wakeword ready for speech recognition when the wakeword is detected?

Psychokiller · on May 2, 2022

Yeah, that's what I said, "as for the wakewords" we listen all the time, looking for a specific wave pattern in the audio and not for words. But the audio is literally always flowing in, on all your satellites and the main unit. The problem with prewarming is that more than analysing a wave pattern in the audio stream, we need to keep a much longer audio data dump in memory in some kind of a FIFO pool. don't take me wrong, it's easily doable, just haven't taken the time to do something polished and not overconsuming for the device running it. Technically, we just need to pool the audio data, say 3-5 seconds depending on hardware used (pi 3 is slow), trim the begining of the length of the wakeword detected length and append the rest of the incoming data while already streaming to ASR, be it local oor cloud based

moffkalast · on May 2, 2022

My guess would be no. After all if you're going through the trouble of setting up a home assistant it'll be mains powered anyway, and the Pis don't actually use that much more power when at max load than when turned off.

I think the ballpark figures for the Pi 4 are 0.5 A when doing nothing, 1A when doing something intensive a single core and 1.2A when at full multicore load.