Alice offers a few different STT options.
While you can stay completely offline, depending on the language and your background noise, indeed the results are not as perfect as the cloud ones. But especially coqui STT (former deepspeech) does a really good job!
Another option, if you are a bit more open with "sharing" data and you don't want to miss out the best TTS, is enabling google or azure cloud services - with the big difference, Alice will only send the sound right after detecting your hotword (only while flashing her LED and asking you "yes?" before the recording starts). Nothing else would be shared.
Does it have an option to cache the whole "Hey Alice, what's gonna be the weather at 11 tomorrow?" utterance then if the wake word is detected send what's been cached after it?
That's the main gripe I had with Mycroft, as there was no going around this:
"Hey Mycroft."
long pause
"What's gonna be the weather at 11 tomorrow?"
even longer pause
<answer>
Which is frankly so unnatural and just too annoying to be practical. And it feels like something that should be very straightforward to implement in terms of logic.
For the moment, no, nothing is cached until the hotword is recognized.
We thought about it though, but it would mean we have to store passed sound input for a few seconds. While this won't be a problem for the main device, satellites aren't power full enough to run ASR them selfes (Raspi Zero), so the sound is streamed to the main device after the hotword detection. This process wouldn't match perfectly with the storing of the data.
Another thing to keep in mind is, we use intermediate results for the ASR. Means already while you are speaking, the input is parsed. Only a few ms after you go silent, the parsing is finalized and NLU/TTS will start right away.
Of course with a bit bias, I'd say it is more like:
"Hey Alice"
"Yes?.."
"What's gonna be the weather at 11 tomorrow?"
short pause (.2 seconds?)
<answer>
This is entirely true and I have a few solution I could deploy with some work, the problem is a caching like this consumes power, as you literally listen all the time, as for a wakeword, to cache the audio data in memory and use it ONLY if a wakeword is detected. Now big companies do it in the cloud, we could do it locally, as an option. The path I chose to mitigate that unatural feeling is to use a human answer, bit like at home, you in the kitchen, wife or kids further away, not communicating. At some point you'd call your wife "Alice?" and you'd wait for her to reply for a "yes?" before talking as you are unaware if she's focused on you at the moment or playing with the kids whatever
I haven't looked into the details, but when listening to a wakeword, surely it has to literally listen all the time anyway?
I mean, would it really consume that much extra power to just have a second sink that's just a N-second circular buffer, so you got the samples after the wakeword ready for speech recognition when the wakeword is detected?
Yeah, that's what I said, "as for the wakewords" we listen all the time, looking for a specific wave pattern in the audio and not for words. But the audio is literally always flowing in, on all your satellites and the main unit. The problem with prewarming is that more than analysing a wave pattern in the audio stream, we need to keep a much longer audio data dump in memory in some kind of a FIFO pool. don't take me wrong, it's easily doable, just haven't taken the time to do something polished and not overconsuming for the device running it. Technically, we just need to pool the audio data, say 3-5 seconds depending on hardware used (pi 3 is slow), trim the begining of the length of the wakeword detected length and append the rest of the incoming data while already streaming to ASR, be it local oor cloud based
My guess would be no. After all if you're going through the trouble of setting up a home assistant it'll be mains powered anyway, and the Pis don't actually use that much more power when at max load than when turned off.
I think the ballpark figures for the Pi 4 are 0.5 A when doing nothing, 1A when doing something intensive a single core and 1.2A when at full multicore load.
Another option, if you are a bit more open with "sharing" data and you don't want to miss out the best TTS, is enabling google or azure cloud services - with the big difference, Alice will only send the sound right after detecting your hotword (only while flashing her LED and asking you "yes?" before the recording starts). Nothing else would be shared.