More

cootsnuck · 2026-04-06T23:50:51 1775519451

Handy has Windows support. https://handy.computer/

cootsnuck · 2026-04-06T23:48:44 1775519324

Yup, Handy is the one that made me stop looking for local open source alternatives to Wispr Flow.

I'll give a shoutout as well to Glimpse: https://github.com/LegendarySpy/Glimpse

cootsnuck · 2026-04-06T23:45:37 1775519137

Interesting. My Pixel 7 transcription is barely usable for me. Makes way too many mistakes and defeats the purpose of me not having to type, but maybe that's just my experience.

The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).

I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.

taffydavid · 2026-04-07T08:04:28 1775549068

I've built my own tts apps testing whisper and while it's good it does hallucinate quite a bit if there's noise, or just sometimes when the audio is perfectly clear.

It often gives the illusion of being very good but I could record a half hour of me speaking and discover some very random stuff in the middle that I did not say

cootsnuck · 2026-04-07T16:29:41 1775579381

Yup, you're absolutely right. The open source models do have their rough edges. I use NVIDIA's Parakeet v3 model a lot locally, and it will occasionally do this thing where it just repeats a word like a dozen times.

cootsnuck · 2026-03-22T01:35:00 1774143300

How many businesses have the capabilities and expertise to train their own models?

timschmidt · 2026-03-22T01:42:05 1774143725

No idea. Probably more every day.

cootsnuck · 2026-03-10T15:50:07 1773157807

Super cool. Love seeing these writeups of hobbyists getting their hands dirty, breaking things, and then coming out on the other side of it with something interesting.

cootsnuck · 2026-03-09T19:58:11 1773086291

Yea, it has been a little shocking to me that the rising narratives around "AI agents everywhere" and "enable the web for AI agents" requires what we've all been wanting for awhile on the web (openness and interoperability) but that the same big players in tech have been clearly against for a long time. Like the fact that Google recently released that Google Workspace CLI (https://github.com/googleworkspace/cli) is a perfect example.

They could've released something like that years ago (the discovery service it's built on has existed for over a decade) but creating a simple, accessible, unified CLI for general integration apparently wasn't worth it until agents became the hot thing.

I wonder when / if there will be a rug pull on all of this. Because I really don't see what the long-term incentives are for incumbent tech platforms to make it easy for automated systems to essentially pull users away from the actual platform. I guess they're focused on the short term incentives. And once they decide the party's over, promising upstarts and competition can get absorbed and it'll be business as usual. Idk, we'll see.

DANmode · 2026-03-10T17:47:39 1773164859

FYI from your project link:

Note

This is not an officially supported Google product.

cootsnuck · 2026-03-03T06:15:17 1772518517

I've been working solely on voice agents for the past couple years (and have worked at one of the frontier voice AI companies).

The cascading model (STT -> LLM -> TTS), is unlikely to go away anytime soon for a whole lot of reasons. A big one is observability. The people paying for voice agents are enterprises. Enterprises care about reliability and liability. The cascading model approach is much more amenable to specialization (rather than raw flexibility / generality) and auditability.

Organizations in regulated industries (e.g. healthcare, finance, education) need to be able to see what a voice agent "heard" before it tries to "act" on transcribed text, and same goes for seeing what LLM output text is going to be "said" before it's actually synthesized and played back.

Speech-to-Speech (end-to-end) models definitely have a place for more "narrative" use cases (think interviewing, conducting surveys / polls, etc.).

But from my experience from working with clients, they are clamoring for systems and orchestration that actually use some good ol' fashioned engineering and that don't solely rely on the latest-and-greatest SoTA ML models.

cootsnuck · 2026-03-03T05:55:28 1772517328

Yea, Deepgram Flux is the secret sauce. Doesn't get talked about much.

For anyone curious: https://flux.deepgram.com/

totetsu · 2026-03-03T06:08:24 1772518104

What is the difference between Flux’s end-of-turn detection and Openai's Automatic turn detection Semantic mode?

cootsnuck · 2026-03-03T06:34:12 1772519652

In OpenAI's own words about semantic_vad:

> Chunks the audio when the model believes based on the words said by the user that they have completed their utterance.

Source: https://developers.openai.com/api/docs/guides/realtime-vad

OpenAI's Semantic mode is looking at the semantic meaning of the transcribed text to make an educated guess about where the user's end of utterance is.

According to Deepgram, Flux's end-of-turn detection is not just a semantic VAD (which inherently is a separate model from the STT model that's doing the transcribing). Deepgram describes Flux as:

> the same model that produces transcripts is also responsible for modeling conversational flow and turn detection.

[...]

> With complete semantic, acoustic, and full-turn context in a fused model, Flux is able to very accurately detect turn ends and avoid the premature interruptions common with traditional approaches.

Source: https://deepgram.com/learn/introducing-flux-conversational-s...

So according to them, end-of-turn detection isn't just based on semantic content of the transcript (which makes sense given the latency), but rather the the characteristics of the actual audio waveform itself as well.

Which Pipecat (open source voice AI orchestration platform) actually does as well seemingly with their smart-turn native turn detection model as well: https://github.com/pipecat-ai/smart-turn (minus the built-in transcription)

totetsu · 2026-03-03T13:59:28 1772546368

Thanks. Then maybe it’s similar to Moshi https://github.com/kyutai-labs/moshi?tab=readme-ov-file

cootsnuck · 2026-03-01T17:06:10 1772384770

Except an LLM actually is a piece of software. And the brain is not what you said.

philipswood · 2026-03-01T18:43:26 1772390606

Which part of what he said is wrong?

> A brain is a collection of cells that transmit electrical signals and sodium. ...

That it is a collection of cells? Or that they transmit electrical signals and sodium?

Or do you feel that he's leaving out something important about how it works (like generated electrical fields or neural quantum effects)?

cootsnuck · 2026-03-01T07:33:39 1772350419

It was: https://news.ycombinator.com/item?id=47000263