That's not what I meant. "Thinking internally" referred to the user experience o...

Sharlin · 2025-10-06T12:05:01 1759752301

I’m not sure what you meant then.

There’s no waiting for reply, there’s only the wait between tokens output, which is fixed and mostly depends on hardware and model size. Inference is slower on larger models, but so is training, which is more of a bottleneck than user experience.

The model cannot think before it starts emitting tokens, the only way for it to "think" privately is by the interface hiding some of its output from the user, which is what happens in "think longer" and "search the web" modes.

If a online LLM doesn’t begin emitting a reply immediately, more likely the service is waiting for available GPU time or something like that, and/or prioritizing paying customers. Lag between tokens is also likely caused by large demand or throttling.

Of course there are many ways to optimize model speed that also make it less smart, and maybe even SOTA models have such optimizations these days. Difficult to know because they’re black boxes.