This idea comes around every few months, but nobody can document it in tests (apart from the actually broken services that get fixed in a couple of days). Have you got repeatable cases where it can be shown?
I've had a 100% benchable case in the past, though it wasn't really a degradation in terms of output quality per se, it was (an undocumented and unacknowledged) permanent degradation in maximum output length. From one day to the next, 100% of prompts where we'd ask for e.g. 15 sections, would only do 10 sections and then ask "Would you like me to continue?". Which is in a way quality, but generally not something that shows up on coding benches and the likes. This was Anthropic.
Also seen a 2-week latency 10+ times increase of Gemini 2.5 Flash finetuned (= enterprise) model endpoints, again undocumented and unacknowledged, because they shifted all of their GPU capacity towards going "viral" on people generating slop artwork around Nano Banana Pro release.
So plenty of silent shenanigans do happen, including by the Big 3 on API endpoints. At the same time I agree with you that all those rumors of "degradation in code quality" are very much unproven.
> prompts where we'd ask for e.g. 15 sections, would only do 10 sections and then ask "Would you like me to continue?".
I can’t speak to any kind of identifiable pattern but man that behavior drives me up the wall when it happens.
When I run into a specific task that starts to trigger that behavior, even a clean session with explicit instructions directing it to complete ALL sub steps isn’t enough to push it to finish the entire request to the end.