A bug has recently appeared in OpenAI's API, which is breaking customer API integrations and workflows.
When generating long outputs with slow models, OpenAI servers have started silently terminate the connection after exactly 300 seconds.
This affects both standard and streamed responses. API customers are still being charged for the full output, despite receiving truncated responses.
When streaming, the final chunk indicates no finish_reason, implying the model is still running.
I first noticed this last week, but others reported it before then too. It's seemingly gone unnoticed by OpenAI so far (or at least, remains unacknowledged and undocumented).
This renders GPT-4's 8K/32K context window effectively useless in many cases. 300 seconds is only enough time to generate approx. 1.5K tokens.
I'm guessing this is some kind of server misconfiguration. I appreciate scaling this must be very challenging, but support tickets have been ignored, and this is a breaking change.
Hopefully someone from OpenAI sees this and can pass it on, or may know something I don't. Thanks!