Streaming returns the response token-by-token as Server-Sent Events (SSE) — a long-lived HTTP response where the server pushes a sequence of data: events as the model generates them. For human-facing UIs, it dramatically reduces perceived latency: users see the first words within a few hundred milliseconds instead of waiting for the full response.
#Why stream
- Chat interfaces — show tokens as they arrive so the user starts reading immediately.
- Long-form generation (summaries, articles, drafts) — partial output is useful even before the model is done.
- Cost-sensitive ops — you can inspect the early tokens and stop mid-response if the output is heading the wrong way.
When not to stream: batch processing, structured-output extraction where you need the full JSON before parsing, and any machine-to-machine call where perceived latency doesn't apply. Non-streaming responses are simpler to handle — use them by default unless a human is watching the output appear.
#SSE format
Each event is data: followed by a JSON object identical in shape to a non-streaming response, except delta replaces message and contains only the new content for that chunk. The stream ends with the literal data: [DONE].
The first chunk usually carries the role; subsequent chunks carry content fragments. Concatenate the delta.content values in order to reconstruct the full message.
Side-by-side, the differences from a non-streaming response:
| Field | Non-streaming | Streaming |
|---|---|---|
choices[N].message | { role, content } | (absent) |
choices[N].delta | (absent) | { role?, content? } (partial) |
choices[N].finish_reason | populated on the response | populated on the last delta |
| Top-level wire format | single JSON object | data: <json>\n\n ... data: [DONE]\n\n |
#Python
The openai SDK exposes streaming as an iterator. Set stream=True on the create call and loop with for chunk in stream.
end="" keeps tokens flowing on a single line; flush=True forces the terminal to render each chunk as it arrives instead of buffering.
#TypeScript
The Node SDK exposes streaming as an async iterable. Use for await and write each fragment to process.stdout.
In Node, process.stdout.write() doesn't buffer like print — each call goes straight to the terminal. In browsers, you'll typically write into a DOM node directly (innerHTML / textContent / a streaming React component) — no flush semantics, but be aware of re-render costs if you're appending to a long string on each chunk.
#cURL
Pass "stream": true in the JSON body and -N (alias for --no-buffer) to curl so it doesn't hold output back. Raw SSE prints to the terminal; pipe it into whatever your shell can do.
#Error handling mid-stream
If a chunk arrives with an error field set, the stream is terminating with an error — the model did not finish. Stop accumulating output, surface the error to the caller, and do not treat the partial text as a valid response.
Common mid-stream errors include upstream capacity loss, content-filter trips after partial output, and timeout on the model side. Treat partial output as diagnostic, never as the answer. See the error reference for status-code semantics — in particular, the 503 fallback pattern for upstream-loss recovery.
#Backpressure
If you break out of the loop or close the stream client-side, the server-side request continues to run for a short while before being interrupted. Don't rely on client disconnect as a cost-control mechanism — by the time the server notices and aborts, you may have already paid for most of the tokens. Set max_tokens to bound generation up front instead.
Some corporate proxies and load balancers buffer SSE responses, collapsing the token-by-token experience into a single large chunk at the end. If you see the response arrive all at once instead of streaming, the proxy is the culprit. Workarounds: route the request directly (bypass the proxy), configure the proxy to flush SSE (proxy_buffering off in nginx, equivalent in your gateway), or reproduce outside the proxy to confirm the cause before debugging the client. Buffered streams that never see data: [DONE] show up as Missing completion on the Observability dashboard — useful signal when debugging proxy issues at scale.