Streaming · Token Factory Docs

Streaming returns the response token-by-token as Server-Sent Events (SSE) — a long-lived HTTP response where the server pushes a sequence of data: events as the model generates them. For human-facing UIs, it dramatically reduces perceived latency: users see the first words within a few hundred milliseconds instead of waiting for the full response.

#Why stream

Chat interfaces — show tokens as they arrive so the user starts reading immediately.
Long-form generation (summaries, articles, drafts) — partial output is useful even before the model is done.
Cost-sensitive ops — you can inspect the early tokens and stop mid-response if the output is heading the wrong way.

When not to stream: batch processing, structured-output extraction where you need the full JSON before parsing, and any machine-to-machine call where perceived latency doesn't apply. Non-streaming responses are simpler to handle — use them by default unless a human is watching the output appear.

#SSE format

Each event is data: followed by a JSON object identical in shape to a non-streaming response, except delta replaces message and contains only the new content for that chunk. The stream ends with the literal data: [DONE].

The first chunk usually carries the role; subsequent chunks carry content fragments. Concatenate the delta.content values in order to reconstruct the full message.

Side-by-side, the differences from a non-streaming response:

Field	Non-streaming	Streaming
`choices[N].message`	`{ role, content }`	(absent)
`choices[N].delta`	(absent)	`{ role?, content? }` (partial)
`choices[N].finish_reason`	populated on the response	populated on the last delta
Top-level wire format	single JSON object	`data: <json>\n\n` ... `data: [DONE]\n\n`

#Python

The openai SDK exposes streaming as an iterator. Set stream=True on the create call and loop with for chunk in stream.

end="" keeps tokens flowing on a single line; flush=True forces the terminal to render each chunk as it arrives instead of buffering.

#TypeScript

The Node SDK exposes streaming as an async iterable. Use for await and write each fragment to process.stdout.

In Node, process.stdout.write() doesn't buffer like print — each call goes straight to the terminal. In browsers, you'll typically write into a DOM node directly (innerHTML / textContent / a streaming React component) — no flush semantics, but be aware of re-render costs if you're appending to a long string on each chunk.

#cURL

Pass "stream": true in the JSON body and -N (alias for --no-buffer) to curl so it doesn't hold output back. Raw SSE prints to the terminal; pipe it into whatever your shell can do.

#Error handling mid-stream

If a chunk arrives with an error field set, the stream is terminating with an error — the model did not finish. Stop accumulating output, surface the error to the caller, and do not treat the partial text as a valid response.

Common mid-stream errors include upstream capacity loss, content-filter trips after partial output, and timeout on the model side. Treat partial output as diagnostic, never as the answer. See the error reference for status-code semantics — in particular, the 503 fallback pattern for upstream-loss recovery.

#Backpressure

If you break out of the loop or close the stream client-side, the server-side request continues to run for a short while before being interrupted. Don't rely on client disconnect as a cost-control mechanism — by the time the server notices and aborts, you may have already paid for most of the tokens. Set max_tokens to bound generation up front instead.

Reverse proxies may buffer SSE

Some corporate proxies and load balancers buffer SSE responses, collapsing the token-by-token experience into a single large chunk at the end. If you see the response arrive all at once instead of streaming, the proxy is the culprit. Workarounds: route the request directly (bypass the proxy), configure the proxy to flush SSE (proxy_buffering off in nginx, equivalent in your gateway), or reproduce outside the proxy to confirm the cause before debugging the client. Buffered streams that never see data: [DONE] show up as Missing completion on the Observability dashboard — useful signal when debugging proxy issues at scale.

#What next

Chat completions

The base request shape, parameters, and stop reasons.

API reference

Full parameter and response schemas for /v1/chat/completions.

Errors & status codes

Troubleshoot 401, 403, 429, and 503 — including proxy-buffered SSE.