LLM streaming over Server-Sent Events is a surprisingly load-bearing surface: coding CLIs parse tool-call deltas incrementally, UX layers render token-by-token, and any gateway that reformats the stream breaks something. The LangWatch AI Gateway is explicit about what it may and may not mutate.Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
The streaming contract
Before the first chunk — mutations allowed
- Gateway may inject response headers (
X-LangWatch-Request-Id,X-LangWatch-Provider,X-LangWatch-Model). - Pre-call guardrails may rewrite the request payload before dispatch.
- Fallback chain may transparently switch providers if the primary fails before any chunk arrives.
After the first chunk — strict passthrough
- SSE events are forwarded to the client byte-for-byte, in order, without merging, splitting, or reordering.
- No header injection (except the ones already emitted on the first frame).
- No mid-stream provider switch — even if the primary dies after chunk N, the gateway will NOT fall back to a secondary provider silently. Instead it terminates the stream.
content_block_delta / input_json_delta ordering; any reformat breaks their tool-call parser.
Terminal error shapes
Once the first chunk has been sent to the client, the gateway never silent-switches providers. If something goes wrong mid-stream, the stream ends with a terminalevent: error frame. Two shapes exist in v1, distinguishable by error.code:
Upstream connection failure:
error frames should key off error.code to distinguish provider failures (retryable after backoff) from policy blocks (retry won’t help unless the input changes).
Post-emission sequence for both shapes:
- Gateway emits the terminal
event: errorframe (shown above). - Gateway closes the client connection.
- For
provider_error: the failed provider’s circuit-breaker counter increments. Forguardrail_blocked: a verdict metric (gateway_guardrail_verdicts_total{verdict=block}) increments instead — the provider is healthy, policy fired.
Pre-connection fallback (transparent)
When the primary returns a retryable error before any chunk has been sent to the client (connection failures, pre-first-chunk 5xx / 429 / timeout / network), the gateway walks the VK’sfallback.chain the same way it does for non-streaming calls. The caller sees exactly one stream, from whichever credential accepted the request.
X-LangWatch-Fallback-Count reports how many chain slots were skipped before the successful attempt. 0 means the primary worked; N means N providers were tried.
stream_chunk guardrails
Guardrails with direction: stream_chunk run on chunks with visible delta text before emission. Frames without visible text — role-only ({"delta":{"role":"assistant"}}), tool-call deltas, terminal usage — skip the guardrail call entirely, so ~95% of stream frames are near-zero overhead. See Guardrails → stream_chunk.
- Budget ≤50 ms per chunk. Slow or errored guardrails fail open — the chunk passes through,
gateway_guardrail_verdicts_total{direction=stream_chunk,verdict=fail_open}increments, andlangwatch.guardrail.stream_chunk_fail_open=<reason>is set on the trace. Blocking the user’s stream on a slow policy service is worse than occasional pass-through — but the metric surfaces degraded services. - Decision
block→ terminalevent: errorframe withtype=guardrail_blocked,code=stream_chunk_blocked(see Terminal error shapes). Channel closed immediately. - Decision
modifyis not implemented in v1 — chunk-level rewriting is provider-shape-specific (OpenAI delta JSON vs Anthropic SSE events differ). For v1, “redact on stream” = block + client-retry. A future iter can add provider-aware chunk rewriting.
Starting a streaming request
OpenAI-compatible
Anthropic-compatible
Tool-call deltas — what to look for
Anthropic emits this sequence for a tool call:partial_json chunks into a full JSON object. Any reordering or re-splitting breaks this. The gateway preserves the exact chunk boundaries.
OpenAI emits tool-call deltas as:
Observability
X-LangWatch-Request-Idon the first frame.- Trace span per chunk would be too expensive — the gateway records a single span per streaming request with
langwatch.streaming=trueand the normal per-request attrs (langwatch.model,langwatch.usage.*,langwatch.duration_ms,langwatch.cost_usd). - Mid-stream failures are visible via
gateway_http_requests_total{status="500"}+ the single span’s error state + request-id correlation to the gateway log line.
Per-stream telemetry attrs (
langwatch.stream.chunk_count / .first_chunk_ms / .total_ms / .terminated_at_chunk) are a v1.1 observability follow-up. In v1, the single-span-per-request model + log-line correlation is the operator signal.Usage extraction (critical for streaming budgets)
Streaming responses are just as expensive as non-streaming, but the usage block (token counts, cost inputs) arrives asynchronously — typically in the terminal chunk, but providers differ:| Provider | How usage is reported on a stream |
|---|---|
| Anthropic | message_delta event with final usage.output_tokens (+ usage.cache_read_input_tokens when applicable) |
| OpenAI / Azure OpenAI | The final data-only chunk carries usage when the request body has stream_options: {include_usage: true}. The gateway auto-injects this flag on OpenAI/Azure streaming requests when the caller hasn’t set it — see below |
| Bedrock | Usage on the final message_delta event, same as Anthropic |
| Vertex / Gemini | Usage on the terminal candidates[].finishReason event |
- Empty deltas don’t zero-stomp real numbers. OpenAI’s pre-terminal “empty usage” deltas are ignored; the real numbers arrive on the last chunk.
- OpenAI/Azure auto-injects
stream_options: {include_usage: true}. Most OpenAI SDKs don’t set this flag by default — without it, OpenAI never emits usage on the SSE stream, and the gateway would have to debit$0. The gateway’sDispatchStreampath detects OpenAI/Azure +stream:true+ missing flag and injects the option viasjson.SetBytesbefore the body reaches Bifrost. Other body fields round-trip byte-for-byte (so OpenAI’s prompt-prefix auto-cache still hits on identical prefixes). Callers who explicitly setstream_options: {include_usage: false}are honoured (never overwritten). success_no_usagewarning is now rare. With the auto-injection above, OpenAI streams should always report usage. If you still seespan.status = success_no_usageor theX-LangWatch-Usage-Warning: provider_did_not_report_usage_on_streamheader, it means either (a) the caller opted out of usage explicitly viainclude_usage: false, or (b) the provider returned a stream without a final usage frame (very rare — file a bug with the trace id).
Setting stream_options.include_usage in your client
Auto-injection means you don’t have to. But for clarity and robustness when you change gateways, the explicit form is identical to direct-OpenAI usage:
Alerting on missing usage (residual cases)
In Prometheus / Grafana:LW_LOG_MESSAGE_BODY=1 (see Self-hosting → Config → Debug knobs) and check whether the caller is opting out of usage explicitly.
Non-streaming clients
Requests withoutstream: true are dispatched synchronously, accumulated upstream, and returned as a single JSON response. Pre-call guardrails run on the request; post-call guardrails run on the full response and can block/modify the final body.
The “no mid-stream switch” rule doesn’t apply — non-streaming calls can fall back freely because the client only ever sees the final successful response.