Skip to main content
A fallback chain is an ordered list of provider credentials the gateway tries in sequence when the primary fails. It’s per-virtual-key configuration; each VK owns its own chain.

Configuring a chain

On the VK edit screen:
  1. Select a primary provider credential.
  2. Add one or more fallback credentials in the order they should be tried.
  3. Pick which conditions trigger fallback (by default: 5xx, timeout, rate_limit).
  4. Set timeout_ms (default 30000) and max_attempts (default 3).
{
  "fallback": {
    "chain":        ["pc_primary_openai", "pc_anthropic", "pc_bedrock_us_east"],
    "on":           ["5xx", "timeout", "rate_limit", "network_error"],
    "timeout_ms":   30000,
    "max_attempts": 3
  }
}

When fallback triggers

ConditionFires?Why
Upstream 5xxProvider’s fault; next provider may work.
timeout (> timeout_ms)Provider degraded.
429 rate_limit_exceededPrimary is throttled; secondary may have headroom.
network_error (DNS/TCP/TLS)Connectivity issue to primary.
circuit_breaker openPreemptive, gateway knows primary has been failing recently.
Upstream 400 Bad RequestClient-fault. Surfacing the error is correct.
Upstream 401 UnauthorizedProvider credential bad. Needs human fix, not masking.
Upstream 403 ForbiddenAuthorization issue. Silent switch would hide a real problem.
Upstream 404 Not FoundRequested model doesn’t exist.
LangWatch-internal invalid_api_key etcNever reaches the fallback layer.

Model translation across providers

A single VK may have mixed providers (OpenAI + Anthropic + Bedrock). The gateway uses Bifrost’s provider-dispatch library to translate payloads: same messages schema, different wire formats. If the client requests gpt-5-mini and fails over to Anthropic, the gateway applies the VK’s model_aliases to pick the Anthropic equivalent (e.g. claude-haiku-4-5-20251001). Configure this per VK:
{
  "model_aliases": {
    "gpt-5-mini":              "openai/gpt-5-mini",
    "gpt-5-mini:fallback":     "anthropic/claude-haiku-4-5-20251001",
    "claude-haiku-4-5-20251001":        "anthropic/claude-haiku-4-5-20251001",
    "claude-haiku-4-5-20251001:fallback": "bedrock/anthropic.claude-haiku-4-5-20251001"
  }
}
The :fallback suffix is optional, if absent the gateway uses the same model name against the next provider and expects it to exist.

Streaming

Fallback behaviour differs based on when failure occurs:
  • Before first chunk emits → transparent fallback. The stream-setup call (bifrost.ChatCompletionStreamRequest) walks the chain the same way non-streaming dispatch does; the client sees a single stream from whichever slot accepted the request. X-LangWatch-Fallback-Count reports the skipped slot count.
  • After first chunk has streamedno mid-stream fallback. The gateway emits a terminal event: error frame (with code: upstream_mid_stream_failure) and closes the connection. The client may retry; a fresh request would then re-walk the chain.
This split is deliberate, splicing chunks from two providers would produce an inconsistent response with mismatched tool-call ids and accumulated-content replay. It’s a hard “no” per contract and enforced by a byte-exact assertion on the SSE error frame shape so future refactors can’t accidentally break it. See Streaming → pre-connection fallback, mid-stream failure for the exact frame bytes and a worked example.

Observing the chain in traces

Fallback attribution today:
  • Prometheus counter gateway_provider_attempts_total{outcome} increments once per attempt, with outcome in primary_success | fallback_success | retryable_5xx | rate_limit | timeout | network | circuit_open | non_retryable.
  • Response header X-LangWatch-Fallback-Count: N, how many fallbacks were attempted before success.
  • Request-id correlation via X-LangWatch-Request-Id, join the metric + log line back to the specific trace.
Per-attempt nested spans (langwatch.fallback.attempt, .reason attrs) are a v1.1 observability follow-up. In v1, the counter + header give you the aggregate picture; per-attempt reasoning lives in the gateway log line for that request-id.
If all attempts in the chain fail, the gateway returns the last provider’s error envelope mapped to the OpenAI-compatible shape (provider_error or upstream_timeout as the type).

Circuit breaker

Each provider has an independent circuit breaker with a sliding window:
DefaultMeaning
Window: 30 sFailure events within the last 30 s count toward the open threshold
Threshold: 10 failures10 failures in the window open the circuit
Open cooldown: 60 sCircuit stays open for 60 s; skipped on new requests regardless of fallback order
Half-open probe: 1 requestAfter cooldown, a single probe is let through; if it succeeds, circuit closes; if it fails, another 60 s of open
Override at the service level:
Env varDefault
LW_GATEWAY_CIRCUIT_WINDOW_S30
LW_GATEWAY_CIRCUIT_THRESHOLD10
LW_GATEWAY_CIRCUIT_COOLDOWN_S60
Per-replica, not shared: each gateway replica maintains its own breaker state. This is deliberate, under a large-scale outage, N replicas rediscovering the recovered provider independently is resilient; depending on Redis for breaker consensus is not. Per-provider circuit state is emitted as a Prometheus metric gateway_circuit_state{provider,state} where state ∈ {closed, half_open, open}. Alert on “primary circuit has been open for > 5 minutes” to catch real provider outages (as distinct from transient blips). When a request hits a circuit that’s currently open, the gateway skips to the next entry in the fallback chain immediately, no wasted round-trip to a provider we already know is down.

Sizing the chain

Diminishing returns after 3 entries. With the default timeout_ms=30000 and max_attempts=3, the worst-case wall-clock to exhaust the chain is ~90s; the latency budget of the original call is gone well before then and the client has probably given up. Use chains of 2-3 for latency-sensitive traffic and lower per-entry timeout_ms if you need a tighter total budget; longer chains are fine for batch/offline.