Modern LLM providers offer prompt caching with different shapes: Anthropic uses explicitDocumentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
cache_control blocks, OpenAI auto-caches prompts over 1024 tokens, Gemini has an explicit cachedContent API, Azure OpenAI inherits OpenAI’s behavior, Bedrock Claude inherits Anthropic’s. The LangWatch AI Gateway surfaces a single, unified control plane over all of them.
This is a load-bearing feature: prompt caching cuts input-token costs by up to 90% on hot prefixes, and a gateway that reformats or reorders the payload silently invalidates the cache. The gateway is engineered to preserve provider-native caching byte-for-byte by default, and to let operators compose overrides without breaking the primitives underneath.
The control surface — four layers (v1)
- Gateway default —
respect(byte-for-byte passthrough). Always safe, never surprises. - Per-VK default —
respect/disable/forceconfigurable in the VK drawer under “Cache control”.ttl=NNNis deferred to v1.1 at the per-VK level. - Cache rules — simple match/action overrides evaluated at auth-resolve time (bundled with VK config → near-zero hot-path cost). See Cache rules.
- Per-request header —
X-LangWatch-Cache: respect|disable|forceon an individual request. Highest precedence when set.ttl=NNNheader value is parsed but still deferred to v1.1 (returns400 cache_override_not_implementedwith acache_rule_mode_not_implementedWARN on rules).
respect.
Provider matrix
Not every mode makes sense for every provider. The gateway does the right thing per provider automatically — you don’t have to special-case in your client.| Provider | Native scheme | respect behavior | disable behavior | force behavior | Notes |
|---|---|---|---|---|---|
| Anthropic (direct + Bedrock + Vertex) | Explicit cache_control: {type: "ephemeral"|"persistent"} on content/system/tool blocks | Bytes forwarded unchanged; provider honours markers (5 min / 1 h TTL) | Recursive JSON-walk drops every cache_control key at any nesting depth before dispatch | v1 — gateway injects cache_control: {type: "ephemeral"} on system[-1] + messages[-1].content[-1] when the caller hasn’t already set it (no-double-inject invariant) | cache_read_input_tokens and cache_creation_input_tokens separately reported in the trace. Sonnet 4.5 / Opus 4.5 caching is GA. Haiku 4.5 caching is in beta — may return cache_read=0 even with correct markers on accounts not enrolled in the beta. |
| OpenAI (direct + Azure) | Automatic — prefixes ≥1024 tokens cached, no client flag, 50% discount on cached tokens | Bytes forwarded unchanged (nothing to preserve — provider decides) | No-op on wire; gateway can optionally set a salt via a request-id suffix to break the cache key (future) | v1 — no-op on wire (caching is out of the client’s control), but the rule still fires + attribution emits + counter bumps for per-rule observability | usage.prompt_tokens_details.cached_tokens reported in response. Reasoning models (gpt-5 family with reasoning_effort set) may not return cached_tokens even on identical prefixes — verified directly against api.openai.com on test accounts. Use a non-reasoning model (gpt-4o-mini, gpt-4o, etc.) when prompt caching is load-bearing. |
| Gemini (Vertex + direct) | Explicit cachedContent resource — client POSTs to /v1beta/cachedContents then passes cachedContent: <name> on generation. Implicit prefix-caching exists for paid-tier accounts only. | Bytes forwarded unchanged; client holds the cache name | Strips cachedContent field from the request → forces cold generation | v1 shipped — gateway exposes POST /v1beta/cachedContents via the Gemini-native passthrough route + lifts cached_content field from /v1/chat/completions body onto Bifrost.ChatParameters.ExtraParams so cross-shape callers can also reference an existing cache. | cachedContentTokenCount reported in usageMetadata. Implicit prefix-cache (where Google caches identical >1024-token prefixes automatically) requires a paid-tier Google account; explicit cachedContents works on every tier. |
| Azure OpenAI | Inherits OpenAI automatic prefix cache | Same as OpenAI | Same as OpenAI | Same as OpenAI | Configured via Azure’s cache-at-deployment toggle — independent of our force |
| Bedrock Claude | Inherits Anthropic’s cache_control shape via converse API | Same as Anthropic | Same as Anthropic | Same as Anthropic (v1) | Bedrock region matters — not all regions support Anthropic cache yet. /v1/messages → bedrock returns Unexpected field type 400 (the Bifrost bedrock adapter is converse-API only); use /v1/chat/completions toward a bedrock VK for cache_control round-trip |
| Vertex Claude | Inherits Anthropic’s shape | Same as Anthropic | Same as Anthropic | Same as Anthropic (v1) | — |
respect is always safe; disable always strips cache-related fields (where applicable); force is provider-specific injection — shipped in v1 for Anthropic + OpenAI + Bedrock Claude + Vertex Claude + Gemini.
Modes
respect (default)
Gateway forwards every cache-related field (cache_control on Anthropic, cachedContent on Gemini, etc.) byte-for-byte. The gateway’s own semantic cache (if enabled, v1.1 spec) still runs on top — respect does not disable it.
This mode is the hard invariant protected by the gateway’s byte-level integration test: a request with cache_control: ephemeral on a system block must arrive at the provider with identical bytes in identical order. The test diff-compares the forwarded payload against the client’s payload — any non-trivial drift fails the build.
disable
Strip every cache-related field at any nesting depth before dispatch:
- Anthropic: recursive JSON-walk drops
cache_controlfrommessages[].content[],system[], andtools[]. - Gemini: drops
cachedContentfrom the top-level body. - OpenAI: no-op on wire (the provider’s automatic cache can’t be disabled client-side).
- Benchmarking cold-cost in A/B before deciding if a caching tier is worth turning on.
- Reproducing a cache-dependent bug on a cold call.
- A specific request must not hit cache for correctness.
disable.
force
For providers that support client-side cache declarations:
- Anthropic (direct + Bedrock Claude + Vertex Claude) — v1 shipped. Gateway injects
cache_control: {type: "ephemeral"}onsystem[-1]+messages[-1].content[-1]when the caller hasn’t already set it. Implemented via a recursive JSON walk that targets the last element of each block array; no-double-inject invariant means client-setcache_controlis preserved byte-for-byte. - OpenAI / Azure — v1 shipped (as a no-op on wire). Their cache is automatic, so the request body is byte-identical to passthrough, but the rule still fires +
langwatch.cache.rule_id+mode_applied=FORCEspan attrs emit +gateway_cache_rule_hits_total{mode_applied="FORCE"}counter bumps. Operators get per-rule observability even on providers whereforceis semantically a no-op. - Gemini / Vertex Gemini — v1 shipped. The gateway’s
/v1beta/*passthrough route (see the Gemini CLI integration) acceptsPOST /v1beta/cachedContentsfor context-cache creation and forwards verbatim togenerativelanguage.googleapis.com. Subsequent/v1beta/models/{model}:generateContentcalls reference the cache by name via thecachedContentfield. The samecached_contentfield is also lifted from/v1/chat/completionsrequest bodies ontoBifrost.ChatParameters.ExtraParamsso cross-shape OpenAI-SDK callers can reference an existing cache. Implicit prefix caching (auto-cache for >1024-token shared prefixes) requires a paid-tier Google account.
ttl=NNN
Same shape as force but carries an explicit TTL in seconds. Header-level parse succeeds but Apply is still ErrNotImplemented in v1 — operators opting into a rule-level ttl currently see the passthrough behaviour above with the same WARN. v1.1 work will wire per-provider TTL mapping:
- Anthropic: chooses
ephemeral(5 min) orpersistent(1 h) based on the TTL range; anything above 3600 clamps topersistent. - Gemini: sets the
ttlfield oncachedContent(depends on the v1.1 Gemini force work above).
Response headers
Every successful response carries:X-LangWatch-Cache-Mode— the mode that was applied (respect/disable/force). Matches the internalKindenum the Go gateway uses (services/gateway/internal/cacheoverride). Useful for tracing and A/B analysis: a client that setsX-LangWatch-Cache: disableshould seeX-LangWatch-Cache-Mode: disableon the response, independent of whether the provider reported a cache outcome.X-LangWatch-Cache— the cache outcome:hit— provider served part or all of the prefix from cache.miss— no cache hit.bypass— mode wasdisable.force— mode wasforce(v1.1 — deferred with400 cache_override_not_implementedin v1).
cache_read_input_tokens— served from cache (Anthropic: ~10% of regular price).cache_creation_input_tokens— writing to cache (Anthropic: 125% of regular).input_tokens— regular cold tokens.
/internal/gateway/budget/debit call forwards all three so budget ledgers reflect accurate cache economics.
Cache rules
Simple match/action rules that let operators set cache behavior for groups of requests without touching client code. Deliberately narrower than Cloudflare Page Rules — enough to demonstrate enterprise cache governance (force 5-min cache on prod internal-api, disable cache for benchmark VKs, etc.) without the matcher combinatorics that would slow the hot path. Latency budget: rules are resolved once per VK bundle refresh (every 30 s on/changes) and baked into the bundle as a pre-sorted array. At request time, the gateway walks the sorted array until a match fires — for the expected low single-digit rule count per VK, this is sub-microsecond. Hot-path total overhead stays at ~700 ns.
Shape
Matchers
Each matcher is optional; all non-null matchers must match for the rule to apply (AND semantics).vk_id— exact match on virtual-key id.vk_tags— every tag in the list must be on the VK (e.g.["env=prod", "team=ml"]).vk_prefix— match on VK display prefix (lw_vk_live_01HZX9).principal_id— the user/service account making the call.model— exact match on the resolved model (post-alias).request_metadata— match on arbitrary client-set metadata headers (X-Request-Source,X-Customer-Tier, etc.).time_window— optional time-of-day / day-of-week constraint.
Actions
mode: respect | disable | force— as above.ttl: NNN— seconds (only meaningful withforce).salt: <string>— cache-key salt for A/B or version invalidation (for the gateway’s semantic cache, v1.1).
Evaluation
- First-match wins, sorted by
prioritydescending. - Rules evaluated once per request, AT auth-resolve time (bundled with VK config), so zero additional upstream latency.
- Rule hits exposed in observability via
langwatch.cache.rule_idspan attribute +gateway_cache_rule_hits_total{rule_id, mode, provider}counter.
Concrete example — “force 5-min cache on all prod internal-api calls”
cache_control: ephemeral injected on system + last-user blocks. OpenAI requests pass through (no-op). No client change required.
Enforcement ordering
The cache-override hook runs before blocked-pattern enforcement on/v1/messages. A disable-mode request has its cache_control blocks stripped first, then regex checks fire on the post-strip body. This keeps policy evaluation deterministic regardless of the caller’s caching choice.
Order on the request pipeline:
- Auth resolution (VK → bundle)
- Rate limit (per-VK)
- Cache override mode resolution (header > rule > VK default)
- Cache override apply (strip / inject)
- Blocked patterns (tools / MCP / URLs / models)
- Body size cap
- Guardrails (pre)
- Upstream dispatch
- Guardrails (post)
- Budget debit
Why this is load-bearing
From a real observed pattern: a 40k-token system prompt with 100 calls a day against an Anthropic model.- Without caching: 40k × 100 = 4M input tokens/day at full price. Claude Opus: ~$60/day.
- With Anthropic ephemeral cache (5-min TTL, 1 write + ~99 reads): 1 write at 6/M ≈ $27/day. 56% cheaper.
- With a gateway that reformats the payload and loses the cache: back to $60/day.
See also
- Caching Passthrough — the lower-level byte-for-byte invariant that underpins
respectmode. - Virtual Keys — per-VK default
cacheconfig. - API Reference → Errors —
cache_override_invalid/cache_override_not_implementederror shapes. - Observability — the trace attributes for cache attribution.