Skip to main content
Modern LLM providers offer prompt caching with different shapes: Anthropic uses explicit cache_control blocks, OpenAI auto-caches prompts over 1024 tokens, Gemini has an explicit cachedContent API, Azure OpenAI inherits OpenAI’s behavior, Bedrock Claude inherits Anthropic’s. The LangWatch AI Gateway surfaces a single, unified control plane over all of them. This is a load-bearing feature: prompt caching cuts input-token costs by up to 90% on hot prefixes, and a gateway that reformats or reorders the payload silently invalidates the cache. The gateway is engineered to preserve provider-native caching byte-for-byte by default, and to let operators compose overrides without breaking the primitives underneath.

The control surface: four layers (v1)

  1. Gateway default: respect (byte-for-byte passthrough). Always safe, never surprises.
  2. Per-VK default: respect, disable, force configurable in the VK drawer under “Cache control”. ttl=NNN is deferred to v1.1 at the per-VK level.
  3. Cache rules: simple match/action overrides evaluated at auth-resolve time (bundled with VK config → near-zero hot-path cost). See Cache rules.
  4. Per-request header: X-LangWatch-Cache: respect|disable|force on an individual request. Highest precedence when set. ttl=NNN header value is parsed but still deferred to v1.1 (returns 400 cache_override_not_implemented with a cache_rule_mode_not_implemented WARN on rules).
Effective-mode precedence: per-request header > matched rule > per-VK default > gateway default respect.

Provider matrix

Not every mode makes sense for every provider. The gateway does the right thing per provider automatically, you don’t have to special-case in your client.
ProviderNative schemerespect behaviordisable behaviorforce behaviorNotes
Anthropic (direct + Bedrock + Vertex)Explicit cache_control: {type: "ephemeral"|"persistent"} on content/system/tool blocksBytes forwarded unchanged; provider honours markers (5 min, 1 h TTL)Recursive JSON-walk drops every cache_control key at any nesting depth before dispatchv1: gateway injects cache_control: {type: "ephemeral"} on system[-1] + messages[-1].content[-1] when the caller hasn’t already set it (no-double-inject invariant)cache_read_input_tokens and cache_creation_input_tokens separately reported in the trace. Sonnet 4.5, Opus 4.5 caching is GA. Haiku 4.5 caching is in beta: may return cache_read=0 even with correct markers on accounts not enrolled in the beta.
OpenAI (direct + Azure)Automatic, prefixes ≥1024 tokens cached, no client flag, 50% discount on cached tokensBytes forwarded unchanged (nothing to preserve, provider decides)No-op on wire; gateway can optionally set a salt via a request-id suffix to break the cache key (future)v1: no-op on wire (caching is out of the client’s control), but the rule still fires + attribution emits + counter bumps for per-rule observabilityusage.prompt_tokens_details.cached_tokens reported in response. Reasoning models (gpt-5 family with reasoning_effort set) may not return cached_tokens even on identical prefixes, verified directly against api.openai.com on test accounts. Use a non-reasoning model (gpt-4o-mini, gpt-4o, etc.) when prompt caching is load-bearing.
Gemini (Vertex + direct)Explicit cachedContent resource, client POSTs to /v1beta/cachedContents then passes cachedContent: <name> on generation. Implicit prefix-caching exists for paid-tier accounts only.Bytes forwarded unchanged; client holds the cache nameStrips cachedContent field from the request → forces cold generationv1 shipped: gateway exposes POST /v1beta/cachedContents via the Gemini-native passthrough route + lifts cached_content field from /v1/chat/completions body onto Bifrost.ChatParameters.ExtraParams so cross-shape callers can also reference an existing cache.cachedContentTokenCount reported in usageMetadata. Implicit prefix-cache (where Google caches identical >1024-token prefixes automatically) requires a paid-tier Google account; explicit cachedContents works on every tier.
Azure OpenAIInherits OpenAI automatic prefix cacheSame as OpenAISame as OpenAISame as OpenAIConfigured via Azure’s cache-at-deployment toggle, independent of our force
Bedrock ClaudeInherits Anthropic’s cache_control shape via converse APISame as AnthropicSame as AnthropicSame as Anthropic (v1)Bedrock region matters, not all regions support Anthropic cache yet. /v1/messages → bedrock returns Unexpected field type 400 (the Bifrost bedrock adapter is converse-API only); use /v1/chat/completions toward a bedrock VK for cache_control round-trip
Vertex ClaudeInherits Anthropic’s shapeSame as AnthropicSame as AnthropicSame as Anthropic (v1),
The common shape: respect is always safe; disable always strips cache-related fields (where applicable); force is provider-specific injection, shipped in v1 for Anthropic + OpenAI + Bedrock Claude + Vertex Claude + Gemini.

Modes

respect (default)

Gateway forwards every cache-related field (cache_control on Anthropic, cachedContent on Gemini, etc.) byte-for-byte. The gateway’s own semantic cache (if enabled, v1.1 spec) still runs on top, respect does not disable it. This mode is the hard invariant protected by the gateway’s byte-level integration test: a request with cache_control: ephemeral on a system block must arrive at the provider with identical bytes in identical order. The test diff-compares the forwarded payload against the client’s payload, any non-trivial drift fails the build.

disable

Strip every cache-related field at any nesting depth before dispatch:
  • Anthropic: recursive JSON-walk drops cache_control from messages[].content[], system[], and tools[].
  • Gemini: drops cachedContent from the top-level body.
  • OpenAI: no-op on wire (the provider’s automatic cache can’t be disabled client-side).
Use when:
  • Benchmarking cold-cost in A/B before deciding if a caching tier is worth turning on.
  • Reproducing a cache-dependent bug on a cold call.
  • A specific request must not hit cache for correctness.
The gateway’s semantic cache (if enabled, v1.1) is also bypassed when mode is disable.

force

For providers that support client-side cache declarations:
  • Anthropic (direct + Bedrock Claude + Vertex Claude): v1 shipped. Gateway injects cache_control: {type: "ephemeral"} on system[-1] + messages[-1].content[-1] when the caller hasn’t already set it. Implemented via a recursive JSON walk that targets the last element of each block array; no-double-inject invariant means client-set cache_control is preserved byte-for-byte.
  • OpenAI, Azure: v1 shipped (as a no-op on wire). Their cache is automatic, so the request body is byte-identical to passthrough, but the rule still fires + langwatch.cache.rule_id + mode_applied=FORCE span attrs emit + gateway_cache_rule_hits_total{mode_applied="FORCE"} counter bumps. Operators get per-rule observability even on providers where force is semantically a no-op.
  • Gemini, Vertex Gemini: v1 shipped. The gateway’s /v1beta/* passthrough route (see the Gemini CLI integration) accepts POST /v1beta/cachedContents for context-cache creation and forwards verbatim to generativelanguage.googleapis.com. Subsequent /v1beta/models/{model}:generateContent calls reference the cache by name via the cachedContent field. The same cached_content field is also lifted from /v1/chat/completions request bodies onto Bifrost.ChatParameters.ExtraParams so cross-shape OpenAI-SDK callers can reference an existing cache. Implicit prefix caching (auto-cache for >1024-token shared prefixes) requires a paid-tier Google account.

ttl=NNN

Same shape as force but carries an explicit TTL in seconds. Header-level parse succeeds but Apply is still ErrNotImplemented in v1, operators opting into a rule-level ttl currently see the passthrough behaviour above with the same WARN. v1.1 work will wire per-provider TTL mapping:
  • Anthropic: chooses ephemeral (5 min) or persistent (1 h) based on the TTL range; anything above 3600 clamps to persistent.
  • Gemini: sets the ttl field on cachedContent (depends on the v1.1 Gemini force work above).

Response headers

Every successful response carries:
  • X-LangWatch-Cache-Mode, the mode that was applied (respect, disable, force). Matches the internal Kind enum the Go gateway uses (services/gateway/internal/cacheoverride). Useful for tracing and A/B analysis: a client that sets X-LangWatch-Cache: disable should see X-LangWatch-Cache-Mode: disable on the response, independent of whether the provider reported a cache outcome.
  • X-LangWatch-Cache, the cache outcome:
    • hit, provider served part or all of the prefix from cache.
    • miss, no cache hit.
    • bypass, mode was disable.
    • force, mode was force (v1.1, deferred with 400 cache_override_not_implemented in v1).
Usage reporting splits input tokens by origin:
  • cache_read_input_tokens, served from cache (Anthropic: ~10% of regular price).
  • cache_creation_input_tokens, writing to cache (Anthropic: 125% of regular).
  • input_tokens, regular cold tokens.
The /internal/gateway/budget/debit call forwards all three so budget ledgers reflect accurate cache economics.

Cache rules

Simple match/action rules that let operators set cache behavior for groups of requests without touching client code. Deliberately narrower than Cloudflare Page Rules, enough to demonstrate enterprise cache governance (force 5-min cache on prod internal-api, disable cache for benchmark VKs, etc.) without the matcher combinatorics that would slow the hot path. Latency budget: rules are resolved once per VK bundle refresh (every 30 s on /changes) and baked into the bundle as a pre-sorted array. At request time, the gateway walks the sorted array until a match fires, for the expected low single-digit rule count per VK, this is sub-microsecond. Hot-path total overhead stays at ~700 ns.

Shape

{
  "id": "rule_01HZX...",
  "priority": 100,
  "match": {
    "vk_tags": ["env=prod"],
    "model": "gpt-5-mini",
    "principal_id": null,
    "request_metadata": { "X-Request-Source": "internal-api" }
  },
  "action": {
    "mode": "force",
    "ttl": 300,
    "salt": "v2"
  },
  "enabled": true
}

Matchers

Each matcher is optional; all non-null matchers must match for the rule to apply (AND semantics).
  • vk_id, exact match on virtual-key id.
  • vk_tags, every tag in the list must be on the VK (e.g. ["env=prod", "team=ml"]).
  • vk_prefix, match on VK display prefix (vk-lw-01HZX9).
  • principal_id, the user/service account making the call.
  • model, exact match on the resolved model (post-alias).
  • request_metadata, match on arbitrary client-set metadata headers (X-Request-Source, X-Customer-Tier, etc.).
  • time_window, optional time-of-day, day-of-week constraint.

Actions

  • mode: respect | disable | force, as above.
  • ttl: NNN, seconds (only meaningful with force).
  • salt: <string>, cache-key salt for A/B or version invalidation (for the gateway’s semantic cache, v1.1).

Evaluation

  • First-match wins, sorted by priority descending.
  • Rules evaluated once per request, AT auth-resolve time (bundled with VK config), so zero additional upstream latency.
  • Rule hits exposed in observability via langwatch.cache.rule_id span attribute + gateway_cache_rule_hits_total{rule_id, mode, provider} counter.

Concrete example: “force 5-min cache on all prod internal-api calls”

{
  "priority": 100,
  "match": {
    "vk_tags": ["env=prod"],
    "request_metadata": { "X-Request-Source": "internal-api" }
  },
  "action": { "mode": "force", "ttl": 300 }
}
Every VK in prod serving an internal-api caller gets Anthropic cache_control: ephemeral injected on system + last-user blocks. OpenAI requests pass through (no-op). No client change required.

Enforcement ordering

The cache-override hook runs before blocked-pattern enforcement on /v1/messages. A disable-mode request has its cache_control blocks stripped first, then regex checks fire on the post-strip body. This keeps policy evaluation deterministic regardless of the caller’s caching choice. Order on the request pipeline:
  1. Auth resolution (VK → bundle)
  2. Rate limit (per-VK)
  3. Cache override mode resolution (header > rule > VK default)
  4. Cache override apply (strip, inject)
  5. Blocked patterns (tools, MCP, URLs, models)
  6. Body size cap
  7. Guardrails (pre)
  8. Upstream dispatch
  9. Guardrails (post)
  10. Budget debit

Why this is load-bearing

From a real observed pattern: a 40k-token system prompt with 100 calls a day against an Anthropic model.
  • Without caching: 40k × 100 = 4M input tokens/day at full price. Claude Opus: ~$60/day.
  • With Anthropic ephemeral cache (5-min TTL, 1 write + ~99 reads): 1 write at 75/M+99×40kreadsat75/M + 99 × 40k reads at 6/M ≈ $27/day. 56% cheaper.
  • With a gateway that reformats the payload and loses the cache: back to $60/day.
This is why the caching-passthrough invariant is tested at the byte level, not just “does the call succeed.”

See also