The hard invariant
Whencache.mode: respect (default):
The gateway MUST NOT strip, reorder, or rewrite any cache_control field or cached-prefix marker in the forwarded request.
This is asserted by integration tests that compare a request-hash roundtrip: we diff the payload the gateway forwards against the payload the client sent, and any non-trivial diff fails the test.
Override modes
Per-VK defaultcache.mode and per-request header X-LangWatch-Cache (highest precedence):
| Mode | Behaviour | Availability |
|---|---|---|
respect | Forward cache markers byte-identically. Provider handles caching. (default) | v1 |
disable | Gateway recursively JSON-walks the body and drops every cache_control object at any nesting depth, covers Anthropic messages[].content[].cache_control, system[].cache_control, and tools[].cache_control uniformly. Forces a cold call. Use for benchmarks, A/B cache-impact tests, or reproducing cache-dependent bugs. Gateway’s own semantic cache is also bypassed. | v1 |
force | Gateway injects cache_control: ephemeral on large stable prefixes (system message, tool definitions, first N messages) even if the client didn’t. | v1.1: currently returns 400 cache_override_not_implemented |
ttl=NNN | Same as force but with an explicit TTL hint. | v1.1: currently returns 400 cache_override_not_implemented |
X-LangWatch-Cache: respect, X-LangWatch-Cache: disable. (In v1.1: force and ttl=3600.)
Malformed/unknown mode → 400 cache_override_invalid with a reason in the envelope. Valid-but-unimplemented (force, ttl=N in v1) → 400 cache_override_not_implemented.
Enforcement ordering: the cache-override hook runs on
/v1/messages before blocked-pattern enforcement. A disable-mode request has its cache_control blocks stripped first, then the regex checks fire on the post-strip body. This keeps policy evaluation deterministic regardless of the caller’s caching choice.What’s reported back
Two response headers surface cache behaviour:X-LangWatch-Cache-Mode, echoes the mode that was applied to this request (respectordisable). Useful for tracing and A/B analysis: a request that setsX-LangWatch-Cache: disableshould always seeX-LangWatch-Cache-Mode: disableon the response, independent of whether the upstream provider even reported a cache outcome.X-LangWatch-Cache, the cache outcome as reported by the gateway:hit, provider reported serving from cache (large portion of input from cache).miss, no cache hit.bypass, mode wasdisable.force, mode wasforce(v1.1, deferred with 400 cache_override_not_implemented in v1).
cache_read_input_tokens, served from cache, priced low (~10% of regular).cache_creation_input_tokens, writing to cache, priced at 125% on Anthropic.input_tokens, regular cold tokens.
/internal/gateway/budget/debit call forwards all three so budget ledgers reflect accurate cache economics, not just “all input tokens at full price.”
When to override
Usedisable when:
- You’re benchmarking “cold cost” before deciding whether a caching tier is worth enabling, contrast a small sample of
respectcalls against an equal-sizedisablesample to measure real cache savings. - You’re reproducing a cache-dependent bug and need to confirm the failure exists on a cold call (removes a common source of non-determinism from repros).
- A specific request must not hit cache for correctness reasons (rare, typically the cache is key-aware enough not to return wrong answers).
force, for fronting a legacy codebase that doesn’t setcache_controlon stable prefixes (the gateway would insert them). Requires provider-specific body mutation that is scoped to v1.1.ttl=NNN, explicit TTL hint (e.g. “cache this prompt for 1 hour, no more”). Also v1.1.
X-LangWatch-Cache: force or X-LangWatch-Cache: ttl=3600 returns 400 cache_override_not_implemented with a pointer to this page.
Provider-specific notes
- Anthropic Messages API:
cache_control: {type: "ephemeral"}or{type: "persistent"}on content blocks. Gateway forwards as-is. 5-min, 1-hour TTL respectively. - OpenAI (GPT-4.1, o-series): no explicit cache_control field; OpenAI auto-caches prefixes ≥1024 tokens. Gateway simply forwards.
- Azure OpenAI: inherits OpenAI’s auto-caching. Same behaviour.
- Gemini (1.5, 2.5): context caches are a separate API (
caches.create). Gateway v1 does not orchestrate these on your behalf; the client calls them directly and the gateway forwards the cache reference in subsequent requests. - Bedrock: Anthropic-on-Bedrock supports prompt caching when using converse API. Gateway forwards cache_control blocks to Bedrock runtime as-is.
- Vertex AI: Anthropic-on-Vertex similar to Bedrock for prompt caching.
Why this is load-bearing
From a real observed pattern: a 40k-token system prompt with 100 calls a day.- Without caching: 40k × 100 = 4M input tokens/day at full price. Anthropic Opus: $60/day.
- With Anthropic ephemeral cache (5-min TTL, 1 write + ~99 reads): 1 write @ 6/M = 23.76 ≈ $27/day. 56% cheaper.
- With a gateway that reformats the payload and loses the cache: back to $60/day.