Modern LLM providers offer prompt caching — Anthropic (ephemeral + persistent blocks), OpenAI (automatic prefix caching), Gemini (implicit context caches). The discounts are real: up to 90% off input tokens on cache hits for Anthropic, comparable for others. If an AI gateway reformats or reorders a request before forwarding, it invalidates the cache keys the provider uses internally and silently kills those discounts. The LangWatch AI Gateway is engineered to preserve caching byte-for-byte by default.Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
The hard invariant
Whencache.mode: respect (default):
The gateway MUST NOT strip, reorder, or rewrite any cache_control field or cached-prefix marker in the forwarded request.
This is asserted by integration tests that compare a request-hash roundtrip: we diff the payload the gateway forwards against the payload the client sent, and any non-trivial diff fails the test.
Override modes
Per-VK defaultcache.mode and per-request header X-LangWatch-Cache (highest precedence):
| Mode | Behaviour | Availability |
|---|---|---|
respect | Forward cache markers byte-identically. Provider handles caching. (default) | v1 |
disable | Gateway recursively JSON-walks the body and drops every cache_control object at any nesting depth — covers Anthropic messages[].content[].cache_control, system[].cache_control, and tools[].cache_control uniformly. Forces a cold call. Use for benchmarks, A/B cache-impact tests, or reproducing cache-dependent bugs. Gateway’s own semantic cache is also bypassed. | v1 |
force | Gateway injects cache_control: ephemeral on large stable prefixes (system message, tool definitions, first N messages) even if the client didn’t. | v1.1 — currently returns 400 cache_override_not_implemented |
ttl=NNN | Same as force but with an explicit TTL hint. | v1.1 — currently returns 400 cache_override_not_implemented |
X-LangWatch-Cache: respect, X-LangWatch-Cache: disable. (In v1.1: force and ttl=3600.)
Malformed/unknown mode → 400 cache_override_invalid with a reason in the envelope. Valid-but-unimplemented (force / ttl=N in v1) → 400 cache_override_not_implemented.
Enforcement ordering: the cache-override hook runs on
/v1/messages before blocked-pattern enforcement. A disable-mode request has its cache_control blocks stripped first, then the regex checks fire on the post-strip body. This keeps policy evaluation deterministic regardless of the caller’s caching choice.What’s reported back
Two response headers surface cache behaviour:X-LangWatch-Cache-Mode— echoes the mode that was applied to this request (respectordisable). Useful for tracing and A/B analysis: a request that setsX-LangWatch-Cache: disableshould always seeX-LangWatch-Cache-Mode: disableon the response, independent of whether the upstream provider even reported a cache outcome.X-LangWatch-Cache— the cache outcome as reported by the gateway:hit— provider reported serving from cache (large portion of input from cache).miss— no cache hit.bypass— mode wasdisable.force— mode wasforce(v1.1 — deferred with 400 cache_override_not_implemented in v1).
cache_read_input_tokens— served from cache, priced low (~10% of regular).cache_creation_input_tokens— writing to cache, priced at 125% on Anthropic.input_tokens— regular cold tokens.
/internal/gateway/budget/debit call forwards all three so budget ledgers reflect accurate cache economics, not just “all input tokens at full price.”
When to override
Usedisable when:
- You’re benchmarking “cold cost” before deciding whether a caching tier is worth enabling — contrast a small sample of
respectcalls against an equal-sizedisablesample to measure real cache savings. - You’re reproducing a cache-dependent bug and need to confirm the failure exists on a cold call (removes a common source of non-determinism from repros).
- A specific request must not hit cache for correctness reasons (rare — typically the cache is key-aware enough not to return wrong answers).
force— for fronting a legacy codebase that doesn’t setcache_controlon stable prefixes (the gateway would insert them). Requires provider-specific body mutation that is scoped to v1.1.ttl=NNN— explicit TTL hint (e.g. “cache this prompt for 1 hour, no more”). Also v1.1.
X-LangWatch-Cache: force or X-LangWatch-Cache: ttl=3600 returns 400 cache_override_not_implemented with a pointer to this page.
Provider-specific notes
- Anthropic Messages API —
cache_control: {type: "ephemeral"}or{type: "persistent"}on content blocks. Gateway forwards as-is. 5-min / 1-hour TTL respectively. - OpenAI (GPT-4.1, o-series) — no explicit cache_control field; OpenAI auto-caches prefixes ≥1024 tokens. Gateway simply forwards.
- Azure OpenAI — inherits OpenAI’s auto-caching. Same behaviour.
- Gemini (1.5 / 2.5) — context caches are a separate API (
caches.create). Gateway v1 does not orchestrate these on your behalf; the client calls them directly and the gateway forwards the cache reference in subsequent requests. - Bedrock — Anthropic-on-Bedrock supports prompt caching when using converse API. Gateway forwards cache_control blocks to Bedrock runtime as-is.
- Vertex AI — Anthropic-on-Vertex similar to Bedrock for prompt caching.
Why this is load-bearing
From a real observed pattern: a 40k-token system prompt with 100 calls a day.- Without caching: 40k × 100 = 4M input tokens/day at full price. Anthropic Opus: $60/day.
- With Anthropic ephemeral cache (5-min TTL, 1 write + ~99 reads): 1 write @ 6/M = 23.76 ≈ $27/day. 56% cheaper.
- With a gateway that reformats the payload and loses the cache: back to $60/day.