Caching Passthrough

Modern LLM providers offer prompt caching — Anthropic (ephemeral + persistent blocks), OpenAI (automatic prefix caching), Gemini (implicit context caches). The discounts are real: up to 90% off input tokens on cache hits for Anthropic, comparable for others. If an AI gateway reformats or reorders a request before forwarding, it invalidates the cache keys the provider uses internally and silently kills those discounts. The LangWatch AI Gateway is engineered to preserve caching byte-for-byte by default.

The hard invariant

When cache.mode: respect (default):

The gateway MUST NOT strip, reorder, or rewrite any cache_control field or cached-prefix marker in the forwarded request.

This is asserted by integration tests that compare a request-hash roundtrip: we diff the payload the gateway forwards against the payload the client sent, and any non-trivial diff fails the test.

Override modes

Per-VK default cache.mode and per-request header X-LangWatch-Cache (highest precedence):

Mode	Behaviour	Availability
`respect`	Forward cache markers byte-identically. Provider handles caching. (default)	v1
`disable`	Gateway recursively JSON-walks the body and drops every `cache_control` object at any nesting depth — covers Anthropic `messages[].content[].cache_control`, `system[].cache_control`, and `tools[].cache_control` uniformly. Forces a cold call. Use for benchmarks, A/B cache-impact tests, or reproducing cache-dependent bugs. Gateway’s own semantic cache is also bypassed.	v1
`force`	Gateway injects `cache_control: ephemeral` on large stable prefixes (system message, tool definitions, first N messages) even if the client didn’t.	v1.1 — currently returns `400 cache_override_not_implemented`
`ttl=NNN`	Same as `force` but with an explicit TTL hint.	v1.1 — currently returns `400 cache_override_not_implemented`

Override per-request via header: X-LangWatch-Cache: respect, X-LangWatch-Cache: disable. (In v1.1: force and ttl=3600.) Malformed/unknown mode → 400 cache_override_invalid with a reason in the envelope. Valid-but-unimplemented (force / ttl=N in v1) → 400 cache_override_not_implemented.

Enforcement ordering: the cache-override hook runs on /v1/messages before blocked-pattern enforcement. A disable-mode request has its cache_control blocks stripped first, then the regex checks fire on the post-strip body. This keeps policy evaluation deterministic regardless of the caller’s caching choice.

What’s reported back

Two response headers surface cache behaviour:

X-LangWatch-Cache-Mode — echoes the mode that was applied to this request (respect or disable). Useful for tracing and A/B analysis: a request that sets X-LangWatch-Cache: disable should always see X-LangWatch-Cache-Mode: disable on the response, independent of whether the upstream provider even reported a cache outcome.
X-LangWatch-Cache — the cache outcome as reported by the gateway:
- hit — provider reported serving from cache (large portion of input from cache).
- miss — no cache hit.
- bypass — mode was disable.
- force — mode was force (v1.1 — deferred with 400 cache_override_not_implemented in v1).

Usage reporting (in the response body and in the trace) separates input tokens by origin:

cache_read_input_tokens — served from cache, priced low (~10% of regular).
cache_creation_input_tokens — writing to cache, priced at 125% on Anthropic.
input_tokens — regular cold tokens.

The /internal/gateway/budget/debit call forwards all three so budget ledgers reflect accurate cache economics, not just “all input tokens at full price.”

When to override

Use disable when:

You’re benchmarking “cold cost” before deciding whether a caching tier is worth enabling — contrast a small sample of respect calls against an equal-size disable sample to measure real cache savings.
You’re reproducing a cache-dependent bug and need to confirm the failure exists on a cold call (removes a common source of non-determinism from repros).
A specific request must not hit cache for correctness reasons (rare — typically the cache is key-aware enough not to return wrong answers).

v1.1 roadmap:

force — for fronting a legacy codebase that doesn’t set cache_control on stable prefixes (the gateway would insert them). Requires provider-specific body mutation that is scoped to v1.1.
ttl=NNN — explicit TTL hint (e.g. “cache this prompt for 1 hour, no more”). Also v1.1.

In v1, sending X-LangWatch-Cache: force or X-LangWatch-Cache: ttl=3600 returns 400 cache_override_not_implemented with a pointer to this page.

Provider-specific notes

Anthropic Messages API — cache_control: {type: "ephemeral"} or {type: "persistent"} on content blocks. Gateway forwards as-is. 5-min / 1-hour TTL respectively.
OpenAI (GPT-4.1, o-series) — no explicit cache_control field; OpenAI auto-caches prefixes ≥1024 tokens. Gateway simply forwards.
Azure OpenAI — inherits OpenAI’s auto-caching. Same behaviour.
Gemini (1.5 / 2.5) — context caches are a separate API (caches.create). Gateway v1 does not orchestrate these on your behalf; the client calls them directly and the gateway forwards the cache reference in subsequent requests.
Bedrock — Anthropic-on-Bedrock supports prompt caching when using converse API. Gateway forwards cache_control blocks to Bedrock runtime as-is.
Vertex AI — Anthropic-on-Vertex similar to Bedrock for prompt caching.

Why this is load-bearing

From a real observed pattern: a 40k-token system prompt with 100 calls a day.

Without caching: 40k × 100 = 4M input tokens/day at full price. Anthropic Opus: $60/day.
With Anthropic ephemeral cache (5-min TTL, 1 write + ~99 reads): 1 write @ $75/M + 99 × 40k reads @$ 6/M = $3 +$ 23.76 ≈ $27/day. 56% cheaper.
With a gateway that reformats the payload and loses the cache: back to $60/day.

This is why the caching-passthrough invariant is tested at the byte level, not just “does the call succeed.”

Get Started

SDK Integration

Coding CLI Integrations

Virtual Keys & Budgets

Providers

Features

API Reference

Self-Hosting

Cookbooks

Caching Passthrough

The hard invariant

Override modes

What’s reported back

When to override

Provider-specific notes

Why this is load-bearing

Get Started

SDK Integration

Coding CLI Integrations

Virtual Keys & Budgets

Providers

Features

API Reference

Self-Hosting

Cookbooks

Documentation Index

​The hard invariant

​Override modes

​What’s reported back

​When to override

​Provider-specific notes

​Why this is load-bearing

The hard invariant

Override modes

What’s reported back

When to override

Provider-specific notes

Why this is load-bearing