Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Modern LLM providers offer prompt caching — Anthropic (ephemeral + persistent blocks), OpenAI (automatic prefix caching), Gemini (implicit context caches). The discounts are real: up to 90% off input tokens on cache hits for Anthropic, comparable for others. If an AI gateway reformats or reorders a request before forwarding, it invalidates the cache keys the provider uses internally and silently kills those discounts. The LangWatch AI Gateway is engineered to preserve caching byte-for-byte by default.

The hard invariant

When cache.mode: respect (default):
The gateway MUST NOT strip, reorder, or rewrite any cache_control field or cached-prefix marker in the forwarded request.
This is asserted by integration tests that compare a request-hash roundtrip: we diff the payload the gateway forwards against the payload the client sent, and any non-trivial diff fails the test.

Override modes

Per-VK default cache.mode and per-request header X-LangWatch-Cache (highest precedence):
ModeBehaviourAvailability
respectForward cache markers byte-identically. Provider handles caching. (default)v1
disableGateway recursively JSON-walks the body and drops every cache_control object at any nesting depth — covers Anthropic messages[].content[].cache_control, system[].cache_control, and tools[].cache_control uniformly. Forces a cold call. Use for benchmarks, A/B cache-impact tests, or reproducing cache-dependent bugs. Gateway’s own semantic cache is also bypassed.v1
forceGateway injects cache_control: ephemeral on large stable prefixes (system message, tool definitions, first N messages) even if the client didn’t.v1.1 — currently returns 400 cache_override_not_implemented
ttl=NNNSame as force but with an explicit TTL hint.v1.1 — currently returns 400 cache_override_not_implemented
Override per-request via header: X-LangWatch-Cache: respect, X-LangWatch-Cache: disable. (In v1.1: force and ttl=3600.) Malformed/unknown mode → 400 cache_override_invalid with a reason in the envelope. Valid-but-unimplemented (force / ttl=N in v1) → 400 cache_override_not_implemented.
Enforcement ordering: the cache-override hook runs on /v1/messages before blocked-pattern enforcement. A disable-mode request has its cache_control blocks stripped first, then the regex checks fire on the post-strip body. This keeps policy evaluation deterministic regardless of the caller’s caching choice.

What’s reported back

Two response headers surface cache behaviour:
  • X-LangWatch-Cache-Mode — echoes the mode that was applied to this request (respect or disable). Useful for tracing and A/B analysis: a request that sets X-LangWatch-Cache: disable should always see X-LangWatch-Cache-Mode: disable on the response, independent of whether the upstream provider even reported a cache outcome.
  • X-LangWatch-Cache — the cache outcome as reported by the gateway:
    • hit — provider reported serving from cache (large portion of input from cache).
    • miss — no cache hit.
    • bypass — mode was disable.
    • force — mode was force (v1.1 — deferred with 400 cache_override_not_implemented in v1).
Usage reporting (in the response body and in the trace) separates input tokens by origin:
  • cache_read_input_tokens — served from cache, priced low (~10% of regular).
  • cache_creation_input_tokens — writing to cache, priced at 125% on Anthropic.
  • input_tokens — regular cold tokens.
The /internal/gateway/budget/debit call forwards all three so budget ledgers reflect accurate cache economics, not just “all input tokens at full price.”

When to override

Use disable when:
  • You’re benchmarking “cold cost” before deciding whether a caching tier is worth enabling — contrast a small sample of respect calls against an equal-size disable sample to measure real cache savings.
  • You’re reproducing a cache-dependent bug and need to confirm the failure exists on a cold call (removes a common source of non-determinism from repros).
  • A specific request must not hit cache for correctness reasons (rare — typically the cache is key-aware enough not to return wrong answers).
v1.1 roadmap:
  • force — for fronting a legacy codebase that doesn’t set cache_control on stable prefixes (the gateway would insert them). Requires provider-specific body mutation that is scoped to v1.1.
  • ttl=NNN — explicit TTL hint (e.g. “cache this prompt for 1 hour, no more”). Also v1.1.
In v1, sending X-LangWatch-Cache: force or X-LangWatch-Cache: ttl=3600 returns 400 cache_override_not_implemented with a pointer to this page.

Provider-specific notes

  • Anthropic Messages APIcache_control: {type: "ephemeral"} or {type: "persistent"} on content blocks. Gateway forwards as-is. 5-min / 1-hour TTL respectively.
  • OpenAI (GPT-4.1, o-series) — no explicit cache_control field; OpenAI auto-caches prefixes ≥1024 tokens. Gateway simply forwards.
  • Azure OpenAI — inherits OpenAI’s auto-caching. Same behaviour.
  • Gemini (1.5 / 2.5) — context caches are a separate API (caches.create). Gateway v1 does not orchestrate these on your behalf; the client calls them directly and the gateway forwards the cache reference in subsequent requests.
  • Bedrock — Anthropic-on-Bedrock supports prompt caching when using converse API. Gateway forwards cache_control blocks to Bedrock runtime as-is.
  • Vertex AI — Anthropic-on-Vertex similar to Bedrock for prompt caching.

Why this is load-bearing

From a real observed pattern: a 40k-token system prompt with 100 calls a day.
  • Without caching: 40k × 100 = 4M input tokens/day at full price. Anthropic Opus: $60/day.
  • With Anthropic ephemeral cache (5-min TTL, 1 write + ~99 reads): 1 write @ 75/M+99×40kreads@75/M + 99 × 40k reads @ 6/M = 3+3 + 23.76 ≈ $27/day. 56% cheaper.
  • With a gateway that reformats the payload and loses the cache: back to $60/day.
This is why the caching-passthrough invariant is tested at the byte level, not just “does the call succeed.”