Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Modern LLM providers offer prompt caching with different shapes: Anthropic uses explicit cache_control blocks, OpenAI auto-caches prompts over 1024 tokens, Gemini has an explicit cachedContent API, Azure OpenAI inherits OpenAI’s behavior, Bedrock Claude inherits Anthropic’s. The LangWatch AI Gateway surfaces a single, unified control plane over all of them. This is a load-bearing feature: prompt caching cuts input-token costs by up to 90% on hot prefixes, and a gateway that reformats or reorders the payload silently invalidates the cache. The gateway is engineered to preserve provider-native caching byte-for-byte by default, and to let operators compose overrides without breaking the primitives underneath.

The control surface — four layers (v1)

  1. Gateway defaultrespect (byte-for-byte passthrough). Always safe, never surprises.
  2. Per-VK defaultrespect / disable / force configurable in the VK drawer under “Cache control”. ttl=NNN is deferred to v1.1 at the per-VK level.
  3. Cache rules — simple match/action overrides evaluated at auth-resolve time (bundled with VK config → near-zero hot-path cost). See Cache rules.
  4. Per-request headerX-LangWatch-Cache: respect|disable|force on an individual request. Highest precedence when set. ttl=NNN header value is parsed but still deferred to v1.1 (returns 400 cache_override_not_implemented with a cache_rule_mode_not_implemented WARN on rules).
Effective-mode precedence: per-request header > matched rule > per-VK default > gateway default respect.

Provider matrix

Not every mode makes sense for every provider. The gateway does the right thing per provider automatically — you don’t have to special-case in your client.
ProviderNative schemerespect behaviordisable behaviorforce behaviorNotes
Anthropic (direct + Bedrock + Vertex)Explicit cache_control: {type: "ephemeral"|"persistent"} on content/system/tool blocksBytes forwarded unchanged; provider honours markers (5 min / 1 h TTL)Recursive JSON-walk drops every cache_control key at any nesting depth before dispatchv1 — gateway injects cache_control: {type: "ephemeral"} on system[-1] + messages[-1].content[-1] when the caller hasn’t already set it (no-double-inject invariant)cache_read_input_tokens and cache_creation_input_tokens separately reported in the trace. Sonnet 4.5 / Opus 4.5 caching is GA. Haiku 4.5 caching is in beta — may return cache_read=0 even with correct markers on accounts not enrolled in the beta.
OpenAI (direct + Azure)Automatic — prefixes ≥1024 tokens cached, no client flag, 50% discount on cached tokensBytes forwarded unchanged (nothing to preserve — provider decides)No-op on wire; gateway can optionally set a salt via a request-id suffix to break the cache key (future)v1 — no-op on wire (caching is out of the client’s control), but the rule still fires + attribution emits + counter bumps for per-rule observabilityusage.prompt_tokens_details.cached_tokens reported in response. Reasoning models (gpt-5 family with reasoning_effort set) may not return cached_tokens even on identical prefixes — verified directly against api.openai.com on test accounts. Use a non-reasoning model (gpt-4o-mini, gpt-4o, etc.) when prompt caching is load-bearing.
Gemini (Vertex + direct)Explicit cachedContent resource — client POSTs to /v1beta/cachedContents then passes cachedContent: <name> on generation. Implicit prefix-caching exists for paid-tier accounts only.Bytes forwarded unchanged; client holds the cache nameStrips cachedContent field from the request → forces cold generationv1 shipped — gateway exposes POST /v1beta/cachedContents via the Gemini-native passthrough route + lifts cached_content field from /v1/chat/completions body onto Bifrost.ChatParameters.ExtraParams so cross-shape callers can also reference an existing cache.cachedContentTokenCount reported in usageMetadata. Implicit prefix-cache (where Google caches identical >1024-token prefixes automatically) requires a paid-tier Google account; explicit cachedContents works on every tier.
Azure OpenAIInherits OpenAI automatic prefix cacheSame as OpenAISame as OpenAISame as OpenAIConfigured via Azure’s cache-at-deployment toggle — independent of our force
Bedrock ClaudeInherits Anthropic’s cache_control shape via converse APISame as AnthropicSame as AnthropicSame as Anthropic (v1)Bedrock region matters — not all regions support Anthropic cache yet. /v1/messages → bedrock returns Unexpected field type 400 (the Bifrost bedrock adapter is converse-API only); use /v1/chat/completions toward a bedrock VK for cache_control round-trip
Vertex ClaudeInherits Anthropic’s shapeSame as AnthropicSame as AnthropicSame as Anthropic (v1)
The common shape: respect is always safe; disable always strips cache-related fields (where applicable); force is provider-specific injection — shipped in v1 for Anthropic + OpenAI + Bedrock Claude + Vertex Claude + Gemini.

Modes

respect (default)

Gateway forwards every cache-related field (cache_control on Anthropic, cachedContent on Gemini, etc.) byte-for-byte. The gateway’s own semantic cache (if enabled, v1.1 spec) still runs on top — respect does not disable it. This mode is the hard invariant protected by the gateway’s byte-level integration test: a request with cache_control: ephemeral on a system block must arrive at the provider with identical bytes in identical order. The test diff-compares the forwarded payload against the client’s payload — any non-trivial drift fails the build.

disable

Strip every cache-related field at any nesting depth before dispatch:
  • Anthropic: recursive JSON-walk drops cache_control from messages[].content[], system[], and tools[].
  • Gemini: drops cachedContent from the top-level body.
  • OpenAI: no-op on wire (the provider’s automatic cache can’t be disabled client-side).
Use when:
  • Benchmarking cold-cost in A/B before deciding if a caching tier is worth turning on.
  • Reproducing a cache-dependent bug on a cold call.
  • A specific request must not hit cache for correctness.
The gateway’s semantic cache (if enabled, v1.1) is also bypassed when mode is disable.

force

For providers that support client-side cache declarations:
  • Anthropic (direct + Bedrock Claude + Vertex Claude)v1 shipped. Gateway injects cache_control: {type: "ephemeral"} on system[-1] + messages[-1].content[-1] when the caller hasn’t already set it. Implemented via a recursive JSON walk that targets the last element of each block array; no-double-inject invariant means client-set cache_control is preserved byte-for-byte.
  • OpenAI / Azurev1 shipped (as a no-op on wire). Their cache is automatic, so the request body is byte-identical to passthrough, but the rule still fires + langwatch.cache.rule_id + mode_applied=FORCE span attrs emit + gateway_cache_rule_hits_total{mode_applied="FORCE"} counter bumps. Operators get per-rule observability even on providers where force is semantically a no-op.
  • Gemini / Vertex Geminiv1 shipped. The gateway’s /v1beta/* passthrough route (see the Gemini CLI integration) accepts POST /v1beta/cachedContents for context-cache creation and forwards verbatim to generativelanguage.googleapis.com. Subsequent /v1beta/models/{model}:generateContent calls reference the cache by name via the cachedContent field. The same cached_content field is also lifted from /v1/chat/completions request bodies onto Bifrost.ChatParameters.ExtraParams so cross-shape OpenAI-SDK callers can reference an existing cache. Implicit prefix caching (auto-cache for >1024-token shared prefixes) requires a paid-tier Google account.

ttl=NNN

Same shape as force but carries an explicit TTL in seconds. Header-level parse succeeds but Apply is still ErrNotImplemented in v1 — operators opting into a rule-level ttl currently see the passthrough behaviour above with the same WARN. v1.1 work will wire per-provider TTL mapping:
  • Anthropic: chooses ephemeral (5 min) or persistent (1 h) based on the TTL range; anything above 3600 clamps to persistent.
  • Gemini: sets the ttl field on cachedContent (depends on the v1.1 Gemini force work above).

Response headers

Every successful response carries:
  • X-LangWatch-Cache-Mode — the mode that was applied (respect / disable / force). Matches the internal Kind enum the Go gateway uses (services/gateway/internal/cacheoverride). Useful for tracing and A/B analysis: a client that sets X-LangWatch-Cache: disable should see X-LangWatch-Cache-Mode: disable on the response, independent of whether the provider reported a cache outcome.
  • X-LangWatch-Cache — the cache outcome:
    • hit — provider served part or all of the prefix from cache.
    • miss — no cache hit.
    • bypass — mode was disable.
    • force — mode was force (v1.1 — deferred with 400 cache_override_not_implemented in v1).
Usage reporting splits input tokens by origin:
  • cache_read_input_tokens — served from cache (Anthropic: ~10% of regular price).
  • cache_creation_input_tokens — writing to cache (Anthropic: 125% of regular).
  • input_tokens — regular cold tokens.
The /internal/gateway/budget/debit call forwards all three so budget ledgers reflect accurate cache economics.

Cache rules

Simple match/action rules that let operators set cache behavior for groups of requests without touching client code. Deliberately narrower than Cloudflare Page Rules — enough to demonstrate enterprise cache governance (force 5-min cache on prod internal-api, disable cache for benchmark VKs, etc.) without the matcher combinatorics that would slow the hot path. Latency budget: rules are resolved once per VK bundle refresh (every 30 s on /changes) and baked into the bundle as a pre-sorted array. At request time, the gateway walks the sorted array until a match fires — for the expected low single-digit rule count per VK, this is sub-microsecond. Hot-path total overhead stays at ~700 ns.

Shape

{
  "id": "rule_01HZX...",
  "priority": 100,
  "match": {
    "vk_tags": ["env=prod"],
    "model": "gpt-5-mini",
    "principal_id": null,
    "request_metadata": { "X-Request-Source": "internal-api" }
  },
  "action": {
    "mode": "force",
    "ttl": 300,
    "salt": "v2"
  },
  "enabled": true
}

Matchers

Each matcher is optional; all non-null matchers must match for the rule to apply (AND semantics).
  • vk_id — exact match on virtual-key id.
  • vk_tags — every tag in the list must be on the VK (e.g. ["env=prod", "team=ml"]).
  • vk_prefix — match on VK display prefix (lw_vk_live_01HZX9).
  • principal_id — the user/service account making the call.
  • model — exact match on the resolved model (post-alias).
  • request_metadata — match on arbitrary client-set metadata headers (X-Request-Source, X-Customer-Tier, etc.).
  • time_window — optional time-of-day / day-of-week constraint.

Actions

  • mode: respect | disable | force — as above.
  • ttl: NNN — seconds (only meaningful with force).
  • salt: <string> — cache-key salt for A/B or version invalidation (for the gateway’s semantic cache, v1.1).

Evaluation

  • First-match wins, sorted by priority descending.
  • Rules evaluated once per request, AT auth-resolve time (bundled with VK config), so zero additional upstream latency.
  • Rule hits exposed in observability via langwatch.cache.rule_id span attribute + gateway_cache_rule_hits_total{rule_id, mode, provider} counter.

Concrete example — “force 5-min cache on all prod internal-api calls”

{
  "priority": 100,
  "match": {
    "vk_tags": ["env=prod"],
    "request_metadata": { "X-Request-Source": "internal-api" }
  },
  "action": { "mode": "force", "ttl": 300 }
}
Every VK in prod serving an internal-api caller gets Anthropic cache_control: ephemeral injected on system + last-user blocks. OpenAI requests pass through (no-op). No client change required.

Enforcement ordering

The cache-override hook runs before blocked-pattern enforcement on /v1/messages. A disable-mode request has its cache_control blocks stripped first, then regex checks fire on the post-strip body. This keeps policy evaluation deterministic regardless of the caller’s caching choice. Order on the request pipeline:
  1. Auth resolution (VK → bundle)
  2. Rate limit (per-VK)
  3. Cache override mode resolution (header > rule > VK default)
  4. Cache override apply (strip / inject)
  5. Blocked patterns (tools / MCP / URLs / models)
  6. Body size cap
  7. Guardrails (pre)
  8. Upstream dispatch
  9. Guardrails (post)
  10. Budget debit

Why this is load-bearing

From a real observed pattern: a 40k-token system prompt with 100 calls a day against an Anthropic model.
  • Without caching: 40k × 100 = 4M input tokens/day at full price. Claude Opus: ~$60/day.
  • With Anthropic ephemeral cache (5-min TTL, 1 write + ~99 reads): 1 write at 75/M+99×40kreadsat75/M + 99 × 40k reads at 6/M ≈ $27/day. 56% cheaper.
  • With a gateway that reformats the payload and loses the cache: back to $60/day.
This is why the caching-passthrough invariant is tested at the byte level, not just “does the call succeed.”

See also