Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

When the gateway is slow, stuck, or chewing memory — and the LangWatch trace for a specific bad request has already been checked — this runbook is the next stop. Every recipe here relies on the pprof admin listener, which is bound to 127.0.0.1:6060 by default and therefore never reachable from outside the pod.
Don’t expose the admin port without a token. GATEWAY_ADMIN_ADDR binds to loopback by default. If you genuinely need direct (non-port-forward) access — e.g. non-k8s deploys, or from a corporate VPN — you MUST also set GATEWAY_ADMIN_AUTH_TOKEN. The gateway refuses to start otherwise. See Helm → Admin listener for the three deployment postures. kubectl port-forward remains the simplest option for k8s — it tunnels through the API server and is auditable in Kubernetes audit logs.

Prerequisites

One-time setup on your operator laptop:
# Go toolchain with pprof (matches gateway's Go version)
go version  # expect 1.26+

# Pick one pod to focus on
POD=$(kubectl get pod -n langwatch -l app=langwatch-gateway -o jsonpath='{.items[0].metadata.name}')

# Open the tunnel in one terminal and leave it running
kubectl port-forward -n langwatch "$POD" 6060:6060
All subsequent go tool pprof commands target http://localhost:6060.

Enabling pprof via Helm

The LangWatch Helm chart exposes the admin listener via a top-level admin stanza on the gateway sub-chart. The loopback-bound default is what you want in production — if you widen it to 0.0.0.0, also set admin.existingAuthSecretName so the built-in bearer-token guard protects pprof. The gateway refuses to start in the bind-non-loopback-without-token configuration.
gateway:
  admin:
    addr: "127.0.0.1:6060"   # default — reachable only via kubectl port-forward
    # addr: ""               # disable pprof entirely (advised for compliance-regulated envs)
Shipped in current chart versions. Older chart versions pre-date the field — upgrade the chart before troubleshooting, or set the env var directly via gateway.env.GATEWAY_ADMIN_ADDR.

Recipe 1 — p99 latency spike

Symptom: gateway_http_request_duration_seconds{quantile="0.99"} jumps from ~300 ms to several seconds. Traces show no single slow upstream. Diagnose:
# 30-second CPU profile while the spike is live
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Inside pprof:
(pprof) top20
(pprof) web     # opens an SVG flame graph in your browser
What to look for:
  • A single function using > 50% of CPU that’s not one of: tls.conn.Handshake, net/http.(*conn).serve, json.Decoder.Decode. Those are expected under load.
  • Lock contention on internal/auth or internal/fallback — hot in sync.(*Mutex).Lock. Usually means the L1 cache is evicting faster than it’s filling; consider raising LW_GATEWAY_AUTH_CACHE_L1_SIZE.
  • RE2 compilation in internal/blocked on every request — means the bundle isn’t caching compiled regexes. Check for frequent /changes churn (revision bumps on every request ≠ normal).

Recipe 2 — Goroutine leak

Symptom: go_goroutines climbs monotonically over hours, never GCs. Memory follows. Diagnose:
# Live goroutine dump as SVG
go tool pprof -svg http://localhost:6060/debug/pprof/goroutine > /tmp/goroutines.svg
open /tmp/goroutines.svg

# Or aggregate counts by creation site
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | head -100
What to look for:
  • Hundreds of goroutines parked in chan receive inside internal/fallback.Walk — means a fallback attempt is hanging on a context that never cancels. Check LW_GATEWAY_UPSTREAM_TIMEOUT_MS.
  • Goroutines stuck in internal/guardrails.CheckChunk — likely a guardrail evaluator that never returns and exceeds the 50 ms budget. Check the evaluator service logs.
  • Streaming goroutines (internal/dispatch.streamSSE) that outlive their request context — client disconnect without ctx.Done() firing usually points at a missing close somewhere.
Fix options: restart the offending pod (workaround), or file a bug with the /tmp/goroutines.svg attached.

Recipe 3 — Memory growth

Symptom: RSS climbs from 200 MB to > 1 GB over a day. OOM eventually follows. Diagnose:
# Live heap snapshot
go tool pprof http://localhost:6060/debug/pprof/heap

# Inside pprof:
(pprof) top20 -cum
(pprof) list internal/auth    # source-level attribution
What to look for:
  • internal/auth.(*Cache).Put holding more than ~1 MB per cached bundle — unusual, a bundle should be ≤ 50 KB. Oversized policy_rules.urls.allow with thousands of entries can trigger this.
  • internal/dispatch buffered responses — if streaming responses are being accumulated instead of flushed per-chunk, every request consumes full response size. Check for bufio.NewWriter wrapping a streaming writer anywhere.
  • Outbox shelving (LW_GATEWAY_BUDGET_DEBIT_SHELF_BYTES) — a long control-plane outage can push debits to disk. Check the outbox directory size.

Recipe 4 — Mutex or block profiling for contention

Symptom: CPU is low, request rate is low, but latency is up. Suggests blocking, not computation. Enable on a specific pod (requires a restart with LW_GATEWAY_PPROF_BLOCK_RATE=1 and LW_GATEWAY_PPROF_MUTEX_FRACTION=1 in the env — off by default because they have measurable overhead).
# Top mutex holders
go tool pprof http://localhost:6060/debug/pprof/mutex

# Top blockers (I/O, channels, locks)
go tool pprof http://localhost:6060/debug/pprof/block
What to look for:
  • Contention on internal/ratelimit.(*Bucket).Allow — means a VK’s RPM is bursting past the token-bucket refill rate and every request is waiting. Raise RPM or investigate the caller.
  • Contention on the L1 auth cache — see Recipe 1.

Recipe 5 — Allocation churn (GC pressure)

Symptom: go_gc_pause_seconds_sum is growing too fast; p99 spikes correlate with GC. Diagnose:
# Alloc-only heap (who's creating the most garbage)
go tool pprof http://localhost:6060/debug/pprof/allocs

(pprof) top20 -cum
What to look for:
  • JSON encoding of large /v1/messages requests — expected, but if it dominates consider enabling LW_GATEWAY_MAX_BODY_BYTES enforcement to reject pathologically large bodies earlier.
  • Per-request compilation of the same regex — should never happen; if it does, Lane A has a caching regression.

Recipe 6 — Debit outbox backlog

Symptom: gateway_budget_debit_outbox_depth climbing, or _4xx_drops_total increasing, or _flush_failures_total non-zero. Budgets may drift from actual spend. The outbox is the gateway’s async spend-accounting path: every request produces a debit that batches into a ring buffer and flushes to the control plane. Four failure classes exist, each with its own signal:
SignalMetricSeverityMeaning
Depth risinggateway_budget_debit_outbox_depth / _capacitywarnNormal under burst; watch fill-pct
Flush failuresgateway_budget_debit_outbox_flush_failures_totalwarnControl-plane is slow/unreachable; events re-enqueued, depth climbs slowly
4xx dropsgateway_budget_debit_outbox_4xx_drops_totalpageSilent data loss — signing or payload bug, terminally rejected
Capacity dropsgateway_budget_debit_outbox_dropped_totalwarnRing is full, newest events being evicted — fill-pct was already at 100%
Diagnose, in order:
# 1. Fill percentage right now
kubectl exec -n langwatch deploy/langwatch-gateway -- \
  wget -qO- http://127.0.0.1:5590/metrics | \
  grep -E '^gateway_budget_debit_outbox_(depth|capacity)'

# 2. Last 15 min of flush failures (delta)
# Graph this in Grafana with:
#   rate(gateway_budget_debit_outbox_flush_failures_total[5m])

# 3. 4xx drops — any non-zero delta is a page
#   increase(gateway_budget_debit_outbox_4xx_drops_total[15m])
Interpretation ladder:
  • Only depth is rising, flush_failures is 0: normal burst. Fill-pct will self-heal when traffic subsides. Only page if sustained > 80% for > 10m.
  • flush_failures > 0 and depth is rising: control plane is unavailable or slow. Check control-plane /api/internal/gateway/budget handler latency and error rate. Debits are safe (re-enqueued); they’ll drain once the control plane recovers.
  • 4xx_drops > 0: immediate page. Likely causes, in order of probability:
    1. HMAC secret drift between gateway and control plane (LW_GATEWAY_INTERNAL_SECRET mismatch after rotation).
    2. Payload schema drift (gateway and control plane deployed with incompatible contract versions).
    3. Rare: control plane returning 4xx for a specific tenant (archived organization, deleted project).
  • dropped_total > 0: the ring filled up and the oldest pending events were evicted before they could flush. Always follows a prolonged control-plane outage. Lost debits are unrecoverable — document the outage window for post-hoc spend reconciliation.
Fix:
CauseAction
HMAC driftCheck both env vars point at the same secret; rotate via the dual-secret overlap window described in Config → Secrets
Control-plane slowScale the control plane; ratchet up LW_GATEWAY_BUDGET_DEBIT_RETRY_MAX as a temporary buffer
Payload driftPin gateway + control-plane to matching versions per the compatibility matrix
Alerting (wire these directly into Alertmanager, see Prometheus alerts):
  • rate(gateway_budget_debit_outbox_4xx_drops_total[5m]) > 0page (silent data loss).
  • rate(gateway_budget_debit_outbox_flush_failures_total[5m]) > 0 for 5mwarn.
  • (max(gateway_budget_debit_outbox_depth) / max(gateway_budget_debit_outbox_capacity)) > 0.5 for 5mwarn.

Recipe 7 — Stuck drain

Symptom: A pod stays in Terminating for the full terminationGracePeriodSeconds, then gets SIGKILLed. In-flight requests ended abruptly. Alertmanager fires on the gateway_draining gauge being 1 for > grace. The pod received SIGTERM but at least one request handler never returned before shutdown.timeout expired. The drain pipeline exposes this via two gauges:
MetricDuring healthy drainDuring stuck drain
gateway_draining{pod}1 → 0 at shutdown1 until SIGKILL
gateway_in_flight_requests{pod}monotonically → 0flat at N > 0 for > grace
Diagnose:
# 1. Which pod is stuck, and how many handlers are hanging?
kubectl get pod -n langwatch -l app=langwatch-gateway \
  --field-selector=status.phase=Running -o wide | grep Terminating

# 2. Goroutine dump — stuck handlers will show up in Go runtime state
POD=<terminating pod name>
kubectl port-forward -n langwatch "$POD" 6060:6060 &
go tool pprof -top http://localhost:6060/debug/pprof/goroutine

# 3. Find the handler that's not returning
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | \
  grep -A 20 'internal/dispatch.*streamSSE\|internal/auth\|internal/guardrails'
Common causes, in order of likelihood:
  1. Upstream dial hanging without a deadline. Streaming fallback into a dead region that never returns TLS handshake. LW_GATEWAY_UPSTREAM_TIMEOUT_MS should be < shutdown.timeout; if it is, the handler should cancel on its own.
  2. Guardrail evaluator hanging past its budget. pre/post have a guardrail.preTimeout / postTimeout of 1500 ms, but a misconfigured evaluator can still hang if it doesn’t respect context cancellation. Check the evaluator service’s own SLO.
  3. A breaker with open state but no surrounding deadline. Rare; closed previously, but worth ruling out if the stack shows internal/circuit waiting.
  4. Slow custom middleware. If you’ve forked the gateway and added middleware that does I/O without context propagation, that’s where to look first.
Fix:
CauseAction
Upstream hangLower LW_GATEWAY_UPSTREAM_TIMEOUT_MS below shutdown.timeout. Default 60 s — bump shutdown.timeout to 65 s + terminationGracePeriodSeconds to 80 s if you genuinely need 60 s upstream calls
Guardrail hangRaise evaluator SLO or set guardrails.request_fail_open: true on the VK to fall through on timeout
Custom middlewareThread r.Context() through every I/O call
Temporary workaround: bump shutdown.timeout + terminationGracePeriodSeconds to give the hanging request time to complete. Only appropriate while you diagnose the root cause — long grace periods slow down rolling deploys and make HPA scale-downs feel sluggish.

Recipe 8 — Control-plane outage / stale-while-error

Symptom: The LangWatch control plane is unreachable (deployment incident, DNS hiccup, network partition). The gateway’s L1 auth cache is full of valid resolved-key bundles, but /api/internal/gateway/resolve-key is returning errors. Operator wants to know: are customers being rejected, or is the gateway riding through? The gateway’s auth resolver runs stale-while-error by default: when the cached entry’s JWT crosses its natural expiry AND the control-plane refresh fails for transport-class reasons (network error, dial timeout, 5xx, connection refused, malformed/unparseable response, JWT verify failure), it bumps the soft expiry by LW_GATEWAY_AUTH_CACHE_SOFT_BUMP (default 5m) and serves the cached bundle. This continues every refresh attempt up to the hard cap of LW_GATEWAY_AUTH_CACHE_HARD_GRACE past the JWT exp (default 6h). The hard cap is deliberately generous — the soft-bump path runs on every refresh attempt without a successful response, so the hard cap is the true outage backstop, not a steady-state knob. Auth-class rejections — explicit 401 / 403 / 404 from /resolve-key — bypass the grace window entirely and evict immediately. A revoked credential never gets stale-served. Diagnose:
# 1. Are stale-serve INFO logs firing? (= grace is active and serving traffic)
kubectl logs -n langwatch -l app=langwatch-gateway --tail=200 | \
  grep auth_cache_serve_stale

# 2. Are transport-class refresh failures spewing? (= control plane unreachable)
kubectl logs -n langwatch -l app=langwatch-gateway --tail=200 | \
  grep auth_cache_refresh_transport_failure

# 3. Are hard evictions firing? (= grace exhausted, rejection mode)
kubectl logs -n langwatch -l app=langwatch-gateway --tail=500 | \
  grep auth_cache_hard_evict
The three log lines form a ladder operators read in order:
Log lineLevelMeaningOperator action
auth_cache_serve_staleINFOGrace is active; cached bundle is being served past JWT expNone — this is expected behaviour during outage. Includes vk_id, stale_for, hard_grace_remaining, refresh_error_class
auth_cache_refresh_transport_failureWARNEach refresh attempt is failing; soft expiry bumped. Includes error, error_class, new_soft_expires_atInvestigate control plane (this is the actual outage signal). Stop bumping once CP returns
auth_cache_hard_evictERRORGrace cap exceeded OR auth rejection. reason field disambiguates: hard_cap_exceeded (outage too long) vs auth_rejection (real bad-credential evict) vs auth_rejection_async / hard_cap_exceeded_on_lookupIf hard_cap_exceeded, customers start seeing 401s. Bump LW_GATEWAY_AUTH_CACHE_HARD_GRACE if outage is ongoing and you’d rather extend than reject
Customer-facing behaviour during the grace window:
  • Requests against any VK that resolved successfully before the outage continue to work transparently.
  • Requests against any VK never seen by this pod (cold) still fail — the gateway has no bundle to fall back to. Today’s mitigation is Redis L2 (GATEWAY_REDIS_URL): HPA-scaled pods inherit the warm set from L2 even while the control plane is unreachable. (GATEWAY_CACHE_BOOTSTRAP_ALL_KEYS=true is a planned v1.1 enhancement to also pre-warm L1 from a /bootstrap snapshot on startup; the flag is reserved in env wiring but has no Go-side implementation today, so setting it is currently a no-op.)
  • Requests against any VK whose JWT was revoked just before the outage but the revocation /changes event hadn’t propagated yet — these stay served until the cache entry crosses its hard cap. Acceptable trade-off for the grace; auth rejections from a healthy CP still evict instantly via auth_cache_hard_evict reason=auth_rejection.
Tune:
KnobWhen to lowerWhen to raise
LW_GATEWAY_AUTH_CACHE_SOFT_BUMPIf you want refresh attempts to back off less aggressively (e.g. 1m for tighter retry density)If your control plane is consistently slow under load and you want to reduce refresh-attempt churn during partial degradation
LW_GATEWAY_AUTH_CACHE_HARD_GRACEIf you have strict revocation-latency SLOs (compliance / regulated deployments) and would rather hard-fail than serve a long-stale bundleIf your control plane has known multi-hour planned-maintenance windows and you’d rather ride through
LW_GATEWAY_AUTH_CACHE_HARD_GRACE=0sRestores legacy behaviour: any refresh failure past JWT exp evicts immediately. Pick this if you operate under a security regime where stale-while-error is unacceptablen/a (zero is the disable signal)
Alert pattern (log-based, no metric infra required):
# Fire if hard_evict reason=hard_cap_exceeded appears more than N times in 5 min
# Indicates the outage exceeded the configured grace and customers are now being rejected.
kubectl logs -n langwatch -l app=langwatch-gateway --since=5m | \
  grep -c 'auth_cache_hard_evict.*reason=hard_cap_exceeded'
A non-zero count is the page-worthy signal; the WARN spew alone (auth_cache_refresh_transport_failure) is informational while customers are still being served. See Config → Auth cache for the env-var contract.

Graceful degradation — what survives what

The gateway is a cache of the control plane, so a surprising amount continues to work when pieces go down. Quick reference:
Component downCustomer impactHow long gateway stays upNotes
LangWatch control planeNone for cached VKs~15 min JWT exp + LW_GATEWAY_AUTH_CACHE_HARD_GRACE (default 6 h) of stale-while-error; bootstrap cache + Redis L2 extend coverage to cold-for-pod VKsOutbox re-enqueues debits safely; they flush on recovery. New VK creation is blocked until control plane returns. See Recipe 8 for the operator runbook
One upstream providerNone for VKs with fallback; 502/504 for VKs withoutIndefinite — circuit breaker + fallback chain absorb itgateway_circuit_state{credential_id} shows which provider is down
Redis L2 cache+~30 ms on cold-for-pod VKsIndefinite — L1 + /resolve-key cover the missFail-open by design (poison entries DEL, network errors log and miss)
Gateway pod (crash / eviction)None — HPA replaces, PDB holds ≥ 2 replicasN/AStateless; new pod warms from Redis L2 + /bootstrap
For multi-region deployments with shared control plane: a gateway region down is handled by Route53 latency-based failover. See Scaling → Regional placement.

Writing findings back

When you find something worth filing:
  1. Grab the .pb.gz with go tool pprof -symbolize=remote -proto http://localhost:6060/debug/pprof/heap > /tmp/heap.pb.gz.
  2. Attach it to the issue along with:
    • Pod name + image digest (kubectl get pod -n langwatch -o yaml | grep image).
    • X-LangWatch-Request-Id from one exemplar bad request.
    • kubectl top pod output around the spike.
  3. For urgent escalation, post in #ai-gateway-support with the request id.

See also