Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
When the gateway is slow, stuck, or chewing memory — and the LangWatch trace for a specific bad request has already been checked — this runbook is the next stop. Every recipe here relies on the pprof admin listener, which is bound to 127.0.0.1:6060 by default and therefore never reachable from outside the pod.
Don’t expose the admin port without a token. GATEWAY_ADMIN_ADDR binds to loopback by default. If you genuinely need direct (non-port-forward) access — e.g. non-k8s deploys, or from a corporate VPN — you MUST also set
GATEWAY_ADMIN_AUTH_TOKEN. The gateway refuses to start otherwise. See
Helm → Admin listener for the three deployment postures.
kubectl port-forward remains the simplest option for k8s — it tunnels through the API server and is auditable in Kubernetes audit logs.
Prerequisites
One-time setup on your operator laptop:
# Go toolchain with pprof (matches gateway's Go version)
go version # expect 1.26+
# Pick one pod to focus on
POD=$(kubectl get pod -n langwatch -l app=langwatch-gateway -o jsonpath='{.items[0].metadata.name}')
# Open the tunnel in one terminal and leave it running
kubectl port-forward -n langwatch "$POD" 6060:6060
All subsequent go tool pprof commands target http://localhost:6060.
Enabling pprof via Helm
The LangWatch Helm chart exposes the admin listener via a top-level admin stanza on the gateway sub-chart. The loopback-bound default is what you want in production — if you widen it to 0.0.0.0, also set admin.existingAuthSecretName so the built-in bearer-token guard protects pprof. The gateway refuses to start in the bind-non-loopback-without-token configuration.
gateway:
admin:
addr: "127.0.0.1:6060" # default — reachable only via kubectl port-forward
# addr: "" # disable pprof entirely (advised for compliance-regulated envs)
Shipped in current chart versions. Older chart versions pre-date the field — upgrade the chart before troubleshooting, or set the env var directly via gateway.env.GATEWAY_ADMIN_ADDR.
Recipe 1 — p99 latency spike
Symptom: gateway_http_request_duration_seconds{quantile="0.99"} jumps from ~300 ms to several seconds. Traces show no single slow upstream.
Diagnose:
# 30-second CPU profile while the spike is live
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Inside pprof:
(pprof) top20
(pprof) web # opens an SVG flame graph in your browser
What to look for:
- A single function using > 50% of CPU that’s not one of:
tls.conn.Handshake, net/http.(*conn).serve, json.Decoder.Decode. Those are expected under load.
- Lock contention on
internal/auth or internal/fallback — hot in sync.(*Mutex).Lock. Usually means the L1 cache is evicting faster than it’s filling; consider raising LW_GATEWAY_AUTH_CACHE_L1_SIZE.
- RE2 compilation in
internal/blocked on every request — means the bundle isn’t caching compiled regexes. Check for frequent /changes churn (revision bumps on every request ≠ normal).
Recipe 2 — Goroutine leak
Symptom: go_goroutines climbs monotonically over hours, never GCs. Memory follows.
Diagnose:
# Live goroutine dump as SVG
go tool pprof -svg http://localhost:6060/debug/pprof/goroutine > /tmp/goroutines.svg
open /tmp/goroutines.svg
# Or aggregate counts by creation site
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | head -100
What to look for:
- Hundreds of goroutines parked in
chan receive inside internal/fallback.Walk — means a fallback attempt is hanging on a context that never cancels. Check LW_GATEWAY_UPSTREAM_TIMEOUT_MS.
- Goroutines stuck in
internal/guardrails.CheckChunk — likely a guardrail evaluator that never returns and exceeds the 50 ms budget. Check the evaluator service logs.
- Streaming goroutines (
internal/dispatch.streamSSE) that outlive their request context — client disconnect without ctx.Done() firing usually points at a missing close somewhere.
Fix options: restart the offending pod (workaround), or file a bug with the /tmp/goroutines.svg attached.
Recipe 3 — Memory growth
Symptom: RSS climbs from 200 MB to > 1 GB over a day. OOM eventually follows.
Diagnose:
# Live heap snapshot
go tool pprof http://localhost:6060/debug/pprof/heap
# Inside pprof:
(pprof) top20 -cum
(pprof) list internal/auth # source-level attribution
What to look for:
internal/auth.(*Cache).Put holding more than ~1 MB per cached bundle — unusual, a bundle should be ≤ 50 KB. Oversized policy_rules.urls.allow with thousands of entries can trigger this.
internal/dispatch buffered responses — if streaming responses are being accumulated instead of flushed per-chunk, every request consumes full response size. Check for bufio.NewWriter wrapping a streaming writer anywhere.
- Outbox shelving (
LW_GATEWAY_BUDGET_DEBIT_SHELF_BYTES) — a long control-plane outage can push debits to disk. Check the outbox directory size.
Recipe 4 — Mutex or block profiling for contention
Symptom: CPU is low, request rate is low, but latency is up. Suggests blocking, not computation.
Enable on a specific pod (requires a restart with LW_GATEWAY_PPROF_BLOCK_RATE=1 and LW_GATEWAY_PPROF_MUTEX_FRACTION=1 in the env — off by default because they have measurable overhead).
# Top mutex holders
go tool pprof http://localhost:6060/debug/pprof/mutex
# Top blockers (I/O, channels, locks)
go tool pprof http://localhost:6060/debug/pprof/block
What to look for:
- Contention on
internal/ratelimit.(*Bucket).Allow — means a VK’s RPM is bursting past the token-bucket refill rate and every request is waiting. Raise RPM or investigate the caller.
- Contention on the L1 auth cache — see Recipe 1.
Recipe 5 — Allocation churn (GC pressure)
Symptom: go_gc_pause_seconds_sum is growing too fast; p99 spikes correlate with GC.
Diagnose:
# Alloc-only heap (who's creating the most garbage)
go tool pprof http://localhost:6060/debug/pprof/allocs
(pprof) top20 -cum
What to look for:
- JSON encoding of large
/v1/messages requests — expected, but if it dominates consider enabling LW_GATEWAY_MAX_BODY_BYTES enforcement to reject pathologically large bodies earlier.
- Per-request compilation of the same regex — should never happen; if it does, Lane A has a caching regression.
Recipe 6 — Debit outbox backlog
Symptom: gateway_budget_debit_outbox_depth climbing, or _4xx_drops_total increasing, or _flush_failures_total non-zero. Budgets may drift from actual spend.
The outbox is the gateway’s async spend-accounting path: every request produces a debit that batches into a ring buffer and flushes to the control plane. Four failure classes exist, each with its own signal:
| Signal | Metric | Severity | Meaning |
|---|
| Depth rising | gateway_budget_debit_outbox_depth / _capacity | warn | Normal under burst; watch fill-pct |
| Flush failures | gateway_budget_debit_outbox_flush_failures_total | warn | Control-plane is slow/unreachable; events re-enqueued, depth climbs slowly |
| 4xx drops | gateway_budget_debit_outbox_4xx_drops_total | page | Silent data loss — signing or payload bug, terminally rejected |
| Capacity drops | gateway_budget_debit_outbox_dropped_total | warn | Ring is full, newest events being evicted — fill-pct was already at 100% |
Diagnose, in order:
# 1. Fill percentage right now
kubectl exec -n langwatch deploy/langwatch-gateway -- \
wget -qO- http://127.0.0.1:5590/metrics | \
grep -E '^gateway_budget_debit_outbox_(depth|capacity)'
# 2. Last 15 min of flush failures (delta)
# Graph this in Grafana with:
# rate(gateway_budget_debit_outbox_flush_failures_total[5m])
# 3. 4xx drops — any non-zero delta is a page
# increase(gateway_budget_debit_outbox_4xx_drops_total[15m])
Interpretation ladder:
- Only depth is rising, flush_failures is 0: normal burst. Fill-pct will self-heal when traffic subsides. Only page if sustained > 80% for > 10m.
- flush_failures > 0 and depth is rising: control plane is unavailable or slow. Check control-plane
/api/internal/gateway/budget handler latency and error rate. Debits are safe (re-enqueued); they’ll drain once the control plane recovers.
- 4xx_drops > 0: immediate page. Likely causes, in order of probability:
- HMAC secret drift between gateway and control plane (
LW_GATEWAY_INTERNAL_SECRET mismatch after rotation).
- Payload schema drift (gateway and control plane deployed with incompatible contract versions).
- Rare: control plane returning 4xx for a specific tenant (archived organization, deleted project).
- dropped_total > 0: the ring filled up and the oldest pending events were evicted before they could flush. Always follows a prolonged control-plane outage. Lost debits are unrecoverable — document the outage window for post-hoc spend reconciliation.
Fix:
| Cause | Action |
|---|
| HMAC drift | Check both env vars point at the same secret; rotate via the dual-secret overlap window described in Config → Secrets |
| Control-plane slow | Scale the control plane; ratchet up LW_GATEWAY_BUDGET_DEBIT_RETRY_MAX as a temporary buffer |
| Payload drift | Pin gateway + control-plane to matching versions per the compatibility matrix |
Alerting (wire these directly into Alertmanager, see Prometheus alerts):
rate(gateway_budget_debit_outbox_4xx_drops_total[5m]) > 0 → page (silent data loss).
rate(gateway_budget_debit_outbox_flush_failures_total[5m]) > 0 for 5m → warn.
(max(gateway_budget_debit_outbox_depth) / max(gateway_budget_debit_outbox_capacity)) > 0.5 for 5m → warn.
Recipe 7 — Stuck drain
Symptom: A pod stays in Terminating for the full terminationGracePeriodSeconds, then gets SIGKILLed. In-flight requests ended abruptly. Alertmanager fires on the gateway_draining gauge being 1 for > grace.
The pod received SIGTERM but at least one request handler never returned before shutdown.timeout expired. The drain pipeline exposes this via two gauges:
| Metric | During healthy drain | During stuck drain |
|---|
gateway_draining{pod} | 1 → 0 at shutdown | 1 until SIGKILL |
gateway_in_flight_requests{pod} | monotonically → 0 | flat at N > 0 for > grace |
Diagnose:
# 1. Which pod is stuck, and how many handlers are hanging?
kubectl get pod -n langwatch -l app=langwatch-gateway \
--field-selector=status.phase=Running -o wide | grep Terminating
# 2. Goroutine dump — stuck handlers will show up in Go runtime state
POD=<terminating pod name>
kubectl port-forward -n langwatch "$POD" 6060:6060 &
go tool pprof -top http://localhost:6060/debug/pprof/goroutine
# 3. Find the handler that's not returning
curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | \
grep -A 20 'internal/dispatch.*streamSSE\|internal/auth\|internal/guardrails'
Common causes, in order of likelihood:
- Upstream dial hanging without a deadline. Streaming fallback into a dead region that never returns TLS handshake.
LW_GATEWAY_UPSTREAM_TIMEOUT_MS should be < shutdown.timeout; if it is, the handler should cancel on its own.
- Guardrail evaluator hanging past its budget.
pre/post have a guardrail.preTimeout / postTimeout of 1500 ms, but a misconfigured evaluator can still hang if it doesn’t respect context cancellation. Check the evaluator service’s own SLO.
- A breaker with
open state but no surrounding deadline. Rare; closed previously, but worth ruling out if the stack shows internal/circuit waiting.
- Slow custom middleware. If you’ve forked the gateway and added middleware that does I/O without context propagation, that’s where to look first.
Fix:
| Cause | Action |
|---|
| Upstream hang | Lower LW_GATEWAY_UPSTREAM_TIMEOUT_MS below shutdown.timeout. Default 60 s — bump shutdown.timeout to 65 s + terminationGracePeriodSeconds to 80 s if you genuinely need 60 s upstream calls |
| Guardrail hang | Raise evaluator SLO or set guardrails.request_fail_open: true on the VK to fall through on timeout |
| Custom middleware | Thread r.Context() through every I/O call |
Temporary workaround: bump shutdown.timeout + terminationGracePeriodSeconds to give the hanging request time to complete. Only appropriate while you diagnose the root cause — long grace periods slow down rolling deploys and make HPA scale-downs feel sluggish.
Recipe 8 — Control-plane outage / stale-while-error
Symptom: The LangWatch control plane is unreachable (deployment incident, DNS hiccup, network partition). The gateway’s L1 auth cache is full of valid resolved-key bundles, but /api/internal/gateway/resolve-key is returning errors. Operator wants to know: are customers being rejected, or is the gateway riding through?
The gateway’s auth resolver runs stale-while-error by default: when the cached entry’s JWT crosses its natural expiry AND the control-plane refresh fails for transport-class reasons (network error, dial timeout, 5xx, connection refused, malformed/unparseable response, JWT verify failure), it bumps the soft expiry by LW_GATEWAY_AUTH_CACHE_SOFT_BUMP (default 5m) and serves the cached bundle. This continues every refresh attempt up to the hard cap of LW_GATEWAY_AUTH_CACHE_HARD_GRACE past the JWT exp (default 6h). The hard cap is deliberately generous — the soft-bump path runs on every refresh attempt without a successful response, so the hard cap is the true outage backstop, not a steady-state knob.
Auth-class rejections — explicit 401 / 403 / 404 from /resolve-key — bypass the grace window entirely and evict immediately. A revoked credential never gets stale-served.
Diagnose:
# 1. Are stale-serve INFO logs firing? (= grace is active and serving traffic)
kubectl logs -n langwatch -l app=langwatch-gateway --tail=200 | \
grep auth_cache_serve_stale
# 2. Are transport-class refresh failures spewing? (= control plane unreachable)
kubectl logs -n langwatch -l app=langwatch-gateway --tail=200 | \
grep auth_cache_refresh_transport_failure
# 3. Are hard evictions firing? (= grace exhausted, rejection mode)
kubectl logs -n langwatch -l app=langwatch-gateway --tail=500 | \
grep auth_cache_hard_evict
The three log lines form a ladder operators read in order:
| Log line | Level | Meaning | Operator action |
|---|
auth_cache_serve_stale | INFO | Grace is active; cached bundle is being served past JWT exp | None — this is expected behaviour during outage. Includes vk_id, stale_for, hard_grace_remaining, refresh_error_class |
auth_cache_refresh_transport_failure | WARN | Each refresh attempt is failing; soft expiry bumped. Includes error, error_class, new_soft_expires_at | Investigate control plane (this is the actual outage signal). Stop bumping once CP returns |
auth_cache_hard_evict | ERROR | Grace cap exceeded OR auth rejection. reason field disambiguates: hard_cap_exceeded (outage too long) vs auth_rejection (real bad-credential evict) vs auth_rejection_async / hard_cap_exceeded_on_lookup | If hard_cap_exceeded, customers start seeing 401s. Bump LW_GATEWAY_AUTH_CACHE_HARD_GRACE if outage is ongoing and you’d rather extend than reject |
Customer-facing behaviour during the grace window:
- Requests against any VK that resolved successfully before the outage continue to work transparently.
- Requests against any VK never seen by this pod (cold) still fail — the gateway has no bundle to fall back to. Today’s mitigation is Redis L2 (
GATEWAY_REDIS_URL): HPA-scaled pods inherit the warm set from L2 even while the control plane is unreachable. (GATEWAY_CACHE_BOOTSTRAP_ALL_KEYS=true is a planned v1.1 enhancement to also pre-warm L1 from a /bootstrap snapshot on startup; the flag is reserved in env wiring but has no Go-side implementation today, so setting it is currently a no-op.)
- Requests against any VK whose JWT was revoked just before the outage but the revocation
/changes event hadn’t propagated yet — these stay served until the cache entry crosses its hard cap. Acceptable trade-off for the grace; auth rejections from a healthy CP still evict instantly via auth_cache_hard_evict reason=auth_rejection.
Tune:
| Knob | When to lower | When to raise |
|---|
LW_GATEWAY_AUTH_CACHE_SOFT_BUMP | If you want refresh attempts to back off less aggressively (e.g. 1m for tighter retry density) | If your control plane is consistently slow under load and you want to reduce refresh-attempt churn during partial degradation |
LW_GATEWAY_AUTH_CACHE_HARD_GRACE | If you have strict revocation-latency SLOs (compliance / regulated deployments) and would rather hard-fail than serve a long-stale bundle | If your control plane has known multi-hour planned-maintenance windows and you’d rather ride through |
LW_GATEWAY_AUTH_CACHE_HARD_GRACE=0s | Restores legacy behaviour: any refresh failure past JWT exp evicts immediately. Pick this if you operate under a security regime where stale-while-error is unacceptable | n/a (zero is the disable signal) |
Alert pattern (log-based, no metric infra required):
# Fire if hard_evict reason=hard_cap_exceeded appears more than N times in 5 min
# Indicates the outage exceeded the configured grace and customers are now being rejected.
kubectl logs -n langwatch -l app=langwatch-gateway --since=5m | \
grep -c 'auth_cache_hard_evict.*reason=hard_cap_exceeded'
A non-zero count is the page-worthy signal; the WARN spew alone (auth_cache_refresh_transport_failure) is informational while customers are still being served.
See Config → Auth cache for the env-var contract.
Graceful degradation — what survives what
The gateway is a cache of the control plane, so a surprising amount continues to work when pieces go down. Quick reference:
| Component down | Customer impact | How long gateway stays up | Notes |
|---|
| LangWatch control plane | None for cached VKs | ~15 min JWT exp + LW_GATEWAY_AUTH_CACHE_HARD_GRACE (default 6 h) of stale-while-error; bootstrap cache + Redis L2 extend coverage to cold-for-pod VKs | Outbox re-enqueues debits safely; they flush on recovery. New VK creation is blocked until control plane returns. See Recipe 8 for the operator runbook |
| One upstream provider | None for VKs with fallback; 502/504 for VKs without | Indefinite — circuit breaker + fallback chain absorb it | gateway_circuit_state{credential_id} shows which provider is down |
| Redis L2 cache | +~30 ms on cold-for-pod VKs | Indefinite — L1 + /resolve-key cover the miss | Fail-open by design (poison entries DEL, network errors log and miss) |
| Gateway pod (crash / eviction) | None — HPA replaces, PDB holds ≥ 2 replicas | N/A | Stateless; new pod warms from Redis L2 + /bootstrap |
For multi-region deployments with shared control plane: a gateway region down is handled by Route53 latency-based failover. See Scaling → Regional placement.
Writing findings back
When you find something worth filing:
- Grab the
.pb.gz with go tool pprof -symbolize=remote -proto http://localhost:6060/debug/pprof/heap > /tmp/heap.pb.gz.
- Attach it to the issue along with:
- Pod name + image digest (
kubectl get pod -n langwatch -o yaml | grep image).
X-LangWatch-Request-Id from one exemplar bad request.
kubectl top pod output around the spike.
- For urgent escalation, post in
#ai-gateway-support with the request id.
See also