127.0.0.1:6060 by default and therefore never reachable from outside the pod.
Don’t expose the admin port without a token.
GATEWAY_ADMIN_ADDR binds to loopback by default. If you genuinely need direct (non-port-forward) access, e.g. non-k8s deploys, or from a corporate VPN, you MUST also set GATEWAY_ADMIN_AUTH_TOKEN. The gateway refuses to start otherwise. See Helm → Admin listener for the three deployment postures. kubectl port-forward remains the simplest option for k8s, it tunnels through the API server and is auditable in Kubernetes audit logs.Prerequisites
One-time setup on your operator laptop:go tool pprof commands target http://localhost:6060.
Enabling pprof via Helm
The LangWatch Helm chart exposes the admin listener via a top-leveladmin stanza on the gateway sub-chart. The loopback-bound default is what you want in production, if you widen it to 0.0.0.0, also set admin.existingAuthSecretName so the built-in bearer-token guard protects pprof. The gateway refuses to start in the bind-non-loopback-without-token configuration.
gateway.env.GATEWAY_ADMIN_ADDR.
Recipe 1: p99 latency spike
Symptom:gateway_http_request_duration_seconds{quantile="0.99"} jumps from ~300 ms to several seconds. Traces show no single slow upstream.
Diagnose:
- A single function using > 50% of CPU that’s not one of:
tls.conn.Handshake,net/http.(*conn).serve,json.Decoder.Decode. Those are expected under load. - Lock contention on
internal/authorinternal/fallback, hot insync.(*Mutex).Lock. Usually means the L1 cache is evicting faster than it’s filling; consider raisingLW_GATEWAY_AUTH_CACHE_L1_SIZE. - RE2 compilation in
internal/blockedon every request, means the bundle isn’t caching compiled regexes. Check for frequent/changeschurn (revision bumps on every request ≠ normal).
Recipe 2: Goroutine leak
Symptom:go_goroutines climbs monotonically over hours, never GCs. Memory follows.
Diagnose:
- Hundreds of goroutines parked in
chan receiveinsideinternal/fallback.Walk, means a fallback attempt is hanging on a context that never cancels. CheckLW_GATEWAY_UPSTREAM_TIMEOUT_MS. - Goroutines stuck in
internal/guardrails.CheckChunk, likely a guardrail evaluator that never returns and exceeds the 50 ms budget. Check the evaluator service logs. - Streaming goroutines (
internal/dispatch.streamSSE) that outlive their request context, client disconnect withoutctx.Done()firing usually points at a missingclosesomewhere.
/tmp/goroutines.svg attached.
Recipe 3: Memory growth
Symptom: RSS climbs from 200 MB to > 1 GB over a day. OOM eventually follows. Diagnose:internal/auth.(*Cache).Putholding more than ~1 MB per cached bundle, unusual, a bundle should be ≤ 50 KB. Oversizedpolicy_rules.urls.allowwith thousands of entries can trigger this.internal/dispatchbuffered responses, if streaming responses are being accumulated instead of flushed per-chunk, every request consumes full response size. Check forbufio.NewWriterwrapping a streaming writer anywhere.- Outbox shelving (
LW_GATEWAY_BUDGET_DEBIT_SHELF_BYTES), a long control-plane outage can push debits to disk. Check the outbox directory size.
Recipe 4: Mutex or block profiling for contention
Symptom: CPU is low, request rate is low, but latency is up. Suggests blocking, not computation. Enable on a specific pod (requires a restart withLW_GATEWAY_PPROF_BLOCK_RATE=1 and LW_GATEWAY_PPROF_MUTEX_FRACTION=1 in the env, off by default because they have measurable overhead).
- Contention on
internal/ratelimit.(*Bucket).Allow, means a VK’s RPM is bursting past the token-bucket refill rate and every request is waiting. Raise RPM or investigate the caller. - Contention on the L1 auth cache, see Recipe 1.
Recipe 5: Allocation churn (GC pressure)
Symptom:go_gc_pause_seconds_sum is growing too fast; p99 spikes correlate with GC.
Diagnose:
- JSON encoding of large
/v1/messagesrequests, expected, but if it dominates consider enablingLW_GATEWAY_MAX_BODY_BYTESenforcement to reject pathologically large bodies earlier. - Per-request compilation of the same regex, should never happen; if it does, Lane A has a caching regression.
Recipe 6: Debit outbox backlog
Symptom:gateway_budget_debit_outbox_depth climbing, or _4xx_drops_total increasing, or _flush_failures_total non-zero. Budgets may drift from actual spend.
The outbox is the gateway’s async spend-accounting path: every request produces a debit that batches into a ring buffer and flushes to the control plane. Four failure classes exist, each with its own signal:
| Signal | Metric | Severity | Meaning |
|---|---|---|---|
| Depth rising | gateway_budget_debit_outbox_depth, _capacity | warn | Normal under burst; watch fill-pct |
| Flush failures | gateway_budget_debit_outbox_flush_failures_total | warn | Control-plane is slow/unreachable; events re-enqueued, depth climbs slowly |
| 4xx drops | gateway_budget_debit_outbox_4xx_drops_total | page | Silent data loss, signing or payload bug, terminally rejected |
| Capacity drops | gateway_budget_debit_outbox_dropped_total | warn | Ring is full, newest events being evicted, fill-pct was already at 100% |
- Only depth is rising, flush_failures is 0: normal burst. Fill-pct will self-heal when traffic subsides. Only page if sustained > 80% for > 10m.
- flush_failures > 0 and depth is rising: control plane is unavailable or slow. Check control-plane
/api/internal/gateway/budgethandler latency and error rate. Debits are safe (re-enqueued); they’ll drain once the control plane recovers. - 4xx_drops > 0: immediate page. Likely causes, in order of probability:
- HMAC secret drift between gateway and control plane (
LW_GATEWAY_INTERNAL_SECRETmismatch after rotation). - Payload schema drift (gateway and control plane deployed with incompatible contract versions).
- Rare: control plane returning 4xx for a specific tenant (archived organization, deleted project).
- HMAC secret drift between gateway and control plane (
- dropped_total > 0: the ring filled up and the oldest pending events were evicted before they could flush. Always follows a prolonged control-plane outage. Lost debits are unrecoverable, document the outage window for post-hoc spend reconciliation.
| Cause | Action |
|---|---|
| HMAC drift | Check both env vars point at the same secret; rotate via the dual-secret overlap window described in Config → Secrets |
| Control-plane slow | Scale the control plane; ratchet up LW_GATEWAY_BUDGET_DEBIT_RETRY_MAX as a temporary buffer |
| Payload drift | Pin gateway + control-plane to matching versions per the compatibility matrix |
rate(gateway_budget_debit_outbox_4xx_drops_total[5m]) > 0→ page (silent data loss).rate(gateway_budget_debit_outbox_flush_failures_total[5m]) > 0 for 5m→ warn.(max(gateway_budget_debit_outbox_depth) / max(gateway_budget_debit_outbox_capacity)) > 0.5 for 5m→ warn.
Recipe 7: Stuck drain
Symptom: A pod stays inTerminating for the full terminationGracePeriodSeconds, then gets SIGKILLed. In-flight requests ended abruptly. Alertmanager fires on the gateway_draining gauge being 1 for > grace.
The pod received SIGTERM but at least one request handler never returned before shutdown.timeout expired. The drain pipeline exposes this via two gauges:
| Metric | During healthy drain | During stuck drain |
|---|---|---|
gateway_draining{pod} | 1 → 0 at shutdown | 1 until SIGKILL |
gateway_in_flight_requests{pod} | monotonically → 0 | flat at N > 0 for > grace |
- Upstream dial hanging without a deadline. Streaming fallback into a dead region that never returns TLS handshake.
LW_GATEWAY_UPSTREAM_TIMEOUT_MSshould be <shutdown.timeout; if it is, the handler should cancel on its own. - Guardrail evaluator hanging past its budget.
pre/posthave aguardrail.preTimeout,postTimeoutof 1500 ms, but a misconfigured evaluator can still hang if it doesn’t respect context cancellation. Check the evaluator service’s own SLO. - A breaker with
openstate but no surrounding deadline. Rare; closed previously, but worth ruling out if the stack showsinternal/circuitwaiting. - Slow custom middleware. If you’ve forked the gateway and added middleware that does I/O without context propagation, that’s where to look first.
| Cause | Action |
|---|---|
| Upstream hang | Lower LW_GATEWAY_UPSTREAM_TIMEOUT_MS below shutdown.timeout. Default 60 s, bump shutdown.timeout to 65 s + terminationGracePeriodSeconds to 80 s if you genuinely need 60 s upstream calls |
| Guardrail hang | Raise evaluator SLO or set guardrails.request_fail_open: true on the VK to fall through on timeout |
| Custom middleware | Thread r.Context() through every I/O call |
shutdown.timeout + terminationGracePeriodSeconds to give the hanging request time to complete. Only appropriate while you diagnose the root cause, long grace periods slow down rolling deploys and make HPA scale-downs feel sluggish.
Recipe 8: Control-plane outage, stale-while-error
Symptom: The LangWatch control plane is unreachable (deployment incident, DNS hiccup, network partition). The gateway’s L1 auth cache is full of valid resolved-key bundles, but/api/internal/gateway/resolve-key is returning errors. Operator wants to know: are customers being rejected, or is the gateway riding through?
The gateway’s auth resolver runs stale-while-error by default: when the cached entry’s JWT crosses its natural expiry AND the control-plane refresh fails for transport-class reasons (network error, dial timeout, 5xx, connection refused, malformed/unparseable response, JWT verify failure), it bumps the soft expiry by LW_GATEWAY_AUTH_CACHE_SOFT_BUMP (default 5m) and serves the cached bundle. This continues every refresh attempt up to the hard cap of LW_GATEWAY_AUTH_CACHE_HARD_GRACE past the JWT exp (default 6h). The hard cap is deliberately generous, the soft-bump path runs on every refresh attempt without a successful response, so the hard cap is the true outage backstop, not a steady-state knob.
Auth-class rejections: explicit 401, 403, 404 from /resolve-key, bypass the grace window entirely and evict immediately. A revoked credential never gets stale-served.
Diagnose:
| Log line | Level | Meaning | Operator action |
|---|---|---|---|
auth_cache_serve_stale | INFO | Grace is active; cached bundle is being served past JWT exp | None, this is expected behaviour during outage. Includes vk_id, stale_for, hard_grace_remaining, refresh_error_class |
auth_cache_refresh_transport_failure | WARN | Each refresh attempt is failing; soft expiry bumped. Includes error, error_class, new_soft_expires_at | Investigate control plane (this is the actual outage signal). Stop bumping once CP returns |
auth_cache_hard_evict | ERROR | Grace cap exceeded OR auth rejection. reason field disambiguates: hard_cap_exceeded (outage too long) vs auth_rejection (real bad-credential evict) vs auth_rejection_async, hard_cap_exceeded_on_lookup | If hard_cap_exceeded, customers start seeing 401s. Bump LW_GATEWAY_AUTH_CACHE_HARD_GRACE if outage is ongoing and you’d rather extend than reject |
- Requests against any VK that resolved successfully before the outage continue to work transparently.
- Requests against any VK never seen by this pod (cold) still fail, the gateway has no bundle to fall back to. Today’s mitigation is Redis L2 (
GATEWAY_REDIS_URL): HPA-scaled pods inherit the warm set from L2 even while the control plane is unreachable. (GATEWAY_CACHE_BOOTSTRAP_ALL_KEYS=trueis a planned v1.1 enhancement to also pre-warm L1 from a/bootstrapsnapshot on startup; the flag is reserved in env wiring but has no Go-side implementation today, so setting it is currently a no-op.) - Requests against any VK whose JWT was revoked just before the outage but the revocation
/changesevent hadn’t propagated yet, these stay served until the cache entry crosses its hard cap. Acceptable trade-off for the grace; auth rejections from a healthy CP still evict instantly viaauth_cache_hard_evict reason=auth_rejection.
| Knob | When to lower | When to raise |
|---|---|---|
LW_GATEWAY_AUTH_CACHE_SOFT_BUMP | If you want refresh attempts to back off less aggressively (e.g. 1m for tighter retry density) | If your control plane is consistently slow under load and you want to reduce refresh-attempt churn during partial degradation |
LW_GATEWAY_AUTH_CACHE_HARD_GRACE | If you have strict revocation-latency SLOs (compliance, regulated deployments) and would rather hard-fail than serve a long-stale bundle | If your control plane has known multi-hour planned-maintenance windows and you’d rather ride through |
LW_GATEWAY_AUTH_CACHE_HARD_GRACE=0s | Restores legacy behaviour: any refresh failure past JWT exp evicts immediately. Pick this if you operate under a security regime where stale-while-error is unacceptable | n/a (zero is the disable signal) |
auth_cache_refresh_transport_failure) is informational while customers are still being served.
See Config → Auth cache for the env-var contract.
Graceful degradation: what survives what
The gateway is a cache of the control plane, so a surprising amount continues to work when pieces go down. Quick reference:| Component down | Customer impact | How long gateway stays up | Notes |
|---|---|---|---|
| LangWatch control plane | None for cached VKs | ~15 min JWT exp + LW_GATEWAY_AUTH_CACHE_HARD_GRACE (default 6 h) of stale-while-error; bootstrap cache + Redis L2 extend coverage to cold-for-pod VKs | Outbox re-enqueues debits safely; they flush on recovery. New VK creation is blocked until control plane returns. See Recipe 8 for the operator runbook |
| One upstream provider | None for VKs with fallback; 502/504 for VKs without | Indefinite, circuit breaker + fallback chain absorb it | gateway_circuit_state{credential_id} shows which provider is down |
| Redis L2 cache | +~30 ms on cold-for-pod VKs | Indefinite, L1 + /resolve-key cover the miss | Fail-open by design (poison entries DEL, network errors log and miss) |
| Gateway pod (crash, eviction) | None, HPA replaces, PDB holds ≥ 2 replicas | N/A | Stateless; new pod warms from Redis L2 + /bootstrap |
Writing findings back
When you find something worth filing:- Grab the
.pb.gzwithgo tool pprof -symbolize=remote -proto http://localhost:6060/debug/pprof/heap > /tmp/heap.pb.gz. - Attach it to the issue along with:
- Pod name + image digest (
kubectl get pod -n langwatch -o yaml | grep image). X-LangWatch-Request-Idfrom one exemplar bad request.kubectl top podoutput around the spike.
- Pod name + image digest (
- For urgent escalation, post in
#ai-gateway-supportwith the request id.
See also
- Config → Admin, operator endpoints: env-var reference.
- Health Checks:
/readyzoutput to interpret alongside pprof. - Troubleshooting: symptom-first diagnostic index for non-pprof issues (401, 403, 429, 502).
- Prometheus alerts: alert rules that tell you when to reach for this runbook.