groups:
- name: langwatch-gateway
interval: 30s
rules:
# ─── Reliability ─────────────────────────────────────────────────────
- alert: GatewayHighErrorRate
expr: |
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(gateway_http_requests_total[5m])) > 0.05
for: 5m
labels: { severity: page, team: ai-platform }
annotations:
summary: "Gateway 5xx rate > 5% over 5m"
description: |
End-user requests to the gateway are returning 5xx at {{ printf "%.1f" (mulScalar $value 100) }}%.
Check /readyz on pods, then upstream provider status.
- alert: GatewayReadinessFlapping
expr: |
changes(kube_pod_status_ready{
condition="true",
namespace="langwatch",
pod=~"langwatch-gateway-.*"
}[10m]) > 4
for: 5m
labels: { severity: warn, team: ai-platform }
annotations:
summary: "Gateway pod {{ $labels.pod }} readiness flapped > 4 times in 10m"
description: "Pod is flipping ready/not-ready. Investigate /readyz output."
# ─── Provider health ─────────────────────────────────────────────────
- alert: CircuitOpenTooLong
expr: gateway_circuit_state == 2
for: 5m
labels: { severity: page, team: ai-platform }
annotations:
summary: "Circuit open for credential {{ $labels.credential_id }} > 5m"
description: |
Gateway has tripped the breaker and is skipping this provider.
Verify the provider is actually having issues (check X-LangWatch-
Provider headers on recent failures + provider status page).
- alert: ExcessiveFallback
expr: |
sum(rate(gateway_provider_attempts_total{outcome="fallback"}[10m]))
/ sum(rate(gateway_provider_attempts_total[10m])) > 0.1
for: 10m
labels: { severity: warn, team: ai-platform }
annotations:
summary: "> 10% of requests falling back over 10m"
description: |
Primary provider is degraded enough that > 10% of requests are
walking the fallback chain. Not paging because fallback is
working as designed, but the primary needs investigation.
# ─── Budget governance ───────────────────────────────────────────────
- alert: StreamingUsageMissing
expr: |
sum(rate(gateway_streaming_usage_missing_total[10m])) > 0
for: 10m
labels: { severity: page, team: ai-platform }
annotations:
summary: "Streaming requests without usage reported"
description: |
Streaming request completing without token counts means $0 is
being debited to the budget — budgets are silently bypassed.
OpenAI requires stream_options.include_usage=true on the client.
See /ai-gateway/streaming#usage-extraction-critical-for-streaming-budgets.
- alert: BudgetDebitOutboxBacklog
expr: gateway_budget_debit_outbox_depth > 1000
for: 5m
labels: { severity: page, team: ai-platform }
annotations:
summary: "Budget debit outbox depth > 1000 for 5m"
description: |
Gateway can't reach the control plane /budget/debit endpoint fast
enough. Customer budgets are NOT being debited in real-time;
near-limit customers may be over-consuming. Check control-plane
health + gateway_budget_debit_outbox_dropped_total rate.
# --- Iter 21 outbox leading-indicator rules
# Fill-pct (depth/capacity) leads the absolute-depth rule above and
# catches pods with different capacities consistently.
- alert: BudgetOutboxFillPctHigh
expr: |
max by (pod) (gateway_budget_debit_outbox_depth)
/ max by (pod) (gateway_budget_debit_outbox_capacity) > 0.5
for: 5m
labels: { severity: warn, team: ai-platform }
annotations:
summary: "Outbox fill-pct > 50% on {{ $labels.pod }} for 5m"
description: |
Approaching capacity. See production runbook Recipe 6, normal
bursts self-heal; sustained climb is usually flush failure.
# Flush-failure rate leads depth climb, catches slow/unreachable
# control plane BEFORE events back up enough to alert on depth.
- alert: BudgetOutboxFlushFailures
expr: rate(gateway_budget_debit_outbox_flush_failures_total[5m]) > 0
for: 5m
labels: { severity: warn, team: ai-platform }
annotations:
summary: "Outbox flush failures for 5m on {{ $labels.pod }}"
description: |
Control plane is slow/unreachable; debits are re-enqueued and
safe, but latency to reconcile is growing. Investigate control-
plane /budget/debit latency before this turns into a depth alert.
# 4xx drops = silent data loss. Any non-zero rate is immediately
# actionable (signing drift / payload drift / control-plane bug).
- alert: BudgetOutbox4xxDrops
expr: increase(gateway_budget_debit_outbox_4xx_drops_total[15m]) > 0
labels: { severity: page, team: ai-platform }
annotations:
summary: "Outbox debits are being terminally dropped with 4xx"
description: |
Silent data loss. Debits are being permanently rejected by the
control plane. Most common: LW_GATEWAY_INTERNAL_SECRET drift
after rotation. See production runbook Recipe 6.
- alert: BudgetCheckLiveFailRate
expr: |
sum(rate(gateway_budget_check_live_total{outcome="transport_error"}[5m]))
/ sum(rate(gateway_budget_check_live_total[5m])) > 0.2
for: 5m
labels: { severity: warn, team: ai-platform }
annotations:
summary: "Live /budget/check failing open > 20%"
description: |
Live reconciliation for near-limit scopes is timing out or erroring
and falling back to the cached snapshot. Near-limit customers may
briefly over-consume. Check control-plane /budget/check latency.
# ─── Auth cache ──────────────────────────────────────────────────────
- alert: AuthCacheHitRateDropped
expr: |
sum(rate(gateway_auth_cache_hits_total{tier="l1"}[5m]))
/ ( sum(rate(gateway_auth_cache_hits_total{tier="l1"}[5m]))
+ sum(rate(gateway_auth_cache_misses_total{tier="l1"}[5m])) ) < 0.9
for: 15m
labels: { severity: warn, team: ai-platform }
annotations:
summary: "L1 auth cache hit rate < 90% for 15m"
description: |
Sustained cache-miss rate means every miss costs a /resolve-key
round-trip to the control plane. Investigate: recent deploy churn,
/changes feed reliability, or LRU eviction under load.
# ─── Cost & anomaly ──────────────────────────────────────────────────
# Note: per-request cost anomaly detection requires a cost metric
# (e.g. gateway_cost_usd_total) that isn't in the current collector set.
# It's easier to derive cost anomalies from the GatewayBudgetLedger
# table in the control-plane warehouse rather than Prometheus, since
# cost attribution happens on the control-plane side after debit.
# See /gateway/usage UI for the visual equivalent.
# ─── Blocked-by-policy noise ────────────────────────────────────────
- alert: GuardrailBlockSpike
expr: |
sum by (direction) (
rate(gateway_guardrail_verdicts_total{verdict="block"}[5m])
) > 10
for: 5m
labels: { severity: warn, team: ai-platform }
annotations:
summary: "> 10 req/s blocked by policy reason={{ $labels.reason }}"
description: |
Legit block storm (a customer's runtime is trying a banned tool
repeatedly) or a regression in the policy_rules config.
Check LangWatch traces filtered on
attr.langwatch.policy.blocked != "".