Skip to main content
LangWatch exposes Prometheus metrics and health check endpoints for monitoring your self-hosted deployment.

Prometheus

The Helm chart includes an optional Prometheus instance that scrapes metrics from LangWatch components.

Enable Prometheus

# In your values.yaml
app:
  telemetry:
    metrics:
      enabled: true
      apiKey:
        value: "your-metrics-api-key"  # Authenticates scrape requests

prometheus:
  chartManaged: true
  server:
    retention: 30d
    persistentVolume:
      size: 20Gi

What Gets Scraped

ComponentPortEndpointMetrics
App5560/metricsHTTP request latency, error rates, active connections
Workers2999/metricsQueue depth, job processing time, job success/failure rates

Access Prometheus

Port-forward to the Prometheus UI:
kubectl -n langwatch port-forward svc/langwatch-prometheus-server 9090:9090
# Open http://localhost:9090

External Prometheus

To use an existing Prometheus instance instead of the chart-managed one:
prometheus:
  chartManaged: false
  external:
    existingSecret: prometheus-credentials
    secretKeys:
      host: "host"
      port: "port"
      username: "username"
      password: "password"
You’ll need to configure your external Prometheus to scrape the LangWatch pods. Pods are annotated with:
prometheus.io/scrape: "true"
prometheus.io/port: "5560"    # or 2999 for workers
prometheus.io/path: "/metrics"

Grafana

Connect Grafana to your Prometheus instance to visualize LangWatch metrics.

Key Dashboards

Set up dashboards for:
  • Trace throughput — traces ingested per minute
  • Worker queue depth — BullMQ queue backlog (indicates processing bottleneck)
  • ClickHouse query latency — p50/p95/p99 query times
  • Error rates — HTTP 5xx responses from App and Workers
  • Resource utilization — CPU and memory per component

Example Queries

# Trace ingestion rate (per minute)
rate(langwatch_traces_ingested_total[5m]) * 60

# Worker queue depth
langwatch_worker_queue_depth

# HTTP error rate
rate(http_requests_total{status=~"5.."}[5m])
  / rate(http_requests_total[5m])

# ClickHouse query p95 latency
histogram_quantile(0.95, rate(clickhouse_query_duration_seconds_bucket[5m]))

Health & Readiness Checks

Endpoints

ComponentEndpointMethodHealthy Response
App/api/healthGET200 OK
Workers/healthzGET200 OK

Kubernetes Probes

The Helm chart configures probes automatically. Default configuration:
# Startup probe (allows time for migrations)
startupProbe:
  httpGet:
    path: /api/health
    port: 5560
  periodSeconds: 5
  failureThreshold: 30    # Up to 150s for startup

# Liveness probe (restarts unhealthy pods)
livenessProbe:
  httpGet:
    path: /api/health
    port: 5560
  periodSeconds: 10
  failureThreshold: 5

# Readiness probe (removes from service if unhealthy)
readinessProbe:
  httpGet:
    path: /api/health
    port: 5560
  periodSeconds: 5
  failureThreshold: 3

Manual Health Check

# Check app health
kubectl -n langwatch exec deploy/langwatch-app -- \
  curl -s http://localhost:5560/api/health

# Check worker health
kubectl -n langwatch exec deploy/langwatch-workers -- \
  curl -s http://localhost:2999/healthz

Alerting Recommendations

Set up alerts for these critical conditions:
AlertConditionSeverity
Worker queue backlogQueue depth > 10,000 for 5 minWarning
Worker queue backlog (critical)Queue depth > 100,000 for 5 minCritical
ClickHouse memoryMemory usage > 80% of limitWarning
ClickHouse diskHot storage > 85% fullCritical
PostgreSQL connectionsActive connections > 80% of maxWarning
App error rateHTTP 5xx rate > 5% for 5 minCritical
Pod restartsPod restart count > 3 in 15 minWarning

Example Alertmanager Rule

groups:
  - name: langwatch
    rules:
      - alert: WorkerQueueBacklog
        expr: langwatch_worker_queue_depth > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker queue backlog is high"
          description: "Queue depth is {{ $value }} — workers may need scaling."

Prometheus Configuration Reference

Full Prometheus configuration in the Helm chart:
ValueDescriptionDefault
prometheus.chartManagedManage Prometheus via this charttrue
prometheus.server.retentionData retention period60d
prometheus.server.persistentVolume.sizeStorage size6Gi
prometheus.server.persistentVolume.storageClassStorage class"" (default)
prometheus.server.resources.requests.cpuCPU request200m
prometheus.server.resources.requests.memoryMemory request512Mi
prometheus.server.resources.limits.cpuCPU limit500m
prometheus.server.resources.limits.memoryMemory limit2Gi
prometheus.server.global.scrape_intervalScrape interval15s