Prometheus
The Helm chart includes an optional Prometheus instance that scrapes metrics from LangWatch components.Enable Prometheus
What Gets Scraped
| Component | Port | Endpoint | Metrics |
|---|---|---|---|
| App | 5560 | /metrics | HTTP request latency, error rates, active connections |
| Workers | 2999 | /metrics | Queue depth, job processing time, job success/failure rates |
Access Prometheus
Port-forward to the Prometheus UI:External Prometheus
To use an existing Prometheus instance instead of the chart-managed one:Grafana
Connect Grafana to your Prometheus instance to visualize LangWatch metrics.Key Dashboards
Set up dashboards for:- Trace throughput — traces ingested per minute
- Worker queue depth — BullMQ queue backlog (indicates processing bottleneck)
- ClickHouse query latency — p50/p95/p99 query times
- Error rates — HTTP 5xx responses from App and Workers
- Resource utilization — CPU and memory per component
Example Queries
Health & Readiness Checks
Endpoints
| Component | Endpoint | Method | Healthy Response |
|---|---|---|---|
| App | /api/health | GET | 200 OK |
| Workers | /healthz | GET | 200 OK |
Kubernetes Probes
The Helm chart configures probes automatically. Default configuration:Manual Health Check
Alerting Recommendations
Set up alerts for these critical conditions:| Alert | Condition | Severity |
|---|---|---|
| Worker queue backlog | Queue depth > 10,000 for 5 min | Warning |
| Worker queue backlog (critical) | Queue depth > 100,000 for 5 min | Critical |
| ClickHouse memory | Memory usage > 80% of limit | Warning |
| ClickHouse disk | Hot storage > 85% full | Critical |
| PostgreSQL connections | Active connections > 80% of max | Warning |
| App error rate | HTTP 5xx rate > 5% for 5 min | Critical |
| Pod restarts | Pod restart count > 3 in 15 min | Warning |
Example Alertmanager Rule
Prometheus Configuration Reference
Full Prometheus configuration in the Helm chart:| Value | Description | Default |
|---|---|---|
prometheus.chartManaged | Manage Prometheus via this chart | true |
prometheus.server.retention | Data retention period | 60d |
prometheus.server.persistentVolume.size | Storage size | 6Gi |
prometheus.server.persistentVolume.storageClass | Storage class | "" (default) |
prometheus.server.resources.requests.cpu | CPU request | 200m |
prometheus.server.resources.requests.memory | Memory request | 512Mi |
prometheus.server.resources.limits.cpu | CPU limit | 500m |
prometheus.server.resources.limits.memory | Memory limit | 2Gi |
prometheus.server.global.scrape_interval | Scrape interval | 15s |