Queue Management - LangWatch

Queue Management is handled through three sections of the Ops Dashboard: Blocked Groups, Dead Letter Queue (DLQ), and the Groups table. Together, they provide the tools to diagnose stuck processing, recover failed jobs, or discard unrecoverable work.

Error Groups & Blocked Queues

When a job fails repeatedly, its group becomes blocked: no new jobs in that group are processed until the error is resolved. The Blocked section clusters these failures by normalized error message, so you can see patterns at a glance. Each error cluster shows:

Count: how many groups are affected
Error message: normalized sample
Pipeline: which pipeline stage produced the error
Sample group IDs: for quick identification

Actions on Blocked Groups

All write actions require the ops:manage permission.

Action	Effect
Unblock All	Retries all blocked groups in the queue immediately
Canary Unblock	Retries a small random sample (default 5, configurable up to 100) to test whether the underlying issue is resolved before unblocking everything
Move to DLQ	Moves all blocked groups to the Dead Letter Queue for later inspection or reprocessing
Drain	Permanently discards all jobs in an error cluster, use when jobs are unrecoverable

Drain is irreversible. Drained jobs cannot be recovered. Always prefer moving to DLQ first if there’s any chance the jobs can be reprocessed later.

Dead Letter Queue (DLQ)

The DLQ holds groups that were explicitly moved there, either automatically after exceeding retry limits or manually via the “Move to DLQ” action. Items in the DLQ are not processed until an operator takes action. Each DLQ entry shows:

Queue name: which queue the group came from
Group ID and Pipeline: for identification
Error message: the error that caused the failure
Job count: how many jobs are in the group

Redriving from the DLQ

Action	Effect
Replay All	Moves all DLQ groups in a queue back to the main queue for reprocessing from the beginning
Replay (single)	Moves a single group back to the main queue
Canary Redrive	Test-replays a small random sample (default 5, configurable up to 100) before committing to a full redrive

Use canary redrives after deploying a fix. If the fix works for the canary batch, replay the rest. If it doesn’t, the canary groups return to the DLQ and you haven’t made the problem worse.

Groups Table

The Groups table provides a detailed per-group view of all processing groups across queues. Each row shows:

Group ID: the logical partition key
Pipeline: which pipeline stage this group belongs to
Pending: number of jobs waiting to be processed
Retries: retry count (orange if > 0)
Oldest job age: with a warning indicator if overdue
Status: OK, Active, Blocked, or Stale

Filtering

Filter groups by status to focus on problems:

All: every group
Blocked: groups stuck due to errors
Stale: blocked groups that have been waiting too long
Active: groups currently being processed
OK: healthy groups

A search box lets you filter by Group ID, Pipeline name, or error message.

Group Detail

Click any row to open the Group Detail dialog, which shows:

Full status and pipeline information
Error message and stack trace
Active job ID (if currently processing)
Paginated list of all jobs in the group with their scores and raw data

Common Workflows

Recovering from a bad deployment

Check the Blocked section, a spike in blocked groups after a deploy usually means the new code is crashing
Roll back the deployment
Use Canary Unblock to test that the rollback fixes the issue
If the canary succeeds, Unblock All to resume processing

Clearing stale data after a schema change

Identify affected groups in the Groups table using status filters
If the data can be reprocessed: Move to DLQ, fix the schema, then Replay All
If the data is obsolete: Drain the affected error cluster

Testing a fix before full redrive

Deploy the fix
Go to the DLQ section
Use Canary Redrive with a count of 5-10
Monitor the dashboard for new failures
If clean, Replay All to redrive the remaining items

​Error Groups & Blocked Queues

​Actions on Blocked Groups

​Dead Letter Queue (DLQ)

​Redriving from the DLQ

​Groups Table

​Filtering

​Group Detail

​Common Workflows

​Recovering from a bad deployment

​Clearing stale data after a schema change

​Testing a fix before full redrive

Error Groups & Blocked Queues

Actions on Blocked Groups

Dead Letter Queue (DLQ)

Redriving from the DLQ

Groups Table

Filtering

Group Detail

Common Workflows

Recovering from a bad deployment

Clearing stale data after a schema change

Testing a fix before full redrive