Skip to main content
Queue Management is handled through three sections of the Ops Dashboard: Blocked Groups, Dead Letter Queue (DLQ), and the Groups table. Together, they provide the tools to diagnose stuck processing, recover failed jobs, or discard unrecoverable work.

Error Groups & Blocked Queues

When a job fails repeatedly, its group becomes blocked — no new jobs in that group are processed until the error is resolved. The Blocked section clusters these failures by normalized error message, so you can see patterns at a glance. Each error cluster shows:
  • Count — how many groups are affected
  • Error message — normalized sample
  • Pipeline — which pipeline stage produced the error
  • Sample group IDs — for quick identification

Actions on Blocked Groups

All write actions require the ops:manage permission.
ActionEffect
Unblock AllRetries all blocked groups in the queue immediately
Canary UnblockRetries a small random sample (default 5, configurable up to 100) to test whether the underlying issue is resolved before unblocking everything
Move to DLQMoves all blocked groups to the Dead Letter Queue for later inspection or reprocessing
DrainPermanently discards all jobs in an error cluster — use when jobs are unrecoverable
Drain is irreversible. Drained jobs cannot be recovered. Always prefer moving to DLQ first if there’s any chance the jobs can be reprocessed later.

Dead Letter Queue (DLQ)

The DLQ holds groups that were explicitly moved there — either automatically after exceeding retry limits or manually via the “Move to DLQ” action. Items in the DLQ are not processed until an operator takes action. Each DLQ entry shows:
  • Queue name — which queue the group came from
  • Group ID and Pipeline — for identification
  • Error message — the error that caused the failure
  • Job count — how many jobs are in the group

Redriving from the DLQ

ActionEffect
Replay AllMoves all DLQ groups in a queue back to the main queue for reprocessing from the beginning
Replay (single)Moves a single group back to the main queue
Canary RedriveTest-replays a small random sample (default 5, configurable up to 100) before committing to a full redrive
Use canary redrives after deploying a fix. If the fix works for the canary batch, replay the rest. If it doesn’t, the canary groups return to the DLQ and you haven’t made the problem worse.

Groups Table

The Groups table provides a detailed per-group view of all processing groups across queues. Each row shows:
  • Group ID — the logical partition key
  • Pipeline — which pipeline stage this group belongs to
  • Pending — number of jobs waiting to be processed
  • Retries — retry count (orange if > 0)
  • Oldest job age — with a warning indicator if overdue
  • StatusOK, Active, Blocked, or Stale

Filtering

Filter groups by status to focus on problems:
  • All — every group
  • Blocked — groups stuck due to errors
  • Stale — blocked groups that have been waiting too long
  • Active — groups currently being processed
  • OK — healthy groups
A search box lets you filter by Group ID, Pipeline name, or error message.

Group Detail

Click any row to open the Group Detail dialog, which shows:
  • Full status and pipeline information
  • Error message and stack trace
  • Active job ID (if currently processing)
  • Paginated list of all jobs in the group with their scores and raw data

Common Workflows

Recovering from a bad deployment

  1. Check the Blocked section — a spike in blocked groups after a deploy usually means the new code is crashing
  2. Roll back the deployment
  3. Use Canary Unblock to test that the rollback fixes the issue
  4. If the canary succeeds, Unblock All to resume processing

Clearing stale data after a schema change

  1. Identify affected groups in the Groups table using status filters
  2. If the data can be reprocessed: Move to DLQ, fix the schema, then Replay All
  3. If the data is obsolete: Drain the affected error cluster

Testing a fix before full redrive

  1. Deploy the fix
  2. Go to the DLQ section
  3. Use Canary Redrive with a count of 5-10
  4. Monitor the dashboard for new failures
  5. If clean, Replay All to redrive the remaining items