> ## Documentation Index
> Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Queue Management

> Manage error groups, blocked queues, dead letter queue redriving, and draining

Queue Management is handled through three sections of the [Ops Dashboard](/self-hosting/ops/dashboard): **Blocked Groups**, **Dead Letter Queue (DLQ)**, and the **Groups** table. Together, they provide the tools to diagnose stuck processing, recover failed jobs, or discard unrecoverable work.

## Error Groups & Blocked Queues

When a job fails repeatedly, its group becomes **blocked** — no new jobs in that group are processed until the error is resolved. The Blocked section clusters these failures by normalized error message, so you can see patterns at a glance.

Each error cluster shows:

* **Count** — how many groups are affected
* **Error message** — normalized sample
* **Pipeline** — which pipeline stage produced the error
* **Sample group IDs** — for quick identification

### Actions on Blocked Groups

<Note>
  All write actions require the `ops:manage` permission.
</Note>

| Action             | Effect                                                                                                                                          |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| **Unblock All**    | Retries all blocked groups in the queue immediately                                                                                             |
| **Canary Unblock** | Retries a small random sample (default 5, configurable up to 100) to test whether the underlying issue is resolved before unblocking everything |
| **Move to DLQ**    | Moves all blocked groups to the Dead Letter Queue for later inspection or reprocessing                                                          |
| **Drain**          | Permanently discards all jobs in an error cluster — use when jobs are unrecoverable                                                             |

<Warning>
  **Drain is irreversible.** Drained jobs cannot be recovered. Always prefer moving to DLQ first if there's any chance the jobs can be reprocessed later.
</Warning>

## Dead Letter Queue (DLQ)

The DLQ holds groups that were explicitly moved there — either automatically after exceeding retry limits or manually via the "Move to DLQ" action. Items in the DLQ are not processed until an operator takes action.

Each DLQ entry shows:

* **Queue name** — which queue the group came from
* **Group ID** and **Pipeline** — for identification
* **Error message** — the error that caused the failure
* **Job count** — how many jobs are in the group

### Redriving from the DLQ

| Action              | Effect                                                                                                     |
| ------------------- | ---------------------------------------------------------------------------------------------------------- |
| **Replay All**      | Moves all DLQ groups in a queue back to the main queue for reprocessing from the beginning                 |
| **Replay** (single) | Moves a single group back to the main queue                                                                |
| **Canary Redrive**  | Test-replays a small random sample (default 5, configurable up to 100) before committing to a full redrive |

<Tip>
  **Use canary redrives after deploying a fix.** If the fix works for the canary batch, replay the rest. If it doesn't, the canary groups return to the DLQ and you haven't made the problem worse.
</Tip>

## Groups Table

The Groups table provides a detailed per-group view of all processing groups across queues.

Each row shows:

* **Group ID** — the logical partition key
* **Pipeline** — which pipeline stage this group belongs to
* **Pending** — number of jobs waiting to be processed
* **Retries** — retry count (orange if > 0)
* **Oldest job age** — with a warning indicator if overdue
* **Status** — `OK`, `Active`, `Blocked`, or `Stale`

### Filtering

Filter groups by status to focus on problems:

* **All** — every group
* **Blocked** — groups stuck due to errors
* **Stale** — blocked groups that have been waiting too long
* **Active** — groups currently being processed
* **OK** — healthy groups

A search box lets you filter by Group ID, Pipeline name, or error message.

### Group Detail

Click any row to open the **Group Detail** dialog, which shows:

* Full status and pipeline information
* Error message and stack trace
* Active job ID (if currently processing)
* Paginated list of all jobs in the group with their scores and raw data

## Common Workflows

### Recovering from a bad deployment

1. Check the **Blocked** section — a spike in blocked groups after a deploy usually means the new code is crashing
2. Roll back the deployment
3. Use **Canary Unblock** to test that the rollback fixes the issue
4. If the canary succeeds, **Unblock All** to resume processing

### Clearing stale data after a schema change

1. Identify affected groups in the **Groups** table using status filters
2. If the data can be reprocessed: **Move to DLQ**, fix the schema, then **Replay All**
3. If the data is obsolete: **Drain** the affected error cluster

### Testing a fix before full redrive

1. Deploy the fix
2. Go to the DLQ section
3. Use **Canary Redrive** with a count of 5-10
4. Monitor the dashboard for new failures
5. If clean, **Replay All** to redrive the remaining items
