> ## Documentation Index
> Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Queue Management

> Manage error groups, blocked queues, dead letter queue redriving, and draining

Queue Management is handled through three sections of the [Ops Dashboard](/docs/self-hosting/ops/dashboard): **Blocked Groups**, **Dead Letter Queue (DLQ)**, and the **Groups** table. Together, they provide the tools to diagnose stuck processing, recover failed jobs, or discard unrecoverable work.

## Error Groups & Blocked Queues

When a job fails repeatedly, its group becomes **blocked**: no new jobs in that group are processed until the error is resolved. The Blocked section clusters these failures by normalized error message, so you can see patterns at a glance.

Each error cluster shows:

* **Count**: how many groups are affected
* **Error message**: normalized sample
* **Pipeline**: which pipeline stage produced the error
* **Sample group IDs**: for quick identification

### Actions on Blocked Groups

<Note>
  All write actions require the `ops:manage` permission.
</Note>

| Action             | Effect                                                                                                                                          |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| **Unblock All**    | Retries all blocked groups in the queue immediately                                                                                             |
| **Canary Unblock** | Retries a small random sample (default 5, configurable up to 100) to test whether the underlying issue is resolved before unblocking everything |
| **Move to DLQ**    | Moves all blocked groups to the Dead Letter Queue for later inspection or reprocessing                                                          |
| **Drain**          | Permanently discards all jobs in an error cluster, use when jobs are unrecoverable                                                              |

<Warning>
  **Drain is irreversible.** Drained jobs cannot be recovered. Always prefer moving to DLQ first if there's any chance the jobs can be reprocessed later.
</Warning>

## Dead Letter Queue (DLQ)

The DLQ holds groups that were explicitly moved there, either automatically after exceeding retry limits or manually via the "Move to DLQ" action. Items in the DLQ are not processed until an operator takes action.

Each DLQ entry shows:

* **Queue name**: which queue the group came from
* **Group ID** and **Pipeline**: for identification
* **Error message**: the error that caused the failure
* **Job count**: how many jobs are in the group

### Redriving from the DLQ

| Action              | Effect                                                                                                     |
| ------------------- | ---------------------------------------------------------------------------------------------------------- |
| **Replay All**      | Moves all DLQ groups in a queue back to the main queue for reprocessing from the beginning                 |
| **Replay** (single) | Moves a single group back to the main queue                                                                |
| **Canary Redrive**  | Test-replays a small random sample (default 5, configurable up to 100) before committing to a full redrive |

<Tip>
  **Use canary redrives after deploying a fix.** If the fix works for the canary batch, replay the rest. If it doesn't, the canary groups return to the DLQ and you haven't made the problem worse.
</Tip>

## Groups Table

The Groups table provides a detailed per-group view of all processing groups across queues.

Each row shows:

* **Group ID**: the logical partition key
* **Pipeline**: which pipeline stage this group belongs to
* **Pending**: number of jobs waiting to be processed
* **Retries**: retry count (orange if > 0)
* **Oldest job age**: with a warning indicator if overdue
* **Status**: `OK`, `Active`, `Blocked`, or `Stale`

### Filtering

Filter groups by status to focus on problems:

* **All**: every group
* **Blocked**: groups stuck due to errors
* **Stale**: blocked groups that have been waiting too long
* **Active**: groups currently being processed
* **OK**: healthy groups

A search box lets you filter by Group ID, Pipeline name, or error message.

### Group Detail

Click any row to open the **Group Detail** dialog, which shows:

* Full status and pipeline information
* Error message and stack trace
* Active job ID (if currently processing)
* Paginated list of all jobs in the group with their scores and raw data

## Common Workflows

### Recovering from a bad deployment

1. Check the **Blocked** section, a spike in blocked groups after a deploy usually means the new code is crashing
2. Roll back the deployment
3. Use **Canary Unblock** to test that the rollback fixes the issue
4. If the canary succeeds, **Unblock All** to resume processing

### Clearing stale data after a schema change

1. Identify affected groups in the **Groups** table using status filters
2. If the data can be reprocessed: **Move to DLQ**, fix the schema, then **Replay All**
3. If the data is obsolete: **Drain** the affected error cluster

### Testing a fix before full redrive

1. Deploy the fix
2. Go to the DLQ section
3. Use **Canary Redrive** with a count of 5-10
4. Monitor the dashboard for new failures
5. If clean, **Replay All** to redrive the remaining items