Interpret dead letters¶

The dead-letter queue (heddle.tasks.dead_letter) is the router's escape hatch for tasks that can't be routed normally. When it fills, the right question is why, not how to drain it — the reasons are distinct and the fixes are different.

Symptom¶

Dead-letter consumer logs are growing: dead_letter.received lines with a non-empty reason field.
Workshop /dead-letters view shows a non-empty table.
Downstream callers experience missing results (often paired with Debug missing worker results).

Diagnosis¶

Every dead-letter envelope carries a reason field set by the router. There are three reasons today, each fixable independently:

Reason prefix	Meaning	Where it comes from
`invalid_task_message: ...`	Payload failed Pydantic validation as a `TaskMessage`	External sender violated the message contract
`unknown_tier: ...`	`model_tier` did not resolve via the router rules	Worker config drift, or sender set a wrong tier
`rate_limited: tier 'X' has no available capacity`	The token-bucket rate limiter for tier `X` exhausted	Sustained burst beyond `configs/router_rules.yaml`

To inspect:

# CLI consumer (subscribes and logs each arriving entry)
uv run heddle dead-letter monitor --nats-url $NATS_URL

# Or the Workshop UI
# (HEDDLE_WORKSHOP_TOKEN gates auth — see deploy-workshop-safely.md)
open http://localhost:8000/dead-letters

Reading the table:

task_id and worker_type come from the original payload (or None if the parse failed before they could be extracted — invalid_task_message).
reason is the human-readable explanation.
The full original data is preserved in the envelope so a replay reconstructs the exact same task.

Mitigation¶

`invalid_task_message`¶

The sender is wrong, not Heddle. Locate it:

# Trace where the bad task came from
rg "task_id\":.*<task_id_from_dead_letter>" --type log

External integrations (MCP gateway clients, CLI scripts, third-party publishers) are the usual culprits. Fix the sender; Heddle's role here is to fail loud, not to coerce malformed input.

`unknown_tier`¶

Either the router rules dropped a tier mapping, or the sender set a non-standard model_tier. Compare:

# What the sender requested
rg "task_id.*<id>" --type log | rg "model_tier"

# What the router knows about
cat configs/router_rules.yaml

Standard values are local, standard, frontier — anything else needs an explicit tier_overrides mapping in router_rules.yaml.

`rate_limited`¶

The tier's token bucket is empty. Three options:

Raise the rate limit in configs/router_rules.yaml and reload via nats pub heddle.control.reload '{}'.
Add replicas so each replica's share of the bucket goes further (queue groups load-balance, but the rate limiter is shared via NATS — confirm by reading RateLimiter docstring).
Replay later when the bucket has refilled. The replay flow re-injects the same task into heddle.tasks.incoming:

# Workshop UI
curl -X POST http://localhost:8000/dead-letters/{index}/replay

The router will re-evaluate the task with current rate-limit state.

Verify¶

After mitigation:

dead_letter.received log lines stop accruing.
The consumer's replay_count matches the entries you intentionally re-injected (Workshop view shows this).
Downstream callers no longer report missing results for the affected task_ids.

Followup¶

A persistently non-empty dead-letter queue is a signal, not a configuration. If rate_limited reasons dominate, the capacity model is wrong; if invalid_task_message dominates, an upstream integration is broken; if unknown_tier dominates, the router_rules.yaml is out of date with shipped worker configs.

Add an alert on dead_letter.received rate (e.g. > N/min sustained) so the queue is observed, not just retained.