Operator Runbooks¶
Incident-style runbooks for the most common failure modes operators hit while running Heddle in production. Each follows the same shape:
- Symptom — what an operator actually sees.
- Diagnosis — concrete steps to localise the cause (log greps, CLI commands, code references).
- Mitigation — what to do once you know.
- Verify — how to confirm the system is healthy again.
- Followup — root-cause work to prevent recurrence.
These complement, rather than replace, the Design Invariants (the why behind the constraints) and the Troubleshooting page (quick-fix recipes for developer-facing setup issues). The runbooks here assume a deployed system and an operator under time pressure.
Catalogue¶
| Runbook | When to reach for it |
|---|---|
| Debug missing worker results | Caller times out but a worker logs success |
| Interpret dead letters | heddle.tasks.dead_letter is filling up |
| Verify NATS connectivity | Actor startup is failing or flaky |
| Deploy Workshop safely | Standing up Workshop outside loopback |
| Recover from stuck eval runs | Eval Runner row sits at running forever |
| Recover from a stuck orchestrator goal | Goal never publishes a final result |
Conventions¶
Log keys quoted in these runbooks are exact structlog event names.
Greppable with no escaping: rg dispatch.parse_error lands directly
on the call site that emits them. Where a runbook references a fix
commit (e.g. b453298), the commit message carries the full context
of the bug and the test pin that prevents regression.