Skip to content

Operator Runbooks

Incident-style runbooks for the most common failure modes operators hit while running Heddle in production. Each follows the same shape:

  • Symptom — what an operator actually sees.
  • Diagnosis — concrete steps to localise the cause (log greps, CLI commands, code references).
  • Mitigation — what to do once you know.
  • Verify — how to confirm the system is healthy again.
  • Followup — root-cause work to prevent recurrence.

These complement, rather than replace, the Design Invariants (the why behind the constraints) and the Troubleshooting page (quick-fix recipes for developer-facing setup issues). The runbooks here assume a deployed system and an operator under time pressure.

Catalogue

Runbook When to reach for it
Debug missing worker results Caller times out but a worker logs success
Interpret dead letters heddle.tasks.dead_letter is filling up
Verify NATS connectivity Actor startup is failing or flaky
Deploy Workshop safely Standing up Workshop outside loopback
Recover from stuck eval runs Eval Runner row sits at running forever
Recover from a stuck orchestrator goal Goal never publishes a final result

Conventions

Log keys quoted in these runbooks are exact structlog event names. Greppable with no escaping: rg dispatch.parse_error lands directly on the call site that emits them. Where a runbook references a fix commit (e.g. b453298), the commit message carries the full context of the bug and the test pin that prevents regression.