ADR-012: Drop message-level retry fields¶
Status: Accepted.
Supersedes: the documented-but-unimplemented retry contract on
TaskMessage and the RETRY value on TaskStatus.
Pairs with: Invariant 1
(stateless workers reset between tasks — unchanged by this decision).
Context¶
TaskMessage carried two fields and TaskStatus carried one value
that together described a worker-side retry contract:
TaskMessage.max_retries: int = 2— "Max retry attempts before permanent failure" (docstring).TaskMessage.retry_count: int = 0— "Incremented on each retry by TaskWorker" (docstring).TaskStatus.RETRYwith the state-transition noteFAILED -> RETRY -> PROCESSING (handled by TaskWorker).
The 2026-05-10 repository review (§8.8) flagged the entire contract
as dead: no code path reads task.max_retries, no code path
increments task.retry_count, and no code path emits or consumes
the RETRY status. The fields and the enum value were aspirational
shims for a worker-side retry feature that was never built.
The 2026-05-11 wire-protocol externalisation (schemas/v1/ published
in bf13827) made the dead surface load-bearing for the first time:
foreign-language SDKs would generate types for retry_count,
max_retries, and RETRY from the canonical JSON Schemas, treat
them as a stable contract, and bake them into their wire-handling
code. Once that happens, the fields can't be removed without a
breaking schemas/v2/ migration.
The decision was forced by the schema publication: either implement
the retry contract before downstream SDKs lock to it, or remove it
while the schema is still v1 and uncoupled from external consumers.
Decision¶
Drop the dead surface. Specifically:
- Remove
max_retriesfromTaskMessage. - Remove
retry_countfromTaskMessage. - Remove
RETRYfromTaskStatus. - Regenerate
schemas/v1/task_message.schema.jsonandschemas/v1/task_result.schema.jsonto reflect the trimmed shape. CI's schema-drift gate enforces synchronisation.
Retry semantics that DO exist in Heddle remain unchanged, and the documentation now points to them explicitly:
- Stage-level retry lives in
PipelineOrchestratorvia the per-stagemax_retries: intYAML field on each pipeline stage. Validated at point-of-use (ADR-007 / G7 is unrelated; the validation lives insrc/heddle/orchestrator/pipeline.py:340-348). - Bus-level redelivery is handled by NATS queue-group semantics: when a worker disconnects mid-task without acknowledging, NATS re-routes the message to another consumer in the same queue group. This is the framework's transparent retry path for worker-process failures.
Neither of these requires fields on TaskMessage. The first lives in
the pipeline's YAML config; the second is invisible to the
application layer.
Alternatives considered¶
Implement worker-side retry (rejected)¶
Wire the dead fields up: have TaskWorker.handle_message on
failure increment task.retry_count and republish the task if
retry_count < max_retries. This was the "make the docstring true"
option.
- Rejected because it would evolve Invariant 1
from "stateless workers reset between tasks" to "stateless workers
reset between tasks; retry handled at the message level." That's a
meaningful change to the framework's safety contract — the
worker's
reset()would have to coordinate with whatever cached the in-flight task pending republish, and the queue-group dispatch guarantee ("one consumer per task") would gain an exception ("... unless the task is being retried, in which case the same consumer re-publishes"). - The pipeline-level
max_retriesalready covers the documented use case ("the LLM failed transiently; try again"). The worker- level retry would only add value for the narrower case "the LLM succeeded but the network round-trip to NATS failed during publish," which the existing F1 publish-failure regression test pins as a fail-the-task path — the orchestrator timeout + per-stage retry recovers cleanly from there. - No concrete workload has asked for worker-side retry. Adding Invariant-changing complexity for a hypothetical user violates the project's minimum-code discipline.
Keep the fields, document them as reserved (rejected)¶
Leave the fields in TaskMessage and the value in TaskStatus,
but document them as "reserved for future use."
- Rejected because "reserved" fields on a wire contract still appear in generated SDK types, still occupy schema-versioning semantics, and still confuse readers who expect them to do something. A future implementation that "fills in" the reserved fields with different semantics than the original docstring implied would be a behavioural change masquerading as a feature-add.
- The cost of removal now is one commit; the cost of removal
later (after the foreign-language SDKs are generated) is a
schemas/v2/break.
Defer the decision; ship the release with the dead fields (rejected)¶
Mark schemas/v1/ as "experimental — wire shape may change before
v1.0" and decide K2 separately.
- Rejected because "experimental" is doc cover for "we haven't decided yet," and the decision is cheap. Deferral has a real cost (the dead surface keeps confusing every reader) and no real benefit (no concrete user is asking for worker-side retry that the deferral preserves the option for).
Consequences¶
Enables:
- The wire contract documents only behaviours the framework
actually implements. A foreign-language SDK author reading
schemas/v1/task_message.schema.jsonsees the truth. - Invariant 1 keeps its tight shape. The "stateless" claim is no longer caveated by docstrings that describe a feature that doesn't exist.
- Future contributors who want retries find the pipeline-stage path documented at the right level (stage YAML, not message envelope) and don't get steered toward the abandoned worker-side design.
Costs:
- This is a breaking change to
TaskMessageandTaskStatusPython types. External code that constructedTaskMessagewithmax_retries=orretry_count=, or pattern-matched onTaskStatus.RETRY, will need to be updated. - Mitigation: no in-tree consumer outside test code did any of these things (verified by grep). The dead surface was genuinely dead.
- The
schemas/v1/*.schema.jsonfiles change shape. CI's drift gate enforces the regeneration. Foreign-language SDKs that were already generated against the previous v1 shape (none today) would need to regenerate. - If a concrete worker-side retry workload appears later, the reintroduction will need its own ADR + invariant evolution. That's the right time to design it — when the actual use case is concrete and the constraints can be honest, rather than pre-emptively shimmed.