ADR-004: Skip-and-log on malformed messages, not crash¶
Status: Accepted. Pairs with: Invariant 8.
Context¶
Heddle has two layers at which a message can be malformed:
- Bus deserialization — the NATS payload isn't valid JSON,
or the UTF-8 decoding fails.
NATSBus._handle_messagecatchesjson.JSONDecodeErrorandUnicodeDecodeError, logs, and keeps the subscription alive. - Pydantic parsing — the dict is valid JSON but doesn't fit
the
TaskMessage/TaskResultschema. The orchestrator'sdispatch_and_wait_for_resultandResultStreamboth route the parse attempt throughparse_task_result(heddle.core.messages), which logs and returnsNoneon failure.
The decision to make: when a malformed message arrives, should the consumer crash and force a restart, or skip the message and continue?
Decision¶
Malformed messages are skip-and-logged at every layer. The subscription stays live, the consumer task keeps running, and the outer timeout is the only way a malformed message causes caller-visible failure.
Specifically:
NATSBuscatches deserialization errors and logsnats.malformed_message_skipped. Subscription loop continues.parse_task_resultcatches PydanticValidationError, logs via the caller-suppliedlog_event(one ofdispatch.parse_error/result_stream.parse_error), and returnsNone. Caller-supplied skip-and-continue logic.
Alternatives considered¶
Crash and let the supervisor restart (rejected)¶
Treat a malformed message as fatal. The consumer crashes, the process manager (systemd, Kubernetes) restarts the actor, and processing resumes from the new subscription.
- Rejected because the bad message stays in NATS (NATS is at-most-once; once delivered it's gone, but a corrupted payload from a buggy publisher may keep arriving). The worker crashes, restarts, subscribes, receives the same bad message, crashes again — a crash loop that masks valid messages behind it.
- Restart latency is much longer than message latency. A worker processing a high-volume stream spends 10s rebooting per bad message, during which valid messages pile up and time out at the caller.
- A single buggy publisher upstream can take down the entire fleet of consumers via this mechanism. The blast radius is out of proportion to the local fault.
Dead-letter malformed messages (partially adopted)¶
The router already dead-letters tasks that fail TaskMessage
parsing (reason="invalid_task_message: ..." —
Invariant 8 covers the bus layer; the router covers the
TaskMessage layer specifically).
- Adopted for the routing layer — the router is the natural triage point and the dead-letter queue gives operators visibility into upstream sender bugs.
- Not adopted for the result-collection layer — a malformed result shouldn't be moved to a dead-letter queue, because the goal is still in progress and the caller is still waiting. The skip-and-log path lets the next (valid) matching result complete the wait. b453298 fixed the bug where the result-collection layer was treating ValidationError as fatal; the current shape is the alternative we chose.
Pause subscription, drain, then resume (rejected)¶
When a bad message arrives, stop the subscription, drain in-flight handlers, restart the subscription pointer past the bad message, and continue.
- Rejected because NATS doesn't expose a per-message acknowledgement pointer for non-JetStream subjects (Heddle runs on core NATS by default for low latency). The drain- and-resume primitive doesn't exist at the protocol level.
- Even if it did, the complexity isn't justified. The state machine for "currently draining due to bad message at offset N" has more failure modes than the simple skip-and-log path.
Consequences¶
Enables:
- Single buggy producer cannot take down a consumer fleet.
- High-volume streams remain processable through transient corruption events.
- Operator triage is straightforward: grep
*.parse_errorto find the bad messages, fix the upstream cause, no consumer restart needed. - The two parse-error log keys (
dispatch.parse_error,result_stream.parse_error) are parallel so an operator can grep both modules with one query — the centralization inparse_task_resultmakes them inseparable from the helper itself.
Costs:
- A consistently malformed message produces a steady stream of log warnings. Operators must monitor the warning rate themselves; the framework doesn't escalate to fatal.
- Silent bug class: a publisher emits valid JSON that almost
matches the schema (missing one required field). The consumer
skips every one of its messages. The publisher author sees no
feedback at the publish call. Mitigated by the dead-letter
queue at the router layer for
TaskMessages, but result- layer parse errors only surface via log greps. - The bad message stays in NATS for as long as NATS retains it. Operators expecting "drain it from the queue" semantics need to know that's not how core NATS works.