Debug missing worker results¶
The caller (orchestrator, pipeline, MCP bridge, or CLI) times out waiting for a result, but the worker logs show the task processed successfully. This is the most pernicious class of failure in Heddle's request-reply path because the worker side looks healthy.
Symptom¶
- Caller log:
result_stream.timeout, or a genericasyncio.TimeoutErrorpropagated up fromdispatch_and_wait_for_result(the helper returnsNonesilently on its own timeout — there is nodispatch.timeoutlog key; you'll see the caller's wrapper log it). - Worker log:
worker.completedwith the sametask_id, well within the timeout window. - Sometimes intermittent — the same payload succeeds on a retry.
Diagnosis¶
Walk the request-reply path from worker → caller:
-
Did the worker publish to the right subject?
The worker publishes to
heddle.results.{parent_task_id}(orheddle.results.defaultif no parent). If the worker computed the subject from a stale or wrongparent_task_id, the caller's subscription will not see it.Cross-check
parent_task_idbetween the dispatched TaskMessage and the published TaskResult. -
Did the caller's subscription see anything?
Look for
result_stream.ignored(wrong task_id),result_stream. duplicate(already collected), orresult_stream.parse_error(malformed payload).A
*.parse_errorline means a matching result arrived but was Pydantic-rejected — and the consumer correctly kept waiting (commitb453298, Invariant 17 supporting fix). The error string in the log identifies which field was malformed. -
Did the caller subscribe before publishing?
Subscribe-before-publish is Design Invariant 17. NATS is at-most-once. If a fast worker published its result before the caller's subscription went live, the result was delivered to nobody — and the caller times out.
Both
dispatch_and_wait_for_result(inheddle.orchestrator.dispatch) andResultStream(inheddle.orchestrator.stream) enforce the ordering. Custom callers that publish directly are the usual place this regresses; check the call site. -
Did the caller's timeout fire before the worker started?
A late result arriving after the orchestrator's per-task timeout is dropped by design (commit
cc49783). The orchestrator's final result will includemetadata.timeoutwithpending_task_ids,expected_count,collected_count, andtimeout_seconds:# Pull the final result JSON and look at the timeout block rg "orchestrator.goal_completed" caller.log # then look at the corresponding TaskResult.output.metadata.timeoutIf
pending_task_idslists yourtask_id, the worker simply didn't reply in time — not a delivery loss.
Mitigation¶
| Diagnosis | Action |
|---|---|
*.parse_error for a matching task_id |
Fix the malformed payload at the worker; the framework already keeps waiting for the next valid one |
| Subscribe-before-publish reversed | Route the call site through dispatch_and_wait_for_result or ResultStream (do not roll your own) |
| Late result after timeout | Increase timeout_seconds on the orchestrator config, or split the goal so individual tasks are cheaper |
Wrong parent_task_id |
Audit the dispatch path; the orchestrator wires this in _dispatch_subtasks |
If you must replay the task manually:
# Inspect the dead-letter queue first — the task may already be there
uv run heddle dead-letter monitor --nats-url $NATS_URL
See also: Interpret dead letters.
Verify¶
Re-dispatch the same payload and confirm:
- A single
worker.completedline. - A single
result_stream.collectedline on the caller (or, for single-result waits, a returnedTaskResultnotNone). - No
*.parse_errorlines.
Followup¶
If this fires repeatedly on the same worker, treat the worker output
schema as suspect — the parse-error resilience is a safety net, not a
fix. Add the failing payload as a fixture under
tests/test_contracts.py (or the worker-specific test file) and
tighten the worker's output validation so the schema rejects the bad
shape at the worker side instead of at the caller.