Recover from stuck eval runs¶
The Workshop Eval Runner persists each suite invocation as an
eval_runs row with a status column. The row should transition
running → completed or running → failed on every invocation.
A row pinned at running indicates the runner crashed in a way
that escaped the status-update path.
Symptom¶
- Workshop
/evalspage shows a run that has beenrunningfor longer thantimeout_seconds * len(cases). - New eval submissions for the same suite collide with the stale row (depending on suite-level locking).
- Worker logs show the cases completed; nothing wrote the summary back.
Diagnosis¶
There are three places a run can get stuck, and each has its own fingerprint in the log:
-
A case raised — should not leak, but if it did, look for
eval.case_save_failed. Thereturn_exceptions=Truegather pattern means one bad case can't cancel its siblings, but a bug in the per-case persistence path could leave the case row in a partial state. The suite-level row still terminates; only the case row is affected. -
The suite-level finally bailed — look for
eval.suite_status_write_failed. Commit history (look ateval_runner.pygit log) shows the try/finally that wraps thestatus='failed'update. If the DB write itself failed (locked database, disk full, schema drift), the runner logs this event and re-raises. The row is still stuck, but you have the exception traceback in logs. -
The runner process died — no
eval.suite_completedoreval.suite_failed. Look for SIGKILL/OOM in the process manager, container restart in K8s, oractor.shutdown_requestedif a graceful shutdown landed mid-run.
Mitigation¶
Case 1 — case-level save failure¶
Inspect the failing case row directly:
If a case row is running while siblings are completed/failed,
manually update it to failed with the error from the log line:
UPDATE eval_cases
SET status = 'failed',
error = 'Manual recovery: see eval.case_save_failed at <timestamp>'
WHERE id = '<case_id>';
Then update the suite row through the same machinery:
UPDATE eval_runs
SET status = 'failed',
completed_at = CURRENT_TIMESTAMP,
error = 'Manual recovery'
WHERE id = '<stuck_run_id>';
Case 2 — suite-level finally failed to write¶
The runner code already attempts the failure write inside a try/except so it doesn't re-mask the original error. If even that bailed, the DB itself is the problem — fix the DB first:
Then run the SQL update above to terminate the stale row.
Case 3 — process death¶
There is no in-process recovery. Restart the runner, then run the suite-row SQL update to terminate the stale row.
If process death is recurring, the run length is likely beyond what your process manager tolerates (OOM-killed; container restart policy). Either shrink the suite (fewer cases, smaller timeout) or move the runner to a more permissive deployment target.
Verify¶
-- No rows should be 'running' from before the current process started
SELECT id, status, created_at, completed_at
FROM eval_runs
WHERE status = 'running'
ORDER BY created_at DESC;
New eval submissions for the same suite should now run without
collision. Re-run a quick suite end-to-end and confirm
eval.suite_completed lands.
Followup¶
Persistent stuck runs almost always trace back to one of:
- A suite that's too large for the timeout, hitting per-case cancellation but leaving the suite-row update non-atomic with the case rows.
- A DuckDB file shared across multiple processes — DuckDB's single-writer model means a holder elsewhere can starve the runner indefinitely.
The framework guarantees the row will terminate if the process survives long enough to run the finally block. Operator effort should focus on that survival, not on rewriting the finally block.