Skip to content

Runner Crash Recovery

When the Generation Runner crashes mid-pass — process killed, system rebooted, G-Labs died, network dropped — you need to recover without losing the work that already succeeded.

What you have after a crash

When the runner crashes mid-pass, here's the state:

Asset State
The source .nbflow (input) Unchanged. Intact.
Local generated PNGs / MP4s Whatever was produced before the crash, saved to Assets/{workflow}/generated-images/v0-N/ and generated-videos/v0-N/
R2 uploads All successful uploads happened atomically; R2 has whatever finished
The -generated.nbflow (output) Either: didn't get written (crash too early), partial / corrupted, or written successfully (crash AFTER export)
Cache state in the source .nbflow Updated for any successful node, untouched for incomplete ones
errors.json May or may not exist depending on when the crash happened

The good news: what succeeded is preserved. The bad news: you have to figure out what to redo.

Diagnostic flow

flowchart TD
    A[Runner crashed mid-pass]
    A --> B{Does -generated.nbflow exist?}
    B -->|yes, complete| C[Lucky — restart with full output]
    B -->|no or partial| D{Check local Assets folders}
    D --> E[Count generated PNGs/MP4s by scene]
    E --> F{All scenes covered?}
    F -->|yes| G[Re-run to finish exporting]
    F -->|no| H[Re-run with cache enabled]
    H --> I[Cached successful nodes reused;<br/>unfinished nodes regenerate]

Step 1: Check what got produced

Look at the local Assets folders:

ls "projects/may/xyz-wellness/Assets/XYZG3/generated-images/v0-1/"
# scene01.png, scene02.png, scene03.png, scene04.png, scene05.png

ls "projects/may/xyz-wellness/Assets/XYZG3/generated-videos/v0-1/"
# (empty or partial)

This tells you: images were done through Scene 05 before the crash. Video gen hadn't started (or hadn't finished any).

Step 2: Check -generated.nbflow

ls "projects/may/xyz-wellness/Generations/"
# XYZG3-V0-1-generated-20260512-143021.nbflow

If a -generated.nbflow exists, open it in PatchWork or inspect the JSON. Look at _savedImages and _savedClips fields on each gen node:

  • Populated = that node completed successfully; its R2 URL is baked in
  • Empty / null = that node didn't complete

Step 3: Re-run with cache enabled

The Generation Runner's cache should handle most recovery automatically:

You: re-run XYZG3-V0-1.nbflow. The previous run crashed mid-way; use
     the cache to avoid redoing what succeeded.

Claude: [invokes Generation Runner with normal cache behavior]
        [the runner reads the source .nbflow, sees which nodes have
         _savedImages populated, skips those, runs the rest]

What happens:

  • Nodes that completed before the crash → cached output is reused, no regen
  • Nodes that didn't complete → fresh generation
  • Total cost: only the cost of the unfinished work

You usually get a clean -generated.nbflow on the second pass with minimal re-cost.

Step 4: If cache state is corrupted

In rare cases, the crash leaves the source .nbflow in an inconsistent state — some nodes have _savedImages set to URLs that don't exist on R2, or partial data.

Symptoms:

  • The runner tries to "reuse" output but PatchWork shows a broken image
  • R2 URLs in the file return 404
  • The runner finishes but the output is mixed valid / invalid

Recovery:

  1. Restore the source .nbflow from the backups/ folder (the version before this run)
  2. The restored file has no cache; everything regenerates from scratch
  3. Cost is back to full, but you're guaranteed clean output

When the runner's cache is suspect, restore and re-run is faster than diagnosing the corruption.

When G-Labs dies mid-pass

A specific case: G-Labs (the local backend) crashes or you lose the tunnel during a run.

Symptom
Runner errors with connection refused or 522 timeout partway through. May or may not have produced errors.json.
Diagnosis
Check the G-Labs health endpoint (http://localhost:8765/health). If it's 404, G-Labs is down.
Recovery
  1. Restart G-Labs (npm start or whatever your installation requires)
  2. Verify health endpoint is OK
  3. Restart cloudflared tunnel (the URL will be different)
  4. Update PatchWork's settings with the new tunnel URL
  5. Re-run the workflow — cache will reuse anything that succeeded before the crash

The whole sequence is 5-10 minutes if everything goes smoothly.

When the cloudflared tunnel drops

Even more specific: tunnel URL stops resolving mid-pass.

Symptom
Generations error with 522 or connection refused. The local G-Labs is healthy. The tunnel URL just doesn't work.
Recovery
  1. Restart cloudflared: cloudflared tunnel --url http://localhost:8765
  2. Copy the new tunnel URL
  3. Update wherever the URL is referenced (Generation Runner --server arg or PatchWork settings)
  4. Re-run the workflow

The runner will reuse cached output and only redo the unfinished work.

When the runner produces invalid output

A different kind of crash — runner completed but output is garbage:

  • All candidates have AI tells the auto-QA didn't catch
  • Output URLs don't match the prompts
  • Avatar references didn't lock the face
  • All scenes look like a different account's avatar

This usually means a structural issue with the input .nbflow that the runner's sanity check didn't catch (or you skipped the sanity check).

Recovery:

  1. Restore the .nbflow to the pre-run backup
  2. Run the pre-generation sanity check on the restored file
  3. Fix anything the check surfaces
  4. Re-run

Common findings in this case:

  • Avatar reference URL was wrong (per-node link refs stale, see PatchWork Import Bugs)
  • Wrong model field on the gen nodes (defaulted to NB1 instead of NB2)
  • Dynamic Prompt variableName doesn't match the template placeholder

After recovery

Always do these:

  1. Bump V0-N if the recovery involved any changes (cache wipe, sanity-check fixes, structural patches)
  2. Document what happened in the version registry's history note ("V0-1: runner crashed at Scene 06, recovered with cache reuse, full output now")
  3. Sync the tracker so the team sees the recovered state

Don't pretend the crash didn't happen. The audit trail matters.

When to give up and start over

Sometimes recovery isn't worth it. Signs:

  • You've tried 2 recovery passes and both produce different errors
  • The source .nbflow itself seems corrupted (won't import into PatchWork cleanly)
  • The cost of recovery (diagnosis time + re-runs) exceeds the cost of building from V0-1 again

Restart from the source brief. Skip the trauma of trying to salvage. You'll be faster.

When you're ready

Next: Cache Wipe / Force Re-roll — explicitly clearing the cache when it's getting in your way.