Skip to content

perf: reuse ProcessPoolExecutor + as_completed + drop per-miner Simulation rebuild in sync_forward_multiprocess #254

@Thykof

Description

@Thykof

Context

Follow-up to #253 (ticket 05), which bounded each miner call at timeout seconds of wall clock. Today's dendrite forward sits at ~148 s under a 120 s timeout — the extra ~28 s is main-process overhead around the I/O phase. Three structural changes in synth/base/dendrite_multiprocess.py:sync_forward_multiprocess can squeeze that overhead.

Proposed changes (land together)

1. Reuse the ProcessPoolExecutor across cycles.

Currently (line 338): with concurrent.futures.ProcessPoolExecutor(nprocs) as executor: forks nprocs fresh children every cycle, each re-importing bittensor/httpx/uvloop, and tears them down at the end. synth/validator/reward.py:147 already uses the module-level singleton pattern for the CRPS workers; apply the same here.

2. Replace executor.map(...) with as_completed(...).

executor.map yields chunks in submission order — main-process CPU work (unpickling ~90 MB of floats per chunk, rebuilding the synapse, validation) is serialised after the slowest chunk finishes. as_completed lets the main thread start processing each chunk the moment it lands, overlapping with remaining I/O. Requires carrying the chunk index / original axon index into the work item so final ordering is preserved.

3. Drop the per-miner Simulation(...).from_headers(...) + model_copy() rebuild.

For every miner (~240 per cycle) the current code constructs a fresh Simulation Pydantic model, calls from_headers(), assigns the ~289k-float simulation_output, and calls model_copy(). The caller (synth/validator/forward.py:231) only uses three fields. Return plain (response, process_time) tuples (or a dict keyed by miner_uid) and let the caller index by uid directly.

Why this will help

  • Change 1 eliminates per-cycle fork + import cost (several seconds on cold fork, worse under memory pressure).
  • Change 2 pipelines main-process CPU work with remaining network I/O; the more main-process CPU there is to do, the bigger the overlap win.
  • Change 3 removes ~240 Pydantic model constructions, 240 from_headers calls, and 240 model_copy() operations per cycle — each touches the large simulation_output.

Combined upper-bound reclaim: ~23 s off forward (148 s → ~125 s) with current miner behaviour.

Priority

Low relative to:

  • batched save_responses inserts (#): ~85 s reclaim from save.
  • vectorised validate_responses_v2 (#): ~30 s reclaim between forward and save.

Pick up after those land, or sooner if as_completed overlap starts being valuable (miners hitting full timeout consistently again).

Risk / review notes

  • sync_forward_multiprocess return type changes. Caller in query_available_miners_and_save_responses needs the matching update (positional miner_uids[i] → keyed by uid, or carry index through).
  • Singleton pool lifecycle: register a shutdown hook so workers are cleaned up on SIGTERM, but never inside the normal cycle path.
  • Workers that crash are replaced by ProcessPoolExecutor — no behaviour change there.
  • Individual workers can't hang past timeout because of the asyncio.wait_for from fix: enforce total per-miner timeout in dendrite forward #253.

Verification plan

  • Timing logs around pool acquisition, per-chunk completion, per-miner post-processing. Before/after on staging against a replay.
  • Success criteria: forward under timeout + ~10 s overhead; no regression in miner_predictions row count or format_validation distribution; no leaked workers after SIGTERM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions