You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to #253 (ticket 05), which bounded each miner call at timeout seconds of wall clock. Today's dendrite forward sits at ~148 s under a 120 s timeout — the extra ~28 s is main-process overhead around the I/O phase. Three structural changes in synth/base/dendrite_multiprocess.py:sync_forward_multiprocess can squeeze that overhead.
Proposed changes (land together)
1. Reuse the ProcessPoolExecutor across cycles.
Currently (line 338): with concurrent.futures.ProcessPoolExecutor(nprocs) as executor: forks nprocs fresh children every cycle, each re-importing bittensor/httpx/uvloop, and tears them down at the end. synth/validator/reward.py:147 already uses the module-level singleton pattern for the CRPS workers; apply the same here.
2. Replace executor.map(...) with as_completed(...).
executor.map yields chunks in submission order — main-process CPU work (unpickling ~90 MB of floats per chunk, rebuilding the synapse, validation) is serialised after the slowest chunk finishes. as_completed lets the main thread start processing each chunk the moment it lands, overlapping with remaining I/O. Requires carrying the chunk index / original axon index into the work item so final ordering is preserved.
3. Drop the per-miner Simulation(...).from_headers(...) + model_copy() rebuild.
For every miner (~240 per cycle) the current code constructs a fresh Simulation Pydantic model, calls from_headers(), assigns the ~289k-float simulation_output, and calls model_copy(). The caller (synth/validator/forward.py:231) only uses three fields. Return plain (response, process_time) tuples (or a dict keyed by miner_uid) and let the caller index by uid directly.
Why this will help
Change 1 eliminates per-cycle fork + import cost (several seconds on cold fork, worse under memory pressure).
Change 2 pipelines main-process CPU work with remaining network I/O; the more main-process CPU there is to do, the bigger the overlap win.
Change 3 removes ~240 Pydantic model constructions, 240 from_headers calls, and 240 model_copy() operations per cycle — each touches the large simulation_output.
Combined upper-bound reclaim: ~23 s off forward (148 s → ~125 s) with current miner behaviour.
Priority
Low relative to:
batched save_responses inserts (#): ~85 s reclaim from save.
vectorised validate_responses_v2 (#): ~30 s reclaim between forward and save.
Pick up after those land, or sooner if as_completed overlap starts being valuable (miners hitting full timeout consistently again).
Risk / review notes
sync_forward_multiprocess return type changes. Caller in query_available_miners_and_save_responses needs the matching update (positional miner_uids[i] → keyed by uid, or carry index through).
Singleton pool lifecycle: register a shutdown hook so workers are cleaned up on SIGTERM, but never inside the normal cycle path.
Workers that crash are replaced by ProcessPoolExecutor — no behaviour change there.
Timing logs around pool acquisition, per-chunk completion, per-miner post-processing. Before/after on staging against a replay.
Success criteria: forward under timeout + ~10 s overhead; no regression in miner_predictions row count or format_validation distribution; no leaked workers after SIGTERM.
Context
Follow-up to #253 (ticket 05), which bounded each miner call at
timeoutseconds of wall clock. Today's dendrite forward sits at ~148 s under a 120 s timeout — the extra ~28 s is main-process overhead around the I/O phase. Three structural changes insynth/base/dendrite_multiprocess.py:sync_forward_multiprocesscan squeeze that overhead.Proposed changes (land together)
1. Reuse the
ProcessPoolExecutoracross cycles.Currently (line 338):
with concurrent.futures.ProcessPoolExecutor(nprocs) as executor:forksnprocsfresh children every cycle, each re-importingbittensor/httpx/uvloop, and tears them down at the end.synth/validator/reward.py:147already uses the module-level singleton pattern for the CRPS workers; apply the same here.2. Replace
executor.map(...)withas_completed(...).executor.mapyields chunks in submission order — main-process CPU work (unpickling ~90 MB of floats per chunk, rebuilding the synapse, validation) is serialised after the slowest chunk finishes.as_completedlets the main thread start processing each chunk the moment it lands, overlapping with remaining I/O. Requires carrying the chunk index / original axon index into the work item so final ordering is preserved.3. Drop the per-miner
Simulation(...).from_headers(...) + model_copy()rebuild.For every miner (~240 per cycle) the current code constructs a fresh
SimulationPydantic model, callsfrom_headers(), assigns the ~289k-floatsimulation_output, and callsmodel_copy(). The caller (synth/validator/forward.py:231) only uses three fields. Return plain(response, process_time)tuples (or adictkeyed byminer_uid) and let the caller index by uid directly.Why this will help
from_headerscalls, and 240model_copy()operations per cycle — each touches the largesimulation_output.Combined upper-bound reclaim: ~23 s off forward (148 s → ~125 s) with current miner behaviour.
Priority
Low relative to:
save_responsesinserts (#): ~85 s reclaim from save.validate_responses_v2(#): ~30 s reclaim between forward and save.Pick up after those land, or sooner if
as_completedoverlap starts being valuable (miners hitting full timeout consistently again).Risk / review notes
sync_forward_multiprocessreturn type changes. Caller inquery_available_miners_and_save_responsesneeds the matching update (positionalminer_uids[i]→ keyed by uid, or carry index through).ProcessPoolExecutor— no behaviour change there.timeoutbecause of theasyncio.wait_forfrom fix: enforce total per-miner timeout in dendrite forward #253.Verification plan
miner_predictionsrow count orformat_validationdistribution; no leaked workers after SIGTERM.