Skip to content

perf: batch MinerPrediction inserts in save_responses #255

@Thykof

Description

@Thykof

Context

Follow-up to #253 (ticket 05). With the dendrite forward bounded, save_responses is now the single largest phase of the low-frequency cycle — steady at ~105–118 s against a 300 s slot budget. That leaves ~180–200 s for everything else and pushes the cycle into "Calculated delay is non-positive" territory on variance. Biggest single fix available.

Symptom

INFO Execution time for save_responses: 117.6267 seconds
INFO Execution time for save_responses: 104.1159 seconds
INFO Execution time for save_responses: 102.3338 seconds
...

One attempt, one transaction per call — no retries — so the 100+ s is the raw INSERT path.

Current shape

synth/validator/miner_data_handler.py:116 save_responses:

with connection.begin():
    result = connection.execute(insert_stmt_validator)             # ValidatorRequest
    ...
    miner_prediction_records = [...]                               # ~240 dicts
    connection.execute(insert(MinerPrediction).values(miner_prediction_records))

240 rows × ~2–3 MB JSON per row = ~600–700 MB of JSON in one INSERT — one payload SQLAlchemy has to JSON-encode up front, one statement Postgres has to parse server-side, one WAL flush at commit.

Proposed change

Same transaction, same retry wrapper, same rows written — split the bulk insert into batches of N rows (start N = 25):

BATCH_SIZE = 25
for start in range(0, len(miner_prediction_records), BATCH_SIZE):
    batch = miner_prediction_records[start:start + BATCH_SIZE]
    connection.execute(insert(MinerPrediction).values(batch))

Why this will help

  1. SQLAlchemy parameter serialisation is amortised — driver starts sending work to Postgres while subsequent batches are still being encoded.
  2. Postgres parses ~60–75 MB per statement instead of 700 MB. Its protocol + planner handle the batched size gracefully.
  3. WAL flushes evenly across the transaction instead of one giant flush at commit.
  4. Validator-side memory pressure drops — each batch can be freed after its insert.
  5. Partial-failure diagnostics get better — 25-row batch error vs 240-row error.

Do NOT split the transactionValidatorRequest + its predictions must be all-or-nothing; downstream scoring reads by id and assumes the prediction rows are complete.

Expected post-change: save_responses ~25–40 s. Buys ~70–90 s of slot margin per cycle.

Out of scope

  • Changing the prediction column type to bytea / msgpack (separate schema-migration ticket; only pursue if batching isn't enough).
  • Removing the @retry wrapper.
  • Parallelising inserts across connections (breaks atomicity).

Verification plan

  • Staging: time save_responses before/after with realistic payload. Target < 40 s.
  • Sweep BATCH_SIZE (10, 25, 50, 100), keep the value minimising wall time without pinning DB CPU.
  • Assert row count matches len(miner_predictions) across batch sizes.
  • Confirm downstream reward.py reads predictions unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions