Skip to content

perf: vectorise validate_responses_v2 #256

@Thykof

Description

@Thykof

Context

Follow-up to #253 (ticket 05). Between the dendrite forward returning and save_responses starting there is a ~30–36 s gap (Forwarding took ... 147s at 07:38:28 → Starting call to ... save_responses at 07:39:04 in the latest run). That gap is the synchronous validation loop in query_available_miners_and_save_responses (synth/validator/forward.py:231):

for i, synapse in enumerate(synapses):
    response = synapse.deserialize()
    process_time = synapse.dendrite.process_time
    format_validation = validate_responses_v2(response, simulation_input, process_time)

For each of ~240 miners this iterates every path and every point in Python.

Why the current implementation is slow

synth/validator/response_validation_v2.py:94:

for path in all_paths:                               # 1000 paths / miner
    error_message = validate_path(path, expected_time_points)
    ...
def validate_path(path, expected_time_points):
    if not isinstance(path, list): ...
    if len(path) != expected_time_points: ...
    for point in path:                               # 289 points / path
        if not isinstance(point, (int, float)): ...
        if len(str(point).replace(".", "")) > 8: ...
    return None

Per miner: 1000 × 289 ≈ 289,000 Python-level iterations with two isinstance + one str(point) allocation per point. Across 240 miners: ~69 M Python ops per cycle, all on the main thread after the forward finishes. ~30 s matches expected CPython throughput. The str(point).replace('.', '') digit-count check in particular allocates a temp string per float.

Proposed change

Operate on a NumPy array per miner; inner loop becomes a single C-level pass:

def validate_path(path, expected_time_points):
    if not isinstance(path, list):
        return f"Path format is incorrect: expected list, got {type(path)}"
    if len(path) != expected_time_points:
        return f"Number of time points is incorrect: expected {expected_time_points}, got {len(path)}"
    try:
        arr = np.asarray(path, dtype=np.float64)
    except (TypeError, ValueError):
        return "Price format is incorrect: non-numeric value in path"
    if not np.isfinite(arr).all():
        return "Price format is incorrect: non-finite value in path"
    # Significant-digit rule: replace the per-float `len(str(x).replace('.', '')) > 8` check
    # with a vectorised equivalent. See "Risk" below — must be behaviour-preserving.
    return None

Paths can additionally be stacked into a 2-D array and checked once per miner.

Why this will help

  1. np.asarray(path, dtype=np.float64) fails fast on non-numeric input in a single C pass — replaces the per-point Python isinstance loop (~100× speed-up on 289-float paths).
  2. No per-float string allocation. The digit-count check becomes numeric rather than string-based, cutting allocator pressure on the main thread — which matters because the same thread is about to serialise ~700 MB for save_responses.
  3. Pairs with future as_completed work (#): if validation runs in parallel with the remaining chunks arriving, keeping the loop fast-enough-to-not-dominate is a prerequisite.
  4. Memory layout matches what reward.py already needs — np.array(pred.prediction) is how scoring reads these back.

Expected reclaim: ~30 s → ~2 s on the validation gap.

Risk / behaviour-preservation

The existing len(str(point).replace('.', '')) > 8 is a string-based significant-digit cap and is subtle:

  • 1e-05 stringifies to 4 non-dot chars — a naive magnitude check rejects it incorrectly.
  • Integers stringify without a dot; len(str(int_value)) == digit count.

Safe path: derive the replacement from samples of real miner_predictions.prediction rows (a few hundred miners × a week). Require 100 % agreement with the old rule on the replay before deploying.

Verification plan

  • Unit tests pinning old vs new against a fixture of real production responses (valid + invalid). Every row must produce the same format_validation label.
  • Micro-benchmark: single-miner response target < 2 ms (down from ~250 ms).
  • Cycle-level on staging: confirm no shift in format_validation distribution vs main.

Out of scope

  • Moving validation into the miner workers (would require re-pickling errors with large payloads — counterproductive).
  • Tightening / loosening the validation rules themselves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions