Skip to content

RFC: VDiff3 - a checksum-based, streaming diff #19735

@timvaillancourt

Description

@timvaillancourt

History

The original VDiff (v1) ran on vtctld, synchronously pulling full row data from all source and target shards into a single process for comparison.

VDiff2 moved execution to the target tablets themselves — running asynchronously and per-shard — to eliminate the vtctld bottleneck and add resumability, but retained the same full-row-data comparison algorithm.

Problem

VDiff2 streams full row data from source and target tablets to the controller (target tablet), where rows are compared column-by-column. For tables with large rows (BLOBs, TEXT, JSON), this can cause the controller tablet to OOM because:

  1. Full row data from every source shard is held in memory simultaneously in the merge sort heap
  2. Each VStreamRowsResponse is deep-cloned (CloneVT()) due to gRPC proto buffer pooling, doubling memory per batch
  3. There are multiple buffered channels in the pipeline, each holding a batch per shard
  4. Report sample collection stores full row data for mismatches

For a 10 MB row across 10 source shards, peak memory on the controller is 300+ MB just for one row in flight. Multiple concurrent VDiffs or tables compound this.

A secondary risk: if the controller tablet OOMs mid-VDiff, vreplication streams that were stopped for consistent snapshotting can be left in Stopped state.

Finally, OOMs aside, transferring full-rows over gRPC has a price to resources (network, garbage collection, etc)

Proposal: VDiff3 - a checksum-based, streaming diff

Replace full row data streaming with per-row checksum streaming and parallel chunked reads. Only fetch full row data on mismatch for reporting.

+-----------------------------+   +-----------------------------+
|  Source Tablet - Shard -80  |   |  Source Tablet - Shard 80-  |
|  +-------+                  |   |  +-------+                  |
|  | MySQL |--> N chunk reads |   |  | MySQL |--> N chunk reads |
|  +-------+       |         |   |  +-------+       |         |
|          SHA-256 per row    |   |          SHA-256 per row    |
+--------------+--------------+   +--------------+--------------+
               |                                  |
               |  (PK, hash) via gRPC             |
               |  32 bytes/row                    |
               v                                  v
+----------------------------------------------------------------+
|  Target Tablet - Controller                                    |
|  +-------+                                                     |
|  | MySQL |--> N chunk reads --> SHA-256 per row                |
|  +-------+                           |                         |
|                                      v                         |
|              +-------------------------------------+           |
|              |  Diff Engine                        |           |
|              |  Merge sort by PK - compare hashes  |           |
|              +----------------+--------------------+           |
|                               v                                |
|                          DiffReport                            |
+----------------------------------------------------------------+
                                |
                    mismatched PKs only
                    point SELECT full row
                                |
                      v         v         v
                Source MySQL  Source MySQL  Target MySQL

Design goals (in priority order)

  1. Correctness of the diff
  2. Avoid OOMs on the controller tablet
  3. Performance — at least on par with VDiff2, ideally faster via parallelism
  4. Efficiency use as few system resources as possible

Design constraints

  • Every column value must be checksummed — nothing can be skipped
  • Mismatch reporting must show exactly which rows differ, same as VDiff2
  • Must support replicas as sources (same GTID-based snapshotting as VDiff2)

What stays the same from VDiff2

  • Execution model — runs on target tablets, asynchronously, per-shard
  • Resumability — can resume from a lastpk after interruption
  • GTID-based snapshotting — stops vreplication, waits for source replicas to catch up, takes consistent snapshots
  • Merge sort by PK — same algorithm to globally order rows across shards
  • Mismatch reporting — same DiffReport/RowDiff structures, same user-facing output
  • Replica support — sources can be replicas, with the same lag-handling via WaitForPosition
  • Throttler integration — subscribes to the lag throttler via the standard vstreamer API
  • Database schema — same _vt.vdiff, _vt.vdiff_table, _vt.vdiff_log tables for state tracking
  • CLI/API surface — same vtctldclient VDiff commands

What changes

Source/target tablets (rowstreamer):

  • Split table into N PK-range chunks
  • Open N connections with coordinated consistent snapshots (same GTID)
  • Each connection reads its chunk, computes SHA-256 per row, streams (PK values, checksum) tuples
  • Full row data never leaves the tablet over gRPC

Controller (target tablet):

  • Receives (PK, hash) streams from all source shards + own target streams
  • Compares 32-byte hashes instead of full column values
  • Records mismatched PKs

Mismatch reporting:

  • For each mismatched PK, fetches the full row via point SELECT from source and target
  • Same DiffReport/RowDiff output as VDiff2

Memory profile

For a 10 MB row across 10 source shards:

  • VDiff2: ~300+ MB on the controller
  • VDiff3: ~320 bytes per row in flight (PK + 32-byte hash) × 10 shards ≈ 3 KB

Implementation

Phases

Phase 1: Checksum-based VDiff
  1. New VStreamRowChecksums RPC on tablet server — chunked parallel reads with per-row SHA-256
  2. VDiff3 diff engine — new package, shares report types with VDiff2
  3. Mismatch reporting — point-fetch full rows for mismatched PKs, same report format
  4. Integration — wire into VDiff command with version selection
Phase 2: Coordinated parallel snapshots
  1. Coordinated snapshot infrastructure — parallel snapshot creation with table lock coordination, chunked PK-range reads

Why per-row hashing, not per-chunk

With resharding, source and target shards have different row distributions. The controller merge-sorts by PK across shards. Per-chunk hashing doesn't align cleanly across differently-sharded sources since you can't guarantee the same rows end up in the same chunk on both sides. Per-row hashing works naturally with the existing merge sort architecture.

Why SHA-256

Correctness is the top priority. We're checksumming production data to verify correctness of a migration — collision resistance is non-negotiable. SHA-256 on modern hardware (with acceleration) is fast, so the performance cost is minimal relative to MySQL I/O and network overhead.

Early benchmarks on Apple M4 Pro (ARM64):

Row size Time per row Throughput Allocations
10 cols × 100 B (1 KB) 0.5 µs ~1.8 GB/s 2 allocs, 56 B
10 cols × 10 KB (100 KB) 31 µs ~3.2 GB/s 2 allocs, 56 B
5 cols × 1 MB (5 MB) 1.6 ms ~3.2 GB/s 2 allocs, 56 B

At 3.2 GB/s throughput and constant 56-byte allocation overhead regardless of row size, SHA-256 is not a bottleneck. MySQL disk I/O and network will dominate.

Why hash in Go, not MySQL

  • Deterministic across MySQL versions
  • Full control over serialization (NULL handling, binary data, type coercion)
  • The hash input is the raw row.Values byte buffer — the same bytes VDiff2 currently compares

Collation normalization before hashing

VDiff2 handles collations at compare time: for each text column, it passes the column's collation ID to NullsafeCompare(), which internally calls coll.Collate() to do a collation-aware comparison. This works because VDiff2 has both the source and target row values side-by-side on the controller.

VDiff3 hashes rows independently on each tablet. If we hash the raw byte values without considering collation, two strings that are equal under their collation (e.g., "Foo" and "foo" under utf8mb4_general_ci) would produce different SHA-256 hashes — causing false mismatches.

To solve this, both source and target tablets normalize text column values according to their collation before hashing. For each text column with a non-binary collation:

  1. Look up the column's collation from information_schema (same query VDiff2 already uses in getPKColumnCollations(), extended to all columns)
  2. Before feeding the column value into SHA-256, produce a collation weight string using colldata.Lookup(collationID).WeightString(dst, src, 0) — the canonical byte representation where collation-equivalent strings produce identical bytes
  3. Hash the weight string instead of the raw bytes

Binary collation columns and non-text types are unaffected — raw bytes are already canonical. The overhead is negligible relative to SHA-256 and MySQL I/O.

VDiff2 VDiff3
When At compare time on controller Before hashing on each tablet
How NullsafeCompare() with collation ID Weight string normalization before SHA-256
Where Controller tablet Source and target tablets independently

Coordinated parallel snapshots

To read the same table in parallel chunks at the same GTID:

  1. Acquire LOCK TABLES ... READ (waits for in-flight writes to drain)
  2. Capture GTID position
  3. Start N transactions with START TRANSACTION WITH CONSISTENT SNAPSHOT, READ ONLY
  4. Release lock
            Conn 0          Conn 1        Conn 2  ...  Conn N
              |               |             |            |
              |  LOCK TABLES ... READ       |            |
              |  (waits for in-flight       |            |
              |   writes to drain)          |            |
              |               |             |            |
         ,----|--- lock held -|-------------|------------|----.
         |    |               |             |            |    |
         |    |  Capture GTID position      |            |    |
         |    |               |             |            |    |
         |    |  START TRANSACTION WITH CONSISTENT SNAPSHOT   |
         |    |           START TRANSACTION WITH ...          |
         |    |                     START TRANSACTION WITH ...|
         |    |               |             |            |    |
         |    |  UNLOCK TABLES|             |            |    |
         `----|---------------|-------------|------------|----'
              |               |             |            |
              x          chunk 1 read   chunk 2 read  chunk N read
                         WHERE pk >=    WHERE pk >=   WHERE pk >=
                           range_1        range_2       range_N
                              |             |            |
                         SHA-256/row    SHA-256/row  SHA-256/row
                              |             |            |
                         stream (PK, hash) to controller

The lock is held only for steps 2-4: capturing a GTID and starting N InnoDB snapshot transactions. Each START TRANSACTION is a local InnoDB operation (microseconds). For N=4-8 connections, the lock duration is sub-millisecond after in-flight writes drain.

If the tablet crashes while holding the lock, MySQL closes the connections and releases the lock automatically. The dangerous crash window (stopped vreplication) is the same as VDiff2 and is unrelated to this mechanism.

New protobuf

message VStreamRowChecksumsRequest {
    Target target = 1;
    string query = 2;
    QueryResult lastpk = 3;
    VStreamOptions options = 4;
    int32 parallelism = 5; // number of parallel chunk readers
}

message VStreamRowChecksumsResponse {
    repeated RowChecksum rows = 1;
    string gtid = 2;
    repeated Field pk_fields = 3;
}

message RowChecksum {
    Row pk_values = 1;
    bytes checksum = 2;  // SHA-256, 32 bytes
}

Performance expectations

Should perform at least as well as VDiff2, likely faster:

  • Same MySQL queries on each tablet, but parallelized across PK-range chunks
  • Same merge sort on controller
  • Less data over gRPC (32 bytes vs full row per row)
  • Hash comparison (bytes.Equal) cheaper than per-column NullsafeCompare with collation
  • SHA-256 computation is fast relative to MySQL I/O (1+ GB/s on modern CPUs)
  • Parallel chunk reads scale tablet-side throughput with connection count

A benchmark comparing VDiff2 and VDiff3 end-to-end performance across a range of row sizes and shard counts is required before merging. This should measure wall-clock time, peak memory, and throughput (rows/sec) to validate that VDiff3 is not a regression.

POC benchmark results (MoveTables, 2 target shards, Apple M4 Pro)

Small rows (~89 bytes/row, 2M rows):

Metric VDiff2 VDiff3 Difference
Rows compared 2,000,003 2,000,003
Duration 2s 2s Same
Throughput 1,000,002 rows/sec 1,000,002 rows/sec Same
Data sent (source→controller) 169.8 MB 82.0 MB 52% less
Bytes per row 89 43 52% less

Large rows (1 MB/row with LONGBLOB, 253 rows, 250 MB total):

Metric VDiff2 VDiff3 Difference
Rows compared 253 253
Duration 1s 1s Same
Data sent (source→controller) 250.0 MB 0.01 MB 99.99% less
Bytes per row 1,036,224 43 24,098x less

VDiff3 matches VDiff2 throughput in all cases. For small rows, data transfer is halved. For large rows (the OOM scenario), data transfer drops by 4 orders of magnitude — from ~1 MB/row to a fixed 43 bytes/row regardless of row size. This directly eliminates the OOM risk on the controller tablet.

Use Case(s)

Users of VDiff2

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions