History
The original VDiff (v1) ran on vtctld, synchronously pulling full row data from all source and target shards into a single process for comparison.
VDiff2 moved execution to the target tablets themselves — running asynchronously and per-shard — to eliminate the vtctld bottleneck and add resumability, but retained the same full-row-data comparison algorithm.
Problem
VDiff2 streams full row data from source and target tablets to the controller (target tablet), where rows are compared column-by-column. For tables with large rows (BLOBs, TEXT, JSON), this can cause the controller tablet to OOM because:
- Full row data from every source shard is held in memory simultaneously in the merge sort heap
- Each
VStreamRowsResponse is deep-cloned (CloneVT()) due to gRPC proto buffer pooling, doubling memory per batch
- There are multiple buffered channels in the pipeline, each holding a batch per shard
- Report sample collection stores full row data for mismatches
For a 10 MB row across 10 source shards, peak memory on the controller is 300+ MB just for one row in flight. Multiple concurrent VDiffs or tables compound this.
A secondary risk: if the controller tablet OOMs mid-VDiff, vreplication streams that were stopped for consistent snapshotting can be left in Stopped state.
Finally, OOMs aside, transferring full-rows over gRPC has a price to resources (network, garbage collection, etc)
Proposal: VDiff3 - a checksum-based, streaming diff
Replace full row data streaming with per-row checksum streaming and parallel chunked reads. Only fetch full row data on mismatch for reporting.
+-----------------------------+ +-----------------------------+
| Source Tablet - Shard -80 | | Source Tablet - Shard 80- |
| +-------+ | | +-------+ |
| | MySQL |--> N chunk reads | | | MySQL |--> N chunk reads |
| +-------+ | | | +-------+ | |
| SHA-256 per row | | SHA-256 per row |
+--------------+--------------+ +--------------+--------------+
| |
| (PK, hash) via gRPC |
| 32 bytes/row |
v v
+----------------------------------------------------------------+
| Target Tablet - Controller |
| +-------+ |
| | MySQL |--> N chunk reads --> SHA-256 per row |
| +-------+ | |
| v |
| +-------------------------------------+ |
| | Diff Engine | |
| | Merge sort by PK - compare hashes | |
| +----------------+--------------------+ |
| v |
| DiffReport |
+----------------------------------------------------------------+
|
mismatched PKs only
point SELECT full row
|
v v v
Source MySQL Source MySQL Target MySQL
Design goals (in priority order)
- Correctness of the diff
- Avoid OOMs on the controller tablet
- Performance — at least on par with VDiff2, ideally faster via parallelism
- Efficiency use as few system resources as possible
Design constraints
- Every column value must be checksummed — nothing can be skipped
- Mismatch reporting must show exactly which rows differ, same as VDiff2
- Must support replicas as sources (same GTID-based snapshotting as VDiff2)
What stays the same from VDiff2
- Execution model — runs on target tablets, asynchronously, per-shard
- Resumability — can resume from a
lastpk after interruption
- GTID-based snapshotting — stops vreplication, waits for source replicas to catch up, takes consistent snapshots
- Merge sort by PK — same algorithm to globally order rows across shards
- Mismatch reporting — same
DiffReport/RowDiff structures, same user-facing output
- Replica support — sources can be replicas, with the same lag-handling via
WaitForPosition
- Throttler integration — subscribes to the lag throttler via the standard vstreamer API
- Database schema — same
_vt.vdiff, _vt.vdiff_table, _vt.vdiff_log tables for state tracking
- CLI/API surface — same
vtctldclient VDiff commands
What changes
Source/target tablets (rowstreamer):
- Split table into N PK-range chunks
- Open N connections with coordinated consistent snapshots (same GTID)
- Each connection reads its chunk, computes SHA-256 per row, streams
(PK values, checksum) tuples
- Full row data never leaves the tablet over gRPC
Controller (target tablet):
- Receives
(PK, hash) streams from all source shards + own target streams
- Compares 32-byte hashes instead of full column values
- Records mismatched PKs
Mismatch reporting:
- For each mismatched PK, fetches the full row via point SELECT from source and target
- Same
DiffReport/RowDiff output as VDiff2
Memory profile
For a 10 MB row across 10 source shards:
- VDiff2: ~300+ MB on the controller
- VDiff3: ~320 bytes per row in flight (PK + 32-byte hash) × 10 shards ≈ 3 KB
Implementation
Phases
Phase 1: Checksum-based VDiff
- New
VStreamRowChecksums RPC on tablet server — chunked parallel reads with per-row SHA-256
- VDiff3 diff engine — new package, shares report types with VDiff2
- Mismatch reporting — point-fetch full rows for mismatched PKs, same report format
- Integration — wire into VDiff command with version selection
Phase 2: Coordinated parallel snapshots
- Coordinated snapshot infrastructure — parallel snapshot creation with table lock coordination, chunked PK-range reads
Why per-row hashing, not per-chunk
With resharding, source and target shards have different row distributions. The controller merge-sorts by PK across shards. Per-chunk hashing doesn't align cleanly across differently-sharded sources since you can't guarantee the same rows end up in the same chunk on both sides. Per-row hashing works naturally with the existing merge sort architecture.
Why SHA-256
Correctness is the top priority. We're checksumming production data to verify correctness of a migration — collision resistance is non-negotiable. SHA-256 on modern hardware (with acceleration) is fast, so the performance cost is minimal relative to MySQL I/O and network overhead.
Early benchmarks on Apple M4 Pro (ARM64):
| Row size |
Time per row |
Throughput |
Allocations |
| 10 cols × 100 B (1 KB) |
0.5 µs |
~1.8 GB/s |
2 allocs, 56 B |
| 10 cols × 10 KB (100 KB) |
31 µs |
~3.2 GB/s |
2 allocs, 56 B |
| 5 cols × 1 MB (5 MB) |
1.6 ms |
~3.2 GB/s |
2 allocs, 56 B |
At 3.2 GB/s throughput and constant 56-byte allocation overhead regardless of row size, SHA-256 is not a bottleneck. MySQL disk I/O and network will dominate.
Why hash in Go, not MySQL
- Deterministic across MySQL versions
- Full control over serialization (NULL handling, binary data, type coercion)
- The hash input is the raw
row.Values byte buffer — the same bytes VDiff2 currently compares
Collation normalization before hashing
VDiff2 handles collations at compare time: for each text column, it passes the column's collation ID to NullsafeCompare(), which internally calls coll.Collate() to do a collation-aware comparison. This works because VDiff2 has both the source and target row values side-by-side on the controller.
VDiff3 hashes rows independently on each tablet. If we hash the raw byte values without considering collation, two strings that are equal under their collation (e.g., "Foo" and "foo" under utf8mb4_general_ci) would produce different SHA-256 hashes — causing false mismatches.
To solve this, both source and target tablets normalize text column values according to their collation before hashing. For each text column with a non-binary collation:
- Look up the column's collation from
information_schema (same query VDiff2 already uses in getPKColumnCollations(), extended to all columns)
- Before feeding the column value into SHA-256, produce a collation weight string using
colldata.Lookup(collationID).WeightString(dst, src, 0) — the canonical byte representation where collation-equivalent strings produce identical bytes
- Hash the weight string instead of the raw bytes
Binary collation columns and non-text types are unaffected — raw bytes are already canonical. The overhead is negligible relative to SHA-256 and MySQL I/O.
|
VDiff2 |
VDiff3 |
| When |
At compare time on controller |
Before hashing on each tablet |
| How |
NullsafeCompare() with collation ID |
Weight string normalization before SHA-256 |
| Where |
Controller tablet |
Source and target tablets independently |
Coordinated parallel snapshots
To read the same table in parallel chunks at the same GTID:
- Acquire
LOCK TABLES ... READ (waits for in-flight writes to drain)
- Capture GTID position
- Start N transactions with
START TRANSACTION WITH CONSISTENT SNAPSHOT, READ ONLY
- Release lock
Conn 0 Conn 1 Conn 2 ... Conn N
| | | |
| LOCK TABLES ... READ | |
| (waits for in-flight | |
| writes to drain) | |
| | | |
,----|--- lock held -|-------------|------------|----.
| | | | | |
| | Capture GTID position | | |
| | | | | |
| | START TRANSACTION WITH CONSISTENT SNAPSHOT |
| | START TRANSACTION WITH ... |
| | START TRANSACTION WITH ...|
| | | | | |
| | UNLOCK TABLES| | | |
`----|---------------|-------------|------------|----'
| | | |
x chunk 1 read chunk 2 read chunk N read
WHERE pk >= WHERE pk >= WHERE pk >=
range_1 range_2 range_N
| | |
SHA-256/row SHA-256/row SHA-256/row
| | |
stream (PK, hash) to controller
The lock is held only for steps 2-4: capturing a GTID and starting N InnoDB snapshot transactions. Each START TRANSACTION is a local InnoDB operation (microseconds). For N=4-8 connections, the lock duration is sub-millisecond after in-flight writes drain.
If the tablet crashes while holding the lock, MySQL closes the connections and releases the lock automatically. The dangerous crash window (stopped vreplication) is the same as VDiff2 and is unrelated to this mechanism.
New protobuf
message VStreamRowChecksumsRequest {
Target target = 1;
string query = 2;
QueryResult lastpk = 3;
VStreamOptions options = 4;
int32 parallelism = 5; // number of parallel chunk readers
}
message VStreamRowChecksumsResponse {
repeated RowChecksum rows = 1;
string gtid = 2;
repeated Field pk_fields = 3;
}
message RowChecksum {
Row pk_values = 1;
bytes checksum = 2; // SHA-256, 32 bytes
}
Performance expectations
Should perform at least as well as VDiff2, likely faster:
- Same MySQL queries on each tablet, but parallelized across PK-range chunks
- Same merge sort on controller
- Less data over gRPC (32 bytes vs full row per row)
- Hash comparison (
bytes.Equal) cheaper than per-column NullsafeCompare with collation
- SHA-256 computation is fast relative to MySQL I/O (1+ GB/s on modern CPUs)
- Parallel chunk reads scale tablet-side throughput with connection count
A benchmark comparing VDiff2 and VDiff3 end-to-end performance across a range of row sizes and shard counts is required before merging. This should measure wall-clock time, peak memory, and throughput (rows/sec) to validate that VDiff3 is not a regression.
POC benchmark results (MoveTables, 2 target shards, Apple M4 Pro)
Small rows (~89 bytes/row, 2M rows):
| Metric |
VDiff2 |
VDiff3 |
Difference |
| Rows compared |
2,000,003 |
2,000,003 |
— |
| Duration |
2s |
2s |
Same |
| Throughput |
1,000,002 rows/sec |
1,000,002 rows/sec |
Same |
| Data sent (source→controller) |
169.8 MB |
82.0 MB |
52% less |
| Bytes per row |
89 |
43 |
52% less |
Large rows (1 MB/row with LONGBLOB, 253 rows, 250 MB total):
| Metric |
VDiff2 |
VDiff3 |
Difference |
| Rows compared |
253 |
253 |
— |
| Duration |
1s |
1s |
Same |
| Data sent (source→controller) |
250.0 MB |
0.01 MB |
99.99% less |
| Bytes per row |
1,036,224 |
43 |
24,098x less |
VDiff3 matches VDiff2 throughput in all cases. For small rows, data transfer is halved. For large rows (the OOM scenario), data transfer drops by 4 orders of magnitude — from ~1 MB/row to a fixed 43 bytes/row regardless of row size. This directly eliminates the OOM risk on the controller tablet.
Use Case(s)
Users of VDiff2
History
The original VDiff (v1) ran on
vtctld, synchronously pulling full row data from all source and target shards into a single process for comparison.VDiff2 moved execution to the target tablets themselves — running asynchronously and per-shard — to eliminate the
vtctldbottleneck and add resumability, but retained the same full-row-data comparison algorithm.Problem
VDiff2 streams full row data from source and target tablets to the controller (target tablet), where rows are compared column-by-column. For tables with large rows (BLOBs, TEXT, JSON), this can cause the controller tablet to OOM because:
VStreamRowsResponseis deep-cloned (CloneVT()) due to gRPC proto buffer pooling, doubling memory per batchFor a 10 MB row across 10 source shards, peak memory on the controller is 300+ MB just for one row in flight. Multiple concurrent VDiffs or tables compound this.
A secondary risk: if the controller tablet OOMs mid-VDiff, vreplication streams that were stopped for consistent snapshotting can be left in
Stoppedstate.Finally, OOMs aside, transferring full-rows over gRPC has a price to resources (network, garbage collection, etc)
Proposal: VDiff3 - a checksum-based, streaming diff
Replace full row data streaming with per-row checksum streaming and parallel chunked reads. Only fetch full row data on mismatch for reporting.
Design goals (in priority order)
Design constraints
What stays the same from VDiff2
lastpkafter interruptionDiffReport/RowDiffstructures, same user-facing outputWaitForPosition_vt.vdiff,_vt.vdiff_table,_vt.vdiff_logtables for state trackingvtctldclient VDiffcommandsWhat changes
Source/target tablets (rowstreamer):
(PK values, checksum)tuplesController (target tablet):
(PK, hash)streams from all source shards + own target streamsMismatch reporting:
DiffReport/RowDiffoutput as VDiff2Memory profile
For a 10 MB row across 10 source shards:
Implementation
Phases
Phase 1: Checksum-based VDiff
VStreamRowChecksumsRPC on tablet server — chunked parallel reads with per-row SHA-256Phase 2: Coordinated parallel snapshots
Why per-row hashing, not per-chunk
With resharding, source and target shards have different row distributions. The controller merge-sorts by PK across shards. Per-chunk hashing doesn't align cleanly across differently-sharded sources since you can't guarantee the same rows end up in the same chunk on both sides. Per-row hashing works naturally with the existing merge sort architecture.
Why SHA-256
Correctness is the top priority. We're checksumming production data to verify correctness of a migration — collision resistance is non-negotiable. SHA-256 on modern hardware (with acceleration) is fast, so the performance cost is minimal relative to MySQL I/O and network overhead.
Early benchmarks on Apple M4 Pro (ARM64):
At 3.2 GB/s throughput and constant 56-byte allocation overhead regardless of row size, SHA-256 is not a bottleneck. MySQL disk I/O and network will dominate.
Why hash in Go, not MySQL
row.Valuesbyte buffer — the same bytes VDiff2 currently comparesCollation normalization before hashing
VDiff2 handles collations at compare time: for each text column, it passes the column's collation ID to
NullsafeCompare(), which internally callscoll.Collate()to do a collation-aware comparison. This works because VDiff2 has both the source and target row values side-by-side on the controller.VDiff3 hashes rows independently on each tablet. If we hash the raw byte values without considering collation, two strings that are equal under their collation (e.g.,
"Foo"and"foo"underutf8mb4_general_ci) would produce different SHA-256 hashes — causing false mismatches.To solve this, both source and target tablets normalize text column values according to their collation before hashing. For each text column with a non-binary collation:
information_schema(same query VDiff2 already uses ingetPKColumnCollations(), extended to all columns)colldata.Lookup(collationID).WeightString(dst, src, 0)— the canonical byte representation where collation-equivalent strings produce identical bytesBinary collation columns and non-text types are unaffected — raw bytes are already canonical. The overhead is negligible relative to SHA-256 and MySQL I/O.
NullsafeCompare()with collation IDCoordinated parallel snapshots
To read the same table in parallel chunks at the same GTID:
LOCK TABLES ... READ(waits for in-flight writes to drain)START TRANSACTION WITH CONSISTENT SNAPSHOT, READ ONLYThe lock is held only for steps 2-4: capturing a GTID and starting N InnoDB snapshot transactions. Each
START TRANSACTIONis a local InnoDB operation (microseconds). For N=4-8 connections, the lock duration is sub-millisecond after in-flight writes drain.If the tablet crashes while holding the lock, MySQL closes the connections and releases the lock automatically. The dangerous crash window (stopped vreplication) is the same as VDiff2 and is unrelated to this mechanism.
New protobuf
Performance expectations
Should perform at least as well as VDiff2, likely faster:
bytes.Equal) cheaper than per-columnNullsafeComparewith collationA benchmark comparing VDiff2 and VDiff3 end-to-end performance across a range of row sizes and shard counts is required before merging. This should measure wall-clock time, peak memory, and throughput (rows/sec) to validate that VDiff3 is not a regression.
POC benchmark results (MoveTables, 2 target shards, Apple M4 Pro)
Small rows (~89 bytes/row, 2M rows):
Large rows (1 MB/row with LONGBLOB, 253 rows, 250 MB total):
VDiff3 matches VDiff2 throughput in all cases. For small rows, data transfer is halved. For large rows (the OOM scenario), data transfer drops by 4 orders of magnitude — from ~1 MB/row to a fixed 43 bytes/row regardless of row size. This directly eliminates the OOM risk on the controller tablet.
Use Case(s)
Users of VDiff2