RFC: VDiff3 - a checksum-based, streaming diff

## History

The original VDiff (v1) ran on `vtctld`, synchronously pulling full row data from all source and target shards into a single process for comparison.

VDiff2 moved execution to the target tablets themselves — running asynchronously and per-shard — to eliminate the `vtctld` bottleneck and add resumability, but retained the same full-row-data comparison algorithm.

## Problem

VDiff2 streams full row data from source and target tablets to the controller (target tablet), where rows are compared column-by-column. For tables with large rows (BLOBs, TEXT, JSON), this can cause the controller tablet to OOM because:

1. Full row data from every source shard is held in memory simultaneously in the merge sort heap
2. Each `VStreamRowsResponse` is deep-cloned (`CloneVT()`) due to gRPC proto buffer pooling, doubling memory per batch
3. There are multiple buffered channels in the pipeline, each holding a batch per shard
4. Report sample collection stores full row data for mismatches

For a 10 MB row across 10 source shards, peak memory on the controller is 300+ MB just for one row in flight. Multiple concurrent VDiffs or tables compound this.

A secondary risk: if the controller tablet OOMs mid-VDiff, vreplication streams that were stopped for consistent snapshotting can be left in `Stopped` state.

Finally, OOMs aside, transferring full-rows over gRPC has a price to resources _(network, garbage collection, etc)_

## Proposal: VDiff3 - a checksum-based, streaming diff

Replace full row data streaming with per-row checksum streaming and parallel chunked reads. Only fetch full row data on mismatch for reporting.

```
+-----------------------------+   +-----------------------------+
|  Source Tablet - Shard -80  |   |  Source Tablet - Shard 80-  |
|  +-------+                  |   |  +-------+                  |
|  | MySQL |--> N chunk reads |   |  | MySQL |--> N chunk reads |
|  +-------+       |         |   |  +-------+       |         |
|          SHA-256 per row    |   |          SHA-256 per row    |
+--------------+--------------+   +--------------+--------------+
               |                                  |
               |  (PK, hash) via gRPC             |
               |  32 bytes/row                    |
               v                                  v
+----------------------------------------------------------------+
|  Target Tablet - Controller                                    |
|  +-------+                                                     |
|  | MySQL |--> N chunk reads --> SHA-256 per row                |
|  +-------+                           |                         |
|                                      v                         |
|              +-------------------------------------+           |
|              |  Diff Engine                        |           |
|              |  Merge sort by PK - compare hashes  |           |
|              +----------------+--------------------+           |
|                               v                                |
|                          DiffReport                            |
+----------------------------------------------------------------+
                                |
                    mismatched PKs only
                    point SELECT full row
                                |
                      v         v         v
                Source MySQL  Source MySQL  Target MySQL
```

### Design goals (in priority order)

1. **Correctness** of the diff
2. **Avoid OOMs** on the controller tablet
3. **Performance** — at least on par with VDiff2, ideally faster via parallelism
4. **Efficiency** use as few system resources as possible

### Design constraints

- Every column value must be checksummed — nothing can be skipped
- Mismatch reporting must show exactly which rows differ, same as VDiff2
- Must support replicas as sources (same GTID-based snapshotting as VDiff2)


### What stays the same from VDiff2

- **Execution model** — runs on target tablets, asynchronously, per-shard
- **Resumability** — can resume from a `lastpk` after interruption
- **GTID-based snapshotting** — stops vreplication, waits for source replicas to catch up, takes consistent snapshots
- **Merge sort by PK** — same algorithm to globally order rows across shards
- **Mismatch reporting** — same `DiffReport`/`RowDiff` structures, same user-facing output
- **Replica support** — sources can be replicas, with the same lag-handling via `WaitForPosition`
- **Throttler integration** — subscribes to the lag throttler via the standard vstreamer API
- **Database schema** — same `_vt.vdiff`, `_vt.vdiff_table`, `_vt.vdiff_log` tables for state tracking
- **CLI/API surface** — same `vtctldclient VDiff` commands

### What changes

**Source/target tablets (rowstreamer):**
- Split table into N PK-range chunks
- Open N connections with coordinated consistent snapshots (same GTID)
- Each connection reads its chunk, computes SHA-256 per row, streams `(PK values, checksum)` tuples
- Full row data never leaves the tablet over gRPC

**Controller (target tablet):**
- Receives `(PK, hash)` streams from all source shards + own target streams
- Compares 32-byte hashes instead of full column values
- Records mismatched PKs

**Mismatch reporting:**
- For each mismatched PK, fetches the full row via point SELECT from source and target
- Same `DiffReport`/`RowDiff` output as VDiff2

### Memory profile

For a 10 MB row across 10 source shards:
- **VDiff2**: ~300+ MB on the controller
- **VDiff3**: ~320 bytes per row in flight (PK + 32-byte hash) × 10 shards ≈ 3 KB

### Implementation

#### Phases

##### Phase 1: Checksum-based VDiff

1. **New `VStreamRowChecksums` RPC** on tablet server — chunked parallel reads with per-row SHA-256
2. **VDiff3 diff engine** — new package, shares report types with VDiff2
3. **Mismatch reporting** — point-fetch full rows for mismatched PKs, same report format
4. **Integration** — wire into VDiff command with version selection

##### Phase 2: Coordinated parallel snapshots

5. **Coordinated snapshot infrastructure** — parallel snapshot creation with table lock coordination, chunked PK-range reads

#### Why per-row hashing, not per-chunk

With resharding, source and target shards have different row distributions. The controller merge-sorts by PK across shards. Per-chunk hashing doesn't align cleanly across differently-sharded sources since you can't guarantee the same rows end up in the same chunk on both sides. Per-row hashing works naturally with the existing merge sort architecture.

#### Why SHA-256

Correctness is the top priority. We're checksumming production data to verify correctness of a migration — collision resistance is non-negotiable. SHA-256 on modern hardware (with acceleration) is fast, so the performance cost is minimal relative to MySQL I/O and network overhead.

Early benchmarks on Apple M4 Pro (ARM64):

| Row size | Time per row | Throughput | Allocations |
|---|---|---|---|
| 10 cols × 100 B (1 KB) | 0.5 µs | ~1.8 GB/s | 2 allocs, 56 B |
| 10 cols × 10 KB (100 KB) | 31 µs | ~3.2 GB/s | 2 allocs, 56 B |
| 5 cols × 1 MB (5 MB) | 1.6 ms | ~3.2 GB/s | 2 allocs, 56 B |

At 3.2 GB/s throughput and constant 56-byte allocation overhead regardless of row size, SHA-256 is not a bottleneck. MySQL disk I/O and network will dominate.

#### Why hash in Go, not MySQL

- Deterministic across MySQL versions
- Full control over serialization (NULL handling, binary data, type coercion)
- The hash input is the raw `row.Values` byte buffer — the same bytes VDiff2 currently compares

#### Collation normalization before hashing

VDiff2 handles collations **at compare time**: for each text column, it passes the column's collation ID to `NullsafeCompare()`, which internally calls `coll.Collate()` to do a collation-aware comparison. This works because VDiff2 has both the source and target row values side-by-side on the controller.

VDiff3 hashes rows independently on each tablet. If we hash the raw byte values without considering collation, two strings that are equal under their collation (e.g., `"Foo"` and `"foo"` under `utf8mb4_general_ci`) would produce different SHA-256 hashes — causing false mismatches.

To solve this, both source and target tablets **normalize text column values according to their collation before hashing**. For each text column with a non-binary collation:

1. Look up the column's collation from `information_schema` (same query VDiff2 already uses in `getPKColumnCollations()`, extended to all columns)
2. Before feeding the column value into SHA-256, produce a **collation weight string** using `colldata.Lookup(collationID).WeightString(dst, src, 0)` — the canonical byte representation where collation-equivalent strings produce identical bytes
3. Hash the weight string instead of the raw bytes

Binary collation columns and non-text types are unaffected — raw bytes are already canonical. The overhead is negligible relative to SHA-256 and MySQL I/O.

| | VDiff2 | VDiff3 |
|---|---|---|
| **When** | At compare time on controller | Before hashing on each tablet |
| **How** | `NullsafeCompare()` with collation ID | Weight string normalization before SHA-256 |
| **Where** | Controller tablet | Source and target tablets independently |

#### Coordinated parallel snapshots

To read the same table in parallel chunks at the same GTID:

1. Acquire `LOCK TABLES ... READ` (waits for in-flight writes to drain)
2. Capture GTID position
3. Start N transactions with `START TRANSACTION WITH CONSISTENT SNAPSHOT, READ ONLY`
4. Release lock

```
            Conn 0          Conn 1        Conn 2  ...  Conn N
              |               |             |            |
              |  LOCK TABLES ... READ       |            |
              |  (waits for in-flight       |            |
              |   writes to drain)          |            |
              |               |             |            |
         ,----|--- lock held -|-------------|------------|----.
         |    |               |             |            |    |
         |    |  Capture GTID position      |            |    |
         |    |               |             |            |    |
         |    |  START TRANSACTION WITH CONSISTENT SNAPSHOT   |
         |    |           START TRANSACTION WITH ...          |
         |    |                     START TRANSACTION WITH ...|
         |    |               |             |            |    |
         |    |  UNLOCK TABLES|             |            |    |
         `----|---------------|-------------|------------|----'
              |               |             |            |
              x          chunk 1 read   chunk 2 read  chunk N read
                         WHERE pk >=    WHERE pk >=   WHERE pk >=
                           range_1        range_2       range_N
                              |             |            |
                         SHA-256/row    SHA-256/row  SHA-256/row
                              |             |            |
                         stream (PK, hash) to controller
```

The lock is held only for steps 2-4: capturing a GTID and starting N InnoDB snapshot transactions. Each `START TRANSACTION` is a local InnoDB operation (microseconds). For N=4-8 connections, the lock duration is sub-millisecond after in-flight writes drain.

If the tablet crashes while holding the lock, MySQL closes the connections and releases the lock automatically. The dangerous crash window (stopped vreplication) is the same as VDiff2 and is unrelated to this mechanism.

#### New protobuf

```protobuf
message VStreamRowChecksumsRequest {
    Target target = 1;
    string query = 2;
    QueryResult lastpk = 3;
    VStreamOptions options = 4;
    int32 parallelism = 5; // number of parallel chunk readers
}

message VStreamRowChecksumsResponse {
    repeated RowChecksum rows = 1;
    string gtid = 2;
    repeated Field pk_fields = 3;
}

message RowChecksum {
    Row pk_values = 1;
    bytes checksum = 2;  // SHA-256, 32 bytes
}
```

### Performance expectations

Should perform at least as well as VDiff2, likely faster:
- Same MySQL queries on each tablet, but parallelized across PK-range chunks
- Same merge sort on controller
- Less data over gRPC (32 bytes vs full row per row)
- Hash comparison (`bytes.Equal`) cheaper than per-column `NullsafeCompare` with collation
- SHA-256 computation is fast relative to MySQL I/O (1+ GB/s on modern CPUs)
- Parallel chunk reads scale tablet-side throughput with connection count

A benchmark comparing VDiff2 and VDiff3 end-to-end performance across a range of row sizes and shard counts is required before merging. This should measure wall-clock time, peak memory, and throughput (rows/sec) to validate that VDiff3 is not a regression.

#### POC benchmark results (MoveTables, 2 target shards, Apple M4 Pro)

**Small rows (~89 bytes/row, 2M rows):**

| Metric | VDiff2 | VDiff3 | Difference |
|---|---|---|---|
| Rows compared | 2,000,003 | 2,000,003 | — |
| Duration | 2s | 2s | Same |
| Throughput | 1,000,002 rows/sec | 1,000,002 rows/sec | Same |
| Data sent (source→controller) | 169.8 MB | 82.0 MB | **52% less** |
| Bytes per row | 89 | 43 | **52% less** |

**Large rows (1 MB/row with LONGBLOB, 253 rows, 250 MB total):**

| Metric | VDiff2 | VDiff3 | Difference |
|---|---|---|---|
| Rows compared | 253 | 253 | — |
| Duration | 1s | 1s | Same |
| Data sent (source→controller) | 250.0 MB | 0.01 MB | **99.99% less** |
| Bytes per row | 1,036,224 | 43 | **24,098x less** |

VDiff3 matches VDiff2 throughput in all cases. For small rows, data transfer is halved. For large rows (the OOM scenario), data transfer drops by 4 orders of magnitude — from ~1 MB/row to a fixed 43 bytes/row regardless of row size. This directly eliminates the OOM risk on the controller tablet.

### Use Case(s)

Users of VDiff2



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: VDiff3 - a checksum-based, streaming diff #19735

History

Problem

Proposal: VDiff3 - a checksum-based, streaming diff

Design goals (in priority order)

Design constraints

What stays the same from VDiff2

What changes

Memory profile

Implementation

Phases

Phase 1: Checksum-based VDiff

Phase 2: Coordinated parallel snapshots

Why per-row hashing, not per-chunk

Why SHA-256

Why hash in Go, not MySQL

Collation normalization before hashing

Coordinated parallel snapshots

New protobuf

Performance expectations

POC benchmark results (MoveTables, 2 target shards, Apple M4 Pro)

Use Case(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Row size	Time per row	Throughput	Allocations
10 cols × 100 B (1 KB)	0.5 µs	~1.8 GB/s	2 allocs, 56 B
10 cols × 10 KB (100 KB)	31 µs	~3.2 GB/s	2 allocs, 56 B
5 cols × 1 MB (5 MB)	1.6 ms	~3.2 GB/s	2 allocs, 56 B

	VDiff2	VDiff3
When	At compare time on controller	Before hashing on each tablet
How	`NullsafeCompare()` with collation ID	Weight string normalization before SHA-256
Where	Controller tablet	Source and target tablets independently

Metric	VDiff2	VDiff3	Difference
Rows compared	2,000,003	2,000,003	—
Duration	2s	2s	Same
Throughput	1,000,002 rows/sec	1,000,002 rows/sec	Same
Data sent (source→controller)	169.8 MB	82.0 MB	52% less
Bytes per row	89	43	52% less

RFC: VDiff3 - a checksum-based, streaming diff #19735

Description

History

Problem

Proposal: VDiff3 - a checksum-based, streaming diff

Design goals (in priority order)

Design constraints

What stays the same from VDiff2

What changes

Memory profile

Implementation

Phases

Phase 1: Checksum-based VDiff

Phase 2: Coordinated parallel snapshots

Why per-row hashing, not per-chunk

Why SHA-256

Why hash in Go, not MySQL

Collation normalization before hashing

Coordinated parallel snapshots

New protobuf

Performance expectations

POC benchmark results (MoveTables, 2 target shards, Apple M4 Pro)

Use Case(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions