Skip to content

fix(data-branch): unify diff/merge collect-range over arbitrary DAG depth#24371

Merged
mergify[bot] merged 10 commits into
matrixorigin:mainfrom
gouhongshen:fix/data-branch-diff-chain-walk
May 15, 2026
Merged

fix(data-branch): unify diff/merge collect-range over arbitrary DAG depth#24371
mergify[bot] merged 10 commits into
matrixorigin:mainfrom
gouhongshen:fix/data-branch-diff-chain-walk

Conversation

@gouhongshen
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #21979

What this PR does / why we need it:

TL;DR

data branch diff and data branch merge previously dropped rows
silently whenever the LCA → endpoint chain in the branch DAG had any
intermediate node that performed inserts/updates/deletes before the
endpoint was forked. Symptom: rows that the endpoint inherits via
multi-level cloning were never compared on its side of the diff, so
downstream merges failed to migrate them too. The fix replaces the
historical case-by-case logic with a single ancestor-walking model
that works uniformly for any DAG depth and shape.

The bug

Concretely, with a chain t0 → t1 → t2 where t1 inserts (4, 40)
before t2 is branched:

create table t0 (a int primary key, b int);
insert into t0 values (1,10), (2,20), (3,30);

data branch create table t1 from t0;
insert into t1 values (4, 40);
data branch create table t2 from t1;
insert into t2 values (7, 70);

data branch diff t2 against t0;

Expected (state-diff):

t2 INSERT 4 40
t2 INSERT 7 70

Before this PR:

t2 INSERT 7 70                ← (4,40) is silently dropped

Root cause: decideCollectRange only walked the two endpoint
relations. Mutations performed on intermediate ancestors (here, t1's
insert (4,40)) live in those intermediates' change streams, not in
either endpoint's. The previous case 2 / case 3 / case 4 sub-branches
each handled at most one extra LCA relation, never the full chain.

The fix — unified DAG-walk model

For each endpoint X we know the path [LCA, …, X] from the DAG. The
new mutations distinguishing the two endpoints live entirely on
path[LCA, endpoint] of each side; everything strictly above LCA is
shared and reconciles to no diff. For each ancestor A on that path:

windowFrom_A = A.CTS + 1                          (A's own writes)
windowEnd_A  = nextChildOnPath.CloneTS            (A is intermediate)
             | endpointSP                         (A is the endpoint)

Plus one prune at the LCA: clamp windowFrom_LCA up to
otherFork.Next() so the prefix shared with the other side via clone
is skipped (would otherwise reconcile to nothing).

The single buildSideCollectRange helper applies this to either
side. The historical case 2 / case 3 / case 4 branching is gone.

Implementation

  • DataBranchDAG.PathFromAncestor (new): top-down chain from LCA to
    any descendant.
  • branchMetaInfo (refactored): now carries
    pathFromLCAToTar / pathFromLCAToBase as the single source of
    truth. lcaType / lcaLeft / lcaRight / lcaOther / tarBranchTS / baseBranchTS are gone — their information is derived on demand
    via hasLCA() / tarLCASnapshot() / baseLCASnapshot() / lcaProbeSnapshot().
  • decideCollectRange (rewritten): three branches kept (case 0
    same-table, case 1 unrelated tables, case 2 unified
    has-LCA chain walk) — the entire pre-existing case 2/3/4 logic
    collapses into one loop driven by the path.
  • Dead lcaType int parameters removed from diffDataHelper,
    hashDiffIfNoLCA, and buildHashmapForTable (they were forwarded
    through three layers and never read).

The running example used throughout the new comments is a 19-node
multi-fork tree of depth 4 with 10 leaves (t0 root, fork to t1/t2/t3,
each forks again twice, leaves t9..t18). Every diff/merge example in
the doc traces against this tree so reviewers can sanity-check the
window arithmetic against a concrete shape.

BVT additions

  • git4data/branch/diff/diff_12.sql — 12 cases covering linear chains
    with IUD on every chain level, mid-chain post-branch writes,
    snapshot mixes, no-PK chains, double-INSERT-same-PK reconciliation,
    alternating multi-round IUD, and 1000-row scale.
  • git4data/branch/diff/diff_13.sql — one giant tree case (19 nodes,
    depth 4, 10 leaves) with 14 diff pairs spanning leaf↔root,
    cross-subtree, sibling, cousin, intermediate↔intermediate variants.
  • git4data/branch/merge/merge_8.sql — 9 cases for chain merge:
    cascade up the chain, direct grandchild → grandparent merge,
    conflict skip/accept across chain depths, snapshot-scoped, no-PK,
    alternating-round merges, order-of-merges regression.

Verification

  • Whole test/distributed/cases/git4data/branch/diff/ BVT —
    1317/1317 PASS, 100%.
  • Whole test/distributed/cases/git4data/branch/merge/ BVT —
    466/466 PASS, 100%.
  • pkg/frontend unit tests pass.

Special notes for your reviewer:

  • The state-diff observation in the bug section is the simplest
    reproduction. The same shape underlies every chain failure
    (deeper trees just compound it).
  • During development the deep cleanup (commit e4dc4d3099)
    initially regressed merge_6 / merge_7 (no-PK cases). Root
    cause was a swapped path lookup in the LCA-snapshot helpers,
    fixed in the final commit dd0205ebba. The unified design
    made the bug surface as one helper to fix, not five
    scattered branches — which is the point of the refactor.
  • etc/launch/cn.toml contains a local port override on my
    workspace and is intentionally not in this PR.

gouhongshen and others added 4 commits May 13, 2026 11:19
`data branch diff` previously walked only the two endpoint relations
when computing change streams. Mutations performed on intermediate
ancestors (e.g. t1 in a chain t0->t1->t2) that were inherited into the
endpoint via cloning never reached the diff, causing rows to silently
disappear from `diff t2 against t0`.

Replace the old case-by-case branches (LCA == base / LCA == tar / LCA
elsewhere with three sub-conditions each) with a single ancestor walk
driven by the LCA->endpoint paths captured from the branch DAG. For
each ancestor A on path[LCA, endpoint] we collect A's own mutations
within (A.CTS, nextChildOnPath.CloneTS] (or (A.CTS, endpointSP] when
A is the endpoint). The LCA's prefix that the other side inherits
identically is pruned to skip redundant IO. The model generalizes to
any tree depth and any pair of nodes.

Add `DataBranchDAG.PathFromAncestor` and the corresponding chain
fields on `branchMetaInfo` to thread the path information from the
DAG resolver to the collect-range builder. The new
`buildSideCollectRange` helper materializes one side of the diff
uniformly; the historical case-2/3/4 logic is gone.

BVT additions exercise the new code path:
* test/distributed/cases/git4data/branch/diff/diff_12.sql:
  12 cases covering linear chains, mid-chain writes, snapshot mixes,
  no-PK chains, and post-fork ancestor mutations.
* test/distributed/cases/git4data/branch/diff/diff_13.sql:
  one giant tree case (19 nodes, depth 4, 10 leaves) with 10 diff
  pairs across non-adjacent levels.

Whole `git4data/branch/diff/` BVT now passes 1317/1317.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…esolver

Follow-up to the unified collect-range fix. No semantic change; BVT
still passes 1317/1317.

* Drop stale comments left over from the pre-unified design:
  - "1. has no lca / 2. ..." ASCII case list in
    decideLCABranchTSFromBranchDAG that described the old branch_t1_ts
    / branch_t2_ts model.
  - Lone "Boundary semantics:" header in decideCollectRange whose
    body was already absorbed into the new running example block.
  - "// has no lca" trailing comment in diffOnBase that contradicted
    the lca-handling block right above it.
* Refactor decideLCABranchTSFromBranchDAG so the LCA path information
  is computed in a single linear flow (FindLCA → PathFromAncestor →
  switch on lcaTableID) instead of a `var (...)` block + defer that
  re-wired branchInfo from outside. Adds a small `buildTSs` helper
  so the int64-to-types.TS lift no longer repeats four times. The
  function-level doc comment now states what each output field is
  for and who consumes it downstream.
* Drop an unused error-wrap path: the previous code re-wrapped
  `dag.GetCloneTS` not-found into an internal error even though the
  DAG always populates CloneTS for any node it knows about (FindLCA
  has already validated descent). The simpler `_, _ := dag.GetCloneTS`
  matches the intent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminate the last vestiges of the pre-unified design where
branchMetaInfo carried three pieces of information that all derive
from the LCA->endpoint paths:

  * lcaType (lcaEmpty / lcaLeft / lcaRight / lcaOther) — was set in
    five places but only ever read at one decision point ("is the
    side empty?"); the other reads forwarded it through diffDataHelper
    and buildHashmapForTable as a *dead parameter*. Replace the gate
    with branchMetaInfo.hasLCA() and remove the dead parameter
    entirely.
  * tarBranchTS / baseBranchTS — per-side LCA snapshot used by
    findDeleteAndUpdateBat and diffOnBase. Both values are exactly
    pathFromLCAToOther[1].cloneTS (or otherSP when the other endpoint
    is the LCA). Expose three small helpers on branchMetaInfo —
    tarLCASnapshot, baseLCASnapshot, lcaProbeSnapshot — and let the
    consumers compute on demand.

Other touch-ups in the same spirit:

  * decideLCABranchTSFromBranchDAG now also handles the
    same-table-id-no-DAG case directly with synthetic single-node
    paths, instead of leaking the special-case setup into
    decideCollectRange. The function is once again single-purpose.
  * The lcaEmpty / lcaLeft / lcaRight / lcaOther constants are gone.
  * data_branch_hashdiff_test.go updated for the new signatures.

Whole `git4data/branch/diff/` BVT still passes 1317/1317. Frontend
unit tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous refactor swapped the meaning of pathFromLCAToTar vs
pathFromLCAToBase in the LCA-snapshot helpers, which broke
no-PK / fake-PK merge cases (merge_6, merge_7) where tombstones on
the descendant side were resolved against the wrong moment of the
LCA. The right semantics:

  tarLCASnapshot returns the moment tar's view of the LCA was
  anchored — tar's first-child CloneTS when tar forked off the LCA,
  otherwise base's first-child CloneTS, otherwise baseSP. The
  baseLCASnapshot mirror works the same way.

This matches the original `tarBranchTS = CloneTS(tarFirstChild)`
behavior. After the fix the whole merge/ BVT (merge_1..7) goes back
to 100%, and the chain merge_8.sql (172 statements over 9 cases)
also passes 100%.

Includes:
* test/distributed/cases/git4data/branch/merge/merge_8.{sql,result}
  — chain merge BVT covering cascade up the chain, direct grandchild
  -> grandparent merge, conflict skip/accept across multiple chain
  depths, snapshot-scoped, no-PK chain, alternating-round merges,
  and order-of-merges regression.
* Regenerated diff_12.result / diff_13.result against the corrected
  helpers (no semantic change, only the LCA probe TS that drives
  tombstone resolution).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

mergify Bot pushed a commit that referenced this pull request May 14, 2026
…ry DAG depth (cherry-pick #24371 to 4.0-dev) (#24378)

Cherry-picks the 5 substantive commits from #24371 (skips the
intermediate `Merge branch 'main'` merge commit, which only resolves
context against `main`). All commits applied cleanly with no conflicts.

| commit | summary |
| --- | --- |
| `0b566bbe75` | fix(data-branch): unify diff collect-range over arbitrary DAG depth |
| `b0f44f14d5` | chore(data-branch): clean up dead comments and tangled defer in DAG resolver |
| `e844aaf7ff` | refactor(data-branch): drop redundant lcaType / branchTS fields |
| `b63cdac409` | fix(data-branch): correct LCA-snapshot helpers for chain merge |
| `e193420425` | fix(data-branch): address review comments and fix gofmt |

The final tree state of all 11 modified files (`pkg/frontend/data_branch*.go`,
`pkg/frontend/databranchutils/branch_dag.go`, and the new BVT cases under
`test/distributed/cases/git4data/branch/{diff,merge}/`) is byte-identical
to the head of #24371 (verified via per-file shasum diff).

### TL;DR (from #24371)

`data branch diff` and `data branch merge` previously dropped rows
silently whenever the LCA → endpoint chain in the branch DAG had any
intermediate node that performed inserts/updates/deletes before the
endpoint was forked. Symptom: rows that the endpoint inherits via
multi-level cloning were never compared on its side of the diff, so
downstream merges failed to migrate them too. The fix replaces the
historical case-by-case logic with a single ancestor-walking model
that works uniformly for any DAG depth and shape.

Refer to the original PR for the full bug analysis, fix design, and
running 19-node tree example used throughout the new code comments.

### Verification on 4.0-dev

* `pkg/frontend` build + `go vet` clean.
* `pkg/frontend` unit tests pass (including
`TestHashDiff_NoLCAWithStubHandles`, `TestDiffDataHelper_*`,
`TestRunLCAProbeWithReaderFallback_*`, `TestHandleDelsOnLCA_*`,
`TestLCAProbeJoinCastType`).
* `gofmt -l` clean on all five modified Go files.
* New BVT cases (`diff_12`, `diff_13`, `merge_8`) included with their
corresponding `.result` files.

Approved by: @XuPeng-SH, @heni02
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 15, 2026

Merge Queue Status

  • Entered queue2026-05-15 11:54 UTC · Rule: main
  • Checks skipped · PR is already up-to-date
  • Merged2026-05-15 11:55 UTC · at 3fbc5fbf9a33f61675fd0ef3ca0a68a567984e01 · squash

This pull request spent 47 seconds in the queue, including 7 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • github-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working kind/refactor Code refactor kind/test-ci size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants