Skip to content

[SPARK-57150][SDP] SCD1 Out-of-order Convergence Suite#56214

Open
AnishMahto wants to merge 2 commits into
apache:masterfrom
AnishMahto:SPARK-57150-SCD1-OOO-convergence-suite
Open

[SPARK-57150][SDP] SCD1 Out-of-order Convergence Suite#56214
AnishMahto wants to merge 2 commits into
apache:masterfrom
AnishMahto:SPARK-57150-SCD1-OOO-convergence-suite

Conversation

@AnishMahto
Copy link
Copy Markdown
Contributor

@AnishMahto AnishMahto commented May 29, 2026

What changes were proposed in this pull request?

A key feature of SDP's AutoCDC implementation is that it supports reconciling out-of-order (by sequence) events. This support also adds significant complexity to the reconciliation logic as it requires cross-microbatch stateful tracking in the auxiliary table, and is prone to breaking as the implementation evolves over time.

Introduce an A/B style test suite to execute the implementation on both a sequence-sorted single-microbatch event stream and the same events on a shuffled multi-microbatch event stream. If out-of-order processing is correct, then the SCD1 implementation should produce the same target tables for both runs.

Data is randomly generated, but with a constant seed for reproducibility.

Why are the changes needed?

Test correctness of the AutoCDC SCD1 out-of-order reconciliation algorithm. Prevents signal for existing implementation's correctness and helps prevent regressions in future iterations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test only change.

Was this patch authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.7

@AnishMahto
Copy link
Copy Markdown
Contributor Author

@jose-torres Short test-only PR for A/B testing the out-of-order correctness of the AutoCDC SCD1 implementaiton

@AnishMahto AnishMahto changed the title [SPARK-57150] SCD1 Out-of-order Convergence Suite [SPARK-57150][SDP] SCD1 Out-of-order Convergence Suite May 29, 2026
}
}

gridTest("SCD1 merge converges across micro-batch shuffling for randomly generated " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is quite right as a test structure. I'd expect to see a single test with a single seed that's randomized by default or uses a passed in command line argument. Right now there's no way to actually perform a reproduction without editing and rebuilding the test code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants