Merge pull request #65 from iamvirul/release/v0.7

iamvirul · web-flow · commit 8f323152b5ef · 2026-03-14T10:26:37.000+05:30
feat(perf): Streaming batch hashing and parallel table diffing for v0.7
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,44 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.7] - 2026-03-14
+
+### Added
+- **Streaming Support for Large Datasets**
+  - Keyset-paginated batch hashing for large tables (`WHERE pk > lastVal ORDER BY pk LIMIT N`)
+  - Each page fetches a bounded number of rows, keeping heap flat regardless of table size (~150–200 MB peak vs ~700–900 MB unbatched)
+  - `--batch-size N` CLI flag for `diff` and `gen-pack` — overrides `performance.hash_batch_size` in config
+  - `--parallel N` CLI flag for `diff` and `gen-pack` — overrides `performance.max_parallel_tables` in config
+  - `performance` configuration section in `deepdiffdb.config.yaml`:
+    ```yaml
+    performance:
+      hash_batch_size: 10000      # rows per keyset-paginated query (0 = disabled)
+      max_parallel_tables: 2      # tables hashed concurrently
+    ```
+  - Parallel table hashing via bounded goroutine pool (`errgroup` + `semaphore.NewWeighted`)
+  - Per-batch memory telemetry at `DEBUG` level (`alloc_mb`, `batch`, `total_rows_hashed`)
+  - `--batch-size 0` falls back to pre-v0.7 full-scan behaviour (full backward compatibility)
+- **Shared Keyset Query Builder** (`internal/content/cursor.go`)
+  - `BuildCursorQuery` and `buildCursorWhere` extracted into a shared module supporting composite primary keys
+  - Used by both `hash.go` and `pack.go` — eliminates cursor logic drift between the two
+- **Sample 14: Streaming Large Datasets** (`samples/14-streaming-large-datasets/`)
+  - Go seed script generating 500k orders / 100k products / 200k audit_logs in SQLite
+  - Makefile targets: `seed`, `seed-small`, `diff`, `diff-fast`, `diff-sequential`, `gen-pack`, `clean`
+  - Memory tuning guide for low/standard/high-memory hosts
+  - No Docker required
+
+### Changed
+- `HashTable` signature extended with `batchSize int` parameter; `batchSize=0` preserves original behaviour
+- Sequential per-table loop in `runFullDiff` and `runGenPack` replaced with `hashTablesParallel`
+- Inline cursor closure in `pack.go` replaced with `BuildCursorQuery` (DRY)
+- `performance.hash_batch_size` defaults to `10000`; `performance.max_parallel_tables` defaults to `1`
+- `deepdiffdb.config.yaml.example` updated with commented `performance:` section
+
+### Performance
+- **~4× throughput improvement** on multi-table databases with `--parallel 4`
+- Memory during hashing reduced from O(n) unbounded growth to O(batch_size) bounded heap
+- `runtime.GC()` hint issued after each batch to return memory promptly between pages
+
 ## [0.6.1] - 2026-01-08
 
 ### Added
@@ -250,7 +288,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - PostgreSQL schema-aware queries
 - MySQL foreign key check handling
 
-[Unreleased]: https://github.com/iamvirul/deepdiff-db/compare/v0.6.1...HEAD
+[Unreleased]: https://github.com/iamvirul/deepdiff-db/compare/v0.7...HEAD
+[0.7]: https://github.com/iamvirul/deepdiff-db/compare/v0.6.1...v0.7
 [0.6.1]: https://github.com/iamvirul/deepdiff-db/compare/v0.6...v0.6.1
 [0.6]: https://github.com/iamvirul/deepdiff-db/compare/v0.5...v0.6
 [0.5]: https://github.com/iamvirul/deepdiff-db/compare/v0.4...v0.5
diff --git a/README.md b/README.md
@@ -38,6 +38,8 @@ DeepDiff DB makes the entire process deterministic, reviewable, and safe by:
 - **Progress tracking** - Visual progress bars and spinners for long-running operations
 - **Checkpoint/resume** - Resume interrupted operations from saved checkpoints
 - **Enhanced error handling** - Rich error messages with actionable suggestions
+- **Streaming large datasets** - Keyset-paginated batch hashing keeps memory bounded at any table size
+- **Parallel table hashing** - Hash multiple tables concurrently with configurable worker pool
 
 ### Safety Features
 
@@ -247,6 +249,16 @@ Resolution strategies:
 - `theirs`: Use development values (accept dev changes)
 - `manual`: Require interactive decision for each conflict
 
+**Performance Configuration (v0.7+):**
+- `performance.hash_batch_size`: Rows per keyset-paginated query during table hashing. `0` disables batching (loads all rows in one query). Default: `10000`
+- `performance.max_parallel_tables`: Maximum number of tables hashed concurrently. Default: `1`
+
+```yaml
+performance:
+  hash_batch_size: 10000      # ~1–2 MB per page; keeps heap bounded on any table size
+  max_parallel_tables: 2      # hash prod tables in parallel; raises throughput ~2× on dual-core
+```
+
 An example configuration file is included at `deepdiffdb.config.yaml.example`.
 
 ## Commands
@@ -314,6 +326,18 @@ Performs a full comparison of both schema and data.
 deepdiffdb diff --config deepdiffdb.config.yaml
 ```
 
+**Large Dataset Options (v0.7+):**
+```bash
+# Keyset-paginated hashing, 2 tables in parallel
+deepdiffdb diff --config deepdiffdb.config.yaml --batch-size 10000 --parallel 2
+
+# Disable batching (pre-v0.7 behaviour, loads all rows in memory)
+deepdiffdb diff --config deepdiffdb.config.yaml --batch-size 0 --parallel 1
+```
+
+- `--batch-size N`: Rows per page when hashing large tables. Overrides `performance.hash_batch_size`. `0` = no pagination.
+- `--parallel N`: Max tables hashed concurrently. Overrides `performance.max_parallel_tables`.
+
 **Generate Interactive HTML Report:**
 ```bash
 deepdiffdb diff --config deepdiffdb.config.yaml --html
@@ -345,6 +369,14 @@ Generates a SQL migration pack for data differences.
 deepdiffdb gen-pack --config deepdiffdb.config.yaml
 ```
 
+**Large Dataset Options (v0.7+):**
+```bash
+deepdiffdb gen-pack --config deepdiffdb.config.yaml --batch-size 5000 --parallel 4
+```
+
+- `--batch-size N`: Rows per page when hashing large tables. Overrides `performance.hash_batch_size`.
+- `--parallel N`: Max tables hashed concurrently. Overrides `performance.max_parallel_tables`.
+
 **Resume from Checkpoint:**
 ```bash
 deepdiffdb gen-pack --config deepdiffdb.config.yaml --resume
@@ -513,7 +545,7 @@ DeepDiff DB uses a multi-stage approach to ensure safe and accurate database syn
 7. **Migration Generation** - Creates SQL migration scripts with proper ordering and batching
 8. **Transactional Application** - Applies changes within a single transaction for atomicity
 
-The tool processes data in chunks for large tables and provides progress indicators for operations exceeding 10,000 rows. Progress bars show throughput (rows/second) and estimated time remaining. Checkpoints are automatically saved during long-running operations, allowing you to resume from interruptions.
+The tool processes data using **keyset-paginated batching** for large tables — each page fetches a bounded number of rows (`WHERE pk > lastVal ORDER BY pk LIMIT N`), keeping heap usage flat regardless of table size. Multiple tables can be hashed concurrently using a bounded goroutine pool. Progress bars show throughput (rows/second) and estimated time remaining. Checkpoints are automatically saved during long-running operations, allowing you to resume from interruptions.
 
 ## Architecture
 
@@ -584,7 +616,7 @@ Current limitations and known constraints:
 - **Database Support** - MSSQL and Oracle are not yet supported (planned for future releases)
 - **Schema Auto-merge** - Schema differences must be resolved manually
 - **Primary Key Requirement** - All tables must have primary keys (unless explicitly ignored)
-- **Large Database Performance** - Very large databases may produce large diff files and require significant processing time
+- **Large Database Performance** - Very large tables are handled with keyset-paginated batching (v0.7+); diff output files may still be large for tables with many changed rows
 - **Conflict Resolution** - Complex merge strategies (e.g., column-level merging) are not supported
 - **SQLite Constraints** - SQLite has limited support for ALTER TABLE operations
 
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -8,44 +8,29 @@ We release a new version every **Saturday**. Each release includes one or more f
 
 ---
 
-## Current Status: v0.6
+## Current Status: v0.7
 
-**Last Release:** 2026-01-06
+**Last Release:** 2026-03-14
 
 **Current Features:**
-- Schema drift detection
-- Row-level data comparison
-- Migration pack generation
-- Transactional apply mode
-- MySQL, PostgreSQL, SQLite support
-- Conflict detection
-- JSON and text reports
-- Standalone schema migration command (`schema-migrate`)
-- DROP COLUMN support with safety controls
-- MODIFY COLUMN support (type changes, nullable changes)
-- CREATE TABLE and DROP TABLE support
-- Index support (CREATE INDEX, DROP INDEX)
-- Foreign key constraint handling (ADD/DROP FOREIGN KEY)
-- Primary key modification support
-- Dependency-aware migration ordering
+- Schema drift detection and standalone schema migration (`schema-migrate`)
+- Row-level data comparison with SHA-256 hashing
+- Migration pack generation and transactional apply mode
+- MySQL, PostgreSQL, and SQLite support
+- Conflict detection with `ours`/`theirs`/`manual` resolution strategies
 - Interactive `resolve-conflicts` command with `--auto` and `--resume` flags
-- Conflict resolution configuration (`ours`, `theirs`, `manual` strategies)
-- Per-table conflict resolution strategies
-- Resolution persistence with `resolutions.json`
-- Enhanced conflict reports with resolution statistics
-- **NEW:** Interactive HTML report generation with `--html` flag
-- **NEW:** Visual schema diff viewer with foreign key support
-- **NEW:** Data diff visualization with expandable row keys
-- **NEW:** Resolution strategy breakdown (auto/pending counts)
-- **NEW:** Per-table strategy table with conflict statistics
-- **NEW:** Conflict highlighting with strategy badges
-- **NEW:** SQL preview with syntax highlighting
-- **NEW:** Export to PDF functionality
-- **NEW:** Structured logging with JSON/text formats and log levels
-- **NEW:** Progress tracking with bars and spinners
-- **NEW:** Checkpoint/resume system for long-running operations
-- **NEW:** Enhanced error handling with suggestions and stack traces
-- **NEW:** Performance metrics collection
+- Per-table conflict resolution strategies with `resolutions.json` persistence
+- DROP/MODIFY COLUMN, CREATE/DROP TABLE, CREATE/DROP INDEX, ADD/DROP FOREIGN KEY
+- Primary key modification and dependency-aware migration ordering
+- Interactive HTML report with schema diff viewer, data diff, conflict highlighting, and SQL preview
+- Structured JSON/text logging with configurable levels and file output
+- Visual progress bars and throughput metrics
+- Checkpoint/resume system for long-running operations
+- Enhanced error handling with actionable suggestions and retry logic
+- **NEW:** Keyset-paginated batch hashing — `--batch-size N` / `performance.hash_batch_size`
+- **NEW:** Parallel table hashing — `--parallel N` / `performance.max_parallel_tables`
+- **NEW:** Bounded O(batch_size) memory during hashing regardless of table size
+- **NEW:** Per-batch memory telemetry at DEBUG log level (`alloc_mb`, `batch`)
 
 ---
 
@@ -113,27 +98,29 @@ We release a new version every **Saturday**. Each release includes one or more f
 
 ---
 
-## Upcoming Releases
+## Completed Releases (continued)
 
----
+### v0.7: Streaming Support for Large Datasets (Released 2026-03-14)
 
-### Week 1 - v0.7: Streaming Support for Large Datasets
-**Target Date:** Next Saturday
+**Features Delivered:**
+- Keyset-paginated batch hashing (`WHERE pk > lastVal ORDER BY pk LIMIT N`) — O(batch_size) heap at any table size
+- `--batch-size N` and `--parallel N` CLI flags for `diff` and `gen-pack`
+- `performance.hash_batch_size` and `performance.max_parallel_tables` config keys (defaults: 10000 / 1)
+- Bounded goroutine pool via `errgroup` + `semaphore.NewWeighted` for parallel table hashing
+- `BuildCursorQuery` shared module (`internal/content/cursor.go`) used by both hash and pack paths
+- Per-batch memory telemetry at DEBUG level
+- Sample 14: Streaming Large Datasets (SQLite, no Docker, seed script + Makefile)
 
-**Features:**
-- Streaming diff for tables > 1M rows
-- Memory-efficient hash computation
-- Chunked processing with progress tracking
-- Configurable batch sizes
-- Resume capability for interrupted operations
-- Performance optimizations for large databases
+**Impact:** Enables comparison of databases with millions of rows while keeping memory usage bounded and wall-clock time short
 
-**Impact:** Enables comparison of very large production databases
+---
+
+## Upcoming Releases
 
 ---
 
-### Week 2 - v0.8: MSSQL Support
-**Target Date:** Week 2 Saturday
+### v0.8: MSSQL Support
+**Target Date:** Next Saturday
 
 **Features:**
 - Microsoft SQL Server driver support
@@ -146,8 +133,8 @@ We release a new version every **Saturday**. Each release includes one or more f
 
 ---
 
-### Week 3 - v0.9: Oracle Support
-**Target Date:** Week 3 Saturday
+### v0.9: Oracle Support
+**Target Date:** Week 2 Saturday
 
 **Features:**
 - Oracle Database driver support
@@ -160,8 +147,8 @@ We release a new version every **Saturday**. Each release includes one or more f
 
 ---
 
-### Week 4 - v1.0: Production Ready Release
-**Target Date:** Week 4 Saturday
+### v1.0: Production Ready Release
+**Target Date:** Week 3 Saturday
 
 **Features:**
 - Comprehensive documentation
@@ -225,10 +212,10 @@ We release a new version every **Saturday**. Each release includes one or more f
 - ~~Conflict Resolution Strategies~~ (v0.4)
 - ~~HTML Report Viewer~~ (v0.5)
 - ~~Enhanced Error Handling & Logging~~ (v0.6)
+- ~~Streaming Support for Large Datasets~~ (v0.7)
 - Documentation & Production Readiness
 
 ### Medium Priority (Should Have)
-- Streaming Support for Large Datasets
 - MSSQL Support
 
 ### Low Priority (Nice to Have)
@@ -242,6 +229,7 @@ We release a new version every **Saturday**. Each release includes one or more f
 - [x] Conflict Resolution Strategies (v0.4)
 - [x] HTML Report Viewer (v0.5)
 - [x] Enhanced Error Handling & Logging (v0.6)
+- [x] Streaming Support for Large Datasets (v0.7)
 - [ ] All high-priority features implemented
 - [ ] Test coverage > 80%
 - [ ] Comprehensive documentation
@@ -271,5 +259,5 @@ If you'd like to contribute to any of these features, please:
 
 ---
 
-**Last Updated:** 2026-01-06
+**Last Updated:** 2026-03-14