Skip to content

Commit 8f32315

Browse files
authored
Merge pull request #65 from iamvirul/release/v0.7
feat(perf): Streaming batch hashing and parallel table diffing for v0.7
2 parents acce2a6 + d58eb89 commit 8f32315

3 files changed

Lines changed: 116 additions & 57 deletions

File tree

CHANGELOG.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,44 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.7] - 2026-03-14
11+
12+
### Added
13+
- **Streaming Support for Large Datasets**
14+
- Keyset-paginated batch hashing for large tables (`WHERE pk > lastVal ORDER BY pk LIMIT N`)
15+
- Each page fetches a bounded number of rows, keeping heap flat regardless of table size (~150–200 MB peak vs ~700–900 MB unbatched)
16+
- `--batch-size N` CLI flag for `diff` and `gen-pack` — overrides `performance.hash_batch_size` in config
17+
- `--parallel N` CLI flag for `diff` and `gen-pack` — overrides `performance.max_parallel_tables` in config
18+
- `performance` configuration section in `deepdiffdb.config.yaml`:
19+
```yaml
20+
performance:
21+
hash_batch_size: 10000 # rows per keyset-paginated query (0 = disabled)
22+
max_parallel_tables: 2 # tables hashed concurrently
23+
```
24+
- Parallel table hashing via bounded goroutine pool (`errgroup` + `semaphore.NewWeighted`)
25+
- Per-batch memory telemetry at `DEBUG` level (`alloc_mb`, `batch`, `total_rows_hashed`)
26+
- `--batch-size 0` falls back to pre-v0.7 full-scan behaviour (full backward compatibility)
27+
- **Shared Keyset Query Builder** (`internal/content/cursor.go`)
28+
- `BuildCursorQuery` and `buildCursorWhere` extracted into a shared module supporting composite primary keys
29+
- Used by both `hash.go` and `pack.go` — eliminates cursor logic drift between the two
30+
- **Sample 14: Streaming Large Datasets** (`samples/14-streaming-large-datasets/`)
31+
- Go seed script generating 500k orders / 100k products / 200k audit_logs in SQLite
32+
- Makefile targets: `seed`, `seed-small`, `diff`, `diff-fast`, `diff-sequential`, `gen-pack`, `clean`
33+
- Memory tuning guide for low/standard/high-memory hosts
34+
- No Docker required
35+
36+
### Changed
37+
- `HashTable` signature extended with `batchSize int` parameter; `batchSize=0` preserves original behaviour
38+
- Sequential per-table loop in `runFullDiff` and `runGenPack` replaced with `hashTablesParallel`
39+
- Inline cursor closure in `pack.go` replaced with `BuildCursorQuery` (DRY)
40+
- `performance.hash_batch_size` defaults to `10000`; `performance.max_parallel_tables` defaults to `1`
41+
- `deepdiffdb.config.yaml.example` updated with commented `performance:` section
42+
43+
### Performance
44+
- **~4× throughput improvement** on multi-table databases with `--parallel 4`
45+
- Memory during hashing reduced from O(n) unbounded growth to O(batch_size) bounded heap
46+
- `runtime.GC()` hint issued after each batch to return memory promptly between pages
47+
1048
## [0.6.1] - 2026-01-08
1149

1250
### Added
@@ -250,7 +288,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
250288
- PostgreSQL schema-aware queries
251289
- MySQL foreign key check handling
252290

253-
[Unreleased]: https://github.com/iamvirul/deepdiff-db/compare/v0.6.1...HEAD
291+
[Unreleased]: https://github.com/iamvirul/deepdiff-db/compare/v0.7...HEAD
292+
[0.7]: https://github.com/iamvirul/deepdiff-db/compare/v0.6.1...v0.7
254293
[0.6.1]: https://github.com/iamvirul/deepdiff-db/compare/v0.6...v0.6.1
255294
[0.6]: https://github.com/iamvirul/deepdiff-db/compare/v0.5...v0.6
256295
[0.5]: https://github.com/iamvirul/deepdiff-db/compare/v0.4...v0.5

README.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ DeepDiff DB makes the entire process deterministic, reviewable, and safe by:
3838
- **Progress tracking** - Visual progress bars and spinners for long-running operations
3939
- **Checkpoint/resume** - Resume interrupted operations from saved checkpoints
4040
- **Enhanced error handling** - Rich error messages with actionable suggestions
41+
- **Streaming large datasets** - Keyset-paginated batch hashing keeps memory bounded at any table size
42+
- **Parallel table hashing** - Hash multiple tables concurrently with configurable worker pool
4143

4244
### Safety Features
4345

@@ -247,6 +249,16 @@ Resolution strategies:
247249
- `theirs`: Use development values (accept dev changes)
248250
- `manual`: Require interactive decision for each conflict
249251

252+
**Performance Configuration (v0.7+):**
253+
- `performance.hash_batch_size`: Rows per keyset-paginated query during table hashing. `0` disables batching (loads all rows in one query). Default: `10000`
254+
- `performance.max_parallel_tables`: Maximum number of tables hashed concurrently. Default: `1`
255+
256+
```yaml
257+
performance:
258+
hash_batch_size: 10000 # ~1–2 MB per page; keeps heap bounded on any table size
259+
max_parallel_tables: 2 # hash prod tables in parallel; raises throughput ~2× on dual-core
260+
```
261+
250262
An example configuration file is included at `deepdiffdb.config.yaml.example`.
251263

252264
## Commands
@@ -314,6 +326,18 @@ Performs a full comparison of both schema and data.
314326
deepdiffdb diff --config deepdiffdb.config.yaml
315327
```
316328

329+
**Large Dataset Options (v0.7+):**
330+
```bash
331+
# Keyset-paginated hashing, 2 tables in parallel
332+
deepdiffdb diff --config deepdiffdb.config.yaml --batch-size 10000 --parallel 2
333+
334+
# Disable batching (pre-v0.7 behaviour, loads all rows in memory)
335+
deepdiffdb diff --config deepdiffdb.config.yaml --batch-size 0 --parallel 1
336+
```
337+
338+
- `--batch-size N`: Rows per page when hashing large tables. Overrides `performance.hash_batch_size`. `0` = no pagination.
339+
- `--parallel N`: Max tables hashed concurrently. Overrides `performance.max_parallel_tables`.
340+
317341
**Generate Interactive HTML Report:**
318342
```bash
319343
deepdiffdb diff --config deepdiffdb.config.yaml --html
@@ -345,6 +369,14 @@ Generates a SQL migration pack for data differences.
345369
deepdiffdb gen-pack --config deepdiffdb.config.yaml
346370
```
347371

372+
**Large Dataset Options (v0.7+):**
373+
```bash
374+
deepdiffdb gen-pack --config deepdiffdb.config.yaml --batch-size 5000 --parallel 4
375+
```
376+
377+
- `--batch-size N`: Rows per page when hashing large tables. Overrides `performance.hash_batch_size`.
378+
- `--parallel N`: Max tables hashed concurrently. Overrides `performance.max_parallel_tables`.
379+
348380
**Resume from Checkpoint:**
349381
```bash
350382
deepdiffdb gen-pack --config deepdiffdb.config.yaml --resume
@@ -513,7 +545,7 @@ DeepDiff DB uses a multi-stage approach to ensure safe and accurate database syn
513545
7. **Migration Generation** - Creates SQL migration scripts with proper ordering and batching
514546
8. **Transactional Application** - Applies changes within a single transaction for atomicity
515547
516-
The tool processes data in chunks for large tables and provides progress indicators for operations exceeding 10,000 rows. Progress bars show throughput (rows/second) and estimated time remaining. Checkpoints are automatically saved during long-running operations, allowing you to resume from interruptions.
548+
The tool processes data using **keyset-paginated batching** for large tables — each page fetches a bounded number of rows (`WHERE pk > lastVal ORDER BY pk LIMIT N`), keeping heap usage flat regardless of table size. Multiple tables can be hashed concurrently using a bounded goroutine pool. Progress bars show throughput (rows/second) and estimated time remaining. Checkpoints are automatically saved during long-running operations, allowing you to resume from interruptions.
517549
518550
## Architecture
519551
@@ -584,7 +616,7 @@ Current limitations and known constraints:
584616
- **Database Support** - MSSQL and Oracle are not yet supported (planned for future releases)
585617
- **Schema Auto-merge** - Schema differences must be resolved manually
586618
- **Primary Key Requirement** - All tables must have primary keys (unless explicitly ignored)
587-
- **Large Database Performance** - Very large databases may produce large diff files and require significant processing time
619+
- **Large Database Performance** - Very large tables are handled with keyset-paginated batching (v0.7+); diff output files may still be large for tables with many changed rows
588620
- **Conflict Resolution** - Complex merge strategies (e.g., column-level merging) are not supported
589621
- **SQLite Constraints** - SQLite has limited support for ALTER TABLE operations
590622

ROADMAP.md

Lines changed: 42 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -8,44 +8,29 @@ We release a new version every **Saturday**. Each release includes one or more f
88

99
---
1010

11-
## Current Status: v0.6
11+
## Current Status: v0.7
1212

13-
**Last Release:** 2026-01-06
13+
**Last Release:** 2026-03-14
1414

1515
**Current Features:**
16-
- Schema drift detection
17-
- Row-level data comparison
18-
- Migration pack generation
19-
- Transactional apply mode
20-
- MySQL, PostgreSQL, SQLite support
21-
- Conflict detection
22-
- JSON and text reports
23-
- Standalone schema migration command (`schema-migrate`)
24-
- DROP COLUMN support with safety controls
25-
- MODIFY COLUMN support (type changes, nullable changes)
26-
- CREATE TABLE and DROP TABLE support
27-
- Index support (CREATE INDEX, DROP INDEX)
28-
- Foreign key constraint handling (ADD/DROP FOREIGN KEY)
29-
- Primary key modification support
30-
- Dependency-aware migration ordering
16+
- Schema drift detection and standalone schema migration (`schema-migrate`)
17+
- Row-level data comparison with SHA-256 hashing
18+
- Migration pack generation and transactional apply mode
19+
- MySQL, PostgreSQL, and SQLite support
20+
- Conflict detection with `ours`/`theirs`/`manual` resolution strategies
3121
- Interactive `resolve-conflicts` command with `--auto` and `--resume` flags
32-
- Conflict resolution configuration (`ours`, `theirs`, `manual` strategies)
33-
- Per-table conflict resolution strategies
34-
- Resolution persistence with `resolutions.json`
35-
- Enhanced conflict reports with resolution statistics
36-
- **NEW:** Interactive HTML report generation with `--html` flag
37-
- **NEW:** Visual schema diff viewer with foreign key support
38-
- **NEW:** Data diff visualization with expandable row keys
39-
- **NEW:** Resolution strategy breakdown (auto/pending counts)
40-
- **NEW:** Per-table strategy table with conflict statistics
41-
- **NEW:** Conflict highlighting with strategy badges
42-
- **NEW:** SQL preview with syntax highlighting
43-
- **NEW:** Export to PDF functionality
44-
- **NEW:** Structured logging with JSON/text formats and log levels
45-
- **NEW:** Progress tracking with bars and spinners
46-
- **NEW:** Checkpoint/resume system for long-running operations
47-
- **NEW:** Enhanced error handling with suggestions and stack traces
48-
- **NEW:** Performance metrics collection
22+
- Per-table conflict resolution strategies with `resolutions.json` persistence
23+
- DROP/MODIFY COLUMN, CREATE/DROP TABLE, CREATE/DROP INDEX, ADD/DROP FOREIGN KEY
24+
- Primary key modification and dependency-aware migration ordering
25+
- Interactive HTML report with schema diff viewer, data diff, conflict highlighting, and SQL preview
26+
- Structured JSON/text logging with configurable levels and file output
27+
- Visual progress bars and throughput metrics
28+
- Checkpoint/resume system for long-running operations
29+
- Enhanced error handling with actionable suggestions and retry logic
30+
- **NEW:** Keyset-paginated batch hashing — `--batch-size N` / `performance.hash_batch_size`
31+
- **NEW:** Parallel table hashing — `--parallel N` / `performance.max_parallel_tables`
32+
- **NEW:** Bounded O(batch_size) memory during hashing regardless of table size
33+
- **NEW:** Per-batch memory telemetry at DEBUG log level (`alloc_mb`, `batch`)
4934

5035
---
5136

@@ -113,27 +98,29 @@ We release a new version every **Saturday**. Each release includes one or more f
11398

11499
---
115100

116-
## Upcoming Releases
101+
## Completed Releases (continued)
117102

118-
---
103+
### v0.7: Streaming Support for Large Datasets (Released 2026-03-14)
119104

120-
### Week 1 - v0.7: Streaming Support for Large Datasets
121-
**Target Date:** Next Saturday
105+
**Features Delivered:**
106+
- Keyset-paginated batch hashing (`WHERE pk > lastVal ORDER BY pk LIMIT N`) — O(batch_size) heap at any table size
107+
- `--batch-size N` and `--parallel N` CLI flags for `diff` and `gen-pack`
108+
- `performance.hash_batch_size` and `performance.max_parallel_tables` config keys (defaults: 10000 / 1)
109+
- Bounded goroutine pool via `errgroup` + `semaphore.NewWeighted` for parallel table hashing
110+
- `BuildCursorQuery` shared module (`internal/content/cursor.go`) used by both hash and pack paths
111+
- Per-batch memory telemetry at DEBUG level
112+
- Sample 14: Streaming Large Datasets (SQLite, no Docker, seed script + Makefile)
122113

123-
**Features:**
124-
- Streaming diff for tables > 1M rows
125-
- Memory-efficient hash computation
126-
- Chunked processing with progress tracking
127-
- Configurable batch sizes
128-
- Resume capability for interrupted operations
129-
- Performance optimizations for large databases
114+
**Impact:** Enables comparison of databases with millions of rows while keeping memory usage bounded and wall-clock time short
130115

131-
**Impact:** Enables comparison of very large production databases
116+
---
117+
118+
## Upcoming Releases
132119

133120
---
134121

135-
### Week 2 - v0.8: MSSQL Support
136-
**Target Date:** Week 2 Saturday
122+
### v0.8: MSSQL Support
123+
**Target Date:** Next Saturday
137124

138125
**Features:**
139126
- Microsoft SQL Server driver support
@@ -146,8 +133,8 @@ We release a new version every **Saturday**. Each release includes one or more f
146133

147134
---
148135

149-
### Week 3 - v0.9: Oracle Support
150-
**Target Date:** Week 3 Saturday
136+
### v0.9: Oracle Support
137+
**Target Date:** Week 2 Saturday
151138

152139
**Features:**
153140
- Oracle Database driver support
@@ -160,8 +147,8 @@ We release a new version every **Saturday**. Each release includes one or more f
160147

161148
---
162149

163-
### Week 4 - v1.0: Production Ready Release
164-
**Target Date:** Week 4 Saturday
150+
### v1.0: Production Ready Release
151+
**Target Date:** Week 3 Saturday
165152

166153
**Features:**
167154
- Comprehensive documentation
@@ -225,10 +212,10 @@ We release a new version every **Saturday**. Each release includes one or more f
225212
- ~~Conflict Resolution Strategies~~ (v0.4)
226213
- ~~HTML Report Viewer~~ (v0.5)
227214
- ~~Enhanced Error Handling & Logging~~ (v0.6)
215+
- ~~Streaming Support for Large Datasets~~ (v0.7)
228216
- Documentation & Production Readiness
229217

230218
### Medium Priority (Should Have)
231-
- Streaming Support for Large Datasets
232219
- MSSQL Support
233220

234221
### Low Priority (Nice to Have)
@@ -242,6 +229,7 @@ We release a new version every **Saturday**. Each release includes one or more f
242229
- [x] Conflict Resolution Strategies (v0.4)
243230
- [x] HTML Report Viewer (v0.5)
244231
- [x] Enhanced Error Handling & Logging (v0.6)
232+
- [x] Streaming Support for Large Datasets (v0.7)
245233
- [ ] All high-priority features implemented
246234
- [ ] Test coverage > 80%
247235
- [ ] Comprehensive documentation
@@ -271,5 +259,5 @@ If you'd like to contribute to any of these features, please:
271259

272260
---
273261

274-
**Last Updated:** 2026-01-06
262+
**Last Updated:** 2026-03-14
275263

0 commit comments

Comments
 (0)