Out-of-band iceberg export compaction (icebird 0.8.10) + fix broken export smoke fixtures#96
Open
philcunliffe wants to merge 1 commit into
Open
Out-of-band iceberg export compaction (icebird 0.8.10) + fix broken export smoke fixtures#96philcunliffe wants to merge 1 commit into
philcunliffe wants to merge 1 commit into
Conversation
…smoke fixtures Two non-graph pieces, bundled because the compaction smoke needs the fixture fix to run at all: 1. Export-table compaction (LLP 0022#compaction). icebird 0.8.9 shipped icebergRewrite and 0.8.10 made it safe for formatVersion-3 tables (row lineage preserved across rewrites), so the "compactionSupported: false" stub in format-iceberg maintenance becomes a real rewrite — but strictly out-of-band per LLP 0022: `hyp sink maintain --compact` is the only path that runs it; the default maintain, the daemon loop, and the sink tick never compact. Gated on a `compact_file_count` threshold (default 32), single commit attempt on conflict (a blind retry could drop concurrently appended rows). Pin bumped 0.8.9 -> 0.8.10. LLP 0022 updated in the same change. 2. Fix the iceberg export smoke fixtures. Both iceberg_export_local_fs and iceberg_export_s3_fixture were failing before this change: their inline fixture plugins pointed partition discovery at the legacy `datasets/<ds>/all` path, but the storage spool routes flushed rows to `datasets/<ds>/source=<client>/table/`, so exportBatch exported an empty partition (bytesWritten=0). The fixtures now discover via discoverCachePartitions, and the flows force-settle the spool (flushAll) instead of racing the fire-and-forget size-threshold flush. iceberg_export_partitioned_local_fs is unaffected (it deliberately writes past the spool). New unit coverage in test/plugins/iceberg-maintenance.test.js: config normalization, threshold gating, dry-run, and a real v3 rewrite consolidating two data files into one. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Codex | Error Handling & Resilience: CLI reports any rewrite failure as below-threshold skip (major, maintenance.js:172-176, core_commands.js:2341-2342) | Unobservable failure path; Concurrency surface (conflict path) |
| Claude | Failed rewrite misreported as "compaction_skipped (below compact_file_count)" (major, maintenance.js:172-194, core_commands.js:2342) | Unobservable failure path |
| Claude | Conflict silently leaks orphaned table copy, no reclamation pass (major, maintenance.js:92-94) | Conflict-orphan leak; Lifecycle gap |
| Claude | Rewrite re-introduces unbounded-memory compaction class from #82/#90 (major, maintenance.js:90) | Unbounded rewrite memory |
| Claude | Rewrite data preservation never precisely asserted (major, iceberg-maintenance.test.js:77-83, iceberg_export_local_fs.js:320-326) | Coverage gap |
| Claude | LLP 0022 References section stale pointer (minor, llp/0022:278) | (doc hygiene only — no risk surface) |
Codex review
Fix Validations
Broken export smoke fixtures
- Status: correct
- Evidence:
hypaware-core/smoke/flows/iceberg_export_local_fs.js:269,hypaware-core/smoke/flows/iceberg_export_local_fs.js:464,hypaware-core/smoke/flows/iceberg_export_s3_fixture.js:269,hypaware-core/smoke/flows/iceberg_export_s3_fixture.js:424,src/core/cache/partition.js:161 - Assessment: The fixtures now force-flush the spool before discovery and use the existing partition discovery helper instead of hard-coding
datasets/<ds>/all. Existing storage handling supports this because discoveredp.pathis a logical table path, and storage resolves source-table layouts internally.
Out-of-band compaction fence
- Status: correct
- Evidence:
hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:217,hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:248,src/core/cli/core_commands.js:2273,src/core/cli/core_commands.js:2330 - Assessment:
maintainExportTablesdefaultscompactto false and only callscompactExportTablewhen the CLI passes--compact. The traced production caller ishyp sink maintain; the smoke calls are explicit.
Findings
5) Error Handling & Resilience
- Severity: major
- Confidence: high
- Evidence:
hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:172,hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:176,src/core/cli/core_commands.js:2341,src/core/cli/core_commands.js:2342 - Why it matters: Any rewrite failure, including the intentional concurrent-commit conflict path or an IO/auth error, is reported by the CLI as
compaction_skipped (below compact_file_count), which gives operators the wrong remediation signal. - Suggested fix: Return a skip/failure reason from
compactExportTablesuch asbelow_threshold,commit_conflict, orrewrite_error; only printbelow compact_file_countwhen the threshold check actually failed, and consider a nonzero exit for unexpected rewrite errors.
No Finding
-
- Behavioral Correctness
-
- Contract & Interface Fidelity
-
- Change Impact / Blast Radius
-
- Concurrency, Ordering & State Safety
-
- Security Surface
-
- Resource Lifecycle & Cleanup
-
- Release Safety
-
- Test Evidence Quality
-
- Architectural Consistency
-
- Debuggability & Operability
Evidence Bundle
- Changed hot paths:
compactExportTable,maintainExportTables,hyp sink maintain --compact, iceberg export smoke fixture discovery,icebirdpin to0.8.10. - Impacted callers:
src/core/cli/core_commands.js:2330,hypaware-core/smoke/flows/iceberg_export_local_fs.js:271,hypaware-core/smoke/flows/iceberg_export_local_fs.js:295,test/plugins/iceberg-maintenance.test.js:39. - Impacted tests:
test/plugins/iceberg-maintenance.test.js:22,test/plugins/iceberg-maintenance.test.js:49,hypaware-core/smoke/flows/iceberg_export_local_fs.js:295,hypaware-core/smoke/flows/iceberg_export_s3_fixture.js:269. - Unresolved uncertainty: I did not run the suite; this is a static diff review. No explicit test covers the rewrite failure/conflict reporting path.
Claude review
Claude review
Failed rewrite misreported as "compaction_skipped (below compact_file_count)"
- Severity: major
- Confidence: 85
- Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:172-194 and src/core/cli/core_commands.js:2342
- Why it matters:
compactExportTableswallows every rewrite failure (including the concurrent-commit conflict the PR explicitly designs for, and e.g. S3 permission errors) into the samecompacted: falseshape as the below-threshold skip, sorunSinkMaintainprints the affirmatively false reasoncompaction_skipped (below compact_file_count)with exit code 0 — the one manual compaction tool misdiagnoses its own failures, contrary to CLAUDE.md's log-driven-development requirement that failures identify the broken step. (Flagged independently by 4 of 5 review agents.) - Suggested fix: Add a discriminant to the
compactExportTableresult (e.g.reason: 'below-threshold' | 'conflict' | 'error'pluserror_kind), propagate it throughExportMaintenanceDatasetReport, print the real reason in the per-dataset CLI line, and emit a structured log/span around the rewrite per LLP 0022's Observability section.
Rewrite commit conflict silently leaks a full table copy of orphaned data files, with no reclamation pass
- Severity: major
- Confidence: 85
- Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:92-94 (bare
catch→compacted: false) - Why it matters: icebird's
icebergStageRewritewrites all consolidated data files and the manifest to the blob store before the single-attempt commit, and its own docs say orphan cleanup "is left to a separate maintenance pass" — whichmaintainExportTablesdoes not have (icebergExpireSnapshotsonly deletes files referenced by expired snapshots); since the design point of out-of-band compaction is that the daemon keeps appending concurrently, the deliberately no-retry rewrite is expected to lose races, and each lost race leaves a full rewritten copy of the table orphaned in S3 forever — the same leak class commit 085af71 (fix: make cache compaction memory-safe and reclaim orphaned generations #82) documented (9.5 GB) and fixed for the local cache. - Suggested fix: On commit failure, best-effort delete the staged
writtenFiles, or add an orphan-file sweep to export maintenance mirroring fix: make cache compaction memory-safe and reclaim orphaned generations #82's orphan-generation cleanup; at minimum surface the conflict so the operator knows a leak occurred.
Out-of-band rewrite re-introduces the unbounded-memory compaction class fixed in #82/#90
- Severity: major
- Confidence: 85
- Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:90 (the
icebergRewritecall); node_modules/icebird/src/write/rewrite.js:91-92 - Why it matters: icebird's rewrite materializes the entire table's live rows in memory (
await icebergRead(...)then full-array sort and per-column copies) with notargetFileRowsand no byte gate — the exact failure mode fix: make cache compaction memory-safe and reclaim orphaned generations #82 fixed for local-cache compaction with a 32 MBcompact_batch_bytesbudget and Stop parquet exports bloating & OOMing on wide repeated columns #90 fixed for parquet exports; the 32-file default threshold fires precisely on large long-lived tables (the historically documented ai_gateway export was ~1.3 GB), so the first productionhyp sink maintain --compactplausibly OOM-crashes under a default heap — running out-of-band protects the daemon, not the rewrite. - Suggested fix: Read
total-files-sizefrom the current snapshot summary (already available next tototal-data-files) and skip-with-warning above a configurable byte cap, passtargetFileRows; longer-term request a streamed/per-partition rewrite from icebird mirroringcompact_batch_bytes.
Rewrite data preservation is never precisely asserted
- Severity: major
- Confidence: 85
- Evidence: test/plugins/iceberg-maintenance.test.js:77-83 and hypaware-core/smoke/flows/iceberg_export_local_fs.js:320-326
- Why it matters: The PR's own JSDoc and LLP 0022 name row loss as the rewrite's central risk, yet the unit test asserts only file count and format-version after compaction (never reads rows back), and the smoke's post-compaction check is
rows.length >= ROW_COUNT(>= 5) against a table holding 10 rows — a rewrite that silently dropped half the data would pass every tier. - Suggested fix: In the unit test,
icebergReadafter the successfulcompactExportTableand assert the appended records survive (sorted deep-equal onid/value); tighten the smoke's post-compaction assertion to the exact expected row count.
LLP 0022 References section still points at stale maintenance.js:120
- Severity: minor
- Confidence: 82
- Evidence: llp/0022-iceberg-export-partitioning.spec.md:278
- Why it matters: The PR fixed the identical stale
maintenance.js:120pointer in the Compaction section body but missed the References list, which still cites "compaction framing" at a line that is now the tail ofdiscoverExportDatasets; CLAUDE.md requires LLP docs not to carry stale guidance. - Suggested fix: Update the References entry to drop the line number (matching the body's fix) or point it at
compactExportTable/maintainExportTables.
Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-96/dual-review/pr-96
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two non-graph pieces carved out of the context-graph work, bundled because the compaction smoke can't run without the fixture fix:
1. Export-table compaction, out-of-band per LLP 0022
icebergRewrite; 0.8.10 makes it safe for our formatVersion-3 tables (row lineage preserved across rewrites). Pin moves0.8.9 -> 0.8.10.compactionSupported: falsestub informat-iceberg/src/maintenance.jsbecomes a realcompactExportTable, but it honors LLP 0022 §Compaction's constraint: rewrites are out-of-band only.hyp sink maintain --compactis the single path that runs one — defaultmaintain, the daemon loop, and the sink tick never compact (the in-daemon read-rewrite memory landmine stays fenced off).compact_file_countthreshold (default 32, configurable via the sink'smaintenanceconfig). Single commit attempt on conflict: a rewrite only saw the rows it read, so a blind retry could drop concurrently appended rows; the next manual run starts from fresh metadata.2. Fix the two broken iceberg export smokes
iceberg_export_local_fsandiceberg_export_s3_fixturewere failing on master before this change (export: bytesWritten > 0, value=0): their inline fixture plugins pointed partition discovery at the legacydatasets/<ds>/allpath, but the storage spool routes flushed rows todatasets/<ds>/source=<client>/table/— soexportBatchexported an empty partition and nothing ever landed in the destination. The fixtures now discover viadiscoverCachePartitions, and the flows force-settle the spool (flushAll({ force: true })) instead of racing the fire-and-forget size-threshold flush. (iceberg_export_partitioned_local_fsis unaffected — it deliberately writes past the spool.)Testing
npm test: 834/834 passing, including newtest/plugins/iceberg-maintenance.test.js(config normalization, threshold gating, dry-run, and a real v3 rewrite consolidating two data files into one withformat-versionpreserved)npm run typecheck/npm run lintcleaniceberg_export_local_fs(now also exercises a realicebergRewriteviacompact_file_count: 2and asserts the default run does not compact),iceberg_export_s3_fixture,iceberg_export_partitioned_local_fs— all ok🤖 Generated with Claude Code