Skip to content

Out-of-band iceberg export compaction (icebird 0.8.10) + fix broken export smoke fixtures#96

Open
philcunliffe wants to merge 1 commit into
masterfrom
iceberg-export-compaction-and-smoke-fixes
Open

Out-of-band iceberg export compaction (icebird 0.8.10) + fix broken export smoke fixtures#96
philcunliffe wants to merge 1 commit into
masterfrom
iceberg-export-compaction-and-smoke-fixes

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

What

Two non-graph pieces carved out of the context-graph work, bundled because the compaction smoke can't run without the fixture fix:

1. Export-table compaction, out-of-band per LLP 0022

  • icebird 0.8.9 shipped icebergRewrite; 0.8.10 makes it safe for our formatVersion-3 tables (row lineage preserved across rewrites). Pin moves 0.8.9 -> 0.8.10.
  • The compactionSupported: false stub in format-iceberg/src/maintenance.js becomes a real compactExportTable, but it honors LLP 0022 §Compaction's constraint: rewrites are out-of-band only. hyp sink maintain --compact is the single path that runs one — default maintain, the daemon loop, and the sink tick never compact (the in-daemon read-rewrite memory landmine stays fenced off).
  • Gated on a compact_file_count threshold (default 32, configurable via the sink's maintenance config). Single commit attempt on conflict: a rewrite only saw the rows it read, so a blind retry could drop concurrently appended rows; the next manual run starts from fresh metadata.
  • LLP 0022 updated in the same change.

2. Fix the two broken iceberg export smokes

iceberg_export_local_fs and iceberg_export_s3_fixture were failing on master before this change (export: bytesWritten > 0, value=0): their inline fixture plugins pointed partition discovery at the legacy datasets/<ds>/all path, but the storage spool routes flushed rows to datasets/<ds>/source=<client>/table/ — so exportBatch exported an empty partition and nothing ever landed in the destination. The fixtures now discover via discoverCachePartitions, and the flows force-settle the spool (flushAll({ force: true })) instead of racing the fire-and-forget size-threshold flush. (iceberg_export_partitioned_local_fs is unaffected — it deliberately writes past the spool.)

Testing

  • npm test: 834/834 passing, including new test/plugins/iceberg-maintenance.test.js (config normalization, threshold gating, dry-run, and a real v3 rewrite consolidating two data files into one with format-version preserved)
  • npm run typecheck / npm run lint clean
  • Smokes: iceberg_export_local_fs (now also exercises a real icebergRewrite via compact_file_count: 2 and asserts the default run does not compact), iceberg_export_s3_fixture, iceberg_export_partitioned_local_fs — all ok

🤖 Generated with Claude Code

…smoke fixtures

Two non-graph pieces, bundled because the compaction smoke needs the
fixture fix to run at all:

1. Export-table compaction (LLP 0022#compaction). icebird 0.8.9 shipped
   icebergRewrite and 0.8.10 made it safe for formatVersion-3 tables
   (row lineage preserved across rewrites), so the
   "compactionSupported: false" stub in format-iceberg maintenance
   becomes a real rewrite — but strictly out-of-band per LLP 0022:
   `hyp sink maintain --compact` is the only path that runs it; the
   default maintain, the daemon loop, and the sink tick never compact.
   Gated on a `compact_file_count` threshold (default 32), single
   commit attempt on conflict (a blind retry could drop concurrently
   appended rows). Pin bumped 0.8.9 -> 0.8.10. LLP 0022 updated in the
   same change.

2. Fix the iceberg export smoke fixtures. Both iceberg_export_local_fs
   and iceberg_export_s3_fixture were failing before this change: their
   inline fixture plugins pointed partition discovery at the legacy
   `datasets/<ds>/all` path, but the storage spool routes flushed rows
   to `datasets/<ds>/source=<client>/table/`, so exportBatch exported an
   empty partition (bytesWritten=0). The fixtures now discover via
   discoverCachePartitions, and the flows force-settle the spool
   (flushAll) instead of racing the fire-and-forget size-threshold
   flush. iceberg_export_partitioned_local_fs is unaffected (it
   deliberately writes past the spool).

New unit coverage in test/plugins/iceberg-maintenance.test.js: config
normalization, threshold gating, dry-run, and a real v3 rewrite
consolidating two data files into one.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe

philcunliffe commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Codex Error Handling & Resilience: CLI reports any rewrite failure as below-threshold skip (major, maintenance.js:172-176, core_commands.js:2341-2342) Unobservable failure path; Concurrency surface (conflict path)
Claude Failed rewrite misreported as "compaction_skipped (below compact_file_count)" (major, maintenance.js:172-194, core_commands.js:2342) Unobservable failure path
Claude Conflict silently leaks orphaned table copy, no reclamation pass (major, maintenance.js:92-94) Conflict-orphan leak; Lifecycle gap
Claude Rewrite re-introduces unbounded-memory compaction class from #82/#90 (major, maintenance.js:90) Unbounded rewrite memory
Claude Rewrite data preservation never precisely asserted (major, iceberg-maintenance.test.js:77-83, iceberg_export_local_fs.js:320-326) Coverage gap
Claude LLP 0022 References section stale pointer (minor, llp/0022:278) (doc hygiene only — no risk surface)
Codex review

Fix Validations

Broken export smoke fixtures

  • Status: correct
  • Evidence: hypaware-core/smoke/flows/iceberg_export_local_fs.js:269, hypaware-core/smoke/flows/iceberg_export_local_fs.js:464, hypaware-core/smoke/flows/iceberg_export_s3_fixture.js:269, hypaware-core/smoke/flows/iceberg_export_s3_fixture.js:424, src/core/cache/partition.js:161
  • Assessment: The fixtures now force-flush the spool before discovery and use the existing partition discovery helper instead of hard-coding datasets/<ds>/all. Existing storage handling supports this because discovered p.path is a logical table path, and storage resolves source-table layouts internally.

Out-of-band compaction fence

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:217, hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:248, src/core/cli/core_commands.js:2273, src/core/cli/core_commands.js:2330
  • Assessment: maintainExportTables defaults compact to false and only calls compactExportTable when the CLI passes --compact. The traced production caller is hyp sink maintain; the smoke calls are explicit.

Findings

5) Error Handling & Resilience

  • Severity: major
  • Confidence: high
  • Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:172, hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:176, src/core/cli/core_commands.js:2341, src/core/cli/core_commands.js:2342
  • Why it matters: Any rewrite failure, including the intentional concurrent-commit conflict path or an IO/auth error, is reported by the CLI as compaction_skipped (below compact_file_count), which gives operators the wrong remediation signal.
  • Suggested fix: Return a skip/failure reason from compactExportTable such as below_threshold, commit_conflict, or rewrite_error; only print below compact_file_count when the threshold check actually failed, and consider a nonzero exit for unexpected rewrite errors.

No Finding

    1. Behavioral Correctness
    1. Contract & Interface Fidelity
    1. Change Impact / Blast Radius
    1. Concurrency, Ordering & State Safety
    1. Security Surface
    1. Resource Lifecycle & Cleanup
    1. Release Safety
    1. Test Evidence Quality
    1. Architectural Consistency
    1. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: compactExportTable, maintainExportTables, hyp sink maintain --compact, iceberg export smoke fixture discovery, icebird pin to 0.8.10.
  • Impacted callers: src/core/cli/core_commands.js:2330, hypaware-core/smoke/flows/iceberg_export_local_fs.js:271, hypaware-core/smoke/flows/iceberg_export_local_fs.js:295, test/plugins/iceberg-maintenance.test.js:39.
  • Impacted tests: test/plugins/iceberg-maintenance.test.js:22, test/plugins/iceberg-maintenance.test.js:49, hypaware-core/smoke/flows/iceberg_export_local_fs.js:295, hypaware-core/smoke/flows/iceberg_export_s3_fixture.js:269.
  • Unresolved uncertainty: I did not run the suite; this is a static diff review. No explicit test covers the rewrite failure/conflict reporting path.
Claude review

Claude review

Failed rewrite misreported as "compaction_skipped (below compact_file_count)"

  • Severity: major
  • Confidence: 85
  • Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:172-194 and src/core/cli/core_commands.js:2342
  • Why it matters: compactExportTable swallows every rewrite failure (including the concurrent-commit conflict the PR explicitly designs for, and e.g. S3 permission errors) into the same compacted: false shape as the below-threshold skip, so runSinkMaintain prints the affirmatively false reason compaction_skipped (below compact_file_count) with exit code 0 — the one manual compaction tool misdiagnoses its own failures, contrary to CLAUDE.md's log-driven-development requirement that failures identify the broken step. (Flagged independently by 4 of 5 review agents.)
  • Suggested fix: Add a discriminant to the compactExportTable result (e.g. reason: 'below-threshold' | 'conflict' | 'error' plus error_kind), propagate it through ExportMaintenanceDatasetReport, print the real reason in the per-dataset CLI line, and emit a structured log/span around the rewrite per LLP 0022's Observability section.

Rewrite commit conflict silently leaks a full table copy of orphaned data files, with no reclamation pass

  • Severity: major
  • Confidence: 85
  • Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:92-94 (bare catchcompacted: false)
  • Why it matters: icebird's icebergStageRewrite writes all consolidated data files and the manifest to the blob store before the single-attempt commit, and its own docs say orphan cleanup "is left to a separate maintenance pass" — which maintainExportTables does not have (icebergExpireSnapshots only deletes files referenced by expired snapshots); since the design point of out-of-band compaction is that the daemon keeps appending concurrently, the deliberately no-retry rewrite is expected to lose races, and each lost race leaves a full rewritten copy of the table orphaned in S3 forever — the same leak class commit 085af71 (fix: make cache compaction memory-safe and reclaim orphaned generations #82) documented (9.5 GB) and fixed for the local cache.
  • Suggested fix: On commit failure, best-effort delete the staged writtenFiles, or add an orphan-file sweep to export maintenance mirroring fix: make cache compaction memory-safe and reclaim orphaned generations #82's orphan-generation cleanup; at minimum surface the conflict so the operator knows a leak occurred.

Out-of-band rewrite re-introduces the unbounded-memory compaction class fixed in #82/#90

  • Severity: major
  • Confidence: 85
  • Evidence: hypaware-core/plugins-workspace/format-iceberg/src/maintenance.js:90 (the icebergRewrite call); node_modules/icebird/src/write/rewrite.js:91-92
  • Why it matters: icebird's rewrite materializes the entire table's live rows in memory (await icebergRead(...) then full-array sort and per-column copies) with no targetFileRows and no byte gate — the exact failure mode fix: make cache compaction memory-safe and reclaim orphaned generations #82 fixed for local-cache compaction with a 32 MB compact_batch_bytes budget and Stop parquet exports bloating & OOMing on wide repeated columns #90 fixed for parquet exports; the 32-file default threshold fires precisely on large long-lived tables (the historically documented ai_gateway export was ~1.3 GB), so the first production hyp sink maintain --compact plausibly OOM-crashes under a default heap — running out-of-band protects the daemon, not the rewrite.
  • Suggested fix: Read total-files-size from the current snapshot summary (already available next to total-data-files) and skip-with-warning above a configurable byte cap, pass targetFileRows; longer-term request a streamed/per-partition rewrite from icebird mirroring compact_batch_bytes.

Rewrite data preservation is never precisely asserted

  • Severity: major
  • Confidence: 85
  • Evidence: test/plugins/iceberg-maintenance.test.js:77-83 and hypaware-core/smoke/flows/iceberg_export_local_fs.js:320-326
  • Why it matters: The PR's own JSDoc and LLP 0022 name row loss as the rewrite's central risk, yet the unit test asserts only file count and format-version after compaction (never reads rows back), and the smoke's post-compaction check is rows.length >= ROW_COUNT (>= 5) against a table holding 10 rows — a rewrite that silently dropped half the data would pass every tier.
  • Suggested fix: In the unit test, icebergRead after the successful compactExportTable and assert the appended records survive (sorted deep-equal on id/value); tighten the smoke's post-compaction assertion to the exact expected row count.

LLP 0022 References section still points at stale maintenance.js:120

  • Severity: minor
  • Confidence: 82
  • Evidence: llp/0022-iceberg-export-partitioning.spec.md:278
  • Why it matters: The PR fixed the identical stale maintenance.js:120 pointer in the Compaction section body but missed the References list, which still cites "compaction framing" at a line that is now the tail of discoverExportDatasets; CLAUDE.md requires LLP docs not to carry stale guidance.
  • Suggested fix: Update the References entry to drop the line number (matching the body's fix) or point it at compactExportTable / maintainExportTables.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-96/dual-review/pr-96

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant