gix,gix-odb,gix-pack: stream blob contents from the ODB#2526
gix,gix-odb,gix-pack: stream blob contents from the ODB#2526Malcom Gilbert (mjgil) wants to merge 6 commits into
Conversation
Add streaming blob lookup support for loose and packed objects, avoid eagerly materializing packed results into memory, and expose repository helpers with regression coverage for loose, packed, empty, and wrong-type cases. Co-authored-by: Claude Co-authored-by: Codex
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f918257b18
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".
| zlib::Status::Ok | zlib::Status::BufError if eof => { | ||
| self.done = true; | ||
| return Ok(total_written); |
There was a problem hiding this comment.
Reject truncated loose streams instead of returning EOF
The streaming loose-object reader currently treats eof with Status::Ok|BufError as a normal end-of-stream and returns Ok(...), but it never verifies that the deflate stream actually reached StreamEnd or that the advertised loose-object size was fully produced. For a truncated/corrupt loose object this can silently return fewer bytes than Stream::size() instead of surfacing an error (the non-streaming path reports size/decompression mismatches), which can propagate corrupted blob contents to callers.
Useful? React with 👍 / 👎.
| } | ||
| assert_eq!(i, data.len()); | ||
| assert_eq!(target.len(), 0); | ||
| debug_assert_eq!(i, data.len()); |
There was a problem hiding this comment.
Preserve delta output-length validation in apply()
After generalizing delta::apply() to write into any Write, the function no longer checks that the expected output length was fully produced. In decode paths that pass a fixed-size slice target, malformed delta instructions that produce fewer bytes than declared result_size now return Ok(()) and leave trailing zeroed bytes, silently corrupting decoded objects instead of failing fast.
Useful? React with 👍 / 👎.
|
Thanks for taking a stab at this. If this would work, that would be a huge step forward for our Gitoxide. For the PR to have a chance of being merged, I'd need proof of "it's working". For that, you could create a test with a custom 'measuring' allocator that shows how an optimally compressed and thus huge base object and deltified object are decompressed with less memory usage compared to the current eager decompression that stores the whole object in memory. I may also add that it will take considerable time to get to it, but if you stay on top of it and provide the requested information, it should happen eventually. |
Sebastian Thiel (Byron)
left a comment
There was a problem hiding this comment.
Let's also add benchmark comparisons to see if this affects the performance of packed object decompression in the non-streaming case. The gix CLI has subcommands for that.
Reject truncated loose-object streams, restore delta output-length validation, and add allocator-backed proof that packed delta blob streaming lowers peak memory usage compared to eager blob lookup. Co-authored-by: Codex
|
Addressed the review items in Changes:
Validation:
Memory proof:
Non-streaming benchmark:
So on this fixture the non-streaming packed-object path is slightly slower, about 2%, but not by a large margin. |
Bring the gix_object::Write trait into scope for write_buf() and remove the unused Read import so workspace clippy and test builds pass again. Co-authored-by: Codex
Accept both UnexpectedEof and InvalidData for truncated loose-object stream corruption, as the stream reader can surface either depending on where zlib detects the truncated deflate payload. Co-authored-by: Codex
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4b7bf16549
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".
| pack_id: resolved_pack_id, | ||
| pack_offset, | ||
| entry_size: outcome.compressed_size + header_size, |
There was a problem hiding this comment.
Recompute pack location after delta-base retry
When decode_entry_to_write() falls into the DeltaBaseUnresolved retry path, the object is looked up and decoded again after recursive base resolution, and that recursive lookup can reorder or refresh snapshot.indices. In that case, the successful retry may come from a different pack entry than the initial lookup, but the returned location still uses the originally captured pack_id/header_size, so callers can receive a mismatched Location (pack_id, pack_offset, entry_size) for the bytes they actually streamed.
Useful? React with 👍 / 👎.
|
Thanks! Please note that I am a bit allergic against PRs that appear fully automated. This translates to a lot of time for me and possibly very little for the author. What's the story related to |
|
In the issue here: #1595 Sebastian Thiel (@Byron) you mentioned that you'd prefer streaming to fail if it wasn't possible rather than the use something like I suppose the context is different. If the goal is for gitoxide to incrementally be able to improve it's memory usage by reducing how often entire blobs must be loaded into memory then I prefer this invisible I would be happy to help review and extend this PR, it's still been in the back of my mind since I made that issue. My inclination would be to make this API surface available similar to it's current state, and then smaller PRs/improvements can be made to consume the streaming API surface within gix - for working directory stuff - as well as the underlying streaming mechanics - to stream more efficiently in more scenarios. Does that sound reasonable? Will offer an actual review later. |
Add streaming blob lookup support for loose and packed objects, avoid eagerly materializing packed results into memory, and expose repository helpers with regression coverage for loose, packed, empty, and wrong-type cases.
Addresses #1595.
Summary
This adds blob-only streaming reads from the object database so callers can obtain blob contents without first materializing the full object in memory.
At the repository level, this introduces blob streaming helpers that:
Implementation
To support this, the change adds a streaming object path through the ODB stack.
Loose objects are exposed as a reader over incremental decompression.
Packed objects are decoded into a temporary file and then returned as a readable stream. This avoids holding the fully resolved blob in RAM, including for delta-resolved packed blobs, while still fitting the existing pack decoding model.
The repository-facing API is intentionally blob-only for now, which keeps the surface area aligned with the use case described in this issue.
Tests
Added coverage for:
Also validated with:
cargo test -p gixcargo test -p gix-odbcargo test -p gix-packNotes
Co-authored-by: Claude
Co-authored-by: Codex