Skip to content

gix,gix-odb,gix-pack: stream blob contents from the ODB#2526

Open
Malcom Gilbert (mjgil) wants to merge 6 commits into
GitoxideLabs:mainfrom
mjgil:pr-1595
Open

gix,gix-odb,gix-pack: stream blob contents from the ODB#2526
Malcom Gilbert (mjgil) wants to merge 6 commits into
GitoxideLabs:mainfrom
mjgil:pr-1595

Conversation

@mjgil
Copy link
Copy Markdown

Add streaming blob lookup support for loose and packed objects, avoid eagerly materializing packed results into memory, and expose repository helpers with regression coverage for loose, packed, empty, and wrong-type cases.

Addresses #1595.

Summary

This adds blob-only streaming reads from the object database so callers can obtain blob contents without first materializing the full object in memory.

At the repository level, this introduces blob streaming helpers that:

  • stream loose blobs directly
  • stream packed blobs through a tempfile-backed reader
  • reject non-blob object IDs
  • preserve the existing empty-blob fast path

Implementation

To support this, the change adds a streaming object path through the ODB stack.

Loose objects are exposed as a reader over incremental decompression.

Packed objects are decoded into a temporary file and then returned as a readable stream. This avoids holding the fully resolved blob in RAM, including for delta-resolved packed blobs, while still fitting the existing pack decoding model.

The repository-facing API is intentionally blob-only for now, which keeps the surface area aligned with the use case described in this issue.

Tests

Added coverage for:

  • loose blob streaming
  • packed blob streaming
  • empty blob streaming
  • rejection of non-blob IDs

Also validated with:

  • cargo test -p gix
  • cargo test -p gix-odb
  • cargo test -p gix-pack

Notes

Co-authored-by: Claude
Co-authored-by: Codex

Add streaming blob lookup support for loose and packed objects, avoid eagerly materializing packed results into memory, and expose repository helpers with regression coverage for loose, packed, empty, and wrong-type cases.

Co-authored-by: Claude

Co-authored-by: Codex
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f918257b18

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".

Comment thread gix-odb/src/store_impls/loose/find.rs Outdated
Comment on lines +188 to +190
zlib::Status::Ok | zlib::Status::BufError if eof => {
self.done = true;
return Ok(total_written);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject truncated loose streams instead of returning EOF

The streaming loose-object reader currently treats eof with Status::Ok|BufError as a normal end-of-stream and returns Ok(...), but it never verifies that the deflate stream actually reached StreamEnd or that the advertised loose-object size was fully produced. For a truncated/corrupt loose object this can silently return fewer bytes than Stream::size() instead of surfacing an error (the non-streaming path reports size/decompression mismatches), which can propagate corrupted blob contents to callers.

Useful? React with 👍 / 👎.

}
assert_eq!(i, data.len());
assert_eq!(target.len(), 0);
debug_assert_eq!(i, data.len());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve delta output-length validation in apply()

After generalizing delta::apply() to write into any Write, the function no longer checks that the expected output length was fully produced. In decode paths that pass a fixed-size slice target, malformed delta instructions that produce fewer bytes than declared result_size now return Ok(()) and leave trailing zeroed bytes, silently corrupting decoded objects instead of failing fast.

Useful? React with 👍 / 👎.

@Byron
Copy link
Copy Markdown
Member

Thanks for taking a stab at this. If this would work, that would be a huge step forward for our Gitoxide.

For the PR to have a chance of being merged, I'd need proof of "it's working". For that, you could create a test with a custom 'measuring' allocator that shows how an optimally compressed and thus huge base object and deltified object are decompressed with less memory usage compared to the current eager decompression that stores the whole object in memory.

I may also add that it will take considerable time to get to it, but if you stay on top of it and provide the requested information, it should happen eventually.

@Byron Sebastian Thiel (Byron) marked this pull request as draft April 20, 2026 02:57
Copy link
Copy Markdown
Member

@Byron Sebastian Thiel (Byron) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add benchmark comparisons to see if this affects the performance of packed object decompression in the non-streaming case. The gix CLI has subcommands for that.

Reject truncated loose-object streams, restore delta output-length validation, and add allocator-backed proof that packed delta blob streaming lowers peak memory usage compared to eager blob lookup.

Co-authored-by: Codex
@mjgil
Copy link
Copy Markdown
Author

Addressed the review items in 84931e7c3.

Changes:

  • reject truncated loose-object streams instead of returning short reads
  • restore delta output-size validation after the Write-based delta::apply() refactor
  • add an allocator-backed proof test for packed delta blob streaming memory behavior

Validation:

  • cargo test -p gix-pack
  • cargo test -p gix-odb
  • cargo test -p gix

Memory proof:

  • added cargo test -p gix --test blob_stream_memory
  • this creates a packed delta blob and compares peak allocations between find_blob() and find_blob_stream() on fresh repo instances
  • the streaming path shows a lower peak allocation than the eager path

Non-streaming benchmark:

  • command: gix free pack verify -s <idx>
  • main: 111.2 ms ± 1.0 ms
  • this branch: 113.6 ms ± 2.1 ms

So on this fixture the non-streaming packed-object path is slightly slower, about 2%, but not by a large margin.

Bring the gix_object::Write trait into scope for write_buf() and remove the unused Read import so workspace clippy and test builds pass again.

Co-authored-by: Codex
Accept both UnexpectedEof and InvalidData for truncated loose-object stream corruption, as the stream reader can surface either depending on where zlib detects the truncated deflate payload.

Co-authored-by: Codex
@mjgil Malcom Gilbert (mjgil) marked this pull request as ready for review April 20, 2026 04:55
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b7bf16549

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".

Comment on lines +180 to +182
pack_id: resolved_pack_id,
pack_offset,
entry_size: outcome.compressed_size + header_size,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute pack location after delta-base retry

When decode_entry_to_write() falls into the DeltaBaseUnresolved retry path, the object is looked up and decoded again after recursive base resolution, and that recursive lookup can reorder or refresh snapshot.indices. In that case, the successful retry may come from a different pack entry than the initial lookup, but the returned location still uses the originally captured pack_id/header_size, so callers can receive a mismatched Location (pack_id, pack_offset, entry_size) for the bytes they actually streamed.

Useful? React with 👍 / 👎.

@Byron
Copy link
Copy Markdown
Member

Thanks! Please note that I am a bit allergic against PRs that appear fully automated. This translates to a lot of time for me and possibly very little for the author.

What's the story related to gix-filter? Right now this also isn't streaming, and to checkout or checkin/track a file one will have to run them. That is, any worktree interaction both from or to Git objects may require a filter invocation.

@willstott101
Copy link
Copy Markdown
Contributor

In the issue here: #1595 Sebastian Thiel (@Byron) you mentioned that you'd prefer streaming to fail if it wasn't possible rather than the use something like Cursor<Vec<u8>> - but it seems like that's a code path in this PR.

I suppose the context is different. If the goal is for gitoxide to incrementally be able to improve it's memory usage by reducing how often entire blobs must be loaded into memory then I prefer this invisible Vec approach. But I suppose if the goal is for users of the library to request a file stream for server streaming like in my old issue then knowing if the stream was possible or not would be helpful, and actually just being given the Vec is more useful.

I would be happy to help review and extend this PR, it's still been in the back of my mind since I made that issue.

My inclination would be to make this API surface available similar to it's current state, and then smaller PRs/improvements can be made to consume the streaming API surface within gix - for working directory stuff - as well as the underlying streaming mechanics - to stream more efficiently in more scenarios.

Does that sound reasonable? Will offer an actual review later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants