Skip to content

Integrate cbq#79

Merged
noamteyssier merged 62 commits into
binseq-0.9.0from
integrate-cbq
Jan 15, 2026
Merged

Integrate cbq#79
noamteyssier merged 62 commits into
binseq-0.9.0from
integrate-cbq

Conversation

@noamteyssier
Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @noamteyssier, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the binseq library by integrating a new CBQ (Columnar Variable-Length Records) format. This new format is designed to enhance data storage efficiency and optimize for parallel processing by organizing sequence data into columnar blocks, each compressed independently. The changes involve extensive new module implementations for CBQ's core logic, readers, and writers, alongside necessary updates to project dependencies, error handling, and overall library integration to support this new data structure.

Highlights

  • New CBQ File Format: Introduced a new binary file format variant, CBQ (Columnar Variable-Length Records), designed for high-performance, efficient storage, and parallel processing of DNA sequences. This format stores data in separate, compressed columns within blocks, addressing limitations of the existing VBQ format.
  • Dependency Updates: Updated several Rust dependencies including bitnuc to 0.4.0, bytemuck to 1.24.0 with new features, and itoa to 1.0.17. A new dependency, sucds (for Elias-Fano encoding), and memchr were also added.
  • Rust Edition Upgrade: The project's Rust edition has been upgraded from 2021 to 2024 in Cargo.toml.
  • Error Handling Enhancements: Added a new CbqError enum to src/error.rs to handle various error conditions specific to the new CBQ format, such as block full, configuration mismatches, and invalid magic numbers.
  • Parallel Processing Support: Extended the BinseqReader enum and its parallel processing methods to support the new CBQ format, allowing for efficient parallel reading of CBQ files.
  • Documentation Update: The README.md file has been updated to include CBQ as a third variant of the BINSEQ format, providing an overview of its benefits and file structure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial pull request that introduces the new cbq columnar binary format. The implementation is extensive, covering reading (streaming and memory-mapped), writing, and parallel processing capabilities, and is well-integrated with the existing binseq abstractions. The code is well-structured and makes good use of modern Rust features and external crates for performance.

I've identified one critical issue where an implementation of a core trait is incomplete, which will lead to runtime panics. I've also found a high-severity bug in a utility function that can cause a division by zero. Addressing these will be important for the stability and correctness of the new format's implementation.

Comment thread src/cbq/core/block.rs
Comment on lines +798 to +804
fn sbuf(&self) -> &[u64] {
unimplemented!("sbuf is not implemented for cbq")
}

fn xbuf(&self) -> &[u64] {
unimplemented!("xbuf is not implemented for cbq")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The BinseqRecord trait is not fully implemented for RefRecord. The methods sbuf() and xbuf() are left as unimplemented!, which will cause a runtime panic if a cbq::RefRecord is used with generic consumers that expect these methods to be available, such as SeqCtx used in the examples.

This breaks the contract of the BinseqRecord trait and prevents cbq records from being used polymorphically with other record types like bq and vbq.

Given that the cbq format is columnar and does not store per-record u64 buffers, you might need to reconsider if RefRecord should implement BinseqRecord, or if the trait itself needs refactoring to better support different storage layouts. At a minimum, these methods should not panic. If they cannot be implemented, perhaps they could return an empty slice &[] and consumers could be adapted, but this would be a significant breaking change for the trait.

Comment thread src/cbq/core/index.rs
Comment on lines +155 to +169
pub fn average_block_size(&self) -> f64 {
let mut block_iter = self.iter_blocks();
let Some(mut last_block) = block_iter.next() else {
return 0.0;
};
let mut total_size = 0.0;
let mut count = 0;
for block in block_iter {
let last_block_size = block.offset - last_block.offset;
total_size += last_block_size as f64;
count += 1;
last_block = block;
}
total_size / f64::from(count)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in average_block_size is flawed and can lead to a division-by-zero error.

  1. If there is only one block in the index, block_iter.next() will succeed, but the for loop will not execute. This leaves count at 0, causing a division by zero (total_size / 0.0), which results in NaN.
  2. The calculation does not include the size of the last block, so the average is only over the first N-1 blocks.

I suggest refactoring this function to handle these edge cases correctly and to make the calculation more robust.

Suggested change
pub fn average_block_size(&self) -> f64 {
let mut block_iter = self.iter_blocks();
let Some(mut last_block) = block_iter.next() else {
return 0.0;
};
let mut total_size = 0.0;
let mut count = 0;
for block in block_iter {
let last_block_size = block.offset - last_block.offset;
total_size += last_block_size as f64;
count += 1;
last_block = block;
}
total_size / f64::from(count)
}
#[must_use]
pub fn average_block_size(&self) -> f64 {
if self.ranges.len() < 2 {
return 0.0;
}
let mut total_size = 0.0;
for i in 0..(self.ranges.len() - 1) {
let block_size = self.ranges[i + 1].offset - self.ranges[i].offset;
total_size += block_size as f64;
}
// Note: This calculates the average size of all blocks except the last one.
// To include the last block, the total file size would be needed.
total_size / (self.ranges.len() - 1) as f64
}

@noamteyssier noamteyssier changed the base branch from main to binseq-0.9.0 January 15, 2026 22:28
@noamteyssier noamteyssier merged commit 002d706 into binseq-0.9.0 Jan 15, 2026
20 of 21 checks passed
@noamteyssier noamteyssier deleted the integrate-cbq branch January 15, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant