Integrate cbq#79
Conversation
…r of sequences separately
Summary of ChangesHello @noamteyssier, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a substantial pull request that introduces the new cbq columnar binary format. The implementation is extensive, covering reading (streaming and memory-mapped), writing, and parallel processing capabilities, and is well-integrated with the existing binseq abstractions. The code is well-structured and makes good use of modern Rust features and external crates for performance.
I've identified one critical issue where an implementation of a core trait is incomplete, which will lead to runtime panics. I've also found a high-severity bug in a utility function that can cause a division by zero. Addressing these will be important for the stability and correctness of the new format's implementation.
| fn sbuf(&self) -> &[u64] { | ||
| unimplemented!("sbuf is not implemented for cbq") | ||
| } | ||
|
|
||
| fn xbuf(&self) -> &[u64] { | ||
| unimplemented!("xbuf is not implemented for cbq") | ||
| } |
There was a problem hiding this comment.
The BinseqRecord trait is not fully implemented for RefRecord. The methods sbuf() and xbuf() are left as unimplemented!, which will cause a runtime panic if a cbq::RefRecord is used with generic consumers that expect these methods to be available, such as SeqCtx used in the examples.
This breaks the contract of the BinseqRecord trait and prevents cbq records from being used polymorphically with other record types like bq and vbq.
Given that the cbq format is columnar and does not store per-record u64 buffers, you might need to reconsider if RefRecord should implement BinseqRecord, or if the trait itself needs refactoring to better support different storage layouts. At a minimum, these methods should not panic. If they cannot be implemented, perhaps they could return an empty slice &[] and consumers could be adapted, but this would be a significant breaking change for the trait.
| pub fn average_block_size(&self) -> f64 { | ||
| let mut block_iter = self.iter_blocks(); | ||
| let Some(mut last_block) = block_iter.next() else { | ||
| return 0.0; | ||
| }; | ||
| let mut total_size = 0.0; | ||
| let mut count = 0; | ||
| for block in block_iter { | ||
| let last_block_size = block.offset - last_block.offset; | ||
| total_size += last_block_size as f64; | ||
| count += 1; | ||
| last_block = block; | ||
| } | ||
| total_size / f64::from(count) | ||
| } |
There was a problem hiding this comment.
The logic in average_block_size is flawed and can lead to a division-by-zero error.
- If there is only one block in the index,
block_iter.next()will succeed, but theforloop will not execute. This leavescountat 0, causing a division by zero (total_size / 0.0), which results inNaN. - The calculation does not include the size of the last block, so the average is only over the first
N-1blocks.
I suggest refactoring this function to handle these edge cases correctly and to make the calculation more robust.
| pub fn average_block_size(&self) -> f64 { | |
| let mut block_iter = self.iter_blocks(); | |
| let Some(mut last_block) = block_iter.next() else { | |
| return 0.0; | |
| }; | |
| let mut total_size = 0.0; | |
| let mut count = 0; | |
| for block in block_iter { | |
| let last_block_size = block.offset - last_block.offset; | |
| total_size += last_block_size as f64; | |
| count += 1; | |
| last_block = block; | |
| } | |
| total_size / f64::from(count) | |
| } | |
| #[must_use] | |
| pub fn average_block_size(&self) -> f64 { | |
| if self.ranges.len() < 2 { | |
| return 0.0; | |
| } | |
| let mut total_size = 0.0; | |
| for i in 0..(self.ranges.len() - 1) { | |
| let block_size = self.ranges[i + 1].offset - self.ranges[i].offset; | |
| total_size += block_size as f64; | |
| } | |
| // Note: This calculates the average size of all blocks except the last one. | |
| // To include the last block, the total file size would be needed. | |
| total_size / (self.ranges.len() - 1) as f64 | |
| } |
No description provided.