Integrate cbq by noamteyssier · Pull Request #79 · ArcInstitute/binseq

noamteyssier · 2026-01-15T22:18:03Z

No description provided.

…ices

…t global index

…r of sequences separately

gemini-code-assist · 2026-01-15T22:18:39Z

Summary of Changes

Hello @noamteyssier, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the binseq library by integrating a new CBQ (Columnar Variable-Length Records) format. This new format is designed to enhance data storage efficiency and optimize for parallel processing by organizing sequence data into columnar blocks, each compressed independently. The changes involve extensive new module implementations for CBQ's core logic, readers, and writers, alongside necessary updates to project dependencies, error handling, and overall library integration to support this new data structure.

Highlights

New CBQ File Format: Introduced a new binary file format variant, CBQ (Columnar Variable-Length Records), designed for high-performance, efficient storage, and parallel processing of DNA sequences. This format stores data in separate, compressed columns within blocks, addressing limitations of the existing VBQ format.
Dependency Updates: Updated several Rust dependencies including bitnuc to 0.4.0, bytemuck to 1.24.0 with new features, and itoa to 1.0.17. A new dependency, sucds (for Elias-Fano encoding), and memchr were also added.
Rust Edition Upgrade: The project's Rust edition has been upgraded from 2021 to 2024 in Cargo.toml.
Error Handling Enhancements: Added a new CbqError enum to src/error.rs to handle various error conditions specific to the new CBQ format, such as block full, configuration mismatches, and invalid magic numbers.
Parallel Processing Support: Extended the BinseqReader enum and its parallel processing methods to support the new CBQ format, allowing for efficient parallel reading of CBQ files.
Documentation Update: The README.md file has been updated to include CBQ as a third variant of the BINSEQ format, providing an overview of its benefits and file structure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is a substantial pull request that introduces the new cbq columnar binary format. The implementation is extensive, covering reading (streaming and memory-mapped), writing, and parallel processing capabilities, and is well-integrated with the existing binseq abstractions. The code is well-structured and makes good use of modern Rust features and external crates for performance.

I've identified one critical issue where an implementation of a core trait is incomplete, which will lead to runtime panics. I've also found a high-severity bug in a utility function that can cause a division by zero. Addressing these will be important for the stability and correctness of the new format's implementation.

gemini-code-assist · 2026-01-15T22:21:06Z

+    fn sbuf(&self) -> &[u64] {
+        unimplemented!("sbuf is not implemented for cbq")
+    }
+
+    fn xbuf(&self) -> &[u64] {
+        unimplemented!("xbuf is not implemented for cbq")
+    }


The BinseqRecord trait is not fully implemented for RefRecord. The methods sbuf() and xbuf() are left as unimplemented!, which will cause a runtime panic if a cbq::RefRecord is used with generic consumers that expect these methods to be available, such as SeqCtx used in the examples.

This breaks the contract of the BinseqRecord trait and prevents cbq records from being used polymorphically with other record types like bq and vbq.

Given that the cbq format is columnar and does not store per-record u64 buffers, you might need to reconsider if RefRecord should implement BinseqRecord, or if the trait itself needs refactoring to better support different storage layouts. At a minimum, these methods should not panic. If they cannot be implemented, perhaps they could return an empty slice &[] and consumers could be adapted, but this would be a significant breaking change for the trait.

gemini-code-assist · 2026-01-15T22:21:06Z

+    pub fn average_block_size(&self) -> f64 {
+        let mut block_iter = self.iter_blocks();
+        let Some(mut last_block) = block_iter.next() else {
+            return 0.0;
+        };
+        let mut total_size = 0.0;
+        let mut count = 0;
+        for block in block_iter {
+            let last_block_size = block.offset - last_block.offset;
+            total_size += last_block_size as f64;
+            count += 1;
+            last_block = block;
+        }
+        total_size / f64::from(count)
+    }


The logic in average_block_size is flawed and can lead to a division-by-zero error.

If there is only one block in the index, block_iter.next() will succeed, but the for loop will not execute. This leaves count at 0, causing a division by zero (total_size / 0.0), which results in NaN.

The calculation does not include the size of the last block, so the average is only over the first N-1 blocks.

I suggest refactoring this function to handle these edge cases correctly and to make the calculation more robust.

Suggested change

pub fn average_block_size(&self) -> f64 {

let mut block_iter = self.iter_blocks();

let Some(mut last_block) = block_iter.next() else {

return 0.0;

};

let mut total_size = 0.0;

let mut count = 0;

for block in block_iter {

let last_block_size = block.offset - last_block.offset;

total_size += last_block_size as f64;

count += 1;

last_block = block;

}

total_size / f64::from(count)

}

#[must_use]

pub fn average_block_size(&self) -> f64 {

if self.ranges.len() < 2 {

return 0.0;

}

let mut total_size = 0.0;

for i in 0..(self.ranges.len() - 1) {

let block_size = self.ranges[i + 1].offset - self.ranges[i].offset;

total_size += block_size as f64;

}

// Note: This calculates the average size of all blocks except the last one.

// To include the last block, the total file size would be needed.

total_size / (self.ranges.len() - 1) as f64

}

noamteyssier added 30 commits December 16, 2025 13:28

initial commit

cb32965

working implementation of writer

7e576e2

working implementation with packed sequences

36ef124

store less redundant data

097f9f8

working implementation with npos tracking

777bcef

wip: reader

e7d7148

refactor: rename block writer

ad03e26

refactor: rework block into separate data structure

03d41b4

feat: working decoder

10bde64

chore: remove ebuf_len since its deterministically sized from the nuclen

ee92403

feat: added final index for cbq

c050874

feat: added a global file header

5f5f39c

fix: ensure ranges account for the size of the block header

d0c7dd2

feat: working implementation of parallel mmap reader

065ef3f

refactor: decompress direct from mmap

453be85

refactor: reuse offset calculations instead of allocating

ee9c9ce

refactor: reuse a dctx for all decoders

9880bba

fix: ensure that the footer is aligned before transmuting

71edfea

refactor: improve throughput and reduce function call overhead

5fc2506

refactor: return a block header for the stream reader

4891e15

refactor: move into separate submodules for cleaner organization

2976440

feat: added a sequencing record builder

f39cc54

feat: working ingest feature

12e65b1

feat: examples of reading and writing cbq

9fecd51

feat: more complete examples

4841590

fix: remove secondary overwrite

fbdb2fb

fix: finish paired impl

33016e7

refactor: simplify impl

ffb5093

chore: print number of records to stderr

c967fd2

feat: improve encoding throughput with shared cctx

0aa51dc

noamteyssier added 25 commits December 18, 2025 12:10

feat: added in cbq to core library

3e66adf

feat: integrated cbq into binseq reader

987efef

feat: working implementation of cbq integration

ea85b6c

feat: reduce record size constraints based to reduced sequenced size

22a42a6

refactor: use zstd compress2 and ensure compression level

9684676

feat: set options for the compressor

c13159f

feat: use EF encoding for n-positions instead of just storing u64 ind…

32a8fef

…ices

dep(bitnuc): update

d71ba7a

chore: remove old cbq impl

0b64b49

refactor: remove all anyhow errors and use crate errors

c65ccf1

style(fmt): run

c73a8fa

style(clippy): fix

72c45f0

refactor: clean up implementation of ref record iter with more correc…

06b93a4

…t global index

refactor: clean up internal api and track number of records and numbe…

ed684db

…r of sequences separately

feat: added convenience functions for accessing index metadata

25b1638

style(clippy): fix

ddc5f00

fix: ensure size is div by 4 not by 32

1655908

feat: added an iterator over block headers

3219c18

docs: added mention of cbq

714b4a0

docs: added documentation about cbq

ed380d0

docs: added documentation about cbq

8aa75be

docs: added documentation about cbq

1147407

style(clippy): fix

b00db99

chore: remove debug code

afa596b

dep: remove redundant dependencies

b3cb076

gemini-code-assist Bot reviewed Jan 15, 2026

View reviewed changes

noamteyssier changed the base branch from main to binseq-0.9.0 January 15, 2026 22:28

noamteyssier merged commit 002d706 into binseq-0.9.0 Jan 15, 2026
20 of 21 checks passed

noamteyssier deleted the integrate-cbq branch January 15, 2026 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate cbq#79

Integrate cbq#79
noamteyssier merged 62 commits into
binseq-0.9.0from
integrate-cbq

noamteyssier commented Jan 15, 2026

Uh oh!

gemini-code-assist Bot commented Jan 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noamteyssier commented Jan 15, 2026

Uh oh!

gemini-code-assist Bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant