Skip to content

Latest commit

 

History

History
103 lines (68 loc) · 5.4 KB

File metadata and controls

103 lines (68 loc) · 5.4 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

bqtools is a Rust CLI for working with BINSEQ files — a binary format family for high-performance DNA sequence processing. It encodes, decodes, greps, concatenates, samples, and pipes BINSEQ files (.bq, .vbq, .cbq). CBQ is the recommended format for most applications.

Build & Test Commands

cargo build                    # Debug build
cargo build --release          # Optimized build (uses LTO, slow)
cargo install --path .         # Install binary locally

cargo test --verbose           # Run all tests
cargo test --verbose -F fuzzy  # Run tests including fuzzy feature
cargo test <test_name>         # Run a single test by name

cargo fmt --check              # Check formatting
cargo clippy --verbose         # Lint (pedantic clippy enabled)

Logging is controlled via BQTOOLS_LOG environment variable (uses env_logger).

Feature Flags

  • htslib (default): SAM/BAM/CRAM support via rust-htslib
  • gcs (default): Google Cloud Storage file reading
  • fuzzy (optional): Fuzzy matching via sassy — requires RUSTFLAGS="-C target-cpu=native"

Build without defaults: cargo build --no-default-features -F fuzzy,gcs

Architecture

Module Layout

  • src/cli/ — Clap derive-based argument definitions. cli.rs has the top-level Commands enum. input.rs and output.rs handle complex input/output argument parsing (file formats, compression, paired-end, spans).
  • src/commands/ — Command implementations, each in its own subdirectory. utils.rs has shared compression helpers.
  • src/types.rs — Type aliases (BoxedReader, BoxedWriter).
  • src/main.rs — CLI dispatch and SIGPIPE handling.

Key Patterns

Parallel processing: Commands use the paraseq crate's ParallelProcessor trait for embarrassingly parallel batch processing. Each command has a processor.rs implementing this trait with thread-local buffers and Arc<Mutex<T>> for shared global state.

Grep backends: The grep command uses a PatternMatcher enum dispatching to three backends — regex, aho-corasick (fixed-string, multi-pattern), and sassy (fuzzy, feature-gated). The same pattern applies to PatternCounter for the -P pattern-count mode. All backends accept PatternCollection which carries optional pattern names (from FASTA headers).

Pattern types: patterns.rs defines Pattern (name + sequence) and PatternCollection (newtype over Vec<Pattern>) with methods .bytes(), .regexes(), .names(). Pattern files (--file, --sfile, --xfile) auto-detect FASTA vs plain text. FASTA headers become pattern names; plain text patterns have no name and fall back to the pattern string in output.

Encode modes: Encoding dispatches across atomic (single/paired files), recursive (directory walk via walkdir), manifest (file list), and batch (multi-file thread distribution) modes.

Writer abstraction: SplitWriter supports interleaved (single file) and split (separate R1/R2) output modes with polymorphic writers (file, stdout, compressed, chunked).

Core Dependencies

Crate Role
binseq BINSEQ format read/write
bitnuc 2-bit/4-bit nucleotide encoding
paraseq Parallel FASTX/BINSEQ processing
clap CLI argument parsing (derive)
anyhow Error handling throughout

Testing

Integration tests live in tests/. tests/common.rs provides a builder (write_fastx()) for generating random FASTQ/FASTA test data with configurable compression (none, gzip, zstd). Tests use cartesian products over format/compression/mode combinations. Dev dependencies: bon (builder macro), nucgen (random sequences), tempfile, itertools.

Generating Test Data

Random FASTQ/FASTA test data can be created on the CLI with nucgen (cargo install nucgen if not already installed).

# generates 10,000 reads of length 150
nucgen -n 10000 -l 150 some.fq
# generates 30,000 paired-reads of length 50 and 200
nucgen -n 30000 -l 50 -L 200 some_R1.fq some_R2.fq

These can then be ingested with bqtools encode:

bqtools encode some.fq -o some.cbq
bqtools encode some_R1.fq some_R2.fq -o some.cbq

Benchmarking Changes

Make use of hyperfine (cargo install hyperfine if not already installed) to measure performance of binaries after changes.

# Measures decoding performance
hyperfine --warmup 3 --runs 10 "bqtools decode some.cbq > /dev/null"

Contribution Guide

When making changes, keep the following documentation in sync:

  1. CLAUDE.md — Update this file when adding new commands, changing architecture, or modifying build/test workflows.
  2. README.md — Update usage examples and feature descriptions when adding or changing user-facing functionality (new commands, flags, behavior changes).
  3. Clap doc comments — All CLI arguments, flags, and subcommands use clap derive macros with /// doc comments and #[clap(long_about)] attributes. When adding or modifying flags, write clear help text directly on the struct fields in src/cli/. These doc comments are the --help output users see.
  4. New feature flags — If adding a Cargo feature flag, document it in both CLAUDE.md (Feature Flags section) and README.md (Feature Flags / Installation section).