From cc8afa5f516fc09b2b7afb25ad7750fa3c50ccdd Mon Sep 17 00:00:00 2001 From: Sam Park Date: Mon, 13 Apr 2026 19:04:50 -0400 Subject: [PATCH 1/2] Sync gtars docs: genomicdist, lola, core updates, bindings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brings docs in line with the current gtars workspace, focused on the genomicdist + lola crates and the RegionSet/RegionSetList surface that the feat/genomicdist-spatial-stats branch introduced across all bindings. ## New pages Rust module references (2): - gtars/genomicdist.md — port of R GenomicDistributions plus new spatial-arrangement summaries (inter-peak spacing, peak clusters, density vector, density homogeneity). Two-bucket framing separating per-region quantities from whole-set scalar summaries. - gtars/lola.md — LOLA enrichment built on IGD + genomicdist. End-to-end workflow, universe diagnostics, Fisher's exact test semantics, CMLE odds ratio. Python bindings (3): - gtars/python/models.md — full RegionSet / RegionSetList method surface (~50 methods), plus statistics result types, reference genomes, gene models, partition lists, signal matrices. - gtars/python/genomic_distributions.md — free functions that need a reference genome, partition list, or signal matrix. - gtars/python/lola.md — RegionDB, run_lola with pandas-friendly return schema, universe helpers. Flags the tuple-based API quirk. Wasm bindings (4): - gtars/wasm/regionset.md — consolidated page for RegionSet, RegionSetList, ConsensusBuilder, and result types. - gtars/wasm/partitions.md — GeneModel, PartitionList, calcPartitions, calcExpectedPartitions. Notes no filesystem loader in Wasm. - gtars/wasm/signal.md — SignalMatrix fromBin / fromJSArrays, calcSummarySignal. - gtars/wasm/lola.md — LolaRegionDB, runLOLA, checkUniverseAppropriateness. R bindings (4): - gtars/r/regionset.md — RegionSet and RegionSetList S4 classes, polymorphic inputs, overridden Bioconductor generics. - gtars/r/genomicdist.md — drop-in replacements for R GenomicDistributions with a porting table. - gtars/r/lola.md — drop-in replacement for R LOLA. Returns data.table. - gtars/r/igd.md — low-level igd_create / igd_search. ## Updated pages - gtars/core.md — full rewrite. Documents RegionSetList, CoordinateMode, Fragment, Interval, RegionSetError, to_polars, bigbed/http/dataframe feature flags. Fixes pre-existing Region::new() example (method does not exist — uses struct-literal syntax). - gtars/regionSet.md — appended RegionSetList section with Python/Rust side-by-side. Cross-links to core.md, genomicdist.md, lola.md. - gtars/modules.md — added "Distribution Analysis & Enrichment" section. - gtars/python-overview.md — rewrote. Old version had incorrect examples (rs.coverage() no args, from_bed classmethod that doesn't exist). - gtars/wasm.md — rewrote. Corrects npm package name to @databio/gtars-js and links out to the new submodule pages. - gtars/r.md — rewrote. Added submodule table, polymorphic-inputs explanation, quick starts for core/LOLA/refget/IGD. - mkdocs.yml — added nav entries under Modules and Bindings → {Python, Wasm, R}. Renamed "gtars-models" label to "Core models". All new nav labels sized to match the widest pre-existing entry in each section (no sidebar wraparound). ## Coverage of feat/genomicdist-spatial-stats All four new spatial-arrangement summary methods (calc_inter_peak_spacing, calc_peak_clusters with min_cluster_size, calc_density_vector, calc_density_homogeneity), the reworked gaps(chrom_sizes) signature, and the "n_bins is a target for the longest chromosome, not a total" semantic are documented across all four layers (Rust, Python, Wasm, R). ## Verification - mkdocs build --strict passes (one pre-existing WARNING unrelated to this work: refget-store.md → ../refget-store-format.md, from upstream commit b85d462 which removed the target file without updating the link). - mkdocs serve smoketest: all 13 new pages return 200 OK. - All cross-page anchor links verified against generated HTML slugs. - Every example snippet's API signatures were audited against the actual Rust/Python/R/Wasm source under repos/gtars rather than guessed. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/gtars/core.md | 261 +++++++-- docs/gtars/genomicdist.md | 609 +++++++++++++++++++++ docs/gtars/lola.md | 358 ++++++++++++ docs/gtars/modules.md | 7 +- docs/gtars/python-overview.md | 105 ++-- docs/gtars/python/genomic_distributions.md | 225 ++++++++ docs/gtars/python/lola.md | 269 +++++++++ docs/gtars/python/models.md | 404 ++++++++++++++ docs/gtars/r.md | 131 ++++- docs/gtars/r/genomicdist.md | 224 ++++++++ docs/gtars/r/igd.md | 101 ++++ docs/gtars/r/lola.md | 228 ++++++++ docs/gtars/r/regionset.md | 208 +++++++ docs/gtars/regionSet.md | 59 ++ docs/gtars/wasm.md | 90 +-- docs/gtars/wasm/lola.md | 190 +++++++ docs/gtars/wasm/partitions.md | 152 +++++ docs/gtars/wasm/regionset.md | 280 ++++++++++ docs/gtars/wasm/signal.md | 140 +++++ mkdocs.yml | 15 +- 20 files changed, 3939 insertions(+), 117 deletions(-) create mode 100644 docs/gtars/genomicdist.md create mode 100644 docs/gtars/lola.md create mode 100644 docs/gtars/python/genomic_distributions.md create mode 100644 docs/gtars/python/lola.md create mode 100644 docs/gtars/python/models.md create mode 100644 docs/gtars/r/genomicdist.md create mode 100644 docs/gtars/r/igd.md create mode 100644 docs/gtars/r/lola.md create mode 100644 docs/gtars/r/regionset.md create mode 100644 docs/gtars/wasm/lola.md create mode 100644 docs/gtars/wasm/partitions.md create mode 100644 docs/gtars/wasm/regionset.md create mode 100644 docs/gtars/wasm/signal.md diff --git a/docs/gtars/core.md b/docs/gtars/core.md index 5d2a2612..71aef461 100644 --- a/docs/gtars/core.md +++ b/docs/gtars/core.md @@ -1,60 +1,257 @@ # gtars-core -Core library providing fundamental data structures and utilities for genomic interval operations. This is the foundation that all other gtars modules build upon. +Core library providing the fundamental data structures and utilities that every other gtars crate builds on. If you're working directly with genomic regions in Rust, everything starts here. -## Features +!!! info "What's new" + Recent additions to `gtars-core` that pre-existing doc snippets don't cover: -- Common genomic data structures (Region, RegionSet) -- BED file parsing utilities -- Shared constants and helper functions -- Foundation for all gtars modules + - **`RegionSetList`** — a `GRangesList`-style collection of named `RegionSet`s with `concat`, `iter`, and set-level identifier computation. Consumed by `gtars-genomicdist` and `gtars-lola`. + - **`CoordinateMode`** enum — selects BED (0-based half-open) vs. GRanges (1-based closed) conventions for midpoint calculations. + - **`Fragment`** — fragment-file record type for single-cell ATAC-seq workflows. + - **`Interval`** — generic interval with payload, used internally by overlap indexes. + - **Typed error enum** `RegionSetError` — replaces the previous panicking parse paths. + - **`to_polars`** + `dataframe` feature flag — zero-copy conversion to a Polars DataFrame. + - **`bigbed`** and **`http`** feature flags — optional bigBed writing and URL-backed `RegionSet::try_from`. -## Core Data Types +## Core data types + +The `gtars_core::models` module re-exports all six core types at the top level, so you typically import them directly: + +```rust +use gtars_core::models::{ + Region, RegionSet, RegionSetList, Interval, Fragment, CoordinateMode, +}; +``` + +### `Region` + +A single genomic interval. `start` and `end` are 0-based half-open by BED convention; `rest` carries any trailing tab-separated fields from the original line verbatim. -### Region -Represents a genomic interval with chromosome, start, and end coordinates: ```rust use gtars_core::models::Region; -// Create a region -let region = Region::new("chr1", 1000, 2000); +let region = Region { + chr: "chr1".to_string(), + start: 1000, + end: 2000, + rest: None, +}; -// Access properties -println!("Chr: {}", region.chr); -println!("Start: {}", region.start); -println!("End: {}", region.end); +assert_eq!(region.width(), 1000); +println!("{}", region.as_string()); // chr1\t1000\t2000 ``` -### RegionSet -Collection of genomic regions: +Methods: + +| method | returns | notes | +|---|---|---| +| `width()` | `u32` | `end - start` | +| `as_string()` | `String` | tab-separated BED line | +| `digest()` | `String` | MD5 digest of `"chr,start,end"` | +| `mid_point()` | `u32` | `start + width() / 2` (BED/floor) | +| `mid_point_with_mode(mode)` | `u32` | BED or GRanges convention — see `CoordinateMode` below | + +`Region` implements `Display` (as tab-separated text), `Clone`, `Debug`, `Eq`, `Hash`, and — under the `serde` feature — `Serialize`/`Deserialize`. + +### `RegionSet` + +An ordered collection of `Region`s. `RegionSet::try_from` accepts `&Path`, `&str`, `String`, `PathBuf`, or `Vec`, auto-detects gzip by extension, and with the `http` feature will fetch from URLs. Construction always sorts by `(chr, start)`. + ```rust use gtars_core::models::RegionSet; use std::path::Path; -// Load from BED file +// From a local BED (or BED.gz) file let rs = RegionSet::try_from(Path::new("peaks.bed"))?; -// Access regions -println!("Number of regions: {}", rs.regions.len()); +assert!(!rs.is_empty()); +println!("{} regions, {} bp total", rs.len(), rs.nucleotides_length()); -// Iterate over regions -for region in &rs.regions { +for region in &rs { println!("{}: {}-{}", region.chr, region.start, region.end); } +# Ok::<(), gtars_core::errors::RegionSetError>(()) +``` + +Key methods: + +**Construction** + +- `RegionSet::try_from(path)` — accepts `&Path`, `&str`, `String`, `PathBuf`. Sorts on load. +- `RegionSet::from(regions: Vec)` — in-memory construction. +- `RegionSet::from(bytes: &[u8])` — parse from an in-memory byte slice (no gzip handling). + +**Iteration** + +- `for region in &rs { ... }` — `IntoIterator` is implemented for `&RegionSet`. +- `iter_chroms()` → unique chromosomes in insertion order (post-sort). +- `iter_chr_regions(chr)` → all regions on a specific chromosome. + +**Summaries** + +- `len()`, `is_empty()`, `nucleotides_length()` — count and total bp. +- `region_widths()` → `Vec`. +- `mean_region_width()` → `f64`, rounded to 2 decimals. +- `get_max_end_per_chr()` → `HashMap`. +- `calc_mid_points()` → `HashMap>` (BED convention). +- `calc_mid_points_with_mode(CoordinateMode)` → same, with mode control. + +**Identifiers** + +- `identifier()` — MD5 digest over the first-three-column layout; the canonical BEDbase identifier. +- `file_digest()` — MD5 digest over the full serialized file content. + +**I/O** + +- `to_bed(path)` / `to_bed_gz(path)` — write plain or gzipped BED. +- `to_bigbed(path, chrom_sizes)` — under the `bigbed` feature, write a bigBed file. +- `to_polars()` — under the `dataframe` feature, return a `PolarsResult`. + +**Mutation** + +- `sort()` — in-place sort by `(chr, start)`. + +### `RegionSetList` + +A collection of `RegionSet`s — the gtars equivalent of Bioconductor's `GRangesList`. This is the type that downstream crates (genomicdist, lola) use to pass multiple region sets across FFI boundaries without paying N × clone costs. + +```rust +use gtars_core::models::{RegionSet, RegionSetList}; + +let peaks1 = RegionSet::try_from("peaks_rep1.bed")?; +let peaks2 = RegionSet::try_from("peaks_rep2.bed")?; +let peaks3 = RegionSet::try_from("peaks_rep3.bed")?; + +// Unnamed — names are optional +let rsl = RegionSetList::new(vec![peaks1, peaks2, peaks3]); + +// Or with explicit names +let rsl = RegionSetList::with_names( + rsl.region_sets, + vec!["rep1".into(), "rep2".into(), "rep3".into()], +); + +// Iterate +for rs in &rsl { + println!("{} regions", rs.len()); +} + +// Flatten all regions into a single RegionSet (no merge/dedup) +let combined: RegionSet = rsl.concat(); + +// Stable identifier over the full set (order-independent) +let id = rsl.identifier(); +# Ok::<(), gtars_core::errors::RegionSetError>(()) +``` + +`RegionSetList::try_from` also accepts: + +- A path to a **bedset manifest file** — a text file listing one BED path per line (`read_bedset_file` under the hood). +- A `Vec<&Path>`, `Vec<&str>`, `Vec`, or `Vec` — each is loaded as its own `RegionSet`. + +`concat()` flattens without merging; if you need a reduced union, call `.reduce()` on the result (the `reduce` method lives in `gtars-genomicdist` via the `IntervalRanges` trait). + +Key methods: `new`, `with_names`, `add`, `get(i)`, `iter`, `len`, `is_empty`, `concat`, `identifier`. + +### `CoordinateMode` + +Switches between BED (0-based half-open, floor division) and GRanges (1-based closed, banker's rounding) conventions for midpoint calculation. BED is the default and matches Python/numpy conventions; GRanges exists for exact bit-compatibility with R GenomicDistributions output. + +```rust +use gtars_core::models::{Region, CoordinateMode}; + +let r = Region { chr: "chr1".into(), start: 100, end: 206, rest: None }; + +assert_eq!(r.mid_point_with_mode(CoordinateMode::Bed), 153); // 100 + 53 +assert_eq!(r.mid_point_with_mode(CoordinateMode::GRanges), 152); // banker's rounding +``` + +For widths `w` where `w % 4 == 2`, BED and GRanges midpoints differ by 1 bp — this affects approximately 2.6% of typical peak distance calculations. Use `GRanges` mode only when you need byte-for-byte parity with R output. + +### `Fragment` + +A fragment-file record (chromatin accessibility / scATAC-seq). Implements `FromStr` for parsing lines from 10x-style `fragments.tsv` files and `From for Region` for dropping the barcode/support metadata when you only care about coordinates. + +```rust +use gtars_core::models::{Fragment, Region}; +use std::str::FromStr; + +let f = Fragment::from_str("chr1\t1000\t1200\tAAACCTGAGAAACCAT-1\t3")?; +assert_eq!(f.barcode, "AAACCTGAGAAACCAT-1"); +assert_eq!(f.read_support, 3); + +let as_region: Region = f.into(); +# Ok::<(), anyhow::Error>(()) +``` + +### `Interval` + +A generic `[start, end)` range with a payload `T`, parameterized over any unsigned integer type. This is primarily a building block for overlap indexes (consumed by `gtars-overlaprs` and `gtars-igd`); most user code should prefer `Region`/`RegionSet`. + +```rust +use gtars_core::models::Interval; + +let a: Interval = Interval { start: 10, end: 50, val: 0 }; +let b: Interval = Interval { start: 40, end: 80, val: 1 }; + +assert!(a.overlap(b.start, b.end)); +assert_eq!(a.intersect(&b), 10); // overlap width in bp +``` + +## Error handling + +Parse and I/O errors from `RegionSet::try_from` come back as the typed `RegionSetError` enum (no panics on malformed input): + +```rust +use gtars_core::errors::RegionSetError; +use gtars_core::models::RegionSet; + +match RegionSet::try_from("not_a_real_file.bed") { + Ok(rs) => println!("loaded {} regions", rs.len()), + Err(RegionSetError::FileReadError(msg)) => eprintln!("read failed: {msg}"), + Err(RegionSetError::RegionParseError(msg)) => eprintln!("parse failed: {msg}"), + Err(RegionSetError::EmptyRegionSet(path)) => eprintln!("no regions in {path}"), + Err(e) => eprintln!("other: {e}"), +} ``` -## Available Modules +Variants: `FileReadError`, `InvalidPathOrUrl`, `InvalidBedbaseIdentifier`, `BedbaseFetchError`, `RegionParseError`, `EmptyRegionSet`, `HttpFeatureDisabled`, `BigBedError`, and a transparent `Io` wrapper around `std::io::Error`. -- `models` - Core data structures (Region, RegionSet) -- `utils` - Utility functions for file handling and parsing -- `consts` - Shared constants +## Feature flags + +| flag | effect | +|---|---| +| *(default)* | Pure in-memory BED reading/writing, no optional dependencies. | +| `serde` | Derives `Serialize`/`Deserialize` on `Region`, `RegionSet`. | +| `http` | Enables `RegionSet::try_from(&Path)` to fetch from HTTP(S) URLs via `ureq`. Without this feature, non-file paths return `HttpFeatureDisabled`. | +| `dataframe` | Enables `RegionSet::to_polars()` (pulls in `polars`). Required transitively by `gtars-genomicdist`'s `bedclassifier` feature. | +| `bigbed` | Enables `RegionSet::to_bigbed()` via `bigtools`. | + +Enable them in `Cargo.toml` like so: + +```toml +[dependencies] +gtars-core = { version = "0.5", features = ["serde", "dataframe"] } +``` -## Dependencies +## Available modules -Minimal external dependencies: +- **`models`** — all core data types (`Region`, `RegionSet`, `RegionSetList`, `Interval`, `Fragment`, `CoordinateMode`). Re-exported at the crate root. +- **`errors`** — `RegionSetError` enum. +- **`utils`** — readers, file-type detection, chromosome-sizes parsing, and `Region` ↔ id hash-map helpers: + - `get_dynamic_reader(&Path)` / `get_dynamic_reader_w_stdin(&str)` — transparent gzip/stdin handling. + - `get_dynamic_reader_from_url(&Path)` — under the `http` feature. + - `get_file_info(&Path) -> FileInfo` — detect type (BED, BAM, NARROWPEAK, UNKNOWN) and gzip. + - `parse_bedlike_file(line)` → `(chr, start, end)` tuple from a single line. + - `get_chrom_sizes(path)` → `HashMap`. + - `read_bedset_file(path)` → `Vec` of BED paths from a bedset manifest. + - `generate_region_to_id_map` / `generate_id_to_region_map` and string variants — stable id assignment for tokenizer vocabularies. + - `remove_all_extensions(&Path)` → stem with *all* extensions stripped (handles `.bed.gz`). +- **`consts`** — column-name constants (`CHR_COL_NAME`, `START_COL_NAME`, `END_COL_NAME`, `DELIMITER`) and file-extension constants (`BED_FILE_EXTENSION`, `BAM_FILE_EXTENSION`, `GZ_FILE_EXTENSION`, `IGD_FILE_EXTENSION`, `GTOK_EXT`). -- `anyhow` - Error handling -- `flate2` - Gzip compression support -- Other standard bioinformatics libraries +## Where to go next -This module serves as the foundation for all other gtars modules and maintains backward compatibility within major versions. \ No newline at end of file +- **[Core models tour](regionSet.md)** — a cross-language (Python + Rust) walkthrough of `Region`, `RegionSet`, and friends. +- **[gtars-overlaprs](overlaprs.md)** — high-performance overlap queries that operate on `RegionSet`. +- **[gtars-genomicdist](genomicdist.md)** — the `IntervalRanges` and `GenomicIntervalSetStatistics` traits extend `RegionSet` with R GenomicDistributions–style set algebra and summary stats. +- **[gtars-lola](lola.md)** — LOLA enrichment built on top of the IGD index and `RegionSetList`. diff --git a/docs/gtars/genomicdist.md b/docs/gtars/genomicdist.md new file mode 100644 index 00000000..73aee0f4 --- /dev/null +++ b/docs/gtars/genomicdist.md @@ -0,0 +1,609 @@ +# gtars-genomicdist + +Rust port of the R [GenomicDistributions](https://code.databio.org/GenomicDistributions/) package — plus a handful of extra calculations and binary serialization formats used by BEDbase. The crate provides: + +- **Summary statistics** and per-chromosome distributions (`GenomicIntervalSetStatistics` trait). +- **Interval set algebra** in the GRanges/IRanges idiom (`IntervalRanges` trait, 26+ methods). +- **Genomic partitioning** using a gene model (promoter / UTR / exon / intron / intergenic). +- **Signal matrix overlap and summary** (`calc_summary_signal`, TSV + packed-binary I/O). +- **Consensus region calling** across multiple input sets. +- **Nearest-TSS / nearest-feature distance** calculations (`TssIndex`). +- **GC content** and **dinucleotide frequency** against a reference genome. +- **BED file classification** (optional `bedclassifier` feature, polars-backed). +- **Density and clustering** statistics used for ML feature vectors and peak-clustering summaries. +- The **GDA** binary gene-model format and the **.fab** binary FASTA format for fast mmap-backed sequence lookup. + +All operations take a `&RegionSet` (from `gtars-core`) as input and either extend it via trait implementations or run as free functions. If you're new to the crate, start with the `IntervalRanges` and `GenomicIntervalSetStatistics` traits — those cover the GRanges/GenomicDistributions surface area most users need. + +!!! tip "Coming from R GenomicDistributions?" + Most function names follow the pattern `calc_*` where R used `calc*`, and behavior is matched where practical. A few deliberate divergences are documented inline — the most notable is that midpoint calculations default to BED (floor) conventions rather than GRanges banker's rounding, and spacing/nearest-neighbor calculations *exclude* single-region chromosomes rather than emitting sentinel values. + +## Installation + +```toml +[dependencies] +gtars-genomicdist = "0.7" +``` + +### Feature flags + +| flag | effect | +|---|---| +| *(default)* | All traits, statistics, partitions, signal, consensus, and TSS calculations. | +| `bedclassifier` | Enables `classify_bed()` and pulls in `polars`. Transitively enables `gtars-core/dataframe`. | + +Enable the classifier feature: + +```toml +[dependencies] +gtars-genomicdist = { version = "0.7", features = ["bedclassifier"] } +``` + +## Core types + +All of these are re-exported at the crate root, so you typically import them directly: + +```rust +use gtars_genomicdist::{ + // Statistics + GenomicIntervalSetStatistics, + ClusterStats, DensityHomogeneity, DensityVector, SpacingStats, + // Interval algebra + IntervalRanges, RegionSetListOps, pairwise_jaccard, + // Strand-aware wrappers + SortedRegionSet, Strand, StrandedRegionSet, + // Partitions + GeneModel, PartitionList, PartitionResult, + ExpectedPartitionResult, ExpectedPartitionRow, + calc_expected_partitions, calc_partitions, genome_partition_list, + // Signal + SignalMatrix, SignalSummaryResult, ConditionStats, + calc_summary_signal, + // Consensus + ConsensusRegion, consensus, + // Sequence stats + calc_gc_content, calc_dinucl_freq, DINUCL_ORDER, + // Utilities + chrom_karyotype_key, median_abs_distance, + // Gene model serialization + GenomicDistAnnotation, + // BED classifier (feature-gated) + #[cfg(feature = "bedclassifier")] classify_bed, + // Re-export from gtars-core + CoordinateMode, +}; +``` + +The submodules (`statistics`, `interval_ranges`, `partitions`, `signal`, `consensus`, `models`, `asset`, `bed_classifier`, `utils`) are all `pub` as well, so both `gtars_genomicdist::calc_partitions` and `gtars_genomicdist::partitions::calc_partitions` resolve to the same symbol. + +## Statistics + +The `GenomicIntervalSetStatistics` trait extends `RegionSet` with two kinds of summaries: + +1. **Per-region and per-pair quantities** — direct ports of the R GenomicDistributions functions (`calcWidth`, `calcNeighborDist`, `calcNearestNeighbors`, `calcChromBins`). These return vectors or per-chromosome tables, one number per peak or per gap. Use them when you want to *see* each value — plot a histogram, feed them to a downstream test, render a per-chromosome heatmap. + +2. **Whole-set scalar summaries** — new in gtars. These collapse an entire peak set into a handful of numbers that answer questions like "how regularly are my peaks spaced?", "how clumpy is this peak set?", or "how evenly is it spread across the genome?". Each one wraps a per-region calculation and reduces it to a scalar / small struct. Use them when you want to *compare* peak sets against each other or a baseline, build ML feature vectors, or run peak-set QC. + +Bring the trait into scope and both kinds of methods appear on any `RegionSet`. + +### Per-region and per-pair quantities + +Direct ports of R GenomicDistributions. Each returns a vector or per-chromosome table. + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::GenomicIntervalSetStatistics; + +let peaks = RegionSet::try_from("peaks.bed")?; + +// Per-chromosome summary: count, min/max/mean/median width, chr bounds +for (chr, stats) in peaks.chromosome_statistics() { + println!( + "{chr}: {} regions, median width {}", + stats.number_of_regions, + stats.median_region_length + ); +} + +// Region-width distribution (one value per region) +let widths: Vec = peaks.calc_widths(); + +// Signed inter-region gaps on each chromosome (only positive gaps are returned) +let gaps: Vec = peaks.calc_neighbor_distances()?; + +// Nearest-neighbor distance per region (multi-region chromosomes only) +let nn: Vec = peaks.calc_nearest_neighbors()?; +# Ok::<(), Box>(()) +``` + +| method | question it answers | returns | +|---|---|---| +| `chromosome_statistics()` | "What do the per-chromosome counts and width distributions look like?" | `HashMap` | +| `region_distribution_with_bins(n_bins)` | "How many peaks fall in each of `n_bins` bins sized by the longest chromosome?" | `HashMap` | +| `region_distribution_with_chrom_sizes(n_bins, chrom_sizes)` | Same, but bins are sized per-chromosome by actual chromosome length — matches R `getGenomeBins` | `HashMap` | +| `calc_widths()` | "How wide is each peak?" — port of R `calcWidth()` | `Vec` | +| `calc_neighbor_distances()` | "What's the bp gap between every pair of consecutive peaks on each chromosome?" | `Result>` | +| `calc_nearest_neighbors()` | "How far is each peak from its closest neighbor?" — port of R `calcNearestNeighbors()` | `Result>` | + +### Whole-set scalar summaries + +These collapse the peak set to a handful of numbers, answering questions about **spatial arrangement** across the genome. They're complementary to the per-region quantities above — each one internally calls one of the per-region methods and then summarizes the result. + +```rust +use std::collections::HashMap; +use gtars_core::models::RegionSet; +use gtars_genomicdist::GenomicIntervalSetStatistics; + +let peaks = RegionSet::try_from("peaks.bed")?; +let chrom_sizes: HashMap = + [("chr1".into(), 248_956_422), ("chr2".into(), 242_193_529)].into(); + +// 1. How regularly are peaks spaced? +// Returns SpacingStats: mean/median/std/IQR plus log-space stats that +// handle heavy-tailed gap distributions. Small log_std => regular arrays +// (CTCF-style); large log_std => bursty pileups. +let spacing = peaks.calc_inter_peak_spacing(); +println!("n_gaps={}, median={:.0} bp, log_std={:.2}", + spacing.n_gaps, spacing.median, spacing.log_std); + +// 2. How much does the peak set cluster at a 5 kb stitching radius? +// Single-linkage clustering — two peaks link if the gap between them is +// ≤ radius_bp. The default `min_cluster_size = 2` answers "fraction of +// peaks that have at least one neighbor within 5 kb", matching typical +// super-enhancer-stitching / enhancer-clustering analyses. +let clusters = peaks.calc_peak_clusters(5_000, 2); +println!( + "{} clusters, {:.1}% of peaks clustered, biggest={}", + clusters.n_clusters, + clusters.fraction_clustered * 100.0, + clusters.max_cluster_size, +); + +// 3. Dense per-window count vector across the whole genome — an ML-ready +// feature vector. Stable karyotypic ordering means vectors from different +// BED files are aligned column-for-column, so you can stack them into a +// matrix and feed to downstream models. +let density = peaks.calc_density_vector(&chrom_sizes, 250); +println!("density vector: {} windows, bin_width={} bp", + density.counts.len(), density.bin_width); + +// 4. How evenly is the peak set spread across the genome, as a scalar? +// Mean count per window + variance + coefficient of variation (Poisson +// ≈ 1, clustered >> 1, even << 1) + Gini + nonzero-window count. The +// canonical "how uniform is this peak set" measure. +let homog = peaks.calc_density_homogeneity(&chrom_sizes, 250); +println!( + "mean={:.2}, CV={:.2}, gini={:.3}, {}/{} windows nonzero", + homog.mean_count, homog.cv, homog.gini, + homog.n_nonzero_windows, homog.n_windows, +); +# Ok::<(), Box>(()) +``` + +| method | question it answers | wraps | returns | +|---|---|---|---| +| `calc_inter_peak_spacing()` | "How regularly are my peaks spaced?" — mean/median/std/IQR and log-space stats of inter-peak gaps | `calc_neighbor_distances` | `SpacingStats` | +| `calc_peak_clusters(radius_bp, min_cluster_size)` | "How clumpy is my peak set at a given stitching radius?" — cluster counts, sizes, and the fraction of peaks in multi-peak clusters | `cluster(radius_bp)` from `IntervalRanges` | `ClusterStats` | +| `calc_density_vector(chrom_sizes, n_bins)` | "Give me a stable, ML-ready per-window count vector across the whole genome" — dense, zero-padded, karyotypically ordered | per-region midpoint binning | `DensityVector` | +| `calc_density_homogeneity(chrom_sizes, n_bins)` | "How evenly are my peaks spread across the genome, as a scalar?" — mean/variance/CV/Gini of the density vector | `calc_density_vector` | `DensityHomogeneity` | + +### Output structs + +```rust +pub struct ChromosomeStatistics { + pub chromosome: String, + pub number_of_regions: u32, + pub start_nucleotide_position: u32, // leftmost start + pub end_nucleotide_position: u32, // rightmost end + pub minimum_region_length: u32, + pub maximum_region_length: u32, + pub mean_region_length: f64, + pub median_region_length: f64, +} + +pub struct RegionBin { pub chr: String, pub start: u32, pub end: u32, pub n: u32, pub rid: u32 } + +pub struct SpacingStats { + pub n_gaps: usize, + pub mean: f64, pub median: f64, pub std: f64, pub iqr: f64, + pub log_mean: f64, pub log_std: f64, +} + +pub struct ClusterStats { + pub radius_bp: u32, + pub n_clusters: usize, + pub n_clustered_peaks: usize, + pub mean_cluster_size: f64, + pub max_cluster_size: usize, + pub fraction_clustered: f64, +} + +pub struct DensityVector { + pub n_bins: u32, // target bins for the longest chromosome + pub bin_width: u32, // derived as max_chrom_len / n_bins, floored, min 1 + pub counts: Vec, // dense, row-major across chromosomes in karyotype order + pub chrom_offsets: Vec<(String, usize)>, +} + +pub struct DensityHomogeneity { + pub bin_width: u32, + pub n_windows: usize, + pub n_nonzero_windows: usize, + pub mean_count: f64, + pub variance: f64, + pub cv: f64, + pub gini: f64, +} +``` + +!!! warning "Output length for spacing/nearest calculations" + `calc_neighbor_distances` and `calc_nearest_neighbors` **skip single-region chromosomes** (matching R GenomicDistributions behavior). The output length is therefore **not 1:1 with the input region count** — it's the number of multi-region chromosomes' regions. No sentinel values are emitted. If you need 1:1 alignment, filter your input to multi-region chromosomes first. + +!!! warning "`n_bins` is a target, not a total" + In `calc_density_vector`, `calc_density_homogeneity`, and `region_distribution_with_chrom_sizes`, `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so shorter chromosomes get fewer bins and the total bin count is `sum(ceil(chrom_size / bin_width))`, which can exceed `n_bins` substantially. To target a specific bin width in bp, set `n_bins = max_chrom_len / desired_bp`. + +## Interval set algebra + +`IntervalRanges` is a second trait on `RegionSet` that provides GRanges/IRanges-style set algebra. All operations return new `RegionSet`s (immutable pattern) and are strand-unaware by default — use `StrandedRegionSet` if you need strand-aware promoters, reduce, or setdiff. + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::IntervalRanges; + +let a = RegionSet::try_from("peaks_a.bed")?; +let b = RegionSet::try_from("peaks_b.bed")?; + +let merged = a.reduce(); // merge overlapping intervals +let common = a.intersect(&b); // range-level intersection +let a_only = a.setdiff(&b); // remove b from a +let combined = a.union(&b); // union + reduce +let jac = a.jaccard(&b); // bp-Jaccard similarity +let closest = a.closest(&b); // Vec<(a_idx, b_idx, dist)> +let clustered = a.cluster(1000); // per-region cluster id +# Ok::<(), Box>(()) +``` + +### Method reference + +| method | description | +|---|---| +| `trim(chrom_sizes)` | clamp regions to `[0, chrom_size)`; drop regions on unknown chromosomes | +| `promoters(upstream, downstream)` | `[start - upstream, start + downstream)` per region | +| `reduce()` | merge overlapping/adjacent intervals per chromosome | +| `setdiff(other)` / `subtract(other)` | remove `other` from self | +| `pintersect(other)` | *pairwise* (by index) intersection | +| `intersect(other)` | range-level intersection | +| `intersect_all(other)` | all-vs-all pairwise intersection fragments (AIList-backed) | +| `concat(other)` | concatenate without merging | +| `union(other)` | `concat(other).reduce()` | +| `jaccard(other)` | bp-level Jaccard `|A ∩ B| / |A ∪ B|` | +| `coverage(other)` | fraction of `self` bp covered by `other` | +| `overlap_coefficient(other)` | `|A ∩ B| / min(|A|, |B|)` | +| `shift(offset)` | translate by signed bp offset (saturating at 0) | +| `flank(width, use_start, both)` | upstream/downstream/both-side flanks | +| `resize(width, fix)` | fixed width anchored at `"start"`, `"end"`, or `"center"` | +| `narrow(start, end, width)` | GRanges-style narrow with 1-based relative coords | +| `disjoin()` | tile into non-overlapping pieces at every boundary | +| `gaps(chrom_sizes)` | peak-free regions, including leading/trailing/whole-chrom gaps | +| `closest(other)` | `Vec<(self_idx, other_idx, signed_dist)>` | +| `cluster(max_gap)` | `Vec` cluster ids in original order | + +!!! note "`rest` fields are dropped" + Operations that merge or synthesize new intervals (reduce, setdiff, promoters, etc.) produce regions with `rest: None`. There is no unambiguous way to carry the original metadata through a merge, so the contract is: use `IntervalRanges` methods for coordinate-only work. + +### `pairwise_jaccard` + +A standalone helper for computing the full N×N Jaccard matrix over a slice of region sets, optimized to pre-reduce each set once and walk pairs linearly: + +```rust +use gtars_genomicdist::pairwise_jaccard; + +let sets = vec![rs1, rs2, rs3]; +let jac = pairwise_jaccard(&sets); // Vec of length 9 (row-major) + +for i in 0..3 { + for j in 0..3 { + print!("{:.3} ", jac[i * 3 + j]); + } + println!(); +} +``` + +### RegionSetListOps + +`RegionSetListOps` implements the same set-algebra operations on a `RegionSetList` by index — useful for bindings (wasm/Python/R) that want to operate on pairs without copying whole `RegionSet`s across an FFI boundary. + +```rust +use gtars_core::models::RegionSetList; +use gtars_genomicdist::RegionSetListOps; + +let rsl = RegionSetList::try_from(vec!["a.bed", "b.bed", "c.bed"])?; + +let jac_01 = rsl.jaccard_at(0, 1); // Option +let union_02 = rsl.union_at(0, 2); // Option +let drop_one = rsl.union_except(1); // union of all but index 1 +let (full_union, loo) = rsl.bulk_union_except() // all leave-one-out unions in O(n) + .ok_or("empty list")?; +# Ok::<(), Box>(()) +``` + +Methods: `pintersect_at`, `pintersect_count`, `jaccard_at`, `union_at`, `setdiff_at`, `region_count`, `union_except`, `bulk_union_except`, `union_all`, `intersect_all`. + +## Partitions + +Ports of `genomePartitionList()`, `calcPartitions()`, and `calcExpectedPartitions()` from R GenomicDistributions. You load a gene model from BED files or GTF, build an ordered `PartitionList` of promoter / UTR / exon / intron categories, and classify your query regions into mutually exclusive buckets (`intergenic` is added as the implicit remainder). + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{ + GeneModel, calc_partitions, calc_expected_partitions, genome_partition_list, +}; +use std::collections::HashMap; + +// 1. Load the gene model (choose one) +let model = GeneModel::from_bed_files( + "genes.bed", + "exons.bed", + Some("three_utr.bed"), // optional + Some("five_utr.bed"), // optional +)?; + +// Or, from a GTF: +// let model = GeneModel::from_gtf("gencode.v47.gtf.gz", true, true)?; + +// 2. Build partition list (core/prox promoter sizes in bp; chrom_sizes optional) +let partitions = genome_partition_list(&model, 100, 2000, None); + +// 3. Classify query regions +let query = RegionSet::try_from("peaks.bed")?; + +let result = calc_partitions(&query, &partitions, /* bp_proportion = */ false); +for (name, count) in &result.counts { + let pct = *count as f64 / result.total as f64 * 100.0; + println!("{:<15} {:>6} ({:.1}%)", name, count, pct); +} + +// 4. Observed vs expected, with chi-square p-value +let chrom_sizes: HashMap = [("chr1".into(), 248_956_422)].into_iter().collect(); +let expected = calc_expected_partitions(&query, &partitions, &chrom_sizes, false); +for row in &expected.rows { + println!( + "{:<15} obs={:>6} exp={:>10.1} log10(O/E)={:+.2} p={:.2e}", + row.partition, row.observed, row.expected, row.log10_oe, row.chi_sq_pval + ); +} +# Ok::<(), Box>(()) +``` + +### Priority order + +`genome_partition_list` produces partitions in this order, which is also the priority for classification: + +1. **`promoterCore`** — `core_prom_size` bp upstream of each gene start (strand-aware). +2. **`promoterProx`** — `prox_prom_size` bp upstream, minus core. +3. **`threeUTR`** — if UTR files are provided. +4. **`fiveUTR`** — minus 3'UTR (R gives 3'UTR priority). +5. **`exon`** — minus UTRs. +6. **`intron`** — gene bodies minus UTRs and exons. +7. **`intergenic`** — implicit remainder, emitted by `calc_partitions`. + +`calc_partitions` has two modes: + +- **Priority mode (`bp_proportion = false`)** — each query region is assigned to the first partition it overlaps; remainders are counted as `intergenic`. Mutually exclusive. +- **BP proportion mode (`bp_proportion = true`)** — for each partition, computes total overlapping base pairs of query. Not mutually exclusive: a query region overlapping multiple partitions contributes bp to each (matching R behavior). + +!!! note "Chi-square p-values differ from R" + `calc_expected_partitions` uses a goodness-of-fit `(O-E)²/E` formula. R's `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R GenomicDistributions output byte-for-byte. The `log10_oe` column is directly comparable. + +### `GeneModel::from_gtf` + +GTF loading handles GENCODE-style files that use an undifferentiated `UTR` feature type by parsing `CDS` records to infer which UTRs are 5' vs 3'. Two flags: + +- `filter_protein_coding` — keep only features with `gene_biotype "protein_coding"` or `gene_type "protein_coding"` in the attributes column. +- `convert_ensembl_ucsc` — prepend `chr` to chromosome names that don't already have it (so Ensembl `1` becomes UCSC `chr1`). + +## Signal matrix overlap + +`SignalMatrix` holds a matrix of signal values across genomic regions × conditions (e.g. a peak × cell-type matrix of ChIP-seq intensities). `calc_summary_signal` overlaps a query region set against the matrix, aggregates by MAX per query region, and returns Tukey boxplot statistics per condition. + +```rust +use gtars_core::models::{RegionSet, CoordinateMode}; +use gtars_genomicdist::{SignalMatrix, calc_summary_signal}; + +// Load from a TSV where col 0 is "chr_start_end", remaining cols are condition values +let sm = SignalMatrix::from_tsv("signal_matrix.tsv")?; + +// Or load from the packed binary format (produced by sm.save_bin()) +let sm = SignalMatrix::load_bin("signal_matrix.sigm")?; +// ...or from an in-memory byte slice (for wasm): +// let sm = SignalMatrix::load_bin_from_bytes(&bytes)?; + +let query = RegionSet::try_from("peaks.bed")?; +let summary = calc_summary_signal(&query, &sm, CoordinateMode::Bed)?; + +for stats in &summary.matrix_stats { + println!( + "{}: median={:.2} [{:.2}–{:.2}]", + stats.condition, stats.median, stats.lower_hinge, stats.upper_hinge + ); +} +# Ok::<(), Box>(()) +``` + +`SignalSummaryResult` contains per-query-region max signal vectors (`signal_matrix`), per-condition Tukey stats (`matrix_stats`), and `condition_names`. The packed binary format (`.sigm`, magic `SIGM`) is produced by `sm.save_bin()` — substantially faster to load than TSV for large matrices. + +## Consensus regions + +Given N region sets, `consensus` produces the reduced union annotated with how many input sets overlap each union region. Useful for filtering by replicate support threshold: + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{consensus, ConsensusRegion}; + +let reps = [ + RegionSet::try_from("rep1.bed")?, + RegionSet::try_from("rep2.bed")?, + RegionSet::try_from("rep3.bed")?, +]; + +let cons: Vec = consensus(&reps); + +// Keep regions present in ≥ 2/3 replicates +let robust: Vec<_> = cons.into_iter().filter(|c| c.count >= 2).collect(); +println!("{} robust regions", robust.len()); +# Ok::<(), Box>(()) +``` + +`ConsensusRegion` has `chr`, `start`, `end`, and `count` — the number of input sets overlapping that union region. + +## TSS / nearest-feature distances + +`TssIndex` builds a per-chromosome sorted vector of TSS midpoints from a `RegionSet`, enabling O(R · log M) distance queries instead of the naive O(R · M): + +```rust +use gtars_core::models::{RegionSet, CoordinateMode}; +use gtars_genomicdist::models::TssIndex; + +let tss = TssIndex::try_from("gencode_tss.bed")?; + +let peaks = RegionSet::try_from("peaks.bed")?; + +// Unsigned nearest-feature distance per query region +let dists: Vec = tss.calc_tss_distances(&peaks, CoordinateMode::Bed)?; + +// Signed distances (positive = feature downstream, negative = upstream) +let signed: Vec = tss.calc_feature_distances(&peaks, CoordinateMode::Bed)?; +# Ok::<(), Box>(()) +``` + +Regions on chromosomes missing from the TSS index are padded with `u32::MAX` / `i64::MAX` sentinels (one per query region). Use `median_abs_distance(&signed)` from the `utils` module to summarize while filtering those sentinels. + +## GC content and dinucleotide frequencies + +Both functions take anything implementing `SequenceAccess` — currently `GenomeAssembly` (in-memory HashMap, built from a FASTA) or `BinaryGenomeAssembly` (mmap-backed `.fab` binary). + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{calc_gc_content, calc_dinucl_freq, DINUCL_ORDER}; +use gtars_genomicdist::models::{GenomeAssembly, BinaryGenomeAssembly}; + +// In-memory loader — ~2s to build for hg38 but then gives zero-copy slices +let genome = GenomeAssembly::try_from("hg38.fa")?; + +// Or the mmap .fab loader — instant construction, zero-copy per region +let genome = BinaryGenomeAssembly::from_file("hg38.fab".as_ref())?; + +let peaks = RegionSet::try_from("peaks.bed")?; + +// GC content per region, 0.0–1.0 +let gc: Vec = calc_gc_content(&peaks, &genome, /* ignore_unk_chroms = */ true)?; + +// Dinucleotide frequencies: 16 columns per region in DINUCL_ORDER +// raw_counts=false → percentages (matches R default) +let (labels, matrix) = calc_dinucl_freq(&peaks, &genome, false, true)?; +# Ok::<(), Box>(()) +``` + +Create a `.fab` file from a regular FASTA once up-front: + +```rust +use gtars_genomicdist::models::BinaryGenomeAssembly; + +BinaryGenomeAssembly::write_from_fasta( + "hg38.fa".as_ref(), + "hg38.fab".as_ref(), +)?; +# Ok::<(), Box>(()) +``` + +Or via the CLI: `gtars prep --fasta hg38.fa` (see the [gtars-cli](cli.md) page). + +## BED classification + +Under the `bedclassifier` feature, `classify_bed` inspects the column layout of a BED file and assigns one of several UCSC / ENCODE format classifications (BED3, BED6+, narrowPeak, broadPeak, gappedPeak, RNA elements, etc.). + +```rust +#[cfg(feature = "bedclassifier")] +{ + use gtars_core::models::RegionSet; + use gtars_genomicdist::bed_classifier::classify_bed; + + let rs = RegionSet::try_from("unknown_format.bed").unwrap(); + let c = classify_bed(&rs).unwrap(); + println!( + "format={} bed_compliance={} compliant_cols={}", + c.data_format, c.bed_compliance, c.compliant_columns + ); +} +``` + +## GDA binary format + +`GenomicDistAnnotation` is a serializable wrapper around `GeneModel` in the GDA (Genomic Dist Annotation) binary format. Chrom sizes are **not** stored in the GDA file — they come from a separate source (refgenie, `.chrom.sizes` file, or an API). + +```rust +use gtars_genomicdist::GenomicDistAnnotation; + +// Build from GTF once, save as GDA +let gda = GenomicDistAnnotation::from_gtf("gencode.v47.gtf.gz")?; +gda.save_bin("gencode.v47.gda")?; + +// Load from disk +let gda = GenomicDistAnnotation::load_bin("gencode.v47.gda")?; + +// Or from memory (wasm / API) +// let gda = GenomicDistAnnotation::load_bin_from_bytes(&bytes)?; + +let partitions = gtars_genomicdist::genome_partition_list(&gda.gene_model, 100, 2000, None); +# Ok::<(), Box>(()) +``` + +## Utilities + +From `gtars_genomicdist::utils`: + +- **`partition_genome_into_bins(chrom_sizes, n_bins)`** — tile all chromosomes with fixed-width bins (bin width = longest chrom / n_bins, floored, min 1 bp). Returns a `RegionSet`. +- **`median_abs_distance(&[i64])`** — median absolute value, filtering `i64::MAX` sentinels produced by `TssIndex::calc_feature_distances`. +- **`chrom_karyotype_key(chr)`** — sort key producing the standard karyotype order: `chr1, chr2, …, chr22, chrX, chrY, chrM, chrUn_*`. Works with or without the `chr` prefix. + +## Strand-aware wrappers + +Two wrappers extend `RegionSet` with stronger invariants: + +- **`SortedRegionSet`** — a newtype guaranteeing `(chr, start)` sort order. Constructed via `SortedRegionSet::new(rs)`, which sorts in place (move, no clone). Downstream code that requires sorted input can accept `&SortedRegionSet` to avoid re-sorting on every call. +- **`StrandedRegionSet`** — pairs a `RegionSet` with a parallel `Vec`. Strand-aware `promoters_stranded`, `reduce`, `setdiff`, and `trim` are methods on this type and are used internally by `genome_partition_list` to produce correct partitions for minus-strand genes. + +```rust +use gtars_core::models::RegionSet; +use gtars_genomicdist::{SortedRegionSet, StrandedRegionSet, Strand}; + +let rs = RegionSet::try_from("peaks.bed")?; +let sorted = SortedRegionSet::new(rs); // in-place sort + +let stranded = StrandedRegionSet::new( + sorted.0.clone(), + vec![Strand::Plus; sorted.0.regions.len()], +); +# Ok::<(), Box>(()) +``` + +`Strand::from_char('+' | '-' | _)` converts a single character to `Plus` / `Minus` / `Unstranded`. + +## Errors + +All operations that can fail return `Result`. Variants: + +- `CustomError(String)` — general I/O and parsing wrapper. +- `GCContentError(chr, start, end, msg)` — sequence lookup failed during GC calculation. +- `TSSContentError(msg)` — no TSS features found for the region set. +- `SignalMatrixError(msg)` — signal-matrix parse failure. +- `Io(std::io::Error)` — transparent I/O error. + +The `bedclassifier` feature has its own `BedClassifierError` enum for format-classification failures. + +## Where to go next + +- **[gtars-core](core.md)** — `RegionSet`, `RegionSetList`, and `CoordinateMode`, which this crate consumes. +- **[gtars-lola](lola.md)** — LOLA enrichment is built on `IntervalRanges` (for universe construction) and the IGD index. +- **[gtars-overlaprs](overlaprs.md)** — the overlap-detection engine used internally by `calc_partitions` and `consensus`. +- **[gtars CLI](cli.md)** — `gtars genomicdist` subcommands for running these analyses from the command line. diff --git a/docs/gtars/lola.md b/docs/gtars/lola.md new file mode 100644 index 00000000..88e7809c --- /dev/null +++ b/docs/gtars/lola.md @@ -0,0 +1,358 @@ +# gtars-lola + +Rust port of the R [LOLA](http://lola.databio.org/) (Locus Overlap Analysis) package: region-set enrichment testing against a database of reference region sets, using Fisher's exact test on 2×2 contingency tables. + +Given a **user set** of regions (your peaks), a **universe** (the background you consider sampleable — typically a union of publicly available peak sets), and a **region database** (LOLA core or custom), `gtars-lola` computes for every database set the probability that your user set's overlap with that database set is larger (or smaller) than expected under the null. + +The crate is built on top of: + +- **[gtars-igd](igd.md)** — the interval overlap index. A LOLA database is exposed as a single `Igd` index over all its files, so one call can score a user set against thousands of reference sets simultaneously. +- **[gtars-genomicdist](genomicdist.md)** — the `IntervalRanges` trait (for `concat`, `disjoin`, `union`, etc.) is used to build restricted universes and redefine user sets in terms of universe regions. +- **[gtars-core](core.md)** — all region types come from here; `RegionSetList` is used as the FFI-friendly return type for extracting region sets from a database. + +## Installation + +```toml +[dependencies] +gtars-lola = "0.2" +``` + +No feature flags — the crate has a single default configuration. + +## Module layout + +`gtars-lola` does not re-export symbols at the crate root; import from the submodules directly: + +- **`gtars_lola::database`** — `RegionDB`, `CollectionAnno`, `RegionSetAnno` +- **`gtars_lola::enrichment`** — `run_lola`, `ContingencyTable` impls +- **`gtars_lola::universe`** — `check_universe_appropriateness`, `redefine_user_sets`, `build_restricted_universe`, `UniverseReport`, `UserSetReport` +- **`gtars_lola::output`** — `annotate_results`, `apply_fdr_correction`, `results_to_columns`, `write_results_tsv`, `LolaColumnar` +- **`gtars_lola::models`** — `Direction`, `LolaConfig`, `ContingencyTable`, `LolaResult` +- **`gtars_lola::errors`** — `LolaError` + +## End-to-end workflow + +```rust +use std::fs::File; +use std::io::BufWriter; +use std::path::Path; + +use gtars_core::models::RegionSet; +use gtars_lola::database::RegionDB; +use gtars_lola::enrichment::run_lola; +use gtars_lola::models::LolaConfig; +use gtars_lola::output::{annotate_results, apply_fdr_correction, write_results_tsv}; + +// 1. Load the region database from a LOLA-format folder +let db = RegionDB::from_lola_folder( + Path::new("LOLACore/hg38"), + None, // collections filter (None = all) + None, // per-collection file limit (None = all) +)?; + +// 2. Load the user set(s) and universe +let user_sets = vec![ + RegionSet::try_from("peaks_condition_a.bed")?, + RegionSet::try_from("peaks_condition_b.bed")?, +]; +let universe = RegionSet::try_from("universe.bed")?; + +// 3. Run Fisher's exact test for every (user_set, db_set) pair +let mut results = run_lola( + &db.igd, + &user_sets, + &universe, + &LolaConfig::default(), +)?; + +// 4. Attach database metadata (collection, cell type, tissue, …) +annotate_results(&mut results, &db); + +// 5. Benjamini-Hochberg FDR correction, per user set +apply_fdr_correction(&mut results); + +// 6. Write results to TSV (format matches R LOLA's writeCombinedEnrichment) +let mut out = BufWriter::new(File::create("lola_results.tsv")?); +write_results_tsv(&mut out, &results)?; +# Ok::<(), Box>(()) +``` + +The steps correspond 1:1 to the R LOLA workflow: `loadRegionDB` → `runLOLA` → `getLolaResults` → `writeCombinedEnrichment`. + +## Loading a region database + +`RegionDB::from_lola_folder` expects a directory laid out in the standard R LOLA format: + +```text +db_path/ +├── collection1/ +│ ├── collection.txt # collector, date, source, description (TSV) +│ ├── index.txt # per-file annotations (TSV or CSV) +│ └── regions/ +│ ├── file1.bed +│ └── file2.bed +├── collection2/ +│ ├── collection.txt +│ ├── index.txt +│ └── regions/ +│ └── … +``` + +`collection.txt` and `index.txt` are both auto-detected TSV or CSV (R's `fread` convention). Recognized columns in `index.txt`: `filename`, `cellType`, `description`, `tissue`, `dataSource`, `antibody`, `treatment`. Unknown columns are ignored; missing optional columns are preserved as `None`. + +### Filters and limits + +```rust +// Load only specific collections +let db = RegionDB::from_lola_folder( + path, + Some(&["encode_segmentation", "roadmap_epigenomics"]), + None, +)?; + +// Limit files per collection (useful for smoke tests) +let db = RegionDB::from_lola_folder(path, None, Some(10))?; +# Ok::<(), Box>(()) +``` + +### In-memory construction (for API integrations) + +If you're serving region sets from an API rather than a folder, build the `RegionDB` directly from a pre-constructed `Igd`: + +```rust +use gtars_lola::database::{RegionDB, RegionSetAnno}; + +let db = RegionDB::from_igd_with_regions(igd, region_sets, region_annos); +``` + +This is the path BEDbase uses to expose LOLA over the web — the IGD is built server-side from the bedset collection and shipped to the client. + +### Convenience accessors + +- **`db.num_region_sets()`** — total region sets in the database. +- **`db.list_region_sets(collections)`** → `Vec` — filenames, optionally filtered by collection. +- **`db.get_region_set(filenames, collections)`** → `Vec<&RegionSet>` — look up by filename. +- **`db.get_region_set_list(&[usize])`** → `RegionSetList` — extract indexed region sets as a named `RegionSetList` (for FFI, plotting, or downstream analysis). Out-of-bounds indices are silently skipped. +- **`RegionDB::merge(a, b)`** — combine two databases into one; rebuilds the IGD. + +## Checking and shaping the universe + +LOLA results are only meaningful if your universe **contains** your user set: every user region should overlap at least one universe region, with no many-to-many mappings. The `universe` module provides three diagnostics/fixers: + +### `check_universe_appropriateness` + +```rust +use gtars_lola::universe::check_universe_appropriateness; +use gtars_igd::igd::Igd; + +// Pre-build the universe IGD once — you can reuse it across calls +let universe_igd = Igd::from_single_region_set(&universe); + +let report = check_universe_appropriateness(&user_sets, &universe_igd); + +for r in &report.user_set_reports { + println!( + "user set {}: {:.1}% coverage ({} of {} regions), {} many-to-many", + r.user_set_index, + r.coverage * 100.0, + r.regions_in_universe, + r.total_regions, + r.many_to_many_count, + ); + for w in &r.warnings { + eprintln!(" ⚠ {w}"); + } +} +``` + +Warnings fire when coverage is under 50% (severe) or under 90% (moderate), or when any user region overlaps more than one universe region (many-to-many). + +### `redefine_user_sets` + +Rewrite each user set in terms of universe regions: for every user region, find the universe regions it overlaps and emit those as the new user set. This eliminates many-to-many mapping artifacts and is the Rust equivalent of R LOLA's `redefineUserSets()`. + +```rust +use gtars_lola::universe::redefine_user_sets; + +let universe_igd = Igd::from_single_region_set(&universe); +let redefined: Vec = + redefine_user_sets(&user_sets, &universe, &universe_igd); + +// Now run enrichment on `redefined` instead of the raw user_sets +let results = run_lola(&db.igd, &redefined, &universe, &LolaConfig::default())?; +# Ok::<(), Box>(()) +``` + +### `build_restricted_universe` + +For differential enrichment analysis — build a universe that is exactly the union of all user sets, disjoined into non-overlapping pieces. This is R LOLA's `disjoin(unlist(userSets))`. + +```rust +use gtars_lola::universe::build_restricted_universe; + +let restricted = build_restricted_universe(&user_sets); +// Use `restricted` as the universe in a differential run +``` + +## Running enrichment + +`run_lola(igd, user_sets, universe, config)` is the core engine. For every `(user_set, db_set)` pair it builds a 2×2 contingency table and runs a one-sided Fisher exact test: + +```text + In DB set Not in DB set +In user set a c +Not in user set b d +``` + +- **a** = overlap count between user set and DB set (the *support* — shows up as `support` in results). +- **b** = universe-DB overlap − user-DB overlap. +- **c** = user set size − user-DB overlap. +- **d** = universe size − a − b − c. + +### `LolaConfig` + +```rust +pub struct LolaConfig { + pub min_overlap: i32, // default 1 + pub direction: Direction, // default Enrichment +} + +pub enum Direction { + Enrichment, // P(X ≥ a), alternative = "greater" + Depletion, // P(X ≤ a), alternative = "less" +} +``` + +```rust +use gtars_lola::models::{LolaConfig, Direction}; + +let config = LolaConfig { + min_overlap: 1, + direction: Direction::Enrichment, +}; +``` + +### p-values and odds ratios + +The p-value is computed from the hypergeometric distribution using the survival function (`sf`) rather than `1 - cdf` to avoid catastrophic cancellation when the tail probability is very small. It is reported as `-log10(p)` (capped at ~322 via a `1e-322` floor, matching R LOLA). + +The odds ratio is the **conditional maximum likelihood estimate** — the noncentrality parameter of Fisher's noncentral hypergeometric distribution — solved by Brent's method. This matches `fisher.test()$estimate` in R exactly, not the simple `(a·d)/(b·c)` point estimate. + +### Negative contingency values + +If your user set contains regions *outside* the universe, the contingency table can produce negative `b`, `c`, or `d`. `run_lola` matches R LOLA's behavior: it prints a warning to stderr, stores the signed values on the `LolaResult` row, and returns `p_value_log = 0.0`, `odds_ratio = NaN` for that row. To avoid this entirely, pre-process your inputs with `redefine_user_sets`. + +## Results + +Each `(user_set, db_set)` pair produces a `LolaResult`: + +```rust +pub struct LolaResult { + pub user_set: usize, // 0-based user set index + pub db_set: usize, // 0-based db set index + pub p_value_log: f64, // -log10(p), ≥ 0 + pub odds_ratio: f64, // CMLE, matches R fisher.test()$estimate + pub support: u64, // contingency table cell 'a' + pub rnk_pv: usize, // rank by p-value (1-based, ascending) + pub rnk_or: usize, // rank by odds ratio (1-based, descending) + pub rnk_sup: usize, // rank by support (1-based, descending) + pub max_rnk: usize, // max(rnk_pv, rnk_or, rnk_sup) + pub mean_rnk: f64, // mean of the three ranks + pub b: i64, pub c: i64, pub d: i64, // signed contingency values + pub q_value: Option, // BH-adjusted p (None until apply_fdr_correction) + pub filename: String, + pub collection: Option, + pub description: Option, + pub cell_type: Option, + pub tissue: Option, + pub antibody: Option, + pub treatment: Option, + pub data_source: Option, + pub db_set_size: u64, +} +``` + +`run_lola` returns results sorted by descending `p_value_log`, then ascending `mean_rnk` — identical to R LOLA's output order. Ranks are assigned per user set so the rank columns within a user set are independent. + +## Annotation and FDR + +Two post-processing steps, both in `gtars_lola::output`: + +- **`annotate_results(&mut results, &db)`** — fills in `collection`, `description`, `cell_type`, `tissue`, `antibody`, `treatment`, `data_source`, and `db_set_size` from the database. `description` is truncated to 80 chars to match R LOLA. +- **`apply_fdr_correction(&mut results)`** — Benjamini-Hochberg q-value computation, applied **independently per user set**. Writes to the `q_value: Option` field. + +## Output + +### TSV (R LOLA–compatible) + +```rust +use gtars_lola::output::write_results_tsv; +use std::fs::File; +use std::io::BufWriter; + +let mut out = BufWriter::new(File::create("lola_results.tsv")?); +write_results_tsv(&mut out, &results)?; +# Ok::<(), Box>(()) +``` + +The header exactly matches R LOLA's `writeCombinedEnrichment`: + +```text +userSet dbSet collection pValueLog oddsRatio support +rnkPV rnkOR rnkSup maxRnk meanRnk b c d +description cellType tissue antibody treatment dataSource +filename qValue size +``` + +`userSet` and `dbSet` are 1-based in the TSV (R convention), even though they are 0-based in memory. + +### Columnar (for FFI bindings) + +`results_to_columns(&results) -> LolaColumnar` pivots the row-oriented `Vec` into parallel column vectors. This is what the Python, WASM, and R bindings consume to build native data frames without reimplementing the row→column transpose: + +```rust +use gtars_lola::output::{results_to_columns, LolaColumnar}; + +let cols: LolaColumnar = results_to_columns(&results); +println!("{} rows", cols.p_value_log.len()); +// cols.p_value_log[i], cols.odds_ratio[i], cols.cell_type[i], … all align on row i +``` + +Every field of `LolaResult` has a corresponding `Vec` on `LolaColumnar`; the ordering is preserved from the input slice. + +## Direct contingency-table use + +If you want to compute a single Fisher test outside of the full `run_lola` pipeline: + +```rust +use gtars_lola::models::{ContingencyTable, Direction}; + +let ct = ContingencyTable { a: 45, b: 155, c: 155, d: 9645 }; + +let p = ct.fisher_pvalue(Direction::Enrichment); +let log_p = ct.p_value_log(Direction::Enrichment); +let or = ct.odds_ratio(); + +println!("p = {p:.2e} (-log10 = {log_p:.2}) odds = {or:.2}"); +``` + +`fisher_pvalue`, `p_value_log`, and `odds_ratio` are `impl ContingencyTable` methods — the same functions `run_lola` calls internally. + +## Errors + +```rust +pub enum LolaError { + EmptyUniverse, // universe has 0 regions + EmptyDatabase, // IGD has 0 files + Other(anyhow::Error), // wrapped I/O etc. +} +``` + +`run_lola` returns early with `EmptyUniverse` / `EmptyDatabase` so you can distinguish configuration problems from per-row negative-contingency warnings (which print to stderr and continue). + +## Where to go next + +- **[gtars-igd](igd.md)** — the overlap index that backs `RegionDB.igd`. Worth reading if you're tuning universe construction or wrapping your own database. +- **[gtars-genomicdist](genomicdist.md)** — `IntervalRanges` operations used during `build_restricted_universe` and `redefine_user_sets`. +- **[gtars-core](core.md)** — `RegionSet` and `RegionSetList`, the input/output types throughout this page. +- **R LOLA** ([docs](http://lola.databio.org/)) — the reference implementation this port targets. diff --git a/docs/gtars/modules.md b/docs/gtars/modules.md index 47ce2f9a..c30ea86c 100644 --- a/docs/gtars/modules.md +++ b/docs/gtars/modules.md @@ -6,7 +6,7 @@ gtars is organized as a workspace of independent Rust crates, each providing spe ## Core Infrastructure -- **[gtars-core](core.md)** - Fundamental data structures and utilities +- **[gtars-core](core.md)** - Fundamental data structures (`Region`, `RegionSet`, `RegionSetList`, `Interval`, `Fragment`, `CoordinateMode`) and utilities - **[gtars-io](io.md)** - I/O operations and file format parsers ## Genomic Analysis @@ -17,6 +17,11 @@ gtars is organized as a workspace of independent Rust crates, each providing spe - **[gtars-scoring](scoring.md)** - Fragment overlap scoring - **[gtars-fragsplit](fragsplit.md)** - Fragment file splitting for pseudobulk +## Distribution Analysis & Enrichment + +- **[gtars-genomicdist](genomicdist.md)** - Rust port of R GenomicDistributions: summary stats, interval set algebra, partitions, signal matrices, consensus +- **[gtars-lola](lola.md)** - LOLA (Locus Overlap Analysis) enrichment testing built on IGD + genomicdist + ## Machine Learning - **[gtars-tokenizers](tokenizers.md)** - Genomic region tokenizers for ML diff --git a/docs/gtars/python-overview.md b/docs/gtars/python-overview.md index cad45506..244a70ea 100644 --- a/docs/gtars/python-overview.md +++ b/docs/gtars/python-overview.md @@ -1,6 +1,6 @@ # gtars-python -Python bindings for gtars functionality, providing native Python API for genomic interval analysis. +Python bindings for gtars — native Python API for genomic interval analysis, built on the Rust core. ## Installation @@ -8,52 +8,93 @@ Python bindings for gtars functionality, providing native Python API for genomic pip install gtars ``` -## Available Modules +For development installs from source, `uv pip install -e ./gtars-python` from the repo root. -### tokenizers -```python -from gtars.tokenizers import Tokenizer +## Submodules -# Create tokenizer from BED file -tokenizer = Tokenizer.from_bed("regions.bed") +The `gtars` Python package exposes several submodules, each wrapping a gtars Rust crate: -# Or from configuration -tokenizer = Tokenizer.from_config("config.toml") +| submodule | Rust crate | purpose | dedicated doc | +|---|---|---|---| +| `gtars.models` | `gtars-core` + `gtars-genomicdist::models` | `Region`, `RegionSet`, `RegionSetList`, reference genomes, gene models, partition lists, signal matrices — plus all the per-`RegionSet` statistics and interval algebra methods | [models](python/models.md) | +| `gtars.genomic_distributions` | `gtars-genomicdist` | GC content, dinucleotide freq, partition classification, signal summaries, consensus regions (free functions that need a reference genome or a partition list) | [genomic_distributions](python/genomic_distributions.md) | +| `gtars.lola` | `gtars-lola` | LOLA enrichment testing: `RegionDB`, `run_lola`, universe helpers | [lola](python/lola.md) | +| `gtars.tokenizers` | `gtars-tokenizers` | Genomic region tokenizers for ML | *(see in-package docs)* | +| `gtars.refget` | `gtars-refget` | GA4GH refget protocol client + local store | [digests](python/digests.md), [refget API](python/refget-api.md), [RefgetStore](python/refget-store.md) | +| `gtars.utils` | `gtars-core::utils` | File-reading and parsing helpers | *(see in-package docs)* | -# Tokenize regions -tokens = tokenizer.tokenize(["chr1:1000-2000", "chr2:3000-4000"]) -``` +## Quick start -### models (RegionSet) ```python from gtars.models import RegionSet -# Create from BED file -rs = RegionSet.from_bed("peaks.bed") +# Load a BED file (local path, or URL if the http feature is enabled) +rs = RegionSet("peaks.bed") -# Access properties -print(f"Number of regions: {len(rs)}") -print(f"Total coverage: {rs.coverage()}") +print(len(rs)) # region count +print(rs.identifier) # canonical bedbase MD5 id +print(rs.mean_region_width()) # average width -# Operations -rs.sort() -merged = rs.merge() +# Iterate +for region in rs: + print(region.chr, region.start, region.end) + +# Set algebra +merged = rs.reduce() # merge overlapping +promoters = rs.promoters(2000, 0) # 2kb upstream ``` -### refget +See [`gtars.models`](python/models.md) for the full `RegionSet` surface — including `reduce`, `setdiff`, `union`, `intersect_all`, `jaccard`, `coverage`, `neighbor_distances`, `peak_clusters`, `density_vector`, and more. + +## Extended example — genomic distributions + ```python -from gtars.refget import RefgetStore, compute_digest +from gtars.models import RegionSet, BinaryGenomeAssembly, PartitionList +from gtars.genomic_distributions import calc_gc_content, calc_expected_partitions + +peaks = RegionSet("peaks.bed") +genome = BinaryGenomeAssembly("hg38.fab") +pl = PartitionList.from_gtf("gencode.v47.gtf.gz", core_prom=100, prox_prom=2000) +chrom_sizes = {"chr1": 248_956_422, "chr2": 242_193_529} + +# GC content per region +gc = calc_gc_content(peaks, genome, ignore_unk_chroms=True) +print(f"Mean GC: {sum(gc) / len(gc):.3f}") + +# Observed vs expected partition enrichment +result = calc_expected_partitions(peaks, pl, chrom_sizes) +for p, o, e, lo, pv in zip( + result["partition"], result["observed"], + result["expected"], result["log10OE"], result["pvalue"], +): + print(f" {p:<15} obs={o:>6.0f} exp={e:>10.1f} log10(O/E)={lo:+.2f} p={pv:.2e}") +``` -# Compute sequence digest -digest = compute_digest(b"ACGTACGT") +## Extended example — LOLA enrichment -# Use RefgetStore for sequence management -store = RefgetStore() +```python +import pandas as pd +from gtars.models import RegionSet +from gtars.lola import RegionDB, run_lola + +def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + +db = RegionDB.from_folder("LOLACore/hg38") + +user_sets = [to_tuples(RegionSet("peaks.bed"))] +universe = to_tuples(RegionSet("universe.bed")) + +results = run_lola(user_sets, universe, db) +df = pd.DataFrame(results).sort_values("pValueLog", ascending=False) +print(df.head(20)) ``` -## Performance +See [`gtars.lola`](python/lola.md) for universe preparation helpers, `RegionDB` accessors, and the full result schema. + +## Performance notes -- Native Rust performance -- Zero-copy data transfer where possible -- NumPy array integration -- Parallel processing support +- `RegionSet` construction sorts on load; repeated operations don't re-sort. +- Overlap queries (`count_overlaps`, `find_overlaps`, `subset_by_overlaps`) go through an AIList index for O(log n) per-query lookups. +- `BinaryGenomeAssembly` uses `mmap` for zero-copy sequence access — use it over `GenomeAssembly` for large genomes unless you specifically need the in-memory version. +- Most numeric-result methods return Python lists; for very large outputs, consider iterating or using numpy wrappers at the call site. diff --git a/docs/gtars/python/genomic_distributions.md b/docs/gtars/python/genomic_distributions.md new file mode 100644 index 00000000..ab892057 --- /dev/null +++ b/docs/gtars/python/genomic_distributions.md @@ -0,0 +1,225 @@ +# `gtars.genomic_distributions` + +Python bindings for [`gtars-genomicdist`](../genomicdist.md) — the Rust port of R GenomicDistributions plus extras. This submodule contains the **free functions** that operate on regions-plus-something-else (a reference genome, a partition list, a signal matrix). Methods that operate only on a `RegionSet` (summary stats, interval algebra, peak clustering) live directly on `RegionSet` in [`gtars.models`](models.md). + +!!! info "Where to look for what" + - Per-region summaries, nearest-neighbor distances, density vectors, set algebra → `RegionSet` methods in [`gtars.models`](models.md). + - GC content / dinucleotide frequencies → **this module** (need a reference genome). + - Partition classification → **this module** (need a `PartitionList`). + - Signal matrix overlap → **this module** (need a `SignalMatrix`). + - Consensus regions across replicates → **this module**. + - LOLA enrichment → [`gtars.lola`](lola.md). + +The Rust reference page, [gtars-genomicdist](../genomicdist.md), has the algorithmic detail, caveats about bin-width semantics, divergences from R GenomicDistributions, and so on. This page is a Python-focused reference. + +## Sequence statistics + +These both take a `RegionSet` and a reference genome (`GenomeAssembly` for in-memory or `BinaryGenomeAssembly` for mmap-backed `.fab` files — see [`gtars.models`](models.md#genomeassembly)). + +### `calc_gc_content` + +```python +from gtars.models import RegionSet, BinaryGenomeAssembly +from gtars.genomic_distributions import calc_gc_content + +peaks = RegionSet("peaks.bed") +genome = BinaryGenomeAssembly("hg38.fab") + +gc = calc_gc_content(peaks, genome, ignore_unk_chroms=True) +# gc: list[float] — one value per region, 0.0–1.0 +``` + +Setting `ignore_unk_chroms=False` raises if any region lives on a chromosome missing from the assembly. Set to `True` (default) to skip them silently. + +### `calc_dinucl_freq` + +```python +from gtars.genomic_distributions import calc_dinucl_freq + +result = calc_dinucl_freq( + peaks, genome, + raw_counts=False, # False (default) → percentages 0–100; True → integer counts + ignore_unk_chroms=True, +) + +# result is a dict: +result["region_labels"] # list[str] — "chr_start_end" per region +result["dinucleotides"] # list[str] — 16 dinucleotide names in canonical order +result["frequencies"] # list[list[float]] — one row per region, 16 values per row +``` + +The 16 columns in `frequencies` correspond one-for-one with the 16 names in `dinucleotides`. For pooled global counts across all regions, run with `raw_counts=True` and sum each column of the `frequencies` matrix. + +This matches R GenomicDistributions `calcDinuclFreq` column-for-column, including the canonical order. + +## Partition classification + +Builds on the `PartitionList` type from [`gtars.models`](models.md#partitionlist). + +### `calc_partitions` + +```python +from gtars.models import RegionSet, PartitionList +from gtars.genomic_distributions import calc_partitions + +peaks = RegionSet("peaks.bed") +pl = PartitionList.from_gtf("gencode.v47.gtf.gz", core_prom=100, prox_prom=2000) + +result = calc_partitions(peaks, pl, bp_proportion=False) + +# result is a dict with: +result["partition"] # list[str] — partition names + "intergenic" +result["count"] # list[int] — region (or bp) count per partition +result["total"] # int — sum of all counts +``` + +- `bp_proportion=False` (priority mode): each query region is assigned to the first partition it overlaps. Mutually exclusive. The `intergenic` bucket collects everything not matched. +- `bp_proportion=True`: for each partition, sum the overlapping base pairs from the query set. Not mutually exclusive — a region that straddles two partitions contributes bp to each (matching R behavior). + +### `calc_expected_partitions` + +Observed vs expected enrichment with chi-square test, given chromosome sizes: + +```python +from gtars.genomic_distributions import calc_expected_partitions + +chrom_sizes = {"chr1": 248_956_422, "chr2": 242_193_529, ...} + +result = calc_expected_partitions( + peaks, pl, chrom_sizes, bp_proportion=False, +) + +# result is a dict with parallel lists: +result["partition"] # list[str] +result["observed"] # list[float] +result["expected"] # list[float] +result["log10OE"] # list[float] — log10(observed / expected) +result["pvalue"] # list[float] — chi-square goodness-of-fit p-value +``` + +!!! warning "p-values will differ slightly from R" + `calc_expected_partitions` uses a goodness-of-fit `(O-E)²/E` formula. R's `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R GenomicDistributions output byte-for-byte. The `log10OE` column is directly comparable. + +## Signal matrix overlap + +### `calc_summary_signal` + +Overlap a query region set against a pre-loaded `SignalMatrix` and compute Tukey boxplot statistics per condition: + +```python +from gtars.models import RegionSet, SignalMatrix +from gtars.genomic_distributions import calc_summary_signal + +peaks = RegionSet("peaks.bed") +sm = SignalMatrix.from_tsv("signal_matrix.tsv") +# or: sm = SignalMatrix.load_bin("signal_matrix.sigm") + +result = calc_summary_signal(peaks, sm) + +result["condition_names"] # list[str] — columns of the signal matrix +result["region_labels"] # list[str] — "chr_start_end" per matched query +result["signal_matrix"] # list[list[float]] — row per matched query, col per condition (MAX) +result["matrix_stats"] # list of dicts — Tukey boxplot stats per condition +``` + +Each entry in `matrix_stats` has: `condition`, `lower_whisker`, `lower_hinge`, `median`, `upper_hinge`, `upper_whisker`. + +For each query region, the function finds all overlapping rows in the signal matrix and takes the **max** per condition — answering "what's the peak signal at this region in each cell type?". + +## Consensus regions + +### `consensus` + +Given N region sets (e.g. replicates), produce the reduced union with per-region replicate support count: + +```python +from gtars.models import RegionSet +from gtars.genomic_distributions import consensus + +reps = [ + RegionSet("rep1.bed"), + RegionSet("rep2.bed"), + RegionSet("rep3.bed"), +] + +cons = consensus(reps) +# cons: list of dicts with keys: chr, start, end, count + +# Keep regions present in ≥ 2/3 replicates +robust = [c for c in cons if c["count"] >= 2] +print(f"{len(robust)} robust regions") +``` + +The `count` field is how many input sets overlap each union region — not the number of replicate *regions*, but the number of input *sets*. + +## Utilities + +### `median_abs_distance` + +Median of absolute values, filtering sentinels. Pairs well with `TssIndex.feature_distances`: + +```python +from gtars.models import TssIndex, RegionSet +from gtars.genomic_distributions import median_abs_distance + +tss = TssIndex("gencode.v47.tss.bed") +peaks = RegionSet("peaks.bed") + +signed = tss.feature_distances(peaks) # list[float | None] +# Drop None values (regions on chromosomes not in the TSS index) +clean = [d for d in signed if d is not None] + +median = median_abs_distance(clean) # Optional[float] +``` + +Returns `None` if the input is empty or contains only missing/sentinel values. + +## End-to-end example + +```python +from gtars.models import ( + RegionSet, BinaryGenomeAssembly, PartitionList, SignalMatrix, TssIndex, +) +from gtars.genomic_distributions import ( + calc_gc_content, calc_partitions, calc_expected_partitions, + calc_summary_signal, consensus, +) + +# 1. Load inputs once +peaks = RegionSet("peaks.bed") +genome = BinaryGenomeAssembly("hg38.fab") +pl = PartitionList.from_gtf("gencode.v47.gtf.gz", core_prom=100, prox_prom=2000) +chrom_sizes = {"chr1": 248_956_422, "chr2": 242_193_529} + +# 2. Peak-level statistics — live on RegionSet itself +print(f"{len(peaks)} regions, mean width {peaks.mean_region_width():.1f}") +print(f"Per-chromosome: {peaks.chromosome_statistics()}") + +# 3. GC content per region +gc = calc_gc_content(peaks, genome, ignore_unk_chroms=True) +print(f"Mean GC: {sum(gc) / len(gc):.3f}") + +# 4. Partition enrichment (observed vs expected) +exp = calc_expected_partitions(peaks, pl, chrom_sizes) +for p, o, e, lo, pv in zip( + exp["partition"], exp["observed"], exp["expected"], exp["log10OE"], exp["pvalue"] +): + print(f" {p:<15} obs={o:>6.0f} exp={e:>10.1f} log10(O/E)={lo:+.2f} p={pv:.2e}") + +# 5. Nearest-TSS distances +tss = TssIndex("gencode.v47.tss.bed") +tss_dists = tss.calc_tss_distances(peaks) +print(f"Median distance to nearest TSS: {sorted(tss_dists)[len(tss_dists)//2]:,}") + +# 6. Consensus across replicates +reps = [RegionSet(f"rep{i}.bed") for i in (1, 2, 3)] +cons = consensus(reps) +robust = [c for c in cons if c["count"] >= 2] +print(f"{len(robust)} of {len(cons)} union regions present in ≥2 replicates") +``` + +## See also + +- **[`gtars.models`](models.md)** — `RegionSet`, `GenomeAssembly`, `PartitionList`, `SignalMatrix`, and the peak-stats methods attached to `RegionSet`. +- **[`gtars.lola`](lola.md)** — LOLA enrichment testing. +- **[gtars-genomicdist](../genomicdist.md)** — full Rust API reference, algorithmic notes, and caveats that apply identically in Python. diff --git a/docs/gtars/python/lola.md b/docs/gtars/python/lola.md new file mode 100644 index 00000000..4f269732 --- /dev/null +++ b/docs/gtars/python/lola.md @@ -0,0 +1,269 @@ +# `gtars.lola` + +Python bindings for [`gtars-lola`](../lola.md) — LOLA (Locus Overlap Analysis) enrichment testing. Given a user set of regions, a universe (background), and a region database, computes Fisher's exact test enrichment of the user set against every database region set. + +See the [Rust gtars-lola page](../lola.md) for the full statistical reference (Fisher's exact test, CMLE odds ratio, contingency table semantics, R LOLA compatibility notes). This page is a Python-focused reference. + +!!! warning "LOLA uses tuples, not `RegionSet`" + The Python `gtars.lola` API takes **`list[tuple[str, int, int]]`** — lists of `(chr, start, end)` tuples — rather than `RegionSet` objects. This is a short-term quirk of the binding. If you have a `RegionSet` in hand, convert with: + + ```python + def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + ``` + +## End-to-end workflow + +```python +from gtars.models import RegionSet +from gtars.lola import RegionDB, run_lola + +# 1. Load the region database from a LOLA-format folder +db = RegionDB.from_folder( + "LOLACore/hg38", + collections=None, # None = all; or pass a list of collection names + limit=None, # None = all files; or pass a per-collection cap +) + +# 2. Prepare user sets and universe as lists of (chr, start, end) tuples +peaks = RegionSet("peaks.bed") +universe = RegionSet("universe.bed") + +def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + +user_sets = [to_tuples(peaks)] # list of user sets; each is a list of tuples +universe_tuples = to_tuples(universe) + +# 3. Run the enrichment +results = run_lola( + user_sets, + universe_tuples, + db, + min_overlap=1, + direction="enrichment", # or "depletion" +) + +# results is a column-oriented dict — DataFrame-friendly +import pandas as pd +df = pd.DataFrame(results) +df = df.sort_values("pValueLog", ascending=False) +print(df.head(20)) +``` + +The 4-step workflow mirrors R LOLA: load database → convert inputs → run enrichment → pandas-ify. + +## `RegionDB` + +Wraps a `gtars-lola` `RegionDB` — an IGD overlap index plus original region sets plus per-file annotations, loaded from the standard LOLA directory layout: + +```text +db_path/ +├── collection1/ +│ ├── collection.txt # collector, date, source, description (TSV/CSV) +│ ├── index.txt # per-file annotations (TSV/CSV) +│ └── regions/ +│ ├── file1.bed +│ └── file2.bed +├── collection2/ +│ └── … +``` + +### `RegionDB.from_folder` + +```python +RegionDB.from_folder(db_path, collections=None, limit=None) -> RegionDB +``` + +- `db_path` — path to the LOLA database root. +- `collections` — optional list of collection names to include (others are skipped). +- `limit` — optional per-collection file cap (useful for smoke tests). + +`collection.txt` and `index.txt` are auto-detected TSV or CSV. Recognized `index.txt` columns: `filename`, `cellType`, `description`, `tissue`, `dataSource`, `antibody`, `treatment`. Unknown columns are ignored; missing optional columns become `None`. + +### `RegionDB.from_bed_files` + +Build a database from a list of BED file paths without the full LOLA layout: + +```python +db = RegionDB.from_bed_files( + ["file1.bed", "file2.bed", "file3.bed"], + filenames=None, # optional; defaults to basenames +) +``` + +Useful for ad-hoc analyses where you have a list of peak files but don't want to set up a collection directory. + +### Accessors + +```python +db.num_region_sets # int — total files in the database + +db.list_region_sets() # list[str] — all filenames +db.list_region_sets(collections=["enc3"]) # filtered + +# Extract region sets as a gtars.models.RegionSetList (names populated from filenames) +rsl = db.get_region_sets() # all sets +rsl = db.get_region_sets(indices=[0, 5, 12]) + +# Annotations as lists of dicts +db.region_anno # per-file dicts with filename/cellType/description/tissue/... +db.collection_anno # per-collection dicts with collector/date/source/description +``` + +`get_region_sets()` returns a [`gtars.models.RegionSetList`](models.md#regionsetlist), which has `names` populated from the database filenames. This is the one path in Python where `RegionSetList.names` is non-None — normal `RegionSetList(...)` construction can't set names. + +## Running enrichment + +### `run_lola` + +```python +run_lola( + user_sets: list[list[tuple[str, int, int]]], + universe: list[tuple[str, int, int]], + region_db: RegionDB, + min_overlap: int = 1, + direction: str = "enrichment", +) -> dict +``` + +**Parameters:** + +- `user_sets` — a **list of user sets**, where each user set is a list of `(chr, start, end)` tuples. Pass one-element list `[peaks_tuples]` for a single-user-set analysis. +- `universe` — a single list of `(chr, start, end)` tuples representing the background. +- `region_db` — a `RegionDB`. +- `min_overlap` — minimum bp overlap to count as overlapping (default 1). +- `direction` — `"enrichment"` (default, P(X ≥ a), alternative "greater") or `"depletion"` (P(X ≤ a), alternative "less"). The strings `"greater"` / `"less"` are accepted as aliases. + +**Returns** a column-oriented dict with these parallel lists (one entry per `(user_set, db_set)` pair): + +| key | type | meaning | +|---|---|---| +| `userSet` | `list[int]` | 0-based user-set index | +| `dbSet` | `list[int]` | 0-based db-set index | +| `collection` | `list[str | None]` | from `index.txt` | +| `pValueLog` | `list[float]` | `-log10(p)` from Fisher's exact test, capped at ~322 | +| `oddsRatio` | `list[float]` | CMLE odds ratio (matches R `fisher.test()$estimate`) | +| `support` | `list[int]` | overlap count between user set and db set — the contingency `a` | +| `rnkPV`, `rnkOR`, `rnkSup` | `list[int]` | 1-based per-metric ranks within the user set | +| `maxRnk` | `list[int]` | max of the three ranks | +| `meanRnk` | `list[float]` | mean of the three ranks | +| `b`, `c`, `d` | `list[int]` | signed contingency values (can be negative if user set extends outside universe) | +| `qValue` | `list[float | None]` | BH-adjusted p-value (applied automatically inside `run_lola`) | +| `description`, `cellType`, `tissue`, `antibody`, `treatment`, `dataSource` | `list[str | None]` | from `index.txt` | +| `filename` | `list[str]` | db set file name | +| `size` | `list[int]` | number of regions in the db set | + +The Python `run_lola` already calls `annotate_results` + `apply_fdr_correction` internally, so `qValue` and the annotation columns come back populated — you don't need to post-process. + +Rows are sorted by descending `pValueLog`, then ascending `meanRnk` (matching R LOLA output order). + +!!! note "Negative contingency values" + If your user set contains regions *outside* the universe, the contingency table produces negative `b`/`c`/`d`. The binding prints a warning to stderr, stores the signed values, and returns `pValueLog = 0.0`, `oddsRatio = NaN` for that row. To avoid this, preprocess with `redefine_user_sets` (below). + +## Universe preparation + +LOLA results are only meaningful when your universe contains your user sets. Three helpers for checking and shaping the universe: + +### `check_universe` + +```python +from gtars.lola import check_universe + +report = check_universe( + user_sets, # list[list[tuple[str, int, int]]] + universe_tuples, # list[tuple[str, int, int]] +) + +# report is a dict with parallel lists: +report["userSet"] # list[int] +report["totalRegions"] # list[int] +report["regionsInUniverse"] # list[int] +report["coverage"] # list[float] — 0.0 to 1.0 +report["manyToMany"] # list[int] — number of user regions overlapping >1 universe region +report["warnings"] # list[str] — free-form warning messages +``` + +Warnings fire when coverage is under 50% (severe) or under 90% (moderate), and when any user region overlaps more than one universe region. + +### `redefine_user_sets` + +Rewrite each user set in terms of universe regions — for every user region, find the universe regions it overlaps and emit *those* as the new user set. Eliminates many-to-many mapping artifacts. This is the Python equivalent of R LOLA's `redefineUserSets()`. + +```python +from gtars.lola import redefine_user_sets + +redefined = redefine_user_sets(user_sets, universe_tuples) +# redefined is list[list[tuple[str, int, int]]] — same shape as input, but rewritten + +# Now run enrichment on the redefined version +results = run_lola(redefined, universe_tuples, db) +``` + +### `build_restricted_universe` + +For differential enrichment analysis — build a universe that is exactly the union of all user sets, disjoined into non-overlapping pieces. This is R LOLA's `disjoin(unlist(userSets))`. + +```python +from gtars.lola import build_restricted_universe + +restricted = build_restricted_universe(user_sets) +# restricted: list[tuple[str, int, int]] + +# Use it as the universe in a differential run +results = run_lola(user_sets, restricted, db) +``` + +## Complete example with pandas + +```python +import pandas as pd +from gtars.models import RegionSet +from gtars.lola import RegionDB, run_lola, check_universe, redefine_user_sets + +# Convert a RegionSet to the tuple form lola expects +def to_tuples(rs): + return [(r.chr, r.start, r.end) for r in rs] + +# Load inputs +db = RegionDB.from_folder("LOLACore/hg38") +peaks_a = to_tuples(RegionSet("condition_a.bed")) +peaks_b = to_tuples(RegionSet("condition_b.bed")) +universe = to_tuples(RegionSet("universe.bed")) + +user_sets = [peaks_a, peaks_b] + +# Universe sanity check +report = check_universe(user_sets, universe) +for us, cov, mm in zip( + report["userSet"], report["coverage"], report["manyToMany"] +): + print(f"user set {us}: {cov:.1%} coverage, {mm} many-to-many") +for w in report["warnings"]: + print(f" ⚠ {w}") + +# Optional: eliminate many-to-many artifacts +user_sets = redefine_user_sets(user_sets, universe) + +# Run enrichment +results = run_lola(user_sets, universe, db, direction="enrichment") + +df = pd.DataFrame(results) +print(df.shape) # (n_user_sets * n_db_sets, 23) + +# Top 10 per user set +for us in df["userSet"].unique(): + top = ( + df[df["userSet"] == us] + .sort_values("pValueLog", ascending=False) + .head(10) + ) + print(f"\n=== User set {us} — top 10 ===") + print(top[["filename", "cellType", "pValueLog", "oddsRatio", "support", "qValue"]]) +``` + +## See also + +- **[`gtars.models`](models.md)** — `RegionSet`, `RegionSetList`, and the types returned by `RegionDB.get_region_sets()`. +- **[`gtars.genomic_distributions`](genomic_distributions.md)** — set algebra on `RegionSet` (which internally powers universe redefinition). +- **[gtars-lola](../lola.md)** — full Rust API reference, including contingency table math, p-value computation via hypergeometric survival function, CMLE odds ratio, and the R LOLA TSV output format. diff --git a/docs/gtars/python/models.md b/docs/gtars/python/models.md new file mode 100644 index 00000000..cdc52156 --- /dev/null +++ b/docs/gtars/python/models.md @@ -0,0 +1,404 @@ +# `gtars.models` + +The `gtars.models` submodule exposes the core genomic data types and — unlike the Rust layout, where statistics and set algebra live in separate crates — attaches almost the entire `gtars-genomicdist` method surface directly to `RegionSet`. In Python, you generally interact with `RegionSet` and the rest is built up on top of it. + +!!! info "Python ↔ Rust correspondence" + `gtars.models` combines types from three Rust crates: + + - `gtars-core::models` → `Region`, `Interval`, `RegionSet`, `RegionSetList` + - `gtars-genomicdist::models` → `ChromosomeStatistics`, `SpacingStats`, `ClusterStats`, `DensityVector`, `DensityHomogeneity`, `GenomeAssembly`, `BinaryGenomeAssembly`, `TssIndex`, `GeneModel`, `PartitionList`, `SignalMatrix`, `GenomicDistAnnotation` + - `gtars-genomicdist::{IntervalRanges, GenomicIntervalSetStatistics}` → methods on `RegionSet` + + Free functions that need a reference genome or a partition list live in [`gtars.genomic_distributions`](genomic_distributions.md). + +## Classes + +### `Region` + +A single genomic interval. + +```python +from gtars.models import Region + +r = Region(chr="chr1", start=100, end=200, rest="peak1") +print(r) # chr1\t100\t200\tpeak1 +print(len(r)) # 100 (width) +print(r.chr, r.start, r.end, r.rest) +``` + +Constructor: `Region(chr, start, end, rest=None)`. +Attributes: `chr: str`, `start: int`, `end: int`, `rest: str | None`. +Supports `==`, `!=`, `len()`, `str()`, `repr()`. + +### `RegionSet` + +The main workhorse — a collection of regions loaded from a BED file (or a URL, via the `http` feature on the underlying crate), and the attachment point for ~50 methods covering load/save, iteration, summaries, interval set algebra, overlap queries, and peak statistics. + +#### Construction + +```python +from gtars.models import Region, RegionSet + +# From a local file (or URL if the http feature is compiled in) +rs = RegionSet("peaks.bed") +rs = RegionSet("peaks.bed.gz") +rs = RegionSet("https://example.org/peaks.bed.gz") + +# From a list of Region objects +regions = [ + Region("chr1", 100, 200), + Region("chr1", 300, 400), +] +rs = RegionSet.from_regions(regions) + +# From parallel column vectors (useful when coming from pandas / numpy) +rs = RegionSet.from_vectors( + chrs=["chr1", "chr1", "chr2"], + starts=[100, 300, 500], + ends=[200, 400, 600], +) + +# Optional strand tracking +rs = RegionSet.from_regions(regions, strands=["+", "-"]) +``` + +Regions are auto-sorted by `(chr, start)` on construction. + +#### Properties and iteration + +```python +len(rs) # number of regions +list(rs) # iterate; yields Region objects +rs[0] # indexed access (supports negatives) +rs.is_empty() + +rs.identifier # MD5 of first three columns — canonical bedbase id +rs.file_digest # MD5 of full serialized content +rs.header # original BED header (may be None) +rs.path # source path (str, raises if constructed in-memory) +rs.strands # parallel list of strand strings +``` + +#### Summaries + +```python +rs.widths() # list[int], one width per region +rs.region_widths() # alias for widths() +rs.mean_region_width() # float, rounded to 2 decimals +rs.get_nucleotide_length() # int, total bp across all regions +rs.get_max_end_per_chr() # dict[str, int] +rs.chromosome_statistics() # dict[str, ChromosomeStatistics] +rs.distribution(n_bins=250, chrom_sizes=None) + # list of dicts: chr/start/end/n/rid +``` + +!!! warning "`chrom_sizes` matters" + Without `chrom_sizes`, `distribution()` derives bin size from the BED file's own observed max end — so outputs are **not comparable across files**. Pass `chrom_sizes` to get reference-genome-aligned bins that are comparable. + +#### I/O + +```python +rs.to_bed("out.bed") +rs.to_bed_gz("out.bed.gz") +rs.to_bigbed("out.bb", "chrom.sizes") # requires the bigbed feature +rs.sort() # in-place +``` + +#### Interval set algebra + +Methods that mirror R GRanges/IRanges — all return a new `RegionSet`: + +```python +rs.trim(chrom_sizes) # clamp to [0, chrom_size) +rs.promoters(upstream=2000, downstream=0) +rs.reduce() # merge overlapping/adjacent +rs.disjoin() # tile at every boundary into disjoint pieces +rs.setdiff(other) # remove other from self +rs.subtract(other) # alias for setdiff +rs.pintersect(other) # pairwise (index-aligned) intersect +rs.intersect_all(other) # all-vs-all intersection fragments +rs.concat(other) # combine without merging +rs.union(other) # concat + reduce +rs.gaps(chrom_sizes) # peak-free regions bounded by chrom sizes +``` + +And similarity / coverage metrics: + +```python +rs.jaccard(other) # nucleotide Jaccard |A∩B| / |A∪B| +rs.coverage(other) # fraction of self bp covered by other +rs.overlap_coefficient(other) # |A∩B| / min(|A|, |B|) +``` + +!!! note "`rest` metadata is dropped" + Operations that merge or synthesize new intervals (`reduce`, `setdiff`, `promoters`, etc.) produce regions with `rest = None`. There's no unambiguous way to carry metadata through a merge. + +#### Overlap queries + +These route through the `gtars-overlaprs` AIList index under the hood: + +```python +rs.subset_by_overlaps(other) # RegionSet containing only self-regions that overlap other +rs.count_overlaps(other) # list[int], per-region overlap count +rs.any_overlaps(other) # list[bool], per-region "overlaps anything?" +rs.find_overlaps(other) # list[list[int]], indices into other per self-region +rs.closest(other) # list[(self_idx, other_idx, distance)] +rs.cluster(max_gap=0) # list[int], cluster id per region +``` + +#### Peak statistics + +Ports of the R GenomicDistributions functions, exposed via the Rust `GenomicIntervalSetStatistics` trait as `RegionSet` methods in Python: + +```python +rs.neighbor_distances() # list[int], signed gaps per chromosome (multi-region chroms only) +rs.nearest_neighbors() # list[int], per-region min neighbor distance +rs.inter_peak_spacing() # SpacingStats +rs.peak_clusters(radius_bp=5000, min_cluster_size=2) # ClusterStats +rs.density_vector(chrom_sizes, n_bins) # DensityVector — ML-ready dense count vector +rs.density_homogeneity(chrom_sizes, n_bins) # DensityHomogeneity — CV, Gini, etc. +``` + +!!! warning "Output length for `neighbor_distances` / `nearest_neighbors`" + Both methods **skip chromosomes with only one region** (matching R GenomicDistributions). The returned list is therefore **not aligned 1:1 with input regions** — it's shorter than `len(rs)` whenever any chromosome has exactly one peak. No sentinel values are emitted. + +!!! warning "`n_bins` is a target, not a total" + In `density_vector`, `density_homogeneity`, and `distribution(chrom_sizes=...)`, `n_bins` is the target bin count for the **longest** chromosome. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total window count is `sum(ceil(size / bin_width))` across chromosomes, often much larger than `n_bins`. See `DensityVector` for details. + +### `RegionSetList` + +A collection of `RegionSet`s — the Python wrapper for `gtars-core::models::RegionSetList`. Substantially thinner than the Rust type: only construction from a list of `RegionSet`s, iteration/indexing, `concat`, and pairwise Jaccard. + +```python +from gtars.models import RegionSet, RegionSetList + +rsl = RegionSetList([ + RegionSet("rep1.bed"), + RegionSet("rep2.bed"), + RegionSet("rep3.bed"), +]) + +len(rsl) # 3 +rsl[0] # RegionSet +list(rsl) # iterate + +combined = rsl.concat() # flatten into a single RegionSet (no merge/dedup) +rsl.names # None unless produced by RegionDB.get_region_sets() + +# Full N×N Jaccard similarity matrix +jac = rsl.pairwise_jaccard() # list[list[float]], symmetric, 1.0 on diagonal +``` + +!!! note "Names come from `RegionDB`" + The Python `RegionSetList.__new__` does not accept a `names=` parameter — names are only populated when a `RegionSetList` is produced by `RegionDB.get_region_sets()` (see [`gtars.lola`](lola.md)). If you need named sets from scratch, that's currently only available in the Rust API. + +### `Interval` + +A generic integer interval with a payload. Primarily a helper for overlap indexes — most user code should use `Region` / `RegionSet` instead. + +### Statistics result types + +These are returned by `RegionSet` methods. All are read-only dataclass-like types with named fields. + +#### `ChromosomeStatistics` + +Returned by `rs.chromosome_statistics()` → `dict[str, ChromosomeStatistics]`. + +| field | type | meaning | +|---|---|---| +| `chromosome` | `str` | chromosome name | +| `number_of_regions` | `int` | region count on this chromosome | +| `start_nucleotide_position` | `int` | leftmost start | +| `end_nucleotide_position` | `int` | rightmost end | +| `minimum_region_length` | `int` | | +| `maximum_region_length` | `int` | | +| `mean_region_length` | `float` | | +| `median_region_length` | `float` | | + +#### `SpacingStats` + +Returned by `rs.inter_peak_spacing()`. All float fields are NaN when `n_gaps == 0`. + +| field | type | +|---|---| +| `n_gaps` | `int` — count of *positive* inter-region gaps | +| `mean`, `median`, `std`, `iqr` | `float` | +| `log_mean`, `log_std` | `float` — `log10(gap + 1)` stats, more robust for heavy-tailed gap distributions | + +#### `ClusterStats` + +Returned by `rs.peak_clusters(radius_bp, min_cluster_size)`. + +| field | type | +|---|---| +| `radius_bp` | `int` — pass-through | +| `n_clusters` | `int` — clusters with size ≥ `min_cluster_size` | +| `n_clustered_peaks` | `int` — peaks belonging to those clusters | +| `mean_cluster_size` | `float` — NaN if no clusters meet threshold | +| `max_cluster_size` | `int` — always over all clusters, not filtered | +| `fraction_clustered` | `float` — `n_clustered_peaks / total_peaks` | + +The identity `n_clusters * mean_cluster_size == n_clustered_peaks` holds exactly at any threshold. With `min_cluster_size=2` (the default), `fraction_clustered` is the fraction of peaks with at least one neighbor within `radius_bp` — the typical meaning in super-enhancer-stitching analyses. + +#### `DensityVector` + +Returned by `rs.density_vector(chrom_sizes, n_bins)`. + +| field | type | +|---|---| +| `n_bins` | `int` — target bin count for the longest chromosome (**not** the length of `counts`) | +| `bin_width` | `int` — `max(chrom_sizes.values()) // n_bins`, min 1 | +| `counts` | `list[int]` — dense, row-major, karyotypic chromosome order | +| `chrom_offsets` | `list[(str, int)]` — `(chr_name, start_index_in_counts)` per chromosome | + +Supports `len()`. Suitable as an ML feature vector — stable ordering across files when you pass the same `chrom_sizes`. + +#### `DensityHomogeneity` + +Returned by `rs.density_homogeneity(chrom_sizes, n_bins)`. + +| field | type | +|---|---| +| `bin_width`, `n_windows`, `n_nonzero_windows` | `int` | +| `mean_count`, `variance`, `cv`, `gini` | `float` | + +Poisson-distributed peaks give `cv ≈ 1`; clustered sets give `cv >> 1`; evenly-spread sets give `cv << 1`. **Caveat:** Gini is biased high for sparse count distributions; check `n_nonzero_windows` before interpreting it. + +## Reference genome and annotation types + +These types are constructed from files on disk and passed into the functions in [`gtars.genomic_distributions`](genomic_distributions.md). + +### `GenomeAssembly` + +In-memory FASTA loader. Slow to construct (~2s for hg38) but fast for per-region lookups. + +```python +from gtars.models import GenomeAssembly + +genome = GenomeAssembly("hg38.fa") +``` + +### `BinaryGenomeAssembly` + +mmap-backed `.fab` binary FASTA — instant construction, zero-copy per-region access. + +```python +from gtars.models import BinaryGenomeAssembly + +genome = BinaryGenomeAssembly("hg38.fab") +``` + +Create a `.fab` file once up front with the `gtars prep --fasta hg38.fa` CLI command (or the Rust `BinaryGenomeAssembly.write_from_fasta`). + +### `TssIndex` + +Sorted-per-chromosome TSS midpoint index for fast nearest-TSS queries. + +```python +from gtars.models import TssIndex, RegionSet + +tss = TssIndex("gencode.v47.tss.bed") +# or: +# tss = TssIndex.from_regionset(RegionSet("gencode.v47.tss.bed")) + +peaks = RegionSet("peaks.bed") + +dists = tss.calc_tss_distances(peaks) # list[int] +signed = tss.feature_distances(peaks) # list[float | None] +``` + +`feature_distances` returns `None` for regions on chromosomes that have no features in the index. + +### `GeneModel` + +Loaded from a GTF file. Used to build a `PartitionList`. + +```python +from gtars.models import GeneModel + +model = GeneModel.from_gtf( + "gencode.v47.annotation.gtf.gz", + filter_protein_coding=True, + convert_ensembl_ucsc=True, +) +print(model.n_genes, model.n_exons) +``` + +- `filter_protein_coding` — keep only features with `gene_biotype "protein_coding"` (or `gene_type`). +- `convert_ensembl_ucsc` — prepend `chr` to chromosome names that don't already have it. + +### `PartitionList` + +Ordered, priority-based genomic partition list: `promoterCore` → `promoterProx` → `threeUTR` → `fiveUTR` → `exon` → `intron` → `intergenic`. + +```python +from gtars.models import GeneModel, PartitionList + +model = GeneModel.from_gtf("gencode.v47.gtf.gz") + +# From an existing GeneModel +pl = PartitionList.from_gene_model( + model, + core_prom=100, + prox_prom=2000, + chrom_sizes=None, # optional; used for trimming to chrom boundaries +) + +# Or directly from GTF in one call +pl = PartitionList.from_gtf( + "gencode.v47.gtf.gz", + core_prom=100, + prox_prom=2000, + filter_protein_coding=True, + convert_ensembl_ucsc=True, +) + +print(pl.partition_names()) # ['promoterCore', 'promoterProx', 'exon', 'intron', ...] +print(len(pl)) # number of partitions +``` + +Pass the resulting `PartitionList` to `calc_partitions` / `calc_expected_partitions` in [`gtars.genomic_distributions`](genomic_distributions.md). + +### `SignalMatrix` + +A region × condition matrix of signal values (e.g. a peak × cell-type ChIP intensity matrix). Load from TSV or the packed binary `.sigm` format: + +```python +from gtars.models import SignalMatrix + +sm = SignalMatrix.from_tsv("signal_matrix.tsv") +sm = SignalMatrix.load_bin("signal_matrix.sigm") + +print(sm.condition_names) # list[str] +print(sm.n_conditions) # int +print(sm.n_regions) # int +print(len(sm)) # n_regions +``` + +Pass it to `calc_summary_signal` in [`gtars.genomic_distributions`](genomic_distributions.md). + +### `GenomicDistAnnotation` + +A serializable wrapper around `GeneModel` in the GDA binary format. Chrom sizes are not stored in a GDA file — they come from a separate source. + +```python +from gtars.models import GenomicDistAnnotation + +# Build once from GTF, save as GDA (save via Rust / CLI) +gda = GenomicDistAnnotation.from_gtf("gencode.v47.gtf.gz") + +# Load from disk (fast) +gda = GenomicDistAnnotation.load_bin("gencode.v47.gda") + +# Derived views +model = gda.gene_model() +pl = gda.partition_list(core_prom=100, prox_prom=2000, chrom_sizes=None) +tss = gda.tss_index() +``` + +## See also + +- **[`gtars.genomic_distributions`](genomic_distributions.md)** — free functions for GC content, dinucleotide frequencies, partition classification, signal summaries, consensus regions. +- **[`gtars.lola`](lola.md)** — LOLA enrichment testing. +- **[gtars-core](../core.md)** — the Rust crate `gtars.models` wraps. Useful if you want to know which methods are "free" vs. which come from `gtars-genomicdist`. +- **[Core models tour](../regionSet.md)** — cross-language walkthrough of `Region` / `RegionSet` / `RegionSetList`. diff --git a/docs/gtars/r.md b/docs/gtars/r.md index 36bdefe9..8efafc66 100644 --- a/docs/gtars/r.md +++ b/docs/gtars/r.md @@ -1,44 +1,135 @@ # gtars-r -R bindings for gtars functionality, providing an R API for genomic interval analysis. +R bindings for gtars — high-performance genomic interval analysis in R, backed by Rust via `extendr`. The package is `gtars`. ## Installation -To install the development version, use the `remotes` package: - -``` R +```r install.packages("remotes") -remotes::install_github("databio/gtars", ref = "dev", subdir = "gtars-r") +remotes::install_github("databio/gtars", subdir = "gtars-r") ``` -You can also install the R package locally from the repo directory: +Or install from a local clone of the repo: -``` console +```console R CMD INSTALL gtars-r ``` -## Available Modules +Building the package requires a Rust toolchain (`cargo`, `rustc`). See the [gtars-r README](https://github.com/databio/gtars/blob/dev/gtars-r/README.md) for detailed setup notes. + +## Loading + +```r +library(gtars) +``` + +## Submodule pages + +| page | covers | +|---|---| +| [**RegionSet & RegionSetList**](r/regionset.md) | S4 classes for genomic regions, interval set algebra, overlap queries, conversion to/from `GRanges`, pairwise Jaccard. | +| [**GenomicDistributions**](r/genomicdist.md) | Drop-in replacements for R GenomicDistributions — `calcWidth`, `calcGCContent`, `calcDinuclFreq`, `genomePartitionList`, `calcPartitions`, `calcExpectedPartitions`, `calcSummarySignal`, `calcFeatureDist`, etc. | +| [**LOLA enrichment**](r/lola.md) | Drop-in replacement for R LOLA — `loadRegionDB`, `runLOLA`, `checkUniverseAppropriateness`, `redefineUserSets`, `buildRestrictedUniverse`. | +| [**IGD indexing**](r/igd.md) | Low-level `igd_create` and `igd_search` for building and querying IGD databases directly. | +| [**RefgetStore**](r/refgetstore.md) | GA4GH refget protocol client. | + +## Polymorphic inputs + +A defining feature of the R gtars API: nearly every function that takes a query accepts **any** of these input types interchangeably: + +- a `RegionSet` S4 object, +- a `GRanges` object (from `GenomicRanges`), +- a file path to a BED / BED.gz / narrowPeak file, +- a `data.frame` with `chr` / `start` / `end` columns (and optional `strand`). + +This means you can drop gtars into an existing Bioconductor workflow without explicit conversion: + +```r +library(GenomicRanges) +library(gtars) + +gr <- GRanges(...) + +widths <- calcWidth(gr) # GRanges works directly +merged <- reduce(gr) # overridden S4 method +jac <- jaccard(gr, "other_peaks.bed") # mix GRanges with file path +``` + +## Quick start + +```r +library(gtars) + +# 1. Load regions +peaks <- RegionSet("peaks.bed") +length(peaks) +show(peaks) + +# 2. Summary statistics +calcWidth(peaks) |> summary() +interPeakSpacing(peaks) + +# 3. Interval set algebra +merged <- reduce(peaks) +proms <- promoters(peaks, upstream = 2000L, downstream = 200L) + +# 4. Convert back to GRanges for downstream Bioconductor work +# (GenomicRanges only needed here — it's a bridge, not a core dep) +library(GenomicRanges) +gr <- as_granges(peaks) +``` + +## LOLA enrichment quick start -### refget -``` R +```r library(gtars) -# Compute sequence digest -readable <- 'ACGTACGT' +regionDB <- loadRegionDB("LOLACore/hg38") + +results <- runLOLA( + userSets = list(RegionSet("peaks_a.bed"), RegionSet("peaks_b.bed")), + userUniverse = "universe.bed", + regionDB = regionDB, + redefineUserSets = TRUE, +) + +head(results[order(-pValueLog)], 20) +``` + +## refget quick start + +```r +library(gtars) + +readable <- "ACGTACGT" gtars::sha512t24u_digest(readable) gtars::md5_digest(readable) -# Use RefgetStore for sequence management -store <- global_refget_store('raw') +store <- global_refget_store("raw") ``` -### IGD -``` R +## IGD quick start + +```r library(gtars) -### Building an Index -igd_create(output_path = '/home/igd_output/', filelist = '/home/my_bedfiles/') +# Build an IGD index from a directory of BED files +igd_create(output_path = "./igd_out", filelist = "./my_beds/", db_name = "my_peaks") -### Querying with a single bed file -igd_search(database_path = 'my_igd_database.igd', query_path = 'my_query.bed') +# Query the index +hits <- igd_search( + database_path = "./igd_out/my_peaks.igd", + query_path = "query.bed", +) ``` + +## Coordinate-system notes + +The R package uses **0-based half-open** coordinates internally (BED convention) to match the Rust core. When you pass a `GRanges` object to a gtars function, coordinates are converted from 1-based closed to 0-based half-open automatically. Converting back via `as_granges(rs)` does the reverse. + +If you're building a `RegionSet` from a `data.frame` directly, the `start` and `end` columns are expected to already be in 0-based half-open form. + +## See also + +- **[gtars-core](core.md)** and **[gtars-genomicdist](genomicdist.md)** — Rust reference pages with the algorithmic detail and caveats that apply identically in R. +- **[R GenomicDistributions](https://code.databio.org/GenomicDistributions/)** and **[R LOLA](https://www.bioconductor.org/packages/release/bioc/html/LOLA.html)** — the original packages gtars R is a port of. diff --git a/docs/gtars/r/genomicdist.md b/docs/gtars/r/genomicdist.md new file mode 100644 index 00000000..d7db80c9 --- /dev/null +++ b/docs/gtars/r/genomicdist.md @@ -0,0 +1,224 @@ +# GenomicDistributions in R (gtars) + +The `gtars` R package ships **drop-in replacements** for the core functions in the [R GenomicDistributions](https://code.databio.org/GenomicDistributions/) package, backed by the Rust [`gtars-genomicdist`](../genomicdist.md) crate. The function names, signatures, and return shapes match R GenomicDistributions closely enough that most analyses should port over with one-line changes. + +Every function accepts polymorphic query input — `RegionSet`, `GRanges`, file path, or `data.frame` — so you can mix gtars with existing Bioconductor workflows without explicit conversion. + +For algorithmic detail and caveats that apply identically in R, see [gtars-genomicdist](../genomicdist.md). + +## Drop-in replacements for GenomicDistributions + +| gtars R function | R GenomicDistributions equivalent | notes | +|---|---|---| +| `calcWidth(query)` | `calcWidth()` | delegates to `widths()` S4 method | +| `calcNeighborDist(query)` | `calcNeighborDist()` | signed inter-region gaps per chromosome | +| `calcNearestNeighbors(query)` | `calcNearestNeighbors()` | per-region min neighbor distance | +| `regionDistribution(query, nBins, chromSizes)` | `calcChromBins()` | bin counts; pass `chromSizes` for reference-aligned bins | +| `calcGCContent(query, ref, ignoreUnkChroms)` | `calcGCContent()` | requires a `GenomeAssembly` pointer, not BSgenome | +| `calcDinuclFreq(query, ref, rawCounts)` | `calcDinuclFreq()` | returns a data.frame with region + 16 dinucleotide columns | +| `genomePartitionList(...)` | `genomePartitionList()` | build partition list from gene model components | +| `partitionListFromGTF(path, ...)` | *(not in original)* | convenience — loads gene model from GTF in one step | +| `calcPartitions(query, partitionList, bpProportion)` | `calcPartitions()` | priority-based (or bp-proportion) partition classification | +| `calcExpectedPartitions(query, partitionList, genomeSize)` | `calcExpectedPartitions()` | observed vs. expected with chi-square p-values | +| `calcSummarySignal(query, signalMatrix)` | `calcSummarySignal()` | overlap query regions with a signal matrix | +| `calcFeatureDist(query, features)` | `calcFeatureDist()` | signed distance to nearest feature | +| `calcTSSDist(query, features)` | *(not in original)* | absolute distance variant | +| `loadGenomeAssembly(fasta_path)` | — | replaces BSgenome loading (FASTA directly) | + +## Porting an analysis + +Most GenomicDistributions scripts port by adding `library(gtars)` after `library(GenomicDistributions)` — gtars masks the functions with faster implementations: + +```r +library(GenomicDistributions) # optional, for plotting helpers +library(gtars) # masks calcWidth, calcGCContent, etc. +library(GenomicRanges) + +query <- GRanges(...) # or a file path, or a data.frame +widths <- calcWidth(query) # now backed by Rust + +# Use existing plotGenomicDist* helpers on the results +plotChromBins(regionDistribution(query, chromSizes = hg38)) +``` + +## Basic statistics + +```r +library(gtars) + +# Works with any query input: RegionSet, GRanges, file path, or data.frame +query <- "peaks.bed" + +w <- calcWidth(query) # numeric vector of widths +nd <- calcNeighborDist(query) # signed gaps per chromosome +nn <- calcNearestNeighbors(query) # min neighbor distance per region +``` + +!!! warning "`calcNeighborDist` / `calcNearestNeighbors` output length" + Both functions **skip chromosomes with only one region** (matching R GenomicDistributions). The returned vector is shorter than `length(query)` whenever any chromosome has a single peak, and is **not aligned 1:1** with input regions. + +## Region distribution + +```r +# Without chromSizes: bin size derived from the query itself +# (not comparable across files) +rd <- regionDistribution(query, nBins = 250L) + +# With chromSizes: reference-aligned bins (comparable across files, +# aligned with reference genome positions) +chromSizes <- c(chr1 = 248956422L, chr2 = 242193529L) +rd <- regionDistribution(query, nBins = 250L, chromSizes = chromSizes) +``` + +Returns a data.table compatible with R GenomicDistributions' `plotChromBins`. + +!!! warning "`nBins` is a target, not a total" + When `chromSizes` is provided, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. This applies identically to `densityVector` and `densityHomogeneity` from [`r/regionset.md`](regionset.md#statistics-methods). + +## GC content and dinucleotides + +Unlike the original R GenomicDistributions (which uses BSgenome), gtars loads a reference genome directly from a FASTA file: + +```r +genome <- loadGenomeAssembly("hg38.fa") + +gc <- calcGCContent(query, genome, ignoreUnkChroms = TRUE) +# numeric vector of GC content per region, 0.0 to 1.0 + +dinucl <- calcDinuclFreq(query, genome, rawCounts = FALSE) +# data.frame with columns: region + 16 dinucleotide columns +# rawCounts = FALSE → percentages (each row sums to 100) +# rawCounts = TRUE → integer counts + +# Pooled global counts across all regions +pooled <- colSums(calcDinuclFreq(query, genome, rawCounts = TRUE)[, -1]) +``` + +## Partitions + +Build a `PartitionList` of promoter / UTR / exon / intron / intergenic categories from gene model components, then classify your query. You can either supply each component explicitly or load them in one call from a GTF. + +### From gene model components + +```r +# From GRanges, file paths, or data.frames — any combination works +pl <- genomePartitionList( + genesGR = "genes.bed", + exonsGR = "exons.bed", + threeUTRGR = "three_utr.bed", + fiveUTRGR = "five_utr.bed", + corePromSize = 100L, + proxPromSize = 2000L, + chromSizes = c(chr1 = 248956422L, chr2 = 242193529L), # optional +) +``` + +Strand information from the inputs (GRanges strand column, or the `strand` column of a data.frame) is used for strand-aware promoter and reduce operations — critical for getting correct promoters on minus-strand genes. + +### From a GTF + +```r +# Single-call convenience — loads gene model and builds partitions +pl <- partitionListFromGTF( + "gencode.v47.annotation.gtf.gz", + filterProteinCoding = TRUE, + convertEnsemblUCSC = FALSE, + corePromSize = 100L, + proxPromSize = 2000L, + chromSizes = chromSizes, +) +``` + +### Classifying query regions + +```r +# Priority mode — each query region assigned to first matching partition +counts <- calcPartitions(query, pl, bpProportion = FALSE) +# data.frame with partition, Freq columns + attr(counts, "total") + +total <- attr(counts, "total") +counts$pct <- counts$Freq / total * 100 +print(counts) + +# Observed vs expected with chi-square p-values +chromSizes <- c(chr1 = 248956422L, chr2 = 242193529L) +expected <- calcExpectedPartitions(query, pl, chromSizes, bpProportion = FALSE) +# data.frame with partition, observed, expected, log10OE, pvalue columns +``` + +!!! warning "p-values differ slightly from R GenomicDistributions" + `calcExpectedPartitions` uses a goodness-of-fit `(O-E)²/E` formula. R `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R GenomicDistributions output byte-for-byte. The `log10OE` column is directly comparable. + +## Signal matrix overlap + +```r +# signalMatrix is a data.frame / data.table where column 1 is the region ID +# in "chr_start_end" format and the remaining columns are condition names +# with numeric signal values. +signalMatrix <- data.table::fread("signal_matrix.tsv") + +result <- calcSummarySignal(query, signalMatrix) +# result is a list compatible with GenomicDistributions' plotSummarySignal: +# $signalSummaryMatrix — data.table with queryPeak + one column per condition +# $matrixStats — data.frame with 5 rows (Tukey stats) and one column per condition + +# Row names of matrixStats are the Tukey boxplot quantities: +# lowerWhisker, lowerHinge, median, upperHinge, upperWhisker +print(result$matrixStats) +``` + +Plug directly into R GenomicDistributions' `plotSummarySignal()` without further conversion. + +## Distances to TSS / nearest features + +```r +tss <- "gencode.v47.tss.bed" + +# Signed distance (positive = feature downstream, negative = upstream) +d_signed <- calcFeatureDist(query, tss) + +# Absolute distance +d_abs <- calcTSSDist(query, tss) +``` + +Both accept RegionSet, GRanges, file path, or data.frame for either argument. + +## End-to-end example + +```r +library(gtars) + +# 1. Load inputs once +peaks <- RegionSet("peaks.bed") +genome <- loadGenomeAssembly("hg38.fa") +chromSizes <- c(chr1 = 248956422L, chr2 = 242193529L) +pl <- partitionListFromGTF("gencode.v47.gtf.gz", corePromSize = 100L, + proxPromSize = 2000L, chromSizes = chromSizes) + +# 2. Widths and GC content +cat(sprintf("%d peaks, median width %.0f\n", + length(peaks), median(calcWidth(peaks)))) + +gc <- calcGCContent(peaks, genome, ignoreUnkChroms = TRUE) +cat(sprintf("Mean GC content: %.3f\n", mean(gc))) + +# 3. Partition enrichment +expected <- calcExpectedPartitions(peaks, pl, chromSizes, bpProportion = FALSE) +print(expected) + +# 4. Distance to nearest TSS +tssDist <- calcTSSDist(peaks, "gencode.v47.tss.bed") +cat(sprintf("Median distance to nearest TSS: %d bp\n", median(tssDist))) + +# 5. Convert to GRanges for downstream Bioconductor work +# (only step that needs GenomicRanges — it's a bridge, not a core dep) +library(GenomicRanges) +gr <- as_granges(peaks) +``` + +## See also + +- **[RegionSet & RegionSetList (R)](regionset.md)** — S4 class reference and method list. +- **[R LOLA interface](lola.md)** — enrichment testing built on top of these types. +- **[gtars-genomicdist](../genomicdist.md)** — Rust reference with all the algorithmic detail. +- **[R GenomicDistributions](https://code.databio.org/GenomicDistributions/)** — the original package gtars is a port of. Plotting helpers from GenomicDistributions still work on gtars return values. diff --git a/docs/gtars/r/igd.md b/docs/gtars/r/igd.md new file mode 100644 index 00000000..948653c8 --- /dev/null +++ b/docs/gtars/r/igd.md @@ -0,0 +1,101 @@ +# IGD in R (gtars) + +Low-level R bindings for IGD ([Indexed Genomic Data](../igd.md)) — a fast binary index over collections of BED files, used internally by gtars LOLA and by the BEDbase retrieval pipeline. The R API exposes two functions: `igd_create` (build an index from BED files on disk) and `igd_search` (query an existing index with a BED file). + +If you're running LOLA enrichment, you usually don't need to touch IGD directly — `loadRegionDB()` in [`r/lola.md`](lola.md) builds an IGD index internally from the region database. Use this page if you want to index a custom collection of BED files and query it directly. + +## Creating an IGD database + +```r +library(gtars) + +igd_create( + output_path = "path/to/output/", + filelist = "path/to/bed/files/", + db_name = "igd_database", # default +) +``` + +**Arguments:** + +- `output_path` — directory where the IGD database files will be written. Must exist. +- `filelist` — one of: + - a path to a text file listing BED file paths (one per line), + - a path to a directory containing BED files (all `.bed` / `.bed.gz` files will be indexed), + - `"-"` or `"stdin"` to read paths from standard input. +- `db_name` — prefix for the output filenames (default `"igd_database"`). The function produces two files: `.igd` (the index) and `_index.tsv` (file metadata). + +Returns `NULL` invisibly on success. Errors are raised via `stop()` on invalid input. + +### Example + +```r +# From a directory of BED files +igd_create( + output_path = "./igd_out", + filelist = "./my_bed_files/", + db_name = "my_peaks", +) + +# Produces: +# ./igd_out/my_peaks.igd +# ./igd_out/my_peaks_index.tsv + +# From an explicit file list +writeLines( + c("./rep1.bed", "./rep2.bed", "./rep3.bed"), + "bed_list.txt", +) +igd_create( + output_path = "./igd_out", + filelist = "bed_list.txt", + db_name = "replicates", +) +``` + +## Searching an IGD database + +```r +hits <- igd_search( + database_path = "path/to/my_peaks.igd", + query_path = "path/to/query.bed", +) +``` + +**Arguments:** + +- `database_path` — path to an existing `.igd` file (produced by `igd_create` or any other IGD-compatible tool). +- `query_path` — path to a BED file containing the query regions. + +**Returns** a `data.frame` of overlap hits. The exact columns depend on the query result schema — typically `file_id`, `filename`, `overlap_count`, etc. Use `colnames(hits)` to discover the structure for your version. + +### Example + +```r +hits <- igd_search( + database_path = "./igd_out/my_peaks.igd", + query_path = "query.bed", +) + +# Inspect the schema +colnames(hits) +head(hits) + +# Typical analysis: count overlaps per database file +aggregate(overlap_count ~ filename, data = hits, sum) +``` + +## Relationship to other gtars R functions + +- **`loadRegionDB()` in `lola.R`** — builds an IGD index internally for a LOLA region database. The IGD is wrapped inside a `RegionDB` pointer and used by `runLOLA()`. Users running enrichment typically don't need to call `igd_create` / `igd_search` separately. +- **`countOverlaps()` / `findOverlaps()` on `RegionSet`** — ad-hoc overlap queries between two `RegionSet` objects use the AIList index from [`gtars-overlaprs`](../overlaprs.md), not IGD. Use those methods when you have both the query and subject in memory. +- **IGD is preferred when** you have a large static collection of BED files that you'll query many times against different user sets, and you want the index persisted on disk. + +!!! note "Naming convention" + Unlike the rest of the R gtars API, which uses camelCase (`runLOLA`, `calcPartitions`, etc.), the IGD functions use snake_case (`igd_create`, `igd_search`). This is a legacy from the direct CLI mapping and may be aligned in a future release. + +## See also + +- **[gtars-igd](../igd.md)** — the underlying Rust crate, with the binary format reference and performance characteristics. +- **[R LOLA interface](lola.md)** — `loadRegionDB()` and `runLOLA()`, which use IGD internally. +- **[gtars CLI](../cli.md)** — `gtars igd create` and `gtars igd search` commands, equivalent to these R functions. diff --git a/docs/gtars/r/lola.md b/docs/gtars/r/lola.md new file mode 100644 index 00000000..7fdab4a0 --- /dev/null +++ b/docs/gtars/r/lola.md @@ -0,0 +1,228 @@ +# LOLA enrichment in R (gtars) + +The `gtars` R package ships a **drop-in replacement** for the [LOLA](https://www.bioconductor.org/packages/release/bioc/html/LOLA.html) R/Bioconductor package, powered by the Rust [`gtars-lola`](../lola.md) crate. Function names and signatures match LOLA closely enough that existing scripts port over with one-line changes. + +For the underlying statistical reference see [gtars-lola](../lola.md) — Fisher's exact test on a hypergeometric survival function, CMLE odds ratio via Brent's method, per-user-set Benjamini-Hochberg FDR correction, and 1-based TSV output compatible with R LOLA's `writeCombinedEnrichment`. + +## Loading the database + +`loadRegionDB()` reads a standard LOLA-format folder (one directory per collection, each with `collection.txt`, `index.txt`, and a `regions/` subdirectory of BED files): + +```r +library(gtars) + +# Load the full database +regionDB <- loadRegionDB("LOLACore/hg38") + +# Load only specific collections +regionDB <- loadRegionDB("LOLACore/hg38", collections = c("encode_tfbs", "roadmap_epigenomics")) + +# Per-collection file cap (useful for smoke tests) +regionDB <- loadRegionDB("LOLACore/hg38", limit = 10L) +``` + +### From a list of BED files + +If you don't have the full LOLA folder layout, `loadRegionDBFromBeds()` builds a minimal region database from an arbitrary list of BED paths: + +```r +regionDB <- loadRegionDBFromBeds( + bedFiles = c("file1.bed", "file2.bed", "file3.bed"), + filenames = c("Condition A", "Condition B", "Condition C"), # optional display names +) +``` + +### Accessing annotations + +```r +# Per-file annotation table (matches R LOLA's regionDB$regionAnno) +regionAnno <- regionDBAnno(regionDB) +# data.table with columns: filename, cellType, description, tissue, +# dataSource, antibody, treatment, collection + +# Collection-level annotation +collectionAnno <- regionDBCollectionAnno(regionDB) +# data.table with columns: collectionname, collector, date, source, description + +# List region set filenames +listRegionSets(regionDB) +listRegionSets(regionDB, collections = "encode_tfbs") # filtered +``` + +### Extracting region sets + +```r +# Get all region sets as a RegionSetList +all_sets <- getRegionSets(regionDB) + +# Or a subset by 1-based indices +subset <- getRegionSets(regionDB, indices = c(1, 5, 12)) + +# RegionSetList is the same S4 class described in the RegionSet page +length(subset) +names(subset) +rs1 <- subset[[1]] # individual RegionSet +``` + +## Running enrichment + +```r +runLOLA( + userSets, # RegionSet, or list of RegionSets (or GRanges / file paths — polymorphic) + userUniverse, # RegionSet or GRanges representing the background + regionDB, # from loadRegionDB + minOverlap = 1, # minimum bp overlap to count as overlapping + cores = 1, # reserved for future parallelism — currently ignored + redefineUserSets = FALSE, # automatically rewrite user sets against universe + direction = "enrichment", # "enrichment" or "depletion" +) +``` + +**Returns a `data.table`** with one row per `(user_set, db_set)` pair, matching the R LOLA schema: `userSet`, `dbSet`, `collection`, `pValueLog`, `oddsRatio`, `support`, `rnkPV`, `rnkOR`, `rnkSup`, `maxRnk`, `meanRnk`, `b`, `c`, `d`, `description`, `cellType`, `tissue`, `antibody`, `treatment`, `dataSource`, `filename`, `qValue`, `size`. + +- `userSet` and `dbSet` are **1-based** in the output data.table (R convention), even though they are 0-based internally. +- `pValueLog` is `-log10(p)` from Fisher's exact test, capped at ~322. +- `oddsRatio` is the CMLE odds ratio — matches `fisher.test()$estimate`, not the simple `(a·d)/(b·c)` point estimate. +- `qValue` is Benjamini-Hochberg adjusted, computed **per user set independently**. +- Rows are sorted by descending `pValueLog`, then ascending `meanRnk` — identical to R LOLA output order. + +### Basic example + +```r +library(gtars) + +regionDB <- loadRegionDB("LOLACore/hg38") + +# Polymorphic inputs — RegionSet, GRanges, file path all accepted +userSets <- list( + RegionSet("condition_a.bed"), + RegionSet("condition_b.bed"), +) +universe <- "universe.bed" # file path works directly + +results <- runLOLA(userSets, universe, regionDB, + minOverlap = 1, + direction = "enrichment") + +# Top 20 hits for user set 1 +head(results[userSet == 1], 20) +``` + +### Two-condition comparison + +```r +# Differential enrichment — use the union of both conditions as the universe +restricted <- buildRestrictedUniverse(list(peaks_a, peaks_b)) + +results <- runLOLA(list(peaks_a, peaks_b), restricted, regionDB) + +# Top hits unique to condition A +head(results[userSet == 1 & qValue < 0.05][order(-pValueLog)], 20) +``` + +## Universe diagnostics + +LOLA results are only meaningful if your universe **contains** your user sets: every user region should overlap at least one universe region. + +### `checkUniverseAppropriateness` + +```r +report <- checkUniverseAppropriateness(userSets, userUniverse) +# data.frame with columns: +# user_set, total_regions, regions_in_universe, coverage, many_to_many +# Warnings are emitted via warning() for low coverage or many-to-many mappings. +``` + +Warnings fire when coverage is below 50% (severe) or 90% (moderate), or when any user region overlaps more than one universe region. + +### `redefineUserSets` + +Rewrite each user set in terms of universe regions — eliminates many-to-many mapping artifacts. + +```r +redefined <- redefineUserSets(userSets, userUniverse) +# Returns a list of RegionSet objects. + +# Use directly +results <- runLOLA(redefined, userUniverse, regionDB) + +# Or pass redefineUserSets = TRUE to runLOLA to do this automatically +results <- runLOLA(userSets, userUniverse, regionDB, redefineUserSets = TRUE) +``` + +### `buildRestrictedUniverse` + +For differential enrichment: build a universe that is exactly the union of all user sets, disjoined into non-overlapping pieces (R LOLA's `disjoin(unlist(userSets))`): + +```r +restricted <- buildRestrictedUniverse(list(peaks_a, peaks_b, peaks_c)) +# Returns a RegionSet +``` + +## Porting from R LOLA + +Most LOLA scripts port by changing one line — replacing `library(LOLA)` with `library(gtars)`. The core API is compatible: + +| R LOLA | gtars R | notes | +|---|---|---| +| `loadRegionDB()` | `loadRegionDB()` | same signature | +| `runLOLA()` | `runLOLA()` | same signature; returns `data.table` | +| `checkUniverseAppropriateness()` | `checkUniverseAppropriateness()` | same signature | +| `redefineUserSets()` | `redefineUserSets()` | same signature; returns list of `RegionSet` | +| `writeCombinedEnrichment()` | *(use `data.table::fwrite`)* | output table is already in the right format | +| `extractEnrichmentOverlaps()` | *(not implemented)* | file an issue if needed | +| `getRegionFile()` | `getRegionSets(regionDB, index)` | returns `RegionSet`, not `GRanges` — call `as_granges()` to convert | + +The notable differences: + +- **Return type is `data.table`, not `data.frame`.** R LOLA returns a plain data.frame; gtars returns a `data.table` by default. Convert with `as.data.frame()` if you prefer. +- **Faster p-value and odds ratio.** The p-value uses the survival function (no cancellation at small tails), and the odds ratio is the CMLE (not the point estimate). Expect tail significance to be reported more precisely. + +## End-to-end example + +```r +library(gtars) +library(data.table) + +# 1. Load the database once +regionDB <- loadRegionDB("LOLACore/hg38") +cat(sprintf("Loaded %d region sets\n", length(listRegionSets(regionDB)))) + +# 2. Load user sets and universe — polymorphic inputs +peaks_a <- "peaks_condition_a.bed" +peaks_b <- "peaks_condition_b.bed" +universe <- "universe.bed" + +# 3. Universe sanity check +diag <- checkUniverseAppropriateness(list(peaks_a, peaks_b), universe) +print(diag) + +# 4. Run enrichment with auto-redefinition +results <- runLOLA( + list(peaks_a, peaks_b), universe, regionDB, + minOverlap = 1, + redefineUserSets = TRUE, + direction = "enrichment", +) + +# 5. Top hits per user set +for (us in unique(results$userSet)) { + cat(sprintf("\n=== User set %d — top 10 ===\n", us)) + top <- head( + results[userSet == us][order(-pValueLog)], + 10, + ) + print(top[, .(filename, cellType, pValueLog, oddsRatio, support, qValue)]) +} + +# 6. Write to TSV matching R LOLA's writeCombinedEnrichment format +fwrite(results, "lola_results.tsv", sep = "\t") +``` + +## See also + +- **[RegionSet & RegionSetList (R)](regionset.md)** — the S4 classes used for user sets and extracted database sets. +- **[R GenomicDistributions wrappers](genomicdist.md)** — statistics and partition analysis. +- **[R IGD interface](igd.md)** — IGD is the overlap index that backs a `RegionDB`. +- **[gtars-lola](../lola.md)** — Rust reference with the full statistical detail. +- **[R LOLA package](https://www.bioconductor.org/packages/release/bioc/html/LOLA.html)** — the original Bioconductor package this is a port of. diff --git a/docs/gtars/r/regionset.md b/docs/gtars/r/regionset.md new file mode 100644 index 00000000..3303caa2 --- /dev/null +++ b/docs/gtars/r/regionset.md @@ -0,0 +1,208 @@ +# RegionSet & RegionSetList (R) + +The `gtars` R package exposes S4 classes `RegionSet` and `RegionSetList` — the R-idiomatic equivalents of Bioconductor's `GRanges` and `GRangesList`, backed by a high-performance Rust implementation via `externalptr`. + +Unlike most gtars bindings, the R API is **polymorphic on input type**: nearly every function that takes a query will accept a `RegionSet`, a `GRanges` object, a file path, or a `data.frame` interchangeably. This lets you mix existing Bioconductor workflows with gtars without converting explicitly. + +For the underlying semantics see [gtars-core](../core.md) (core types) and [gtars-genomicdist](../genomicdist.md) (statistics, set algebra). + +## Loading the package + +```r +library(gtars) +``` + +## `RegionSet` + +An S4 class representing a set of genomic regions. Internally wraps a Rust `RegionSet` pointer plus an R-side strand vector. + +### Construction + +`RegionSet()` accepts any of these inputs: + +```r +# 1. File path (BED, bed.gz, narrowPeak, etc.) +rs <- RegionSet("peaks.bed") + +# 2. GRanges object (requires GenomicRanges) +library(GenomicRanges) +gr <- GRanges(seqnames = c("chr1", "chr1"), + ranges = IRanges(start = c(100, 500), end = c(200, 600)), + strand = c("+", "-")) +rs <- RegionSet(gr) # coords converted from 1-based closed to 0-based half-open + +# 3. data.frame with chr/start/end (and optional strand) +df <- data.frame(chr = c("chr1", "chr1"), + start = c(100, 500), + end = c(200, 600), + strand = c("+", "-")) +rs <- RegionSet(df) + +# 4. An existing RegionSet (returned as-is) +rs2 <- RegionSet(rs) +``` + +`as_regionset()` is an alias for `RegionSet()`. + +!!! warning "0-based half-open, BED convention" + R gtars uses **0-based half-open** coordinates internally, matching BED convention (not GRanges' 1-based closed convention). Constructor/accessor functions convert automatically when bridging to `GRanges`; if you're building a RegionSet from a data.frame directly, the `start` and `end` columns are expected in 0-based half-open form. + +### Basic S4 methods + +```r +length(rs) # number of regions +show(rs) # pretty-print summary (first 5 regions) +rs # same — calls show() + +as.data.frame(rs) # convert to a data.frame with chr/start/end/strand +rs[1:10] # subset by index (numeric, logical, or character) +rs[rs$strand == "+"] # Note: use as.data.frame(rs)$strand for filtering +``` + +### Converting to `GRanges` + +```r +gr <- as_granges(rs) +# Coordinates converted from 0-based half-open to 1-based closed. +# Strand information is preserved. +``` + +Requires the `GenomicRanges` package to be installed. + +### Statistics methods + +All of these accept `RegionSet`, `GRanges`, file path, or data.frame as input: + +```r +widths(rs) # numeric vector of region widths (end - start) +neighborDistances(rs) # signed gaps between consecutive regions per chromosome +nearestNeighbors(rs) # per-region min neighbor distance +chromosomeStatistics(rs) # named list of per-chromosome stats + +distribution(rs, nBins = 250) # bin counts across the genome +distribution(rs, nBins = 250, chromSizes = hg38) # with reference chrom sizes + +clusterRegions(rs, maxGap = 5000L) # cluster id per region +interPeakSpacing(rs) # spacing stats list +peakClusters(rs, radius_bp = 5000L, min_cluster_size = 2L) +densityVector(rs, chrom_sizes = hg38, nBins = 250L) +densityHomogeneity(rs, chrom_sizes = hg38, nBins = 250L) +``` + +!!! warning "Output length for `neighborDistances` / `nearestNeighbors`" + Both skip chromosomes with only one region (matching R GenomicDistributions). The returned vector is **not aligned 1:1** with the input — it's shorter than `length(rs)` whenever any chromosome has a single peak. + +!!! warning "`nBins` is a target, not a total" + In `distribution(chromSizes = ...)`, `densityVector`, and `densityHomogeneity`, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. + +### Interval set algebra + +R gtars overrides the standard Bioconductor generics (`union`, `intersect`, `setdiff`, `reduce`, `promoters`, `shift`, `flank`, `resize`, `narrow`, `disjoin`, `gaps`, `findOverlaps`, `countOverlaps`) plus adds gtars-specific ones (`trim`, `pintersect`, `concat`, `jaccard`). + +```r +merged <- reduce(rs) +trimmed <- trim(rs, chromSizes = hg38) +proms <- promoters(rs, upstream = 2000L, downstream = 200L) +shifted <- shift(rs, shift = 100L) +resized <- resize(rs, width = 500L, fix = "center") +disjoint <- disjoin(rs) +gapped <- gaps(rs, chrom_sizes = hg38) + +# Binary operations +u <- union(rs1, rs2) +i <- intersect(rs1, rs2) +d <- setdiff(rs1, rs2) +pi <- pintersect(rs1, rs2) # pairwise (by index) +c <- concat(rs1, rs2) + +j <- jaccard(rs1, rs2) # scalar Jaccard similarity + +# Overlap queries +hits <- findOverlaps(rs1, rs2) +counts <- countOverlaps(rs1, rs2) +``` + +Methods dispatch on `("RegionSet", "RegionSet")`, `("RegionSet", "ANY")`, and `("ANY", "RegionSet")` so you can mix in `GRanges`, file paths, or data.frames on either side: + +```r +# Works — second argument is a file path +j <- jaccard(rs, "other_peaks.bed") + +# Works — first argument is a GRanges +merged <- union(gr, rs) +``` + +### Consensus regions + +```r +cons <- consensus(list(rep1_rs, rep2_rs, rep3_rs)) +# Returns a data.frame with chr, start, end, count columns. +# 'count' is the number of input sets overlapping each union region. + +# Keep regions present in ≥ 2/3 replicates +robust <- cons[cons$count >= 2, ] +``` + +## `RegionSetList` + +An S4 class for collections of `RegionSet`s — the gtars equivalent of `GRangesList`. Provides the single most efficient path for operating on many region sets at once, since pointers are passed between R and Rust without copying. + +### Construction + +```r +# Variadic: any mix of RegionSet / file path / GRanges / data.frame +rsl <- RegionSetList( + RegionSet("rep1.bed"), + RegionSet("rep2.bed"), + RegionSet("rep3.bed"), +) + +# Equivalent: single list argument +rsl <- RegionSetList(list(rs1, rs2, rs3)) + +# Or directly from file paths — each is auto-wrapped via RegionSet() +rsl <- RegionSetList("rep1.bed", "rep2.bed", "rep3.bed") + +# Empty list +empty <- RegionSetList() +``` + +### S4 methods + +```r +length(rsl) # number of region sets +show(rsl) # pretty-print with sizes +rsl[[1]] # extract a single RegionSet by index (1-based) +names(rsl) # character vector of names, or NULL +``` + +### Operations + +```r +# Flatten into a single RegionSet (no merge/dedup) +flat <- concat(rsl) +# Apply reduce() on the result if you want the union +merged_all <- reduce(concat(rsl)) + +# Full N x N Jaccard similarity matrix +jac <- pairwise_jaccard(rsl) +# Returns a symmetric numeric matrix with 1.0 on the diagonal. +``` + +## Coordinate-system notes + +| system | representation | gtars handling | +|---|---|---| +| BED / gtars | 0-based half-open | **internal** — always used by `RegionSet` | +| GRanges / IRanges | 1-based closed | converted in both directions by `RegionSet(gr)` / `as_granges(rs)` | +| data.frame input | 0-based half-open | passed through as-is — make sure your `start`/`end` columns are BED-style | + +If you're working entirely in GRanges-land, use `as_granges()` to convert back before handing off to Bioconductor workflows. If you're reading BED files and writing results back to BED, stay in `RegionSet` to avoid double conversion. + +## See also + +- **[R GenomicDistributions wrappers](genomicdist.md)** — `calcWidth`, `calcGCContent`, `calcPartitions`, etc. — the drop-in API for users migrating from R GenomicDistributions. +- **[R LOLA interface](lola.md)** — `loadRegionDB`, `runLOLA`, `checkUniverseAppropriateness`, etc. +- **[R IGD interface](igd.md)** — `igd_create` / `igd_search` for low-level IGD access. +- **[gtars-core](../core.md)** — the underlying Rust types. +- **[Core models tour](../regionSet.md)** — Python + Rust walkthrough of the same concepts. diff --git a/docs/gtars/regionSet.md b/docs/gtars/regionSet.md index 3c392227..62af0ab6 100644 --- a/docs/gtars/regionSet.md +++ b/docs/gtars/regionSet.md @@ -135,3 +135,62 @@ rs.to_bigbed("path/to/save/bedfile.bb", chrom_sizes="path/to/chrom.sizes") !!! info - Detailed documentation for RegionSet is available in the [API reference](https://docs.rs/gtars-core/latest/gtars_core/models/region_set/). + +### 🟢 RegionSetList + +`RegionSetList` is the gtars equivalent of Bioconductor's `GRangesList` — an ordered collection of `RegionSet`s with optional names. It's the type downstream crates (genomicdist, lola) use to pass multiple region sets across FFI boundaries without paying N×clone costs. + +#### Example + +=== "Python" + ```python + from gtars.models import RegionSet, RegionSetList + + rs1 = RegionSet("rep1.bed") + rs2 = RegionSet("rep2.bed") + rs3 = RegionSet("rep3.bed") + + # Unnamed + rsl = RegionSetList([rs1, rs2, rs3]) + + # Or with names + rsl = RegionSetList([rs1, rs2, rs3], names=["rep1", "rep2", "rep3"]) + + print(len(rsl)) # number of sets + combined = rsl.concat() # flatten into a single RegionSet (no merge) + set_id = rsl.identifier() # stable order-independent identifier + ``` + +=== "Rust" + ```rust + use gtars_core::models::{RegionSet, RegionSetList}; + + let rs1 = RegionSet::try_from("rep1.bed")?; + let rs2 = RegionSet::try_from("rep2.bed")?; + let rs3 = RegionSet::try_from("rep3.bed")?; + + let rsl = RegionSetList::with_names( + vec![rs1, rs2, rs3], + vec!["rep1".into(), "rep2".into(), "rep3".into()], + ); + + // Iterate + for rs in &rsl { + println!("{} regions", rs.len()); + } + + // Flatten all regions into a single RegionSet (no merge/dedup) + let combined = rsl.concat(); + let id = rsl.identifier(); + # Ok::<(), gtars_core::errors::RegionSetError>(()) + ``` + +`RegionSetList::try_from` in Rust also accepts a **bedset manifest file** (text file listing one BED path per line) or a `Vec<&Path>` / `Vec<&str>` / `Vec` / `Vec`. + +`concat()` flattens without merging; if you need a reduced union, call `.reduce()` on the result — that method comes from the `IntervalRanges` trait in [gtars-genomicdist](genomicdist.md). + +## See also + +- **[gtars-core](core.md)** — the canonical Rust API reference for `Region`, `RegionSet`, `RegionSetList`, `Interval`, `Fragment`, `CoordinateMode`, and `RegionSetError`. +- **[gtars-genomicdist](genomicdist.md)** — the `IntervalRanges` and `GenomicIntervalSetStatistics` traits that extend `RegionSet` with set-algebra and summary stats. +- **[gtars-lola](lola.md)** — LOLA enrichment, which consumes `RegionSetList` for user-set and database-set inputs. diff --git a/docs/gtars/wasm.md b/docs/gtars/wasm.md index 26caaca9..9da2b0b3 100644 --- a/docs/gtars/wasm.md +++ b/docs/gtars/wasm.md @@ -1,61 +1,89 @@ # gtars Wasm/JS -We provide WebAssembly (Wasm) bindings for gtars functionality, enabling genomic interval analysis in JavaScript/TypeScript environments. This allows for client-side processing of genomic data without the need for server-side computation. +WebAssembly bindings for gtars, enabling genomic interval analysis directly in browsers and Node — no server round-trip required. The package is `@databio/gtars-js`; the underlying Rust crate is `gtars-wasm`. ## Features -- Zero-installation genomic tools -- Client-side processing (no server required) -- JavaScript/TypeScript interoperability +- Zero-installation genomic tools running client-side. +- Full `RegionSet` interval algebra and summary statistics. +- Genomic distribution analysis: partitions, signal matrices, density vectors, consensus regions. +- LOLA enrichment testing with in-memory region databases. +- TypeScript-friendly — the published package ships `.d.ts` declarations generated from the Rust source. ## Installation -You can install the gtars Wasm package via npm: ```bash npm install @databio/gtars-js ``` -## Usage -We don't currently provide full feature parity bindings, but some functionality is available. +## Quick start -```typescript -import init, { Overlapper } from 'gtars-js'; +Every Wasm entry point needs the module to be initialized once before use. In ESM environments with top-level await, that's a single line: + +```ts +import init, { RegionSet, Overlapper } from '@databio/gtars-js'; await init(); -import { Overlapper } from '@databio/gtars-js'; +const peaks = new RegionSet([ + ['chr1', 100, 200, 'peak1'], + ['chr2', 150, 250, 'peak2'], + ['chr3', 300, 400, 'peak3'], +]); + +console.log(`${peaks.numberOfRegions} regions, mean width ${peaks.meanRegionWidth}`); +``` -const universe = [ - ['chr1', 100, 200], - ['chr1', 150, 250], - ['chr2', 300, 400], -] +## Submodule pages -const overlapper = new Overlapper(universe, 'ailist'); -console.log(`Using backend: ${overlapper.get_backend()}`); +| page | covers | +|---|---| +| [**Overlappers**](wasm/overlappers.md) | `Overlapper` — low-level interval overlap engine with AIList / BITS backends. | +| [**RegionSet & models**](wasm/regionset.md) | `RegionSet`, `RegionSetList`, `ConsensusBuilder`, `ChromosomeStatistics`, `RegionDistribution`, `BedClassificationOutput`, and all the per-`RegionSet` statistics and interval algebra methods. | +| [**Partitions**](wasm/partitions.md) | `GeneModel`, `PartitionList`, `calcPartitions`, `calcExpectedPartitions` — classify regions into promoter / UTR / exon / intron / intergenic. | +| [**Signal matrix**](wasm/signal.md) | `SignalMatrix` and `calcSummarySignal` — overlay region sets on region × condition signal matrices. | +| [**LOLA enrichment**](wasm/lola.md) | `LolaRegionDB`, `runLOLA`, `checkUniverseAppropriateness` — Fisher's exact test enrichment against a database of reference region sets. | -const query = ['chr1', 180, 220]; -const overlaps = overlapper.find(query); +## Limitations vs. the Rust/Python APIs -console.log(`Overlaps for ${query}:`, overlaps); -``` +Wasm is a constrained runtime — a few crate features are not exposed: -## Integration +- **No filesystem loaders.** `GeneModel::from_gtf`, `GeneModel::from_bed_files`, `SignalMatrix::from_tsv`, and `RegionDB::from_lola_folder` are not available. Everything is constructed from in-memory JS data or from packed binary buffers (e.g. `.sigm`, `.fab`, `.gda`) fetched over the network. +- **No `redefine_user_sets` / `build_restricted_universe`.** These LOLA universe helpers are Rust-only at the moment; use `checkUniverseAppropriateness` for diagnostics and replicate the set-algebra logic client-side with `RegionSetList` operations if you need the rewriting behavior. +- **Refget is exposed as a separate module** — see the refget binding docs (currently being written). -### An example React component -```jsx +## Integration example — React + +```tsx import { useEffect, useState } from 'react'; -import init, { TreeTokenizer } from 'gtars-js'; +import init, { RegionSet, ConsensusBuilder } from '@databio/gtars-js'; -function TokenizerComponent() { - const [tokenizer, setTokenizer] = useState(null); +function ConsensusComponent({ replicates }: { replicates: Array<[string, number, number, string][]> }) { + const [ready, setReady] = useState(false); + const [robust, setRobust] = useState(null); useEffect(() => { - init().then(() => { - setTokenizer(new TreeTokenizer()); - }); + init().then(() => setReady(true)); }, []); - // Use tokenizer... + useEffect(() => { + if (!ready) return; + const cb = new ConsensusBuilder(); + for (const rep of replicates) { + cb.add(new RegionSet(rep)); + } + const cons = cb.compute(); + setRobust(cons.filter((r) => r.count >= 2).length); + }, [ready, replicates]); + + if (!ready) return

Loading gtars-js…

; + return

{robust} regions present in ≥ 2 replicates

; } ``` + +## Performance notes + +- `RegionSet` construction sorts on load. Repeated operations don't re-sort. +- Overlap queries (`jaccard`, `setdiff`, `intersect`, etc.) route through the AIList index under the hood, giving O(log n) per-query lookups. +- Statistics methods like `densityVector` and `peakClusters` are implemented in Rust and called via zero-copy boundary — they're substantially faster than pure-JS equivalents for typical peak-set sizes. +- For very large numeric outputs, the Wasm glue converts between Rust `Vec` and JS arrays — if you're hitting GC pressure, consider batching calls or keeping `RegionSet` handles around instead of returning raw arrays. diff --git a/docs/gtars/wasm/lola.md b/docs/gtars/wasm/lola.md new file mode 100644 index 00000000..88c87b30 --- /dev/null +++ b/docs/gtars/wasm/lola.md @@ -0,0 +1,190 @@ +# LOLA enrichment (Wasm) + +Wasm bindings for [gtars-lola](../lola.md) — LOLA (Locus Overlap Analysis) enrichment testing, runnable entirely in the browser. Given a user set of regions, a universe, and a region database (built in memory from API-served BED data), computes Fisher's exact test enrichment of the user set against every database region set. + +See the [Rust gtars-lola reference](../lola.md) for the full statistical details — Fisher's exact test via hypergeometric survival function, CMLE odds ratio, contingency table semantics, and compatibility notes with R LOLA. This page is a Wasm-focused binding reference. + +!!! note "In-memory only — no folder loading" + The Wasm binding's `LolaRegionDB` is built entirely from in-memory data (`{regions, name}` entries passed in at construction). The Rust `RegionDB::from_lola_folder` loader is not available in Wasm — it requires a filesystem. In practice you'd fetch the region database from an API and pass the regions directly into the constructor, or build it from a pre-indexed IGD on the server side. + +## Import + +```ts +import init, { + LolaRegionDB, + runLOLA, + checkUniverseAppropriateness, +} from '@databio/gtars-js'; + +await init(); +``` + +## `LolaRegionDB` + +### Construction + +Construct from an array of `{ regions, name }` objects, where `regions` is an array of `[chr, start, end]` tuples and `name` is the filename-like identifier: + +```ts +const dbEntries = [ + { + name: 'encode_k562_ctcf.bed', + regions: [ + ['chr1', 100, 200], + ['chr1', 400, 500], + ['chr2', 1000, 1100], + ], + }, + { + name: 'encode_hela_ctcf.bed', + regions: [ + ['chr1', 150, 250], + ['chr2', 1050, 1150], + ], + }, +]; + +const db = new LolaRegionDB(dbEntries); +``` + +The constructor builds an IGD overlap index internally and wraps it with minimal annotation (filename only). To attach richer per-file metadata — cell type, tissue, antibody, etc. — build the annotated `RegionDB` on the Rust side and serve it through a different transport (e.g. a custom binary format), or expose a method that accepts `RegionSetAnno` objects alongside the regions. + +### Accessors + +```ts +db.numRegionSets // number +db.listRegionSets() // string[] — filenames +db.collectionAnno // Array of { collectionname, collector, date, source, description } + +// Extract region sets by 0-based index as a RegionSetList +// (pass null for "all sets"; names are populated from filenames) +const rsl = db.getRegionSets(null); +const rsl2 = db.getRegionSets([0, 5, 12]); +``` + +`getRegionSets()` returns a [`RegionSetList`](regionset.md#regionsetlist) with `.names` populated from the database filenames — the one path in Wasm where a `RegionSetList` has non-null names. + +## `runLOLA` + +Runs Fisher's exact test for every `(user_set, db_set)` pair and returns a column-oriented object suitable for dropping directly into a DataFrame or table. + +```ts +runLOLA( + userSets: Array>, + universe: Array<[string, number, number]>, + regionDb: LolaRegionDB, + minOverlap?: number, // default 1 + direction?: string, // "enrichment" (default) | "depletion" +): LolaResultColumnar +``` + +**Parameters:** +- `userSets` — **array of user sets**, where each user set is an array of `[chr, start, end]` tuples. Pass a single-element outer array `[peaks]` for a one-user-set analysis. +- `universe` — a single array of `[chr, start, end]` tuples representing the background. +- `regionDb` — a `LolaRegionDB`. +- `minOverlap` — minimum bp overlap to count as overlapping (default 1). +- `direction` — `"enrichment"` (default, P(X ≥ a), alternative "greater") or `"depletion"` (P(X ≤ a), alternative "less"). The strings `"greater"` / `"less"` are accepted as aliases on the Python side but Wasm only recognizes `"enrichment"` / `"depletion"` explicitly; anything else falls back to enrichment. + +**Returns** an object with parallel arrays (one entry per `(user_set, db_set)` pair) using camelCase keys: + +| key | type | meaning | +|---|---|---| +| `userSet` | `number[]` | 0-based user-set index | +| `dbSet` | `number[]` | 0-based db-set index | +| `collection` | `(string \| null)[]` | per-file annotation, usually null in Wasm | +| `pValueLog` | `number[]` | `-log10(p)` from Fisher's exact test, capped at ~322 | +| `oddsRatio` | `number[]` | CMLE odds ratio (matches R `fisher.test()$estimate`) | +| `support` | `number[]` | overlap count between user set and db set | +| `rnkPv`, `rnkOr`, `rnkSup` | `number[]` | 1-based per-metric ranks within the user set | +| `maxRnk` | `number[]` | max of the three ranks | +| `meanRnk` | `number[]` | mean of the three ranks | +| `b`, `c`, `d` | `number[]` | signed contingency values (can be negative) | +| `qValue` | `(number \| null)[]` | BH-adjusted p-value (applied automatically inside `runLOLA`) | +| `description`, `cellType`, `tissue`, `antibody`, `treatment`, `dataSource` | `(string \| null)[]` | per-file metadata | +| `filename` | `string[]` | db set file name | +| `size` | `number[]` | number of regions in the db set | + +Like the Python binding, Wasm `runLOLA` **auto-applies** `annotate_results` + `apply_fdr_correction` internally — you don't need to call them separately. Rows come back sorted by descending `pValueLog`, then ascending `meanRnk` (matching R LOLA output order). + +### Usage example + +```ts +import init, { + LolaRegionDB, + runLOLA, + RegionSet, +} from '@databio/gtars-js'; + +await init(); + +// Helper to convert a RegionSet to the tuple form LOLA expects +function toTuples(rs: RegionSet): Array<[string, number, number]> { + // Iterate rs and pull chr/start/end — or keep the original BedEntries + // if you still have them in scope from constructing the RegionSet. + // ... +} + +// 1. Build the region database from API-fetched data +const dbEntries = await fetchDatabaseRegions(); // [{ name, regions }, ...] +const db = new LolaRegionDB(dbEntries); + +// 2. Prepare user sets and universe as tuple arrays +const userSets = [userPeakTuples]; +const universe = universeTuples; + +// 3. Run enrichment +const results = runLOLA(userSets, universe, db, 1, 'enrichment'); + +// 4. Pivot to row-oriented view for rendering +const n = results.pValueLog.length; +for (let i = 0; i < n; i++) { + console.log( + `${results.filename[i]} ` + + `p=${results.pValueLog[i].toFixed(2)} ` + + `OR=${results.oddsRatio[i].toFixed(2)} ` + + `support=${results.support[i]}` + ); +} +``` + +!!! note "Negative contingency values" + If your user set contains regions *outside* the universe, the contingency table produces negative `b`/`c`/`d`. The binding matches R LOLA's behavior — it logs a warning, stores the signed values, and returns `pValueLog = 0.0`, `oddsRatio = NaN` for that row. Pre-process with `checkUniverseAppropriateness` (below) to detect this before running enrichment. + +## `checkUniverseAppropriateness` + +Diagnostic check — for each user set, reports what fraction of user-set regions overlap the universe and whether there are many-to-many mappings. + +```ts +const report = checkUniverseAppropriateness(userSets, universe); +// Returns a column-oriented object: +// { +// userSet: number[], +// totalRegions: number[], +// regionsInUniverse: number[], +// coverage: number[], // 0.0 to 1.0 per user set +// manyToMany: number[], // count of user regions overlapping >1 universe region +// warnings: string[] // free-form warning messages, pooled across user sets +// } + +const n = report.userSet.length; +for (let i = 0; i < n; i++) { + console.log( + `user set ${report.userSet[i]}: ` + + `${(report.coverage[i] * 100).toFixed(1)}% coverage, ` + + `${report.manyToMany[i]} many-to-many` + ); +} +for (const w of report.warnings) { + console.warn('⚠', w); +} +``` + +Warnings fire when coverage drops below 50% (severe) or 90% (moderate), or when any user region overlaps more than one universe region. + +!!! note "Rust-only helpers not exposed in Wasm" + The Rust universe module has two additional helpers — `redefine_user_sets` (rewrite user sets in terms of universe regions) and `build_restricted_universe` (disjoint union of all user sets, for differential analysis). Neither is currently exposed in the Wasm binding. If you need them, you can replicate the logic client-side using `RegionSetList` operations from [wasm/regionset](regionset.md), or request a binding. + +## See also + +- **[wasm/regionset](regionset.md)** — `RegionSet`, `RegionSetList`, and the `ConsensusBuilder` used upstream of LOLA analyses. +- **[gtars-lola](../lola.md)** — full Rust API reference, including the contingency table math, p-value computation, and CMLE odds ratio details. diff --git a/docs/gtars/wasm/partitions.md b/docs/gtars/wasm/partitions.md new file mode 100644 index 00000000..d86074aa --- /dev/null +++ b/docs/gtars/wasm/partitions.md @@ -0,0 +1,152 @@ +# Partitions (Wasm) + +Wasm bindings for the gtars-genomicdist partition system — classify query regions into promoter / UTR / exon / intron / intergenic buckets given a gene model. Everything is in-memory; there is no GTF loader on the JS side (because GTF loading requires filesystem access), so you either build a `GeneModel` from pre-parsed region arrays or lift one out of a `GenomicDistAnnotation` served from the API. + +See [gtars-genomicdist → Partitions](../genomicdist.md#partitions) for the full semantics — priority order, mutually-exclusive vs. bp-proportion modes, R GenomicDistributions divergences, and the chi-square p-value caveat. + +## Import + +```ts +import init, { + GeneModel, + PartitionList, + calcPartitions, + calcExpectedPartitions, +} from '@databio/gtars-js'; + +await init(); +``` + +## `GeneModel` + +Constructed from four JS arrays of region tuples: `genes`, `exons`, and optional `three_utr` / `five_utr`. Each region tuple is `[chr, start, end]` or `[chr, start, end, strand]`. When strand is provided, `+`/`-` become stranded; any other character becomes `Unstranded`. + +```ts +const genes = [ + ['chr1', 1000, 5000, '+'], + ['chr1', 8000, 12000, '-'], +]; +const exons = [ + ['chr1', 1000, 2000, '+'], + ['chr1', 4500, 5000, '+'], + ['chr1', 11500, 12000, '-'], +]; +const threeUtr = [['chr1', 4800, 5000, '+']]; +const fiveUtr = [['chr1', 1000, 1100, '+']]; + +// Any of three_utr / five_utr can be null if you don't have UTR annotations +const model = new GeneModel(genes, exons, threeUtr, fiveUtr); +``` + +!!! note "No GTF loader in Wasm" + The Rust `GeneModel::from_gtf` and `GeneModel::from_bed_files` methods are **not** available in Wasm — they require filesystem access. In the browser, typically: + + - Fetch a pre-built `.gda` (Genomic Dist Annotation) binary from the BEDbase API and decode it, **or** + - Parse regions yourself from JSON / TSV over the network and pass them into the `GeneModel` constructor. + +## `PartitionList` + +Build an ordered, priority-based partition list from a `GeneModel`. Partition priority order is hard-coded: `promoterCore` → `promoterProx` → `threeUTR` → `fiveUTR` → `exon` → `intron` (with `intergenic` added implicitly by `calcPartitions`). + +```ts +const chromSizes = { chr1: 248956422 }; + +const pl = PartitionList.fromGeneModel( + model, + 100, // core promoter size (bp upstream of gene start) + 2000, // proximal promoter size + chromSizes, // optional — pass null to skip boundary trimming +); +``` + +Pass `chromSizes: null` if you don't need promoter trimming at chromosome boundaries. + +## `calcPartitions` + +Classify query regions into partition categories. Two modes: + +- **Priority mode** (`bpProportion = false`): each query region is assigned to the first partition it overlaps; everything else goes to `intergenic`. Mutually exclusive, one region → one partition. +- **BP proportion mode** (`bpProportion = true`): for each partition, computes the total overlapping base pairs of the query set. **Not** mutually exclusive — a region that straddles two partitions contributes bp to each. + +```ts +const peaks: RegionSet = /* ... */; + +// Priority mode (region counts) +const result = calcPartitions(peaks, pl, false); +// { +// partitions: [{ name: 'promoterCore', count: 42 }, ...], +// total: 523 +// } + +for (const { name, count } of result.partitions) { + const pct = (count / result.total) * 100; + console.log(`${name.padEnd(15)} ${String(count).padStart(6)} (${pct.toFixed(1)}%)`); +} +``` + +## `calcExpectedPartitions` + +Observed vs. expected partition enrichment with a chi-square test. Requires chromosome sizes to compute the expected fraction based on each partition's share of the genome. + +```ts +const chromSizes = { chr1: 248956422, chr2: 242193529 }; + +const expected = calcExpectedPartitions(peaks, pl, chromSizes, false); +// Array of { partition, observed, expected, log10Oe, pvalue } + +for (const row of expected) { + console.log( + `${row.partition.padEnd(15)} ` + + `obs=${row.observed.toFixed(0).padStart(6)} ` + + `exp=${row.expected.toFixed(1).padStart(10)} ` + + `log10(O/E)=${row.log10Oe >= 0 ? '+' : ''}${row.log10Oe.toFixed(2)} ` + + `p=${row.pvalue.toExponential(2)}` + ); +} +``` + +Result fields use camelCase: `partition`, `observed`, `expected`, `log10Oe`, `pvalue`. + +!!! warning "p-values will differ slightly from R GenomicDistributions" + `calcExpectedPartitions` uses a goodness-of-fit `(O-E)²/E` formula. R's `chisq.test()` computes a 2×2 test of independence with optional Yates correction, so p-values will not match R output byte-for-byte. The `log10Oe` column is directly comparable. + +## End-to-end example + +```ts +import init, { + RegionSet, + GeneModel, + PartitionList, + calcExpectedPartitions, +} from '@databio/gtars-js'; + +await init(); + +// 1. Build query RegionSet from BED-like entries +const peaks = new RegionSet([ + ['chr1', 1500, 1800, ''], + ['chr1', 4600, 4900, ''], + ['chr1', 11700, 12000, ''], +]); + +// 2. Build GeneModel — in a real app you'd fetch annotations from an API +const model = new GeneModel( + [['chr1', 1000, 5000, '+'], ['chr1', 8000, 12000, '-']], + [['chr1', 1000, 2000, '+'], ['chr1', 4500, 5000, '+'], ['chr1', 11500, 12000, '-']], + null, + null, +); + +const chromSizes = { chr1: 248956422 }; +const pl = PartitionList.fromGeneModel(model, 100, 2000, chromSizes); + +// 3. Compute observed vs expected partition enrichment +const enriched = calcExpectedPartitions(peaks, pl, chromSizes, false); +console.table(enriched); +``` + +## See also + +- **[wasm/regionset](regionset.md)** — the `RegionSet` type used as the query input. +- **[wasm/signal](signal.md)** — signal matrix overlap. +- **[gtars-genomicdist → Partitions](../genomicdist.md#partitions)** — Rust reference with the priority-order and mutually-exclusive-vs-bp-proportion semantics. diff --git a/docs/gtars/wasm/regionset.md b/docs/gtars/wasm/regionset.md new file mode 100644 index 00000000..2531d1b1 --- /dev/null +++ b/docs/gtars/wasm/regionset.md @@ -0,0 +1,280 @@ +# RegionSet & models (Wasm) + +The `@databio/gtars-js` Wasm bindings expose gtars' core interval-analysis types directly to JavaScript and TypeScript, with no server round-trip. This page covers everything in the `regionset` Wasm module — `RegionSet`, `RegionSetList`, `ConsensusBuilder`, and the result types for statistics and BED classification. + +See the Rust reference pages for the underlying semantics: + +- [gtars-core](../core.md) — `Region`, `RegionSet`, `RegionSetList` +- [gtars-genomicdist](../genomicdist.md) — the stats, interval algebra, and consensus/classifier methods that `RegionSet` gains in Wasm + +## Import + +```ts +import init, { + RegionSet, + RegionSetList, + ConsensusBuilder, +} from '@databio/gtars-js'; + +await init(); // initializes the wasm module; top-level await in ESM environments +``` + +## `RegionSet` + +A sorted collection of genomic regions constructed from an array of `[chr, start, end, rest]` tuples. Construction sorts by `(chr, start)`. + +### Construction + +```ts +// BED-like entries: [chr, start, end, rest] +// `rest` is any trailing BED metadata — pass "" if you have BED3 data +const entries: [string, number, number, string][] = [ + ['chr1', 100, 200, 'peak1'], + ['chr1', 300, 400, 'peak2'], + ['chr2', 500, 600, 'peak3'], +]; + +const rs = new RegionSet(entries); +``` + +### Properties + +```ts +rs.numberOfRegions // number +rs.meanRegionWidth // number +rs.nucleotidesLength // number (total bp) +rs.identifier // string — canonical MD5 id over first 3 columns +rs.firstRegion // string — debug repr of the first region +rs.classify // BedClassificationOutput (see below) +``` + +### Statistics + +```ts +// Per-chromosome summary stats — returns { [chr]: ChromosomeStatistics } +const stats = rs.chromosomeStatistics(); + +// Region widths (end - start) +const widths: number[] = rs.calcWidths(); + +// Signed gaps between consecutive regions (positive gaps only) +const gaps: number[] = rs.calcNeighborDistances(); + +// Per-region min-neighbor distance +const nn: number[] = rs.calcNearestNeighbors(); + +// Descriptive summary of inter-peak spacing +const spacing = rs.interPeakSpacing(); +// { n_gaps, mean, median, std, iqr, log_mean, log_std } + +// Single-linkage clusters at a stitching radius; returns a cluster id per region +const clusterIds: number[] = rs.cluster(5_000); + +// Cluster-level summary at a stitching radius (default min_cluster_size = 2) +const clusters = rs.peakClusters(5_000, 2); +// { radius_bp, n_clusters, n_clustered_peaks, mean_cluster_size, max_cluster_size, fraction_clustered } +``` + +!!! warning "Output length for `calcNeighborDistances` / `calcNearestNeighbors`" + Both methods **skip chromosomes with only one region** (matching R GenomicDistributions). The returned array is therefore **not aligned 1:1 with input regions** — it's shorter than `rs.numberOfRegions` whenever any chromosome has a single peak. No sentinel values are emitted. + +### Region distribution and density + +```ts +// Bin counts for plotting; chrom_sizes is optional +const chromSizes = { chr1: 248956422, chr2: 242193529 }; + +// With chrom_sizes → reference-aligned bins (comparable across files) +const dist = rs.regionDistribution(250, chromSizes); + +// Without chrom_sizes → bins derived from observed max end (NOT comparable) +const distLocal = rs.regionDistribution(250, null); +// Array of { chr, start, end, n, rid } + +// Dense zero-padded per-window count vector (ML-ready feature vector) +const density = rs.densityVector(chromSizes, 250); +// { n_bins, bin_width, counts: number[], chrom_offsets: [chr, start][] } + +// Summary of density vector +const homogeneity = rs.densityHomogeneity(chromSizes, 250); +// { bin_width, n_windows, n_nonzero_windows, mean_count, variance, cv, gini } +``` + +!!! warning "`n_bins` is a target, not a total" + `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total window count `counts.length` equals `sum(ceil(size / bin_width))` over `chrom_sizes`, which can substantially exceed `n_bins`. + +### Interval set algebra + +All operations return a new `RegionSet`. Metadata (`rest` fields) is dropped by operations that merge or synthesize new intervals. + +```ts +const chromSizes = { chr1: 248956422 }; + +const trimmed = rs.trim(chromSizes); +const merged = rs.reduce(); +const disjoint = rs.disjoin(); +const proms = rs.promoters(2000, 0); +const shifted = rs.shift(100); +const resized = rs.resize(500, 'center'); +const flanked = rs.flank(1000, true, false); +const narrowed = rs.narrow(100, 200, null); +const gapped = rs.gaps(chromSizes); + +const diff = rs.setdiff(other); +const pInter = rs.pintersect(other); +const inter = rs.intersect(other); +const combined = rs.concat(other); +const unioned = rs.union(other); + +const jac: number = rs.jaccard(other); +``` + +## `RegionSetList` + +A collection of named `RegionSet`s — the gtars equivalent of Bioconductor's `GRangesList`. Provides indexed pairwise operations without copying whole `RegionSet`s on every call. + +### Construction + +```ts +// Empty builder + add() +const rsl = new RegionSetList(); +rsl.add(rs1, 'rep1'); +rsl.add(rs2, 'rep2'); +rsl.add(rs3, 'rep3'); + +// Or build directly from BED entries for multiple sets at once +const rsl2 = RegionSetList.fromEntries( + [ + [['chr1', 100, 200, ''], ['chr1', 300, 400, '']], // set 1 + [['chr1', 150, 250, '']], // set 2 + [['chr2', 500, 600, '']], // set 3 + ], + ['rep1', 'rep2', 'rep3'], // names; pass null to leave unnamed +); +``` + +### Accessors + +```ts +rsl.length // number of sets +const rs = rsl.get(0); // RegionSet at index (throws on out-of-range) +rsl.names // string[] or null +const flat = rsl.concat(); // flatten into a single RegionSet (no merge) +``` + +### Indexed pair operations + +These let you compute operations on pairs by index without shuttling full `RegionSet`s across the JS/Wasm boundary — useful for building N×N analysis grids: + +```ts +const n = rsl.regionCount(0); // regions in set 0 +const pCount = rsl.pintersectCount(0, 1); // pairwise intersect count +const jac = rsl.jaccardAt(0, 1); // Jaccard between sets 0 and 1 +const un = rsl.unionAt(0, 1); // union as a new RegionSet +const diff = rsl.setdiffAt(0, 1); // set 0 minus set 1 + +const except = rsl.unionExcept(2); // union of everything but index 2 +``` + +### Bulk operations + +O(n) N-way operations via prefix/suffix arrays. + +```ts +// Union of all sets +const unionAll = rsl.unionAll(); + +// Intersection of all sets +const interAll = rsl.intersectAll(); + +// Prefix/suffix-based leave-one-out: all N "union except i" results in O(n) unions +const bulk = rsl.bulkUnionExcept(); +// { union_regions, union_nucleotides, except_unique: number[] } +// except_unique[i] = number of regions unique to set i vs. the union of all others +``` + +### Pairwise Jaccard matrix + +```ts +const result = rsl.pairwiseJaccard(); +// { matrix: number[][], names: string[] | null } +// +// matrix is symmetric with 1.0 on the diagonal — one row per set, one col per set. +``` + +## `ConsensusBuilder` + +Builder pattern for consensus region analysis — given N input region sets, compute the reduced union annotated with per-region replicate support. + +```ts +import { ConsensusBuilder } from '@databio/gtars-js'; + +const cb = new ConsensusBuilder(); +cb.add(rep1); +cb.add(rep2); +cb.add(rep3); + +const consensus = cb.compute(); +// Array of { chr, start, end, count } + +// Keep regions present in ≥ 2/3 replicates +const robust = consensus.filter(r => r.count >= 2); +console.log(`${robust.length} robust regions`); +``` + +`count` is the number of input **sets** (not regions) that overlap each union region. + +## Result types + +### `ChromosomeStatistics` + +Returned by `rs.chromosomeStatistics()` → `{ [chr: string]: ChromosomeStatistics }`. + +| field | type | +|---|---| +| `chromosome` | `string` | +| `number_of_regions` | `number` | +| `start_nucleotide_position` | `number` — leftmost start | +| `end_nucleotide_position` | `number` — rightmost end | +| `minimum_region_length`, `maximum_region_length` | `number` | +| `mean_region_length`, `median_region_length` | `number` | + +Fields are exposed as getters on a Wasm-bound class. + +### `RegionDistribution` + +Entries in the array returned by `rs.regionDistribution(n_bins, chrom_sizes)`. + +| field | type | +|---|---| +| `chr` | `string` | +| `start`, `end`, `n`, `rid` | `number` | + +`n` is the count of regions whose midpoint falls in the bin; `rid` is the bin's row index within its chromosome. + +### `BedClassificationOutput` + +Returned by the `rs.classify` getter. Identifies the BED/ENCODE subtype based on column analysis (uses the `bedclassifier` feature, enabled by default in the Wasm build). + +| field | type | +|---|---| +| `bed_compliance` | `string` | +| `data_format` | `string` — e.g. `"UcscBed"`, `"EncodeNarrowPeak"`, etc. | +| `compliant_columns`, `non_compliant_columns` | `number` | + +### `SpacingStats`, `ClusterStats`, `DensityVector`, `DensityHomogeneity` + +Returned as plain JS objects (via `serde_wasm_bindgen`). Field semantics are identical to the Rust types in [gtars-genomicdist](../genomicdist.md#statistics). Full reference: + +- **`SpacingStats`**: `n_gaps`, `mean`, `median`, `std`, `iqr`, `log_mean`, `log_std` — all NaN when `n_gaps == 0`. +- **`ClusterStats`**: `radius_bp`, `n_clusters`, `n_clustered_peaks`, `mean_cluster_size`, `max_cluster_size`, `fraction_clustered`. `min_cluster_size=2` is the typical call. +- **`DensityVector`**: `n_bins` (target for longest chrom), `bin_width`, `counts: number[]`, `chrom_offsets: [chr, start][]`. Ordered by karyotype. +- **`DensityHomogeneity`**: `bin_width`, `n_windows`, `n_nonzero_windows`, `mean_count`, `variance`, `cv`, `gini`. Gini is biased high on sparse inputs; check `n_nonzero_windows`. + +## See also + +- **[wasm/partitions](partitions.md)** — `GeneModel`, `PartitionList`, `calcPartitions` / `calcExpectedPartitions`. +- **[wasm/signal](signal.md)** — `SignalMatrix` and `calcSummarySignal`. +- **[wasm/lola](lola.md)** — `LolaRegionDB` and `runLOLA`. +- **[wasm/overlappers](overlappers.md)** — low-level overlap engine. +- **[gtars-core](../core.md)** and **[gtars-genomicdist](../genomicdist.md)** — Rust reference with algorithmic details and caveats that apply identically in Wasm. diff --git a/docs/gtars/wasm/signal.md b/docs/gtars/wasm/signal.md new file mode 100644 index 00000000..75a3f225 --- /dev/null +++ b/docs/gtars/wasm/signal.md @@ -0,0 +1,140 @@ +# Signal matrix (Wasm) + +Wasm bindings for the gtars-genomicdist signal matrix overlap — overlay query regions on a region × condition matrix of signal values (e.g. a peak × cell-type ChIP intensity matrix), aggregate by MAX per query region, and compute Tukey boxplot statistics per condition. + +See [gtars-genomicdist → Signal matrix overlap](../genomicdist.md#signal-matrix-overlap) for the full algorithmic detail. + +## Import + +```ts +import init, { + SignalMatrix, + calcSummarySignal, +} from '@databio/gtars-js'; + +await init(); +``` + +## `SignalMatrix` + +A region × condition matrix of signal values, loaded from either the packed binary format or built directly from JS data. + +### `SignalMatrix.fromBin` + +Load from a packed binary buffer (the `.sigm` format produced by the Rust `SignalMatrix::save_bin` or the CLI). This is the fast path — drop the bytes of a `.sigm` file served from your API into the constructor. + +```ts +const response = await fetch('/api/signal-matrix/my_matrix.sigm'); +const bytes = new Uint8Array(await response.arrayBuffer()); + +const sm = SignalMatrix.fromBin(bytes); +``` + +### From JS arrays + +For the rare case where you have the signal data in memory already (e.g. parsed from a TSV fetched separately), you can build a `SignalMatrix` directly via the `new SignalMatrix(...)` constructor: + +```ts +const regionIds = [ + 'chr1_100_200', + 'chr1_300_400', + 'chr2_500_600', +]; +const conditionNames = ['K562', 'HeLa', 'GM12878']; + +// Flat row-major array: row i, condition j = values[i * nConditions + j] +const values = new Float64Array([ + 1.2, 0.8, 2.3, // region 0 + 0.4, 1.5, 0.9, // region 1 + 3.1, 2.7, 1.8, // region 2 +]); + +const sm = new SignalMatrix( + regionIds, + conditionNames, + values, + 3, // nRegions + 3, // nConditions +); +``` + +!!! note "Region IDs must be `chr_start_end`" + The constructor parses each `regionIds` entry by splitting on `_` and expects exactly three parts: chromosome, start, end. IDs with more or fewer underscores error out. This matches the R GenomicDistributions convention used by `signal_matrix.tsv` files. + +## `calcSummarySignal` + +Overlap a query region set against a `SignalMatrix`, take the MAX signal per query region per condition, and compute Tukey boxplot statistics per condition. + +```ts +const peaks = new RegionSet([ + ['chr1', 150, 250, ''], + ['chr2', 550, 580, ''], +]); + +const result = calcSummarySignal(peaks, sm); + +console.log(result.conditionNames); +// ['K562', 'HeLa', 'GM12878'] + +console.log(result.signalMatrix); +// Array of { region, values } — per-query-region max signal vector +// [ +// { region: "chr1_150_250", values: [1.2, 0.8, 2.3] }, +// { region: "chr2_550_580", values: [3.1, 2.7, 1.8] }, +// ] + +console.log(result.matrixStats); +// Per-condition Tukey stats +// [ +// { condition: 'K562', lowerWhisker, lowerHinge, median, upperHinge, upperWhisker }, +// { condition: 'HeLa', ... }, +// { condition: 'GM12878', ... }, +// ] +``` + +### Result schema + +**Top level:** +- `signalMatrix: { region: string, values: number[] }[]` — one entry per query region that matched at least one signal row. `region` is the query region label in `chr_start_end` form; `values` are the per-condition max signals. +- `matrixStats: ConditionStats[]` — one entry per condition, in the order of `conditionNames`. +- `conditionNames: string[]` — column labels, same as the input `SignalMatrix.conditionNames`. + +**`ConditionStats`** (per condition, standard Tukey 5-number summary): + +| field | type | +|---|---| +| `condition` | `string` | +| `lowerWhisker` | `number` | +| `lowerHinge` | `number` — Q1 | +| `median` | `number` | +| `upperHinge` | `number` — Q3 | +| `upperWhisker` | `number` | + +## End-to-end example + +```ts +import init, { RegionSet, SignalMatrix, calcSummarySignal } from '@databio/gtars-js'; + +await init(); + +// 1. Fetch a packed binary signal matrix from the API +const response = await fetch('/api/signal/encode_k562_hela.sigm'); +const sm = SignalMatrix.fromBin(new Uint8Array(await response.arrayBuffer())); + +// 2. Build a query RegionSet from user-provided peaks +const peaks = new RegionSet(userPeakEntries); + +// 3. Compute summary signal +const result = calcSummarySignal(peaks, sm); + +// 4. Render per-condition boxplot +for (const stats of result.matrixStats) { + renderBoxplot(stats.condition, stats); +} +``` + +## See also + +- **[wasm/regionset](regionset.md)** — the `RegionSet` type used as the query input. +- **[wasm/partitions](partitions.md)** — partition classification. +- **[gtars-genomicdist → Signal matrix overlap](../genomicdist.md#signal-matrix-overlap)** — Rust reference for the underlying algorithm and the packed `.sigm` binary format. diff --git a/mkdocs.yml b/mkdocs.yml index 1711b382..ebb40d42 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -175,7 +175,7 @@ nav: - Overview: gtars/modules.md - gtars-core: gtars/core.md # - gtars-io: gtars/io.md - - gtars-models: gtars/regionSet.md + - Core models: gtars/regionSet.md - gtars-overlaprs: gtars/overlaprs.md - gtars-uniwig: gtars/uniwig.md - gtars-igd: gtars/igd.md @@ -183,11 +183,16 @@ nav: - gtars-fragsplit: gtars/fragsplit.md - gtars-tokenizers: gtars/tokenizers.md - gtars-scatrs: gtars/scatrs.md + - gtars-genomicdist: gtars/genomicdist.md + - gtars-lola: gtars/lola.md - gtars-refget: gtars/refget.md - gtars-bbcache: gtars/bbcache.md - Bindings: - Python: - Overview: gtars/python-overview.md + - Models: gtars/python/models.md + - GenomicDist: gtars/python/genomic_distributions.md + - LOLA: gtars/python/lola.md - Digests: gtars/python/digests.md - RefgetStore Tutorial: gtars/python/refgetstore.ipynb - RefgetStore Python Reference: gtars/python/refget-store.md @@ -195,9 +200,17 @@ nav: - Wasm: - Overview: gtars/wasm.md - Overlappers: gtars/wasm/overlappers.md + - RegionSet: gtars/wasm/regionset.md + - Partitions: gtars/wasm/partitions.md + - Signal: gtars/wasm/signal.md + - LOLA: gtars/wasm/lola.md - CLI: gtars/cli.md - R: - R: gtars/r.md + - RegionSet: gtars/r/regionset.md + - GenomicDist: gtars/r/genomicdist.md + - LOLA: gtars/r/lola.md + - IGD: gtars/r/igd.md - RefgetStore: gtars/r/refgetstore.md - Reference: - Changelog: gtars/changelog.md From 692d8b2926c481ae5d8a23c89ef95b142d251b3a Mon Sep 17 00:00:00 2001 From: Sam Park Date: Sun, 19 Apr 2026 17:08:18 -0400 Subject: [PATCH 2/2] Remove references to reverted genomicdist APIs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four per-file spatial-arrangement stats were removed from gtars in databio/gtars#257 (`inter_peak_spacing`, `peak_clusters`, `density_vector`, `density_homogeneity`, plus `--cluster-radii` / `--cluster-min-size` CLI flags). This commit removes the corresponding docs. Docs for `gaps()`, `cluster()` / `clusterRegions()`, and `calcDinuclFreq` are unchanged (those APIs were kept). ## Files touched (8; net −182 lines) - `docs/gtars/genomicdist.md` — removed imports, feature bullet, the "Whole-set scalar summaries" section and its example, 4-row method table, and 4 struct defs. Collapsed two-bucket framing to one. Rewrote `n_bins` warning to only cite the kept method. - `docs/gtars/python/models.md` — removed 4 type names from the submodule-map admonition, 4 example lines, and 4 result-type subsections. Rewrote `n_bins` warning. - `docs/gtars/wasm/regionset.md` — removed 3 example usages and 1 struct-reference subsection. Renamed section heading. Rewrote `n_bins` warning. - `docs/gtars/r/regionset.md` — removed 4 example lines; trimmed `nBins` warning. - `docs/gtars/wasm.md` — dropped "density vectors" from features bullet; rewrote perf bullet. - `docs/gtars/python-overview.md` — replaced removed methods in the cross-ref sentence with `nearest_neighbors`. - `docs/gtars/r.md` — swapped `interPeakSpacing(peaks)` in the quickstart for `chromosomeStatistics(peaks)`. - `docs/gtars/r/genomicdist.md` — trimmed trailing sentence of `nBins` warning. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/gtars/genomicdist.md | 106 +--------------------------------- docs/gtars/python-overview.md | 2 +- docs/gtars/python/models.md | 57 +----------------- docs/gtars/r.md | 2 +- docs/gtars/r/genomicdist.md | 2 +- docs/gtars/r/regionset.md | 6 +- docs/gtars/wasm.md | 4 +- docs/gtars/wasm/regionset.md | 29 +--------- 8 files changed, 13 insertions(+), 195 deletions(-) diff --git a/docs/gtars/genomicdist.md b/docs/gtars/genomicdist.md index 73aee0f4..e57c66f6 100644 --- a/docs/gtars/genomicdist.md +++ b/docs/gtars/genomicdist.md @@ -10,7 +10,6 @@ Rust port of the R [GenomicDistributions](https://code.databio.org/GenomicDistri - **Nearest-TSS / nearest-feature distance** calculations (`TssIndex`). - **GC content** and **dinucleotide frequency** against a reference genome. - **BED file classification** (optional `bedclassifier` feature, polars-backed). -- **Density and clustering** statistics used for ML feature vectors and peak-clustering summaries. - The **GDA** binary gene-model format and the **.fab** binary FASTA format for fast mmap-backed sequence lookup. All operations take a `&RegionSet` (from `gtars-core`) as input and either extend it via trait implementations or run as free functions. If you're new to the crate, start with the `IntervalRanges` and `GenomicIntervalSetStatistics` traits — those cover the GRanges/GenomicDistributions surface area most users need. @@ -47,7 +46,6 @@ All of these are re-exported at the crate root, so you typically import them dir use gtars_genomicdist::{ // Statistics GenomicIntervalSetStatistics, - ClusterStats, DensityHomogeneity, DensityVector, SpacingStats, // Interval algebra IntervalRanges, RegionSetListOps, pairwise_jaccard, // Strand-aware wrappers @@ -78,13 +76,9 @@ The submodules (`statistics`, `interval_ranges`, `partitions`, `signal`, `consen ## Statistics -The `GenomicIntervalSetStatistics` trait extends `RegionSet` with two kinds of summaries: +The `GenomicIntervalSetStatistics` trait extends `RegionSet` with per-region and per-pair quantities — direct ports of the R GenomicDistributions functions (`calcWidth`, `calcNeighborDist`, `calcNearestNeighbors`, `calcChromBins`). These return vectors or per-chromosome tables, one number per peak or per gap. Use them when you want to *see* each value — plot a histogram, feed them to a downstream test, render a per-chromosome heatmap. -1. **Per-region and per-pair quantities** — direct ports of the R GenomicDistributions functions (`calcWidth`, `calcNeighborDist`, `calcNearestNeighbors`, `calcChromBins`). These return vectors or per-chromosome tables, one number per peak or per gap. Use them when you want to *see* each value — plot a histogram, feed them to a downstream test, render a per-chromosome heatmap. - -2. **Whole-set scalar summaries** — new in gtars. These collapse an entire peak set into a handful of numbers that answer questions like "how regularly are my peaks spaced?", "how clumpy is this peak set?", or "how evenly is it spread across the genome?". Each one wraps a per-region calculation and reduces it to a scalar / small struct. Use them when you want to *compare* peak sets against each other or a baseline, build ML feature vectors, or run peak-set QC. - -Bring the trait into scope and both kinds of methods appear on any `RegionSet`. +Bring the trait into scope and the methods appear on any `RegionSet`. ### Per-region and per-pair quantities @@ -125,68 +119,6 @@ let nn: Vec = peaks.calc_nearest_neighbors()?; | `calc_neighbor_distances()` | "What's the bp gap between every pair of consecutive peaks on each chromosome?" | `Result>` | | `calc_nearest_neighbors()` | "How far is each peak from its closest neighbor?" — port of R `calcNearestNeighbors()` | `Result>` | -### Whole-set scalar summaries - -These collapse the peak set to a handful of numbers, answering questions about **spatial arrangement** across the genome. They're complementary to the per-region quantities above — each one internally calls one of the per-region methods and then summarizes the result. - -```rust -use std::collections::HashMap; -use gtars_core::models::RegionSet; -use gtars_genomicdist::GenomicIntervalSetStatistics; - -let peaks = RegionSet::try_from("peaks.bed")?; -let chrom_sizes: HashMap = - [("chr1".into(), 248_956_422), ("chr2".into(), 242_193_529)].into(); - -// 1. How regularly are peaks spaced? -// Returns SpacingStats: mean/median/std/IQR plus log-space stats that -// handle heavy-tailed gap distributions. Small log_std => regular arrays -// (CTCF-style); large log_std => bursty pileups. -let spacing = peaks.calc_inter_peak_spacing(); -println!("n_gaps={}, median={:.0} bp, log_std={:.2}", - spacing.n_gaps, spacing.median, spacing.log_std); - -// 2. How much does the peak set cluster at a 5 kb stitching radius? -// Single-linkage clustering — two peaks link if the gap between them is -// ≤ radius_bp. The default `min_cluster_size = 2` answers "fraction of -// peaks that have at least one neighbor within 5 kb", matching typical -// super-enhancer-stitching / enhancer-clustering analyses. -let clusters = peaks.calc_peak_clusters(5_000, 2); -println!( - "{} clusters, {:.1}% of peaks clustered, biggest={}", - clusters.n_clusters, - clusters.fraction_clustered * 100.0, - clusters.max_cluster_size, -); - -// 3. Dense per-window count vector across the whole genome — an ML-ready -// feature vector. Stable karyotypic ordering means vectors from different -// BED files are aligned column-for-column, so you can stack them into a -// matrix and feed to downstream models. -let density = peaks.calc_density_vector(&chrom_sizes, 250); -println!("density vector: {} windows, bin_width={} bp", - density.counts.len(), density.bin_width); - -// 4. How evenly is the peak set spread across the genome, as a scalar? -// Mean count per window + variance + coefficient of variation (Poisson -// ≈ 1, clustered >> 1, even << 1) + Gini + nonzero-window count. The -// canonical "how uniform is this peak set" measure. -let homog = peaks.calc_density_homogeneity(&chrom_sizes, 250); -println!( - "mean={:.2}, CV={:.2}, gini={:.3}, {}/{} windows nonzero", - homog.mean_count, homog.cv, homog.gini, - homog.n_nonzero_windows, homog.n_windows, -); -# Ok::<(), Box>(()) -``` - -| method | question it answers | wraps | returns | -|---|---|---|---| -| `calc_inter_peak_spacing()` | "How regularly are my peaks spaced?" — mean/median/std/IQR and log-space stats of inter-peak gaps | `calc_neighbor_distances` | `SpacingStats` | -| `calc_peak_clusters(radius_bp, min_cluster_size)` | "How clumpy is my peak set at a given stitching radius?" — cluster counts, sizes, and the fraction of peaks in multi-peak clusters | `cluster(radius_bp)` from `IntervalRanges` | `ClusterStats` | -| `calc_density_vector(chrom_sizes, n_bins)` | "Give me a stable, ML-ready per-window count vector across the whole genome" — dense, zero-padded, karyotypically ordered | per-region midpoint binning | `DensityVector` | -| `calc_density_homogeneity(chrom_sizes, n_bins)` | "How evenly are my peaks spread across the genome, as a scalar?" — mean/variance/CV/Gini of the density vector | `calc_density_vector` | `DensityHomogeneity` | - ### Output structs ```rust @@ -202,45 +134,13 @@ pub struct ChromosomeStatistics { } pub struct RegionBin { pub chr: String, pub start: u32, pub end: u32, pub n: u32, pub rid: u32 } - -pub struct SpacingStats { - pub n_gaps: usize, - pub mean: f64, pub median: f64, pub std: f64, pub iqr: f64, - pub log_mean: f64, pub log_std: f64, -} - -pub struct ClusterStats { - pub radius_bp: u32, - pub n_clusters: usize, - pub n_clustered_peaks: usize, - pub mean_cluster_size: f64, - pub max_cluster_size: usize, - pub fraction_clustered: f64, -} - -pub struct DensityVector { - pub n_bins: u32, // target bins for the longest chromosome - pub bin_width: u32, // derived as max_chrom_len / n_bins, floored, min 1 - pub counts: Vec, // dense, row-major across chromosomes in karyotype order - pub chrom_offsets: Vec<(String, usize)>, -} - -pub struct DensityHomogeneity { - pub bin_width: u32, - pub n_windows: usize, - pub n_nonzero_windows: usize, - pub mean_count: f64, - pub variance: f64, - pub cv: f64, - pub gini: f64, -} ``` !!! warning "Output length for spacing/nearest calculations" `calc_neighbor_distances` and `calc_nearest_neighbors` **skip single-region chromosomes** (matching R GenomicDistributions behavior). The output length is therefore **not 1:1 with the input region count** — it's the number of multi-region chromosomes' regions. No sentinel values are emitted. If you need 1:1 alignment, filter your input to multi-region chromosomes first. !!! warning "`n_bins` is a target, not a total" - In `calc_density_vector`, `calc_density_homogeneity`, and `region_distribution_with_chrom_sizes`, `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so shorter chromosomes get fewer bins and the total bin count is `sum(ceil(chrom_size / bin_width))`, which can exceed `n_bins` substantially. To target a specific bin width in bp, set `n_bins = max_chrom_len / desired_bp`. + In `region_distribution_with_chrom_sizes`, `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so shorter chromosomes get fewer bins and the total bin count is `sum(ceil(chrom_size / bin_width))`, which can exceed `n_bins` substantially. To target a specific bin width in bp, set `n_bins = max_chrom_len / desired_bp`. ## Interval set algebra diff --git a/docs/gtars/python-overview.md b/docs/gtars/python-overview.md index 244a70ea..2ebb2e31 100644 --- a/docs/gtars/python-overview.md +++ b/docs/gtars/python-overview.md @@ -44,7 +44,7 @@ merged = rs.reduce() # merge overlapping promoters = rs.promoters(2000, 0) # 2kb upstream ``` -See [`gtars.models`](python/models.md) for the full `RegionSet` surface — including `reduce`, `setdiff`, `union`, `intersect_all`, `jaccard`, `coverage`, `neighbor_distances`, `peak_clusters`, `density_vector`, and more. +See [`gtars.models`](python/models.md) for the full `RegionSet` surface — including `reduce`, `setdiff`, `union`, `intersect_all`, `jaccard`, `coverage`, `neighbor_distances`, `nearest_neighbors`, and more. ## Extended example — genomic distributions diff --git a/docs/gtars/python/models.md b/docs/gtars/python/models.md index cdc52156..c7a5c791 100644 --- a/docs/gtars/python/models.md +++ b/docs/gtars/python/models.md @@ -6,7 +6,7 @@ The `gtars.models` submodule exposes the core genomic data types and — unlike `gtars.models` combines types from three Rust crates: - `gtars-core::models` → `Region`, `Interval`, `RegionSet`, `RegionSetList` - - `gtars-genomicdist::models` → `ChromosomeStatistics`, `SpacingStats`, `ClusterStats`, `DensityVector`, `DensityHomogeneity`, `GenomeAssembly`, `BinaryGenomeAssembly`, `TssIndex`, `GeneModel`, `PartitionList`, `SignalMatrix`, `GenomicDistAnnotation` + - `gtars-genomicdist::models` → `ChromosomeStatistics`, `GenomeAssembly`, `BinaryGenomeAssembly`, `TssIndex`, `GeneModel`, `PartitionList`, `SignalMatrix`, `GenomicDistAnnotation` - `gtars-genomicdist::{IntervalRanges, GenomicIntervalSetStatistics}` → methods on `RegionSet` Free functions that need a reference genome or a partition list live in [`gtars.genomic_distributions`](genomic_distributions.md). @@ -153,17 +153,13 @@ Ports of the R GenomicDistributions functions, exposed via the Rust `GenomicInte ```python rs.neighbor_distances() # list[int], signed gaps per chromosome (multi-region chroms only) rs.nearest_neighbors() # list[int], per-region min neighbor distance -rs.inter_peak_spacing() # SpacingStats -rs.peak_clusters(radius_bp=5000, min_cluster_size=2) # ClusterStats -rs.density_vector(chrom_sizes, n_bins) # DensityVector — ML-ready dense count vector -rs.density_homogeneity(chrom_sizes, n_bins) # DensityHomogeneity — CV, Gini, etc. ``` !!! warning "Output length for `neighbor_distances` / `nearest_neighbors`" Both methods **skip chromosomes with only one region** (matching R GenomicDistributions). The returned list is therefore **not aligned 1:1 with input regions** — it's shorter than `len(rs)` whenever any chromosome has exactly one peak. No sentinel values are emitted. !!! warning "`n_bins` is a target, not a total" - In `density_vector`, `density_homogeneity`, and `distribution(chrom_sizes=...)`, `n_bins` is the target bin count for the **longest** chromosome. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total window count is `sum(ceil(size / bin_width))` across chromosomes, often much larger than `n_bins`. See `DensityVector` for details. + In `distribution(chrom_sizes=...)`, `n_bins` is the target bin count for the **longest** chromosome. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total window count is `sum(ceil(size / bin_width))` across chromosomes, often much larger than `n_bins`. ### `RegionSetList` @@ -215,55 +211,6 @@ Returned by `rs.chromosome_statistics()` → `dict[str, ChromosomeStatistics]`. | `mean_region_length` | `float` | | | `median_region_length` | `float` | | -#### `SpacingStats` - -Returned by `rs.inter_peak_spacing()`. All float fields are NaN when `n_gaps == 0`. - -| field | type | -|---|---| -| `n_gaps` | `int` — count of *positive* inter-region gaps | -| `mean`, `median`, `std`, `iqr` | `float` | -| `log_mean`, `log_std` | `float` — `log10(gap + 1)` stats, more robust for heavy-tailed gap distributions | - -#### `ClusterStats` - -Returned by `rs.peak_clusters(radius_bp, min_cluster_size)`. - -| field | type | -|---|---| -| `radius_bp` | `int` — pass-through | -| `n_clusters` | `int` — clusters with size ≥ `min_cluster_size` | -| `n_clustered_peaks` | `int` — peaks belonging to those clusters | -| `mean_cluster_size` | `float` — NaN if no clusters meet threshold | -| `max_cluster_size` | `int` — always over all clusters, not filtered | -| `fraction_clustered` | `float` — `n_clustered_peaks / total_peaks` | - -The identity `n_clusters * mean_cluster_size == n_clustered_peaks` holds exactly at any threshold. With `min_cluster_size=2` (the default), `fraction_clustered` is the fraction of peaks with at least one neighbor within `radius_bp` — the typical meaning in super-enhancer-stitching analyses. - -#### `DensityVector` - -Returned by `rs.density_vector(chrom_sizes, n_bins)`. - -| field | type | -|---|---| -| `n_bins` | `int` — target bin count for the longest chromosome (**not** the length of `counts`) | -| `bin_width` | `int` — `max(chrom_sizes.values()) // n_bins`, min 1 | -| `counts` | `list[int]` — dense, row-major, karyotypic chromosome order | -| `chrom_offsets` | `list[(str, int)]` — `(chr_name, start_index_in_counts)` per chromosome | - -Supports `len()`. Suitable as an ML feature vector — stable ordering across files when you pass the same `chrom_sizes`. - -#### `DensityHomogeneity` - -Returned by `rs.density_homogeneity(chrom_sizes, n_bins)`. - -| field | type | -|---|---| -| `bin_width`, `n_windows`, `n_nonzero_windows` | `int` | -| `mean_count`, `variance`, `cv`, `gini` | `float` | - -Poisson-distributed peaks give `cv ≈ 1`; clustered sets give `cv >> 1`; evenly-spread sets give `cv << 1`. **Caveat:** Gini is biased high for sparse count distributions; check `n_nonzero_windows` before interpreting it. - ## Reference genome and annotation types These types are constructed from files on disk and passed into the functions in [`gtars.genomic_distributions`](genomic_distributions.md). diff --git a/docs/gtars/r.md b/docs/gtars/r.md index 8efafc66..9b1c8edf 100644 --- a/docs/gtars/r.md +++ b/docs/gtars/r.md @@ -67,7 +67,7 @@ show(peaks) # 2. Summary statistics calcWidth(peaks) |> summary() -interPeakSpacing(peaks) +chromosomeStatistics(peaks) # 3. Interval set algebra merged <- reduce(peaks) diff --git a/docs/gtars/r/genomicdist.md b/docs/gtars/r/genomicdist.md index d7db80c9..bd678746 100644 --- a/docs/gtars/r/genomicdist.md +++ b/docs/gtars/r/genomicdist.md @@ -73,7 +73,7 @@ rd <- regionDistribution(query, nBins = 250L, chromSizes = chromSizes) Returns a data.table compatible with R GenomicDistributions' `plotChromBins`. !!! warning "`nBins` is a target, not a total" - When `chromSizes` is provided, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. This applies identically to `densityVector` and `densityHomogeneity` from [`r/regionset.md`](regionset.md#statistics-methods). + When `chromSizes` is provided, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. ## GC content and dinucleotides diff --git a/docs/gtars/r/regionset.md b/docs/gtars/r/regionset.md index 3303caa2..0cc62d40 100644 --- a/docs/gtars/r/regionset.md +++ b/docs/gtars/r/regionset.md @@ -83,17 +83,13 @@ distribution(rs, nBins = 250) # bin counts across the genome distribution(rs, nBins = 250, chromSizes = hg38) # with reference chrom sizes clusterRegions(rs, maxGap = 5000L) # cluster id per region -interPeakSpacing(rs) # spacing stats list -peakClusters(rs, radius_bp = 5000L, min_cluster_size = 2L) -densityVector(rs, chrom_sizes = hg38, nBins = 250L) -densityHomogeneity(rs, chrom_sizes = hg38, nBins = 250L) ``` !!! warning "Output length for `neighborDistances` / `nearestNeighbors`" Both skip chromosomes with only one region (matching R GenomicDistributions). The returned vector is **not aligned 1:1** with the input — it's shorter than `length(rs)` whenever any chromosome has a single peak. !!! warning "`nBins` is a target, not a total" - In `distribution(chromSizes = ...)`, `densityVector`, and `densityHomogeneity`, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. + In `distribution(chromSizes = ...)`, `nBins` is the target bin count for the **longest** chromosome in `chromSizes`. Bin width is derived as `max(chromSizes) %/% nBins` (floored, minimum 1 bp), and every chromosome is tiled at the same bp width — so shorter chromosomes get proportionally fewer bins. The total bin count returned is `sum(ceiling(chrom_size / bin_width))`, which can substantially exceed `nBins` when `chromSizes` has many entries. To target a specific bin width in bp instead, pass `nBins = max_chrom_len %/% desired_bp`. ### Interval set algebra diff --git a/docs/gtars/wasm.md b/docs/gtars/wasm.md index 9da2b0b3..b01e4b1c 100644 --- a/docs/gtars/wasm.md +++ b/docs/gtars/wasm.md @@ -6,7 +6,7 @@ WebAssembly bindings for gtars, enabling genomic interval analysis directly in b - Zero-installation genomic tools running client-side. - Full `RegionSet` interval algebra and summary statistics. -- Genomic distribution analysis: partitions, signal matrices, density vectors, consensus regions. +- Genomic distribution analysis: partitions, signal matrices, consensus regions. - LOLA enrichment testing with in-memory region databases. - TypeScript-friendly — the published package ships `.d.ts` declarations generated from the Rust source. @@ -85,5 +85,5 @@ function ConsensusComponent({ replicates }: { replicates: Array<[string, number, - `RegionSet` construction sorts on load. Repeated operations don't re-sort. - Overlap queries (`jaccard`, `setdiff`, `intersect`, etc.) route through the AIList index under the hood, giving O(log n) per-query lookups. -- Statistics methods like `densityVector` and `peakClusters` are implemented in Rust and called via zero-copy boundary — they're substantially faster than pure-JS equivalents for typical peak-set sizes. +- Statistics methods are implemented in Rust and called via zero-copy boundary — they're substantially faster than pure-JS equivalents for typical peak-set sizes. - For very large numeric outputs, the Wasm glue converts between Rust `Vec` and JS arrays — if you're hitting GC pressure, consider batching calls or keeping `RegionSet` handles around instead of returning raw arrays. diff --git a/docs/gtars/wasm/regionset.md b/docs/gtars/wasm/regionset.md index 2531d1b1..bef2a9e4 100644 --- a/docs/gtars/wasm/regionset.md +++ b/docs/gtars/wasm/regionset.md @@ -63,22 +63,14 @@ const gaps: number[] = rs.calcNeighborDistances(); // Per-region min-neighbor distance const nn: number[] = rs.calcNearestNeighbors(); -// Descriptive summary of inter-peak spacing -const spacing = rs.interPeakSpacing(); -// { n_gaps, mean, median, std, iqr, log_mean, log_std } - // Single-linkage clusters at a stitching radius; returns a cluster id per region const clusterIds: number[] = rs.cluster(5_000); - -// Cluster-level summary at a stitching radius (default min_cluster_size = 2) -const clusters = rs.peakClusters(5_000, 2); -// { radius_bp, n_clusters, n_clustered_peaks, mean_cluster_size, max_cluster_size, fraction_clustered } ``` !!! warning "Output length for `calcNeighborDistances` / `calcNearestNeighbors`" Both methods **skip chromosomes with only one region** (matching R GenomicDistributions). The returned array is therefore **not aligned 1:1 with input regions** — it's shorter than `rs.numberOfRegions` whenever any chromosome has a single peak. No sentinel values are emitted. -### Region distribution and density +### Region distribution ```ts // Bin counts for plotting; chrom_sizes is optional @@ -90,18 +82,10 @@ const dist = rs.regionDistribution(250, chromSizes); // Without chrom_sizes → bins derived from observed max end (NOT comparable) const distLocal = rs.regionDistribution(250, null); // Array of { chr, start, end, n, rid } - -// Dense zero-padded per-window count vector (ML-ready feature vector) -const density = rs.densityVector(chromSizes, 250); -// { n_bins, bin_width, counts: number[], chrom_offsets: [chr, start][] } - -// Summary of density vector -const homogeneity = rs.densityHomogeneity(chromSizes, 250); -// { bin_width, n_windows, n_nonzero_windows, mean_count, variance, cv, gini } ``` !!! warning "`n_bins` is a target, not a total" - `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total window count `counts.length` equals `sum(ceil(size / bin_width))` over `chrom_sizes`, which can substantially exceed `n_bins`. + When `chrom_sizes` is provided, `n_bins` is the target bin count for the **longest** chromosome in `chrom_sizes`. Bin width is derived from that, and every chromosome is tiled at the same bp width — so the total bin count across chromosomes is `sum(ceil(size / bin_width))`, which can substantially exceed `n_bins`. ### Interval set algebra @@ -262,15 +246,6 @@ Returned by the `rs.classify` getter. Identifies the BED/ENCODE subtype based on | `data_format` | `string` — e.g. `"UcscBed"`, `"EncodeNarrowPeak"`, etc. | | `compliant_columns`, `non_compliant_columns` | `number` | -### `SpacingStats`, `ClusterStats`, `DensityVector`, `DensityHomogeneity` - -Returned as plain JS objects (via `serde_wasm_bindgen`). Field semantics are identical to the Rust types in [gtars-genomicdist](../genomicdist.md#statistics). Full reference: - -- **`SpacingStats`**: `n_gaps`, `mean`, `median`, `std`, `iqr`, `log_mean`, `log_std` — all NaN when `n_gaps == 0`. -- **`ClusterStats`**: `radius_bp`, `n_clusters`, `n_clustered_peaks`, `mean_cluster_size`, `max_cluster_size`, `fraction_clustered`. `min_cluster_size=2` is the typical call. -- **`DensityVector`**: `n_bins` (target for longest chrom), `bin_width`, `counts: number[]`, `chrom_offsets: [chr, start][]`. Ordered by karyotype. -- **`DensityHomogeneity`**: `bin_width`, `n_windows`, `n_nonzero_windows`, `mean_count`, `variance`, `cv`, `gini`. Gini is biased high on sparse inputs; check `n_nonzero_windows`. - ## See also - **[wasm/partitions](partitions.md)** — `GeneModel`, `PartitionList`, `calcPartitions` / `calcExpectedPartitions`.