Skip to content

Number of reads mapped to too many loci always 0; reads exceeding --outFilterMultimapNmax dropped from accounting #53

@pinin4fjords

Description

@pinin4fjords

Summary

Reads exceeding --outFilterMultimapNmax were dropped silently from the per-bucket stats in Log.final.out. The too_many_loci counter stays at 0 no matter how many reads exceed the cap, and the per-bucket sum is total_reads - <dropped> instead of equalling Number of input reads.

On the yeast PE test profile (50 000 pairs) with STAR's default --outFilterMultimapNmax 10:

Log.final.out field rustar STAR
Number of input reads 50 000 50 000
Uniquely mapped reads number 41 670 41 679
Number of reads mapped to multiple loci 958 974
Number of reads mapped to too many loci 0 3
Number of reads unmapped: too many mismatches 0 0
Number of reads unmapped: too short 4 193 3 786
Number of reads unmapped: other 0 (3 592 with #49) 3 558
Sum of buckets 49 997 50 000

The 3 missing reads are the 3 reads STAR classifies as too_many_loci. At larger cap values (e.g. --outFilterMultimapNmax 20) no read on this input exceeds the cap so both aligners report 0 — the bug only surfaces when the cap is small enough that some reads exceed it.

STAR reference behaviour

STAR's Stats.cpp::nUnmapped[5] tracks Reads mapped to too many loci separately from the unmapped buckets. The classification fires in ReadAlign_assignAlignToWindow.cpp when the count of valid alignments exceeds outFilterMultimapNmax; those reads are recorded in too_many_loci and (with default outSAMmultNmax=-1) suppressed from BAM emission. Per-bucket counts always sum to the input read count.

rustar's AlignmentStats::record_alignment(n_alignments, max_multimaps) already has the correct routing arm:

_ => {
    self.too_many_loci.fetch_add(1, Ordering::Relaxed);
}

The bug is at the upstream filter site: on the PE path, align_paired_read's too-many-loci filter returns n_for_mapq = 0 rather than the actual loci count, so the caller hits record_alignment(0, max_multimaps)n == 0 arm (unmapped) instead of the _ => arm.

Reproducer (uses --outFilterMultimapNmax 10 so reads actually exceed the cap)

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-tml && cd /tmp/rustar-mre-tml

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outFilterMultimapNmax 10
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

for tool in RUS STAR; do
    if [ $tool = RUS ]; then LOG=RUS./Log.final.out; else LOG=STAR.Log.final.out; fi
    echo "=== $tool ==="
    grep -E 'Number of input reads|Uniquely mapped reads number|Number of reads mapped|Number of reads unmapped' $LOG
done

Suggested fix

At the PE too-many-loci filter site in src/align/read_align.rs::align_paired_read, return joint_pairs.len() as n_for_mapq (rather than 0) when the cap is exceeded. The caller in src/lib.rs::align_reads_paired_end then passes that count into record_alignment, which routes to the existing _ => arm. Same shape as the SE path which already records the loci count correctly.

Add a conservation assertion in tests: total_reads == uniquely_mapped + multi_mapped + too_many_loci + unmapped on a run with reads that exceed the cap.

Severity

Low-medium. Doesn't affect alignment correctness on the reads that ARE accounted for. Affects Log.final.out totals and any downstream tool (MultiQC's STAR module) that sanity-checks Number of input reads == sum(buckets). Scales with how many reads in the input hit the outFilterMultimapNmax ceiling — yeast at cap=10 has 0.006 %; mammalian repetitive regions can reach single digits.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions