Summary
Reads exceeding --outFilterMultimapNmax were dropped silently from the per-bucket stats in Log.final.out. The too_many_loci counter stays at 0 no matter how many reads exceed the cap, and the per-bucket sum is total_reads - <dropped> instead of equalling Number of input reads.
On the yeast PE test profile (50 000 pairs) with STAR's default --outFilterMultimapNmax 10:
Log.final.out field |
rustar |
STAR |
Number of input reads |
50 000 |
50 000 |
Uniquely mapped reads number |
41 670 |
41 679 |
Number of reads mapped to multiple loci |
958 |
974 |
Number of reads mapped to too many loci |
0 |
3 |
Number of reads unmapped: too many mismatches |
0 |
0 |
Number of reads unmapped: too short |
4 193 |
3 786 |
Number of reads unmapped: other |
0 (3 592 with #49) |
3 558 |
| Sum of buckets |
49 997 |
50 000 |
The 3 missing reads are the 3 reads STAR classifies as too_many_loci. At larger cap values (e.g. --outFilterMultimapNmax 20) no read on this input exceeds the cap so both aligners report 0 — the bug only surfaces when the cap is small enough that some reads exceed it.
STAR reference behaviour
STAR's Stats.cpp::nUnmapped[5] tracks Reads mapped to too many loci separately from the unmapped buckets. The classification fires in ReadAlign_assignAlignToWindow.cpp when the count of valid alignments exceeds outFilterMultimapNmax; those reads are recorded in too_many_loci and (with default outSAMmultNmax=-1) suppressed from BAM emission. Per-bucket counts always sum to the input read count.
rustar's AlignmentStats::record_alignment(n_alignments, max_multimaps) already has the correct routing arm:
_ => {
self.too_many_loci.fetch_add(1, Ordering::Relaxed);
}
The bug is at the upstream filter site: on the PE path, align_paired_read's too-many-loci filter returns n_for_mapq = 0 rather than the actual loci count, so the caller hits record_alignment(0, max_multimaps) → n == 0 arm (unmapped) instead of the _ => arm.
Reproducer (uses --outFilterMultimapNmax 10 so reads actually exceed the cap)
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-tml && cd /tmp/rustar-mre-tml
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outFilterMultimapNmax 10
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
for tool in RUS STAR; do
if [ $tool = RUS ]; then LOG=RUS./Log.final.out; else LOG=STAR.Log.final.out; fi
echo "=== $tool ==="
grep -E 'Number of input reads|Uniquely mapped reads number|Number of reads mapped|Number of reads unmapped' $LOG
done
Suggested fix
At the PE too-many-loci filter site in src/align/read_align.rs::align_paired_read, return joint_pairs.len() as n_for_mapq (rather than 0) when the cap is exceeded. The caller in src/lib.rs::align_reads_paired_end then passes that count into record_alignment, which routes to the existing _ => arm. Same shape as the SE path which already records the loci count correctly.
Add a conservation assertion in tests: total_reads == uniquely_mapped + multi_mapped + too_many_loci + unmapped on a run with reads that exceed the cap.
Severity
Low-medium. Doesn't affect alignment correctness on the reads that ARE accounted for. Affects Log.final.out totals and any downstream tool (MultiQC's STAR module) that sanity-checks Number of input reads == sum(buckets). Scales with how many reads in the input hit the outFilterMultimapNmax ceiling — yeast at cap=10 has 0.006 %; mammalian repetitive regions can reach single digits.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).
Summary
Reads exceeding
--outFilterMultimapNmaxwere dropped silently from the per-bucket stats inLog.final.out. Thetoo_many_locicounter stays at 0 no matter how many reads exceed the cap, and the per-bucket sum istotal_reads - <dropped>instead of equallingNumber of input reads.On the yeast PE test profile (50 000 pairs) with STAR's default
--outFilterMultimapNmax 10:Log.final.outfieldNumber of input readsUniquely mapped reads numberNumber of reads mapped to multiple lociNumber of reads mapped to too many lociNumber of reads unmapped: too many mismatchesNumber of reads unmapped: too shortNumber of reads unmapped: otherThe 3 missing reads are the 3 reads STAR classifies as
too_many_loci. At larger cap values (e.g.--outFilterMultimapNmax 20) no read on this input exceeds the cap so both aligners report 0 — the bug only surfaces when the cap is small enough that some reads exceed it.STAR reference behaviour
STAR's
Stats.cpp::nUnmapped[5]tracksReads mapped to too many lociseparately from the unmapped buckets. The classification fires inReadAlign_assignAlignToWindow.cppwhen the count of valid alignments exceedsoutFilterMultimapNmax; those reads are recorded intoo_many_lociand (with defaultoutSAMmultNmax=-1) suppressed from BAM emission. Per-bucket counts always sum to the input read count.rustar's
AlignmentStats::record_alignment(n_alignments, max_multimaps)already has the correct routing arm:The bug is at the upstream filter site: on the PE path,
align_paired_read's too-many-loci filter returnsn_for_mapq = 0rather than the actual loci count, so the caller hitsrecord_alignment(0, max_multimaps)→n == 0arm (unmapped) instead of the_ =>arm.Reproducer (uses
--outFilterMultimapNmax 10so reads actually exceed the cap)Suggested fix
At the PE too-many-loci filter site in
src/align/read_align.rs::align_paired_read, returnjoint_pairs.len()asn_for_mapq(rather than 0) when the cap is exceeded. The caller insrc/lib.rs::align_reads_paired_endthen passes that count intorecord_alignment, which routes to the existing_ =>arm. Same shape as the SE path which already records the loci count correctly.Add a conservation assertion in tests:
total_reads == uniquely_mapped + multi_mapped + too_many_loci + unmappedon a run with reads that exceed the cap.Severity
Low-medium. Doesn't affect alignment correctness on the reads that ARE accounted for. Affects
Log.final.outtotals and any downstream tool (MultiQC's STAR module) that sanity-checksNumber of input reads == sum(buckets). Scales with how many reads in the input hit theoutFilterMultimapNmaxceiling — yeast at cap=10 has 0.006 %; mammalian repetitive regions can reach single digits.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).