Skip to content

Log.final.out folds all unmapped reads into too short; other bucket always 0 #48

@pinin4fjords

Description

@pinin4fjords

Summary

Log.final.out reports two categories of unmapped reads — Number of reads unmapped: too short and Number of reads unmapped: other — using STAR's exact field labels. rustar folds both buckets into too short and always reports other = 0. Total unmapped count is preserved, but a downstream tool that parses the two fields separately (MultiQC's STAR module, custom QC dashboards) gets an inflated too short percentage and a zero other percentage.

On WT_REP2 (nf-core/rnaseq test profile):

Field STAR 2.7.11b rustar 0.1.0
Number of reads unmapped: too many mismatches 0 0
Number of reads unmapped: too short 1 540 (3.11 %) 4 193 (8.46 %)
Number of reads unmapped: other 2 656 (5.36 %) 0 (0.00 %)
Total unmapped 4 196 4 193

The totals match within RNG / tie-breaking tolerance — rustar isn't dropping reads. It's the bucketing that's wrong: every read STAR would label other (no seeds / no clusters / failed-stitch reason) is being mislabelled as too short (post-stitch length / score / match filter).

STAR reference behaviour

STAR's bucketing (per source/ReadAlign_assignAlignToWindow.cpp and the nUnmapped arrays in source/Stats.cpp):

  • too short: a transcript was generated but filtered out by score / match-count / read-length-coverage filters (outFilterScoreMin*, outFilterMatchNmin*).
  • other: no seeds, no clusters, no transcript — read couldn't be aligned at all.
  • too many mismatches: filtered by outFilterMismatchN*.

rustar's src/stats.rs::UnmappedReason enum has all four variants:

pub enum UnmappedReason {
    Other,
    TooShort,
    TooManyMismatches,
    TooManyLoci,
}

So the categorization exists in the code. The bug is at the classification call sites — wherever an unmapped read is recorded, the code is calling record_unmapped_reason(UnmappedReason::TooShort) instead of UnmappedReason::Other for reads that have no alignments at all.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-unmapped && cd /tmp/rustar-mre-unmapped

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar unmapped buckets ==="
grep -E 'unmapped' RUS./Log.final.out
echo
echo "=== STAR unmapped buckets ==="
grep -E 'unmapped' STAR.Log.final.out

Suggested investigation

Find every site that records an unmapped read in rustar — likely a record_unmapped_reason(UnmappedReason::...) call. Trace which one fires for reads that fail at different pipeline stages:

  • No seeds found → should be Other
  • Seeds found but no clusters / no extension → should be Other
  • Transcripts generated but filtered by outFilterScoreMin* / outFilterMatchNmin* → should be TooShort
  • Filtered by outFilterMismatchN* → should be TooManyMismatches

The fact that rustar's Other is always 0 strongly suggests record_unmapped_reason is never being called with UnmappedReason::Other — either the no-seeds / no-clusters path is silently falling through to the TooShort bucket, or the Other arm of the enum is unreachable in the current code.

Severity

Low-to-medium. Doesn't affect mapping quality or alignment correctness — totals match STAR. Affects downstream QC reading (MultiQC's STAR module pulls both fields; the inflated too short bar and zero other bar would look anomalous to a user comparing samples processed by different aligners). Anything that summarises "fraction of reads unmappable for non-trivial reasons" (STAR's other bucket captures harder-to-rescue reads) loses that signal entirely.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions