Summary
Log.final.out always reports Number of splices: Annotated (sjdb) = 0 even when --sjdbGTFfile is provided. Read-derived splices that match GTF junctions never get credited as annotated; the sjdbScore bonus never fires; the stricter unannotated overhang gate is used for every splice. ~50 % of GT/AG splices are dropped from pass 1, non-canonical rate roughly triples.
Smoking gun: the same paired-end run produces SJ.out.tab with 2 rows annotated=1 but Log.final.out reports Annotated (sjdb) = 0. Two internal counters reading the same DB give different answers.
Root cause
The GTF-derived SpliceJunctionDb is keyed in chromosome-local 1-based coordinates (src/junction/gtf.rs:207-209) but consulted in genome-absolute coordinates at two call sites that disagree with each other and with the DB:
let donor_fwd =
index.sa_pos_to_forward(junc_donor_sa, cluster.is_reverse, del as usize);
let acceptor_fwd = donor_fwd + del as u64;
db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, ...)
donor_fwd is genome-absolute 0-based. On chr_start[0] = 0 (yeast test data) the key is 1 less than the DB's chr-local 1-based store. On chr N+ the offset is chr_start[N]. Every lookup misses. Additionally acceptor_fwd = donor_fwd + del lands one base past the last intron base, so the acceptor is off-by-one even in the right coordinate space.
let intron_start = genome_pos + 1;
let intron_end = genome_pos + intron_len as u64;
let annotated = index.junction_db.is_annotated(transcript.chr_idx, intron_start, intron_end, strand);
genome_pos is genome-absolute 0-based, so this is genome-absolute 1-based-equivalent. On chr 0 it accidentally matches the DB; on chr N+ it misses by chr_start[N].
Reproducer (paired STAR + rustar)
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-27 && cd /tmp/rustar-mre-27
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
echo "=== rustar ===" ; grep -E 'splices|sjdb' RUS./Log.final.out
echo "=== STAR ===" ; grep -E 'splices|sjdb' STAR.Log.final.out
echo "=== rustar SJ.out.tab annotated col ===" ; awk '{print $6}' RUS./SJ.out.tab | sort | uniq -c
Observed
=== rustar ===
Number of splices: Total | 366
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 266
Number of splices: Non-canonical | 93
=== STAR ===
Number of splices: Total | 720
Number of splices: Annotated (sjdb) | 620
Number of splices: GT/AG | 688
Number of splices: Non-canonical | 25
=== rustar SJ.out.tab annotated col ===
14 0
2 1
Note: Log.final.out Annotated = 0 while SJ.out.tab has 2 annotated rows on the same BAM — the two counters disagree.
Suggested fix
Normalise the DB to genome-absolute 0-based at construction (matching the convention prepared_junctions in src/index/mod.rs:92-105 and SpliceJunctionStats already use):
- In
SpliceJunctionDb::from_* constructors, store keys as (chr_idx, chr_start[chr_idx] + intron_start_local_1b - 1, chr_start[chr_idx] + intron_end_local_1b - 1).
- Update the stats-time call site at
src/lib.rs:1860-1894 to query (genome_pos, genome_pos + intron_len - 1) (drop the +1 / +intron_len offsets).
- Subtract 1 from the stitch-time
acceptor_fwd so it lands on the last intron base.
- Adjust
src/junction/sj_output.rs:262-264 if needed so the chr-local 1-based output in SJ.out.tab is unchanged.
Related
- #47: pass-1 doesn't seed alignment candidates from the GTF junctions either — separate bug, accounts for the residual splice-count gap once this issue is fixed.
- #50: the
sjdbScore bonus is added to motif_score rather than replacing it; visible once this issue lets the bonus path actually fire.
Severity
Medium. Functional impact (50 % of GT/AG splices dropped, 3× non-canonical rate) scales with how many reads cross annotated junctions — small on the yeast test profile, much larger on mammalian data.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).
Summary
Log.final.outalways reportsNumber of splices: Annotated (sjdb) = 0even when--sjdbGTFfileis provided. Read-derived splices that match GTF junctions never get credited as annotated; thesjdbScorebonus never fires; the stricter unannotated overhang gate is used for every splice. ~50 % of GT/AG splices are dropped from pass 1, non-canonical rate roughly triples.Smoking gun: the same paired-end run produces
SJ.out.tabwith 2 rowsannotated=1butLog.final.outreportsAnnotated (sjdb) = 0. Two internal counters reading the same DB give different answers.Root cause
The GTF-derived
SpliceJunctionDbis keyed in chromosome-local 1-based coordinates (src/junction/gtf.rs:207-209) but consulted in genome-absolute coordinates at two call sites that disagree with each other and with the DB:Stitch-time (
src/align/stitch.rs:1305-1314)donor_fwdis genome-absolute 0-based. Onchr_start[0] = 0(yeast test data) the key is 1 less than the DB's chr-local 1-based store. On chr N+ the offset ischr_start[N]. Every lookup misses. Additionallyacceptor_fwd = donor_fwd + dellands one base past the last intron base, so the acceptor is off-by-one even in the right coordinate space.Stats-time (
src/lib.rs:1860-1894)genome_posis genome-absolute 0-based, so this is genome-absolute 1-based-equivalent. On chr 0 it accidentally matches the DB; on chr N+ it misses bychr_start[N].Reproducer (paired STAR + rustar)
Observed
Note:
Log.final.outAnnotated = 0 whileSJ.out.tabhas 2 annotated rows on the same BAM — the two counters disagree.Suggested fix
Normalise the DB to genome-absolute 0-based at construction (matching the convention
prepared_junctionsinsrc/index/mod.rs:92-105andSpliceJunctionStatsalready use):SpliceJunctionDb::from_*constructors, store keys as(chr_idx, chr_start[chr_idx] + intron_start_local_1b - 1, chr_start[chr_idx] + intron_end_local_1b - 1).src/lib.rs:1860-1894to query(genome_pos, genome_pos + intron_len - 1)(drop the+1/+intron_lenoffsets).acceptor_fwdso it lands on the last intron base.src/junction/sj_output.rs:262-264if needed so the chr-local 1-based output inSJ.out.tabis unchanged.Related
sjdbScorebonus is added tomotif_scorerather than replacing it; visible once this issue lets the bonus path actually fire.Severity
Medium. Functional impact (50 % of GT/AG splices dropped, 3× non-canonical rate) scales with how many reads cross annotated junctions — small on the yeast test profile, much larger on mammalian data.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).