Skip to content

Log.final.out always reports Number of splices: Annotated (sjdb) = 0 despite --sjdbGTFfile #27

@pinin4fjords

Description

@pinin4fjords

Summary

Log.final.out always reports Number of splices: Annotated (sjdb) = 0 even when --sjdbGTFfile is provided. Read-derived splices that match GTF junctions never get credited as annotated; the sjdbScore bonus never fires; the stricter unannotated overhang gate is used for every splice. ~50 % of GT/AG splices are dropped from pass 1, non-canonical rate roughly triples.

Smoking gun: the same paired-end run produces SJ.out.tab with 2 rows annotated=1 but Log.final.out reports Annotated (sjdb) = 0. Two internal counters reading the same DB give different answers.

Root cause

The GTF-derived SpliceJunctionDb is keyed in chromosome-local 1-based coordinates (src/junction/gtf.rs:207-209) but consulted in genome-absolute coordinates at two call sites that disagree with each other and with the DB:

Stitch-time (src/align/stitch.rs:1305-1314)

let donor_fwd =
    index.sa_pos_to_forward(junc_donor_sa, cluster.is_reverse, del as usize);
let acceptor_fwd = donor_fwd + del as u64;
db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, ...)

donor_fwd is genome-absolute 0-based. On chr_start[0] = 0 (yeast test data) the key is 1 less than the DB's chr-local 1-based store. On chr N+ the offset is chr_start[N]. Every lookup misses. Additionally acceptor_fwd = donor_fwd + del lands one base past the last intron base, so the acceptor is off-by-one even in the right coordinate space.

Stats-time (src/lib.rs:1860-1894)

let intron_start = genome_pos + 1;
let intron_end = genome_pos + intron_len as u64;
let annotated = index.junction_db.is_annotated(transcript.chr_idx, intron_start, intron_end, strand);

genome_pos is genome-absolute 0-based, so this is genome-absolute 1-based-equivalent. On chr 0 it accidentally matches the DB; on chr N+ it misses by chr_start[N].

Reproducer (paired STAR + rustar)

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-27 && cd /tmp/rustar-mre-27

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar ===" ; grep -E 'splices|sjdb' RUS./Log.final.out
echo "=== STAR ==="   ; grep -E 'splices|sjdb' STAR.Log.final.out
echo "=== rustar SJ.out.tab annotated col ===" ; awk '{print $6}' RUS./SJ.out.tab | sort | uniq -c

Observed

=== rustar ===
Number of splices: Total            | 366
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG            | 266
Number of splices: Non-canonical    | 93

=== STAR ===
Number of splices: Total            | 720
Number of splices: Annotated (sjdb) | 620
Number of splices: GT/AG            | 688
Number of splices: Non-canonical    | 25

=== rustar SJ.out.tab annotated col ===
   14 0
    2 1

Note: Log.final.out Annotated = 0 while SJ.out.tab has 2 annotated rows on the same BAM — the two counters disagree.

Suggested fix

Normalise the DB to genome-absolute 0-based at construction (matching the convention prepared_junctions in src/index/mod.rs:92-105 and SpliceJunctionStats already use):

  1. In SpliceJunctionDb::from_* constructors, store keys as (chr_idx, chr_start[chr_idx] + intron_start_local_1b - 1, chr_start[chr_idx] + intron_end_local_1b - 1).
  2. Update the stats-time call site at src/lib.rs:1860-1894 to query (genome_pos, genome_pos + intron_len - 1) (drop the +1 / +intron_len offsets).
  3. Subtract 1 from the stitch-time acceptor_fwd so it lands on the last intron base.
  4. Adjust src/junction/sj_output.rs:262-264 if needed so the chr-local 1-based output in SJ.out.tab is unchanged.

Related

  • #47: pass-1 doesn't seed alignment candidates from the GTF junctions either — separate bug, accounts for the residual splice-count gap once this issue is fixed.
  • #50: the sjdbScore bonus is added to motif_score rather than replacing it; visible once this issue lets the bonus path actually fire.

Severity

Medium. Functional impact (50 % of GT/AG splices dropped, 3× non-canonical rate) scales with how many reads cross annotated junctions — small on the yeast test profile, much larger on mammalian data.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions