Skip to content

Commit 7f13ae6

Browse files
committed
update ENCODE-eGFP with browser scripts
Figure 3 inserted into the manuscript of browser shots for pileups of mislabelled ID3 and NR4A1 scripts added: -job/03_MakeBrowserData_genomes.sh --makes synthetic genomes (ID3-eGFP and NR4A1-eGFP) to align against -job/04_MakeBrowserData_BAM.sh --filter fastq files and align to each synthetic genome -README.md --describe results directory structure -results/annotations -- annotations for marking relevant features in synthetic genomes of browser figure -.gitignore --update with BrowserData directories
1 parent 8ed21d9 commit 7f13ae6

6 files changed

Lines changed: 164 additions & 6 deletions

File tree

paper/.gitignore

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,22 @@ run.setup.err
22
run.setup.out
33
input/hg19.fa*
44
input/sacCer3.fa*
5+
input/*.fai
56
db/
7+
SyntheticEpitope/results/hg19/*/*/*/*/*.*
8+
SyntheticEpitope/results/sacCer3/*/*/*/*/*.*
69
SyntheticEpitope/synthetic_genome/
710
SyntheticEpitope/logs/*.err-*
811
SyntheticEpitope/logs/*.out-*
9-
SyntheticEpitope/results/sacCer3*
10-
SyntheticEpitope/results/hg19*
11-
SyntheticEpitope/results/mix_*
12+
ENCODEdata-eGFP/results/BrowserData/BAM/*
13+
ENCODEdata-eGFP/results/BrowserData/Genomes/*
14+
ENCODEdata-eGFP/results/FASTQ
15+
ENCODEdata-eGFP/results/ID/*.tab
1216
ENCODEdata-eGFP/logs/*.out-*
1317
ENCODEdata-eGFP/logs/*.err-*
14-
ENCODEdata-eGFP/results/FASTQ
15-
ENCODEdata-eGFP/results/ID
1618
HIV_samples/logs/*.err-*
1719
HIV_samples/logs/*.out-*
1820
HIV_samples/results/FASTQ
19-
HIV_samples/results/ID
2021
SyntheticDeletion/synthetic_genome/
2122
SyntheticDeletion/logs/*.err-*
2223
SyntheticDeletion/logs/*.out-*
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
#!/bin/bash
2+
3+
# seqkit
4+
# bowtie
5+
6+
module load gcc
7+
module load samtools
8+
module load anaconda3
9+
source activate my-genopipe-env
10+
11+
INSERT=../SyntheticEpitope/scripts/insert_FASTA_into_Genome.pl
12+
RTAG=../db/hg19_EpiID/FASTA_tag/Tag_DB/LAP-tag.fa
13+
14+
GDIR=results/BrowserData/Genomes
15+
[ -d $GDIR ] || mkdir -p $GDIR
16+
17+
# Make two synthetic genomes to align to (opposite strand for rev complement)
18+
19+
# ID3|chr1:23885453-XXX:-
20+
SGENOME=$GDIR/hg19_ID3-Cterm_LAP-tag.fa
21+
perl $INSERT ../input/hg19.fa chr1:23885453:+ $RTAG $SGENOME
22+
bowtie2-build $SGENOME $SGENOME
23+
24+
# NR4A1|chr12:XXXXX-52452725:+
25+
SGENOME=$GDIR/hg19_NR4A1-Cterm_LAP-tag.fa
26+
perl $INSERT ../input/hg19.fa chr12:52452725:- $RTAG $SGENOME
27+
bowtie2-build $SGENOME $SGENOME
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
2+
# Before running this script, run EpitopeID with "Clean-up" removal of intermediate files commented out
3+
4+
# Create directory for FASTQ subset results
5+
BDIR=results/BrowserData/BAM
6+
FDIR=results/BrowserData/FASTQ
7+
GDIR=results/BrowserData/Genomes
8+
[ -d $BDIR ] || mkdir $BDIR
9+
[ -d $FDIR ] || mkdir $FDIR
10+
11+
# Align reads for each ENCODE sample with ID3-eGFP and NR4A1-eGFP genomes
12+
for ENCFF in "ENCFF548RTA" "ENCFF671VDI";
13+
do
14+
FQ=results/FASTQ/$ENCFF
15+
SFQ=$FDIR/$ENCFF
16+
17+
# Subset based on read IDs from intermediate files of EpitopeID
18+
seqkit grep -f <(cat results/ID/$ENCFF\_R1/reads*) $FQ\_R1.fastq.gz > $SFQ\_R1.fastq
19+
seqkit grep -f <(cat results/ID/$ENCFF\_R1/reads*) $FQ\_R1.fastq.gz > $SFQ\_R2.fastq
20+
21+
# Align to ID3 and index
22+
SGENOME=$GDIR/hg19_ID3-Cterm_LAP-tag.fa
23+
BAM=$BDIR/ID3-Nterm-LAP_$ENCFF
24+
bowtie2 -x $SGENOME -1 $SFQ\_R1.fastq -2 $SFQ\_R2.fastq | samtools sort -o $BAM.bam
25+
samtools index $BAM.bam
26+
27+
# Align to NR4A1 and index
28+
SGENOME=$GDIR/hg19_NR4A1-Cterm_LAP-tag.fa
29+
BAM=$BDIR/NR4A1-Nterm-LAP_$ENCFF
30+
bowtie2 -x $SGENOME -1 $SFQ\_R1.fastq -2 $SFQ\_R2.fastq | samtools sort -o $BAM.bam
31+
samtools index $BAM.bam
32+
done
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
chr1 23885452 23886367 LAP-tag 0 -
2+
chr1 23886367 23886832 ID3 0 -
3+
chr12 52432493 52452725 NR4A1 0 +
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
chr1 23885452 23885917 ID3 0 -
2+
chr12 52432493 52452725 NR4A1 0 +
3+
chr12 52452725 52453640 LAP-tag 0 +
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,93 @@
11
# Downloaded FASTQ files and EpitopeID results go here
2+
3+
4+
## Run EpitopeID on all ENCODE tagged samples
5+
6+
7+
### Update script and download ENCODE samples
8+
```
9+
qsub 00_download_data.pbs
10+
```
11+
12+
### Update script and run EpitopeID on ENCODE samples
13+
```
14+
qsub 01_indexed_runEID.pbs
15+
```
16+
17+
### Compile results into summary report
18+
```
19+
bash 02_tally_results.sh
20+
```
21+
22+
23+
## Make Browser screenshots (Figure 3)
24+
25+
### BrowserData/Genomes
26+
Two synthetic genomes (ID3-eGFP and NR4A1-eGFP) are built by running
27+
```
28+
bash job/03_MakeBrowserData_genomes.sh
29+
#hg19_ID3-Cterm_LAP-tag.fa
30+
#hg19_NR4A1-Cterm_LAP-tag.fa
31+
```
32+
33+
### BrowserData/FASTQ and BrowserData/BAM
34+
Reads filtered to only include read pairs with at least one read mapping to the eGFP tag were aligned to each of the synthetic genomes generated. This was done by running EpitopeID on the samples (with the "clean-up" removal of the directory storing intermeidate files commented out) so that a set of read IDs could be obtained. Then the raw FASTQ files were filtered to only include these read IDs and then aligned.
35+
36+
```
37+
bash job/04_MakeBrowserData_BAM.sh
38+
#hg19_ID3-Cterm_LAP-tag_ENCFF548RTA.bam
39+
#hg19_ID3-Cterm_LAP-tag_ENCFF671VDI.bam
40+
#hg19_NR4A1-Cterm_LAP-tag_ENCFF548RTA.bam
41+
#hg19_NR4A1-Cterm_LAP-tag_ENCFF671VDI.bam
42+
43+
```
44+
45+
### BrowserData/annotations
46+
Annotations of the new ID3-eGFP and NR4A1-eGFP genomes for each of the two ORFs and the LAP-tag are built using the below methodology and saved to `results/BrowserData/annotations/hg19_ID3-Cterm_LAP-tag.bed` and `results/BrowserData/annotations/hg19_NR4A1-Cterm_LAP-tag.bed`.
47+
48+
Gene ORFs were obtained from gencode:
49+
50+
```
51+
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
52+
```
53+
54+
Parse out all start/stop codon info for genes of interest (ID3 and NR4A1).
55+
```
56+
grep '\"ID3\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'stop_codon' > ANNOTATIONS/hg19_NoGenotype_features.gtf
57+
grep '\"ID3\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'start_codon' >> ANNOTATIONS/hg19_NoGenotype_features.gtf
58+
grep '\"NR4A1\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'stop_codon' >> ANNOTATIONS/hg19_NoGenotype_features.gtf
59+
grep '\"NR4A1\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'start_codon' >> ANNOTATIONS/hg19_NoGenotype_features.gtf
60+
```
61+
62+
Identify ORF for each gene from gencode entries (select based on IGV hg19 RefSeq annotations).
63+
```
64+
chr1 23885452 23885917 ID3 0 -
65+
chr12 52432493 52452725 NR4A1 0 +
66+
```
67+
68+
Write new BED coordinates shifted as appropriate for the eGFP(LAP) tag
69+
* `hg19_ID3-Cterm_LAP-tag.bed`
70+
* `hg19_NR4A1-Cterm_LAP-tag.bed`
71+
72+
73+
### Screenshots taken
74+
75+
IGV window screenshot coordinate range:
76+
77+
* `hg19_ID3-Cterm_LAP-tag.fa`
78+
* ID3-locus
79+
* center -- chr1:(23885452+915)=23886367
80+
* window -- `chr1:23885367-23887367`
81+
* modwindow(2kb) -- `chr1:23885001-23886999`
82+
* NR4A1-locus
83+
* center -- chr12:52452725
84+
* window -- `chr12:52451726-52453724`
85+
86+
* `hg19_NR4A1-Cterm_LAP-tag.fa`
87+
* ID3-locus
88+
* center -- chr1:23885452
89+
* window -- `chr1:23884453-23886451`
90+
* NR4A1-locus
91+
* center -- chr12:52452725
92+
* window -- `chr12:52451725-52453725`
93+
* modwindow(2kb) -- `chr12:52452201-52454199`

0 commit comments

Comments
 (0)