update ENCODE-eGFP with browser scripts

owlang · owlang · commit 7f13ae6bbb1d · 2023-03-06T14:13:17.000-05:00
Figure 3 inserted into the manuscript of browser shots for pileups of mislabelled ID3 and NR4A1 scripts added:
-job/03_MakeBrowserData_genomes.sh --makes synthetic genomes (ID3-eGFP and NR4A1-eGFP) to align against
-job/04_MakeBrowserData_BAM.sh --filter fastq files and align to each synthetic genome
-README.md --describe results directory structure
-results/annotations -- annotations for marking relevant features in synthetic genomes of browser figure
-.gitignore --update with BrowserData directories
diff --git a/paper/.gitignore b/paper/.gitignore
@@ -2,21 +2,22 @@ run.setup.err
 run.setup.out
 input/hg19.fa*
 input/sacCer3.fa*
+input/*.fai
 db/
+SyntheticEpitope/results/hg19/*/*/*/*/*.*
+SyntheticEpitope/results/sacCer3/*/*/*/*/*.*
 SyntheticEpitope/synthetic_genome/
 SyntheticEpitope/logs/*.err-*
 SyntheticEpitope/logs/*.out-*
-SyntheticEpitope/results/sacCer3*
-SyntheticEpitope/results/hg19*
-SyntheticEpitope/results/mix_*
+ENCODEdata-eGFP/results/BrowserData/BAM/*
+ENCODEdata-eGFP/results/BrowserData/Genomes/*
+ENCODEdata-eGFP/results/FASTQ
+ENCODEdata-eGFP/results/ID/*.tab
 ENCODEdata-eGFP/logs/*.out-*
 ENCODEdata-eGFP/logs/*.err-*
-ENCODEdata-eGFP/results/FASTQ
-ENCODEdata-eGFP/results/ID
 HIV_samples/logs/*.err-*
 HIV_samples/logs/*.out-*
 HIV_samples/results/FASTQ
-HIV_samples/results/ID
 SyntheticDeletion/synthetic_genome/
 SyntheticDeletion/logs/*.err-*
 SyntheticDeletion/logs/*.out-*
diff --git a/paper/ENCODEdata-eGFP/job/03_MakeBrowserData_genomes.sh b/paper/ENCODEdata-eGFP/job/03_MakeBrowserData_genomes.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+# seqkit
+# bowtie
+
+module load gcc
+module load samtools
+module load anaconda3
+source activate my-genopipe-env
+
+INSERT=../SyntheticEpitope/scripts/insert_FASTA_into_Genome.pl
+RTAG=../db/hg19_EpiID/FASTA_tag/Tag_DB/LAP-tag.fa
+
+GDIR=results/BrowserData/Genomes
+[ -d $GDIR ] || mkdir -p $GDIR
+
+# Make two synthetic genomes to align to (opposite strand for rev complement)
+
+# ID3|chr1:23885453-XXX:-
+SGENOME=$GDIR/hg19_ID3-Cterm_LAP-tag.fa
+perl $INSERT ../input/hg19.fa chr1:23885453:+ $RTAG $SGENOME
+bowtie2-build $SGENOME $SGENOME
+
+# NR4A1|chr12:XXXXX-52452725:+
+SGENOME=$GDIR/hg19_NR4A1-Cterm_LAP-tag.fa
+perl $INSERT ../input/hg19.fa chr12:52452725:- $RTAG $SGENOME
+bowtie2-build $SGENOME $SGENOME
diff --git a/paper/ENCODEdata-eGFP/job/04_MakeBrowserData_BAM.sh b/paper/ENCODEdata-eGFP/job/04_MakeBrowserData_BAM.sh
@@ -0,0 +1,32 @@
+
+# Before running this script, run EpitopeID with "Clean-up" removal of intermediate files commented out
+
+# Create directory for FASTQ subset results
+BDIR=results/BrowserData/BAM
+FDIR=results/BrowserData/FASTQ
+GDIR=results/BrowserData/Genomes
+[ -d $BDIR ] || mkdir $BDIR
+[ -d $FDIR ] || mkdir $FDIR
+
+# Align reads for each ENCODE sample with ID3-eGFP and NR4A1-eGFP genomes
+for ENCFF in "ENCFF548RTA" "ENCFF671VDI";
+do
+	FQ=results/FASTQ/$ENCFF
+	SFQ=$FDIR/$ENCFF
+
+	# Subset based on read IDs from intermediate files of EpitopeID
+	seqkit grep -f <(cat results/ID/$ENCFF\_R1/reads*) $FQ\_R1.fastq.gz > $SFQ\_R1.fastq
+	seqkit grep -f <(cat results/ID/$ENCFF\_R1/reads*) $FQ\_R1.fastq.gz > $SFQ\_R2.fastq
+
+	# Align to ID3 and index
+	SGENOME=$GDIR/hg19_ID3-Cterm_LAP-tag.fa
+	BAM=$BDIR/ID3-Nterm-LAP_$ENCFF
+	bowtie2 -x $SGENOME -1 $SFQ\_R1.fastq -2 $SFQ\_R2.fastq | samtools sort -o $BAM.bam
+	samtools index $BAM.bam
+
+	# Align to NR4A1 and index
+	SGENOME=$GDIR/hg19_NR4A1-Cterm_LAP-tag.fa
+	BAM=$BDIR/NR4A1-Nterm-LAP_$ENCFF
+	bowtie2 -x $SGENOME -1 $SFQ\_R1.fastq -2 $SFQ\_R2.fastq | samtools sort -o $BAM.bam
+	samtools index $BAM.bam
+done
diff --git a/paper/ENCODEdata-eGFP/results/BrowserData/annotations/hg19_ID3-Cterm_LAP-tag.bed b/paper/ENCODEdata-eGFP/results/BrowserData/annotations/hg19_ID3-Cterm_LAP-tag.bed
@@ -0,0 +1,3 @@
+chr1	23885452	23886367	LAP-tag	0	-
+chr1	23886367	23886832	ID3	0	-
+chr12	52432493	52452725	NR4A1	0	+
diff --git a/paper/ENCODEdata-eGFP/results/BrowserData/annotations/hg19_NR4A1-Cterm_LAP-tag.bed b/paper/ENCODEdata-eGFP/results/BrowserData/annotations/hg19_NR4A1-Cterm_LAP-tag.bed
@@ -0,0 +1,3 @@
+chr1	23885452	23885917	ID3	0	-
+chr12	52432493	52452725	NR4A1	0	+
+chr12	52452725	52453640	LAP-tag	0	+
diff --git a/paper/ENCODEdata-eGFP/results/README.md b/paper/ENCODEdata-eGFP/results/README.md
@@ -1 +1,93 @@
 # Downloaded FASTQ files and EpitopeID results go here
+
+
+## Run EpitopeID on all ENCODE tagged samples
+
+
+### Update script and download ENCODE samples
+```
+qsub 00_download_data.pbs
+```
+
+### Update script and run EpitopeID on ENCODE samples
+```
+qsub 01_indexed_runEID.pbs
+```
+
+### Compile results into summary report
+```
+bash 02_tally_results.sh
+```
+
+
+## Make Browser screenshots (Figure 3)
+
+### BrowserData/Genomes
+Two synthetic genomes (ID3-eGFP and NR4A1-eGFP) are built by running
+```
+bash job/03_MakeBrowserData_genomes.sh
+#hg19_ID3-Cterm_LAP-tag.fa
+#hg19_NR4A1-Cterm_LAP-tag.fa
+```
+
+### BrowserData/FASTQ and BrowserData/BAM
+Reads filtered to only include read pairs with at least one read mapping to the eGFP tag were aligned to each of the synthetic genomes generated. This was done by running EpitopeID on the samples (with the "clean-up" removal of the directory storing intermeidate files commented out) so that a set of read IDs could be obtained. Then the raw FASTQ files were filtered to only include these read IDs and then aligned.
+
+```
+bash job/04_MakeBrowserData_BAM.sh
+#hg19_ID3-Cterm_LAP-tag_ENCFF548RTA.bam
+#hg19_ID3-Cterm_LAP-tag_ENCFF671VDI.bam
+#hg19_NR4A1-Cterm_LAP-tag_ENCFF548RTA.bam
+#hg19_NR4A1-Cterm_LAP-tag_ENCFF671VDI.bam
+
+```
+
+### BrowserData/annotations
+Annotations of the new ID3-eGFP and NR4A1-eGFP genomes for each of the two ORFs and the LAP-tag are built using the below methodology and saved to `results/BrowserData/annotations/hg19_ID3-Cterm_LAP-tag.bed` and `results/BrowserData/annotations/hg19_NR4A1-Cterm_LAP-tag.bed`.
+
+Gene ORFs were obtained from gencode:
+
+```
+wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
+```
+
+Parse out all start/stop codon info for genes of interest (ID3 and NR4A1).
+```
+grep '\"ID3\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'stop_codon' > ANNOTATIONS/hg19_NoGenotype_features.gtf
+grep '\"ID3\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'start_codon' >> ANNOTATIONS/hg19_NoGenotype_features.gtf
+grep '\"NR4A1\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'stop_codon' >> ANNOTATIONS/hg19_NoGenotype_features.gtf
+grep '\"NR4A1\"' ANNOTATIONS/gencode.v19.annotation.gtf_withproteinids |grep 'start_codon' >> ANNOTATIONS/hg19_NoGenotype_features.gtf
+```
+
+Identify ORF for each gene from gencode entries (select based on IGV hg19 RefSeq annotations).
+```
+chr1	23885452	23885917	ID3	0	-
+chr12	52432493	52452725	NR4A1	0	+
+```
+
+Write new BED coordinates shifted as appropriate for the eGFP(LAP) tag
+* `hg19_ID3-Cterm_LAP-tag.bed`
+* `hg19_NR4A1-Cterm_LAP-tag.bed`
+
+
+### Screenshots taken
+
+IGV window screenshot coordinate range:
+
+* `hg19_ID3-Cterm_LAP-tag.fa`
+  * ID3-locus
+    * center -- chr1:(23885452+915)=23886367
+    * window -- `chr1:23885367-23887367`
+    * modwindow(2kb) -- `chr1:23885001-23886999`
+  * NR4A1-locus
+    * center -- chr12:52452725
+    * window -- `chr12:52451726-52453724`
+
+* `hg19_NR4A1-Cterm_LAP-tag.fa`
+  * ID3-locus
+    * center -- chr1:23885452
+    * window -- `chr1:23884453-23886451`
+  * NR4A1-locus
+    * center -- chr12:52452725
+    * window -- `chr12:52451725-52453725`
+    * modwindow(2kb) -- `chr12:52452201-52454199`

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+chr1 23885452 23886367 LAP-tag 0 -`
	`2`	`+chr1 23886367 23886832 ID3 0 -`
	`3`	`+chr12 52432493 52452725 NR4A1 0 +`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+chr1 23885452 23885917 ID3 0 -`
	`2`	`+chr12 52432493 52452725 NR4A1 0 +`
	`3`	`+chr12 52452725 52453640 LAP-tag 0 +`