Skip to content

Commit 9713fa7

Browse files
committed
update docusaurus eid DB customization
Add to the epitopeid.md content around database customization, clarify language, and fix typos.
1 parent 270355e commit 9713fa7

1 file changed

Lines changed: 71 additions & 29 deletions

File tree

docusaurus/docs/epitopeid.md

Lines changed: 71 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -22,28 +22,34 @@ Specific Dependencies
2222

2323
## Input (`-i`)
2424

25-
EpitopeID takes [gzipped][gzip-man] FASTQ files from single-end(SE) or paired-end(PE) datasets to run. These should all be added to the same directory and the path to this directory will be used as the input for `identify-Epitope.sh`. If your FASTQ files are not already gzipped and you have `gzip` installed, you can simply gzip them in your terminal:
25+
EpitopeID takes [gzipped][gzip-man] FASTQ files from single-end(SE) or paired-end(PE) datasets to run. These should all be added to the same directory and the path to this directory will be used as the input for `identify-Epitope.sh`. If your FASTQ files are not already compressed, you can zip them yourself if gzip is installed:
2626

2727
```bash
2828
gzip XXXXX_R1.fastq
2929
```
3030

31-
It is expected that at least one file for each sample(even SE) follows the naming convention of `_R1*.fastq.gz` and a second file if the data is paired-end following `_R2*.fastq.gz`. This is based on the Illumina naming standard.
31+
It is expected that at least one file for each sample (even for single-end data) follows the naming convention of `XXXXX_R1*.fastq.gz` and a second file if the data is paired-end following `XXXXX_R2*.fastq.gz`. This is based on the Illumina naming standard.
3232

33-
:::caution
34-
Make sure none of the sample names used in the filenames have an occurrence of `_R1` to avoid errors from EpitopeID about being unable to find files or attempting to use a `_R2` file as a `_R1` file.
35-
:::
33+
Make sure none of the sample names used in the filenames have an occurrence of `_R1` outside of the read-specifying term to avoid confusion for EpitopeID when it is determining which samples have a read2 file.
34+
35+
For example, samples named like this will cause confusion:<br/>
36+
`Sample_R13A_R1.fastq.gz`
37+
38+
Alternatively try:<br/>
39+
`Sample-R13A_R1.fastq.gz`<br/>
40+
`SampleR13A_R1.fastq.gz`
3641

3742

3843

3944
## Reference Files (`-d`)
4045

4146
:::note
42-
The provided database files are missing the genomic reference file. You will need to follow the directions below to download `genome.fa` before running EpitopeID if you are planning to use the provided references.
47+
The provided database files are missing the genomic reference file for storage reasons. You will need to follow the directions below to download `genome.fa` before running EpitopeID if you are planning to use the provided references.
4348
:::
4449

4550
For EpitopeID, this is the "database" or directory with four types of reference files used by `identify-Epitope.sh`. You will notice that EpitopeID provides reference files for both yeast (`sacCer3_EpiID`) and human (`hg19_EpiID`) so you can quickly get started without building up the database from scratch. However, you are free to customize and build your own set of files (e.g. add different epitope tags to check for, use a different genome build).
4651

52+
### Database structure
4753
Below is a list of the the hardcoded filenames that EpitopeID looks for during execution and some information on the provided yeast and human defaults.
4854

4955
* The `FASTA_tag/ALL_TAG.fa` is the FASTA formatted collection of all epitope sequences to search for. The yeast tag database includes the [AID, Extended-Tap, FRB, HA_v1, HA_v3, MNase, ProtA, CBP, FLAG-3x, GFP, HA_v2, HaloTag, and Myc(3x)][tag-ref] sequences. The human tag database only includes the [LAP][lap-ref] tag but it is easy to customize the list to include other epitopes for EpitopeID to look for.
@@ -113,40 +119,75 @@ mv ALL_TAG.fa* /path/to/hg19_EpiID/FASTA_tag/
113119

114120

115121

116-
### Customizing annotation
122+
### Customizing annotations
123+
GenoPipe provides the utility scripts for recreating the precomputed reference annotation files. The scripts download (yeast and human) annotations and for format gene annotation files to the EpitopeID format by tiling the genome around and including gene intervals.
117124

125+
The precomputed files should work for most (yeast and human) use cases but if you need to compute these reference files yourself, use the available `utility_scripts` as follows:
118126

119127
#### Make `genome_annotation.gff.gz` with a different bin size
128+
If you wish to change the bin size of the tiles from the reference files GenoPipe already provides, you can rerun our scripts with a different value in the `-b` flag. The following are the specific commands you can execute to do so.
120129

130+
```bash
131+
# sacCer3
132+
cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
133+
bash generate_sacCer3_GenomeAnnotation.sh -g /path/to/genome/sacCer3.fa -b <BIN_SIZE>
134+
```
121135

122-
:::warning
123-
write up command series for yeast
124136
```bash
125-
# bedtools intersect -wb -abam $OUTPUT/$SAMPLE/orf_filter.bam -b $DATABASE/annotation/genome_annotation.gff.gz -bed > $OUTPUT/$SAMPLE/align-pe.out
137+
# hg19
126138
cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
127-
bash
139+
bash generate_hg19_GenomeAnnotation.sh -g /path/to/genome/hg19.fa -b <BIN_SIZE>
140+
```
128141

142+
```bash
143+
# hg38
144+
cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
145+
bash generate_hg38_GenomeAnnotation.sh -g /path/to/genome/hg38.fa -b <BIN_SIZE>
129146
```
130-
:::
131147

132148

149+
#### Make `genome_annotation.gff.gz` with custom annotations
150+
There may be a few reasons to create a custom annotation reference set:
151+
* Working with non-yeast and non-human data
152+
* Inserted sequence localized to non-ORF genomic location (e.g. insertions in enhancer region)
153+
* Inserted sequence/epitope localized to ORF not included in precomputed annotations (rare, for genes not included in official set at the time we created the precomputed files)
133154

134-
:::warning
135-
write up command series for human
155+
:::note
156+
EpitopeID actually can still detect and localize insertions from non-ORF regions but the report will only include the nearest ORF or genomic tile and may be more difficult to read/interpret. Creating a customized annotation reference file would improve clarity in the output report but is not *necessary*.
136157
:::
137158

138-
#### Make `genome_annotation.gff.gz` with a different set of annotations
159+
1. Create a custom `.gff` file including the feature with the expected localization.
160+
- It may be a good idea to include other potential off-target annotations for better readability of the report. Otherwise off-target localizations will be reported with only the genomic coordinate information.
161+
- If you are working with a custom genome build, you will need the gene annotations in `.gff` format for the genome build. [Ensembl][ensembl-ftp] and [UCSC][ucsc-download] can be good resources for finding gene annotations associated with official genome builds across a variety of organisms. Please make sure they are in `.gff` file format for compatibility with the utility scripts.
139162

163+
```
164+
# Example .gff entry like an enhancer or something
165+
# myunicorn_annotations.gff
140166
141-
:::warning
142-
write up command series for unicorn
143-
:::
167+
```
144168

145-
```bash
146-
# bedtools intersect -wb -abam $OUTPUT/$SAMPLE/orf_filter.bam -b $DATABASE/annotation/genome_annotation.gff.gz -bed > $OUTPUT/$SAMPLE/align-pe.out
147-
cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
148-
bash
169+
2. Execute the following commands to tile the genome and merge the annotations with the tiled regions.
149170

171+
```bash
172+
MYGFF=/path/to/myunicorn_annotations.gff
173+
GENOME=/path/to/genome/unicorn1.fa
174+
# Choose a bin size (consider size of genome)
175+
BIN=250
176+
177+
# Add flanking sequence to
178+
perl parsers/parse_Generic_GFF.pl $MYGFF $BIN temp.gff
179+
# Tile genome
180+
perl genome_bin/bin_genome.pl $GENOME $BIN unicorn1_BIN.gff
181+
# Intersect gene annotations
182+
bedtools subtract -a unicorn1_BIN.gff -b temp.gff > unicorn1_BIN_temp.gff
183+
# Merge annotations and bin file
184+
perl genome_bin/rename_BIN_GFF.pl unicorn1_BIN_temp.gff unicorn1_BIN_filter.gff
185+
cat unicorn1_BIN_filter.gff temp.gff > unicorn1_ALL.gff
186+
sort -k 1,1 -k4,4n unicorn1_ALL.gff > genome_annotation.gff
187+
# Compress
188+
gzip -f genome_annotation.gff
189+
# Clean-up
190+
rm temp.gff unicorn1_BIN.gff unicorn1_BIN_temp.gff unicorn1_BIN_filter.gff unicorn1_ALL.gff
150191
```
151192

152193

@@ -203,19 +244,15 @@ bash identify_Epitope.sh -i ../samples/ -o ../output/ -d sacCer3_EpiID -t 4
203244
```
204245

205246

206-
207-
208-
209-
210247
## Threading (`-t`)
211248

212-
This optional input is used to specify the number of threads to used for the BWA alignment commands.
249+
This optional input is used to specify the number of threads to used for the BWA alignment commands. Defaults to 1.
213250

214251

215252

216253
## Output Report (`-o`)
217254

218-
The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/samplename_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
255+
The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/XXXXX_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
219256

220257
```
221258
EpitopeID EpitopeCount
@@ -227,12 +264,14 @@ NR4A1|chr12:52416616-52453291 LAP-tag C-term 9 3.580493355965414e-24
227264

228265
The first part of the report shows which epitopes in `Tag_DB` were identified in the sample (**EpitopeID column**) and how many reads mapped to this epitope (**EpitopeCount**) to help quantify the coverage of the epitopes which relates to the confidence of the call.
229266

230-
The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the cooridinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
267+
The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the coordinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
231268

232269

233270

234271
## FAQs
235272

273+
* Q: My epitope sequence isn't part of the sequences in the default provided reference files (either `sacCer3_EpiID` or `hg19_EpiID`). Can I still use EpitopeID for checking my samples?
274+
* Yes! Please scroll up to the **Customizing epitopes** section for directions on how to add your epitope sequence to the database.
236275
* Q: I added my own custom tag sequences to the `TagDB` directory but when I run EpitopeID, none of my samples are getting significant hits to the new tags.
237276
* There are a few things you should check before concluding that the epitope is not present in your sample:
238277
* Did you recreate the `ALL_TAG.fa` file? Open it up to make sure your sequences are there. If they aren't there, follow the commands in the "How to add your own epitope tag sequences" section above.
@@ -244,6 +283,9 @@ The second part of the report shows which epitopes localized to which regions/ti
244283

245284

246285
[gzip-man]:https://www.gnu.org/software/gzip/manual/gzip.html
286+
[ensembl-ftp]:https://useast.ensembl.org/info/data/ftp/index.html
287+
[ucsc-download]:https://hgdownload.soe.ucsc.edu/downloads.html
288+
247289
[AID-ref]:https://www.google.com
248290
[Extended-Tap-ref]:https://www.google.com
249291
[FRB-ref]:https://www.google.com

0 commit comments

Comments
 (0)