You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docusaurus/docs/epitopeid.md
+71-29Lines changed: 71 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,28 +22,34 @@ Specific Dependencies
22
22
23
23
## Input (`-i`)
24
24
25
-
EpitopeID takes [gzipped][gzip-man] FASTQ files from single-end(SE) or paired-end(PE) datasets to run. These should all be added to the same directory and the path to this directory will be used as the input for `identify-Epitope.sh`. If your FASTQ files are not already gzipped and you have `gzip` installed, you can simply gzip them in your terminal:
25
+
EpitopeID takes [gzipped][gzip-man] FASTQ files from single-end(SE) or paired-end(PE) datasets to run. These should all be added to the same directory and the path to this directory will be used as the input for `identify-Epitope.sh`. If your FASTQ files are not already compressed, you can zip them yourself if gzip is installed:
26
26
27
27
```bash
28
28
gzip XXXXX_R1.fastq
29
29
```
30
30
31
-
It is expected that at least one file for each sample(even SE) follows the naming convention of `_R1*.fastq.gz` and a second file if the data is paired-end following `_R2*.fastq.gz`. This is based on the Illumina naming standard.
31
+
It is expected that at least one file for each sample(even for single-end data) follows the naming convention of `XXXXX_R1*.fastq.gz` and a second file if the data is paired-end following `XXXXX_R2*.fastq.gz`. This is based on the Illumina naming standard.
32
32
33
-
:::caution
34
-
Make sure none of the sample names used in the filenames have an occurrence of `_R1` to avoid errors from EpitopeID about being unable to find files or attempting to use a `_R2` file as a `_R1` file.
35
-
:::
33
+
Make sure none of the sample names used in the filenames have an occurrence of `_R1` outside of the read-specifying term to avoid confusion for EpitopeID when it is determining which samples have a read2 file.
34
+
35
+
For example, samples named like this will cause confusion:<br/>
36
+
❌ `Sample_R13A_R1.fastq.gz`
37
+
38
+
Alternatively try:<br/>
39
+
✅ `Sample-R13A_R1.fastq.gz`<br/>
40
+
✅ `SampleR13A_R1.fastq.gz`
36
41
37
42
38
43
39
44
## Reference Files (`-d`)
40
45
41
46
:::note
42
-
The provided database files are missing the genomic reference file. You will need to follow the directions below to download `genome.fa` before running EpitopeID if you are planning to use the provided references.
47
+
The provided database files are missing the genomic reference file for storage reasons. You will need to follow the directions below to download `genome.fa` before running EpitopeID if you are planning to use the provided references.
43
48
:::
44
49
45
50
For EpitopeID, this is the "database" or directory with four types of reference files used by `identify-Epitope.sh`. You will notice that EpitopeID provides reference files for both yeast (`sacCer3_EpiID`) and human (`hg19_EpiID`) so you can quickly get started without building up the database from scratch. However, you are free to customize and build your own set of files (e.g. add different epitope tags to check for, use a different genome build).
46
51
52
+
### Database structure
47
53
Below is a list of the the hardcoded filenames that EpitopeID looks for during execution and some information on the provided yeast and human defaults.
48
54
49
55
* The `FASTA_tag/ALL_TAG.fa` is the FASTA formatted collection of all epitope sequences to search for. The yeast tag database includes the [AID, Extended-Tap, FRB, HA_v1, HA_v3, MNase, ProtA, CBP, FLAG-3x, GFP, HA_v2, HaloTag, and Myc(3x)][tag-ref] sequences. The human tag database only includes the [LAP][lap-ref] tag but it is easy to customize the list to include other epitopes for EpitopeID to look for.
GenoPipe provides the utility scripts for recreating the precomputed reference annotation files. The scripts download (yeast and human) annotations and for format gene annotation files to the EpitopeID format by tiling the genome around and including gene intervals.
117
124
125
+
The precomputed files should work for most (yeast and human) use cases but if you need to compute these reference files yourself, use the available `utility_scripts` as follows:
118
126
119
127
#### Make `genome_annotation.gff.gz` with a different bin size
128
+
If you wish to change the bin size of the tiles from the reference files GenoPipe already provides, you can rerun our scripts with a different value in the `-b` flag. The following are the specific commands you can execute to do so.
120
129
130
+
```bash
131
+
# sacCer3
132
+
cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
#### Make `genome_annotation.gff.gz` with custom annotations
150
+
There may be a few reasons to create a custom annotation reference set:
151
+
* Working with non-yeast and non-human data
152
+
* Inserted sequence localized to non-ORF genomic location (e.g. insertions in enhancer region)
153
+
* Inserted sequence/epitope localized to ORF not included in precomputed annotations (rare, for genes not included in official set at the time we created the precomputed files)
133
154
134
-
:::warning
135
-
write up command series for human
155
+
:::note
156
+
EpitopeID actually can still detect and localize insertions from non-ORF regions but the report will only include the nearest ORF or genomic tile and may be more difficult to read/interpret. Creating a customized annotation reference file would improve clarity in the output report but is not *necessary*.
136
157
:::
137
158
138
-
#### Make `genome_annotation.gff.gz` with a different set of annotations
159
+
1. Create a custom `.gff` file including the feature with the expected localization.
160
+
- It may be a good idea to include other potential off-target annotations for better readability of the report. Otherwise off-target localizations will be reported with only the genomic coordinate information.
161
+
- If you are working with a custom genome build, you will need the gene annotations in `.gff` format for the genome build. [Ensembl][ensembl-ftp] and [UCSC][ucsc-download] can be good resources for finding gene annotations associated with official genome builds across a variety of organisms. Please make sure they are in `.gff` file format for compatibility with the utility scripts.
139
162
163
+
```
164
+
# Example .gff entry like an enhancer or something
This optional input is used to specify the number of threads to used for the BWA alignment commands.
249
+
This optional input is used to specify the number of threads to used for the BWA alignment commands. Defaults to 1.
213
250
214
251
215
252
216
253
## Output Report (`-o`)
217
254
218
-
The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/samplename_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
255
+
The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/XXXXX_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
The first part of the report shows which epitopes in `Tag_DB` were identified in the sample (**EpitopeID column**) and how many reads mapped to this epitope (**EpitopeCount**) to help quantify the coverage of the epitopes which relates to the confidence of the call.
229
266
230
-
The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the cooridinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
267
+
The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the coordinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
231
268
232
269
233
270
234
271
## FAQs
235
272
273
+
* Q: My epitope sequence isn't part of the sequences in the default provided reference files (either `sacCer3_EpiID` or `hg19_EpiID`). Can I still use EpitopeID for checking my samples?
274
+
* Yes! Please scroll up to the **Customizing epitopes** section for directions on how to add your epitope sequence to the database.
236
275
* Q: I added my own custom tag sequences to the `TagDB` directory but when I run EpitopeID, none of my samples are getting significant hits to the new tags.
237
276
* There are a few things you should check before concluding that the epitope is not present in your sample:
238
277
* Did you recreate the `ALL_TAG.fa` file? Open it up to make sure your sequences are there. If they aren't there, follow the commands in the "How to add your own epitope tag sequences" section above.
@@ -244,6 +283,9 @@ The second part of the report shows which epitopes localized to which regions/ti
0 commit comments