You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docusaurus/docs/epitopeid.md
+30-32Lines changed: 30 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,14 +50,6 @@ The provided database files are missing the genomic reference file for storage r
50
50
For EpitopeID, this is the "database" or directory with four types of reference files used by `identify-Epitope.sh`. You will notice that EpitopeID provides reference files for both yeast (`sacCer3_EpiID`) and human (`hg19_EpiID`) so you can quickly get started without building up the database from scratch. However, you are free to customize and build your own set of files (e.g. add different epitope tags to check for, use a different genome build).
51
51
52
52
### Database structure
53
-
Below is a list of the the hardcoded filenames that EpitopeID looks for during execution and some information on the provided yeast and human defaults.
54
-
55
-
* The `FASTA_tag/ALL_TAG.fa` is the FASTA formatted collection of all epitope sequences to search for. The yeast tag database includes the [AID, Extended-Tap, FRB, HA_v1, HA_v3, MNase, ProtA, CBP, FLAG-3x, GFP, HA_v2, HaloTag, and Myc(3x)][tag-ref] sequences. The human tag database only includes the [LAP][lap-ref] tag but it is easy to customize the list to include other epitopes for EpitopeID to look for.
56
-
* The `FASTA_genome/genome.fa` is the reference genome used for the genomic alignments that the other annotations are base on. Even if you use the provided databases from Github, the genomic reference file still needs to be downloaded and moved to `FASTA_genome/genome.fa`. (Genome was not include for data storage reasons)
57
-
* The `annotation/genome_annotation.gff.gz` file defines the bin coordinates to use when localizing the epitope insertion in PE datasets. The yeast default uses SGD gene annotation coordinates to defines one bin for the length of each gene, 250bp bins flanking each set of gene coordinates, and 250bp bins breaking up the remaining intergenic regions. The human default similarly bins out the genome using 1000bp windows on the NCBI Refseq annotations.
58
-
* The `blacklist_filter/blacklist.bed`
59
-
60
-
61
53
Whether you use the provided reference files or create your own, the database should use the following directory structure both to ensure that EpitopeID can find the correct reference files and for organization, clarity, and consistency.
62
54
```
63
55
/name/of/epiDB
@@ -81,6 +73,13 @@ Whether you use the provided reference files or create your own, the database sh
81
73
| |--blacklist.bed
82
74
```
83
75
76
+
Below is a list of the the hardcoded filenames that EpitopeID looks for during execution and some information on the provided yeast and human defaults.
77
+
78
+
* The `FASTA_tag/ALL_TAG.fa` is the FASTA formatted collection of all epitope sequences to search for. The yeast tag database includes the [AID, Extended-Tap, FRB, HA_v1, HA_v3, MNase, ProtA, CBP, FLAG-3x, GFP, HA_v2, HaloTag, and Myc(3x)][tag-ref] sequences. The human tag database only includes the [LAP][lap-ref] tag but it is easy to customize the list to include other epitopes for EpitopeID to look for.
79
+
* The `FASTA_genome/genome.fa` is the reference genome used for the genomic alignments that the other annotations are base on. Even if you use the provided databases from Github, the genomic reference file still needs to be downloaded and moved to `FASTA_genome/genome.fa`. (Genome was not include for data storage reasons)
80
+
* The `annotation/genome_annotation.gff.gz` file defines the bin coordinates to use when localizing the epitope insertion in PE datasets. The yeast default uses SGD gene annotation coordinates to defines one bin for the length of each gene, 250bp bins flanking each set of gene coordinates, and 250bp bins breaking up the remaining intergenic regions. The human default similarly bins out the genome using 1000bp windows on the NCBI Refseq annotations.
81
+
* The `blacklist_filter/blacklist.bed`
82
+
84
83
Below is more information on how to use the utility scripts to download and customize your reference files.
85
84
86
85
@@ -220,6 +219,29 @@ cd $EPITOPEID/utility_scripts
220
219
221
220
``` -->
222
221
222
+
223
+
## Output Report (`-o`)
224
+
225
+
The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/XXXXX_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
The first part of the report shows which epitopes in `Tag_DB` were identified in the sample (**EpitopeID column**) and how many reads mapped to this epitope (**EpitopeCount**) to help quantify the coverage of the epitopes which relates to the confidence of the call.
236
+
237
+
The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the coordinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
238
+
239
+
240
+
## Threading (`-t`)
241
+
242
+
This optional input is used to specify the number of threads to used for the BWA alignment commands. Defaults to 1.
243
+
244
+
223
245
## Example: Set-up EpitopeID and run on yeast example
This optional input is used to specify the number of threads to used for the BWA alignment commands. Defaults to 1.
250
-
251
-
252
-
253
-
## Output Report (`-o`)
254
-
255
-
The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/XXXXX_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
The first part of the report shows which epitopes in `Tag_DB` were identified in the sample (**EpitopeID column**) and how many reads mapped to this epitope (**EpitopeCount**) to help quantify the coverage of the epitopes which relates to the confidence of the call.
266
-
267
-
The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the coordinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
268
-
269
-
270
-
271
269
## FAQs
272
270
273
271
* Q: My epitope sequence isn't part of the sequences in the default provided reference files (either `sacCer3_EpiID` or `hg19_EpiID`). Can I still use EpitopeID for checking my samples?
0 commit comments