update docusaurus eid DB customization

owlang · owlang · commit 9713fa78096c · 2022-07-26T17:07:26.000-04:00
Add to the epitopeid.md content around database customization, clarify language, and fix typos.
diff --git a/docusaurus/docs/epitopeid.md b/docusaurus/docs/epitopeid.md
@@ -22,28 +22,34 @@ Specific Dependencies
 
 ## Input (`-i`)
 
-EpitopeID takes [gzipped][gzip-man] FASTQ files from single-end(SE) or paired-end(PE) datasets to run. These should all be added to the same directory and the path to this directory will be used as the input for `identify-Epitope.sh`. If your FASTQ files are not already gzipped and you have `gzip` installed, you can simply gzip them in your terminal:
+EpitopeID takes [gzipped][gzip-man] FASTQ files from single-end(SE) or paired-end(PE) datasets to run. These should all be added to the same directory and the path to this directory will be used as the input for `identify-Epitope.sh`. If your FASTQ files are not already compressed, you can zip them yourself if gzip is installed:
 
 ```bash
 gzip XXXXX_R1.fastq
 ```
 
-It is expected that at least one file for each sample(even SE) follows the naming convention of `_R1*.fastq.gz` and a second file if the data is paired-end following `_R2*.fastq.gz`. This is based on the Illumina naming standard.
+It is expected that at least one file for each sample (even for single-end data) follows the naming convention of `XXXXX_R1*.fastq.gz` and a second file if the data is paired-end following `XXXXX_R2*.fastq.gz`. This is based on the Illumina naming standard.
 
-:::caution
-Make sure none of the sample names used in the filenames have an occurrence of `_R1` to avoid errors from EpitopeID about being unable to find files or attempting to use a `_R2` file as a `_R1` file.
-:::
+Make sure none of the sample names used in the filenames have an occurrence of `_R1` outside of the read-specifying term to avoid confusion for EpitopeID when it is determining which samples have a read2 file.
+
+For example, samples named like this will cause confusion:<br/>
+❌ `Sample_R13A_R1.fastq.gz`
+
+Alternatively try:<br/>
+✅ `Sample-R13A_R1.fastq.gz`<br/>
+✅ `SampleR13A_R1.fastq.gz`
 
 
 
 ## Reference Files (`-d`)
 
 :::note
-The provided database files are missing the genomic reference file. You will need to follow the directions below to download `genome.fa` before running EpitopeID if you are planning to use the provided references.
+The provided database files are missing the genomic reference file for storage reasons. You will need to follow the directions below to download `genome.fa` before running EpitopeID if you are planning to use the provided references.
 :::
 
 For EpitopeID, this is the "database" or directory with four types of reference files used by `identify-Epitope.sh`. You will notice that EpitopeID provides reference files for both yeast (`sacCer3_EpiID`) and human (`hg19_EpiID`) so you can quickly get started without building up the database from scratch. However, you are free to customize and build your own set of files (e.g. add different epitope tags to check for, use a different genome build).
 
+### Database structure
 Below is a list of the the hardcoded filenames that EpitopeID looks for during execution and some information on the provided yeast and human defaults.
 
 * The `FASTA_tag/ALL_TAG.fa` is the FASTA formatted collection of all epitope sequences to search for. The yeast tag database includes the [AID, Extended-Tap,  FRB, HA_v1, HA_v3, MNase, ProtA, CBP, FLAG-3x, GFP, HA_v2, HaloTag, and Myc(3x)][tag-ref] sequences. The human tag database only includes the [LAP][lap-ref] tag but it is easy to customize the list to include other epitopes for EpitopeID to look for.
@@ -113,40 +119,75 @@ mv ALL_TAG.fa* /path/to/hg19_EpiID/FASTA_tag/
 
 
 
-### Customizing annotation
+### Customizing annotations
+GenoPipe provides the utility scripts for recreating the precomputed reference annotation files. The scripts download (yeast and human) annotations and for format gene annotation files to the EpitopeID format by tiling the genome around and including gene intervals.
 
+The precomputed files should work for most (yeast and human) use cases but if you need to compute these reference files yourself, use the available `utility_scripts` as follows:
 
 #### Make `genome_annotation.gff.gz` with a different bin size
+If you wish to change the bin size of the tiles from the reference files GenoPipe already provides, you can rerun our scripts with a different value in the `-b` flag. The following are the specific commands you can execute to do so.
 
+```bash
+# sacCer3
+cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
+bash generate_sacCer3_GenomeAnnotation.sh -g /path/to/genome/sacCer3.fa -b <BIN_SIZE>
+```
 
-:::warning
-write up command series for yeast
 ```bash
-# bedtools intersect -wb -abam $OUTPUT/$SAMPLE/orf_filter.bam -b $DATABASE/annotation/genome_annotation.gff.gz -bed > $OUTPUT/$SAMPLE/align-pe.out
+# hg19
 cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
-bash
+bash generate_hg19_GenomeAnnotation.sh -g /path/to/genome/hg19.fa -b <BIN_SIZE>
+```
 
+```bash
+# hg38
+cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
+bash generate_hg38_GenomeAnnotation.sh -g /path/to/genome/hg38.fa -b <BIN_SIZE>
 ```
-:::
 
 
+#### Make `genome_annotation.gff.gz` with custom annotations
+There may be a few reasons to create a custom annotation reference set:
+  * Working with non-yeast and non-human data
+  * Inserted sequence localized to non-ORF genomic location (e.g. insertions in enhancer region)
+  * Inserted sequence/epitope localized to ORF not included in precomputed annotations (rare, for genes not included in official set at the time we created the precomputed files)
 
-:::warning
-write up command series for human
+:::note
+EpitopeID actually can still detect and localize insertions from non-ORF regions but the report will only include the nearest ORF or genomic tile and may be more difficult to read/interpret. Creating a customized annotation reference file would improve clarity in the output report but is not *necessary*.
 :::
 
-#### Make `genome_annotation.gff.gz` with a different set of annotations
+1. Create a custom `.gff` file including the feature with the expected localization.
+  - It may be a good idea to include other potential off-target annotations for better readability of the report. Otherwise off-target localizations will be reported with only the genomic coordinate information.
+  - If you are working with a custom genome build, you will need the gene annotations in `.gff` format for the genome build. [Ensembl][ensembl-ftp] and [UCSC][ucsc-download] can be good resources for finding gene annotations associated with official genome builds across a variety of organisms. Please make sure they are in `.gff` file format for compatibility with the utility scripts.
 
+```
+# Example .gff entry like an enhancer or something
+# myunicorn_annotations.gff
 
-:::warning
-write up command series for unicorn
-:::
+```
 
-```bash
-# bedtools intersect -wb -abam $OUTPUT/$SAMPLE/orf_filter.bam -b $DATABASE/annotation/genome_annotation.gff.gz -bed > $OUTPUT/$SAMPLE/align-pe.out
-cd /path/to/GenoPipe/EpitopeID/utility_scripts/annotation_data
-bash
+2. Execute the following commands to tile the genome and merge the annotations with the tiled regions.
 
+```bash
+MYGFF=/path/to/myunicorn_annotations.gff
+GENOME=/path/to/genome/unicorn1.fa
+# Choose a bin size (consider size of genome)
+BIN=250
+
+# Add flanking sequence to
+perl parsers/parse_Generic_GFF.pl $MYGFF $BIN temp.gff
+# Tile genome
+perl genome_bin/bin_genome.pl $GENOME $BIN unicorn1_BIN.gff
+# Intersect gene annotations
+bedtools subtract -a unicorn1_BIN.gff -b temp.gff > unicorn1_BIN_temp.gff
+# Merge annotations and bin file
+perl genome_bin/rename_BIN_GFF.pl unicorn1_BIN_temp.gff unicorn1_BIN_filter.gff
+cat unicorn1_BIN_filter.gff temp.gff > unicorn1_ALL.gff
+sort -k 1,1 -k4,4n unicorn1_ALL.gff > genome_annotation.gff
+# Compress
+gzip -f genome_annotation.gff
+# Clean-up
+rm temp.gff unicorn1_BIN.gff unicorn1_BIN_temp.gff unicorn1_BIN_filter.gff unicorn1_ALL.gff
 ```
 
 
@@ -203,19 +244,15 @@ bash identify_Epitope.sh -i ../samples/ -o ../output/ -d sacCer3_EpiID -t 4
 ```
 
 
-
-
-
-
 ## Threading (`-t`)
 
-This optional input is used to specify the number of threads to used for the BWA alignment commands.
+This optional input is used to specify the number of threads to used for the BWA alignment commands. Defaults to 1.
 
 
 
 ## Output Report (`-o`)
 
-The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/samplename_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
+The output report is saved to the user-provided output directory in a file named based on the input FASTQ files (`/path/to/output/XXXXX_R1-ID.tab`). Below is a sample report based on the results from running EpitopeID on the ENCODE ENCFF415CJF sample.
 
 ```
 EpitopeID	EpitopeCount
@@ -227,12 +264,14 @@ NR4A1|chr12:52416616-52453291	LAP-tag	C-term	9	3.580493355965414e-24
 
 The first part of the report shows which epitopes in `Tag_DB` were identified in the sample (**EpitopeID column**) and how many reads mapped to this epitope (**EpitopeCount**) to help quantify the coverage of the epitopes which relates to the confidence of the call.
 
-The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the cooridinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
+The second part of the report shows which epitopes localized to which regions/tiles of the genome significantly (sorted by pvalue if multiple hits). The columns specify the coordinate interval (**GeneID**), which epitope maps to this locus (**EpitopeID**), if this occurs on the N or C-terminus (**EpitopeLocation**), the number of reads mapping to this tile (**EpitopeCount**), and the poisson-calculated associated p-value to indicate confidence of the site (**pVal**).
 
 
 
 ## FAQs
 
+* Q: My epitope sequence isn't part of the sequences in the default provided reference files (either `sacCer3_EpiID` or `hg19_EpiID`). Can I still use EpitopeID for checking my samples?
+  * Yes! Please scroll up to the **Customizing epitopes** section for directions on how to add your epitope sequence to the database.
 * Q: I added my own custom tag sequences to the `TagDB` directory but when I run EpitopeID, none of my samples are getting significant hits to the new tags.
   * There are a few things you should check before concluding that the epitope is not present in your sample:
   * Did you recreate the `ALL_TAG.fa` file? Open it up to make sure your sequences are there. If they aren't there, follow the commands in the "How to add your own epitope tag sequences" section above.
@@ -244,6 +283,9 @@ The second part of the report shows which epitopes localized to which regions/ti
 
 
 [gzip-man]:https://www.gnu.org/software/gzip/manual/gzip.html
+[ensembl-ftp]:https://useast.ensembl.org/info/data/ftp/index.html
+[ucsc-download]:https://hgdownload.soe.ucsc.edu/downloads.html
+
 [AID-ref]:https://www.google.com
 [Extended-Tap-ref]:https://www.google.com
 [FRB-ref]:https://www.google.com