|
1 | 1 | # GenoPipe |
2 | 2 |
|
3 | | -Expanded Documentation at https://pughlab.mbg.cornell.edu/GenoPipe-docs/ |
| 3 | +Confidence in experimental results is critical for discovery. As the scale of data generation in genomics has grown exponentially, experimental error has likely kept pace despite the best efforts of many laboratories. Technical mistakes can and do occur at nearly every stage of a genomics assay (i.e., cell line contamination, reagent swapping, tube mislabelling, etc.) and are often difficult to identify post-execution. However, the DNA sequenced in genomic experiments contains certain markers (e.g., indels) encoded within and can often be ascertained forensically from experimental datasets. We developed the Genotype validation Pipeline (GenoPipe), a suite of heuristic tools that operate together directly on raw and aligned sequencing data from individual high-throughput sequencing experiments to characterize the underlying genome of the source material. We demonstrate how GenoPipe validates and rescues erroneously annotated experiments by identifying unique markers inherent to an organism’s genome (i.e., epitope insertions, gene deletions, and SNPs). |
4 | 4 |
|
5 | | -## Toolkit for characterizing the genotype of NGS datasets |
6 | | - |
7 | | -There are 3 primary modules for genotype identification: |
8 | | - |
9 | | -### EpitopeID |
10 | | - |
11 | | -- Identify and determine the genomic location of epitopes relative to genomic loci |
12 | | - |
13 | | -- Some epitope sequences are provided in the default tag database |
14 | | - |
15 | | -| sacCer3(yeast) | hg19(human) | |
16 | | -| -------------- | ----------- | |
17 | | -| AID | LAP-tag | |
18 | | -| CBP | | |
19 | | -| Extended-Tap | | |
20 | | -| FLAG-3x | | |
21 | | -| FRB | | |
22 | | -| GFP | | |
23 | | -| HA_v1 | | |
24 | | -| HA_v2 | | |
25 | | -| HA_v3 | | |
26 | | -| HaloTag | | |
27 | | -| MNase_v2 | | |
28 | | -| Myc-3x | | |
29 | | -| ProteinA | | |
30 | | - |
31 | | -### DeletionID |
32 | | - |
33 | | -- Identify signficant depletion of aligned NGS tags in the genome relative to a background model. This module is useful for confirming gene knockouts. |
34 | | - |
35 | | -- Default database includes reference files that direct the search for depleted reads within gene annotation intervals from the sacCer3(yeast) genome build. |
36 | | - |
37 | | -### StrainID |
38 | | - |
39 | | -- Compare a database of VCF files against an aligned BAM file to check for the presence of SNPs in order to determine a likely cell line/strain used in the experiment |
40 | | - |
41 | | -- Default database includes reference VCF files for the following strains: |
42 | | - |
43 | | -| sacCer3(yeast) | hg19(human) | |
44 | | -| -------------- | ----------- | |
45 | | -| BY4741 | A549 | |
46 | | -| BY4742 | HCT116 | |
47 | | -| CEN.PK2-1Ca | HELA | |
48 | | -| D273-10B | HepG2 | |
49 | | -| FL100 | K562 | |
50 | | -| JK9-3d | LnCap | |
51 | | -| RM11-1A | MCF7 | |
52 | | -| RedStar | SKnSH | |
53 | | -| SEY6210 | | |
54 | | -| Sigma1278b-10560-6B | | |
55 | | -| W303 | | |
56 | | -| Y55 | | |
57 | | - |
58 | | - |
59 | | -[Figure 1] |
60 | | - |
61 | | - |
62 | | -## Quickstart |
63 | | - |
64 | | -This guide is for how to run each of the three GenoPipe modules on data from yeast(sacCer3) and human(hg19) samples. See the full documentation for how to modify and generate reference files for other genome builds. |
65 | | - |
66 | | -### Dependencies |
67 | | - |
68 | | -You will need the following software to run all GenoPipe modules: |
69 | | - |
70 | | -[Samtools v1.5+](http://www.htslib.org/) |
71 | | - |
72 | | -[Bedtools v2.27+](https://bedtools.readthedocs.io/en/latest/) |
73 | | - |
74 | | -[BWA v0.7.15+](http://bio-bwa.sourceforge.net/bwa.shtml) |
75 | | - |
76 | | -[Python v3.6.8+](https://www.python.org/) |
77 | | - |
78 | | -- [scipy v1.5.4+](https://www.scipy.org/) |
79 | | - |
80 | | -- [pysam v0.16.0.1+](https://pysam.readthedocs.io/en/latest/api.html) |
81 | | - |
82 | | -[Perl](https://www.perl.org/) |
83 | | - |
84 | | -[wget](https://www.gnu.org/software/wget/) |
85 | | - |
86 | | -conda install: |
87 | | - |
88 | | -``` |
89 | | -conda create -n genopipe -c conda-forge -c bioconda python perl bwa bedtools samtools pysam scipy wget |
90 | | -``` |
91 | | - |
92 | | -### Download |
93 | | - |
94 | | -To download GenoPipe, you can clone the repostitory. No builds needed. |
95 | | - |
96 | | -``` |
97 | | -git clone https://github.com/CEGRcode/GenoPipe |
98 | | -cd GenoPipe |
99 | | -``` |
100 | | - |
101 | | - |
102 | | -###EpitopeID |
103 | | -genomes and epitope sequences |
104 | | - |
105 | | -- yeast epitope tags |
106 | | - |
107 | | -Saccer33xMyc |
108 | | - |
109 | | - |
110 | | - |
111 | | -1. Check FASTQ filenames |
112 | | - |
113 | | -EpitopeID takes gzipped FASTQ files as input. The file name should end with a `_R1` or `_R2` and use the extension `fastq.gz` (the standard naming convention of Illumina libraries). |
114 | | - |
115 | | -Example: |
116 | | - |
117 | | -The following would be valid file names for EpitopeID input files where “SampleA” is single-end data and SampleB is paired-ended. |
118 | | - |
119 | | -``` |
120 | | -SampleA_R1.fastq.gz |
121 | | -SampleB_R1.fastq.gz |
122 | | -SampleB_R2.fastq.gz |
123 | | -``` |
124 | | - |
125 | | -2. Set-up the database |
126 | | - |
127 | | -The following instructions are for setting up the database of reference files used by EpitopeID using the provided genome builds and epitope tag sequences. To customize your database, see the full documentation. |
128 | | - |
129 | | -For downloading yeast genome... |
130 | | - |
131 | | -``` |
132 | | -cd EpitopeID/utility_scripts/genome_data/ |
133 | | -bash download_sacCer3_Genome.sh |
134 | | -mv genome.fa* ../../sacCer3_EpiID/FASTA_genome/ |
135 | | -``` |
136 | | - |
137 | | -For downloading human genome... |
138 | | - |
139 | | -``` |
140 | | -cd EpitopeID/utility_scripts/genome_data/ |
141 | | -bash download_hg19_Genome.sh |
142 | | -mv genome.fa* ../../hg19_EpiID/FASTA_genome/ |
143 | | -``` |
144 | | - |
145 | | - |
146 | | -3. Run EpitopeID |
147 | | - |
148 | | -When providing path locations, it is important that you provide **absolute paths** (i.e. path should start with `/` or `~/`). |
149 | | - |
150 | | -For yeast (sacCer3) samples... |
151 | | -``` |
152 | | -cd GenoPipe/EpitopeID |
153 | | -bash identify-Epitope.sh -i /path/to/FASTQ -o /path/to/output -d /path/to/GenoPipe/EpitopeID/sacCer3_EpiID |
154 | | -``` |
155 | | - |
156 | | -For human (hg19) samples... |
157 | | -``` |
158 | | -cd GenoPipe/EpitopeID |
159 | | -bash identify-Epitope.sh -i /path/to/FASTQ -o /path/to/output -d /path/to/GenoPipe/EpitopeID/hg19_EpiID |
160 | | -``` |
161 | | - |
162 | | - |
163 | | -Joe Schmoe Example: |
164 | | - |
165 | | -In the following example, GenoPipe, the directory including all the input yeast FASTQ files, and the new directory for storing EpitopeID reports are stored on the Desktop of Joe Schmoe. Filepaths would need to be changed according to a user's preferred directory structure. |
166 | | - |
167 | | -``` |
168 | | -# Download GenoPipe |
169 | | -cd /User/joeschmoe/Desktop/ |
170 | | -git clone GenoPipe |
171 | | -# Download Genomic FASTA and move to appropriate directory |
172 | | -cd /User/joeschmoe/Desktop/GenoPipe/EpitopeID/utility_scripts/genome_data/ |
173 | | -bash download_sacCer3_Genome.sh |
174 | | -mv genome.fa* ../../sacCer3_EpiID/FASTA_genome/ |
175 | | -cd ../../ |
176 | | -# Run EpitopeID |
177 | | -bash identify-Epitope.sh -i /User/joeschmoe/Desktop/myfastq -o /User/joeschmoe/Desktop/myreports_EID -d /User/joeschmoe/Desktop/GenoPipe/EpitopeID/sacCer3_EpiID |
178 | | -``` |
179 | | - |
180 | | - |
181 | | - |
182 | | - |
183 | | -### DeletionID |
184 | | - |
185 | | -1. Align FASTQ input files |
186 | | - |
187 | | -DeletionID uses BAM files as its input. Make sure that the reads are aligned to sacCer3 if you are using the default interval database. Any aligner that outputs standard BAM format can be used to generate the BAM input. DeletionID was tested on [BWA-MEM](http://bio-bwa.sourceforge.net/bwa.shtml). |
188 | | - |
189 | | -2. Run DeletionID |
190 | | - |
191 | | - |
192 | | -For yeast (sacCer3) samples... |
193 | | - |
194 | | -``` |
195 | | -cd GenoPipe/DeletionID |
196 | | -bash identify-Deletion.sh -i /path/to/BAM -o /path/to/output -d /path/to/GenoPipe/DeletionID/sacCer3_Del |
197 | | -``` |
198 | | - |
199 | | -Joe Schmoe Example: |
200 | | - |
201 | | -In the following example, GenoPipe, the directory including all the input yeast BAM files, and the new directory for storing DeletionID reports are stored on the Desktop of Joe Schmoe. Filepaths would need to be changed according to a user's preferred directory structure. |
202 | | - |
203 | | -``` |
204 | | -cd /User/joeschmoe/Desktop/GenoPipe/DeletionID |
205 | | -# Run DeletionID |
206 | | -bash identify-Deletion.sh -i /User/joeschmoe/Desktop/mybam -o /User/joeschmoe/Desktop/myreports_DID -d /User/joeschmoe/Desktop/GenoPipe/DeletionID/sacCer3_Del |
207 | | -``` |
208 | | - |
209 | | -### StrainID |
210 | | - |
211 | | -1. Align FASTQ input files |
212 | | - |
213 | | -StrainID uses BAM files as its input. Make sure that the reads are aligned to the appropriate sacCer3 or hg19 genome build if you are using the default interval database. Any aligner that outputs standard BAM format can be used to generate the BAM input. StrainID was tested on [BWA-MEM](http://bio-bwa.sourceforge.net/bwa.shtml). |
214 | | - |
215 | | -2. Run StrainID |
216 | | - |
217 | | -For yeast (sacCer3) samples... |
218 | | - |
219 | | -``` |
220 | | -cd GenoPipe/StrainID |
221 | | -bash identify-Strain.sh -i /path/to/BAM -o /path/to/output -g /path/to/sacCer3.fa -v /path/to/GenoPipe/StrainID/sacCer3_VCF |
222 | | -``` |
223 | | - |
224 | | -For human (hg19) samples... |
225 | | - |
226 | | -``` |
227 | | -cd GenoPipe/StrainID |
228 | | -bash identify-Strain.sh -i /path/to/BAM -o /path/to/output -g /path/to/hg19.fa -v /path/to/GenoPipe/StrainID/hg19_VCF |
229 | | -``` |
230 | | - |
231 | | -Joe Schmoe Example: |
232 | | - |
233 | | -In the following example, GenoPipe, the directory including all the input yeast BAM files, and the new directory for storing DeletionID reports are stored on the Desktop of Joe Schmoe. Filepaths would need to be changed according to a user's preferred directory structure. |
234 | | - |
235 | | -``` |
236 | | -cd /User/joeschmoe/Desktop/GenoPipe/ |
237 | | -cd EpitopeID/utility_scripts/genome_data |
238 | | -bash download_sacCer3_Genome.sh |
239 | | -mv genome.fa /User/joeschmoe/Desktop/GenoPipe/sacCer3.fa |
240 | | -# Run StrainID |
241 | | -cd ../../../StrainID |
242 | | -bash identify-Strain.sh -i /User/joeschmoe/Desktop/mybam -o /User/joeschmoe/Desktop/myreports_SID -g /User/joeschmoe/Desktop/GenoPipe/sacCer3.fa -v /User/joeschmoe/Desktop/GenoPipe/StrainID/sacCer3_VCF |
243 | | -``` |
244 | | - |
245 | | - |
246 | | -Full Joe Schmoe examples' directory structure: |
247 | | - |
248 | | -``` |
249 | | -/User/joeschmoe/Desktop |
250 | | - |--GenoPipe |
251 | | - | |--EpitopeID |
252 | | - | |--DeletionID |
253 | | - | |--StrainID |
254 | | - |--myfastq |
255 | | - | |--SampleA_R1.fastq.gz |
256 | | - | |--SampleB_R1.fastq.gz |
257 | | - | |--SampleB_R2.fastq.gz |
258 | | - |--mybam |
259 | | - | |--SampleA.bam |
260 | | - | |--SampleB.bam |
261 | | - |--myreports_EID |
262 | | - | |--SampleA_R1-ID.tab |
263 | | - | |--SampleB_R1-ID.tab |
264 | | - |--myreports_DID |
265 | | - | |--SampleA_deletion.tab |
266 | | - | |--SampleB_deletion.tab |
267 | | - |--myreports_SID |
268 | | - |--SampleA_strain.tab |
269 | | - |--SampleB_strain.tab |
270 | | -``` |
| 5 | +### [:house: GenoPipe Website Homepage :house:](https://pughlab.mbg.cornell.edu/GenoPipe-docs/) |
0 commit comments