Skip to content

Commit 73e17f5

Browse files
committed
update README with docs link
Reformat README to match scriptmanager. Usage/quickstart should be referred to in the docs.
1 parent 4e9b2e8 commit 73e17f5

1 file changed

Lines changed: 2 additions & 267 deletions

File tree

README.md

Lines changed: 2 additions & 267 deletions
Original file line numberDiff line numberDiff line change
@@ -1,270 +1,5 @@
11
# GenoPipe
22

3-
Expanded Documentation at https://pughlab.mbg.cornell.edu/GenoPipe-docs/
3+
Confidence in experimental results is critical for discovery. As the scale of data generation in genomics has grown exponentially, experimental error has likely kept pace despite the best efforts of many laboratories. Technical mistakes can and do occur at nearly every stage of a genomics assay (i.e., cell line contamination, reagent swapping, tube mislabelling, etc.) and are often difficult to identify post-execution. However, the DNA sequenced in genomic experiments contains certain markers (e.g., indels) encoded within and can often be ascertained forensically from experimental datasets. We developed the Genotype validation Pipeline (GenoPipe), a suite of heuristic tools that operate together directly on raw and aligned sequencing data from individual high-throughput sequencing experiments to characterize the underlying genome of the source material. We demonstrate how GenoPipe validates and rescues erroneously annotated experiments by identifying unique markers inherent to an organism’s genome (i.e., epitope insertions, gene deletions, and SNPs).
44

5-
## Toolkit for characterizing the genotype of NGS datasets
6-
7-
There are 3 primary modules for genotype identification:
8-
9-
### EpitopeID
10-
11-
- Identify and determine the genomic location of epitopes relative to genomic loci
12-
13-
- Some epitope sequences are provided in the default tag database
14-
15-
| sacCer3(yeast) | hg19(human) |
16-
| -------------- | ----------- |
17-
| AID | LAP-tag |
18-
| CBP | |
19-
| Extended-Tap | |
20-
| FLAG-3x | |
21-
| FRB | |
22-
| GFP | |
23-
| HA_v1 | |
24-
| HA_v2 | |
25-
| HA_v3 | |
26-
| HaloTag | |
27-
| MNase_v2 | |
28-
| Myc-3x | |
29-
| ProteinA | |
30-
31-
### DeletionID
32-
33-
- Identify signficant depletion of aligned NGS tags in the genome relative to a background model. This module is useful for confirming gene knockouts.
34-
35-
- Default database includes reference files that direct the search for depleted reads within gene annotation intervals from the sacCer3(yeast) genome build.
36-
37-
### StrainID
38-
39-
- Compare a database of VCF files against an aligned BAM file to check for the presence of SNPs in order to determine a likely cell line/strain used in the experiment
40-
41-
- Default database includes reference VCF files for the following strains:
42-
43-
| sacCer3(yeast) | hg19(human) |
44-
| -------------- | ----------- |
45-
| BY4741 | A549 |
46-
| BY4742 | HCT116 |
47-
| CEN.PK2-1Ca | HELA |
48-
| D273-10B | HepG2 |
49-
| FL100 | K562 |
50-
| JK9-3d | LnCap |
51-
| RM11-1A | MCF7 |
52-
| RedStar | SKnSH |
53-
| SEY6210 | |
54-
| Sigma1278b-10560-6B | |
55-
| W303 | |
56-
| Y55 | |
57-
58-
59-
[Figure 1]
60-
61-
62-
## Quickstart
63-
64-
This guide is for how to run each of the three GenoPipe modules on data from yeast(sacCer3) and human(hg19) samples. See the full documentation for how to modify and generate reference files for other genome builds.
65-
66-
### Dependencies
67-
68-
You will need the following software to run all GenoPipe modules:
69-
70-
[Samtools v1.5+](http://www.htslib.org/)
71-
72-
[Bedtools v2.27+](https://bedtools.readthedocs.io/en/latest/)
73-
74-
[BWA v0.7.15+](http://bio-bwa.sourceforge.net/bwa.shtml)
75-
76-
[Python v3.6.8+](https://www.python.org/)
77-
78-
- [scipy v1.5.4+](https://www.scipy.org/)
79-
80-
- [pysam v0.16.0.1+](https://pysam.readthedocs.io/en/latest/api.html)
81-
82-
[Perl](https://www.perl.org/)
83-
84-
[wget](https://www.gnu.org/software/wget/)
85-
86-
conda install:
87-
88-
```
89-
conda create -n genopipe -c conda-forge -c bioconda python perl bwa bedtools samtools pysam scipy wget
90-
```
91-
92-
### Download
93-
94-
To download GenoPipe, you can clone the repostitory. No builds needed.
95-
96-
```
97-
git clone https://github.com/CEGRcode/GenoPipe
98-
cd GenoPipe
99-
```
100-
101-
102-
###EpitopeID
103-
genomes and epitope sequences
104-
105-
- yeast epitope tags
106-
107-
Saccer33xMyc
108-
109-
110-
111-
1. Check FASTQ filenames
112-
113-
EpitopeID takes gzipped FASTQ files as input. The file name should end with a `_R1` or `_R2` and use the extension `fastq.gz` (the standard naming convention of Illumina libraries).
114-
115-
Example:
116-
117-
The following would be valid file names for EpitopeID input files where “SampleA” is single-end data and SampleB is paired-ended.
118-
119-
```
120-
SampleA_R1.fastq.gz
121-
SampleB_R1.fastq.gz
122-
SampleB_R2.fastq.gz
123-
```
124-
125-
2. Set-up the database
126-
127-
The following instructions are for setting up the database of reference files used by EpitopeID using the provided genome builds and epitope tag sequences. To customize your database, see the full documentation.
128-
129-
For downloading yeast genome...
130-
131-
```
132-
cd EpitopeID/utility_scripts/genome_data/
133-
bash download_sacCer3_Genome.sh
134-
mv genome.fa* ../../sacCer3_EpiID/FASTA_genome/
135-
```
136-
137-
For downloading human genome...
138-
139-
```
140-
cd EpitopeID/utility_scripts/genome_data/
141-
bash download_hg19_Genome.sh
142-
mv genome.fa* ../../hg19_EpiID/FASTA_genome/
143-
```
144-
145-
146-
3. Run EpitopeID
147-
148-
When providing path locations, it is important that you provide **absolute paths** (i.e. path should start with `/` or `~/`).
149-
150-
For yeast (sacCer3) samples...
151-
```
152-
cd GenoPipe/EpitopeID
153-
bash identify-Epitope.sh -i /path/to/FASTQ -o /path/to/output -d /path/to/GenoPipe/EpitopeID/sacCer3_EpiID
154-
```
155-
156-
For human (hg19) samples...
157-
```
158-
cd GenoPipe/EpitopeID
159-
bash identify-Epitope.sh -i /path/to/FASTQ -o /path/to/output -d /path/to/GenoPipe/EpitopeID/hg19_EpiID
160-
```
161-
162-
163-
Joe Schmoe Example:
164-
165-
In the following example, GenoPipe, the directory including all the input yeast FASTQ files, and the new directory for storing EpitopeID reports are stored on the Desktop of Joe Schmoe. Filepaths would need to be changed according to a user's preferred directory structure.
166-
167-
```
168-
# Download GenoPipe
169-
cd /User/joeschmoe/Desktop/
170-
git clone GenoPipe
171-
# Download Genomic FASTA and move to appropriate directory
172-
cd /User/joeschmoe/Desktop/GenoPipe/EpitopeID/utility_scripts/genome_data/
173-
bash download_sacCer3_Genome.sh
174-
mv genome.fa* ../../sacCer3_EpiID/FASTA_genome/
175-
cd ../../
176-
# Run EpitopeID
177-
bash identify-Epitope.sh -i /User/joeschmoe/Desktop/myfastq -o /User/joeschmoe/Desktop/myreports_EID -d /User/joeschmoe/Desktop/GenoPipe/EpitopeID/sacCer3_EpiID
178-
```
179-
180-
181-
182-
183-
### DeletionID
184-
185-
1. Align FASTQ input files
186-
187-
DeletionID uses BAM files as its input. Make sure that the reads are aligned to sacCer3 if you are using the default interval database. Any aligner that outputs standard BAM format can be used to generate the BAM input. DeletionID was tested on [BWA-MEM](http://bio-bwa.sourceforge.net/bwa.shtml).
188-
189-
2. Run DeletionID
190-
191-
192-
For yeast (sacCer3) samples...
193-
194-
```
195-
cd GenoPipe/DeletionID
196-
bash identify-Deletion.sh -i /path/to/BAM -o /path/to/output -d /path/to/GenoPipe/DeletionID/sacCer3_Del
197-
```
198-
199-
Joe Schmoe Example:
200-
201-
In the following example, GenoPipe, the directory including all the input yeast BAM files, and the new directory for storing DeletionID reports are stored on the Desktop of Joe Schmoe. Filepaths would need to be changed according to a user's preferred directory structure.
202-
203-
```
204-
cd /User/joeschmoe/Desktop/GenoPipe/DeletionID
205-
# Run DeletionID
206-
bash identify-Deletion.sh -i /User/joeschmoe/Desktop/mybam -o /User/joeschmoe/Desktop/myreports_DID -d /User/joeschmoe/Desktop/GenoPipe/DeletionID/sacCer3_Del
207-
```
208-
209-
### StrainID
210-
211-
1. Align FASTQ input files
212-
213-
StrainID uses BAM files as its input. Make sure that the reads are aligned to the appropriate sacCer3 or hg19 genome build if you are using the default interval database. Any aligner that outputs standard BAM format can be used to generate the BAM input. StrainID was tested on [BWA-MEM](http://bio-bwa.sourceforge.net/bwa.shtml).
214-
215-
2. Run StrainID
216-
217-
For yeast (sacCer3) samples...
218-
219-
```
220-
cd GenoPipe/StrainID
221-
bash identify-Strain.sh -i /path/to/BAM -o /path/to/output -g /path/to/sacCer3.fa -v /path/to/GenoPipe/StrainID/sacCer3_VCF
222-
```
223-
224-
For human (hg19) samples...
225-
226-
```
227-
cd GenoPipe/StrainID
228-
bash identify-Strain.sh -i /path/to/BAM -o /path/to/output -g /path/to/hg19.fa -v /path/to/GenoPipe/StrainID/hg19_VCF
229-
```
230-
231-
Joe Schmoe Example:
232-
233-
In the following example, GenoPipe, the directory including all the input yeast BAM files, and the new directory for storing DeletionID reports are stored on the Desktop of Joe Schmoe. Filepaths would need to be changed according to a user's preferred directory structure.
234-
235-
```
236-
cd /User/joeschmoe/Desktop/GenoPipe/
237-
cd EpitopeID/utility_scripts/genome_data
238-
bash download_sacCer3_Genome.sh
239-
mv genome.fa /User/joeschmoe/Desktop/GenoPipe/sacCer3.fa
240-
# Run StrainID
241-
cd ../../../StrainID
242-
bash identify-Strain.sh -i /User/joeschmoe/Desktop/mybam -o /User/joeschmoe/Desktop/myreports_SID -g /User/joeschmoe/Desktop/GenoPipe/sacCer3.fa -v /User/joeschmoe/Desktop/GenoPipe/StrainID/sacCer3_VCF
243-
```
244-
245-
246-
Full Joe Schmoe examples' directory structure:
247-
248-
```
249-
/User/joeschmoe/Desktop
250-
|--GenoPipe
251-
| |--EpitopeID
252-
| |--DeletionID
253-
| |--StrainID
254-
|--myfastq
255-
| |--SampleA_R1.fastq.gz
256-
| |--SampleB_R1.fastq.gz
257-
| |--SampleB_R2.fastq.gz
258-
|--mybam
259-
| |--SampleA.bam
260-
| |--SampleB.bam
261-
|--myreports_EID
262-
| |--SampleA_R1-ID.tab
263-
| |--SampleB_R1-ID.tab
264-
|--myreports_DID
265-
| |--SampleA_deletion.tab
266-
| |--SampleB_deletion.tab
267-
|--myreports_SID
268-
|--SampleA_strain.tab
269-
|--SampleB_strain.tab
270-
```
5+
### [:house: GenoPipe Website Homepage :house:](https://pughlab.mbg.cornell.edu/GenoPipe-docs/)

0 commit comments

Comments
 (0)