Skip to content

Commit 236cd49

Browse files
committed
update READMEs
update three different README files: - paper/ - capitalize sentence descriptions and add some detail - paper/SE/ - add instructions for how to execute all scripts within SE and remove simulations in progress table - paper/SE/results/ - embed results images in README file with descriptions of summary output files
1 parent 44ae489 commit 236cd49

3 files changed

Lines changed: 94 additions & 38 deletions

File tree

paper/README.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -21,44 +21,48 @@ paper
2121
|--CENPK-chipseq
2222
```
2323

24+
These scripts were built to run on a linux server with a PBS job scheduler set up and some of the dependencies installed using some environmental modules and a conda environment for remaining dependencies. You may need to modify these scripts to account for different server setup and configurations.
25+
26+
See the [GenoPipe documentation](https://pughlab.mbg.cornell.edu/GenoPipe-docs/) for a list of dependencies needed to run these publication-associated scripts. In addition to these dependencies, you will also need to instal [seqtk](https://github.com/lh3/seqtk).
27+
2428
## setup.sh
25-
runs the scripts to download and format the yeast and human genomes and other reference files for aligning the data
26-
also indexes the genomes for BWA
29+
Runs the scripts to download and format the yeast and human genomes and other reference files for aligning the data
30+
also indexes the genomes for BWA. This must be run before all other scripts to reproduce the publication figures.
2731

2832
## scripts
29-
contains the general scripts for setting up the simulations like downloading and parsing the reference genomes
30-
also contains the general scripts that several of the higher directory scripts call
33+
Contains the general scripts for setting up the simulations like downloading and parsing the reference genomes
34+
also contains the general scripts that several of the higher directory scripts call.
3135

3236
## input
33-
where setup.sh puts the reference genome and the aligner indexes
34-
also where other reference FASTA files are stored (i.e. R500.fa, 3xFLAG.fa, and the HIV genome sequence)
37+
Where `setup.sh` puts the reference genome and the aligner indexes. Also where other reference FASTA files are stored
38+
(i.e. R500.fa, 3xFLAG.fa, and the HIV genome sequence).
3539

3640
## db
37-
where the input database directories are built by setup.sh with variation as appropriate for GenoPipe module, species, and epitope set
41+
Where the input database directories are built by `setup.sh` with variation as appropriate for GenoPipe module, species, and epitope set.
3842

3943
## SyntheticEpitope
40-
contains the scripts and houses the results of simulations testing EpitopeID
44+
Contains the scripts and the results of simulations testing EpitopeID. This also includes the mixed contamination simulation tests.
4145

4246
## SyntheticStrain
43-
contains the scripts and houses the results of simulations testing StrainID
47+
Contains the scripts and the results of simulations testing StrainID.
4448

4549
## SyntheticDeletion
46-
contains the scripts and houses the results of simulations testing DeletionID
50+
Contains the scripts and the results of simulations testing DeletionID.
4751

4852
## ENCODEdata-eGFP
49-
contains the scripts and information for downloading ENCODE eGFP data to test EpitopeID
53+
Contains the scripts and information for downloading ENCODE eGFP data to test EpitopeID.
5054

5155
## ENCODEdata-CellLines
52-
contains the scripts and information for downloading ENCODE transcription factor ChIP-seq data to test StrainID
56+
Contains the scripts and information for downloading ENCODE transcription factor ChIP-seq data to test StrainID.
5357

5458
## HIV_samples
55-
contains the scripts and information for downloading, processing, and running EpitopeID on the Bosque et al, 2017 dataset for localizing HIV genome insertions
59+
Contains the scripts and information for downloading, processing, and running EpitopeID on the Bosque et al, 2017 dataset for localizing HIV genome insertions.
5660

5761
## YKOC-wgs
58-
contains the scripts and information for downloading, processing, and running DeletionID on the Puddu et al, 2019 dataset for identifying deletions
62+
Contains the scripts and information for downloading, processing, and running DeletionID on the Puddu et al, 2019 dataset for identifying deletions.
5963

6064
## BY4742-chipseq
61-
contains the scripts and information for downloading, processing, and running StrainID on the BAM files
65+
Contains the scripts and information for downloading, processing, and running StrainID on the BAM files.
6266

6367
## CENPK-chipseq
64-
contains the scripts and information for downloading, processing, and running StrainID on the BAM files
68+
cContains the scripts and information for downloading, processing, and running StrainID on the BAM files.

paper/SyntheticEpitope/README.md

Lines changed: 36 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,5 @@
11
# Simulate Paired-End datasets for EpitopeID and evaluate performance
22

3-
Simulations left todo:
4-
5-
🍏 = 1000 Simulated FASTQ files completed
6-
7-
🍎 = 1000 EpitopeID results generated and committed
8-
9-
Yeast (across 10M, 1M, 100K, 10K PE reads)
10-
| | Reb1 | Rap1 | Sua7 |
11-
| ---- | --------- | --------- | --------- |
12-
| R500 | 🍎 | 🍎 | 🍎 |
13-
| R100 | 🍎 | 🍎 | 🍎 |
14-
| R50 | 🍎 | 🍎 | 🍎 |
15-
| R20 | 🍎 | 🍎 | 🍎 |
16-
17-
Human (across 50M, 10M, 20M, 1M, 100K PE reads)
18-
| | CTCF | POLR2H | YY1 |
19-
| ---- | --------- | --------- | --------- |
20-
| R500 | 🍎 | 🍎 | 🍎 |
21-
| R100 | 🍎 | 🍎 | 🍎 |
22-
| R50 | 🍎 | 🍎 | 🍎 |
23-
| R20 | 🍎 | | 🍎 |
24-
253

264
## Generate Synthetic Genomes
275
Create synthetic genomes to simulate from by generating a random 500bp "epitope" sequence
@@ -46,3 +24,39 @@ For each organism, subsample the datasets generated above and mix them in variou
4624
-each pair of sets should be mixed in the following ratios: (90-10%, 80-20%, .., 10-90%)
4725
Run the new "contaminated" datasets through EpitopeID and evaluate how often EpitopeID
4826
correctly identifies the location of the inserted sequence of each population.
27+
28+
29+
## Job execution order
30+
```
31+
# Create synthetic genomes
32+
bash job/generate_synthetic_genomes.sh
33+
34+
# Set-up job scripts for depth simulations
35+
bash job/build_jobs.sh # will create a bunch of PBS scripts in the job/ directory. Based on depth_template.pbs and epitopeid_template.pbs
36+
37+
# Run depth simulations to create FASTQ input files (yeast & human)
38+
qsub run_depth_1_Reb1-Cterm_R500.pbs
39+
qsub run_depth_2_Rap1-Nterm_R500.pbs
40+
qsub run_depth_X_....
41+
...
42+
43+
# Run EpitopeID on depth simulations to create the reports (yeast & human)
44+
qsub run_EpitopeID_1_Reb1-Cterm_R500.pbs
45+
qsub run_EpitopeID_2_Rap1-Nterm_R500.pbs
46+
qsub run_EpitopeID_X_...
47+
...
48+
49+
# Compile results from depth simulations
50+
bash job/compile_results.sh
51+
52+
# Run mixture simulations to create FASTQ input files
53+
qsub job/run_mix_yeast.pbs
54+
qsub job/run_mix_human.pbs
55+
56+
# Run EpitopeID on mixture simulations to create the reports
57+
qsub job/run_EpitopeID_on_mix_yeast.pbs
58+
qsub job/run_EpitopeID_on_mix_human.pbs
59+
60+
# Compile results from mixture simulations
61+
bash job/compile_mix_results.sh
62+
```
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,39 @@
11
# Simulation files and EpitopeID results go here
2+
3+
## sacCer3
4+
This directory contains the raw EpitopeID reports and runtimes for each of the 1000 simulations across each type of yeast simulation.
5+
6+
## hg19
7+
This directory contains the raw EpitopeID reports and runtimes for each of the 1000 simulations across each type of human simulation.
8+
9+
## Summary reports
10+
The EpitopeID results are parsed out into a summary report using the `scripts/analyze_eid_results.py` where...
11+
- **First Column:** is the filepath for each simulation's EpitopeID report (filepath encodes experiment parameters)
12+
- **Second Column:** includes the read count of reads that map to the expected epitope sequence
13+
- **Third Column:** includes the number of bins relating to the expected target region (>0 means successfully localized)
14+
15+
16+
The following files contain the results from variable sequencing depth and epitope tag length experiments in yeast and human:
17+
```
18+
SummaryReport_sacCer3.txt
19+
SummaryReport_hg19.txt
20+
```
21+
![sacCer3-id-tally](results/ID-tally_sacCer3.png)
22+
![hg19-id-tally](results/ID-tally_hg19.png)
23+
24+
The following files contain the results from the read mixture contamination titration experiments in yeast and human:
25+
```
26+
MixSummaryReport_sacCer3.txt
27+
MixSummaryReport_hg19.txt
28+
```
29+
![mix-sacCer3-id-tally](results/ID-Mix-tally_sacCer3_1M.png)
30+
![mix-hg19-id-tally](results/ID-Mix-tally_hg19_50M.png)
31+
32+
## Runtime Summary reports
33+
The following files contain the results from runtime benchmarking the read mixture contamination titration experiments in yeast and human:
34+
```
35+
RuntimeSummaryReport_sacCer3.txt
36+
RuntimeSummaryReport_hg19.txt
37+
```
38+
![sacCer3-runtimes](results/Runtimes_sacCer3.png)
39+
![hg19-runtimes](results/Runtimes_hg19.png)

0 commit comments

Comments
 (0)