update READMEs

owlang · owlang · commit 236cd495f5f9 · 2023-03-06T14:32:55.000-05:00
update three different README files:
- paper/ - capitalize sentence descriptions and add some detail
- paper/SE/ - add instructions for how to execute all scripts within SE and remove simulations in progress table
- paper/SE/results/ - embed results images in README file with descriptions of summary output files
diff --git a/paper/README.md b/paper/README.md
@@ -21,44 +21,48 @@ paper
 |--CENPK-chipseq
 ```
 
+These scripts were built to run on a linux server with a PBS job scheduler set up and some of the dependencies installed using some environmental modules and a conda environment for remaining dependencies. You may need to modify these scripts to account for different server setup and configurations.
+
+See the [GenoPipe documentation](https://pughlab.mbg.cornell.edu/GenoPipe-docs/) for a list of dependencies needed to run these publication-associated scripts. In addition to these dependencies, you will also need to instal [seqtk](https://github.com/lh3/seqtk).
+
 ## setup.sh
-runs the scripts to download and format the yeast and human genomes and other reference files for aligning the data
-also indexes the genomes for BWA
+Runs the scripts to download and format the yeast and human genomes and other reference files for aligning the data
+also indexes the genomes for BWA. This must be run before all other scripts to reproduce the publication figures.
 
 ## scripts
-contains the general scripts for setting up the simulations like downloading and parsing the reference genomes
-also contains the general scripts that several of the higher directory scripts call
+Contains the general scripts for setting up the simulations like downloading and parsing the reference genomes
+also contains the general scripts that several of the higher directory scripts call.
 
 ## input
-where setup.sh puts the reference genome and the aligner indexes
-also where other reference FASTA files are stored (i.e. R500.fa, 3xFLAG.fa, and the HIV genome sequence)
+Where `setup.sh` puts the reference genome and the aligner indexes. Also where other reference FASTA files are stored
+(i.e. R500.fa, 3xFLAG.fa, and the HIV genome sequence).
 
 ## db
-where the input database directories are built by setup.sh with variation as appropriate for GenoPipe module, species, and epitope set
+Where the input database directories are built by `setup.sh` with variation as appropriate for GenoPipe module, species, and epitope set.
 
 ## SyntheticEpitope
-contains the scripts and houses the results of simulations testing EpitopeID
+Contains the scripts and the results of simulations testing EpitopeID. This also includes the mixed contamination simulation tests.
 
 ## SyntheticStrain
-contains the scripts and houses the results of simulations testing StrainID
+Contains the scripts and the results of simulations testing StrainID.
 
 ## SyntheticDeletion
-contains the scripts and houses the results of simulations testing DeletionID
+Contains the scripts and the results of simulations testing DeletionID.
 
 ## ENCODEdata-eGFP
-contains the scripts and information for downloading ENCODE eGFP data to test EpitopeID
+Contains the scripts and information for downloading ENCODE eGFP data to test EpitopeID.
 
 ## ENCODEdata-CellLines
-contains the scripts and information for downloading ENCODE transcription factor ChIP-seq data to test StrainID
+Contains the scripts and information for downloading ENCODE transcription factor ChIP-seq data to test StrainID.
 
 ## HIV_samples
-contains the scripts and information for downloading, processing, and running EpitopeID on the Bosque et al, 2017 dataset for localizing HIV genome insertions
+Contains the scripts and information for downloading, processing, and running EpitopeID on the Bosque et al, 2017 dataset for localizing HIV genome insertions.
 
 ## YKOC-wgs
-contains the scripts and information for downloading, processing, and running DeletionID on the Puddu et al, 2019 dataset for identifying deletions
+Contains the scripts and information for downloading, processing, and running DeletionID on the Puddu et al, 2019 dataset for identifying deletions.
 
 ## BY4742-chipseq
-contains the scripts and information for downloading, processing, and running StrainID on the BAM files
+Contains the scripts and information for downloading, processing, and running StrainID on the BAM files.
 
 ## CENPK-chipseq
-contains the scripts and information for downloading, processing, and running StrainID on the BAM files
+cContains the scripts and information for downloading, processing, and running StrainID on the BAM files.
diff --git a/paper/SyntheticEpitope/README.md b/paper/SyntheticEpitope/README.md
@@ -1,27 +1,5 @@
 # Simulate Paired-End datasets for EpitopeID and evaluate performance
 
-Simulations left todo:
-
-&#x1F34F; = 1000 Simulated FASTQ files completed
-
-&#x1F34E; = 1000 EpitopeID results generated and committed
-
-Yeast (across 10M, 1M, 100K, 10K PE reads)
-|      |    Reb1   |    Rap1   |    Sua7   |
-| ---- | --------- | --------- | --------- |
-| R500 | &#x1F34E; | &#x1F34E; | &#x1F34E; |
-| R100 | &#x1F34E; | &#x1F34E; | &#x1F34E; |
-| R50  | &#x1F34E; | &#x1F34E; | &#x1F34E; |
-| R20  | &#x1F34E; | &#x1F34E; | &#x1F34E; |
-
-Human (across 50M, 10M, 20M, 1M, 100K PE reads)
-|      |    CTCF   |   POLR2H  |    YY1    |
-| ---- | --------- | --------- | --------- |
-| R500 | &#x1F34E; | &#x1F34E; | &#x1F34E; |
-| R100 | &#x1F34E; | &#x1F34E; | &#x1F34E; |
-| R50  | &#x1F34E; | &#x1F34E; | &#x1F34E; |
-| R20  | &#x1F34E; |           | &#x1F34E; |
-
 
 ## Generate Synthetic Genomes
 Create synthetic genomes to simulate from by generating a random 500bp "epitope" sequence
@@ -46,3 +24,39 @@ For each organism, subsample the datasets generated above and mix them in variou
 -each pair of sets should be mixed in the following ratios: (90-10%, 80-20%, .., 10-90%)
 Run the new "contaminated" datasets through EpitopeID and evaluate how often EpitopeID
 correctly identifies the location of the inserted sequence of each population.
+
+
+## Job execution order
+```
+# Create synthetic genomes
+bash job/generate_synthetic_genomes.sh
+
+# Set-up job scripts for depth simulations
+bash job/build_jobs.sh   # will create a bunch of PBS scripts in the job/ directory. Based on depth_template.pbs and epitopeid_template.pbs
+
+# Run depth simulations to create FASTQ input files (yeast & human)
+qsub run_depth_1_Reb1-Cterm_R500.pbs
+qsub run_depth_2_Rap1-Nterm_R500.pbs
+qsub run_depth_X_....
+...
+
+# Run EpitopeID on depth simulations to create the reports (yeast & human)
+qsub run_EpitopeID_1_Reb1-Cterm_R500.pbs
+qsub run_EpitopeID_2_Rap1-Nterm_R500.pbs
+qsub run_EpitopeID_X_...
+...
+
+# Compile results from depth simulations
+bash job/compile_results.sh
+
+# Run mixture simulations to create FASTQ input files
+qsub job/run_mix_yeast.pbs
+qsub job/run_mix_human.pbs
+
+# Run EpitopeID on mixture simulations to create the reports
+qsub job/run_EpitopeID_on_mix_yeast.pbs
+qsub job/run_EpitopeID_on_mix_human.pbs
+
+# Compile results from mixture simulations
+bash job/compile_mix_results.sh
+```
diff --git a/paper/SyntheticEpitope/results/README.md b/paper/SyntheticEpitope/results/README.md
@@ -1 +1,39 @@
 # Simulation files and EpitopeID results go here
+
+## sacCer3
+This directory contains the raw EpitopeID reports and runtimes for each of the 1000 simulations across each type of yeast simulation.
+
+## hg19
+This directory contains the raw EpitopeID reports and runtimes for each of the 1000 simulations across each type of human simulation.
+
+## Summary reports
+The EpitopeID results are parsed out into a summary report using the `scripts/analyze_eid_results.py` where...
+- **First Column:** is the filepath for each simulation's EpitopeID report (filepath encodes experiment parameters)
+- **Second Column:** includes the read count of reads that map to the expected epitope sequence
+- **Third Column:** includes the number of bins relating to the expected target region (>0 means successfully localized)
+
+
+The following files contain the results from variable sequencing depth and epitope tag length experiments in yeast and human:
+```
+SummaryReport_sacCer3.txt
+SummaryReport_hg19.txt
+```
+![sacCer3-id-tally](results/ID-tally_sacCer3.png)
+![hg19-id-tally](results/ID-tally_hg19.png)
+
+The following files contain the results from the read mixture contamination titration experiments in yeast and human:
+```
+MixSummaryReport_sacCer3.txt
+MixSummaryReport_hg19.txt
+```
+![mix-sacCer3-id-tally](results/ID-Mix-tally_sacCer3_1M.png)
+![mix-hg19-id-tally](results/ID-Mix-tally_hg19_50M.png)
+
+## Runtime Summary reports
+The following files contain the results from runtime benchmarking the read mixture contamination titration experiments in yeast and human:
+```
+RuntimeSummaryReport_sacCer3.txt
+RuntimeSummaryReport_hg19.txt
+```
+![sacCer3-runtimes](results/Runtimes_sacCer3.png)
+![hg19-runtimes](results/Runtimes_hg19.png)