Skip to content

Commit 2be5b68

Browse files
Update README.md
1 parent 199b801 commit 2be5b68

1 file changed

Lines changed: 37 additions & 40 deletions

File tree

README.md

Lines changed: 37 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,10 @@
11
## Bichrom: A bimodal neural network to predict TF binding using sequence and pre-existing chromatin track data
2-
Transcription factor (TF) binding specificity is determined via a complex interplay between the TF’s DNA binding preference and cell type-specific chromatin environments. The chromatin features that correlate with TF binding in a given cell type have been well characterized. For instance, the binding sites for a majority of TFs display concurrent chromatin accessibility. However, concurrent chromatin features reflect the binding activities of the TF itself, and thus provide limited insight into how genome-wide TF-DNA binding patterns became established in the first place. To understand the determinants of TF binding specificity, we therefore need to examine how newly activated TFs interact with sequence and preexisting chromatin landscapes.
2+
Bichrom provides a framework for modeling, interpreting, and visualizing the joint sequence and chromatin landscapes that determine TF-DNA binding dynamics.
33

4-
Here, we investigate the sequence and preexisting chromatin predictors of TF-DNA binding by examining the genome-wide occupancy of TFs that have been induced in well-characterized chromatin environments. We develop Bichrom, a bimodal neural network that jointly models sequence and preexisting chromatin data to interpret the genome-wide binding patterns of induced TFs. We find that the preexisting chromatin landscape is a differential global predictor of TF-DNA binding; incorporating preexisting chromatin features improves our ability to explain the binding specificity of some TFs substantially, but not others. Furthermore, by analyzing site-level predictors, we show that TF binding in previously inaccessible chromatin tends to correspond to the presence of more favorable cognate DNA sequences. Bichrom thus provides a framework for modeling, interpreting, and visualizing the joint sequence and chromatin landscapes that determine TF-DNA binding dynamics.
5-
6-
## Citation
4+
### Citation
75
Srivastava, D., Aydin, B., Mazzoni, E.O. and Mahony, S., 2020. An interpretable bimodal neural network characterizes the sequence and preexisting chromatin predictors of induced TF binding. bioRxiv, p.672790.
86

9-
## Installation
7+
### Installation
108
**Requirements**:
119

1210
python >= 3.5
@@ -15,12 +13,12 @@ We suggest using anaconda to create a virtual environment using the provided YAM
1513
Alternatively, to install requirements using pip:
1614
`pip install -r requirements.txt`
1715

18-
## Usage
16+
### Usage
1917
```
2018
# Clone and navigate to the iTF repository.
2119
cd trainNN
2220
To view help:
23-
python train.py --help
21+
python run_bichrom.py --help
2422
usage: run_bichrom.py [-h] training_schema_yaml window_size bin_size outdir
2523
2624
Train and compare BichromSEQ and Bichrom
@@ -36,51 +34,50 @@ optional arguments:
3634
3735
```
3836

39-
## Input Files: Description
40-
iTF trains and evaluates two models:
37+
### Input Files: Description
38+
run_bichrom.py trains and evaluates two models:
4139
* A sequence based classifier for TF binding prediction (Bichrom<sub>SEQ</sub>)
4240
* A sequence + pre-existing chromatin based classifier for TF binding prediction (Bichrom)
4341

44-
**Inputs:**
45-
46-
Required arguments:
47-
* training_schema_yaml: This is a YAML file containing containing paths to the training data (sequence, preexisting chromatin and labels), validation data and test data. A sample YAML file can be found in trainNN/sample.yaml. The structure of the training_schema_yaml file should be as follows:
42+
**Required arguments**:
4843

49-
<pre>
50-
train:
51-
seq: '/path/to/train/seq.txt'
52-
labels: '/path/to/train/labels.txt'
53-
chromatin_tracks: ['/path/to/train/atacseq.txt', ..., '/path/to/train/h3k27ac.txt']
54-
val:
55-
seq: '/path/to/val/seq.txt'
56-
labels: '/path/to/val/labels.txt'
57-
chromatin_tracks: ['/path/to/val/atacseq.txt', ..., '/path/to/val/h3k27ac.txt']
58-
test:
59-
seq: '/path/to/test/seq.txt'
60-
labels: '/path/to/test/labels.txt'
61-
chromatin_tracks: ['/path/to/val/atacseq.txt', ..., '/path/to/test/h3k27ac.txt']
62-
</pre>
44+
* **training_schema_yaml**:
45+
This is a YAML file containing containing paths to the training data (sequence, preexisting chromatin and labels), validation data and test data. A sample YAML file can be found in trainNN/sample.yaml. The structure of the training_schema_yaml file should be as follows:
6346

64-
* window_size: The size of genomic windows used for training, validation and testing. (For example: 500)
65-
* bin_size: Binning applied to the chromatin data. (For example, if window_size=500 and bin_size=10, each line in a chromatin_track file will contain 500/10=50 tab separated values)
66-
* outdir: This is the output directory, where all Bichrom output files will be placed.
47+
<pre>
48+
train:
49+
seq: '/path/to/train/seq.txt'
50+
labels: '/path/to/train/labels.txt'
51+
chromatin_tracks: ['/path/to/train/atacseq.txt', ..., '/path/to/train/h3k27ac.txt']
6752

68-
**Input file formats:**
53+
val:
54+
seq: '/path/to/val/seq.txt'
55+
labels: '/path/to/val/labels.txt'
56+
chromatin_tracks: ['/path/to/val/atacseq.txt', ..., '/path/to/val/h3k27ac.txt']
6957

70-
The training, validation and test files are provided to Bichrom using the argument **training_schema_yaml**. Each data set: train, test and validation, corresponds to 500 base pair windows on the genome. The "seq", "labels" and "chromatin_tracks" files for the train, test and validation sets contain features associated with these 500 base pair windows.
58+
test:
59+
seq: '/path/to/test/seq.txt'
60+
labels: '/path/to/test/labels.txt'
61+
chromatin_tracks: ['/path/to/val/atacseq.txt', ..., '/path/to/test/h3k27ac.txt']
62+
</pre>
7163

72-
* **seq**: The seq file contains one sequence per line. For example, if your training set has 25,100 genomic windows, the seq file will contain 25,100 lines.
64+
**Description for the input files provided in the YAML configuration**:
65+
Each data set: train, test and validation, corresponds to 500 base pair windows on the genome. The "seq", "labels" and "chromatin_tracks" files for the train, test and validation sets contain features associated with these 500 base pair windows.
7366

74-
* **labels**: The labels file contains a binary label that has been assigned each training window. (1 or 0)
75-
76-
* **chromatin_tracks**: Multiple chromatin files can be passed to to the program through the YAML file. (The YAML field chromatin_tracks accepts a list of file locations.) Each line in a chromatin track file contains tab separated binned chromatin data. The data can be binned at any resolution. For example, if the genomic windows used for train, test and validation are 500 base pair long, then:
77-
* If bins=50 base pairs, then each line in the chromatin file will contain 10 (500/50) tab separated values.
78-
* If bins=1 base pair, then each line in the chromatin file will contain 500 values. Note that all chromatin feature files that are passed to this argument must be binned at the same resolution.
67+
- **seq**: The seq file contains one sequence per line. For example, if your training set has 25,100 genomic windows, the seq file will contain 25,100 lines.
7968

69+
- **labels**: The labels file contains a binary label that has been assigned each training window. (1 or 0)
70+
71+
- **chromatin_tracks**: Multiple chromatin files can be passed to to the program through the YAML file. (The YAML field chromatin_tracks accepts a list of file locations.) Each line in a chromatin track file contains tab separated binned chromatin data. The data can be binned at any resolution. For example, if the genomic windows used for train, test and validation are 500 base pair long, then:
72+
* If bins=50 base pairs, then each line in the chromatin file will contain 10 (500/50) tab separated values.
73+
* If bins=1 base pair, then each line in the chromatin file will contain 500 values. Note that all chromatin feature files that are passed to this argument must be binned at the same resolution.
8074

81-
**Outputs:**
75+
Other required arguments:
8276

83-
iTF outputs the validation and test metrics (auROC and auPRC) for both a sequence-only network (Bichrom<sub>SEQ</sub>) and a sequence + preexisting chromatin bimodal network (Bichrom). It additionally plots the test Precision Recall curves for both models; as well as test recall at a false positive rate=0.01.
77+
* **window_size**: The size of genomic windows used for training, validation and testing. (For example: 500)
78+
* **bin_size**: Binning applied to the chromatin data. (For example, if window_size=500 and bin_size=10, each line in a chromatin_track file must contain 500/10 tab separated values)
79+
* **outdir**: Output directory where all Bichrom output files will be stored.
80+
* Bichrom outputs the validation and test metrics (auROC and auPRC) for both a sequence-only network (Bichrom<sub>SEQ</sub>) and a sequence + preexisting chromatin bimodal network (Bichrom). It additionally plots the test Precision Recall curves for both models; as well as test recall at a false positive rate=0.01.
8481

8582

8683

0 commit comments

Comments
 (0)