You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-9Lines changed: 5 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,9 +35,6 @@ optional arguments:
35
35
```
36
36
37
37
### Input Files: Description
38
-
run_bichrom.py trains and evaluates two models:
39
-
* A sequence based classifier for TF binding prediction (Bichrom<sub>SEQ</sub>)
40
-
* A sequence + pre-existing chromatin based classifier for TF binding prediction (Bichrom)
41
38
42
39
**Required arguments**:
43
40
@@ -62,22 +59,21 @@ This is a YAML file containing containing paths to the training data (sequence,
62
59
</pre>
63
60
64
61
**Description for the input files provided in the YAML configuration**:
65
-
Each data set: train, test and validation, corresponds to 500 base pair windows on the genome. The "seq", "labels" and "chromatin_tracks" files for the train, test and validation sets contain features associated with these 500 base pair windows.
62
+
Each input data point (train, test or validation) corresponds to a 500 base pair window on the genome. The "seq", "labels" and "chromatin_tracks" files contain genomic features associated with these input 500 base pair windows.
66
63
67
-
-**seq**: The seq file contains one sequence per line. For example, if your training set has 25,100 genomic windows, the seq file will contain 25,100 lines.
64
+
-**seq**: The sequence input file contains one sequence per line. For example, if your training set has 25,100 genomic windows, the seq file will contain 25,100 lines. (Permitted bases: A, T, G, C, N).
68
65
69
-
-**labels**: The labels file contains a binary label that has been assigned each training window. (1 or 0)
66
+
-**labels**: This file contains a binary label that has been assigned to each training, validation and test input data point. (Must be 0/1).
70
67
71
-
-**chromatin_tracks**: Multiple chromatin files can be passed to to the program through the YAML file. (The YAML field chromatin_tracks accepts a list of file locations.) Each line in a chromatin track file contains tab separated binned chromatin data. The data can be binned at any resolution. For example, if the genomic windows used for train, test and validation are 500 base pair long, then:
68
+
-**chromatin_tracks**: Multiple chromatin files can be passed to Bichrom through the YAML file. (The YAML field chromatin_tracks accepts a list of file locations.) Each line in a chromatin track file contains tab separated binned chromatin data. The data can be binned at any resolution. For example, if the genomic windows used to train Bichrom are 500 base pairs, then:
72
69
* If bins=50 base pairs, then each line in the chromatin file will contain 10 (500/50) tab separated values.
73
-
* If bins=1 base pair, then each line in the chromatin file will contain 500 values. Note that all chromatin feature files that are passed to this argument must be binned at the same resolution.
70
+
* If bins=1 base pair, then each line in the chromatin file will contain 500 values. Note that all chromatin feature files that are passed to Bichrom must be binned at the same resolution.
74
71
75
72
Other required arguments:
76
73
77
74
***window_size**: The size of genomic windows used for training, validation and testing. (For example: 500)
78
75
***bin_size**: Binning applied to the chromatin data. (For example, if window_size=500 and bin_size=10, each line in a chromatin_track file must contain 500/10 tab separated values)
79
76
***outdir**: Output directory where all Bichrom output files will be stored.
80
77
* Bichrom outputs the validation and test metrics (auROC and auPRC) for both a sequence-only network (Bichrom<sub>SEQ</sub>) and a sequence + preexisting chromatin bimodal network (Bichrom). It additionally plots the test Precision Recall curves for both models; as well as test recall at a false positive rate=0.01.
0 commit comments