Skip to content

Commit 3ff0e8a

Browse files
author
akmorrow13
committed
Merge branch 'master' of github.com:YosefLab/epitome into test
2 parents 5f8311a + 58931c8 commit 3ff0e8a

13 files changed

Lines changed: 359 additions & 92 deletions

File tree

.github/workflows/main.yml

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
2+
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
3+
4+
name: epitome
5+
6+
on:
7+
push:
8+
branches: [master]
9+
pull_request:
10+
branches: [master]
11+
12+
jobs:
13+
build:
14+
15+
runs-on: ubuntu-latest
16+
strategy:
17+
matrix:
18+
python-version: [3.6, 3.7, 3.8]
19+
20+
steps:
21+
- uses: actions/checkout@v2
22+
- name: Set up Python ${{ matrix.python-version }}
23+
uses: actions/setup-python@v2
24+
with:
25+
python-version: ${{ matrix.python-version }}
26+
- name: Cache pip
27+
uses: actions/cache@v2
28+
with:
29+
path: ~/.cache/pip
30+
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
31+
restore-keys: |
32+
${{ runner.os }}-pip-
33+
- name: Install dependencies
34+
run: |
35+
pip install pytest-cov
36+
make develop
37+
- name: Test with pytest
38+
run: |
39+
make test

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
[![pypi](https://img.shields.io/pypi/v/epitome.svg)](https://pypi.org/project/epitome/)
2+
[![docs](https://readthedocs.org/projects/epitome/badge/?version=latest)](https://epitome.readthedocs.io/en/latest/)
3+
![Build status](https://github.com/YosefLab/epitome/workflows/epitome/badge.svg)
4+
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/6c2cef0a2eae45399c9caed2d8c81965)](https://app.codacy.com/gh/YosefLab/epitome?utm_source=github.com&utm_medium=referral&utm_content=YosefLab/epitome&utm_campaign=Badge_Grade)
5+
6+
17
# Epitome
28

39
Pipeline for predicting ChIP-seq peaks in novel cell types using chromatin accessibility.
@@ -6,6 +12,10 @@ Pipeline for predicting ChIP-seq peaks in novel cell types using chromatin acces
612

713
Epitome leverages chromatin accessibility (either DNase-seq or ATAC-seq) to predict epigenetic events in a novel cell type of interest. Such epigenetic events include transcription factor binding sites and histone modifications. Epitome computes chromatin accessibility similarity between ENCODE cell types and the novel cell type, and uses this information to transfer known epigentic signal to the novel cell type of interest.
814

15+
# Documentation
16+
17+
Epitome documentation is hosted at [readthedocs](https://epitome.readthedocs.io/en/latest/). Documentation for Epitome includes tutorials for creating Epitome datasets, training, testing, and evaluated models.
18+
919

1020
## Requirements
1121
* [conda](https://docs.conda.io/en/latest/miniconda.html)
@@ -25,8 +35,6 @@ pip install epitome
2535

2636
## Training a Model
2737

28-
TODO: link to documentation
29-
3038
First, create an Epitome dataset that defines the cell types and ChIP-seq
3139
targets you want to train on,
3240

docs/requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
sphinx==1.7.7
1+
sphinx==2.1.1
2+
sphinx_rtd_theme==0.4.3
23
nbsphinx==0.3.4
3-
sphinx_rtd_theme==0.4.2
44
mock

docs/usage/train.rst

Lines changed: 61 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Training an Epitome Model
33

44
Once you have `installed Epitome <../installation/source.html>`__, you are ready to train a model.
55

6-
Training a Model
6+
Create a Dataset
77
----------------
88

99
First, import Epitome:
@@ -13,10 +13,7 @@ First, import Epitome:
1313
from epitome.dataset import *
1414
from epitome.models import *
1515
16-
Create an Epitome Dataset
17-
-------------------------
18-
19-
First, create an Epitome Dataset. In the dataset, you will define the
16+
Next, create an Epitome Dataset. In the dataset, you will need to define the
2017
ChIP-seq targets you want to predict, the cell types you want to train from,
2118
and the assays you want to use to compute cell type similarity. For more information
2219
on creating an Epitome dataset, see `Configuring data <./dataset.html>`__.
@@ -28,24 +25,80 @@ on creating an Epitome dataset, see `Configuring data <./dataset.html>`__.
2825
2926
dataset = EpitomeDataset(targets=targets, cells=celltypes)
3027
28+
Train a Model
29+
----------------
3130
Now, you can create a model:
3231

3332
.. code:: python
3433
3534
model = EpitomeModel(dataset, test_celltypes = ["K562"]) # cell line reserved for testing
3635
37-
Next, train the model. Here, we train the model for 5000 iterations:
36+
Next, train the model. Here, we train the model for 5000 batches:
3837

3938
.. code:: python
4039
4140
model.train(5000)
4241
43-
You can then evaluate model performance on held out test cell lines specified in the model declaration. In this case, we will evaluate on K562 on the first 10,000 points.
42+
Train a Model that Stops Early
43+
-------------------------------
44+
If you are not sure how many batches your model should train for or are concerned
45+
about your model overfitting, you can specify the max_valid_batches parameter when
46+
initializing the model, which will create a train_validation dataset the size of
47+
max_valid_batches. This forces the model to validate on the train-validation dataset
48+
and generate the train-validation loss every 200 training batches. The model may
49+
stop training early (before max_train_batches) if the model's train-validation
50+
losses stop improving during training. Else, the model will continue to train
51+
until max_train_batches.
52+
53+
First, we have created a model that has a train-validation set size of 1000:
54+
55+
.. code:: python
56+
57+
model = EpitomeModel(dataset,
58+
test_celltypes = ["K562"], # cell line reserved for testing
59+
max_valid_batches = 1000) # train_validation set size reserved while training
60+
61+
Next, we train the model for a maximum of 5000 batches. If the train-validation
62+
loss stops improving, the model will stop training early:
63+
64+
.. code:: python
65+
66+
best_model_batches, total_trained_batches, train_valid_losses = model.train(5000)
67+
68+
If you are concerned about the model above overtraining because the model continues
69+
to improve by miniscule amounts, you can specify the min-delta which is minimum
70+
change in the train-validation loss required to qualify as an improvement. In the
71+
model below, a minimum improvement of at least 0.1 is required for the model to
72+
qualify as improving.
73+
74+
If you are concerned about the model above under-fitting (stopping training too
75+
early because the train-validation loss might worsen slightly before reaching it's
76+
highest accuracy), you can specify the patience. In the model below, specifying
77+
a patience of 3 allows the model to train for up to 3 train-validation iterations
78+
(200 batches each) with no improvement, before stopping training.
79+
80+
You can read the in-depth explanation of these hyper-parameters in
81+
`this section <https://www.overleaf.com/project/5cd315cb8028bd409596bdff>`__ of the
82+
paper. Detailed documentation of the train() function can also
83+
be found in the `Github repo <https://github.com/YosefLab/epitome>`__.
84+
85+
.. code:: python
86+
87+
best_model_batches, total_trained_batches, train_valid_losses = model.train(5000,
88+
patience = 3,
89+
min_delta = 0.1)
90+
91+
Test the Model
92+
----------------
93+
Finally, you can evaluate model performance on held out test cell lines specified
94+
in the model declaration. In this case, we will evaluate on K562 on the first 10,000 points.
4495

4596
.. code:: python
4697
4798
results = model.test(10000,
4899
mode = Dataset.TEST,
49100
calculate_metrics=True)
50101
51-
The output of `results` will contain the predictions and truth values, a dictionary of assay specific performance metrics, and the average auROC and auPRC across all evaluated assays.
102+
The output of `results` will contain the predictions and truth values, a dictionary
103+
of assay specific performance metrics, and the average auROC and auPRC across all
104+
evaluated assays.

epitome/__init__.py

Lines changed: 0 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -22,35 +22,6 @@
2222

2323
__path__ = __import__('pkgutil').extend_path(__path__, __name__)
2424

25-
S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'
26-
27-
# os env that should be set by user to explicitly set the data path
28-
EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"
29-
30-
# data files required by epitome
31-
# data.h5 contains data, row information (celltypes and targets) and
32-
# column information (chr, start, binSize)
33-
EPITOME_H5_FILE = "data.h5"
34-
REQUIRED_FILES = [EPITOME_H5_FILE]
35-
# required keys in h5 file
36-
REQUIRED_KEYS = ['/',
37-
'/columns',
38-
'/columns/binSize',
39-
'/columns/chr',
40-
'/columns/index',
41-
'/columns/index/TEST',
42-
'/columns/index/TRAIN',
43-
'/columns/index/VALID',
44-
'/columns/index/test_chrs',
45-
'/columns/index/valid_chrs',
46-
'/columns/start',
47-
'/data',
48-
'/meta',
49-
'/meta/assembly',
50-
'/meta/source',
51-
'/rows',
52-
'/rows/celltypes',
53-
'/rows/targets']
5425

5526
def GET_EPITOME_USER_PATH():
5627
return os.path.join(os.path.expanduser('~'), '.epitome')

epitome/constants.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,8 @@ class Dataset(Enum):
6060
r"""
6161
All mode: Specifies that data should not be divided by train, valid, and test.
6262
"""
63+
64+
TRAIN_VALID = 6 # For early stopping criteria. pulls train/valid chromsome.
65+
r"""
66+
TRAIN_VALID mode: Specifies that only a validation chr from train should be used.
67+
"""

epitome/dataset.py

Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,42 @@
2222

2323
# local imports
2424
from epitome import *
25-
from .constants import *
25+
from .constants import Dataset
2626
from .functions import download_and_unzip
2727
from .viz import plot_assay_heatmap
2828

29+
################### File accession constants #######################
30+
S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'
31+
32+
# os env that should be set by user to explicitly set the data path
33+
EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"
34+
35+
# data files required by epitome
36+
# data.h5 contains data, row information (celltypes and targets) and
37+
# column information (chr, start, binSize)
38+
EPITOME_H5_FILE = "data.h5"
39+
REQUIRED_FILES = [EPITOME_H5_FILE]
40+
# required keys in h5 file
41+
REQUIRED_KEYS = ['/',
42+
'/columns',
43+
'/columns/binSize',
44+
'/columns/chr',
45+
'/columns/index',
46+
'/columns/index/TEST',
47+
'/columns/index/TRAIN',
48+
'/columns/index/VALID',
49+
'/columns/index/test_chrs',
50+
'/columns/index/valid_chrs',
51+
'/columns/start',
52+
'/data',
53+
'/meta',
54+
'/meta/assembly',
55+
'/meta/source',
56+
'/rows',
57+
'/rows/celltypes',
58+
'/rows/targets']
59+
60+
2961
class EpitomeDataset:
3062
'''
3163
Dataset for holding Epitome data.
@@ -118,6 +150,7 @@ def __init__(self,
118150
self.indices[Dataset.TRAIN] = dataset['columns']['index'][Dataset.TRAIN.name][:]
119151
self.indices[Dataset.VALID] = dataset['columns']['index'][Dataset.VALID.name][:]
120152
self.indices[Dataset.TEST] = dataset['columns']['index'][Dataset.TEST.name][:]
153+
self.indices[Dataset.TRAIN_VALID] = [] # placeholder for if early stop is used
121154
self.valid_chrs = [i.decode() for i in dataset['columns']['index']['valid_chrs'][:]]
122155
self.test_chrs = [i.decode() for i in dataset['columns']['index']['test_chrs'][:]]
123156

@@ -127,8 +160,34 @@ def __init__(self,
127160

128161
dataset.close()
129162

163+
def set_train_validation_indices(self, chrom):
164+
'''
165+
Removes and reserves a given chromosome from the TRAIN dataset into
166+
its own TRAIN_VALID dataset.
167+
168+
:param str chrom: string representation of chromosome in 'chr{int}' format (Ex: 'chr22').
169+
'''
170+
assert chrom in self.regions.chromosomes, "%s must be part of the genome assembly. Not found in regions."
171+
assert chrom not in self.valid_chrs and chrom not in self.test_chrs, "%s cannot be a valid or test chromosome."
172+
173+
# load in original training indices
174+
dataset = h5py.File(self.h5_path, 'r')
175+
train_indices = dataset['columns']['index'][Dataset.TRAIN.name][:]
176+
dataset.close()
177+
178+
chr_indices = self.regions[self.regions.Chromosome == chrom].idx
179+
180+
# make sure this chromosome is in train set
181+
assert len(np.setdiff1d(chr_indices, train_indices)) == 0, "chr_indices must be a subset of train_indices"
182+
183+
# remove valid indices
184+
self.indices[Dataset.TRAIN] = np.setdiff1d(train_indices, chr_indices)
185+
self.indices[Dataset.TRAIN_VALID] = chr_indices
186+
187+
130188
def get_parameter_dict(self):
131-
''' Returns dict of all parameters required to reconstruct this dataset
189+
'''
190+
Returns dict of all parameters required to reconstruct this dataset
132191
133192
:return: dict containing all parameters to reconstruct dataset.
134193
:rtype: dict

epitome/generators.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,6 @@ def load_data(data,
122122

123123
# sites where TF is bound in at least 2 cell line
124124
positive_indices = np.where(np.sum(data[TF_indices,:], axis=0) > 1)[0]
125-
126125
indices_probs = np.ones([data.shape[1]])
127126
indices_probs[positive_indices] = 0
128127
indices_probs = indices_probs/np.sum(indices_probs, keepdims=1)
@@ -143,7 +142,6 @@ def load_data(data,
143142
else:
144143
indices = range(0, data.shape[-1]) # not training mode, set to all points
145144

146-
147145
if (mode == Dataset.RUNTIME):
148146
label_cell_types = ["PLACEHOLDER_CELL"]
149147
if similarity_matrix is None:

0 commit comments

Comments
 (0)