Skip to content

Commit bacb6f0

Browse files
committed
Update idpet/ from main
1 parent ad97eb0 commit bacb6f0

5 files changed

Lines changed: 159 additions & 140 deletions

File tree

idpet/comparison.py

Lines changed: 65 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -442,9 +442,9 @@ def get_adaJSD_matrix(
442442
Two Ensemble objects storing the ensemble data to compare.
443443
return_bins : bool, optional
444444
If True, also return the histogram bin edges used in the comparison.
445-
**remaining**
446-
Additional arguments passed to `idpet.comparison.score_adaJSD`.
447-
445+
**remaining
446+
Additional arguments passed to `dpet.comparison.score_adaJSD`.
447+
448448
Output
449449
------
450450
score : float
@@ -824,95 +824,88 @@ def all_vs_all_comparison(
824824
verbose: bool = False
825825
) -> dict:
826826
"""
827-
Compare all pairs of ensembles using divergence scores.
828-
829-
Implemented scores are approximate average Jensen–Shannon divergences
827+
Compare all pair of ensembles using divergence scores.
828+
Implemented scores are approximate average Jensen–Shannon divergence
830829
(JSD) over several kinds of molecular features. The lower these scores
831-
are, the higher the similarity between the probability distributions of
830+
are, the higher the similarity between the probability distribution of
832831
the features of the ensembles. JSD scores here range from a minimum of 0
833-
to a maximum of log(2) 0.6931.
832+
to a maximum of log(2) ~= 0.6931.
834833
835834
Parameters
836835
----------
837-
ensembles : List[Ensemble]
838-
Ensemble objects to analyze.
839-
score : str
836+
ensembles: List[Ensemble]
837+
Ensemble objectes to analyze.
838+
score: str
840839
Type of score used to compare ensembles. Choices: `adaJSD` (carbon
841-
Alpha Distance Average JSD), `ramaJSD` (RAMAchandran Average JSD), and
840+
Alfa Distance Average JSD), `ramaJSD` (RAMAchandran average JSD) and
842841
`ataJSD` (Alpha Torsion Average JSD). `adaJSD` scores the average
843-
JSD over all Cα–Cα distance distributions of residue pairs with
842+
JSD over all Ca-Ca distance distributions of residue pairs with
844843
sequence separation > 1. `ramaJSD` scores the average JSD over the
845-
φ–ψ angle distributions of all residues. `ataJSD` scores the average
846-
JSD over all alpha torsion angles, which are the angles formed by four
847-
consecutive atoms in a protein.
848-
featurization_params : dict, optional
844+
phi-psi angle distributions of all residues. `ataJSD` scores the
845+
average JSD over all alpha torsion angles, which are the angles
846+
formed by four consecutive Ca atoms in a protein.
847+
featurization_params: dict, optional
849848
Optional dictionary to customize the featurization process for the
850849
above features.
851-
bootstrap_iters : int, optional
852-
Number of bootstrap iterations. By default, its value is ``None``. In
853-
this case, IDPET will directly compare each pair of ensembles
854-
:math:`i` and :math:`j` by using all of their conformers and perform
855-
the comparison only once. On the other hand, if an integer value is
856-
provided for this argument, each pair of ensembles :math:`i` and
857-
:math:`j` will be compared ``bootstrap_iters`` times by randomly
858-
selecting (bootstrapping) conformations from them. Additionally, each
859-
ensemble will be auto-compared with itself by subsampling conformers
860-
via bootstrapping. Then, IDPET will perform a statistical test to
861-
determine whether the inter-ensemble (:math:`i \\neq j`) scores are
862-
significantly different from the intra-ensemble (:math:`i = j`)
863-
scores.
864-
865-
The tests work as follows: for each ensemble pair :math:`i \\neq j`,
866-
IDPET obtains their inter-ensemble comparison scores from
867-
bootstrapping. Then, it retrieves the bootstrapping scores from
868-
auto-comparisons of ensembles :math:`i` and :math:`j`, and the scores
869-
with the higher mean are selected as reference intra-ensemble scores.
870-
Finally, the inter-ensemble and intra-ensemble scores are compared via
871-
a one-sided Mann–Whitney U test with the alternative hypothesis that
872-
inter-ensemble scores are stochastically greater than intra-ensemble
873-
scores. The p-values obtained from these tests will additionally be
874-
returned.
875-
876-
For small protein structural ensembles (fewer than 500 conformations),
877-
most comparison scores in IDPET are not robust estimators of
878-
divergence or distance. Performing bootstrapping provides an estimate
879-
of how ensemble size affects the comparison. Use values ≥ 50 when
880-
comparing ensembles with very few conformations (less than 100). When
881-
comparing large ensembles (more than 1,000–5,000 conformations), you
882-
can safely avoid bootstrapping.
883-
bootstrap_frac : float, optional
850+
bootstrap_iters: int, optional
851+
Number of bootstrap iterations. By default its value is None. In
852+
this case, IDPET will directly compare each pair of ensemble $i$ and
853+
$j$ by using all of their conformers and perform the comparison only
854+
once. On the other hand, if providing an integer value to this
855+
argument, each pair of ensembles $i$ and $j$ will be compared
856+
`bootstrap_iters` times by randomly selecting (bootstrapping)
857+
conformations from them. Additionally, each ensemble will be
858+
auto-compared with itself by subsampling conformers via
859+
bootstrapping. Then IDPET will perform a statistical test to
860+
establish if the inter-ensemble ($i != j$) scores are significantly
861+
different from the intra-ensemble ($i == j$) scores. The tests work
862+
as follows: for each ensemble pair $i != j$ IDPET will get their
863+
inter-ensemble comparison scores obtained in bootstrapping. Then, it
864+
will get the bootstrapping scores from auto-comparisons of ensemble
865+
$i$ and $j$ and the scores with the higher mean here are selected as
866+
reference intra-ensemble scores. Finally, the inter-ensemble and
867+
intra-ensemble scores are compared via a one-sided Mann-Whitney U
868+
test with the alternative hypothesis being: inter-ensemble scores
869+
are stochastically greater than intra-ensemble scores. The p-values
870+
obtained in these tests will additionally be returned. For small
871+
protein structural ensembles (less than 500 conformations) most
872+
comparison scores in IDPET are not robust estimators of
873+
divergence/distance. By performing bootstrapping, you can have an
874+
idea of how the size of your ensembles impacts the comparison. Use
875+
values >= 50 when comparing ensembles with very few conformations
876+
(less than 100). When comparing large ensembles (more than
877+
1,000-5,000 conformations) you can safely avoid bootstrapping.
878+
bootstrap_frac: float, optional
884879
Fraction of the total conformations to sample when bootstrapping.
885-
Default value is 1.0, which results in bootstrap samples with the same
886-
number of conformations as the original ensemble.
887-
bootstrap_replace : bool, optional
888-
If ``True``, bootstrap will sample with replacement. Default is
889-
``True``.
890-
bins : Union[int, str], optional
880+
Default value is 1.0, which results in bootstrap samples with the
881+
same number of conformations of the original ensemble.
882+
bootstrap_replace: bool, optional
883+
If `True`, bootstrap will sample with replacement. Default is `True`.
884+
bins: Union[int, str], optional
891885
Number of bins or bin assignment rule for JSD comparisons. See the
892-
documentation of ``dpet.comparison.get_num_comparison_bins`` for
886+
documentation of `dpet.comparison.get_num_comparison_bins` for
893887
more information.
894-
random_seed : int, optional
888+
random_seed: int, optional
895889
Random seed used when performing bootstrapping.
896-
verbose : bool, optional
897-
If ``True``, prints additional information about the comparisons to
890+
verbose: bool, optional
891+
If `True`, some information about the comparisons will be printed to
898892
stdout.
899893
900894
Returns
901895
-------
902-
results : dict
903-
A dictionary containing the following keyvalue pairs:
904-
905-
- ``scores``: a (M, M, B) NumPy array storing the comparison
906-
scores, where M is the number of ensembles being compared and
907-
B is the number of bootstrap iterations (B = 1 if bootstrapping
908-
was not performed).
909-
- ``p_values``: a (M, M) NumPy array storing the p-values
910-
obtained from the statistical tests performed when using
911-
a bootstrapping strategy (see the ``bootstrap_iters`` parameter).
912-
Returned only when performing a bootstrapping strategy.
896+
results: dict
897+
A dictionary containing the following key-value pairs:
898+
`scores`: a (M, M, B) NumPy array storing the comparison
899+
scores, where M is the number of ensembles being
900+
compared and B is the number of bootstrap iterations (B
901+
will be 1 if bootstrapping was not performed).
902+
`p_values`: a (M, M) NumPy array storing the p-values
903+
obtained in the statistical test performed when using
904+
a bootstrapping strategy (see the `bootstrap_iters`)
905+
method. Returned only when performing a bootstrapping
906+
strategy.
913907
"""
914908

915-
916909
score_type, feature = scores_data[score]
917910

918911
### Check arguments.

idpet/data/io_utils.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,20 @@
11
import os
22
import tarfile
3+
from pathlib import Path
34

5+
def get_output_dir(output_dir: str):
6+
if output_dir is None:
7+
return os.getenv(
8+
"IDPET_OUTPUT_DIR", # If defined, gets an environmental variable.
9+
str(Path.home() / ".idpet" / "data") # Else, uses a default path.
10+
)
11+
else:
12+
return output_dir
413

514
def setup_data_dir(data_dir: str):
15+
data_dir = get_output_dir(data_dir)
616
os.makedirs(data_dir, exist_ok=True)
17+
return data_dir
718

819
def extract_tar_gz(tar_gz_file:str, output_dir:str, new_name:str):
920
# Extract the .pdb file with renaming
@@ -13,3 +24,5 @@ def extract_tar_gz(tar_gz_file:str, output_dir:str, new_name:str):
1324
member.name = new_name
1425
tar.extract(member, path=output_dir)
1526
break # Only rename and extract the first .pdb file
27+
28+
trajectory_extensions = ('.dcd', '.xtc')

0 commit comments

Comments
 (0)