Skip to content

Commit 11e2e1e

Browse files
authored
Merge pull request #3 from DocMinus/joblib
Joblib
2 parents 54ccbed + a443bc2 commit 11e2e1e

5 files changed

Lines changed: 51 additions & 34 deletions

File tree

README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
22
![GitHub](https://img.shields.io/github/license/DocMinus/RxnTransformDescriptors)
3+
![GitHub release (latest by date)](https://img.shields.io/github/v/release/DocMinus/RxnTransformDescriptors)
34

45
# Reaction Transform descriptors
56
Python code to calculate reaction transform descriptors as described in [CHEMRXIV](https://chemrxiv.org/engage/chemrxiv/article-details/649888d41dcbb92a5e8e3475), by [@DocMinus](https://github.com/docminus) and [@DrAlatriste](https://github.com/DrAlatriste). <br>
@@ -42,6 +43,11 @@ Python testing has been added instead of the previous test.py, see the README.md
4243
<br>
4344

4445
### Acknowledgments
45-
We would like to thank [@eryl](https://github.com/eryl) for suggestions and help regarding multiprocessing. This allowed processing of large datasets within minutes or even seconds on a standard system, versus previously hours.
46-
47-
46+
We would like to thank [@eryl](https://github.com/eryl) for suggestions and help regarding multiprocessing in the original build. This allowed processing of large datasets within minutes or even seconds on a standard system, versus previously hours.<br>
47+
Currently this has been changed to joblib instead, which seems a bit more stable and faster in this particular context.
48+
49+
### Updates
50+
* setup.py for install as package
51+
* testing added
52+
* switch from multiparallel to joblib
53+
* releases introduced; version number reflects version number of tool.

environment/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,11 @@ Modules required are sort of standard for chemistry scripting, rdkit, pandas & n
2727
pip install -r ./environment/requirements.txt
2828
pip install .
2929
```
30-
31-
The latter installs the rxn_tools into the environment. The example script would work without that, but testing requires that.
30+
The latter installs the rxn_tools into the environment. The example script would work without that, but testing requires that.<br>
31+
If you don't want to download this repo, you can install it directly with:
32+
```shell
33+
pip install git+https://github.com/DocMinus/RxnTransformDescriptors.git
34+
```
3235
3336
## Running Tests
3437
`pytest` is available for testing. See the README.md in /tests.

environment/requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
rdkit
22
pandas
3-
numpy
3+
numpy
4+
joblib

setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
setup(
44
name="td_tools",
5-
version="2.1.3",
5+
version="2.2.0",
66
pythonrequires=">=3.9",
77
packages=find_packages(),
88
package_data={
@@ -12,4 +12,5 @@
1212
author="DocMinus",
1313
author_email="alexander.minidis@gmail.com",
1414
url="https://github.com/DocMinus/RxnTransformDescriptors",
15+
license="CC0-1.0",
1516
)

td_tools/rxntools.py

Lines changed: 33 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,25 @@
11
"""
22
Initial Version: Nov, 2022
3-
Version 2.1.3
4-
Update: 2023-06-22
3+
Version 2.2.0
4+
Update: 2024-02-24
55
@author: Alexander Minidis (DocMinus)
66
7-
Copyright (c) 2022-2023 DocMinus
7+
Copyright (c) 2022-2024 DocMinus
88
Changelog:
9-
V1 -> V2: multiprocessing with help by @eryl (github)
9+
V1 -> V2: multiprocessing with help by @eryl (github) / changed to joblib for better compatibility
1010
Finalization w. further optimization and removal of V1 code. Black-ened
1111
Also added csv reader to pd (copy from chemtools)
1212
Added transfrom_descriptors() as wrapper for all descriptor functions, reducing number of lines in main.py
1313
"""
1414

15-
import multiprocessing
15+
import multiprocessing # remains for reference, but not used
1616
from collections import defaultdict
1717

18-
# general imports
1918
import pandas as pd
20-
21-
# RDkit stuff
19+
from joblib import Parallel, delayed
2220
from rdkit import Chem, RDLogger
2321
from rdkit.Chem.MolStandardize import rdMolStandardize
2422

25-
# my imports
2623
from .ids import (
2724
ELEMENTS_DICT,
2825
RDKIT_SMARTS,
@@ -135,42 +132,50 @@ def rd_clean(workpackage: str) -> list:
135132
return _id, Chem.MolToSmiles(mol)
136133

137134

138-
def clean_smiles_multi(schmiles: list) -> tuple:
135+
def clean_smiles_multiprocessing(schmiles: list) -> tuple:
139136
"""parallel cleanining of a list of smiles.\n
137+
Leaving this here for reference; obsolete but still functional if so desired.
138+
"""
139+
print("...Cleaning compounds...")
140+
work = [(i, smiles) for i, smiles in enumerate(schmiles)]
141+
with multiprocessing.Pool() as pool:
142+
processed_smiles = pool.map(rd_clean, work)
143+
144+
return tuple(smi[1] for smi in processed_smiles)
145+
146+
147+
def clean_smiles_multi(schmiles: list) -> tuple:
148+
"""parallel cleaning of a list of smiles.\n
140149
Includes an ID, not really necessary, but helps in later analysis downstream.
141150
142151
in: list of smiles
143152
out: list of cleaned smiles (for now, without ID)
144153
"""
145154
print("...Cleaning compounds...")
146155
work = [(i, smiles) for i, smiles in enumerate(schmiles)]
147-
with multiprocessing.Pool() as pool:
148-
processed_smiles = pool.map(rd_clean, work)
156+
processed_smiles = Parallel(n_jobs=-1)(delayed(rd_clean)(w) for w in work)
149157

150-
# tuple or list, doesn't really matter(?)
151158
return tuple(smi[1] for smi in processed_smiles)
152159

153160

154-
def element_count(smiles: str) -> dict:
161+
def element_count(schmiles: str) -> dict:
155162
"""
156163
Creates a "sum formula" as dictionary, no hydrogens though.\n
157164
Doesn't consider salts, ignores bondtypes. But: doesn't need RDKIT.
158165
Input:
159-
string containing smiles
166+
string containing schmiles
160167
Output:
161168
dictionary containing sumformula, here: sorted, including 0 atoms, useful for EA descriptors
162169
"""
163170

164-
elements = [token for token in REGEX.findall(smiles)]
171+
elements = [token for token in REGEX.findall(schmiles)]
165172
for i in range(len(elements)):
166173
if REGEX_LOW.findall(elements[i]):
167174
elements[i] = elements[i].upper()
168175

169176
element_count = [elements.count(ele) for ele in elements]
170177
formula = dict(zip(elements, element_count))
171-
# final_dict = {"smiles": smiles}
172-
# final_dict.update(ELEMENTS_DICT)
173-
# return {key: formula.get(key, final_dict[key]) for key in final_dict}
178+
174179
return {key: formula.get(key, ELEMENTS_DICT[key]) for key in ELEMENTS_DICT}
175180

176181

@@ -183,10 +188,10 @@ def elemental_tds_multi(schmiles: list) -> pd.DataFrame:
183188
"""
184189
print("...Calculation EA based descriptors...")
185190

186-
with multiprocessing.Pool() as pool:
187-
processed_smiles = pool.map(element_count, schmiles)
191+
processed_smiles = Parallel(n_jobs=-1)(
192+
delayed(element_count)(smiles) for smiles in schmiles
193+
)
188194

189-
# return processed_smiles
190195
return pd.DataFrame(data=processed_smiles).astype("int16")
191196

192197

@@ -212,8 +217,9 @@ def rdkit_descriptors_multi(schmiles: list) -> pd.DataFrame:
212217
:return: pandas df containing calculated properties (no structures, but now with ID)
213218
"""
214219
print("...Calculating RDKit descriptors...")
215-
with multiprocessing.Pool() as pool:
216-
processed_smiles = pool.map(properties_calc, schmiles)
220+
processed_smiles = Parallel(n_jobs=-1)(
221+
delayed(properties_calc)(schmiles) for schmiles in schmiles
222+
)
217223

218224
named_properties = dict(zip(RDKIT_TD_HEADERS, zip(*processed_smiles)))
219225
return pd.DataFrame(data=named_properties).astype("int16")
@@ -248,9 +254,9 @@ def smarts_descriptors_multi(schmiles: list) -> pd.DataFrame:
248254
"""
249255
print("...Calculating smarts based descriptors...")
250256

251-
with multiprocessing.Pool() as pool:
252-
processed_smiles = pool.map(smarts_count, schmiles)
253-
# processed_smiles = list(pool.imap_unordered(match_smarts, mollist, chunksize=chunksize))
257+
processed_smiles = Parallel(n_jobs=-1)(
258+
delayed(smarts_count)(smiles) for smiles in schmiles
259+
)
254260

255261
columns = defaultdict(list)
256262
for smi in schmiles:

0 commit comments

Comments
 (0)