Skip to content

Commit feb1d95

Browse files
Merge pull request mala-project#666 from mala-project/develop
v1.4.0 - Mixology Certification
2 parents 3d8ecd0 + 2ca9b6e commit feb1d95

144 files changed

Lines changed: 11908 additions & 1891 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.dockerignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
*
2-
!install/*
2+
!pipeline/*

.github/workflows/cpu-tests.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ jobs:
118118

119119
cpu-tests:
120120
needs: build-docker-image-cpu
121-
runs-on: ubuntu-20.04
121+
runs-on: ubuntu-22.04
122122
env:
123123
IMAGE_REPO: ${{ needs.build-docker-image-cpu.outputs.image-repo }}
124124
DOCKER_TAG: ${{ needs.build-docker-image-cpu.outputs.docker-tag }}
@@ -183,7 +183,7 @@ jobs:
183183
# be there before it has been installed.
184184
sed -i '/materials-learning-algorithms/d' ./env_after.yml
185185
186-
# if comparison fails, `install/mala_cpu_[base]_environment.yml` needs to be aligned with
186+
# if comparison fails, `pipeline/mala_cpu_[base]_environment.yml` needs to be aligned with
187187
# `requirements.txt` and/or extra dependencies are missing in the Docker Conda environment
188188
189189
if diff --brief env_before.yml env_after.yml
@@ -230,7 +230,7 @@ jobs:
230230
231231
- name: Test mala
232232
shell: 'bash -c "docker exec -i mala-cpu bash < {0}"'
233-
run: MALA_DATA_REPO=$(pwd)/mala_data pytest --cov=mala --cov-fail-under=60 -m "not examples" --disable-warnings
233+
run: MALA_DATA_REPO=$(pwd)/mala_data pytest --cov=mala --cov-fail-under=50 -m "not examples" --disable-warnings
234234

235235
retag-docker-image-cpu:
236236
needs: [cpu-tests, build-docker-image-cpu]

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Byte-compiled / optimized / DLL files
22
__pycache__/
3+
__pycache__
34
*.py[cod]
45
*$py.class
56

@@ -152,6 +153,7 @@ cython_debug/
152153
*.out
153154
*.npy
154155
*.pkl
156+
*.pk
155157
*.pth
156158
*.json
157159

@@ -186,3 +188,7 @@ wandb/
186188

187189
*.zip
188190
*~
191+
192+
# ACE files & libraries
193+
*.pkl
194+

Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,14 @@ RUN apt-get --allow-releaseinfo-change update && apt-get upgrade -y && \
1414

1515
# Choose 'cpu' or 'gpu'
1616
ARG DEVICE=cpu
17-
COPY install/mala_${DEVICE}_environment.yml .
17+
COPY pipeline/mala_${DEVICE}_environment.yml .
1818
RUN conda env create -f mala_${DEVICE}_environment.yml && rm -rf /opt/conda/pkgs/*
1919

2020
# Install optional MALA dependencies into Conda environment with pip
2121
RUN /opt/conda/envs/mala-${DEVICE}/bin/pip install --no-input --no-cache-dir \
2222
pytest \
2323
pytest-cov \
24-
oapackage==2.6.8 \
24+
oapackage==2.7.14 \
2525
pqkmeans
2626

2727
RUN echo "source activate mala-${DEVICE}" > ~/.bashrc

docs/source/CONTRIBUTE.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,14 @@ nature of your contribution:
1616

1717
- Bartosz Brzoza (Bugfixes, GNN implementation)
1818
- Timothy Callow (Grid-size transferability)
19+
- Petr Cagas (Sample data management and data generation)
20+
- Matthew Campbell (Active learning)
1921
- Attila Cangi (Scientific supervision)
2022
- Austin Ellis (General code infrastructure)
2123
- Omar Faruk (Training parallelization via horovod)
2224
- Lenz Fiedler (General code development and maintenance)
2325
- James Fox (GNN implementation)
26+
- James Goff (ACE descriptors and forces)
2427
- Nils Hoffmann (NASWOT method)
2528
- Kyle Miller (Data shuffling)
2629
- Daniel Kotik (Documentation and CI)
@@ -33,7 +36,7 @@ nature of your contribution:
3336
- Siva Rajamanickam (Scientific supervision)
3437
- Josh Romero (GPU usage improvement for model tuning)
3538
- Steve Schmerler (Uncertainty quantification)
36-
- Adam Stephens (Uncertainty quantification work)
39+
- Adam Stephens (Uncertainty quantification)
3740
- Hossein Tahmasbi (Minterpy descriptors)
3841
- Aidan Thompson (Descriptor calculation)
3942
- Sneha Verma (Tensorboard interface)
@@ -113,7 +116,7 @@ If you add additional dependencies, make sure to add them to `requirements.txt`
113116
if they are required or to `setup.py` under the appropriate `extras` tag if
114117
they are not.
115118
Further, in order for them to be available during the CI tests, make sure to
116-
add _required_ dependencies to the appropriate environment files in folder `install/` and _extra_ requirements directly in the `Dockerfile` for the `conda` environment build.
119+
add _required_ dependencies to the appropriate environment files in folder `pipeline/` and _extra_ requirements directly in the `Dockerfile` for the `conda` environment build.
117120

118121
## Pull Requests
119122
We actively welcome pull requests.

docs/source/advanced_usage/descriptors.rst

Lines changed: 48 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
.. _tuning descriptors:
22

3-
Improved data conversion
4-
========================
3+
Advanced descriptor options
4+
===========================
55

66
As a general remark please be reminded that if you have not used LAMMPS
77
for your first steps in MALA, and instead used the python-based descriptor
@@ -76,23 +76,20 @@ An example would be this:
7676

7777
.. code-block:: python
7878
79-
hyperoptimizer.add_snapshot("espresso-out", os.path.join(data_path, "Be_snapshot1.out"),
80-
"numpy", os.path.join(data_path, "Be_snapshot1.out.npy"),
79+
hyperoptimizer.add_snapshot("espresso-out", os.path.join(data_path_be, "Be_snapshot1.out"),
80+
"numpy", os.path.join(data_path_be, "Be_snapshot1.out.npy"),
8181
target_units="1/(Ry*Bohr^3)")
82-
hyperoptimizer.add_snapshot("espresso-out", os.path.join(data_path, "Be_snapshot2.out"),
83-
"numpy", os.path.join(data_path, "Be_snapshot2.out.npy"),
82+
hyperoptimizer.add_snapshot("espresso-out", os.path.join(data_path_be, "Be_snapshot2.out"),
83+
"numpy", os.path.join(data_path_be, "Be_snapshot2.out.npy"),
8484
target_units="1/(Ry*Bohr^3)")
8585
8686
Once this is done, you can start the optimization via
8787

8888
.. code-block:: python
8989
90-
hyperoptimizer.perform_study(return_plotting=False)
90+
hyperoptimizer.perform_study()
9191
hyperoptimizer.set_optimal_parameters()
9292
93-
If ``return_plotting`` is set to ``True``, relevant plotting data for the
94-
analysis are returned. This is useful for exploratory searches.
95-
9693
Since the ACSD re-calculates the bispectrum descriptors for each combination
9794
of hyperparameters, it is useful to use parallel descriptor calculation.
9895
To do so, you can enable the `MPI <https://www.mpi-forum.org/>`_ capabilites
@@ -118,3 +115,44 @@ Parallelization may also generally be used for data conversion via the
118115
prior to using the ``DataConverter`` class. Then, all processing will
119116
be done in parallel - both the descriptor calculation as well as the LDOS
120117
parsing.
118+
119+
ACE Descriptors
120+
******************
121+
122+
.. note::
123+
124+
To use ACE descriptors with MALA, you need to install LAMMPS from source
125+
using the ACE descriptor development branch, since the ACE descriptors
126+
are not yet part of the descriptor calculation code the MALA team has
127+
integrated into mainline LAMMPS. You can find the code here:
128+
https://github.com/jmgoff/lammps_compute_PACE/tree/mala-ace-grid.
129+
130+
Recently, and as described in the
131+
`MALA technical paper <https://arxiv.org/abs/2411.19617>`_ ACE descriptors
132+
have been implemented as an alternative to bispectrum descriptors. They
133+
follow the Atomic Cluster Expansion (ACE) formalism, introduced by
134+
the `eponymous publication <https://journals.aps.org/prb/abstract/10.1103/PhysRevB.99.014104>`_
135+
by Ralf Drautz. ACE descriptors hold the promise of being more descriptive and
136+
accurate than bispectrum descriptors and are currently being investigated by
137+
the MALA team. MALA already implements most functionalities of bispectrum
138+
descriptors for ACE descriptors. You can use them in the same fashion as
139+
the bispectrum descriptors, with the only difference being the hyperparameters
140+
you need to set.
141+
142+
Specifically, by replacing all bispectrum hyperparameters in your script
143+
with code such as this
144+
145+
.. code-block:: python
146+
147+
parameters.descriptors.descriptor_type = "ACE"
148+
parameters.descriptors.ace_cutoff = 5.8
149+
parameters.descriptors.ace_included_expansion_ranks = [1, 2, 3]
150+
parameters.descriptors.ace_maximum_l_per_rank = [0, 1, 1]
151+
parameters.descriptors.ace_maximum_n_per_rank = [1, 1, 1]
152+
parameters.descriptors.ace_minimum_l_per_rank = [0, 0, 0]
153+
154+
ACE descriptors will be used in your processing/training/testing scripts.
155+
ACE_DOCS_MISSING: Describe what the parameters mean/how to best tune them.
156+
157+
A known current limitation is that ACE descriptors can only be run on CPU.
158+
A GPU version is currently being developed.

docs/source/advanced_usage/hyperparameters.rst

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,34 @@ are started with ``wait_time`` time interval in between (to avoid race
9696
conditions when accessing the same data base) and further only use the data
9797
base, not MPI, for communication.
9898

99+
The batch job on your HPC cluster will get killed after the designated runtime.
100+
Then unfinished trials will remain in the Optuna database in state RUNNING.
101+
102+
The current workflow for resuming the study which makes use of MALA's own
103+
resume tooling
104+
(see ``examples/advanced/ex05_checkpoint_hyperparameter_optimization.py``) is
105+
this: before submitting the batch job again and let the script do the resume
106+
work, a user needs to modify the database like so:
107+
108+
.. code-block:: bash
109+
110+
python3 -c "import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"
111+
112+
which will set the RUNNING trials to state WAITING.
113+
When Optuna resumes, it will pick up and re-run those, before carrying on
114+
running the resumed study.
115+
116+
Common questions related to this feature:
117+
118+
- "Does "injecting" jobs like this disturb Optuna's operation in any way?":
119+
No, the study object takes all of its information directly from the
120+
data base, which in this case has "WAITING" trials now.
121+
- "Do those trials have to be run?": Technically not. One could simply ignore
122+
them and re-run without them. The problem is that in this case, the study
123+
will have missing data points from trials that have been suggested for a
124+
reason, so even if Optuna would resume fine, we still want to re-run them
125+
from an optimization point of view.
126+
99127
If you do distributed hyperparameter optimization, another useful option
100128
is
101129

@@ -114,7 +142,7 @@ a physical validation metric such as
114142

115143
.. code-block:: python
116144
117-
parameters.running.after_training_metric = "band_energy"
145+
parameters.running.final_validation_metric = "band_energy"
118146
119147
Advanced optimization algorithms
120148
********************************

docs/source/advanced_usage/openpmd.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,16 +33,16 @@ be left untouched. Specifically, set
3333
...
3434
# Changes for DataHandler
3535
data_handler = mala.DataHandler(parameters)
36-
data_handler.add_snapshot("Be_snapshot0.in.h5", data_path,
37-
"Be_snapshot0.out.h5", data_path, "tr",
36+
data_handler.add_snapshot("Be_snapshot0.in.h5", data_path_be,
37+
"Be_snapshot0.out.h5", data_path_be, "tr",
3838
snapshot_type="openpmd")
3939
...
4040
# Changes for DataShuffler
4141
data_shuffler = mala.DataShuffler(parameters)
4242
# Data can be shuffle FROM and TO openPMD - but also from
4343
# numpy to openPMD.
44-
data_shuffler.add_snapshot("Be_snapshot0.in.h5", data_path,
45-
"Be_snapshot0.out.h5", data_path,
44+
data_shuffler.add_snapshot("Be_snapshot0.in.h5", data_path_be,
45+
"Be_snapshot0.out.h5", data_path_be,
4646
snapshot_type="openpmd")
4747
data_shuffler.shuffle_snapshots(...,
4848
save_name="Be_shuffled*.h5")

docs/source/advanced_usage/predictions.rst

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,26 @@ CPU or GPU. To do so, simply enable MPI usage in MALA
105105
parameters.use_mpi = True
106106
107107
Once MPI is activated, you can start the MPI aware Python script using
108-
``mpirun``, ``srun`` or whichever MPI wrapper is used on your machine.
108+
``mpirun``, ``srun`` or whichever MPI wrapper is used on your machine, for
109+
example with
110+
111+
.. code-block:: bash
112+
113+
#!/bin/bash
114+
#SBATCH --nodes=NUMBER_OF_NODES
115+
#SBATCH --ntasks-per-node=NUMBER_OF_TASKS_PER_NODE
116+
#SBATCH --gres=gpu:NUMBER_OF_TASKS_PER_NODE
117+
# Add more arguments as needed
118+
...
119+
120+
# Load more modules as needed
121+
...
122+
123+
# Depending on your cluster setup, you may need to use srun here
124+
# rather than mpirun.
125+
# Note that
126+
# NUMBER_OF_RANKS = NUMBER_OF_NODES * NUMBER_OF_TASKS_PER_NODE
127+
mpirun -np NUMBER_OF_RANKS python3 -u prediction.py
109128
110129
By default, MALA can only operate with a number of processes by which the
111130
z-dimension of the inference grid can be evenly divided, since the Quantum

docs/source/advanced_usage/trainingmodel.rst

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -71,13 +71,13 @@ is directly outputted by MALA. By default, this validation loss gives the
7171
mean squared error between LDOS prediction and actual value. From a purely
7272
ML point of view, this is fine; however, the correctness of the LDOS itself
7373
does not hold much physical virtue. Thus, MALA implements physical validation
74-
metrics to be accessed before and after the training routine.
74+
metrics which can be evaluated for example after the training.
7575

7676
Specifically, when setting
7777

7878
.. code-block:: python
7979
80-
parameters.running.after_training_metric = "band_energy"
80+
parameters.running.final_validation_metric = "band_energy"
8181
8282
the error in the band energy between actual and predicted LDOS will be
8383
calculated and printed before and after network training (in meV/atom).
@@ -170,23 +170,32 @@ data sets have to be saved - in-memory implementations are currently developed.
170170
To use the data shuffling (also shown in example
171171
``advanced/ex02_shuffle_data.py``), you can use the ``DataShuffler`` class.
172172

173-
The syntax is very easy, you create a ``DataShufller`` object,
173+
The syntax is very easy, you create a ``DataShuffler`` object,
174174
which provides the same ``add_snapshot`` functionalities as the ``DataHandler``
175-
object, and shuffle the data once you have added all snapshots in question,
176-
i.e.,
175+
object, and shuffle the data once you have added all snapshots in question.
176+
Just as with the ``DataHandler`` class, on-the-fly calculation of bispectrum
177+
descriptors is supported.
177178

178179
.. code-block:: python
179180
180181
parameters.data.shuffling_seed = 1234
181182
182183
data_shuffler = mala.DataShuffler(parameters)
183-
data_shuffler.add_snapshot("Be_snapshot0.in.npy", data_path,
184-
"Be_snapshot0.out.npy", data_path)
185-
data_shuffler.add_snapshot("Be_snapshot1.in.npy", data_path,
186-
"Be_snapshot1.out.npy", data_path)
184+
data_shuffler.add_snapshot("Be_snapshot0.in.npy", data_path_be,
185+
"Be_snapshot0.out.npy", data_path_be)
186+
data_shuffler.add_snapshot("Be_snapshot1.in.npy", data_path_be,
187+
"Be_snapshot1.out.npy", data_path_be)
187188
data_shuffler.shuffle_snapshots(complete_save_path="../",
188189
save_name="Be_shuffled*")
189190
191+
By using the ``shuffle_to_temporary`` keyword, you can shuffle the data to
192+
temporary files, which will can deleted after the training run. This is useful
193+
if you want to shuffle the data right before training and do not plan to re-use
194+
shuffled data files for multiple training runs. As detailed in
195+
``advanced/ex02_shuffle_data.py``, access to temporary files is provided via
196+
``data_shuffler.temporary_shuffled_snapshots[...]``, which is a list containing
197+
``mala.Snapshot`` objects.
198+
190199
The seed ``parameters.data.shuffling_seed`` ensures reproducibility of data
191200
sets. The ``shuffle_snapshots`` function has a path handling ability akin to
192201
the ``DataConverter`` class. Further, via the ``number_of_shuffled_snapshots``
@@ -203,7 +212,7 @@ in the file ``advanced/ex03_tensor_board``. Simply select a logger prior to trai
203212
.. code-block:: python
204213
205214
parameters.running.logger = "tensorboard"
206-
parameters.running.logging_dir = "mala_vis"
215+
parameters.running.logging_dir = "mala_logs"
207216
208217
or
209218

@@ -215,14 +224,14 @@ or
215224
entity="your_wandb_entity"
216225
)
217226
parameters.running.logger = "wandb"
218-
parameters.running.logging_dir = "mala_vis"
227+
parameters.running.logging_dir = "mala_logs"
219228
220229
where ``logging_dir`` specifies some directory in which to save the
221230
MALA logging data. You can also select which metrics to record via
222231

223232
.. code-block:: python
224233
225-
parameters.validation_metrics = ["ldos", "dos", "density", "total_energy"]
234+
parameters.logging_metrics = ["ldos", "dos", "density", "total_energy"]
226235
227236
Full list of available metrics:
228237
- "ldos": MSE of the LDOS.
@@ -240,14 +249,14 @@ To save time and resources you can specify the logging interval via
240249

241250
.. code-block:: python
242251
243-
parameters.running.validate_every_n_epochs = 10
252+
parameters.running.logging_metrics_interval = 10
244253
245254
If you want to monitor the degree to which the model overfits to the training data,
246255
you can use the option
247256

248257
.. code-block:: python
249258
250-
parameters.running.validate_on_training_data = True
259+
parameters.running.log_metrics_on_train_set = True
251260
252261
MALA will evaluate the validation metrics on the training set as well as the validation set.
253262

0 commit comments

Comments
 (0)