Skip to content

Commit 65ebdb8

Browse files
Merge branch 'develop_lenz' into multielement_multidos
# Conflicts: # examples/basic/ex01_train_network.py # examples/basic/ex05_run_predictions.py
2 parents b4dcdc7 + 9425dd6 commit 65ebdb8

30 files changed

Lines changed: 575 additions & 600 deletions

.dockerignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
*
2-
!install/*
2+
!pipeline/*

.github/workflows/cpu-tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ jobs:
183183
# be there before it has been installed.
184184
sed -i '/materials-learning-algorithms/d' ./env_after.yml
185185
186-
# if comparison fails, `install/mala_cpu_[base]_environment.yml` needs to be aligned with
186+
# if comparison fails, `pipeline/mala_cpu_[base]_environment.yml` needs to be aligned with
187187
# `requirements.txt` and/or extra dependencies are missing in the Docker Conda environment
188188
189189
if diff --brief env_before.yml env_after.yml

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ cython_debug/
153153
*.out
154154
*.npy
155155
*.pkl
156+
*.pk
156157
*.pth
157158
*.json
158159

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ RUN apt-get --allow-releaseinfo-change update && apt-get upgrade -y && \
1414

1515
# Choose 'cpu' or 'gpu'
1616
ARG DEVICE=cpu
17-
COPY install/mala_${DEVICE}_environment.yml .
17+
COPY pipeline/mala_${DEVICE}_environment.yml .
1818
RUN conda env create -f mala_${DEVICE}_environment.yml && rm -rf /opt/conda/pkgs/*
1919

2020
# Install optional MALA dependencies into Conda environment with pip

docs/source/CONTRIBUTE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ If you add additional dependencies, make sure to add them to `requirements.txt`
116116
if they are required or to `setup.py` under the appropriate `extras` tag if
117117
they are not.
118118
Further, in order for them to be available during the CI tests, make sure to
119-
add _required_ dependencies to the appropriate environment files in folder `install/` and _extra_ requirements directly in the `Dockerfile` for the `conda` environment build.
119+
add _required_ dependencies to the appropriate environment files in folder `pipeline/` and _extra_ requirements directly in the `Dockerfile` for the `conda` environment build.
120120

121121
## Pull Requests
122122
We actively welcome pull requests.

docs/source/advanced_usage/hyperparameters.rst

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,34 @@ are started with ``wait_time`` time interval in between (to avoid race
9696
conditions when accessing the same data base) and further only use the data
9797
base, not MPI, for communication.
9898

99+
The batch job on your HPC cluster will get killed after the designated runtime.
100+
Then unfinished trials will remain in the Optuna database in state RUNNING.
101+
102+
The current workflow for resuming the study which makes use of MALA's own
103+
resume tooling
104+
(see ``examples/advanced/ex05_checkpoint_hyperparameter_optimization.py``) is
105+
this: before submitting the batch job again and let the script do the resume
106+
work, a user needs to modify the database like so:
107+
108+
.. code-block:: bash
109+
110+
python3 -c "import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"
111+
112+
which will set the RUNNING trials to state WAITING.
113+
When Optuna resumes, it will pick up and re-run those, before carrying on
114+
running the resumed study.
115+
116+
Common questions related to this feature:
117+
118+
- "Does "injecting" jobs like this disturb Optuna's operation in any way?":
119+
No, the study object takes all of its information directly from the
120+
data base, which in this case has "WAITING" trials now.
121+
- "Do those trials have to be run?": Technically not. One could simply ignore
122+
them and re-run without them. The problem is that in this case, the study
123+
will have missing data points from trials that have been suggested for a
124+
reason, so even if Optuna would resume fine, we still want to re-run them
125+
from an optimization point of view.
126+
99127
If you do distributed hyperparameter optimization, another useful option
100128
is
101129

@@ -114,7 +142,7 @@ a physical validation metric such as
114142

115143
.. code-block:: python
116144
117-
parameters.running.after_training_metric = "band_energy"
145+
parameters.running.final_validation_metric = "band_energy"
118146
119147
Advanced optimization algorithms
120148
********************************

docs/source/advanced_usage/trainingmodel.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -71,13 +71,13 @@ is directly outputted by MALA. By default, this validation loss gives the
7171
mean squared error between LDOS prediction and actual value. From a purely
7272
ML point of view, this is fine; however, the correctness of the LDOS itself
7373
does not hold much physical virtue. Thus, MALA implements physical validation
74-
metrics to be accessed before and after the training routine.
74+
metrics which can be evaluated for example after the training.
7575

7676
Specifically, when setting
7777

7878
.. code-block:: python
7979
80-
parameters.running.after_training_metric = "band_energy"
80+
parameters.running.final_validation_metric = "band_energy"
8181
8282
the error in the band energy between actual and predicted LDOS will be
8383
calculated and printed before and after network training (in meV/atom).
@@ -212,7 +212,7 @@ in the file ``advanced/ex03_tensor_board``. Simply select a logger prior to trai
212212
.. code-block:: python
213213
214214
parameters.running.logger = "tensorboard"
215-
parameters.running.logging_dir = "mala_vis"
215+
parameters.running.logging_dir = "mala_logs"
216216
217217
or
218218

@@ -224,14 +224,14 @@ or
224224
entity="your_wandb_entity"
225225
)
226226
parameters.running.logger = "wandb"
227-
parameters.running.logging_dir = "mala_vis"
227+
parameters.running.logging_dir = "mala_logs"
228228
229229
where ``logging_dir`` specifies some directory in which to save the
230230
MALA logging data. You can also select which metrics to record via
231231

232232
.. code-block:: python
233233
234-
parameters.validation_metrics = ["ldos", "dos", "density", "total_energy"]
234+
parameters.logging_metrics = ["ldos", "dos", "density", "total_energy"]
235235
236236
Full list of available metrics:
237237
- "ldos": MSE of the LDOS.
@@ -249,14 +249,14 @@ To save time and resources you can specify the logging interval via
249249

250250
.. code-block:: python
251251
252-
parameters.running.validate_every_n_epochs = 10
252+
parameters.running.logging_metrics_interval = 10
253253
254254
If you want to monitor the degree to which the model overfits to the training data,
255255
you can use the option
256256

257257
.. code-block:: python
258258
259-
parameters.running.validate_on_training_data = True
259+
parameters.running.log_metrics_on_train_set = True
260260
261261
MALA will evaluate the validation metrics on the training set as well as the validation set.
262262

docs/source/install/installing_mala.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ itself is subject to ongoing development as well.
4545
4646
git clone https://github.com/mala-project/test-data ~/path/to/data/repo
4747
cd ~/path/to/data/repo
48-
git checkout v1.8.1
48+
git checkout 2.0.0
4949
5050
* Export the path to that repo by ``export MALA_DATA_REPO=~/path/to/data/repo``
5151

examples/advanced/ex01_checkpoint_training.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,10 +57,10 @@ def initial_setup():
5757
data_handler.output_dimension,
5858
]
5959

60-
test_network = mala.Network(parameters)
61-
test_trainer = mala.Trainer(parameters, test_network, data_handler)
60+
network = mala.Network(parameters)
61+
trainer = mala.Trainer(parameters, network, data_handler)
6262

63-
return parameters, test_network, data_handler, test_trainer
63+
return parameters, network, data_handler, trainer
6464

6565

6666
if mala.Trainer.run_exists("ex01_checkpoint", path="./"):

examples/advanced/ex03_tensor_board.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@
2727
# files into.
2828
parameters.running.logger = "tensorboard"
2929
parameters.running.logging_dir = "mala_vis"
30-
parameters.running.validation_metrics = ["ldos", "band_energy"]
31-
parameters.running.validate_every_n_epochs = 5
30+
parameters.running.logging_metrics = ["ldos", "band_energy"]
31+
parameters.running.logging_metrics_interval = 5
3232

3333
data_handler = mala.DataHandler(parameters)
3434
data_handler.add_snapshot(

0 commit comments

Comments
 (0)