Skip to content

Commit 2ed1f2f

Browse files
Merge pull request mala-project#664 from RandomDefaultUser/optuna_resume_workflow_docs
Added Optuna resume related docs
2 parents d4f4419 + 4f02cf5 commit 2ed1f2f

1 file changed

Lines changed: 28 additions & 0 deletions

File tree

docs/source/advanced_usage/hyperparameters.rst

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,34 @@ are started with ``wait_time`` time interval in between (to avoid race
9696
conditions when accessing the same data base) and further only use the data
9797
base, not MPI, for communication.
9898

99+
The batch job on your HPC cluster will get killed after the designated runtime.
100+
Then unfinished trials will remain in the Optuna database in state RUNNING.
101+
102+
The current workflow for resuming the study which makes use of MALA's own
103+
resume tooling
104+
(see ``examples/advanced/ex05_checkpoint_hyperparameter_optimization.py``) is
105+
this: before submitting the batch job again and let the script do the resume
106+
work, a user needs to modify the database like so:
107+
108+
.. code-block:: bash
109+
110+
python3 -c "import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"
111+
112+
which will set the RUNNING trials to state WAITING.
113+
When Optuna resumes, it will pick up and re-run those, before carrying on
114+
running the resumed study.
115+
116+
Common questions related to this feature:
117+
118+
- "Does "injecting" jobs like this disturb Optuna's operation in any way?":
119+
No, the study object takes all of its information directly from the
120+
data base, which in this case has "WAITING" trials now.
121+
- "Do those trials have to be run?": Technically not. One could simply ignore
122+
them and re-run without them. The problem is that in this case, the study
123+
will have missing data points from trials that have been suggested for a
124+
reason, so even if Optuna would resume fine, we still want to re-run them
125+
from an optimization point of view.
126+
99127
If you do distributed hyperparameter optimization, another useful option
100128
is
101129

0 commit comments

Comments
 (0)