@@ -96,6 +96,34 @@ are started with ``wait_time`` time interval in between (to avoid race
9696conditions when accessing the same data base) and further only use the data
9797base, not MPI, for communication.
9898
99+ The batch job on your HPC cluster will get killed after the designated runtime.
100+ Then unfinished trials will remain in the Optuna database in state RUNNING.
101+
102+ The current workflow for resuming the study which makes use of MALA's own
103+ resume tooling
104+ (see ``examples/advanced/ex05_checkpoint_hyperparameter_optimization.py ``) is
105+ this: before submitting the batch job again and let the script do the resume
106+ work, a user needs to modify the database like so:
107+
108+ .. code-block :: bash
109+
110+ python3 -c " import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"
111+
112+ which will set the RUNNING trials to state WAITING.
113+ When Optuna resumes, it will pick up and re-run those, before carrying on
114+ running the resumed study.
115+
116+ Common questions related to this feature:
117+
118+ - "Does "injecting" jobs like this disturb Optuna's operation in any way?":
119+ No, the study object takes all of its information directly from the
120+ data base, which in this case has "WAITING" trials now.
121+ - "Do those trials have to be run?": Technically not. One could simply ignore
122+ them and re-run without them. The problem is that in this case, the study
123+ will have missing data points from trials that have been suggested for a
124+ reason, so even if Optuna would resume fine, we still want to re-run them
125+ from an optimization point of view.
126+
99127If you do distributed hyperparameter optimization, another useful option
100128is
101129
0 commit comments