Skip to content

Commit 8e734c5

Browse files
committed
Slurm job submission from EasyBuild a bit reworked.
1 parent 492683e commit 8e734c5

1 file changed

Lines changed: 69 additions & 40 deletions

File tree

docs/2022-CSC_and_LO/3_03_slurm_jobs.md

Lines changed: 69 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,30 @@
66

77
EasyBuild can submit jobs to different backends including Slurm to install software,
88
to *distribute* the often time-consuming installation of a set of software applications and
9-
the dependencies they require to a cluster.
9+
the dependencies they require to a cluster. Each individual package is installed in a separate
10+
job and job dependencies are used to manage the dependencies between package so that no build
11+
is started before the dependencies are in place.
1012

1113
This is done via the ``--job`` command line option.
1214

1315
It is important to be aware of some details before you start using this, which we'll cover here.
1416

17+
!!! Warning "This section is not supported on LUMI, use at your own risk"
18+
19+
EasyBuild on LUMI is currently not fully configured to support job submission via Slurm. Several
20+
changes would be needed to the configuration of EasyBuild, including the location of the
21+
temporary files and build directory. Those have to be made by hand.
22+
23+
Due to the setup of the central software stack, this feature is currently useless to install
24+
the central stack. For user installations, there are also limitations as the enviornment
25+
on the compute nodes is different from the login nodes so, e.g., different locations for
26+
temporary files are being used. These would only be refreshed if the EasyBuild configuration
27+
modules are reloaded on the compute nodes which cannot be done currently in the way Slurm
28+
job submission is set up in EasyBuild.
29+
30+
Use material in this section with care; it has not been completely tested.
31+
32+
1533
## Configuration
1634

1735
The EasyBuild configuration that is active at the time that ``eb --job`` is used
@@ -25,6 +43,8 @@ that are specified via an [EasyBuild configuration file](configuration.md#config
2543
This implies that any EasyBuild configuration files or ``$EASYBUILD_*`` environment variables
2644
that are in place in the job environment are most likely *irrelevant*, since configuration settings
2745
they specify they will most likely be overruled by the corresponding command line options.
46+
It does also imply however that the EasyBuild configuration that is in place when ``eb --job`` is used
47+
does also work on the compute nodes to which the job is submitted.
2848

2949

3050
## Using ``eb --job``
@@ -39,6 +59,9 @@ to ``Slurm``, for example by setting the corresponding environment variable:
3959
export EASYBUILD_JOB_BACKEND='Slurm'
4060
```
4161

62+
On LUMI this is taken care of in the EasyBuild configuration modules such as ``EasyBuild-user``.
63+
64+
4265
### Job resources
4366

4467
To submit an installation as a job, simply use ``eb --job``:
@@ -73,13 +96,13 @@ For example, to specify a particular account that should be used for the jobs su
7396
(equivalent with using the ``-A`` or ``--account`` command line option for ``sbatch``):
7497

7598
```shell
76-
export SBATCH_ACCOUNT='example_project'
99+
export SBATCH_ACCOUNT='project_XXXXXXXXX'
77100
```
78101

79102
Or to submit to a particular Slurm partition (equivalent with the ``-p`` or ``--partition`` option for ``sbatch``):
80103

81104
```shell
82-
export SBATCH_PARTITION='example_partition'
105+
export SBATCH_PARTITION='small'
83106
```
84107

85108
For more information about supported ``$SBATCH_*`` environment variables,
@@ -113,24 +136,29 @@ as jobs, to avoid that they fail almost instantly due to a lack of disk space.
113136
Keep in mind that the active EasyBuild configuration is passed down into the submitted jobs,
114137
so any configuration that is present on the workernodes may not have any effect.
115138

116-
For example, if you commonly use `/tmp/$USER` for build directories on a login node,
117-
you may need to tweak that when submitting jobs to use a different location:
139+
For example, on LUMI it is possible to use ``$XDG_RUNTIME_DIR`` on the login nodes which has
140+
the advantage that any leftovers of failed builds will be cleaned up when the user ends their last
141+
login session on that node, but it is not possible to do so on the compute nodes.
118142

119143
```shell
120144
# EasByuild is configured to use /tmp/$USER on the login node
121-
login01 $ eb --show-config | grep buildpath
122-
buildpath (E) = /tmp/example
145+
uan01 $ eb --show-config | grep buildpath
146+
buildpath (E) = /run/user/XXXXXXXX/easybuild/build
123147

124-
# use /localdisk/$USER for build directories when submitting installations as jobs
125-
login01 $ eb --job --buildpath /localdisk/$USER example.eb --robot
148+
# use /dev/shm/$USER for build directories when submitting installations as jobs
149+
login01 $ eb --job --buildpath /dev/shm/$USER/easybuild example.eb --robot
126150
```
127151

152+
128153
### Temporary log files and build directories
129154

130-
The temporary log file that EasyBuild creates is most likely going to end up on the local disk
131-
of the workernode on which the job was started (by default in `$TMPDIR` or `/tmp`).
132-
If an installation fails, the job will finish and temporary files will likely be cleaned up instantly,
133-
which may leave you wondering about the actual cause of the failing installation...
155+
The problems for the temporary log files are twofold. First, they may end up in a place
156+
that is not available on the compute nodes. E.g., for the same reasons as for the build
157+
path, the LUMI EasyBuild configuration will place the temporary files in a subdirectory of
158+
``$XDG_RUNTIME_DIR`` on the loginnodes but a subdirectory of ``/dev/shm/$USER`` on the
159+
compute nodes. The second problem however is that if an installation fails, those log files are
160+
not even accessible anymore which may leave you wondering about the actual cause of the failing
161+
installation...
134162

135163
To remedy this, there are a couple of EasyBuild configuration options you can use:
136164

@@ -139,18 +167,21 @@ To remedy this, there are a couple of EasyBuild configuration options you can us
139167
```shell
140168
$ eb --job example.eb --tmp-logdir $HOME/eb_tmplogs
141169
```
170+
This will move at least the log file to a suitable place.
142171

143172
* If you prefer having the entire log file stored in the Slurm job output files,
144173
you can use ``--logtostdout`` when submitting the jobs. This will result in extensive logging
145174
to your terminal window when submitting the jobs, but it will also make EasyBuild
146175
log to ``stdout`` when the installation is running in the job, and hence the log messages will be
147176
captured in the job output files.
148177

149-
The same remark applies to build directories: they should be on a local filesystem (to avoid problems
150-
that often occur when building software on a parallel filesystem like GPFS or Lustre),
151-
which will probably be cleaned up automatically when a job fails. Here it is less easy to provide
152-
general advice on how to deal with this, but one thing you can consider is retrying the installation
153-
in an interactive job, so you can inspect the build directory after the installation fails.
178+
The build directory of course also suffers from the problem of being no longer accessible if the
179+
installation fails, but there it is not so easy to find a solution. Building on a shared file system
180+
is not only much slower, but in particular on parallel file systems like GPFS/SpectrumScale, Lustre
181+
or BeeGFS buiding sometimes fails in strange ways. One thing you can consider if you cannot do the
182+
build on a login node (e.g., because the code is not suitable for cross-compiling or the configure
183+
system does tests that would fail on the login node), is to rety the installation in an
184+
interactive job, so you can inspect the build directory after the installation fails.
154185

155186
### Lock files
156187

@@ -171,37 +202,37 @@ subdirectory of ``installpath``) manually, or re-submit the job with ``eb --job
171202

172203
As an example, we will let EasyBuild submit jobs to install ``AUGUSTUS`` with the ``foss/2020b`` toolchain.
173204

205+
!!! Warning "This example does not work on LUMI"
206+
207+
Note that this is an example using the FOSS common toolchain. For this reason it does not work on
208+
LUMI.
209+
174210
### Configuration
175211

176212
Before using ``--job``, let's make sure that EasyBuild is properly configured:
177213

178214
```shell
179-
# use $HOME/easybuild for software, modules, sources, etc.
180-
export EASYBUILD_PREFIX=$HOME/easybuild
215+
# Load the EasyBuild-user module (central installations will not work at all
216+
# using job submission)
217+
module load LUMI/21.12
218+
module load partition/C
219+
module load EasyBuild-user
181220

182221
# use ramdisk for build directories
183-
export EASYBUILD_BUILDPATH=/dev/shm/$USER
222+
export EASYBUILD_BUILDPATH=/dev/shm/$USER/build
223+
export EASYBUILD_TMPDIR=/dev/shm/$USER/tmp
184224

185225
# use Slurm as job backend
186226
export EASYBUILD_JOB_BACKEND=Slurm
187227
```
188228

189-
In addition, add the path to the centrally installed software to ``$MODULEPATH`` via ``module use``:
190229

191-
```shell
192-
module use /easybuild/modules/all
193-
```
194-
195-
Load the EasyBuild module:
196-
197-
```shell
198-
module load EasyBuild
199-
```
200-
201-
Let's assume that we also need to inform Slurm that jobs should be submitted into a particular account:
230+
We will also need to inform Slurm that jobs should be submitted into a particular account, and
231+
in a particular partition:
202232

203233
```shell
204-
export SBATCH_ACCOUNT=example_project
234+
export SBATCH_ACCOUNT=project_XXXXXXXXX
235+
export SBATCH_PARTITION='small'
205236
```
206237

207238
This will be picked up by the ``sbatch`` commands that EasyBuild will run to submit the software installation jobs.
@@ -234,14 +265,14 @@ $ eb AUGUSTUS-3.4.0-foss-2020b.eb --missing
234265
Several dependencies are not installed yet, so we will need to use ``--robot`` to ensure that
235266
EasyBuild also submits jobs to install these first.
236267

237-
To speed up the installations a bit, we will request 10 cores for each submitted job (via ``--job-cores``).
268+
To speed up the installations a bit, we will request 8 cores for each submitted job (via ``--job-cores``).
238269
That should be sufficient to let each installation finish in (well) under 1 hour,
239270
so we only request 1 hour of walltime per job (via ``--job-max-walltime``).
240271

241272
In order to have some meaningful job output files, we also enable trace mode (via ``--trace``).
242273

243274
```
244-
$ eb AUGUSTUS-3.4.0-foss-2020b.eb --job --job-cores 10 --job-max-walltime 1 --robot --trace
275+
$ eb AUGUSTUS-3.4.0-foss-2020b.eb --job --job-cores 8 --job-max-walltime 1 --robot --trace
245276
...
246277
== resolving dependencies ...
247278
...
@@ -278,7 +309,7 @@ these jobs will be able to start.
278309
After about 20 minutes, AUGUSTUS and all missing dependencies should be installed:
279310

280311
```
281-
$ ls -lrt $HOME/easybuild/modules/all/*/*.lua | tail -11
312+
$ ls -lrt $HOME/EasyBuild/modules/.../*.lua | tail -11
282313
-rw-rw----. 1 example example 1634 Mar 29 10:13 /users/example/easybuild/modules/all/HTSlib/1.11-GCC-10.2.0.lua
283314
-rw-rw----. 1 example example 1792 Mar 29 10:13 /users/example/easybuild/modules/all/SAMtools/1.11-GCC-10.2.0.lua
284315
-rw-rw----. 1 example example 1147 Mar 29 10:13 /users/example/easybuild/modules/all/BamTools/2.5.1-GCC-10.2.0.lua
@@ -291,11 +322,9 @@ $ ls -lrt $HOME/easybuild/modules/all/*/*.lua | tail -11
291322
-rw-rw----. 1 example example 1365 Mar 29 10:28 /users/example/easybuild/modules/all/SuiteSparse/5.8.1-foss-2020b-METIS-5.1.0.lua
292323
-rw-rw----. 1 example example 2233 Mar 29 10:30 /users/example/easybuild/modules/all/AUGUSTUS/3.4.0-foss-2020b.lua
293324
294-
$ module use $HOME/easybuild/modules/all
295-
296325
$ module avail AUGUSTUS
297326
298-
-------- /users/hkenneth/easybuild/modules/all --------
327+
-- EasyBuild managed user software for software stack ... --
299328
AUGUSTUS/3.4.0-foss-2020b
300329
```
301330

0 commit comments

Comments
 (0)