Skip to content

Commit 6d0e17a

Browse files
committed
test&finish the slurm MWE
1 parent d8f94ab commit 6d0e17a

1 file changed

Lines changed: 74 additions & 4 deletions

File tree

docs/src/slurm.md

Lines changed: 74 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,15 @@ article](https://academic.oup.com/gigascience/article/9/11/giaa127/5987271)
1212
Use of DistributedData with [Slurm](https://slurm.schedmd.com/overview.html) is
1313
similar as with many other distributed computing systems:
1414

15-
1. You submit a batch (or interactive) task, which runs your Julia script on a
15+
1. You install Julia and the required packages.
16+
2. You submit a batch (or interactive) task, which runs your Julia script on a
1617
node and gives it some information about where to find other worker nodes.
17-
2. In the Julia script, you use
18+
3. In the Julia script, you use
1819
[`ClusterManagers`](https://github.com/JuliaParallel/ClusterManagers.jl)
1920
function `addprocs_slurm` to add the processes, just as with normal
2021
`addprocs`. Similar functions exist for many other task schedulers,
2122
including the popular PBS and LSF.
22-
3. The rest of the workflow is unchanged; all functions from `DistributedData`
23+
4. The rest of the workflow is unchanged; all functions from `DistributedData`
2324
such as `save_at` and `dmapreduce` will work in the cluster just as they
2425
worked locally. Performance will vary though -- you may want to optimize
2526
your algorithm to use as much parallelism as possible (to get lots of
@@ -28,8 +29,44 @@ similar as with many other distributed computing systems:
2829
the communication overhead, transferring only the minimal required amount of
2930
data as seldom as possible.
3031

32+
### Preparing the packages
33+
34+
The easiest way to install the packages is using a single-machine interactive job. On the access node of your HPC, run:
35+
```sh
36+
srun --pty -n1 -c1 -t60 --mem 1G /bin/bash
37+
```
38+
39+
When the shell opens (the prompt should change), you can load the Julia module,
40+
usually with a command such as this:
41+
42+
```sh
43+
module load lang/Julia/1.3.0
44+
```
45+
46+
(You may consult `module avail` for other possible Julia versions.)
47+
48+
After that, start `julia` and add press `]` to open the packaging prompt (you
49+
should see `(v1.3) pkg>` instead of `julia>`). There you can download and
50+
install the required packages:
51+
```
52+
add DistributedData, ClusterManagers
53+
```
54+
55+
You may also want to load the packages to precompile them, which saves precious
56+
time later in the workflows. Press backspace to return to the "normal" Julia
57+
shell, and type:
58+
```julia
59+
using DistributedData, ClusterManagers
60+
```
61+
62+
Depending on the package, this may take a while, but should be done in under a
63+
minute for most existing packages. Finally, press `Ctrl+D` twice to exit both
64+
Julia and the interactive Slurm job shell.
65+
66+
### Batch script
67+
3168
An example Slurm batch script is here, save it as `run-analysis.batch` to your
32-
Slurm gateway machine, in a directory that is shared with the workers (usually
69+
Slurm access node, in a directory that is shared with the workers (usually
3370
a subdirectory of `/scratch`):
3471
```sh
3572
#!/bin/bash -l
@@ -56,6 +93,7 @@ The parameters are, in order:
5693
administrators)
5794
- finally, it will run the Julia script `run-analysis.jl`
5895

96+
### Julia script
5997
The `run-analysis.jl` may look as follows:
6098
```julia
6199
using Distributed, ClusterManagers, DistributedData
@@ -88,3 +126,35 @@ sbatch run-analysis.batch
88126
After your tasks gets queued, executed and finished successfully, you may see
89127
the result in `result.txt`. In the meantime, you can entertain yourself by
90128
watching `squeue`, to see e.g. the expected execution time of your batch.
129+
130+
Note that you may want to run the analysis in a separate directory, because the
131+
logs from all workers are collected in the current path by default. The
132+
resulting files may look like this:
133+
```
134+
0 [user@access1 test]$ ls
135+
job0000.out job0019.out job0038.out job0057.out job0076.out job0095.out job0114.out
136+
job0001.out job0020.out job0039.out job0058.out job0077.out job0096.out job0115.out
137+
job0002.out job0021.out job0040.out job0059.out job0078.out job0097.out job0116.out
138+
job0003.out job0022.out job0041.out job0060.out job0079.out job0098.out job0117.out
139+
job0004.out job0023.out job0042.out job0061.out job0080.out job0099.out job0118.out
140+
job0005.out job0024.out job0043.out job0062.out job0081.out job0100.out job0119.out
141+
job0006.out job0025.out job0044.out job0063.out job0082.out job0101.out job0120.out
142+
job0007.out job0026.out job0045.out job0064.out job0083.out job0102.out job0121.out
143+
job0008.out job0027.out job0046.out job0065.out job0084.out job0103.out job0122.out
144+
job0009.out job0028.out job0047.out job0066.out job0085.out job0104.out job0123.out
145+
job0010.out job0029.out job0048.out job0067.out job0086.out job0105.out job0124.out
146+
job0011.out job0030.out job0049.out job0068.out job0087.out job0106.out job0125.out
147+
job0012.out job0031.out job0050.out job0069.out job0088.out job0107.out job0126.out
148+
job0013.out job0032.out job0051.out job0070.out job0089.out job0108.out job0127.out
149+
job0014.out job0033.out job0052.out job0071.out job0090.out job0109.out result.txt <-- here is the result!
150+
job0015.out job0034.out job0053.out job0072.out job0091.out job0110.out run-analysis.jl
151+
job0016.out job0035.out job0054.out job0073.out job0092.out job0111.out run-analysis.sbatch
152+
job0017.out job0036.out job0055.out job0074.out job0093.out job0112.out slurm-2237171.out
153+
job0018.out job0037.out job0056.out job0075.out job0094.out job0113.out
154+
```
155+
156+
The files `jobXXXX.out` contain the information collected from individual
157+
workers' standard outputs, such as the output of `println` or `@info`. For
158+
complicated programs, this is the easiest way to get out debugging information,
159+
and a simple but informative way to collect benchmarking output (using e.g.
160+
`@time`).

0 commit comments

Comments
 (0)