Skip to content

Commit d8f94ab

Browse files
committed
add Slurm MWE
1 parent 5d92b48 commit d8f94ab

3 files changed

Lines changed: 101 additions & 1 deletion

File tree

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,3 +132,13 @@ julia> gather_array(dataset) # download the data from workers to a sing
132132
0.610183 1.12165 0.722438
133133
134134
```
135+
136+
## Using DistributedData.jl in HPC environments
137+
138+
You can use
139+
[`ClusterManagers`](https://github.com/JuliaParallel/ClusterManagers.jl)
140+
package to add distributed workers from many different workload managers and
141+
task scheduling environments, such as Slurm, PBS, LSF, and others.
142+
143+
See the documentation for an [example of using Slurm to run
144+
DistributedData](https://lcsb-biocore.github.io/DistributedData.jl/stable/slurm/).

docs/src/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ on worker-local storage (e.g. to prevent memory exhaustion) and others.
1919
To start quickly, you can read the tutorial:
2020

2121
```@contents
22-
Pages=["tutorial.md"]
22+
Pages=["tutorial.md", "slurm.md"]
2323
```
2424

2525
### Functions

docs/src/slurm.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
2+
# DistributedData.jl on Slurm clusters
3+
4+
DistributedData scales well to moderately large computer clusters. As a
5+
practical example, you can see the script that was used to process relatively
6+
large datasets for GigaSOM, documented in the Supplementary file of the
7+
[GigaSOM
8+
article](https://academic.oup.com/gigascience/article/9/11/giaa127/5987271)
9+
(click the `giaa127_Supplemental_File` link below on the page, and find
10+
*Listing S1* at the end of the PDF).
11+
12+
Use of DistributedData with [Slurm](https://slurm.schedmd.com/overview.html) is
13+
similar as with many other distributed computing systems:
14+
15+
1. You submit a batch (or interactive) task, which runs your Julia script on a
16+
node and gives it some information about where to find other worker nodes.
17+
2. In the Julia script, you use
18+
[`ClusterManagers`](https://github.com/JuliaParallel/ClusterManagers.jl)
19+
function `addprocs_slurm` to add the processes, just as with normal
20+
`addprocs`. Similar functions exist for many other task schedulers,
21+
including the popular PBS and LSF.
22+
3. The rest of the workflow is unchanged; all functions from `DistributedData`
23+
such as `save_at` and `dmapreduce` will work in the cluster just as they
24+
worked locally. Performance will vary though -- you may want to optimize
25+
your algorithm to use as much parallelism as possible (to get lots of
26+
performance), load more data in the memory (usually, much more total memory
27+
is available in the clusters than on a single computer), but keep an eye on
28+
the communication overhead, transferring only the minimal required amount of
29+
data as seldom as possible.
30+
31+
An example Slurm batch script is here, save it as `run-analysis.batch` to your
32+
Slurm gateway machine, in a directory that is shared with the workers (usually
33+
a subdirectory of `/scratch`):
34+
```sh
35+
#!/bin/bash -l
36+
#SBATCH -n 128
37+
#SBATCH -c 1
38+
#SBATCH -t 60
39+
#SBATCH --mem-per-cpu 4G
40+
#SBATCH -J MyDistributedJob
41+
42+
module load lang/Julia/1.3.0
43+
44+
julia run-analysis.jl
45+
```
46+
47+
The parameters are, in order:
48+
- using 128 "tasks" (ie. spawning 128 separate processes)
49+
- each process uses 1 CPU (you may want more CPUs if you work with actual
50+
threads and shared memory)
51+
- the whole batch takes maximum 60 minutes
52+
- each CPU (in our case each process) will be allocated 4 gigabytes of RAM
53+
- the job will be visible in the queue as `MyDistributedJob`
54+
- it will load Julia 1.3.0 module on the workers, so that `julia` executable is
55+
available (you may want to consult the versions availability with your HPC
56+
administrators)
57+
- finally, it will run the Julia script `run-analysis.jl`
58+
59+
The `run-analysis.jl` may look as follows:
60+
```julia
61+
using Distributed, ClusterManagers, DistributedData
62+
63+
# read the number of available workers from environment and start the worker processes
64+
65+
n_workers = parse(Int , ENV["SLURM_NTASKS"])
66+
addprocs_slurm(n_workers , topology =:master_worker)
67+
68+
# load the required packages on all workers
69+
@everywhere using DistributedData
70+
71+
# generate a random dataset on all workers
72+
dataset = dtransform((), _ -> randn(10000,10000), workers(), :myData)
73+
74+
# for demonstration, sum the whole dataset
75+
totalResult = dmapreduce(dataset, sum, +)
76+
77+
# do not forget to save the results!
78+
f = open("result.txt", "w")
79+
println(f, totalResult)
80+
close(f)
81+
```
82+
83+
Finally, you can execute the whole thing with `sbatch`:
84+
```sh
85+
sbatch run-analysis.batch
86+
```
87+
88+
After your tasks gets queued, executed and finished successfully, you may see
89+
the result in `result.txt`. In the meantime, you can entertain yourself by
90+
watching `squeue`, to see e.g. the expected execution time of your batch.

0 commit comments

Comments
 (0)