@@ -12,14 +12,15 @@ article](https://academic.oup.com/gigascience/article/9/11/giaa127/5987271)
1212Use of DistributedData with [ Slurm] ( https://slurm.schedmd.com/overview.html ) is
1313similar as with many other distributed computing systems:
1414
15- 1 . You submit a batch (or interactive) task, which runs your Julia script on a
15+ 1 . You install Julia and the required packages.
16+ 2 . You submit a batch (or interactive) task, which runs your Julia script on a
1617 node and gives it some information about where to find other worker nodes.
17- 2 . In the Julia script, you use
18+ 3 . In the Julia script, you use
1819 [ ` ClusterManagers ` ] ( https://github.com/JuliaParallel/ClusterManagers.jl )
1920 function ` addprocs_slurm ` to add the processes, just as with normal
2021 ` addprocs ` . Similar functions exist for many other task schedulers,
2122 including the popular PBS and LSF.
22- 3 . The rest of the workflow is unchanged; all functions from ` DistributedData `
23+ 4 . The rest of the workflow is unchanged; all functions from ` DistributedData `
2324 such as ` save_at ` and ` dmapreduce ` will work in the cluster just as they
2425 worked locally. Performance will vary though -- you may want to optimize
2526 your algorithm to use as much parallelism as possible (to get lots of
@@ -28,8 +29,44 @@ similar as with many other distributed computing systems:
2829 the communication overhead, transferring only the minimal required amount of
2930 data as seldom as possible.
3031
32+ ### Preparing the packages
33+
34+ The easiest way to install the packages is using a single-machine interactive job. On the access node of your HPC, run:
35+ ``` sh
36+ srun --pty -n1 -c1 -t60 --mem 1G /bin/bash
37+ ```
38+
39+ When the shell opens (the prompt should change), you can load the Julia module,
40+ usually with a command such as this:
41+
42+ ``` sh
43+ module load lang/Julia/1.3.0
44+ ```
45+
46+ (You may consult ` module avail ` for other possible Julia versions.)
47+
48+ After that, start ` julia ` and add press ` ] ` to open the packaging prompt (you
49+ should see ` (v1.3) pkg> ` instead of ` julia> ` ). There you can download and
50+ install the required packages:
51+ ```
52+ add DistributedData, ClusterManagers
53+ ```
54+
55+ You may also want to load the packages to precompile them, which saves precious
56+ time later in the workflows. Press backspace to return to the "normal" Julia
57+ shell, and type:
58+ ``` julia
59+ using DistributedData, ClusterManagers
60+ ```
61+
62+ Depending on the package, this may take a while, but should be done in under a
63+ minute for most existing packages. Finally, press ` Ctrl+D ` twice to exit both
64+ Julia and the interactive Slurm job shell.
65+
66+ ### Batch script
67+
3168An example Slurm batch script is here, save it as ` run-analysis.batch ` to your
32- Slurm gateway machine , in a directory that is shared with the workers (usually
69+ Slurm access node , in a directory that is shared with the workers (usually
3370a subdirectory of ` /scratch ` ):
3471``` sh
3572#! /bin/bash -l
@@ -56,6 +93,7 @@ The parameters are, in order:
5693 administrators)
5794- finally, it will run the Julia script ` run-analysis.jl `
5895
96+ ### Julia script
5997The ` run-analysis.jl ` may look as follows:
6098``` julia
6199using Distributed, ClusterManagers, DistributedData
@@ -88,3 +126,35 @@ sbatch run-analysis.batch
88126After your tasks gets queued, executed and finished successfully, you may see
89127the result in ` result.txt ` . In the meantime, you can entertain yourself by
90128watching ` squeue ` , to see e.g. the expected execution time of your batch.
129+
130+ Note that you may want to run the analysis in a separate directory, because the
131+ logs from all workers are collected in the current path by default. The
132+ resulting files may look like this:
133+ ```
134+ 0 [user@access1 test]$ ls
135+ job0000.out job0019.out job0038.out job0057.out job0076.out job0095.out job0114.out
136+ job0001.out job0020.out job0039.out job0058.out job0077.out job0096.out job0115.out
137+ job0002.out job0021.out job0040.out job0059.out job0078.out job0097.out job0116.out
138+ job0003.out job0022.out job0041.out job0060.out job0079.out job0098.out job0117.out
139+ job0004.out job0023.out job0042.out job0061.out job0080.out job0099.out job0118.out
140+ job0005.out job0024.out job0043.out job0062.out job0081.out job0100.out job0119.out
141+ job0006.out job0025.out job0044.out job0063.out job0082.out job0101.out job0120.out
142+ job0007.out job0026.out job0045.out job0064.out job0083.out job0102.out job0121.out
143+ job0008.out job0027.out job0046.out job0065.out job0084.out job0103.out job0122.out
144+ job0009.out job0028.out job0047.out job0066.out job0085.out job0104.out job0123.out
145+ job0010.out job0029.out job0048.out job0067.out job0086.out job0105.out job0124.out
146+ job0011.out job0030.out job0049.out job0068.out job0087.out job0106.out job0125.out
147+ job0012.out job0031.out job0050.out job0069.out job0088.out job0107.out job0126.out
148+ job0013.out job0032.out job0051.out job0070.out job0089.out job0108.out job0127.out
149+ job0014.out job0033.out job0052.out job0071.out job0090.out job0109.out result.txt <-- here is the result!
150+ job0015.out job0034.out job0053.out job0072.out job0091.out job0110.out run-analysis.jl
151+ job0016.out job0035.out job0054.out job0073.out job0092.out job0111.out run-analysis.sbatch
152+ job0017.out job0036.out job0055.out job0074.out job0093.out job0112.out slurm-2237171.out
153+ job0018.out job0037.out job0056.out job0075.out job0094.out job0113.out
154+ ```
155+
156+ The files ` jobXXXX.out ` contain the information collected from individual
157+ workers' standard outputs, such as the output of ` println ` or ` @info ` . For
158+ complicated programs, this is the easiest way to get out debugging information,
159+ and a simple but informative way to collect benchmarking output (using e.g.
160+ ` @time ` ).
0 commit comments