Skip to content

Commit 931c7b1

Browse files
Merge pull request #3 from LCSB-BioCore/docs
Add reasonable docs
2 parents bd34c68 + 0c3671d commit 931c7b1

8 files changed

Lines changed: 774 additions & 4 deletions

File tree

.github/workflows/docs.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# ref: https://juliadocs.github.io/Documenter.jl/stable/man/hosting/#GitHub-Actions-1
2+
name: Documentation
3+
4+
on:
5+
push:
6+
branches:
7+
- develop
8+
tags: '*'
9+
pull_request:
10+
release:
11+
types: [published, created]
12+
13+
jobs:
14+
build:
15+
runs-on: ubuntu-latest
16+
steps:
17+
- uses: actions/checkout@v2
18+
- uses: julia-actions/setup-julia@latest
19+
with:
20+
version: 1.5
21+
- name: Install dependencies
22+
run: julia --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
23+
- name: Build and deploy
24+
env:
25+
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key
26+
run: julia --project=docs/ docs/make.jl

README.md

Lines changed: 130 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,135 @@
11
# DiDa.jl
22

3-
Simple Distributed Data manipulation and processing routines for Julia.
3+
Simple distributed data manipulation and processing routines for Julia.
44

55
This was originally developed for
6-
[GigaSOM.jl](https://github.com/LCSB-BioCore/GigaSOM.jl), this package contains
7-
the separated-out lightweight distributed-processing framework that can be used
8-
with GigaSOM.
6+
[`GigaSOM.jl`](https://github.com/LCSB-BioCore/GigaSOM.jl); DiDa.jl package
7+
contains the separated-out lightweight distributed-processing framework that
8+
was used in `GigaSOM.jl`.
99

10+
## Why?
11+
12+
DiDa.jl provides a very simple, imperative and straightforward way to move your
13+
data around a cluster of Julia processes created by the
14+
[`Distributed`](https://docs.julialang.org/en/v1/stdlib/Distributed/) package,
15+
and run computation on the distributed data pieces. The main aim of the package
16+
is to avoid anything complicated-- the first version used in
17+
[GigaSOM](https://github.com/LCSB-BioCore/GigaSOM.jl) had just under 500 lines
18+
of relatively straightforward code (including the doc-comments).
19+
20+
Compared to plain `Distributed` API, you get more straightforward data
21+
manipulation primitives, some extra control over the precise place where code
22+
is executed, and a few high-level functions. These include a distributed
23+
version of `mapreduce`, simpler work-alike of the
24+
[DistributedArrays](https://github.com/JuliaParallel/DistributedArrays.jl)
25+
functionality, and easy-to-use distributed dataset saving and loading.
26+
27+
Most importantly, the main motivation behind the package is that the
28+
distributed processing should be simple and accessible.
29+
30+
## Brief how-to
31+
32+
The package provides a few very basic primitives that lightly wrap the
33+
`Distributed` package functions `remotecall` and `fetch`. The most basic one is
34+
`save_at`, which takes a worker ID, variable name and variable content, and
35+
saves the content to the variable on the selected worker. `get_from` works the
36+
same way, but takes the data back from the worker.
37+
38+
You can thus send some random array to a few distributed workers:
39+
40+
```julia
41+
julia> using Distributed, DiDa
42+
43+
julia> addprocs(2)
44+
2-element Array{Int64,1}:
45+
2
46+
3
47+
48+
julia> @everywhere using DiDa
49+
50+
julia> save_at(2, :x, randn(10,10))
51+
Future(2, 1, 4, nothing)
52+
```
53+
54+
The `Future` returned from `save_at` is the normal Julia future from
55+
`Distributed`, you can even `fetch` it to wait until the operation is really
56+
done on the other side. Fetching the data is done the same way:
57+
58+
```julia
59+
julia> get_from(2,:x)
60+
Future(2, 1, 15, nothing)
61+
62+
julia> get_val_from(2,:x) # auto-fetch()ing variant
63+
10×10 Array{Float64,2}:
64+
-0.850788 0.946637 1.78006
65+
-0.49596 0.497829 -2.03013
66+
67+
```
68+
69+
All commands support full quoting, which allows you to easily distinguish
70+
between code parts that are executed locally and remotely:
71+
72+
```julia
73+
julia> save_at(3, :x, randn(1000,1000)) # generates a matrix locally and sends it to the remote worker
74+
75+
julia> save_at(3, :x, :(randn(1000,1000))) # generates a matrix right on the remote worker and saves it there
76+
77+
julia> get_val_from(3, :x) # retrieves the generated matrix and fetches it
78+
79+
80+
julia> get_val_from(3, :(randn(1000,1000))) # generates the matrix on the worker and fetches the data
81+
82+
```
83+
84+
Notably, this is different from the approach taken by `DistributedArrays` and
85+
similar packages -- all data manipulation is explicit, and any data type is
86+
supported as long as it can be moved among workers by the `Distributed`
87+
package. This helps with various highly non-array-ish data, such as large text
88+
corpora and graphs.
89+
90+
There are various goodies for easy work with matrix-style data, namely
91+
scattering, gathering and running distributed algorithms:
92+
93+
```julia
94+
julia> x = randn(1000,3)
95+
1000×3 Array{Float64,2}:
96+
-0.992481 0.551064 1.67424
97+
-0.751304 -0.845055 0.105311
98+
-0.712687 0.165619 -0.469055
99+
100+
101+
julia> dataset = scatter_array(:myDataset, x, workers()) # sends slices of the array to workers
102+
Dinfo(:myDataset, [2, 3]) # a helper for holding the variable name and the used workers together
103+
104+
julia> get_val_from(3, :(size(myDataset)))
105+
(500, 3) # there's really only half of the data
106+
107+
julia> dmapreduce(dataset, sum, +) # MapReduce-style sum of all data
108+
-51.64369103751014
109+
110+
julia> dstat(dataset, [1,2,3]) # get means and sdevs in individual columns
111+
([-0.030724038974465212, 0.007300925745200863, -0.028220577808245786],
112+
[0.9917470012495775, 0.9975120525455358, 1.000243845434252])
113+
114+
julia> dmedian(dataset, [1,2,3]) # distributed iterative median in columns
115+
3-element Array{Float64,1}:
116+
0.004742259615849834
117+
0.039043266340824986
118+
-0.05367799062404967
119+
120+
julia> dtransform(dataset, x -> 2 .^ x) # exponentiate all data (medians should now be around 1)
121+
Dinfo(:myDataset, [2, 3])
122+
123+
julia> gather_array(dataset) # download the data from workers to a sing
124+
1000×3 Array{Float64,2}:
125+
0.502613 1.46517 3.1915
126+
0.594066 0.55669 1.07573
127+
0.610183 1.12165 0.722438
128+
129+
```
130+
131+
## What does the name `DiDa` mean?
132+
133+
**Di**stributed **Da**ta.
134+
135+
There is no consensus on how to pronounce the shortcut.

docs/Project.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[deps]
2+
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
3+
DocumenterTools = "35a29f4d-8980-5a13-9543-d66fff28ecb8"

docs/make.jl

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
using Documenter, DiDa
2+
3+
makedocs(modules = [DiDa],
4+
clean = false,
5+
format = Documenter.HTML(prettyurls = !("local" in ARGS)),
6+
sitename = "DiDa.jl",
7+
authors = "The developers of DiDa.jl",
8+
linkcheck = !("skiplinks" in ARGS),
9+
pages = [
10+
"Home" => "index.md",
11+
"Tutorial" => "tutorial.md",
12+
"Functions" => "functions.md",
13+
],
14+
)
15+
16+
deploydocs(
17+
repo = "github.com/LCSB-BioCore/DiDa.jl.git",
18+
target = "build",
19+
branch = "gh-pages",
20+
devbranch = "develop",
21+
versions = "stable" => "v^",
22+
)

docs/src/assets/logo.svg

Lines changed: 126 additions & 0 deletions
Loading

docs/src/functions.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Functions
2+
3+
## Data structures
4+
5+
```@autodocs
6+
Modules = [DiDa]
7+
Pages = ["structs.jl"]
8+
```
9+
10+
## Base functions
11+
12+
```@autodocs
13+
Modules = [DiDa]
14+
Pages = ["base.jl"]
15+
```
16+
17+
## Higher-level array operations
18+
19+
```@autodocs
20+
Modules = [DiDa]
21+
Pages = ["tools.jl"]
22+
```
23+
24+
## Input/Output
25+
26+
```@autodocs
27+
Modules = [DiDa]
28+
Pages = ["io.jl"]
29+
```

docs/src/index.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
2+
# DiDa.jl — simple work with distributed data
3+
4+
This packages provides a relatively simple Distributed Data manipulation and
5+
processing routines for Julia.
6+
7+
The design of the package and data manipulation approach is deliberately
8+
"imperative" and "hands-on", to allow as much user influence on the actual way
9+
the data are moved and stored in the cluster as possible. It uses the
10+
`Distributed` package and its infrastructure of workers, and provides a few
11+
very basic primitives that lightly wrap the `Distributed` package functions
12+
`remotecall` and `fetch`.
13+
14+
There are also various extra functions to easily run distributed data
15+
transformations, MapReduce-style algorithms, store and load the data on worker
16+
local storage (e.g. to prevent memory exhaustion) and others.
17+
18+
To start quickly, you can read the tutorial:
19+
20+
```@contents
21+
Pages=["tutorial.md"]
22+
```
23+
24+
### Functions
25+
26+
A full reference to all functions is given here:
27+
28+
```@contents
29+
Pages = ["functions.md"]
30+
```

0 commit comments

Comments
 (0)