Introduction.
Spike delivery is a major cost centre in large networks. There have been several -- partial successful --
attempts at improving performance. Yet, we still have left room for improvement
⚠️ This is an internal project, please contact us before starting work ⚠️
Steps
Measure
Measure performance and hardware utilisation using the tiled-busyring example.
Recommended tools: the internal profiler and perf for details. This input file
{
"name": "cortex",
"num-cells": 1024,
"num-tiles": 32768,
"synapses": 10000,
"min-delay": 5,
"duration": 200,
"ring-size": 4,
"event-weight": 0.2,
"record": false,
"spikes": false,
"dt": 0.025,
"depth": 2,
"complex": false,
"branch-probs": [
1,
0.5
],
"compartments": [
2,
1
],
"lengths": [
20,
2
]
}
emulates cortex-level connectivity (10k synapses/cell) on a large cluster (32k tiles) but runs
comfortably on a single node with this example profile (M1 MacBook)
gpu: no
threads: 8
mpi: no
ranks: 32768
running simulation
100% |--------------------------------------------------| 200ms
310378496 spikes generated at rate of 6.44375e-07 ms between spikes
---- meters -------------------------------------------------------------------------------
meter time(s) memory(MB)
-------------------------------------------------------------------------------------------
model-init 5.023 2135.179
model-run 7.487 382.992
meter-total 12.510 2518.172
REGION CALLS WALL THREAD %
root - 2.691 21.526 100.0
communication - 1.215 9.719 45.2
walkspikes 80 0.580 4.640 21.6
enqueue - 0.550 4.401 20.4
sort 81920 0.337 2.695 12.5
tree 81920 0.208 1.662 7.7
setup 81920 0.005 0.040 0.2
generators 81920 0.001 0.005 0.0
exchange - 0.085 0.677 3.1
gather 80 0.085 0.677 3.1
sort 80 0.000 0.000 0.0
gatherlocal 80 0.000 0.000 0.0
gather - 0.000 0.000 0.0
init - 0.788 6.304 29.3
simulation - 0.589 4.712 21.9
sources 1 0.320 2.558 11.9
group - 0.269 2.153 10.0
factory 1024 0.269 2.153 10.0
communicator - 0.199 1.592 7.4
update - 0.199 1.592 7.4
connections - 0.185 1.482 6.9
local 1 0.185 1.482 6.9
advance - 0.688 5.502 25.6
integrate - 0.684 5.473 25.4
current - 0.174 1.389 6.5
zero 8192000 0.048 0.383 1.8
hh 8192000 0.045 0.358 1.7
expsyn 8192000 0.044 0.355 1.6
pas 8192000 0.037 0.293 1.4
state - 0.137 1.095 5.1
hh 8192000 0.070 0.559 2.6
expsyn 8192000 0.035 0.279 1.3
pas 8192000 0.032 0.257 1.2
setup 81920 0.078 0.622 2.9
cable 8192000 0.076 0.611 2.8
event - 0.049 0.390 1.8
expsyn 7926239 0.049 0.390 1.8
Reduce data movement
Currently we bin incoming spikes into one lane per cell, sort each queue, and then pass the
queues to cell groups. However, the cable_cell_group then re-organises the queues into
one per target; targets are formed by merging synapses of one type inside a cell group. These
are then sorted again. This costs time, space, and bandwidth.
Proposed solution: cell groups expose a list of targets and a lookup structure to bin events, eliminating
the extra steps.
Parallelise spike delivery
This has been attempted before, but is tricky to get right. The main problem is that each event
might be pushed into multiple queues when a single source has connections targetting multiple
sinks in the same cell group. This gets increasing unlikely as connectivities get sparser, but still
must be handled. Thus, traversing incoming events in parallel must be synchronised, eg by using
an atomic queue per target. But, as noted before clashes get increasingly unlikely only as network
sizes grow, so synchronisation must be cheap in both high and low contention.
Another option is to traverse local connections instead and bin into target queues. The idea could
be to identify groups of connections terminating into the same queue and sort the connection list
accordingly, ie st ranges of connections targetting the same queue are stored contiguously. Threads
would then work in parallel on these ranges without contention. Work balancing might be an issue.
Further options and ideas welcome.
Optimise single thread spike delivery
Based on measurements, take steps to improve spike binning into target queues.
Constraints
All changes should be done without consuming more memory -- ideally less, given
hardware trends -- and must not allocate new memory $\geq O(N_\mathrm{cell})$.
Existing examples and test must not break and results must be consistent across changes.
Introduction.
Spike delivery is a major cost centre in large networks. There have been several -- partial successful --
attempts at improving performance. Yet, we still have left room for improvement
Steps
Measure
Measure performance and hardware utilisation using the
tiled-busyringexample.Recommended tools: the internal profiler and
perffor details. This input file{ "name": "cortex", "num-cells": 1024, "num-tiles": 32768, "synapses": 10000, "min-delay": 5, "duration": 200, "ring-size": 4, "event-weight": 0.2, "record": false, "spikes": false, "dt": 0.025, "depth": 2, "complex": false, "branch-probs": [ 1, 0.5 ], "compartments": [ 2, 1 ], "lengths": [ 20, 2 ] }emulates cortex-level connectivity (10k synapses/cell) on a large cluster (32k tiles) but runs
comfortably on a single node with this example profile (M1 MacBook)
Reduce data movement
Currently we bin incoming spikes into one lane per cell, sort each queue, and then pass the
queues to cell groups. However, the
cable_cell_groupthen re-organises the queues intoone per target; targets are formed by merging synapses of one type inside a cell group. These
are then sorted again. This costs time, space, and bandwidth.
Proposed solution: cell groups expose a list of targets and a lookup structure to bin events, eliminating
the extra steps.
Parallelise spike delivery
This has been attempted before, but is tricky to get right. The main problem is that each event
might be pushed into multiple queues when a single source has connections targetting multiple
sinks in the same cell group. This gets increasing unlikely as connectivities get sparser, but still
must be handled. Thus, traversing incoming events in parallel must be synchronised, eg by using
an atomic queue per target. But, as noted before clashes get increasingly unlikely only as network
sizes grow, so synchronisation must be cheap in both high and low contention.
Another option is to traverse local connections instead and bin into target queues. The idea could
be to identify groups of connections terminating into the same queue and sort the connection list
accordingly, ie st ranges of connections targetting the same queue are stored contiguously. Threads
would then work in parallel on these ranges without contention. Work balancing might be an issue.
Further options and ideas welcome.
Optimise single thread spike delivery
Based on measurements, take steps to improve spike binning into target queues.
Constraints
All changes should be done without consuming more memory -- ideally less, given$\geq O(N_\mathrm{cell})$ .
hardware trends -- and must not allocate new memory
Existing examples and test must not break and results must be consistent across changes.