Improving Performance of Spike Delivery

## Introduction.
Spike delivery is  a major cost centre in large networks. There have been several -- partial successful --
attempts at improving performance. Yet, we still have left room for improvement

⚠️ This is an internal project, please contact us before starting work ⚠️ 

## Steps

### Measure
Measure performance and hardware utilisation using the `tiled-busyring` example.
Recommended tools: the internal profiler and `perf` for details. This input file
```json
{
    "name": "cortex",
    "num-cells": 1024,
    "num-tiles": 32768,
    "synapses": 10000,
    "min-delay": 5,
    "duration": 200,
    "ring-size": 4,
    "event-weight": 0.2,
    "record": false,
    "spikes": false,
    "dt": 0.025,
    "depth": 2,
    "complex": false,
    "branch-probs": [
        1,
        0.5
    ],
    "compartments": [
        2,
        1
    ],
    "lengths": [
        20,
        2
    ]
}
```
emulates cortex-level connectivity (10k synapses/cell) on a large cluster (32k tiles) but runs 
comfortably on a single node with this example profile (M1 MacBook)
```
gpu:      no
threads:  8
mpi:      no
ranks:    32768

running simulation

100% |--------------------------------------------------|           200ms

310378496 spikes generated at rate of 6.44375e-07 ms between spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                      5.023        2135.179
model-run                       7.487         382.992
meter-total                    12.510        2518.172

REGION                           CALLS     WALL    THREAD        %
root                                 -    2.691    21.526    100.0
  communication                      -    1.215     9.719     45.2
    walkspikes                      80    0.580     4.640     21.6
    enqueue                          -    0.550     4.401     20.4
      sort                       81920    0.337     2.695     12.5
      tree                       81920    0.208     1.662      7.7
      setup                      81920    0.005     0.040      0.2
      generators                 81920    0.001     0.005      0.0
    exchange                         -    0.085     0.677      3.1
      gather                        80    0.085     0.677      3.1
      sort                          80    0.000     0.000      0.0
      gatherlocal                   80    0.000     0.000      0.0
      gather                         -    0.000     0.000      0.0
  init                               -    0.788     6.304     29.3
    simulation                       -    0.589     4.712     21.9
      sources                        1    0.320     2.558     11.9
      group                          -    0.269     2.153     10.0
        factory                   1024    0.269     2.153     10.0
    communicator                     -    0.199     1.592      7.4
      update                         -    0.199     1.592      7.4
        connections                  -    0.185     1.482      6.9
          local                      1    0.185     1.482      6.9
  advance                            -    0.688     5.502     25.6
    integrate                        -    0.684     5.473     25.4
      current                        -    0.174     1.389      6.5
        zero                   8192000    0.048     0.383      1.8
        hh                     8192000    0.045     0.358      1.7
        expsyn                 8192000    0.044     0.355      1.6
        pas                    8192000    0.037     0.293      1.4
      state                          -    0.137     1.095      5.1
        hh                     8192000    0.070     0.559      2.6
        expsyn                 8192000    0.035     0.279      1.3
        pas                    8192000    0.032     0.257      1.2
      setup                      81920    0.078     0.622      2.9
      cable                    8192000    0.076     0.611      2.8
      event                          -    0.049     0.390      1.8
        expsyn                 7926239    0.049     0.390      1.8
```

## Reduce data movement
Currently we bin incoming spikes into one lane per cell, sort each queue, and then pass the
queues to cell groups. However, the `cable_cell_group` then re-organises the queues into 
one per _target_; targets are formed by merging synapses of one type inside a cell group. These
are then sorted _again_. This costs time, space, and bandwidth.

Proposed solution: cell groups expose a list of targets and a lookup structure to bin events, eliminating
the extra steps.

## Parallelise spike delivery
This has been attempted before, but is tricky to get right. The main problem is that each event
might be pushed into multiple queues when a single source has connections targetting multiple
sinks in the same cell group. This gets increasing unlikely as connectivities get sparser, but still
must be handled. Thus, traversing incoming events in parallel must be synchronised, eg by using
an atomic queue per target. But, as noted before clashes get increasingly unlikely only as network
sizes grow, so synchronisation must be cheap in both high and low contention.

Another option is to traverse local connections instead and bin into target queues. The idea could
be to identify groups of connections terminating into the same queue and sort the connection list
accordingly, ie st ranges of connections targetting the same queue are stored contiguously. Threads
would then work in parallel on these ranges without contention. Work balancing might be an issue.

Further options and ideas welcome.

## Optimise single thread spike delivery
Based on measurements, take steps to improve spike binning into target queues.

## Constraints

All changes _should_ be done without consuming more memory -- ideally less, given
hardware trends -- and _must not_ allocate new memory $\geq O(N_\mathrm{cell})$.
Existing examples and test must not break and results must be consistent across changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Performance of Spike Delivery #2480

Introduction.

Steps

Measure

Reduce data movement

Parallelise spike delivery

Optimise single thread spike delivery

Constraints

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improving Performance of Spike Delivery #2480

Description

Introduction.

Steps

Measure

Reduce data movement

Parallelise spike delivery

Optimise single thread spike delivery

Constraints

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions