Skip to content

Improving Performance of Spike Delivery #2480

@thorstenhater

Description

@thorstenhater

Introduction.

Spike delivery is a major cost centre in large networks. There have been several -- partial successful --
attempts at improving performance. Yet, we still have left room for improvement

⚠️ This is an internal project, please contact us before starting work ⚠️

Steps

Measure

Measure performance and hardware utilisation using the tiled-busyring example.
Recommended tools: the internal profiler and perf for details. This input file

{
    "name": "cortex",
    "num-cells": 1024,
    "num-tiles": 32768,
    "synapses": 10000,
    "min-delay": 5,
    "duration": 200,
    "ring-size": 4,
    "event-weight": 0.2,
    "record": false,
    "spikes": false,
    "dt": 0.025,
    "depth": 2,
    "complex": false,
    "branch-probs": [
        1,
        0.5
    ],
    "compartments": [
        2,
        1
    ],
    "lengths": [
        20,
        2
    ]
}

emulates cortex-level connectivity (10k synapses/cell) on a large cluster (32k tiles) but runs
comfortably on a single node with this example profile (M1 MacBook)

gpu:      no
threads:  8
mpi:      no
ranks:    32768

running simulation

100% |--------------------------------------------------|           200ms

310378496 spikes generated at rate of 6.44375e-07 ms between spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                      5.023        2135.179
model-run                       7.487         382.992
meter-total                    12.510        2518.172

REGION                           CALLS     WALL    THREAD        %
root                                 -    2.691    21.526    100.0
  communication                      -    1.215     9.719     45.2
    walkspikes                      80    0.580     4.640     21.6
    enqueue                          -    0.550     4.401     20.4
      sort                       81920    0.337     2.695     12.5
      tree                       81920    0.208     1.662      7.7
      setup                      81920    0.005     0.040      0.2
      generators                 81920    0.001     0.005      0.0
    exchange                         -    0.085     0.677      3.1
      gather                        80    0.085     0.677      3.1
      sort                          80    0.000     0.000      0.0
      gatherlocal                   80    0.000     0.000      0.0
      gather                         -    0.000     0.000      0.0
  init                               -    0.788     6.304     29.3
    simulation                       -    0.589     4.712     21.9
      sources                        1    0.320     2.558     11.9
      group                          -    0.269     2.153     10.0
        factory                   1024    0.269     2.153     10.0
    communicator                     -    0.199     1.592      7.4
      update                         -    0.199     1.592      7.4
        connections                  -    0.185     1.482      6.9
          local                      1    0.185     1.482      6.9
  advance                            -    0.688     5.502     25.6
    integrate                        -    0.684     5.473     25.4
      current                        -    0.174     1.389      6.5
        zero                   8192000    0.048     0.383      1.8
        hh                     8192000    0.045     0.358      1.7
        expsyn                 8192000    0.044     0.355      1.6
        pas                    8192000    0.037     0.293      1.4
      state                          -    0.137     1.095      5.1
        hh                     8192000    0.070     0.559      2.6
        expsyn                 8192000    0.035     0.279      1.3
        pas                    8192000    0.032     0.257      1.2
      setup                      81920    0.078     0.622      2.9
      cable                    8192000    0.076     0.611      2.8
      event                          -    0.049     0.390      1.8
        expsyn                 7926239    0.049     0.390      1.8

Reduce data movement

Currently we bin incoming spikes into one lane per cell, sort each queue, and then pass the
queues to cell groups. However, the cable_cell_group then re-organises the queues into
one per target; targets are formed by merging synapses of one type inside a cell group. These
are then sorted again. This costs time, space, and bandwidth.

Proposed solution: cell groups expose a list of targets and a lookup structure to bin events, eliminating
the extra steps.

Parallelise spike delivery

This has been attempted before, but is tricky to get right. The main problem is that each event
might be pushed into multiple queues when a single source has connections targetting multiple
sinks in the same cell group. This gets increasing unlikely as connectivities get sparser, but still
must be handled. Thus, traversing incoming events in parallel must be synchronised, eg by using
an atomic queue per target. But, as noted before clashes get increasingly unlikely only as network
sizes grow, so synchronisation must be cheap in both high and low contention.

Another option is to traverse local connections instead and bin into target queues. The idea could
be to identify groups of connections terminating into the same queue and sort the connection list
accordingly, ie st ranges of connections targetting the same queue are stored contiguously. Threads
would then work in parallel on these ranges without contention. Work balancing might be an issue.

Further options and ideas welcome.

Optimise single thread spike delivery

Based on measurements, take steps to improve spike binning into target queues.

Constraints

All changes should be done without consuming more memory -- ideally less, given
hardware trends -- and must not allocate new memory $\geq O(N_\mathrm{cell})$.
Existing examples and test must not break and results must be consistent across changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    AEPArbor Enhancement ProposalenhancementhpcRelevant primarily to HPC environments.optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions