The group identified 3 distinct benchmarking classes.
1) Behavioural benchmarks for whole networks
2) Dynamical reproduction benchmarks for specific neuron and synapse models
3) Benchmarks for comparing against biological datasets.
With respect to the first it was quickly acknowledged that any metrics developed
will be highly task-specific. Most people expressed a general dissatisfaction
with present benchmarks which tend to use ad-hoc metrics open to mathematical
criticism. 2 examples were presented of more formally justifiable metrics. The
first considers a synfire chain. The dynamics of the synfire chain can be
expressed in terms of initial spike activity vs. synchrony, in which there is
a clear separatrix between networks whose long-term dynamics will dissipate
(i.e. the synfire waves will disappear) and networks whose long-term dynamics
remain stable and convergent (i.e. the synfire waves propagate indefinitely with
increasing synchronisation). Since the network can be characterised in this way
it was proposed that the degree to which the implemented network reproduces this
theoretical separatrix could be a benchmark. (The 'degree of match' remains
something that must be formally defined) The second example considers object
tracking. If one has a ball following a ballistic trajectory, and neglecting air
resistance (such as, e.g. can be achieved by simulating a ball on a computer
screen), the object tracking performance can be measured by measuring degree of
match in both time and position to the trajectory. Again the physics of the
problem allows the analytic solution to be described and thus the degree of
match.
One member considered the idea that separate benchmark figures on a variety of
tasks could be combined into a Quality of Service figure, however, most of the
particpants noted that since there was no way to normalise individual metrics
given the radically different nature of the task and the measurement, such an
approach must be considered dubious. In the end the group indicated that in
essence reasonable metrics will be in large part a matter of good experimental
design. It must be considered essential that the expected behaviour can be
defined using some closed-form mathematical expression so that the actual network
can be compared relative to an absolute reference.
The second type of benchmarks was felt to be the easiest, because hardware and
other systems are necessarily reproducing a model that can be defined
mathematically. Platforms can then be compared relative to a reference simulator
which is considered to give the definitive solution. There was some question as
to what the reference simulator should be - certain participants had worked with
Mathematica to produce high-quality results but the exact nature of the tests for
similarity needed to be confirmed. In general however the group was in consensus
that model matching could in principle be benchmarked in this way provided
representational precision was high enough in the reference simulator.
With the third type of metric the group found more difficulty. There are issues
related to the fact that there is no absolute reference for comparison - data is
simply data. Furthermore there are issues in the case of spike comparison with
respect to spike matching: for the interval over which the modelled network and
the data produce exactly the same number of spikes then some sort of pairing could
possibly be done but once the number of spikes diverges identifying a given spike
with a given expected spike is much more problematic. Various sorts of
sliding-window comparisons could be made but this introduces a significant ad-hoc
component in the size and shape of the sliding window - e.g. all spikes could be
convolved with a Gaussian kernel but what should the width and gain of the kernel
be? The outlook here was less definite and various metrics were proposed with the
overall idea that there ought to be some sort of distance metric between the
dataset and the model but what this distance metric ought to be remained the subject
of further work.