{"title": "Prediction on Spike Data Using Kernel Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1367, "page_last": 1374, "abstract": "", "full_text": "Prediction on Spike Data\nUsing Kernel Algorithms\n\nJan Eichhorn, Andreas Tolias, Alexander Zien, Malte Kuss,\n\nCarl Edward Rasmussen, Jason Weston, Nikos Logothetis and Bernhard Sch\u00a8olkopf\n\nMax Planck Institute for Biological Cybernetics\n\n72076 T\u00a8ubingen, Germany\n\nfirst.last@tuebingen.mpg.de\n\nAbstract\n\nWe report and compare the performance of different learning algorithms\nbased on data from cortical recordings. The task is to predict the orienta-\ntion of visual stimuli from the activity of a population of simultaneously\nrecorded neurons. We compare several ways of improving the coding of\nthe input (i.e., the spike data) as well as of the output (i.e., the orienta-\ntion), and report the results obtained using different kernel algorithms.\n\n1\n\nIntroduction\n\nRecently, there has been a great deal of interest in using the activity from a population\nof neurons to predict or reconstruct the sensory input [1, 2], motor output [3, 4] or the\ntrajectory of movement of an animal in space [5]. This analysis is of importance since it\nmay lead to a better understanding of the coding schemes utilised by networks of neurons\nin the brain. In addition, ef\ufb01cient algorithms to interpret the activity of brain circuits in\nreal time are essential for the development of successful brain computer interfaces such as\nmotor prosthetic devices.\n\nThe goal of reconstruction is to predict variables which can be of rather different nature\nand are determined by the speci\ufb01c experimental setup in which the data is collected. They\nmight be for example arm movement trajectories or variables representing sensory stimuli,\nsuch as orientation, contrast or direction of motion. From a data analysis perspective, these\nproblems are challenging for a number of reasons, to be discussed in the remainder of this\narticle.\n\nWe will exemplify our reasoning using data from an experiment described in Sect. 3. The\ntask is to reconstruct the angle of a visual stimulus, which can take eight discrete values,\nfrom the activity of simultaneously recorded neurons.\nInput coding. In order to effectively apply machine learning algorithms, it is essential to\nadequately encode prior knowledge about the problem. A clever encoding of the input data\nmight re\ufb02ect, for example, known invariances of the problem, or assumptions about the\nsimilarity structure of the data motivated by scienti\ufb01c insights. An algorithmic approach\nwhich currently enjoys great popularity in the machine learning community, called kernel\nmachines, makes these assumptions explicit by the choice of a kernel function. The ker-\nnel can be thought of as a mathematical formalisation of a similarity measure that ideally\n\n\fcaptures much of this prior knowledge about the data domain. Note that unlike many tradi-\ntional machine learning methods, kernel machines can readily handle data that is not in the\nform of vectors of numbers, but also complex data types, such as strings, graphs, or spike\ntrains. Recently, a kernel for spike trains was proposed whose design is based on a number\nof biologically motivated assumptions about the structure of spike data [6].\nOutput coding. Just like the inputs, also the stimuli perceived or the actions carried out by\nan animal are in general not given to us in vectorial form. Moreover, biologically mean-\ningful similarity measures and loss functions may be very different from those used tra-\nditionally in pattern recognition. Hence, once again, there is a need for methods that are\nsuf\ufb01ciently general such that they can cope with these issues.\nIn the problem at hand,\nthe outputs are orientations of a stimulus and thus it would be desirable to use a method\nwhich takes their circular structure into account. In this paper, we will utilise the recently\nproposed kernel dependency estimation technique [7] that can cope with general sets of\noutputs and and a large class of loss functions in a principled manner. Besides, we also\napply Gaussian process regression to the given task.\nInference and generalisation. The dimensionality of the spike data can be very high, in\nparticular if the data stem from multicellular recording and if the temporal resolution is\nhigh. In addition, the problems are not necessarily stationary, the distributions can change\nover time, and depend heavily on the individual animal. These aspects make it hard for\na learning machine to generalise from the training data to previously unseen test data. It\nis thus important to use methods which are state of the art and assay them using carefully\ndesigned numerical experiments. In our work, we have attempted to evaluate several such\nmethods, including certain developments for the present task that shall be described below.\n\n2 Learning algorithms, kernels and output coding\n\nIn supervised machine learning, we basically attempt to discover dependencies between\nvariables based on a \ufb01nite set of observations (called the training set) f(xi; yi)ji =\n1; : : : ; ng. The xi 2 X are referred to as inputs and are taken from a domain X; likewise,\nthe y 2 Y are called outputs and the objective is to approximate the mapping X ! Y\nbetween the domains from the samples. If Y is a discrete set of class labels, e.g. f(cid:0)1; 1g,\nthe problem is referred to as classi\ufb01cation; if Y = RN , it is called regression.\nKernel machines, a term which refers to a group of learning algorithms, are based on the\nnotion of a feature space mapping (cid:8). The input points get mapped to a possibly high-\ndimensional dot product space (called the feature space) using (cid:8), and in that space the\nlearning problem is tackled using simple linear geometric methods (see [8] for details). All\ngeometric methods that are based on distances and angles can be performed in terms of the\ndot product. The \u201dkernel trick\u201d is to calculate the inner product of feature space mapped\npoints using a kernel function\n\nk(xi; xj) = h(cid:8)(xi); (cid:8)(xj)i:\n\n(1)\n\nwhile avoiding explicit mappings (cid:8). In order for k to be interpretable as a dot product in\nsome feature space it has to be a positive de\ufb01nite function.\n\n2.1 Support Vector Classi\ufb01cation and Gaussian Process Regression\n\nA simple geometric classi\ufb01cation method which is based on dot products and which is the\nbasis of support vector machines is linear classi\ufb01cation via separating hyperplanes. One\ncan show that the so-called optimal separating hyperplane (the one that leads to the largest\nmargin of separation between the classes) can be written in feature space as hw; (cid:8)(x)i+b =\n0, where the hyperplane normal vector can be expanded in terms of the training points as\n\n\fw = Pm\ntogether, this leads to the decision function\n\ni=1 (cid:21)i(cid:8)(xi). The points for which (cid:21)i\n\n6= 0 are called support vectors. Taken\n\nf (x) = sign(cid:16)\n\nm\n\nX\n\ni=1\n\n(cid:21)ih(cid:8)(x); (cid:8)(xi)i + b(cid:17) = sign(cid:16)\n\nm\n\nX\n\ni=1\n\n(cid:21)ik(x; xi) + b(cid:17):\n\n(2)\n\nThe coef\ufb01cients (cid:21)i; b 2 R are found by solving a quadratic optimisation problem, for which\nstandard methods exist. The central idea of support vector machines is thus that we can\nperform linear classi\ufb01cation in a high-dimensional feature space using a kernel which can\nbe seen as a (nonlinear) similarity measure for the input data. A popular nonlinear kernel\nfunction is the Gaussian kernel k(xi; xj) = exp((cid:0)kxi (cid:0) xjk2=2(cid:27)2): This kernel has been\nsuccessfully used to predict stimulus parameters using spikes from simultaneously recorded\ndata [2].\n\nIn Gaussian process regression [9], the model speci\ufb01es a random distribution over func-\ntions. This distribution is conditioned on the observations (the training set) and predictions\nmay be obtained in closed form as Gaussian distributions for any desired test inputs. The\ncharacteristics (such as smoothness, amplitude, etc.) of the functions are given by the co-\nvariance function or covariance kernel; it controls how the outputs covary as a function of\nthe inputs. In the experiments below (assuming x 2 RD) we use a Gaussian kernel of the\nform\n\nCov(yi; yj) = k(xi; xj) = v2 exp(cid:16) (cid:0)\n\n1\n2\n\nD\n\nX\n\nd=1\n\nkxd\n\ni (cid:0) xd\n\nd(cid:17)\nj k2=w2\n\n(3)\n\nwith parameters v and w = (w1; : : : ; wD). This covariance function expresses that outputs\nwhose inputs are nearby have large covariance, and outputs that belong to inputs far apart\nhave smaller covariance. In fact, it is possible to show that the distribution of functions\ngenerated by this covariance function are all smooth. The w parameters determine exactly\nhow important different input coordinates are (and can be seen as a generalisation of the\nabove kernel). The parameters are \ufb01t by optimising the likelihood.\n\n2.2 Similarity measures for spike data\n\nTo take advantage of the strength of kernel machines in the analysis of cortical recordings\nwe will explore the usefulness of different kernel functions. We describe the spikernel\nintroduced in [6] and present a novel use of alignment-type scores typically used in bioin-\nformatics.\n\nAlthough we are far from understanding the neuronal code, there exist some reasonable\nassumptions about the structure of spike data one has to take into account when comparing\nspike patterns and designing kernels.\n\n(cid:15) Most fundamental is the assumption that frequency and temporal coding play cen-\ntral roles. Information related to a certain variable of the stimulus may be coded\nin highly speci\ufb01c temporal patterns contained in the spike trains of a cortical pop-\nulation.\n\n(cid:15) These \ufb01ring patterns may be misaligned in time. To compare spike trains it might\nbe necessary to realign them by introducing a certain time shift. We want the\nsimilarity score to be the higher the smaller this time shift is.\n\nSpikernel. In [6] Shpigelman et al. proposed a kernel for spike trains that was designed\nwith respect to the assumptions above and some extra assumptions related to the special\ntask to be solved. To understand their ideas it is most instructive to have a look at the\nfeature map (cid:8) rather than at the kernel itself.\n\n\fLet s be a sequence of \ufb01ring rates of length jsj. The feature map maps this sequence into a\nhigh dimensional space where the coordinates u represent a possible spike train prototype\nof \ufb01xed length n (cid:20) jsj. The value of the feature map of s, (cid:8)u(s), represents the similarity\nof s to the prototype u. The u component of the feature vector (cid:8)(s) is de\ufb01ned as:\n\n(cid:8)u(s) = C\n\nn\n\n2 X\ni2I\n\nn;jsj\n\n(cid:22)d(si;u)(cid:21)jsj(cid:0)i1\n\n(4)\n\ni=1 piki(s; t):\n\nHere i is an index vector that indexes a length n ordered subsequence of s and the sum\nruns over all possible subsequences. (cid:21); (cid:22) 2 [0; 1] are parameters of the kernel. The (cid:22)-\npart of the sum re\ufb02ects the weighting according to the similarity of s to the coordinate\nu (expressed in the distance measure d(si; u) = Pn\nk=1 d(si;k; uk)), whereas the (cid:21)-part\nemphasises the concentration towards a \u201ctime of interest\u201d at the end of the sequence s (i1\nis the \ufb01rst index of the subsequence). Following the authors we chose the distance measure\nd(si;k; uk), determining how two \ufb01ring rate vectors are compared, to be the squared l2-\n2: Note, that each entry sk of the sequence (-matrix) s is\nnorm: d(si;k; uk) = ksi;k (cid:0) ukk2\nmeant to be a vector containing the \ufb01ring rates of all simultaneously recorded neurons in\nthe same time interval (bin).\nThe kernel kn(s; t) induced by this feature map can be computed in time O(jsjjtjn) using\ndynamic programming. The kernel used in our experiments is a sum of kernels for different\npattern lengths n weighted with another parameter p, i.e.,\nk(s; t) = PN\nAlignment score. In addition to methods developed speci\ufb01cally for neural spike train data,\nwe also train on pairwise similarities derived from global alignments. Aligning sequences\nis a standard method in bioinformatics; there, the sequences usually describe DNA, RNA or\nprotein molecules. Here, the sequences are time-binned representations of the spike trains,\nas described above.\nIn a global alignment of two sequences s = s1 : : : sjsj and t = t1 : : : tjtj, each sequence\nmay be elongated by inserting copies of a special symbol (the dash, \u201c \u201d) at any position,\nyielding two stuffed sequences s0 and t0. The \ufb01rst requirement is that the stuffed sequences\nmust have the same length. This allows to write them on top of each other, so that each\nsymbol of s is either mapped to a symbol of t (match/mismatch), or mapped to a dash (gap),\nand vice versa. The second requirement for a valid alignment is that no dash is mapped to\na dash, which restricts the length of any alignment to a maximum of jsj + jtj.\nOnce costs are assigned to the matches and gaps, the cost of an alignment is de\ufb01ned as the\nsum of costs in the alignment. The distance of s and t can now be de\ufb01ned as the cost of an\noptimal global alignment of s and t, where optimal means minimising the cost. Although\nthere are exponentially many possible global alignments, the optimal cost (and an optimal\nalignment) can be computed in time O(jsjjtj) using dynamic programming [10].\nLet c(a; b) denote the cost of a match/mismatch (a = si; b = tj) or of a gap (either a =\u201c \u201d\nor b =\u201c \u201d). We parameterise the costs with (cid:13) and (cid:22) as follows:\n\nc(a; b) = c(b; a)\nc(a; ) = c( ; a)\n\n:= ja (cid:0) bj\n:= (cid:13)ja (cid:0) (cid:22)j\n\nThe matrix of pairwise distances as de\ufb01ned above will, in general, not be a proper kernel\n(i.e., it will not be positive de\ufb01nite). Therefore, we use it to build a new representation of\nthe data (see below). A related but different distance measure has previously been proposed\nby Victor and Purpura [11].\n\nWe use the alignment score to compute explicit feature vectors of the data points via an\n\n\fempirical kernel map [8, p. 42]. Consider as prototypes the overall data set1 fxigi=1;:::;m\nof m trials xi = [n1;i n2;i ::: n20;i] as de\ufb01ned in Sect. 3. Since our alignment score\nkalign(n; n0) applies to single spike trains only2, we compute the empirical kernel map\nfor each neuron separately and then concatenate these vectors. Hence, the feature map is\nde\ufb01ned as:\n\n(cid:8)x1;:::;xm(x0) = (cid:8)x1;:::;xm([n01\n\nn02 : : : n020])\n\n= [fkalign(n1;i=1::m; n01)g fkalign(n2;i=1::m; n02)g : : : fkalign(n20;i=1::m; n020)g]\n\nThus, each trial is represented by a vector of its alignment score with respect to all other\ntrials where alignments are computed separately for all 20 neurons.\n\nWe can now train kernel machines using any standard kernel on top of this representation,\nbut we already achieve very good performance using the simple linear kernel (see results\nsection). Although we give results obtained with this technique of constructing a feature\nmap only for the alignment score, it can be easily applied with the spikernel and other\nkernels.\n\n2.3 Coding structure in output space\n\nOur objective is to use various machine learning algorithms to predict the orientation of a\nstimulus used in the experiment described below. Since we use discrete orientations we can\nmodel this as a multi-class classi\ufb01cation problem or transform it into a regression task.\nCombining Support Vector Machines. Above, we explained how to do binary clas-\nsi\ufb01cation using SVMs by estimating a normal vector w and offset b of a hyperplane\nhw; (cid:8)(x)i + b = 0 in the feature space. A given point x will then be assigned to class\n1 if hw; (cid:8)(x)i + b > 0 (and to class -1 otherwise). If we have M > 2 classes, we can\ntrain M classi\ufb01ers, each one separating one speci\ufb01c class from the union of all other ones\n(hence the name \u201cone-versus-rest\u201d). When classifying a new point x, we simply assign it\nto the class whose classi\ufb01er leads to the largest value of hw; (cid:8)(x)i + b.\nA more sophisticated and more expensive method is to train one classi\ufb01er for each possible\ncombination of two classes and then use a voting scheme to classify a point. It is referred\nto as \u201cone-versus-one\u201d.\nKernel Dependency Estimation. Note that the above approach treats all classes the same.\nIn our situation, however, certain classes are \u201ccloser\u201d to each other since the corresponding\nstimulus angles are closer than others. To take this into account, we use the kernel depen-\ndency estimation (KDE) algorithm [7] with an output similarity measure corresponding to\na loss function of the angles taking the form L((cid:11); (cid:12)) = cos(2(cid:11) (cid:0) 2(cid:12)):3 The modi\ufb01cation\nrespects the symmetry that 0(cid:14) and 180(cid:14), say, are equivalent.\nLack of space does not permit us to explain the KDE algorithm in detail. In a nutshell, it\nestimates a linear mapping between two feature spaces. One feature space corresponds to\nthe kernel used on the inputs (in our case, the spike trains), and the other one to a second\nkernel which encodes the similarity measure to be used on the outputs (the orientation of\nthe lines).\nGaussian Process Regression. When we use Gaussian processes to predict the stimulus\nangle (cid:11) we consider the task as a regression problem on sin 2(cid:11) and cos 2(cid:11) separately. To\n\n1Note that this means that we are considering a transductive setting [12], where we have access\n\nbut we achieved worse results.\n\nuse the linear loss function (5).\n\nto all input data (but not the test outputs) during training.\n\n2It is straightforward to extend this idea to synchronous alignments of the whole population vector,\n\n3Note that L((cid:11); (cid:12)) needs to be an admissible kernel, i.e. positive de\ufb01nite, and therefore we cannot\n\n\fdo prediction we take the means of the predicted distributions of sin 2(cid:11) and cos 2(cid:11) as point\nestimates respectively, which are then projected onto the unit circle. Finally we assign the\naveraged predicted angle to the nearest orientation which could have been shown.\n\n3 Experiments\n\nWe will now apply the ideas from the reasoning above and see how well these different\nconcepts perform in practice on a dataset of cortical recordings.\nData collection. The dataset we used was collected in an experiment performed in our\nneurophysiology department. All experiments were conducted in full compliance with\nthe guidelines of the European Community (EUVD/86/609/EEC) for the care and use of\nlaboratory animals and were approved by the local authorities (Regierungspr\u00a8asidium). The\nspike data were recorded using tetrodes inserted in area V1 of a behaving macaque (Macaca\nMulatta). The spike waveforms were sampled at 32KHz. The animal\u2019s task was to \ufb01xate a\nsmall square spot on the monitor while gratings of eight different orientations (0o, 22o, 45o,\n67o, 90o, 112o, 135o, 158o) and two contrasts (2% and 30%) were presented on a monitor.\nThe stimuli were positioned on the monitor so as to cover the classical receptive \ufb01elds of\nthe neurons. A single stimulus of \ufb01xed orientation and contrast was presented for a period\nof 500 ms, i.e., during the epoch of a single behavioural trial. All 8 stimuli appeared 30\ntimes each and in random order, resulting in 240 observed trials.\n\nSpiking activity from neural recordings usually come as a time series of action potentials\nfrom one or more neurons recorded from the brain. It is commonly believed that in most\ncircumstances most of the information in the spiking activity is mainly present in the times\nof occurrence of spikes and not in the exact shape of the individual spikes. Therefore we\ncan abstract the spike series as a series of zeros and ones.\n\nFrom a single trial we have recordings of 500ms from 20 neurons. We compute the \ufb01ring\nrates from the high resolution data for each neuron in 1, 5 or 10 bins of length 500, 100 or\n50ms respectively, resulting in three different data representations for different temporal\nresolutions. By concatenation of the vectors nr\n(r = 1; : : : ; 20) containing the bins of\neach neuron we obtain one data point x = [n1 n2 ::: n20] per trial.\nComparing the algorithms. Below we validate our reasoning on input and output coding\nwith several experiments. We will compare the kernel algorithms KDE, SVM and Gaus-\nsian Processes (GP) and a simple k-nearest neighbour approach (k-NN) that we applied\nwith different kernels and different data representations. As reference values, we give the\nperformance of a standard Bayesian reconstruction method (assuming independent neurons\nwith Poisson characteristics), a Template Matching method and the standard Population\nVector method as they are described e.g. in [5] and [3].\n\nIn all our experiments we compute the test error over a \ufb01ve fold cross-validation using\nalways the same data split, balanced with respect to the classes.4 We use four out of the\n\ufb01ve folds of the data to choose the parameters of the kernel and the method. This choice\nitself is done via another level of \ufb01ve fold cross-validation (this time unbalanced). Finally\nwe train the best model on these four folds and compute an independent test error on the\nremaining fold.\n\nSince simple zero-one-loss is not very informative about the error in multi-class problems,\nwe report the linear loss of the predicted angles, while taking into account the circular\nstructure of the problem. Hence the loss function takes the form\n\nL((cid:11); (cid:12)) = minfj(cid:11) (cid:0) (cid:12)j; (cid:0)j(cid:11) (cid:0) (cid:12)j + 180og:\n\n(5)\n\n4I.e., in every fold we have the same number of points per class.\n\n\fThe parameters of the KDE algorithm (ridge parameter) and the SVM (C) are taken from\na logarithmic grid (ridge = 10(cid:0)5; 10(cid:0)4; :::; 101; C = 10(cid:0)1; 1; :::; 105). After we knew\nits order of magnitude, we chose the (cid:27)-parameter of the Gaussian kernel from a linear grid\n((cid:27) = 1; 2; :::; 10). The spikernel has four parameters: (cid:21), (cid:22), N and p. The stimulus in our\nexperiment was perceived over the whole period of recording. Therefore we do not want\nany increasing weight of the similarity score towards the beginning or the end of the spike-\nsequence and we \ufb01x (cid:21) = 1. Further we chose N = 10 to be the length of our sequence,\nand thereby consider patterns of all possible lengths. The parameters (cid:22) and p are chosen\nfrom the following (partly linear) grids: (cid:22) = 0:01; 0:05; 0:1; 0:2; 0:3; 0:4; :::; 0:8; 0:9; 0:99\nand p = 0:05; 0:1; 0:3; 0:5; :::; 2:5; 2:7\nTable 1 Mean test error and standard error on the low contrast dataset\n\nGaussian Kernel\n\nSpikernel\n\nAlignment score\n\nKDE\n\nSVM (1-vs-rest)\n\nSVM (1-vs-1)\n\nk-NN\n\nGP\n\n10 bins\n1 bin\n10 bins\n1 bin\n10 bins\n1 bin\n10 bins\n1 bin\n2 bins z\n1 bin\n\n16:8(cid:14) (cid:6) 1:6(cid:14)\n12:8(cid:14) (cid:6) 1:7(cid:14)\n16:8(cid:14) (cid:6) 2:0(cid:14)\n13:3(cid:14) (cid:6) 1:6(cid:14)\n16:4(cid:14) (cid:6) 1:6(cid:14)\n12:2(cid:14) (cid:6) 1:7(cid:14)\n18:7(cid:14) (cid:6) 1:5(cid:14)\n14:0(cid:14) (cid:6) 1:7(cid:14)\n16:2(cid:14) (cid:6) 1:1(cid:14)\n15:6(cid:14) (cid:6) 1:7(cid:14)\n\n11:5(cid:14) (cid:6) 1:3(cid:14)\n(13:6(cid:14) (cid:6) 1:8(cid:14))y\n13:1(cid:14) (cid:6) 1:4(cid:14)\n\n13:8(cid:14) (cid:6) 1:3(cid:14)\n\n12:8(cid:14) (cid:6) 0:9(cid:14)\n\n11:2(cid:14) (cid:6) 1:3(cid:14)\n\n12:3(cid:14) (cid:6) 1:5(cid:14)\n\n12:1(cid:14) (cid:6) 1:4(cid:14)\n\n13:0(cid:14) (cid:6) 2:0(cid:14)\n\nn/a (cid:3)\n\nn/a (cid:3)\n\nBayesian rec.: 14:4(cid:14) (cid:6) 2:1(cid:14), Template Matching: 17:7(cid:14) (cid:6) 0:6(cid:14), Pop. Vect.: 28:8(cid:14) (cid:6) 1:0(cid:14)\n\nTable 2 Mean test error and standard error on the high contrast dataset\n\nGaussian Kernel\n\nSpikernel\n\nAlignment score\n\nKDE\n\nSVM (1-vs-rest)\n\nSVM (1-vs-1)\n\nk-NN\n\nGP\n\n10 bins\n1 bin\n10 bins\n1 bin\n10 bins\n1 bin\n10 bins\n1 bin\n2 bins z\n1 bin\n\n1:9(cid:14) (cid:6) 0:5(cid:14)\n1:4(cid:14) (cid:6) 0:5(cid:14)\n1:5(cid:14) (cid:6) 0:5(cid:14)\n1:4(cid:14) (cid:6) 0:4(cid:14)\n1:2(cid:14) (cid:6) 0:4(cid:14)\n1:1(cid:14) (cid:6) 0:4(cid:14)\n4:7(cid:14) (cid:6) 1:2(cid:14)\n1:7(cid:14) (cid:6) 0:6(cid:14)\n1:4(cid:14) (cid:6) 0:4(cid:14)\n2:0(cid:14) (cid:6) 0:5(cid:14)\n\n1:7(cid:14) (cid:6) 0:4(cid:14)\n(1:6(cid:14) (cid:6) 0:4(cid:14))y\n1:4(cid:14) (cid:6) 0:6(cid:14)\n\n2:1(cid:14) (cid:6) 0:4(cid:14)\n\n1:0(cid:14) (cid:6) 0:5(cid:14)\n\n1:4(cid:14) (cid:6) 0:5(cid:14)\n\n0:8(cid:14) (cid:6) 0:3(cid:14)\n\n1:0(cid:14) (cid:6) 0:4(cid:14)\n\n1:0(cid:14) (cid:6) 0:3(cid:14)\n\nn/a (cid:3)\n\nn/a (cid:3)\n\nBayesian rec.: 3:8(cid:14) (cid:6) 0:6(cid:14), Template Matching: 7:2(cid:14) (cid:6) 1:0(cid:14), Pop. Vect.: 11:6(cid:14) (cid:6) 0:7(cid:14)\ny We report this number only for comparison, since the spikernel relies on temporal patterns and it makes no sense to use only\none bin.\nz A 10 bin resolution would require to determine 200 parameters wd of the covariance function (3) from only 192 samples.\n(cid:3) We did not compute these results. Both kernels are not analytical functions of their parameters and we would loose much of the\nconvenience of Gaussian Processes. Using crossvalidation instead resembles very much Kernel Ridge Regression on sin 2(cid:11) and\ncos 2(cid:11) which is almost exactly what KDE is doing when applied with the loss function (5).\n\nThe results for the low contrast datasets is given in Table 1, and Table 2 presents results for\nhigh contrast (\ufb01ve best results in boldface). The relatively large standard error ((cid:6) (cid:27)pn ) is\ndue to the fact that we used only \ufb01ve folds to compute the test error.\n\n\f4 Discussion\n\nIn our experiments, we have shown that using modern machine learning techniques, it\nis possible to use tetrode recordings in area V1 to reconstruct the orientation of a stim-\nulus presented to a macaque monkey rather accurately: depending on the contrast of the\nstimulus, we obtained error rates in the range of 1(cid:14) (cid:0) 20(cid:14). We can observe that standard\ntechniques for decoding, namely Population vector, Template Matching and a particular\nBayesian reconstruction method, can be outperformed by state-of-the-art kernel methods\nwhen applied with an appropriate kernel and suitable data representation. We found that\nthe accuracy of kernel methods can in most cases be improved by utilising task speci\ufb01c\nsimilarity measures for spike trains, such as the spikernel or the introduced alignment dis-\ntances from bioinformatics. Due to the (by machine learning standards) relatively small\nsize of the analysed datasets, it is hard to draw conclusions regarding which of the applied\nkernel methods performs best.\n\nRather than focusing too much on the differences in performance, we want to emphasise\nthe capability of kernel machines to assay different decoding hypotheses by choosing ap-\npropriate kernel functions. Analysing their respective performance may provide insight\nabout how spike trains carry information and thus about the nature of neural coding.\nAcknowledgements. For useful help, we thank Goekhan Bak\u0131r, Olivier Bousquet and\nGunnar R\u00a8atsch. J.E. was supported by a grant from the Studienstiftung des deutschen\nVolkes.\n\nReferences\n[1] P. F\u00a8oldi\u00b4ak. The \u201dideal humunculus\u201d: statistical inference from neural population responses. In\nF. Eeckman and J. Bower, editors, Computation and Neural Systems 1992, Norwell, MA, 1993.\nKluwer.\n\n[2] A. S. Tolias, A. G. Siapas, S. M. Smirnakis and N. K. Logothetis. Coding visual information at\n\nthe level of populations of neurons. Soc. Neurosci. Abst. 28, 2002.\n\n[3] A. P. Georgopoulos, A. B. Schwartz and R. E. Kettner. Neuronal population coding of movement\n\ndirection. Science, 233(4771):1416\u20131419, 1986.\n\n[4] T. D. Sanger. Probability density estimation for the interpretation of neural population codes. J\n\nNeurophysiol., 76(4):2790\u20132793, 1996.\n\n[5] K. Zhang, I. Ginzburg, B. L. McNaughton and T. J. Sejnowski. Interpreting neuronal population\nactivity by reconstruction: uni\ufb01ed framework with application to hippocampal place cells. J\nNeurophysiol., 79(2):1017\u20131044, 1998.\n\n[6] L. Shpigelman, Y. Singer, R. Paz and E. Vaadia. Spikernels: embedding spike neurons in inner-\nproduct spaces. In S. Becker, S. Thrun and K. Obermayer, editors, Advances in Neural Information\nProcessing Systems 15, 2003.\n\n[7] J. Weston, O. Chapelle, A. Elisseeff, B. Sch\u00a8olkopf and V. Vapnik. Kernel dependency estimation.\nIn S. Becker, S. Thrun and K. Obermayer, editors, Advances in Neural Information Processing\nSystems 15, 2003.\n\n[8] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, Mas-\n\nsachusetts, 2002.\n\n[9] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky,\nM. C. Mozer and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems\n8, 1996.\n\n[10] S. B. Needleman and C. D. Wunsch. A General Method Applicable to the Search for Similarities\nin the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48:443\u2013453, 1970.\n[11] J. D. Victor and K. P. Purpura. Nature and precision of temporal coding in visual cortex: a\n\nmetric-space analysis. J Neurophysiol, 76(2):1310\u20131326, 1996.\n\n[12] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.\n\n\f", "award": [], "sourceid": 2357, "authors": [{"given_name": "Jan", "family_name": "Eichhorn", "institution": null}, {"given_name": "Andreas", "family_name": "Tolias", "institution": null}, {"given_name": "Alexander", "family_name": "Zien", "institution": null}, {"given_name": "Malte", "family_name": "Kuss", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Nikos", "family_name": "Logothetis", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}