{"title": "STDP enables spiking neurons to detect hidden causes of their inputs", "book": "Advances in Neural Information Processing Systems", "page_first": 1357, "page_last": 1365, "abstract": "The principles by which spiking neurons contribute to the astounding computational power of generic cortical microcircuits, and how spike-timing-dependent plasticity (STDP) of synaptic weights could generate and maintain this computational function, are unknown. We show here that STDP, in conjunction with a stochastic soft winner-take-all (WTA) circuit, induces spiking neurons to generate through their synaptic weights implicit internal models for subclasses (or causes\") of the high-dimensional spike patterns of hundreds of pre-synaptic neurons. Hence these neurons will fire after learning whenever the current input best matches their internal model. The resulting computational function of soft WTA circuits, a common network motif of cortical microcircuits, could therefore be a drastic dimensionality reduction of information streams, together with the autonomous creation of internal models for the probability distributions of their input patterns. We show that the autonomous generation and maintenance of this computational function can be explained on the basis of rigorous mathematical principles. In particular, we show that STDP is able to approximate a stochastic online Expectation-Maximization (EM) algorithm for modeling the input data. A corresponding result is shown for Hebbian learning in artificial neural networks.\"", "full_text": "STDP enables spiking neurons to\ndetect hidden causes of their inputs\n\nBernhard Nessler, Michael Pfeiffer, and Wolfgang Maass\n\nInstitute for Theoretical Computer Science, Graz University of Technology\n\nA-8010 Graz, Austria\n\n{nessler,pfeiffer,maass}@igi.tugraz.at\n\nAbstract\n\nThe principles by which spiking neurons contribute to the astounding computa-\ntional power of generic cortical microcircuits, and how spike-timing-dependent\nplasticity (STDP) of synaptic weights could generate and maintain this compu-\ntational function, are unknown. We show here that STDP, in conjunction with\na stochastic soft winner-take-all (WTA) circuit, induces spiking neurons to gen-\nerate through their synaptic weights implicit internal models for subclasses (or\n\u201ccauses\u201d) of the high-dimensional spike patterns of hundreds of pre-synaptic neu-\nrons. Hence these neurons will \ufb01re after learning whenever the current input best\nmatches their internal model. The resulting computational function of soft WTA\ncircuits, a common network motif of cortical microcircuits, could therefore be\na drastic dimensionality reduction of information streams, together with the au-\ntonomous creation of internal models for the probability distributions of their in-\nput patterns. We show that the autonomous generation and maintenance of this\ncomputational function can be explained on the basis of rigorous mathematical\nprinciples. In particular, we show that STDP is able to approximate a stochastic\nonline Expectation-Maximization (EM) algorithm for modeling the input data. A\ncorresponding result is shown for Hebbian learning in arti\ufb01cial neural networks.\n\n1\n\nIntroduction\n\nIt is well-known that synapses change their synaptic ef\ufb01cacy (\u201cweight\u201d) w in dependence of the\ndifference tpost \u2212 tpre of the \ufb01ring times of the post- and presynaptic neuron according to variations\nof a generic STDP rule (see [1] for a recent review). However, the computational bene\ufb01t of this\nlearning rule is largely unknown [2, 3]. It has also been observed that local WTA-circuits form a\ncommon network-motif in cortical microcircuits [4]. However, it is not clear how this network-motif\ncontributes to the computational power and adaptive capabilities of laminar cortical microcircuits,\nout of which the cortex is composed. Finally, it has been conjectured for quite some while, on the\nbasis of theoretical considerations, that the discovery and representation of hidden causes of their\nhigh-dimensional afferent spike inputs is a generic computational operation of cortical networks of\nneurons [5]. One reason for this belief is that the underlying mathematical framework, Expectation-\nMaximization (EM), arguably provides the most powerful approach to unsupervised learning that\nwe know of. But one has so far not been able to combine these three potential pieces (STDP, WTA-\ncircuits, EM) of the puzzle into a theory that could help us to unravel the organization of computation\nand learning in cortical networks of neurons.\n\nWe show in this extended abstract that STDP in WTA-circuits approximates EM for discovering\nhidden causes of large numbers of input spike trains. We \ufb01rst demonstrate this in section 2 in an\napplication to a standard benchmark dataset for the discovery of hidden causes. In section 3 we\nshow that the functioning of this demonstration can be explained on the basis of EM for simpler\nnon-spiking approximations to the spiking network considered in section 2.\n\n1\n\n\f2 Discovery of hidden causes for a benchmark dataset\n\nWe applied the network architecture shown in Fig. 1A to handwritten digits from the MNIST dataset\n[6].1 This dataset consists of 70, 000 28 \u00d7 28-pixel images of handwritten digits2, from which we\npicked the subset of 20, 868 images containing only the digits 0, 3 and 4. Training examples were\nrandomly sampled from this subset with a uniform distribution of digit classes.\n\nSimple STDP curve\nComplex STDP curve\n\n \n\nc \u00b7 e\n\n\u2212wki -1\n\n\u03c3\n\n-1\n\n0\n \u2212 t\n\nt\npost\n\npre\n\ni\n\nk\n\nw\n\u2206\n\n \n\n0\n\n \n\nA\n\nB\n\nFigure 1: A) Architecture for learning with STDP in a WTA-network of spiking neurons. B) Learn-\ning curve for the two STDP rules that were used (with \u03c3 = 10ms). The synaptic weight wki is\nchanged in dependence of the \ufb01ring times tpre of the presynaptic neuron yi and tpost of the post-\nsynaptic neuron zk. If zk \ufb01res at time t without a \ufb01ring of yi in the interval [t \u2212 \u03c3, t + 2\u03c3], wki is\nreduced by 1. The resulting weight change is in any case multiplied with the current learning rate \u03b7,\nwhich was chosen in the simulations according to the variance tracking rule7.\n\nPixel values xj were encoded through population coding by binary variables yi (spikes were pro-\nduced for each variable yi by a Poisson process with a rate of 40 Hz for yi = 1, and 0 Hz for yi = 0,\nat a simulation time step of 1ms, see Fig. 2A). Every training example x was presented for 50ms.\nEvery neuron yi was connected to all K = 10 output neurons z1, . . . , z10. A Poisson process caused\n\ufb01ring of one of the neurons zk on average every 5ms (see [8] for a more realistic \ufb01ring mechanism).\nThe WTA-mechanism ensured that only one of the output neurons could \ufb01re at any time step. The\nwinning neuron at time step t was chosen from the soft-max distribution\n\np(zk \ufb01res at time t|y) =\n\n,\n\n(1)\n\neuk(t)\nl=1 eul(t)\n\nPK\n\nwhere uk(t) =Pn\n\ni=1 wki \u02dcyi(t) + wk0 represents the current membrane potential of neuron zk (with\n\n\u02dcyi(t) = 1 if yi \ufb01red within the time interval [t \u2212 10ms, t], else \u02dcyi(t) = 0).3\nSTDP with the learning curves shown in Fig. 1B was applied to all synapses wki for an input consist-\ning of a continuous sequence of spike encodings of handwritten digits, each presented for 50ms (see\n\n1A similar network of spiking neurons had been applied successfully in [7] to learn with STDP the classi-\n\ufb01cation of symbolic (i.e., not handwritten) characters. Possibly our theoretical analysis could also be used to\nexplain their simulation result.\n\n2Pixels were binarized to black/white. All pixels that were black in less than 5% of the training examples\nwere removed, leaving m = 429 external variables xj, that were encoded by n = 858 spiking neurons yi. Our\napproach works just as well for external variables xj that assume any \ufb01nite number of values, provided that\nthey are presented to the network through population coding with one variable yi for every possible value of\nxj. In fact, the approach appears to work also for the commonly considered population coding of continuous\nexternal variables.\n\n3This amounts to a representation of the EPSP caused by a \ufb01ring of neuron yi by a step function, which facil-\nitates the theoretical analysis in section 3. Learning with the spiking network works just as well for biologically\nrealistic EPSP forms.\n\n2\n\n\fInput Spike Trains\n\nOutput before Learning\n\nOutput after Learning\n\ns\nn\no\nr\nu\ne\nN\n\n \nt\n\nu\np\nn\n\nI\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n0\n\n50\n\n100\n\nTime [ms]\n\nA\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\ns\nn\no\nr\nu\ne\nN\n\n \nt\n\nu\np\n\nt\n\nu\nO\n\n150\n\n10\n\n0\n\n50\n\n100\n\nTime [ms]\n\nB\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\ns\nn\no\nr\nu\ne\nN\n\n \nt\n\nu\np\n\nt\n\nu\nO\n\n150\n\n10\n\n0\n\n50\n\n100\n\n150\n\nTime [ms]\n\nC\n\nFigure 2: Unsupervised classi\ufb01cation learning and sparsi\ufb01cation of \ufb01ring of output neurons after\ntraining. For testing we presented three examples from an independent test set of handwritten digits\n0, 3, 4 from the MNIST dataset, and compared the \ufb01ring of the output-neurons before and after\nlearning. A) Representation of the three handwritten digits 0, 3, 4 for 50ms each by 858 spiking\nneurons yi. B) Response of the output neurons before training. C) Response of the output neurons\nafter STDP (according to Fig. 1B) was applied to their weights wki for a continuous sequence of\nspike encodings of 4000 randomly drawn examples of handwritten digits 0, 3, 4, each represented\nfor 50ms (like in panel A). The three output neurons z4, z9, z6 that respond have generated internal\nmodels for the three shown handwritten digits according to Fig. 3C.\n\nFig. 2A).4 The learning rate \u03b7 was chosen locally according to the variance tracking rule7. Fig. 2C\nshows that for subsequent representations of new handwritten samples of the same digits only one\nneuron responds during each of the 50ms while a handwritten digit is shown. The implicit internal\nmodels which the output neurons z1, . . . , z10 had created in their weights after applying STDP are\nmade explicit in Fig. 3B and C. Since there were more output neurons than digits, several output\nneurons created internal models for different ways of writing the same digit. When after applying\nSTDP to 2000 random examples of handwritten digits 0 and 3 also examples of handwritten digit\n4 were included in the next 2000 examples, the internal models of the 10 output neurons reorga-\nnized autonomously, to include now also two internal models for different ways of writing the digit\n4. The adaptation of the spiking network to the examples shown so far is measured in Fig. 3A by\nthe normalized conditional entropy H(L|Z)/H(L, Z), where L denotes the correct classi\ufb01cation of\neach handwritten digit y, and Z is the random variable which denotes the cluster assignment with\np(Z = k|y) = p(zk = 1|y), the \ufb01ring probabilities at the presentation of digit y, see (1).\nSince after training by STDP each of the output neurons \ufb01re preferentially for one digit, we can\nmeasure the emergent classi\ufb01cation capability of the network. The resulting weight-settings achieve\na classi\ufb01cation error of 2.19% on the digits 0 and 3 after 2000 training steps and 3.68% on all three\ndigits after 4000 training steps on independent test sets of 10,000 new samples each.\n\n3 Underlying theoretical principles\n\nWe show in this section that one can analyze the learning dynamics of the spiking network con-\nsidered in the preceding section (with the simple STDP curve of Fig. 1B with the help of Hebbian\nlearning (using rule (12)) in a corresponding non-spiking neural network Nw. Nw is a stochastic\narti\ufb01cial neural network with the architecture shown in Fig. 1A, and with a parameter vector w con-\nsisting of thresholds wk0 (k = 1, . . . , K) for the K output units z1, . . . , zK and weights wki for the\nconnection from the ith input node yi (i = 1, . . . , n) to the kth output unit zk. We assume that this\nnetwork receives at each discrete time step a binary input vector y \u2208 {0, 1}n and outputs a binary\nk=1 zk = 1, where the k such that zk = 1 is drawn from the distribution\n\n4Whereas the weights in the theoretical analysis of section 3 will approximate logs of probabilities (see (6)),\none can easily make all weights non-negative by restricting the range of these log-probabilities to [\u22125, 0], and\nthen adding a constant 5 to all weight values. This transformation gives rise to the factor c = e5 in Fig. 1B.\n\nvector z \u2208 {0, 1}K withPK\n\n3\n\n\fSpiking Network (simple STDP curve)\nSpiking Network (complex STDP curve)\nNon\u2212spiking Network (no missing attributes)\nNon\u2212spiking Network (35% missing attributes)\n\n \n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\ny\np\no\nr\nt\n\nn\nE\n\n \nl\n\na\nn\no\n\ni\nt\ni\n\nd\nn\no\nC\n\n0\n \n0\n\n500\n\n1000\n\n1500\n2500\nTraining Examples\n\n2000\n\n3000\n\n3500\n\n4000\n\nA\n\nB\n\nC\n\nFigure 3: Analysis of the learning progress of the spiking network for the MNIST dataset. A)\nNormalized conditional entropy (see text) for the spiking network with the two variants of STDP\nlearning rules illustrated in Fig. 1B (red solid and blue dashed lines), as well as two non-spiking\napproximations of the network with learning rule (12) that are analyzed in section 3. According to\nthis analysis the non-spiking network with 35% missing attributes (dash-dotted line) is expected to\nhave a very similar learning behavior to the spiking network. 2000 random examples of handwritten\ndigits 0 and 3 were presented (for 50ms each) to the spiking network as the \ufb01rst 2000 examples.\nThen for the next 2000 examples also samples of handwritten digit 4 were included. B) The implicit\ninternal models created by the neurons after 2000 training examples are made explicit by drawing\nfor each pixel the difference wki \u2212 wk(i+1) of the weights for input yi and yi+1 that encode the two\npossible values (black/white) of the variable xj that encodes this pixel value. One can clearly see\nthat neurons created separate internal models for different ways of writing the two digits 0 and 3. C)\nRe-organized internal models after 2000 further training examples that included digit 4. Two output\nneurons had created internal models for the newly introduced digit 4.\n\nover {1, . . . , K} de\ufb01ned by\n\np(zk = 1|y, w) =\n\neuk\nK\n\nwith uk =\n\nwki yi + wk0 .\n\n(2)\n\nn\n\nXi=1\n\nWe consider the case where there are arbitrary discrete external variables x1, . . . , xm, each ranging\nover {1, . . . , M } (we had M = 2 in section 2), and assume that these are encoded through binary\n\neul\n\nPl=1\n\nvariables y1, . . . , yn for n = m \u00b7 M withPn\n\ny(j\u22121)\u00b7M +r = 1 \u21d0\u21d2 xj = r ,\n\ni=1 yi = m according to the rule\n\nfor j = 1, . . . , m and r = 1, . . . , M .\n\n(3)\n\nIn other words: the group Gj of variables y(j\u22121)\u00b7M +1, . . . , y(j\u22121)\u00b7M +M provides a population cod-\ning for the discrete variable xj.\nWe now consider a class of probability distributions that is particularly relevant for our analysis:\nmixtures of multinomial distributions [9], a generalization of mixtures of Bernoulli distributions\n(see section 9.3.3 of [10]). This is a standard model for latent class analysis [11] in the case of\ndiscrete variables. Mixtures of multinomial distributions are arbitrary mixtures of K distributions\np1(x), . . . , pK (x) that factorize, i.e.,\n\nm\n\npk(x) =\n\npkj(xj)\n\nYj=1\n\nfor arbitrary distributions pkj(xj) over the range {1, . . . , M } of possible values for xj. In other\nk=1 zk = 1, where the\n\nwords: there exists some distribution over hidden binary variables zk withPK\n\nk with zk = 1 is usually referred to as a hidden \u201ccause\u201d in the generation of x, such that\n\nK\n\np(x) =\n\nXk=1\n\np(zk = 1) \u00b7 pk(x).\n\n4\n\n(4)\n\n\fWe \ufb01rst observe that any such distribution p(x) can be represented with some suitable weight vector\nw by the neural network Nw, after recoding of the multinomial variables xj by binary variables yi\nas de\ufb01ned before:\n\nfor\n\np(y|w) =\n\neu\u2217\n\nk\n\nK\n\nXk=1\n\nwith\n\nu\u2217\n\nk :=\n\nn\n\nXi=1\n\nw\u2217\n\nki yi + w\u2217\nk0\n\n,\n\nw\u2217\n\nki := log p(yi = 1|zk = 1)\n\nand\n\nw\u2217\n\nk0 := log p(zk = 1) .\n\nIn addition, Nw de\ufb01nes for any weight vector w whose components are normalized, i.e.\n\nK\n\nXk=1\n\newk0 = 1 and Xi\u2208Gj\n\na mixture of multinomials of the type (4).\n\newki = 1 ,\n\nfor j = 1, . . . , m; k = 1, . . . , K,\n\n(5)\n\n(6)\n\n(7)\n\nThe problem of learning a generative model for some arbitrarily given input distribution p\u2217(x) (or\np\u2217(y) after recoding according to (3)), by the neural network Nw is to \ufb01nd a weight vector w such\nthat p(y|w) de\ufb01ned by (5) models p\u2217(y) as accurately as possible. As usual, we quantify this goal\nby demanding that\n\nEp\u2217 [log p(y|w)]\n\n(8)\n\nis maximized.\n\nNote that the architecture Nw is very useful from a functional point of view, because if (7) holds,\nthen the weighted sum uk at its unit zk has according to (2) the value log p(zk = 1|y, w), and the\nstochastic WTA rule of Nw picks the \u201cwinner\u201d k with zk = 1 from this internally generated model\np(zk = 1|y, w) for the actual distribution p\u2217(zk = 1|y) of hidden causes. We will not enforce the\nnormalization (7) explicitly during the subsequently considered learning process, but rather use a\nlearning rule (12) that turns out to automatically approximate such normalization in the limit.\nExpectation Maximization (EM) is the standard method for maximizing Ep\u2217 [log p(y|w)]. We will\nshow that the simple STDP-rule of Fig. 1B for the spiking network of section 2 can be viewed as\nan approximation to an online version of this EM method. We will \ufb01rst consider in section 3.1 the\nstandard EM-approach, and show that the Hebbian learning rule (12) provides a stochastic approxi-\nmation to the maximization step.\n\n3.1 Reduction to EM\n\nThe standard method for maximizing the expected log-likelihood Ep\u2217 [log p(y|w)] with a dis-\n\ntribution p of the form p(y|w) = Pz p(y, z|w) with hidden variables z, is to observe that\n\nEp\u2217 [log p(y|w)] can be written for arbitrary distributions q(z|y) in the form\n\nEp\u2217 [log p(y|w)] = L(q, w) + Ep\u2217 [KL(q(z|y)||p(z|y, w))]\n\n(9)\n\nL(q, w) = Ep\u2217\"Xz\n\nq(z|y) log\n\np(y, z|w)\n\nq(z|y) # ,\n\n(10)\n\nwhere KL(.) denotes the Kullback-Leibler divergence.\nold, thereby\nIn the E-step one sets q(z|y) = p(z|y, w\nachieving Ep\u2217 [KL(q(z|y)||p(z|y, w\nold by new parameters\nw that maximize L(q, w) for this distribution q(z|y). One can easily show that this is achieved by\nsetting\n\nold) for the current parameter values w = w\n\nold))] = 0. In the M-step one replaces w\n\nw\u2217\n\nki = log p\u2217(yi = 1|zk = 1),\n\n(11)\nw\u2217\nwith values for the variables zk generated by q(z|y) = p(z|y, w\nold), while the values for the vari-\nables y are generated by the external distribution p\u2217. Note that this distribution of z is exactly the\ndistribution (2) of the output of the neural network Nw for inputs y generated by p\u2217.5 In the fol-\nlowing section we will show that this M-step can be approximated by applying iteratively a simple\nHebbian learning rule to the weights w of the neural network Nw.\n\nk0 = log p\u2217(zk = 1),\n\nand\n\n5Hence one can extend p\u2217(y) for each \ufb01xed w to a joint distribution p\u2217(y, z), where the z are generated\n\nfor each y by Nw.\n\n5\n\n\f3.2 A Hebbian learning rule for the M-step\n\nWe show here that the target weight values (11) are the only equilibrium points of the following\nHebbian learning rule:\n\n\u2206wki=(\u03b7 (e\u2212wki \u2212 1),\n\n\u2212\u03b7,\n0,\n\nif yi=1 and zk=1\nif yi=0 and zk=1\n\nif zk = 0,\n\n\u2206wk0=(cid:26)\u03b7 (e\u2212wk0 \u2212 1),\n\n\u2212\u03b7,\n\nif zk=1\nif zk=0\n\n(12)\n\nIt is obvious (using for the second equivalence the fact that yi is a binary variable) that\n\nE[\u2206wki] = 0 \u21d4\n\u21d4\n\n\u21d4\n\u21d4\n\np\u2217(yi=1|zk=1)\u03b7(e\u2212wki \u2212 1) \u2212 p\u2217(yi=0|zk=1)\u03b7 = 0\np\u2217(yi=1|zk=1)(e\u2212wki \u2212 1) + p\u2217(yi=1|zk=1) \u2212 1 = 0\np\u2217(yi=1|zk=1)e\u2212wki = 1\nwki = log p\u2217(yi=1|zk=1) .\n\n(13)\n\nki (in fact, exponentially fast).\n\nAnalogously one can show that E[\u2206wk0] = 0 \u21d4 wk0 = log p\u2217(zk=1). With similar elementary\ncalculations one can show that E[\u2206wki] has for any w a value that moves wki in the direction of\nw\u2217\nOne can actually show that one single step of (12) is a linear approximation of the ideal incremental\nupdate of wki = log aki\n, with aki and Nk representing the values of the corresponding suf\ufb01cient\nNk\nstatistics, as log aki+1\n. This also reveals the\nrole of the learning rate \u03b7 as the reciprocal of the equivalent sample size6.\nIn order to guarantee the stochastic convergence (see [12]) of the learning rule one has to use a\n\nNk+1 = wki + log(1 + \u03b7e\u2212wki) \u2212 log(1 + \u03b7) for \u03b7 = 1\n\nNk\n\nt=1(\u03b7(t))2 = 0.7\n\ndecaying learning rate \u03b7(t) such thatP\u221e\n\nt=1 \u03b7(t) = \u221e andP\u221e\n\nThe learning rule (12) is similar to a rule that had been introduced in [13] in the context of supervised\nlearning and reinforcement learning. That rule had satis\ufb01ed an equilibrium condition similar to (13).\nBut to the best of our knowledge, such type of rule has so far not been considered in the context of\nunsupervised learning.\nOne can easily see the correspondence between the update of wki in (12) and in the simple STDP\nrule of Fig. 1B. In fact, if each time where neuron zk \ufb01res in the spiking network, each presynaptic\nneuron yi that currently has a high \ufb01ring rate has \ufb01red within the last \u03c3 = 10ms before the \ufb01ring\nof zk, the two learning rules become equivalent. However since the latter condition could only be\nachieved with biologically unrealistic high \ufb01ring rates, we need to consider in section 3.4 the case\nfor the non-spiking network where some attributes are missing (i.e., yi = 0 for all i \u2208 Gj; for some\ngroup Gj that encodes an external variable xj via population coding).\nWe \ufb01rst show that the Hebbian learning rule (12) is also meaningful in the case of online learning of\nNw, which better matches the online learning process for the spiking network.\n\n3.3 Stochastic online EM\n\nThe preceding arguments justify an application of learning rule (12) for a number of steps within\np[log p(y|w)]. We now show that it is also\neach M-step of a batch EM approach for maximizing E\u2217\nmeaningful to apply the same rule (12) in an online stochastic EM approach (similarly as in [14]),\nwhere at each combined EM-step only one example y is generated by p\u2217, and the learning rule (12)\n\n6The equilibrium condition (13) only sets a necessary constraint for the the quotient of the two directions of\n\nthe update in (12). The actual formulation of (12) is motivated by the goal of updating a suf\ufb01cient statistics.\n\n7In our experiments we used an adaptation of the variance tracking heuristic from [13]. If we assume that\nthe consecutive values of the weights represent independent samples of their true stochastic distribution at the\ncurrent learning rate, then this observed distribution is the log of a beta-distribution of the above mentioned\nparameters of the suf\ufb01cient statistics. Analytically this distribution has the \ufb01rst and second moments E[wki] \u2248\nlog aki\n. The\nNi\nempirical estimates of these \ufb01rst two moments can be gathered online by exponentially decaying averages\nusing the same learning rate \u03b7ki.\n\n, leading to the estimate \u03b7new\n\nki] \u2248 E[wki]2 + 1\naki\n\nki = 1\nNi\n\nki]\u2212E[wki]2\n\n= E[w2\n\nand E[w2\n\ne\u2212E[wki ]+1\n\n+ 1\nNi\n\n6\n\n\fis applied just once (for zk resulting from p(z|y, w) for the current weights w, or simpler: for the\nzk that is output by Nw for the current input y).\nOur strategy for showing that a single application of learning rule (12) is expected to provide\nprogress in an online EM-setting is the following. We consider the Lagrangian F for maximiz-\ning Ep\u2217 [log p(y|w)] under the constraints (7), and show that an application of rule (12) is expected\nto increase the value of F . We set\n\n\u03bbkj\uf8eb\n\uf8ed1 \u2212 Xi\u2208Gj\n\newki\uf8f6\n\uf8f8 .\n\n(14)\n\ni=1 wki yi + wk0. Hence one\n\nK\n\nK\n\nm\n\nXk=1\n\newk0! \u2212\n\nF (w, \u03bb) = Ep\u2217 [log p(y|w)] \u2212 \u03bb0 1 \u2212\n\narrives at the following conditions for the Lagrange multipliers \u03bb:\n\nXj=1\nXk=1\nk=1 euk for uk = PK\nAccording to (5) one can write p(y|w) = PK\n] \u2212 \u03bb0ewk0! = 0\nXk=1 Ep\u2217 [\nPK\n= Xi\u2208Gj Ep\u2217 [yi\n] \u2212 \u03bbkjewki! = 0,\nPK\n\neuk\nl=1 eul\neuk\nl=1 eul\n\nXk=1\nXi\u2208Gj\n\n\u2202F\n\u2202wki\n\n\u2202F\n\u2202wk0\n\n=\n\nK\n\nK\n\n(15)\n\n(16)\n\neuk\n\nwhich yield \u03bb0 = 1 and \u03bbkj = Ep\u2217 [\n\nl=1 eul ].\nPlugging these values for \u03bb into \u2207wF \u00b7 E\u2217\np[\u2206w] with \u2206w de\ufb01ned by (12) shows that this vector\nproduct is always positive. Hence even a single application of learning rule (12) to a single new\nexample y, drawn according to p\u2217, is expected to increase Ep\u2217[log p(y|w)] under the constraints\n(7).\n\nPK\n\n3.4\n\nImpact of missing attributes\n\nWe had shown at the end of 3.2 that learning in the spiking network corresponds to learning in the\nnon-spiking network Nw with missing attributes. A profound analysis of the correct handling of\nmissing attribute values in EM can be found in [15]. Their analysis implies that the correct learning\naction is then not to change the weights wki for i \u2208 Gj. However the STDP rule of Fig. 1B, as\nwell as (12), reduce also these weights by \u03b7 if zk \ufb01res. This yields a modi\ufb01cation of the equilibrium\nanalysis (13):\n\nE[\u2206wki] = 0 \u21d4 (1 \u2212 r)(cid:0)p\u2217(yi=1|zk=1)\u03b7(e\u2212wki \u2212 1) \u2212 p\u2217(yi=0|zk=1)\u03b7(cid:1) \u2212 r\u03b7 = 0\n\nwki = log p\u2217(yi=1|zk=1) + log(1 \u2212 r) ,\n\n\u21d4\n\n(17)\n\nwhere r is the probability that i belongs to a group Gj where the value of xj is missing. Since\nthis probability r is independent of the neuron zk and also independent of the current value of the\nexternal variable xi, this offset of log(1 \u2212 r) is expected to be the same for all weights. It can easily\nbe veri\ufb01ed, that such an offset does not change the resulting probabilities of the competition in the\nE-step according to (2).\n\n3.5 Relationship between the spiking and the non-spiking network\n\nAs indicated at the end of section 3.2, the learning process for the spiking network from section 2\nwith the simple STDP curve from Fig. 1B (and external variables xj encoded by input spike trains\nfrom neurons yi) is equivalent to a somewhat modi\ufb01ed learning process of the non-spiking network\nNw with the Hebbian learning rule (12) and external variables xj encoded by binary variables yi.\nEach \ufb01ring of a neuron zk at some time t corresponds to a discrete time step in Nw with an ap-\nplication of the Hebbian learning rule (12). Each neuron yi that had \ufb01red during the time interval\n[t \u2212 10ms, t] contributes a value \u02dcyi(t) = 1 to the membrane potential uk(t) of the neuron zk at time\nt, and a value \u02dcyi(0) = 0 if it did not \ufb01re during [t \u2212 10ms, t]. Hence the weight updates at time t\naccording to the simple STDP curve are exactly equal to those of (12) in the non-spiking network.\nHowever (12) will in general be applied to a corresponding input y where it may occur that for some\n\n7\n\n\fj \u2208 {1, . . . , m} one has yi = 0 for all i \u2208 Gj (since none of the neurons yi with i \u2208 Gj \ufb01red in the\nspiking network during [t \u2212 10ms, t]). Hence we arrive at an application of (12) to an input y with\nmissing attributes, as discussed in section 3.4.\nSince several neurons zk are likely to \ufb01re during the presentation of an external input x (each hand-\nwritten digit was presented for 50ms in section 2; but a much shorter presentation time of 10ms also\nworks quite well), this external input x gives in general rise to several applications of the STDP rule.\nThis corresponds to several applications of rule (12) to the same input (but with different choices\nof missing attributes) in the non-spiking network. In the experiments in section 2, every example\nin the non-spiking network with missing attributes was therefore presented for 10 steps, such that\nthe average number of learning steps is the same as in the spiking case. The learning process of\nthe spiking network corresponds to a slight variation of the stochastic online EM algorithm that is\nimplemented through (12) according to the analysis of section 3.3.\n\n4 Discussion\n\nThe model for discovering hidden causes of inputs that is proposed in this extended abstract presents\nan interesting shortcut for implementing and learning generative models for input data in networks\nof neurons. Rather than building and adapting an explicit model for re-generating internally the dis-\ntribution of input data, our approach creates an implicit model of the input distribution (see Fig. 3B)\nthat is encoded in the weights of neurons in a simple WTA-circuit. One might call it a Vapnik-style\n[16] approach towards generative modeling, since it focuses directly on the task to represent the\nmost likely hidden causes of the inputs through neuronal \ufb01ring. As the theoretical analysis via non-\nspiking networks in section 3 has shown, this approach also offers a new perspective for generating\nself-adapting networks on the basis of traditional arti\ufb01cial neural networks. One just needs to add\nthe stochastic and non-feedforward parts required for implementing stochastic WTA circuits to a\n1-layer feedforward network, and apply the Hebbian learning rule (12) to the feedforward weights.\nOne interesting aspect of the \u201cimplicit generative learning\u201d approach that we consider in this ex-\ntended abstract is that it retains important advantages of the generative learning approach, faster\nlearning and better generalization [17], while retaining the algorithmic simplicity of the discrimina-\ntive learning approach.\n\nOur approach also provides a new method for analyzing details of STDP learning rules. The sim-\nulation results of section 2 show that a simpli\ufb01ed STDP rule that can be understood clearly from\nthe perspective of stochastic online EM with a suitable Hebbian learning rule, provides good perfor-\nmance in discovering hidden causes for a standard benchmark dataset. A more complex STDP rule,\nwhose learning curve better matches experimentally recorded average changes of synaptic weights,\nprovides almost the same performance. For a comparison of the STDP curves in Fig. 1B with ex-\nperimentally observed STDP curves one should keep in mind, that most experimental data on STDP\ncurves are for very low \ufb01ring rates. The STDP curve of Fig. 7C in [18] for a \ufb01ring rate of 20Hz has,\nsimilarly as the STDP curves in Fig. 1B of this extended abstract, no pronounced negative dip, and\ninstead an almost constant negative part.\n\nIn our upcoming paper [8] we will provide full proofs for the results announced in this extended\nabstract, as well as further applications and extensions of the learning result. We will also demon-\nstrate, that the learning rules that we have proposed are robust to noise, and that they are matched\nquite well by experimental data.\n\nAcknowledgments\n\nWe would like to thank the anonymous reviewer for a hint in the notational formalism. Written under\npartial support by the Austrian Science Fund FWF, project # P17229-N04, project # S9102-N04,\nand project # FP6-015879 (FACETS) as well as # FP7-216593 (SECO) of the European Union.\n\n8\n\n\fReferences\n\n[1] Y. Dan and M. Poo. Spike timing-dependent plasticity of neural circuits. Neuron, 44:23\u201330, 2004.\n[2] L. F. Abbott and S. B. Nelson. Synaptic plasticity: taming the beast. Nature Neuroscience, 3:1178\u20131183,\n\n2000.\n\n[3] A. Morrison, A. Aertsen, and M. Diesmann. Spike-timing-dependent plasticity in balanced random net-\n\nworks. Neural Computation, 19:1437\u20131467, 2007.\n\n[4] R. J. Douglas and K. A. Martin. Neuronal circuits of the neocortex. Annu Rev Neurosci, 27:419\u2013451,\n\n2004.\n\n[5] G. E. Hinton and Z. Ghahramani. Generative models for discovering sparse distributed representations.\n\nPhilos Trans R Soc Lond B Biol Sci., 352(1358):1177\u20131190, 1997.\n\n[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[7] A. Gupta and L. N. Long. Character recognition using spiking neural networks. IJCNN, pages 53\u201358,\n\n2007.\n\n[8] B. Nessler, M. Pfeiffer, and W. Maass. Spike-timing dependent plasticity performs stochastic expectation\n\nmaximization to reveal the hidden causes of complex spike inputs. (in preparation).\n\n[9] M. Meil\u02d8a and D. Heckerman. An experimental comparison of model-based clustering methods. Machine\n\nLearning, 42(1):9\u201329, 2001.\n\n[10] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.\n[11] G. McLachlan and D. Peel. Finite mixture models. Wiley, 2000.\n[12] J.H. Kushner and G.G. Yin. Stochastic approximation algorithms and applications. Springer, 1997.\n[13] B. Nessler, M. Pfeiffer, and W. Maass. Hebbian learning of bayes optimal decisions. In Advances in\n\nNeural Information Processing Systems 21, pages 1169\u20131176. MIT Press, 2009.\n\n[14] M. Sato. Fast learning of on-line EM algorithm. Rapport Technique, ATR Human Information Processing\n\nResearch Laboratories, 1999.\n\n[15] Z. Ghahramani and M.I. Jordan. Mixture models for learning from incomplete data. Computational\n\nLearning Theory and Natural Learning Systems, 4:67\u201385, 1997.\n\n[16] V. Vapnik. Universal learning technology: Support vector machines. NEC Journal of Advanced Technol-\n\nogy, 2:137\u2013144, 2005.\n\n[17] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic regression\n\nand naive Bayes. Advances in Neural Information Processing Systems (NIPS), 14:841\u2013848, 2002.\n\n[18] P. J. Sj\u00a8ostr\u00a8om, G. G. Turrigiano, and S. B. Nelson. Rate, timing, and cooperativity jointly determine\n\ncortical synaptic plasticity. Neuron, 32:1149\u20131164, 2001.\n\n9\n\n\f", "award": [], "sourceid": 737, "authors": [{"given_name": "Bernhard", "family_name": "Nessler", "institution": null}, {"given_name": "Michael", "family_name": "Pfeiffer", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}]}