{"title": "On Herding and the Perceptron Cycling Theorem", "book": "Advances in Neural Information Processing Systems", "page_first": 694, "page_last": 702, "abstract": "The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM.", "full_text": "On Herding and the Perceptron Cycling Theorem\n\nAndrew E. Gelfand, Yutian Chen, Max Welling\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\n{agelfand,yutianc,welling}@ics.uci.edu\n\nLaurens van der Maaten\n\nDepartment of CSE, UC San Diego\nPRB Lab, Delft University of Tech.\n\nlvdmaaten@gmail.com\n\nAbstract\n\nThe paper develops a connection between traditional perceptron algorithms and\nrecently introduced herding algorithms. It is shown that both algorithms can be\nviewed as an application of the perceptron cycling theorem. This connection\nstrengthens some herding results and suggests new (supervised) herding algo-\nrithms that, like CRFs or discriminative RBMs, make predictions by conditioning\non the input attributes. We develop and investigate variants of conditional herd-\ning, and show that conditional herding leads to practical algorithms that perform\nbetter than or on par with related classi\ufb01ers such as the voted perceptron and the\ndiscriminative RBM.\n\n1\n\nIntroduction\n\nw \u2190 w + xi(yi \u2212 y\u2217\ni )\n\nThe invention of the perceptron [12] goes back to the very beginning of AI more than half a century\nago. Rosenblatt\u2019s very simple, neurally plausible learning rule made it an attractive algorithm for\nlearning relations in data: for every input xi, make a linear prediction about its label: y\u2217\ni = wT xi\nand update the weights as,\n\n(1)\nA critical evaluation by Minsky and Papert [11] revealed the perceptron\u2019s limited representational\npower. This fact is re\ufb02ected in the behavior of Rosenblatt\u2019s learning rule: if the data is linearly\nseparable, then the learning rule converges to the correct solution in a number of iterations that can\nbe bounded by (R/\u03b3)2, where R represents the norm of the largest input vector and \u03b3 represents the\nmargin between the decision boundary and the closest data-case. However, \u2018for data sets that are\nnot linearly separable, the perceptron learning algorithm will never converge\u2019 (quoted from [1]).\nWhile the above result is true, the theorem in question has something much more powerful to say.\nThe \u2018perceptron cycling theorem\u2019 (PCT) [2, 11] states that for the inseparable case the weights re-\nmain bounded and do not diverge to in\ufb01nity. In this paper, we show that the implication of this\ntheorem is that certain moments are conserved on average. Denoting the data-case selected at itera-\ntion t by it (note that the same data-case can be picked multiple times), the corresponding attribute\nvector and label by (xit, yit ) with xi \u2208 X , and the label predicted by the perceptron at iteration t\nfor data-case it by y\u2217\n\nit, we obtain the following result:\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n|| 1\nT\n\nxityit \u2212 1\nT\n\nxity\u2217\n\nit\n\n|| \u223c O(1/T )\n\n(2)\n\nThis result implies that, even though the perceptron learning algorithm does not converge in the\ninseparable case, it generates predictions that correlate with the attributes in the same way as the true\n\u221a\nlabels do. More importantly, the correlations converge to the sample mean with a rate 1/T , which is\nmuch faster than sampling based algorithms that converge at a rate 1/\nT . By using general features\n\u03c6(x), the above result can be extended to the matching of arbitrarily complicated statistics between\ndata and predictions.\n\n1\n\n\fIn the inseparable case, we can interpret the perceptron as a bagging procedure and average predic-\ntions instead of picking the single best (or last) weights found during training. Although not directly\nmotivated by the PCT and Eqn. 2, this is exactly what the voted perceptron (VP) [5] does. Interest-\ning generalization bounds for the voted perceptron have been derived in [5]. Extensions of VP to\nchain models have been explored in, e.g. [4].\nHerding is a seemingly unrelated family of algorithms for unsupervised learning [15, 14, 16, 3]. In\ntraditional methods for learning Markov Random Field (MRF) models, the goal is to converge to\na single parameter estimate and then perform (approximate) inference in the resulting model. In\ncontrast, herding combines the learning and inference phases by treating the weights as dynamic\nquantities and de\ufb01ning a deterministic set of updates such that averaging predictions preserves cer-\ntain moments of the training data. The herding algorithm generates a weakly chaotic sequence of\nweights and a sequence of states of both hidden and visible variables of the MRF model. The in-\ntermediate states produced by herding are really \u2018representative points\u2019 of an implicit model that\ninterpolates between data cases. We can view these states as pseudo-samples, which analogously to\nEqn. 2, satisfy certain constraints on their average suf\ufb01cient statistics. However, unlike in perceptron\nlearning, the non-convergence of the weights is needed to generate long, non-periodic trajectories of\nstates that can be averaged over.\nIn this paper, we show that supervised perceptron algorithms and unsupervised herding algorithms\ncan all be derived from the PCT. This connection allows us to strengthen existing herding results.\nFor instance, we prove fast convergence rates of sample averages when we use small mini-batches\nfor making updates, or when we use incomplete optimization algorithms to run herding. Moreover,\nthe connection suggests new algorithms that lie between supervised perceptron and unsupervised\nherding algorithms. We refer to these algorithms as \u201cconditional herding\u201d (CH) because, like con-\nditional random \ufb01elds, they condition on the input features. From the perceptron perspective, condi-\ntional herding can be understood as \u201cvoted perceptrons with hidden units\u201d. Conditional herding can\nalso be interpreted as the zero temperature limit of discriminative RBMs (dRBMs) [10].\n\n2 Perceptrons, Herding and the Perceptron Cycling Theorem\n\nWe \ufb01rst review the perceptron cycling theorem that was initially introduced in [11] with a gap in the\nproof that was \ufb01xed in [2]. A sequence of vectors {wt}, wt \u2208 RD, t = 0, 1, . . . is generated by the\nfollowing iterative procedure: wt+1 = wt + vt, where vt is an element of a \ufb01nite set, V, and the\nnorm of vt is bounded: maxi ||vi|| = R < \u221e.\nt vt \u2264 0, then there exists a constant M > 0\nPerceptron Cycling Theorem (PCT). \u2200t \u2265 0: If wT\nsuch that (cid:107)wt \u2212 w0(cid:107) < M.\nThe theorem still holds when V is a \ufb01nite set in a Hilbert space. The PCT immediately leads to the\nfollowing result:\nConvergence Theorem. If PCT holds, then: || 1\n\n(cid:80)T\nt=1 vt|| \u223c O(1/T ).\n\nThis result is easily shown by observing that ||wT +1 \u2212 w0|| = ||(cid:80)T\n\nT\n\nt=1 \u2206wt|| = ||(cid:80)T\n\nt=1 vt|| < M,\n\nand dividing all terms by T .\n\n2.1 Voted Perceptron and Moment Matching\n\ni )| , yi = \u00b11, y\u2217\n\nThe voted perceptron (VP) algorithm [5] repeatedly applies the update rule in Eqn. 1. Predictions\nof test labels are made after each update and \ufb01nal label predictions are taken as an average of all\nintermediate predictions. The PCT convergence theorem leads to the result of Eqn. 2, where we\nidentify V = {xi(yi \u2212 y\u2217\ni = \u00b11, i = 1, . . . , N}. For the VP algorithm, the PCT\nthus guarantees that the moments (cid:104)xy(cid:105) \u02dcp(x,y) (with \u02dcp the empirical distribution) are matched with\n(cid:104)xy\u2217(cid:105)p(y\u2217|x) \u02dcp(x) where p(y\u2217|x) is the model distribution implied by how VP generates y\u2217.\nIn maximum entropy models, one seeks a model that satis\ufb01es a set of expectation constraints (mo-\nments) from the training data, while maximizing the entropy of the remaining degrees of free-\nIn contrast, a single perceptron strives to learn a deterministic mapping p(y\u2217|x) =\ndom [9].\n\u03b4[y\u2217 \u2212 arg maxy(ywT x)] that has zero entropy and gets every prediction on every training case\n\n2\n\n\fcorrect (where \u03b4 is the delta function). Entropy is created in p(y\u2217|x) only when the weights wt do\nnot converge (i.e. for inseparable data sets). Thus, VP and maximum entropy methods are related,\nbut differ in how they handle the degrees of freedom that are unconstrained by moment matching.\n\n2.2 Herding\n\nA new class of unsupervised learning algorithms, known as \u201cherding\u201d, was introduced in [15].\nRather than learning a single \u2018best\u2019 MRF model that can be sampled from to estimate quantities\nof interest, herding combines learning and inference into a single process. In particular, herding\nproduces a trajectory of weights and states that reproduce the moments of the training data.\nConsider a fully observed MRF with features \u03c6(x), x \u2208 X = [1, . . . , K]m with K the number of\nstates for each variable xj (j = 1, . . . , m) and with an energy function E(x) given by:\n\nIn herding [15], the parameters w are updated as:\n\nE(x) = \u2212wT \u03c6(x).\n\nwt+1 = wt + \u03c6 \u2212 \u03c6(x\u2217\nt ),\n\n(cid:80)\n\n(3)\n\nt\n\nT\n\ni \u03c6(xi) and x\u2217\n\nt = arg maxx wT\n\n(4)\nt \u03c6(x). Eqn. 4 looks like a maximum likeli-\nwhere \u03c6 = 1\nN\nhood (ML) gradient update, with constant learning rate and maximization in place of expectation in\nthe right-hand side. This follows from taking the zero temperature limit of the ML objective (see\nSection 2.5). The maximization prevents the herding sequence from converging to a single point\nestimate on this alternative objective.\nt} denote the sequence of states (pseudo-samples)\nLet {wt} denote the sequence of weights and {x\u2217\nproduced by herding. We can apply the PCT to herding by identifying V = {\u03c6 \u2212 \u03c6(x\u2217)| x\u2217 \u2208 X}.\n(cid:80)T\nIt is now easy to see that, in general, herding does not converge because under very mild conditions\nwe can always \ufb01nd an x\u2217\nt vt < 0. From the PCT convergence theorem, we also see\nt such that wT\nt )|| \u223c O(1/T ), i.e. the pseudo-sample averages of the features converge\nthat ||\u03c6 \u2212 1\nt=1 \u03c6(x\u2217\n\u221a\nto the data averages \u03c6 at a rate 1/T 1. This is considerably faster than i.i.d. sampling from the\ncorresponding MRF model, which would converge at a rate of 1/\nSince the cardinality of the set V is exponentially large (i.e. |V| = K m), \ufb01nding the maximizing\nt at each update may be hard. However, the PCT only requires us to \ufb01nd some state x\u2217\nstate x\u2217\nt vt \u2264 0 and in most cases this can easily be veri\ufb01ed. Hence, the PCT provides a\nsuch that wT\ntheoretical justi\ufb01cation for using a local search algorithm that performs partial energy maximization.\nFor example, we may start the local search from the state we ended up in during the previous\niteration (a so-called persistent chain [13, 17]). Or, one may consider contrastive divergence-like\nalgorithms [8], in which the sampling or mean \ufb01eld approximation is replaced by a maximization.\nIn this case, maximizations are initialized on all data-cases and the weights are updated by the\ni } found after\ndifference between the average over the data-cases minus the average over the {x\u2217\n(partial) maximization. In this case, the set V is given by: V = {\u03c6 \u2212 1\ni )| x\u2217\ni \u2208 X \u2200i}.\ni \u03c6(x\u2217\nFor obvious reasons, it is now guaranteed that wT\nIn practice, we often use mini-batches of size n < N instead of the full data set. In this case, the\ncardinality of the set V is enlarged to |V| = C(n, N )K m, with C(n, N ) representing the \u2018n choose\nN\u2019 ways to compute the sample mean \u03c6(n) based on a subset of n data-cases. The negative term\nt )|| \u223c O(1/T ).\nremains unaltered. Since the PCT still applies: || 1\n\u221a\nDepending on how the mini-batches are picked, convergence onto the overall mean \u03c6 can be either\nO(1/\nT ) (random sampling with replacement) or O(1/T ) (sampling without replacement which\nhas picked all data-cases after (cid:100)N/n(cid:101) rounds).\n\nt vt \u2264 0.\n(cid:80)T\nt=1 \u03c6(n),t \u2212 1\n\n(cid:80)T\nt=1 \u03c6(x\u2217\n\n(cid:80)\n\nN\n\nT\n\nT\n\nT .\n\n2.3 Hidden Variables\n\nThe discussion so far has considered only constant features: \u03c6(x, y) = xy for VP and \u03c6(x) for\nherding. However, the PCT allows us to consider more general features that depend on the weights\n1Similar convergence could also be achieved (without concern for generalization performance) by sampling\ndirectly from the training data. However, herding converges with rate 1/T and is regularized by the weights to\nprevent over\ufb01tting.\n\n3\n\n\fw, as long as the image of this feature mapping (and therefore, the update vector v) is a set of \ufb01nite\ncardinality. In [14], such features took the form of \u2018hidden units\u2019:\n\n\u03c6(x, z),\n\nz(x, w) = arg max\n\nz(cid:48)\n\nwT \u03c6(x, z(cid:48))\n\n(5)\n\nIn this case, we identify the vector v as v = \u03c6(x, z) \u2212 \u03c6(x\u2217, z\u2217). In the left-hand term of this\nexpression, x is clamped to the data-cases and z is found as in Eqn. 5 by maximizing every data-case\nseparately; in the right-hand (or negative) term, x\u2217, z\u2217 are found by jointly maximizing wT \u03c6(x, z).\nThe quantity \u03c6(x, z) denotes a sample average over the training cases. We note that \u03c6(x, z) indeed\nmaps to a \ufb01nite domain because it depends on the real parameter w only through the discrete state\nz. We also notice again that wT v \u2264 0 because of the de\ufb01nition of (x\u2217, z\u2217). From the convergence\ntheorem we \ufb01nd that, || 1\nt )|| \u223c O(1/T ). This result can be\nextended to mini-batches as well.\n\n(cid:80)T\nt=1 \u03c6(x, zt) \u2212 1\n\n(cid:80)T\nt=1 \u03c6(x\u2217\n\nt , z\u2217\n\nT\n\nT\n\n2.4 Conditional Herding\n\nWe are now ready to propose our new algorithm: conditional herding (CH). Like the VP algorithm,\nCH is concerned with discriminative learning and, therefore, it conditions on the input attributes\n{xi}. CH differs from VP in that it uses hidden variables, similar to the herder described in the\nprevious subsection. In the most general setting, CH uses features:\n\n\u03c6(x, y, z),\n\nz(x, y, w) = arg max\n\nz(cid:48)\nIn the experiments in Section 3, we use the explicit form:\n\nwT \u03c6(x, y, z(cid:48)).\n\n(6)\n\nwT \u03c6(x, y, z) = xT Wz + yT Bz + \u03b8T z + \u03b1T y.\n\n(7)\nwhere W, B, \u03b8 and \u03b1 are the weights, z is a binary vector and y is a binary vector in a 1-of-K\nscheme (see Figure 1). At each iteration t, CH randomly samples a subset of the data-cases and their\nlabels Dt = {xit, yit} \u2286 D. For every member of this mini-batch it computes a hidden variable zit\nusing Eqn. 6. The parameters are then updated as:\n\nwt+1 = wt +\n\n\u03b7\n|Dt|\n\n(\u03c6(xit, yit, zit) \u2212 \u03c6(xit, y\u2217\n\nit\n\n, z\u2217\n\nit\n\n))\n\n(8)\n\n(cid:88)\n\nit\u2208Dt\n\nIn the positive term, zit, is found as in Eqn. 5. The negative term is obtained (similar to the percep-\ntron) by making a prediction for the labels, keeping the input attributes \ufb01xed:\n\u2200it \u2208 Dt.\n\nwT \u03c6(xit, y(cid:48), z(cid:48)),\n\n) = arg max\n\n(y\u2217\n\n, z\u2217\n\n(9)\n\nit\n\nit\n\ny(cid:48),z(cid:48)\n\nFor the PCT to apply to CH, the set V of update vectors must be \ufb01nite. The inputs x can be real-\nvalued because we condition on the inputs and there will be at most N distinct values (one for each\ndata-case). However, since we maximize over y and z these states must be discrete for the PCT to\napply.\nEqn. 8 includes a potentially vector-valued stepsize \u03b7. Notice however that scaling w \u2190 \u03bbw will\nhave no affect on the values of z, z\u2217 or y\u2217 and hence on v. Therefore, if we also scale \u03b7 \u2190 \u03bb\u03b7, then\nthe sequence of discrete states zt, z\u2217\nt(cid:48)=0 vt(cid:48) + w0 ,\nthe only scale that matters is the relative scale between w0 and \u03b7. In case there would just be a single\nattractor set for the dynamics of w, the initialization w0 would only represent a transient affect.\nHowever, in practice the scale of w0 relative to that of \u03b7 does play an important role indicating that\nmany different attractor sets exist for this system.\nIrrespective of the attractor we end up in, the PCT guarantees that:\n\nt will not be affected either. Since wt = \u03b7(cid:80)t\u22121\n\nt , y\u2217\n\nT(cid:88)\n\nt=1\n\n|| 1\nT\n\n1\n|Dt|\n\n(cid:88)\n\nit\n\nT(cid:88)\n\nt=1\n\n(cid:88)\n\nit\n\n\u03c6(xit, yit, zit) \u2212 1\nT\n\n1\n|Dt|\n\n\u03c6(xit, y\u2217\n\nit\n\n, z\u2217\n\nit\n\n)|| \u223c O(1/T ).\n\n(10)\n\nIn general, herding systems perform better when we use normalized features: (cid:107)\u03c6(x, z, y)(cid:107) =\nR, \u2200(x, z, y). The reason is that herding selects states by maximizing the inner product wT \u03c6\n\n4\n\n\fand features with large norms will therefore become more likely to be selected. In fact, one can\nshow that states inside the convex hull of the \u03c6(x, y, z) are never selected. For binary (\u00b11) vari-\nables all states live on the convex hull, but this need not be true in general, especially when we use\ncontinuous attributes x. To remedy this, one can either normalize features or add one additional fea-\nmax \u2212 ||\u03c6(x, y, z)||2, where Rmax = maxx,y,z \u03c6(x, y, z) where x is only\n\nture2 \u03c60(x, y, z) =(cid:112)R2\n\nallowed to vary over the data-cases.\nFinally, predictions on unseen test data are made by:\n\n(y\u2217\n\ntst,t, z\u2217\n\ntst,t) = arg max\n\ny(cid:48),z(cid:48)\n\nt \u03c6(xtst, y(cid:48), z(cid:48)),\nwT\n\n(11)\n\nThe algorithm is summarized in the algorithm-box below.\n\nConditional Herding (CH)\n\n1. Initialize w0 (with \ufb01nite norm) and yavg,j = 0 for all test cases j.\n2. For t \u2265 0:\n\n(a) Choose a subset {xit, yit} = Dt \u2286 D. For each (xit, yit), choose a hidden state zit.\n(b) Choose a set of \u201cnegative states\u201d {(x\u2217\n\n, z\u2217\n\n(cid:88)\n\nit\n\n1\n|Dt|\n\n= xit, y\u2217\nt\u22121\u03c6(xit, yit, zit) \u2264 1\nwT\n|Dt|\n\nit\n\nit\n\nit\n\n(cid:88)\n\n)}, such that:\nt\u22121\u03c6(xit, y\u2217\nwT\n\nit\n\nit\n\n, z\u2217\n\nit\n\n).\n\n(12)\n\n3. Update wt according to Eqn. 8.\n4. Predict on test data as follows:\n\n(a) For every test case xtst,j at every iteration, choose negative states (y\u2217\n\ntst,jt, z\u2217\n\ntst,jt) in the\n\nsame way as for training data.\n\n(b) Update online average over predictions, yavg,j, for all test cases j.\n\n2.5 Zero Temperature Limit of Discriminative MRF Learning\n\nRegular herding can be understood as gradient descent on the zero temperature limit of an MRF\nmodel. In this limit, gradient updates with constant step size never lead to convergence, irrespective\nof how small the step size is. Analogously, CH can be viewed as constant step size gradient updates\non the zero temperature limit of discriminative MRFs (see [10] for the corresponding RBM model).\nThe \ufb01nite temperature model is given by:\n\nz exp(cid:2)wT \u03c6(y, z, x)(cid:3)\n(cid:80)\n(cid:80)\ntakes the limit T \u2192 0 of (cid:96)T (cid:44) T (cid:96), where (cid:96) =(cid:80)\n\nz(cid:48),y(cid:48) exp [wT \u03c6(y(cid:48), z(cid:48), x)]\n\ni log p(yi|xi).\n\np(y|x) =\n\nSimilar to herding [14], conditional herding introduces a temperature by replacing w by w/T and\n\n.\n\n(13)\n\n3 Experiments\n\nWe studied the behavior of conditional herding on two arti\ufb01cial and four real-world data sets, com-\nparing its performance to that of the voted perceptron [5] and that of discriminative RBMs [10]. The\nexperiments on arti\ufb01cial and real-world data are discussed separately in Section 3.1 and 3.2.\nWe studied conditional herding in the discriminative RBM architecture illustrated in Figure 1 (i.e.,\nwe use the energy function in Eqn. 7). Per the discussion in Section 2.4, we added an additional\n\nmax \u2212 ||x||2 with Rmax = maxi (cid:107)xi(cid:107) in all experiments.\n\nfeature \u03c60(x) =(cid:112)R2\n\n2If in test data this extra feature becomes imaginary we simply set it to zero.\n\n5\n\n\fFigure 1: Discriminative Restricted Boltzmann Machine model of distribution p(y, z|x).\n\n(a) Banana data set.\n\n(b) Lithuanian data set.\n\nFigure 2: Decision boundaries of VP, CH, and dRBMs on two arti\ufb01cial data sets.\n\n3.1 Arti\ufb01cial Data\n\nTo investigate the characteristics of VP, dRBMs and CH, we used the techniques to construct de-\ncision boundaries on two arti\ufb01cial data sets: (1) the banana data set; and (2) the Lithuanian data\nset. We ran VP and CH for 1, 000 epochs using mini-batches of size 100. The decision bound-\nary for VP and CH is located at the location where the sign of the prediction y\u2217\ntst changes. We\nused conditional herders with 20 hidden units. The dRBMs also had 20 hidden units and were\ntrained by running conjugate gradients until convergence. The weights of the dRBMs were ini-\ntialized by sampling from a Gaussian distribution with a variance of 10\u22124. The decision bound-\nary for the dRBMs is located at the point where both class posteriors are equal, i.e., where\np(y\u2217\nPlots of the decision boundary for the arti\ufb01cial data sets are shown in Figure 2. The results on the\nbanana data set illustrate the representational advantages of hidden units. Since VP selects data\npoints at random to update the weights, on the banana data set, the weight vector of VP tends to\noscillate back and forth yielding a nearly linear decision boundary3. This happens because VP can\nregress on only 2 + 1 = 3 \ufb01xed features. In contrast, for CH the simple predictor in the top layer can\nregress onto M = 20 hidden features. This prevents the same oscillatory behavior from occurring.\n\ntst = \u22121|\u02dcxtst) = p(y\u2217\n\ntst = +1|\u02dcxtst) = 0.5.\n\n3.2 Real-World Data\n\nIn addition to the experiments on synthetic data, we also performed experiments on four real-world\ndata sets - namely, (1) the USPS data set, (2) the MNIST data set, (3) the UCI Pendigits data set, and\n(4) the 20-Newsgroups data set. The USPS data set consists of 11,000, 16 \u00d7 16 grayscale images of\nhandwritten digits (1, 100 images of each digit 0 through 9) with no \ufb01xed division. The MNIST data\nset contains 70, 000, 28 \u00d7 28 grayscale images of digits, with a \ufb01xed division into 60, 000 training\nand 10, 000 test instances. The UCI Pendigits consists of 16 (integer-valued) features extracted from\nthe movement of a stylus. It contains 10, 992 instances, with a \ufb01xed division into 7, 494 training\nand 3, 498 test instances. The 20-Newsgroups data set contains bag-of-words representations of\n18, 774 documents gathered from 20 different newsgroups. Since the bag-of-words representation\n\n3On the Lithuanian data set, VP constructs a good boundary by exploiting the added \u2018normalizing\u2019 feature.\n\n6\n\nxi1xi2xiD...yi1yi2yiC...zi1zi2ziK...WB\u03b8\u03b1 Voted perceptronDiscr. RBMCond. herding \fcomprises over 60, 000 words, we identi\ufb01ed the 5, 000 most frequently occurring words. From this\nset, we created a data set of 4, 900 binary word-presence features by binarizing the word counts and\nremoving the 100 most frequently occurring words. The 20-Newsgroups data has a \ufb01xed division\ninto 11, 269 training and 7, 505 test instances. On all data sets with real-valued input attributes we\nused the \u2018normalizing\u2019 feature described above.\nThe data sets used in the experiments are multi-class. We adopted a 1-of-K encoding, where if yi\nis the label for data point xi, then yi = {yi,1, ..., yi,K} is a binary vector such that yi,k = 1 if the\nlabel of the ith data point is k and yi,k = \u22121 otherwise. Performing the maximization in Eqn. 9 is\ndif\ufb01cult when K > 2. We investigated two different procedures for doing so. In the \ufb01rst procedure,\nwe reduce the multi-class problem to a series of binary decision problems using a one-versus-all\nscheme. The prediction on a test point is taken as the label with the largest online average. In the\nsecond procedure, we make predictions on all K labels jointly. To perform the maximization in\nEqn. 9, we explore all states of y in a one-of-K encoding - i.e. one unit is activated and all others\nare inactive. This partial maximization is not a problem as long as the ensuing con\ufb01guration satis\ufb01es\nt vt \u2264 0 4. The main difference between the two procedures is that in the second procedure the\nwT\nweights W are shared amongst the K classi\ufb01ers. The primary advantage of the latter procedure is\nit less computationally demanding than the one-versus-all scheme.\nWe trained the dRBMs by performing iterations of conjugate gradients (using 3 linesearches) on\nmini-batches of size 100 until the error on a small held-out validation set started increasing (i.e.,\nwe employed early stopping) or until the negative conditional log-likelihood on the training data\nstopped coming down. Following [10], we use L2-regularization on the weights of the dRBMs;\nthe regularization parameter was determined based on the generalization error on the same held-\nout validation set. The weights of the dRBMs were initialized from a Gaussian distribution with\nvariance of 10\u22124.\nCH used mini-batches of size 100. For the USPS and Pendigits data sets CH used a burn-in period\nof 1, 000 updates; on MNIST it was 5, 000 updates; and on 20 Newsgroups it was 20, 000 updates.\nHerding was stopped when the error on the training set became zero 5.\nThe parameters of the conditional herders were initialized by sampling from a Gaussian distribution.\nIdeally, we would like each of the terms in the energy function in Eqn. 7 to contribute equally during\nupdating. However, since the dimension of the data is typically much greater than the number of\nclasses, the dynamics of the conditional herding system will be largely driven by W. To negate this\neffect, we rescaled the standard deviation of the Gaussian by a factor 1/M with M the total number\nof elements of the parameter involved (e.g. \u03c3W = \u03c3/(dim(x) dim(z)) etc.). We also scale the step\nsizes \u03b7 by the same factor so the updates will retain this scale during herding. The relative scale\nbetween \u03b7 and \u03c3 was chosen by cross-validation. Recall that the absolute scale is unimportant (see\nSection 2.4 for details).\nIn addition, during the early stages of herding, we adapted the parameter update for the bias on the\nhidden units \u03b8 in such a way that the marginal distribution over the hidden units was nearly uniform.\nThis has the advantage that it encourages high entropy in the hidden units, leading to more useful\n(1 \u2212 \u03bb)(cid:104)zit(cid:105) \u2212 z\u2217\ndynamics of the system. In practice, we update \u03b8 as \u03b8t+1 = \u03b8t + \u03b7|Dt|\nit,\nwhere (cid:104)zit(cid:105) is the batch mean. \u03bb is initialized to 1 and we gradually half its value every 500 updates,\nslowly moving from an entropy-encouraging update to the standard update for the biases of the\nhidden units.\nVP was also run on mini-batches of size 100 (with step size of 1). VP was run until the predictor\nstarted over\ufb01tting on a validation set. No burn-in was considered for VP.\nThe results of our experiments are shown in Table 1. In the table, the best performance on each\ndata set using each procedure is typeset in boldface. The results reveal that the addition of hidden\nunits to the voted perceptron leads to signi\ufb01cant improvements in terms of generalization error.\nFurthermore, the results of our experiments indicate that conditional herding performs on par with\ndiscriminative RBMs on the MNIST and USPS data sets and better on the 20 Newsgroups data set.\nThe 20 Newsgroups data is high dimensional and sparse and both VP and CH appear to perform\n\n(cid:80)\n\nit\n\n\u2217,k\n\u2217,k\n4Local maxima can also be found by iterating over y\ntst,j, but the proposed procedure is more ef\ufb01cient.\ntst\n5We use a \ufb01xed order of the mini-batches, so that if there are N data cases and the batch size is K, if the\n\ntraining error is 0 for (cid:100)N/K(cid:101) iterations, the error for the whole training set is 0.\n\n, z\n\n7\n\n\fVP\n\n7.69%\n\n10.92%\n27.75%\n\nVP\n\n50\n\nXXXXXXXXX\n\nTechnique\n\nData Set\nMNIST\nUSPS\nUCI Pendigits\n20 Newsgroups\n\nXXXXXXXXX\n\nTechnique\n\nData Set\nMNIST\nUSPS\nUCI Pendigits\n20 Newsgroups\n\nJoint Procedure\nDiscriminative RBM\n500\n1.98%\n4.06%\n(1.09%)\n8.89%\n30.07%\n\n100\n2.93%\n2.84%\n(0.59%)\n3.23%\n30.57%\n\n8.84% 3.88%\n4.86% 3.13%\n(0.52%)\n6.78% 3.80%\n24.89% \u2013\n\n(0.73%)\n\nOne-Versus-All Procedure\nDiscriminative RBM\n\n100\n3.57%\n\n5.32%\n34.78%\n\n200\n3.58%\n\n5.00%\n34.36%\n\n5.03% (0.4%) 3.97% (0.38%) 4.02% (0.68%)\n\nConditional herding\n\n200\n3.99%\n\n100\n3.97%\n3.49% (0.45%) 3.35%(0.48%)\n3.37%\n29.78%\n\n3.00%\n25.96%\n\nConditional herding\n500\n2.09%\n2.81%\n(0.50%)\n2.86%\n24.93%\n\n100\n2.09%\n3.07%\n(0.52%)\n2.57%\n25.76%\n\n50\n2.89%\n3.36%\n(0.48%)\n3.14%\n\u2013\n\nTable 1: Generalization errors of VP, dRBMs, and CH on 4 real-world data sets. dRBMs and CH\nresults are shown for various numbers of hidden units. The best performance on each data set is\ntypeset in boldface; missing values are shown as \u2018-\u2019. The std. dev. of the error on the 10-fold cross\nvalidation of the USPS data set is reported in parentheses.\n\nquite well in this regime. Techniques to promote sparsity in the hidden layer when training dRBMs\nexist (see [10]), but we did not investigate them here. It is also worth noting that CH is rather resilient\nto over\ufb01tting. This is particularly evident in the low-dimensional UCI Pendigits data set, where the\ndRBMs start to badly over\ufb01t with 500 hidden units, while the test error for CH remains level. This\nphenomena is the bene\ufb01t of averaging over many different predictors.\n\n4 Concluding Remarks\n\nThe main contribution of this paper is to expose a relationship between the PCT and herding algo-\nrithms. This has allowed us to strengthen certain results for herding - namely, theoretically vali-\ndating herding with mini-batches and partial optimization. It also directly leads to the insight that\n\u221a\nnon-convergent VPs and herding match moments between data and generated predictions at a rate\nmuch faster than random sampling (O(1/T ) vs. O(1/\nT )). From these insights, we have proposed\na new conditional herding algorithm that is the zero-temperature limit of dRBMs [10].\nThe herding perspective provides a new way of looking at learning as a dynamical system. In fact,\nthe PCT precisely speci\ufb01es the conditions that need to hold for a herding system (in batch mode)\nto be a piecewise isometry [7]. A piecewise isometry is a weakly chaotic dynamical system that\ndivides parameter space into cells and applies a different isometry in each cell. For herding, the\nisometry is given by a translation and the cells are labeled by the states {x\u2217, y\u2217, z, z\u2217}, whichever\ncombination applies. Therefore, the requirement of the PCT that the space V must be of \ufb01nite\ncardinality translates into the division of parameter space in a \ufb01nite number of cells, each with\nits own isometry. Many interesting results about piecewise isometries have been proven in the\nmathematics literature such as the fact that the sequence of sampled states grows algebraically with\nT and not exponentially as in systems with random or chaotic components [6]. We envision a fruitful\ncross-fertilization between the relevant research areas in mathematics and learning theory.\n\nAcknowledgments\n\nThis work is supported by NSF grants 0447903, 0914783, 0928427 and 1018433 as well as\nONR/MURI grant 00014-06-1-073. LvdM acknowledges support by the Netherlands Organisation\nfor Scienti\ufb01c Research (grant no. 680.50.0908) and by EU-FP7 NoE on Social Signal Processing\n(SSPNet).\n\nReferences\n[1] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n\n8\n\n\f[2] H.D. Block and S.A. Levin. On the boundedness of an iterative procedure for solving a system\nof linear inequalities. Proceedings of the American Mathematical Society, 26(2):229\u2013235,\n1970.\n\n[3] Y. Chen and M. Welling. Parametric herding. In Proceedings of the Thirteenth International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[4] M. Collins. Discriminative training methods for hidden markov models: Theory and exper-\niments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical\nmethods in natural language processing-Volume 10, page 8. Association for Computational\nLinguistics, 2002.\n\n[5] Y. Freund and R.E. Schapire. Large margin classi\ufb01cation using the perceptron algorithm.\n\nMachine learning, 37(3):277\u2013296, 1999.\n\n[6] A. Goetz. Perturbations of 8-attractors and births of satellite systems. Internat. J. Bifur. Chaos,\n\nAppl. Sci. Engrg., 8(10):1937\u20131956, 1998.\n\n[7] A. Goetz. Global properties of a family of piecewise isometries. Ergodic Theory Dynam.\n\nSystems, 29(2):545\u2013568, 2009.\n\n[8] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14:1771\u20131800, 2002.\n\n[9] E.T. Jaynes.\n\nInformation theory and statistical mechanics. Physical Review Series II,\n\n106(4):620\u2013663, 1957.\n\n[10] H. Larochelle and Y. Bengio. Classi\ufb01cation using discriminative Restricted Boltzmann Ma-\nchines. In Proceedings of the 25th International Conference on Machine learning, pages 536\u2013\n543. ACM, 2008.\n\n[11] M.L. Minsky and S. Papert. Perceptrons; An introduction to computational geometry. Cam-\n\nbridge, Mass.,: MIT Press, 1969.\n\n[12] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization\n\nin the brain. Psychological review, 65(6):386\u2013408, 1958.\n\n[13] T. Tieleman. Training Restricted Boltzmann Machines using approximations to the likeli-\nIn Proceedings of the 25th International Conference on Machine learning,\n\nhood gradient.\nvolume 25, pages 1064\u20131071, 2008.\n\n[14] M. Welling. Herding dynamic weights for partially observed random \ufb01eld models. In Proc. of\n\nthe Conf. on Uncertainty in Arti\ufb01cial Intelligence, Montreal, Quebec, CAN, 2009.\n\n[15] M. Welling. Herding dynamical weights to learn. In Proceedings of the 21st International\n\nConference on Machine Learning, Montreal, Quebec, CAN, 2009.\n\n[16] M. Welling and Y. Chen. Statistical inference using weak chaos and in\ufb01nite memory.\n\nIn\nProceedings of the Int\u2019l Workshop on Statistical-Mechanical Informatics (IW-SMI 2010), pages\n185\u2013199, 2010.\n\n[17] L. Younes. Parametric inference for imperfectly observed Gibbsian \ufb01elds. Probability Theory\n\nand Related Fields, 82:625\u2013645, 1989.\n\n9\n\n\f", "award": [], "sourceid": 433, "authors": [{"given_name": "Andrew", "family_name": "Gelfand", "institution": null}, {"given_name": "Yutian", "family_name": "Chen", "institution": null}, {"given_name": "Laurens", "family_name": "Maaten", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}]}