{"title": "Learning via Gaussian Herding", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 459, "abstract": "We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data.", "full_text": "Learning via Gaussian Herding\n\nDepartment of Electrical Enginering\n\nDept. of Electrical and Systems Engineering\n\nDaniel D. Lee\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nddlee@seas.upenn.edu\n\nKoby Crammer\n\nThe Technion\n\nHaifa, 32000 Israel\n\nkoby@ee.technion.ac.il\n\nAbstract\n\nWe introduce a new family of online learning algorithms based upon constraining\nthe velocity \ufb02ow over a distribution of weight vectors. In particular, we show how\nto effectively herd a Gaussian weight vector distribution by trading off velocity\nconstraints with a loss function. By uniformly bounding this loss function, we\ndemonstrate how to solve the resulting optimization analytically. We compare the\nresulting algorithms on a variety of real world datasets, and demonstrate how these\nalgorithms achieve state-of-the-art robust performance, especially with high label\nnoise in the training data.\n\n1\n\nIntroduction\n\nOnline learning algorithms are simple, fast, and require less memory compared to batch learning\nalgorithms. Recent work has shown that they can also perform nearly as well as batch algorithms in\nmany settings, making them quite attractive for a number of large scale learning problems [3]. The\nsuccess of an online learning algorithm depends critically upon a tradeoff between \ufb01tting the current\ndata example and regularizing the solution based upon some memory of prior hypotheses. In this\nwork, we show how to incorporate regularization in an online learning algorithm by constraining the\nmotion of weight vectors in the hypothesis space. In particular, we demonstrate how to use simple\nconstraints on the velocity \ufb02ow \ufb01eld of Gaussian-distributed weight vectors to regularize online\nlearning algorithms. This process results in herding the motion of the Gaussian weight vectors to\nyield algorithms that are particularly robust to noisy input data.\nRecent work has demonstrated how parametric information about the weight vector distribution can\nbe used to guide online learning [1]. For example, con\ufb01dence weighted (CW) learning maintains a\nGaussian distribution over linear classi\ufb01er hypotheses and uses it to control the direction and scale\nof parameter updates [9]. CW learning has formal guarantees in the mistake-bound model [7];\nhowever, it can over-\ufb01t in certain situations due to its aggressive update rules based upon a separable\ndata assumption. A newer online algorithm, Adaptive Regularization of Weights (AROW) relaxes\nthis separable assumption, resulting in an adaptive regularization for each training example based\nupon its current con\ufb01dence [8]. This regularization comes in the form of minimizing a bound on the\nKullback-Leibler divergence between Gaussian distributed weight vectors.\nHere we take a different microscopic view of the online learning process. Instead of reweighting and\ndiffusing the weight vectors in hypothesis space, we model them as \ufb02owing under a velocity \ufb01eld\ngiven by each data observation. We show that for linear velocity \ufb01elds, a Gaussian weight vector\ndistribution will maintain its Gaussianity, with corresponding updates for its mean and covariance.\nThe advantage of this approach is that we can incorporate different constraints and regularization\non the resulting velocity \ufb01elds to yield more robust online learning algorithms. In the remainder\nof this paper, we elucidate the details of our approach and compare its performance on a variety of\nexperimental data.\n\n1\n\n\fThese algorithms maintain a Gaussian distribution over possible weight vectors in hypothesis space.\nIn traditional stochastic \ufb01ltering, weight vectors are \ufb01rst reweighted according to how accurately\nthey describe the current data observation. The remaining distribution is then subjected to random\ndiffusion, resulting in a new distribution. When the reweighting factor depends linearly upon the\nweight vector in combination with a Gaussian diffusion model, a weight vector distribution will\nmaintain its Gaussianity under such a transformation. The Kalman \ufb01lter equations then yield the\nresulting change in the mean and covariance of the new distribution. Our approach, on the other\nhand, updates the weight vector distribution with each observation by herding the weight vectors\nusing a velocity \ufb01eld. The differences between these two processes are shown in Fig. 1.\n2 Background\n\nConsider the following online binary classi\ufb01cation\nproblem, that proceeds in rounds. On the ith round\nthe online algorithm receives an input xi \u2208 Rd and\napplies its current prediction rule to make a predic-\ntion \u02c6yi \u2208 Y, for the binary set Y = {\u22121, +1}. It\nthen receives the correct label yi \u2208 Y and suffers a\nloss (cid:96)(yi, \u02c6yi). At this point, the algorithm updates its\nprediction rule with the pair (xi, yi) and proceeds to\nthe next round. A summary of online algorithms can\nbe found in [2].\nAn initial description for possible online algorithms\nis provided by the family of passive-aggressive (PA)\nalgorithms for linear classi\ufb01ers [5]. The weight vec-\ntor wi at each round is updated with the current in-\nput xi and label yi, by optimizing:\n\n1\n2\n\nwi+1 = arg min\nw\n\n(cid:107)w \u2212 wi(cid:107)2 + C(cid:96) ((xi, yi), w) ,\n(1)\nwhere (cid:96) ((xi, yi), w) is the squared- or hinge-loss\nfunction and C > 0 controls the tradeoff between\noptimizing the current loss and being close to the\n\nFigure 1:\n(a) Traditional stochastic \ufb01lter-\ning: weight vectors in the hypothesis space are\nreweighted according to the new observation and\nundergo diffusion resulting in a new weight vec-\ntor distribution. (b) Herding via a velocity \ufb01eld:\nweights vectors \ufb02ow in hypothesis space accord-\ning to a constrained velocity \ufb01eld, resulting in a\nnew weight vector distribution.\n\n, \u03b1i =(cid:0)max{0, 1 \u2212 yi(w(cid:62)\n\ni xi)}(cid:1) /(cid:0)(cid:107)xi(cid:107)2 + 1/C(cid:1) .\n\nold weight vector. Eq. (1) can also be expressed in dual form, yielding the PA-II update equation:\n\nwi+1 = wi + \u03b1iyixi\n\n(2)\nThe theoretical properties of this algorithm was analyzed by [5], and it was demonstrated on a variety\nof tasks (e.g. [3]).\nOnline con\ufb01dence-weighted (CW) learning [9, 7], generalized the PA update principle to multivari-\nate Gaussian distributions over the weight vectors N (\u00b5, \u03a3) for binary classi\ufb01cation. The mean\n\u00b5 \u2208 Rd contains the current estimate for the best weight vector, whereas the Gaussian covariance\nmatrix \u03a3 \u2208 Rd\u00d7d captures the con\ufb01dence in this estimate.\nCW classi\ufb01ers are trained according to a PA rule that is modi\ufb01ed to track differences in Gaus-\nsian distributions. At each round, the new mean and covariance of the weight vector distribu-\ntion is chosen by optimizing: (\u00b5i+1, \u03a3i+1) = arg min\u00b5,\u03a3 DKL (N (\u00b5, \u03a3) (cid:107)N (\u00b5i, \u03a3i)) such that\nPrw\u223cN (\u00b5,\u03a3) [yi (w \u00b7 xi) \u2265 0] \u2265 \u03b7.\nThis particular CW rule may over-\ufb01t since it guarantees a correct prediction with likelihood \u03b7 > 0.5\nat every round. A more recent alternative scheme called AROW (adaptive regularization of weight-\nvectors) [8] replaces the guaranteed prediction at each round with the following loss function:\ni \u03a3xi ,where\n(cid:96)h2 (yi, \u00b5 \u00b7 xi) = (max{0, 1 \u2212 yi(\u00b5 \u00b7 xi)})2 is the squared-hinge loss suffered using the weight\nvector \u00b5 and \u03bb1, \u03bb2 \u2265 0 are two tradeoff hyperparameters. AROW [8] has been shown to perform\nwell in practice, especially for noisy data where CW severely over\ufb01ts.\nIn this work, we take the view that the Gaussian distribution over weight vectors is modi\ufb01ed by\nherding according to a velocity \ufb02ow \ufb01eld. First we show that any change in a Gaussian distributed\nrandom variable can be related to a linear velocity \ufb01eld:\n\n(cid:1) = arg min\u00b5,\u03a3 DKL (N (\u00b5, \u03a3) (cid:107)N (\u00b5i, \u03a3i)) + \u03bb1(cid:96)h2 (yi, \u00b5 \u00b7 xi) + \u03bb2x(cid:62)\n\n(cid:0)\u00b5i+1, \u03a3i+1\n\n2\n\n(a)(b)\fTheorem 1 Assume that the random variable (r.v.) W is distributed according to a Gaussian dis-\ntribution, W \u223c N (\u00b5, \u03a3) ,\n\n1. The r.v. U = AW + b also has a Gaussian distribution, U \u223c N(cid:0)b + A\u00b5, A\u03a3A(cid:62)(cid:1) .\n2. Assume that a r.v. U is distributed according to a Gaussian distribution, U \u223c N(cid:16)\n\n\u02dc\u00b5, \u02dc\u03a3\nThen there exists A and b such that the following linear relation holds, U = AW + b .\n\n(cid:17)\n\n.\n\n3. Let \u03a5 be any orthogonal matrix \u03a5(cid:62) = \u03a5\u22121 and de\ufb01ne U = \u03a3 1\n\n2 \u03a5\u03a3\u2212 1\n\n2 (W \u2212 \u00b5) + \u00b5,\n\nthen both U and W have the same distribution.\n\n(cid:104)\n\n(cid:62)(cid:105)\n\n(cid:104)\n\n(cid:62)(cid:105)\n\n(U \u2212 \u00b5) (U \u2212 \u00b5)\n2 \u03a5\u03a5(cid:62)\u03a3 1\n2 = \u03a3 1\n\nProof: The \ufb01rst property follows easily from linear systems theory. The second property is easily\nshown by taking: A = \u02dc\u03a3 1\n2 \u00b5 . Similarly, for the third property, it suf\ufb01ces\nto show that E [U] = \u03a3 1\n=\n\n2 and b = \u02dc\u00b5 \u2212 \u02dc\u03a3 1\n2 (E [W] \u2212 \u00b5)+\u00b5 = \u00b5 , and Cov (U) = E\n\n2 \u03a3\u2212 1\n2 \u03a5\u03a3\u2212 1\n\n2 \u03a3\u2212 1\n\n(W \u2212 \u00b5) (W \u2212 \u00b5)\n\n\u03a3\u2212 1\n\n2 \u03a5(cid:62)\u03a3 1\n\n2 \u03a5\u03a3\u2212 1\n2 \u03a3 1\n\n2 \u03a5\u03a3\u2212 1\n\n2 \u03a3\u03a3\u2212 1\n\n2 \u03a5(cid:62)\u03a3 1\n\n2 E\n\n2 =\n\n2 = \u03a3 .\n\n2 = \u03a3 1\n\n\u03a3 1\n\u03a3 1\nThus, the transformation U = AW + b can be viewed as a velocity \ufb02ow resulting in a change of the\nunderlying Gaussian distribution of weight vectors. On the other hand, this microscopic view of the\nunderlying velocity \ufb01eld contains more information than merely tracking the mean and covariance\nof the Gaussian. This can be seen since many different velocity \ufb01elds result in the same overall mean\nand covariance. In the next section, we show how we can de\ufb01ne new online learning algorithms by\nconsidering various constraints on the overall velocity \ufb01eld. These new algorithms optimize a loss\nfunction by constraining the parameters of this velocity \ufb01eld.\n3 Algorithms\nOur algorithms maintain a distribution, or in\ufb01nite collection of weight vectors {Wi} for each round\ni. Given an instance xi it outputs a prediction based upon the majority of these weight vectors. Each\nweight vector Wi is then individually updated to Wi+1 according to a generalized PA rule,\n\nWi+1=arg min\nW\n\nCi (W) where Ci (W)=\n\n1\n2\n\n(cid:62)\n(W\u2212Wi)\ni (W\u2212Wi)+C(cid:96) ((xi, yi) ,W) ,\n\u03a3\u22121\n\n(3)\n\nand \u03a3i is a PSD matrix that will be de\ufb01ned shortly. In fact, we assume that \u03a3i is invertible and thus\nPD.\nClearly, it is impossible to maintain and update an in\ufb01nite set of vectors, and thus we employ a\nparametric density fi(Wi; \u03b8i) to weight each vector. In general, updating each individual weight-\nvector using some rule (such as the PA update) will modify the parametric family. We thus employ\na Gaussian parametric density with W \u223c N (\u00b5i, \u03a3i), and update the distribution collectively,\n\nwhere Ai \u2208 Rd\u00d7d represents stretching and rotating the distribution, and the bi \u2208 Rd is an overall\ntranslation. Incorporating this linear transformation, we minimize the average of Eq. (3) with respect\nto the current distribution,\n\nWi+1 = AiWi + bi ,\n\n(Ai, bi) = arg min\nA,b\n\nEWi\u223cN (\u00b5i,\u03a3i) [Ci (AWi + b)] .\n\n(4)\n\n(5)\n\n(6)\n\nWe derive the algorithm by computing the expectation Eq. (4) starting with the \ufb01rst regularization\nterm of Eq. (3). After some algebraic manipulations and using the \ufb01rst property of Theorem 1 to\nwrite \u00b5 = A\u00b5i + bi we get the expected value for the \ufb01rst term of Eq. (3) in terms of \u00b5 and A,\n\nTr(cid:0)(A \u2212 I)(cid:62)\u03a3\u22121\n\ni\n\n(cid:1) .\n\n(A \u2212 I)\u03a3i\n\n(cid:62)\n(\u00b5 \u2212 \u00b5i)\n\n\u03a3\u22121\n\ni\n\n(\u00b5 \u2212 \u00b5i) +\n\n1\n2\n\n1\n2\n\nNext, we focus on the expectation of the loss function in their second term of Eq. (3).\n\n3.1 Expectation of the Loss Function\n\nWe consider the expectation,\n\nEWi\u223cN (\u00b5i,\u03a3i) [(cid:96) ((xi, yi) , AWi + b)]\n\n3\n\n\fIn general, there is no closed form solution for this expectation, and instead we seek for an appro-\npriate approximation or bound. For simplicity we consider binary classi\ufb01cation, denote the signed\nmargin by M = yi(W(cid:62)x) and write (cid:96) ((x, y), W) = (cid:96)(M ) .\nIf the loss is relatively concentrated about its mean, then the loss of the expected weight-vector \u00b5 is\na good proxy for Eq. (6). Formally, we can de\ufb01ne\nDe\ufb01nition 1 Let F = {f (M ; \u03b8) : \u03b8 \u2208 \u0398} be a family of density functions. A loss function is\nuniformly \u03bb-bounded in expectation with respect to F if there exists \u03bb > 0 such that for all \u03b8 \u2208 \u0398\nwe have that, E [(cid:96) (M )] \u2264 (cid:96) (E [M ]) + \u03bb\n, where all expectations are with\n2 E\nrespect M \u223c f (M ; \u03b8).\n\n(M \u2212 E [M ])2(cid:105)\n\n(cid:104)\n\nWe note in passing that if the loss function (cid:96) is convex with respect to W we always have that,\nE [(cid:96) (M )] \u2265 (cid:96) (E [M ]). For Gaussian distributions we have that \u0398 = {\u00b5, \u03a3} and a loss function\n(cid:96) is uniformly \u03bb-bounded in expectation if there exists a \u03bb such that, EN (\u00b5,\u03a3) [(cid:96) ((x, y), W)] \u2264\n2 x(cid:62)\u03a3x . We now enumerate some particular cases where losses are uniformly\n(cid:96) ((x, y), E [W]) + \u03bb\n\u03bb-bounded.\nProposition 2 Assume that the loss function (cid:96)(M ) has a bounded second derivative, (cid:96)(cid:48)(cid:48)(M ) \u2264 \u03bb\nthen (cid:96) is uniformly \u03bb-bounded in expectation.\n\nApplying the Taylor expansion about M = E [M ] we get, (cid:96) (M ) = (cid:96) (E [M ]) +\n2 (M \u2212 E [M ])2 (cid:96)(cid:48)(cid:48) (\u03be) ,for some \u03be \u2208 [M, E [M ]]. Taking the expecta-\n\n(cid:0)y \u2212 M(cid:62)x(cid:1)2 is uniformly (\u03bb =)1-bounded in expectation since\n\nProof:\n(M \u2212 E [M ]) (cid:96)(cid:48) (E [M ]) + 1\ntion of both sides and bounding (cid:96)(cid:48)(cid:48)(\u03be) \u2264 \u03bb concludes the proof.\nFor example, the squared loss 1\nits second derivative is bounded by unity (1). Another example is the log-loss, log(1 + exp(\u2212M )),\n2\nbeing uniformly 1/4-bounded in expectation. Note that the popular hinge and squared-hinge loss\nare not even differentiable at M = 1. Nevertheless, we can show explicitly that indeed both are\nuniformly \u03bb-bounded, though the proof is omitted here due to space considerations. To conclude,\ni A\u03a3iA(cid:62)xi .\nfor uniformly \u03bb-bounded loss functions, we bound Eq. (6) with (cid:96) ((xi, yi), \u00b5) + \u03bb\nThus, our online algorithm minimizes the following bound on Eq. (4), with a change of variables\nfrom the pair (A, b) to the pair (A, \u00b5), where \u00b5 is the mean of the new distribution,\n\n2 x(cid:62)\n\n(Ai, \u00b5i+1) = arg min\nA,\u00b5\n\n1\n2\n\n\u03a3\u22121\n\n1\n2\n\n(cid:62)\n(\u00b5 \u2212 \u00b5i)\n\nTr(cid:0)(A \u2212 I)(cid:62)\u03a3\u22121\n\ni\n\ni\n\n(\u00b5 \u2212 \u00b5i) + C(cid:96) ((xi, yi), \u00b5) +\n(A \u2212 I)\u03a3i\nx(cid:62)\ni A\u03a3iA(cid:62)xi\n\n(cid:1) +\n\nC\u03bb\n2\n\n(7)\n\n(8)\n\nIn the next section we derive an analytic solution for the last problem. We note that, similar to\nAROW, it is decomposed into two additive terms: Eq. (7) which depends only on \u00b5 and Eq. (8)\nwhich depends only on A.\n\n4 Solving the Optimization Problem\n\nWe consider here the squared-hinge loss, (cid:96) ((x, y), \u00b5) =(cid:0)max{0, 1 \u2212 y(\u00b5(cid:62)x)}(cid:1)2, reducing Eq. (7)\n\nto a generalization of PA-II in Mahalanobis distances (see Eq. (2)),\n\n\u00b5i+1 = \u00b5i + \u03b1iyixi , \u03b1i =(cid:0)max{0, 1 \u2212 yi(\u00b5(cid:62)\n\ni xi)}(cid:1) /(cid:0)x(cid:62)\n\ni \u03a3ixi + 1/C(cid:1) ,\n\n(9)\n\nWe now focus on minimizing the second term (Eq. (8)) which depends solely on Ai. For simplicity\nwe assume \u03bb = 1 and consider two cases.\n\n4.1 Diagonal Covariance Matrix\n\nWe \ufb01rst assume that both \u03a3i and A are diagonal, and thus also \u03a3i+1 is diagonal, and thus \u03a3i, \u03a3i+1\ni A\u03a3iA(cid:62)xi .\nand A commute with each other. Eq. (8) then becomes, 1\nDenote the rth diagonal element of \u03a3i by (\u03a3i)r,r and the rth diagonal element of A by (A)r,r. The\n\n2 x(cid:62)\n\n2 Tr(cid:0)(A \u2212 I)(cid:62)(A \u2212 I)(cid:1)+ C\n\n4\n\n\flast equation becomes,(cid:80)\n\nrespect to (A)r,r we get,\n\n(cid:16)\n\n1\n\n2 ((A)r,r \u2212 1)2 + C\n\n2\n\nr\n\nr x2\n\ni,r (A)2\n\nr,r (\u03a3i)r,r Taking the derivative with\n\n(cid:80)\n\n(cid:17) \u21d2 (\u03a3i+1)r,r = (\u03a3i)r,r/\n(cid:16)\n\n(cid:17)2\n\n(Ai)r,r = 1/\n\n1 + Cx2\n\ni,r (\u03a3i)r,r\n\n1 + Cx2\n\ni,r (\u03a3i)r,r\n\n.\n\n(10)\n\nThe last equation is well-de\ufb01ned since the denominator is always greater than or equal to 1.\n\ni\n\ni\n\ni\n\ni + Cxix(cid:62)\n\n4.2 Full Covariance Matrix\n\n(cid:1) \u2212 Tr(cid:0)\u03a3\u22121\n\n(cid:0)Tr(cid:0)A(cid:62)\u03a3\u22121\n(cid:1)\n(cid:1)(cid:1) + C\ni A\u03a3i\ni A\u03a3i\n2 x(cid:62)\ni A\u03a3iA(cid:62)xi. Setting the\ni A\u03a3i\u2212\n(cid:1) A = \u03a3\u22121\n(right) and\n(cid:1)\u22121\n\ni \u03a3i\nderivative of the last equation with respect to A we get, \u03a3\u22121\ni A\u03a3i = 0 . We multiply both terms by \u03a3\u22121\nI + Cxix(cid:62)\n, Yielding,\n\u03a3\u22121\ninverse, \u03a3\u22121\n\ni + Cxix(cid:62)\ni\ncompute\n\nExpanding Eq. (8) we get 1\n2\ni \u03a3i\n\n(cid:1) \u2212 Tr(cid:0)A(cid:62)\u03a3\u22121\nAi =(cid:0)\u03a3\u22121\n\n+Tr(cid:0)\u03a3\u22121\ncombine terms,(cid:0)\u03a3\u22121\n(cid:0)A\u03a3iA(cid:62)(cid:1)\u22121\ni+1 =(cid:0)A\u03a3iA(cid:62)(cid:1)\u22121\n\nTo get \u03a3i+1 we \ufb01rst\n\n=\n. Substituting Eq. (11) in the last equation we\n\n(12)\nFinally, using the Woodbury identity [12] to compute to updated\ncovariance matrix,\n\u03a3i+1 = \u03a3i \u2212 \u03a3ixix(cid:62)\n\ni +(cid:0)2C +C 2x(cid:62)\n(cid:0)C 2xi\u03a3ix(cid:62)\n\n(cid:1) xix(cid:62)\ni + 2C(cid:1) /(cid:0)(1 + Cx(cid:62)\n\n= \u03a3\u22121\n\n(13)\nWe call the above algorithms NHERD for Normal (Gaussian)\nHerd. A pseudocode of the algorithm appears in Alg. 3.\n\ni \u03a3ixi\n\n\u03a3\u22121\n\n(11)\n\ni \u03a3i\n\nget,\n\nits\n\ni+1\n\n.\n\ni\n\ni\n\ni \u03a3ixi)2(cid:1) .\n\n4.3 Discussion\n\ni\n\ni\n\nBoth our update of \u03a3i+1 in Eq. (12) and the update of AROW (see\neq. (8) of [8] ) have the same structure of adding \u03b3ixix(cid:62)\nto \u03a3i.\nAROW sets \u03b3i = C while our update sets \u03b3i = 2C +C 2xi\u03a3ix(cid:62)\ni .\nIn this aspect, the NHERD update is more aggressive as it in-\ncreases the eigenvalues of \u03a3\u22121\nat a faster rate. Furthermore, its\nupdate rate is not constant and depends linearly on the current vari-\nance of the margin x(cid:62)\ni \u03a3ixi; the higher the variance, the faster the\neigenvalues of \u03a3i decrease. Lastly, we note that the update ma-\ntrix Ai can be written as a product of two terms, one depends on\nthe covariance matrix before the update and the other on the co-\nvariance matrix after an AROW update. Formally, let \u02dc\u03a3i+1 be\nthe covariance matrix after updated using the AROW rule, that is,\ni + Cxix(cid:62)\nobserve that Ai = \u02dc\u03a3\u22121\n\u03a3i if and only if AROW modi\ufb01es \u03a3i.\nThe diagonal updates of AROW and NHERD share similar\n[8] did not specify the speci\ufb01c update for this\nproperties.\nthat\ncase, yet using a similar derivation of Sec. 4.1 we get\nthe AROW update for diagonal matrices \u02dc\u03a3i+1 is\n=\n\n(cid:1) (see eq. (8) of [8] ). From Eq. (11) we\n\ni+1\u03a3i, which means that NHERD modi\ufb01es\n\n(cid:16) \u02dc\u03a3i+1\n\n(cid:17)\n\ni\n\nr,r\n\n(cid:17)\n\n1 + Cx2\n\ni,r (\u03a3i)r,r\n\n. Taking the ratio between the rth\ni,r (\u03a3i)r,r \u2265 1 .\n\n/(\u03a3i+1)r,r = 1 + Cx2\n\n\u02dc\u03a3i+1 =(cid:0)\u03a3\u22121\n\nFigure 2: Top and center panels:\nan illustration of the algorithm\u2019s\nupdate (see text). Bottom panel:\nan illustration of a single update\nfor the \ufb01ve algorithms. The cyan\nellipse represents the weight vec-\ntor distribution before the example\nis observed. The red-square rep-\nresents the mean of the updated\ndistribution and the \ufb01ve ellipses\nrepresents the covariance of each\nof the algorithm after given the\ndata example ((1, 2), +1). The\nordering of the area of the \ufb01ve el-\nlipses correlates well with the per-\nformance of the algorithms.\nelement of Eq. (10) and the last equation we get,\n\n(\u03a3i)r,r/\n\n(cid:16)\n\n(cid:16) \u02dc\u03a3i+1\n\n(cid:17)\n\nr,r\n\n5\n\n\u22122\u22121012\u22121.5\u22121\u22120.500.511.52\u22122\u22121012\u22121.5\u22121\u22120.500.511.52\u22121\u22120.500.511.5\u22120.500.511.5beforeNHERD_PNHERD_ENHERD_DAROW_PAROW_D\fi xi)}\n\nmax{0,1\u2212yi(\u00b5(cid:62)\nx(cid:62)\ni \u03a3ixi+ 1\nC\nSet \u03a3i+1=\u03a3i\u2212\u03a3ixix(cid:62)\n\nParameter: C > 0\nInitialize: \u00b51 = 0 , \u03a31 = I\nfor i = 1, . . . , m do\nGet input example xi \u2208 Rd\nPredict \u02c6yi = sign(\u00b5(cid:62)\ni xi)\nGet true label yi and suffer loss 1 if \u02c6yi (cid:54)= yi\ni xi) \u2264 1 then\nif yi(\u00b5(cid:62)\nSet \u00b5i+1=\u00b5i +yi\nFull Covariance:\n\nTo conclude, the update of NHERD for diagonal covariance matrices is also more aggressive than\nAROW as it increases the (diagonal) elements of its inverse faster than AROW.\nAn illustration of the two updates appears in Fig. 2 for a problem in a planar 2-dimensional space.\nThe Gaussian distribution before the update is isotropic with mean \u00b5 = (0, 0) and \u03a3 = I2. Given\nthe input example x = (1, 2), y = 1 we computed both A and b for both the full (top panel)\nand diagonal (center panel) update. The plot illustrates the update of the mean vector (red square),\nweight vectors with unit norm (cid:107)w(cid:107) = 1 (blue), and weight vectors with norm of 2, (cid:107)w(cid:107) = 2 (green).\nThe ellipses with dashed lines il-\nlustrate the weights before the\nupdate, and ellipses with solid\nlines illustrate the weight-vectors\nafter the update. All the weight\nvectors above the black dotted\nline classify the example cor-\nrectly and the ones above the\ndashed lines classify the exam-\nple with margin of at least unit 1.\nThe arrows connecting weight-\nvectors from the dashed ellipses\nto solid ellipses illustrate the up-\ndate of individual weight-vectors\nwith the linear transformation\nw \u2190 Ai(w \u2212 \u00b5i) + \u00b5i+1.\nIn both updates the current mean\n\u00b5i is mapped to the next mean\n\u00b5i+1. The full update \u201cshrinks\u201d\nthe covariance in the direction\northogonal to the example yixi; vectors close to the margin of unit 1 are modi\ufb01ed less than vec-\ntors far from this margin; vectors with smaller margin are updated more aggressively then vectors\nwith higher margin; even vectors that classify the example correctly with large margin of at least one\nare updated, such that their margin is shrunk. This is a consequence of the linear transformation that\nties the update between all weight-vectors. The diagonal update, as designed, maintains a diagonal\nmatrix, yet shrinks the matrix more in the directions that are more \u201corthogonal\u201d to the example.\nWe note in passing that for all previous CW algorithms [7] and AROW [8], a closed form solution\nfor diagonal matrices was not provided. Instead these papers proposed to diagonalize either \u03a3i+1\n(called drop) or \u03a3\u22121\ni+1 (called project) which was then inverted. Together with the exact solution\nof Eq. (10) we get the following three alternative solutions for diagonal matrices,\n\nend if\nend for\nReturn: \u00b5m+1 , \u03a3m+1\n\nSet (\u03a3i+1)r,r for r = 1 . . . d using Eq. (14)\n\nFigure 3: Normal Herd (NHERD)\n\nDiagonal Covariance:\n\nC2xi\u03a3ix(cid:62)\n(1+Cxi\u03a3ix(cid:62)\n\ni +2C\ni )2\n\n(Eq. (13))\n\n(Eq. (9))\n\n\u03a3ixi\n\ni \u03a3i\n\n(\u03a3i+1)r,r =\n\nWe investigate these formulations in the next section. Finally, we note that similarly to CW and\nAROW, algorithms that employ full matrices can be incorporated with Mercer kernels [11, 14],\nwhile to the best of our knowledge, the diagonal versions can not.\n5 Empirical Evaluation\n\nWe evaluate NHERD on several popular datasets for document classi\ufb01cation, optical character\nrecognition (OCR), phoneme recognition, as well as on action recognition in video. We compare our\nnew algorithm NHERD with the AROW [8] algorithm, which was found to outperform other base-\nlines [8]: the perceptron algorithm [13], Passive-Aggressive (PA) [5], con\ufb01dence weighted learning\n(CW) [9, 7] and second order perceptron [1] on these datasets. For both NHERD and AROW\nwe used the three diagonalization schemes, as mentioned in Eq. (14) in Sec. 4.3. Since AROW\nProject and AROW Exact are equivalent we omit the latter, yielding a total of \ufb01ve algorithms:\nNHERD {P, D, E} for Project,Drop,Exact and similarly AROW {P, D}.\n\n6\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(\u03a3i)r,r/\n\n(cid:17)2(cid:19)\n(cid:18)(cid:16)\n(cid:16)\n(1/ (\u03a3i)r,r) +(cid:0)2C +C 2x(cid:62)\n(\u03a3i)r,r \u2212(cid:16)\n(cid:17)2 (C2xi\u03a3ix(cid:62)\n\n(\u03a3i)r,r xi,r\n\ni,r (\u03a3i)r,r\n\n1 + Cx2\n\n1/\n\ni \u03a3ix(cid:62)\n\n(1+Cxi\u03a3ix(cid:62)\n\nexact\n\nproject\n\n(14)\n\ndrop\n\n(cid:17)\n\n(cid:1) x2\n\ni,r\n\ni\ni +2C)\ni )2\n\n\fFigure 4: Performance comparison between algorithms. Each algorithm is represented by a vertex. The weight\nof an edge between two algorithms is the fraction of datasets in which the top algorithm achieves lower test\nerror than the bottom algorithm. An edge with no head indicates a fraction lower than 60% and a bold edge\nindicates a fraction greater than 80%. Graphs (left to right) are for noise levels of 0%, 10%, and 30%.\n\nAlthough NHERD and AROW are designed primarily for binary classi\ufb01cation, we can modify them\nfor use on multi-class problems as follows. Following [4], we generalize binary classi\ufb01cation and\nassume a feature function f (x, y) \u2208 Rd mapping instances x \u2208 X and labels y \u2208 Y into a common\nspace. Given a new example, the algorithm predicts \u02c6y = arg maxz \u00b5 \u00b7 f (x, z), and suffers a loss if\ny (cid:54)= \u02c6y. It then computes the difference vector \u2206 = f (x, y)\u2212f (x, y(cid:48)) for y(cid:48) = arg maxz(cid:54)=y f (x, y(cid:48))\nwhich replaces yx in NHERD (Alg. 3).\nWe conducted an empirical study using the following datasets. First are datasets from [8]: 36 binary\ndocument classi\ufb01cation data, and 100 binary OCR data (45 all-pairs of both USPS and MNIST and\n1-vs-rest of MNIST). Secondly, we used the nine multi-category document classi\ufb01cation datasets\nused by [6]. Third, we conducted experiments on a TIMIT phoneme classi\ufb01cation task. Here\nwe used an experimental setup similar to [10] and mapped the 61 phonetic labels into 48 classes.\nWe then picked 10 pairs of classes to construct binary classi\ufb01cation tasks. We focused mainly on\nunvoiced phonemes where there is no underlying harmonic source and whose instantiations are\nnoisy. The ten binary classi\ufb01cation problems are identi\ufb01ed by a pair of phoneme symbols (one or\ntwo Roman letters). For each of the ten pairs we picked 1, 000 random examples from both classes\nfor training and 4, 000 random examples for a test set. These signals were then preprocessed by\ncomputing mel-frequency cepstral coef\ufb01cients (MFCCs) together with \ufb01rst and second derivatives\nand second order interactions, yielding a feature vector of 902 dimensions. Lastly, we also evaluated\nour algorithm on an action recognition problem in video under four different conditions. There are\nabout 100 samples for each of 6 actions. Each sample is represented using a set of 575 positive real\nlocalized spectral content \ufb01lters from the videos. This yields a total of 156 datasets.\nEach result for the text datasets was averaged over 10-fold cross-validation, otherwise a \ufb01xed split\ninto training and test sets was used. Hyperparameters (C for NHERD and r for ARROW) and the\nnumber of online iterations (up to 20) were optimized using a single randomized run. In order to\nobserve each algorithm\u2019s ability to handle non-separable data, we performed each experiment using\nvarious levels of arti\ufb01cial label noise, generated by independently \ufb02ipping binary labels.\n\nResults: We \ufb01rst summarize the results on all datasets excluding the video recognition dataset in\nFig. 4, where we computed the number of datasets for which one algorithm achieved a lower test\nerror than another algorithm. The results of this tournament between algorithms is presented as\na winning percentage. An edge between two algorithms shows the fraction of the 155 datasets for\nwhich the algorithm on top had lower test error than the other algorithm. The three panels correspond\nto three varying noise levels, from 0%,10% and 30%.\nWe observe from the \ufb01gure that Project generally outperforms Exact which in turn outper-\nforms Drop. Furthermore, NHERD outperforms AROW, in particular NHERD P outperforms\nAROW P and NHERD D outperforms AROW D. These relations become more prominent when\nlabeling noise is increased in the training data. The right panel of Fig. 2 illustrates a single update of\neach of the \ufb01ve algorithms: AROW D, AROW D, NHERD D, NHERD E, NHERD P. Each of the\n\ufb01ve ellipses represents the Gaussian weight vector distribution after a single update on an example\n\n7\n\nNHERD_PNHERD_E 56%NHERD_D 66%AROW_P 54%AROW_D 70% 71% 52% 72% 72% 67% 72%NHERD_PNHERD_E 50%NHERD_D 88%AROW_P 76%AROW_D 91% 85% 78% 89% 72% 75% 84%NHERD_PNHERD_E 65%NHERD_D 75%AROW_P 73%AROW_D 80% 70% 76% 80% 74% 59% 74%\fby each of the \ufb01ve algorithms. Interestingly, the resulting volume (area) of different ellipses roughly\ncorrespond to the overall performance of the algorithms. The best update \u2013 NHERD P \u2013 has the\nsmallest ellipse (with lowest-entropy), and the update with the worst performance \u2013 AROW D \u2013 has\nthe largest, highest-entropy ellipse.\n\nMore detailed results for NHERD P and\nAROW P,\nthe overall best performing\nalgorithms, are compared in Fig. 5.\nNHERD P and AROW P are compara-\nble when there is no added noise, with\nNHERD P winning a majority of the\ntime. As label noise increases (moving\ntop-to-bottom in Fig. 5) NHERD P holds\nup remarkably well. In almost every high\nnoise evaluation, NHERD P improves\nover AROW P (as well as all other base-\nlines, not shown). The bottom-left panel\nof Fig. 5 shows the relative improvment\nin accuracy of NHERD P over AROW P\non the ten phoneme recognition tasks\nwith additional 30% label noise. The ten\ntasks are ordered according to their sta-\ntistical signi\ufb01cance according to McNe-\nmar\u2019s test. The results for the seven right\ntasks are statistically signi\ufb01cant with a p-\nvalue less then 0.001. NHERD P out-\nperforms AROW P \ufb01ve times and un-\nderperforms twice on these seven signi\ufb01-\ncant tests. Finally, the bottom-right panel\nshows the 10-fold accuracy of the \ufb01ve\nalgorithms over the video data, where\nclearly NHERD P outperforms all other\nalgorithms by a wide margin.\nConclusions: We have seen how to in-\ncorporate velocity constraints in an on-\nline learning algorithm.\nIn addition to\ntracking the mean and covariance of a\nGaussian weight vector distribution, reg-\nularization of the linear velocity terms\nare used to herd the normal distribution\nin the learning process. By bounding the\nloss function with a quadratic term, the\nresulting optimization can be solved an-\nalytically, resulting in the NHERD algo-\nrithm. We empirically evaluated the per-\nformance of NHERD on a variety of ex-\nperimental datasets, and show that the\nprojected NHERD algorithm generally\noutperforms all other online learning al-\ngorithms on these datasets. In particular,\nNHERD is very robust when random la-\nbeling noise is present during training.\nAcknowledgments:\nKC is a Horev\nFellow, supported by the Taub Founda-\n\nFigure 5: Three top rows: Accuracy on OCR (left) and text\nand phoneme (right) classi\ufb01cation. Plots compare performance\nbetween NHERD P and AROW P. Markers above the line in-\ndicate superior NHERD P performance and below the line su-\nperior AROW P performance. Label noise increases from top\nto bottom: 0%, 10% and 30%. NHERD P improves relative\nto AROW P as noise increases. Bottom left: relative accuracy\nimprovment of NHERD P over AROW P on the ten phoneme\nclassi\ufb01cation tasks. Bottom right: accuracy of \ufb01ve algorithms\non the video data. In both cases NHERD P is superior\ntions. This work was also supported by German-Israeli Foundation grant GIF-2209-1912.\n\n8\n\n949596979899100949596979899100AROW_PNHERD_P uspsmnist607080906065707580859095AROW_PNHERD_P binary textmc textphoneme8688909294969886889092949698AROW_PNHERD_P uspsmnist60708090556065707580859095AROW_PNHERD_P binary textmc textphoneme506070809050556065707580859095AROW_PNHERD_P uspsmnist607080905560657075808590AROW_PNHERD_P binary textmc textphonemeb\u2212pd\u2212tf\u2212thg\u2212kjh\u2212chm\u2212nm\u2212ngn\u2212ngs\u2212shv\u2212dh\u22128\u22126\u22124\u2212202468Relative Increase in AccuracyAROW_PAROW_DNHERD_ENHERD_PNHERD_D7778798081828384858687Accuracy\fReferences\n[1] Nicol\u00b4o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algo-\n\nrithm. Siam Journal of Commutation, 34(3):640\u2013668, 2005.\n\n[2] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge Uni-\n\nversity Press, New York, NY, USA, 2006.\n\n[3] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. An online algorithm for large scale image\n\nsimilarity learning. In NIPS, 2009.\n\n[4] Michael Collins. Discriminative training methods for hidden markov models: Theory and\n\nexperiments with perceptron algorithms. In EMNLP, 2002.\n\n[5] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n\nalgorithms. JMLR, 7:551\u2013585, 2006.\n\n[6] K. Crammer, M. Dredze, and A. Kulesza. Multi-class con\ufb01dence weighted algorithms.\n\nEMNLP, 2009.\n\nIn\n\n[7] K. Crammer, M. Dredze, and F. Pereira. Exact con\ufb01dence-weighted learning. In NIPS 22,\n\n2008.\n\n[8] K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weighted vectors.\n\nAdvances in Neural Information Processing Systems 23, 2009.\n\nIn\n\n[9] M. Dredze, K. Crammer, and F. Pereira. Con\ufb01dence-weighted linear classi\ufb01cation. In ICML,\n\n2008.\n\n[10] A. Gunawardana, M. Mahajan, A Acero, and Pl att J. C. Hidden conditional random \ufb01elds for\n\nphone classi\ufb01 cation. In Proceedings of ICSCT, 2005.\n\n[11] J. Mercer. Functions of positive and negative type and their connection with the theory of\n\nintegral equations. Philos. Trans. Roy. Soc. London A, 209:415\u2013446, 1909.\n\n[12] K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2007.\n[13] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization\n\nin the brain. Psychological Review, 65:386\u2013407, 1958.\n\n[14] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Ma-\nchines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.\n\n9\n\n\f", "award": [], "sourceid": 309, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}