{"title": "On higher-order perceptron algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 528, "abstract": "A new algorithm for on-line learning linear-threshold functions is proposed which efficiently combines second-order statistics about the data with the logarithmic behavior\" of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (first-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms.\"", "full_text": "On Higher-Order Perceptron Algorithms \u2217\n\nCristian Brotto\n\nDICOM, Universit`a dell\u2019Insubria\ncristian.brotto@gmail.com\n\nClaudio Gentile\n\nDICOM, Universit`a dell\u2019Insubria\n\nclaudio.gentile@uninsubria.it\n\nFabio Vitale\n\nDICOM, Universit`a dell\u2019Insubria\n\nfabiovdk@yahoo.com\n\nAbstract\n\nA new algorithm for on-line learning linear-threshold functions is proposed which\nef\ufb01ciently combines second-order statistics about the data with the \u201dlogarithmic\nbehavior\u201d of multiplicative/dual-norm algorithms. An initial theoretical analysis is\nprovided suggesting that our algorithm might be viewed as a standard Perceptron\nalgorithm operating on a transformed sequence of examples with improved mar-\ngin properties. We also report on experiments carried out on datasets from diverse\ndomains, with the goal of comparing to known Perceptron algorithms (\ufb01rst-order,\nsecond-order, additive, multiplicative). Our learning procedure seems to general-\nize quite well, and converges faster than the corresponding multiplicative baseline\nalgorithms.\n\n1 Introduction and preliminaries\n\nThe problem of on-line learning linear-threshold functions from labeled data is one which have\nspurred a substantial amount of research in Machine Learning. The relevance of this task from\nboth the theoretical and the practical point of view is widely recognized: On the one hand, linear\nfunctions combine \ufb02exiblity with analytical and computational tractability, on the other hand, on-\nline algorithms provide ef\ufb01cient methods for processing massive amounts of data. Moreover, the\nwidespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope\nof this learning technology, thereby increasing even further the general attention towards the speci\ufb01c\ntask of incremental learning (generalized) linear functions. Many models/algorithms have been\nproposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do\njustice of the existing work on this subject. In this paper, we are interested in the problem of on-\nline learning linear-threshold functions from adversarially generated examples. We introduce a new\nfamily of algorithms, collectively called the Higher-order Perceptron algorithm (where \u201dhigher\u201d\nmeans here \u201dhigher than one\u201d, i.e., \u201dhigher than \ufb01rst-order\u201d descent algorithms such as gradient-\ndescent or standard Perceptron-like algorithms\u201d). Contrary to other higher-order algorithms, such\nas the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the\nability to put together in a principled and \ufb02exible manner second-order statistics about the data with\nthe \u201dlogarithmic behavior\u201d of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our\nalgorithm exploits a simpli\ufb01ed form of the inverse data matrix, lending itself to be easily combined\nwith the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also\ncomputational advantages, allowing us to formulate an ef\ufb01cient (subquadratic) implementation.\n\nOur contribution is twofold. First, we provide an initial theoretical analysis suggesting that our\nalgorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed se-\nquence of examples with improved margin properties. The same analysis also suggests a simple\n(but principled) way of switching on the \ufb02y between higher-order and \ufb01rst-order updates. This is\n\u2217The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant\n\nn. 506778. This publication only re\ufb02ects the authors\u2019 views.\n\n\fbyt \u2208 {\u22121, 1}. Then the true label yt is disclosed. In the case whenbyt 6= yt we say that the algorithm\n\nespecially convenient when we deal with kernel functions, a major concern being the sparsity of the\ncomputed solution. The second contribution of this paper is an experimental investigation of our\nalgorithm on arti\ufb01cial and real-world datasets from various domains: We compared Higher-order\nPerceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm de\ufb01ned in\n[7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that\nHigher-order Perceptron generalizes quite well. Among our experimental \ufb01ndings are the follow-\ning: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm)\nbaseline (thus the stored data matrix is always bene\ufb01cial in terms of convergence speed); 2) When\ndealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and\ndepends on the speci\ufb01c task at hand.\nLearning protocol and notation. Our algorithm works in the well-known mistake bound model\nof on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6,\n13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial\nt = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are\nnormalized, i.e., ||xt|| = 1, where || \u00b7 || is the Euclidean norm unless otherwise speci\ufb01ed), and then\nguesses the binary label yt \u2208 {\u22121, 1} associated with xt. We denote the algorithm\u2019s prediction by\nhas made a prediction mistake. We call an example a pair (xt, yt), and a sequence of examples S\nany sequence S = (x1, y1), (x2, y2), . . . , (xT , yT ). In this paper, we are competing against the\nclass of linear-threshold predictors, parametrized by normal vectors u \u2208 {v \u2208 Rn : ||v|| = 1}. In\nthis case, a common way of measuring the (relative) prediction performance of an algorithm A is\nto compare the total number of mistakes of A on S to some measure of the linear separability of S.\nOne such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) D\u03b3(u;S) of S w.r.t. a\nt=1 max{0, \u03b3 \u2212 ytu>xt} (observe\nthat D\u03b3(u;S) vanishes if and only if u separates S with margin at least \u03b3.\nA mistake-driven algorithm A is one which updates its internal state only upon mistakes. One\ncan therefore associate with the run of A on S a subsequence M = M(S, A) \u2286 {1, . . . , T} of\nmistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior\nof the comparison class to mistaken trials only and, as a consequence, to re\ufb01ne D\u03b3(u;S) so as to\nt\u2208M max{0, \u03b3 \u2212 ytu>xt}. This gives bounds on A\u2019s\nperformance relative to the best u over a sequence of examples produced (or, actually, selected)\nby A during its on-line functioning. Our analysis in Section 3 goes one step further: the number\nof mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed\nsequence \u02dcS = ((\u02dcxi1, yi1), (\u02dcxi2, yi2), . . . , (\u02dcxim , yim)), where each instance xik gets transformed\ninto \u02dcxik through a mapping depending only on the past behavior of the algorithm (i.e., only on\nexamples up to trial t = ik\u22121). As we will see in Section 3, this new sequence \u02dcS tends to be \u201dmore\nseparable\u201d than the original sequence, in the sense that if S is linearly separable with some margin,\nthen the transformed sequence \u02dcS is likely to be separable with a larger margin.\n\nlinear classi\ufb01er u at a given margin value \u03b3 > 0: D\u03b3(u;S) =PT\n\ninclude only trials in M: D\u03b3(u;S) = P\n\n2 The Higher-order Perceptron algorithm\n\nThe algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters \u03c11, \u03c12, ...,\nand maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector vk (ini-\ntialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes\n(plus one). Upon receiving the t-th normalized instance vector xt \u2208 Rn, the algorithm computes\nt )\nwhile vector vk\u22121 is updated additively through the standard Perceptron rule vk = vk\u22121 + yt xt.\n\nits binary prediction valuebyt as the sign of the inner product between vector Bk\u22121vk\u22121 and vector\nBk\u22121xt. Ifbyt 6= yt then matrix Bk\u22121 is updates multiplicatively as Bk = Bk\u22121 (I \u2212 \u03c1k xtx>\nThe new matrix Bk and the new vector vk will be used in the next trial. Ifbyt = yt no update is\nbehavior be substantially different from Perceptron\u2019s) we need to ensureP\u221e\n\nperformed (hence the algorithm is mistake driven). Observe that \u03c1k = 0 for any k makes this algo-\nrithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in\norder to let this algorithm exploit the information collected in the matrix B (and let the algorithm\u2019s\nk=1 \u03c1k = \u221e. In the\nsequel, our standard choice will be \u03c1k = c/k, with c \u2208 (0, 1). See Sections 3 and 4.\nImplementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe\nthree of them, each one having its own merits.\n1) Primal version. We store and update an n\u00d7n matrix Ak = B>\n\nk Bk and an n-dimensional column\n\n\fParameters: \u03c11, \u03c12, ... \u2208 [0, 1).\nInitialization: B0 = I; v0 = 0; k = 1.\nRepeat for t = 1, 2, . . . , T :\n\n||xt|| = 1;\n\n1. Get instance xt \u2208 Rn,\n3. Get label yt \u2208 {\u22121, +1};\n\n2. Predictbyt = SGN(w>\n4. ifbyt 6= yt then:\n\nk\u22121xt) \u2208 {\u22121, +1}, where wk\u22121 = B>\n\nk\u22121Bk\u22121vk\u22121;\n\nvk = vk\u22121 + yt xt\nBk = Bk\u22121 (I \u2212 \u03c1k xtx>\nt )\nk \u2190 k + 1.\n\nFigure 1: The Higher-order Perceptron algorithm (for p = 2).\n\n(cid:21)\n\ni,j=1 d(k)\n\n\u2212\u03c1k b\n>\nk\n\nd(k)\nk,k\n\n, where bk = Dk\u22121X>\n\nk\u22121xk, and d(k)\n\nk\u22121x is equal to g>\n\nk\u22121X>\n\n1 , ..., h(k)\nj , it is not hard to show that the margin value w>\n\nvector vk. Matrix Ak is updated as Ak = Ak\u22121\u2212 \u03c1Ak\u22121xx>\u2212 \u03c1xx>Ak\u22121 + \u03c12(x>Ak\u22121x)xx>,\ntaking O(n2) operations, while vk is updated as in Figure 1. Computing the algorithm\u2019s margin\nv>Ax can then be carried out in time quadratic in the dimension n of the input space.\n2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us\ndenote by Xk the n \u00d7 k matrix whose columns are the n-dimensional instance vectors x1, ..., xk\nwhere a mistake occurred so far, and yk be the k-dimensional column vector of the corresponding\nlabels. We store and update the k \u00d7 k matrix Dk = [d(k)\ni,j=1, the k \u00d7 k diagonal matrix Hk =\ni,j ]k\nDIAG{hk}, hk = (h(k)\nk )> = X>\nk Xk yk, and the k-dimensional column vector gk = yk +\nIf we interpret the primal matrix Ak above as Ak =\nDk Hk 1k, being 1k a vector of k ones.\ni,j xix>\nk\u22121x,\nand can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can\nbe updated with O(k2) extra inner products by updating D and H in the following way. We let\nD0 and H0 be empty matrices. Then, given Dk\u22121 and Hk\u22121 = DIAG{hk\u22121}, we have1 Dk =\nk. On\n\nI +Pk\n(cid:20) Dk\u22121 \u2212\u03c1k bk\nThis amounts to say that matrix Ak = I +Pk\n\nthe other hand, Hk = DIAG{hk\u22121 + yk X>\nObserve that on trials when \u03c1k = 0 matrix Dk\u22121 is padded with a zero row and a zero column.\nj , is not updated, i.e., Ak = Ak\u22121. A\ncloser look at the above update mechanism allows us to conclude that the overall extra inner prod-\nucts needed to compute gk is actually quadratic only in the number of past mistaken trials having\n\u03c1k > 0. This turns out to be especially important when using a sparse version of our algorithm\nwhich, on a mistaken trial, decides whether to update both B and v or just v (see Section 4).\n3) Implicit primal version and the dual norms algorithm. This is based on the simple observation\nthat for any vector z we can compute Bkz by unwrapping Bk as in Bkz = Bk\u22121(I \u2212 \u03c1xx>)z =\nBk\u22121z0, where vector z0 = (z \u2212 \u03c1x x>z) can be calculated in time O(n). Thus computing\nthe margin v>B>\nk\u22121Bk\u22121x actually takes O(nk). Maintaining this implicit representation for the\nproduct matrix B can be convenient when an ef\ufb01cient dual version is likely to be unavailable,\nas is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm.\nWe recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g.,\n[18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p \u2265 2,\nand the associated gradient mapping2 g : \u03b8 \u2208 Rn \u2192 \u2207\u03b8||\u03b8||2\np / 2 \u2208 Rn. Then, in Figure 1, we\nnormalize instance vectors xt w.r.t. the p-norm, we de\ufb01ne wk\u22121 = B>\nk\u22121g(Bk\u22121vk\u22121), and gen-\neralize the matrix update as Bk = Bk\u22121(I \u2212 \u03c1kxtg(xt)>). As we will see, the resulting algorithm\ncombines the multiplicative behavior of the p-norm algorithms with the \u201dsecond-order\u201d information\ncontained in the matrix Bk. One can easily see that the above-mentioned argument for computing\nthe margin g(Bk\u22121vk\u22121)>Bk\u22121x in time O(nk) still holds.\n\nk Xk\u22121bk \u2212 2\u03c1k + \u03c12\nk\u22121X>\n\nk\u22121xk + yk.\n\nk,k = \u03c12\nk }, with h(k)\n\nk x>\nk = y>\n\ni,j=1 d(k)\n\ni,j xix>\n\nk\u22121xk , h(k)\n\n1Observe that, by construction, Dk is a symmetric matrix.\n2This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar\n\nto Winnow [18]. Also, notice that p = 2 yields g = identity.\n\n\f3 Analysis\n\nWe express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss\nbehavior of the best linear classi\ufb01er over the transformed sequence\n\n\u02dcS = (B0xt(1), yt(1)), (B1xt(2), yt(2)), (B2xt(3), yt(3)), . . . ,\n\n(1)\nbeing t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm.\nObserve that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples\nonly. This is relevant to the argument that \u02dcS tends to have a larger margin than the original sequence\n(see the discussion at the end of this section). This neat \u201don-line structure\u201d does not seem to be\nshared by other competing higher-order algorithms, such as the \u201dridge regression-like\u201d algorithms\nconsidered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the\ncase p = 2. A more general statement holds when p \u2265 2.\n\nTheorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of exam-\nples S = (x1, y1), (x2, y2), . . . , (xT , yT ). Let the sequence of parameters \u03c1k satisfy 0 \u2264 \u03c1k \u2264\nk\u22121xt| , where xt is the k-th mistaken instance vector, and c \u2208 (0, 1]. Then the total number m\n1\u2212c\n1+|v>\nof mistakes satis\ufb01es3\n\ns\n\nD\u03b3(u; \u02dcSc))\n\nm \u2264 \u03b1\n\n\u03b3\n\n+ \u03b12\n2\u03b32 + \u03b1\n\n\u03b3\n\n\u03b1\n\nD\u03b3(u; \u02dcSc))\n\n\u03b3\n\n+ \u03b12\n4\u03b32 ,\n\n(2)\n\nholding for any \u03b3 > 0 and any unit norm vector u \u2208 Rn, where \u03b1 = \u03b1(c) = (2 \u2212 c)/c.\nProof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We \ufb01x\nan arbitrary sequence S = (x1, y1), (x2, y2), . . . , (xT , yT ) and let M \u2286 {1, 2, . . . , T} be the set\nof trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th\nmistake occurred. We study the evolution of ||Bkvk||2 over mistaken trials. Notice that the matrix\nB>\nk Bk is positive semide\ufb01nite for any k. We can write\n\n||Bkvk||2 = ||Bk\u22121 (I \u2212 \u03c1k xtx>\n\nt ) (vk\u22121 + yt xt)||2\n\n(from the update rule vk = vk\u22121 + yt xt and Bk = Bk\u22121 (I \u2212 \u03c1k xtx>\n= ||Bk\u22121vk\u22121 + yt (1 \u2212 \u03c1kytvk\u22121xt \u2212 \u03c1k)Bk\u22121xt||2\n= ||Bk\u22121vk\u22121||2 + 2 ytrk v>\n\nk||Bk\u22121xt||2,\n\nk\u22121Bk\u22121xt + r2\n\nk\u22121B>\n\n(using ||xt|| = 1)\n\nt ) )\n\nt Ak\u22121 xt, where we set Ak = B>\n\nk \u2264 (1+\u03c1k|vk\u22121xt|\u2212\u03c1k)2 \u2264 (2\u2212c)2. Now, using yt v>\n\nwhere we set for brevity rk = 1 \u2212 \u03c1kytvk\u22121xt \u2212 \u03c1k. We proceed by upper and lower bounding the\nabove chain of equalities. To this end, we need to ensure rk \u2265 0. Observe that ytvk\u22121xt \u2265 0 implies\nrk \u2265 0 if and only if \u03c1k \u2264 1/(1 + ytvk\u22121xt). On the other hand, if ytvk\u22121xt < 0 then, in order for\nrk to be nonnegative, it suf\ufb01ces to pick \u03c1k \u2264 1. In both cases \u03c1k \u2264 (1 \u2212 c)/(1 + |vk\u22121xt|) implies\nk\u22121Bk\u22121xt \u2264 0\nrk \u2265 c > 0, and also r2\n||Bkvk||2 \u2212 ||Bk\u22121vk\u22121||2 \u2264 (2 \u2212 c)2 ||Bk\u22121 xt||2 =\n(combined with rk \u2265 0), we conclude that\n(2 \u2212 c)2 x>\nk Bk. A simple4 (and crude) upper bound on the last\nt Ak\u22121 xt \u2264 ||Ak\u22121||, the spectral norm (largest\nterm follows by observing that ||xt|| = 1 implies x>\neigenvalue) of Ak\u22121. Since a factor matrix of the form (I \u2212 \u03c1 xx>) with \u03c1 \u2264 1 and ||x|| = 1 has\nt(i)||2 \u2264 1. Therefore,\nspectral norm one, we have x>\nsumming over k = 1, . . . , m = |M| (or, equivalently, over t \u2208 M) and using v0 = 0 yields the\nupper bound\n(3)\nTo \ufb01nd a lower bound of the left-hand side of (3), we \ufb01rst pick any unit norm vector u \u2208 Rn, and\napply the standard Cauchy-Schwartz inequality: ||Bmvm|| \u2265 u>Bmvm. Then, we observe that for\na generic trial t = t(k) the update rule of our algorithm allows us to write\n\nt Ak\u22121 xt \u2264 ||Ak\u22121|| \u2264Qk\u22121\n\n||Bmvm||2 \u2264 (2 \u2212 c)2 m.\n\ni=1 ||I \u2212 \u03c1i xt(i)x>\n\nk\u22121B>\n\nu>Bkvk \u2212 u>Bk\u22121vk\u22121 = rk yt u>Bk\u22121xt \u2265 rk (\u03b3 \u2212 max{0, \u03b3 \u2212 yt u>Bk\u22121xt}),\n\nwhere the last inequality follows from rk \u2265 0 and holds for any margin value \u03b3 > 0. We sum\n3The subscript c in \u02dcSc emphasizes the dependence of the transformed sequence on the choice of c. Note\nthat in the special case c = 1 we have \u03c1k = 0 for any k and \u03b1 = 1, thereby recovering the standard Perceptron\nbound for nonseparable sequences (see, e.g., [12]).\n4A slightly more re\ufb01ned bound can be derived which depends on the trace of matrices I \u2212 Ak. Details will\n\nbe given in the full version of this paper.\n\n\fthe above over k = 1, . . . , m and exploit c \u2264 rk \u2264 2 \u2212 c after rearranging terms. This gets\n||Bmvm|| \u2265 u>Bmvm \u2265 c \u03b3 m\u2212 (2\u2212 c)D\u03b3(u; \u02dcSc). Combining with (3) and solving for m gives\n(cid:3)\nthe claimed bound.\n\nFrom the above result one can see that our algorithm might be viewed as a standard Perceptron\nalgorithm operating on the transformed sequence \u02dcSc in (1). We now give a qualitative argument,\nwhich is suggestive of the improved margin properties of \u02dcSc. Assume for simplicity that all examples\n(xt, yt) in the original sequence are correctly classi\ufb01ed by hyperplane u with the same margin\n\u03b3 = yt u>xt > 0, where t = t(k). According to Theorem 1, the parameters \u03c11, \u03c12, . . . should be\nsmall positive numbers. Assume, again for simplicity, that all \u03c1k are set to the same small enough\nt(i)) can be approximated as\nt(i). Then, to the extent that the above approximation holds, we can write:5\n\nvalue \u03c1 > 0. Then, up to \ufb01rst order, matrix Bk =Qk\nBk \u2019 I \u2212 \u03c1Pk\nyt u>Bk\u22121xt = yt u>(cid:0)I \u2212 \u03c1Pk\u22121\n(cid:0)Pk\u22121\n\n(cid:1)xt = yt u>(cid:0)I \u2212 \u03c1Pk\u22121\n\n(cid:1)xt = \u03b3 \u2212 \u03c1 \u03b3 yt v>\n\ni=1 yt(i)xt(i) yt(i)x>\n\ni=1(I \u2212 \u03c1 xt(i)x>\n\ni=1 yt(i) u>xt(i) yt(i)x>\n\nt(i)\n\n(cid:1)xt\n\nt(i)\n\nk\u22121xt.\n\ni=1 xt(i)x>\n\n= yt u>xt \u2212 \u03c1 yt\n\ni=1 xt(i)x>\n\nt(i)\n\nk\u22121B>\n\nk\u22121wk\u22121 = v>\n\nk\u22121xt \u2264 0 is more likely to imply yt v>\n\nNow, yt v>\nk\u22121xt is the margin of the (\ufb01rst-order) Perceptron vector vk\u22121 over a mistaken trial for\nthe Higher-order Perceptron vector wk\u22121. Since the two vectors vk\u22121 and wk\u22121 are correlated\nk\u22121Bk\u22121vk\u22121 = ||Bk\u22121vk\u22121||2 \u2265 0) the mistaken condition\n(recall that v>\nk\u22121xt \u2264 0 than the opposite. This tends to yield a\nyt w>\nmargin larger than the original margin \u03b3. As we mentioned in Section 2, this is also advantageous\nfrom a computational standpoint, since in those cases the matrix update Bk\u22121 \u2192 Bk might be\nskipped (this is equivalent to setting \u03c1k = 0), still Theorem 1 would hold.\nThough the above might be the starting point of a more thorough theoretical understanding of the\nmargin properties of our algorithm, in this paper we prefer to stop early and leave any further inves-\ntigation to collecting experimental evidence.\n\n4 Experiments\n\nWe tested the empirical performance of our algorithm by conducting a number of experiments on a\ncollection of datasets, both arti\ufb01cial and real-world from diverse domains (Optical Character Recog-\nnition, text categorization, DNA microarrays). The main goal of these experiments was to compare\nHigher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such\nas \ufb01rst-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence\nspeed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2.\nTask 1: DNA microarrays and arti\ufb01cial data. The goal here was to test the convergence proper-\nties of our algorithms on sparse target learning tasks. We \ufb01rst tested on a couple of well-known DNA\nmicroarray datasets. For each dataset, we \ufb01rst generated a number of random training/test splits (our\nrandom splits also included random permutations of the training set). The reported results are aver-\naged over these random splits. The two DNA datasets are: i. The ER+/ER\u2212 dataset from [14]. Here\nthe task is to analyze expression pro\ufb01les of breast cancer and classify breast tumors according to ER\n(Estrogen Receptor) status. This dataset (which we call the \u201cBreast\u201d dataset) contains 58 expression\npro\ufb01les concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a\ntest set of size 11.\nii. The \u201cLymphoma\u201d dataset [1]. Here the goal is to separate cancerous and\nnormal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression pro\ufb01les\nconcerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of\nsize 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms\nhave been run by cycling 5 times over the current training set. No kernel functions have been used.\nWe also arti\ufb01cially generated two (moderately) sparse learning problems with margin \u03b3 \u2265 0.005 at\nlabeling noise levels \u03b7 = 0.0 (linearly separable) and \u03b7 = 0.1, respectively. The datasets have been\ngenerated at random by \ufb01rst generating two (normalized) target vectors u \u2208 {\u22121, 0, +1}500, where\nthe \ufb01rst 50 components are selected independently at random in {\u22121, +1} and the remaining 450\n5Again, a similar argument holds in the more general setting p \u2265 2. The reader should notice how important\n\nthe dependence of Bk on the past is to this argument.\n\n\fcomponents are 0. Then we set \u03b7 = 0.0 for the \ufb01rst target and \u03b7 = 0.1 for the second one and,\ncorresponding to each of the two settings, we randomly generated 1000 training examples and 1000\ntest examples. The instance vectors are chosen at random from [\u22121, +1]500 and then normalized. If\nu \u00b7 xt \u2265 \u03b3 then a +1 label is associated with xt. If u \u00b7 xt \u2264 \u2212\u03b3 then a \u22121 label is associated with\nxt. The labels so obtained are \ufb02ipped with probability \u03b7. If |u \u00b7 xt| < \u03b3 then xt is rejected and\na new vector xt is drawn. We call the two datasets \u201dArti\ufb01cial 0.0\u201d and \u201dArti\ufb01cial 0.1\u201d. We tested\nour algorithms by training over an increasing number of epochs and checking the evolution of the\ncorresponding test set accuracy. Again, no kernel functions have been used.\nTask 2: Text categorization. The text categorization datasets are derived from the \ufb01rst 20,000\nnewswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF-IDF bag-of-words\nencoding was used to transform each news story into a normalized vector of real attributes. We\nbuilt four binary classi\ufb01cation problems by \u201cbinarizing\u201d consecutive news stories against the four\ntarget categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories,\nrespectively, within the \ufb01rst 20,000 news stories of RCV1. We call these datasets RCV1x, where\nx = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same\nsize. All algorithms have been trained for a single epoch. We initially tried polynomial kernels,\nthen realized that kernel functions did not signi\ufb01cantly alter our conclusions on this task. Thus the\nreported results refer to algorithms with no kernel functions.\nTask 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the\nUSPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the\none in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary\ntasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters\nchosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms\nhave been trained for a single epoch over the training set. The results in Table 3 only refer to the\nbest parameter settings for each kernel.\nAlgorithms. We implemented the standard Perceptron algorithm (with and without kernels), the\nSecond-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higher-\norder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2)\nwas \u201dimplicit primal\u201d when tested on the sparse learning tasks, and in dual variables for the other two\ntasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing\non a generous range of values. For brevity, only the settings achieving the best results are reported.\nOn the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on\nthe other two tasks we set p = 2. In any case, for each value of p, we set7 \u03c1k = c/k, with c =\n0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we\ntried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks,\nwe also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t =\nt(k), this algorithm sets \u03c1k = c/k if yt vk\u22121xt \u2265 0, and \u03c1k = 0 otherwise (thus, when yt vk\u22121xt <\n0 the matrix Bk\u22121 is not updated). For the sake of brevity, the standard Perceptron algorithm is\ncalled FO (\u201dFirst Order\u201d), the Second-order algorithm is denoted by SO (\u201dSecond Order\u201d), while the\nHigher-order algorithm with norm parameter p and \u03c1k = c/k is abbreviated as HOp(c). Thus, for\ninstance, FO = HO2(0).\nResults and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting\nresults. In all our experiments HOp(c) with c > 0 outperforms HOp(0). On the other hand, the\ncomparison HOp(c) vs. SO depends on the speci\ufb01c task. On the DNA datasets, HOp(c) with c > 0 is\nclearly superior in Breast. On Lymphoma, HOp(c) gets worse as p increases. This is a good indica-\ntion that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns\nout to be only slightly worse than SO. On the arti\ufb01cial datasets HOp(c) with c > 0 is always better\nthan the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to\nperform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it per-\nforms similarly when combined with Gaussian kernels (though it turns out to be relatively sparser),\nwhile it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts\nthe matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either\nsparse or not) signi\ufb01cantly outperforms FO.\nIn conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas-\n\n6We did not use the most frequent category because of its signi\ufb01cant overlap with the other ones.\n7Notice that this setting ful\ufb01lls the condition on \u03c1k stated in Theorem 1.\n\n\fTable 1: Training and test error on the two datasets \u201dBreast\u201d and \u201dLymphoma\u201d. Training error is\nthe average total number of updates over 5 training epochs, while test error is the average fraction\nof misclassi\ufb01ed patterns in the test set, The results refer to the same training/test splits. For each\nalgorithm, only the best setting is shown (best training and best test setting coincided in these ex-\nperiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized\nthe comparison HO7(0) vs. HO7(c) with best c among the tested values. According to Wilcoxon\nsigned rank test, an error difference of 0.5% or larger might be considered signi\ufb01cant. In bold are\nthe smallest \ufb01gures achieved on each row of the table.\nHO4\n24.5\n\nHO7(0)\n47.4\n\nBREAST\n\nHO2\n21.7\n\nFO\n45.2\n\nHO7\n24.5\n23.4% 16.4% 13.3% 15.7% 12.0%\n20.0\n\nHO10\n32.4\n13.5\n23.1\n11.8% 10.0% 10.0% 11.5% 11.5% 11.9%\n\n23.0\n\nSO\n29.6\n15.0%\n19.3\n9.6%\n\nLYMPHOMA\n\n22.1\n\n19.6\n\n18.9\n\nTRAIN\nTEST\nTRAIN\nTEST\n\nFigure 2: Experiments on the two arti\ufb01cial datasets (Arti\ufb01cial0.0, on the left, and Arti\ufb01cial0.1, on\nthe right). The plots give training and test behavior as a function of the number of training epochs.\nNotice that the test set in Arti\ufb01cial0.1 is affected by labelling noise of rate 10%. Hence, a visual\ncomparison between the two plots at the bottom can only be made once we shift down the y-axis of\nthe noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable.\nThe reader might have dif\ufb01culty telling apart the two kinds of algorithms HOp(0.0) and HOp(c) with\nc > 0. In practice, the latter turned out to be always slightly superior in performance to the former.\n\nsi\ufb01cation, having the ability to combine multiplicative (or nonadditive) and second-order behavior\ninto a single inference procedure. Like other algorithms, HOp can be extended (details omitted due\nto space limitations) in several ways through known worst-case learning technologies, such as large\nmargin (e.g., [17, 11]), label-ef\ufb01cient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]).\n\nReferences\n[1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identi\ufb01ed by gene expression\n\npro\ufb01ling. Nature, 403, 503\u2013511.\n\n[2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319\u2013342.\n[3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127\u2013150.\n[4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the\n\nexponential familiy of distributions. Machine Learning, 43(3), 211\u2013246.\n\n[5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classi\ufb01ers with on-line and active\n\nlearning. JMLR, 6, 1579\u20131619.\n\n[6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How\n\nto use expert advice. J. ACM, 44(3), 427\u2013485.\n\n1510205321# of training updates600800500400300Training updates vs training epochs on Artificial # of training epochs0.0700HO (0.0)*SO (a = 0.2)7HO (0.4)2HO (0.4)4HO (0.4)7FO = HO (0.0)*******21510205321# of training updates160024001200800400Training updates vs training epochs on Artificial # of training epochs0.12000HO (0.0)*SO (a = 0.2)7HO (0.4)2HO (0.4)4HO (0.4)7FO = HO (0.0)*******21510205321Test error rates18%26%14%10% 6%Test error rates vs training epochs on Artificial # of training epochs0.022%HO (0.0)*SO (a = 0.2)7HO (0.4)2HO (0.4)4HO (0.4)7FO = HO (0.0)**2*****1510205321Test error rates (minus 10%)18%26%14%10% 6%Test error rates vs training epochs on Artificial # of training epochs0.122%HO (0.0)*SO (a = 0.2)7HO (0.4)2HO (0.4)4HO (0.4)7FO = HO (0.0)**2*****\fTable 2: Experimental results on the four binary classi\ufb01cation tasks derived from RCV1. \u201dTrain\u201d\ndenotes the number of training corrections, while \u201dTest\u201d gives the fraction of misclassi\ufb01ed patterns\nin the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the\nsmallest \ufb01gures achieved for each of the 8 combinations of dataset (RCV1x, x = 70, 101, 4, 59) and\nphase (training or test).\n\nFO\n\nHO2\n\nSO\n\nRCV170\nRCV1101\nRCV14\nRCV159\n\nTRAIN\n993\n673\n803\n767\n\nTEST\n7.20%\n6.39%\n6.14%\n6.45%\n\nTRAIN\n941\n665\n783\n762\n\nTEST\n6.83%\n5.81%\n5.94%\n6.04%\n\nTRAIN\n880\n677\n819\n760\n\nTEST\n6.95%\n5.48%\n6.05%\n6.84%\n\nTable 3: Experimental results on the OCR tasks. \u201dTrain\u201d denotes the total number of training cor-\nrections, summed over the 10 categories, while \u201dTest\u201d denotes the fraction of misclassi\ufb01ed patterns\nin the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse\nversion of HO2 we also reported (in parentheses) the number of matrix updates during training. In\nbold are the smallest \ufb01gures achieved for each of the 8 combinations of dataset (USPS or MNIST),\nkernel type (Gaussian or Polynomial), and phase (training or test).\n\nUSPS\n\nGAUSS\nPOLY\n\nMNIST GAUSS\n\nPOLY\n\nFO\n\nTRAIN\n1385\n1609\n5834\n8148\n\nTEST\n6.53%\n7.37%\n2.10%\n3.04%\n\nHO2\n\nTEST\n\nSparse HO2\nTRAIN\n4.76% 965 (440)\n5.71% 1081 (551)\n1.79% 5363 (2596)\n2.27% 6476 (3311)\n\nTEST\n5.13%\n5.52%\n1.81%\n2.28%\n\nTRAIN\n945\n1090\n5351\n6404\n\nSO\n\nTRAIN\n1003\n1054\n5684\n6440\n\nTEST\n5.05%\n5.53%\n1.82%\n2.03%\n\n[7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal\n\nof Computing, 34(3), 640\u2013668.\n\n[8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linear-\n\nthreshold algorithms. JMLR, 7, 1205\u20131230.\n\n[9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273\u2013297.\n[10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a \ufb01xed\n\nbudget. NIPS 18, MIT Press, pp. 259\u2013266.\n\n[11] C. Gentile (2001). A new approximate maximal margin classi\ufb01cation algorithm. JMLR, 2, 213\u2013242.\n[12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265\u2013299.\n[13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant\n\nupdates. Machine Learning Journal, 43(3), 173\u2013210.\n\n[14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably dis-\n\ntinct gene expression patterns. Cancer Res., 61, 5979\u20135984.\n\n[15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic\n\nmistake bounds when few input variables are relevant. Arti\ufb01cial Intelligence, 97, 325\u2013343.\n\n[16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN\n\n1995, pp. 53\u201360.\n\n[17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3),\n\n361\u2013387.\n\n[18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algo-\n\nrithm. Machine Learning, 2(4), 285\u2013318.\n\n[19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation,\n\n108(2), 212\u2013261.\n\n[20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004.\n[21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathe-\n\nmatical Theory of Automata, vol. XII, pp. 615\u2013622.\n\n[22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/.\n[23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp.\n\n423\u2013437.\n\n[24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press.\n[25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.\n\n\f", "award": [], "sourceid": 238, "authors": [{"given_name": "Claudio", "family_name": "Gentile", "institution": null}, {"given_name": "Fabio", "family_name": "Vitale", "institution": null}, {"given_name": "Cristian", "family_name": "Brotto", "institution": null}]}