{"title": "Invariant Pattern Recognition by Semi-Definite Programming Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 33, "page_last": 40, "abstract": "", "full_text": "Invariant Pattern Recognition\n\nby Semide\ufb01nite Programming Machines\n\nThore Graepel\n\nMicrosoft Research Ltd.\n\nCambridge, UK\n\nthoreg@microsoft.com\n\nRalf Herbrich\n\nMicrosoft Research Ltd.\n\nCambridge, UK\n\nrherb@microsoft.com\n\nAbstract\n\ninvariances with respect to given pattern\nKnowledge about local\ntransformations can greatly improve the accuracy of classi\ufb01cation.\nPrevious approaches are either based on regularisation or on the gen-\neration of virtual (transformed) examples. We develop a new frame-\nwork for learning linear classi\ufb01ers under known transformations based\non semide\ufb01nite programming. We present a new learning algorithm\u2014\nthe Semide\ufb01nite Programming Machine (SDPM)\u2014which is able to\n\ufb01nd a maximum margin hyperplane when the training examples are\npolynomial trajectories instead of single points. The solution is found\nto be sparse in dual variables and allows to identify those points on\nthe trajectory with minimal real-valued output as virtual support vec-\ntors. Extensions to segments of trajectories, to more than one trans-\nformation parameter, and to learning with kernels are discussed. In\nexperiments we use a Taylor expansion to locally approximate rota-\ntional invariance in pixel images from USPS and \ufb01nd improvements\nover known methods.\n\n1 Introduction\n\nOne of the central problems of pattern recognition is the exploitation of known in-\nvariances in the pattern domain. In images these invariances may include rotation,\ntranslation, shearing, scaling, brightness, and lighting direction. In addition, speci\ufb01c\ndomains such as handwritten digit recognition may exhibit invariances such as line\nthinning/thickening and other non-uniform deformations [8]. The challenge is to com-\nbine the training sample with the knowledge of invariances to obtain a good classi\ufb01er.\nPossibly the most straightforward way of incorporating invariances is by including\nvirtual examples into the training sample which have been generated from actual ex-\namples by the application of the invariance T : R \u00d7 Rn \u2192 Rn at some \ufb01xed \u03b8 \u2208 R,\ne.g. the method of virtual support vectors [7]. Images x subjected to the transforma-\ntion T (\u03b8,\u00b7) describe highly non-linear trajectories or manifolds in pixel space. The\ntangent distance [8] approximates the distance between the trajectories (manifolds)\nby the distance between their tangent vectors (planes) at a given value \u03b8 = \u03b80 and can\nbe used with any kind of distance-based classi\ufb01er. Another approach, tangent prop\n[8], incorporates the invariance T directly into the objective function for learning by\npenalising large values of the derivative of the classi\ufb01cation function w.r.t. the given\n\n\ftransformation parameter. A similar regulariser can be applied to support vector\nmachines [1].\nWe take up the idea of considering the trajectory given by the combination of training\nvector and transformation. While data in machine learning are commonly represented\nas vectors x \u2208 Rn we instead consider more complex training examples each of which\nis represented as a (usually in\ufb01nite) set\n\n{T (\u03b8, xi) : \u03b8 \u2208 R} \u2282 Rn ,\n\n(1)\nwhich constitutes a trajectory in Rn. Our goal is to learn a linear classi\ufb01er that sepa-\nrates well the training trajectories belonging to di\ufb00erent classes. In practice, we may\nbe given a \u201cstandard\u201d training example x together with a di\ufb00erentiable transforma-\ntion T representing an invariance of the learning problem. The problem can be solved\nif the transformation T is approximated by a transformation \u02dcT polynomial in \u03b8, e.g.,\na Taylor expansion of the form\n\u03b8j \u00b7\n\n\u02dcT (\u03b8, xi) \u2248 r(cid:88)\n\ndjT (\u03b8, xi)\n\nr(cid:88)\n\n\u03b8j \u00b7 (Xi)j,\u00b7 .\n\n(cid:182)\n\n(cid:181)\n\n(2)\n\n=\n\n(cid:175)(cid:175)(cid:175)(cid:175)\n\n1\nj!\n\nj=0\n\nd\u03b8j\n\n\u03b8=0\n\nj=0\n\nOur approach is based on a powerful theorem by Nesterov [5] which states that the\nset P +\n2l of polynomials of degree 2l non-negative on the entire real line is a convex set\nrepresentable by positive semide\ufb01nite (psd) constraints. Hence, optimisation over P +\n2l\ncan be formulated as a semide\ufb01nite program (SDP). Recall that an SDP [9] is given\nby a linear objective function minimised subject to a linear matrix inequality (LMI),\n\nn(cid:88)\n\nj=1\n\nminimise\n\nw\u2208Rn\n\nc(cid:62)w subject to A (w) :=\n\nwjAj \u2212 B (cid:186) 0 ,\n\n(3)\n\n(cid:161)\n\nj=1 wj\n\nv(cid:62)Ajv\n\n(cid:80)n\n\n(cid:162) \u2212 v(cid:62)Bv \u2265 0 which reveals that LMI constraints\n\nwith Aj \u2208 Rm\u00d7m for all j \u2208 {0, . . . , n}. The LMI A (w) (cid:186) 0 means that\ni.e., that for all v \u2208 Rn we have\nA (w) is required to be positive semide\ufb01nite,\nv(cid:62)A (w) v =\ncorrespond to in\ufb01nitely many linear constraints. This expressive power can be used\nto enforce constraints for training examples as given by (1), i.e., constraints required\nto hold for all values \u03b8 \u2208 R. Based on this representability theorem for non-negative\npolynomials we develop a learning algorithm\u2014the Semide\ufb01nite Programming Machine\n(SDPM)\u2014that maximises the margin on polynomial training samples, much like the\nsupport vector machine [2] for ordinary single vector data.\n\n2 Semide\ufb01nite Programming Machines\n\n(cid:162)\n\n(cid:161)\n\nw(cid:62)x\n\nLinear Classi\ufb01ers and Polynomial Examples We consider binary classi\ufb01cation\nproblems and linear classi\ufb01ers. Given a training sample ((x1, y1) , . . . , (xm, ym)) \u2208\n(Rn \u00d7 {\u22121, +1})m we aim at learning a weight vector1 w \u2208 Rn to classify examples x\nby y (x) = sign\n. Assuming linear separability of the training sample the prin-\nciple of empirical risk minimisation recommends \ufb01nding a weight vector w such that\nfor all i \u2208 {1, . . . , m} we have yiw(cid:62)xi \u2265 0. As such this constitutes a linear feasibility\nproblem and is easily solved by the perceptron algorithm [6]. Additionally requiring\nthe solution to maximise the margin leads to the well-known quadratic program of\nsupport vector learning [2].\nIn order to be able to cope with known invariances T (\u03b8,\u00b7) we would like to generalise\nthe above setting to the following feasibility problem:\n\n\ufb01nd w \u2208 Rn\n\n\u2200i \u2208 {1, . . . , m} : \u2200\u03b8 \u2208 R :\n1We omit an explicit threshold to unclutter the presentation.\n\nsuch that\n\nyiw(cid:62)xi (\u03b8) \u2265 0 ,\n\n(4)\n\n\fSVM version space\n\nSDPM version space\n\nFigure 1: (Left) Approximated trajectories for rotated USPS images (2) for r = 1\n(dashed line) and r = 2 (dotted line). The features are the mean pixel intensities in\nthe top and bottom half of the image. (Right) Set of weight vectors w which are\nconsistent with the six images (top) and the six trajectories (bottom). The SDPM\nversion space is smaller and thus determines the weight vector more precisely. The\ndot corresponds to the separating plane in the left plot.\n\nthat is, we would require the weight vector to classify correctly every transformed\ntraining example xi (\u03b8) := T (\u03b8, xi) for every value of the transformation parameter \u03b8.\nThe situation is illustrated in Figure 1. In general, such a set of constraints leads to a\nvery complex and di\ufb03cult-to-solve feasibility problem. As a consequence, we consider\nonly transformations \u02dcT (\u03b8, x) of polynomial form, i.e., \u02dcxi (\u03b8) := \u02dcT (\u03b8, xi) = X(cid:62)\ni \u03b8, each\npolynomial example \u02dcxi (\u03b8) being represented by a polynomial in the row vectors of\nXi \u2208 R(r+1)\u00d7n, with \u03b8 := (1, \u03b8, . . . , \u03b8r)(cid:62). Then the problem (4) can be written as\n\n\ufb01nd w \u2208 Rn\n\nyiw(cid:62)X(cid:62)\n\ni \u03b8 \u2265 0 ,\n\n\u2200i \u2208 {1, . . . , m} : \u2200\u03b8 \u2208 R :\n\nsuch that\n\ni \u03b8 are non-negative everywhere, i.e., pi \u2208 P +\n\n(5)\nwhich is equivalent to \ufb01nding a weight vector w such that the polynomials pi (\u03b8) :=\nyiw(cid:62)X(cid:62)\nr . The following proposition by\nNesterov [5] paves the way for an SDP formulation of the above problem if r = 2l.\nProposition 1 (SD Representation of Non-Negative Polynomials [5]). The\nset P +\n2l of polynomials non-negative everywhere on the real line is SD-representable:\n1. For every P (cid:186) 0 the polynomial p (\u03b8) = \u03b8\n(cid:62)P \u03b8 is non-negative everywhere.\n2. For every polynomial p \u2208 P +\n(cid:62)P \u03b8.\n\n2l there exists a P (cid:186) 0 such that p (\u03b8) = \u03b8\n\nProof. Any polynomial p \u2208 P2l can be written as p (\u03b8) = \u03b8\nR(l+1)\u00d7(l+1). Statement 1 : P (cid:186) 0 implies \u2200\u03b8 \u2208 R : p (\u03b8) = \u03b8\nhence p \u2208 P +\na sum of squared polynomials [4], hence \u2203qi \u2208 Pl : p (\u03b8) =\nwhere P :=\n\n(cid:80)\n2l. Statement 2 : Every non-negative polynomial p \u2208 P +\ni qiq(cid:62)\n\ni (cid:186) 0 and qi is the coe\ufb03cient vector of polynomial qi.\n\n(cid:62)P \u03b8, where P = P(cid:62) \u2208\n2 \u03b8(cid:107)2 \u2265 0,\n2l can be written as\ni qiq(cid:62)\n\u03b8\n\n(cid:62)P \u03b8 = (cid:107)P 1\n\n(cid:62)(cid:161)(cid:80)\n\n(cid:80)\n\ni q2\n\ni (\u03b8) = \u03b8\n\n(cid:162)\n\ni\n\nMaximising Margins on Polynomial Samples Here we develop an SDP for-\nmulation for learning a maximum margin classi\ufb01er given the polynomial constraints\n\n0.10.20.30.40.50.20.250.30.350.40.450.50.550.60.65f1(x)f2(x)\f(5).\nIt is well-known that SDPs include quadratic programs as a special case [9].\nThe squared objective (cid:107)w(cid:107)2 is minimised by replacing it with an auxiliary variable t\nsubject to a quadratic constraint t \u2265 (cid:107)w(cid:107)2 that is written as an LMI using Schur\u2019s\ncomplement lemma,\n\n(cid:181)\n\nminimise\n\n(w,t)\n\nsubject to F (w, t) :=\n\n1\n2 t\nand \u2200i : G (w, Xi, yi) := G0 +\n\nIn w\nw(cid:62)\nt\n\nn(cid:88)\n\nwjGj\n\nj=1\n\n(cid:186) 0 ,\n\n(Xi)\u00b7,j , yi\n\n(cid:180)\n\n(cid:186) 0 .\n\n(6)\n\n(cid:182)\n(cid:179)\n\nThis constitutes an SDP as in (3) by the fact that a block-diagonal matrix is psd if\nand only if all its diagonal blocks are psd.\nFor the sake of illustration consider the case of l = 0 (the simplest non-trivial case).\nThe matrix G (w, Xi, yi) reduces to a scalar yiw(cid:62)xi \u2212 1, which translates into the\nstandard SVM constraint yiw(cid:62)xi \u2265 1 linear in w.\nFor the case l = 1 we have G (w, Xi, yi) \u2208 R2\u00d72 and\n\nG (w, Xi, yi) =\n\nyiw(cid:62)(Xi)0,\u00b7 \u2212 1\n2 yiw(cid:62)(Xi)1,\u00b7\n\n1\n\n1\n\n2 yiw(cid:62)(Xi)1,\u00b7\nyiw(cid:62)(Xi)2,\u00b7\n\n.\n\n(7)\n\n(cid:181)\n\n(cid:182)\n\nAlthough we require G (w, Xi, yi) to be psd the resulting optimisation problem can\nbe formulated in terms of a second-order cone program (SOCP) because the matrices\ninvolved are only 2 \u00d7 2.2\nFor the case l \u2265 2 the resulting program constitutes a genuine SDP. Again for the sake\nof illustration we consider the case l = 2 \ufb01rst. Since a polynomial p of degree four is\nfully determined by its \ufb01ve coe\ufb03cients p0, . . . , p4, but the symmetric matrix P \u2208 R3\u00d73\n(cid:62)P \u03b8 has six degrees of freedom we require one auxiliary variable ui per\nin p (\u03b8) = \u03b8\ntraining example,\n\nG (w, ui, Xi, yi) =\n\n1\n2\n\n\uf8eb\uf8ed 2yiw(cid:62) (Xi)0,\u00b7 \u2212 2\n\nyiw(cid:62) (Xi)1,\u00b7\n\nyiw(cid:62) (Xi)2,\u00b7 \u2212 ui\n\nyiw(cid:62) (Xi)1,\u00b7\nyiw(cid:62) (Xi)3,\u00b7\n\n2ui\n\nyiw(cid:62) (Xi)2,\u00b7 \u2212 ui\n\nyiw(cid:62) (Xi)3,\u00b7\nyiw(cid:62) (Xi)4,\u00b7\n\n\uf8f6\uf8f8 .\n\nIn general, since a polynomial of degree 2l has 2l + 1 coe\ufb03cients and a symmetric\n(l + 1) \u00d7 (l + 1) matrix has (l + 1) (l + 2) /2 degrees of freedom we require (l \u2212 1) l/2\nauxiliary variables.\n\nDual Program and Complementarity Let us consider the dual SDPs corre-\nsponding to the optimisation problems above. For the sake of clarity, we restrict the\npresentation to the case l = 1. The dual of the general SDP (3) is given by\n\nmaximise\n\u039b\u2208Rm\u00d7m\n\ntr (B\u039b)\n\nsubject to\n\n\u2200j \u2208 {1, . . . , n} : tr (Aj\u039b) = cj; \u039b (cid:186) 0,\n\nwhere we introduced a matrix \u039b of dual variables. The complementarity conditions\nfor the optimal solution (w\u2217, t\u2217) read A ((w\u2217, t\u2217)) \u039b\u2217 = 0 . The dual formulation of\n(6) with matrix (7) combined with the F (w, t) part of the complementarity conditions\nreads\n\nmaximise\n(\u03b1,\u03b2,\u03b3)\u2208R3m\n\nsubject to\n\n\u22121\n2\n\u2200i \u2208 {1, . . . , m} : Mi :=\n\nj=1\n\ni=1\n\n(cid:181)\n\nyiyj [\u02dcx (\u03b1i, \u03b2i, \u03b3i, Xi)](cid:62) [\u02dcx (\u03b1j, \u03b2j, \u03b3j, Xj)] +\n\n(cid:182)\n\ni=1\n\n(cid:186) 0 ,\n\n\u03b1i\n\n(8)\n\n\u03b1i \u03b2i\n\u03b2i\n\u03b3i\n\nm(cid:88)\n\nm(cid:88)\n\nm(cid:88)\n\n2The characteristic polynomial of a 2\u00d72 matrix is quadratic and has at most two solutions.\nThe condition that the lower eigenvalue be non-negative can be expressed as a second-order\ncone constraint. The SOCP formulation\u2014if applicable\u2014can be solved more e\ufb03ciently than\nthe SDP formulation.\n\n\fwhere we de\ufb01ne extrapolated training examples \u02dcx(\u03b1i, \u03b2i, \u03b3i, Xi) := \u03b1i(Xi)0,\u00b7 +\n\u03b2i(Xi)1,\u00b7 + \u03b3i(Xi)2,\u00b7. As before this program with quadratic objective and psd con-\nstraints can be formulated as a standard SDP in the form (3) and is easily solved by\na standard SDP solver3. In addition, the complementarity conditions reveal that the\noptimal weight vector w\u2217 can be expanded as\n\nyi\u02dcx (\u03b1i, \u03b2i, \u03b3i, Xi) ,\n\n(9)\n\nm(cid:88)\n\ni=1\n\nw\u2217 =\n\nin analogy to the corresponding result for support vector machines [2].\nIt remains to analyse the complementarity conditions related to the example-related\nG (w, Xi, yi) constraints in (6). Using (7) and assuming primal and dual feasibility\nwe obtain for all i \u2208 {1, . . . , m} at the solution (w\u2217, t\u2217, M\u2217\ni ),\n\nG (w\u2217, Xi, yi) \u00b7 M\u2217\n\ni = 0 ,\n\nthe trace of which translates into\n\nyiw\u2217,(cid:62) [\u03b1\u2217\n\ni (Xi)0,\u00b7 + \u03b2\u2217\n\ni (Xi)1,\u00b7 + \u03b3\u2217\n\ni (Xi)2,\u00b7] = \u03b1\u2217\ni .\n\n(10)\n\n(11)\n\ni\n\ni = 0.\n\ni = \u03b3\u2217\n\nThese relations enable us to characterise the solution by the following propositions:\nProposition 2 (Sparse Expansion). The expansion (9) of w\u2217 in terms of Xi is\nsparse: Only those examples Xi (\u201csupport vectors\u201d) may have non-zero expansion\ncoe\ufb03cients \u03b1\u2217\ni which lie on the margin, i.e., for which det (Gi (w\u2217, Xi, yi)) = 0. Fur-\nthermore, in this case \u03b1\u2217\n\ni = 0 implies \u03b2\u2217\n(cid:54)= 0 and derive a contradiction. From G (w\u2217, Xi, yi) (cid:194) 0 we\nProof. We assume \u03b1\u2217\nconclude using Proposition 1 that for all \u03b8 \u2208 R we have yiw\u2217,(cid:62)((Xi)0,\u00b7 + \u03b8(Xi)1,\u00b7 +\n\u03b82(Xi)2,\u00b7) > 1. Furthermore, we conclude from (10) that det(M\u2217\ni = 0,\ni (cid:54)= 0 implies that there exists \u02dc\u03b8 \u2208 R such that\nwhich together with the assumption \u03b1\u2217\n\u03b2\u2217\ni = \u02dc\u03b8\u03b1\u2217\ni . Inserting this into (11) leads to a contradiction,\ni = 0 and the fact that G (w\u2217, Xi, yi) (cid:194)\nhence \u03b1\u2217\n0 \u21d2 yiw\u2217,(cid:62) (Xi)2,\u00b7 (cid:54)= 0 ensures that \u03b3\u2217\nProposition 3 (Truly Virtual Support Vectors). For all examples Xi lying on\nthe margin, i.e., satisfying det (G (w\u2217, Xi, yi)) = 0 and det (M\u2217\ni ) = 0 there exist\n\u03b8i \u2208 R \u222a {\u221e} such that the optimal weight vector w\u2217 can be written as\n\ni and \u03b3\u2217\ni /\u03b1\u2217\ni = 0. Then, det(M\u2217\n\ni = \u02dc\u03b82\u03b1\u2217\ni ) = 0 implies \u03b2\u2217\n\ni = 0 holds as well.\n\ni = \u03b2\u22172\n\ni \u2212 \u03b2\u22172\n\ni ) = \u03b1\u2217\n\ni \u03b3\u2217\n\nw\u2217 =\n\n\u03b1\u2217\ni yi\u02dcxi (\u03b8i) =\n\nyi\u03b1\u2217\n\ni\n\n(Xi)0,\u00b7 + \u03b8\u2217\n\ni (Xi)1,\u00b7 + \u03b8\u22172\n\ni (Xi)2,\u00b7\n\ni=1\n\ni=1\n\nProof. (sketch) We have det(M\u2217\nin which case there exists \u03b8\u2217\nare ruled out by the complementarity conditions (10).\n\ni such that \u03b2\u2217\n\ni = \u03b8\u2217\n\ni \u03b1\u2217\n\ni ) = \u03b1\u2217\u03b3\u2217\u2212\u03b2\u22172 = 0. We only need to consider \u03b1\u2217\n\ni (cid:54)= 0,\ni . The other cases\n\ni and \u03b3\u2217\n\ni = \u03b8\u22172\n\ni \u03b1\u2217\n\nBased on this proposition it is possible not only to identify which examples Xi are\nused in the expansion of the optimal weight vector w\u2217, but also the corresponding\nvalues \u03b8\u2217\ni of the transformation parameter \u03b8. This extends the idea of virtual support\nvectors [7] in that Semide\ufb01nite Programming Machines are capable of \ufb01nding truly\nvirtual support vectors that were not explicitly provided in the training sample.\n\n3We used the SDP solver SeDuMi together with the LMI parser Yalmip under MATLAB\n\n(see also http://www-user.tu-chemnitz.de/\u02dchelmberg/semidef.html ).\n\nm(cid:88)\n\nm(cid:88)\n\n(cid:179)\n\n(cid:180)\n\n\f3 Extensions to SDPMs\n\nIn many applications it may not be desirable to\nOptimisation on a Segment\nenforce correct classi\ufb01cation on the entire trajectory given by the polynomial example\n\u02dcx (\u03b8). In particular, when the polynomial is used as a local approximation to a global\ninvariance we would like to restrict the example to a segment of the trajectory. To\nthis end consider the following corollary to Proposition 1.\nCorollary 1 (SD-Representability on a segment [5]). For any l \u2208 N, the set\nP +\nl (\u2212\u03c4, \u03c4) of polynomials non-negative on a segment [\u2212\u03c4, \u03c4] is SD-representable.\n\nProof. (sketch) Consider a polynomial p \u2208 P +\n\nl (\u2212\u03c4, \u03c4) where p := x (cid:55)\u2192(cid:80)l\n\n1 + x2(cid:162)l \u00b7 [p(\u03c4(2x2(1 + x2)\u22121 \u2212 1))] .\n\nq := x (cid:55)\u2192(cid:161)\n\ni=0 pixi and\n\nIf q \u2208 P +\n\n2l is non-negative everywhere then p is non-negative in [\u2212\u03c4, \u03c4].\n\nThe proposition shows how we can restrict the examples \u02dcx (\u03b8) to a segment \u03b8 \u2208 [\u2212\u03c4, \u03c4]\nby e\ufb00ectively doubling the degree of the polynomial used. This is the SDPM version\nused in the experiments in Section 4. Note that the matrix G (w, Xi, yi) is sparse\nbecause the resulting polynomial contains only even powers of \u03b8.\n\nIn practice it would be desirable to treat\nMultiple Transformation Parameters\nmore than one transformation at once. For example, in handwritten digit recognition\ntransformations like rotation, scaling, translation, shearing, thinning/thickening etc.\nmay all be relevant [8]. Unfortunately, Proposition 1 only holds for polynomials in\none variable. However, its \ufb01rst statement may be generalised to polynomials of more\n(cid:62)\n\u03c1 P \u03b8\u03c1 is\nthan one variable:\nnon-negative everywhere, even if \u03b8i is any monomial in \u03c11, . . . , \u03c1D. This means, that\noptimisation is only over a subset of these polynomials4. Considering polynomials of\ndegree two and \u03b8\u03c1 := (1, \u03c11, . . . , \u03c1D) we have,\n\nfor every psd matrix P (cid:186) 0 the polynomial p (\u03c1) = \u03b8\n\n(cid:183)\n\n(cid:184)\n\n\u02dcxi (\u03c1) \u2248 \u03b8\n\n(cid:62)\n\u03c1\n\n\u2207(cid:62)\n\u03c1 xi (0)\nxi (0)\n\u2207\u03c1xi (0) \u2207\u03c1\u2207(cid:62)\n\u03c1 xi (0)\n\n\u03b8\u03c1 ,\n\n\u03c1 denotes the gradient and \u2207\u03c1\u2207(cid:62)\n\nwhere \u2207(cid:62)\nNote that the scaling behaviour with regard to the number D of parameters is more\nbenign than that of the naive method of adding virtual examples to the training\nsample on a grid. Such a procedure would incur an exponential growth in the number\nof examples, whereas the approximation above only exhibits a linear growth in the\nsize of the matrices involved.\n\n\u03c1 denotes the Hessian operator.\n\nLearning with Kernels Support vector machines derive much of their popularity\nfrom the \ufb02exibility added by the use of kernels [2, 7]. Due to space restrictions we\ncannot discuss kernels in detail. However, taking the dual SDPM (8) as a starting\npoint and assuming the Taylor expansion (2) the crucial point is that in order to\nrepresent the polynomial trajectory in feature space we need to di\ufb00erentiate through\nthe kernel function.\nLet us assume a feature map \u03c6 : Rn \u2192 F \u2286 RN and k : X \u00d7 X \u2192 R be the kernel\nfunction corresponding to \u03c6 in the sense that \u2200x, \u02dcx \u2208 X : [\u03c6(x)](cid:62)[\u03c6(\u02dcx)] = k (x, \u02dcx).\n\n4There exist polynomials in more than one variable that are non-negative everywhere yet\n\ncannot be written as a sum of squares and are hence not SD-representable.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) A linear classi\ufb01er learned with the SDPM on 10 2D-representations of\nthe USPS digits \u201c1\u201d and \u201c9\u201d (see Figure 1 for details). Note that the \u201csupport\u201d vector\nis truly virtual since it was never directly supplied to the algorithm (inset zoom-in).\n(b) Mean test errors of classi\ufb01ers learned with the SVM vs. SDPM (see text) and (c)\nvirtual SVM vs. SDPM algorithm on 50 independent training sets of size m = 20 for\nall 45 digit classi\ufb01cation tasks.\n\nThe Taylor expansion (2) is now carried out in F. Then an inner product expression\nbetween data points xi and xj di\ufb00erentiated, respectively, u and v times reads\n\n\uf8eb\uf8ed dv\u03c6s(\u02dcx(\u02dc\u03b8))\n\n(cid:33)\n\n\u00b7\n\nd\u02dc\u03b8v\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n\n\uf8f6\uf8f8 .\n\nx=xi,\u03b8=0\n\n\u02dcx=xj ,\u02dc\u03b8=0\n\n(cid:175)(cid:175)(cid:175)(cid:175)\n\ndu\u03c6s(x(\u03b8))\n\nd\u03b8u\n\n(cid:195)\n\nN(cid:88)\n\ns=1\n\n\u03c6(u)(xi)\n\n\u03c6(v)(xj)\n\n(cid:105)(cid:62)(cid:104)\n(cid:123)(cid:122)\n\n(cid:104)\n(cid:105)\n(cid:125)\n(cid:124)\nkernel matrix is already O(cid:161)\n\nk(u,v)(xi,xj )\n\n=\n\nThe kernel trick may help avoid the sum over N feature space dimensions, however,\nit does so at the cost of additional terms by the product rule of di\ufb00erentiation. It\nturns out that for polynomials of degree r = 2 the exact calculation of elements of the\nand needs to be approximated e\ufb03ciently in practice.\n\nn4\n\n(cid:162)\n\n4 Experimental Results\n\nIn order to test and illustrate the SDPM we used the well-known USPS data set of\n16 \u00d7 16 pixel images in [0, 1] of handwritten digits. We considered the transformation\nrotation by angle \u03b8 and calculated the \ufb01rst and second derivatives x(cid:48)\ni (\u03b8 = 0) and\nx(cid:48)(cid:48)\ni (\u03b8 = 0) based on an image representation smoothed by a Gaussian of variance 0.09.\nFor the purpose of illustration we calculated two simple features, averaging the \ufb01rst\nand the second 128 pixel intensities, respectively. Figure 2 (a) shows a plot of 10\ntraining examples of digits \u201c1\u201d and \u201c9\u201d together with the quadratically approximated\ntrajectories for \u03b8 \u2208 [\u221220\u25e6, 20\u25e6]. The examples are separated by the solution found with\nan SDPM restricted to the same segment of the trajectory. Following Propositions 2\nand 3 the weight vector found is expressed as a linear combination of truly virtual\nsupport vectors that had not been supplied in the training sample directly (see inset).\nIn a second experiment, we probed the performance of the SDPM algorithm on the\nfull feature set of 256 pixel intensities using 50 training sets of size m = 20 for each\nof the 45 one-versus-one classi\ufb01cation tasks between all of the digits from \u201c0\u201d to \u201c9\u201d\nfrom the USPS data set. For each task, the digits in one class were rotated by \u221210\u25e6\nand the digits of the other class by +10\u25e6. We compared the performance of the SDPM\nalgorithm to the performance of the original support vector machine (SVM) [2] and\nthe virtual support vector machine (VSVM) [7] measured on independent test sets\nof size 250. The VSVM takes the support vectors of the ordinary SVM run and is\ntrained on a sample that contains these support vectors together with transformed\nversions rotated by \u221210\u25e6 and +10\u25e6 in the quadratic approximation. The results are\n\n0.10.150.20.250.30.350.080.10.120.140.160.180.200.050.10.150.200.020.040.060.080.10.120.140.160.180.2SVM errorSDPM error00.020.040.060.080.10.120.1400.020.040.060.080.10.120.14VSVM errorSDPM error\fshown in the form of scatter plots of the errors for the 45 tasks in Figure 2 (b) and (c).\nClearly, taking into account the invariance is useful and leads to SDPM performance\nsuperior to the ordinary SVM. The SDPM also performs slightly better than the\nVSVM, however, this could be attributed to the pre-selection of support vectors to\nwhich the transformation is applied. It is expected that for increasing number D of\ntransformations the performance improvement becomes more pronounced because in\nhigh dimensions most volume is concentrated on the boundary of the convex hull of\nthe polynomial manifold.\n\n5 Conclusion\n\nWe introduced Semide\ufb01nite Programming Machines as a means of learning on in\ufb01nite\nfamilies of examples given in terms of polynomial trajectories or\u2014more generally\u2014\nmanifolds in data space. The crucial insight lies in the SD-representability of non-\nnegative polynomials which allows us to replace the simple non-negativity constraint\nin algorithms such as support vector machines by positive semide\ufb01nite constraints.\nWhile we have demonstrated the performance of the SDPM only on very small data\nsets it is expected that modern interior-point methods make it possible to scale SDPMs\nto problems of m \u2248 105 \u2212 106 data points, in particular in primal space where the\nnumber of variables is given by the number of features. This expectation is further\nsupported by the following: (i) The resulting SDP is well structured in the sense that\nA (w, t) is block-diagonal with many small blocks. (ii) It may often be su\ufb03cient to\nsatisfy the constraints\u2014e.g., by a version of the perceptron algorithm for semide\ufb01nite\nfeasibility problems [3]\u2014without necessarily maximising the margin.\nOpen questions remain about training SDPMs with multiple parameters and about\nthe e\ufb03cient application of SDPMs with kernels. Finally, it would be interesting to\nobtain learning theoretical results regarding the fact that SDPMs e\ufb00ectively make use\nof an in\ufb01nite number of (non IID) training examples.\n\nReferences\n\n[1] O. Chapelle and B. Sch\u00a8olkopf.\n\nIncorporating invariances in non-linear support vector\nmachines.\nIn T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in\nNeural Information Processing Systems 14, pages 609\u2013616, Cambridge, MA, 2002. MIT\nPress.\n\n[2] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273\u2013297, 1995.\n\n[3] T. Graepel, R. Herbrich, A. Kharechko, and J. Shawe-Taylor. Semide\ufb01nite programming\nIn S. Thrun, L. Saul, and B. Sch\u00a8olkopf, editors, Advances in\n\nby perceptron learning.\nNeural Information Processing Systems 16. MIT Press, 2004.\n\n[4] A. Nemirovski. Five lectures on modern convex optimization, 2002. Lecture notes of the\n\nC.O.R.E. Summer School on Modern Convex Optimization.\n\n[5] Y. Nesterov. Squared functional systems and optimization problems.\n\nIn H. Frenk,\nK. Roos, T. Terlaky, and S. Zhang, editors, High Performance Optimization, pages 405\u2013\n440. Kluwer Academic Press, 2000.\n\n[6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and orga-\n\nnization in the brain. Psychological Review, 65(6):386\u2013408, 1958.\n\n[7] B. Sch\u00a8olkopf. Support Vector Learning. R. Oldenbourg Verlag, M\u00a8unchen, 1997. Dok-\n\ntorarbeit, TU Berlin. Download: http://www.kernel-machines.org.\n\n[8] P. Simard, Y. LeCun, J. Denker, and B. Victorri. Transformation invariance in pattern\nrecognition, tangent distance and tangent propagation. In G. Orr and M. K., editors,\nNeural Networks: Tricks of the trade. Springer, 1998.\n\n[9] L. Vandenberghe and S. Boyd. Semide\ufb01nite programming. SIAM Review, 38(1):49\u201395,\n\n1996.\n\n\f", "award": [], "sourceid": 2403, "authors": [{"given_name": "Thore", "family_name": "Graepel", "institution": null}, {"given_name": "Ralf", "family_name": "Herbrich", "institution": null}]}