{"title": "Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space", "book": "Advances in Neural Information Processing Systems", "page_first": 2031, "page_last": 2039, "abstract": "Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are \\emph{compactly} encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art.", "full_text": "Learning with Invariance via Linear Functionals on\n\nReproducing Kernel Hilbert Space\n\nXinhua Zhang\n\nMachine Learning Research Group\nNational ICT Australia and ANU\n\nxinhua.zhang@nicta.com.au\n\nWee Sun Lee\n\nDepartment of Computer Science\nNational University of Singapore\n\nleews@comp.nus.edu.sg\n\nYee Whye Teh\n\nDepartment of Statistics\nUniversity of Oxford\ny.w.teh@stats.ox.ac.uk\n\nAbstract\n\nIncorporating invariance information is important for many learning problems. To\nexploit invariances, most existing methods resort to approximations that either\nlead to expensive optimization problems such as semi-de\ufb01nite programming, or\nrely on separation oracles to retain tractability. Some methods further limit the\nspace of functions and settle for non-convex models. In this paper, we propose a\nframework for learning in reproducing kernel Hilbert spaces (RKHS) using local\ninvariances that explicitly characterize the behavior of the target function around\ndata instances. These invariances are compactly encoded as linear functionals\nwhose value are penalized by some loss function. Based on a representer theo-\nrem that we establish, our formulation can be ef\ufb01ciently optimized via a convex\nprogram. For the representer theorem to hold, the linear functionals are required\nto be bounded in the RKHS, and we show that this is true for a variety of com-\nmonly used RKHS and invariances. Experiments on learning with unlabeled data\nand transform invariances show that the proposed method yields better or similar\nresults compared with the state of the art.\n\nIntroduction\n\n1\nInvariances are among the most useful prior information used in machine learning [1]. In many\nvision problems such as handwritten digit recognition, detectors are often supposed to be invariant\nto certain local transformations, such as translation, rotation, and scaling [2, 3]. One way to utilize\nthis invariance is by assuming that the gradient of the function is small along the directions of trans-\nformation at each data instance. Another important scenario is semi-supervised learning [4], which\nrelies on reasonable priors over the relationship between the data distribution and the discriminant\nfunction [5, 6]. It is commonly assumed that the function does not change much in the proximity of\neach observed data instance, which re\ufb02ects the typical clustering structure in the data set: instances\nfrom the same class are clustered together and away from those of different classes [7\u20139]. Another\npopular assumption is that the function varies smoothly over the graph Laplacian [10\u201312].\nA number of existing works have established a mathematical framework for learning with invari-\nance. Suppose T (x, \u03b8) transforms a data point x by an operator T with a parameter \u03b8 (e.g. T for\nrotation and \u03b8 for the degree of rotation). Then to incorporate invariance, the target function f is\nassumed to be (almost) invariant over T (x, \u0398) := {T (x, \u03b8) : \u03b8 \u2208 \u0398}, where \u0398 controls the lo-\ncality of invariance. The consistency of this framework was shown by [13] in the context of robust\noptimization [14]. However, in practice it usually leads to a large or in\ufb01nite number of constraints,\nand hence tractable formulations inevitably rely on approximating or restricting the invariance under\nconsideration. Finally, this paradigm gets further complicated when f comes from a rich space of\nfunctions, e.g. the reproducing kernel Hilbert space (RKHS) induced by universal kernels.\nIn [15], all perturbations within the ellipsoids around instances are treated as invariances. This led to\na second order cone program, which is dif\ufb01cult to solve ef\ufb01ciently. In [16], a discrete set of \u0398 that\ncorresponds to feature deletions [17] are considered. The problem is reduced to a quadratic program,\nbut at the cost of blowing up the number of variables which makes scaling to large problems chal-\n\n1\n\n\flenging. A one step approximation of T (x, \u0398) via the even-order Taylor expansions around \u03b8 = 0\nis used in [18]. This results in a semi-de\ufb01nite programming, which is still hard to solve. A further\nsimpli\ufb01cation was introduced by [19], which performed sparse approximations of T (x, \u0398) by \ufb01nd-\ning (via an oracle) the most violating instance under the current solution. Besides yielding a cheap\nquadratic program at each iteration, it also improved upon the Virtual Support Vector approach in\n[20], which did not have a clear optimization objective despite a similar motivation of sparse ap-\nproximation. However, tractable oracles are often unavailable, and so a simpler approximation can\nbe performed by merely enforcing the invariance of f along some given directions, e.g. the tangent\n\u2202\u03b8|\u03b8=0T (x, \u03b8). This idea was used in [21] in a nonlinear RKHS, but their direction of per-\ndirection \u2202\nturbation was not in the original space but in the RKHS. By contrast, [3] did penalize the gradient\nof f in the original feature space, but their function space was limited to neural networks and only\nlocally optimal solutions were found.\nThe goal of our paper, therefore, is to develop a new framework that: (1) allows a variety of invari-\nances to be compactly encoded over a rich family of functions like RKHS, and (2) allows the search\nof the optimal function to be formulated as a convex program that is ef\ufb01ciently solvable. The key\nrequirement to our approach is that the invariances can be characterized by linear functionals that\nare bounded (\u00a7 2). Under this assumption, we are able to formulate our model into a standard regu-\nlarized risk minimization problem, where the objective consists of the sum of loss functions on these\nlinear functionals, the usual loss functions on the labeled training data, and a regularization penalty\nbased on the RKHS norm of the function (\u00a7 3). We give a representer theorem that guarantees that\nthe cost can be minimized by linearly combining a \ufb01nite number of basis functions1. Using convex\nloss, the resulting optimization problem is a convex program, which can be ef\ufb01ciently solved in a\nbatch or online fashion (\u00a7 5). Note [23] also proposed an operator based model for invariance, but\ndid not derive a representer theorem and did not study the empirical performance.\nWe also show that a wide range of commonly used invariances can be encoded as bounded linear\nfunctionals. This include derivatives, transformation invariances, and local averages in commonly\nused RKHSs such as those de\ufb01ned by Gaussian and polynomial kernels (\u00a7 4). Experiment show that\nthe use of some of these invariances within our framework yield better or similar results compared\nto the state of the art.\nFinally, we point out that our focus is to \ufb01nd a function in a given RKHS which respects the pre-\nspeci\ufb01ed invariances. We are not constructing kernels that instantiate the invariance, e.g. [24, 25].\n2 Preliminaries\nSuppose features of training examples lie in a domain X. A function k : X \u00d7 X \u2192 R is called a\npositive semi-de\ufb01nite kernel (or simply kernel) if for all l \u2208 N and all x1, . . . , xl \u2208 X, the l\u00d7 l Gram\nmatrix K := (k(xi, xj))ij is symmetric positive semi-de\ufb01nite. Example kernels on Rn\u00d7Rn include\npolynomial kernel of degree r, which are de\ufb01ned as k(x1, x2) = (x1 \u00b7 x2 + 1)r,2 as well as Gaus-\nsian kernels, de\ufb01ned as k(x1, x2) = \u03ba\u03c3(x1, x2) where \u03ba\u03c3(x1, x2) := exp(\u2212(cid:107)x1 \u2212 x2(cid:107)2 /(2\u03c32)).\nMore comprehensive introductions to kernels are available in, e.g., [26\u201328].\nGiven k, let H0 be the set of all \ufb01nite linear combinations of functions in {k(x,\u00b7) : x \u2208 X}, and en-\ni=1 \u03b1ik(xi,\u00b7)\nj=1 \u03b2jk(yj,\u00b7). Note (cid:104)f, g(cid:105) is invariant to the form of expansion of f and g [26, 27].\nUsing the positive semi-de\ufb01nite properties of k, it is easy to show that H0 is an inner product space,\nand we call its completion under (cid:104)\u00b7,\u00b7(cid:105) as a reproducing kernel Hilbert space (RKHS) H induced by k.\nFor any f \u2208 H, the reproducing property implies f (x) = (cid:104)f, k(x,\u00b7)(cid:105) and we denote (cid:107)f(cid:107)2 := (cid:104)f, f(cid:105).\n2.1 Operators and Representers\nDe\ufb01nition 1 (Bounded Linear Operator and Functional). A linear operator T is a mapping from a\nvector space V to a vector space W, such that T (x + y) = T x + T y and T (\u03b1x) = \u03b1 \u00b7 T x for all\nx, y \u2208 V and scalar \u03b1 \u2208 R. T is also called a functional if W = R. In the case that V and W are\nnormed, T is called bounded if c := supx\u2208V,(cid:107)x(cid:107)V=1 (cid:107)T x(cid:107)W is \ufb01nite, and we call c as the norm of\nthe operator, denoted by (cid:107)T(cid:107).\n\ndow on H0 an inner product as (cid:104)f, g(cid:105) =(cid:80)p\nand g(\u00b7) =(cid:80)q\n\n(cid:80)q\nj=1 \u03b1i\u03b2jk(xi, yj) where f (\u00b7) =(cid:80)p\n\ni=1\n\n1A similar result was provided in [22], but not in the context of learning with invariance.\n2We write a variable in boldface if it is a vector in a Euclidean space.\n\n2\n\n\fExample 1. Let H be an RKHS induced by a kernel k de\ufb01ned on X \u00d7 X. Then for any x \u2208 X, the\nlinear functional T : H \u2192 R de\ufb01ned as T (f ) := f (x) is bounded since |f (x)| = |(cid:104)f, k(x,\u00b7)(cid:105)| \u2264\n(cid:107)k(x,\u00b7)(cid:107) \u00b7 (cid:107)f(cid:107) = k(x, x)1/2(cid:107)f(cid:107) by the Cauchy-Schwarz inequality.\nBoundedness of linear functionals is particularly useful thanks to the Riesz representation theorem\nwhich establishes their one-to-one correspondence to V [29].\nTheorem 1 (Riesz representation Theorem). Every bounded linear functional L on a Hilbert space\nV can be represented in terms of an inner product L(x) = (cid:104)x, z(cid:105) for all x \u2208 V, where the representer\nof the functional, z \u2208 V, has norm (cid:107)z(cid:107) = (cid:107)L(cid:107) and is uniquely determined by L.\nExample 2. Let H be the RKHS induced by a kernel k. For any functional L on H, the representer\nz can be constructed as z(x) = (cid:104)z, k(x,\u00b7)(cid:105) = L(k(x,\u00b7)) for all x \u2208 X. By Theorem 1, z \u2208 H.\nUsing Riesz\u2019s representer theorem, it is not hard to show that for any bounded linear operator T :\nV \u2192 V where V is Hilbertian, there exists a unique bounded linear operator T \u2217 : V \u2192 V such that\n(cid:104)T x, y(cid:105) = (cid:104)x, T \u2217y(cid:105) for all x, y \u2208 V. T \u2217 is called the adjoint operator. So continuing Example 2:\nExample 3. Suppose the functional L has the form L(f ) = T (f )(x0), where x0 \u2208 X and T : H \u2192\nH is a bounded linear operator on H. Then the representer of L is z = T \u2217(k(x0,\u00b7)) because\n\u2200 x \u2208 X, z(x) = L(k(x,\u00b7)) = T (k(x,\u00b7))(x0) = (cid:104)T (k(x,\u00b7)), k(x0,\u00b7)(cid:105) = (cid:104)k(x,\u00b7), T \u2217(k(x0,\u00b7))(cid:105). (1)\nRiesz\u2019s theorem will be useful for our framework since it allows us to compactly represent function-\nals related to local invariances as elements of the RKHS.\n\n3 Regularized Risk Minimization in RKHS with Invariances\n\nTo simplify the presentation, we \ufb01rst describe our framework in the settings of semi-supervised\nlearning [SSL, 4], and later show how to extend it to other learning scenarios in a straightforward\nway. In SSL, we wish to learn a target function f both from labeled data and from local invariances\nextracted from labeled and unlabeled data. Let (x1, y1), . . . , (xl, yl) be the labeled training data,\nand (cid:96)1(f (x), y) be the loss function on f when the training input x is labeled as y. In this paper we\nrestrict (cid:96)1 to be convex in its \ufb01rst argument, e.g. logistic loss, hinge loss, and squared loss.\nWe measure deviations from local invariances around each labeled or unlabeled input instance, and\nexpress them as bounded linear functionals Ll+1(f ), . . . , Ll+m(f ) on the RKHS H. The linear\nfunctionals are associated with another convex loss function (cid:96)2(Li(f )) penalizing violations of the\nlocal invariances. As an example, the derivative of f with respect to an input feature at some training\ninstance x is a linear functional in f, and the loss function penalizes large values of the derivative\nat x using, e.g., squared loss, absolute loss, and \u0001-insensitive loss. Section 4 describes other local\ninvariances we can consider and shows that these can be expressed as bounded linear functionals.\nFinally, we penalize the complexity of f via the squared RKHS norm (cid:107)f(cid:107)2. Putting together the\nloss and regularizer, we set out to minimize the regularized risk functional over f \u2208 H:\n\nmin\nf\u2208H\n\n(cid:107)f(cid:107)2 + \u03bb\n\n1\n2\n\n(cid:96)1(f (xi), yi) + \u03bd\n\n(cid:96)2 (Li(f )) ,\n\n(2)\n\ni=1\n\ni=l+1\n\nwhere \u03bb, \u03bd > 0. By the convexity of (cid:96)1 and (cid:96)2, (2) must be a convex optimization problem. However\nit is still in the function space and involves functionals. In order to derive an ef\ufb01cient optimization\nprocedure, we now derive a representer theorem showing that the optimal solution lies in the span of\na \ufb01nite number of functions associated with the labeled data and the representers of the functionals\nLi. Similar results are available in [22].\nTheorem 2. Let H be the RKHS de\ufb01ned by the kernel k. Let Li (i = l + 1, . . . , l + m) be bounded\nlinear functionals on H with representers zi. Then the optimal solution to (2) must be in the form of\n\n(3)\nFurthermore, the parameters \u03b1 = (\u03b11, . . . , \u03b1l+m)(cid:48) (\ufb01nite dimensional) can be found by minimizing\n\ni=l+1\n\ni=1\n\ng(\u00b7) =\n\n\u03b1ik(xi,\u00b7) +\n\n\u03b1izi(\u00b7)\n\n(cid:96)1((cid:104)k(xi,\u00b7), f(cid:105) , yi) + \u03bd\n\n\u03b1izi. (4)\n\u03bb\nHere Kij =(cid:104)\u02c6ki, \u02c6kj(cid:105), where \u02c6ki = k(xi,\u00b7) if i \u2264 l, and \u02c6ki = zi(\u00b7) otherwise. (Proof is in Appendix A.)\n\ni=l+1\n\ni=l+1\n\ni=1\n\n(cid:96)2 ((cid:104)zi, f(cid:105)) +\n\n\u03b1(cid:48)K\u03b1, where f =\n\n\u03b1ik(xi,\u00b7) +\n\nl(cid:88)\n\nl+m(cid:88)\n\nl(cid:88)\n\ni=1\n\nl+m(cid:88)\n\nl(cid:88)\n\nl(cid:88)\n\nl+m(cid:88)\n\nl+m(cid:88)\n\n1\n2\n\n3\n\n\fTheorem 2 is similar to the results in [18, Proposition 3] and [19, Eq 8], where the optimal function\nlies in the span of a \ufb01nite number of representers. However, our model is quite different in that it\nuses the representers of the linear functionals corresponding to the invariance, rather than virtual\nsamples drawn from the invariant neighborhood. This could result in more compact models because\nthe invariance (e.g. rotation) is enforced by a single representer, rather than multiple virtual examples\n(e.g. various degrees of rotation) drawn from the trajectory of invariant transforms. By the expansion\nof f in (4), the labeling of a new instance x depends not only on k(x, xi) that often measures the\nsimilarity between x and training examples, but also takes into account the extent to which k(x,\u00b7),\nas a function, conforms to the prior invariances.\nComputationally, (cid:104)k(xi,\u00b7), zj(cid:105) is straightforward based on the de\ufb01nition of Lj. The ef\ufb01cient com-\ngiven the similarity measure wij between xi and xj, can be written as(cid:80)\nputation of (cid:104)zi, zj(cid:105) depends on the speci\ufb01c kernels and invariances, as we will show in Section\n(cid:80)\n4. In the simplest case, consider the commonly used graph Laplacian regularizer [10, 11] which,\nij wij(f (xi) \u2212 f (xj))2 =\nij wij(Lij(f ))2, where Lij(f ) = (cid:104)f, k(xi,\u00b7) \u2212 k(xj,\u00b7)(cid:105) is linear and bounded. Then (cid:104)zij, zpq(cid:105) =\nk(xi, xp) + k(xj, xq) \u2212 k(xj, xp) \u2212 k(xi, xq). Another generic approach is to use the assumption\nin Example 3 that (cid:104)zs, f(cid:105) = Ls(f ) = Ts(f )(xs), \u2200s \u2208 {i, j}. Then\n(cid:104)zi, zj(cid:105) = Li(zj) = Ti(zj)(xi) = (cid:104)Ti(zj), k(xi,\u00b7)(cid:105) = (cid:104)zj, T \u2217\ni (k(xi,\u00b7)))(xj). (5)\nIn practice, classi\ufb01ers such as the support vector machine often use an additional constant term (bias)\nthat is not penalized in the optimization. This is equivalent to searching f over F + H, where F is a\n\ufb01nite set of basis functions. A similar representer theorem can be established (see Appendix A).\n\ni (k(xi,\u00b7))(cid:105) = Tj(T \u2217\n\n4 Local Invariances as Bounded Linear Functionals\nInterestingly, many useful local invariances can be modeled as bounded linear functionals. If it can\nbe expressed in terms of function values f (x), then it must be bounded as shown in Example 1.\nIn general, boundedness hinges on the functional and the RKHS H. When H is \ufb01nite dimensional,\nsuch as that induced by linear or polynomial kernels, all linear functionals on H must be bounded:\nTheorem 3 ([29, Thm 2.7-8]). Linear functionals on \ufb01nite dimensional normed space are bounded.\nHowever, in most nonparametric statistics problems that are of interest, the RKHS is in\ufb01nite dimen-\nsional. So the boundedness requires a more re\ufb01ned analysis depending on the speci\ufb01c functional.\n\n4.1 Differentiation Functional\nIn semi-supervised learning, a common prior is that the discriminant function f does not change\nrapidly around sampled points. Therefore, we expect the norm of the gradient at these locations is\nsmall. Suppose X \u2286 Rn is an open set, and k is continuously differentiable on X2. Then we are\n\u2202xd |x=xi, where xd stands for the d-th component of\ninterested in linear functionals Lxi,d(f ) := \u2202f (x)\nthe vector x. Then Lxi,d must be bounded:\nTheorem 4. Lxi,d is bounded on H with respect to the RKHS norm.\nProof. This result is immediate from the inequality given by [28, Corollary 4.36]:\n\n(cid:12)(cid:12)(cid:12)x=y=xi\n\n\u22022\n\n\u2202xd\u2202yd\n\n(cid:17) 1\n\n2\n\nk(x, y)\n\n.\n\n(cid:4)\n\n\u2202xd\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202\n(cid:12)(cid:12)(cid:12)(cid:12)y=xj\n\n(cid:12)(cid:12)(cid:12)x=xi\n\nf (x)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 (cid:107)f(cid:107)(cid:16)\n(cid:12)(cid:12)(cid:12)(cid:12)y=xj\n\n\u2202\n\u2202yd(cid:48)\n\nLet us denote the representer of Lxi,d as zi,d. Indeed, [28, Corollary 4.36] established the same\nresult for higher order partial derivatives, which can be easily used in our framework as well.\nThe inner product between representers can be computed by de\ufb01nition:\n\u22022\n(cid:104)zi,d, zj,d(cid:48)(cid:105) =\n(6)\n\u2202yd(cid:48)\nIf k is considered as a function on (x, y) \u2208 R2n, this implies that the inner product (cid:104)zi,d, zj,d(cid:48)(cid:105) is the\n(xd, yd(cid:48)\n)-th element of the Hessian of k evaluated at (xi, xj), which could be interpreted as some\nsort of \u201ccovariance\u201d between the two invariances with respect to the touchstone function k.\nApplying (6) to the polynomial kernel k(x, y) = ((cid:104)x, y(cid:105) + 1)r, we derive\n\n(cid:12)(cid:12)(cid:12)(cid:12)x=xi,y=xj\n\n(cid:104)zi,d, k(y,\u00b7)(cid:105) =\n\nzi,d(y) =\n\n\u2202\n\u2202yd(cid:48)\n\nk(x, y).\n\n\u2202xd\n\n(cid:104)zi,d, zj,d(cid:48)(cid:105) = r((cid:104)xi, xj(cid:105) + 1)r\u22122[(r \u2212 1)xd(cid:48)\n\nj + ((cid:104)xi, xj(cid:105) + 1)\u03b4d=d(cid:48)],\n\ni xd\n\n(7)\n\n4\n\n\fwhere \u03b4d=d(cid:48) = 1 if d = d(cid:48), and 0 otherwise. For Gaussian kernels k(x, y) = \u03ba\u03c3(x, y), we can take\nanother path that is different from (6). Note that Lxi,d(f ) = T (f )(xi) where T : f (cid:55)\u2192 \u2202f\n\u2202xd , and it\nis straightforward to verify that T is bounded with T \u2217 = \u2212T for Gaussian RKHS. So applying (5),\n(cid:104)zi,d, zj,d(cid:48)(cid:105) =\n(8)\n\nBy Theorem 1, it immediately follows that the norm of Lxi,d is(cid:112)(cid:104)zi,d, zi,d(cid:105) = 1/\u03c3.\n\n(cid:12)(cid:12)(cid:12)(cid:12)y=xj\n\n[\u03c32\u03b4d=d(cid:48) \u2212 (xd\n\n\u2202yd k(xi, y)\n\ni \u2212 xd(cid:48)\n\ni \u2212 xd\n\nj )(xd(cid:48)\n\n\u2202\n\u2202yd(cid:48)\n\nk(xi, xj)\n\n\u2212 \u2202\n\n(cid:18)\n\n(cid:19)\n\nj )].\n\n\u03c34\n\n=\n\n4.2 Transformation Invariance\n\n(cid:18)x\n\n(cid:19) t(3)\n\n\u03b1(cid:55)\u2212\u2192\n\n(cid:18)x + \u03b1x\n(cid:19)\n\nInvariance to known local transformations of input has been used successfully in supervised learning\n[3]. Here we show that transformation invariance can be handled in our framework via representers\nin RKHS. In particular, gradients with respect to the transformations are bounded linear functionals.\nFollowing [3], we \ufb01rst require a differentiable function g that maps points from a space S to R,\nwhere S lies in a Euclidean space.3 For example, an image can be considered as a function g that\nmaps points in the plane S = R2 to the intensity of the image at that point. Next, we consider a\nfamily of bijective transformations t\u03b1 : S (cid:55)\u2192 S, which is differentiable in both the input and the\nparameter \u03b1. For instance, translation, rotation, and scaling can be represented as mappings t(1)\n\u03b1 ,\n\u03b1 , and t(3)\nt(2)\n\n\u03b1 respectively:\n\n(cid:18)x\n\n(cid:19) t(1)\n\n\u03b1(cid:55)\u2212\u2192\n\n(cid:18)x + \u03b1x\n\n(cid:19)\n\n(cid:18)x\n\n(cid:19) t(2)\n\n\u03b1(cid:55)\u2212\u2192\n\n(cid:18)x cos \u03b1 \u2212 y sin \u03b1\n\nx sin \u03b1 \u2212 y cos \u03b1\n\n(cid:19)\n\n,\n\n,\n\ny\n\ny\n\ny\n\ny + \u03b1y\n\ny + \u03b1y\n\n.\nBased on t\u03b1, we de\ufb01ne a family of operators T\u03b1 : RS \u2192 RS as T\u03b1(g) = g \u25e6 t\u22121\n\u03b1 . The\nfunction T\u03b1(g)(x, y) gives us the intensity at location (x, y) of the image translated by an off-\nset (\u03b1x, \u03b1y), rotated by an angle \u03b1, or scaled by an amount \u03b1. Finally we sample from S a\n\ufb01xed number of locations S := {s1, . . . , sq}, and present to the learning algorithm a vector\nI(g, \u03b1; S) := (T\u03b1(g)(s1)(cid:48), . . . , T\u03b1(g)(sq)(cid:48))(cid:48). Digital images are discretization of real images where\nwe sample at \ufb01xed pixel locations of the function g to obtain a \ufb01xed sized vector. Clearly, for a \ufb01xed\ng, the sampled observation I(g, \u03b1; S) is a vector valued function of \u03b1.\nThe following result allows our framework to use derivatives with respect to the parameters in \u03b1.\nTheorem 5. Let F be a normed vector space of functions that map from the range of I(g, \u03b1; S) to R.\nSuppose the linear functional that maps f \u2208 F to \u2202f (u)\nis bounded for any u0 and coordinate\nj, and its norm is denoted as Cj. Then the functional Lg,d,S: f (cid:55)\u2192 \u2202\nderivatives with respect to each of the components in \u03b1, must be bounded linear functionals on F.\nProof. Let I j(g, \u03b1; S) be the j-th component of I(g, \u03b1; S). Using the chain rule, for any f \u2208 F\n\n\u2202\u03b1d\n\n\u2202uj\n\n(cid:12)(cid:12)u=u0\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b1=0\n(cid:12)(cid:12)(cid:12)(cid:12) =\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202I j(g, \u03b1; S)\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b1=0\n\n\u2202\u03b1d\n\nj=1\n\n\u2202f (u)\n\u2202uj\n\n(cid:12)(cid:12)(cid:12)(cid:12)u=I(g,0;S)\n(cid:12)(cid:12)(cid:12)(cid:12) = (cid:107)f(cid:107) \u00b7\nq(cid:88)\n\nj=1\n\nCj\n\n(cid:12)(cid:12)\u03b1=0f (I(g, \u03b1; S)), i.e.\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b1=0\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b1=0\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202I j(g, \u03b1; S)\n\n\u00b7 \u2202I j(g, \u03b1; S)\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\n\u2202\u03b1d\n\n\u2202\u03b1d\n\nThe proof is completed by noting that the last summation is a \ufb01nite constant independent of f. (cid:4)\nCorollary 1. The derivatives \u2202\n\u2202\u03b1d\nbounded linear functionals on the RKHS de\ufb01ned by the polynomial and Gaussian kernels.\n\nf (I(g, \u03b1; S)) with respect to each of the component in \u03b1 are\n\nTo compute the inner product between representers, let zg,d,S be the representer of Lg,d,S and denote\nvg,d,S = \u2202I(g,\u03b1;S)\n\n|\u03b1=0. Then (cid:104)k(x,\u00b7), zg,d,S(cid:105) =\n\nand\n\n\u2202\u03b1d\n\n(cid:68) \u2202\n(cid:18)(cid:28)\n\n\u2202y\n\n(cid:12)(cid:12)y=I(g,0,S)k(x, y), vg,d,S\n(cid:12)(cid:12)(cid:12)y=I(g(cid:48),0,S(cid:48))\n\n\u2202\n\u2202y\n\nvg(cid:48),d(cid:48),S(cid:48),\n\n(cid:69)\n\n(cid:29)(cid:19)(cid:29)\n\nk(x, y)\n\n.\n\n(9)\n\n(cid:28)\n\n(cid:104)zg,d,S, zg(cid:48),d(cid:48),S(cid:48)(cid:105) =\n\nvg,d,S,\n\n(cid:12)(cid:12)(cid:12)\u03b1=0\n(cid:12)(cid:12)(cid:12)x=I(g,0,S)\n\n\u2202\n\u2202x\n\n|Lg,d,S(f )| =\n\n(by de\ufb01nition of (cid:107)f(cid:107)) \u2264 q(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202f (I(g, \u03b1; S))\n\n\u2202\u03b1d\n\n(Cj (cid:107)f(cid:107)) \u00b7\n\nj=1\n\n3In practice, S can be a discrete domain such as the pixel coordinate of an image. Then g can be extended\n\nto an interpolated continuous space via convolution with a Gaussian. See [3, \u00a7 2.3] for more details.\n\n5\n\n\f(cid:90)\n\nX\n\nLxi(f ) =\n\nf (\u03c4 )p(xi \u2212 \u03c4 )d\u03c4 \u2212 f (xi),\n\n4.3 Local Averaging\nUsing gradients to enforce the local invariance that the target function does not change much around\ndata instances increases the number of basis functions by a factor of n, where n is the number of\ngradient directions that we use. The optimization problem can become computationally expensive if\nn is large. When we do not have useful information about the invariant directions, it may be useful\nto have methods that do not increase the number of basis functions by much. Consider functionals\n(10)\nwhere p(\u00b7) is a probability density function centered at zero. Minimizing a loss with such linear\nfunctionals will favor functions whose local averages given by the integral are close to the function\nvalues at data instances. If p(\u00b7) is selected to be a low pass \ufb01lter, the function should be smoother\nand less likely to change in regions with more data points but is less constrained to be smooth in\nregions where the data points are sparse. Hence, such loss functions may be appropriate when we\nbelieve that data instances from the same class are clustered together.\nTo use the framework we have developed, we need to select the probability density p(\u00b7) and the\nkernel k such that Lxi(f ) is a bounded linear functional.\nTheorem 6. Assume there is a constant C > 0 such that k(x, x) \u2264 C 2 for all x \u2208 X. Then the\nlinear functional Lxi in (10) is bounded in the RKHS de\ufb01ned by k for any probability density p(\u00b7).\nSee proof in Appendix B. As a result, radial kernels such as Gaussian and exponential kernels make\nLxi bounded. k(x, x) is not bounded for polynomial kernel, but it has been covered by Theorem 3.\nTo allow for ef\ufb01cient implementation, the inner product between representers of invariances must\nbe computed ef\ufb01ciently. Unlike differentiation, integration often does not result in a closed form\nexpression. Fortunately, analytic evaluation is feasible for the Gaussian kernel \u03ba\u03c3(xi, xj) together\n2\u03c0)\u2212n\u03ba\u03b8(x, 0), because the convolution of two\nwith the Gaussian density function p(x) = (\u03b8\nGaussian densities is still a Gaussian density (see derivation in Appendix B):\n\n(cid:104)zxi, zxj(cid:105) = \u03ba\u03c3(xi, xj) + (1 + 2\u03b8/\u03c3)\u2212n\u03ba\u03c3+2\u03b8(xi, xj) \u2212 2(1 + \u03b8/\u03c3)\u2212n\u03ba\u03c3+\u03b8(xi, xj).\n\n\u221a\n\n5 Optimization\nThe objective (2) can be optimized by many algorithms, such as stochastic gradient [30], bundle\nmethod [19, 31], and (randomized) coordinate descent in its dual (4) [32]. Since all computational\nstrategies rely on kernel evaluation, we prefer the dual approach. In particular, (4) allows an uncon-\nstrained optimization and the nonsmooth regions in (cid:96)1 or (cid:96)2 can be easily approximated by smooth\nsurrogates [33]. So without loss of generality, we assume (cid:96)1 and (cid:96)2 are smooth. Our approach can\nwork in both batch and coordinate-wise, depending on the scale of different problems.\nIn the batch setting, the major challenge is the cost in computing the gradient when the number\nof invariance is large. For example, consider all derivatives at all labeled examples x1, . . . , xl and\nlet N = ((cid:104)zi,d(cid:48), zj,d(cid:48)(cid:105))(i,d),(j,d(cid:48)) \u2208 Rnl\u00d7nl as in (8). Then given \u03b1 = (\u03b1(cid:48)\nl)(cid:48) \u2208 Rnl, the\nbottleneck of computation is g := N \u03b1, which costs O(l2n2) time and O(l2n2) space to store N in\na vanilla implementation. However, since the kernel matrices often employ rich structures, a careful\ntreatment can reduce the cost by an order of magnitude, e.g. into O(l2n) time and O(nl+l2) space in\nthis case. Speci\ufb01cally, denote K = (k(xi, xj))ij \u2208 Rl\u00d7l, and three n\u00d7l matrices X = (x1, . . . , xl),\nA = (\u03b11, . . . , \u03b1l), and G = (gd,i). Then one can show\n\nlQ)(cid:3) , where Qij = Kij(\u039bji \u2212 \u039bii), \u039b = X(cid:48)A.\n\nG = \u03c3\u22124(cid:2)\u03c32AK + XQ \u2212 X \u25e6 (1n \u2297 1(cid:48)\n\n(11)\nHere \u25e6 stands for Hadamard product, and \u2297 is Kronecker product. 1n \u2208 Rn is a vector of straight\none. The computational cost is dominated by X(cid:48)A, AK, and XQ, which are all O(l2n).\nWhen the number of invariance is huge, a batch solver can be slow and coordinate-wise updates can\nbe more ef\ufb01cient. In each iteration, it picks a coordinate in \u03b1 and optimizes the objective over all the\ncoordinates picked so far, leaving the rest elements to zero. [32] selected the coordinate randomly,\nwhile another strategy is to choose the steepest descent coordinate. Clearly, it is useful only when\nthis selection can be performed ef\ufb01ciently, which depends heavily on the structure of the problem.\n\n1, . . . , \u03b1(cid:48)\n\n6 Experimental Result\nWe compared our approach,which is henceforth referred to as InvSVM,with state-of-the-art methods\n\n6\n\n\fFigure 1: Decision boundary for the two moon dataset with \u03bd = 0, 0.01, 0.1, and 1 (left to right).\n\nInvariance to Differentiation: Transduction on Two Moon Dataset\n\nInvariance to Differentiation: Semi-supervised Learning on Real-world Data.\n\nin (semi-)supervised learning using invariance to differentiation and transformation.\n6.1\nAs a proof of concept, we experimented with the \u201ctwo moon\u201d dataset shown in Figure 1, with\nonly l = 2 labeled data instances (red circle and black square). We used the Gaussian kernel with\n\u03c3 = 0.25 and the gradient invariances on both labeled and unlabeled data. The loss (cid:96)1 and (cid:96)2 are\nlogistic and squared loss respectively, with \u03bb = 1, and \u03bd \u2208 {0, 0.01, 0.1, 1}. (4) was minimized\nby a L-BFGS solver [34]. In Figure 1, from left to right our method lays more and more emphasis\non placing the separation boundary in low density regions, which allows unlabeled data to improve\nthe classi\ufb01cation accuracy. We also tried hinge loss for (cid:96)1 and \u0001-insensitive loss for (cid:96)2 with similar\nclassi\ufb01cation results.\n6.2\nDatasets. We used 9 datasets for binary classi\ufb01cation from [4] and the UCI repository [35]. The\nnumber of features (n) and instances (t) are given in Table 1. All feature vectors were normalized\nto zero mean and unit length.\nAlgorithms. We trained InvSVM with hinge loss for (cid:96)1 and squared loss for (cid:96)2. The differentiation\ninvariance was used over the whole dataset, i.e. m = nt in (4), and the gradient computation was\naccelerated by (11). We compared InvSVM with the standard SVM, and a state-of-the-art semi-\nsupervised learning algorithm LapSVM [36], which uses manifold regularization based on graph\nLaplacian [11, 37]. All three algorithms used Gaussian kernel with the bandwidth \u03c3 set to be the\nmedian of pairwise distance among all instances.\nSettings. We trained SVM, LapSVM, and InvSVM on a subset of l \u2208 {30, 60, 90} labeled ex-\namples, and compared their test error on the other t \u2212 l examples in each dataset. We used 5 fold\nstrati\ufb01ed cross validation (CV) to select the values of \u03bb and \u03bd for InvSVM. CV was also applied\nto LapSVM for choosing the number of nearest neighbor for graph construction, weight on the\nLaplacian regularizer, and the standard RKHS norm regularizer. For each fold of CV, InvSVM and\nLapSVM used the other 4 folds ( 4\n5 l points as unla-\nbeled data. The error on the 1\n5 l points was then used for CV. Finally the random selection of l labeled\nexamples was repeated for 10 times, and we reported the mean test error and standard deviation.\nResults.\nIt is clear from Table 1 that in most cases InvSVM achieves lower or similar test er-\nror compared to SVM and LapSVM. Both LapSVM and InvSVM are implementations of the low\ndensity prior. To this end, LapSVM enforces smoothness of the discrimination function over neigh-\nboring instances, while InvSVM directly penalizes the gradient at instances and does not require a\nnotion of neighborhood. The lower error of InvSVM suggests the superiority of the use of gradient,\nwhich is enabled by our representer based approach. Besides, SVM often performs quite well when\nthe number of labeled data is large, with similar error as LapSVM. But still, InvSVM can attain\neven lower error.\n6.3 Transformation Invariance\nNext we study the use of transformation invariance for supervised learning. We used the handwritten\ndigits from the MNIST dataset [38] and compared InvSVM with the virtual sample SVM (VirSVM)\nwhich constructs additional instances by applying the following transformations to the training data:\n2-pixel shifts in 4 directions, rotations by \u00b110 degrees, scaling by \u00b10.1 unit, and shearing in vertical\n\u2202\u03b1|\u03b1=0I(g, \u03b1; S) was approximated\nor horizontal axis by \u00b10.1 unit. For InvSVM, the derivative \u2202\nwith the difference that results from the above transformation. We considered binary classi\ufb01cation\nproblems by choosing four pairs of digits (4-vs-9, 2-vs-3, and 6-vs-5 are hard, while 7-vs-1 is\n\n5 l points) as labeled data, and the remaining t \u2212 4\n\n7\n\n\u22121\u22120.500.511.522.5\u22120.6\u22120.4\u22120.200.20.40.60.811.2\u22121\u22120.500.511.522.5\u22120.6\u22120.4\u22120.200.20.40.60.811.2\u22121\u22120.500.511.522.5\u22120.6\u22120.4\u22120.200.20.40.60.811.2\u22121\u22120.500.511.522.5\u22120.6\u22120.4\u22120.200.20.40.60.811.2\fTable 1: Test error of SVM, LapSVM, and InvSVM for semi-supervised learning. The best result\n(including tie) that is statistically signi\ufb01cant in each setting is highlighted in bold. No number is\nhighlighted if there is no signi\ufb01cant difference between the three methods.\n\nMethod\n\nheart (n = 13, t = 270)\n\nBCI (n = 117, t = 400)\n\nbupa (n = 6, t = 245)\n\nl = 30\n\nl = 60\n\nl = 60\n\nl = 90\n\nl = 30\n\nl = 90\n\nl = 30\nl = 90\n23.5\u00b12.08 21.4\u00b10.23 20.0\u00b11.11 44.6\u00b11.89 39.0\u00b13.41 34.6\u00b13.25 36.2\u00b15.53 37.9\u00b15.80 36.9\u00b11.80\nSVM\nLapSVM 23.2\u00b11.68 22.7\u00b11.92 20.7\u00b12.10 44.8\u00b12.72 45.4\u00b12.25 36.1\u00b13.92 36.6\u00b15.98 40.7\u00b14.05 37.1\u00b11.58\nInvSVM 22.3\u00b11.27 20.2\u00b11.01 19.6\u00b12.79 45.3\u00b13.59 38.4\u00b14.41 32.7\u00b12.27 38.2\u00b14.46 35.4\u00b11.85 35.2\u00b10.45\nAustralian (n = 14, t = 690)\nl = 30\nl = 90\nl = 30\n32.9\u00b11.59 27.1\u00b10.92 24.8\u00b11.99 34.6\u00b12.52 28.3\u00b12.65 25.6\u00b11.48 20.6\u00b14.18 21.7\u00b111.3 15.3\u00b10.86\nSVM\nLapSVM 37.1\u00b11.25 28.4\u00b12.44 29.7\u00b11.57 37.7\u00b15.76 29.9\u00b12.41 25.9\u00b11.49 21.9\u00b19.27 15.6\u00b12.49 14.7\u00b10.16\nInvSVM 33.1\u00b10.49 26.4\u00b11.14 23.4\u00b11.36 35.4\u00b11.57 28.4\u00b13.03 22.3\u00b14.69 17.7\u00b11.74 16.4\u00b11.11 15.6\u00b10.77\n\ng241n (n = 241, t = 1500)\nl = 90\n\ng241c (n = 241, t = 1500)\nl = 90\n\nl = 60\n\nl = 60\n\nl = 60\n\nl = 60\n\nl = 30\n\nionosphere (n = 34, t = 351)\nUSPS (n = 241, t = 1500)\nl = 30\nl = 90\nl = 90\n12.5\u00b13.12 8.71\u00b11.62 7.17\u00b11.67 30.9\u00b12.53 22.7\u00b10.39 20.6\u00b14.81 14.9\u00b10.26 12.6\u00b12.04 11.3\u00b12.06\nSVM\nLapSVM 14.9\u00b11.73 9.05\u00b11.05 7.66\u00b11.53 29.4\u00b12.33 22.9\u00b10.17 24.9\u00b11.29 15.3\u00b11.10 12.3\u00b11.74 11.1\u00b11.81\nInvSVM 7.58\u00b11.29 7.90\u00b10.23 7.02\u00b10.88 31.6\u00b14.68 24.1\u00b12.06 21.8\u00b13.91 15.4\u00b14.02 12.2\u00b12.93 11.3\u00b11.63\n\nsonar (n = 60, t = 208)\n\nl = 90\n\nl = 30\n\nl = 60\n\nl = 60\n\nl = 60\n\nl = 30\n\n(a) 4 (pos) vs 9 (neg)\n\n(b) 2 (pos) vs 3 (neg)\n\n(c) 6 (pos) vs 5 (neg)\n\n(d) 7 (pos) vs 1 (neg)\n\nFigure 2: Test error (in percentage) of InvSVM versus SVM with virtual sample.\n\n(cid:80)\n\ni:yi=1 (cid:96)1(f (xi), 1) + n\u22121\u2212 (cid:80)\n\neasier). As the real distribution of digit is imbalanced and invariance is more useful when the\nnumber of labeled data is low, we randomly chose n+ = 50 labeled images for one class and\nn\u2212 = 10 images for the other. Accordingly the supervised loss was normalized within each class:\ni:yi=\u22121 (cid:96)1(f (xi),\u22121). Logistic loss and \u0001-insensitive loss were\nn\u22121\n+\nused for (cid:96)1 and (cid:96)2 respectively. All parameters were set by 5 fold CV and the test error was measured\non the rest images in the dataset. The whole process was repeated for 20 times.\nIn Figure 2, InvSVM generally yields lower error than VirSVM, which suggests that compared\nwith drawing virtural samples from the invariance, it seems more effective to directly enforce a \ufb02at\ngradient like in [3].\n\n7 Conclusion and Discussion\nWe have shown how to model local invariances by using representers in RKHS. This subsumes a\nwide range of invariances that are useful in practice, and the formulation can be optimized ef\ufb01ciently.\nFor future work, it will be interesting to extend the framework to a broader range of learning tasks.\nA potential application is to the problem of imbalanced data learning, where one wishes to keep\nthe decision boundary further away from instances of the minority class and closer to the instances\nof majority class. It will also be interesting to generalize the framework to invariances that are not\na |f(cid:48)(x)| dx (a, b \u2208 R) is not linear, but\nwe can treat the integral of absolute value as a loss function and take the derivative as the linear\na |Lx(f )| dx. As a result, the optimization may have to\nresort to sparse approximation methods. Finally the norm of the linear functional does affect the\noptimization ef\ufb01ciency, and a detailed analysis will be useful for choosing the linear functionals.\n\ndirectly linear. For example, the total variation (cid:96)(f ) := (cid:82) b\nfunctional: Lx(f ) = f(cid:48)(x) and (cid:96)(f ) = (cid:82) b\n\n8\n\n5101551015Error of InvSVMError of VirSVM46845678Error of InvSVMError of VirSVM51046810Error of InvSVMError of VirSVM0.511.520.511.52Error of InvSVMError of VirSVM\fReferences\n[1] G. E. Hinton. Learning translation invariant recognition in massively parallel networks. In Proceedings\n\nConference on Parallel Architectures and Laguages Europe, pages 1\u201313. Springer, 1987.\n\n[2] M. Ferraro and T. M. Caelli. Lie transformation groups, integral transforms, and invariant pattern recog-\n\nnition. Spatial Vision, 8:33\u201344, 1994.\n\n[3] P. Simard, Y. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition-\ntangent distance and tangent propagation. In Neural Networks: Tricks of the Trade, pages 239\u2013274, 1996.\n\n[4] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.\n[5] S. Ben-David, T. Lu, and D. Pal. Does unlabeled data provably help? Worst-case analysis of the sample\n\ncomplexity of semi-supervised learning. In COLT, 2008.\n\n[6] T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classi\ufb01cation problems.\n\nIn ICML, 2000.\n\n[7] O. Bousquet, O. Chapelle, and M. Hein. Measure based regularization. In NIPS, 2003.\n[8] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In ICML, 1999.\n[9] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In ICML, 2001.\n[10] M. Belkin, P. Niyogi, and V. Sindhwani. On manifold regularization. In AI-Stats, 2005.\n[11] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions. In ICML, 2003.\n\n[12] D. Zhou and B. Sch\u00a8olkopf. Discrete regularization. In Semi-Supervised Learning, pages 221\u2013232. MIT\n\nPress, 2006.\n\n[13] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. Journal\n\nof Machine Learning Research, 10:3589\u20133646, 2009.\n\n[14] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2008.\n[15] C. Bhattacharyya, K. S. Pannagadatta, and A. J. Smola. A second order cone programming formulation\n\nfor classifying missing data. In NIPS, 2005.\n\n[16] A. Globerson and S. Roweis. Nightmare at test time: Robust learning by feature deletion. In ICML, 2006.\n[17] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classi\ufb01cation. In KDD, 2004.\n[18] T. Graepel and R. Herbrich. Invariant pattern recognition by semide\ufb01nite programming machines. In\n\nNIPS, 2004.\n\n[19] C. H. Teo, A. Globerson, S. Roweis, and A. Smola. Convex learning with invariances. In NIPS, 2007.\n[20] D. DeCoste and B. Sch\u00a8olkopf. Training invariant support vector machines. Machine Learning, 46:161\u2013\n\n190, 2002.\n\n[21] O. Chapelle and B. Sch\u00a8olkopf. Incorporating invariances in nonlinear support vector machines. In NIPS,\n\n2001.\n\n[22] G. Wahba. An introduction to model building with reproducing kernel Hilbert spaces. Technical Report\n\nTR 1020, University of Wisconsin-Madison, 2000.\n\n[23] A. J. Smola and B. Sch\u00a8olkopf. On a kernel-based method for pattern recognition, regression, approxima-\n\ntion and operator inversion. Algorithmica, 22:211\u2013231, 1998.\n\n[24] C. Walder and O. Chapelle. Learning with transformation invariant kernels. In NIPS, 2007.\n[25] C. Burges. Geometry and invariance in kernel based methods. In B. Sch\u00a8olkopf, C. Burges, and A. Smola,\n\neditors, Advances in Kernel Methods \u2014 Support Vector Learning, pages 89\u2013116. MIT Press, 1999.\n\n[26] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, 2001.\n[27] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based\n\nLearning Methods. Cambridge University Press, Cambridge, UK, 2000.\n\n[28] I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer,\n\n2008.\n\n[29] E. Kreyszig. Introductory Functional Analysis with Applications. Wiley, 1989.\n[30] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\n\nICML, 2007.\n\n[31] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk mini-\n\nmization. Journal of Machine Learning Research, 11:311\u2013365, January 2010.\n\n[32] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[33] O. Chapelle. Training a support vector machine in the primal. Neural Comput., 19(5):1155\u20131178, 2007.\n[34] http://www.cs.ubc.ca/\u223cpcarbo/lbfgsb-for-matlab.html.\n[35] K. Bache and M. Lichman. UCI machine learning repository, 2013. University of California, Irvine.\n[36] http://www.dii.unisi.it/\u223cmelacci/lapsvmp.\n[37] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive to semi-supervised\n[38] http://www.cs.nyu.edu/\u223croweis/data.html.\n\nlearning. In ICML, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1022, "authors": [{"given_name": "Xinhua", "family_name": "Zhang", "institution": "NICTA"}, {"given_name": "Wee Sun", "family_name": "Lee", "institution": "National University of Singapore"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}]}