{"title": "Learning Anchor Planes for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1611, "page_last": 1619, "abstract": "Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (\\alpha,\\beta,p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points.", "full_text": "Learning Anchor Planes for Classi\ufb01cation\n\nZiming Zhang\u2020 L\u2019ubor Ladick\u00fd\u2021\n\nPhilip H.S. Torr\u2020 Amir Saffari\u2020\u00a7\n\n\u2020 Department of Computing, Oxford Brookes University, Wheatley, Oxford, OX33 1HX, U.K.\n\u2021 Department of Engineering Science, University of Oxford, Parks Road, Oxford, OX1 3PJ, U.K.\n\n\u00a7 Sony Computer Entertainment Europe, London, UK\n\n{ziming.zhang, philiptorr}@brookes.ac.uk\n\nlubor@robots.ox.ac.uk\n\namir@ymer.org\n\nAbstract\n\nLocal Coordinate Coding (LCC) [18] is a method for modeling functions of data\nlying on non-linear manifolds. It provides a set of anchor points which form a local\ncoordinate system, such that each data point on the manifold can be approximated\nby a linear combination of its anchor points, and the linear weights become the\nlocal coordinate coding. In this paper we propose encoding data using orthogonal\nanchor planes, rather than anchor points. Our method needs only a few orthogonal\nanchor planes for coding, and it can linearize any (\u03b1, \u03b2, p)-Lipschitz smooth non-\nlinear function with a \ufb01xed expected value of the upper-bound approximation error\non any high dimensional data. In practice, the orthogonal coordinate system can be\neasily learned by minimizing this upper bound using singular value decomposition\n(SVD). We apply our method to model the coordinates locally in linear SVMs\nfor classi\ufb01cation tasks, and our experiment on MNIST shows that using only 50\nanchor planes our method achieves 1.72% error rate, while LCC achieves 1.90%\nerror rate using 4096 anchor points.\n\n1\n\nIntroduction\n\nLocal Coordinate Coding (LCC) [18] is a coding scheme that encodes the data locally so that any\nnon-linear (\u03b1, \u03b2, p)-Lipschitz smooth function (see De\ufb01nition 1 in Section 2 for details) over the data\nmanifold can be approximated using linear functions. There are two components in this method: (1)\na set of anchor points which decide the local coordinates, and (2) the coding for each data based\non the local coordinates given the anchor points. Theoretically [18] suggests that under certain\nassumptions, locality is more essential than sparsity for non-linear function approximation. LCC has\nbeen successfully applied to many applications such like object recognition (e.g. locality-constraint\nlinear coding (LLC) [16]) in VOC 2009 challenge [7].\nOne big issue in LCC is that its classi\ufb01cation performance is highly dependent on the number of\nanchor points, as observed in Yu and Zhang [19], because these points should be \u201clocal enough\u201d\nto encode surrounding data on the data manifold accurately, which sometimes means that in real\napplications the number of anchor points explodes to a surprisingly huge number. This has been\ndemonstrated in [18] where LCC has been tested on MNIST dataset, using from 512 to 4096 anchor\npoints learned from sparse coding, the error rate decreased from 2.64% to 1.90%. This situation\ncould become a serious problem when the distribution of the data points is sparse in the feature\nspace, i.e. there are many \u201choles\u201d between data points (e.g. regions of feature space that are sparsely\npopulated by data). As a result of this, many redundant anchor points will be distributed in the holes\nwith little information. By using many anchor points, the computational complexity of the classi\ufb01er\nat both training and test time increases signi\ufb01cantly, defeating the original purpose of using LCC.\n\n1\n\n\fSo far several approaches have been proposed for problems closely related to anchor point learning\nsuch as dictionary learning or codebook learning. For instance, Lee et. al. [12] proposed learning\nthe anchor points for sparse coding using the Lagrange dual. Mairal et. al. [13] proposed an online\ndictionary learning algorithm using stochastic approximations. Wang et. al. [16] proposed locality-\nconstraint linear coding (LLC), which is a fast implementation of LCC, and an online incremental\ncodebook learning algorithm using coordinate descent method, whose performance is very close to\nthat using K-Means. However, none of these algorithms can deal with holes of sparse data as they\nneed many anchor points.\nIn this paper, we propose a method to approximate any non-linear (\u03b1, \u03b2, p)-Lipschitz smooth func-\ntion using an orthogonal coordinate coding (OCC) scheme on a set of orthogonal basis vectors. Each\nbasis vector v \u2208 Rd de\ufb01nes a family of anchor planes, each of which can be considered as consist-\ning of in\ufb01nite number of anchor points, and the nearest point on each anchor plane to a data point\nx \u2208 Rd is used for coding, as illustrated in Figure 1. The data point x will be encoded based on\nthe margin, xT v where (\u00b7)T denotes the matrix transpose operator, between x and an anchor plane\nde\ufb01ned by v. The bene\ufb01ts of using anchor planes are:\n\n\u2022 A few anchor planes can replace many anchor points while preserving similar locality of\nanchor points. This sparsity may lead to a better generalization since many anchor points\nwill over\ufb01t the data easily. Therefore, it can deal with the hole problem in LCC.\n\n\u2022 The learned orthogonal basis vectors can \ufb01t naturally into locally linear SVM\u2019s (such as\n\n[9,10,11,19,21]) which we describe below.\n\nTheoretically we show that using OCC any (\u03b1, \u03b2, p)-Lipschitz smooth non-linear function can be\nlinearized with a \ufb01xed upper-bound approximation error.\nIn practice by minimizing this upper\nbound, the orthogonal basis vectors can be learned using singular value decomposition (SVD). In\nour experiments, We integrate OCC into LL-SVM for classi\ufb01cation.\nLinear support vector machines have become popular for solving classi\ufb01cation tasks due to their\nfast and simple online application to large scale data sets. However, many problems are not linearly\nseparable. For these problems kernel-based SVMs are often used, but unlike their linear variant they\nsuffer from various drawbacks in terms of computational and memory ef\ufb01ciency. Their response\ncan be represented only as a function of the set of support vectors, which has been experimentally\nshown to grow linearly with the size of the training set. A recent trend has grown to create a classi\ufb01er\nlocally based on a set of linear SVMs [9,10,11,19,21]. For instance, in [20] SVMs are trained only\nbased on the N nearest neighbors of each data, and in [9] multiple kernel learning was applied\nlocally. In [10] Kecman and Brooks proved that the stability bounds for local SVMs are tighter than\nthe ones for traditional, global, SVMs. Ladicky and Torr [11] proposed a novel locally linear SVM\nclassi\ufb01er (LL-SVM) with smooth decision boundary and bounded curvature. They show how the\nfunctions de\ufb01ning the classi\ufb01er can be approximated using local codings and show how this model\ncan be optimized in an online fashion by performing stochastic gradient descent with the same\nconvergence guarantees as standard gradient descent method for linear SVMs. Mathematically LL-\nSVM is formulated as follows:\n\n(cid:88)\n\n1\n|S|\n\u03bek\n\u03bek \u2265 1 \u2212 yk\n\nk\u2208S\n\narg min\n\nW,b\n\n(cid:107)W(cid:107)2 +\n\n\u03bb\n2\n\ns.t. \u2200k \u2208 S :\n\n(cid:2)\u03b3T\n\nxk\n\nWxk + \u03b3T\nxk\n\nb(cid:3) , \u03bek \u2265 0\n\n(1)\n\n\u2208 RN is its local coding,\nwhere \u2200k, xk \u2208 Rd is a training vector, yk \u2208 {\u22121, 1} is its label, \u03b3xk\n\u03bb \u2265 0 is a pre-de\ufb01ned scalar, and W \u2208 RN\u00d7d and b \u2208 RN are the model parameters. As\ndemonstrated in our experiments, the choices of the local coding methods are very important for\nLL-SVM, and an improper choice will hurt its performance.\nThe rest of the paper is organized as follows. In Section 2 we \ufb01rst recall some de\ufb01nitions and lemmas\nin LCC, then introduce OCC for non-linear function approximation and its property on the upper\nbound of localization error as well as comparing OCC with LCC in terms of geometric interpretation\nand optimization. In Section 3, we explain how to \ufb01t OCC into LL-SVM to model the coordinates\nfor classi\ufb01cation. We show our experimental results and comparison in Section 4, and conclude the\npaper in Section 5.\n\n2\n\n\f2 Anchor Plane Learning\n\nIn this section, we introduce our Orthogonal Coordinate Coding (OCC) based on some orthogonal\nbasis vectors. For clari\ufb01cation, we summarize some notations in Table 1 which are used in LCC and\nOCC.\n\nNotation\nv \u2208 Rd\nC \u2282 Rd\n\nC \u2208 Rd\u00d7|C|\n\u03b3v(x) \u2208 R\n\u03b3(x) \u2208 Rd\n\u03b3x \u2208 R|C|\n(\u03b3,C)\n\n\u03b3\n\nTable 1: Some notations used in LCC and OCC.\n\nDe\ufb01nition\n\nA d-dimensional anchor point in LCC; a d-dimensional basis vector which de\ufb01nes a\nfamily of anchor planes in OCC.\nA subset in d-dimensional space containing all the anchor points (\u2200v, v \u2208 C) in LCC;\na subset in d-dimensional space containing all the basis vectors in OCC.\nThe anchor point (or basis vector) matrix with v \u2208 C as columns.\nThe local coding of a data point x \u2208 Rd using the anchor point (or basis vector) v.\nThe physical approximation vector of a data point x.\nThe coding vector of data point x containing all \u03b3v(x) in order \u03b3x = [\u03b3v(x)]v\u2208C.\nA map of x \u2208 Rd to \u03b3x.\nA coordinate coding.\n\n2.1 Preliminary\n\nWe \ufb01rst recall some de\ufb01nitions and lemmas in LCC based on which we develop our method. Notice\nthat in the following sections, (cid:107) \u00b7 (cid:107) denotes the (cid:96)2-norm without explicit explanation.\nDe\ufb01nition 1 (Lipschitz Smoothness [18]). A function f (x) on Rd is (\u03b1, \u03b2, p)-Lipschitz smooth with\nrespect to a norm (cid:107)\u00b7(cid:107) if |f (x(cid:48))\u2212f (x)| \u2264 \u03b1(cid:107)x\u2212x(cid:48)(cid:107) and |f (x(cid:48))\u2212f (x)\u2212\u2207f (x)T (x(cid:48)\u2212x)| \u2264 \u03b2(cid:107)x\u2212x(cid:48)(cid:107)1+p,\nwhere we assume \u03b1, \u03b2 > 0 and p \u2208 (0, 1].\nDe\ufb01nition 2 (Coordinate Coding [18]). A coordinate coding is a pair (\u03b3,C), where C \u2282 Rd is a set\nv \u03b3v(x) = 1. It induces the\nv\u2208C \u03b3v(x)v. Moreover, for all x \u2208 Rd, we de\ufb01ne the\n\nof anchor points, and \u03b3 is a map of x \u2208 Rd to [\u03b3v(x)]v\u2208C \u2208 R|C| such that(cid:80)\nfollowing physical approximation of x in Rd: \u03b3(x) =(cid:80)\ncorresponding coding norm as (cid:107)x(cid:107)\u03b3 = ((cid:80)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b1(cid:107)x \u2212 \u03b3(x)(cid:107) + \u03b2\nshould be localized (i.e. smaller localization error(cid:80)\n\nLemma 1 (Linearization [18]). Let (\u03b3,C) be an arbitrary coordinate coding on Rd. Let f be an (\u03b1, \u03b2, p)-\nLipschitz smooth function. We have for all x \u2208 Rd:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)f (x) \u2212(cid:88)\n\nAs explained in [18], a good coding scheme for non-linear function approximation should make x\nclose to its physical approximation \u03b3(x) (i.e. smaller data reconstruction error (cid:107)x \u2212 \u03b3(x)(cid:107)) and\nv\u2208C |\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p). This is the basic\n(cid:88)\n\nidea of LCC.\nDe\ufb01nition 3 (Localization Measure [18]). Given \u03b1, \u03b2, p, and coding (\u03b3,C), we de\ufb01ne\n\n(cid:34)\n\n(cid:35)\n\n|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p\n\nv\u2208C \u03b3v(x)2)1/2.\n\nQ\u03b1,\u03b2,p(\u03b3,C) = Ex\n\n\u03b1(cid:107)x \u2212 \u03b3(x)(cid:107) + \u03b2\n\n|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p\n\n\u03b3v(x)f (v)\n\nv\u2208C\n\n(2)\n\n(3)\n\n(cid:88)\n\nv\u2208C\n\nv\u2208C\n\nLocalization measure is equivalent to the expectation of the upper bound of the approximate error.\n\n2.2 Orthogonal Coordinate Coding\n\nIn the following sections, we will follow the notations in Table 1, and de\ufb01ne our orthogonal coordi-\nnate coding (OCC) as below.\nDe\ufb01nition 4 (Orthogonal Coordinate Coding). An orthogonal coordinate coding is a pair (\u03b3,C),\nwhere C \u2282 Rd contains |C| orthogonal basis vectors, that is, \u2200u, v \u2208 C, if u (cid:54)= v, then uT v = 0,\nv\u2208C |\u03b3v(x)| =\n\nand coding \u03b3 is a map of x \u2208 Rd to [\u03b3v(x)]v\u2208C \u2208 R|C| such that \u03b3v(x) \u221d xT v(cid:107)v(cid:107)2 and(cid:80)\n\n1.\n\n3\n\n\fFigure 1: Comparison of the geometric views on (a) LCC and (b) OCC, where the white and red dots denote\nthe data and anchor points, respectively. In LCC, the anchor points are distributed among the data space and\nseveral nearest neighbors around the data are selected for data reconstruction, while in OCC the anchor points\nare located on the anchor plane de\ufb01ned by the normal vector (i.e. coordinate, basis vector) v and only the\nclosest point to each data point on the anchor plane is selected for coding. The \ufb01gures are borrowed from the\nslides of [17], and best viewed in color.\n\nCompared to De\ufb01nition 2, there are two changes in OCC: (1) instead of anchor points we use a set\nof orthogonal basis vectors, which de\ufb01nes a set of anchor planes, and (2) the coding for each data\npoint is de\ufb01ned on the (cid:96)1-norm unit ball, which removes the scaling factors in both x and v. Notice\nthat since given the data matrix, the maximum number of orthogonal basis vectors which can be\nused to represent all the data precisely is equal to the rank of the data matrix, the maximum value of\n|C| is equal to the rank of the data matrix as well.\nFigure 1 illustrates the geometric views on LCC and OCC respectively. Intuitively, in both methods\nanchor points try to encode data locally. However, the ways of their arrangement are quite different.\nIn LCC anchor points are distributed among the whole data space such that each data can be covered\nby certain anchor points in a local region, and their distribution cannot be described using regular\nshapes. On the contrary, the anchor points in OCC are located on the anchor plane de\ufb01ned by\na basis vector. In fact, each anchor plane can be considered as in\ufb01nite number of anchor points,\nand for each data point only its closest point on each anchor plane is utilized for reconstruction.\nTherefore, intuitively the number of anchor planes in OCC should be much fewer than the number\nof anchor points in LCC.\nTheorem 1 (Localization Error of OCC). Let (\u03b3,C) be an orthogonal coordinate coding on Rd\nwhere C \u2282 Rd with size |C| = M. Let f be an (\u03b1, \u03b2, p)-Lipschitz smooth function. Without\nlosing generalization, assuming \u2200x \u2208 Rd,(cid:107)x(cid:107) \u2264 1 and \u2200v \u2208 C, 1 \u2264 (cid:107)v(cid:107) \u2264 h(h \u2265 1), then the\nlocalization error in Lemma 1 is bounded by:\n\n(cid:107)x(cid:107)(cid:107)v(cid:107)\n(cid:107)v(cid:107)2 \u2264 M\n\nv\u2208C\n\nv\u2208C\n\n4\n\n(cid:88)\n\nv\u2208C\n\nv\u2208C\n\nv\u2208C\n\n(1 + M )h\n\n(cid:105)1+p\n\n|xT v|\n(cid:107)v(cid:107)2 , then\n\nProof. Let \u03b3v(x) = xT v\n\n|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p =\n\n|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p \u2264(cid:104)\n\n(cid:88)\nsx(cid:107)v(cid:107)2 , where sx =(cid:80)\n|\u03b3v(x)|(cid:104)(cid:107)v(cid:107)2 \u2212 2sx\u03b3v(x)(cid:107)v(cid:107)2 +\n|\u03b3v(x)|(cid:104)(cid:107)v(cid:107)2 + 2sx(cid:107)v(cid:107)2|\u03b3v(x)| +\nx\u03b3v(x)2(cid:107)v(cid:107)2(cid:105) 1+p\n(cid:88)\n|\u03b3v(x)|(cid:104)(cid:107)v(cid:107)2 + 2sx(cid:107)v(cid:107)2|\u03b3v(x)| +\n(cid:16)(cid:88)\n\u03b3v(x)2(cid:17)(cid:16)\n\u2235 \u2200x \u2208 Rd,(cid:107)x(cid:107) \u2264 1 and \u2200v \u2208 C, 1 \u2264 (cid:107)v(cid:107) \u2264 h(h \u2265 1),(cid:80)\nv\u2208C \u03b3v(x)2 \u2264 1, sx =(cid:80)\n\u2234 \u2200v \u2208 C,|\u03b3v(x)| \u2264 1,(cid:80)\n(cid:107)v(cid:107)2 \u2264(cid:80)\n\nv\u2208C\nv\u2208C |\u03b3v(x)| = 1,\n\n\u2264(cid:88)\n\u2264(cid:88)\n\nv\u2208C\n\n(cid:88)\n\nv\u2208C\n\n(cid:88)\n\nv\u2208C\n\n|xT v|\n\ns2\n\nv\u2208C\n\nv\u2208C\n\n.\n\ns2\n\n2\n\nx\u03b3v(x)2(cid:107)v(cid:107)2(cid:105) 1+p\nx(cid:107)v(cid:107)2(cid:17)(cid:105) 1+p\n\nv\u2208C s2\nmax\n\n2\n\n2\n\n(4)\n\n(5)\n\n\f|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p \u2264 (cid:88)\n\n\u2234(cid:88)\n\nv\u2208C\n\n2\n\n2\n\nv\u2208C\n\nh2 + 2M h2|\u03b3v(x)| + M 2h2(cid:105) 1+p\n|\u03b3v(x)|(cid:104)\n1 + 2M|\u03b3v(x)| + M 2(cid:105) 1+p\n|\u03b3v(x)|(cid:104)\n= h1+p(cid:88)\n\u2264 h1+p(cid:16)(cid:88)\n|\u03b3v(x)|(cid:17)(cid:16)\n2 (cid:27)\n(cid:26)(cid:104)\n1 + 2M|\u03b3v(x)| + M 2(cid:105) 1+p\n1 + 2M + M 2(cid:105) 1+p\n\u2264 h1+p(cid:104)\n(cid:105)1+p\n\n2 (cid:27)(cid:17)\n(cid:26)(cid:104)\n1 + 2M|\u03b3v(x)| + M 2(cid:105) 1+p\n(cid:104)\n\nv\u2208C\n= h1+p \u00b7 max\nv\u2208C\n\n(1 + M )h\n\nmax\nv\u2208C\n\nv\u2208C\n\n.\n\n2\n\n=\n\n(6)\n\n(7)\n\n(8)\n\n2.3 Learning Orthogonal Basis Vectors\n\n(cid:110) 1\n(cid:88)\ns.t. \u2200x, (cid:80)\n\nx\u2208X\n\n2\n\nmin\n(\u03b3,C)\n\n(cid:88)\n\nv\u2208C\n\nInstead of optimizing De\ufb01nition 3, LCC simpli\ufb01es the localization error term by assuming \u03b3(x) = x\nand p = 1. Mathematically LCC solves the following optimization problem:\n\n(cid:107)x \u2212 \u03b3(x)(cid:107)2 + \u00b5\n\n|\u03b3v(x)|(cid:107)v \u2212 x(cid:107)2 + \u03bb\n\n(cid:107)v(cid:107)2(cid:111)\n\n(cid:88)\n\nv\u2208C\n\nv\u2208C \u03b3v(x) = 1.\n\nThey update C and \u03b3 via alternating optimization. The step of updating \u03b3 can be transformed into a\ncanonical LASSO problem, and the step of updating C is a least squares problem.\nFor OCC, given an (\u03b1, \u03b2, p)-Lipschitz smooth function f and a set of data X \u2282 Rd, whose corre-\nsponding data matrix and its rank are denoted as X and D, respectively, we would like to learn an\northogonal coordinate coding (\u03b3,C) where the number of basis vectors |C| = M \u2264 D such that the\nlocalization measure of this coding is minimized. Since Theorem 1 proves that the localization error\nper data point is bounded by a constant given an OCC, in practice we only need to minimize the data\nreconstruction error in order to minimize the upper bound of the localization measure. That is, we\nneed to solve the following problem:\n\n(cid:88)\n\n(cid:107)x \u2212 C\u03b3x(cid:107)2\n\nmin\n(\u03b3,C)\ns.t. \u2200u, v \u2208 C, u (cid:54)= v \u21d2 uT v = 0,\n\nx\u2208X\n|C| = M,\n\u2200x,\n\n(cid:107)\u03b3x(cid:107)1 = 1.\n\nThis optimization problem is quite similar to sparse coding [12], except that there exists the orthog-\nonal constraint on the basis vectors. In practice we relax this problem by removing the constraint\n\u2200x,(cid:107)\u03b3x(cid:107)1 = 1.\n(I) Solving for C. Eqn. 8 can be solved \ufb01rst using singular value decomposition (SVD). Let the\nSVD of X = V\u03a3U where the singular values are positive and in descending order with respect to\n\u03a3. Then we set C = V{d\u00d7M}\u03a3{M\u00d7M}, where V{d\u00d7M} denotes a sub-matrix of V containing the\nelements within rows from 1 to d and columns from 1 to M, similarly for \u03a3{M\u00d7M}. We need only\nto use a few top eigenvectors as our orthogonal basis vectors for coding, and the search space is far\nsmaller than generating anchor points.\n(II) Solving for \u03b3x. Since we have the orthogonal basis vectors in C, we can easily derive the for-\nmulation for calculating \u02dc\u03b3x, the value of \u03b3x before normalization, that is, \u02dc\u03b3x = (CT C)\u22121CT x.\nLetting {\u00afv} and {\u03c3v} be the corresponding singular vectors and singular values, based on the or-\nthogonality of basis vectors we have \u02dc\u03b3v(x) = \u00afvT x\n, which is a variant of the coding de\ufb01nition in\n\u03c3v\nDe\ufb01nition 4. Finally, we can calculate \u03b3x by normalizing \u02dc\u03b3x as follows: \u03b3x = \u02dc\u03b3x(cid:107)\u02dc\u03b3x(cid:107)1\n\n.\n\n5\n\n\f3 Modeling Classi\ufb01cation Decision Boundary in SVM\nGiven a set of data {(xi, yi)} where yi \u2208 {\u22121, 1} is the label of xi, the decision boundary for binary\nclassi\ufb01cation of a linear SVM is f (x) = wT x + b where w is the normal vector of the decision\nhyperplane (i.e. coef\ufb01cients) of the SVM and b is a bias term. Here, we assume that the decision\nboundary is an (\u03b1, \u03b2, p)-Lipschitz smooth function. Since in LCC each data is encoded by some\nanchor points on the data manifold, it can model the decision boundary of an SVM directly using\nv\u2208C \u03b3v(x)f (v). Then by taking \u03b3x as the input data of a linear SVM, f (v)\u2019s can be\n\nf (x) \u2248 (cid:80)\n\nlearned to approximate the decision boundary f.\nHowever, OCC learns a set of orthogonal basis vectors, rather than anchor points, and corresponding\ncoding for data. This makes OCC suitable to model the normal vectors of decision hyperplanes in\nSVMs locally with LL-SVM. Given data x and an orthogonal coordinate coding (\u03b3,C), the decision\nboundary in LL-SVM can be formulated as follows 1.\n\nf (x) = w(x)T x + b =\n\n\u03b3v(x)w(v)T x + b = \u03b3T\n\nx Wx + b\n\n(9)\n\n(cid:88)\n\nv\u2208C\n\nwhere W \u2208 RM\u00d7d is a matrix which needs to be learned for SVMs. In the view of kernel SVMs,\nwe actually de\ufb01ne another kernel K based on x and \u03b3x as shown below.\n\n(10)\nwhere < \u00b7,\u00b7 > denotes the Frobenius inner product. Notice that in our kernel, latent semantic kernel\n[6] has been involved which is de\ufb01ned based on a set of orthogonal basis vectors.\n\ni , \u03b3xj xT\n\nj >\n\n\u2200i, j, K(xi, xj) =< \u03b3xixT\n\n4 Experiments\n\nIn our experiments, we test OCC with LL-SVM for classi\ufb01cation on the benchmark datasets:\nMNIST, USPS and LETTER. The features we used are the raw features such that we can compare\nour results fairly with others.\nMNIST contains 40000 training and 10000 test gray-scale images with resolution 28\u00d728, which are\nnormalized directly into 784 dimensional vectors. The label of each image is one of the 10 digits\nfrom 0 to 9. USPS contains 7291 training and 2007 test gray-scale images with resolution 16 x 16,\ndirectly stored as 256 dimensional vectors, and the label of each image still corresponds to one of\nthe 10 digits from 0 to 9. LETTER contains 16000 training and 4000 testing images, each of which\nis represented as a relatively short 16 dimensional vector, and the label of each image corresponds\nto one of the 26 letters from A to Z.\nWe re-implemented LL-SVM based on the C++ code of LIBLINEAR [8] 2 and PEGASOS [14] 3,\nrespectively, and performed multi-class classi\ufb01cation using the one-vs-all strategy. This aims to\ntest the effect of either quadratic programming or stochastic gradient based SVM solver on both\naccuracy and computational time. We denote these two ways of LL-SVM as LIB-LLSVM and PEG-\nLLSVM for short. We tried to learn our basis vectors in two ways: (1) SVD is applied directly to the\nentire training data matrix, or (2) SVD is applied separately to the data matrix consisting of all the\npositive training data. We denote these two types of OCC as G-OCC (i.e. Generic OCC) and C-OCC\n(i.e. Class-speci\ufb01c OCC), respectively. Then the coding for each data is calculated as explained in\nSection 2.3. Next, all the training raw features and their coding vectors are taken as the input to train\nthe model (W, b) of LL-SVM. For each test data x, we calculate its coding in the same way and\nclassify it based on its decision values, that is, y(x) = arg maxy \u03b3T\nFigure 2 shows the comparison of classi\ufb01cation error rates among G-OCC + LIB-LLSVM, G-OCC +\nPEG-LLSVM, C-OCC + LIB-LLSVM, and C-OCC + PEG-LLSVM on MNIST (left), USPS (middle),\nand LETTER (right), respectively, using different numbers of orthogonal basis vectors. With the\nsame OCC, LIB-LLSVM performs slightly better than PEG-LLSVM in terms of accuracy, and both\n\nx,yWyx + by.\n\n1Notice that Eqn. 9 is slightly different from the original formulation in [11] by ignoring the different bias\n\nterm for each orthogonal basis vector.\n\n2Using LIBLINEAR, we implemented LL-SVM based on Eqn. 9.\n3Using PEGASOS, we implemented LL-SVM based on the original formulation in [11].\n\n6\n\n\fFigure 2: Performance comparison among the 4 different combinations of OCC + LL-SVM on MNIST (left),\nUSPS (middle), and LETTER (right) using different numbers of orthogonal basis vectors. This \ufb01gure is best\nviewed in color.\n\nIt seems that in\n\nbehaves similarly with the increase of the number of orthogonal basis vectors.\ngeneral C-OCC is better than G-OCC.\nTable 2 summarizes our comparison results between our methods and some other SVM based ap-\nproaches. The parameters of the RBF kernel used in the kernel SVMs are the same as [2]. Since\nthere are no results of LCC on USPS and LETTER or its code, we tested the published code of LLC\n[16] on these two datasets so that we can have a rough idea of how well LCC works. The anchor\npoints are found using K-Means. From Table 2, we can see that applying linear SVM directly on\nOCC works slightly better than on the raw features, and when OCC is working with LL-SVM, the\nperformance is boosted signi\ufb01cantly while the numbers of anchor points that are needed in LL-SVM\nare reduced. On MNIST we can see that our non-linear function approximation is better than LCC,\nimproved LCC, LLC, and LL-SVM, on USPS ours is better than both LLC and LL-SVM, but on\nLETTER ours is worse than LLC (4096 anchor points) and LL-SVM (100 anchor points). The\nreason for this is that strictly speaking LETTER is not a high dimensional dataset (only 16 dimen-\nsions per data), which limits the power of OCC. Compared with kernel based SVMs, our method\ncan achieve comparable or even better results (e.g. on USPS). All of these results demonstrate that\nOCC is quite suitable to model the non-linear normal vectors using linear SVMs for classi\ufb01cation on\nhigh dimensional data. In summary, our encoding scheme uses much less number of basis vectors\ncompared to anchor points in LCC while achieving better test accuracy, which translates to higher\nperformance both in terms of generalization and ef\ufb01ciency in computation.\nWe show our training and test time on these three datasets as well in Table 3 based on unoptimized\nMATLAB code on a single thread of a 2.67 GHz CPU. For training, the time includes calculating\nOCC and training LL-SVM. From this table, we can see that our methods are a little slower than the\noriginal LL-SVM, but still much faster than kernel SVMs. The main reason for this is that OCC is\nnon-sparse while in [11] the coef\ufb01cients are sparse. However, for calculating coef\ufb01cients, OCC is\nfaster than [11], because there is no distance calculation or K nearest neighbor search involved in\nOCC, just simple multiplication and normalization.\n\n5 Conclusion\n\nIn this paper, we propose orthogonal coordinate coding (OCC) to encode high dimensional data\nbased on a set of anchor planes de\ufb01ned by a set of orthogonal basis vectors. Theoretically we prove\nthat our OCC can guarantee a \ufb01xed upper bound of approximation error for any (\u03b1, \u03b2, p)-Lipschitz\nsmooth function, and we can easily learn the orthogonal basis vectors using SVD to minimize\nthe localization measure. Meanwhile, OCC can help locally linear SVM (LL-SVM) approximate\nthe kernel-based SVMs, and our experiments demonstrate that with a few orthogonal anchor\nplanes, LL-SVM can achieve comparable or better results than LCC and its variants improved\nLCC and LLC with linear SVMs, and on USPS even better than kernel-based SVMs. In future, we\nwould like to learn the orthogonal basis vectors using semi-de\ufb01nite programming to guarantee the\northogonality.\n\nAcknowledgements. We thank J. Valentin, P. Sturgess and S. Sengupta for useful discussion\nin this paper. This work was supported by the IST Programme of the European Community, under\n\n7\n\n\fTable 2: Classi\ufb01cation error rate comparison (%) between our methods and others on MNIST, USPS, and\nLETTER. The numbers of anchor planes in the brackets are the ones which returns the best result on each\ndataset. All kernel methods [13, 14, 15, 16, 17] use the RBF kernel.\nIn general, LIB-LLSVM + C-OCC\nperforms best.\n\nUSPS\n\n7.82 (95)\n5.98 (95)\n4.14 (20)\n4.38 (50)\n3.94 (80)\n4.09 (80)\n\n9.57\n\n-\n-\n-\n-\n\n5.78\n4.38\n\n-\n-\n-\n\n4.24\n4.38\n4.25\n5.78\n\nLETTER\n30.52 (15)\n14.95 (16)\n6.85 (15)\n9.83 (14)\n7.35 (16)\n8.30 (16)\n\n41.77\n\n-\n-\n-\n-\n\n9.02\n4.12\n\n-\n-\n-\n\n2.42\n2.40\n2.80\n5.32\n\nMethods\n\nLinear SVM + G-OCC (# basis vectors)\nLinear SVM + C-OCC (# anchor planes)\nLIB-LLSVM + G-OCC (# basis vectors)\nPEG-LLSVM + G-OCC (# basis vectors)\nLIB-LLSVM + C-OCC (# basis vectors)\nPEG-LLSVM + C-OCC (# basis vectors)\n\nLinear SVM (10 passes) [1]\n\nLinear SVM + LCC (512 anchor points) [18]\nLinear SVM + LCC (4096 anchor points) [18]\n\nLinear SVM + improved LCC (512 anchor points) [19]\nLinear SVM + improved LCC (4096 anchor points) [19]\n\nLinear SVM + LLC (512 anchor points) [16]\nLinear SVM + LLC (4096 anchor points) [16]\n\nLibSVM [4]\n\nLA-SVM (1 pass) [3]\nLA-SVM (2 passes) [3]\n\nMCSVM [5]\n\nSV Mstruct[15]\n\nLA-RANK (1 pass) [2]\n\nLL-SVM (100 anchor points, 10 passes) [11]\n\nMNIST\n9.25 (100)\n7.42 (100)\n1.72 (50)\n1.81 (40)\n1.61 (90)\n1.74 (90)\n\n12.00\n2.64\n1.90\n1.95\n1.64\n3.69\n2.28\n1.36\n1.42\n1.36\n1.44\n1.40\n1.41\n1.85\n\nTable 3: Computational time comparison between our methods and others on MNIST, USPS, and LETTER.\nThe numbers in Row 7-14 are copied from [11]. The training times of our methods include the calculation of\nOCC and training LL-SVM. All the numbers are corresponding to the methods shown in Table 2 with the same\nparameters. Notice that for PEG-LLSVM, 106 random data points is used for training.\n\nTraining Time (s)\n\nTest Time (ms)\n\nMethods\n\nLIB-LLSVM + G-OCC\nPEG-LLSVM + G-OCC\nLIB-LLSVM + C-OCC\nPEG-LLSVM + C-OCC\n\nLinear SVM (10 passes) [1]\n\nLibSVM [4]\n\nLA-SVM (1 pass) [3]\nLA-SVM (2 passes) [3]\n\nMCSVM [5]\n\nSV Mstruct[15]\n\nLA-RANK (1 pass) [2]\n\nLL-SVM (100, 10 passes) [11]\n\nMNIST\n113.38\n125.03\n224.09\n273.70\n\n1.5\n\n1.75\u00d7104\n4.9\u00d7103\n1.22\u00d7104\n2.5\u00d7104\n2.65\u00d7105\n3\u00d7104\n81.7\n\nUSPS\n5.78\n14.50\n25.61\n23.31\n0.26\n\n-\n-\n-\n60\n\n6.3\u00d7103\n\nLETTER\n\n4.14\n2.02\n1.66\n0.85\n0.18\n\n-\n-\n-\n\n1.2\u00d7103\n2.4\u00d7104\n\n85\n6.2\n\n940\n4.2\n\nMNIST\n5.51\u00d7103\n302.28\n9.57\u00d7103\n503.18\n\n8.75\u00d710\u22123\n\n46\n40.6\n42.8\n\n-\n-\n-\n\n0.47\n\nUSPS\n19.23\n23.25\n547.60\n50.63\n\nLETTER\n\n4.09\n3.33\n63.13\n28.94\n\n-\n-\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n-\n-\n\nthe PASCAL2 Network of Excellence, IST-2007-216886. P. H. S. Torr is in receipt of Royal Society\nWolfson Research Merit Award.\n\n8\n\n\fReferences\n\n[1] Bordes, A., Bottou, L. & Gallinari, P. (2009) Sgd-qn: Careful quasi-newton stochastic gradient\ndescent. Journal of Machine Learning Research (JMLR).\n[2] Bordes, A., Bottou, L., Gallinari, P., & Weston, J. (2007) Solving multiclass support vector\nmachines with larank. In Proceeding of International Conference on Machine Learning (ICML).\n[3] Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005) Fast kernel classi\ufb01ers with online and\nactive learning. Journal of Machine Learning Research (JMLR).\n[4] Chang, C. & Lin, C. (2011) LIBSVM: A Library for Support Vector Machines. ACM Transac-\ntions on Intelligent Systems and Technology, vol. 2, issue 3, pp. 27:1-27:27.\n[5] Crammer, K. & Singer, Y. (2002) On the algorithmic implementation of multiclass kernel-based\nvector machines. Journal of Machine Learning Research (JMLR).\n[6] Cristianini, N., Shawe-Taylor, J. & Lodhi, H. (2002) Latent Semantic Kernels. Journal of Intel-\nligent Information Systems, Vol. 18, No. 2-3, 127-152.\n[7] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. & Zisserman, A. The PASCAL\nVisual Object Classes Challenge 2009 (VOC2009). http://www.pascal-network.org/\nchallenges/VOC/voc2009/workshop/index.html\n[8] Fan, R., Chang, K., Hsieh, C., Wang, X. & Lin, C. (2008) LIBLINEAR: A Library for Large\nLinear Classi\ufb01cation. Journal of Machine Learning Research (JMLR), vol. 9, pp. 1871-1874.\n[9] G\u00f6nen, M. & Alpaydin, E. (2008) Localized Multiple Kernel Learning. In Proceeding of Inter-\nnational Conference on Machine Learning (ICML).\n[10] Kecman, V. & Brooks, J.P. (2010) Locally Linear Support Vector Machines and Other Local\nModels. In Proceeding of IEEE World Congress on Computational Intelligence (WCCI), pp. 2615-\n2620.\n[11] Ladicky, L. & Torr, P.H.S. (2011) Locally Linear Support Vector Machines. In Proceeding of\nInternational Conference on Machine Learning (ICML).\n[12] Lee, H., Battle, A., Raina, R., & Ng, A.Y. (2007) Ef\ufb01cient Sparse Coding Algorithms.\nAdvances in Neural Information Processing Systems (NIPS).\n[13] Mairal, J., Bach, F., Ponce, J. & Sapiro, G. (2009) Online Dictionary Learning for Sparse\nCoding. In Proceeding of International Conference on Machine Learning (ICML).\n[14] Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007) Pegasos: Primal Estimated sub-GrAdient\nSOlver for SVM. In Proceeding of International Conference on Machine Learning (ICML).\n[15] Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005) Large margin methods for\nstructured and interdependent output variables. Journal of Machine Learning Research (JMLR).\n[16] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010) Locality-constrained Linear\nIn Proceedings of IEEE Conference on Computer Vision and\nCoding for Image Classi\ufb01cation.\nPattern Recognition (CVPR).\n[17] Yu, K. & Ng, A. (2010) ECCV-2010 Tutorial: Feature Learning for Image Classi\ufb01cation.\nhttp://ufldl.stanford.edu/eccv10-tutorial/.\n[18] Yu, K., Zhang, T., & Gong, Y. (2009) Nonlinear Learning using Local Coordinate Coding. In\nAdvances in Neural Information Processing Systems (NIPS).\n[19] Yu, K. & Zhang, T. (2010) Improved Local Coordinate Coding using Local Tangents. In Pro-\nceeding of International Conference on Machine Learning (ICML).\n[20] Zhang, H., Berg, A., Maure, M. & Malik, J. (2006) SVM-KNN: Discriminative nearest neigh-\nbor classi\ufb01cation for visual category recognition. In Proceedings of IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pp. 2126-2136.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 916, "authors": [{"given_name": "Ziming", "family_name": "Zhang", "institution": null}, {"given_name": "Lubor", "family_name": "Ladicky", "institution": null}, {"given_name": "Philip", "family_name": "Torr", "institution": null}, {"given_name": "Amir", "family_name": "Saffari", "institution": null}]}