{"title": "Theory of matching pursuit", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 728, "abstract": "We analyse matching pursuit for kernel principal components analysis by proving that the sparse subspace it produces is a sample compression scheme. We show that this bound is tighter than the KPCA bound of Shawe-Taylor et al swck-05 and highly predictive of the size of the subspace needed to capture most of the variance in the data. We analyse a second matching pursuit algorithm called kernel matching pursuit (KMP) which does not correspond to a sample compression scheme. However, we give a novel bound that views the choice of subspace of the KMP algorithm as a compression scheme and hence provide a VC bound to upper bound its future loss. Finally we describe how the same bound can be applied to other matching pursuit related algorithms.", "full_text": "Theory of matching pursuit\n\nZakria Hussain and John Shawe-Taylor\n\nDepartment of Computer Science\nUniversity College London, UK\n\n{z.hussain,j.shawe-taylor}@cs.ucl.ac.uk\n\nAbstract\n\nWe analyse matching pursuit for kernel principal components analysis (KPCA)\nby proving that the sparse subspace it produces is a sample compression scheme.\nWe show that this bound is tighter than the KPCA bound of Shawe-Taylor et al\n[7] and highly predictive of the size of the subspace needed to capture most of the\nvariance in the data. We analyse a second matching pursuit algorithm called ker-\nnel matching pursuit (KMP) which does not correspond to a sample compression\nscheme. However, we give a novel bound that views the choice of subspace of the\nKMP algorithm as a compression scheme and hence provide a VC bound to upper\nbound its future loss. Finally we describe how the same bound can be applied to\nother matching pursuit related algorithms.\n\n1 Introduction\n\nMatching pursuit refers to a family of algorithms that generate a set of bases for learning in a greedy\nfashion. A good example of this approach is the matching pursuit algorithm [4]. Viewed from this\nangle sparse kernel principal components analysis (PCA) looks for a small number of kernel ba-\nsis vectors in order to maximise the Rayleigh quotient. The algorithm was proposed by [8]1 and\nmotivated by matching pursuit [4], but to our knowledge sparse PCA has not been analysed theo-\nretically. In this paper we show that sparse PCA (KPCA) is a sample compression scheme and can\nbe bounded using the size of the compression set [3, 2] which is the set of training examples used\nin the construction of the KPCA subspace. We also derive a more general framework for this algo-\nrithm that uses the principle \u201cmaximise Rayleigh quotient and de\ufb02ate\u201d. A related algorithm called\nkernel matching pursuit (KMP) [10] is a sparse version of least squares regression but without the\nproperty of being a compression scheme. However, we use the number of basis vectors constructed\nby KMP to help upper bound the loss of the KMP algorithm using the VC dimension. This bound\nis novel in that it is applied in an empirically chosen low dimensional hypothesis space and applies\nindependently of the actual dimension of the ambient feature space (including one constructed from\nthe Gaussian kernel). In both cases we illustrate the use of the bounds on real and/or simulated data.\nFinally we also show that the KMP bound can be applied to a sparse kernel canonical correlation\nanalysis that uses a similar matching pursuit technique. We do not describe the algorithm here due to\nspace constraints and only concentrate on theoretical results. We begin with preliminary de\ufb01nitions.\n\n2 Preliminary de\ufb01nitions\n\nThroughout the paper we consider learning from samples of data. For the regression section the\ni=1 of input-output pairs drawn from a joint space X \u00d7 Y where\ndata is a sample S = {(xi, yi)}m\nx \u2208 Rn and y \u2208 R. For the principal components analysis the data is a sample S = {xi}m\ni=1 of\n1The algorithm was proposed as a low rank kernel approximation \u2013 however the algorithm turns out to be a\n\nsparse kernel PCA (to be shown).\n\n1\n\n\fmultivariate examples drawn from a space X . For simplicity we always assume that the examples\nare already projected into the kernel de\ufb01ned feature space, so that the kernel matrix K has entries\nK[i, j] = (cid:104)xi, xj(cid:105). The notation K[i, :] and K[:, i] will denote the ith row and ith column of the\nmatrix K, respectively. When using a set of indices i = {i1, . . . , ik} (say) then K[i, i] denotes the\nsquare matrix de\ufb01ned solely by the index set i. The transpose of a matrix X or vector x is denoted\nby X(cid:48) or x(cid:48) respectively. The input data matrix X will contain examples as row vectors.\nFor analysis purposes we assume that the training examples are generated i.i.d. according to an un-\nknown but \ufb01xed probability distribution that also governs the generation of the test data. Expectation\nover the training examples (empirical average) is denoted by \u02c6E[\u00b7], while expectation with respect to\nthe underlying distribution is denoted E[\u00b7]. For the sample compression analysis the compression\nfunction \u039b induced by a sample compression learning algorithm A on training set S is the map\n\u039b : S (cid:55)\u2212\u2192 \u039b(S) such that the compression set \u039b(S) \u2282 S is returned by A. A reconstruction func-\ntion \u03a6 is a mapping from a compression set \u039b(S) to a set F of functions \u03a6 : \u039b(S) (cid:55)\u2212\u2192 F . Let A(S)\nbe the function output by learning algorithm A on training set S. Therefore, a sample compression\nscheme is a reconstruction function \u03a6 mapping a compression set \u039b(S) to some set of functions F\nsuch that A(S) = \u03a6(\u039b(S)). If F is the set of Boolean-valued or Real-valued functions then the\nsample compression scheme is said to be a classi\ufb01cation or regression algorithm, respectively.\n\n2.1 Sparse kernel principal components analysis\n\nPrincipal components analysis [6] can be expressed as the following maximisation problem:\n\n(1)\nwhere w is the weight vector. In a sparse KPCA algorithm we would like to \ufb01nd a sparsely repre-\nsented vector w = X[i, :](cid:48) \u02dc\u03b1, that is a linear combination of a small number of training examples\nindexed by vector i. Therefore making this substitution into Equation (1) we have the following\nsparse dual PCA maximisation problem,\n\nmax\nw\n\nw(cid:48)X(cid:48)Xw\nw(cid:48)w ,\n\nwhere ei is the ith unit vector. After this maximisation we need to orthogonalise (de\ufb02ate) the kernel\nmatrix to create a projection into the space orthogonal to the basis vectors chosen to ensure we \ufb01nd\nthe maximum variance of the data in the projected space. The de\ufb02ation step can be carried out as\nfollows. Let \u03c4 = K[:, i] = XX(cid:48)ei where ei is the ith unit vector. We know that primal PCA\nde\ufb02ation can be carried out with respect to the features in the following way:\n\n(cid:18)\n\n(cid:19)\n\n\u02c6X(cid:48) =\n\nI \u2212 uu(cid:48)\nu(cid:48)u\n\nX(cid:48),\n\nwhere u is the projection directions de\ufb01ned by the chosen eigenvector and \u02c6X is the de\ufb02ated matrix.\nHowever, in sparse KPCA, u = X(cid:48)ei because the projection directions are simply the examples in\nX. Therefore, for sparse KPCA we have:\n\n(cid:18)\n\n\u02c6X \u02c6X(cid:48) = X\n\nI \u2212 uu(cid:48)\nu(cid:48)u\n\n(cid:19)(cid:18)\nI \u2212 uu(cid:48)\nu(cid:48)u\n\n(cid:19)\n\nX(cid:48) = XX(cid:48) \u2212 XX(cid:48)eie(cid:48)\ne(cid:48)\niXX(cid:48)ei\n\niXX(cid:48)\n\n= K \u2212 K[:, i]K[:, i](cid:48)\n\n.\n\nK[i, i]\n\n2\n\n\u02dc\u03b1(cid:48)X[i, :]X(cid:48)XX[i, :](cid:48) \u02dc\u03b1\n\n\u02dc\u03b1(cid:48)X[i, :]X[i, :](cid:48) \u02dc\u03b1\n\n,\n\nmax\n\n\u02dc\u03b1\n\n\u02dc\u03b1(cid:48)K[:, i](cid:48)K[:, i] \u02dc\u03b1\n\n\u02dc\u03b1(cid:48)K[i, i] \u02dc\u03b1\n\n,\n\nmax\n\n\u02dc\u03b1\n\nwhich is equivalent to sparse kernel PCA (SKPCA) with sparse kernel matrix K[:, i](cid:48) = X[i, :]X(cid:48),\n\nwhere \u02dc\u03b1 is a sparse vector of length k = |i|. Clearly maximising the quantity above will lead to\nthe maximisation of the generalised eigenvalues corresponding to \u02dc\u03b1 \u2013 and hence a sparse subset of\nthe original KPCA problem. We would like to \ufb01nd the optimal set of indices i. We proceed in a\ngreedy manner (matching pursuit) in much the same way as [8]. The procedure involves choosing\nbasis vectors that maximise the Rayleigh quotient without the set of eigenvectors. Choosing basis\nvectors iteratively until some pre-speci\ufb01ed number of k vectors are chosen. An orthogonalisation\nof the kernel matrix at each step ensures future potential basis vectors will be orthogonal to those\nalready chosen. The quotient to maximise is:\n\nmax \u03c1i =\n\n,\n\n(2)\n\ne(cid:48)\niK2ei\ne(cid:48)\niKei\n\n\fTherefore, given a kernel matrix K the de\ufb02ated kernel matrix \u02c6K can be computed as follows:\n\n\u02c6K = K \u2212 \u03c4 \u03c4 (cid:48)\nK[ik, ik]\n\n(3)\n\nwhere \u03c4 = K[:, ik] and ik denotes the latest element in the vector i. The algorithm is presented\nbelow in Algorithm 1 and we use the notation K.2 to denote component wise squaring. Also,\ndivision of vectors are assumed to be component wise.\n\nAlgorithm 1: A matching pursuit algorithm for kernel principal components analysis (i.e., sparse\nKPCA)\nInput: Kernel K, sparsity parameter k > 0.\n1: initialise i = [ ]\n2: for j = 1 to k do\n3:\n\n(cid:110) (K.2)(cid:48)1\n\n(cid:111)\n\nSet ij to index of max\nset \u03c4 = K[:, ij] to de\ufb02ate kernel matrix like so: K = K \u2212 \u03c4 \u03c4 (cid:48)\n\ndiag{K}\n\nK[ij ,ij ]\n\n4:\n5: end for\n6: Compute \u02dcK using i and Equation (5)\nOutput: Output sparse matrix approximation \u02dcK\n\nThis algorithm is presented in Algorithm 1 and is equivalent to the algorithm proposed by [8].\nHowever, their motivation comes from the stance of \ufb01nding a low rank matrix approximation of\nthe kernel matrix. They proceed by looking for an approximation \u02dcK = K[:, i]T for a set i such\nthat the Frobenius norm between the trace residuals tr{K \u2212 K[:, i]T} = tr{K \u2212 \u02dcK} is minimal.\nTheir algorithm \ufb01nds the set of indices i and the projection matrix T . However, the use of T in\ncomputing the low rank matrix approximation seems to imply the need for additional information\nfrom outside of the chosen basis vectors in order to construct this approximation. However, we\nshow that a projection into the space de\ufb01ned solely by the chosen indices is enough to reconstruct\nthe kernel matrix and does not require any extra information.2 The projection is the well known\nNystr\u00a8om method [11].\nAn orthogonal projection Pi(\u03c6(xj)) of a feature vector \u03c6(xj) into a subspace de\ufb01ned only by the\nset of indices i can be expressed as: Pi(xj) = \u02dcX(cid:48)( \u02dcX \u02dcX(cid:48))\u22121 \u02dcX\u03c6(xj), where \u02dcX = X[i, :] are the i\ntraining examples from data matrix X. It follows that,\n\nPi(xj)(cid:48)Pi(xj) = \u03c6(xj)(cid:48) \u02dcX(cid:48)( \u02dcX \u02dcX(cid:48))\u22121 \u02dcX \u02dcX(cid:48)( \u02dcX \u02dcX(cid:48))\u22121 \u02dcX\u03c6(xj)\n\n(4)\nwith K[i, j] denoting the kernel entries between the index set i and the feature vector \u03c6(xj). Giving\nus the following projection into the space de\ufb01ned by i:\n\n= K[i, j]K[i, i]\u22121K[j, i],\n\nClaim 1. The sparse kernel principal components analysis algorithm is a compression scheme.\n\n\u02dcK = K[:, i]K[i, i]\u22121K[:, i](cid:48).\n\n(5)\n\nProof. We can reconstruct the projection from the set of chosen indices i using Equation (4). Hence,\ni forms a compression set.\n\nWe now prove that Smola and Sch\u00a8olkopf\u2019s low rank matrix approximation algorithm [8] (without\nsub-sampling)3 is equivalent to sparse kernel principal components analysis presented in this paper\n(Algorithm 1).\nTheorem 1. Without sub-sampling, Algorithm 1 is equivalent to Algorithm 2 of [8].\n\n2In their book, Smola and Sch\u00a8olkopf rede\ufb01ne their kernel approximation in the same way as we have done\n\n[5], however they do not make the connection that it is a compression scheme (see Claim 1).\n\n3We do not use the \u201c59-trick\u201d in our algorithm \u2013 although it\u2019s inclusion would be trivial and would result in\n\nthe same algorithm as in [8]\n\n3\n\n\fProof. Let K be the kernel matrix and let K[:, i] be the ith column of the kernel matrix. Assume X is\nthe input matrix containing rows of vectors that have already been mapped into a higher dimensional\nfeature space using \u03c6 such that X = (\u03c6(x1), . . . , \u03c6(xm))(cid:48). Smola and Sch\u00a8olkopf [8] state in section\n4.2 of their paper that their algorithm 2 \ufb01nds a low rank approximation of the kernel matrix such that\nFrob = tr{K\u2212 \u02dcK} where \u02dcX is the low rank approximation\nit minimises the Frobenius norm (cid:107)X\u2212 \u02dcX(cid:107)2\nof X. Therefore, we need to prove that Algorithm 1 also minimises this norm.\nWe would like to show that the maximum reduction in the Frobenius norm between the kernel K\nand its projection \u02dcK is in actual fact the choice of basis vectors that maximise the Rayleigh quotient\nand de\ufb02ate according to Equation (3). At each stage we de\ufb02ate by,\n\nThe trace tr{K} =(cid:80)m\n\nK = K \u2212 \u03c4 \u03c4 (cid:48)\n\nK[ik, ik] .\n= tr{K} \u2212 tr{\u03c4 (cid:48)\u03c4}\nK[ik, ik]\n\ni=1 K[i, i] is the sum of the diagonal elements of matrix K. Therefore,\n\ntr{K} = tr{K} \u2212 tr{\u03c4 \u03c4 (cid:48)}\nK[ik, ik]\n\n= tr{K} \u2212 K2[ik, ik]\nK[ik, ik] .\n\nThe last term of the \ufb01nal equation corresponds exactly to the Rayleigh quotient of Equation (2).\nTherefore the maximisation of the Rayleigh quotient does indeed correspond to the maximum re-\nduction in the Frobenius norm between the approximated matrix \u02dcX and X.\n\n2.2 A generalisation error bound for sparse kernel principal components analysis\n\nWe use the sample compression framework of [3] to bound the generalisation error of the sparse\nKPCA algorithm. Note that kernel PCA bounds [7] do not use sample compression in order to\nbound the true error. As pointed out above, we use the simple fact that this algorithm can be viewed\nas a compression scheme. No side information is needed in this setting and a simple application\nof [3] is all that is required. That said the usual application of compression bounds has been for\nclassi\ufb01cation algorithms, while here we are considering a subspace method.\nTheorem 2. Let Ak be any learning algorithm having a reconstruction function that maps compres-\nsion sets to subspaces. Let m be the size of the training set S, let k be the size of the compression\nset, let \u02c6Em\u2212k[(cid:96)(Ak(S))] be the residual loss between the m \u2212 k points outside of the compression\nset and their projections into a subspace, then with probability 1\u2212 \u03b4, the expected loss E[(cid:96)(Ak(S))]\nof algorithm Ak given any training set S can be bounded by,\nR\n\n(cid:19)(cid:21)(cid:35)\n(cid:18)2m\n(cid:1) different\nProof. Consider the case where we have a compression set of size k. Then we have(cid:0)m\n(cid:1). Solving for \u0001\npoints not in the compression set once for each choice by setting it equal to \u03b4/(cid:0)m\n\nways of choosing the compression set. Given \u03b4 con\ufb01dence we apply Hoeffding\u2019s bound to the m\u2212 k\n\nE[(cid:96)(Ak(S))] \u2264 min\n1\u2264t\u2264k\n\nwhere (cid:96)(\u00b7) \u2265 0 and R = sup (cid:96)(\u00b7).\n\n\u02c6Em\u2212t[(cid:96)(At(S))] +\n\n(cid:16) em\n\n2(m \u2212 t)\n\n(cid:115)\n\n+ ln\n\n(cid:17)\n\n(cid:34)\n\n(cid:20)\n\nt ln\n\ngives us the theorem when we further apply a factor 1/m to \u03b4 to ensure one application for each\npossible choice of k. The minimisation over t chooses the best value making use of the fact that\nusing more dimensions can only reduce the expected loss on test points.\n\n\u03b4\n\nt\n\nk\n\nk\n\n,\n\nWe now consider the application of the above bound to sparse KPCA. Let the corresponding loss\nfunction be de\ufb01ned as\n\n(cid:96)(At(S))(x) = (cid:107)x \u2212 Pit(x)(cid:107)2,\n\nwhere x is a test point and Pit(x) its projection into the subspace determined by the set it of indices\nreturned by At(S). Thus we can give a more speci\ufb01c loss bound in the case where we use a Gaussian\nkernel in the sparse kernel principal components analysis.\nCorollary 1 (Sample compression bound for sparse KPCA). Using a Gaussian kernel and all of the\nde\ufb01nitions from Theorem 2, we get the following bound:\n(cid:107)xi \u2212 Pit(xi)(cid:107)2 +\nE[(cid:96)(A(S))] \u2264 min\n1\u2264t\u2264k\n\n(cid:18)2m\n\n(cid:16) em\n\n(cid:19)(cid:21)(cid:35)\n\nm\u2212t(cid:88)\n\n(cid:115)\n\n2(m \u2212 t)\n\nm \u2212 t\n\n+ ln\n\n(cid:17)\n\n(cid:34)\n\n(cid:20)\n\nt ln\n\n1\n\n1\n\n\u03b4\n\nt\n\n,\n\ni=1\n\n4\n\n\fNote that R corresponds to the smallest radius of a ball that encloses all of the training points. Hence,\nfor the Gaussian kernel R equals 1. We now compare the sample compression bound proposed\nabove for sparse KPCA with the kernel PCA bound introduced by [7]. The left hand side of Figure 1\nshows plots for the test error residuals (for the Boston housing data set) together with its upper\nbounds computed using the bound of [7] and the sample compression bound of Corollary 1. The\nsample compression bound is much tighter than the KPCA bound and also non-trivial (unlike the\nKPCA bound).\nThe sample compression bound is at its lowest point after 43 basis vectors have been added. We\nspeculate that at this point the \u201ctrue\u201d dimensions of the data have been found and that all other di-\nmensions correspond to \u201cnoise\u201d. This corresponds to the point at which the plot of residual becomes\nlinear, suggesting dimensions with uniform noise. We carry out an extra toy experiment to help\nassess whether or not this is true and to show that the sample compression bound can help indicate\nwhen the principal components have captured most of the actual data. The right hand side plot of\nFigure 1 depicts the results of a toy experiment where we randomly sampled 1000 examples with\n450 dimensions from a Gaussian distribution with zero mean and unit variance. We then ensured\nthat 50 dimensions contained considerably larger eigenvalues than the remaining 400. From the\nright plot of Figure 1 we see that the test residual keeps dropping at a constant rate after 50 basis\nvectors have been added. The compression bound picks 46 dimensions with the largest eigenvalues,\nhowever, the KPCA bound of [7] is much more optimistic and is at its lowest point after 30 basis\nvectors, suggesting erroneously that SKPCA has captured most of the data in 30 dimensions. There-\nfore, as well as being tighter and non-trivial, the compression bound is much better at predicting the\nbest choice for the number of dimensions to use with sparse KPCA. Note that we carried out this\nexperiment without randomly permuting the projections into a subspace because SKPCA is rotation\ninvariant and will always choose the principal components with the largest eigenvalues.\n\nFigure 1: Bound plots for sparse kernel PCA comparing the sample compression bound proposed\nin this paper and the already existing PCA bound. The plot on the left hand side is for the Boston\nHousing data set and the plot on the right is for a Toy experiment with 1000 training examples (and\n450 dimensions) drawn randomly from a Gaussian distribution with zero mean and unit variance.\n\n3 Kernel matching pursuit\n\nUnfortunately, the theory of the last section, where we gave a sample compression bound for SKPCA\ncannot be applied to KMP. This is because the algorithm needs information from outside of the\ncompression set in order to construct its regressors and make predictions. However, we can use\na VC argument together with a sample compression trick in order to derive a bound for KMP in\nterms of the level of sparsity achieved, by viewing the sparsity achieved in the feature space as a\n\n5\n\n010203040506070809010011012013014015000.20.40.60.811.21.41.61.8Level of sparsityResidualBound plots for sparse kernel PCA PCA boundsample compression boundtest residual010203040506070809010011012013014015000.511.522.5 X: 46Y: 0.7984X: 30Y: 1.84Level of sparsityResidualBound plots for sparse kernel PCAPCA boundsample compression boundtest residual\fcompression scheme. Please note that we do not derive or reproduce the KMP algorithm here and\nadvise the interested reader to read the manuscript of [10] for the algorithmic details.\n\n3.1 A generalisation error bound for kernel matching pursuit\n\nVC bounds have commonly been used to bound learning algorithms whose hypothesis spaces are\nin\ufb01nite. One problem with these results is that the VC-dimension can sometimes be in\ufb01nite even in\ncases where learning is successful (e.g., the SVM). However, in this section we can avoid this issue\nby making use of the fact that the VC-dimension of the set of linear threshold functions is simply the\ndimensionality of the function class. In the kernel matching pursuit algorithm this translates directly\ninto the number of basis vectors chosen and hence a standard VC argument.\nThe natural loss function for KMP is regression \u2013 however in order to use standard VC bounds we\nmap the regression loss into a classi\ufb01cation loss in the following way.\nDe\ufb01nition 1. Let S \u223c D be a regression training sample generated iid from a \ufb01xed but unknown\nprobability distribution D. Given the error (cid:96)(f) = |f(x) \u2212 y| for a regression function f between\ntraining example x and regression output y we can de\ufb01ne, for some \ufb01xed positive scalar \u03b1 \u2208 R, the\ncorresponding true classi\ufb01cation loss (error) as\n\n(cid:96)\u03b1(f) =\n\nPr\n\n(x,y)\u223cD\n\n{|f(x) \u2212 y| > \u03b1} .\n\nSimilarly, we can de\ufb01ne the corresponding empirical classi\ufb01cation loss as\n\n\u02c6(cid:96)\u03b1(f) = (cid:96)S\n\n\u03b1(f) =\n\nPr\n\n(x,y)\u223cS\n\n{|f(x) \u2212 y| > \u03b1} = E(x,y)\u223cS {I(|f(x) \u2212 y| > \u03b1)} ,\n\nwhere I is the indicator function and S is suppressed when clear from context.\n\nNow that we have a loss function that is binary we can make a simple sample compression argument,\nthat counts the number of possible subspaces, together with a traditional VC style bound to upper\nbound the expected loss of KMP. To help keep the notation consistent with earlier de\ufb01nitions we\nwill denote the indices of the chosen basis vectors by i. The indices of i are chosen from the training\nsample S and we denote Si to be those samples indexed by the vector i. Given these de\ufb01nitions and\nthe bound of Vapnik and Chervonenkis [9] we can upper bound the true loss of KMP as follows.\nTheorem 3. Fix \u03b1 \u2208 R, \u03b1 > 0. Let A be the regression algorithm of KMP, m the size of the training\nset S and k the size of the chosen basis vectors i. Let S be reordered so that the last m \u2212 k points\nI(|f(xi) \u2212 yi| > \u03b1) be the number of errors for those\npoints in S (cid:114) Si. Then with probability 1 \u2212 \u03b4 over the generation of the training set S the expected\nloss E[(cid:96)(\u00b7)] of algorithm A can be bounded by,\n\nare outside of the set i and let t =(cid:80)m\n(cid:20)\n\ni=m\u2212k\n\nE[(cid:96)(A(S))] \u2264\n\n2\n\nm \u2212 k \u2212 t\n\n(k + 1) log\n\n(cid:19)\n(cid:18)4e(m \u2212 k \u2212 t)\n(cid:19)\n(cid:18) e(m \u2212 k)\n\nk + 1\n\n+t log\n\n+ log\n\nt\n\n+ k log\n\n(cid:17)\n(cid:16) em\n(cid:19)(cid:21)\n(cid:18)2m2\n\nk\n\n.\n\n\u03b4\n\nProof. First consider a \ufb01xed size k for the compression set and number of errors t. Let S1 =\n{xi1, . . . , xik} be the set of k training points chosen by the KMP regressor, S2 = {xik+1 , . . . , xik+t}\nthe set of points erred on in training and \u00afS = S (cid:114) (S1 \u222a S2) the points outside of the compression\nset (S1) and training error set (S2). Suppose that the \ufb01rst k points form the compression set and\nthe next t are the errors of the KMP regressor. Since the remaining m \u2212 k \u2212 t points \u00afS are drawn\nindependently we can apply the VC bound [9] to the (cid:96)\u03b1 loss to obtain the bound\n\n(cid:110) \u00afS : (cid:96) \u00afS\n\nPr\n\n\u03b1(f) = 0, (cid:96)\u03b1(f) > \u0001\n\n(cid:18)4e(m \u2212 k \u2212 t)\n(cid:111) \u2264 2\n(cid:1) which is \u2264 (cid:16) e(md\u22121)\n\nk + 1\n\n(cid:19)k+1\n(cid:17)k+d\u22121\n\n2\u2212\u0001(m\u2212k\u2212t)/2,\n\nhyperplanes [1], which is(cid:80)k+d\u22121\n\nwhere we have made use of a bound on the number of dichotomies that can be generated by parallel\n, where d is the number\nof parallel hyperplanes and equals 2 in our case. We now need to consider all of the ways that the\n\n(cid:0)md\u22121\n\nk+d\u22121\n\ni=0\n\ni\n\n6\n\n\fk basis vectors and t error points might have occurred and apply the union bound over all of these\npossibilities. This gives the bound\n\n(cid:111)\n\n(cid:110)\n\nPr\n\nS : \u2203 f \u2208 span{S1} s.t. (cid:96)S2\n\n(cid:18)m\n(cid:19)(cid:18)m \u2212 k\n\n(cid:19)\n\n\u03b1 (f) = 1, (cid:96) \u00afS\n\n(cid:18)4e(m \u2212 k \u2212 t)\n\n(cid:19)k+1\n\n\u03b1(f) = 0, (cid:96)\u03b1(f) > \u0001\n\n\u2264\n\nk\n\nt\n\nk + 1\n\n2\n\n2\u2212\u0001(m\u2212k\u2212t)/2.\n\n(6)\n\nFinally we need to consider all possible choices of the values of k and t. The number of these\npossibilities is clearly upper bounded by m2. Setting m2 times the rhs of (6) equal to \u03b4 and solving\nfor \u0001 gives the result.\n\nThis is the \ufb01rst upper bound on the generalisation error for KMP that we are aware of and as such\nwe cannot compare the bound against any others. Figure 2 plots the KMP test error against the loss\n\nFigure 2: Plot of KMP bound against its test error. We used 450 examples for training and the 56\nfor testing. Bound was scaled down by a factor of 5.\n\nbound given by Theorem 3. The bound value has been scaled by 5 in order to get the correct pictorial\nrepresentation of the two plots. Figure 2 shows its minima directly coincides with the lowest test\nerror (after 17 basis vectors). This motivates a training algorithm for KMP that would use the bound\nas the minimisation criteria and stop once the bound fails to become smaller. Hence, yielding a more\nautomated training procedure.\n\n4 Extensions\n\nThe same approach that we have used for bounding the performance of kernel matching pursuit can\nbe used to bound a matching pursuit version of kernel canonical correlation analysis (KCCA) [6].\nBy choosing the basis vectors greedily to optimise the quotient:\ne(cid:48)\niKxKyei\ne(cid:48)\nxeie(cid:48)\niK2\niK2\n\n(cid:113)\n\n\u03c1i =\n\nmax\n\nyei\n\n,\n\ni\n\nand proceeding in the same manner as Algorithm 1 by de\ufb02ating after each pair of basis vectors are\nchosen, we create a sparsely de\ufb01ned subspace within which we can run the standard CCA algorithm.\nThis again means that the overall algorithm fails to be a compression scheme as side information is\nrequired. However, we can use the same approach described for KMP to bound the expected \ufb01t of\nthe projections from the two views. The resulting bound has the following form.\nTheorem 4. Fix \u03b1 \u2208 R, \u03b1 > 0. Let A be the SKCCA algorithm, m the size of the paired training sets\nSX\u00d7Y and k the cardinality of the set i of chosen basis vectors. Let SX\u00d7Y be reordered so that the\nI(|fx(xi)\u2212 fy(yi)| >\n\nlast m\u2212 k paired data points are outside of the set i and de\ufb01ne t =(cid:80)m\n\ni=m\u2212k\n\n7\n\n0510152025303540455000.10.20.30.40.50.60.70.80.9Level of sparsityLossKMP error on Boston housing data set boundKMP test error\f\u03b1) to be the number of errors for those points in SX\u00d7Y (cid:114) SX\u00d7Y\n, where fx is the projection function\nof the X view and fy the projection function of the Y view. Then with probability 1 \u2212 \u03b4 over the\ngeneration of the paired training sets SX\u00d7Y the expected loss E[(cid:96)(\u00b7)] of algorithm A can be bounded\nby,\n\ni\n\n(k + 1) log\n\n+ k log\n\n(cid:20)\n\nE[(cid:96)(A(S))] \u2264\n\n2\n\nm \u2212 k \u2212 t\n\n5 Discussion\n\n(cid:18)4e(m \u2212 k \u2212 t)\n(cid:19)\n(cid:18) e(m \u2212 k)\n(cid:19)\n\nk + 1\n\n+t log\n\n+ log\n\nt\n\n(cid:17)\n(cid:16) em\n(cid:19)(cid:21)\n(cid:18)2m2\n\nk\n\n.\n\n\u03b4\n\nMatching pursuit is a meta-scheme for creating learning algorithms for a variety of tasks. We have\npresented novel techniques that make it possible to analyse this style of algorithm using a combina-\ntion of compression scheme ideas and more traditional learning theory. We have shown how sparse\nKPCA is in fact a compression scheme and demonstrated bounds that are able to accurately guide di-\nmension selection in some cases. We have also used the techniques to bound the performance of the\nkernel matching pursuit (KMP) algorithm and to reinforce the generality of the approach indicated\nand how the approach can be extended to a matching pursuit version of KCCA.\nThe results in this paper imply that the performance of any learning algorithm from the matching\npursuit family can be analysed using a combination of sparse and traditional learning bounds. The\nbounds give a general theoretical justi\ufb01cation of the framework and suggest potential applications\nof matching pursuit methods to other learning tasks such as novelty detection, ranking and so on.\n\nAcknowledgements\n\nThe work was sponsored by the PASCAL network of excellence and the SMART project.\n\nReferences\n[1] M. Anthony. Partitioning points by parallel planes. Discrete Mathematics, 282:17\u201321, 2004.\n[2] S. Floyd and M. Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkis\n\ndimension. Machine Learning, 21(3):269\u2013304, 1995.\n\n[3] N. Littlestone and M. K. Warmuth. Relating data compression and learnability. Technical\n\nreport, University of California Santa Cruz, Santa Cruz, CA, 1986.\n\n[4] S. Mallat and Z. Zhang. Matching pursuit with time-frequency dictionaries. IEEE Transactions\n\non Signal Processing, 41(12):3397\u20133415, 1993.\n\n[5] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[6] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-\n\nsity Press, Cambridge, U.K., 2004.\n\n[7] J. Shawe-Taylor, C. K. I. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the\nGram matrix and the generalization error of kernel-PCA. IEEE Transactions on Information\nTheory, 51(7):2510\u20132522, 2005.\n\n[8] A. J. Smola and B. Sch\u00a8olkopf. Sparse greedy matrix approximation for machine learning. In\nProceedings of 17th International Conference on Machine Learning, pages 911\u2013918. Morgan\nKaufmann, San Francisco, CA, 2000.\n\n[9] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of\nevents to their probabilities. Theory of Probability and its Applications, 16(2):264\u2013280, 1971.\n\n[10] P. Vincent and Y. Bengio. Kernel matching pursuit. Machine Learning, 48:165\u2013187, 2002.\n[11] C. K. I. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In\nAdvances in Neural Information Processing Systems, volume 13, pages 682\u2013688. MIT Press,\n2001.\n\n8\n\n\f", "award": [], "sourceid": 314, "authors": [{"given_name": "Zakria", "family_name": "Hussain", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}]}