{"title": "Sparse Greedy Minimax Probability Machine Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 105, "page_last": 112, "abstract": "", "full_text": "Sparse Greedy Minimax Probability Machine\n\nClassi\ufb01cation\n\nThomas R. Strohmann\n\nDepartment of Computer Science\nUniversity of Colorado, Boulder\n\nstrohman@cs.colorado.edu\n\nAndrei Belitski\n\nDepartment of Computer Science\nUniversity of Colorado, Boulder\nAndrei.Belitski@colorado.edu\n\nGregory Z. Grudic\n\nDepartment of Computer Science\nUniversity of Colorado, Boulder\n\ngrudic@cs.colorado.edu\n\nDennis DeCoste\n\nMachine Learning Systems Group\nNASA Jet Propulsion Laboratory\n\ndecoste@aig.jpl.nasa.gov\n\nAbstract\n\nThe Minimax Probability Machine Classi\ufb01cation (MPMC) framework\n[Lanckriet et al., 2002] builds classi\ufb01ers by minimizing the maximum\nprobability of misclassi\ufb01cation, and gives direct estimates of the proba-\nbilistic accuracy bound \u2126. The only assumptions that MPMC makes is\nthat good estimates of means and covariance matrices of the classes exist.\nHowever, as with Support Vector Machines, MPMC is computationally\nexpensive and requires extensive cross validation experiments to choose\nkernels and kernel parameters that give good performance. In this paper\nwe address the computational cost of MPMC by proposing an algorithm\nthat constructs nonlinear sparse MPMC (SMPMC) models by incremen-\ntally adding basis functions (i.e. kernels) one at a time \u2013 greedily select-\ning the next one that maximizes the accuracy bound \u2126. SMPMC auto-\nmatically chooses both kernel parameters and feature weights without us-\ning computationally expensive cross validation. Therefore the SMPMC\nalgorithm simultaneously addresses the problem of kernel selection and\nfeature selection (i.e. feature weighting), based solely on maximizing the\naccuracy bound \u2126. Experimental results indicate that we can obtain reli-\nable bounds \u2126, as well as test set accuracies that are comparable to state\nof the art classi\ufb01cation algorithms.\n\n1 Introduction\n\nThe goal of a binary classi\ufb01er is to maximize the probability that unseen test data will be\nclassi\ufb01ed correctly. Assuming that the test data is generated from the same probability\ndistribution as the training data, it is possible to derive speci\ufb01c probability bounds for the\ncase that the decision boundary is a hyperplane. The following result due to Marshall and\nOlkin [1] and extended by Bertsimas and Popescu [2] provides the theoretical basis for\n\n\fsup\n\nP r{aT z \u2265 b} =\n\n1\n\nassigning probability bounds to hyperplane classi\ufb01ers:\n\n\u03c92 = inf aT t\u2265b(t \u2212 \u00afz)T \u03a3\u22121\n\nz (t \u2212 \u00afz)\n\n1 + \u03c92\n\nE[z]=\u00afz,Cov[z]=\u03a3z\n\n(1)\nwhere a \u2208 Rd, b are the hyperplane parameters, z is a random vector, and t is an ordinary\nvector. Lanckriet et al (see [3] and [4]) used the above result to build the Minimax Proba-\nbility Machine for binary classi\ufb01cation (MPMC). From (1) we note that the only required\nrelevant information of the underlying probability distribution for each class is its mean\nand covariance matrix. No other estimates and/or assumptions are needed, which implies\nthat the obtained bound (which we refer to as \u2126) is essentially distribution free, i.e. it holds\nfor any distribution with a certain mean and covariance matrix.\n\nAs with other classi\ufb01cation algorithms such as Support Vector Machines (SVM) (see [5]),\nthe main disadvantage of current MPMC implementations is that they are computationally\nexpensive (same complexity as SVM), and require extensive cross validation experiments\nto choose kernels and kernel parameter to give good performance on each data set. The\ngoal of this paper is to propose a kernel based MPMC algorithm that directly addresses\nthese computational issues.\n\nTowards this end, we propose a sparse greedy MPMC (SMPMC) algorithm that ef\ufb01ciently\nbuilds classi\ufb01ers, while at the same time maintains the distribution free probability bound\nof MPM type algorithms. To achieve this goal, we propose to use an iterative algorithm\nwhich adds basis functions (i.e. kernels) one by one, to an initially \u201dempty\u201d model. We\nare considering basis functions that are induced by Mercer kernels, i.e. functions of the\nfollowing form f(z) = K\u03b3(z, zi) (where zi is an input vector of the training data). Bases\nare added in a greedy way: we select the particular zi that maximizes the MPMC objective\n\u2126. Furthermore, SMPMC chooses optimal kernel parameters that maximize this metric\n(hence the subscript \u03b3 in K\u03b3), including automatically weighting input features by \u03b3j \u2265\n0 for each kernel added, such that zi = (\u03b31z1, \u03b32z2, ..., \u03b3dzd) for d dimensional data.\nThe proposed SMPMC algorithm automatically selects kernels and re-weights features (i.e.\ndoes feature selection) for each new added basis function, by minimizing the error bound\n(i.e. maximizing \u2126). Thus the large computational cost of cross validation (typically used\nby SVM and MPMC) is avoided.\n\nThe paper is organized as follows: Section 2.1 reviews the standard MPMC; Sec-\ntion 2.2 describes the proposed sparse greedy MPMC algorithm (SMPMC); and Sec-\ntions 2.3-2.4 show how we can use sparse MPMC to determine optimal kernel pa-\nrameters.\nIn section 3 we compare our results to the ones described in the orig-\ninal MPMC paper (see [4]), showing the probability bounds and the test set ac-\ncuracies for different binary classi\ufb01cation problems.\nThe conclusion is presented\nin section 4. Matlab source code for the SMPMC algorithm is available online:\nhttp://nago.cs.colorado.edu/\u223cstrohman/papers.html\n\n2 Classi\ufb01cation model\n\nIn this section we develop a sparse version of the Minimax Probability Machine for bi-\nnary classi\ufb01cation. We show that besides a signi\ufb01cant reduction in computational cost, the\nSMPMC algorithm allows us to do automated kernel and feature selection.\n\n2.1 Minimax Probability Machine for binary classi\ufb01cation\n\nWe will brie\ufb02y describe the underlying concepts of the MPMC framework as developed\nby Lanckriet et al. (see [4]). The goal of MPMC is to \ufb01nd a decision boundary H(a, b) =\n{z|aT z = b} such that the minimum probability \u2126H of classifying future data correctly is\nmaximized. If we assume that the two classes are generated from random vectors x and y,\n\n\fwe can express this probability bound just in terms of the means and covariances of these\nrandom vectors:\n\n(2)\nNote that we do not make any distributional assumptions other than that \u00afx, \u03a3x, \u00afy, and \u03a3x\nare bounded. Exploiting a theorem from Marshall and Olkin [1], it is possible to rewrite\n(2) as a closed form expression:\n\nx\u223c(\u00afx,\u03a3x),y\u223c(\u00afy,\u03a3y)\n\nP r{aT x \u2265 b \u2227 aT y \u2264 b}\n\n\u2126H =\n\ninf\n\nwhere\n\nm = min\n\na\n\n(cid:112)\n\n1\n\n1 + m2\n\n\u2126H =\n\n(cid:113)\n\naT \u03a3xa +\n\naT \u03a3ya s.t. aT (\u00afx \u2212 \u00afy) = 1\n\n(3)\n\n(4)\n\n(cid:112)\n\nNy(cid:88)\n\ni=1\n\nThe optimal hyperplane parameter a\u2217 is the vector that minimizes (4). The hyperplane\nparameter b\u2217 can then be computed as:\n\n(5)\nA new data point znew is classi\ufb01ed according to sign(aT\u2217 znew \u2212 b\u2217); if this yields +1,\nznew is classi\ufb01ed as belonging to class x, otherwise it is classi\ufb01ed as belonging to class y.\n\nm\n\nb\u2217 = aT\u2217 \u00afx \u2212\n\naT\u2217 \u03a3xa\u2217\n\n2.2 Sparse MPM classi\ufb01cation\n\nOne of the appealing properties of Support Vector Machines is that their models typically\nrely only on a small fraction of the training examples, the so called support vectors. The\nmodels obtained from the kernelized MPMC, however, use all of the training examples (see\n[4]), i.e. the decision hyperplane will look like:\n\na(y)\ni K(yi, z) = b\n\n(6)\n\nNx(cid:88)\n\na(x)\ni K(xi, z) +\n(cid:54)= 0.\n\ni=1\n, a(y)\n\ni\n\ni\n\nwhere in general all a(x)\nThis brings up the question whether one can construct sparse models for the MPMC where\nmost of the coef\ufb01cients a(x)\nare zero. In this paper we propose to do this by starting\nwith an initially \u201dempty\u201d model and then adding basis functions one by one. As we will\nsee shortly, this approach is speeding up both learning and evaluation time while it is still\nmaintaining the distribution free probability bound of the MPMC.\n\nor a(y)\n\ni\n\ni\n\nBefore we outline the algorithm we introduce some notation:\n= Nx + Ny the total number of training examples\nN\n= ((cid:96)1, ..., (cid:96)N )T \u2208 {\u22121, 1}N the labels of the training data\n(cid:96)\n\n(cid:98)(cid:96)(k) = ((cid:98)(cid:96)(k)\n\n1 , ...,(cid:98)(cid:96)(k)\n\nN )T \u2208 RN output of the model after adding the kth basis function\n\na(k) = the MPMC hyperplane coef\ufb01cients when adding the kth basis function\nb(k) = the MPMC hyperplane offset when adding the kth basis function\n(cid:126)Kb = (Kv(v, x1), ..., Kv(v, xNx), Kv(v, y1), ..., Kv(v, yNy))T\n\nbasis function evaluated on all training examples (empirical map)\n\nNote that(cid:98)(cid:96)(k) is a vector of real numbers (the distances of the training data to the hyperplane\n\n(cid:126)Kxv = (Kv(v, x1), ..., Kv(v, xNx))T evaluated only on positive examples\n(cid:126)Kyv = (Kv(v, y1), ..., Kv(v, yNy))T evaluated only on negative examples\nbefore applying the sign function). v \u2208 Rd is the training vector generating the basis\nfunction (cid:126)Kv\n\n1. We will simply write (cid:126)K (k), (cid:126)K (k)\n\nfor the kth basis function.\n\nx , (cid:126)K (k)\ny\n\n1Note that we use the same symbol (cid:126)K for both the empirical map and the induced function. It\n\nwill always be clear from the context what (cid:126)K refers to.\n\n\f(cid:113)\n\n(cid:113)\n\n(cid:113)\n\n(7)\n\n(8)\n\n(10)\n\n(11)\n\n(12)\n\nFor the \ufb01rst basis we are solving the one dimensional MPMC:\na s.t. a( (cid:126)K (1)\n\nm = min\n\na +\n\nx \u2212 (cid:126)K (1)\n\ny ) = 1\n\na\u03c32\n(1)\n(cid:126)K\nx\n\na\n\na\u03c32\n(cid:126)K\n\n(1)\ny\n\nx and \u03c32\n(1)\n(cid:126)K\nx\n\nwhere (cid:126)K (1)\nfunction evaluated on all positive training examples).\nBecause of the constraint the feasible region contains just one value for a(1):\n\nare the mean and variance of the vector (cid:126)K (1)\n\nx (which is the \ufb01rst basis\n\na(1) = 1/( (cid:126)K (1)\n\nb(1) = a(1) (cid:126)K (1)\n\nx \u2212 (cid:126)K (1)\ny )\nx \u2212\n\n(cid:113)\n\n(cid:113)\n\n(1)\nx\n\na\n\na\u03c32\n(cid:126)K\n\n= a(1) (cid:126)K (1)\n\nx \u2212\n\n\u03c3\n\n(1)\n(cid:126)K\nx\n+\u03c3\n\nThe \ufb01rst model then looks like: (cid:98)(cid:96)(1) = a(1) (cid:126)K (1) \u2212 b(1)\nAll of the subsequent models use the previous estimation (cid:98)(cid:96)(k) as one input and the next\n\na\u03c32\n(cid:126)K\n\na\u03c32\n(cid:126)K\n\n(9)\n\n(1)\nx\n\n(1)\ny\n\n(1)\nx\n\n(1)\ny\n\na+\n\n(cid:126)K\n\n(cid:126)K\n\n\u03c3\n\na\n\nbasis (cid:126)K (k+1) as the other input. We set up the two dimensional classi\ufb01cation problem:\n\nx(k+1) = [(cid:98)(cid:96)(k)\ny(k+1) = [(cid:98)(cid:96)(k)\n\nx\n\n] \u2208 RNx\u00d72\n] \u2208 RNy\u00d72\n\nx , (cid:126)K (k+1)\ny , (cid:126)K (k+1)\n\n(cid:113)\naT \u03a3y(k+1)a s.t. aT (x(k+1) \u2212 y(k+1)) = 1\n\ny\n\n(cid:112)\n\na\n\nm = min\n\nAnd solve the following optimization problem:\n\naT \u03a3x(k+1)a +\n\nLet a(k+1) = (a(k+1)\n\nwhere x(k+1) is the 2-dimensional mean vector ((cid:98)(cid:96)(k)\n2 \u00d7 2 sample covariance matrix of the vectors(cid:98)(cid:96)(k)\n(cid:113)\n(cid:98)(cid:96)(k) + a(k+1)\n\n, a(k+1)\nb(k+1) = a(k+1)T x(k+1) \u2212\n\n(cid:98)(cid:96)(k+1) = a(k+1)\n\nand obtain the next model as:\n\n(cid:113)\n\n2\n\n1\n\nx , (cid:126)K (k+1)\nx and (cid:126)K (k+1)\n\nx\n\nx\n\n)T and where \u03a3x(k+1) is the\n.\n\n)T be the optimal solution of (11). We set:\n\na(k+1)T \u03a3x(k+1)a(k+1)\n\n(cid:113)\n\na(k+1)T \u03a3x(k+1)a(k+1) +\n\na(k+1)T \u03a3y(k+1)a(k+1)\n\n(cid:126)K (k+1) \u2212 b(k+1)\n\n1\n\n2\n\n(13)\nAs stated above, one computational advantage of SMPMC is that we typically use only\na small number of training examples to obtain our \ufb01nal model (i.e. k << N). Another\nbene\ufb01t is that we have to solve only one and two dimensional MPMC problems. As seen in\n(8) the one dimensional solution is trivial to compute. An analysis of the two dimensional\nproblem shows that it can be reduced to the problem of \ufb01nding the roots of a fourth order\npolynomial. Polynomials of degree 4 still have closed form solutions (see e.g. [6]) which\ncan be computed ef\ufb01ciently.\nIn the standard MPMC algorithm (see [4]), however, the\nsolution a for equation (4) has N dimensions and can therefore only be found by expensive\nnumerical methods.\nIt may seem that the values of \u2126 = 1/(1 + m2) which we obtain from (11) are not true\nfor the whole model since we are considering only two dimensional problems and not all\nof the k + 1 dimensions we have added so far through our basis functions. But it turns\nout that the \u201dlocal\u201d bound (from the 2D MPMC) is indeed equal to the \u201dglobal\u201d bound\n(when considering all k + 1 dimensions). We state this fact more formally in the following\ntheorem:\n\nTheorem 1: Let(cid:98)(cid:96)(k) = c0 + c1 (cid:126)K (1) + ... + ck (cid:126)K (k) be the sparse MPMC model at the\nMPMC:(cid:98)(cid:96)(k+1) = a(k+1)\n\nkth iteration (k \u2265 1) and let a(k+1)\n\n(cid:98)(cid:96)(k) + a(k+1)\n\n, b(k+1) be the solution of the two dimensional\n\n(cid:126)K (k+1) \u2212 b(k+1).\n\nThen the values of \u2126 for the two dimensional MPMC and for the k + 1 dimensional MPMC\nare the same.\nProof: see Appendix\n\n, a(k+1)\n\n2\n\n2\n\n1\n\n1\n\n\f2.3 Selection of bases and Gaussian Kernel widths\n\nIn our experiments we are using the Gaussian kernel which looks like:\n\nK\u03c3(u, v) = exp(\u2212||u \u2212 v||2\n\n2\n\n)\n\n(14)\nwhere \u03c3 is the so called kernel width. As mentioned before, one typically has to choose\n\u03c3 manually or determine it by cross validation (see [4]). The SMPMC algorithm greedily\nselects a basis function \u2013 out of a randomly chosen candidate set \u2013 to maximize \u2126 which is\nequivalent to minimizing the value of m in (7) and (11). Before we state the optimization\nproblem for the one and two dimensional MPMC we rewrite (14) so that we can get rid of\nthe denominator:\n\n2\u03c32\n\n2) \u03b3 \u2265 0\nThe optimization problem we solve for the \ufb01rst iteration is then:\n\nK\u03b3(u, v) = exp(\u2212\u03b3||u \u2212 v||2\n\n(cid:113)\n\n(cid:113)\n\nmin\n\u03b3\n\nm(\u03b3) = min\n\na\n\na\u03c32\n(1)\n(cid:126)K\nx\n\na +\n\na\u03c32\n(1)\n(cid:126)K\ny\n\na s.t. a( (cid:126)K (1)\n\nx \u2212 (cid:126)K (1)\n\ny ) = 1\n\n(15)\n\n(16)\n\nnote that \u2013 even though we did not state it explicitly \u2013 the statistics \u03c32\n(1)\n(cid:126)K\nx\n\n, \u03c32\n(1)\n(cid:126)K\ny\n\n, (cid:126)K (1)\n\nx , and\n\n(cid:112)\n\n(cid:126)K (1)\ny (and consequently the coef\ufb01cient a) all depend on the kernel parameter \u03b3.\n(cid:113)\nThe two dimensional problem that has to be solved for all subsequent iterations k \u2265 2 turns\ninto the following optimization problem for \u03b3:\naT \u03a3y(k+1)a s.t. aT (x(k+1)\u2212y(k+1)) = 1 (17)\nmin\n\u03b3\nAgain, x(k+1), y(k+1), \u03a3x(k+1), and \u03a3y(k+1) all depend on the kernel parameter \u03b3 and from\nthese four statistics we can compute the minimizer a \u2208 R2 analytically.\n\naT \u03a3x(k+1)a+\n\nm(\u03b3) = min\n\na\n\n2.4 Feature selection\n\nK(cid:126)\u03b3(u, v) = exp(\u2212(cid:80)d\n\nFor doing feature selection with Gaussian kernels one has to replace the uniform kernel\nwidth \u03b3 with a d dimensional vector (cid:126)\u03b3 of kernel weightings:\n\n(18)\nNote that the optimization problems (16) and (17) for the one respectively two dimensional\nMPMC are now d dimensional instead of just one dimensional.\n\nl = 1, ..., d)\n\nl=1 \u03b3l(ul \u2212 vl)2)\n\n(\u03b3l \u2265 0\n\n3 Experiments\n\nIn this section we describe the results we obtained for SMPMC on various classi\ufb01cation\nbenchmarks. We used the same data sets as Lanckriet et al. in [4] for the standard MPMC.\nThe data sets were randomly divided into 90% training data and 10% test data and the\nresults were averaged over 50 runs for each of the \ufb01ve problems (see table 1). In all the\nexperiments listed in table 1 we used the feature selection algorithm (with the exception\nof Sonar where width selection was used) and had a candidate set of size 5, i.e. at each\niteration the best basis out of 5 randomly chosen candidates was selected. The results we\nobtained are comparable to the ones reported by Lanckriet et al [4]. Note that for all of the\ndata sets SMPMC uses signi\ufb01cantly less basis functions than MPMC does which directly\ntranslates into an accordingly smaller evaluation cost. The differences in training cost are\nshown in table 2. The total training time for standard MPMC takes into account the 50-fold\ncross validation and 10 candidates for the kernel parameter. We observe that for all of the\n\ufb01ve data sets the training cost of sparse MPMC is only a fraction of the one for standard\nMPMC.\n\nThe two plots in \ufb01gure 1 show what typical learning curves for sparse MPMC look like.\nAs the number of basis function increases, both the bound \u2126 and the test set accuracy start\n\n\f\u2126\n\nSMPMC\n\nTable 1: Bound \u2126, Test set accuracy (TSA), number of bases (B) for sparse and standard MPMC\nStandard MPMC (Lanckriet et al.)\nDataset\n91.3 \u00b1 0.1% 95.7 \u00b1 0.5%\n89.1 \u00b1 0.1% 96.9 \u00b1 0.3%\n89.3 \u00b1 0.2% 91.5 \u00b1 0.7%\n32.5 \u00b1 0.2% 76.2 \u00b1 0.6%\n99.9 \u00b1 0.1% 87.5 \u00b1 0.9%\n\nB\n86.4 \u00b1 0.1% 98.3 \u00b1 0.4% 25\n90.9 \u00b1 0.1% 96.8 \u00b1 0.3% 50\n77.7 \u00b1 0.2% 91.6 \u00b1 0.5% 25\n38.2 \u00b1 0.1% 75.4 \u00b1 0.7% 50\n78.5 \u00b1 0.2% 86.4 \u00b1 1.0% 80\n\nTwonorm\nBreast Cancer\nIonosphere\nPima Diabetes\nSonar\n\nB\n270\n614\n315\n691\n187\n\nTSA\n\n\u2126\n\nTSA\n\nTable 2: training time (in seconds) for Matlab implementations of SMPMC and MPMC\nStandard MPMC (Lanckriet et al.)\n\nSMPMC\n\nDataset\n\ntraining time\n\none optimization\n\ntotal training time\n\n# training\nexamples\n\nTwonorm\nBreast Cancer\nIonosphere\nPima Diabetes\nSonar\n\n270\n614\n315\n691\n187\n\n125.0\n188.5\n416.3\n165.6\n35.3\n\n23.9\n122.4\n28.1\n186.5\n8.7\n\n1199.2\n6123.2\n1404.3\n9324.2\n435.1\n\nto go up and after a while stabilize. The stabilization point usually occurs earlier when one\ndoes full feature selection (a \u03b3 weight for each input dimension) instead of kernel width\nselection (one uniform \u03b3 for all dimensions). We also experimented with different sizes\nfor the candidate set. The plots in \ufb01gure 2 show what happens for 1, 5, and 10 candidates.\nThe overall behavior is that the test set accuracy as well as the \u2126 value converge earlier for\nlarger candidate sets (but note that a larger candidate set also increases the computational\ncost per iteration).\nAs seen in \ufb01gure 1, feature selection gives usually better results in terms of the bound \u2126\nand the test set accuracy. Furthermore, a feature selection algorithm should indicate which\nfeatures are relevant and which are not. We set up an experiment for the Twonorm data\n(which has 20 input features) where we added 20 additional noisy features that were not\nrelated to the output. The results are shown in \ufb01gure 3 and demonstrate that the feature\nselection algorithm obtained from SMPMC is able to distinguish between relevant and\nirrelevant features.\n\n4 Conclusion & future work\n\nThis paper introduces a new algorithm (Sparse Minimax Probability Machine Classi\ufb01ca-\ntion - SMPMC) for building sparse classi\ufb01cation models that provide a lower bound on\nthe probability of classifying future data correctly. We have shown that the method of iter-\natively adding basis functions has signi\ufb01cant computational advantages over the standard\nMPMC, while it still maintains the distribution free probability bound \u2126. Experimental\nresults indicate that automated selection of kernel parameters, as well as automated feature\nselection (weighting), both key characteristics of the SMPMC algorithm, result in error\nrates that are competitive with those obtained by models where these parameters must be\ntuned by computationally expensive cross validation.\n\nFuture research on sparse greedy MPMC will focus on establishing a theoretical framework\nfor a stopping criterion, when adding more basis functions (kernels) will not signi\ufb01cantly\nreduce error rates, and may lead to over\ufb01tting. Also, experiments have so far focused on us-\ning Gaussian kernels as basis functions. From the experience with other kernel algorithms,\nit is known that other type of kernels (polynomial, tanh) can yield better results for certain\napplications. Furthermore, our framework is not limited to Mercer kernels, and other types\n\n\fFigure 1: Bound \u2126 and Test Set accuracy (TSA) for width selection (WS) and feature selection (FS).\nNote that the accuracies are all higher than the corresponding bounds.\n\nFigure 2: Accuracy and bound for the Diabetes data set using 1,5 or 10 basis candidates per iteration.\nAgain, the \u2126 bound is a true lower bound on the test set accuracy.\n\nFigure 3: Average feature weighting for the Twonorm data set over 50 test runs. The \ufb01rst 20 features\nare the original inputs, the last 20 features are additional noisy inputs\n\n05101520250.20.30.40.50.60.70.80.91Ionospherebasis functionsW for WSTSA for WSW for FSTSA for FS0102030405060708000.10.20.30.40.50.60.70.80.9Sonarbasis functionsW for WSTSA for WSW for FSTSA for FS024681012141618200.60.620.640.660.680.70.720.740.760.78Test set accuracybasis functions1 candidate5 candidates10 candidates024681012141618200.050.10.150.20.250.30.350.4W boundbasis functions1 candidate5 candidates10 candidates05101520253035400.10.20.30.40.50.60.70.8feature iweight gi\fof basis functions are also worth investigating. Recent work by Crammer et al. [7] uses\nboosting to construct a suitable kernel matrix iteratively. An interesting open question is\nhow this approach relates to sparse greedy MPMC.\n\nReferences\n\n[1] A. W. Marshall and I. Olkin. Multivariate chebyshev inequalities. Annals of Mathematical\n\nStatistics, 31(4):1001\u20131014, 1960.\n\n[2] I. Popescu and D. Bertsimas. Optimal inequalities in probability theory: A convex optimization\n\napproach. Technical Report TM62, INSEAD, Dept. Math. O.R., Cambridge, Mass, 2001.\n\n[3] G. R. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. Minimax probability\nmachine. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Infor-\nmation Processing Systems 14, Cambridge, MA, 2002. MIT Press.\n\n[4] G. R. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach\n\nto classi\ufb01cation. Journal of Machine Learning Research, 3:555\u2013582, 2002.\n\n[5] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[6] William H. Beyer. CRC Standard Mathemathical Tables, page 12. CRC Press Inc., Boca Raton,\n\nFL, 1987.\n\n[7] K. Crammer, J. Keshet, and Y. Singer. Kernel design using boosting.\n\nIn T. G. Dietterich,\nS. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems\n15, Cambridge, MA, 2003. MIT Press.\n\nAppendix: Proof of Theorem 1\n\n\u221a\nWe have to show that the values of m are equal for the two dimensional MPMC and the k + 1\naT \u03a3xa, an analogue\ndimensional MPMC. We will just show the equivalence for the \ufb01rst term\nargumentation will hold for the second term.\nFor the two dimensional MPMC we have the following for the term under the square root:\n\n(cid:33)(cid:181)\n\n(cid:182)\n\n(cid:162)(cid:195)\n\n\u03c32(cid:98)(cid:96)\n\n(k)\nx\n\n(cid:161)\n\n\u03c3(cid:98)(cid:96)\n\n\u03c32\n(cid:126)K\n\n+ 2a(k+1)\n\n1\n\n\u03c3 (cid:126)K\n(k+1)\nx\na(k+1)\n2\n\n(k)\nx\n\n(k)\nx\n\n(cid:126)K\n\n(k+1)\nx\n\n(k)\nx\n\n(cid:126)K\n\n(k+1)\nx\n\n(k+1)\nx\n+ [a(k+1)\n\n2\n\na(k+1)\n1\na(k+1)\n2\n\n]2\u03c32\n(cid:126)K\n\n(k+1)\nx\n\n(cid:98)(cid:96)\n\u03c3(cid:98)(cid:96)\n\n(19)\n\n(20)\n\nx , c0 + c1 (cid:126)K (1)\n\nx + ... + ck (cid:126)K (k)\nx )\n\n= Cov(c0 + c1 (cid:126)K (1)\n\nj=1 cicjCov( (cid:126)K (i)\n\nx + ... + ck (cid:126)K (k)\nx , (cid:126)K (j)\nx )\nx + ... + ck (cid:126)K (k)\nx , (cid:126)K (k+1)\nx , (cid:126)K (k+1)\n\n)\n\nx\n\nx\n\n)\n\ni=1\n\n=\n= Cov(c0 + c1 (cid:126)K (1)\ni=1 ciCov( (cid:126)K (i)\n=\n\nby using properties of the sample covariance (linearity, Cov(const, X) = 0).\n\nFor the k + 1 dimensional MPMC let us \ufb01rst determine the k + 1 coef\ufb01cients:\n\n(c0 + c1 (cid:126)K (1)\nc1 (cid:126)K (1)\n\nx + ... + ck (cid:126)K (k)\nck (cid:126)K (k)\n\nx ) + a(k+1)\nx + a(k+1)\n\nx\n\n(cid:126)K (k+1)\n(cid:126)K (k+1)\n\nx\n\n2\n\n2\n\na(k+1)\n2\n\n(k)\nx\n\n(cid:80)k\n\nNote that we can rewrite\n\n1\n\n(cid:126)K\n\n(k)\nx\n\n(k)\nx\n\n(k+1)\nx\n\na(k+1)\n1\n\n= [a(k+1)\n\n]2\u03c32(cid:98)(cid:96)\n\u03c32(cid:98)(cid:96)\n(cid:80)k\n\u03c3(cid:98)(cid:96)\n(cid:80)k\n(cid:98)(cid:96)(k+1) = a(k+1)\n\uf8f6\uf8f7\uf8f8T\uf8eb\uf8ec\uf8ec\uf8ed \u03c32\n\uf8eb\uf8ec\uf8ed a(k+1)\n(cid:112)\n\n(cid:126)K\n...\n\u03c3 (cid:126)K\n\u03c3 (cid:126)K\n\nck\n\n1\n\n1\n\n1\n\n= a(k+1)\n\nx + ... + a(k+1)\nThe term under the square root then looks like:\n... \u03c3 (cid:126)K\n...\n...\n... \u03c32\n(cid:126)K\n... \u03c3 (cid:126)K\n\n1\n...\na(k+1)\n1\na(k+1)\n2\n\n(k)\nx\n\n(1)\nx\n\n(1)\nx\n\nc1\n\n(cid:126)K\n\n(cid:126)K\n\n(k+1)\nx\n\n(1)\nx\n\n\u2212 b(k+1)\n+ a(k+1)\n\nc0 \u2212 b(k+1)\n\n\uf8f6\uf8f7\uf8f7\uf8f8\n\n1\n\n\uf8eb\uf8ec\uf8ed a(k+1)\n\n1\n...\na(k+1)\n1\na(k+1)\n2\n\nc1\n\nck\n\n\uf8f6\uf8f7\uf8f8 (21)\n\n(1)\nx\n\n(cid:126)K\n\n(k)\nx\n\n(k)\nx\n(k+1)\nx\n\n(cid:126)K\n\n(k)\nx\n\n\u03c3 (cid:126)K\n...\n\u03c3 (cid:126)K\n\u03c32\n(cid:126)K\n\n(1)\nx\n\n(k)\nx\n\n(cid:126)K\n\n(k+1)\nx\n\n(cid:126)K\n\n(k+1)\nx\n\n(k+1)\nx\n\nMultiplying out (21) and substituting according to the equations in (20) yields exactly expression\n(19) (which is the aT \u03a3xa term of the two dimensional MPM). Since this equivalence will hold\naT \u03a3ya term in m, we have shown that m (and therefore \u2126) is equal for the two\nlikewise for the\ndimensional and the k + 1 dimensional MPMC.\n\n\f", "award": [], "sourceid": 2470, "authors": [{"given_name": "Thomas", "family_name": "Strohmann", "institution": null}, {"given_name": "Andrei", "family_name": "Belitski", "institution": null}, {"given_name": "Gregory", "family_name": "Grudic", "institution": null}, {"given_name": "Dennis", "family_name": "DeCoste", "institution": null}]}