{"title": "Boosted Dyadic Kernel Discriminants", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 768, "abstract": null, "full_text": "Boosted Dyadic Kernel Discriminants\n\nBaback Moghaddam\n\nMitsubishi Electric Research Laboratory\n\n201 Broadway\n\nCambridge MA 02139 USA\n\nbaback@merl.com\n\nGregory Shakhnarovich\n\nMIT AI Laboratory\n\n200 Technology Square\n\nCambridge MA 02139 USA\n\ngregory@ai.mit.edu\n\nAbstract\n\nWe introduce a novel learning algorithm for binary classi(cid:12)cation\nwith hyperplane discriminants based on pairs of training points\nfrom opposite classes (dyadic hypercuts). This algorithm is further\nextended to nonlinear discriminants using kernel functions satisfy-\ning Mercer\u2019s conditions. An ensemble of simple dyadic hypercuts is\nlearned incrementally by means of a con(cid:12)dence-rated version of Ad-\naBoost, which provides a sound strategy for searching through the\n(cid:12)nite set of hypercut hypotheses. In experiments with real-world\ndatasets from the UCI repository, the generalization performance\nof the hypercut classi(cid:12)ers was found to be comparable to that of\nSVMs and k-NN classi(cid:12)ers. Furthermore, the computational cost\nof classi(cid:12)cation (at run time) was found to be similar to, or bet-\nter than, that of SVM. Similarly to SVMs, boosted dyadic kernel\ndiscriminants tend to maximize the margin (via AdaBoost).\nIn\ncontrast to SVMs, however, we o(cid:11)er an on-line and incremental\nlearning machine for building kernel discriminants whose complex-\nity (number of kernel evaluations) can be directly controlled (traded\no(cid:11) for accuracy).\n\n1\n\nIntroduction\n\nThis paper introduces a novel algorithm for learning complex binary classi(cid:12)ers by\nsuperposition of simpler hyperplane-type discriminants. In this algorithm, each of\nthe simple discriminants is based on the projection of a test point onto a vector\njoining a dyad, de(cid:12)ned as a pair of training data points with opposite labels. The\nlearning algorithm itself is based on a real-valued variant of AdaBoost [7], and the\nhyperplane classi(cid:12)ers use kernels of the type used, e.g., by support vector machines\n(SVMs) [9] for mapping linearly non-separable problems to high-dimensional feature\nspaces.\n\nWhen the concept class consists of linear discriminants (hyperplanes), this amounts\nto using a hyperplane orthogonal to the vector connecting the point in a dyad.\nWe shall refer to such a classi(cid:12)er as a hypercut. By applying the same notion of\nlinear hypercuts to a nonlinearly transformed feature space obtained by Mercer-\ntype kernels [3], we are able to implement nonlinear kernel discriminants similar in\nform to SVMs.\n\n\fIn each iteration of AdaBoost, the space of all dyadic hypercuts is searched. It can\nbe easily shown that this hypothesis space spans the subspace of the data and that it\nmust include the optimal hyperplane discriminant. This notion is readily extended\nto non-linear classi(cid:12)ers obtained by kernel transformations, by noting that in the\nfeature space, the optimal discriminant resides in the span of the transformed data.\nTherefore, for both linear and nonlinear classi(cid:12)cation, searching the space of dyadic\nhypercuts forms an e(cid:14)cient strategy for exploring the space of all hypotheses.\n\n1.1 Related work\n\nThe most general framework to consider is the theory of potential functions for\npattern classi(cid:12)cation [1] in which potential (cid:12)elds1 of the form\n\nH(x) =Xi\n\n(cid:11)iyiK(x; xi)\n\n(1)\n\nare thresholded to predict classi(cid:12)cation labels, ^y = sign(H(x)). In a probabilistic\nkernel regression framework recently proposed in [5], the coe(cid:14)cients (cid:11) that minimize\nthe classi(cid:12)cation error are obtained by maximizing\n\nJ((cid:11)) = (cid:0)\n\n1\n\n2Xi;j\n\n(cid:11)i(cid:11)jyiyjK(xi; xj) + Xi\n\nF ((cid:11)i);\n\n(2)\n\nwhere the potential function F is concave and continuous (corresponding to positive\nsemi-de(cid:12)nite kernels). This framework subsumes SVMs, which correspond to the\nsimplest case F ((cid:11)) = (cid:11). Generalized linear models [6] can also be shown to be\nmembers of this class by considering logistic regression where F ((cid:11)) becomes the\nbinary entropy function and K is related to the covariance function of a Gaussian\nprocess classi(cid:12)er for the GLM\u2019s intermediate variables.\n\nIn this paper we propose and design classi(cid:12)ers with dyadic discriminants, which\nhave potential functions of the form\n\nH(x) =Xt\n\n(cid:11)tK(x; xp\n\nt ) (cid:0) (cid:11)tK(x; xn\n\nt );\n\n(3)\n\nwhere xp and xn are positively and negatively labeled data, respectively. The coef-\n(cid:12)cients (cid:11)t are determined not by minimizing a convex quadratic function J((cid:11)) but\nrather by selecting an optimal classi(cid:12)er in the t-th iteration of AdaBoost. Thus the\npotential function is constrained to the form of a weighted sum of dyadic hypercuts,\nor di(cid:11)erences of kernel functions. Another way to view this is to think of a pair of\nopposite { polarity \\basis vectors\" sharing the same coe(cid:14)cient (cid:11)t.\n\nThe most closely related potential function technique to ours is that of SVMs [9],\nwhere the classi(cid:12)cation margin (and thus the bound on generalization) is maxi-\nmized by a simultaneous optimization with respect to all of the training points.\nHowever, there are important di(cid:11)erences between SVMs and our iterative hypercut\nalgorithm. In each step of the boosting process, we do not maximize the margin\nof the resulting strong classi(cid:12)er directly, which makes for a much simpler optimiza-\ntion task. Meanwhile, we are assured that with AdaBoost we tend to maximize\n(although in an asymptotic sense) the margin of the (cid:12)nal classi(cid:12)er [7].\n\nThe most important di(cid:11)erence that distinguishes our method from SVMs (and,\nby extension, from the general kernel discriminant family described above) is that\n\n1The physical analogy here is to the linear superposition of electrostatic charges of\n\nstrength (cid:11)i, polarity yi and location xi with distance de(cid:12)ned by the kernel K.\n\n\fthe points in our dyads are not typically located near the decision boundary, as\nis the case with support vectors. As a result, the (cid:12)nal set of \\basis vectors\" used\nby the boosted strong classi(cid:12)er can be viewed as a representative subset of the\ndata (i.e. those points needed for classi(cid:12)cation), whereas with SVMs the support\nvectors are simply the minimal number of training points needed to build (support)\nthe decision boundary and are almost certainly not \\typical\" or high-likelihood\nmembers of either class.2\n\nThe classi(cid:12)cation complexity of a kernel-based classi(cid:12)er | the cost of classifying\na test point | depends on the number of kernel function evaluations on which\nthe classi(cid:12)er is based.\nIn the case of SVMs, there is (usually) no direct way of\ncontrolling this number (the quadratic programming solution will automatically\ndetermine all positive Lagrange multipliers). In our boosted hypercut algorithm,\nhowever, the number of dyadic \\basis vectors\", and therefore of the required kernel\nevaluations, is determined by the number of iterations of the boosting algorithm and\ncan therefore be controlled. Note that we are not referring here to the complexity\nof training classi(cid:12)ers here, only to their run-time computational cost.\n\n2 Methodology\n\nConsider a binary classi(cid:12)cation task where we are given a training set of vectors\nT = fx1; : : : ; xM g where x 2 RN , with corresponding labels fy1; : : : ; yM g where\ny 2 f(cid:0)1; +1g. Let there be Mp samples with label +1 and Mn samples with label\n(cid:0)1 so that M = Mp + Mn. Consider a simple linear hyperplane classi(cid:12)er de(cid:12)ned\nby a discriminant function of the form\n\nf (x) = hw (cid:1) xi + b\n\n(4)\n\nwhere sign(f (x)) 2 f+1; (cid:0)1g gives the binary classi(cid:12)cation.\n\nUnder certain assumptions, Gaussianity in particular, the optimal hyperplane, spec-\ni(cid:12)ed by the projection w(cid:3) and bias b(cid:3), is easily computed using standard statisti-\ncal techniques based on class means and sample covariances for linear classi(cid:12)ers.\nHowever, in the absence of such assumptions, one must resort to searching for the\noptimal hyperplane. When searching for w(cid:3), an e(cid:14)cient strategy is to consider only\nhyperplanes whose surface normal is parallel to the line joining a dyad (xi; xj):\n\nwij =\n\nxi (cid:0) xj\n\nc\n\n;\n\nyi 6= yj;\n\ni < j\n\n(5)\n\nwhere yi 6= yj by de(cid:12)nition, i < j for uniqueness, and c is a scale factor. The\nvector wij is parallel to the line segment connecting the points in a dyad. Setting\nc = kxi (cid:0) xjk makes wij a unit-norm direction vector.\n\nThe hypothesis space to be searched consists of j fwijg j= MpMn hypercuts, each\nhaving a free bias parameter bij which is typically determined by minimizing the\nweighted classi(cid:12)cation error (as we shall see in the next section). Each hypothesis\nis then given by the sign of the discriminant as in (4):\n\nhij (x) = sign(hwij (cid:1) xi + bij)\n\n(6)\n\nLet fhijg = fwij; bijg denote the complete set of hypercuts for a given training\nset. Strictly speaking, this set is uncountable since bij is continuous and arbitrary.\nHowever, since we always select one bias parameter for each hypercut wij, we do\nin fact end up with only MpMn classi(cid:12)ers.\n\n2Although unrelated to our technique, the Relevance Vector machine [8] is another\nkernel learning algorithm that tends to produce \\prototypical\" basis vectors in the interior\nas opposed to the boundary of the distributions.\n\n\f2.1 AdaBoost\n\nThe AdaBoost algorithm [4] provides a practical framework for combining a number\nof weak classi(cid:12)ers into a strong (cid:12)nal classi(cid:12)er by means of linear combination and\nthresholding. AdaBoost works by maintaining over the training set an iteratively\nevolving distribution (weights) Dt(i) based on the di(cid:14)culty of classi(cid:12)cation (i.e.\npoints which are harder to classify have greater weight). Consequently, a \\weak\"\nhypothesis h(x) : x ! f+1; (cid:0)1g will have classi(cid:12)cation error (cid:15)t weighted by\nDt.\nIn our case, in each iteration t, we select from the complete set of MpMn\nhypercuts fhijg one which minimizes (cid:15)t. The data are then re-weighted based on\ntheir (mis)classi(cid:12)cation to obtain an updated distribution Dt+1.\n\nThe (cid:12)nal classi(cid:12)er is a linear combination of the selected weak classi(cid:12)ers ht and has\nthe form of a weighted \\voting\" scheme\n\nH(x) = sign T\nXi=1\n\n(cid:11)tht(x)!\n\n(7)\n\n).\n\n(cid:15)t\n\n2 ln( 1(cid:0)(cid:15)t\n\nwhere (cid:11)t = 1\nIn [7] a framework was developed where ht(x) can be\nreal-valued (as opposed to binary) and is interpreted as a \\con(cid:12)dence-rated predic-\ntion.\" The sign of ht(x) is the predicted label while the magnitude j ht(x) j is the\ncon(cid:12)dence. For such real-valued classi(cid:12)ers we have\n\n(cid:11)t =\n\n1\n2\n\nln(cid:18) 1 + rt\n1 (cid:0) rt(cid:19)\n\n(8)\n\nwhere the \\correlation\" rt =Pi Dt(i) yi ht(xi) is inversely related to the error by\n\n(cid:15)t = (1 (cid:0) rt)=2.\n\n2.2 Nonlinear Hypercuts\n\nThe logical extension beyond the boosted linear dyadic discriminants described in\nthe previous section is that of nonlinear discriminants using positive de(cid:12)nite kernels\nas suggested in [3] for use with SVMs. In the resulting \\reproducing kernel Hilbert\nspaces\", dot products between high-dimensional mappings (cid:8)(x) : X ! F are easily\nevaluated using Mercer kernels\n\nk(x; x0) = h(cid:8)(x) (cid:1) (cid:8)(x0)i:\n\n(9)\n\nThis has the desirable property that any algorithm based on dot products, e.g.\nour linear hypercut classi(cid:12)er (6), can (cid:12)rst nonlinearly transform its inputs (using\nkernels) and implicitly perform dot-products in the transformed space. The pre-\nimage of the linear hyperplane solution back in the input space is thus a nonlinear\nhypersurface.\n\nApplying the above kernel property to the hypercut concept (5) we can rewrite it\nin nonlinear form by considering the linear hypercut in the transformed space F\nwhere the projection operator is\n\nwij = (cid:8)(xi) (cid:0) (cid:8)(xj );\n\nyi 6= yj;\n\ni < j\n\n(10)\n\n(we have absorbed the scale constant c in (5) into wij for simplicity in this case).3\nDue to the implicit nature of the nonlinear mapping, we can not directly evaluate\nwij. However, we only need its dot product with the transformed input vectors\n\n3Since the optimal projection w(cid:3)\n\nij must lie in the span of f(cid:8)(xi)g, we should restrict\nthe search for an optimal hyperplane accordingly, e.g. by considering pair-wise hypercuts.\n\n\f(cid:8)(x). Considering the linear discriminant (4) and substituting the above we obtain\n\nfij(x) = h((cid:8)(xi) (cid:0) (cid:8)(xj)) (cid:1) (cid:8)(x)i + bij;\n\nwhich by applying the kernel property (9) is equivalent to\n\nfij(x) = k(x; xi) (cid:0) k(x; xj) + bij\n\n(11)\n\n(12)\n\nNote that fij now represents a single dyadic term in the potential function intro-\nduced in (3). The binary-valued hypercut classi(cid:12)er is given by a simple thresholding\n\nhij(x) = sign(fij(x)):\n\n(13)\n\nA \\con(cid:12)dence-rated\" classi(cid:12)er with output in the range [(cid:0)1; +1] can be obtained by\npassing fij through a bipolar sigmoidal nonlinearity such as a hyperbolic tangent\n\nhij (x) = tanh ((cid:12)fij(x))\n\n(14)\n\nwhere (cid:12) determines the \\slope\" of the sigmoid. We note that in order to obtain\na continuous-valued hypercut classi(cid:12)er that suitably occupies the range [(cid:0)1; +1] it\nmay be necessary to experiment and adjust both constants c and (cid:12).\n\nThe (cid:12)nal classi(cid:12)er constructed by AdaBoost, following (7), is given by\n\nH(x) = sign T\nXt=1\n\n(cid:11)t tanh(cid:0)(cid:12)(cid:2)k(x; xt\n\nij(cid:3)(cid:1)! ;\n\ni) (cid:0) k(x; xt\n\nj ) + bt\n\n(15)\n\nwhere we have superscripted the elements of fij selected in iteration t of boosting.\nNote that besides the monotonic sigmoid and o(cid:11)set transformation, this form is\nessentially a (nonlinear) equivalent of the dyadic potential function of (3).\n\nIf we assume, without loss of generality, that an equal number N=2 of d-dimensional\ntraining points is available from each class, de(cid:12)ning O(N 2) hypercuts. The values\nof fij(x) for each hypercut and each training point (12) can be computed only once,\ntypically in O(d), and used in every iteration of the algorithm, making the setup\ncost for the algorithm O(dN 3). Each iteration requires examination of all fij(xk)\nand takes O(N 3). To summarize, the cost of learning a classi(cid:12)er with K dyads is\n\nO(cid:0)(d + K)N 3(cid:1). It is important to note that both the setup step and the search\n\nfor an optimal hypercut in each iteration are naturally parallelizable, leading to a\nreduction in time linear in the number of processors.\n\n3 Experiments\n\nBefore applying our algorithm to standard benchmarks, we illustrate a simple 2D\nexample of nonlinear boosted dyadic hypercuts on a \\toy\" problem. Consider a\nclassi(cid:12)cation task on the dataset of 20 points (10 for each class) shown in Figure 1.\nThe hypercuts algorithm (using Gaussian kernels) was able to separate the classes\nusing two iterations (two cuts) as shown in Figure 1(a). Note how the dyads of\ntraining points (connected by dashed lines) de(cid:12)ne the discriminant boundary. For\ncomparison, we used an SVM with Gaussian kernels on the same dataset, as shown\nin Figure 1(b). Although the SVM has a wider margin, the same would be expected\nfrom our algorithm with additional rounds of boosting.\n\nThe computational cost of classifying a point can be directly compared in terms of\nthe number of required kernel evaluations in (2), which dominate the computation\nfor high-dimensional data and kernels like Gaussians. For SVM, this is the number\nof support vectors. For hypercuts, this is the number of distinct training points\n\n\f(a)\n\n(b)\n\nFigure 1: A toy problem: classi(cid:12)cation based on (a) hypercuts (2 dyads) (b) SVM (4\nsupport vectors).\n\nin the selected dyads. After n rounds of boosting this number is bounded by 2n,\nsince a point can participate in multiple dyads. For instance, the SVM in Figure 1\nrequires 4 kernel evaluations, compared to 3 for the boosted hypercuts.\n\n3.1 Experiments with real data sets\n\nWe evaluated the performance of the dyadic hypercuts algorithm on a number of\nreal-world data sets from the UCI repository [2], and compared the performance\nto that of two established classi(cid:12)cation methods: SVM with Gaussian RBF kernel\nand k-Nearest Neighbor (k-NN). We chose sets large enough for reasonable train-\ning/validation/test partitioning, and that represent binary (or easily converted to\nbinary) classi(cid:12)cation problems.\n\nDataset\nHeart\nIonosphere\nWBC\nWPBC\nWDBC\nWine\nSpam\nSonar\nPima\n\nN\n90\n120\n200\n65\n190\n60\n150\n70\n200\n\nd\n13\n34\n9\n32\n30\n13\n57\n60\n8\n\nk-NN\n\n.196 (cid:6).042\n.168 (cid:6).024\n.034 (cid:6).011\n.250 (cid:6).024\n.044 (cid:6).015\n.053 (cid:6).030\n.159 (cid:6).025\n.227 (cid:6).041\n.267 (cid:6).024\n\nSVM\n\n.202 (cid:6).038\n.064 (cid:6).018\n.032 (cid:6).008\n.243 (cid:6).006\n.035 (cid:6).013\n.032 (cid:6).022\n.123 (cid:6).016\n.226 (cid:6).037\n.244 (cid:6).014\n\n#SV\n62 (cid:6)10\n73 (cid:6)7\n50 (cid:6)26\n63 (cid:6)3\n67 (cid:6)15\n40 (cid:6)9\n\n101 (cid:6)8\n\n66 (cid:6)3\n129 (cid:6)7\n\nHypercuts #k.ev.\n50 (cid:6)12\n.202 (cid:6).030\n63 (cid:6)7\n.083 (cid:6).022\n30 (cid:6)12\n.028 (cid:6).007\n.253 (cid:6).025\n41 (cid:6)5\n47 (cid:6)12\n.038 (cid:6).014\n.040 (cid:6).026\n23 (cid:6)4\n.116 (cid:6).019\n.202 (cid:6).045\n.260 (cid:6).017\n\n52 (cid:6)5\n110 (cid:6)16\n\n73 (cid:6)15\n\nTable 1: The results of the experiments described in Section 3.1. N is the size of the\ntraining set, d the dimension, #SV the number of support vectors for the SVM, and #k.ev.\nthe number of kernel evaluations required by a boosted hypercuts classi(cid:12)er. Means and\nstandard deviations in 30 trials are reported for each data set. WBC,WPBC,WDBC are\nWisconsin Breast Cancer, Prognosis and Diagnosis data sets, respectively.\n\nIn each experiment, the data set was randomly partitioned into training, validation\nand test sets of similar sizes. The validation set was used to \\tune\" the parameters\nof each of the classi(cid:12)ers (k for k-NN, (cid:27) for RBF kernels of SVMs and hypercuts),\nby choosing from a suitable range the parameter value with lowest error on the vali-\ndation set. Each of the three classi(cid:12)ers was then trained with the chosen parameter\non the training set, and tested on the test set.\n\nFor each data data set the above experiment was repeated 30 times. The columns\nof Table 1, left to right, show the following, with means and standard deviations\nover the 30 trials for each dataset: size of the training set, dimension, the test error\n\n\fr\no\nr\nr\ne\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nSVM, 96 Support Vectors \n\nHypercuts, test error \n\n58 k.ev. \n\n72 k.ev. \n\n78 k.ev. \n\n27 k.ev. \n\nHypercuts, training error \n\n10% 25% 50%\nDataset\n.197\n.202\nHeart\n.094\n.178\nIon.\n. 028\n.028\nWBC\nWPBC .302\n.266\n.383\nWDBC .365\n.043\n.064\nWine\n.117\n.142\nSpam\n.214\n.248\nSonar\n.269\n.263\nPima\n\n.200\n.113\n.028\n.269\n.384\n.051\n.124\n.233\n.268\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\nIteartions of AdaBoost\n\nFigure 2: An example of the progress of training\n(dotted line) and test (solid line) error in a run of\nhypercuts algorithm with RBF kernel on Spam data.\nThe number of kernel evaluations in the combined\nclassi(cid:12)er is shown for indicated points in the run.\nThe dashed line shows the test error of the SVM\nwith RBF kernel.\n\nTable 2: Test error as a function\nof number of kernel evaluations al-\nlowed by the user; the percentage\nvalues are relative to the number\nof SVs in each experiment. Aver-\naged over 30 trials for each data\nset.\n\nof k-NN, the test error of SVM, the number of support vectors, the test error of\nhypercuts, and the number of kernel evaluations in the (cid:12)nal hypercuts classi(cid:12)er.\n\nThe size of the hypercuts classi(cid:12)er can be controlled via the number of AdaBoost\niterations, thus a(cid:11)ecting the accuracy of the classi(cid:12)er. In our experiments boosting\nwas stopped after a prolonged plateau in the training error was observed; in some\ncases, further continuation of boosting could lead to better results.\n\n3.2 Discussion\n\nThe most important conclusion from these empirical results is that for all data sets,\nthe RBF boosted dyadic hypercuts achieve test performance statistically equivalent\nto that of SVMs4, and usually better than that of k-NN classi(cid:12)ers, while the com-\nplexity of the trained classi(cid:12)er is typically lower (in some cases, which appear in\nbold in Table 1, the di(cid:11)erence in complexity is signi(cid:12)cant).\n\nIn addition, our experiments demonstrate the trade-o(cid:11) between the complexity and\naccuracy of the hypercuts. Figure 2 shows an example run of hypercuts algorithm\non Spam data set, with 150 training points. After 24 iterations, the test error of\nthe (cid:12)nal classi(cid:12)er becomes consistently lower than that of SVM trained on the same\ntraining set, which found 96 support vectors. At that point the classi(cid:12)er requires\n27 kernel evaluations (about 28% of the number of SVs). The following 115 itera-\ntions achieve further improvement of only 1.8% in test error, while increasing the\nrequired number of kernel evaluations to 78. Here the automatic criterion stopped\nthe AdaBoost after no signi(cid:12)cant improvement in training error was observed for 25\niterations. But the user can instead specify the desired bound on the complexity of\nthe classi(cid:12)er. Table 2 shows the behavior of test error as a function of the number\nof kernel evaluations by the classi(cid:12)er, averaged over all 30 trials. For some data\nsets, e.g. Heart and WBC, the hypercuts classi(cid:12)er with only 10% of the number\nof kernel evaluations in an SVM already achieves comparable test error.\n\n4i.e. the di(cid:11)erence of the means is within one standard deviation from both sides\n\n\f4 Conclusions\n\nThe contribution of this paper is two-fold. First, we proposed a family of simple\ndiscriminants (hypercuts), based on pairs of training points from opposite classes\n(dyads), and extended this family using a nonlinear mapping with Mercer-type\nkernels. Second, we have designed a greedy selection algorithm based on boosting\nwith con(cid:12)dence-rated (real-valued) hypercut classi(cid:12)ers with continuous output in\nthe interval [-1,1].\n\nThis is a new kernel based approach to classi(cid:12)cation. We have shown that this\napproach performs on par with SVMs, without having to solve large QP problems.\nIn contrast, our algorithm allows the user to trade o(cid:11) the classi(cid:12)er\u2019s computational\ncomplexity for its accuracy, and bene(cid:12)ts from AdaBoost\u2019s exponential error conver-\ngence and the assurance of asymptotic margin maximization.\n\nThe generalization performance of our algorithm was evaluated on a number of data\nsets from the UCI repository, and demonstrated to be comparable to that of estab-\nlished state-of-the-art algorithms (SVMs, k-NN), often with reduced classi(cid:12)cation\ntime and reduced classi(cid:12)er size. We emphasize this performance advantage, since in\npractical applications it is often desirable to minimize complexity even at the cost\nof increased training time.\n\nWe are currently looking into optimal strategies for sampling the hypothesis space\n(MpMn possible hypercuts) based on the distribution Dt(i) and forming hypercuts\nthat are not necessarily based on training samples but rather, for example, on cluster\ncentroids or other points derived from the input distribution. This has the potential\nto dramatically reduce the computational cost of learning in the boosted hypercuts\nalgorithm, thus making it even more attractive for a practitioner.\n\nReferences\n\n[1] M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer. Theoretical foundations of the\npotential function method in pattern recognition learning. Automation and Remote\nControl, 25:821{837, 1964.\n\n[2] C. L. Blake and C. J. Merz. UCI repository of machine learning databases.\n\n[http://www.ics.uci.edu/(cid:24)mlearn/MLRepository.html], 1998.\n\n[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin\nclassi(cid:12)ers. In D. Haussler, editor, Proc. 5th Annual ACM Workshop on Computational\nLearning Theory, pages 144{152. ACM Press, 1992.\n\n[4] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of Computer and System Sciences, 55(1):119{\n139, 1995.\n\n[5] T. Jaakkola and D. Haussler. Probabilistic kernel regression models. In D. Heckerman\nand J. Whittaker, editors, Proc. of 7th International Workshop on AI and Statistics.\nMorgan Kaufman, 1999.\n\n[6] P. McCallugh and J. Nelder. Generalized Linear Models. Chapman and Hall, London,\n\n1983.\n\n[7] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using con(cid:12)dence-\nrated predictions. In Proc. of 11th Annual Conf. on Computational Learning Theory,\npages 80{91, 1998.\n\n[8] M. E. Tipping. The Relevance Vector Machine. In Advances in Neural Information\n\nProcessing Systems 12, pages 652{658. MIT Press, 2000.\n\n[9] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.\n\n\f", "award": [], "sourceid": 2221, "authors": [{"given_name": "Baback", "family_name": "Moghaddam", "institution": null}, {"given_name": "Gregory", "family_name": "Shakhnarovich", "institution": null}]}