{"title": "Large-Margin Convex Polytope Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 3248, "page_last": 3256, "abstract": "We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-of-the-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting.", "full_text": "Large-Margin Convex Polytope Machine\n\nAlex Kantchelian Michael Carl Tschantz Ling Huang\u2020\n\nPeter L. Bartlett Anthony D. Joseph\n\nJ. D. Tygar\n\nUC Berkeley \u2013 {akant|mct|bartlett|adj|tygar}@cs.berkeley.edu\n\n\u2020Datavisor \u2013 ling.huang@datavisor.com\n\nAbstract\n\nWe present the Convex Polytope Machine (CPM), a novel non-linear learning al-\ngorithm for large-scale binary classi\ufb01cation tasks. The CPM \ufb01nds a large margin\nconvex polytope separator which encloses one class. We develop a stochastic gra-\ndient descent based algorithm that is amenable to massive data sets, and augment\nit with a heuristic procedure to avoid sub-optimal local minima. Our experimental\nevaluations of the CPM on large-scale data sets from distinct domains (MNIST\nhandwritten digit recognition, text topic, and web security) demonstrate that the\nCPM trains models faster, sometimes by several orders of magnitude, than state-\nof-the-art similar approaches and kernel-SVM methods while achieving compara-\nble or better classi\ufb01cation performance. Our empirical results suggest that, unlike\nprior similar approaches, we do not need to control the number of sub-classi\ufb01ers\n(sides of the polytope) to avoid over\ufb01tting.\n\n1\n\nIntroduction\n\nMany application domains of machine learning use massive data sets in dense medium-dimensional\nor sparse high-dimensional spaces. Some domains also require near real-time responses in both\nthe prediction and the model training phases. These applications often deal with inherent non-\nstationarity; thus, the models need to be constantly updated in order to catch up with drift. Today,\nthe de facto algorithm for binary classi\ufb01cation tasks at these scales is linear SVM. Indeed, since\nShalev-Shwartz et al. demonstrated both theoretically and experimentally that large margin linear\nclassi\ufb01ers can be ef\ufb01ciently trained at scale using stochastic gradient descent (SGD), the Pegasos [1]\nalgorithm has become a standard building tool for the machine learning practitioner.\nWe propose a novel algorithm for Convex Polytope Machine (CPM) separation exhibiting superior\nempirical performance to existing algorithms, with running times on a large data set that are up to\n\ufb01ve orders of magnitude faster. We conjecture that worst case bounds are independent of the number\nK of faces of the convex polytope and state a theorem of loose upper bounds in terms of\nIn theory, as the VC dimension of d-dimensional linear separators is d + 1, a linear classi\ufb01er in\nvery high dimension d is expected to have a considerable expressiveness power. This argument is\noften understood as \u201ceverything is separable in high dimensional spaces; hence, linear separation is\ngood enough\u201d. However, in practice, deployed systems rarely use a single naked linear separator.\nOne explanation for this gap between theory and practice is that while the probability of a single\nhyperplane perfectly separating both classes in very high dimensions is high, the resulting classi\ufb01er\nmargin might be very small. Since the classi\ufb01er margin also accounts for the generalization power,\nwe might experience poor future classi\ufb01cation performance in this scenario.\nFigure 1a provides a two-dimensional example of a data set that has a small margin when using a\nsingle separator (solid line) despite being linearly separable and intuitively easily classi\ufb01ed. The\nintuition that the data is easily classi\ufb01ed comes from the data naturally separating into three clusters\n\n\u221a\n\nK.\n\n1\n\n\fwith two of them in the positive class. Such clusters can form due to the positive instances being\ngenerated by a collection of different processes.\n\n(a) Instances are perfectly linearly separable (dashed\nline), although with small margin due to positive in-\nstances (A & B) having con\ufb02icting patterns. We can\nobtain higher margin by separately training two linear\nsub-classi\ufb01ers (solid lines) on left and right clusters\nof positive instances, each against all the negative in-\nstances, yielding a prediction value of the maximum\nof the sub-classi\ufb01ers.\nFigure 1: Positive (\u2022) and negative (\u25e6) instances in continuous two dimensional feature space.\n\n(b) The worst-case margin is insensitive to wiggling\nof sub-classi\ufb01ers having non-minimal margin. Sub-\nclassi\ufb01er 2 has the smallest margin, and sub-classi\ufb01er\n1 is allowed to freely move without affecting \u03b4WC.\nFor comparison,\nis\nshown (dashed lines).\n\nthe largest-margin solution 1(cid:48)\n\nAs Figure 1a shows, a way of increasing the margins is to introduce two linear separators (dashed\nlines), one for each positive cluster. We take advantage of this intuition to design a novel machine\nlearning algorithm that will provide larger margins than a single linear classi\ufb01er while still enjoying\nmuch of the computational effectiveness of a simple linear separator. Our algorithm learns a bounded\nnumber of linear classi\ufb01ers simultaneously. The global classi\ufb01er will aggregate all the sub-classi\ufb01ers\ndecisions by taking the maximum sub-classi\ufb01er score. The maximum aggregation has the effect of\nassigning a positive point to a unique sub-classi\ufb01er. The model class we have intuitively described\nabove corresponds to convex polytope separators.\nIn Section 2, we present related work in convex polytope classi\ufb01ers and in Section 3, we de\ufb01ne the\nCPM optimization problem and derive loose upper bounds. In Section 4, we discuss a Stochastic\nGradient Descent-based algorithm for the CPM and perform a comparative evaluation in Section 5.\n\n2 Related Work\n\nFischer focuses on \ufb01nding the polygon with the fewest misclassi\ufb01ed points drawn independently\nfrom an unknown distribution using an algorithm with a running time of more than O(n12) where\nn is the number of sample points [2]. We instead focus on \ufb01nding good, not optimal, polygons that\ngeneralize well in practice despite having fast running times. Our focus on generalization leads us\nto maximize the margin, unlike this work, which actually minimizes it to make their proofs easier.\nTakacs proposes algorithms for training convex polytope classi\ufb01ers based on the smooth approxi-\nmation of the maximum function [3]. While his algorithms use smooth approximation during train-\ning, it uses the original formula during prediction, which introduces a gap that could deteriorate\nthe accuracy. The proposed algorithms achieve similar classi\ufb01cation accuracy to several nonlinear\nclassi\ufb01ers, including KNN, decision tree and kernel SVM. However, the training time of the al-\ngorithms is often much longer than those nonlinear classi\ufb01ers (e.g., an order of magnitude longer\nthan ID3 algorithm and eight times longer than kernel SVM on CHESS DATASET), diminishing\nthe motivation to use the proposed algorithms in realistic setting. Zhang et al. propose an Adaptive\nMulti-hyperplane Machine (AMM) algorithm that is fast during both training and prediction, and\ncapable of handling nonlinear classi\ufb01cation problems [4]. They develop an iterative algorithm based\non the SGD method to search for the number of hyperplanes and train the model. Their experiments\non several large data sets show that AMM is nearly as fast as the state-of-the-art linear SVM solver\nand achieves classi\ufb01cation accuracy between linear and kernel SVMs.\nManwani and Sastry propose two methods for learning polytope classi\ufb01ers, one based on the logistic\nfunction [5], and another based on the perceptron method [6], and propose alternating optimization\n\n2\n\n+-AB+-+-211\u2019\falgorithms to train the classi\ufb01ers. However, they only evaluate the proposed methods on a few small\ndata sets (with no more than 1000 samples in each), and do not compare them to other widely used\n(nonlinear) classi\ufb01ers (e.g., KNN, decision tree, SVM). It is unclear how applicable these algorithms\nare to large-scale data. Our work makes three signi\ufb01cant contributions over their work: (1) deriving\nthe formulation from a large-margin argument and obtaining a regularization term that is missing\nin [6]; (2) safely restricting the choice of assignments to only positive instances, leading to a training\ntime optimization heuristic; and (3) demonstrating higher performance on non-synthetic, large scale\ndata sets when using two CPMs together.\nThe CPM is a special case of the more general Latent SVM [7] formulation. In particular, the latent\nvariable represents the sub-classi\ufb01er, or face of the polytope, used for classifying the given instance.\n\n3 Large-Margin Convex Polytopes\n\nIn this section, we derive and discuss several alternative optimization problems for \ufb01nding a large-\nmargin convex polytope which separates binary labeled points of Rd.\n\n3.1 Problem Setup and Model Space\nLet D = {(xi, yi)}1\u2264i\u2264n be a binary labeled data set of n instances, where x \u2208 Rd and y \u2208 {\u22121, 1}.\nFor the sake of notational brevity, we assume that the xi include a constant unitary component\ncorresponding to a bias term. Our prediction problem is to \ufb01nd a classi\ufb01er c : Rd \u2192 {\u22121, 1}\nsuch that c(xi) is a good estimator of yi. To do so, we consider classi\ufb01ers constructed from convex\nK-faced polytope separators for a \ufb01xed positive integer K. Let PK be the model space of convex\nK-faced polytope separators:\n\n(cid:27)\n\n(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12) f (x) = max\n\n1\u2264k\u2264K\n\nPK =\n\nf : Rd \u2192 R\n\n(Wx)k, W \u2208 RK\u00d7d\n\nFor each such function f in PK, we can get a classi\ufb01er cf such that cf (x) is 1 if f (x) > 0 and\n\u22121 otherwise. This model space corresponds to a shallow single hidden layer neural network with a\nmax aggregator. Note that when K = 1, P1 is simply the space of all linear classi\ufb01ers. Importantly,\nwhen K \u2265 2, elements of PK are not guaranteed to have additive inverses in PK. Thus, the labels\ny = \u22121 and y = +1 are not interchangeable. Geometrically, the negative class remains enclosed\nwithin the convex polytope while the positive class lives outside of it, leading to the label asymmetry.\nTo construct a classi\ufb01er without label asymmetry, we can use two polytopes, one with the negative\ninstances on the inside the polytope to get a classi\ufb01cation function f\u2212 and one with the positive\ninstances on the inside to get f+. From these two polytopes, we construct the classi\ufb01er cf\u2212,f+\nwhere cf\u2212,f+(x) is 1 if f\u2212(x) \u2212 f+(x) > 0 and \u22121 otherwise.\nTo better understand the nature of the faces of a single polytope, for a given polytope W and a data\npoint x, we denote by zW(x) the index of the maximum sub-classi\ufb01er for x:\n\nzW (x) = argmax\n1\u2264k\u2264K\n\n(Wx)k\n\nWe call zW(x) the assigned sub-classi\ufb01er for instance x. When clear from context, we drop W\nfrom zW. We use the notation Wk to designate the kth row of W, which corresponds to the kth\nface of the polytope, or the kth sub-classi\ufb01er. Hence, Wz(x) identi\ufb01es the separator assigned to x.\nWe now pursue a geometric large-margin-based approach for formulating the concrete optimization\nproblem. To simplify notation and without loss of generality, we suppose that W is row-normalized\nsuch that ||Wk|| = 1 for all k. We also initially suppose our data set is perfectly separable by a\nK-faced convex polytope.\n\n3.2 Margins for Convex Polytopes\n\nWhen K = 1, the problem reduces to \ufb01nding a good linear classi\ufb01er and only a single natural\nmargin \u03b4 of the separator exists [8]:\n\n\u03b4W = min\n1\u2264i\u2264n\n\nyiW1xi\n\n3\n\n\fMaximizing \u03b4W yields the well known (linear) Support Vector Machine. However, multiple notions\nof margin exist for a K-faced convex polytope with K \u2265 2. We consider two.\nLet the worst case margin \u03b4WC\nsub-classi\ufb01ers, we \ufb01nd the one with the minimal margin to the closest point assigned to it:\n\nW be the smallest margin of any point to the polytope. Over all the K\n\n\u03b4WC\nW = min\n1\u2264i\u2264n\n\nyiWz(xi)xi = min\n1\u2264k\u2264K\n\nmin\n\ni:z(xi)=k\n\nyiWkxi\n\nThe worst case margin is very similar to the linear classi\ufb01er margin but suffers from an important\ndrawback. Maximizing \u03b4WC leaves K \u2212 1 sub-classi\ufb01ers wiggling while over-focusing on the sub-\nclassi\ufb01er with the smallest margin. See Figure 1b for a geometrical intuition.\nThus, we instead focus on the total margin, which measures each sub-classi\ufb01er\u2019s margin with respect\nto just its assigned points. The total margin \u03b4T\n\nW is the sum of the K sub-classi\ufb01ers margins:\n\nK(cid:88)\n\nk=1\n\n\u03b4T\nW =\n\nmin\n\ni:z(xi)=k\n\nyiWkxi\n\nThe total margin gives the same importance to each of the K sub-classi\ufb01er margins.\n\n3.3 Maximizing the Margin\n\nWe now turn to the question of maximizing the margin. Here, we provide an overview of a smoothed\nbut non-convex optimization problem for maximizing the total margin.\nWe would like to optimize the margin by solving the optimization problem\nW subject to (cid:107)W1(cid:107) = \u00b7\u00b7\u00b7 = (cid:107)WK(cid:107) = 1\n\u03b4T\nK(cid:88)\n\nIntroducing one additional variable \u03b6k per classi\ufb01er, problem (1) is equivalent to\n\nmax\nW\n\n(1)\n\nsubject to \u2200i, \u03b6z(xi) \u2264 yiWz(xi)xi\n\n(2)\n\n\u03b6k\n\nmax\nW,\u03b6\n\nk=1\n\n\u03b61 > 0, . . . , \u03b6K > 0\n(cid:107)W1(cid:107) = \u00b7\u00b7\u00b7 = (cid:107)WK(cid:107) = 1\n\nConsidering the unnormalized rows Wk/\u03b6k, we obtain the following equivalent formulation:\n\nmax\nW\n\n1\n\n(cid:107)Wk(cid:107)\n\nsubject to \u2200i, 1 \u2264 yiWz(xi)xi\n\n(3)\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n\nWhen y = \u22121 and z(xi) satisfy the margin constraint in (3), we have that the constraint holds for\nevery sub-classi\ufb01er k since yiWkxi is minimal at k = z(xi). Thus, when y = \u22121, we can enforce\nthe constraint for all k. We can also smooth the objective into a convex, de\ufb01ned-everywhere one by\nminimizing the sum of the inverse squares of the terms instead of maximizing the sum of the terms.\nWe obtain the following smoothed problem:\n\nk=1\n\nmin\nW\n\n(cid:107)Wk(cid:107)2\n\nsubject to \u2200i : yi = \u22121, \u2200k \u2208 {1, . . . , K}, 1 + Wkxi \u2264 0\n\n\u2200i : yi = +1, 1 \u2212 Wz(xi)xi \u2264 0\n\n(4)\n(5)\nThe objective of the above program is now the familiar L2 regularization term (cid:107)W(cid:107)2. The con-\nstraints (4), on the negative instances, are convex (linear functions), but the positive terms (5) result\nin non-convex constraints because of the instance-dependent assignment z. As for the Support Vec-\ntor Machine, we can introduce n slack variables \u03bei and a regularization factor C > 0 for the common\ncase of noisy, non-separable data. Hence, the practical problem becomes\n\n(cid:107)W(cid:107)2 + C\n\nmin\nW,\u03be\n\n\u03bei\n\nsubject to \u2200i : yi = \u22121, \u2200k \u2208 {1, . . . , K}, 1 + Wkxi \u2264 \u03bei \u2265 0 (6)\n\n\u2200i : yi = +1, 1 \u2212 Wz(xi)xi \u2264 \u03bei \u2265 0\n\nn(cid:88)\n\ni=1\n\nFollowing the same steps, we obtain the following problem for maximizing the worst-case margin.\nThe only difference is the regularization term in the objective function which becomes maxk (cid:107)Wk(cid:107)2\ninstead of (cid:107)W(cid:107)2.\n\n4\n\n\f(cid:26)\n\n1\u2264k\u2264K\n\n(cid:12)(cid:12)(cid:12)(cid:12) f (x) = max\nn(cid:88)\n\n(cid:27)\n\nDiscussion. The goal of our relaxation is to demonstrate that our solution involves two intuitive\nsteps: (i) assigning positive instances to sub-classi\ufb01ers, and (ii) solving a collection of SVM-like\nsub-problems. While our solution taken as a whole remains non-convex, this decomposition isolates\nthe non-convexity to a single intuitive assignment problem that is similar to clustering. This isola-\ntion enables us to use intuitive heuristics or clustering-like algorithms to handle the non-convexity.\nIndeed, in our \ufb01nal form (6), if the optimal assignment function z(xi) of positive instances to sub-\nclassi\ufb01ers were known and \ufb01xed, the problem would be reduced to a collection of perfectly inde-\npendent convex minimization problems. Each such sub-problem corresponds to a classical SVM\nde\ufb01ned on all negative instances and the subset of positive instances assigned by z(xi).\n\n3.4 Choice of K, Generalization Bound for CPM\n\nAssuming we can ef\ufb01ciently solve this optimization problem, we would need to adjust the number\nK of faces and the degree C of regulation. The following result gives a preliminary generalization\nbound for the CPM. For B1, . . . , Bk \u2265 0, let FK,B be the following subset of the set PK of convex\npolytope separators:\n\nFK,B =\n\nf : Rd \u2192 R\n\n(Wx)k, W \u2208 RK\u00d7d,\u2200k,(cid:107)Wk(cid:107) \u2264 Bk\n\nTheorem 1. There exists some constant A > 0 such that for all distributions P over X \u00d7 {\u22121, 1},\nK in {1, 2, 3, . . .}, B1, . . . , Bk \u2265 0, and \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over the training set\n(x1, y1), . . . , (xn, yn) \u223c P , any f in FK,B is such that:\n\n(cid:80)\n\n(cid:114)\n\nP (yf (x) \u2264 0) \u2264 1\nn\n\ni=1\n\nmax(0, 1 \u2212 yif (xi)) + A\n\nk Bk\u221a\nn\n\n+\n\nln (2/\u03b4)\n\n2n\n\nproportional to the sum of the sub-classi\ufb01er norms. Note that as we have(cid:80)\n\nThis is a uniform bound on the 0-1 risk of classi\ufb01ers in FK,B. It shows that with high probability,\nthe risk is bounded by the empirical hinge loss plus a capacity term that decreases in n\u22121/2 and is\nK(cid:107)W(cid:107),\nK(cid:107)W(cid:107). As a comparison, the generalization error\nthe capacity term is essentially equivalent to\nfor AMM [4], a related piecewise-linear classi\ufb01cation method has previously been shown to be\nproportional to K(cid:107)W(cid:107) in [4, Thm. 2]. In practice, this bound is very loose as it does not explain the\nobserved absence of over \ufb01tting as K gets large. We experimentally demonstrate this phenomenon\nin Section 5. We conjecture that there exists a bound that is independent of K altogether. The proof\nof Theorem 1 relies on a result due to Bartlett et al. on Rademacher complexities [9]. We \ufb01rst prove\nn). We then invoke Theorem 7 of [9]\n\nthat the Rademacher complexity of FK,B is in O((cid:80)\n\nk (cid:107)Wk(cid:107) \u2264 \u221a\n\n\u221a\n\n\u221a\n\nk Bk/\n\nto show our result. The appendix contains the full proof.\n\n4 SGD-based Learning\n\nIn this section, we present a learning algorithm based on Stochastic Gradient Descent (SGD) for\napproximately solving the total margin maximization problem (6). The choice of SGD is motivated\nby two factors. First, we would like our learning technique to ef\ufb01ciently scale to several million\ninstances of sparse high dimensional space. The sample-iterative nature of SGD makes it a very\nsuitable candidate for this end [10]. Second, the optimization problem we are solving is non-convex.\nSGD has recently been shown to work well for such learning problems when near-optimum solutions\nare acceptable [11].\nProblem (6) can be expressed as an unconstrained minimization problem as follows:\n[1 \u2212 Wz(xi)xi]+ + \u03bb(cid:107)W(cid:107)2\n\n[1 + Wkxi]+ +\n\n(cid:88)\n\n(cid:88)\n\nK(cid:88)\n\nmin\nW\n\ni:yi=\u22121\n\nk=1\n\ni:yi=+1\n\nwhere [x]+ = max(0, x) and \u03bb > 0. This form reveals the strong similarity with optimizing K\nunconstrained linear SVMs [1]. The difference is that although each sub-classi\ufb01er is trained on\nall the negative instances, positive instances are associated to a unique sub-classi\ufb01er. From the\nunconstrained form, we can derive the stochastic gradient descent Algorithm 1. For the positive\ninstances, we isolate the task of \ufb01nding the assigned sub-classi\ufb01er z to a separate procedure ASSIGN.\nWe use the Pegasos inverse schedule \u03b7t = 1/(\u03bbt).\n\n5\n\n\fWk \u2190 Wk \u2212 \u03b7tx\n\nelse if y = +1 then\n\nz \u2190 argmaxk Wkx\nif Wzx < 1 then\n\nz \u2190 ASSIGN(W, x, h)\nWz \u2190 Wz + \u03b7tx\n\nreturn W\n\nfor k \u2190 1, . . . , K do\nif Wkx > \u22121 then\n\nAlgorithm 1 Stochastic gradient descent al-\ngorithm for solving problem (6).\n\nfunction SGDTRAIN(D, \u03bb, T, (\u03b7t), h)\n\nInitialize W \u2208 RK\u00d7d, W \u2190 0\nfor t \u2190 1, . . . , T do\nPick (x, y) \u2208 D\nif y = \u22121 then\n\nSince the optimization problem (6) is non-convex, a\npure SGD approach could get stuck in a low-quality\nlocal optimum and we, indeed, found that this prob-\nlem occurs in practice. These optima assign most\nof the positive instances to a small number of sub-\nclassi\ufb01ers. In this con\ufb01guration, the remaining sub-\nclassi\ufb01ers serve no purpose.\nIntuitively, the algo-\nrithm clustered the data into large \u201csuper-clusters\u201d\nignoring the more subtle sub-clusters comprising the\nlarger super-clusters. The large clusters represent\nan appealing local optima since breaking one down\ninto sub-clusters often requires transitioning through\na patch of lower accuracy as the sub-classi\ufb01ers align\nthemselves to the new cluster boundaries. We view\nthe local optima as the algorithm under\ufb01tting the\ndata by using too simple of a model. In this case,\nthe algorithm needs encouragement to explore more\ncomplex clusterings.\nWith this intuition in mind, we add a term encour-\naging the algorithm to explore higher entropy con-\n\ufb01gurations of the sub-classi\ufb01ers. To do so, we use\nthe entropy of the random variable Z = argmaxk Wkx where x \u223c D+, a distribution de\ufb01ned on\nthe set of all positive instances as follows. Let nk be the number of positive instances assigned to\nsub-classi\ufb01er k, and n be the total number of positive instances. We de\ufb01ne D+ as the empirical\ninstances, and maximal at log2 K when every classi\ufb01er \ufb01res on a K\u22121 fraction of the positive in-\nstances. Thus, maximizing the entropy encourages the algorithm to break down large clusters into\nsmaller clusters of near equal size.\nWe use this notion of entropy in our heuristic procedure for assignment, described in Algorithm 2.\nASSIGN takes a prede\ufb01ned minimum entropy level h \u2265 0 and compensates for disparities in how\npositive instances are assigned to sub-classi\ufb01ers, where the disparity is measured by entropy. When\nthe entropy is above h, ASSIGN uses the natural argmaxk Wkx assignment. Conversely, if the\ncurrent entropy is below h, then it picks an assignment that is guaranteed to increase the entropy.\nThus, when h = 0, there is no adjustment made. It keeps a dictionary UNADJ mapping the previous\npoints it has encountered to the unadjusted assignment that the natural argmax assignment would\nhad made at the time of encountering the point. We write UNADJ + (x, k) to denote the new dictio-\nnary U such that U [v] is equal to k if v = x and to UNADJ[v] otherwise. Dictionary UNADJ keeps\ntrack of the assigned positives per sub-classi\ufb01ers, and serves to estimate the current entropy in the\ncon\ufb01guration without needing to recompute every prior point\u2019s assignment.\n\n(cid:1). The entropy is zero when the same classi\ufb01er \ufb01res for all positive\n\ndistribution on(cid:0) n1\n\nW \u2190 (1 \u2212 \u03b7t\u03bb)W\n\nn , n2\n\nn , . . . , nk\n\nn\n\n5 Evaluation\n\nWe use seven data sets1 to evaluate the CPM: (1) an MNIST data set consisting of labeled handwrit-\nten digits encoded in 28\u00d7 28 gray scale pictures (60,000 training and 10,000 testing instances) [12];\n(2) an MNIST8m data set consisting of 8,100,000 pictures obtained by applying various random\ndeformations to MNIST training instances MNIST [13]; (3) a URL data set used for malicious URL\ndetection (1.1 million training and 1.1 million testing instances with more than 2.3 million fea-\ntures) [14]; (4) the RCV1-bin data set corresponding to a binary classi\ufb01cation task (separating cor-\nporate and economics categories from government and markets categories) de\ufb01ned over the RCV1\ndata set of news articles (20,242 training and 677,399 testing instances) [15]; (5) the IJCNN1 data set\nfor detecting mis\ufb01rings of combustion engine (35,000 training and 91,701 testing instances with 22\nfeatures); (6) the A9A data set representing census data for predicting annual income (32,561 train-\ning and 16,281 testing instances with 123 features); (7) the KDDA data set for predicting student\nperformance (8,407,752 training and 510,302 testing instances with more than 20 million features).\n\n1All data sets available at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets\n\n6\n\n\fSince our main focus is on binary classi\ufb01cation, for the two MNIST data sets we evaluate distin-\nguishing 2\u2019s from any other digit, which we call MNIST-2 and MNIST8m-2.\n\n5.1 Parameter Tuning\n\nkadj \u2190 kunadj\nhcur \u2190 ENTROPY(UNADJ)\nKinc \u2190 {k: ENTROPY(UNADJ+(x, k)) > hcur}\nkadj \u2190 argmax\nk\u2208Kinc\n\nWkx\n\nUNADJ \u2190 UNADJ + (x, kunadj)\nreturn kadj\n\nAlgorithm 2 Heuristic maximum assignment algorithm.\nThe input is the current weight matrix W, positive in-\nstance x, and the desired assignment entropy h \u2265 0.\n\nInitialize UNADJ\u2190 {}\nfunction ASSIGN(W, x, h)\nkunadj \u2190 argmaxk Wkx\nif ENTROPY(UNADJ + (x, kunadj)) \u2265 h then\nelse\n\nAll seven data sets have well de\ufb01ned\ntraining and testing subsets and to tune\neach algorithms meta-parameters (\u03bb\nand h for the CPM, C and \u03b3 for RBF-\nSVM, and \u03bb for AMM), we randomly\nselect a \ufb01xed validation subset from the\ntraining set (1,000 to 10,000 depending\non the size of the training set).\nFor the CPM, we use a double-sided\nCPM as described in section 3.1, where\nboth CPMs share the same meta-\nparameters. We start by \ufb01xing a num-\nber of iterations T and a number of\nhyperplanes K which will result in a\nreasonable execution time, effectively\ntreating these parameters as a com-\nputational budget, and we experimen-\ntally demonstrate that increasing either\nK or T always results in a decrease of the testing error. Once these are selected, we let\nh = 0 and select the best \u03bb in {T \u22121, 10 \u00d7 T \u22121, . . . , 104 \u00d7 T \u22121}. We then choose h from\n{0, log K/10, log 2K/10, . . . , log 9K/10}, effectively performing a one-round coordinate descent\non \u03bb, h. To test the effectiveness of our empirical entropy-driven assignment procedure, we mute\nthe mechanism by also testing with h = 0. We make our C++11 implementation available.2\nThe AMM [16] has three parameters to adjust (excluding T and the equivalent of K), two of which\ncontrol the weight pruning mechanism and are left set at default values. We only adjust \u03bb. Contrary\nto the CPM, we do not observe AMM testing error to strictly decrease with the number of iterations\nT . We observe erratic behavior and thus we manually select the smallest T for which the mean vali-\ndation error appears to reach a minimum. For RBF-SVM, we use the LibSVM [17] implementation\nand perform the usual grid search on the parameter space.\n\n5.2 Performance\n\nUnless stated otherwise, we used one core of an Intel Xeon E5 (3.2Ghz, 64GB RAM) for experi-\nments. Table 1 presents the results of experiments and shows that the CPM achieves comparable, and\nat times better, classi\ufb01cation accuracy than the RBF-SVM, while working at a relatively small and\nconstant computational budget. For the CPM, T was up to 32 million and K ranged from 10 to 100.\nFor AMM, T ranged from 500,000 to 36 million. Across methods, the worst execution time is for\nthe MNIST8m-2 task, where a 512 core parallel implementation of RBF-SVM runs in 2 days [18],\nand our sequential single-core algorithm runs in less than 5 minutes. The AMM has signi\ufb01cantly\nlarger errors and/or execution times. For small training sets such as MNIST-2 and RCV1-bin, we\nwere not able to achieve consistent results, regardless of how we set T and \u03bb, and we conjecture that\nthis is a consequence of the weight pruning mechanism. The results show that our empirical entropy-\ndriven assignment procedure for the CPM leads to better solutions for all tasks. In the RCV1-bin\nand MNIST-2 tasks, the improvement in accuracy from using a tuned entropy parameter is 31% and\n21%, respectively.\nWe use the MNIST8m-2 task to the study the effects of tuning T and K on the CPM. We \ufb01rst choose\na grid of values for T, K and for a \ufb01xed regularization factor C and h = 0, we train a model for\neach point of the parameter grid, and evaluate its performance on the testing set. Note that for C\nto remain constant, we adjust \u03bb = 1\nCT . We run each experiment 5 times and only report the mean\naccuracy. Figure 2 shows how this mean error rate evolves as a function of both T and K. We\n\n2CPM implementation available at https://github.com/alkant/cpm\n\n7\n\n\fMNIST8m-2\n\nURL\n\nCPM\nCPM h=0\nRBF-SVM\nAMM\n\nError\n0.30 \u00b1 0.023\n0.35 \u00b1 0.034\n0.43\u2217\n0.38 \u00b1 0.024\n\nTime\n4m\n4m\n2d\u2217\u2217\n1hr\n\nError\n1.32 \u00b1 0.012\n1.35 \u00b1 0.029\n2.20 \u00b1 0.067\n\nTimed out\n\nTime\n3m\n3m\n\n5m\n\nKDDA\n\nError\n10.38 \u00b1 0.027\n10.40 \u00b1 0.021\nTimed out\n18.25 \u00b1 6.51\n\nTime\n6m\n6m\n\n53m\n\n(a) Large Data Sets\n\n* for unadjusted parameters [18]\n** running on 512 processors [18]\n\nMNIST-2\n\nIJCNN1\n\nA9A\n\nCPM\nCPM h=0\nRBF-SVM\nAMM\n\nError\n0.38 \u00b1 0.028\n0.46 \u00b1 0.026\n0.35\n2.83 \u00b1 1.090\n\nTime\n2m\n2m\n7m\n1m\n\nError\n3.00 \u00b1 0.114\n\nSame as CPM\n\n1.44\n2.84 \u00b1 0.312\n\nTime\n\n2m\n\n1s\n14s\n\nError\n15.15 \u00b1 0.062\n\nSame as CPM\n\n14.96\n15.29 \u00b1 0.181\n\nTime\n\n15s\n\n1m\n12s\n\nRCV1-bin\n\nError\n2.82 \u00b1 0.059\n3.69 \u00b1 0.156\n3.7\n15.40 \u00b1 6.420\n\nTime\n2m\n2m\n46m\n1m\n\n(b) Small Data Sets\n\nTable 1: Error rates and running times (include both training and testing periods) for binary tasks.\nMeans and standard deviations for 5 runs with random shuf\ufb02ing of the training set.\n\nobserve two phenomena. First, for any value K > 1, the error rate decreases with T . Second,\nfor large enough values of T , the error rate decreases when K increases. These two observations\nvalidate our treatment of both K and T as budgeting parameters. The observation about K also\nprovides empirical evidence of our conjecture that large values of K do not lead to over\ufb01tting.\n\n5.3 Multi-class Classi\ufb01cation\n\na\n\n(cid:1) = 45\n\n2\n\nWe\nperformed\npreliminary multi-\nclass\nclassi\ufb01cation experiment using the\nMNIST/MNIST8m data sets. There are several\napproaches for building a multi-class classi\ufb01er\nfrom a binary classi\ufb01er [19, 20, 21]. We used a\n\none-vs-one approach where we train(cid:0)10\n\none-vs-one classi\ufb01ers and classify by a major-\nity vote rule with random tie breaking. While\nthis approach is not optimal, it approximates\nachievable performance. For MNIST, CPM\u2019s\ntesting error is 1.61\u00b1 0.019 and RBF-SVM\u2019s is\n1.47, with running times of 7m20s and 6m43s,\nrespectively. On MNIST8m, CPM\u2019s error is\n1.03 \u00b1 0.074 (2h3m) and RBF-SVM\u2019s is 0.67\n(8 days) as reported by [13].\n\nFigure 2: Error rate on MNIST8m-2 as a function\nof K, T . C = 0.01 and h = 0 are \ufb01xed.\n\n6 Conclusion\n\nWe propose a novel algorithm for Convex Polytope Machine (CPM) separation that provides larger\nmargins than a single linear classi\ufb01er, while still enjoying the computational effectiveness of a simple\nlinear separator. Our algorithm learns a bounded number of linear classi\ufb01ers simultaneously. On\nlarge data sets, the CPM outperforms RBF-SVM and AMM, both in terms of running times and\nerror rates. Furthermore, by not pruning the number of sub-classi\ufb01ers used, CPM is algorithmically\nsimpler than AMM. CPM avoids such complications by having little tendency to over\ufb01t the data as\nthe number K of sub-classi\ufb01ers increases, shown empirically in Section 5.2.\nAcknowledgements. This research was supported in part by Intel\u2019s ISTC for Secure Computing,\nNSF grants 0424422 (TRUST) and 1139158, the Freedom 2 Connect Foundation, US State Dept.\nDRL, LBNL Award 7076018, DARPA XData Award FA8750-12-2-0331, and gifts from Amazon,\nGoogle, SAP, Apple, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOnTalis, Gen-\neral Electric, Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware,\nWANdisco and Yahoo!. The opinions in this paper are those of the authors and do not necessarily\nre\ufb02ect those of any funding sponsor or the United States Government.\n\n8\n\n\fReferences\n[1] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-\nIn International Conference on Machine Learning, ICML \u201907,\n\nGrAdient SOlver for SVM.\npages 807\u2013814, 2007.\n\n[2] Paul Fischer. More or less ef\ufb01cient agnostic learning of convex polygons.\n\nIn Conference\non Computational Learning Theory, COLT \u201995, pages 337\u2013344, New York, NY, USA, 1995.\nACM.\n\n[3] Gabor Takacs. Smooth maximum based algorithms for classi\ufb01cation, regression, and collabo-\n\nrative \ufb01ltering. Acta Technica Jaurinensis, 3(1), 2010.\n\n[4] Zhuang Wang, Nemanja Djuric, Koby Crammer, and Slobodan Vucetic. Trading representabil-\nity for scalability: adaptive multi-hyperplane machine for nonlinear classi\ufb01cation. In interna-\ntional conference on Knowledge discovery and data mining (KDD 2011), 2011.\n\n[5] Naresh Manwani and P. S. Sastry. Learning polyhedral classi\ufb01ers using logistic function. In\nProceedings of the 2nd Asian Conference on Machine Learning (ACML 2010), Tokyo, Japan,\n2010.\n\n[6] Naresh Manwani and P. S. Sastry.\n\narXiv:1107.1564, 2013.\n\nPolyceptron: A polyhedral\n\nlearning algorithm.\n\n[7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with dis-\ncriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 32(9):1627\u20131645, 2010.\n\n[8] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013\n\n297, 1995.\n\n[9] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds\n\nand structural results. J. Mach. Learn. Res., 3:463\u2013482, March 2003.\n\n[10] L\u00b4eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings\n\nof COMPSTAT\u20192010, pages 177\u2013186. 2010.\n\n[11] Geoffrey E Hinton. A practical guide to training restricted Boltzmann machines. In Neural\n\nNetworks: Tricks of the Trade, pages 599\u2013619. 2012.\n\n[12] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. MNIST dataset, 1998.\n[13] St\u00b4ephane Canu and Leon Bottou. Training invariant support vector machines using selective\n\nsampling. In Large Scale Kernel Machines, pages 301\u2013320. MIT, 2007.\n\n[14] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Beyond blacklists:\nLearning to detect malicious web sites from suspicious URLs. In International Conference on\nKnowledge Discovery and Data Mining (KDD \u201909), pages 1245\u20131254, 2009.\n\n[15] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection\n\nfor text categorization research. J. Mach. Learn. Res., 5:361\u2013397, December 2004.\n\n[16] Nemanja Djuric, Liang Lan, Slobodan Vucetic, and Zhuang Wang. BudgetedSVM: A toolbox\nfor scalable SVM approximations. Journal of Machine Learning Research, 14:3813\u20133817,\n2013.\n\n[17] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM\n\nTrans. Intell. Syst. Technol., 2(3):27:1\u201327:27, May 2011.\n\n[18] Zeyuan Allen Zhu, Weizhu Chen, Gang Wang, Chenguang Zhu, and Zheng Chen. P-packSVM:\nIn International Conference on Data Mining\n\nParallel primal gradient descent kernel SVM.\n(ICDM\u201909)., pages 677\u2013686. IEEE, 2009.\n\n[19] Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory Sorkin, and Alex Strehl. Conditional\nprobability tree estimation analysis and algorithms. In Uncertainty in Arti\ufb01cial Intelligence,\nUAI \u201909, pages 51\u201358, 2009.\n\n[20] Alina Beygelzimer, John Langford, and Bianca Zadrozny. Weighted one-against-all. In Na-\n\ntional Conference on Arti\ufb01cial Intelligence - Volume 2, AAAI\u201905, pages 720\u2013725, 2005.\n\n[21] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-\n\ncorrecting output codes. J. Artif. Int. Res., 2(1):263\u2013286, January 1995.\n\n9\n\n\f", "award": [], "sourceid": 1653, "authors": [{"given_name": "Alex", "family_name": "Kantchelian", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Tschantz", "institution": "UC Berkeley"}, {"given_name": "Ling", "family_name": "Huang", "institution": "Intel"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}, {"given_name": "Anthony", "family_name": "Joseph", "institution": "University of California, Berkeley"}, {"given_name": "J. D.", "family_name": "Tygar", "institution": null}]}