{"title": "Volume Regularization for Binary Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 332, "page_last": 340, "abstract": "We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains ``simple'' weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of $30$ NLP datasets and binarized USPS optical character recognition datasets.", "full_text": "Volume Regularization for Binary Classi\ufb01cation\n\nKoby Crammer\n\nDepartment of Electrical Enginering\n\nThe Technion - Israel Institute of Technology\n\nHaifa, 32000 Israel\n\nkoby@ee.technion.ac.il\n\nTal Wagner\u2217\n\nFaculty of Mathematics and Computer Science\n\nWeizmann Institute of Science\n\nRehovot, 76100, Israel\n\ntal.wagner@gmail.com\n\nAbstract\n\nWe introduce a large-volume box classi\ufb01cation for binary prediction, which main-\ntains a subset of weight vectors, and speci\ufb01cally axis-aligned boxes. Our learning\nalgorithm seeks for a box of large volume that contains \u201csimple\u201d weight vectors\nwhich most of are accurate on the training set. Two versions of the learning pro-\ncess are cast as convex optimization problems, and it is shown how to solve them\nef\ufb01ciently. The formulation yields a natural PAC-Bayesian performance bound\nand it is shown to minimize a quantity directly aligned with it. The algorithm out-\nperforms SVM and the recently proposed AROW algorithm on a majority of 30\nNLP datasets and binarized USPS optical character recognition datasets.\n\n1\n\nIntroduction\n\nLinear models are widely used for a variety of tasks including classi\ufb01cation and regression. Support\nvectors machines [3, 22] (SVMs) are considered a primary method to ef\ufb01ciently build linear clas-\nsi\ufb01ers from data, yielding state-of-the-art performance. SVMs and many other methods are often\neasy to implement and ef\ufb01cient, yet return only a single weight-vector with no additional informa-\ntion about alternative models nor about con\ufb01dence in prediction.\nAn alternative approach is taken by Bayesian methods [21, 13]. The primary object is a (posterior)\ndistribution over models that is updated using Bayes rule. Unfortunately, the posterior is very com-\nplicated even for simple models, such as Bayesian logistic regression [15], and it is not known how\nto perform the update analytically, and approximations are required.\nIn this work we integrate the advantages of both approaches. We propose to model uncertainty\nover weight-vectors by maintaining a (simple) set of possible weight-vectors, rather than a single\nweight-vector. Learning is motivated from principles of discriminative learning rather than Bayes\u2019\nrule, and it is optimizing a combination of an hand-crafted regularization term and the empirical\nloss. Speci\ufb01cally, our algorithm maintains an axis-aligned box, which only requires double number\nof parameters than maintaing a single weight-vector, a dominating model for many tasks.\nWe use a similar conceptual reasoning as used in Bayes point machines (BPM) [13]. Both ap-\nproaches maintain a set of possible weights, which can be thought of as a posterior. BPMs use the\nversion space, the set of all consistent weight vectors, which is a convex polyhedron. Since the size\nof the polyhedron\u2019s representation grows with the number of training examples, BPMs approximate\nthe polyhedron with a single weight-vector, the Bayes point. Our algorithms model the set as a box,\nwith a representation that is \ufb01xed in the size of the input, and \ufb01nd an optimal prediction box.\nWe cast learning as a convex optimization problem and propose methods to solve it ef\ufb01ciently. We\nfurther provide generalization bounds using PAC-Bayesian theory, and show that our algorithm is\n\n\u2217The research was performed while TW was a student at the Technion.\n\n1\n\n\fminimizing a quantity directly related to the generalization bound. We give two formulations or\nversions of the algorithm, one that is closely related to the bound, while the other is smooth.\nWe experiment with 30 binary text categorization datasets from various tasks: sentiment classi\ufb01ca-\ntion, predicting domain of product-review, assigning topics to news items, tagging spam emails, and\nclassifying posts to news-groups. The results indicate that our algorithms outperform SVM and the\nrecently proposed AROW [4] algorithm, which was shown to be the state-of-the-art in numerous\nNLP tasks. Additional support for the superiority and robustness of our algorithms, especially in\nhigh-noise setting, is provided using experiments with 45 pairs of binarized USPS OCR problems.\nNotation: Given a vector x \u2208 Rd, we denote its kth element by xk \u2208 R, and by |x| \u2208 Rd the\nvector with component-wise absolute value of its elements, |x| = (|x1|, . . . ,|xd|).\n\n2 Large-Volume Box Classi\ufb01ers\nStandard linear classi\ufb01cation learning algorithms maintain and return a single weight vector w(cid:63) \u2208\nRd used to predict the label of any test point. We study a generalization of these algorithms where\nhypotheses are uncertainty (sub)sets of weight vectors w. Such a hypothesis can be seen as a ran-\ndomized linear classi\ufb01er or a voting process. To classify an instance x, a parameter vector w is\ndrawn according to the hypothesis and predicts the label sign(w \u00b7 x). Herbrich et.al. [13, 12], ar-\ngued in a similar context that such a randomization yields a more robust solution. PAC-Bayesian\nanalysis and its generalization bounds give additional justi\ufb01cation to this approach (see Sec 5).\nThe uncertainty subsets we study are axis aligned boxes parametrized with two vectors u, v \u2208\nRd where we assume, uk \u2264 vk for all k = 1 . . . d.\nIn words, u is the vertex with the lowest\ncoordinates, and v is the vertex with the largest coordinates. The projection of the box onto the\nk-axis yields the interval [uk, vk]. The set of weight vectors contained in the box is denoted by,\nQ = {w : uk \u2264 wk \u2264 vk for k = 1 . . . d} . Given an instance x to be classi\ufb01ed, a Gibbs classi\ufb01er\nsamples a weight vector uniformly in random from the box w \u2208 Q and returns sign(w \u00b7 x). A\ndeterministic alternative we use in practice is to employ the center of mass de\ufb01ned by \u00b5 = 1\n2 (u + v)\nand return sign(\u00b5 \u00b7 x). For linear classi\ufb01ers, the majority prediction with Gibbs sampling coincides\n2 (v \u2212 u).\nwith predicting using the center of mass. We also de\ufb01ne the uncertainty intervals \u03c3 = 1\nIntuitively, the uncertainty in the weight associated with the kth feature is \u03c3k. Clearly, v = \u00b5 + \u03c3\nand u = \u00b5 \u2212 \u03c3.\n\n3 Learning as Optimization\nGiven a labeled sample S = {(xi, yi)}n\ni=1, a common practice in learning linear models w is to\nperform structural risk minimization (SRM) [25] that picks a weight-vector that is both \u201csimple\u201d (eg\nsmall norm) and performs well on the training set. Learning is cast as an optimization problem,\n\n(cid:88)\n\ni\n\nw(cid:63) = arg min\nw\n\n1\nn\n\n(cid:96)(w, (xi, yi)) + D R(w) .\n\n(1)\n\nThe \ufb01rst term is the empirical loss evaluated on the training set with some loss function (cid:96)(w, (x, y)),\nand the second term is a regularization that penalizes weight-vectors according to their complexity.\nThe parameter D > 0 is a tradeoff parameter.\nLearning with uncertainty sets invites us to balance three desires rather than two as when learning\na single weight-vector. The \ufb01rst two desires are generalizations of the structural risk minimiza-\ntion principle [25] to boxes: we prefer boxes containing weight-vectors that attain both low loss\n(cid:96)(w, (xi, yi)) and are \u201csimple\u201d (eg small norm). This alone though is not enough, as if the loss and\nregularization functions are strictly convex then the optimal box would in fact be a single weight-\nvector. The third desire is thus to prefer boxes with large volume. Intuitively, if during training\nan algorithm \ufb01nds a box with large volume, such that all weight-vectors belonging to it attain low\ntraining error and are simple, we expect the classi\ufb01er based on the center of mass to be robust to\nnoise or \ufb02uctuations. This will be formally stated in the analysis described in Sec 5. We formalize\nthis requirement by adding a term that is inversely proportional to the volume of the box Q.\nWe take a worst-case approach, and de\ufb01ne the loss of the box Q given an example (x, y) denoted\nby (cid:96)(Q, (x, y)) to be the loss of the worst member w \u2208 Q. Similarly, we de\ufb01ne the complexity\nof the box Q to be the complexity of the most complex member of the box w \u2208 Q, formally,\n(cid:96)(Q, (x, y)) = supw\u2208Q (cid:96)(w, (x, y)) and R(Q) = supw\u2208Q R(w).\n\n2\n\n\fPutting it all together, we replace (1) with,\n\n(cid:32)\n\n(cid:33)\n\n,\n\n(cid:96)(w, (xi, yi)) + DR(w)\n\n(cid:88)\n\ni\n\n1\nm\n\n(2)\n\nQ(cid:63) = arg minQ\u2208Q\n\nsup\nw\u2208Q\n\nwhere Q is a set of boxes with some minimal volume. In other words, the algorithm is seeking for\na set of alternative weight-vectors all of which are performing well on the training data. We expect\nthis formulation to be robust, as a box is evaluated with its worst performing member.\nWe modify the problem by removing the constraint Q \u2208 Q and adding an equivalent penalty term to\nthe objective, namely the log-volume of the box. We use the log-volume function for three reasons.\nFirst, it is a common barrier function in optimization [26], and in our case it keeps the box from\nactually shrinking to a zero volume box. Second, this choice is supported by the analysis below,\nand third, it is additive in the dimension of the data d, like all other quantities of the objective.\nAdditionally, we bound the supremum over w with a sum of supremum operators. To conclude, we\ncast the learning problem as the following optimization problem over boxes,\n(cid:96)(w, (xi, yi)) \u2212 C log volQ + D sup\nw\u2208Q\n\n(cid:88)\n\nR(w) ,\n\narg minQ\n\nsup\nw\u2208Q\n\n1\nm\n\n(3)\n\ni\n\nwhere C, D > 0 are two trade-off parameters used to balance the three goals.\n(In the analysis\nbelow it will be shown that D can be also interpreted as a Langrange multiplier of a constrained\noptimization problem.) We further develop the last equation by making additional assumptions over\nthe loss function and the regularization. We assume that the loss is a monotonically decreasing\nfunction of the product y(x(cid:62)w), often called the margin (or the signed margin). This is a property\nof many popular loss functions for binary classi\ufb01cation, including the hinge-loss and its square used\nby SVMs [3, 22], exp-loss used by boosting [9], logistic-regression [11] and the Huber-loss [14].\nUnder this assumption we compute analytically the \ufb01rst term of the objective (3).\n\n(cid:96)(y(cid:0)x(cid:62)w(cid:1)) then supw\u2208Q (cid:96)(w, (xi, yi)) = (cid:96)(y(x(cid:62)\u00b5) \u2212 |x|\u03c3).\n\nthe loss function is monotonically decreasing in the margin, (cid:96)(w, (x, y)) =\n\nFrom the monotonicity of (cid:96)(\u00b7) we have supw\u2208Q (cid:96)(cid:0)y(x(cid:62)w)(cid:1) = (cid:96)(cid:0)inf w\u2208Q y(x(cid:62)w)(cid:1).\n\nLemma 1 If\n\nProof:\nComputing the in\ufb01mum we get,\nw\u2208Q y(x(cid:62)w) =\nd(cid:88)\n\n(cid:26) uk\n\ninf\n\ninf\n\n=\n\n(yxk)\n\nwk\u2208[uk,vk] for k=1...d\n\n(yxk) \u2265 0\n(yxk) < 0\n\nvk\n\nd(cid:88)\n\nk=1\n\nd(cid:88)\n\nk=1\n\nd(cid:88)\n\nk=1\n\n(yxk)wk =\n\ninf\n\nwk\u2208[uk,vk]\n\n(yxk)wk\n\n=\n\n(yxk) (\u00b5k \u2212 sign(yxk)\u03c3k) = y(x(cid:62)\u00b5) \u2212 |x|\u03c3 ,\n\nk=1\n\nusing u = \u00b5 \u2212 \u03c3 and v = u + \u03c3 as stated above.\nThe lemma highlights the need to constrain the volume to be strictly larger than zero: due to mono-\ntonicity and the fact that \u03c3 \u2265 0 (component wise) we have (cid:96)(y(x(cid:62)\u00b5) \u2212 |x|\u03c3) \u2265 (cid:96)(y(x(cid:62)\u00b5)), so\nthe loss is always minimized when we set \u03c3 = 0. We next turn to analyse the third term of (3) with\nthe following lemma.\nLemma 2 (1) Assuming R(w) is convex, then supw\u2208Q R(w) is attained on vertices of the box Q.\n(2) Additionally, if R(w) is strictly convex then the supremum is attained only on vertices.\nnon-negative elements and(cid:80)\nProof: We use the fact that every point in the box can be represented as a convex combination\nbox. Convexity of R(\u00b7) yields, R(w) \u2264 (cid:80)\nof the vertices. Formally, given a point in the box w \u2208 Q, there exists a vector \u03b1 \u2208 R2d with\nt \u03b1tzt where zt are the vertices of the\nt \u03b1tR(zt) \u2264 maxt {R(zt)} . Thus, if w attains the\nsupremum supw\u2208Q R(w) then so does at least one vertex. Additionally, if R(w) is a strictly convex\nfunction, then the \ufb01rst inequality in the last equation is a strict inequality, and thus a non-vertex\n(cid:80)\ncannot attain the supremum.\nCommon regularization functions are de\ufb01ned as sums over individual features, that is R(w) =\n\nt \u03b1t = 1 such that w = (cid:80)\n\nk r(wk). In this case the supremum is attained on each coordinate independently as follows.\n\n3\n\n\fCorollary 3 Assuming R(w) is a sum of scalar-convex functions(cid:80)\n\nk r(wk), we have,\n\nmax{r(uk), r(vk)}\n\n=\n\nmax{r(\u00b5k \u2212 \u03c3k), r(\u00b5k + \u03c3k)} .\n\nk\n\nk\n\n(cid:88)\n\nR(w) =\n\nsup\nw\u2208Q\n\n(cid:88)\n\nThe corollary follows from the lemma since a supremum of a scalar-function over a box is equivalent\nto taking the supremum over the box projected to a single coordinate. Finally, the volume of a box\nk (2\u03c3k) =\n\nis given by a product of the length of its axes, that is, vol (Q) = (cid:81)\n2d(cid:81)\n\nk (vk \u2212 uk) = (cid:81)\n\nk \u03c3k .\n\nTo summarize, the learning problem of the large-volume box algorithm is cast by solving the fol-\nlowing minimization problem, in terms of the center \u00b5 and the size (or dimensions) \u03c3,\n\nm(cid:88)\n\n(cid:96)(cid:0)yi(x(cid:62)\n\ni \u00b5) \u2212 |xi|\u03c3(cid:1) \u2212C\n\n(cid:88)\n\n(cid:88)\n\n1\nm\n\nmax{r(\u00b5k \u2212 \u03c3k), r(\u00b5k + \u03c3k)} , (4)\nmin\n\u03c3\u22650,\u00b5\nwhere (cid:96)(\u00b7) is a monotonically decreasing function, r(\u00b7) is a convex function, and C, D > 0 are two\ntrade-off parameters used to balance our three desires. We denote by\n\nlog \u03c3k +D\n\ni=1\n\nk\n\nk\n\n(cid:96)\n\n2\n\n(cid:18) 1\n\n1\nmin\nv\u2265u\nm\n\u2212 C\n\nzi,+ = yixi + |xi| \u2208 Rd , zi,\u2212 = yixi \u2212 |xi| \u2208 Rd .\n\n(cid:0)v(cid:62)(zi,\u2212) + u(cid:62)(zi,+)(cid:1)(cid:19)\n\n(5)\nThe kth element of zi,+ (zi,\u2212) is twice the kth element of |xi| if the sign of the kth element of xi\nagrees (disagrees) with yi, and zero otherwise.\nThis problem can equivalently be written in terms of the two \u201cextreme\u201d vertices u and v as follows,\n\nm(cid:88)\n(cid:88)\ni (v+u)\u2212|xi|(v\u2212u) = v(cid:62)(zi,\u2212)+u(cid:62)(zi,+) . Note, if the loss function (cid:96)(\u00b7)\nby using the relation yix(cid:62)\nis convex, then both formulations (4) and (6) of the learning problem are convex in their arguments,\nas each is a sum of convex functions of linear combination of the arguments, and a maximum of\nconvex functions is convex.\n(cid:80)\nWe conclude this section with an additional alternative formulation, which for convenience, we\nit with a smooth term, by changing the max to a sum, yielding(cid:80)\npresent in the notation of (6). Although the above problem is convex, the regularization term\nk max{r(vk), r(uk)} is not smooth because of the max operator. In this alternative, we replace\nk r(vk) + r(uk) = R(u) +R(v).\n\nmax{r(vk), r(uk)} ,\n\nlog (vk \u2212 uk)+D\n\n(cid:88)\n\n(6)\n\ni=1\n\nk\n\nk\n\nThe problem then becomes,\n\nm(cid:88)\n\n(cid:18) 1\n\n(cid:0)v(cid:62)(zi,\u2212) + u(cid:62)(zi,+)(cid:1)(cid:19)\n\n(cid:88)\n\n1\nm\n\n\u2212 C\n\nlog (vk \u2212 uk) + D (R(u) + R(v)) .\n\n(cid:96)\n\n2\n\ni=1\n\nmin\nv\u2265u\nThe two alternatives are related via the following chain of inequalities, 0.5 max{r(vk), r(uk)} \u2264\n0.5 (r(vk) + r(uk)) \u2264 max{r(vk), r(uk)} \u2264 r(vk) + r(uk) . In other words, given either one\nof the problems (6) or (7), we can lower and upper bound it with the other problem with a proper\nchoice of trade-off parameter D. We call the two versions BoW for box-of-weights algorithm, and\nrefer to them as BoW-M(ax) and BoW-S(um), respectively.\n\nk\n\n(7)\n\n4 Optimization Algorithm\n\nWe now present an algorithm to solve (6) for the special case r(x) = x2. The algorithm is a based\non COMID [8] and its convergence analysis follows directly from the analysis of COMID, which\nis omitted due to lack of space. The algorithm works in iterations. On each iteration a (stochastic)\ngradient decent step is performed, followed by a regularization-optimization step. Formally, the\nalgorithm picks a random example i and updates,\n\n(cid:17)\n\nfor \u03b1 = (cid:96)(cid:48)(cid:18) 1\n\n(cid:0)v(cid:62)(zi,\u2212) + u(cid:62)(zi,+)(cid:1)(cid:19)\n\n.\n\n(\u02dcu, \u02dcv) \u2190 (u, v) \u2212 \u03b1\n\n\u03b7\n2\n\nzi,+ , zi,\u2212\n\n(cid:16)\n\n2\n\n4\n\n\fThe algorithm then solves the following regularization-oriented optimization problem,\n\nmin\nu,v\n\n1\n2\n\n(cid:107)u \u2212 \u02dcu(cid:107)2 +\n\n1\n2\n\n(cid:107)v \u2212 \u02dcv(cid:107)2 \u2212 C\n\n(cid:88)\n\nk\n\n(cid:88)\n\nlog (vk \u2212 uk) + D\n\nmax(cid:8)v2\n\n(cid:9) .\n(v \u2212 \u02dcv)2 \u2212 C log (v \u2212 u) + D max(cid:8)v2, u2(cid:9) .\n\nk, u2\nk\n\nk\n\nThe objective of the last problem decomposes over individual pairs uk, vk so we reduce the opti-\nmization to d independent problems, each de\ufb01ned over 2 scalars u and v (omitting index k),\n\n(u \u2212 \u02dcu)2 +\n\n1\n2\n\n1\n2\n\nmin\nu,v\n\nF (u, v) =\n\n(8)\nWe denote the half-plane H = {(u, v) \u2208 R2 : v > u} and partition it into three subsets: G1 =\n{(u, v) \u2208 H : v > \u2212u}, G2 = {(u, v) \u2208 H : v < \u2212u}, and the line L = {(u, v) \u2208 R2 : v = \u2212u}.\nThe following lemma describes the optimal solution of (8).\n\nLemma 4 Exactly one of the items below holds and describe the optimal solution of (8).\n\n1. If there exists (u, v) \u2208 G1 such that v is a root of f (v) = \u03b1v2 + \u03b2v + \u03b3 and u =\n\u02dcu \u2212 2Dv + (\u02dcv \u2212 v) where \u03b1 = 2(1 + D)(1 + 2D), \u03b2 = \u2212\u02dcu(1 + 2D) \u2212 \u02dcv(3 + 4D), and\n\u03b3 = \u02dcv + \u02dcu\u02dcv \u2212 C, then it is a global minimum of F .\n\n2. If there exists (u, v) \u2208 G2 such that u is a root of f (u) = \u03b1u2 + \u03b2u + \u03b3 and v =\n\u02dcv \u2212 2Du + (\u02dcu \u2212 u) where \u03b1 = 2(1 + D)(1 + 2D), \u03b2 = \u2212\u02dcv(1 + 2D) \u2212 \u02dcu(3 + 4D) and\n\u03b3 = \u02dcu + \u02dcv\u02dcu \u2212 C, then it is a global minimum of F . Furthermore, such point and a point\ndescribed in 1 cannot exist simultaneously.\n\n3. If no points as described in 1 nor 2 exist, then the global minimum of F is (u,\u2212u) such\n\nthat u is a root of f (u) = \u03b1u2 + \u03b2u + C where \u03b1 = 2 + 2D, \u03b2 = \u02dcv \u2212 \u02dcu, \u03b3 = \u2212C.\n\nF(cid:12)(cid:12)G1\n\nProof sketch: By de\ufb01nition, the function F is smooth and convex on G1. The condition in 1 is\nequivalent to satisfying \u2207F (u, v) = 0, and therefore any point that satis\ufb01es it, is a minimum of\n. A similar argument applies to G2 with 2. The convexity of F on the entire set H yields\nthat any such point is also a global minimum of F , and that if no such point exists then F attains\na global minimum on L (which is derived in 3). The latter is sure to exist since limv\u21920 F|L =\nlimv\u2192\u221e F|L = \u221e. The algebraic derivation is omitted due to lack of space.\nSimilarly, we develop the update for solving (7). Here after the gradient step we need to solve the\n2 (v \u2212 \u02dcv)2 \u2212 C log (v \u2212 u) +\nfollowing problem per coordinate k, minu,v F (u, v) = 1\n\nD(cid:0)v2 + u2(cid:1) . The following lemma characterizes the optimal solution.\n\n2 (u \u2212 \u02dcu)2 + 1\n\nLemma 5 The optimal solution (u, v) \u2208 {(u, v) \u2208 R2 : v \u2212 u > 0} of the last problem is such\nthat u is a root of the polynomial f (u) = \u03b1u2 + \u03b2u + \u03b3 where \u03b1 = 2 + 2D + 6D + 8D2,\u03b2 =\n\u2212(\u02dcv + 2D\u02dcv + \u02dcu + 6D\u02dcu)\u2212 2\u02dcu,\u03b3 = \u02dcu2 + \u02dcu\u02dcv\u2212 4C \u2212 2CD and v = (\u02dcv + \u02dcu \u2212 u(1 + 2D)) /(1 + 2D).\n\nIts proof is similar to the proof of Lemma 4, but simpler and omitted due to lack of space.\n\n5 Analysis\n\nPAC-Bayesian bounds were introduced by McAllester [19], were further re\ufb01ned later (e.g. [17, 23]),\nand applied to analyze SVMs [18]. They often have been shown to be quite tight.\nWe \ufb01rst introduce some notation needed for the discussion of these bounds. Let \u00af(cid:96)(w, (x, y)) denote\nthe zero-one loss, that is \u00af(cid:96)(w, (x, y)) = 1 if sign(w\u00b7 x) (cid:54)= y and \u00af(cid:96)(w, (x, y)) = 0 otherwise. Let D\nbe a distribution over the labeled examples (x, y), and denote by \u00af(cid:96)(w,D) the expected zero-one loss\nof a linear classi\ufb01er characterized by its weight vector w: \u00af(cid:96)(w,D) = Pr(x,y)\u223cD[sign(w\u00b7x) (cid:54)= y] =\nE(x,y)\u223cD[\u00af(cid:96)(w, (x, y))] . We abuse notation, and denote by \u00af(cid:96)(w, S) the expected loss \u00af(cid:96)(w,DS) for\nthe empirical distribution DS of a sample S.\nPAC-Bayesian analysis states generalization bounds in terms of two distributions - prior and pos-\nterior - over all hypotheses (i.e. over weight-vectors w). Below, we identify a compact set with a\nuniform distribution over the set, and in particular we identify a box Q with a uniform distribution\n\n5\n\n\fover all weight vectors it contains (and zero mass otherwise). Similarly, we identify any compact\nbody P with a uniform distribution over its elements. In other words, we refer to the prior P and\nthe posterior Q both as two uniform distributions and as their support (which are subsets). We\nalso denote by (cid:96)(Q,D) the expectation of (cid:96)(w,D) over weight vectors w drawn according to the\ndistribution Q. We quote Cor. 2.2 of Germain et.al. [10],\nCorollary 6 ([10]) : For any distribution D, for any set H of weight-vectors, for any distribution\nP of support H, for any \u03b4 \u2208 (0, 1], and any positive number \u03b3 the following statement holds with\nprobability \u2265 1 \u2212 \u03b4 over samples S of size n,\n\n(cid:40)\n\n(cid:34)\n\n(cid:32)\n\n\u00af(cid:96)(Q,D) \u2264\n\n1\n\n1 \u2212 e\u2212\u03b3\n\n1 \u2212 exp\n\n\u2212\n\n\u03b3 \u00b7 \u00af(cid:96)(Q, S) +\n\nDKL (Q(cid:107)P) +\n\n1\nn\n\n1\nn\n\nln\n\n(cid:19)(cid:33)(cid:35)(cid:41)\n\n(cid:18) 1\n\n\u03b4\n\n.\n\n(9)\n\nThe corollary states that the expected number of mistakes over examples drawn according to some\n\ufb01xed and unknown distribution D over inputs, and over weight-vectors drawn from the box Q uni-\nformly, is bounded by the right term, which is a monotonic function of the following sum,\n\n1\nn\u03b3\nFor uniform distributions we have the following,\n\n\u00af(cid:96)(Q, S) +\n\nDKL (Q(cid:107)P) .\n\nAdditionally, we bound the empirical training error,\n\n\u00af(cid:96)(Q, S) =\n\n1\nn\n\n1\nvolQ\n\n\u00af(cid:96)(w, (xi, yi)) dw \u2264 1\nn\n\nw\u2208Q\n\n(cid:96)\n\ni\n\nDKL (Q(cid:107)P) =\n\n(cid:90)\n\nn(cid:88)\n\ni\n\n(cid:26) log vol(P)\n\n\u221e\n\notherwise\n\nvol(Q) Q \u2286 P\n(cid:88)\n\n(10)\n\n(11)\n\n,\n\n(12)\n\n.\n\n(cid:18)\nw\u2208Q yi(x(cid:62)\n\ninf\n\ni w)\n\n(cid:19)\n\nwhere the equality is the de\ufb01nition of \u00af(cid:96)(Q, S), and the inequality follows by choosing a loss function\n(cid:96)(\u00b7) which upper bounds the zero-one loss (e.g. Hinge loss), by bounding an expectation with the\nsupremum value, and from Lemma 1.\nWe get that to minimize the generalization bound of (9) we can minimize a bound on (10) which is\nobtained by substituting (11) and (12) in (10). Omitting constants we get,\n\n(cid:18)\nw\u2208Q yiw(cid:62)xi\n\ninf\n\n(cid:19)\n\n(cid:96)\n\n(cid:88)\n\ni\n\nminQ\n\n1\nn\n\n\u2212 1\nn\u03b3\n\nlog volQ s.t. Q \u2286 P .\n\n(13)\n\nk max{v2\n\nk} \u2264 R2 .\n\nk, u2\n\nNext, we set P to be a ball of radius R about the origin, and, as in Sec 2, we set Q as a box\nparametrized with the vectors u and v. We use the following lemma, of which proof is omitted due\nto lack of space,\nLemma 7 If P is a ball of radius R about the origin and Q is a box parametrized using u and v,\n\nwe have Q \u2286 P \u21d4(cid:80)\ni=1 (cid:96)(cid:0) 1\n(cid:80)m\n(cid:80)\n(cid:80)\nof\nk log (vk \u2212 uk)\nk max{r(vk), r(uk)} \u2264 R2 . To solve the last problem we write its Lagrangian,\n(cid:88)\n\n(cid:0)v(cid:62) (zi,\u2212) + u(cid:62) (zi,+)(cid:1)(cid:1) \u2212 1\n(cid:0)v(cid:62)(zi,\u2212) + u(cid:62)(zi,+)(cid:1)(cid:19)\n(cid:18) 1\nm(cid:88)\n(cid:88)\n\nplugging Lemma 7 and Lemma 1 in (13) we get\nthe\n\nis monotonically\n\nthe\ngeneralization\n\nfollowing prob-\nloss,\nto\n\nFinally,\nlem, which\n1\nminv\u2265u\nn\n\n2\nmax{r(vk), r(uk)} \u2212 \u03b7R2 ,\n\nk\n\n(14)\n\nlog (vk \u2212 uk)\n\n\u2212 1\nn\u03b3\n\nsubject\n\nmax\n\n\u03b7\n\nmin\nv\u2265u\n\nrelated\n\n+ \u03b7\n\nto\n\na\n\nbound\n\n1\nn\n\ni=1\n\nn\u03b3\n\n(cid:96)\n\n2\n\nk\n\nwhere \u03b7 is the Lagrange multiplier ensuring the constraint. Comparing (14), whose objective is used\nin the PAC-Bayesian bound, and our learning algorithm in (6), we observe that the three terms in\nboth objectives are the same by setting C = 1\nn\u03b3 and identifying the optimal value of the Lagrange\n\n6\n\n\fFigure 1: Fraction of error on text classi\ufb01cation datasets of BoW-M and BoW-S vs SVM (two left plots); and\nBoW-M and BoW-S vs AROW (two right plots). Markers above the line indicate superior BoW performance.\n\nmultipler with the trade-off constant \u03b7 = D. In fact, each value of the radius R yields a unique\noptimal value of the Lagrange multiplier \u03b7. Thus, we can interpret the role of the constant D as\nsetting implicitly the effective radius of the prior ball P.\nFew comments are in order. First, the KL-divergence between distributions is minimized more\neffectively if both P and Q are of the same form, e.g. both P and Q are boxes. However, we chose\nQ to be a box, as it has a nice interpretation of uncertainty over features, and P to be a ball, as\nit decomposes (as opposed to an (cid:96)\u221e ball), which allows simpler optimization algorithms. Second,\nas noted above, BoW-S is indeed smoother than BoW-M, yet, from (14) it follows that the latter is\nbetter motivated from the PAC-Bayesian bound, as we want Q \u2286 P. Third, the bound is small if\nthe volume of the box Q is large, which motivates seeking for large-volume boxes, whose members\nperform well.\n\n6 Empirical Evaluation\n\nWe evaluated BoW-M and BoW-S on NLP tasks experimenting with all the 12 datasets used by\nDredze et al [6] (sentiment classi\ufb01cation in 6 Amazon domains, 3 pairs of 20 newsgroups and 3\npairs of Reuters (RCV1)). We de\ufb01ned an additional task from the 6 Amazon domains (book, dvd,\nmusic, video, electronics, kitchen). Given reviews from two domains, the goal is to identify the do-\nmain identity. We used all 6\u00d75/2 = 15 unordered pairs of domains. Additionally, we selected 3 users\nfrom task A of the 2006 ECML/PKDD Discovery Challenge spam data set. The goal is to classify\nan email as either a spam or a not-spam. This yielded a total of 30 datasets. For each problem we\nselected about 2, 000 instances and represented them with vectors of uni/bi-grams counts. Feature\nextraction followed a previous protocol [6, 2]. Each dataset was randomly divided for 10-fold cross\nvalidation. We also experimented with USPS OCR data which we binarized into 45 all-pairs prob-\nlems, maintaing the standard split into training and test sets. Given an image of one of two digits,\nthe goal is to detect which of the two digits is shown in the image.\nWe implemented BoW-M and BoW-S both with Hinge loss and Huber loss. The performance of the\nlatter was slightly worse than the former, thus we report results only for the Hinge loss. We also tried\nAdaGrad [7] but surprisingly it did not work as well as COMID. We compared BoW with support\nvector machines (SVM) [3] and AROW [4] which was shown to outperform many algorithms on\nNLP tasks. (Other algorithms we evaluated, including maximum-entropy and SGD with Huber-\nloss, performed worse than either of these two algorithms and thus are omitted.) It is not clear at\nthis point how to incorporate Mercer kernels into BoW, and thus we are restricted to evaluate all\nalgorithms on data that can be classi\ufb01ed well with linear models.\nClassi\ufb01ers parameters (C for SVM, r for AROW and C, D for BoW) were tuned for each task on\na single additional randomized run over the data splitting it into 80%, used for training, and the\nremaining 20% of examples were used to choose the parameters. Results are reported for NLP tasks\nas the average error over the 10 folds per problem , while for USPS the standard test sets are used.\nThe mean error for 30 NLP tasks over 10 folds of BoW-M and BoW-S vs SVM is summarized in the\ntwo left panels of Fig. 1. Markers above the line indicate superior BoW performance. Clearly, both\nBoW versions outperform SVM obtaining lower test error on most (26) datasets and higher only on\nfew (at most 3). The right two panels compare the performance of both BoW versions with AROW.\nHere the trend remains yet with a smaller gap, BoW-M outperforms AROW in 20 datasets, and is\noutperformed in 9, while BoW-S outperforms AROW in 19 datasets and outperformed in 12. Note,\nAROW was previously shown [4] to have superior performance on text data over other algorithms.\n\n7\n\n10\u2212110010110\u22121100101error (%) BoW\u2212Merror (%) SVM10\u2212110010110\u22121100101error (%) BoW\u2212Serror (%) SVM10\u2212110010110\u22121100101error (%) BoW\u2212Merror (%) AROW10\u2212110010110\u22121100101error (%) BoW\u2212Serror (%) AROW\fFigure 2: No. of USPS 1-vs-1 datasets (out of 45) for which one algorithm is better than the other (see legend)\nshown for four levels of label noise during training: 0%, 10%, 20% and 30% (left to right). Higher values\nindicate better performance.\n\nThe results of the experiments with USPS are summarized in Fig. 2. Each panel shows the number of\ndatasets (out of 45) for which one algorithm outperforms another algorithm, for four levels of label\nnoise (i.e. probability of \ufb02ipping the correct label) during training: 0%, 10%, 20% and 30%.The\nfour pairs compared are BoW vs SVM (two left panels, BoW-M most left) and BoW vs AROW\n(two right panels, BoW-S most right). A left bar higher than a middle bar (in each group in each\npanel) indicates superior BoW performance. With no label noise (left group in each panel) SVM\noutperforms both BoW algorithms (e.g. SVM attains lower test error than BoW-S on 20 datasets\nand higher on 12 datasets, with a tie in 13 datasets). The average test error of SVM is 1.81, AROW\nis 1.98 and BoW-S is 1.97. When the level of noise increases both BoW algorithms outperform\nAROW and SVM. With maximal level of 30% label noise, the average test error is 16.1% for SVM,\n14.8 for AROW, and 6.1% for BoW-S. BoW-M achieves lower test error on 27 datasets (compared\nboth with SVM and AROW), while BoW-S achieves lower test error than SVM on 38 datasets and\nthan AROW on 40 datasets. Interestingly, while, in general, BoW-M achieved lower test error than\nBoW-S on the NLP problems, the situation is reversed in the USPS data where BoW-S achieves in\ngeneral lower test error.\n\n7 Related Work\nThere is much previous work on a related topic of incorporating additional constraints, using prior\nknowledge of the problem. Shivaswamy and Jebara [24] use a geometric motivation to modify\nSVMs. Their effort and other related works, \ufb01rst deduce some additional knowledge about the\nproblem [16, 20, 1], and keep it \ufb01xed while learning. In contrast, our method learns together the\nclassi\ufb01er and some additional information.\nAnother line of research is about algorithms that are maintaining a Gaussian distribution over\nweights, as opposed to uniform distribution as in our case, either AROW [4] in the online setting\nand its predecessors, or Gaussian Margin Machines (GMMs) [5] in the batch setting. Our motiva-\ntion is similar to the motivation behind GMMs, yet it is different in few important aspects. (1) BoW\nmaintains only 2d parameters, while GMM employs d + d(d + 1)/2 as it maintains a full covariance\nmatrix. (2) As a consequence, GMMs are not feasible to run on data with more than hundreds of\nfeatures, which is further supported by the fact that GMMs were evaluated only on data of dimen-\nsion 64 [5]. (3) We use directly a specialized PAC-Bayes bound for convex loss functions [10] while\nthe analysis of GMMs uses a bound designed for the 0 \u2212 1 loss which is then further bounded. (4)\nThe optimization problem of both versions of BoW is convex, while the optimization problem of\nGMMs is not convex, and it is only approximated with a convex problem. (5) Therefore, we can and\ndo employ COMID [8] which is theoretically justi\ufb01ed and fast in practice, while GMMs are trained\nusing another technique with no convergence (to local minima) guarantees. (6) Conceptually, BoW\nmaintains a compact set (box) while the set of possible weights for GMM is not compact. This\nallows us to extend our work to other types of sets (in progress), while its not clear how to extend\nthe GMMs approach from Gaussian distributions to other objects.\n\n8 Conclusion\nWe extend the commonly used linear classi\ufb01ers to subsets of the class of possible classi\ufb01ers, or\nin other words uniform distributions over weight vectors. Our learning algorithm is based on a\nworst-case margin minimization principle, and it bene\ufb01ts from strong theoretical guarantees based\non tight PAC-Bayesian bounds. The empirical evaluation presented shows that our method performs\nfavourably with respect to SVMs and AROW, and is more robust in the presence of label noise. We\n\n8\n\n0%10%20%30%010203040No winsTraining Label Error BoW\u2212MSVMTIE0%10%20%30%010203040No winsTraining Label Error BoW\u2212SSVMTIE0%10%20%30%010203040No winsTraining Label Error BoW\u2212MAROWTIE0%10%20%30%010203040No winsTraining Label Error BoW\u2212SAROWTIE\fplan to study the integration of kernels, extend our framework for various shapes and problems, and\ndevelop specialized large scale algorithms.\nAcknowledgments: The paper was partially supported by an Israeli Science Foundation grant ISF-\n1567/10 and by a Google research award.\n\nReferences\n[1] J. Bi and T. Zhang. Support vector classi\ufb01cation with input data uncertainty. In NIPS, 2004.\n[2] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders:\n\nDomain adaptation for sentiment classi\ufb01cation. In ACL, 2007.\n\n[3] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297,\n\nSeptember 1995.\n\n[4] K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weighted vectors.\n\nNIPS, 2009.\n\nIn\n\n[5] K. Crammer, M. Mohri, and F. Pereira. Gaussian margin machines. In AISTATS, 2009.\n[6] M. Dredze, K. Crammer, and F. Pereira. Con\ufb01dence-weighted linear classi\ufb01cation. In ICML,\n\n2008.\n\n[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. In COLT, 2010.\n\n[8] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In\n\nCOLT, pages 250\u2013264, 2010.\n\n[9] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. In Euro-COLT, pages 23\u201337, 1995.\n\n[10] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. Pac-bayesian learning of linear\n\nclassi\ufb01ers. In ICML, 2009.\n\n[11] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,\n\nInference, and Prediction. Springer, 2001.\n\n[12] R. Herbrich, T. Graepel, and C. Campbell. Robust Bayes point machines. In ESANN 2000,\n\npages 49\u201354, 2000.\n\n[13] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. JMLR, 1:245\u2013279, 2001.\n[14] P.J. Huber. Robust estimation of a location parameter. Annals of Statistics, 53:73101, 1964.\n[15] T. Jaakkola and M. Jordan. A variational approach to Bayesian logistic regression models and\n\ntheir extensions. In Workshop on Arti\ufb01cial Intelligence and Statistics, 1997.\n\n[16] G. Lanckriet, L. Ghaoui, C. Bhattacharyya, and M. Jordan. A robust minimax approach to\n\nclassi\ufb01cation. JMLR, 3:555\u2013582, 2002.\n\n[17] J. Langford and M. Seeger. Bounds for averaging classi\ufb01ers. Technical report, CMU-CS-01-\n\n102, 2002.\n\n[18] J. Langford and J. Shawe-Taylor. PAC-bayes and margins. In NIPS, 2002.\n[19] D. McAllester. PAC-Bayesian model averaging. In COLT, 1999.\n[20] J. Nath, C. Bhattacharyya, and M. Murty. Clustering based large margin classi\ufb01cation: A\n\nscalable approach using SOCP formulation. In KDD, 2006.\n\n[21] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.\n\nMorgan Kaufmann, 1988.\n\n[22] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regulariza-\n\ntion, Optimization and Beyond. MIT Press, 2002.\n\n[23] M. Seeger. PAC-Bayesian generalization bounds for gaussian processes. JMLR, 3:233\u2013269,\n\n2002.\n\n[24] P. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In AISTATS, 2007.\n[25] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[26] M.H. Wright. The interior-point revolution in optimization: history, recent developments, and\n\nlasting consequences. Bull. Amer. Math. Soc., 42:39\u201356, 2005.\n\n9\n\n\f", "award": [], "sourceid": 182, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Tal", "family_name": "Wagner", "institution": null}]}