{"title": "Multiclass Learning from Contradictions", "book": "Advances in Neural Information Processing Systems", "page_first": 8400, "page_last": 8410, "abstract": "We introduce the notion of learning from contradictions, a.k.a Universum learning, for multiclass problems and propose a novel formulation for multiclass universum SVM (MU-SVM). We show that learning from contradictions (using MU-SVM) incurs lower sample complexity compared to multiclass SVM (M-SVM) by deriving the Natarajan dimension for sample complexity for PAC-learnability of MU-SVM. We also propose an analytic span bound for MU-SVM and demonstrate its utility for model selection resulting in $\\sim 2-4 \\times$ faster computation times than standard resampling techniques. We empirically demonstrate the efficacy of MU- SVM on several real world datasets achieving $>$ 20\\% improvement in test accuracies compared to M-SVM. Insights into the underlying behavior of MU-SVM using a histograms-of-projections method are also provided.", "full_text": "Multiclass Learning from Contradictions\n\nSauptik Dhar\nLG Electronics\n\nSanta Clara, CA 95054\n\nsauptik.dhar@lge.com\n\nVladimir Cherkassky\nUniversity of Minnesota\nMinneapolis, MN 55455\n\ncherk001@umn.edu\n\nMohak Shah\nLG Electronics\n\nSanta Clara, CA 95054\nmohak.shah@lge.com\n\nAbstract\n\nWe introduce the notion of learning from contradictions, a.k.a Universum learning,\nfor multiclass problems and propose a novel formulation for multiclass universum\nSVM (MU-SVM). We show that learning from contradictions (using MU-SVM) in-\ncurs lower sample complexity compared to multiclass SVM (M-SVM) by deriving\nthe Natarajan dimension for sample complexity for PAC-learnability of MU-SVM.\nWe also propose an analytic span bound for MU-SVM and demonstrate its utility\nfor model selection resulting in \u223c 2 \u2212 4\u00d7 faster computation times than standard\nresampling techniques. We empirically demonstrate the ef\ufb01cacy of MU-SVM\non several real world datasets achieving > 20% improvement in test accuracies\ncompared to M-SVM. Insights into the underlying behavior of MU-SVM using a\nhistograms-of-projections method are also provided.\n\n1\n\nIntroduction\n\nMany machine learning problems in domains such as, healthcare, autonomous driving, and prog-\nnostics and health management involve learning from high-dimensional data with limited labeled\nsamples. In such domains labeling very large quantities of data is either extremely expensive, or\nentirely prohibitive due to the manual effort required. Standard inductive learning methods, including\ndata intensive deep architectures [1], may not be suf\ufb01cient for such high-dimensional limited-labeled-\ndata problems. The learning from contradictions paradigm (popularly known as Universum learning)\nhas shown to be particularly effective for binary classi\ufb01cation problems of this nature [2\u201311]. In\nthis paradigm, along with the labeled training data we are also given a set of unlabeled universum\nsamples. These universum samples belong to the same application domain as the training data, but\nare known not to belong to either of the two classes. The rationale behind this setting comes from the\nfact that even though obtaining labels is very dif\ufb01cult, obtaining such additional unlabeled samples\nis relatively easier. These unlabeled universum samples act as contradictions and should not be\nexplained by the binary decision rule. However, this paradigm has been mostly limited to binary\nclassi\ufb01cation problems making it impractical for most real applications involving classi\ufb01cation of\nmore than two categories. Further, this limits incorporation of a priori knowledge by discarding\navailable universum data for such applications.\nPrevious works such as [12, 13] have hinted on adopting an Error Correcting Output Code (ECOC)\nbased setting such as one-vs-one (OVO) and one-vs-all (OVA), where several binary Universum-\nSVM [12] classi\ufb01ers are combined to solve the multiclass problem. However, such studies lack\na complete formalization and analysis. An alternative is the adoption of a direct approach [14]\nwhere the entire multiclass problem is solved through a single larger optimization formulation by\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fintroducing universum learning under a probabilistic framework using a logistic loss. However, the\nwork does not clarify as to how contradictions are captured through the proposed formulation. In this\npaper, we propose a formalization for multiclass learning with contradictions. Following [15] for\nmulticlass SVM\u2019s, we introduce the new Multiclass Universum SVM (MU-SVM) formulation. The\nproposed MU-SVM provides a uni\ufb01ed framework for multiclass learning under universum settings,\nwith improved performance accuracies. The main contributions of this paper are as follows:\n\n1. Formulation: We formalize the notion of universum learning for multiclass SVM (M-SVM),\n\nand propose a novel direct formulation called Multiclass Universum SVM (MU-SVM).\n\n2. PAC Learnability: We derive the Natarajan dimension for the MU-SVM hypothesis class\nand analyze its sample complexity for PAC learnability (Theorem 2). Our analysis shows\nthat MU-SVM incurs lower sample complexity compared to M-SVM.\n\n3. Useful Properties: MU-SVM reduces to: i) standard multiclass SVM in absence of univer-\nsum data and ii) binary U-SVM formulation in [16], for two-class problems (proposition 2).\nIn addition, the proposed MU-SVM is solvable through any state-of-art M-SVM solvers\n(proposition 3).\n\n4. Model Selection: We provide a new span de\ufb01nition speci\ufb01c to MU-SVM, and follow-\ning [17] derive a leave-one-out bound for MU-SVM (Theorem 3). Under additional assump-\ntions, a computationally ef\ufb01cient version of the bound is also provided (Theorem 4).\n\n5. Empirical validation: Empirical results demonstrate the ef\ufb01cacy of the proposed formula-\ntion. We also propose a histogram-of-projections approach to analyse the results (section 4).\n\n2 Multiclass SVM (M-SVM)\n\nThis section introduces multiclass learning under inductive settings and the popular Crammer and\nSinger\u2019s (C&S) multiclass SVM (M-SVM) formulation [15] used in such settings. Although\nseveral other multiclass SVM formulations have been proposed in literature [18\u201324], C&S\u2019s\nM-SVM is among the most widely used ones. Further, compared to the most popular mul-\nticlass formulations, C&S\u2019s M-SVM provides the smallest estimation error while ensuring a\nsmall approximation error (see [25] Table 1 for details). This\nmakes the C&S M-SVM formulation highly desirable for limited\nlabeled samples settings. In this paper we use the C&S\u2019s M-\nSVM with equal misclassi\ufb01cation costs for balanced data as\nan exemplar for multiclass SVM formulations under inductive\nsettings, and refer to it as M-SVM throughout.\nDe\ufb01nition 1. (Multiclass Learning under Inductive Setting)\nGiven i.i.d training samples T = (xi, yi)n\ni=1 \u223c DnT , with x \u2208\nX \u2286 (cid:60)d and y \u2208 Y = {1, . . . , L}, estimate a hypothesis\nh\u2217 : X \u2192 Y from hypothesis class H which minimizes,\n\nEDT [1y(cid:54)=h(x)]\n\ninf\nh\u2208H\n\n(1)\nwhere, DT = the training distribution, 1(\u00b7) = indicator func-\ntion, and EDT (\u00b7) = expectation under training distribution.\nThe M-SVM, by minimizing a margin-based loss function [15], estimates f = [f1, . . . , fL] to\nconstruct the decision rule \u02c6h(x) = argmax\nl=1,...,L\n\nfl(x). The M-SVM hypothesis class is given as,\n\nFigure 1: Loss function for M-\nSVM with fk(x) = w(cid:62)\nk x. For the\nsoft-margin M-SVM (3) any sam-\nple (x, y = k) lying inside the mar-\ngin is linearly penalized using the\nslack variable \u03be.\n\nHM-SVM =\n\nx \u2192 argmax\nl\u2208Y\n\n(wT\n\nl x) :\n\n(cid:107)wl(cid:107)2\n\n2 \u2264 \u039b2; wT\n\nk x \u2212 argmax\nl(cid:54)=k\n\nl x \u2265 1; if y = k\nwT\n\n(2)\n\nwhere, \u039b \u2265 0 is a user-de\ufb01ned parameter which controls the complexity of the hypothesis class. The\nform in (2) is also known as the hard-margin version of M-SVM. For practical purposes we solve the\n\n2\n\n(cid:110)\n\nL(cid:88)\n\nl=1\n\n(cid:111)\n\n\fsoft-margin version given below,\n\nL(cid:88)\n\nl=1\n\nn(cid:88)\n\ni=1\n\n\u03bei\n\nmin\n\nw1...wL,\u03be\n\n1\n2\n\n(cid:107)wl(cid:107)2\n\n2 + C\n\ni = 1 . . . n, l = 1 . . . L\n\ns.t.:\n\n(wyi \u2212 wl)(cid:62)xi \u2265 eil \u2212 \u03bei; eil = 1 \u2212 \u03b4il\n\n\u03b4il =\n\n(cid:26) 1;\n\n0;\n\nyi = l\nyi (cid:54)= l\n\n(3)\n\nn(cid:80)\n\nIn this formulation, the training samples falling inside the margin border (\u2018+1\u2019) are linearly penalized\nusing the slack variables \u03bei \u2265 0, i = 1 . . . n (see Fig 1), which contributes to the margin error\n\u03bei.\nEq. (3) balances between minimizing the margin error and regularization term using the user-de\ufb01ned\nparameter C \u2265 0.\n\ni=1\n\n3 Multiclass Universum SVM (MU-SVM)\n\n3.1 Multiclass U-SVM formulation\n\nLearning from contraditions or Universum learning was introduced in [2] for binary classi\ufb01cation\nproblems to incorporate a priori knowledge about admissible data samples. For multiclass problems\nin addition to the labeled training data we are also given with unla-\nbeled universum samples which are known not to belong to any of\nthe classes in the training data. For example, if the goal of learning\nis to discriminate between handwritten digits (0, 1, 2,...,9), one can\nintroduce additional \u2018knowledge\u2019 in the form of handwritten letters\n(A, B, C, ... ,Z). These examples from the universum contain certain\ninformation (e.g., handwriting styles) but they cannot be assigned\nto any of the classes (0 to 9). Further, the universum samples do\nnot have the same distribution as labeled training samples. Learning\nunder this setting can be formalized as below.\nDe\ufb01nition 2. (Multiclass Learning under Universum Setting)\nGiven i.i.d training samples T = (xi, yi)n\ni=1 \u223c DnT , with x \u2208\nX \u2286 (cid:60)d and y \u2208 Y = {1, . . . , L} and additional m unlabeled\nuniversum samples U = (x\u2217\nU \u2286 (cid:60)d,\nestimate h\u2217 : X \u2192 Y from hypothesis class H which, in addition to\neq. (1), obtains maximum contradiction on universum samples i.e.\nmaximizes the following probability for x\u2217 \u2208 X \u2217\nU ,\n(cid:84)\nPDU [x\u2217 /\u2208 any class] = sup\nh\u2208H\n\nFigure 2: Loss function for\nuniversum samples x\u2217 for\nkth class decision boundary\nk x\u2217 \u2212 max\nw(cid:62)\nl x\u2217 = 0.\nw(cid:62)\nFor soft-margin MU-SVM for-\nmulation (7) a sample lying\noutside the \u2206-insensitive zone\nis linearly penalized using the\nslack variable \u03b6k.\n\ni(cid:48)=1 \u223c DU with x\u2217 \u2208 X \u2217\n\nEDU [1{\n\nh(x\u2217)(cid:54)=k}]\n\nsup\nh\u2208H\n\nk\u2208{1,...,L}\n\ni(cid:48))m\n\nl=1...L\n\n(4)\n\nwhere, DU is the universum distribution, PDU (\u00b7) is probability under universum distribution,\nEDU (\u00b7) is the expectation under universum distribution, X \u2217\nU is the domain of universum data.\nLearning under the universum setting has the dual goal of minimizing eq. (1) while maximizing the\ncontradiction (in eq. (4)) on universum samples. The following proposition 1 provides guidance on\nhow to address this for the M-SVM formulation in eq. (3).\nProposition 1. For the M-SVM formulation in (3), maximum contradiction on universum samples\nx\u2217 \u2208 U can be achieved when,\n|(w(cid:62)\n\nl x\u2217)| = 0; \u2200k \u2208 {1, . . . , L}\nw(cid:62)\n\nk x\u2217 \u2212 max\n\n(5)\n\nl=1...L\n\nThat is, learning under De\ufb01nition 2 using M-SVM requires the universum samples to lie on the\ndecision boundaries of all the classes {1 . . . L}. Here however, we relax this constraint (5) by requiring\nthe universum samples to lie within a \u2206-insensitive zone around the decision boundaries (see Fig\n2) as was done for binary scenario [16]. However, different from [16], here the \u2206-insensitive loss\nis introduced for the decision boundaries of all the classes i.e., |w(cid:62)\nl x\u2217| \u2264 \u2206 ;\u2200k =\nw(cid:62)\n1 . . . L. This reasoning motivates the new multiclass Universum-SVM (MU-SVM) formulation where\ni=1; yi \u2208 {1, . . . , L} are penalized by standard hinge loss (similar\nthe Training samples T := (xi, yi)n\n\nk x\u2217 \u2212 max\n\nl=1...L\n\n3\n\n\fto M-SVM (3) and shown in Fig. 1); and the Universum samples U := (x\u2217\n\u2206-insensitive loss (see Fig. 2) for the decision functions of all the classes f = [f1, . . . , fL].\nThe resulting hard-margin MU-SVM hypothesis class is given as,\n\ni(cid:48))m\n\ni(cid:48)=1 are penalized by a\n\n(cid:110)\n\nHMU-SVM =\n\nx \u2192 argmax\n\nl=1,...,L\n\n(wT\n\nl x) :\n\nL(cid:88)\nk x\u2217 \u2212 max\n\n(cid:107)wl(cid:107)2\n\nl=1\n\n|w(cid:62)\n\n2 \u2264 \u039b2 ; wT\n\nk x \u2212 argmax\nl(cid:54)=k\n\nl x\u2217| \u2264 \u2206 ;\u2200k \u2208 Y(cid:111)\n\nw(cid:62)\n\nl=1...L\n\nl x \u2265 1; if y = k\nwT\n\n(6)\n\nWe then relax the hard constraints on the universum samples by linearly penalizing the constraint\nviolations through a slack variable \u03b6 shown in Fig. 2 leading to the following soft-margin MU-SVM\nformulation 1,\n\n1\n2\n\n(cid:107)wl(cid:107)2\n\nL(cid:88)\n\nn(cid:88)\nmin\n2 + C\n(wyi \u2212 wl)(cid:62)xi \u2265 eil \u2212 \u03bei;\n|(w(cid:62)\n\ni(cid:48) \u2212 max\n\nw(cid:62)\nl x\u2217\n\nk x\u2217\n\ni=1\n\nl=1\n\nl=1...L\n\nw1...wL,\u03be,\u03b6\n\ns.t.\n\n\u03bei + C\u2217 m(cid:88)\n\ni(cid:48)=1\neil = 1 \u2212 \u03b4il,\n\nL(cid:88)\n\nk=1\n\n\u03b6i(cid:48)k \u2200i = 1 . . . n,\n\nl = 1 . . . L\n\ni(cid:48) = 1 . . . m\n\n(cid:26) 1;\n\n0;\n\n(7)\n\nyi = l\nyi (cid:54)= l\n\ni(cid:48))| \u2264 \u2206 + \u03b6i(cid:48)k;\n\nk = 1 . . . L \u03b6i(cid:48)k \u2265 0,\n\n\u03b4il =\n\nHere, for the kth class decision boundary the universum samples (x\u2217\ni(cid:48)=1 that lie outside the\n\u2206-insensitive zone are linearly penalized using the slack variables \u03b6i(cid:48)k \u2265 0, i(cid:48) = 1 . . . m. The\nuser-de\ufb01ned parameters C, C\u2217 \u2265 0 control the trade-off between the margin size, the margin-error on\ntraining samples, and the contradictions (samples lying outside \u00b1\u2206 zone) on the universum samples.\nNote that for C\u2217 = 0 eq. (7) reduces to the M-SVM classi\ufb01er.\nProposition 2. For binary classi\ufb01cation L = 2, (7) reduces to the standard U-SVM formulation\nin [16] with w = w1 \u2212 w2 and b = 0.\n\ni(cid:48))m\n\nP\ni=1\u223cD\n(xi,yi)n\n\n(cid:16)PD[\u02c6h(x) (cid:54)= y] > inf\n\n3.2 Sample Complexity for PAC Learnability\nNext we derive the sample complexity for PAC-learnability of HM\u2212SV M and HM U\u2212SV M and\nprovide a comparative analysis. First we provide the necessary de\ufb01nitions,\nDe\ufb01nition 3. (Sample Complexity for PAC learnability [26,27]) of an algorithm A : X \u00d7Y \u2192 H\nis de\ufb01ned as the smallest integer nA(\u0001, \u03b4) such that for any given \u0001, \u03b4 > 0 and every n > nA(\u0001, \u03b4)\nand distribution D on X \u00d7 Y we have \u2200 \u02c6h = A((xi, yi)n\n\ni=1),\nPD[h(x) (cid:54)= y] + \u0001\nThe sample complexity of a hypothesis class H : nH(\u0001, \u03b4) = infA nA(\u0001, \u03b4)\nThe sample complexity for PAC learnability depends on the size (a.k.a capacity measure) of a\nhypothesis class. Traditional capacity measures used in binary classi\ufb01cation do not directly apply for\nmulticlass problems. This has led to the research on several newer capacity measures for multiclass\nproblems [19, 28\u201330]. One of the most widely researched capacity measure for multiclass SVMs is\nthe Natarajan dimension [31] de\ufb01ned next,\nDe\ufb01nition 4. Shattering (Multiclass version) For any hypothesis class H \u2286 YX and any S \u2286 X ,\nwhere Y = {1, . . . , L} and X := training data domain, we say H shatters S if \u2203f1, f2 : S \u2192 Y with\n\u2200x \u2208 S, f1(x) (cid:54)= f2(x) and for every T \u2286 S there is a g \u2208 H which satis\ufb01es,\n\u2200x \u2208 T, g(x) = f1(x), and \u2200x \u2208 S \u2212 T, g(x) = f2(x).\n\n(cid:17) \u2264 \u03b4\n\nh\u2208H\n\nNatarajan Dimension dN (H) is the maximal cardinality of a set that is shattered by H.\nAn advantage of Natarajan Dimension is that it provides a natural extension to the fundamental\nlearning theorem for multi-class problems (see [26, 32]) discussed next.\n\n(9)\n\n(8)\n\n1Throughout this paper, we use index i, j for training samples, i(cid:48) for universum samples and k, l for the class\n\nlabels.\n\n4\n\n\fTheorem 1. (Fundamental Learning Theorem) There exist absolute constants C1, C2 > 0 such\nthat any hypothesis class H of functions from X \u2192 Y is PAC-learnable with sample complexity\n\ndN (H) + log(1/\u03b4)\n\n\u00012\n\nC1\n\n\u2264 nH(\u0001, \u03b4) \u2264 C2\n\ndN (H)log(|Y|) + log(1/\u03b4)\n\n\u00012\n\n(10)\n\nTheorem 1 shows nH(\u0001, \u03b4) = O( dN (H)log(|Y|)+log(1/\u03b4)\n). Hence, for low sample complexity it is\ndesirable for hypothesis classes to have smaller dN (H). With these de\ufb01nitions in place, we prove a\nnew Theorem 2 to characterize dN (H) for HM\u2212SV M and HM U\u2212SV M as shown below,\nTheorem 2. The Natarajan dimension for HM\u2212SV M and HM U\u2212SV M has the form dN (H) =\nO(\u03d1log(\u03d1)). Assuming ||x||2\n\n\u00012\n\n2 \u2264 R2 ; \u2200x \u2208 X \u2286 (cid:60)d gives,\nHM\u2212SV M :\nHM U\u2212SV M :\n\n\u03d1 = min (dL + 1, 2R2\u039b2)\n\n\u03d1 = min (dL + 1, \u03ba)\n\nwhere,\n\n(cid:34)\n\nd = data dimension,\n\u03ba \u2264\n\nmin\n\n\u03b3\u2208{\u03b3\u22650 ; G(\u03b3)\u22650}\n\nL = total no. of classes\n\nF (\u03b3)R2 +\n\n(cid:112)G(\u03b3)\n\n(cid:35)\n\n2\n\nmL(L \u2212 1)\n\nF (\u03b3) = \u039b2 + \u03b3\nH(\u03b3) = (I + \u03b3VVT )\u22121(VZT ZVT )\n\n\u22062\n\n2\n\nG(\u03b3) = [2F (\u03b3)R2]2 \u2212 4\u03b3F (\u03b3) trace[H(\u03b3)]\n\nalso, the transformations Z, V are obtained as,\n\n(T1) Given: For a maximally shattered set S = {x1, . . . , xdN } using the functions f1(x), f2(x)\n\n(11)\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(see de\ufb01nition (4))\nDe\ufb01ne: a mapping \u03c6 : (cid:60)d \u2192 (cid:60)dL as ,\n\nx\n(0d\u00d71, . . . , \u2212x\n\nf1(x)=l\n\nf1(x)=l\n\n\uf8f1\uf8f2\uf8f3 (0d\u00d71, . . . ,\n\uf8ee\uf8ef\uf8f0 (z1)T\n\nZ =\n\n...\n\n\uf8f9\uf8fa\uf8fb.\n\n(zdN )T\n\nz = \u03c6(x) =\n\nObtain:\n\n, . . . \u2212x\nx\n, . . .\n\nf2(x)=k\n\nf2(x)=k\n\n, . . . , 0d\u00d71)dL\u00d71; \u2200x \u2208 T \u2286 S\n, . . . , 0d\u00d71)dL\u00d71; \u2200x \u2208 S \u2212 T\n\nBasically, the transformation \u03c6 maps a sample x \u2208 (cid:60)d from the shattered set S to a dL\n- dimension vector z; where for any x \u2208 T with f1(x) = l and f2(x) = k; we copy the\nx vector onto l(d \u2212 1) + 1 . . . ld-th position and \u2212x vector onto k(d \u2212 1) + 1 . . . kd-th\nposition of z. The remaining elements are set to 0. We reverse the sign of the mapping for\nx \u2208 S \u2212 T .\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f0 (x\u2217\n\n1)T\n...\n(x\u2217\nm)T\n\n1(L\u22122)\u00d71\n\n0\n\u00b7\u00b7\u00b7\n\n\uf8ee\uf8ef\uf8ef\uf8f01(L\u22121)\u00d71\n\n0\n0\n\n(T2) Given: universum set U =\n\nm\u00d7d\n\u2212IL\u22121\u00d7L\u22121\n1(L\u22123)\u00d71 \u2212IL\u22123\u00d7L\u22123\n\n\u2212IL\u22122\u00d7L\u22122\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\nL(L\u22121)\n\n2 \u00d7L\n\nDe\ufb01ne: G =\n\n...\nObtain: V = (G \u2297 U) where \u2297 is the Kronecker product.\n\nDue to space constraints, the proof of Theorem 2 is provided in the supplementary material.\nTheorem 2 provides a framework to analyze the sample complexity for PAC-learnability of\nHM\u2212SV M and HM U\u2212SV M . A direct observation from Theorem 2 is that MU-SVM is likely to have\n\n5\n\n\fa smaller dN and hence a lower sample complexity compared to M-SVM. This is seen from (14),\nwhere setting \u03b3 = 0 \u21d2 F (\u03b3) = \u039b2 \u21d2 G(\u03b3) = [2\u039b2R2]2. Hence we always have, \u03ba \u2264 2R2\u039b2\nfrom (13). This gives \u03d1HM U\u2212SV M \u2264 \u03d1HM\u2212SV M . In fact \u03d1HM U\u2212SV M can be signi\ufb01cantly smaller\nthan \u03d1HM\u2212SV M for a well chosen \u03b3 \u2208 {\u03b3 \u2265 0 ; G(\u03b3) \u2265 0}, resulting to a much smaller dN for\nMU-SVM compared to M-SVM. The trade-off between m, \u2206 and the universum data types further\nensures low sample-complexity for MU-SVM.\n\n3.3 MU-SVM Implementation\n\nAnother desirable property of MU-SVM (7) is that it is solvable through state-of-art M-SVM\nsolvers [33, 34]. For every universum sample x\u2217\ni(cid:48) we create L arti\ufb01cial samples belonging to all the\nclasses, i.e. (x\u2217\n\ni(cid:48)L = L) and add them to the training set as shown below,\n\ni(cid:48)1 = 1), . . . , (x\u2217\n\ni(cid:48), y\u2217\n\ni = 1 . . . n\ni = n + 1 . . . n + mL; i(cid:48) = 1 . . . m; l = 1 . . . L\n\n(x\u2217\n\ni(cid:48), y\u2217\n\n(cid:26) (xi, yi, eil, C, \u03bei)\n(cid:88)\n(cid:88)\n\ni(cid:48), y\u2217\n\ni(cid:48)l,\u2212\u2206ei(cid:48)l, C\u2217, \u03b6i(cid:48))\n\n\u03b1il\u03b1jlK(xi, xj) \u2212(cid:88)\n\ni,j\n\nl\n\ni,l\n\n\u03b1ileil\n\n(xi, yi, eil, Ci, \u03bei) =\n\nW (\u03b1) = \u2212 1\n2\n\n(cid:88)\n\nmax\n\u03b1\n\ns.t.\n\nProposition 3. MU-SVM (7) after the transformation (16) can be solved in the dual form as,\n\n(16)\n\n(17)\n\n\u03b1il = 0; \u03b1i,l \u2264 Ci if l = yi; \u03b1i,l \u2264 0 if l (cid:54)= yi;\n\ni, j = 1 . . . n + mL, l = 1 . . . L\n\nl\n\nNote that (17) has similar form as the M-SVM\u2019s dual form (see [15, 24]), except that (17) has\nadditional mL constraints for the universum samples. Hence, solving MU-SVM using (17) is same\nas solving an M-SVM problem (3) with n + mL samples.\n\n3.4 Model Selection\nThe MU-SVM (17) has four tunable parameters: C, C\u2217, kernel parameter, and \u2206. Successful appli-\ncation of MU-SVM signi\ufb01cantly depends on optimal model selection. In this paper we simplify the\nmodel selection using a two-step approach,\n(Step a) First, perform optimal tuning of the C and kernel parameters for M-SVM (2). This equiva-\nlently tunes the parameters speci\ufb01c only to the training samples in the MU-SVM formulation.\n(Step b) Tune \u2206 while keeping C and kernel parameters \ufb01xed (from Step a). Also C\u2217 = nC\nmL is kept\n\n\ufb01xed to ensure equal contribution of training and universum samples in MU-SVM (7).\n\nThe model parameters in Steps (a) & (b) are typically selected through resampling techniques like,\nleave one out (l.o.o) or strati\ufb01ed cross-validation approaches [35]. Of these approaches, l.o.o provides\nan almost unbiased estimate of the test error [36]. However, on the downside it can be computationally\nprohibitive. In this paper we provide a new span de\ufb01nition for MU-SVM in (19), and modify the\ntechnique in [17] to derive a new analytic l.o.o bound for MU-SVM. Other span based l.o.o bounds\nhave been derived for alternative versions of the M-SVM formulation [37]. However they do not\napply to the C&S\u2019s M-SVM and the MU-SVM formulation proposed in this paper. Next, we show\nthat our proposed bound can be sucessfully used for model selection in Steps (a) & (b) thereby\navoiding computationally-prohibitive l.o.o and expensive cross-validation. The necessary de\ufb01nitions\nare provided next.\nDe\ufb01nition 5. The (Leave-One-Out procedure) with the tth training sample dropped in-\n\u2200l. The obtained l.o.o so-\nvolves solving (17) with an additional constraint \u03b1tl = 0;\n, . . .] with tth sample prediction \u02c6yt =\nlution \u03b1t = [\u03b1t\n\nt1 = 0, . . . , \u03b1t\n\n11, . . . , \u03b1t\n1L\n\n, . . . , \u03b1t\n\ntL = 0\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03b1t\n1\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03b1t\n\nt=0\n\n(cid:125)\n\nilK(xi, xt) gives the leave-one-out error as: Rl.o.o = 1\n\u03b1t\n\nn\n\nDe\ufb01nition 6. Support vectors obtained through solving (17) are categorized as: Type 1 SV1 =\n{ i |0 < \u03b1iyi < Ci} and Type 2 SV2 = { i |\u03b1iyi = Ci}.\nThe set of all support vectors are represented as, SV = SV1 \u222a SV2. Similarly, the set of support\nvectors for l.o.o solution is given as SV t. Under De\ufb01nition 6 we prove the following,\n\nn(cid:80)\n\nt=1\n\n1[yt (cid:54)= \u02c6yt]\n\n(cid:80)\n\ni\n\narg max\n\nl\n\n6\n\n\fTheorem 4. Under Assumption 1 the leave-one-out error is upper bounded as:\n\nt \u2208 SV \u2229 T | S2\n\nt \u2265 \u03b1(cid:62)\n\nt\n\n\u03b1ilK(xi, xt)\n\n(cid:88)\n\n(cid:88)\n\ni\u2208SV\n\nl\n\n(cid:111)|\n\n\u2212\u03b1iyi.\n\n|(cid:110)\n\nRl.o.o \u2264 1\nn\n\n(cid:26) \u03b1(cid:62)\n\n(cid:35)\n\n(cid:34)\n|(cid:110)\n(cid:88)\n\n2D,\n\n) \u2265 1\n\n1\u221a\nC\n\n(cid:111)| + |(cid:110)\n\nt \u2208 SV2 \u2229 T(cid:111)|\n\n\u221a\nt \u2208 SV1 \u2229 T |St max(\n\nTheorem 3. The leave-one-out error is upper bounded as:\nRl.o.o \u2264 1\nn\nwhere, | \u00b7 | := Cardinality of a set, and St := Span of a Type 1 support vector xt given as,\nt = min\nS2\n\u03b2\ns.t. \u03b1il \u2212 \u03b2il \u2264 Ci;\n\u03b1il \u2212 \u03b2il \u2264 0;\n\u03b2il = 0;\n\n\u2200{(i (cid:54)= t, l)| 0 < \u03b1il < Ci; l = yi}\n\u2200 {(i (cid:54)= t, l)| \u03b1il < 0; l (cid:54)= yi}\n\n\u2200i /\u2208 SV1 \u2212 {t} \u2200l = 1 . . . L;\n\n(new span de\ufb01nition speci\ufb01c to MU-SVM)\n\n\u2200l = 1 . . . L;\n\n\u03b2il\u03b2jl)K(xi, xj)\n\n(cid:88)\n\n(cid:88)\n\n\u03b2tl = \u03b1tl;\n\ni,j\n\nl\n\n(\n\n\u03b2il = 0\n\n(18)\n\n(19)\n\nl\nand D is the Diameter of the smallest hypersphere containing all training samples.\n\nPlease refer to the supplementary material for proof of Theorem 3. The practical utility of (18)\nis limited due to the signi\ufb01cant computational complexity for solving (19) which results to \u223c\nO(n + mL)4 (worst case) to compute (18). To alleviate this, we derive a computationally attractive\nalternative under the following assumptions,\nAssumption 1. : For the MU-SVM solution,\n\n(A1) The sets SV1 and SV2 remain the same during the l.o.o procedure.\n(A2) The SV1 support vectors have only two active elements i.e. \u2200\u03b1i \u2208 SV1\u2203 k (cid:54)= yi s.t. \u03b1ik =\n\n(20)\n\n(cid:20)KSV1 \u2297 IL A(cid:62)\n\n(cid:21)\n\nt =\n\nt [(H\u22121)tt]\u22121\u03b1t\nwhere, S2\nt [K(xt, xt) \u2297 IL \u2212 KT\n\u03b1(cid:62)\n(cid:124) (cid:123)(cid:122) (cid:125)\nA := I|SV1| \u2297 (1L)(cid:62) ; 1L = [ 1 1 . . . 1\n\nL elements\n\nt \u2208 SV1 \u2229 T\nt \u2208 SV2 \u2229 T\n\nt H\u22121Kt]\u03b1t\n] ; KSV1 := Kernel matrix of the SV1 support vectors\n\n; H :=\n\nA\n\n0\n\n:= sub-matrix of H\u22121 for indices i = [(t \u2212 1)L + 1 . . . tL]\n\nt \u2297\n(H\u22121)tt\n1L) 0L\u00d7|SV1|]T ; kt = n|SV1|\u00d71 dim vector where ith element is K(xi, xt),\u2200xi \u2208 SV1 ; and \u2297 is\nthe Kronecker product.\n\n; Kt = [(kT\n\nTheorem 4 provides a good approximation of the l.o.o error (also con\ufb01rmed from results in Table 3)\neven when the Assumption 1 is violated just as in [17]. Further, it provides two major advantages\nover Theorem 3. First, Eq. (20) is valid for both SV1 & SV2 training support vectors and results in a\nstricter bound. Second, span computation in theorem 4 requires only one matrix inversion H\u22121. This\nresults in a signi\ufb01cant speed-up to \u223c O(n + mL)3 for computing the leave-one-out bound using (20)\nas compared to \u223c O(n + mL)4 in (18).\n\n4 Empirical Results\n\nWe use three real life datasets discussed next:\nGerman Traf\ufb01c Sign Recognition Benchmark (GTSRB) [38]: The goal is to identify the traf\ufb01c signs\nfor the speed-zones \u201830\u2019,\u201870\u2019 and \u201880\u2019. Here, the\nimages are represented by their 1568 histogram\nof gradient (HOG 1) features. For this data we\nuse three kinds of Universum: (U1) Random\nAveraging (RA) : synthetically created by \ufb01rst\nselecting a random traf\ufb01c sign from each class\n(\u201830\u2019,\u201870\u2019 and \u201880\u2019) in the training set and aver-\naging them. (U2) Others: all other non-speed traf\ufb01c signs. (U3)\u2018Priority road\u2019 Sign: an exhaustive\n\n300 / 1500 (100 / 500 PER CLASS)\n600 / 400 (150 / 100 PER CLASS)\n500 / 500 (100 / 100 PER CLASS)\n\nTable 1: Real-life datasets.\n\nTRAIN / TEST SIZE\n\nDATASET\nGTSRB\n\nABCDETC\n\nISOLET\n\n7\n\n\f,\n\n.\n\n;\n\nmL and \u2206 = [0, 0.01, 0.05, 0.1].\n\n?\n) $ % \" @ and (U4) Random Averaging (RA) generated as above.\n\nsearch over several non-speed zone traf\ufb01c signs showed this universum to provide the best perfor-\nmance (see appendix B.4).\nHandwritten characters (ABCDETC) [16]: The goal is to identify handwritten digits \u20180\u2019-\u20183\u2019 using\ntheir 10000 (100 \u00d7 100) pixel values. We use the characters other than digits as universum i.e., (U1)\n\u2018A - Z\u2019 uppercase letters, (U2) \u2018a - z\u2019 lowercase letters, (U3) all other symbols like:- !\n:\n= - + /\nSpeech-based Isolated Letter recognition (ISOLET) [39]: This is a speech recognition dataset where\n150 subjects pronounced each letter \u2018a - z\u2019 twice. The goal is to identify the spoken letters \u2018a\u2019 - \u2018e\u2019\nusing the 617 dimensional spectral coef\ufb01cients, contour, sonorant, presonorant, and post-sonorant\nfeatures. We use two different types of universum: (U1) \u2018Others\u2019 which contains all the other\nspeeched letters i.e. \u2018f\u2019 -\u2018z\u2019 and (U2) Random Averaging (RA) discussed above.\nDue to space constraints and to simplify our analyses in later sections, we used only a subset of\nthe training classes. We see similar results using all the classes (results provided in supplementary\nmaterial B.1). For the model parameters our initial experiments showed linear parameterization to be\noptimal for GTSRB; hence only linear kernel has been used for it. For ABCDETC and ISOLET an\nRBF kernel K(xi, xj) = exp(\u2212\u03b3(cid:107)xi \u2212 xj(cid:107)2) with \u03b3 = 2\u22127 provided optimal results for M-SVM.\nFor all the experiments model selection is done over the range of parameters C = [10\u22124, . . . , 103] ,\nC\u2217/C = n\nEffectiveness of the MU-SVM formulation (7): Table 2 provides the average test error for MU-\nSVM and several other baseline methods over 10 random training/test partitioning of the data in\nthe proportions shown in Table 1. Model selection within each partition is done using strati\ufb01ed 5\nFold CV [35]. Here, SVMOVA & SVMOVO denotes the popular ECOC based multiclass extensions\none-vs-all (OVA) and one-vs-one (OVO) using binary SVM [2] as the base classi\ufb01er. Similarly,\nU-SVMOVA & U-SVMOVO uses binary U-SVM [16] as the base classi\ufb01er. Owing to space constraints,\nwe only show the results for the best performing universum for all the datasets. Also we \ufb01x the\nnumber of universum samples to m = 500. Additional increase in the universum samples do not\nprovide any signi\ufb01cant gains (see appendix B.3 for results). The complete set of results using all the\nuniversum types are provided in Appendix B.2. For reproducibility of the results we also provide the\ntypical optimal parameters selected through model selection in Appendix B.2.\nTable 2 shows that MU-SVM provides lower test errors compared to all the other baseline methods.\nSpeci\ufb01cally, compared to M-SVM, the performance gains using MU-SVM improve signi\ufb01cantly\nup to \u223c 20 \u2212 25%. For suf\ufb01ciently large universum set size, such signi\ufb01cant improvements using\nMU-SVM depend mostly on the statistical characteristics of the universum data. To better understand\nthese statistical characteristics we adopt the technique of \u2018histogram of projections\u2019 (HOP) originally\nintroduced for binary classi\ufb01cation in [40]. Here, different from [40], for a given M-SVM / MU-SVM\nmodel we project the training samples onto the decision space of their respective classes i.e. \u2200(xi, yi =\nk) we obtain the projection values as w(cid:62)\nl xi. In addition we also project the universum\nsamples onto the decision spaces of all the classes i.e. \u2200(x\u2217\nl x\u2217\nw(cid:62)\ni(cid:48).\nFinally we generate the class speci\ufb01c histograms of these projection values. In addition to the\nhistograms, we also generate a frequency plot of the predicted labels for the universum samples using\nthe models. Using this HOP visualization we analyze the effectiveness of the universum U3 for\nGTSRB dataset (see Fig 3). As seen from Fig. 3, the optimal M-SVM model has high separability for\nthe training samples i.e. most of the training samples lie outside the margin borders (+1). In addition,\nthe universum samples U3 are widely spread about the margin-borders and biased towards the positive\nside of the decision boundary of the sign \u201830\u2019 (Fig. 3(a)); and hence predominantly gets classi\ufb01ed\nas sign \u201830\u2019(Fig.3(d)). As seen from Figs 3. (e)-(g), applying the MU-SVM model preserves the\nseparability of the training samples and additionally reduces the spread of the universum samples.\nFollowing proposition 1, such a model exhibits higher uncertainty on the universum samples\u2019 class\nmembership, and uniformly assigns them over all the classes i.e. signs \u201830\u2019,\u201870\u2019 and \u201880\u2019 (see Fig.\n3(h)). This shows that, the resulting MU-SVM model has higher contradiction (uncertainty) on the\nuniversum samples and hence provides better generalization compared to M-SVM. This behavior\nis consistently seen for the other datasets and universum choices (provided in the supplementary\nmaterial - Appendix B.6).\nModel Selection using Theorem 4: Table 3 provides the average \u00b1 std. dev of time taken (in\n\ni(cid:48)); project \u2200k; w(cid:62)\n\nk x\u2217\n\nk xi \u2212 max\nl(cid:54)=k\n\ni(cid:48) \u2212 max\nl(cid:54)=k\n\n/\n\n(\n\nw(cid:62)\n\n8\n\n\fFigure 3: GTSRB: Histograms of projections for training (in blue) and universum U3 (in red).\nM-SVM (C = 1): (a) sign \u201830\u2019. (b) sign \u201870\u2019. (c) sign \u201880\u2019. (d) frequency plot of universum labels.\nMU-SVM (\u2206 = 0) :(e) sign \u201830\u2019. (f) sign \u201870\u2019. (g) sign \u201880\u2019. (h) frequency plot of universum labels.\nTable 2: Mean (\u00b1 standard deviation) of the test errors (in %) over 10 runs of the experimental setting\nin Table 1. No. of universum samples (m = 500).\n\nDATASET\n\nSVMOVA\nM-SVM U-SVMOVA U-SVMOVO MU-SVM\n7.17 \u00b1 1.08 7.16 \u00b1 1.92 7.24 \u00b1 1.16 6.05 \u00b1 0.61 5.97 \u00b1 0.63 5.53 \u00b1 0.62\nGTSRB (USING U3)\nABCDETC (USING U4) 28.1 \u00b1 4.74 29.1 \u00b1 4.16 27.5 \u00b1 3.34 26.1 \u00b1 4.93 26.9 \u00b1 4.51 22.1 \u00b1 3.24\n3.72 \u00b1 0.6 3.88 \u00b1 0.44 3.6 \u00b1 0.31 3.56 \u00b1 0.55 3.88 \u00b1 0.63 2.83 \u00b1 0.32\nISOLET (USING U2)\n\nSVMOVO\n\nmL , \u2206 = [0, 0.01, 0.05, 0.1]\nTHEOREM 4\n\n.\n5-FOLD CV\n\nTable 3: Comparisons for model selection using 5 Fold CV\nvs. Theorem 4. No. of universum samples (m = 500). Model\nparameters used C\u2217/C = n\n\nseconds) for model selection using\nTheorem 4 vs. 5-fold CV for 10\nruns over the entire range of parame-\nters as well as the respective average\ntest errors. In each experimental run\nthe data is partitioned as in Table 1.\nWe use a desktop with 12 core Intel\nXeon @3.5 Ghz and 32 GB RAM.\nThe bound-based model selection is\n\u223c 2 \u2212 4\u00d7 faster than 5-fold CV and\nprovides similar test errors. The ad-\nvantage offered by Theorem 4 is even\nmore pronounced against l.o.o. For\ninstance, comparison with l.o.o for\nGTSRB dataset showed \u223c 100\u00d7 im-\nprovement in speed using Theorem 4\nwith similar test accuracies (see Appendix B.5). Additional l.o.o results could not be reported owing\nto its prohibitively slow speed.\n\nTIME\nTEST ERROR\n(\u00d7104sec)\n(IN %)\n0.8 \u00b1 0.2\n6.9 \u00b1 0.9\n7.4 \u00b1 0.9\n0.9 \u00b1 0.3\n0.9 \u00b1 0.1\n5.5 \u00b1 0.6\n26.1 \u00b1 4.0 2.8 \u00b1 0.1 26.1 \u00b1 3.7 1.1 \u00b1 0.1\n24.2 \u00b1 3.1 2.8 \u00b1 0.1 24.4 \u00b1 3.2 1.3 \u00b1 0.1\n23.3 \u00b1 3.2 2.6 \u00b1 0.2 24.1 \u00b1 3.8 0.9 \u00b1 0.09\n22.1 \u00b1 3.2 2.6 \u00b1 0.1 22.0 \u00b1 2.8 0.9 \u00b1 0.1\n3.3 \u00b1 0.3\n2.1 \u00b1 0.5\n2.8 \u00b1 0.3\n1.9 \u00b1 0.7\n\nB U1\nR\nU2\nS\nU3\nT\nG\nU1\nU2\nU3\nU4\nT U1\nU2\n\nTEST ERROR\n(IN %)\n6.9 \u00b1 0.9\n7.1 \u00b1 0.8\n5.2 \u00b1 0.4\n\nTIME\n(\u00d7104sec)\n3.1 \u00b1 0.5\n3.2 \u00b1 0.9\n2.9 \u00b1 0.3\n\nD\nC\nB\nA\n\nC\nT\nE\n\nO\nS\nI\n\nE\nL\n\nMUSVM\n\n4.8 \u00b1 0.9\n3.1 \u00b1 0.6\n\n3.3 \u00b1 0.3\n2.6 \u00b1 0.3\n\n5 Conclusions\n\nThis paper proposes a new formulation for multiclass SVM (MU-SVM). MU-SVM is shown to incur\nlower sample-complexity for PAC learnability compared to M-SVM by deriving Natarajan dimension.\nFurther, the proposed MU-SVM embodies several useful mathematical properties amenable for: a)\nits ef\ufb01cient implementation using existing M-SVM solvers, and b) deriving practical analytic bounds\nthat can perform model selection. We empirically show the effectiveness of the formulation as well\nas the bound. Insights into the workings of MU-SVM using HOP visualization is also provided.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their comments which helped improve the quality of the paper.\n\nReferences\n[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.\n\ndeeplearningbook.org.\n\nMIT Press, 2016, http://www.\n\n[2] V. Vapnik, Estimation of Dependences Based on Empirical Data (Information Science and Statistics).\n\nSpringer, Mar. 2006.\n\n[3] F. Sinz, O. Chapelle, A. Agarwal, and B. Sch\u00f6lkopf, \u201cAn analysis of inference with the universum,\u201d in\n\nAdvances in neural information processing systems 20. NY, USA: Curran, Sep. 2008, pp. 1369\u20131376.\n\n9\n\n-101(a)00.51 Sign 30-101(b)00.51 Sign 70-101(c)00.51 Sign 80307080(d)0500-101(e)00.51 Sign 30-101(f)00.51 Sign 70-101(g)00.51 Sign 80307080(h)0100200\f[4] S. Dhar and V. Cherkassky, \u201cDevelopment and evaluation of cost-sensitive universum-svm,\u201d Cybernetics,\n\nIEEE Transactions on, vol. 45, no. 4, pp. 806\u2013818, 2015.\n\n[5] S. Lu and L. Tong, \u201cWeighted twin support vector machine with universum,\u201d Advances in Computer\n\nScience: an International Journal, vol. 3, no. 2, pp. 17\u201323, 2014.\n\n[6] Z. Qi, Y. Tian, and Y. Shi, \u201cA nonparallel support vector machine for a classi\ufb01cation problem with\nuniversum learning,\u201d Journal of Computational and Applied Mathematics, vol. 263, pp. 288\u2013298, 2014.\n[7] C. Shen, P. Wang, F. Shen, and H. Wang, \u201cUboost: Boosting with the universum,\u201d Pattern Analysis and\n\nMachine Intelligence, IEEE Transactions on, vol. 34, no. 4, pp. 825\u2013832, 2012.\n\n[8] Z. Wang, Y. Zhu, W. Liu, Z. Chen, and D. Gao, \u201cMulti-view learning with universum,\u201d Knowledge-Based\n\nSystems, vol. 70, pp. 376\u2013391, 2014.\n\n[9] D. Zhang, J. Wang, F. Wang, and C. Zhang, \u201cSemi-supervised classi\ufb01cation with universum.\u201d in SDM.\n\nSIAM, 2008, pp. 323\u2013333.\n\n[10] Y. Xu, M. Chen, Z. Yang, and G. Li, \u201c\u03bd-twin support vector machine with universum data for classi\ufb01cation,\u201d\n\nApplied Intelligence, vol. 44, no. 4, pp. 956\u2013968, 2016.\n\n[11] C. Zhu, \u201cImproved multi-kernel classi\ufb01cation machine with nystr\u00f6m approximation technique and univer-\n\nsum data,\u201d Neurocomputing, vol. 175, pp. 610\u2013634, 2016.\n\n[12] F. Sinz, \u201cA priori knowledge from non-examples,\u201d Ph.D. dissertation, Mar 2007.\n[13] S. Chen and C. Zhang, \u201cSelecting informative universum sample for semi-supervised learning.\u201d in IJCAI,\n\n2009, pp. 1016\u20131021.\n\n[14] X. Zhang and Y. LeCun, \u201cUniversum prescription: Regularization using unlabeled data.\u201d in AAAI, 2017,\n\npp. 2907\u20132913.\n\n[15] K. Crammer and Y. Singer, \u201cOn the learnability and design of output codes for multiclass problems,\u201d\n\nMachine learning, vol. 47, no. 2-3, pp. 201\u2013233, 2002.\n\n[16] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik, \u201cInference with the universum,\u201d in Proceedings\n\nof the 23rd international conference on Machine learning. ACM, 2006, pp. 1009\u20131016.\n\n[17] V. Vapnik and O. Chapelle, \u201cBounds on error expectation for support vector machines,\u201d Neural computation,\n\nvol. 12, no. 9, pp. 2013\u20132036, 2000.\n\n[18] J. Weston, C. Watkins et al., \u201cSupport vector machines for multi-class pattern recognition.\u201d in Esann,\n\nvol. 99, 1999, pp. 219\u2013224.\n\n[19] Y. Lei, U. Dogan, A. Binder, and M. Kloft, \u201cMulti-class svms: From tighter data-dependent generalization\n\nbounds to novel algorithms,\u201d in Advances in Neural Information Processing Systems 28, 2015.\n\n[20] S. Szedmak, J. Shawe-Taylor et al., \u201cLearning via linear operators: Maximum margin regression,\u201d in In\n\nProceedings of 2001 IEEE International Conference on Data Mining, 2005.\n\n[21] Y. Lee, Y. Lin, and G. Wahba, \u201cMulticategory support vector machines: Theory and application to the\nclassi\ufb01cation of microarray data and satellite radiance data,\u201d Journal of the American Statistical Association,\nvol. 99, no. 465, pp. 67\u201381, 2004.\n\n[22] E. J. Bredensteiner and K. P. Bennett, \u201cMulticategory classi\ufb01cation by support vector machines,\u201d in\n\nComputational Optimization. Springer, 1999, pp. 53\u201379.\n\n[23] Y. Guermeur and E. Monfrini, \u201cA quadratic loss multi-class svm for which a radius\u2013margin bound applies,\u201d\n\nInformatica, vol. 22, no. 1, pp. 73\u201396, 2011.\n\n[24] C. Hsu and C. Lin, \u201cA comparison of methods for multiclass support vector machines,\u201d Neural Networks,\n\nIEEE Transactions on, vol. 13, no. 2, pp. 415\u2013425, 2002.\n\n[25] A. Daniely et al., \u201cMulticlass learning approaches: A theoretical comparison with implications,\u201d in NIPS,\n\n2012.\n\n[26] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms.\n\nCambridge university press, 2014.\n\n[27] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2018.\n[28] K. Musayeva, F. Lauer, and Y. Guermeur, \u201cRademacher complexity and generalization performance of\n\nmulti-category margin classi\ufb01ers,\u201d Neurocomputing, vol. 342, pp. 6\u201315, 2019.\n\n[29] A. Daniely and S. Shalev-Shwartz, \u201cOptimal learners for multiclass problems,\u201d in Proceedings of The 27th\n\nConference on Learning Theory, ser. Proceedings of Machine Learning Research, vol. 35, 2014.\n\n[30] Y. Lei, \u00dc. Dogan, D. Zhou, and M. Kloft, \u201cGeneralization error bounds for extreme multi-class classi\ufb01ca-\n\ntion,\u201d CoRR, abs/1706.09814, 2017.\n\n[31] B. K. Natarajan, \u201cOn learning sets and functions,\u201d Machine Learning, vol. 4, no. 1, pp. 67\u201397, 1989.\n\n10\n\n\f[32] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz, \u201cMulticlass learnability and the erm principle,\u201d\n\nThe Journal of Machine Learning Research, vol. 16, no. 1, pp. 2377\u20132404, 2015.\n\n[33] F. Lauer and Y. Guermeur, \u201cMSVMpack: a multi-class support vector machine package,\u201d Journal of\n\nMachine Learning Research, vol. 12, pp. 2269\u20132272, 2011, http://www.loria.fr/~lauer/MSVMpack.\n\n[34] \u201clibsvmtools,\u201d https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/, accessed: 2019-05-17.\n[35] N. Japkowicz and M. Shah, Evaluating learning algorithms: a classi\ufb01cation perspective. Cambridge\n\nUniversity Press, 2011.\n\n[36] A. Luntz, \u201cOn estimation of characters obtained in statistical procedure of recognition,\u201d Technicheskaya\n\nKibernetica, 1969.\n\n[37] R. Bonidal, \u201cS\u00e9lection de mod\u00e8le par chemin de r\u00e9gularisation pour les machines \u00e0 vecteurs support \u00e0 co\u00fbt\n\nquadratique.\u201d Ph.D. dissertation, June 2013.\n\n[38] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, \u201cMan vs. computer: Benchmarking machine learning\n\nalgorithms for traf\ufb01c sign recognition,\u201d Neural Networks, pp. \u2013, 2012.\n\n[39] M. Fanty and R. Cole, \u201cSpoken letter recognition,\u201d in Advances in Neural Information Processing Systems,\n\n1991, pp. 220\u2013226.\n\n[40] V. Cherkassky, S. Dhar, and W. Dai, \u201cPractical conditions for effectiveness of the universum learning,\u201d\n\nNeural Networks, IEEE Transactions on, vol. 22, no. 8, pp. 1241\u20131255, 2011.\n\n11\n\n\f", "award": [], "sourceid": 4552, "authors": [{"given_name": "Sauptik", "family_name": "Dhar", "institution": "LG Electronics"}, {"given_name": "Vladimir", "family_name": "Cherkassky", "institution": "University of Minnesota"}, {"given_name": "Mohak", "family_name": "Shah", "institution": "LG Electronics"}]}