{"title": "ShareBoost: Efficient multiclass learning with feature sharing", "book": "Advances in Neural Information Processing Systems", "page_first": 1179, "page_last": 1187, "abstract": "Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sub-linearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches.", "full_text": "ShareBoost: Ef\ufb01cient Multiclass Learning with\n\nFeature Sharing\n\nShai Shalev-Shwartz\u21e4\n\nYonatan Wexler\u2020\n\nAmnon Shashua\u2021\n\nAbstract\n\nMulticlass prediction is the problem of classifying an object into a relevant target\nclass. We consider the problem of learning a multiclass predictor that uses only\nfew features, and in particular, the number of used features should increase sub-\nlinearly with the number of possible classes. This implies that features should be\nshared by several classes. We describe and analyze the ShareBoost algorithm for\nlearning a multiclass predictor that uses few shared features. We prove that Share-\nBoost ef\ufb01ciently \ufb01nds a predictor that uses few shared features (if such a predictor\nexists) and that it has a small generalization error. We also describe how to use\nShareBoost for learning a non-linear predictor that has a fast evaluation time. In a\nseries of experiments with natural data sets we demonstrate the bene\ufb01ts of Share-\nBoost and evaluate its success relatively to other state-of-the-art approaches.\n\n1\n\nIntroduction\n\nLearning to classify an object into a relevant target class surfaces in many domains such as docu-\nment categorization, object recognition in computer vision, and web advertisement. In multiclass\nlearning problems we use training examples to learn a classi\ufb01er which will later be used for accu-\nrately classifying new objects. Typically, the classi\ufb01er \ufb01rst calculates several features from the input\nobject and then classi\ufb01es the object based on those features. In many cases, it is important that the\nruntime of the learned classi\ufb01er will be small. In particular, this requires that the learned classi\ufb01er\nwill only rely on the value of few features.\nWe start with predictors that are based on linear combinations of features. Later, in Section 3, we\nshow how our framework enables learning highly non-linear predictors by embedding non-linearity\nin the construction of the features. Requiring the classi\ufb01er to depend on few features is therefore\nequivalent to sparseness of the linear weights of features. In recent years, the problem of learning\nsparse vectors for linear classi\ufb01cation or regression has been given signi\ufb01cant attention. While, in\ngeneral, \ufb01nding the most accurate sparse predictor is known to be NP hard, two main approaches\nhave been proposed for overcoming the hardness result. The \ufb01rst approach uses `1 norm as a surro-\ngate for sparsity (e.g. the Lasso algorithm [33] and the compressed sensing literature [5, 11]). The\nsecond approach relies on forward greedy selection of features (e.g. Boosting [15] in the machine\nlearning literature and orthogonal matching pursuit in the signal processing community [35]).\nA popular model for multiclass predictors maintains a weight vector for each one of the classes. In\nsuch case, even if the weight vector associated with each class is sparse, the overall number of used\nfeatures might grow with the number of classes. Since the number of classes can be rather large,\nand our goal is to learn a model with an overall small number of features, we would like that the\nweight vectors will share the features with non-zero weights as much as possible. Organizing the\nweight vectors of all classes as rows of a single matrix, this is equivalent to requiring sparsity of the\ncolumns of the matrix.\n\n\u21e4School of Computer Science and Engineering, the Hebrew University of Jerusalem, Israel\n\u2020OrCam Ltd., Jerusalem, Israel\n\u2021OrCam Ltd., Jerusalem, Israel\n\n1\n\n\fIn this paper we describe and analyze an ef\ufb01cient algorithm for learning a multiclass predictor whose\ncorresponding matrix of weights has a small number of non-zero columns. We formally prove that\nif there exists an accurate matrix with a number of non-zero columns that grows sub-linearly with\nthe number of classes, then our algorithm will also learn such a matrix. We apply our algorithm\nto natural multiclass learning problems and demonstrate its advantages over previously proposed\nstate-of-the-art methods.\nOur algorithm is a generalization of the forward greedy selection approach to sparsity in columns.\nAn alternative approach, which has recently been studied in [26, 12], generalizes the `1 norm based\napproach, and relies on mixed-norms. We discuss the advantages of the greedy approach over mixed-\nnorms in Section 1.2.\n\n1.1 Formal problem statement\nLet V be the set of objects we would like to classify. For example, V can be the set of gray scale\nimages of a certain size. For each object v 2V , we have a pool of prede\ufb01ned d features, each of\nwhich is a real number in [1, 1]. That is, we can represent each v 2V as a vector of features\nx 2 [1, 1]d. We note that the mapping from v to x can be non-linear and that d can be very large.\nFor example, we can de\ufb01ne x so that each element xi corresponds to some patch, p 2 {\u00b11}q\u21e5q, and\na threshold \u2713, where xi equals 1 if there is a patch of v whose inner product with p is higher than\n\u2713. We discuss some generic methods for constructing features in Section 3. From this point onward\nwe assume that x is given.\nThe set of possible classes is denoted by Y = {1, . . . , k}. Our goal is to learn a multiclass pre-\ndictor, which is a mapping from the features of an object into Y. We focus on the set of predictors\nparametrized by matrices W 2 Rk,d that takes the following form:\n\nhW (x) = argmax\n\ny2Y\n\n(W x)y .\n\n(1)\n\nThat is, the matrix W maps each d-dimensional feature vector into a k-dimensional score vector,\nand the actual prediction is the index of the maximal element of the score vector. If the maximizer\nis not unique, we break ties arbitrarily.\nRecall that our goal is to \ufb01nd a matrix W with few non-zero columns. We denote by W\u00b7,i the i\u2019th\ncolumn of W and use the notation kWk1,0 = |{i : kW\u00b7,ik1 > 0}| to denote the number of\ncolumns of W which are not identically the zero vector. More generally, given a matrix W and a\npair of norms k\u00b7k p,k\u00b7k r we denote kWkp,r = k(kW\u00b7,1kp, . . . ,kW\u00b7,dkp)kr, that is, we apply the\np-norm on the columns of W and the r-norm on the resulting d-dimensional vector.\nThe 01 loss of a multiclass predictor hW on an example (x, y) is de\ufb01ned as 1[hW (x) 6= y]. That\nis, the 01 loss equals 1 if hW (x) 6= y and 0 otherwise. Since this loss function is not convex\nwith respect to W , we use a surrogate convex loss function based on the following easy to verify\ninequalities:\n\n1[hW (x) 6= y] \uf8ff 1[hW (x) 6= y] (W x)y + (W x)hW (x)\n\n1[y0 6= y] (W x)y + (W x)y0\ne1[y06=y](W x)y+(W x)y0 .\n\n\uf8ff max\ny02Y\n\uf8ff lnXy02Y\n\n(2)\n\n(3)\n\nWe use the notation `(W, (x, y)) to denote the right-hand side (eqn. (3)) of the above. The loss given\nin eqn. (2) is the multi-class hinge loss [7] used in Support-Vector-Machines, whereas `(W, (x, y)) is\n\nthe result of performing a \u201csoft-max\u201d operation: maxx f(x) \uf8ff (1/p) lnPx epf (x), where equality\nholds for p ! 1.\nThis logistic multiclass loss function `(W, (x, y)) has several nice properties \u2014 see for example\n[39]. Besides being a convex upper-bound on the 01 loss, it is smooth. The reason we need the\nloss function to be both convex and smooth is as follows. If a function is convex, then its \ufb01rst order\napproximation at any point gives us a lower bound on the function at any other point. When the\nfunction is also smooth, the \ufb01rst order approximation gives us both lower and upper bounds on the\n\n2\n\n\fvalue of the function at any other point1. ShareBoost uses the gradient of the loss function at the\ncurrent solution (i.e. the \ufb01rst order approximation of the loss) to make a greedy choice of which\ncolumn to update. To ensure that this greedy choice indeed yields a signi\ufb01cant improvement we\nmust know that the \ufb01rst order approximation is indeed close to the actual loss function, and for that\nwe need both lower and upper bounds on the quality of the \ufb01rst order approximation.\nGiven a training set S = (x1, y1), . . . , (xm, ym), the average training loss of a matrix W is:\nL(W ) = 1\n\nmP(x,y)2S `(W, (x, y)). We aim at approximately solving the problem\n\n(4)\n\nmin\nW2Rk,d\n\nL(W ) s.t. kWk1,0 \uf8ff s .\n\nThat is, \ufb01nd the matrix W with minimal training loss among all matrices with column sparsity of at\nmost s, where s is a user-de\ufb01ned parameter. Since `(W, (x, y)) is an upper bound on 1[hW (x) 6= y],\nby minimizing L(W ) we also decrease the average 01 error of W over the training set. In Section 4\nwe show that for sparse models, a small training error is likely to yield a small error on unseen\nexamples as well.\nRegrettably, the constraint kWk1,0 \uf8ff s in eqn. (4) is non-convex, and solving the optimization\nproblem in eqn. (4) is NP-hard [24, 9]. To overcome the hardness result, the ShareBoost algorithm\nwill follow the forward greedy selection approach. The algorithm comes with formal generalization\nand sparsity guarantees (described in Section 4) that makes ShareBoost an attractive multiclass\nlearning engine due to ef\ufb01ciency (both during training and at test time) and accuracy.\n\n1.2 Related Work\n\nThe centrality of the multiclass learning problem has spurred the development of various approaches\nfor tackling the task. Perhaps the most straightforward approach is a reduction from multiclass to\nbinary, e.g.\nthe one-vs-rest or all pairs constructions. The more direct approach we choose, in\nparticular, the multiclass predictors of the form given in eqn. (1), has been extensively studied and\nshowed a great success in practice \u2014 see for example [13, 37, 7].\nAn alternative construction, abbreviated as the single-vector model, shares a single weight vector,\nfor all the classes, paired with class-speci\ufb01c feature mappings. This construction is common in\ngeneralized additive models [17], multiclass versions of boosting [16, 28], and has been popularized\nlately due to its role in prediction with structured output where the number of classes is exponentially\nlarge (see e.g. [31]). While this approach can yield predictors with a rather mild dependency of the\nrequired features on k (see for example the analysis in [39, 31, 14]), it relies on a-priori assumptions\non the structure of X and Y.\nIn contrast, in this paper we tackle general multiclass prediction\nproblems, like object recognition or document classi\ufb01cation, where it is not straightforward or even\nplausible how one would go about to construct a-priori good class speci\ufb01c feature mappings, and\ntherefore the single-vector model is not adequate.\nThe class of predictors of the form given in eqn. (1) can be trained using Frobenius norm regular-\nization (as done by multiclass SVM \u2013 see e.g. [7]) or using `1 regularization over all the entries of\nW . However, as pointed out in [26], these regularizers might yield a matrix with many non-zeros\ncolumns, and hence, will lead to a predictor that uses many features.\nThe alternative approach, and the most relevant to our work, is the use of mix-norm regularizations\nlike kWk1,1 or kWk2,1 [21, 36, 2, 3, 26, 12, 19]. For example, [12] solves the following problem:\n(5)\n\nmin\nW2Rk,d\n\nL(W ) + kWk1,1 .\n\nwhich can be viewed as a convex approximation of our objective (eqn. (4)). This is advantageous\nfrom an optimization point of view, as one can \ufb01nd the global optimum of a convex problem, but\nit remains unclear how well the convex program approximates the original goal. For example,\nin Section C we show cases where mix-norm regularization does not yield sparse solutions while\nShareBoost does yield a sparse solution. Despite the fact that ShareBoost tackles a non-convex\nprogram, and thus limited to local optimum solutions, we prove in Theorem 2 that under mild\n1Smoothness guarantees that |f (x) f (x0) rf (x0)(x x0)|\uf8ff kx x0k2 for some and all x, x0.\nTherefore one can approximate f (x) by f (x0) +rf (x0)(x x0) and the approximation error is upper bounded\nby the difference between x, x0.\n\n3\n\n\fconditions ShareBoost is guaranteed to \ufb01nd an accurate sparse solution whenever such a solution\nexists and that the generalization error is bounded as shown in Theorem 1.\nWe note that several recent papers (e.g. [19]) established exact recovery guarantees for mixed norms,\nwhich may seem to be stronger than our guarantee given in Theorem 2. However, the assumptions\nin [19] are much stronger than the assumptions of Theorem 2. In particular, they have strong noise\nassumptions and a group RIP like assumption (Assumption 4.1-4.3 in their paper).\nIn contrast,\nwe impose no such restrictions. We would like to stress that in many generic practical cases, the\nassumptions of [19] will not hold. For example, when using decision stumps, features will be highly\ncorrelated which will violate Assumption 4.3 of [19].\nAnother advantage of ShareBoost is that its only parameter is the desired number of non-zero\ncolumns of W . Furthermore, obtaining the whole-regularization-path of ShareBoost, that is, the\ncurve of accuracy as a function of sparsity, can be performed by a single run of ShareBoost, which\nis much easier than obtaining the whole regularization path of the convex relaxation in eqn. (5).\nLast but not least, ShareBoost can work even when the initial number of features, d, is very large,\nas long as there is an ef\ufb01cient way to choose the next feature. For example, when the features are\nconstructed using decision stumps, d will be extremely large, but ShareBoost can still be imple-\nmented ef\ufb01ciently. In contrast, when d is extremely large mix-norm regularization techniques yield\nchallenging optimization problems.\nAs mentioned before, ShareBoost follows the forward greedy selection approach for tackling the\nhardness of solving eqn. (4). The greedy approach has been widely studied in the context of learning\nsparse predictors for linear regression. However, in multiclass problems, one needs sparsity of\ngroups of variables (columns of W ). ShareBoost generalizes the fully corrective greedy selection\nprocedure given in [29] to the case of selection of groups of variables, and our analysis follows\nsimilar techniques.\nObtaining group sparsity by greedy methods has been also recently studied in [20, 23], and indeed,\nShareBoost shares similarities with these works. We differ from [20] in that our analysis does not\nimpose strong assumptions (e.g. group-RIP) and so ShareBoost applies to a much wider array of\napplications. In addition, the speci\ufb01c criterion for choosing the next feature is different. In [20], a\nratio between difference in objective and different in costs is used. In ShareBoost, the L1 norm of\nthe gradient matrix is used. For the multiclass problem with log loss, the criterion of ShareBoost\nis much easier to compute, especially in large scale problems. [23] suggested many other selection\nrules that are geared toward the squared loss, which is far from being an optimal loss function for\nmulticlass problems.\nAnother related method is the JointBoost algorithm [34]. While the original presentation in\n[34] seems rather different than the type of predictors we describe in eqn. (1), it is possible\nto show that JointBoost in fact learns a matrix W with additional constraints.\nIn particular,\nthe features x are assumed to be decision stumps and each column W\u00b7,i is constrained to be\n\u21b5i(1[1 2 Ci] , . . . , 1[k 2 Ci]), where \u21b5i 2 R and Ci \u21e2Y . That is, the stump is shared by all\nclasses in the subset Ci. JointBoost chooses such shared decision stumps in a greedy manner by\napplying the GentleBoost algorithm on top of this presentation. A major disadvantage of JointBoost\nis that in its pure form, it should exhaustively search C among all 2k possible subsets of Y. In prac-\ntice, [34] relies on heuristics for \ufb01nding C on each boosting step. In contrast, ShareBoost allows\nthe columns of W to be any real numbers, thus allowing \u201dsoft\u201d sharing between classes. Therefore,\nShareBoost has the same (or even richer) expressive power comparing to JointBoost. Moreover,\nShareBoost automatically identi\ufb01es the relatedness between classes (corresponding to choosing the\nset C) without having to rely on exhaustive search. ShareBoost is also fully corrective, in the sense\nthat it extracts all the information from the selected features before adding new ones. This leads to\nhigher accuracy while using less features as was shown in our experiments on image classi\ufb01cation.\nLastly, ShareBoost comes with theoretical guarantees.\nFinally, we mention that feature sharing is merely one way for transferring information across classes\n[32] and several alternative ways have been proposed in the literature such as target embedding\n[18, 4], shared hidden structure [22, 1], shared prototypes [27], or sharing underlying metric [38].\n\n4\n\n\f2 The ShareBoost Algorithm\n\nShareBoost is a forward greedy selection approach for solving eqn. (4). Usually, in a greedy ap-\nproach, we update the weight of one feature at a time. Now, we will update one column of W at a\ntime (since the desired sparsity is over columns). We will choose the column that maximizes the `1\nnorm of the corresponding column of the gradient of the loss at W . Since W is a matrix we have that\nrL(W ) is a matrix of the partial derivatives of L. Denote by rrL(W ) the r\u2019th column of rL(W ),\nthat is, the vector\u21e3 @L(W )\n\n@Wk,r\u2318. A standard calculation shows that\nm X(x,y)2SXc2Y\n\u21e2c(x, y) xr(1[q = c] 1[q = y])\n\n@W1,r\n@L(W )\n@Wq,r\n\n, . . . , @L(W )\n\n=\n\n1\n\nwhere\n\n@L(W )\n@Wq,r\n\n(6)\n\n=\n\n(7)\n\n\u21e2c(x, y) =\n\ne1[c6=y](W x)y+(W x)c\n\n.\n\n1\n\nPy02Y e1[y06=y](W x)y+(W x)y0\nNote that Pc \u21e2c(x, y) = 1 for all (x, y).\nmP(x,y) xr(\u21e2q(x, y) 1[q = y]) . Based on the above we have\nxr(\u21e2q(x, y) 1[q = y])\n\nX(x,y)\n\nkrrL(W )k1 =\n\nmXq2Y\n\n1\n\n.\n\nTherefore, we can rewrite,\n\nFinally, after choosing the column for which krrL(W )k1 is maximized, we re-optimize all the\ncolumns of W which were selected so far. The resulting algorithm is given in Algorithm 1.\n\nAlgorithm 1 ShareBoost\n1: Initialize: W = 0 ; I = ;\n2: for t=1,2,. . . ,T do\n3:\n4:\n5:\n6:\n7: end for\n\nFor each class c and example (x, y) de\ufb01ne \u21e2c(x, y) as in eqn. (6)\nChoose feature r that maximizes the right-hand side of eqn. (7)\nI I [{ r}\nSet W argminW L(W ) s.t. W\u00b7,i = 0 for all i /2 I\n\nThe runtime of ShareBoost is as follows. Steps 3-5 requires O(mdk). Step 6 is a convex optimiza-\ntion problem in tk variables and can be performed using various methods. In our experiments, we\nused Nesterov\u2019s accelerated gradient method [25] whose runtime is O(mtk/p\u270f) for a smooth objec-\ntive, where \u270f is the desired accuracy. Therefore, the overall runtime is O(T mdk + T 2mk/p\u270f). It is\ninteresting to compare this runtime to the complexity of minimizing the mixed-norm regularization\nobjective given in eqn. (5). Since the objective is no longer smooth, the runtime of using Nesterov\u2019s\naccelerated method would be O(mdk/\u270f) which can be much larger than the runtime of ShareBoost\nwhen d T .\n2.1 Variants of ShareBoost\n\nWe now describe several variants of ShareBoost. The analysis we present in Section 4 can be easily\nadapted for these variants as well.\n\nModifying the Greedy Choice Rule ShareBoost chooses the feature r which maximizes the `1\nnorm of the r-th column of the gradient matrix. Our analysis shows that this choice leads to a suf\ufb01-\ncient decrease of the objective function. However, one can easily develop other ways for choosing\na feature which may potentially lead to an even larger decrease of the objective. For example, we\ncan choose a feature r that minimizes L(W ) over matrices W with support of I [{ r}. This will\nlead to the maximal possible decrease of the objective function at the current iteration. Of course,\nthe runtime of choosing r will now be much larger. Some intermediate options are to choose r that\nminimizes min\u21b52R W + \u21b5rrL(W ) or to choose r that minimizes minw2Rk W + we\u2020r, where e\u2020r is\nthe all-zero row vector except 1 in the r\u2019th position.\n\n5\n\n\fSelecting a Group of Features at a Time\nIn some situations, features can be divided into groups\nwhere the runtime of calculating a single feature in each group is almost the same as the runtime of\ncalculating all features in the group. In such cases, it makes sense to choose groups of features at\neach iteration of ShareBoost. This can be easily done by simply choosing the group of features J\n\nthat maximizesPj2J krjL(W )k1.\nAdding Regularization Our analysis implies that when |S| is signi\ufb01cantly larger than \u02dcO(T k)\nthen ShareBoost will not over\ufb01t. When this is not the case, we can incorporate regularization in the\nobjective of ShareBoost in order to prevent over\ufb01tting. One simple way is to add to the objective\nfunction L(W ) a Frobenius norm regularization term of the form Pi,j W 2\ni,j, where is a reg-\nularization parameter. It is easy to verify that this is a smooth and convex function and therefore\nwe can easily adapt ShareBoost to deal with this regularized objective. It is also possible to rely\non other norms such as the `1 norm or the `1/`1 mixed-norm. However, there is one technical-\nity due to the fact that these norms are not smooth. We can overcome this problem by de\ufb01ning\nsmooth approximations to these norms. The main idea is to \ufb01rst note that for a scalar a we have\n|a| = max{a,a} and therefore we can rewrite the aforementioned norms using max and sum\noperations. Then, we can replace each max expression with its soft-max counterpart and obtain a\nsmooth version of the overall norm function. For example, a smooth version of the `1/`1 norm\ni=1(eWi,j + eWi,j )\u2318 , where 1 controls the tradeoff\nj=1 log\u21e3Pk\nPd\nwill be kWk1,1 \u21e1 1\nbetween quality of approximation and smoothness.\n\n3 Non-Linear Prediction Rules\n\nWe now demonstrate how ShareBoost can be used for learning non-linear predictors. The main idea\nis similar to the approach taken by Boosting and SVM. That is, we construct a non-linear predictor\nby \ufb01rst mapping the original features into a higher dimensional space and then learning a linear\npredictor in that space, which corresponds to a non-linear predictor over the original feature space.\nTo illustrate this idea we present two concrete mappings. The \ufb01rst is the decision stumps method\nwhich is widely used by Boosting algorithms. The second approach shows how to use ShareBoost\nfor learning piece-wise linear predictors and is inspired by the super-vectors construction recently\ndescribed in [40].\n\n3.1 ShareBoost with Decision Stumps\nLet v 2 Rp be the original feature vector representing an object. A decision stump is a binary\nfeature of the form 1[vi \uf8ff \u2713], for some feature i 2{ 1, . . . , p} and threshold \u2713 2 R. To construct\na non-linear predictor we can map each object v into a feature-vector x that contains all possible\ndecision stumps. Naturally, the dimensionality of x is very large (in fact, can even be in\ufb01nite),\nand calculating Step 4 of ShareBoost may take forever. Luckily, a simple trick yields an ef\ufb01cient\nsolution. First note that for each i, all stump features corresponding to i can get at most m + 1\nvalues on a training set of size m. Therefore, if we sort the values of vi over the m examples in the\ntraining set, we can calculate the value of the right-hand side of eqn. (7) for all possible values of\n\u2713 in total time of O(m). Thus, ShareBoost can be implemented ef\ufb01ciently with decision stumps.\n\n3.2 Learning Piece-wise Linear Predictors with ShareBoost\n\nTo motivate our next construction let us consider \ufb01rst a sim-\nple one dimensional function estimation problem. Given sample\n(x1, yi), . . . , (xm, ym) we would like to \ufb01nd a function f : R !\nR such that f(xi) \u21e1 yi for all i. The class of piece-wise linear\nfunctions can be a good candidate for the approximation function\nf. See for example an illustration in Fig. 1. In fact, it is easy to\nverify that all smooth functions can be approximated by piece-\nwise linear functions (see for example the discussion in [40]). In\ngeneral, we can express piece-wise linear vector-valued functions\nj=1 1[kv vjk < rj] (huj, vi + bj) , where q is\n\nas f(v) = Pq\n\n6\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n3.5\n\n4\n\n4.5\n\n5\n\nFigure 1: Motivating super vec-\ntors.\n\n\fthe number of pieces, (uj, bj) represents the linear function corresponding to piece j, and (vj, rj)\nrepresents the center and radius of piece j. This expression can be also written as a linear function\nover a different domain, f(v) = hw, (v)i where\n\n (v) = [ 1[kv v1k < r1] [v , 1] , . . . , 1[kv vqk < rq] [v , 1] ] .\n\nIn the case of learning a multiclass predictor, we shall learn a predictor v 7! W (v), where W will\nbe a k by dim( (v)) matrix. ShareBoost can be used for learning W . Furthermore, we can apply\nthe variant of ShareBoost described in Section 2.1 to learn a piece-wise linear model with few pieces\n(that is, each group of features will correspond to one piece of the model). In practice, we \ufb01rst de\ufb01ne\na large set of candidate centers by applying some clustering method to the training examples, and\nsecond we de\ufb01ne a set of possible radiuses by taking values of quantiles from the training examples.\nThen, we train ShareBoost so as to choose a multiclass predictor that only use few pairs (vj, rj).\nThe advantage of using ShareBoost here is that while it learns a non-linear model it will try to \ufb01nd\na model with few linear \u201cpieces\u201d, which is advantageous both in terms of test runtime as well as in\nterms of generalization performance.\n\n4 Analysis\n\nIn this section we provide formal guarantees for the ShareBoost algorithm. The proofs are deferred\nto the appendix. We \ufb01rst show that if the algorithm has managed to \ufb01nd a matrix W with a small\nnumber of non-zero columns and a small training error, then the generalization error of W is also\nsmall. The bound below is in terms of the 01 loss. A related bound, which is given in terms of the\nconvex loss function, is described in [39].\n\nTheorem 1 Suppose that the ShareBoost algorithm runs for T iterations and let W be its output\nmatrix. Then, with probability of at least 1 over the choice of the training set S we have that\n[hW (x) 6= y]+O s T k log(T k) log(k) + T log(d) + log(1/)\n\n[hW (x) 6= y] \uf8ff P\n\n!\n\nP\n\n(x,y)\u21e0D\n\n(x,y)\u21e0S\n\n|S|\n\nNext, we analyze the sparsity guarantees of ShareBoost. As mentioned previously, exactly solving\neqn. (4) is known to be NP hard. The following main theorem gives an interesting approximation\nguarantee. It tells us that if there exists an accurate solution with small `1,1 norm, then the Share-\nBoost algorithm will \ufb01nd a good sparse solution.\n\nTheorem 2 Let \u270f> 0 and let W ? be an arbitrary matrix. Assume that we run the ShareBoost\nalgorithm for T =\u23034 1\nand L(W ) \uf8ff L(W ?) + \u270f.\n5 Experiments\n\n1,1\u2325 iterations and let W be the output matrix. Then, kWk1,0 \uf8ff T\n\n\u270f kW ?k2\n\nIn this section we demonstrate the merits (and pitfalls) of ShareBoost by comparing it to alternative\nalgorithms in different scenarios. The \ufb01rst experiment exempli\ufb01es the feature sharing property of\nShareBoost. We perform experiments with an OCR data set and demonstrate a mild growth of the\nnumber of features as the number of classes grows from 2 to 36. The second experiment shows\nthat ShareBoost can construct predictors with state-of-the-art accuracy while only requiring few\nfeatures, which amounts to fast prediction runtime. The third experiment, which due to lack of\nspace is deferred to Appendix A.3, compares ShareBoost to mixed-norm regularization and to the\nJointBoost algorithm of [34]. We follow the same experimental setup as in [12]. The main \ufb01nding\nis that ShareBoost outperforms the mixed-norm regularization method when the output predictor\nneeds to be very sparse, while mixed-norm regularization can be better in the regime of rather dense\npredictors. We also show that ShareBoost is both faster and more accurate than JointBoost.\n\nFeature Sharing The main motivation for deriving the ShareBoost algorithm is the need for\na multiclass predictor that uses only few features, and in particular,\nthe number of features\nshould increase slowly with the number of classes. To demonstrate this property of Share-\nBoost we experimented with the Char74k data set which consists of images of digits and\n\n7\n\n\fletters. We trained ShareBoost with the number of classes varying from 2 classes to the\n36 classes corresponding to the 10 digits and 26 capital letters. We calculated how many\nfeatures were required to achieve a certain \ufb01xed accuracy as a function of the number of\nclasses. Due to lack of space, the description of the feature space is deferred to the appendix.\n\nWe compared ShareBoost to the 1-vs-rest approach, where in the\nlatter, we trained each binary classi\ufb01er using the same mechanism\nas used by ShareBoost. Namely, we minimize the binary logistic\nloss using a greedy algorithm. Both methods aim at construct-\ning sparse predictors using the same greedy approach. The dif-\nference between the methods is that ShareBoost selects features\nin a shared manner while the 1-vs-rest approach selects features\nfor each binary problem separately. In Fig. 2 we plot the overall\nnumber of features required by both methods to achieve a \ufb01xed\naccuracy on the test set as a function of the number of classes. As\ncan be easily seen, the increase in the number of required features\nis mild for ShareBoost but signi\ufb01cant for the 1-vs-rest approach.\n\ns\ne\nr\nu\n\nt\n\na\ne\n\nf\n \n\n#\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n \n0\n\n \n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n# classes\n\nFigure 2: The number of features\nrequired to achieve a \ufb01xed accu-\nracy as a function of the number\nof classes for ShareBoost (dashed)\nand the 1-vs-rest\n(solid-circles).\nThe blue lines are for a target error\nof 20% and the green lines are for\n8%.\n\nConstructing fast and accurate predictors The goal of our\nthis experiment is to show that ShareBoost achieves state-of-\nthe-art performance while constructing very fast predictors. We\nexperimented with the MNIST digit dataset, which consists of\na training set of 60, 000 digits represented by centered size-\nnormalized 28 \u21e5 28 images, and a test set of 10, 000 digits. The MNIST dataset has been ex-\ntensively studied and is considered a standard test for multiclass classi\ufb01cation of handwritten dig-\nits. The SVM algorithm with Gaussian kernel achieves an error rate of 1.4% on the test set.\nThe error rate achieved by the most advanced algorithms are below 1% of the test set. See\nhttp://yann.lecun.com/exdb/mnist/. In particular, the top MNIST performer [6] uses\na feed-forward Neural-Net with 7.6 million connections which roughly translates to 7.6 million\nmultiply-accumulate (MAC) operations at run-time as well. During training, geometrically dis-\ntorted versions of the original examples were generated in order to expand the training set following\n[30] who introduced a warping scheme for that purpose. The top performance error rate stands at\n0.35% at a run-time cost of 7.6 million MAC per test example\nThe error-rate of ShareBoost with T = 266 rounds stands on\n0.71% using the original training set and 0.47% with the ex-\npanded training set of 360, 000 examples generated by adding\n\ufb01ve deformed instances per original example and with T = 305\nrounds. Fig. 3 displays the convergence curve of error-rate as a\nfunction of the number of rounds. Note that the training error\nis higher than the test error. This follows from the fact that the\ntraining set was expanded with 5 fairly strong deformed versions\nof each input, using the method in [30]. As can be seen, less than\n75 features suf\ufb01ces to obtain an error rate of < 1%.\nIn terms of run-time on a test image, the system requires 305 con-\nvolutions of 7\u21e57 templates and 540 dot-product operations which\ntotals to roughly 3.3\u00b7106 MAC operations \u2014 compared to around\n7.5\u00b7 106 MAC operations of the top MNIST performer. The error\nrate of 0.47% is better than that reported by [10] who used a 1-vs-all SVM with a 9-degree polyno-\nmial kernel and with an expanded training set of 780, 000 examples. The number of support vectors\n(accumulated over the ten separate binary classi\ufb01ers) was 163, 410 giving rise to a run-time of 21-\nfold compared to ShareBoost. Moreover, due to the fast convergence of ShareBoost, 75 rounds are\nenough for achieving less than 1% error.\n\nFigure 3: The test error rate of\nShareBoost on the MNIST dataset\nas a function of the number of\nrounds using patch based features.\n\nTrain\nTest\n\nRounds\n\n250\n\n300\n\n450\n\n500\n\n150\n\n200\n\n550\n\n600\n\n350\n\n400\n\n \n\n0\n50\n\n100\n\n \n\n1.5\n\n1\n\n0.5\n0.47\n\nAcknowledgements: We would like to thank Itay Erlich and Zohar Bar-Yehuda for their contri-\nbution to the implementation of ShareBoost and to Ronen Katz for helpful comments.\n\n8\n\n\fReferences\n[1] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classi\ufb01cation. In International Conference on\n\nMachine Learning, 2007.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, pages 41\u201348, 2006.\n[3] F.R. Bach. Consistency of the group lasso and multiple kernel learning. J. of Machine Learning Research, 9:1179\u20131225, 2008.\n[4] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In NIPS, 2011.\n[5] E.J. Candes and T. Tao. Decoding by linear programming. IEEE Trans. on Information Theory, 51:4203\u20134215, 2005.\n[6] D. C. Ciresan, U. Meier, L. Maria G., and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR,\n\n2010.\n\n[7] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research,\n\n3:951\u2013991, 2003.\n\n[8] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz. Multiclass learnability and the erm principle. In COLT, 2011.\n[9] G. Davis, S. Mallat, and M. Avellaneda. Greedy adaptive approximation. Journal of Constructive Approximation, 13:57\u201398, 1997.\n[10] D. Decoste and S. Bernhard. Training invariant support vector machines. Mach. Learn., 46:161\u2013190, 2002.\n[11] D.L. Donoho. Compressed sensing. In Technical Report, Stanford University, 2006.\n[12] J. Duchi and Y. Singer. Boosting with structural sparsity. In Proc. ICML, pages 297\u2013304, 2009.\n[13] R. O. Duda and P. E. Hart. Pattern Classi\ufb01cation and Scene Analysis. Wiley, 1973.\n[14] M. Fink, S. Shalev-Shwartz, Y. Singer, and S. Ullman. Online multiclass learning by interclass hypothesis sharing. In International\n\nConference on Machine Learning, 2006.\n\n[15] Y. Freund and R. E. Schapire. A short introduction to boosting. J. of Japanese Society for AI, pages 771\u2013780, 1999.\n[16] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer\n\nand System Sciences, pages 119\u2013139, 1997.\n\n[17] T.J. Hastie and R.J. Tibshirani. Generalized additive models. Chapman & Hall, 1995.\n[18] D. Hsu, S.M. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In NIPS, 2010.\n[19] J. Huang and T. Zhang. The bene\ufb01t of group sparsity. Annals of Statistics, 38(4), 2010.\n[20] J. Huang, T. Zhang, and D.N. Metaxas. Learning with structured sparsity. In ICML, 2009.\n[21] G.R.G. Lanckriet, N. Cristianini, P.L. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix with semide\ufb01nite programming.\n\nJ. of Machine Learning Research, pages 27\u201372, 2004.\n\n[22] Y. L. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE,\n\npages 2278\u20132324, 1998.\n\n[23] A. Majumdar and R.K. Ward. Fast group sparse classi\ufb01cation. Electrical and Computer Engineering, Canadian Journal of, 34(4):136\u2013\n\n144, 2009.\n\n[24] B. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Computing, pages 227\u2013234, 1995.\n[25] Y. Nesterov and I.U.E. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Netherlands, 2004.\n[26] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An ef\ufb01cient projection for l 1,inf inity regularization. In ICML, page 108, 2009.\n[27] A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classi\ufb01cation with sparse prototype representations. In CVPR, 2008.\n[28] R. E. Schapire and Y. Singer. Improved boosting algorithms using con\ufb01dence-rated predictions. Machine Learning, 37(3):1\u201340, 1999.\n[29] S. Shalev-Shwartz, T. Zhang, and N. Srebro. Trading accuracy for sparsity in optimization problems with sparsity constraints. Siam\n\nJournal on Optimization, 20:2807\u20132832, 2010.\n\n[30] P. Y. Simard, Dave S., and John C. Platt. Best practices for convolutional neural networks applied to visual document analysis. Document\n\nAnalysis and Recognition, 2003.\n\n[31] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n[32] S. Thrun. Learning to learn: Introduction. Kluwer Academic Publishers, 1996.\n[33] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267\u2013288, 1996.\n[34] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence (PAMI), pages 854\u2013869, 2007.\n\n[35] J.A. Tropp and A.C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE\n\nTransactions on, 53(12):4655\u20134666, 2007.\n\n[36] B. A Turlach, W. N V., and Stephen J Wright. Simultaneous variable selection. Technometrics, 47, 2000.\n[37] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[38] E. Xing, A.Y. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In NIPS,\n\n2003.\n\n[39] T. Zhang. Class-size independent generalization analysis of some discriminative multi-category classi\ufb01cation. In NIPS, 2004.\n[40] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classi\ufb01cation using super-vector coding of local image descriptors. Computer Vision\u2013\n\nECCV 2010, pages 141\u2013154, 2010.\n\n9\n\n\f", "award": [], "sourceid": 684, "authors": [{"given_name": "Shai", "family_name": "Shalev-shwartz", "institution": null}, {"given_name": "Yonatan", "family_name": "Wexler", "institution": null}, {"given_name": "Amnon", "family_name": "Shashua", "institution": null}]}