{"title": "Learning Multiple Tasks using Shared Hypotheses", "book": "Advances in Neural Information Processing Systems", "page_first": 1475, "page_last": 1483, "abstract": "In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of {\\em shared hypotheses}. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach.", "full_text": "Learning Multiple Tasks using Shared Hypotheses\n\nKoby Crammer\n\nDepartment of Electrical Enginering\n\nThe Technion - Israel Institute of Technology\n\nHaifa, 32000 Israel\n\nkoby@ee.technion.ac.il\n\nYishay Mansour\n\nSchool of Computer Science\n\nTel Aviv University\nTel - Aviv 69978\n\nmansour@tau.ac.il\n\nAbstract\n\nIn this work we consider a setting where we have a very large number of related\ntasks with few examples from each individual task. Rather than either learning\neach task individually (and having a large generalization error) or learning all the\ntasks together using a single hypothesis (and suffering a potentially large inherent\nerror), we consider learning a small pool of shared hypotheses. Each task is then\nmapped to a single hypothesis in the pool (hard association). We derive VC dimen-\nsion generalization bounds for our model, based on the number of tasks, shared\nhypothesis and the VC dimension of the hypotheses class. We conducted exper-\niments with both synthetic problems and sentiment of reviews, which strongly\nsupport our approach.\n\n1\n\nIntroduction\n\nConsider sentiment analysis task for a set of reviews for different products. Each individual product\nhas only very few reviews, which does not enable reliable learning. Furthermore, reviewers may use\ndifferent amount and level of superlatives to describe the same sentiment level, or feeling different\nsentiment level yet describing the product with the same text. For example, one may use the sen-\ntence \u201cThe product is OK\u201d to describe the highest-satisfaction, while another would use \u201cIts a great\nproduct, but not amazing\u201d to describe some notion of disappointment. Should one build individual\nsentiment predictors, one per product, based on small amount of data, or build a single sentiment\npredictor for all products, based on mixed input with potentially heterogeneous linguistic usage?\nOne methodology is to cluster individual products to categories, and run the learning algorithm\non the aggregated data. While in some cases the aggregation might be simple, in other cases it\nmight be a challenge. (For example, you can cluster restaurants by the cuisine, by the price, by the\nlocation, etc.) In addition, the different tasks might be somewhat different on both domain (text\nused) or predictions (sentiment association with given text), which may raise the dilemma between\nclustering related tasks or related domain.\nIn this work we propose an alternative methodology. Rather than clustering the different tasks before\nthe learning, perform it as part of the learning task. Speci\ufb01cally, we consider a very large number\nof tasks, with only few examples from each domain. The goal is to output a pool of few classi\ufb01ers,\nand map each task to a single classi\ufb01er (or a convex combination of them). The idea is that we can\ncontrol the complexity of the learning process by deciding on the size of pool of shared classi\ufb01ers.\nThis is a very natural approach, in such a setting.\nOur \ufb01rst objective it to study the generalization bounds for such a simple and natural setting. We\nstart by computing an upper and lower bounds on the VC dimension, showing that the VC dimension\nis at most O(T log k + kd log(T kd)), where T is the number of domains, k the number of shared\nhypothesis and d the VC dimension of the basic hypothesis class. We also show a lower bound\nof max{kd , T min{d, log k}}. This shows that the dependency on the number of tasks (T ) and\n\n1\n\n\f(cid:18)(cid:113) log k\n\n(cid:19)\n\nthe number of shared hypothesis (k) is very different, namely, increasing the number of shared\nhypothesis increases the VC dimension only logarithmically. This will imply that if we have N\nexamples per task, the generalization error is only \u02dcO\nwhen learning each task individually. So we have a signi\ufb01cant gain when log k (cid:28) N and k (cid:28) T ,\nwhich is a realistic case. We also derived a K-means like algorithm to learn such classi\ufb01ers, of both\nmodels and association of tasks and models.\nOur experimental results support the general theoretical framework introduced. We conduct experi-\nment with both synthetic problems and sentiment prediction, with number of tasks ranging beween\n30\u2212 370, some contain as high-as 18 examples in the training set. Our experimental results strongly\nsupport the bene\ufb01ts from the approach we propose here, which attains lower test error compared\nwith learning individual models per task, or a single model for all tasks.\n\ncompared to O\n\nN\n\nN + dk\n\nT N\n\n(cid:18)(cid:113) d\n\n(cid:19)\n\nRelated Work\n\nIn the recent years there is increasing body of work on domain adaptation and multi-task learning. In\ndomain adaptation we often assume that the tasks to be performed are very similar to each other, yet\nthe data comes from different distributions, and often there is only unlabeled data from the domain\n(or task) of interest. Mansour et al. [18] develop theory when distribution of the problem of interest\n(called target) is a convex combination of other distributions for which samples from each is given.\nBen-David et al. [6] focused in classi\ufb01cation and developed a distance between distributions and\nused it to develop new generalization bounds when training and test examples are not coming from\nthe same distributions. Mansour et al. [19] built on that work and developed new distance and theory\nfor adaptation problems with arbitrary loss functions. See also a recent result of Blanchard et al [7].\nAnother direction of research is to learn few problems simultaneously, yet, unlike in domain adap-\ntation, assuming examples are coming from the same distribution. Obozinski et al. [20] proposed to\nlearn one model per task, yet \ufb01nd a small set of shared features using mixed-norm regularization.\nArgyriou et al. [4] took a similar approach, yet with added complexity that the feature space can\nalso be rotated before choosing this small shared set. Ando and Zhang [2], and Amit et al. [1], learn\nby \ufb01rst \ufb01nding a linear transformation shared by all tasks, and then individual models per task. The\n\ufb01rst formulation is not convex, while the later is. Evgeniou [13] and Daume [15] proposed to com-\nbine two models, one individual per task and the other shared across all tasks, and combine them\nat test time, while later Evgeniou et al. [12] proposed to learn one model per task, and force all the\nmodels to be close to each other. Finally, there exists large body of work on multi-task learning in\nthe Bayesian setting, where a shared prior is used to connect or related the various tasks [5, 22, 16],\nwhile other works [17, 21, 9] are using Gaussian process predictors.\nThe work most similar to our is of Crammer et al. [11, 10] whom developed theory for learning a\nmodel with few datasets from various tasks, assuming they are sampled from the same source. They\nassumed that the relative error (or a bound over it) is known, and proved generalization bound for\nthat task, their bounds proposed to use some of the datasets, but not all, when building a model for\nthe main task. Yet, it was performed before seeing the data and having the strong assumption of the\ndiscrepancy between tasks. We do not assume this knowledge and learn few tasks simultaneously.\n\n2 Model\nThere is a set T of T tasks and with each task t there is an associated distribution Dt over inputs\n(x, y), where x \u2208 Rr and y \u2208 Y. We assume binary classi\ufb01cation tasks, i.e., Y = {+1,\u22121}. For\neach task t \u2208 T has a sample of size nt denoted by St = {(xt,i, yt,i, t)}nt\ni=1 drawn from Dt, where\nxt,i \u2208 Rr is the i-th example in the t-th domain and yt,i \u2208 Y is the corresponding label. (Note\nthat the name of the domain is part of the example, so there is no uncertainty regarding from which\ndomain the example originated from.)\nA k-shared task classi\ufb01er is a pair (Hk, g), where Hk = {h1, . . . , hk} \u2282 H is a set of k hypotheses\nfrom a class of functions H = {h : Rr \u2192 Y}. The function g maps each task t \u2208 T to the\nhypotheses pool Hk, where the mapping is to a single hypothesis (hard association). We denote by\nK = {1, . . . , k} the index set for Hk.\n\n2\n\n\fIn the hard k-shared task classi\ufb01er, g maps each task t \u2208 T to one hypothesis in hi \u2208 Hk, i.e., g :\nT \u2192 K. Classi\ufb01er (Hk, g), given an example (x, t), \ufb01rst computes the mapping from the domain\nname t to the hypotheses hi, where i = g(t), and then predicts using the corresponding function hi,\ni.e., the prediction is hg(t)(x). The class of hard k-shared task classi\ufb01ers using hypotheses class H\nincludes all such (Hk, g) classi\ufb01ers, i.e., fHk,g : Rr \u00d7 T \u2192 Y, where fHk,g(x, t) = hg(t)(x), and\nthe class is FH,k = {fHk,g : |Hk| = k, Hk \u2282 H, g : T \u2192 K}.\n\n3 Hard k-shared Task Classi\ufb01ers: Generalization Bounds\nWe envision the following learning process. Given the training sets St, for t \u2208 T , the learner\noutputs at the end of the training phase both Hk and g, where Hk is composed from k hypotheses\nh1, . . . , hk \u2208 H and g : T \u2192 K. Naturally, this implies that there is potentially over\ufb01tting in both\nthe selection of Hk and the mapping g.\nOur main goal in this section is to bound the VC dimension of the resulting hypothesis class FH,k,\nassuming the VC dimension of H is d. We show that the VC dimension of FH,k is at most\nO(T log k + kd log(T kd)) and at least \u2126(T log k + dk).\nTheorem 1. For any hypothesis class H of VC-dimension d, the class of hard k-shared task classi-\n\ufb01ers FH,k has VC dimension at most the minimum between dT and O(T log k + kd log(T kd)).\n\nj=0\n\nj\n\n(cid:1) be an upper bound on the number of\n(cid:0)m\n\nProof: Our main goal is to derive an upper bound on the number of possible labeling using a hard\nk-shared task classi\ufb01ers FH,k. Once we establish this, we can use Sauer lemma to derive an upper\n\nbound on the VC dimension [3]. Let \u03a6d(m) = (cid:80)d\nlabeling over m examples using a hypothesis class of VC dimension d. Let m =(cid:80)\nmj = (cid:80)\npool Hk by(cid:81)k\n\nt\u2208T nt the total\nsample size.\nWe consider all mapping g of the T tasks to Hk, there are kT such mappings. Fix a particular\nmapping g where hypothesis hj has tasks Sj \u2282 T assigned to it. (At this point hj \u2208 H is not\n\ufb01xed yet, we are only \ufb01xing g and the tasks that are mapped to the j hypothesis in Hk.) There are\nt\u2208Sj nt examples for the tasks in Sj, and therefore at most \u03a6d(mj) labeling. (Note that\nthe labeling are using any h \u2208 H.) We can upper bound the numbers of labeling any hypothesis\nj mj, this bound is maximized when mj = m/k, and this\n\nj=1 \u03a6d(mj). Since m =(cid:80)\n\nimplies that the number of labeling is upper bounded by kT (em/dk)dk.\nNow we would like to upper bound the VC dimension of FH,k. When m is equal to the VC dimen-\nsion we have 2m different labeling induced on the m points. Hence, it has to be the case that,\n\n2m \u2264 kT(cid:16) em\n\n(cid:17)kd\n\n.\n\ndk\n\nWe need to \ufb01nd the largest m for which m \u2264 kd log(em/dk) + T log k \u2264 T log k + kd log(e/dk) +\nkd log m \u2264 T log k + kd log m for dk \u2265 3. Note that for \u03b1 \u2265 2 and \u03b2 \u2265 1, we have that if\nm < \u03b1 + \u03b2 log(m) then m < \u03b1 + 16\u03b2 log(\u03b1\u03b2). This implies that\n\nm \u2264 T log k + 16kd log(T dk log k) = O (T log k + kd log(T kd)) ,\n\nwhich derives an upper bound on the number of points that can be shattered, and hence the VC\ndimension.\nTo show the upper bound of dT , we simply let each domain select a separate hypothesis from H.\nSince H has VC dimension d, there are at most d examples that can be shattered in each task, for a\ntotal of dT .\nAs an immediate corollary we can derive the following generalization bound, using the standard VC\ndimension generalization bounds [3]. For simplicity we assume that the distribution over the tasks\nis uniform, we de\ufb01ne the true error as e(fHk,g) = Pr(x,y,t)[fHk,g(x) (cid:54)= y], and the empirical (or\ntraining) error as\n\n(cid:80)T\n\n(cid:80)nt\ni=1 I[fHk,g(xt,i) (cid:54)= yt,i]\n\n\u02c6e(fHk,g) =\n\n(1)\nt nt is the sample size, and I(a) = 1 iff the predicate a is true. We can now state the\n\nt=1\n\nm\n\n,\n\nwhere m =(cid:80)\n\nfollowing corollary, which follows from standard generalization bounds,\n\n3\n\n\fInput parameters: k - number of models to use, N - number of iterations, \u03b7 - fraction of data for split\nInitialize:\n\n\u2022 Set random partition S1\n\u2022 Set g(t) = Jt where Jt is drawn uniform from {1...k}\n\nt = St where S1\n\nt \u222a S2\n\nt \u2229 S2\n\nt = \u2205 and |S1\n\nt |/|St| = \u03b7 for t = 1 . . . T\n\nFor i = 1, . . . , N\n\n1. Set hj \u2190 learn(\u222at\u2208Ij S1\nt ,H) where Ij = {i : g(i) = j}.\n2. Set g(t) = arg mink\n1|S2\nt |\n\nI[hj(x) (cid:54)= y].\n\n(x,y)\u2208S2\n\nj=1\n\nt\n\nSet hj \u2190 learn(\u222at\u2208Ij St,H) where Ij = {i : g(i) = j}.\nOutput:\n\nfHk,g(x) where Hk = {h1, . . . , hk}\n\n(cid:80)\n\nFigure 1: The SHAMO algorithm for learning shared models.\n\nCorollary 2. Fix k. For any hypothesis class H of VC-dimension d, for any hard k-shared task\nclassi\ufb01er f = (Hk, g) we have that with probability 1 \u2212 \u03b4,\n\n|e(f ) \u2212 \u02c6e(f )| = O\n\n(T log k + kd log(T kd)) log(m/T ) + log 1/\u03b4\n\nm\n\n(cid:33)\n\n.\n\n(cid:33)\n\n.\n\n(cid:32)(cid:114)\n\n(cid:32)(cid:114)\n\nThe previous corollary holds for some \ufb01xed k known before observing the training data, we now\nstate a bound where k is chosen after seeing the data, together with g and Hk. The proof follows\nfrom the previous corollary and performing a union bound on the different values of k,\nCorollary 3. For any hypothesis class H of VC-dimension d, for any k, for any hard k-shared task\nclassi\ufb01er f = (Hk, g) we have that with probability 1 \u2212 \u03b4,\n\n|e(f ) \u2212 \u02c6e(f )| = O\n\n(T log k + kd log(T kd)) log(m/T ) + log(k/\u03b4)\n\nm\n\nT log k is small in compared with m =(cid:80)\n\nThe last two bounds state that empirical error is close to true error under two conditions, \ufb01rst that\nt nt. That is, the average number of examples (per task),\nshould be large compared to the log-number-of models. Thus, even with a dozen models, only few\ntens of examples are suf\ufb01ce. Second, that kd is small compared with m. The main point is that if the\nVC dimension is large and the average number of examples m/T is low, it is possible to compensate\nif the number of models k is small relative to the number of tasks T . Hence, we expect to improve\nperformance over individual models if there are many-tasks, yet we predict with relative few models.\nWe now show that our upper bound on the VC dimension is almost tight.\nTheorem 4. There is a hypothesis class H of VC-dimension d, such that the class of hard k-shared\ntask FH,k has VC dimension at least max{kd , T min{d, log k}}.\nProof: To show the lower bound of kd consider d points that H shatters, x1, . . . , xd. Consider the\nset of examples S = {(xi, j) : 1 \u2264 j \u2264 k, 1 \u2264 i \u2264 d}. For any labeling of S, we can select for\neach domain j a different hypothesis from H that agrees with the labeling. Since we have only k\ndifferent js, we can do it with k functions. Therefore we shatter S and have a lower bound on kd.\nLet (cid:96) = min{d, log k}, hence the second bound is T (cid:96). Since class H is of VC dimension d, this\nimplies that there are points x1, . . . , x(cid:96) and function h1, . . . hk \u2208 H, such that for any labeling of\nxi\u2019s there is a hypothesis hj which is consistent with it. (Since k hypotheses can shatter at most\nlog k points, we get the dependency on log k.) Let the sample be S = {(xi, t) : 1 \u2264 i \u2264 (cid:96), t \u2208 T }.\nFor any labeling of S, when we consider domain t \u2208 T , there is a function in hi \u2208 Hk which is\nconsistent with the labeling. Therefore the VC dimension is at least T (cid:96).\n\n4 Learning with SHAred MOdels (SHAMO) Algorithm\n\nThe generalization bound states that we should \ufb01nd a pair (Hk, g) that perform well on the training\ndata and that k would be small a-priori. We assume that there is a learning algorithm from H called\n\n4\n\n\f(a) Data Synthetic I\n\n(b) Error I\n\n(c) Error II\n\nFigure 2: Left: Illustration of data used in the \ufb01rst experiment. The middle (experiment I) and right\n(experiment II) panels shows the average error vs k for the three algorithms, and the \u201ceffective\u201d\nnumber of models vs k (right axis).\n\nwith a training set S. Formally, we assume that the hypothesis returned by \u02c6h \u2190 learn(S,H) has\nlowest training error, that is the algorithm performs empirical risk minimization. We propose to\nperform an iterative procedure, between two stages, which intuitively is similar to K-means.\nIn the \ufb01rst stage, the algorithm \ufb01xes the assignment function g and \ufb01nd the best k functions Hk. This\ncan be performed easily by calling k times any algorithm that learns with the hypothesis class H. On\neach call the union of the training sets that are assigned by g the same value is fed into the algorithm.\nFormally, for all j = 1 . . . k set, hj \u2190 learn(\u222at\u2208Ij St,H) where Ij = {i : g(i) = j}. In the sec-\n(cid:80)nt\nond stage we learn the association g given Hk. Here we simply set g(t) to be the model which attains\ni=1 I[hj(xt,i) (cid:54)= yt,i] .\nthe lowest error evaluated on the training set, that is, g(t) = arg mink\nThis procedure can be repeated for a \ufb01xed number of iterations, or until a convergence criteria is\nmet. Speci\ufb01cally, in the experiments below our algorithm iterated between the step exactly 10 times.\nClearly, each stage reduces the training error of (1), but how far the resulting hypotheses from the\noptimal one is not clear.\nIn the description above the training sets St was used twice, once for \ufb01nding Hk and once for\n\ufb01nding g. We found in practice that this leads to over-\ufb01tting, that is, in the second stage sub-\noptimal hypotheses are assigned to g if evaluated on the test set (which clearly is not known during\ntraining time.) We thus modify the algorithm above, and use only part of the training set for each\nof the tasks, where these parts not over overlapping. Formally, before performing the iterations the\nt = \u2205. Then the\nalgorithm partitions the training set, into two parts, S1\n\ufb01rst stage is performed by calling the learning procedure with the \ufb01rst set and the second with the\nsecond set. Only after iterations are concluded, we use the entire training set to build models, with\nout modifying the association function g. We call this algorithm SHAMO for learning with shared\nmodels. The algorithm is summarized in the Fig. 1.\n\nt = St where S1\n\nt \u222a S2\n\nj=1\n\n1\nnt\n\nt \u2229 S2\n\n5 Empirical Study\n\nWe evaluated our algorithm on both synthetic and real-world sentiment classi\ufb01cation task. Training\nwas performed using the averaged-Perceptron [14] executed for 10 iterations. Three methods are\nevaluated, learning one model per task, called Individual below, learning one model for all tasks\ncalled Shared below, and learning k models using our algorithm, SHAMO. We also implemented an\nonline version of a batch algorithm for this setting [4]. SHAMO was outperformed it in the majority\nof experiments. Full details will be included in a long version of this paper.\n\nSynthetic Data: We \ufb01rst report results using synthetic data. We generated 20 dimensional in-\nputs x \u2208 R20. All features were drawn from Gaussian with mean zero. The \ufb01rst two inputs of\ntasks t were drawn with a covariance speci\ufb01c for that tasks. The remaining 18 features were with\nisotropic covariance. The label of input x = (x1, x2, ..., x20) was set to be sign(x2 \u00b7 st) where\nst \u2208 {\u22121, +1} with probability half. We generated T = 200 such tasks each with 6 training ex-\namples (with at least one example from each class), and ran our algorithm for various values of k.\nModels were evaluated on tests sets of size n = 1, 000 for each task. The results below are averages\n\n5\n\n\u22126\u22124\u22122024\u221210\u22128\u22126\u22124\u221220246246810121401020304050No Models ( K )Avearge Error 246810121424Effective No. ModelsSharedIndividualSHAMO020406080100010203040No Models ( K )Avearge Error 02040608010002468101214161820Effective No. ModelsSharedIndividualSHAMO\fover 50 random repetitions of the data generation process. Plot of test set (with T = 9 for ease of\npresentation) appear in the left-panel of Fig. 2, clearly two models are enough to classify all tasks\ncorrectly (depending on the value of st above), and furthermore, applying the wrong model yields\ntest error of 100%. All 6 examples were used both to build models and associating models to tasks.\n\n(b) Error vs. k\n\n(a) Error of Individual and Shared\nvs. Error of Shamo\n\nThe results are summarized in middle panel of Fig. 2, in which\nwe plot mean error of the three algorithms vs the number of\nmodels k, with error bars for 95% con\ufb01dence interval. Since\nboth Individual and Shared are independent of k, the line is\n\ufb02at for them. It is clear that Shared performs worst with an\naverage error of 50% (highest line), which is explained by the\nfact that the test error of half of the models over the other\nhalf of the data-sets is about 100%. Individual performs sec-\nond, with test error of about 30% obtained by only 6 training\nexamples. Our algorithm, SHAMO, performs the best with\nerror of about 5% when allowing k = 2 models, and about\n10% when allowing k = 14 models. The dotted-black line\nindicates the number of \u201ceffective\u201d models per value of k,\nwhich is the smallest number of models which at least 90 tasks\nare associated with (exactly) one of them. The corresponding\nscale is the right axis. Indeed as the number of possible mod-\nels k is increased to 14, the number of effective models is also\nincreased, but only moderately, from an average of 2 to an av-\nerage of 3.5. In other words, only small number of models are\nused in practice, which avoids severe over\ufb01tting.\nThe next synthetic experiment was performed with 10 target\nmodels and more noise. Here, we generated 40 dimensional\ninputs x \u2208 R40. All features were drawn from Gaussian with\nFigure 3: Results for Data A (31\nTasks, 1 Thresh)\nmean zero. The \ufb01rst ten inputs of tasks t were drawn with a\ncovariance speci\ufb01c for that tasks. The remaining 30 features\nwere with isotropic covariance. The label of input x = (x1, x2, ...x40) was set to be sign(ut \u00b7\n(x1 . . . x10)) where ut \u2208 R10 are a set of 10 orthogonal vectors, chosen uniformly in random. As\nin the \ufb01rst experiment, we generated T = 200 such tasks, each with 25 training examples, and ran\nSHAMO with values of k ranging between 2 and 100. Models were evaluated on tests sets of size\nn = 1, 000 for each task. The results below are averages over 50 random repetitions of the data\ngeneration process. In these experiments ten models are enough to classify all tasks correctly, yet in\nthis experiment, applying the wrong model yields test error of only 50%. Out of the 25 examples\navailable for each task, 7 were used to build models, and the remaining 18 were used to associate\nmodels to tasks (\u03b7 = 7/25). Lower values cause over\ufb01tting, while higher values yield poor models.\nThe results are summarized in right panel of Fig. 2, in which we plot mean error of the three algo-\nrithms vs the number of models k, with error bars for 95% con\ufb01dence interval. The bottom line is\nsimilar to the previous experiment. As before, Shared performs worst, Individual performs second,\nwith test error of about performing second with about 15% obtained with 25 training examples. Our\nalgorithm, SHAMO, performs the best with error of about 11% when allowing k = 22 models, twice\nthe number of real models. Additionally, it seems that the algorithm was not-over\ufb01tting, even when\nthe number of allowed models was set to 100 the performance was the same as setting k = 25.\nOne possible explanation is that the algorithm is not using all allowed models, indeed the number\nof \u201ceffective\u201d models (which are associated to 90% of the tasks) grows moderately for number of\nmodels greater than 25 (from 14 to 16). In other words, if we allow the algorithm to remove about\n10% of the tasks, then only 14 \u2212 16 models are enough to have about 11% test error on average. It\nis not clear to us yet, why over-\ufb01tting occurred in the \ufb01rst experiment but not in the second.\nSentiment Data: We followed Blitzer et.al [8] and evaluated our algorithm also on product\nreviews from Amazon. We downloaded 2, 000 reviews from 31 categories, such as books, dvd\nand so on; a total of 62, 000 reviews all together. All reviews were represented using bag-of-\nunigrams/bigrams, using only features that appeared at least 5 times in all training sets, yielding\na dictionary of size 28, 775. The reviews we used were originally labeled with 1, 2, 4, 5 stars, as\nreviews with 3 stars were very hard to predict, even with very large amount of data.\n\n6\n\n101520253035101520253035error SHAMOerror SharedIndividual24681012141015202530No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO\f(a) Data B, 62 Tasks\n\n(b) Data C, 124 Tasks\n\n(c) Data D, 186 Tasks\n\n(d) Data E, 248 Tasks\n\n(e) Data B, 62 Tasks\n\n(f) Data C, 124 Tasks\n\n(g) Data D, 186 Tasks\n\n(h) Data E, 248 Tasks\n\nFigure 4: Top: test error of Individual and Shared algorithms vs test error of SHAMO for k = 14, for all\ndatasets with 2 thresholds. Bottom: average error vs k for the three algorithms, and the \u201ceffective\u201d number of\nmodels vs k (right axis).\n\nThresholds No. Tasks\n\nTraining Size\n\nTest Size\n\nData\nA\nB\nC\nD\nE\nF\nG\nH\nI\n\n1\n2\n2\n2\n2\n3\n3\n3\n3\n\n31\n62\n124\n186\n248\n93\n186\n279\n372\n\n220\n108\n54\n36\n27\n72\n36\n24\n18\n\n1,780\n892\n446\n297\n223\n592\n296\n197\n148\nTable 1: Summary of sentiment datasets used.\n\nWe generated three binary predic-\ntion datasets as follows.\nIn the\n\ufb01rst dataset, the goal was to predict\nwhether the number of stars associ-\nated with a review is above or be-\nlow 3. Since we focus in the case\nof many tasks with small amount of\ndata each, we used about 1/9 of the\ndata for training and the remaining\nfor evaluation. Each set (training\nand test) contains equal amount of\nreviews with the 1, 2, 4, 5 stars. The outcome of this process are 31 tasks, each with 220 training\nexamples and 1, 780 test examples. This dataset is in row A of Table 1.\nFor the second dataset we partitioned all reviews from each category into two equal sets. The\nprediction problem for the \ufb01rst was to predict if the number of stars is 5 stars or not (that is, below\n5). For the second set of problems the goal was to predict if the number of stars is 1 or not. The\noutcome are 62 tasks with 108 training examples and 892 test examples. We refer to this problem as\nhaving 2 thresholds (5 and 1). This dataset is row B of Table 1. For the third dataset we partitioned\nthe reviews into three sets, using one of the three goals above - is the number of starts above or\nbelow 1, is it above or below 3, and is it above or below 5 - ending up with 93 tasks with 72 training\nexamples and 592 test examples in each. We refer to this problem as having 3 thresholds (5, 3 and\n1). This dataset is row F in Table 1. Finally, we took each of the last two problems and divided each\ntask into 2, 3 or 4 - rows C,D,E (2 thresholds) and, rows G,H,I (3 thresholds). Our setting with few\nthresholds represent different language usages, from mild to strong, for the same level of sentiment.\nUnlike in the synthetic experiments training data was either used for building models, or associating\nmodels to tasks. That is, we set |S1\nt | = 0.5|St| for \u03b7 = 0.5, and used one half of the examples\nto build models (set the weights of prediction functions), and the remaining half to evaluate each of\nthe k models on the T tasks, and associating models to tasks. Only after this process ended, we \ufb01xed\nthis association and learned models using all training points to build \ufb01nal models.\nThe results for dataset A of single threshold appear in Fig. 3. The top panel shows the error of In-\ndividual and Shared vs SHAMO for k = 14. Points above the line y = x indicate the superiority\nof SHAMO. Although we used reviews from 31 domains, there is essentially a single task, and thus\nit is best to combine the data. Indeed, all the red-squares (corresponding to Individual) are above\nthe blue-circles (corresponding to Shared), indicating that the shared model outperforms individ-\nual models. Additionally, all points corresponding to Shared lies on the diagonal, indicating that\n\nt | = |S2\n\n7\n\n1520253035404515202530354045error SHAMOerror SharedIndividual1520253035404515202530354045error SHAMOerror SharedIndividual1520253035404515202530354045error SHAMOerror SharedIndividual20304050152025303540455055error SHAMOerror SharedIndividual246810121420222426283032343638No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO246810121420222426283032343638No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO246810121420222426283032343638No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO246810121420222426283032343638No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO\f(a) Data F, 93 tasks\n\n(b) Data G, 186 tasks\n\n(c) Data H, 279 tasks\n\n(d) Data I, 372 tasks\n\nFigure 5: Average error vs k for the three algorithms, and the \u201ceffective\u201d number of models vs k (right axis).\nSHAMO is performing as well as Shared, with error \u223c 16%. The bottom panel shows the perfor-\nmance of SHAMO vs. k. As shown, the error is \ufb01xed and is not affected by k. This is explained\nby the black-dashed-line that, as before, shows the number of \u201ceffective\u201d models, which is 1. Even\nthough the algorithm may choose up to 14 models, it is always using effectively one.\nThe results for datasets B-E all with two thresholds are summarized in Fig. 4. The top panels show\nthe test error of Individual and Shared algorithms vs test error of SHAMO for k = 14, with number\nof tasks increasing from left to right. First, as opposed to dataset A with single threshold, in all\ncases the results for Shared are worse than these of Individual. This gap is getting smaller with the\nnumber of tasks (the clouds are overlapping as we go from left panel to right). intuitively, Shared\nintroduces (label) bias as the two thresholds are being treated as one, while Individual introduces\nvariance as smaller and smaller training sets are used, as we go from the left panel to the right one,\nthe gap between bias and variance shrinks as the variance is increased. SHAMO performs the best\nas in all plots almost all the points (less in the right plot) are above the line y = x. Additionally, the\nspread of the cloud in the top-panels is getting larger, indicating larger deviation in the performance\nacross different tasks.\nThe bottom panels of Fig. 4 shows the average test error vs k. As Shared is not affected by k nor\nT (as total training examples remains the same), its test error of 36% is \ufb01xed across panels. As\nwe have more tasks, and less training examples per task, the test error of Individual increases from\n25.6% to 28.9% (gap of 3.3%). SHAMO performs the best, and is also affected from smaller dataset,\nwith test error ranging between 21.8 and 24.3, having a smaller gap than Individual of 2.5%). In all\nfour dataset the optimal number of models is k = 3, and there is minor over\ufb01tting when using larger\nvalues (at most 1%). As before the effective number of models grows weakly with k.\nThe results for datasets F-I all with three thresholds are summarized in Fig. 5, the general trend\nremains the same, and we highlight only the main differences. First, the gap between Individual\nand Shared is much smaller, in some tasks one is better, and in other tasks the other is better.\nAdditionally, for the smallest number of tasks (left) Individual is better with a gap of \u223c 1.5%,\nwhile for largest number of tasks Individual is worse with a gap ranging between 1 \u2212 4%. This\nis exactly where the effect of variance of small datasets became stronger than the bias emerging\nfrom sharing. Second, in general these dataset are more heterogeneous, as indicated by the larger\nstandard-deviation (longer error-bars than in Fig. 4). As before SHAMO performs the best, with\noptimal performance when k = 3 \u2212 4 and is almost not over\ufb01tting for larger values of k, as the\n\u201ceffective\u201d number of models grows slowly with k.\n\nSummary\nWe described theoretical framework for multitask learning using small number of shared models.\nOur theory suggests that many-tasks can be used to compensate for small number of training exam-\nples per task, if one can partition that tasks to few sets, with similar labeling function per set. We\nalso derived a K-means-like algorithm to learn such classi\ufb01ers of both models and association of\ntaks and models. Our experimental results on both hand-crafted problems and real-world sentiment\nclassi\ufb01cation problem strongly support the bene\ufb01ts from the approach, even with very few examples\nper task. We plan to extend our theory to direct of the optimal splitting of the training data by the\nalgorithm, analyze its convergence properties and perform extensive experiments. We also plan to\nderive theory and algorithms for soft association of tasks to classi\ufb01ers.\n\nAcknowledgements:\nUnion grant IRG-256479.\n\nThe research is partially supported by a grants from ISF, BSF and European\n\n8\n\n24681012142224262830323436No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO24681012142224262830323436No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO24681012142224262830323436No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO24681012142224262830323436No Models ( K )Avearge Error 24681012140246810Effective No. ModelsSharedIndividualSHAMO\fReferences\n[1] Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncovering shared struc-\n\ntures in multiclass classi\ufb01cation. In ICML, pages 17\u201324, 2007.\n\n[2] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from mul-\n\ntiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[3] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations.\n\nCambridge University Press, 1999.\n\n[4] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[5] Bart Bakker and Tom Heskes. Task clustering and gating for bayesian multitask learning.\n\nJournal of Machine Learning Research, 4:83\u201399, 2003.\n\n[6] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-\nnifer Wortman Vaughan. A theory of learning from different domains. Machine Learning,\n79(1-2):151\u2013175, 2010.\n\n[7] Gilles Blanchard, Gyemin Lee, and Clay Scott. Generalizing from several related classi\ufb01cation\n\ntasks to a new unlabeled sample. In NIPS, 2011.\n\n[8] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and\nblenders: Domain adaptation for sentiment classi\ufb01cation. In Association for Computational\nLinguistics (ACL), 2007.\n\n[9] Edwin V. Bonilla, Felix V. Agakov, and Christopher K. I. Williams. Kernel multi-task learning\nusing task-speci\ufb01c features. Journal of Machine Learning Research - Proceedings Track, 2:43\u2013\n50, 2007.\n\n[10] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources.\n\nJournal of Machine Learning Research, 9:1757\u20131774, 2008.\n\n[11] Koby Crammer, Michael J. Kearns, and Jennifer Wortman. Learning from data of variable\n\nquality. In NIPS, 2005.\n\n[12] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning multiple tasks\n\nwith kernel methods. Journal of Machine Learning Research, 6:615\u2013637, 2005.\n\n[13] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi\u2013task learning.\n\npages 109\u2013117, 2004.\n\nIn KDD,\n\n[14] Y. Freund and R. E. Schapire. Large margin classi\ufb01cation using the perceptron algorithm. In\nProceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998. To\nappear, Machine Learning.\n\n[15] Hal Daum\u00b4e III. Frustratingly easy domain adaptation. In ACL, 2007.\n[16] Hal Daum\u00b4e III. Bayesian multitask learning with latent hierarchies. In UAI, pages 135\u2013142,\n\n2009.\n\n[17] Neil D. Lawrence and John C. Platt. Learning to learn with the informative vector machine. In\n\nICML, 2004.\n\n[18] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple\n\nsources. In NIPS, pages 1041\u20131048, 2008.\n\n[19] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning\n\nbounds and algorithms. In COLT, 2009.\n\n[20] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint\nsubspace selection for multiple classi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013\n252, 2010.\n\n[21] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple\n\ntasks. In ICML, pages 1012\u20131019, 2005.\n\n[22] Shipeng Yu, Volker Tresp, and Kai Yu. Robust multi-task learning with t-processes. In ICML,\n\npages 1103\u20131110, 2007.\n\n9\n\n\f", "award": [], "sourceid": 706, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Yishay", "family_name": "Mansour", "institution": null}]}