{"title": "Learning Multiple Tasks in Parallel with a Shared Annotator", "book": "Advances in Neural Information Processing Systems", "page_first": 1170, "page_last": 1178, "abstract": "We introduce a new multi-task framework, in which $K$ online learners are sharing a single annotator with limited bandwidth. On each round, each of the $K$ learners receives an input, and makes a prediction about the label of that input. Then, a shared (stochastic) mechanism decides which of the $K$ inputs will be annotated. The learner that receives the feedback (label) may update its prediction rule, and we proceed to the next round. We develop an online algorithm for multi-task binary classification that learns in this setting, and bound its performance in the worst-case setting. Additionally, we show that our algorithm can be used to solve two bandits problems: contextual bandits, and dueling bandits with context, both allowed to decouple exploration and exploitation. Empirical study with OCR data, vowel prediction (VJ project) and document classification, shows that our algorithm outperforms other algorithms, one of which uses uniform allocation, and essentially makes more (accuracy) for the same labour of the annotator.", "full_text": "Learning Multiple Tasks in Parallel with\n\na Shared Annotator\n\nHaim Cohen\n\nKoby Crammer\n\nDepartment of Electrical Engeneering\n\nThe Technion \u2013 Israel Institute of Technology\n\nDepartment of Electrical Engeneering\n\nThe Technion \u2013 Israel Institute of Technology\n\nHaifa, 32000 Israel\n\nhcohen@tx.technion.ac.il\n\nHaifa, 32000 Israel\n\nkoby@ee.technion.ac.il\n\nAbstract\n\nWe introduce a new multi-task framework, in which K online learners are sharing\na single annotator with limited bandwidth. On each round, each of the K learners\nreceives an input, and makes a prediction about the label of that input. Then, a\nshared (stochastic) mechanism decides which of the K inputs will be annotated.\nThe learner that receives the feedback (label) may update its prediction rule, and\nthen we proceed to the next round. We develop an online algorithm for multi-\ntask binary classi\ufb01cation that learns in this setting, and bound its performance in\nthe worst-case setting. Additionally, we show that our algorithm can be used to\nsolve two bandits problems: contextual bandits, and dueling bandits with context,\nboth allow to decouple exploration and exploitation. Empirical study with OCR\ndata, vowel prediction (VJ project) and document classi\ufb01cation, shows that our\nalgorithm outperforms other algorithms, one of which uses uniform allocation,\nand essentially achieves more (accuracy) for the same labour of the annotator.\n\n1\n\nIntroduction\n\nA triumph of machine learning is the ability to predict many human aspects: is certain mail spam or\nnot, is a news-item of interest or not, does a movie meet one\u2019s taste or not, and so on. The dominant\nparadigm is supervised learning, in which the main bottleneck is the need to annotate data. A\ncommon protocol is problem centric: \ufb01rst collect data or inputs automatically (with low cost), and\nthen pass it on to a user or an expert to be annotated. Annotation can be outsourced to the crowed by\na service like Mechanical Turk, or performed by experts as in the Linguistic data Consortium. Then,\nthis data may be used to build models, either for a single task or many tasks. This approach is not\nmaking optimal use of the main resource - the annotator - as some tasks are harder than others, yet\nwe need to give the annotator the (amount of) data to be annotated for each task a-priori . Another\naspect of this problem is the need to adapt systems to individual users, to this end, such systems\nmay query the user for the label of some input, yet, if few systems will do so independently, the user\nwill be \ufb02ooded with queries, and will avoid interaction with those systems. For example, sometimes\nthere is a need to annotate news items from few agencies. One person cannot handle all of them,\nand only some items can be annotated, which ones? Our setting is designed to handle exactly this\nproblem, and speci\ufb01cally, how to make best usage of annotation time.\nWe propose a new framework of online multi-task learning with a shared annotator. Here, algorithms\nare learning few tasks simultaneously, yet they receive feedback using a central mechanism that\ntrades off the amount of feedback (or labels) each task receives. We derive a speci\ufb01c algorithm based\non the good-old Perceptron algorithm, called SHAMPO (SHared Annotator for Multiple PrOblems)\nfor binary classi\ufb01cation and analyze it in the mistake bound model, showing that our algorithm\nmay perform well compared with methods that observe all annotated data. We then show how to\nreduce few contextual bandit problems into our framework, and provide speci\ufb01c bounds for such\n\n1\n\n\fsettings. We evaluate our algorithm with four different datasets for OCR , vowel prediction (VJ) and\ndocument classi\ufb01cation, and show that it can improve performance either on average over all tasks,\nor even if their output is combined towards a single shared task, such as multi-class prediction. We\nconclude with discussion of related work, and few of the many routes to extend this work.\n\n1x\n2x\n\nKx\n\n)\n\ny\n\nK\n\n(\n\nx\nK\n\n,\n\nAnnotator\n\n1,..., K\nw\nw\n\n(\nx y\n1\n1\n(\nx y\n2\n2\n\n,\n,\n\n)\n)\n\nfor that\n\ntask Jt.\n\n2 Problem Setting\nWe study online multi-task learning with a shared annotator. There are K tasks to be learned simul-\ntaneously. Learning is performed in rounds. On round t, there are K input-output pairs (xi,t, yi,t)\nwhere inputs xi,t 2 Rdi are vectors, and labels are binary yi,t 2 {1, +1}. In the general case, the\ninput-spaces for each task may be different. We simplify the notation and assume that di = d for all\ntasks. Since the proposed algorithm uses the margin that is affected by the vector norm, there is a\nneed to scale all the vectors into a ball. Furthermore, no dependency between tasks is assumed.\nOn round t,\nthe learning algorithm receives K inputs xi,t for i = 1, . . . , K, and out-\nputs K binary-labels \u02c6yi,t, where \u02c6yi,t 2 {1, +1} is the label predicted for the input\nxi,t of task i. The algorithm then chooses a task Jt 2\n{1, . . . , K} and receives from\nan annotator the true-label yJt,t\nIt does not observe any other label.\nThen, the algorithm updates its models, and proceeds to the\nnext round (and inputs). For easing calculations below, we\ndenote by K indicators Zt = (Z1,t, . . . , ZK,t) the identity\nof the task which was queried on round t, and set ZJt,t = 1\nand Zi,t = 0 for i 6= Jt. Clearly,Pi Zi,t = 1. Below, we\nde\ufb01ne the notation Et1 [x] to be the conditional expectation\nE [x|Z1, ...Zt1] given all previous choices.\nIllustration of a single iteration of multi-task algorithms is\nshown in Fig. 1. The top panel shows the standard setting\nwith shared annotator, that labels all inputs, which are fed to\nthe corresponding algorithms to update corresponding mod-\nels. The bottom panel shows the SHAMPO algorithm, which\ncouples labeling annotation and learning process, and syn-\nchronizes a single annotation per round. At most one task\nperforms an update per round (the annotated one).\nWe focus on linear-functions of the form f (x) = sign(p) for\na quantity p = w>x, w 2 Rd, called the margin. Speci\ufb01cally, the algorithm maintains a set of K\nweight vectors. On round t, the algorithm predicts \u02c6yi,t = sign(\u02c6pi,t) where \u02c6pi,t = w>i,t1xi,t. On\nrounds for which the label of some task Jt is queried, the algorithm, is not updating the models of\nall other tasks, that is, we have wi,t = wi,t1 for i 6= Jt.\nWe say that the algorithm has a prediction mistake in task i if yi,t 6= \u02c6yi,t, and denote this event by\nMi,t = 1, otherwise, if there is no mistake we set Mi,t = 0. The goal of the algorithm is to minimize\nthe cumulative number of mistakes,PtPi Mi,t. Models are also evaluated using the Hinge-loss.\nSpeci\ufb01cally, let ui 2 Rd be some vector associated with task i. We denote the Hinge-loss of it,\nwith respect to some input-output by, `,i,t(ui) = yi,tu>i xi,t+, where, (x)+ = max{x, 0},\nand > 0 is some parameter. The cumulative loss over all tasks and a sequence of n inputs,\nis, L,n = L({ui}) = Pn\ni=1 `,i,t(ui). We also use the following expected hinge-loss\nover the random choices of the algorithm, \u00afL,n = \u00afL{ui} = EhPn\ni=1 Mi,tZi,t`,i,t(ui)i. We\nt PK\n\nproceed by describing our algorithm and specifying how to choose a task to query its label, and how\nto perform an update.\n\nFigure 1: Illustration of a single it-\neration of multi-task algorithms (a)\nstandard setting (b) SHAMPO\n\nt=1PK\n\nAnnotator\n\n(\n\n,\n\nx y\n\nJ\n\nJ\n\n)\n\n(b)\n\n1x\n\n2x\n\n(a)\n\nAlg.\n\nKx\n\nAlg.\n\nJw\n\nJx\n\n3 SHAMPO: SHared Annotator for Multiple Problems\nWe turn to describe an algorithm for multi-task learning with a shared annotator setting, that works\nwith linear models. Two steps are yet to be speci\ufb01ed: how to pick a task to be labeled and how to\nperform an update once the true label for that task is given.\nTo select a task, the algorithm uses the absolute margin |\u02c6pi,t|. Intuitively, if |\u02c6pi,t| is small, then there\nis uncertainty about the labeling of xi,t, and vise-versa for large values of |\u02c6pi,t|. Similar argument\n\n2\n\n\fwas used by Tong and Koller [22] for picking an example to be labeled in batch active learning.\nYet, if the model wi,t1 is not accurate enough, due to small number of observed examples, this\nestimation may be rough, and may lead to a wrong conclusion. We thus perform an exploration-\nexploitation strategy, and query tasks randomly, with a bias towards tasks with low |\u02c6pi,t|. To the\nbest of our knowledge, exploration-exploitation usage in this context of choosing an examples to be\nlabeled (eg. in settings such as semi-supervised learning or selective sampling) is novel and new.\nWe introduce b 0 to be a tradeoff parameter between exploration and exploitation and ai 0 as a\nprior for query distribution over tasks. Speci\ufb01cally, we induce a distribution over tasks,\n\nPr [Jt = j] =\n\nDt\n\nDt =Xi\n\nmargin)\n\nPr [Jt = j] =\n\nfor Dt =\n\nKXi=1\n\nm |\u02c6pm,t|\u23181\n\n. (1)\n\najb + |\u02c6pj,t|minK\n\nm=1 |\u02c6pm,t|1\n\nDt\n\nm=1 |\u02c6pm,t|1\nm=1|\u02c6pm,t|\u25c61\n\nK\nmin\n\nParameters: b, , ai 2 R+ for i = 1, . . . , K\nInitialize: wi,0 = 0 for i = 1, . . . , K\nfor t = 1, 2, ..., n do\n\n1. Observe K instance vectors, xi,t, (i = 1, . . . , K).\n2. Compute margins \u02c6pi,t = w>i,t1xi,t.\n3. Predict K labels, \u02c6yi,t = sign(\u02c6pi,t).\n4. Draw task Jt with the distribution:\najb + |\u02c6pj,t| minK\nai\u2713b + |\u02c6pi,t| \n5. Query the true label ,yJt,t 2 {1, 1}.\n6. Set indicator MJt,t = 1 iff yJt,t \u02c6pi,t \uf8ff 0 (Error)\n7. Set indicator AJt,t = 1 iff 0 < yJt,t \u02c6pi,t \uf8ff (Small\n8. Update with the perceptron rule:\n\nai\u21e3b + |\u02c6pi,t|min\nClearly, Pr [Jt = j] 0 and\nPj Pr [Jt = j] = 1. For b = 0\nwe have Pr [Jt = j] = 1 for the\ntask with minimal margin, Jt =\narg minK\ni=1 |\u02c6pi,t|, and for b ! 1\nthe distribution is proportional to\nthe prior weights, Pr [Jt = j] =\naj/(Pi ai). As noted above we\ndenote by Zi,t = 1 iff i = Jt.\nSince the distribution is invariant\nto a multiplicative factor of ai we\nassume 1 \uf8ff ai8i.\nThe update of\nthe algorithm\nis performed with the aggres-\nsive perceptron rule,\nis\nwJt,t = wJt,t1 + (AJt,t +\nMJt,t) yJt,t xJt,t and wi,t =\nwi,t1 for i 6= Jt. we de\ufb01ne\nAi,t , the aggressive update in-\ndicator introducing and the ag-\ngressive update threshold, 2\nR > 0 such that, Ai = 1 iff\n0 < yi,t \u02c6pi,t \uf8ff , i.e, there is no\nmistake but the margin is small,\nand Ai,t = 0 otherwise. An up-\ndate is performed if either there\nis a mistake (MJi,t = 0) or the\nFigure 2: SHAMPO: SHared Annotator for Multiple PrOblems.\nmargin is low (AJi,t = 1). Note\nthat these events are mutually exclusive. For simplicity of presentation we write this update as,\nwi,t = wi,t1 + Zi,t (Ai,t + Mi,t)yi,t xi,t. Although this notation uses labels for all-tasks, only the\nlabel of the task Jt is used in practice, as for other tasks Zi,t = 0.\nWe call this algorithm SHAMPO for SHared Annotator for Multiple PrOblems. The pseudo-code\nappears in Fig. 2. We conclude this section by noting that the algorithm can be incorporated with\nMercer-kernels as all operations depend implicitly on inner-product between inputs.\n\nwJt,t = wJt,t1 + (AJt,t + MJt,t) yJt,t xJt,t\nwi,t = wi,t1 for i 6= Jt\n\nend for\nOutput: wi,n for i = 1, . . . , K.\n\n,\n\n.\n\n(2)\n\nthat\n\n4 Analysis\nThe following theorem states that the expected cumulative number of mistakes that the algorithm\nmakes, may not be higher than the algorithm that observes the labels of all inputs.\n\nTheorem 1 If SHAMPO algorithm runs on K tasks with K parallel example pair sequences\n(xi,1, yi,1), ...(xi,n, yi,n) 2 Rd \u21e5 {1, 1}, i = 1, ..., K with input parameters 0 \uf8ff b, 0 \uf8ff \uf8ff b/2,\nand prior 1 \uf8ff ai8i, denote by X = maxi,t kxi,tk, then, for all > 0, all ui 2 Rd and all n 1\n\n3\n\n\fi=1 ai, such that,\n\nX 2\n\n8b\n\nnXt=1\n\n# +\u27132\n\n\n\nb 1\u25c6 E\" KXi=1\n\nthere exists 0 < \uf8ffPK\n\"\u27131 +\nMi,t# \uf8ff\nE\" KXi=1\nwhere we denote U 2 =PK\n\nnXt=1\n\n\n\n2b\u25c6 \u00afL,n +2b + X 22 U 2\n\naiAi,t# .\ni=1 kuik2. The expectation is over the random choices of the algorithm.\nDue to lack of space, the proof appears in Appendix A.1 in the supplementary material. Few notes\non the mistake bound: First, the right term of the bound is equals zero either when = 0 (as\nAi,t = 0) or = b/2. Any value in between, may yield an strict negative value of this term, which\nin turn, results in a lower bound. Second, the quantity \u00afL,n is non-increasing with the number of\ntasks. The \ufb01rst terms depends on the number of tasks only via \uf8ffPi ai. Thus, if ai = 1 (uniform\nprior) the quantity \uf8ff K is bounded by the number of tasks. Yet, when the hardness of the tasks is\nnot equal or balanced, one may expect to be closer to 1 than K, which we found empirically to be\ntrue. Additionally, the prior ai can be used to make the algorithm focus on the hard tasks, thereby\nimproving the bound. While multiplying the \ufb01rst term can be larger, the second term can be lower.\nA task i which corresponds to a large value of ai will be updated more in early rounds than tasks\nwith low ai. If more of these updates are aggressive, the second term will be negative and far from\nzero.\nOne can use the bound to tune the algorithm for a good value of b for the non aggressive case\n( = 0), by minimizing the bound over b. This may not be possible directly since \u00afL,n de-\npends implicitly on the value of b1. Alternatively, we can take a loose estimate of \u00afL,n, and re-\nplace it with L,n (which is \u21e0 K times larger). The optimal value of b can now be calculated,\nb = X 2\nU 2X 2 . Substituting this value in the bound of Eq. (1) with L,n leads to the fol-\nU 2X 2 , which has the same\n\ndependency in the number of inputs n as algorithm that observes all of them.\nWe conclude this section by noting that the algorithm and analysis can be extended to the case that\nmore than single query is allowed per task. Analysis and proof appears in Appendix A.2 in the\nsupplementary material.\n\n2 q1 + 4L,n\nlowing bound, EhPK\n\ni=1Pn\n\nt=1 Mi,ti \uf8ff \n\n\uf8ffL,n + U 2X 2\n\n2 + U 2\n\n2q1 + 4L,n\n\n5 From Multi-task to Contextual Bandits\nAlthough our algorithm is designed for many binary-classi\ufb01cation tasks, it can also be applied in\ntwo settings of contextual bandits, when decoupling exploration and exploitation is allowed [23, 3].\nIn this setting, the goal is to predict a label \u02c6Yt 2 {1, . . . , C} given an input xt. As before, the\nalgorithm works in rounds. On round t the algorithm receives an input xt and gives as an output\nmulticalss label \u02c6Yt 2 {1, . . . , C}. Then, it queries for some information about the label via a single\nbinary \u201cyes-no\u201d question, and uses the feedback to update its model. We consider two forms of\nquestions. Note that our algorithm subsumes past methods since they also allow the introduction of\na bias (or prior knowledge) towards some tasks, which in turn, may improve performance.\n\n5.1 One-vs-Rest\nThe \ufb01rst setting is termed one-vs-rest. The algorithm asks if the true label is some label \u00afYt 2\n{1, . . . , C}, possibly not the predicted label, i.e. it may be the case that \u00afYt 6= \u02c6Yt. Given the response\nwhether \u00afYt is the true label Yt, the algorithm updates its models. The reduction we perform is by\nintroducing K tasks, one per class. The problem of the learning algorithm for task i is to decide\nwhether the true label is class i or not. Given the output of all (binary) classi\ufb01ers, the algorithm\ngenerates a single multi-class prediction to be the single label for which the output of the corre-\nsponding binary classi\ufb01er is positive. If such class does not exist, or there are more than one classes\nas such, a random prediction is used, i.e., given an input xt we de\ufb01ne \u02c6Yt = arg maxi \u02c6yi,t, where ties\nare broken arbitrarily. The label to be queried is \u00afYt = Jt, i.e. the problem index that SHAMPO is\nquerying. We analyze the performance of this reduction as a multiclass prediction algorithm.\n\n1Similar issue appears also after the discussion of Theorem 1 in a different context [7].\n\n4\n\n\f \n\n \n\nExploit\nUniform\nSHAMPO\n\nr\no\nr\nr\n\nE\n\n10\n\n8\n\n6\n\n4\n\n2\n\nExploit\nUniform\nSHAMPO\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\nr\no\nr\nr\n\nE\n\n0\n\n \n\n1\n\n2\n\n3\n\n4\n\nTask #\n\n0\n\n \n\n1\n\n2\n\n3\n\n4\n\n5\n\nTask #\n\n6\n\n7\n\n8\n\n(a) 4 text classi\ufb01cation tasks\n\n(b) 8 text classi\ufb01cation tasks\n\n]\n\n%\n\n[\n \nr\no\nr\nr\n\nE\n\n18\n16\n14\n12\n10\n8\n6\n4\n2\n\n \n\nAggressive \u2212 \u03bb=b\nAggrssive \u2212 \u03bb=b +prior\nplain\n\n10\u22125\n\n100\n\nb [log]\n\n(c) Error vs. b\n\n \n\n105\n\n\n\nFigure 3: Left and middle: Test error of aggressive SHAMPO on (a) four and (b) eight binary text\nclassi\ufb01cation tasks. Three algorithms are evaluated: uniform, exploit, and aggressive SHAMPO.\n(Right) Mean test error over USPS One-vs-One binary problems vs b of aggressive SHAMPO with\nprior, aggressive with uniform prior, and non-aggressive with uniform prior.\nCorollary 2 Assume the SHAMPO algorithm is executed as above with K = C one-vs-rest\nproblems, on a sequence (x1, Y1), ...(xn, Yn) 2 Rd \u21e5 {1, ..., C}, and input parameter b > 0\nand prior 1 \uf8ff ai8i. Then for all > 0 and all ui 2 Rd, there exist 0 < \uf8ff PC\ni=1 ai\nsuch that the expected number of multi-class errors is bounded as follows EhPt[[Yt 6= \u02c6Yt]]i \uf8ff\n\uf8ff\u21e31 + X 2\nt=1 aiAi,ti , where [[I]] = 1 if the pred-\nThe corollary follows directly from Thm. 1 by noting that, [[Yt 6= \u02c6Yt]] \uf8ffPi Mi,t. That is, there is a\nmulticlass mistake if there is at least one prediction mistake of one of the one-vs-rest problems. The\nclosest setting is contextual bandits, yet we allow decoupling of exploration and exploitation. Ignor-\ning this decoupling, the Banditron algorithm [17] is the closest to ours, with a regret of O(T 2/3).\nHazan et al [16] proposed an algorithm with O(pT ) regret but designed for the log loss, with coef\ufb01-\ncient that may be very large, and another [9] algorithm has O(pT ) regret with respect to prediction\nmistakes, yet they assumed stochastic labeling, rather than adversarial.\n\nb 1 EhPK\n\n2b\u2318 \u00afL,n +\n\nicate I is true, and zero otherwise.\n\n+2 \n\ni=1Pn\n\n(2b+X 2)2U 2\n\n8b\n\n5.2 One-vs-One\nIn the second setting, termed by one-vs-one, the algorithm picks two labels \u00afY +\nt , \u00afY t 2 {1 . . . C},\npossibly both not the predicted label. The feedback for the learner is three-fold: it is yJt,t = +1 if\nthe \ufb01rst alternative is the correct label, \u00afY +\nt = Yt, yJt,t = 1 if the second alternative is the correct\nlabel, \u00afY t = Yt, and it is yJt,t = 0 otherwise (in this case there is no error and we set MJt,t = 0).\n2 problems, one per pair of classes. The goal\n\nThe reduction we perform is by introducing K =C\n\nof the learning algorithm for a problem indexed with two labels (y1, y2) is to decide which is the\ncorrect label, given it is one of the two. Given the output of all (binary) classi\ufb01ers the algorithm\ngenerates a single multi-class prediction using a tournament in a round-robin approach [15]. If there\nis no clear winner, a random prediction is used. We now analyze the performance of this reduction\nas a multiclass prediction algorithm.\n\n2 one-vs-one\nCorollary 3 Assume the SHAMPO algorithm is executed as above, with K = C\nproblems, on a sequence (x1, Y1), ...(xn, Yn) 2 Rd \u21e5 {1, ..., C}, and input parameter b > 0 and\nprior 1 \uf8ff ai8i . Then for all > 0 and all ui 2 Rd, there exist 0 < \uf8ff P(C\n2)\ni=1 ai such\nthat the expected number of multi-class errors can be bounded as follows EhPt[[Yt 6= \u02c6Yt]]i \uf8ff\nt=1 aiAi,ti .\n2)1)/2+1\u21e2 \ni=1Pn\n2)1)/2+1P(C\n2)\nThe corollary follows directly from Thm. 1 by noting that, [[Yt 6= \u02c6Yt]] \uf8ff\ni=1 Mi,t.\nNote, that the bound is essentially independent of C as the coef\ufb01cient in the bound is upper bounded\nby 6 for C 3.\n\nb 1 EhPK\n\n\uf8ff\u21e31 + X 2\n\n2b\u2318 \u00afL,n +\n\n +2 \n\n(2b+X 2)2U 2\n\n((C\n\n((C\n\n2\n\n2\n\n8b\n\n5\n\n\fWe conclude this section with two algorithmic modi\ufb01cations, we employed in this setting. Cur-\nrently, when the feedback is zero, there is no update of the weights, because there are no er-\nrors. This causes the algorithm to effectively ignore such examples, as in these cases the algo-\nrithm is not modifying any model, furthermore, if such example is repeated, a problem with pos-\nsibly \u201c0\u201d feedback may be queried again. We \ufb01x this issue with one of two modi\ufb01cations: In\nthe \ufb01rst one, if the feedback is zero, we modify the model to reduce the chance that the cho-\nsen problem, Jt, would be chosen again for the same input (i.e. not to make the same wrong-\nchoice of choosing irrelevant problem again). To this end, we modify the weights a bit, to in-\ncrease the con\ufb01dence (absolute margin) of the model for the same input, and replace Eq. (2)\nwith, wJt,t = wJt,t1 + [[yJt,t 6= 0]] yJt,t xJt,t + [[yJt,t = 0]]\u2318\u02c6yJt,txJt,t , for some \u2318 > 0.\nIn other words, if there is a possible error (i.e.\n6= 0) the update follows the Percep-\ntron\u2019s rule. Otherwise, the weights are updated such that the absolute margin will increase, as\n|w>Jt,txJt,t| = |(wJt,t1 + \u2318\u02c6yJt,txJt,t)>xJt,t| = |w>Jt,t1xJt,t + \u2318sign(w>Jt,t1xJt,t)kxJt,tk2| =\n|w>Jt,t1xJt,t| + \u2318kxJt,tk2 > |w>Jt,t1xJt,t|. We call this method one-vs-one-weak, as it performs\nweak updates for zero feedback. The second alternative is not to allow 0 value feedback, and if this\nis the case, to set the label to be either +1 or 1, randomly. We call this method one-vs-one-random.\n6 Experiments\n\nyJt,t\n\n \n\n\u22124\n\n10\n\n\u22122\n\n10\n\n0\n10\n\n2\n10\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n \n10\n\n\u22126\n\nb [log]\n\n]\n\n%\n\n[\n \nr\no\nr\nr\n\nE\n\nof\n\ntotal\nqueried\n\nqueries\n\n(a) Training mistakes vs b (b) Test error vs no.\n\nFigure 4: Left: mean of fraction no.\nof mistakes\nSHAMPO made during training time on MNIST of all\nexamples and only queried. Right: test error vs no. of\nqueries is plotted for all MNIST one-vs-one problems.\n\nWe evaluated the SHAMPO algorithm\nusing four datasets: USPS, MNIST\n(both OCR), Vocal Joystick (VJ, vowel\nrecognition) and document classi\ufb01ca-\ntion. The USPS dataset, contains 7, 291\ntraining examples and 2, 007 test exam-\nples, each is a 16 \u21e5 16 pixels gray-scale\nimages converted to a 256 dimensional\nvector. The MNIST dataset with 28 \u21e5\n28 gray-scale images, contains 60, 000\n(10, 000) training (test) examples.\nIn\nboth cases there are 10 possible labels,\ndigits. The VJ tasks is to predict a vowel\nfrom eight possible vowels. Each exam-\nple is a frame of spoken value described\nwith 13 MFCC coef\ufb01cients transformed\ninto 27 features. There are 572, 911\ntraining examples and 236, 680 test examples. We created binary tasks from these multi-class\ndatasets using two reductions: One-vs-Rest setting and One-vs-One setting. For example, in both\nUSPS and MNIST there are 10 binary one-vs-rest tasks and 45 binary one-vs-one tasks. The NLP\ndocument classi\ufb01cation include of spam \ufb01ltering, news items and news-group classi\ufb01cation, senti-\nment classi\ufb01cation, and product domain categorization. A total of 31 binary prediction tasks over\nall, with a total of 252, 609 examples, and input dimension varying between 8, 768 and 1, 447, 866.\nDetails of the individual binary tasks can be found elsewhere [8]. We created an eighth collection,\nnamed MIXED, which consists of 40 tasks: 10 random tasks from each one of the four basic datasets\n(one-vs-one versions). This yielded eight collections (USPS, MNIST and VJ; each as one-vs-rest or\none-vs-one), document classi\ufb01cation and mixed. From each of these eight collections we generated\nbetween 6 to 10 combinations (or problems), each problem was created by sampling between 2 and\n8 tasks which yielded a total of 64 multi-task problems. We tried to diversify problems dif\ufb01culty by\nincluding both hard and easy binary classi\ufb01cation problems. The hardness of a binary problem is\nevaluated by the number of mistakes the Perceptron algorithm performs on the problem.\nWe evaluated two baselines as well as our algorithm. Algorithm uniform picks a random task to\nbe queried and updated (corresponding to b ! 1), exploit which picks the tasks with the lowest\nabsolute margin (i.e. the \u201chardest instance\u201d), this combination corresponds to b \u21e1 0 of SHAMPO.\nWe tried for SHAMPO 13 values for b, equally spaced on a logarithmic scale. All algorithms made\na single pass over the training data. We ran two version of the algorithm: plain version, without\naggressiveness (updates on mistakes only, = 0) and an Aggressive version = b/2 (we tried\nlower values of as in the bound, but we found that = b/2 gives best results), both with uniform\nprior (ai = 1). We used separate training set and a test set, to build a model and evaluate it.\n\n6\n\n\fTable 1: Test errors percentage . Scores are shown in parenthesis.\n\nAggressive = b/2\n\nDataset\nVJ 1 vs 1\nVJ 1 vs Rest\nUSPS 1 vs 1\nUSPS 1 vs Rest\nMNIST 1 vs 1\nMNIST 1 vs Rest\nNLP documents\nMIXED\nMean score\n\nexploit\n5.22 (2.9)\n13.26 (3.5)\n3.31 (2.5)\n5.45 (2.8)\n1.08 (2.3)\n4.74 (2.8)\n19.43 (2.3)\n2.75 (2.4)\n(2.7)\n\nSHAMPO\n4.57 (1.1)\n11.73 (1.2)\n2.73 (1.0)\n4.93 (1.2)\n0.75 (1.0)\n3.88 (1.0)\n16.5 (1.0)\n2.06 (1.0)\n(1.1)\n\nuniform\n5.67 (3.9)\n12.43 (2.5)\n19.29 (6.0)\n10.12 (6.0)\n5.9 (6.0)\n10.01 (6.0)\n23.21 (5.0)\n13.59 (6.0)\n(5.2)\n\nexploit\n5.21 (2.7)\n13.11 (3.0)\n3.37 (2.5)\n5.31 (2.0)\n1.2 (2.7)\n4.44 (2.8)\n19.46 (2.7)\n2.78 (2.6)\n(2.6)\n\nPlain\nSHAMPO\n6.93 (4.6)\n14.17 (5.0)\n4.83 (4.0)\n6.51 (4.0)\n1.69 (4.1)\n5.4 (3.8)\n21.54 (4.7)\n4.2 (4.3)\n(4.3)\n\nuniform\n6.26 (5.8)\n14.71 (5.8)\n5.33 (5,0)\n7.06 (5.0)\n1.94 (4.9)\n6.1 (5.0)\n21.74 (5.3)\n4.45 (4.7)\n(5.2)\n\nResults are evaluated using 2 quantities. First, the average test error (over all the dataset combina-\ntions) and the average score. For each combination we assigned a score of 1 to the algorithm with\nthe lowest test error, and a score of 2, to the second best, and all the way up to a score of 6 to the\nalgorithm with the highest test error.\n\nMulti-task Binary Classi\ufb01cation : Fig. 3(a) and Fig. 3(b) show the test error of the three algo-\nrithms on two of document classi\ufb01cation combinations, with four and eight tasks. Clearly, not only\nSHAMPO performs better, but it does so on each task individually. (Our analysis above bounds\nthe total number of mistakes over all tasks.) Fig. 3(c) shows the average test error vs b using the\none-vs-one binary USPS problems for the three variants of SHAMPO: non-aggressive (called plain),\naggressive and aggressive with prior.Clearly, the plain version does worse than both the aggressive\nversion and the non-uniform prior version. For other combinations the prior was not always im-\nproving results. We hypothesise that this is because our heuristic may yield a bad prior which is not\nfocusing the algorithm on the right (hard) tasks.\nResults are summarized in Tab. 1. In general exploit is better than uniform and aggressive is better\nthan non-aggressive. Aggressive SHAMPO yields the best results both evaluated as average (over\ntasks per combination and over combinations). Remarkably, even in the mixed dataset (where tasks\nare of different nature: images, audio and documents), the aggressive SHAPO improves over uni-\nform (4.45% error) and the aggressive-exploit baseline (2.75%), and achieves a test error of 2.06%.\nNext, we focus on the problems that the algorithm chooses to annotate on each iteration for various\nvalues of b. Fig. 4(a) shows the total number of mistakes SHAMPO made during training time on\nMNIST , we show two quantities: fraction of mistakes over all training examples (denoted by \u201ctotal\u201d\n- blue) and fraction of mistakes over only queried examples (denoted by \u201cqueried\u201d - dashed red).\nIn pure exploration (large b) both quantities are the same, as the choice of problem to be labeled\nis independent of the problem and example, and essentially the fraction of mistakes in queried\nexamples is a good estimate of the fraction of mistakes over all examples. The other extreme is\nwhen performing pure exploitation (low b), here the fraction of mistakes made on queried examples\nwent up, while the overall fraction of mistakes went down. This indicates that the algorithm indeed\nfocuses its queries on the harder inputs, which in turn, improves overall training mistake. There is a\nsweet point b \u21e1 0.01 for which SHAMPO is still focusing on the harder examples, yet reduces the\ntotal fraction of training mistakes even more. The existence of such tradeoff is predicted by Thm. 1.\nAnother perspective of the phenomena is that for values of b \u2327 1 SHAMPO focuses on the harder\nexamples, is illustrated in Fig. 4(b) where test error vs number of queries is plotted for each problem\nfor MNIST. We show three cases: uniform, exploit and a mid-value of b \u21e1 0.01 which tradeoffs\nexploration and exploitation. Few comments: First, when performing uniform querying, all prob-\nlems have about the same number of queries (266), close to the number of examples per problem\n(12, 000), divided by the number of problems (45). Second, when having a tradeoff between ex-\nploration and exploitation, harder problems (as indicated by test error) get more queries than easier\nproblems. For example, the four problems with test error greater than 6% get at least 400 queries,\nwhich is about twice the number of queries received by each of the 12 problems with test error less\nthan 1%. Third, as a consequence, SHAMPO performs equalization, giving the harder problems\nmore labeled data, and as a consequence, reduces the error of these problems, however, is not in-\ncreasing the error of the easier problems which gets less queries (in fact it reduces the test error of\nall 45 problems!). The tradeoff mechanism of SHAMPO, reduces the test error of each problem\nby more than 40% compared to full exploration. Fourth, exploits performs similar equalization, yet\n\n7\n\n\fin some hard tasks it performs worse than SHAMPO. This could be because it over\ufb01ts the training\ndata, by focusing on hard-examples too much, as SHAMPO has a randomness mechanism.\nIndeed, Table 1 shows that aggressive SHAMPO outperforms better alternatives. Yet, we claim\nthat a good prior may improve results. We compute prior over the 45 USPS tasks, by running the\nperceptron algorithm on 1000 examples and computing the number of mistakes. We set the prior\nto be proportional to this number. We then reran aggressive SHAMPO with prior, comparing it to\naggressive SHAMPO with no prior (i.e. ai = 1). Aggressive SHAMO with prior achieves average\nerror of 1.47 (vs. 2.73 with no prior) on 1-vs-1 USPS and 4.97 (vs 4.93) on one-vs-rest USPS, with\nscore rank of 1.0 (vs 2.9) and 1.7 (vs 2.0) respectively. Fig. 3(c) shows the test error for a all values\nof b we evaluated. A good prior is shown to outperform the case ai = 1 for all values of b.\n\nReduction of Multi-task to Contextual Bandits Next, we evaluated SHAMPO as a contextual\nbandit algorithm, by breaking a multi-class problem into few binary tasks, and integrating their\noutput into a single multi-class problem. We focus on the VJ data, as there are many examples,\nand linear models perform relatively well [18]. We implemented all three reductions mentioned in\nSec. 5.2, namely, one-vs-rest, one-vs-one-random which picks a random label if the feedback is zero,\none-vs-one-weak (which performs updates to increase con\ufb01dence when the feedback is zero), where\nwe set \u2318 = 0.2, and the Banditron algorithm [17]. The one-vs-rest reduction and the Banditron\nhave a test error of about 43.5%, and the one-vs-one-random of about 42.5%. Finally, one-vs-one-\nweak achieves an error of 39.4%. This is slightly worst than PLM [18] with test error of 38.4%\n(and higher than MLP with 32.8%), yet all of these algorithms observe only one bit of feedback per\nexample, while both MLP and PLM observe 3 bits (as class identity can be coded with 3 bits for\n8 classes). We claim that our setting can be easily used to adapt a system to individual user, as we\nonly need to assume the ability to recognise three words, such as three letters. Given an utterance of\nthe user, the system may ask: \u201cDid you say (a) \u2019a\u2019 like \u2019bad\u2019 (b) \u2019o\u2019 like in \u2019book\u2019) (c) none\u201d. The\nuser can communicate the correct answer with no need for a another person to key in the answer.\n7 Related Work and Conclusion\nIn the past few years there is a large volume of work on multi-task learning, which clearly we can not\ncover here. The reader is referred to a recent survey on the topic [20]. Most of this work is focused\non exploring relations between tasks, i.e. \ufb01nd similarities dissimilarities between tasks, and use it to\nshare data directly (e.g. [10]) or model parameters [14, 11, 2]. In the online settings there are only\na handful of work on multi-task learning. Dekel et al [13] consider the setting where all algorithms\nare evaluated using a global loss function, and all work towards the shared goal of minimizing it.\nLogosi et al [19] assume that there are constraints on the predictions of all learners, and focus in\nthe expert setting. Agarwal et al [1] formalize the problem in the framework of stochastic convex\nprogramming with few matrix regularization, each captures some assumption about the relation\nbetween the models. Cavallanti et al [4] and Cesa-Bianci et al [6] assume and exploite a known\nrelation between tasks. Unlike these approaches, we assume the ability to share an annotator rather\nthan data or parameters, thus our methods can be applied to problems with no common input space.\nOur analysis is similar to that of Cesa-Bianchi et al [7], yet they focus in selective sampling (see\nalso [5, 12]), that is, making individual binary decisions of whether to query, while our algorithm\nalways query, and needs to decide for which task. Finally, there have been recent work in contextual\nbandits [17, 16, 9], each with slightly different assumptions. To the best of our knowledge, we are the\n\ufb01rst to consider decoupled exploration and exploitation in this context. Finally, there is recent work\nin learning with relative or preference feedback in various settings [24, 25, 26, 21]. Unlike this work,\nour work allows again decoupled exploitation and exploration, and also non-relevant feedback.\nWe proposed a new framework for online multi-task learning, where learners share a single anno-\ntator. We presented an algorithm for this settings and analyzed it in the mistake-bound model. We\nalso showed how learning in such a model can be used to learn in contextual-bandits setting with\nfew types of feedback. Empirical results show that our algorithm does better for the same price. It\nfocuses the annotator on the harder instances, and is improving performance. We plan to integrate\nother algorithms to our framework, extend it to other settings, investigate ways to generate good\npriors, and reduce multi-class to binary also via error-correcting output-codes.\n\nAcknowledgements: The research was funded by the Intel Collaborative Research Institute for\nComputational Intelligence (ICRI-CI), and by the Israeli Science Foundation grant ISF-1567/10.\n\n8\n\n\fReferences\n[1] Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for online mul-\ntitask learning. Technical Report UCB/EECS-2008-138, EECS Department, University of California,\nBerkeley, Oct 2008.\n\n[2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning.\n\nMachine Learning, 73(3):243\u2013272, 2008.\n\n[3] Orly Avner, Shie Mannor, and Ohad Shamir. Decoupling exploration and exploitation in multi-armed\n\nbandits. In ICML, 2012.\n\n[4] Giovanni Cavallanti, Nicol`o Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask\n\nclassi\ufb01cation. Journal of Machine Learning Research, 11:2901\u20132934, 2010.\n\n[5] Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona. Robust bounds for classi\ufb01cation via\n\nselective sampling. In ICML 26, 2009.\n\n[6] Nicol`o Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Incremental algorithms for hierarchical clas-\n\nsi\ufb01cation. The Journal of Machine Learning Research, 7:31\u201354, 2006.\n\n[7] Nicol`o Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Worst-case analysis of selective sampling for\n\nlinear classi\ufb01cation. The Journal of Machine Learning Research, 7:1205\u20131230, 2006.\n\n[8] Koby Crammer, Mark Dredze, and Fernando Pereira. Con\ufb01dence-weighted linear classi\ufb01cation for text\n\ncategorization. J. Mach. Learn. Res., 98888:1891\u20131926, June 2012.\n\n[9] Koby Crammer and Claudio Gentile. Multiclass classi\ufb01cation with bandit feedback using adaptive regu-\n\nlarization. Machine Learning, 90(3):347\u2013383, 2013.\n\n[10] Koby Crammer and Yishay Mansour. Learning multiple tasks using shared hypotheses. In Advances in\n\nNeural Information Processing Systems 25. 2012.\n\n[11] Hal Daum\u00b4e, III, Abhishek Kumar, and Avishek Saha. Frustratingly easy semi-supervised domain adapta-\n\ntion. In DANLP 2010, 2010.\n\n[12] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Robust selective sampling from single and multiple\n\nteachers. In COLT, pages 346\u2013358, 2010.\n\n[13] Ofer Dekel, Philip M. Long, and Yoram Singer. Online multitask learning. In COLT, pages 453\u2013467,\n\n2006.\n\n[14] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi\u2013task learning. In Proceedings of the\ntenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD \u201904, pages\n109\u2013117, New York, NY, USA, 2004. ACM.\n\n[15] Johannes F\u00a8urnkranz. Round robin classi\ufb01cation. Journal of Machine Learning Research, 2:721\u2013747,\n\n2002.\n\n[16] Elad Hazan and Satyen Kale. Newtron: an ef\ufb01cient bandit algorithm for online multiclass prediction.\n\nAdvances in Neural Information Processing Systems (NIPS), 2011.\n\n[17] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Ef\ufb01cient bandit algorithms for online multi-\nclass prediction. In Proceedings of the 25th international conference on Machine learning, pages 440\u2013\n447. ACM, 2008.\n\n[18] Hui Lin, Jeff Bilmes, and Koby Crammer. How to lose con\ufb01dence: Probabilistic linear machines for\nmulticlass classi\ufb01cation. In Tenth Annual Conference of the International Speech Communication Asso-\nciation, 2009.\n\n[19] G\u00b4abor Lugosi, Omiros Papaspiliopoulos, and Gilles Stoltz. Online multi-task learning with hard con-\n\nstraints. In COLT, 2009.\n\n[20] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and\n\nData Engineering, 22(10):1345\u20131359, 2010.\n\n[21] Pannagadatta K. Shivaswamy and Thorsten Joachims. Online learning with preference feedback. CoRR,\n\nabs/1111.0712, 2011.\n\n[22] Simon Tong and Daphne Koller. Support vector machine active learning with application sto text classi-\n\n\ufb01cation. In ICML, pages 999\u20131006, 2000.\n\n[23] Jia Yuan Yu and Shie Mannor. Piecewise-stationary bandit problems with side observations. In ICML,\n\n2009.\n\n[24] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits prob-\n\nlem. In COLT, 2009.\n\n[25] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits prob-\n\nlem. J. Comput. Syst. Sci., 78(5):1538\u20131556, 2012.\n\n[26] Yisong Yue and Thorsten Joachims. Beat the mean bandit. In ICML, pages 241\u2013248, 2011.\n\n9\n\n\f", "award": [], "sourceid": 679, "authors": [{"given_name": "Haim", "family_name": "Cohen", "institution": "Technion"}, {"given_name": "Koby", "family_name": "Crammer", "institution": "The Technion"}]}