{"title": "Preference Based Adaptation for Learning Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 7828, "page_last": 7837, "abstract": "In many real-world learning tasks, it is hard to directly optimize the true performance measures, meanwhile choosing the right surrogate objectives is also difficult. Under this situation, it is desirable to incorporate an optimization of objective process into the learning loop based on weak modeling of the relationship between the true measure and the objective. In this work, we discuss the task of objective adaptation, in which the learner iteratively adapts the learning objective to the underlying true objective based on the preference feedback from an oracle. We show that when the objective can be linearly parameterized, this preference based learning problem can be solved by utilizing the dueling bandit model. A novel sampling based algorithm DL^2M is proposed to learn the optimal parameter, which enjoys strong theoretical guarantees and efficient empirical performance. To avoid learning a hypothesis from scratch after each objective function update, a boosting based hypothesis adaptation approach is proposed to efficiently adapt any pre-learned element hypothesis to the current objective. We apply the overall approach to multi-label learning, and show that the proposed approach achieves significant performance under various multi-label performance measures.", "full_text": "Preference Based Adaptation for Learning Objectives\n\nYao-Xiang Ding\n\nZhi-Hua Zhou\n\nNational Key Laboratory for Novel Software Technology,\n\nNanjing University, Nanjing, 210023, China\n{dingyx, zhouzh}@lamda.nju.edu.cn\n\nAbstract\n\nIn many real-world learning tasks, it is hard to directly optimize the true perfor-\nmance measures, meanwhile choosing the right surrogate objectives is also dif\ufb01-\ncult. Under this situation, it is desirable to incorporate an optimization of objec-\ntive process into the learning loop based on weak modeling of the relationship\nbetween the true measure and the objective. In this work, we discuss the task of\nobjective adaptation, in which the learner iteratively adapts the learning objective\nto the underlying true objective based on the preference feedback from an oracle.\nWe show that when the objective can be linearly parameterized, this preference\nbased learning problem can be solved by utilizing the dueling bandit model. A\nnovel sampling based algorithm DL2M is proposed to learn the optimal parameter,\nwhich enjoys strong theoretical guarantees and ef\ufb01cient empirical performance.\nTo avoid learning a hypothesis from scratch after each objective function update,\na boosting based hypothesis adaptation approach is proposed to ef\ufb01ciently adapt\nany pre-learned element hypotheses to the current objective. We apply the overall\napproach to multi-label learning, and show that the proposed approach achieves\nsigni\ufb01cant performance under various multi-label performance measures.\n\n1\n\nIntroduction\n\nMachine learning approaches have already been applied on many real-world tasks, in which the tar-\nget is usually to optimize some task-speci\ufb01c performance measures. For complex problems, the per-\nformance measures are usually hard to be optimized directly, such as the click-through-rate in online\nadvertisement and the pro\ufb01t gain in recommendation system design. Instead of directly optimizing\nthese complex measures, surrogate objectives with better mathematical properties are designed to\nsimplify optimization. It is obvious that whether the objective is correctly designed essentially af-\nfects the application performance. However, it also requires delicate knowledge on the relationship\nbetween the true measure and the objective, which is sometimes dif\ufb01cult and challenging to acquire.\nUnder this situation, it is more desirable to learn both objective and hypothesis simultaneously.\nBased on this motivation, we consider the novel scenario of learning with objective adaptation from\npreference feedback. Under this scenario, in each iteration of the objective adaptation process, the\nlearner maintains a pair of objective functions, as well as the corresponding learned hypotheses,\nobtained from the latest two iterations. An oracle then provides a preference over the pair of hy-\npotheses to the learner, according to the true task performance measure. Based on this preference\ninformation, the learner updates both the objective function and the corresponding hypothesis. In\nspecial, this formulation even allows us to model complex scenerios when the true performance mea-\nsure is not quanti\ufb01ed, such as subjective human preference. It is expected that the objective function\nconverges to the optimal one so that the learned hypothesis optimizes the true performance measure.\nIn this work, we focus on the following linear parameterized objective function class. Denote the\nobjective by Lw; w 2 W, in which W is the parameter space, and w = [w1 w2 (cid:1)(cid:1)(cid:1) wK] is a K\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdimensional real-valued vector. We assume that Lw can be represented as\n\nK\u2211\n\nLw =\n\nwili + w0(cid:3); w0 2 f0; 1g;\n\n(1)\n\ni=1\n\nin which l1; l2; : : : ; lK are K convex element objectives, and w0 is an additional indicator of (cid:3).\nWhen w0 = 0, Lw is a linear combination of the K element objectives. When w0 = 1, (cid:3) can be\nutilized to represent an additional convex regularization term. It is easy to see that this linear for-\nmulation covers a broad class of commonly used objectives in different learning tasks. By choosing\ndifferent w, we are allowed to consider different trade-offs among element objectives. The target of\nobjective adaptation is then to learn the optimal w(cid:3) which corresponds to the best trade-off leading\nto the optimal task performance measure. To ensure the problem of learning a hypothesis under any\nchoice of w is solvable, we restrict 8wi (cid:21) 0. When w0 = 0, the scale of w does not matter, thus we\nrestrict W to be the non-negative part of the K-dimensional unit sphere, i.e. \u2225w\u22252 = 1;8wi (cid:21) 0.\nthe scale of w is meaningful, we restrict W to be the K-dimensional ball\nWhen w0 = 1, i.e.\n\u2225w (cid:0) R1K\u22252 (cid:20) R, in which 1K is the K-dimensional full-one vector and R is the radius. By this\nway, Lw is kept convex over all w 2 W.\nThere are two main challenges under the above objective adaptation scenario. One is to learn the\nobjective function based on preference feedback from the oracle, which requires proper modeling\nof the preference feedback. Another is how hypothesis learning can be done ef\ufb01ciently without\nlearning from scratch when the objective is updated. In this work, we take the \ufb01rst step towards\nthe above two challenges. First, we naturally formulate the objective adaptation process into the\ndueling bandit model [Yue et al., 2012], in which w is treated as the bandit arm and the oracle\npreference is treated as the reward. A novel sampling based algorithm DL2M , which stands for\nDueling bandit Learning for Logit Model, is proposed for learning the optimal weight w(cid:3), which\nenjoys ~O(K 3=2\nT ) regret bound and ef\ufb01cient empirical performance. Second, by assuming to\nlearn K element hypotheses f i beforehand, which correspond to one-hot weights wi; i 2 [K] with\nonly one non-zero wi, a novel gradient boosting based approach named Adapt-Boost is proposed\nfor adapting the element hypotheses to the hypothesis hw corresponding to any w. We apply the\nproposed objective adaptation approach to multi-label learning, and the experimental results show\nthat our approach achieves signi\ufb01cant performance for various multi-label performance measures.\n\np\n\n2 Related Work\n\nSome similarities exist between the objective adaptation scenerio and multi-objective optimization\n(MOO) [Deb, 2014]. Under both scenerios, multiple element objectives are considered, and the\ntrade-offs among them should be properly dealt with. While in MOO, the target is to \ufb01gure out\nthe Pareto solutions re\ufb02ecting different trade-offs instead of a single optimal solution de\ufb01ned by the\noracle\u2019s preference. In fact, for our objective adaptation problem, it is also possible to utilize evolu-\ntionary algorithms instead of the proposed DL2M algorithm. While evolutionary algorithms are usu-\nally heuristic and theoretical guarantees are lacking. In [Agarwal et al., 2014], the multi-objective\ndecision making problem is considered. The target of the learner is to optimize all objectives by\nobserving the actions provided by a mentor. There is a signi\ufb01cant difference between their setting\nand ours since we focus on general learning tasks instead of decision making.\nThe proposed DL2M algorithm belongs to the family of continuous dueling bandit algorithms. In\np\n[Yue and Joachims, 2009], an online bandit gradient descent algorithm [Flaxman et al., 2005] was\nproposed, which achieves O(\nIn [Kumagai,\n2017], they showed that when the value function is strongly convex and smooth, their stochastic\nmirror descent algorithm achieves near optimal ~O(K\nT ) regret bound. Similar to DL2M , both\nthe above two algorithms follow from the online convex optimization framework [Zinkevich, 2003],\nwhile DL2M assumes the underlying value function follows from a linear model. The major advan-\ntage of DL2M lies on the reduction of the total number of arms needed to be sampled during learning.\nFor the above two algorithms, two arms are needed to be sampled for comparison in one iteration.\nWhile DL2M samples only one arm in one iteration t, and compares it with the arm sampled on t(cid:0)1.\nThus the total number of arms needed is halved for DL2M , comparing to the above two algorithms.\nFor objective adaptation, choosing an arm incurs the cost of learning the corresponding hypothesis,\nthus it is important to reduce the total number of arms sampled.\n\nKT 3=4) regret bound for convex value functions.\n\np\n\n2\n\n\fOur boosting based hypothesis adaptation procedure is motivated from multi-task learning [Evge-\nniou and Pontil, 2004; Chapelle et al., 2010]. By regarding each element objective as a single task,\nthe hypothesis adaptation procedure can be decomposed into the adaptation of the element hypoth-\nesis for each element objective, and the adaptation of a global hypothesis for the weighted total\nobjective. On the other hand, since the target is to optimize the single total objective, directly uti-\nlizing multi-task learning approaches is invalid to our problem. The hypothesis adaptation task is\nalso considered in [Li et al., 2013], in which an ef\ufb01cient adaptation approach is proposed under\nthe assumption that the auxiliary hypothesis is a linear model. On the other hand, the linear model\nassumption also restricts the capacity of adaptation. To address this issue, our approach utilizes a\ngradient boosting based learner, which can use any weak hypothesis for adaptation, thus the opti-\nmization procedure can be more \ufb02exible and ef\ufb01cient.\n\n3 Dueling Bandit Learning for Objective Adaptation\n\nIn this section, a dueling bandit algorithm DL2M is proposed to learn the optimal weight vector w(cid:3)\nfrom preference feedback to solve the objective adaptation task. For convenience of optimization,\nwe assume the arm space for DL2M is the full K-dimensional unit sphere W : \u2225w\u22252 = 1. How to\napply DL2M on W de\ufb01ned in Section 1 is discussed in Remark 3 below. To model the preference of\nthe oracle, we assume a total order \u2aaf exists on W. For w(cid:3) 2 W, we have w \u2aaf w(cid:3);8w 2 W.\nWhenever the oracle is given an ordered pair (w; w\n), the oracle gives the feedback of r = 1\n\u2032 \u2aaf w, and r = (cid:0)1 otherwise. To precisely model the partial order and how the oracle\nif w\nprovides the preference information, we assume that each arm can be evaluated by a value function\n\u2032 \u2aaf w, and the preference feedback is generated by the\nv(w), such that v(w\n\u2032\nprobabilistic model considering the gap between v(w) and v(w\n)),\nin which (cid:22)(x) is a strictly increasing link function. In this paper, the logistic probability model\n(cid:22)(x) = 1=(1 + exp((cid:0)x)) is utilized, which is the common choice in related researches. The\nIn each\ngeneration of preferences is also assumed to be independent of other parts of learning.\n\u2032\nt) is submitted to the oracle for feedback. The\niteration t out of the total T iterations, a pair (wt; w\ntarget is to minimize the total (pseudo) regret\n\n): Pr(r = 1) = (cid:22)(v(w) (cid:0) v(w\n\n) (cid:20) v(w) , w\n\n\u2032\n\n\u2032\n\n\u2032\n\nt=1\n\n\u2206T =\n\n\u2032\nt)):\n\n(cid:22)(v(w(cid:3)) (cid:0) v(wt)) + (cid:22)(v(w(cid:3)) (cid:0) v(w\n\u2211\n\u2211\n\u2032\nt to be wt(cid:0)1, then we can only consider the summation over wt. Furthermore,\nIf further restricts w\nby observing that (cid:22)(v(w(cid:3))(cid:0)v(wt)) achieves minimum 1=2 when wt = w(cid:3), we can reformulate the\nt=1 (cid:22)(v(w(cid:3)) (cid:0) v(wt)) (cid:0) 1=2.\nt=1 (cid:22)(v(w(cid:3)) (cid:0) v(wt)) (cid:0) (cid:22)(v(w(cid:3)) (cid:0) v(w(cid:3))) =\nregret as \u2206T =\n\u2211\nFrom the above de\ufb01nition, the tasks of regret minimization and optimal weight vector estimation\n(cid:0)x)(cid:0)1=2 has the same convergence rate as f (x) = x\ncoincide. By L\u2019Hopital\u2019s rule, f (x) = 1=(1+e\nwhen x ! 0. Thus we can only consider \u2206T =\nt=1 v(w(cid:3)) (cid:0) v(wt). In this work, we adopt the\ncommonly used linear value function vLIN(w) = wT (cid:18)(cid:3), in which (cid:18)(cid:3) is an underlying optimal\nT\u2211\nevaluation vector. This leads to the classical linear regret formulation\n\nT\n\nT\n\nT\n\n\u2206LIN\n\nT =\n\nt=1\n\nwT(cid:3) (cid:18)(cid:3) (cid:0) wT\n\nt (cid:18)(cid:3);\n\n(2)\n\nT\u2211\n\nwhich indicates that the objective is to maximize the linear value function. Since \u2225w\u22252 = 1, wT (cid:18)(cid:3)\nis the projection of (cid:18)(cid:3) onto w, and achieves the maximum when the directions of w and (cid:18)(cid:3) coincide.\nThus the direction of (cid:18)(cid:3) can be interpreted as the direction of the optimal weight vector kept in the\noracle\u2019s mind. To simplify optimization, we assume \u2225(cid:18)(cid:3)\u22252 (cid:20) 1 without loss of generality.\nFrom the de\ufb01nition of regret, it is crucial to estimate (cid:18)(cid:3) accurately. Thus we consider the procedure\nof estimating (cid:18)(cid:3) in each iteration \ufb01rst. Motivated by the logit one-bit bandit algorithm proposed in\n[Zhang et al., 2016], in each iteration t, we can utilize the online version of the maximum likelihood\nestimator, i.e. to minimize the loss function\n\nft((cid:18)) = log\n\n1 + exp\n\nt (cid:18) (cid:0) wT\n\nt(cid:0)1(cid:18))\n\n(\n\n( (cid:0) rt(wT\n\n3\n\n))\n\n;\n\n\fAlgorithm 1 Dueling bandit Learning for Logit Model (DL2M )\n1: Input Initialization (cid:18)1 = 0; Z1 = (cid:21)I; w0, number of iterations T .\n2: for t = 1 to T do\n3:\n4:\n5:\n\nSample (cid:17)t (cid:24) N (0; I K).\nChoose (cid:20)t according to Theorem 1.\nCompute ~(cid:18)t as\n\n~(cid:18)t (cid:18)t + (cid:20)tZ\n\n(cid:0)1=2\nt\n\n(cid:17)t:\n\n6:\n\nCompute wt as\n\nwt arg max\n\u2225w\u22252=1\n\nwT ~(cid:18)t:\n\nSubmit wt and wt(cid:0)1 and get rt.\nCompute (cid:18)t+1 and Zt+1 as Equation 3 and 4.\n\n7:\n8:\n9: end for\n\n(5)\n\n(6)\n\nwhich satis\ufb01es the exponentially concave property. As a result, the optimal update can be approxi-\nmated by the analogy of the online Newton step [Hazan et al., 2007]:\n\n\u2225(cid:18) (cid:0) (cid:18)t\u22252\n\nZt+1\n\n+ ((cid:18) (cid:0) (cid:18)t)T\u2207ft((cid:18)t);\n\n(cid:18)t+1 = min\n\u2225(cid:18)\u22252(cid:20)1\n\n2\n\n(3)\n\n(4)\n\n)\n\nin which\n\nZt+1 = Zt +\n\n(cid:12)\n2\n\n(wt (cid:0) wt(cid:0)1)(wt (cid:0) wt(cid:0)1)T ; Z1 = (cid:21)I;\n\n2(e+1). Next, we consider how to choose wt in each round for better exploration. Different\nand (cid:12) = 1\nfrom the UCB based exploration strategy implemented in [Zhang et al., 2016], we extend the linear\nThompson sampling technique proposed in [Abeille and Lazaric, 2017] to our dueling bandit setting,\nleading to the DL2M algorithm, which is illustrated in Algorithm 1. We provide regret guarantee for\nthe proposed algorithm, whose proof will be presented in a longer version of the paper.\nTheorem 1. Assume that (cid:20)t in Algorithm 1 is set according to (cid:20)t =\n\n\u221a\n\n(cid:13)t( (cid:14)\n\n4T ), where\n\n8\n(cid:12)\n\n32\n3\n\n2\n(cid:12)\n\ndet(Zt+1)\ndet(Z1)\n\n:\n\n+\n\n) log\n\n+\n\nlog\n\n\u221a\n\n(cid:13)t+1((cid:14)) = (cid:21) + 16 + (\n\n(7)\nAfter running DL2M for T rounds, then for 8(cid:14) > 0, the following result holds with probability at\n)\nT\u2211\nleast 1 (cid:0) (cid:14):\n(\n\n[Abbasi-Yadkori et al., 2011], we have log(det(Zt+1)= det(Z1)) (cid:20)\n\nwT(cid:3) (cid:18)(cid:3) (cid:0) wT\n)\n\ndet(Zt+1)\ndet(Z1)\n\nt (cid:18)(cid:3) (cid:20)\n\n\u221a\n\n)KT log\n\n(cid:14)\n4T\n\n+ 128\n\n8KT\n\n(cid:13)T (\n\nlog\n\n:\n\n1\n(cid:21)\n\n391\n\nlog\n\n1\n(cid:12)\n\n4\n(cid:14)\n\nt=1\n\n(cid:14)\n\n2\u23082 log t\u2309t2\n\u221a\n\n(\n\np\n\n(\n\n(cid:14)\n\nBy Lemma 10 of\nK log\n\n1 + (cid:12)t\n2(cid:21)K\n\n. Thus Theorem 1 provides ~O(K 3=2\n\nT ) regret guarantee for DL2M .\n\n\u221a\n\nRemark 1 The well-known doubling trick [Shalev-Shwartz, 2012] can be utilized to make (cid:20)t in-\nIn practice, since (cid:20)t determines the step size\ndependent of the total number of iterations T .\nof exploration, it is desirable to further make it \ufb01ne-tunable.\nIn all the experiments, we set\nlog(det(Zt)= det(Z1))), in which c is a hyperparameter. The min operator is\n(cid:20)t = min(c=2; c\nintroduced to control the largest step size.\nRemark 2 Theorem 1 provides the guarantee of total regret. Our objective adaptation approach can\nbe utilized in many real-world tasks, in which the learned is already in application during the learning\nstage, and the preference feedback is generated from its true effectiveness. The total regret guarantee\nis natural under this situation. Meanwhile, it is also important to consider another kind of tasks, in\nwhich only the \ufb01nal estimation accuracy matters. Under this situation, it is better to consider simple\nregret instead of total regret since it is a pure exploration problem. This is a particularly interesting\n\n4\n\n\fand challenging task since we assume the continuous arm space. The experimental results in Section\n5 show that DL2M is ef\ufb01cient in \ufb01nding the best arms, and we leave designing the optimal pure\nexploration algorithm for continuous dueling bandits a future work to investigate.\nRemark 3 In the above discussion, we assume that the arm space for DL2M is W0 : \u2225w\u22252 = 1.\nFor objective learning, the parameter domains introduced in Section 1 are different. We discuss how\nDL2M can be applied.\nWhen w0 = 0, the domain of w is W : \u2225w\u22252 = 1;8wi (cid:21) 0, which is the nonnegative part of W0.\nTo apply DL2M , it is necessary to restrict each dimension of (cid:18)t; ~(cid:18)t to be nonnegative. For (cid:18)t, we\ncan simply change the domain of the update of (cid:18) as\n\n(cid:18)t+1 =\n\nmin\n\n\u2225(cid:18)\u22252(cid:20)1;8(cid:18)i(cid:21)0;i2[K]\n\n\u2225(cid:18) (cid:0) (cid:18)t\u22252\n\nZt+1\n\n2\n\n+ ((cid:18) (cid:0) (cid:18)t)T\u2207ft((cid:18)t);\n\nt+1\n\n max(0; ~(cid:18)i\n\nin which (cid:18)i is the i-th entry of (cid:18). Since the domain remains convex, the ef\ufb01ciency of optimization\nt+1);8i 2 [K] operation to limit\nis unaffected. For ~(cid:18)t+1, we can simply take a ~(cid:18)i\nits value. Though this operation may affect the theoretical guarantee for w near the boundary, we\nobserve that the performance is not affected in experiments.\nWhen w0 = 1, the domain of w is W : \u2225w (cid:0) R1K\u22252 (cid:20) R. The main idea is to establish a\ntopologically identical mapping from the arm space to W, then we can perform DL2M in the arm\nspace, then map the result to W. First, it is easy to establish a bijective mapping g1 from the\nK dimensional ball W1 : \u2225w\u22252 (cid:20) 1 to W with constant shifting and scaling. Second, another\nsimple bijective mapping g2 exists to map a point in half of the K + 1-dimensional sphere, i.e.\nW2 : \u2225w\u22252 = 1; wK+1 (cid:21) 0, to a point in W1, by simply setting wK+1 = 0 (just imagine the\nmapping from the upper half of the three-dimensionl sphere onto a two-dimensional circle). Thus\nwe can simply utilize the composite mapping g1(g2) to map an arm in W2 to a parameter in W. To\napply DL2M on W2, we can update (cid:18) by\n\n\u2225(cid:18) (cid:0) (cid:18)t\u22252\n\nZt+1\n\n2\n\n+ ((cid:18) (cid:0) (cid:18)t)T\u2207ft((cid:18)t);\n\n\u2225(cid:18)\u22252(cid:20)1;(cid:18)K+1(cid:21)0\n\n(cid:18)t+1 =\nmin\n max(0; ~(cid:18)K+1\n\nand perform ~(cid:18)K+1\nt+1\n\nt+1 ) to restrict both (cid:18)t+1; ~(cid:18)t+1, which is similar to w0 = 0.\n\n4 Boosting Based Hypothesis Adaptation\n\nAfter each objective adaptation step, w is updated, then a new Lw is obtained. To avoid learning\nthe corresponding hypothesis F w from scratch, a hypothesis adaptation procedure is considered.\nRecall the formulation of objective function de\ufb01ned in Equation 1. Assume that before the whole\nobjective adaptation process, we have learned K element hypotheses f i under (regularized) element\nobjectives li + w0(cid:3); i 2 [K]. To obtain F w corresponding to Lw, we can linearly combine f i\ntogether with a newly learned auxiliary hypothesis \u03d5w, i.e. make F w =\ni=1 (cid:11)if i + \u03d5w. As a\nresult, the learning problem is transformed into\n\n\u2211\n\nK\n\n)\n\n)\n\n(( K\u2211\n\ni=1\n\nmin\n\n(cid:11)i;i2[K];\u03d5w\n\nLw\n\n(cid:11)if i\n\n+ \u03d5w\n\n:\n\n(8)\n\nUnder the above formulation, the learning target is to decide the weight (cid:11)i for each f i, together\nwith the auxiliary hypothesis \u03d5w. Intuitively, there should be a close relationship between (cid:11)i and\nwi, which is the weight for li in Lw. When wi is large, the corresponding li has a large impact to the\nglobal Lw. Since f i is learned under li + w0(cid:3), then (cid:11)i should also be large to make the contribution\nof f i in F w more signi\ufb01cant. For the similar reason, if wi is small then (cid:11)i should follow. As a result,\nto solve Equation 8 properly, establishing a close relationship between (cid:11)i and wi is a necessary task.\nBased on this motivation, a boosting based hypothesis adaptation approach named Adapt-Boost is\nproposed. Assume that the learning procedure runs for N iterations. Under Adapt-Boost, one weak\nhypothesis hj is learned in each iteration j. Denote by H = [h1 h2 (cid:1)(cid:1)(cid:1) hN ]T the vector of all\n\u2032;i = wi; i 2 [K], Adapt-Boost solves the following\nlearned weak hypotheses and set w\n\n\u2032;0 = 1; w\n\n5\n\n\fAlgorithm 2 Adapt-Boost\n1: Input: Loss parameter w, loss function Lw, element hypotheses f i; i 2 [K], f 0 (cid:17) 0, number\n\n\u2032;0 1; F0 f 0.\n\nof iterations N, number of element losses K, step size \u03f5.\n\u2032;i wi; i 2 [K]; w\nCalculate current residual (cid:0)\u2207Lw(Fj(cid:0)1).\nfor i = 0 to K do\nFit residual (cid:0)\u2207Lw(Fj(cid:0)1) with f i + hi\nend for\nChoose the optimal update i\n\n2: w\n3: for j = 1 to N do\n4:\n5:\n6:\n7:\n8:\n\nj to obtain a weak hypothesis hi\nj.\n\n(cid:3) as\n= arg maxi\n\n(cid:3)\n\ni\n\n(cid:0)w\n\n\u2032;i[\u2207Lw(Fj(cid:0)1)](f i + hi\nj):\n\n9:\n\nUpdate the current hypothesis as\n\nFj = Fj(cid:0)1 + (w\n\n\u2032;i\n(cid:3)\n\n\u03f5)(f i\n\n(cid:3)\n\n+ hi\n\n(cid:3)\nj ):\n\n10: end for\n11: Output The learned hypothesis FN .\n\nl1-regularized problem:\n\nmin\n\n(cid:12)i;i2[K][f0g;H\n\nLw\n\n(( K\u2211\n\n)\n\n(cid:12)k;T (1N f k + H)\n\nk=1\n\n+ (cid:12)0;T H\n\nK\u2211\n\ns:t:\n\ni=0\n\n)\n(\u2211\n\n;\n\n1\nw\u2032;i\n\n\u2225(cid:12)i\u22251 (cid:20) (cid:22);\n)\n\n(9)\n\nK\nk=1 (cid:12)k;T (1N f k + H)\n\nin which (cid:12)i; i 2 [K][f0g are N-dimensional weight vectors and 1N is the N-dimensional full-one\nvector. Comparing to Equation 8, F w is further restricted as\n+ (cid:12)0;T H,\nand the auxiliary hypothesis \u03d5w is decomposed into K local (cid:12)k;T H corresponding to f k, together\nwith a global (cid:12)0;T H. For each f k, the weight (cid:11)k, which represents the importance of f k in learning\nF w, is substituted by the weight vector (cid:12)k. Thus controlling the magnitude of (cid:11)k is equivalent to\n\u2032;i-weighted l1-\ncontrolling the norm of (cid:12)k. This target is realized by introducing the sum of 1=w\nnorm constraints on (cid:12)i; i 2 [K] [ f0g in Equation 9 with a hyperparameter (cid:22) controlling the global\n\u2032;k; k 2 [K], we are able to\nsparsity. Meanwhile, by controlling the local sparsity of each (cid:12)k using w\nrelate the importance of f k in F w with objective weights w.\nThe key advantage to employ Equation 9 is that this sparsity-constrained problem can be solved by\nthe \u03f5-boost algorithm [Rosset et al., 2004], which will be brie\ufb02y introduced below. To simplify nota-\ntion, we use (cid:12) to denote the vector which is the concatenation of all (cid:12)i; i 2 [K] [ f0g. Temporarily,\nwe also assume that the weak hypotheses H are \ufb01xed, and only (cid:12) needs to be optimized. Instead of\nexplicitly setting the sparsity level (cid:22), we decompose the sparsity constraint over all steps. In each\niteration, a small increment \u2206(cid:12) is added on (cid:12), and an \u03f5-sparsity constraint is applied on \u2206(cid:12), leading\nto the following inside-iteration optimization problem:\n\nLw((cid:12) + \u2206(cid:12));\n\ns:t:\n\nmin\n\u2206(cid:12)\n\n\u2225\u2206(cid:12)i\u22251 (cid:20) \u03f5;\n\n1\nw\u2032;i\n\n(10)\n\nK\u2211\n\ni=0\n\nin which \u2206(cid:12)i is the part of \u2206(cid:12) added on (cid:12)i. The objective function can be approximated as\n\nLw((cid:12) + \u2206(cid:12)) (cid:25) Lw((cid:12)) + [\u2207Lw((cid:12))]T \u2206(cid:12)\n\n\u2211\n\n(11)\nby Taylor expansion. Thus we turn to minimize [\u2207Lw((cid:12))]T \u2206(cid:12). Since the sparsity constraints are\ngradually added by \u03f5 over the learning process, and Lw is convex, the optimal solution for Equation\nj; i 2 [K] [ f0g; j 2 [N ] be the\n10 always satis\ufb01es\n(N i + j)-th dimension of \u2207Lw((cid:12)) and \u2206(cid:12). It is easy to see that the optimal \u2206(cid:12) in Equation 11\n\u2032;i[\u2207Lw((cid:12))]i\n(cid:3)\n(cid:3)\nis a vector of all zeros except for [\u2206(cid:12)]i\nj.\nj(cid:3) = w\nFurthermore, we can explicity write the (N i + j)-th component of \u2207Lw((cid:12)) as @Lw=@(cid:12)i\nj =\nj) = [\u2207Lw(F w)](f i + hj), in which hj is the j-th weak hypothesis in\n[\u2207Lw(F w)](@F w=@(cid:12)i\n\nw\u2032;i\u2225\u2206(cid:12)i\u22251 = \u03f5. Let [\u2207Lw((cid:12))]i\n\n= arg mini;j w\n\n\u03f5 such that i\n\nj; [\u2206(cid:12)]i\n\n(cid:3)\n\n; j\n\nK\ni=0\n\n1\n\n\u2032;i\n(cid:3)\n\n6\n\n\f(cid:0)w\n\nj \ufb01t for the residual (cid:0)\u2207Lw(F w), and then we can choose the optimal update f i\n\nH, and an additional f 0 (cid:17) 0 is introduced to simplify notation. Now let us take choosing weak hy-\npotheses H into consideration. Based on the above discussion, in the j-th iteration, our target is to\n\u2032;i[\u2207Lw(F w)](f i + hj). This formulation inspires us to utilize gradient boosting.\nsolve maxi;hj\nj is chosen for each i; i 2 [K] [ f0g to\nTo obtain the optimal update, a candidate weak hypothesis hi\n(cid:3)\nlet f i + hi\n+ hi\nj\n\u2032;i\n(cid:3)\nwhich optimally \ufb01ts the residual weighted by w\n\u03f5\naccording to the previous discussion. The whole process of Adapt-Boost is illustrated in Algorithm\n2. It can be seen that Adapt-Boost utilizes a boosting based process to gradually add the element\nand weak hypotheses into the learned hypothesis instead of explicitly setting their weights. Thanks\nto the \ufb02exability of choosing the weak learners and the ef\ufb01ciency of gradient boosting, we are able\nto solve complex hypothesis adaptation problems with low cost.\n\n\u2032;i. The optimal step size for the update is w\n\n(cid:3)\n\n(a) K = 10; c = 0:1\n\n(b) K = 10; c = 0:05\n\n(c) K = 10; c = 0:01\n\n(d) K = 100; c = 0:1\n\n(e) K = 100; c = 0:05\n\n(f) K = 100; c = 0:01\n\nFigure 1: Instantaneous regret of DL2M .\n\n7\n\n50100150200t00.20.40.60.81Instantaneous Regret50100150200t00.20.40.60.81Instantaneous Regret50010001500200025003000t00.20.40.60.81Instantaneous Regret12345t10400.20.40.60.81Instantaneous Regret12345t10400.20.40.60.81Instantaneous Regret12345t10400.20.40.60.81Instantaneous Regret\f5 Experiments\n\n5.1 Testing DL2M on Synthetic Data\n\n)\n\n( (cid:0) (cid:26)rt(wt (cid:0) wt(cid:0)1)T (cid:18)(cid:3)\n\nWe present experimental results on synthetic data to verify the effectiveness of DL2M . In each\nexperiment, a K dimensional point is uniformly sampled from the unit ball as (cid:18)(cid:3). Once the learner\nsubmits the pair of arms (wt; wt(cid:0)1), a preference feedback rt 2 f(cid:0)1; 1g is randomly generated\naccording to Pr(rt = (cid:6)1j(wt; wt(cid:0)1)) = 1=(1 + exp\n), in which (cid:26) is the\nparameter controlling the randomness of the preferences. In all experiments, we use (cid:26) = 100 to\nensure the preferences are relatively consistent. We also set (cid:21) = 1 in all the experiments. The per-\nformance is measured by the change of instantaneous regret wT(cid:3) (cid:18)(cid:3) (cid:0) wT\nt (cid:18)(cid:3) over time. We compare\nthe results among different c and K, which are illustrated in Figure 1. It can be observed that when\nthe parameter c is properly set and K is not large, DL2M achieves very ef\ufb01cient performance, which\ncan quickly converge in limited number of iterations. As the dimension gets larger, the performance\ndegenerates accordingly. To verify the ef\ufb01ciency of Thompson sampling in dueling and one-bit\nbandit problems, it is interesting to compare the above results with those reported in [Zhang et al.,\n2016], which utilizes UCB based exploration. It can be seen that Thompson sampling can achieve\nmore ef\ufb01cient performance in practice as in many other bandit problems.\n\n5.2 Multi-Label Performance Measure Adaptation\n\nThe multi-label classi\ufb01cation task is utilized to evaluate the effectiveness of DL2M and Adapt-Boost,\nboth separately and jointly.\nIt is well-known that for multi-label learning, various performance\nmeasures exist, and the choice of the performance measure will largely affect the evaluation of\nthe learned classi\ufb01er. In [Wu and Zhou, 2017], two notions of multi-label margin, i.e. label-wise\nand instance-wise margins are proposed to characterize different multi-label performance measures.\nAccording to their work, one speci\ufb01c multi-label performance measure tends to be biased towards\none of the two margins, such that optimizing the corresponding margin will also make the measure\noptimized. Based on this \ufb01nding, the stochastic gradient descent (SGD) based LIMO algorithm is\nproposed to jointly optimize the both margins, in order to achieve good performance on different\nmeasures simultaneously. The high-level formulation of LIMO\u2019s objective is\n\nLLIM O = (cid:3) + w1Llabel + w2Linst;\n\n(12)\nin which Llabel; Linst are two margin loss terms for maximizing the two margins, and (cid:3) is a reg-\nularization term. For LLIM O, the weights w1; w2 control the relative importance of the two loss\nterms, thus different choices of the weights can signi\ufb01cantly affect the performance. We will show\nthat by utilizing DL2M and Adapt-Boost, we can automatically \ufb01nd the proper weights between the\ntwo margin losses in LIMO\u2019s objective, and ef\ufb01ciently adapt to different performance measures.\nThe experiments are conducted on six benchmark multi-label datasets 1: emotions, CAL500, enron,\nCorel5k, medical and bibtex. On each dataset, four multi-label performance measures are adopted\nfor evaluation, i.e. ranking loss, coverage, average precision and one error. Three LIMO based\ncomparison methods are adopted as baselines: (i) LIMO-label, which optimizes LLIM O with w1 =\n1; w2 = 0, (ii) LIMO-inst, which optimizes LLIM O with w1 = 0; w2 = 1, (iii) LIMO, which\noptimizes LLIM O with w1 = 1; w2 = 1 and corresponds to the recommended parameter in the\noriginal paper. To evaluate DL2M and Adapt-Boost both separately and jointly, three adaptation\nbased approaches are tested: (i) ADAPT-hypo, which optimizes LLIM O with w1 = 1; w2 = 1\nusing Adapt-Boost, (ii) ADAPT-obj, which utilizes DL2M with SGD training, (iii) ADAPT-both,\nwhich utilizes both DL2M and Adapt-Boost. To implement DL2M , each dataset is randomly split\ninto training, validation and testing set, with ratio of size 3:1:1. During the learning process, the\npreference feedback is generated by testing the learned hypothesis on the validation set, and DL2M\nis utilized to update the objective for 20 iterations, with c = 0:05; (cid:21) = 1. For Adapt-Boost, to\nevaluate its ef\ufb01ciency, we only use half number of training iterations than standard LIMO training.\nFurthermore, to make Adapt-Boost compatible to LIMO training, the SGD updates are utilized as\nthe weaker learners for adaptation.\nThe experimental results are illustrated in Table 1, and the average ranks in all experiments are\nillustrated in Table 2. It can be seen that DL2M based method ADAPT-obj achieve better perfor-\nmance than LIMO, which assigns \ufb01xed weights to the two margin losses. This phenomenon reveals\n\n1http://mulan.sourceforge.net/datasets-mlc.html\n\n8\n\n\fthat DL2M can automatically identify the best trade-off among different element objectives. Further-\nmore, though running with much fewer training iterations, Adapt-Boost based method ADAPT-hypo\nachieves even better performance than LIMO, which is based on standard SGD training. This veri\ufb01es\nthe ef\ufb01ciency of Adapt-Boost. ADAPT-both, which utilizes both two adaptation methods, achieves\nsuperior performance. It shows that by utilizing DL2M and Adapt-Boost, we can effectively solve\nthe objective and hypothesis adaptation problem better and faster.\n\nDataset\n\nemotions\n\nCAL500\n\nenron\n\nCorel5k\n\nmedical\n\nbibtex\n\nAlgorithm\nLIMO-inst\nLIMO-label\n\nLIMO\n\nADAPT-hypo\nADAPT-obj\nADAPT-both\nLIMO-inst\nLIMO-label\n\nLIMO\n\nADAPT-hypo\nADAPT-obj\nADAPT-both\nLIMO-inst\nLIMO-label\n\nLIMO\n\nADAPT-hypo\nADAPT-obj\nADAPT-both\nLIMO-inst\nLIMO-label\n\nLIMO\n\nADAPT-hypo\nADAPT-obj\nADAPT-both\nLIMO-inst\nLIMO-label\n\nLIMO\n\nADAPT-hypo\nADAPT-obj\nADAPT-both\nLIMO-inst\nLIMO-label\n\nLIMO\n\nADAPT-hypo\nADAPT-obj\nADAPT-both\n\nranking loss #\n:420 (cid:6) :051(6)\n:349 (cid:6) :028(5)\n:299 (cid:6) :023(4)\n:279 (cid:6) :026(3)\n:268 (cid:6) :033(2)\n:254 (cid:6) :020(1)\n:522 (cid:6) :026(6)\n:182 (cid:6) :005(2)\n:182 (cid:6) :004(2)\n:182 (cid:6) :005(2)\n:182 (cid:6) :005(2)\n:181 (cid:6) :004(1)\n:229 (cid:6) :010(6)\n:087 (cid:6) :009(3)\n:089 (cid:6) :009(5)\n:087 (cid:6) :009(3)\n:086 (cid:6) :009(2)\n:085 (cid:6) :008(1)\n:302 (cid:6) :006(6)\n:121 (cid:6) :005(3)\n:130 (cid:6) :004(5)\n:121 (cid:6) :005(3)\n:118 (cid:6) :005(2)\n:114 (cid:6) :005(1)\n:019 (cid:6) :005(2)\n:028 (cid:6) :004(6)\n:020 (cid:6) :005(4)\n:021 (cid:6) :004(5)\n:019 (cid:6) :004(2)\n:018 (cid:6) :004(1)\n:120 (cid:6) :003(6)\n:072 (cid:6) :003(5)\n:060 (cid:6) :002(4)\n:056 (cid:6) :002(1)\n:057 (cid:6) :003(3)\n:056 (cid:6) :002(1)\n\ncoverage #\n\n2:950 (cid:6) :134(6)\n2:745 (cid:6) :174(5)\n2:483 (cid:6) :070(4)\n2:331 (cid:6) :090(2)\n2:377 (cid:6) :144(3)\n2:298 (cid:6) :200(1)\n162:950 (cid:6) 2:417(6)\n131:439 (cid:6) 1:764(5)\n131:020 (cid:6) 1:697(2)\n131:297 (cid:6) 1:899(4)\n131:088 (cid:6) 1:849(3)\n131:008 (cid:6) 2:072(1)\n25:166 (cid:6) :957(6)\n12:362 (cid:6) :612(5)\n12:199 (cid:6) :625(4)\n12:060 (cid:6) :648(2)\n12:049 (cid:6) :624(1)\n12:066 (cid:6) :577(3)\n188:785 (cid:6) 3:122(6)\n106:920 (cid:6) 2:457(5)\n104:465 (cid:6) 2:149(4)\n101:668 (cid:6) 2:141(3)\n100:478 (cid:6) 3:376(2)\n98:880 (cid:6) 2:989(1)\n1:781 (cid:6) :337(5)\n2:326 (cid:6) :489(6)\n1:563 (cid:6) :249(3)\n1:621 (cid:6) :246(4)\n1:499 (cid:6) :340(2)\n1:447 (cid:6) :288(1)\n32:751 (cid:6) 1:144(6)\n20:460 (cid:6) :515(5)\n17:648 (cid:6) :596(4)\n16:708 (cid:6) :430(1)\n17:119 (cid:6) :832(2)\n17:128 (cid:6) :590(3)\n\navg. precision \"\n:603 (cid:6) :028(6)\n:619 (cid:6) :025(5)\n:648 (cid:6) :028(4)\n:671 (cid:6) :032(3)\n:673 (cid:6) :033(2)\n:678 (cid:6) :028(1)\n:153 (cid:6) :010(6)\n:496 (cid:6) :006(5)\n:498 (cid:6) :006(1)\n:497 (cid:6) :007(2)\n:497 (cid:6) :007(2)\n:497 (cid:6) :008(2)\n:504 (cid:6) :017(6)\n:672 (cid:6) :014(4)\n:670 (cid:6) :014(5)\n:680 (cid:6) :013(2)\n:675 (cid:6) :012(3)\n:683 (cid:6) :016(1)\n:101 (cid:6) :006(6)\n:281 (cid:6) :005(1)\n:222 (cid:6) :006(5)\n:252 (cid:6) :006(3)\n:247 (cid:6) :010(4)\n:280 (cid:6) :007(2)\n:857 (cid:6) :020(5)\n:830 (cid:6) :020(6)\n:869 (cid:6) :026(2)\n:863 (cid:6) :023(4)\n:874 (cid:6) :021(1)\n:866 (cid:6) :025(3)\n:488 (cid:6) :007(6)\n:526 (cid:6) :007(5)\n:567 (cid:6) :007(4)\n:579 (cid:6) :007(2)\n:575 (cid:6) :006(3)\n:581 (cid:6) :006(1)\n\none-error #\n\n:500 (cid:6) :047(6)\n:509 (cid:6) :064(5)\n:498 (cid:6) :057(4)\n:481 (cid:6) :048(3)\n:478 (cid:6) :062(2)\n:465 (cid:6) :058(1)\n:971 (cid:6) :019(6)\n:099 (cid:6) :026(2)\n:131 (cid:6) :053(5)\n:128 (cid:6) :036(4)\n:098 (cid:6) :028(1)\n:107 (cid:6) :024(3)\n:350 (cid:6) :043(6)\n:235 (cid:6) :024(1)\n:246 (cid:6) :029(3)\n:246 (cid:6) :022(3)\n:251 (cid:6) :022(5)\n:242 (cid:6) :027(2)\n:893 (cid:6) :007(6)\n:718 (cid:6) :017(1)\n:793 (cid:6) :012(5)\n:762 (cid:6) :014(3)\n:772 (cid:6) :013(4)\n:719 (cid:6) :016(2)\n:192 (cid:6) :034(5)\n:216 (cid:6) :037(6)\n:163 (cid:6) :028(1)\n:181 (cid:6) :031(4)\n:171 (cid:6) :034(2)\n:176 (cid:6) :036(3)\n:486 (cid:6) :018(6)\n:440 (cid:6) :018(5)\n:395 (cid:6) :014(4)\n:384 (cid:6) :013(2)\n:391 (cid:6) :013(3)\n:383 (cid:6) :013(1)\n\nTable 1: Experimental results for the adaptation based and LIMO based methods. For each measure,\n\u201c#\u201d indicates \u201cthe smaller the better\u201d and \u201c\"\u201d indicates \u201cthe larger the better\u201d. The results are shown\nin mean(cid:6)std(rank) format calculated from ten repeated experiments. The rank is calculated from\nthe mean. The smaller the rank, the better the performance. The \ufb01rst-ranked results are bolded.\n\nAlgorithm LIMO-inst\navg. rank\n\nLIMO-label\n\n5:71\nTable 2: The average performance rank in all experiments.\n\n4:21\n\n2:83\n\nLIMO ADAPT-hypo ADAPT-obj ADAPT-both\n3:67\n\n2:42\n\n1:58\n\n6 Conclusion and Future Work\n\nIn this work, the preference based objective adaptation task is studied. The DL2M algorithm is\nproposed under this setting, which can ef\ufb01ciently solve the objective adaptation problem based on\nthe dueling bandit model. For better hypothesis adaptation, the Adapt-Boost method is proposed in\norder to adapt the pre-learned element classi\ufb01ers to the new objective with low cost.\nTo further investigate the objective adaptation problem, it is possible to relax the linear combinaiton\nformulation of the objective function adopted in this work. We are also interested in applying the\nproposed approaches in other real-world problems, especially the tasks in which human expert feed-\nback can be utilized. Furthermore, it is also interesting to investigate Adapt-Boost on problems with\nlarger scale, as well as to study its theoretical guarantees.\n\n9\n\n\fAcknowledgement\n\nThis research is supported by National Key R&D Program of China (2018YFB1004300), NSFC\n(61751306) and Collaborative Innovation Center of Novel Software Technology and Industrializa-\ntion. Yao-Xiang Ding is supported by the Outstanding PhD Candidate Program of Nanjing Univer-\nsity. The Authors would like to thank the anonymous reviewers for constructive suggestions, as well\nas Lijun Zhang, Ming Pang, Xi-Zhu Wu and Yichi Xiao for helpful discussions.\n\nReferences\nYasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Improved algorithms for linear stochastic\n\nbandits. In NIPS, pages 2312\u20132320, 2011.\n\nMarc Abeille and Alessandro Lazaric. Linear Thompson Sampling Revisited. In AISTATS, pages\n\n176\u2013184, 2017.\n\nAlekh Agarwal, Ashwinkumar Badanidiyuru, Miroslav Dudik, Robert E Schapire, and Aleksandrs\nSlivkins. Robust multi-objective learning with mentor feedback. In COLT, pages 726\u2013741, 2014.\n\nOlivier Chapelle, Pannagadatta Shivaswamy, Srinivas Vadrevu, Kilian Weinberger, Ya Zhang, and\nBelle Tseng. Multi-task learning for boosting with application to web search ranking. In KDD,\npages 1189\u20131198, 2010.\n\nKalyanmoy Deb. Multi-objective optimization. In Search methodologies, pages 403\u2013449. Springer,\n\n2014.\n\nTheodoros Evgeniou and Massimiliano Pontil. Regularized multi\u2013task learning. In KDD, pages\n\n109\u2013117, 2004.\n\nAbraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization\n\nin the bandit setting: gradient descent without a gradient. In SODA, pages 385\u2013394, 2005.\n\nElad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\nWataru Kumagai. Regret analysis for continuous dueling bandit. In NIPS, pages 1488\u20131497, 2017.\n\nNan Li, Ivor W Tsang, and Zhi-Hua Zhou. Ef\ufb01cient optimization of performance measures by\n\nclassi\ufb01er adaptation. IEEE TPAMI, 35(6):1370\u20131382, 2013.\n\nSaharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin\n\nclassi\ufb01er. JMLR, 5(Aug):941\u2013973, 2004.\n\nShai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R\u20dd\n\nin Machine Learning, 4(2):107\u2013194, 2012.\n\nXi-Zhu Wu and Zhi-Hua Zhou. A uni\ufb01ed view of multi-label performance measures.\n\npages 3780\u20133788, 2017.\n\nIn ICML,\n\nYisong Yue and Thorsten Joachims.\n\nInteractively optimizing information retrieval systems as a\n\ndueling bandits problem. In ICML, pages 1201\u20131208, 2009.\n\nYisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits\n\nproblem. JCSS, 78(5):1538\u20131556, 2012.\n\nLijun Zhang, Tianbao Yang, Rong Jin, Yichi Xiao, and Zhi-hua Zhou. Online stochastic linear\n\noptimization under one-bit feedback. In ICML, pages 392\u2013401, 2016.\n\nMartin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nICML, pages 928\u2013936, 2003.\n\n10\n\n\f", "award": [], "sourceid": 4866, "authors": [{"given_name": "Yao-Xiang", "family_name": "Ding", "institution": "Nanjing University"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}