{"title": "A General Boosting Method and its Application to Learning Ranking Functions for Web Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1697, "page_last": 1704, "abstract": "We present a general boosting method extending functional gradient boosting to optimize complex loss functions that are encountered in many machine learning problems. Our approach is based on optimization of quadratic upper bounds of the loss functions which allows us to present a rigorous convergence analysis of the algorithm. More importantly, this general framework enables us to use a standard regression base learner such as decision trees for fitting any loss function. We illustrate an application of the proposed method in learning ranking functions for Web search by combining both preference data and labeled data for training. We present experimental results for Web search using data from a commercial search engine that show significant improvements of our proposed methods over some existing methods.", "full_text": "A General Boosting Method and its Application to\n\nLearning Ranking Functions for Web Search\n\nZhaohui Zhengy Hongyuan Zha? Tong Zhangy Olivier Chapelley Keke Cheny Gordon Suny\n\nyYahoo! Inc.\n\n701 First Avene\n\nSunnyvale, CA 94089\n\nfzhaohui,tzhang,chap,kchen,gzsung@yahoo-inc.com\n\n?College of Computing\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30032\n\nzha@cc.gatech.edu\n\nAbstract\n\nWe present a general boosting method extending functional gradient boosting to\noptimize complex loss functions that are encountered in many machine learning\nproblems. Our approach is based on optimization of quadratic upper bounds of the\nloss functions which allows us to present a rigorous convergence analysis of the\nalgorithm. More importantly, this general framework enables us to use a standard\nregression base learner such as single regression tree for \u00a3tting any loss function.\nWe illustrate an application of the proposed method in learning ranking functions\nfor Web search by combining both preference data and labeled data for training.\nWe present experimental results for Web search using data from a commercial\nsearch engine that show signi\u00a3cant improvements of our proposed methods over\nsome existing methods.\n\n1 Introduction\n\nThere has been much interest in developing machine learning methods involving complex loss func-\ntions beyond those used in regression and classi\u00a3cation problems [13]. Many methods have been\nproposed dealing with a wide range of problems including ranking problems, learning conditional\nrandom \u00a3elds and other structured learning problems [1, 3, 4, 5, 6, 7, 11, 13]. In this paper we\npropose a boosting framework that can handle a wide variety of complex loss functions. The pro-\nposed method uses a regression black box to optimize a general loss function based on quadratic\nupper bounds, and it also allows us to present a rigorous convergence analysis of the method. Our\napproach extends the gradient boosting approach proposed in [8] but can handle substantially more\ncomplex loss functions arising from a variety of machine learning problems.\n\nAs an interesting and important application of the general boosting framework we apply it to the\nproblem of learning ranking functions for Web search. Speci\u00a3cally, we want to rank a set of docu-\nments according to their relevance to a given query. We adopt the following framework: we extract\na set of features x for each query-document pair, and learn a function h(x) so that we can rank the\ndocuments using the values h(x), say x with larger h(x) values are ranked higher. We call such\na function h(x) a ranking function. In Web search, we can identify two types of training data for\nlearning a ranking function: 1) preference data indicating a document is more relevant than another\nwith respect to a query [11, 12]; and 2) labeled data where documents are assigned ordinal labels\nrepresenting degree of relevancy. In general, we will have both preference data and labeled data for\n\n1\n\n\ftraining a ranking function for Web search, leading to a complex loss function that can be handled\nby our proposed general boosting method which we now describe.\n\n2 A General Boosting Method\n\nWe consider the following general optimization problem:\n\n(1)\nwhere h denotes a prediction function which we are interested in learning from the data, H is a pre-\nchosen function class, and R(h) is a risk functional with respect to h. We consider the following\nform of the risk functional R:\n\nh2HR(h);\n\n^h = arg min\n\nn\n\nR(h) =\n\n1\nn\n\nXi=1\n\n`i(h(xi;1);\u00a2\u00a2\u00a2 ; h(xi;mi ); yi);\n\n(2)\n\nwhere `i(h1; : : : ; hmi ; y) is a loss function with respect to the \u00a3rst m i arguments h1; : : : ; hmi.\nFor example, each function `i can be a single variable function (mi = 1) such as in regression:\n`i(h; y) = (h \u00a1 y)2; or a two-variable function (mi = 2), such as those in ranking based on pair-\nwise comparisons: `i(h1; h2; y) = max(0; 1\u00a1 y(h1 \u00a1 h2))2, where y 2 f\u00a71g indicates whether h1\nis preferred to h2 or not; or it can be a multi-variable function as used in some structured prediction\nproblems: `i(h1; : : : ; hmi ; y) = supz \u2013(y; z) + \u02c6(h; z) \u00a1 \u02c6(h; y), where \u2013 is a loss function [13].\nAssume we do not have a general solver for the optimization problem (1), but we have a learning\nalgorithm A which we refer to as regression weak learner. Given any set of data points X =\n[x1; : : : ; xk], with corresponding target values R = [r1; : : : ; rk], weights W = [w1; : : : ; wk], and\ntolerance \u2020 > 0, the regression weak learner A produces a function ^g = A(W; X; R; \u2020) 2 C such\nthat\n\nk\n\nk\n\nXj=1\n\nwj(^g(xj) \u00a1 rj)2 \u2022 min\n\ng2C\n\nXj=1\n\nwj(g(xj) \u00a1 rj)2 + \u2020:\n\n(3)\n\nOur goal is to use this weak learner A to solve the original optimization problem (1). Here H =\nspan(C), i.e., h 2 H can be expressed as h(x) = Pj ajhj(x) with hj 2 C.\n\nFriedman [8] proposed a solution when the loss function in (2) can be expressed as\n\nR(h) =\n\nn\n\nXi=1\n\n`i(h(xi));\n\n(4)\n\nwhich he named as gradient boosting. The idea is to estimate the gradient r`i(h(xi)) using regres-\nsion at each step with uniform weighting, and update. However, there is no convergence proof.\n\nFollowing his work, we consider an extension that is more principly motivated, for which a conver-\ngence analysis can be obtained. We \u00a3rst rewrite (2) in the more general form:\n\nR(h) = R(h(x1); : : : ; h(xN ));\n\n(5)\n\nwhere N \u2022 P mi.1 Note that R depends on h only through the function values h(xi) and from\nnow on we identify the function h with the vector [h(xi)]. Also the function R is considered to be a\nfunction of N variables.\nOur main observation is that for twice differentiable risk functional R, at each tentative solution hk,\nwe can expand R(h) around hk using Taylor expansion as\nR(hk + g) = R(hk) + rR(hk)T g +\n\ngTr2R(h0)g;\n\n1\n2\n\nwhere h0 lies between hk and hk + g. The right hand side is almost quadratic, and we can then\nreplace it by a quadratic upper-bound\n\nR(hk + g) \u2022 Rk(g) = R(hk) + rR(hk)T g +\n\n1\n2\n\ngT W g;\n\n(6)\n\n1We consider that all xi are different, but some of the xi;mi in (2) might have been identical, hence the\n\ninequality.\n\n2\n\n\fwhere W is a diagonal matrix upper bounding the Hessian between hk and hk + g. If we de\u00a3ne\n\nrj = \u00a1[rR(hk)]j=wj, then 8g 2 C;Pj wj(g(xj) \u00a1 rj)2 is equal to the above quadratic form\n(up to a constant). So g can be found by calling the regression weak learner A. Since at each\nstep we try to minimize an upper bound Rk of R, if we let the minimum be gk, it is clear that\nR(hk + gk) \u2022 Rk(gk) \u2022 R(hk). This means that by optimizing with respect to the problem Rk\nthat can be handled by A, we also make progress with respect to optimizing R. The algorithm based\non this idea is listed in Algorithm 1 for the loss function in (5).\n\nConvergence analysis of this algorithm can be established using the idea summarized above; see\ndetails in appendix. However, in partice, instead of the quadratic upper bound (which has a theo-\nretical garantee easier to derive), one may also consider minimizing an approximation to the Taylor\nexpansion, which would be closer to a Newton type method.\n\nAlgorithm 1 Greedy Algorithm with Quadratic Approximation\n\nInput: X = [x\u2018]\u2018=1;:::;N\nlet h0 = 0\nfor k = 0; 1; 2; : : :\n\nlet W = [w\u2018]\u2018=1;:::;N , with either\nw\u2018 = @2R=@hk(x\u2018)2 or\nW global diagonal upper bound on the Hessian\nlet R = [r\u2018]\u2018=1;:::;N , where r\u2018 = w\u00a11\n\u2018 @R=@hk(x\u2018)\npick \u2020k \u201a 0\nlet gk = A(W; X; R; \u2020k)\npick step-size sk \u201a 0, typically by line search on R\nlet hk+1 = hk + skgk\nend\n\n% Newton-type method with diagonal Hessian\n% Upper-bound minimization\n\nThe main conceptual difference between our view and that of Friedman is that he views regression\nas a \u201creasonable\u201d approximation to the \u00a3rst order gradient rR, while our work views it as a natural\nconsequence of second order approximation of the objective function (in which the quadratic term\nserve as an upper bound of the Hessian either locally or globally). This leads to algorithmic differ-\nence. In our approach, a good choice of the second order upper bound (leading to tighter bound)\nmay require non-uniform weights W . This is inline with earlier boosting work in which sample-\nreweighting was a central idea. In our framework, the reweighting naturally occurs when we choose\na tight second order approximation. Different reweighting can affect the rate of convergence in our\nanalysis. The other main difference with Friedman is that he only considered objective functions of\nthe form (4); we propose a natural extension to the ones of the form (5).\n\n3 Learning Ranking Functions\n\nWe now apply Algorithm 1 to the problem of learning ranking functions. We use preference data as\nwell as labeled data for training the ranking function. For preference data, we use x \u00b4 y to mean\nthat x is preferred over y or x should be ranked higher than y, where x and y are the feature vectors\nfor corresponding items to be ranked. We denote the set of available preferences as S = fxi \u00b4\nyi; i = 1; : : : ; Ng: In addition to the preference data, there are also labeled data, L = f(zi; li); i =\n1; : : : ; ng; where zi is the feature of an item and li is the corresponding numerically coded label.2\nWe formulate the ranking problem as computing a ranking function h 2 H, such that h satis\u00a3es as\nmuch as possible the set of preferences, i.e., h(xi) \u201a h(yi), if xi \u00b4 yi; i = 1; : : : ; N; while at the\nsame time h(zi) matches the label li in a sense to be detailed below.\n\n2Some may argue that, absolute relevance judgments can also be converted to relative relevance judgments.\nFor example, for a query, suppose we have three documents d1; d2 and d3 labeled as perfect, good, and bad,\nrespectively. We can obtain the following relative relevance judgments: d1 is preferred over d2, d1 is preferred\nover d3 and d2 is preferred over d3. However, it is often the case in Web search that for many queries there\nonly exist documents with a single label and for such kind of queries, no preference data can be constructed.\n\n3\n\n\fTHE OBJECTIVE FUNCTION. We use the following objective function to measure the empirical risk\nof a ranking function h,\n\nR(h) =\n\nw\n2\n\nN\n\nXi=1\n\n(maxf0; h(yi) \u00a1 h(xi) + \u00bfg)2 +\n\n1 \u00a1 w\n\n2\n\nn\n\nXi=1\n\n(li \u00a1 h(zi))2:\n\nThe objective function consists of two parts: 1) for the preference data part, we introduce a margin\nparameter \u00bf and would like to enforce that h(xi) \u201a h(yi) + \u00bf ; if not, the difference is quadratically\npenalized; and 2) for the labeled data part, we simply minimize the squared errors. The parameter\nw is the relative weight for the preference data and could typically be found by cross-validation.\nThe optimization problem we seek to solve is h\u2044 = argmin h2H R(h); where H is some given\nfunction class. Note that R depends only on the values h(xi); h(yi); h(zi) and we can optimize it\nusing the general boosting framework discussed in section 2.\nQUADRATIC APPROXIMATION. To this end consider the quadratic approximation (6) for R(h).\nFor simplicity let us assume that each feature vector xi, yi and zi only appears in S and L once,\notherwise we need to compute appropriately formed averages. We consider\n\nh(xi); h(yi);\n\ni = 1; : : : ; N;\n\nh(zi);\n\ni = 1; : : : ; n\n\nas the unknowns, and compute the gradient of R(h) with respect to those unknowns. The com-\nponents of the negative gradient corresponding to h(zi) is just li \u00a1 h(zi): The components of the\nnegative gradient corresponding to h(xi) and h(yi), respectively, are\n\nmaxf0; h(yi) \u00a1 h(xi) + \u00bfg; \u00a1 maxf0; h(yi) \u00a1 h(xi) + \u00bfg:\n\nBoth of the above equal to zero when h(xi)\u00a1h(yi) \u201a \u00bf . For the second-order term, it can be readily\nveri\u00a3ed that the Hessian of R(h) is block-diagonal with 2-by-2 blocks corresponding to h(x i) and\nh(yi) and 1-by-1 blocks for h(zi). In particular, if we evaluate the Hessian at h, the 2-by-2 block\nequals to\n\n\u2022 1 \u00a11\n\u00a11\n\n1 \u201a ;\n\n\u2022 0\n\n0\n\n0\n\n0 \u201a ;\n\nfor xi \u00b4 yi with h(xi) \u00a1 h(yi) < \u00bf and h(xi) \u00a1 h(yi) \u201a \u00bf , respectively. We can upper bound the\n\u00a3rst matrix by the diagonal matrix diag(2; 2) leading to a quadratic upper bound. We summarize\nthe above derivations in the following algorithm.\n\nAlgorithm 2 Boosted Ranking using Successive Quadratic Approximation (QBRank)\n\nStart with an initial guess h0, for m = 1; 2; : : : ,\n1) we construct a training set for \u00a3tting gm(x) by adding the following for each hxi; yii 2 S,\n(xi; maxf0; hm\u00a11(yi) \u00a1 hm\u00a11(xi) + \u00bfg); (yi;\u00a1 maxf0; hm\u00a11(yi) \u00a1 hm\u00a11(xi) + \u00bfg);\nand\nf(zi; li \u00a1 hm\u00a11(zi));\nThe \u00a3tting of gm(x) is done by using a base regressor with the above training set; We weigh\nthe above preference data by w and the labeled data by 1 \u00a1 w respectively.\n2) forming hm = hm\u00a11 + \u00b7smgm(x);\nwhere sm is found by line search to minimize the objective function. \u00b7 is a shrinkage factor.\n\ni = 1; : : : ; ng:\n\nThe shrinkage factor \u00b7 by default is 1, but Friedman [8] reported better results (coming from better\nregularization) by taking \u00b7 < 1. In general, we choose \u00b7 and w by cross-validation. \u00bf could be the\ndegree of preference if that information is available, e.g., the absolute grade difference between each\nprefernce if it is converted from labeled data. Otherwise, we simply set it to be 1.0. When there is\nno preference data and the weak regression learner produces a regression tree, QBrank is identical\nto Gradient Boosting Trees (GBT) as proposed in [8].\nREMARK. An xi can appear multiple times in Step 1), in this case we use the average gradient\nvalues as the target value for each distinct xi.\n\n4\n\n\f4 Experiment Results\n\nWe carried out several experiments illustrating the properties and effectiveness of QBrank using\ncombined preference data and labeled data in the context of learning ranking functions for Web\nsearch [3]. We also compared its performance with QBrank using preference data only and several\nexisting algorithms such as Gradient Boosting Trees [8] and RankSVM [11, 12]. RankSVM is a\npreference learning method which learns pair-wise preferences based on SVM approach.\n\nDATA COLLECTION. We \u00a3rst describe how the data used in the experiments are collected. For\neach query-document pair we extracted a set of features to form a feature vector. which consists of\nthree parts, x = [xQ; xD; xQD]; where 1) the query-feature vector xQ comprises features dependent\non the query q only and have constant values across all the documents d in the document set, for\nexample, the number of terms in the query, whether or not the query is a person name, etc.; 2)\nthe document-feature vector xD comprises features dependent on the document d only and have\nconstant values across all the queries q in the query set, for example, the number of inbound links\npointing to the document, the amount of anchor-texts in bytes for the document, and the language\nidentity of the document, etc.; and 3) the query-document feature vector xQD which comprises\nfeatures dependent on the relation of the query q with respect to the document d, for example, the\nnumber of times each term in the query q appears in the document d, the number of times each term\nin the query q appears in the anchor-texts of the document d, etc.\nWe sampled a set of queries from the query logs of a commercial search engine and generated\na certain number of query-document pairs for each of the queries. A \u00a3ve-level numerical grade\n(0; 1; 2; 3; 4) is assigned to each query-document pair based on the degree of relevance. In total\nwe have 4,898 queries and 105,243 query-document pairs. We split the data into three subsets as\nfollows: 1) we extract all the queries which have documents with a single label. The set of feature\nvectors and the corresponding labels form training set L1, which contains around 2000 queries\ngiving rise to 20,000 query-document pairs. (Some single-labeled data are from editorial database,\nwhere each query has a few ideal results with the same label. Other are bad ranking cases submitted\ninternally and all the documents for a query are labeled as bad. As we will see those type of single-\nlabeled data are very useful for learning ranking functions); and 2) we then randomly split the\nremaining data by queries, and construct a training set L2 containing about 1300 queries and 40,000\nquery-document pairs and a test set L3 with about 1400 queries and 44,000 query-document pairs.\nWe use L2 or L3 to generate a set of preference data as follows: given a query q and two documents\ndx and dy. Let the feature vectors for (q; dx) and (q; dy) be x and y, respectively. If dx has a higher\ngrade than dy, we include the preference x \u00b4 y while if dy has a higher grade than dx, we include\nthe preference y \u00b4 x. For each query, we consider all pairs of documents within the search results\nfor that query except those with equal grades. This way, we generate around 500,000 preference\npairs in total. We denote the preference data as P2 and P3 corresponding to L2 and L3, respectively.\nEVALUATION METRICS. The output of QBrank is a ranking function h which is used to rank the\ndocuments x according to h(x). Therefore, document x is ranked higher than y by the ranking\nfunction h if h(x) > h(y), and we call this the predicted preference. We propose the following two\nmetrics to evaluate the performance of a ranking function with respect to a given set of preferences\nwhich we considered as the true preferences.\n\n1) Precision at K%: for two documents x and y (with respect to the same query), it is reasonable to\nassume that it is easy to compare x and y if jh(x) \u00a1 h(y)j is large, and x and y should have about\nthe same rank if h(x) is close to h(y). Base on this, we sort all the document pairs hx; yi according\nto jh(x) \u00a1 h(y)j. We call precision at K%, the fraction of non-contradicting pairs in the top K% of\nthe sorted list. Precision at 100% can be considered as an overall performance measure of a ranking\nfunction.\n\n2) Discounted Cumulative Gain (DCG): DCG has been widely used to assess relevance in the context\nof search engines [10]. For a ranked list of N documents (N is set to be 5 in our experiments), we\ni=1 Gi= log2 (i + 1), where Gi represents the\nweights assigned to the label of the document at position i. Higher degree of relevance corresponds\nto higher value of the weight.\n\nuse the following variation of DCG, DCGN = PN\n\nPARAMETERS. There are three parameters in QBrank: \u00bf , \u00b7, and w. In our experiments, \u00bf is the\nabsolute grade difference between each pair hxi; yii. We set \u00b7 to be 0.05, and w to be 0.5 in our\n\n5\n\n\fTable 1: Precision at K% for QBrank, GBT, and RankSVM\n\nQBrank\n%K\n0.9446\n10%\n0.903\n20%\n0.8611\n30%\n0.8246\n40%\n0.7938\n50%\n0.7673\n60%\n0.7435\n70%\n0.7218\n80%\n90%\n0.7015\n100% 0.6834\n\nGBT RankSVM\n0.8524\n0.8152\n0.7839\n0.7578\n0.7357\n0.7151\n0.6957\n0.6779\n0.6615\n0.6465\n\n0.9328\n0.8939\n0.8557\n0.8199\n0.7899\n0.7637\n0.7399\n0.7176\n0.6977\n0.6803\n\nexperiments. For a fair comparsion, we used single regression tree with 20 leaf nodes as the base\nregressor of both GBT and QBrank in our experiments. \u00b7 and number of leaf nodes were tuned for\nGBT through cross validation. We did not retune them for QBrank.\n\nEXPERIMENTS AND RESULTS. We are interested in the following questions: 1) How does GBT\nusing labeled data L2 compare with QBrank or RankSVM using the preference data extracted from\nthe same labeled data: P2? and 2) Is it useful to include single-labeled data L1 in GBT and QBrank?\nTo this end, we considered the following six experiments for comparison: 1) GBT using L1, 2) GBT\nusing L2, 3) GBT using L1 [ L2, 4) RankSVM using P2, 5) QBrank using P2, and 6) QBrank using\nP2 [ L1.\nTable 1 presents the precision at K% on data P3 for the ranking function learned from GBT with\nlabeled training data L2, and QBrank and RankSVM with the corresponding preference data P2.\nThis shows that QBrank outperforms both GBT and RankSVM with respect to the precision at K%\nmetric.\nThe DCG-5 for RankSVM using P2 is 6.181 while that for the other \u00a3ve methods are shown in\nFigure 1, from which we can see it is useful to include single-labeled data in GBT training. In case\nof preference learning, no preference pairs could be extracted from single labeled data. Therefore,\nexisting methods such as RankSVM, RankNet and RankBoost that are formulated for preference\ndata only can not take advantage of such data. The QBrank framework can combine preference data\nand labeled data in a natural way. From Figure 1, we can see QBrank using combined preference\ndata and labeled data outperforms both QBrank and RankSVM using preference data only, which\nindicates that singled labeled data are also useful to QBrank training. Another observation is that\nGBT using labeled data is signi\u00a3cantly worse than QBrank using preference data extracted from the\nsame labeled data3. The clear convergence trend of QBrank is also demonstrated in Figure 1. Notice\nthat, we excluded all tied data (pairs of documents with the same grades) when converting preference\ndata from the absolute relevance judgments, which can be signi\u00a3cant information loss, for example\nof x1 > x2, and x3 > x4.\nIf we know x2 ties with x3, then we can have the whole ranking\nx1 > fx2; x3g > x4. Including tied data could further improve performance of both GBrank and\nQBrank.\n\n5 Conclusions and Future Work\n\nWe proposed a general boosting method for optimizing complex loss functions. We also applied\nthe general framework to the problem of learning ranking functions. Experimental results using a\ncommercial search engine data show that our approach leads to signi\u00a3cant improvements. In future\nwork, 1) we will add regularization to the preference part in the objective function; 2) we plan\nto apply our general boosting method to other structured learning problems; and 3) we will also\nexplore other applications where both preference and labeled data are available for training ranking\nfunctions.\n\n3a 1% dcg gain is considered sign\u00a3cant on this data set for commercial search engines.\n\n6\n\n\f6.9\n\n6.85\n\n6.8\n\n6.75\n\n6.7\n\n6.65\n\n5\n-\nG\nC\nD\n\n6.6\n\n50\n\nDCG-5 v. Iterations\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\nIterations (Number of trees)\n\nGBT using L2\n\nGBT using L1\n\nGBT using L1+L2\n\nQBrank using P2\n\nQBrank using P2+L1\n\nFigure 1: DCG v. Iterations. Notice that DCG for RankSVM using P2 is 6.181.\n\nAppendix: Convergence results\n\nn P\u2018 w\u2018g(x\u2018)2.\n\nWe introduce a few de\u00a3nitions.\nDe\u00a3nition 1 C is scale-invariant if 8g 2 C and \ufb01 2 R, \ufb01g 2 C.\nDe\u00a3nition 2 kgkW;X = q 1\nDe\u00a3nition 3 Let h 2 span(C), then khkW;X = inf nPj j\ufb01jj : h = Pj \ufb01jgj=kgjkW;X ; gj 2 Co.\nDe\u00a3nition 4 Let R(h) be a function of h, an global upper bound M of its Hessian with respect to\n[W; X] satisfy: 8h; \ufb02 and g: R(h + \ufb02g) \u2022 R(h) + \ufb02rR(h)T g + \ufb02 2\nAlthough we only consider global upper bounds, it is easy to see that results with respect to local\nupper bounds can also be established.\n\n2 Mkgk2\n\nW;X.\n\nR(hk+1) \u2022 R(\u201eh)+\n\nmax(0; R(0)\u00a1R(\u201eh))+inf\n\ni =2), then\n\nk\u201ehkW;X\n\nk\u201ehkW;X + ak\n\nTheorem 1 Consider Algorithm 1, where R is a convex function of h. Let M be an upper bound of\nthe Hessian of R. Assume that C is scale-invariant. Let \u201eh 2 span(C). Let \u201esk = skkgkkW;X be the\nnormalized step-size, aj = Pj\n\ni=0 \u201esi, and bj = Pi\u201aj(\u201esip2\u2020i + M \u201es2\nj \u2022(b0 \u00a1 bj+1)\n+ (bj+1 \u00a1 bk+1)\u201a :\nk + \u201eskp\u2020k) < 1, then limk!1 R(hk) =\nIf we choose \u201esk \u201a 0 such that Pk \u201esk = 1 and Pk(\u201es2\ninf \u201eh2span(C) R(\u201eh), and the rate of convergence compared to any target \u201eh 2 span(C) only depends\non k\u201ehkW;X, and the sequences fajg and fbjg.\nThe proof is a systematic application of the idea outlined earlier and will be detailed in a sep-\narate publication.\nIn particu-\n\nIn practice, one often set the step size to be a small constant.\n\nlar, for for some \u00a3xed s > 0, we can choose p2\u2020i \u2022 M s2=2, and skkgkkW;X = s2 when\nR(hk + \u201esk ~gk) \u2022 R(hk) (\u201esk = 0 otherwise). Theorem 1 gives the following bound when\nk \u201a qk\u201ehkW;X max(0; R(0) \u00a1 R(\u201eh))=M s\u00a13,\n\nk\u201ehkW;X + aj\nk\u201ehkW;X + ak\n\nR(hk+1) \u2022 R(\u201eh) + 2sqmax(0; R(0) \u00a1 R(\u201eh))k\u201ehkW;X M + M s4:\n\n7\n\n\fThe convergence results show that in order to have a risk not much worse than any target function\n\u201eh 2 span(C), the approximation function hk does not need to be very complex when the complexity\nis measured by its 1-norm. It is also important to see that quantities appearing in the generalization\nanalysis do not depend on the number of samples. These results imply that statistically, Algorithm 1\n(with small step-size) has an implicit regularization effect that prevents the procedure from over\u00a3ting\nthe data. Standard empirical process techniques can then be applied to obtain generalization bounds\nfor Algorithm 1.\n\nReferences\n\n[1] BALCAN N., BEYGELZIMER A., LANGFORD J., AND SORKIN G. Robust Reductions from Ranking to\n\nClassi\u00a3cation, manuscript, 2007.\n\n[2] BERTSEKAS D. Nonlinear programming. Athena Scienti\u00a3c, second edition, 1999.\n[3] BURGES, C., SHAKED, T., RENSHAW, E., LAZIER, A., DEEDS, M., HAMILTON, N., AND HULLEN-\nDER, G. Learning to rank using gradient descent. Proc. of Intl. Conf. on Machine Learning (ICML)\n(2005).\n\n[4] DIETTERICH, T. G., ASHENFELTER, A., BULATOV, Y. Training Conditional Random Fields via Gradi-\n\nent Tree Boosting Proc. of Intl. Conf. on Machine Learning (ICML) (2004).\n\n[5] CLEMENCON S., LUGOSI G., AND VAYATIS N. Ranking and scoring using empirical risk minimization.\n\nProc. of COLT (2005).\n\n[6] COHEN, W. W., SCHAPIRE, R. E., AND SINGER, Y. Learning to order things. Journal of Arti\u00a3cial\n\nIntelligence Research, Neural Computation, 13, 14431472 (1999).\n\n[7] FREUND, Y., IYER, R., SCHAPIRE, R. E., AND SINGER, Y. An ef\u00a3cient boosting algorithm for com-\n\nbining preferences. Journal of Machine Learning Research 4 (2003), 933\u2013969.\n\n[8] FRIEDMAN, J. H. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29,\n\n5 (2001), 1189\u20131232.\n\n[9] HERBRICH, R., GRAEPEL, T., AND OBERMAYER, K. Large margin rank boundaries for ordinal regres-\n\nsion. 115\u2013132.\n\n[10] JARVELIN, K., AND KEKALAINEN, J. Ir evaluation methods for retrieving highly relevant documents.\n\nProc. of ACM SIGIR Conference (2000).\n\n[11] JOACHIMS, T. Optimizing search engines using clickthrough data. Proc. of ACM SIGKDD Conference\n\n(2002).\n\n[12] JOACHIMS, T., GRANKA, L., PAN, B., AND GAY, G. Accurately interpreting clickthough data as\n\nimplicit feedback. Proc. of ACM SIGIR Conference (2005).\n\n[13] TSOCHANTARIDIS, I., JOACHIMS, T., HOFMANN, T., AND ALTUN, Y. Large margin methods for\nstructured and interdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484,\n2005.\n\n8\n\n\f", "award": [], "sourceid": 747, "authors": [{"given_name": "Zhaohui", "family_name": "Zheng", "institution": null}, {"given_name": "Hongyuan", "family_name": "Zha", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}, {"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Keke", "family_name": "Chen", "institution": null}, {"given_name": "Gordon", "family_name": "Sun", "institution": null}]}