{"title": "Top Rank Optimization in Linear Time", "book": "Advances in Neural Information Processing Systems", "page_first": 1502, "page_last": 1510, "abstract": "Bipartite ranking aims to learn a real-valued ranking function that orders positive instances before negative instances. Recent efforts of bipartite ranking are focused on optimizing ranking accuracy at the top of the ranked list. Most existing approaches are either to optimize task specific metrics or to extend the rank loss by emphasizing more on the error associated with the top ranked instances, leading to a high computational cost that is super-linear in the number of training instances. We propose a highly efficient approach, titled TopPush, for optimizing accuracy at the top that has computational complexity linear in the number of training instances. We present a novel analysis that bounds the generalization error for the top ranked instances for the proposed approach. Empirical study shows that the proposed approach is highly competitive to the state-of-the-art approaches and is 10-100 times faster.", "full_text": "Top Rank Optimization in Linear Time\n\nZhi-Hua Zhou1\nNan Li1\n1National Key Laboratory for Novel Software Technology,\n\nRong Jin2\n\nNanjing University, Nanjing 210023, China\n\n2Department of Computer Science and Engineering,\nMichigan State University, East Lansing, MI 48824\n\n{lin,zhouzh}@lamda.nju.edu.cn\n\nrongjin@cse.msu.edu\n\nAbstract\n\nBipartite ranking aims to learn a real-valued ranking function that orders positive\ninstances before negative instances. Recent efforts of bipartite ranking are fo-\ncused on optimizing ranking accuracy at the top of the ranked list. Most existing\napproaches are either to optimize task speci\ufb01c metrics or to extend the rank loss by\nemphasizing more on the error associated with the top ranked instances, leading to\na high computational cost that is super-linear in the number of training instances.\nWe propose a highly ef\ufb01cient approach, titled TopPush, for optimizing accuracy\nat the top that has computational complexity linear in the number of training in-\nstances. We present a novel analysis that bounds the generalization error for the\ntop ranked instances for the proposed approach. Empirical study shows that the\nproposed approach is highly competitive to the state-of-the-art approaches and is\n10-100 times faster.\n\n1\n\nIntroduction\n\nBipartite ranking aims to learn a real-valued ranking function that places positive instances above\nnegative instances. It has attracted much attention because of its applications in several areas such\nas information retrieval and recommender systems [32, 25]. Many ranking methods have been\ndeveloped for bipartite ranking, and most of them are essentially based on pairwise ranking. These\nalgorithms reduce the ranking problem into a binary classi\ufb01cation problem by treating each positive-\nnegative instance pair as a single object to be classi\ufb01ed [16, 12, 5, 39, 38, 33, 1, 3]. Since the number\nof instance pairs can grow quadratically in the number of training instances, one limitation of these\nmethods is their high computational costs, making them not scalable to large datasets.\nConsidering that for applications such as document retrieval and recommender systems, only the top\nranked instances will be examined by users, there has been a growing interest in learning ranking\nfunctions that perform especially well at the top of the ranked list [7, 39, 38, 33, 1, 3, 27, 40]. Most\nof these approaches can be categorized into two groups. The \ufb01rst group maximizes the ranking\naccuracy at the top of the ranked list by optimizing task speci\ufb01c metrics [17, 21, 23, 40], such\nas average precision (AP) [42], NDCG [39] and partial AUC [27, 28]. The main limitation of\nthese methods is that they often result in non-convex optimization problems that are dif\ufb01cult to\nsolve ef\ufb01ciently. Structural SVM [37] addresses this issue by translating the non-convexity into\nan exponential number of constraints. It can still be computationally challenging because it usually\nrequires to search for the most violated constraint at each iteration of optimization. In addition, these\nmethods are statistically inconsistent [36, 21], leading to suboptimal solutions. The second group of\nmethods are based on pairwise ranking. They design special convex loss functions that place more\npenalties on the ranking errors related to the top ranked instances [38, 33, 1]. Since these methods\nare based on pairwise ranking, their computational costs are usually proportional to the number of\npositive-negative instance pairs, making them unattractive for large datasets.\n\n1\n\n\fIn this paper, we address the computational challenge of bipartite ranking by designing a ranking\nalgorithm, named TopPush, that can ef\ufb01ciently optimize the ranking accuracy at the top. The key\nfeature of the proposed TopPush algorithm is that its time complexity is only linear in the number\nof training instances. This is in contrast to most existing methods for bipartite ranking whose com-\nputational costs depend on the number of instance pairs. Moreover, we develop novel analysis for\nbipartite ranking. One de\ufb01ciency of the existing theoretical studies [33, 1] on bipartite ranking is that\nthey try to bound the probability for a positive instance to be ranked before any negative instance,\nleading to relatively pessimistic bounds. We overcome this limitation by bounding the probability\nof ranking a positive instance before most negative instances, and show that TopPush is effective in\nplacing positive instances at the top of a ranked list. Extensive empirical study shows that TopPush\nis computationally more ef\ufb01cient than most ranking algorithms, and yields comparable performance\nas the state-of-the-art approaches that maximize the ranking accuracy at the top.\nThe rest of this paper is organized as follows. Section 2 introduces the preliminaries of bipartite\nranking, and addresses the difference between AUC optimization and maximizing accuracy at the\ntop. Section 3 presents the proposed TopPush algorithm and its key theoretical properties. Section 4\nsummarizes the empirical study, and Section 5 concludes this work with future directions.\n\n(cid:88)m\n\n(cid:88)n\n\nI(cid:0)f (x+\n\nj )(cid:1) ,\n\ni ) \u2264 f (x\u2212\n\ni \u2208 X}m\n\n2 Bipartite Ranking: AUC vs. Accuracy at the Top\nLet X = {x \u2208 Rd : (cid:107)x(cid:107) \u2264 1} be the instance space. Let S = S+ \u222a S\u2212 be a set of training\ninstances, where S+ = {x+\ni=1 include m positive instances\nand n negative instances independently sampled from distributions P+ and P\u2212, respectively. The\ngoal of bipartite ranking is to learn a ranking function f : X (cid:55)\u2192 R that is likely to place a positive\ninstance before most negative ones. In the literature, bipartite ranking has found applications in many\ndomains [32, 25], and its theoretical properties have been examined by several studies [2, 6, 20, 26].\nAUC is a commonly used evaluation metric for bipartite ranking [15, 9]. By exploring its equiva-\nlence to Wilcoxon-Mann-Whitney statistic [15], many ranking algorithms have been developed to\noptimize AUC by minimizing the ranking loss de\ufb01ned as\n\ni=1 and S\u2212 = {x\u2212\n\ni \u2208 X}n\n\nLrank(f ; S) =\n\n1\nmn\n\ni=1\n\nj=1\n\n(1)\nwhere I(\u00b7) is the indicator function. Other than a few special loss functions (e.g., exponential and\nlogistic loss) [33, 20], most of these methods need to enumerate all the positive-negative instance\npairs, making them unattractive for large datasets. Various methods have been developed to address\nthis computational challenge [43, 13].\nRecently, there is a growing interest on optimizing ranking accuracy at the top [7, 3]. Maximizing\nAUC is not suitable for this goal as indicated by the analysis in [7]. To address this challenge,\nwe propose to maximize the number of positive instances that are ranked before the \ufb01rst negative\ninstance, which is known as positives at the top [33, 1, 3]. We can translate this objective into the\nminimization of the following loss\nL(f ; S) =\n\n(cid:88)m\n\nI(cid:16)\n\nf (x+\n\n(cid:17)\n\n(2)\n\n.\n\ni ) \u2264 max\n1\u2264j\u2264n\n\nf (x\u2212\nj )\n\ni=1\n\n1\nm\n\nwhich computes the fraction of positive instances ranked below the top-ranked negative instance. By\nminimizing the loss in (2), we essentially push negative instances away from the top of the ranked\nlist, leading to more positive ones placed at the top. We note that (2) is fundamentally different from\nAUC optimization as AUC does not focus on the ranking accuracy at the top. More discussion about\nthe relationship between (1) and (2) can be found in the longer version of the paper [22].\nTo design practical learning algorithms, we replace the indicator function in (2) with its convex\nsurrogate, leading to the following loss function\n\n(cid:88)m\n\n(cid:16)\n\n(cid:17)\n\nL(cid:96)(f ; S) =\n\n1\nm\n\n(3)\nwhere (cid:96)(\u00b7) is a convex loss function that is non-decreasing1 and differentiable. Examples of such\nloss functions include truncated quadratic loss (cid:96)(z) = [1 + z]2\n+, exponential loss (cid:96)(z) = ez, or\n\nmax\n1\u2264j\u2264n\n\ni=1\n\n(cid:96)\n\n,\n\nf (x\u2212\n\nj ) \u2212 f (x+\ni )\n\n1In this paper, we let (cid:96)(z) to be non-decreasing for the simplicity of formulating dual problem.\n\n2\n\n\flogistic loss (cid:96)(z) = log(1 + ez). In the discussion below, we restrict ourselves to the truncated\nquadratic loss, though most of our analysis applies to others.\nIt is easy to verify that the loss L(cid:96)(f ; S) in (3) is equivalent to the loss used in In\ufb01nitePush [1] (a\nspecial case of P -norm Push [33]), i.e.,\n\nj ) \u2212 f (x+\n\nL(cid:96)\u221e(f ; S) = max\n1\u2264j\u2264n\n\n1\nm\n\ni=1\n\n(4)\nThe apparent advantage of employing L(cid:96)(f ; S) instead of L(cid:96)\u221e(f ; S) is that it only needs to evaluate\non m positive-negative instance pairs, whereas the later needs to enumerate all the mn instance\npairs. As a result, the number of dual variables induced by L(cid:96)(f ; S) is n + m, linear in the number\nof training instances, which is signi\ufb01cantly smaller than mn, the number of dual variables induced\nby L(cid:96)\u221e(f ; S) [1, 31]. It is this difference that makes the proposed algorithm achieve a computational\ncomplexity linear in the number of training instances and therefore be more ef\ufb01ciently than the\nexisting algorithms for most state-of-the-art algorithms for bipartite ranking.\n\n(cid:88)m\n\n(cid:96)(cid:0)f (x\u2212\n\ni )(cid:1) .\n\n3 TopPush for Optimizing Top Accuracy\n\nWe \ufb01rst present a learning algorithm to minimize the loss function in (3), and then the computational\ncomplexity and performance guarantee for the proposed algorithm.\n\n(cid:88)m\n\n(cid:16)\n\n(cid:17)\n\n3.1 Dual Formulation\nWe consider linear ranking function2, i.e., f (x) = w(cid:62)x, where w \u2208 Rd is the weight vector to be\nlearned. As a result, the learning problem is given by the following optimization problem\n\nmin\n\nw\n\n(cid:107)w(cid:107)2 +\n\n\u03bb\n2\n\n1\nm\n\n(cid:96)\n\ni=1\n\nw(cid:62)x\u2212\n\nj \u2212 w(cid:62)x+\n\ni\n\nmax\n1\u2264j\u2264n\n\n,\n\n(5)\n\nwhere \u03bb > 0 is a regularization parameter. Directly minimizing the objective in (5) can be challeng-\ning because of the max operator in the loss function. We address this challenge by developing a dual\nformulation for (5). Speci\ufb01cally, given a convex and differentiable function (cid:96)(z), we can rewrite it\nin its convex conjugate form as (cid:96)(z) = max\u03b1\u2208\u2126 \u03b1z \u2212 (cid:96)\u2217(\u03b1) , where (cid:96)\u2217(\u03b1) is the convex conjugate\nof (cid:96)(z) and \u2126 is the domain of dual variable [4]. For example, the convex conjugate of truncated\nquadratic loss is (cid:96)\u2217(\u03b1) = \u2212\u03b1 + \u03b12/4 with \u2126 = R+. We note that dual form has been widely used\nto improve computational ef\ufb01ciency [35] and connect different styles of learning algorithms [19].\nHere we exploit it to overcome the dif\ufb01culty caused by max operator. The dual form of (5) is given\nin the following theorem, whose detailed proof can be found in the longer version [22].\nn )(cid:62), the dual problem of (5) is\nTheorem 1. De\ufb01ne X+ = (x+\n(6)\n\n1 , . . . , x\u2212\n(cid:107)\u03b1(cid:62)X+ \u2212 \u03b2(cid:62)X\u2212(cid:107)2 +\n\nm)(cid:62) and X\u2212 = (x\u2212\n1 , . . . , x+\n1\n\ng(\u03b1, \u03b2) =\n\nmin\n\n(cid:96)\u2217(\u03b1i)\n\nwhere \u03b1 and \u03b2 are dual variables, and the domain \u039e is de\ufb01ned as\n\n(\u03b1,\u03b2)\u2208\u039e\n\n2\u03bbm\n\n\u039e = (cid:8)\u03b1 \u2208 Rm\n\n+ , \u03b2 \u2208 Rn\n\n+ : 1(cid:62)\n\nm\u03b1 = 1(cid:62)\n\n(cid:0)a\u2217(cid:62)X+ \u2212 \u03b2\u2217(cid:62)X\u2212(cid:1) .\n\nw\u2217 =\n\n1\n\u03bbm\n\nLet \u03b1\u2217 and \u03b2\u2217 be the optimal solution to the dual problem (6). Then, the optimal solution w\u2217 to the\nprimal problem in (5) is given by\n\n(cid:88)m\nn \u03b2(cid:9).\n\ni=1\n\n(7)\n\nRemark The key feature of the dual problem in (6) is that the number of dual variables is m + n,\nleading to a linear time ranking algorithm. This is in contrast to the In\ufb01nitPush algorithm in [1] that\nintroduces mn dual variables and a higher computational cost. In addition, the objective function in\n(6) is smooth if the convex conjugate (cid:96)\u2217(\u00b7) is smooth, which is true for many common loss functions\n(e.g., truncated quadratic loss and logistic loss). It is well known in the literature of optimization [4]\nthat an O(1/T 2) convergence rate can be achieved if the objective function is smooth, where T is\nthe number of iterations; this also helps in designing ef\ufb01cient learning algorithm.\n\n2Nonlinear function can be trained by kernel methods, and Nystr\u00a8om method and random Fourier features\n\ncan transform the kernelized problem into a linear one. See [41] for more discussions.\n\n3\n\n\f3.2 Linear Time Bipartite Ranking\n\nAccording to Theorem 1, to learn a ranking function f (w), it is suf\ufb01cient to learn the dual variables\n\u03b1 and \u03b2 by solving the problem in (6). For this purpose, we adopt the accelerated gradient method\ndue to its light computation per iteration, and refer the obtained algorithm as TopPush. Speci\ufb01cally,\nwe choose the Nesterov\u2019s method [30, 29] that achieves an optimal convergence rate O(1/T 2) for\nsmooth objective function. One of the key features of the Nesterov\u2019s method is that it maintains\ntwo sequences of solutions: {(\u03b1k, \u03b2k)} and {(s\u03b1\nk )}, where the sequence of auxiliary solutions\n{(s\u03b1\nk )} is introduced to exploit the smoothness of the objective to achieve a faster convergence\nrate. Algorithm 1 shows the key steps3 of the Nesterov\u2019s method for solving the problem in (6),\nwhere the gradients of the objective function g(\u03b1, \u03b2) can be ef\ufb01ciently computed as\n\u2217(\u03b1) , \u2207\u03b2g(\u03b1, \u03b2) = \u2212X\u2212\u03bd(cid:62)/\u03bbm .\n\n\u2207\u03b1g(\u03b1, \u03b2) = X+\u03bd(cid:62)/\u03bbm + (cid:96)(cid:48)\n\nk ; s\u03b2\n\nk ; s\u03b2\n\n(8)\n\nwhere \u03bd = \u03b1(cid:62)X+ \u2212 \u03b2(cid:62)X\u2212 and (cid:96)(cid:48)\n\n\u2217(\u00b7) is the derivative of (cid:96)\u2217(\u00b7).\n\nAlgorithm 1 The TopPush Algorithm\nInput: X+ \u2208 Rm\u00d7d, X\u2212 \u2208 Rn\u00d7d, \u03bb, \u0001\nOutput: w\n1: initialize \u03b11 = \u03b10 = 0m, \u03b21 = \u03b20 = 0n, and let t\u22121 = 0, t0 = 1, L0 = 1\n2: repeat for k = 1, 2, . . .\n3:\n4:\n5:\n\ncompute sa\nk ) and g\u03b2 = \u2207\u03b2g(s\u03b1\ncompute g\u03b1 = \u2207\u03b1g(s\u03b1\n\ufb01nd Lk > Lk\u22121 such that g(\u03b1k+1, \u03b2k+1) > g(s\u03b1\nk+1]) with \u03b1(cid:48)\nk+1; \u03b2(cid:48)\nk\u22121)/2\n\n[\u03b1k+1; \u03b2k+1] = \u03c0\u039e([\u03b1(cid:48)\n1 + 4t2\n\nk = \u03b1k + \u03c9k(\u03b1k \u2212 \u03b1k\u22121) and s\u03b2\n\nupdate tk = (1 +\n\n6:\n7: until convergence (i.e., |g(\u03b1k+1, \u03b2k+1) \u2212 g(\u03b1k, \u03b2k)| < \u0001)\n8: return w = 1\n\n(cid:113)\nk X+ \u2212 \u03b2(cid:62)\n\u03bb\u00b7m (\u03b1(cid:62)\n\nk+1 = s\u03b1\n\nk X\u2212)\n\nk , s\u03b2\n\nk , s\u03b2\n\nk , s\u03b2\n\nLk\n\nm+n\n\nk = \u03b2k + \u03c9k(\u03b2k \u2212 \u03b2k\u22121), where \u03c9k = tk\u22122\u22121\n\ntk\u22121\n\nk ) based on (8)\nk ) + ((cid:107)g\u03b1(cid:107)2 + (cid:107)g\u03b2(cid:107)2)/(2Lk), where\nk \u2212 1\ng\u03b2\n\ng\u03b1 and \u03b2(cid:48)\n\nk+1 = s\u03b2\n\nk \u2212 1\n\nLk\n\nIt should be noted that, (6) is a constrained problem, and therefore, at each step of gradient mapping,\nwe have to project the dual solution into the domain \u039e (i.e, [\u03b1k+1; \u03b2k+1] = \u03c0\u039e([\u03b1(cid:48)\nk+1; \u03b2(cid:48)\nk+1]) in\nstep 5) to keep them feasible. Below, we discuss how to solve this projection step ef\ufb01ciently.\nProjection Step For clear notations, we expand the projection step into the problem\n\nmin\n\n\u03b1\u22650,\u03b2\u22650\n\n1\n2\n\n(cid:107)\u03b1 \u2212 \u03b10(cid:107)2 +\n\n(cid:107)\u03b2 \u2212 \u03b20(cid:107)2\n\n1\n2\n\ns.t. 1(cid:62)\n\nm\u03b1 = 1(cid:62)\n\nn \u03b2 ,\n\n(9)\n\nwhere \u03b10 and \u03b20 are the solutions obtained in the last iteration. We note that similar projection\nproblems have been studied in [34, 24] where they either have O((m + n) log(m + n)) time com-\nplexity [34] or only provide approximate solutions [24]. Instead, based on the following proposition,\nwe provide a method which \ufb01nd the exact solution to (9) in O(n+m) time. By using proof technique\nsimilar to that for Theorem 2 in [24], we can prove the following proposition:\nProposition 1. The optimal solution to the projection problem in (9) is given by\n\nwhere \u03b3\u2217 is the root of function \u03c1(\u03b3) =(cid:80)m\n\ni \u2212 \u03b3]+ \u2212(cid:80)n\n\ni=1[\u03b10\n\n\u03b1\u2217 = [\u03b10 \u2212 \u03b3\u2217]+ and \u03b2\u2217 = [\u03b20 + \u03b3\u2217]+ ,\n\nj=1[\u03b20\n\nj + \u03b3]+ .\n\nBased on Proposition 1, we provide a method which \ufb01nd the exact solution to (9) in O(m + n) time.\nAccording to Proposition 1, the key to solving this problem is to \ufb01nd the root of \u03c1(\u03b3). Instead of\napproximating the solution via bisection as in [24], we develop a divide-and-conquer method to \ufb01nd\nthe exact solution of \u03b3\u2217 in O(m + n) time, where a similar approach has been used in [10]. The\nbasic idea is to \ufb01rst identify the smallest interval that contains the root based on a modi\ufb01cation of\nthe randomized median \ufb01nding algorithm [8], and then solve the root exactly based on the interval.\nThe detailed projection procedure can be found in the longer version [22].\n\n3The step size of the Nesterov\u2019s method depends on the smoothness of the objective function. In current\nwork we adopt the Nemirovski\u2019s line search scheme [29] to compute the smoothness parameter, and the detailed\nalgorithm can be found in [22].\n\n4\n\n\fTable 1: Comparison of computational complexities for ranking algorithms, where d is the number of dimen-\nsions, \u0001 is the precision parameter, m and n are the number of positive and negative instances, respectively.\n\nAlgorithm\nSVMRank\nSVMMAP\nOWPC\nSVMpAUC\nIn\ufb01nitePush\nL1SVIP\nTopPush\n\n[18]\n[42]\n[38]\n[27, 28]\n[1]\n[31]\n\nComputational Complexity\n\nO(cid:0)(cid:0)(m + n)d + (m + n) log(m + n)(cid:1)/\u0001(cid:1)\nO(cid:0)(cid:0)(m + n)d + (m + n) log(m + n)(cid:1)/\u0001(cid:1)\nO(cid:0)(cid:0)(m + n)d + (m + n) log(m + n)(cid:1)/\u0001(cid:1)\nO(cid:0)(cid:0)n log n + m log m + (m + n)d(cid:1)/\u0001(cid:1)\nO(cid:0)(cid:0)mnd + mn log(mn)(cid:1)/\u00012(cid:1)\nO(cid:0)(cid:0)mnd + mn log(mn)(cid:1)/\u0001(cid:1)\n\nthis paper O(cid:0)(m + n)d/\n\n\u0001(cid:1)\n\n\u221a\n\n3.3 Convergence and Computational Complexity\n\nThe theorem below states the convergence of the TopPush algorithm, which follows immediately\nfrom the convergence result for the Nesterov\u2019s method [29].\nTheorem 2. Let \u03b1T and \u03b2T be the solution output from TopPush after T iterations, we have\n\ng(\u03b1T , \u03b2T ) \u2264 min\n(\u03b1,\u03b2)\u2208\u039e\n\ng(\u03b1, \u03b2) + \u0001\n\n\u221a\n\n\u0001).\n\nprovided T \u2265 O(1/\nFinally, since the computational cost of each iteration is dominated by the gradient evaluation and\nthe projection step, the time complexity of each iteration is O((m + n)d) since the complexity of\nprojection step is O(m + n) and the cost of computing the gradient is O((m + n)d). Combining this\nresult with Theorem 2, we have, to \ufb01nd an \u0001-suboptimal solution, the total computational complexity\n\u0001), which is linear in the number of training instances.\nof the TopPush algorithm is O((m + n)d/\nTable 1 compares the computational complexity of TopPush with that of the state-of-the-art algo-\nrithms. It is easy to see that TopPush is asymptotically more ef\ufb01cient than the state-of-the-art rank-\ning algorithms4. For instances, it is much more ef\ufb01cient than In\ufb01nitePush and its sparse extension\nL1SVIP whose complexity depends on the number of positive-negative instance pairs; compared\nwith SVMRank, SVMMAP and SVMpAUC that handle speci\ufb01c performance metrics via structural-\nSVM, the linear dependence on the number of training instances makes our TopPush approach more\nappealing, especially for large datasets.\n\n\u221a\n\n3.4 Theoretical Guarantee\n\nWe develop theoretical guarantee for the ranking performance of TopPush. In [33, 1], the authors\nhave developed margin-based generalization bounds for the loss function L(cid:96)\u221e . One limitation with\nthe analysis in [33, 1] is that they try to bound the probability for a positive instance to be ranked\nbefore any negative instance, leading to relatively pessimistic bounds5. Our analysis avoids this\npitfall by considering the probability of ranking a positive instance before most negative instances.\nTo this end, we \ufb01rst de\ufb01ne hb(x, w), the probability for any negative instance to be ranked above x\nusing ranking function f (x) = w(cid:62)x, as\n\nhb(x, w) = Ex\u2212\u223cP\u2212(cid:2)I(w(cid:62)x \u2264 w(cid:62)x\u2212)(cid:3) .\ni , w) \u2265 \u03b4(cid:1) .\n\n(cid:0)hb(x+\n\nPb(w, \u03b4) = Prx+\u223cP +\n\nSince we are interested in whether positive instances are ranked above most negative instances, we\nwill measure the quality of f (x) = w(cid:62)x by the probability for any positive instance to be ranked\nbelow \u03b4 percent of negative instances, i.e.,\n\nClearly, if a ranking function achieves a high ranking accuracy at the top, it should have a large\npercentage of positive instances with ranking scores higher than most of the negative instances,\nleading to a small value for Pb(w, \u03b4) with little \u03b4. The following theorem bounds Pb(w, \u03b4) for\nTopPush, and the detailed proof can be found in the longer version [22].\n\n4In Table 1, we report the complexity of SVMpAUC\ntight\n\nin [28], which is more ef\ufb01cient than SVMpAUC in [27].\n\nIn addition, SVMpAUC\ntight\n\nis used in experiments and we do not distinguish between them in this paper.\n\n5For instance, for the bounds in [33], the failure probability can be as large as 1 if the parameter p is large.\n\n5\n\n\fTheorem 3. Given training data S consisting of m independent samples from P + and n indepen-\ndent samples from P\u2212, let w\u2217 be the optimal solution to the problem in (5). Assume m \u2265 12 and\nn (cid:29) t, we have, with a probability at least 1 \u2212 2e\u2212t,\n\nPb(w\u2217, \u03b4) \u2264 L(cid:96)(w\u2217, S) + O(cid:0)(cid:112)(t + log m)/m(cid:1)\n\nwhere \u03b4 = O((cid:112)log m/n) and L(cid:96)(w\u2217, S) = 1\nbounded by O((cid:112)log m/n). We observe that m and n play different roles in the bound; that is,\n\nRemark Theorem 3 implies that if the empirical loss L(cid:96)(w\u2217, S) \u2264 O(log m/m), for most positive\ninstance x+ (i.e., 1 \u2212 O(log m/m)), the percentage of negative instances ranked above x+ is upper\n\n(cid:80)m\ni=1 (cid:96)(max1\u2264j\u2264n w(cid:62)x\u2212\n\nj \u2212 w(cid:62)x+\ni ).\n\nm\n\nbecause the empirical loss compares the positive instances to the negative instance with the largest\nscore, it usually grows signi\ufb01cantly slower with increasing n. For instance, the largest absolute value\n\u221a\nof Gaussian random samples grows in log n. Thus, we believe that the main effect of increasing n\nin our bound is to reduce \u03b4 (decrease at the rate of 1/\nn), especially when n is large. Meanwhile,\nby increasing the number of positive instances m, we will reduce the bound for Pb(w, \u03b4), and\nconsequently increase the chance of \ufb01nding positive instances at the top.\n\n4 Experiments\n\n4.1 Settings\n\nTo evaluate the performance of the TopPush algorithm, we conduct a set of experiments on real-\nworld datasets. Table 2 (left column) summarizes the datasets used in our experiments. Some of\nthem were used in previous studies [1, 31, 3], and others are larger datasets from different domains.\nWe compare TopPush with state-of-the-art algorithms that focus on accuracy at the top, including\nSVMMAP [42], SVMpAUC [28] with \u03b1 = 0 and \u03b2 = 1/n, AATP [3] and In\ufb01nitePush [1]. In\naddition, for completeness, several state-of-the-art classi\ufb01cation and ranking models are included\nin the comparison: logistic regression (LR) for binary classi\ufb01cation, cost-sensitive SVM (cs-SVM)\nthat addresses imbalance class distribution by introducing a different misclassi\ufb01cation cost for each\nclass, and SVMRank [18] for AUC optimization. We implement TopPush and In\ufb01nitePush using\nMATLAB, implement AATP using CVX [14], and use LIBLINEAR [11] for LR and cs-SVM, and\nuse the codes shared by the authors of the original works.\nWe measure the accuracy at\nthe top\n(Pos@Top) [1, 31, 3], which is de\ufb01ned as the fraction of positive instances ranked above the top-\nranked negative, (ii) average precision (AP) and (iii) normalized DCG scores (NDCG). On each\ndataset, experiments are run for thirty trials. In each trial, the dataset is randomly divided into two\nsubsets: 2/3 for training and 1/3 for test. For all algorithms, we set the precision parameter \u0001 to\n10\u22124, choose other parameters by 5-fold cross validation (based on the average value of Pos@Top)\non training set, and perform the evaluation on test set. Finally, averaged results over thirty trails are\nreported. All experiments are run on a machine with two Intel Xeon E7 CPUs and 16GB memory.\n\nthe top by commonly used metrics6:\n\n(i) positives at\n\n4.2 Results\n\nIn table 2, we report the performance of the algorithms in comparison, where the statistics of testbeds\nare included in the \ufb01rst column of the table. For better comparison between the performance of\nTopPush and baselines, pairwise t-tests at signi\ufb01cance level of 0.9 are performed and results are\nmarks \u201c\u2022 / \u25e6\u201d in table 2 when TopPush is statistically signi\ufb01cantly better/worse.\nWhen an evaluation task can not be completed in two weeks, it will be stopped automatically, and no\nresult will be reported. As a consequence, we observe that results for some algorithms are missing\nin Table 2 for certain datasets, especially for large ones. We can see from Table 2 that TopPush,\nLR and cs-SVM succeed to \ufb01nish the evaluation on all datasets (even the largest datasets url). In\ncontrast, SVMRank, SVMRank and SVMpAUC fail to complete the training in time for several large\ndatasets. In\ufb01nitePush and AATP have the worst scalability: they are only able to \ufb01nish the smallest\ndataset diabetes. We thus conclude that overall, TopPush scales well to large datasets.\n\n6It is worth mentioning that we also measure the ranking performance by AUC, and the results can be found\n\nin [22]. In addition, more details of the experimental setting can be found there.\n\n6\n\n\fTable 2: Data statistics (left column) and experimental results. For each dataset, the number of positive\nand negative instances is below the data name as m/n, together with dimensionality d. For training time\ncomparison,\u201c(cid:78)\u201d (\u201c(cid:70)\u201d) are marked if TopPush is at least 10 (100) times faster than the compared algorithm.\nFor performance (mean\u00b1std) comparison, \u201c\u2022\u201d (\u201c\u25e6\u201d) is marked if TopPush performs signi\ufb01cantly better (worse)\nthan the baseline based on pairwise t-test at 0.9 signi\ufb01cance level. On each dataset, if the evaluation of an\nalgorithm can not be completed in two weeks, it will be stopped and its results will be missing from the table.\n\nData\ndiabetes\n500/268\nd : 34\n\nnews20-forsale\n999/18, 929\nd : 62, 061\n\nnslkdd\n71, 463/77, 054\nd : 121\n\nreal-sim\n22, 238/50, 071\nd : 20, 958\n\nspambase\n1, 813/2, 788\nd : 57\n\nurl\n792, 145/1, 603, 985\nd : 3, 231, 961\n\nw8a\n1, 933/62, 767\nd : 300\n\nAlgorithm\nTopPush\nLR\ncs-SVM\nSVMRank\nSVMMAP\nSVMpAUC\nIn\ufb01nitePush\nAATP\nTopPush\nLR\ncs-SVM\nSVMRank\nSVMMAP\nSVMpAUC\nTopPush\nLR\ncs-SVM\nSVMpAUC\nTopPush\nLR\ncs-SVM\nSVMRank\nTopPush\nLR\ncs-SVM\nSVMRank\nSVMMAP\nSVMpAUC\nIn\ufb01nitePush\nTopPush\nLR\ncs-SVM\nTopPush\nLR\ncs-SVM\nSVMpAUC\n\nTime (s)\n\n5.11 \u00d7 10\u22123\n2.30 \u00d7 10\u22122\n7.70 \u00d7 10\u22122\n6.11 \u00d7 10\u22122\n4.71 \u00d7 100\n2.09 \u00d7 10\u22121(cid:78)\n2.63 \u00d7 101(cid:70)\n2.72 \u00d7 103(cid:70)\n2.16 \u00d7 100\n4.14 \u00d7 100\n1.89 \u00d7 100\n2.96 \u00d7 102(cid:70)\n8.42 \u00d7 102(cid:70)\n3.25 \u00d7 102(cid:70)\n7.64 \u00d7 101\n3.63 \u00d7 101\n1.86 \u00d7 100\n1.72 \u00d7 102\n1.34 \u00d7 101\n7.67 \u00d7 100\n4.84 \u00d7 100\n1.83 \u00d7 103(cid:70)\n1.51 \u00d7 10\u22121\n3.11 \u00d7 10\u22122\n8.31 \u00d7 10\u22122\n2.31 \u00d7 101(cid:78)\n1.92 \u00d7 102(cid:70)\n1.73 \u00d7 100(cid:78)\n1.78 \u00d7 103(cid:70)\n5.11 \u00d7 103\n8.98 \u00d7 103\n3.78 \u00d7 103\n7.35 \u00d7 100\n2.46 \u00d7 100\n3.87 \u00d7 100\n2.59 \u00d7 103(cid:70)\n\nPos@Top\n.123 \u00b1 .056\n.064 \u00b1 .075\u2022\n.077 \u00b1 .088\u2022\n.087 \u00b1 .082\u2022\n.077 \u00b1 .072\u2022\n.053 \u00b1 .096\u2022\n.119 \u00b1 .051\n.127 \u00b1 .061\n.191 \u00b1 .088\n.086 \u00b1 .067\u2022\n.114 \u00b1 .069\u2022\n.149 \u00b1 .056\u2022\n.184 \u00b1 .092\n.196 \u00b1 .087\n.633 \u00b1 .088\n.220 \u00b1 .053\u2022\n.556 \u00b1 .037\u2022\n.634 \u00b1 .059\n.186 \u00b1 .049\n.100 \u00b1 .043\u2022\n.146 \u00b1 .031\u2022\n.090 \u00b1 .045\u2022\n.129 \u00b1 .077\n.071 \u00b1 .053\u2022\n.069 \u00b1 .059\u2022\n.069 \u00b1 .076\u2022\n.097 \u00b1 .069\u2022\n.073 \u00b1 .058\u2022\n.132 \u00b1 .087\n.474 \u00b1 .046\n.362 \u00b1 .113\u2022\n.432 \u00b1 .069\u2022\n.226 \u00b1 .053\n.107 \u00b1 .093\u2022\n.118 \u00b1 .105\u2022\n.207 \u00b1 .046\n\nAP\n\n.872 \u00b1 .023\n.881 \u00b1 .022\n.758 \u00b1 .166\u2022\n.879 \u00b1 .022\n.879 \u00b1 .012\n.668 \u00b1 .123\u2022\n.877 \u00b1 .035\n.881 \u00b1 .035\n.843 \u00b1 .018\n.803 \u00b1 .020\u2022\n.766 \u00b1 .021\u2022\n.850 \u00b1 .016\n.832 \u00b1 .022\n.812 \u00b1 .019\u2022\n.978 \u00b1 .001\n.981 \u00b1 .002\n.980 \u00b1 .001\n.956 \u00b1 .002\u2022\n.986 \u00b1 .001\n.989 \u00b1 .001\n.979 \u00b1 .001\n.986 \u00b1 .000\n.922 \u00b1 .006\n.920 \u00b1 .010\n.907 \u00b1 .010\u2022\n.931 \u00b1 .010\n.935 \u00b1 .014\n.854 \u00b1 .024\u2022\n.920 \u00b1 .005\n.986 \u00b1 .001\n.993 \u00b1 .001\u25e6\n.991 \u00b1 .002\n.710 \u00b1 .019\n.450 \u00b1 .374\u2022\n.447 \u00b1 .372\u2022\n.673 \u00b1 .021\u2022\n\nNDCG\n\n.976 \u00b1 .005\n.973 \u00b1 .008\n.920 \u00b1 .078\u2022\n.975 \u00b1 .006\n.969 \u00b1 .009\n.884 \u00b1 .065\u2022\n.978 \u00b1 .007\n.979 \u00b1 .010\n.970 \u00b1 .005\n.962 \u00b1 .005\n.955 \u00b1 .006\u2022\n.972 \u00b1 .003\n.969 \u00b1 .007\n.963 \u00b1 .005\u2022\n.997 \u00b1 .001\n.998 \u00b1 .001\n.998 \u00b1 .001\n.996 \u00b1 .001\n.998 \u00b1 .001\n.999 \u00b1 .001\n.998 \u00b1 .001\n.999 \u00b1 .001\n.988 \u00b1 .001\n.987 \u00b1 .003\n.980 \u00b1 .004\u2022\n.990 \u00b1 .003\n.984 \u00b1 .005\n.975 \u00b1 .007\u2022\n.987 \u00b1 .002\n.999 \u00b1 .001\n.999 \u00b1 .001\n.998 \u00b1 .001\n.938 \u00b1 .005\n.775 \u00b1 .221\u2022\n.774 \u00b1 .220\u2022\n.929 \u00b1 .006\u2022\n\nPerformance Comparison In terms of evaluation metric Pos@Top, we \ufb01nd that TopPush yields\nsimilar performance as In\ufb01nitePush and AATP, and performs signi\ufb01cantly better than the other base-\nlines including LR and cs-SVM, SVMRank, SVMRank and SVMpAUC. This is consistent with the\ndesign of TopPush that aims to maximize the accuracy at the top of the ranked list. Since the loss\nfunction optimized by In\ufb01nitePush and AATP are similar as that for TopPush, it is not surprising\nthat they yield similar performance. The key advantage of using the proposed algorithm versus In-\n\ufb01nitePush and AATP is that it is computationally more ef\ufb01cient and scales well to large datasets.\nIn terms of AP and NDCG, we observe that TopPush yield similar, if not better, performance as\nthe state-of-the-art methods, such as SVMMAP and SVMpAUC, that are designed to optimize these\nmetrics. Overall, we conclude that the proposed algorithm is effective in optimizing the ranking\naccuracy for the top ranked instances.\nTraining Ef\ufb01ciency To evaluate the computational ef\ufb01ciency, we set the parameters of different\nalgorithms to be the values that are selected by cross-validation, and run these algorithms on full\ndatasets that include both training and testing sets. Table 2 summarizes the training time of different\nalgorithms. From the results, we can see that TopPush is faster than state-of-the-art ranking meth-\nods on most datasets. In fact, the training time of TopPush is similar to that of LR and cs-SVM\n\n7\n\n\fimplemented by LIBLINEAR. Since the time complexity of learning a binary classi\ufb01cation model\nis usually linear in the number of training instances, this result implicitly suggests a linear time\ncomplexity for the proposed algorithm.\nScalability We study how TopPush scales to different\nnumber of training examples by using the largest dataset\nurl. Figure 1 shows the log-log plot for the training time\nof TopPush vs. the size of training data, where different\nlines correspond to different values of \u03bb. For the purpose\nof comparison, we also include a black dash-dot line that\ntries to \ufb01t the training time by a linear function in the\nnumber of training instances (i.e., \u0398(m + n)). From the\nplot, we can see that for different regularization parame-\nter \u03bb, the training time of TopPush increases even slower\nthan the number of training data. This is consistent with\nour theoretical analysis given in Section 3.3.\n\nFigure 1: Training time of TopPush versus\ntraining data size for different values of \u03bb.\n\n5 Conclusion\n\nIn this paper, we focus on bipartite ranking algorithms that optimize accuracy at the top of the ranked\nlist. To this end, we consider to maximize the number of positive instances that are ranked above any\nnegative instances, and develop an ef\ufb01cient algorithm, named as TopPush to solve related optimiza-\ntion problem. Compared with existing work on this topic, the proposed TopPush algorithm scales\nlinearly in the number of training instances, which is in contrast to most existing algorithms for\nbipartite ranking whose time complexities dependents on the number of positive-negative instance\npairs. Moreover, our theoretical analysis clearly shows that it will lead to a ranking function that\nplaces many positive instances the top of the ranked list. Empirical studies verify the theoretical\nclaims: the TopPush algorithm is effective in maximizing the accuracy at the top and is signi\ufb01cantly\nmore ef\ufb01cient than the state-of-the-art algorithms for bipartite ranking. In the future, we plan to\ndevelop appropriate univariate loss, instead of pairwise ranking loss, for ef\ufb01cient bipartite ranking\nthat maximize accuracy at the top.\nAcknowledgement This research was supported by the 973 Program (2014CB340501), NSFC\n(61333014), NSF (IIS-1251031), and ONR Award (N000141210431).\n\nReferences\n[1] S. Agarwal. The in\ufb01nite push: A new support vector ranking algorithm that directly optimizes accuracy\n\nat the absolute top of the list. In SDM, pages 839\u2013850, 2011.\n\n[2] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under\n\nthe ROC curve. JMLR, 6:393\u2013425, 2005.\n\n[3] S. Boyd, C. Cortes, M. Mohri, and A. Radovanovic. Accuracy at the top. In NIPS, pages 962\u2013970. 2012.\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[5] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank\n\nusing gradient descent. In ICML, pages 89\u201396, 2005.\n\n[6] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-statistics. Annals of\n\nStatistics, 36(2):844\u2013874, 2008.\n\n[7] S. Cl\u00b4emenc\u00b8on and N. Vayatis. Ranking the best instances. JMLR, 8:2671\u20132699, 2007.\n[8] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to algorithms. MIT Press, 2001.\n[9] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. In NIPS, pages 313\u2013320. 2004.\n[10] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the (cid:96)1-ball for learning\n\nin high dimensions. In ICML, pages 272\u2013279, 2008.\n\n[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\n[12] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining preferences.\n\nJMLR, 4:933\u2013969, 2003.\n\n[13] W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In ICML, pages 906\u2013914, 2013.\n\n8\n\n102103104105101102data sizetrainign time (s)url \uf06c=100\uf06c=10\uf06c=1\uf06c=0.1\uf06c=0.01\uf051(x)\f[14] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http:\n\n//cvxr.com/cvx, March 2014.\n\n[15] J. Hanley and B. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC)\n\ncurve. Radiology, 143:29\u201336, 1982.\n\n[16] R. Herbrich, T. Graepel, and K. Obermayer. Large Margin Rank Boundaries for Ordinal Regression,\n\nchapter Advances in Large Margin Classi\ufb01ers, pages 115\u2013132. MIT Press, Cambridge, MA, 2000.\n\n[17] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377\u2013384,\n\nBonn, Germany, 2005.\n\n[18] T. Joachims. Training linear SVMs in linear time. In KDD, pages 217\u2013226, 2006.\n[19] T. Kanamori, A. Takeda, and T. Suzuki. Conjugate relation between loss functions and uncertainty sets in\n\nclassi\ufb01cation problems. JMLR, 14:1461\u20131504, 2013.\n\n[20] W. Kotlowski, K. Dembczynski, and E. H\u00a8ullermeier. Bipartite ranking through minimization of univariate\n\nloss. In ICML, pages 1113\u20131120, 2011.\n\n[21] Q.V. Le and A. Smola. Direct optimization of ranking measures. CoRR, abs/0704.3359, 2007.\n[22] N. Li, R. Jin, and Z.-H. Zhou. Top rank optimization in linear time. CoRR, abs/1410.1462, 2014.\n[23] N. Li, I. W. Tsang, and Z.-H. Zhou. Ef\ufb01cient optimization of performance measures by classi\ufb01er adapta-\n\ntion. IEEE-PAMI, 35(6):1370\u20131382, 2013.\n\n[24] J. Liu and J. Ye. Ef\ufb01cient Euclidean projections in linear time. In ICML, pages 657\u2013664, 2009.\n[25] T.-Y. Liu. Learning to Rank for Information Retrieval. Springer, 2011.\n[26] H. Narasimhan and S. Agarwal. On the relationship between binary classi\ufb01cation, bipartite ranking, and\n\nbinary class probability estimation. In NIPS, pages 2913\u20132921. 2013.\n\n[27] H. Narasimhan and S. Agarwal. A structural SVM based approach for optimizing partial AUC. In ICML,\n\npages 516\u2013524, 2013.\n\n[28] H. Narasimhan and S. Agarwal. SVMtight\n\npAUC: A new support vector method for optimizing partial AUC\n\nbased on a tight convex upper bound. In KDD, pages 167\u2013175, 2013.\n\n[29] A. Nemirovski. Ef\ufb01cient methods in convex programming. Lecture Notes, 1994.\n[30] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2003.\n[31] A. Rakotomamonjy. Sparse support vector in\ufb01nite push. In ICML, 2012.\n[32] S. Rendle, L. Balby Marinho, A. Nanopoulos, and L. Schmidt-Thieme. Learning optimal ranking with\n\ntensor factorization for tag recommendation. In KDD, pages 727\u2013736, 2009.\n\n[33] C. Rudin and R. Schapire. Margin-based ranking and an equivalence between adaboost and rankboost.\n\nJMLR, 10:2193\u20132232, 2009.\n\n[34] S. Shalev-Shwartz and Y. Singer. Ef\ufb01cient learning of label ranking by soft projections onto polyhedra.\n\nJMLR, 7:1567\u20131599, 2006.\n\n[35] S. Sun and J. Shawe-Taylor. Sparse semi-supervised learning using conjugate functions. JMLR, 11:2423\u2013\n\n2455, 2010.\n\n[36] A. Tewari and P. Bartlett. On the consistency of multiclass classi\ufb01cation methods. JMLR, 8:1007\u20131025,\n\n2007.\n\n[37] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. JMLR, 6:1453\u20131484, 2005.\n\n[38] N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classi\ufb01cation. In ICML,\n\npages 1057\u20131064, Montreal, Canada, 2009.\n\n[39] H. Valizadegan, R. Jin, R. Zhang, and J. Mao. Learning to rank by optimizing NDCG measure. In NIPS,\n\npages 1883\u20131891. 2009.\n\n[40] M. Xu, Y.-F. Li, and Z.-H. Zhou. Multi-label learning with PRO loss. In AAAI, pages 998\u20131004, 2013.\n[41] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystr\u00a8om method vs random Fourier features: A\n\ntheoretical and empirical comparison. In NIPS, pages 485\u2013493. MIT Press, 2012.\n\n[42] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average preci-\n\nsion. In SIGIR, pages 271\u2013278, 2007.\n\n[43] P. Zhao, S.C.H. Hoi, R. Jin, and T. Yang. Online AUC maximization. In ICML, pages 233\u2013240, Bellevue,\n\nWA, 2011.\n\n9\n\n\f", "award": [], "sourceid": 811, "authors": [{"given_name": "Nan", "family_name": "Li", "institution": "Nanjing University, China"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Michigan State University (MSU)"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}