{"title": "Stochastic Online AUC Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 459, "abstract": "Area under ROC (AUC) is a metric which is widely used for measuring the classification performance for imbalanced data. It is of theoretical and practical interest to develop online learning algorithms that maximizes AUC for large-scale data. A specific challenge in developing online AUC maximization algorithm is that the learning objective function is usually defined over a pair of training examples of opposite classes, and existing methods achieves on-line processing with higher space and time complexity. In this work, we propose a new stochastic online algorithm for AUC maximization. In particular, we show that AUC optimization can  be equivalently formulated as a convex-concave saddle point problem. From this saddle representation, a stochastic online algorithm (SOLAM) is proposed which has time and space complexity of one datum. We establish theoretical convergence of SOLAM with high probability and demonstrate its effectiveness and efficiency on standard benchmark datasets.", "full_text": "Stochastic Online AUC Maximization\n\nYiming Ying\u2020, Longyin Wen\u2021, Siwei Lyu\u2021\n\u2020Department of Mathematics and Statistics\nSUNY at Albany, Albany, NY, 12222, USA\n\n\u2021Department of Computer Science\n\nSUNY at Albany, Albany, NY, 12222, USA\n\nAbstract\n\nArea under ROC (AUC) is a metric which is widely used for measuring the\nclassi\ufb01cation performance for imbalanced data. It is of theoretical and practical\ninterest to develop online learning algorithms that maximizes AUC for large-scale\ndata. A speci\ufb01c challenge in developing online AUC maximization algorithm is that\nthe learning objective function is usually de\ufb01ned over a pair of training examples\nof opposite classes, and existing methods achieves on-line processing with higher\nspace and time complexity. In this work, we propose a new stochastic online\nalgorithm for AUC maximization. In particular, we show that AUC optimization\ncan be equivalently formulated as a convex-concave saddle point problem. From\nthis saddle representation, a stochastic online algorithm (SOLAM) is proposed\nwhich has time and space complexity of one datum. We establish theoretical\nconvergence of SOLAM with high probability and demonstrate its effectiveness\non standard benchmark datasets.\n\n1\n\nIntroduction\n\nArea Under the ROC Curve (AUC) [8] is a widely used metric for measuring classi\ufb01cation perfor-\nmance. Unlike misclassi\ufb01cation error that re\ufb02ects a classi\ufb01er\u2019s ability to classify a single randomly\nchosen example, AUC concerns the overall performance of a functional family of classi\ufb01ers and\nquanti\ufb01es their ability of correctly ranking any positive instance with regards to a randomly chosen\nnegative instance. Most algorithms optimizing AUC for classi\ufb01cation [5, 9, 12, 17] are for batch\nlearning, where we assume all training data are available.\nOn the other hand, online learning algorithms [1, 2, 3, 16, 19, 22], have been proven to be very\nef\ufb01cient to deal with large-scale datasets. However, most studies of online learning focus on the\nmisclassi\ufb01cation error or its surrogate loss, in which the objective function depends on a sum of\nlosses over individual examples. It is thus desirable to develop online learning algorithms to optimize\nthe AUC metric. The main challenge for an online AUC algorithm is that the objective function of\nAUC maximization depends on a sum of pairwise losses between instances from different classes\nwhich is quadratic in the number of training examples. As such, directly deploying the existing online\nalgorithms will require to store all training data received, making it not feasible for large-scale data\nanalysis.\nSeveral recent works [6, 11, 18, 20, 21] have studied a type of online AUC maximization method that\nupdates the classi\ufb01er upon the arrival of each new training example. However, this type of algorithms\nneed to access all previous examples at iteration t, and has O(td) space and per-iteration complexity\nwhere d is the dimension of the data. The scaling of per-iteration space and time complexity is an\nundesirable property for online applications that have to use \ufb01xed resources. This problem is partially\nalleviated by the use of buffers of a \ufb01xed size s in [11, 21], which reduces the per-iteration space and\ntime complexity to O(sd). Although this change makes the per-iteration space and time complexity\nindependent of the number of iterations, in practice, to reduce variance in learning performance, the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsize of the buffer needs to be set suf\ufb01ciently large. The work of [6] proposes an alternative method\nthat requires to update and store the \ufb01rst-order (mean) and second-order (covariance) statistics of the\ntraining data, and the space and per-iteration complexity becomes O(d2). Although this eliminates\nthe needs to access all previous training examples, the per-iteration is now quadratic in data dimension,\nwhich makes this method inef\ufb01cient for high-dimensional data. To this end, the authors of [6] further\nproposed to approximate the covariance matrices with low-rank random Gaussian matrices. However,\nthe approximation method is not a general solution to the original problem and its convergence was\nonly established under the assumption that the effective numerical rank for the set of covariance\nmatrices is small (i.e., they can be well approximated by low-rank matrices).\nIn this work, we present a new stochastic online AUC maximization (SOLAM) method associated\nfor the (cid:96)2 loss function. In contrast to existing online AUC maximization methods, e.g. [6, 21],\nSOLAM does not need to store previously received training examples or the covariance matrices,\nwhile, at the same time, enjoys a comparable convergence rate, up to a logarithmic term, as in\n[6, 21]. To our best knowledge, this is the \ufb01rst online learning algorithm for AUC optimization with\nlinear space and per-iteration time complexities of O(d), which are the same as the online gradient\ndescent algorithm [1, 2, 16, 22] for classi\ufb01cation. The key step of SOLAM is to reformulate the\noriginal problem as a stochastic saddle point problem [14]. This connection is the foundation of the\nSOLAM algorithm and its convergence analysis. When evaluating on several standard benchmark\ndatasets, SOLAM achieves performances that are on par with state-of-the-art online AUC optimization\nmethods with signi\ufb01cant improvement in running time.\nThe main contribution of our work can be summarized as follows:\n\n\u2022 We provide a new formulation of the AUC optimization problem as stochastic Saddle Point Prob-\nlem (SPP). This formulation facilitates the development of online algorithms for AUC optimization.\n\u2022 Our algorithm SOLAM achieves a per-iteration space and time complexity that is linear in data\n\n\u2022 Our theoretical analysis provides guarantee of convergence, with high probability, of the proposed\n\ndimensionality.\n\nalgorithm.\n\n2 Method\nLet the input space X \u2286 Rd and the output space Y = {\u22121, +1}. We assume the training data,\nz = {(xi, yi), i = 1, . . . , n} as i.i.d. sample drawn from an unknown distribution \u03c1 on Z = X \u00d7 Y.\nThe ROC curve is the plot of the true positive rate versus the false positive rate. The area under the\nROC curve (AUC) for any scoring function f : X \u2192 R is equivalent to the probability of a positive\nsample ranks higher than a negative sample (e.g. [4, 8]). It is de\ufb01ned as\n\nAUC(f ) = Pr(f (x) \u2265 f (x(cid:48))|y = +1, y(cid:48) = \u22121),\n\n(1)\nwhere (x, y) and (x(cid:48), y(cid:48)) are independent drawn from \u03c1. The target of AUC maximization is to \ufb01nd\nthe optimal decision function f:\n\narg max\n\nf\n\nAUC(f ) = arg min\n\nf\n\nPr(f (x) < f (x(cid:48))|y = 1, y(cid:48) = \u22121)\n\nE(cid:104)I[f (x(cid:48))\u2212f (x)>0]\n\n(cid:105)\n(cid:12)(cid:12)y = 1, y(cid:48) = \u22121\n\nf\n\n= arg min\n\n(2)\nwhere I(\u00b7) is the indicator function that takes value 1 if the argument is true and 0 otherwise. Let\np = Pr(y = 1). For any random variable \u03be(z), recall that its conditional expectation is de\ufb01ned\nby E[\u03be(z)|y = 1] = 1\nconvex surrogates. Two common choices are the (cid:96)2 loss (1 \u2212 (f (x) \u2212 f (x(cid:48))))2 or the hinge loss\n+. In this work, we use the (cid:96)2, as it has been shown to be statistically consistent\nwith AUC while the hinge loss is not [6, 7]. We also restrict our interests to the family of linear\nfunctions, i.e., f (x) = w(cid:62)x. In summary, the AUC maximization can be formulated by\n\n(cid:82)(cid:82) \u03be(z)Iy=1d\u03c1(z). Since I(\u00b7) is not continuous, it is often replaced by its\n\n(cid:0)1\u2212 (f (x)\u2212 f (x(cid:48)))(cid:1)\n\np\n\n,\n\nE(cid:104)\n\nargmin(cid:107)w(cid:107)\u2264R\n= argmin(cid:107)w(cid:107)\u2264R\n\n(1 \u2212 w(cid:62)(x \u2212 x(cid:48)))2|y = 1, y(cid:48) = \u22121\n\n(cid:82)(cid:82)\nZ\u00d7Z (1 \u2212 w(cid:62)(x \u2212 x(cid:48)))2I[y=1]I[y(cid:48)=\u22121]d\u03c1(z)d\u03c1(z(cid:48)).\n\n1\n\np(1\u2212p)\n\n(3)\n\n(cid:105)\n\n2\n\n\fWhen \u03c1 is a uniform distribution over training data z, we obtain the empirical minimization (ERM)\nproblem for AUC optimization studied in [6, 21]1\n\nn(cid:88)\n\nn(cid:88)\n(1 \u2212 w(cid:62)(xi \u2212 xj))2I[yi=1\u2227yj =\u22121],\n\nargmin\n(cid:107)w(cid:107)\u2264R\n\n1\n\nn+n\u2212\n\ni=1\n\nj=1\n\nwhere n+ and n\u2212 denote the numbers of instances in the positive and negative classes, respectively.\n\n2.1 Equivalent Representation as a (Stochastic) Saddle Point Problem (SPP)\n\n(cid:8)f (u, \u03b1) := E[F (u, \u03b1, \u03be)](cid:9),\n\nThe main result of this work is the equivalence of problem (3) to a stochastic Saddle Point Problem\n(SPP) (e.g., [14]). A stochastic SPP is generally in the form of\n\nmin\nu\u2208\u21261\n\nmax\n\u03b1\u2208\u21262\n\nmeasurable set \u039e \u2286 Rp, and F : \u21261 \u00d7 \u21262 \u00d7 \u039e \u2192 R. Here E[F (u, \u03b1, \u03be)] =(cid:82)\n\n(5)\nwhere \u21261 \u2286 Rd and \u21262 \u2286 Rm are nonempty closed convex sets, \u03be is a random vector with non-empty\n\u039e F (u, \u03b1, \u03be)d Pr(\u03be),\nand function f (u, \u03b1) is convex in u \u2208 \u21261 and concave in \u03b1 \u2208 \u21262. In general, u and \u03b1 are referred to\nas the primal variable and the dual variable, respectively.\nThe following theorem shows that (3) is equivalent to a stochastic SPP (5). First, de\ufb01ne F :\nRd \u00d7 R3 \u00d7 Z \u2192 R, for any w \u2208 Rd, a, b, \u03b1 \u2208 R and z = (x, y) \u2208 Z, by\n\nF (w, a, b, \u03b1; z) = (1 \u2212 p)(w(cid:62)x \u2212 a)2I[y=1] + p(w(cid:62)x \u2212 b)2I[y=\u22121]\n\n+ 2(1 + \u03b1)(pw(cid:62)xI[y=\u22121] \u2212 (1 \u2212 p)w(cid:62)xI[y=1]) \u2212 p(1 \u2212 p)\u03b12.\n\nTheorem 1. The AUC optimization (3) is equivalent to\n\n(4)\n\n(6)\n\n(7)\n\n(cid:110)\n\nmin\n(cid:107)w(cid:107)\u2264R\n(a,b)\u2208R2\n\nmax\n\u03b1\u2208R\n\n(cid:90)\n\nZ\n\n(cid:111)\n\nf (w, a, b, \u03b1) :=\n\nF (w, a, b, \u03b1; z)d\u03c1(z)\n\n.\n\n(3) equals to 1 +\n\nthe objective function of\n\nProof. It suf\ufb01ces to prove the claim that\nZ F (w, a, b, \u03b1; z)d\u03c1(z).\n\nTo this end, note that z = (x, y) and z = (x(cid:48), y(cid:48)) are samples independently drawn from \u03c1. Therefore,\nthe objective function of (3) can be rewritten as\n\nmin(a,b)\u2208R2 max\u03b1\u2208R(cid:82)\nE(cid:2)(1 \u2212 w(cid:62)(x \u2212 x(cid:48)))2|y = 1, y(cid:48) = \u22121(cid:3) = 1 + E[(w(cid:62)x)2|y = 1] + E[(w(cid:62)x(cid:48))2|y(cid:48) = \u22121]\n\u2212 2E[w(cid:62)x|y = 1] + 2E[w(cid:62)x(cid:48)|y(cid:48) = \u22121] \u2212 2(cid:0)E[w(cid:62)x|y = 1](cid:1)(cid:0)E[w(cid:62)x(cid:48)|y(cid:48) = \u22121](cid:1)\n= 1 +(cid:8)E[(w(cid:62)x)2|y = 1] \u2212(cid:0)E[w(cid:62)x|y = 1](cid:1)2(cid:9) +(cid:8)E[(w(cid:62)x(cid:48))2|y(cid:48) = \u22121] \u2212(cid:0)E[w(cid:62)x(cid:48)|y(cid:48) = \u22121](cid:1)2(cid:9)\n\u2212 2E[w(cid:62)x|y = 1] + 2E[w(cid:62)x(cid:48)|y(cid:48) = \u22121] +(cid:0)E[w(cid:62)x|y = 1] \u2212 E[w(cid:62)x(cid:48)|y(cid:48) = \u22121](cid:1)2\nNote that E[(w(cid:62)x)2|y = 1] \u2212 (cid:0)E[w(cid:62)x|y = 1](cid:1)2\n(cid:0) 1\nZ w(cid:62)xI[y=1]d\u03c1(z)(cid:1)2\n(cid:82)\n\n(cid:82)\n(8)\n(cid:82)\nZ (w(cid:62)x)2I[y=1]d\u03c1(z) \u2212\nZ (w(cid:62)x\u2212a)2I[y=1]d\u03c1(z) = mina\u2208R E[(w(cid:62)x\u2212a)2|y = 1],\na = E[w(cid:62)x|y = 1].\n\nLikewise, min\nminimization is obtained by letting\n\nE[(w(cid:62)x(cid:48) \u2212 b)2|y(cid:48) = \u22121] = E[(w(cid:62)x(cid:48))2|y(cid:48) = \u22121] \u2212(cid:0)E[w(cid:62)x(cid:48)|y(cid:48) = \u22121](cid:1)2 where the\n(cid:8)2\u03b1(E[w(cid:62)x(cid:48)|y(cid:48) = \u22121] \u2212\n\nMoreover, observe that(cid:0)E[w(cid:62)x|y = 1] \u2212 E[w(cid:62)x(cid:48)|y(cid:48) = \u22121](cid:1)2\nE[w(cid:62)x|y = 1]) \u2212 \u03b12(cid:9), where the maximization is achieved with\n\n= mina\u2208R 1\np\nwhere the minimization is achieved by\n\nb = E[w(cid:62)x(cid:48)|y(cid:48) = \u22121].\n\n= max\u03b1\n\n= 1\np\n\n(10)\n\n(9)\n\np\n\n.\n\nb\n\n1The work [6, 21] studied the regularized ERM problem, i.e. minw\u2208Rd\n\n\u03b1 = E[w(cid:62)x(cid:48)|y(cid:48) = \u22121] \u2212 E[w(cid:62)x|y = 1].\n\n(cid:80)n\n(11)\nj=1(1 \u2212 w(cid:62)(xi \u2212\n2 (cid:107)w(cid:107)2, which is equivalent to (3) with \u2126 being a bounded ball in Rd.\n\nn+n\u2212(cid:80)n\n\ni=1\n\n1\n\nxj))2I[yi=1]I[yj =\u22121] + \u03bb\n\n3\n\n\fPutting all these equalities into (8) implies that\n\n(1 \u2212 w(cid:62)(x \u2212 x(cid:48)))2|y = 1, y(cid:48) = \u22121\n\nE(cid:104)\n\n(cid:105)\n\n(cid:82)\n\nZ F (w, a, b; z)d\u03c1(z)\n\np(1 \u2212 p)\n\n.\n\n= 1 + min\n\n(a,b)\u2208R2\n\nmax\n\u03b1\u2208R\n\nThis proves the claim and hence the theorem.\n\nIn addition, we can prove the following result.\nProposition 1. Function f (w, a, b, \u03b1) is convex in (w, a, b) \u2208 Rd+2 and concave in \u03b1 \u2208 R.\n\nThe proof of this proposition can be found in the Supplementary Materials.\n\n2.2 Stochastic Online Algorithm for AUC Maximization\n\nThe optimal solution to an SPP problem is called a saddle point. Stochastic \ufb01rst-order methods are\nwidely used to get such an optimal saddle point. The main idea of such algorithms (e.g. [13, 14] is\nto use an unbiased stochastic estimator of the true gradient to perform, at each iteration, gradient\ndescent in the primal variable and gradient ascent in the dual variable.\nUsing the stochastic SPP formulation (7) for AUC optimization, we can develop stochastic on-\nline learning algorithms which only need to pass the data once. For notational simplicity, let\nvector v = (w(cid:62), a, b)(cid:62) \u2208 Rd+2, and for any w \u2208 Rd, a, b, \u03b1 \u2208 R and z = (x, y) \u2208 Z,\nwe denote f (w, a, b, \u03b1) as f (v, \u03b1), and F (w, a, b, \u03b1, z) as F (v, \u03b1, z). The gradient of the ob-\njective function in the stochastic SPP problem (7) is given by a (d + 3)-dimensional column vector\ng(v, \u03b1) = (\u2202vf (v, \u03b1),\u2212\u2202\u03b1f (v, \u03b1)) and its unbiased stochastic estimator is given, for any z \u2208 Z,\nby G(v, \u03b1, z) = (\u2202uF (v, \u03b1, z),\u2212\u2202\u03b1F (v, \u03b1, z)). One could directly deploy the stochastic \ufb01rst-order\nmethod in [14] to the stochastic SPP formulation (7) for AUC optimization. However, from the\nde\ufb01nition of F in (6), this would require the knowledge of the unknown probability p = Pr(y = 1) a\npriori. To overcome this problem, for any v(cid:62) = (w(cid:62), a, b) \u2208 Rd+2, \u03b1 \u2208 R and z \u2208 Z, let\n\n\u02c6Ft(v, \u03b1, z) = (1 \u2212 \u02c6pt)(w(cid:62)x \u2212 a)2I[y=1] + \u02c6pt(w(cid:62)x \u2212 b)2I[y=\u22121]\n(cid:80)t\n\n+ 2(1 + \u03b1)(\u02c6ptw(cid:62)xI[y=\u22121] \u2212 (1 \u2212 \u02c6pt)w(cid:62)xI[y=1]) \u2212 \u02c6pt(1 \u2212 \u02c6pt)\u03b12.\n\nat iteration t. We propose, at iteration t, to use the stochastic estimator\n\u02c6Gt(v, \u03b1, z) = (\u2202v \u02c6Ft(v, \u03b1, z),\u2212\u2202\u03b1 \u02c6Ft(v, \u03b1, z))\n\nwhere \u02c6pt =\n\nI[yi=1]\n\ni=1\n\nt\n\n(12)\n\n(13)\n\n1\u2212p|(cid:82)\n\nZ(cid:104)w\u2217, x(cid:48)(cid:105)I[y(cid:48)=\u22121]d\u03c1(z(cid:48))| \u2264 R\u03ba, and |\u03b1\u2217| = (cid:12)(cid:12) 1\n\np|(cid:82)\nto replace the unbiased, but practically inaccessible, stochastic estimator G(v, \u03b1, z). Assume \u03ba =\nsupx\u2208X (cid:107)x(cid:107) < \u221e, and recall that (cid:107)w(cid:107) \u2264 R. For any optimal solution (w\u2217, a\u2217, b\u2217) of the stochastic\n(cid:82)\nSPP (7) for AUC optimization, by (9), (10) and (11) we know that |a\u2217| = 1\nZ(cid:104)w\u2217, x(cid:105)I[y=1]d\u03c1(z)| \u2264\nZ(cid:104)w\u2217, x(cid:105)I[y=1]d\u03c1(z)(cid:12)(cid:12) \u2264 2R\u03ba. Therefore, we can restrict (w, a, b) and \u03b1 to the following bounded\n(cid:82)\nZ(cid:104)w\u2217, x(cid:48)(cid:105)I[y(cid:48)=\u22121]d\u03c1(z(cid:48)) \u2212\nR\u03ba, |b\u2217| = 1\n\u21261 =(cid:8)(w, a, b) \u2208 Rd+2 : (cid:107)w(cid:107) \u2264 R,|a| \u2264 R\u03ba,|b| \u2264 R\u03ba(cid:9), \u21262 =(cid:8)\u03b1 \u2208 R : |\u03b1| \u2264 2R\u03ba(cid:9). (14)\n\n1\np\ndomains:\n\n1\u2212p\n\nIn this case, the projection steps (e.g. steps 4 and 5) in Table 1 can be easily computed. The pseudo-\ncode of the online AUC optimization algorithm is described in Table 1, to which we refer as SOLAM.\n\n3 Analysis\n\nWe now present the convergence results of the proposed algorithm for AUC optimization. Let\nu = (v, \u03b1) = (w, a, b, \u03b1). The quality of an approximation solution (\u00afvt, \u00af\u03b1t) to the SPP problem (5)\nat iteration t is measured by the duality gap:\n\n\u03b5f (\u00afvt, \u00af\u03b1t) = max\n\u03b1\u2208\u21262\n\nf (\u00afvt, \u03b1) \u2212 min\nv\u2208\u21261\n\nf (v, \u00af\u03b1t).\n\n(15)\n\n4\n\n\fStochastic Online AUC Maximization (SOLAM)\n1. Choose step sizes {\u03b3t > 0 : t \u2208 N}\n2. Initialize t = 1, v1 \u2208 \u21261, \u03b11 \u2208 \u21262 and let \u02c6p0 = 0, \u00afv0 = 0, \u00af\u03b10 = 0 and \u00af\u03b30 = 0.\n3. Receive a sample zt = (xt, yt) and compute \u02c6pt = (t\u22121) \u02c6pt\u22121+I[yt=1]\n4. Update vt+1 = P\u21261 (vt \u2212 \u03b3t\u2202v \u02c6Ft(vt, \u03b1t, zt))\n5. Update \u03b1t+1 = P\u21262 (\u03b1t + \u03b3t\u2202\u03b1 \u02c6Ft(vt, \u03b1t, zt))\n6. Update \u00af\u03b3t = \u00af\u03b3t\u22121 + \u03b3t\n7. Update \u00afvt = 1\n8. Set t \u2190 t + 1\n\u00af\u03b3t\n\n(\u00af\u03b3t\u22121\u00afvt\u22121 + \u03b3tvt), and \u00af\u03b1t = 1\n\u00af\u03b3t\n\n(\u00af\u03b3t\u22121 \u00af\u03b1t\u22121 + \u03b3t\u03b1t)\n\nt\n\nTable 1: Pseudo code of the proposed algorithm. In steps 4 and 5, P\u21261(\u00b7) and P\u21262(\u00b7) denote the\nprojection to the convex sets \u21261 and \u21262, respectively.\n\nTheorem 2. Assume that samples {(x1, y1), (x2, y2), . . . , (xT , yT )} are i.i.d. drawn from a distri-\nbution \u03c1 over X \u00d7 Y, let \u21261 and \u21262 be given by (14) and the step sizes given by {\u03b3t > 0 : t \u2208 N}.\nFor sequence {(\u00afvt, \u00af\u03b1t) : t \u2208 [1, T ]} generated by SOLAM (Table (1)), and any 0 < \u03b4 < 1, with\nprobability 1 \u2212 \u03b4, the following holds\n\n(cid:114)\n\nln\n\n4T\n\u03b4\n\n(cid:0) T(cid:88)\n\n(cid:1)\u22121(cid:104)\n\n\u03b3j\n\n1 +\n\nT(cid:88)\n\nj +(cid:0) T(cid:88)\n\n\u03b32\n\n(cid:1) 1\n\n\u03b32\nj\n\n2 +\n\nT(cid:88)\n\n(cid:105)\n\n,\n\n\u03b3j\u221a\nj\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\n\u03b5f (\u00afvT , \u00af\u03b1T ) \u2264 C\u03ba max(R2, 1)\n\nwhere C\u03ba is an absolute constant independent of R and T (see its explicit expression in the proof).\nDenote f\u2217 as the optimum of (7) which, by Theorem 1, is identical to the optimal value of AUC\noptimization (3). From Theorem 2, the following convergence rate is straightforward.\n\nCorollary 1. Under the same assumptions as in Theorem 2, and(cid:8)\u03b3j = \u03b6j\u2212 1\n\u03b6 > 0, with probability 1 \u2212 \u03b4, it holds |f (\u00afvT , \u00af\u03b1T ) \u2212 f\u2217| \u2264 \u03b5f (\u00afuT ) = O(cid:16) ln T\n\n2 : j \u2208 N(cid:9) with constant\nln(cid:0) 4T\n(cid:114)\n\n(cid:17)\n\n(cid:1)\n\n\u221a\n\n.\n\n\u03b4\n\nT\n\nWhile the above convergence rate is obtained by choosing decaying step sizes, one can establish a\nsimilar result when a constant step size is appropriately chosen.\nThe proof of Theorem 2 requires several lemmas. The \ufb01rst is a standard result from convex online\nlearning [16, 22]. We include its proof in the Supplementary Materials for completeness.\nLemma 1. For any T \u2208 N, let {\u03bej : j \u2208 [1, T ]} be a sequence of vectors in Rm, and \u02dcu1 \u2208 \u2126 where\n\u2126 is a convex set. For any t \u2208 [1, T ] de\ufb01ne \u02dcut+1 = P\u2126(\u02dcut \u2212 \u03bet). Then, for any u \u2208 \u2126, there holds\n\n(cid:80)T\nt=1(\u02dcut \u2212 u)(cid:62)\u03bet \u2264 (cid:107)\u02dcu1\u2212u(cid:107)2\n\n2\n\n(cid:80)T\nt=1 (cid:107)\u03bet(cid:107)2.\n\n+ 1\n2\n\nThe second lemma is the Pinelis-Bernstein inequality for martingale difference sequence in a Hilbert\nspace, which is from [15, Theorem 3.4]\nLemma 2. Let {Sk : k \u2208 N} be a martingale difference sequence in a Hilbert space. Suppose that\nT . Then, for any 0 < \u03b4 < 1, there\n\nalmost surely (cid:107)Sk(cid:107) \u2264 B and(cid:80)T\n\nE[(cid:107)Sk(cid:107)2|S1, . . . , Sk\u22121] \u2264 \u03c32\n\nk=1\n\nholds, with probability at least 1 \u2212 \u03b4, sup1\u2264j\u2264T\n\nk=1 Sk\n\n3 + \u03c3T\n\n(cid:13)(cid:13)(cid:13)(cid:80)j\n\n(cid:13)(cid:13)(cid:13) \u2264 2(cid:0) B\n\n(cid:1) log 2\n\n\u03b4 .\n\nThe third lemma indicates that the approximate stochastic estimator \u02c6Gj(u, z) de\ufb01ned by (13), is not\nfar away from the unbiased one G(u, z). Its proof is given in the Supplementary materials.\nLemma 3. Let \u21261 and \u21262 be given by (14) and denote by \u2126 = \u21261 \u00d7 \u21262. For any t \u2208 N, with\nprobability 1 \u2212 \u03b4, there holds\n2 .\nProof of Theorem 2. By the convexity of f (\u00b7, \u03b1) and concavity of of f (v,\u00b7), for any u = (v, \u03b1) \u2208\n\u21261 \u00d7 \u21262, we get f (vt, \u03b1) \u2212 f (v, \u03b1t) = (f (vt, \u03b1t) \u2212 f (v, \u03b1t)) + (f (vt, \u03b1) \u2212 f (vt, \u03b1t)) \u2264 (vt \u2212\nv)(cid:62)\u2202vf (vt, \u03b1t) \u2212 (\u03b1t \u2212 \u03b1)\u2202\u03b1f (vt, \u03b1t) = (ut \u2212 u)(cid:62)g(ut). Hence, there holds\n\n(cid:107) \u02c6Gt(u, z) \u2212 G(u, z)(cid:107) \u2264 2\u03ba(4\u03baR + 11R + 1)(cid:0)ln (\n\nu\u2208\u2126,z\u2208Z\n\nsup\n\n2\n\u03b4\n\nf (\u00afvT , \u03b1) \u2212 min\nv\u2208\u21261\n\nmax\n\u03b1\u2208\u21262\n\nf (v, \u00af\u03b1T ) \u2264 (\n\n\u03b3t)\u22121\n\n\u03b3tf (vt, \u03b1) \u2212 min\nv\u2208\u21261\n\n\u03b3tf (v, \u03b1t)\n\n)/t(cid:1) 1\n(cid:33)\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n(cid:32)\n\nmax\n\u03b1\u2208\u21262\n\n5\n\n\fT(cid:88)\n\n\u2264 (\n\nT(cid:88)\n(cid:80)T\n\nt=1\n\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\n(21)\n\n\u03b32\nt\n\n(16)\nRecall that \u2126 = \u21261 \u00d7 \u21262. The steps 4 and 5 in Algorithm SOLAM can be rewritten as ut+1 =\n(vt+1, \u03b1t+1) = P\u2126(ut \u2212 \u03b3t \u02c6Gt(ut, zt)). By applying Lemma 1 with \u03bet = \u03b3t \u02c6Gt(ut, zt), we have, for\nt (cid:107) \u02c6Gt(ut, zt)(cid:107)2, which yields\n\nany u \u2208 \u2126, that(cid:80)T\n\nt=1 \u03b32\n\nt=1\n\n+ 1\n2\n\n\u03b3t)\u22121 max\nu\u2208\u21261\u00d7\u21262\n\n\u03b3t(ut \u2212 u)(cid:62)g(ut)\n\nt=1 \u03b3t(ut \u2212 u)(cid:62) \u02c6Gt(ut, zt) \u2264 (cid:107)u1\u2212u(cid:107)2\nT(cid:88)\n\n(cid:107)u1 \u2212 u(cid:107)2\n\n2\n\nsup\nu\u2208\u2126\n\n\u03b3t(ut \u2212 u)(cid:62)g(ut) \u2264 sup\nu\u2208\u2126\n\n2\n\n+\n\n1\n2\n\nt (cid:107) \u02c6Gt(ut, zt)(cid:107)2\n\u03b32\nT(cid:88)\n\n(cid:107)u1 \u2212 u(cid:107)2\n\nt=1\n\n\u03b3t(ut \u2212 u)(cid:62)(g(ut) \u2212 \u02c6Gt(ut, zt)) \u2264 sup\nu\u2208\u2126\n\n2\n\n+\n\n1\n2\n\nt=1\n\nt (cid:107) \u02c6Gt(ut, zt)(cid:107)2\n\u03b32\n\n\u03b3t(ut \u2212 u)(cid:62)(g(ut) \u2212 G(ut, zt)) + sup\nu\u2208\u2126\n\n\u03b3t(ut \u2212 u)(cid:62)(G(ut, zt) \u2212 \u02c6Gt(ut, zt))\n\n(17)\n\nT(cid:88)\n\nt=1\n\nthat\n\nt=1\n\nT(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\n+ sup\nu\u2208\u2126\n\n+ sup\nu\u2208\u2126\n\nNow we estimate the terms on the right hand side of (17) as follows.\nFor the \ufb01rst term, we have\n\n1\n2\n\n(cid:107)u1 \u2212 u(cid:107)2 \u2264 2\n\n(cid:107)u(cid:107)2 \u2264 2R2(1 + 6\u03ba2).\n\n((cid:107)v(cid:107)2 + |\u03b1|2) \u2264 2 sup\nu\u2208\u2126\n\nsup\n\nsup\nu\u2208\u2126\n\nv\u2208\u21261,\u03b1\u2208\u21262\n\n(wt, at, bt, \u03b1t) \u2208 \u2126 =(cid:8)(w, a, b, \u03b1) : (cid:107)w(cid:107) \u2264 R,|a| \u2264 \u03baR,|b| \u2264 \u03baR,|\u03b1| \u2264 2\u03baR(cid:9). Combining this\n\n(18)\nFor the second term on the right hand side of (17), observe that supx\u2208X (cid:107)x(cid:107) \u2264 \u03ba and ut =\nwith the de\ufb01nition of \u02c6Gt(ut, zt) given by (13), one can easily get (cid:107) \u02c6Gt(ut, zt)(cid:107) \u2264 (cid:107)\u2202w \u02c6Ft(ut, zt)(cid:107) +\n|\u2202a \u02c6Ft(ut, zt)| + |\u2202b \u02c6Ft(ut, zt)| + |\u2202\u03b1 \u02c6Ft(ut, zt)| \u2264 2\u03ba(2R + 1 + 2R\u03ba). Hence, there holds\n\nT(cid:88)\nG(ut, zt)) \u2264 supu\u2208\u2126[(cid:80)T\n\nt (cid:107) \u02c6Gt(ut, zt)(cid:107)2 \u2264 2\u03ba2(2R + 1 + 2R\u03ba)2(cid:0) T(cid:88)\n(cid:80)T\nt=1 \u03b3t(\u02dcut \u2212 u)(cid:62)(g(ut) \u2212 G(ut, zt))] + (cid:80)T\nt=1 \u03b3t(ut \u2212 u)(cid:62)(g(ut) \u2212\nThe third term on the right hand side of (17) can be bounded by supu\u2208\u2126\nt=1 \u03b3t(ut \u2212 \u02dcut)(cid:62)(g(ut) \u2212\nG(ut, zt)), where \u02dcu1 = 0 \u2208 \u2126 and \u02dcut+1 = P\u2126(\u02dcut \u2212 \u03b3t(g(ut) \u2212 G(ut, zt))) for any t \u2208 [1, T ].\nApplying Lemma 1 with \u03bet = \u03b3t(g(ut) \u2212 G(ut, zt)) yields that\n\n(cid:1).\n\n(19)\n\n\u03b32\nt\n\n1\n2\n\n\u03b32\n\nt=1\n\nt=1\n\nT(cid:88)\n\nt=1\n\nsup\nu\u2208\u2126\n\n\u03b3t(\u02dcut \u2212 u)(cid:62)(g(ut) \u2212 G(ut, zt)) \u2264 sup\nu\u2208\u2126\n\n(cid:107)u(cid:107)2\n2\n\n+\n\n1\n2\n\nT(cid:88)\n\nt=1\n\nt (cid:107)g(ut) \u2212 G(ut, zt)(cid:107)2\n\u03b32\nT(cid:88)\n\n\u2264 1\n2\n\nR2(1 + 6\u03ba2) + 4\u03ba2(2R + 1 + 2R\u03ba)2\n\nt , (20)\n\u03b32\nwhere we used (cid:107)G(ut, zt)(cid:107) and (cid:107)g(ut)(cid:107) is uniformly bounded by 2\u03ba(2R + 1 + 2R\u03ba). Notice that\nut and \u02dcut are only dependent on {z1, z2, . . . , zt\u22121}, {St = \u03b3t(ut \u2212 \u02dcut)(cid:62)(g(ut) \u2212 G(ut, zt)) :\nt = 1, . . . , t} is a martingale difference sequence. Observe that E[(cid:107)St(cid:107)2|z1, . . . , zt\u22121] =\nt supu\u2208\u2126,z\u2208Z [(cid:107)ut\u2212 \u02dcut(cid:107)2(cid:107)g(ut)\u2212G(ut, zt)(cid:107)2] \u2264\n\u03b32\nt\n\u03b32\n1 + 6\u03ba2(2R + 1 +\nt [2\u03baR\nt=1 \u03b3t|(ut \u2212 \u02dcut)(cid:62)(g(ut) \u2212 G(ut, zt))| \u2264 \u03c3T implies that, with probabili-\nty 1 \u2212 \u03b4\n\n(cid:82)(cid:82)\nZ ((ut\u2212 \u02dcut)(cid:62)(g(ut)\u2212G(ut, z)))2d\u03c1(z) \u2264 \u03b32\n2R\u03ba)]2(cid:80)T\n\u221a\nT(cid:88)\n\n1 + 6\u03ba2(2R + 1 + 2R\u03ba)]2. Applying Lemma 2 with \u03c32\nt=1 \u03b32\n\n2, there holds\n\nt , B = supT\n\n\u03b3t(ut \u2212 \u02dcut)(cid:62)(g(ut) \u2212 G(ut, zt)) \u2264 16\u03baR\n\nT = [2\u03baR\n\n\u221a\n\n\u221a\n\nt=1\n\n1 + 6\u03ba2(2R + 1 + 2R\u03ba)\n\n3\n\n\u03b32\nt .\n\nCombining (20) with (21) implies, with probability 1 \u2212 \u03b4\n2,\n\u03b3t(ut \u2212 u)(cid:62)(g(ut) \u2212 G(ut, zt)) \u2264 R2(1 + 6\u03ba2)\n\nT(cid:88)\n\nt=1\n\nsup\nu\u2208\u2126\n\nt=1\n\n6\n\n+ 4\u03ba2(2R + 1 + 2R\u03ba)2\n\n2\n\n\fdatasets\ndiabetes\n\nusps\nijcnn1\n\n(cid:93)feat datasets\nfourclass\n\n(cid:93)inst\n768\n9,298\n141,691\nTable 2: Basic information about the benchmark datasets used in the experiments.\n\n(cid:93)inst\n1,000\n60,000\n9,619 55,197 news20 15,935 62,061\n\n(cid:93)feat datasets\n2\ngerman\n123 mnist\n54\nsector\n\ndatasets\n(cid:93)inst\nsplice\n3,175\nacoustic 78,823\n\na9a\n\n32,561\ncovtype 581,012\n\n(cid:93)feat\n60\n50\n\n(cid:93)feat\n24\n780\n\n8\n256\n22\n\n(cid:93)inst\n862\n\n\u221a\n\n16\u03baR\n\n1 + 6\u03ba2(2R + 1 + 2R\u03ba)\n\n+\n\n(cid:0) T(cid:88)\n\n(cid:1)1/2\n\n.\n\n(22)\n\n\u03b32\nt\n\n3\nBy Lemma 3, for any t \u2208 [1, T ] there holds, with probability 1\u2212 \u03b4\n2T ,\n\nt=1\n\n(cid:107) \u02c6Gt(u, z)\u2212G(u, z)(cid:107) \u2264\n\nsup\n\nu\u2208\u2126,z\u2208Z\n\n(cid:114)\n\n)/t. Hence, the fourth term on the righthand side of (17) can estimated\n\n2\u03ba(2R(\u03ba + 1) + 1)\nas follows: with probability 1 \u2212 \u03b4\n\nln (\n\n4T\n\u03b4\n\n2, there holds\n\nsup\nu\u2208\u2126\n\n\u03b3t(ut \u2212 u)(cid:62)(G(ut, zt) \u2212 \u02c6Gt(ut, zt)) \u2264 2 sup\n\nu\u2126\n\nT(cid:88)\n\nt=1\n\n(cid:107)u(cid:107)(cid:0) T(cid:88)\n\nt=1\n\n\u03b3t\n\nsup\n\nu\u2208\u2126,z\u2208Z\n\n(cid:112)\n\n(cid:107) \u02c6Gt(u, z) \u2212 G(u, z)(cid:107)(cid:1)\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\n\u03b3t\u221a\nt\n\n\u03b3t\u221a\nt\n\n.\n\n(23)\n\n(cid:105)\n\n,\n\n(cid:3)\n\nPutting the estimations (18), (19), (22), (23) and (17) back into (16) implies that\n\n(cid:114)\n\n\u2264 8R\u03ba(4R\u03ba + 11R + 1)\n\n6\u03ba2 + 1\n\n(cid:0) T(cid:88)\n\n(cid:1)\u22121(cid:104)\n\nT(cid:88)\n\nt +(cid:0) T(cid:88)\n\n\u03b32\n\n1 +\n\n(cid:1) 1\n\n\u03b32\nt\n\n2 +\n\nt=1\n\nt=1\n\nt=1\n\n6\u03ba2 + 1(2\u03ba + 3).\n\n\u03b5f (\u00afuT ) \u2264 C\u03ba max(R2, 1)\n\nln\n\n\u03b3t\n\u221a\n2 (1 + 6\u03ba2) + 6\u03ba2(\u03ba + 3)2 + 112\n3 \u03ba\n\nt=1\n\n4T\n\u03b4\n\nwhere C\u03ba = 5\n\n4 Experiments\n\nIn this section, we report experimental evaluations of the SOLAM algorithm and comparing its\nperformance with existing state-of-the-art learning algorithms for AUC optimization. SOLAM was\nimplemented in MATLAB, and MATLAB code of the compared methods were obtained from the\nauthors of corresponding papers. In the training phase, we use \ufb01ve-fold cross validation to determine\nthe initial learning rate \u03b6 \u2208 [1 : 9 : 100] and the bound on w, R \u2208 10[\u22121:1:5] by a grid search.\nFollowing the evaluation protocol of [6], the performance of SOLAM was evaluated by averaging\nresults from \ufb01ve runs of \ufb01ve-fold cross validations.\nOur experiments were performed based on 12 datasets that had been used in previous studies. For\nmulti-class datasets, e.g., news20 and sector, we transform them into binary classi\ufb01cation problems\nby randomly partitioning the data into two groups, where each group includes the same number of\nclasses. Information about these datasets is summarized in Table 2.\nOn these datasets, we evaluate and compare SOLAM with four online and two of\ufb02ine learning\nalgorithms for AUC maximization, i.e. one-pass AUC maximization (OPAUC) [6], which uses the (cid:96)2\nloss surrogate of the AUC objective function; online AUC maximization [21] that uses the hinge loss\nsurrogate of the AUC objective function with two variants, one with sequential update (OAMseq) and\nthe other using gradient update (OAMgra); online Uni-Exp [12] which uses the weighted univariate\nexponential loss; B-SVM-OR [10], which is a batch learning algorithm using the hinge loss surrogate\nof the AUC objective function; and B-LS-SVM, which is a batch learning algorithm using the (cid:96)2 loss\nsurrogate of the AUC objective function.\nClassi\ufb01cation performances on the testing dataset of all methods are given in Table 3. These results\nshow that SOLAM achieves similar performances as other state-of-the-art online and of\ufb02ine methods\nbased on AUC maximization. The performance of SOLAM is better than the of\ufb02ine methods on\nacoustic and covtype which could be due to the normalization of features used in our experiments for\nSOLAM. On the other hand, the main advantage of SOLAM is the running ef\ufb01ciency, as we pointed\nout in the Introduction, its per-iteration running time and space complexity is linear in data dimension\nand do not depend on the iteration number. In Figure 1, we show AUC vs. run time (seconds) for\n\n7\n\n\fOAMseq\n\nOAMgra\n\nOPAUC\n\nSOLAM\n\nonline Uni-Exp B-SVM-OR B-LS-SVM\nDatasets\ndiabetes .8253\u00b1.0314 .8309\u00b1.0350 .8264\u00b1.0367 .8262\u00b1.0338 .8215\u00b1.0309 .8326\u00b1.0328 .8325\u00b1.0329\nfourclass .8226\u00b1.0240 .8310\u00b1.0251 .8306\u00b1.0247 .8295\u00b1.0251 .8281\u00b1.0305 .8305\u00b1.0311 .8309\u00b1.0309\ngerman .7882\u00b1.0243 .7978\u00b1.0347 .7747\u00b1.0411 .7723\u00b1.0358 .7908\u00b1.0367 .7935\u00b1.0348 .7994\u00b1.0343\n.9253\u00b1.0097 .9232\u00b1.0099 .8594\u00b1.0194 .8864\u00b1.0166 .8931\u00b1.0213 .9239\u00b1.0089 .9245\u00b1.0092\nsplice\n.9766\u00b1.0032 .9620\u00b1.0040 .9310\u00b1.0159 .9348\u00b1.0122 .9538\u00b1.0045 .9630\u00b1.0047 .9634\u00b1.0045\nusps\n.9001\u00b1.0042 .9002\u00b1.0047 .8420\u00b1.0174 .8571\u00b1.0173 .9005\u00b1.0024 .9009\u00b1.0036 .8982\u00b1.0028\na9a\n.9324\u00b1.0020 .9242\u00b1.0021 .8615\u00b1.0087 .8643\u00b1.0112 .7932\u00b1.0245 .9340\u00b1.0020 .9336\u00b1.0025\nmnist\nacoustic .8898\u00b1.0026 .8192\u00b1.0032 .7113\u00b1.0590 .7711\u00b1.0217 .8171\u00b1.0034 .8262\u00b1.0032 .8210\u00b1.0033\n.9215\u00b1.0045 .9269\u00b1.0021 .9209\u00b1.0079 .9100\u00b1.0092 .9264\u00b1.0035 .9337\u00b1.0024 .9320\u00b1.0037\nijcnn1\ncovtype .9744\u00b1.0004 .8244\u00b1.0014 .7361\u00b1.0317 .7403\u00b1.0289 .8236\u00b1.0017 .8248\u00b1.0013 .8222\u00b1.0014\n.9834\u00b1.0023 .9292\u00b1.0081 .9163\u00b1.0087 .9043\u00b1.0100 .9215\u00b1.0034\nsector\nnews20 .9467\u00b1.0039 .8871\u00b1.0083 .8543\u00b1.0099 .8346\u00b1.0094 .8880\u00b1.0047\nTable 3: Comparison of the testing AUC values (mean\u00b1std.) on the evaluated datasets. To accelerate the\nexperiments, the performances of OPAUC, OAMseq, OAMgra, online Uni-Exp, B-SVM-OR and B-LS-SVM were\ntaken from [6]\n\n-\n-\n\n-\n-\n\n(a) a9a\n\nFigure 1: AUC vs. time curves of SOLAM algorithm and three state-of-the-art AUC learning algorithms, i.e.,\nOPAUC [6], OAMseq [21], and OAMgra [21]. The values in parentheses indicate the average running time\n(seconds) per pass for each algorithm.\n\n(b) ups\n\n(c) sector\n\nSOLAM and three other state-of-the-art online learning algorithms,i.e., OPAUC [6], OAMseq [21],\nand OAMgra [21] over three datasets (a9a, ups, and sector), along with the per-iteration running time\nin the legend2. These results show that SOLAM in general reaches convergence faster in comparison\nof, while achieving competitive performance.\n\n5 Conclusion\n\nIn this paper we showed that AUC maximization is equivalent to a stochastic saddle point problem,\nfrom which we proposed a novel online learning algorithm for AUC optimization. In contrast to\nthe existing algorithms [6, 21], the main advantage of our algorithm is that it does not need to store\nall previous examples nor its second-order covariance matrix. Hence, it is a truly online learning\nalgorithm with one-datum space and per-iteration complexities, which are the same as online gradient\ndescent algorithms [22] for classi\ufb01cation.\n\u221a\nThere are several research directions for future work. Firstly, the convergence rate O(1/\nT ) for\nSOLAM only matches that of the black-box sub-gradient method. It would be interesting to derive\nfast convergence rate O(1/T ) by exploring the special structure of the objective function F de\ufb01ned\nby (6). Secondly, the convergence was established using the duality gap associated with the stochastic\nSPP formulation 7. It would be interesting to establish the strong convergence of the output \u00afwT of\nalgorithm SOLAM to its optimal solution of the actual AUC optimization problem (3). Thirdly, the\nSPP formulation (1) holds for the least square loss. We do not know if the same formulation holds\ntrue for other loss functions such as the logistic regression or the hinge loss.\n\n2Experiments were performed with running time reported based on a workstation with 12 nodes, each with\n\nan Intel Xeon E5-2620 2.0GHz CPU and 64GB RAM.\n\n8\n\n\fReferences\n[1] F. R. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms\n\nfor machine learning. In NIPS, 2011.\n\n[2] L. Bottou and Y. LeCun. Large scale online learning. In NIPS, 2003.\n\n[3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Trans. Information Theory, 50(9):2050\u20132057, 2004.\n\n[4] S. Clemencon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of u-statistics.\n\nThe Annals of Statistics, 36(2):844\u2013874, 2008.\n\n[5] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. In NIPS, 2003.\n\n[6] W. Gao, R. Jin, S. Zhu, and Z. H. Zhou. One-pass AUC optimization. In ICML, 2013.\n\n[7] W. Gao and Z.H. Zhou. On the consistency of AUC pairwise optimization. In International\n\nJoint Conference on Arti\ufb01cial Intelligence, 2015.\n\n[8] J. A. Hanley and B. J. McNeil. The meaning and use of the area under of receiver operating\n\ncharacteristic (roc) curve. Radiology, 143(1):29\u201336, 1982.\n\n[9] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.\n\n[10] Thorsten Joachims. Training linear svms in linear time. In Proceedings of the Twelfth ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 217\u2013226,\n2006.\n\n[11] P. Kar, B. K. Sriperumbudur, P. Jain, and H. Karnick. On the generalization ability of online\n\nlearning algorithms for pairwise loss functions. In ICML, 2013.\n\n[12] W. Kotlowski, K. Dembczynski, and E. H\u00fcllermeier. Bipartite ranking through minimization of\n\nunivariate loss. In ICML, 2011.\n\n[13] G. Lan. An optimal method for stochastic composite optimization. Math Programming,\n\n133(1-2):365\u2013397, 2012.\n\n[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[15] I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of\n\nProbability, 22(4):1679\u20131706, 1994.\n\n[16] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In ICML, 2012.\n\n[17] A. Rakotomamonjy. Optimizing area under roc curve with svms. In 1st International Workshop\n\non ROC Analysis in Arti\ufb01cial Intelligence, 2004.\n\n[18] Y. Wang, R. Khardon, D. Pechyony, and R. Jones. Generalization bounds for online learning\n\nalgorithms with pairwise loss functions. In COLT, 2012.\n\n[19] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computa-\n\ntional Mathematics, 8(5):561\u2013596, 2008.\n\n[20] Y. Ying and D. X. Zhou. Online pairwise learning algorithms. Neural Computation, 28:743\u2013777,\n\n2016.\n\n[21] P. Zhao, S. C. H. Hoi, R. Jin, and T. Yang. Online AUC maximization. In ICML, 2011.\n\n[22] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 254, "authors": [{"given_name": "Yiming", "family_name": "Ying", "institution": "State University of New York at Albany"}, {"given_name": "Longyin", "family_name": "Wen", "institution": "State University of New York at Albany"}, {"given_name": "Siwei", "family_name": "Lyu", "institution": "State University of New York at Albany"}]}