{"title": "SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk", "book": "Advances in Neural Information Processing Systems", "page_first": 1027, "page_last": 1035, "abstract": "In many learning problems, ranging from clustering to ranking through metric learning, empirical estimates of the risk functional consist of an average over tuples (e.g., pairs or triplets) of observations, rather than over individual observations. In this paper, we focus on how to best implement a stochastic approximation approach to solve such risk minimization problems. We argue that in the large-scale setting, gradient estimates should be obtained by sampling tuples of data points with replacement (incomplete U-statistics) instead of sampling data points without replacement (complete U-statistics based on subsamples). We develop a theoretical framework accounting for the substantial impact of this strategy on the generalization ability of the prediction model returned by the Stochastic Gradient Descent (SGD) algorithm. It reveals that the method we promote achieves a much better trade-off between statistical accuracy and computational cost. Beyond the rate bound analysis, experiments on AUC maximization and metric learning provide strong empirical evidence of the superiority of the proposed approach.", "full_text": "SGD Algorithms based on Incomplete U-statistics:\n\nLarge-Scale Minimization of Empirical Risk\n\nGuillaume Papa, St\u00b4ephan Cl\u00b4emenc\u00b8on\n\nLTCI, CNRS, T\u00b4el\u00b4ecom ParisTech\n\nUniversit\u00b4e Paris-Saclay, 75013 Paris, France\nfirst.last@telecom-paristech.fr\n\nAur\u00b4elien Bellet\n\nMagnet Team, INRIA Lille - Nord Europe\n\n59650 Villeneuve d\u2019Ascq, France\naurelien.bellet@inria.fr\n\nAbstract\n\nIn many learning problems, ranging from clustering to ranking through metric\nlearning, empirical estimates of the risk functional consist of an average over tu-\nples (e.g., pairs or triplets) of observations, rather than over individual observa-\ntions. In this paper, we focus on how to best implement a stochastic approximation\napproach to solve such risk minimization problems. We argue that in the large-\nscale setting, gradient estimates should be obtained by sampling tuples of data\npoints with replacement (incomplete U-statistics) instead of sampling data points\nwithout replacement (complete U-statistics based on subsamples). We develop a\ntheoretical framework accounting for the substantial impact of this strategy on the\ngeneralization ability of the prediction model returned by the Stochastic Gradient\nDescent (SGD) algorithm. It reveals that the method we promote achieves a much\nbetter trade-off between statistical accuracy and computational cost. Beyond the\nrate bound analysis, experiments on AUC maximization and metric learning pro-\nvide strong empirical evidence of the superiority of the proposed approach.\n\n1\n\nIntroduction\n\nIn many machine learning problems, the statistical risk functional is an expectation over d-tuples\n(d \u2265 2) of observations, rather than over individual points. This is the case in supervised metric\nlearning [3], where one seeks to optimize a distance function such that it assigns smaller values\nto pairs of points with the same label than to those with different labels. Other popular examples\ninclude bipartite ranking (see [27] for instance), where the goal is to maximize the number of con-\ncordant pairs (i.e. AUC maximization), and more generally multi-partite ranking (cf [12]), as well\nas pairwise clustering (see [7]). Given a data sample, the most natural empirical risk estimate (which\nis known to have minimal variance among all unbiased estimates) is obtained by averaging over all\ntuples of observations and thus takes the form of a U-statistic (an average of dependent variables\ngeneralizing the means, see [19]). The Empirical Risk Minimization (ERM) principle, one of the\nmain paradigms of statistical learning theory, has been extended to the case where the empirical risk\nof a prediction rule is a U-statistic [5], using concentration properties of U-processes (i.e. collec-\ntions of U-statistics). The computation of the empirical risk is however numerically unfeasible in\nlarge and even moderate scale situations due to the exploding number of possible tuples.\nIn practice, the minimization of such empirical risk functionals is generally performed by means\nof stochastic optimization techniques such as Stochastic Gradient Descent (SGD), where at each\niteration only a small number of randomly selected terms are used to compute an estimate of the\ngradient (see [27, 24, 16, 26] for instance). A drawback of the original SGD learning method,\nintroduced in the case where empirical risk functionals are computed by summing over independent\nobservations (sample mean statistics), is its slow convergence due to the variance of the gradient\nestimates, see [15]. This has recently motivated the development of a wide variety of SGD variants\nimplementing a variance reduction method in order to improve convergence. Variance reduction is\n\n1\n\n\fachieved by occasionally computing the exact gradient (see SAG [18], SVRG [15], MISO [20] and\nSAGA [9] among others) or by means of nonuniform sampling schemes (see [21, 28] for instance).\nHowever, such ideas can hardly be applied to the case under study here: due to the overwhelming\nnumber of possible tuples, computing even a single exact gradient or maintaining a probability\ndistribution over the set of all tuples is computationally unfeasible in general.\nIn this paper, we leverage the speci\ufb01c structure and statistical properties of the empirical risk func-\ntional when it is of the form of a U-statistic to design an ef\ufb01cient implementation of the SGD\nlearning method. We study the performance of the following sampling scheme for the gradient\nestimation step involved in the SGD algorithm: drawing with replacement a set of tuples directly\n(in order to build an incomplete U-statistic gradient estimate), rather than drawing a subset of ob-\nservations without replacement and forming all possible tuples based on these (the corresponding\ngradient estimate is then a complete U-statistic based on a subsample). While [6] has investigated\nmaximal deviations between U-processes and their incomplete approximations, the performance\nanalysis carried out in the present paper is inspired from [4] and involves both the optimization error\nof the SGD algorithm and the estimation error induced by the statistical \ufb01nite-sample setting. We\n\ufb01rst provide non-asymptotic rate bounds and asymptotic convergence rates for the SGD procedure\napplied to the empirical minimization of a U-statistic. These results shed light on the impact of the\nconditional variance of the gradient estimators on the speed of convergence of SGD. We then derive\na novel generalization bound which depends on the variance of the sampling strategies. This bound\nestablishes the indisputable superiority of the incomplete U-statistic estimation approach over the\ncomplete variant in terms of the trade-off between statistical accuracy and computational cost. Our\nexperimental results on AUC maximization and metric learning tasks on large-scale datasets are\nconsistent with our theoretical \ufb01ndings and show that the use of the proposed sampling strategy can\nprovide spectacular performance gains in practice. We conclude this paper with promising lines\nfor future research, in particular regarding the trade-offs involved in a possible implementation of\nnonuniform sampling strategies to further improve convergence.\nThe rest of this paper is organized as follows. In Section 2, we brie\ufb02y review the theory of U-\nstatistics and their approximations, together with elementary notions of gradient-based stochastic\napproximation. Section 3 provides a detailed description of the SGD implementation we propose,\nalong with a performance analysis conditional upon the data sample. In Section 4, based on these\nresults, we derive a generalization bound based on a decomposition into optimization and estimation\nerrors. Section 5 presents our numerical experiments, and we conclude in Section 6. Technical\nproofs are sketched in the Appendix, and further details can be found in the Supplementary Material.\n\n2 Background and Problem Setup\nHere and throughout, the indicator function of any event E is denoted by I{E} and the variance of\nany square integrable r.v. Z by \u03c32(Z).\n\n2.1 U-statistics: De\ufb01nition and Examples\n\nGeneralized U-statistics are extensions of standard sample mean statistics, as de\ufb01ned below.\nnk ), 1 \u2264\nDe\ufb01nition 1. Let K \u2265 1 and (d1, . . . , dK) \u2208 N\u2217K. Let X{1, ..., nk} = (X (k)\n1 , . . . , X (k)\nk \u2264 K, be K independent samples of sizes nk \u2265 dk and composed of i.i.d. random variables taking\ntheir values in some measurable space Xk with distribution Fk(dx) respectively. Let H : X d1\n1 \u00d7\u00b7\u00b7\u00b7\u00d7\nX dK\nK \u2192 R be a measurable function, square integrable with respect to the probability distribution\n\u2297dK\nK . Assume in addition (without loss of generality) that H(x(1), . . . , x(K))\n\u00b5 = F\nis symmetric within each block of arguments x(k) (valued in X dk\nk ), 1 \u2264 k \u2264 K. The generalized (or\nK-sample) U-statistic of degrees (d1, . . . , dK) with kernel H, is then de\ufb01ned as\n\n1 \u2297 \u00b7\u00b7\u00b7 \u2297 F\n\u2297d1\n\n(cid:17)\n\n(cid:1)(cid:88)\n(cid:0)nk\nwhere n = (n1, . . . , nK), the symbol(cid:80)\nthe set of the(cid:81)K\n\n1(cid:81)K\n\n(cid:0)nk\n\nUn(H) =\n\ndk\n\nI1\n\n(cid:88)\n\u00b7\u00b7\u00b7(cid:80)\n\nIK\n\n. . .\n\nidk \u2264 nk and X(k)\n\nIk\n\nk=1\n\ndk\n= (X (k)\ni1\n\n, . . . , X (k)\nidk\n\n) for 1 \u2264 k \u2264 K.\n\n(cid:16)\n\nH\n\nX(1)\nI1\n\n; X(2)\nI2\n\n; . . . ; X(K)\nIK\n\n,\n\n(1)\n\nk=1\n\n(cid:1) index vectors (I1, . . . , IK), Ik being a set of dk indexes 1 \u2264 i1 < . . . <\n\nrefers to summation over all elements of \u039b,\n\nIK\n\nI1\n\n2\n\n\fK(cid:88)\n\nnk(cid:88)\n\nI(cid:110)\n\nk=1\n\nik=1\n\n(cid:111)\n\nIn the above de\ufb01nition, standard mean statistics correspond to the case where K = 1 = d1. More\ngenerally when K = 1, Un(H) is an average over all d1-tuples of observations. Finally, K \u2265 2\ncorresponds to the multi-sample situation where a dk-tuple is used for each sample k \u2208 {1, . . . , K}.\nThe key property of the statistic (1) is that it has minimum variance among all unbiased estimates of\n\nH\n\nX (1)\n\n1 , . . . , X (1)\nd1\n\n, . . . , X (K)\n\n1\n\n, . . . , X (K)\ndk\n\n= E [Un(H)] .\n\n\u00b5(H) = E(cid:104)\n\n(cid:16)\n\n(cid:17)(cid:105)\n\nOne may refer to [19] for further results on the theory of U-statistics. In machine learning, general-\nized U-statistics are used as performance criteria in various problems, such as those listed below.\nClustering. Given a distance D : X1 \u00d7 X1 \u2192 R+, the quality of a partition P of X1 with respect\nto the clustering of an i.i.d. sample X1, . . . , Xn drawn from F1(dx) can be assessed through the\nwithin cluster point scatter:\n\n(cid:88)\n\nD(Xi, Xj) \u00b7(cid:88)\n\nI(cid:8)(Xi, Xj) \u2208 C2(cid:9) .\n\nIt is a one sample U-statistic of degree 2 with kernel HP (x, x(cid:48)) = D(x, x(cid:48))\u00b7(cid:80)C\u2208P I{(x, x(cid:48)) \u2208 C2}.\n\nn(n \u2212 1)\n\nC\u2208P\n\n(2)\n\ni 0, such that we have:\nK(cid:88)\n\u03b8 = \u03c32(\u2207\u03b8H(X (1)\n\nfor all n \u2208 N\u2217K, with \u03c32\nvariances are given in [19].\nRemark 1. The results of this paper can be extended to other sampling schemes to approximate (4),\nsuch as Bernoulli sampling or sampling without replacement in \u039b, following the proposal of [14].\nFor clarity, we focus on sampling with replacement, which is computationally more ef\ufb01cient.\n\n; \u03b8)). Explicit but lengthy expressions of the\n\n\u03c32 (\u02dcgn(cid:48)(\u03b8)) \u2264 c \u00b7 \u03c32\n\u03b8 /\n\n\u03c32 (\u00afgB(\u03b8)) \u2264 c \u00b7 \u03c32\n\u03b8 /\n\n1 , . . . , X (K)\ndK\n\n(cid:18)n(cid:48)\n\nK(cid:89)\n\n(cid:0)n(cid:48)\n\n(cid:19)\n\nk\ndk\n\nand\n\nn(cid:48)\n\nk=1\n\nk=1\n\nk=1\n\nk=1\n\nk\ndk\n\nk\ndk\n\nk\n\n,\n\n3.2 A Conditional Performance Analysis\n\nAs a \ufb01rst go, we investigate and compare the performance of the SGD methods described above\nconditionally upon the observed data samples. For simplicity, we denote by Pn(.) the conditional\nprobability measure given the data and by En[.] the Pn-expectation. Given a matrix M, we denote\n\nby M T the transpose of M and (cid:107)M(cid:107)HS :=(cid:112)T r(M M T ) its Hilbert-Schmidt norm. We assume\nourselves to the case where(cid:98)Ln is \u03b1-strongly convex for some deterministic constant \u03b1:\n\nthat the loss function H is l-smooth in \u03b8, i.e its gradient is l-Lipschitz, with l > 0. We also restrict\n\n(cid:98)Ln(\u03b81) \u2212(cid:98)Ln(\u03b82) (cid:54) \u2207\u03b8(cid:98)Ln(\u03b81)T (x \u2212 y) \u2212 \u03b1\n\n(cid:107)\u03b81 \u2212 \u03b82(cid:107)2\n\n(7)\n\n2\n\nand we denote by \u03b8\u2217\nn its unique minimizer. We point out that the present analysis can be extended to\nthe smooth but non-strongly convex case, see [1]. A classical argument based on convex analysis and\n\n4\n\n\fstochastic optimization (see [1, 22] for instance) shows precisely how the conditional variance of the\ngradient estimator impacts the empirical performance of the solution produced by the corresponding\nSGD method and thus strongly advocates the use of the SGD variant proposed in Section 3.1.\n\nProposition 2. Consider the recursion \u03b8t+1 = \u03b8t \u2212 \u03b3tg(\u03b8t) where En[g(\u03b8t)|\u03b8t] = \u2207\u03b8(cid:98)Ln(\u03b8t), and\n\nn(g(\u03b8)) the conditional variance of g(\u03b8). For step size \u03b3t = \u03b31/t\u03b2, the following holds.\n2 < \u03b2 < 1, then:\n\n1. If 1\n\ndenote by \u03c32\n\nEn[(cid:98)Ln(\u03b8t+1) \u2212(cid:98)Ln(\u03b8\u2217\n\nn)] (cid:54) \u03c32\n\nn(g(\u03b8\u2217\nn))\nt\u03b2\n\n\u03b31l2\u03b2\u22121(\n\n(cid:124)\n\n1\n2\u03b1\n\n+\n\nl\u03b32\n1\n2\u03b2 \u2212 1\n\n(cid:123)(cid:122)\n\nC1\n\n) + o(\n\n.\n\n1\nt\u03b2 )\n\n(cid:125)\n\n+ o(\n\n.\n\n1\nt\n\n)\n\n(cid:125)\n\n2. If \u03b2 = 1 and \u03b31 > 1\n\n2\u03b1 , then:\n\nEn[(cid:98)Ln(\u03b8t+1) \u2212(cid:98)Ln(\u03b8\u2217\n\nn)] (cid:54) \u03c32\n\nn(g(\u03b8\u2217\nn))\nt + 1\n\n2\u03b1\u03b31l exp(2\u03b1l\u03b32\n(2\u03b1\u03b31 \u2212 1)\n\n1 )\u03b32\n1\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nC2\n\nProposition 2 illustrates the well-known fact that the convergence rate of SGD is dominated by the\nvariance term and thus one needs to focus on reducing this term to improve its performance.\nWe are also interested in the asymptotic behavior of the algorithm (when t \u2192 +\u221e), under the\nfollowing assumptions:\n\nA1 The function(cid:98)Ln(\u03b8) is twice differentiable on a neighborhood of \u03b8\u2217\nA2 The function \u2207(cid:98)Ln(\u03b8) is bounded.\nLet us set \u0393 = \u22072(cid:98)Ln(\u03b8\u2217\n\nn.\n\nn). We establish the following result (refer to the Supplementary Material\n\nfor a detailed proof).\nTheorem 1. Let the covariance matrix \u03a3\u2217\nn + \u03a3\u2217\n\n\u0393\u03a3\u2217\n\nwhere \u03a3n(\u03b8\u2217\nA1 \u2212 A2, we have:\n\nn) = En[g(\u03b8\u2217\n\nn)g(\u03b8\u2217\n\n(cid:16)(cid:98)Ln(\u03b8t) \u2212(cid:98)Ln(\u03b8\u2217)\n\n(cid:17) \u21d2 1\n\nU T (\u03a3\u2217\nwhere U \u223c N (0, Iq). In addition, in the case \u03b7 = 0, we have :\n\n1/\u03b3t\n\n2\n\nn)1/2\u0393(\u03a3\u2217\n\nn)1/2U,\n\nn be the unique solution of the Lyapunov equation:\nn\u0393 \u2212 \u03b7\u03a3\u2217\nn)T ] and \u03b7 = \u03b31 > 1\n\nn = \u03a3n(\u03b8\u2217\n(8)\nn),\n2\u03b1 if \u03b2 = 1, 0 if not. Then, under Assumptions\n\n(cid:107)(\u03a3\u2217\n\nn\u0393)1/2(cid:107)2\n\nHS = E[U T (\u03a3\u2217\n\nn)1/2\u0393(\u03a3\u2217\n\nn)1/2U ] =\n\nn(g(\u03b8\u2217\n\u03c32\n\nn)).\n\n1\n2\n\n(9)\n\nTheorem 1 reveals that the conditional variance term again plays a key role in the asymptotic per-\nformance of the algorithm. In particular, it is the dominating term in the precision of the solution. In\nthe next section, we build on these results to derive a generalization bound in the spirit of [4] which\nexplicitly depend on the true variance of the gradient estimator.\n\n4 Generalization Bounds\nLet \u03b8\u2217 = argmin\u03b8\u2208\u0398 L(\u03b8) be the minimizer of the true risk. As proposed in [4], the mean excess\nrisk can be decomposed as follows: \u2200n \u2208 N\u2217K,\n\nE[L(\u03b8t) \u2212 L(\u03b8\u2217)] \u2264 2E\n\n(cid:124)\n\nsup\n\u03b8\u2208\u0398\n\n(cid:20)\n\n(cid:21)\n|(cid:98)Ln(\u03b8) \u2212 L(\u03b8)|\n(cid:123)(cid:122)\n(cid:125)\n\nE1\n\n+ E(cid:104)(cid:98)Ln(\u03b8t) \u2212(cid:98)Ln(\u03b8\u2217\n(cid:124)\n\n(cid:123)(cid:122)\n\nn)\n\n(cid:105)\n(cid:125)\n\nE2\n\n.\n\n(10)\n\nBeyond the optimization error (the second term on the right hand side of (10)), the analysis of the\ngeneralization ability of the learning method previously described requires to control the estimation\nerror (the \ufb01rst term). This can be achieved by means of the result stated below, which extends\nCorollary 3 in [5] to the K-sample situation.\n\n5\n\n\fProposition 3. Let H be a collection of bounded symmetric kernels on(cid:81)K\n\nk such that MH =\nsup(H,x)\u2208H\u00d7X |H(x)| < +\u221e. Suppose also that H is a VC major class of functions with \ufb01nite\nVapnik-Chervonenkis dimension V < +\u221e. Let \u03ba = min{(cid:98)n1/d1(cid:99), . . . , (cid:98)nK/dK(cid:99)}. Then, for\nany n \u2208 N\u2217K\n\nk=1 X dk\n(cid:41)\n\n(cid:40)\n\n(cid:114)\n\n|Un(H) \u2212 \u00b5(H)|\n\n\u2264 MH\n\n2\n\n2V log(1 + \u03ba)\n\n.\n\n(11)\n\n(cid:21)\n\n(cid:20)\n\nE\n\nsup\nH\u2208H\n\n\u03ba\n\nmator (6) with B =(cid:81)K\n\nWe are now ready to derive our main result.\nTheorem 2. Let \u03b8t be the sequence generated by SGD using the incomplete statistic gradient esti-\nK. Assume that {L(.; \u03b8) : \u03b8 \u2208 \u0398} is a\n\n(cid:1) terms for some n(cid:48)\n\n1, . . . , n(cid:48)\n\n(cid:0)n(cid:48)\n\nk=1\n\nk\ndk\n\nVC major class class of \ufb01nite VC dimension V s.t.\n\nM\u0398 =\n\nsup\n\n\u03b8\u2208\u0398, (x(1), ..., x(K))\u2208(cid:81)K\n\u03b8 < +\u221e. If the step size satis\ufb01es the condition of Proposition 2, we have:\n\n|H(x(1), . . . , x(K); \u03b8)| < +\u221e,\n\nk=1 X dk\n\nk\n\nand N\u0398 = sup\u03b8\u2208\u0398 \u03c32\n\n\u2200n \u2208 N\u2217K, E[|L(\u03b8t) \u2212 L(\u03b8\u2217)|] (cid:54) CN\u0398\nBt\u03b2 + 2M\u0398\n(cid:40)\n(cid:33)\n(cid:114)\n\n(cid:114)\n\n(cid:32)\n\nFor any \u03b4 \u2208 (0, 1), we also have with probability at least 1 \u2212 \u03b4: \u2200n \u2208 N\u2217K,\n\n2V log(1 + \u03ba)\n\n\u03ba\n\n.\n\n|L(\u03b8t) \u2212 L(\u03b8\u2217)| (cid:54)\n\nCN\u0398\nBt\u03b2 +\n\nD\u03b2 log(2/\u03b4)\n\nt\u03b2\n\n+ 2M\u0398\n\n2\n\n2V log(1 + \u03ba)\n\n\u03ba\n\n(cid:41)\n\n(cid:114)\n\n+\n\nlog(4/\u03b4)\n\n\u03ba\n\n(12)\n\n(cid:41)\n\n.\n\n(13)\n\n(cid:40)\n\n(cid:114)\n\n2\n\nfor some constants C and D\u03b2 depending on the parameters l, \u03b1, \u03b31, a1.\n\nrem 2 for the complete U-statistic estimator (5), but B =(cid:81)K\n\nThe generalization bound provided by Theorem 2 shows the advantage of using an incomplete U-\nstatistic (6) as the gradient estimator. In particular, we can obtain results of the same form as Theo-\nk=1 n(cid:48)\nk\n(following Proposition 1), leading to greatly damaged bounds. Using an incomplete U-statistic, we\nthus achieve better performance on the test set while reducing the number of iterations (and there-\nfore the numbers of gradient computations) required to converge to a accurate solution. To the best\nof our knowledge, this is the \ufb01rst result of this type for empirical minimization of U-statistics. In\nthe next section, we provide experiments showing that these gains are very signi\ufb01cant in practice.\n\n(cid:1) is then replaced by(cid:80)K\n\n(cid:0)n(cid:48)\n\nk=1\n\nk\ndk\n\n5 Numerical Experiments\n\nIn this section, we provide numerical experiments to compare the incomplete and complete U-\nstatistic gradient estimators (5) and (6) in SGD when they rely on the same number of terms B.\nThe datasets we use are available online.1 In all experiments, we randomly split the data into 80%\ntraining set and 20% test set and sample 100K pairs from the test set to estimate the test performance.\nWe used a step size of the form \u03b3t = \u03b31/t, and the results below are with respect to the number of\nSGD iterations. Computational time comparisons can be found in the supplementary material.\n\nAUC Optimization We address the problem of learning a binary classi\ufb01er by optimizing the Area\nUnder the Curve, which corresponds to the VUS criterion (Eq. 2) when K = 2. Given a sequence of\ni.i.d observations Zi = (Xi, Yi) where Xi \u2208 Rp and Yi \u2208 {\u22121, 1}, we denote by X + = {Xi; Yi =\n1}, X\u2212 = {Xi; Yi = \u22121} and N = |X +||X\u2212|. As done in [27, 13], we take a linear scoring rule\ns\u03b8(x) = \u03b8T x where \u03b8 \u2208 Rp is the parameter to learn, and use the logistic loss as a smooth convex\nfunction upper bounding the Heaviside function, leading to the following ERM problem:\n\n(cid:88)\n\n(cid:88)\n\ni \u2208X +\nX +\n\n\u2212\nj \u2208X\u2212\n\nX\n\nmin\n\u03b8\u2208Rp\n\n1\nN\n\nlog(1 + exp(s\u03b8(X\u2212\n\ni ) \u2212 s\u03b8(X +\n\ni ))).\n\n1http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/\n\n6\n\n\f(a) Covtype, Batch size = 9, \u03b31 = 1\n\n(b) Covtype, Batch size = 400, \u03b31 = 1\n\n(c) Ijcnn1, Batch size = 25, \u03b31 = 2\n\n(d) Ijcnn1, Batch size = 100, \u03b31 = 5\n\nFigure 1: Average over 50 runs of the risk estimate with the number of iterations (solid lines) +/-\ntheir standard deviation (dashed lines)\n\nWe use two datasets: IJCNN1 (\u223c200K examples, 22 features) and covtype (\u223c600K examples, 54\nfeatures). We try different values for the initial step size \u03b31 and the batch size B. Some results,\naveraged over 50 runs of SGD, are displayed in Figure 1. As predicted by our theoretical \ufb01ndings,\nwe found that the incomplete U-statistic estimator always outperforms its complete variant. The\nperformance gap between the two strategies can be small (for instance when B is very large or \u03b31 is\nunnecessarily small), but for values of the parameters that are relevant in practical scenarios (i.e., B\nreasonably small and \u03b31 ensuring a signi\ufb01cant decrease in the objective function), the difference can\nbe substantial. We also observe a smaller variance between SGD runs with the incomplete version.\n\nMetric Learning We now turn to a metric learning formulation, where we are given a sample of\nN i.i.d observations Zi = (Xi, Yi) where Xi \u2208 Rp and Yi \u2208 {1, . . . , c}. Following the existing\nliterature [2], we focus on (pseudo) distances of the form DM (x, x(cid:48)) = (x \u2212 x(cid:48))T M (x \u2212 x(cid:48)) where\nM is a p \u00d7 p symmetric positive semi-de\ufb01nite matrix. We again use the logistic loss to obtain a\nconvex and smooth surrogate for (3). The ERM problem is as follows:\n\n(cid:88)\n\ni