{"title": "Generalization Error Bounds for Aggregation by Mirror Descent with Averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 603, "page_last": 610, "abstract": null, "full_text": "Generalization Error Bounds for Aggregation by\n\nMirror Descent with Averaging\n\nAnatoli Juditsky\n\nLaboratoire de Mod\u00b4elisation et Calcul - Universit\u00b4e Grenoble I\n\nB.P. 53, 38041 Grenoble, France\n\nanatoli.iouditski@imag.fr\n\nAlexander Nazin\n\nInstitute of Control Sciences - Russian Academy of Science\n\n65, Profsoyuznaya str., GSP-7, Moscow, 117997, Russia\n\nnazine@ipu.rssi.ru\n\nLaboratoire de Probabilit\u00b4es et Mod`eles Al\u00b4eatoires - Universit\u00b4e Paris VI\n\nAlexandre Tsybakov\n\n4, place Jussieu, 75252 Paris Cedex, France\n\ntsybakov@ccr.jussieu.fr\n\nLaboratoire de Probabilit\u00b4es et Mod`eles Al\u00b4eatoires - Universit\u00b4e Paris VI\n\n4, place Jussieu, 75252 Paris Cedex, France\n\nNicolas Vayatis\n\nvayatis@ccr.jussieu.fr\n\nAbstract\n\nWe consider the problem of constructing an aggregated estimator from\na \ufb01nite class of base functions which approximately minimizes a con-\nvex risk functional under the \u21131 constraint. For this purpose, we propose\na stochastic procedure, the mirror descent, which performs gradient de-\nscent in the dual space. The generated estimates are additionally aver-\naged in a recursive fashion with speci\ufb01c weights. Mirror descent algo-\nrithms have been developed in different contexts and they are known to\nbe particularly ef\ufb01cient in high dimensional problems. Moreover their\nimplementation is adapted to the online setting. The main result of the\npaper is the upper bound on the convergence rate for the generalization\nerror.\n\n1\n\nIntroduction\n\nWe consider the aggregation problem (cf. [16]) where we have at hand a \ufb01nite class of M\npredictors which are to be combined linearly under an \u21131 constraint k\u03b8k1 = \u03bb on the vec-\ntor \u03b8 \u2208 RM that determines the coef\ufb01cients of the linear combination. In order to exhibit\nsuch a combination, we focus on the strategy of penalized convex risk minimization which\n\n\fis motivated by recent statistical studies of boosting and SVM algorithms [11, 14, 18].\nMoreover, we take a stochastic approximation approach which is particularly relevant in\nthe online setting since it leads to recursive algorithms where the update uses a single data\nobservation per iteration step. In this paper, we consider a general setting for which we\npropose a novel stochastic gradient algorithm and show tight upper bounds on its expected\naccuracy. Our algorithm builds on the ideas of mirror descent methods, \ufb01rst introduced by\nNemirovski and Yudin [12], which consider updates of the gradient in the dual space. The\nmirror descent algorithm has been successfully applied in high dimensional problems both\nin deterministic and stochastic settings [2, 7]. In the present work, we describe a partic-\nular instance of the algorithm with an entropy-like proxy function. This method presents\nsimilarities with the exponentiated gradient descent algorithm which was derived under dif-\nferent motivations in [10]. A crucial distinction between the two is the additional averaging\nstep in our version which guarantees statistical performance. The idea of averaging recur-\nsive procedures is well-known (see e.g. [13] and the references therein) and it has been\ninvoked recently by Zhang [19] for the standard stochastic gradient descent (taking place\nin the initial parameter space). Also it is worth noticing that most of the existing online\nmethods are evaluated in terms of relative loss bounds which are related to the empirical\nrisk while we focus on generalization error bounds (see [4, 5, 10] for insights on connec-\ntions between the two types of criteria). The rest of the paper is organized as follows.\nWe \ufb01rst introduce the setup (Section 2), then we describe the algorithm and state the main\nconvergence result (Section 3). Further we provide the intuition underlying the proposed\nalgorithm, and compare it to other methods (Section 4). We end up with a technical section\ndedicated to the proof of our main result (Section 5).\n\n2 Setup and notations\n\nLet Z be a random variable with values in a measurable space (Z,A). We set a parameter\n\u03bb > 0, and an integer M \u2265 2. The unknown parameter is a vector \u03b8 \u2208 RM which is\ncompelled to stay in the decision set \u0398 = \u0398M,\u03bb de\ufb01ned by:\n\n\u0398M,\u03bb =(cid:26)\u03b8 = (\u03b8(1), . . . , \u03b8(M ))T \u2208 RM\n\n+ : XM\n\ni=1\n\n\u03b8(i) = \u03bb(cid:27) .\n\n(1)\n\nNow we introduce the loss function Q : \u0398 \u00d7 Z \u2192 R+ such that the random function\nQ(\u00b7 , Z) : \u0398 \u2192 R+ is convex for almost all Z and de\ufb01ne the convex risk function A : \u0398 \u2192\nR+ to be minimized as follows:\n(2)\n\nA(\u03b8) = E Q(\u03b8, Z) .\n\nAssume a training sample is given in the form of a sequence (Z1, . . . , Zt\u22121), where each\nZi has the same distribution as Z. We assume for simplicity that the training sequence is\ni.i.d. though this assumption can be weakened.\n\nWe propose to minimize the convex target function A over the decision set \u0398 on the basis\nof the stochastic sub-gradients of Q:\n\nui(\u03b8) = \u2207\u03b8Q(\u03b8, Zi) ,\n\ni = 1, 2, . . . ,\n\n(3)\n\nNote that the expectations E ui(\u00b7) belong to the sub-differential of A(\u00b7).\nIn the sequel, we will characterize the accuracy of an estimateb\u03b8t =b\u03b8t(Z1, . . . , Zt\u22121) \u2208 \u0398\n\nof the minimizer of A by the excess risk:\n\n(4)\n\nwhere the expectation is taken over the sample (Z1, . . . , Zt\u22121).\n\nA(\u03b8)\n\nE A(b\u03b8t) \u2212 min\n\n\u03b8\u2208\u0398\n\n\fWe now introduce the notation that is necessary to present the algorithm in the next section.\n\nFor a vector z =(cid:0)z(1), . . . , z(M )(cid:1)T\nj=1 |z(j)| ,\nThe space RM equipped with the norm k \u00b7 k1 is called the primal space E and the same\nspace equipped with the dual norm k \u00b7 k\u221e is called the dual space E\u2217.\nIntroduce a so-called entropic proxy function:\n\n\u2208 RM , de\ufb01ne the norms\nkzk\u221e\n\ndef=XM\n\nj=1,...,M |z(j)| .\n\ndef= max\nk\u03b8k1=1\n\nzT \u03b8 = max\n\nkzk1\n\n\u2200 \u03b8 \u2208 \u0398,\n\nV (\u03b8) = \u03bb ln (M/\u03bb) +XM\n\nj=1\n\n\u03b8(j) ln \u03b8(j) ,\n\n(5)\n\nwhich has its minimum at \u03b80 = (\u03bb/M, . . . , \u03bb/M )T . It is easy to check that this function is\n\u03b1-strongly convex with respect to the norm k \u00b7 k1 with parameter \u03b1 = 1/\u03bb , i.e.,\ns(1 \u2212 s)kx \u2212 yk2\n\nV (sx + (1 \u2212 s)y) \u2264 sV (x) + (1 \u2212 s)V (y) \u2212\n\n\u03b1\n2\n\n(6)\n\n1\n\nfor all x, y \u2208 \u0398 and any s \u2208 [0, 1].\nLet \u03b2 > 0 be a parameter. We call \u03b2-conjugate of V the following convex transform:\n\nAs it straightforwardly follows from (5), the \u03b2-conjugate is given here by:\n\n\u03b8\u2208\u0398(cid:8)\u2212zT \u03b8 \u2212 \u03b2V (\u03b8)(cid:9) .\ne\u2212z(k)/\u03b2(cid:19) ,\n\u2200 z \u2208 RM ,\nwhich has a Lipschitz-continuous gradient w.r.t. k \u00b7 k1 , namely,\n\n\u2200 z \u2208 RM , W\u03b2(z) def= sup\nW\u03b2(z) = \u03bb \u03b2 ln(cid:18) 1\n\nM XM\n\nk=1\n\nk\u2207W\u03b2(z) \u2212 \u2207W\u03b2( \u02dcz )k1 \u2264\n\n\u03bb\n\u03b2 kz \u2212 \u02dczk\u221e ,\n\n\u2200 z, \u02dcz \u2208 RM .\n\n(7)\n\n(8)\n\nThough we will focus on a particular algorithm based on the entropic proxy function, our\nresults apply for a generic algorithmic scheme which takes advantage of the general proper-\nties of convex transforms (see [8] for details). The key property in the proof is the inequality\n(8).\n\n3 Algorithm and main result\n\nThe mirror descent algorithm is a stochastic gradient algorithm in the dual space. At each\niteration i, a new data point (Xi, Yi) is observed and there are two updates: one is the value\n\u03b6i as the result of the stochastic gradient descent in the dual space, the other is the update of\nthe parameter \u03b8i which is the \u201dmirror image\u201d of \u03b6i. In order to tune the algorithm properly,\nwe need two \ufb01xed positive sequences (\u03b3i)i\u22651 (stepsize) and (\u03b2i)i\u22651 (temperature) such\nthat \u03b2i \u2265 \u03b2i\u22121. The mirror descent algorithm with averaging is as follows:\nAlgorithm.\n\n\u2022 Fix the initial values \u03b80 \u2208 \u0398 and \u03b60 = 0 \u2208 RM .\n\u2022 For i = 1, . . . , t \u2212 1, do\n\n\u03b6i = \u03b6i\u22121 + \u03b3iui(\u03b8i\u22121) ,\n\n\u03b8i = \u2212\u2207W\u03b2i (\u03b6i) .\n\n(9)\n\n\fAt this point, we actually have described a class of algorithms. Given the observations of\nthe stochastic sub-gradient (3), particular choices of the proxy function V , of the stepsize\nand temperature parameters, will determine the algorithm completely. We discuss these\nchoices with more details in [8]. In this paper, we focus on the entropic proxy function and\nconsider a nearly optimal choice for the stepsize and temperature parameters which is the\nfollowing:\n\n\u03b2i = \u03b20\u221ai + 1 ,\n\n\u03b3i \u2261 1 ,\n\ni = 1, 2, . . . ,\n\n\u03b20 > 0 .\n\n(11)\n\nWe can now state our rate of convergence result.\nTheorem. Assume that the loss function Q satis\ufb01es the following boundedness condition:\n(12)\n\nsup\n\u03b8\u2208\u0398\n\nEk\u2207\u03b8Q(\u03b8, Z)k2\n\n\u221e \u2264 L2 < \u221e .\n\nFix also \u03b20 = L/\u221aln M.\nThen, for any integer t \u2265 1, the excess risk of the estimateb\u03b8t described above satis\ufb01es the\n\nfollowing bound:\n\nA(\u03b8) \u2264 2 L\u03bb ( ln M )1/2 \u221at + 1\n\nt\n\n.\n\n(13)\n\nE A(b\u03b8t) \u2212 min\n\n\u03b8\u2208\u0398\n\n\u2022 Output at iteration t the following convex combination:\n\u03b3j .\n\n\u02c6\u03b8t =Xt\n\ni=1\n\n\u03b3i\u03b8i\u22121.Xt\n\nj=1\n\n(10)\n\nExample. Consider the setting of supervised learning where the data are modelled by a\npair (X, Y ) with X \u2208 X being an observation vector and Y a label, either integer (clas-\nsi\ufb01cation) or real-valued (regression). Boosting and SVM algorithms are related to the\nminimization of a functional\n\nR(f ) = E\u03d5(Y f (X))\n\nwhere \u03d5 is a convex non-negative cost function (typically exponential, logit or hinge loss)\nand f belongs to a given class of combined predictors. The aggregation problem consists in\n\ufb01nding the best linear combination of elements from a \ufb01nite set of predictors {h1, . . . , hM}\nwith hj : X \u2192 [\u2212K, K]. Taking compact notations, it means that we search for f of the\nform f = \u03b8T H with H denoting the vector-valued function whose components are these\nbase predictors:\n\nH(x) = (h1(x), . . . , hM (x))T ,\n\nand \u03b8 belonging in a decision set \u0398 = \u0398M,\u03bb. Take for instance \u03d5 to be non-increasing.\nIt is easy to see that this problem can be interpreted in terms of our general setting with\nZ = (X, Y ), Q(Z, \u03b8) = \u03d5(Y \u03b8T H(X)) and L = K\u03d5\u2032(K\u03bb).\n\n4 Discussion\n\nIn this section, we provide some insights on the method and the result of the previous\nsection.\n\n4.1 Heuristics\n\nSuppose that we want to minimize a convex function \u03b8 7\u2192 A(\u03b8) over a convex set \u0398. If\n\u03b80, . . . , \u03b8t\u22121 are the available search points at iteration t, we can provide the af\ufb01ne approx-\nimations \u03c6i of the function A de\ufb01ned, for \u03b8 \u2208 \u0398, by\n\n\u03c6i(\u03b8) = A(\u03b8i\u22121) + (\u03b8 \u2212 \u03b8i\u22121)T\u2207A(\u03b8i\u22121),\n\ni = 1, . . . , t .\n\n\fHere \u03b8 7\u2192 \u2207A(\u03b8) is a vector function belonging to the sub-gradient of A(\u00b7). Taking a\nconvex combination of the \u03c6i\u2019s, we obtain an averaged approximation of A(\u03b8):\n\n\u00af\u03c6t(\u03b8) = Pt\n\ni=1 \u03b3i(cid:0)A(\u03b8i\u22121) + (\u03b8 \u2212 \u03b8i\u22121)T\u2207A(\u03b8i\u22121)(cid:1)\n\n.\n\nAt \ufb01rst glance, it would seem reasonable to choose as the next search point a vector \u03b8 \u2208 \u0398\nminimizing the approximation \u00af\u03c6t, i.e.,\n\ni=1 \u03b3i\n\nPt\n\u03b8T tXi=1\n\n\u03b3i\u2207A(\u03b8i\u22121)! .\n\n\u03b8t = arg min\n\n\u03b8\u2208\u0398\n\n\u00af\u03c6t(\u03b8) = arg min\n\n\u03b8\u2208\u0398\n\n(14)\n\n(15)\n\nHowever, this does not make any progress, because our approximation is \u201cgood\u201d only in\nthe vicinity of search points \u03b80, . . . , \u03b8t\u22121. Therefore, it is necessary to modify the criterion,\nfor instance, by adding a special penalty Bt(\u03b8, \u03b8t\u22121) to the target function in order to keep\nthe next search point \u03b8t in the desired region. Thus, one chooses the point:\n\n\u03b8t = arg min\n\n\u03b8\u2208\u0398 \"\u03b8T tXi=1\n\n\u03b3i\u2207A(\u03b8i\u22121)! + Bt(\u03b8, \u03b8t\u22121)# .\n\nOur algorithm corresponds to a speci\ufb01c type of penalty Bt(\u03b8, \u03b8t\u22121) = \u03b2tV (\u03b8), where\nV is the proxy function. Also note that in our problem the vector-function \u2207A(\u00b7) is not\navailable. Therefore, we replace in (15) the unknown gradients \u2207A(\u03b8i\u22121) by the observed\nstochastic sub-gradients ui(\u03b8i\u22121). This yields a new de\ufb01nition of the t-th search point:\n\n\u03b8t = arg min\n\n\u03b8\u2208\u0398 \"\u03b8T tXi=1\n\n\u03b3iui(\u03b8i\u22121)! + \u03b2tV (\u03b8)# = arg max\n\n\u03b8\u2208\u0398 (cid:2)\u2212\u03b6 T\n\nt \u03b8 \u2212 \u03b2tV (\u03b8)(cid:3) ,\n\n(16)\n\nwhere \u03b6t = Pt\ni=1 \u03b3iui(\u03b8i\u22121). By a standard result of convex analysis (see e.g. [3]), the\nsolution to this problem reads as \u2212\u2207W\u03b2t (\u03b6t) and it is now easy to deduce the iterative\nscheme (9) of the mirror descent algorithm.\n\n4.2 Comparison with previous work\n\nThe versions of mirror descent method proposed in [12] are somewhat different from our\niterative scheme (9). One of them, closest to ours, is studied in detail in [3]. It is based on\nthe recursive relation\n\n\u03b8i = \u2212\u2207W1(cid:16) \u2212 \u2207V (\u03b8i\u22121) + \u03b3iui(\u03b8i\u22121)(cid:17),\n\ni = 1, 2, . . . ,\n\n(17)\n\nwhere the function V is strongly convex with respect to the norm of initial space E (which\nis not necessarily the space \u2113M\n1 ) and W1 is the 1-conjugate function to V .\nIf \u0398 = RM and V (\u03b8) = 1\n2k\u03b8k2\nmethod.\nFor the unit simplex \u0398 = \u0398M,1 and the entropy type proxy function V from (5) with\n\u03bb = 1, the coordinates \u03b8(j)\n\n2, the scheme of (17) coincides with the ordinary gradient\n\nof vector \u03b8i from (17) are:\n\ni\n\n\u2200j = 1, . . . , M,\n\n\u03b8(j)\ni =\n\n\u03b8(j)\n\n\u03b3mum, j(\u03b8m\u22121)!\n0 exp \u2212\niXm=1\n\u03b3mum, k(\u03b8m\u22121)! .\nexp \u2212\nMXk=1\niXm=1\n\n\u03b8(k)\n0\n\n(18)\n\nThe algorithm is also known as the exponentiated gradient (EG) method [10]. The differ-\nences between the algorithm (17) and ours are the following:\n\n\f\u2022 the initial iterative scheme of the Algorithm is different than that of (17), partic-\nularly, it includes the second tuning parameter \u03b2i ; moreover, the algorithm (18)\nuses initial value \u03b80 in a different manner;\n\n\u2022 our algorithm contains the additional averaging step of the updates (10).\n\nThe convergence properties of the EG method (18) have been studied in a determinis-\ntic setting [6]. Namely, it has been shown that, under some assumptions, the difference\nAt(\u03b8t) \u2212 min\u03b8\u2208\u0398M,1 At(\u03b8), where At is the empirical risk, is bounded by a constant de-\npending on M and t. If this constant is small enough, these results show that the EG method\nprovides good numerical minimizers of the empirical risk At. The averaging step allows\nthe use of the results provided in [5] to derive generalization error bounds from relative loss\n\nbounds. This technique leads to rates of convergence of the orderp(ln M )/t as well but\n\nwith suboptimal multiplicative factor in \u03bb.\nFinally, we point out that the algorithm (17) may be deduced from the ideas mentioned in\nSubsection 4.1 and which are studied in the literature on proximal methods within the \ufb01eld\nof convex optimization (see, e.g., [9, 1] and the references therein). Namely, under rather\ngeneral conditions, the variable \u03b8i from (17) solves the the minimization problem\n\n\u03b8i = arg min\n\n\u03b8\u2208\u0398 (cid:0)\u03b8T \u03b3iui(\u03b8i\u22121) + B(\u03b8, \u03b8i\u22121)(cid:1) ,\n\n(19)\n\nwhere the penalty B(\u03b8, \u03b8i\u22121) = V (\u03b8) \u2212 V (\u03b8i\u22121) \u2212 (\u03b8 \u2212 \u03b8i\u22121)T\u2207V (\u03b8i\u22121) represents the\nBregman divergence between \u03b8 and \u03b8i\u22121 related to the function V .\n\n4.3 General comments\nPerformance and ef\ufb01ciency. The rate of convergence of order \u221aln M /\u221at is typical with-\nout low noise assumptions (as they are introduced in [17]). Batch procedures based on\nminimization of the empirical convex risk functional present a similar rate. From the statis-\ntical point of view, there is no remarkable difference between batch and our mirror-descent\nprocedure. On the other hand, from the computational point of view, our procedure is quite\ncomparable with the direct stochastic gradient descent. However, the mirror-descent algo-\nrithm presents two major advantages as compared both to batch and to direct stochastic gra-\ndient: (i) its behavior with respect to the cardinality of the base class is better than for direct\nstochastic gradient descent (of the order of \u221aln M in the Theorem, instead of M or \u221aM\nfor direct stochastic gradient); (ii) mirror-descent presents a higher ef\ufb01ciency especially in\nhigh-dimensional problems as its algorithmic complexity and memory requirements are of\nstrictly smaller order than for corresponding batch procedures (see [7] for a comparison).\nOptimality of the rate of convergence. Using the techniques of [7] and [16] it is not hard\n\nto prove minimax lower bound on the excess risk E A(b\u03b8t) \u2212 min\u03b8\u2208\u0398M,\u03bb A(\u03b8) having the\norder (ln M )1/2/\u221at for M \u2265 t1/2+\u03b4 with some \u03b4 > 0. This indicates that the upper bound\nof the Theorem is rate optimal for such values of M.\nChoice of the base class. We point out that the good behaviour of this method crucially re-\nlies on the choice of the base class of functions {hj}1\u2264j\u2264M . As far as theory is concerned,\nin order to provide a complete statistical analysis, one should establish approximation error\nbounds on the quantity inf f \u2208FM,\u03bb A(f ) \u2212 inf f A(f ) showing that the richness of the base\nclass is re\ufb02ected both by diversity (orthogonality or independence) of the hj\u2019s and by its\ncardinality M. For example, one can take hj\u2019s as the eigenfunctions associated to some\npositive de\ufb01nite kernel. We refer to [14], [15], for related results. The choice of \u03bb can be\nmotivated by similar considerations. In fact, to minimize the approximation error it might\nbe useful to take \u03bb depending on the sample size t and tending to in\ufb01nity with some slow\nrate as in [11]. A balance between the stochastic error as given in the Theorem and the\napproximation error would then determine the optimal choice of \u03bb.\n\n\f5 Proof of the Theorem\n\nIntroduce the notation \u2207A(\u03b8) = Eui(\u03b8) and \u03bei(\u03b8) = ui(\u03b8) \u2212 \u2207A(\u03b8). Put vi = ui(\u03b8i\u22121)\nwhich gives \u03b6i\u2212\u03b6i\u22121 = \u03b3ivi. By continuous differentiability of W\u03b2t\u22121 and by (8) we have:\n\nW\u03b2i\u22121 (\u03b6i) = W\u03b2i\u22121(\u03b6i\u22121) + \u03b3ivT\n\ni \u2207W\u03b2i\u22121 (\u03b6i\u22121)\n\nvT\n\n+\u03b3iZ 1\n\u2264 W\u03b2i\u22121(\u03b6i\u22121) + \u03b3ivT\n\n0\n\ni (cid:2)\u2207W\u03b2i\u22121 (\u03c4 \u03b6i + (1 \u2212 \u03c4 )\u03b6i\u22121) \u2212 \u2207W\u03b2i\u22121 (\u03b6i\u22121)(cid:3) d\u03c4\n\ni \u2207W\u03b2i\u22121 (\u03b6i\u22121) +\n\n\u03bb\u03b32\n\ni kvik2\n2\u03b2i\u22121\n\n\u221e\n\n.\n\nThen, using the fact that (\u03b2i)i\u22651 is a non-decreasing sequence and that, for z \ufb01xed, \u03b2 7\u2192\nW\u03b2(z) is a non-increasing function, we get\n\n\u03bb\u03b32\n\ni kvik2\n2\u03b2i\u22121\n\n\u221e\n\ni\u22121vi +\n\nW\u03b2i (\u03b6i) \u2264 W\u03b2i\u22121 (\u03b6i) \u2264 W\u03b2i\u22121(\u03b6i\u22121) \u2212 \u03b3i\u03b8T\nSumming up over the i\u2019s and using the representation \u03b6t =Pt\n\u03b3i(\u03b8i\u22121 \u2212 \u03b8)T vi \u2264 \u2212W\u03b2t (\u03b6t) \u2212 \u03b6 T\n\n\u2200\u03b8 \u2208 \u0398, Xt\n\nt \u03b8 +Xt\nsince W\u03b20 (\u03b60) = 0. From de\ufb01nition of W\u03b2, we have, \u2200 \u03b6 \u2208 RM and \u2200 \u03b8 \u2208 \u0398, \u2212W\u03b2t (\u03b6) \u2212\n\u03b6 T \u03b8 \u2264 \u03b2tV (\u03b8). Finally, since vi = \u2207A(\u03b8i\u22121) + \u03bei(\u03b8i\u22121), we get\n\ni=1 \u03b3ivi, we get:\n\u03bb\u03b32\n\ni kvik2\n2\u03b2i\u22121\n\ni=1\n\ni=1\n\n\u221e\n\n.\n\ntXi=1\n\u03b3i(\u03b8i\u22121 \u2212 \u03b8)T\u2207A(\u03b8i\u22121) \u2264 \u03b2tV (\u03b8) \u2212\ndence between \u03b8i\u22121 and (Xi, Yi), we have: E(cid:0)(\u03b8i\u22121 \u2212 \u03b8)T \u03bei(\u03b8i\u22121)(cid:1) = 0. Now, convexity\n\nAs we are to take expectations, we note that, conditioning on \u03b8i\u22121 and using the indepen-\n\n\u03b3i(\u03b8i\u22121 \u2212 \u03b8)T \u03bei(\u03b8i\u22121) +\n\nof A and the previous display lead to:\n\ni kvik2\n2\u03b2i\u22121\n\ntXi=1\n\ntXi=1\n\n\u03bb\u03b32\n\n\u221e\n\n.\n\n\u2200 \u03b8 \u2208 \u0398 , E A(b\u03b8t) \u2212 A(\u03b8) \u2264 Pt\ntXi=1\n\n=\n\n1\nt\n\u221at + 1\n\ni=1 \u03b3iE [(\u03b8i\u22121 \u2212 \u03b8)T\u2207A(\u03b8i\u22121)]\n\ni=1 \u03b3i\n\nPt\nE [(\u03b8i\u22121 \u2212 \u03b8)T\u2207A(\u03b8i\u22121)]\n(cid:18)\u03b20V \u2217 +\n\n\u03b20 (cid:19) ,\n\n\u03bbL2\n\n\u2264\n\nt\n\nwhere we have set V \u2217 = max\u03b8\u2208\u0398 V (\u03b8) and made use of the boundedness assumption\n\u221e \u2264 L2 and of the particular choice for the stepsize and temperature parameters.\nEkui(\u03b8)k2\nNoticing that V \u2217 = \u03bb ln M and optimizing this bound in \u03b20 > 0, we obtain the result.\n\nAcknowledgments\n\nWe thank Nicol`o Cesa-Bianchi for sharing with us his expertise on relative loss bounds.\n\nReferences\n\n[1] Beck, A. & Teboulle, M. (2003) Mirror descent and nonlinear projected subgradient\n\nmethods for convex optimization. Operations Research Letters, 31:167\u2013175.\n\n\f[2] Ben-Tal, A., Margalit, T. & Nemirovski, A. (2001) The Ordered Subsets Mirror De-\nscent optimization method and its use for the Positron Emission Tomography recon-\nstruction problem. SIAM J. on Optimization, 12:79\u2013108.\n\n[3] Ben-Tal, A. & Nemirovski, A.S. (1999) The conjugate barrier mirror descent method\nfor non-smooth convex optimization. MINERVA Optimization Center Report, Tech-\nnion Institute of Technology.\nAvailable at http://iew3.technion.ac.il/Labs/Opt/opt/Pap/CP MD.pdf\n\n[4] Cesa-Bianchi, N. & Gentile, C. (2005) Improved risk tail bounds for on-line algo-\n\nrithms. Submitted.\n\n[5] Cesa-Bianchi, N., Conconi, A. & Gentile, C. (2004) On the generalization ability of\non-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050\u2013\n2057.\n\n[6] Helmbold, D.P., Kivinen, J. & Warmuth, M.K. (1999) Relative loss bounds for single\n\nneurons. IEEE Trans. on Neural Networks, 10(6):1291\u20131304.\n\n[7] Juditsky, A. & Nemirovski, A. (2000) Functional aggregation for nonparametric esti-\n\nmation. Annals of Statistics, 28(3): 681\u2013712.\n\n[8] Juditsky, A.B., Nazin, A.V., Tsybakov, A.B. & Vayatis N. (2005) Recursive Aggre-\ngation of Estimators via the Mirror Descent Algorithm with Averaging. Technical\nReport LPMA, Universit\u00b4e Paris 6.\nAvailable at http://www.proba.jussieu.fr/pageperso/vayatis/publication.html\n\n[9] Kiwiel, K.C. (1997) Proximal minimization methods with generalized Bregman func-\n\ntions. SIAM J. Control Optim., 35:1142\u20131168.\n\n[10] Kivinen J. & Warmuth M.K. (1997) Additive versus exponentiated gradient updates\n\nfor linear prediction. Information and Computation, Vol.132(1): 1\u201364.\n\n[11] Lugosi, G. & Vayatis, N. (2004) On the Bayes-risk consistency of regularized boost-\n\ning methods (with discussion). Annals of Statitics, 32(1): 30\u201355.\n\n[12] Nemirovski, A.S. & Yudin, D.B. (1983) Problem Complexity and Method Ef\ufb01ciency\n\nin Optimization. Wiley-Interscience.\n\n[13] Polyak, B.T. & Juditsky, A.B. (1992) Acceleration of stochastic approximation by\n\naveraging. SIAM J. Control Optim., 30:838\u2013855.\n\n[14] Scovel, J.C. & Steinwart, I. (2005) Fast Rates for Support Vector Machines. In Pro-\nceedings of the 18th Conference on Learning Theory (COLT 2005), Bertinoro, Italy.\n[15] Tarigan, B. & van de Geer, S. (2004) Adaptivity of Support Vector Machines with \u21131\n\nPenalty. Preprint, University of Leiden.\n\n[16] Tsybakov, A. (2003) Optimal Rates of Aggregation. Proceedings of COLT\u201903, LNCS,\n\nSpringer, Vol. 2777:303\u2013313.\n\n[17] Tsybakov, A. (2004) Optimal aggregation of classi\ufb01ers in statistical learning. Annals\n\nof Statistics, 32(1):135\u2013166.\n\n[18] Zhang, T. (2004) Statistical behavior and consistency of classi\ufb01cation methods based\n\non convex risk minimization (with discussion). Annals of Statistics, 32(1):56\u201385.\n\n[19] Zhang, T. (2004) Solving large scale linear prediction problems using stochastic gra-\n\ndient descent algorithms. In Proceedings of ICML\u201904.\n\n\f", "award": [], "sourceid": 2779, "authors": [{"given_name": "Anatoli", "family_name": "Juditsky", "institution": null}, {"given_name": "Alexander", "family_name": "Nazin", "institution": null}, {"given_name": "Alexandre", "family_name": "Tsybakov", "institution": null}, {"given_name": "Nicolas", "family_name": "Vayatis", "institution": null}]}