{"title": "From Stochastic Mixability to Fast Rates", "book": "Advances in Neural Information Processing Systems", "page_first": 1197, "page_last": 1205, "abstract": "Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\\mathcal{F}$ with small loss $\\ell$. In the parametric setting, depending upon $(\\ell, \\mathcal{F},\\mathsf{P})$ ERM can have slow $(1/\\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\\ell$, $\\mathcal{F}$, and $\\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss $\\ell$ (there being no role there for $\\mathcal{F}$ or $\\mathsf{P}$). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\\ell,\\mathcal{F}, \\mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.", "full_text": "From Stochastic Mixability to Fast Rates\n\nNishant A. Mehta\n\nResearch School of Computer Science\n\nAustralian National University\n\nnishant.mehta@anu.edu.au\n\nRobert C. Williamson\n\nResearch School of Computer Science\n\nAustralian National University and NICTA\n\nbob.williamson@anu.edu.au\n\nAbstract\n\nEmpirical risk minimization (ERM) is a fundamental learning rule for statistical\nlearning problems where the data is generated according to some unknown distri-\nbution P and returns a hypothesis f chosen from a \ufb01xed class F with small loss (cid:96).\n\u221a\nIn the parametric setting, depending upon ((cid:96),F, P) ERM can have slow (1/\nn)\nor fast (1/n) rates of convergence of the excess risk as a function of the sample\nsize n. There exist several results that give suf\ufb01cient conditions for fast rates in\nterms of joint properties of (cid:96), F, and P, such as the margin condition and the Bern-\nstein condition. In the non-statistical prediction with expert advice setting, there\nis an analogous slow and fast rate phenomenon, and it is entirely characterized in\nterms of the mixability of the loss (cid:96) (there being no role there for F or P). The\nnotion of stochastic mixability builds a bridge between these two models of learn-\ning, reducing to classical mixability in a special case. The present paper presents\na direct proof of fast rates for ERM in terms of stochastic mixability of ((cid:96),F, P),\nand in so doing provides new insight into the fast-rates phenomenon. The proof\nexploits an old result of Kemperman on the solution to the general moment prob-\nlem. We also show a partial converse that suggests a characterization of fast rates\nfor ERM in terms of stochastic mixability is possible.\n\n1\n\nIntroduction\n\nRecent years have unveiled central contact points between the areas of statistical and online learning.\nThese include Abernethy et al.\u2019s [1] uni\ufb01ed Bregman-divergence based analysis of online convex\noptimization and statistical learning, the online-to-batch conversion of the exponentially weighted\naverage forecaster (a special case of the aggregating algorithm for mixable losses) which yields the\nprogressive mixture rule as can be seen e.g. from the work of Audibert [2], and most recently Van\nErven et al.\u2019s [21] injection of the concept of mixability into the statistical learning space in the form\nof stochastic mixability. It is this last connection that will be our departure point for this work.\nMixability is a fundamental property of a loss that characterizes when constant regret is possible in\nthe online learning game of prediction with expert advice [23]. Stochastic mixability is a natural\nadaptation of mixability to the statistical learning setting; in fact, in the special case where the func-\ntion class consists of all possible functions from the input space to the prediction space, stochastic\nmixability is equivalent to mixability [21]. Just as Vovk and coworkers (see e.g. [24, 8]) have devel-\noped a rich convex geometric understanding of mixability, stochastic mixability can be understood\nas a sort of effective convexity.\nIn this work, we study the O(1/n)-fast rate phenomenon in statistical learning from the perspective\nof stochastic mixability. Our motivation is that stochastic mixability might characterize fast rates in\nstatistical learning. As a \ufb01rst step, Theorem 5 herein establishes via a rather direct argument that\nstochastic mixability implies an exact oracle inequality (i.e. with leading constant 1) with a fast rate\nfor \ufb01nite function classes, and Theorem 7 extends this result to VC-type classes. This result can be\nunderstood as a new chapter in an evolving narrative that started with Lee et al.\u2019s [13] seminal paper\n\n1\n\n\fshowing fast rates for agnostic learning with squared loss over convex function classes, and that was\ncontinued by Mendelson [18] who showed that fast rates are possible for p-losses (y, \u02c6y) (cid:55)\u2192 |y \u2212 \u02c6y|p\nover effectively convex function classes by passing through a Bernstein condition (de\ufb01ned in (12)).\nWe also show that when stochastic mixability does not hold in a certain sense (described in Sec-\ntion 5), then the risk minimizer is not unique in a bad way. This is precisely the situation at the\nheart of the works of Mendelson [18] and Mendelson and Williamson [19], which show that having\nnon-unique minimizers is symptomatic of bad geometry of the learning problem. In such situations,\nthere are certain targets (i.e. output conditional distributions) close to the original target under which\nempirical risk minimization learns (ERM) at a slow rate, where the guilty target depends on the sam-\nple size and the target sequence approaches the original target asymptotically. Even the best known\nupper bounds have constants that blow up in the case of non-unique minimizers. Thus, whereas\nstochastic mixability implies fast rates, a sort of converse is also true, where learning is hard in a\n\u201cneighborhood\u201d of statistical learning problems for which stochastic mixability does not hold. In\naddition, since a stochastically mixable problem\u2019s function class looks convex from the perspective\nof risk minimization, and since when stochastic mixability fails the function class looks non-convex\nfrom the same perspective (it has multiple well-separated minimizers), stochastic mixability char-\nacterizes the effective convexity of the learning problem from the perspective of risk minimization.\nMuch of the recent work in obtaining faster learning rates in agnostic learning has taken place in set-\ntings where a Bernstein condition holds, including results based on local Rademacher complexities\n[3, 10]. The Bernstein condition appears to have \ufb01rst been used by Bartlett and Mendelson [4] in\ntheir analysis of ERM; this condition is subtly different from the margin condition of Mammen and\nTsybakov [15, 20], which has been used to obtain fast rates for classi\ufb01cation. Lecu\u00b4e [12] pinpoints\nthat the difference between the two conditions is that the margin condition applies to the excess loss\nrelative to the best predictor (not necessarily in the model class) whereas the Bernstein condition\napplies to the excess loss relative to the best predictor in the model class. Our approach in this work\nis complementary to the approaches of previous works, coming from a different assumption that\nforms a bridge to the online learning setting. Yet this assumption is related; the Bernstein condition\nimplies stochastic mixability under a bounded losses assumption [21]. Further understanding the\nconnection between the Bernstein condition and stochastic mixability is an ongoing effort.\n\nContributions. The core contribution of this work is to show a new path to the \u02dcO(1/n)-fast rate\nin statistical learning. We are not aware of previous results that show fast rates from the stochastic\nmixability assumption. Secondly, we establish intermediate learning rates that interpolate between\nthe fast and slow rate under a weaker notion of stochastic mixability. Finally, we show that in a\ncertain sense stochastic mixability characterizes the effective convexity of the statistical problem.\nIn the next section we formally de\ufb01ne the statistical problem, review stochastic mixability, and\nexplain our high-level approach toward getting fast rates. This approach involves directly appealing\nto the Cram\u00b4er-Chernoff method, from which nearly all known concentration inequalities arose in one\nway or another. In Section 3, we frame the problem of computing a particular moment of a certain\nexcess loss random variable as a general moment problem. We suf\ufb01ciently bound the optimal value\nof the moment, which allows for a direct application of the Cram\u00b4er-Chernoff method. These results\neasily imply a fast rates bound for \ufb01nite classes that can be extended to parametric (VC-type) classes,\nas shown in Section 4. We describe in Section 5 how stochastic mixability characterizes a certain\nnotion of convexity of the statistical learning problem. In Section 6, we extend the fast rates results to\nclasses that obey a notion we call weak stochastic mixability. Finally, Section 7 concludes this work\nwith connections to related topics in statistical learning theory and a discussion of open problems.\n\n2 Stochastic mixability, Cram\u00b4er-Chernoff, and ERM\nLet ((cid:96),F, P) be a statistical learning problem with (cid:96) : Y \u00d7 R \u2192 R+ a nonnegative loss, F \u2282 RX a\ncompact function class, and P a probability measure over X \u00d7Y for input space X and output/target\nspace Y. Let Z be a random variable de\ufb01ned as Z = (X, Y ) \u223c P. We assume for all f \u2208 F,\n(cid:96)(Y, f (X)) \u2264 V almost surely (a.s.) for some constant V .\nA probability measure P operates on functions and loss-composed functions as:\n\nP (cid:96)(\u00b7, f ) = E(X,Y )\u223cP (cid:96)(cid:0)Y, f (X)(cid:1).\n\nP f = E(X,Y )\u223cP f (X)\n\n2\n\n\fn(cid:88)\n\n1\nn\n\n(cid:96)(cid:0)yj, f (xj)(cid:1).\n\nn(cid:88)\n\nj=1\n\n1\nn\n\nSimilarly, an empirical measure Pn associated with an n-sample z, comprising n iid samples\n(x1, y1), . . . , (xn, yn), operates on functions and loss-composed functions as:\n\nPn (cid:96)(\u00b7, f ) =\n\nPn f =\n\nf (xj)\n\nLet f\u2217 be any function for which P (cid:96)(\u00b7, f\u2217) = inf f\u2208F P (cid:96)(\u00b7, f ). For each f \u2208 F de\ufb01ne the excess\n\nrisk random variable Zf := (cid:96)(cid:0)Y, f (X)(cid:1) \u2212 (cid:96)(cid:0)Y, f\u2217(X)(cid:1).\n\nj=1\n\nWe frequently work with the following two subclasses. For any \u03b5 > 0, de\ufb01ne the subclasses\n\nF(cid:22)\u03b5 := {f \u2208 F : P Zf \u2264 \u03b5}\n\nF(cid:23)\u03b5 := {f \u2208 F : P Zf \u2265 \u03b5} .\n\n2.1 Stochastic mixability\nFor \u03b7 > 0, we say that ((cid:96),F, P) is \u03b7-stochastically mixable if for all f \u2208 F\n(1)\nIf \u03b7-stochastic mixability holds for some \u03b7 > 0, then we say that ((cid:96),F, P) is stochastically mixable.\nThroughout this paper it is assumed that the stochastic mixability condition holds, and we take \u03b7\u2217 to\nbe the largest \u03b7 such that \u03b7-stochastic mixability holds. Condition (1) has a rich history, beginning\nfrom the foundational thesis of Li [14] who studied the special case of \u03b7\u2217 = 1 in density estimation\nwith log loss from the perspective of information geometry. The connections that Li showed between\nthis condition and convexity were strengthened by Gr\u00a8unwald [6, 7] and Van Erven et al. [21].\n\nlog E exp(\u2212\u03b7Zf ) \u2264 0.\n\n2.2 Cram\u00b4er-Chernoff\n\nThe high-level strategy taken here is to show that with high probability ERM will not select a \ufb01xed\nn for some constant a > 0. For each hypothesis, this\nhypothesis function f with excess risk above a\nguarantee will \ufb02ow from the Cram\u00b4er-Chernoff method [5] by controlling the cumulant generating\nfunction (CGF) of \u2212Zf in a particular way to yield exponential concentration. This control will be\npossible because the \u03b7\u2217-stochastic mixability condition implies that the CGF of \u2212Zf takes the value\n0 at some \u03b7 \u2265 \u03b7\u2217, a fact later exploited by our key tool Theorem 3.\nLet Z be a real-valued random variable. Applying Markov\u2019s inequality to an exponentially trans-\nformed random variable yields that, for any \u03b7 \u2265 0 and t \u2208 R\n\nPr(Z \u2265 t) \u2264 exp(\u2212\u03b7t + log E exp(\u03b7Z));\n\n(2)\n\nthe inequality is non-trivial only if t > E Z and \u03b7 > 0.\n\n2.3 Analysis of ERM\nWe consider the ERM estimator \u02c6fz := arg minf\u2208F Pn (cid:96)(\u00b7, f ). That is, given an n-sample z, ERM\nselects any \u02c6fz \u2208 F minimizing the empirical risk Pn (cid:96)(\u00b7, f ). We say ERM is \u03b5-good when \u02c6fz \u2208 F(cid:22)\u03b5.\nIn order to show that ERM is \u03b5-good it is suf\ufb01cient to show that for all f \u2208 F \\ F(cid:22)\u03b5 we have\nP Zf > 0. The goal is to show that with high probability ERM is \u03b5-good, and we will do this by\nshowing that with high probability uniformly for all f \u2208 F \\ F(cid:22)\u03b5 we have Pn Zf > t for some\nslack t > 0 that will come in handy later.\nFor a real-valued random variable X, recall that the cumulant generating function of X is \u03b7 (cid:55)\u2192\n\u039bX (\u03b7) := log E e\u03b7X; we allow \u039bX (\u03b7) to be in\ufb01nite for some \u03b7 > 0.\nTheorem 1 (Cram\u00b4er-Chernoff Control on ERM). Let a > 0 and select f such that E Zf > 0.\nLet t < E Zf . If there exists \u03b7 > 0 such that \u039b\u2212Zf (\u03b7) \u2264 \u2212 a\n\nProof. Let Zf,1, . . . , Zf,n be iid copies of Zf , and de\ufb01ne the sum Sf,n := (cid:80)n\n\nPn (cid:96)(\u00b7, f ) \u2264 Pn (cid:96)(\u00b7, f\u2217) + t\n\nPr\n\nj=1 \u2212Zf,j. Since\n\n(cid:110)\n\nn , then\n\n(cid:111) \u2264 exp(\u2212a + \u03b7t).\n(cid:19)\n\n(\u2212t) > E 1\n\nn Sf,n, then from (2) we have\n\n(cid:18) 1\n\nn(cid:88)\n\nn\n\nj=1\n\nPr\n\n(cid:19)\n\n(cid:18) 1\n\nn\n\nZf,j \u2264 t\n\n= Pr\n\n\u2264 exp (\u03b7t + log E exp(\u03b7Sf,n))\n\n= exp(\u03b7t)(cid:0)E exp(\u2212\u03b7Zf )(cid:1)n\n\n.\n\nSf,n \u2265 \u2212t\n\n3\n\n\fMaking the replacement \u039b\u2212Zf (\u03b7) = log E exp(\u2212\u03b7Zf ) yields\n\n(cid:18) 1\n\n(cid:19)\n\nSf,n \u2265 \u2212t\n\n\u2264 \u03b7t + n\u039b\u2212Zf (\u03b7).\n\nn\n\nlog Pr\nBy assumption, \u039b\u2212Zf (\u03b7) \u2264 \u2212 a\nThis theorem will be applied by showing that for an excess loss random variable Zf taking values in\n[\u22121, 1], if for some \u03b7 > 0 we have E exp(\u2212\u03b7Zf ) = 1 and if E Zf = a\nn for some constant a (that can\nand must depend on n), then \u039b\u2212Zf (\u03b7/2) \u2264 \u2212 c\u03b7a\nn where c > 0 is a universal constant. This is the\nnature of the next section. We then extend this result to random variables taking values in [\u2212V, V ].\n\nn, and so Pr{Pn Zf \u2264 t} \u2264 exp(\u2212a + \u03b7t) as desired.\n\n3 Semi-in\ufb01nite linear programming and the general moment problem\n\nn and\nThe key subproblem now is to \ufb01nd, for each excess loss random variable Zf with mean a\n\u039b\u2212Zf (\u03b7) = 0 (for some \u03b7 \u2265 \u03b7\u2217), a pair of constants \u03b70 > 0 and c > 0 for which \u039b\u2212Zf (\u03b70) \u2264 \u2212 ca\nn .\nTheorem 1 would then imply that ERM will prefer f\u2217 over this particular f with high probability for\nca large enough. This subproblem is in fact an instance of the general moment problem, a problem\non which Kemperman [9] has conducted a very nice geometric study. We now describe this problem.\nThe general moment problem. Let P(A) be the space of probability measures over a measurable\nspace A = (A,S). For real-valued measurable functions h and (gj)j\u2208[m] on a measurable space\nA = (A,S), the general moment problem is\n\ninf\n\nEX\u223c\u00b5 h(X)\n\n\u00b5\u2208P(A)\nsubject to EX\u223c\u00b5 gj(X) = yj,\n\n(3)\nLet the vector-valued map g : A \u2192 Rm be de\ufb01ned in terms of coordinate functions as (g(x))j =\ngj(x), and let the vector y \u2208 Rm be equal to (y1, . . . , ym).\nLet D\u2217 \u2282 Rm+1 be the set\n\nj \u2208 {1, . . . , m}.\n\nD\u2217 :=\n\nd\u2217 = (d0, d1, . . . , dm) \u2208 Rm+1 : h(x) \u2265 d0 +\n\ndjgj(x)\n\nfor all x \u2208 A\n\n.\n\n(4)\n\n(cid:26)\n\n(cid:27)\n\nTheorem 3 of [9] states that if y \u2208 int conv g(A), the optimal value of problem (3) equals\n\n(cid:26)\n\nm(cid:88)\n\ndjyj : d\u2217 = (d0, d1, . . . , dm) \u2208 D\u2217(cid:27)\n\nsup\n\nd0 +\n\n.\n\n(5)\n\nOur instantiation. We choose A = [\u22121, 1], set m = 2 and de\ufb01ne h, (gj)j\u2208{1,2}, and y \u2208 R2 as:\n\nj=1\n\nh(x) = \u2212e(\u03b7/2)x,\n\ng1(x) = x,\n\ng2(x) = e\u03b7x,\n\ny1 = \u2212 a\nn\n\n,\n\ny2 = 1,\n\nfor any \u03b7 > 0, a > 0, and n \u2208 N. This yields the following instantiation of problem (3):\n\nm(cid:88)\n\nj=1\n\ninf\n\nEX\u223c\u00b5 \u2212e(\u03b7/2)X\n\u00b5\u2208P([\u22121,1])\nsubject to EX\u223c\u00b5 X = \u2212 a\nn\nEX\u223c\u00b5 e\u03b7X = 1.\n\n(6a)\n\n(6b)\n\n(6c)\n\n(7)\n\n(cid:111)\n\nNote that equation (5) from the general moment problem now instantiates to\n\n(cid:110)\n\nsup\n\nd0 \u2212 a\nn\n\nd1 + d2 : d\u2217 = (d0, d1, d2) \u2208 D\u2217(cid:111)\n\n,\n\nwith D\u2217 equal to the set\n\n(cid:110)\n\nd\u2217 = (d0, d1, d2) \u2208 R3 : \u2212e(\u03b7/2)x \u2265 d0 + d1x + d2e\u03b7x\n\n(8)\nApplying Theorem 3 of [9] requires the condition y \u2208 int conv g([\u22121, 1]). We \ufb01rst characterize\nwhen y \u2208 conv g([\u22121, 1]) holds and handle the int conv g([\u22121, 1]) version after Theorem 3.\n\nfor all x \u2208 [\u22121, 1]\n\n.\n\n4\n\n\fLemma 2 (Feasible Moments). The point y =(cid:0)\u2212 a\n\u2264 e\u03b7 + e\u2212\u03b7 \u2212 2\nProof. Let W denote the convex hull of g([\u22121, 1]). We need to see if(cid:0)\u2212 a\ne\u03b7 \u2212 e\u2212\u03b7 =\n\nn , 1(cid:1) \u2208 conv g([\u22121, 1]) if and only if\n\ncosh(\u03b7) \u2212 1\n\nsinh(\u03b7)\n\na\nn\n\n.\n\nn , 1(cid:1) \u2208 W . Note that W\n\n(9)\n\nis the convex set formed by starting with the graph of x (cid:55)\u2192 e\u03b7x on the domain [\u22121, 1], including the\nline segment connecting this curve\u2019s endpoints (\u22121, e\u2212\u03b7) to (1, e\u03b7x), and including all of the points\nbelow this line segment but above the aforementioned graph. That is, W is precisely the set\n\n(x, y) \u2208 R2 : e\u03b7x \u2264 y \u2264 e\u03b7 + e\u2212\u03b7\n\n+\n\nW :=\n\ne\u03b7 \u2212 e\u2212\u03b7\n\n2\n\n2\n\nx, \u2200x \u2208 [\u22121, 1]\n\n.\n\nIt remains to check that 1 is sandwiched between the lower and upper bounds at x = \u2212 a\nthe lower bound holds. Simple algebra shows that the upper bound is equivalent to condition (9).\n\nn. Clearly\n\n(cid:26)\n\n(cid:27)\n\nNote that if (9) does not hold, then the semi-in\ufb01nite linear program (6) is infeasible; infeasibility in\nturn implies that such an excess loss random variable cannot exist. Thus, we need not worry about\nwhether (9) holds; it holds for any excess loss random variable satisfying constraints (6b) and (6c).\nThe following theorem is a key technical result for using stochastic mixability to control the CGF.\nThe proof is long and can be found in Appendix A.\nTheorem 3 (Stochastic Mixability Concentration). Let f be an element of F with Zf taking val-\nues in [\u22121, 1], n \u2208 N, E Zf = a\n\nn for some a > 0, and \u039b\u2212Zf (\u03b7) = 0 for some \u03b7 > 0. If\n\ne\u03b7 + e\u2212\u03b7 \u2212 2\ne\u03b7 \u2212 e\u2212\u03b7\n\na\nn\n\n<\n\n,\n\n(10)\n\nthen\n\nE e(\u03b7/2)(\u2212Zf ) \u2264 1 \u2212 0.18(\u03b7 \u2227 1)a\n\nn\n\n.\n\n.\n\nn\n\nNote that since log(1 \u2212 x) \u2264 \u2212x when x < 1, we have \u039b\u2212Zf (\u03b7/2) \u2264 \u2212 0.18(\u03b7 \u2227 1)a\nIn order to apply Theorem 3, we need (10) to hold, but only (9) is guaranteed to hold. The corner\ncase is if (9) holds with equality. However, observe that one can always approximate the random\nvariable X by a perturbed version X(cid:48) which has nearly identical mean a(cid:48) \u2248 a and a nearly identical\n\u03b7(cid:48) \u2248 \u03b7 for which EX(cid:48)\u223c\u00b5(cid:48) e\u03b7(cid:48)X(cid:48)\n= 1, and yet the inequality in (9) is strict. Later, in the proof\nof Theorem 5, for any random variable that required perturbation to satisfy the interior condition\n(10), we implicitly apply the analysis to the perturbed version, show that ERM would not pick the\n(slightly different) function corresponding to the perturbed version, and use the closeness of the two\nfunctions to show that ERM also would not pick the original function.\nWe now present a necessary extension for the case of losses with range [0, V ], proved in Appendix A.\nLemma 4 (Bounded Losses). Let g1(x) = x and y2 = 1 be common settings for the following two\nproblems. The instantiation of problem (3) with A = [\u2212V, V ], h(x) = \u2212e(\u03b7/2)x, g2(x) = e\u03b7x,\nn has the same optimal value as the instantiation of problem (3) with A = [\u22121, 1],\nand y1 = \u2212 a\nh(x) = \u2212e(V \u03b7/2)x, g2(x) = e(V \u03b7)x, and y1 = \u2212 a/V\nn .\n\n4 Fast rates\n\nWe now show how the above results can be used to obtain an exact oracle inequality with a fast rate.\nWe \ufb01rst present a result for \ufb01nite classes and then present a result for VC-type classes (classes with\nlogarithmic universal metric entropy).\nTheorem 5 (Finite Classes Exact Oracle Inequality). Let ((cid:96),F, P) be \u03b7\u2217-stochastically mixable,\nfor all n \u2265 1, with probability at least 1 \u2212 \u03b4\n\nwhere |F| = N, (cid:96) is a nonnegative loss, and supf\u2208F (cid:96)(cid:0)Y, f (X)(cid:1) \u2264 V a.s. for a constant V . Then\n\n(cid:110)\n\n(cid:111)(cid:0)log 1\n\n\u03b4 + log N(cid:1)\n\n.\n\nV, 1\n\u03b7\u2217\n\nn\n\nP (cid:96)(\u00b7, \u02c6fz) \u2264 P (cid:96)(\u00b7, f\u2217) +\n\n6 max\n\n5\n\n\f(cid:1)\u222aF hyper(cid:23)\u03b3n\n\nF(cid:23)\u03b3n =(cid:0)(cid:83)\n\nn for a constant a to be \ufb01xed later. For each \u03b7 > 0, let F (\u03b7)(cid:23)\u03b3n\n\n\u2282 F(cid:23)\u03b3n correspond\nProof. Let \u03b3n = a\nto those functions in F(cid:23)\u03b3n for which \u03b7 is the largest constant such that E exp(\u2212\u03b7Zf ) = 1. Let\n\u2282 F(cid:23)\u03b3n correspond to functions f in F(cid:23)\u03b3n for which lim\u03b7\u2192\u221e E exp(\u2212\u03b7Zf ) < 1. Clearly,\nF hyper(cid:23)\u03b3n\n\u03b7\u2208[\u03b7\u2217,\u221e) F (\u03b7)(cid:23)\u03b3n\n. The excess loss random variables corresponding to elements\nf \u2208 F hyper(cid:23)\u03b3n\nare \u201chyper-concentrated\u201d in the sense that they are in\ufb01nitely stochastically mixable.\nHowever, Lemma 10 in Appendix B shows that for each hyper-concentrated Zf , there exists another\nexcess loss random variable Z(cid:48)\nf ) = 1 for\nf \u2264 Zf with probability 1. The last property implies\nsome arbitrarily large but \ufb01nite \u03b7, and with Z(cid:48)\nthat the empirical risk of Z(cid:48)\nf is no greater than that of Zf ; hence for each hyper-concentrated Zf it is\nsuf\ufb01cient (from the perspective of ERM) to study a corresponding Z(cid:48)\nf . From now on, we implicitly\n\u03b7\u2208[\u03b7\u2217,\u221e) F (\u03b7)(cid:23)\u03b3n\n\nmake this replacement in F(cid:23)\u03b3n itself, so that we now have F(cid:23)\u03b3n =(cid:83)\n\nf with mean arbitrarily close to that of Zf , with E exp(\u2212\u03b7Z(cid:48)\n\n.\n\n.\n\nConsider an arbitrary a > 0. For some \ufb01xed \u03b7 \u2208 [\u03b7\u2217,\u221e) for which |F (\u03b7)(cid:23)\u03b3n\n| > 0, consider\nthe subclass F (\u03b7)(cid:23)\u03b3n\nIndividually for each such function, we will apply Theorem 1 as follows.\n(V \u03b7/2). From Theorem 3, the latter is at most\nFrom Lemma 4, we have \u039b\u2212Zf (\u03b7/2) = \u039b\u2212 1\n\u2212 0.18(V \u03b7 \u2227 1)(a/V )\n(V \u03b7 \u2228 1)n . Hence, Theorem 1 with t = 0 and the \u03b7 from the Theo-\nrem taken to be \u03b7/2 implies that the probability of the event Pn (cid:96)(\u00b7, f ) \u2264 Pn (cid:96)(\u00b7, f\u2217) is at most\nexp\n\n(cid:19)(cid:19)\n. Applying the union bound over all of F(cid:23)\u03b3n, we conclude that\n\nV \u03b7 \u2228 1 a\nPr {\u2203f \u2208 F(cid:23)\u03b3n : Pn (cid:96)(\u00b7, f ) \u2264 Pn (cid:96)(\u00b7, f\u2217)} \u2264 N exp\n\n\u2212\u03b7\u2217(cid:18) 0.18a\n\n(cid:16)\u22120.18\n\n= \u2212 0.18\u03b7a\n\n(cid:18)\n\n(cid:17)\n\nV Zf\n\nn\n\n\u03b7\n\n.\n\nV \u03b7\u2217 \u2228 1\n\n.\n\nn\n\n\u03b4 +log N)\n\n\u03b7\u2217}(log 1\n\nSince ERM selects hypotheses on their empirical risk, from inversion it holds that with probability at\nleast 1 \u2212 \u03b4 ERM will not select any hypothesis with excess risk at least 6 max{V, 1\nBefore presenting the result for VC-type classes, we require some de\ufb01nitions. For a pseudometric\nspace (G, d), for any \u03b5 > 0, let N (\u03b5,G, d) be the \u03b5-covering number of (G, d); that is, N (\u03b5,G, d) is\nthe minimal number of balls of radius \u03b5 needed to cover G. We will further constrain the cover (the\nset of centers of the balls) to be a subset of G (i.e. to be proper), thus ensuring that the stochastic\nmixability assumption transfers to any (proper) cover of F. Note that the \u201cproper\u201d requirement at\nmost doubles the constant K below, as shown by Vidyasagar [22, Lemma 2.1].\nWe now state a localization-based result that allows us to extend the result for \ufb01nite classes to VC-\ntype classes. Although the localization result can be obtained by combining standard techniques,1\nwe could not \ufb01nd this particular result in the literature. Below, an \u03b5-net F\u03b5 of a set F is a subset of\nF such that F is contained in the union of the balls of radius \u03b5 with centers in F\u03b5.\nTheorem 6. Let F be a separable function class whose functions have range bounded in [0, V ] and\nfor which, for a constant K \u2265 1, for each u \u2208 (0, K] the L2(P) covering numbers are bounded as\n\nN (u,F, L2(P)) \u2264\n\n(cid:18) K\n\n(cid:19)C\n\n.\n\nu\n\nSuppose F\u03b5 is a minimal \u03b5-net for F in the L2(P) norm, with \u03b5 = 1\nL2(P)-metric projection from F to F\u03b5. Then, provided that \u03b4 \u2264 1\nthere exist f \u2208 F such that\n\n(cid:115)(cid:18)\n\n(cid:19)\n\n(cid:32)\n\nPn f < Pn (\u03c0(f )) \u2212 V\nn\n\n1080C log(2Kn) + 90\n\nlog\n\n1\n\u03b4\n\nC log(2Kn) + log\n\ne\n\u03b4\n\n(11)\nn . Denote by \u03c0 : F \u2192 F\u03b5 an\n2 , with probability at most \u03b4 can\n\n(cid:33)\n\n.\n\nThe proof is presented in Appendix C. We now present the fast rates result for VC-type classes.\nThe proof (in Appendix C) uses Theorem 6 and the proof of the Theorem 5. Below, we denote the\nloss-composed version of a function class F as (cid:96) \u25e6 F := {(cid:96)(\u00b7, f ) : f \u2208 F}.\n\n1See e.g. the techniques of Massart and N\u00b4ed\u00b4elec [16] and equation (3.17) of Koltchinskii [11].\n\n6\n\n\f\u03b5\n\n1\nn\n\n(cid:16)\n\n2V\n\n) +\n\nmax\n\n(cid:1)C\n\nP (cid:96)(\u00b7, \u02c6fz) \u2264 P (cid:96)(\u00b7, f\n\n\u2217\n\nfor all n \u2265 5 and \u03b4 \u2264 1\n\nN ((cid:96) \u25e6 F, L2(P), \u03b5) \u2264 (cid:0) K\n\n(cid:110)\n2 , with probability at least 1 \u2212 \u03b4\nV, 1\n\u03b7\u2217\n\nTheorem 7 (VC-Type Classes Exact Oracle Inequality). Let ((cid:96),F, P) be \u03b7\u2217-stochastically mix-\nable with (cid:96) \u25e6 F separable, where,\nfor each \u03b5 \u2208 (0, K] we have\n\nfor a constant K \u2265 1,\n\n, and supf\u2208F (cid:96)(cid:0)Y, f (X)(cid:1) \u2264 V a.s. for a constant V \u2265 1. Then\n\uf8fc\uf8fd\uf8fe +\n\uf8f1\uf8f2\uf8f3\n2 (X)(cid:1) a.s. We say the excess loss class\n\n(cid:111)(cid:0)C log(Kn) + log 2\n(cid:113)(cid:0)log 2\n\n(cid:1)C log(2Kn) + log 2e\n\n2 of P (cid:96)(\u00b7, f ) over F satisfy (cid:96)(cid:0)Y, f\u2217\n\n5 Characterizing convexity from the perspective of risk minimization\nIn the following, when we say ((cid:96),F, P) has a unique minimizer we mean that any two minimizers\nf\u2217\n1 , f\u2217\n{(cid:96)(\u00b7, f )\u2212 (cid:96)(\u00b7, f\u2217) : f \u2208 F} satis\ufb01es a (\u03b2, B)-Bernstein condition with respect to P for some B > 0\nand 0 < \u03b2 \u2264 1 if, for all f \u2208 F:\n\n1 (X)(cid:1) = (cid:96)(cid:0)Y, f\u2217\n\nP(cid:0)(cid:96)(\u00b7, f ) \u2212 (cid:96)(\u00b7, f\u2217)(cid:1)2 \u2264 B(cid:0)P(cid:0)(cid:96)(\u00b7, f ) \u2212 (cid:96)(\u00b7, f\u2217)(cid:1)(cid:1)\u03b2\n\n1080C log(2Kn) + 90\n\n.\n\n(12)\n\n(cid:1) ,\n\n(cid:17)\n\n\u03b4\n\n8 max\n\n\u03b4\n\n\u03b4\n\n1\nn\n\n.\n\n1 (X)(cid:1) = (cid:96)(cid:0)Y, f\u2217\n\nIt already is known that the stochastic mixability condition guarantees that there is a unique min-\nimizer [21]; this is a simple consequence of Jensen\u2019s inequality. This leaves open the question: if\nstochastic mixability does not hold, are there necessarily non-unique minimizers? We show that in\na certain sense this is indeed the case, in bad way: the set of minimizers will be a disconnected set.\n\nFor any \u03b5 > 0, de\ufb01ne G\u03b5 as the class G\u03b5 := {f\u2217} \u222a(cid:8)f \u2208 F : (cid:107)f \u2212 f\u2217(cid:107)L1(P) \u2265 \u03b5(cid:9), where in case\n\nnot the case that (cid:96)(cid:0)Y, f\u2217\n\nthere are multiple minimizers in F we arbitrarily select one of them as f\u2217. Since we assume that F\nis compact and G\u03b5 \\ {f\u2217} is equal to F minus an open set homeomorphic to the unit L1(P) ball,\nG\u03b5 \\ {f\u2217} is also compact.\nTheorem 8 (Non-Unique Minimizers). Suppose there exists some \u03b5 > 0 such that G\u03b5 is not\n2 \u2208 F of P (cid:96)(\u00b7, f ) over F such that it is\nstochastically mixable. Then there are minimizers f\u2217\nProof. Select \u03b5 > 0 as in the theorem and some \ufb01xed \u03b7 > 0. Since G\u03b5 is not \u03b7-stochastically\nmixable, there exists f\u03b7 \u2208 G\u03b5 such that \u039b\u2212Zf\u03b7\n(\u03b7) > 0. Note that there exists \u03b7(cid:48) \u2208 (0, \u03b7) with\n(\u03b7)\u2212\u039b\u2212Zf\u03b7\n> 0 \u21d2 \u039b(cid:48)\n(0) = E(\u2212Zf\u03b7 )\n(0) > 0, so \u039b(cid:48)\n\u039b\u2212Zf\u03b7\n\u2212Zf\u03b7\nimplies that E Zf\u03b7 < 0, a contradiction! From Lemma 2, E Zf\u03b7 \u2264 cosh(\u03b7(cid:48))\u22121\n; for \u03b7(cid:48) \u2265 0 the RHS\n(cid:16) \u03b7(cid:48)\nsinh(\u03b7(cid:48))\nhas upper bound \u03b7(cid:48)\n2 tanh2(\u03b7(cid:48)/2)\n2 \u2212 cosh(\u03b7(cid:48))\u22121\nand\nsinh(\u03b7(cid:48))\na positive decreasing sequence (\u03b7j)j approaching 0, corresponding to a sequence (f\u03b7j )j \u2282 G\u03b5\\{f\u2217}\nwith limit point g\u2217 \u2208 G\u03b5\\{f\u2217} for which E Zg\u2217 = 0, and so there is a risk minimizer in G\u03b5\\{f\u2217}.\n\n(cid:17)|\u03b7(cid:48)=0 = 0. Thus, E Zf\u03b7 \u2192 0 as \u03b7 \u2192 0. As G\u03b5\\{f\u2217} is compact, we can take\n\n2 since the derivative of \u03b7(cid:48)\n\n2 (X)(cid:1) a.s.\n\n(\u03b7(cid:48)) = 0; if not, lim\u03b7\u21930\n\nis the nonnegative function 1\n\n2 \u2212 cosh(\u03b7(cid:48))\u22121\nsinh(\u03b7(cid:48))\n\n1 , f\u2217\n\n\u2212Zf\u03b7\n\n\u039b\u2212Zf\u03b7\n\n(0)\n\n\u03b7\n\nThe implications of having non-unique risk minimizers.\nIn the case of non-unique risk mini-\nmizers, Mendelson [17] showed that for p-losses (y, \u02c6y) (cid:55)\u2192 |y \u2212 \u02c6y|p with p \u2208 [2,\u221e) there is an\nn-indexed sequence of probability measures (P(n))n approaching the true probability measure as\nn \u2192 \u221e such that, for each n, ERM learns at a slow rate under sample size n when the true distri-\nbution is P(n). This behavior is a consequence of the statistical learning problem\u2019s poor geometry:\nthere are multiple minimizers and the set of minimizers is not even connected. Furthermore, in this\ncase, the best known fast rate upper bounds (see [18] and [19]) have a multiplicative constant that\napproaches \u221e as the target probability measure approaches a probability measure for which there\nare non-unique minimizers. The reason for the poor upper bounds in this case is that the constant B\nin the Bernstein condition explodes, and the upper bounds rely upon the Bernstein condition.\n\n6 Weak stochastic mixability\nFor some \u03ba \u2208 [0, 1], we say ((cid:96),F, P) is (\u03ba, \u03b70)-weakly stochastically mixable if, for every \u03b5 > 0, for\nall f \u2208 {f\u2217} \u222a F(cid:23)\u03b5, the inequality log E exp(\u2212\u03b7\u03b5Zf ) \u2264 0 holds with \u03b7\u03b5 := \u03b70\u03b51\u2212\u03ba. This concept\nwas introduced by Van Erven et al. [21] without a name.\n\n7\n\n\fSuppose that some \ufb01xed function has excess risk a = \u03b5. Then, roughly, with high probability\nERM does not make a mistake provided that a\u03b7a = 1\nn and hence when\n\u03b5 = (\u03b70n)\u22121/(2\u2212\u03ba). Modifying the proof of the \ufb01nite classes result (Theorem 5) to consider all\nfunctions in the subclass F(cid:23)\u03b3n for \u03b3n = (\u03b70n)\u22121/(2\u2212\u03ba) yields the following corollary of Theorem 5.\nCorollary 9. Let ((cid:96),F, P) be (\u03ba, \u03b70)-weakly stochastically mixable for some \u03ba \u2208 [0, 1], where\nn \u2265 1\n\n|F| = N, (cid:96) is a nonnegative loss, and supf\u2208F (cid:96)(cid:0)Y, f (X)(cid:1) \u2264 V a.s. for a constant V . Then for any\n\nn, i.e. when \u03b5 \u00b7 \u03b70\u03b51\u2212\u03ba = 1\n\n\u03b70\n\nV (1\u2212\u03ba)/(2\u2212\u03ba), with probability at least 1 \u2212 \u03b4\nP (cid:96)(\u00b7, \u02c6fz) \u2264 P (cid:96)(\u00b7, f\u2217) +\n\n6(cid:0)log 1\n\n\u03b4 + log N(cid:1)\n\n(\u03b70n)1/(2\u2212\u03ba)\n\n.\n\nIt is simple to show a similar result for VC-type classes; the \u03b5-net can still be taken at the resolution\nn, but we need only apply the analysis to the subclass of F with excess risk at least (\u03b70n)\u22121/(2\u2212\u03ba).\n1\n\n7 Discussion\n\nWe have shown that stochastic mixability implies fast rates for VC-type classes, using a direct argu-\nment based on the Cram\u00b4er-Chernoff method and suf\ufb01cient control of the optimal value of a certain\ninstance of the general moment problem. The approach is amenable to localization in that the analy-\nsis separately controls the probability of large deviations for individual elements of F. An important\nopen problem is to extend the results presented here for VC-type classes to results for nonparametric\nclasses with polynomial metric entropy, and moreover, to achieve rates similar to those obtained for\nthese classes under the Bernstein condition.\nThere are still some unanswered questions with regards to the connection between the Bernstein\ncondition and stochastic mixability. Van Erven et al. [21] showed that for bounded losses the Bern-\nstein condition implies stochastic mixability. Therefore, when starting from a Bernstein condition,\nTheorem 5 offers a different path to fast rates. An open problem is to settle the question of whether\nthe Bernstein condition and stochastic mixability are equivalent. Previous results [21] suggest that\nthe stochastic mixability does imply a Bernstein condition, but the proof was non-constructive, and\nit relied upon a bounded losses assumption. It is well known (and easy to see) that both stochastic\nmixability and the Bernstein condition hold only if there is a unique minimizer. Theorem 8 shows in\na certain sense that if stochastic mixability does not hold, then there cannot be a unique minimizer.\nIs the same true when the Bernstein condition fails to hold? Regardless of whether stochastic mixa-\nbility is equivalent to the Bernstein condition, the direct argument presented here and the connection\nto classical mixability, which does characterize constant regret in the simpler non-stochastic setting,\nmotivates further study of stochastic mixability.\nFinally, it would be of great interest to discard the bounded losses assumption. Ignoring the depen-\ndence of the metric entropy on the maximum possible loss, the upper bound on the loss V enters the\n\ufb01nal bound through the dif\ufb01culty of controlling the minimum value of u\u03b7(\u22121) when \u03b7 is large (see\nthe proof of Theorem 3). From extensive experiments with a grid-approximation linear program,\nwe have observed that the worst (CGF-wise) random variables for \ufb01xed negative mean and \ufb01xed\noptimal stochastic mixability constant are those which place very little probability mass at \u2212V and\nmost of the probability mass at a small positive number that scales with the mean. These random\nvariables correspond to functions that with low probability beat f\u2217 by a large (loss) margin but with\nhigh probability have slightly higher loss than f\u2217. It would be useful to understand if this exotic\nbehavior is a real concern and, if not, \ufb01nd a simple, mild condition on the moments that rules it out.\n\nAcknowledgments\n\nRCW thanks Tim van Erven for the initial discussions around the Cram\u00b4er-Chernoff method during\nhis visit to Canberra in 2013 and for his gracious permission to proceed with the present paper\nwithout him as an author, and both authors thank him for the further enormously helpful spotting\nof a serious error in our original proof for fast rates for VC-type classes. This work was supported\nby the Australian Research Council (NAM and RCW) and NICTA (RCW). NICTA is funded by the\nAustralian Government through the Department of Communications and the Australian Research\nCouncil through the ICT Centre of Excellence program.\n\n8\n\n\fReferences\n\n[1] Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin. A stochastic view of optimal\nIn Proceedings of the 22nd Annual Conference on Learning Theory\n\nregret through minimax duality.\n(COLT 2009), 2009.\n\n[2] Jean-Yves Audibert. Fast learning rates in statistical inference through aggregation. The Annals of Statis-\n\ntics, 37(4):1591\u20131646, 2009.\n\n[3] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. The Annals\n\nof Statistics, 33(4):1497\u20131537, 2005.\n\n[4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields,\n\n135(3):311\u2013334, 2006.\n\n[5] St\u00b4ephane Boucheron, G\u00b4abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic\n\ntheory of independence. Oxford University Press, 2013.\n\n[6] Peter Gr\u00a8unwald. Safe learning: bridging the gap between Bayes, MDL and statistical learning theory via\nempirical convexity. In Proceedings of the 24th International Conference on Learning Theory (COLT\n2011), pages 397\u2013419, 2011.\n\n[7] Peter Gr\u00a8unwald. The safe Bayesian. In Proceedings of the 23rd International Conference on Algorithmic\n\nLearning Theory (ALT 2012), pages 169\u2013183. Springer, 2012.\n\n[8] Yuri Kalnishkan and Michael V. Vyugin. The weak aggregating algorithm and weak mixability.\n\nIn\nProceedings of the 18th Annual Conference on Learning Theory (COLT 2005), pages 188\u2013203. Springer,\n2005.\n\n[9] Johannes H.B. Kemperman. The general moment problem, a geometric approach. The Annals of Mathe-\n\nmatical Statistics, 39(1):93\u2013122, 1968.\n\n[10] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The\n\nAnnals of Statistics, 34(6):2593\u20132656, 2006.\n\n[11] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Prob-\n\nlems: Ecole dEt\u00b4e de Probabilit\u00b4es de Saint-Flour XXXVIII-2008, volume 2033. Springer, 2011.\n\n[12] Guillaume Lecu\u00b4e.\n\nInterplay between concentration, complexity and geometry in learning theory with\napplications to high dimensional data analysis. Habilitation `a diriger des recherches, Universit\u00b4e Paris-\nEst, 2011.\n\n[13] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with\n\nsquared loss. IEEE Transactions on Information Theory, 44(5):1974\u20131980, 1998.\n\n[14] Jonathan Qiang Li. Estimation of mixture models. PhD thesis, Yale University, 1999.\n[15] Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,\n\n27(6):1808\u20131829, 1999.\n\n[16] Pascal Massart and \u00b4Elodie N\u00b4ed\u00b4elec. Risk bounds for statistical learning. The Annals of Statistics,\n\n34(5):2326\u20132366, 2006.\n\n[17] Shahar Mendelson. Lower bounds for the empirical minimization algorithm.\n\nInformation Theory, 54(8):3797\u20133803, 2008.\n\nIEEE Transactions on\n\n[18] Shahar Mendelson. Obtaining fast error rates in nonconvex situations. Journal of Complexity, 24(3):380\u2013\n\n397, 2008.\n\n[19] Shahar Mendelson and Robert C. Williamson. Agnostic learning nonconvex function classes. In Pro-\nceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), pages 1\u201313.\nSpringer, 2002.\n\n[20] Alexander B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statistics,\n\n32(1):135\u2013166, 2004.\n\n[21] Tim Van Erven, Peter D. Gr\u00a8unwald, Mark D. Reid, and Robert C. Williamson. Mixability in statistical\nIn Advances in Neural Information Processing Systems 25 (NIPS 2012), pages 1700\u20131708,\n\nlearning.\n2012.\n\n[22] Mathukumalli Vidyasagar. Learning and Generalization with Applications to Neural Networks. Springer,\n\n2002.\n\n[23] Volodya Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences,\n\n56(2):153\u2013173, 1998.\n\n[24] Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213\u2013248, 2001.\n\n9\n\n\f", "award": [], "sourceid": 687, "authors": [{"given_name": "Nishant", "family_name": "Mehta", "institution": "Australian National University"}, {"given_name": "Robert", "family_name": "Williamson", "institution": "NICTA"}]}