{"title": "Divergences, surrogate loss functions and experimental design", "book": "Advances in Neural Information Processing Systems", "page_first": 1011, "page_last": 1018, "abstract": "", "full_text": "Divergences, surrogate loss functions and\n\nexperimental design\n\nXuanLong Nguyen\n\nUniversity of California\n\nBerkeley, CA 94720\n\nxuanlong@cs.berkeley.edu\n\nMartin J. Wainwright\nUniversity of California\n\nBerkeley, CA 94720\n\nwainwrig@eecs.berkeley.edu\n\nMichael I. Jordan\n\nUniversity of California\n\nBerkeley, CA 94720\n\njordan@cs.berkeley.edu\n\nAbstract\n\nIn this paper, we provide a general theorem that establishes a correspon-\ndence between surrogate loss functions in classi\ufb01cation and the family\nof f-divergences. Moreover, we provide constructive procedures for\ndetermining the f-divergence induced by a given surrogate loss, and\nconversely for \ufb01nding all surrogate loss functions that realize a given\nf-divergence. Next we introduce the notion of universal equivalence\namong loss functions and corresponding f-divergences, and provide nec-\nessary and suf\ufb01cient conditions for universal equivalence to hold. These\nideas have applications to classi\ufb01cation problems that also involve a com-\nponent of experiment design; in particular, we leverage our results to\nprove consistency of a procedure for learning a classi\ufb01er under decen-\ntralization requirements.\n\n1\n\nIntroduction\n\nA unifying theme in the recent literature on classi\ufb01cation is the notion of a surrogate loss\nfunction\u2014a convex upper bound on the 0-1 loss. Many practical classi\ufb01cation algorithms\ncan be formulated in terms of the minimization of surrogate loss functions; well-known\nexamples include the support vector machine (hinge loss) and Adaboost (exponential loss).\nSigni\ufb01cant progress has been made on the theoretical front by analyzing the general statis-\ntical consequences of using surrogate loss functions [e.g., 2, 10, 13].\n\nThese recent developments have an interesting historical antecedent. Working in the con-\ntext of experimental design, researchers in the 1960\u2019s recast the (intractable) problem of\nminimizing the probability of classi\ufb01cation error in terms of the maximization of various\nsurrogate functions [e.g., 5, 8]. Examples of experimental design include the choice of a\nquantizer as a preprocessor for a classi\ufb01er [12], or the choice of a \u201csignal set\u201d for a radar\nsystem [5]. The surrogate functions that were used included the Hellinger distance and var-\nious forms of KL divergence; maximization of these functions was proposed as a criterion\nfor the choice of a design. Theoretical support for this approach was provided by a classical\ntheorem on the comparison of experiments due to Blackwell [3]. An important outcome\nof this line of work was the de\ufb01nition of a general family of \u201cf-divergences\u201d (also known\nas \u201cAli-Silvey distances\u201d), which includes Hellinger distance and KL divergence as special\ncases [1, 4].\n\n\fIn broad terms, the goal of the current paper is to bring together these two literatures, in\nparticular by establishing a correspondence between the family of surrogate loss functions\nand the family of f-divergences. Several speci\ufb01c goals motivate us in this regard: (1)\ndifferent f-divergences are related by various well-known inequalities [11], so that a cor-\nrespondence between loss functions and f-divergences would allow these inequalities to\nbe harnessed in analyzing surrogate loss functions; (2) a correspondence could allow the\nde\ufb01nition of interesting equivalence classes of losses or divergences; and (3) the problem\nof experimental design, which motivated the classical research on f-divergences, provides\nnew venues for applying the loss function framework from machine learning. In particular,\none natural extension\u2014and one which we explore towards the end of this paper\u2014is in re-\nquiring consistency not only in the choice of an optimal discriminant function but also in\nthe choice of an optimal experiment design.\n\nThe main technical contribution of this paper is to state and prove a general theorem relat-\ning surrogate loss functions and f-divergences. 1 We show that the correspondence is quite\nstrong: any surrogate loss induces a corresponding f-divergence, and any f-divergence\nsatisfying certain conditions corresponds to a family of surrogate loss functions. Moreover,\nexploiting tools from convex analysis, we provide a constructive procedure for \ufb01nding loss\nfunctions from f-divergences. We also introduce and analyze a notion of universal equiva-\nlence among loss functions (and corresponding f-divergences). Finally, we present an ap-\nplication of these ideas to the problem of proving consistency of classi\ufb01cation algorithms\nwith an additional decentralization requirement.\n\n2 Background and elementary results\n\nConsider a covariate X 2 X , where X is a compact topological space, and a random\nvariable Y 2 Y := f\u00a11; +1g. The space (X \u00a3 Y ) is assumed to be endowed with a\nBorel regular probability measure P . In this paper, we consider a variant of the standard\nclassi\ufb01cation problem, in which the decision-maker, rather than having direct access to X,\nonly observes some variable Z 2 Z that is obtained via conditional probability Q(ZjX).\nThe stochastic map Q is referred to as an experiment in statistics; in the signal processing\nliterature, where Z is generally taken to be discrete, it is referred to as a quantizer. We let\nQ denote the space of all stochastic Q and let Q0 denote its deterministic subset.\nGiven a \ufb01xed experiment Q, we can formulate a standard binary classi\ufb01cation problem as\none of \ufb01nding a measurable function (cid:176) 2 \u00a1 := fZ ! Rg that minimizes the Bayes risk\nP (Y 6= sign((cid:176)(Z))). Our focus is the broader question of determining both the classi\ufb01er\n(cid:176) 2 \u00a1, as well as the experiment choice Q 2 Q so as to minimize the Bayes risk.\nThe Bayes risk corresponds to the expectation of the 0-1 loss. Given the non-convexity of\nthis loss function, it is natural to consider a surrogate loss function ` that we optimize in\nplace of the 0-1 loss. We refer to the quantity R`((cid:176); Q) := E`(Y (cid:176)(Z)) as the `-risk. For\neach \ufb01xed quantization rule Q, the optimal ` risk (as a function of Q) is de\ufb01ned as follows:\n(1)\n\nR`((cid:176); Q):\n\nR`(Q) := inf\n(cid:176)2\u00a1\n\nGiven priors q = P (Y = \u00a11) and p = P (Y = 1), de\ufb01ne nonnegative measures \u201e and \u2026:\n\n\u201e(z) = P (Y = 1; Z = z) = pZx\n\u2026(z) = P (Y = \u00a11; Z = z) = qZx\n\nQ(zjx)dP (xjY = 1)\n\nQ(zjx)dP (xjY = \u00a11):\n\n1Proofs are omitted from this manuscript for lack of space; see the long version of the paper [7]\n\nfor proofs of all of our results.\n\n\fAs a consequence of Lyapunov\u2019s theorem, the space of f(\u201e; \u2026)g obtained by varying Q 2\nQ (or Q0) is both compact and convex (see [12] for details). For simplicity, we assume that\nthe space Q of Q is restricted such that both \u201e and \u2026 are strictly positive measures.\nOne approach to choosing Q is to de\ufb01ne an f-divergence between \u201e and \u2026; indeed this is\nthe classical approach referred to earlier [e.g., 8]. Rather than following this route, however,\nwe take an alternative path, setting up the problem in terms of `-risk and optimizing out\nthe discriminant function (cid:176). Note in particular that the `-risk can be represented in terms\nof the measures \u201e and \u2026 as follows:\n\nR`((cid:176); Q) = Xz\n\n`((cid:176)(z))\u201e(z) + `(\u00a1(cid:176)(z))\u2026(z):\n\n(2)\n\n2 \u00a1 1\n\nThis representation allows us to compute the optimal value for (cid:176)(z) for all z 2 Z, as well\nas the optimal ` risk for a \ufb01xed Q. We illustrate this calculation with several examples:\n0-1 loss. If ` is 0-1 loss, then (cid:176)(z) = sign(\u201e(z)\u00a1\u2026(z)). Thus the optimal Bayes risk given\na \ufb01xed Q takes the form: Rbayes(Q) = Pz2Z minf\u201e(z); \u2026(z)g = 1\n2 Pz2Z j\u201e(z) \u00a1\n\u2026(z)j =: 1\n2 (1 \u00a1 V (\u201e; \u2026)), where V (\u201e; \u2026) denotes the variational distance between two\nmeasures \u201e and \u2026.\nHinge loss. Let `hinge(y(cid:176)(z)) = (1 \u00a1 y(cid:176)(z))+. In this case (cid:176)(z) = sign(\u201e(z) \u00a1 \u2026(z))\nand the optimal risk takes the form: Rhinge(Q) = Pz2Z 2 minf\u201e(z); \u2026(z)g = 1 \u00a1\nPz2Z j\u201e(z) \u00a1 \u2026(z)j = 1 \u00a1 V (\u201e; \u2026) = 2Rbayes(Q).\nLeast squares loss. Letting `sqr(y(cid:176)(z)) = (1 \u00a1 y(cid:176)(z))2, we have (cid:176)(z) = \u201e(z)\u00a1\u2026(z)\n\u201e(z)+\u2026(z) . The\n(\u201e(z)\u00a1\u2026(z))2\noptimal risk takes the form: Rsqr(Q) = Pz2Z\n\u201e(z)+\u2026(z) = 1 \u00a1 Pz2Z\n\u201e(z)+\u2026(z) =:\n1 \u00a1 \u00a2(\u201e; \u2026), where \u00a2(\u201e; \u2026) denotes the triangular discrimination distance.\nLogistic loss. Letting `log(y(cid:176)(z)) := log\u00a11 + exp\u00a1y(cid:176)(z)\u00a2, we have (cid:176)(z) = log \u201e(z)\n\u2026(z) .\nThe optimal risk for logistic loss takes the form: Rlog(Q) = Pz2Z \u201e(z) log \u201e(z)+\u2026(z)\n\u201e(z) +\n2 ) =: log 2 \u00a1 C(\u201e; \u2026), where\n\n\u2026(z) log \u201e(z)+\u2026(z)\nC(U; V ) denotes the capacitory discrimination distance.\n2 log \u201e(z)\nExponential loss. Letting `exp(y(cid:176)(z)) = exp(\u00a1y(cid:176)(z)), we have (cid:176)(z) = 1\n\u2026(z) .\nThe optimal risk for exponential loss takes the form: Rexp(Q) = Pz2Z 2p\u201e(z)\u2026(z) =\n1 \u00a1 Pz2Z (p\u201e(z) \u00a1 p\u2026(z))2 = 1 \u00a1 2h2(\u201e; \u2026), where h(\u201e; \u2026) denotes the Hellinger\n\ndistance between measures \u201e and \u2026.\nAll of the distances given above (e.g., variational, Hellinger) are all particular instances of\nf-divergences. This fact points to an interesting correspondence between optimized `-risks\nand f-divergences. How general is this correspondence?\n\n4\u201e(z)\u2026(z)\n\n\u2026(z)\n\n= log 2 \u00a1 KL(\u201ejj \u201e+\u2026\n\n2 ) \u00a1 KL(\u2026jj \u201e+\u2026\n\n3 The correspondence between loss functions and f-divergences\n\nIn order to resolve this question, we begin with precise de\ufb01nitions of f-divergences, and\nsurrogate loss functions. A f-divergence functional is de\ufb01ned as follows [1, 4]:\nDe\ufb01nition 1. Given any continuous convex function f : [0; +1) ! R [ f+1g, the\nf-divergence between measures \u201e and \u2026 is given by If (\u201e; \u2026) := Pz \u2026(z)f(cid:181) \u201e(z)\n\u2026(z)\u00b6.\nFor instance, the variational distance is given by f (u) = ju\u00a1 1j, KL divergence by f (u) =\nu log u, triangular discrimination by f (u) = (u \u00a1 1)2=(u + 1), and Hellinger distance by\nf (u) = 1\n\n2 (pu \u00a1 1)2.\n\n\fSurrogate loss `. First, we require that any surrogate loss function ` is continuous and\nconvex. Second, the function ` must be classi\ufb01cation-calibrated [2], meaning that for\nany a; b \u201a 0 and a 6= b, inf \ufb01:\ufb01(a\u00a1b)<0 `(\ufb01)a + `(\u00a1\ufb01)b > inf \ufb012R `(\ufb01)a + `(\u00a1\ufb01)b. It\ncan be shown [2] that in the convex case ` is classi\ufb01cation-calibrated if and only if it is\ndifferentiable at 0 and `0(0) < 0. Lastly, let \ufb01\u2044 = inf \ufb01f`(\ufb01) = inf `g. If \ufb01\u2044 < +1,\nthen for any \u2013 > 0, we require that `(\ufb01\u2044 \u00a1 \u2013) \u201a `(\ufb01\u2044 + \u2013). The interpretation of the last\nassumption is that one should penalize deviations away from \ufb01\u2044 in the negative direction\nat least as strongly as deviations in the positive direction; this requirement is intuitively\nreasonable given the margin-based interpretation of \ufb01.\nFrom `-risk to f-divergence. We begin with a simple result that formalizes how any `-\nrisk induces a corresponding f-divergence. More precisely, the following lemma proves\nthat the optimal ` risk for a \ufb01xed Q can be written as the negative of an f divergence.\nLemma 2. For each \ufb01xed Q, let (cid:176)Q denote the optimal decision rule. The ` risk for (Q; (cid:176)Q)\nis an f-divergence between \u201e and \u2026 for some convex function f:\n\nR`(Q) = \u00a1If (\u201e; \u2026):\n\n(3)\n\nProof. The optimal ` risk takes the form:\n\ninf\n\ufb01\n\n\u2026(z)\u00b6:\nR`(Q) = Xz2Z\n(`(\ufb01)\u201e(z) + `(\u00a1\ufb01)\u2026(z)) = Xz\nFor each z let u = \u201e(z)\n\u2026(z) , then inf \ufb01(`(\u00a1\ufb01) + `(\ufb01)u) is a concave function of u (since\nminimization over a set of linear function is a concave function). Thus, the claim follows\nby de\ufb01ning (for u 2 R)\n\n\ufb01 (cid:181)`(\u00a1\ufb01) + `(\ufb01)\n\n\u2026(z) inf\n\n\u201e(z)\n\nf (u) := \u00a1 inf\n\n\ufb01\n\n(`(\u00a1\ufb01) + `(\ufb01)u):\n\n(4)\n\nFrom f-divergence to `-risk. In the remainder of this section, we explore the converse\nof Lemma 2. Given a divergence If (\u201e; \u2026) for some convex function f, does there exist a\nloss function ` for which R`(Q) = \u00a1If (\u201e; \u2026)? In the following, we provide a precise\ncharacterization of the set of f-divergences that can be realized in this way, as well as a\nconstructive procedure for determining all ` that realize a given f-divergence.\nOur method requires the introduction of several intermediate functions. First, let us de\ufb01ne,\nfor each \ufb02, the inverse mapping `\u00a11(\ufb02) := inff\ufb01 : `(\ufb01) \u2022 \ufb02g, where inf ; := +1.\nUsing the function `\u00a11, we then de\ufb01ne a new function \u201c : R ! R by\nif `\u00a11(\ufb02) 2 R;\notherwise.\n\n:= \u2030`(\u00a1`\u00a11(\ufb02))\n\n\u201c(\ufb02)\n\n(5)\n\n+1\n\nNote that the domain of \u201c is Dom(\u201c) = f\ufb02 2 R : `\u00a11(\ufb02) 2 Rg. De\ufb01ne\n\n\ufb021 := inff\ufb02 : \u201c(\ufb02) < +1g and \ufb022 := inff\ufb02 : \u201c(\ufb02) = inf \u201cg:\n\n(6)\nIt is simple to check that inf ` = inf \u201c = `(\ufb01\u2044), and \ufb021 = `(\ufb01\u2044), \ufb022 = `(\u00a1\ufb01\u2044).\nFurthermore, \u201c(\ufb022) = `(\ufb01\u2044) = \ufb021, \u201c(\ufb021) = `(\u00a1\ufb01\u2044) = \ufb022. With this set-up, the\nfollowing lemma captures several important properties of \u201c:\nLemma 3.\n(a) \u201c is strictly decreasing in (\ufb021; \ufb022). If ` is decreasing, then \u201c is also\n\ndecreasing in (\u00a11; +1). In addition, \u201c(\ufb02) = +1 for \ufb02 < \ufb021.\n\n(b) \u201c is convex in (\u00a11; \ufb022]. If ` is decreasing, then \u201c is convex in (\u00a11; +1).\n(c) \u201c is lower semi-continuous, and continuous in its domain.\n(d) There exists u\u2044 2 (\ufb021; \ufb022) such that \u201c(u\u2044) = u\u2044.\n\n\f(e) There holds \u201c(\u201c(\ufb02)) = \ufb02 for all \ufb02 2 (\ufb021; \ufb022).\n\nThe connection between \u201c and an f-divergence arises from the following fact. Given the\nde\ufb01nition (5) of \u201c, it is possible to show that\n\nf (u) = sup\n\ufb022R\n\n(\u00a1\ufb02u \u00a1 \u201c(\ufb02)) = \u201c\u2044(\u00a1u);\n\n(7)\n\nwhere \u201c\u2044 denotes the conjugate dual of the function \u201c. Hence, if \u201c is a lower semicon-\ntinuous convex function, it is possible to recover \u201c from f by means of convex duality [9]:\n\u201c(\ufb02) = f \u2044(\u00a1\ufb02). Thus, equation (5) provides means for recovering a loss function ` from\n\u201c. Indeed, the following theorem provides a constructive procedure for \ufb01nding all such `\nwhen \u201c satis\ufb01es necessary conditions speci\ufb01ed in Lemma 3:\nTheorem 4. (a) Given a lower semicontinuous convex function f : R ! R, de\ufb01ne:\n\n\u201c(\ufb02) = f \u2044(\u00a1\ufb02):\n\n(8)\n\nIf \u201c is a decreasing function satisfying the properties speci\ufb01ed in parts (c), (d) and (e) of\nLemma 3, then there exist convex continuous loss function ` for which (3) and (4) hold.\n(b) More precisely, all such functions ` are of the form: For any \ufb01 \u201a 0,\nand `(\u00a1\ufb01) = g(\ufb01 + u\u2044);\n\n(9)\nwhere u\u2044 satis\ufb01es \u201c(u\u2044) = u\u2044 for some u\u2044 2 (\ufb021; \ufb022) and g : [u\u2044; +1) ! R is any\nincreasing continuous convex function such that g(u\u2044) = u\u2044. Moreover, g is differentiable\nat u\u2044+ and g0(u\u2044+) > 0.\n\n`(\ufb01) = \u201c(g(\ufb01 + u\u2044));\n\nOne interesting consequence of Theorem 4 that any realizable f-divergence can in fact be\nobtained from a fairly large set of ` loss functions. More precisely, examining the statement\nof Theorem 4(b) reveals that for \ufb01 \u2022 0, we are free to choose a function g that must satisfy\nonly mild conditions; given a choice of g, then ` is speci\ufb01ed for \ufb01 > 0 accordingly by\nequation (9). We describe below how the Hellinger distance, for instance, is realized not\nonly by the exponential loss (as described earlier), but also by many other surrogate loss\nfunctions. Additional examples can be found in [7].\nIllustrative examples. Consider Hellinger distance, which is an f-divergence2 with\nf (u) = \u00a12pu. Augment the domain of f with f (u) = +1 for u < 0. Following\n\nthe prescription of Theorem 4(a), we \ufb01rst recover \u201c from f:\n\n\u201c(\ufb02) = f \u2044(\u00a1\ufb02) = sup\n\nu2R\n\n(\u00a1\ufb02u \u00a1 f (u)) = \u20301=\ufb02\n\nwhen \ufb02 > 0\n+1 otherwise.\n\nClearly, u\u2044 = 1. Now if we choose g(u) = eu\u00a11, then we obtain the exponential loss\n`(\ufb01) = exp(\u00a1\ufb01). However, making the alternative choice g(u) = u, we obtain the\nfunction `(\ufb01) = 1=(\ufb01 + 1) and `(\u00a1\ufb01) = \ufb01 + 1, which also realizes the Hellinger distance.\nRecall that we have shown previously that the 0-1 loss induces the variational distance,\nwhich can be expressed as an f-divergence with fvar(u) = \u00a12 min(u; 1) for u \u201a 0. It\nis thus of particular interest to determine other loss functions that also lead to variational\ndistance. If we augment the function fvar by de\ufb01ning fvar(u) = +1 for u < 0, then we\ncan recover \u201c from fvar as follows:\n\n\u201c(\ufb02) = f \u2044\n\nvar(\u00a1\ufb02) = sup\n\nu2R\n\n(\u00a1\ufb02u \u00a1 fvar(u)) = \u2030(2 \u00a1 \ufb02)+ when \ufb02 \u201a 0\n\nwhen \ufb02 < 0:\n\n+1\n\n2We consider f-divergences for two convex functions f1 and f2 to be equivalent if f1 and f2 are\nrelated by a linear term, i.e., f1 = cf2 + au + b for some constants c > 0; a; b, because then If1 and\nIf2 are different by a constant.\n\n\fClearly u\u2044 = 1. Choosing g(u) = u leads to the hinge loss `(\ufb01) = (1 \u00a1 \ufb01)+, which is\nconsistent with our earlier \ufb01ndings. Making the alternative choice g(u) = eu\u00a11 leads to a\nrather different loss\u2014namely, `(\ufb01) = (2 \u00a1 e\ufb01)+ for \ufb01 \u201a 0 and `(\ufb01) = e\u00a1\ufb01 for \ufb01 < 0\u2014\nthat also realizes the variational distance.\n\nUsing Theorem 4 it can be shown that an f-divergence is realizable by a margin-based sur-\nrogate loss if and only if it is symmetric [7]. Hence, the list of non-realizable f-divergences\nincludes the KL divergence KL(\u201ejj\u2026) (as well as KL(\u2026jj\u201e)). The symmetric KL diver-\ngence KL(\u201ejj\u2026) + KL(\u2026jj\u201e) is a realizable f-divergence. Theorem 4 allows us to con-\nstruct all ` losses that realize it. One of them turns out to have the simple closed-form\n`(\ufb01) = e\u00a1\ufb01 \u00a1 \ufb01, but obtaining it requires some non-trivial calculations [7].\n4 On comparison of loss functions and quantization schemes\n\nThe previous section was devoted to study of the correspondence between f-divergences\nand the optimal `-risk R`(Q) for a \ufb01xed experiment Q. Our ultimate goal, however, is that\nof choosing an optimal Q, a problem known as experimental design in the statistics litera-\nture [3]. One concrete application is the design of quantizers for performing decentralized\ndetection [12, 6] in a sensor network.\n\nIn this section, we address the experiment design problem via the joint optimization of `-\nrisk (or more precisely, its empirical version) over both the decision (cid:176) and the choice of\nexperiment Q (hereafter referred to as a quantizer). This procedure raises the natural the-\noretical question: for what loss functions ` does such joint optimization lead to minimum\nBayes risk? Note that the minimum here is taken over both the decision rule (cid:176) and the\nspace of experiments Q, so that this question is not covered by standard consistency re-\nsults [13, 10, 2]. Here we describe how the results of the previous section can be leveraged\nto resolve this issue of consistency.\n\n4.1 Universal equivalence\n\nThe connection between f-divergences and 0-1 loss can be traced back to seminal work\non the comparison of experiments [3]. Formally, we say that the quantization scheme Q1\ndominates than Q2 if Rbayes(Q1) \u2022 Rbayes(Q2) for any prior probabilities q 2 (0; 1). We\nhave the following theorem [3] (see also [7] for a short proof):\nTheorem 5. Q1 dominates Q2 iff If (\u201eQ1 ; \u2026Q1 ) \u201a If (\u201eQ2 ; \u2026Q2 ), for all convex functions\nf. The superscripts denote the dependence of \u201e and \u2026 on the quantizer rules Q1; Q2.\n\nUsing Lemma 2, we can establish the following:\nCorollary 6. Q1 dominates Q2 iff R`(Q1) \u2022 R`(Q2) for any surrogate loss `.\nOne implication of Corollary 6 is that if R`(Q1) \u2022 R`(Q2) for some loss function `, then\nRbayes(Q1) \u2022 Rbayes(Q2) for some set of prior probabilities on the labels Y . This fact\njusti\ufb01es the use of a surrogate `-loss as a proxy for the 0-1 loss, at least for a certain subset\nof prior probabilities. Typically, however, the goal is to select the optimal experiment Q\nfor a pre-speci\ufb01ed set of priors, in which context this implication is of limited use. We\nare thus motivated to consider a different method of determining which loss functions (or\nequivalently, f-divergences) lead to the same optimal experimental design as the 0-1 loss\n(respectively the variational distance). More generally, we are interested in comparing two\narbitrary loss function `1 and `2, with corresponding divergences induced by f1 and f2\nrespectively:\nDe\ufb01nition 7. The surrogate loss functions `1 and `2 are universally equivalent, denoted\n\u2026 f2), if for any P (X; Y ) and quantization rules Q1; Q2, there holds:\nby `1\n(10)\nR`1(Q1) \u2022 R`1 (Q2) , R`2(Q1) \u2022 R`2(Q2):\n\n\u2026 `2 (and f1\n\nu\n\nu\n\n\fThe following result provides necessary and suf\ufb01cient conditions for universal equivalence:\nTheorem 8. Suppose that f1 and f2 are differentiable a.e., convex functions that map\n[0; +1) to R. Then f1\n\u2026 f2 if and only if f1(u) = cf2(u) + au + b for some constants\na; b 2 R and c > 0.\nIf we restrict our attention to convex and differentiable a.e. functions f, then it follows that\nall f-divergences univerally equivalent to the variational distance must have the form\n\nu\n\nf (u) = \u00a1c min(u; 1) + au + b\n\nwith c > 0:\n\n(11)\n\nAs a consequence, the only `-loss functions universally equivalent to 0-1 loss are those that\ninduce an f-divergence of this form (11). One well-known example of such a function is\nthe hinge loss; more generally, Theorem 4 allows us to construct all such `.\n\n4.2 Consistency in experimental design\n\nThe notion of universal equivalence might appear quite restrictive because condition (10)\nmust hold for any underlying probability measure P (X; Y ). However, this is precisely\nwhat we need when P (X; Y ) is unknown. Assume that the knowledge about P (X; Y )\ncomes from an empirical data sample (xi; yi)n\nConsider any algorithm (such as that proposed by Nguyen et al. [6]) that involves choosing\na classi\ufb01er-quantizer pair ((cid:176); Q) 2 \u00a1 \u00a3 Q by minimizing an empirical version of `-risk:\n\ni=1.\n\nn\n\n^R`((cid:176); Q) :=\n\n1\nn\n\nXi=1 Xz\n\n`(yi(cid:176)(z))Q(zjxi):\n\nMore formally, suppose that (Cn;Dn) is a sequence of increasing compact function classes\nsuch that C1 (cid:181) C2 (cid:181) : : : (cid:181) \u00a1 and D1 (cid:181) D2 (cid:181) : : : (cid:181) Q. Let ((cid:176) \u2044\nn) be an optimal\nsolution to the minimization problem min((cid:176);Q)2(Cn;Dn)\nbayes denote\nthe minimum Bayes risk achieved over the space of decision rules ((cid:176); Q) 2 (\u00a1;Q). We\ncall Rbayes((cid:176)\u2044\nbayes the Bayes error of our estimation procedure. We say that\nsuch a procedure is universally consistent if the Bayes error tends to 0 as n ! 1, i.e., for\nany (unknown) Borel probability measure P on X \u00a3 Y ,\n\n^R`((cid:176); Q), and let R\u2044\n\nn) \u00a1 R\u2044\n\nn; Q\u2044\n\nn; Q\u2044\n\nlim\nn!1\n\nRbayes((cid:176)\u2044\n\nn; Q\u2044\n\nn) \u00a1 R\u2044\n\nbayes = 0 in probability:\n\n`, where R\u2044\n\nWhen the surrogate loss ` is universally equivalent to 0-1 loss, we can prove that suit-\nable learning procedures are indeed universally consistent. Our approach is based on the\nframework developed by various authors [13, 10, 2] for the case of ordinary classi\ufb01ca-\ntion, and using the strategy of decomposing the Bayes error into a combination of (a)\napproximation error introduced by the bias of the function classes Cn (cid:181) \u00a1: E0(Cn;Dn) =\n` := inf ((cid:176);Q)2(\u00a1;Q) R`((cid:176); Q); and (b) esti-\ninf ((cid:176);Q)2(Cn;Dn) R`((cid:176); Q) \u00a1 R\u2044\nmation error introduced by the variance of using \ufb01nite sample size n, E1(Cn;Dn) =\nE sup((cid:176);Q)2(Cn;Dn) j ^R`((cid:176); Q) \u00a1 R`((cid:176); Q)j, where the expectation is taken with respect\nto the (unknown) probability measure P (X; Y ).\nAssumptions. Assume that the loss function ` is universally equivalent to the 0-1\nloss. From Theorem 8, the corresponding f-divergence must be of the form f (u) =\n\u00a1c min(u; 1) + au + b, for a; b 2 R and c > 0. Finally, we also assume that\n(a \u00a1 b)(p \u00a1 q) \u201a 0 and `(0) \u201a 0.3\nIn addition, for each n = 1; 2; : : :, suppose that\nMn := supy;z sup((cid:176);Q)2(Cn;Dn) j`(y(cid:176)(z))j < +1.\n\n3These technical conditions are needed so that the approximation error due to varying Q domi-\n\nnates the approximation error due to varying (cid:176). Setting a = b is suf\ufb01cient.\n\n\fThe following lemma plays a key role in our proof: it links the excess `-risk to the Bayes\nerror when performing joint minimization:\nLemma 9. For any ((cid:176); Q), we have c\n\n2 (Rbayes((cid:176); Q) \u00a1 R\u2044\n\nbayes) \u2022 R`((cid:176); Q) \u00a1 R\u2044\n`:\n\nFinally, we can relate the Bayes error to the approximation error and estimation error, and\nprovide general conditions for universal consistency:\nTheorem 10. (a) For any Borel probability measure P , with probability at least 1\u00a1\u2013, there\nbayes \u2022 2\nc (2E1(Cn;Dn) + E0(Cn;Dn) + 2Mnp2 ln(2=\u2013)=n):\nholds: Rbayes((cid:176)\u2044\nn=1Cn is dense in \u00a1 so\n(b) (Universal Consistency) If [1\nthat limn!1 E0(Cn;Dn) = 0, and if the sequence of function classes (Cn;Dn) grows\nsuf\ufb01ciently slowly enough so that limn!1 E1(Cn;Dn) = limn!1 Mnpln n=n = 0, there\nholds limn!1 Rbayes((cid:176)\u2044\n\nn=1Dn is dense in Q and if [1\n\nbayes = 0 in probability.\n\nn; Q\u2044\n\nn; Q\u2044\n\nn) \u00a1 R\u2044\n\nn) \u00a1 R\u2044\n\n5 Conclusions\n\nWe have presented a general theoretical connection between surrogate loss functions and\nf-divergences. As illustrated by our application to decentralized detection, this connec-\ntion can provide new domains of application for statistical learning theory. We also expect\nthat this connection will provide new applications for f-divergences within learning the-\nory; note in particular that bounds among f-divergences (of which many are known; see,\ne.g., [11]) induce corresponding bounds among loss functions.\n\nReferences\n[1] S. M. Ali and S. D. Silvey. A general class of coef\ufb01cients of divergence of one distribution from\n\nanother. J. Royal Stat. Soc. Series B, 28:131\u2013142, 1966.\n\n[2] P. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation and risk bounds. Journal\n\nof the American Statistical Association, 2005. To appear.\n\n[3] D. Blackwell. Equivalent comparisons of experiments. Annals of Statistics, 24(2):265\u2013272,\n\n1953.\n\n[4] I. Csisz\u00b4ar.\n\nInformation-type measures of difference of probability distributions and indirect\n\nobservation. Studia Sci. Math. Hungar, 2:299\u2013318, 1967.\n\n[5] T. Kailath. The divergence and Bhattacharyya distance measures in signal selection.\n\nTrans. on Communication Technology, 15(1):52\u201360, 1967.\n\nIEEE\n\n[6] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Nonparametric decentralized detection using\n\nkernel methods. IEEE Transactions on Signal Processing, 53(11):4053\u20134066, 2005.\n\n[7] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On divergences, surrogate loss functions\nand decentralized detection. Technical Report 695, Department of Statistics, University of\nCalifornia at Berkeley, September 2005.\n\n[8] H. V. Poor and J. B. Thomas. Applications of Ali-Silvey distance measures in the design of\ngeneralized quantizers for binary decision systems. IEEE Trans. on Communications, 25:893\u2013\n900, 1977.\n\n[9] G. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970.\n[10] I. Steinwart. Consistency of support vector machines and other regularized kernel machines.\n\nIEEE Trans. Info. Theory, 51:128\u2013142, 2005.\n\n[11] F. Topsoe. Some inequalities for information divergence and related measures of discrimination.\n\nIEEE Transactions on Information Theory, 46:1602\u20131609, 2000.\n\n[12] J. Tsitsiklis. Extremal properties of likelihood-ratio quantizers. IEEE Trans. on Communication,\n\n41(4):550\u2013558, 1993.\n\n[13] T. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk\n\nminimization. Annal of Statistics, 53:56\u2013134, 2004.\n\n\f", "award": [], "sourceid": 2905, "authors": [{"given_name": "XuanLong", "family_name": "Nguyen", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}