{"title": "Toward a Characterization of Loss Functions for Distribution Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7237, "page_last": 7246, "abstract": "In this work we study loss functions for learning and evaluating probability distributions over large discrete domains. Unlike classification or regression where a wide variety of loss functions are used, in the distribution learning and density estimation literature, very few losses outside the dominant \\emph{log loss} are applied. We aim to understand this fact, taking an axiomatic approach to the design of loss functions for distributions. We start by proposing a set of desirable criteria that any good loss function should satisfy. Intuitively, these criteria require that the loss function faithfully evaluates a candidate distribution, both in expectation and when estimated on a few samples. Interestingly, we observe that \\emph{no loss function} possesses all of these criteria. However, one can circumvent this issue by introducing a natural restriction on the set of candidate distributions. Specifically, we require that candidates are \\emph{calibrated} with respect to the target distribution, i.e., they may contain less information than the target but otherwise do not significantly distort the truth. We show that, after restricting to this set of distributions, the log loss and a large variety of other losses satisfy the desired criteria. These results pave the way for future investigations of distribution learning that look beyond the log loss, choosing a loss function based on application or domain need.", "full_text": "Toward a Characterization of Loss Functions for\n\nDistribution Learning\n\nNika Haghtalab\u2217\nCornell University\n\nnika@cs.cornell.edu\n\nCameron Musco\u2217\nUMass Amherst\n\ncmusco@cs.umass.edu\n\nBo Waggoner\u2020\nU. Colorado\n\nbwag@colorado.edu\n\nAbstract\n\nIn this work we study loss functions for learning and evaluating probability dis-\ntributions over large discrete domains. Unlike classi\ufb01cation or regression where\na wide variety of loss functions are used, in the distribution learning and density\nestimation literature, very few losses outside the dominant log loss are applied.\nWe aim to understand this fact, taking an axiomatic approach to the design of loss\nfunctions for distributions. We start by proposing a set of desirable criteria that\nany good loss function should satisfy. Intuitively, these criteria require that the loss\nfunction faithfully evaluates a candidate distribution, both in expectation and when\nestimated on a few samples. Interestingly, we observe that no loss function pos-\nsesses all of these criteria. However, one can circumvent this issue by introducing\na natural restriction on the set of candidate distributions. Speci\ufb01cally, we require\nthat candidates are calibrated with respect to the target distribution, i.e., they may\ncontain less information than the target but otherwise do not signi\ufb01cantly distort\nthe truth. We show that, after restricting to this set of distributions, the log loss and\na large variety of other losses satisfy the desired criteria. These results pave the\nway for future investigations of distribution learning that look beyond the log loss,\nchoosing a loss function based on application or domain need.\n\n1\n\nIntroduction\n\nEstimating a probability distribution given independent samples from that distribution is a fundamental\nproblem in machine learning and statistics [e.g. 23, 2, 24, 5]. In machine learning applications, the\ndistribution of interest is often over a very large but \ufb01nite sample space, e.g., the set of all English\nsentences up to a certain length or images of a \ufb01xed size in their RGB format.\nA central problem is evaluating the learned distribution, most commonly using a loss function. Such\nevaluation is an important task in its own right as well as central to some learning techniques. Given\na ground truth distribution p over a set of outcomes X and a sample x \u223c p, a loss function (cid:96)(q, x)\nevaluates the performance of a candidate distribution q in predicting x. Generally, (cid:96)(q, x) will be\nhigher if q places smaller probability on x. Thus, in expectation over x \u223c p, the loss will be lower\nfor candidate distributions that closely match p.\nThe dominant loss applied in practice is the log loss (cid:96)(q, x) = ln(1/qx), which corresponds to the\nlearning technique of log likelihood maximization. Surprisingly, few other losses are ever considered.\nThis is in sharp contrast to other areas of machine learning, including in supervised learning where\ndifferent applications have necessitated the use of different losses, such as the squared loss, hinge\nloss, (cid:96)1 loss, etc. However, alternative loss functions can be bene\ufb01cial for distribution learning on\nlarge domains, as we show with a brief motivating example.\n\n\u2217Research conducted while at Microsoft Research, New England.\n\u2020Research conducted while at Microsoft Research, New York City.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMotivating example.\nIn many learning applications, one seeks to \ufb01t a complex distribution with a\nsimple model that cannot fully capture its complexity. This includes e.g., noise tolerant or agnostic\nlearning. As an example, consider modeling the distribution over English words with a character\ntrigram model. While this model, trained by minimizing log loss, \ufb01ts the distribution of English\nwords relatively well, its performance signi\ufb01cantly degrades if a small amount of mostly-irrelevant\ndata is added, e.g. if the dataset includes a small fraction of foreign language words. The model is\nunable to \ufb01t the \u2018tail\u2019 of the distribution (corresponding to foreign words), however, in trying to do so\nit performs signi\ufb01cantly worse on the \u2018head\u2019 of the distribution (corresponding to common English\nwords). This is due to the fact that minimizing log loss requires qx to not be much smaller than px\nfor all x. A more robust loss function, such as the log log loss, (cid:96)(q, x) = ln(ln(1/qx)), emphasizes\nthe importance of \ufb01tting the \u2018head\u2019 and is less sensitive to the introduction of the foreign words. See\nFigure 1 and the full version of the paper for details.\n\nSamples from q1\n\nSamples from q2\n\nbrappost\n\nhild\nme\non\nther\n\noneems\n\nto\n\nthe\nnot\nof\n\nlog loss(p) = 7.45\n\nlog log loss(p) = 1.91\nlog loss(q1) = 11.25\n\nlog log loss(q1) = 2.22\n\nlog loss(q2) = 12.26\n\nlog log loss(q2) = 2.18\n\nFigure 1: Modeling the distribution of English words, corrupted with 12% French and German words with\ncharacter trigrams. Distribution q1 is trained by minimizing log loss. q2 achieves worse log loss but better log\nlog loss and better performance at \ufb01tting the \u2018head\u2019 of the the target p, indicating that log log loss may be more\nappropriate in this application. See the full version for more details.\n\nLoss function properties.\nIn this paper, we start by understanding the desirable properties of\nlog loss and seek to identity other loss functions with such properties that can have applications\nin various domains. A key characteristic of the log loss is that it is (strictly) proper. That is, the\ntrue underlying distribution p (uniquely) minimizes the expected loss on samples drawn from p.\nProperness is essential for loss functions, as without it minimizing the expected loss leads to choosing\nan incorrect candidate distribution even when the target distribution is fully known. Log loss is also\nlocal (sometimes termed pointwise). That is, the loss of q on sample x is a function of the probability\nqx and not of qx(cid:48) for x(cid:48) (cid:54)= x. Local losses are preferred in machine learning, where qx is often\nimplicitly represented as the output of a likelihood function applied to x, but where fully computing\nq requires at least linear time in the size of the sample space N and is infeasible for large domains,\nsuch as learning the distribution of all English sentences up to a certain length.\nIt is well-known that log loss is the unique local and strictly proper loss function [19, 22, 13]. Thus,\nrequiring strict properness and locality already restricts us to using the log loss. At the same time,\nthese restrictive properties are not suf\ufb01cient for effective distribution learning, because:\n\n\u2022 A candidate distribution may be far from the target yet have arbitrarily close to optimal loss.\nMotivated by this problem, we de\ufb01ne strongly proper losses that, if given a candidate far from\nthe target, will give an expected loss signi\ufb01cantly worse than optimal.\n\u2022 A candidate distribution might be far from the target, yet on a small number of samples, it may\nbe likely to have smaller empirical loss than that of the target. This motivates our de\ufb01nition of\nsample-proper losses.\n\u2022 On a small number of samples, the empirical loss of a distribution may be far from its expected\n\nloss, making evaluation impossible. This motivates our de\ufb01nition of concentrating losses.\n\nNaively, it seems we cannot satisfy all our desired criteria: our only local strictly proper loss is the\nlog loss, which in fact fails to satisfy the concentration requirement (see Example 4). We propose\nto overcome this challenge by restricting the set of candidate distributions, speci\ufb01cally to ones that\nsatisfy the reasonable condition of calibration. We then consider the properties of loss functions on,\nnot the set of all possible distributions, but the set of calibrated distributions.\n\nCalibration and results. We call a candidate distribution q calibrated with respect to a target p\nif all elements to which q assigns probability \u03b1 actually occur on average with probability \u03b1 in\n\n2\n\n\fthe target distribution.3 This can also be interpreted as requiring q to be a coarsening of p, i.e., a\ncalibrated distribution may contain less information than p but does otherwise not distort information.\nWhile for simplicity we focus on exactly calibrated distributions, in the full version we extend our\nresults to a natural notion of approximate calibration. Our main results show that the calibration\nconstraint overcomes the impossibility of satisfying properness along with the our three desired\ncriteria.\nMain results (Informal summary). Any (local) loss (cid:96)(q, x):=f\nand monotonically increasing has the following properties subject to calibration:\n\nsuch that f is strictly concave\n\n(cid:16) 1\n\n(cid:17)\n\nqx\n\n1. (cid:96) is strictly proper, i.e., the target distribution minimizes expected loss.\n\n2. If in addition f satis\ufb01es left-strong-concavity, then (cid:96) is strongly proper, i.e., distributions\n\nfar from the target have signi\ufb01cantly worse loss.\n\n3. If in addition to the above f grows relatively slowly, then (cid:96) is sample proper i.e., on few\nsamples, distributions far from the target have higher empirical loss with high probability.\n\n4. Under these same conditions, (cid:96) concentrates i.e., on few samples, a distribution\u2019s empirical\n\nloss is a reliable estimate of its expected loss with high probability.\n\nThe above criteria are formally introduced in Section 3. Each criteria is parameterized and different\nlosses satisfy them with different parameters. We illustrate a few examples in Table 1 below. We\nemphasize that all losses shown below achieve relatively strong bounds, only depending polylogarith-\nmically on the domain size N. Thus, we view all of these loss functions as viable alternatives to the\nlog loss, which may be useful in different applications.\n\n(cid:16)\n\nln 1\nqx\n\n(cid:96)(q, x)\n\nln 1\nqx\nfor p \u2208 (0, 1]\n\n(cid:17)p\n(cid:16)\n\n(cid:17)2\n\nln ln 1\nqx\n\nln e2\nqx\n\nStrong Properness\nE (cid:96)(q; x) \u2212 E (cid:96)(p; x)\n\n\u2126(\u00012)\n\n\u2126(cid:0)\u00012 (ln N )p\u22121(cid:1)\n(cid:17)\n\n(cid:16) \u00012\n\n\u2126\n\nln N\n\n\u2126(\u00012)\n\nConcentration\n\nsample size m(\u03b3, N )\n\u02dcO\n\n(cid:18)\n(cid:18)\n(cid:18)\n(cid:18)\n\n\u02dcO\n\n\u02dcO\n\n\u02dcO\n\n\u03b3\u22122 ln\n\u03b3\u22122 ln\n\u03b3\u22122 ln ln\n\u03b3\u22122 ln\n\n\u03b3\n\n(cid:17)2(cid:19)\n(cid:16) N\n(cid:17)2p(cid:19)\n(cid:16) N\n(cid:17)2(cid:19)\n(cid:16) N\n(cid:17)4(cid:19)\n(cid:16) N\n\n\u03b3\n\n\u03b3\n\n\u03b3\n\nSample Properness\nsample size m(\u0001, N )\n\nO(cid:0)\u0001\u22124 (ln N )2(cid:1)\nO(cid:0)\u0001\u22124 (ln N )2(cid:1)\nO(cid:0)\u0001\u22124(ln ln N )2(ln N )2(cid:1)\nO(cid:0)\u0001\u22124(ln N )4(cid:1)\n\nTable 1: Examples of loss function that demonstrate strong properness, sample properness, and concentration,\nwhen restricted to calibrated distributions. In the above, N is the distributions support size, \u0001:=(cid:107)p \u2212 q(cid:107)1 is the\n(cid:96)1 distance between p and q, and \u03b3 is an approximation parameter for concentration (see Section 4.2 for details).\nWe assume for simplicity that \u0001 \u2265 1/N and hide dependencies on a success probability parameter for sample\nproperness and concentration. \u02dcO(\u00b7) suppresses logarithmic dependence on 1/\u0001 and 1/\u03b3.\n\n1.1 Related work\n\nOur work is directly inspired by applications of distribution estimation in very high-dimensional\nspaces, such as language modeling [18]. However, we do not know of work in this area that takes a\nsystematic approach to designing loss functions.\nA conceptually related research problem is that of learning distributions using computationally\nand statistically ef\ufb01cient algorithms. Beyond loss function minimization, a number of general-\npurpose methods have been proposed for this problem, including using histograms, nearest neighbor\nestimators, etc. See [15] for a survey of these methods. Much of the work in this space focuses\non learning structured or parametric distributions [7, 16, 17, 6], e.g., monotone distributions or\nmixtures of Gaussians. On the other hand, learning an unstructured discrete distribution with support\nsize N within (cid:96)1 distance \u0001 requires poly(N, 1/\u0001) samples. Thus, works in this space typically\nfocus on designing computationally ef\ufb01cient algorithms for optimal estimation using large sample\nsets [24]. In comparison, we focus on unstructured distributions with prohibitively large supports and\n\n3This de\ufb01nition is an adaptation of the standard calibration criterion applied to sequences of predictions\n\nmade by a forecaster [8, 11]. See discussion in the full version of the paper.\n\n3\n\n\fcharacterize loss functions that only require polylog(N ) sample complexity to estimate. We do not\nintroduce a general algorithm for distribution learning \u2014 as any such algorithm would require \u2126(N )\nsamples. Rather, motivated by tailored algorithms used in complex domains such as natural language\nprocessing, our work characterizes loss functions that could be used by a variety of algorithms.\nOutside distribution learning, loss functions (termed scoring rules) have been studied for decades in\nthe information elicitation literature, which seeks to incentivize experts, such as weather forecasters,\nto give accurate predictions [e.g. 4, 14, 22, 12, 13]. The notion of loss function properness, for\nexample, comes from this literature. Recent research has made some connections between information\nelicitation and loss functions in machine learning; however, it has focused mostly on the classi\ufb01cation\nand regression and not distribution learning [1, 12, 20, 21, 9]. Our work can be viewed as a\ncontribution to the literature on evaluating forecasters by showing that, if the forecaster is constrained\nto be calibrated, then a variety of simple local loss functions become (strongly, sample) proper.\n\ndenoted by p(B):=(cid:80)\n\n2 Preliminaries\nWe work with distributions over a \ufb01nite domain X with |X| = N. The set of all distributions over X\nis denoted by \u2206X . We denote a distribution p \u2208 {0, 1}N over X by a vector of probabilities, where\npx is the probability p places on x \u2208 X . For any set B \u2286 X , the total probability p places on B is\nx\u2208B px. We use X to denote a random variable on X whose distribution is\nspeci\ufb01ed in context. We also consider point mass distributions \u03b4x \u2208 \u2206X where \u03b4x\nx(cid:48) = 1 [x = x(cid:48)].\nThroughout this paper, we typically use p to denote the true (or target) distribution and q to denote a\ncandidate or predicted distribution. For any two distributions p and q, the total variation distance\n2(cid:107)p \u2212 q(cid:107)1, where (cid:107) \u00b7 (cid:107)1 denotes\nbetween them is de\ufb01ned by TV(p, q):= supB\u2286X p(B) \u2212 q(B) = 1\nthe (cid:96)1 norm of a vector. Together, (cid:96)1 and the total variation distance are two of the most widely used\nmeasures of distance between distributions.\nTo measure the quality of a candidate distribution q given samples from p, machine learning typically\nturns to loss functions. A loss function is a function (cid:96) : \u2206X \u00d7 X \u2192 R where (cid:96)(q, x) is the loss\nassigned to candidate q on outcome x. Given a target distribution p, the expected loss for candidate\nq is de\ufb01ned as (cid:96)(q; p):= EX\u223cp [(cid:96)(q, X)] . A loss function is called proper if (cid:96)(p; p) \u2264 (cid:96)(q; p) for\nall p (cid:54)= q, and strictly proper if the inequality is always strict4. Two common examples of proper\nloss functions are the log loss function (cid:96)(q, x) = ln( 1\n) (with the logarithm always taken base e in\nqx\n2(cid:107)\u03b4x \u2212 q(cid:107)2\nthis paper) and the quadratic loss (cid:96)(q, x) = 1\n2. A loss function (cid:96) is called local if (cid:96)(q, x)\nis a function of qx alone. For example, the log loss is local while the quadratic loss is not.\nOur main results are characterized by the geometry of the loss functions we consider. For simplicity,\nwe will generally assume functions are differentiable, although our results can be extended.\nDe\ufb01nition 1 (Strongly Concave). A function f : [0,\u221e] \u2192 R is \u03b2-strongly concave if for all z, z(cid:48) in\nthe domain of f, f (z) \u2264 f (z(cid:48)) + \u2207f (z(cid:48)) \u00b7 (z \u2212 z(cid:48)) \u2212 \u03b2\nWe also consider a relaxation of strong concavity that helps us in analyzing functions that have a\nlarge curvature close to the origin but \ufb02atten out as we move farther from it.\nDe\ufb01nition 2 (Left-Strongly Concave). A function f : [0,\u221e] \u2192 R is \u03b2(z)-left-strongly concave if\nthe function restricted to [0, z] is \u03b2(z)-strongly concave, for all z.\n\n2 (z \u2212 z(cid:48))2.\n\nAs discussed, a natural assumption on the set of candidate distributions is calibration. Formally:\nDe\ufb01nition 3 (Calibration). Given a distribution q \u2208 \u2206X , let Bt(q) = {x : qx = t}. When it is clear\nfrom the context, we suppress q in the de\ufb01nition of Bt. We say that q is calibrated with respect to p,\nif q(Bt(q)) = p(Bt(q)) for all t \u2208 [0, 1]. We let C(p) denote the set of all calibrated distributions\nwith respect to p.\n\nIn other words, q is calibrated with respect to p if points assigned probability qx = t have average\nprobability t under p.\nIn other words, p can be \u201ccoarsened\u201d to q by taking subsets of points\nand replacing their probabilities with the subset average. Note that the uniform distribution q =\nN ) is calibrated with respect to all p, and that p is calibrated with respect to itself. Also\nN , . . . , 1\n( 1\n4Our use of \u201cproperness\u201d is inspired the literature on proper scoring rules. It is not to be confused with\n\u201cproperness\u201d in learning theory where the learned hypothesis must belong to a pre-determined class of hypotheses.\n\n4\n\n\fnote that there are only \ufb01nitely many values t \u2208 [0, 1] for which Bt is non-empty. We denote the set\nof these values by T (q) = {t : Bt (cid:54)= \u2205}.\nWe refer an interested reader to the full version of the paper for a more detailed discussion of the\nnotion of calibration and its connections to similar notions used in forecasting theory, e.g. [8, 11]. See\nthe full version for a discussion of how our results can be extended to a natural notion of approximate\ncalibration.\n\n3 Three Desirable Properties of Loss Functions\n\nIn this section, we de\ufb01ne three criteria and discuss why any desirable loss function should demonstrate\nthem. We use examples of loss functions, such as the log loss (cid:96)log-loss(q, x) = ln( 1\n) and the linear\nqx\nloss (cid:96)lin-loss(q, x) = \u2212qx to help demonstrate the existence or lack of these criteria.\n\n3.1 Strong Properness\n\nRecall that a loss function is strictly proper if all incorrect candidate distributions yield a higher\nexpected loss value than the target distribution. Here, we expand this to strong properness where this\ngap in expected loss grows with distance from the target distribution. We also extend both de\ufb01nitions\nto hold over a speci\ufb01c domain of candidate distributions, rather than all distributions.\nDe\ufb01nition 4 (Calibrated Properness). Let P : \u2206X \u2192 2\u2206X be a domain function, that is, P(p) \u2286\n\u2206X is a restricted set of distributions. A loss function (cid:96) is proper over P if for all p \u2208 \u2206X ,\np \u2208 argminq\u2208P(p) (cid:96)(q; p). A loss function is said to be strictly proper over P if the argmin is\nalways unique. When P(p) = C(p), i.e. is the set of calibrated distributions w.r.t. p, we call such a\nloss function (strictly) calibrated proper.\nExample 1. It is well-known that (cid:96)log-loss(q, x) = ln\nis the unique local proper loss function\n(up to scaling) over the unrestricted domain P(p) = \u2206X [3]. Indeed, it is known that the difference\nin expected log loss of a prediction q and the target distribution p is the KL-divergence, i.e.\n\n(cid:16) 1\n\n(cid:17)\n\nqx\n\n.\n\npx ln\n\n(cid:96)log-loss(q; p) \u2212 (cid:96)log-loss(p; p) = KL(p, q):=\n\n(1)\nFurthermore, the KL-divergence is strictly positive for p (cid:54)= q. This proves that the log loss is strictly\nproper over \u2206X , and as a result, is strictly calibrated proper as well.\nOn the other hand, (cid:96)lin-loss(q, x) = \u2212qx is not proper over \u2206X . This is due to that fact that the\nminimizer of this loss is the point mass distribution \u03b4x for x = argmaxx px. For example, for target\n3 ), distribution q = (0, 1) yields a lower (cid:96)lin-loss than that of p. Note, however,\ndistribution p = ( 1\nthat such a choice of q is not calibrated with respect to p. When loss minimization is constrained\nto the set of calibrated distributions, C(p) = {( 1\n2 )}, p minimizes the expected linear loss.\nIndeed, in Section 4 we show more generally that the linear loss and in fact many reasonable local\nloss functions are calibrated proper.\n\n3 ), ( 1\n\n3 , 2\n\n3 , 2\n\n2 , 1\n\n(cid:88)\n\nx\n\n(cid:18) px\n\n(cid:19)\n\nqx\n\nWhile strict properness is an important baseline guarantee, we would like a \u201cstronger\u201d property: If\nq is signi\ufb01cantly incorrect in the sense of being far from p, then the expected loss of q should be\nsigni\ufb01cantly worse. This motivates the following de\ufb01nition. An analogous de\ufb01nition has appeared in\nthe context of mechanism design in [10].\nDe\ufb01nition 5 (Strong Calibrated Properness). A loss function (cid:96) is \u03b2-strongly proper over a domain\nfunction P if for all p \u2208 \u2206X , for all q \u2208 P(p), (cid:96)(q; p) \u2212 (cid:96)(p; p) \u2265 \u03b2\n1 . When P(p) =\nC(p), we call such functions \u03b2-strongly calibrated proper and when P(p) = \u2206X , we simply refer to\nthem as \u03b2-strongly proper.\nExample 2. The log loss is 1-strongly proper. This is equivalent to Pinsker\u2019s inequality, which\nstates that for all p and q, KL(p, q) \u2265 2TV(p, q)2. Together with (1) and the fact that TV(p, q) =\n2 (cid:107)p \u2212 q(cid:107)1, this shows that log loss is 1-strongly proper (and thus also 1-strongly calibrated proper.)\nAs we will see in Section 4, strong calibrated properness relates to the notion of strong concavity (of\nthe inverse loss function) in (cid:96)1 norm. We refer the interested reader to the full version of the paper for\na discussion of the use of alternative norms in the de\ufb01nition of strong properness. In the full version\n\n2 (cid:107)p \u2212 q(cid:107)2\n\n1\n\n5\n\n\fwe extend the study of normed concavity of loss functions to strong properness of a loss function\nover \u2206X .\n\n3.2 Sample-properness\nSo far, we have focused on the loss a candidate q receives in expectation over x \u223c p. Of course,\nif one is attempting to learn p, this expectation can generally not be computed. We would like the\nnotion of properness to carry over to the setting when the loss on q is estimated using a small set of\nsamples from p. We say that a loss function is sample-proper if within a small number, all candidate\ndistributions that are suf\ufb01ciently far from p yield a loss that is larger than that of p on the samples.\nIn the remainder of this paper, let \u02c6p denote the empirical distribution corresponding to samples drawn\nfrom p. Note that the average loss of any q on the samples can be written (cid:96)(q; \u02c6p). Formally:\nDe\ufb01nition 6 (Calibrated Sample-Properness). A loss function (cid:96) is m(\u0001, \u03b4, N )-sample proper over a\nfunction domain P if, for all p \u2208 \u2206X and all q \u2208 P(p) with (cid:107)p \u2212 q(cid:107)1 \u2265 \u0001, with probability at least\n1 \u2212 \u03b4 over m(\u0001, \u03b4, N ) i.i.d. samples from p, we have (cid:96)(p; \u02c6p) < (cid:96)(q; \u02c6p). When P(p) = C(p), we\ncall such functions calibrated m(\u0001, \u03b4, N )-sample proper.\n\nExample 3. A folklore theorem states that (cid:96)log-loss is O(cid:0) 1\na result it is calibrated O(cid:0) 1\n\n(cid:1)(cid:1)-sample proper.\n\n(cid:1)(cid:1)-sample proper over \u2206X , and as\n\n\u00012 ln(cid:0) 1\n\n\u00012 ln(cid:0) 1\n\nNow consider (cid:96)lin-loss(q, x) = \u2212qx. Since it is not a proper loss function over \u2206X , by de\ufb01nition\nit is not sample proper over \u2206X for any m(\u0001, \u03b4, N ). When restricting to calibrated distributions\nhowever, as we claimed in Example 1 linear loss is calibrated proper in expectation. It is interesting\nto note that linear loss is not sample proper for any m(\u0001, \u03b4, N ) \u2208 o (N ). To observe this, consider\n2(N/2\u22122) for x = 3, . . . , N/2 and px = 0 for\np where p1 = 1\n2(N\u22122) for x = 3, . . . , N. Let\nx = N/2 + 1, . . . , N. Consider q where q1 = q2 = 1\n\u02c6p be the empirical distribution. With a constant probability, \u02c6p1 \u2264 1\n4. Let\n\u03bd =\n\nm and \u02c6p2 \u2265 1\n\nm, and px =\n\n4 and qx =\n\nm, p2 = 1\n\n4 \u2212 1\u221a\n\n4\u2212 1\u221a\n\n4 + 1\u221a\n\n2(N/2\u22122) \u2212 1\n\n2(N\u22122) = \u0398( 1\n\nN ). Therefore,\n\n1\n\n1\n\n1\n\n\u03b4\n\n\u03b4\n\nwhen m \u2208 o (N ). Furthermore, note that q is calibrated w.r.t. p with two non-empty buckets\n(q) = {3, . . . , N}. Moreover, (cid:107)p \u2212 q(cid:107)1 = \u0398(1). Thus, for (cid:96)lin-loss\nB 1\nto be calibrated m(\u0001, \u03b4, N )-sample proper, we must have m(\u0398(1), \u0398(1), N ) \u2208 \u2126 (N ).\n4\n\n(q) = {1, 2} and B 1\n\n2(N\u22122)\n\n3.3 Concentration\n\nBeyond sample properness, when the expected loss (cid:96)(q; p) is estimated from a small i.i.d. sample\nfrom p, we would like the empirical loss to remain faithful to the true value. For example, one might\nhope that minimizing loss on that sample will result in a distribution that has small loss on p. This\nwill hold as long as the empirical loss well approximates the true expected loss with high probability.\nDe\ufb01nition 7 (Calibrated Concentration). A loss function (cid:96) concentrates over domain function P\nwith m(\u03b3, \u03b4, N ) samples if for all p \u2208 \u2206X , for all q \u2208 P(p), for m(\u03b3, \u03b4, N ) i.i.d. samples from p,\nPr [|(cid:96)(q; \u02c6p) \u2212 (cid:96)(q; p)| \u2265 \u03b3] \u2264 \u03b4. When P(p) = C(p), we say that (cid:96) calibrated concentrates with\nm(\u03b3, \u03b4, N ) samples.5\n\n5We use \u03b3 to denote difference in loss to avoid confusion with \u0001, which generally means a distance between\n\ndistributions.\n\n6\n\n(cid:96)(q; \u02c6p) \u2212 (cid:96)(p; \u02c6p) =\n\n\u02c6px(px \u2212 qx)\n\nN(cid:88)\n\nx=1\n\n\u02c6p1\u221a\nm\n\n=\n\n=\n\n1\u221a\nm\n= \u2212 1\nm\n\n+ \u03bd\n\n+\n\nm\n\n\u2212\u02c6p2\u221a\n(cid:18) 1\n(cid:18) 1\n\n4\n\n+ \u0398\n\n\u2212 1\u221a\nm\n\n(cid:19)\n\nN\n\nN/2(cid:88)\n(cid:19)\n\nx=3\n\n< 0,\n\nN(cid:88)\n(cid:18) 1\n\nN\n\n\u02c6px\n\n(cid:19)\n\nx=N/2+1\n\n\u02c6px \u2212 \u03bd\n\n\u2212 1\u221a\nm\n\n1\n4\n\n+ \u0398\n\n\fExample 4. We can easily see that log loss does not concentrate with o(N ) samples over \u2206X . Let p\nbe the uniform distribution and q be uniform on X \\ {x}. With high probability, x is not sampled,\nand (cid:96)(q; \u02c6p) is \ufb01nite. Yet (cid:96)(q; p) = \u221e. Note that although this example is extreme, its conclusion is\nrobust: one can make an arbitrarily large \ufb01nite gap. As we will see, the log loss, along with many\nother reasonable loss will concentrate with a small number of samples over calibrated distributions.\n\n4 Main Results\n\nLooking back at the criteria de\ufb01ned in Section 3, we are immediately faced with an impossibility result:\nno local loss function exists that satis\ufb01es properness, o(N )-sample properness, and concentration\nwith o(N ) samples. This is because log loss is the unique local loss function that satis\ufb01es the \ufb01rst\nproperty and as shown in Example 4 it does not concentrate. In this section, we show that a broad\nclass of local loss functions with certain niceness properties satis\ufb01es the above three criteria over\ncalibrated domains. Speci\ufb01cally, we consider loss functions (cid:96)(q, x) that are non-increasing in qx\nand are inversely concave: (cid:96)(q, x) = f ( 1\n) for some concave function f. Similarly, we say that (cid:96) is\nqx\ninversely strongly concave if the corresponding f is strongly concave.\n\n4.1 Calibrated and Strong Calibrated Properness\n\nIn this section, we show that any (strongly) nice loss function is (strongly) proper over the domain of\ncalibrated distributions. More formally.\nTheorem 1 (Strict Properness). Suppose the local loss function (cid:96) is such that (cid:96)(q, x) = f ( 1\nconcave f function. Then, (cid:96) is strictly proper over the domain function C.\nqx\nTheorem 2 (Strong Properness). Suppose the loss function (cid:96) is such that (cid:96)(q, x) = f ( 1\nqx\nis non-decreasing and is C(x)\nfor x \u2265 1. Then for all p \u2208 \u2206X and q \u2208 C(p),\n(cid:96)(q; p) \u2212 (cid:96)(p; p) \u2265 C\n\n) where f\nx2 -left-strongly concave where C(x) is non-increasing and non-negative\n\n(cid:18) 4N\n\n\u00b7 (cid:107)p \u2212 q(cid:107)2\n\n) for a\n\n(cid:19)\n\n1\n\n.\n\n(cid:107)p \u2212 q(cid:107)1\n\n128\n\nWe defer the proof of Theorem 2 to the full version.\nWe begin with the proof of Theorem 1, which relies on a key property of calibration stated in Lemma\n1. At a high level, this lemma shows that the average value of 1/px and 1/qx is the same over\ninstances x such that qx = t, which is also equal to 1/t.\nLemma 1. For any distribution p \u2208 \u2206X and q \u2208 C(p), and for any t \u2208 [0, 1], we have\nEX\u223cp\n\nt , where Bt = {x : qx = t}.\n\n(cid:12)(cid:12)(cid:12) X \u2208 Bt\n\n(cid:104) 1\n\n= 1\n\n(cid:105)\n\npX\n\nProof. We have\n\n(cid:20) 1\n\npX\n\nE\n\n(cid:21)\n\n(cid:12)(cid:12)(cid:12) X \u2208 Bt\n\n(cid:88)\n\nx\u2208Bt\n\n=\n\npx\n\np(B)\n\n1\npx\n\n=\n\n|Bt|\np(Bt)\n\n=\n\n1\nt\n\n.\n\nProof of Theorem 1. Suppose (cid:96)(q, x) = f ( 1\n) for a strictly concave f. Consider any q that is\nqx\ncalibrated with respect to p. Recall that Bt = {x : qx = t} and T (q) = {t : |Bt| (cid:54)= \u2205} is a \ufb01nite set.\n\n(cid:18) 1\n(cid:19)\n\npX\n\n=\n\n(cid:20)\n(cid:18) 1\n\nf\n\nt\n\np(Bt) E\n\np(Bt)f\n\n(cid:88)\n(cid:88)\n\nt\u2208T (q)\n\nt\u2208T (q)\n\n(cid:96)(p; p) =\n\n=\n\n(cid:21)\n\n(cid:19) (cid:12)(cid:12)(cid:12) X \u2208 Bt\n(cid:88)\n(cid:88)\n\n\u2264 (cid:88)\n(cid:18) 1\n(cid:19)\n\nt\u2208T (q)\n\n(cid:18)\n\nE\n\n(cid:20) 1\n\npX\n\n(cid:21)(cid:19)\n\n(cid:12)(cid:12)(cid:12) X \u2208 Bt\n\np(Bt)f\n\nt\u2208T (q)\n\nx\u2208Bt\n\npxf\n\nqx\n\n= (cid:96)(q; p),\n\nwhere the second transition is by Jensen\u2019s inequality and the third transition is by Lemma 1. If f is\nstrictly concave and there exists a Bt where q and p disagree, then the inequality is strict.\n\n7\n\n\f4.2 Concentration\n\n\u221a\n\n(cid:17)\n\nfor nonnega-\nz for all z \u2265 1 and some constant c.\n\nqx\n\n(cid:16) 1\n\nThe (strong) properness of a loss function, as discussed in Section 4.1, is only concerned with\nloss functions in expectation. In this section, we consider \ufb01nite sample guarantees. Recall that (cid:96)\nconcentrates over P(p) (De\ufb01nition 7) if, with m(\u03b3, \u03b4, N ) samples, the empirical loss (cid:96)(q; \u02c6p) of a\ndistribution q \u2208 P(p) is \u03b3-close to its true loss (cid:96)(q; p) with probability 1 \u2212 \u03b4. Concentration can be\ndif\ufb01cult to achieve: by Example 4, even the log loss does not concentrate for any sample size o(N )\nfor general q \u2208 \u2206X . However, as we show below, when q is calibrated, many natural loss functions,\nincluding log loss, indeed concentrate. All that is needed is that the loss function is inverse concave,\nincreasing, and does not grow too quickly as qx \u2192 0.\nTheorem 3 (Concentration). Suppose (cid:96) is a local loss function with (cid:96)(q, x) = f\ntive, increasing, concave f (z). Suppose further that f (z) \u2264 c\nThen (cid:96) concentrates over the domain function C for any m(\u03b3, \u03b4, N ) \u2264 N, such that\nm(\u03b3, \u03b4, N ) \u2265 c1 \u00b7 f (\u03b2)2 ln 1\n\u03b4\u00b7min(1,\u03b32/c2) . That is, for any p \u2208 \u2206X , q \u2208 C(p), drawing at\n\nwhere c1 is a \ufb01xed constant and \u03b2:=\nleast m(\u03b3, \u03b4, N ) samples guarantees |(cid:96)(q; \u02c6p) \u2212 (cid:96)(q; p)| \u2264 \u03b3 with probability \u2265 1 \u2212 \u03b4.\nNote that \u03b3 bounds the absolute difference between (cid:96)(q; \u02c6p) and (cid:96)(q; p). The desired difference\nmay depend on the relative scale of the loss function. If e.g., we take (cid:96)(q, x) and scale to obtain\n(cid:96)(cid:48)(q, x) = \u03b1 \u00b7 (cid:96)(q, x) for some \u03b1, the desired error \u03b3 scales by \u03b1, f (\u03b2) and c both scale by \u03b1, and\nthus we can see that the sample complexity remains \ufb01xed.\nWe defer the proof of Theorem 3 to the full version of the paper. At a high level, Theorem 3 holds\nbecause calibration helps us avoid worst-case instances (as in Example 4) using a very simple fact:\nwhen q is calibrated, we have qx\nN for all x. This rules out very low probability events that\npx\ncontribute signi\ufb01cantly to (cid:96)(q; p) but require many samples to identify. To prove Theorem 3 we\npartition X into \u2126 containing elements of very small probability, and X \\ \u2126. With high probability,\nno element of \u2126 is ever sampled from p. Conditioned on this, the loss is bounded (and its expectation\ndoes not change much), so a concentration result can be applied.\n\n\u2265 1\n\n16N 8\n\n\u03b4\n\n,\n\n\u03b32\n\n4.3 Sample Properness\n\nLastly, we turn our attention to calibrated sample properness. Recall that a loss function is sample\nproper if all candidate distributions that are suf\ufb01ciently far from p have a loss that is larger p on\nthe empirical distribution \u02c6p corresponding to a small number of samples from p. It is not hard\nto see that sample properness of a loss function is a direct consequence of its concentration and\nstrong properness. For any candidate distribution q for which (cid:107)q \u2212 p(cid:107)1 is large, strong properness\n(Theorem 2) implies that (cid:96)(q; p) is signi\ufb01cantly larger than (cid:96)(p; p). Furthermore, concentration\n(Theorem 3) implies that with high probability (cid:96)(q; p) \u2248 (cid:96)(q; \u02c6p) and (cid:96)(p; p) \u2248 (cid:96)(p; \u02c6p). Therefore,\nwith high probability, (cid:96)(q; \u02c6p) > (cid:96)(p; \u02c6p). Formally in the full version of the paper we prove:\nTheorem 4 (Sample properness). Suppose (cid:96) is a local loss function with (cid:96)(q, x) = f ( 1\n) for\n\u221a\nqx\nnonnegative, increasing, concave f (z). Suppose further that f (z) \u2264 c\nz for all z \u2265 1 and some\nconstant c and that f is C(x)\nx2 -left-strongly concave for where C(x) is nonincreasing and nonnegative\nfor x \u2265 1. Then for all p \u2208 \u2206X and q \u2208 C(p), if \u02c6p is the empirical distribution constructed from m\nindependent samples of p with m \u2264 N and\n\n(cid:16)\n(cid:16)\n\nm \u2265\n(cid:32)\n\n(cid:20)\n\n(cid:16) 4N(cid:107)p\u2212q(cid:107)1\n\nC\n\n288N 8\n4N(cid:107)p\u2212q(cid:107)1\n\n(cid:17) (cid:107)p\u2212q(cid:107)2\n\n1\n\n128c\n\nc1 \u00b7 f (\u03b2)2 ln 1\n\n\u03b4\n\n(cid:17)(cid:107)p \u2212 q(cid:107)2(cid:17)2 ,\n(cid:21)2(cid:33) , then (cid:96)(q; \u02c6p) > (cid:96)(p; \u02c6p) with prob. \u2265 1\u2212\u03b4.\n\nwhere c1 is constant and \u03b2:=\n\n\u03b4\u00b7min\n\n1,\n\nC\n\n4.4 Application of the Main Results to Loss Functions\n\nWe now instantiate Theorems 2, 3, and 4 for one example of a natural loss function (cid:96)(q, x) =\n). Refer to Table 1 for other loss functions and see the full version for details on its derivation.\nln ln( 1\nqx\n\n8\n\n\fFirst, note that ln ln(z) is C(z)/z2-left-strongly concave for C(z) = (1+ln(z))\nnon-increasing and non-negative for z \u2265 1 and ln ln(z) \u2264 \u221a\nsuch that (cid:107)p \u2212 q(cid:107)1 \u2265 \u0001 we have\n\n.6 Moreover, C(z) is\nz. Using these, for any p and q \u2208 C(p)\n\nln(z)2\n\n\u2022 By Theorem 2, (cid:96)(q; p) \u2212 (cid:96)(p; p) \u2265 \u2126(\n\n\u2022 By Theorem 3, an empirical distribution \u02c6p of \u02dcO(cid:0)\u03b3\u22122 ln ln(N )2 ln(1/\u03b4)(cid:1) i.i.d samples from\n\u2022 By Theorem 4, an empirical distribution \u02c6p of \u02dcO(cid:0)\u0001\u22124 ln ln(N ln(N ))2 ln(1/\u03b4) ln(N )(cid:1) i.i.d\n\np is suf\ufb01cient such that |(cid:96)(q; \u02c6p) \u2212 (cid:96)(q; p)| \u2264 \u03b3 with probability 1 \u2212 \u03b4.\n\nsamples from p is suf\ufb01cient such that (cid:96)(q; \u02c6p) > (cid:96)(p; \u02c6p) with probability 1 \u2212 \u03b4.\n\n\u00012\n\nln(N/\u0001) ).\n\n5 Discussion\n\nIn this work, we characterized loss functions that meet three desirable properties: properness in\nexpectation, concentration, and sample properness. We demonstrated that no local loss function\nmeets all of these properties over the domain of all candidate distributions. But, if one enforces the\ncriterion of calibration (or approximate calibration as discussed in the full version), then many simple\nloss functions have good properties for evaluating learned distributions over large discrete domains.\nWe hope that our work provides a starting point for several future research directions.\nOne natural question is to understand how to select a loss function based on the application domain.\nOur example for language modeling, from the introduction, motivates the idea that log loss is not\nthe best choice always. Understanding this more formally, for example in the framework of robust\ndistribution learning, could provide a systematic approach for selecting loss functions based on the\nneeds of the domain. Our work also leaves open the question of designing compuationally and\nstatistically ef\ufb01cient learning algorithms for different loss functions under the constraint that the\ncandidate q is (approximately) calibrated. One challenge in designing computationally ef\ufb01cient\nalgorithms is that the space of calibrated distributions is not convex. We present some advances\ntowards dealing with this challenge in the full version by providing an ef\ufb01cient procedure for\n\u2018projecting\u2019 a non-calibrated distribution on the space of approximately calibrated distribution. It\nremains to be seen if iteratively applying this procedure could be useful in designing an ef\ufb01cient\nalgorithm for minimizing the loss on calibrated distributions.\n\nAcknowledgements\n\nWe thank Adam Kalai for signi\ufb01cant involvement in early stages of this project and for suggesting\nthe idea of exploring alternatives to the log loss under calibration restrictions. We also thank Gautam\nKamath for helpful discussions.\n\nReferences\n[1] Arpit Agarwal and Shivani Agarwal. On consistent surrogate risk minimization and property\nelicitation. In Proceedings of the 28th Conference on Computational Learning Theory (COLT),\npages 4\u201322, 2015.\n\n[2] Tu\u02d8gkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, , and Patrick White. Testing\nthat distributions are close. In Proceedings of the 41st Symposium on Foundations of Computer\nScience (FOCS), pages 259\u2013269, 2000.\n\n[3] Jos\u00e9 M Bernardo. Expected information as expected utility. Annals of Statistics, pages 686\u2013690,\n\n1979.\n\n[4] Glenn W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly Weather\n\nReview, 78(1):1\u20133, 1950.\n\n[5] Cl\u00e9ment L Canonne. A survey on distribution testing: Your data is big. but is it blue? In\n\nElectronic Colloquium on Computational Complexity (ECCC), volume 22, pages 1\u20131, 2015.\n\n6In the full version, we show that function f is b(z)-left-strongly concave if for all z, f(cid:48)(cid:48)(z) \u2264 \u2212b(z).\n\n9\n\n\f[6] Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Learning mixtures of\nstructured distributions over discrete domains. In Proceedings of the 24th Annual ACM-SIAM\nSymposium on Discrete Algorithms (SODA), pages 1380\u20131394, 2013.\n\n[7] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for\nproper learning mixtures of gaussians. In Proceedings of the 27th Conference on Computational\nLearning Theory (COLT), pages 1183\u20131213, 2014.\n\n[8] A. Philip Dawid. The well-calibrated Bayesian. Journal of the American Statistical Association,\n\n77(379):605\u2013610, 1982.\n\n[9] Werner Ehm, Tilmann Gneiting, Alexander Jordan, and Fabian Kr\u00fcger. Of quantiles and\nexpectiles: consistent scoring functions, Choquet representations and forecast rankings. Journal\nof the Royal Statistical Society: Series B (Statistical Methodology), 78(3):505\u2013562, 2016.\n\n[10] Amos Fiat, Anna Karlin, Elias Koutsoupias, and Angelina Vidali. Approaching utopia: strong\ntruthfulness and externality-resistant mechanisms. In Proceedings of the 4th Innovations in\nTheoretical Computer Science Conference, ITCS \u201912, pages 221\u2013230, 2013.\n\n[11] Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, 85(2):379\u2013390, 1998.\n\n[12] Rafael Frongillo and Ian Kash. Vector valued property elicitation. In Proceedings of the 28th\n\nAlgorithmic Learning Theory (ALT), pages 710\u2013727, 2015.\n\n[13] Tilman Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.\n\nJournal of the American Statistical Association, 102(477):359\u2013378, 2007.\n\n[14] Irving J. Good. Rational decisions. Journal of the Royal Statistical Society, 14(1):107\u2013114,\n\n1952.\n\n[15] Alan Julian Izenman. Recent developments in nonparametric density estimation. Journal of the\n\nAmerican Statistical Association, 86(413):205\u2013224, 1991.\n\n[16] Adam Kalai, Ankur Moitra, and Gregory Valiant. Disentangling Gaussians. Communications\n\nof the ACM, 55(2):113\u2013120, February 2012.\n\n[17] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E Schapire, and Linda\nSellie. On the learnability of discrete distributions. In Proceedings of the 26th Annual ACM\nSymposium on Theory of Computing (STOC), pages 273\u2013282, 1994.\n\n[18] Christopher D Manning, Christopher D Manning, and Hinrich Sch\u00fctze. Foundations of statistical\n\nnatural language processing. MIT press, 1999.\n\n[19] John McCarthy. Measures of the value of information. Proceedings of the National Academy of\n\nSciences, 42(9):654\u2013655, 1956.\n\n[20] Harikrishna Narasimhan, Harish G. Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Con-\nsistent multiclass algorithms for complex performance measures. In Proceedings of the 32nd\nInternational Conference on Machine Learning (ICML), pages 2398\u20132407, 2015.\n\n[21] Harish G. Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss\n\nmatrices. Journal of Machine Learning Research, 17:1\u201345, 2016.\n\n[22] Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the\n\nAmerican Statistical Association, 66(336):783\u2013801, 1971.\n\n[23] Bernard W Silverman. Density estimation for statistics and data analysis. Monographs on\n\nStatistics and Applied Probability, 1986.\n\n[24] Gregory Valiant and Paul Valiant.\n\nIn\nProceedings of the 48th Annual ACM Symposium on Theory of Computing (STOC), pages\n142\u2013155, 2016.\n\nInstance optimal learning of discrete distributions.\n\n10\n\n\f", "award": [], "sourceid": 3930, "authors": [{"given_name": "Nika", "family_name": "Haghtalab", "institution": "Cornell University"}, {"given_name": "Cameron", "family_name": "Musco", "institution": "Microsoft Research"}, {"given_name": "Bo", "family_name": "Waggoner", "institution": "U. Colorado, Boulder"}]}