{"title": "The power of absolute discounting: all-dimensional distribution estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 6660, "page_last": 6669, "abstract": "Categorical models are a natural fit for many problems. When learning the distribution of categories from samples, high-dimensionality may dilute the data. Minimax optimality is too pessimistic to remedy this issue. A serendipitously discovered estimator, absolute discounting, corrects empirical frequencies by subtracting a constant from observed categories, which it then redistributes among the unobserved. It outperforms classical estimators empirically, and has been used extensively in natural language modeling. In this paper, we rigorously explain the prowess of this estimator using less pessimistic notions. We show that (1) absolute discounting recovers classical minimax KL-risk rates, (2) it is \\emph{adaptive} to an effective dimension rather than the true dimension, (3) it is strongly related to the Good-Turing estimator and inherits its \\emph{competitive} properties. We use power-law distributions as the cornerstone of these results. We validate the theory via synthetic data and an application to the Global Terrorism Database.", "full_text": "The power of absolute discounting:\n\nall-dimensional distribution estimation\n\nMoein Falahatgar\n\nUCSD\n\nMesrob Ohannessian\n\nTTIC\n\nAlon Orlitsky\n\nUCSD\n\nVenkatadheeraj Pichapati\n\nUCSD\n\nmoein@ucsd.edu\n\nmesrob@gmail.com\n\nalon@ucsd.edu\n\ndheerajpv7@ucsd.edu\n\nAbstract\n\nCategorical models are a natural \ufb01t for many problems. When learning the dis-\ntribution of categories from samples, high-dimensionality may dilute the data.\nMinimax optimality is too pessimistic to remedy this issue. A serendipitously\ndiscovered estimator, absolute discounting, corrects empirical frequencies by sub-\ntracting a constant from observed categories, which it then redistributes among the\nunobserved. It outperforms classical estimators empirically, and has been used ex-\ntensively in natural language modeling. In this paper, we rigorously explain the\nprowess of this estimator using less pessimistic notions. We show that (1) ab-\nsolute discounting recovers classical minimax KL-risk rates, (2) it is adaptive to\nan effective dimension rather than the true dimension, (3) it is strongly related to\nthe Good\u2013Turing estimator and inherits its competitive properties. We use power-\nlaw distributions as the cornerstone of these results. We validate the theory via\nsynthetic data and an application to the Global Terrorism Database.\n\nIntroduction\n\n1\nMany natural problems involve uncertainties about categorical objects. When modeling language,\nwe reason about words, meanings, and queries. When inferring about mutations, we manipulate\ngenes, SNPs, and phenotypes. It is sometimes possible to embed these discrete objects into continu-\nous spaces, which allows us to use the arsenal of the latest machine learning tools that often (though\nadmittedly not always) need numerically meaningful data. But why not operate in the discrete space\ndirectly? One of the main obstacles to this is the dilution of data due to the high-dimensional aspect\nof the problem, where dimension in this case refers to the number k of categories.\nThe classical framework of categorical distribution estimation, studied at length by the information\ntheory community, involves a \ufb01xed small k, [BS04]. Add-contant estimators are suf\ufb01cient for this\npurpose. Some of the impetus to understanding the large k regime came from the neuroscience\nworld, [Pan04]. But this extended the pessimistic worst-case perspective of the earlier framework,\nresulting in guarantees that left a lot to be desired. This is because high-dimension often also comes\nwith additional structure. In particular, if a distribution produces only roughly d distinct categories\nin a sample of size n, then we ought to think of d (and not k) as the effective dimension of the\nproblem. There are also some ubiquitous structures, like power-law distributions. Natural language\nis a \ufb02agship example of this, which was observed as early as by Zipf in [Zip35]. Species and genera,\nrainfall, terror incidents, to mention just a few all obey power-laws [SLE+03, CSN09, ADW13].\nAre there estimators that mold to both dimension and structure? It turns out we don\u2019t need to\nsearch far. In natural language processing (NLP) it was \ufb01rst discovered that an estimator proposed\nby Good and Turing worked very well\n[Goo53]. Only recently did we start understanding why\nand how [OSZ03, OD12, AJOS13, OS15]. And the best explanation thus far is that it implicitly\ncompetes with the best estimator in a very small neighborhood of the true distribution. But NLP\nresearchers [NEK94, KN95, CG96] have long realized that another simpler estimator, absolute dis-\ncounting, is equally good. But why and how this is the case was never properly determined, save\nsome mention in [OD12] and in [FNT16], where the focus is primarily on form.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we \ufb01rst show that absolute discounting, de\ufb01ned in Section 3, recovers pessimistic min-\nimax optimality in both the low- and high-dimensional regimes. This is an immediate consequence\nof an upper bound that we provide in Section 5. We then study lower bounds with classes de\ufb01ned\nby the number of distinct categories d and also power-law structure in Section 6. This reveals that\nabsolute discounting in fact adapts to the family of these classes. We further unravel the relationship\nof absolute discounting with the Good\u2013Turing estimator, for power-law distributions. Interestingly,\nthis leads to a further re\ufb01nement of this estimator\u2019s performance in terms of competitivity. Lastly, we\ngive some synthetic experiments in Section 8 and then explore forecasting global terror incidents on\nreal data [LDMN16], which showcases very well the \u201call-dimensional\u201d learning power of absolute\ndiscounting. These contributions are summarized in more detail in Section 4. We start out in Section\n2 with laying out what we mean by these notions of optimality.\n2 Optimal distribution learning\nIn this section we concretely formulate the optimal distribution learning framework and take the\nopportunity to point out related work.\nProblem setting Let p = (p1, p2, . . . , pk) be a distribution over [k] := {1, 2, . . . , k} categories.\nLet [k]\u2217 be the set of \ufb01nite sequences over [k]. An estimator q is a mapping that assigns to every\nsequence xn \u2208 [k]\u2217 a distribution q(xn) over [k]. We model p as being the underlying distribution\nover the categories. We have access to data consisting of n samples X n = X1, X2, ..., Xn generated\ni.i.d. from p. Intuitively, our goal is to \ufb01nd a choice of q that is guaranteed to be as close as any other\nestimator can be to p, in average. We \ufb01rst need to quantify how performance is measured.\nGeneral notation: Let (\u00b5j\nthe number of\ntimes symbol j appears in X n and let D be the number of distinct categories appearing in X n,\n1{\u00b5j > 0}. We denote by d := E[D] its expectation. Let (\u03a6\u00b5 : \u00b5 = 0,\u00b7\u00b7\u00b7 , n),\n1{\u00b5j = \u00b5}. Note that\n\u00b5>0 \u03a6\u00b5. Also let (S\u00b5 : \u00b5 = 0,\u00b7\u00b7\u00b7 , n), be the total probability within each such group,\nj pj 1{\u00b5j = \u00b5}. Lastly, denote the empirical distribution by q+0\n\ni.e. D = (cid:80)\nbe the total number of categories appearing exactly \u00b5 times, \u03a6\u00b5 := (cid:80)\nD = (cid:80)\nS\u00b5 := (cid:80)\nKL(p||q) := (cid:80)k\n\nj = 1,\u00b7\u00b7\u00b7 , k) denote the empirical counts, i.e.\n\nKL-Risk We adopt the Kullback-Leibler (KL) divergence as a measure of loss between two dis-\ntributions. When a distribution p is approximated by another q, the KL divergence is given by\n. We can then measure the performance of an estimator q that depends\non data in terms of the KL-risk, the expectation of the divergence with respect to the samples. We\nuse the following notation to express the KL-risk of q after observing n samples X n:\n\nj=1 pj log pj\nqj\n\n:= \u00b5j/n.\n\n:\n\nj\n\nj\n\nj\n\nrn(p, q) := E\n\nX n\u223cpn\n\n[KL(p||q(X n))].\n\nAn estimator that is identical to p regardless of the data is unbeatable, since rn(p, q) = 0. Therefore\nit is important to model our ignorance of p and gauge the optimality of an estimator q accordingly.\nThis can be done in various ways. We elaborate the three most relevant such perspectives: minimax,\nadaptive, and competitive distribution learning.\nMinimax In the minimax setting, p is only known to belong to some class of distributions P, but\nwe don\u2019t know which one. We would like to perform well, no matter which distribution it is. To\neach q corresponds a distribution p \u2208 P (assuming the class is \ufb01nite or closed) on which q has its\nworst performance:\n\nrn(P, q) := max\n\np\u2208P rn(p, q).\n\nThe minimax risk is the least worst-case KL-risk achieved by any estimator q,\n\nrn(P) := min\n\nrn(P, q).\n\nq\n\nThe minimax risk depends only on the class P. It is a lower bound: no estimator can beat it for all p,\ni.e. it\u2019s not possible that rn(p, q) < rn(P) for all p \u2208 P. An estimator q that satis\ufb01es an upper bound\nof the form rn(P, q) = (1 + o(1))rn(P) is said to be minimax optimal \u201ceven to the constant\u201d (an\ninformal but informative expression that we adopt in this paper). If instead rn(P, q) = O(1)rn(P),\nwe say that q is rate optimal. Near-optimality notions are also possible, but we don\u2019t dwell on\nthem. As an aside, note that universal compression is minimax optimality using cumulative risk.\nSee [FJO+15] for such related work on universal compression for power laws.\n\n2\n\n\fAdaptive The minimax perspective captures our ignorance of p in a pessimistic fashion. This is\nbecause rn(P) may be large, but for a speci\ufb01c p \u2208 P we may have a much smaller rn(p, q). How\ncan we go beyond this pessimism? Observe that when a class is smaller, then rn(P) is smaller.\nThis is because we\u2019d be maximizing on a smaller set. In the extreme case noted earlier, when P\ncontains only a single distribution, we have rn(P) = 0. The adaptive learning setting \ufb01nds an\nintermediate ground where we have a family of distribution classes F = {Ps : s \u2208 S} indexed\nby a (not necessarily countable) index set S. For each s, we have a corresponding rn(Ps) which is\noften much smaller than rn\ncorresponding to the smaller class. We say that an estimator q is adaptive to the family F if for all\ns \u2208 S:\nThere often is a price to adaptivity, which is a function of the granularity of F and is paid in the\nform of varying/large leading constants per class. This framework has been particularly successful\nin density estimation with smoothness classes [Tsy09] and has been recently used in the discrete\nsetting for universal compression [BGO15].\n\n(cid:1), and we would like the estimator to achieve the risk bound\n\nrn(p, q) \u2264 Os(1) rn(Ps) \u2200p \u2208 Ps \u21d0\u21d2 rn(Ps, q) \u2264 Os(1) rn(Ps)\n\ns\u2208S Ps\n\n(cid:0)(cid:83)\n\nCompetitive The adaptive perspective can be tightened by demanding that, rather than a multi-\nplicative constant, the KL-risk tracks the risk up to a vanishingly small additive term:\n\nrn(p, q) = rn(Ps) + \u0001n(Ps, q) \u2200p \u2208 Ps.\n\nIdeally, we would like the competitive loss \u0001n(Ps, q) to be negligible compared to the risk of\neach class rn(Ps). If \u0001n(Ps, q) = Os(1)rn(Ps) for all s, then we recover adaptivity. And when\n\u0001n(Ps, q) = os(1)rn(Ps) for all s \u2208 S, we have minimax optimality even to the constant within\neach class, which is a much stronger form of adaptivity. We then say that the estimator is competitive\nwith respect to the family F. We may also evaluate the worst-case competitive loss, over S.\nThis formulation was recently introduced in [OS15] in the context of distribution learning. This work\nshows that the celebrated Good\u2013Turing estimator [Goo53], combined with the empirical estimator,\nhas small worst-case competitive loss over the family of classes de\ufb01ned by any given distribution and\nall its permutations. Most importantly, this loss was shown to stay bounded, even as the dimension\nincreases. This provided a rigorous theoretical explanation for the performance of the Good\u2013Turing\nestimator in high-dimensions. A similar framework is also studied for (cid:96)1-loss in [VV15].\n\n3 Absolute discounting\nOne of the \ufb01rst things to observe is that the empirical distribution is particularly ill-suited to handle\nKL-risk. This is most easily seen by the fact that we\u2019d have in\ufb01nite blow-up when any \u00b5j = 0,\nwhich will happen with positive probability. Instead, one could resort to an add-constant estimator,\nwhich for a positive \u03b2 is of the form q+\u03b2\nThe most widely-studied class of distributions is the one that\n\nk\u2212dimensional simplex, \u2206k := {(p1, p2, . . . , pk), : (cid:80)\n\nthe\npi = 1, pi \u2265 0 \u2200i \u2208 [k]}. In the low-\ndimensional scaling, when n/k \u2192 \u221e (the \u201cdimension\u201d here being the support size k), the minimax\nrisk is\n\n:= (\u00b5j + \u03b2)/(n + k\u03b2).\n\nincludes all of them:\n\nj\n\nrn(\u2206k) = (1 + o(1))\n\ni\n\nk \u2212 1\n2n\n\n,\n\nIn [BS04], a variant of the add-constant estimator is shown to achieve this risk even to the constant.\nFurthermore, any add-constant estimator is rate optimal when k is \ufb01xed. But in the very high-\ndimensional setting, when k/n \u2192 \u221e, [Pan04] showed that the minimax risk behaves as\n\nrn(\u2206k) = (1 + o(1)) log\n\nk\nn\n\n,\n\nachieved by an add-constant estimator, but with a constant that depends on the ratio of k and n.\nDespite these classical results on minimax optimal estimators, in practice people often use other\nestimators that have better empirical performance. This was a long-running mystery in the lan-\nguage modeling community [CG96], where variants of the Good\u2013Turing estimator were shown to\nperform the best [JM85, GS95]. The gap in performance was only understood recently, using the\nnotion of competitivity [OS15]. In essence, the Good\u2013Turing estimator works well in both low- and\n\n3\n\n\fhigh-dimensional regimes, and in-between. Another estimator, absolute discounting, unlike add-\nconstant estimators, simply subtracts a positive constant from the empirical counts and redistributes\nthe subtracted amount to unseen categories. For a discount parameter \u03b4 \u2208 [0, 1), it is de\ufb01ned as:\n\n(cid:40) \u00b5j\u2212\u03b4\n\nn\nD\u03b4\n\nn(k\u2212D)\n\nq\u2212\u03b4\n\nj\n\n:=\n\nif \u00b5j > 0,\nif \u00b5j = 0.\n\n(1)\n\nStarting with the work of [NEK94], absolute discounting soon supplanted the Good\u2013Turing estima-\ntor, due to both its simplicity and comparable performance. Kneser-Ney smoothing [KN95], which\nuses absolute discounting at its core was long held as the preferred way to train N-gram models.\nEven to this day, the state-of-the-art language models are combined systems where one usually inter-\npolates between recurrent neural networks and Kneser-Ney smoothing [JVS+16]. Can this success\nbe explained?\nKneser-Ney is for the most part a principled implementation of the notion of back-off, which we only\ntouch upon in the conclusion. The use of absolute discounting is critical however, as performance\ndeteriorates if we back-off with care but use a more na\u00a8\u0131ve add-constant or even Katz-style smoothing\n[Kat87], which switches from the Good\u2013Turing to the empirical distribution at a \ufb01xed frequency\npoint. It is also important to mention the Bayesian approach of [Teh06] that performs similarly\nto Kneser-Ney, called the Hierarchical Pitman-Yor language model. The hierarchies in this model\nreprise the role of back-off, while the two-parameter Poisson-Dirichlet prior proposed by Pitman\nand Yor [PY97] results in estimators that are very similar to absolute discounting. The latter is not\na surprise because this prior almost surely generates a power law distribution, which is intimately\nrelated to absolute discounting as we study in this paper. Though our theory applies more generally,\nit can in fact be straightforwardly adapted to give guarantees to estimators built upon this prior.\n\n4 Contributions\nWe investigate the reason behind the auspicious behavior of the absolute discounting estimator. We\nachieve this by demonstrating the adaptivity and competitivity of this estimator for many relevant\nfamilies of distribution classes. In summary:\n\n\u2022 We analyze the performance of the absolute discounting estimator by upper bounding the KL-\nrisk for each class in a family of distribution classes de\ufb01ned by the expected number of distinct\ncategories. [Section 5, Theorem 1] This result implies that absolute discounting achieves classical\nminimax rate-optimality in both the low- and high-dimensional regimes over the whole simplex\n\u2206k, as outlined in Section 2.\n\n\u2022 We provide a generic lower bound to the minimax risk of classes de\ufb01ned by a single distribution\nand all of its permutations. We then show that if the de\ufb01ning distribution is a truncated (possibly\nperturbed) power-law, then this lower bound matches the upper bound of absolute discounting, up\nto a constant factor. [Section 6, Corollaries 3 and 4]\n\n\u2022 This implies that absolute discounting is adaptive to the family of classes de\ufb01ned by a truncated\npower-law distribution and its permutations. Also, since classes de\ufb01ned by the expected number\nof distinct categories necessarily includes a power-law, absolute discounting is also adaptive to\nthis family. This is a strict re\ufb01nement of classical minimax rate-optimality.\n\n\u2022 We give an equivalence between the absolute discounting and Good\u2013Turing estimators in the\nhigh-dimensional setting, whenever the distribution is a truncated power-law. This is a \ufb01nite-\nsample guarantee, as compared to the asymptotic version of [OD12]. As a consequence, absolute-\ndiscounting becomes competitive with respect to the family of classes de\ufb01ned by permutations of\npower-laws, inheriting Good\u2013Turing\u2019s behavior [OS15]. [Section 7, Lemma 5 and Theorem 6]\n\nWe corroborate the theoretical results with synthetic experiments that reproduce the theoretical mini-\nmax risk bounds. We also show that the prowess of absolute discounting on real data is not restricted\nonly to language modeling. In particular, we explore a striking application to forecasting global ter-\nror incidents and show that, unlike naive estimators, absolute discounting gives accurate predictions\nsimultaneously in all of low-, medium-, and high-activity zones. [Section 8]\n\n4\n\n\f5 Upper bound and classical minimax optimality\n\nd\n2\n\nlog\n\n(2)\n\nd\nn\n\nif\n\nif\n\nk \u2212 d\n\n2\n\n+ c\n\nd\nn\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nWe now give an upper bound for the risk of the absolute discounting estimator and show that it\nrecovers classical minimax rates in the low- and high-dimensional regimes. Recall that d := E[D]\nis the expected number of distinct categories in the samples. The upper bound that we derive can be\nwritten as function of only d, k, and n, and is non-decreasing in d. For a given n and k, let Pd be the\nset of all distributions for which E[D] \u2264 d. The upper bound is thus also a worst-case bound over Pd.\nTheorem 1 (Upper bound). Consider the absolute discounting estimator q = q\u2212\u03b4, de\ufb01ned in (1).\nLet p be such that E[D] = d. Given a discount 0 < \u03b4 < 1, there exists a constant c that may depend\non \u03b4 and only \u03b4, such that\n\nlog k + c\n\nd < 10 log log k.\n\nrn(p, q) \u2264\n\nd \u2265 10 log log k,\n\nn = O(1) k\n\nd\nn\nd\nn\nThe same bound holds for rn(Pd, q).\nWe defer the proof the theorem to the supplementary material. Here are the immediate implications.\nk \u2192 \u221e and the class \u2206k, the largest d can be once n > k is\nFor the low-dimensional regime n\nk. The risk of absolute discounting is thus bounded by c(1 + o(1)) k\nn. This is minimax\nn \u2192 \u221e and the class \u2206k, the largest d can\nrate-optimal [BS04]. For the high-dimensional regime k\nbe when k > n is n. The risk of absolute discounting is thus dominated by the \ufb01rst term, which\nn. This is the optimal risk for the class \u2206k [Pan04], even to the constant.\nreduces to (1 + o(1)) log k\nTherefore on the two extreme ranges of k and n absolute discounting recovers the best performance,\neither as rate-optimal or optimal even to the constant. These results are for the entire k\u2212dimensional\nsimplex \u2206k. Furthermore, for smaller classes, it characterizes the worst-case risk of the class by the\nd, the expected number of distinct categories. Is this characterization tight?\n6 Lower bounds and adaptivity\nIn order to lower bound the minimax risk of a given class P, we use a \ufb01ner granularity than the\nPd classes described in Section 5. In particular, let Pp be the permutation class of distributions\nconsisting of a single distribution p and all of its permutations. Note that the multiset of probabilities\nis the same for all distributions in Pp, and since the expected number of distinct categories only\n1. To \ufb01nd a good lower\nbound for Pd, we need a p that is \u201cworst case\u201d. We \ufb01rst give the following generic lower bound.\nTheorem 2 (Generic lower bound). Let Pp be a permutation class de\ufb01ned by a distribution p and\nlet \u03b3 > 1. Then for k > \u03b3d, the minimax risk is bounded by:\n\ndepends on the multiset (d =(cid:80)\n(cid:18)\n\nj[1 \u2212 (1 \u2212 pj)n]) it follows that Pp \u2282 Pd\n\n+\n\npj log pj\n\n(3)\n\n(cid:19)\uf8eb\uf8ed k(cid:88)\n\nj=\u03b3d\n\n\uf8f6\uf8f8 log\n\npj\n\nk \u2212 \u03b3d(cid:80)k\n\nj=\u03b3d pj\n\n(cid:88)\n\ni=\u03b3d\n\nrn(Pp) \u2265\n\n1 \u2212 1\n\u03b3\n\nEquation (3) can be used as a starting point for more concrete lower bounds on various distribution\nclasses. We illustrate this for two cases. First, let us choose p to be a truncated power-law distribution\nwith power \u03b1: pj \u221d j\u2212\u03b1,\nfor j = 1,\u00b7\u00b7\u00b7 , k. We always assume \u03b1 \u2265 \u03b10 > 1. This leads to the\nfollowing lower bound.\nCorollary 3. Let P be all permutations of a single power-law distribution with power \u03b1 truncated\nover k categories. Then there exists a constant c > 0 and large enough n0 such that when n > n0\nand k > max{n, 1.2\n\n\u03b1\u22121 n 1\n\n\u03b1},\n\n1\n\nrn(P) \u2265 c\n\nd\nn\n\nlog\n\nk \u2212 2d\n\n.\n\n2d\n\nNext, we use a different choice of p for Pp to provide a lower bound whenever d grows linearly with\nn. This essentially closes the gap of the previous corollary when \u03b1 approaches 1.\n\n1We abuse notation by distinguishing the classes by the letter used, while at the same time using the letters\nto denote actual quantities. From the context we understand that d is the expected number of distinct categories\nfor p, at the given n.\n\n5\n\n\fCorollary 4. Let \u03c1 \u2208 (1, 1.75) and let P be all permutations of a single uniform distribution over\n\u03c1 out of k categories. Then d \u223c (1 \u2212 e\u2212\u03c1)n/\u03c1 and there exists a constant c > 0 and\na subset k(cid:48) = n\nlarge enough n0 such that when n > n0 and k > n5,\n\nrn(P) \u2265 c\n\nd\nn\n\nlog\n\nk \u2212 1.2d\n\n.\n\nd\n\nWe defer the proofs of the theorem and its corollaries to the supplementary material. The upper\nbound of Theorem 1 and the lower bounds of Corollaries 3 and 4 are within constant factors of\neach other. The immediate consequence is that absolute discounting is adaptive with respect to the\nfamilies of classes of the Corollaries. Furthermore, over the family of classes Pd where we can\n\u03b1 for some \u03b1 > 1 or d \u221d n, we can select a distribution from the Corollaries among\nwrite d as n 1\neach class and use the corresponding lower bound to match the upper bound of Theorem 1 up to\na constant factor. Therefore absolute discounting is adaptive to this family of classes. Intuitively,\nadaptivity to these classes establishes optimality in the intermediate range between low- and high-\ndimensional settings in a distribution-dependent fashion and governed by the expected number of\ndistinct categories d, which we may regard as the effective dimension of the problem.\n\n7 Relationship to Good\u2013Turing and competitivity\nWe now establish a relationship between the absolute discounting and Good\u2013Turing estimators and\nre\ufb01ne the adaptivity results of the previous section into competitivity results. When [OS15] intro-\nduced the notion of competitive optimality, they showed that a variation of the Good\u2013Turing estima-\ntor is worst-case competitive with respect to the family of distribution classes de\ufb01ned by any given\nprobability distribution and its permutations. In light of the results of Sections 5 and 6, it is natural\nto ask whether absolute discounting enjoys the same kind of competitive properties. Not only that,\nbut it was observed empirically by [NEK94] and shown theoretically in [OD12] that asymptotically\nGood\u2013Turing behaves exactly like absolute discounting, when the underlying distribution is a (pos-\nsibly perturbed) power-law. We therefore choose this family of classes to prove competitivity for.\nWe \ufb01rst make the aforementioned equivalence concrete by establishing a \ufb01nite sample version. We\nuse the following idealized version of the Good\u2013Turing estimator [Goo53]:\n\n(4)\n\nn\nE[\u03a61]\nn(k\u2212D)\n\nE[\u03a6\u00b5j +1]\nE[\u03a6\u00b5j ]\n\n\uf8f1\uf8f2\uf8f3 \u00b5j +1\n(cid:17)(cid:17) \u223c \u00b5j \u2212 1\n\n2\u03b1+1\n\n3\n\n\u03b1\n\nn\n\nqGT\nj\n\n:=\n\n1 + O(cid:16)\n(cid:16)\n\nn\u2212 1\n\n2\n\nif \u00b5j > 0,\n\nif \u00b5j = 0.\n\nLemma 5. Let p be a power law with power \u03b1 truncated over k categories. Then for k >\nmax{n, n\n\n\u03b1\u22121}, we have the equivalence:\n\n1\n\n\u2200 \u00b5j \u2208(cid:110)\n\n1,\u00b7\u00b7\u00b7 , n\n\n1\n\n2\u03b1+1\n\n(cid:111)\n\n.\n\nqGT\nj =\n\n\u00b5j \u2212 1\n\n\u03b1\n\nn\n\nAn interesting outcome of the equivalence of Lemma 5 is that it suggests a choice of the discount \u03b4\nin terms of the power, 1/\u03b1. To give a data-driven version of 1/\u03b1, we will use a robust version of the\nratio \u03a61/D proposed in [OD12, BBO17], which is a strongly consistent estimator when k = \u221e.\nTheorem 6. Let P be all permutations of a truncated power law p with power \u03b1. Let q be the\nabsolute discounting estimator with \u03b4 = min\n, for a suitable choice of \u03b4max.\nThen for k > max{n, n\n\n\u03b1\u22121}, the competitive loss is\n\n(cid:110) max{\u03a61,1}\n\n, \u03b4max\n\n(cid:111)\n\nD\n\n1\n\n\u0001n(Pp, q) = O(cid:16)\n\n(cid:17)\n\nn\u2212 2\u03b1\u22121\n\n2\u03b1+1\n\n.\n\nThe implications are as follows. For the union of all such classes above a given \u03b1, we \ufb01nd that we\nbeat the n\u22121/3 rate of the worst-case competitive loss obtained for the estimator in [OS15]. Theorem\n6 and the bounds of Sections 5 and 6, together imply that absolute discounting is not only worst-case\ncompetitive, but also class-by-class competitive with respect to the power-law permutation family.\nIn other words, it in fact achieves minimax optimality even to the constant. One of the advantages\nof absolute discounting is that it gradually transitions between values that are close to the empirical\ndistribution for abundant categories (since \u00b5 then dominates the discount \u03b4), to a behavior that\nimitates the Good\u2013Turing estimator for rare categories (as established by Lemma 5). In contrast, the\nestimator proposed in [OS15], and its antecedents starting from [Kat87], have to carefully choose a\nthreshold where they switch abruptly from one estimator to the other.\n\n6\n\n\f8 Experiments\nWe now illustrate the theory with some experimental results. Our purpose is to (1) validate the func-\ntional form of the risk as given by our lower and upper bounds and (2) compare absolute discounting\non both synthetic and real data to estimators that have various optimality guarantees. In all synthetic\nexperiments, we use 500 Monte Carlo iterations. Also, we set the discount value based on data,\n\u03b4 = min{ max(\u03a61,1)\n\n, 0.9}. This is as suggested in Section 7, assuming \u03b4max = 0.9 is suf\ufb01cient.\n\nD\n\n(a) k \ufb01xed\n\n(b) n \ufb01xed, k << n\n\n(c) n \ufb01xed, k >> n\n\nFigure 1: Risk of absolute discounting in different ranges of k and n for a power-law with \u03b1 = 2\n\nValidation For our \ufb01rst goal, we consider absolute discounting in isolation. Figure 1(a) shows the\ndecay of KL-risk with the number of samples n for a power-law distribution. The dependence of the\nrisk on the number of categories k is captured in Figures 1(b) (linear x-axis) and 1(c) (logarithmic\nx-axis). Note the linear growth when k is small and the logarithmic growth when k is large. For the\nlast plot we give 95% con\ufb01dence intervals for the simulations, by performing 100 restarts.\n\nN\n\nk log k\n\nn \u2200i,\n\n, and its two variants:\n\nn , add-beta q+\u03b2(x) = \u00b5x+\u03b2\u00b5x\n\nSynthetic data For our second goal, we start with synthetic data. In Figure 2, we pit absolute\ndiscounting against a number of distributions related to power-laws. The estimators used for our\ncomparisons are: empirical q+0(x) = \u00b5x\n\u2022 Braess and Sauer, qBS [BS04] q+\u03b2 with \u03b20 = 0.5, \u03b21 = 1, and \u03b2i = 0.75 \u2200i \u2265 2\n\u2022 Paninski, qPan [Pan04] q+\u03b2 with \u03b2i = n\nabsolute discounting, q\u2212\u03b4, described in 1, Good\u2013Turing + empirical qGT in [OS15], and an oracle-\naided estimator where S\u00b5 is known.\nIn Figures 2(a) and 2(b), samples are generated according to a power-law distribution with power\n\u03b1 = 2 over k = 1, 000 categories. However, the underlying distribution in Figure 2(c) is a piece-\nwise power-law. It consists of three equal-length pieces, with powers 1.3, 2, and 1.5. Paninski\u2019s\nestimator is not shown in Figures 2(b) and 2(c) since it is not well-de\ufb01ned in this range (it is designed\nfor the case k > n only). Unsurprisingly, absolute discounting dominates these experiments. What\nis more interesting is that it does not seem to need a pure power-law (similar results hold for other\nkinds of perturbations, such as mixtures and noise). Also Good\u2013Turing is a tight second.\n\n(a) pure power-law\n\n(b) pure power-law\n\n(c) piece-wise power-law\n\nFigure 2: Comparing estimators for power-law variants with power \u03b1 = 2 and k = 1000.\n\n7\n\n0500100015002000250030003500400045005000n00.050.10.150.20.250.30.350.40.450.5expected KL lossk=100k=500k=1000k=300056789101112131415k00.0050.010.0150.020.025expected KL lossn=500n=1000n=5000n=10000103104k0.20.40.60.811.21.41.6expected KL lossn=20n=50n=100n=20001002003004005006007008009001000n10-210-1100101expected KL lossGood-Turing + empiricalBraess-SauerPaninskiabsolute-discountoracle010002000300040005000600070008000900010000n10-210-1100101expected KL lossGood-Turing + empiricalBraess-Sauerabsolute-discountoracle0200040006000800010000n10-210-1100101expected KL lossGood-Turing + empiricalBraess-Sauerabsolute-discountoracle\fReal data One of the chief motivations to investigate absolute discounting is natural language\nmodeling. But there have been such extensive empirical studies that have veri\ufb01ed over and over\nthe power of absolute discounting (see the classical survey of [CG96]) that we chose to use this\nspace for something new. We use the START Global terrorism database from the University of\nMaryland [LDMN16] and explore how well we can forecast the number of terrorist incidents in\ndifferent cities. The data contains the record of more than 50, 000 terror incidents between the\nyears 1992 and 2010, in more than 12, 000 different cities around the world. First, we display in\nFigure 3(a) the frequency of incidents across the entire dataset versus the activity rank of the city in\nlog-log scale, showing a striking adherence to a power-law (see [CSN09] for more on this).\nThe forecasting problem that we solve is to estimate the number of total incidents in a subset of\nthe cities over the coming year, using the current year\u2019s data from all cities. In order to emulate\nthe various dimension regimes, we look at three subsets: (1) low-activity cities with no incidents\nin the current year and less than 20 incidents in the whole data, (2) medium-activity cities, with\nsome incidents in the current year and less than 20 incidents in the whole data, and (3) high-activity\nindividual cities with a large number of overall incidents.\nThe results for (1) are in Figure 3(b). The frequency estimator trivially estimates zero. Braess-Sauer\ndoes something meaningful. But absolute discounting and Good\u2013Turing estimators, indistinguish-\nable from each other, are remarkably on spot. And this, without having observed any of the cities!\nThis nicely captures the importance of using structure when dimensionality is so high and data is\nso scarce. The results for (2) are in Figure 3(c). The frequency estimator markedly overestimates.\nBut now absolute discounting, Good\u2013Turing, and Braess-Sauer, perform similarly. This is a lower\ndimensional regime than in (1), but still not adequate for simply using frequencies. This changes in\ncase (3), illustrated in Figure 4. To take advantage of the abundance of data, in this case at each time\npoint we used the previous 2, 000 incidents for learning, and predicted the share of each city for the\nnext 2, 000 incidents. In fact, incidents are so abundant that we can simply rely on the previous win-\ndow\u2019s count. Note how Braess-Sauer over-penalizes such abundant categories and suffers, whereas\nabsolute discounting and Good\u2013Turing continue to hold their own, mimicking the performance of\nthe empirical counts. This is a very low-dimensional regime.\nThe closeness of the Good\u2013Turing estimator to absolute discounting in all of our experiments vali-\ndates the equivalence result of Lemma 5. The robustness in various regimes and the improvement in\nperformance over such minimax optimal estimators as Braess-Sauer\u2019s and Paninski\u2019s are evidence\nthat absolute discounting truly molds to both the raw dimension and effective dimension / structure.\n\n(a) frequency vs rank\n\n(b) unobserved cities\n\n(c) observed cities\n\nFigure 3: (a) power-law behavior of frequency vs rank in terror incidents, (b), and (c) comparing\nforecasts of the number of incidents in unobserved cities and observed ones, respectively.\n\n9 Conclusion\nIn this paper, we offered a rigorous analysis of the absolute discounting estimator for categorical dis-\ntributions. We showed that it recovers classical minimax optimality. The true reason for its success,\nhowever, is in adapting to distributions much more intimately, by recovering the right dependence\non the distinct observed categories d, which can be regarded as an effective dimension, and opti-\nmally tracking structure such as power-laws. We also tightened its relationship with the celebrated\nGood\u2013Turing estimator.\n\n8\n\n100101102103104105rank of the city100101102103104number of incidents1992199519971999200220062007year050010001500200025003000number of incidentsGood-Turing + empiricalBraess-Sauerabsolute-discounttrue valueempirical1992199519971999200220062007year05001000150020002500number of incidentsGood-Turing + empiricalBraess-Sauerabsolute-discounttrue valueempirical\f(a) Baghdad\n\n(b) Fallujah\n\n(c) Belfast\n\nFigure 4: Estimating the number of incidents based on previous data for different cities\n\nSome of our analysis could possibly be tightened, in particular in terms of the range of applicability\nover n, k, and d. Also, the limiting case of \u03b1 = 1 (very heavy tails, known as \u201cfast variation\u201d\n[BBO17]) to which our results don\u2019t directly apply, merits investigation. But more importantly,\nabsolute discounting is often a module. For example, we already note how it is widely used in\nN-gram back-off models [KN95]. Also, recently, it has been successfully applied to smoothing\nlow-rank probability matrices [FOO16]. Perhaps to further understand its power, it is worthwhile to\nstudy how it interacts with such larger systems.\n\nAcknowledgements We thank Vaishakh Ravindrakumar for very helpful suggestions, and NSF\nfor supporting this work through grants CIF-1564355 and CIF-1619448.\n\nReferences\n[ADW13] Armen E. Allahverdyan, Weibing Deng, and Q. A. Wang. Explaining Zipf\u2019s law via a mental\n\nlexicon. Physical Review E - Statistical, Nonlinear, and Soft Matter Physics, 88(6), 2013. 1\n\n[AJOS13] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Optimal Prob-\nability Estimation with Applications to Prediction and Classi\ufb01cation. In COLT, pages 764\u2013796,\n2013. 1\n\n[BBO17] Anna Ben Hamou, St\u00b4ephane Boucheron, and Mesrob I Ohannessian. Concentration Inequali-\nties in the In\ufb01nite Urn Scheme for Occupancy Counts and the Missing Mass, with Applications.\nBernoulli, 2017. 7, 9\n\n[BGO15] St\u00b4ephane Boucheron, Elisabeth Gassiat, and Mesrob I Ohannessian. About adaptive coding on\nIEEE Transactions on Information Theory,\n\ncountable alphabets: Max-stable envelope classes.\n61(9), 2015. 2\n\n[BS04] Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Ap-\n\nproximation Theory, 128(2):187\u2013206, 2004. 1, 3, 5, 8\n\n[CG96] Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language\nmodeling. In Proceedings of the 34th annual meeting on Association for Computational Linguis-\ntics, pages 310\u2013318. Association for Computational Linguistics, 1996. 1, 3, 8\n\n[CSN09] Aaron Clauset, Cosma Rohilla Shalizi, and Mark E J Newman. Power-law distributions in empir-\n\nical data. SIAM review, 51(4):661\u2013703, 2009. 1, 8\n\n[FJO+15] Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapati,\n\nand\nIn Informa-\nAnanda Theertha Suresh. Universal compression of power-law distributions.\ntion Theory (ISIT), 2015 IEEE International Symposium on, pages 2001\u20132005. IEEE, 2015.\n2\n\n[FNT16] Stefano Favaro, Bernardo Nipoti, and Yee Whye Teh. Rediscovery of {Good\u2013Turing} estimators\n\nvia {B}ayesian nonparametrics. Biometrics, 72(1):136\u2013145, 2016. 1\n\n[FOO16] Moein Falahatgar, Mesrob I Ohannessian, and Alon Orlitsky. Near-Optimal Smoothing of Struc-\n\ntured Conditional Probability Matrices. In NIPS, pages 4860\u20134868, 2016. 9\n\n[Goo53] Irving J Good. The population frequencies of species and the estimation of population parameters.\n\nBiometrika, pages 237\u2013264, 1953. 1, 2, 7\n\n9\n\n1992199519971999200220062007200820092010year050100150200250300350number of incidentsGood-Turing + empiricalBraess-Sauerabsolute-discounttrue valueempirical1992199519971999200220062007200820092010year0510152025number of incidentsGood-Turing + empiricalBraess-Sauerabsolute-discounttrue valueempirical1992199519971999200220062007200820092010year020406080100120number of incidentsGood-Turing + empiricalBraess-Sauerabsolute-discounttrue valueempirical\f[GS95] William A Gale and Geoffrey Sampson. {Good\u2013Turing} frequency estimation without tears. Jour-\n\nnal of Quantitative Linguistics, 2(3):217\u2013237, 1995. 3\n\n[JM85] Frederick Jelinek and Robert Mercer. Probability distribution estimation from sparse data. IBM\n\ntechnical disclosure bulletin, 28:2591\u20132594, 1985. 3\n\n[JVS+16] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the\n\nlimits of language modeling. arXiv preprint arXiv:1602.02410, 2016. 3\n\n[Kat87] Slava M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component\n\nof a Speech Recognizer, 1987. 3, 7\n\n[KN95] Reinhard Kneser and Hermann Ney. Improved Backing-Off for {M}-Gram Language Modeling.\nIn Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing,\npages 181\u2013184, Detroit, MI, may 1995. 1, 3, 9\n\n[LDMN16] Gary LaFree, Laura Dugan, Erin Miller, and National Consortium for the Study of Terrorism and\n\nResponses to Terrorism. Global Terrorism Database, 2016. 1, 8\n\n[NEK94] Hermann Ney, Ute Essen, and Reinhard Kneser. On structuring probabilistic dependences in\n\nstochastic language modelling. Computer Speech & Language, 8(1):1\u201338, 1994. 1, 3, 7\n\n[Neu13] Edward Neuman. Inequalities and bounds for the incomplete gamma function. Results in Mathe-\n\nmatics, pages 1\u20136, 2013.\n\n[NP00] Pierpaolo Natalini and Biagio Palumbo. Inequalities for the incomplete gamma function. Mathe-\n\nmatical Inequalities & Applications, 3(1):69\u201377, 2000.\n\n[OD12] Mesrob I Ohannessian and Munther A Dahleh. Rare Probability Estimation under Regularly Vary-\n\ning Heavy Tails. In COLT, page 21, 2012. 1, 4, 7, 7\n\n[OS15] Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is {Good\u2013\n\nTuring} good. In NIPS, pages 2143\u20132151, 2015. 1, 2, 3, 4, 7, 7, 8\n\n[OSZ03] Alon Orlitsky, Narayana P Santhanam, and Junan Zhang. Always {Good\u2013Turing}: Asymptoti-\n\ncally optimal probability estimation. Science, 302(5644):427\u2013431, 2003. 1\n\n[Pan04] Liam Paninski. Variational Minimax Estimation of Discrete Distributions under KL Loss.\n\nNIPS, pages 1033\u20131040, 2004. 1, 3, 5, 8\n\nIn\n\n[PY97] Jim Pitman and Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable\n\nsubordinator. Annals of Probability, 25(2):855\u2013900, 1997. 3\n\n[SLE+03] Felisa A Smith, S Kathleen Lyons, S K Ernest, Kate E Jones, Dawn M Kaufman, Tamar Dayan,\nPablo A Marquet, James H Brown, and John P Haskell. Body mass of late Quaternary mammals.\nEcology, 84(12):3403, 2003. 1\n\n[Teh06] Yee-Whye Teh. A Hierarchical Bayesian Language Model Based on Pitman-Yor processe. Pro-\nceedings of the 21st International Conference on Computational Linguistics and the 44th annual\nmeeting of the Association for Computational Linguistics, (July):985\u2013992, 2006. 3\n\n[Tsy09] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics.\n\nSpringer, 2009. 2\n\n[VV15] Gregory Valiant and Paul Valiant. Instance optimal learning. arXiv preprint arXiv:1504.05321,\n\n2015. 2\n\n[Zip35] George Kingsley Zipf. The psycho-biology of language. Houghton, Mif\ufb02in, 1935. 1\n\n10\n\n\f", "award": [], "sourceid": 3334, "authors": [{"given_name": "Moein", "family_name": "Falahatgar", "institution": "UCSD"}, {"given_name": "Mesrob", "family_name": "Ohannessian", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}, {"given_name": "Venkatadheeraj", "family_name": "Pichapati", "institution": "UCSD"}]}