{"title": "Shadow Dirichlet for Restricted Probability Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 613, "page_last": 621, "abstract": "Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data.", "full_text": "Shadow Dirichlet for Restricted Probability Modeling\n\nBela A. Frigyik, Maya R. Gupta, and Yihua Chen\n\nDepartment of Electrical Engineering\n\nfrigyik@gmail.com, gupta@ee.washington.edu, yihuachn@gmail.com\n\nUniversity of Washington\n\nSeattle, WA 98195\n\nAbstract\n\nAlthough the Dirichlet distribution is widely used, the independence structure of\nits components limits its accuracy as a model. The proposed shadow Dirichlet\ndistribution manipulates the support in order to model probability mass functions\n(pmfs) with dependencies or constraints that often arise in real world problems,\nsuch as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We\ndescribe some properties of this new class of distributions, provide maximum en-\ntropy constructions, give an expectation-maximization method for estimating the\nmean parameter, and illustrate with real data.\n\n1 Modeling Probabilities for Machine Learning\n\nModeling probability mass functions (pmfs) as random is useful in solving many real-world prob-\nlems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate\nto the multinomial and hence mathematically convenient for Bayesian inference, and the number of\nparameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distri-\nbution over the entire probability simplex, and for many problems this is simply the wrong domain\nif there is application-speci\ufb01c prior knowledge that the pmfs come from a restricted subset of the\nsimplex.\nFor example, in natural language modeling, it is common to regularize a pmf over n-grams by some\ngeneric language model distribution q0, that is, the pmf to be modeled is assumed to have the form\n\u03b8 = \u03bbq + (1 \u2212 \u03bb)q0 for some q in the simplex, \u03bb \u2208 (0, 1) and a \ufb01xed generic model q0 [2]. But\nonce q0 and \u03bb are \ufb01xed, the pmf \u03b8 can only come from a subset of the simplex. Another natural\nlanguage processing example is modeling the probability of keywords in a dictionary where some\nwords are related, such as espresso and latte, and evidence for the one is to some extent\nevidence for the other. This relationship can be captured with a bounded variation model that would\nconstrain the modeled probability of espresso to be within some \u0001 of the modeled probability\nof latte. We show that such bounds on the variation between pmf components also restrict the\ndomain of the pmf to a subset of the simplex. As a third example of restricting the domain, the\nsimilarity discriminant analysis classi\ufb01er estimates class-conditional pmfs that are constrained to be\nmonotonically increasing over an ordered sample space of discrete similarity values [3].\nIn this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex,\nexplore its properties, and show how to learn the model from data. We \ufb01rst discuss the alternative\nsolution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work.\nThen we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet\nfor three types of restricted domains: the regularized pmf case, bounded variation between pmf\ncomponents, and monotonic pmfs; and discuss the most general case. We show how to use the\nexpectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter \u03b1, and present\nsimulation results for the estimation.\n\n1\n\n\fDirichlet\n\nShadow Dirichlet\n\nRenormalized Dirichlet\n\nFigure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for \u03b1 = [3.94 2.25 2.81].\n\n2 Related Work\n\nOne solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of\nthe Dirichlet to the desired support \u02dcS, and renormalize the Dirichlet over \u02dcS (see Fig. 1 for an ex-\nample). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the\nmultinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found\nit dif\ufb01cult to use because the density requires numerical integration to compute the normalizer [4]\n. In addition, there is no closed form solution for the mean, covariance, or peak of the renormal-\nized Dirichlet, making it dif\ufb01cult to work with. Table 1 summarizes these properties. Additionally,\ngenerating samples from the renormalized Dirichlet is inef\ufb01cient: one draws samples from the stan-\ndard Dirichlet, then rejects realizations that are outside \u02dcS. For high-dimensional sample spaces, this\ncould greatly increase the time to generate samples.\nAlthough the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it \u201cis to-\ntally inadequate for the description of the variability of compositional data,\u201d because of its \u201cimplied\nindependence structure and so the Dirichlet class is unlikely to be of any great use for describ-\ning compositions whose components have even weak forms of dependence\u201d [5]. Aitchison instead\nchampioned a logistic normal distribution with more parameters to control covariance between com-\nponents.\nA number of variants of the Dirichlet that can capture more dependence have been proposed and\nanalyzed. For example, the scaled Dirichlet enables a more \ufb02exible shape for the distribution [5],\nj Yj\nwhere Yj \u223c \u0393(\u03b1j, \u03b2), whereas the scaled Dirichlet is derived from Yj \u223c \u0393(\u03b1j, \u03b2j), resulting in\n+ are parameters, and \u03b3 is the normalizer.\n\nbut does not change the support. The original Dirichlet(\u03b11, \u03b12, . . . \u03b1d) can be derived as Yj/(cid:80)\ndensity p(\u03b8) = \u03b3(cid:81)\n\n\u03b1j\nj \u03b8\n\nj\n\nAnother variant is the generalized Dirichlet [6] which also has parameters \u03b2, \u03b1 \u2208 Rd\n+, and allows\ngreater control of the covariance structure, again without changing the support. As perhaps \ufb01rst\nnoted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data\ncan be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman\nvariant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet\nmodels. Ongaro et al. [8] propose the \ufb02exible Dirichlet distribution by forming a re-parameterized\nmixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure\nfor the general Dirichlet family called the generalized Liouville distributions. In contrast to prior\nefforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that\narise frequently in machine learning problems.\n\n\u03b1j\u22121\n\n((cid:80)\ni \u03b2i\u03b8i)\u03b11+\u00b7\u00b7\u00b7+\u03b1d , where \u03b2, \u03b1 \u2208 Rd\n\n\u03b2\n\nj\n\n3 Shadow Dirichlet Distribution\nWe introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob-\nability (d \u2212 1)-simplex, and let \u02dc\u0398 \u2208 S be a random pmf drawn from a Dirichlet distribution with\ndensity pD and unnormalized parameter \u03b1 \u2208 Rd\n+. Then we say the random pmf \u0398 \u2208 S is distributed\naccording to a shadow Dirichlet distribution if \u0398 = M \u02dc\u0398 for some \ufb01xed d \u00d7 d left-stochastic (that\nis, each column of M sums to 1) full-rank (and hence invertible) matrix M, and we call \u02dc\u0398 the gen-\n\n2\n\n\ferating Dirichlet of \u0398, or \u0398\u2019s Dirichlet shadow. Because M is a left-stochastic linear map between\n\ufb01nite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and\ncompact subset of S that we denote SM .\nThe shadow Dirichlet has two parameters: the generating Dirichlet\u2019s parameter \u03b1 \u2208 Rd\n+, and the\nd \u00d7 d matrix M. Both \u03b1 and M can be estimated from data. However, as we show in the following\nsubsections, the matrix M can be pro\ufb01tably used as a design parameter that is chosen based on\napplication-speci\ufb01c knowledge or side-information to specify the restricted domain SM , and in that\nway impose dependency between the components of the random pmfs.\nThe shadow Dirichlet density p(\u03b8) is the normalized pushforward of the Dirichlet density, that is, it\nis the composition of the Dirichlet density and M\u22121 with the Jacobian:\n(M\u22121\u03b8)\u03b1j\u22121\n\n(cid:89)\n\np(\u03b8) =\n\n1\n\n(1)\n\nB(\u03b1)|det(M)|\n\nis the standard Dirichlet normalizer, and \u03b10 = (cid:80)d\n\n,\n\nj\n\nj\n\nj \u0393(\u03b1j )\n\u0393(\u03b10)\n\nwhere B(\u03b1) (cid:44)\nj=1 \u03b1j is the standard\nDirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1\nshows an example shadow Dirichlet distribution.\nGenerating samples from the shadow Dirichlet is trivial: generate samples from its generating\nDirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to\ncreate the corresponding shadow Dirichlet sample.\n\n(cid:81)\n\nTable 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirich-\nlet distributions.\n\nDensity p(\u03b8)\n\n1\n\nB(\u03b1)\n\nDirichlet(\u03b1)\n\n(cid:81)d\nj=1 \u03b8\u03b1j\u22121\n\nj\n\nShadow\n\nDirichlet (\u03b1, M)\n\n(cid:81)d\nj=1(M\u22121\u03b8)\u03b1j\u22121\n\nj\n\n1\n\nB(\u03b1)|det(M )|\n\n(cid:82)\n\n\u02dcS\n\n(cid:82)\n\nRenormalized\nDirichlet (\u03b1, \u02dcS)\n\n(cid:81)d\nj=1 \u03b8\u03b1j\u22121\n\nj\n\n(cid:81)d\n\n1\nj=1 q\n\n\u03b1j\u22121\n\ndq\n\nj\n\n(cid:82)\n\n\u02dcS \u03b8p(\u03b8)d\u03b8\n\n\u02dcS (\u03b8 \u2212 \u00af\u03b8)(\u03b8 \u2212 \u00af\u03b8)T p(\u03b8)d\u03b8\n\np(\u03b8)\n\nmax\n\u03b8\u2208 \u02dcS\n\nMean\n\nCovariance\n\nMode (if \u03b1 > 1)\n\nHow to Sample\n\n\u03b1\n\u03b10\n\nCov(\u0398)\n\n\u03b1j\u22121\n\u03b10\u2212d\n\nM\n\n\u03b1\n\u03b10\n\nM Cov(\u0398)M T\n\nM \u03b1j\u22121\n\u03b10\u2212d\n\nstick-breaking,\nurn-drawing\n\ndraw from Dirichlet(\u03b1),\n\nmultiply by M\n\ndraw from Dirichlet(\u03b1),\n\nreject if not in \u02dcS\n\nML Estimate\n\niterative\n\n(simple functions)\n\niterative\n\n(simple functions)\n\nML Compound\nEstimate\n\niterative\n\n(simple functions)\n\niterative\n\n(numerical integration)\n\nunknown\ncomplexity\n\nunknown\ncomplexity\n\n{\u03b8(cid:12)(cid:12) \u03b8 = \u03bb\u02dc\u03b8 + (1 \u2212 \u03bb)\u02d8\u03b8, \u02dc\u03b8 \u2208 S}, for speci\ufb01c values of \u03bb and \u02d8\u03b8. In general, for a given \u03bb and \u02d8\u03b8 \u2208 S,\n\n3.1 Example: Regularized Pmfs\nThe shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM =\nthe following d \u00d7 d matrix M will change the support to the desired subset SM by mapping the\nextreme points of S to the extreme points of SM :\n\n(2)\nwhere I is the d \u00d7 d identity matrix. In Section 4 we show that the M given in (2) is optimal in a\nmaximum entropy sense.\n\nM = (1 \u2212 \u03bb)\u02d8\u03b81T + \u03bbI,\n\n3\n\n\f3.2 Example: Bounded Variation Pmfs\n\nWe describe how to use the shadow Dirichlet to model a random pmf that has bounded variation\nsuch that |\u03b8k \u2212 \u03b8l| \u2264 \u0001k,l for any k, (cid:96) \u2208 {1, 2, . . . , d} and \u0001k,l > 0. To construct speci\ufb01ed bounds\non the variation, we \ufb01rst analyze the variation for a given M. For any d \u00d7 d left stochastic matrix\nM, \u03b8 = M \u02dc\u03b8 =\n, so the difference between any two entries is\n\n(cid:104)(cid:80)d\n\nj=1 M1j\n\n\u02dc\u03b8j\n\nj=1 Mdj\n\n\u02dc\u03b8j\n\n. . . (cid:80)d\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:88)\n\nj\n\n(cid:105)T\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264(cid:88)\n\nj\n\n|\u03b8k \u2212 \u03b8l| =\n\n(Mkj \u2212 Mlj)\u02dc\u03b8j\n\n|Mkj \u2212 Mlj| \u02dc\u03b8j.\n\n(3)\n\nThus, to obtain a distribution over pmfs with bounded |\u03b8k \u2212 \u03b8(cid:96)| \u2264 \u0001k,l for any k, (cid:96) components, it is\nsuf\ufb01cient to choose components of the matrix M such that |Mkj \u2212 Mlj| \u2264 \u0001k,l for all j = 1, . . . , d\nbecause \u02dc\u03b8 in (3) sums to 1.\nOne way to create such an M is using the regularization strategy described in Section 3.1. For this\n= \u03bb\u02dc\u03b8j + (1\u2212 \u03bb)\u02d8\u03b8j, and thus the variation between the\nM \u02dc\u03b8\ncase, the jth component of \u03b8 is \u03b8j =\nith and jth component of any pmf in SM is:\n\n(cid:12)(cid:12)(cid:12)\u03bb\u02dc\u03b8i + (1 \u2212 \u03bb)\u02d8\u03b8i \u2212 \u03bb\u02dc\u03b8j \u2212 (1 \u2212 \u03bb)\u02d8\u03b8j\n\nj\n\n(cid:12)(cid:12)(cid:12) \u2264 \u03bb\n\n(cid:12)(cid:12)(cid:12)\u02dc\u03b8i \u2212 \u02dc\u03b8j\n\n(cid:12)(cid:12)(cid:12) + (1 \u2212 \u03bb)\n\n(cid:12)(cid:12)(cid:12)\u02d8\u03b8i \u2212 \u02d8\u03b8j\n\n(cid:12)(cid:12)(cid:12)\n\n|\u03b8i \u2212 \u03b8j| =\n\n(cid:16)\n(cid:17)\n(cid:12)(cid:12)(cid:12)\u02d8\u03b8i \u2212 \u02d8\u03b8j\n\n(cid:12)(cid:12)(cid:12) .\n\n\u2264 \u03bb + (1 \u2212 \u03bb) max\n\ni,j\n\n(4)\n\nThus by choosing an appropriate \u03bb and regularizing pmf \u02d8\u03b8, one can impose the bounded variation\ngiven by (4). For example, set \u02d8\u03b8 to be the uniform pmf, and choose any \u03bb \u2208 (0, 1), then the matrix\nM given by (2) will guarantee that the difference between any two entries of any pmf drawn from\nthe shadow Dirichlet (M, \u03b1) will be less than or equal to \u03bb.\n\n3.3 Example: Monotonic Pmfs\n\nFor pmfs over ordered components, it may be desirable to restrict the support of the random pmf\ndistribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs).\nA d \u00d7 d left-stochastic matrix M that will result in a shadow Dirichlet that generates only mono-\ntonically increasing d \u00d7 1 pmfs has kth column [0 . . . 0 1/(d \u2212 k + 1) . . . 1/(d \u2212 k + 1)]T , we\ncall this the monotonic M. It is easy to see that with this M only monotonic \u03b8\u2019s can be produced,\n\u02dc\u03b81 + 1\n\u02dc\u03b82 and so on. In Section 4 we show\nbecause \u03b81 = 1\nd\u22121\nd\nthat the monotonic M is optimal in a maximum entropy sense.\nNote that to provide support over both monotonically increasing and decreasing pmfs with one\ndistribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two\nshadow Dirichlets.\n\n\u02dc\u03b81 which is less than or equal to \u03b82 = 1\n\nd\n\n3.4 What Restricted Subsets are Possible?\n\nAbove we have described solutions to construct M for three kinds of dependence that arise in\nmachine learning applications. Here we consider the more general question: What subsets of the\nsimplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a par-\nticular support? For any matrix M, by the Krein-Milman theorem [10], SM = MS is the convex\nhull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d \u00d7 d\nmatrix M will have d extreme points that occur for the d choices of \u03b8 that have only one nonzero\ncomponent, as the rest of the \u03b8 will create a non-trivial convex combination of the columns of M,\nand therefore cannot result in extreme points of SM by de\ufb01nition. That is, the extreme points of SM\nare the d columns of M, and one can design any SM with d extreme points by setting the columns\nof M to be those extreme pmfs.\nHowever, if one wants the new support to be a polytope in the probability (d \u2212 1)-simplex with\nm > d extreme points, then one must use a fat M with d\u00d7 m entries. Let S m denote the probability\n\n4\n\n\f(m \u2212 1)-simplex, then the domain of the shadow Dirichlet will be MS m, which is the convex hull\nof the m columns of M and forms a convex polytope in S with at most m vertices. In this case\nM cannot be injective, and hence it is not bijective between S m and MS m. However, a density on\nMS m can be de\ufb01ned as:\n\n(cid:90)\n\n(cid:12)(cid:12) M \u02dc\u03b8=\u03b8}\n\n(cid:89)\n\nj\n\np(\u03b8) =\n\n1\n\nB(\u03b1)\n\n{\u02dc\u03b8\n\n\u02dc\u03b8\u03b1j\u22121\n\nj\n\nd\u02dc\u03b8.\n\n(5)\n\nOn the other hand, if one wants the support to be a low-dimensional polytope subset of a higher-\ndimensional probability simplex, then a thin d \u00d7 m matrix M, where m < d, can be used to\nimplement this. If M is injective, then it has a left inverse M\u2217 that is a matrix of dimension m \u00d7 d,\nand the normalized pushforward of the original density can be used as a density on the image MS m:\n\np(\u03b8) =\n\n1\n\nB(\u03b1)|det(M T M)|1/2\n\n(M\u2217\u03b8)\u03b1j\u22121\n\n,\n\nj\n\n(cid:89)\n\nj\n\nIf M is not injective then one way to determine a density is to use (5).\n\n4\n\nInformation-theoretic Properties\n\nIn this section we note two information-theoretic properties of the shadow Dirichlet. Let \u0398 be drawn\nfrom shadow Dirichlet density pM , and let its generating Dirichlet \u02dc\u0398 be drawn from pD. Then the\ndifferential entropy of the shadow Dirichlet is h(pM ) = log |det(M)| + h(pD), where h(pD) is\nthe differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less\nentropy than its Dirichlet shadow because log |det(M)| \u2264 0, which can be shown as a corollary to\nthe following lemma (proof not included due to lack of space):\nLemma 4.1. Let {x1, . . . , xn} and {y1, . . . , yn} be column vectors in Rn. If each yj is a convex\ni=1 \u03b3ji = 1, \u03b3jk \u2265 0, \u2200j, k \u2208 {1, . . . , n} then\n|det[y1, . . . , yn]| \u2264 |det[x1, . . . , xn]|.\n\ncombination of the xi\u2019s, i.e. yj = (cid:80)n\n\ni=1 \u03b3jixi,(cid:80)n\n\nIt follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M\nare optimal in the sense of maximizing entropy:\nCorollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet\n\ndistributions with support in {\u03b8 (cid:12)(cid:12) \u03b8 = \u03bb\u02dc\u03b8 + (1 \u2212 \u03bb)\u02d8\u03b8, \u02dc\u03b8 \u2208 S}, for a speci\ufb01c choice of \u03bb and \u02d8\u03b8.\n\nThen the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves\narg maxM\u2208Mreg h(pM ).\nCorollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow\nDirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Sec-\ntion 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves\narg maxM\u2208Mmono h(pM ).\n\n5 Estimating the Distribution from Data\n\nIn this section, we discuss the estimation of \u03b1 for the shadow Dirichlet and compound shadow\nDirichlet, and the estimation of M.\n\n5.1 Estimating \u03b1 for the Shadow Dirichlet\n\nLet matrix M be speci\ufb01ed (for example, as described in the subsections of Section 3), and let q be a\nd \u00d7 N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N, and let (qi)j be the jth\ncomponent of the ith sample pmf for j = 1, . . . , d. Then \ufb01nding the maximum likelihood estimate\n\n5\n\n\fof \u03b1 for the shadow Dirichlet is straightforward:\n\nN(cid:89)\n\ni=1\n\narg max\n\u03b1\u2208Rk\n+\n\nlog\n\n(cid:20)\n\uf8eb\uf8ed 1\n\n(cid:21)N\n(cid:89)\n(\u02dcqi)\u03b1j\u22121\n\n+ log\n\nj\n\n\uf8eb\uf8ed(cid:89)\n\uf8f6\uf8f8 ,\n\ni\n\n(cid:89)\n(M\u22121qi)\u03b1j\u22121\n\nj\n\nj\n\n\uf8f6\uf8f8\n\n(6)\n\n(cid:89)\n\n1\n\nB(\u03b1)|det(M)|\n\nB(\u03b1)N\n\ni\n\nj\n\np(qi|\u03b1) \u2261 arg max\n\u03b1\u2208Rk\n+\n\nlog\n\n\u2261 arg max\n\u03b1\u2208Rk\n+\n\nlog\n\nwhere \u02dcq = M\u22121q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis-\ntribution given the matrix \u02dcq, and can be solved using the standard methods for that problem (see\ne.g. [11, 12]).\n\n5.2 Estimating \u03b1 for the Compound Shadow Dirichlet\n\nFor many machine learning applications the given data are modeled as samples from realizations\nof a random pmf, and given these samples one must estimate the random pmf model\u2019s parameters.\nWe refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also\ncalled the multivariate P\u00b4olya distribution). Assuming one has already speci\ufb01ed M, we \ufb01rst discuss\nmethod of moments estimation, and then describe an expectation-maximization (EM) method for\ncomputing the maximum likelihood estimate \u02d8\u03b1.\nOne can form an estimate of \u03b1 by the method of moments. For the standard compound Dirichlet,\none treats the samples of the realizations as normalized empirical histograms, sets the normalized\n\u03b1 parameter equal to the empirical mean of the normalized histograms, and uses the empirical\nvariances to determine the precision \u03b10. By de\ufb01nition, this estimate will be less likely than the\nmaximum likelihood estimate, but may be a practical short-cut in some cases. For the compound\nshadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general.\nThe problem is that if one draws samples from a pmf \u03b8 from a restricted subset SM of the simplex,\nthen the normalized empirical histogram \u02d8\u03b8 of those samples may not be in SM . For example given\na monotonic pmf, the histogram of \ufb01ve samples drawn from it may not be monotonic. Then the\nempirical mean of such normalized empirical histograms may not be in SM , and so setting the\nshadow Dirichlet mean M \u03b1 equal to the empirical mean may lead to an infeasible estimate (one that\nis outside SM ). A heuristic solution is to project the empirical mean into SM \ufb01rst, for example, by\n\ufb01nding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet,\nthis may still be a useful approach in practice for some problems.\nNext we state an EM method to \ufb01nd the maximum likelihood estimate \u02d8\u03b1. Let s be a d \u00d7 N matrix\nof sample histograms from different experiments, such that the ith column si is the ith histogram\nfor i = 1, . . . , N, and (si)j is the number of times we have observed the jth event from the ith\npmf vi. Then the maximum log-likelihood estimate of \u03b1 solves arg max log p(s|\u03b1) for \u03b1 \u2208 Rk\n+.\nIf the random pmfs are drawn from a Dirichlet distribution, then \ufb01nding this maximum likelihood\nestimate requires an iterative procedure, and can be done in several ways including a gradient descent\n(ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution,\nthen a direct gradient descent approach is highly inconvenient as it requires taking derivatives of\nnumerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm\n[13][14], as we describe in the rest of this section. Code to perform the EM estimation of \u03b1 can be\ndownloaded from idl.ee.washington.edu/publications.php.\n\nWe assume that the experiments are independent and therefore p(s|\u03b1) = p({si}|\u03b1) =(cid:81)\n\ni p(si|\u03b1)\n\n+\n\nand hence arg max\u03b1\u2208Rk\nTo apply the EM method, we consider the complete data to be the sample histograms s and the\npmfs that generated them (s, v1, v2, . . . , vN ), whose expected log-likelihood will be maximized.\nSpeci\ufb01cally, because of the assumed independence of the {vi}, the EM method requires one to\nrepeatedly maximize the Q-function such that the estimate of \u03b1 at the (m + 1)th iteration is:\n\n+\n\n(cid:80)\ni log p(si|\u03b1).\n\nlog p(s|\u03b1) = arg max\u03b1\u2208Rk\n\n\u03b1(m+1) = arg max\n\u03b1\u2208Rk\n+\n\nEvi|si,\u03b1(m) [log p(vi|\u03b1)] .\n\n(7)\n\nN(cid:88)\n\ni=1\n\n6\n\n\f(cid:0)M\u22121vi\n\n(cid:1) , where pD,\u03b1 is the Dirichlet distribution with parameter \u03b1,\n\nLike the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not neces-\nsarily concave. However, note that the Q-function given in (7) is concave, because log p(vi|\u03b1) =\n\u2212 log |det(M)| + log pD,\u03b1\nand by a theorem of Ronning [11], log pD,\u03b1 is a concave function, and adding a constant does not\nchange the concavity. The Q-function is a \ufb01nite integration of such concave functions and hence\nalso concave [15].\nWe simplify (7) without destroying the concavity to yield the equivalent problem \u03b1(m+1) =\narg max g(\u03b1) for \u03b1 \u2208 Rk\nj=1 \u03b2j\u03b1j, and\n\u03b2j = 1\nN\n\n, where tij and zi are integrals we compute with Monte Carlo integration:\n\nj=1 log \u0393(\u03b1j) + (cid:80)d\n\n(cid:80)N\n\ni=1\n\ntij\nzi\n\n+, where g(\u03b1) = log \u0393(\u03b10) \u2212 (cid:80)d\n(cid:90)\n(cid:90)\n\nlog(M\u22121vi)j\u03b3i\n\nd(cid:89)\n\n(vi)(si)k\n\nSM\n\nk=1\n\nk\n\n(vi)jk(si)kpM (vi |\u03b1(m))dvi,\n\npM (vi |\u03b1(m))dvi\n\ntij =\n\nzi =\n\nd(cid:89)\n\nk=1\n\n\u03b3i\n\nSM\n\nwhere \u03b3i is the normalization constant for the multinomial with histogram si.\nWe apply the Newton method [16] to maximize g(\u03b1), where the gradient \u2207g(\u03b1) has kth component\n\u03c80(\u03b10) \u2212 \u03c80(\u03b11) + \u03b21, where \u03c80 denotes the digamma function. Let \u03c81 denote the trigamma\nfunction, then the Hessian matrix of g(\u03b1) is: H = \u03c81(\u03b10)11T \u2212 diag (\u03c81(\u03b11), . . . , \u03c81(\u03b1d)) .\nNote that because H has a very simple structure, the inversion of H required by the Newton\nstep is greatly simpli\ufb01ed by using the Woodbury identity [17]: H\u22121 = \u2212 diag(\u03be1, . . . , \u03bed) \u2212\n\u03be0\u2212(cid:80)d\n\n[\u03bei\u03bej]d\u00d7d, where \u03be0 = 1\n\n\u03c81(\u03b1j ), j = 1, . . . , d.\n\n\u03c81(\u03b10) and \u03bej = 1\n\n1\nj=1 \u03bej\n\n5.3 Estimating M for the Shadow Dirichlet\n\nThus far we have discussed how to construct M to achieve certain desired properties and how to\ninterpret a given M\u2019s effect on the support. In some cases it may be useful to estimate M directly\nfrom data, for example, \ufb01nding the maximum likelihood M. In general, this is a non-convex problem\nbecause the set of rank d \u2212 1 matrices is not convex. However, we offer two approximations. First,\nnote that as in estimating the support of a uniform distribution, the maximum likelihood M will\ncorrespond to a support that is no larger than needed to contain the convex hull of sample pmfs.\nSecond, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth\ncolumn of M (which corresponds to the kth vertex of the support) to be a convex combination of the\nkth vertex of the standard probability simplex and the empirical mean pmf. We provide code that\n\ufb01nds the d optimal such convex combinations such that a speci\ufb01ced percentage of the sample pmfs\nare within the support, which reduces the non-convex problem of \ufb01nding the maximum likelihood\nd \u00d7 d matrix M to a d-dimensional convex relaxation.\n\n6 Demonstrations\n\nIt is reasonable to believe that if the shadow Dirichlet better matches the problem\u2019s statistics, it will\nperform better in practice, but an open question is how much better? To motivate the reader to\ninvestigate this question further in applications, we provide two small demonstrations.\n\n6.1 Verifying the EM Estimation\n\nWe used a broad suite of simulations to test and verify the EM estimation. Here we include a simple\nvisual con\ufb01rmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet\nwith monotonic M for d = 3 and \u03b1 = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs,\nwe drew 100 i.i.d. samples. Then we applied the EM algorithm to \ufb01nd the \u03b1 for both the standard\ncompound Dirichlet, and the compound shadow Dirichlet with the correct M. Fig. 2 shows the true\ndistribution and the two estimated distributions.\n\n7\n\n\fTrue Distribution\n(Shadow Dirichlet)\n\nEstimated Shadow Dirichlet\n\nEstimated Dirichlet\n\nFigure 2: Samples were drawn from the true distribution and the given EM method was applied to\nform the estimated distributions.\n\n6.2 Estimating Proportions from Sales\n\nManufacturers often have constrained manufacturing resources, such as equipment, inventory of raw\nmaterials, and employee time, with which to produce multiple products. The manufacturer must\ndecide how to proportionally allocate such constrained resources across their product line based on\ntheir estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales\ndata for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict\nthe proportion of sales expected for each puzzle. These estimates were then tested on the next \ufb01ve\nmonths of sales data, for January 2010 through April 2010. The company also provided a similarity\nbetween puzzles S, where S(A, B) is the proportion of times an order during the six training months\nincluded both puzzle A and B if it included puzzle A. We compared treating each of the six training\nmonths of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet.\nFor the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so\nthat it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought\ntogether to have closer estimated proportions. We estimated each \u03b1 parameter by EM to maximize\nthe likelihood of the past sales data, and then estimated the future sales proportions to be the mean\nof the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all\nsix months of sales data as coming from one multinomial which we estimated as the maximum\nlikelihood multinomial, and to taking the mean of the six empirical pmfs.\n\nTable 2: Squared errors between estimates and actual proportional sales.\n\nMultinomial Mean Pmf Dirichlet\n\nJan.\nFeb.\nMar.\nApr.\n\n7 Summary\n\n.0129\n.0185\n.0231\n.0240\n\n.0106\n.0206\n.0222\n.0260\n\n.0109\n.0172\n.0227\n.0235\n\nShadow Dirichlet\n.0093\n.0164\n.0197\n.0222\n\nIn this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of\nthe dependent structure that arises often in machine learning applications. We have discussed some\nof its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded\nvariation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the\nEM method and made available code to estimate both the shadow Dirichlet and compound shadow\nDirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow\nDirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice.\n\n8\n\n\fReferences\n[1] B. Frigyik, A. Kapila, and M. R. Gupta, \u201cIntroduction to the Dirichlet distribution and related\n\nprocesses,\u201d Tech. Rep., University of Washington, 2010.\n\n[2] C. Zhai and J. Lafferty, \u201cA study of smoothing methods for language models applied to infor-\n\nmation retrieval,\u201d ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179\u2013214, 2004.\n\n[3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, \u201cSimilarity-based classi\ufb01ca-\ntion: Concepts and algorithms,\u201d Journal of Machine Learning Research, vol. 10, pp. 747\u2013776,\nMarch 2009.\n\n[4] R. Nallapati, T. Minka, and S. Robertson, \u201cThe smoothed-Dirichlet distribution: a building\n\nblock for generative topic models,\u201d Tech. Rep., Microsoft Research, Cambridge, 2007.\n[5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986.\n[6] R. J. Connor and J. E. Mosiman, \u201cConcepts of independence for proportions with a general-\nization of the Dirichlet distibution,\u201d Journal of the American Statistical Association, vol. 64,\npp. 194\u2013206, 1969.\n\n[7] K. Pearson, \u201cMathematical contributions to the theory of evolution\u2013on a form of spurious\ncorrelation which may arise when indices are used in the measurement of organs,\u201d Proc. Royal\nSociety of London, vol. 60, pp. 489\u2013498, 1897.\n\n[8] A. Ongaro, S. Migliorati, and G. S. Monti, \u201cA new distribution on the simplex containing the\n\nDirichlet family,\u201d Proc. 3rd Compositional Data Analysis Workshop, 2008.\n\n[9] W. S. Rayens and C. Srinivasan, \u201cDependence properties of generalized Liouville distributions\non the simplex,\u201d Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465\u2013\n1470, 1994.\n\n[10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991.\n[11] G. Ronning, \u201cMaximum likelihood estimation of Dirichlet distributions,\u201d Journal of Statistical\n\nComputation and Simulation, vol. 34, no. 4, pp. 215221, 1989.\n\n[12] T. Minka, \u201cEstimating a Dirichlet distribution,\u201d Tech. Rep., Microsoft Research, Cambridge,\n\n2009.\n\n[13] A. P. Dempster, N. M. Laird, and D. B. Rubin, \u201cMaximum likelihood from incomplete data\nvia the EM algorithm,\u201d Journal of the Royal Statistical Society: Series B (Methodological),\nvol. 39, no. 1, pp. 1\u201338, 1977.\n\n[14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in\n\nSignal Processing, Hanover, MA, 2010.\n\n[15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.\n[16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge,\n\n2004.\n\n[17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com.\n[18] R. E. Madsen, D. Kauchak, and C. Elkan, \u201cModeling word burstiness using the Dirichlet\n\ndistribution,\u201d in Proc. Intl. Conf. Machine Learning, 2005.\n\n9\n\n\f", "award": [], "sourceid": 900, "authors": [{"given_name": "Bela", "family_name": "Frigyik", "institution": null}, {"given_name": "Maya", "family_name": "Gupta", "institution": null}, {"given_name": "Yihua", "family_name": "Chen", "institution": null}]}