{"title": "Density Estimation under Independent Similarly Distributed Sampling Assumptions", "book": "Advances in Neural Information Processing Systems", "page_first": 713, "page_last": 720, "abstract": null, "full_text": "Density Estimation under Independent Similarly\n\nDistributed Sampling Assumptions\n\nTony Jebara, Yingbo Song and Kapil Thadani\n\nDepartment of Computer Science\n\nColumbia University\nNew York, NY 10027\n\n{ jebara,yingbo,kapil }@cs.columbia.edu\n\nAbstract\n\nA method is proposed for semiparametric estimation where parametric and non-\nparametric criteria are exploited in density estimation and unsupervised learning.\nThis is accomplished by making sampling assumptions on a dataset that smoothly\ninterpolate between the extreme of independently distributed (or id) sample data\n(as in nonparametric kernel density estimators) to the extreme of independent\nidentically distributed (or iid) sample data. This article makes independent simi-\nlarly distributed (or isd) sampling assumptions and interpolates between these two\nusing a scalar parameter. The parameter controls a Bhattacharyya af\ufb01nity penalty\nbetween pairs of distributions on samples. Surprisingly, the isd method maintains\ncertain consistency and unimodality properties akin to maximum likelihood esti-\nmation. The proposed isd scheme is an alternative for handling nonstationarity in\ndata without making drastic hidden variable assumptions which often make esti-\nmation dif\ufb01cult and laden with local optima. Experiments in density estimation\non a variety of datasets con\ufb01rm the value of isd over iid estimation, id estimation\nand mixture modeling.\n\n1 Introduction\n\nDensity estimation is a popular unsupervised learning technique for recovering distributions from\ndata. Most approaches can be split into two categories: parametric methods where the functional\nform of the distribution is known a priori (often from the exponential family (Collins et al., 2002;\nEfron & Tibshirani, 1996)) and non-parametric approaches which explore a wider range of distri-\nbutions with less constrained forms (Devroye & Gyor\ufb01, 1985). Parametric approaches can under\ufb01t\nor may be mismatched to real-world data if they are built on incorrect a priori assumptions. A\npopular non-parametric approach is kernel density estimation or the Parzen windows method (Sil-\nverman, 1986). However, these may over-\ufb01t thus requiring smoothing, bandwidth estimation and\nadaptation (Wand & Jones, 1995; Devroye & Gyor\ufb01, 1985; Bengio et al., 2005). Semiparametric\nefforts (Olking & Spiegelman, 1987) combine the complementary advantages of both schools. For\ninstance, mixture models in their in\ufb01nite-component setting (Rasmussen, 1999) as well as statistical\nprocesses (Teh et al., 2004) make only partial parametric assumptions. Alternatively, one may seed\nnon-parametric distributions with parametric assumptions (Hjort & Glad, 1995) or augment para-\nmetric models with nonparametric factors (Naito, 2004). This article instead proposes a continuous\ninterpolation between iid parametric density estimation and id kernel density estimation. It makes\nindependent similarly distributed (isd) sampling assumptions on the data. In isd, a scalar parameter\n\u03bb trades off parametric and non-parametric properties to produce an overall better density estimate.\nThe method avoids sampling or approximate inference computations and only recycles well known\nparametric update rules for estimation. It remains computationally ef\ufb01cient, unimodal and consistent\nfor a wide range of models.\n\n\fYn=1\n\nN\n\nYn=1\n\npiid(X ) =\n\np(xn).\n\nThis paper is organized as follows. Section 2 shows how id and iid sampling setups can be smoothly\ninterpolated using a novel isd posterior which maintains log-concavity for many popular models.\nSection 3 gives analytic formulae for the exponential family case as well as slight modi\ufb01cations\nto familiar maximum likelihood updates for recovering parameters under isd assumptions. Some\nconsistency properties of the isd posterior are provided. Section 4 then extends the method to hidden\nvariable models or mixtures and provides simple update rules. Section 5 provides experiments\ncomparing isd with id and iid as well as mixture modeling. We conclude with a brief discussion.\n\n2 A Continuum between id and iid\n\nAssume we are given a dataset of N \u2212 1 inputs x1, . . . , xN \u22121 from some sample space \u2126. Given\na new query input xN also in the same sample space, density estimation aims at recovering a\ndensity function p(x1, . . . , xN \u22121, xN ) or p(xN |x1, . . . , xN \u22121) using a Bayesian or frequentist ap-\nproach. Therefore, a general density estimation task is, given a dataset X = x1, . . . , xN , recover\np(x1, . . . , xN ). A common subsequent assumption is that the data points are id or independently\nsampled which leads to the following simpli\ufb01cation:\n\nN\n\npid(X ) =\n\npn(xn).\n\nThe joint likelihood factorizes into a product of independent singleton marginals pn(xn) each of\nwhich can be different. A stricter assumption is that all samples share the same singleton marginal:\n\nwhich is the popular iid sampling situation. In maximum likelihood estimation, either of the above\nlikelihood scores (pid or piid) is maximized by exploring different settings of the marginals. The\nid setup gives rise to what is commonly referred to as kernel density or Parzen estimation. Mean-\nwhile, the iid setup gives rise to traditional iid parametric maximum likelihood (ML) or maximum\na posteriori (MAP) estimation. Both methods have complementary advantages and disadvantages.\nThe iid assumption may be too aggressive for many real world problems. For instance, data may\nbe generated by some slowly time-varying nonstationary distribution or (more distressingly) from\na distribution that does not match our parametric assumptions. Similarly, the id setup may be too\n\ufb02exible and might over-\ufb01t when the marginal pn(x) is myopically recovered from a single xn.\nConsider the parametric ML and MAP setting where parameters \u0398 = {\u03b81, . . . , \u03b8N } are used to\nde\ufb01ne the marginals. We will use p(x|\u03b8n) = pn(x) interchangeably. The MAP id parametric\nsetting involves maximizing the following posterior (likelihood times a prior) over the models:\n\npid(X , \u0398) =\n\np(xn|\u03b8n)p(\u03b8n).\n\nN\n\nYn=1\n\nTo mimic ML, simply set p(\u03b8n) to uniform. For simplicity assume that these singleton priors are\nalways kept uniform. Parameters \u0398 are then estimated by maximizing pid. To obtain the iid setup,\nwe can maximize pid subject to constraints that force all marginals to be equal, in other words\n\u03b8m = \u03b8n for all m, n \u2208 {1, . . . , N }.\nInstead of applying N (N \u2212 1)/2 hard pairwise constraints in an iid setup, consider imposing\npenalty functions across pairs of marginals. These penalty functions reduce the posterior score when\nmarginals disagree and encourage some stickiness between models (Teh et al., 2004). We measure\nthe level of agreement between two marginals pm(x) and pn(x) using the following Bhattacharyya\naf\ufb01nity metric (Bhattacharyya, 1943) between two distributions:\n\nB(pm, pn) = B(p(x|\u03b8m), p(x|\u03b8n)) = Z p\u03b2(x|\u03b8m)p\u03b2(x|\u03b8n)dx.\n\nThis is a symmetric non-negative quantity in both distributions pm and pn. The natural choice\nfor the setting of \u03b2 is 1/2 and in this case, it is easy to verify the af\ufb01nity is maximal and equals\none if and only if pm(x) = pn(x). A large family of alternative information divergences exist\n\n\fto relate pairs of distributions (Topsoe, 1999) and are discussed in the Appendix. In this article,\nthe Bhattacharyya af\ufb01nity is preferred since it has some useful computational, analytic, and log-\nconcavity properties. In addition, it leads to straightforward variants of the estimation algorithms as\nin the id and iid situations for many choices of parametric densities. Furthermore, (unlike Kullback\nLeibler divergence) it is possible to compute the Bhattacharyya af\ufb01nity analytically and ef\ufb01ciently\nfor a wide range of probability models including hidden Markov models (Jebara et al., 2004).\n\nWe next de\ufb01ne (up to a constant scaling) the posterior score for independent similarly distributed\n(isd) data:\n\np\u03bb(X , \u0398) \u221d Yn\n\np(xn|\u03b8n)p(\u03b8n) Ym6=n\n\nB\u03bb/N (p(x|\u03b8m), p(x|\u03b8n)).\n\n(1)\n\nHere, a scalar power \u03bb/N is applied to each af\ufb01nity. The parameter \u03bb adjusts the importance of the\nsimilarity between pairs of marginals. Clearly, if \u03bb \u2192 0, then the af\ufb01nity is always unity and the\nmarginals are completely unconstrained as in the id setup. Meanwhile, as \u03bb \u2192 \u221e, the af\ufb01nity is\nzero unless the marginals are exactly identical. This produces the iid setup. We will refer to the isd\nposterior as Equation 1 and when p(\u03b8n) is set to uniform, we will call it the isd likelihood. One can\nalso view the additional term in isd as id estimation with a modi\ufb01ed prior \u02dcp(\u0398) as follows:\n\n\u02dcp(\u0398) \u221d Yn\n\np(\u03b8n) Ym6=n\n\nB\u03bb/N (p(x|\u03b8m), p(x|\u03b8n)).\n\nThis prior is a Markov random \ufb01eld tying all parameters in a pairwise manner in addition to the\nstandard singleton potentials in the id scenario. However, this perspective is less appealing since it\ndisguises the fact that the samples are not quite id or iid.\n\nOne of the appealing properties of iid and id maximum likelihood estimation is its unimodality for\nlog-concave distributions. The isd posterior also bene\ufb01ts from a unique optimum and log-concavity.\nHowever, the conditional distributions p(x|\u03b8n) are required to be jointly log-concave in both param-\neters \u03b8n and data x. This set of distributions includes the Gaussian distribution (with \ufb01xed variance)\nand many exponential family distributions such as the Poisson, multinomial and exponential distri-\nbution. We next show that the isd posterior score for log-concave distributions is log-concave in \u0398.\nThis produces a unique estimate for the parameters as was the case for id and iid setups.\n\nTheorem 1 The isd posterior is log-concave for jointly log-concave density distributions and for\nlog-concave prior distributions.\n\nProof 1 The isd log-posterior is the sum of the id log-likelihoods, the singleton log-priors and pair-\nwise log-Bhattacharyya af\ufb01nities:\n\nlog p\u03bb(X , \u0398) = const +Xn\n\nlog p(xn|\u03b8n) +Xn\n\nlog p(\u03b8n) +\n\n\u03bb\n\nN Xn Xm6=n\n\nlog B(pm, pn).\n\nThe id log-likelihood is the sum of the log-probabilities of distributions that are log-concave in the\nparameters and is therefore concave. Adding the log-priors maintains concavity since these are log-\nconcave in the parameters. The Bhattacharyya af\ufb01nities are log-concave by the following key result\n(Prekopa, 1973). The Bhattacharyya af\ufb01nity for log-concave distributions is given by the integral\nover the sample space of the product of two distributions. Since the term in the integral is a product\nof jointly log-concave distributions (by assumption), the integrand is a jointly log-concave function.\nIntegrating a log-concave function over some of its arguments produces a log-concave function in\nthe remaining arguments (Prekopa, 1973). Therefore, the Bhattacharyya af\ufb01nity is log-concave in\nthe parameters of jointly log-concave distributions. Finally, since the isd log-posterior is the sum of\nconcave terms and concave log-Bhattacharyya af\ufb01nities, it must be concave.\n\nThis log-concavity permits iterative and greedy maximization methods to reliably converge in prac-\ntice. Furthermore, the isd setup will produce convenient update rules that build upon iid estimation\nalgorithms. There are additional properties of isd which are detailed in the following sections. We\n\ufb01rst explore the \u03b2 = 1/2 setting and subsequently discuss the \u03b2 = 1 setting.\n\n\f3 Exponential Family Distributions and \u03b2 = 1/2\n\nWe \ufb01rst specialize the above derivations to the case where the singleton marginals obey the expo-\nnential family form as follows:\n\np(x|\u03b8n) = exp(cid:0)H(x) + \u03b8T\n\nnT (x) \u2212 A(\u03b8n)(cid:1) .\n\nAn exponential family distribution is speci\ufb01ed by providing H, the Lebesgue-Stieltjes integrator, \u03b8n\nthe vector of natural parameters, T , the suf\ufb01cient statistic, and A the normalization factor (which\nis also known as the cumulant-generating function or the log-partition function). Tables of these\nvalues are shown in (Jebara et al., 2004). The function A is obtained by normalization (a Legendre\ntransform) and is convex by construction. Therefore, exponential family distributions are always\nlog-concave in the parameters \u03b8n. For the exponential family, the Bhattacharyya af\ufb01nity is com-\nputable in closed form as follows:\n\nB(pm, pn) = exp (A(\u03b8m/2 + \u03b8n/2) \u2212 A(\u03b8m)/2 \u2212 A(\u03b8n)/2) .\n\nAssuming uniform priors on the exponential family parameters, it is now straightforward to write\nan iterative algorithm to maximize the isd posterior. We \ufb01nd settings of \u03b81, . . . , \u03b8N that maximize\nthe isd posterior or log p\u03bb(X , \u0398) using a simple greedy method. Assume a current set of param-\neters is available \u02dc\u03b81, . . . , \u02dc\u03b8N . We then update a single \u03b8n to increase the posterior while all other\nparameters (denoted \u02dc\u0398/n) remain \ufb01xed at their previous settings. It suf\ufb01ces to consider only terms\nin log p\u03bb(X , \u0398) that are variable with \u03b8n:\n\nlog p\u03bb(X , \u03b8n, \u02dc\u0398/n) = const + \u03b8T\n\nnT (xn) \u2212\n\nN + \u03bb(N \u2212 1)\n\nN\n\nA(\u03b8n) +\n\nA(\u02dc\u03b8m/2 + \u03b8n/2).\n\n2\u03bb\n\nN Xm6=n\n\nIf the exponential family is jointly log-concave in parameters and data (as is the case for Gaussians),\nthis term is log-concave in \u03b8n. Therefore, we can take a partial derivative of it with respect to \u03b8n and\nset to zero to maximize:\n\nA\u2032(\u03b8n) =\n\nN + \u03bb(N \u2212 1)\uf8eb\n\nN\n\n\uf8edT (xn) +\n\n\u03bb\n\nN Xm6=n\n\nA\u2032(\u02dc\u03b8m/2 + \u03b8n/2)\uf8f6\n\uf8f8 .\n\nFor the Gaussian mean case (i.e. a white Gaussian with covariance locked at identity), we have\nA(\u03b8) = \u03b8T \u03b8. Then a closed-form formula is easy to recover from the above1. However, a simpler\niterative update rule for \u03b8n is also possible as follows. Since A(\u03b8) is a convex function, we can\ncompute a linear variational lower bound on each A(\u03b8m/2 + \u03b8n/2) term for the current setting of\n\u03b8n:\n\n(2)\n\nlog p\u03bb(X , \u03b8n, \u02dc\u0398/n) \u2265 const + \u03b8T\n\nnT (xn) \u2212\n\nN + \u03bb(N \u2212 1)\n\nN\n\nA(\u03b8n)\n\n2A(\u02dc\u03b8m/2 + \u02dc\u03b8n/2) + A\u2032(\u02dc\u03b8m/2 + \u02dc\u03b8n/2)\n\nT\n\n(\u03b8n \u2212 \u02dc\u03b8n).\n\n+\n\n\u03bb\n\nN Xm6=n\n\nThis gives an iterative update rule of the form of Equation 2 where the \u03b8n on the right hand side is\nkept \ufb01xed at its previous setting (i.e. replace the right hand side \u03b8n with \u02dc\u03b8n) while the equation is\niterated multiple times until the value of \u03b8n converges. Since we have a variational lower bound,\neach iterative update of \u03b8n monotonically increases the isd posterior. We can also work with a robust\n(yet not log-concave) version of the isd score which has the form:\n\nand leads to the general update rule (where \u03b1 = 0 reproduces isd and larger \u03b1 increases robustness):\n\nlog \u02c6p\u03bb(X , \u0398) = const +Xn\nN + \u03bb(N \u2212 1)\uf8eb\n\nA\u2032(\u03b8n) =\n\nN\n\nWe next examine marginal consistency, another important property of the isd posterior.\n\n1The update for the Gaussian mean with covariance=I is: \u03b8n =\n\n1\n\nN+\u03bb(N \u22121)/2 (N xn + \u03bb/2 Pm6=n\n\n\u02dc\u03b8m).\n\nlog p(xn|\u03b8n) +Xn\n\nlog p(\u03b8n) +\n\n\u03bb\n\nN Xn\n\n\uf8edT (xn) +\n\n\u03bb\n\nN Xm6=n\n\n(N \u2212 1)B\u03b1(p(x|\u02dc\u03b8m), p(x|\u02dc\u03b8n))\n\nPl6=n B\u03b1(p(x|\u02dc\u03b8l), p(x|\u02dc\u03b8n))\n\nB(pm, pn)\uf8f6\nlog\uf8eb\n\uf8edXm6=n\n\uf8f8 .\nA\u2032(\u02dc\u03b8m/2 + \u02dc\u03b8n/2)\uf8f6\n\uf8f8 .\n\n\f3.1 Marginal Consistency in the Gaussian Mean Case\n\nFor marginal consistency, if a datum and model parameter are hidden and integrated over, this should\nnot change our estimate. It is possible to show that the isd posterior is marginally consistent at least\nin the Gaussian mean case (one element of the exponential family). In other words, marginalizing\nover an observation and its associated marginal\u2019s parameter (which can be taken to be xN and \u03b8N\nwithout loss of generality) still produces a similar isd posterior on the remaining observations X/N\nand parameters \u0398/N . Thus, we need:\n\nZ Z p\u03bb(X , \u0398)dxN d\u03b8N \u221d p\u03bb(X/N , \u0398/N ).\n\nWe then would recover the posterior formed using the formula in Equation 1 with only N \u2212 1\nobservations and N \u2212 1 models.\n\nTheorem 2 The isd posterior with \u03b2 = 1/2 is marginally consistent for Gaussian distributions.\n\nProof 2 Start by integrating over xN :\n\nZ p\u03bb(X , \u0398)dxN \u221d\n\nN \u22121\n\nYi=1\n\np(xi|\u03b8i)\n\nN\n\nYn=1\n\np(\u03b8n)\n\nN\n\nYm=n+1\n\nB2\u03bb/N (pm, pn)\n\nAssume the singleton prior p(\u03b8N ) is uniform and integrate over \u03b8N to obtain:\n\nZ Z p\u03bb(X , \u0398)dxN d\u03b8N \u221d\n\np(xi|\u03b8i)\n\nN \u22121\n\nYi=1\n\nN \u22121\n\nN \u22121\n\nYn=1\n\nYm=n+1\n\nB2\u03bb/N (pm, pn)Z\n\nN \u22121\n\nYm=1\n\nB2\u03bb/N (pm, pN )d\u03b8N\n\nConsider only the right hand integral and impute the formula for the Bhattacharyya af\ufb01nity:\n\nZ\n\nN \u22121\n\nYm=1\n\nB2\u03bb/N (pm, pN )d\u03b8N = Z exp 2\u03bb\n\nN\n\nN \u22121\n\nXm=1\n\nA(cid:18) \u03b8m\n\n2\n\n+\n\n\u03b8N\n\n2 (cid:19) \u2212\n\nA(\u03b8m)\n\n2\n\n\u2212\n\nA(\u03b8N )\n\n2 ! d\u03b8N\n\nIn the (white) Gaussian case A(\u03b8) = \u03b8T \u03b8 which simpli\ufb01es the above into:\n\nZ\n\nN \u22121\n\nYm=1\n\nB2\u03bb/N (pm, pN )d\u03b8N = Z exp \u2212\n\n2\u03bb\nN\n\nN \u22121\n\n2\n\n\u2212\n\nA(cid:18) \u03b8m\nXm=1\nXm=n+1\nXn=1\n\nN \u22121\n\nN \u22121\n\n\u03b8N\n\n2 (cid:19)! d\u03b8N\nA(cid:18) \u03b8m\n\n+\n\n2\n\n\u03b8n\n\n2 (cid:19) \u2212\n\nA(\u03b8m)\n\n2\n\n\u2212\n\nA(\u03b8n)\n\n2 !\n\n2\u03bb\n\nN (N \u2212 1)\n\n\u221d exp \nYn=1\n\nN \u22121\n\n\u221d\n\nN \u22121\n\nYm=n+1\n\n2\u03bb\n\nN (N \u22121) (pm, pn)\n\nB\n\nReinserting the integral changes the exponent of the pairs of Bhattacharyya af\ufb01nities between the\n(N \u2212 1) models raising it to the appropriate power \u03bb/(N \u2212 1):\n\nZ Z p\u03bb(X , \u0398)dxN d\u03b8N \u221d\n\nN \u22121\n\nYi=1\n\np(xi|\u03b8i)\n\nN \u22121\n\nN \u22121\n\nYn=1\n\nYm=n+1\n\nB2\u03bb/(N \u22121)(pm, pn) = p\u03bb(X/N , \u0398/N ).\n\nTherefore, we get the same isd score that we would have obtained had we started with only (N \u2212 1)\ndata points. We conjecture that it is possible to generalize the marginal consistency argument to\nother distributions beyond the Gaussian. The isd estimator thus has useful properties and still agrees\nwith id when \u03bb = 0 and iid when \u03bb = \u221e. Next, the estimator is generalized to handle distributions\nbeyond the exponential family where latent variables are implicated (as is the case for mixtures of\nGaussians, hidden Markov models, latent graphical models and so on).\n\n\f4 Hidden Variable Models and \u03b2 = 1\n\nOne important limitation of most divergences between distributions is that they become awkward\nwhen dealing with hidden variables or mixture models. This is because they may involve intractable\nintegrals. The Bhattacharyya af\ufb01nity with the setting \u03b2 = 1, also known as the probability product\nkernel, is an exception to this since it only involves integrating the product of two distributions.\nIn fact, it is known that this af\ufb01nity is ef\ufb01cient to compute for mixtures of Gaussians, multino-\nmials and even hidden Markov models (Jebara et al., 2004). This permits the af\ufb01nity metric to\nef\ufb01ciently pull together parameters \u03b8m and \u03b8n. However, for mixture models, there is the presence\nof hidden variables h in addition to observed variables. Therefore, we replace all the marginals\n\np(x|\u03b8n) = Ph p(x, h|\u03b8n). The af\ufb01nity is still straightforward to compute for any pair of latent\n\nvariable models (mixture models, hidden Markov models and so on). Thus, evaluating the isd pos-\nterior is straightforward for such models when \u03b2 = 1. We next provide a variational method that\nmakes it possible to maximize a lower bound on the isd posterior in these cases.\nAssume a current set of parameters is available \u02dc\u0398 = \u02dc\u03b81, . . . , \u02dc\u03b8N . We will \ufb01nd a new setting for \u03b8n\nthat increases the posterior while all other parameters (denoted \u02dc\u0398/n) remain \ufb01xed at their previous\nsettings. It suf\ufb01ces to consider only terms in log p\u03bb(X , \u0398) that depend on \u03b8n. This yields:\n\nlog p\u03bb(X , \u03b8n, \u02dc\u0398/n) = const + log p(xn|\u03b8n)p(\u03b8n) +\n\n\u2265 const + log p(xn|\u03b8n)p(\u03b8n) +\n\n2\u03bb\n\nlogZ p(x|\u02dc\u03b8m)p(x|\u03b8n)dx\nN Xm6=n\nN Xm6=nZ p(x|\u02dc\u03b8m) log p(x|\u03b8n)dx\n\n2\u03bb\n\nThe application of Jensen\u2019s inequality above produces an auxiliary function Q(\u03b8n| \u02dc\u0398/n) which is a\nlower-bound on the log-posterior. Note that each density function has hidden variables, p(xn|\u03b8n) =\n\nPh p(xn, h|\u03b8n). Applying Jensen\u2019s inequality again (as in the Expectation-Maximization or EM\n\nalgorithm) replaces the log-incomplete likelihoods over h with expectations over the complete pos-\nteriors given the previous parameters \u02dc\u03b8n. This gives isd the following auxiliary function Q(\u03b8n| \u02dc\u0398) =\n\np(h|xn, \u02dc\u03b8n) log p(xn, h|\u03b8n) + log p(\u03b8n) +\n\nXh\n\n2\u03bb\n\nN Xm6=nZ p(x|\u02dc\u03b8m)Xh\n\np(h|x, \u02dc\u03b8n) log p(x, h|\u03b8n)dx.\n\nThis is a variational lower bound which can be iteratively maximized instead of the original isd\nposterior. While it is possible to directly solve for the maximum of Q(\u03b8n| \u02dc\u0398) in some mixture\nmodels, in practice, a further simpli\ufb01cation is to replace the integral over x with synthesized samples\ndrawn from p(x|\u02dc\u03b8m). This leads to the following approximate auxiliary function (based on the law\nof large numbers) which is merely the update rule for EM for \u03b8n with s = 1, . . . , S virtual samples\nxm,s obtained from the m\u2019th model p(x|\u02dc\u03b8m) for each of the other N \u2212 1 models, \u02dcQ(\u03b8n| \u02dc\u0398) =\n\np(h|xn, \u02dc\u03b8n) log p(xn, h|\u03b8n) + log p(\u03b8n) +\n\nXh\n\n2\u03bb\n\nSN Xm6=nXs Xh\n\np(h|xm,s, \u02dc\u03b8n) log p(xm,s, h|\u03b8n).\n\nWe now have an ef\ufb01cient update rule for latent variable models (mixtures, hidden Markov models,\netc.) which maximizes a lower bound on p\u03bb(X , \u0398). Unfortunately, as with most EM implementa-\ntions, the arguments for log-concavity no longer hold.\n\n5 Experiments\n\nA preliminary way to evaluate the usefulness of the isd framework is to explore density estimation\nover real-world datasets under varying \u03bb. If we set \u03bb large, we have the standard iid setup and\nonly \ufb01t a single parametric model to the dataset. For small \u03bb, we obtain the kernel density or\nParzen estimator. In between, an iterative algorithm is available to maximize the isd posterior to\nobtain potentially superior models \u03b8\u2217\nN . Figure 1 shows the isd estimator with Gaussian\nmodels on a ring-shaped 2D dataset. The new estimator recovers the shape of the distribution more\naccurately. To evaluate performance on real data, we aggregate the isd learned models into a single\ndensity estimate as is done with Parzen estimators and compute the iid likelihood of held out test\n\n1, . . . , \u03b8\u2217\n\n\f2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u03bb = 0, \u03b1 = 0\n\n\u03bb = 1, \u03b1 = 0\n\n\u03bb = 2, \u03b1 = 0\n\n\u03bb = \u221e, \u03b1 = 0\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u03bb = 0, \u03b1 = 1\n2\n\n\u03bb = 1, \u03b1 = 1\n2\n\n\u03bb = 2, \u03b1 = 1\n2\n\n\u03bb = \u221e, \u03b1 = 1\n2\n\nFigure 1: Estimation with isd for Gaussian models (mean and covariance) on synthetic data.\n\nid\n\niid-1\n\niid-2\n\niid-3\n\niid-4\n\nDataset\n-5.61e3 -1.36e3 -1.36e3 -1.19e3 -7.98e2 -6.48e2 -4.86e2\nSPIRAL\nMIT-CBCL -9.82e2 -1.39e3 -1.19e3 -1.00e3 -1.01e3 -1.10e3 -3.14e3\n-1.94e3 -2.02e4 -3.23e4 -2.50e4 -1.68e4 -3.15e4 -4.02e2\nDIABETES -6.25e3 -2.12e5 -2.85e5 -4.48e5 -2.03e5 -3.40e5 -8.22e2\nCANCER -5.80e3 -7.22e6 -2.94e6 -3.92e6 -4.08e6 -3.96e6 -1.22e2\n-3.41e3 -2.53e4 -1.88e4 -2.79e4 -2.62e4 -3.23e4 -4.56e2\nLIVER\n\niid-\u221e isd \u03b1 = 0 isd \u03b1 = 1\n2\n-1.19e2\n-9.79e2\n-4.47e2\n-8.09e2\n-5.54e2\n-4.69e2\n\n-2.26e2\n-9.79e2\n-4.51e2\n-8.28e2\n-5.54e2\n-4.74e2\n\nHEART\n\niid-5\n\nTable 1: Gaussian test log-likelihoods using id, iid, EM, \u221e GMM and isd estimation.\n\nN Pn p(x\u03c4 |\u03b8\u2217\n\ndata viaP\u03c4 log(cid:0) 1\n\nn)(cid:1) . A larger score implies a better p(x) density estimate. Table 1\n\nsummarizes experiments with the Gaussian (mean and covariance) models. On 6 standard datasets,\nwe show the average test log-likelihood of Gaussian estimation while varying the settings of \u03bb\ncompared to a single iid Gaussian, an id Parzen RBF estimator and a mixture of 2 to 5 Gaussians\nusing EM. Comparisons with (Rasmussen, 1999) are also shown. Cross-validation was used to\nchoose the \u03c3, \u03bb or EM local minimum (from ten initializations), for the id, isd and EM algorithms\nrespectively. Train, cross-validation and test split sizes where 80%, 10% and 10% respectively. The\ntest log-likelihoods show that isd outperformed iid, id and EM estimation and was comparable to\nin\ufb01nite Gaussian mixture (iid\u2212\u221e) models (Rasmussen, 1999) (which is a far more computationally\ndemanding method).\nIn another synthetic experiment with hidden Markov models, 40 sequences\nof 8 binary symbols were generated using 2 state HMMs with 2 discrete emissions. However, the\nparameters generating the HMMs were allowed to slowly drift during sampling (i.e. not iid). The\ndata was split into 20 training and 20 testing examples. Table 2 shows that the isd estimator for\ncertain values of \u03bb produced higher test log-likelihoods than id and iid.\n\n6 Discussion\n\nThis article has provided an isd scheme to smoothly interpolate between id and iid assumptions in\ndensity estimation. This is done by penalizing divergence between pairs of models using a Bhat-\ntacharyya af\ufb01nity. The method maintains simple update rules for recovering parameters for exponen-\ntial families as well as mixture models. In addition, the isd posterior maintains useful log-concavity\nand marginal consistency properties. Experiments show its advantages in real-world datasets where\nid or iid assumptions may be too extreme. Future work involves extending the approach into other\naspects of unsupervised learning such as clustering. We are also considering computing the isd pos-\n\n\u03bb = 0\n\u03bb = 5 \u03bb = 10 \u03bb = 20 \u03bb = 30 \u03bb = \u221e\n-5.7153 -5.5875 -5.5692 -5.5648 -5.5757 -5.5825 -5.5849 -5.5856 -5.6152 -5.5721\n\n\u03bb = 1\n\n\u03bb = 2\n\n\u03bb = 3\n\n\u03bb = 4\n\nTable 2: HMM test log-likelihoods using id, iid and isd estimation.\n\n\fterior with a normalizing constant which depends on \u03bb and thus permits a direct estimate of \u03bb by\nmaximization instead of cross-validation2.\n\n7 Appendix: Alternative Information Divergences\n\nThere is a large family of information divergences (Topsoe, 1999) between pairs of distributions\n(Renyi measure, variational distance, \u03c72 divergence, etc.) that can be used to pull models pm and pn\ntowards each other. The Bhattacharya, though, is computationally easier to evaluate and minimize\nover a wide range of probability models (exponential families, mixtures and hidden Markov models).\n\nAn alternative is the Kullback-Leibler divergence D(pmkpn) =R pm(x)(log pm(x)\u2212log pn(x))dx\n\nand its symmetrized variant D(pmkpn)/2 + D(pnkpm)/2. The Bhattacharyya af\ufb01nity is related to\nthe symmetrized variant of KL. Consider a variational distribution q that lies between the input pm\nand pn. The log Bhattacharyya af\ufb01nity with \u03b2 = 1/2 can be written as follows:\n\nlog B(pm, pn) = logZ q(x)ppm(x)pn(x)\n\nq(x)\n\ndx \u2265 \u2212D(qkpm)/2 \u2212 D(qkpn)/2.\n\nThus, B(pm, pn) \u2265 exp(\u2212D(qkpm)/2 \u2212 D(qkpn)/2). The choice of q that maximizes the lower\nbound on the Bhattacharyya is q(x) = 1\nand is therefore equal to the Bhattacharyya af\ufb01nity. Thus we have the following property:\n\nZppm(x)pn(x). Here, Z = B(pm, pn) normalizes q(x)\n\n\u22122 log B(pm, pn) = min\n\nD(qkpm) + D(qkpn).\n\nq\n\nIt is interesting to note that the Jensen-Shannon divergence (another symmetrized variant of KL)\nemerges by placing the variational q distribution as the second argument in the divergences:\n2JS(pm, pn) = D(pmkpm/2 + pn/2) + D(pnkpm/2 + pn/2) = min\n\nD(pmkq) + D(pnkq).\n\nq\n\nSimple manipulations then show 2JS(pm, pn) \u2264 min(D(pmkpn), D(pnkpm)). Thus, there are\nclose ties between Bhattacharyya, Jensen-Shannon and symmetrized KL divergences.\n\nReferences\nBengio, Y., Larochelle, H., & Vincent, P. (2005). Non-local manifold Parzen windows. Neural Information\n\nProcessing Systems.\n\nBhattacharyya, A. (1943). On a measure of divergence between two statistical populations de\ufb01ned by their\n\nprobability distributions. Bull. Calcutta Math Soc.\n\nCollins, M., Dasgupta, S., & Schapire, R. (2002). A generalization of principal components analysis to the\n\nexponential family. NIPS.\n\nDevroye, L., & Gyor\ufb01, L. (1985). Nonparametric density estimation: The l1 view. John Wiley.\nEfron, B., & Tibshirani, R. (1996). Using specially designed exponential families for density estimation. The\n\nAnnals of Statistics, 24, 2431\u20132461.\n\nHjort, N., & Glad, I. (1995). Nonparametric density estimation with a parametric start. The Annals of Statistics,\n\n23, 882\u2013904.\n\nJebara, T., Kondor, R., & Howard, A. (2004). Probability product kernels. Journal of Machine Learning\n\nResearch, 5, 819\u2013844.\n\nNaito, K. (2004). Semiparametric density estimation by local l2-\ufb01tting. The Annals of Statistics, 32, 1162\u2013\n\n1192.\n\nOlking, I., & Spiegelman, C. (1987). A semiparametric approach to density estimation. Journal of the American\n\nStatistcal Association, 82, 858\u2013865.\n\nPrekopa, A. (1973). On logarithmic concave measures and functions. Acta. Sci. Math., 34, 335\u2013343.\nRasmussen, C. (1999). The in\ufb01nite Gaussian mixture model. NIPS.\nSilverman, B. (1986). Density estimation for statistics and data analysis. Chapman and Hall: London.\nTeh, Y., Jordan, M., Beal, M., & Blei, D. (2004). Hierarchical Dirichlet processes. NIPS.\nTopsoe, F. (1999). Some inequalities for information divergence and related measures of discrimination. Jour-\n\nnal of Inequalities in Pure and Applied Mathematics, 2.\n\nWand, M., & Jones, M. (1995). Kernel smoothing. CRC Press.\n\n2Work supported in part by NSF Award IIS-0347499 and ONR Award N000140710507.\n\n\f", "award": [], "sourceid": 1057, "authors": [{"given_name": "Tony", "family_name": "Jebara", "institution": null}, {"given_name": "Yingbo", "family_name": "Song", "institution": null}, {"given_name": "Kapil", "family_name": "Thadani", "institution": null}]}