{"title": "On Reversing Jensen's Inequality", "book": "Advances in Neural Information Processing Systems", "page_first": 231, "page_last": 237, "abstract": null, "full_text": "On Reversing Jensen's Inequality \n\nTony Jebara \nMIT Media Lab \n\nCambridge, MA 02139 \njebam@media.mit.edu \n\nAlex Pentland \nMIT Media Lab \n\nCambridge, MA 02139 \nsandy@media.mit.edu \n\nAbstract \n\nJensen's inequality is a powerful mathematical tool and one of the \nworkhorses in statistical learning. Its applications therein include the EM \nalgorithm, Bayesian estimation and Bayesian inference. Jensen com(cid:173)\nputes simple lower bounds on otherwise intractable quantities such as \nproducts of sums and latent log-likelihoods. This simplification then per(cid:173)\nmits operations like integration and maximization. Quite often (i.e. in \ndiscriminative learning) upper bounds are needed as well. We derive and \nprove an efficient analytic inequality that provides such variational upper \nbounds. This inequality holds for latent variable mixtures of exponential \nfamily distributions and thus spans a wide range of contemporary statis(cid:173)\ntical models. We also discuss applications of the upper bounds including \nmaximum conditional likelihood, large margin discriminative models and \nconditional Bayesian inference. Convergence, efficiency and prediction \nresults are shown. 1 \n\n1 \n\nIntroduction \n\nStatistical model estimation and inference often require the maximization, evaluation, and \nintegration of complicated mathematical expressions. One approach for simplifying the \ncomputations is to find and manipulate variational upper and lower bounds instead of the \nexpressions themselves. A prominent tool for computing such bounds is Jensen's inequality \nwhich subsumes many information-theoretic bounds (cf. Cover and Thomas 1996). In \nmaximum likelihood (ML) estimation under incomplete data, Jensen is used to derive an \niterative EM algorithm [2]. For graphical models, intractable inference and estimation is \nperformed via variational bounds [7]. Bayesian integration also uses Jensen and EM-like \nbounds to compute integrals that are otherwise intractable [9]. \n\nRecently, however, the learning community has seen the proliferation of conditional or \ndiscriminative criteria. These include support vector machines, maximum entropy discrim(cid:173)\nination distributions [4], and discriminative HMMs [3]. These criteria allocate resources \nwith the given task (classification or regression) in mind, yielding improved performance. \nIn contrast, under canonical ML each density is trained separately to describe observations \nrather than optimize classification or regression. Therefore performance is compromised. \n\n'This is the short version of the paper. Please download the long version with tighter \nbounds, detailed proofs, more results, important extensions and sample matlab code from: \nhttp://www.media.rnit.edu/ \"-'jebara/bounds \n\n\fComputationally, what differentiates these criteria from ML is that they not only require \nJensen-type lower bounds but may also utilize the corresponding upper bounds. The Jensen \nbounds only partially simplify their expressions and some intractabilities remain. For in(cid:173)\nstance, latent distributions need to be bounded above and below in a discriminative setting \n[4] [3]. Metaphorically, discriminative learning requires lower bounds to cluster positive \nexamples and upper bounds to repel away from negative ones. We derive these comple(cid:173)\nmentary upper bounds 2 which are useful for discriminative classification and regression. \nThese bounds are structurally similar to Jensen bounds, allowing easy migration of ML \ntechniques to discriminative settings. \n\nThis paper is organized as follows: We introduce the probabilistic models we will use: \nmixtures of the exponential family. We then describe some estimation criteria on these \nmodels which are intractable. One simplification is to lower bound via Jensen's inequality \nor EM. The reverse upper bound is then derived. We show implementation and results of \nthe bounds in applications (i.e. conditional maximum likelihood (CML\u00bb. Finally, a strict \nalgebraic proof is given to validate the reverse-bound. \n\n2 The Exponential Family \n\nWe restrict the reverse-Jensen bounds to mixtures of the exponential family (e-family). In \npractice this class of densities covers a very large portion of contemporary statistical models. \nMixtures of the e-family include Gaussians Mixture Models, Multinomials, Poisson, Hidden \nMarkov Models, Sigmoidal Belief Networks, Discrete Bayesian Networks, etc. [1] The \ne-family has the following form: p(Xle) = exp(A(X) + xTe - K(e)). \n\nE-Distribution \nGaussian \nMultinomial \n\nA(X) \n\n- ~ XT x-If log( 2n-) \n\no \n\nHere, K(e) is convex in e, a multi-dimensional parameter vector. Typically the data vector \nX is constrained to live in the gradient space of K, i.e. X E :eK(e). The e-family has \nspecial properties (i.e. conjugates, convexity, linearity, etc.) [1]. The reverse-Jensen bound \nalso exploits these intrinsic properties. The table above lists example A and K functions \nfor Gaussian and multinomial distributions. More generally, though, we will deal with \nmixtures of the e-family (where m represents the incomplete data?, i.e.: \n\nm \n\nm \n\nThese latent probability distributions need to get maximized, integrated, marginalized, \nconditioned, etc. to solve various inference, prediction, and parameter estimation tasks. \nHowever, such manipulations can be difficult or intractable. \n\n3 Conditional and Discriminative Criteria \n\nThe combination of ML with EM and Jensen have indeed produced straightforward and \nmonotonically convergent estimation procedures for mixtures of the e-family [2] [1] [7]. \nHowever, ML criteria are non-discriminative modeling techniques for estimating generative \nmodels. Consequently, they suffer when model assumptions are inaccurate. \n\n2 A weaker bound for Gaussian mixture regression appears in [6]. Other reverse-bounds are in [8]. \n3Note we use El to denote an aggregate model encompassing all individual Elm \\1m. \n\n\f12 \n10 \n8 \n6 \n4 \n2 \n\n0 \n00 'b \n~ \n\n0 0 \n\nQ'. o \nO~ \n\n)( \"\u00a7/ \n\n0 \u00a7 ifJo \n\n0 \n\n\" * \n\n\" ~ >Q x \n0 \n\n00\u00b0 \n' ' ,?O \n\u00b0 \n\n.. \n\n0 \n8 \n0 \n8 \n\ngo 'b \n\n00 \n\n~ \n\n0 \n\n\u2022 \n0' \n, \n\n\" \n00\"0 \n\n0\u00b0'<> \n\n\u00b0 \n\n'\" \n)( \"*.l \n\n12 \n10 \n0 \u00a7 ifJO \n8 \n6 C) \n6$0 \n\n4 \n2 \n\n0 \n\n0 \n\n\u2022 \n\n, \n\n0 \n\n\u00b0 \n\n'\" \n\n0 \n-2 \n-5 \nML Classifier: 1=-8.0, Ie = -1.7 \n\n10 \n\n20 \n\n15 \n\n0 \n\n5 \n\n25 \n\n0 \n-2 \n-5 \n25 \nCML Classifier: 1=-54.7, Ie = 0.4 \n\n20 \n\n15 \n\n10 \n\n5 \n\n0 \n\nFigure 1: ML vs. CML (Thick Gaussians represent circles, thin ones represent x's). \n\nFor visualization, observe the binary classification4 problem above. Here, our model \nincorrectly has 2 Gaussians (identity covariances) per class but the true data is generated \nfrom 8 Gaussians. Two solutions are shown, ML and CML. Note the values of joint log(cid:173)\nlikelihood I and conditional log-likelihood Ie. The ML solution performs as well as random \nchance guessing while CML classifies the data very well. Thus, CML, in estimating a \nconditional density, propagates the classification task into the estimation criterion. \n\nIn such examples, we are given training examples Xi and corresponding binary labels Ci to \nclassify with a latent variable e-family model (mixture of Gaussians). We use m to represent \nthe latent missing variables. The corresponding objective functions log-likelihood I and \nconditional log-likelihood Ie are: \n\nI \nIe = L, logL=p(m ,e, IX,,0) = L , \n\nL , log L= p(m,e\"X, 10) \n\nlogL=p(m ,e\"X, 10) -logL= L c p(m, e,X, 10) \n\nThe classification and regression task can be even more powerfully exploited in the case of \ndiscriminative (or large-margin) estimation [4] [5]. Here, hard constraints are posed on a \ndiscriminant function \u00a3 (X IE\u00bb, the ratio of each class' latent likelihood. Prediction of class \nlabels is done via the sign of the function, c = sign\u00a3(X IE\u00bb. \n\n\u00a3(XIE\u00bb \n\n= \n\nlog :~~:~:; = logL=p(m ,XI0+)-logL= p(m,X I0_ ) \n\n(1) \n\nIn the above log-likelihoods and discriminant functions we note logarithms of sums (latent \nlikelihood is basically a product of sums) which cause intractabilities. For instance, it is \ndifficult to maximize or integrate the above log-sum quantities. Thus, we need to invoke \nsimplifying bounds. \n\n4 Jensen and EM Bounds \n\nRecall the definition of Jensen's inequality: f(E{X}) 2': E{f(X)} for concave f. The \nlog-summations in I, Ie, and \u00a3(X IE\u00bb all involve a concave f = log around an expectation, \ni.e. a log-sum or probabilistic mixture over latent variables. We apply Jensen as follows: \n\nlogL=p(m ,XI0) \n\nlog L= 0'= exp(A=(X =)+X~0=-JC=(0=)) \n\n> \n> \n\np = ,x I0 . ] log p(= ,X lEl)+log'\" p(m,X I0 ) \ni...J= \nn p(n,X lEl) \nL= [h=] (X~0= - JC=(0=)) +C \n\np(= ,XlEl) \n\nAbove, we have also expanded the bound in the e-family notation. This forms a variational \nlower bound on the log-sum which makes tangential contact with it at e and is much easier \n\n4These derivations extend to multi-class classification and regression as well. \n\n\fto manipulate. Basically, the log-sum becomes a sum of log-exponential family members. \nThere is an additive constant term C and the positive scalar hm terms (the responsibilities) \nare given by the terms in the square brackets (here, brackets are for grouping terms and are \nnot operators). These quantities are relatively straightforward to compute. We only require \nlocal evaluations of log-sum values at the current E> to compute a global lower bound. \nIf we bound all log-sums in the log-likelihood, we have a lower bound on the objective I \nwhich we can maximize easily. Iterating maximization and lower bound computation at the \nnew E> produces a local maximum of log-likelihood as in EM. However, applying Jensen \non log-sums in Ie and \u00a3(XIE\u00bb is not as straightforward. Some terms in these expressions \ninvolve negative log-sums and so Jensen is actually solving for an upper bound on those \nterms. If we want overall lower and upper bounds on Ie and \u00a3(XIE\u00bb, we need to compute \nreverse-Jensen bounds. \n\n5 Reverse-Jensen Bounds \n\nIt seems strange we can reverse Jensen (i.e. f(E{X}) ~ E{f(X)}) but it is possible. \nWe need to exploit the convexity of the K functions in the e-family instead of exploiting \nthe concavity of f = log. However, not only does the reverse-bound have to upper-bound \nthe log-sum, it should also have the same form as the Jensen-bound above, i.e. a sum \nof log-exponential family terms. That way, upper and lower bounds can be combined \nhomogeneously and ML tools can be quickly adapted to the new bounds. We thus need: \n\niogLm cx m exp(Am(Xm)+X;;'0m - K m(0m )) < Lm -[w ml (Y!0 m - K m(0m )) +k \n\n(2) \n\nHere, we give the parameters of the bound directly, refer to the proof at the end of the paper \nfor their algebraic derivation. This bound again makes tangential contact at e yet is an \nupper bound on the log-sum 5. \n\nk \n\niogp(XI0)+ Lm Wm(Y!Elm-Km(El m)) \n~ ( 8K(0m l I -X)+ 8K(0m l I \nem \n\nem \n\n80m. \n\n88m. \n\nW 7n \n\nTn \n\n\u2022 \n\nI \n\nmIn W Tn sue \n\nh th t \na \n\n.!:..m. (8K (0m l I -X)+ 8K(0m l I E 8K(0m l \nw ~ \n\n80m. \n\n9:m \n\n88 m \n\n88 m \n\nm. \n\n0 7n \n\nThis bound effectively reweights (wm ) and translates (Ym ) incomplete data to obtain com(cid:173)\nplete data. Tighter bounds are possible (i.e. smaller w m ) which also depend on the hm \nterms (see web page). The first condition requires that the W;\" generate a valid Ym that \nlives in the gradient space of the K functions (a typical e-family constraint). Thus, from \nlocal computations of the log-sum's values, gradients and Hessians at the current e, we can \ncompute global upper bounds. \n\n6 Applications and Results \n\nIn Fig. 2 we plot the bounds for a two-component unidimensional Gaussian mixture model \ncase and a two component binomial (unidimensional multinomial) mixture model. The \nJensen-type bounds as well as the reverse-Jensen bounds are shown at various configurations \nof e and X. Jensen bounds are usually tighter but this is inevitable due to the intrinsic \nshape of the log-sum. In addition to viewing many such 2D visualizations, we computed \nhigher dimensional bounds and sampled them extensively, empirically verifying that the \nreverse-Jensen bound remained above the log-sum. Below we describe practical uses of \nthis new reverse-bound. \n\n5We can also find multinomial bounds on a-priors jointly with the E> parameters. \n\n\f.\" \n\ne, \n\ne, \n\n1 0 \n\n10 \n\n(a) Gaussian Case \n\n(b) Multinomial Case \n\nFigure 2: Jensen (black) and reverse-Jensen (white) bounds on the log-sum (gray). \n\n6.1 Conditional Maximum Likelihood \n\nThe inequalities above were use to fully lower bound IC and max-\nimizing the bound iteratively. This is like the CEM algorithm [6] \n\nexcept the new bounds handle the whole e-family (i.e. generalized E \n-_1 ~= __ ~ \n\n1 \n\nCEM). The synthetic Gaussian mixture model problem problem por-\ntrayed in Fig. 1 was implemented. Both ML and CML estimators \n(with reverse-bounds) were initialized in the same random configu-\nration and maximized. The Gaussians converged as in Fig. 1. CML \nclassification accuracy was 93 % while ML obtained 59%. Figure \n(A) depicts the convergence of IC per iteration under CML (top line) \nand ML (bottom-line). Similarly, we computed multinomial models \nfor 3-class data as 60 base-pair protein chains in Figure (B). \n\n(A) -1) \n\n5 \n\n2240\"IC \n\n10 \n\n15 \n\nI \n\n220 \n\nComputationally, utilizing both Jensen and reverse-Jensen bounds \nfor optimizing CML needs double the processing as ML using EM. For example, we \nestimated 2 classes of mixtures of multinomials (5-way mixture) from 40 lO-dimensional \ndata points. In non-optimized Matlab code, ML took 0.57 seconds per epoch while CML \ntook 1.27 seconds due to extra bound computations. Thus, efficiency is close to EM \nfor practical problems. Complexity per epoch roughly scales linearly with sample size, \ndimensions and number of latent variables. \n\n(B) 20~ \n\n20 \n\n10 \n\n30 \n\n6.2 Conditional Variational Bayesian Inference \n\nIn [9], Bayesian integration methods were demonstrated on latent-variable models by \ninvoking Jensen type lower bounds on the integrals of interest. A similar technique can \nbe used to approximate conditional Bayesian integration. Traditionally, we compute the \njoint Bayesian integral from (X,Y) data as p(X, Y) = f p(X, Y I8)p(8IX ,Y)d8 and \ncondition it to obtain p(Y IX)i (the superscript indicates we initially estimated a joint \ndensity). Alternatively, we can compute the conditional Bayesian integral directly. The \n\n\fcorresponding dependency graphs (Fig. 3 (b) and (c\u00bb depict the differences between j oint and \nconditional estimation. The conditional Bayesian integral exploits the graph's factorization, \nto solve p(Y IX) c. \n\np(YIX )c = f p(YIX ,ElC)[p (El clx ,Y )]dElc= f p(YIX,ElC) [ P(YI;~yjC1)(0\") l dElc \n\nJensen and reverse-Jensen bound the terms to permit analytic integration. Iterating this \nprocess efficiently converges to an approximation of the true integral. We also exhaustively \nsolved both Bayesian integrals exactly for a 2 Gaussian mixture model on 4 data points. \nFig. 3 shows the data and densities. \nIn Fig. 3(d) joint and conditional estimates are \ninconsistent under Bayesian integration (i.e. P(Y IX )C -j. P(Y IX )j). \n\n~ \n~ \n\n7YY \n\n. ~pIYlx/ \nfP1;'x( \n~gral. \n\nIn~ral. \n\nIX' Y~YIX} \n\nCondition \n\n(a) Data \n\n(b) Conditioned Joint \n\n(c) Direct Conditional \n\n(d) Inconsistency \n\nFigure 3: Conditioned Joint and Conditional Bayesian Estimates \n\n6.3 Maximum Entropy Discrimination \n\nRecently, Maximum Entropy Discrimination (MED) was proposed as an alternative criterion \nfor estimating discriminative exponential densities [4] [5] and was shown to subsume SVMs. \nThe technique integrates over discriminant functions like Eq. 1 but this is intractable under \nlatent variable situations. However, if Jensen and reverse-Jensen bounds are used, the \nrequired computations can be done. This permits iterative MED solutions to obtain large \nmargin mixture models and mixtures of SVMs (see web page). \n\n7 Discussion \n\nWe derived and proved an upper bound on the log-sum of e-farnily distributions that acts \nas the reverse of the Jensen lower bound. This tool has applications in conditional and \ndiscriminative learning for latent variable models. For further results, extensions, etc. see: \nhttp://www.media.mit.edu/ ~jebara/bounds. \n\n8 Proof \n\nStarting from Eq. 2, we directly compute k and Ym by ensuring the variational bound \nmakes tangential contact with the log-sum at e (i.e. making their value and gradients \nequal). Substituting k and Y minto Eq. 2, we get constraints on W m via Bregman distances: \n\nDefine F m(El m)=IC(El m )-1C(8m) -(El m-8 m)TIC' (8 m) . The F functions are convex and have a \nminimum (which is zero) at 8 m \u2022 Replace the IC functions with F : \n\n\fHere, D= are constants and z=:=X=-K' (0=). Next, define a mapping from these bowl(cid:173)\nshaped functions to quadratics: \n\nF=(0=) = 9=(<1>=) = !(*=-0=f(**=-0=) \n\nThis permits us to rewrite Eq. 2 in terms of <1>: \n\nL w=9(**=) ~ log \n\n:m \n\n't \n\nL cxp{ D=+0=(\",=)T z =-!)(\"'=)} \n\nm cxP{Dm+0:mZm-OCE>m)} \n\n-T \n\n-\n\nL h=(0=(<1>=)-0=) z= \n\nT \n\n-\n\nm \n\n(3) \n\nLet us find properties of the mapping F =9. Take 2nd derivatives over <1>=: \n\nK\"(0=)~ ~ T +(KI(0=)_KI(0=\u00bb)~2~= = 1 \n\nSetting 0==0= above, we get the following for a family of such mappings: ~ 18 = = \nIn an e-farnily, we can always find a O;\" such that X==K ' (0;\"). By convexity \n[K\"(0=)]- 1/ 2. \nof F we create a linear lower bound at 0;\": \n\nF(0;\")+(0=-0;\") a ~~~) 10;\" ~ F(0=) = 9(<1>=) \n\nTake 2nd derivatives over <1>=: F ' (0;\") ~2:t:: ~ 1 which is rewritten as: Z a2 0 m < 1 \n\nm. 811>in, \n\n-\n\nIn Eq. 3, D=+0=(**=)T Z=-9(**=) is always concave since its Hessian is: Z= a20,m -1 which \na\",= \nis negative. So, we upper bound these terms by a variational linear bound at 0=: \nL h=(0=(<1>=)-0=)TZ= \n\nL cXP{D~+4>~[KII(0m)]-1/2 Zm} \nt \n\nL w=9(**=) > log \n\n-T \n\n-\n\n-\n\nm. CXP{Dm+07JLZm-O(E>m)} \n\nm \n\nm \n\n-\n\nTake 2nd derivatives of both sides with respect to each <1>= to obtain (after simplifications): \n\nw 1> Z K\"(0 )- I ZT - h Z a20m \n\nm _m \n\nm \n\nrn. \n\nmmacl>~ \n\nIf we invoke the constraint on w;\", we can replace -h=Z= ~2:,m ~ w;\"1. Manipulating, we \nget the constraint on w = (as a Loewner ordering here), guaranteeing a global upper bound: \n\no \n\n9 Acknowledgments \n\nThe authors thank T. Minka, T. Jaakkola and K. Pop at for valuable discussions. \n\nReferences \n[1] Buntine, W. (1994). Operations for learning with graphical models. JAIR 2, 1994. \n[2] Dempster, AP. and Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete \n\ndata via the EM algorithm. Journal o/the Royal Statistical Society, B39. \n\n[3] Gopalakrishnan, P.S. and Kanevsky, D. and Nadas, A and Nahamoo, D. (1991). An inequality \nfor rational functions with applications to some statistical estimation problems, IEEE Trans. \nInformation Theory, pp. 107-113, Jan. 1991. \n\n[4] Jaakkola, T. and Meila, M. and Jebara, T. (1999). Maximum entropy discrimination. NIPS 12. \n[5] Jebara, T. and Jaakkola, T. (2000). Feature selection and dualities in maximum entropy discrim(cid:173)\n\nination. DAI 2000. \n\n[6] Jebara, T. and Pentland, A (1998). Maximum conditional likelihood via bound maximization \n\nand the CEM algorithm. NIPS 11. \n\n[7] Jordan, M. Gharamani, Z. Jaakkola, T. and Saul, L. (1997). An introduction to variational \n\nmethods for graphical models. Learning in Graphical Models , Kluwer Academic. \n\n[8] Pecaric, J.E. and Proschan, F. and Tong, Y.L. (1992). Convex Functions, Partial Orderings, and \n\nStatistical Applications. Academic Press. \n\n[9] Gharamani, Z. and Beal, M. (1999). Variational Inference for Bayesian Mixture of Factor \n\nAnalysers, NIPS 12. \n\n\f", "award": [], "sourceid": 1879, "authors": [{"given_name": "Tony", "family_name": "Jebara", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}*