{"title": "Non-conjugate Variational Message Passing for Multinomial and Binary Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1701, "page_last": 1709, "abstract": "Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability.", "full_text": "Non-conjugate Variational Message Passing for\n\nMultinomial and Binary Regression\n\nDavid A. Knowles\n\nDepartment of Engineering\nUniversity of Cambridge\n\nThomas P. Minka\nMicrosoft Research\n\nCambridge, UK\n\nAbstract\n\nVariational Message Passing (VMP) is an algorithmic implementation of the Vari-\national Bayes (VB) method which applies only in the special case of conjugate\nexponential family models. We propose an extension to VMP, which we refer to\nas Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate\nthis restriction while maintaining modularity, allowing choice in how expecta-\ntions are calculated, and integrating into an existing message-passing framework:\nInfer.NET. We demonstrate NCVMP on logistic binary and multinomial regres-\nsion. In the multinomial case we introduce a novel variational bound for the soft-\nmax factor which is tighter than other commonly used bounds whilst maintaining\ncomputational tractability.\n\n1\n\nIntroduction\n\nVariational Message Passing [20] is a message passing implementation of the mean-\ufb01eld approxima-\ntion [1, 2], also known as variational Bayes (VB). Although Expectation Propagation [12] can have\nmore desirable properties as a result of the particular Kullback-Leibler divergence that is minimised,\nVMP is more stable than EP under certain circumstances, such as multi-modality in the posterior\ndistribution.\nUnfortunately, VMP is effectively limited to conjugate-exponential models since otherwise the mes-\nsages become exponentially more complex at each iteration. In conjugate exponential models this\nis avoided due to the closure of exponential family distributions under multiplication. There are\nmany non-conjugate problems which arise in Bayesian statistics, for example logistic regression or\nlearning the hyperparameters of a Dirichlet.\nPrevious work extending Variational Bayes to non-conjugate models has focused on two aspects.\nThe \ufb01rst is how to \ufb01t the variational parameters when the VB free form updates are not viable.\nVarious authors have used standard numerical optimization techniques [15, 17, 3], or adapted such\nmethods to be more suitable for this problem [7, 8]. A disadvantage of this approach is that the\nconvenient and ef\ufb01cient message-passing formulation is lost.\nThe second line of work applying VB to non-conjugate models involves deriving lower bounds\nto approximate the expectations [9, 18, 5, 10, 11] required to calculate the KL divergence. We\ncontribute to this line of work by proposing and evaluating a new bound for the useful softmax factor,\nwhich is tighter than other commonly used bounds whilst maintaining computational tractability. We\nalso demonstrate, in agreement with [19] and [14], that for univariate expectations such as required\nfor logistic regression, carefully designed quadrature methods can be effective.\nExisting methods typically represent a compromise on modularity or performance. To maintain\nmodularity one is effectively constrained to use exponential family bounds (e.g. quadratic in the\nGaussian case [9, 5]) which we will show often gives sub-optimal performance. Methods which uses\nmore general bounds, e.g. [3], must then resort to numerical optimisation, and sacri\ufb01ce modularity.\n\n1\n\n\fThis is a particular disadvantage for an inference framework such as Infer.NET [13] where we want\nto allow modular construction of inference algorithms from arbitrary deterministic and stochastic\nfactors. We propose a novel message passing algorithm, which we call Non-conjugate Variational\nMessage Passing (NCVMP), which generalises VMP and gives a recipe for calculating messages\nout of any factor. NCVMP gives much greater freedom in how expectations are taken (using bounds\nor quadrature) so that performance can be maintained along with modularity.\nThe outline of the paper is as follows. In Sections 2 and 3 we brie\ufb02y review VB and VMP. Section 4\nis the main contribution of the paper: Non-conjugate VMP. Section 5 describes the binary logistic\nand multinomial softmax regression models, and implementation options with and without NCVMP.\nResults on synthetic and standard UCI datasets are given in Section 6 and some conclusions are\ndrawn in Section 7.\n\n2 Mean-\ufb01eld approximation\n\nOur aim is to approximate some model p(x), represented as a factor graph p(x) = (cid:81)\nvariational posterior q(x) =(cid:81)\n(cid:90)\n\na fa(xa)\nwhere factor fa is a function of all x \u2208 xa. The mean-\ufb01eld approximation assumes a fully-factorised\ni qi(xi) where qi(xi) is an approximation to the marginal distribution\nof xi (note however xi might be vector valued, e.g. with multivariate normal qi). The variational\napproximation q(x) is chosen to minimise the Kullback-Leibler divergence KL(q||p), given by\n\nwhere H[q(x)] = \u2212(cid:82) q(x) log q(x)dx is the entropy. It can be shown [1] that if the functions qi(xi)\n\nKL(q||p) =\n\nq(x) log q(x)\n\np(x) dx = \u2212H[q(x)] \u2212\n\nare unconstrained then minimising this functional can be achieved by coordinate descent, setting\nqi(xi) = exp(cid:104)log p(x)(cid:105)\u00acqi(xi), iteratively for each i, where (cid:104)...(cid:105)\u00acqi(xi) implies marginalisation of\nall variables except xi.\n\n(cid:90)\n\nq(x) log p(x)dx.\n\n(1)\n\n3 Variational Message Passing on factor graphs\n\nxi\u2208xa\n\nVMP is an ef\ufb01cient algorithmic implementation of the mean-\ufb01eld approximation which lever-\nages the fact that the mean-\ufb01eld updates only requires local operations on the factor graph.\nThe variational distribution q(x) factorises into approximate factors \u02dcfa(xa). As a result of the\nfully factorised approximation, the approximate factors themselves factorise into messages, i.e.\nma\u2192i(xi) where the message from factor a to variable i is ma\u2192i(xi) =\nexp(cid:104)log fa(xa)(cid:105)\u00acqi(xi). The message from variable i to factor a is the current variational poste-\na\u2208N (i) ma\u2192i(xi) where N (i) are the\n\n\u02dcfa(xa) = (cid:81)\nrior of xi, denoted qi(xi), i.e. mi\u2192a(xi) = qi(xi) = (cid:81)\n\nfactors connected to variable i.\nFor conjugate-exponential models the messages to a particular variable xi, will all be in the same\nexponential family. Thus calculating qi(xi) simply involves summing suf\ufb01cient statistics. If, how-\never, our model is not conjugate-exponential, there will be a variable xi which receives incoming\nmessages which are in different exponential families, or which are not even exponential family dis-\ntributions at all. Thus qi(xi) will be some more complex distribution. Computing the required\nexpectations becomes more involved, and worse still the complexity of the messages (e.g. the num-\nber of possible modes) grows exponentially per iteration.\n\n4 Non-conjugate Variational Message Passing\n\nIn this section we give some criteria under which the algorithm was conceived. We set up required\nnotation and describe the algorithm, and prove some important properties. Finally we give some\nintuition about what the algorithm is doing. The approach we take aims to ful\ufb01ll certain criteria:\n\n1. provides a recipe for any factor\n2. reduces to standard VMP in the case of conjugate exponential factors\n3. allows modular implementation and combining of deterministic and stochastic factors\n\n2\n\n\fNCVMP ensures the gradients of the approximate KL divergence implied by the message match the\ngradients of the true KL. This means that we will have a \ufb01xed point at the correct point in parameter\nspace: the algorithm will be at a \ufb01xed point if the gradient of the KL is zero.\nWe use the following notation: variable xi has current variational posterior qi(xi; \u03b8i), where \u03b8i is\nthe vector of natural parameters of the exponential family distribution qi. Each factor fa which is\na neighbour of xi sends a message ma\u2192i(xi; \u03c6a\u2192i) to xi, where ma\u2192i is in the same exponential\nfamily as qi, i.e. ma\u2192i(xi; \u03c6) = exp(\u03c6T u(xi)\u2212\u03ba(\u03c6)) and qi(xi; \u03b8) = exp(\u03b8T u(xi)\u2212\u03ba(\u03b8)) where\nu(\u00b7) are suf\ufb01cient statistics, and \u03ba(\u00b7) is the log partition function. We de\ufb01ne C(\u03b8) as the Hessian of\n\u03ba(\u00b7) evaluated at \u03b8, i.e. Cij(\u03b8) = \u22022\u03ba(\u03b8)\n. It is straightforward to show that C(\u03b8) = cov(u(x)|\u03b8)\nso if the exponential family qi is identi\ufb01able, C will be symmetric positive de\ufb01nite, and therefore\n\ninvertible. The factor fa contributes a term Sa(\u03b8i) = (cid:82) qi(xi; \u03b8i)(cid:104)log fa(x)(cid:105)\u00acqi(xi)dxi to the KL\n\ndivergence, where we have only made the dependence on \u03b8i explicit: this term is also a function of\nthe variational parameters of the other variables neighbouring fa. With this notation in place we are\nnow able to describe the NCVMP algorithm.\n\n\u2202\u03b8i\u2202\u03b8j\n\nfor all neighbouring factors a \u2208 N (i) do\n\nAlgorithm 1 Non-conjugate Variational Message Passing\n1: Initialise all variables to uniform \u03b8i := 0\u2200i\n2: while not converged do\nfor all variables i do\n3:\n4:\n5:\n6:\n7:\nend for\n8:\n9: end while\n\n\u03c6a\u2192i := C(\u03b8i)\u22121 \u2202Sa(\u03b8i)\n\n\u03b8i :=(cid:80)\n\na\u2208N (i) \u03c6a\u2192i\n\nend for\n\n\u2202\u03b8i\n\nTo motivate Algorithm 1 we give a rough proof that we will have a \ufb01xed point at the correct point in\nparameter space: the algorithm will be at a \ufb01xed point if the gradient of the KL divergence is zero.\nTheorem 1. Algorithm 1 has a \ufb01xed point at {\u03b8i : \u2200i} if and only if {\u03b8i : \u2200i} is a stationary point\nof the KL divergence KL(q||p).\n\nProof. Firstly de\ufb01ne the function\n\n\u02dcSa(\u03b8; \u03c6) :=\n\n(cid:90)\n\nqi(xi; \u03b8) log ma\u2192i(xi; \u03c6)dxi,\n\n(2)\n\n(cid:90)\n\nwhich is an approximation to the function Sa(\u03b8). Since qi and ma\u2192i belong to the same exponential\nfamily we can simplify as follows,\n\n\u02dcSa(\u03b8; \u03c6) =\n\nqi(xi; \u03b8)(\u03c6T u(xi) \u2212 \u03ba(\u03c6))dxi = \u03c6T(cid:104)u(xi)(cid:105)\u03b8 \u2212 \u03ba(\u03c6) = \u03c6T \u2202\u03ba(\u03b8)\n\u2202\u03b8\n\n(3)\nwhere (cid:104)\u00b7(cid:105)\u03b8 implies expectation wrt qi(xi; \u03b8) and we have used the standard property of exponential\nfamilies that (cid:104)u(xi)(cid:105)\u03b8 = \u2202\u03ba(\u03b8)\n= C(\u03b8)\u03c6. Now, the\nupdate in Algorithm 1, Line 5 for \u03c6a\u2192i ensures that\n\n\u2202\u03b8 . Taking derivatives wrt \u03b8 we have \u2202 \u02dcSa(\u03b8;\u03c6)\n\n\u2212 \u03ba(\u03c6),\n\n\u2202\u03b8\n\nC(\u03b8)\u03c6 = \u2202Sa(\u03b8)\n\n\u21d4 \u2202 \u02dcSa(\u03b8; \u03c6)\n\n= \u2202Sa(\u03b8)\n\n.\n\n(4)\nThus this update ensures that the gradients wrt \u03b8i of S and \u02dcS match. The update in Algorithm 1,\nLine 7 for \u03b8i is minimising an approximate local KL divergence for xi:\n\n\u2202\u03b8\n\n\u2202\u03b8\n\n\u2202\u03b8\n\n\u03b8i := arg min\n\u03b8\n\n\uf8eb\uf8ed\u2212H[qi(xi, \u03b8i)] \u2212 (cid:88)\n\n\u2202\n\u2202\u03b8i\n\n\u02dcSa(\u03b8i; \u03c6a\u2192i)\n\na\u2208N (i)\n\nwhere H[.] is the entropy. If and only if we are at a \ufb01xed point of the algorithm, we will have\n\u2202 \u02dcSa(\u03b8i; \u03c6a\u2192i)\n\n\u02dcS(\u03b8; \u03c6a\u2192i)\n\na\u2208N (i)\n\n\uf8f6\uf8f8 = \u2202H[qi(xi, \u03b8i)]\n\n\u2202\u03b8i\n\n3\n\n\uf8f6\uf8f8 = (cid:88)\n\u2212 (cid:88)\n\na\u2208N (i)\n\na\u2208N (i)\n\n\u03c6a\u2192i\n\n(5)\n\n= 0\n\n\u2202\u03b8i\n\n\uf8eb\uf8ed\u2212H[qi(xi, \u03b8)] \u2212 (cid:88)\n\n\ffor all variables i. By (4), if and only if we are at a \ufb01xed point (so that \u03b8i has not changed since\nupdating \u03c6) we have\n\n\u2202Sa(\u03b8i)\n\n\u2202\u03b8i\n\n= \u2202KL(q||p)\n\n\u2202\u03b8i\n\n= 0\n\n(6)\n\n\u2212 \u2202H[qi(xi, \u03b8i)]\n\n\u2202\u03b8i\n\nfor all variables i.\n\n\u2212 (cid:88)\n\na\u2208N (i)\n\nTheorem 1 showed that if NCVMP converges to a \ufb01xed point then it is at a stationary point of\nthe KL divergence KL(q||p).\nIn practice this point will be a minimum because any maximum\nwould represent an unstable equilibrium. However, unlike VMP we have no guarantee to decrease\nKL(q||p) at every step, and indeed we do sometimes encounter convergence problems which require\ndamping to \ufb01x: see Section 7. Theorem 1 also gives some intuition about what NCVMP is doing.\n\u02dcSa is a conjugate approximation to the true Sa function, chosen to have the correct gradients at the\ncurrent \u03b8i. The update at variable xi for \u03b8i combines all these approximations from factors involving\nxi to get an approximation to the local KL, and then moves \u03b8i to the minimum of this approximation.\nAnother important property of Non-conjugate VMP is that it reduces to standard VMP for conjugate\nfactors.\nTheorem 2. If (cid:104)log fa(x)(cid:105)\u00acqi(xi) as a function of xi can be written \u00b5T u(xi) \u2212 c where c is a con-\nstant, then the NCVMP message ma\u2192i(xi, \u03c6a\u2192i) will be the standard VMP message ma\u2192i(xi, \u00b5).\nProof. To see this note that (cid:104)log fa(x)(cid:105)\u00acqi(xi) = \u00b5T u(xi) \u2212 c \u21d2 Sa(\u03b8) = \u00b5T(cid:104)u(xi)(cid:105)\u03b8 \u2212 c,\nwhere \u00b5 is the expected natural statistic under the messages from the variables connected to fa other\nthan xi. We have Sa(\u03b8) = \u00b5T \u2202\u03ba(\u03b8)\n\u2202\u03b8 = C(\u03b8)\u00b5 so from Algorithm 1, Line 7 we\nhave \u03c6a\u2192i := C(\u03b8)\u22121 \u2202Sa(\u03b8)\n\n\u2202\u03b8 = C(\u03b8)\u22121C(\u03b8)\u00b5 = \u00b5, the standard VMP message.\n\n\u2202\u03b8 \u2212 c \u21d2 \u2202Sa(\u03b8)\n\nThe update for \u03b8i in Algorithm 1, Line 7 is the same as for VMP, and Theorem 2 shows that for\nconjugate factors the messages sent to the variables are the same as for VMP. Thus NCVMP is a\ngeneralisation of VMP.\nNCVMP can alternatively be derived by assuming the incoming messages to xi are \ufb01xed apart from\nma\u2192i(xi; \u03c6) and calculating a \ufb01xed point update for ma\u2192i(xi; \u03c6). Gradient matching for NCVMP\ncan be seen as analogous to moment matching in EP. Due to space limitations we defer the details\nto the supplementary material.\n\n4.1 Gaussian variational distribution\n\nHere we describe the NCVMP updates for a Gaussian variational distribution q(x) = N(x; m, v)\nand approximate factor \u02dcf(x; mf , vf ). Although these can be derived from the generic formula using\nnatural parameters it is mathematically more convenient to use the mean and variance (NCVMP is\nparameterisation invariant so it is valid to do this).\n\n1\nvf\n\n= \u22122 dS(m, v)\n\ndv\n\n,\n\nmf\nvf\n\n= m\nvf\n\n+ dS(m, v)\n\ndm\n\n.\n\n(7)\n\n5 Logistic regression models\n\nof the model is standard: gkn = (cid:80)D\nWe illustrate NCVMP on Bayesian binary and multinomial logistic regression. The regression part\nd=1 WkdXdn + mk where g is the auxiliary variable, W is a\nmatrix of weights with standard normal prior, X is the design matrix and m is a per class mean,\nwhich is also given a standard normal prior. For binary regression we just have k = 1, and the\nobservation model is p(y = 1|g1n) = \u03c3(g1n) where \u03c3(x) = 1/(1 + e\u2212x) is the logistic function.\nIn the multinomial case p(y = k|g:n) = \u03c3k(g:n) where \u03c3k(x) = exk(cid:80)\nl exl is the \u201csoftmax\u201d function.\nThe VMP messages for the regression part of the model are standard so we omit the details due to\nspace limitations.\n\n4\n\n\f5.1 Binary logistic regression\nFor logistic regression we require the following factor: f(s, x) = \u03c3(x)s(1 \u2212 \u03c3(x))1\u2212s where we\nassume s is observed. The log factor is sx \u2212 log(1 + ex). There are two problems: we cannot\nanalytically compute expectations wrt to x, and we need to optimise the variational parameters. [9]\npropose the \u201cquadratic\u201d bound on the integrand\n\u03c3(x) \u2265 \u02dc\u03c3(x, t) = \u03c3(t) exp\n\n(cid:18)\n(x \u2212 t)/2 \u2212 \u03bb(t)\n2\n\n(x2 \u2212 t2)\n\n(cid:19)\n\n(8)\n\n,\n\nwhere \u03bb(t) = tanh (t/2)\n. It is straightforward to analytically optimise t to make the\nbound as tight as possible. The bound is conjugate to a Gaussian, but its performance can be poor.\nAn alternative proposed in [18] is to bound the integral:\n\nt\n\nt\n\n= \u03c3(t)\u22121/2\n\n(cid:104)log f(x)(cid:105)q \u2265 sm \u2212 1\n\n2 a2v \u2212 log(1 + em+(1\u22122a)v/2)),\n\n(9)\n\n= a(1\u2212a), mf\n\nwhere m, v are the mean and variance of q(x) and a is a variational parameter which can be opti-\nmised using the \ufb01xed point iteration a := \u03c3(m\u2212(1\u22122a)v/2). We refer to this as the \u201ctilted\u201d bound.\nThis bound is not conjugate to a Gaussian, but we can calculate the NCVMP message, which has\n+s\u2212a, where we have assumed a has been optimised. A \ufb01nal\nparameters: 1\nvf\npossibility is to use quadrature to calculate the gradients of S(m, v) directly. The NCVMP message\n+ s \u2212 (cid:104)\u03c3(x)(cid:105)q. The univariate expecta-\nthen has parameters 1\nvf\ntions (cid:104)\u03c3(x)(cid:105)q and (cid:104)x\u03c3(x)(cid:105)q can be ef\ufb01ciently computed using Gauss-Hermite or Clenshaw-Curtis\nquadrature.\n\n= (cid:104)x\u03c3(x)(cid:105)q\u2212m(cid:104)\u03c3(x)(cid:105)q\n\n= m\nvf\n\n= m\nvf\n\n, mf\nvf\n\nvf\n\nv\n\n5.2 Multinomial softmax regression\n\nk=1 \u03b4 (pk \u2212 \u03c3k(x)), where xk are real valued and p is a\nprobability vector with current Dirichlet variational posterior q(p) = Dir(p; d). We can integrate\nl exl where we de\ufb01ne\nk=1 N(xk; mk, vk). How should\nl exl term? The approach used by [3] is a linear Taylor expansion of the log,\n\nConsider the softmax factor f(x, p) = (cid:81)K\nout p to give the log factor log f(x) =(cid:80)K\nd. :=(cid:80)K\nwe deal with the log(cid:80)\nexi(cid:105) \u2264 log(cid:88)\n\nk=1 dk. Let the incoming message from x be q(x) =(cid:81)K\n(cid:104)exi(cid:105) = log(cid:88)\n\nk=1(dk \u2212 1)xk \u2212 (d. \u2212 K) log(cid:80)\n\nwhich is accurate for small variances v:\n\n(cid:104)log(cid:88)\n\nemi+vi/2,\n\n(10)\n\ni\n\ni\n\ni\n\nwhich we refer to as the \u201clog\u201d bound. The messages are still not conjugate, so some numerical\nmethod must still be used to learn m and v: while [3] used LBFGS we will use NCVMP. Another\nbound was proposed by [5]:\n\nK(cid:88)\n\nK(cid:88)\n\nlog\n\nexk \u2264 a +\n\nlog(1 + exk\u2212a),\n\n(11)\n\nk=1\n\nk=1\n\nwhere a is a new variational parameter. Combining with (8) we get the \u201cquadratic bound\u201d on the\nintegrand, with K + 1 variational parameters. This has conjugate updates, so modularity can be\nachieved without NCVMP, but as we will see, results are often poor. [5] derives coordinate ascent\n\ufb01xed point updates to optimise a, but reducing to a univariate optimisation in a and using Newton\u2019s\nmethod is much faster (see supplementary material).\nInspired by the univariate \u201ctilted\u201d bound in Equation 9 we propose the multivariate tilted bound:\n\n(cid:104)log(cid:88)\n\ni\n\nexi(cid:105) \u2264 1\n2\n\n(cid:88)\n\nj\n\nj vj + log(cid:88)\n\na2\n\ni\n\nemi+(1\u22122ai)vi/2\n\n(12)\n\n\u03c3(cid:2)m + 1\n\nSetting ak = 0 for all k we recover Equation 10 (hence this is the \u201ctilted\u201d version). Maximisation\nwith respect to a can be achieved by the \ufb01xed point update (see supplementary material): a :=\n\n2(1 \u2212 2a) \u00b7 v(cid:3). This is a O(K) operation since the denominator of the softmax function\n\nis shared. For the softmax factor quadrature is not viable because of the high dimensionality of the\nintegrals. From Equation 7 the NCVMP messages using the tilted bound have natural parameters\n\n5\n\n\f= mk\nvkf\n\nvkf\n\n= (d. \u2212 K)ak(1 \u2212 ak), mkf\n\n+ dk \u2212 1 \u2212 (d. \u2212 K)ak where we have assumed a has\n1\nvkf\nbeen optimised. As an alternative we suggest choosing whether to send the message resulting from\nthe quadratic bound or tilted bound depending on which is currently the tightest, referred to as the\n\u201cadaptive\u201d method. Finally we consider a simple Taylor series expansion of the integrand around\nthe mean of x, denoted \u201cTaylor\u201d, and the multivariate quadratic bound of [4], denoted \u201cBohning\u201d\n(see the Supplementary material for details).\n\n6 Results\n\nHere we aim to present the typical compromise between performance and modularity that NCVMP\naddresses. We will see that for both binary logistic and multinomial softmax models achieving\nconjugate updates by being constrained to quadratic bounds is sub-optimal, in terms of estimates of\nvariational parameters, marginal likelihood estimation, and predictive performance. NCVMP gives\nthe freedom to choose a wider class of bounds, or even use ef\ufb01cient quadrature methods in the\nunivariate case, while maintaining simplicity and modularity.\n\n6.1 The logistic factor\n\nWe \ufb01rst test the logistic factor methods of Section 5.1 at the task of estimating the toy model\n\u03c3(x)\u03c0(x) with varying Gaussian prior \u03c0(x) (see Figure 1(a)). We calculate the true mean and vari-\nance using quadrature. The quadratic bound has the largest errors for the posterior mean, and the\nposterior variance is severely underestimated. In contrast, NCVMP using quadrature, while being\nslightly more computationally costly, approximates the posterior much more accurately: the error\nhere is due only to the VB approximation. Using the tilted bound with NCVMP gives more robust\nestimates of the variance than the quadratic bound as the prior mean changes. However, both the\nquadratic and tilted bounds underestimate the variance as the prior variance increases.\n\n(a) Posterior mean and variance estimates of \u03c3(x)\u03c0(x) with varying\nprior \u03c0(x). Left: varying the prior mean with \ufb01xed prior variance\nv = 10. Right: varying the prior variance with \ufb01xed prior mean\nm = 0.\n\n(b) Log likelihood of the true regres-\nsion coef\ufb01cients under the approxi-\nmate posterior for 10 synthetic logistic\nregression datasets.\n\nFigure 1: Logistic regression experiments.\n\n6.2 Binary logistic regression\n\nWe generated ten synthetic logistic regression datasets with N = 30 data points and P = 8 co-\nvariates. We evaluated the results in terms of the log likelihood of the true regression coef\ufb01cients\nunder the approximate posterior, a measure which penalises poorly estimated posterior variances.\nFigure 1(b) compares the performance of non-conjugate VMP using quadrature and VMP using\nthe quadratic bound. For four of the ten datasets the quadratic bound \ufb01nds very poor solutions.\nNon-conjugate VMP \ufb01nds a better solution in seven out of the ten datasets, and there is marginal\n\n6\n\n\u221220\u22121001020priormean\u22120.3\u22120.2\u22120.10.00.10.20.3errorinposteriormean05101520priorvariance\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2\u22120.10.00.1errorinposteriormeanNCVMPquadNCVMPtiltedVMPquadratic\u221220\u22121001020priormean\u22123.5\u22123.0\u22122.5\u22122.0\u22121.5\u22121.0\u22120.50.0errorinposteriorvariance05101520priorvariance\u22124\u22123\u22122\u221210errorinposteriorvariance\fdifference in the other three. Non-conjugate VMP (with no damping) also converges faster in gen-\neral, although some oscillation is seen for one of the datasets.\n\n6.3 Softmax bounds\n\nTo have some idea how the various bounds for the softmax integral Eq[log(cid:80)K\n(cid:81)\nk=1 exk] com-\npare empirically we calculated relative absolute error on 100 random distributions q(x) =\nk N(xk; mk, v). We sample mk \u223c N(0, u). When not being varied, K = 10, u = 1, v = 1.\nGround truth was calculated using 105 Monte Carlo samples. We vary the number of classes, K, the\ndistribution variance v and spread of the means u. Results are shown in Figure 2. As expected the\ntilted bound (12) dominates the log bound (10), since it is a generalisation. As K is increased the\nrelative error made using the quadratic bound increases, whereas both the log and the tilted bound\nget tighter. In agreement with [5] we \ufb01nd the strength of the quadratic bound (11) is in the high\nvariance case, and Bohning\u2019s bound [4] is very loose under all conditions. Both the log and tilted\nbound are extremely accurate for variances v < 1. In fact, the log and tilted bounds are asymp-\ntotically optimal as v \u2192 0. \u201cTaylor\u201d gives accurate results but is not a bound, so convergence is\nnot guaranteed and the global bound on the marginal likelihood is lost. The spread of the means\nu does not have much of an effect on the tightness of these bounds. These results show that even\nwhen quadrature is not an option, much tighter bounds can be found if the constraint of requiring\nquadratic bounds imposed by VMP is relaxed. For the remainder of the paper we consider only the\nquadratic, log and tilted bounds.\n\nFigure 2: Log10 of the relative absolute error approximating E log(cid:80) exp, averaged over 100 runs.\n\n6.4 Multinomial softmax regression\n\nSynthetic data. For synthetic data sampled from the generative model we know the ground truth\ncoef\ufb01cients and can control characteristics of the data. We \ufb01rst investigate the performance with\nsample size N, with \ufb01xed number of features P = 6, classes K = 4, and no noise (apart from\nthe inherent noise of the softmax function). As expected our ability to recover the ground truth\nregression coef\ufb01cients improves with increasing N (see Figure 3(a), left). However, we see that\nthe methods using the tilted bound perform best, closely followed by the log bound. Although the\nquadratic bound has comparable performance for small N < 200 it performs poorly with larger\nN due to its weakness at small variances. The choice of bound impacts the speed of convergence\n(see Figure 3(a), right). The log bound performed almost as well as the tilted bound at recovering\ncoef\ufb01cients it takes many more iterations to converge. The extra \ufb02exibility of the tilted bound allows\nfaster convergence, analogous to parameter expansion [16]. For small N the tilted bound, log bound\nand adaptive method converge rapidly, but as N increases the quadratic bound starts to converge\nmuch more slowly, as do the tilted and adaptive methods to a lesser extent. \u201cAdaptive\u201d converges\nfastest because the quadratic bound gives good initial updates at high variance, and the tilted bound\ntakes over once the variance decreases. We vary the level of noise in the synthetic data, \ufb01xing\nN = 200, in Figure 3(b). For all but very large noise values the tilted bound performs best.\n\nUCI datasets. We test the multinomial regression model on three standard UCI datasets: Iris (N =\n150, D = 4, K = 3), Glass (N = 214, D = 8, K = 6) and Thyroid (N = 7200, D = 21, K = 3),\n\n7\n\n101102103classesK\u221210\u22128\u22126\u22124\u22122024log(relativeabserror)10\u22121100101102inputvariancev\u221212\u221210\u22128\u22126\u22124\u2212202quadraticlogtiltedBohningTaylor10\u22121100101meanvarianceu\u22127\u22126\u22125\u22124\u22123\u22122\u221210\f(a) Varying sample size\n\n(b) Varying noise level\n\nFigure 3: Left: root mean squared error of inferred regression coef\ufb01cients. Right: iterations to\nconvergence. Results are shown as quartiles on 16 random synthetic datasets. All the bounds except\n\u201cquadratic\u201d were \ufb01t using NCVMP.\nIris\nQuadratic\nProbit\n\u221265 \u00b1 3.5\n\u221237.3 \u00b1 0.79\nMarginal likelihood\nPredictive likelihood \u22120.216 \u00b1 0.07\n\u22120.215 \u00b1 0.034\n0.0892 \u00b1 0.039\n0.0592 \u00b1 0.03\nPredictive error\nGlass\nQuadratic\nProbit\n\u2212319 \u00b1 5.6\n\u2212201 \u00b1 2.6\nMarginal likelihood\nPredictive likelihood \u22120.58 \u00b1 0.12\n\u22120.503 \u00b1 0.095\n0.197 \u00b1 0.032\n0.195 \u00b1 0.035\nPredictive error\nThyroid\nQuadratic\nProbit\n\u22121814 \u00b1 43\n\u2212840 \u00b1 18\nMarginal likelihood\nPredictive likelihood \u22120.114 \u00b1 0.019 \u22120.0793 \u00b1 0.014 \u22120.0753 \u00b1 0.008 \u22120.0916 \u00b1 0.010\n0.0241 \u00b1 0.0026\n0.0276 \u00b1 0.0028\nPredictive error\n\nTilted\n\u221231.2 \u00b1 2\n\u22120.201 \u00b1 0.039\n0.065 \u00b1 0.038\nTilted\n\u2212193 \u00b1 5.4\n\u22120.531 \u00b1 0.1\n0.200 \u00b1 0.032\nTilted\n\u2212916 \u00b1 31\n0.0226 \u00b1 0.0023\n\nAdaptive\n\u221231.2 \u00b1 2\n\u22120.201 \u00b1 0.039\n0.0642 \u00b1 0.037\nAdaptive\n\u2212193 \u00b1 3.9\n\u22120.542 \u00b1 0.11\n0.200 \u00b1 0.032\nAdaptive\n\u2212909 \u00b1 30\n0.0225 \u00b1 0.0024\n\nTable 1: Average results and standard deviations on three UCI datasets, based on 16 random 50 : 50\ntraining-test splits. Adaptive and tilted use NCVMP, quadratic and probit use VMP.\n\nsee Table 1. Here we have also included \u201cProbit\u201d, corresponding to a Bayesian multinomial probit\nregression model, estimated using VMP, and similar in setup to [6], except that we use EP to approx-\nimate the predictive distribution, rather than sampling. On all three datasets the marginal likelihood\ncalculated using the tilted or adaptive bounds is optimal out of the logistic models (\u201cProbit\u201d has a\ndifferent underlying model, so differences in marginal likelihood are confounded by the Bayes fac-\ntor). In terms of predictive performance the quadratic bound seems to be slightly worse across the\ndatasets, with the performance of the other methods varying between datasets. We did not compare\nto the log bound since it is dominated by the tilted bound and is considerably slower to converge.\n\n7 Discussion\n\nf\u2192i(xi)\u03b1 where mold\n\nNCVMP is not guaranteed to converge. Indeed, for some models we have found convergence to be\na problem, which can be alleviated by damping: if the NCVMP message is mf\u2192i(xi) then send\nthe message mf\u2192i(xi)1\u2212\u03b1mold\nf\u2192i(xi) was the previous message sent to i and\n0 \u2264 \u03b1 < 1 is a damping factor. The \ufb01xed points of the algorithm remained unchanged.\nWe have introduced Non-conjugate Variational Message Passing, which extends variational Bayes\nto non-conjugate models while maintaining the convenient message passing framework of VMP and\nallowing freedom to choose the most accurate available method to approximate required expecta-\ntions. Deterministic and stochastic factors can be combined in a modular fashion, and conjugate\nparts of the model can be handled with standard VMP. We have shown NCVMP to be of practical\nuse for \ufb01tting Bayesian binary and multinomial logistic models. We derived a new bound for the\nsoftmax integral which is tighter than other commonly used bounds, but has variational parameters\nthat are still simple to optimise. Tightness of the bound is valuable both in terms of better approxi-\nmating the posterior and giving a closer approximation to the marginal likelihood, which may be of\ninterest for model selection.\n\n8\n\n101102103samplesizeN0.00.20.40.60.8RMSerrorofcoef\ufb01centsadaptivetiltedquadraticlog101102103samplesizeN01020304050Iterationstoconvergence10\u2212310\u2212210\u22121100syntheticnoisevariance0.200.250.300.350.400.450.500.55RMSerrorofcoef\ufb01cents10\u2212310\u2212210\u22121100syntheticnoisevariance01020304050Iterationstoconvergence\fReferences\n[1] H. Attias. A variational Bayesian framework for graphical models. Advances in neural infor-\n\nmation processing systems, 12(1-2):209215, 2000.\n\n[2] M. Beal and Z. Ghahramani. Variational Bayesian learning of directed graphical models with\n\nhidden variables. Bayesian Analysis, 1(4):793832, 2006.\n\n[3] D. Blei and J. Lafferty. A correlated topic model of science. Annals of Applied Statistics, 2007.\n[4] D. Bohning. Multinomial logistic regression algorithm. Annals of the Institute of Statistical\n\nMathematics, 44:197\u2013200, 1992. 10.1007/BF00048682.\n\n[5] G. Bouchard. Ef\ufb01cient bounds for the softmax and applications to approximate inference in\n\nhybrid models. In NIPS workshop on approximate inference in hybrid models, 2007.\n\n[6] M. Girolami and S. Rogers. Variational bayesian multinomial probit regression with gaussian\n\nprocess priors. Neural Computation, 18(8):1790\u20131817, 2006.\n\n[7] A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. Approximate riemannian\nconjugate gradient learning for \ufb01xed-form variational bayes. Journal of Machine Learning\nResearch, 11:3235\u20133268, 2010.\n\n[8] A. Honkela, M. Tornio, T. Raiko, and J. Karhunen. Natural conjugate gradient in variational\ninference. In M. Ishikawa, K. Doya, H. Miyamoto, and T. Yamakawa, editors, ICONIP (2),\nvolume 4985 of Lecture Notes in Computer Science, pages 305\u2013314. Springer, 2007.\n\n[9] T. S. Jaakkola and M. I. Jordan. A variational approach to bayesian logistic regression models\nand their extensions. In International Conference on Arti\ufb01cial Intelligence and Statistics, 1996.\n[10] M. E. Khan, B. M. Marlin, G. Bouchard, and K. P. Murphy. Variational bounds for mixed-data\n\nfactor analysis. In Advances in Neural Information Processing (NIPS) 23, 2010.\n\n[11] B. M. Marlin, M. E. Khan, and K. P. Murphy. Piecewise bounds for estimating bernoulli-\nlogistic latent gaussian models. In Proceedings of the 28th Annual International Conference\non Machine Learning, 2011.\n\n[12] T. P. Minka. Expectation propagation for approximate bayesian inference. In Uncertainty in\n\nArti\ufb01cial Intelligence, volume 17, 2001.\n\n[13] T. P. Minka, J. M. Winn, J. P. Guiver, and D. A. Knowles. Infer.NET 2.4, 2010. Microsoft\n\nResearch Cambridge. http://research.microsoft.com/infernet.\n\n[14] H. Nickisch and C. E. Rasmussen. Approximations for binary gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 9:2035\u20132078, Oct. 2008.\n\n[15] M. Opper and C. Archambeau. The variational gaussian approximation revisited. Neural\n\nComputation, 21(3):786\u2013792, 2009.\n\n[16] Y. A. Qi and T. Jaakkola. Parameter expanded variational bayesian methods. In B. Sch\u00a8olkopf,\nJ. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing (NIPS) 19,\npages 1097\u20131104. MIT Press, 2006.\n\n[17] T. Raiko, H. Valpola, M. Harva, and J. Karhunen. Building blocks for variational bayesian\nlearning of latent variable models. Journal of Machine Learning Research, 8:155\u2013201, 2007.\n[18] L. K. Saul and M. I. Jordan. A mean \ufb01eld learning algorithm for unsupervised neural networks.\n\nLearning in graphical models, 1999.\n\n[19] M. P. Wand, J. T. Ormerod, S. A. Padoan, and R. Fruhwirth. Variational bayes for elaborate\n\ndistributions. In Workshop on Recent Advances in Bayesian Computation, 2010.\n\n[20] J. Winn and C. M. Bishop. Variational message passing. Journal of Machine Learning Re-\n\nsearch, 6(1):661, 2006.\n\n9\n\n\f", "award": [], "sourceid": 962, "authors": [{"given_name": "David", "family_name": "Knowles", "institution": null}, {"given_name": "Tom", "family_name": "Minka", "institution": null}]}