{"title": "Information Maximization in Noisy Channels : A Variational Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 201, "page_last": 208, "abstract": "", "full_text": "The IM Algorithm : A variational\n\napproach to Information Maximization\n\nDavid Barber\n\nFelix Agakov\n\nInstitute for Adaptive and Neural Computation : www.anc.ed.ac.uk\n\nEdinburgh University, EH1 2QL, U.K.\n\nAbstract\n\nThe maximisation of information transmission over noisy channels\nis a common, albeit generally computationally di\ufb03cult problem.\nWe approach the di\ufb03culty of computing the mutual information\nfor noisy channels by using a variational approximation. The re-\nsulting IM algorithm is analagous to the EM algorithm, yet max-\nimises mutual information, as opposed to likelihood. We apply the\nmethod to several practical examples, including linear compression,\npopulation encoding and CDMA.\n\n1 Introduction\n\nThe reliable communication of information over noisy channels is a widespread issue,\nranging from the construction of good error-correcting codes to feature extraction[3,\n12]. In a neural context, maximal information transmission has been extensively\nstudied and proposed as a principal goal of sensory processing[2, 5, 7]. The central\nquantity in this context is the Mutual Information (MI) which, for source variables\n(inputs) x and response variables (outputs) y, is\n\nI(x, y) \u2261 H(y) \u2212 H(y|x),\n\n(1)\n\nwhere H(y) \u2261 \u2212hlog p(y)ip(y) and H(y|x) \u2261 \u2212hlog p(y|x)ip(x,y) are marginal and\nconditional entropies respectively, and angled brackets represent averages. The\ngoal is to adjust parameters of the mapping p(y|x) to maximise I(x, y). Despite\nthe simplicity of the statement, the MI is generally intractable for all but special\ncases. The key di\ufb03culty lies in the computation of the entropy of p(y) (a mixture).\n\nOne such tractable special case is if the mapping y = g(x; \u0398) is deterministic and\ninvertible, for which the di\ufb03cult entropy term trivially becomes\n\nH(y) = hlog |J|ip(y) + const.\n\n(2)\n\nHere J = {\u2202yi/\u2202xj} is the Jacobian of the mapping. For non-Gaussian sources\np(x), and special choices of g(x; \u0398), the minimization of (1) with respect to the\nparameters \u0398 leads to the infomax formulation of ICA[4].\n\nAnother tractable special case is if the source distribution p(x) is Gaussian and the\nmapping p(y|x) is Gaussian.\n\n\fz\n\np(z)\n\nx p(x)\n\nsource\n\ny\n\nx\n\np(y|x)\n\nencoder\n\nq(x|y,z) decoder\n\nFigure 1: An illustration of the form of a more\ngeneral mixture decoder. x represents the sources\nor inputs, which are (stochastically) encoded as\ny. A receiver decodes y (possibly with the aid of\nauxiliary variables z).\n\nHowever, in general, approximations of the MI need to be considered. A variety of\nmethods have been proposed. In neural coding, a popular alternative is to maximise\nthe Fisher \u2018Information\u2019[5]. Other approaches use di\ufb00erent objective criteria, such\nas average reconstruction error.\n\n2 Variational Lower Bound on Mutual Information\n\nSince the MI is a measure of information transmission, our central aim is to maximise\na lower bound on the MI. Using the symmetric property of the MI, an equivalent\nformulation of the MI is I(x, y) = H(x) \u2212 H(x|y). Since we shall generally be\ninterested in optimising MI with respect to the parameters of p(y|x), and p(x) is\nsimply the data distribution, we need to bound H(x|y) suitably. The Kullback-\n\nLeibler bound Px p(x|y) log p(x|y) \u2212 p(x|y) log q(x|y) \u2265 0 gives\n\nI(x, y) \u2265 H(x)\n\n+ hlog q(x|y)ip(x,y)\n\ndef= \u02dcI(x, y).\n\n(3)\n\n\u201centropy00\n\n|{z}\n\n|\n\n\u201cenergy00\n\n{z\n\n}\n\nwhere q(x|y) is an arbitrary variational distribution. The bound is exact if q(x|y) \u2261\np(x|y). The form of this bound is convenient since it explicitly includes both the\nencoder p(y|x) and decoder q(x|y), see \ufb01g(1).\n\nCertainly other well known lower bounds on the MI may be considered [6] and\na future comparison of these di\ufb00erent approaches would be interesting. However,\nour current experience suggests that the bound considered above is particularly\ncomputationally convenient. Since the bound is based on the KL divergence, it is\nequivalent to a moment matching approximation of p(x|y) by q(x|y). This fact\nis highly bene\ufb01cial in terms of decoding, since mode matching approaches, such\nas mean-\ufb01eld theory, typically get trapped in the one of many sub-optimal local\nminima. More successful decoding algorithms approximate the posterior mean[10].\n\nThe IM algorithm\n\nTo maximise the MI with respect to any parameters \u03b8 of p(y|x, \u03b8), we aim to\npush up the lower bound (3). First one needs to choose a class of variational\ndistributions q(x|y) \u2208 Q for which the energy term is tractable. Then a natural\nrecursive procedure for maximising \u02dcI(X, Y ) for given p(x), is\n\n1. For \ufb01xed q(x|y), \ufb01nd \u03b8new = arg max\u03b8 \u02dcI(X, Y )\n2. For \ufb01xed \u03b8, qnew(x|y) = arg maxq(x|y)\u2208Q\n\nof distributions.\n\n\u02dcI(X, Y ), where Q is a chosen class\n\nThese steps are iterated until convergence. This procedure is analogous to the\n(G)EM algorithm which maximises a lower bound on the likelihood[9]. The di\ufb00er-\nence is simply in the form of the \u201cenergy\u201d term.\n\nNote that if |y| is large, the posterior p(x|y) will typically be sharply peaked around\nits mode. This would motivate a simple approximation q(x|y) to the posterior,\n\n\fFigure 2: The MI optimal linear projection of data x (dots) is not always\ngiven by PCA. PCA projects data onto the vertical line, for which the entropy\nconditional on the projection H(x|y) is large. Optimally, we should project\nonto the horizontal line, for which the conditional entropy is zero.\n\nsigni\ufb01cantly reducing the computational complexity of optimization. In the case\nof real-valued x, a natural choice in the large |y| limit is to use a Gaussian. A\nsimple approximation would then be to use a Laplace approximation to p(x|y) with\ncovariance elements [\u03a3\u22121]ij = \u2202 2 log p(x|y)\n. Inserted in the bound, this then gives a\nform reminiscent of the Fisher Information[5]. The bound presented here is arguably\nmore general and appropriate than presented in [5] since, whilst it also tends to the\nexact value of the MI in the limit of a large number of responses, it is a principled\nbound for any response dimension.\n\n\u2202xi\u2202xj\n\nRelation to Conditional Likelihood\n\nConsider an autoencoder x \u2192 y \u2192 \u02dcx and imagine that we wish to maximise the\nprobability that the reconstruction \u02dcx is in the same s state as x:\n\nlog p(\u02dcx = s|x = s) = logZy\n\np(\u02dcx = s|y)p(y|x = s)\n\nAveraging this over all the states of x:\n\nJensen\n\nz}|{\u2265 hlog p(\u02dcx = s|y)ip(y|x=s)\n\nXs\n\np(x = s) log p(\u02dcx = s|x = s) \u2265Xs\n\nhlog p(\u02dcx = s|y)ip(x=s,y) \u2261 hlog q(x|y)ip(x,y)\n\nHence, maximising \u02dcI(X, Y ) (for \ufb01xed p(x)) is the same as maximising the lower\nbound on the probability of a correct reconstruction. This is a reassuring property\nof the lower bound. Even though we do not directly maximise the MI, we also indi-\nrectly maximise the probability of a correct reconstruction \u2013 a form of autoencoder.\n\nGeneralisation to Mixture Decoders\n\nA straightforward application of Jensen\u2019s inequality leads to the more general result:\n\nI(X, Y ) \u2265 H(X) + hlog q(x|y, z)ip(y|x)p(x)q(z) \u2261 \u02dcI(X, Y )\n\nwhere q(x|y, z) and q(z) are variational distributions. The aim is to choose q(x|y, z)\nsuch that the bound is tractably computable. The structure is illustrated in \ufb01g(1).\n\n3 Linear Gaussian Channel : Improving on PCA\n\nA common theme in linear compression and feature extraction is to map a (high\ndimensional) vector x to a (lower dimensional) vector y = W x such that the infor-\nmation in the vector x is maximally preserved in y. The classical solution to this\nproblem (and minimizes the linear reconstruction error) is given by PCA. However,\nas demonstrated in \ufb01g(2), the optimal setting for W is, in general not given by the\nwidely used PCA.\n\nTo see how we might improve on the PCA approach, we consider optimising our\nbound with respect to linear mappings. We take as our projection (encoder) model,\n\n\fp(y|x) \u223c N (Wx, s2I), with isotropic Gaussian noise. The empirical distribu-\n\u00b5=1 \u03b4(x \u2212 x\u00b5), where P is the number of datapoints.\nWithout loss of generality, we assume the data is zero mean. For a decoder\nq(x|y) = N (m(y), \u03a3(y)), maximising the bound on MI is equivalent to minimising\n\ntion is simply p(x) \u221d PP\n\nP\n\nX\u00b5=1(cid:10)(x \u2212 m(y))T \u03a3\u22121(y)(x \u2212 m(y)) + log det \u03a3(y)(cid:11)p(y|x\u00b5)\n\nFor constant diagonal matrices \u03a3(y), this reduces to minimal mean square recon-\nstruction error autoencoder training in the limit s2 \u2192 0. This clari\ufb01es why autoen-\ncoders (and hence PCA) are a sub-optimal special case of MI maximisation.\n\nLinear Gaussian Decoder\n\nA simple decoder is given by q(x|y) \u223c N (Uy, \u03c32I), for which\n\n\u02dcI(x, y) \u221d 2tr(UWS) \u2212 tr(UMUT ),\n\nwhere S = hxxT i =P\u00b5 x\u00b5(x\u00b5)T /P is the sample covariance of the data, and\n\nM = Is2 + WSWT\n\n(4)\n\n(5)\n\nis the covariance of the mixture distribution p(y). Optimization of (4) for U leads\nto SWT = UM. Eliminating U, this gives\n\n\u02dcI(x, y) \u221d tr(cid:0)SWT M\u22121WS(cid:1)\n\nIn the zero noise limit, optimisation of (6) produces PCA. For noisy channels, un-\nconstrained optimization of (6) leads to a divergence of the matrix norm kWWT k\u221e;\na norm-constrained optimisation in general produces a di\ufb00erent result to PCA. The\nsimplicity of the linear decoder in this case severely limits any potential improvement\nover PCA, and certainly would not resolve the issue in \ufb01g(2). For this, a non-linear\ndecoder q(x|y) is required, for which the integrals become more complex.\n\n(6)\n\nNon-linear Encoders and Kernel PCA\n\nAn alternative to using non-linear decoders to improve on PCA is to use a non-linear\nencoder. A useful choice is\n\np(y|x) = N (W\u03a6(x), \u03c32I)\n\nwhere \u03a6(x) is in general a high dimensional, non-linear embedding function, for\nwhich W will be non-square. In the zero-noise limit the optimal solution for the en-\ncoder results in non-linear PCA on the covariance h\u03a6(x)\u03a6(x)T i of the transformed\ndata. By Mercer\u2019s theorem, the elements of the covariance matrix may be replaced\nby a Kernel function of the users choice[8]. An advantage of our framework is that\nour bound enables the principled comparison of embedding functions/kernels.\n\n4 Binary Responses (Neural Coding)\n\nIn a neurobiological context, a popular issue is how to encode real-valued stimuli\nin a population of spiking neurons. Here we look brie\ufb02y at a simple case in which\neach neuron \ufb01res (yi = 1) with increasing probability the further the membrane\npotential wT\n\ni x is above threshold \u2212bi. Independent neural \ufb01ring suggests:\n\np(y|x) =Yi\n\np(yi|x) def= Y \u03c3(yi(wT\n\ni x + bi)).\n\n(7)\n\n\fFigure 3: Top row: a subset of the original real-valued source data. Middle row: after\ntraining, 20 samples from each of the 7 output units, for each of the corresponding source\ninputs. Bottom row: Reconstruction of the source data from 50 samples of the output\nunits. Note that while the 8th and the 10th patterns have closely matching stochastic\nbinary representations, they di\ufb00er in the \ufb01ring rates of unit 5. This results in a visibly\nlarger bottom loop of the 8th reconstructed pattern, which agrees with the original source\ndata. Also, the thick vertical 1 (pattern 3) di\ufb00ers from the thin vertical eight (pattern 6)\ndue to the di\ufb00erences in stochastic \ufb01rings of the third and the seventh units.\n\nHere the response variables y \u2208 {\u22121, +1}|y|, and \u03c3(a) def= 1/(1 + e\u2212a). For the\ndecoder, we chose a simple linear Gaussian q(x|y) \u223c N (Uy, \u03a3). In this case, exact\nevaluation of the bound (3) is straightforward, since it only involves computations\nof the second-order moments of y over the factorized distribution.\nA reasonable reconstruction of the source x? from its representation y will be given\nby the mean \u02dcx = hxiq(x|y) of the learned approximate posterior. In noisy channels\nwe need to average over multiple possible representations, i.e. \u02dcx = hhxiq(x|y)ip(y|x\n?).\n\nWe performed reconstruction of continuous source data from stochastic binary re-\nsponses for |x| = 196 input and |y| = 7 output units. The bound was optimized\nwith respect to the parameters of p(y|x) and q(x|y) with isotropic norm constraints\non W and b for 30 instances of digits 1 and 8 (15 of each class). The source variables\nwere reconstructed from 50 samples of the corresponding binary representations at\nthe mean of the learned q(x|y), see \ufb01g(3).\n\n5 Code Division Multiple Access (CDMA)\n\nIn CDMA[11], a mobile phone user j \u2208 1, . . . , M wishes to send a bit sj \u2208 {0, 1} of\ninformation to a base station. To send sj = 1, she transmits an N dimensional real-\nvalued vector gj, which represents a time-discretised waveform (sj = 0 corresponds\nto no transmission). The simultaneous transmissions from all users results in a\nreceived signal at the base station of\n\ngj\ni sj + \u03b7i,\n\nri =Xj\n\ni = 1, . . . , N,\n\nor\n\nr = Gs + \u03b7\n\nwhere \u03b7i is Gaussian noise. Probabilistically, we can write\n\np(r|s) \u221d expn\u2212 (r \u2212 Gs)2 /(2\u03c32)o .\n\nThe task for the base station (which knows G) is to decode the received vector r\nso that s can be recovered reliably. For simplicity, we assume that N = M so that\nthe matrix G is square. Using Bayes\u2019 rule, p(s|r) \u221d p(r|s)p(s), and assuming a \ufb02at\nprior on s,\n\n(8)\nComputing either the MAP solution arg maxs p(s|r) or the MPM solution\narg maxsj p(sj|r), j = 1, . . . , M is, in general, NP-hard.\n\np(s|r) \u221d exp(cid:8)\u2212(cid:0)\u22122rT Gs + sT GT Gs(cid:1) /(2\u03c32)(cid:9)\n\n\fIf GT G is diagonal, optimal decoding is easy, since the posterior factorises, with\n\np(sj|r) \u221d exp( 2Xi\n\nriGji \u2212 Djj! sj/(2\u03c32))\n\nwhere the diagonal matrix D = GT G (and we used s2\ni \u2261 si for si \u2208 {0, 1}). For\nsuitably randomly chosen matrices G, GT G will be approximately diagonal in the\nlimit of large N . However, ideally, one would like to construct decoders that perform\nnear-optimal decoding without recourse to the approximate diagonality of GT G.\nThe MAP decoder solves the problem\n\nmins\u2208{0,1}N(cid:0)sT GT Gs \u2212 2sT GT r(cid:1) \u2261 mins\u2208{0,1}N(cid:0)s \u2212 G\u22121r(cid:1)T\n\nand hence the MAP solution is that s which is closest to the vector G\u22121r. The\ndi\ufb03culty lies in the meaning of \u2018closest\u2019 since the space is non-isotropically warped\nby the matrix GT G. A useful guess for the decoder is that it is the closest in the\nEuclidean sense to the vector G\u22121r. This is the so-called decorrelation estimator.\n\nGT G(cid:0)s \u2212 G\u22121r(cid:1)\n\nComputing the Mutual Information\n\nOf prime interest in CDMA is the evaluation of decoders in the case of non-\northogonal matrices G[11].\nIn this respect, a principled comparison of decoders\ncan be obtained by evaluating the corresponding bound on the MI1,\n\nI(r, s) \u2261 H(s) \u2212 H(s|r) \u2265 H(s) +Xr Xs\n\np(s)p(r|s) log q(s|r)\n\n(9)\n\nwhere H(s) is trivially given by M (bits). The bound is exact if q(s|r) = p(s|r).\n\nWe make the speci\ufb01c assumption in the following that our decoding algorithm takes\n\nthe factorised form q(s|r) =Qi q(si|r) and, without loss of generality, we may write\n\nq(si|r) = \u03c3 ((2si \u2212 1)fi(r))\n\n(10)\n\nfor some decoding function fi(r). We restrict interest here to the case of simple\nlinear decoding functions\n\nfi(r) = ai +Xj\n\nwijrj.\n\nSince p(r|s) is Gaussian, (2si \u2212 1)fi(r) \u2261 xi is also Gaussian,\n\np(xi|s) = N (\u00b5i(s), vari),\n\n\u00b5i(s) \u2261 (2si \u2212 1)(ai + wT\n\ni Gs),\n\nvari \u2261 \u03c32wT\n\ni wi\n\nwhere wT\ni\n\nis the ith row of the matrix [W ]ij \u2261 wij. Hence\n\n\u2212H(s|r) \u2265Xi\n\ni wi (cid:28)Z \u221e\n\nx=\u2212\u221e\n\n1\n\np2\u03c0\u03c32wT\n\n[log \u03c3 (x)] e\u2212[x\u2212(2si\u22121)(ai+w\n\nT\n\ni Gs)]2/(2\u03c32\n\nw\n\nT\n\ni wi)(cid:29)p(s)\n\n(11)\nIn general, the average over the factorised distribution p(s) can be evaluated by\nusing the Fourier Transform [1]. However, to retain clarity here, we constrain the\ndecoding matrix W so that wT\ni Gs = bisi, i.e. W G = diag(b), for a parameter\nvector b. The average over p(s) then gives\n\n\u2212H(s|r) \u2265\n\n1\n\n2Xi Dlog \u03c3 (x) (1 + e\u2212[\u22122xbi\u22124xai+2aibi+b2\n\ni ]/(2\u03c32\n\nw\n\nT\n\ni wi)EN (\u2212ai,var=\u03c32\n\n,\n\nw\n\nT\ni wi)\n\n(12)\n\n1Other variational methods may be considered to approximate the normalisation con-\nstant of p(s|r)[13], and it would be interesting to look into the possibility of using them in\na MI approximation, and also as approximate decoding algorithms.\n\n\fr\ne\nd\no\nc\ne\nD\n\n \nl\n\na\nm\n\ni\nt\n\ni\n\n \n\np\nO\nd\ne\nn\na\nr\nt\ns\nn\no\nC\n\n \nr\no\n\nf\n \n\nd\nn\nu\no\nb\n\n \nI\n\nM\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.2\n\n0.8\n\n0.4\n\n0.6\n\nMI bound for Inverse Decoder\n\nFigure 4: The bound given by the decoder W \u221d G\u22121r\nplotted against the optimised bound (for the same G) found\nusing 50 updates of conjugate gradients. This was repeated\nover several trials of randomly chosen matrices G, each of\nwhich are square of N = 10 dimensions. For clarity, a small\nnumber of poor results (in which the bound is negative)\nhave been omitted. To generate G, form the matrix Aij \u223c\nN (0, 1), and B = A + AT . From the eigen-decomposition\nof B, i.e BE = E\u039b, form [G]ij = [E\u039b]ij + 0.1N (0, 1) (so\nthat GT G has small o\ufb00 diagonal elements).\n\n1\n\na sum of one dimensional integrals, each of which can be evaluated numerically. In\nthe case of an orthogonal matrix GT G = D the decoding function is optimal and\nthe MI bound is exact with the parameters in (12) set to\n\nai = \u2212[GT G]ii/(2\u03c32)\n\nW = GT /\u03c32\n\nbi = [GT G]ii/\u03c32.\n\nOptimising the linear decoder\n\nIn the case that GT G is non-diagonal, what is the optimal linear decoder? A\npartial answer is given by numerically optimising the bound from (11). For the\nconstrained case, W G = diag(b), (12) can be used to calculate the bound. Using\nW = diag(b)G\u22121,\n\n\u03c32wT\n\ni wi = \u03c32b2\n\niXj\n\n([G\u22121]ij)2,\n\nand the bound depends only on a and b. Under this constraint the bound can be\n\nnumerically optimised as a function of a and b, given a \ufb01xed vector Pj([G\u22121]ij)2.\n\nAs an alternative we can employ the decorrelation decoder, W = G\u22121/\u03c32, with\nai = \u22121/(2\u03c32). In \ufb01g(4) we see that, according to our bound, the decorrelation or\n(\u2018inverse\u2019) decoder is suboptimal versus the linear decoder fi(r) = ai + wT\ni r with\nW = diag(b)G\u22121, optimised over a and b. These initial results are encouraging, and\nmotivate further investigations, for example, using syndrome decoding for CDMA.\n\n6 Posterior Approximations\n\nThere is an interesting relationship between maximising the bound on the MI and\ncomputing an optimal estimate q(s|r) of an intractable posterior p(s|r). The op-\ntimal bit error solution sets q(si|r) to the mean of the exact posterior marginal\np(si|r). Mean Field Theory approximates the posterior marginal by minimis-\n\ning the KL divergence: KL(q||p) = Ps (q(s|r) log q(s|r) \u2212 q(s|r) log p(s|r)), where\nq(s|r) =Qi q(si|r). In this case, the KL divergence is tractably computable (up to\n\na neglectable prefactor). However, this form of the KL divergence chooses q(si|r)\nto be any one of a very large number of local modes of the posterior distribution\np(si|r). Since the optimal choice is to choose the posterior marginal mean, this is\nwhy using Mean Field decoding is generally suboptimal. Alternatively, consider\n\nKL(p||q) =Xs\n\n(p(s|r) log p(s|r) \u2212 p(s|r) log q(s|r)) = \u2212Xs\n\np(s|r) log q(s|r) + const.\n\nThis is the correct KL divergence in the sense that, optimally, q(si|r) = p(si|r), that\nis, the posterior marginal is correctly calculated. The di\ufb03culty lies in performing\n\n\faverages with respect to p(s|r), which are generally intractable. Since we will have\na distribution p(r) it is reasonable to provide an averaged objective function,\n\np(s)p(r|s) log q(s|r).\n\n(13)\n\nXr Xs\n\np(r)p(s|r) log q(s|r) =Xr Xs\n\nWhilst, for any given r, we cannot calculate the best posterior marginal estimate,\nwe may be able to calculate the best posterior marginal estimate on average. This is\nprecisely the case in, for example, CDMA since the average over p(r|s) is tractable,\nand the resulting average over p(s) can be well approximated numerically. Wherever\nan average case objective is desired is of interest to the methods suggested here.\n\n7 Discussion\n\nWe have described a general theoretically justi\ufb01ed approach to information maxi-\nmization in noisy channels. Whilst the bound is straightforward, it appears to have\nattracted little previous attention as a practical tool for MI optimisation. We have\nshown how it naturally generalises linear compression and feature extraction. It is\na more direct approach to optimal coding than using the Fisher \u2018Information\u2019 in\nneurobiological population encoding. Our bound enables a principled comparison of\ndi\ufb00erent information maximisation algorithms, and may have applications in other\nareas of machine learning and Information Theory, such as error-correction.\n\n[1] D. Barber, Tractable Approximate Belief Propagation, Advanced Mean Field Methods\n\nTheory and Practice (D. Saad and M. Opper, eds.), MIT Press, 2001.\n\n[2] H. Barlow, Unsupervised Learning, Neural Computation 1 (1989), 295\u2013311.\n[3] S Becker, An Information-theoretic unsupervised learning algorithm for neural net-\n\nworks, Ph.D. thesis, University of Toronto, 1992.\n\n[4] A.J. Bell and T.J. Sejnowski, An information-maximisation approach to blind sepa-\n\nration and blind deconvolution, Neural Computation 7 (1995), no. 6, 1004\u20131034.\n\n[5] N. Brunel and J.-P. Nadal, Mutual Information, Fisher Information and Population\n\nCoding, Neural Computation 10 (1998), 1731\u20131757.\n\n[6] T. Jaakkola and M. Jordan., Improving the mean \ufb01eld approximation via the use of\nmixture distributions, Proceedings of the NATO ASI on Learning in Graphical Models,\nKluwer, 1997.\n\n[7] R. Linsker, Deriving Receptive Fields Using an Optimal Encoding Criterion, Ad-\nvances in Neural Information Processing Systems (Lee Giles (eds) Steven Hanson,\nJack Cowan, ed.), vol. 5, Morgan-Kaufmann, 1993.\n\n[8] S. Mika, B. Schoelkopf, A.J. Smola, K-R. Muller, M. Scholz, and Gunnar Ratsch,\nKernel PCA and De-Noising in Feature Spaces, Advances in Neural Information Pro-\ncessing Systems 11 (1999).\n\n[9] R. M. Neal and G. E. Hinton, A View of the EM Algorithm That Justi\ufb01es Incremental,\nSparse, and Other Variants, Learning in Graphical Models (M.J. Jordan, ed.), MIT\nPress, 1998.\n\n[10] D. Saad and M. Opper, Advanced Mean Field Methods Theory and Practice, MIT\n\nPress, 2001.\n\n[11] T. Tanaka, Analysis of Bit Error Probability of Direct-Sequence CDMA Multiuser\nDemodulators, Advances in Neural Information Processing Systems (T. K. Leen et al.\n(eds.), ed.), vol. 13, MIT Press, 2001, pp. 315\u2013321.\n\n[12] K. Torkkola and W. M. Campbell, Mutual Information in Learning Feature Trans-\n\nformations, Proc. 17th International Conf. on Machine Learning (2000).\n\n[13] M. Wainwright, T. Jaakkola, and A. Willsky, A new class of upper bounds on the log\n\npartition function, Uncertainty in Arti\ufb01cial Intelligence, 2002.\n\n\f", "award": [], "sourceid": 2410, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Felix", "family_name": "Agakov", "institution": null}]}