{"title": "Unsupervised Variational Bayesian Learning of Nonlinear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 600, "abstract": null, "full_text": "Unsupervised Variational Bayesian\n\nLearning of Nonlinear Models\n\nNeural Networks Research Centre, Helsinki University of Technology\n\nAntti Honkela and Harri Valpola\n\nP.O. Box 5400, FI-02015 HUT, Finland\n\n{Antti.Honkela, Harri.Valpola}@hut.fi\nhttp://www.cis.hut.fi/projects/bayes/\n\nAbstract\n\nIn this paper we present a framework for using multi-layer per-\nceptron (MLP) networks in nonlinear generative models trained\nby variational Bayesian learning. The nonlinearity is handled by\nlinearizing it using a Gauss\u2013Hermite quadrature at the hidden neu-\nrons. This yields an accurate approximation for cases of large pos-\nterior variance. The method can be used to derive nonlinear coun-\nterparts for linear algorithms such as factor analysis, independent\ncomponent/factor analysis and state-space models. This is demon-\nstrated with a nonlinear factor analysis experiment in which even\n20 sources can be estimated from a real world speech data set.\n\n1 Introduction\n\nLinear latent variable models such as factor analysis, principal component analysis\n(PCA) and independent component analysis (ICA) [1] are used in many applications\nranging from engineering to social sciences and psychology. In many of these cases,\nthe e\ufb00ect of the desired factors or sources to the observed data is, however, not\nlinear. A nonlinear model could therefore produce better results.\n\nThe method presented in this paper can be used as a basis for many nonlinear\nlatent variable models, such as nonlinear generalizations of the above models. It is\nbased on the variational Bayesian framework, which provides a solid foundation for\nnonlinear modeling that would otherwise be prone to over\ufb01tting [2]. It also allows\nfor easy comparison of di\ufb00erent model structures, which is even more important for\n\ufb02exible nonlinear models than for simpler linear models.\n\nGeneral nonlinear generative models for data x(t) of the type\n\nx(t) = f (s(t), \u03b8f ) + n(t) = B\u03c6(As(t) + a) + b + n(t)\n\n(1)\n\noften employ a multi-layer perceptron (MLP) (as in the equation) or a radial basis\nfunction (RBF) network to model the nonlinearity. Here s(t) are the latent variables\nof the model, n(t) is noise and \u03b8f are the parameters of the nonlinearity, in case\nof MLP the weight matrices A, B and bias vectors a, b. In context of variational\nBayesian methods, RBF networks seem more popular of the two because it is easier\n\n\fto evaluate analytic expressions and bounds for certain key quantities [3]. With\nMLP networks such values are not as easily available and one usually has to resort\nto numeric approximations. Nevertheless, MLP networks can often, especially for\nnearly linear models and in high dimensional spaces, provide an equally good model\nwith fewer parameters [4]. This is important with generative models whose latent\nvariables are independent or at least uncorrelated and the intrinsic dimensionality\nof the input is large. A reasonable approximate bound for a good model is also\noften better than a strict bound for a bad model.\n\nMost existing applications of variational Bayesian methods for nonlinear models\nare concerned with the supervised case where the inputs of the network are known\nand only the weights have to be learned [3, 5]. This is easier as there are fewer\nparameters with related posterior variance above the nonlinear hidden layer and\nthe distributions thus tend to be easier to handle.\n\nIn this paper we present a novel method for evaluating the statistics of the outputs\nof an MLP network in context of unsupervised variational Bayesian learning of its\nweights and inputs. The method is demonstrated with a nonlinear factor analysis\nproblem. The new method allows for reliable estimation of a larger number of\nfactors than before [6, 7].\n\n2 Variational learning of unsupervised MLPs\n\nLet us denote the observed data by X = {x(t)|t}, the latent variables of the model\nby S = {s(t)|t} and the model parameters by \u03b8 = (\u03b8i). The nonlinearity (1) can be\nused as a building block of many di\ufb00erent models depending on the model assumed\nfor the sources S. Simple Gaussian prior on S leads to a nonlinear factor analysis\n(NFA) model [6, 7] that is studied here because of its simplicity. The method could\neasily be extended with a mixture-of-Gaussians prior on S [8] to get a nonlinear\nindependent factor analysis model, but this is omitted here.\nIn many nonlinear\nblind source separation (BSS) problems it is enough to apply simple NFA followed\nby linear ICA postprocessing to achieve nonlinear BSS [6, 7]. Another possible\nextension would be to include dynamics for S as in [9].\n\nIn order to deal with the \ufb02exible nonlinear models, a powerful learning paradigm\nresistant to over\ufb01tting is needed. The variational Bayesian method of ensemble\nlearning [2] has proven useful here. Ensemble learning is based on approximating\nthe true posterior p(S, \u03b8|X) with a tractable approximation q(S, \u03b8), typically a\nmultivariate Gaussian with a diagonal covariance. The approximation is \ufb01tted to\nminimize the cost\n\nC =(cid:28)log\n\nq(S, \u03b8)\n\np(S, \u03b8, X)(cid:29) = D(q(S, \u03b8)||p(S, \u03b8|X)) \u2212 log p(X)\n\n(2)\n\nwhere h\u00b7i denotes expectation over q(S, \u03b8) and D(q||p) is the Kullback-Leibler diver-\ngence between q and p. As the Kullback-Leibler divergence is always non-negative,\nC yields an upper bound for \u2212 log p(X) and thus a lower bound for the evidence\np(X). The cost can be evaluated analytically for a large class of mainly linear\nmodels [10, 11] leading to simple and e\ufb03cient learning algorithms.\n\n2.1 Evaluating the cost\n\nUnfortunately, the cost (2) cannot be evaluated analytically for the nonlinear model\n(1). Assuming a Gaussian noise model, the likelihood term of C becomes\n\nh\u2212 log N (x(t); f (s(t), \u03b8f ), \u03a3x)i .\n\n(3)\n\nCx = h\u2212 log p(X|S, \u03b8)i =Xt\n\n\fThe term Cx depends on the \ufb01rst and second moments of f (s(t), \u03b8f ) over the pos-\nterior approximation q(S, \u03b8), and they cannot easily be evaluated analytically. As-\nsuming the noise covariance is diagonal, the cross terms of the covariance of the\noutput are not needed, only the scalar variances of the di\ufb00erent components.\n\nIf the activation functions of the MLP network were linear, the output mean and\nvariance could be evaluated exactly using only the mean and variance of the inputs\ns(t) and \u03b8f . Thus a natural \ufb01rst approximation would be to linearize the network\nabout the input mean using derivatives [6]. Taking the derivative with respect to\ns(t), for instance, yields\n\n\u2202f (s(t), \u03b8f )\n\n\u2202s(t)\n\n= B diag(\u03c60(y(t))) A,\n\n(4)\n\nwhere diag(v) denotes a diagonal matrix with elements of vector v on the main\ndiagonal and y(t) = As(t) + a. Due to the local nature of the approximation,\nthis can lead to severe underestimation of the variance, especially when the hidden\nneurons of the MLP network operate in the saturated region. This makes the\nnonlinear factor analysis algorithm using this approach unstable with large number\nof factors because the posterior variance corresponding to the last factors is typically\nlarge.\n\nTo avoid this problem, we propose using a Gauss\u2013Hermite quadrature to evaluate\nan e\ufb00ective linearization of the nonlinear activation functions \u03c6(yi(t)). The Gauss\u2013\nHermite quadrature is a method for approximating weighted integrals\n\nZ \u221e\n\n\u2212\u221e\n\nf (x) exp(\u2212x2) dx \u2248Xk\n\nwk f (tk),\n\n(5)\n\nh\u03c60(yi(t))i :=se\u03c6(yi(t))GH\neyi(t)\n\nwhere the weights wk and abscissas tk are selected by requiring exact result for\nsuitable number of low-order polynomials. This allows evaluating the mean and\nvariance of \u03c6(yi(t)) by quadratures\n\n\u03c6(yi(t))GH =Xk\ne\u03c6(yi(t))GH =Xk\n\nw0\n\nw0\n\nk\u03c6(cid:16)yi(t) + t0\nkh\u03c6(cid:16)yi(t) + t0\n\nkpeyi(t)(cid:17)\nkpeyi(t)(cid:17) \u2212 \u03c6(yi(t))GHi2\n\n(6)\n\n(7)\n\n,\n\nrespectively. Here the weights and abscissas have been scaled to take into account\n\nthe Gaussian pdf weight instead of exp(\u2212x2), and yi(t) and eyi(t) are the mean\n\nand variance of yi(t), respectively. We used a three point quadrature that yields\naccurate enough results but can be evaluated quickly. Using e.g. \ufb01ve points improves\nthe accuracy slightly, but slows the computation down signi\ufb01cantly. As both of the\nquadratures depend on \u03c6 at the same points, they can be evaluated together easily.\n\nvariance can be interpreted to yield an e\ufb00ective linearization of \u03c6(yi(t)) through\n\nUsing the approximation formula e\u03c6(yi(t)) = \u03c60(yi(t))2eyi(t), the resulting mean and\n\nh\u03c6(yi(t))i := \u03c6(yi(t))GH\n\n.\n\n(8)\n\nThe positive square root is used here because the derivative of the logistic sigmoid\nused as activation function is always positive. Using these to linearize the MLP as\nin Eq. (4), the exact mean and variance of the linearized model can be evaluated in\na relatively straightforward manner. Evaluation of the variance due to the sources\nrequires propagating matrices through the network to track the correlations between\nthe hidden units. Hence the computational complexity depends quadratically on\nthe number of sources. The same problem does not a\ufb00ect the network weights as\neach parameter only a\ufb00ects the value of one hidden neuron.\n\n\f2.2 Details of the approximation\n\nThe mean and variance of \u03c6(yi(t)) depend on the distribution of yi(t). The Gauss\u2013\nHermite quadrature assumes that yi(t) is Gaussian. This is not true in our case,\nas the product of two independent normally distributed variables aij and sj(t) is\nsuper-Gaussian, although rather close to Gaussian if the mean of one of the variables\nis signi\ufb01cantly larger in absolute value than the standard deviation. In case of N\nsources, the actual input yi(t) is a sum of N of these and a Gaussian variable and\ntherefore rather close to a Gaussian, at least for larger values of N .\n\nIgnoring the non-Gaussianity, the quadrature depends on the mean and variance of\nyi(t). These can be evaluated exactly because of the linearity of the mapping as\n\neyi,tot(t) =Xj (cid:16)eAij(sj(t)2 +esj(t)) + A\n\n2\n\nijesj(t)(cid:17) +eai,\n\n(9)\n\nwhere \u03b8 denotes the mean and e\u03b8 the variance of \u03b8. Here it is assumed that the\n\nposterior approximations q(S) and q(\u03b8f ) have diagonal covariances. Full covariances\ncan be used instead without too much di\ufb03culty, if necessary.\n\nIn an experiment investigating the approximation accuracy with a random\nMLP [12], the Taylor approximation was found to underestimate the output vari-\nance by a factor of 400, at worst. The worst case result of the above approximation\nwas underestimation by a factor of 40, which is a great improvement over the Tay-\nlor approximation, but still far from perfect. The worst case behavior could be\nimproved to underestimation by a factor of 5 by introducing another quadrature\nevaluated with a di\ufb00erent variance for yi(t). This change cannot be easily justi\ufb01ed\nexcept by the fact that it produces better results. The di\ufb00erence in behavior of\nthe two methods in more realistic cases is less drastic, but the version with two\nquadratures seems to provide more accurate approximations.\n\nThe more accurate approximation is implemented by evaluating another quadrature\nusing the variance of yi(t) originating mainly from \u03b8f ,\n\neyi,weight(t) =Xj eAij(sj(t)2 +esj(t)) +eai,\n\n(10)\n\nand using the implied h\u03c60(yi(t))i in the evaluation of the e\ufb00ects of these variances.\nThe total variance (9) is still used in evaluation of the means and the evaluation of\nthe e\ufb00ects of the variance of s(t).\n\n2.3 Learning algorithm for nonlinear factor analysis\n\nThe nonlinear factor analysis (NFA) model [6] is learned by numerically minimizing\nthe cost C evaluated above. The minimization algorithm is a combination of conju-\ngate gradient for the means of S and \u03b8f , \ufb01xed point iteration for the variances of\nS and \u03b8f , and EM like updates for other parameters and hyperparameters.\n\nThe \ufb01xed point update algorithm for the variances follows from writing the cost\nfunction as a sum\n\nC = Cq + Cp = hlog q(S, \u03b8)i + h\u2212 log p(S, \u03b8, X)i .\n\n(11)\n\nA parameter \u03b8i that is assumed independent of others under q and has a Gaussian\n\nposterior approximation q(\u03b8i) = N (\u03b8i; \u03b8i,e\u03b8i), only a\ufb00ects the corresponding negen-\ntropy term \u22121/2 log(2\u03c0ee\u03b8i) in Cq. Di\ufb00erentiating this with respect to e\u03b8i and setting\nthe result to zero leads to a \ufb01xed point update rule e\u03b8i =(cid:16)2\u2202Cp/\u2202e\u03b8i(cid:17)\u22121\n\n. In order to\n\n\fget a stable update algorithm for the variances, dampening by halving the step on\nlog scale until the cost function does not increase must be added to the \ufb01xed point\nupdates. The variance is increased at most by 10 % on one iteration and not set to\na negative value even if the gradient is negative.\n\nThe required partial derivatives can be evaluated analytically with simple back-\npropagation like computations with the MLP network. The quadratures used at\nhidden nodes lead to analytical expressions for the means and variances of the hid-\nden nodes and the corresponding feedback gradients are easy to derive. Along with\nthe derivatives with respect to variances, it is easy to evaluate them with respect to\nmeans of the same parameters. These derivatives can then be used in a conjugate\ngradient algorithm to update the means of S and \u03b8f .\n\nDue to the \ufb02exibility of the MLP network and the gradient based learning algorithm,\nthe nonlinear factor analysis method is sensitive to the initialization. We have used\nlinear PCA for initialization of the means of the sources S. The means of the\nweights \u03b8f are initialized randomly while all the variances are initialized to small\nconstant values. After this, the sources are kept \ufb01xed for 20 iterations while only the\nnetwork weights are updated. The hyperparameters governing noise and parameter\ndistributions are only updated after 80 more iterations to update the sources and the\nMLP. By that time, a reasonable model of the data has been learned and the method\nis not likely to prune away all the sources and other parameters as unnecessary.\n\n2.4 Other approximation methods\n\nAnother way to get a more robust approximation for the statistics of f would be\nto use the deterministic sampling approach used in unscented transform [13] and\nconsecutively in di\ufb00erent unscented algorithms. Unfortunately this approach does\nnot work very well in high dimensional cases. The unscented transform also ignores\nall the prior information on the form of the nonlinearity.\nIn case of the MLP\nnetwork, everything except the scalar activation functions is known to be linear.\nAll information on the correlations of variables is also ignored, which leads to loss\nof accuracy when the output depends on products of input variables like in our case.\nIn an experiment of mean and log-variance approximation accuracy with a relatively\nlarge random MLP [12], the unscented transform needed over 100 % more time to\nachieve results with 10 times the mean squared error of the proposed approach.\n\nPart of our problem was also faced by Barber and Bishop in their work on ensemble\nlearning for supervised learning of MLP networks [5]. In their work the inputs s(t)\nof the network are part of the data and thus have no associated variance. This\nmakes the problem easier as the inputs y(t) of the hidden neurons are Gaussian.\nBy using the cumulative Gaussian distribution or the error function erf as the\nactivation function, the mean of the outputs of the hidden neurons and thus of the\noutputs of the whole network can be evaluated analytically. The covariances still\nneed to be evaluated numerically, and that is done by evaluating all the correlations\nof the hidden neurons separately. In a network with H hidden neurons, this requires\nO(H 2) quadrature evaluations.\n\nIn our case the inputs of the hidden neurons are not Gaussian and hence even the\nerror function as the activation function would not allow for exact evaluation of the\nmeans. This is why we have decided to use the standard logistic sigmoid activation\nfunction in form of tanh which is more common and faster to evaluate numerically.\nIn our approach all the required means and variances can be evaluated with O(H)\nquadratures.\n\n\f3 Experiments\n\nThe proposed nonlinear factor analysis method was tested on natural speech data\nset consisting of spectrograms of 24 individual words of Finnish speech, spoken by\n20 di\ufb00erent speakers. The spectra were modi\ufb01ed to mimic the reception abilities of\nthe human ear. This is a standard preprocessing procedure for speech recognition.\nNo speaker or word information was used in learning, the spectrograms of di\ufb00erent\nwords were simply blindly concatenated. The preprocessed data consisted of 2547\n30-dimensional spectrogram vectors.\n\nThe data set was tested with two di\ufb00erent learning algorithms for the NFA model,\none based on the Taylor approximation introduced in [6] and another based on the\nproposed approximation. Contrary to [6], the algorithm based on Taylor approxi-\nmation used the same conjugate gradient based optimization algorithm as the new\napproximation. This helped greatly in stabilizing the algorithm that used to be\nrather unstable with high source dimensionalities due to sensitivity of the Taylor\napproximation in regions where it is not really valid. Both algorithms were tested\nusing 1 to 20 sources, each number with four di\ufb00erent random initializations for the\nMLP network weights. The number of hidden neurons in the MLP network was 40.\nThe learning algorithm was run for 2000 iterations.1\n\nl\n\n)\ne\np\nm\na\ns\n \n/\n \ns\nt\n\na\nn\n(\n \nt\ns\no\nc\n \n\nd\ne\ns\no\np\no\nr\nP\n\n55\n\n50\n\n45\n\nl\n\n)\ne\np\nm\na\ns\n \n/\n \ns\nt\n\na\nn\n(\n \nt\ns\no\nc\n \nr\no\ny\na\nT\n\nl\n\n55\n\n50\n\n45\n\n45\nReference cost (nats / sample)\n\n50\n\n55\n\n45\nReference cost (nats / sample)\n\n55\n\n50\n\nFigure 1: The attained values of C in di\ufb00erent simulations as evaluated by the dif-\nferent approximations plotted against reference values evaluated by sampling. The\nleft sub\ufb01gure shows the values from experiments using the proposed approximation\nand the right sub\ufb01gure from experiments using the Taylor approximation.\n\nFig. 1 shows a comparison of the cost function values evaluated by the di\ufb00erent ap-\nproximations and a reference value evaluated by sampling. The reference cost values\nwere evaluated by sampling 400 points from the distribution q(S, \u03b8f ), evaluating\nf (s, \u03b8f ) at those points, and using the mean and variance of the output points in the\ncost function evaluation. The accuracy of the procedure was checked by performing\nthe evaluation 100 times for one of the simulations. The standard deviation of the\nvalues was 5 \u00b7 10\u22123 nats per sample which should not show at all in the \ufb01gures. The\nunit nat here signi\ufb01es the use of natural logarithm in Eq. (2).\n\nThe results in Fig. 1 show that the proposed approximation yields consistently very\n\n1The Matlab code used in the experiments is available at http://www.cis.hut.fi/\n\nprojects/bayes/software/.\n\n\fl\n\n)\ne\np\nm\na\ns\n \n/\n \ns\nt\na\nn\n(\n \ne\nu\na\nv\n \nn\no\n\nl\n\ni\nt\nc\nn\nu\n\nf\n \nt\ns\no\nC\n\n56\n\n54\n\n52\n\n50\n\n48\n\n46\n\n44\n\nProposed approximation\nReference value\n\nl\n\n)\ne\np\nm\na\ns\n \n/\n \ns\nt\na\nn\n(\n \ne\nu\na\nv\n \nn\no\n\nl\n\ni\nt\nc\nn\nu\n\nf\n \nt\ns\no\nC\n\n5\n\n10\n\n# of sources\n\n15\n\n20\n\n56\n\n54\n\n52\n\n50\n\n48\n\n46\n\n44\n\nTaylor approximation\nReference value\n\n5\n\n10\n\n# of sources\n\n15\n\n20\n\nFigure 2: The attained value of C in simulations with di\ufb00erent numbers of sources.\nThe values shown are the means of 4 simulations with di\ufb00erent random initializa-\ntions. The left sub\ufb01gure shows the values from experiments using the proposed\napproximation and the right sub\ufb01gure from experiments using the Taylor approxi-\nmation. Both values are compared to reference values evaluated by sampling.\n\nreliable estimates of the true cost, although it has a slight tendency to underestimate\nit. The older Taylor approximation [6] breaks down completely in some cases and\nreports very small costs even though the true value can be signi\ufb01cantly larger.\n\nThe situations where the Taylor approximation fails are illustrated in Fig. 2, which\nshows the attained cost as a function of number of sources used. The Taylor approx-\nimation shows a decrease in cost as the number of the sources increases even though\nthe true cost is increasing rapidly. The behavior of the proposed approximation is\nmuch more consistent and qualitatively correct.\n\n4 Discussion\n\nThe problem of estimating the statistics of a nonlinear transform of a probability\ndistribution is also encountered in nonlinear extensions of Kalman \ufb01ltering. The\nTaylor approximation corresponds to extended Kalman \ufb01lter and the new approxi-\nmation can be seen as a modi\ufb01cation of it with a more accurate linearization. This\nopens up many new potential applications in time series analysis and elsewhere.\nThe proposed method is somewhat similar to unscented Kalman \ufb01ltering based on\nthe unscented transform [13], but much better suited for high dimensional MLP-like\nnonlinearities. This is not very surprising, as worst case complexity of general Gaus-\nsian integration is exponential with respect to the dimensionality of the input [14]\nand unscented transform as a general method with linear complexity is bound to\nbe less accurate in high dimensional problems. In case of the MLP, the complexity\nof the unscented transform depends on the number of all weights, which in our case\nwith 20 sources can be more than 2000.\n\n5 Conclusions\n\nIn this paper we have proposed a novel approximation method for unsupervised\nMLP networks in variational Bayesian learning. The approximation is based on\nusing numerical Gauss\u2013Hermite quadratures to evaluate the global e\ufb00ect of the\nnonlinear activation function of the network to produce an e\ufb00ective linearization of\nthe MLP. The statistics of the outputs of the linearized network can be evaluated\n\n\fexactly to get accurate and reliable estimates of the statistics of the MLP outputs.\nThese can be used to evaluate the standard variational Bayesian ensemble learning\ncost function C and numerically minimize it using a hybrid \ufb01xed point / conjugate\ngradient algorithm.\n\nWe have demonstrated the method with a nonlinear factor analysis model and a\nreal world speech data set. It was able to reliably estimate all the 20 factors we\nattempted from the 30-dimensional data set. The presented method can be used\ntogether with linear ICA for nonlinear BSS [7], and the approximation can be easily\napplied to more complex models such as nonlinear independent factor analysis [6]\nand nonlinear state-space models [9].\n\nAcknowledgments\n\nThe authors wish to thank David Barber, Markus Harva, Bert Kappen, Juha\nKarhunen, Uri Lerner and Tapani Raiko for useful comments and discussions. This\nwork was supported in part by the IST Programme of the European Community,\nunder the PASCAL Network of Excellence, IST-2002-506778. This publication only\nre\ufb02ects the authors\u2019 views.\n\nReferences\n\n[1] A. Hyv\u00a8arinen, J. Karhunen, and E. Oja. Independent Component Analysis. J. Wiley,\n\n2001.\n\n[2] G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the\ndescription length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computa-\ntional Learning Theory, pp. 5\u201313, Santa Cruz, CA, USA, 1993.\n\n[3] P. Sykacek and S. Roberts. Adaptive classi\ufb01cation by variational Kalman \ufb01ltering.\nIn Advances in Neural Information Processing Systems 15, pp. 753\u2013760. MIT Press,\n2003.\n\n[4] S. Haykin. Neural Networks \u2013 A Comprehensive Foundation, 2nd ed. Prentice-Hall,\n\n1999.\n\n[5] D. Barber and C. Bishop. Ensemble learning for multi-layer networks. In Advances\n\nin Neural Information Processing Systems 10, pp. 395\u2013401. MIT Press, 1998.\n\n[6] H. Lappalainen and A. Honkela. Bayesian nonlinear independent component analysis\nby multi-layer perceptrons. In M. Girolami, ed., Advances in Independent Component\nAnalysis, pp. 93\u2013121. Springer-Verlag, Berlin, 2000.\n\n[7] H. Valpola, E. Oja, A. Ilin, A. Honkela, and J. Karhunen. Nonlinear blind source\nseparation by variational Bayesian learning. IEICE Transactions on Fundamentals of\nElectronics, Communications and Computer Sciences, E86-A(3):532\u2013541, 2003.\n\n[8] H. Attias. Independent factor analysis. Neural Computation, 11(4):803\u2013851, 1999.\n[9] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear\n\ndynamic state-space models. Neural Computation, 14(11):2647\u20132692, 2002.\n\n[10] H. Attias. A variational Bayesian framework for graphical models. In Advances in\n\nNeural Information Processing Systems 12, pp. 209\u2013215. MIT Press, 2000.\n\n[11] Z. Ghahramani and M. Beal. Propagation algorithms for variational Bayesian learn-\ning. In Advances in Neural Information Processing Systems 13, pp. 507\u2013513. MIT\nPress, 2001.\n\n[12] A. Honkela. Approximating nonlinear transformations of probability distributions for\nnonlinear independent component analysis. In Proc. 2004 IEEE Int. Joint Conf. on\nNeural Networks (IJCNN 2004), pp. 2169\u20132174, Budapest, Hungary, 2004.\n\n[13] S. Julier and J. K. Uhlmann. A general method for approximating nonlinear trans-\nformations of probability distributions. Technical report, Robotics Research Group,\nDepartment of Engineering Science, University of Oxford, 1996.\n\n[14] F. Curbera. Delayed curse of dimension for Gaussian integration. Journal of Com-\n\nplexity, 16(2):474\u2013506, 2000.\n\n\f", "award": [], "sourceid": 2564, "authors": [{"given_name": "Antti", "family_name": "Honkela", "institution": null}, {"given_name": "Harri", "family_name": "Valpola", "institution": null}]}