{"title": "Learning a Hierarchical Belief Network of Independent Factor Analyzers", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 367, "abstract": null, "full_text": "Learning a Hierarchical Belief Network of \n\nIndependent Factor Analyzers \n\nH. Attias* \n\nhagai@gatsby.ucl.ac.uk \n\nSloan Center for Theoretical Neurobiology, Box 0444 \n\nUniversity of California at San Francisco \n\nSan Francisco, CA 94143-0444 \n\nAbstract \n\nMany belief networks have been proposed that are composed of \nbinary units. However, for tasks such as object and speech recog(cid:173)\nnition which produce real-valued data, binary network models are \nusually inadequate. Independent component analysis (ICA) learns \na model from real data, but the descriptive power of this model \nis severly limited. We begin by describing the independent factor \nanalysis (IFA) technique, which overcomes some of the limitations \nof ICA. We then create a multilayer network by cascading single(cid:173)\nlayer IFA models. At each level, the IFA network extracts real(cid:173)\nvalued latent variables that are non-linear functions of the input \ndata with a highly adaptive functional form, resulting in a hier(cid:173)\narchical distributed representation of these data. Whereas exact \nmaximum-likelihood learning of the network is intractable, we de(cid:173)\nrive an algorithm that maximizes a lower bound on the likelihood, \nbased on a variational approach. \n\n1 \n\nIntroduction \n\nAn intriguing hypothesis for how the brain represents incoming sensory informa(cid:173)\ntion holds that it constructs a hierarchical probabilistic model of the observed data. \nThe model parameters are learned in an unsupervised manner by maximizing the \nlikelihood that these data are generated by the model. A multilayer belief net(cid:173)\nwork is a realization of such a model. Many belief networks have been proposed \nthat are composed of binary units. The hidden units in such networks represent \nlatent variables that explain different features of the data, and whose relation to the \n\n\u00b7Current address: Gatsby Computational Neuroscience Unit, University College Lon(cid:173)\n\ndon, 17 Queen Square, London WC1N 3AR, U .K. \n\n\f362 \n\nH. Attias \n\ndata is highly non-linear. However, for tasks such as object and speech recognition \nwhich produce real-valued data, the models provided by binary networks are often \ninadequate. Independent component analysis (ICA) learns a generative model from \nreal data, and extracts real-valued latent variables that are mutually statistically \nindependent. Unfortunately, this model is restricted to a single layer and the latent \nvariables are simple linear functions of the data; hence, underlying degrees of free(cid:173)\ndom that are non-linear cannot be extracted by ICA. In addition, the requirement \nof equal numbers of hidden and observed variables and the assumption of noiseless \ndata render the ICA model inappropriate. \n\nThis paper begins by introducing the independent factor analysis (IFA) technique. \nIFA is an extension of ICA, that allows different numbers of latent and observed \nvariables and can handle noisy data. The paper proceeds to create a multilayer \nnetwork by cascading single-layer IFA models. The resulting generative model pro(cid:173)\nduces a hierarchical distributed representation of the input data, where the latent \nvariables extracted at each level are non-linear functions of the data with a highly \nadaptive functional form. Whereas exact maximum-likelihood (ML) learning in \nthis network is intractable due to the difficulty in computing the posterior density \nover the hidden layers, we present an algorithm that maximizes a lower bound on \nthe likelihood. This algorithm is based on a general variational approach that we \ndevelop for the IFA network. \n\n2 \n\nIndependent Component and Independent Factor \nAnalysis \n\nAlthough the concept of ICA originated in the field of signal processing, it is actually \na density estimation problem. Given an L' x 1 observed data vector y, the task is \nto explain it in terms of an LxI vector x of unobserved 'sources' that are mutually \nstatistically independent. The relation between the two is assumed linear, \n\ny = Hx + u, \n\n(1) \nwhere H is the 'mixing' matrix; the noise vector u is usually assumed zero-mean \nGaussian with a covariance matrix A. In the context of blind source separation \n[1]-[4], the sOurce signals x should be recovered from the mixed noisy signals y with \nno knowledge of H, A, or the source densities P(Xi), hence the term 'blind'. In the \ndensity estimation approach, one regards (1) as a probabilistic generative model for \nthe observed p(y), with the mixing matrix, noise covariance, and source densities \nserving as model parameters. In principle, these parameters should be learned by \nML, followed by inferring the sources via a MAP estimator. \n\nFor Gaussian sources, (1) is the factor analysis model, for which an EM algorithm \nexists and the MAP estimator is linear. The problem becomes interesting and more \ndifficult for non-Gaussian sources. Most ICA algorithms focus on square (L' = L), \nnoiseless (y = Hx) mixing, and fix P(Xi) using prior knowledge (but see [5] for the \ncase of noisy mixing with a fixed Laplacian source prior). Learning H occurs via \ngradient-ascent maximization of the likelihood [1]-[4]. Source density parameters \ncan also be adapted in this way [3],[4], but the resulting gradient-ascent learning is \nrather slow. This state of affairs presented a problem to ICA algorithms, since the \nability to learn arbitrary sOurce densities that are not known in advance is crucial: \nusing an inaccurate p( Xi) often leads to a bad H estimate and failed separation. \nThis problem was recently solved by introducing the IFA technique [6]. \nIFA \nemploys a semi-parametric model of the source densities, which allows learning \nthem (as well as the mixing matrix) using expectation-maximization (EM). Specif(cid:173)\nically, P(Xi) is described as a mixture of Gaussians (MOG), where the mixture \n\n\fHierarchicalIFA Belief Networks \n\n363 \n\ncomponents are labeled by s = 1, ... , ni and have means f..ti,s and variances Ii,s: \np( Xi) = ~ s p( S i = s) 9 (Xi -\nf..ti,s, Ii ,s). I The mixing proportions are parametrized \nusing the softmax form: P(Si = s) = exp(ai,s)/ ~s' exp(ai,s'). Beyond noiseless \nleA, an EM algorithm for the noisy case (1) with any L, L' was also derived in \n[6] using the MOG description. 2 This algorithm learns a probabilistic model \np(y I W) for the observed data, parametrized by W = (H,A,{ai ,s,f..ti,s\"i,s}) . A \ngraphical representation of this model is provided by Fig. 1, if we set n = 1 and \nyO = bi = VI = 0 \n\nJ \n\nJ ,S \n\nJ ,s \n\n. \n\n3 Hierarchical Independent Factor Analysis \n\nIn the following we develop a multilayer generalization of IFA, by cascading dupli(cid:173)\ncates of the generative model introduced in [6]. Each layer n = 1, ... , N is composed \nof two sublayers: a source sublayer which consists of the units xi, i = 1, ... , L n , and \nan output sublayer which consists of Yj, j = 1, ... , L~ . The two are linearly related \nvia yn = Hnxn + un as in (1); un is a Gaussian noise vector with covariance An. \nThe nth-layer source xi is described by a MOG density model with parameters ai S' \nf..ti,s' and Irs' in analogy to the IFA sources above. \nThe important step is to determine how layer n depends on the previous layers. We \nchoose to introduce a dependence of the ith source of layer n only on the ith output \nof layer n - 1. Notice that matching Ln = L~ _ l is now required. This dependence \nis implemented by making the means and mixture proportions of the Gaussians \nwhich compose p(xi) dependent on yr- l . Specifically, we make the replacements \n\u00b7t \u00a3 \nn \nn \nf..ti ,s -t f..ti ,s + vi ,sYi \ne resu mg Jomt ensl y or \nlayer n, conditioned on layer n - 1, is \n\nan ai ,s -t ai,s + i,sYi \n\nbn n- l Th \n\nIt\u00b7\u00b7\u00b7 d \n\nn n- l \n\nd n \n\nn \n\n. \n\n' \n\np(sn , xn,yn I yn-l, wn) = IIp(si I yr - 1 ) p(xi I si,yr- l ) p(yn I xn) , \n\n(2) \n\nLn \n\nwhere vvn are the parameters of layer nand \n\ni=I \n\n( n _ \n\np Si - S Yi \n\nI n-I) _ \n-\n\nexp(ai,s + bi,syr- 1 ) \nn n-l \n'\"' \nL... exp(ai s' + bi s'Yi \n) \n, \ns' \n( n Inn - I ) 9( n \n= Xi -\nP Xi \n\nSi = S, Yi \n\nn \n\n' \n\n' \n\nn n- l n) \nf..ti s - Vi sYi \"i s \n, \n\nn \n\" \n\n. \n\nThe full model joint density is given by the product of (2) over n = 1, ... , N (setting \nyO = 0) . A graphical representation of layer n of the hierarchical IFA network is \ngiven in Fig. 1. All units are hidden except yN. \n\nTo gain some insight into our network, we examine the relation between the nth(cid:173)\nlayer source xi and the n -\nlth-Iayer output yr- 1 . This relation is probabilistic \nand is determined by the conditional density p(xi I yr- 1 ) = ~s~ p(si I y~-l )p(xi I \nsi,yr- 1 ) . Notice from (2) that this is a MOG density. Its yr - (dependent mean is \ngiven by \n\nXi = f['(yr- 1 ) = LP(si = s I yr- 1 ) (f..t~s + vf,syr- 1 ) , \n\n(3) \n\ns \n\nIThroughout this paper, Q(x,~) =1 27r~ 1- 1 / 2 exp( _XT~ - IX/2) . \n2However, for many sources the E-step becomes intractable, since the number TIi ni \nof source state configurations s = (s 1, ... , s L) depends exponentially on L. Such cases are \ntreated in [6] using a variational approximation. \n\n\f364 \n\nH. Attias \n\nn \n\nJ-lj,s \n\nFigure 1: Layer n of the hierarchical leA generative model. \n\nand is a non-linear function of y~-l due to the softmax form of p(si I yr- 1 ). \nBy adjusting the parameters, the function II' can assume a very wide range of \nforms: suppose that for state si , ai,s and bi,s are set so that p(si = s I yr- 1 ) is \nsignificant only in a small, continuous range of yr- 1 values, with different ranges \nassociated with different s 's. In this range, II' will be dominated by the linear \nterm J.1.is + lIrs y~-l. Hence, a desired ii can be produced by placing oriented \nline seg~ents ~t appropriate points above the yr-1-axis, then smoothly join them \ntogether by the p(si I yr- 1 ) . Using the algorithm below, the optimal form of ii \nwill be learned from the data. Therefore, our model describes the data yf as a \npotentially highly complex function of the top layer sources, produced by repeated \napplication of linear mixing followed by a non-linearity, with noise allowed at each \nstage. \n\n4 Learning and Inference by Variational EM \n\nThe need for summing over an exponentially large number of source state config(cid:173)\nurations (sr, ... , s\"lJ, and integrating over the softmax functions p(si I yi), makes \nexact learning intractable in our network. Thus, approximations must be made. \nIn the following we develop a variational approach, in the spirit of [8], to hierar(cid:173)\nchical IFA. We begin, following the approach of [7] ~o EM , by bounding the log(cid:173)\nlikelihood from below: \u00a3 = 10gp(yN) 2: l:n{Elogp(yn I xn) + l:i, s~[Elogp(xi I \nsi, y~-l) + E logp(si I y~- l)]} - E log q, where E denotes averaging over the hidden \nlayers using an arbitrary posterior q = q(Sl\u00b7\u00b7-N,xI ... N,yl .. . N-l I yN). In exact EM, \nq at each iteration is the true posterior, parametrized by W 1 ... N from the previ(cid:173)\nous iteration. In variational EM, q is chosen to have a form which makes learning \ntractable, and is parametrized by a separate set of parameters V I .. . N . These are \noptimized to bring q as close to the true posterior as possible. \n\n. \n\n\fHierarchical IFA Belief Networks \n\n365 \n\nE-step. We use a variational posterior that is factorized across layers. Within layer \nn it has the form \n\nq(sn, x n, yn I vn) = II Vf,si 9(zn _ pn, ~n) , \n\nLn \n\n(4) \n\ni=l \n\nfor n < N, and q(sN, x N I VN) = TIi Vt,'Si 9(xN - pN, ~N). The variational param(cid:173)\neters vn = (pn, ~n, {vf,s}) depend on the data yN. The full N -layer posterior is \nsimply a product of (4) over n. Hence, given the data, the nth-layer sources and \noutputs are jointly Gaussian whereas the states sf are independent. 3 \nEven with the variational posterior (4), the term Elogp(sf I y~-l) in the lower \nbound cannot be calculated analytically, since it involves integration over the \nsoftmax function. \nInstead, we calculate yet a lower bound on this term. Let \nci,s = ai,s + bi,syr- l and drop the unit and layer indices i, n, then logp(s I y) = \n-log(l + e- c , Ls'#s eC . ' ) . B