{"title": "Local Probability Propagation for Factor Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 442, "page_last": 448, "abstract": null, "full_text": "Local probability propagation for factor \n\nanalysis \n\nComputer Science, University of Waterloo, Waterloo, Ontario, Canada \n\nBrendan J. Frey \n\nAbstract \n\nEver since Pearl's probability propagation algorithm in graphs with \ncycles was shown to produce excellent results for error-correcting \ndecoding a few years ago, we have been curious about whether \nlocal probability propagation could be used successfully for ma(cid:173)\nchine learning. One of the simplest adaptive models is the factor \nanalyzer, which is a two-layer network that models bottom layer \nsensory inputs as a linear combination of top layer factors plus in(cid:173)\ndependent Gaussian sensor noise. We show that local probability \npropagation in the factor analyzer network usually takes just a few \niterations to perform accurate inference, even in networks with 320 \nsensors and 80 factors. We derive an expression for the algorithm's \nfixed point and show that this fixed point matches the exact solu(cid:173)\ntion in a variety of networks, even when the fixed point is unstable. \nWe also show that this method can be used successfully to perform \ninference for approximate EM and we give results on an online face \nrecognition task. \n1 Factor analysis \nA simple way to encode input patterns is to suppose that each input can be well(cid:173)\napproximated by a linear combination of component vectors, where the amplitudes \nof the vectors are modulated to match the input. For a given training set, the most \nappropriate set of component vectors will depend on how we expect the modula(cid:173)\ntion levels to behave and how we measure the distance between the input and its \napproximation. These effects can be captured by a generative probabilit~ model \nthat specifies a distribution p(z) over modulation levels z = (Zl, ... ,ZK) and a \ndistribution p(xlz) over sensors x = (Xl, ... ,XN)T given the modulation levels. \nPrincipal component analysis, independent component analysis and factor analysis \ncan be viewed as maximum likelihood learning in a model of this type, where we as(cid:173)\nsume that over the training set, the appropriate modulation levels are independent \nand the overall distortion is given by the sum of the individual sensor distortions. \n\nIn factor analysis, the modulation levels are called factors and the distributions \nhave the following form: \n\np(Zk) = N(Zk; 0,1), \n\np(z) = nf=lP(Zk) = N(z; 0, I), \n\np(xnl z) = N(xn; E~=l AnkZk, 'l/Jn), \n\n(1) \nThe parameters of this model are the factor loading matrix A, with elements Ank, \nand the diagonal sensor noise covariance matrix 'It, with diagonal elements 'l/Jn. A \nbelief network for the factor analyzer is shown in Fig. 1a. The likelihood is \n\np(xlz) = n:=IP(xnlz) = N(x; Az, 'It). \n\np(x) = 1 N(z; 0, I)N(x; Az, 'It)dz = N(x; 0, AA T + 'It), \n\n(2) \n\n\fLocal Probability Propagation for Factor Analysis \n\n...., \n\n(b) \n\n~. \n\n- -... \n\n\" \n\n'r \n\n- 'J \n\n'. \n\n443 \n\n. ... , \n\n1t. \n\nE \n... '\" \n\"' I:~ \n\nFigure 1: (a) A belief network for factor analysis. (b) High-dimensional data (N = 560). \n\nand online factor analysis consists of adapting A and q, to increase the likelihood \nof the current input, such as a vector of pixels from an image in Fig. lb. \nProbabilistic inference - computing or estimating p{zlx) - is needed to do dimen(cid:173)\nsionality reduction and to fill in the unobserved factors for online EM-type learning. \nIn this paper, we focus on methods that infer independent factors. p(zlx) is Gaus(cid:173)\nsian and it turns out that the posterior means and variances of the factors are \n\nE[zlx] = (A Tq,-l A + 1)-1 AT q,-lx, \n\ndiag(COV(zlx)) = diag(A T q,-l A + 1)-1). \n\n(3) \n\nGiven A and q\" computing these values exactly takes O(K2 N) computations, \nmainly because of the time needed to compute AT q,-l A. Since there are only K N \nconnections in the network, exact inference takes at least O{K) bottom-up/top \ndown iterations. \n\nOf course, if the same network is going to be applied more than K times for inference \n(e.g., for batch EM), then the matrices in (3) can be computed once and reused. \nHowever, this is not directly applicable in online learning and in biological models. \nOne way to circumvent computing the matrices is to keep a separate recognition \nnetwork, which approximates E[zlx] with Rx (Dayan et al., 1995). The optimal \nrecognition network, R = (A Tq,-l A+I)-l A Tq,-l, can be approximated by jointly \nestimating the generative network and the recognition network using online wake(cid:173)\nsleep learning (Hinton et al., 1995). \n2 Probability propagation in the factor analyzer network \nRecent results on error-correcting coding show that in some cases Pearl's prob(cid:173)\nability propagation algorithm, which does exact probabilistic inference in graphs \nthat are trees, gives excellent performance even if the network contains so many \ncycles that its minimal cut set is exponential (Frey and MacKay, 1998; Frey, 1998; \nMacKay, 1999). In fact, the probability propagation algorithm for decoding low(cid:173)\ndensity parity-check codes (MacKay, 1999) and turbocodes (Berrou and Glavieux, \n1996) is widely considered to be a major breakthrough in the information theory \ncommunity. \n\nWhen the network contains cycles, the local computations give rise to an iterative \nalgorithm, which hopefully converges to a good answer. Little is known about the \nconvergence properties of the algorithm. Networks containing a single cycle have \nbeen successfully analyzed by Weiss (1999) and Smyth et al. (1997), but results for \nnetworks containing many cycles are much less revealing. \nThe probability messages produced by probability propagation in the factor analyzer \nnetwork of Fig. 1a are Gaussians. Each iteration of propagation consists of passing \na mean and a variance along each edge in a bottom-up pass, followed by passing \na mean and a variance along each edge in a top-down pass. At any instant, the \n\n\f444 \n\nB.J. Frey \n\nbottom-up means and variances can be combined to form estimates of the means \nand variances of the modulation levels given the input. \n\nInitially, the variance and mean sent from the kth top layer unit to the nth sensor \nis set to vk~ = 1 and 7]i~ = 0. The bottom-up pass begins by computing a noise \nlevel and an error signal at each sensor using the top-down variances and means \nfrom the previous iteration: \n\ne~) = Xn - 2: {:= 1 Ank7]i~-l). \nThese are used to compute bottom-up variances and means as follows: \n\ns~) = 'l/Jn + 2:{:=1 A;kVk~-I) , \n\n\",(i) = s(i)/A2 _ v(i-l) \nkn' \n'l'nk \n\nnk \n\nn \n\nlI(i) = e(i)/A k + 7](i-l) \nr'nk \nkn' \n\nn \n\nn \n\n(4) \n\n(5) \n\nThe bottom-up variances and means are then combined to form the current esti(cid:173)\nmates of the modulation variances and means: \n\n(i) \nVk = 1/(1 + 2:n=1 1/\u00a2nk)' \n\n(i) \n\nN \n\nA(i) _ \nZk \n\n(i)\"\",N \n\n(i)/\",(i) \n- V k L..Jn=lJ.tnk 'l'nk' \n\n(6) \n\nThe top-down pass proceeds by computing top-down variances and means as follows: \n\nvk~ = l/(l/vii ) - l/\u00a2~l), \n\n(7) \nNotice that the variance updates are independent of the mean updates, whereas the \nmean updates depend on the variance updates. \n\n7]i~ = vk~(.iki) /vii ) - J.t~V\u00a2~l)\u00b7 \n\n2.1 Performance of local probability propagation. We created a total of \n200,000 factor analysis networks with 20 different sizes ranging from K = 5, N = 10 \nto K = 80, N = 320 and for each size of network we measured the inference error as \na function of the number of iterations of propagation. Each of the 10,000 networks of \na given size was produced by drawing the AnkS from standard normal distributions \nand then drawing each sensor variance 'l/Jn from an exponential distribution with \nmean 2:{:=1 A;k' (A similar procedure was used by Neal and Dayan (1997).) \nFor each random network, a pattern was simulated from the network and probabil(cid:173)\nity propagation was applied using the simulated pattern as input. We measured the \nerror between the estimate z(i) and the correct value E[zlx] by computing the dif(cid:173)\nference between their coding costs under the exact posterior distribution and then \nnormalizing by K to get an average number of nats per top layer unit. \nFig. 2a shows the inference error on a logarithmic scale versus the number of iter(cid:173)\nations (maximum of 20) in the 20 different network sizes. In all cases, the median \nerror is reduced below .01 nats within 6 iterations. The rate of convergence of the \nerror improves for larger N, as indicated by a general trend for the error curves to \ndrop when N is increased. In contrast, the rate of convergence of the error appears \nto worsen for larger K, as shown by a general slight trend for the error curves to \nrise when K is increased. \nFor K ~ N/8, 0.1% of the networks actually diverge. To better understand the di(cid:173)\nvergent cases, we studied the means and variances for all of the divergent networks. \nIn all cases, the variances converge within a few iterations whereas the means oscil(cid:173)\nlate and diverge. For K = 5, N = 10, 54 of the 10,000 networks diverged and 5 of \nthese are shown in Fig. 2b. This observation suggests that in general the dynamics \nare determined by the dynamics of the mean updates. \n\n2.2 Fixed points and a condition for global convergence. When the vari(cid:173)\nance updates converge, the dynamics of probability propagation in factor analysis \nnetworks become linear. This allows us to derive the fixed point of propagation in \nclosed form and write an eigenvalue condition for global convergence. \n\n\fLocal Probability Propagation for Factor Analysis \n\n445 \n\nK=20 K=40 \n\nK=80 \n\n(a) \n\nK = 5 \n\nK = 10 \n\n~',:~ \n~ 01~ \ng\"Xl~ ... 10 \nII \n~ 1:~ \n11' 0~ \n~ .01 \n\n2: \n\n~ ',:u \n~ ',:~ \n\n0, \n\n~ 0' \no ,OO~ \n~' O~ \n~ 01 0 \n\n20 \n\n10 \n\nFigure 2: (a) Performance of probability propagation . Median inference error (bold curve) \non a logarithmic scale as a function of the number of iterations for different sizes of network \nparameterized by K and N. The two curves adjacent to the bold curve show the range within \nwhich 98% of the errors lie. 99 .9% of the errors were below the fourth, topmost curve. (b) \nThe error, bottom-up variances and top-down means as a function of the number of iterations \n(maximum of 20) for 5 divergent networks of size K = 5, N = 10. \n\n(i) \n\n(i))T \n\n. , 17KN \n\n, P, \n\n(i) \n\n(i) \n\n(i ) \n\nf \n(i) \n\n. - (i) _ \n-\n(i) )T -\n\n, X= Xl,Xl, .. \u00b7 ,Xl,X2, .. \u00b7 , X2 , XN, .. \u00b7 ,XN \n\n( (i) \n1711,1721\"'\" 17Kl' 1712' \" \n( \n\nTo analyze the system of mean updates, we define the following length K N vec-\nd h \u00b7 \n- (i) _ \ntors 0 means an \nt e mput. TJ \n-\n)T \n(i) \n(i ) \n(\nJ-tll,J-t12 ' ''' ,J-tlK , J-t21'''' , J-tNK \n, \nwhere each Xn is repeated K times in the last vector. The network parameters are \nrepresented using K N x K N diagonal matrices, A and q,. The diagonal of A is \nA11, ... , AIK , A21, ... , ANK, and the diagonal of q, is '1/111, '1/121, ... , '1/INI, where 1 is \nthe K x K identity matrix. The converged bottom-up variances are represented \nusing a diagonal matrix ~ with diagonal \u00a211, ... , \u00a2IK , \u00a221, .. . , \u00a2NK. \nThe summation operations in the propagation formulas are represented by a K N x \nK N matrix I: z that sums over means sent down from the top layer and a K N x K N \nmatrix I:x that sums over means sent up from the sensory input: \n\n1 \n\n:Ex = (~ \ni \n\n1 \n1 \n\n1 \n\n) \n1 \n\n' \n\n(8) \n\nThese are N x N matrices of K x K blocks, where 1 is the K x K block of ones \nand 1 is the K x K identity matrix. \n\nUsing the above representations, the bottom-up pass is given by \n\nji, (i) = A-I X _ A- I (:E z - I)Af7(i-l), \n\nand the top-down pass is given by \n\nf7( i) = (I + diag(:Ex~ -1 :Ex) _ ~ -1) -1 (I:x _ I)~ -1 ji,( i ) . \n\n(9) \n\n(10) \n\nSubstituting (10) into (9), we get the linear update for ji,: \nji,(i) = A-I X _ A-I (:E z _ I)A(I + diag(:Exci -l:Ex) _ c) -1) -1 (:Ex _ I)ci -1 ji, (i -l). \n(11) \n\n\f446 \n\nB.J. Frey \n\nB[]Bga~Q~g[] \n\n1.11 \n\n1.06 \n\n1.24 \n\n1.07 \n\n1.49 \n\n1.13 \n\n1.03 \n\n1.02 \n\n1.09 \n\n1.01 \n\nFigure 3: The error (log scale) versus number of iterations (log scale. max. of 1000) in 10 \nof the divergent networks with K = 5. N = 10. The means were initialized to the fixed point \nsolutions and machine round-off errors cause divergence from the fixed points. whose errors \nare shown by horizontal lines. \n\nThe fixed point of this dynamic system, when it exists, is \n\nji,* = ~ (A~ + (tz - I)A(I + diag(I:xc) -ltx) - ~ -1) -\\tx - I)) -1 x. \n\n(12) \n\nA fixed point exists if the determinant of the expression in large braces in (12) is \nnonzero. We have found a simplified expression for this determinant in terms of the \ndeterminants of smaller, K x K matrices. \n\nReinterpreting the dynamics in (11) as dynamics for Aji,(i), the stability of a fixed \npoint is determined by the largest eigenvalue of the update matrix, (I:z - I)A (I + \n. If the modulus ofthe largest eigenvalue \ndiag(Exc}) Ex)-c}) \nis less than 1, the fixed point is stable. Since the system is linear, if a stable fixed \npoint exists, the system will be globally convergent to this point. \n\n(Ex-I)c}) A \n\n- -1 -1 -\n\n- -1 - -1 \n\n-\n\n- -1 -\n\n) \n\nOf the 200,000 networks we explored, about 99.9% of the networks converged. For \n10 of the divergent networks with K = 5, N = 10, we used 1000 iterations of prob(cid:173)\nability propagation to compute the steady state variances. Then, we computed the \nmodulus of the largest eigenvalue of the system and we computed the fixed point. \nAfter initializing the bottom-up means to the fixed point values, we performed 1000 \niterations to see if numerical errors due to machine precision would cause divergence \nfrom the fixed point. Fig. 3 shows the error versus number of iterations (on loga(cid:173)\nrithmic scales) for each network, the error of the fixed point, and the modulus of \nthe largest eigenvalue. In some cases, the network diverges from the fixed point and \nreaches a dynamic equilibrium that has a lower average error than the fixed point. \n3 Online factor analysis \nTo perform maximum likelihood factor analysis in an online fashion, each parameter \nshould be modified to slightly increase the log-probability of the current sensory \ninput,logp(x). However, since the factors are hidden, they must be probabilistically \n\"filled in\" using inference before an incremental learning step is performed. \nIf the estimated mean and variance of the kth factor are Zk and Vk, then it turns \nout (e.g., Neal and Dayan, 1997) the parameters can be updated as follows: \n\nAnk f- Ank + l}[Zk(Xn - Ef=1 AnjZj) - VkAnk]/'ljln, \n\n'IjIn f- (l-l})'ljln + l}[(xn - Ef=1 AnjZj)2 + Ef=1 VkA~j], \n\n(13) \n\nwhere 1} is a learning rate. \n\nOnline learning consists of performing some number of iterations of probability prop(cid:173)\nagation for the current input (e.g., 4 iterations) and then modifying the parameters \nbefore processing the next input. \n\n3.1 Results on simulated data. We produced 95 training sets of 200 cases \neach, with input sizes ranging from 20 sensors to 320 sensors. For each of 19 sizes \nof factor analyzer, we randomly selected 5 sets of parameters as described above \nand generated a training set. The factor analyzer sizes were K E {5, 10,20,40, 80}, \n\n\fLocal Probability Propagation for Factor Analysis \n\n447 \n\nFigure 4: (a) Achievable errors after the same number of epochs of learning using 4 iterations \nversus 1 iteration. The horizontal axis gives the log-probability error (log scale) for learning with \n1 iteration and the vertical axis gives the error after the same number of epochs for learning \nwith 4 iterations. (b) The achievable errors for learning using 4 iterations of propagation versus \nwake-sleep learning using 4 iterations. \nN E {20, 40, 80,160, 320}, N > K. For each factor analyzer and simulated data set, \nwe estimated the optimal log-probability of the data using 100 iterations of EM. \n\nFor learning, the size of the model to be trained was set equal to the size of the model \nthat was used to generate the data. To avoid the issue of how to schedule learning \nrates, we searched for achievable learning curves, regardless of whether or not a \nsimple schedule for the learning rate exists. So, for a given method and randomly \ninitialized parameters, we performed one separate epoch of learning using each of \nthe learning rates, 1,0.5, ... ,0.520 and picked the learning rate that most improved \nthe log-probability. Each successive learning rate was determined by comparing the \nperformance using the old learning rate and one 0.75 times smaller. \n\nWe are mainly interested in comparing the achievable curves for different methods \nand how the differences scale with K and N. For two methods with the same K \nand N trained on the same data, we plot the log-probability error (optimal log(cid:173)\nprobability minus log-probability under the learned model) of one method against \nthe log-probability error of the other method. \n\nFig. 4a shows the achievable errors using 4 iterations versus using 1 iteration. Usu(cid:173)\nally, using 4 iterations produces networks with lower errors than those learned using \n1 iteration. The difference is most significant for networks with large K, where in \nSec. 2.1 we found that the convergence of the inference error was slower. \nFig. 4b shows the achievable errors for learning using 4 iterations of probability \npropagation versus wake-sleep learning using 4 iterations. Generally, probability \npropagation achieves much smaller errors than wake-sleep learning, although for \nsmall K wake-sleep performs better very close to the optimum log-probability. The \nmost significant difference between the methods occurs for large K, where aside \nfrom local optima probability propagation achieves nearly optimal log-probabilities \nwhile the log-probabilities for wake-sleep learning are still close to their values at \nthe start of learning. \n4 Online face recognition \nFig. 1b shows examples from a set of 30,000 20 x 28 greyscale face images of 18 \ndifferent people. In contrast to other data sets used to test face recognition methods, \nthese faces include wide variation in expression and pose. To make classification \nmore difficult, we normalized the images for each person so that each pixel has \n\n\f448 \n\nB.J. Frey \n\nthe same mean and variance. We used probability propagation and a recognition \nnetwork in a factor analyzer to reduce the dimensionality of the data online from \n560 dimensions to 40 dimensions. For probability propagation, we rather arbitrarily \nchose a learning rate of 0.0001, but for wake-sleep learning we tried learning rates \nranging from 0.1 down to 0.0001. A multilayer perceptron with one hidden layer of \n160 tanh units and one output layer of 18 softmax units was simultaneously being \ntrained using gradient descent to predict face identity from the mean factors. The \nlearning rate for the multilayer perceptron was set to 0.05 and this value was used \nfor both methods. \n\n\"' \" \n\n.. ',\"--\"\"~ \\<',::,::'--, ... , ... \"\" \n\n\\ \n\n' \n\n'. ~, \n\n'. \n\n~ \nj \ni \n\nNumber of pattern presentations \n\nFigure 5: Online error \ncurves for probability prop(cid:173)\nagation (solid), wake-sleep \nlearning (dashed), nearest \nneighbors \n(dot-dashed) \nand guessing (dotted). \n\nFor each image, a prediction was made before the pa(cid:173)\nrameters were modified. Fig. 5 shows online error \ncurves obtained by filtering the losses. The curve for \nprobability propagation is generally below the curves \nfor wake-sleep learning. \nThe figure also shows the error curves for two forms of \nonline nearest neighbors, where only the most recent \nW cases are used to make a prediction. The form of \nnearest neighbors that performs the worst has W set so \nthat the storage requirements are the same as for the \nfactor analysis / multilayer perceptron method. The \nbetter form of nearest neighbors has W set so that the \nnumber of computations is the same as for the factor \nanalysis / multilayer perceptron method. \n5 Summary \nIt turns out that iterative probability propagation can be fruitful when used for \nlearning in a graphical model with cycles, even when the model is densely con(cid:173)\nnected. Although we are more interested in extending this work to more complex \nmodels where exact inference takes exponential time, studying iterative probability \npropagation in the factor analyzer allowed us to compare our results with exact in(cid:173)\nference and allowed us to derive the fixed point of the algorithm. We are currently \napplying iterative propagation in multiple cause networks for vision problems. \nReferences \nC. Berrou and A. Glavieux 1996. Near optimum error correcting coding and decoding: \nTurbo-codes. IEEE TI-ans. on Communications, 44, 1261-1271. \nP. Dayan, G. E. Hinton, R. M. Neal and R. S. Zemel 1995. The Helmholtz machine. \nNeural Computation 1, 889-904. \nB. J. Frey and D. J. C. MacKay 1998. A revolution: Belief propagation in graphs with \nIn M. Jordan, M. Kearns and S. Solla (eds), Advances in Neural Information \ncycles. \nProcessing Systems 10, Denver, 1997. \nB. J. Frey 1998. Graphical Models for Machine Learning and Digital Communication. \nMIT Press, Cambridge MA. See http://wvv.cs.utoronto.ca/-frey . \nG. E. Hinton, P. Dayan, B. J. Frey and R. M. Neal 1995. The wake-sleep algorithm for \nunsupervised neural networks. Science 268, 1158-1161. \nD. J. C. MacKay 1999. Information Theory, Inference and Learning Algorithms. Book in \npreparation, currently available at http://wol.ra.phy.cam.ac . uk/mackay. \nR. M. Neal and P. Dayan 1997. Factor analysis using delta-rule wake-sleep learning. Neural \nComputation 9, 1781-1804. \nP. Smyth, R. J . McEliece, M. Xu, S. Aji and G. Horn 1997. Probability propagation in \ngraphs with cycles. Presented at the workshop on Inference and Learning in Graphical \nModels, Vail, Colorado. \nY. Weiss 1998. Correctness of local probability propagation in graphical models. To \nappear in Neural Computation. \n\n\f", "award": [], "sourceid": 1649, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}]}