{"title": "Autoencoders, Minimum Description Length and Helmholtz Free Energy", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 10, "abstract": null, "full_text": "Autoencoders, Minimum Description Length \n\nand Helmholtz Free Energy \n\nGeoffrey E. Hinton \n\nDepartment of Computer Science \n\nUniversity of Toronto \n6 King's College Road \n\nToronto M5S lA4, Canada \n\nRichard S. Zemel \n\nComputational Neuroscience Laboratory \n\nThe Salk Institute \n\n10010 North Torrey Pines Road \n\nLa Jolla, CA 92037 \n\nAbstract \n\nAn autoencoder network uses a set of recognition weights to convert an \ninput vector into a code vector. It then uses a set of generative weights to \nconvert the code vector into an approximate reconstruction of the input \nvector. We derive an objective function for training autoencoders based \non the Minimum Description Length (MDL) principle. The aim is to \nminimize the information required to describe both the code vector and \nthe reconstruction error. We show that this information is minimized \nby choosing code vectors stochastically according to a Boltzmann distri(cid:173)\nbution, where the generative weights define the energy of each possible \ncode vector given the input vector. Unfortunately, if the code vectors \nuse distributed representations, it is exponentially expensive to compute \nthis Boltzmann distribution because it involves all possible code vectors. \nWe show that the recognition weights of an autoencoder can be used to \ncompute an approximation to the Boltzmann distribution and that this ap(cid:173)\nproximation gives an upper bound on the description length. Even when \nthis bound is poor, it can be used as a Lyapunov function for learning \nboth the generative and the recognition weights. We demonstrate that \nthis approach can be used to learn factorial codes. \n\n1 \n\nINTRODUCTION \n\nMany of the unsupervised learning algorithms that have been suggested for neural networks \ncan be seen as variations on two basic methods: Principal Components Analysis (PCA) \n\n3 \n\n\f4 \n\nHinton and Zemel \n\nand Vector Quantization (VQ) which is also called clustering or competitive learning. \nBoth of these algorithms can be implemented simply within the autoencoder framework \n(Baldi and Hornik, 1989; Hinton, 1989) which suggests that this framework may also \ninclude other algorithms that combine aspects of both. VQ is powerful because it uses \na very non-linear mapping from the input vector to the code but weak because the code \nis a purely local representation. Conversely, PCA is weak because the mapping is linear \nbut powerful because the code is a distributed, factorial representation. We describe a \nnew objective function for training autoencoders that allows them to discover non-linear, \nfactorial representations. \n\n2 THE MINIMUM DESCRIPfION LENGTH APPROACH \n\nOne method of deriving a cost function for the activities of the hidden units in an autoencoder \nis to apply the Minimum Description Length (MDL) principle (Rissanen 1989). We imagine \na communication game in which a sender observes an ensemble of training vectors and must \nthen communicate these vectors to a receiver. For our purposes, the sender can wait until \nall of the input vectors have been observed before communicating any of them - an online \nmethod is not required. Assuming that the components of the vectors are finely quantized \nwe can ask how many bits must be communicated to allow the receiver to reconstruct the \ninput vectors perfectly. Perhaps the simplest method of communicating the vectors would \nbe to send each component of each vector separately. Even this simple method requires \nsome further specification before we can count the number of bits required. To send the \nvalue, Xi,c, of component i of input vector c we must encode this value as a bit string. If \nthe sender and the receiver have already agreed on a probability distribution that assigns \na probability p( x) to each possible quantized value, x, Shannon's coding theorem implies \nthat x can be communicated at a cost that is bounded below by -log p( x) bits. Moreover, \nby using block coding techniques we can get arbitrarily close to this bound so we shall \ntreat it as the true cost. For coding real values to within a quantization width of t it is \noften convenient to assume a Gaussian probability distribution with mean zero and standard \ndeviation (1'. Provided that (1' is large compared with t, the cost of coding the value x is then \n-logt + 0.5 log 21r(1'2 + x 2 /2(1'2. \nThis simple method of communicating the trainjng vectors is generally very wasteful. If \nthe components of a vector are correlated it is generally more efficient to convert the input \nvector into some other representation before communicating it. The essence of the MDL \nprinciple is that the best model of the data is the one that minimizes the total number of \nbits required to communicate it, including the bits required to describe the coding scheme. \nFor an autoencoder it is convenient to divide the total description length into three terms. \nAn input vector is communicated to the receiver by sending the activities of the hidden \nunits and the residual differences between the true input vector and the one that can be \nreconstructed from the hidden activities. There is a code cost for the hidden activities and a \nreconstruction cost for the residual differences. In addition there is a one-time model cost \nfor communicating the weights that are required to convert the hidden activities into the \noutput of the net. This model cost is generally very important within the MDL framework, \nbut in this paper we will ignore it. In effect, we are considering the limit in which there is \nso much data that this limited model cost is negligible. \n\nPCA can be viewed as a special case of MDL in which we ignore the model cost and we limit \nthe code cost by only using m hidden units. The question of how many bits are required \n\n\fAutoencoders, Minimum Description Length, and Helmhotz Free Energy \n\n5 \n\nto code each hidden unit activity is also ignored. Thus the only remaining term is the \nreconstruction cost. Assuming that the residual differences are encoded using a zero-mean \nGaussian with the same predetermined variance for each component, the reconstruction \ncost is minimized by minimizing the squared differences. \n\nSimilarly, VQ is a version of MDL in which we limit the code cost to at most log m bits by \nusing only m winner-lake-all hidden units, we ignore the model cost, and we minimize the \nreconstruction cost. \n\nIn standard VQ we assume that each input vector is converted into a specific code. Sur(cid:173)\nprisingly, it is more efficient to choose the codes stochastically so that the very same input \nvector is sometimes communicated using one code and sometimes using another. This type \nof \"stochastic VQ\" is exactly equivalent to maximizing the log probability of the data under \na mixture of Gaussians model. Each code of the VQ then corresponds to the mean of a \nGaussian and the probability of picking the code is the posterior probability of the input \nvector under that Gaussian. Since this derivation of the mixture of Gaussians model is \ncrucial to the new techniques described later, we shall describe it in some detail. \n\n2.1 The \"bits-back\" argument \n\nThe description length of an input vector using a particular code is the sum of the code cost \nand reconstruction cost. We define this to be the energy of the code, for reasons that will \nbecome clear later. Given an input vector, we define the energy of a code to be the sum \nof the code cost and the reconstruction cost. If the prior probability of code i is 1f'i and its \nsquared reconstruction error is d; the energy of the code is \n\nd 2 \nEi = -log 1f'i - k log t + \"2 log 21f'0'2 + 20'2 \n\nk \n\n(1) \n\nwhere k is the dimensionality of the input vector, 0'2 is the variance of the fixed Gaussian \nused for encoding the reconstruction errors and t is the quantization width. \n\nNow consider the following situation: We have fitted a VQ to some training data and, for a \nparticular input vector, two of the codes are equally good in the sense that they have equal \nenergies. In a standard VQ we would gain no advantage from the fact that there are two \nequally good codes. However, the fact that we have a choice of two codes should be worth \nsomething. It does not matter which code we use so if we are vague about the choice of \ncode we should be able to save one bit when communicating the code. \n\nTo make this argument precise consider the following communication game: The sender \nis already communicating a large number of random bits to the receiver, and we want to \ncompute the additional cost of communicating some input vectors. For each input vector \nwe have a number of alternative codes h1 ... hi ... hm and each code has an energy, Ei. In a \nstandard VQ we would pick the code, j, with the lowest energy. But suppose we pick code \ni with a probability Pi that depends on Ei. Our expected cost then appears to be higher \nsince we sometimes pick codes that do not have the minimum value of E. \n\n< Cost >= LPiEi \n\ni \n\n(2) \n\nwhere < ... > is used to denote an expected value. However, the sender can use her \nfreedom of choice in stochastically picking codes to communicate some of the random \n\n\f6 \n\nHinton and Zemel \n\nbits that need to be communicated anyway. It is easy to see how random bits can be used \nto stochastically choose a code, but it is less obvious how these bits can be recovered by \nthe receiver, because he is only sent the chosen code and does not know the probability \ndistribution from which it was picked. This distribution depends on the particular input \nvector that is being communicated. To recover the random bits, the receiver waits until all \nof the training vectors have been communicated losslessly and then runs exactly the same \nlearning algorithm as the sender used. This allows the receiver to recover the recognition \nweights that are used to convert input vectors into codes, even though the only weights that \nare explicitly communicated from the sender to the receiver are the generative weights that \nconvert codes into approximate reconstructions of the input. After learning the recognition \nweights, the receiver can reconstruct the probability distribution from which each code was \nstochastically picked because the input vector has already been communicated. Since he \nalso knows which code was chosen, he can figure out the random bits that were used to do \nthe picking. The expected number of random bits required to pick a code stochastically is \nsimply the entropy of the probability distribution over codes \n\nH = - LPi logpi \n\n(3) \n\nSo, allowing for the fact that these random bits have been successfully communicated, the \ntrue expected combined cost is \n\n(4) \n\nNote that F has exactly the form of Helmholtz free energy. It can be shown that the \nprobability distribution which minimizes F is \n\ne- E ; \n\nPi = Lj e-Ej \n\n(5) \n\nThis is exactly the posterior probability distribution obtained when fitting a mixture of \nGaussians to an input vector. \n\nThe idea that a stochastic choice of codes is more efficient than just choosing the code with \nthe smallest value of E is an example of the concept of stochastic complexity (Rissanen, \n1989) and can also be derived in other ways. The concept of stochastic complexity is \nunnecessarily complicated if we are only interested in fitting a mixture of Gaussians. \nInstead of thinking in terms of a stochastically chosen code plus a reconstruction error, \nwe can simply use Shannon's coding theorem directly by assuming that we code the input \nvectors using the mixture of Gaussians probability distribution. However, when we start \nusing more complicated coding schemes in which the input is reconstructed from the \nactivities of several different hidden units, the formulation in terms of F is much easier \nto work with because it liberates us from the constraint that the probability distribution \nover codes must be the optimal one. There is generally no efficient way of computing the \noptimal distribution, but it is nevertheless possible to use F with a suboptimal distribution \nas a Lyapunov function for learning (Neal and Hinton, 1993). In MDL terms we are simply \nusing a suboptimal coding scheme in order to make the computation tractable. \n\nOne particular class of suboptimal distributions is very attractive for computational reasons. \nIn a factorial distribution the probability distribution over m d alternatives factors into d \nindependent distributions over m alternatives. Because they can be represented compactly, \n\n\fAutoencoders, Minimum Description Length, and Helmhotz Free Energy \n\n7 \n\nfactorial distributions can be computed conveniently by a non-stochastic feed-forward \nrecognition network. \n\n3 FACTORIAL STOCHASTIC VECTOR QUANTIZATION \n\nInstead of coding the input vector by a single, stochastically chosen hidden unit, we could \nuse several different pools of hidden units and stochastically pick one unit in each pool. \nAll of the selected units within this distributed representation are then used to reconstruct \nthe input. This amounts to using several different VQs which cooperate to reconstruct the \ninput. Each VQ can be viewed as a dimension and the chosen unit within the VQ is the \nvalue on that dimension. The number of possible distributed codes is m d where d is the \nnumber of VQs and m is the number of units within a VQ. The weights from the hidden \nunits to the output units determine what output is produced by each possible distributed \ncode. Once these weights are fixed, they determine the reconstruction error that would be \ncaused by using a particular distributed code. If the prior probabilities of each code are also \nfixed, Eq. 5 defines the optimal probability distribution over distributed codes, where the \nindex i now ranges over the m d possible codes. \n\nComputing the correct distribution requires an amount of work that is exponential in d, so \nwe restrict ourselves to the suboptimal distributions that can be factored into d independent \ndistributions, one for each VQ. The fact that the correct distribution is not really factorial \nwill not lead to major problems as it does in mean field approximations of Boltzmann \nmachines (Galland, 1993). It will simply lead to an overestimate of the description length \nbut this overestimate can still be used as a bound when learning the weights. Also the excess \nbits caused by the non-independence will force the generative weights towards values that \ncause the correct distribution to be approximately factorial. \n\n3.1 Computing the Expected Reconstruction Error \n\nTo perform gradient descent in the description length given in Eq. 4, it is necessary to \ncompute, for each training example, the derivative of the expected reconstruction cost with \nrespect to the activation probability of each hidden unit. An obvious way to approximate \nthis derivative is to use Monte Carlo simulations in which we stochastically pick one hidden \nunit in each pool. This way of computing derivatives is faithful to the underlying stochastic \nmodel, but it is inevitably either slow or inaccurate. Fortunately, it can be replaced by a \nfast exact method when the output units are linear and there is a squared error measure for \nthe reconstruction. Given the probability, hi, of picking hidden unit i in VQ v, we can \ncompute the expected reconstructed output Yj for output unit j on a given training case \n\n(6) \n\nwhere bj is the bias of unit j and wji is the generative weight from ito j in VQ v. We \ncan also compute the variance in the reconstructed output caused by the stochastic choices \nwithin the VQs. Under the assumption that the stochastic choices within different VQs are \nindependent, the variances contributed by the different VQs can simply be added. \n\n(7) \n\n\f8 \n\nHinton and Zemel \n\nThe expected squared reconstruction error for each output unit is Vi + (Yj - dj )2 where dj is \nthe desired output. So if the reconstruction error is coded assuming a zero-mean Gaussian \ndistribution the expected reconstruction cost can be computed exactlyl. It is therefore \nstraightforward to compute the derivatives, with respect to any weight in the network, of all \nthe terms in Eq. 4. \n\n4 AN EXAMPLE OF FACTORIAL VECTOR QUANTIZATION \n\nZemel (1993) presents several different data sets for which factorial vector quantization \n(FVQ) produces efficient encodings. We briefly describe one of those examples. The data \nset consists of 200 images of simple curves as shown in figure 1. A network containing 4 \nVQs, each containing 6 hidden units, is trained on this data set. After training, the final \noutgoing weights for the hidden units are as shown in figure 2. Each VQ has learned \nto represent the height of the spline segment that connects a pair of control points. By \nchaining these four segments together the image can be reconstructed fairly accurately. For \nnew images generated in the same way, the description length is approximately 18 bits for \nthe reconstruction cost and 7 bits for the code. By contrast, a stochastic vector quantizer \nwith 24 hidden units in a single competing group has a reconstruction cost of 36 bits and \na code cost of 4 bits. A set of 4 separate stochastic VQs each of which is trained on a \ndifferent 8x3 vertical slice of the image also does slightly worse than the factorial VQ (by \n5 bits) because it cannot smoothly blend the separate segments of the curve together. A \npurely linear network with 24 hidden units that performs a version of principal components \nanalysis has a slightly lower reconstruction cost but a much higher code cost. \n\nFixed x Positions \n\nRandom \n\ny \n\nPositions \n\n-------> \n\nFigure 1: Each image in the spline dataset is generated by fitting a spline to 5 control \npoints with randomly chosen y-positions. An image is formed by blurring the spline with \na Gaussian. The intensity of each pixel is indicated by the area of white in the display. The \nresulting images are 8x12 pixels, but have only 5 underlying degrees of freedom. \n\n1 Each VQ contributes non-Gaussian noise and the combined noise is also non-Gaussian. But \nsince its variance is known, the expected cost of coding the reconstruction error using a Gaussian \nprior can be computed exactly. The fact that this prior is not ideal simply means that the computed \nreconstruction cost is an upper bound on the cost using a better prior. \n\n\fAutoencoders, Minimum Description Length, and Helmhotz Free Energy \n\n9 \n\n\"\" -':' \". I 1\"\"-: \n\u2022\u2022 :.> : :] \n':':-\n. . \n\"~--~ .. \n\n\" \n\nFigure 2: The outgoing weights of the hidden units for a network containing 4 VQs with 6 \nunits in each, trained on the spline dataset. Each 8x 12 weight block corresponds to a single \nunit, and each row of these blocks corresponds to one VQ. \n\n5 DISCUSSION \n\nA natural approach to unsupervised learning is to use a generative model that defines a \nprobability distribution over observable vectors. The obvious maximum likelihood learning \nprocedure is then to adjust the parameters of the model so as to maximize the sum of the \nlog probabilities of a set of observed vectors. This approach works very well for generative \nmodels, such as a mixture of Gaussians, in which it is tractable to compute the expectations \nthat are required for the application of the EM algorithm. It can also be applied to the wider \nclass of models in which it is tractable to compute the derivatives of the log probability of \nthe data with respect to each model parameter. However. for non-linear models that use \ndistributed codes it is usually intractable to compute these derivatives since they require that \nwe integrate over all of the exponentially many codes that could have been used to generate \neach particular observed vector. \n\nThe MDL principle suggest a way of making learning tractable in these more complicated \ngenerative models. The optimal way to code an observed vector is to use the correct \nposterior probability distribution over codes given the current model parameters. However, \nwe are free to use a suboptimal probability distribution that is easier to compute. The \ndescription length using this suboptimal method can still be used as a Lyapunov function \nfor learning the model parameters because it is an upper bound on the optimal description \nlength. The excess description length caused by using the wrong distribution has the form \nof a Kullback-Liebler distance and acts as a penalty term that encourages the recognition \nweights to approximate the correct distribution as well as possible. \n\nThere is an interesting relationship to statistical physics. Given an input vector, each \npossible code acts like an alternative configuration of a physical system. The function \n\n\f10 \n\nHinton and Zemel \n\nE defined in Eq. 1 is the energy of this configuration. The function F in Eq. 4 is \nthe Helmholtz free energy which is minimized by the thermal equilibrium or Boltzmann \ndistribution. The probability assigned to each code at this minimum is exactly its posterior \nprobability given the parameters of the generative model. The difficulty of performing \nmaximum likelihood learning corresponds to the difficulty of computing properties of the \nequilibrium distribution. Learning is much more tractable if we use the non-equilibrium \nHelmholtz free energy as a Lyapunov function (Neal and Hinton, 1993). We can then use \nthe recognition weights of an autoencoder to compute some non-equilibrium distribution. \nThe derivatives of F encourage the recognition weights to approximate the equilibrium \ndistribution as well as they can, but we do not need to reach the equilibrium distribution \nbefore adjusting the generative weights that define the energy function of the analogous \nphysical system. \n\nIn this paper we have shown that an autoencoder network can learn factorial codes by using \nnon-equilibrium Helmholtz free energy as an objective function. In related work (Zemel \nand Hinton 1994) we apply the same approach to learning population codes. We anticipate \nthat the general approach described here will be useful for a wide variety of complicated \ngenerative models. It may even be relevant for gradient descent learning in situations where \nthe model is so complicated that it is seldom feasible to consider more than one or two of \nthe innumerable ways in which the model could generate each observation. \n\nAcknowledgements \n\nThis research was supported by grants from the Ontario Information Technology Research \nCenter, the Institute for Robotics and Intelligent Systems, and NSERC. Geoffrey Hinton \nis the Noranda Fellow of the Canadian Institute for Advanced Research. We thank Peter \nDayan, Yann Le Cun, Radford Neal and Chris Williams for helpful discussions. \n\nReferences \n\nBaldi, P. and Hornik, K. (1989) Neural networks and principal components analysis: Learn(cid:173)\ning from examples without local minima. Neural Networks, 2, 53-58. \nGalland, C. C. (1993) The limitations of deterministic Boltzmann machine learning. Net(cid:173)\nwork, 4, 355-379. \nHinton, G. E. (1989) Connectionist learning procedures. Artificial Intelligence, 40, 185-\n234. \n\nNeal, R., and Hinton. G. E. (1993) A new view of the EM algorithm that justifies incremental \nand other variants. Manuscript available/rom the authors. \nRissanen.1. ( 1989) Stochastic Complexity in Statistical Inquiry. World Scientific Publish(cid:173)\ning Co .\u2022 Singapore. \nZemel. R. S. (1993) A Minimum Description Length Framework/or Unsupervised Learning. \nPhD. Thesis. Department of Computer Science, University of Toronto. \n\nZemel, R. S. and Hinton. G. E. (1994) Developing Population Codes by Minimizing \nDescription Length. In I. Cowan, G. Tesauro. and I. Alspector (Eds.), Advances in Neural \nIn/ormation Processing Systems 6, San Mateo, CA: Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 798, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}]}