{"title": "Autoencoders, Minimum Description Length and Helmholtz Free Energy", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 10, "abstract": null, "full_text": "Autoencoders, Minimum Description Length \n\nand Helmholtz Free Energy \n\nGeoffrey E. Hinton \n\nDepartment of Computer Science \n\nUniversity of Toronto \n6 King's College Road \n\nToronto M5S  lA4, Canada \n\nRichard S. Zemel \n\nComputational Neuroscience Laboratory \n\nThe Salk Institute \n\n10010 North Torrey Pines Road \n\nLa Jolla, CA 92037 \n\nAbstract \n\nAn autoencoder network uses a set of recognition weights to convert an \ninput vector into a code vector.  It then uses a set of generative weights to \nconvert the code vector into an approximate reconstruction of the input \nvector.  We derive an objective function for training autoencoders based \non the  Minimum  Description Length  (MDL)  principle.  The aim  is  to \nminimize the information required to describe both the code vector and \nthe reconstruction error.  We  show  that this information  is  minimized \nby choosing code vectors stochastically according to a Boltzmann distri(cid:173)\nbution, where the generative weights define the energy of each possible \ncode vector given  the input vector.  Unfortunately,  if the code vectors \nuse distributed representations, it is exponentially expensive to compute \nthis Boltzmann distribution because it involves all possible code vectors. \nWe show that the recognition weights of an autoencoder can  be used to \ncompute an approximation to the Boltzmann distribution and that this ap(cid:173)\nproximation gives an upper bound on the description length.  Even when \nthis bound is poor,  it can  be used  as a  Lyapunov function for learning \nboth the generative and the recognition weights.  We  demonstrate that \nthis approach can be used to learn factorial codes. \n\n1 \n\nINTRODUCTION \n\nMany of the unsupervised learning algorithms that have been suggested for neural networks \ncan  be seen as  variations on two basic methods:  Principal Components Analysis (PCA) \n\n3 \n\n\f4 \n\nHinton and Zemel \n\nand  Vector  Quantization  (VQ)  which  is  also  called  clustering  or competitive  learning. \nBoth of these algorithms can be implemented simply within the autoencoder framework \n(Baldi  and  Hornik,  1989;  Hinton,  1989)  which  suggests  that this  framework  may  also \ninclude other algorithms that combine aspects  of both.  VQ is powerful because it uses \na  very  non-linear mapping from  the input vector to the code but weak because  the code \nis a  purely local representation.  Conversely, PCA is weak because the mapping  is linear \nbut powerful  because  the  code is a  distributed,  factorial  representation.  We  describe  a \nnew  objective function for training autoencoders that allows them  to discover non-linear, \nfactorial representations. \n\n2  THE MINIMUM DESCRIPfION LENGTH APPROACH \n\nOne method of deriving a cost function for the activities of the hidden units in an autoencoder \nis to apply the Minimum Description Length (MDL) principle (Rissanen 1989). We imagine \na communication game in which a sender observes an ensemble of training vectors and must \nthen communicate these vectors to a receiver.  For our purposes, the sender can wait until \nall of the input vectors have been observed before communicating any of them - an online \nmethod is not required.  Assuming that the components of the vectors are finely quantized \nwe can ask how many bits must be communicated to allow the receiver to reconstruct the \ninput vectors perfectly.  Perhaps the simplest method of communicating the vectors would \nbe to  send each  component of each vector separately.  Even  this simple method requires \nsome further specification before we can count the number of bits required.  To  send the \nvalue,  Xi,c,  of component i of input vector c we must encode this value as a bit string.  If \nthe sender and the receiver  have already agreed on a probability distribution that assigns \na probability p( x) to each possible quantized value, x, Shannon's coding theorem  implies \nthat x can be communicated at a cost that is bounded below by -log p( x) bits.  Moreover, \nby  using block coding  techniques we  can get arbitrarily close to this bound so  we  shall \ntreat it as  the  true cost.  For coding real  values  to  within a  quantization  width of t  it is \noften convenient to assume a Gaussian probability distribution with mean zero and standard \ndeviation (1'.  Provided that (1' is large compared with t, the cost of coding the value x is then \n-logt + 0.5 log 21r(1'2 + x 2 /2(1'2. \nThis simple method of communicating the trainjng vectors is generally very wasteful.  If \nthe components of a vector are correlated it is generally more efficient to convert the input \nvector into some other representation before communicating it.  The essence of the MDL \nprinciple is that the best model of the data is the one that minimizes the total number of \nbits required to communicate it, including the bits required to describe the coding scheme. \nFor an  autoencoder it is convenient to divide the total description length into three terms. \nAn input vector is communicated  to the receiver by sending the activities of the  hidden \nunits and  the residual  differences between  the  true input vector and  the one that can  be \nreconstructed from the hidden activities. There is a code cost for the hidden activities and a \nreconstruction cost for the residual differences.  In addition there is a one-time model cost \nfor communicating the weights that are required  to  convert the hidden activities into the \noutput of the net.  This model cost is generally very important within the MDL framework, \nbut in this paper we will ignore it.  In effect, we are considering the limit in which there is \nso much data that this limited model cost is negligible. \n\nPCA can be viewed as a special case of MDL in which we ignore the model cost and we limit \nthe code cost by only using m  hidden units.  The question of how many bits are required \n\n\fAutoencoders, Minimum Description Length, and Helmhotz Free Energy \n\n5 \n\nto  code each  hidden  unit activity is also  ignored.  Thus the  only remaining  term  is  the \nreconstruction cost.  Assuming that the residual differences are encoded using a zero-mean \nGaussian  with  the same  predetermined variance  for  each  component,  the reconstruction \ncost is minimized by minimizing the squared differences. \n\nSimilarly, VQ is a version of MDL in which we limit the code cost to at most log m bits by \nusing only m winner-lake-all hidden units, we ignore the model cost, and we minimize the \nreconstruction cost. \n\nIn  standard VQ  we assume that each  input vector is converted into a specific code.  Sur(cid:173)\nprisingly, it is more efficient to choose the codes stochastically so that the very same input \nvector is sometimes communicated using one code and sometimes using another.  This type \nof \"stochastic VQ\" is exactly equivalent to maximizing the log probability of the data under \na  mixture of Gaussians model.  Each  code of the VQ  then corresponds to the mean  of a \nGaussian and the probability of picking the code is the posterior probability of the input \nvector  under  that Gaussian.  Since this derivation of the  mixture of Gaussians model  is \ncrucial to the new techniques described later, we shall describe it in some detail. \n\n2.1  The \"bits-back\" argument \n\nThe description length of an input vector using a particular code is the sum of the code cost \nand reconstruction cost.  We define this to be the energy of the code, for reasons that will \nbecome clear later.  Given an  input vector,  we define the energy of a code to be the sum \nof the code cost and the reconstruction cost.  If the prior probability of code i is 1f'i  and its \nsquared reconstruction error is d; the energy of the code is \n\nd 2 \nEi = -log 1f'i  - k log t + \"2  log 21f'0'2 + 20'2 \n\nk \n\n(1) \n\nwhere k is the dimensionality of the input vector,  0'2  is the variance of the fixed Gaussian \nused for encoding the reconstruction errors and t is the quantization width. \n\nNow consider the following situation: We have fitted a VQ to some training data and, for a \nparticular input vector, two of the codes are equally good in the sense that they have equal \nenergies.  In a standard VQ we would gain no advantage from  the fact that there are two \nequally good codes.  However, the fact that we have a choice of two codes should be worth \nsomething.  It does not matter which code we use so if we are vague about the choice of \ncode we should be able to save one bit when communicating the code. \n\nTo make this argument precise consider the following communication game:  The sender \nis already communicating a large number of random bits to the receiver,  and we want to \ncompute the additional cost of communicating some input vectors.  For each input vector \nwe have a number of alternative codes h1 ... hi ... hm  and each code has an energy,  Ei.  In a \nstandard VQ we would pick the code, j, with the lowest energy.  But suppose we pick code \ni with a probability Pi  that depends on Ei.  Our expected cost then appears to be higher \nsince we sometimes pick codes that do not have the minimum value of E. \n\n< Cost >= LPiEi \n\ni \n\n(2) \n\nwhere  <  ...  > is  used  to denote an  expected  value.  However,  the  sender  can  use  her \nfreedom  of choice in stochastically picking codes  to communicate some of the random \n\n\f6 \n\nHinton and Zemel \n\nbits that need to be communicated anyway.  It is easy  to see how random bits can be used \nto stochastically choose a code, but it is less obvious how  these bits can be recovered by \nthe receiver,  because he is only sent the chosen code and does  not know  the probability \ndistribution from  which  it was picked.  This distribution depends on the particular input \nvector that is being communicated.  To recover the random bits, the receiver waits until all \nof the training vectors have been communicated losslessly and then runs exactly the same \nlearning algorithm as  the sender used.  This allows the receiver to recover the recognition \nweights that are used to convert input vectors into codes, even though the only weights that \nare explicitly communicated from the sender to the receiver are the generative weights that \nconvert codes into approximate reconstructions of the input. After learning the recognition \nweights, the receiver can reconstruct the probability distribution from which each code was \nstochastically picked because the input vector has already been communicated.  Since he \nalso knows which code was chosen, he can figure out the random bits that were used to do \nthe picking.  The expected number of random bits required to pick a code stochastically is \nsimply the entropy of the probability distribution over codes \n\nH  = - LPi logpi \n\n(3) \n\nSo, allowing for the fact that these random bits have been successfully communicated, the \ntrue expected combined cost is \n\n(4) \n\nNote  that  F  has  exactly  the  form  of Helmholtz  free  energy.  It  can  be  shown  that  the \nprobability distribution which minimizes F is \n\ne- E ; \n\nPi  =  Lj e-Ej \n\n(5) \n\nThis  is exactly  the  posterior probability distribution obtained  when  fitting  a  mixture of \nGaussians to an input vector. \n\nThe idea that a stochastic choice of codes is more efficient than just choosing the code with \nthe smallest value of E  is an  example of the concept of stochastic complexity (Rissanen, \n1989)  and  can  also be derived  in other ways.  The  concept of stochastic complexity is \nunnecessarily  complicated  if we  are  only  interested  in  fitting  a  mixture  of Gaussians. \nInstead  of thinking in  terms  of a  stochastically chosen  code plus a  reconstruction error, \nwe can simply use Shannon's coding theorem directly by assuming that we code the input \nvectors using the mixture of Gaussians probability distribution.  However,  when  we start \nusing  more  complicated  coding  schemes  in  which  the  input  is  reconstructed  from  the \nactivities of several  different hidden units,  the formulation  in terms of F  is much  easier \nto  work  with because  it liberates  us  from  the constraint that the probability distribution \nover codes must be the optimal one.  There is generally no efficient way of computing the \noptimal distribution, but it is nevertheless possible to use F with a suboptimal distribution \nas a Lyapunov function for learning (Neal and Hinton, 1993).  In MDL terms we are simply \nusing a suboptimal coding scheme in order to make the computation tractable. \n\nOne particular class of suboptimal distributions is very attractive for computational reasons. \nIn a  factorial distribution the probability distribution over  m d  alternatives  factors  into d \nindependent distributions over m alternatives.  Because they can be represented compactly, \n\n\fAutoencoders, Minimum Description Length, and Helmhotz Free Energy \n\n7 \n\nfactorial  distributions can  be  computed  conveniently  by  a  non-stochastic  feed-forward \nrecognition network. \n\n3  FACTORIAL STOCHASTIC VECTOR QUANTIZATION \n\nInstead of coding the input vector by a single, stochastically chosen hidden unit, we could \nuse several  different pools of hidden  units and stochastically pick one unit in each pool. \nAll of the selected units within this distributed representation are then used to reconstruct \nthe input.  This amounts to using several different VQs which cooperate to reconstruct the \ninput.  Each VQ can be viewed as a  dimension and the chosen unit within the VQ is the \nvalue on that dimension.  The number of possible distributed codes is m d  where d is the \nnumber of VQs and  m  is the number of units within a VQ. The weights from  the hidden \nunits to the output units determine what output is produced by  each  possible distributed \ncode.  Once these weights are fixed,  they determine the reconstruction error that would be \ncaused by using a particular distributed code.  If the prior probabilities of each code are also \nfixed, Eq.  5 defines the optimal probability distribution over distributed codes, where the \nindex i now ranges over the m d  possible codes. \n\nComputing the correct distribution requires an amount of work that is exponential in d, so \nwe restrict ourselves to the suboptimal distributions that can be factored into d independent \ndistributions, one for each VQ. The fact that the correct distribution is not really factorial \nwill  not lead  to  major problems as  it does  in mean  field  approximations of Boltzmann \nmachines (Galland, 1993).  It will simply lead to an overestimate of the description length \nbut this overestimate can still be used as a bound when learning the weights.  Also the excess \nbits caused by the non-independence will force the generative weights towards values that \ncause the correct distribution to be approximately factorial. \n\n3.1  Computing the Expected Reconstruction Error \n\nTo  perform gradient descent in the description length given in Eq.  4,  it is necessary  to \ncompute, for each training example, the derivative of the expected reconstruction cost with \nrespect to the activation probability of each hidden unit.  An obvious way to approximate \nthis derivative is to use Monte Carlo simulations in which we stochastically pick one hidden \nunit in each pool. This way of computing derivatives is faithful to the underlying stochastic \nmodel, but it is inevitably either slow or inaccurate.  Fortunately, it can be replaced by a \nfast exact method when the output units are linear and there is a squared error measure for \nthe reconstruction.  Given the probability,  hi, of picking hidden  unit i  in VQ  v,  we can \ncompute the expected reconstructed output Yj  for output unit j  on a given training case \n\n(6) \n\nwhere  bj  is the bias of unit j  and wji  is the generative weight from  ito j  in VQ v.  We \ncan also compute the variance in the reconstructed output caused by the stochastic choices \nwithin the VQs.  Under the assumption that the stochastic choices within different VQs are \nindependent, the variances contributed by the different VQs can simply be added. \n\n(7) \n\n\f8 \n\nHinton and Zemel \n\nThe expected squared reconstruction error for each output unit is Vi + (Yj - dj )2 where dj is \nthe desired output.  So if the reconstruction error is coded assuming a zero-mean Gaussian \ndistribution the  expected  reconstruction cost can  be computed  exactlyl.  It is  therefore \nstraightforward to compute the derivatives, with respect to any weight in the network, of all \nthe terms in Eq.  4. \n\n4  AN EXAMPLE OF FACTORIAL VECTOR QUANTIZATION \n\nZemel  (1993) presents several  different data sets for which  factorial  vector quantization \n(FVQ) produces efficient encodings.  We briefly describe one of those examples.  The data \nset consists of 200 images of simple curves as shown in figure  1.  A network containing 4 \nVQs, each containing 6 hidden  units, is trained on this data set.  After training, the final \noutgoing  weights  for  the  hidden  units are  as  shown  in figure  2.  Each  VQ  has  learned \nto represent the  height of the spline segment  that connects a  pair of control points.  By \nchaining these four segments together the image can be reconstructed fairly accurately.  For \nnew images generated in the same way,  the description length is approximately 18 bits for \nthe reconstruction cost and 7 bits for the code.  By contrast, a stochastic vector quantizer \nwith 24 hidden units in a single competing group has a reconstruction cost of 36 bits and \na  code cost of 4  bits.  A  set of 4  separate stochastic VQs each  of which  is trained on a \ndifferent 8x3 vertical slice of the image also does slightly worse than the factorial VQ (by \n5 bits) because it cannot smoothly blend the separate segments of the curve together.  A \npurely linear network with 24 hidden units that performs a version of principal components \nanalysis has a slightly lower reconstruction cost but a much higher code cost. \n\nFixed x Positions \n\nRandom \n\ny \n\nPositions \n\n-------> \n\nFigure  1:  Each  image in  the  spline dataset  is generated  by  fitting  a  spline to  5  control \npoints with randomly chosen y-positions.  An image is formed by blurring the spline with \na Gaussian.  The intensity of each pixel is indicated by the area of white in the display.  The \nresulting images are 8x12 pixels, but have only 5 underlying degrees of freedom. \n\n1 Each VQ contributes non-Gaussian noise and  the combined noise is  also  non-Gaussian.  But \nsince its  variance is known,  the expected cost of coding the reconstruction error using a Gaussian \nprior can be computed exactly.  The fact that this prior is not ideal simply means that the computed \nreconstruction cost is an upper bound on the cost using a better prior. \n\n\fAutoencoders, Minimum Description Length, and Helmhotz Free Energy \n\n9 \n\n\"\"  -':'  \". I 1\"\"-: \n\u2022\u2022 :.>  : :] \n':':-\n. .  \n\"~--~  .. \n\n\" \n\nFigure 2:  The outgoing weights of the hidden units for a network containing 4 VQs with 6 \nunits in each, trained on the spline dataset.  Each 8x 12 weight block corresponds to a single \nunit, and each row of these blocks corresponds to one VQ. \n\n5  DISCUSSION \n\nA  natural approach  to  unsupervised learning  is  to use a  generative model  that defines  a \nprobability distribution over observable vectors.  The obvious maximum likelihood learning \nprocedure is then to adjust the parameters of the model so as to maximize the sum of the \nlog probabilities of a set of observed vectors.  This approach works very well for generative \nmodels, such as a mixture of Gaussians, in which it is tractable to compute the expectations \nthat are required for the application of the EM algorithm. It can also be applied to the wider \nclass of models in which it is tractable to compute the derivatives of the log probability of \nthe data with respect to each  model parameter.  However.  for non-linear models that use \ndistributed codes it is usually intractable to compute these derivatives since they require that \nwe integrate over all of the exponentially many codes that could have been used to generate \neach particular observed vector. \n\nThe MDL principle suggest a way of making learning tractable in these more complicated \ngenerative  models.  The  optimal  way  to  code  an  observed  vector  is  to  use  the  correct \nposterior probability distribution over codes given the current model parameters.  However, \nwe  are  free  to  use  a  suboptimal  probability distribution that is easier  to compute.  The \ndescription length using this suboptimal method can still be used as a Lyapunov function \nfor learning the model parameters because it is an  upper bound on the optimal description \nlength.  The excess description length caused by using the wrong distribution has the form \nof a  Kullback-Liebler distance and acts as a penalty term that encourages the recognition \nweights to approximate the correct distribution as well as possible. \n\nThere  is  an  interesting relationship  to statistical  physics.  Given  an  input  vector,  each \npossible code acts  like an  alternative configuration of a  physical system.  The  function \n\n\f10 \n\nHinton and Zemel \n\nE  defined  in  Eq.  1  is the  energy  of this configuration.  The  function  F  in  Eq.  4  is \nthe Helmholtz free energy which is minimized by the thermal equilibrium or Boltzmann \ndistribution. The probability assigned to each code at this minimum is exactly its posterior \nprobability given  the parameters of the generative model.  The difficulty of performing \nmaximum likelihood learning corresponds to the difficulty of computing properties of the \nequilibrium distribution.  Learning is much  more tractable if we use the non-equilibrium \nHelmholtz free energy as a Lyapunov function (Neal and Hinton, 1993).  We can then use \nthe recognition weights of an autoencoder to compute some non-equilibrium distribution. \nThe derivatives of F  encourage  the recognition weights to  approximate the equilibrium \ndistribution as  well as  they can,  but we do not need  to reach  the equilibrium distribution \nbefore adjusting the generative weights that define the energy  function  of the analogous \nphysical system. \n\nIn this paper we have shown that an autoencoder network can learn factorial codes by using \nnon-equilibrium Helmholtz free energy as an  objective function.  In related work (Zemel \nand Hinton 1994) we apply the same approach to learning population codes.  We anticipate \nthat the general approach described here will be useful for a  wide variety of complicated \ngenerative models.  It may even be relevant for gradient descent learning in situations where \nthe model is so complicated that it is seldom feasible to consider more than one or two of \nthe innumerable ways in which the model could generate each observation. \n\nAcknowledgements \n\nThis research was supported by grants from the Ontario Information Technology Research \nCenter,  the Institute for Robotics and Intelligent Systems,  and NSERC.  Geoffrey  Hinton \nis the Noranda Fellow of the Canadian Institute for Advanced Research.  We thank Peter \nDayan, Yann Le Cun, Radford Neal and Chris Williams for helpful discussions. \n\nReferences \n\nBaldi, P. and Hornik, K. (1989) Neural networks and principal components analysis: Learn(cid:173)\ning from examples without local minima. Neural Networks, 2, 53-58. \nGalland, C. C. (1993) The limitations of deterministic Boltzmann machine learning.  Net(cid:173)\nwork, 4, 355-379. \nHinton, G.  E.  (1989) Connectionist learning procedures.  Artificial Intelligence, 40,  185-\n234. \n\nNeal, R., and Hinton. G. E. (1993) A new view of the EM algorithm that justifies incremental \nand other variants.  Manuscript available/rom the authors. \nRissanen.1. ( 1989) Stochastic Complexity in Statistical Inquiry.  World Scientific Publish(cid:173)\ning Co .\u2022 Singapore. \nZemel. R. S. (1993) A Minimum Description Length Framework/or Unsupervised Learning. \nPhD. Thesis. Department of Computer Science, University of Toronto. \n\nZemel,  R.  S.  and  Hinton.  G.  E.  (1994)  Developing  Population  Codes  by  Minimizing \nDescription Length.  In I. Cowan, G. Tesauro. and I. Alspector (Eds.), Advances in Neural \nIn/ormation Processing Systems 6, San Mateo, CA: Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 798, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}]}