{"title": "High-temperature Expansions for Learning Models of Nonnegative Data", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 471, "abstract": null, "full_text": "High-temperature expansions for learning \n\nmodels of nonnegative data \n\nOliver B. Downs \n\nDept.  of Mathematics \nPrinceton University \nPrinceton, NJ 08544 \n\nobdown s@ p r in c et o n.edu \n\nAbstract \n\nRecent  work  has  exploited  boundedness  of data  in  the  unsupervised \nlearning of new types of generative model.  For nonnegative data it was \nrecently  shown  that the  maximum-entropy generative  model  is  a  Non(cid:173)\nnegative Boltzmann Distribution  not  a  Gaussian  distribution,  when  the \nmodel is  constrained to match the  first and second order statistics of the \ndata.  Learning for practical sized problems is made difficult by the need \nto  compute  expectations  under  the  model  distribution.  The  computa(cid:173)\ntional  cost of Markov  chain Monte  Carlo  methods  and  low  fidelity  of \nnaive  mean  field  techniques has  led  to  increasing interest in  advanced \nmean  field  theories  and  variational  methods.  Here I  present a  second(cid:173)\norder mean-field approximation for the Nonnegative Boltzmann Machine \nmodel,  obtained  using  a  \"high-temperature\" expansion.  The  theory  is \ntested  on  learning  a bimodal 2-dimensional model,  a high-dimensional \ntranslationally  invariant distribution,  and  a generative  model for hand(cid:173)\nwritten digits. \n\n1  Introduction \n\nUnsupervised learning of generative and feature-extracting models for continuous nonneg(cid:173)\native data has recently been proposed [1], [2] . In  [1], it was pointed out that the maximum \nentropy  distribution  (matching  Ist- and  2nd-order statistics)  for  continuous  nonnegative \ndata is  not Gaussian, and  indeed  that a  Gaussian  is  not in  general a good  approximation \nto  that  distribution.  The  true  maximum  entropy  distribution  is  known  as  the  Nonnega(cid:173)\ntive  Boltzmann Distribution  (NNBD),  (previously the  rectified Gaussian distribution  [3]) , \nwhich has the functional form \n\np(x) = {o~exp[-E(X)]  if Xi  ~ OVi, \nif any Xi  < 0, \nwhere the energy function E(x) and normalisation constant Z are: \n\nE(x) \n\n(3xT Ax - bT X, \n\nZ  =  (  dx  exp[-E(x)]. \n\n10;\"20 \n\n(1) \n\n(2) \n\n(3) \n\n\fIn  contrast to  the  Gaussian  distribution,  the  NNBD can  be  multimodal in  which case its \nmodes are confined to the boundaries of the nonnegative orthant. \n\nThe Nonnegative Boltzmann Machine (NNBM) has been proposed as a method for learning \nthe maximum likelihood parameters for this maximum entropy model from data.  Without \nhidden units, it has the stochastic-EM learning rule: \n\n(XiXj)f - (XiXj)c \n(Xi)c  - (Xi)r, \n\n(4) \n(5) \n\nwhere the subscript \"c\" denotes a \"clamped\" average over the data,  and the  subscript \"f\" \ndenotes a \"free\" average over the NNBD: \n\n1  M \n\n(f(x))c  =  M  L f(x(I')) \n(f(X))f  =  1 dxp(x)f(x). \n\n1'=1 \n\nx~O \n\n(6) \n\n(7) \n\nThis learning rule has hitherto been extremely computationally costly to  implement, since \nnaive variationaVmean-field approximations for (XXT)r  are found empirically to  be poor, \nleading to the need to use Markov chain Monte Carlo methods. This has made the NNBM \nimpractical for application to high-dimensional data. \n\nWhile the NNBD is  generally skewed and hence has moments of order greater than 2,  the \nmaximum-likelihood learning rule suggests that the distribution can be described solely in \nterms  of the  Ist- and  2nd-order statistics  of the  data.  With  that in  mind,  I have  pursued \nadvanced approximate models for the NNBM. \nIn the following section I derive a second-order approximation for (XiXj)r analogous to the \nTAP-On sager correction for the mean-field Ising Model, using a high temperature expan(cid:173)\nsion,  [4].  This produces an  analytic  approximation for the parameters A ij , bi in  terms  of \nthe mean and cross-correlation matrix of the training data. \n\n2  Learning approximate NNBM parameters using high-temperature \n\nexpansion \n\nHere  I  use  Taylor expansion of a \"free  energy\" directly  related  to  the  partition function \nof the distribution,  Z in the fJ  = 0 limit, to  derive a second-order approximation for the \nNNBM  model  parameters.  In  this  free  energy  we  embody  the  constraint  that  Eq.  5  is \nsatisfied: \n\nwhere fJ  is  an  \"inverse temperature\".  There is  a direct relationship between the \"free en(cid:173)\nergy\", G and the normalisation, Z  of the NNBD, Eq. 3. \n\n-In Z  = G(fJ, m) + Constant(b, m) \n\nThus, \n\n(9) \n\n(10) \n\n\fThe Lagrange multipliers, Ai  embody the constraint that (Xi)f  match the mean field of the \npatterns, mi = (x)c.  This effectively forces tl.b = 0 in Eq. 5, with bi = -Ai((3). \nSince the Lagrange constraint is enforced for all temperatures, we can solve for the specific \ncase (3  =  O. \n\nmi = (Xi)fl.8-o = \n\n-\n\nTIk Ixoo =0 Xi exp (- L:l Al(O)(XI  - ml)) dXk \nTIk IXh=o exp (- L:l Al (0) (Xl  - ml)) dXk \n\nhOO \n\n1 \n= --\nAi(O) \n\n(11) \n\nNote that this embodies the unboundedness of Xk  in the nonnegative orthant, as compared \nto the equivalent term of Georges &  Yedidia for the Ising model, mi = tanh(Ai(O)). \nWe consider Taylor expansion of Eq. 8 about the \"high temperature\" limit, (3  =  O. \n\n8G I \nG((3, m) = G(O, m) + (3  8(3 \n\n.8=0 \n\n(32  82G I \n\n+ 2'  8(32 \n\n.8=0 \n\n+ ... \n\n(12) \n\nSince the integrand becomes factorable in Xi  in this limit, the infinite temperature values of \nG and its derivatives are analytically calculable. \n\nG((3,m)I.8=o = - Lin {OO_  exp  (- LAi(O)(Xi -mi)) dXk \n\nk \n\n}Xh-O \n\ni \n\nusing Eq.  11; \n\nG((3,m)I.8=o  = - ~ln (Ak~O) exp (~Ai(O)mi)) \n\n=N+ Llnmk \n\nk \n\n(13) \n\n(14) \n\nThe first derivative is then as follows \n\n8GI \n8(3  .8=0 \n\nTIk 1000  (L:i .j -AijXiXj - L:i(Xi - mi) \u00a5t) exp (- L:l Am(O)(XI  - ml)) dXk \n\nTIk 1000 exp (- L:l Am(O)(XI  - ml)) dXk \n\ni,j \n\n(15) \n\n(16) \n\nThis term is exactly the result of applying naive mean-field theory to this system, as  in [1]. \nLikewise we obtain the second derivative \n\n~~~ Ip~o ~ - ( (~A';X'X;) ') \n\n.8=0 \n\n+ (pi + O';)A,;m,m;) , \n\n+ (~AijXiXj L  ~; (Xk  - mk)) \n= - L  L  Qijkl Aij Aklmimjmkml \n\nt,} \n\nk \n\ni,j  k,l \n\n.8=0 \n\n(17) \n\n(18) \n\nWhere Qijkl  contains  the integer coefficients  arising from integration by  parts in  the first \nand second terms and (1 + Oij)  in the second term of Eq.  17. \nThis expansion is to the same order as the TAP-Onsager correction term for the Ising model, \nwhich can be derived by an analogous approach to the equivalent free-energy [4].  Substi(cid:173)\ntuting these results into Eq.  10, we obtain \n\n(3(Xi Xj)f  R! (3(1  + Oij)mimj - 2' L  QijklAklmimjmkml \n\n(32 \n\nkl \n\n(19) \n\n\fWe arrive at an analytic approximation for Aij as a function of the 1st and 2nd moments of \nthe data, using Eq.  19 in the learning rule, Eq.  4,  setting ~Aij = 0 and solving the linear \nequation for A. \nWe can obtain an equivalent expansion for Ai ((3) and hence bi. To first order in (3  (equiva(cid:173)\nlent to the order of (3  in the approximation for A), we have \n\nAi((3)  ~ Ai(O) + (3  8; \n8A\u00b71 \n\nP \n\n/3 =0 \n\nUsing Eqs.  11  &  15 \n\nHence \n\n= - 2:(1 + c5ij )Aijmj \n\nj \n\n+ . .. \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\n(24) \n\nThe  approach  presented  here  makes  an  explicit  approximation  of the  statistics  required \nfor the NNBM  learning rule (xxT}f' which can be substituted in the fixed-point equation \nEq. 4,  and  yields  a  linear  equation  in  A  to  be  solved.  This  is  in  contrast  to  the  linear \nresponse theory approach of Kappen  &  Rodriguez [6]  to  the  Boltzmann Machine,  which \nexploits the relationship \n\n82 1nZ \n8bi8bj  =  (XiXj)  - (Xi) (Xj)  =  Xij \n\n(25) \n\nbetween the free energy and the covariance matrix X of the model. In the learning problem, \nthis  produces  a quadratic  equation in  A,  the  solution  of which  is  non-trivial.  Computa(cid:173)\ntionally  efficient solutions  of the  linear response  theory  are  then  obtained  by  secondary \napproximation of the 2nd-order term, compromising the fidelity of the model. \n\n3  Learning a 'Competitive' Nonnegative Boltzmann Distribution \n\nA visualisable test problem is  that of learning a bimodal NNBD in  2 dimensions.  Monte(cid:173)\nCarlo slice  sampling  (See  [1]  &  [5])  was  used to  generate 200 samples from a NNBD  as \nshown in  Fig.  l(a).  The high  temperature expansion was  then  used to  learn  approximate \nparameters for the NNBM model of this data.  A surface plot of the resulting model distri(cid:173)\nbution is  shown in Fig.  l(b), it is clearly a valid candidate generative distribution for the \ndata. This is in strong contrast with a naive mean field  ((3  =  0)  model, which by construc(cid:173)\ntion would be unable to produce a multiple-peaked approximation, as previously described, \n[1] . \n\n4  Orientation Tuning in Visual Cortex - a translationally invariant \n\nmodel \n\nThe neural network model of Ben-Yishai et.  al  [7]  for orientation-tuning in visual cortex \nhas  the  property  that  its  dynamics  exhibit  a continuum of stable  states  which  are  trans-\n\n\f(a) \n\n8 \n\n~ \n\n6 \n\n><.-\n\n4  ~ \n\n2to \n\n0 \n\n0 \n0 \n\no  oo~ Jiil.. \n\n2 \n\n4 \nx2 \n\n-\"\" \n\n8 \n\n6 \n\n15 \n\n>--\n\n'iii \nc \n~10 \n>-\n== :c  5 \nco \n.c \n... \n0 \nQ. \n\no \n\nFigure  1: \nLearned model distribution, under the high temperature expansion. \n\n(a)  Training  data,  generated  from  2-dimensional  'competitive'  NNBD,  (b) \n\nlationally  invariant  across  the  network.  The energy  function  of the  network model is  a \ntranslationally invariant function of the angles of maximal response, Bi ,  of the N  neurons, \nand can be mapped directly onto the energy of the NNBM, as described in  [1]. \n\nAii=1'(c5ii + ~- ~COS(~li-jl)),bi=1' \n\n(26) \n\nWe can generate training data for the NNBM by sampling from  the neural  network model \nwith known parameters.  It is  easily  shown that Aii  has 2 equal negative eigenvalues, the \nremainder being positive and equal in value.  The corresponding pair of eigenvectors of A \nare  sinusoids of period equal to  the  width of the  stable  activation  bumps of the  network, \nwith a small relative phase. \n\nHere,  the  NNBM parameters have been solved using  the high-temperature expansion for \ntraining data generated by  Monte Carlo slice-sampling [5]  from  a  lO-neuron model with \nparameters to  =  4, I' =  100 in Eq. 26.  Fig. 2 illustrates modal activity patterns of the learned \nNNBM model distribution, found using gradient ascent of the log-likelihood function from \na random initialisation of the variables. \n\n~x ex  [-Ax + bj+ \n\n(27) \n\nwhere the superscript + denotes rectification. \n\nThese modes of the approximate NNBM model are highly similar to the training patterns, \nalso the eigenvectors and eigenvalues of A exhibit similar properties between their learned \nand training forms.  This gives evidence that the approximation is  successful in learning a \nhigh-dimensional translationally invariant NNBM model. \n\n5  Generative Model for Handwritten Digits \n\nIn  figure  3, I show the results of applying the high-temperature NNBM to  learning a gen(cid:173)\nerative  model  for  the  feature  coactivations  of the  Nonnegative Matrix  Factorization  [2] \n\n\f6 \n\nQ) \nrn4 \n0: \nOJ \nc:: \n.;:: \nU:2 \n\n0 \n\n(a) \n\n0.4 \n\nQ)  0.2 \nrn \n0: \nOJ \nc:: \n\n0 \n.;:: u:  -0.2 \n\n6 \n~ \n0: 4 \nOJ \nc:: \n.;:: \nu: \n\n2 \nO~ \n\n1  2  3  4  5  6  7  8  9  10 \n\nNeuron Number \n\n1  2  3  4  5  6  7  8  9  10 \n\nNeuron Number \n\n(b) \n\n~ \n0: \n\n10 \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 \n\nNeuron Number \n\n2 \n\n4 \n\n6 \n\n8 \n\nNeuron Number \n\nFigure 2:  Upper:  2 modal states  of the  NNBM model density,  located by gradient-ascent \nof  the  log-likelihood  from  different  random  initialisations,  Lower:  The  two  negative(cid:173)\neigenvalue  eigenvectors  of A  - a)  in  the  learned  model,  and  b)  as  used  to  generate  the \ntraining data. \n\ndecomposition of a database of the handwritten digits, 0-9.  This problem contains none of \nthe space-filling symmetry of the visual  cortex model, and hence requires a more strongly \nmultimodal generative model distribution to  generate distinct digits.  Here performance is \npoor, although superior to uniformly-sampled feature activitations. \n\n6  Discussion \n\nIn  this  work,  an  approximate  technique  has  been  derived  for  directly  determining  the \nNNBM  parameters  A,  b in  terms  of the  Ist- and  2nd-order statistics  of the  data,  using \nthe  method of high-temperature expansion.  To  second  order this  produces corrections to \nthe naive mean field  approximation of the system analogous to the TAP term for the Ising \nModel/Boltzmann  Machine.  The  efficacy  of this  approximation  has  been  demonstrated \nin  the pathological case  of learning the  'competitive'  NNBD, learning the translationally \ninvariant model in 10 dimensions, and a generative model for handwritten digits. \n\nThese results  demonstrate an improvement in approximation to  models in  this  class  over \na naive mean field  ((3  =  0) approach, without reversion to  secondary assumptions such as \nthose made in the linear response theory for the Boltzmann Machine. \nThere  is  strong  current interest  in  the  relationship  between  TAP-like  mean  field  theory, \nvariational approximation and  belief-propagation in  graphical models  with loops.  All  of \nthese can be interpreted in terms  of minimising an effective free energy of the system [8]. \nThe distinction in  the  work presented here lies in  choosing optimal  approximate statistics \nto learn the true model, under the assumption that satisfaction of the fixed-point equations \nof the  true  model  optimises  the  free  energy.  This  compares favourably  with  variational \n\n\fa) \n\nb) \n\nFigure 3:  Digit images generated with feature activations  sampled from  a)  a uniform dis(cid:173)\ntribution, and b) a high-temperature NNBM model for the digits. \n\napproaches which directly optimise an approximate model distribution. \n\nMethods  of this  type  fail  when  they  add  spurious  fixed  points  to  the  learning dynamics. \nFuture work will focus  on understanding the origins of such fixed  points, and the regimes \nin which they lead to a poor approximation of the model parameters. \n\n7  Acknowledgements \n\nThis  work was  inspired by the NIPS  1999 Workshop on  Advanced Mean Field Methods. \nThe author is especially grateful to David MacKay and Gayle Wittenberg for comments on \nearly  versions  of this  manuscript.  I  also  acknowledge guidance from  John  Hopfield  and \nDavid Heckerman,  detailed discussion with  Bert  Kappen,  Daniel Lee  and  David  Barber \nand encouragement from Kim Midwood. \n\nReferences \n\n[1]  Downs,  DB,  MacKay,  DJC,  &  Lee,  DD  (2000).  The  Nonnegative  Boltzmann  Machine.  Ad(cid:173)\n\nvances in Neural Information Processing Systems 12, 428-434. \n\n[2]  Lee,  DD,  and  Seung,  HS  (1999)  Learning the  parts of objects  by  non-negative  matrix factor(cid:173)\n\nization. Nature 401,788-791. \n\n[3]  Socci, ND, Lee, DD, and Seung,  HS  (1998).  The rectified Gaussian  distribution. Advances in \n\nNeural Information  Processing Systems 10, 350-356. \n\n[4]  Georges,  A,  &  Yedidia,  JS  (1991).  How  to  expand  around  mean-field  theory  using  high(cid:173)\n\ntemperature expansions. Journal of Physics A 24, 2173- 2192. \n\n[5]  Neal, RM  (1997).  Markov chain Monte Carlo methods based on  'slicing' the density function. \n\nTechnical Report 9722, Dept. of Statistics, University of Toronto. \n\n[6]  Kappen, HJ &  Rodriguez, FB  (1998).  Efficient learning in  Boltzmann Machines using linear \n\nresponse theory. Neural Computation 10, 1137-1156. \n\n[7]  Ben-Yishai, R,  Bar-Or,  RL,  &  Sompolinsky, H (1995).  Theory  of orientation  tuning in  visual \n\ncortex.  Proc. Nat.  Acad. Sci.  USA,92(9):3844-3848. \n\n[8]  Yedidia, JS , Freeman, WT,  &  Weiss,  Y  (2000).  Generalized Belief Propagation.  Mitsubishi \n\nElectric Research Laboratory Technical Report, TR-2000-26. \n\n\f", "award": [], "sourceid": 1929, "authors": [{"given_name": "Oliver", "family_name": "Downs", "institution": null}]}