{"title": "Modeling image patches with a directed hierarchy of Markov random fields", "book": "Advances in Neural Information Processing Systems", "page_first": 1121, "page_last": 1128, "abstract": "We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images.", "full_text": "Modeling image patches with a directed hierarchy of\n\nMarkov random \ufb01elds\n\nSimon Osindero and Geoffrey Hinton\n\nDepartment of Computer Science, University of Toronto\n\n6, King\u2019s College Road, M5S 3G4, Canada\nosindero,hinton@cs.toronto.edu\n\nAbstract\n\nWe describe an ef\ufb01cient learning procedure for multilayer generative models that\ncombine the best aspects of Markov random \ufb01elds and deep, directed belief nets.\nThe generative models can be learned one layer at a time and when learning is\ncomplete they have a very fast inference procedure for computing a good approx-\nimation to the posterior distribution in all of the hidden layers. Each hidden layer\nhas its own MRF whose energy function is modulated by the top-down directed\nconnections from the layer above. To generate from the model, each layer in turn\nmust settle to equilibrium given its top-down input. We show that this type of\nmodel is good at capturing the statistics of patches of natural images.\n\n1 Introduction\n\nThe soldiers on a parade ground form a neat rectangle by interacting with their neighbors. An of\ufb01cer\ndecides where the rectangle should be, but he would be ill-advised to try to tell each individual sol-\ndier exactly where to stand. By allowing constraints to be enforced by local interactions, the of\ufb01cer\nenormously reduces the bandwidth of top-down communication required to generate a familiar pat-\ntern. Instead of micro-managing the soldiers, the of\ufb01cer speci\ufb01es an objective function and leaves\nit to the soldiers to optimise that function. This example of pattern generation suggests that a multi-\nlayer, directed belief net may not be the most effective way to generate patterns. Instead of using\nshared ancestors to create correlations between the variables within a layer, it may be more ef\ufb01cient\nfor each layer to have its own energy function that is modulated by directed, top-down input from\nthe layer above. Given the top-down input, each layer can then use lateral interactions to settle on\na good con\ufb01guration and this con\ufb01guration can then provide the top-down input for the next layer\ndown. When generating an image of a face, for example, the approximate locations of the mouth\nand nose might be speci\ufb01ed by a higher level and the local interactions would then ensure that the\naccuracy of their vertical alignment was far greater than the accuracy with which their locations\nwere speci\ufb01ed top-down.\n\nIn this paper, we show that recently developed techniques for learning deep belief nets (DBN\u2019s) can\nbe generalized to solve the apparently more dif\ufb01cult problem of learning a directed hierarchy of\nMarkov Random Fields (MRF\u2019s). The method we describe can learn models that have many hidden\nlayers, each with its own MRF whose energy function is conditional on the values of the variables in\nthe layer above. It does not require detailed prior knowledge about the data to be modeled, though\nit obviously works better if the architecture and the types of latent variable are well matched to the\ntask.\n\n1\n\n\f2 Learning deep belief nets: An overview\n\nThe learning procedure for deep belief nets has now been described in several places (Hinton et al.,\n2006; Hinton and Salakhutdinov, 2006; Bengio et al., 2007) and will only be sketched here. It relies\non a basic module, called a restricted Boltzmann machine (RBM) that can be trained ef\ufb01ciently\nusing a method called \u201ccontrastive divergence\u201d (Hinton, 2002).\n\n2.1 Restricted Boltzmann Machines\n\nAn RBM consists of a layer of binary stochastic \u201cvisible\u201d units connected to a layer of binary,\nstochastic \u201chidden\u201d units via symmetrically weighted connections. A joint con\ufb01guration, (v, h) of\nthe visible and hidden units has an energy given by:\n\nE(v, h) = \u2212 X\n\nbivi \u2212 X\n\nbjhj \u2212 X\n\nvihjwij\n\n(1)\n\ni\u2208visibles\n\nj\u2208hiddens\n\ni,j\n\nwhere vi, hj are the binary states of visible unit i and hidden unit j, bi, bj are their biases and wij is\nthe symmetric weight between them. The network assigns a probability to every possible image via\nthis energy function and the probability of a training image can be raised by adjusting the weights\nand biases to lower the energy of that image and to raise the energy of similar, reconstructed images\nthat the network would prefer to the real data.\nGiven a training vector, v, the binary state, hj, of each feature detector, j, is set to 1 with probability\n\u03c3(bj + Pi viwij), where \u03c3(x) is the logistic function 1/(1 + exp(\u2212x)), bj is the bias of j, vi is\nthe state of visible unit i, and wij is the weight between i and j. Once binary states have been\nchosen for the hidden units, a reconstruction is produced by setting each vi to 1 with probability\n\u03c3(bi + Pj hjwij). The states of the hidden units are then updated once more so that they represent\nfeatures of the reconstruction. The change in a weight is given by\n\n\u2206wij = \u01eb(hvihjidata \u2212 hvihjirecon)\n\n(2)\nwhere \u01eb is a learning rate, hvihjidata is the fraction of times that visible unit i and hidden units j\nare on together when the hidden units are being driven by data and hvihjirecon is the corresponding\nfraction for reconstructions. A simpli\ufb01ed version of the same learning rule is used for the biases.\nThe learning works well even though it is not exactly following the gradient of the log probability\nof the training data (Hinton, 2002).\n\n2.2 Compositions of experts\n\nA single layer of binary features is usually not the best way to capture the structure in the data. We\nnow show how RBM\u2019S can be composed to create much more powerful, multilayer models.\n\nAfter using an RBM to learn the \ufb01rst layer of hidden features we have an undirected model that\nde\ufb01nes p(v, h) via the energy function in Eq. 1. We can also think of the model as de\ufb01ning p(v, h)\nby de\ufb01ning a consistent pair of conditional probabilities, p(h|v) and p(v|h) which can be used to\nsample from the model distribution. A different way to express what has been learned is p(v|h)\nand p(h). Unlike a standard directed model, this p(h) does not have its own separate parameters.\nIt is a complicated, non-factorial prior on h that is de\ufb01ned implicitly by the weights. This peculiar\ndecomposition into p(h) and p(v|h) suggests a recursive algorithm: keep the learned p(v|h) but\nreplace p(h) by a better prior over h, i.e. a prior that is closer to the average, over all the data\nvectors, of the conditional posterior over h.\n\nWe can sample from this average conditional posterior by simply applying p(h|v) to the training\ndata. The sampled h vectors are then the \u201cdata\u201d that is used for training a higher-level RBM that\nlearns the next layer of features. We could initialize the higher-level RBM model by using the same\nparameters as the lower-level RBM but with the roles of the hidden and visible units reversed. This\nensures that p(v) for the higher-level RBM starts out being exactly the same as p(h) for the lower-\nlevel one. Provided the number of features per layer does not decrease, Hinton et al. (2006) show\nthat each extra layer increases a variational lower bound on the log probability of the data.\n\nThe directed connections from the \ufb01rst hidden layer to the visible units in the \ufb01nal, composite\ngraphical model are a consequence of the the fact that we keep the p(v|h) but throw away the p(h)\nde\ufb01ned by the \ufb01rst level RBM. In the \ufb01nal composite model, the only undirected connections are\n\n2\n\n\fbetween the top two layers, because we do not throw away the p(h) for the highest-level RBM. To\nsuppress noise in the learning signal, we use the real-valued activation probabilities for the visible\nunits of all the higher-level RBM\u2019s, but to prevent hidden units from transmitting more than one bit\nof information from the data to its reconstruction, we always use stochastic binary values for the\nhidden units.\n\n3 Semi-restricted Boltzmann machines\n\nFor contrastive divergence learning to work well, it is important for the hidden units to be sampled\nfrom their conditional distribution given the data or the reconstructions. It not necessary, however,\nfor the reconstructions to be sampled from their conditional distribution given the hidden states. All\nthat is required is that the reconstructions have lower free energy than the data. So it is possible to\ninclude lateral connections between the visible units and to create reconstructions by taking a small\nstep towards the conditional equilibrium distribution given the hidden states. If we are using mean-\n\ufb01eld activities for the reconstructions, we can move towards the equilibrium distribution by using a\nfew damped mean-\ufb01eld updates (Welling and Hinton, 2002). We call this a semi-restricted Boltz-\nmann machine (SRBM). The visible units form a conditional MRF with the biases of the visible\nunits being determined by the hidden states. The learning procedure for the visible to hidden con-\nnections is unaffected and the same learning procedure applies to the lateral connections. Explicitly,\nthe energy function for a SRBM is given by\n\nE(v, h) = \u2212 X\n\nbivi \u2212 X\n\nbjhj \u2212 X\n\nvihjwij \u2212 X\n\nvivi\u2032 Lii\u2032\n\ni\u2208visibles\n\nj\u2208hiddens\n\ni,j\n\ni