{"title": "A Better Way to Pretrain Deep Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 2447, "page_last": 2455, "abstract": "We describe how the pre-training algorithm for Deep Boltzmann Machines (DBMs) is related to the pre-training algorithm for Deep Belief Networks and we show that under certain conditions, the pre-training procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pre-training DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pre-training algorithm allows us to learn better generative models.", "full_text": "A Better Way to Pretrain Deep Boltzmann Machines\n\nRuslan Salakhutdinov\n\nGeoffrey Hinton\n\nDepartment of Statistics and Computer Science\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nrsalakhu@cs.toronto.edu\n\nUniversity of Toronto\n\nhinton@cs.toronto.edu\n\nAbstract\n\nWe describe how the pretraining algorithm for Deep Boltzmann Machines\n(DBMs) is related to the pretraining algorithm for Deep Belief Networks and\nwe show that under certain conditions, the pretraining procedure improves the\nvariational lower bound of a two-hidden-layer DBM. Based on this analysis, we\ndevelop a different method of pretraining DBMs that distributes the modelling\nwork more evenly over the hidden layers. Our results on the MNIST and NORB\ndatasets demonstrate that the new pretraining algorithm allows us to learn better\ngenerative models.\n\n1\n\nIntroduction\n\nA Deep Boltzmann Machine (DBM) is a type of binary pairwise Markov Random Field with mul-\ntiple layers of hidden random variables. Maximum likelihood learning in DBMs, and other related\nmodels, is very dif\ufb01cult because of the hard inference problem induced by the partition function\n[3, 1, 12, 6]. Multiple layers of hidden units make learning in DBM\u2019s far more dif\ufb01cult [13]. Learn-\ning meaningful DBM models, particularly when modelling high-dimensional data, relies on the\nheuristic greedy pretraining procedure introduced by [7], which is based on learning a stack of mod-\ni\ufb01ed Restricted Boltzmann Machines (RBMs). Unfortunately, unlike the pretraining algorithm for\nDeep Belief Networks (DBNs), the existing procedure lacks a proof that adding additional layers\nimproves the variational bound on the log-probability that the model assigns to the training data.\nIn this paper, we \ufb01rst show that under certain conditions, the pretraining algorithm improves a\nvariational lower bound of a two-layer DBM. This result gives a much deeper understanding of\nthe relationship between the pretraining algorithms for Deep Boltzmann Machines and Deep Belief\nNetworks. Using this understanding, we introduce a new pretraining procedure for DBMs and show\nthat it allows us to learn better generative models of handwritten digits and 3D objects.\n\n2 Deep Boltzmann Machines (DBMs)\n\nA Deep Boltzmann Machine is a network of symmetrically coupled stochastic binary units. It con-\ntains a set of visible units v \u2208 {0, 1}D, and a series of layers of hidden units h(1) \u2208 {0, 1}F1,\nh(2) \u2208 {0, 1}F2,..., h(L) \u2208 {0, 1}FL. There are connections only between units in adjacent layers.\nConsider a DBM with three hidden layers, as shown in Fig. 1, left panel. The probability that the\nDBM assigns to a visible vector v is:\n\n(cid:88)\n\n(cid:18)(cid:88)\n\nP (v; \u03b8) =\n\n1\nZ(\u03b8)\n\n(cid:88)\n\n(cid:88)\n\n(cid:19)\n\n,\n\nexp\n\nW (1)\n\nij vih(1)\n\nj +\n\nW (2)\n\njl h(1)\n\nj h(2)\n\nl +\n\nW (3)\n\nlm h(2)\n\nl h(3)\n\nm\n\n(1)\n\nh\n\nij\n\njl\n\n1\n\nlm\n\n\fDeep Belief Network\n\nDeep Boltzmann Machine\n\nPretraining\n\n\u2019\n\nFigure 1: Left: Deep Belief Network (DBN) and Deep Boltzmann Machine (DBM). The top two layers of a\nDBN form an undirected graph and the remaining layers form a belief net with directed, top-down connections.\nFor a DBM, all the connections are undirected. Right Pretraining a DBM with three hidden layers consists of\nlearning a stack of RBMs that are then composed to create a DBM. The \ufb01rst and last RBMs in the stack need\nto be modi\ufb01ed by using asymmetric weights.\n\nwhere h = {h(1), h(2), h(3)} are the set of hidden units, and \u03b8 = {W(1), W(2), W(3)} are\nthe model parameters, representing visible-to-hidden and hidden-to-hidden symmetric interaction\nterms1. Setting W(2)=0 and W(3)=0 recovers the Restricted Boltzmann Machine (RBM) model.\nApproximate Learning: Exact maximum likelihood learning in this model is intractable, but ef\ufb01-\ncient approximate learning of DBMs can be carried out by using a mean-\ufb01eld inference to estimate\ndata-dependent expectations, and an MCMC based stochastic approximation procedure to approx-\nimate the model\u2019s expected suf\ufb01cient statistics [7]. In particular, consider approximating the true\nposterior P (h|v; \u03b8) with a fully factorized approximating distribution over the three sets of hidden\nk |v), where \u00b5 = {\u00b5(1), \u00b5(2), \u00b5(3)}\n|v)q(h(3)\nare the mean-\ufb01eld parameters with q(h(l)\nfor l = 1, 2, 3. In this case, we can write\ndown the variational lower bound on the log-probability of the data, which takes a particularly sim-\nple form:\n\nunits: Q(h|v; \u00b5) =(cid:81)F1\n\n|v)q(h(2)\ni = 1) = \u00b5(l)\n\n(cid:81)F3\n\n(cid:81)F2\n\nk=1 q(h(1)\n\nj\n\nlog P (v; \u03b8) \u2265 v(cid:62)W(1)\u00b5(1) + \u00b5(1)(cid:62)\n\nW(3)\u00b5(3) \u2212 log Z(\u03b8) + H(Q), (2)\nwhere H(\u00b7) is the entropy functional. Learning proceeds by \ufb01nding the value of \u00b5 that maximizes\nthis lower bound for the current value of model parameters \u03b8, which results in a set of the mean-\ufb01eld\n\ufb01xed-point equations. Given the variational parameters \u00b5, the model parameters \u03b8 are then updated\nto maximize the variational bound using stochastic approximation (for details see [7, 11, 14, 15]).\n\nW(2)\u00b5(2) + \u00b5(2)(cid:62)\n\nj=1\n\nl=1\n\nl\n\ni\n\n3 Pretraining Deep Boltzmann Machines\n\nThe above learning procedure works quite poorly when applied to DBMs that start with randomly\ninitialized weights. Hidden units in higher layers are very under-constrained so there is no consistent\nlearning signal for their weights. To alleviate this problem, [7] introduced a layer-wise pretraining\nalgorithm based on learning a stack of \u201cmodi\ufb01ed\u201d Restricted Boltzmann Machines (RBMs).\nThe idea behind the pretraining algorithm is straightforward. When learning parameters of the \ufb01rst\nlayer \u201cRBM\u201d, the bottom-up weights are constrained to be twice the top-down weights (see Fig. 1,\nright panel). Intuitively, using twice the weights when inferring the states of the hidden units h(1)\ncompensates for the initial lack of top-down feedback. Conversely, when pretraining the last \u201cRBM\u201d\nin the stack, the top-down weights are constrained to be twice the bottom-up weights. For all the\nintermediate RBMs the weights are halved in both directions when composing them to form a DBM,\nas shown in Fig. 1, right panel.\nThis heuristic pretraining algorithm works surprisingly well in practice. However, it is solely mo-\ntivated by the need to end up with a model that has symmetric weights, and does not provide any\n\n1We omit the bias terms for clarity of presentation.\n\n2\n\nh(3)h(2)h(1)vW(3)W(2)W(1)h(3)h(2)h(1)vW(3)W(2)W(1)\u201cRBM\u201dRBM\u201cRBM\u201dv2W(1)W(1)h(1)2W(2)2W(2)W(3)2W(3)h(1)h(2)h(2)h(3)W(1)W(2)W(3)\fuseful insights into what is happening during the pretraining stage. Furthermore, unlike the pretrain-\ning algorithm for Deep Belief Networks (DBNs), it lacks a proof that each time a layer is added to\nthe DBM, the variational bound improves.\n\n3.1 Pretraining Algorithm for Deep Belief Networks\n\nWe \ufb01rst brie\ufb02y review the pretraining algorithm for Deep Belief Networks [2], which will form the\nbasis for developing a new pretraining algorithm for Deep Boltzmann Machines.\nConsider pretraining a two-layer DBN using a stack of RBMs. After learning the \ufb01rst RBM in the\nh(1) p(v|h(1); W(1))p(h(1); W(1)).\nThe second RBM in the stack attempts to replace the prior p(h(1); W(1)) by a better model\n\nstack, we can write the generative model as: p(v; W(1)) =(cid:80)\np(h(1); W(2)) =(cid:80)\nlog P (vn) \u2265(cid:88)\nN(cid:88)\n\nMore formally, for any approximating distribution Q(h(1)|v), the DBN\u2019s log-likelihood has the\nfollowing variational lower bound on the log probability of the training data {v1, ..., vN}:\n\nh(2) p(h(1), h(2); W(2)), thus improving the \ufb01t to the training data.\n\nQ(h(1)|vn)||P (h(1); W(1))\n\nlog P (vn|h(1); W(1))\n\n(cid:105) \u2212(cid:88)\n\n(cid:16)\n\n(cid:104)\n\nKL\n\nEQ(h(1)|vn)\n\n(cid:17)\n\nn=1\n\nn\n\nn\n\nWe set Q(h(1)|vn; W(1)) = P (h(1)|vn; W(1)), which is the true factorial posterior of the \ufb01rst layer\nRBM. Initially, when W(2) = W(1)(cid:62), Q(h(1)|vn) de\ufb01nes the DBN\u2019s true posterior over h(1), and\nthe bound is tight. Maximizing the bound with respect to W(2) only affects the last KL term in the\nabove equation, and amounts to maximizing:\n\nQ(h(1)|vn; W(1))P (h(1); W(2)).\n\n(3)\n\nN(cid:88)\n\n(cid:88)\n\nn=1\n\nh(1)\n\n1\nN\n\n.\n\n.\n\n(5)\n\n(cid:17)\n\ncases: 1/N(cid:80)\n\nThis is equivalent to training the second layer RBM with vectors drawn from Q(h(1)|v; W(1)) as\ndata. Hence, the second RBM in the stack learns a better model of the mixture over all N training\nn Q(h(1)|vn; W(1)), called the \u201caggregated posterior\u201d. This scheme can be extended\n\nto training higher-layer RBMs.\nObserve that during the pretraining stage the whole prior of the lower-layer RBM is replaced by the\nnext RBM in the stack. This leads to the hybrid Deep Belief Network model, with the top two layers\nforming a Restricted Boltzmann Machine, and the lower layers forming a directed sigmoid belief\nnetwork (see Fig. 1, left panel).\n\n3.2 A Variational Bound for Pretraining a Two-layer Deep Boltzmann Machine\nConsider a simple two-layer DBM with tied weights W(2) = W(1)(cid:62), as shown in Fig. 2a:\n\nP (v; W(1)) =\n\n1\n\nZ(W(1))\n\nexp\n\nv(cid:62)W(1)h(1) + h(1)(cid:62)W(1)(cid:62)h(2)\n\n.\n\n(4)\n\nSimilar to DBNs, for any approximate posterior Q(h(1)|v), we can write a variational lower bound\non the log probability that this DBM assigns to the training data:\n\n(cid:18)\n\n(cid:88)\n\nh(1),h(2)\n\nN(cid:88)\n\nlog P (vn) \u2265(cid:88)\n\n(cid:104)\n\n(cid:105) \u2212(cid:88)\n\n(cid:16)\n\nEQ(h(1)|vn)\n\nlog P (vn|h(1); W(1))\n\nKL\n\nQ(h(1)|vn)||P (h(1); W(1))\n\nn=1\n\nn\n\nn\n\n(cid:19)\n\nThe key insight is to note that the model\u2019s marginal distribution over h(1) is the product of two\nidentical distributions, one de\ufb01ned by an RBM composed of h(1) and v, and the other de\ufb01ned by an\nidentical RBM composed of h(1) and h(2) [8]:\n\n(cid:18)(cid:88)\n(cid:124)\n\nv\n\nev(cid:62)W(1)h(1)(cid:19)\n(cid:123)(cid:122)\n(cid:125)\n\nRBM with h(1) and v\n\n(cid:18)(cid:88)\n(cid:124)\n\nh(2)\n\neh(2)(cid:62)W(1)h(1)(cid:19)\n(cid:123)(cid:122)\n(cid:125)\n\nRBM with h(1) and h(2)\n\n.\n\n(6)\n\nP (h(1); W(1)) =\n\n1\n\nZ(W(1))\n\n3\n\n\fa)\n\nb)\n\nc)\n\nFigure 2: Left: Pretraining a Deep Boltzmann Machine with two hidden layers. a) The DBM with tied\nweights. b) The second RBM with two sets of replicated hidden units, which will replace half of the 1stRBM\u2019s\nprior. c) The resulting DBM with modi\ufb01ed second hidden layer. Right: The DBM with tied weights is trained\nto model the data using one-step contrastive divergence.\n\nThe idea is to keep one of these two RBMs and replace the other by the square root of a better\n(cid:80)\nprior P (h(1); W(2)). In particular, another RBM with two sets of replicated hidden units and tied\nh(2a),h(2b) P (h(1), h(2a), h(2b); W(2)) is trained to be a better model\nn Q(h(1)|vn; W(1)) of the \ufb01rst model (see Fig. 2b).\nof the aggregated variational posterior 1\nN\nBy initializing W(2) = W(1)(cid:62), the second-layer RBM has exactly the same prior over h(1) as the\noriginal DBM. If the RBM is trained by maximizing the log likelihood objective:\n\nweights P (h(1); W(2)) = (cid:80)\n(cid:88)\n\n(cid:88)\n\nQ(h(1)|vn) log P (h(1); W(2)),\n\nthen we obtain:(cid:88)\n\nKL(Q(h(1)|vn)||P (h(1); W(2))) \u2264(cid:88)\n\nh(1)\n\nn\n\nn\n\nn\n\nKL(Q(h(1)|vn)||P (h(1); W(1))).\n\nSimilar to Eq. 6, the distribution over h(1) de\ufb01ned by the second-layer RBM is also the product of\ntwo identical distributions. Once the two RBMs are composed to form a two-layer DBM model (see\nFig. 2c), the marginal distribution over h(1) is the geometric mean of the two probability distribu-\ntions: P (h(1); W(1)), P (h(1); W(2)) de\ufb01ned by the \ufb01rst and second-layer RBMs:\n\n(7)\n\n(8)\n\n(9)\n\nP (h(1); W(1), W(2)) =\n\n1\n\nZ(W(1), W(2))\n\n(cid:18)(cid:88)\n\nev(cid:62)W(1)h(1)(cid:19)(cid:18)(cid:88)\n(cid:17) \u2264(cid:88)\n\n(cid:16)\n\nKL\n\nh(2)\n\nv\n\neh(1)(cid:62)W(2)h(2)(cid:19)\n(cid:17)\n\n.\n\n(cid:88)\n\n(cid:16)\n\nBased on Eqs. 8, 9, it is easy to show that the variational lower bound of Eq. 5 improves because\nreplacing half of the prior by a better model reduces the KL divergence from the variational posterior:\n\nKL\n\nQ(h(1)|vn)||P (h(1); W(1), W(2))\n\nQ(h(1)|vn)||P (h(1); W(1))\n\n.\n\n(10)\n\nn\n\nn\n\nDue to the convexity of asymmetric divergence, this is guaranteed to improve the variational bound\nof the training data by at least half as much as fully replacing the original prior.\nThis result highlights a major difference between DBNs and DBMs. The procedure for adding an\nextra layer to a DBN replaces the full prior over the previous top layer, whereas the procedure for\nadding an extra layer to a DBM only replaces half of the prior. So in a DBM, the weights of the\nbottom level RBM perform much more of the work than in a DBN, where the weights are only used\nto de\ufb01ne the last stage of the generative process P (v|h(1); W(1)).\nThis result also suggests that adding layers to a DBM will give diminishing improvements in the\nvariational bound, compared to adding layers to a DBN. This may explain why DBMs with three\nhidden layers typically perform worse than the DBMs with two hidden layers [7, 8]. On the other\nhand, the disadvantage of the pretraining procedure for Deep Belief Networks is that the top-layer\nRBM is forced to do most of the modelling work. This may also explain the need to use a large\nnumber of hidden units in the top-layer RBM [2].\nThere is, however, a way to design a new pretraining algorithm that would spread the modelling work\nmore equally across all layers, hence bypassing shortcomings of the existing pretraining algorithms\nfor DBNs and DBMs.\n\n4\n\nvh(2)h(1)W(1)W(1)W(2)W(2)h(1)h(2a)h(2b)vh(1)h(2)W(1)W(2)vh(2)=vh(1)W(1)W(1)\fReplacing 2/3 of the Prior\n\nPractical Implementation\n\na)\n\nb)\n\nc)\n\nFigure 3: Left: Pretraining a Deep Boltzmann Machine with two hidden layers. a) The DBM with tied\nweights. b) The second layer RBM is trained to model 2/3 of the 1st RBM\u2019s prior. c) The resulting DBM with\nmodi\ufb01ed second hidden layer. Right: The corresponding practical implementation of the pretraining algorithm\nthat uses asymmetric weights.\n\n3.3 Controlling the Amount of Modelling Work done by Each Layer\nConsider a slightly modi\ufb01ed two-layer DBM with two groups of replicated 2nd-layer units, h(2a)\nand h(2b), and tied weights (see Fig. 3a). The model\u2019s marginal distribution over h(1) is the product\nof three identical RBM distributions, de\ufb01ned by h(1) and v, h(1) and h(2a), and h(1) and h(2b):\n\nev(cid:62)W(1)h(1)(cid:19)(cid:18)(cid:88)\n\neh(2a)(cid:62)W(1)h(1)(cid:19)(cid:18)(cid:88)\n\neh(2b)(cid:62)W(1)h(1)(cid:19)\n\n.\n\n(cid:18)(cid:88)\n\nP (h(1); W(1)) =\n\n1\n\nZ(W(1))\n\nv\n\nh(2a)\n\nh(2b)\n\nDuring the pretraining stage, we keep one of these RBMs and replace the other two by a better prior\nP (h(1); W(2)). To do so, similar to Sec. 3.2, we train another RBM, but with three sets of hidden\nunits and tied weights (see Fig. 3b). When we combine the two RBMs into a DBM, the marginal\ndistribution over h(1) is the geometric mean of three probability distributions: one de\ufb01ned by the\n\ufb01rst-layer RBM, and the remaining two de\ufb01ned by the second-layer RBMs:\n\nP (h(1); W(1), W(2)) =\n\n=\n\n1\n\nZ(W(1), W(2))\n\nZ(W(1), W(2))\n\n1\n\n(cid:18)(cid:88)\n\nP (h(1); W(1))P (h(1); W(2))P (h(1); W(2))\n\nev(cid:62)W(1)h(1)(cid:19)(cid:18)(cid:88)\n\neh(2a)(cid:62)W(2)h(1)(cid:19)(cid:18)(cid:88)\n\neh(2b)(cid:62)W(2)h(1)(cid:19)\n\n.\n\nv\n\nh(2a)\n\nh(2b)\n\nIn this DBM, 2/3 of the \ufb01rst RBM\u2019s prior over the \ufb01rst hidden layer has been replaced by the prior de-\n\ufb01ned by the second-layer RBM. The variational bound on the training data is guaranteed to improve\nby at least 2/3 as much as fully replacing the original prior. Hence in this slightly modi\ufb01ed DBM\nmodel, the second layer performs 2/3 of the modelling work compared to the \ufb01rst layer. Clearly, con-\ntrolling the number of replicated hidden groups allows us to easily control the amount of modelling\nwork left to the higher layers in the stack.\n\n3.4 Practical Implementation\n\nSo far, we have made the assumption that we start with a two-layer DBM with tied weights. We now\nspecify how one would train this initial set of tied weights W(1).\nLet us consider the original two-layer DBM in Fig. 2a with tied weights. If we knew the initial\nstate vector h(2), we could train this DBM using one-step contrastive divergence (CD) with mean\n\ufb01eld reconstructions of both the visible states v and the top-layer states h(2), as shown in Fig. 2,\nright panel. Instead, we simply set the initial state vector h(2) to be equal to the data, v. Using\nmean-\ufb01eld reconstructions for v and h(2), one-step CD is exactly equivalent to training a modi\ufb01ed\n\u201cRBM\u201d with only one hidden layer but with bottom-up weights that are twice the top-down weights,\nas de\ufb01ned in the original pretraining algorithm (see Fig. 1, right panel). This way of training the\nsimple DBM with tied weights is unlikely to maximize the likelihood objective, but in practice it\nproduces surprisingly good models that reconstruct the training data well.\nWhen learning the second RBM in the stack, instead of maintaining a set of replicated hidden groups,\nit will often be convenient to approximate CD learning by training a modi\ufb01ed RBM with one hidden\nlayer but with asymmetric bottom-up and top-down weights.\n\n5\n\nvh(2)ah(2)bh(1)W(1)W(1)W(1)W(2)W(2)W(2)h(1)h(2)ah(2)bh(2)cvh(2)ah(2)bh(1)W(2)W(2)W(1)vh(1)3W(1)W(1)h(1)h(2)2W(2)3W(2)vh(2)h(1)2W(2)W(1)\f1\n\n1 + exp(\u2212(cid:80)\n1 + exp(\u2212(cid:80)\n\nFor example, consider pretraining a two-layer DBM, in which we would like to split the modelling\nwork between the 1st and 2nd-layer RBMs as 1/3 and 2/3. In this case, we train the \ufb01rst layer RBM\nusing one-step CD, but with the bottom-up weights constrained to be three times the top-down\nweights (see Fig. 3, right panel). The conditional distributions needed for CD learning take form:\n\nP (vi = 1|h(1)) =\n\n,\n\ni 3W (1)\n\nij vi)\n\n1 + exp(\u2212(cid:80)\n\n1\n\n.\n\nj W (1)\n\nij h(1)\nj )\n\nP (h(1)\n\nj = 1|v) =\n\nP (h(2)\n\n, P (h(1)\n\n1\nj 2W (2)\n\nl = 1|h(1)) =\n\nConversely, for the second modi\ufb01ed RBM in the stack, the top-down weights are constrained to be\n3/2 times the bottom-up weights. The conditional distributions take form:\nj = 1|h(2)) =\n\njl h(2)\n)\nNote that this second-layer modi\ufb01ed RBM simply approximates the proper RBM with three sets of\nreplicated h(2) groups. In practice, this simple approximation works well compared to training a\nproper RBM, and is much easier to implement. When combining the RBMs into a two-layer DBM,\nwe end up with W(1) and 2W(2) in the \ufb01rst and second layers, each performing 1/3 and 2/3 of the\nmodelling work respectively:\n1\nZ(\u03b8)\n\nv(cid:62)W(1)h(1) + h(1)(cid:62)2W(2)h(2)\n\n1 + exp(\u2212(cid:80)\n\n1\nl 3W (2)\n\n(cid:88)\n\njl h(1)\nj )\n\n(cid:19)\n\n.\n\nP (v; \u03b8) =\n\n(cid:18)\n\n(11)\n\n.\n\nl\n\nexp\n\nh(1),h(2)\n\nParameters of the entire model can be generatively \ufb01ne-tuned using the combination of the mean-\n\ufb01eld algorithm and the stochastic approximation algorithm described in Sec. 2\n\n4 Pretraining a Three Layer Deep Boltzmann Machine\n\nIn the previous section, we showed that provided we start with a two-layer DBM with tied weights,\nwe can train the second-layer RBM in a way that is guaranteed to improve the variational bound.\nFor the DBM with more than two layers, we have not been able to develop a pretraining algorithm\nthat is guaranteed to improve a variational bound. However, results of Sec. 3 suggest that using\nsimple modi\ufb01cations when pretraining a stack of RBMs would allow us to approximately control\nthe amount of modelling work done by each layer.\n\nPretraining a 3-layer DBM\n\nConsider learning a 3-layer DBM, in which\neach layer is forced to perform approxi-\nmately 1/3 of the modelling work. This\ncan easily be accomplished by learning a\nstack of three modi\ufb01ed RBMs. Similar\nto the two-layer model, we train the \ufb01rst\nlayer RBM using one-step CD, but with the\nbottom-up weights constrained to be three\ntimes the top-down weights (see Fig. 4).\nTwo-thirds of this RBM\u2019s prior will be\nmodelled by the 2ndand 3rd-layer RBMs.\nFor the second modi\ufb01ed RBM in the stack,\nwe use 4W(2) bottom-up and 3W(2) top-\ndown. Note that we are using 4W(2)\nbottom-up, as we are expecting to replace\nhalf of the second RBM prior by a third RBM, hence splitting the remaining 2/3 of the work equally\nbetween the top two layers. If we were to pretrain only a two-layer DBM, we would use 2W(2)\nbottom-up and 3W(2) top-down, as discussed in Sec. 3.2.\nFor the last RBM in the stack, we use 2W(3) bottom-up and 4W(2) top-down. When combining the\nthree RBMs into a three-layer DBM, we end up with symmetric weights W(1), 2W(2), and 2W(3)\nin the \ufb01rst, second, and third layers, with each layer performing 1/3 of the modelling work:\n\nFigure 4: Layer-wise pretraining of a 3-layer\nDeep Boltzmann Machine.\n\nP (v; \u03b8) =\n\n1\nZ(\u03b8)\n\nexp\n\nv(cid:62)W(1)h(1) + h(1)(cid:62)2W(2)h(2) + h(2)(cid:62)2W(3)h(3)\n\n.\n\n(12)\n\n(cid:88)\n\n(cid:18)\n\n(cid:19)\n\nh\n\n6\n\nvh(1)3W(1)W(1)h(1)h(2)4W(2)3W(2)h(2)h(3)2W(3)4W(3)2W(3)h(3)vh(2)h(1)2W(2)W(1)\fAlgorithm 1 Greedy Pretraining Algorithm for a 3-layer Deep Boltzmann Machine\n1: Train the 1st layer \u201cRBM\u201d using one-step CD learning with mean \ufb01eld reconstructions of the visible vec-\n2: Freeze 3W(1) that de\ufb01nes the 1st layer of features, and use samples h(1) from P (h(1)|v; 3W(1)) as the\n\ntors. Constrain the bottom-up weights, 3W(1), to be three times the top-down weights, W(1).\n\ndata for training the second RBM.\n\n3: Train the 2nd layer \u201cRBM\u201d using one-step CD learning with mean \ufb01eld reconstructions of the visible\n4: Freeze 4W(2) that de\ufb01nes the 2nd layer of features and use the samples h(3) from P (h(2)|h(1); 4W(2))\n\nvectors. Set the bottom-up weights to 4W(1), and the top-down weights to 3W(1).\n\nas the data for training the next RBM.\n\n5: Train the 3rd-layer \u201cRBM\u201d using one-step CD learning with mean \ufb01eld reconstructions of its visible vec-\n6: Use the weights {W(1), 2W(2), 2W(3)} to compose a three-layer Deep Boltzmann Machine.\n\ntors. During the learning, set the bottom-up weights to 2W(3), and the top-down weights to 4W(3).\n\nThe new pretraining procedure for a 3-layer DBM is shown in Alg. 1. Note that compared to the\noriginal algorithm, it requires almost no extra work and can be easily integrated into existing code.\nExtensions to training DBMs with more layers is trivial. As we show in our experimental results,\nthis pretraining can improve the generative performance of Deep Boltzmann Machines.\n\n5 Experimental Results\n\nIn our experiments we used the MNIST and NORB datasets. During greedy pretraining, each layer\nwas trained for 100 epochs using one-step contrastive divergence. Generative \ufb01ne-tuning of the\nfull DBM model, using mean-\ufb01eld together with stochastic approximation, required 300 epochs.\nIn order to estimate the variational lower-bounds achieved by different pretraining algorithms, we\nneed to estimate the global normalization constant. Recently, [10] demonstrated that Annealed\nImportance Sampling (AIS) can be used to ef\ufb01ciently estimate the partition function of an RBM.\nWe adopt AIS in our experiments as well. Together with variational inference this will allow us to\nobtain good estimates of the lower bound on the log-probability of the training and test data.\n\n5.1 MNIST\nThe MNIST digit dataset contains 60,000 training and 10,000 test images of ten handwritten digits\n(0 to 9), with 28\u00d728 pixels. In our \ufb01rst experiment, we considered a standard two-layer DBM with\n500 and 1000 hidden units2, and used two different algorithms for pretraining it. The \ufb01rst pretraining\nalgorithm, which we call DBM-1/2-1/2, is the original algorithm for pretraining DBMs, as introduced\nby [7] (see Fig. 1). Here, the modelling work between the 1stand 2nd-layer RBMs is split equally.\nThe second algorithm, DBM-1/3-2/3, uses a modi\ufb01ed pretraining procedure of Sec. 3.4, so that the\nsecond RBM in the stack ends up doing 2/3 of the modelling work compared to the 1st-layer RBM.\nResults are shown in Table 1. Prior to the global generative \ufb01ne-tuning, the estimate of the lower\nbound on the average test log-probability for DBM-1/3-2/3 was \u2212108.65 per test case, compared to\n\u2212114.32 achieved by the standard pretraining algorithm DBM-1/2-1/2. The large difference of about\n7 nats shows that leaving more of the modelling work to the second layer, which has a larger number\nof hidden units, substantially improves the variational bound.\nAfter the global generative \ufb01ne-tuning, DBM-1/3-2/3 achieves a lower bound of \u221283.43, which is\nbetter compared to \u221284.62, achieved by DBM-1/2-1/2. This is also lower compared to the lower\nbound of \u221285.97, achieved by a carefully trained two-hidden-layer Deep Belief Network [10].\nIn our second experiment, we pretrained a 3-layer Deep Boltzmann Machine with 500, 500, and\n1000 hidden units. The existing pretraining algorithm, DBM-1/2-1/4-1/4, approximately splits the\nmodelling between three RBMs in the stack as 1/2, 1/4, 1/4, so the weights in the 1st-layer RBM\nperform half of the work compared to the higher-level RBMs. On the other hand, the new pretraining\nprocedure (see Alg. 1), which we call DBM-1/3-1/3-1/3, splits the modelling work equally across all\nthree layers.\n\n2These architectures have been considered before in [7, 9], which allows us to provide a direct comparison.\n\n7\n\n\fTable 1: MNIST: Estimating the lower bound on the average training and test log-probabilities for two DBMs:\none with two layers (500 and 1000 hidden units), and the other one with three layers (500, 500, and 1000 hidden\nunits). Results are shown for various pretraining algorithms, followed by generative \ufb01ne-tuning.\n\nPretraining\n\nTest\n\nGenerative Fine-Tuning\nTrain\nTrain\n\u2212113.32 \u2212114.32 \u221283.61\n2 layers DBM-1/2-1/2\n\u2212107.89 \u2212108.65 \u221282.83\nDBM-1/3-2/3\n3 layers DBM-1/2-1/4-1/4 \u2212116.74 \u2212117.38 \u221284.49\nDBM-1/3-1/3-1/3 \u2212107.12 \u2212107.65 \u221282.34\n\nTest\n\u221284.62\n\u221283.43\n\u221285.10\n\u221283.02\n\nTable 2: NORB: Estimating the lower bound on the average training and test log-probabilities for two DBMs:\none with two layers (1000 and 2000 hidden units), and the other one with three layers (1000, 1000, and 2000\nhidden units). Results are shown for various pretraining algorithms, followed by generative \ufb01ne-tuning.\n\nPretraining\n\nGenerative Fine-Tuning\n\nTrain\nTrain\n\u2212640.94 \u2212643.87 \u2212598.13\n2 layers DBM-1/2-1/2\n\u2212633.21 \u2212636.65 \u2212593.76\nDBM-1/3-2/3\n3 layers DBM-1/2-1/4-1/4 \u2212641.87 \u2212645.06 \u2212598.98\nDBM-1/3-1/3-1/3 \u2212632.75 \u2212635.14 \u2212592.87\n\nTest\n\nTest\n\n\u2212601.76\n\u2212597.23\n\u2212602.84\n\u2212596.11\n\nTable 1 shows that DBM-1/3-1/3-1/3 achieves a lower bound on the average test log-probability of\n\u2212107.65, improving upon DBM-1/2-1/4-1/4\u2019s bound of \u2212117.38. The difference of about 10 nats\nfurther demonstrates that during the pretraining stage, it is rather crucial to push more of the mod-\nelling work to the higher layers. After generative \ufb01ne-tuning, the bound on the test log-probabilities\nfor DBM-1/3-1/3-1/3 was \u221283.02, so with a new pretraining procedure, the three-hidden-layer DBM\nperforms slightly better than the two-hidden-layer DBM. With the original pretraining procedure,\nthe 3-layer DBM achieves a bound of \u221285.10, which is worse than the bound of 84.62, achieved by\nthe 2-layer DBM, as reported by [7, 9].\n\n5.2 NORB\nThe NORB dataset [4] contains images of 50 different 3D toy objects with 10 objects in each of\n\ufb01ve generic classes: cars, trucks, planes, animals, and humans. Each object is photographed from\ndifferent viewpoints and under various lighting conditions. The training set contains 24,300 stereo\nimage pairs of 25 objects, 5 per class, while the test set contains 24,300 stereo pairs of the remaining,\ndifferent 25 objects. From the training data, 4,300 were set aside for validation. To deal with raw\npixel data, we followed the approach of [5] by \ufb01rst learning a Gaussian-binary RBM with 4000\nhidden units, and then treating the the activities of its hidden layer as preprocessed binary data.\nSimilar to the MNIST experiments, we trained two Deep Boltzmann Machines: one with two layers\n(1000 and 2000 hidden units), and the other one with three layers (1000, 1000, and 2000 hidden\nunits). Table 2 reveals that for both DBMs, the new pretraining achieves much better variational\nbounds on the average test log-probability. Even after the global generative \ufb01ne-tuning, Deep Boltz-\nmann Machines, pretrained using a new algorithm, improve upon standard DBMs by at least 5 nats.\n\n6 Conclusion\nIn this paper we provided a better understanding of how the pretraining algorithms for Deep Belief\nNetworks and Deep Boltzmann Machines are related, and used this understanding to develop a\ndifferent method of pretraining. Unlike many of the existing pretraining algorithms for DBNs and\nDBMs, the new procedure can distribute the modelling work more evenly over the hidden layers.\nOur results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows\nus to learn much better generative models.\n\nAcknowledgments\nThis research was funded by NSERC, Early Researcher Award, and gifts from Microsoft and\nGoogle. G.H. and R.S. are fellows of the Canadian Institute for Advanced Research.\n\n8\n\n\fReferences\n[1] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning,\n\n2009.\n\n[2] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18(7):1527\u20131554, 2006.\n\n[3] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep\n\nneural networks. Journal of Machine Learning Research, 10:1\u201340, 2009.\n\n[4] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In CVPR (2), pages 97\u2013104, 2004.\n\n[5] V. Nair and G. E. Hinton. Implicit mixtures of restricted Boltzmann machines. In Advances in\n\nNeural Information Processing Systems, volume 21, 2009.\n\n[6] M. A. Ranzato. Unsupervised learning of feature hierarchies. In Ph.D. New York University,\n\n2009.\n\n[7] R. R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines.\n\nIn Proceedings of the\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 12, 2009.\n\n[8] R. R. Salakhutdinov and G. E. Hinton. An ef\ufb01cient learning procedure for Deep Boltzmann\n\nMachines. Neural Computation, 24:1967 \u2013 2006, 2012.\n\n[9] R. R. Salakhutdinov and H. Larochelle. Ef\ufb01cient learning of deep Boltzmann machines. In Pro-\nceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, volume 13,\n2010.\n\n[10] R. R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In\nProceedings of the International Conference on Machine Learning, volume 25, pages 872 \u2013\n879, 2008.\n\n[11] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood\n\ngradient. In ICML. ACM, 2008.\n\n[12] M. Welling and G. E. Hinton. A new learning algorithm for mean \ufb01eld Boltzmann machines.\n\nLecture Notes in Computer Science, 2415, 2002.\n\n[13] M. Welling and C. Sutton. Learning in markov random \ufb01elds with contrastive free energies. In\n\nInternational Workshop on AI and Statistics (AISTATS\u20192005), 2005.\n\n[14] L. Younes. On the convergence of Markovian stochastic algorithms with rapidly decreasing\n\nergodicity rates, March 17 2000.\n\n[15] A. L. Yuille. The convergence of contrastive divergences. In Advances in Neural Information\n\nProcessing Systems, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1178, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}