{"title": "Iterative Neural Autoregressive Distribution Estimator NADE-k", "book": "Advances in Neural Information Processing Systems", "page_first": 325, "page_last": 333, "abstract": "Training of the neural autoregressive density estimator (NADE) can be viewed as doing one step of probabilistic inference on missing values in data. We propose a new model that extends this inference scheme to multiple steps, arguing that it is easier to learn to improve a reconstruction in $k$ steps rather than to learn to reconstruct in a single inference step. The proposed model is an unsupervised building block for deep learning that combines the desirable properties of NADE and multi-predictive training: (1) Its test likelihood can be computed analytically, (2) it is easy to generate independent samples from it, and (3) it uses an inference engine that is a superset of variational inference for Boltzmann machines. The proposed NADE-k is competitive with the state-of-the-art in density estimation on the two datasets tested.", "full_text": "Iterative Neural Autoregressive Distribution\n\nEstimator (NADE-k)\n\nTapani Raiko\nAalto University\n\nLi Yao\n\nUniversit\u00b4e de Montr\u00b4eal\n\nKyungHyun Cho\n\nUniversit\u00b4e de Montr\u00b4eal\n\nYoshua Bengio\n\nUniversit\u00b4e de Montr\u00b4eal,\nCIFAR Senior Fellow\n\nAbstract\n\nTraining of the neural autoregressive density estimator (NADE) can be viewed as\ndoing one step of probabilistic inference on missing values in data. We propose\na new model that extends this inference scheme to multiple steps, arguing that\nit is easier to learn to improve a reconstruction in k steps rather than to learn to\nreconstruct in a single inference step. The proposed model is an unsupervised\nbuilding block for deep learning that combines the desirable properties of NADE\nand multi-prediction training: (1) Its test likelihood can be computed analytically,\n(2) it is easy to generate independent samples from it, and (3) it uses an inference\nengine that is a superset of variational inference for Boltzmann machines. The\nproposed NADE-k is competitive with the state-of-the-art in density estimation\non the two datasets tested.\n\n1\n\nIntroduction\n\nTraditional building blocks for deep learning have some unsatisfactory properties. Boltzmann ma-\nchines are, for instance, dif\ufb01cult to train due to the intractability of computing the statistics of the\nmodel distribution, which leads to the potentially high-variance MCMC estimators during training\n(if there are many well-separated modes (Bengio et al., 2013)) and the computationally intractable\nobjective function. Autoencoders have a simpler objective function (e.g., denoising reconstruction\nerror (Vincent et al., 2010)), which can be used for model selection but not for the important choice\nof the corruption function. On the other hand, this paper follows up on the Neural Autoregressive\nDistribution Estimator (NADE, Larochelle and Murray, 2011), which specializes previous neural\nauto-regressive density estimators (Bengio and Bengio, 2000) and was recently extended (Uria et al.,\n2014) to deeper architectures. It is appealing because both the training criterion (just log-likelihood)\nand its gradient can be computed tractably and used for model selection, and the model can be\ntrained by stochastic gradient descent with backpropagation. However, it has been observed that the\nperformance of NADE has still room for improvement.\nThe idea of using missing value imputation as a training criterion has appeared in three recent pa-\npers. This approach can be seen either as training an energy-based model to impute missing values\nwell (Brakel et al., 2013), as training a generative probabilistic model to maximize a generalized\npseudo-log-likelihood (Goodfellow et al., 2013), or as training a denoising autoencoder with a mask-\ning corruption function (Uria et al., 2014). Recent work on generative stochastic networks (GSNs),\nwhich include denoising auto-encoders as special cases, justi\ufb01es dependency networks (Hecker-\nman et al., 2000) as well as generalized pseudo-log-likelihood (Goodfellow et al., 2013), but have\nthe disadvantage that sampling from the trained \u201cstochastic \ufb01ll-in\u201d model requires a Markov chain\n(repeatedly resampling some subset of the values given the others).\nIn all these cases, learning\nprogresses by back-propagating the imputation (reconstruction) error through inference steps of the\nmodel. This allows the model to better cope with a potentially imperfect inference algorithm. This\nlearning-to-cope was introduced recently in 2011 by Stoyanov et al. (2011) and Domke (2011).\n\n1\n\n\fFigure 1: The choice of a structure for NADE-k is very \ufb02exible. The dark \ufb01lled halves indicate that\na part of the input is observed and \ufb01xed to the observed values during the iterations. Left: Basic\nstructure corresponding to Equations (6\u20137) with n = 2 and k = 2. Middle: Depth added as in\nNADE by Uria et al. (2014) with n = 3 and k = 2. Right: Depth added as in Multi-Prediction Deep\nBoltzmann Machine by Goodfellow et al. (2013) with n = 2 and k = 3. The \ufb01rst two structures are\nused in the experiments.\n\nThe NADE model involves an ordering over the components of the data vector. The core of the\nmodel is the reconstruction of the next component given all the previous ones. In this paper we\nreinterpret the reconstruction procedure as a single iteration in a variational inference algorithm,\nand we propose a version where we use k iterations instead, inspired by (Goodfellow et al., 2013;\nBrakel et al., 2013). We evaluate the proposed model on two datasets and show that it outperforms\nthe original NADE (Larochelle and Murray, 2011) as well as NADE trained with the order-agnostic\ntraining algorithm (Uria et al., 2014).\n\n2 Proposed Method: NADE-k\n\nWe propose a probabilistic model called NADE-k for D-dimensional binary data vectors x. We start\nby de\ufb01ning p\u03b8 for imputing missing values using a fully factorial conditional distribution:\n\np\u03b8(xmis | xobs) = (cid:89)\n\ni\u2208mis\n\np\u03b8(xi | xobs),\n\n(1)\n\nwhere the subscripts mis and obs denote missing and observed components of x. From the con-\nditional distribution p\u03b8 we compute the joint probability distribution over x given an ordering o (a\npermutation of the integers from 1 to D) by\np\u03b8(x | o) =\n\nD(cid:89)\n\n| xo<d),\n\np\u03b8(xod\n\n(2)\n\nd=1\n\nwhere o<d stands for indices o1 . . . od\u22121.\nThe model is trained to minimize the negative log-likelihood averaged over all possible orderings o\n(3)\n\nL(\u03b8) = Eo\u2208D! [Ex\u2208data [\u2212 log p\u03b8(x | o)]] .\n\nusing an unbiased, stochastic estimator of L(\u03b8)\nD\n\n\u02c6L(\u03b8) = \u2212\n\nD \u2212 d + 1\n\nlog p\u03b8(xo\u2265d\n\n| xo<d)\n\n(4)\n\nby drawing o uniformly from all D! possible orderings and d uniformly from 1 . . . D (Uria et al.,\n2014). Note that while the model de\ufb01nition in Eq. (2) is sequential in nature, the training criterion\n(4) involves reconstruction of all the missing values in parallel. In this way, training does not involve\npicking or following speci\ufb01c orders of indices.\n| xobs) using a deep feedforward neural\nIn this paper, we de\ufb01ne the conditional model p\u03b8(xmis\nnetwork with nk layers, where we use n weight matrices k times. This can also be interpreted as\nrunning k successive inference steps with an n-layer neural network.\nThe input to the network is\n\n(5)\nwhere m is a binary mask vector indicating missing components with 1, and (cid:12) is an element-\nwise multiplication. Ex\u2208data [x] is an empirical mean of the observations. For simplicity, we give\n\nv(cid:104)0(cid:105) = m (cid:12) Ex\u2208data [x] + (1 \u2212 m) (cid:12) x\n\n2\n\nv<0>v<1>v<2>h<1>h<1>WVWVv<0>v<1>h<1>h<1>[1][2]UWVh<2>h<2>[1][2]UWVv<2>v<0>v<1>v<2>h<1>WWWWv<3>WWTTTVVVVTTh<2>h<3>h<1>h<2>[1][1][1][2][2]\fFigure 2: The inner working mechanism of NADE-k. The left most column shows the data vectors x,\nthe second column shows their masked version and the subsequent columns show the reconstructions\nv(cid:104)0(cid:105) . . . v(cid:104)10(cid:105) (See Eq. (7)).\n\nequations for a simple structure with n = 2. See Fig. 1 (left) for the illustration of this simple\nstructure.\nIn this case, the activations of the layers at the t-th step are\n\nh(cid:104)t(cid:105) = \u03c6(Wv(cid:104)t\u22121(cid:105) + c)\nv(cid:104)t(cid:105) = m (cid:12) \u03c3(Vh(cid:104)t(cid:105) + b) + (1 \u2212 m) (cid:12) x\n\n(6)\n(7)\n\nwhere \u03c6 is an element-wise nonlinearity, \u03c3 is a logistic sigmoid function, and the iteration index t\nruns from 1 to k. The conditional probabilities of the variables (see Eq. (1)) are read from the output\nv(cid:104)k(cid:105) as\n\np\u03b8(xi = 1 | xobs) = v\n\n(cid:104)k(cid:105)\ni\n\n.\n\n(8)\n\nFig. 2 shows examples of how v(cid:104)t(cid:105) evolves over iterations, with the trained model.\nThe parameters \u03b8 = {W, V, c, b} can be learned by stochastic gradient descent to minimize \u2212L(\u03b8)\nin Eq. (3), or its stochastic approximation \u2212 \u02c6L(\u03b8) in Eq. (4), with the stochastic gradient computed\nby back-propagation.\nOnce the parameters \u03b8 are learned, we can de\ufb01ne a mixture model by using a uniform probability\nover a set of orderings O. We can compute the probability of a given vector x as a mixture model\n\npmixt(x | \u03b8, O) =\n\n1\n|O|\n\np\u03b8(x | o)\n\n(9)\n\n(cid:88)\n\no\u2208O\n\u223c p\u03b8(xod\n\nk(cid:88)\n\nt=1\n\nlog (cid:89)\n\ni\u2208o\u2265d\n\nwith Eq. (2). We can draw independent samples from the mixture by \ufb01rst drawing an ordering o and\n| xo<d). Furthermore, we can draw\nthen sequentially drawing each variable using xod\nsamples from the conditional p(xmis | xobs) easily by considering only orderings where the observed\nindices appear before the missing ones.\nPretraining It is well known that training deep networks is dif\ufb01cult without pretraining, and in our\nexperiments, we train networks up to kn = 7\u00d73 = 21 layers. When pretraining, we train the model\nto produce good reconstructions v(cid:104)t(cid:105) at each step t = 1 . . . k. More formally, in the pretraining\nphase, we replace Equations (4) and (8) by\n\nD\n\n\u02c6Lpre(\u03b8) = \u2212\n1\nD \u2212 d + 1\nk\n\u03b8 (xi = 1 | xobs) = v\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\np\ni\n\n.\n\n\u03b8 (xi | xo<d)\n(cid:104)t(cid:105)\np\n\n(10)\n\n(11)\n\n2.1 Related Methods and Approaches\n\nOrder-agnostic NADE The proposed method follows closely the order-agnostic version of\nNADE (Uria et al., 2014), which may be considered as the special case of NADE-k with k = 1. On\nthe other hand, NADE-k can be seen as a deep NADE with some speci\ufb01c weight sharing (matrices\nW and V are reused for different depths) and gating in the activations of some layers (See Equation\n(7)).\n\n3\n\n\fAdditionally, Uria et al. (2014) found it crucial to give the mask m as an auxiliary input to the\nnetwork, and initialized missing values to zero instead of the empirical mean (See Eq. (5)). Due to\nthese differences, we call their approach NADE-mask. One should note that NADE-mask has more\nparameters due to using the mask as a separate input to the network, whereas NADE-k is roughly k\ntimes more expensive to compute.\nProbabilistic Inference Let us consider the task of missing value imputation in a probabilistic\nlatent variable model. We get the conditional probability of interest by marginalizing out the latent\nvariables from the posterior distribution:\n\n(cid:90)\n\np(xmis | xobs) =\n\np(h, xmis | xobs)dh.\n\nh\n\n(12)\nAccessing the joint distribution p(h, xmis | xobs) directly is often harder than alternatively updating\nh and xmis based on the conditional distributions p(h | xmis, xobs) and p(xmis | h).1 Variational\ninference is one of the representative examples that exploit this.\nIn variational inference, a factorial distribution q(h, xmis) = q(h)q(xmis) is iteratively \ufb01tted to\np(h, xmis | xobs) such that the KL-divergence between q and p\nq(h, xmis) log\n\nKL[q(h, xmis)||p(h, xmis | xobs)] = \u2212(cid:90)\n\n(cid:20) p(h, xmis | xobs)\n\ndhdxmis\n\n(13)\n\n(cid:21)\n\nh,xmis\n\nq(h, xmis)\n\nis minimized. The algorithm alternates between updating q(h) and q(xmis), while considering the\nother one \ufb01xed.\nAs an example, let us consider a restricted Boltzmann machine (RBM) de\ufb01ned by\n\np(v, h) \u221d exp(b(cid:62)v + c(cid:62)h + h(cid:62)Wv).\n\n(14)\nWe can \ufb01t an approximate posterior distribution parameterized as q(vi = 1) = \u00afvi and q(hj = 1) =\n\u00afhj to the true posterior distribution by iteratively computing\n\n\u00afh \u2190 \u03c3(W\u00afv + c)\n\u00afv \u2190 m (cid:12) \u03c3(W(cid:62)h + b) + (1 \u2212 m) (cid:12) v.\n\n(15)\n(16)\nWe notice the similarity to Eqs. (6)\u2013(7): If we assume \u03c6 = \u03c3 and V = W(cid:62), the inference in the\nNADE-k is equivalent to performing k iterations of variational inference on an RBM for the missing\nvalues (Peterson and Anderson, 1987). We can also get variational inference on a deep Boltzmann\nmachine (DBM) using the structure in Fig. 1 (right).\nMulti-Prediction Deep Boltzmann Machine Goodfellow et al. (2013) and Brakel et al. (2013)\nuse backpropagation through variational inference steps to train a deep Boltzmann machine. This\nis very similar to our work, except that they approach the problem from the view of maximizing\nthe generalized pseudo-likelihood (Huang and Ogata, 2002). Also, the deep Boltzmann machine\nlacks the tractable probabilistic interpretation similar to NADE-k (See Eq. (2)) that would allow\nto compute a probability or to generate independent samples without resorting to a Markov chain.\nAlso, our approach is somewhat more \ufb02exible in the choice of model structures, as can be seen in\nFig. 1. For instance, in the proposed NADE-k, encoding and decoding weights do not have to be\nshared and any type of nonlinear activations, other than a logistic sigmoid function, can be used.\nProduct and Mixture of Experts One could ask what would happen if we would de\ufb01ne an ensemble\nlikelihood along the line of the training criterion in Eq. (3). That is,\n\n\u2212 log pprod(x | \u03b8) \u221d Eo\u2208D! [\u2212 log p(x | \u03b8, o)] .\n\n(17)\nMaximizing this ensemble likelihood directly will correspond to training a product-of-experts\nmodel (Hinton, 2000). However, this requires us to evaluate the intractable normalization constant\nduring training as well as in the inference, making the model not tractable anymore.\nOn the other hand, we may consider using the log-probability of a sample under the mixture-of-\nexperts model as the training criterion\n\n(18)\nThis criterion resembles clustering, where individual models may specialize in only a fraction of the\ndata. In this case, however, the simple estimator such as in Eq. (4) would not be available.\n\n\u2212 log pmixt(x | \u03b8) = \u2212 log Eo\u2208D! [p(x | \u03b8, o)] .\n\n1 We make a typical assumption that observations are mutually independent given the latent variables.\n\n4\n\n\fModel\nNADE 1HL(\ufb01xed order)\nNADE 1HL\nNADE 2HL\nNADE-mask 1HL\nNADE-mask 2HL\nNADE-mask 4HL\nEoNADE-mask 1HL(128 Ords)\nEoNADE-mask 2HL(128 Ords)\n\nLog-Prob. Model\n\n-88.86\n-99.37\n-95.33\n-92.17\n-89.17\n-89.60\n-87.71\n-85.10\n\nRBM (500h, CD-25)\nDBN (500h+2000h)\nDARN (500h)\nDARN (500h, adaNoise)\nNADE-5 1HL\nNADE-5 2HL\nEoNADE-5 1HL(128 Ords)\nEoNADE-5 2HL(128 Ords)\n\nLog-Prob.\n\u2248 -86.34\n\u2248 -84.55\n\u2248 -84.71\n\u2248 -84.13\n-90.02\n-87.14\n-86.23\n-84.68\n\nTable 1: Results obtained on MNIST using various models and number of hidden layers (1HL\nor 2HL). \u201cOrds\u201d is short for \u201corderings\u201d. These are the average log-probabilities of the test set.\nEoNADE refers to the ensemble probability (See Eq. (9)). From here on, in all \ufb01gures and tables we\nuse \u201cHL\u201d to denote the number of hidden layers and \u201ch\u201d for the number of hidden units.\n\n3 Experiments\n\nWe study the proposed model with two datasets: binarized MNIST handwritten digits and Caltech\n101 silhouettes.\nWe train NADE-k with one or two hidden layers (n = 2 and n = 3, see Fig. 1, left and middle)\nwith a hyperbolic tangent as the activation function \u03c6(\u00b7). We use stochastic gradient descent on\nthe training set with a minibatch size \ufb01xed to 100. We use AdaDelta (Zeiler, 2012) to adaptively\nchoose a learning rate for each parameter update on-the-\ufb02y. We use the validation set for early-\nstopping and to select the hyperparameters. With the best model on the validation set, we report the\nlog-probability computed on the test set. We have made our implementation available2.\n\n3.1 MNIST\n\nWe closely followed the procedure used by Uria et al. (2014), including the split of the dataset into\n50,000 training samples, 10,000 validation samples and 10,000 test samples. We used the same\nversion where the data has been binarized by sampling.\nWe used a \ufb01xed width of 500 units per hidden layer. The number of steps k was selected among\n{1, 2, 4, 5, 7}. According to our preliminary experiments, we found that no separate regularization\nwas needed when using a single hidden layer, but in case of two hidden layers, we used weight\n\ndecay with the regularization constant in the interval(cid:2)e\u22125, e\u22122(cid:3). Each model was pretrained for\n\n1000 epochs and \ufb01ne-tuned for 1000 epochs in the case of one hidden layer and 2000 epochs in the\ncase of two.\nFor both NADE-k with one and two hidden layers, the validation performance was best with k = 5.\nThe regularization constant was chosen to be 0.00122 for the two-hidden-layer model.\nResults We report in Table 1 the mean of the test log-probabilities averaged over randomly selected\norderings. We also show the experimental results by others from (Uria et al., 2014; Gregor et al.,\n2014). We denote the model proposed in (Uria et al., 2014) as a NADE-mask.\nFrom Table 1, it is clear that NADE-k outperforms the corresponding NADE-mask both with the\nindividual orderings and ensembles over orderings using both 1 or 2 hidden layers. NADE-k with\ntwo hidden layers achieved the generative performance comparable to that of the deep belief network\n(DBN) with two hidden layers.\nFig. 3 shows training curves for some of the models. We can see that the NADE-1 does not perform\nas well as NADE-mask. This con\ufb01rms that in the case of k = 1, the auxiliary mask input is indeed\nuseful. Also, we can note that the performance of NADE-5 is still improving at the end of the\npreallocated 2000 epochs, further suggesting that it may be possible to obtain a better performance\nsimply by training longer.\n\n2git@github.com:yaoli/nade k.git\n\n5\n\n\f(a)\n\n(b)\n\nFigure 3: NADE-k with k steps of variational inference helps to reduce the training cost (a) and to\ngeneralize better (b). NADE-mask performs better than NADE-1 without masks both in training and\ntest.\n\n(a)\n\n(b)\n\nFigure 4: (a) The generalization performance of different NADE-k models trained with different k.\n(b) The generalization performance of NADE-5 2h, trained with k=5, but with various k in test time.\n\nFig. 4 (a) shows the effect of the number of iterations k during training. Already with k = 2, we can\nsee that the NADE-k outperforms its corresponding NADE-mask. The performance increases until\nk = 5. We believe the worse performance of k = 7 is due to the well known training dif\ufb01culty of a\ndeep neural network, considering that NADE-7 with two hidden layers effectively is a deep neural\nnetwork with 21 layers.\nAt inference time, we found that it is important to use the exact k that one used to train the model.\nAs can be seen from Fig. 4 (b), the assigned probability increases up to the k, but starts decreasing\nas the number of iterations goes over the k. 3\n\n3.1.1 Qualitative Analysis\nIn Fig. 2, we present how each iteration t = 1 . . . k improves the corrupted input (v(cid:104)t(cid:105) from Eq. (5)).\nWe also investigate what happens with test-time k being larger than the training k = 5. We can see\nthat in all cases, the iteration \u2013 which is a \ufb01xed point update \u2013 seems to converge to a point that is\nin most cases close to the ground-truth sample. Fig. 4 (b) shows however that the generalization\nperformance drops after k = 5 when training with k = 5. From Fig. 2, we can see that the\nreconstruction continues to be sharper even after k = 5, which seems to be the underlying reason\nfor this phenomenon.\n\n3In the future, one could explore possibilities for helping better converge beyond step k, for instance by\n\nusing costs based on reconstructions at k \u2212 1 and k even in the \ufb01ne-tuning phase.\n\n6\n\n050010001500trainingepochs80859095100105110115120trainingcostendofpretrainNADE-mask1HLNADE-51HLNADE-11HL200400600800100012001400160018002000trainingepochs\u2212100\u221298\u221296\u221294\u221292\u221290testsetlog-probabilityendofpretrainNADE-mask1HLNADE-51HLNADE-11HL12457trainedwithkstepsofiterations\u221296\u221295\u221294\u221293\u221292\u221291\u221290\u221289\u221288\u221287testsetlog-probabilityNADE-k1HLNADE-k2HLNADE-mask1HLNADE-mask2HL05101520performkstepsofiterationsattesttime\u2212115\u2212110\u2212105\u2212100\u221295\u221290\u221285testsetlog-probabilityNADE-52HLNADE-mask2HL\f(a) MNIST\n\n(b) Caltech-101 Silhouettes\n\nFigure 5: Samples generated from NADE-k trained on (a) MNIST and (b) Caltech-101 Silhouettes.\n\n(a)\n\n(b)\n\nFigure 6: Filters learned from NADE-5 2HL. (a) A random subset of the encodering \ufb01lters. (b) A\nrandom subset of the decoding \ufb01lters.\n\nFrom the samples generated from the trained NADE-5 with two hidden layers shown in Fig. 5 (a),\nwe can see that the model is able to generate digits. Furthermore, the \ufb01lters learned by the model\nshow that it has learned parts of digits such as pen strokes (See Fig. 6).\n\n3.1.2 Variability over Orderings\nIn Section 2, we argued that we can perform any inference task p(xmis | xobs) easily and ef\ufb01ciently\nby restricting the set of orderings O in Eq. (9) to ones where xobs is before xmis. For this to work\nwell, we should investigate how much the different orderings vary.\nTo measure the variability over orderings, we computed the variance of log p(x | o) for 128 ran-\ndomly chosen orderings o with the trained NADE-k\u2019s and NADE-mask with a single hidden layer.\nFor comparison, we computed the variance of log p(x | o) over the 10,000 test samples.\n\nlog p(x | o)\n\nEo,x [\u00b7] (cid:112)Ex Varo [\u00b7] (cid:112)Eo Varx [\u00b7]\n\nNADE-mask 1HL\n\nNADE-5 1HL\nNADE-5 2HL\n\n-92.17\n-90.02\n-87.14\n\n3.5\n3.1\n2.4\n\n23.5\n24.2\n22.7\n\nTable 2: The variance of\nlog p(x | o) over orderings o\nand over test samples x.\n\nIn Table 2, the variability over the orderings is clearly much smaller than that over the samples.\nFurthermore, the variability over orderings tends to decrease with the better models.\n\n3.2 Caltech-101 silhouettes\n\nWe also evaluate the proposed NADE-k on Caltech-101 Silhouettes (Marlin et al., 2010), using\nthe standard split of 4100 training samples, 2264 validation samples and 2307 test samples. We\ndemonstrate the advantage of NADE-k compared with NADE-mask under the constraint that they\nhave a matching number of parameters.\nIn particular, we compare NADE-k with 1000 hidden\nunits with NADE-mask with 670 hiddens. We also compare NADE-k with 4000 hidden units with\nNADE-mask with 2670 hiddens.\nWe optimized the hyper-parameter k \u2208 {1, 2, . . . , 10} in the case of NADE-k. In both NADE-k\nand NADE-mask, we experimented without regularizations, with weight decays, or with dropout.\nUnlike the previous experiments, we did not use the pretraining scheme (See Eq. (10)).\n\n7\n\n\fTable 3: Average log-probabilities of test samples of Caltech-101 Silhouettes. ((cid:63)) The results are\nfrom Cho et al. (2013). The terms in the parenthesis indicate the number of hidden units, the total\nnumber of parameters (M for million), and the L2 regularization coef\ufb01cient. NADE-mask 670h\nachieves the best performance without any regularizations.\n\nModel\nRBM(cid:63)\n(2000h, 1.57M)\nNADE-mask\n(670h, 1.58M)\nNADE-2\n(1000h, 1.57M, L2=0.0054)\n\nTest LL Model\nRBM (cid:63)\n-108.98\n(4000h, 3.14M)\nNADE-mask\n(2670h, 6.28M, L2=0.00106)\nNADE-5\n(4000h, 6.28M, L2=0.0068)\n\n-108.81\n\n-112.51\n\nTest LL\n-107.78\n\n-110.95\n\n-107.28\n\nAs we can see from Table 3, NADE-k outperforms the NADE-mask regardless of the number of\nparameters. In addition, NADE-2 with 1000 hidden units matches the performance of an RBM with\nthe same number of parameters. Futhermore, NADE-5 has outperformed the previous best result\nobtained with the RBMs in (Cho et al., 2013), achieving the state-of-art result on this dataset. We\ncan see from the samples generated by the NADE-k shown in Fig. 5 (b) that the model has learned\nthe data well.\n\n4 Conclusions and Discussion\n\nIn this paper, we proposed a model called iterative neural autoregressive distribution estimator\n(NADE-k) that extends the conventional neural autoregressive distribution estimator (NADE) and its\norder-agnostic training procedure. The proposed NADE-k maintains the tractability of the original\nNADE while we showed that it outperforms the original NADE as well as similar, but intractable\ngenerative models such as restricted Boltzmann machines and deep belief networks.\nThe proposed extension is inspired from the variational inference in probabilistic models such as\nrestricted Boltzmann machines (RBM) and deep Boltzmann machines (DBM). Just like an iterative\nmean-\ufb01eld approximation in Boltzmann machines, the proposed NADE-k performs multiple itera-\ntions through hidden layers and a visible layer to infer the probability of the missing value, unlike\nthe original NADE which performs the inference of a missing value in a single iteration through\nhidden layers.\nOur empirical results show that this approach of multiple iterations improves the performance of\na model that has the same number of parameters, compared to performing a single iteration. This\nsuggests that the inference method has signi\ufb01cant effect on the ef\ufb01ciency of utilizing the model\nparameters. Also, we were able to observe that the generative performance of NADE can come\nclose to more sophisticated models such as deep belief networks in our approach.\nIn the future, more in-depth analysis of the proposed NADE-k is needed. For instance, a relation-\nship between NADE-k and the related models such as the RBM need to be both theoretically and\nempirically studied. The computational speed of the method could be improved both in training (by\nusing better optimization algorithms. See, e.g., (Pascanu and Bengio, 2014)) and in testing (e.g. by\nhandling the components in chunks rather than fully sequentially). The computational ef\ufb01ciency of\nsampling for NADE-k can be further improved based on the recent work of Yao et al. (2014) where\nan annealed Markov chain may be used to ef\ufb01ciently generate samples from the trained ensemble.\nAnother promising idea to improve the model performance further is to let the model adjust its own\ncon\ufb01dence based on d. For instance, in the top right corner of Fig. 2, we see a case with lots of miss-\ning values values (low d), where the model is too con\ufb01dent about the reconstructed digit 8 instead\nof the correct digit 2.\n\nAcknowledgements\n\nThe authors would like to acknowledge the support of NSERC, Calcul Qu\u00b4ebec, Compute Canada,\nthe Canada Research Chair and CIFAR, and developers of Theano (Bergstra et al., 2010; Bastien\net al., 2012).\n\n8\n\n\fReferences\nBastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N.,\nand Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and\nUnsupervised Feature Learning NIPS 2012 Workshop.\n\nBengio, Y. and Bengio, S. (2000). Modeling high-dimensional discrete data with multi-layer neural\n\nnetworks. In NIPS\u201999, pages 400\u2013406. MIT Press.\n\nBengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations.\n\nIn Proceedings of the 30th International Conference on Machine Learning (ICML\u201913). ACM.\n\nBergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-\nIn\n\nFarley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.\nProceedings of the Python for Scienti\ufb01c Computing Conference (SciPy). Oral Presentation.\n\nBrakel, P., Stroobandt, D., and Schrauwen, B. (2013). Training energy-based models for time-series\n\nimputation. The Journal of Machine Learning Research, 14(1), 2771\u20132797.\n\nCho, K., Raiko, T., and Ilin, A. (2013). Enhanced gradient for training restricted boltzmann ma-\n\nchines. Neural computation, 25(3), 805\u2013831.\n\nDomke, J. (2011). Parameter learning with truncated message-passing. In Computer Vision and\n\nPattern Recognition (CVPR), 2011 IEEE Conference on, pages 2937\u20132943. IEEE.\n\nGoodfellow, I., Mirza, M., Courville, A., and Bengio, Y. (2013). Multi-prediction deep boltzmann\n\nmachines. In Advances in Neural Information Processing Systems, pages 548\u2013556.\n\nGregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep autoregressive\n\nnetworks. In International Conference on Machine Learning (ICML\u20192014).\n\nHeckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C. (2000). Depen-\ndency networks for inference, collaborative \ufb01ltering, and data visualization. Journal of Machine\nLearning Research, 1, 49\u201375.\n\nHinton, G. E. (2000). Training products of experts by minimizing contrastive divergence. Technical\n\nReport GCNU TR 2000-004, Gatsby Unit, University College London.\n\nHuang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for Markov random \ufb01elds\n\non lattice. Annals of the Institute of Statistical Mathematics, 54(1), 1\u201318.\n\nLarochelle, H. and Murray, I. (2011). The neural autoregressive distribution estimator. Journal of\n\nMachine Learning Research, 15, 29\u201337.\n\nMarlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010).\n\nBoltzmann machine learning.\nArti\ufb01cial Intelligence and Statistics (AISTATS\u201910), volume 9, pages 509\u2013516.\n\nInductive principles for restricted\nIn Proceedings of The Thirteenth International Conference on\n\nPascanu, R. and Bengio, Y. (2014). Revisiting natural gradient for deep networks. In International\n\nConference on Learning Representations 2014(Conference Track).\n\nPeterson, C. and Anderson, J. R. (1987). A mean \ufb01eld theory learning algorithm for neural networks.\n\nComplex Systems, 1(5), 995\u20131019.\n\nStoyanov, V., Ropson, A., and Eisner, J. (2011). Empirical risk minimization of graphical model\nparameters given approximate inference, decoding, and model structure. In International Confer-\nence on Arti\ufb01cial Intelligence and Statistics, pages 725\u2013733.\n\nUria, B., Murray, I., and Larochelle, H. (2014). A deep and tractable density estimator. In Proceed-\n\nings of the 30th International Conference on Machine Learning (ICML\u201914).\n\nVincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising\nautoencoders: Learning useful representations in a deep network with a local denoising criterion.\nJ. Machine Learning Res., 11.\n\nYao, L., Ozair, S., Cho, K., and Bengio, Y. (2014). On the equivalence between deep nade and gen-\nerative stochastic networks. In European Conference on Machine Learning (ECML/PKDD\u201914).\nSpringer.\n\nZeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. Technical report, arXiv\n\n1212.5701.\n\n9\n\n\f", "award": [], "sourceid": 239, "authors": [{"given_name": "Tapani", "family_name": "Raiko", "institution": "Aalto University"}, {"given_name": "Yao", "family_name": "Li", "institution": "University of Montreal"}, {"given_name": "Kyunghyun", "family_name": "Cho", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "University of Montreal"}]}