{"title": "Variational Bayes on Monte Carlo Steroids", "book": "Advances in Neural Information Processing Systems", "page_first": 3018, "page_last": 3026, "abstract": "Variational approaches are often used to approximate intractable posteriors or normalization constants in hierarchical latent variable models. While often effective in practice, it is known that the approximation error can be arbitrarily large. We propose a new class of bounds on the marginal log-likelihood of directed latent variable models. Our approach relies on random projections to simplify the posterior. In contrast to standard variational methods, our bounds are guaranteed to be tight with high probability. We provide a new approach for learning latent variable models based on optimizing our new bounds on the log-likelihood. We demonstrate empirical improvements on benchmark datasets in vision and language for sigmoid belief networks, where a neural network is used to approximate the posterior.", "full_text": "Variational Bayes on Monte Carlo Steroids\n\nAditya Grover, Stefano Ermon\nDepartment of Computer Science\n\nStanford University\n\n{adityag,ermon}@cs.stanford.edu\n\nAbstract\n\nVariational approaches are often used to approximate intractable posteriors or nor-\nmalization constants in hierarchical latent variable models. While often effective\nin practice, it is known that the approximation error can be arbitrarily large. We\npropose a new class of bounds on the marginal log-likelihood of directed latent\nvariable models. Our approach relies on random projections to simplify the poste-\nrior. In contrast to standard variational methods, our bounds are guaranteed to be\ntight with high probability. We provide a new approach for learning latent variable\nmodels based on optimizing our new bounds on the log-likelihood. We demonstrate\nempirical improvements on benchmark datasets in vision and language for sigmoid\nbelief networks, where a neural network is used to approximate the posterior.\n\n1\n\nIntroduction\n\nHierarchical models with multiple layers of latent variables are emerging as a powerful class of\ngenerative models of data in a range of domains, ranging from images to text [1, 18]. The great\nexpressive power of these models, however, comes at a signi\ufb01cant computational cost. Inference and\nlearning are typically very dif\ufb01cult, often involving intractable posteriors or normalization constants.\nThe key challenge in learning latent variable models is to evaluate the marginal log-likelihood\nof the data and optimize it over the parameters. The marginal log-likelihood is generally non-\nconvex and intractable to compute, as it requires marginalizing over the unobserved variables.\nExisting approaches rely on Monte Carlo [12] or variational methods [2] to approximate this integral.\nVariational approximations are particularly suitable for directed models, because they directly provide\ntractable lower bounds on the marginal log-likelihood.\nVariational Bayes approaches use variational lower bounds as a tractable proxy for the true marginal\nlog-likelihood. While optimizing a lower bound is a reasonable strategy, the true marginal log-\nlikelihood of the data is not necessarily guaranteed to improve.\nIn fact, it is well known that\nvariational bounds can be arbitrarily loose. Intuitively, dif\ufb01culties arise when the approximating\nfamily of tractable distributions is too simple and cannot capture the complexity of the (intractable)\nposterior, no matter how well the variational parameters are chosen.\nIn this paper, we propose a new class of marginal log-likelihood approximations for directed latent\nvariable models with discrete latent units that are guaranteed to be tight, assuming an optimal choice\nfor the variational parameters. Our approach uses a recently introduced class of random projections\n[7, 15] to improve the approximation achieved by a standard variational approximation such as\nmean-\ufb01eld. Intuitively, our approach relies on a sequence of random projections to simplify the\nposterior, without losing too much information at each step, until it becomes easy to approximate\nwith a mean-\ufb01eld distribution.\nWe provide a novel learning framework for directed, discrete latent variable models based on\noptimizing this new lower bound. Our approach jointly optimizes the parameters of the generative\nmodel and the variational parameters of the approximating model using stochastic gradient descent\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(SGD). We demonstrate an application of this approach to sigmoid belief networks, where neural\nnetworks are used to specify both the generative model and the family of approximating distributions.\nWe use a new stochastic, sampling based approximation of the variational projected bound, and show\nempirically that by employing random projections we are able to signi\ufb01cantly improve the marginal\nlog-likelihood estimates.\nOverall, our paper makes the following contributions:\n\n1. We extend [15], deriving new (tight) stochastic bounds for the marginal log-likelihood of\n\ndirected, discrete latent variable models.\n\n2. We develop a \u201cblack-box\u201d [23] random-projection based algorithm for learning and inference\nthat is applicable beyond the exponential family and does not require deriving potentially\ncomplex updates or gradients by hand.\n\n3. We demonstrate the superior performance of our algorithm on sigmoid belief networks\nwith discrete latent variables in which a highly expressive neural network approximates the\nposterior and optimization is done using an SGD variant [16].\n\n2 Background setup\n\nLet p\u03b8(X, Z) denote the joint probability distribution of a directed latent variable model parameterized\nby \u03b8. Here, X = {Xi}m\ni=1 represents the observed random variables which are explained through a\nset of latent variables Z = {Zi}n\ni=1. In general, X and Z can be discrete or continuous. Our learning\nframework assumes discrete latent variables Z whereas X can be discrete or continuous.\nLearning latent variable models based on the maximum likelihood principle involves an intractable\nmarginalization over the latent variables. There are two complementary approaches to learning latent\nvariable models based on approximate inference which we discuss next.\n\n2.1 Learning based on amortized variational inference\n\nIn variational inference, given a data point x, we introduce a distribution q\u03c6(z) parametrized by\na set of variational parameters \u03c6. Using Jensen\u2019s inequality, we can lower bound the marginal\nlog-likelihood of x as an expectation with respect to q.\n\n(cid:88)\n\n(cid:88)\n\nq\u03c6(z) \u00b7 p\u03b8(x, z)\nq\u03c6(z)\n\nlog p\u03b8(x) = log\n\n\u2265(cid:88)\n\nz\n\np\u03b8(x, z) = log\n\nz\n\nz\n\nq\u03c6(z) \u00b7 log\n\np\u03b8(x, z)\nq\u03c6(z)\n\n(cid:35)\n\nS(cid:88)\n\ni=1\n\n1\nS\n\np\u03b8(x, zi)\nq\u03c6(zi|x)\n\n(cid:34)\n\nlog\n\n2\n\n= Eq[log p\u03b8(x, z) \u2212 log q\u03c6(z)].\n\n(1)\n\nThe evidence lower bound (ELBO) above is tight when q\u03c6(z) = p\u03b8(z|x). Therefore, variational\ninference can be seen as a problem of computing the parameters \u03c6 from an approximating family of\ndistributions Q such that the ELBO can be evaluated ef\ufb01ciently and the approximate posterior over\nthe latent variables is close to the true posterior.\nIn the setting we consider, we only have access to samples x \u223c p\u03b8(x) from the underlying distribution.\nFurther, we can amortize the cost of inference by learning a single data-dependent variational posterior\nq\u03c6(z|x) [9]. This increases the generalization strength of our approximate posterior and speeds up\ninference at test time. Hence, learning using amortized variational inference optimizes the average\nELBO (across all x) jointly over the model parameters (\u03b8) as well as the variational parameters (\u03c6).\n\n2.2 Learning based on importance sampling\n\nA tighter lower bound of the log-likelihood can be obtained using importance sampling (IS) [4]. From\nthis perspective, we view q\u03c6(z|x) as a proposal distribution and optimize the following lower bound:\n\nlog p\u03b8(x) \u2265 Eq\n\n(2)\n\n\fwhere each of the S samples are drawn from q\u03c6(z|x). The IS estimate reduces to the variational\nobjective for S = 1 in Eq. (1). From Theorem 1 of [4], the IS estimate is also a lower bound to the\ntrue log-likelihood of a model and is asymptotically unbiased under mild conditions. Furthermore,\nincreasing S will never lead to a weaker lower bound.\n\n3 Learning using random projections\n\nComplex data distributions are well represented by generative models that are \ufb02exible and have many\nmodes. Even though the posterior is generally much more peaked than the prior, learning a model\nwith multiple modes can help represent arbitrary structure and supports multiple explanations for the\nobserved data. This largely explains the empirical success of deep models for representational learning,\nwhere the number of modes grows nearly exponentially with the number of hidden layers [1, 22].\nSampling-based estimates for the marginal log-likelihood in Eq. (1) and Eq. (2) have high variance,\nbecause they might \u201cmiss\u201d important modes of the distribution. Increasing S helps but one might\nneed an extremely large number of samples to cover the entire posterior if it is highly multi-modal.\n\n3.1 Exponential sampling\n\nOur key idea is to use random projections [7, 15, 28], a hash-based inference scheme that can\nef\ufb01ciently sample an exponentially large number of latent variable con\ufb01gurations from the posterior.\nIntuitively, instead of sampling a single latent con\ufb01guration each time, we sample (exponentially\nlarge) buckets of con\ufb01gurations de\ufb01ned implicitly as the solutions to randomly generated constraints.\nFormally, let P be the set of all posterior distributions de\ufb01ned over z \u2208 {0, 1}n conditioned on x. 1\nA,b : P \u2192 P is a family of operators speci\ufb01ed by A \u2208 {0, 1}k\u00d7n, b \u2208 {0, 1}k\nA random projection Rk\nfor a k \u2208 {0, 1, . . . , n}. Each operator maps the posterior distribution p\u03b8(z|x) to another distribution\nA,b[p\u03b8(z|x)] with probability mass proportional to p\u03b8(z|x) and a support set restricted to {z : Az =\nRk\nb mod 2}. When A, b are chosen uniformly at random, this de\ufb01nes a family of pairwise independent\nhash functions H = {hA,b(z) : {0, 1}n \u2192 {0, 1}k} where hA,b(z) = Az + b mod 2. See [7, 27]\nfor details.\nThe constraints on the space of assignments of z can be viewed as parity (XOR) constraints. The\nrandom projection reduces the dimensionality of the problem in the sense that a subset of k variables\nbecomes a deterministic function of the remaining n\u2212 k. 2 By uniformly randomizing over the choice\nof the constraints, we can extend similar results from [28] to get the following expressions for the\n\ufb01rst and second order moments of the normalization constant of the projected posterior distribution.\nLemma 3.1. Given A \u2208 {0, 1}k\u00d7n iid\u223c Bernoulli( 1\n2 ) for k \u2208\n{0, 1, . . . , n}, we have the following relationships:\n\n2 ) and b \u2208 {0, 1}k iid\u223c Bernoulli( 1\n(cid:35)\n\n(cid:34) (cid:88)\n\np\u03b8(x, z)\n\n= 2\u2212kp\u03b8(x)\n\n(3)\n\n(4)\n\np\u03b8(x, z)2\n\n(cid:88)\n\nEA,b\n\n(cid:18) (cid:88)\n\nV ar\n\nz:Az=b mod 2\n\n(cid:19)\n\np\u03b8(x, z)\n\n= 2\u2212k(1 \u2212 2\u2212k)\n\nz:Az=b mod 2\n\nz\n\nHence, a typical random projection of the posterior distribution partitions the support into 2k subsets\nor buckets, each containing 2n\u2212k states. In contrast, typical Monte Carlo estimators for variational\ninference and importance sampling can be thought of as partitioning the state space into 2n subsets,\neach containing a single state.\nThere are two obvious challenges with this random projection approach:\n\n1. What is a good proposal distribution to select the appropriate constraint sets, i.e., buckets?\n\n1For brevity, we use binary random variables, although our analysis extends to discrete random variables.\n2This is the typical case: randomly generated constraints can be linearly dependent, leading to larger buckets.\n\n3\n\n\f2. Once we select a bucket, how can we perform ef\ufb01cient inference over the (exponentially\n\nlarge number of) con\ufb01gurations within the bucket?\n\nSurprisingly, using a uniform proposal for 1) and a simple mean-\ufb01eld inference strategy for 2), we will\nprovide an estimator for the marginal log-likelihood that will guarantee tight bounds for the quality\nof our solution. Unlike the estimates produced by variational inference in Eq. (1) and importance\nsampling in Eq. (2) which are stochastic lower bounds for the true log-likelihood, our estimate will\nbe a provably tight approximation for the marginal log-likelihood with high probability using a small\nnumber of samples, assuming we can compute an optimal mean-\ufb01eld approximation. Given that\n\ufb01nding an optimal mean-\ufb01eld (fully factored) approximation is a non-convex optimization problem,\nour result does not violate known worst-case hardness results for probabilistic inference.\n\n3.2 Tighter guarantees on the marginal log-likelihood\n\nIntuitively, we want to project the posterior distribution in a \u201cpredictable\u201d way such that key properties\nare preserved. Speci\ufb01cally, in order to apply the results in Lemma 3.1, we will use a uniform proposal\nfor any given choice of constraints. Secondly, we will reason about the exponential con\ufb01gurations\ncorresponding to any given choice of constraint set using variational inference with an approximating\nfamily of tractable distributions Q. We follow the proof strategy of [15] and extend their work on\nbounding the partition function for inference in undirected graphical models to the learning setting\nfor directed latent variable models. We assume the following:\nAssumption 3.1. The set D of degenerate distributions, i.e., distributions which assign all the\nprobability mass to a single con\ufb01guration, is contained in Q: D \u2282 Q.\nThis assumption is true for most commonly used approximating families of distributions such as\nmean-\ufb01eld QM F = {q(z) : q(z) = q1(z1)\u00b7\u00b7\u00b7 q(cid:96)(x(cid:96))}, structured mean-\ufb01eld [3], etc. We now de\ufb01ne\na projected variational inference problem as follows:\nt \u2208 {0, 1}k\u00d7n iid\u223c Bernoulli( 1\n2 ) for k \u2208\nDe\ufb01nition 3.1. Let Ak\n[0, 1,\u00b7\u00b7\u00b7 , n] and t \u2208 [1, 2,\u00b7\u00b7\u00b7 , T ]. Let Q be a family of distributions such that Assumption 3.1 holds.\nt , are de\ufb01ned as follows:\nThe optimal solutions for the projected variational inference problems, \u03b3k\n\nt \u2208 {0, 1}k iid\u223c Bernoulli( 1\n\n2 ) and bk\n\n(cid:88)\n\nq\u03c6(z|x)(cid:0) log p\u03b8(x, z) \u2212 log q\u03c6(z|x)(cid:1)\n\n(5)\n\nlog \u03b3k\n\nt (x) = max\nq\u2208Q\n\nz:Ak\n\nt z=bk\n\nt mod 2\n\nWe now derive bounds on the marginal likelihood p\u03b8(x) using two estimators that aggregate solutions\nto the projected variational inference problems.\n\n3.2.1 Bounds based on mean aggregation\n\nOur \ufb01rst estimator is a weighted average of the projected variational inference problems.\nDe\ufb01nition 3.2. For any given k, the mean estimator over T instances of the projected variational\ninference problems is de\ufb01ned as follows:\n\nT(cid:88)\n\nt=1\n\nLk,T\n\u00b5 (x) =\n\n1\nT\n\n\u03b3k\nt (x)2k.\n\n(6)\n\nNote that the stochasticity in the mean estimator is due to the choice of our random matrices Ak\nin De\ufb01nition 5. Consequently, we obtain the following guarantees:\nTheorem 3.1. The mean estimator is a lower bound for p\u03b8(x) in expectation:\n\nt , bk\nt\n\nE(cid:2)Lk,T\n\u00b5 (x)(cid:3) \u2264 p\u03b8(x).\n\nMoreover, there exists a k(cid:63) and a positive constant \u03b1 such that for any \u2206 > 0, if T \u2265 1\nthen with probability at least (1 \u2212 2\u2206),\n\n\u03b1 (log(2n/\u2206))\n\nLk(cid:63),T\n\n\u00b5\n\n(x) \u2265 p\u03b8(x)\n\n64(n + 1)\n\n.\n\n4\n\n\f(cid:80)\n\nProof sketch: For the \ufb01rst part of the theorem, note that the solution of a projected variational\n(5) is a lower bound to the sum\nproblem for any choice of Ak\np\u03b8(x, z) using Eq. (1). Now, we can use Eq. (3) in Lemma 3.1 to obtain the upper\n\nt with a \ufb01xed k in Eq.\n\nt and bk\n\nt z=bk\n\nt mod 2\n\nz:Ak\nbound in expectation. The second part of the proof extends naturally from Theorem 3.2 which we\nstate next. Please refer to the supplementary material for a detailed proof.\n\n3.2.2 Bounds based on median aggregation\n\nWe can additionally aggregate the solutions to Eq. (5) using the median estimator. This gives us\ntighter guarantees, including a lower bound that does not require us to take an expectation.\nDe\ufb01nition 3.3. For any given k, the median estimator over T instances of the projected variational\ninference problems is de\ufb01ned as follows:\n\nM d(x) = M edian(cid:0)\u03b3k\n\nLk,T\n\nT (x)(cid:1) 2k.\n\n1 (x),\u00b7\u00b7\u00b7 , \u03b3k\n\n(7)\n\nThe guarantees we obtain through the median estimator are formalized in the theorem below:\nTheorem 3.2. For the median estimator, there exists a k(cid:63) > 0 and positive constant \u03b1 such that for\nany \u2206 > 0, if T \u2265 1\n\n\u03b1 (log(2n/\u2206)) then with probability at least (1 \u2212 2\u2206),\n\n4p\u03b8(x) \u2265 Lk(cid:63),T\n\nM d (x) \u2265 p\u03b8(x)\n32(n + 1)\n\n(cid:80)\n\nz:Ak\n\nt z=bk\n\nt mod 2\n\nProof sketch: The upper bound follows from the application of Markov\u2019s inequality to the positive\nrandom variable\nt (x)\n\np\u03b8(x, z) (\ufb01rst moments are bounded from Lemma 3.1) and \u03b3k\n\nlower bounds this sum. The lower bound of the above theorem extends a result from Theorem 2 of\n[15]. Please refer to the supplementary material for a detailed proof.\nHence, the rescaled variational solutions aggregated through a mean or median can provide tight\nbounds on the log-likelihood estimate for the observed data with high probability unlike the ELBO\nestimates in Eq. (1) and Eq. (2), which could be arbitrarily far from the true log-likelihood.\n\n4 Algorithmic framework\n\nIn recent years, there have been several algorithmic advancements in variational inference and\nlearning using black-box techniques [23]. These techniques involve a range of ideas such as the use\nof mini-batches, amortized inference, Monte Carlo gradient computation, etc., for scaling variational\ntechniques to large data sets. See Section 6 for a discussion. In this section, we integrate random\nprojections into a black-box algorithm for belief networks, a class of directed, discrete latent variable\nmodels. These models are especially hard to learn, since the \u201creparametrization trick\u201d [17] is not\napplicable to discrete latent variables leading to gradient updates with high variance.\n\n4.1 Model speci\ufb01cation\n\nWe will describe our algorithm using the architecture of a sigmoid belief network (SBN), a multi-layer\nperceptron which is the basic building block for directed deep generative models with discrete latent\nvariables [21]. A sigmoid belief network consists of L densely connected layers of binary hidden\nunits (Z1:L) with the bottom layer connected to a single layer of binary visible units (X). The nodes\nand edges in the network are associated with biases and weights respectively. The state of the units\nin the top layer (ZL) is a sigmoid function (\u03c3(\u00b7)) of the corresponding biases. For all other layers,\nthe conditional distribution of any unit given its parents is represented compactly by a non-linear\nactivation of the linear combination of the weights of parent units with their binary state and an\nadditive bias term. The generative process can be summarized as follows:\np(ZL\n\ni); p(Xi = 1|z1) = \u03c3(W 1 \u00b7 z1 + b0\ni )\nIn addition to the basic SBN design, we also consider the amortized inference setting. Here, we have\nan inference network with the same architecture as the SBN, but the feedforward loop running in the\nreverse direction from the input (x) to the output q(zL|x).\n\ni = 1|zl+1) = \u03c3(W l+1 \u00b7 zl+1 + bl\n\ni = 1) = \u03c3(bL\n\ni ); p(Zl\n\n5\n\n\fAlgorithm 1 VB-MCS: Learning belief networks with random projections.\nVB-MCS (Mini-batches {xh}H\nConstraints k, Instances T )\nfor e = 1 : E do\n\nh=1, Generative Network (G, \u03b8), Inference Network (I, \u03c6), Epochs E,\n\nfor h = 1 : H do\nfor t = 1 : T do\n\nSample A \u2208 {0, 1}k\u00d7n iid\u223c Bernoulli( 1\nC, b(cid:48) \u2190 RowReduce(A, b)\nlog \u03b3k\n\nt (xh) \u2190 ComputeProjectedELBO(xh, G, \u03b8, I, \u03c6, C, b(cid:48))\n\n2 ) and b \u2208 {0, 1}k iid\u223c Bernoulli( 1\n2 )\n\nlog Lk,T (xh) \u2190 log [Aggregate(\u03b3k\nUpdate \u03b8, \u03c6 \u2190 StochasticGradientDescent(\u2212 log Lk,T (xh))\n\n1 (xh),\u00b7\u00b7\u00b7 , \u03b3k\n\nT (xh))]\n\nreturn \u03b8, \u03c6\n\n4.2 Algorithm\n\nThe basic algorithm for learning belief networks with augmented inference networks is inspired by\nthe wake-sleep algorithm [13]. One key difference from the wake-sleep algorithm is that there is a\nsingle objective being optimized. This is typically the ELBO (see Eq. ( 1)) and optimization is done\nusing stochastic mini-batch descent jointly over the model and inference parameters.\nTraining consists of two alternating phases for every mini-batch of points. The \ufb01rst step makes a\nforward pass through the inference network producing one or more samples from the top layer of the\ninference network, and \ufb01nally, these samples complete a forward pass through the generative network.\nThe reverse pass computes the gradient of the model and variational parameters with respect to the\nELBO in Eq. (1) and uses these gradient updates to perform a gradient descent step on the ELBO.\nWe now introduce a black-box technique within this general learning framework, which we refer to\nas Variational Bayes on Monte Carlo Steroids (VB-MCS) due to the exponential sampling property.\nVB-MCS requires as input a data-dependent parameter k, which is the number of variables to constrain.\nAt every training epoch, we \ufb01rst sample entries of a full-rank constraint matrix A \u2208 {0, 1}k\u00d7n\nand vector b \u2208 {0, 1}k and then optimize for the objective corresponding to a projected variational\ninference problem de\ufb01ned in Eq. (5). This procedure is repeated for T problem instances, and the\nindividual likelihood estimates are aggregated using the mean or median based estimators de\ufb01ned in\nEq. (6) and Eq. (7). The pseudocode is given in Algorithm 1.\nFor computing the projected ELBO, the inference network considers the marginal distribution of only\nn\u2212 k free latent variables. We consider the mean-\ufb01eld family of approximations where the free latent\nvariables are sampled independently from their corresponding marginal distributions. The remaining\nk latent variables are speci\ufb01ed by parity constraints. Using Gaussian elimination, the original linear\nsystem Az = b mod 2 is reduced into a row echleon representation of the form Cz = b(cid:48) where\nC = [Ikxk|A(cid:48)] such that A(cid:48) \u2208 {0, 1}k\u00d7(n\u2212k) and b(cid:48) \u2208 {0, 1}k. Finally, we read off the constrained\n\ni=k+1 cjizi \u2295 b(cid:48)\n\nj for j = 1, 2,\u00b7\u00b7\u00b7 , k where \u2295 is the XOR operator.\n\nvariables as zj =(cid:76)n\n\n5 Experimental evaluation\n\nWe evaluated the performance of VB-MCS as a black-box technique for learning discrete, directed latent\nvariable models for images and documents. Our test-architecture is a simple sigmoid belief network\nwith a single hidden layer consisting of 200 units and a visible layer. Through our experiments, we\nwish to demonstrate that the theoretical advantage offered by random projections easily translates\ninto practice using an associated algorithm such as VB-MCS. We will compare a baseline sigmoid\nbelief network (Base-SBN) learned using Variational Bayes and evaluate it against a similar network\nwith parity constraints imposed on k latent variables (henceforth, referred as k-SBN) and learned\nusing VB-MCS. We now discuss some parameter settings below, which have been \ufb01xed with respect to\nthe best validation performance of Base-SBN on the Caltech 101 Silhouettes dataset.\nImplementation details: The prior probabilities for the latent layer are speci\ufb01ed using autoregressive\nconnections [10]. The learning rate was \ufb01xed based on validation performance to 3 \u00d7 10\u22124 for the\ngenerator network and reduced by a factor of 5 for the inference network. Mini-batch size was \ufb01xed\n\n6\n\n\fTable 1: Test performance evaluation of VB-MCS. Random projections lead to improvements in terms\nof estimated negative log-likelihood and log-perplexity.\nDataset\nVision: Caltech 101 Silhouettes NLL\nLanguage: NIPS Proceedings\n\nk=20\n256.60\n4920.71\n\nk=5\n245.60\n4919.35\n\nk=10\n248.79\n4919.22\n\nEvaluation Metric Base\n\n251.04\n5009.79\n\nlog-perplexity\n\n(a) Dataset Images\n\n(b) Denoised Images\n\n(c) Samples\n\nFigure 1: Denoised images (center) of the actual ones (left) and sample images (right) generated from\nthe best k-SBN model trained on the Caltech 101 Silhouettes dataset.\n\nto 20. Regularization was imposed by early stopping of training after 50 epochs. The optimizer used\nis Adam [16]. For k-SBN, we show results for three values of k: 5, 10, and 20, and the aggregation is\ndone using the median estimator with T = 3.\n\n5.1 Generative modeling of images in the Caltech 101 Silhouettes dataset\nWe trained a generative model for silhouette images of 28 \u00d7 28 dimensions from the Caltech 101\nSilhouettes dataset 3. The dataset consists of 4,100 train images, 2,264 validation images and 2,307\ntest images. This is a particularly hard dataset due to the asymmetry in silhouettes compared to other\ncommonly used structured datasets. As we can see in Table 1, the k-SBNs trained using VB-MCS can\noutperform the Base-SBN by several nats in terms of the negative log-likelihood estimates on the test\nset. The performance for k-SBNs dips as we increase k, which is related to the empirical quality of\nthe approximation our algorithm makes for different k values.\nThe qualitative evaluation results of SBNs trained using VB-MCS and additional control variates [19]\non denoising and sampling are shown in Fig. 1. While the qualitative evaluation is subjective, the\ndenoised images seem to smooth out the edges in the actual images. The samples generated from the\nmodel largely retain essential qualities such as silhouette connectivity and varying edge patterns.\n\n5.2 Generative modeling of documents in the NIPS Proceedings dataset\n\nWe performed the second set of experiments on the latest version of the NIPS Proceedings dataset4\nwhich consists of the distribution of words in all papers that appeared in NIPS from 1988-2003. We\nperformed a 80/10/10 split of the dataset into 1,986 train, 249 validation, and 248 test documents.\nThe relevant metric here is the average perplexity per word for D documents, given by P =\n\nlog p(xi)(cid:1) where Li is the length of document i. We feed in raw word counts per\n\nexp(cid:0)\u22121\n\n(cid:80)D\n\ndocument as input to the inference network and consequently, the visible units in the generative\nnetwork correspond to the (unnormalized) probability distribution of words in the document.\nTable 1 shows the log-perplexity scores (in nats) on the test set. From the results, we again observe\nthe superior performance of all k-SBNs over the Base-SBN. The different k-SBNs have comparable\nperformance, although we do not expect this observation to hold true more generally for other\n\n1\nLi\n\ni=1\n\nD\n\n3Available at https://people.cs.umass.edu/~marlin/data.shtml\n4Available at http://ai.stanford.edu/~gal/\n\n7\n\n\fdatasets. For a qualitative evaluation, we sample the relative word frequencies in a document and\nthen generate the top-50 words appearing in a document. One such sampling is shown in Figure 2.\nThe bag-of-words appears to be semantically re\ufb02ective of coappearing words in a NIPS paper.\n\n6 Discussion and related work\n\nThere have been several recent advances in ap-\nproximate inference and learning techniques\nfrom both a theoretical and empirical perspec-\ntive. On the empirical side, the various black-\nbox techniques [23] such as mini-batch up-\ndates [14], amortized inference [9] etc. are key\nto scaling and generalizing variational inference\nto a wide range of settings. Additionally, ad-\nvancements in representational learning have\nmade it possible to specify and learn highly ex-\npressive directed latent variable models based\non neural networks, for e.g.,\n[4, 10, 17, 19,\n20, 24]. Rather than taking a purely variational\nor sampling-based approach, these techniques\nstand out in combining the computational ef\ufb01-\nciency of variational techniques with the gener-\nalizability of Monte Carlo methods [25, 26].\n\nFigure 2: Bag-of-words for a 50 word document\nsampled from the best k-SBN model trained on the\nNIPS Proceedings dataset.\n\nOn the theoretical end, there is a rich body of recent work in hash-based inference applied to\nsampling [11], variational inference [15], and hybrid inference techniques at the intersection of\nthe two paradigms [28]. The techniques based on random projections have not only lead to better\nalgorithms but more importantly, they come with strong theoretical guarantees [5, 6, 7].\nIn this work, we attempt to bridge the gap between theory and practice by employing hash-based\ninference techniques to the learning of latent variable models. We introduced a novel bound on the\nmarginal log-likelihood of directed latent variable models with discrete latent units. Our analysis\nextends the theory of random projections for inference previously done in the context of discrete,\nfully-observed log-linear undirected models to the general setting of both learning and inference in\ndirected latent variable models with discrete latent units while the observed data can be discrete or\ncontinuous. Our approach combines a traditional variational approximation with random projections\nto get provable accuracy guarantees and can be used to improve the quality of traditional ELBOs\nsuch as the ones obtained using a mean-\ufb01eld approximation.\nThe power of black-box techniques lies in their wide applicability, and in the second half of the paper,\nwe close the loop by developing VB-MCS, an algorithm that incorporates the theoretical underpinnings\nof random projections into belief networks that have shown tremendous promise for generative\nmodeling. We demonstrate an application of this idea to sigmoid belief networks, which can also\nbe interpreted as probabilistic autoencoders. VB-MCS simultaneously learns the parameters of the\n(generative) model and the variational parameters (subject to random projections) used to approximate\nthe intractable posterior. Our approach can still leverage backpropagation to ef\ufb01ciently compute\ngradients of the relevant quantities. The resulting algorithm is scalable and the use of random\nprojections signi\ufb01cantly improves the quality of the results on benchmark data sets in both vision and\nlanguage domains.\nFuture work will involve devising random projection schemes for latent variable models with continu-\nous latent units and other variational families beyond mean-\ufb01eld [24]. On the empirical side, it would\nbe interesting to investigate potential performance gains by employing complementary heuristics\nsuch as variance reduction [19] and data augmentation [8] in conjunction with random projections.\n\nAcknowledgments\n\nThis work was supported by grants from the NSF (grant 1649208) and Future of Life Institute (grant\n2016-158687).\n\n8\n\n\fReferences\n[1] Y. Bengio. Learning deep architectures for AI. Foundations and trends in ML, 2(1):1\u2013127, 2009.\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n\n[3] A. Bouchard-C\u00f4t\u00e9 and M. I. Jordan. Optimization of structured mean \ufb01eld objectives. In UAI, 2009.\n\n[4] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.\n\n[5] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman. Low-density parity constraints for hashing-based\n\ndiscrete integration. In ICML, 2014.\n\n[6] S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. Optimization with parity constraints: From binary\n\ncodes to discrete integration. In UAI, 2013.\n\n[7] S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. Taming the curse of dimensionality: Discrete\n\nintegration by hashing and optimization. In ICML, 2013.\n\n[8] Z. Gan, R. Henao, D. E. Carlson, and L. Carin. Learning deep sigmoid belief networks with data\n\naugmentation. In AISTATS, 2015.\n\n[9] S. Gershman and N. D. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the\n\nThirty-Sixth Annual Conference of the Cognitive Science Society, 2014.\n\n[10] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. In ICML,\n\n2014.\n\n[11] S. Hadjis and S. Ermon. Importance sampling over sets: A new probabilistic inference scheme. In UAI,\n\n2014.\n\n[12] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation,\n\n14(8):1771\u20131800, 2002.\n\n[13] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The \"wake-sleep\" algorithm for unsupervised neural\n\nnetworks. Science, 268(5214):1158\u20131161, 1995.\n\n[14] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 14(1):1303\u2013\n\n1347, 2013.\n\n[15] L.-K. Hsu, T. Achim, and S. Ermon. Tight variational bounds via random projections and I-projections. In\n\nAISTATS, 2016.\n\n[16] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[18] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable\n\nunsupervised learning of hierarchical representations. In ICML, 2009.\n\n[19] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.\n\n[20] A. Mnih and D. J. Rezende. Variational inference for Monte Carlo objectives. In ICML, 2016.\n\n[21] R. M. Neal. Connectionist learning of belief networks. AIJ, 56(1):71\u2013113, 1992.\n\n[22] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In UAI, 2011.\n\n[23] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, 2013.\n\n[24] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In ICML, 2015.\n\n[25] T. Salimans, D. P. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging\n\nthe gap. In ICML, 2015.\n\n[26] M. Titsias and M. L\u00e1zaro-Gredilla. Local expectation gradients for black box variational inference. In\n\nNIPS, 2015.\n\n[27] S. P. Vadhan et al. Pseudorandomness. In Foundations and Trends in TCS, 2011.\n\n[28] M. Zhu and S. Ermon. A hybrid approach for probabilistic inference using random projections. In ICML,\n\n2015.\n\n9\n\n\f", "award": [], "sourceid": 1505, "authors": [{"given_name": "Aditya", "family_name": "Grover", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}