{"title": "Learning Disentangled Representations with Semi-Supervised Deep Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5925, "page_last": 5935, "abstract": "Variational autoencoders (VAEs) learn representations of data by jointly training a probabilistic encoder and decoder network. Typically these models encode all features of the data into a single variable. Here we are interested in learning disentangled representations that encode distinct aspects of the data into separate variables. We propose to learn such representations using model architectures that generalise from standard VAEs, employing a general graphical model structure in the encoder and decoder. This allows us to train partially-specified models that make relatively strong assumptions about a subset of interpretable variables and rely on the flexibility of neural networks to learn representations for the remaining variables. We further define a general objective for semi-supervised learning in this model class, which can be approximated using an importance sampling procedure. We evaluate our framework's ability to learn disentangled representations, both by qualitative exploration of its generative capacity, and quantitative evaluation of its discriminative ability on a variety of models and datasets.", "full_text": "Learning Disentangled Representations with\nSemi-Supervised Deep Generative Models\n\nN. Siddharth\u2020\n\nUniversity of Oxford\nnsid@robots.ox.ac.uk\n\nBrooks Paige\u2020\n\nAlan Turing Institute\n\nUniversity of Cambridge\nbpaige@turing.ac.uk\n\nJan-Willem van de Meent\u2020\n\nNortheastern University\n\nj.vandemeent@northeastern.edu\n\nAlban Desmaison\nUniversity of Oxford\n\nalban@robots.ox.ac.uk\n\nNoah D. Goodman\nStanford University\n\nngoodman@stanford.edu\n\nPushmeet Kohli \u2217\n\nDeepmind\n\npushmeet@google.com\n\nFrank Wood\n\nUniversity of Oxford\n\nfwood@robots.ox.ac.uk\n\nPhilip H.S. Torr\n\nUniversity of Oxford\n\nphilip.torr@eng.ox.ac.uk\n\nAbstract\n\nVariational autoencoders (VAEs) learn representations of data by jointly training a probabilistic\nencoder and decoder network. Typically these models encode all features of the data into a\nsingle variable. Here we are interested in learning disentangled representations that encode\ndistinct aspects of the data into separate variables. We propose to learn such representations\nusing model architectures that generalise from standard VAEs, employing a general graphical\nmodel structure in the encoder and decoder. This allows us to train partially-speci\ufb01ed models\nthat make relatively strong assumptions about a subset of interpretable variables and rely on\nthe \ufb02exibility of neural networks to learn representations for the remaining variables. We\nfurther de\ufb01ne a general objective for semi-supervised learning in this model class, which can be\napproximated using an importance sampling procedure. We evaluate our framework\u2019s ability\nto learn disentangled representations, both by qualitative exploration of its generative capacity,\nand quantitative evaluation of its discriminative ability on a variety of models and datasets.\n\n1\n\nIntroduction\n\nLearning representations from data is one of the fundamental challenges in machine learning and\narti\ufb01cial intelligence. Characteristics of learned representations can depend on their intended use.\nFor the purposes of solving a single task, the primary characteristic required is suitability for that\ntask. However, learning separate representations for each and every such task involves a large amount\nof wasteful repetitive effort. A representation that has some factorisable structure, and consistent\nsemantics associated to different parts, is more likely to generalise to a new task.\nProbabilistic generative models provide a general framework for learning representations: a model is\nspeci\ufb01ed by a joint probability distribution both over the data and over latent random variables, and a\nrepresentation can be found by considering the posterior on latent variables given speci\ufb01c data. The\nlearned representation \u2014 that is, inferred values of latent variables \u2014 depends then not just on the\ndata, but also on the generative model in its choice of latent variables and the relationships between\nthe latent variables and the data. There are two extremes of approaches to constructing generative\nmodels. At one end are fully-speci\ufb01ed probabilistic graphical models [19, 22], in which a practitioner\ndecides on all latent variables present in the joint distribution, the relationships between them, and\nthe functional form of the conditional distributions which de\ufb01ne the model. At the other end are\n\n\u2217Author was at Microsoft Research during this project. \u2020 indicates equal contribution.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdeep generative models [8, 17, 20, 21], which impose very few assumptions on the structure of the\nmodel, instead employing neural networks as \ufb02exible function approximators that can be used to\ntrain a conditional distribution on the data, rather than specify it by hand.\nThe tradeoffs are clear. In an explicitly constructed graphical model, the structure and form of the\njoint distribution ensures that latent variables will have particular semantics, yielding a disentangled\nrepresentation. Unfortunately, de\ufb01ning a good probabilistic model is hard: in complex perceptual\ndomains such as vision, extensive feature engineering (e.g. Berant et al. [1], Siddharth et al. [31]) may\nbe necessary to de\ufb01ne a suitable likelihood function. Deep generative models completely sidestep\nthe dif\ufb01culties of feature engineering. Although they address learning representations which then\nenable them to better reconstruct data, the representations themselves do not always exhibit consistent\nmeaning along axes of variation: they produce entangled representations. While such approaches\nhave considerable merit, particularly when faced with the absence of any side information about data,\nthere are often situations when aspects of variation in data can be, or are desired to be characterised.\nBridging this gap is challenging. One way to enforce a disentangled representation is to hold different\naxes of variation \ufb01xed during training [21]. Johnson et al. [14] combine a neural net likelihood\nwith a conjugate exponential family model for the latent variables. In this class of models, ef\ufb01cient\nmarginalisation over the latent variables can be performed by learning a projection onto the same\nconjugate exponential family in the encoder. Here we propose a more general class of partially-\nspeci\ufb01ed graphical models: probabilistic graphical models in which the modeller only needs specify\nthe exact relationship for some subset of the random variables in the model. Factors left unde\ufb01ned in\nthe model de\ufb01nition are then learned, parametrised by \ufb02exible neural networks. This provides the\nability to situate oneself at a particular point on a spectrum, by specifying precisely those axes of\nvariations (and their dependencies) we have information about or would like to extract, and learning\ndisentangled representations for them, while leaving the rest to be learned in an entangled manner.\nA subclass of partially-speci\ufb01ed models that is particularly common is that where we can obtain\nsupervision data for some subset of the variables. In practice, there is often variation in the data\nwhich is (at least conceptually) easy to explain, and therefore annotate, whereas other variation is less\nclear. For example, consider the MNIST dataset of handwritten digits: the images vary both in terms\nof content (which digit is present), and style (how the digit is written), as is visible in the right-hand\nside of Fig. 1. Having an explicit \u201cdigit\u201d latent variable captures a meaningful and consistent axis of\nvariation, independent of style; using a partially-speci\ufb01ed graphical model means we can de\ufb01ne a\n\u201cdigit\u201d variable even while leaving unspeci\ufb01ed the semantics of the different styles, and the process of\nrendering a digit to an image. With unsupervised learning there is no guarantee that inference on a\nmodel with 10 classes will induce factored latent representations with factors corresponding to the the\n10 digits. However, given a small amount of labelled examples, this task becomes signi\ufb01cantly easier.\nFundamentally, our approach conforms to the idea that well-de\ufb01ned notions of disentanglement\nrequire speci\ufb01cation of a task under which to measure it [4]. For example, when considering images\nof people\u2019s faces, we might wish to capture the person\u2019s identity in one context, and the lighting\nconditions on the faces in another, facial features in another, or combinations of these in yet other\ncontexts. Partially-speci\ufb01ed models and weak supervision can be seen as a way to operationalise this\ntask-dependence directly into the learning objective.\nIn this paper we introduce a recipe for learning and inference in partially-speci\ufb01ed models, a \ufb02exible\nframework that learns disentangled representations of data by using graphical model structures to\nencode constraints to interpret the data. We present this framework in the context of variational\nautoencoders (VAEs), developing a generalised formulation of semi-supervised learning with DGMs\nthat enables our framework to automatically employ the correct factorisation of the objective for\nany given choice of model and set of latents taken to be observed. In this respect our work extends\nprevious efforts to introduce supervision into variational autoencoders [18, 24, 32]. We introduce a\nvariational objective which is applicable to a more general class of models, allowing us to consider\ngraphical-model structures with arbitrary dependencies between latents, continuous-domain latents,\nand those with dynamically changing dependencies. We provide a characterisation of how to compile\npartially-supervised generative models into stochastic computation graphs, suitable for end-to-end\ntraining. This approach allows us also amortise inference [7, 23, 29, 34], simultaneously learning\na network that performs approximate inference over representations at the same time we learn the\nunknown factors of the model itself. We demonstrate the ef\ufb01cacy of our framework on a variety of\ntasks, involving classi\ufb01cation, regression, and predictive synthesis, including its ability to encode\nlatents of variable dimensionality.\n\n2\n\n\fFigure 1: Semi-supervised learning in structured variational autoencoders, illustrated on MNIST\ndigits. Top-Left: Generative model. Bottom-Left: Recognition model. Middle: Stochastic com-\nputation graph, showing expansion of each node to its corresponding sub-graph. Generative-model\ndependencies are shown in blue and recognition-model dependencies are shown in orange. See\nSection 2.2 for a detailed explanation. Right: learned representation.\n2 Framework and Formulation\n\nVAEs [17, 28] are a class of deep generative models that simultaneously train both a probabilistic\nencoder and decoder for a elements of a data set D = {x1, . . . xN}. The central analogy is that\nan encoding z can be considered a latent variable, casting the decoder as a conditional probability\ndensity p\u03b8(x|z). The parameters \u03b7\u03b8(z) of this distribution are the output of a deterministic neural\nnetwork with parameters \u03b8 (most commonly MLPs or CNNs) which takes z as input. By placing a\nweak prior over z, the decoder de\ufb01nes a posterior and joint distribution p\u03b8(z | x) \u221d p\u03b8(x | z)p(z).\nInference in VAEs can be performed using a variational method that approximates the\nposterior distribution p\u03b8(z | x) using an encoder q\u03c6(z | x), whose parameters \u03bb\u03c6(x) are\nthe output of a network (with parameters \u03c6) that is referred to as an \u201cinference network\u201d\nor a \u201crecognition network\u201d. The generative and inference networks, denoted by solid\nand dashed lines respectively in the graphical model, are trained jointly by performing\nstochastic gradient ascent on the evidence lower bound (ELBO) L(\u03c6, \u03b8;D) \u2264 log p\u03b8(D),\n\nxn\n\nN\n\n\u03c6\n\nzn\n\n\u03b8\n\nL(\u03c6, \u03b8;D) =\n\nN(cid:88)n=1\n\nL(\u03c6, \u03b8; xn) =\n\nN(cid:88)n=1\n\nEq\u03c6(z|xn)[log p\u03b8(xn | z) + log p(z) \u2212 log q\u03c6(z|xn)].\n\n(1)\n\nTypically, the \ufb01rst term Eq\u03c6(z|xn)[log p\u03b8(xn | z)] is approximated by a Monte Carlo estimate and the\nremaining two terms are expressed as a divergence \u2212KL(q\u03c6(z|xn)(cid:107)p(z)), which can be computed\nanalytically when the encoder model and prior are Gaussian.\nIn this paper, we will consider models in which both the generative model p\u03b8(x, y, z) and the\napproximate posterior q\u03c6(y, z | x) can have arbitrary conditional dependency structures involving\nrandom variables de\ufb01ned over a number of different distribution types. We are interested in de\ufb01ning\nVAE architectures in which a subset of variables y are interpretable. For these variables, we assume\nthat supervision labels are available for some fraction of the data. The VAE will additionally retain\nsome set of variables z for which inference is performed in a fully unsupervised manner. This is in\nkeeping with our central goal of de\ufb01ning and learning in partially-speci\ufb01ed models. In the running\nexample for MNIST, y corresponds to the classi\ufb01cation label, whereas z captures all other implicit\nfeatures, such as the pen type and handwriting style.\nThis class of models is more general than the models in the work by Kingma et al. [18], who consider\nthree model designs with a speci\ufb01c conditional dependence structure. We also do not require p(y, z)\nto be a conjugate exponential family model, as in the work by Johnson et al. [15]. To perform\nsemi-supervised learning in this class of models, we need to i) de\ufb01ne an objective that is suitable to\ngeneral dependency graphs, and ii) de\ufb01ne a method for constructing a stochastic computation graph\n[30] that incorporates both the conditional dependence structure in the generative model and that of\nthe recognition model into this objective.\n\n3\n\nz (handwriting style)y (digit label)Disentangled RepresentationStochastic Computation Graph for VAE\u03b5zpq\u03bb\u03b7\u03c6n\u03b8zpq\u03bb\u03b7\u03c6n\u03b8(a)(b)(c)(d)Figure2:(a)VisualanalogiesfortheMNISTdata,withinferredstylelatentvariable\ufb01xedandthelabelvaried.(b)Explorationin\u201cstyle\u201dspacefora2Dlatentgaussianrandomvariable.VisualanalogiesfortheSVHNdatawhen(c)fullysupervised,and(d)partiallysupervisedwithjust100labels/digit.Totraindeepgenerativemodelsinasemi-supervisedmanner,weneedtoincorporatelabelleddata126intothevariationalbound.Inafullyunsupervisedsetting,thecontributionofaparticulardata127pointxitotheELBOcanbeexpressed,withminoradjustmentsofEquation(1),whoseMonte-Carlo128approximationsampleslatentszandyfromtherecognitiondistributionqz,y|xi.129L\u2713,;xi=Eq(z,y|xi)\"logp\u2713xi|z,yp(z,y)q(z,y|xi)#.(2)Bycontrast,inthefullysupervisedsettingthevaluesyaretreatedasobservedandbecome\ufb01xed130inputsintothecomputationgraph,insteadofbeingsampledfromq.Whenthelabelyisobserved131alongwiththedata,for\ufb01xed(xi,yi)pairs,thelowerboundontheconditionallog-marginallikelihood132logp\u2713(x|y)is133Lx|y\u2713,z;xi,yi=Eqz(z|xi,yi)\"logp\u2713xi|z,yipz|yiqz(z|xi,yi)#.(3)Thisquantitycanbeoptimizeddirectlytolearnmodelparameters\u2713andzsimultaneouslyviaSGD.134However,itdoesnotcontaintheencoderparametersy.Thisdif\ufb01cultywasalsoencounteredina135relatedcontextbyKingmaetal.[17].Theirsolutionwastoaugmentthelossfunctionbyincluding136anexplicitadditionaltermforlearningaclassi\ufb01erdirectlyonthesupervisedpoints.137Hereweproposeanalternativeapproach.Weextendthemodelwithanauxiliaryvariable\u02dcywith138likelihoodp(\u02dcy|y)=\u02dcy(y)tode\ufb01nedensities139p(\u02dcy,y,z,x)=p(\u02dcy|y)p\u2713(x|y,z)p(y,z)q(\u02dcy,y,z|x)=p(\u02dcy|y)q(y,z|x).WhenwemarginalizetheELBOforthismodelover\u02dcy,werecovertheexpressioninEquation(2).140Treating\u02dcy=yiasobservedresultsinthesupervisedobjective141L\u2713,;xi\u02dcy=yi=Eq(z,y|xi)\"yi(y)logp\u2713xi|z,yp(z,y)q(z,y|xi)#.(4)IntegrationoveranobservedyisthenreplacedwithevaluationoftheELBOandthedensityqyat142yi.AMonteCarloestimatorofEquation(4)canbeconstructedautomaticallyforanyfactorization143ofqbysamplinglatentvariableszandweightingtheresultingELBOestimatebytheconditional144densitytermsqy(y|\u00b7).145NotethattheexactfunctionalformoftheMonteCarloestimatorwillvarydependingonthe146dependencystructureofqz,y|xi.Forexample,fordiscretey,choosingq(z,y|x)=147qz(z|y,x)qy(y|x),decomposestheproblemintosimultaneouslylearningaclassi\ufb01er148qy(y|x)alongsidethegenerativemodelparameters\u2713andencoderqz(z|x,y)whichiscondi-149tionedontheselectedclass.ThecomputationgraphforamodelwiththisfactorizationisshowninFig-150ure1.Init,thevalueyofthedistributionqy(\u00b7|x)isobserved,whilethedistributionqz(\u00b7|x,y)1514p\u03bby\u03b7\u03b5zp\u03bbq\u03b7\u03c6\u03c6x (data)y (partial labels)p\u03b7\u03b8qRecognition ModelzxyGenerative Modelzx\u03b5yx\f2.1 Objective Function\n\nPrevious work on semi-supervised learning for deep generative models [18]\nde\ufb01nes an objective over N unsupervised data points D = {x1, . . . , xN}\nand M supervised data points Dsup = {(x1, y1), . . . , (xM , yM )},\n\nL(\u03b8, \u03c6;D,Dsup) =\n\nL(\u03b8, \u03c6; xn) + \u03b3\n\nLsup(\u03b8, \u03c6; xm, ym).\n\n(2)\n\n\u03b8\n\n\u03c6\n\nzm\n\nym\n\nzn\n\nyn\n\nN(cid:88)n=1\n\nM(cid:88)m=1\n\nOur model\u2019s joint distribution factorises into unsupervised and supervised\ncollections of terms over D and Dsup as shown in the graphical model. The\nstandard variational bound on the joint evidence of all observed data (includ-\ning supervision) also factorises as shown in Eq. (2). As the factor corresponding to the unsupervised\npart of the graphical model is exactly that as Eq. (1), we focus on the supervised term in Eq. (2),\nexpanded below, incorporating an additional weighted component as in Kingma et al. [18].\n\nxm\n\nxn\n\nN\n\nM\n\nLsup(\u03b8, \u03c6; xm, ym) = Eq\u03c6(z|xm,ym)(cid:20)log\n\np\u03b8(xm, ym, z)\n\nq\u03c6(z | xm, ym)(cid:21) + \u03b1 log q\u03c6(ym | xm).\n\n(3)\n\nNote that the formulation in Eq. (2) introduces an constant \u03b3 that controls the relative strength of\nthe supervised term. While the joint distribution in our model implicitly weights the two terms, in\nsituations where the relative sizes of D and Dsup are vastly different, having control over the relative\nweights of the terms can help ameliorate such discrepancies.\nThis de\ufb01nition in Eq. (3) implicitly assumes that we can evaluate the conditional probability\nq\u03c6(z|x, y) and the marginal q\u03c6(y|x) =(cid:82) dz q\u03c6(y, z|x). This was indeed the case for the models\nconsidered by Kingma et al. [18], which have a factorisation q\u03c6(y, z|x) = q\u03c6(z|x, y)q\u03c6(y|x).\nHere we will derive an estimator for Lsup that generalises to models in which q\u03c6(y, z | x) can have\nan arbitrary conditional dependence structure. For purposes of exposition, we will for the moment\nconsider the case where q\u03c6(y, z | x) = q\u03c6(y | x, z)q\u03c6(z | x). For this factorisation, generating\nsamples zm,s \u223c q\u03c6(z | xm, ym) requires inference, which means we can no longer compute a simple\nMonte Carlo estimator by sampling from the unconditioned distribution q\u03c6(z | xm). Moreover, we\nalso cannot evaluate the density q\u03c6(z | xm, ym).\nIn order to address these dif\ufb01culties, we re-express the supervised terms in the objective as\nLsup(\u03b8, \u03c6; xm, ym) = Eq\u03c6(z|xm,ym)(cid:20)log\nwhich removes the need to evaluate q\u03c6(z | xm, ym). We can then use (self-normalised) importance\nsampling to approximate the expectation. To do so, we sample proposals zm,s \u223c q\u03c6(z | xm) from\nthe unconditioned encoder distribution, and de\ufb01ne the estimator\n\nq\u03c6(ym, z | xm)(cid:21) + (1 + \u03b1) log q\u03c6(ym | xm),\n\np(xm, ym, z)\n\n(4)\n\nEq\u03c6(z|xm,ym)(cid:20)log\n\np\u03b8(xm, ym, z)\n\nq\u03c6(ym, z | xm)(cid:21) (cid:39)\n\n1\nS\n\nS(cid:88)s=1\n\nwm,s\nZ m log\n\np\u03b8(xm, ym, zm,s)\nq\u03c6(ym, zm,s | xm)\n\n,\n\nwhere the unnormalised importance weights wm,s and normaliser Z m are de\ufb01ned as\n\nwm,s :=\n\nq\u03c6(ym, zm,s | xm)\nq\u03c6(zm,s | xm)\n\n,\n\nZ m =\n\n1\nS\n\nwm,s.\n\nTo approximate log q\u03c6(ym | xm), we use a Monte Carlo estimator of the lower bound that is normally\nused in maximum likelihood estimation,\n\nlog q\u03c6(ym | xm) \u2265 Eq\u03c6(z|xm)(cid:20)log\n\nq\u03c6(ym, z | xm)\n\nq\u03c6(z | xm) (cid:21) (cid:39)\n\n1\nS\n\nlog wm,s,\n\n(7)\n\nusing the same samples zm,s and weights wm,s as in Eq. (5). When we combine the terms in Eqs. (5)\nand (7), we obtain the estimator\n\n\u02c6Lsup(\u03b8, \u03c6; xm, ym) :=\n\n1\nS\n\nS(cid:88)s=1\n\nwm,s\nZ m log\n\np\u03b8(xm, ym, zm,s)\nq\u03c6(ym, zm,s | xm)\n\n+ (1 + \u03b1) log wm,s.\n\n(8)\n\n4\n\n(5)\n\n(6)\n\nS(cid:88)s=1\n\nS(cid:88)s=1\n\n\fwm,s :=\n\nWe note that this estimator applies to any conditional dependence structure. Suppose that we were to\nde\ufb01ne an encoder q\u03c6(z2, y1, z1 | x) with factorisation q\u03c6(z2 | y1, z1, x)q\u03c6(y1 | z1, x)q\u03c6(z1 | x).\nIf we propose z2 \u223c q\u03c6(z2 | y1, z1, x) and z1 \u223c q\u03c6(z1 | x), then the importance weights wm,s for\nthe estimator in Eq. (8) are de\ufb01ned as\nq\u03c6(zm,s\n| ym\n\n1 , zm,s\n| xm)\n, xm)q\u03c6(zm,s\n\nIn general, the importance weights are simply the product of conditional probabilities of the supervised\nvariables y in the model. Note that this also applies to the models in Kingma et al. [18], whose\nobjective we can recover by taking the weights to be constants wm,s = q\u03c6(ym | xm).\nWe can also de\ufb01ne an objective analogous to the one used in importance-weighted autoencoders [2],\nin which we compute the logarithm of a Monte Carlo estimate, rather than the Monte Carlo estimate\nof a logarithm. This objective takes the form\n\n, ym\n1 , zm,s\n\n1 | zm,s\n\n= q\u03c6(ym\n\nq\u03c6(zm,s\n\n2\n\n, xm).\n\n1\n\n1\n\n| xm)\n\n2\n\n1\n\n1\n\n\u02c6Lsup,iw(\u03b8, \u03c6; xm, ym) := log(cid:34) 1\n\nS\n\np\u03b8(xm, ym, zm,s)\n\nq\u03c6(zm,s | xm) (cid:35) + \u03b1 log(cid:34) 1\n\nS\n\nwm,s(cid:35) ,\n\nS(cid:88)s=1\n\nS(cid:88)s=1\n\n(9)\n\nwhich can be derived by moving the sums in Eq. (8) into the logarithms and applying the substitution\nwm,s/q\u03c6(ym, zm,s | xm) = 1/q\u03c6(zm,s | xm).\n2.2 Construction of the Stochastic Computation Graph\n\nTo perform gradient ascent on the objective in Eq. (8), we map the graphical models for p\u03b8(x, y, z)\nand q\u03c6(y, z|x) onto a stochastic computation graph in which each stochastic node forms a sub-graph.\nFigure 1 shows this expansion for the simple VAE for MNIST digits from [17]. In this model, y is a\ndiscrete variable that represents the underlying digit, our latent variable of interest, for which we have\npartial supervision data. An unobserved Gaussian-distributed variable z captures the remainder of the\nlatent information. This includes features such as the hand-writing style and stroke thickness. In the\ngenerative model (Fig. 1 top-left), we assume a factorisation p\u03b8(x, y, z) = p\u03b8(x | y, z)p(y)p(z) in\nwhich y and z are independent under the prior. In the recognition model (Fig. 1 bottom-left), we use\na conditional dependency structure q\u03c6(y, z | x) = q\u03c6z (z | y, x)q\u03c6y (y|x) to disentangle the digit\nlabel y from the handwriting style z (Fig. 1 right).\nThe generative and recognition model are jointly form a stochastic computation graph (Fig. 1 centre)\ncontaining a sub-graph for each stochastic variable. These can correspond to fully supervised,\npartially supervised and unsupervised variables. This example graph contains three types of sub-\ngraphs, corresponding to the three possibilities for supervision and gradient estimation:\n\u2022 For the fully supervised variable x, we compute the likelihood p under the generative model, that\nis p\u03b8(x | y, z) = N (x ; \u03b7\u03b8(y, z)). Here \u03b7\u03b8(y, z) is a neural net with parameters \u03b8 that returns\nthe parameters of a normal distribution (i.e. a mean vector and a diagonal covariance).\n\u2022 For the unobserved variable z, we compute both the prior probability p(z) = N (z ; \u03b7z ), and the\nconditional probability q\u03c6(z | x, y) = N (z ; \u03bb\u03c6z (x, y)). Here the usual reparametrisation is\nused to sample z from q\u03c6(z | x, y) by \ufb01rst sampling \u0001 \u223c N (0, I) using the usual reparametrisa-\ntion trick z = g(\u0001, \u03bb\u03c6(x, y)).\n\u2022 For the partially observed variable y, we also compute probabilities p(y) = Discrete(y; \u03b7y ) and\nq\u03c6y (y|x) = Discrete(y; \u03bb\u03c6z (x)). The value y is treated as observed when available, and sampled\notherwise. In this particular example, we sample y from a q\u03c6y (y|x) using a Gumbel-softmax\n[13, 25] relaxation of the discrete distribution.\nThe example in Fig. 1 illustrates a general framework for de\ufb01ning VAEs with arbitrary dependency\nstructures. We begin by de\ufb01ning a node for each random variable. For each node we then specify\na distribution type and parameter function \u03b7, which determines how the probability under the\ngenerative model depends on the other variables in the network. This function can be a constant, fully\ndeterministic, or a neural network whose parameters are learned from the data. For each unsupervised\nand semi-supervised variable we must additionally specify a function \u03bb that returns the parameter\nvalues in the recognition model, along with a (reparametrised) sampling procedure.\nGiven this speci\ufb01cation of a computation graph, we can now compute the importance sampling\nestimate in Eq. (8) by simply running the network forward repeatedly to obtain samples from q\u03c6(\u00b7|\u03bb)\nfor all unobserved variables. We then calculate p\u03b8(x, y, z), q\u03c6(y|x), q\u03c6(y, z|x), and the importance\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) Visual analogies for the MNIST data, partially supervised with just 100 labels (out of\n50000). We infer the style variable z and then vary the label y. (b) Exploration in style space with\nlabel y held \ufb01xed and (2D) style z varied. Visual analogies for the SVHN data when (c) partially\nsupervised with just 1000 labels, and (d) fully supervised.\n\nweight w, which is the joint probability of all semi-supervised variable for which labels are available.\nThis estimate can then be optimised with respect to the variables \u03b8 and \u03c6 to train the autoencoder.\n\n3 Experiments\n\nWe evaluate our framework along a number of different axes pertaining to its ability to learn disen-\ntangled representations through the provision of partial graphical-model structures for the latents\nand weak supervision. In particular, we evaluate its ability to (i) function as a classi\ufb01er/regressor for\nparticular latents under the given dataset, (ii) learn the generative model in a manner that preserves\nthe semantics of the latents with respect to the data generated, and (iii) perform these tasks, in a\n\ufb02exible manner, for a variety of different models and data.\nFor all the experiments run, we choose architecture and parameters that are considered standard\nfor the type and size of the respective datasets. Where images are concerned (with the exception\nof MNIST), we employ (de)convolutional architectures, and employ a standard GRU recurrence\nin the Multi-MNIST case. For learning, we used AdaM [16] with a learning rate and momentum-\ncorrection terms set to their default values. As for the mini batch sizes, they varied from 100-700\ndepending on the dataset being used and the sizes of the labelled subset Dsup. All of the above,\nincluding further details of precise parameter values and the source code, including our PyTorch-\nbased library for specifying arbitrary graphical models in the VAE framework, is available at \u2013\nhttps://github.com/probtorch/probtorch.\n\n3.1 MNIST and SVHN\n\nWe begin with an experiment involving a simple dependency structure, in fact the very same as that\nin Kingma et al. [18], to validate the performance of our importance-sampled objective in the special\ncase where the recognition network and generative models factorise as indicated in Fig. 1(left), giving\nus importance weights that are constant wm,s = q\u03c6(ym|xm). The model is tested on it\u2019s ability to\nclassify digits and perform conditional generation on the MNIST and Google Street-View House\nNumbers (SVHN) datasets. As Fig. 1(left) shows, the generative and recognition models have the\n\u201cdigit\u201d label, denoted y, partially speci\ufb01ed (and partially supervised) and the \u201cstyle\u201d factor, denoted\nz, assumed to be an unobserved (and unsupervised) variable.\nFigure 2(a) and (c) illustrate the conditional generation capabilities of the learned model, where we\nshow the effect of \ufb01rst transforming a given input (leftmost column) into the disentangled latent\nspace, and with the style latent variable \ufb01xed, manipulating the digit through the generative model to\ngenerate data with expected visual characteristics. Note that both these results were obtained with\npartial supervision \u2013 100 (out of 50000) labelled data points in the case of MNIST and 1000 (out\nof 70000) labelled data points in the case of SVHN. The style latent variable z was taken to be a\ndiagonal-covariance Gaussian of 10 and 15 dimensions respectively. Figure 2(d) shows the same for\nSVHN with full supervision. Figure 2(b) illustrates the alternate mode of conditional generation,\nwhere the style latent, here taken to be a 2D Gaussian, is varied with the digit held \ufb01xed.\nNext, we evaluate our model\u2019s ability to effectively learn a classi\ufb01er from partial supervision. We\ncompute the classi\ufb01cation error on the label-prediction task on both datasets, and the results are\n\n6\n\n\fOurs\n\nM\n100 9.71 (\u00b1 0.91)\n600 3.84 (\u00b1 0.86)\n1000 2.88 (\u00b1 0.79)\n3000 1.57 (\u00b1 0.93)\n\nT\nS\nI\nN\nM\n\n0\n0\n0\n0\n5\n=\nN\n\nN\nH\nV\nS\n\n0\n0\n0\n7\n=\nN\n\n0 M\n\nOurs\n\n1000 38.91 (\u00b1 1.06)\n3000 29.07 (\u00b1 0.83)\n\nM2 [18]\n\n11.97 (\u00b1 1.71)\n4.94 (\u00b1 0.13)\n3.60 (\u00b1 0.56)\n3.92 (\u00b1 0.63)\nM1+M2 [18]\n36.02 (\u00b1 0.10)\n\n\u2014\n\nFigure 3: Right: Classi\ufb01cation error rates for different labelled-set sizes M over multiple runs,\nwith supervision rate \u03c1 = \u03b3M\nN +\u03b3M , \u03b3 = 1. For SVHN, we compare against a multi-stage process\n(M1+M2) [18], where our model only uses a single stage. Left: Classi\ufb01cation error over different\nlabelled set sizes and supervision rates for MNIST (top) and SVHN (bottom). Here, scaling of the\nclassi\ufb01cation objective is held \ufb01xed at \u03b1 = 50 (MNIST) and \u03b1 = 70 (SVHN). Note that for sparsely\nlabelled data (M (cid:28) N), a modicum of over-representation (\u03b3 > 1) helps improve generalisation\nwith better performance on the test set. Conversely, too much over-representation leads to over\ufb01tting.\n\nreported in the table in Fig. 3. Note that there are a few minor points of difference in the setup\nbetween our method and those we compare against [18]. We always run our models directly on the\ndata, with no pre-processing or pre-learning on the data. Thus, for MNIST, we compare against\nmodel M2 from the baseline which does just the same. However, for SVHN, the baseline method\ndoes not report errors for the M2 model; only the two-stage M1+M2 model which involves a separate\nfeature-extraction step on the data before learning a semi-supervised classi\ufb01er.\nAs the results indicate, our model and objective does indeed perform on par with the setup considered\nin Kingma et al. [18], serving as basic validation of our framework. We note however, that from\nthe perspective of achieving the lowest possible classi\ufb01cation error, one could adopt any number of\nalternate factorisations [24] and innovations in neural-network architectures [27, 33].\nSupervision rate: As discussed in Section 2.1, we formulate our objective to provide a handle\non the relative weight between the supervised and unsupervised terms. For a given unsupervised\nset size N, supervised set size M, and scaling term \u03b3, the relative weight is \u03c1 = \u03b3M/(N + \u03b3M ).\nFigure 3 shows exploration of this relative weight parameter over the MNIST and SVHN datasets\nand over different supervised set sizes M. Each line in the graph measures the classi\ufb01cation error\nfor a given M, over \u03c1, starting at \u03b3 = 1, i.e. \u03c1 = M/(N + M ). In line with Kingma et al.[18], we\nuse \u03b1 = 0.1/\u03c1. When the labelled data is very sparse (M (cid:28) N), over-representing the labelled\nexamples during training can help aid generalisation by improving performance on the test data. In\nour experiments, for the most part, choosing this factor to be \u03c1 = M/(N + M ) provides good results.\nHowever, as is to be expected, over-\ufb01tting occurs when \u03c1 is increased beyond a certain point.\n\n3.2\n\nIntrinsic Faces\n\nWe next move to a more complex domain involving generative models of faces. As can be seen in the\ngraphical models for this experiment in Fig. 4, the dependency structures employed here are more\ncomplex in comparison to those from the previous experiment. Here, we use the \u201cYale B\u201d dataset [6]\nas processed by Jampani et al. [12] for the results in Fig. 5. We are interested in showing that our\nmodel can learn disentangled representations of identity and lighting and evaluate it\u2019s performance\non the tasks of (i) classi\ufb01cation of person identity, and (ii) regression for lighting direction.\nNote that our generative model assumes no special structure \u2013 we simply specify a model where all\nlatent variables are independent under the prior. Previous work [12] assumed a generative model\nwith latent variables identity i, lighting l, shading s, and re\ufb02ectance r, following the relationship\n(n \u00b7 l) \u00d7 r + \u0001 for the pixel data. Here, we wish to demonstrate that our generative model still learns\nthe correct relationship over these latent variables, by virtue of the structure in the recognition model\nand given (partial) supervision.\nNote that in the recognition model (Fig. 4), the lighting l is a latent variable with continuous domain,\nand one that we partially supervise. Further, we encode identity i as a categorical random variable,\n\n7\n\n0204060MNISTM=100M=600M=1000M=30000.00.20.40.60.81.00204060SVHNM=1000M=30000.00.20.40.60.81.0Supervision Rate ()0.00.20.40.60.81.0Classification Error (%)Effect of Supervision Rate ()\fIntrinsic Faces\n\nMulti-MNIST\n\ns\n\ni\n\nx\n\n(cid:96)\n\nr\n\ns\n\n(cid:96)\n\ni\n\nx\n\nr\n\nzk\n\nyk\n\nK\n\nx\n\nxk\n\nak\n\nK\n\nzk\n\nyk\n\nhk\u22121\n\nhk\n\nK\n\nx\n\nxk\n\nak\n\nK\n\nRecognition Model\n\nGenerative Model\nFigure 4: Generative and recognition models for the intrinsic-faces and multi-MNIST experiments.\nInput Recon.\n\nRecognition Model\n\nGenerative Model\n\nVarying Identity\n\nInput Recon.\n\nVarying Lighting\n\nOurs\n\nOurs\n\n(Full Supervision)\n\n(Semi-Supervised)\nJampani et al. [12]\n(plot asymptotes)\n\nIdentity\n\n1.9% (\u00b1 1.5)\n3.5% (\u00b1 3.4)\n\nLighting\n3.1% (\u00b1 3.8)\n17.6% (\u00b1 1.8)\n\n\u2248 30\n\n\u2248 10\n\nFigure 5: Left: Exploring the generative capacity of the supervised model by manipulating identity\nand lighting given a \ufb01xed (inferred) value of the other latent variables. Right: Classi\ufb01cation and\nregression error rates for identity and lighting latent variables, fully-supervised, and semi-supervised\n(with 6 labelled example images for each of the 38 individuals, a supervision rate of \u03c1 = 0.5,\nand \u03b1 = 10). Classi\ufb01cation is a direct 1-out-of-38 choice, whereas for the comparison, error is a\nnearest-neighbour loss based on the inferred re\ufb02ectance. Regression loss is angular distance.\n\ninstead of constructing a pixel-wise surface-normal map (each assumed to be independent Gaussian)\nas is customary. This formulation allows us to address the task of predicting identity directly, instead\nof applying surrogate evaluation methods (e.g. nearest-neighbour classi\ufb01cation based on inferred\nre\ufb02ectance). Figure 5 presents both qualitative and quantitative evaluation of the framework to jointly\nlearn both the structured recognition model, and the generative model parameters.\n\n3.3 Multi-MNIST\n\nFinally, we conduct an experiment that extends the complexity from the prior models even further.\nParticularly, we explore the capacity of our framework to handle models with stochastic dimension-\nality \u2013 having the number of latent variables itself determined by a random variable, and models\nthat can be composed of other smaller (sub-)models. We conduct this experiment in the domain of\nmulti-MNIST. This is an apposite choice as it satis\ufb01es both the requirements above \u2013 each image can\nhave a varying number of individual digits, which essentially dictates that the model must learn to\ncount, and as each image is itself composed of (scaled and translated) exemplars from the MNIST\ndata, we can employ the MNIST model itself within the multi-MNIST model.\nThe model structure that we assume for the generative and recognition networks is shown in Fig. 4.\nWe extend the models from the MNIST experiment by composing it with a stochastic sequence\ngenerator, in which the loop length K is a random variable. For each loop iteration k = 1, . . . , K,\nthe generative model iteratively samples a digit yk, style zk, and uses these to generate a digit image\nxk in the same manner as in the earlier MNIST example. Additionally, an af\ufb01ne tranformation is also\nsampled for each digit in each iteration to transform the digit images xk into a common, combined\ncanvas that represents the \ufb01nal generated image x, using a spatial transformer network [11].\nIn the recognition model, we predict the number of digits K from the pixels in the image. For each\nloop iteration k = 1, . . . , K, we de\ufb01ne a Bernoulli-distributed digit image xk. When supervision is\navailable, we compute the probability of xk from the binary cross-entropy in the same manner as in\nthe likelihood term for the MNIST model. When no supervision is available, we deterministically\nset xk to the mean of the distribution. This can be seen akin to providing bounding-boxes around\nthe constituent digits as supervision, which must be taken into account when learning the af\ufb01ne\ntransformations that decompose a multi-MNIST image into its constituent MNIST-like images. This\nmodel design is similar to the one used in DRAW [10], recurrent VAEs [3], and AIR [5].\n\n8\n\n\fInput\n\nReconstruction\n\nDecomposition\n\nM\n\nM +N\n\nCount Error (%)\n\n0.1\n0.5\n1.0\n\nw/o MNIST\n85.45 (\u00b1 5.77)\n93.27 (\u00b1 2.15)\n99.81 (\u00b1 1.81)\n\nw/ MNIST\n76.33 (\u00b1 8.91)\n80.27 (\u00b1 5.45)\n84.79 (\u00b1 5.11)\nFigure 6: Left: Example input multi-MNIST images and reconstructions. Top-Right: Decomposition\nof Multi-MNIST images into constituent MNIST digits. Bottom-Right: Count accuracy over\ndifferent supervised set sizes M for given dataset size M + N = 82000.\nIn the absence of a canonical multi-MNIST dataset, we created our own from the MNIST dataset by\nmanipulating the scale and positioning of the standard digits into a combined canvas, evenly balanced\nacross the counts (1-3) and digits. We then conducted two experiments within this domain. In the\n\ufb01rst experiment, we seek to measure how well the stochastic sequence generator learns to count\non its own, with no heed paid to disentangling the latent representations for the underlying digits.\nHere, the generative model presumes the availability of individual MNIST-digit images, generating\ncombinations under sampled af\ufb01ne transformations. In the second experiment, we extend the above\nmodel to now also incorporate the same pre-trained MNIST model from the previous section, which\nallows the generative model to sample MNIST-digit images, while also being able to predict the\nunderlying digits. This also demonstrates how we can leverage compositionality of models: when\na complex model has a known simpler model as a substructure, the simpler model and its learned\nweights can be dropped in directly.\nThe count accuracy errors across different supervised set sizes, reconstructions for a random set of\ninputs, and the decomposition of a given set of inputs into their constituent individual digits, are\nshown in Fig. 6. All reconstructions and image decompositions shown correspond to the nested-model\ncon\ufb01guration. We observe that not only are we able to reliably infer the counts of the digits in the\ngiven images, we are able to simultaneously reconstruct the inputs as well as its constituent parts.\n4 Discussion and Conclusion\nIn this paper we introduce a framework for learning disentangled representations of data using\npartially-speci\ufb01ed graphical model structures and semi-supervised learning schemes in the domain of\nvariational autoencoders (VAEs). This is accomplished by de\ufb01ning hybrid generative models which\nincorporate both structured graphical models and unstructured random variables in the same latent\nspace. We demonstrate the \ufb02exibility of this approach by applying it to a variety of different tasks\nin the visual domain, and evaluate its ef\ufb01cacy at learning disentangled representations in a semi-\nsupervised manner, showing strong performance. Such partially-speci\ufb01ed models yield recognition\nnetworks that make predictions in an interpretable and disentangled space, constrained by the structure\nprovided by the graphical model and the weak supervision.\nThe framework is implemented as a PyTorch library [26], enabling the construction of stochastic\ncomputation graphs which encode the requisite structure and computation. This provides another\ndirection to explore in the future \u2014 the extension of the stochastic computation graph framework to\nprobabilistic programming [9, 35, 36]. Probabilistic programs go beyond the presented framework to\npermit more expressive models, incorporating recursive structures and higher-order functions. The\ncombination of such frameworks with neural networks has recently been studied in Le et al. [23] and\nRitchie et al. [29], indicating a promising avenue for further exploration.\nAcknowledgements\nThis work was supported by the EPSRC, ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant\nSeebibyte EP/M013774/1, and EPSRC/MURI grant EP/N019474/1. BP & FW were supported by\nThe Alan Turing Institute under the EPSRC grant EP/N510129/1. JWM, FW & NDG were supported\nunder DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006.\nJWM was additionally supported through startup funds provided by Northeastern University. FW was\nadditionally supported by Intel and DARPA D3M, under Cooperative Agreement FA8750-17-2-0093.\n\n9\n\n\fReferences\n\n[1] Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad\nHuang, Peter Clark, and Christopher D Manning. Modeling biological processes for reading\ncomprehension. In EMNLP, 2014.\n\n[2] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\narXiv preprint arXiv:1509.00519, 2015.\n\n[3] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua\nBengio. A recurrent latent variable model for sequential data. In Advances in neural information\nprocessing systems, pages 2980\u20132988, 2015.\n\n[4] Christopher K. I. Williams Cian Eastwood. A framework for the quantitaive evaluation of\nIn Proceedings of the Workshop on Learning Disentangled\n\ndisentangled representations.\nRepresentaions: from Perception to Control, at NIPS 2017, 2017.\n\n[5] S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, and\nGeoffrey. E Hinton. Attend, infer, repeat: Fast scene understanding with generative models.\narXiv preprint arXiv:1603.08575, 2016.\n\n[6] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illumination cone\nmodels for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach.\nIntelligence, 23(6):643\u2013660, 2001.\n\n[7] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In\n\nCogSci, 2014.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems, pages 2672\u20132680, 2014.\n\n[9] ND Goodman, VK Mansinghka, D Roy, K Bonawitz, and JB Tenenbaum. Church: A language\n\nfor generative models. In Uncertainty in Arti\ufb01cial Intelligence, pages 220\u2013229, 2008.\n\n[10] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. Draw: A\nIn Proceedings of the 32nd International\n\nrecurrent neural network for image generation.\nConference on Machine Learning (ICML-15), pages 1462\u20131471, 2015.\n\n[11] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In\n\nAdvances in Neural Information Processing Systems, pages 2017\u20132025, 2015.\n\n[12] Varun Jampani, S. M. Ali Eslami, Daniel Tarlow, Pushmeet Kohli, and John Winn. Consensus\nIn International Conference on Arti\ufb01cial\n\nmessage passing for layered graphical models.\nIntelligence and Statistics, pages 425\u2013433, 2015.\n\n[13] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[14] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta.\nComposing graphical models with neural networks for structured representations and fast\ninference. In Advances in Neural Information Processing Systems, pages 2946\u20132954, 2016.\n\n[15] Matthew J. Johnson, David K. Duvenaud, Alex B. Wiltschko, Sandeep R. Datta, and Ryan P.\nAdams. Composing graphical models with neural networks for structured representations and\nfast inference. In Advances in Neural Information Processing Systems, 2016.\n\n[16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.\n\n[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the\n\n2nd International Conference on Learning Representations, 2014.\n\n10\n\n\f[18] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, pages 3581\u20133589, 2014.\n\n[19] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[20] Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. Picture:\nIn Proceedings of the IEEE\n\nA probabilistic programming language for scene perception.\nConference on Computer Vision and Pattern Recognition, pages 4390\u20134399, 2015.\n\n[21] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolu-\ntional inverse graphics network. In Advances in Neural Information Processing Systems, pages\n2530\u20132538, 2015.\n\n[22] Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on\ngraphical structures and their application to expert systems. Journal of the Royal Statistical\nSociety. Series B (Methodological), pages 157\u2013224, 1988.\n\n[23] Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. Inference compilation and universal\n\nprobabilistic programming. arXiv preprint arXiv:1610.09900, 2016.\n\n[24] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary deep generative models.\n\narXiv preprint arXiv:1602.05473, 2016.\n\n[25] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[26] PyTorch. PyTorch. http://pytorch.org/, 2017. Accessed: 2017-11-4.\n[27] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and Raiko. T. Semi-supervised learning with\nladder networks. In Advances in Neural Information Processing Systems, pages 3532\u20133540,\n2015.\n\n[28] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In Proceedings of The 31st International\nConference on Machine Learning, pages 1278\u20131286, 2014.\n\n[29] Daniel Ritchie, Paul Horsfall, and Noah D Goodman. Deep amortized inference for probabilistic\n\nprograms. arXiv preprint arXiv:1610.05735, 2016.\n\n[30] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation\nusing stochastic computation graphs. In Advances in Neural Information Processing Systems,\npages 3510\u20133522, 2015.\n\n[31] N. Siddharth, A. Barbu, and J. M. Siskind. Seeing what you\u2019re told: Sentence-guided activity\nrecognition in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 732\u201339, June 2014.\n\n[32] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using\ndeep conditional generative models. In Advances in Neural Information Processing Systems,\npages 3465\u20133473, 2015.\n\n[33] C. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. Ladder variational\n\nautoencoders. In Advances in Neural Information Processing Systems, 2016.\n\n[34] Andreas Stuhlm\u00fcller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In\n\nAdvances in neural information processing systems, pages 3048\u20133056, 2013.\n\n[35] David Wingate, Andreas Stuhlmueller, and Noah D Goodman. Lightweight implementations\nof probabilistic programming languages via transformational compilation. In International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 770\u2013778, 2011.\n\n[36] Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new approach to prob-\nabilistic programming inference. In Arti\ufb01cial Intelligence and Statistics, pages 1024\u20131032,\n2014.\n\n11\n\n\f", "award": [], "sourceid": 3021, "authors": [{"given_name": "Siddharth", "family_name": "N", "institution": "University of Oxford"}, {"given_name": "Brooks", "family_name": "Paige", "institution": "Alan Turing Institute"}, {"given_name": "Jan-Willem", "family_name": "van de Meent", "institution": "Northeastern University"}, {"given_name": "Alban", "family_name": "Desmaison", "institution": "Oxford University"}, {"given_name": "Noah", "family_name": "Goodman", "institution": "Stanford University"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "Microsoft Research"}, {"given_name": "Frank", "family_name": "Wood", "institution": "University of Oxford"}, {"given_name": "Philip", "family_name": "Torr", "institution": "University of Oxford"}]}