{"title": "Probabilistic Neural Programmed Networks for Scene Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 4028, "page_last": 4038, "abstract": "In this paper we address the text to scene image generation problem. Generative models that capture the variability in complicated scenes containing rich semantics is a grand goal of image generation. Complicated scene images contain rich visual elements, compositional visual concepts, and complicated relations between objects. Generative models, as an analysis-by-synthesis process, should encompass the following three core components: 1) the generation process that composes the scene; 2) what are the primitive visual elements and how are they composed; 3) the rendering of abstract concepts into their pixel-level realizations. We propose PNP-Net, a variational auto-encoder framework that addresses these three challenges: it flexibly composes images with a dynamic network structure, learns a set of distribution transformers that can compose distributions based on semantics, and decodes samples from these distributions into realistic images.", "full_text": "Probabilistic Neural Programmed Networks\n\nfor Scene Generation\n\nZhiwei Deng, Jiacheng Chen, Yifang Fu, Greg Mori\n\nSimon Fraser University\n\n{zhiweid, jca348, yifangf}@sfu.ca, mori@cs.sfu.ca\n\nAbstract\n\nIn this paper we address the text to scene image generation problem. Generative\nmodels that capture the variability in complicated scenes containing rich semantics\nis a grand goal of image generation. Complicated scene images contain varied\nvisual elements, compositional visual concepts, and complicated relations between\nobjects. Generative models, as an analysis-by-synthesis process, should encompass\nthe following three core components: 1) the generation process that composes the\nscene; 2) what are the primitive visual elements and how are they composed; 3) the\nrendering of abstract concepts into their pixel-level realizations. We propose PNP-\nNet, a variational auto-encoder framework that addresses these three challenges:\nit \ufb02exibly composes images with a dynamic network structure, learns a set of\ndistribution transformers that can compose distributions based on semantics, and\ndecodes samples from these distributions into realistic images.\n\n1\n\nIntroduction\n\nPowerful latent data representations should encode abstract, semantic concepts. Accompanying\ngenerative models to decode these rich representations into their varied realizations as data instances,\nalong with learning of the latent representation directly from data represent a coveted guerdon of AI\nresearch. As such, the search for expressive, learnable latent encodings along with corresponding\ngeneration techniques has been a preoccupation of signi\ufb01cant research effort.\nExamples of data modeling tasks abound. Speci\ufb01cally in this work we consider that of image\ngeneration. We are particularly interested in the representation of high-level semantic concepts and\nhence focus on the depiction of a complex scene x, as illustrated in Fig. 1. Such scenes are composed\nof a variable number of objects, with individual properties and relations among them. Effective latent\nrepresentations for these images need \ufb02exibility and compositionality.\nImpressive strides in generative models for images have been made via a number of advances in\ncontinuous stochastic latent variable models under the variational auto-encoder (VAE) formalism\n[1, 2, 3]. We work within this formalism, in which a (conditional) prior p(z|y) controls generation\nof an output x under condition y via a non-linear mapping p\u03b8(x|z). Recent work (e.g. [3]) has\nemphasized that the utility of a VAE hinges on its ability to capture useful information in the latent\nrepresentation z, complementary to that in powerful decoder networks p\u03b8(x|z).\nOn the other hand, renewed interest in programmatic representations in AI have sought modular,\ngeneralized, compositional representations for high-level concepts [4, 5, 6, 7, 8]. This line of work\nuses dynamically programmable networks and has demonstrated impressive results at question\nanswering, graphics-based image rendering, and synthesizing programs for computation or image\ngeneration. We build on these approaches to conduct learning of compositional, modular latent\nrepresentations for scenes.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Generating a complicated scene with rich primitive concepts, compositionality and inter-\nactions is challenging. We use text based programs (can be derived from other forms of language,\ne.g. sentences) and propose PNP-Net, which is a powerful and \ufb02exible model for combining and\ngenerating scene images. Samples generated by our model are shown here.\n\nThe krux of the problem in learning latent data representations is this dichotomy between powerful\ndecoder networks and the latent encoding. However, structured, complex scenes present an enticing\nopportunity for learning latent representations. While recent advances in generative modeling have\nproduced exciting progress toward image generation, the successes are largely focused on images of\na single object or hand-drawn symbol.\nWe leverage these lower-level image generation successes to build higher-level latent representations\nof semantic concepts. Modeling complex phenomena arguably requires a structured form to the\nprior p(z|y). We advocate for the use of a modular, compositional prior that permits learning\ndisentangled representations that \ufb02exibly scale to variable numbers of, and relationships between,\nconcepts in a scene. Speci\ufb01cally, we formulate a generic recursive form to the prior: p(z|y) =\n\u03c4 (p1(z1|y), . . . , pK(zk|y)), where each component of this prior is itself de\ufb01ned in a similar fashion.\nConsider the case of composing a scene of a shiny, red cylinder next to a matte, blue sphere. Priors for\nthe properties of the objects such as pshiny(zshiny|y) and pred(zred|y) are composed via aggregation\noperators \u03c4 (\u00b7, . . . ,\u00b7) in a programmatic fashion, i.e. yielding a prior pred,shiny(zred,shiny|y) =\n\u03c4 (pred(zred|y), pshiny(zshiny|y)) for the composite concept of red and shiny.\nThis paper describes Probabilistic Neural Programmed Networks (PNP-Net), a probabilistic mod-\neling framework. PNP-Net combines the advantages of modular programmable frameworks with\nprobabilistic modeling. The contributions of this paper include: 1) a set of visual elements and\nneural modules/programs for modifying the appearance of these visual elements; 2) integrating\nthese probabilistic neural modules into the canonical VAE framework, generalizing the VAE by\nempowering it with reusable, composable, interpretable modules; and 3) demonstrating generalization\nability for complicated scene understanding, including zero-shot learning of novel compositions.\nThis approach leads to compositional models of appearance that can be utilized across variable scenes.\nLearning is sample-ef\ufb01cient, in that shared properties (such as shiny cylinders and shiny spheres) can\nbene\ufb01t from commonality among training samples. The model we present is effective at harnessing\nthe strengths of powerful generative decoders, while encoding semantic properties. We demonstrate\nthat we can learn compositional semantic priors that can capture the variability in complex scenes.\nThese models outperform competing approaches on Color-MNIST and CLEVR-G image generation\ntasks, while yielding semantically-meaningful latent representations.\n\n2 Related Work\n\nThis paper proposes a generative modeling approach by linking language elements to both semantic\nlatent priors and functional symbolic networks. We draw links from the following related \ufb01elds and\nreview previous works in the following domains: generative modeling, compositional semantics,\nlanguage in vision and disentangled representations.\nGenerative modeling: Image synthesis via generative models has received a \ufb02urry of renewed\ninterest. Training in a generative adversarial framework [9] has been used for a variety of tasks.\nPromising results have been achieved for image-conditional tasks [10, 11]. These can be augmented\nwith hierarchical models [12, 13, 14, 15] for maintaining image structure in the context of body pose.\nGregor et al. [16] advocate for a recurrent approach for image generation, with successive rounds\nof re\ufb01nement in a generative process that remains close to the pixels. A push toward hierarchical\n\n2\n\nText-based ProgramPNP NetSample\fvariants could include layers or successively more abstract information regarding image content.\nPixelCNN [17, 18] demonstrates the power of low-level architectures to model \ufb01ne-scale pixel detail.\nImpressive (conditional) image generation results are achieved with a slate of variants that include\nrecurrent architectures, convolutional approaches, and gated models.\nGenerative models with learnable priors: PixelVAE [2] utilizes hierarchical latent variables with\nauto-regressive structure, in line with a PixelCNN output pixel value decoding. Chen et al. [3] analyze\nthe dichotomy between latent codes and powerful decoders. Approaches to narrow the decoder\u2019s\nview or auto-regressive latent codes can be used to encourage the VAE\u2019s latent code to store useful\ninformation. Hoffman [19] develops a Markov chain Monte Carlo algorithm for re\ufb01ning an initial\nvariational approximation to the data likelihood.\nProgrammatic representations and reasoning: Neural Module Networks [5] dynamically con-\nstruct neural network architectures by composition of modules. Wu et al. [8, 20] describe methods\nfor extracting physical world representations for scenes and videos via structured building block com-\nponents. Johnson et al. [21] develop symbolic modules for visual reasoning. Subsequent work [22]\ngenerates an image from a scene graph using graph convolution to decode object layouts.\nSPIRAL [6] produces programs capable of generating images in an adversarial context. Reed and\nde Freitas [4] describe the Neural Programmer-Interpreter architecture. Building blocks including\na key-value data store and composable modules that represent function calls and arguments are\ncombined in a curriculum learning framework. Parisotto et al. [23] synthesize complex programs for\ninput-output synthesis in a domain-speci\ufb01c language.\nLanguage and vision: Words and pictures research has deep roots [24]. Recent work on image cap-\ntioning typically uses encoder-decoder architectures (e.g. [25, 26]), i.e. learning a vector to represent\nthe meaning of a word/sentence/description. There is also work combining word embedding with\nimages. Karpathy et al. [27] understands language fragments with visual appearance. DeViSE [28]\nproposes a joint embedding between single word labels and images. But these works only use a\nsingle vector to represent an image and are not applicable to generative modeling. The realization of\na concept covers in\ufb01nitely many possible image instances, namely a distribution of images.\nLearning disentangled representations: There has been signi\ufb01cant work on learning disentangled\nrepresentations. Kulkarni et al. [29] encourage VAE latent variables to focus on disentangled\nfactors via a novel model structure and mini-batches with certain transformations active or inactive.\nInfoGAN [30] learns disentangled angles, lighting, etc. Reed et al. [31] applies visual differences to a\nnew image and to make content changes. FVAE [32] learns disentangled representations for audience\nidentities and times. In this work we aim for a more general image understanding with linking of\nvisual appearance to abstract descriptions to more deeply understand semantics.\n\n3 Model\n\nIn this section we describe how we can take the semantic description y of a scene, and generate\nan image x containing these concepts. Our proposed model has two core components: 1) a set\nof mapping functions \u03c4 (\u00b7, . . . ,\u00b7), which take either semantic concepts or a series of distributions\nas input, and generate distributions over the latent space capturing their combined meaning; 2) a\nprobabilistic modeling framework which performs inference and learning using this latent space.\nWe work with a description y that forms a tree structure. Construction of the latent representation z\nproceeds with a bottom-up pass over the tree structure. Each node i \u2208 y has a type t(i) and concept\nword w(i). To each node we apply the appropriate mapping function \u03c4t(i) over it and its children,\nmodulated by the content of the concept word w(i). For example, a \u201cdescribe\" node with concept\nword \u201ccube\" would specify how to combine the child visual property \u201cbrown\" with the object \u201ccube\"\ninto a combined representation for a brown cube. Figure 3 shows an example tree structure.\nThis modular approach with mapping functions allows us to reuse aspects of our model and combine\nthem to represent complex scenes. The framework permits varied mapping functions with different\ninput arity and output dimension. In the following sections we elaborate on this model and provide\nthe speci\ufb01c examples of the mapping functions we use (illustrated in Fig. 2).\n\n3\n\n\fFigure 2: Core modules. a) Combine takes two distributions and generates a compound distribution\nas output. b) Describe models the \"decorating\" interaction between distributions and generated\noutput distributions. c) Transform instantiates the size of a spatial appearance distribution, and\nperforms bilinear interpolation on it to generate a new appearance distribution. d) Layout places two\nappearance distributions according to sampled offsets on a canvas and renders it by conv layers.\n\n3.1 Probabilistic operators\n\nGiven abstract semantic concepts, there are many possible groundings in the visual domain. Assume\nthere is a latent space which describes both the semantic concepts and visual data. Our goal is to\nde\ufb01ne a set of reusable operators, which can be used to gradually compose a latent distribution p(z|y)\nthat fully describes the possible complicated scenes. We follow the standard variational autoencoder\nsetup and use Gaussian distributions p(z; \u00b5, \u03c3) in the latent space, with \u00b5 and \u03c3 as mean and variance\nrespectively. To handle the background which is not described by the semantics, we keep a globally\nlearned background mean \u00b5 and variance \u03c3.\n\n3.1.1 Concept mapping operator\n\nPrimitive concepts are the most basic elements, e.g. attributes (shiny, blue, metal, ...), objects (car,\nsphere, table, ...) and relations between objects (on top of, holding, ...). We \ufb01rst de\ufb01ne a concept\nmapping operator \u03c4concept(w) which takes the concept word w, and generates distributions in latent\nspace which can describe the properties of the concept. Ideally, the distribution should describe what\nthe visual appearance variation is, and what the location/scale information is for that concept. We use\nthe following coupled distributions to model the appearance and location/scale latent distribution:\n\n\u03c4concept(w) =(cid:2)p(za; f a\n\n\u03c3(w))(cid:3)\n\n\u00b5(w), f a\n\n\u00b5(w), f s\n\n\u00b5 and f\u00b7\n\n(1)\nhere f\u00b7\n\u03c3 embed the word w to its corresponding mean \u00b5\u00b7 and variance \u03c3\u00b7 respectively. za is\na latent variable which characterizes the visual appearance for distribution of concept w, zs is the\nvariable for the location/scale of w. Note that we design \u00b5a and \u03c3a to be C \u00d7 H \u00d7 W tensors with\nheight H and width W for maintaining spatial appearance information, while \u00b5s and \u03c3s are simply\nC-dim vectors encoding scale/location information.\n\n\u03c3(w)), p(zs; f s\n\n3.1.2 Aggregation operators\n\nGiven the primitive concept latent distributions derived from the concept mapping operator, we\nnow have the basic elements for composing objects and more complex scenes. In the rest of this\nsection, we describe a set of aggregation mapping functions which operate on distributions. These\noperators take in intermediate distributions, and then generate output distributions based on the\nsemantic meaning of the operator.\nCombine: The combine operator combines attributes. It takes multiple attribute distributions, and\naims to generate compound latent distributions which represent composite concepts \u201cA and B\".\n\n4\n\ng-PoE\ud835\udc97(\ud835\udc93\ud835\udc86\ud835\udc85)\ud835\udc9b\ud835\udc82\ud835\udc9b\ud835\udc94\ud835\udc9b\ud835\udc82\ud835\udc9b\ud835\udc94\ud835\udc97(\ud835\udc94\ud835\udc89\ud835\udc8a\ud835\udc8f\ud835\udc9a\t\ud835\udc93\ud835\udc86\ud835\udc85)Combine[v(red), v(shiny)]\ud835\udc9b\ud835\udc82\ud835\udc9b\ud835\udc94\ud835\udc97(\ud835\udc83\ud835\udc8d\ud835\udc96\ud835\udc86\t\ud835\udc94\ud835\udc91\ud835\udc89\ud835\udc86\ud835\udc93\ud835\udc86)g-PoE\ud835\udc97(\ud835\udc83\ud835\udc8d\ud835\udc96\ud835\udc86)\ud835\udc9b\ud835\udc82\ud835\udc9b\ud835\udc94\ud835\udc97(\ud835\udc94\ud835\udc91\ud835\udc89\ud835\udc86\ud835\udc93\ud835\udc86)\ud835\udc9b\ud835\udc82\ud835\udc9b\ud835\udc94Describe[v(sphere), v(blue)]Convs\ud835\udc5d\u2032\ud835\udc97(\ud835\udc83\ud835\udc8d\ud835\udc96\ud835\udc86\t\ud835\udc94\ud835\udc91\ud835\udc89\ud835\udc86\ud835\udc93\ud835\udc86)\ud835\udc9b\ud835\udc82\ud835\udc9b\ud835\udc94\ud835\udc9b\ud835\udc82\ud835\udc97\u2032(\ud835\udc83\ud835\udc8d\ud835\udc96\ud835\udc86\t\ud835\udc94\ud835\udc91\ud835\udc89\ud835\udc86\ud835\udc93\ud835\udc86)Transform[v(blue sphere)]SizeSamplerBilinearInterpolation\ud835\udc9b\ud835\udc82\ud835\udc97\u2032(\ud835\udc83\ud835\udc8d\ud835\udc96\ud835\udc86\t\ud835\udc94\ud835\udc91\ud835\udc89\ud835\udc86\ud835\udc93\ud835\udc86)Layout[v\u2019(blue sphere), v\u2019(red cube)]\ud835\udc9b\ud835\udc82\ud835\udc97\u2032(\ud835\udc93\ud835\udc86\ud835\udc85\t\ud835\udc84\ud835\udc96\ud835\udc83\ud835\udc86)\ud835\udc9b\ud835\udc94(\ud835\udc8d\ud835\udc82\ud835\udc9a\ud835\udc90\ud835\udc96\ud835\udc95)OffsetsSamplerDistributionCanvasRefinerCNN\ud835\udc9b\ud835\udc94\ud835\udc84\ud835\udc86\ud835\udc8f\ud835\udc86(a). Combine(b). Describe(c). Transform(d). Layout\ud835\udc97(\ud835\udc94\ud835\udc89\ud835\udc8a\ud835\udc8f\ud835\udc9a)\ud835\udc9b\ud835\udc82\ud835\udc9b\ud835\udc94\ud835\udc34\fFor example, \u03c4combine(pshiny, pmetal) will output a distribution for shiny metal. This operator is\nde\ufb01ned separately for appearance and location/scale to process p(za) and p(zs) in parallel due to the\ndifferent properties of appearance and scale information. We use Product of Experts (PoE)[33, 34]\nfor combining distributions and further propose a gated version of PoE (g-PoE) 1, which takes\nthe parameters of distributions pi(zi) and pj(zj), and generates a set of gates {g} to control the\ninformation during combining. The operations of combine operator can be summarized by:\n\n1\nZ\n\ni (cid:12) \u00b5i, g\u03c3\n\ni (cid:12) \u03c3i)p(zj; g\u00b5\n\ni (cid:12) \u00b5j, g\u03c3\n\ni (cid:12) \u03c3j)\n\np(zi; g\u00b5\n\n\u03c4combine(pi, pj) =\n\n(2)\nwhere (cid:12) is element-wise product. Note that this operator can be reused repeatedly to combine more\nthan two attributes.\nDescribe: The describe operator grounds attributes to an exact object. It models the interactions\nbetween attribute distributions and an object distribution in a content aware manner, considering that\nthe way attributes affect an object depends on the properties of the target object. When composing the\ndistribution for red sphere, it \ufb01rst takes the distributions of red and sphere, then generates a content\naware \u201cdecorating\" distribution p(cid:48), which is \ufb01nally imposed on the distribution of sphere to get the\ndistribution p(z|red, sphere). Note that this operator is also separately de\ufb01ned for appearance and\nlocation/scale to process p(za) and p(zs) in parallel. We represent the describe module by:\n\n\u03c4describe(pi, pj) = p(cid:48)(z(cid:48); f\u00b5(\u00b5i, \u00b5j), f\u03c3(\u03c3i, \u03c3j)) \u2297 pj(zj; \u00b5j, \u03c3j)\n\n(3)\nwhere \u2297 represents g-PoE, and f\u00b7(\u00b7) is the function for combining mean or variance. Note that f\u00b7(\u00b7)\nis a CNN when the inputs are appearance distributions, while a MLP for scale/location distributions.\nTransform: Scale invariance is critical to modeling visual content. The transform operator in-\nstantiates the size of an object by \ufb01rst sampling bounding box sizes from a scale distribution, and\nthen performing bilinear interpolation on both mean and standard deviation to adapt the appear-\nance distribution to varied sizes. More precisely, we \ufb01rst sample a scale tuple s = (h, w) from\np(s|zs), zs \u223c p(zs), indicating the size of bounding box in latent space. Then the transformation\nmatrix A can be de\ufb01ned as a diagonal matrix with scaling factor speci\ufb01ed by h and w. After the\ntransformation, the re-sampled mean and variance will have new spatial size h and w. The transform\nmodule is summaried by:\n\n\u03c4transf orm(p) = p(z; bilinear(\u00b5, A), bilinear(\u03c3, A))\n\n(4)\n\nLayout: The layout operator models the interactions between objects. Based on the semantic concepts\nsuch as right-next-to, holding, etc., it generates the positions for arranging latent distributions of two\nchildren nodes. Speci\ufb01cally, we \ufb01rst sample a tuple l = (x, y) from p(l|zs), indicating the offsets\nbetween two children\u2019s bounding boxes in latent space, where zs \u223c p(zs) and p(zs) is generated by\nour concept mapping operator \u03c4concept taking the relation word of layout node. Then we place two\nmasks on the background canvas guided by the offsets l and the bounding box scales s of children\nnodes. We then \ufb01ll the distributions pi and pj from children nodes in the current canvas according to\nthe masks. The rest of the canvas is \ufb01lled up with background biases. The canvas size is set to the\nminimum size to cover all objects, or to the actual latent map size of whole image when layout is the\no , and let f\u00b5(\u00b7) and f\u03c3(\u00b7)\nroot node. Denote the mean and variance of the \ufb01nal canvas as \u00b5ij\nrepresent CNNs for further re\ufb01ning the \ufb01nal canvas, the \ufb01nal step of layout module can be expressed\nas:\n\no and \u03c3ij\n\n\u03c4layout(pi, pj) = p(z; f\u00b5(\u00b5ij\n\no ), f\u03c3(\u03c3ij\n\no ))\n\n(5)\n\n3.2 Model formulation\nGiven the semantic description for image x, semantic information y and operators {\u03c4 (\u00b7, ...,\u00b7)}, we\ndesign a model that utilizes the set of operators to generate x from y. We \ufb01rst map the semantic\ninformation into programmatic operations P which represent the generation process. Similar to\n[5], we use tree structures for y to represent the generation process, but note that our operators and\nmethod could be generalized to directed acyclic graph structures.\nFollowing the generation process, a model should be able to gradually compose from primitive latent\ndistributions to a full latent distribution p(z|y) describing a complex scene, where it samples a latent\n1In the variational autoencoder setting, PoE potentially has a \u201cghost information\" issue and lacks the ability\n\nto control how experts are combined. See the supplementary material for more information.\n\n5\n\n\fFigure 3: Our model takes a program derived from text concepts, performs compositional generation\nin the latent space z, then decodes the latent code to image space. During training, an encoder\n(Reader) is used as a proposal distribution for learning the latent space. Describe and Transform are\nput together in the \ufb01gure to emphasize the fact that the application of a Transform module always\nfollows another Describe module in our formulation.\n\nz as a possible valid scene, then renders it into an image (normally with a learnable general decoder).\nHowever, \ufb01nding a latent space Z describing the observations (image x and semantic concepts y)\nis known to be a hard problem, since calculating the true posterior distribution p(z|x, y) is often\nintractable. We follow a standard variational inference framework, use a proposal distribution q(z|x)\nto approximate such a latent space by minimizing the KL Divergence term KL(q(z|x)||p(z|x, y)),\nwhere p(z|x, y) is the true posterior distribution. By re-writing the KL term, we get the following\nequation:\n\nlog p(x|y) \u2212 KL(q(z|x)||p(z|x, y)) = Ez\u223cq(z|x)[log p(z, x|y) \u2212 log q(z|x)]\n\n= Ez\u223cq(z|x)[log p(x|z)] \u2212 KL(q(z|x)||p(z|y))\n= Ez\u223cq(z|x)[log p(x|z)] \u2212 KL(q(z|x)||P(\u03c4, y))\n\n(6)\n(7)\n(8)\nIn the above equation, we formulate the prior distribution p(z|y) as a learnable prior, and assume\nthat z should contain the full information for the decoder to reconstruct x. The learnable distribution\nis de\ufb01ned by the generation process P(\u03c4, y) = \u03c4 (p1(z1|y), . . . , pK(zk|y)). Since KL divergence\nis non-negative, we have log p(x|y) \u2265 Ez\u223cq(z|x)[log p(x|z)] \u2212 KL(q(z|x)||P(\u03c4, y)). The right\nhand side is known as the Evidence Lower Bound (ELBO), where the \ufb01rst term is the likelihood\nfor reconstructing observations, and the second term is minimizing the KL divergence between a\nparticular data sample and a complex semantic driven prior that could correspond to many possibilities.\nWe optimize the ELBO to maximize the log likelihood for the conditional distribution p(x|y). Along\nwith the training process, all operators af\ufb01liated to the tree are updated through the second KL term.\nThe \ufb02exibility of the operators in the prior permits a variety of groundings in the visual domain. The\nM-projection KL term for prior p(z|x) encourages the scene prior to cover those groundings in the\nlatent space. Note that to better train the Transform and Layout operators, we also use ground truth\nbounding boxes as auxiliary loss for learning size and offset samplers.\nThrough de\ufb01ning a small set of operators/modules, the prior distribution p(z|y) conditioned on\nthe semantic concepts can acquire much more \ufb02exibility. During generation, our model takes the\nprogrammatic process P and dynamically constructs a network p(\u00b7; \u03b8) by assembling and reusing the\noperators \u03c4. The network will \ufb01nally generate the prior distribution p(z|y; \u03b8), which is then mapped\nto possibilities of visual data p(x|z) by a decoder function.\n\n4 Experiments\n\nWe evaluate PNP-Net on the task of text to scene generation across a series of datasets and experi-\nmental settings with various complexity.\nPerformance Measure: Measuring the quality of samples for a generative model is challenging.\nClassi\ufb01er-based scores have proven to be more directly relevant to the visual quality of generated\nsamples. Inception score [35] measures the generated samples by using the \ufb01nal output of an Inception\nnetwork trained for classi\ufb01cation. However, most classi\ufb01er based scores are designed for measuring\n\n6\n\ngraybluerubberbrowncylindercubecubeleftleft-front(root)CombineDescribeLayout\ud835\udc9bTransformProbabilistic Neural Programed Network\ud835\udc5e(\ud835\udc67|\ud835\udc65)\ud835\udc5d(\ud835\udc65|\ud835\udc67)\ud835\udc5d(\ud835\udc67|\ud835\udc66)Semantic\tProgramReaderWriter\fthe quality of a whole image or single object. In contrast, we require a metric for measuring the\nquality of generated objects in a complex scene.\nWe propose to use the mean average precision(mAP) of a pre-trained detector to measure whether the\nconditional generated image samples possess desired semantic information. We modify the standard\nPASCAL VOC detection metric as follows: we measure the semantic correctness of conditional\ngeneration per class by considering standard precision-recall curves of the pre-trained detector.\nUnlike object detection, there are no ground truth bounding boxes for each generated image. We\ninstead use the conditional information (semantic concepts) as ground truth. In an image, a correctly\ngenerated object class which aligns with the ground truth is a true positive, any other detections for\nthe same class are false positives. Recall for a class is calculated as the number of correctly generated\nobjects (true positives) divided by total count of that class in the conditions. We train three types\nof detectors to test the generated samples: objectness detector (OBJ-N), object detector (OBJ-T),\nand object-attribute detector (OBJ-A). The objectness detector measures whether the generated\nimages contain object-like samples (regardless of object class), the object detector and object-attribute\ndetector measure whether the model generates correct objects or objects with appropriate attributes\nrespectively. In all the following experiment tables, we report accuracy of detectors on ground truth\ndata by standard mAP (bounding boxes based) to verify that the detectors are highly accurate, and\nour proposed detector score on generated images by models.\nImplementation Details: We brie\ufb02y describe our implementation details here. More details can be\nfound in our project repository2 where we release the code for model training/evaluating and dataset\ngeneration. For encoder and decoder, we use residual connection based convolutional blocks. The\nhidden dimension size is 160 and batchnorm layer is added to stabilize training. For latent size for\nappearance distribution, we set the latent size to be 64 dimensions with height and width as 16. For\nlocation/scale latent distribution, we use 8 dimensions. The learning rate is set as 0.0001 and the\nmodel is optimized by Adamax [36]. For baseline models, we use LSTM with 128 hidden dimensions\nto encode the concept information. In the data generation process, since the sentence-program\ngeneration on CLEVR and MNIST is often rule-based, parsing a sentence back can be trivial. We\ninstead directly use the generated programs, which actually lead to more complicated scenes than\nnormal sentences describe.\nBaselines: We compare our model with two widely used architectures: Conditional DCGAN [37, 38]\nand Conditional GatedPixelCNN. We \ufb01rst serialize the semantic concepts by traversing the tree, then\nuse LSTM to encode this information. We follow the original proposed DCGAN architecture [38].\nFor GatedPixelCNN, we use a 10 layer gated PixelCNN [18] with 64 \ufb01lters, trained by discretized\nlogit loss [39]. Those two models should be able to use LSTM as a powerful global planning model\nfor the distribution based on the semantic conditions. We also use a simple product of experts based\nVAE model as a baseline[40]. The product of experts simply combines all concepts and generates a\ntext based distribution without considering scene structures.\n\n4.1 ColorMNIST dataset\n\nMNIST contains hand-written digits with large variations and is often used for testing generative\nmodels. We create a ColorMNIST dataset which contains images with up to two digits in an image.\nWe use 2 size attributes (small, large), 6 color attributes, and 4 relation attributes (top, bottom, left,\nright) to create compositional scene images. We use 8000 training and 8000 test images.\nThe results are summarized in Tab. 5b. From objectness detector to object-attribute detector, the\nmetrics are progressively more challenging. We found that all methods can produce images that\ncontain \"digit-like\" content. PixelCNN generates the images with highest quality pixel detail. PoE-\nVAE has relatively lower score due to the loss of global scene structure. Our model manages to\npreserve the most semantic content. Gated-PoE can lead to a 2-3% performance boost. Note that\nusing PoE in operators, which has fewer parameters, can also lead to good generation results.\n\n4.2 Model evaluation on CLEVR-G dataset\n\nCLEVR is a standard dataset containing rich compositionality and complicated scenes for examining\nreasoning in visual question answering. We create a CLEVR-G dataset which contains 10000 64x64\n\n2https://github.com/Lucas2012/ProbabilisticNeuralProgrammedNetwork\n\n7\n\n\f(a) Shape concepts of PNP-Net\n\n(b) Sample images generated by PNP-Net on Color-\nMNIST and corresponding text-program inputs.\n\nFigure 4: (l): Visualizing the shape concepts in CLEVR-G. (r): Samples in Color-MNIST.\n\nMethod\n\nGT\n\nDC-GAN\nPixelCNN\nPoE-VAE\n\nOurs\n\nOBJ-N OBJ-T OBJ-A\n0.999\n0.999\n0.990\n0.146\n0.318\n0.921\n0.051\n0.913\n0.363\n0.981\n\n0.999\n0.211\n0.474\n0.136\n0.419\n\n(a) Samples from attribute concept Metal\ngrounded on different objects.\n\n(b) Detector scores on Color-MNIST. GT-\nmeans we use the exact real image in test\nsetas the input to our pre-trained detector.\n\nFigure 5: Qualitative comparisons between PNP-Net and the baselines on ColorMNIST (left).\nVisualization of visual concepts de\ufb01ned in our formulation (right).\n\ntraining images and 10000 testing images. It contains 2 sizes (small, large), 2 materials (rubber,\nmetal), 8 colors, and 6 relations (left, right, left-front, left-behind, right-front, right-behind).\nFig. 6a shows examples of images generated by PNP-Net and baselines. Tab. 6b provides quantitative\nresults. Generally, all methods perform reasonably well at generating a set of objects (high objectness\nscore, qualitatively the correct number of objects). However, our structured prior gives much more\naccurate depictions of the required objects and their attributes (object type and object-attribute scores,\nqualitatively correct examples). Further, some global coherence is captured in the generated images\n(e.g. shape variance with respect to camera distance and position).\n\nMethod\n\nGT\n\nDC-GAN\nPixelCNN\nPoE-VAE\n\nOurs\n\nOBJ-N OBJ-T OBJ-A\n0.976\n0.999\n0.979\n0.176\n0.074\n0.894\n0.134\n0.974\n0.737\n0.971\n\n0.998\n0.566\n0.444\n0.493\n0.833\n\n(b) Detector scores on CLEVR-G dataset. GT\nmeans we use the exact real image in test set\nas the input to our pre-trained detector.\n\n(a) Samples of different models on\n\nCLEVR-G dataset.\n\nFigure 6: Qualitative (left) and quantitative (right) comparisons on CLEVR-G.\n\nVisualizing the learned concepts: Our model learns the primitive concepts in visual words. It would\nbe interesting to see what distributions are learned during training. We visualize the shape concepts\nthat the model is learning by pushing the samples through the decoder. The results are shown in\nFig. 4a. Our model successfully learns those concepts separately. We also show the attribute concept\ngeneration by grounding it to an object (required by Describe). The results are shown in Fig. 5a.\n\n8\n\nSphereCubeCylindercyan7left2yellowgreen8top1redSceneTypeGTPNPNetDCGANPixelCNN1object2objects4 objects\fZero-shot combination: As our model is comprised of reusable modules, it potentially has better\nability to generalize to unseen combinations. We further create a CLEVR-G-ZS dataset by the\nfollowing process: We split all concepts into two disjoint sets: colors into c1 and c2, materials\ninto m1 and m2, objects into o1 and o2, and relations into r1 and r2. In our zero-shot setting, the\ncombination of o1 \u2229 m2 \u2229 c2 \u2229 r1 and o2 \u2229 m1 \u2229 c1 \u2229 r2 are omitted in the training set, while the\ntest set only contains them. The model has to exploit its modular nature to handle the generation\nof unseen combinations. Qualitative results are shown in Fig. 7a and corresponding detector scores\nare summarize in Table 7b. We use the same pre-trained detector as our non zero-shot experiments.\nNote that the object-attributes scores under zero-shot setting are not directly comparable with non\nzero-shot results, since the number of ground-truth object-attribute classes in the zero-shot test set is\nsmaller than that of a normal test set.\nGeneralize to complicated scenes: To test whether our model has the ability to scale up to more\ncomplex scenes by assembling reusable modules, we design the following two experiments: (1) We\ncreate a 128x128 version dataset with up to 8 objects, then we train and test our model on this more\ncomplicated dataset; (2) We create another 128x128 version dataset with up to 4 objects, then we\ntrain our model on this up-to-4 dataset and test on the up-to-8 dataset. More complex scenes are\nintrinsically harder with richer object interactions and occlusions, but the modular property of PNP-\nNet enables it to properly handle complicated semantics even with easy training data. We summarize\nthe results in Table 1. It shows that our model still outperforms the scores of baseline methods in\nsimpler scenes. Also, our model is robust to the complexity of scenes due to the modularization\nembedded in our generative models.\n\nMethod\nDC-GAN\nPixelCNN\nPoE-VAE\n\nOurs\n\nOBJ-N OBJ-T OBJ-A\n0.989\n0.332\n0.899\n0.192\n0.318\n0.975\n0.734\n0.970\n\n0.418\n0.420\n0.441\n0.752\n\n(b) Detector scores on CLEVR-G under zero-\nshot setting. Again, the numbers here are not\ndirectly comparable to the ones in Fig. 6b as\nexplained in Sec. 4.2\n\n(a) CLEVR-G Zero-shot\n\nFigure 7: Qualitative (left) and quantitative (right) zero-shot comparisons on CLEVR-G.\n\nSettings(trained on) OBJ-N OBJ-T OBJ-A\n0.943\n0.567\n0.518\n\nOurs (up-to-8)\nOurs (up-to-4)\n\nGT(-)\n\n0.979\n0.973\n0.976\n\n0.977\n0.734\n0.715\n\nTable 1: Detector scores on CLEVR-G 128x128 images with up to 8 ojbects\n\n5 Conclusion\n\nWe proposed a novel programmatic approach to constructing priors for generative modeling of\ncomplex scenes. A set of modular components can be combined to represent complex abstract\nconcepts. Individual components represent base concepts such as a sphere, or the material property\nshiny. These are combined via aggregation operators that allow for modi\ufb01cation or interactions\nbetween components. We demonstrated that these priors can be used to model the variability and\ncompositional aspects of complex images consisting of multiple entities with different properties,\noutperforming related methods that do not model scene structure.\n\n9\n\nSceneTypeGTPNPNetDCGANPixelCNN1object2objects4 objects\fReferences\n[1] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[2] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and\n\nAaron Courville. Pixelvae: A latent variable model for natural images. ICLR, 2017.\n\n[3] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever,\n\nand Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.\n\n[4] Scott Reed and Nando de Freitas. Neural programmer-interpreters. In ICLR, 2016.\n\n[5] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Deep compositional question answering\n\nwith neural module networks. In CVPR, 2016.\n\n[6] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. Synthesizing\n\nprograms for images using reinforced adversarial learning. CoRR, abs/1804.01118, 2018.\n\n[7] Tuan Anh Le, Atilim Gunes Baydin, and Frank D. Wood. Inference compilation and universal probabilistic\n\nprogramming. CoRR, abs/1610.09900, 2016.\n\n[8] Jiajun Wu, Joshua B. Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing\nSystems (NIPS), 2014.\n\n[10] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional\n\nadversarial networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[11] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. In International Conference on Computer Vision (ICCV), 2017.\n\n[12] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The pose knows: Video forecasting\n\nby generating pose futures. In International Conference on Computer Vision (ICCV), 2017.\n\n[13] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to\ngenerate long-term future via hierarchical prediction. In International Conference on Machine Learning\n(ICML), 2017.\n\n[14] Shizhan Zhu, Sanja Fidler, Raquel Urtasun, Dahua Lin, and Chen Change Loy. Be your own prada: Fashion\n\nsynthesis with structural coherence. CoRR, 2017.\n\n[15] Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. Learning to generate images of outdoor\n\nscenes from attributes and semantic layouts. CoRR, 2016.\n\n[16] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards\nconceptual compression. In Advances In Neural Information Processing Systems, pages 3549\u20133557, 2016.\n\n[17] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\n[18] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\nimage generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages\n4790\u20134798, 2016.\n\n[19] Matthew D Hoffman. Learning deep latent gaussian models with markov chain monte carlo. In International\n\nConference on Machine Learning, pages 1510\u20131519, 2017.\n\n[20] Jiajun Wu, Erika Lu, Pushmeet Kohli, William T Freeman, and Joshua B Tenenbaum. Learning to see\n\nphysics via visual de-animation. In Advances in Neural Information Processing Systems, 2017.\n\n[21] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence\n\nZitnick, and Ross Girshick. Inferring and executing programs for visual reasoning. In CVPR, 2017.\n\n[22] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In CVPR, 2018.\n\n[23] Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet\n\nKohli. Neuro-symbolic program synthesis. In arXiv, 2016.\n\n10\n\n\f[24] Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I Jordan.\n\nMatching words and pictures. Journal of machine learning research, 3(Feb):1107\u20131135, 2003.\n\n[25] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for\ndense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\n2016.\n\n[26] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,\nKate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and\ndescription. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n2625\u20132634, 2015.\n\n[27] Andrej Karpathy, Armand Joulin, and Li Fei-Fei. Deep fragment embeddings for bidirectional image-\n\nsentence mapping. In NIPS, 2014.\n\n[28] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A\ndeep visual-semantic embedding model. In Advances in neural information processing systems, pages\n2121\u20132129, 2013.\n\n[29] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In Advances in neural information processing systems, pages 2539\u20132547, 2015.\n\n[30] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[31] Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In Advances in\n\nNeural Information Processing Systems (NIPS), 2015.\n\n[32] Z. Deng, R. Navarathna, P. Carr, S. Mandt, Y. Yue, I. Matthews, and G. Mori. Factorized variational\nautoencoders for modeling audience reactions to movies. In Computer Vision and Pattern Recognition\n(CVPR), 2017.\n\n[33] Christopher KI Williams and Felix V Agakov. Products of gaussians and probabilistic minor component\n\nanalysis. Neural Computation, 14(5):1169\u20131182, 2002.\n\n[34] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation,\n\n14(8):1771\u20131800, 2002.\n\n[35] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages 2234\u20132242,\n2016.\n\n[36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[37] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.\n\nGenerative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.\n\n[38] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[39] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn\nwith discretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint arXiv:1701.05517,\n2017.\n\n[40] Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative models of visually\n\ngrounded imagination. arXiv preprint arXiv:1705.10762, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1997, "authors": [{"given_name": "Zhiwei", "family_name": "Deng", "institution": "Simon Fraser University"}, {"given_name": "Jiacheng", "family_name": "Chen", "institution": "Simon Fraser University"}, {"given_name": "YIFANG", "family_name": "FU", "institution": "Simon Fraser University"}, {"given_name": "Greg", "family_name": "Mori", "institution": "Borealis AI"}]}