{"title": "Learning to Infer Graphics Programs from Hand-Drawn Images", "book": "Advances in Neural Information Processing Systems", "page_first": 6059, "page_last": 6068, "abstract": "We introduce a model that learns to convert simple hand drawings\n into graphics programs written in a subset of \\LaTeX.~The model\n combines techniques from deep learning and program synthesis. We\n learn a convolutional neural network that proposes plausible drawing\n primitives that explain an image. These drawing primitives are a\n specification (spec) of what the graphics program needs to draw. We\n learn a model that uses program synthesis techniques to recover a\n graphics program from that spec. These programs have constructs like\n variable bindings, iterative loops, or simple kinds of\n conditionals. With a graphics program in hand, we can correct errors\n made by the deep network and extrapolate drawings.", "full_text": "Learning to Infer Graphics Programs from\n\nHand-Drawn Images\n\nKevin Ellis\n\nMIT\n\nDaniel Ritchie\nBrown University\n\nellisk@mit.edu\n\ndaniel_ritchie@brown.edu\n\nArmando Solar-Lezama\n\nMIT\n\nasolar@csail.mit.edu\n\nJoshua B. Tenenbaum\n\nMIT\n\njbt@mit.edu\n\nAbstract\n\nWe introduce a model that learns to convert simple hand drawings into graphics\nprograms written in a subset of LATEX. The model combines techniques from\ndeep learning and program synthesis. We learn a convolutional neural network\nthat proposes plausible drawing primitives that explain an image. These drawing\nprimitives are a speci\ufb01cation (spec) of what the graphics program needs to draw.\nWe learn a model that uses program synthesis techniques to recover a graphics\nprogram from that spec. These programs have constructs like variable bindings,\niterative loops, or simple kinds of conditionals. With a graphics program in hand,\nwe can correct errors made by the deep network and extrapolate drawings.\n\n1\n\nIntroduction\n\nHuman vision is rich \u2013 we infer shape, objects, parts of objects, and relations between objects \u2013 and\nvision is also abstract: we can perceive the radial symmetry of a spiral staircase, the iterated repetition\nin the Ising model, see the forest for the trees, and also the recursion within the trees. How could we\nbuild an agent with similar visual inference abilities? As a small step in this direction, we cast this\nproblem as program learning, and take as our goal to learn high\u2013level graphics programs from simple\n2D drawings. The graphics programs we consider make \ufb01gures like those found in machine learning\npapers (Fig. 1), and capture high-level features like symmetry, repetition, and reuse of structure.\n\nfor (i < 3)\n\nrectangle(3*i,-2*i+4,\n\n3*i+2,6)\n\nfor (j < i + 1)\n\ncircle(3*i+1,-2*j+5)\n\nreflect(y=8)\n\nfor(i<3)\nif(i>0)\n\nrectangle(3*i-1,2,3*i,3)\n\ncircle(3*i+1,3*i+1)\n\n(a)\n\n(b)\n\nFigure 1: (a): Model learns to convert hand drawings (top) into LATEX (rendered below). (b) Learns to\nsynthesize high-level graphics program from hand drawing.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe key observation behind our work is that going from pixels to programs involves two distinct\nsteps, each requiring different technical approaches. The \ufb01rst step involves inferring what objects\nmake up an image \u2013 for diagrams, these are things like as rectangles, lines and arrows. The second\nstep involves identifying the higher-level visual concepts that describe how the objects were drawn.\nIn Fig. 1(b), it means identifying a pattern in how the circles and rectangles are being drawn that is\nbest described with two nested loops, and which can easily be extrapolated to a bigger diagram.\nThis two-step factoring can be framed as probabilistic inference in a generative model where a\nlatent program is executed to produce a set of drawing commands, which are then rendered to form\nan image (Fig. 2). We refer to this set of drawing commands as a speci\ufb01cation (spec) because it\nspeci\ufb01es what the graphics program drew while lacking the high-level structure determining how the\nprogram decided to draw it. We infer a spec from an image using stochastic search (Sequential Monte\nCarlo) and infer a program from a spec using constraint-based program synthesis [1] \u2013 synthesizing\nstructures like symmetries, loops, or conditionals. In practice, both stochastic search and program\nsynthesis are prohibitively slow, and so we learn models that accelerate inference for both programs\nand specs, in the spirit of \u201camortized inference\u201d [2], training a neural network to amortize the cost of\ninferring specs from images and using a variant of Bias\u2013Optimal Search [3] to amortize the cost of\nsynthesizing programs from specs.\n\nImage\n\n(Observed)\n\nSpec/Drawing Commands\n\n(Latent)\n\nRendering\n\nline , line ,\nrectangle ,\nline , ...\n\nExecution\n\nProgram\n(Latent)\n\nfor ( j < 3)\nfor ( i < 3)\nif (...)\n\nline (...)\nline (...)\n\nr e c t a n g l e (...)\n\nLearning +\nProgram synthesis\nSection 3: Spec\u2192Program\n\nExtrapolation\n\nError\ncorrection\n\nLearning +\nStochastic search\nSection 2: Image\u2192Spec\n\nSection 4: Applications\nFigure 2: Black arrows: Top\u2013down generative model; Program\u2192Spec\u2192Image. Red arrows: Bottom\u2013\nup inference procedure. Bold: Random variables (image/spec/program)\n\nThe new contributions of this work are (1) a working model that can infer high-level symbolic\nprograms from perceptual input, and (2) a technique for using learning to amortize the cost of\nprogram synthesis, described in Section 3.1.\n\n2 Neural architecture for inferring specs\n\nWe developed a deep network architecture for ef\ufb01ciently inferring a spec, S, from a hand-drawn\nimage, I. Our model combines ideas from Neurally-Guided Procedural Models [4] and Attend-\nInfer-Repeat [5], but we wish to emphasize that one could use many different approaches from the\ncomputer vision toolkit to parse an image in to primitive drawing commands (in our terminology,\na \u201cspec\u201d) [6]. Our network constructs the spec one drawing command at a time, conditioned on\nwhat it has drawn so far (Fig. 3). We \ufb01rst pass a 256 \u00d7 256 target image and a rendering of the\ndrawing commands so far (encoded as a two-channel image) to a convolutional network. Given the\nfeatures extracted by the convnet, a multilayer perceptron then predicts a distribution over the next\ndrawing command to execute (see Tbl. 1). We also use a differentiable attention mechanism (Spatial\nTransformer Networks: [7]) to let the model attend to different regions of the image while predicting\ndrawing commands. We currently constrain coordinates to lie on a discrete 16 \u00d7 16 grid, but the grid\ncould be made arbitrarily \ufb01ne.\nWe trained our network by sampling specs S and target images I for randomly generated scenes1 and\nmaximizing P\u03b8[S|I], the likelihood of S given I, with respect to model parameters \u03b8, by gradient\nascent. We trained on 105 scenes, which takes a day on an Nvidia TitanX GPU. Supplement Section\n1 gives the full details of the architecture and training of this network.\n\n1Because rendering ignores ordering we put the drawing commands into a canonical order\n\n2\n\n\fTarget image: I\n\n(cid:76)\n\n2\n\u00d7\n6\n5\n2\n\u00d7\n6\n5\n2\n\nSTN\n\nSTN\n\nCNN\n\nMLP\n\nMLP\n\nMLP\n\nCanvas: render(S)\n\nRenderer\n\ncircle(\n\nX=7,\n\nY=12)\n\nNext drawing command\n\nFigure 3: Neural\narchitecture\nfor\ninferring specs from\nimages. Blue: net-\nwork inputs. Black:\nnetwork\nopera-\ntions. Red: draws\nfrom a multino-\nmial. Typewriter\nfont: network out-\nputs. Renders on a\n16 \u00d7 16 grid, shown\nin gray. STN: dif-\nferentiable attention\nmechanism [7].\n\nTable 1: Primitive drawing commands currently supported by our model.\n\ncircle(x, y)\nrectangle(x1, y1, x2, y2)\nline(x1, y1, x2, y2,\n\narrow \u2208 {0, 1}, dashed \u2208 {0, 1})\n\nSTOP\n\nCircle at (x, y)\nRectangle with corners at (x1, y1) & (x2, y2)\nLine from (x1, y1) to (x2, y2),\n\noptionally with an arrow and/or dashed\n\nFinishes spec inference\n\nOur network can \u201cderender\u201d random synthetic images by doing a beam search to recover specs\nmaximizing P\u03b8[S|I]. But, if the network predicts an incorrect drawing command, it has no way\nof recovering from that error. For added robustness we treat the network outputs as proposals for\na Sequential Monte Carlo (SMC) sampling scheme [8]. Our SMC sampler draws samples from\nthe distribution \u221d L(I|render(S))P\u03b8[S|I], where L(\u00b7|\u00b7) uses the pixel-wise distance between two\nimages as a proxy for a likelihood. Here, the network is learning a proposal distribution to amortize\nthe cost of inverting a generative model (the renderer) [2].\nExperiment 1: Figure 4. To evaluate which components of the model are necessary to parse\ncomplicated scenes, we compared the neural network with SMC against the neural network by itself\n(i.e., w/ beam search) or SMC by itself. Only the combination of the two passes a critical test of\ngeneralization: when trained on images with \u2264 12 objects, it successfully parses scenes with many\nmore objects than the training data. We compare with a baseline that produces the spec in one shot by\nusing the CNN to extract features of the input which are passed to an LSTM which \ufb01nally predicts the\nspec token-by-token (LSTM in Fig. 4). This architecture is used in several successful neural models\nof image captioning (e.g., [9]), but, for this domain, cannot parse cluttered scenes with many objects.\n\nFigure 4: Parsing LATEX output after train-\ning on diagrams with \u2264 12 objects. Out-of-\nsample generalization: Model generalizes\nto scenes with many more objects (\u2248 at ceil-\ning when tested on twice as many objects as\nwere in the training data). Neither SMC nor\nthe neural network are suf\ufb01cient on their\nown. # particles varies by model: we com-\npare the models with equal runtime (\u2248 1\nsec/object). Average number of errors is (#\nincorrect drawing commands predicted by\nmodel)+(# correct commands that were not\npredicted by model).\n\n3\n\n\f2.1 Generalizing to real hand drawings\n\nWe trained the model to generalize to hand drawings by introducing noise into the renderings of\nthe training target images, where the noise process mimics the kinds of variations found in hand\ndrawings. While our neurally-guided SMC procedure used pixel-wise distance as a surrogate for\na likelihood function (L(\u00b7|\u00b7) in Sec. 2), pixel-wise distance fares poorly on hand drawings, which\nnever exactly match the model\u2019s renders. So, for hand drawings, we learn a surrogate likelihood\nfunction, Llearned(\u00b7|\u00b7). The density Llearned(\u00b7|\u00b7) is predicted by a convolutional network that we train\nto predict the distance between two specs conditioned upon their renderings. We train Llearned(\u00b7|\u00b7) to\napproximate the symmetric difference, which is the number of drawing commands by which two\nspecs differ:\n\n\u2212 log Llearned(render(S1)|render(S2)) \u2248 |S1 \u2212 S2| + |S2 \u2212 S1|\n\n(1)\n\nSupplement Section 2 explains the architecture and training of Llearned.\nExperiment 2: Figures 5\u20137. We evaluated, but did not train, our system on 100 real hand-drawn\n\ufb01gures; see Fig. 5\u20136. These were drawn carefully but not perfectly with the aid of graph paper. For\neach drawing we annotated a ground truth spec and had the neurally guided SMC sampler produce\n103 samples. For 63% of the drawings, the Top-1 most likely sample exactly matches the ground\ntruth; with more samples, the model \ufb01nds specs that are closer to the ground truth annotation (Fig. 7).\nWe will show that the program synthesizer corrects some of these small errors (Sec. 4.1).\n\nFigure 5: Left to right: Ising model, recurrent network architec-\nture, \ufb01gure from a deep learning textbook [10], graphical model\n\nFigure 6: Near misses. Right-\nmost: illusory contours (note:\nno SMC in rightmost)\n\nFigure 7: How close are the model\u2019s out-\nputs to the ground truth on hand draw-\nings, as we consider larger sets of sam-\nples (1, 5, 100)? Distance to ground\ntruth measured by the intersection over\nunion (IoU) of predicted spec vs. ground\ntruth spec: IoU of sets (specs) A and B is\n|A\u2229B|/|A\u222aB|. (a) for 63% of drawings\nthe model\u2019s top prediction is exactly cor-\nrect; (b) for 70% of drawings the ground\ntruth is in the top 5 model predictions;\n(c) for 4% of drawings all of the model\noutputs have no overlap with the ground\ntruth. Red: the full model. Other colors:\nlesioned versions of our model.\n\n3 Synthesizing graphics programs from specs\n\nAlthough the spec describes the contents of a scene, it does not encode higher-level features of\nan image such as repeated motifs or symmetries, which are more naturally captured by a graphics\nprogram. We seek to synthesize graphics programs from their specs.\n\n4\n\n\fWe constrain the space of programs by writing down a context free grammar over programs \u2013 what\nin the program languages community is called a Domain Speci\ufb01c Language (DSL) [11]. Our DSL\n(Tbl. 2) encodes prior knowledge of what graphics programs tend to look like.\n\nTable 2: Grammar over graphics programs. We allow loops (for) with conditionals (if), vertical/hor-\nizontal re\ufb02ections (reflect), variables (Var) and af\ufb01ne transformations (Z\u00d7Var+Z).\n\nProgram\u2192 Statement; \u00b7\u00b7\u00b7 ; Statement\nStatement\u2192 circle(Expression,Expression)\nStatement\u2192 rectangle(Expression,Expression,Expression,Expression)\nStatement\u2192 line(Expression,Expression,Expression,Expression,Boolean,Boolean)\nStatement\u2192 for(0 \u2264 Var < Expression) { if (Var > 0) { Program }; Program }\nStatement\u2192 reflect(Axis) { Program }\nExpression\u2192 Z\u00d7Var+Z\nZ \u2192 an integer\n\nAxis\u2192 X = Z | Y = Z\n\nGiven the DSL and a spec S, we want a program that both satis\ufb01es S and, at the same time, is\nthe \u201cbest\u201d explanation of S. For example, we might prefer more general programs or, in the spirit\nof Occam\u2019s razor, prefer shorter programs. We wrap these intuitions up into a cost function over\nprograms, and seek the minimum cost program consistent with S:\n\n1 [p consistent w/ S] exp (\u2212cost(p))\n\n(2)\n\nprogram(S) = arg max\np\u2208DSL\n\nWe de\ufb01ne the cost of a program to be the number of Statement\u2019s it contains (Tbl. 2). We also\npenalize using many different numerical constants; see Supplement Section 3. Returning to the\ngenerative model in Fig. 2, this setup is the same as saying that the prior probability of a program p is\n\u221d exp (\u2212cost(p)) and the likelihood of a spec S given a program p is 1[p consistent w/ S].\nThe constrained optimization problem in Eq. 2 is intractable in general, but there exist ef\ufb01cient-in-\npractice tools for \ufb01nding exact solutions to such program synthesis problems. We use the state-of-\nthe-art Sketch tool [1]. Sketch takes as input a space of programs, along with a speci\ufb01cation of the\nprogram\u2019s behavior and optionally a cost function. It translates the synthesis problem into a constraint\nsatisfaction problem and then uses a SAT solver to \ufb01nd a minimum-cost program satisfying the\nspeci\ufb01cation. Sketch requires a \ufb01nite program space, which here means that the depth of the program\nsyntax tree is bounded (we set the bound to 3), but has the guarantee that it always eventually \ufb01nds\na globally optimal solution. In exchange for this optimality guarantee it comes with no guarantees\non runtime. For our domain synthesis times vary from minutes to hours, with 27% of the drawings\ntiming out the synthesizer after 1 hour. Tbl. 3 shows programs recovered by our system. A main\nimpediment to our use of these general techniques is the prohibitively high cost of searching for\nprograms. We next describe how to learn to synthesize programs much faster (Sec. 3.1), timing out\non 2% of the drawings and solving 58% of problems within a minute.\n\n3.1 Learning a search policy for synthesizing programs\n\nWe want to leverage powerful, domain-general techniques from the program synthesis community,\nbut make them much faster by learning a domain-speci\ufb01c search policy. A search policy poses\nsearch problems like those in Eq. 2, but also offers additional constraints on the structure of the\nprogram (Tbl. 4). For example, a policy might decide to \ufb01rst try searching over small programs\nbefore searching over large programs, or decide to prioritize searching over programs that have loops.\nA search policy \u03c0\u03b8(\u03c3|S) takes as input a spec S and predicts a distribution over synthesis problems,\neach of which is written \u03c3 and corresponds to a set of possible programs to search over (so \u03c3 \u2286 DSL).\nGood policies will prefer tractable program spaces, so that the search procedure will terminate early,\nbut should also prefer program spaces likely to contain programs that concisely explain the data.\nThese two desiderata are in tension: tractable synthesis problems involve searching over smaller\nspaces, but smaller spaces are less likely to contain good programs. Our goal now is to \ufb01nd the\nparameters of the policy, written \u03b8, that best navigate this trade-off.\nGiven a search policy, what is the best way of using it to quickly \ufb01nd minimum cost programs? We\nuse a bias-optimal search algorithm (c.f. Schmidhuber 2004 [3]):\n\n5\n\n\fTable 3: Drawings (left), their specs (middle left), and programs synthesized from those specs (middle\nright). Compared to the specs the programs are more compressive (right: programs have fewer lines\nthan specs) and automatically group together related drawing commands. Note the nested loops and\nconditionals in the Ising model, combination of symmetry and iteration in the bottom \ufb01gure, af\ufb01ne\ntransformations in the top \ufb01gure, and the complicated program in the second \ufb01gure to bottom.\n\nDrawing\n\nSpec\n\nProgram\n\nCompression factor\n\nLine (2 ,15 , 4 ,15)\nLine (4 ,9 , 4 ,13)\nLine (3 ,11 , 3 ,14)\nLine (2 ,13 , 2 ,15)\nLine (3 ,14 , 6 ,14)\nLine (4 ,13 , 8 ,13)\n\nCircle (5 ,8)\nCircle (2 ,8)\nCircle (8 ,11)\nLine (2 ,9 , 2 ,10)\nCircle (8 ,8)\nLine (3 ,8 , 4 ,8)\nLine (3 ,11 , 4 ,11)\n... etc. ...; 21 lines\n\nR e c t a n g l e (1 ,10 ,3 ,11)\nR e c t a n g l e (1 ,12 ,3 ,13)\nR e c t a n g l e (4 ,8 ,6 ,9)\nR e c t a n g l e (4 ,10 ,6 ,11)\n... etc. ...; 16 lines\n\nfor (i <3)\n\nline (i , -1* i +6 ,\n\n2* i +2 , -1* i +6)\n\nline (i , -2* i +4 , i , -1* i +6)\n\n6\n\n3 = 2x\n\nfor (i <3)\n\nfor (j <3)\nif (j >0)\n\nline ( -3* j +8 , -3* i +7 ,\n-3* j +9 , -3* i +7)\nline ( -3* i +7 , -3* j +8 ,\n-3* i +7 , -3* j +9)\n\ncircle ( -3* j +7 , -3* i +7)\n\n21\n\n6 = 3.5x\n\nfor (i <4)\n\nfor (j <4)\n\nr e c t a n g l e ( -3* i +9 , -2* j +6 ,\n\n-3* i +11 , -2* j +7)\n\n16\n\n3 = 5.3x\n\nfor (i <4)\n\nLine (11 ,14 ,13 ,14 , arrow )\nCircle (10 ,10)\nLine (10 ,13 ,10 ,11 , arrow )\nCircle (6 ,10)\n... etc. ...; 15 lines\n\nline ( -4* i +13 ,4 , -4* i +13 ,2 , arrow )\nfor (j <3)\nif (j >0)\n\ncircle ( -4* i +13 ,4* j + -3)\n\nline ( -4* j +10 ,5 , -4* j +12 ,5 ,\n\narrow )\n\n15\n\n6 = 2.5x\n\nLine (3 ,10 ,3 ,14 , arrow )\nR e c t a n g l e (11 ,8 ,15 ,10)\nR e c t a n g l e (11 ,14 ,15 ,15)\nLine (13 ,10 ,13 ,14 , arrow )\n... etc. ...; 16 lines\n\nfor (i <3)\n\nline (7 ,1 ,5* i +2 ,3 , arrow )\nfor (j < i +1)\n\nif (j >0)\n\nline (5* j -1 ,9 ,5* i ,5 , arrow )\n\nline (5* j +2 ,5 ,5* j +2 ,9 , arrow )\n\nr e c t a n g l e (5* i ,3 ,5* i +4 ,5)\nr e c t a n g l e (5* i ,9 ,5* i +4 ,10)\n\nr e c t a n g l e (2 ,0 ,12 ,1)\n\n16\n\n9 = 1.8x\n\nCircle (2 ,8)\nR e c t a n g l e (6 ,9 , 7 ,10)\nCircle (8 ,8)\nR e c t a n g l e (6 ,12 , 7 ,13)\nR e c t a n g l e (3 ,9 , 4 ,10)\n... etc. ...; 9 lines\n\nr e f l e c t ( y =8)\n\nfor (i <3)\nif (i >0)\n\nr e c t a n g l e (3* i -1 ,2 ,3* i ,3)\n\ncircle (3* i +1 ,3* i +1)\n\n9\n\n5 = 1.8x\n\nDe\ufb01nition: Bias-optimality. A search algorithm is n-bias optimal with respect to a distribution\nPbias[\u00b7] if it is guaranteed to \ufb01nd a solution in \u03c3 after searching for at least time n \u00d7 t(\u03c3)\nPbias[\u03c3], where\nt(\u03c3) is the time it takes to verify that \u03c3 contains a solution to the search problem.\nBias-optimal search over program spaces is known as Levin Search [12]; an example of a 1-bias\noptimal search algorithm is an ideal time-sharing system that allocates Pbias[\u03c3] of its time to trying \u03c3.\nWe construct a 1-bias optimal search algorithm by identifying Pbias[\u03c3] = \u03c0\u03b8(\u03c3|S) and t(\u03c3) = t(\u03c3|S),\nwhere t(\u03c3|S) is how long the synthesizer takes to search \u03c3 for a program for S. Intuitively, this\nmeans that the search algorithm explores the entire program space, but spends most of its time in the\nregions of the space that the policy judges to be most promising. Concretely, this means that we run\nmany different program searches in parallel (i.e., run in parallel different instances of the synthesizer,\none for each \u03c3), but to allocate compute time to a \u03c3 in proportion to \u03c0\u03b8(\u03c3|S).\n\n6\n\n\fNow in theory any \u03c0\u03b8(\u00b7|\u00b7) is a bias-optimal searcher. But the actual runtime of the algorithm depends\nstrongly upon the bias Pbias[\u00b7]. Our new approach is to learn Pbias[\u00b7] by picking the policy minimizing\nthe expected bias-optimal time to solve a training corpus, D, of graphics program synthesis problems:\n\n(cid:21)\n\n(cid:20)\n\nmin\n\n\u03c3\u2208BEST(S)\n\nt(\u03c3|S)\n\u03c0\u03b8(\u03c3|S)\n\nLOSS(\u03b8;D) = ES\u223cD\n\n+ \u03bb(cid:107)\u03b8(cid:107)2\nwhere \u03c3 \u2208 BEST(S) if a minimum cost program for S is in \u03c3.\n\n2\n\n(3)\n\nTo generate a training corpus for learning a policy, we synthesized minimum cost programs for\neach drawing and for each \u03c3, then minimized 3 using gradient descent while annealing a softened\nminimum to the hard minimization equation 3. Because we want to learn a policy from only 100\ndrawings, we parameterize \u03c0 with a low-capacity bilinear model with only 96 real-valued parameters.\nSupplement Section 4 further details the parameterization and training of the policy.\nExperiment 3: Table 5; Figure 8; Supplement Section 4. We compare synthesis times for our\nlearned search policy with 4 alternatives: Sketch, which poses the entire problem wholesale to the\nSketch program synthesizer; DC, a DeepCoder\u2013style model that learns to predict which program\ncomponents (loops, re\ufb02ections) are likely to be useful [13]; End\u2013to-End, which trains a recurrent\nneural network to regress directly from images to programs; and an Oracle, a policy which always\npicks the quickest to search \u03c3 also containing a minimum cost program. Our approach improves upon\nSketch by itself, and comes close to the Oracle\u2019s performance. One could never construct this Oracle,\nbecause the agent does not know ahead of time which \u03c3\u2019s contain minimum cost programs nor does\nit know how long each \u03c3 will take to search. With this learned policy in hand we can synthesize 58%\nof programs within a minute.\n\nTable 4: Parameterization of different ways of posing the program synthesis problem. The policy\nlearns to choose parameters likely to quickly yield a minimal cost program.\n\nParameter\nLoops?\nRe\ufb02ects?\nIncremental?\nMaximum depth Bound on the depth of the program syntax tree\n\nDescription\nIs the program allowed to loop?\nIs the program allowed to have re\ufb02ections?\nSolve the problem piece-by-piece or all at once?\n\nRange\n{True, False}\n{True, False}\n{True, False}\n{1, 2, 3}\n\nModel\n\nSketch\nDC\nEnd\u2013to\u2013End\nOracle\nOurs\n\nMedian\n\nsearch time\n274 sec\n187 sec\n63 sec\n6 sec\n28 sec\n\nTimeouts\n\n(1 hr)\n\n27%\n2%\n94%\n2%\n2%\n\nTable 5: Time to synthesize a minimum cost pro-\ngram. Sketch: out-of-the-box performance of\nSketch [1]. DC: Deep\u2013Coder style baseline that\npredicts program components, trained like [13].\nEnd\u2013to\u2013End: neural net trained to regress directly\nfrom images to programs, which fails to \ufb01nd valid\nprograms 94% of the time. Oracle: upper bounds\nthe performance of any bias\u2013optimal search policy.\nOurs: evaluated w/ 20-fold cross validation.\n\nFigure 8: Time to synthesize a minimum cost program (compare w/ Table 5). End\u2013to\u2013End: not\nshown because it times out on 96% of drawings, and has its median time (63s) calculated only on\nnon-timeouts, wheras the other comparisons include timeouts in their median calculation. \u221e =\ntimeout. Red dashed line is median time.\n\n7\n\n\f4 Applications of graphics program synthesis\n\n4.1 Correcting errors made by the neural network\n\nThe program synthesizer corrects errors made by the neural\nnetwork by favoring specs which lead to more concise or gen-\neral programs. For example, \ufb01gures with perfectly aligned\nobjects are preferable, and precise alignment lends itself to\nshort programs. Concretely, we run the program synthesizer\non the Top-k most likely specs output by the neurally guided\nsampler. Then, the system reranks the Top-k by the prior prob-\nability of their programs. The prior probability of a program\nis learned by optimizing the parameters of the prior so as to\nmaximize the likelihood of the ground truth specs; see supple-\nment for details. But, this procedure can only correct errors\nwhen a correct spec is in the Top-k. Our sampler could only do\nbetter on 7/100 drawings by looking at the Top-100 samples\n(see Fig. 7), precluding a statistically signi\ufb01cant analysis of\nhow much learning a prior over programs could help correct\nerrors. But, learning this prior does sometimes help correct\nmistakes made by the neural network; see Fig. 9 for a represen-\ntative example of the kinds of corrections that it makes. See\nSupplement Section 5 for details.\n\n4.2 Extrapolating \ufb01gures\n\nFigure 9: Left: hand drawings. Cen-\nter: interpretations favored by the\ndeep network. Right: interpretations\nfavored after learning a prior over\nprograms. The prior favors simpler\nprograms, thus (top) continuing the\npattern of not having an arrow is pre-\nferred, or (bottom) continuing the\n\u201cbinary search tree\u201d is preferred.\n\nHaving access to the source code of a graphics program facilitates coherent, high-level image editing.\nFor example, we could change all of the circles to squares or make all of the lines be dashed, or we\ncan (automatically) extrapolate \ufb01gures by increasing the number of times that loops are executed.\nExtrapolating repetitive visuals patterns comes naturally to humans, and is a practical application:\nimagine hand drawing a repetitive graphical model structure and having our system automatically\ninduce and extend the pattern. Fig. 10 shows extrapolations produced by our system.\n\nFigure 10: Top, white: drawings. Bottom, black: extrapolations automatically produced by our\nsystem.\n\n8\n\n\f5 Related work\n\nProgram Induction: Our approach to learning to search for programs draws theoretical under-\npinnings from Levin search [12, 14] and Schmidhuber\u2019s OOPS model [3]. DeepCoder [13] is a\nrecent model which, like ours, learns to predict likely program components. Our work differs by\nidentifying and modeling the trade-off between tractability and probability of success. TerpreT [15]\nsystematically compares constraint-based program synthesis techniques against gradient-based search\nmethods, like those used to train Differentiable Neural Computers [16]. The TerpreT experiments\nmotivate our use of constraint-based techniques. Neurally Guided Deductive Search (NGDS: [17]) is\na recent neurosymbolic approach; combining our work with ideas from NGDS could be promising.\nDeep Learning: Our neural network combines the architectural ideas of Attend-Infer-Repeat [5]\n\u2013 which learns to decompose an image into its constituent objects \u2013 with the training regime and\nSMC inference of Neurally Guided Procedural Modeling [4] \u2013 which learns to control procedural\ngraphics programs. The very recent SPIRAL [18] system learns to infer procedures for controlling\na \u2018pen\u2019 to derender highly diverse natural images, complementing our focus here on more abstract\nprocedures but less natural images. IM2LATEX [19] and pix2code [20] are recent works that derender\nLATEX equations and GUIs, respectively, both recovering a markup-like representation. Our goal is to\ngo from noisy input to a high-level program, which goes beyond markup languages by supporting\nprogramming constructs like loops and conditionals.\nHand-drawn sketches: Sketch-n-Sketch is a bi-directional editing system where direct manipula-\ntions to a program\u2019s output automatically propagate to the program source code [21]. This work\ncompliments our own: programs produced by our method could be provided to a Sketch-n-Sketch-like\nsystem as a starting point for further editing. Other systems in the computer graphics literature convert\nsketches to procedural representations, e.g. using a convolutional network to match a sketch to the\noutput of a parametric 3D modeling system in [22] or supporting interactive sketch-based instan-\ntiation of procedural primitives in [23] In contrast, we seek to automatically infer a programmatic\nrepresentation capturing higher-level visual patterns. The CogSketch system [24] also aims to have a\nhigh-level understanding of hand-drawn \ufb01gures. Their goal is cognitive modeling, whereas we are\ninterested in building an automated AI application.\n\n6 Contributions\n\nWe have presented a system for inferring graphics programs which generate LATEX-style \ufb01gures from\nhand-drawn images using a combination of learning, stochastic search, and program synthesis. In the\nnear future, we believe it will be possible to produce professional-looking \ufb01gures just by drawing\nthem and then letting an AI write the code. More generally, we believe the problem of inferring\nvisual programs is a promising direction for research in machine perception.\n\nAcknowledgments\n\nWe are grateful for advice from Will Grathwohl and Jiajun Wu on the neural architecture, and for\nfunding from NSF GRFP, NSF Award #1753684, the MUSE program (DARPA grant FA8750-14-2-\n0242), and AFOSR award FA9550-16-1-0012. This material is based upon work supported by the\nCenter for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.\n\nCode, data, and drafts\n\nA longer version of this paper is available at https://arxiv.org/abs/1707.09627. The code\nand data are available at https://github.com/ellisk42/TikZ.\n\nReferences\n[1] Armando Solar Lezama. Program Synthesis By Sketching. PhD thesis, EECS Department, University of\n\nCalifornia, Berkeley, Dec 2008.\n\n[2] Brooks Paige and Frank Wood. Inference networks for sequential monte carlo in graphical models. In\n\nInternational Conference on Machine Learning, pages 3040\u20133049, 2016.\n\n[3] J\u00fcrgen Schmidhuber. Optimal ordered problem solver. Machine Learning, 54(3):211\u2013254, 2004.\n\n9\n\n\f[4] Daniel Ritchie, Anna Thomas, Pat Hanrahan, and Noah Goodman. Neurally-guided procedural models:\n\nAmortized inference for procedural graphics programs using neural networks. In NIPS, 2016.\n\n[5] SM Eslami, N Heess, and T Weber. Attend, infer, repeat: Fast scene understanding with generative models.\n\nIn NIPS, 2016.\n\n[6] Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017.\n\n[7] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n\n[8] Arnaud Doucet, Nando De Freitas, and Neil Gordon, editors. Sequential Monte Carlo Methods in Practice.\n\nSpringer, 2001.\n\n[9] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\ncaption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 3156\u20133164, 2015.\n\n[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press.\n\n[11] Oleksandr Polozov and Sumit Gulwani. Flashmeta: A framework for inductive program synthesis. ACM\n\nSIGPLAN Notices, 50(10):107\u2013126, 2015.\n\n[12] Leonid Anatolevich Levin. Universal sequential search problems. Problemy Peredachi Informatsii,\n\n9(3):115\u2013116, 1973.\n\n[13] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. DeepCoder:\n\nLearning to write programs. arXiv preprint arXiv:1611.01989, November 2016.\n\n[14] Raymond J Solomonoff. Optimum sequential search. 1984.\n\n[15] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor,\nand Daniel Tarlow. Terpret: A probabilistic programming language for program induction. arXiv preprint\narXiv:1608.04428, 2016.\n\n[16] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi\u00b4nska,\nSergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing\nusing a neural network with dynamic external memory. Nature, 538(7626):471\u2013476, 2016.\n\n[17] Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit Gulwani.\n\nNeural-guided deductive search for real-time program synthesis from examples. ICLR, 2018.\n\n[18] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, SM Eslami, and Oriol Vinyals. Synthesizing programs\n\nfor images using reinforced adversarial learning. ICML, 2018.\n\n[19] Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M. Rush. Image-to-markup generation with\n\ncoarse-to-\ufb01ne attention. In ICML, 2017.\n\n[20] Tony Beltramelli. pix2code: Generating code from a graphical user interface screenshot. CoRR,\n\nabs/1705.07962, 2017.\n\n[21] Brian Hempel and Ravi Chugh. Semi-automated svg programming via direct manipulation. In Proceedings\nof the 29th Annual Symposium on User Interface Software and Technology, UIST \u201916, pages 379\u2013390,\nNew York, NY, USA, 2016. ACM.\n\n[22] Haibin Huang, Evangelos Kalogerakis, Ersin Yumer, and Radomir Mech. Shape synthesis from sketches\nvia procedural models and convolutional networks. IEEE transactions on visualization and computer\ngraphics, 2017.\n\n[23] Gen Nishida, Ignacio Garcia-Dorado, Daniel G. Aliaga, Bedrich Benes, and Adrien Bousseau. Interactive\n\nsketching of urban procedural models. ACM Trans. Graph., 35(4), 2016.\n\n[24] Kenneth Forbus, Jeffrey Usher, Andrew Lovett, Kate Lockwood, and Jon Wetzel. Cogsketch: Sketch\nunderstanding for cognitive science research and for education. Topics in Cognitive Science, 3(4):648\u2013666,\n2011.\n\n10\n\n\f", "award": [], "sourceid": 2974, "authors": [{"given_name": "Kevin", "family_name": "Ellis", "institution": "MIT"}, {"given_name": "Daniel", "family_name": "Ritchie", "institution": "Brown University"}, {"given_name": "Armando", "family_name": "Solar-Lezama", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}