{"title": "Simple, Distributed, and Accelerated Probabilistic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 7598, "page_last": 7609, "abstract": "We describe a simple, low-level approach for embedding probabilistic programming in a deep learning ecosystem. In particular, we distill probabilistic programming down to a single abstraction\u2014the random variable. Our lightweight implementation in TensorFlow enables numerous applications: a model-parallel variational auto-encoder (VAE) with 2nd-generation tensor processing units (TPUv2s); a data-parallel autoregressive model (Image Transformer) with TPUv2s; and multi-GPU No-U-Turn Sampler (NUTS). For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer on 256x256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3.", "full_text": "Simple, Distributed, and Accelerated\n\nProbabilistic Programming\n\nDustin Tran\u21e4 Matthew D. Hoffman\u2020\nSrinivas Vasudevan\u2020\n\nAlexey Radul\u2020 Matthew Johnson\u21e4\n\nDave Moore\u2020\n\nChristopher Suter\u2020\nRif A. Saurous\u2020\n\n\u21e4Google Brain, \u2020Google\n\nAbstract\n\nWe describe a simple, low-level approach for embedding probabilistic program-\nming in a deep learning ecosystem. In particular, we distill probabilistic program-\nming down to a single abstraction\u2014the random variable. Our lightweight imple-\nmentation in TensorFlow enables numerous applications: a model-parallel varia-\ntional auto-encoder (VAE) with 2nd-generation tensor processing units (TPUv2s); a\ndata-parallel autoregressive model (Image Transformer) with TPUv2s; and multi-\nGPU No-U-Turn Sampler (NUTS). For both a state-of-the-art VAE on 64x64 Im-\nageNet and Image Transformer on 256x256 CelebA-HQ, our approach achieves\nan optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x\nspeedup on GPUs over Stan and 37x over PyMC3.1\n\nIntroduction\n\n1\nMany developments in deep learning can be interpreted as blurring the line between model and\ncomputation. Some have even gone so far as to declare a new paradigm of \u201cdifferentiable program-\nming,\u201d in which the goal is not merely to train a model but to perform general program synthesis.2 In\nthis view, attention [3] and gating [18] describe boolean logic; skip connections [17] and conditional\ncomputation [6, 14] describe control \ufb02ow; and external memory [12, 15] accesses elements outside a\nfunction\u2019s internal scope. Learning algorithms are also increasingly dynamic: for example, learning\nto learn [19], neural architecture search [52], and optimization within a layer [1].\nThe differentiable programming paradigm encourages modelers to explicitly consider computational\nexpense: one must consider not only a model\u2019s statistical properties (\u201chow well does the model\ncapture the true data distribution?\u201d), but its computational, memory, and bandwidth costs (\u201chow\nef\ufb01ciently can it train and make predictions?\u201d). This philosophy allows researchers to engineer\ndeep-learning systems that run at the very edge of what modern hardware makes possible.\nBy contrast, the probabilistic programming community has tended to draw a hard line between\nmodel and computation: \ufb01rst, one speci\ufb01es a probabilistic model as a program; second, one per-\nforms an \u201cinference query\u201d to automatically train the model given data [44, 33, 8]. This design\nchoice makes it dif\ufb01cult to implement probabilistic models at truly large scales, where training multi-\nbillion parameter models requires splitting model computation across accelerators and scheduling\ncommunication [41]. Recent advances such as Edward [48] have enabled \ufb01ner control over infer-\nence procedures in deep learning (see also [28, 7]). However, they all treat inference as a closed\n\n1All code,\n\nincluding experiments and more details\n\navailable at http://bit.ly/2JpFipt.\ntfe=tf.contrib.eager. Code snippets assume tensorflow==1.12.0.\n\nNamespaces:\n\nfrom code snippets displayed here,\n\nis\nimport tensorflow as tf; ed=edward2;\n\n2Recent advocates of\n\nthis trend include Tom Dietterich (https://twitter.com/tdietterich/\nstatus/948811925038669824) and Yann LeCun (https://www.facebook.com/yann.lecun/posts/\n10155003011462143). It is a classic idea in the programming languages \ufb01eld [4].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdef model():\n\np = ed.Beta(1., 1., name=\"p\")\nx = ed.Bernoulli(probs=p,\n\nsample_shape=50,\nname=\"x\")\n\nelse:\n\nreturn x\n\nFigure 1: Beta-Bernoulli program. In eager\nmode, model() generates a binary vector of\n50 elements.\nIn graph mode, model() re-\nturns an op to be evaluated in a TensorFlow\nsession.\n\nimport neural_net_negative, neural_net_positive\n\ndef variational(x):\n\neps = ed.Normal(0., 1., sample_shape=2)\nif eps[0] > 0:\n\nreturn neural_net_positive(eps[1], x)\n\nreturn neural_net_negative(eps[1], x)\nFigure 2: Variational program [35], available in\neager mode. Python control \ufb02ow is applicable to\ngenerative processes: given a coin \ufb02ip, the pro-\ngram generates from one of two neural nets. Their\noutputs can have differing shape (and structure).\n\nthis makes them dif\ufb01cult to compose with arbitrary computation, and with the broader\n\nsystem:\nmachine learning ecosystem, such as production platforms [5].\nIn this paper, we describe a simple approach for embedding probabilistic programming in a deep\nlearning ecosystem; our implementation is in TensorFlow and Python, named Edward2. This\nlightweight approach offers a low-level modality for \ufb02exible modeling\u2014one which deep learners\nbene\ufb01t from \ufb02exible prototyping with probabilistic primitives, and one which probabilistic model-\ners bene\ufb01t from tighter integration with familiar numerical ecosystems.\nContributions. We distill the core of probabilistic programming down to a single abstraction\u2014the\nrandom variable. Unlike existing languages, there is no abstraction for learning: algorithms may for\nexample be functions taking a model as input (another function) and returning tensors.\nThis low-level design has two important implications. First, it enables research \ufb02exibility: a re-\nsearcher has freedom to manipulate model computation for training and testing. Second, it enables\nbigger models using accelerators such as tensor processing units (TPUs) [22]: TPUs require special-\nized ops in order to distribute computation and memory across a physical network topology.\nWe illustrate three applications: a model-parallel variational auto-encoder (VAE) [24] with TPUs; a\ndata-parallel autoregressive model (Image Transformer [31]) with TPUs; and multi-GPU No-U-Turn\nSampler (NUTS) [21]. For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer\non 256x256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2\nchips. With NUTS, we see a 100x speedup on GPUs over Stan [8] and 37x over PyMC3 [39].\n\n1.1 Related work\n\nTo the best of our knowledge, this work takes a unique design standpoint. Although its lightweight\ndesign adds research \ufb02exibility, it removes many high-level abstractions which are often desirable\nfor practitioners. In these cases, automated inference in alternative probabilistic programming lan-\nguages (PPLs) [25, 39] prove useful, so both styles are important for different audiences.\nCombining PPLs with deep learning poses many practical challenges; we outline three. First, with\nthe exception of recent works [49, 36, 39, 42, 7, 34], most languages lack support for minibatch\ntraining and variational inference, and most lack critical systems features such as numerical stabil-\nity, automatic differentiation, accelerator support, and vectorization. Second, existing PPLs restrict\nlearning algorithms to be \u201cinference queries\u201d, which return conditional or marginal distributions of\na program. By blurring the line between model and computation, a lighterweight approach allows\nany algorithm operating on probability distributions; this enables, e.g., risk minimization and the\ninformation bottleneck. Third, it has been an open challenge to scale PPLs to 50+ million parameter\nmodels, to multi-machine environments, and with data or model parallelism. To the best of our\nknowledge, this work is the \ufb01rst to do so.\n\n2 Random Variables Are All You Need\nWe outline probabilistic programs in Edward2. They require only one abstraction: a random vari-\nable. We then describe how to perform \ufb02exible, low-level manipulations using tracing.\n\n2.1 Probabilistic Programs, Variational Programs, and Many More\n\n2\n\n\fa\n\ntoy\n\np(x, p)\n\n1\n\nillustrates\n\nexample:\n\na Beta-Bernoulli model,\n\nEdward2 rei\ufb01es any computable probability distribution as a Python function (program). Typically,\nthe function executes the generative process and returns samples.3 Inputs to the program\u2014along\nwith any scoped Python variables\u2014represent values the distribution conditions on.\nTo specify random choices in the program, we use RandomVariables from Edward [49], which\nhas similarly been built on by Zhusuan [42] and Probtorch [34]. Random variables provide methods\nsuch as log_prob and sample, wrapping TensorFlow Distributions [10]. Further, Edward random\nvariables augment a computational graph of TensorFlow operations: each random variable x is\nassociated to a sampled tensor x\u21e4 \u21e0 p(x) in the graph.\nFigure\n=\nBeta(p| 1, 1)Q50\nn=1 Bernoulli(xn | p), where p is a latent probability shared across the 50\ndata points x 2{ 0, 1}50. The random variable x is 50-dimensional, parameterized by the tensor\np\u21e4 \u21e0 p(p). As part of TensorFlow, Edward2 supports two execution modes. Eager mode\nsimultaneously places operations onto the computational graph and executes them; here, model()\ncalls the generative process and returns a binary vector of 50 elements. Graph mode separately\nstages graph-building and execution; here, model() returns a deferred TensorFlow vector; one may\nrun a TensorFlow session to fetch the vector.\nImportantly, all distributions\u2014regardless of downstream use\u2014are written as probabilistic programs.\nFigure 2 illustrates an implicit variational program, i.e., a variational distribution which admits sam-\npling but may not have a tractable density. In general, variational programs [35], proposal programs\n[9], and discriminators in adversarial training [13] are computable probability distributions. If we\nhave a mechanism for manipulating these probabilistic programs, we do not need to introduce any\nadditional abstractions to support powerful inference paradigms. Below we demonstrate this \ufb02exi-\nbility using a model-parallel VAE.\n2.2 Example: Model-Parallel VAE with TPUs\nFigure 4 implements a model-parallel variational auto-encoder (VAE), which consists of a decoder,\nprior, and encoder. The decoder generates 16-bit audio (a sequence of T values in [0, 216  1]\nnormalized to [0, 1]); it employs an autoregressive \ufb02ow, which for training ef\ufb01ciently parallelizes\nover sequence length [30]. The prior posits latents representing a coarse 8-bit resolution over T /2\nsteps; it is learnable with a similar architecture. The encoder compresses each sample into the coarse\nresolution; it is parameterized by a compressing function.\nA TPU cluster arranges cores in a toroidal network, where for example, 512 cores may be arranged\nas a 16x16x2 torus interconnect. To utilize the cluster, the prior and decoder apply distributed au-\ntoregressive \ufb02ows (Figure 3). They split compute across a virtual 4x4 topology in two ways: \u201cacross\n\ufb02ows\u201d, where every 2 \ufb02ows belong on a different core; and \u201cwithin a \ufb02ow\u201d, where 4 independent\n\ufb02ows apply layers respecting autoregressive ordering (for space, we omit code for splitting within a\n\ufb02ow). The encoder splits computation via compressor; for space, we also omit it.\nThe probabilistic programs are concise. They capture recent advances such as autoregressive \ufb02ows\nand multi-scale latent variables, and they enable never-before-tried architectures where with 16x16\nTPUv2 chips (512 cores), the model can split across 4.1TB memory and utilize up to 1016 FLOPS.\nAll elements of the VAE\u2014distributions, architectures, and computation placement\u2014are extensible.\nFor training, we use typical TensorFlow ops; we describe how this works next.\n2.3 Tracing\nWe de\ufb01ned probabilistic programs as arbitrary Python functions. To enable \ufb02exible training, we\napply tracing, a classic technique used across probabilistic programming [e.g., 28, 45, 36, 11, 7]\nas well as automatic differentiation [e.g., 27]. A tracer wraps a subset of the language\u2019s primitive\noperations so that the tracer can intercept control just before those operations are executed.\nFigure 5 displays the core implementation: it is 10 lines of code.4 trace is a context manager\nwhich, upon entry, pushes a tracer callable to a stack, and upon exit, pops tracer from the stack.\ntraceable is a decorator: it registers functions so that they may be traced according to the stack.\n\n3Instead of sampling, one can also represent a distribution in terms of its density; see Section 3.1.\n4Rather than implement tracing, one can also reuse the pre-existing one in an autodiff system. However, our\npurposes require tracing with user control (tracer functions above) in order to manipulate computation. This is\nnot presently available in TensorFlow Eager or Autograd [27]\u2014which motivated our implementation.\n\n3\n\n\fimport SplitAutoregressiveFlow, masked_network\ntfb = tf.contrib.distributions.bijectors\n\nclass DistributedAutoregressiveFlow(tfb.Bijector):\n\ndef __init__(flow_size=[4]*8):\n\nself.flows = []\nfor num_splits in flow_size:\n\nflow = SplitAutoregressiveFlow(masked_network, num_splits)\nself.flows.append(flow)\n\nself.flows.append(SplitAutoregressiveFlow(masked_network, 1))\nsuper(DistributedAutoregressiveFlow, self).__init__()\n\ndef _forward(self, x):\n\nfor l, flow in enumerate(self.flows):\n\nwith tf.device(tf.contrib.tpu.core(l//2)):\n\nx = flow.forward(x)\n\nreturn x\n\ndef _inverse_and_log_det_jacobian(self, y):\n\nldj = 0.\nfor l, flow in enumerate(self.flows[::-1]):\nwith tf.device(tf.contrib.tpu.core(l//2)):\n\ny, new_ldj = flow.inverse_and_log_det_jacobian(y)\nldj += new_ldj\n\nreturn y, ldj\n\nFigure 3: Distributed autoregressive \ufb02ows. (right) The default length is 8, each with 4 independent\n\ufb02ows. Each \ufb02ow transforms inputs via layers respecting autoregressive ordering. (left) Flows are\npartitioned across a virtual topology of 4x4 cores (rectangles); each core computes 2 \ufb02ows and\nis locally connected; a \ufb01nal core aggregates. The virtual topology aligns with the physical TPU\ntopology: for 4x4 TPUs, it is exact; for 16x16 TPUs, it is duplicated for data parallelism.\n\nimport upsample, compressor\n\ndef prior():\n\n\"\"\"Uniform noise to 8-bit latent, [u1,...,u(T/2)] -> [z1,...,z(T/2)]\"\"\"\ndist = ed.Independent(ed.Uniform(low=tf.zeros([batch_size, T/2])))\nreturn ed.TransformedDistribution(dist, DistributedAutoregressiveFlow(flow_size))\n\ndef decoder(z):\n\n\"\"\"Uniform noise + latent to 16-bit audio, [u1,...,uT], [z1,...,z(T/2)] -> [x1,...,xT]\"\"\"\ndist = ed.Independent(ed.Uniform(low=tf.zeros([batch_size, T])))\ndist = ed.TransformedDistribution(dist, tfb.Affine(shift=upsample(z)))\nreturn ed.TransformedDistribution(dist, DistributedAutoregressiveFlow(flow_size))\n\ndef encoder(x):\n\n\"\"\"16-bit audio to 8-bit latent, [x1,...,xT] -> [z1,...,z(T/2)]\"\"\"\nloc, log_scale = tf.split(compressor(x), 2, axis=-1)\nreturn ed.Normal(loc=loc, scale=tf.exp(log_scale))\n\nFigure 4: Model-parallel VAE with TPUs, generating 16-bit audio from 8-bit latents. The prior\nand decoder split computation according to distributed autoregressive \ufb02ows. The encoder may split\ncomputation according to compressor; we omit it for space.\n\n4\n\n\fSTACK = [lambda f, *a, **k: f(*a, **k)]\n\n@contextmanager\ndef trace(tracer):\n\nSTACK.append(tracer)\nyield\nSTACK.pop()\n\ndef traceable(f):\n\ndef f_wrapped(*a, **k):\nSTACK[-1](f, *a, **k)\n\nreturn f_wrapped\n\nFigure 5: Minimal implementation of trac-\ning. trace de\ufb01nes a context; any traceable\nops executed during it are replaced by calls\nto tracer. traceable registers these ops;\nwe register Edward random variables.\n\nFigure 6: A program execution. It is a directed\nacyclic graph and is traced for various operations\nsuch as accumulating log-probabilities or \ufb01nding\nconditional independence.\n\ndef make_log_joint_fn(model):\n\ndef log_joint_fn(**model_kwargs):\n\ndef tracer(rv_call, *args, **kwargs):\n\nname = kwargs.get(\"name\")\nkwargs[\"value\"] = model_kwargs.get(name)\nrv = rv_call(*args, **kwargs)\nlog_probs.append(tf.sum(rv.log_prob(rv)))\nreturn rv\n\nlog_probs = []\nwith trace(tracer):\n\nmodel(**model_kwargs)\n\nreturn sum(log_probs)\n\nreturn log_joint_fn\n\nFigure 7: A higher-order function which takes a\nmodel program as input and returns its log-joint\ndensity function.\n\ndef mutilate(model, **do_kwargs):\n\ndef mutilated_model(*args, **kwargs):\n\ndef tracer(rv_call, *args, **kwargs):\n\nname = kwargs.get(\"name\")\nif name in do_kwargs:\n\nreturn do_kwargs[name]\n\nreturn rv_call(*args, **kwargs)\n\nwith trace(tracer):\n\nreturn model(*args, **kwargs)\n\nreturn mutilated_model\n\nFigure 8: A higher-order function which\ntakes a model program as input and returns\nits causally intervened program. Intervention\ndiffers from conditioning: it does not change\nthe sampled value but the distribution.\n\nEdward2 registers random variables:\nfor example, Normal = traceable(edward1.Normal).\nThe tracing implementation is also agnostic to the numerical backend. Appendix A applies Fig-\nure 5 to implement Edward2 on top of SciPy.\n2.4 Tracing Applications\nTracing is a common tool for probabilistic programming. However, in other languages, tracing\nprimarily serves as an implementation detail to enable inference \u201cmeta-programming\u201d procedures.\nIn our approach, we promote it to be a user-level technique for \ufb02exible computation. We outline two\nexamples; both are dif\ufb01cult to implement without user access to tracing.\nFigure 7 illustrates a make_log_joint factory function. It takes a model program as input and\nreturns its joint density function across a trace. We implement it using a tracer which sets random\nvariable values to the input and accumulates its log-probability as a side-effect. Section 3.3 applies\nmake_log_joint in a variational inference algorithm.\nFigure 8 illustrates causal intervention [32]: it \u201cmutilates\u201d a program by setting random variables\nindexed by their name to another random variable. Note this effect is propagated to any descen-\ndants while leaving non-descendants unaltered: this is possible because Edward2 implicitly traces a\ndata\ufb02ow graph over random variables, following a \u201cpush\u201d model of evaluation. Other probabilistic\noperations more naturally follow a \u201cpull\u201d model of evaluation: mean-\ufb01eld variational inference re-\nquires evaluating energy terms corresponding to a single factor; we do so by reifying a variational\nprogram\u2019s trace (e.g., Figure 6) and walking backwards from that factor\u2019s node in the trace.\n3 Examples: Learning with Low-Level Functions\nWe described probabilistic programs and how to manipulate their computation with low-level tracing\nfunctions. Unlike existing PPLs, there is no abstraction for learning. Below we provide examples of\nhow this works and its implications.\n\n5\n\npilogeinsumeinsumsubtract[ 2.3 2.3 2.3 2.3 2.3]add4.94add29.3zone_hoteinsumeinsumeinsum5einsumaddtaulogeinsumeinsumeinsumaddadd-11.5add-9.19mu-0.05addadd200add-368xsubtract-0.5subtractsubtractoutput\fimport get_channel_embeddings, add_positional_embedding_nd, local_attention_1d\n\ndef image_transformer(inputs, hparams):\n\nx = get_channel_embeddings(3, inputs, hparams.hidden_size)\nx = tf.reshape(x, [-1, 32*32*3, hparams.hidden_size])\nx = tf.pad(x, [[0, 0], [1, 0], [0, 0]])[:, :-1, :] # shift pixels right\nx = add_positional_embedding_nd(x, max_length=32*32*3+3)\nx = tf.nn.dropout(x, keep_prob=0.7)\nfor _ in range(hparams.num_layers):\n\ny = local_attention_1d(x, hparams, attention_type=\"local_mask_right\",\n\nq_padding=\"LEFT\", kv_padding=\"LEFT\")\n\nx = tf.contrib.layers.layer_norm(tf.nn.dropout(y, keep_prob=0.7) + x, begin_norm_axis=-1)\ny = tf.layers.dense(x, hparams.filter_size, activation=tf.nn.relu)\ny = tf.layers.dense(y, hparams.hidden_size, activation=None)\nx = tf.contrib.layers.layer_norm(tf.nn.dropout(y, keep_prob=0.7) + x, begin_norm_axis=-1)\n\nlogits = tf.layers.dense(x, 256, activation=None)\nreturn ed.Categorical(logits=logits).log_prob(inputs)\n\nloss = -tf.reduce_sum(image_transformer(inputs, hparams)) # inputs has shape [batch,32,32,3]\ntrain_op = tf.contrib.tpu.CrossShardOptimizer(tf.train.AdamOptimizer()).minimize(loss)\nFigure 9: Data-parallel Image Transformer with TPUs [31]. It is a neural autoregressive model\nwhich computes the log-probability of a batch of images with self-attention. Our lightweight design\nenables representing and training the model as a log-probability function; this is more ef\ufb01cient\nthan the typical representation of programs as a generative process. Embedding and self-attention\nfunctions are assumed in the environment; they are available in Tensor2Tensor [50].\n\n3.1 Example: Data-Parallel Image Transformer with TPUs\nAll PPLs have so far focused on a unifying representation of models, typically as a generative pro-\ncess. However, this can be inef\ufb01cient in practice for certain models. Because our lightweight ap-\nproach has no required signature for training, it permits alternative model representations.5\nFor example, Figure 9 represents the Image Transformer [31] as a log-probability function. The\nImage Transformer is a state-of-the-art autoregressive model for image generation, consisting of a\nCategorical distribution parameterized by a batch of right-shifted images, embeddings, a sequence\nof alternating self-attention and feedforward layers, and an output layer. The function computes\nlog_prob with respect to images and parallelizes over pixel dimensions. Unlike the log-probability,\nsampling requires programming the autoregressivity in serial, which is inef\ufb01cient and harder to\nimplement.6 With the log-probability representation, data parallelism with TPUs is also immediate\nby cross-sharding the optimizer. The train op can be wrapped in a TF Estimator, or applied with\nmanual TPU ops in order to aggregate training across cores.\n\n3.2 Example: No-U-Turn Sampler\nFigure 10 demonstrates the core logic behind the No-U-Turn Sampler (NUTS), a Hamiltonian Monte\nCarlo algorithm which adaptively selects the path length hyperparameter during leapfrog integra-\ntion. Its implementation uses non-tail recursion, following the pseudo-code in Hoffman and Gelman\n[21, Alg 6]; both CPUs and GPUs are compatible. See source code for the full implementation;\nAppendix B also implements a grammar VAE [26] using a data-dependent while loop.\nThe ability to integrate NUTS requires interoperability with eager mode: NUTS requires Python\ncontrol \ufb02ow, as it is dif\ufb01cult to implement recursion natively with TensorFlow ops. (NUTS is not\navailable, e.g., in Edward 1.) However, eager execution has tradeoffs (not unique to our approach).\nFor example, it incurs a non-negligible overhead over graph mode, and it has preliminary support\nfor TPUs. Our lightweight design supports both modes so the user can select either.\n\n5The Image Transformer provides a performance reason for when density representations may be preferred.\nAnother compelling example are energy-based models p(x) / exp{f (x)}, where sampling is not even avail-\nable in closed-form; in contrast, the unnormalized density is.\n6In principle, one can reify any model in terms of sampling and apply make_log_joint to obtain its\ndensity. However, make_log_joint cannot always be done ef\ufb01ciently in practice, such as in this example. In\ncontrast, the reverse program transformation from density to sampling can be done ef\ufb01ciently: in this example,\nsampling can at best compute in serial order; therefore it requires no performance optimization.\n\n6\n\n\fdef nuts(...):\nsamples = []\nfor _ in range(num_samples):\n\nstate = set_up_trajectory(...)\ndepth = 0\nwhile no_u_turn(state):\n\nstate = extend_trajectory(depth, state)\ndepth += 1\n\nsamples.append(state)\n\nreturn samples\n\ndef extend_trajectory(depth, state):\n\nif depth == 0:\n\nstate = one_leapfrog_step(state)\n\nelse:\n\nstate = extend_trajectory(depth-1, state)\nif no_u_turn(state):\nstate = extend_trajectory(depth-1, state)\n\nreturn state\n\nFigure 10: Core logic in No-U-Turn Sampler [21].\nThis algorithm has data-dependent non-tail recur-\nsion.\n\n3.3 Example: Alignment of Probabilistic Programs\n\nFigure 11: Learning often involves\nmatching two execution traces such as\na model program\u2019s (left) and a varia-\ntional program\u2019s (right), or a model pro-\ngram\u2019s with data tensors (bottom). Red\narrows align prior and variational vari-\nables. Blue arrows align observed vari-\nables and data; edges from data to varia-\ntional variables represent amortization.\n\nLearning algorithms often involve manipulating multiple probabilistic programs. For example, a\nvariational inference algorithm takes two programs as input\u2014the model program and variational\nprogram\u2014and computes a loss function for optimization. This requires specifying which variables\nrefer to each other in the two programs.\nWe apply alignment (Figure 11), which is a dictionary of key-value pairs, each from one string (a\nrandom variable\u2019s name) to another (a random variable in the other program). This dictionary pro-\nvides \ufb02exibility over how random variables are aligned, independent of their speci\ufb01cations in each\nprogram. For example, this enables ladder VAEs [43] where prior and variational topological order-\nings are reversed; and VampPriors [46] where prior and variational parameters are shared.\nFigure 12 shows variational inference with gradient descent using a \ufb01xed preconditioner. It applies\nmake_log_joint_fn (Figure 7) and assumes model applies a random variable with name \u2019x\u2019\n(such as the VAE in Section 2.2). Note this extends alignment from Edward 1 to dynamic programs\n[48]: instead of aligning nodes in static graphs at construction-time, it aligns nodes in execution\ntraces at runtime. It also has applications for aligning model and proposal programs in Metropolis-\nHastings; model and discriminator programs in adversarial training; and even model programs and\ndata infeeding functions (\u201cprograms\u201d) in input-output pipelines.\n\n3.4 Example: Learning to Learn by Variational Inference by Gradient Descent\n\nA lightweight design is not only advantageous for \ufb02exible speci\ufb01cation of learning algorithms but\n\ufb02exible composability: here, we demonstrate nested inference via learning to learn. Recall Figure 12\nperforms variational inference with gradient descent. Figure 13 applies gradient descent on the out-\nput of that gradient descent algorithm. It \ufb01nds the optimal preconditioner [2]. This is possible be-\ncause learning algorithms are simply compositions of numerical operations; the composition is fully\ndifferentiable. This differentiability is not possible with Edward, which manipulates inference\nobjects: taking gradients of one is not well-de\ufb01ned.7 See also Appendix C which illustrates Markov\nchain Monte Carlo within variational inference.\n\n4 Experiments\nWe introduced a lightweight approach for embedding probabilistic programming in a deep learning\necosystem. Here, we show that such an approach is particularly advantageous for exploiting modern\n\n7Unlike Edward, Edward2 can also specify distributions over the learning algorithm.\n\n7\n\n\fimport model, variational, align, x\n\ndef train(precond):\n\ndef loss_fn(x):\n\ngrad_fn = tfe.gradients_function(train)\noptimizer = tf.train.AdamOptimizer(0.1)\nfor _ in range(100):\n\noptimizer.apply_gradients(grad_fn())\nFigure 13: Learning-to-learn. It \ufb01nds the op-\ntimal preconditioner for train (Figure 12) by\ndifferentiating the entire learning algorithm\nwith respect to the preconditioner.\n\nqz = variational(x)\nlog_joint_fn = make_log_joint_fn(model)\nkwargs = {align[rv.name]: rv\n\nfor rv in toposort(qz)}\n\nenergy = log_joint_fn(x=x, **kwargs)\nentropy = sum([tf.reduce_sum(rv.entropy())\n\nfor rv in toposort(qz)])\n\nreturn -energy - entropy\n\ngrad_fn = tfe.implicit_gradients(loss_fn)\noptimizer = tf.train.AdamOptimizer(0.1)\nfor _ in range(500):\n\ngrads = tf.tensordot(precond, grad_fn(x), [[1], [0]])\noptimizer.apply_gradients(grads)\n\nreturn loss_fn(x)\n\nFigure 12: Variational inference with precondi-\ntioned gradient descent. Edward2 offers writing the\nprobabilistic program and performing arbitrary Ten-\nsorFlow computation for learning.\n\nFigure 14: Vector-Quantized VAE on 64x64\nImageNet.\n\nFigure 15:\nCelebA-HQ.\n\nImage Transformer on 256x256\n\nhardware for multi-TPU VAEs and autoregressive models, and multi-GPU NUTS. CPU experiments\nuse a six-core Intel E5-1650 v4, GPU experiments use 1-8 NVIDIA Tesla V100 GPUs, and TPU\nexperiments use 2nd generation chips under a variety of topology arrangements. The TPUv2 chip\ncomprises two cores: each features roughly 22 tera\ufb02ops on mixed 16/32-bit precision (it is roughly\ntwice the \ufb02ops of a NVIDIA Tesla P100 GPU on 32-bit precision). In all distributed experiments,\nwe cross-shard the optimizer for data-parallelism: each shard (core) takes a batch size of 1. All\nnumbers are averaged over 5 runs.\n\n4.1 High-Quality Image Generation\n\nWe evaluate models with near state-of-the-art results (\u201cbits/dim\u201d) for non-autoregressive generation\non 64x64 ImageNet [29] and autoregressive generation on 256x256 CelebA-HQ [23]. We evaluate\nwall clock time of the number of examples (data points) processed per second.\nFor 64x64 ImageNet, we use a vector-quantized variational auto-encoder trained with soft EM [37].\nIt encodes a 64x64x3 pixel image into a 8x8x10 tensor of latents, with a codebook size of 256 and\nwhere each code vector has 512 dimensions. The prior is an Image Transformer [31] with 6 layers\nof local 1D self-attention. The encoder applies 4 convolutional layers with kernel size 5 and stride\n2, 2 residual layers, and a dense layer. The decoder applies the reverse of a dense layer, 2 residual\nlayers, and 4 transposed convolutional layers.\n\n8\n\n11664128256#TPUv2chips050100150200250300350400Examples/SecSpeedupoverTPUs,slope=1.4011664128256#TPUv2chips0500100015002000Examples/SecSpeedupoverTPUs,slope=7.49\fSystem\nStan (CPU)\nPyMC3 (CPU)\nHandwritten TF (CPU)\nEdward2 (CPU)\nHandwritten TF (1 GPU)\nEdward2 (1 GPU)\nEdward2 (8 GPU)\n\nRuntime (ms)\n201.0\n74.8\n66.2\n68.4\n9.5\n9.7\n2.3\n\nTable 1: Time per leapfrog step for No-U-Turn Sampler in Bayesian logistic regression. Edward2\n(GPU) achieves a 100x speedup over Stan (CPU) and 37x over PyMC3 (CPU); dynamism is not\navailable in Edward. Edward2 also incurs negligible overhead over handwritten TensorFlow code.\n\nFor 256x256 CelebA-HQ, we use a relatively small Image Transformer [31] in order to \ufb01t the model\nin memory. It applies 5 layers of local 1D self-attention with block length of 256, hidden sizes of\n128, attention key/value channels of 64, and feedforward layers with a hidden size of 256.\nFigure 14 and Figure 15 show that for both models, Edward2 achieves an optimal linear scaling\nover the number of TPUv2 chips from 1 to 256. In experiments, we also found the larger batch sizes\ndrastically sped up training.\n\n4.2 No-U-Turn Sampler\n\nWe use the No-U-Turn Sampler (NUTS, [21]) to illustrate the power of dynamic algorithms on ac-\ncelerators. NUTS implements a variant of Hamiltonian Monte Carlo in which the \ufb01xed trajectory\nlength is replaced by a recursive doubling procedure that adapts the length per iteration.\nWe compare Bayesian logistic regression using NUTS implemented in Stan [8] and in PyMC3 [39]\nto our eager-mode TensorFlow implementation. The model\u2019s log joint density is implemented\nas \u201chandwritten\u201d TensorFlow code and by a probabilistic program in Edward2; see code in Ap-\npendix D. We use the Covertype dataset (581,012 data points, 54 features, outcomes are binarized).\nSince adaptive sampling may lead NUTS iterations to take wildly different numbers of leapfrog\nsteps, we report the average time per leapfrog step, averaged over 5 full NUTS trajectories (in these\nexperiments, that typically amounted to about a thousand leapfrog steps total).\nTable 1 shows that Edward2 (GPU) has up to a 37x speedup over PyMC3 with multi-threaded CPU;\nit has up to a 100x speedup over Stan, which is single-threaded.8 In addition, while Edward2 in\nprinciple introduces overhead in eager mode due to its tracing mechanism, the speed differential be-\ntween Edward2 and handwritten TensorFlow code is neligible (smaller than between-run variation).\nThis demonstrates that the power of the PPL formalism comes with negligible overhead.\n\n5 Discussion\nWe described a simple, low-level approach for embedding probabilistic programming in a deep\nlearning ecosystem. For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer on\n256x256 CelebA-HQ, we achieve an optimal linear speedup from 1 to 256 TPUv2 chips. For NUTS,\nwe see up to 100x speedups over other systems.\nAs current work, we are pushing on this design as a stage for fundamental research in generative\nmodels and Bayesian neural networks (e.g., [47, 51, 16]). In addition, our experiments relied on\ndata parallelism to show massive speedups. Recent work has improved distributed programming of\nneural networks for both model parallelism and parallelism over large inputs such as super-high-\nresolution images [40]. Combined with this work, we hope to push the limits of giant probabilistic\nmodels with over 1 trillion parameters and over 4K resolutions (50 million dimensions).\n\nAcknowledgements. We thank the anonymous NIPS reviewers, TensorFlow Eager team, PyMC\nteam, Alex Alemi, Samy Bengio, Josh Dillon, Delesley Hutchins, Dick Lyon, Dougal Maclaurin,\n\n8PyMC3 is actually slower with GPU than CPU; its code frequently communicates between Theano on the\nGPU and NumPy on the CPU. Stan only used one thread as it leverages multiple threads by running HMC\nchains in parallel, and it requires double precision.\n\n9\n\n\fKevin Murphy, Niki Parmar, Zak Stone, and Ashish Vaswani for their assistance in improving the\nimplementation, the benchmarks, and/or the paper.\n\nReferences\n[1] Amos, B. and Kolter, J. Z. (2017). OptNet: Differentiable optimization as a layer in neural\n\nnetworks. In International Conference on Machine Learning.\n\n[2] Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas,\nN. (2016). Learning to learn by gradient descent by gradient descent. In Neural Information\nProcessing Systems.\n\n[3] Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning\n\nto align and translate. In International Conference on Learning Representations.\n\n[4] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2015). Automatic differen-\n\ntiation in machine learning: a survey. arXiv preprint arXiv:1502.05767.\n\n[5] Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M.,\nJain, V., Koc, L., et al. (2017). TFX: A TensorFlow-based production-scale machine learning\nplatform. In Knowledge Discovery and Data Mining.\n\n[6] Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D. (2015). Conditional computation in neural\n\nnetworks for faster models. arXiv preprint arXiv:1511.06297.\n\n[7] Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh,\nR., Szerlip, P., Horsfall, P., and Goodman, N. D. (2018). Pyro: Deep Universal Probabilistic\nProgramming. arXiv preprint arXiv:1810.09538.\n\n[8] Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker,\nM., Guo, J., Li, P., and Riddell, A. (2016). Stan: A probabilistic programming language. Journal\nof Statistical Software.\n\n[9] Cusumano-Towner, M. F. and Mansinghka, V. K. (2018). Using probabilistic programs as pro-\n\nposals. In POPL Workshop.\n\n[10] Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B.,\nAlemi, A., Hoffman, M., and Saurous, R. A. (2017). TensorFlow Distributions. arXiv preprint\narXiv:1711.10604.\n\n[11] Ge, H., Xu, K., Scibior, A., Ghahramani, Z., et al. (2018). The Turing language for probabilis-\n\ntic programming. In Arti\ufb01cial Intelligence and Statistics.\n\n[12] Giles, C. L., Sun, G.-Z., Chen, H.-H., Lee, Y.-C., and Chen, D. (1990). Higher order recurrent\n\nnetworks and grammatical inference. In Neural Information Processing Systems.\n\n[13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,\nIn Neural Information Processing\n\nA., and Bengio, Y. (2014). Generative Adversarial Nets.\nSystems.\n\n[14] Graves, A. (2016). Adaptive computation time for recurrent neural networks. arXiv preprint\n\narXiv:1603.08983.\n\n[15] Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines. arXiv preprint\n\narxiv:1410.5401.\n\n[16] Hafner, D., Tran, D., Irpan, A., Lillicrap, T., and Davidson, J. (2018). Reliable uncertainty\n\nestimates in deep neural networks using noise contrastive priors. arXiv preprint.\n\n[17] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.\n\nIn Computer Vision and Pattern Recognition.\n\n[18] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation,\n\n9(8):1735\u20131780.\n\n[19] Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001). Learning to learn using gradient\n\ndescent. In International Conference on Arti\ufb01cial Neural Networks, pages 87\u201394.\n\n10\n\n\f[20] Hoffman, M. D. (2017). Learning deep latent Gaussian models with Markov chain Monte\n\nCarlo. In International Conference on Machine Learning.\n\n[21] Hoffman, M. D. and Gelman, A. (2014). The No-U-turn sampler: Adaptively setting path\nlengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593\u20131623.\n[22] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S.,\nBoden, N., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing\nunit. In Proceedings of the 44th Annual International Symposium on Computer Architecture.\n\n[23] Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive growing of gans for im-\nproved quality, stability, and variation. In International Conference on Learning Representations.\nIn International\n\n[24] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes.\n\nConference on Learning Representations.\n\n[25] Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). Automatic\ndifferentiation variational inference. The Journal of Machine Learning Research, 18(1):430\u2013474.\n[26] Kusner, M. J., Paige, B., and Hern\u00e1ndez-Lobato, J. M. (2017). Grammar variational autoen-\n\ncoder. In International Conference on Machine Learning.\n\n[27] Maclaurin, D., Duvenaud, D., Johnson, M., and Adams, R. P. (2015). Autograd: Reverse-mode\n\ndifferentiation of native Python.\n\n[28] Mansinghka, V., Selsam, D., and Perov, Y. (2014). Venture: A higher-order probabilistic\n\nprogramming platform with programmable inference. arXiv preprint arXiv:1404.0099.\n\n[29] Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks.\n\narXiv preprint arXiv:1601.06759.\n\n[30] Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive \ufb02ow for density\n\nestimation. In Advances in Neural Information Processing Systems, pages 2335\u20132344.\n\n[31] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, \u0141., Shazeer, N., Ku, A., and Tran, D. (2018).\n\nImage transformer. In International Conference on Machine Learning.\n\n[32] Pearl, J. (2003). Causality: models, reasoning, and inference. Econometric Theory, 19(675-\n\n685):46.\n\n[33] Pfeffer, A. (2007). The design and implementation of IBAL: A general-purpose probabilistic\n\nlanguage. Introduction to Statistical Relational Learning, page 399.\n\n[34] Probtorch Developers (2017). Probtorch. https://github.com/probtorch/probtorch.\n[35] Ranganath, R., Altosaar, J., Tran, D., and Blei, D. M. (2016). Operator variational inference.\n\nIn Neural Information Processing Systems.\n\n[36] Ritchie, D., Horsfall, P., and Goodman, N. D. (2016). Deep Amortized Inference for Proba-\n\nbilistic Programs. arXiv preprint arXiv:1610.05735.\n\n[37] Roy, A., Vaswani, A., Neelakantan, A., and Parmar, N. (2018). Theory and experiments on\n\nvector quantized autoencoders. arXiv preprint arXiv:1805.11063.\n\n[38] Salimans, T., Kingma, D., and Welling, M. (2015). Markov chain Monte Carlo and variational\n\ninference: Bridging the gap. In International Conference on Machine Learning.\n\n[39] Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. (2016). Probabilistic programming in Python\n\nusing PyMC3. PeerJ Computer Science, 2:e55.\n\n[40] Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee,\nH., Hong, M., Young, C., Sepassi, R., and Hechtman, B. (2018). Mesh-tensor\ufb02ow: Deep learning\nfor supercomputers. In Neural Information Processing Systems.\n\n[41] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017).\nOutrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint\narXiv:1701.06538.\n\n11\n\n\f[42] Shi, J., Chen, J., Zhu, J., Sun, S., Luo, Y., Gu, Y., and Zhou, Y. (2017). Zhusuan: A library for\n\nbayesian deep learning. arXiv preprint arXiv:1709.05870.\n\n[43] S\u00f8nderby, C. K., Raiko, T., Maal\u00f8e, L., S\u00f8nderby, S. K., and Winther, O. (2016). Ladder\n\nvariational autoencoders. In Neural Information Processing Systems.\n\n[44] Spiegelhalter, D. J., Thomas, A., Best, N. G., and Gilks, W. R. (1995). BUGS: Bayesian\n\ninference using Gibbs sampling, version 0.50. MRC Biostatistics Unit, Cambridge.\n\n[45] Tolpin, D., van de Meent, J.-W., Yang, H., and Wood, F. (2016). Design and implementation\nof probabilistic programming language Anglican. In Proceedings of the 28th Symposium on the\nImplementation and Application of Functional Programming Languages, page 6.\n\n[46] Tomczak, J. M. and Welling, M. (2018). Vae with a vampprior. In Arti\ufb01cial Intelligence and\n\nStatistics.\n\n[47] Tran, D. and Blei, D. (2018). Implicit causal models for genome-wide association studies. In\n\nInternational Conference on Learning Representations.\n\n[48] Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., and Blei, D. M. (2017).\n\nDeep probabilistic programming. In International Conference on Learning Representations.\n\n[49] Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016).\narXiv preprint\n\ninference, and criticism.\n\nEdward: A library for probabilistic modeling,\narXiv:1610.09787.\n\n[50] Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., Jones, L., Kaiser,\nL., Kalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkoreit, J. (2018). Ten-\nsor2tensor for neural machine translation. CoRR, abs/1803.07416.\n\n[51] Wen, Y., Vicol, P., Ba, J., Tran, D., and Grosse, R. (2018). Flipout: Ef\ufb01cient pseudo-\nIn International Conference on Learning\n\nindependent weight perturbations on mini-batches.\nRepresentations.\n\n[52] Zoph, B. and Le, Q. V. (2017). Neural architecture search with reinforcement learning. In\n\nInternational Conference on Learning Representations.\n\n12\n\n\f", "award": [], "sourceid": 3771, "authors": [{"given_name": "Dustin", "family_name": "Tran", "institution": "Google Brain"}, {"given_name": "Matthew", "family_name": "Hoffman", "institution": "Google"}, {"given_name": "Dave", "family_name": "Moore", "institution": "Google"}, {"given_name": "Christopher", "family_name": "Suter", "institution": "Google, Inc"}, {"given_name": "Srinivas", "family_name": "Vasudevan", "institution": "Google"}, {"given_name": "Alexey", "family_name": "Radul", "institution": "Google"}]}