{"title": "Generative Models for Graph-Based Protein Design", "book": "Advances in Neural Information Processing Systems", "page_first": 15820, "page_last": 15831, "abstract": "Engineered proteins offer the potential to solve many problems in biomedicine, energy, and materials science, but creating designs that succeed is difficult in practice. A significant aspect of this challenge is the complex coupling between protein sequence and 3D structure, with the task of finding a viable design often referred to as the inverse protein folding problem. We develop relational language models for protein sequences that directly condition on a graph specification of the target structure. Our approach efficiently captures the complex dependencies in proteins by focusing on those that are long-range in sequence but local in 3D space. Our framework significantly improves in both speed and robustness over conventional and deep-learning-based methods for structure-based protein sequence design, and takes a step toward rapid and targeted biomolecular design with the aid of deep generative models.", "full_text": "Generative models for graph-based protein design\n\nJohn Ingraham, Vikas K. Garg, Regina Barzilay, Tommi Jaakkola\n\nComputer Science and Arti\ufb01cial Intelligence Lab, MIT\n\n{ingraham, vgarg, regina, tommi}@csail.mit.edu\n\nAbstract\n\nEngineered proteins offer the potential to solve many problems in biomedicine,\nenergy, and materials science, but creating designs that succeed is dif\ufb01cult in\npractice. A signi\ufb01cant aspect of this challenge is the complex coupling between\nprotein sequence and 3D structure, with the task of \ufb01nding a viable design often\nreferred to as the inverse protein folding problem. In this work, we introduce a\nconditional generative model for protein sequences given 3D structures based on\ngraph representations. Our approach ef\ufb01ciently captures the complex dependencies\nin proteins by focusing on those that are long-range in sequence but local in 3D\nspace. This graph-based approach improves in both speed and reliability over\nconventional and other neural network-based approaches, and takes a step toward\nrapid and targeted biomolecular design with the aid of deep generative models.\n\n1\n\nIntroduction\n\nA central goal for computational protein design is to automate the invention of protein molecules\nwith de\ufb01ned structural and functional properties. This \ufb01eld has seen tremendous progess in the past\ntwo decades [1], including the design of novel 3D folds [2], enzymes [3], and complexes [4]. Despite\nthese successes, current approaches are often unreliable, requiring multiple rounds of trial-and-error\nin which initial designs often fail [5, 6]. Moreover, diagnosing the origin of this unreliability is\ndif\ufb01cult, as contemporary bottom-up approaches depend both on the accuracy of complex, composite\nenergy functions for protein physics and also on the ef\ufb01ciency of sampling algorithms for jointly\nexploring the protein sequence and structure space.\nHere, we explore an alternative, top-down framework for protein design that directly learns a\nconditional generative model for protein sequences given a speci\ufb01cation of the target structure, which\nis represented as a graph over the residues (amino acids). Speci\ufb01cally, we augment the autoregressive\nself-attention of recent sequence models [7] with graph-based representations of 3D molecular\nstructure. By composing multiple layers of this structured self-attention, our model can effectively\ncapture higher-order, interaction-based dependencies between sequence and structure, in contrast to\nprevious parameteric approaches [8, 9] that are limited to only the \ufb01rst-order effects.\nA graph-structured sequence model offers several bene\ufb01ts, including favorable computational ef\ufb01-\nciency, inductive bias, and representational \ufb02exibility. We accomplish the \ufb01rst two by leveraging\na well-evidenced \ufb01nding in protein science, namely that long-range dependencies in sequence are\ngenerally short-range in 3D space [10\u201312]. By making the graph and self-attention similarly sparse\nand localized in 3D space, we achieve computational scaling that is linear in sequence length. Addi-\ntionally, graph structured inputs offer representational \ufb02exibility, as they accomodate both coarse,\n\u2018\ufb02exible backbone\u2019 (connectivity and topology) as well as \ufb01ne-grained (precise atom locations)\ndescriptions of structure.\nWe demonstrate the merits of our approach via a detailed empirical study. Speci\ufb01cally, we evaluate\nour model\u2019s performance for structural generalization to sequences of protein 3D folds that are\ntopologically distinct from those in the training set. For \ufb01xed-backbone sequence design, we \ufb01nd that\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\four model achieves considerably improved statistical performance over a prior neural-network based\nmodel and also that it achieves higher accuracy and ef\ufb01ciency than Rosetta fixbb, a state-the-art\nprogram for protein design.\nThe rest of the paper is organized as follows. We \ufb01rst position our contributions with respect to the\nprior work in Section 1.1. We provide details on our methods, including structure representation,\nin Section 2. We introduce our Structured Transformer model in Section 2.2. The details of our\nexperiments are laid out in Section 3, and the corresponding results that elucidate the merits of our\napproach are presented in Section 4.\n\n1.1 Related Work\n\nGenerative models for protein sequence and structure A number of works have explored the\nuse of generative models for protein engineering and design [13]. [8, 9, 14] have used neural network-\nbased models for sequences given 3D structure, where the amino acids are modeled independently of\none another. [15] introduced a generative model for protein sequences conditioned on a 1D, context-\nfree grammar based speci\ufb01cation of the fold topology. Multiple works [16, 17] have modeled the\nconditional distribution of single amino acids given surrounding structure and sequence context with\nconvolutional neural networks. In contrast to these works, our model captures the joint distribution\nof the full protein sequence while grounding these dependencies in terms of long-range interactions\narising from structure.\nIn parallel to the development of structure-based models, there has been considerable work on deep\ngenerative models for protein sequences in individual protein families [18\u201321]. While useful, these\nmethods presume the availability of a large number of sequences from a particular family, which are\nunavailable in the case of designing novel proteins that diverge signi\ufb01cantly from natural sequences.\nSeveral groups have obtained promising results using unconditional protein language models [22\u201325]\nto learn sequence representations that transfer well to supervised tasks. While serving different\npurposes, we emphasize that one advantage of conditional generative modeling is to facilitate\nadaptation to speci\ufb01c (and potentially novel) parts of structure space. Language models trained\non hundreds of millions of evolutionary sequences will still be \u2018semantically\u2019 bottlenecked by the\nthousands of 3D evolutionary folds that these sequences represent. We propose evaluating protein\nlanguage models with structure-based splitting of sequence data, and begin to see how unconditional\nlanguage models may struggle to assign high likelihoods to sequences from out-of-training folds.\nIn a complementary line of research, several deep and differentiable parameterizations of protein\nstructure [26\u201329] have been recently proposed that could be used to generate, optimize, or validate\n3D structures for input to sequence design.\n\nProtein design and interaction graphs For classical approaches to computational protein design,\nwhich are based on joint modeling of structure and sequence, we refer the reader to a review of\nboth methods and accomplishments in [1]. Many of the major \u2018\ufb01rsts\u2019 in protein design are due to\nRosetta [30, 31], a leading framework for protein design. More recently, there have been successes\nwith non-parametric approaches to protein design [32] which are based on \ufb01nding substructural\nhomologies between the target and diverse templates in large protein database. In this work, we focus\non comparisons to Rosetta (Section 4.2), since it is based on a single parametric energy function for\ncapturing the sequence-structure relationship.\n\nSelf-Attention Our model extends the Transformer [33] to capture sparse, pairwise relational\ninformation between sequence elements. The dense variation of this problem was explored in [34]\nand [35]. As noted in those works, incorporating general pairwise information incurs O(N 2) memory\n(and computational) cost for sequences of length N, which can be highly limiting for training on\nGPUs. We circumvent this cost by instead restricting the self-attention to the sparsity of the input\ngraph. Given this graph-structured self-attention, our model may also be reasonably cast in the\nframework of message-passing or graph neural networks [36, 37] (Section 4.1). Our approach is\nsimilar to Graph Attention Networks [38], but augmented with edge features and an autoregressive\ndecoder.\n\n2\n\n\fFigure 1: A graph-based, autoregressive model for protein sequences given 3D structures. (A)\nWe cast protein design as language modeling conditioned on an input graph. In our architecture, an\nencoder develops a sequence-independent representation of 3D structure via multi-head self-attention\n[7] on the spatial k-nearest neighbors graph. A decoder then autoregressively generates each amino\nacid si given the full structure and previously decoded amino acids. (B) Each layer of the encoder\nand decoder contains a step of neighborhood aggregation (self-attention) and of local information\nprocessing (position-wise feedforward).\n\n2 Methods\n\nIn this work, we introduce a Structured Transformer model that draws inspiration from the self-\nattention based Transformer model [7] and is augmented for scalable incorporation of relational\ninformation (Figure 1). While general relational attention incurs quadratic memory and computation\ncosts, we avert these by restricting the attention for each node i to the set N(i, k) of its k-nearest\nneighbors in 3D space. Since our architecture is multilayered, iterated local attention can derive\nprogressively more global estimates of context for each node i. Second, unlike the standard Trans-\nformer, we also include edge features to embed the spatial and positional dependencies in deriving\nthe attention. Thus, our model generalizes Transformer to spatially structured settings.\n\n2.1 Representing structure as a graph\nWe represent protein structure in terms of an attributed graph G = (V,E) with node features\nV = {v1, . . . , vN} describing each residue (amino acid, which are the letters which compose a\nprotein sequence) and edge features E = {eij}i(cid:54)=j capturing relationships between them. This\nformulation can accommodate different variations on the macromolecular design problem, including\nboth \u2018rigid backbone\u2019 design where the precise coordinates of backbone atoms are \ufb01xed, as well\nas \u2018\ufb02exible backbone\u2019 design where softer constraints such as blueprints of hydrogen-bonding\nconnectivity [5] or 1D architectures [15] could de\ufb01ne the structure of interest.\n\n3D considerations For a rigid-body design problem, the structure for conditioning is a \ufb01xed set\nof backbone coordinates X = {xi \u2208 R3 : 1 \u2264 i \u2264 N}, where N is the number of positions1. We\ndesire a graph representation of the coordinates G(X ) that has two properties:\n\u2022 Invariance. The features are invariant to rotations and translations.\n\u2022 Locally informative. The edge features incident to vi due to its neighbors N(i, k),\ni.e. {eij}j\u2208N(i,k), contain suf\ufb01cient information to reconstruct all adjacent coordinates\n{xj}j\u2208N(i,k) up to rigid-body motion.\n\nWhile invariance is motivated by standard symmetry considerations, the second property is motivated\nby limitations of current graph neural networks [36]. In these networks, updates to node features\n\n1Here we consider a single representative coordinate per position when deriving edge features but may revisit\n\nmultiple atom types per position for features such as backbone angles or hydrogen bonds.\n\n3\n\nAStructure GSelf-attentionPosition-wise FeedforwardNode embeddingsEdge embeddings Causal Self-attentionPosition-wise FeedforwardSequence s EncoderDecoderStructure and sequenceBMSGIAVSStructure EncoderSequence Decoder (autoregressive)StructureNode (amino acid)BackboneInformation flow\fFigure 2: Spatial features capture structural relationships across diverse folds. (A) The edge\nfeatures of our most detailed protein graph representation capture the relative distance, direction, and\norientation between two positions on the backbone. For scalability, all computation after an initially\ndense Euclidean distance calculation (right, top), such as relative directions (right, bottom) and neural\nsteps, can be restricted to the k-Nearest Neighbors graph. (B) Example of topological variation in the\ndataset. Protein chains in train, test, and validation are split by the sub-chain CATH [40] topologies,\nwhich means that folds in each set will have distinct patterns of spatial connectivity.\n\nvi depend only on the edge and node features adjacent to vi. However, typically, these features are\ninsuf\ufb01cient to reconstruct the relative neighborhood positions {xj}j\u2208N(i,k), so individual updates\ncannot fully depend on the \u2018local environment\u2019. For example, when reasoning about the neighborhood\naround coordinate xi, the pairwise distances Dia and Dib will be insuf\ufb01cient to determine if xa and\nxb are on the same or opposite sides.\n\nRelative spatial encodings We develop invariant and locally informative features by \ufb01rst augment-\ning the points xi with \u2018orientations\u2019 Oi that de\ufb01ne a local coordinate system at each point (Figure 2).\nWe de\ufb01ne these in terms of the backbone geometry as\n\nOi = [bi ni bi \u00d7 ni] ,\n\nwhere bi is the negative bisector of angle between the rays (xi\u22121 \u2212 xi) and (xi+1 \u2212 xi), and ni is a\nunit vector normal to that plane. Formally, we have\n\nui =\n\nxi \u2212 xi\u22121\n||xi \u2212 xi\u22121|| , bi =\n\nui \u2212 ui+1\n||ui \u2212 ui+1|| , ni =\n\nui \u00d7 ui+1\n||ui \u00d7 ui+1|| .\n\nFinally, we derive the spatial edge features e(s)\nfrom the rigid body transformation that relates\nij\nreference frame (xi, Oi) to reference frame (xj, Oj). While this transformation has 6 degrees of\nfreedom, we decompose it into features for distance, direction, and orientation as\n\n(cid:18)\n\ne(s)\nij =\n\nr (||xj \u2212 xi||) , OT\n\ni\n\n||xj \u2212 xi|| , q(cid:0)OT\n\nxj \u2212 xi\n\n(cid:1)(cid:19)\n\ni Oj\n\n.\n\nHere the \ufb01rst vector is a distance encoding r(\u00b7) lifted into a radial basis2, the second vector is a\ndirection encoding that corresponds to the relative direction of xj in the reference frame of (xi, Oi),\nand the third vector is an orientation encoding q(\u00b7) of the quaternion representation of the spatial\nrotation matrix OT\ni Oj. Quaternions represent 3D rotations as four-element vectors that can be\nef\ufb01ciently and reasonably compared by inner products [39].3\n\nRelative positional encodings As in the original Transformer, we also represent distances between\nresidues in the sequence (rather than space) with positional embeddings e(p)\nij . Speci\ufb01cally, we need\nto represent the positioning of each neighbor j relative to the node under consideration i. Therefore,\n\n2We used 16 Gaussian RBFs isotropically spaced from 0 to 20 Angstroms.\n3We represent quaternions in terms of their vector of real coef\ufb01cients.\n\n4\n\nAEdge featuresk-NNDistancesSparse directionsRelationsBackbone structureDistanceDirectionRotationPoint cloud with local framesB\fwe obtain the position embedding as a sinusoidal function of the gap i \u2212 j. We retain the sign of the\ndistance i \u2212 j because protein sequences are generally asymmetric. These relative encodings contrast\nthe absolute encodings of the original Transformer, but are consistent with modi\ufb01cations described in\nsubsequent work [34].\n\nij with the positional encodings e(p)\n\nNode and edge features Finally, we obtain an aggregate edge encoding vector eij by concatenating\nthe structural encodings e(s)\nij and then linearly transforming them to\nhave the same dimension as the model. We only include edges in the k-nearest neighbors graph of X ,\nwith k = 30 for all experiments. This k is generous, as typical de\ufb01nitions of residue-residue contacts\nin proteins will result in < 20 contacts per residue. For node features, we compute the three dihedral\nangles of the protein backbone (\u03c6i, \u03c8i, \u03c9i) and embed these on the 3-torus as {sin, cos}\u00d7(\u03c6i, \u03c8i, \u03c9i).\nFlexible backbone features We also consider \u2019\ufb02exible backbone\u2019 descriptions of 3D structure\nbased on topological binary edge features and coarse backbone geometry. We combine the relative\npositional encodings with two binary edge features: contacts that indicate when the distance between\nC\u03b1 residues at i and j are less than 8 Angstroms and hydrogen bonds which are directed and de\ufb01ned\nby the electrostatic model of DSSP [41]. For coarse node features, we compute virtual dihedral\nangles and bond angles between backbone C\u03b1 residues, interpret them as spherical coordinates, and\nrepresent them as points on the unit sphere.\n\n2.2 Structured Transformer\n\n(cid:89)\n\ni\n\nAutoregressive decomposition We decompose the distribution of a protein sequence given a 3D\nstructure as\n\np(s|x) =\n\np(si|x, s* j\ni \u2264 j\n\n.\n\nj\n\nis the embedding of node j in the current layer of the decoder, h(enc)\n\nHere h(dec)\nis the embedding of\nnode j in the \ufb01nal layer of the encoder, and g(sj) is a sequence embedding of amino acid sj at node\nj. This concatenation and masking structure ensures that sequence information only \ufb02ows to position\ni from positions j < i, but still allows position i to attend to subsequent structural information unlike\nthe standard Transformer decoder.\nWe now demonstrate the merits of our approach via a detailed empirical analysis. We begin with the\nexperimental set up including our architecture, and description of the data used in our experiments.\n\nj\n\n3 Training\n\nArchitecture\nward modules for the encoder and decoder with a hidden dimension of 128.\n\nIn all experiments, we used three layers of self-attention and position-wise feedfor-\n\nOptimization We trained models using the learning rate schedule and initialization of the original\nTransformer paper [7], a dropout rate of 10% [42], a label smoothing rate of 10%, and early stopping\nbased on validation perplexity. The unconditional language models did not include dropout or label\nsmoothing.\n\nDataset To evaluate the ability of our models to generalize across different protein folds, we\ncollected a dataset based on the CATH hierarchical classi\ufb01cation of protein structure [40]. For all\ndomains in the CATH 4.2 40% non-redundant set of proteins, we obtained full chains up to length\n500 and then randomly assigned their CATH topology classi\ufb01cations (CAT codes) to train, validation\nand test sets at a targeted 80/10/10 split. Since each chain can contain multiple CAT codes, we \ufb01rst\nremoved any redundant entries from train and then from validation. Finally, we removed any chains\nfrom the test set that had CAT overlap with train and removed chains from the validation set with\nCAT overlap to train or test. This resulted in a dataset of 18024 chains in the training set, 608 chains\nin the validation set, and 1120 chains in the test set. There is zero CAT overlap between these sets.\n\n4 Results\n\nA challenge in evaluating computational protein design methods is the degeneracy of the relationship\nbetween protein structure and sequence. Many protein sequences may reasonably design the same\n\n6\n\n\fTable 2: Per-residue perplexities for protein language modeling (lower is better). The protein\nchains have been cluster-split by CATH topology, such that test includes only unseen 3D folds. While\na structure-conditioned language model can generalize in this structure-split setting, unconditional\nlanguage models struggle.\nTest set\nStructure-conditioned models\nStructured Transformer (ours)\nSPIN2 [8]\nLanguage models\nLSTM (h = 128)\nLSTM (h = 256)\nLSTM (h = 512)\nTest set size\n\n16.38\n16.37\n16.38\n103\n\n17.13\n17.12\n17.13\n1120\n\n16.06\n16.08\n15.98\n\nSingle chain\n\n9.03\n12.61\n\n8.54\n12.11\n\nShort\n\n6.85\n\nAll\n\n94\n\n-\n\n3D structure [43], meaning that sequence similarity need not necessarily be high. At the same time,\nsingle mutations may cause a protein to break or misfold, meaning that high sequence similarity\nisn\u2019t suf\ufb01cient for a correct design. To deal with this, we will focus on three kinds of evaluation: (i)\nlikelihood-based, where we test the ability of the generative model to give high likelihood to held\nout sequences, (ii) native sequence recovery, where we evaluate generated sequences vs the native\nsequences of templates, and (iii) experimental comparison, where we compare the likelihoods of the\nmodel to high-throughput data from a de novo protein design experiment.\nWe \ufb01nd that our model is able to attain considerably improved statistical performance in its likelihoods\nwhile simultaneously providing more accurate and ef\ufb01cient sequence recovery.\n\n4.1 Statistical comparison to likelihood-based models\n\nProtein perplexities What kind of perplexities might be useful? To provide context, we \ufb01rst\npresent perplexities for some simple models of protein sequences in Table 1. The amino acid alphabet\nand its natural frequencies upper-bound perplexity at 20 and \u223c17.8, respectively. Random protein\nsequences under these null models are unlikely to be functional without further selection [44]. First\norder pro\ufb01les of protein sequences such as those from the Pfam database [45], however, are widely\nused for protein engineering. We found the average perplexity per letter of pro\ufb01les in Pfam 32\n(ignoring alignment uncertainty) to be \u223c11.6. This suggests that even models with high perplexities\nhigh as \u223c 11 have the potential to be useful for the space of functional protein sequences.\n\nThe importance of structure We found that there was a signi\ufb01cant gap between unconditional\nlanguage models of protein sequences and models conditioned on structure. Remarkably, for a range\nof structure-independent language models, the typical test perplexities are \u223c16-17 (Table 2), which\nwere barely better than null letter frequencies (Table 1). We emphasize that the RNNs were not\nbroken and could still learn the training set in these capacity ranges. All structure-based models had\n(unsurprisingly) considerably lower perplexities. In particular, our Structured Transformer model\nattained a perplexity of \u223c7 on the full test set. It seems that protein language models trained on one\nsubset of 3D folds (in our cluster-splitting procedure) generalize poorly to predict the sequences\n\nEdge features\n\nTable 3: Ablation of graph features and model components. Test perplexities (lower is better).\nNode features\nAll\nRigid backbone\nDihedrals\nDihedrals\nC\u03b1 angles\nDihedrals\nFlexible backbone\nC\u03b1 angles\n\nDistances, Orientations\nDistances, Orientations\nDistances, Orientations\nDistances\n\nAttention\nPairMLP\nAttention\nAttention\n\nContacts, Hydrogen bonds\n\n8.54\n8.33\n9.16\n9.11\n\n9.03\n8.86\n9.37\n9.63\n\n6.85\n6.55\n7.83\n7.87\n\nAggregation\n\nSingle chain\n\nAttention\n\n11.71\n\nShort\n\n11.81\n\n11.51\n\n7\n\n\fMethod Recovery (%)\n\nSpeed (AA/s) CPU Speed (AA/s) GPU\n\nRosetta 3.10 fixbb\nOurs (T = 0.1)\n\n17.9\n27.6\n\n4.88 \u00d7 10\u22121\n2.22 \u00d7 102\n\nN/A\n\n1.04 \u00d7 104\n\n(a) Single chain test set (103 proteins)\n\nMethod Recovery (%)\n\nRosetta, fixbb 1\nRosetta, fixbb 2\nOurs (T = 0.1)\n\n33.1\n38.4\n39.2\n\n(b) Ollikainen benchmark (40 proteins)\n\nTable 4: Improved reliability and speed compared to Rosetta. (a) On the \u2018single chain\u2019 test set,\nour model more accurately recovers native sequences than Rosetta fixbb with greater speed (CPU:\nsingle core of Intel Xeon Gold 5115, GPU: NVIDIA RTX 2080). This set includes NMR-based\nstructures for which Rosetta is known to not be robust [46]. (b) Our model also performs favorably\non a prior benchmark of 40 proteins. All results reported as median of average over 100 designs.\n\nof unseen folds. We believe this possibility might be important to consider when training protein\nlanguage models for protein engineering and design.\n\nImprovement over deep pro\ufb01le-based methods We also compared to a recent method SPIN2\nthat predicts, using deep neural networks, protein sequence pro\ufb01les given protein structures [8].\nSince SPIN2 is computationally intensive (minutes per protein for small proteins) and was trained\non complete proteins rather than chains, we evaluated it on two subsets of the full test set: a \u2018Small\u2019\nsubset of the test set containing chains up to length 100 and a \u2018Single chain\u2019 subset containing only\nthose models where the single chain accounted for the entire protein record in the Protein Data Bank.\nBoth subsets discarded any chains with structural gaps (chain break). We found that our Structured\nTransformer model signi\ufb01cantly improved upon the perplexities of SPIN2 (Table 2).\n\nGraph representations and attention mechanisms The graph-based formulation of protein de-\nsign can accommodate very different formulations of the problem depending on how structure is\nrepresented by a graph. We tested different approaches for representing the protein including both\nmore \u2018rigid\u2019 design with precise geometric details, and \u2018\ufb02exible\u2019 topological design based on spatial\ncontacts and hydrogen bonding (Table 3). For the best perplexities, we found that using local orienta-\ntion information was indeed important above simple distance measures. At the same time, even the\ntopological features were suf\ufb01cient to obtain better perplexities than SPIN2 (Table 2), which uses\nprecise atomic details.\nIn addition to varying the graph features, we also experimented with an alternative aggregation\nfunction from message passing neural networks [36].5 We found that a simple aggregation function\nj MLP(hj, hj, eij) led to the best performance of all models, where MLP(\u00b7) is a two layer\nperceptron that preserves the hidden dimension of the model. We speculate that this is due to potential\nover\ufb01tting by the attention mechanism. Although this suggests room for future improvements, we\nuse multi-head self-attention throughout the remaining experiments.\n\n\u2206hi =(cid:80)\n\n4.2 Benchmarking protein redesign\n\ndistributions p(T )(s|x) =(cid:81)\n\nDecoding strategies Generating protein sequence designs requires a sampling scheme for drawing\nhigh-likelihood sequences from the model. While beam-search or top-k sampling [47] are commonly\nused heuristics for decoding, we found that simple biased sampling from the temperature adjusted\np(si|x,s*