{"title": "Pointer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2692, "page_last": 2700, "abstract": "We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that arediscrete tokens corresponding to positions in an input sequence.Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines,because the number of target classes in eachstep of the output depends on the length of the input, which is variable.Problems such as sorting variable sized sequences, and various combinatorialoptimization problems belong to this class. Our model solvesthe problem of variable size output dictionaries using a recently proposedmechanism of neural attention. It differs from the previous attentionattempts in that, instead of using attention to blend hidden units of anencoder to a context vector at each decoder step, it uses attention asa pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net).We show Ptr-Nets can be used to learn approximate solutions to threechallenging geometric problems -- finding planar convex hulls, computingDelaunay triangulations, and the planar Travelling Salesman Problem-- using training examples alone. Ptr-Nets not only improve oversequence-to-sequence with input attention, butalso allow us to generalize to variable size output dictionaries.We show that the learnt models generalize beyond the maximum lengthsthey were trained on. We hope our results on these taskswill encourage a broader exploration of neural learning for discreteproblems.", "full_text": "Pointer Networks\n\nOriol Vinyals\u2217\nGoogle Brain\n\nMeire Fortunato\u2217\n\nDepartment of Mathematics, UC Berkeley\n\nNavdeep Jaitly\nGoogle Brain\n\nAbstract\n\nWe introduce a new neural architecture to learn the conditional probability of an\noutput sequence with elements that are discrete tokens corresponding to positions\nin an input sequence. Such problems cannot be trivially addressed by existent ap-\nproaches such as sequence-to-sequence [1] and Neural Turing Machines [2], be-\ncause the number of target classes in each step of the output depends on the length\nof the input, which is variable. Problems such as sorting variable sized sequences,\nand various combinatorial optimization problems belong to this class. Our model\nsolves the problem of variable size output dictionaries using a recently proposed\nmechanism of neural attention. It differs from the previous attention attempts in\nthat, instead of using attention to blend hidden units of an encoder to a context\nvector at each decoder step, it uses attention as a pointer to select a member of\nthe input sequence as the output. We call this architecture a Pointer Net (Ptr-Net).\nWe show Ptr-Nets can be used to learn approximate solutions to three challenging\ngeometric problems \u2013 \ufb01nding planar convex hulls, computing Delaunay triangu-\nlations, and the planar Travelling Salesman Problem \u2013 using training examples\nalone. Ptr-Nets not only improve over sequence-to-sequence with input attention,\nbut also allow us to generalize to variable size output dictionaries. We show that\nthe learnt models generalize beyond the maximum lengths they were trained on.\nWe hope our results on these tasks will encourage a broader exploration of neural\nlearning for discrete problems.\n\n1\n\nIntroduction\n\nRecurrent Neural Networks (RNNs) have been used for learning functions over sequences from\nexamples for more than three decades [3]. However, their architecture limited them to settings\nwhere the inputs and outputs were available at a \ufb01xed frame rate (e.g. [4]). The recently introduced\nsequence-to-sequence paradigm [1] removed these constraints by using one RNN to map an input\nsequence to an embedding and another (possibly the same) RNN to map the embedding to an output\nsequence. Bahdanau et. al. augmented the decoder by propagating extra contextual information\nfrom the input using a content-based attentional mechanism [5, 2, 6, 7]. These developments have\nmade it possible to apply RNNs to new domains, achieving state-of-the-art results in core problems\nin natural language processing such as translation [1, 5] and parsing [8], image and video captioning\n[9, 10], and even learning to execute small programs [2, 11].\nNonetheless, these methods still require the size of the output dictionary to be \ufb01xed a priori. Because\nof this constraint we cannot directly apply this framework to combinatorial problems where the size\nof the output dictionary depends on the length of the input sequence. In this paper, we address this\nlimitation by repurposing the attention mechanism of [5] to create pointers to input elements. We\nshow that the resulting architecture, which we name Pointer Networks (Ptr-Nets), can be trained to\noutput satisfactory solutions to three combinatorial optimization problems \u2013 computing planar con-\nvex hulls, Delaunay triangulations and the symmetric planar Travelling Salesman Problem (TSP).\nThe resulting models produce approximate solutions to these problems in a purely data driven fash-\n\n\u2217Equal contribution\n\n1\n\n\f(a) Sequence-to-Sequence\n\n(b) Ptr-Net\n\nFigure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code\nvector that is used to generate the output sequence (purple) using the probability chain rule and\nanother RNN. The output dimensionality is \ufb01xed by the dimensionality of the problem and it is the\nsame during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence\nto a code (blue) that is fed to the generating network (purple). At each step, the generating network\nproduces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The\noutput of the attention mechanism is a softmax distribution with dictionary size equal to the length\nof the input.\n\nion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is\ndepicted in Figure 1.\nThe main contributions of our work are as follows:\n\n\u2022 We propose a new architecture, that we call Pointer Net, which is simple and effective. It\ndeals with the fundamental problem of representing variable length dictionaries by using a\nsoftmax probability distribution as a \u201cpointer\u201d.\n\u2022 We apply the Pointer Net model to three distinct non-trivial algorithmic problems involving\ngeometry. We show that the learned model generalizes to test problems with more points\nthan the training problems.\n\u2022 Our Pointer Net model learns a competitive small scale (n \u2264 50) TSP approximate solver.\nOur results demonstrate that a purely data driven approach can learn approximate solutions\nto problems that are computationally intractable.\n\n2 Models\n\nWe review the sequence-to-sequence [1] and input-attention models [5] that are the baselines for this\nwork in Sections 2.1 and 2.2. We then describe our model - Ptr-Net in Section 2.3.\n\n2.1 Sequence-to-Sequence Model\nGiven a training pair, (P,CP ), the sequence-to-sequence model computes the conditional probabil-\nity p(CP|P; \u03b8) using a parametric model (an RNN with parameters \u03b8) to estimate the terms of the\nprobability chain rule (also see Figure 1), i.e.\n\np(CP|P; \u03b8) =\n\np(Ci|C1, . . . , Ci\u22121,P; \u03b8).\n\n(1)\n\nm(P)(cid:89)\n\ni=1\n\n2\n\n\fHere P = {P1, . . . , Pn} is a sequence of n vectors and CP = {C1, . . . , Cm(P)} is a sequence of\nm(P) indices, each between 1 and n (we note that the target sequence length m(P) is, in general, a\nfunction of P).\nThe parameters of the model are learnt by maximizing the conditional probabilities for the training\nset, i.e.\n\n\u03b8\u2217 = arg max\n\nlog p(CP|P; \u03b8),\n\n(2)\n\n(cid:88)\n\n\u03b8\n\nP,CP\n\nwhere the sum is over training examples.\nAs in [1], we use an Long Short Term Memory (LSTM) [12] to model p(Ci|C1, . . . , Ci\u22121,P; \u03b8).\nThe RNN is fed Pi at each time step, i, until the end of the input sequence is reached, at which time\na special symbol, \u21d2 is input to the model. The model then switches to the generation mode until\nthe network encounters the special symbol \u21d0, which represents termination of the output sequence.\nNote that this model makes no statistical independence assumptions. We use two separate RNNs\n(one to encode the sequence of vectors Pj, and another one to produce or decode the output symbols\nCi). We call the former RNN the encoder and the latter the decoder or the generative RNN.\nDuring inference, given a sequence P, the learnt parameters \u03b8\u2217 are used to select the sequence\n\u02c6CP with the highest probability, i.e., \u02c6CP = arg max\np(CP|P; \u03b8\u2217). Finding the optimal sequence \u02c6C\nis computationally impractical because of the combinatorial number of possible output sequences.\nInstead we use a beam search procedure to \ufb01nd the best possible sequence given a beam size.\nIn this sequence-to-sequence model, the output dictionary size for all symbols Ci is \ufb01xed and equal\nto n, since the outputs are chosen from the input. Thus, we need to train a separate model for each\nn. This prevents us from learning solutions to problems that have an output dictionary with a size\nthat depends on the input sequence length.\nUnder the assumption that the number of outputs is O(n) this model has computational complexity\nof O(n). However, exact algorithms for the problems we are dealing with are more costly. For exam-\nple, the convex hull problem has complexity O(n log n). The attention mechanism (see Section 2.2)\nadds more \u201ccomputational capacity\u201d to this model.\n\nCP\n\n2.2 Content Based Input Attention\nThe vanilla sequence-to-sequence model produces the entire output sequence CP using the \ufb01xed\ndimensional state of the recognition RNN at the end of the input sequence P. This constrains\nthe amount of information and computation that can \ufb02ow through to the generative model. The\nattention model of [5] ameliorates this problem by augmenting the encoder and decoder RNNs with\nan additional neural network that uses an attention mechanism over the entire sequence of encoder\nRNN states.\nFor notation purposes, let us de\ufb01ne the encoder and decoder hidden states as (e1, . . . , en) and\n(d1, . . . , dm(P)), respectively. For the LSTM RNNs, we use the state after the output gate has\nbeen component-wise multiplied by the cell activations. We compute the attention vector at each\noutput time i as follows:\n\nui\nj = vT tanh(W1ej + W2di)\nj = softmax(ui\nai\nj)\nd(cid:48)\ni =\n\nn(cid:88)\n\nai\njej\n\nj \u2208 (1, . . . , n)\nj \u2208 (1, . . . , n)\n\n(3)\n\nj=1\n\nwhere softmax normalizes the vector ui (of length n) to be the \u201cattention\u201d mask over the inputs,\nand v, W1, and W2 are learnable parameters of the model. In all our experiments, we use the same\nhidden dimensionality at the encoder and decoder (typically 512), so v is a vector and W1 and W2\nare square matrices. Lastly, d(cid:48)\ni and di are concatenated and used as the hidden states from which we\nmake predictions and which we feed to the next time step in the recurrent model.\nNote that for each output we have to perform n operations, so the computational complexity at\ninference time becomes O(n2).\n\n3\n\n\fThis model performs signi\ufb01cantly better than the sequence-to-sequence model on the convex hull\nproblem, but it is not applicable to problems where the output dictionary size depends on the input.\nNevertheless, a very simple extension (or rather reduction) of the model allows us to do this easily.\n\n2.3 Ptr-Net\n\nWe now describe a very simple modi\ufb01cation of the attention model that allows us to apply the\nmethod to solve combinatorial optimization problems where the output dictionary size depends on\nthe number of elements in the input sequence.\nThe sequence-to-sequence model of Section 2.1 uses a softmax distribution over a \ufb01xed sized output\ndictionary to compute p(Ci|C1, . . . , Ci\u22121,P) in Equation 1. Thus it cannot be used for our problems\nwhere the size of the output dictionary is equal to the length of the input sequence. To solve this\nproblem we model p(Ci|C1, . . . , Ci\u22121,P) using the attention mechanism of Equation 3 as follows:\n\np(Ci|C1, . . . , Ci\u22121,P) = softmax(ui)\n\nui\nj = vT tanh(W1ej + W2di)\n\nj \u2208 (1, . . . , n)\n\nwhere softmax normalizes the vector ui (of length n) to be an output distribution over the dictionary\nof inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do not blend\nthe encoder state ej to propagate extra information to the decoder, but instead, use ui\nj as pointers\nto the input elements. In a similar way, to condition on Ci\u22121 as in Equation 1, we simply copy\nthe corresponding PCi\u22121 as the input. Both our method and the attention model can be seen as an\napplication of content-based attention mechanisms proposed in [6, 5, 2, 7].\nWe also note that our approach speci\ufb01cally targets problems whose outputs are discrete and corre-\nspond to positions in the input. Such problems may be addressed arti\ufb01cially \u2013 for example we could\nlearn to output the coordinates of the target point directly using an RNN. However, at inference,\nthis solution does not respect the constraint that the outputs map back to the inputs exactly. With-\nout the constraints, the predictions are bound to become blurry over longer sequences as shown in\nsequence-to-sequence models for videos [13].\n\n3 Motivation and Datasets Structure\n\nIn the following sections, we review each of the three problems we considered, as well as our data\ngeneration protocol.1\nIn the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements each, where\nPj = (xj, yj) are the cartesian coordinates of the points over which we \ufb01nd the convex hull, the De-\nlaunay triangulation or the solution to the corresponding Travelling Salesman Problem. In all cases,\nwe sample from a uniform distribution in [0, 1] \u00d7 [0, 1]. The outputs CP = {C1, . . . , Cm(P)} are\nsequences representing the solution associated to the point set P. In Figure 2, we \ufb01nd an illustration\nof an input/output pair (P,CP ) for the convex hull and the Delaunay problems.\n\n3.1 Convex Hull\n\nWe used this example as a baseline to develop our models and to understand the dif\ufb01culty of solving\ncombinatorial problems with data driven approaches. Finding the convex hull of a \ufb01nite number\nof points is a well understood task in computational geometry, and there are several exact solutions\navailable (see [14, 15, 16]).\nIn general, \ufb01nding the (generally unique) solution has complexity\nO(n log n), where n is the number of points considered.\nThe vectors Pj are uniformly sampled from [0, 1] \u00d7 [0, 1]. The elements Ci are indices between 1\nand n corresponding to positions in the sequence P, or special tokens representing beginning or end\nof sequence. See Figure 2 (a) for an illustration. To represent the output as a sequence, we start\nfrom the point with the lowest index, and go counter-clockwise \u2013 this is an arbitrary choice but helps\nreducing ambiguities during training.\n\n1We will release all the datasets at hidden for reference.\n\n4\n\n\f(a) Input P = {P1, . . . , P10}, and the output se-\nquence CP = {\u21d2, 2, 4, 3, 5, 6, 7, 2,\u21d0} represent-\ning its convex hull.\n\n(b) Input P = {P1, . . . , P5}, and the output CP =\n{\u21d2, (1, 2, 4), (1, 4, 5), (1, 3, 5), (1, 2, 3),\u21d0} repre-\nsenting its Delaunay Triangulation.\n\nFigure 2: Input/output representation for (a) convex hull and (b) Delaunay triangulation. The tokens\n\u21d2 and \u21d0 represent beginning and end of sequence, respectively.\n\n3.2 Delaunay Triangulation\nA Delaunay triangulation for a set P of points in a plane is a triangulation such that each circumcircle\nof every triangle is empty, that is, there is no point from P in its interior. Exact O(n log n) solutions\nare available [17], where n is the number of points in P.\nIn this example, the outputs CP = {C1, . . . , Cm(P)} are the corresponding sequences representing\nthe triangulation of the point set P . Each Ci is a triple of integers from 1 to n corresponding to the\nposition of triangle vertices in P or the beginning/end of sequence tokens. See Figure 2 (b).\nWe note that any permutation of the sequence CP represents the same triangulation for P, addi-\ntionally each triangle representation Ci of three integers can also be permuted. Without loss of\ngenerality, and similarly to what we did for convex hulls at training time, we order the triangles Ci\nby their incenter coordinates (lexicographic order) and choose the increasing triangle representa-\ntion2. Without ordering, the models learned were not as good, and \ufb01nding a better ordering that the\nPtr-Net could better exploit is part of future work.\n\n3.3 Travelling Salesman Problem (TSP)\n\nTSP arises in many areas of theoretical computer science and is an important algorithm used for\nmicrochip design or DNA sequencing. In our work we focused on the planar symmetric TSP: given\na list of cities, we wish to \ufb01nd the shortest possible route that visits each city exactly once and\nreturns to the starting point. Additionally, we assume the distance between two cities is the same\nin each opposite direction. This is an NP-hard problem which allows us to test the capabilities and\nlimitations of our model.\nThe input/output pairs (P,CP ) have a similar format as in the Convex Hull problem described in\nSection 3.1. P will be the cartesian coordinates representing the cities, which are chosen randomly\nin the [0, 1] \u00d7 [0, 1] square. CP = {C1, . . . , Cn} will be a permutation of integers from 1 to n\nrepresenting the optimal path (or tour). For consistency, in the training dataset, we always start in\nthe \ufb01rst city without loss of generality.\nTo generate exact data, we implemented the Held-Karp algorithm [18] which \ufb01nds the optimal\nsolution in O(2nn2) (we used it up to n = 20). For larger n, producing exact solutions is extremely\ncostly, therefore we also considered algorithms that produce approximated solutions: A1 [19] and\nA2 [20], which are both O(n2), and A3 [21] which implements the O(n3) Christo\ufb01des algorithm.\nThe latter algorithm is guaranteed to \ufb01nd a solution within a factor of 1.5 from the optimal length.\nTable 2 shows how they performed in our test sets.\n\n2We choose Ci = (1, 2, 4) instead of (2,4,1) or any other permutation.\n\n5\n\nP1P2P3P4P5\f4 Empirical Results\n\n4.1 Architecture and Hyperparameters\n\nNo extensive architecture or hyperparameter search of the Ptr-Net was done in the work presented\nhere, and we used virtually the same architecture throughout all the experiments and datasets. Even\nthough there are likely some gains to be obtained by tuning the model, we felt that having the same\nmodel hyperparameters operate on all the problems makes the main message of the paper stronger.\nAs a result, all our models used a single layer LSTM with either 256 or 512 hidden units, trained with\nstochastic gradient descent with a learning rate of 1.0, batch size of 128, random uniform weight\ninitialization from -0.08 to 0.08, and L2 gradient clipping of 2.0. We generated 1M training example\npairs, and we did observe over\ufb01tting in some cases where the task was simpler (i.e., for small n).\nTraining generally converged after 10 to 20 epochs.\n\n4.2 Convex Hull\n\nWe used the convex hull as the guiding task which allowed us to understand the de\ufb01ciencies of\nstandard models such as the sequence-to-sequence approach, and also setting up our expectations\non what a purely data driven model would be able to achieve with respect to an exact solution.\nWe reported two metrics: accuracy, and area covered of the true convex hull (note that any simple\npolygon will have full intersection with the true convex hull). To compute the accuracy, we con-\nsidered two output sequences C1 and C2 to be the same if they represent the same polygon. For\nsimplicity, we only computed the area coverage for the test examples in which the output represents\na simple polygon (i.e., without self-intersections). If an algorithm fails to produce a simple polygon\nin more than 1% of the cases, we simply reported FAIL.\nThe results are presented in Table 1. We note that the area coverage achieved with the Ptr-Net is\nclose to 100%. Looking at examples of mistakes, we see that most problems come from points that\nare aligned (see Figure 3 (d) for a mistake for n = 500) \u2013 this is a common source of errors in most\nalgorithms to solve the convex hull.\nIt was seen that the order in which the inputs are presented to the encoder during inference affects\nits performance. When the points on the true convex hull are seen \u201clate\u201d in the input sequence, the\naccuracy is lower. This is possibly the network does not have enough processing steps to \u201cupdate\u201d\nthe convex hull it computed until the latest points were seen. In order to overcome this problem,\nwe used the attention mechanism described in Section 2.2, which allows the decoder to look at\nthe whole input at any time. This modi\ufb01cation boosted the model performance signi\ufb01cantly. We\ninspected what attention was focusing on, and we observed that it was \u201cpointing\u201d at the correct\nanswer on the input side. This inspired us to create the Ptr-Net model described in Section 2.3.\nMore than outperforming both the LSTM and the LSTM with attention, our model has the key\nadvantage of being inherently variable length. The bottom half of Table 1 shows that, when training\nour model on a variety of lengths ranging from 5 to 50 (uniformly sampled, as we found other forms\nof curriculum learning to not be effective), a single model is able to perform quite well on all lengths\nit has been trained on (but some degradation for n = 50 can be observed w.r.t. the model trained only\non length 50 instances). More impressive is the fact that the model does extrapolate to lengths that it\nhas never seen during training. Even for n = 500, our results are satisfactory and indirectly indicate\nthat the model has learned more than a simple lookup. Neither LSTM or LSTM with attention can\nbe used for any given n(cid:48) (cid:54)= n without training a new model on n(cid:48).\n\n4.3 Delaunay Triangulation\n\nThe Delaunay Triangulation test case is connected to our \ufb01rst problem of \ufb01nding the convex hull. In\nfact, the Delaunay Triangulation for a given set of points triangulates the convex hull of these points.\nWe reported two metrics: accuracy and triangle coverage in percentage (the percentage of triangles\nthe model predicted correctly). Note that, in this case, for an input point set P, the output sequence\nC(P) is, in fact, a set. As a consequence, any permutation of its elements will represent the same\ntriangulation.\n\n6\n\n\fTable 1: Comparison between LSTM, LSTM with attention, and our Ptr-Net model on the convex\nhull problem. Note that the baselines must be trained on the same n that they are tested on. 5-50\nmeans the dataset had a uniform distribution over lengths from 5 to 50.\n\nMETHOD\n\nTRAINED n\n\nn\n\nACCURACY\n\nAREA\n\nLSTM [1]\n+ATTENTION [5]\nPTR-NET\nLSTM [1]\nPTR-NET\nLSTM [1]\nPTR-NET\nPTR-NET\nPTR-NET\nPTR-NET\nPTR-NET\n\n50\n50\n50\n5\n\n5-50\n10\n5-50\n5-50\n5-50\n5-50\n5-50\n\n50\n50\n50\n5\n5\n10\n10\n50\n100\n200\n500\n\n1.9%\n38.9%\n72.6%\n87.7%\n92.0%\n29.9%\n87.0%\n69.6%\n50.3%\n22.1%\n1.3%\n\nFAIL\n99.7%\n99.9%\n99.6%\n99.6%\nFAIL\n99.8%\n99.9%\n99.9%\n99.9%\n99.2%\n\n(a) LSTM, m=50, n=50\n\n(b) Truth, n=50\n\n(c) Truth, n=20\n\n(d) Ptr-Net, m=5-50, n=500\n\n(e) Ptr-Net , m=50, n=50\n\n(f) Ptr-Net , m=5-20, n=20\n\nFigure 3: Examples of our model on Convex hulls (left), Delaunay (center) and TSP (right), trained\non m points, and tested on n points. A failure of the LSTM sequence-to-sequence model for Convex\nhulls is shown in (a). Note that the baselines cannot be applied to a different length from training.\n\nUsing the Ptr-Net model for n = 5, we obtained an accuracy of 80.7% and triangle coverage of\n93.0%. For n = 10, the accuracy was 22.6% and the triangle coverage 81.3%. For n = 50, we\ndid not produce any precisely correct triangulation, but obtained 52.8% triangle coverage. See the\nmiddle column of Figure 3 for an example for n = 50.\n4.4 Travelling Salesman Problem\n\nWe considered the planar symmetric travelling salesman problem (TSP), which is NP-hard as the\nthird problem. Similarly to \ufb01nding convex hulls, it also has sequential outputs. Given that the Ptr-\nNet implements an O(n2) algorithm, it was unclear if it would have enough capacity to learn a\nuseful algorithm solely from data.\nAs discussed in Section 3.3, it is feasible to generate exact solutions for relatively small values\nof n to be used as training data. For larger n, due to the importance of TSP, good and ef\ufb01cient\nalgorithms providing reasonable approximate solutions exist. We used three different algorithms in\nour experiments \u2013 A1, A2, and A3 (see Section 3.3 for references).\n\n7\n\nGround TruthPredictionsGround TruthGround Truth: tour length is 3.518Ground TruthPredictionsPredictionsPredictions: tour length is 3.523\fTable 2: Tour length of the Ptr-Net and a collection of algorithms on a small scale TSP problem.\n\nn\n\nOPTIMAL\n\nA1\n\nA2\n\nA3\n\nPTR-NET\n\n5\n10\n50 (A1 TRAINED)\n50 (A3 TRAINED)\n5 (5-20 TRAINED)\n10 (5-20 TRAINED)\n20 (5-20 TRAINED)\n25 (5-20 TRAINED)\n30 (5-20 TRAINED)\n40 (5-20 TRAINED)\n50 (5-20 TRAINED)\n\n2.12\n2.87\nN/A\nN/A\n2.12\n2.87\n3.83\nN/A\nN/A\nN/A\nN/A\n\n2.18\n3.07\n6.46\n6.46\n2.18\n3.07\n4.24\n4.71\n5.11\n5.82\n6.46\n\n2.12\n2.87\n5.84\n5.84\n2.12\n2.87\n3.86\n4.27\n4.63\n5.27\n5.84\n\n2.12\n2.87\n5.79\n5.79\n2.12\n2.87\n3.85\n4.24\n4.60\n5.23\n5.79\n\n2.12\n2.88\n6.42\n6.09\n2.12\n2.87\n3.88\n4.30\n4.72\n5.91\n7.66\n\nTable 2 shows all of our results on TSP. The number reported is the length of the proposed tour.\nUnlike the convex hull and Delaunay triangulation cases, where the decoder was unconstrained, in\nthis example we set the beam search procedure to only consider valid tours. Otherwise, the Ptr-Net\nmodel would sometimes output an invalid tour \u2013 for instance, it would repeat two cities or decided\nto ignore a destination. This procedure was relevant for n > 20: for n \u2264 20, the unconstrained\ndecoding failed less than 1% of the cases, and thus was not necessary. For 30, which goes beyond\nthe longest sequence seen in training, failure rate went up to 35%, and for 40, it went up to 98%.\nThe \ufb01rst group of rows in the table show the Ptr-Net trained on optimal data, except for n = 50,\nsince that is not feasible computationally (we trained a separate model for each n). Interestingly,\nwhen using the worst algorithm (A1) data to train the Ptr-Net, our model outperforms the algorithm\nthat is trying to imitate.\nThe second group of rows in the table show how the Ptr-Net trained on optimal data with 5 to\n20 cities can generalize beyond that. The results are virtually perfect for n = 25, and good for\nn = 30, but it seems to break for 40 and beyond (still, the results are far better than chance). This\ncontrasts with the convex hull case, where we were able to generalize by a factor of 10. However,\nthe underlying algorithms have greater complexity than O(n log n), which could explain this.\n5 Conclusions\n\nIn this paper we described Ptr-Net, a new architecture that allows us to learn a conditional prob-\nability of one sequence CP given another sequence P, where CP is a sequence of discrete tokens\ncorresponding to positions in P. We show that Ptr-Nets can be used to learn solutions to three dif-\nferent combinatorial optimization problems. Our method works on variable sized inputs (yielding\nvariable sized output dictionaries), something the baseline models (sequence-to-sequence with or\nwithout attention) cannot do directly. Even more impressively, they outperform the baselines on\n\ufb01xed input size problems - to which both the models can be applied.\nPrevious methods such as RNNSearch, Memory Networks and Neural Turing Machines [5, 6? ]\nhave used attention mechanisms to process inputs. However these methods do not directly address\nproblems that arise with variable output dictionaries. We have shown that an attention mechanism\ncan be applied to the output to solve such problems. In so doing, we have opened up a new class\nof problems to which neural networks can be applied without arti\ufb01cial assumptions. In this paper,\nwe have applied this extension to RNNSearch, but the methods are equally applicable to Memory\nNetworks and Neural Turing Machines.\nFuture work will try and show its applicability to other problems such as sorting where the outputs\nare chosen from the inputs. We are also excited about the possibility of using this approach to other\ncombinatorial optimization problems.\n\nAcknowledgments\n\nWe would like to thank Rafal Jozefowicz, Ilya Sutskever, Quoc Le and Samy Bengio for useful\ndiscussions. We would also like to thank Daniel Gillick for his help with the \ufb01nal manuscript.\n\n8\n\n\fReferences\n[1] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[2] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint\n\narXiv:1410.5401, 2014.\n\n[3] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representa-\n\ntions by error propagation. Technical report, DTIC Document, 1985.\n\n[4] Anthony J Robinson. An application of recurrent nets to phone probability estimation. Neural\n\nNetworks, IEEE Transactions on, 5(2):298\u2013305, 1994.\n\n[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by\n\njointly learning to align and translate. In ICLR 2015, arXiv preprint arXiv:1409.0473, 2014.\n\n[6] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In ICLEAR 2015, arXiv\n\npreprint arXiv:1410.3916, 2014.\n\n[7] Alex Graves. Generating sequences with recurrent neural networks.\n\narXiv:1308.0850, 2013.\n\narXiv preprint\n\n[8] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton.\n\nGrammar as a foreign language. arXiv preprint arXiv:1412.7449, 2014.\n\n[9] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\n\nimage caption generator. In CVPR 2015, arXiv preprint arXiv:1411.4555, 2014.\n\n[10] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venu-\ngopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for\nvisual recognition and description. In CVPR 2015, arXiv preprint arXiv:1411.4389, 2014.\n\n[11] Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615,\n\n2014.\n\n[12] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[13] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of\n\nvideo representations using lstms. In ICML 2015, arXiv preprint arXiv:1502.04681, 2015.\n\n[14] Ray A Jarvis. On the identi\ufb01cation of the convex hull of a \ufb01nite set of points in the plane.\n\nInformation Processing Letters, 2(1):18\u201321, 1973.\n\n[15] Ronald L. Graham. An ef\ufb01cient algorith for determining the convex hull of a \ufb01nite planar set.\n\nInformation processing letters, 1(4):132\u2013133, 1972.\n\n[16] Franco P. Preparata and Se June Hong. Convex hulls of \ufb01nite sets of points in two and three\n\ndimensions. Communications of the ACM, 20(2):87\u201393, 1977.\n\n[17] S1 Rebay. Ef\ufb01cient unstructured mesh generation by means of delaunay triangulation and\n\nbowyer-watson algorithm. Journal of computational physics, 106(1):125\u2013138, 1993.\n\n[18] Richard Bellman. Dynamic programming treatment of the travelling salesman problem. Jour-\n\nnal of the ACM (JACM), 9(1):61\u201363, 1962.\n\n[19] Suboptimal\n\nsalesman\nhttps://github.com/dmishin/tsp-solver.\n\ntravelling\n\nproblem (tsp)\n\nsolver.\n\nAvailable\n\n[20] Traveling\n\nsalesman\n\nhttps://github.com/samlbest/traveling-salesman.\n\nproblem\n\nc++\n\nimplementation.\n\nAvailable\n\nat\n\nat\n\n[21] C++ implementation of traveling salesman problem using christo\ufb01des and 2-opt. Available at\n\nhttps://github.com/beckysag/traveling-salesman.\n\n9\n\n\f", "award": [], "sourceid": 1558, "authors": [{"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google"}, {"given_name": "Meire", "family_name": "Fortunato", "institution": null}, {"given_name": "Navdeep", "family_name": "Jaitly", "institution": "Google"}]}