{"title": "Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 15696, "page_last": 15705, "abstract": "Recurrent neural networks (RNNs) are a widely used tool for modeling sequential data, yet they are often treated as inscrutable black boxes. Given a trained recurrent network, we would like to reverse engineer it--to obtain a quantitative, interpretable description of how it solves a particular task. Even for simple tasks, a detailed understanding of how recurrent networks work, or a prescription for how to develop such an understanding, remains elusive. In this work, we use tools from dynamical systems analysis to reverse engineer recurrent networks trained to perform sentiment classification, a foundational natural language processing task. Given a trained network, we find fixed points of the recurrent dynamics and linearize the nonlinear system around these fixed points. Despite their theoretical capacity to implement complex, high-dimensional computations, we find that trained networks converge to highly interpretable, low-dimensional representations. In particular, the topological structure of the fixed points and corresponding linearized dynamics reveal an approximate line attractor within the RNN, which we can use to quantitatively understand how the RNN solves the sentiment analysis task. Finally, we find this mechanism present across RNN architectures (including LSTMs, GRUs, and vanilla RNNs) trained on multiple datasets, suggesting that our findings are not unique to a particular architecture or dataset. Overall, these results demonstrate that surprisingly universal and human interpretable computations can arise across a range of recurrent networks.", "full_text": "Reverse engineering recurrent networks for sentiment\n\nclassi\ufb01cation reveals line attractor dynamics\n\nNiru Maheswaranathan\u2217\nGoogle Brain, Google Inc.\n\nMountain View, CA\nnirum@google.com\n\nAlex H. Williams\u2217\nStanford University\n\nStanford, CA\n\nahwillia@stanford.edu\n\nMatthew D. Golub\nStanford University\n\nStanford, CA\n\nmgolub@stanford.edu\n\nSurya Ganguli\n\nStanford and Google Brain, Google Inc.\n\nStanford, CA\n\nsganguli@stanford.edu\n\nDavid Sussillo\n\nGoogle Brain, Google Inc.\n\nMountain View, CA\n\nsussillo@google.com\n\nAbstract\n\nRecurrent neural networks (RNNs) are a widely used tool for modeling sequential\ndata, yet they are often treated as inscrutable black boxes. Given a trained recurrent\nnetwork, we would like to reverse engineer it\u2013to obtain a quantitative, interpretable\ndescription of how it solves a particular task. Even for simple tasks, a detailed\nunderstanding of how recurrent networks work, or a prescription for how to develop\nsuch an understanding, remains elusive. In this work, we use tools from dynam-\nical systems analysis to reverse engineer recurrent networks trained to perform\nsentiment classi\ufb01cation, a foundational natural language processing task. Given a\ntrained network, we \ufb01nd \ufb01xed points of the recurrent dynamics and linearize the\nnonlinear system around these \ufb01xed points. Despite their theoretical capacity to\nimplement complex, high-dimensional computations, we \ufb01nd that trained networks\nconverge to highly interpretable, low-dimensional representations. In particular,\nthe topological structure of the \ufb01xed points and corresponding linearized dynamics\nreveal an approximate line attractor within the RNN, which we can use to quanti-\ntatively understand how the RNN solves the sentiment analysis task. Finally, we\n\ufb01nd this mechanism present across RNN architectures (including LSTMs, GRUs,\nand vanilla RNNs) trained on multiple datasets, suggesting that our \ufb01ndings are\nnot unique to a particular architecture or dataset. Overall, these results demonstrate\nthat surprisingly universal and human interpretable computations can arise across a\nrange of recurrent networks.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) are a popular tool for sequence modelling tasks. These architec-\ntures are thought to learn complex relationships in input sequences, and exploit this structure in a\nnonlinear fashion. However, RNNs are typically viewed as black boxes, despite considerable interest\nin better understanding how they function.\nHere, we focus on studying how recurrent networks solve document-level sentiment analysis\u2014a\nsimple, but longstanding benchmark task for language modeling [7, 19]. Simple models, such\nas logistic regression trained on a bag-of-words representation, can achieve good performance in\nthis setting [17]. Nonetheless, baseline models without bi-gram features miss obviously important\n\n\u2217equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Example LSTM hidden state activity for a network trained on sentiment classi\ufb01cation. Each\npanel shows the evolution of the hidden state for all of the units in the network for positive (left) and\nnegative (right) example documents over the \ufb01rst 150 tokens. At a glance, the activation time series\nfor individual units appear inscrutable.\n\nsyntactic relations, such as negation clauses [18]. To capture complex structure in text, especially\nover long distances, many recent works have investigated a wide variety of feed-forward and recurrent\nneural network architectures for this task (for a review, see [19]).\nWe demonstrate that popular RNN architectures, despite having the capacity to implement high-\ndimensional and nonlinear computations, in practice converge to low-dimensional representations\nwhen trained on this task. Moreover, using analysis techniques from dynamical systems theory, we\nshow that locally linear approximations to the nonlinear RNN dynamics are highly interpretable.\nIn particular, they all involve approximate low-dimensional line attractor dynamics\u2013a useful dy-\nnamical feature that can be implemented by linear dynamics and can used to store an analog value\n[13]. Furthermore, we show that this mechanism is surprisingly consistent across a range of RNN\narchitectures. Taken together, these results demonstrate how a remarkably simple operation\u2014linear\nintegration\u2014arises as a universal mechanism in disparate, nonlinear recurrent architectures that solve\na real world task.\n\n2 Related Work\n\nSeveral studies have tried to interpret recurrent networks by visualizing the activity of individual\nRNN units and memory gates during NLP tasks [5, 15]. While some individual RNN state variables\nappear to encode semantically meaningful features, most units do not have clear interpretations. For\nexample, the hidden states of an LSTM appear extremely complex when performing a task (Fig.\n1). Other work has suggested that network units with human interpretable behaviors (e.g. class\nselectivity) are not more important for network performance [10], and thus our understanding of\nRNN function may be misled by focusing only on single interpretable units. Instead, this work aims\nto interpret the entire hidden state to infer computational mechanisms underlying trained RNNs.\nAnother line of work has developed quantitative methods to identify important words or phrases in\nan input sequence that in\ufb02uenced the model\u2019s ultimate prediction [8, 11]. These approaches can\nidentify interesting salient features in subsets of the inputs, but do not directly shed light into the\ncomputational mechanism of RNNs.\n\n3 Methods\n\n3.1 Preliminaries\n\nWe denote the hidden state of a recurrent network at time t as a vector, ht. Similarly, the input to the\nnetwork at time t is given by a vector xt. We use F to denote a function that applies any recurrent\nnetwork update, i.e. ht+1 = F (ht, xt).\n\n3.2 Training\n\nWe trained four RNN architectures\u2013LSTM [4], GRU [1], Update Gate RNN (UGRNN) [2], and\nstandard (vanilla) RNNs\u2013on binary sentiment classifcation tasks. We trained each network type on\neach of three datasets: the IMDB movie review dataset, which contains 50,000 highly polarized\n\n2\n\n0150Timestep (t)101LSTM activity, h(t)0150Timestep (t)101\fFigure 2: LSTMs trained to identify the sentiment of Yelp reviews explore a low-dimensional volume\nof state space. (a) PCA on LSTM hidden states - PCA applied to all hidden states visited during\n1000 test examples for untrained (light gray) vs. trained (black) LSTMs. After training, most of\nthe variance in LSTM hidden unit activity is captured by a few dimensions. (b) RNN state space -\nProjection of LSTM hidden unit activity onto the top two principal components (PCs). 2D histogram\nshows density of visited states for test examples colored for negative (red) and positive (green)\nreviews. Two example trajectories are shown for a document of each type (red and green solid\nlines, respectively). The projection of the initial state (black dot) and readout vector (black arrows)\nin this low-dimensional space are also shown. Dashed black line shows a readout value of 0. (c)\nApproximate \ufb01xed points - Projection of approximate \ufb01xed points of the LSTM dynamics (see\nMethods) onto the top PCs. The \ufb01xed points lie along a 1-D manifold (inset shows variance explained\nby PCA on the approximate \ufb01xed points), parameterized by a coordinate \u03b8 (see Methods).\n\nreviews [9]; the Yelp review dataset, which contained 500,000 user reviews [20]; and the Stanford\nSentiment Treebank, which contains 11,855 sentences taken from movie reviews [14]. For each\ntask and architecture, we analyzed the best performing networks, selected using a validation set (see\nAppendix B for test accuracies of the best networks).\n\n3.3 Fixed point analysis\n\n2, h\u2217\n\nWe analyzed trained networks by linearizing the dynamics around approximate \ufb01xed points. Approxi-\nmate \ufb01xed points are state vectors {h\u2217\n3,\u00b7\u00b7\u00b7} that do not change appreciably under the RNN\n1, h\u2217\ni \u2248 F (h\u2217\ndynamics with zero inputs: h\u2217\ni , x=0) [16]. Brie\ufb02y, we \ufb01nd these \ufb01xed points numerically\nN (cid:107)h\u2212 F (h, 0)(cid:107)2\nby \ufb01rst de\ufb01ning a loss function q = 1\n2, and then minimizing q with respect to hidden\nstates, h, using standard auto-differentiation methods [3]. We ran this optimization multiple times\nstarting from different initial values of h. These initial conditions were sampled randomly from the\ndistribution of state activations explored by the trained network, which was done to intentionally\nsample states related to the operation of the RNN.\n\n4 Results\n\nFor brevity, in what follows we explain our approach using the working example of the LSTM trained\non the Yelp dataset (Figs. 2-3). At the end of the results we show a summary \ufb01gure across a few\nmore architectures and datasets (Fig. 6). We \ufb01nd similar results for all architectures and datasets, as\ndemonstrated by an exhaustive set of \ufb01gures in the supplementary materials.\n\n4.1 RNN dynamics are low-dimensional\n\nAs an initial exploratory analysis step, we performed principal components analysis (PCA) on the\nRNN states concatenated across 1,000 test examples. The top 2-3 PCs explained \u223c90% of the\nvariance in hidden state activity (Fig. 2a, black line). The distribution of hidden states visited\nby untrained networks on the same set of examples was much higher dimensional (Fig. 2a, gray\nline), suggesting that training the networks stretched the geometry of their representations along a\nlow-dimensional subspace.\nWe then visualized the RNN dynamics in this low-dimensional space by forming a 2D histogram of\nthe density of RNN states colored by the sentiment label (Fig. 2b), and visualized how RNN states\nevolved within this low-dimensional space over a full sequence of text (Fig. 2b).\n\n3\n\n\fFigure 3: Characterizing the top eigenmodes of each \ufb01xed point. (a) Same plot as in Fig. 2c (\ufb01xed\npoints are grey), with three example \ufb01xed points highlighted. (b) For each of these \ufb01xed points,\nwe compute the LSTM Jacobian (see Methods) and show the distribution of eigenvalues (colored\ncircles) in the complex plane (black line is the unit circle). (c-d) The time constants (\u03c4 in terms of #\nof input tokens, see Appendix C) associated with the eigenvalues. (c) The time constant for the top\nthree modes for all \ufb01xed points as function of the position along the line attractor (parameterized\nby a manifold coordinate, \u03b8). (d) All time constants for all eigenvalues associated with the three\nhighlighted \ufb01xed points. The top eigenmode across \ufb01xed points has a time constant on the order of\nhundreds to thousands of tokens.\n\nWe observed that the state vector incrementally moved from a central position towards one or another\nend of the PC-plane, with the direction corresponding either to a positive or negative sentiment\nprediction. Input words with positive valence (\u201camazing\u201d, \u201cgreat\u201d, etc.) incremented the hidden state\ntowards a positive sentiment prediction, while words with negative valence (\u201cbad\u201d, \u201chorrible\u201d, etc.)\npushed the hidden state in the opposite direction. Neutral words and phrases did not typically exert\nlarge effects on the RNN state vector.\nThese observations are reminiscent of line attractor dynamics. That is, the RNN state vector evolves\nalong a 1D manifold of marginally stable \ufb01xed points. Movement along the line is negligible whenever\nnon-informative inputs (i.e. neutral words) are input to the network, whereas when an informative\nword or phrase (e.g. \u201cdelicious\u201d or \u201cmediocre\u201d) is encountered, the state vector is pushed towards one\nor the other end of the manifold. Thus, the model\u2019s representation of positive and negative documents\ngradually separates as evidence is incrementally accumulated.\nThe hypothesis that RNNs approximate line attractor dynamics makes four speci\ufb01c predictions, which\nwe investigate and con\ufb01rm in subsequent sections. First, the \ufb01xed points form an approximately 1D\nmanifold that is aligned with the readout weights (Section 4.2). Second, all \ufb01xed points are attracting\nand marginally stable. That is, in the absence of input (or, perhaps, if a string of neutral/uninformative\nwords are encountered) the RNN state should rapidly converge to the closest \ufb01xed point and then\nshould not change appreciably (Section 4.4). Third, locally around each \ufb01xed point, inputs represent-\ning positive vs. negative evidence should produce linearly separable effects on the RNN state vector\nalong some dimension (Section 4.5). Finally, these instantaneous effects should be integrated by the\nrecurrent dynamics along the direction of the 1D \ufb01xed point manifold (Section 4.5).\n\n4.2 RNNs follow a 1D manifold of stable \ufb01xed points\n\nThe line attractor hypothesis predicts that RNN state vector should rapidly approach a \ufb01xed point if\nno input is delivered to the network. To test this, we initialized the RNN to a random state (chosen\nuniformly from the distribution of states observed on the test set) and simulated the RNN without any\ninput. In all cases, the normalized velocity of the state vector ((cid:107)ht+1 \u2212 ht(cid:107)/(cid:107)ht(cid:107)) approached zero\nwithin a few steps, and often the initial velocity was small. From this we conclude that the RNN is\nvery often in close proximity to a \ufb01xed point during the task.\nWe numerically identi\ufb01ed the location of \u223c500 RNN \ufb01xed points using previously established\nmethods [16, 3]. Brie\ufb02y, we minimized the quantity q = 1\n2 over the RNN hidden\n\nN (cid:107)h \u2212 F (h, 0)(cid:107)2\n\n4\n\n808PC #18404PC #2(a)0.41.0()0.30.00.3()(b)0.41.0()0.30.00.3()0.41.0()0.30.00.3()0.80.00.8Manifold coordinate ()100101102103104Time constant ()(c)0128Eigenvalue (index)100101102103104Time constant ()(d)0128Eigenvalue (index)100101102103104Time constant ()0128Eigenvalue (index)100101102103104Time constant ()\fstate vector, h, from many initial conditions drawn to match the distribution of hidden states during\ntraining. Critical points of this loss function satisfying q < 10\u22128 were consider \ufb01xed points (similar\nresults were observed for different choices of this threshold). For each architecture, we found \u223c500\n(approximate) \ufb01xed points.\nWe then projected these \ufb01xed points into the same low-dimensional space used in Fig. 2b. Although\nthe PCA projection was \ufb01t to the RNN hidden states, and not the \ufb01xed points, a very high percentage\nof variance in \ufb01xed points was captured by this projection (Fig. 2c, inset), suggesting that the RNN\nstates remain close to the manifold of \ufb01xed points. We call the vector that describes the main axis\nof variation of the 1D manifold m. Consistent with the line attractor hypothesis, the \ufb01xed points\nappeared to be spread along a 1D curve when visualized in PC space, and furthermore the principal\ndirection of this curve was aligned with the readout weights (Fig. 2c).\nWe further veri\ufb01ed that this low-dimensional approximation was accurate by using locally linear\nembedding (LLE) [12] to parameterize a 1D manifold of \ufb01xed points in the raw, high-dimensional\ndata. This provided a scalar coordinate, \u03b8i \u2208 [\u22121, 1], for each \ufb01xed point, which was well-matched\nto the position of the \ufb01xed point manifold in PC space (coloring of points in Fig. 2c).\n\n4.3 Linear approximations of RNN dynamics\n\nWe next aimed to demonstrate that the identi\ufb01ed \ufb01xed points were marginally stable, and thus could be\nused to preserve accumulated information from the inputs. To do this, we used a standard linearization\nprocedure [6] to obtain an approximate, but highly interpretable, description of the RNN dynamics\nnear the \ufb01xed point manifold. Brie\ufb02y, given the last state ht\u22121 and the current input xt, the approach\nis to locally approximate the update rule with a \ufb01rst-order Taylor expansion:\n\nht = F (h\u2217 + \u2206ht\u22121, x\u2217 + \u2206xt)\n\n\u2248 F (h\u2217, x\u2217) + Jrec\u2206ht\u22121 + Jinp\u2206xt\n\n(1)\nwhere \u2206ht\u22121 = ht\u22121 \u2212 h\u2217 and \u2206xt = xt \u2212 x\u2217, and {Jrec, Jinp} are Jacobian matrices of the system:\nij (h\u2217, x\u2217) = \u2202F (h\u2217,x\u2217)i\nJ rec\nWe choose h\u2217 to be a numerically identi\ufb01ed \ufb01xed point and x\u2217 = 02, thus we have F (h\u2217, x\u2217) \u2248 h\u2217\nand \u2206xt = xt. Under this choice, equation (1) reduces to a discrete-time linear dynamical system:\n\nij (h\u2217, x\u2217) = \u2202F (h\u2217,x\u2217)i\n\nand J inp\n\n\u2202h\u2217\n\nj\n\n\u2202x\u2217\n\nj\n\n.\n\n\u2206ht = Jrec\u2206ht\u22121 + Jinpxt.\n\n(2)\n\nIt is important to note that both Jacobians depend on which \ufb01xed point we choose to linearize around,\nand should thus be thought of as functions of h\u2217; for notational simplicity we do not denote this\ndependence explicitly.\nBy reducing the nonlinear RNN to a linear system, we can analytically estimate the network\u2019s\nresponse to a sequence of T inputs. In this approximation, the effect of each input xt is decoupled\nfrom all others; that is, the \ufb01nal state is given by the sum of all individual effects3.\nWe can restrict our focus to the effect of a single input, xt. Let k = T \u2212 t be the number of time\nsteps between xt and the end of the document. The total effect of xt on the \ufb01nal RNN state is\n(Jrec)k Jinpxt. After substituting the eigendecomposition Jrec = R\u039bL for a non-normal matrix, this\nbecomes:\n\nN(cid:88)\n\nR\u039bkLJinpxt =\n\n\u03bbk\nara(cid:96)\n\n(cid:62)\na Jinpxt,\n\n(3)\n\nwhere L = R\u22121, the columns of R (denoted ra) contain the right eigenvectors of Jrec, the rows\n(cid:62)\na ) contain the left eigenvectors of Jrec, and \u039b is a diagonal matrix containing\nof L (denoted (cid:96)\ncomplex-valued eigenvalues, \u03bb1 > \u03bb2 > . . . > \u03bbN , which are sorted based on their magnitude.\n\na=1\n\n2We also tried linearizing around the average embedding over all words; this did not change the results. The\naverage embedding is very close to the zeros vector (the norm of the difference between the two is less than\n8 \u00d7 10\u22123), so it is not surprising that using that as the linearization point yields similar results.\n3We consider the case where the network has closely converged to a \ufb01xed point, so that h0 = h\u2217 and thus\n\n\u2206h0 = 0.\n\n5\n\n\fFigure 4: Effect of different word inputs on the LSTM state vector. (a) Effect of word inputs, Jinpx,\nfor positive, negative, and neutral words (green, red, cyan dots). The green and red arrows point to the\ncenter of mass for the positive and negative words, respectively. Blue arrows denote (cid:96)1, the top left\neigenvector. The PCA projection is the same as Fig. 2c, but centered around each \ufb01xed point. Each\nplot denotes a separate \ufb01xed point (labeled in panel b). (b) Same plot as in Fig. 2c, with the three\nexample \ufb01xed points in (a) highlighted (the rest of the approximate \ufb01xed points are shown in grey).\nBlue arrows denote r1, the top right eigenvector. In all cases r1 is aligned with the orientation of the\nmanifold, m, consistent with an approximate line attractor. (c) Average of the projection of inputs\n(cid:62)\n1 Jinpx) over 100 positive (green), negative (red), or neutral (cyan) words.\nwith the left eigenvector ((cid:96)\nHistogram displays the distribution of this input projection over all \ufb01xed points. (d) Distribution of\nr(cid:62)\n1 m (overlap of the top right eigenvector with the \ufb01xed point manifold) over all \ufb01xed points. Null\ndistribution consists of randomly generated unit vectors of the same dimension as the hidden state.\n\n4.4 An analysis of integration eigenmodes.\n\n1\n\nlog(|\u03bba|)\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12) (see Appendix C for derivation). This time constant has units of tokens\n\nEach mode of the system either reduces to zero or diverges exponentially fast, with a time constant\ngiven by: \u03c4a =\n(or, roughly, words) and yields an interpretable number for the effective memory of the system. In\npractice we \ufb01nd, with high consistency, that nearly all eigenmodes are stable and only a small number\ncluster around |\u03bba| \u2248 1.\nFig. 3 plots the eigenvalues and associated time constants and shows the distribution of all eigenvalues\nat three representative \ufb01xed points along the \ufb01xed point manifold (Fig. 3a). In Fig. 3c, we plot the\ndecay time constant of the top three modes; the slowest decaying mode persists after \u223c1000 time\nsteps, while the next two modes persist after \u223c100 time steps, with lower modes decaying even faster.\nSince the average review length for the Yelp dataset is \u223c175 words, only a small number of modes\ncan retain information from the beginning of the document.\nOverall, these eigenvalue spectra are consistent with our observation that RNN states only explore a\nlow-dimensional subspace when performing sentiment classi\ufb01cation. RNN activity along the majority\nof dimensions is associated with fast time constants and is therefore quickly forgotten. While multiple\neigenmodes likely contribute to the performance of the network, we restrict this initial study to the\nslowest mode, for which \u03bb1 \u2248 1.\n\n4.5 Left and right eigenvectors\n\nRestricting our focus to the top eigenmode for simplicity (there may be a few slow modes of\n(cid:62)\n1 Jinpx. We\nintegration), the effect of a single input, xt, on the network activity (eq. 3) becomes: r1(cid:96)\nhave dropped the dependence on t since \u03bb1 \u2248 1, so the effect of x is largely insensitive to the exact\ntime it was input to system. Using this expression, we separately analyzed the effects of speci\ufb01c\nwords with positive, negative and neutral valences. We de\ufb01ned positive, negative, and neutral words\nbased on the magnitude and sign of the logistic regression coef\ufb01cients of a bag-of-words classi\ufb01er.\n\n6\n\n808PC #1840PC #2(b)0.30.00.3 PC #10.30.00.3 PC #2PositivewordsNegativewords(a)0.30.00.3 PC #10.30.00.3 PC #20.30.00.3 PC #10.30.00.3 PC #201Manifold projection (mTr1)015FrequencyNullObserved(d)0.30.00.3Input projection (\u2018T1Jinpx)030FrequencyNegative wordsPositive wordsNeutral words(c)\fFigure 5: Linearized LSTM dynamics display low fractional error. (a) At every step along a trajectory,\nwe compute the next state using either the full nonlinear system (solid, black) or the linearized system\n(dashed, red). Inset shows a zoomed in version of the dynamics. (b) Histogram of fractional error of\nthe linearized system over many test examples, evaluated in the high-dimensional state space.\n\nWe \ufb01rst examined the term Jinpx for various choices of x (i.e. various word tokens). This quantity\nrepresents the instantaneous linear effect of x on the RNN state vector. We projected the resulting\nvectors onto the same low-dimensional subspace shown in Fig. 2c. We see that positive and negative\nvalence words push the hidden state in opposite directions. Neutral words, in contrast, exert much\nsmaller effects on the RNN state (Fig 4).\nWhile Jinpx represents the instantaneous effect of a word, only the features of this input that overlap\n(cid:62)\n1 Jinpx,\nwith the top few eigenmodes are reliably remembered by the network. The scalar quantity (cid:96)\nwhich we call the input projection, captures the magnitude of change induced by x along the\neigenmode associated with the longest timescale. Again we observe that the valence of x strongly\ncorrelates with this quantity: neutral words have an input projection near zero while positive and\nnegative words produced larger magnitude responses of opposite sign. Furthermore, this is reliably\nobserved across all \ufb01xed points. Fig. 4c shows the average input projection for positive, negative, and\nneutral words; the histogram summarizes these effects across all \ufb01xed points along the line attractor.\nFinally, if the input projection onto the top eigenmode is non-negligible, then the right eigenvector\nr1 (which is normalized to unit length) represents the direction along which x is integrated. If the\nRNN implements an approximate line attractor, then r1 (and potentially other slow modes) should\nalign with the principal direction of the manifold of \ufb01xed points, m. In essence, this prediction states\nthat an informative input pushes the current RNN state along the \ufb01xed point manifold and towards a\nneighboring \ufb01xed point, with the direction of this movement determined by word or phrase valence.\nWe indeed observe a high degree of overlap between r1 and m both visually in PC space (Fig. 4b)\nand quantitatively across all \ufb01xed points (Fig. 4d).\n\n4.6 Linearized dynamics approximate the nonlinear system\n\nTo verify that the linearized dynamics (2) well approximate the nonlinear system, we compared hidden\nstate trajectories of the full, nonlinear RNN to the linearized dynamics. That is, at each step, we\ncomputed the next hidden state using the nonlinear LSTM update equations (hLSTM\nt+1 = F (ht, xt)), and\nt+1 = h\u2217+Jrec (h\u2217) (ht \u2212 h\u2217)+\nthe linear approximation of the dynamics at the nearest \ufb01xed point (hlin\nJinp (h\u2217) xt). Fig. 5a shows the true, nonlinear trajectory (solid black line) as well as the linear\napproximations at every point along the trajectory (red dashed line). To summarize the error across\nt+1 (cid:107)2. Fig. 5b shows that\nmany examples, we computed the relative error (cid:107)hLSTM\nthis error is small (around 10%) across many test examples.\nNote that this error is the single-step error, computed by running either the nonlinear or linear\ndynamics forward for one time step. If we run the dynamics for many time steps, we \ufb01nd that small\nerrors in the linearized system accumulate thus causing the trajectories to diverge. This suggests that\nwe cannot, in practice, replace the full nonlinear LSTM with a single linearized version.\n\nt+1(cid:107)2/(cid:107)hLSTM\n\nt+1 \u2212 hlin\n\n4.7 Universal mechanisms across architectures and datasets\n\nEmpirically, we investigated whether the mechanisms identi\ufb01ed in the LSTM (line attractor dynamics)\nwere present not only for other network architectures but also for networks trained on other datasets\n\n7\n\n909Principal component #18404Principal component #2(a)0.00.51.0Fractional error0510Frequency(b)\fFigure 6: Universal mechanisms across architectures and datasets (see Appendix A for all other\narchitecture-dataset combinations). Top row: comparison of left eigenvector (blue) against instan-\ntaneous effect of word input Jinpx by valence (green and red dots are positive and negative words,\ncompare to Fig. 4a) for an example \ufb01xed point. Second row: Histogram of input projections summa-\n(cid:62)\n1 Jinpx, compare to Fig. 4c). Third row:\nrizing the effect of input across \ufb01xed points (average of (cid:96)\nExample \ufb01xed point (blue) shown on top of the manifold of \ufb01xed points (gray) projected into the\nprincipal components of hidden state activity, along with the corresponding top right eigenvector\n(compare to Fig. 4b). Bottom row: Distribution of projections of the top right eigenvector onto the\nmanifold across \ufb01xed points (distribution of r(cid:62)\n\n1 m, compare to Fig. 4d).\n\nused for sentiment classi\ufb01cation. Remarkably, we see a surprising near-universality across networks\n(but see Supp. Mat. for another solution for the VRNN). Fig. 6 shows, for different architectures and\ndatasets, the correlation of the the top left eigenvectors with the instantaneous input for a given \ufb01xed\npoint (\ufb01rst row), as well as a histogram over the same quantity over \ufb01xed points (second row). We\nobserve the same con\ufb01guration of a line attractor of approximate \ufb01xed points, and show an example\n\ufb01xed point and right eigenvector highlighted (third row) along with a summary of the projection of\nthe top right eigenvector along the manifold across \ufb01xed points (bottom row). We see that regardless\nof architecture or dataset, each network approximately solves the task using the same mechanism.\n\n5 Discussion\n\nIn this work we applied dynamical systems analysis to understand how RNNs solve sentiment analysis.\nWe found a simple mechanism\u2014integration along a line attractor\u2014present in multiple architectures\ntrained to different sentiment analysis tasks. Overall, this work provides preliminary, but optimistic,\nevidence that different, highly intricate network models can converge to similar solutions that may be\nreduced and understood by human practitioners.\nIn summary, we found that in nearly all cases the key activity performed by the RNN for sentiment\nanalysis is simply counting the number of positive and negative words used. More precisely, a slow\nmode of a local linear system aligns its left eigenvector with the current effective input, which itself\n\n8\n\n\fnicely separates positive and negative word tokens. The associated right eigenvector then represents\nthat input in a direction aligned to a line attractor, which in turn is aligned to the readout vector. As\nthe RNN iterates over a document, integration of negative and positive words moves the system state\nalong this line attractor, corresponding to accumulation of evidence by the RNN towards a prediction.\nSuch a mechanism is consistent with a solution that does not make use of word order when making a\ndecision. As such, it is likely that we have not understood all the dynamics relevant in the computation\nof sentiment analysis. For example, we speculate there may be some yet unknown mechanism that\ndetects simple bi-gram negations of one word by another, e.g. \u201cnot bad,\u201d since the gated RNNs\nperformed a few percentage points better than the bag-of-words model. Nonetheless, it appears that\napproximate line attractor dynamics represent a fundamental computational mechanism in these\nRNNs, which can be built upon by future investigations.\nWhen we compare the overall classi\ufb01cation accuracy of the Jacobian linearized version of the LSTM\nwith the full nonlinear LSTM, we \ufb01nd that the linearized version is much worse, presumably due to\nsmall errors in the linear approximation that accrue as the network processes a document. Note that\nif we directly train a linear model (as opposed to linearizing a nonlinear model), the performance\nis quite high (only around 3% worse than the LSTM), which suggests that the error of the Jacobian\nlinearized model has to do with errors in the approximation, not from having less expressive power.\nWe showed that similar dynamical features occur in 4 different architectures, the LSTM, GRU,\nUGRNN and vanilla RNNs (Fig. 6 and Supp. Mat.) and across three datasets. These rather different\narchitectures all implemented the solution to sentiment analysis in a highly similar way. This hints at\na surprising notion of universality of mechanism in disparate RNN architectures.\nWhile our results pertain to a speci\ufb01c task, sentiment analysis is nevertheless representative of a\nlarger set of modeling tasks that require integrating both relevant and irrelevant information over long\nsequences of symbols. Thus, it is possible that the uncovered mechanisms\u2014namely, approximate\nline attractor dynamics\u2014will arise in other practical settings, though perhaps employed in different\nways on a per-task basis.\n\nAcknowledgments\n\nThe authors would like to thank Peter Liu, Been Kim, and Michael C. Mozer for helpful feedback\nand discussions.\n\nReferences\n\n[1] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statis-\ntical Machine Translation. In Proc. Conference on Empirical Methods in Natural Language\nProcessing, Unknown, Unknown Region, 2014.\n\n[2] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capacity and trainability in\n\nrecurrent neural networks. arXiv preprint arXiv:1611.09913, 2016.\n\n[3] Matthew Golub and David Sussillo. FixedPointFinder: A tensor\ufb02ow toolbox for identifying\nand characterizing \ufb01xed points in recurrent neural networks. Journal of Open Source Software,\n3(31):1003, 2018.\n\n[4] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[5] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent\n\nnetworks. arXiv preprint arXiv:1506.02078, 2015.\n\n[6] Hassan K. Khalil. Nonlinear Systems. Pearson, 2001.\n[7] Bing Liu. Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University\n\nPress, 2015.\n\n[8] Scott M Lundberg and Su-In Lee. A uni\ufb01ed approach to interpreting model predictions. In\nI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 30, pages 4765\u20134774. Curran\nAssociates, Inc., 2017.\n\n9\n\n\f[9] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher\nPotts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting\nof the Association for Computational Linguistics: Human Language Technologies - Volume\n1, HLT \u201911, pages 142\u2013150, Stroudsburg, PA, USA, 2011. Association for Computational\nLinguistics.\n\n[10] Ari S. Morcos, David G.T. Barrett, Neil C. Rabinowitz, and Matthew Botvinick. On the\nimportance of single directions for generalization. In International Conference on Learning\nRepresentations, 2018.\n\n[11] W. James Murdoch, Peter J. Liu, and Bin Yu. Beyond word importance: Contextual de-\ncomposition to extract interactions from LSTMs. In International Conference on Learning\nRepresentations, 2018.\n\n[12] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. Science, 290(5500):2323\u20132326, 2000.\n\n[13] H. S. Seung. How the brain keeps the eyes still. Proceedings of the National Academy of\n\nSciences, 93(23):13339\u201313344, 1996.\n\n[14] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Proceedings of the 2013 conference on empirical methods in natural language\nprocessing, pages 1631\u20131642, 2013.\n\n[15] Hendrik Strobelt, Sebastian Gehrmann, Bernd Huber, Hanspeter P\ufb01ster, Alexander M Rush, et al.\nVisual analysis of hidden state dynamics in recurrent neural networks. CoRR, abs/1606.07461,\n2016.\n\n[16] David Sussillo and Omri Barak. Opening the black box: low-dimensional dynamics in high-\n\ndimensional recurrent neural networks. Neural computation, 25(3):626\u2013649, 2013.\n\n[17] Sida Wang and Christopher D. Manning. Baselines and bigrams: Simple, good sentiment\nand topic classi\ufb01cation. In Proceedings of the 50th Annual Meeting of the Association for\nComputational Linguistics: Short Papers - Volume 2, ACL \u201912, pages 90\u201394, Stroudsburg, PA,\nUSA, 2012. Association for Computational Linguistics.\n\n[18] Michael Wiegand, Alexandra Balahur, Benjamin Roth, Dietrich Klakow, and Andr\u00e9s Montoyo.\nA survey on the role of negation in sentiment analysis. In Proceedings of the Workshop on\nNegation and Speculation in Natural Language Processing, NeSp-NLP \u201910, pages 60\u201368,\nStroudsburg, PA, USA, 2010. Association for Computational Linguistics.\n\n[19] Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. Wiley\n\nInterdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253, 2018.\n\n[20] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text\nclassi\ufb01cation. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28, pages 649\u2013657. Curran Associates,\nInc., 2015.\n\n10\n\n\f", "award": [], "sourceid": 9119, "authors": [{"given_name": "Niru", "family_name": "Maheswaranathan", "institution": "Google Brain"}, {"given_name": "Alex", "family_name": "Williams", "institution": "Stanford University"}, {"given_name": "Matthew", "family_name": "Golub", "institution": "Stanford University"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford"}, {"given_name": "David", "family_name": "Sussillo", "institution": "Google Inc."}]}