{"title": "Universality and individuality in neural dynamics across large populations of recurrent networks", "book": "Advances in Neural Information Processing Systems", "page_first": 15629, "page_last": 15641, "abstract": "Many recent studies have employed task-based modeling with recurrent neural networks (RNNs) to infer the computational function of different brain regions. These models are often assessed by quantitatively comparing the low-dimensional neural dynamics of the model and the brain, for example using canonical correlation analysis (CCA). However, the nature of the detailed neurobiological inferences one can draw from such efforts remains elusive. For example, to what extent does training neural networks to solve simple tasks, prevalent in neuroscientific studies, uniquely determine the low-dimensional dynamics independent of neural architectures? Or alternatively, are the learned dynamics highly sensitive to different neural architectures? Knowing the answer to these questions has strong implications on whether and how to use task-based RNN modeling to understand brain dynamics. To address these foundational questions, we study populations of thousands of networks of commonly used RNN architectures trained to solve neuroscientifically motivated tasks and characterize their low-dimensional dynamics via CCA and nonlinear dynamical systems analysis. We find the geometry of the dynamics can be highly sensitive to different network architectures, and further find striking dissociations between geometric similarity as measured by CCA and network function, yielding a cautionary tale. Moreover, we find that while the geometry of neural dynamics can vary greatly across architectures, the underlying computational scaffold: the topological structure of fixed points, transitions between them, limit cycles, and linearized dynamics, often appears {\\it universal} across all architectures. Overall, this analysis of universality and individuality across large populations of RNNs provides a much needed foundation for interpreting quantitative measures of dynamical similarity between RNN and brain dynamics.", "full_text": "Universality and individuality in neural dynamics\nacross large populations of recurrent networks\n\nNiru Maheswaranathan\u2217\nGoogle Brain, Google Inc.\n\nMountain View, CA\nnirum@google.com\n\nAlex H. Williams\u2217\nStanford University\n\nStanford, CA\n\nahwillia@stanford.edu\n\nMatthew D. Golub\nStanford University\n\nStanford, CA\n\nmgolub@stanford.edu\n\nSurya Ganguli\n\nStanford University and Google Brain\nStanford, CA and Mountain View, CA\n\nsganguli@stanford.edu\n\nDavid Sussillo\u2020\n\nGoogle Brain, Google Inc.\n\nMountain View, CA\n\nsussillo@google.com\n\nAbstract\n\nTask-based modeling with recurrent neural networks (RNNs) has emerged as a\npopular way to infer the computational function of different brain regions. These\nmodels are quantitatively assessed by comparing the low-dimensional neural rep-\nresentations of the model with the brain, for example using canonical correlation\nanalysis (CCA). However, the nature of the detailed neurobiological inferences\none can draw from such efforts remains elusive. For example, to what extent does\ntraining neural networks to solve common tasks uniquely determine the network\ndynamics, independent of modeling architectural choices? Or alternatively, are\nthe learned dynamics highly sensitive to different model choices? Knowing the\nanswer to these questions has strong implications for whether and how we should\nuse task-based RNN modeling to understand brain dynamics. To address these\nfoundational questions, we study populations of thousands of networks, with com-\nmonly used RNN architectures, trained to solve neuroscienti\ufb01cally motivated tasks\nand characterize their nonlinear dynamics. We \ufb01nd the geometry of the RNN\nrepresentations can be highly sensitive to different network architectures, yielding\na cautionary tale for measures of similarity that rely on representational geometry,\nsuch as CCA. Moreover, we \ufb01nd that while the geometry of neural dynamics\ncan vary greatly across architectures, the underlying computational scaffold\u2014the\ntopological structure of \ufb01xed points, transitions between them, limit cycles, and\nlinearized dynamics\u2014often appears universal across all architectures.\n\n1\n\nIntroduction\n\nThe computational neuroscience community is increasingly relying on deep learning both to directly\nmodel large-scale neural recordings [1, 2, 3] as well to train neural networks on computational tasks\nand compare the internal dynamics of such trained networks to measured neural recordings [4, 5, 6, 7,\n8, 9]. For example, several recent studies have reported similarities between the internal represen-\ntations of biological and arti\ufb01cial networks [5, 10, 11, 12, 13, 14, 15, 16]. These representational\nsimilarities are quite striking since arti\ufb01cial neural networks clearly differ in many ways from their\nmuch more biophysically complex natural counterparts. How then, should we scienti\ufb01cally interpret\n\n\u2217Equal contribution.\n\u2020Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe striking representational similarity of biological and arti\ufb01cial networks, despite their vast disparity\nin biophysical and architectural mechanisms?\nA fundamental impediment to achieving any such clear scienti\ufb01c interpretation lies in the fact that\nin\ufb01nitely many model networks may be consistent with any particular computational task or neural\nrecording. Indeed, many modern applications of deep learning utilize a wide variety of recurrent\nneural network (RNN) architectures [17, 18, 19, 20], initialization strategies [21] and regularization\nterms [22, 23]. Moreover, new architectures continually emerge through large-scale automated\nsearches [24, 25, 26]. This dizzying set of modelling degrees of freedom in deep learning raises\nfundamental questions about how the degree of match between dynamical properties of biological\nand arti\ufb01cial networks varies across different modelling choices used to generate RNNs.\nFor example, do certain properties of RNN dynamics vary widely across individual architectures?\nIf so, then a high degree of match between these properties measured in both an arti\ufb01cial RNN\nand a biological circuit might yield insights into the architecture underlying the biological circuit\u2019s\ndynamics, as well as rule out other potential architectures. Alternatively, are other properties of RNN\ndynamics universal across many architectural classes and other modelling degrees of freedom? If\nso, such properties are interesting neural invariants determined primarily by the task, and we should\nnaturally expect them to recur not only across diverse classes of arti\ufb01cial RNNs, but also in relevant\nbrain circuits that solve the same task. The existence of such universal properties would then provide\na satisfying explanation of certain aspects of the match in internal representations between biological\nand arti\ufb01cial RNNs, despite many disparities in their underlying mechanisms.\nInterestingly, such universal properties can also break the vast design space of RNNs into different\nuniversality classes, with these universal dynamical properties being constant within classes, and\nvarying only between classes. This offers the possibility of theoretically calculating or understanding\nsuch universal properties by analyzing the simplest network within each universality class3. Thus\na foundational question in the theory of RNNs, as well as in their application to neuroscienti\ufb01c\nmodelling, lies in ascertaining which aspects of RNN dynamics vary across different architectural\nchoices, and which aspects\u2014if any\u2014are universal across such choices.\nTheoretical clarity on the nature of individuality and universality in nonlinear RNN dynamics is\nlargely lacking4, with some exceptions [29, 30, 31, 32]. Therefore, with the above neuroscienti\ufb01c\nand theoretical motivations in mind, we initiate an extensive numerical study of the variations in\nRNN dynamics across thousands of RNNs with varying modelling choices. We focus on canonical\nneuroscienti\ufb01cally motivated tasks that exemplify basic elements of neural computation, including\nthe storage and maintenance of multiple discrete memories, the production of oscillatory motor-like\ndynamics, and contextual integration in the face of noisy evidence [33, 4].\nTo compare internal representations across networks, we focused on comparing the geometry of\nneural dynamics using common network similarity measures such as singular vector canonical\ncorrelation analysis (SVCCA) [34] and centered kernel alignment (CKA) [35]. We also used tools\nfrom dynamical systems analysis to extract more topological aspects of neural dynamics, including\n\ufb01xed points, limit cycles, and transition pathways between them, as well as the linearized dynamics\naround \ufb01xed points [33]. We focused on these approaches because comparisons between arti\ufb01cial\nand biological network dynamics at the level of geometry, and topology and linearized dynamics, are\noften employed in computational neuroscience.\nUsing these tools, we \ufb01nd that different RNN architectures trained on the same task exhibit both\nuniversal and individualistic dynamical properties. In particular, we \ufb01nd that the geometry of neural\nrepresentations varies considerably across RNNs with different nonlinearities. We also \ufb01nd surprising\ndissociations between dynamical similarity and functional similarity, whereby trained and untrained\narchitectures of a given type can be more similar to each other than trained architectures of different\ntypes. This yields a cautionary tale for using SVCCA or CKA to compare neural geometry, as\nthese similarity metrics may be more sensitive to particular modeling choices than to overall task\nperformance. Finally, we \ufb01nd considerably more universality across architectures in the topological\n\n3This situation is akin to that in equilibrium statistical mechanics in which physical materials as disparate\nas water and ferromagnets have identical critical exponents at second order phase transitions, by virtue of the\nfact that they fall within the same universality class [27]. Moreover, these universal critical exponents can be\ncomputed theoretically in the simplest model within this class: the Ising model.\n\n4Although Feigenbaum\u2019s analysis [28] of period doubling in certain 1D maps might be viewed as an analysis\n\nof 1D RNNs.\n\n2\n\n\fstructure of \ufb01xed points, limit cycles, and speci\ufb01c properties of the linearized dynamics about \ufb01xed\npoints. Thus overall, our numerical study provides a much needed foundation for understanding\nuniversality and individuality in network dynamics across various RNN models, a question that is\nboth of intrinsic theoretical interest, and of importance in neuroscienti\ufb01c applications.\n\n2 Methods\n\n2.1 Model Architectures and Training Procedure\n\nWe de\ufb01ne an RNN by an update rule, ht = F (ht\u22121, xt), where F denotes some nonlinear function\nof the network state vector ht\u22121 \u2208 RN and the network input xt \u2208 RM . Here, t is an integer index\ndenoting discrete time steps. Given an initial state, h0, and a stream of T inputs, x1, x2, . . ., xT , the\nRNN states are recursively computed, h1, h2, . . ., hT . The model predictions are based on a linear\nreadout of these state vector representations of the input stream. We studied 4 RNN architectures, the\nvanilla RNN (Vanilla), the Update-Gate RNN (UGRNN; [20]), the Gated Recurrent Unit (GRU; [18]),\nand the Long-Short-Term-Memory (LSTM; [17]). The equations for these RNNs can be found in\nAppendix A. For each RNN architecture we modi\ufb01ed the (non-gate) point-wise activation function to\nbe either recti\ufb01ed linear (relu) or hyperbolic tangent (tanh). The point-wise activation for the gating\nunits is kept as a sigmoid.\nWe trained networks for every combination of the following parameters: RNN architecture (Vanilla,\nUGRNN, LSTM, GRU), activation (relu, tanh), number of units/neurons (64, 128, 256), and L2\nregularization (1e-5, 1e-4, 1e-3, 1e-2). This yielded 4\u00d72\u00d73\u00d74 = 96 unique con\ufb01gurations. For each\none of these con\ufb01gurations, we performed a separate random hyperparameter search over gradient\nclipping values [22] (logarithmically spaced from 0.1 to 10) and the learning rate schedule parameters.\nThe learning rate schedule is an exponentially decaying schedule parameterized by the initial rate\n(with search range from 1e-5 to 0.1), decay rate (0.1 to 0.9), and momentum (0 to 1). All networks\nwere trained using stochastic gradient descent with momentum [36, 37] for 20,000 iterations with\na batch size of 64. For each network con\ufb01guration, we selected the best hyperparameters using a\nvalidation set. We additionally trained each of these con\ufb01gurations with 30 random seeds, yielding\n2,880 total networks for analysis for each task. All networks achieve low error; histograms of the\n\ufb01nal loss values achieved by all networks are available in Appendix C.\n\n2.2 Tasks\n\nWe used three canonical tasks that have been previously studied in the neuroscience literature:\n\nK-bit \ufb02ip-\ufb02op Following [33], RNNs were provided K inputs taking discrete values in\n{\u22121, 0, +1}. The RNN has K outputs, each of which is trained to remember the last non-zero\ninput on its corresponding input. Here we set K = 3, so e.g. output 2 remembers the last non-zero\nstate of input 2 (+1 or -1), but ignores inputs 1 and 3. We set the number of time steps, T , to 100, and\nthe \ufb02ip probability (the probability of any input \ufb02ipping on a particular time step) to 5%.\n\nFrequency-cued sine wave Following [33], RNNs received a static input, x \u223c Uniform(0, 1), and\nwere trained to produce a unit amplitude sine wave, sin(2\u03c0\u03c9t), whose frequency is proportional to\nthe input: \u03c9 = 0.04x + 0.01. We set T = 500 and dt = 0.01 (5 simulated seconds total).\n\nContext-dependent integration (CDI) Following previous work [4], RNNs were provided with\nK static context inputs and K time-varying white noise input streams. On each trial, all but one\ncontext input was zero, thus forming a one-hot encoding indicating which noisy input stream of\nlength T should be integrated. The white noise input was sampled from N (\u00b5, 1) at each time step,\nwith \u00b5 sampled uniformly between -1 and 1 and kept static across time for each trial. RNNs were\ntrained to report the cumulative sum of the cued white-noise input stream across time. Here, we set\nK = 2 and T = 30.\n\n3\n\n\f2.3 Assessing model similarity\n\nThe central questions we examined were: how similar are the representations and dynamics of\ndifferent RNNs trained on the same task? To address this, we use approaches that highlight different\nbut sometimes overlapping aspects of RNN function:\n\nSVCCA and CKA to assess representational geometry We quanti\ufb01ed similarity at the level of\nrepresentational geometry [38]. In essence, this means quantifying whether the responses of two\nRNNs to the same inputs are well-aligned by some kind of linear transformation.\nWe focused on singular vector canonical correlations analysis (SVCCA; [34]), which has found\ntraction in both neuroscience [12] and machine learning communities [39, 15]. SVCCA compares\nrepresentations in two steps. First, each representation is projected onto their top principal components\nto remove the effect of noisy (low variance) directions. Typically, the number of components is\nchosen to retain ~95% of the variance in the representation. Then, canonical correlation analysis\n(CCA) is performed to \ufb01nd a linear transformation that maximally correlates the two representations.\nThis yields R correlation coef\ufb01cients, 1 \u2265 \u03c11 \u2265 . . . \u2265 \u03c1R \u2265 0, providing a means to compare the\ntwo datasets, typically by averaging or summing the coef\ufb01cients (see Appendix D for further details).\nIn addition to SVCCA, we explored a related metric, centered kernel alignment (CKA; [35]). CKA\nis related to SVCCA in that it also suppresses low variance directions, however CKA weights the\ncomponents proportional to the singular value (as opposed to removing some completely). We\nfound that using SVCCA and CKA yielded similar results for the purposes of determining whether\nrepresentations cluster by architecture or activation function so we present SVCCA results in the\nmain text but provide a comparison with CKA in Appendix E.\n\n1, h\u2217\n\ni \u2248 F (h\u2217\n\n2, ...} of an RNN such that h\u2217\n\nFixed point topology to assess computation An alternative perspective to representational geom-\netry for understanding computation in RNNs is dynamics. We studied RNN dynamics by reducing\ntheir nonlinear dynamics to linear approximations. Brie\ufb02y, this approach starts by optimizing to \ufb01nd\nthe \ufb01xed points {h\u2217\ni , x\u2217). We use the term \ufb01xed point to\nalso include approximate \ufb01xed points, which are not truly \ufb01xed but are nevertheless very slow on the\ntime scale of the task.\nWe set the input (x\u2217) to be static when \ufb01nding \ufb01xed points. These inputs can be thought of as\nspecifying different task conditions. In particular, the static command frequency in the sine wave task\nand the hot-one context signal in the CDI task are examples of such condition specifying inputs. Note\nhowever, that dimensions of x that are time-varying are set to 0 in x\u2217. In particular, the dimensions\nof the input that represent the input pulses in the 3-bit memory task and the white noise input streams\nin the CDI task are set to 0 in x\u2217.\nNumerical procedures for identifying \ufb01xed points are discussed in [33, 40]. Around each \ufb01xed point,\nthe local behavior of the system can be approximated by a reduced system with linear dynamics:\n\nht \u2248 h\u2217 + J(h\u2217, x\u2217) (ht\u22121 \u2212 h\u2217) ,\nwhere Jij(h\u2217, x\u2217) = \u2202Fi(h\u2217,x\u2217)\ndenotes the Jacobian of the RNN update rule. We studied these\nlinearized systems using the eigenvector decomposition for non-normal matrices (see Appendix B\nfor the eigenvector decomposition). In this analysis, both the topology of the \ufb01xed points and the\nlinearizations around those \ufb01xed points become objects of interest.\n\n\u2202h\u2217\n\nj\n\nVisualizing similarity with multi-dimensional scaling For each analysis, we computed network\nsimilarity between all pairs of network con\ufb01gurations for a given task, yielding a large (dis-)similarity\nmatrix for each task (for example, we show this distance matrix for the \ufb02ip-\ufb02op task in Fig. 1c). To\nvisualize the structure in these matrices, we used multi-dimensional scaling (MDS) [41] to generate\na 2D projection which we used for visualization (Fig. 1d and f, Fig. 2c and e, Fig. 3c and d). For\nvisualization purposes, we separate plots colored by RNN architecture (for a \ufb01xed nonlinearity, tanh)\nand nonlinearity (for a \ufb01xed architecture, Vanilla).\n\n3 Results\n\nThe major contributions in this paper are as follows. First, we carefully train and tune large populations\nof RNNs trained on several canonical tasks relating to discrete memory [33], pattern generation [33],\n\n4\n\n\fFigure 1: 3-bit discrete memory. a) Inputs (black) of -1 or 1 come in at random times while the corresponding\noutput (dashed red) has to remember the last non-zero state of the input (either +1 or -1). b) Example PCA\ntrajectories of dynamics for an example architecture and activation function. c) Dynamics across networks\nare compared via SVCCA and given a distance (one minus the average correlation coef\ufb01cient), yielding a\nnetwork-network distance matrix. d) This distance matrix is used to create a 2D embedding via multidimensional\nscaling (MDS) of all networks, showing clustering based on RNN architecture (left) and activation function\n(right). e) Topological analysis of a network using \ufb01xed points. First, the \ufb01xed points of a network\u2019s dynamics\nare found, and their linear stability is assessed (left, black dots - stable \ufb01xed points, red - one unstable dimension,\ngreen - 2 unstable dimensions, blue - 3 unstable dimensions. By studying heteroclinic and homoclinic orbits, the\n\ufb01xed point structure is translated to a graph representation (right). f) This graph representation is then compared\nacross networks, creating another network-network distance matrix. The distance matrix is used to embed the\nnetwork comparisons into 2D space using MDS, showing that the topological representation of a network using\n\ufb01xed point structure is more similar across architectures (left) and activation functions (right) than the geometry\nof the network is (layout as in 1d).\n\nand analog memory and integration [4]. Then, we show that representational geometry is sensitive\nto model architecture (Figs. 1-3). Next, we show all RNN architectures, including complex, gated\narchitectures (e.g. LSTM and GRU) converge to qualitatively similar dynamical solutions, as\nquanti\ufb01ed by the topology of \ufb01xed points and corresponding linearized dynamics (Figs. 1-3). Finally,\nwe highlight a case where SVCCA is not necessarily indicative of functional similarity (Fig. 4).\n\n3.1\n\n3-bit discrete memory\n\nWe trained RNNs to store and report three discrete binary inputs (Fig. 1a). In Fig. 1b, we use a simple\n\u201cprobe input\u201d consisting of a series of random inputs to highlight the network structure. Across all\nnetwork architectures the resulting trajectories roughly trace out the corners of a three-dimensional\ncube. While these example trajectories look qualitatively similar across architectures, SVCCA\nrevealed systematic differences. This is visible in the raw SVCCA distance matrix (Fig. 1c), as well\nas in low-dimensional linear embeddings achieved by applying multi-dimensional scaling (MDS)\n(Fig. 1d) created using the SVCCA distance matrix.\nTo study the dynamics of these networks, we ran an optimization procedure [40] to numerically\nidentify \ufb01xed points for each trained network (see Methods). A representative network is shown in\nFig. 1e (left). The network solves the task by encoding all 23 possible outputs as 8 stable \ufb01xed points.\nFurthermore, there are saddle points with one, two, or three unstable dimensions (see caption), which\nroute the network activity towards the appropriate stable \ufb01xed point for a given input.\nWe devised an automated procedure to quantify the computational logic of the \ufb01xed point structure\nin Fig. 1e that effectively ignored the precise details in the transient dynamics and overall geometry\nof the 3D cube evident in the PCA trajectories. Speci\ufb01cally, we distilled the dynamical trajectories\ninto a directed graph, with nodes representing \ufb01xed points, and weighted edges representing the\nprobability of moving from one \ufb01xed point to another when starting the initial state a small distance\n\n5\n\n(a)Flip-floptaskschematicgraphrepresentationPCAtrajectories(f)NetworksimilarityusingfixedpointtopologyMDStanhreluUGRNNVanillaGRULSTM(d)NetworksimilarityusingSVCCAMDS(b)PCATrajectoriesPC#1PC#2PC#3(c)SVCCADistances(e)Fixedpointtopology\fFigure 2: Sine wave generation. a) Schematic showing conversion of static input specifying a command\nfrequency, \u03c9, for the sine wave output sin(2\u03c0\u03c9t). b) PCA plots showing trajectories using many evenly divided\ncommand frequencies delivered one at a time (blue: smallest \u03c9, yellow: largest \u03c9). c) MDS plots based on\nSVCCA network-network distances, layout as in Fig. 1d. d) Left, \ufb01xed points (colored circles, with color\nindicating \u03c9, one \ufb01xed point per command frequency) showing a single \ufb01xed point in the middle of each\noscillatory trajectory. Right, the complex eigenvalues of all the linearized systems, one per \ufb01xed point, overlayed\non top of each other, with primary oscillatory eigenvalues colored as in panel b. e) MDS network-network\ndistances based on \ufb01xed point topology, assessing systematic differences in the topology of the input-dependent\n\ufb01xed points (layout as in Fig. 1d). f) Summary analysis showing the frequency of the oscillatory mode in the\nlinearized system vs. command frequency for different architectures (left) and activations (right). Solid line and\nshaded patch show the mean \u00b1 standard error over networks trained with different random seeds. Small, though\nsystematic, variations exist in the frequency of each oscillatory mode.\n\naway from the \ufb01rst \ufb01xed point. We did this 100 times for each \ufb01xed point, yielding a probability\nof transitioning from one \ufb01xed point to another. As expected, stable \ufb01xed points have no outgoing\nedges, and only have a self-loop. All unstable \ufb01xed points had two or more outgoing edges, which are\ndirected at nearby stable \ufb01xed points. We constructed a \ufb01xed point graph for each network and used\nthe Euclidean distance between the graph connectivity matrices to quantify dis-similarity5. These\nheteroclinic orbits are shown in Fig. 1e, light black trajectories from one \ufb01xed point to another. Using\nthis topological measure of RNN similarity, we \ufb01nd that all architectures converge to very similar\nsolutions as shown by an MDS embedding of the \ufb01xed point graph (Fig. 1f).\n\n3.2 Sine wave generation\n\nWe trained RNNs to convert a static input into a sine wave, e.g. convert the command frequency \u03c9\nto sin(2\u03c0\u03c9t) (Fig. 2a). Fig. 2b shows low-dimensional trajectories in trained networks across all\narchitectures and nonlinearities (LSTM with ReLU did not train effectively, so we excluded it). Each\n\n5While determining whether two graphs are isomorphic is a challenging problem in general, we circumvented\nthis issue by lexographically ordering the \ufb01xed points based on the RNN readout. Networks with different\nnumbers of \ufb01xed points than the modal number were discarded (less than 10% of the population).\n\n6\n\n(b)ExamplePCAtrajectoriestanhreluVanillaUGRNNGRUPC#1PC#3PC#2LSTM(c)NetworksimilarityviaSVCCAMDS(e)NetworksimilarityviafixedpointtopologyMDS(d)Fixedpointlinearizationanalysis(examplenetwork)(f)Linearizationsummary135Inputfrequency(Hz)(a)SinewavetaskschematicTime(s)TargetsUGRNNVanillaGRULSTMtanhreluUGRNNVanillaGRULSTMtanhrelu135135LSTMUGRNNGRUVanilla135tanhreluInputfreq.,\u03c9(Hz)Inputfrequency,\u03c9(Hz)Linearfreq.(Hz)0.00.51.0()0.50.00.5()\fFigure 3: Context-Dependent Integration. a) One of two streams of white-noise input (blue or red) is contextually\nselected by a one-hot static context input to be integrated as output of the network, while the other is ignored\n(blue or red). b) The trained networks were studied with probe inputs (panel inset in a), probes from blue to\nred show probe input). For this and subsequent panels, only one context is shown for clarity. Shown in b are\nthe PCA plots of RNN hidden states when driven by probe inputs (blue to red). The \ufb01xed points (black dots)\nshow approximate line attractors for all RNN architectures and nonlinearities. c) MDS embedding of SVCCA\nnetwork-network distances comparing representations based on architecture (left) and activation (right), layout\nas in Fig. 1d. d) Using the same method to assess the topology of the \ufb01xed points as used in the sine-wave\nexample to study the topology of the input-dependent \ufb01xed points, we embedded the network-network distances\nusing the topological structure of the line attractor (colored based on architectures (left) and activation (right),\nlayout as in Fig. 1d). e) Average sorted eigenvalues as a function architecture. Solid line and shaded patch show\nmean \u00b1 standard error over networks trained with different random seeds. f) Output of the network when probed\nwith a unit magnitude input using the linearized dynamics, averaged over all \ufb01xed points on the line attractor,\nas a function of architecture and number of linear modes retained. In order to study the the dimensionality of\nthe solution to integration, we systematically removed the modes with smallest eigenvalues one at a time, and\nrecomputed the prediction of the new linear system for the unit magnitude input. These plots indicate that the\nvanilla RNN (blue) uses a single mode to perform the integration, while the gated architectures distribute this\nacross a larger number of linear modes.\n\ntrajectory is colored by the input frequency. Furthermore, all trajectories followed a similar pattern:\noscillations occur in a roughly 2D subspace (circular trajectories), with separate circles for each\nfrequency input separated by a third dimension. We then performed an analogous series of analyses\nto those used in the previous task. In particular, we computed the SVCCA distances (raw distances\nnot shown) and used those to create an embedding of the network activity (Fig. 2c) as a function of\neither RNN architecture or activation. These SVCCA MDS summaries show systematic differences\nin the representations across both architecture and activation.\nMoving to the analysis of dynamics, we found for each input frequency a single input-dependent\n\ufb01xed point (Fig. 2d, left). We studied the linearized dynamics around each \ufb01xed point and found a\nsingle pair of imaginary eigenvalues, representing a mildly unstable oscillatory mode whose complex\nangle aligned well with the input frequency (Fig. 2d, right). We compared the frequency of the linear\nmodel to the input frequency and found generally good alignment. We averaged the linear frequency\nacross all networks within architecture or activation and found small, but systematic differences\n(Fig. 2f). Embeddings of the topological structure of the input-dependent \ufb01xed points did not reveal\nany structure that systematically varied by architecture or activation (Fig. 2e).\n\n7\n\nUGRNNVanillaGRULSTMtanhreluUGRNNVanillaGRULSTMtanhrelu(a)ContextdependentintegrationtaskContext(redorblue)Probe(singlecontext)(b)ExamplePCAtrajectories(singlecontext)(c)NetworksimilarityusingSVCCAMDS(d)NetworksimilarityusingfixedpointtopologyMDStanhVanillaUGRNNGRULSTMreluContextSignalInputsOutputs(a)(c)LinearprojectionsofCCAsimilarities(b)tanhReLUUntrainedTrainedVanillaRNNsPCAprojectionsofRNNactivityuntrainedReLUtrainedReLUuntrainedtanhtrainedtanhtanhRNNs(trained)VanillaLSTMGRUUGRNNPC1PC2ProbeinputsforRNNintegrationtask.0.720.810.80112801VanillaLSTMGRUUGRNNPC#1PC#2112801VanillaGRUUGRNNLSTM1128tanhrelu|\u03bb|Eigenmode(e)AverageeigenvaluemagnitudeNumberofkepteigenmodes(f)Linearizedpredictionsforaunitinput\fFigure 4: An example where SVCCA yields a stronger correlation between untrained networks and trained\nnetworks than between trained networks with different nonlinearities. a) An example (single context shown)\nof the representation of the probe inputs (blue through red) for four networks: two trained, and two untrained\nwith tanh and ReLU nonlinearities. In this case the untrained tanh and ReLU networks have a higher correlation\nto the trained tanh network than the trained tanh network does to the trained ReLU network. b) MDS plot\nof SVCCA-based distances for many trained and untrained networks, showing that trained and untrained relu\nnetworks are more similar to each other on average than to tanh networks.\n\n3.3 Context-dependent integration (analog memory)\n\nWe trained an RNN to contextually integrate one of two white noise input streams, while ignoring\nthe other (Fig. 3a). We then studied the network representations by delivering a set of probe inputs\n(Fig. 3a). The 3D PCA plots are shown in Fig. 3b, showing obvious differences in representational\ngeometry as a function of architecture and activation. The MDS summary plot of the SVCCA\ndistances of the representations is shown in Fig. 3c, again showing systematic clustering as a function\nof architecture (left) and activation (right). We also analyzed the topology of the \ufb01xed points (black\ndots in Fig. 3b) to assess how well the \ufb01xed points approximated a line attractor. We quanti\ufb01ed this\nby generating a graph with edges between \ufb01xed points that were nearest neighbors. This resulted in a\ngraph for each line attractor in each context, which we then compared using Euclidean distance and\nembedded in a 2D space using MDS (Fig. 3d). The MDS summary plot did not cluster strongly by\narchitecture, but did cluster based on activation.\nWe then studied the linearized dynamics around each \ufb01xed point (Fig. 3e,f). We focused on a single\ncontext, and studied how a unit magnitude relevant input (as opposed to the input that should be\ncontextually ignored) was integrated by the linear system around the nearest \ufb01xed point. This was\npreviously studied in depth in [4]. Here we were interested in differences in integration strategy as a\nfunction of architecture. We found similar results to [4] for the vanilla RNN, which integrated the\ninput using a single linear mode with an eigenvalue of 1, with input coming in on the associated\nleft eigenvector and represented on the associated right eigenvector. Examination of all linearized\ndynamics averaged over all \ufb01xed points within the context showed that different architectures had\na similar strategy, except that the gated architectures had many more eigenvalues near 1 (Fig. 3e)\nand thus used a high-dimensional strategy to accomplish the same goal as the vanilla RNN does in 1\ndimension. We further studied the dimensionality by systematically zeroing out eigenvalues from\nsmallest to largest to discover how many linear modes were necessary to integrate a unit magnitude\ninput, compared to the full linear approximation (Fig. 3f). These results show that all of the networks\nand architectures use essentially the same integration strategy, but systematically vary by architecture\nin terms of the number of modes they employ. To a lesser degree they also vary some in the amount\nthe higher order terms contribute to the solution, as shown by the differences away from an integral\nof 1 for a unit magnitude input, for the full linearized system with no modes zeroed out (analogous to\nFig. 2f).\nFinally, to highlight the dif\ufb01culty of using CCA-based techniques to compare representational\ngeometry in simple tasks, we used the inputs of the context-dependent integrator task to drive both\ntrained and untrained vanilla RNNs (Fig. 4). We found that the average canonical correlation between\ntrained and untrained networks can be larger than between trained RNNs with different nonlinearities.\nThe summary MDS plot across many RNNs shows that the two clusters of untrained and trained relu\nnetworks are closer together than the two clusters of trained tanh networks Fig. 4b.\n\n8\n\n(a)tanhReLUUntrainedTrainedPC1PC20.720.810.80VanillaRNNsuntrainedReLUtrainedReLUuntrainedtanhtrainedtanh(b)PC#1PC#2\f4 Related Work\n\nResearchers are beginning to study both empirically and theoretically how deep networks may show\nuniversal properties. For example, [32] proved that representational geometry is a universal property\namongst all trained deep linear networks that solve a task optimally, with smallest norm weights.\nAlso, [42, 43] studied how expressive capacity increases with network depth and width. Work in\nRNNs is far more preliminary, though it is well known that RNNs are universal approximators\nof dynamical systems [44]. More recently, the per-parameter capacity of RNNs was found to be\nremarkably similar across various RNN architectures [20]. The authors of [45] studied all the possible\ntopological arrangements of \ufb01xed points in a 2D continuous-time GRU, conjecturing that dynamical\ncon\ufb01gurations such as line or ring attractors that require an in\ufb01nite number of \ufb01xed points can only\nbe created in approximation, even in GRUs with more than two dimensions.\nUnderstanding biological neural systems in terms of arti\ufb01cial dynamical systems has a rich tradition\n[46, 47, 48, 49]. Researchers have attempted to understand optimized neural networks with nonlinear\ndynamical systems techniques [33, 50] and to compare those arti\ufb01cial networks to biological circuits\n[4, 12, 51, 52, 53, 13, 14].\nPrevious work has studied vanilla RNNs in similar settings [33, 4, 54], but has not systematically\nsurveyed the variability in network dynamics across commonly used RNN architectures, such as\nLSTMs [17] or GRUs [18], nor quanti\ufb01ed variations in dynamical solutions over architecture and\nnonlinearity, although [16] considers many issues concerning how RNNs may hold memory. Finally,\nthere has been a recent line of work comparing arti\ufb01cial network representations to neural data [1, 2,\n3, 10, 11, 12]. Investigators have been studying ways to improve the utility of CCA-based comparison\nmethods [34, 55], as well as comparing CCA to other methods [35].\n\n5 Discussion\n\nIn this work we empirically study aspects of individuality and universality in recurrent networks.\nWe \ufb01nd individuality in that representational geometry of RNNs varies signi\ufb01cantly as a function\nof architecture and activation function (Fig. 1d, 2c, 3c). We also see hints of universality: the \ufb01xed\npoint topologies show far less variation across networks than the representations do (Fig. 1f, 2e, 3d).\nLinear analyses also showed similar solutions, e.g. essentially linear oscillations for the sine wave\ntask (Fig. 2f) and linear integration in the CDI task (Fig. 3f). However, linear analyses also showed\nvariation across architectures in the dimensionality of the solution to integration (Fig. 3e).\nWhile the linear analyses showed common computational strategies across all architectures (such\nas a slightly unstable oscillation in the linearized system around each \ufb01xed point), we did see small\nsystematic differences that clustered by architecture (such as the difference between input frequency\nand frequency of oscillatory mode in the linearized system). This indicates that another aspect of\nindividuality appears to be the degree to which higher order terms contribute to the total solution.\nThe \ufb01xed point analysis discussed here has one major limitation, namely that the number of \ufb01xed\npoints must be the same across networks that are being compared. For the three tasks studied here,\nwe found that the vast majority of trained networks did indeed have the same number of \ufb01xed points\nfor each task. However, an important direction for future work is extending the analysis to be more\nrobust with respect to differing numbers of \ufb01xed points.\nIn summary, we hope this empirical study begins a larger effort to characterize methods for comparing\nRNN dynamics, building a foundation for future connections of biological circuits and arti\ufb01cial\nneural networks.\n\nAcknowledgments\n\nThe authors would like to thank Jeffrey Pennington, Maithra Raghu, Jascha Sohl-Dickstein, and Larry\nAbbott for helpful feedback and discussions. MDG was supported by the Stanford Neurosciences\nInstitute, the Of\ufb01ce of Naval Research Grant #N00014-18-1-2158.\n\n9\n\n\fReferences\n\n[1] Lane McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen Baccus.\n\u201cDeep Learning Models of the Retinal Response to Natural Scenes\u201d. In: Advances in Neural\nInformation Processing Systems 29. Ed. by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,\nand R. Garnett. Curran Associates, Inc., 2016, pp. 1369\u20131377. URL: http://papers.nips.\ncc/paper/6388-deep-learning-models-of-the-retinal-response-to-natural-\nscenes.pdf.\n\n[2] Niru Maheswaranathan, Lane T McIntosh, David B Kastner, Josh Melander, Luke Brezovec,\nAran Nayebi, Julia Wang, Surya Ganguli, and Stephen A Baccus. \u201cDeep learning models reveal\ninternal structure and diverse computations in the retina under natural scenes\u201d. In: bioRxiv\n(2018), p. 340943.\n\n[3] Chethan Pandarinath, Daniel J O\u2019Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D Stavisky,\nJonathan C Kao, Eric M Trautmann, Matthew T Kaufman, Stephen I Ryu, Leigh R Hochberg,\nJaimie M Henderson, Krishna V Shenoy, L F Abbott, and David Sussillo. \u201cInferring single-\ntrial neural population dynamics using sequential auto-encoders\u201d. In: Nature Methods 15.10\n(2018), pp. 805\u2013815. ISSN: 1548-7105. DOI: 10.1038/s41592-018-0109-9. URL: https:\n//doi.org/10.1038/s41592-018-0109-9.\n\n[4] Valerio Mante, David Sussillo, Krishna V. Shenoy, and William T. Newsome. \u201cContext-\ndependent computation by recurrent dynamics in prefrontal cortex\u201d. In: Nature 503 (2013).\nArticle, p. 78.\n\n[5] Alexander J E Kell, Daniel L K Yamins, Erica N Shook, Sam V Norman-Haignere, and Josh\nH McDermott. \u201cA Task-Optimized Neural Network Replicates Human Auditory Behavior,\nPredicts Brain Responses, and Reveals a Cortical Processing Hierarchy\u201d. In: Neuron 98.3\n(May 2018), 630\u2013644.e16. ISSN: 0896-6273. DOI: 10.1016/j.neuron.2018.03.044. URL:\nhttps://doi.org/10.1016/j.neuron.2018.03.044.\n\n[6] Rishi Rajalingham, Elias B. Issa, Pouya Bashivan, Kohitij Kar, Kailyn Schmidt, and James J.\nDiCarlo. \u201cLarge-Scale, High-Resolution Comparison of the Core Visual Object Recognition\nBehavior of Humans, Monkeys, and State-of-the-Art Deep Arti\ufb01cial Neural Networks\u201d. In:\nJournal of Neuroscience 38.33 (2018), pp. 7255\u20137269. ISSN: 0270-6474. DOI: 10.1523/\nJNEUROSCI.0388- 18.2018. eprint: http://www.jneurosci.org/content/38/33/\n7255.full.pdf. URL: http://www.jneurosci.org/content/38/33/7255.\n\n[7] Christopher J Cueva and Xue-Xin Wei. \u201cEmergence of grid-like representations by training\nrecurrent neural networks to perform spatial localization\u201d. In: arXiv preprint arXiv:1803.07770\n(2018).\n\n[8] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr\nMirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al.\n\u201cVector-based navigation using grid-like representations in arti\ufb01cial agents\u201d. In: Nature\n557.7705 (2018), p. 429.\n\n[9] Stefano Recanatesi, Matthew Farrell, Guillaume Lajoie, Sophie Deneve, Mattia Rigotti, and\nEric Shea-Brown. \u201cPredictive learning extracts latent space representations from sensory\nobservations\u201d. In: bioRxiv (2019). DOI: 10.1101/471987. eprint: https://www.biorxiv.\norg/content/early/2019/07/13/471987.full.pdf. URL: https://www.biorxiv.\norg/content/early/2019/07/13/471987.\n\n[10] Daniel L. K. Yamins, Ha Hong, Charles F. Cadieu, Ethan A. Solomon, Darren Seibert, and\nJames J. DiCarlo. \u201cPerformance-optimized hierarchical models predict neural responses in\nhigher visual cortex\u201d. In: Proceedings of the National Academy of Sciences 111.23 (2014),\npp. 8619\u20138624. ISSN: 0027-8424. DOI: 10.1073/pnas.1403112111.\n\n[11] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. \u201cDeep Supervised, but Not Unsu-\npervised, Models May Explain IT Cortical Representation\u201d. In: PLOS Computational Biology\n10.11 (Nov. 2014), pp. 1\u201329. DOI: 10 . 1371 / journal . pcbi . 1003915. URL: https :\n//doi.org/10.1371/journal.pcbi.1003915.\n\n[12] David Sussillo, Mark M Churchland, Matthew T Kaufman, and Krishna V Shenoy. \u201cA neural\nnetwork that \ufb01nds a naturalistic solution for the production of muscle activity\u201d. In: Nature\nneuroscience 18.7 (2015), p. 1025.\n\n10\n\n\f[13] Evan D Remington, Devika Narain, Eghbal A Hosseini, and Mehrdad Jazayeri. \u201cFlexible\nSensorimotor Computations through Rapid Recon\ufb01guration of Cortical Dynamics\u201d. In: Neuron\n98.5 (2018), 1005\u20131019.e5. ISSN: 0896-6273. DOI: 10.1016/j.neuron.2018.05.020.\nJing Wang, Devika Narain, Eghbal A Hosseini, and Mehrdad Jazayeri. \u201cFlexible timing by\ntemporal scaling of cortical responses\u201d. In: Nature neuroscience 21.1 (2018), p. 102.\n\n[14]\n\n[15] David GT Barrett, Ari S Morcos, and Jakob H Macke. \u201cAnalyzing biological and arti\ufb01cial neu-\nral networks: challenges with opportunities for synergy?\u201d In: Current Opinion in Neurobiology\n55 (2019). Machine Learning, Big Data, and Neuroscience, pp. 55\u201364. ISSN: 0959-4388.\n\n[16] A Emin Orhan and Wei Ji Ma. \u201cA diverse range of factors affect the nature of neural represen-\ntations underlying short-term memory\u201d. In: Nature Neuroscience 22.2 (2019), pp. 275\u2013283.\nISSN: 1546-1726. DOI: 10.1038/s41593- 018- 0314- y. URL: https://doi.org/10.\n1038/s41593-018-0314-y.\n\n[17] Sepp Hochreiter and J\u00fcrgen Schmidhuber. \u201cLong short-term memory\u201d. In: Neural computation\n\n9.8 (1997), pp. 1735\u20131780.\n\n[18] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. \u201cLearning Phrase Representations using RNN Encoder-Decoder for\nStatistical Machine Translation\u201d. In: Proc. Conference on Empirical Methods in Natural\nLanguage Processing. Unknown, Unknown Region, 2014.\n\n[19] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. \u201cFull-\nCapacity Unitary Recurrent Neural Networks\u201d. In: Advances in Neural Information Processing\nSystems 29. Ed. by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett. 2016,\npp. 4880\u20134888.\nJasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. \u201cCapacity and Trainability in\nRecurrent Neural Networks\u201d. In: ICLR. 2017.\n\n[20]\n\n[21] Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. A Simple Way to Initialize Recurrent\n\nNetworks of Recti\ufb01ed Linear Units. 2015. eprint: arXiv:1504.00941.\n\n[22] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. \u201cOn the Dif\ufb01culty of Training Recurrent\nNeural Networks\u201d. In: Proceedings of the 30th International Conference on International\nConference on Machine Learning - Volume 28. ICML\u201913. Atlanta, GA, USA, 2013, pp. III-\n1310\u2013III-1318.\n\n[23] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. \u201cRegularizing and Optimizing\n\nLSTM Language Models\u201d. In: ICLR. 2018.\n\n[24] Barret Zoph and Quoc V. Le. \u201cNeural Architecture Search with Reinforcement Learning\u201d. In:\n\n2017.\n\n[25] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. \u201cEf\ufb01cient Neural\n\nArchitecture Search via Parameter Sharing\u201d. In: ICML. 2018.\n\n[26] Liang-chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian\nSchroff, Hartwig Adam, and Jonathon Shlens. \u201cSearching for Ef\ufb01cient Multi-Scale Architec-\ntures for Dense Image Prediction\u201d. In: 2018. URL: https://arxiv.org/pdf/1809.04184.\npdf.\n\n[27] Harry Eugene Stanley. Introduction to Phase Transitions and Critical Phenomena. en. Oxford\n\nUniversity Press, 1971.\n\n[28] Mitchell J Feigenbaum. \u201cUniversal behavior in nonlinear systems\u201d. In: Universality in Chaos,\n\n2nd edition. Routledge, 2017, pp. 49\u201350.\n\n[29] Alexander Rivkind and Omri Barak. \u201cLocal dynamics in trained recurrent neural networks\u201d.\n\nIn: Physical review letters 118.25 (2017), p. 258101.\n\n[30] Francesca Mastrogiuseppe and Srdjan Ostojic. \u201cLinking connectivity, dynamics, and computa-\n\ntions in low-rank recurrent neural networks\u201d. In: Neuron 99.3 (2018), pp. 609\u2013623.\n\n[31] Francesca Mastrogiuseppe and Srdjan Ostojic. \u201cA Geometrical Analysis of Global Stability in\n\nTrained Feedback Networks\u201d. In: Neural computation 31.6 (2019), pp. 1139\u20131182.\n\n[32] Andrew M Saxe, James L McClelland, and Surya Ganguli. \u201cA mathematical theory of semantic\n\ndevelopment in deep neural networks\u201d. In: Proc. Natl. Acad. Sci. U. S. A. (May 2019).\n\n[33] David Sussillo and Omri Barak. \u201cOpening the Black Box: Low-Dimensional Dynamics in\nHigh-Dimensional Recurrent Neural Networks\u201d. In: Neural Computation 25.3 (2013), pp. 626\u2013\n649. DOI: 10.1162/NECO_a_00409.\n\n11\n\n\f[34] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. \u201cSVCCA: Singular\nVector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability\u201d.\nIn: Advances in Neural Information Processing Systems 30. Ed. by I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc.,\n2017, pp. 6076\u20136085.\n\n[35] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. \u201cSimilarity of\n\nNeural Network Representations Revisited\u201d. In: arXiv preprint arXiv:1905.00414 (2019).\n\n[36] Boris T Polyak. \u201cSome methods of speeding up the convergence of iteration methods\u201d. In:\n\nUSSR Computational Mathematics and Mathematical Physics 4.5 (1964), pp. 1\u201317.\nIlya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. \u201cOn the importance of\ninitialization and momentum in deep learning\u201d. In: International conference on machine\nlearning. 2013, pp. 1139\u20131147.\n\n[37]\n\n[38] Nikolaus Kriegeskorte and Rogier A. Kievit. \u201cRepresentational geometry: integrating cognition,\ncomputation, and the brain\u201d. In: Trends in Cognitive Sciences 17.8 (2013), pp. 401\u2013412. ISSN:\n1364-6613.\n\n[39] Saskia E. J. de Vries, Jerome Lecoq, Michael A. Buice, Peter A. Groblewski, Gabriel K. Ocker,\nMichael Oliver, David Feng, Nicholas Cain, Peter Ledochowitsch, Daniel Millman, et al. \u201cA\nlarge-scale, standardized physiological survey reveals higher order coding throughout the\nmouse visual cortex\u201d. In: bioRxiv (2018). DOI: 10.1101/359513.\n\n[40] Matthew Golub and David Sussillo. \u201cFixedPointFinder: A Tensor\ufb02ow toolbox for identifying\nand characterizing \ufb01xed points in recurrent neural networks\u201d. In: Journal of Open Source\nSoftware 3.31 (Nov. 2018), p. 1003. DOI: 10.21105/joss.01003. URL: https://doi.\norg/10.21105/joss.01003.\nIngwer Borg and Patrick Groenen. \u201cModern multidimensional scaling: Theory and applica-\ntions\u201d. In: Journal of Educational Measurement 40.3 (2003), pp. 277\u2013280.\n\n[41]\n\n[42] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\n\u201cExponential expressivity in deep neural networks through transient chaos\u201d. In: Advances in\nneural information processing systems. 2016, pp. 3360\u20133368.\n\n[43] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. \u201cOn\nthe expressive power of deep neural networks\u201d. In: Proceedings of the 34th International\nConference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 2847\u20132854.\n\n[44] Kenji Doya. \u201cUniversality of fully connected recurrent neural networks\u201d. In: Dept. of Biology,\n\nUCSD, Tech. Rep (1993).\nIan D. Jordan, Piotr Aleksander Sokol, and Il Memming Park. \u201cGated recurrent units viewed\nthrough the lens of continuous time dynamical systems\u201d. In: CoRR abs/1906.01005 (2019).\narXiv: 1906.01005. URL: http://arxiv.org/abs/1906.01005.\n\n[45]\n\n[46] Alain Destexhe and Terrence J Sejnowski. \u201cThe Wilson\u2013Cowan model, 36 years later\u201d. In:\n\nBiological cybernetics 101.1 (2009), pp. 1\u20132.\nJohn J Hop\ufb01eld. \u201cNeural networks and physical systems with emergent collective computa-\ntional abilities\u201d. In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554\u2013\n2558.\n\n[47]\n\n[48] Haim Sompolinsky, Andrea Crisanti, and Hans-Jurgen Sommers. \u201cChaos in random neural\n\nnetworks\u201d. In: Physical review letters 61.3 (1988), p. 259.\n\n[49] H. S. Seung. \u201cHow the brain keeps the eyes still\u201d. In: Proceedings of the National Academy\nof Sciences 93.23 (1996), pp. 13339\u201313344. ISSN: 0027-8424. DOI: 10.1073/pnas.93.23.\n13339.\n\n[50] Omri Barak, David Sussillo, Ranulfo Romo, Misha Tsodyks, and LF Abbott. \u201cFrom \ufb01xed\npoints to chaos: three models of delayed discrimination\u201d. In: Progress in neurobiology 103\n(2013), pp. 214\u2013222.\n\n[51] David Sussillo. \u201cNeural circuits as computational dynamical systems\u201d. In: Current opinion in\n\nneurobiology 25 (2014), pp. 156\u2013163.\n\n[52] Kanaka Rajan, Christopher D Harvey, and David W Tank. \u201cRecurrent network models of\n\nsequence generation and memory\u201d. In: Neuron 90.1 (2016), pp. 128\u2013142.\n\n[53] Omri Barak. \u201cRecurrent neural networks as versatile tools of neuroscience research\u201d. In:\n\nCurrent opinion in neurobiology 46 (2017), pp. 1\u20136.\n\n12\n\n\f[54] Guangyu Robert Yang, Madhura R Joglekar, H Francis Song, William T Newsome, and\nXiao-Jing Wang. \u201cTask representations in neural networks trained to perform many cognitive\ntasks\u201d. In: Nature neuroscience 22.2 (2019), p. 297.\n\n[55] Ari Morcos, Maithra Raghu, and Samy Bengio. \u201cInsights on representational similarity in\nneural networks with canonical correlation\u201d. In: Advances in Neural Information Processing\nSystems. 2018, pp. 5727\u20135736.\nIngmar Kanitscheider and Ila Fiete. \u201cTraining recurrent networks to generate hypotheses\nabout how the brain solves hard navigation problems\u201d. In: Advances in Neural Information\nProcessing Systems 30. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 4529\u20134538.\n\n[56]\n\n[57] Harold Hotelling. \u201cRelations between two sets of variates\u201d. In: Breakthroughs in statistics.\n\nSpringer, 1992, pp. 162\u2013190.\n\n13\n\n\f", "award": [], "sourceid": 9068, "authors": [{"given_name": "Niru", "family_name": "Maheswaranathan", "institution": "Google Brain"}, {"given_name": "Alex", "family_name": "Williams", "institution": "Stanford University"}, {"given_name": "Matthew", "family_name": "Golub", "institution": "Stanford University"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford"}, {"given_name": "David", "family_name": "Sussillo", "institution": "Google Inc."}]}