{"title": "Neural system identification for large populations separating \u201cwhat\u201d and \u201cwhere\u201d", "book": "Advances in Neural Information Processing Systems", "page_first": 3506, "page_last": 3516, "abstract": "Neuroscientists classify neurons into different types that perform similar computations at different locations in the visual field. Traditional methods for neural system identification do not capitalize on this separation of \u201cwhat\u201d and \u201cwhere\u201d. Learning deep convolutional feature spaces that are shared among many neurons provides an exciting path forward, but the architectural design needs to account for data limitations: While new experimental techniques enable recordings from thousands of neurons, experimental time is limited so that one can sample only a small fraction of each neuron's response space. Here, we show that a major bottleneck for fitting convolutional neural networks (CNNs) to neural data is the estimation of the individual receptive field locations \u2013 a problem that has been scratched only at the surface thus far. We propose a CNN architecture with a sparse readout layer factorizing the spatial (where) and feature (what) dimensions. Our network scales well to thousands of neurons and short recordings and can be trained end-to-end. We evaluate this architecture on ground-truth data to explore the challenges and limitations of CNN-based system identification. Moreover, we show that our network model outperforms current state-of-the art system identification models of mouse primary visual cortex.", "full_text": "Neural system identi\ufb01cation for large populations\n\nseparating \u201cwhat\u201d and \u201cwhere\u201d\n\nDavid A. Klindt * 1-3, Alexander S. Ecker * 1,2,4,6, Thomas Euler 1-3, Matthias Bethge 1,2,4-6\n\n* Authors contributed equally\n\n1 Centre for Integrative Neuroscience, University of T\u00fcbingen, Germany\n\n2 Bernstein Center for Computational Neuroscience, University of T\u00fcbingen, Germany\n\n3 Institute for Ophthalmic Research, University of T\u00fcbingen, Germany\n4 Institute for Theoretical Physics, University of T\u00fcbingen, Germany\n5 Max Planck Institute for Biological Cybernetics, T\u00fcbingen, Germany\n\n6 Center for Neuroscience and Arti\ufb01cial Intelligence, Baylor College of Medicine, Houston, USA\n\nklindt.david@gmail.com, alexander.ecker@uni-tuebingen.de,\n\nthomas.euler@cin.uni-tuebingen.de, matthias.bethge@bethgelab.org\n\nAbstract\n\nNeuroscientists classify neurons into different types that perform similar compu-\ntations at different locations in the visual \ufb01eld. Traditional methods for neural\nsystem identi\ufb01cation do not capitalize on this separation of \u201cwhat\u201d and \u201cwhere\u201d.\nLearning deep convolutional feature spaces that are shared among many neurons\nprovides an exciting path forward, but the architectural design needs to account for\ndata limitations: While new experimental techniques enable recordings from thou-\nsands of neurons, experimental time is limited so that one can sample only a small\nfraction of each neuron\u2019s response space. Here, we show that a major bottleneck\nfor \ufb01tting convolutional neural networks (CNNs) to neural data is the estimation\nof the individual receptive \ufb01eld locations \u2013 a problem that has been scratched only\nat the surface thus far. We propose a CNN architecture with a sparse readout layer\nfactorizing the spatial (where) and feature (what) dimensions. Our network scales\nwell to thousands of neurons and short recordings and can be trained end-to-end.\nWe evaluate this architecture on ground-truth data to explore the challenges and\nlimitations of CNN-based system identi\ufb01cation. Moreover, we show that our net-\nwork model outperforms current state-of-the art system identi\ufb01cation models of\nmouse primary visual cortex.\n\n1 Introduction\n\nIn neural system identi\ufb01cation, we seek to construct quantitative models that describe how a neuron\nresponds to arbitrary stimuli [1, 2]. In sensory neuroscience, the standard way to approach this prob-\nlem is with a generalized linear model (GLM): a linear \ufb01lter followed by a point-wise nonlinearity\n[3, 4]. However, neurons elicit complex nonlinear responses to natural stimuli even as early as in\nthe retina [5, 6] and the degree of nonlinearity increases as ones goes up the visual hierarchy. At the\nsame time, neurons in the same brain area tend to perform similar computations at different positions\nin the visual \ufb01eld. This separability of what is computed from where it is computed is a key idea\nunderlying the notion of functional cell types tiling the visual \ufb01eld in a retinotopic fashion.\n\nFor early visual processing stages like the retina or primary visual cortex, several nonlinear methods\nhave been proposed, including energy models [7, 8], spike-triggered covariance methods [9, 10],\nlinear-nonlinear (LN-LN) cascades [11, 12], convolutional subunit models [13, 14] and GLMs based\non handcrafted nonlinear feature spaces [15]. While these models outperform the simple GLM, they\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fstill cannot fully account for the responses of even early visual processing stages (i.e. retina, V1), let\nalone higher-level areas such as V4 or IT. The main problem is that the expressiveness of the model\n(i.e. number of parameters) is limited by the amount of data that can be collected for each neuron.\n\nThe recent success of deep learning in computer vision and other \ufb01elds has sparked interest in using\ndeep learning methods for understanding neural computations in the brain [16, 17, 18], including\npromising \ufb01rst attempts to learn feature spaces for neural system identi\ufb01cation [19, 20, 21, 22, 23].\nIn this study, we would like to achieve a better understanding of the possible advantages of deep\nlearning methods over classical tools for system identi\ufb01cation by analyzing their effectiveness on\nground truth models. Classical approaches have traditionally been framed as individual multivari-\nate regression problems for each recorded neuron, without exploiting computational similarities\nbetween different neurons for regularization. One of the most obvious similarities between different\nneurons, however, is that the visual system simultaneously extracts similar features at many differ-\nent locations. Because of this spatial equivariance, the same nonlinear subspace is spanned at many\nnearby locations and many neurons share similar nonlinear computations. Thus, we should be able\nto learn much more complex nonlinear functions by combining data from many neurons and learning\na common feature space from which we can linearly predict the activity of each neuron.\n\nWe propose a convolutional neural network (CNN) architecture with a special readout layer that\nseparates the problem of learning a common feature space from estimating each neuron\u2019s receptive\n\ufb01eld location and cell type, but can still be trained end-to-end on experimental data. We evaluate\nthis model architecture using simple simulations and show its potential for developing a functional\ncharacterization of cell types. Moreover, we show that our model outperforms the current state-of-\nthe-art on a publicly available dataset of mouse V1 responses to natural images [19].\n\n2 Related work\n\nUsing arti\ufb01cial neural networks to predict neural responses has a long history [24, 25, 26]. Recently,\ntwo studies [13, 14] \ufb01t two-layer models with a convolutional layer and a pooling layer. They\ndo \ufb01nd marked improvements over GLMs and spike-triggered covariance methods, but like most\nother previous studies they \ufb01t their model only to individual cells\u2019 responses and do not exploit\ncomputational similarities among neurons.\n\nAntolik et al. [19] proposed learning a common feature space to improve neural system identi\ufb01ca-\ntion. They outperform GLM-based approaches by \ufb01tting a multi-layer neural network consisting of\nparameterized difference-of-Gaussian \ufb01lters in the \ufb01rst layer, followed by two fully-connected lay-\ners. However, because they do not use a convolutional architecture, features are shared only locally.\nThus, every hidden unit has to be learned \u2018from scratch\u2019 at each spatial location and the number of\nparameters in the fully-connected layers grows quadratically with population size.\n\nMcIntosh et al. [20] \ufb01t a CNN to retinal data. The bottleneck in their approach is the \ufb01nal fully-\nconnected layer that maps the convolutional feature space to individual cells\u2019 responses. The number\nof parameters in this \ufb01nal readout layer grows very quickly and even for their small populations\nrepresents more than half of the total number of parameters.\n\nBatty et al. [21] also advocate feature sharing and explore using recurrent neural networks to model\nthe shared feature space. They use a two-step procedure, where they \ufb01rst estimate each neuron\u2019s\nlocation via spike-triggered average, then crop the stimulus accordingly for each neuron and then\nlearn a model with shared features. The performance of this approach depends critically on the\naccuracy of the initial location estimate, which can be problematic for nonlinear neurons with a\nweak spike-triggered average response (e. g. complex cells in primary visual cortex).\n\nOur contribution is a novel network architecture consisting of a number of convolutional layers\nfollowed by a sparse readout layer factorizing the spatial and feature dimensions. Our approach\nhas two main advantages over prior art. First, it reduces the effective number of parameters in\nthe readout layer substantially while still being trainable end-to-end. Second, our readout forces all\ncomputations to be performed in the convolutional layers while the factorized readout layer provides\nan estimate of the receptive \ufb01eld location and the cell type of each neuron.\n\nIn addition, our work goes beyond the \ufb01ndings of these previous studies by providing a systematic\nevaluation, on ground truth models, of the advantages of feature sharing in neural system identi\ufb01ca-\ntion \u2013 in particular in settings with many neurons and few observations.\n\n2\n\n\fFigure 1: Feature sharing makes more ef\ufb01cient use of\nthe available data. Red line: System identi\ufb01cation per-\nformance with one recorded neuron. Blue lines: Per-\nformance for a hypothetical population of 10 neurons\nwith identical receptive \ufb01eld shapes whose locations\nwe know. A shared model (solid blue) is equivalent to\nhaving 10\u00d7 as much data, i. e. the performance curve\nshifts to the left. If we \ufb01t all neurons independently\n(dashed blue), we do not bene\ufb01t from their similarity.\n\n3 Learning a common feature space\n\nWe illustrate why learning a common feature space makes much more ef\ufb01cient use of the available\ndata by considering a simple thought experiment. Suppose we record from ten neurons that all\ncompute exactly the same function, except that they are located at different positions. If we know\neach neuron\u2019s position, we can pool their data to estimate a single model by shifting the stimulus\nsuch that it is centered on each neuron\u2019s receptive \ufb01eld. In this case we have effectively ten times\nas much data as in the single-neuron case (Fig. 1, red line) and we will achieve the same model\nperformance with a tenth of the data (Fig. 1, solid blue line). In contrast, if we treat each neuron as\nan individual regression problem, the performance will on average be identical to the single-neuron\ncase (Fig. 1, dashed blue line). Although this insight has been well known from transfer learning in\nmachine learning, it has so far not been applied widely in a neuroscience context.\n\nIn practice we neither know the receptive \ufb01eld locations of all neurons a priori nor do all neurons\nimplement exactly the same nonlinear function. However, the improvements of learning a shared\nfeature space can still be substantial. First, estimating the receptive \ufb01eld location of an individual\nneuron is a much simpler task than estimating its entire nonlinear function from scratch. Second,\nwe expect the functional response diversity within a cell type to be much smaller than the overall\nresponse diversity across cell types [27, 28]. Third, cells in later processing stages (e. g. V1) share\nthe nonlinear computations of their upstream areas (retina, LGN), suggesting that equipping them\nwith a common feature space will simplify learning their individual characteristics [19].\n\n4 Feature sharing in a simple linear ground-truth model\n\nWe start by investigating the possible advantages of learning a common feature space with a simple\nground truth model \u2013 a population of linear neurons with Poisson-like output noise:\n\nrn = a\n\nT\nn s\n\nyn \u223c N (cid:16)rn,p|rn|(cid:17)\n\n(1)\n\nHere, s is the (Gaussian white noise) stimulus, rn the \ufb01ring rate of neuron n, an its receptive \ufb01eld\nkernel and yn its noisy response. In this simple model, the classical GLM-based approch reduces to\n(regularized) multivariate linear regression, which we compare to a convolutional neural network.\n\n4.1 Convolutional neural network model\n\nOur neural network consists of a convolutional layer and a readout layer (Fig. 2). The \ufb01rst layer\nconvolves the image with a number of kernels to produce K feature maps, followed by batch nor-\nmalization [29]. There is no nonlinearity in the network (i.e. activation function is the identity).\nBatch normalization ensures that the output has \ufb01xed variance, which is important for the regulariza-\ntion in the second layer. The readout layer pools the output, c, of the convolutional layer by applying\na sparse mask, q, for each neuron:\n\ncijkqijkn\n\n(2)\n\nHere, \u02c6rn is the predicted \ufb01ring rate of neuron n. The mask q is factorized in the spatial and feature\ndimension:\n\n(3)\nwhere m is a spatial mask and w is a set of K feature weights for each neuron. The spatial mask\nand feature weights encode each neuron\u2019s receptive \ufb01eld location and cell type, respectively. As we\nexpect them to be highly sparse, we regularize both by an L1 penalty (with strengths \u03bbm and \u03bbw).\n\nqijkn = mijnwkn,\n\n3\n\n\u02c6rn = Xi,j,k\n\n\fInput\n\nFeature Space\n\nReceptive Fields\n\nResponses\n\nOriginal\n\nconvolution\n\n...\n...\n\n17 \u00d7 17 \u00d7 K\n\nNeuron 1\n\n.\n.\n.\n\n...\n\nK\n\nNeuron N\n\n3\n\n2 \u00d7 3\n\n2\n\n48 \u00d7 48\n\n32 \u00d7 32 \u00d7 K\n\n(32 \u00d7 32 + K) \u00d7 N\n\nN \u00d7 1\n\n48 \u00d7 48\n\nFigure 2: Our proposed CNN architecture in its simplest form. It consists of a feature space module\nand a readout layer. The feature space is extracted via one or more convolutional layers (here one is\nshown). The readout layer computes for each neuron a weighted sum over the entire feature space.\nTo keep the number of parameters tractable and facilitate interpretability, we factorize the readout\ninto a location mask and a vector of feature weights, which are both encouraged to be sparse by\nregularizing with L1 penalty.\n\nBy factorizing the spatial and feature dimension in the readout layer, we achieve several useful\nproperties: \ufb01rst, it reduces the number of parameters substantially compared to a fully-connected\nlayer [20]; second, it limits the expressiveness of the layer, forcing the \u2018computations\u2019 down to\nthe convolutional layers, while the readout layer performs only the selection; third, this separation\nof computation from selection facilitates the interpretation of the learned parameters in terms of\nfunctional cell types.\n\nWe minimize the following penalized mean-squared error using the Adam optimizer [30]:\n\n1\n\nL =\n\nB Xb,n\n\n(ybn \u2212 \u02c6rbn)2 + \u03bbmXi,j,n\n\n|mijn| + \u03bbwXk,n\n\n|wkn|\n\n(4)\n\nwhere b denotes the sample index and B = 256 is the minibatch size. We use an initial learning\nrate of 0.001 and early stopping based on a separate validation set consisting of 20% of the training\nset. When the validation error has not improved for 300 consecutive steps, we go back to the best\nparameter set and decrease the learning rate once by a factor of ten. After the second time we end\nthe training. We \ufb01nd the optimal regularization weights \u03bbm and \u03bbw via grid search.\n\nTo achieve optimal performance, we found it to be useful to initialize the masks well. Shifting the\nconvolution kernel by one pixel in one direction while shifting the mask in the opposite direction\nin principle produces the same output. However, because in practice the \ufb01lter size is \ufb01nite, poorly\ninitialized masks can lead to suboptimal solutions with partially cropped \ufb01lters (cf. Fig. 3C, CNN10).\nTo initialize the masks, we calculated the spike-triggered average for each neuron, smoothed it with\na large Gaussian kernel and took the pixel with the maximum absolute value as our initial guess for\nthe neurons\u2019 location. We set this pixel to the standard deviation of the neuron\u2019s response (because\nthe output of the convolutional layer has unit variance) and initialized the rest of the mask randomly\nfrom a Gaussian N (0, 0.001). We initialized the convolution kernels randomly from N (0, 0.01) and\nthe feature weights from N (1/K, 0.01).\n\n4.2 Baseline models\n\nIn the linear example studied here, the GLM reduces to simple linear regression. We used two forms\nof regularization: lasso (L1) and ridge (L2). To maximize the performance of these baseline models,\nwe cropped the stimulus around each neuron\u2019s receptive \ufb01eld. Thus, the number of parameters\nthese models have to learn is identical to those in the convolution kernel of the CNN. Again, we\ncross-validated over the regularization strength.\n\n4.3 Performance evaluation\n\nTo measure the models\u2019 performance we compute the fraction of explainable variance explained:\n\nFEV = 1 \u2212(cid:10)(\u02c6r \u2212 r)2(cid:11) /Var(r)\n\n4\n\n(5)\n\n\fA\n\nB\n\n \n\ni\n\nl\n\nl\n\ne\nc\nn\na\ni\nr\na\nv\ne\nb\na\nn\na\np\nx\ne\n \nf\no\nn\no\ni\nt\nc\na\nr\nF\n\n \n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n26\n\n28\n\nC\n\nModel\n\nNumber of samples\n\n28\n\n210\n\n212\n\nKernel known\nOLS\nLasso\nRidge\nCNN1\nCNN10\nCNN100\nCNN1000\n\n216\n\n218\n\nOLS\n\nLasso\n\nRidge\n\nCNN1\n\nCNN10\n\nCNN100\n\nCNN1000\n\n212\n\n210\n\n214\nNumber of samples\n\nFigure 3: Feature sharing in homogeneous linear population. A, Population of homogeneous spa-\ntially shifted on-center/off-surround neurons. B, Model comparison: Fraction of explainable vari-\nance explained vs. the number of samples used for \ufb01tting the models. Ordinary least squares (OLS),\nL1 (Lasso) and L2 (Ridge) regularized regression models are \ufb01t to individual neurons. CNNN\nare convolutional models with N neurons \ufb01t jointly. The dashed line shows the performance (for\nN \u2192 \u221e) of estimating the mask given the ground truth convolution kernel.C, Learned \ufb01lters for\ndifferent methods and number of samples.\n\nwhich is evaluated on the ground-truth \ufb01ring rates r without observation noise. A perfect model\nwould achieve FEV = 1. We evaluate FEV on a held-out test set not seen during model \ufb01tting and\ncross-validation.\n\n4.4 Single cell type, homogeneous population\n\nWe \ufb01rst considered the idealized situation where all neurons share the same 17\u00d7 17 px on-center/off-\nsurround \ufb01lter, but at different locations (Fig. 3A). In other words, there is only one feature map in\nthe convolutional layer (K = 1). We used a 48 \u00d7 48 px Gaussian white noise stimulus and scaled\nthe neurons\u2019 output such that h|r|i = 0.1, mimicking a neurally-plausible signal-to-noise ratio at\n\ufb01ring rates of 1 spike/s and an observation window of 100 ms. We simulated populations of N = 1,\n10, 100 and 1000 neurons and varied the amount of training data.\n\nThe CNN model consistently outperformed the linear regression models (Fig. 3B). The ridge-\nregularized linear regression explained around 60% of the explainable variance with 4,000 samples\n(i. e. pairs of stimulus and N-dimensional neural response vector). A CNN model pooling over 10\nneurons achieved the same level of performance with less than a quarter of the data. The margin in\nperformance increased with the number of neurons pooled over in the model, although the relative\nimprovement started to level off when going from 100 to 1,000 neurons.\n\nWith few observations, the bottleneck appears to be estimating each neuron\u2019s location mask. Two\nobservations support this hypothesis. First, the CNN1000 model learned much \u2018cleaner\u2019 weights\nwith 256 samples than ridge regression with 4,096 (Fig. 3C), although the latter achieved a higher\npredictive performance (FEV = 55% vs. 65%). This observation suggests that the feature space can\nbe learned ef\ufb01ciently with few samples and many neurons, but that the performance is limited by the\nestimation of neurons\u2019 location masks. Second, when using the ground-truth kernel and optimizing\nsolely the location masks, performance was only marginally better than for 1,000 neurons (Fig. 3B,\nblue dotted line), indicating an upper performance bound by the problem of estimating the location\nmasks.\n\n4.5 Functional classi\ufb01cation of cell types\n\nOur next step was to investigate whether our model architecture can learn interpretable features and\nobtain a functional classi\ufb01cation of cell types. Using the same simple linear model as above, we\nsimulated two cell types with different \ufb01lter kernels. To make the simulation a bit more realistic, we\nmade the kernels heterogeneous within a cell type (Fig. 4A). We simulated a population of 1,000\nneurons (500 of each type).\n\nWith sparsity on the readout weights every neuron has to select one of the two convolutional kernels.\nAs a consequence, the feature weights represent more or less directly the cell type identity of each\n\n5\n\n\fFigure 4: A, Example receptive \ufb01elds of two types of neurons, differing in their average size. B,\nLearned \ufb01lters of the CNN model. C, Scatter plot of the feature weights for the two cell types.\n\nneuron (Fig. 4C). This in turn forces the kernels to learn the average of each type (Fig. 4B). However,\nany other set of kernels spanning the same subspace would have achieved the same predictive per-\nformance. Thus, we \ufb01nd that sparsity on the feature weights facilitates interpretability: each neuron\nchooses one feature channel which represents the essential computation of this type of neuron.\n\n5 Learning nonlinear feature spaces\n\n5.1 Ground truth model\n\nNext, we investigated how our approach scales to more complex, nonlinear neurons and natural\nstimuli. To keep the bene\ufb01ts of having ground truth data available, we chose our model neurons from\nthe VGG-19 network [31], a popular CNN trained on large-scale object recognition. We selected\nfour random feature maps from layer conv2_2 as \u2018cell types\u2019. For each cell type, we picked 250\nunits with random locations (32 \u00d7 32 possible locations). We computed ground-truth responses\nfor all 1000 cells on 44 \u00d7 44 px image patches obtained by randomly cropping images from the\nImageNet (ILSVRC2012) dataset. As before, we rescaled the output to produce sparse, neurally\nplausible mean responses of 0.1 and added Poisson-like noise.\n\nWe \ufb01t a CNN with three convolutional layers consisting of 32, 64 and 4 feature maps (kernel size\n5 \u00d7 5), followed by our sparse, factorized readout layer (Fig. 5A). Each convolutional layer was\nfollowed by batch normalization and a ReLU nonlinearity. We trained the model using Adam with\na batch-size of 64 and the same initial step size, early stopping, cross-validation and initialization of\nthe masks as described above. As a baseline, we \ufb01t a ridge-regularized GLM with ReLU nonlinearity\nfollowed by an additional bias.\n\nTo show that our sparse, factorized readout layer is an important feature of our architecture, we also\nimplemented two alternative ways of choosing the readout, which have been proposed in previous\nwork on learning common feature spaces for neural populations. The \ufb01rst approach is to estimate\nthe receptive \ufb01eld location in advance based on the spike-triggered average of each neuron [21].1\nTo do so, we determined the pixel with the strongest spike-triggered average. We then set this\npixel to one in the location mask and all other pixels to zero. We then kept the location mask\n\ufb01xed while optimizing convolution kernels and feature weights. The second approach is to use a\nfully-connected readout tensor [20] and regularize the activations of all neurons with L1 penalty.\nIn addition, we regularized the fully-connected readout tensor with L2 weight decay. We \ufb01t both\nmodels to populations of 1,000 neurons.\n\nOur CNN with the factorized readout outperformed all three baselines (Fig. 5B).2 The performance\nof the GLM saturated at \u224820% FEV (Fig. 5B), highlighting the high degree of nonlinearity of our\nmodel neurons. Using a fully-connected readout [20] incurred a substantial performance penalty\nwhen the number of samples was small and only asymptotically (for a large number of samples)\nreached the same performance as our factorized readout. Estimating the receptive \ufb01eld location in\n\n1Note that they used a recurrent neural network for the shared feature space. Here we only reproduce their\n\napproach to de\ufb01ning the readout.\n\n2It did not reach 100% performance, since the feature space we \ufb01t was smaller and the network shallower\n\nthan the one used to generate the ground truth data.\n\n6\n\n\fA\n\nInput\n\nF e a t u r e S p a c e \n\nReceptive Fields\n\nResponses\n\n5 \u00d7 5 \u00d7 3\n\n5 \u00d7 5 \u00d7 32\n\n 5 \u00d7 5 \u00d7 64\n\n...\n\nNeuron 1 \n\n.\n.\n.\n\nNeuron N\n\n44 \u00d7 44 \u00d7 3\n\n40 \u00d7 40 \u00d7 32\n\n36 \u00d7 36 \u00d7 64\n\n32 \u00d7 32 \u00d7 4\n\n(32 \u00d7 32 + K) \u00d7 N\n\nB\n\n \n\ni\n\nl\n\ne\nc\nn\na\ni\nr\na\nv\nd\ne\nn\na\np\nx\ne\n \nf\no\nn\no\ni\nt\nc\na\nr\nF\n\n \n\nCNN 1,000 neurons\n\nCNN 100 neurons\nCNN 12 neurons\n\nC\n\n \n\n2\ne\nr\nu\nt\na\ne\nF\n\n \n\n4\ne\nr\nu\nt\na\ne\nF\n\nType 1\nType 2\n\nFeature 1\n\nType 3\nType 4\n\nFixed mask\nFull readout\nGLM\n\n29\n\n210\n\n212\n\n211\n214\nNumber of samples\n\n213\n\n215\n\n216\n\nFeature 3\n\nD\n\nE\n\n \n\ni\n\nl\n\ne\nc\nn\na\ni\nr\na\nv\nd\ne\nn\na\np\nx\ne\n \nf\no\nn\no\ni\nt\nc\na\nr\nF\n\n \n\nOurs\nFixed mask\nFull readout\n\n4\n16\nNumber of cell types\n\n8\n\nFigure 5: Inferring a complex, nonlinear feature space. A, Model architecture. B, Dependence\nof model performance (FEV) on number of samples used for training. C, Feature weights of the\nfour cell types for CNN1000 with 215 samples cluster strongly. D, Learned location masks for four\nrandomly chosen cells (one per type). E, Dependence of model performance (FEV) on number of\ntypes of neurons in population, number of samples \ufb01xed to 212.\n\nadvance [21] led to a drop in performance \u2013 even for large sample sizes. A likely explanation for\nthis \ufb01nding is the fact that the responses are quite nonlinear and, thus, estimates of the receptive \ufb01eld\nlocation via spike-triggered average (a linear method) are not very reliable, even for large sample\nsizes.\n\nNote that the fact that we can \ufb01t the model is not trivial, although ground truth is a CNN. We have\nobservations of noise-perturbed VGG units whose locations we do not know. Thus, we have to infer\nboth the location of each unit as well as the complex, nonlinear feature space simultaneously. Our\nresults show that our model solves this task more ef\ufb01ciently than both simpler (GLM) and equally\nexpressive [20] models when the number of samples is relatively small.\n\nIn addition to \ufb01tting the data well, the model also recovered both the cell types and the receptive\n\ufb01eld locations correctly (Fig. 5C, D). When \ufb01t using 216 samples (210 for validation/test and the rest\nfor training), the readout weights of the four cell types clustered nicely (Fig. 5C) and it successfully\nrecovered the location masks (Fig. 5D). In fact, all cells were classi\ufb01ed correctly based on their\nlargest feature weight.\n\nNext, we investigated how our model and its competitors [20, 21] fare when scaling up to large\nrecordings with many types of neurons. To simulate this scenario, we sampled again VGG units\n(from the same layer as above), taking 64 units with random locations from up to 16 different\nfeature maps (i.e. cell types). Correspondingly we increased the number of feature maps in the last\nconvolutional layer of the models. We \ufb01xed the number of training samples to 212 to compare models\nin a challenging regime (cf. Fig. 5B) where performance can be high but is not yet asymptotic.\n\nOur CNN model scales gracefully to more diverse neural populations (Fig. 5E), remaining roughly\nat the same level of performance. Similarly, the CNN with the \ufb01xed location masks estimated in\nadvance scales well, although with lower overall performance. In contrast, the performance of the\nfully-connected readout drops fast, because the number of parameters in the readout layer grows\nvery quickly with the number of feature maps in the \ufb01nal convolutional layer. In fact, we were\nunable to \ufb01t models with more than 16 feature maps with this approach, because the size of the\nread-out tensor became prohibitively large for GPU memory.\n\n7\n\n\fTable 1: Application to data from primary visual cortex (V1) of mice [19]. The table shows average\ncorrelations between model predictions and neural responses on the test set.\n\nScan\n\n1\n\n2\n\n3 Average\n\nAntolik et al. 2016 [19]\nLNP\nCNN with fully connected readout\nCNN with \ufb01xed mask\n\n0.51\n0.37\n0.47\n0.45\n\n0.43\n0.30\n0.34\n0.38\n\n0.46\n0.38\n0.43\n0.41\n\nCNN with factorized readout (ours)\n\n0.55\n\n0.45\n\n0.49\n\n0.47\n0.36\n0.43\n0.42\n\n0.50\n\nFinally, we asked how far we can push our model with long recordings and many neurons. We tested\nour model with 216 training samples from 128 different types of neurons (again 64 units each). On\nthis large dataset with \u2248 60.000 recordings from \u2248 8.000 neurons we were still able to \ufb01t the model\non a single GPU and perform at 90% FEV (data not shown). Thus, we conclude that our model\nscales well to large-scale problems with thousands of nonlinear and diverse neurons.\n\n5.2 Application to data from primary visual cortex\n\nTo test our approach on real data and going beyond the previously explored retinal data [20, 21],\nwe used the publicly available dataset from Antolik et al. [19].3 The dataset has been obtained by\ntwo-photon imaging in the primary visual cortex of sedated mice viewing natural images. It contains\nthree scans with 103, 55 and 102 neurons, respectively, and their responses to static natural images.\nEach scan consists of a training set of images that were each presented once (1800, 1260 and 1800\nimages, respectively) as well as a test set consisting of 50 images (each image repeated 10, 8 and 12\ntimes, respectively). We use the data in the same form as the original study [19], to which we refer\nthe reader for full details on data acquisition, post-processing and the visual stimulation paradigm.\n\nTo \ufb01t this dataset, we used the same basic CNN architecture described above, with three small\nmodi\ufb01cations. First, we replaced the ReLU activation functions by a soft-thresholding nonlinearity,\nf (x) = log(1 + exp(x)). Second, we replaced the mean-squared error loss by a Poisson loss\n(because neural responses are non-negative and the observation noise scales with the mean response).\nThird, we had to regularize the convolutional kernels, because the dataset is relatively limited in\nterms of recording length and number of neurons. We used two forms of regularization: smoothness\nand group sparsity. Smoothness is achieved by an L2 penalty on the Laplacian of the convolution\nkernels:\n\nwhere Wijkl is the 4D tensor representing the convolution kernels, i and j depict the two spatial\ndimensions of the \ufb01lters and k, l the input and output channels. Group sparsity encourages \ufb01lters to\npool from only a small set of feature maps in the previous layer and is de\ufb01ned as:\n\nLgroup = \u03bbgroupXi,j sXkl\n\nW 2\n\nijkl.\n\n(7)\n\nWe \ufb01t CNNs with one, two and three layers. After an initial exploration of different CNN architec-\ntures (\ufb01lter sizes, number of feature maps) on the \ufb01rst scan, we systematically cross-validated over\ndifferent \ufb01lter sizes, number of feature maps and regularization strengths via grid search on all three\nscans. We \ufb01t all models using 80% of the training dataset for training and the remaining 20% for\nvalidation using Adam and early stopping as described above. For each scan, we selected the best\nmodel based on the likelihood on the validation set. In all three scans, the best model had 48 feature\nmaps per layer and 13 \u00d7 13 px kernels in the \ufb01rst layer. The best model for the \ufb01rst two scans had\n3 \u00d7 3 kernels in the subsequent layers, while for the third scan larger 8 \u00d7 8 kernels performed best.\n\nWe compared our model to four baselines: (a) the Hierarchical Structural Model from the original\npaper publishing the dataset [19], (b) a regularized linear-nonlinear Poisson (LNP) model, (c) a\nCNN with fully-connected readout (as in [20]) and (d) a CNN with \ufb01xed spatial masks, inferred\n\n3See [22, 23] for concurrent work on primate V1.\n\n8\n\nLlaplace = \u03bblaplace Xi,j,k,l\n\n(W:,:,kl \u2217 L)2\nij,\n\nL =h 0.5\n\n0.5\n\n1\n1 \u22126\n1\n\n0.5\n1\n\n0.5i\n\n(6)\n\n\ffrom the spike-triggered averages of each neuron (as in [21]). We used a separate, held-out test set\nto compare the performance of the models. On the test set, we computed the correlation coef\ufb01cient\nbetween the response predicted by each model and the average observed response across repeats of\nthe same image.4\n\nOur CNN with factorized readout outperformed all four baselines on all three scans (Table 1). The\nother two CNNs, which either did not use a factorized readout (as in [20]) or did not jointly optimize\nfeature space and readout (as in [21]), performed substantially worse. Interestingly, they did not\neven reach the performance of [19], which uses a three-layer fully-connected neural network instead\nof a CNN. Thus, our model is the new state of the art for predicting neural responses in mouse\nV1 and the factorized readout was necessary to outperform an earlier (and simpler) neural network\narchitecture that also learned a shared feature space for all neurons [19].\n\n6 Discussion\n\nOur results show that the bene\ufb01ts of learning a shared convolutional feature space can be substantial.\nPredictive performance increases, however, only until an upper bound imposed by the dif\ufb01culty of\nestimating each neuron\u2019s location in the visual \ufb01eld. We propose a CNN architecture with a sparse,\nfactorized readout layer that separates these two problems effectively. It allows scaling up the com-\nplexity of the convolutional layers to many parallel channels (which are needed to describe diverse,\nnonlinear neural populations), while keeping the inference problem of each neuron\u2019s receptive \ufb01eld\nlocation and type identity tractable.\n\nFurthermore, our performance curves (see Figs. 3 and 5) may inform experimental designs by deter-\nmining whether one should aim for longer recordings or more neurons. For instance, if we want to\nexplain at least 80% of the variance in a very homogenous population of neurons, we could choose\nto record either \u2248 2,000 responses from 10 cells or \u2248 500 responses from 1,000 cells.\n\nBesides making more ef\ufb01cient use of the data to infer their nonlinear computations, the main promise\nof our new regularization scheme for system identi\ufb01cation with CNNs is that the explicit separation\nof \u201cwhat\u201d and \u201cwhere\u201d provides us with a principled way to functionally classify cells into different\ntypes: the feature weights of our model can be thought of as a \u201cbarcode\u201d identifying each cell type.\nWe are currently working on applying this approach to large-scale data from the retina and primary\nvisual cortex. Later processing stages, such as primary visual cortex could additionally bene\ufb01t from\nsimilarly exploiting equivariance not only in the spatial domain, but also (approximately) in the\norientation or direction-of-motion domain.\n\nAvailability of code\n\nThe code to \ufb01t the models and reproduce the \ufb01gures is available online at:\nhttps://github.com/david-klindt/NIPS2017\n\nAcknowledgements\n\nWe thank Philipp Berens, Katrin Franke, Leon Gatys, Andreas Tolias, Fabian Sinz, Edgar Walker\nand Christian Behrens for comments and discussions.\n\nThis work was supported by the German Research Foundation (DFG) through Collaborative Re-\nsearch Center (CRC 1233) \u201cRobust Vision\u201d as well as DFG grant EC 479/1-1; the European Union\u2019s\nHorizon 2020 research and innovation programme under the Marie Sk\u0142odowska-Curie grant agree-\nment No 674901; the German Excellency Initiative through the Centre for Integrative Neuroscience\nT\u00fcbingen (EXC307). The research was also supported by Intelligence Advanced Research Projects\nActivity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number\nD16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Gov-\nernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and\nconclusions contained herein are those of the authors and should not be interpreted as necessarily\nrepresenting the of\ufb01cial policies or endorsements, either expressed or implied, of IARPA, DoI/IBC,\nor the U.S. Government.\n\n4We used the correlation coef\ufb01cient for evaluation (a) to facilitate comparison with the original study [19]\n\nand (b) because estimating FEV on data with a small number of repetitions per image is unreliable.\n\n9\n\n\fReferences\n\n[1] Matteo Carandini, Jonathan B. Demb, Valerio Mante, David J. Tolhurst, Yang Dan, Bruno A. Olshausen,\nJack L. Gallant, and Nicole C. Rust. Do we know what the early visual system does? The Journal of\nNeuroscience, 25(46):10577\u201310597, 2005.\n\n[2] Michael C.-K. Wu, Stephen V. David, and Jack L. Gallant. Complete functional characterization of\n\nsensory neurons by system identi\ufb01cation. Annual Review of Neuroscience, 29:477\u2013505, 2006.\n\n[3] Judson P. Jones and Larry A. Palmer. The two-dimensional spatial structure of simple receptive \ufb01elds in\n\ncat striate cortex. Journal of Neurophysiology, 58(6):1187\u20131211, 1987.\n\n[4] Alison I. Weber and Jonathan W. Pillow. Capturing the dynamical repertoire of single neurons with\n\ngeneralized linear models. arXiv:1602.07389 [q-bio], 2016.\n\n[5] Tim Gollisch and Markus Meister. Eye smarter than scientists believed: neural computations in circuits\n\nof the retina. Neuron, 65(2):150\u2013164, 2010.\n\n[6] Alexander Heitman, Nora Brackbill, Martin Greschner, Alexander Sher, Alan M. Litke, and E. J.\nChichilnisky. Testing pseudo-linear models of responses to natural scenes in primate retina. bioRxiv,\npage 45336, 2016.\n\n[7] David H. Hubel and Torsten N. Wiesel. Receptive \ufb01elds, binocular interaction and functional architecture\n\nin the cat\u2019s visual cortex. The Journal of Physiology, 160(1):106, 1962.\n\n[8] Edward H. Adelson and James R. Bergen. Spatiotemporal energy models for the perception of motion.\n\nJournal of the Optical Society of America A, 2(2):284\u2013299, 1985.\n\n[9] Nicole C. Rust, Odelia Schwartz, J. Anthony Movshon, and Eero P. Simoncelli. Spatiotemporal Elements\n\nof Macaque V1 Receptive Fields. Neuron, 46(6):945\u2013956, 2005.\n\n[10] Jon Touryan, Gidon Felsen, and Yang Dan. Spatial structure of complex cell receptive \ufb01elds measured\n\nwith natural images. Neuron, 45(5):781\u2013791, 2005.\n\n[11] James M. McFarland, Yuwei Cui, and Daniel A. Butts. Inferring Nonlinear Neuronal Computation Based\n\non Physiologically Plausible Inputs. PLOS Computational Biology, 9(7):e1003143, 2013.\n\n[12] Esteban Real, Hiroki Asari, Tim Gollisch, and Markus Meister. Neural Circuit Inference from Function\n\nto Structure. Current Biology, 2017.\n\n[13] Brett Vintch, J. Anthony Movshon, and Eero P. Simoncelli. A convolutional subunit model for neuronal\n\nresponses in macaque V1. The Journal of Neuroscience, 35(44):14829\u201314841, 2015.\n\n[14] Ryan J. Rowekamp and Tatyana O. Sharpee. Cross-orientation suppression in visual area V2. Nature\n\nCommunications, 8, 2017.\n\n[15] Ben Willmore, Ryan J. Prenger, Michael C.-K. Wu, and Jack L. Gallant. The berkeley wavelet transform:\n\na biologically inspired orthogonal wavelet transform. Neural Computation, 20(6):1537\u20131564, 2008.\n\n[16] Daniel L. K. Yamins, Ha Hong, Charles F. Cadieu, Ethan A. Solomon, Darren Seibert, and James J.\nDiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex.\nProceedings of the National Academy of Sciences, 111(23):8619\u20138624, 2014.\n\n[17] Ari S. Benjamin, Hugo L. Fernandes, Tucker Tomlinson, Pavan Ramkumar, Chris VerSteeg, Lee Miller,\nand Konrad P. Kording. Modern machine learning far outperforms GLMs at predicting spikes. bioRxiv,\npage 111450, 2017.\n\n[18] Seyed-Mahdi Khaligh-Razavi, Linda Henriksson, Kendrick Kay, and Nikolaus Kriegeskorte. Explaining\nthe hierarchy of visual representational geometries by remixing of features from many computational\nvision models. bioRxiv, page 9936, 2014.\n\n[19] J\u00e1n Antol\u00edk, Sonja B. Hofer, James A. Bednar, and Thomas D. Mrsic-Flogel. Model Constrained by\nVisual Hierarchy Improves Prediction of Neural Responses to Natural Scenes. PLOS Computational\nBiology, 12(6):e1004927, 2016.\n\n[20] Lane T. McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen A. Baccus. Deep\n\nLearning Models of the Retinal Response to Natural Scenes. arXiv:1702.01825 [q-bio, stat], 2017.\n\n10\n\n\f[21] Eleanor Batty, Josh Merel, Nora Brackbill, Alexander Heitman, Alexander Sher, Alan Litke, E. J.\nChichilnisky, and Liam Paninski. Multilayer Recurrent Network Models of Primate Retinal Ganglion\nCell Responses. In 5th International Conference on Learning Representations, 2017.\n\n[22] William F. Kindel, Elijah D. Christensen, and Joel Zylberberg. Using deep learning to reveal the neural\n\ncode for images in primary visual cortex. arXiv:1706.06208 [cs, q-bio], 2017.\n\n[23] Santiago A. Cadena, George H. Den\ufb01eld, Edgar Y. Walker, Leon A. Gatys, Andreas S. Tolias, Matthias\nBethge, and Alexander S. Ecker. Deep convolutional models improve predictions of macaque V1 re-\nsponses to natural images. bioRxiv, page 201764, 2017.\n\n[24] S. R. Lehky, T. J. Sejnowski, and R. Desimone. Predicting responses of nonlinear neurons in monkey\n\nstriate cortex to complex patterns. The Journal of Neuroscience, 12(9):3568\u20133581, 1992.\n\n[25] Brian Lau, Garrett B. Stanley, and Yang Dan. Computational subunits of visual cortical neurons revealed\nby arti\ufb01cial neural networks. Proceedings of the National Academy of Sciences, 99(13):8974\u20138979, 2002.\n\n[26] Ryan Prenger, Michael C. K. Wu, Stephen V. David, and Jack L. Gallant. Nonlinear V1 responses to\n\nnatural scenes revealed by neural network analysis. Neural Networks, 17(5\u20136):663\u2013679, 2004.\n\n[27] Tom Baden, Philipp Berens, Katrin Franke, Miroslav R. Ros\u00f3n, Matthias Bethge, and Thomas Euler. The\n\nfunctional diversity of retinal ganglion cells in the mouse. Nature, 529(7586):345\u2013350, 2016.\n\n[28] Katrin Franke, Philipp Berens, Timm Schubert, Matthias Bethge, Thomas Euler, and Tom Baden. Inhibi-\n\ntion decorrelates visual feature representations in the inner retina. Nature, 542(7642):439\u2013444, 2017.\n\n[29] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Re-\n\nducing Internal Covariate Shift. arXiv:1502.03167 [cs], 2015.\n\n[30] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\n\n[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. arXiv:1409.1556, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1986, "authors": [{"given_name": "David", "family_name": "Klindt", "institution": "University of T\u00fcbingen"}, {"given_name": "Alexander", "family_name": "Ecker", "institution": "University of Tuebingen"}, {"given_name": "Thomas", "family_name": "Euler", "institution": "University of T\u00fcbingen"}, {"given_name": "Matthias", "family_name": "Bethge", "institution": "CIN, University T\u00fcbingen"}]}