{"title": "Neural Networks for Efficient Bayesian Decoding of Natural Images from Retinal Neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 6434, "page_last": 6445, "abstract": "Decoding sensory stimuli from neural signals can be used to reveal how we sense our physical environment, and is valuable for the design of brain-machine interfaces. However, existing linear techniques for neural decoding may not fully reveal or exploit the fidelity of the neural signal. Here we develop a new approximate Bayesian method for decoding natural images from the spiking activity of populations of retinal ganglion cells (RGCs). We sidestep known computational challenges with Bayesian inference by exploiting artificial neural networks developed for computer vision, enabling fast nonlinear decoding that incorporates natural scene statistics implicitly. We use a decoder architecture that first linearly reconstructs an image from RGC spikes, then applies a convolutional autoencoder to enhance the image. The resulting decoder, trained on natural images and simulated neural responses, significantly outperforms linear decoding, as well as simple point-wise nonlinear decoding. These results provide a tool for the assessment and optimization of retinal prosthesis technologies, and reveal that the retina may provide a more accurate representation of the visual scene than previously appreciated.", "full_text": "Neural Networks for Ef\ufb01cient Bayesian Decoding of\n\nNatural Images from Retinal Neurons\n\nNikhil Parthasarathy\u2217\nStanford University\n\nnikparth@gmail.com\n\nEleanor Batty\u2217\nColumbia University\n\nerb2180@columbia.edu\n\nWilliam Falcon\n\nColumbia University\n\nwaf2107@columbia.edu\n\nThomas Rutten\n\nColumbia University\n\ntkr2112@columbia.edu\n\nMohit Rajpal\n\nColumbia University\n\nmr3522@columbia.edu\n\nE.J. Chichilnisky\u2020\nStanford University\nej@stanford.edu\n\nLiam Paninski\u2020\nColumbia University\n\nliam@stat.columbia.edu\n\nAbstract\n\nDecoding sensory stimuli from neural signals can be used to reveal how we sense\nour physical environment, and is valuable for the design of brain-machine interfaces.\nHowever, existing linear techniques for neural decoding may not fully reveal or ex-\nploit the \ufb01delity of the neural signal. Here we develop a new approximate Bayesian\nmethod for decoding natural images from the spiking activity of populations of\nretinal ganglion cells (RGCs). We sidestep known computational challenges with\nBayesian inference by exploiting arti\ufb01cial neural networks developed for computer\nvision, enabling fast nonlinear decoding that incorporates natural scene statistics\nimplicitly. We use a decoder architecture that \ufb01rst linearly reconstructs an image\nfrom RGC spikes, then applies a convolutional autoencoder to enhance the image.\nThe resulting decoder, trained on natural images and simulated neural responses,\nsigni\ufb01cantly outperforms linear decoding, as well as simple point-wise nonlinear\ndecoding. These results provide a tool for the assessment and optimization of reti-\nnal prosthesis technologies, and reveal that the retina may provide a more accurate\nrepresentation of the visual scene than previously appreciated.\n\n1\n\nIntroduction\n\nNeural coding in sensory systems is often studied by developing and testing encoding models that\ncapture how sensory inputs are represented in neural signals. For example, models of retinal function\nare designed to capture how retinal ganglion cells (RGCs) respond to diverse patterns of visual\nstimulation. An alternative approach \u2013 decoding visual stimuli from RGC responses \u2013 provides a\ncomplementary method to assess the information contained in RGC spikes about the visual world\n[31, 37]. Understanding decoding can also be useful for the design of retinal prostheses, by providing\na measure of the visual restoration that is possible with a prosthesis [26].\nThe most common and well-understood decoding approach, linear regression, has been used in various\nsensory systems [29, 40]. This method was shown to be successful at reconstructing white noise\ntemporal signals from RGC activity [37] and revealed that coarse structure of natural image patches\ncould be recovered from ensemble responses in the early visual system [33]. Other linear methods\n\n\u2217,\u2020Equal contributions\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Outline of approach. A) The original image is fed through the simulated neural encoding\nmodels to produce RGC responses on which we \ufb01t a linear decoder. A deep neural network is then\nused to further enhance the image. B) We use a convolutional autoencoder with a 4 layer encoder and\na 4 layer decoder to enhance the linear decoded image.\n\nsuch as PCA and linear perceptrons have been used to decode low-level features such as color and\nedge orientation from cortical visual areas [14, 4]. For more complex natural stimuli, computationally\nexpensive approximations to Bayesian inference have been used to construct decoders that incorporate\nimportant prior information about signal structure [25, 27, 30]. However, despite decades of effort,\nderiving an accurate prior on natural images poses both computational and theoretical challenges, as\ndoes computing the posterior distribution on images given an observed neural response, limiting the\napplicability of traditional Bayesian inference.\nHere we develop and assess a new method for decoding natural images from the spiking activity of\nlarge populations of RGCs, to sidestep some of these dif\ufb01culties1 . Our approach exploits inference\ntools that approximate optimal Bayesian inference, and emerge from the recent literature on deep\nneural network (DNN) architectures for computer vision tasks such as super-resolution, denoising,\nand inpainting [17, 39]. We propose a novel staged decoding methodology \u2013 linear decoding followed\nby a (nonlinear) DNN trained speci\ufb01cally to enhance the images output by the linear decoder \u2013 and\nuse it to reconstruct natural images from realistic simulated retinal ganglion cell responses. This\napproach leverages recent progress in deep learning to more fully incorporate natural image priors in\nthe decoder. We show that the approach substantially outperforms linear decoding. These \ufb01ndings\nprovide a potential tool to assess the \ufb01delity of retinal prostheses for treating blindness, and provide a\nsubstantially higher bound on how accurately real visual signals may be represented in the brain.\n\n2 Approach\n\nTo decode images from spikes, we use a linear decoder to produce a baseline reconstructed image,\nthen enhance this image using a more complex nonlinear model, namely a static nonlinearity or a\nDNN (Figure 1). There are a few reasons for this staged approach. First, it allows us to cast the\ndecoding problem as a classic image enhancement problem that can directly utilize the computer\nvision literature on super-resolution, in-painting, and denoising. This is especially important for the\nconstruction of DNNs, which remain nontrivial to tune for problems in non-standard domains (e.g.,\nimage reconstruction from neural spikes). Second, by solving the problem partially with a simple\nlinear model, we greatly reduce the space of transformations that a neural network needs to learn,\nconstraining the problem signi\ufb01cantly.\n\n1Source Code is available at: https://github.com/nikparth/visual-neural-decode\n\n2\n\nA)B)Linear Decoded ImageOutput feature mapsConvolu\u019fonal \ufb01ltersConv(7, 64)Downsample(2,2)Conv(5, 128)Downsample(2,2)Enhanced ImageNota\u019fon: Conv(\ufb01lter_size, \ufb01lter_num) Upsample(2,2)Conv(7,1)Upsample(2,2)Conv(5,64)Upsample(2,2)Conv(3,128)Upsample(2,2)Conv(3,256)Conv(3, 256)Downsample(2,2)Conv(3, 256)Downsample(2,2)Deep Neural NetworkLinear DecoderRGC ResponsesdedoceD raeniLegamIImageNN-enhanced Image\fIn order to leverage image enhancement tools from deep learning, we need large training data sets.\nWe use an encoder-decoder approach: \ufb01rst, develop a realistic encoding model that can simulate\nneural responses to arbitrary input images, constrained by real data. We build this encoder to predict\nthe average outputs of many RGCs, but this approach could also be applied to encoders \ufb01t on a\ncell-by-cell basis [3]. Once this encoder is in hand, we train arbitrarily complex decoders by sampling\nmany natural scenes, passing them through the encoder model, and training the decoder so that the\noutput of the full encoder-decoder pipeline matches the observed image as accurately as possible.\n\n2.1 Encoder model: simulation of retinal ganglion cell responses\n\nFor our encoding model, we create a static simulation of the four most numerous retinal ganglion cell\ntypes (ON and OFF parasol cells and ON and OFF midget cells) based on experimental data. We \ufb01t\nlinear-nonlinear-Poisson models to RGC responses to natural scene movies, recorded in an isolated\nmacaque retina preparation [7, 10, 12]. These \ufb01ts produce imperfect but reasonable predictions of\nRGC responses (Figure 2 A). We averaged the parameters (spatial \ufb01lter, temporal \ufb01lter, and sigmoid\nparameters) of these \ufb01ts across neurons, to create a single model for each of four cell types. We chose\nthis model as it is simple and a relatively good baseline encoder with which to test our decoding\nmethod. (Recently, encoding models that leverage deep neural networks [3, 24] have been shown to\n\ufb01t RGC responses better than the simple model we are using; substituting a more complex encoding\nmodel should improve the quality of our \ufb01nal decoder, and we intend to pursue this approach in\nfuture work.) To deal with static images, we then reduced these models to static models, consisting\nof one spatial \ufb01lter followed by a nonlinearity and Poisson spike generation. The outputs of the static\nmodel are equal to summing the spikes produced by the full model over the image frames of a pulse\nmovie: gray frames followed by one image displayed for multiple frames. Spatial \ufb01lters and the\nnonlinearity of the \ufb01nal encoding model are shown in Figure 2 B and C.\nWe then tiled the image space (128 x 128 pixels) with these simulated neurons. For each cell type,\nwe \ufb01t a 2D Gaussian to the spatial \ufb01lter of that cell type and then chose receptive \ufb01eld centers with a\nwidth equal to 2 times the standard deviation of the Gaussian \ufb01t rounded up to the nearest integer.\nThe centers are shifted on alternate rows to form a lattice (Figure 2 D). The resulting response of\neach neuron to an example image is displayed in Figure 2 E as a function of its location on the image.\nThe entire simulation consisted of 5398 RGCs.\n\n2.2 Model architecture\n\nOur decoding model starts with a classic linear regression decoder (LD) to generate linearly decoded\nimages I LD [37]. The LD learns a reconstruction mapping \u02c6\u03b8 between neural responses X and\nstimuli images I ST by modeling each pixel as a weighted sum of the neural responses: \u02c6\u03b8 =\n(X T X)\u22121X T I ST . X is augmented with a bias term in the \ufb01rst column. The model inputs are m\nimages, p pixels and n neurons such that: I ST \u2208 Rm\u00d7p, X \u2208 Rm\u00d7(n+1), \u02c6\u03b8 \u2208 R(n+1)\u00d7p. To decode\nthe set of neural responses X we compute the dot product between \u02c6\u03b8 and X: I LD = X \u02c6\u03b8.\nThe next step of our decoding pipeline enhances I LD through the use of a deep convolutional\nautoencoder (CAE). Our model consists of a 4-layer encoder and a 4-layer decoder. This model\narchitecture was inspired by similar models used in image denoising [11] and inpainting [35, 22].\nIn the encoder network E, each layer applies a convolution and downsampling operating to the\noutput tensor of the previous layer. The output of the encoder is a tensor of activation maps\nrepresenting a low-dimensional embedding of I LD. The decoder network D inverts the encoding\nprocess by applying a sequence of upsampling and convolutional layers to the output tensor of the\nprevious layer. This model outputs the reconstructed image I CAE. We optimize the CAE end-to-end\nthrough backpropagation by minimizing the pixelwise MSE between the output image of the CAE:\nI CAE = D(E(I LD)) and the original stimuli image I ST .\nThe \ufb01lter sizes, number of layers, and number of \ufb01lters were all tuned through an exhaustive grid-\nsearch. We searched over the following parameter space in our grid search: number of encoding\n/ decoding layers: [3, 4, 5], number of \ufb01lters in each layer: [32, 64,128,256], \ufb01lter sizes: [7x7,\n5x5, 3x3], learning rates: [0.00005, 0.0001, 0.0002, 0.0004, 0.0008, 0.001, 0.002, 0.004]. Speci\ufb01c\narchitecture details are provided in Figure 1.\n\n3\n\n\fFigure 2: Encoding model. A) Full spatiotemporal encoding model performance on experimental data.\nRecorded responses (black) vs LNP predictions (red; using the averaged parameters over all cells of\neach type) for one example cell of each type. The spiking responses to 57 trials of a natural scenes test\nmovie were averaged over trials and then smoothed with a 10 ms SD Gaussian. B) Spatial \ufb01lters of\nthe simulated neural encoding model are shown for each cell type. C) The nonlinearity following the\nspatial \ufb01lter-stimulus multiplication is shown for each cell type. We draw from a Poisson distribution\non the output of the nonlinearity to obtain the neural responses. D) Demonstration of the mosaic\nstructure for each cell type on a patch of the image space. The receptive \ufb01elds of each neuron are\nrepresented by the 1 SD contour of the Gaussian \ufb01t to the spatial \ufb01lter of each cell type. E) The\nresponse of each cell is plotted in the square around its receptive \ufb01eld center. The visual stimulus is\nshown on the left. The color maps of ON and OFF cells are reversed to associate high responses with\ntheir preferred stimulus polarity.\n\n4\n\n210-330-330-330ON ParasolON MidgetOFF Midget-330OFF Parasol030024018020ON ParasolON MidgetOFF MidgetOFF ParasolON ParasolON MidgetOFF MidgetOFF ParasolON ParasolON MidgetOFF MidgetOFF ParasolA)B)C)D)E)246810246810804002001000Time (s)Time (s)Firing Rate (Hz)10 pixels10 pixels10 pixelsON ParasolOFF ParasolON MidgetOFF Midget\f2.3 Training and Evaluation\n\nTo train the linear decoder, we iterate through the training data once to collect the suf\ufb01cient statistics\nX T X and X T I ST . We train the convolutional autoencoder to minimize the pixelwise MSE PM SE\nwith the Adam optimizer [15]. To avoid over\ufb01tting, we monitor PM SE changes on a validation set\nthree times per epoch and keep track of the current best loss PM SE,best. We stop training if we have\ngone through 2 epochs worth of training data and the validation loss has not decreased by greater\nthan 0.1%PM SE,best.\nIn our experiments we use two image datasets, ImageNet [8] and the CelebA face dataset [21]. We\napply preprocessing steps described previously in [17] to each image: 1) Convert to gray scale, 2)\nrescale to 256x256, 3) crop the middle 128x128 region. From Imagenet we use 930k random images\nfor training, 50K for validation, and a 10k held-out set for testing. We use ImageNet in all but one of\nour experiments - context-decoding. For the latter, we use the CelebA face dataset [21] with 160k\nimages for training, 30k for validation, and a 10k held-out set for testing.\nWe evaluate all the models in our results using two separate metrics, pixelwise MSE and multi-\nscale structural-similarity (SSIM) [36]. Although each metric alone has known shortcomings, in\ncombination, they provide an objective evaluation of image reconstruction that is interpretable and\nwell-understood.\n\n3 Results\n\n3.1\n\nImageNet decoding\n\nAs expected [33], the linear decoder reconstructed blurry, noisy versions of the original natural\nimages from the neural responses, a result that is attributable to the noisy responses from the RGCs\ndown-sampling the input images. The two-staged model of the CAE trained on the output of the linear\ndecoder (L-CAE) resulted in substantially improved reconstructions, perceptually and quantitatively\n(Figure 3). L-CAE decoding outperformed linear decoding both on average and for the vast majority\nof images, by both the M SE and 1 \u2212 SSIM measures. Qualitatively, the improvements made\nby the CAE generally show increased sharpening of edges, adjustment of contrast, and smoothing\nwithin object boundaries that reduced overall noise. Similar improvement in decoding could not\nbe replicated by utilizing static nonlinearities to transform the linear decoded output to the original\nimages. We used a 6th degree polynomial \ufb01tted to approximate the relation between linearly decoded\nand original image pixel intensities, and then evaluated this nonlinear decoding on held out data. This\napproach produced a small improvement in reconstruction: 3.25% reduction in MSE compared to\n34.50% for the L-CAE. This reveals that the improvement in performance from the CAE involves\nnonlinear image enhancement beyond simple remapping of pixel intensities. Decoding noisier neural\nresponses especially highlights the bene\ufb01ts of using the autoencoder: there are features identi\ufb01able in\nthe L-CAE enhanced images that are not in the linear decoder images (Supplementary Figure 6).\nThe results shown here utilize a large training dataset size for the decoder so it is natural to ask for a\ngiven \ufb01xed encoder model, how many training responses do we need to simulate to obtain a good\ndecoder. We tested this by \ufb01xing our encoder and then training the CAE stage of the decoder with\nvarying amounts of training data. (Supplementary Figure 8). We observed that even with a small\ntraining data set of 20k examples, we can improve signi\ufb01cantly on the linear decoder and after around\n500k examples, our performances begins to saturate. An analogous question can be asked about the\namount of training data required to \ufb01t a good encoder and we intend to explore this aspect in future\nwork.\n\n3.2 Phase Scrambled Training\n\nA possible explanation for the improved performance of the L-CAE compared to the baseline linear\ndecoder is that it more fully exploits phase structure that is characteristic of natural images [2],\nperhaps by incorporating priors on phase structure that are not captured by linear decoding. To test\nthis possibility, we trained both linear and L-CAE decoders on phase-scrambled natural images.\nThe CAE input was produced by the linear decoder trained on the same image type as the CAE.\nObserved responses of RGCs to these stimuli followed approximately the same marginal distribution\nas responses to the original natural images. We then compared the performance of these linear and\n\n5\n\n\fFigure 3: Comparison of linear and CAE decoding. A) MSE on a log-log plot for the ImageNet 10k\nexample test set comparing the L-CAE model trained on ImageNet (only 1k subsampled examples\nare plotted here for visualization purposes). B) 1-SSIM version of the same \ufb01gure. C) Example\nimages from the test set show the original, linear decoded, L-CAE enhanced versions. The average\n(MSE, 1-SSIM) for the linear decoder over the full test set was (0.0077, 0.35) and the corresponding\naverages for the L-CAE were (0.0051, 0.25).\n\nL-CAE decoders to the performance of the original decoders, on the original natural images (Figure\n4). The linear decoder exhibited similar decoding performance when trained on the original and\nphase-scrambled images, while the L-CAE exhibited substantially higher performance when trained\non real images. These \ufb01ndings are consistent with the idea that the CAE is able to capture prior\ninformation on image phase structure not captured by linear decoding. However, direct comparisons\nof the L-CAE and LD trained and tested on phase scrambled images show that the L-CAE does still\nlead to some improvements which are most likely just due to the increased parameter complexity of\nthe decoding model (Supplementary Figure 7).\n\n3.3 Context Dependent Training\n\nThe above results suggest that the CAE is capturing important natural image priors. However, it\nremains unclear whether these priors are suf\ufb01cient to decode speci\ufb01c classes of natural images as\naccurately as decoding models that are tuned to incorporate class-speci\ufb01c priors. We explored this in\n\n6\n\n10-310-210-210-3L-CAELinear Decoder0.00.20.40.60.00.20.40.6Linear DecoderA)MSE1-SSIMB)C)OriginalLinearDecodedL-CAEDecoded\fFigure 4: Comparison of phase scrambled and ImageNet trained models. A) MSE on log-log plot\ncomparing the performance of the linear decoder \ufb01t on natural images to the linear decoder \ufb01t on\nphase scrambled images. The subscript of each model indicates the dataset on which it was trained.\nThe reported MSE values are based on performance on the natural image test set (1k subsampled\nexamples shown). B) Similar plot to A but comparing the L-CAE \ufb01t on natural images to the L-CAE\n\ufb01t on phase scrambled images. C) 1-SSIM version of A. D) 1-SSIM version of B. E) One example\ntest natural image (represented by blue dot in A-D) showing the reconstructions from all 4 models\nand the phase scrambled version.\n\nthe context of human faces by fully re-training a class-speci\ufb01c L-CAE using the CelebA face dataset.\nBoth linear and CAE stages were trained from scratch (random initialization) using only this dataset.\nAs with the phase scrambled comparisons, the CAE input is produced by the linear decoder trained\non the same image type. We then compare these different linear decoder and L-CAE models on a test\nset of CelebA faces. For the linear decoders, we see a 17% improvement in average test MSE and a\n14% improvement in 1-SSIM when training on CelebA as compared to training on ImageNet (Figure\n5 A and C). We \ufb01nd that the differences in MSE and 1-SSIM between the differently trained L-CAE\nmodels are smaller (5% improvement in MSE and a 4% improvement in 1-SSIM) (Figure 5 B and D).\nThe much smaller difference in MSE and 1-SSIM suggests that the L-CAE decoder does a better job\nat generalizing to unseen context-speci\ufb01c classes than the linear decoder. However, the images show\nthat there are still important face-speci\ufb01c features (such as nose and eye de\ufb01nition) that are much\nbetter decoded by the L-CAE trained only on faces (Figure 5E). This suggests that while the natural\nimage statistics captured by the CAE do help improve its generalization to more structured classes,\nthere are still signi\ufb01cant bene\ufb01ts in training class-speci\ufb01c models.\n\n4 Discussion\n\nThe work presented here develops a novel approximate Bayesian decoding technique that uses\nnon-linear DNNs to decode images from simulated responses of retinal neurons. The approach\nsubstantially outperforms linear reconstruction techniques that have usually been used to decode\nneural responses to high-dimensional stimuli.\nPerhaps the most successful previous applications of Bayesian neural decoding are in cases where the\nvariable to be decoded is low-dimensional. The work of [5] stimulated much progress in hippocampus\nand motor cortex using Bayesian state-space approaches applied to low-dimensional (typically\n\n7\n\nLinearPhaseScrambledLinearImageNetL-CAEPhaseScrambledL-CAEImageNetA)B)E)LinearPhaseScrambledLinearImageNetL-CAEImageNetL-CAEPhaseScrambledC)D)OriginalLinearPhaseScrambledLinearImageNetL-CAEPhaseScrambledPhase ScrambledL-CAEImageNetMSE on TestImageNetMSE on TestImageNet1-SSIM on TestImageNet1-SSIM on TestImageNet10-310-310-210-210-310-210-310-20.60.40.20.00.00.20.40.60.00.20.40.60.60.40.20.0\fFigure 5: Comparison of CelebA and ImageNet trained models. A) MSE on log-log plot comparing\nthe performance of the linear decoder \ufb01t on CelebA to the linear decoder \ufb01t on ImageNet. The\nsubscript of each model indicates the dataset on which it was trained. The reported MSE values are\nbased on performance on the natural image test set (1k subsampled examples shown). B) Similar\nplot to A but comparing the L-CAE \ufb01t on CelebA to the L-CAE \ufb01t on ImageNet. C) 1-SSIM version\nof A. D) 1-SSIM version of B. E) One example test natural image (represented by blue dot in A-D)\nshowing the reconstructions from all 4 models.\n\ntwo- or three-dimensional) position variables; see also [16] and [28] for further details. The low\ndimensionality of the state variable and simple Markovian priors leads to fast Bayesian computation\nin these models. At the same time, non-Bayesian approaches based on support vector regression [32]\nor recurrent neural networks [34] have also proven powerful in these applications.\nDecoding information from the retina or early visual pathway requires ef\ufb01cient computations over\nobjects of much larger dimensionality: images and movies. Several threads are worth noting here.\nFirst, some previous work has focused on decoding of \ufb02icker stimuli [37] or motion statistics [18, 23],\nboth of which reduce to low-dimensional decoding problems. Other work has applied straightforward\nlinear decoding methods [33, 9]. Finally, some work has tackled the challenging problem of decoding\nstill images undergoing random perturbations due to eye movements [6, 1]. These studies developed\napproximate Bayesian decoders under simpli\ufb01ed natural image priors, and it would be interesting in\nfuture work to examine potential extensions of our approach to those applications.\nWhile our focus here has been on the decoding of spike counts from populations of neurons recorded\nwith single-cell precision, the ideas developed here could also be applied in the context of decoding\nfMRI data. Our approach shares some conceptual similarity to previous work [25, 27] which used\nelegant encoding models combined with brute-force computation over a large discrete sample space\nto compute posteriors, and to other work [38] which used neural network methods similar to those\ndeveloped in [41] to decode image features. Our approach, for example, could be extended to\nreplace a brute-force discrete-sample decoder [25, 27] with a decoder that operates over the full\nhigh-dimensional continuous space of all images.\nMany state-of-the-art models for in-painting and super-resolution image enhancement rely on gener-\native adversarial networks (GANs). However, these models currently require speci\ufb01c architecture\ntuning based on the exact problem structure. Because our problem involves some complex and un-\nknown combination of denoising, super-resolution, and inpainting, we required a more robust model\nthat could be tested with little hand-tuning. Furthermore, we have no parametric form for the noise\n\n8\n\nLinearCelebALinearImageNetL-CAECelebAL-CAEImageNetA)B)E)LinearCelebALinearImageNetL-CAEImageNetL-CAECelebA1-SSIM on Test CelebAC)D)OriginalLinearCelebALinearImageNetL-CAECelebAL-CAEImageNetMSE on TestCelebAMSE on TestCelebA10-210-310-310-210-210-310-310-21-SSIM on Test CelebA0.40.20.00.40.20.00.00.20.40.00.20.4\fin the linear decoded images, so standard pre-trained networks could not be applied directly. Based\non previous work in [39], it seems that autoencoder architectures can robustly achieve reasonable\nresults for these types of tasks; therefore, we chose the CAE architecture as a useful starting point.\nWe have begun to explore GAN architectures, but these early results do not show any signi\ufb01cant\nimprovements over our CAE model. We plan to explore these networks further in future work.\nIn Section 3.3 we saw that even though there were small differences in MSE and 1-SSIM between\nthe outputs of the L-CAE decoders trained on ImageNet vs. CelebA datasets, visually there were still\nsigni\ufb01cant differences. The most likely explanation for this discrepancy is that these loss functions\nare imperfect and do not adequately capture perceptually relevant differences between two images.\nIn recent years, more complex perceptual similarity metrics have gained traction in the deep learning\ncommunity [42, 20, 13]. While we did not extensively explore this aspect, we have done some\npreliminary experiments that suggest that using just a standard VGG-based perceptual metric [13]\ndecreases some blurring seen using MSE, but does not signi\ufb01cantly improve decoding in a robust\nway. We plan to further explore these ideas by implementing perceptual loss functions that utilize\nmore of our understanding of operations in the early human visual system [19]. Progress in this space\nis vital as any retinal prosthetics application of this work would require decoding of visual scenes\nthat is accurate by perceptual metrics rather than MSE.\nWe have shown improved reconstruction based on simulated data; clearly, an important next step is\nto apply this approach to decode real experimental data. In addition, we have shown better L-CAE\nreconstruction only based on one perfect mosaic of the simulated neurons. In reality, these mosaics\ndiffer from retina to retina and there are gaps in the mosaic when we record from retinal neurons.\nTherefore, it will be important to investigate whether the CAE can learn to generalize over different\nmosaic patterns. We also plan to explore reconstruction of movies and color images.\nThe present results have two implications for visual neuroscience. First, the results provide a\nframework for understanding how an altered neural code, such as the patterns of activity elicited in\na retinal prosthesis, could in\ufb02uence perception of the visual image. With our approach, this can be\nassessed in the image domain directly (instead of the domain of spikes) by examining the quality\nof \"optimal\" reconstruction from electrical activity induced by the prosthesis. Second, the results\nprovide a way to understand which aspects of natural scenes are effectively encoded in the natural\noutput of the retina, again, as assessed in the image domain. Previous efforts toward these two goals\nhave relied on linear reconstruction. The substantially higher performance of the L-CAE provides\na more stringent assessment of prosthesis function, and suggests that the retina may convey visual\nimages to the brain with higher \ufb01delity than was previously appreciated.\n\n5 Acknowledgments\n\nNSF GRFP DGE-16-44869 (EB), NSF/NIH Collaborative Research in Computational Neuroscience\nGrant IIS-1430348/1430239 (EJC & LP), DARPA Contract FA8650-16-1-7657 (EJC), Simons\nFoundation SF-SCGB-365002 (LP); IARPA MICRONS D16PC00003 (LP); DARPA N66001-17-C-\n4002 (LP).\n\nReferences\n[1] Alexander G Anderson, Bruno A Olshausen, Kavitha Ratnam, and Austin Roorda. A neural\nmodel of high-acuity vision in the presence of \ufb01xational eye movements. In Signals, Systems\nand Computers, 2016 50th Asilomar Conference on, pages 588\u2013592. IEEE, 2016.\n\n[2] Elizabeth Arsenault, Ahmad Yoonessi, and Curtis Baker. Higher order texture statistics impair\n\ncontrast boundary segmentation. Journal of vision, 11(10):14\u201314, 2011.\n\n[3] Eleanor Batty, Josh Merel, Nora Brackbill, Alexander Heitman, Alexander Sher, Alan Litke,\nE.J. Chichilnisky, and Liam Paninski. Multilayer recurrent network models of primate retinal\nganglion cell responses. International Conference on Learning Representations, 2017.\n\n[4] Gijs Joost Brouwer and David J. Heeger. Decoding and reconstructing color from responses\nin human visual cortex. The Journal of Neuroscience: the of\ufb01cial journal of the Society for\nNeuroscience, 29(44):13992\u201314003, 2009.\n\n9\n\n\f[5] Emery N Brown, Loren M Frank, Dengda Tang, Michael C Quirk, and Matthew A Wilson. A\nstatistical paradigm for neural spike train decoding applied to position prediction from ensemble\n\ufb01ring patterns of rat hippocampal place cells. Journal of Neuroscience, 18(18):7411\u20137425,\n1998.\n\n[6] Yoram Burak, Uri Rokni, Markus Meister, and Haim Sompolinsky. Bayesian model of dynamic\nimage stabilization in the visual system. Proceedings of the National Academy of Sciences,\n107(45):19525\u201319530, 2010.\n\n[7] E.J. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Compu-\n\ntation in Neural Systems, 12(2):199\u2013213, 2001.\n\n[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[9] Ariadna R. Diaz-Tahoces, Antonio Martinez-Alvarez, Alejandro Garcia-Moll, and Eduardo\nFernandez. Towards the reconstruction of moving images by populations of retinal ganglion\ncells. In 6th International Work-Conference on the Interplay Between Natural and Arti\ufb01cial\nComputation, IWINAC, volume 9107, 2015.\n\n[10] ES Frechette, A Sher, MI Grivich, D Petrusca, AM Litke, and EJ Chichilnisky. Fidelity of the\nensemble code for visual motion in primate retina. Journal of neurophysiology, 94(1):119\u2013135,\n2005.\n\n[11] Lovedeep Gondara. Medical image denoising using convolutional denoising autoencoders.\n\narXiv pre-print 1608.04667, 2016.\n\n[12] Alexander Heitman, Nora Brackbill, Martin Greschner, Alexander Sher, Alan M Litke, and\nEJ Chichilnisky. Testing pseudo-linear models of responses to natural scenes in primate retina.\nbioRxiv, page 045336, 2016.\n\n[13] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer\nand super-resolution. In European Conference on Computer Vision, pages 694\u2013711. Springer,\n2016.\n\n[14] Yukiyasu Kamitani and Frank Tong. Decoding the visual and subjective ontents of the human\n\nbrain. Nature Neuroscience, 8(5):679\u2013685, 2005.\n\n[15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] Shinsuke Koyama, Lucia Castellanos P\u00e9rez-Bolde, Cosma Rohilla Shalizi, and Robert E Kass.\nApproximate methods for state-space models. Journal of the American Statistical Association,\n105(489):170\u2013180, 2010.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, 2012.\n\n[18] Edmund C Lalor, Yashar Ahmadian, and Liam Paninski. The relationship between optimal and\nbiologically plausible decoding of stimulus velocity in the retina. JOSA A, 26(11):B25\u2013B42,\n2009.\n\n[19] Valero Laparra, Alex Berardino, Johannes Ball\u00e9, and Eero P Simoncelli. Perceptually optimized\n\nimage rendering. arXiv preprint arXiv:1701.06641, 2017.\n\n[20] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Ale-\njandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-\nrealistic single image super-resolution using a generative adversarial network. arXiv preprint\narXiv:1609.04802, 2016.\n\n[21] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n10\n\n\f[22] Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using convolutional auto-\nencoders with symmetric skip connections. In Advances in Neural Information Processing,\n2016.\n\n[23] Olivier Marre, Vicente Botella-Soler, Kristina D Simmons, Thierry Mora, Ga\u0161per Tka\u02c7cik, and\nMichael J Berry II. High accuracy decoding of dynamical motion from a large retinal population.\nPLoS Comput Biol, 11(7):e1004304, 2015.\n\n[24] Lane McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen A. Bac-\ncus. Deep learning models of the retinal response to natural scenes. In Advances in Neural\nInformation Processing Systems, 2016.\n\n[25] Thomas Naselaris, Ryan J. Prenger, Kendrick N. Kay, Michael Oliver, and Jack L. Gallant.\nBayesian reconstruction of natural images from human brain activity. Neuron, 63(9):902\u2013915,\n2009.\n\n[26] Sheila Nirenberg and Chetan Pandarinath. Retinal prosthetic strategy with the capacity to restore\n\nnormal vision. PNAS, 109(37), 2012.\n\n[27] Shinji Nishimoto, An T. Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L. Gallant.\nReconstructing visual experiences from brain activity evoked by natural movies. Current\nBiology, 21(19):1641\u20131646, 2011.\n\n[28] Liam Paninski, Yashar Ahmadian, Daniel Gil Ferreira, Shinsuke Koyama, Kamiar Rahnama\nRad, Michael Vidne, Joshua Vogelstein, and Wei Wu. A new look at state-space models for\nneural data. Journal of computational neuroscience, 29(1-2):107\u2013126, 2010.\n\n[29] Brian N. Pasley, Stephen V. David, Nima Mesgarani, Adeen Flinker, Shibab A. Shamma,\nNathan E. Crone, Robert T. Knight, and Edward F. Chang. Reconstructing speech from human\nauditory cortex. PLOS Biology, 10(1), 2012.\n\n[30] Alexandro D Ramirez, Yashar Ahmadian, Joseph Schumacher, David Schneider, Sarah M. N.\nWoolley, and Liam Paninski. Incorporating naturalistic correlation structure improves spec-\ntrogram reconstruction from neuronal activity in the songbird auditory midbrain. Journal of\nNeuroscience, 31(10):3828\u20133842, 2011.\n\n[31] Fred Rieke, Davd Warland, Rob de Ruyter van Steveninck, and William Bialek. Spikes:\n\nExploring the Neural Code. MIT Press, Cambridge, MA, USA, 1999.\n\n[32] Lavi Shpigelman, Hagai Lalazar, and Eilon Vaadia. Kernel-arma for hand tracking and brain-\nmachine interfacing during 3d motor control. In Advances in neural information processing\nsystems, pages 1489\u20131496, 2009.\n\n[33] Garrett B. Stanley, Fei F. Li, and Yang Dan. Reconstruction of natural scenes from ensemble\nresponses in the lateral geniculate nucleus. Journal of Neuroscience, 19(18):8036\u20138042, 1999.\n\n[34] David Sussillo, Sergey D Stavisky, Jonathan C Kao, Stephen I Ryu, and Krishna V Shenoy.\nMaking brain\u2013machine interfaces robust to future neural variability. Nature Communications, 7,\n2016.\n\n[35] Zhangyang Wang, Yingzhen Yang, Zhaowen Wang, Shiyu Chang, Wen Han, Jianchao Yang, and\nThomas S. Huang. Self-tuned deep super resolution. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition Workshops, 2015.\n\n[36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment:\nfrom error visibility to structural similarity. IEEE transactions on image processing, 13(4):600\u2013\n612, 2004.\n\n[37] David K. Warland, Pamela Reinagel, and Markus Meister. Decoding visual information from a\n\npopulation of retinal ganglion cells. Journal of neurophysiology, 78(5):2336\u20132350, 1997.\n\n[38] Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, and Zhongming Liu. Neural encoding\nand decoding with deep learning for dynamic natural vision. arXiv pre-print 1608.03425, 2016.\n\n11\n\n\f[39] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 341\u2013349, 2012.\n\n[40] Kai Xu, Yueming Wnag, Shaomin Zhang, Ting Zhao, Yiwen Wang, Weidong Chen, and\nXiaoxiang Zhang. Comparisons between linear and nonlinear methods for decoding motor\ncortical activities of monkey. In Engineering in Medicine and Biology Society, EMBC, Annual\nInternational Conference of the IEEE, 2011.\n\n[41] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J\nDiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual\ncortex. Proceedings of the National Academy of Sciences, 111(23):8619\u20138624, 2014.\n\n[42] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for neural networks for\n\nimage processing. arXiv preprint arXiv:1511.08861, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3218, "authors": [{"given_name": "Nikhil", "family_name": "Parthasarathy", "institution": "New York University"}, {"given_name": "Eleanor", "family_name": "Batty", "institution": "Columbia University"}, {"given_name": "William", "family_name": "Falcon", "institution": "Columbia University"}, {"given_name": "Thomas", "family_name": "Rutten", "institution": "Columbia University"}, {"given_name": "Mohit", "family_name": "Rajpal", "institution": "Columbia University"}, {"given_name": "E.J.", "family_name": "Chichilnisky", "institution": "Stanford University"}, {"given_name": "Liam", "family_name": "Paninski", "institution": "Columbia University"}]}