{"title": "Reconstructing perceived faces from brain activations with deep adversarial neural decoding", "book": "Advances in Neural Information Processing Systems", "page_first": 4246, "page_last": 4257, "abstract": "Here, we present a novel approach to solve the problem of reconstructing perceived stimuli from brain responses by combining probabilistic inference with deep learning. Our approach first inverts the linear transformation from latent features to brain responses with maximum a posteriori estimation and then inverts the nonlinear transformation from perceived stimuli to latent features with adversarial training of convolutional neural networks. We test our approach with a functional magnetic resonance imaging experiment and show that it can generate state-of-the-art reconstructions of perceived faces from brain activations.", "full_text": "Reconstructing perceived faces from brain activations\n\nwith deep adversarial neural decoding\n\nYa\u02d8gmur G\u00fc\u00e7l\u00fct\u00fcrk*, Umut G\u00fc\u00e7l\u00fc*,\n\nKatja Seeliger, Sander Bosch,\n\nRob van Lier, Marcel van Gerven,\n\nRadboud University, Donders Institute for Brain, Cognition and Behaviour\n\nNijmegen, the Netherlands\n\n{y.gucluturk, u.guclu}@donders.ru.nl\n\n*Equal contribution\n\nAbstract\n\nHere, we present a novel approach to solve the problem of reconstructing perceived\nstimuli from brain responses by combining probabilistic inference with deep learn-\ning. Our approach \ufb01rst inverts the linear transformation from latent features to brain\nresponses with maximum a posteriori estimation and then inverts the nonlinear\ntransformation from perceived stimuli to latent features with adversarial training\nof convolutional neural networks. We test our approach with a functional mag-\nnetic resonance imaging experiment and show that it can generate state-of-the-art\nreconstructions of perceived faces from brain activations.\n\nFigure 1: An illustration of our approach to solve the problem of reconstructing perceived stimuli\nfrom brain responses by combining probabilistic inference with deep learning.\n\n1\n\nIntroduction\n\nA key objective in sensory neuroscience is to characterize the relationship between perceived stimuli\nand brain responses. This relationship can be studied with neural encoding and neural decoding\nin functional magnetic resonance imaging (fMRI) [1]. The goal of neural encoding is to predict\nbrain responses to perceived stimuli [2]. Conversely, the goal of neural decoding is to classify [3, 4],\nidentify [5, 6] or reconstruct [7\u201311] perceived stimuli from brain responses.\nThe recent integration of deep learning into neural encoding has been a very successful endeavor [12,\n13]. To date, the most accurate predictions of brain responses to perceived stimuli have been\nachieved with convolutional neural networks [14\u201320], leading to novel insights about the functional\norganization of neural representations. At the same time, the use of deep learning as the basis for\nneural decoding has received less widespread attention. Deep neural networks have been used for\nclassifying or identifying stimuli via the use of a deep encoding model [16, 21] or by predicting\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\nConvNet (pretrained) + PCAConvNet (adversarial training)latent feat.prior (Gaussian)maximum a posteriorilikelihood (Gaussian)posterior (Gaussian)perceived stim.brain resp.*reconstruction*from brain resp.\fintermediate stimulus features [22, 23]. Deep belief networks and convolutional neural networks have\nbeen used to reconstruct basic stimuli (handwritten characters and geometric \ufb01gures) from patterns\nof brain activity [24, 25]. To date, going beyond such mostly retinotopy-driven reconstructions and\nreconstructing complex naturalistic stimuli with high accuracy have proven to be dif\ufb01cult.\nThe integration of deep learning into neural decoding is an exciting approach for solving the recon-\nstruction problem, which is de\ufb01ned as the inversion of the (non)linear transformation from perceived\nstimuli to brain responses to obtain a reconstruction of the original stimulus from patterns of brain\nactivity alone. Reconstruction can be formulated as an inference problem, which can be solved by\nmaximum a posteriori estimation. Multiple variants of this formulation have been proposed in the\nliterature [26\u201330]. At the same time, signi\ufb01cant improvements are to be expected from deep neural\ndecoding given the success of deep learning in solving image reconstruction problems in computer\nvision such as colorization [31], face hallucination [32], inpainting [33] and super-resolution [34].\nHere, we present a new approach by combining probabilistic inference with deep learning, which\nwe refer to as deep adversarial neural decoding (DAND). Our approach \ufb01rst inverts the linear\ntransformation from latent features to observed responses with maximum a posteriori estimation.\nNext, it inverts the nonlinear transformation from perceived stimuli to latent features with adversarial\ntraining and convolutional neural networks. An illustration of our model is provided in Figure 1. We\nshow that our approach achieves state-of-the-art reconstructions of perceived faces from the human\nbrain.\n\n2 Methods\n\n2.1 Problem statement\nLet x \u2208 Rh\u00d7w\u00d7c, z \u2208 Rp, y \u2208 Rq be a stimulus, feature, response triplet, and \u03c6 : Rh\u00d7w\u00d7c \u2192 Rp be\na latent feature model such that z = \u03c6(x) and x = \u03c6\u22121(z). Without loss of generality, we assume\nthat all of the variables are normalized to have zero mean and unit variance.\nWe are interested in solving the problem of reconstructing perceived stimuli from brain responses:\n(1)\n\n\u02c6x = \u03c6\u22121(arg max\n\nPr (z | y))\n\nwhere Pr(z | y) is the posterior. We reformulate the posterior through Bayes\u2019 theorem:\n\nz\n\n(cid:18)\n\n(cid:19)\n\n\u02c6x = \u03c6\u22121\n\n[Pr(y | z) Pr(z)]\n\n(2)\nwhere Pr(y | z) is the likelihood, and Pr(z) is the prior. In the following subsections, we de\ufb01ne the\nlatent feature model, the likelihood and the prior.\n\narg max\n\nz\n\n2.2 Latent feature model\n\nWe de\ufb01ne the latent feature model \u03c6(x) by modifying the VGG-Face pretrained model [35]. This\nmodel is a 16-layer convolutional neural network, which was trained for face recognition. First, we\ntruncate it by retaining the \ufb01rst 14 layers and discarding the last two layers of the model. At this\npoint, the truncated model outputs 4096-dimensional latent features. To reduce the dimensionality of\nthe latent features, we then combine the model with principal component analysis by estimating the\nloadings that project the 4096-dimensional latent features to the \ufb01rst 699 principal component scores\n(maximum number of components given the number of training observations) and adding them at the\nend of the truncated model as a new fully-connected layer. At this point, the combined model outputs\n699-dimensional latent features.\nFollowing the ideas presented in [36\u201338], we de\ufb01ne the inverse of the feature model \u03c6\u22121(z) (i.e.,\nthe image generator) as a convolutional neural network which transforms the 699-dimensional latent\nvariables to 64 \u00d7 64 \u00d7 3 images and estimate its parameters via an adversarial process. The generator\ncomprises \ufb01ve deconvolution layers: The ith layer has 210\u2212i kernels with a size of 4 \u00d7 4, a stride\nof 2 \u00d7 2, a padding of 1 \u00d7 1, batch normalization and recti\ufb01ed linear units. Exceptions are the \ufb01rst\nlayer which has a stride of 1 \u00d7 1, and no padding; and the last layer which has three kernels, no batch\nnormalization [39] and hyperbolic tangent units. Note that we do use the inverse of the loadings in\nthe generator.\n\n2\n\n\fTo enable adversarial training, we de\ufb01ne a discriminator (\u03c8) along with the generator. The discrimi-\nnator comprises \ufb01ve convolution layers. The ith layer has 25+i kernels with a size of 4 \u00d7 4, a stride\nof 2 \u00d7 2, a padding of 1 \u00d7 1, batch normalization and leaky recti\ufb01ed linear units with a slope of 0.2\nexcept for the \ufb01rst layer which has no batch normalization and last layer which has one kernel, a\nstride of 1 \u00d7 1, no padding, no batch normalization and a sigmoid unit.\nWe train the generator and the discriminator by pitting them against each other in a two-player\nzero-sum game, where the goal of the discriminator is to discriminate stimuli from reconstructions\nand the goal of the generator is to generate reconstructions that are indiscriminable from original\nstimuli. This ensures that reconstructed stimuli are similar to target stimuli on a pixel level and a\nfeature level.\nThe discriminator is trained by iteratively minimizing the following discriminator loss function:\n\nLdis = \u2212E(cid:2)log(\u03c8(x)) + log(1 \u2212 \u03c8(\u03c6\u22121(z)))(cid:3)\n\n(3)\nwhere \u03c8 is the output of the discriminator which gives the probability that its input is an original\nstimulus and not a reconstructed stimulus. The generator is trained by iteratively minimizing a\ngenerator loss function, which is a linear combination of an adversarial loss function, a feature loss\nfunction and a stimulus loss function:\n\n+\u03bbfea E[(cid:107)\u03be(x) \u2212 \u03be(\u03c6\u22121(z))(cid:107)2]\n\n+\u03bbsti E[(cid:107)x \u2212 \u03c6\u22121(z)(cid:107)2]\n\n(4)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nLsti\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nLfea\n\nLgen = \u2212\u03bbadv E(cid:2)log(\u03c8(\u03c6\u22121(z)))(cid:3)\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nLadv\n\n(cid:124)\n\nwhere \u03be is the relu3_3 outputs of the pretrained VGG-16 model [40, 41]. Note that the targets and the\nreconstructions are lower resolution (i.e., 64 \u00d7 64) than the images that are used to obtain the latent\nfeatures (i.e., 224 \u00d7 224).\n\n2.3 Likelihood and prior\n\nWe de\ufb01ne the likelihood as a multivariate Gaussian distribution over y:\n\nPr(y|z) = Ny(B(cid:62)z, \u03a3)\n1, . . . , \u03c32\n\n(5)\nq ) \u2208 Rq\u00d7q. Here, the features \u00d7 voxels\nwhere B = (\u03b21, . . . , \u03b2q) \u2208 Rp\u00d7q and \u03a3 = diag(\u03c32\nmatrix B contains the learnable parameters of the likelihood in its columns \u03b2i (which can also be\ninterpreted as regression coef\ufb01cients of a linear regression model, which predicts y from z).\nE[(cid:107)yi \u2212 \u03b2\nWe estimate the parameters with ordinary least squares, such that \u02c6\u03b2i = arg min\u03b2i\nand \u02c6\u03c32\nWe de\ufb01ne the prior as a zero mean and unit variance multivariate Gaussian distribution Pr(z) =\nNz(0, I).\n\ni = E[(cid:107)yi \u2212 \u02c6\u03b2\n\n(cid:62)\ni z(cid:107)2].\n\ni z(cid:107)2]\n(cid:62)\n\n2.4 Posterior\n\nTo derive the posterior (2), we \ufb01rst reformulate the likelihood as a multivariate Gaussian distribution\nover z. That is, after taking out constant terms with respect to z from the likelihood, it immediately\nbecomes proportional to the canonical form Gaussian over z with \u03bd = B\u03a3\u22121y and \u039b = B\u03a3\u22121B(cid:62),\nwhich is equivalent to the standard form Gaussian with mean \u039b\u22121\u03bd and covariance \u039b\u22121.\nThis allows us to write:\n\n(cid:0)\u039b\u22121\u03bd, \u039b\u22121)Nz(0, I(cid:1)\n\nPr(z|y) \u221d Nz\n\n(6)\n\nNext, recall that the product of two multivariate Gaussians can be formulated in terms of one\nmultivariate Gaussian [42]. That is, Nz(m1, \u03a31)Nz(m2, \u03a32) \u221d Nz(mc, \u03a3c) with mc =\n\n(cid:0)\u03a3\u22121\n\n1 + \u03a3\u22121\n\n2\n\n(cid:1)\u22121(cid:0)\u03a3\u22121m1 + \u03a3\u22121\n\n2 m2\n\n(cid:1) and \u03a3c = (cid:0)\u03a3\u22121\n\n(cid:1)\u22121. By plugging this formula-\n\ntion into Equation (6), we obtain Pr(z|y) \u221d Nz(mc, \u03a3c) with mc = (B\u03a3\u22121B(cid:62) + I)\u22121B\u03a3\u22121y\nand \u03a3c = (B\u03a3\u22121B(cid:62) + I)\u22121.\nRecall that we are interested in reconstructing stimuli from responses by generating reconstructions\nfrom the features that maximize the posterior. Notice that the (unnormalized) posterior is maximized\n\n1 + \u03a3\u22121\n\n2\n\n3\n\n\fat its mean mc since this corresponds to the mode for a multivariate Gaussian distribution. Therefore,\nthe solution of the problem of reconstructing stimuli from responses reduces to the following simple\nexpression:\n\n\u02c6x = \u03c6\u22121(cid:0)(B\u03a3\u22121B(cid:62) + I)\u22121B\u03a3\u22121y(cid:1)\n\n(7)\n\n3 Results\n\n3.1 Datasets\n\nWe used the following datasets in our experiments:\nfMRI dataset. We collected a new fMRI dataset, which comprises face stimuli and associated blood-\noxygen-level dependent (BOLD) responses. The stimuli used in the fMRI experiment were drawn\nfrom [43\u201345] and other online sources, and consisted of photographs of front-facing individuals\nwith neutral expressions. We measured BOLD responses (TR = 1.4 s, voxel size = 2 \u00d7 2 \u00d7 2 mm3,\nwhole-brain coverage) of two healthy adult subjects (S1: 28-year old female; S2: 39-year old male) as\nthey were \ufb01xating on a target (0.6 \u00d7 0.6 degree) [46] superimposed on the stimuli (15 \u00d7 15 degrees).\nEach face was presented at 5 Hz for 1.4 s and followed by a middle gray background presented for\n2.8 s. In total, 700 faces were presented twice for the training set, and 48 faces were repeated 13 times\nfor the test set. The test set was balanced in terms of gender and ethnicity (based on the norming data\nprovided in the original datasets). The experiment was approved by the local ethics committee (CMO\nRegio Arnhem-Nijmegen) and the subjects provided written informed consent in accordance with the\nDeclaration of Helsinki. Our fMRI dataset is available from the \ufb01rst authors on reasonable request.\nThe stimuli were preprocessed as follows: Each image was cropped and resized to 224 \u00d7 224 pixels.\nThis procedure was organized such that the distance between the top of the image and the vertical\ncenter of the eyes was 87 pixels, the distance between the vertical center of the eyes and the vertical\ncenter of the mouth was 75 pixels, the distance between the vertical center of the mouth and the\nbottom of the image was 61 pixels, and the horizontal center of the eyes and the mouth was at the\nhorizontal center of the image.\nThe fMRI data were preprocessed as follows: Functional scans were realigned to the \ufb01rst functional\nscan and the mean functional scan, respectively. Realigned functional scans were slice time corrected.\nAnatomical scans were coregistered to the mean functional scan. Brains were extracted from\nthe coregistered anatomical scans. Finally, stimulus-speci\ufb01c responses were deconvolved from\nthe realigned and slice time corrected functional scans with a general linear model [47]. Here,\ndeconvolution refers to estimating regression coef\ufb01cients (y) of the following GLMs: y\u2217 = Xy,\nwhere y\u2217 is raw voxel responses, X is HRF-convolved design matrix (one regressor per stimulus\nindicating its presence), and y is deconvolved voxel responses such that y is a vector of size m \u00d7 1\nwith m denoting the number of unique stimuli, and there is one y per voxel.\nCelebA dataset [48]. This dataset comprises 202599 in-the-wild portraits of 10177 people, which\nwere drawn from online sources. The portraits are annotated with 40 attributes and \ufb01ve landmarks.\nWe preprocessed the portraits as we preprocessed the stimuli in our fMRI dataset.\n\n3.2\n\nImplementation details\n\nOur implementation makes use of Chainer and Cupy with CUDA and cuDNN [49] except for the\nfollowing: The VGG-16 and VGG-Face pretrained models were ported to Chainer from Caffe [50].\nPrincipal component analysis was implemented in scikit-learn [51]. fMRI preprocessing was imple-\nmented in SPM [52]. Brain extraction was implemented in FSL [53].\nWe trained the discriminator and the generator on the entire CelebA dataset by iteratively minimizing\nthe discriminator loss function and the generator loss function in sequence for 100 epochs with Adam\n[54]. Model parameters were initialized as follows: biases were set to zero, the scaling parameters\nwere drawn from N (1, 2\u00b710\u22122I), the shifting parameters were set to zero and the weights were drawn\nfrom N (1, 10\u22122I) [37]. We set the hyperparameters of the loss functions as follows: \u03bbadv = 102,\n\u03bbdis = 102, \u03bbfea = 10\u22122 and \u03bbsti = 2 \u00b7 10\u22126 [38]. We set the hyperparameters of the optimizer as\nfollows: \u03b1 = 0.001, \u03b21 = 0.9, \u03b22 = 0.999 and \u0001 = 108 [37].\nWe estimated the parameters of the likelihood term on the training split of our fMRI dataset.\n\n4\n\n\f3.3 Evaluation metrics\n\nWe evaluated our approach on the test split of our fMRI dataset with the following metrics: First,\nthe feature similarity between the stimuli and their reconstructions, where the feature similarity is\nde\ufb01ned as the Euclidean similarity between the features, de\ufb01ned as the relu7 outputs of the VGG-\nFace pretrained model. Second, the Pearson correlation coef\ufb01cient between the stimuli and their\nreconstructions. Third, the structural similarity between the stimuli and their reconstructions [55]. All\nevaluation was done on a held-out set not used at any point during model estimation or training. The\nvoxels used in the reconstructions were selected as follows: For each test trial, n voxels with smallest\nresiduals (on training set) were selected. n itself was selected such that reconstruction accuracy of\nremaining test trials was highest. We also performed an encoding analysis to see how well the latent\nfeatures were predictive of voxel responses in different brain areas. The results of this analysis is\nreported in the supplementary material.\n\n3.4 Reconstruction\n\nWe \ufb01rst demonstrate our results by reconstructing the stimulus images in the test set using i) the latent\nfeatures and ii) the brain responses. Figure 2 shows 4 representative examples of the test stimuli\nand their reconstructions. The \ufb01rst column of both panels show the original test stimuli. The second\ncolumn of both panels show the reconstructions of these stimuli x from the latent features z obtained\nby \u03c6(x). These can be considered as an upper limit for the reconstruction accuracy of the brain\nresponses since they are the best possible reconstructions that we can expect to achieve with a perfect\nneural decoder that can exactly predict the latent features from brain responses. The third and fourth\ncolumns of the \ufb01gure show reconstructions of brain responses to stimuli of Subject 1 and Subject 2,\nrespectively.\n\nFigure 2: Reconstructions of the test stimuli from the latent features (model) and the brain responses\nof the two subjects (brain 1 and brain 2).\n\nVisual inspection of the reconstructions from brain responses reveals that they match the test stimuli\nin several key aspects, such as gender, skin color and facial features. Table 1 shows the three\nreconstruction accuracy metrics for both subjects in terms of ratio of the reconstruction accuracy\nfrom brain responses to the reconstruction accuracy from latent features, which were signi\ufb01cantly\n(p < 0.05, permutation test) above those for randomly sampled latent features (cf. 0.5181, 0.1532\nand 0.5183, respectively).\n\nTable 1: Reconstruction accuracy of the proposed decoding approach. The results are reported as the\nratio of accuracy of reconstructing from brain responses and latent features.\n\nFeature similarity\n0.6546 \u00b1 0.0220\n0.6465 \u00b1 0.0222\n\nS1\nS2\n\nPearson correlation coef\ufb01cient\n\n0.6512 \u00b1 0.0493\n0.6580 \u00b1 0.0480\n\nStructural similarity\n0.8365 \u00b1 0.0239\n0.8325 \u00b1 0.0229\n\nFurthermore, besides reconstruction accuracy, we tested the identi\ufb01cation performance within and\nbetween groups that shared similar features (those that share gender or ethnicity as de\ufb01ned by the\nnorming data were assumed to share similar features). Identi\ufb01cation accuracies (which ranged\nbetween 57% and 62%) were signi\ufb01cantly above chance-level (which ranged between 3% and 8%) in\nall cases (p (cid:28) 0.05, Student\u2019s t-test). Furthermore, we found no signi\ufb01cant differences between the\nidenti\ufb01cation accuracies when a reconstruction was identi\ufb01ed among a group sharing similar features\nversus among a group that did not share similar features (p > 0.79, Student\u2019s t-test) (cf. [56]).\n\n5\n\nstim.reconstruction from:modelbrain 1brain 2stim.reconstruction from:modelbrain 1brain 2\f3.5 Visualization, interpolation and sampling\n\nIn the second experiment, we analyzed the properties of the stimulus features predictive of brain acti-\nvations to characterize neural representations of faces. We \ufb01rst investigated the model representations\nto better understand what kind of features drive responses of the model. We visualized the features\nexplaining the highest variance by independently setting the values of the \ufb01rst few latent dimensions\nto vary between their minimum and maximum values and generating reconstructions from these\nrepresentations (Figure 3). As a result, we found that many of the latent features were coding for\ninterpretable high level information such as age, gender, etc. For example, the \ufb01rst feature in Figure 3\nappears to code for gender, the second one appears to code for hair color and complexion, the third\none appears to code for age, and the fourth one appears to code for two different facial expressions.\n\nFigure 3: Reconstructions from features with single features set to vary between their minimum and\nmaximum values.\n\nWe then explored the feature space that was learned by the latent feature model and the response\nspace that was learned by the likelihood by systematically traversing the reconstructions obtained\nfrom different points in these spaces.\nFigure 4A shows examples of reconstructions of stimuli from the latent features (rows one and four)\nand brain responses (rows two, three, \ufb01ve and six), as well as reconstructions from their interpolations\nbetween two points (columns three to nine). The reconstructions from the interpolations between two\npoints show semantic changes with no sharp transitions.\nFigure 4B shows reconstructions from latent features sampled from the model prior (\ufb01rst row) and\nfrom responses sampled from the response prior of each subject (second and third rows). The\nreconstructions from sampled representations are diverse and of high quality.\nThese results provide evidence that no memorization took place and the models learned relevant and\ninteresting representations [37]. Furthermore, these results suggest that neural representations of\nfaces might be embedded in a continuous and distributed space in the brain.\n\n3.6 Comparison versus state-of-the-art\n\nIn this section we qualitatively (Figure 5) and quantitatively (Table 2) compare the performance of\nour approach with two existing decoding approaches from the literature\u2217. Figure 5 shows example\nreconstructions from brain responses with three different approaches, namely with our approach,\nthe eigenface approach [11, 57] and the identity transform approach [58, 29]. To achieve a fair\ncomparison, the implementations of the three approaches only differed in terms of the feature models\nthat were used, i.e. the eigenface approach had an eigenface (PCA) feature model and the identity\ntransform approach had simply an identity transformation in place of the feature model.\nVisual inspection of the reconstructions displayed in Figure 5 shows that DAND clearly outperforms\nthe existing approaches. In particular, our reconstructions better capture the features of the stimuli\n\u2217We also experimented with the VGG-ImageNet pretrained model, which failed to match the reconstruction\nperformance of the VGG-Face model, while their encoding performances were comparable in non-face related\nbrain areas. We plan to further investigate other models in detail in future work.\n\n6\n\nfeature1234feature i = min. <-> feature i = max.reconstruction (from features)feature i = min. <-> feature i = max.reconstruction (from features)\fFigure 4: Reconstructions from interpolated (A) and sampled (B) latent features (model) and brain\nresponses of the two subjects (brain 1 and brain 2).\n\nsuch as gender, skin color and facial features. Furthermore, our reconstructions are more detailed,\nsharper, less noisy and more photorealistic than the eigenface and identity transform approaches. A\nquantitative comparison of the performance of the three approaches shows that the reconstruction\naccuracies achieved by our approach were signi\ufb01cantly higher than those achieved by the existing\napproaches (p (cid:28) 0.05, Student\u2019s t-test).\n\nTable 2: Reconstruction accuracies of the three decoding approaches. LF denotes reconstructions\nfrom latent features.\n\nIdentity\n\nEigenface\n\nDAND\n\nFeature similarity\n0.1254 \u00b1 0.0031\n0.1254 \u00b1 0.0038\n1.0000 \u00b1 0.0000\n0.1475 \u00b1 0.0043\n0.1457 \u00b1 0.0043\n0.3841 \u00b1 0.0149\n0.1900 \u00b1 0.0052\n0.1867 \u00b1 0.0054\n0.2895 \u00b1 0.0137\n\nS1\nS2\nLF\nS1\nS2\nLF\nS1\nS2\nLF\n\nPearson correlation coef\ufb01cient\n\n0.4194 \u00b1 0.0347\n0.4299 \u00b1 0.0350\n1.0000 \u00b1 0.0000\n0.3779 \u00b1 0.0403\n0.2241 \u00b1 0.0435\n0.9875 \u00b1 0.0011\n0.4679 \u00b1 0.0358\n0.4722 \u00b1 0.0344\n0.7181 \u00b1 0.0419\n\nStructural similarity\n0.3744 \u00b1 0.0083\n0.3877 \u00b1 0.0083\n1.0000 \u00b1 0.0000\n0.3735 \u00b1 0.0102\n0.3671 \u00b1 0.0113\n0.9234 \u00b1 0.0040\n0.4662 \u00b1 0.0126\n0.4676 \u00b1 0.0130\n0.5595 \u00b1 0.0181\n\n7\n\nABrecon. (from interpolated features or responses)stim.reconstruction from:brain 2brain 1modelstim.recon.recon.brain 2brain 1modelrecon. (from sampled features or responses)reconstruction from:brain 2brain 1model\fFigure 5: Reconstructions from the latent features and brain responses of the two subjects (brain 1\nand brain 2) using our decoding approach, as well as the eigenface and identity transform approaches\nfor comparison.\n\n3.7 Factors contributing to reconstruction accuracy\n\nFinally, we investigated the factors contributing to the quality of reconstructions from brain responses.\nAll of the faces in the test set had been annotated with 30 objective physical measures (such as\nnose width, face length, etc.) and 14 subjective measures (such as attractiveness, gender, ethnicity,\netc.). Among these measures, we identi\ufb01ed \ufb01ve subjective measures that are important for face\nperception [59\u201364] as measures of interest and supplemented them with an additional measure of\nstimulus complexity. Complexity was included because of its important role in visual perception [65].\nThe selected measures were attractiveness, complexity, ethnicity, femininity, masculinity and proto-\ntypicality. Note that the complexity measure was not part of the dataset annotations and was de\ufb01ned\nas the Kolmogorov complexity of the stimuli, which was taken to be their compressed \ufb01le sizes [66].\nTo this end, we correlated the reconstruction accuracies of the 48 stimuli in the test set (for both\nsubjects) with their corresponding measures (except for ethnicity) and used a two-tailed Student\u2019s\nt-test to test if the multiple comparison corrected (Bonferroni correction) p-value was less than the\ncritical value of 0.05. In the case of ethnicity we used one-way analysis of variance to compare the\nreconstruction accuracies of faces with different ethnicities.\nWe were able to reject the null hypothesis for the measures complexity, femininity and masculinity,\nbut failed to do so for attractiveness, ethnicity and prototypicality. Speci\ufb01cally, we observed a signif-\nicant negative correlation (r = -0.3067) between stimulus complexity and reconstruction accuracy.\nFurthermore, we found that masculinity and reconstruction accuracy were signi\ufb01cantly positively\ncorrelated (r = 0.3841). Complementing this result, we found a negative correlation (r = -0.3961)\nbetween femininity and reconstruction accuracy. We found no effect of attractiveness, ethnicity and\nprototypicality on the quality of reconstructions. We then compared the complexity levels of the\nimages of each gender and found that female face images were signi\ufb01cantly more complex than male\nface images (p < 0.05, Student\u2019s t-test), pointing to complexity as the factor underlying the relation-\nship between reconstruction accuracy and gender. This result demonstrates the importance of taking\nstimulus complexity into account while making inferences about factors driving the reconstructions\nfrom brain responses.\n\n4 Conclusion\n\nIn this study we combined probabilistic inference with deep learning to derive a novel deep neural\ndecoding approach. We tested our approach by reconstructing face stimuli from BOLD responses at\nan unprecedented level of accuracy and detail, matching the target stimuli in several key aspects such\nas gender, skin color and facial features as well as identifying perceptual factors contributing to the\nreconstruction accuracy. Deep decoding approaches such as the one developed here are expected to\nplay an important role in the development of new neuroprosthetic devices that operate by reading\nsubjective information from the human brain.\n\n8\n\neigen. recon. from:identity recon. from:deep recon. from:brain 1brain 2brain 1brain 2brain 1brain 2stim.modelmodelmodel\fAcknowledgments\n\nThis work has been partially supported by a VIDI grant (639.072.513) from the Netherlands Organi-\nzation for Scienti\ufb01c Research and a GPU grant (GeForce Titan X) from the Nvidia Corporation.\n\nReferences\n[1] T. Naselaris, K. N. Kay, S. Nishimoto, and J. L. Gallant, \u201cEncoding and decoding in fMRI,\u201d NeuroImage,\n\nvol. 56, no. 2, pp. 400\u2013410, may 2011.\n\n[2] M. van Gerven, \u201cA primer on encoding models in sensory neuroscience,\u201d J. Math. Psychol., vol. 76, no. B,\n\npp. 172\u2013183, 2017.\n\n[3] J. V. Haxby, \u201cDistributed and overlapping representations of faces and objects in ventral temporal cortex,\u201d\n\nScience, vol. 293, no. 5539, pp. 2425\u20132430, sep 2001.\n\n[4] Y. Kamitani and F. Tong, \u201cDecoding the visual and subjective contents of the human brain,\u201d Nature\n\nNeuroscience, vol. 8, no. 5, pp. 679\u2013685, apr 2005.\n\n[5] T. M. Mitchell, S. V. Shinkareva, A. Carlson, K.-M. Chang, V. L. Malave, R. A. Mason, and M. A. Just,\n\u201cPredicting human brain activity associated with the meanings of nouns,\u201d Science, vol. 320, no. 5880, pp.\n1191\u20131195, may 2008.\n\n[6] K. N. Kay, T. Naselaris, R. J. Prenger, and J. L. Gallant, \u201cIdentifying natural images from human brain\n\nactivity,\u201d Nature, vol. 452, no. 7185, pp. 352\u2013355, mar 2008.\n\n[7] B. Thirion, E. Duchesnay, E. Hubbard, J. Dubois, J.-B. Poline, D. Lebihan, and S. Dehaene, \u201cInverse\nretinotopy: Inferring the visual content of images from brain activation patterns,\u201d NeuroImage, vol. 33,\nno. 4, pp. 1104\u20131116, dec 2006.\n\n[8] Y. Miyawaki, H. Uchida, O. Yamashita, M. aki Sato, Y. Morito, H. C. Tanabe, N. Sadato, and Y. Kamitani,\n\u201cVisual image reconstruction from human brain activity using a combination of multiscale local image\ndecoders,\u201d Neuron, vol. 60, no. 5, pp. 915\u2013929, dec 2008.\n\n[9] T. Naselaris, R. J. Prenger, K. N. Kay, M. Oliver, and J. L. Gallant, \u201cBayesian reconstruction of natural\n\nimages from human brain activity,\u201d Neuron, vol. 63, no. 6, pp. 902\u2013915, sep 2009.\n\n[10] S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu, and J. L. Gallant, \u201cReconstructing visual\nexperiences from brain activity evoked by natural movies,\u201d Current Biology, vol. 21, no. 19, pp.\n1641\u20131646, oct 2011.\n\n[11] A. S. Cowen, M. M. Chun, and B. A. Kuhl, \u201cNeural portraits of perception: Reconstructing face images\n\nfrom evoked brain activity,\u201d NeuroImage, vol. 94, pp. 12\u201322, jul 2014.\n\n[12] D. L. K. Yamins and J. J. Dicarlo, \u201cUsing goal-driven deep learning models to understand sensory cortex,\u201d\n\nNat. Neurosci., vol. 19, pp. 356\u2013365, 2016.\n\n[13] N. Kriegeskorte, \u201cDeep neural networks: A new framework for modeling biological vision and brain\n\ninformation processing,\u201d Annu. Rev. Vis. Sci., vol. 1, no. 1, pp. 417\u2013446, 2015.\n\n[14] D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo,\n\u201cPerformance-optimized hierarchical models predict neural responses in higher visual cortex,\u201d Proceedings\nof the National Academy of Sciences, vol. 111, no. 23, pp. 8619\u20138624, may 2014.\n\n[15] S.-M. Khaligh-Razavi and N. Kriegeskorte, \u201cDeep supervised, but not unsupervised, models may explain\n\nIT cortical representation,\u201d PLoS Computational Biology, vol. 10, no. 11, p. e1003915, nov 2014.\n\n[16] U. G\u00fc\u00e7l\u00fc and M. van Gerven, \u201cDeep neural networks reveal a gradient in the complexity of neural\nrepresentations across the ventral stream,\u201d Journal of Neuroscience, vol. 35, no. 27, pp. 10 005\u201310 014, jul\n2015.\n\n[17] R. M. Cichy, A. Khosla, D. Pantazis, A. Torralba, and A. Oliva, \u201cComparison of deep neural networks to\nspatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence,\u201d\nScienti\ufb01c Reports, vol. 6, no. 1, jun 2016.\n\n[18] U. G\u00fc\u00e7l\u00fc, J. Thielen, M. Hanke, and M. van Gerven, \u201cBrains on beats,\u201d in Advances in Neural Information\n\nProcessing Systems, 2016.\n\n9\n\n\f[19] U. G\u00fc\u00e7l\u00fc and M. A. J. van Gerven, \u201cModeling the dynamics of human brain activity with recurrent neural\n\nnetworks,\u201d Frontiers in Computational Neuroscience, vol. 11, feb 2017.\n\n[20] M. Eickenberg, A. Gramfort, G. Varoquaux, and B. Thirion, \u201cSeeing it all: Convolutional network layers\n\nmap the function of the human visual system,\u201d NeuroImage, vol. 152, pp. 184\u2013194, may 2017.\n\n[21] U. G\u00fc\u00e7l\u00fc and M. van Gerven, \u201cIncreasingly complex representations of natural movies across the dorsal\n\nstream are shared between subjects,\u201d NeuroImage, vol. 145, pp. 329\u2013336, jan 2017.\n\n[22] T. Horikawa and Y. Kamitani, \u201cGeneric decoding of seen and imagined objects using hierarchical visual\n\nfeatures,\u201d Nature Communications, vol. 8, p. 15037, may 2017.\n\n[23] \u2014\u2014, \u201cHierarchical neural representation of dreamed objects revealed by brain decoding with deep neural\n\nnetwork features,\u201d Frontiers in Computational Neuroscience, vol. 11, jan 2017.\n\n[24] M. van Gerven, F. de Lange, and T. Heskes, \u201cNeural decoding with hierarchical generative models,\u201d\n\nNeural Comput., vol. 22, no. 12, pp. 3127\u20133142, 2010.\n\n[25] C. Du, C. Du, and H. He, \u201cSharing deep generative representation for perceived image reconstruction from\n\nhuman brain activity,\u201d CoRR, vol. abs/1704.07575, 2017.\n\n[26] B. Thirion, E. Duchesnay, E. Hubbard, J. Dubois, J.-B. Poline, D. Lebihan, and S. Dehaene, \u201cInverse\nretinotopy: inferring the visual content of images from brain activation patterns,\u201d Neuroimage, vol. 33,\nno. 4, pp. 1104\u20131116, 2006.\n\n[27] T. Naselaris, R. J. Prenger, K. N. Kay, M. Oliver, and J. L. Gallant, \u201cBayesian reconstruction of natural\n\nimages from human brain activity,\u201d Neuron, vol. 63, no. 6, pp. 902\u2013915, 2009.\n\n[28] U. G\u00fc\u00e7l\u00fc and M. van Gerven, \u201cUnsupervised learning of features for Bayesian decoding in functional\n\nmagnetic resonance imaging,\u201d in Belgian-Dutch Conference on Machine Learning, 2013.\n\n[29] S. Schoenmakers, M. Barth, T. Heskes, and M. van Gerven, \u201cLinear reconstruction of perceived images\n\nfrom human brain activity,\u201d NeuroImage, vol. 83, pp. 951\u2013961, dec 2013.\n\n[30] S. Schoenmakers, U. G\u00fc\u00e7l\u00fc, M. van Gerven, and T. Heskes, \u201cGaussian mixture models and semantic\ngating improve reconstructions from human brain activity,\u201d Frontiers in Computational Neuroscience,\nvol. 8, jan 2015.\n\n[31] R. Zhang, P. Isola, and A. A. Efros, \u201cColorful image colorization,\u201d Lect. Notes Comput. Sci., vol. 9907\n\nLNCS, pp. 649\u2013666, 2016.\n\n[32] Y. G\u00fc\u00e7l\u00fct\u00fcrk, U. G\u00fc\u00e7l\u00fc, R. van Lier, and M. van Gerven, \u201cConvolutional sketch inversion,\u201d in Lecture\n\nNotes in Computer Science. Springer International Publishing, 2016, pp. 810\u2013824.\n\n[33] D. Pathak, P. Kr\u00e4henb\u00fchl, J. Donahue, T. Darrell, and A. A. Efros, \u201cContext encoders: Feature learning by\n\ninpainting,\u201d CoRR, vol. abs/1604.07379, 2016.\n\n[34] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi,\n\u201cPhoto-realistic single image super-resolution using a generative adversarial network,\u201d CoRR, vol.\nabs/1609.04802, 2016.\n\n[35] O. M. Parkhi, A. Vedaldi, and A. Zisserman, \u201cDeep face recognition,\u201d in British Machine Vision Conference,\n\njul 2016.\n\n[36] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and\n\nY. Bengio, \u201cGenerative adversarial networks,\u201d CoRR, vol. abs/1406.2661, 2014.\n\n[37] A. Radford, L. Metz, and S. Chintala, \u201cUnsupervised representation learning with deep convolutional\n\ngenerative adversarial networks,\u201d CoRR, vol. abs/1511.06434, 2015.\n\n[38] A. Dosovitskiy and T. Brox, \u201cGenerating images with perceptual similarity metrics based on deep\n\nnetworks,\u201d CoRR, vol. abs/1602.02644, 2016.\n\n[39] S. Ioffe and C. Szegedy, \u201cBatch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift,\u201d CoRR, vol. abs/1502.03167, 2015.\n\n[40] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d\n\nCoRR, vol. abs/1409.1556, 2014.\n\n10\n\n\f[41] J. Johnson, A. Alahi, and F. Li, \u201cPerceptual losses for real-time style transfer and super-resolution,\u201d CoRR,\n\nvol. abs/1603.08155, 2016.\n\n[42] K. B. Petersen and M. S. Pedersen, \u201cThe matrix cookbook,\u201d nov 2012, version 20121115.\n\n[43] D. S. Ma, J. Correll, and B. Wittenbrink, \u201cThe Chicago face database: A free stimulus set of faces and\n\nnorming data,\u201d Behavior Research Methods, vol. 47, no. 4, pp. 1122\u20131135, jan 2015.\n\n[44] N. Strohminger, K. Gray, V. Chituc, J. Heffner, C. Schein, and T. B. Heagins, \u201cThe MR2: A multi-racial,\nmega-resolution database of facial stimuli,\u201d Behavior Research Methods, vol. 48, no. 3, pp. 1197\u20131204,\naug 2015.\n\n[45] O. Langner, R. Dotsch, G. Bijlstra, D. H. J. Wigboldus, S. T. Hawk, and A. van Knippenberg, \u201cPresentation\nand validation of the Radboud faces database,\u201d Cognition & Emotion, vol. 24, no. 8, pp. 1377\u20131388, dec\n2010.\n\n[46] L. Thaler, A. Sch\u00fctz, M. Goodale, and K. Gegenfurtner, \u201cWhat is the best \ufb01xation target? the effect of\n\ntarget shape on stability of \ufb01xational eye movements,\u201d Vision Research, vol. 76, pp. 31\u201342, jan 2013.\n\n[47] J. A. Mumford, B. O. Turner, F. G. Ashby, and R. A. Poldrack, \u201cDeconvolving BOLD activation in\nevent-related designs for multivoxel pattern classi\ufb01cation analyses,\u201d NeuroImage, vol. 59, no. 3, pp.\n2636\u20132643, feb 2012.\n\n[48] Z. Liu, P. Luo, X. Wang, and X. Tang, \u201cDeep learning face attributes in the wild,\u201d in Proceedings of\n\nInternational Conference on Computer Vision (ICCV), Dec. 2015.\n\n[49] S. Tokui, K. Oono, S. Hido, and J. Clayton, \u201cChainer: a next-generation open source framework for deep\n\nlearning,\u201d in Advances in Neural Information Processing Systems Workshops, 2015.\n\n[50] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell,\n\n\u201cCaffe: Convolutional architecture for fast feature embedding,\u201d CoRR, vol. abs/1408.5093, 2014.\n\n[51] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,\n\u201cScikit-learn: Machine learning in Python,\u201d Journal of Machine Learning Research, vol. 12, pp. 2825\u20132830,\n2011.\n\n[52] K. Friston, J. Ashburner, S. Kiebel, T. Nichols, and W. Penny, Eds., Statistical Parametric Mapping: The\n\nAnalysis of Functional Brain Images. Academic Press, 2007.\n\n[53] M. Jenkinson, C. F. Beckmann, T. E. Behrens, M. W. Woolrich, and S. M. Smith, \u201cFSL,\u201d NeuroImage,\n\nvol. 62, no. 2, pp. 782\u2013790, aug 2012.\n\n[54] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d CoRR, vol. abs/1412.6980, 2014.\n\n[55] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, \u201cImage quality assessment: From error visibility to\n\nstructural similarity,\u201d IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600\u2013612, apr 2004.\n\n[56] E. Goesaert and H. P. O. de Beeck, \u201cRepresentations of facial identity information in the ventral visual\nstream investigated with multivoxel pattern analyses,\u201d Journal of Neuroscience, vol. 33, no. 19, pp.\n8549\u20138558, may 2013.\n\n[57] H. Lee and B. A. Kuhl, \u201cReconstructing perceived and retrieved faces from activity patterns in lateral\n\nparietal cortex,\u201d Journal of Neuroscience, vol. 36, no. 22, pp. 6069\u20136082, jun 2016.\n\n[58] M. van Gerven and T. Heskes, \u201cA linear gaussian framework for decoding of perceived images,\u201d in 2012\n\nSecond International Workshop on Pattern Recognition in NeuroImaging.\n\nIEEE, jul 2012.\n\n[59] A. C. Hahn and D. I. Perrett, \u201cNeural and behavioral responses to attractiveness in adult and infant faces,\u201d\n\nNeuroscience & Biobehavioral Reviews, vol. 46, pp. 591\u2013603, oct 2014.\n\n[60] D. I. Perrett, K. A. May, and S. Yoshikawa, \u201cFacial shape and judgements of female attractiveness,\u201d Nature,\n\nvol. 368, no. 6468, pp. 239\u2013242, mar 1994.\n\n[61] B. Birk\u00e1s, M. Dzhelyova, B. L\u00e1badi, T. Bereczkei, and D. I. Perrett, \u201cCross-cultural perception of\ntrustworthiness: The effect of ethnicity features on evaluation of faces\u2019 observed trustworthiness across\nfour samples,\u201d Personality and Individual Differences, vol. 69, pp. 56\u201361, oct 2014.\n\n11\n\n\f[62] M. A. Strom, L. A. Zebrowitz, S. Zhang, P. M. Bronstad, and H. K. Lee, \u201cSkin and bones: The contribution\nof skin tone and facial structure to racial prototypicality ratings,\u201d PLoS ONE, vol. 7, no. 7, p. e41193, jul\n2012.\n\n[63] A. C. Little, B. C. Jones, D. R. Feinberg, and D. I. Perrett, \u201cMen\u2019s strategic preferences for femininity in\n\nfemale faces,\u201d British Journal of Psychology, vol. 105, no. 3, pp. 364\u2013381, jun 2013.\n\n[64] M. de Lurdes Carrito, I. M. B. dos Santos, C. E. Lefevre, R. D. Whitehead, C. F. da Silva, and D. I. Perrett,\n\u201cThe role of sexually dimorphic skin colour and shape in attractiveness of male faces,\u201d Evolution and\nHuman Behavior, vol. 37, no. 2, pp. 125\u2013133, mar 2016.\n\n[65] Y. G\u00fc\u00e7l\u00fct\u00fcrk, R. H. A. H. Jacobs, and R. van Lier, \u201cLiking versus complexity: Decomposing the inverted\n\nU-curve,\u201d Frontiers in Human Neuroscience, vol. 10, mar 2016.\n\n[66] D. Donderi and S. McFadden, \u201cCompressed \ufb01le length predicts search time and errors on visual displays,\u201d\n\nDisplays, vol. 26, no. 2, pp. 71\u201378, apr 2005.\n\n12\n\n\f", "award": [], "sourceid": 2227, "authors": [{"given_name": "Ya\u011fmur", "family_name": "G\u00fc\u00e7l\u00fct\u00fcrk", "institution": "Radboud University"}, {"given_name": "Umut", "family_name": "G\u00fc\u00e7l\u00fc", "institution": "Donders Institute"}, {"given_name": "Katja", "family_name": "Seeliger", "institution": "Donders Institute for Brain, Cognition and Behaviour"}, {"given_name": "Sander", "family_name": "Bosch", "institution": "Radboud University"}, {"given_name": "Rob", "family_name": "van Lier", "institution": "Donders Institute for Brain, Cognition and Behaviour, Radboud University"}, {"given_name": "Marcel", "family_name": "van Gerven", "institution": "Radboud Universiteit"}]}