{"title": "Deep Predictive Coding Network with Local Recurrent Processing for Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 9201, "page_last": 9213, "abstract": "Inspired by \"predictive coding\" - a theory in neuroscience, we develop a bi-directional and dynamic neural network with local recurrent processing, namely predictive coding network (PCN). Unlike feedforward-only convolutional neural networks, PCN includes both feedback connections, which carry top-down predictions, and feedforward connections, which carry bottom-up errors of prediction. Feedback and feedforward connections enable adjacent layers to interact locally and recurrently to refine representations towards minimization of layer-wise prediction errors. When unfolded over time, the recurrent processing gives rise to an increasingly deeper hierarchy of non-linear transformation, allowing a shallow network to dynamically extend itself into an arbitrarily deep network. We train and test PCN for image classification with SVHN, CIFAR and ImageNet datasets. Despite notably fewer layers and parameters, PCN achieves competitive performance compared to classical and state-of-the-art models. Further analysis shows that the internal representations in PCN converge over time and yield increasingly better accuracy in object recognition. Errors of top-down prediction also reveal visual saliency or bottom-up attention.", "full_text": "Deep Predictive Coding Network with Local\nRecurrent Processing for Object Recognition\n\nKuan Han1,3, Haiguang Wen1,3, Yizhen Zhang1,3, Di Fu1,3, Eugenio Culurciello1,2,\n\nZhongming Liu1,2,3\u2217\n\n1School of Electrical and Computer Engineering, Purdue University\n\n2Weldon School of Biomedical Engineering, Purdue University\n3Purdue Institute for Integrative Neuroscience, Purdue University\n\nAbstract\n\nInspired by \"predictive coding\" - a theory in neuroscience, we develop a bi-\ndirectional and dynamic neural network with local recurrent processing, namely\npredictive coding network (PCN). Unlike feedforward-only convolutional neural\nnetworks, PCN includes both feedback connections, which carry top-down predic-\ntions, and feedforward connections, which carry bottom-up errors of prediction.\nFeedback and feedforward connections enable adjacent layers to interact locally\nand recurrently to re\ufb01ne representations towards minimization of layer-wise pre-\ndiction errors. When unfolded over time, the recurrent processing gives rise to\nan increasingly deeper hierarchy of non-linear transformation, allowing a shallow\nnetwork to dynamically extend itself into an arbitrarily deep network. We train and\ntest PCN for image classi\ufb01cation with SVHN, CIFAR and ImageNet datasets. De-\nspite notably fewer layers and parameters, PCN achieves competitive performance\ncompared to classical and state-of-the-art models. Further analysis shows that the\ninternal representations in PCN converge over time and yield increasingly better\naccuracy in object recognition. Errors of top-down prediction also reveal visual\nsaliency or bottom-up attention.\n\n1\n\nIntroduction\n\nModern computer vision is mostly based on feedforward convolutional neural networks (CNNs)\n[18, 33, 50]. To achieve better performance, CNN models tend to use an increasing number of layers\n[19, 24, 50, 59], while sometimes adding \"short-cuts\" to bypass layers [19, 56]. What motivates\nsuch design choices is the notion that models should learn a deep hierarchy of representations to\nperform complex tasks in vision [50, 59]. This notion generally agrees with the brain\u2019s hierarchical\norganization [31, 62, 67, 16]: visual areas are connected in series to enable a cascade of neural\nprocessing [60]. If one layer in a model is analogous to one area in the visual cortex, the state-of-the-\nart CNNs are considerably deeper (with 50 to 1000 layers) [20, 18] than the visual cortex (with 10\nto 20 areas) . As we look to the brain for more inspiration, it is noteworthy that biological neural\nnetworks support robust and ef\ufb01cient intelligence for a wide range of tasks without any need to grow\ntheir depth or width [37].\nWhat distinguishes the brain from CNNs is the presence of abundant feedback connections that\nlink a feedforward series of brain areas in reverse order [11]. Given both feedforward and feedback\nconnections, information passes not only bottom-up but also top-down, and interacts with one another\nto update the internal states over time. The interplay between feedforward and feedback connections\n\n\u2217Correspondence to: Zhongming Liu \n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fhas been thought to subserve the so-called \"predictive coding\" [44, 14, 52, 25, 12] - a neuroscience\ntheory that becomes popular. It says that feedback connections from a higher layer carry the prediction\nof its lower-layer representation, while feedforward connections in turn carry the error of prediction\nupward to the higher layer. Repeating such bi-directional interactions across layers renders the visual\nsystem a dynamic and recurrent neural network [1, 12]. Such a notion can also apply to arti\ufb01cial\nneural networks. As recurrent processing unfolds in time, a static network architecture is used over\nand over to apply increasingly more non-linear operations to the input, as if the input were computed\nthrough more and more layers stacked onto an increasingly deeper feedforward network [37]. In\nother words, running computation through a bi-directional network for a longer time may give rise to\nan effectively deeper network to approximate a complex and nonlinear transformation from pixels to\nconcepts [37, 6], which is potentially how brain solves invariant object recognition without the need\nto grow its depth.\nInspired by the theory of predictive coding, we propose a bi-directional and dynamical network,\nnamely Deep Predictive Coding Network (PCN), to run a cascade of local recurrent processing\n[30, 5, 43] for object recognition. PCN combines predictive coding and local recurrent processing\ninto an iterative inference algorithm. When tested for image classi\ufb01cation with benchmark datasets\n(CIFAR-10, CIFAR-100, SVHN and ImageNet), PCN uses notably fewer layers and parameters to\nachieve competitive performance relative to classical or state-of-the-art models. Further behavioral\nanalysis of PCN sheds light on its computational mechanism and potential use for mapping visual\nsaliency or bottom-up attention.\n\n2 Related Work\n\nPredictive Coding In the brain, connections between cortical areas are mostly reciprocal [11]. Rao\nand Ballard suggest that bi-directional connections subserve \"predictive coding\" [44]: feedback\nconnections from a higher cortical area carry neural predictions to the lower cortical area, while the\nfeedforward connections carry the unpredictable information (or error of prediction) to the higher\narea to correct the neuronal states throughout the hierarchy. With supporting evidence from empirical\nstudies [1, 52, 28], this mechanism enables iterative inference for perception[44] and unsupervised\nlearning[53], incorporates modern neural networks for classi\ufb01cation [54] and video prediction [39],\nand likely represents a uni\ufb01ed theory of the brain[12, 25].\nPredictive Coding Network with Global Recurrent Processing Driven by the predictive coding\ntheory, a bi-directional and recurrent neural network has been proposed in [63]. It runs global\nrecurrent processing by alternating a bottom-up cascade of feedforward computation and a top-down\ncascade of feedback computation. For each cycle of recurrent dynamics, the feedback prediction\nstarts from the top layer and propagates layer by layer until the bottom layer; then, the feedforward\nerror starts from the bottom layer and propagates layer by layer until the top layer. The model\ndescribed herein is similar, but uses local recurrent processing, instead of global recurrent processing.\nOnly for the convenience of notation in this paper, we refer to the proposed PCN with local recurrent\nprocessing simply as \"PCN\", while referring to the model in [63] explicitly as \"PCN with global\nrecurrent processing\".\nLocal Recurrent Processing In the brain, feedforward-only processing plays a central role in rapid\nobject recognition[47, 9]. Although less understood, feedback connections are thought to convey\ntop-down attention [4, 3] or prediction [12, 44, 52]. Evidence also suggests that feedback signals\nmay operate between hierarchically adjacent areas along the ventral stream [30, 5, 43] to enable\nlocal recurrent processing for object recognition [65, 2], especially given ambiguous or degraded\nvisual input [64, 51]. Therefore, feedback processes may be an integral part of both global and local\nrecurrent processing underlying top-down attention in a slower time scale and visual recognition in a\nfaster time scale.\n\n3 Predictive Coding Network\n\nHerein, we design a bi-directional (feedforward and feedback) neural network that runs local recurrent\nprocessing between neighboring layers, and we refer to this network as Predictive Coding Network\n(PCN). As illustrated in Fig. 1, PCN is a stack of recurrent blocks, each running dynamic and recurrent\nprocessing within itself through feedforward and feedback connections. Feedback connections are\n\n2\n\n\fFigure 1: Architecture of CNN vs. PCN (a) The plain model (left) is a feedforward CNN with\n3\u00d73 convolutional connections (solid arrows) and 1\u00d71 bypass connections (dashed arrows). On the\nbasis of the plain model, the local PCN (right) uses additional feedback (solid arrows) and recurrent\n(circular arrows) connections. The feedforward, feedback and bypass connections are constructed\nas convolutions, while the recurrent connections are constructed as identity mappings (b) The PCN\nconsists of a stack of basic building blocks. Each block runs multiple cycles of local recurrent\nprocessing between adjacent layers, and merges its input to its output through the bypass connections.\nThe output from one block is then sent to its next block to initiate local recurrent processing in a\nhigher block. It continues until reaching the top of the network.\n\nused to predict lower-layer representations. In turn, feedforward connections send the error of\nprediction to update the higher-layer representations. After repeating this processing for multiple\ncycles within a given block, the lower-layer representation is merged to the higher-layer representation\nthrough a bypass connection. The merged representation is further sent as the input to the next\nrecurrent block to start another series of recurrent processing in a higher level. After the local recurrent\nprocessing continues through all recurrent blocks in series, the emerging top-level representations are\nused for image classi\ufb01cation.\nIn the following mathematical descriptions, we use italic letters as symbols for scalars, bold lowercase\nletters for column vectors and bold uppercase letters for matrices. We use T to denote the number\nof recurrent cycles, rl(t) to denote the representation of layer l at time t, Wl\u22121,l to denote the\nfeedforward weights from layer l \u2212 1 to layer l, Wl,l\u22121 to denote the feedback weights from layer l\nto layer l \u2212 1 and W bp\n\nl\u22121,l to denote the weights of bypass connections.\n\n3.1 Local Recurrent Processing in PCN\nWithin each recurrent block (e.g. between layer l \u2212 1 and layer l), the local recurrent processing\nserves to reduce the error of prediction. As in Eq. (1), the higher-layer representation rl(t) generates\na prediction, pl\u22121(t), of the lower-layer representation, rl\u22121, through feedback connections, Wl,l\u22121,\nyielding an error of prediction el\u22121(t) as Eq. (2).\n\npl\u22121(t) = (Wl,l\u22121)T rl(t)\nel\u22121(t) = rl\u22121 \u2212 pl\u22121(t)\n\n(1)\n(2)\n\nThe objective of recurrent processing is to reduce the sum of the squared prediction error (Eq. (3)) by\nupdating the higher-layer representation, rl(t), with an gradient descent algorithm [55]. In each cycle\nof recurrent processing, rl(t) is updated along the direction opposite to the gradient (Eq. (4)) with an\n\n3\n\n\fincremental size proportional to an update rate, \u03b1l. As rl(t) is updated over time as Eq. (5), it tends\nto converge while the gross error of prediction tends to decrease. Note that Eq. (6) is equivalent to\nEq. (5), if the feedback weights are tied to be the transpose of the feedforward weights. Eq. (6) is\nuseful even without this assumption as shown in a prior study [63], and it is thus used in this study\ninstead of Eq. (5).\n\n(cid:107) rl\u22121 \u2212 pl\u22121(t) (cid:107)2\n\n1\n2\n= \u2212Wl,l\u22121el\u22121(t)\n\n2\n\nLl\u22121(t) =\n\n\u2202Ll\u22121(t)\n\u2202rl(t)\nrl(t + 1) = rl(t) \u2212 \u03b1l\n\n\u2202Ll\u22121(t)\n\u2202rl(t)\n\n= rl(t) + \u03b1lWl,l\u22121el\u22121(t)\n\nrl(t + 1) = rl(t) + \u03b1l(Wl\u22121,l)T el\u22121(t)\n\n3.2 Network Architecture\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nWe implement PCN with some architectural features common to modern CNNs. Speci\ufb01cally,\nfeedforward and feedback connections are implemented as regular convolutions and transposed\nconvolutions [10], respectively, with a kernel size of 3. Bypass convolutions are implemented as 1\u00d7 1\nconvolution. Batch normalization (BN) [26] is applied to the input to each recurrent block. During\nthe recurrent processing between layer l \u2212 1 and layer l, recti\ufb01ed linear units (ReLU) [41] are applied\nto the initial output rl(0) at t = 0 and the prediction error at each time step el\u22121(t), as expressed\nby Eq. (7) and Eq. (8), respectively. ReLU renders the local recurrent processing an increasingly\nnon-linear operation as the processing continues over time.\n\nl\u22121,lrl\u22121\n\nel\u22121(t) = ReLU (rl\u22121 \u2212 pl\u22121(t))\n\n(7)\n(8)\nBesides, a 2 \u00d7 2 max-pooling with a stride of 2 is optionally applied to the output from every 2 (or 3)\nblocks. On top of the highest recurrent block is a classi\ufb01er, including a global average pooling, a\nfully-connected layer, followed by softmax.\nFor comparison, we also design the feedforward-only counterpart of PCN and refer to it as the \"plain\"\nmodel. It includes the same feedforward and bypass connections and uses the same classi\ufb01cation\nlayer as in PCN, but excludes any feedback connection or recurrent processing. The architecture of\nthe plain model is similar to that of Inception CNN [59].\n\nrl(0) = ReLU(cid:0)(cid:0)W T\n\n(cid:1)(cid:1)\n\n(cid:1)(cid:1);\n\nfor t = 1 to T do\n\nl\u22121 = BatchN orm(rl\u22121);\nrBN\n\nrl(0) = ReLU(cid:0)F F Conv(cid:0)rBN\n\nAlgorithm 1 Predictive Coding Network with local recurrent processing.\nInput: The input image r0;\n1: for l = 1 to L do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n11: return rL for classi\ufb01cation;\n12: (cid:46) FFConv represents the feedforward convolution, FBConv represents the feedback convolution\n\npl\u22121(t) = F BConv (rl(t \u2212 1));\nel\u22121(t) = ReLU (rl\u22121 \u2212 pl\u22121(t));\nrl(t) = rl(t \u2212 1) + \u03b1lF F conv (el\u22121(t));\n\nrl = rl(T ) + BP Conv(cid:0)rBN\n\nend for\n\nl\u22121\n\n(cid:1);\n\nl\u22121\n\nand BPConv represents the bypass convolution.\n\nOur implementation is described in Algorithm 1. Note that the update rate \u03b1l used for local recurrent\nprocessing is a learnable and non-negative parameter separately de\ufb01ned for each \ufb01lter in each\nrecurrent block. The number of cycles of recurrent processing - an important parameter in PCN, is\nvaried to be T = 1, ..., 5. For both PCN and its CNN counterpart (i.e. T=0), we design multiple\n\n4\n\n\farchitectures (labeled as A through E) suitable for different benchmark datasets (SVHN, CIFAR and\nImageNet), as summarized in Table 1. For example, PCN-A-5 stands for a PCN with architecture A\nand 5 cycles of local recurrent processing.\n\nTable 1: Architectures of PCN. Each column shows a model. We use PcConv to represent a predictive\ncoding layer, with its parameters denoted as \"PcConv--\". The \ufb01rst layer in PCN-E is a\nregular convolution with a kernel size of 7, a padding of 3, a stride of 2 and 64 output channels. \"*\"\nindicates the layer applying maxpooling to its output. Feature maps in one grid have the same size.\n\nDataset\n\nArchitecture\n\n#Layers\n\nImage Size\n\nA\n\nSVHN\n\n7\n\nPCN Con\ufb01guration\n\nB\n\nC\n\n32 \u00d7 32\n\nCIFAR\n\n9\n\nD\n\nPcConv3-3-16\nPcConv3-16-16\nPcConv3-16-32*\nPcConv3-32-32\nPcConv3-32-64*\nPcConv3-64-64\n\nPcConv3-3-16\nPcConv3-16-32\nPcConv3-32-64*\nPcConv3-64-64\nPcConv3-64-128*\nPcConv3-128-128\n\nLayers\n\nPcConv3-3-64\nPcConv3-64-64\nPcConv3-64-128*\nPcConv3-128-128\nPcConv3-128-256*\nPcConv3-256-256\nPcConv3-256-256\nPcConv3-256-256\n\nPcConv3-3-64\nPcConv3-64-64\nPcConv3-64-128*\nPcConv3-128-128\nPcConv3-128-256*\nPcConv3-256-256\nPcConv3-256-512\nPcConv3-512-512\n\nE\n13\n\n224 \u00d7 224\nConv7-64\n\nImageNet\n\nPcConv3-64-64\nPcConv3-64-128*\nPcConv3-128-128\nPcConv3-128-128*\nPcConv3-128-128\nPcConv3-128-256*\nPcConv3-256-256\nPcConv3-256-256\nPcConv3-256-512*\nPcConv3-512-512\nPcConv3-512-512\n\nCalssi\ufb01cation\n\n#Params\n\n0.15M\n\n4 Experiments\n\nglobal average pooling, FC-10/100/1000, softmax\n9.90M\n\n4.91M\n\n0.61M\n\n17.26M\n\nWe train PCN with local recurrent processing and its corresponding plain model for object recognition\nwith the following datasets, and compare their performance with classical or state-of-the-art models.\n\n4.1 Datasets\nCIFAR The CIFAR-10 and CIFAR-100 datasets [32] consist of 32\u00d732 color images drawn from 10\nand 100 categories, respectively. Both datasets contain 50,000 training images and 10,000 testing\nimages. For preprocessing, all images are normalized by channel means and standard derivations.\nFor data augmentation, we use a standard scheme (\ufb02ip/translation) as suggested by previous works\n[19, 23, 38, 45, 24].\nSVHN The Street View House Numbers (SVHN) dataset [42] consists of 32\u00d732 color images. There\nare 73,257 images in the training set, 26,032 images in the test set and 531,131 images for extra\ntraining. Following the common practice [23, 38], we train the model with the training and extra\nsets and tested the model with the test set. No data augmentation is introduced and we use the same\npre-processing scheme as in CIFAR.\nImageNet The ILSVRC-2012 dataset [7] consists of 1.28 million training images and 50k validation\nimages, drawn from 1000 categories. Following [19, 20, 59], we use the standard data augmentation\nscheme for the training set: a 224\u00d7224 crop is randomly sampled from either the original image or its\nhorizontal \ufb02ip. For testing, we apply a single crop or ten crops with size 224\u00d7224 on the validation\nset and the top-1 or top-5 classi\ufb01cation error is reported.\n\n4.2 Training\n\nBoth PCN and (plain) CNN models are trained with stochastic gradient descent (SGD). On CIFAR\nand SVHN datasets, we use 128 batch-size and 0.01 initial learning rate for 300 and 40 epochs,\nrespectively. The learning rate is divided by 10 at 50%, 75% and 87.5% of the total number of\ntraining epochs. Besides, we use a Nesterov momentum of 0.9 and a weight decay of 1e-3, which\nis determined by a 45k/5k split on the CIFAR-100 training set. On ImageNet, we follow the most\ncommon practice [19, 66] and use a initial learning rate of 0.01, a momentum of 0.9, a weight decay\n\n5\n\n\fModel\n\nFeed-\nForward\nModels\n\nHighwayNet [56]\nFractalNet [34]\n\nC-10\n7.72\n5.22\n6.41\n5.23\n4.62\n4.00\n4.51\n3.46\n7.09\n9.22\n\nResNet [19]\nResNet [20]\n\n(Pre-act)\nWRN [69]\nDenseNet\nBC [24]\n\nRCNN [36]\nDasNet [58]\n\n#P\n2.3M\n38.6M\n1.7M\n1.7M\n10.2M\n36.5M\n0.8M\n25.6M\n1.86M\n\nTable 2: Error rates (%) on CIFAR datasets. #L and #P\nare the number of layers and parameters, respectively.\nC-100\nType\n32.39\n23.30\n27.22\n24.58\n22.71\n19.25\n22.27\n17.18\n31.75\n33.78\n28.88\n23.14\n21.83\n31.69\n30.66\n27.42\n24.01\n22.89\n22.43\n23.78\n22.75\n21.77\n25.65\n25.31\n\n#L\n19\n21\n110\n164\n1001\n28\n100\n190\n160\n-\n12\n18\n30\n7\n9\n9\n9\n9\n9\n9\n9\n9\n9\n9\n\n10.14M\n10.02M\n0.57M\n1.16M\n4.65M\n4.91M\n4.91M\n4.91M\n9.90M\n9.90M\n9.90M\n2.59M\n5.21M\n\nPCN-C-1\nPCN-C-2\nPCN-C-5\nPCN-D-1\nPCN-D-2\nPCN-D-5\nPlain-C\nPlain-D\n\n-\n\n5.06\n5.06\n7.60\n7.20\n6.17\n5.70\n5.38\n5.10\n5.73\n5.39\n4.89\n5.68\n5.61\n\nGlobal Recurrent\nProcessing [63]\n\nCliqueNet\n\n[68]\n\nPCN with\n\nRecurrent\nModels\n\nPCN\n\nPlain\nModels\n\nFeedbackNet [70]\n\n-\n-\n\nTable 3: Error rates (%) on SVHN. #L\nand #P are the number of layers and pa-\nrameters, respectively.\n\nModel\n\nMaxOut [15]\n\nNIN [38]\n\nDropConnect [61]\n\nDSN [35]\nRCNN [36]\nFitNet [45]\nWRN [69]\n\nPCN (Global)\n\n[63]\n\nPCN-A-1\nPCN-A-2\nPCN-A-5\nPCN-B-1\nPCN-B-2\nPCN-B-5\nPlain-A\nPlain-B\n\n#L\n-\n-\n-\n-\n6\n13\n16\n7\n7\n7\n7\n7\n7\n7\n7\n7\n7\n\n#P\n-\n-\n-\n-\n\n2.67M\n1.5M\n11M\n0.14M\n0.57M\n0.15M\n0.15M\n0.15M\n0.61M\n0.61M\n0.61M\n0.08M\n0.32M\n\nSVHN\n2.47\n2.35\n1.94\n1.92\n1.77\n2.42\n1.54\n2.42\n2.42\n2.29\n2.22\n2.07\n1.99\n1.97\n1.96\n2.85\n2.43\n\nof 1e-4, 100 epochs with the learning rate dropped by 0.1 at epochs 30, 60 & 90. The batch size is\n256 for PCN-E-0/1/2, 128 for PCN-E-3/4, 115 for PCN-E-5 due to limited computational resource.\n\n4.3 Evaluating the Behavior of PCN\n\nTo understand how the model works, we further examine how local recurrent processing changes the\ninternal representations of PCN. For this purpose, we focus on testing PCN-D-5 with CIFAR-100.\nConverging representation? Since local recurrent processing is governed by predictive coding, it is\nanticipated that the error of prediction tends to decrease over time. To con\ufb01rm this expectation, the\nL2 norm of the prediction error is calculated for each layer and each cycle of recurrent processing\nand averaged across all testing examples in CIFAR-100. This analysis reveals the temporal behavior\nof recurrently re\ufb01ned internal representations.\nWhat does the prediction error mean? Since the error of prediction drives the recurrent update of\ninternal representations, we exam the spatial distribution of the error signal (after the \ufb01nal cycle) \ufb01rst\nfor each layer and then average the error distributions across all layers by rescaling them to the same\nsize. The resulting error distribution is used as a spatial pattern in the input space, and it is applied to\nthe input image as a weighted mask to visualize its selectivity.\nDoes predictive coding help image classi\ufb01cation? The goal of predictive coding is to reduce the\nerror of top-down prediction, seemingly independent of the objective of categorization that the PCN\nmodel is trained for. As recurrent processing progressively updates layer-wise representation, does\neach update also subserve the purpose of categorization? As in [27], the loss of categorization, as\na nonlinear function of layer-wise representation L(rl(t)), shows its Taylor\u2019s expansion as Eq. (9),\nwhere \u2206rl(t) = rl(t + 1) \u2212 rl(t) is the incremental update of rl at time t.\n\nL(rl(t + 1)) \u2212 L(rl(t)) = \u2206rl(t) \u00b7 \u2202L(rl(t))\n\n\u2202rl(t)\n\n+ O(\u2206rl(t)2)\n\n(9)\n\nIf each update tends to reduce the categorization loss, it should satisfy \u2206rl(t) \u00b7 \u2202L(rl(t))\n\u2202rl(t) < 0, or\nthe \"cosine distance\" between \u2206rl(t) and \u2202L(rl(t))\nshould be negative. Here O(\u00b7) is ignored as\nsuggested by [27], since the incremental part becomes minor because of the convergent representation\nin PCN. We test this by calculating the cosine distance for each cycle of recurrent processing in each\nlayer, and then averaging the results across all testing images in CIFAR-100.\n\n\u2202rl(t)\n\n6\n\n\fTable 4: Error rates (%) on ImageNet.\n#Layers\n\n10-crop\n\nModel\nResNet-18\nResNet-34\nResNet-50\nPCN-E-5\nPCN-E-3\nPlain-E\n\n18\n34\n50\n13\n13\n\nSingle-Crop\n#Params\ntop-5\ntop-1\n10.92\n11.69M 30.24\n8.58\n21.80M 26.69\n25.56M 23.87\n7.14\n17.26M 25.31\n7.79\n7.78\n25.36\n9.34M 31.15\n11.27\n\ntop-1\n28.15\n24.73\n22.57\n23.52\n23.62\n28.82\n\ntop-5\n9.40\n7.46\n6.24\n6.64\n6.69\n9.79\n\nFigure 2: PCN shows better categorization performance given more cycles of recurrent processing,\nfor CIFAR-10, CIFAR-100 and ImageNet. The red dash line represents the accuracy of the plain\nmodel.\n\n4.4 Experimental results\n\nClassi\ufb01cation performance On CIFAR and SVHN datasets, PCN always outperforms its corre-\nsponding CNN counterpart (Table 2 and 3). On CIFAR-100, PCN-D-5 reduces the error rate by\n3.54% relative to Plain-D. The performance of PCN is also better than those of the ResNet family\n[19, 20], although PCN is much shallower with only 9 layers whereas ResNet may use as many as\n1001 layers. Although PCN under-performs WRN [69] and DenseNet-190 [24] by 2-4%, it uses\na much shallower architecture with many fewer parameters. On SVHN, PCN shows competitive\nperformance despite fewer layers and parameters. On ImageNet, PCN also performs better than\nits plain counterpart. With 5 cycles of local recurrent processing (i.e. PCN-E-5), PCN slightly\nunder-performs ResNet-50 but outperforms ResNet-34, while using fewer layers and parameters\nthan both of them. Therefore, the classi\ufb01cation performance of PCN compares favorably with other\nstate-of-the-art models especially in terms of the performance-to-layer ratio.\nRecurrent Cycles The classi\ufb01cation performance generally improves as the number of cycles of local\nrecurrent processing increases for both CIFAR and ImageNet datasets, and especially for ImageNet\n(Fig. 2). It is worth noting that this gain in performance is achieved without increasing the number of\nlayers or parameters, but by simply running computation for longer time on the same network.\nLocal vs. Global Recurrent Processing The PCN with local recurrent processing (proposed in\nthis paper) performs better than the PCN with global recurrent processing (proposed in our earlier\npaper [63]. Table 2 shows that PCN with local recurrent processing reduces the error rate by 5% on\nCIFAR-100 compared to PCN with the same architecture but global recurrent processing. In addition,\nlocal recurrent processing also requires less computational resource for both training and inference.\n\n4.5 Behavioral Analysis\n\nTo understand how/why PCN works, we further examine how local recurrent processing changes its\ninternal representations and whether the change helps categorization. Behavioral analysis reveals\nsome intriguing \ufb01ndings.\nAs expected for predictive coding, local recurrent processing progressively reduces the error of\ntop-down prediction for all layers, except the top layer on which image classi\ufb01cation is based (Fig.\n\n7\n\n\fFigure 3: Behavioral analysis of PCN. (a) Errors of prediction tend to converge over repeated cycles\nof recurrent processing. The norm of the prediction error is shown for each layer and each cycle\nof local recurrent processing. (b) Errors of prediction reveal visual saliency. Given an input image\n(left), the spatial distribution of the error averaged across layers (middle) shows visual saliency or\nbottom-up attention, highlighting the part of the input with de\ufb01ning features (right). (3) The averaged\ncosine loss between \u2206rl(t) and \u2202L(rl(t))\n\n.\n\n\u2202rl(t)\n\n3a). This \ufb01nding implies that the internal representations converge to a stable state by local recurrent\nprocessing. Interestingly, the spatial distribution of the prediction error (averaged across all layers)\nhighlights the apparently most salient part of the input image, and/or the most discriminative visual\ninformation for object recognition (Fig. 3b). The recurrent update of layer-wise representation tends\nto align along the negative gradient of the categorization loss with respective to the corresponding\nrepresentation (Fig. 3c). From a different perspective, this \ufb01nding lends support to an intriguing\nimplication that predictive coding facilitates object recognition, which is somewhat surprising because\npredictive coding, as a computational principle, herein only explicitly reduces the error of top-\ndown prediction without having any explicit role that favors inference or learning towards object\ncategorization.\n\n5 Discussion and Conclusion\n\nIn this study, we advocate further synergy between neuroscience and arti\ufb01cial intelligence. Comple-\nmentary to engineering innovation in computing and optimization, the brain must possess additional\nmechanisms to enable generalizable, continuous, and ef\ufb01cient learning and inference. Here, we\nhighlight the fact that the brain runs recurrent processing with lateral and feedback connections under\n\n8\n\n\fpredictive coding, instead of feedforward-only processing. While our focus is on object recognition,\nthe PCN architecture can potentially be generalized to other computer vision tasks (e.g. object detec-\ntion, semantic segmentation and image caption) [13, 22, 40], or subserve new computational models\nthat can encode [62, 16, 29, 49] or decode [46, 17, 48, 57] brain activities. PCN with local recurrent\nprocessing outperforms its counterpart with global recurrent processing, which is not surprising\nbecause global feedback pathways might be necessary for top-down attention [4, 3], but may not\nbe necessary for core object recognition [9] itself. By modeling different mechanisms, we support\nthe notion that local recurrent processing is necessary for the initial feedforward process for object\nrecognition.\nThis study leads us to rethink about the models for classi\ufb01cation beyond feedforward-only networks.\nOne interesting idea is to evaluate the equivalence between ResNets and recurrent neural networks\n(RNN). Deep residual networks with shared weights can be strictly reformulated as a shallow RNN\n[37]. Regular ResNets can be reformulated as time-variant RNNs [37], and their representations are\niteratively re\ufb01ned along the stacked residual blocks [27]. Similarly, DenseNet has been shown as a\ngeneralized form of higher order recurrent neural network (HORNN) [6]. The results in this study are\nin line with such notions: a dynamic and bi-directional network can re\ufb01ne its representations across\ntime, leading to convergent representations to support object recognition.\nOn the other hand, PCN is not contradictory to existing feedforward models, because the PCN block\nitself is integrated with a Inception-type CNN module. We expect that other network modules are\napplicable to further improve PCN performance, including cutout regularization [8], dropout layer\n[21], residual learning [19] and dense connectivities [24]. Although not explicitly trained, error\nsignals of PCN can be used to predict saliency in images, suggesting that other computer vision tasks\n[13, 22, 40] could bene\ufb01t from the diverse feature representations (e.g. error, prediction and state\nsignals) in PCN.\nThe PCN with local recurrent processing described herein has the following advantages over feedfor-\nward CNNs or other dynamic or recurrent models, including a similar PCN with global recurrent\nprocessing [63]. 1) It can achieve competitive performance in image classi\ufb01cation with a shallow\nnetwork and fewer parameters. 2) Its internal representations converge as recurrent processing\nproceeds over time, suggesting a self-organized mechanism towards stability [12]. 3) It reveals visual\nsaliency or bottom-up attention while performing object recognition. However, its disadvantages are\n1) the longer time of computation than plain networks (with the same number of layers) and 2) the\nsequentially executed recurrent processing, both of which should be improved or addressed in future\nstudies.\n\nAcknowledgement\n\nThe research was supported by NIH R01MH104402 and the College of Engineering at Purdue\nUniversity.\n\nReferences\n[1] Andre M Bastos, W Martin Usrey, Rick A Adams, George R Mangun, Pascal Fries, and Karl J\n\nFriston. Canonical microcircuits for predictive coding. Neuron, 76(4):695\u2013711, 2012.\n\n[2] CN Boehler, MA Schoenfeld, H-J Heinze, and J-M Hopf. Rapid recurrent processing gates\nawareness in primary visual cortex. Proceedings of the National Academy of Sciences, 105(25):\n8742\u20138747, 2008.\n\n[3] Steven L Bressler, Wei Tang, Chad M Sylvester, Gordon L Shulman, and Maurizio Corbetta.\nTop-down control of human visual cortex by frontal and parietal cortex in anticipatory visual\nspatial attention. Journal of Neuroscience, 28(40):10056\u201310061, 2008.\n\n[4] Elizabeth A Buffalo, Pascal Fries, Rogier Landman, Hualou Liang, and Robert Desimone. A\nbackward progression of attentional effects in the ventral stream. Proceedings of the National\nAcademy of Sciences, 107(1):361\u2013365, 2010.\n\n[5] Joan A Camprodon, Ehud Zohary, Verena Brodbeck, and Alvaro Pascual-Leone. Two phases of\nv1 activity for visual recognition of natural images. Journal of cognitive neuroscience, 22(6):\n1262\u20131269, 2010.\n\n9\n\n\f[6] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path\n\nnetworks. In Advances in Neural Information Processing Systems, pages 4470\u20134478, 2017.\n\n[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[8] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv preprint arXiv:1708.04552, 2017.\n\n[9] James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object\n\nrecognition? Neuron, 73(3):415\u2013434, 2012.\n\n[10] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning.\n\narXiv preprint arXiv:1603.07285, 2016.\n\n[11] Daniel J Felleman and DC Essen Van. Distributed hierarchical processing in the primate cerebral\n\ncortex. Cerebral cortex (New York, NY: 1991), 1(1):1\u201347, 1991.\n\n[12] Karl Friston. The free-energy principle: a uni\ufb01ed brain theory? Nature Reviews Neuroscience,\n\n11(2):127, 2010.\n\n[13] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena-Martinez, and Jose\nGarcia-Rodriguez. A review on deep learning techniques applied to semantic segmentation.\narXiv preprint arXiv:1704.06857, 2017.\n\n[14] Dileep George and Jeff Hawkins. Towards a mathematical theory of cortical micro-circuits.\n\nPLoS computational biology, 5(10):e1000532, 2009.\n\n[15] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.\n\nMaxout networks. arXiv preprint arXiv:1302.4389, 2013.\n\n[16] Umut G\u00fc\u00e7l\u00fc and Marcel AJ van Gerven. Deep neural networks reveal a gradient in the\ncomplexity of neural representations across the ventral stream. Journal of Neuroscience, 35\n(27):10005\u201310014, 2015.\n\n[17] Kuan Han, Haiguang Wen, Junxing Shi, Kun-Han Lu, Yizhen Zhang, and Zhongming Liu.\nVariational autoencoder: An unsupervised model for modeling and decoding fmri activity in\nvisual cortex. bioRxiv, page 214247, 2017.\n\n[18] Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. In Computer\nVision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 5353\u20135360. IEEE,\n2015.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European Conference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[21] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhut-\ndinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv\npreprint arXiv:1207.0580, 2012.\n\n[22] Shin Hoo-Chang, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jian-\nhua Yao, Daniel Mollura, and Ronald M Summers. Deep convolutional neural networks for\ncomputer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE\ntransactions on medical imaging, 35(5):1285, 2016.\n\n[23] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with\nstochastic depth. In European Conference on Computer Vision, pages 646\u2013661. Springer, 2016.\n\n10\n\n\f[24] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, volume 1, page 3, 2017.\n\n[25] Yanping Huang and Rajesh PN Rao. Predictive coding. Wiley Interdisciplinary Reviews:\n\nCognitive Science, 2(5):580\u2013593, 2011.\n\n[26] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[27] Stanis\u0142aw Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, and Yoshua\nBengio. Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773,\n2017.\n\n[28] Janneke FM Jehee, Constantin Rothkopf, Jeffrey M Beck, and Dana H Ballard. Learning\nreceptive \ufb01elds using predictive feedback. Journal of Physiology-Paris, 100(1-3):125\u2013132,\n2006.\n\n[29] Tim Christian Kietzmann, Patrick McClure, and Nikolaus Kriegeskorte. Deep neural networks\n\nin computational neuroscience. bioRxiv, page 133504, 2018.\n\n[30] Mika Koivisto, Henry Railo, Antti Revonsuo, Simo Vanni, and Niina Salminen-Vaparanta.\nRecurrent processing in v1/v2 contributes to categorization of natural scenes. Journal of\nNeuroscience, 31(7):2488\u20132492, 2011.\n\n[31] Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological vision\n\nand brain information processing. Annual review of vision science, 1:417\u2013446, 2015.\n\n[32] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[34] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural\n\nnetworks without residuals. arXiv preprint arXiv:1605.07648, 2016.\n\n[35] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-\n\nsupervised nets. In Arti\ufb01cial Intelligence and Statistics, pages 562\u2013570, 2015.\n\n[36] Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for object recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3367\u20133375, 2015.\n\n[37] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural\n\nnetworks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.\n\n[38] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400,\n\n2013.\n\n[39] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video\n\nprediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.\n\n[40] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive\nattention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), volume 6, page 2, 2017.\n\n[41] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pages\n807\u2013814, 2010.\n\n[42] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, volume 2011, page 5, 2011.\n\n11\n\n\f[43] Randall C O\u2019Reilly, Dean Wyatte, Seth Herd, Brian Mingus, and David J Jilk. Recurrent\n\nprocessing during object recognition. Frontiers in psychology, 4:124, 2013.\n\n[44] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional\ninterpretation of some extra-classical receptive-\ufb01eld effects. Nature neuroscience, 2(1):79,\n1999.\n\n[45] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and\n\nYoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[46] K Seeliger, U G\u00fc\u00e7l\u00fc, L Ambrogioni, Y G\u00fc\u00e7l\u00fct\u00fcrk, and MAJ van Gerven. Generative adversarial\nnetworks for reconstructing natural images from brain activity. NeuroImage, 181:775\u2013785,\n2018.\n\n[47] Thomas Serre, Gabriel Kreiman, Minjoon Kouh, Charles Cadieu, Ulf Knoblich, and Tomaso\nPoggio. A quantitative theory of immediate visual recognition. Progress in brain research, 165:\n33\u201356, 2007.\n\n[48] Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon-\n\nstruction from human brain activity. bioRxiv, page 240317, 2017.\n\n[49] Junxing Shi, Haiguang Wen, Yizhen Zhang, Kuan Han, and Zhongming Liu. Deep recurrent\nneural network reveals a hierarchy of process memory during dynamic natural vision. Human\nbrain mapping, 39(5):2269\u20132282, 2018.\n\n[50] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[51] Courtney J Spoerer, Patrick McClure, and Nikolaus Kriegeskorte. Recurrent convolutional\nneural networks: a better model of biological object recognition. Frontiers in psychology, 8:\n1551, 2017.\n\n[52] Michael W Spratling. Reconciling predictive coding and biased competition models of cortical\n\nfunction. Frontiers in computational neuroscience, 2:4, 2008.\n\n[53] Michael W Spratling. Unsupervised learning of generative and discriminative weights encoding\nelementary image components in a predictive coding model of cortical function. Neural\ncomputation, 24(1):60\u2013103, 2012.\n\n[54] Michael W Spratling. A hierarchical predictive coding model of object recognition in natural\n\nimages. Cognitive computation, 9(2):151\u2013167, 2017.\n\n[55] Michael W Spratling. A review of predictive coding algorithms. Brain and cognition, 112:\n\n92\u201397, 2017.\n\n[56] Rupesh K Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Training very deep networks. In\n\nAdvances in neural information processing systems, pages 2377\u20132385, 2015.\n\n[57] Ghislain St-Yves and Thomas Naselaris. Generative adversarial networks conditioned on brain\n\nactivity reconstruct seen images. bioRxiv, page 304774, 2018.\n\n[58] Marijn F Stollenga, Jonathan Masci, Faustino Gomez, and J\u00fcrgen Schmidhuber. Deep net-\nworks with internal selective attention through feedback connections. In Advances in neural\ninformation processing systems, pages 3545\u20133553, 2014.\n\n[59] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions.\nCvpr, 2015.\n\n[60] David C Van Essen and John HR Maunsell. Hierarchical organization and functional streams in\n\nthe visual cortex. Trends in neurosciences, 6:370\u2013375, 1983.\n\n[61] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of\nneural networks using dropconnect. In International Conference on Machine Learning, pages\n1058\u20131066, 2013.\n\n12\n\n\f[62] Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, Jiayue Cao, and Zhongming Liu.\nNeural encoding and decoding with deep learning for dynamic natural vision. Cerebral Cortex,\npages 1\u201325, 2017.\n\n[63] Haiguang Wen, Kuan Han, Junxing Shi, Yizhen Zhang, Eugenio Culurciello, and Zhongming\nLiu. Deep predictive coding network for object recognition. arXiv preprint arXiv:1802.04762,\n2018.\n\n[64] Dean Wyatte, Tim Curran, and Randall O\u2019Reilly. The limits of feedforward vision: recurrent\nprocessing promotes robust object recognition when objects are degraded. Journal of Cognitive\nNeuroscience, 24(11):2248\u20132261, 2012.\n\n[65] Dean Wyatte, David J Jilk, and Randall C O\u2019Reilly. Early recurrent feedback facilitates visual\n\nobject recognition under challenging conditions. Frontiers in psychology, 5:674, 2014.\n\n[66] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR),\n2017 IEEE Conference on, pages 5987\u20135995. IEEE, 2017.\n\n[67] Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand\n\nsensory cortex. Nature neuroscience, 19(3):356, 2016.\n\n[68] Yibo Yang, Zhisheng Zhong, Tiancheng Shen, and Zhouchen Lin. Convolutional neural\n\nnetworks with alternately updated clique. arXiv preprint arXiv:1802.10419, 2018.\n\n[69] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[70] Amir R Zamir, Te-Lin Wu, Lin Sun, William B Shen, Bertram E Shi, Jitendra Malik, and Silvio\nSavarese. Feedback networks. In 2017 IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 1808\u20131817. IEEE, 2017.\n\n13\n\n\f", "award": [], "sourceid": 5541, "authors": [{"given_name": "Kuan", "family_name": "Han", "institution": "Purdue University"}, {"given_name": "Haiguang", "family_name": "Wen", "institution": "Purdue University"}, {"given_name": "Yizhen", "family_name": "Zhang", "institution": "Purdue University"}, {"given_name": "Di", "family_name": "Fu", "institution": "Purdue University"}, {"given_name": "Eugenio", "family_name": "Culurciello", "institution": "FWDNXT"}, {"given_name": "Zhongming", "family_name": "Liu", "institution": "Purdue University"}]}