{"title": "Compete to Compute", "book": "Advances in Neural Information Processing Systems", "page_first": 2310, "page_last": 2318, "abstract": "Local competition among neighboring neurons is common in biological neural networks (NNs). We apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time.", "full_text": "Compete to Compute\n\nRupesh Kumar Srivastava, Jonathan Masci, Sohrob Kazerounian,\n\nFaustino Gomez, J\u00fcrgen Schmidhuber\n\nIDSIA, USI-SUPSI\n\nManno\u2013Lugano, Switzerland\n\n{rupesh, jonathan, sohrob, tino, juergen}@idsia.ch\n\nAbstract\n\nLocal competition among neighboring neurons is common in biological neu-\nral networks (NNs). In this paper, we apply the concept to gradient-based,\nbackprop-trained arti\ufb01cial multilayer NNs. NNs with competing linear\nunits tend to outperform those with non-competing nonlinear units, and\navoid catastrophic forgetting when training sets change over time.\n\nIntroduction\n\n1\nAlthough it is often useful for machine learning methods to consider how nature has arrived\nat a particular solution, it is perhaps more instructive to \ufb01rst understand the functional\nrole of such biological constraints. Indeed, arti\ufb01cial neural networks, which now represent\nthe state-of-the-art in many pattern recognition tasks, not only resemble the brain in a\nsuper\ufb01cial sense, but also draw on many of its computational and functional properties.\nOne of the long-studied properties of biological neural circuits which has yet to fully impact\nthe machine learning community is the nature of local competition. That is, a common\n\ufb01nding across brain regions is that neurons exhibit on-center, o\ufb00-surround organization\n[1, 2, 3], and this organization has been argued to give rise to a number of interesting\nproperties across networks of neurons, such as winner-take-all dynamics, automatic gain\ncontrol, and noise suppression [4].\nIn this paper, we propose a biologically inspired mechanism for arti\ufb01cial neural networks\nthat is based on local competition, and ultimately relies on local winner-take-all (LWTA)\nbehavior. We demonstrate the bene\ufb01t of LWTA across a number of di\ufb00erent networks and\npattern recognition tasks by showing that LWTA not only enables performance comparable\nto the state-of-the-art, but moreover, helps to prevent catastrophic forgetting [5, 6] common\nto arti\ufb01cial neural networks when they are \ufb01rst trained on a particular task, then abruptly\ntrained on a new task. This property is desirable in continual learning wherein learning\nregimes are not clearly delineated [7]. Our experiments also show evidence that a type of\nmodularity emerges in LWTA networks trained in a supervised setting, such that di\ufb00erent\nmodules (subnetworks) respond to di\ufb00erent inputs. This is bene\ufb01cial when learning from\nmultimodal data distributions as compared to learning a monolithic model.\nIn the following, we \ufb01rst discuss some of the relevant neuroscience background motivating\nlocal competition, then show how we incorporate it into arti\ufb01cial neural networks, and\nhow LWTA, as implemented here, compares to alternative methods. We then show how\nLWTA networks perform on a variety of tasks, and how it helps bu\ufb00er against catastrophic\nforgetting.\n2 Neuroscience Background\nCompetitive interactions between neurons and neural circuits have long played an important\nrole in biological models of brain processes. This is largely due to early studies showing that\n\n1\n\n\fmany cortical [3] and sub-cortical (e.g., hippocampal [1] and cerebellar [2]) regions of the\nbrain exhibit a recurrent on-center, o\ufb00-surround anatomy, where cells provide excitatory\nfeedback to nearby cells, while scattering inhibitory signals over a broader range. Biological\nmodeling has since tried to uncover the functional properties of this sort of organization,\nand its role in the behavioral success of animals.\nThe earliest models to describe the emergence of winner-take-all (WTA) behavior from local\ncompetition were based on Grossberg\u2019s shunting short-term memory equations [4], which\nshowed that a center-surround structure not only enables WTA dynamics, but also contrast\nenhancement, and normalization. Analysis of their dynamics showed that networks with\nslower-than-linear signal functions uniformize input patterns; linear signal functions preserve\nand normalize input patterns; and faster-than-linear signal functions enable WTA dynamics.\nSigmoidal signal functions which contain slower-than-linear, linear, and faster-than-linear\nregions enable the supression of noise in input patterns, while contrast-enhancing, normal-\nizing and storing the relevant portions of an input pattern (a form of soft WTA). The\nfunctional properties of competitive interactions have been further studied to show, among\nother things, the e\ufb00ects of distance-dependent kernels [8], inhibitory time lags [8, 9], devel-\nopment of self-organizing maps [10, 11, 12], and the role of WTA networks in attention [13].\nBiological models have also been extended to show how competitive interactions in spiking\nneural networks give rise to (soft) WTA dynamics [14], as well as how they may be e\ufb03ciently\nconstructed in VLSI [15, 16].\nAlthough competitive interactions, and WTA dynamics have been studied extensively in the\nbiological literature, it is only more recently that they have been considered from computa-\ntional or machine learning perspectives. For example, Maas [17, 18] showed that feedforward\nneural networks with WTA dynamics as the only non-linearity are as computationally pow-\nerful as networks with threshold or sigmoidal gates; and, networks employing only soft\nWTA competition are universal function approximators. Moreover, these results hold, even\nwhen the network weights are strictly positive\u2014a \ufb01nding which has rami\ufb01cations for our\nunderstanding of biological neural circuits, as well as the development of neural networks\nfor pattern recognition. The large body of evidence supporting the advantages of locally\ncompetitive interactions makes it noteworthy that this simple mechanism has not provoked\nmore study by the machine learning community. Nonetheless, networks employing local\ncompetition have existed since the late 80s [21], and, along with [22], serve as a primary\ninspiration for the present work. More recently, maxout networks [19] have leveraged locally\ncompetitive interactions in combination with a technique known as dropout [20] to obtain\nthe best results on certain benchmark problems.\n3 Networks with local winner-take-all blocks\nThis section describes the general network architecture with locally competing neurons.\nThe network consists of B blocks which are organized into layers (Figure 1). Each block,\nbi, i = 1..B, contains n computational units (neurons), and produces an output vector yi,\ndetermined by the local interactions between the individual neuron activations in the block:\n(1)\nwhere g(\u00b7) is the competition/interaction function, encoding the e\ufb00ect of local interactions\ni , j = 1..n, is the activation of the j-th neuron in block i computed by:\nin each block, and hj\n(2)\nwhere x is the input vector from neurons in the previous layer, wij is the weight vector of\nneuron j in block i, and f(\u00b7) is a (generally non-linear) activation function. The output\nactivations y are passed as inputs to the next layer. In this paper we use the winner-take-all\ninteraction function, inspired by studies in computational neuroscience. In particular, we\nuse the hard winner-take-all function:\n\ni = g(h1\nyj\n\nhi = f(wT\n\ni , h2\n\ni ),\ni ..., hn\n\nijx),\n\ni \u2265 hk\nif hj\notherwise.\n\ni , \u2200k = 1..n\n\n(cid:26)hj\n\ni\n\n0\n\ni =\nyj\n\nIn the case of multiple winners, ties are broken by index precedence. In order to investi-\ngate the capabilities of the hard winner-take-all interaction function in isolation, f(x) = x\n\n2\n\n\fFigure 1: A Local Winner-Take-All (LWTA) network with blocks of size two showing the\nwinning neuron in each block (shaded) for a given input example. Activations \ufb02ow forward\nonly through the winning neurons, errors are backpropagated through the active neurons.\nGreyed out connections do not propagate activations. The active neurons form a subnetwork\nof the full network which changes depending on the inputs.\n\n(identity) is used for the activation function in equation (2). The di\ufb00erence between this\nLocal Winner Take All (LWTA) network and a standard multilayer perceptron is that no\nnon-linear activation functions are used, and during the forward propagation of inputs, local\ncompetition between the neurons in each block turns o\ufb00 the activation of all neurons except\nthe one with the highest activation. During training the error signal is only backpropagated\nthrough the winning neurons.\nIn a LWTA layer, there are as many neurons as there are blocks active at any one time for\na given input pattern1. We denote a layer with blocks of size n as LWTA-n. For each input\npattern presented to a network, only a subgraph of the full network is active, e.g. the high-\nlighted neurons and synapses in \ufb01gure 1. Training on a dataset consists of simultaneously\ntraining an exponential number of models that share parameters, as well as learning which\nmodel should be active for each pattern. Unlike networks with sigmoidal units, where all of\nthe free parameters need to be set properly for all input patterns, only a subset is used for\nany given input, so that patterns coming from very di\ufb00erent sub-distributions can poten-\ntially be modelled more e\ufb03ciently through specialization. This modular property is similar\nto that of networks with recti\ufb01ed linear units (ReLU) which have recently been shown to\nbe very good at several learning tasks (links with ReLU are discussed in section 4.3).\n\n4 Comparison with related methods\n4.1 Max-pooling\nNeural networks with max-pooling layers [23] have been found to be very useful, especially\nfor image classi\ufb01cation tasks where they have achieved state-of-the-art performance [24, 25].\nThese layers are usually used in convolutional neural networks to subsample the representa-\ntion obtained after convolving the input with a learned \ufb01lter, by dividing the representation\ninto pools and selecting the maximum in each one. Max-pooling lowers the computational\nburden by reducing the number of connections in subsequent convolutional layers, and adds\ntranslational/rotational invariance.\n\n1However, there is always the possibility that the winning neuron in a block has an activation\n\nof exactly zero, so that the block has no output.\n\n3\n\n\f(a) max-pooling\n\n(b) LWTA\n\nFigure 2: Max-pooling vs. LWTA. (a) In max-pooling, each group of neurons in a layer\nhas a single set of output weights that transmits the winning unit\u2019s activation (0.8 in this\ncase) to the next layer, i.e. the layer activations are subsampled. (b) In an LWTA block,\nthere is no subsampling. The activations \ufb02ow into subsequent units via a di\ufb00erent set of\nconnections depending on the winning unit.\n\nAt \ufb01rst glance, the max-pooling seems very similar to a WTA operation, however, the two\ndi\ufb00er substantially: there is no downsampling in a WTA operation and thus the number of\nfeatures is not reduced, instead the representation is \"sparsi\ufb01ed\" (see \ufb01gure 2).\n4.2 Dropout\nDropout [20] can be interpreted as a model-averaging technique that jointly trains several\nmodels sharing subsets of parameters and input dimensions, or as data augmentation when\napplied to the input layer [19, 20]. This is achieved by probabilistically omitting (\u201cdrop-\nping\u201d) units from a network for each example during training, so that those neurons do not\nparticipate in forward/backward propagation. Consider, hypothetically, training an LWTA\nnetwork with blocks of size two, and selecting the winner in each block at random. This\nis similar to training a neural network with a dropout probability of 0.5. Nonetheless, the\ntwo are fundamentally di\ufb00erent. Dropout is a regularization technique while in LWTA the\ninteraction between neurons in a block replaces the per-neuron non-linear activation.\nDropout is believed to improve generalization performance since it forces the units to learn\nindependent features, without relying on other units being active. During testing, when\npropagating an input through the network, all units in a layer trained with dropout are\nused with their output weights suitably scaled. In an LWTA network, no output scaling is\nrequired. A fraction of the units will be inactive for each input pattern depending on their\ntotal inputs. Viewed this way, WTA is restrictive in that only a fraction of the parameters\nare utilized for each input pattern. However, we hypothesize that the freedom to use di\ufb00erent\nsubsets of parameters for di\ufb00erent inputs allows the architecture to learn from multimodal\ndata distributions more accurately.\n4.3 Recti\ufb01ed Linear units\nRecti\ufb01ed Linear Units (ReLU) are simply linear neurons that clamp negative activations to\nzero (f(x) = x if x > 0, f(x) = 0 otherwise). ReLU networks were shown to be useful for\nRestricted Boltzmann Machines [26], outperformed sigmoidal activation functions in deep\nneural networks [27], and have been used to obtain the best results on several benchmark\nproblems across multiple domains [24, 28].\nConsider an LWTA block with two neurons compared to two ReLU neurons, where x1 and\nx2 are the weighted sum of the inputs to each neuron. Table 1 shows the outputs y1 and\ny2 in all combinations of positive and negative x1 and x2, for ReLU and LWTA neurons.\nFor both ReLU and LWTA neurons, x1 and x2 are passed through as output in half of the\npossible cases. The di\ufb00erence is that in LWTA both neurons are never active or inactive at\nthe same time, and the activations and errors \ufb02ow through exactly one neuron in the block.\nFor ReLU neurons, being inactive (saturation) is a potential drawback since neurons that\n\n4\n\nafterbefore 0.50.80.8afterbefore0.80.800.5\fTable 1: Comparison of recti\ufb01ed linear activation and LWTA-2.\n\nx1\n\nx2\n\nPositive\nPositive\nPositive Negative\nNegative Negative\n\nPositive\nPositive\nNegative\nPositive\nNegative Negative\n\ny2\n\nx2\n0\n0\n\nReLU neurons LWTA neurons\ny1\nx1 > x2\nx1\nx1\n0\nx2 > x1\nx1\n0\n0\n\nx1\nx1\nx1\n\nx2\nx2\n0\n\nx2\nx2\nx2\n\ny2\n\n0\n0\n0\n\ny1\n\n0\n0\n0\n\ndo not get activated will not get trained, leading to wasted capacity. However, previous\nwork suggests that there is no negative impact on optimization, leading to the hypothesis\nthat such hard saturation helps in credit assignment, and, as long as errors \ufb02ow through\ncertain paths, optimization is not a\ufb00ected adversely [27]. Continued research along these\nlines validates this hypothesis [29], but it is expected that it is possible to train ReLU\nnetworks better.\nWhile many of the above arguments for and against ReLU networks apply to LWTA net-\nworks, there is a notable di\ufb00erence. During training of an LWTA network, inactive neurons\ncan become active due to training of the other neurons in the same block. This suggests\nthat LWTA nets may be less sensitive to weight initialization, and a greater portion of the\nnetwork\u2019s capacity may be utilized.\n5 Experiments\nIn the following experiments, LWTA networks were tested on various supervised learning\ndatasets, demonstrating their ability to learn useful internal representations without utilizing\nany other non-linearities. In order to clearly assess the utility of local competition, no special\nstrategies such as augmenting data with transformations, noise or dropout were used. We\nalso did not encourage sparse representations in the hidden layers by adding activation\npenalties to the objective function, a common technique also for ReLU units. Thus, our\nobjective is to evaluate the value of using LWTA rather than achieving the absolute best\ntesting scores. Blocks of size two are used in all the experiments.2\nAll networks were trained using stochastic gradient descent with mini-batches, learning rate\nlt and momentum mt at epoch t given by\n\n(cid:26)\u03b10\u03bbt\n(cid:26) t\n\u03b1min\nT mi + (1 \u2212 t\npf\n\nif \u03b1t > \u03b1min\notherwise\nT )mf\n\n\u03b1t =\n\nmt =\n\nif t < T\nif t \u2265 T\n\nwhere \u03bb is the learning rate annealing factor, \u03b1min is the lower learning rate limit, and\nmomentum is scaled from mi to mf over T epochs after which it remains constant at\nmf. L2 weight decay was used for the convolutional network (section 5.2), and max-norm\nnormalization for other experiments. This setup is similar to that of [20].\n5.1 Permutation Invariant MNIST\nThe MNIST handwritten digit recognition task consists of 70,000 28x28 images (60,000\ntraining, 10,000 test) of the 10 digits centered by their center of mass [33]. In the permutation\ninvariant setting of this task, we attempted to classify the digits without utilizing the 2D\nstructure of the images, e.g. every digit is a vector of pixels. The last 10,000 examples in the\ntraining set were used for hyperparameter tuning. The model with the best hyperparameter\nsetting was trained until convergence on the full training set. Mini-batches of size 20 were\n\n2To speed up our experiments, the Gnumpy [30] and CUDAMat [31] libraries were used.\n\n5\n\n\fTable 2: Test set errors on the permutation invariant MNIST dataset for methods without\ndata augmentation or unsupervised pre-training\n\nActivation\nSigmoid [32]\nReLU [27]\nReLU + dropout in hidden layers [20]\nLWTA-2\n\nTest Error\n\n1.60%\n1.43%\n1.30%\n1.28%\n\nTable 3: Test set errors on MNIST dataset for convolutional architectures with no data\naugmentation. Results marked with an asterisk use layer-wise unsupervised feature learning\nto pre-train the network and global \ufb01ne tuning.\n\nArchitecture\n2-layer CNN + 2 layer MLP [34] *\n2-layer ReLU CNN + 2 layer LWTA-2\n3-layer ReLU CNN [35]\n2-layer CNN + 2 layer MLP [36] *\n3-layer ReLU CNN + stochastic pooling [33]\n3-layer maxout + dropout [19]\n\nTest Error\n\n0.60%\n0.57%\n0.55%\n0.53%\n0.47%\n0.45%\n\nused, the pixel values were rescaled to [0, 1] (no further preprocessing). The best model\nobtained, which gave a test set error of 1.28%, consisted of three LWTA layers of 500\nblocks followed by a 10-way softmax layer. To our knowledge, this is the best reported\nerror, without utilizing implicit/explicit model averaging, for this setting which does not use\ndeformations/noise to enhance the dataset or unsupervised pretraining. Table 2 compares\nour results with other methods which do not use unsupervised pre-training. The performance\nof LWTA is comparable to that of a ReLU network with dropout in the hidden layers. Using\ndropout in input layers as well, lower error rates of 1.1% using ReLU [20] and 0.94% using\nmaxout [19] have been obtained.\n5.2 Convolutional Network on MNIST\nFor this experiment, a convolutional network (CNN) was used consisting of 7 \u00d7 7 \ufb01lters in\nthe \ufb01rst layer followed by a second layer of 6 \u00d7 6, with 16 and 32 maps respectively, and\nReLU activation. Every convolutional layer is followed by a 2 \u00d7 2 max-pooling operation.\nWe then use two LWTA-2 layers each with 64 blocks and \ufb01nally a 10-way softmax output\nlayer. A weight decay of 0.05 was found to be bene\ufb01cial to improve generalization. The\nresults are summarized in Table 3 along with other state-of-the-art approaches which do not\nuse data augmentation (for details of convolutional architectures, see [33]).\n5.3 Amazon Sentiment Analysis\nLWTA networks were tested on the Amazon sentiment analysis dataset [37] since ReLU units\nhave been shown to perform well in this domain [27, 38]. We used the balanced subset of the\ndataset consisting of reviews of four categories of products: Books, DVDs, Electronics and\nKitchen appliances. The task is to classify the reviews as positive or negative. The dataset\nconsists of 1000 positive and 1000 negative reviews in each category. The text of each review\nwas converted into a binary feature vector encoding the presence or absence of unigrams\nand bigrams. Following [27], the 5000 most frequent vocabulary entries were retained as\nfeatures for classi\ufb01cation. We then divided the data into 10 equal balanced folds, and\ntested our network with cross-validation, reporting the mean test error over all folds. ReLU\nactivation was used on this dataset in the context of unsupervised learning with denoising\nautoencoders to obtain sparse feature representations which were used for classi\ufb01cation. We\ntrained an LWTA-2 network with three layers of 500 blocks each in a supervised setting to\ndirectly classify each review as positive or negative using a 2-way softmax output layer. We\nobtained mean accuracies of Books: 80%, DVDs: 81.05%, Electronics: 84.45% and Kitchen:\n85.8%, giving a mean accuracy of 82.82%, compared to 78.95% reported in [27] for denoising\nautoencoders using ReLU and unsupervised pre-training to \ufb01nd a good initialization.\n\n6\n\n\fTable 4: LWTA networks outperform sigmoid and ReLU activation in remembering dataset\nP1 after training on dataset P2.\n\nTesting error on P1 LWTA\nAfter training on P1\nAfter training on P2\n\nReLU\n1.55 \u00b1 0.20% 1.38 \u00b1 0.06%\n1.30 \u00b1 0.13%\n6.12 \u00b1 3.39% 57.84 \u00b1 1.13% 16.63 \u00b1 6.07%\n\nSigmoid\n\n6 Implicit long term memory\nThis section examines the e\ufb00ect of the LWTA architecture on catastrophic forgetting. That\nis, does the fact that the network implements multiple models allow it to retain information\nabout dataset A, even after being trained on a di\ufb00erent dataset B? To test for this implicit\nlong term memory, the MNIST training and test sets were each divided into two parts, P1\ncontaining only digits {0, 1, 2, 3, 4}, and P2 consisting of the remaining digits {5, 6, 7, 8, 9}.\nThree di\ufb00erent network architectures were compared: (1) three LWTA layers each with 500\nblocks of size 2, (2) three layers each with 1000 sigmoidal neurons, and (3) three layers each\nof 1000 ReLU neurons. All networks have a 5-way softmax output layer representing the\nprobability of an example belonging to each of the \ufb01ve classes. All networks were initialized\nwith the same parameters, and trained with a \ufb01xed learning rate and momentum.\nEach network was \ufb01rst trained to reach a 0.03 log-likelihood error on the P1 training set.\nThis value was chosen heuristically to produce low test set errors in reasonable time for\nall three network types. The weights for the output layer (corresponding to the softmax\nclassi\ufb01er) were then stored, and the network was trained further, starting with new initial\nrandom output layer weights, to reach the same log-likelihood value on P2. Finally, the\noutput layer weights saved from P1 were restored, and the network was evaluated on the\nP1 test set. The experiment was repeated for 10 di\ufb00erent initializations.\nTable 4 shows that the LWTA network remembers what was learned from P1 much better\nthan sigmoid and ReLU networks, though it is notable that the sigmoid network performs\nmuch worse than both LWTA and ReLU. While the test error values depend on the learning\nrate and momentum used, LWTA networks tended to remember better than the ReLU\nnetwork by about a factor of two in most cases, and sigmoid networks always performed\nmuch worse. Although standard network architectures are known to su\ufb00er from catastrophic\nforgetting, we not only show here, for the \ufb01rst time, that ReLU networks are actually quite\ngood in this regard, and moreover, that they are outperformed by LWTA. We expect this\nbehavior to manifest itself in competitive models in general, and to become more pronounced\nwith increasingly complex datasets. The neurons encoding speci\ufb01c features in one dataset\nare not a\ufb00ected much during training on another dataset, whereas neurons encoding common\nfeatures can be reused. Thus, LWTA may be a step forward towards models that do not\nforget easily.\n7 Analysis of subnetworks\nA network with a single LWTA-m of N blocks consists of mN subnetworks which can be\nselected and trained for individual examples while training over a dataset. After training,\nwe expect the subnetworks consisting of active neurons for examples from the same class to\nhave more neurons in common compared to subnetworks being activated for di\ufb00erent classes.\nIn the case of relatively simple datasets like MNIST, it is possible to examine the number\nof common neurons between mean subnetworks which are used for each class. To do this,\nwhich neurons were active in the layer for each example in a subset of 10,000 examples were\nrecorded. For each class, the subnetwork consisting of neurons active for at least 90% of the\nexamples was designated the representative mean subnetwork, which was then compared to\nall other class subnetworks by counting the number of neurons in common.\nFigure 3a shows the fraction of neurons in common between the mean subnetworks of each\npair of digits. Digits that are morphologically similar such as \u201c3\u201d and \u201c8\u201d have subnetworks\nwith more neurons in common than the subnetworks for digits \u201c1\u201d and \u201c2\u201d or \u201c1\u201d and \u201c5\u201d\nwhich are intuitively less similar. To verify that this subnetwork specialization is a result\nof training, we looked at the fraction of common neurons between all pairs of digits for the\n\n7\n\n\f(a)\n\n(b)\n\nFigure 3: (a) Each entry in the matrix denotes the fraction of neurons that a pair of MNIST\ndigits has in common, on average, in the subnetworks that are most active for each of the\ntwo digit classes. (b) The fraction of neurons in common in the subnetworks of each of the\n55 possible digit pairs, before and after training.\n\nsame 10000 examples both before and after training (Figure 3b). Clearly, the subnetworks\nwere much more similar prior to training, and the full network has learned to partition its\nparameters to re\ufb02ect the structure of the data.\n8 Conclusion and future research directions\nOur LWTA networks automatically self-modularize into multiple parameter-sharing sub-\nnetworks responding to di\ufb00erent input representations. Without signi\ufb01cant degradation of\nstate-of-the-art results on digit recognition and sentiment analysis, LWTA networks also\navoid catastrophic forgetting, thus retaining useful representations of one set of inputs even\nafter being trained to classify another. This has implications for continual learning agents\nthat should not forget representations of parts of their environment when being exposed to\nother parts. We hope to explore many promising applications of these ideas in the future.\nAcknowledgments\nThis research was funded by EU projects WAY (FP7-ICT-288551), NeuralDynamics (FP7-\nICT-270247), and NASCENCE (FP7-ICT-317662); additional funding from ArcelorMittal.\nReferences\n[1] Per Anderson, Gary N. Gross, Terje L\u00f8mo, and Ola Sveen. Participation of inhibitory and\nexcitatory interneurones in the control of hippocampal cortical output. In Mary A.B. Brazier,\neditor, The Interneuron, volume 11. University of California Press, Los Angeles, 1969.\n\n[2] John Carew Eccles, Masao Ito, and J\u00e1nos Szent\u00e1gothai. The cerebellum as a neuronal machine.\n\nSpringer-Verlag New York, 1967.\n\n[3] Costas Stefanis. Interneuronal mechanisms in the cortex. In Mary A.B. Brazier, editor, The\n\nInterneuron, volume 11. University of California Press, Los Angeles, 1969.\n\n[4] Stephen Grossberg. Contour enhancement, short-term memory, and constancies in reverber-\n\nating neural networks. Studies in Applied Mathematics, 52:213\u2013257, 1973.\n\n[5] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks:\nThe sequential learning problem. The Psychology of Learning and Motivation, 24:109\u2013164,\n1989.\n\n[6] Gail A. Carpenter and Stephen Grossberg. The art of adaptive pattern recognition by a\n\nself-organising neural network. Computer, 21(3):77\u201388, 1988.\n\n[7] Mark B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, Department\nof Computer Sciences, The University of Texas at Austin, Austin, Texas 78712, August 1994.\n[8] Samuel A. Ellias and Stephen Grossberg. Pattern formation, contrast control, and oscillations\nin the short term memory of shunting on-center o\ufb00-surround networks. Bio. Cybernetics, 1975.\n[9] Brad Ermentrout. Complex dynamics in winner-take-all neural nets with slow inhibition.\n\nNeural Networks, 5(1):415\u2013431, 1992.\n\n8\n\nDigitsDigits 012345678901234567890.20.30.4MNIST digit pairsFraction of neurons in common 30 50traineduntrained 0.5 0.6 0.1 0.2 0.3 0.4 0 10 20 0.7 40\f[10] Christoph von der Malsburg. Self-organization of orientation sensitive cells in the striate cortex.\n\nKybernetik, 14(2):85\u2013100, December 1973.\n\n[11] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological\n\ncybernetics, 43(1):59\u201369, 1982.\n\n[12] Risto Mikkulainen, James A. Bednar, Yoonsuck Choe, and Joseph Sirosh. Computational maps\n\nin the visual cortex. Springer Science+ Business Media, 2005.\n\n[13] Dale K. Lee, Laurent Itti, Christof Koch, and Jochen Braun. Attention activates winner-take-\n\nall competition among visual \ufb01lters. Nature Neuroscience, 2(4):375\u201381, April 1999.\n\n[14] Matthias Oster and Shih-Chii Liu. Spiking inputs to a winner-take-all network. In Proceedings\n\nof NIPS, volume 18. MIT; 1998, 2006.\n\n[15] John P. Lazzaro, Sylvie Ryckebusch, Misha Anne Mahowald, and Caver A. Mead. Winner-\n\ntake-all networks of O(n) complexity. Technical report, 1988.\n\n[16] Giacomo Indiveri. Modeling selective attention using a neuromorphic analog VLSI device.\n\nNeural Computation, 12(12):2857\u20132880, 2000.\n\n[17] Wolfgang Maass. Neural computation with winner-take-all as the only nonlinear operation. In\n\nProceedings of NIPS, volume 12, 1999.\n\n[18] Wolfgang Maass. On the computational power of winner-take-all. Neural Computation,\n\n12:2519\u20132535, 2000.\n\n[19] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.\n\nMaxout networks. In Proceedings of the ICML, 2013.\n\n[20] Geo\ufb00rey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R.\nSalakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors,\n2012. arXiv:1207.0580.\n\n[21] Juergen Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent\n\nnetworks. Connection Science, 1(4):403\u2013412, 1989.\n\n[22] Rupesh K. Srivastava, Bas R. Steunebrink, and Juergen Schmidhuber. First experiments with\n\n[23] Maximillian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in\n\npowerplay. Neural Networks, 2013.\n\ncortex. Nature Neuroscience, 2(11), 1999.\n\n[24] Alex Krizhevsky, Ilya Sutskever, and Goe\ufb00rey E. Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In Proceedings of NIPS, pages 1\u20139, 2012.\n\n[25] Dan Ciresan, Ueli Meier, and J\u00fcrgen Schmidhuber. Multi-column deep neural networks for\n\nimage classi\ufb01cation. Proceeedings of the CVPR, 2012.\n\n[26] Vinod Nair and Geo\ufb00rey E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann ma-\n\nchines. In Proceedings of the ICML, number 3, 2010.\n\n[27] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er networks. In AIS-\n\nTATS, volume 15, pages 315\u2013323, 2011.\n\n[28] George E. Dahl, Tara N. Sainath, and Geo\ufb00rey E. Hinton. Improving Deep Neural Networks\n\nfor LVCSR using Recti\ufb01ed Linear Units and Dropout. In Proceedings of ICASSP, 2013.\n\n[29] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Recti\ufb01er nonlinearities improve neural\n\nnetwork acoustic models. In Proceedings of the ICML, 2013.\n\n[30] Tijmen Tieleman. Gnumpy: an easy way to use GPU boards in Python. Department of\n\nComputer Science, University of Toronto, 2010.\n\n[31] Volodymyr Mnih. CUDAMat: a CUDA-based matrix class for Python. Department of Com-\n\nputer Science, University of Toronto, Tech. Rep. UTML TR, 4, 2009.\n\n[32] Patrice Y. Simard, Dave Steinkraus, and John C. Platt. Best practices for convolutional\nneural networks applied to visual document analysis. In International Conference on Document\nAnalysis and Recognition (ICDAR), 2003.\n\n[33] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Ha\ufb00ner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 1998.\n\n[34] Marc\u2019Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. E\ufb03cient learn-\n\ning of sparse representations with an energy-based model. In Proceedings of NIPS, 2007.\n\n[35] Matthew D. Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional\n\nneural networks. In Proceedings of the ICLR, 2013.\n\n[36] Kevin Jarrett, Koray Kavukcuoglu, Marc\u2019Aurelio Ranzato, and Yann LeCun. What is the best\nmulti-stage architecture for object recognition? In Proc. of the ICCV, pages 2146\u20132153, 2009.\n[37] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and\n\nblenders: Domain adaptation for sentiment classi\ufb01cation. Annual Meeting-ACL, 2007.\n\n[38] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale senti-\nment classi\ufb01cation: A deep learning approach. In Proceedings of the ICML, number 1, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1109, "authors": [{"given_name": "Rupesh", "family_name": "Srivastava", "institution": "IDSIA"}, {"given_name": "Jonathan", "family_name": "Masci", "institution": "IDSIA"}, {"given_name": "Sohrob", "family_name": "Kazerounian", "institution": "IDSIA"}, {"given_name": "Faustino", "family_name": "Gomez", "institution": "IDSIA"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": "IDSIA"}]}