{"title": "Slow, Decorrelated Features for Pretraining Complex Cell-like Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 99, "page_last": 107, "abstract": "We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hidden-layer neural network of this kind of model achieves 1.5% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. With this pretraining, the same single-hidden-layer model achieves better generalization error, even though the pretraining sample distribution is very different from the fine-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features.", "full_text": "Slow, Decorrelated Features for\n\nPretraining Complex Cell-like Networks\n\nJames Bergstra\n\nUniversity of Montreal\n\njames.bergstra@umontreal.ca\n\nYoshua Bengio\n\nUniversity of Montreal\n\nyoshua.bengio@umontreal.ca\n\nAbstract\n\nWe introduce a new type of neural network activation function based on recent\nphysiological rate models for complex cells in visual area V1. A single-hidden-\nlayer neural network of this kind of model achieves 1.50% error on MNIST.\nWe also introduce an existing criterion for learning slow, decorrelated features\nas a pretraining strategy for image models. This pretraining strategy results in\norientation-selective features, similar to the receptive \ufb01elds of complex cells. With\nthis pretraining, the same single-hidden-layer model achieves 1.34% error, even\nthough the pretraining sample distribution is very different from the \ufb01ne-tuning\ndistribution. To implement this pretraining strategy, we derive a fast algorithm for\nonline learning of decorrelated features such that each iteration of the algorithm\nruns in linear time with respect to the number of features.\n\n1\n\nIntroduction\n\nVisual area V1 is the \ufb01rst area of cortex devoted to handling visual input in the human visual sys-\ntem (Dayan & Abbott, 2001). One convenient simpli\ufb01cation in the study of cell behaviour is to\nignore the timing of individual spikes, and to look instead at their frequency. Some cells in V1\nare described well by a linear \ufb01lter that has been recti\ufb01ed to be non-negative and perhaps bounded.\nThese so-called simple cells are similar to sigmoidal activation functions: their activity (\ufb01ring fre-\nquency) is greater as an image stimulus looks more like some particular linear \ufb01lter. However, these\nsimple cells are a minority in visual area V1 and the characterization of the remaining cells there\n(and even beyond in visual areas V2, V4, MT, and so on) is a very active area of ongoing research.\nComplex cells are the next-simplest kind of cell. They are characterized by an ability to respond to\nnarrow bars of light with particular orientations in some region (translation invariance) but to turn off\nwhen all those overlapping bars are presented at once. This non-linear response has been modeled\nby quadrature pairs (Adelson & Bergen, 1985; Dayan & Abbott, 2001): pairs of linear \ufb01lters with\nthe property that the sum of their squared responses is constant for an input image with particular\nspatial frequency and orientation (i.e. edges). It has also been modeled by max-pooling across two\nor more linear \ufb01lters (Riesenhuber & Poggio, 1999). More recently, it has been argued that V1 cells\nexhibit a range of behaviour that blurs distinctions between simple and complex cells and between\nenergy models and max-pooling models (Rust et al., 2005; Kouh & Poggio, 2008; Finn & Ferster,\n2007).\nAnother theme in neural modeling is that cells do not react to single images, they react to image\nsequences. It is a gross approximation to suppose that each cell implements a function from image\nto activity level. Furthermore, the temporal sequence of images in a video sequence contains a lot\nof information about the invariances that we would like our models to learn. Throwing away that\ntemporal structure makes learning about objects from images much more dif\ufb01cult. The principle\nof identifying slowly moving/changing factors in temporal/spatial data has been investigated by\nmany (Becker & Hinton, 1993; Wiskott & Sejnowski, 2002; Hurri & Hyv\u00a8arinen, 2003; K\u00a8ording\net al., 2004; Cadieu & Olshausen, 2009) as a principle for \ufb01nding useful representations of images,\n\n1\n\n\fand as an explanation for why V1 simple and complex cells behave the way they do. A good\noverview can be found in (Berkes & Wiskott, 2005).\nThis work follows the pattern of initializing neural networks with unsupervised learning (pretrain-\ning) before \ufb01ne-tuning with a supervised learning criterion. Supervised gradient descent explores the\nparameter space suf\ufb01ciently to get low training error on smaller training sets (tens of thousands of\nexamples, like MNIST). However, models that have been pretrained with appropriate unsupervised\nlearning procedures (such as RBMs and various forms of auto-encoders) generalize better (Hinton\net al., 2006; Larochelle et al., 2007; Lee et al., 2008; Ranzato et al., 2008; Vincent et al., 2008).\nSee Bengio (2009) for a comprehensive review and Erhan et al. (2009) for a thorough experimental\nanalysis of the improvements obtained. It appears that unsupervised pretraining guides the learning\ndynamics in better regions of parameter space associated with basins of attraction of the supervised\ngradient procedure corresponding to local minima with lower generalization error, even for very\nlarge training sets (unlike other regularizers whose effects tend to quickly vanish on large training\nsets) with millions of examples.\nRecent work in the pretraining of neural networks has taken a generative modeling perspective. For\nexample, the Restricted Boltzmann Machine is an undirected graphical model, and training it (by\nmaximum likelihood) as such has been demonstrated to also be a good initialization. However, it is\nan interesting open question whether a better generative model is necessarily (or even typically) a\nbetter point of departure for \ufb01ne-tuning. Contrastive divergence (CD) is not maximum likelihood,\nand works just \ufb01ne as pretraining. Reconstruction error is an even poorer approximation of the\nmaximum likelihood gradient, and sometimes works better than CD (with additional twists like\nsparsity or the denoising of (Vincent et al., 2008)).\nThe temporal coherence and decorrelation criterion is an alternative to training generative models\nsuch as RBMs or auto-encoder variants. Recently (Mobahi et al., 2009) demonstrated that a slowness\ncriterion regularizing the top-most internal layer of a deep convolutional network during supervised\nlearning helps their model to generalize better. Our model is similar in spirit to pre-training with\nthe semi-supervised embedding criterion at each level (Weston et al., 2008; Mobahi et al., 2009),\nbut differs in the use of decorrelation as a mechanism for preventing trivial solutions to a slowness\ncriterion. Whereas RBMs and denoising autoencoders are de\ufb01ned for general input distributions,\nthe temporal coherence and decorrelation criterion makes sense only in the context of data with\nslowly-changing temporal or spatial structure, such as images, video, and sound.\nIn the same way that simple cell models were the inspiration for sigmoidal activation units in arti-\n\ufb01cial neural networks and validated simple cell models, we investigate in arti\ufb01cial neural network\nclassi\ufb01ers the value of complex cell models. This paper builds on these results by showing that\nthe principle of temporal coherence is useful for \ufb01nding initial conditions for the hidden layer of\na neural network that biases it towards better generalization in object recognition. We introduce\ntemporal coherence and decorrelation as a pretraining algorithm. Hidden units are initialized so that\nthey are invariant to irrelevant transformations of the image, and sensitive to relevant ones. In order\nfor this criterion to be useful in the context of large models, we derive a fast online algorithm for\ndecorrelating units and maximizing temporal coherence.\n\n2 Algorithm\n\n2.1 Slow, decorrelated feature learning algorithm\n\n(K\u00a8ording et al., 2004) introduced a principle (and training criterion) to explain the formation of\ncomplex cell receptive \ufb01elds. They based their analysis on the complex-cell model of (Adelson &\nBergen, 1985), which describes a complex cell as a pair of half-recti\ufb01ed linear \ufb01lters whose outputs\nare squared and added together and then a square root is applied to that sum.\nSuppose x is an input\n\np(ui \u00b7 x)2 + (vi \u00b7 x)2. (K\u00a8ording et al., 2004) showed that by minimizing the following cost,\n\nimage and we have F complex cells h1, ..., hF such that hi =\n\n(hi,t \u2212 hi,t\u22121)2\n\nVar(hi)\n\n(1)\n\nX\n\ni!=j\n\nLK2004 = \u03b1\n\n+X\n\nX\n\nt\n\ni\n\nCovt(hi, hj)2\nVar(hi)Var(hj)\n\n2\n\n\fover consecutive natural movie frames (with respect to model parameters), the \ufb01lters ui and vi of\neach complex cell form local Gabor \ufb01lters whose phases are offset by about 90 degrees, like the sine\nand cosine curves that implement a Fourier transform.\nThe criterion in Equation 1 requires a batch minimization algorithm because of the variance and\ncovariance statistics that must be collected. This makes the criterion too slow for use with large\ndatasets. At the same time, the size of the covariance matrix is quadratic in the number of features, so\nit is computationally expensive (perhaps prohibitively) to apply the criterion to train large numbers\nof features.\n\n2.1.1 Online Stochastic Estimation of Covariance\n\nThis section presents an algorithm for approximately minimizing LK2004 using an online algorithm\nwhose iterations run in linear time with respect to the number of features. One way to apply the\ncriterion to large or in\ufb01nite datasets is by estimating the covariance (and variance) from consecutive\nminibatches of N movie frames. Then the cost can be minimized by stochastic gradient descent.\nWe used an exponentially-decaying moving average to track the mean of each feature over time.\n\n\u00afhi(t) = \u03c1\u00afhi(t \u2212 1) + (1 \u2212 \u03c1)hi(t)\n\nFor good results, \u03c1 should be chosen so that the estimates change very slowly. We used a value of\n1.0 \u2212 5.0 \u00d7 10\u22125.\nThen we estimated the variance of each feature over a minibatch like this:\n\nVar(h) \u2248 1\n\nN \u2212 1\n\n(hi(t) \u2212 \u00afhi(t))2\n\nt+N\u22121X\n\n\u03c4 =t\n\nWith this mean and variance, we computed normalized features for each minibatch:\n\nzi(t) = (hi(t) \u2212 \u00afhi(t))/pVar(h) + 10\u221210\n\nLetting Z denote an F \u00d7 N matrix with N columns of F normalized feature values, we estimate the\nN Z(t)Z(t)0.\ncorrelation between features hi by the covariance in these normalized features: C(t) = 1\nWe can now write down L(t), a minibatch-wise approximation to Eq. 1:\n\nX\n\ni!=j\n\nC 2\n\nij(t) +\n\nN\u22121X\n\nX\n\n\u03c4 =0\n\ni\n\nL(t) = \u03b1\n\n(zi(t + \u03c4) \u2212 zi(t + \u03c4 \u2212 1))2\n\n(2)\n\nThe time complexity of evaluating L(t) from Z using this expression is O(F F N +N F ). In practice\nwe use small minibatches and our model has lots of features, so the fact that the time complexity of\nthe algorithm is quadratic in F is troublesome.\nThere is, however, a way to compute this value exactly in time linear in F . The key observation\nis that the sum of the squared elements of C can be computed from the N \u00d7 N Gram matrix\nG(t) = Z(t)0Z(t).\n\n1\nN 2 Tr(Z(t)0Z(t)Z(t)0Z(t))\n\nFX\n\nFX\n\ni=1\n\nC 2\n\nij(t) = Tr(C(t)C(t))\n\nj=1\n1\nN 2 Tr(Z(t)Z(t)0Z(t)Z(t)0) =\n1\nNX\nN 2 Tr(G(t)G(t)) =\n1\nkl(t) .=\nG2\nN 2\n\n1\nN 2 Tr(G(t)G(t)0)\n1\nN 2|Z(t)0Z(t)|2\n\nNX\n\n=\n\n=\n\n=\n\nk=1\n\nl=1\n\n3\n\n\fSubtracting the C 2\nthat suggests the linear-time implementation.\n\nii terms from the sum of all squared elements lets us rewrite Equation 2 in a way\n\nzi(\u03c4)2)2\n\n+\n\n1\n\nN \u2212 1\n\n(zi(\u03c4) \u2212 zi(\u03c4 \u2212 1))2\n\n(3)\n\n \n|Z(t)Z0(t)|2 \u2212 FX\n\nNX\n\n(\n\u03c4 =1\n\ni=1\n\nL(t) = \u03b1\nN 2\n\n!\n\nN\u22121X\n\nFX\n\n\u03c4 =1\n\ni=1\n\nThe time complexity of computing L(t) using Equation 3 from Z(t) is O(N N F ). The sum of\nsquared correlations is still the most expensive term, but for the case where N << F , this expression\nmakes L(t)\u2019s computation linear in F . Considering that each iteration treats N training examples,\nthe per-training-example cost of this algorithm can be seen as O(N F ).\nIn implementation, an\nadditional factor of two in runtime can be obtained by only computing half of the Gram matrix G,\nwhich is symmetric.\n\n2.2 Complex-cell activation function\n\nRecently, (Rust et al., 2005) have argued that existing models, such as that of (Adelson & Bergen,\n1985) cannot account for the variety of behaviour found in visual area V1. Some complex cells\nbehave like simple cells to some extent and vice versa; there is a continuous range of simple to com-\nplex cells. They put forward a similar but more involved expression that can capture the simple and\ncomplex cells as special cases, but ultimately parameterizes a larger class of cell-response functions\n(Eq. 4).\n\n(cid:16)\n\nmax(0, wx)2 +PI\n(cid:16)\nmax(0, wx)2 +PI\n\ni=1(u(i)x)2(cid:17)\u03b6 \u2212 \u03b4\n(cid:17)\u03b6\n\ni=1(u(i)x)2\n\nj=1(v(j)x)2(cid:17)\u03b6\n(cid:16)PJ\n(cid:17)\u03b6\n(cid:16)PJ\n\nj=1(v(j)x)2\n\n+ \u0001\n\n\u03b2\n\na +\n\n1 + \u03b3\n\n(4)\n\nThe numerator in Eq 4 describes the difference between an excitation term and a shunting inhibition\nterm. The denominator acts to normalize this difference. Parameters w, u(i), v(j) have the same\nshape as the input image x, and can be thought of as image \ufb01lters like the \ufb01rst layer of a neural\nnetwork or the codebook of a sparse-coding model. The parameters a, \u03b2, \u03b4, \u03b3, \u0001, \u03b6 are scalars that\ncontrol the range and shape of the activation function, given all the \ufb01lter responses. The numbers I\nand J of quadratic \ufb01lters required to explain a particular cellular response were on the order of 2-16.\nWe introduce the approximation in Equation 5 because it is easier to learn by gradient descent. We\nreplaced the max operation with a softplus(x) = log(1 + ex) function so that there is always a\ngradient on w and b, even when wx + b is negative. We \ufb01xed the scalar parameters to prevent the\nsystem from entering regimes of extreme non-linearity. We \ufb01xed \u03b2, \u03b4, \u03b3, \u0001 to 1, and a to 0. We chose\nto \ufb01x the exponent \u03b6 to 0.5 because (Rust et al., 2005) found that values close to 0.5 offered good\n\ufb01ts to cell \ufb01ring-rate data. Future work might look at choosing these constants in a principled way\nor adapting them; we found that these values worked well. The range of this activation function (as\na function of x) is a connected set on the (\u22121, 1) interval. However, the whole (\u22121, 1) range is not\nalways available, depending on the parameters. If the inhibition term is always 0 for example, then\nthe activation function will be non-negative.\n\nq\n\nlog(1 + ewx+b)2 +PI\nq\nlog(1 + ewx+b)2 +PI\n\ni=1(u(i)x)2 \u2212qPJ\nqPJ\n\ni=1(u(i)x)2 +\n\n1.0 +\n\nj=1(v(j)x)2\n\nj=1(v(j)x)2\n\n(5)\n\n3 Results\n\nClassi\ufb01cation results were obtained by adding a logistic regression model on top of the features\nlearned, and treating the resulting model as a single-hidden-layer neural network. The weights of\nthe logistic regression were always initialized to zero.\nAll work was done on 28x28 images (MNIST-sized), using a model with 300 hidden units. Each\nhidden unit had one linear \ufb01lter w, a bias b, two quadratic excitatory \ufb01lters u1, u2 and two quadratic\ninhibitory \ufb01lters v1, v2. The computational cost of evaluating each unit was thus \ufb01ve times the cost\nof evaluating a normal sigmoidal activation function of the form tanh(w0x + b).\n\n4\n\n\f3.1 Random initialization\n\nAs a baseline, our model parameters were initialized to small random weights and used as the hidden\nlayer of a neural network. Training this randomly-initialized model by stochastic gradient descent\nyielded test-set performance of 1.56% on MNIST.\nThe \ufb01lters learned by this procedure looked somewhat noisy for the most part, but had low-frequency\ntrends. For example, some of the quadratic \ufb01lters had small local Gabor-like \ufb01lters. We believe that\nthese phase-offset pairs of Gabor-like functions allow the units to implement some shift-invariant\nresponse to edges with a speci\ufb01c orientation (Fig. 1).\n\nFigure 1: Four of the three hundred activation functions learned by training our model from random\ninitialization to perform classi\ufb01cation. Top row: the red and blue channels are the two quadratic\n\ufb01lters of the excitation term. Bottom row: the red and blue channels are the two quadratic \ufb01lters of\nthe shunting inhibition term. Training approximately yields locally orientation-selective edge \ufb01lters,\nopposite-orientation edges are inhibitory.\n\n3.2 Pretraining with natural movies\n\nUnder the hypothesis that the matched Gabor functions (see Fig. 1) allowed our model to generalize\nbetter across slight translations of the image, we appealed to a pretraining process to initialize our\nmodel with values better than random noise.\nWe pretrained the hidden layer according to the online version of the cost in Eq. 3, using movies\n(MIXED-movies) made by sliding a 28 x 28 pixel window across large photographs. Each of these\nmovies was short (just four frames long) and ten movies were used in each minibatch (N = 40). The\nsliding speed was sampled uniformly between 0.5 and 2 pixels per frame. The sliding direction was\nsampled uniformly from 0 to 2\u03c0. The sliding initial position was sampled uniformly from image\ncoordinates. Any sampled movie that slid off of the underlying image was rejected. We used two\nphotographs to generate the movies. The \ufb01rst photograph was a grey-scale forest scene (resolution\n1744x1308). The second photograph was a tiling of 100x100 MNIST digits (resolution 2800x2800).\nAs a result of this procedure, digits are not at all centered in MIXED-movies: there might part of a\n\u20193\u2019 in the upper-left part of a frame, and part of a \u20197\u2019 in the lower right.\nThe shunting inhibition \ufb01lters (v1, v2) learned after \ufb01ve hundred thousand movies (\ufb01fty thousand\niterations of stochastic gradient descent) are shown in Figure 2. The \ufb01lters learn to implement\norientation-selective, shift-invariant \ufb01lters at different spatial frequencies. The \ufb01lters shown in \ufb01g-\nure 2 have fairly global receptive \ufb01elds, but smaller more local receptive \ufb01elds were obtained by\napplying \u20181 weight-penalization during pretraining. The \u03b1 parameter that balances decorrelation\nand slowness was chosen manually on the basis of the trained \ufb01lters. We were looking for a diver-\nsity of \ufb01lters with relatively low spatial frequency. The excitatory \ufb01lters learned similar Gabor pairs\nbut the receptive \ufb01elds tended to be both smaller (more localized) and lower-frequency. Fine-tuning\nthis pre-trained model with a learning rate of 0.003 with L1 weight decay of 10\u22125 yielded a test\nerror rate of 1.34% on MNIST.\n\n3.3 Pretraining with MNIST movies\n\nWe also tried pretraining with videos whose frames follow a similar distribution to the images used\nfor \ufb01ne-tuning and testing. We created MNIST movies by sampling an image from the training set,\nand moving around (translating it) according to a Brownian motion. The initial velocity was sampled\nfrom a zero-mean normal distribution with std-deviation 0.2. Changes in that velocity between each\n\n5\n\n\fFigure 2: Filters from some of the units of the model, pretrained on small sliding image patches from\ntwo large images. The features learn to be direction-selective for moving edges by approximately\nimplementing windowed Fourier transforms. These features have global receptive \ufb01eld, but become\nmore local when an \u20181 weight penalization is applied during pretraining. Excitatory \ufb01lters looked\nsimilar, but tended to be more localized and with lower spatial frequency (fewer, shorter, broader\nstripes). Columns of the \ufb01gure are arranged in triples: linear \ufb01lter w in grey, u(1), u(2) in red and\ngreen, v(1), v(2) in blue and green.\n\nframe were sampled from zero-mean normal distribution with std-deviation 0.2. Furthermore, the\ndigit image in each frame was modi\ufb01ed according to a randomly chosen elastic deformation, as\nin (Loosli et al., 2007). As before, movies of four frames were created in this way and training\nwas conducted on minibatches of ten movies (N = 4 \u2217 10 = 40). Unlike the mnist frames in\nMIXED-movies, the frames of MNIST-movies contain a single digit that is approximately centered.\nThe activation functions learned by minimizing Equation 3 on these MNIST movies were quali-\ntatively different from the activation functions learned from the MIXED movies. The inhibitory\nweights (v1, v2) learned from MNIST movies are shown in 3. Once again, the inhibitory weights\nexhibit the narrow red and green stripes that indicate edge-orientation selectivity. But this time they\nare not parallel straight stripes, they follow contours that are adapted to digit edges. The excita-\ntion \ufb01lters u1, u2 were also qualitatively different. Instead of forming localized Gabor pairs, some\nformed large smooth blob-like shapes but most converged toward zero. Fine-tuning this pre-trained\nmodel with a learning rate of 0.003 with L1 weight decay of 10\u22125 yielded a test error rate of 1.37\n% on MNIST.\n\nFigure 3: Filters of our model, pretrained on movies of centered MNIST training images subjected\nto Brownian translation. The features learn to be direction-selective for moving edges by approx-\nimately implementing windowed Fourier transforms. The \ufb01lters are tuned to the higher spatial\nfrequency in MNIST digits, as compared with the natural scene. Columns of the \ufb01gure are arranged\nin triples: linear \ufb01lter w in grey, u(1), u(2) in red and green, v(1), v(2) in blue and green.\n\n6\n\n\fTable 1: Generalization error (% error) from 100 labeled MNIST examples after pretraining on\nMIXED-movies and MNIST-movies.\n\nPre-training Dataset Number of pretraining iterations (\u00d7104)\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nMIXED-movies\nMNIST-movies\n\n23.1\n23.1\n\n21.2\n19.0\n\n20.8\n18.7\n\n20.8\n18.8\n\n20.6\n18.4\n\n20.6\n18.6\n\n4 Discussion\n\nThe results on MNIST compare well with many results in the literature. A single-hidden layer neural\nnetwork of sigmoidal units can achieve 1.8% error by training from random initial conditions, and\nour model achieves 1.5% from random initial conditions. A single-hidden layer sigmoidal neural\nnetwork pretrained as a denoising auto-encoder (and then \ufb01ne-tuned) can achieve 1.4% error on\naverage, and our model is able to achieve 1.34% error from many different \ufb01ne-tuned models (Erhan\net al., 2009). Gaussian SVMs trained just on the original MNIST data achieve 1.4%; our pretraining\nstrategy allows our single-layer model be better than Gaussian SVMs (Decoste & Sch\u00a8olkopf, 2002).\nDeep learning algorithms based on denoising auto-encoders and RBMs are typically able to achieve\nslightly lower scores in the range of 1.2 \u2212 1.3% (Hinton et al., 2006; Erhan et al., 2009). The\nbest convolutional architectures and models that have access to enriched datasets for \ufb01ne-tuning can\nachieve classi\ufb01cation accuriacies under 0.4% (Ranzato et al., 2007). In future work, we will explore\nstrategies for combining these methods and with our decorrelation criterion to train deep networks\nof models with quadratic input interactions. We will also look at comparative performance on a\nwider variety of tasks.\n\n4.1 Transfer learning, the value of pretraining\n\nTo evaluate our unsupervised criterion of slow, decorrelated features as a pretraining step for clas-\nsi\ufb01cation by a neural network, we \ufb01ne-tuned the weights obtained after ten, twenty, thirty, forty,\nand \ufb01fty thousand iterations of unsupervised learning. We used only a small subset (the \ufb01rst 100\ntraining examples) from the MNIST data to magnify the importance of pre-training. The results\nare listed in Table 1. Training from random weights initial led to 23.1 % error. The value of pre-\ntraining is evident right away: after two unsupervised passes over the MNIST training data (100K\nmovies and 10K iterations), the weights have been initialized better. Fine-tuning the weights learned\non the MIXED-movies led to test error rate of 21.2%, and \ufb01ne-tuning the weights learned on the\nMNIST-movies led to a test error rate of 19.0%. Further pretraining offers a diminishing marginal\nreturn, although after ten unsupervised passes through the training data (500K movies) there is no\nevidence of over-pretraining. The best score (20.6%) on MIXED-movies occurs at both eight and\nten unsupervised passes, and the best score on MNIST-movies (18.4%) occurs after eight. A larger\ntest set would be required to make a strong conclusion about a downward trend in test set scores\nfor larger numbers of pretraining iterations. The results with MNIST-movies pretraining are slightly\nbetter than MIXED-movies but these results suggest strong transfer learning: the videos featuring\ndigits in random locations and natural image patches are almost as good for pretraining as compared\nwith videos featuring images very similar to those in the test set.\n\n4.2 Slowness in normalized features encourages binary activations\n\nSomewhat counter-intuitively, the slowness criterion requires movement in the features h. Suppose\na feature hi has activation levels that are normally distributed around 0.1 and 0.2, but the activation\nat each frame of a movie is independent of previous frames. Since the features has a small variance,\nthen the normalized feature zi will oscillate in the same way, but with unit variance. This will cause\nzi(t) \u2212 zi(t \u2212 1) to be relatively high, and for our slowness criterion not to be well satis\ufb01ed. In this\nway the lack of variance in hi can actually make for a relatively fast normalized feature zi rather\nthan a slow one.\nHowever, if hi has activation levels that are normally distributed around .1 and .2 for some image\nsequences and around .8 and .9 for other image sequences, the marginal variance in hi will be larger.\n\n7\n\n\fThe larger marginal variance will make the oscillations between .1 and .2 lead to much smaller\nchanges in the normalized feature zi(t). In this sense, the slowness objective can be maximally\nsatis\ufb01ed by features hi(t) that take near-minimum and near-maximum values for most movies, and\nnever transition from a near-minimum to a near-maximum value during a movie.\nWhen training on multiple short videos instead of one continuous one, it is possible for large changes\nin normalized-feature-activation never [or rarely] to occur during a video. Perhaps this is one of the\nroles of saccades in the visual system: to suspend the normal objective of temporal coherence during\na rapid widespread change of activation levels.\n\n4.3 Eigenvalue interpretation of decorrelation term\n\nWhat does our unsupervised cost mean? One way of thinking about the decorrelation term (\ufb01rst\nterm in Eq. 1) which helped us to design an ef\ufb01cient algorithm for computing it, is to think of it as\n\ufb02attening the eigen-spectrum of the correlation matrix of our features h (over time). It is helpful to\nrewrite this cost in terms of normalized features: zi = hi\u2212 \u00afhi\n, and to consider that we sum over all\nthe elements of the correlation matrix including the diagonal.\n\n\u03c3i\n\nX\n\ni!=j\n\nCovt(hi, hj)2\nVar(hi)Var(hj)\n\n= 2\n\nF\u22121X\n\nFX\n\n\uf8eb\uf8ed FX\n\nFX\n\n\uf8f6\uf8f8 \u2212 F\n\nCovt(zi, zj)2 =\n\nCovt(zi, zj)2\n\ni=1\n\nj=i+1\n\ni=1\n\nj=1\n\nIf we use C to denote the matrix whose i, j entry is Covt(zi, zj), and we use U0\u039bU to denote the\neigen-decomposition of C, then we can transform this sum over i! = j further.\n\nFX\n\nFX\n\n(\n\nCovt(zi, zj)2) \u2212 F = Tr(C0C) \u2212 F = Tr(CC) \u2212 F\n\ni=1\n\nj=1\n\n= Tr(U0\u039bU U0\u039bU) \u2212 F = Tr(U U0\u039bU U0\u039b) \u2212 F =\n\nFX\n\nk \u2212 F\n\u039b2\n\nWe can interpret the \ufb01rst term of Eq. 1 as penalizing the squared eigenvalues of the covariance\nmatrix between features in a normalized feature space (z as opposed to h), or as minimizing the\nsquared eigenvalues of the correlation matrix between features h.\n\nk=1\n\n5 Conclusion\n\nWe have presented an activation function for use in neural networks that is a simpli\ufb01cation of a\nrecent rate model of visual area V1 complex cells. This model learns shift-invariant, orientation-\nselective edge \ufb01lters from purely supervised training on MNIST and achieves lower generalization\nerror than conventional neural nets.\nTemporal coherence and decorrelation has been put forward as a principle for explaining the func-\ntional behaviour of visual area V1 complex cells. We have described an online algorithm for min-\nimizing correlation that has linear time complexity in the number of hidden units. Pretraining our\nmodel with this unsupervised criterion yields even lower generalization error: better than Gaus-\nsian SVMs, and competitive with deep denoising auto-encoders and 3-layer deep belief networks.\nThe good performance of our model compared with poorer approximations of V1 is encouraging\nmachine learning research inspired by neural information processing in the brain. It also helps to\nvalidate the corresponding computational neuroscience theories by showing that these neuron acti-\nvations and unsupervised criteria have value in terms of learning.\n\nAcknowledgments\n\nThis research was performed thanks to funding from NSERC, MITACS, and the Canada Research\nChairs.\n\n8\n\n\fReferences\nAdelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion.\n\nJournal of the Optical Society of America, 2, 284\u201399.\n\nBecker, S., & Hinton, G. E. (1993). Learning mixture models of spatial coherence. Neural Compu-\n\ntation, 5, 267\u2013277.\n\nBengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learn-\n\ning, to appear.\n\nBerkes, P., & Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex cell\n\nproperties. Journal of Vision, 5, 579\u2013602.\n\nCadieu, C., & Olshausen, B. (2009). Learning transformational invariants from natural movies. In\n\nAdvances in neural information processing systems 21 (nips\u201908), 209\u2013216. MIT Press.\n\nDayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. The MIT Press.\nDecoste, D., & Sch\u00a8olkopf, B. (2002). Training invariant support vector machines. Machine Learn-\n\ning, 46, 161\u2013190.\n\nErhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent, P. (2009). The dif\ufb01culty of training\ndeep architectures and the effect of unsupervised pre-training. AISTATS\u20192009 (pp. 153\u2013160).\nClearwater (Florida), USA.\n\nFinn, I., & Ferster, D. (2007). Computational diversity in complex cells of cat primary visual cortex.\n\nJournal of Neuroscience, 27, 9638\u201348.\n\nHinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18, 1527\u20131554.\n\nHurri, J., & Hyv\u00a8arinen, A. (2003). Temporal coherence, natural image sequences, and the visual\n\ncortex. Advances in Neural Information Processing Systems 15 (NIPS\u201902) (pp. 141\u2013148).\n\nK\u00a8ording, K. P., Kayser, C., Einh\u00a8auser, W., & K\u00a8onig, P. (2004). How are complex cell properties\n\nadapted to the statistics of natural stimuli? Journal of Neurophysiology, 91, 206\u2013212.\n\nKouh, M. M., & Poggio, T. T. (2008). A canonical neural circuit for cortical nonlinear operations.\n\nNeural Computation, 20, 1427\u20131451.\n\nLarochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation\nICML 2007 (pp. 473\u2013480).\n\nof deep architectures on problems with many factors of variation.\nCorvallis, OR: ACM.\n\nLee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual area V2. In\n\nAdvances in neural information processing systems 20 (nips\u201907). Cambridge, MA: MIT Press.\n\nLoosli, G., Canu, S., & Bottou, L. (2007). Training invariant support vector machines using selec-\ntive sampling. In L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel\nmachines, 301\u2013320. Cambridge, MA.: MIT Press.\n\nMobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video.\n\nICML 2009. ACM. To appear.\n\nRanzato, M., Boureau, Y., & LeCun, Y. (2008). Sparse feature learning for deep belief networks.\n\nNIPS 20.\n\nRanzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Ef\ufb01cient learning of sparse representa-\n\ntions with an energy-based model. NIPS 19.\n\nRiesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature\n\nNeuroscience, 2, 1019\u20131025.\n\nRust, N., Schwartz, O., Movshon, J. A., & Simoncelli, E. (2005). Spatiotemporal elements of\n\nmacaque V1 receptive \ufb01elds. Neuron, 46, 945\u2013956.\n\nVincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust\n\nfeatures with denoising autoencoders. ICML 2008 (pp. 1096\u20131103). ACM.\n\nWeston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. ICML\n\n2008 (pp. 1168\u20131175). New York, NY, USA: ACM.\n\nWiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances.\n\nNeural Computation, 14, 715\u2013770.\n\n9\n\n\f", "award": [], "sourceid": 933, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "James", "family_name": "Bergstra", "institution": null}]}