{"title": "Tiled convolutional neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1279, "page_last": 1287, "abstract": "Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights signi\ufb01cantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hard-coding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular \u201ctiled\u201d pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs\u2019 advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets.", "full_text": "Tiled convolutional neural networks\n\nQuoc V. Le, Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang Wei Koh, Andrew Y. Ng\n\nComputer Science Department, Stanford University\n\n{quocle,jngiam,zhenghao,danchia,pangwei,ang}@cs.stanford.edu\n\nAbstract\n\nConvolutional neural networks (CNNs) have been successfully applied to many\ntasks such as digit and object recognition. Using convolutional (tied) weights\nsigni\ufb01cantly reduces the number of parameters that have to be learned, and also\nallows translational invariance to be hard-coded into the architecture. In this pa-\nper, we consider the problem of learning invariances, rather than relying on hard-\ncoding. We propose tiled convolution neural networks (Tiled CNNs), which use\na regular \u201ctiled\u201d pattern of tied weights that does not require that adjacent hidden\nunits share identical weights, but instead requires only that hidden units k steps\naway from each other to have tied weights. By pooling over neighboring units,\nthis architecture is able to learn complex invariances (such as scale and rotational\ninvariance) beyond translational invariance. Further, it also enjoys much of CNNs\u2019\nadvantage of having a relatively small number of learned parameters (such as ease\nof learning and greater scalability). We provide an ef\ufb01cient learning algorithm for\nTiled CNNs based on Topographic ICA, and show that learning complex invariant\nfeatures allows us to achieve highly competitive results for both the NORB and\nCIFAR-10 datasets.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) [1] have been successfully applied to many recognition\ntasks. These tasks include digit recognition (MNIST dataset [2]), object recognition (NORB\ndataset [3]), and natural language processing [4]. CNNs take translated versions of the same ba-\nsis function, and \u201cpool\u201d over them to build translational invariant features. By sharing the same\nbasis function across different image locations (weight-tying), CNNs have signi\ufb01cantly fewer learn-\nable parameters which makes it possible to train them with fewer examples than if entirely different\nbasis functions were learned at different locations (untied weights). Furthermore, CNNs naturally\nenjoy translational invariance, since this is hard-coded into the network architecture. However, one\ndisadvantage of this hard-coding approach is that the pooling architecture captures only translational\ninvariance; the network does not, for example, pool across units that are rotations of each other or\ncapture more complex invariances, such as out-of-plane rotations.\n\nIs it better to hard-code translational invariance \u2013 since this is a useful form of prior knowledge \u2013\nor let the network learn its own invariances from unlabeled data? In this paper, we show that the\nlatter is superior and describe an algorithm that can do so, outperforming convolutional methods. In\nparticular, we present tiled convolutional networks (Tiled CNNs), which use a novel weight-tying\nscheme (\u201ctiling\u201d) that simultaneously enjoys the bene\ufb01t of signi\ufb01cantly reducing the number of\nlearnable parameters while giving the algorithm \ufb02exibility to learn other invariances. Our method is\nbased on only constraining weights/basis functions k steps away from each other to be equal (with\nthe special case of k = 1 corresponding to convolutional networks).\nIn order to learn these invariances from unlabeled data, we employ unsupervised pretraining, which\nhas been shown to help performance [5, 6, 7]. In particular, we use a modi\ufb01cation of Topographic\nICA (TICA) [8], which learns to organize features in a topographical map by pooling together groups\n\n1\n\n\fFigure 1: Left: Convolutional Neural Networks with local receptive \ufb01elds and tied weights. Right:\nPartially untied local receptive \ufb01eld networks \u2013 Tiled CNNs. Units with the same color belong to the\nsame map; within each map, units with the same \ufb01ll texture have tied weights. (Network diagrams\nin the paper are shown in 1D for clarity.)\n\nof related features. By pooling together local groups of features, it produces representations that are\nrobust to local transformations [9]. We show in this paper how TICA can be ef\ufb01ciently used to\npretrain Tiled CNNs through the use of local orthogonality.\n\nThe resulting Tiled CNNs pretrained with TICA are indeed able to learn invariant representations,\nwith pooling units that are robust to both scaling and rotation. We \ufb01nd that this improves classi\ufb01ca-\ntion performance, enabling Tiled CNNs to be competitive with previously published results on the\nNORB [3] and CIFAR-10 [10] datasets.\n\n2 Tiled CNNs\n\nCNNs [1, 11] are based on two key concepts: local receptive \ufb01elds, and weight-tying. Using local\nreceptive \ufb01elds means that each unit in the network only \u201clooks\u201d at a small, localized region of the\ninput image. This is more computationally ef\ufb01cient than having full receptive \ufb01elds, and allows\nCNNs to scale up well. Weight-tying additionally enforces that each \ufb01rst-layer (simple) unit shares\nthe same weights (see Figure 1-Left). This reduces the number of learnable parameters, and (by\npooling over neighboring units) further hard-codes translational invariance into the model.\n\nEven though weight-tying allows one to hard-code translational invariance, it also prevents the pool-\ning units from capturing more complex invariances, such as scale and rotation invariance. This is\nbecause the second layer units are constrained to pool over translations of identical bases. In this\npaper, rather than tying all of the weights in the network together, we instead develop a method that\nleaves nearby bases untied, but far-apart bases tied. This lets second-layer units pool over simple\nunits that have different basis functions, and hence learn a more complex range of invariances.\n\nWe call this local untying of weights \u201ctiling.\u201d Tiled CNNs are parametrized by a tile size k: we\nconstrain only units that are k steps away from each other to be tied. By varying k, we obtain a\nspectrum of models which trade off between being able to learn complex invariances, and having\nfew learnable parameters. At one end of the spectrum we have traditional CNNs (k = 1), and at the\nother, we have fully untied simple units.\n\nNext, we will allow our model to use multiple \u201cmaps,\u201d so as to learn highly overcomplete repre-\nsentations. A map is a set of pooling units and simple units that collectively cover the entire image\n(see Figure 1-Right). When varying the tiling size, we change the degree of weight tying within\neach map; for example, if k = 1, the simple units within each map will have the same weights. In\nour model, simple units in different maps are never tied. By having units in different maps learn\ndifferent features, our model can learn a rich and diverse set of features. Tiled CNNs with multiple\nmaps enjoy the twin bene\ufb01ts of (i) being able to represent complex invariances, by pooling over\n(partially) untied weights, and (ii) having a relatively small number of learnable parameters.\n\n2\n\n\fFigure 2: Left: TICA network architecture. Right: TICA \ufb01rst layer \ufb01lters (2D topography, 25 rows\nof W ).\n\nUnfortunately, existing methods for pretraining CNNs [11, 12] are not suitable for untied weights;\nfor example, the CDBN algorithm [11] breaks down without the weight-tying constraints. In the\nfollowing sections, we discuss a pretraining method for Tiled CNNs based on the TICA algorithm.\n\n3 Unsupervised feature learning via TICA\n\nTICA is an unsupervised learning algorithm that learns features from unlabeled image patches.\nA TICA network [9] can be described as a two-layered network (Figure 2-Left), with square and\nsquare-root nonlinearities in the \ufb01rst and second layers respectively. The weights W in the \ufb01rst\nlayer are learned, while the weights V in the second layer are \ufb01xed and hard-coded to represent\nthe neighborhood/topographical structure of the neurons in the \ufb01rst layer. Speci\ufb01cally, each second\nlayer hidden unit pi pools over a small neighborhood of adjacent \ufb01rst layer units hi. We call the hi\nand pi simple and pooling units, respectively.\n\nMore precisely, given an input pattern x(t),\n\nthe activation of each second layer unit\n\nis\nj )2. TICA learns the parameters W through \ufb01nding\n\npi(x(t); W, V ) = qPm\n\nk=1 Vik(Pn\n\nj=1 Wkjx(t)\n\nsparse feature representations in the second layer, by solving:\n\nW\n\n(1)\n\nPT\n\nminimize\n\ni=1 pi(x(t); W, V ), subject to W W T = I\n\nt=1 Pm\nt=1 are whitened.1 Here, W \u2208 Rm\u00d7n and V \u2208 Rm\u00d7m, where n\nwhere the input patterns {x(t)}T\nis the size of the input and m is the number of hidden units in a layer. V is a \ufb01xed matrix (Vij =\n1 or 0) that encodes the 2D topography of the hidden units hi. Speci\ufb01cally, the hi units lie on a\n2D grid, with each pi connected to a contiguous 3x3 (or other size) block of hi units.2 The case of\neach pi being connected to exactly one hi corresponds to standard ICA. The orthogonality constraint\nW W T = I provides competitiveness and ensures that the learned features are diverse.\nOne important property of TICA is that it can learn invariances even when trained only on unlabeled\ndata, as demonstrated in [8, 9]. This is due both to the pooling architecture, which gives rise to pool-\ning units that are robust to local transformations of their inputs, and the learning algorithm, which\npromotes selectivity by optimizing for sparsity. This combination of robustness and selectivity is\ncentral to feature invariance, which is in turn essential for recognition tasks [13].\n\nIf we choose square and square-root activations for the simple and pooling units in the Tiled CNN,\nwe can view the Tiled CNN as a special case of a TICA network, with the topography of the pooling\nunits specifying the matrix V .3 Crucially, Tiled CNNs incorporate local receptive \ufb01elds, which play\nan important role in speeding up TICA. We discuss this next.\n\n1Whitening means that they have been linearly transformed to have zero mean and identity covariance.\n2For illustration, however, the \ufb01gures in this paper depict xi, hi and pi in 1D and show a 1D topography.\n3The locality constraint, in addition to being biologically motivated by the receptive \ufb01eld organization\npatterns in V1, is also a natural approximation to the original TICA algorithm as the original learned receptive\n\n3\n\n\f4 Local receptive \ufb01elds in TICA\n\nTiled CNNs typically perform much better at object recognition when the learned representation\nconsists of multiple feature maps (Figure 1-Right). This corresponds to training TICA with an over-\ncomplete representation (m > n). When learning overcomplete representations [14], the orthogo-\nnality constraint cannot be satis\ufb01ed exactly, and we instead try to satisfy an approximate orthogonal-\nity constraint [15]. Unfortunately, these approximate orthogonality constraints are computationally\nexpensive and have hyperparameters which need to be extensively tuned. Much of this tuning can be\navoided by using score matching [16], but this is computationally even more expensive, and while\northogonalization can be avoided altogether with topographic sparse coding, those models are also\nexpensive as they require further work either for inference at prediction time [9, 14] or for learning\na decoder unit at training time [17].\n\nWe can avoid approximate orthogonalization by using local receptive \ufb01elds, which are inherently\nbuilt into Tiled CNNs. With these, the weight matrix W for each simple unit is constrained to be\n0 outside a small local region. This locality constraint automatically ensures that the weights of\nany two simple units with non-overlapping receptive \ufb01elds are orthogonal, without the need for an\nexplicit orthogonality constraint. Empirically, we \ufb01nd that orthogonalizing partially overlapping\nreceptive \ufb01elds is not necessary for learning distinct, informative features either.\n\nHowever, orthogonalization is still needed to decorrelate units that occupy the same position in their\nrespective maps, for they look at the same region on the image. Fortunately, this local orthogonal-\nization is cheap: for example, if there are l maps and if each receptive \ufb01eld is restricted to look at an\ninput patch that contains s pixels, we would only need to orthogonalize the rows of a l-by-s matrix\nto ensure that the l features over these s pixels are orthogonal. Speci\ufb01cally, so long as l \u2264 s, we can\ndemand that these l units that share an input patch be orthogonal. Using this method, we can learn\nnetworks that are overcomplete by a factor of about s (i.e., by learning l = s maps), while having to\northogonalize only matrices that are l-by-s. This is signi\ufb01cantly lower in cost than standard TICA.\nFor l maps, our computational cost is O(ls2n), compared to standard TICA\u2019s O(l2n3).\nIn general, we will have l \u00d7 k \u00d7 s learnable parameters for an input of size n. We note that setting\nk to its maximum value of n \u2212 s + 1 gives exactly the untied local TICA model outlined in the\nprevious section.4\n\n5 Pretraining Tiled CNNs with local TICA\n\nAlgorithm 1 Unsupervised pretraining of Tiled CNNs with TICA (line search)\nInput: {x(t)}T\nOutput: W\n\n// k is the tile size, s is the receptive \ufb01eld size\n\nt=1, W, V, k, s\n\nrepeat\n\nj \u00b42\nj=1 Wkj x(t)\n\n, g \u2190\n\n\u2202\u02c6PT\n\nt=1 Pm\n\ni=1 rPm\n\nk=1\n\nVik`Pn\n\n\u2202W\n\nj=1 Wkj x\n\n(t)\n\nj \u00b42\u02dc\n\nt=1 Pm\n\ni=1 qPm\n\nk=1 Vik`Pn\n\nf old \u2190 PT\nf new \u2190 +\u221e, \u03b1 \u2190 1\nwhile f new \u2265 f old do\nW new \u2190 W \u2212 \u03b1g\nW new \u2190 localize(W new, s)\nW new \u2190 tie weights(W new, k)\nW new \u2190 orthogonalize local RF (W new)\nf new \u2190 PT\n\u03b1 \u2190 0.5\u03b1\n\nk=1 Vik`Pn\n\ni=1 qPm\n\nt=1 Pm\n\nj=1 W new\n\nj \u00b42\nkj x(t)\n\nend while\nW \u2190 W new\n\nuntil convergence\n\nOur pretraining algorithm, which is based on gradient descent on the TICA objective function (1), is\nshown in Algorithm 1. The innermost loop is a simple implementation of backtracking linesearch.\n\n\ufb01elds tend to be very localized, even without any explicit locality constraint. For example, when trained on\nnatural images, TICA\u2019s \ufb01rst layer weights usually resemble localized Gabor \ufb01lters (Figure 2-Right).\n\n4For a 2D input image of size nxn and local RF of size sxs, the maximum value of k is (n \u2212 s + 1)2.\n\n4\n\n\fIn orthogonalize local RF (W new), we only orthogonalize the weights that have completely over-\nlapping receptive \ufb01elds.\nIn tie weights, we enforce weight-tying by averaging each set of tied\nweights.\n\nThe algorithm is trained by batch projected gradient descent and usually requires little tuning of\noptimization parameters. This is because TICA\u2019s tractable objective function allows us to monitor\nconvergence easily. In contrast, other unsupervised feature learning algorithms such as RBMs [6]\nand autoencoders [18] require much more parameter tuning, especially during optimization.\n\n6 Experiments\n\n6.1 Speed-up\n\nWe \ufb01rst establish that the local receptive \ufb01elds\nintrinsic to Tiled CNNs allows us to imple-\nment TICA learning for overcomplete represen-\ntations in a much more ef\ufb01cient manner.\n\n2 W \u2212 1\n\nFigure 3 shows the relative speed-up of pre-\ntraining Tiled CNNs over standard TICA us-\ning approximate \ufb01xed-point orthogonalization\n(W = 3\n2 W W T W )[15]. These experi-\nments were run on 10000 images of size 32x32\nor 50x50, with s = 8.\nWe note that the weights in this experiment\nwere left fully untied, i.e., k = n\u2212s+1. Hence,\nthe speed-up observed here is not from an ef\ufb01-\ncient convolutional implementation, but purely\ndue to the local receptive \ufb01elds. Overcoming\nthis computational challenge is the key that al-\nlows Tiled CNNs to successfully use TICA to\nlearn features from unlabeled data.5\n\n6.2 Classi\ufb01cation on NORB\n\nFigure 3: Speed-up of Tiled CNNs compared to\nstandard TICA.\n\nNext, we show that TICA pretraining for Tiled CNNs performs well on object recognition. We start\nwith the normalized-uniform set for NORB, which consists of 24300 training examples and 24300\ntest examples drawn from 5 categories. In our case, each example is a preprocessed pair of 32x32\nimages.6\nIn our classi\ufb01cation experiments, we \ufb01x the size of each local receptive \ufb01eld to 8x8, and set V such\nthat each pooling unit pi in the second layer pools over a block of 3x3 simple units in the \ufb01rst layer,\nwithout wraparound at the borders. The number of pooling units in each map is exactly the same as\nthe number of simple units. We densely tile the input images with overlapping 8x8 local receptive\n\ufb01elds, with a step size (or \u201cstride\u201d) of 1. This gives us 25 \u00d7 25 = 625 simple units and 625 pooling\nunits per map in our experiments on 32x32 images.\n\nA summary of results is reported in Table 1.\n\n6.2.1 Unsupervised pretraining\n\nWe \ufb01rst consider the case in which the features are learned purely from unsupervised data.\nIn\nparticular, we use the NORB training set itself (without the labels) as a source of unsupervised data\n\n5All algorithms are implemented in MATLAB, and executed on a computer with 3.0GHz CPU, 9Gb RAM.\nWhile orthogonalization alone is 104 times faster in Tiled CNNs, other computations such as gradient calcula-\ntions reduce its overall speed-up factor to 10x-250x.\n\n6Each NORB example is a binocular pair of 96x96 images. To reduce processing time, we downsampled\neach 96x96 image to 32x32 pixels. Hence, each simple unit sees 128 pixels from an 8x8 patch from each of the\ntwo binocular images. The input was whitened using ZCA (Zero-Phase Components Analysis).\n\n5\n\n\fTable 1: Test set accuracy on NORB\n\nAlgorithm\nTiled CNNs (with \ufb01netuning) (Section 6.2.2)\nTiled CNNs (without \ufb01netuning) (Section 6.2.1)\nStandard TICA (10x overcomplete)\nConvolutional Neural Networks [19], [12]\n3D Deep Belief Networks [19]\nSupport Vector Machines [20]\nDeep Boltzmann Machines [21]\n\nAccuracy\n\n96.1%\n94.5%\n89.6%\n\n94.1% , 94.4%\n\n93.5%\n88.4%\n92.8 %\n\nwith which to learn the weights W of the Tiled CNN. We call this initial phase the unsupervised\npretraining phase.\n\nAfter learning a feature representation from the unlabeled data, we train a linear classi\ufb01er on the\noutput of the Tiled CNN network (i.e., the activations of the pooling units) on the labeled training\nset. During this supervised training phase, only the weights of the linear classi\ufb01er were learned,\nwhile the lower weights of the Tiled CNN model remained \ufb01xed.\nWe train a range of models to investigate the role of the tile size k and the number of maps l.7 The test\nset accuracy results of these models are shown in Figure 4-Left. Using a randomly sampled hold-out\nvalidation set of 2430 examples (10%) taken from the training set, we selected a convolutional model\nwith 48 maps that achieved an accuracy of 94.5% on the test set, indicating that Tiled CNNs learned\npurely on unsupervised data compare favorably to many state-of-the-art algorithms on NORB.\n\n6.2.2 Supervised \ufb01netuning of W\n\nNext, we study the effects of supervised \ufb01netuning [23] on the models produced by the unsupervised\npretraining phase. Supervised \ufb01netuning takes place after unsupervised pretraining, but before the\nsupervised training of the classi\ufb01er.\n\nUsing softmax regression to calculate the gradients, we backpropagated the error signal from the\noutput back to the learned features in order to update W , the weights of the simple units in the Tiled\nCNN model. During the \ufb01netuning step, the weights W were adjusted without orthogonalization.\n\nThe results of supervised \ufb01netuning on our models are shown in Figure 4-Right. As above, we used a\nvalidation set comprising 10% of the training data for model selection. Models with larger numbers\nof maps tended to over\ufb01t and hence performed poorly on the validation set. The best performing\n\ufb01ne-tuned model on the validation set was the model with 16 maps and k = 2, which achieved\na test-set accuracy of 96.1%. This substantially outperforms standard TICA, as well as the best\npublished results on NORB to this date (see Table 1).\n\n6.2.3 Limited training data\n\nTo test the ability of our pretrained features\nto generalize across rotations and lighting con-\nditions given only a weak supervised signal,\nwe limited the labeled training set to comprise\nonly examples with a particular set of view-\ning angles and lighting conditions. Speci\ufb01cally,\nNORB contains images spanning 9 elevations,\n18 azimuths and 6 lighting conditions, and we\ntrained our linear classi\ufb01er only on data with\nelevations {2, 4, 6}, azimuths {10, 18, 24} and\n\nFigure 5: Test set accuracy on full and limited\ntraining sets\n\n7We used an SVM [22] as the linear classi\ufb01er and determined C by cross-validation over\n{10\u22124, 10\u22123, . . . , 104}. Models were trained with various untied map sizes k \u2208 {1, 2, 9, 16, 25} and number\nof maps l \u2208 {4, 6, 10, 16}. When k = 1, we were able to use an ef\ufb01cient convolutional implementation to\nscale up the number of maps in the models, allowing us to train additional models with l \u2208 {22, 36, 48}.\n\n6\n\n\fFigure 4: Left: NORB test set accuracy across various tile sizes and numbers of maps, without\n\ufb01netuning. Right: NORB test set accuracy, with \ufb01netuning.\n\nlighting conditions {1, 3, 5}. Thus, for each object instance, the linear classi\ufb01er sees only 27 training\nimages, making for a total of 675 out of the possible 24300 training examples.\n\nUsing the pretrained network in Section 6.2.1, we trained a linear classi\ufb01er on these 675 labeled\nexamples. We obtained an accuracy of 72.2% on the full test set using the model with k = 2 and\n22 maps. A smaller, approximately 2.5x overcomplete model with k = 2 and 4 maps obtained an\naccuracy of 64.9%. In stark contrast, raw pixel performance dropped sharply from 80.2% with a full\nsupervised training set, to a near-chance level of 20.8% on this limited training set (Figure 5).\n\nThese results demonstrate that Tiled CNNs perform well even with limited labeled data. This is most\nlikely because the partial weight-tying results in a relatively small number of learnable parameters,\nreducing the need for large amounts of labeled data.\n\n6.3 Classi\ufb01cation on CIFAR-10\n\nTable 2: Test set accuracy on CIFAR-10\n\nAlgorithm\nDeep Tiled CNNs (s=4, with \ufb01netuning) (Section 6.3.2)\nTiled CNNs (s=8, without \ufb01netuning) (Section 6.3.1)\nStandard TICA (10x, \ufb01xed-point orthogonalization)\nRaw pixels [10]\nRBM (one layer, 10000 units, \ufb01netuning) [10]\nRBM (two layers, 10000 units, \ufb01netuning both layers) [10]\nRBM (two layers, 10000 units, \ufb01netuning top layer) [10]\nmcRBM (convolutional, trained on two million tiny images) [24]\nLocal Coordinate Coding (LCC) [25]\nImproved Local Coordinate Coding (Improved LCC) [25]\n\nAccuracy\n\n73.1%\n66.1%\n56.1%\n41.1%\n64.8%\n60.3%\n62.2%\n71.0%\n72.3%\n74.5%\n\nThe CIFAR-10 dataset contains 50000 training images and 10000 test images drawn from 10 cate-\ngories.8 A summary of results for is reported in Table 2.\n\n6.3.1 Unsupervised pretraining and supervised \ufb01netuning\n\nAs before, models were trained with tile size k \u2208 {1, 2, 25}, and number of maps l \u2208\n{4, 10, 16, 22, 32}. The convolutional model (k = 1) was also trained with l = 48 maps. This\n48-map convolutional model performed the best on our 10% hold-out validation set, and achieved a\ntest set accuracy of 66.1%. We \ufb01nd that supervised \ufb01netuning of these models on CIFAR-10 causes\nover\ufb01tting, and generally reduces test-set accuracy; the top model on the validation set, with 32\nmaps and k = 1, only achieves 65.1%.\n\n8Each CIFAR-10 example is a 32x32 RGB image, also whitened using ZCA. Hence, each simple unit sees\n\nthree patches from three channels of the color image input (RGB).\n\n7\n\n\f6.3.2 Deep Tiled CNNs\nWe additionally investigate the possibility of training a deep Tiled CNN in a greedy layer-wise\nfashion, similar to models such as DBNs [6] and stacked autoencoders [26, 18]. We constructed\nthis network by stacking two Tiled CNNs, each with 10 maps and k = 2. The resulting four-layer\nnetwork has the structure W1 \u2192 V1 \u2192 W2 \u2192 V2, where the weights W1 are local receptive \ufb01elds\nof size 4x4, and W2 is of size 3x3, i.e., each unit in the third layer \u201clooks\u201d at a 3x3 window of each\nof the 10 maps in the \ufb01rst layer. These parameters were chosen by an ef\ufb01cient architecture search\n[27] on the hold-out validation set. The number of maps in the third and fourth layer is also 10.\n\nAfter \ufb01netuning, we found that the deep model outperformed all previous models on the validation\nset, and achieved a test set accuracy of 73.1%. This demonstrates the potential of deep Tiled CNNs\nto learn more complex representations.\n\n6.4 Effects of optimizing the pooling units\nWhen the tile size is 1 (i.e., a fully tied model), a na\u00a8\u0131ve approach to learn the \ufb01lter weights is to\ndirectly train the \ufb01rst layer \ufb01lters using small patches (e.g., 8x8) randomly sampled from the dataset,\nwith a method such as ICA. This method is computationally more attractive and probably easier to\nimplement. Here, we investigate if such bene\ufb01ts come at the expense of classi\ufb01cation accuracy.\n\nWe use ICA to learn the \ufb01rst layer weights on CIFAR-10 with 16 \ufb01lters. These weights are then used\nin a Tiled CNN with a tile size of 1 and 16 maps. This method is compared to pretraining the model\nof the same architecture with TICA. For both methods, we do not use \ufb01netuning. Interestingly, clas-\nsi\ufb01cation on the test set show that the na\u00a8\u0131ve approach results in signi\ufb01cantly reduced classi\ufb01cation\naccuracy: the na\u00a8\u0131ve approach obtains 51.54% on the test set, while pretraining with TICA achieves\n58.66%. These results con\ufb01rm that optimizing for sparsity of the pooling units results in better\nfeatures than just na\u00a8\u0131vely approximating the \ufb01rst layer weights.\n\n7 Discussion and Conclusion\n\nOur results show that untying weights is bene\ufb01cial for classi\ufb01cation performance. Speci\ufb01cally, we\n\ufb01nd that selecting a tile size of k = 2 achieves the best results for both the NORB and CIFAR-10\ndatasets, even with deep networks. More importantly, untying weights allow the networks to learn\nmore complex invariances from unlabeled data. By visualizing [28, 29] the range of optimal stimulus\nthat activate each pooling unit in a Tiled CNN, we found units that were scale and rotationally\ninvariant.9 We note that a standard CNN is unlikely to be invariant to these transformations.\nA natural choice of the tile size k would be to set it to the size of the pooling region p, which in this\ncase is 3. In this case, each pooling unit always combines simple units which are not tied. However,\nincreasing the tile size leads to a higher degree of freedom in the models, making them susceptible to\nover\ufb01tting (learning unwanted non-stationary statistics of the dataset). Fortunately, the Tiled CNN\nonly requires unlabeled data for training, which can be obtained cheaply. Our preliminary results\non networks pretrained using 250000 unlabeled images from the Tiny images dataset [30] show that\nperformance increases as k goes from 1 to 3, \ufb02attening out at k = 4. This suggests that when there\nis suf\ufb01cient data to avoid over\ufb01tting, setting k = p can be a very good choice.\nIn this paper, we introduced Tiled CNNs as an extension of CNNs that support both unsuper-\nvised pretraining and weight tiling. The idea of tiling, or partial untying of \ufb01lter weights, is a\nparametrization of a spectrum of models which includes both fully-convolutional and fully-untied\nweight schemes as natural special cases. Furthermore, the use of local receptive \ufb01elds enable our\nmodels to scale up well, producing massively overcomplete representations that perform well on\nclassi\ufb01cation tasks. These principles allow Tiled CNNs to achieve competitive results on the NORB\nand CIFAR-10 object recognition datasets. Importantly, tiling is directly applicable and can poten-\ntially bene\ufb01t a wide range of other feature learning models.\n\nAcknowledgements: We thank Adam Coates, David Kamm, Andrew Maas, Andrew Saxe, Serena\nYeung and Chenguang Zhu for insightful discussions. This work was supported by the DARPA Deep\nLearning program under contract number FA8650-10-C-7020.\n\n9These visualizations are available at http://ai.stanford.edu/\u223cquocle/.\n\n8\n\n\fReferences\n\n[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition.\n\nProceeding of the IEEE, 1998.\n\n[2] P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks applied to visual\n\ndocument analysis. In ICDAR, 2003.\n\n[3] Y. LeCun, F.J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to\n\npose and lighting. In CVPR, 2004.\n\n[4] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural networks\n\nwith multitask learning. In ICML, 2008.\n\n[5] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught learning:\n\nTransfer learning from unlabeled data. In ICML, 2007.\n\n[6] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 2006.\n\n[7] D. Erhan, A. Courville, Y. Bengio, and P. Vincent. Why does unsupervised pre-training help deep learn-\n\ning? Journal of Machine Learning Research, 2010.\n\n[8] A. Hyvarinen and P. Hoyer. Topographic independent component analysis as a model of V1 organization\n\nand receptive \ufb01elds. Neural Computation, 2001.\n\n[9] A. Hyvarinen, J. Hurri, and P. Hoyer. Natural Image Statistics. Springer, 2009.\n[10] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, U. Toronto, 2009.\n[11] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for scalable unsuper-\n\nvised learning of hierarchical representations. In ICML, 2009.\n\n[12] M.A. Ranzato K. Jarrett, K. Kavukcuoglu and Y. LeCun. What is the best multi-stage architecture for\n\nobject recognition? In ICCV, 2009.\n\n[13] I. Goodfellow, Q.V. Le, A. Saxe, H. Lee, and A.Y. Ng. Measuring invariances in deep networks. In NIPS,\n\n2010.\n\n[14] B. Olshausen and D. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code\n\nfor natural images. Nature, 1996.\n\n[15] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley Interscience, 2001.\n[16] A. Hyvarinen. Estimation of non-normalized statistical models using score matching. JMLR, 2005.\n[17] K. Kavukcuoglu, M.A. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topo-\n\ngraphic \ufb01lter maps. In CVPR, 2009.\n\n[18] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In\n\nNIPS, 2007.\n\n[19] V. Nair and G. Hinton. 3D object recognition with deep belief nets. In NIPS, 2009.\n[20] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In Large-Scale Kernel Machines, 2007.\n[21] R. Salakhutdinov and H. Larochelle. Ef\ufb01cient learning of Deep Boltzmann Machines. In AISTATS, 2010.\n[22] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\n[23] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n2006.\n\n[24] M. Ranzato and G. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann\n\nmachines. In CVPR, 2010.\n\n[25] K. Yu and T. Zhang. Improved local coordinate coding using local tangents. In ICML, 2010.\n[26] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2009.\n[27] A. Saxe, M. Bhand, Z. Chen, P. W. Koh, B. Suresh, and A. Y. Ng. On random weights and unsupervised\n\nfeature learning. In Workshop: Deep Learning and Unsupervised Feature Learning (NIPS), 2010.\n\n[28] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep network.\n\nTechnical report, University of Montreal, 2009.\n\n[29] P. Berkes and L. Wiskott. Slow feature analysis yields a rich repertoire of complex cell properties. Journal\n\nof Vision, 2005.\n\n[30] R. Fergus A. Torralba and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric\nobject and scene recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.\n\n9\n\n\f", "award": [], "sourceid": 550, "authors": [{"given_name": "Jiquan", "family_name": "Ngiam", "institution": null}, {"given_name": "Zhenghao", "family_name": "Chen", "institution": null}, {"given_name": "Daniel", "family_name": "Chia", "institution": null}, {"given_name": "Pang", "family_name": "Koh", "institution": null}, {"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}