{"title": "Convolutional-Recursive Deep Learning for 3D Object Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 656, "page_last": 664, "abstract": null, "full_text": "Convolutional-Recursive Deep Learning\n\nfor 3D Object Classi\ufb01cation\n\nRichard Socher, Brody Huval, Bharath Bhat, Christopher D. Manning, Andrew Y. Ng\n\nComputer Science Department, Stanford University, Stanford, CA 94305, USA\n\nrichard@socher.org, {brodyh,bbhat,manning}@stanford.edu, ang@cs.stanford.edu\n\nAbstract\n\nRecent advances in 3D sensing technologies make it possible to easily record color\nand depth images which together can improve object recognition. Most current\nmethods rely on very well-designed features for this new 3D modality. We in-\ntroduce a model based on a combination of convolutional and recursive neural\nnetworks (CNN and RNN) for learning features and classifying RGB-D images.\nThe CNN layer learns low-level translationally invariant features which are then\ngiven as inputs to multiple, \ufb01xed-tree RNNs in order to compose higher order fea-\ntures. RNNs can be seen as combining convolution and pooling into one ef\ufb01cient,\nhierarchical operation. Our main result is that even RNNs with random weights\ncompose powerful features. Our model obtains state of the art performance on a\nstandard RGB-D object dataset while being more accurate and faster during train-\ning and testing than comparable architectures such as two-layer CNNs.\n\n1\n\nIntroduction\n\nObject recognition is one of the hardest problems in computer vision and important for making\nrobots useful in home environments. New sensing technology, such as the Kinect, that can record\nhigh quality RGB and depth images (RGB-D) has now become affordable and could be combined\nwith standard vision systems in household robots. The depth modality provides useful extra infor-\nmation to the complex problem of general object detection [1] since depth information is invariant\nto lighting or color variations, provides geometrical cues and allows better separation from the back-\nground. Most recent methods for object recognition with RGB-D images use hand-designed features\nsuch as SIFT for 2d images [2], Spin Images [3] for 3D point clouds, or speci\ufb01c color, shape and\ngeometry features [4, 5].\nIn this paper, we introduce the \ufb01rst convolutional-recursive deep learning model for object recogni-\ntion that can learn from raw RGB-D images. Compared to other recent 3D feature learning methods\n[6, 7], our approach is fast, does not need additional input channels such as surface normals and ob-\ntains state of the art results on the task of detecting household objects. Fig. 1 outlines our approach.\nCode for training and testing is available at www.socher.org.\nOur model starts with raw RGB and depth images and \ufb01rst separately extracts features from them.\nEach modality is \ufb01rst given to a single convolutional neural net layer (CNN, [8]) which provides\nuseful translational invariance of low level features such as edges and allows parts of an object\nto be deformable to some extent. The pooled \ufb01lter responses are then given to a recursive neural\nnetwork (RNN, [9]) which can learn compositional features and part interactions. RNNs hierarchi-\ncally project inputs into a lower dimensional space through multiple layers with tied weights and\nnonlinearities.\nWe also explore new deep learning architectures for computer vision. Our previous work on RNNs\nin natural language processing and computer vision [9, 10] (i) used a different tree structure for each\ninput, (ii) employed a single RNN with one set of weights, (iii) restricted tree structures to be strictly\n\n1\n\n\fFigure 1: An overview of our model: A single CNN layer extracts low level features from RGB and\ndepth images. Both representations are given as input to a set of RNNs with random weights. Each\nof the many RNNs (around 100 for each modality) then recursively maps the features into a lower\ndimensional space. The concatenation of all the resulting vectors forms the \ufb01nal feature vector for a\nsoftmax classi\ufb01er.\n\nbinary, and (iv) trained the RNN with backpropagation through structure [11, 12]. In this paper, we\nexpand the space of possible RNN-based architectures in these four dimensions by using \ufb01xed tree\nstructures and multiple RNNs on the same input and allow n-ary trees. We show that because of\nthe CNN layer, \ufb01xing the tree structure does not hurt performance and it allows us to speed up\nrecognition. Similar to recent work [13, 14] we show that performance of RNN models can improve\nwith an increasing number of features. The hierarchically composed RNN features of each modality\nare concatenated and given to a joint softmax classi\ufb01er.\nMost importantly, we demonstrate that RNNs with random weights can also produce high quality\nfeatures. So far random weights have only been shown to work for convolutional neural networks\n[15, 16]. Because the supervised training reduces to optimizing the weights of the \ufb01nal softmax\nclassi\ufb01er, a large set of RNN architectures can quickly be explored. By combining the above ideas\nwe obtain a state of the art system for classifying 3D objects which is very fast to train and highly\nparallelizable at test time.\nWe \ufb01rst brie\ufb02y describe the unsupervised learning of \ufb01lter weights and their convolution to obtain\nlow level features. Next we give details of how multiple random RNNs can be used to obtain high\nlevel features of the entire image. Then, we discuss related work. In our experiments we show\nquantitative comparisons of different models, analyze model ablations and describe our state-of-the-\nart results on the RGB-D dataset of Lai et al. [2].\n\n2 Convolutional-Recursive Neural Networks\n\nIn this section, we describe our new CNN-RNN model. We \ufb01rst learn the CNN \ufb01lters in an unsuper-\nvised way by clustering random patches and then feed these patches into a CNN layer. The resulting\nlow-level, translationally invariant features are given to recursive neural networks. RNNs compose\nhigher order features that can then be used to classify the images.\n\n2.1 Unsupervised Pre-training of CNN Filters\n\nWe follow the procedure described by Coates et al.\n[13] to learn \ufb01lters which will be used in\nthe convolution. First, random patches are extracted into two sets, one for each modality (RGB\nand depth). Each set of patches is then normalized and whitened. The pre-processed patches are\nclustered by simply running k-means. Fig. 2 shows the resulting \ufb01lters for both modalities. They\ncapture standard edge and color features. One interesting result when applying this method to the\ndepth channel is that the edges are much sharper. This is due to the large discontinuities between\nobject boundaries and the background. While the depth channel is often quite noisy most of the\nfeatures are still smooth.\n\n2\n\nMultiple RNNMultiple RNNConvolutionLabel: Coffee MugFilter Responses get pooledKKK filters4 pooling regionsRGB CNNMerging of pooled vectorsMerging of pooled vectorsKDepth CNNConvolutionKSoftmax Classifier\fFilters:\n\nRGB\n\nDepth\n\nGray scale\n\nFigure 2: Visualization of the k-means \ufb01lters used in the CNN layer after unsupervised pre-training:\n(left) Standard RGB \ufb01lters (best viewed in color) capture edges and colors. When the method is\napplied to depth images (center) the resulting \ufb01lters have sharper edges which arise due to the strong\ndiscontinuities at object boundaries. The same is true, though to a lesser extent, when compared to\n\ufb01lters trained on gray scale versions of the color images (right).\n\n2.2 A Single CNN Layer\n\nTo generate features for the RNN layer, a CNN architecture is chosen for its translational invariance\nproperties. The main idea of CNNs is to convolve \ufb01lters over the input image in order to extract\nfeatures. Our single layer CNN is similar to the one proposed by Jarrett et. al [17] and consists of a\nconvolution, followed by recti\ufb01cation and local contrast normalization (LCN). LCN was inspired by\ncomputational neuroscience and is used to contrast features within a feature map, as well as across\nfeature maps at the same spatial location [17, 18, 14].\nWe convolve each image of size (height and width) dI with K square \ufb01lters of size dP , resulting in\nK \ufb01lter responses, each of dimensionality dI \u2212 dP + 1. We then average pool them with square\nregions of size d(cid:96) and a stride size of s, to obtain a pooled response with width and height equal\nto r = (dI \u2212 d(cid:96))/s + 1. So the output X of the CNN layer applied to one image is a K \u00d7 r \u00d7 r\ndimensional 3D matrix. We apply this same procedure to both color and depth images separately.\n\n2.3 Fixed-Tree Recursive Neural Networks\n\nThe idea of recursive neural networks [19, 9] is to learn hierarchical feature representations by\napplying the same neural network recursively in a tree structure. In our case, the leaf nodes of the\ntree are K-dimensional vectors (the result of the CNN pooling over an image patch repeated for all\nK \ufb01lters) and there are r2 of them.\nIn our previous RNN work [9, 10, 20] the tree structure depended on the input. While this allows\nfor more \ufb02exibility, we found that for the task of object classi\ufb01cation in conjunction with a CNN\nlayer it was not necessary for obtaining high performance. Furthermore, the search over optimal\ntrees slows down the method considerably as one can not easily parallelize the search or make use\nof parallelization of large matrix products. The latter could bene\ufb01t immensely from new multicore\nhardware such as GPUs. In this work, we focus on \ufb01xed-trees which we can design to be balanced.\nPrevious work also only combined pairs of vectors. We generalize our RNN architecture to allow\neach layer to merge blocks of adjacent vectors instead of only pairs.\nWe start with a 3D matrix X \u2208 RK\u00d7r\u00d7r for each image (the columns are K-dimensional). We\nde\ufb01ne a block to be a list of adjacent column vectors which are merged into a parent vector p \u2208 RK.\nIn the following we use only square blocks for convenience. Blocks are of size K \u00d7 b \u00d7 b. For\ninstance, if we merge vectors in a block with b = 3, we get a total size 128 \u00d7 3 \u00d7 3 and a resulting\nlist of vectors (x1, . . . , x9). In general, we have b2 many vectors in each block. The neural network\n\n3\n\n\ffor computing the parent vector is\n\n\uf8eb\uf8ec\uf8edW\n\n\uf8f9\uf8fa\uf8fb\n\uf8f6\uf8f7\uf8f8 ,\n\n\uf8ee\uf8ef\uf8f0 x1\n\n...\nxb2\n\np = f\n\n(1)\n\nwhere the parameter matrix W \u2208 RK\u00d7b2K, f is a nonlinearity such as tanh. We omit the bias\nterm which turns out to have no effect in the experiments below. Eq. 1 will be applied to all blocks\nof vectors in X with the same weights W . Generally, there will be (r/b)2 many parent vectors p,\nforming a new matrix P1. The vectors in P1 will again be merged in blocks just as those in matrix\nX using Eq. 1 with the same tied weights resulting in matrix P2. This procedure continues until\nonly one parent vector remains. Fig. 3 shows an example of a pooled CNN output of size K \u00d7 4\u00d7 4\nand a RNN tree structure with blocks of 4 children.\nThe model so far has been unsupervised. However,\nour original task is to classify each block into one of\nmany object categories. Therefore, we use the top\nvector Ptop as the feature vector to a softmax classi-\n\ufb01er. In order to minimize the cross entropy error of\nthe softmax, we could backpropagate through the re-\ncursive neural network [12] and convolutional layers\n[8]. In practice, this is very slow and we will discuss\nalternatives in the next section.\n\n2.4 Multiple Random RNNs\n\nPrevious work used only a single RNN. We can\nactually use the 3D matrix X as input to a num-\nber of RNNs. Each of N RNNs will output a K-\ndimensional vector. After we forward propagate\nthrough all the RNNs, we concatenate their outputs\nto a N K-dimensional vector which is then given to\nthe softmax classi\ufb01er.\nInstead of taking derivatives of the W matrices of the\nRNNs which would require backprop through struc-\nture [11], we found that even RNNs with random\nweights produce high quality feature vectors. Sim-\nilar results have been found for random weights in\nthe closely related CNNs [16]. Before comparing to\nother approaches, we brie\ufb02y review related work.\n\n3 Related Work\n\nFigure 3: Recursive Neural Network applied\nto blocks: At each node, the same neural net-\nwork is used to compute the parent vector of\na set of child vectors. The original input ma-\ntrix is the output of a pooled convolution.\n\nThere has been great interest in object recognition and scene understanding using RGB-D data.\nSilberman and Fergus have published a 3D dataset for full scene understanding [21]. Koppula et al.\nalso recently provided a new dataset for indoor scene segmentation [4].\nThe most common approach today for standard object recognition is to use well-designed features\nbased on orientation histograms such as SIFT, SURF [22] or textons and give them as input to a\nclassi\ufb01er such as a random forest. Despite their success, they have several shortcomings such as\nbeing only applicable to one modality (grey scale images in the case of SIFT), not adapting easily\nto new modalities such as RGB-D or to varying image domains. There have been some attempts\nto modify these features to colored images via color histograms [23] or simply extending SIFT\nto the depth channel [2]. More advanced methods that generalize these ideas and can combine\nseveral important RGB-D image characteristics such as size, 3D shape and depth edges are kernel\ndescriptors [5].\n\n4\n\np3p4WWWWWK filtersp2p1px3x4x2x1\fAnother related line of work is about spatial pyramids in object classi\ufb01cation, in particular the\npyramid matching kernel [24]. The similarity is mostly in that our model also learns a hierarchical\nimage representation that can be used to classify objects.\nAnother solution to the above mentioned problems is to employ unsupervised feature learning meth-\nods [25, 26, 27] (among many others) which have made large improvements in object recognition.\nWhile many deep learning methods exist for learning features from rgb images, few deep learning\narchitectures have yet been investigated for 3D images. Very recently, Blum et al. [6] introduced\nconvolutional k-means descriptors (CKM) for RGB-D data. They use SURF interest points and\nlearn features using k-means similar to [28]. Their work is similar to ours in that they also learn\nfeatures in an unsupervised way.\nVery recent work by Bo et al. [7] uses unsupervised feature learning based on sparse coding to learn\ndictionaries from 8 different channels including grayscale intensity, RGB, depth scalars, and surface\nnormals. Features are then used in hierarchical matching pursuit which consists of two layers. Each\nlayer has three modules: batch orthogonal matching pursuit, pyramid max pooling, and contrast\nnormalization. This results in a very large feature vector size of 188,300 dimensions which is used\nfor classi\ufb01cation.\nLastly, recursive autoencoders have been introduced by Pollack [19] and Socher et al. [10] to which\nwe compare quantitatively in our experiment section. Recursive neural networks have been applied\nto full scene segmentation [9] but they used hand-designed features. Farabet et al. [29] also introduce\na model for scene segmentation that is based on multi-scale convolutional neural networks and learns\nfeature representations.\n\n4 Experiments\n\nAll our experiments are carried out on the recent RGB-D dataset of Lai et al. [2]. There are 51\ndifferent classes of household objects and 300 instances of these classes. Each object instance is\nimaged from 3 different angles resulting in roughly 600 images per instance. The dataset consists of\na total of 207,920 RGB-D images. We subsample every 5th frame of the 600 images resulting in a\ntotal of 120 images per instance.\nIn this work we focus on the problem of category recognition and we use the same setup as [2] and\nthe 10 random splits they provide. All development is carried out on a separate split and model\nablations are run on one of the 10 splits. For each split\u2019s test set we sample one object from each\nclass resulting in 51 test objects, each with about 120 independently classi\ufb01ed images. This leaves\nabout 34,000 images for training our model. Before the images are given to the CNN they are resized\nto be dI = 148.\nUnsupervised pre-training for CNN \ufb01lters is performed for all experiments by using k-means on\n500,000 image patches randomly sampled from each split\u2019s training set. Before unsupervised pre-\ntraining, the 9 \u00d7 9 \u00d7 3 patches for RGB and 9 \u00d7 9 patches for depth are individually normalized\nby subtracting the mean and divided by the standard deviation of its elements. In addition, ZCA\nwhitening is performed to de-correlate pixels and get rid of redundant features in raw images [30].\nA valid convolution is performed with \ufb01lter bank size K = 128 and \ufb01lter width and height of 9.\nAverage pooling is then performed with pooling regions of size d(cid:96) = 10 and stride size s = 5 to\nproduce a 3D matrix of size 128 \u00d7 27 \u00d7 27 for each image.\nEach RNN has non-overlapping child sizes of 3 \u00d7 3 applied spatially. This leads to the following\nmatrices at each depth of the tree: X \u2208 R128\u00d727\u00d727 to P1 \u2208 R128\u00d79\u00d79 to P2 \u2208 R128\u00d73\u00d73 to \ufb01nally\nP3 \u2208 R128. We use 128 randomly initialized RNNs in both modalities. The combination of RGB\nand depth is done by concatenating the \ufb01nal features which have 2 \u00d7 1282 = 32, 768 dimensions.\n\n4.1 Comparison to Other Methods\n\nIn this section we compare our model to related models in the literature. Table 1 lists the main\naccuracy numbers and compares to the published results [2, 5, 6, 7]. Recent work by Bo et al.\n[5] investigates multiple kernel descriptors on top of various features, including 3D shape, physical\nsize of the object, depth edges, gradients, kernel PCA, local binary patterns,etc. In contrast, all our\nfeatures are learned in an unsupervised way from the raw color and depth images. Blum et al. [6]\n\n5\n\n\fClassi\ufb01er\nLinear SVM [2]\n\nKernel SVM [2]\nRandom Forest [2]\nSVM [5]\n\nCKM [6]\nSP+HMP [7]\nCNN-RNN\n\nExtra Features for 3D;RGB\nSpin Images, ef\ufb01cient match kernel (EMK),\nrandom Fourier sets, width, depth, height;\nSIFT, EMK, texton histogram, color histogram\nsame as above\nsame as above\n3D shape, physical size of the object, depth\nedges, gradients, kernel PCA, local binary pat-\nterns,multiple depth kernels\nSURF interest points\nsurface normals\n\u2013\n\n3D\n53.1\u00b11.7\n\nRGB\n74.3\u00b13.3\n\nBoth\n81.9\u00b12.8\n\n64.7\u00b12.2\n66.8\u00b12.5\n78.8\u00b12.7\n\n74.5\u00b13.1\n74.7\u00b13.6\n77.7\u00b11.9\n\n83.9\u00b13.5\n79.6\u00b14.0\n86.2\u00b12.1\n\n\u2013\n81.2\u00b12.3\n78.9\u00b13.8\n\n\u2013\n82.4\u00b13.1\n80.8\u00b14.2\n\n86.4\u00b12.3\n87.5\u00b12.9\n86.8\u00b13.3\n\nTable 1: Comparison of our CNN-RNN to multiple related approaches. We outperform all ap-\nproaches except that of Bo et al. which uses an extra input modality of surface normals.\n\nalso learn feature descriptors and apply them sparsely to interest points. We outperform all methods\nexcept that of Bo et al. [7] who perform 0.7% better with a \ufb01nal feature vector that requires \ufb01ve\ntimes the amount of memory compared to ours. They make additional use of surface normals and\ngray scale images on top of RGB and depth channels and also learn features from these inputs with\nunsupervised methods based on sparse coding. Sparse coding is known to not scale well in terms of\nspeed for large input dimensions [31].\n\n4.2 Model Analysis\n\nWe analyze our model through several ablations and model variations. We picked one of the splits\nas our development fold and focus on RGB images and RNNs with random weights only unless\notherwise noted.\nTwo layer CNN. Fig. 4 (left) shows a comparison between our CNN-RNN model and a two layer\nCNN. We compare a previously recommended architecture for CNNs [17] and one which uses \ufb01lters\ntrained with k-means. In both settings, the CNN-RNN outperforms the two layer CNN. Because it\nalso requires many fewer matrix multiplication, it is approximately 4\u00d7 faster in our experiments\ncompared to a second CNN layer. However, the main bottleneck of our method is still the \ufb01rst CNN\nlayer. Both models could bene\ufb01t from fast GPU implementations [32, 33].\nTree structured neural nets with untied weights. Fig. 4 (left) also gives results when the weights\nof the random RNNs are untied across layers in the tree (TNN). In other words, different random\nweights are used at each depth of the tree. Since weights are still tied inside each layer this setting\ncan be seen as a convolution where the stride size is equal to the \ufb01lter size. We call this a tree neural\nnetwork (TNN) because it is technically not a recursive neural network. While this results in a large\nincrease in parameters, it actually hurts performance underlining the fact that tying the weights in\nRNNs is bene\ufb01cial.\nTrained RNN. Another comparison shown in Fig. 4 (left) is between many random RNNs and a\nsingle trained RNN. We carefully cross validated the RNN training procedure, objectives (adding\nreconstruction costs at each layer as in [10], classifying each layer or only at the top node), regular-\nization, layer size etc. The best performance was still lacking compared to 128 random RNNs ( 2%\ndifference) and training time is much longer. With a more ef\ufb01cient GPU-based implementation it\nmight be possible to train many RNNs in the future.\nNumber of random RNNs: Fig. 4 (center) shows that increasing the number of random RNNs\nimproves performance, leveling off at around 64 on this dataset.\nRGB & depth combinations and features: Fig. 4 (right) shows that combining RGB and depth\nfeatures from RNNs improves performance. The two modalities complement each other and produce\nfeatures that are independent enough so that the classi\ufb01er can bene\ufb01t from their combination.\nGlobal autoencoder on pixels and depth. In this experiment we investigate whether CNN-RNNs\nlearn better features than simply using a single layer of features on raw pixels. Many methods such\nas those of Coates and Ng [28] show remarkable results with a single very wide layer. The global\nautoencoder achieves only 61.1%, (it is over\ufb01tting at 93.3% training accuracy). We cross-validated\n\n6\n\n\fFilters\n\nSee [17]\nSee [17]\nk-means\nk-means\nk-means\nk-means\n\n2nd\nLayer\nCNN\nRNN\ntRNN\nTNN\nCNN\nRNN\u2217\n\nAcc.\n\n77.66\n77.04\n78.10\n79.67\n78.65\n80.15\n\nFigure 4: Model analysis on the development split (left and center use rgb only). Left: Compar-\nison of two layer CNN with CNN-RNN with different pre-processing ([17] and [13]). TNN is a\ntree structured neural net with untied weights across layers, tRNN is a single RNN trained with\nbackpropagation (see text for details). The best performance is achieved with our model of random\nRNNs (marked with \u2217). Center: Increasing the number of random RNNs improves performance.\nRight: Combining both modalities improves performance to 88% on the development split.\n\nover the number of hidden units and sparsity parameters). This shows that even random recursive\nneural nets can clearly capture more of the underlying class structure in their feature representations\nthan a single layer autoencoder.\n\n4.3 Error Analysis\n\nFig. 5 shows our confusion matrix across all 51 classes. Most model confusions are very reasonable\nshowing that recursive deep learning methods on raw pixels and depth can give provide high quality\nfeatures. The only class that we consistently misclassify are mushrooms which are very similar in\nappearance to garlic.\nFig. 6 shows 4 pairs of often confused classes. Both garlic and mushrooms have very similar\nappearances and colors. Water bottles and shampoo bottles in particular are problematic because the\nIR sensors do not properly re\ufb02ect from see through surfaces.\n\n5 Conclusion\n\nWe introduced a new model based on a combination of convolutional and recursive neural networks.\nUnlike previous RNN models, we \ufb01x the tree structure, allow multiple vectors to be combined,\nuse multiple RNN weights and keep parameters randomly initialized. This architecture allows for\nparallelization and high speeds, outperforms two layer CNNs and obtains state of the art performance\nwithout any external features. We also demonstrate the applicability of convolutional and recursive\nfeature learning to the new domain of depth images.\n\nAcknowledgments\n\nWe thank Stephen Miller and Alex Teichman for tips on 3D images, Adam Coates for chats about image\npre-processing and Ilya Sutskever and Andrew Maas for comments on the paper. We thank the anonymous\nreviewers for insightful comments. Richard is supported by the Microsoft Research PhD fellowship. The\nauthors gratefully acknowledge the support of the Defense Advanced Research Projects Agency (DARPA)\nMachine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-\n0181, and the DARPA Deep Learning program under contract number FA8650-10-C-7020. Any opinions,\n\ufb01ndings, and conclusions or recommendations expressed in this material are those of the authors and do not\nnecessarily re\ufb02ect the view of DARPA, AFRL, or the US government.\n\nReferences\n[1] M. Quigley, S. Batra, S. Gould, E. Klingbeil, Q. Le, A. Wellman, and A.Y. Ng. High-accuracy 3D sensing\n\nfor mobile manipulation: improving object detection and door opening. In ICRA, 2009.\n\n7\n\n18163264128405060708090Number of RNNsAccuracy (%)RGBDepthRGB+Depth7075808590\fFigure 5: Confusion Matrix of our CNN-RNN model. The ground truth labels are on the y-axis and\nthe predicted labels on the x-axis. Many misclassi\ufb01cations are between (a) garlic and mushroom (b)\nfood-box and kleenex.\n\nFigure 6: Examples of confused classes: Shampoo bottle and water bottle, mushrooms labeled as\ngarlic, pitchers classi\ufb01ed as caps due to shape and color similarity, white caps classi\ufb01ed as kleenex\nboxes at certain angles.\n\n[2] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. In\n\nICRA, 2011.\n\n[3] A. Johnson. Spin-Images: A Representation for 3-D Surface Matching. PhD thesis, Robotics Institute,\n\nCarnegie Mellon University, 1997.\n\n[4] H.S. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic labeling of 3d point clouds for indoor\n\nscenes. In NIPS, 2011.\n\n[5] L. Bo, X. Ren, and D. Fox. Depth kernel descriptors for object recognition. In IROS, 2011.\n[6] M. Blum, J. T. Springenberg, J. Wl\ufb01ng, and M. Riedmiller. A Learned Feature Descriptor for Object\n\nRecognition in RGB-D Data. In ICRA, 2012.\n\n8\n\nappleballbananabell pepperbinderbowlcalculatorcameracapcellphonecereal boxcoffee mugcombdry batteryflashlightfood bagfood boxfood canfood cupfood jargarlicglue stickgreenshand towelinstant noodleskeyboardkleenexlemonlightbulblimemarkermushroomnotebookonionorangepeachpearpitcherplateplierspotatorubber eraserscissorsshampoosoda canspongestaplertomatotoothbrushtoothpastewater bottleappleballbananabell pepperbinderbowlcalculatorcameracapcellphonecereal boxcoffee mugcombdry batteryflashlightfood bagfood boxfood canfood cupfood jargarlicglue stickgreenshand towelinstant noodleskeyboardkleenexlemonlightbulblimemarkermushroomnotebookonionorangepeachpearpitcherplateplierspotatorubber eraserscissorsshampoosoda canspongestaplertomatotoothbrushtoothpastewater bottle\f[7] L. Bo, X. Ren, and D. Fox. Unsupervised Feature Learning for RGB-D Based Object Recognition. In\n\nISER, June 2012.\n\n[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11), November 1998.\n\n[9] R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. Parsing Natural Scenes and Natural Language with\n\nRecursive Neural Networks. In ICML, 2011.\n\n[10] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-Supervised Recursive\n\nAutoencoders for Predicting Sentiment Distributions. In EMNLP, 2011.\n\n[11] C. Goller and A. K\u00a8uchler. Learning task-dependent distributed representations by backpropagation\nthrough structure. In Proceedings of the International Conference on Neural Networks (ICNN-96), 1996.\n[12] R. Socher, C. D. Manning, and A. Y. Ng. Learning continuous phrase representations and syntactic parsing\nIn Proceedings of the NIPS-2010 Deep Learning and Unsupervised\n\nwith recursive neural networks.\nFeature Learning Workshop, 2010.\n\n[13] A. Coates, A. Y. Ng, and H. Lee. An Analysis of Single-Layer Networks in Unsupervised Feature Learn-\n\ning. Journal of Machine Learning Research - Proceedings Track: AISTATS, 2011.\n\n[14] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building\n\nhigh-level features using large scale unsupervised learning. In ICML, 2012.\n\n[15] Kevin Jarrett, Koray Kavukcuoglu, Marc\u2019Aurelio Ranzato, and Yann LeCun. What is the best multi-stage\n\narchitecture for object recognition? In ICCV, 2009.\n\n[16] A. Saxe, P.W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng. On random weights and unsupervised\n\nfeature learning. In ICML, 2011.\n\n[17] K. Jarrett and K. Kavukcuoglu and M. Ranzato and Y. LeCun. What is the Best Multi-Stage Architecture\n\nfor Object Recognition? In ICCV. IEEE, 2009.\n\n[18] N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput\n\nBiol, 2008.\n\n[19] J. B. Pollack. Recursive distributed representations. Arti\ufb01cial Intelligence, 46, 1990.\n[20] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. Dynamic Pooling and Unfolding\n\nRecursive Autoencoders for Paraphrase Detection. In NIPS. MIT Press, 2011.\n\n[21] N. Silberman and R. Fergus.\n\nIndoor scene segmentation using a structured light sensor.\n\nWorkshop on 3D Representation and Recognition, 2011.\n\nIn ICCV -\n\n[22] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF). Computer Vision\n\nand Image Understanding, 110(3), 2008.\n\n[23] A. E. Abdel-Hakim and A. A. Farag. CSIFT: A SIFT descriptor with color invariant characteristics. In\n\nCVPR, 2006.\n\n[24] K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classi\ufb01cation with Sets of Image\n\nFeatures. ICCV, 2005.\n\n[25] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786), 2006.\n\n[26] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 2009.\n[27] M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies\n\nwith applications to object recognition. CVPR, 0:1\u20138, 2007.\n\n[28] A. Coates and A. Ng. The Importance of Encoding Versus Training with Sparse Coding and Vector\n\nQuantization . In ICML, 2011.\n\n[29] Farabet C., Couprie C., Najman L., and LeCun Y. Scene parsing with multiscale feature learning, purity\n\ntrees, and optimal covers. In ICML, 2012.\n\n[30] A. Hyv\u00a8arinen and E. Oja. Independent component analysis: algorithms and applications. Neural Netw.,\n\n13, 2000.\n\n[31] J. Ngiam, P. Koh, Z. Chen, S. Bhaskar, and A.Y. Ng. Sparse \ufb01ltering. In NIPS. 2011.\n[32] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance\n\nconvolutional neural networks for image classi\ufb01cation. In IJCAI, 2011.\n\n[33] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello. Hardware accelerated convo-\nlutional neural networks for synthetic vision systems. In Proc. International Symposium on Circuits and\nSystems (ISCAS\u201910), 2010.\n\n9\n\n\f", "award": [], "sourceid": 4773, "authors": [{"given_name": "Richard", "family_name": "Socher", "institution": null}, {"given_name": "Brody", "family_name": "Huval", "institution": null}, {"given_name": "Bharath", "family_name": "Bath", "institution": null}, {"given_name": "Christopher", "family_name": "Manning", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}