{"title": "Learning the 2-D Topology of Images", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 848, "abstract": "We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure of distributional similarity between pixel intensities. We compare different topology-extraction approaches and show how having the two-dimensional topology can be exploited.", "full_text": "Learning the 2-D Topology of Images\n\nNicolas Le Roux\n\nUniversity of Montreal\n\nYoshua Bengio\n\nUniversity of Montreal\n\nnicolas.le.roux@umontreal.ca\n\nyoshua.bengio@umontreal.ca\n\nPascal Lamblin\n\nUniversity of Montreal\n\nlamblinp@umontreal.ca\n\nMarc Joliveau\n\n\u00b4Ecole Centrale Paris\n\nmarc.joliveau@ecp.fr\n\nBal\u00b4azs K\u00b4egl\n\nLAL/LRI, University of Paris-Sud, CNRS\n\n91898 Orsay, France\n\nkegl@lal.in2p3.fr\n\nAbstract\n\nWe study the following question: is the two-dimensional structure of images a\nvery strong prior or is it something that can be learned with a few examples of\nnatural images? If someone gave us a learning task involving images for which\nthe two-dimensional topology of pixels was not known, could we discover it auto-\nmatically and exploit it? For example suppose that the pixels had been permuted\nin a \ufb01xed but unknown way, could we recover the relative two-dimensional loca-\ntion of pixels on images? The surprising result presented here is that not only the\nanswer is yes, but that about as few as a thousand images are enough to approxi-\nmately recover the relative locations of about a thousand pixels. This is achieved\nusing a manifold learning algorithm applied to pixels associated with a measure of\ndistributional similarity between pixel intensities. We compare different topology-\nextraction approaches and show how having the two-dimensional topology can be\nexploited.\n\n1 Introduction\n\nMachine learning has been applied to a number of tasks involving an input domain with a spe-\ncial topology: one-dimensional for sequences, two-dimensional for images, three-dimensional for\nvideos and for 3-D capture. Some learning algorithms are generic, e.g., working on arbitrary un-\nstructured vectors in d, such as ordinary SVMs, decision trees, neural networks, and boosting\napplied to generic learning algorithms. On the other hand, other learning algorithms successfully\nexploit the speci\ufb01c topology of their input, e.g., SIFT-based machine vision [10], convolutional\nneural networks [6, 7], time-delay neural networks [5, 16].\n\nIt has been conjectured [8, 2] that the two-dimensional structure of natural images is a very strong\nprior that would require a huge number of bits to specify, if starting from the completely uniform\nprior over all possible permutations.\n\nThe question studied here is the following: is the two-dimensional structure of natural images a\nvery strong prior or is it something that can be learned with a few examples? If a small number of\nexamples is enough to discover that structure, then the conjecture in [8] about the image topology\nwas probably incorrect. To answer that question we consider a hypothetical learning task involv-\ning images whose pixels have been permuted in a \ufb01xed but unknown way. Could we recover the\n\n1\n\n\ftwo-dimensional relations between pixels automatically? Could we exploit it to obtain better gener-\nalization? A related study performed in the context of ICA can be found in [1].\n\nThe basic idea of the paper is that the two-dimensional topology of pixels can be recovered by\nlooking for a two-dimensional manifold embedding pixels (each pixel is a point in that space), such\nthat nearby pixels have similar distributions of intensity (and possibly color) values.\n\nWe explore a number of manifold techniques with this goal in mind, and explain how we have\nadapted these techniques in order to obtain the positive and surprising result: the two-dimensional\nstructure of pixels can be recovered from a rather small number of training images. On images we\n\ufb01nd that the \ufb01rst 2 dimensions are dominant, meaning that even the knowledge that 2 dimensions\nare most appropriate could probably be inferred from the data.\n\n2 Manifold Learning Techniques Used\n\nIn this paper we have explored the question raised in the introduction for the particular case of\nimages, i.e., with 2-dimensional structures, and our experiments have been performed with images\nIt means that we have to look\nof size 27 (cid:2) 27 to 30 (cid:2) 30, i.e., with about a thousand pixels.\nfor the embedding of about a thousand points (the pixels) on a two-dimensional manifold. Metric\nMulti-Dimensional Scaling MDS is a linear embedding technique (analogous to PCA but starting\nfrom distances and yielding coordinates on the principal directions, of maximum variance). Non-\nparametric techniques such as Isomap [13], Local Linear Embedding (LLE) [12], or Semide\ufb01nite\nEmbedding (SDE, also known as MVU for Maximum Variance Unfolding) [17] have computation\ntime that scale polynomially in the number of examples n. With n around a thousand, all of these\nare feasible, and we experimented with MDS, Isomap, LLE, and MVU.\n\nSince we found Isomap to work best to recover the pixel topology even on small sets of images,\nwe review the basic elements of Isomap. It applies the metric multidimensional scaling (MDS)\nalgorithm to geodesic distances in the neighborhood graph. The neighborhood graph is obtained\nby connecting the k nearest neighbors of each point. Each arc of the graph is associated with a\ndistance (the user-provided distance between points), and is used to compute an approximation of\nthe geodesic distance on the manifold with the length of the shortest path between two points. The\nmetric MDS algorithm then transforms these distances into d-dimensional coordinates as follows.\nIt \ufb01rst computes the dot-product (or Gram) n (cid:2) n matrix M using the \u201cdouble-centering\u201d formula,\nyielding entries Mij = (cid:0) 1\nij). The d principal eigen-\nn Pj D2\nvectors vk and eigenvalues (cid:21)k (k = 1; : : : ; d) of M are then computed. This yields the coordinates:\nxik = vkip(cid:21)k is the k-th embedding coordinate of point i.\n\nn2 Pi;j D2\n\nn Pi D2\n\nij (cid:0) 1\n\n2 (D2\n\nij (cid:0) 1\n\nij + 1\n\n3 Topology-Discovery Algorithms\n\nIn order to apply a manifold learning algorithm, we must generally have a notion of similarity or\ndistance between the points to embed. Here each point corresponds to a pixel, and the data we have\nabout the pixels provide an empirical distribution of intensities for all pixels. Therefore we want to\ncompare two estimate the statistical dependency between two pixels, in order to determine if they\nshould be \u201cneighbors\u201d on the manifold. A simple and natural dependency statistic is the correlation\nbetween pixel intensities, and it works very well.\nThe empirical correlation (cid:26)ij between the intensity of pixel i and pixel j is in the interval [(cid:0)1; 1].\nHowever, two pixels highly anti-correlated are much more likely to be close than pixels not corre-\nlated (think of edges in an image). We should thus consider the absolute value of the correlations. If\nwe assume them to be the value of a Gaussian kernel\n\nj(cid:26)ijj = K(xi; xj) = e(cid:0) 1\nthen by de\ufb01ning Dij = kxi (cid:0) xjk and solving the above for Dij we obtain a \u201cdistance\u201d formula\nthat can be used with the manifold learning algorithms:\n\n2 kxi(cid:0)xj k2\n\n;\n\nNote that scaling the distances in the Gaussian kernel by a variance parameter would only scale the\nresulting embedding, so it is unnecessary.\n\nDij = q(cid:0) logj(cid:26)ijj :\n\n(1)\n\n2\n\n\fMany other measures of distance would probably work as well. However, we found the absolute\ncorrelation to be simple and easy to understand while yielding nice embeddings.\n\n3.1 Dealing With Low-Variance Pixels\n\nA dif\ufb01culty we observed in experimenting with different manifold learning algorithms on data sets\nsuch as MNIST is the in\ufb02uence of low-variance pixels. On MNIST digit images the border pixels\nmay have 0 or very small variance. This makes them all want to be close to each other, which tends\nto fold the manifold on itself.\n\nTo handle this problem we have simply ignored pixels with very low variance. When these represent\na \ufb01xed background (as in MNIST images), this strategy works \ufb01ne. In the experiments with MNIST\nwe removed pixels with standard deviation less than 15% of the maximum standard deviation (max-\nimum over all pixels). On the NORB dataset, which has varied backgrounds, this step does not\nremove any of the pixels (so it is unnecessary).\n\n4 Converting Back to a Grid Image\n\nOnce we have obtained an embedding for the pixels, the next thing we would like to do is to trans-\nform the data vectors back into images. For this purpose we have performed the following two\nsteps:\n\n1. Choosing horizontal and vertical axes (since the coordinates on the manifold can be arbi-\n\ntrarily rotated), and rotating the embedding coordinates accordingly, and\n\n2. Transforming the input vector of intensity values (along with the pixel coordinates) into an\nordinary discrete image on a grid. This should be done so that the resulting intensity at\nposition (i; j) is close to the intensity values associated with input pixels whose embedding\ncoordinates are (i; j).\n\nSuch a mapping of pixels to a grid has already been done in [4], where a grid topology is de\ufb01ned\nby the connections in a graphical model, which is then trained by maximizing the approximate\nlikelihood. However, they are not starting from a continuous embedding, but from the original data.\nLet pk (k = 1 : : : N) be the embedding coordinates found by the dimensionality reduction algorithm\nfor the k-th input variable. We select the horizontal axis as the direction of smaller spread, the\nvertical axis being in the orthogonal direction, and perform the appropriate rotation.\nOnce we have a coordinate system that assigns a 2-dimensional position pk to the k-th input pixel,\nplaced at irregular locations inside a rectangular grid, we can map the input intensities xk into\nintensities Mi;j, so as to obtain a regular image that can be processed by standard image-processing\nand machine vision learning algorithms. The output image pixel intensity Mi;j at coordinates (i; j)\nis obtained through a convex average\n\nwhere the weights are non-negative and sum to one, and are chosen as follows.\n\nMi;j = Xk\n\nwi;j;kxk\n\nwith an exponential of the L1 distance to give less weight to farther points:\n\nwi;j;k =\n\nvi;j;k\n\nPk vi;j;k\n\n(2)\n\n(3)\n\nvi;j;k = exp ((cid:13)k(i; j) (cid:0) pkk1)\n\n N (i;j;k)\n\nwhere N (i; j; k) is true if k(i; j) (cid:0) pkk1 < 2 (or inferior to a larger radius to make sure that at least\none input pixel k is associated with output grid position (i; j)). We used (cid:13) = 3 in the experiments,\nafter trying only 1; 3 and 10. Large values of (cid:13) correspond to using only the nearest neighbor of\n(i; j) among the pks. Smaller values smooth the intensities and make the output look better if the\nembedding is not perfect. Too small values result in a loss of effective resolution.\n\n3\n\n\fAlgorithm 1 Pseudo-code of the topology-learning learning that recovers the 2-D structure of inputs\nprovided in an arbitrary but \ufb01xed order.\nInput: X\n\nfRaw input n (cid:2) N data matrix, one row per example, with elements in \ufb01xed but\narbitrary orderg\nInput: (cid:14) = 0:15 (default value)fMinimum relative standard deviation threshold, to remove too\nlow-variance pixelsg\nInput: k = 4 (default value)fNumber of neighbors used to build Isomap neighborhood graphg\nInput: L = pN ; W = pN (default values) fDimensions (length L, width W of output image)g\nInput: (cid:13) = 3 (default value) fSmoothing coef\ufb01cient to recover imagesg\nOutput: p\nfN (cid:2) 2 matrix of embedding coordinates (one per row) for each input variableg\nOutput: w\nfConvolution weights to recover an image from a raw input vectorg\n\nn = number of examples (rows of X)\nfor all column X:i do\n\n(cid:22)i 1\ni 1\n(cid:27)2\n\nn Pt Xti fCompute meansg\nn Pt(Xti (cid:0) (cid:22)i)2 fCompute variancesg\n\nend for\nRemove columns of X for which\nfor all column X:i do\n\n< (cid:14)\n\n(cid:27)i\n\nmaxj (cid:27)j\n\nfor all column X:j do\n\nempirical correlation (cid:26)ij = (X:i(cid:0)(cid:22)i)0(X:j (cid:0)(cid:22)j )\ntionsg\npseudo-distances Dij = p(cid:0) logj(cid:26)ijj\nend for\n\n(cid:27)i(cid:27)j\n\nfCompute all pair-wise empirical correla-\n\nend for\nfCompute the 2-D embeddings (pk1; pk2) of each input variable k through Isomapg\np = Isomap(D; k; 2)\nfRotate the coordinates p to try to align them to a vertical-horizontal grid (see text)g\nfInvert the axes if L < Wg\nfCompute the convolution weights that will map raw values to output image pixel intensitiesg\nfor all grid position (i; j) in output image (i in 1 : : : L, j in 1 : : : W ) do\n\nr = 1\nrepeat\n\nneighbors fk : jjpk (cid:0) (i; j)jj1 < rg\nr r + 1\n\nuntil neighbors not empty\nfor all k in neighbors do\n\nvk e(cid:13)jjpk(cid:0)(i;j)jj1\n\nend for\nwi;j;: 0\nfor all k in neighbors do\n\nwi;j;k = vi;j;k\n\nPk vi;j;k fCompute convolution weightsg\n\nend for\n\nend for\n\nAlgorithm 2 Convolve a raw input vector into a regular grid image, using the already discovered\nembedding for each input variable.\nInput: x\nInput: p\nInput: w\nOutput: Y\n\nfRaw input N-vector (in same format as a row of X above)g\nfN (cid:2) 2 matrix of embedding coordinates (one per row) for each input variableg\nfConvolution weights to recover an image from a raw input vectorg\nfL (cid:2) W output imageg\n\nfor all grid position (i; j) in output image (i in 1 : : : L, j in 1 : : : W ) do\n\nYi;j Pk wi;j;kxk fPerform the convolutiong\n\nend for\n\n4\n\n\f5 Experimental Results\n\nWe performed experiments on two sets of images: MNIST digits dataset and NORB object classi-\n\ufb01cation dataset 1. We used the \u201cjittered objects and cluttered background\u201d image set from NORB.\nThe MNIST images are particular in that they have a white background, whereas the NORB images\nhave more varying backgrounds. The NORB images are originally of dimension 108 (cid:2) 108; we\nsubsampled them by 4 (cid:2) 4 averaging into 27 (cid:2) 27 images. The experiments have been performed\nwith k = 4 neighbors for the Isomap embedding. Smaller values of k often led to unconnected\nneighborhood graphs, which Isomap cannot deal with.\n\n(a) Isomap embedding\n\n(b) LLE embedding\n\n(c) MDS embedding\n\n(d) MVU embedding\n\nFigure 1: Examples of embeddings discovered by Isomap, LLE, MDS and MVU with 250 training\nimages from NORB. Each of the original pixel is placed at the location discovered by the algorithm.\nSize of the circle and gray level indicate the original true location of the pixel. Manifold learning\nproduces coordinates with an arbitrary rotation. Isomap appears most robust, and MDS the worst\nmethod, for this task.\n\nIn Figure 1 we compare four different manifold learning algorithms on the NORB images: Isomap,\nLLE, MDS and MVU. Figure 2 explains why Isomap is giving good results, especially in comparison\nwith MDS. One the one hand, MDS is using the pseudo-distance de\ufb01ned in equation 1, whose\nrelationship with the real distance between two pixels in the original image is linear only in a small\nneighborhood. On the other hand, Isomap uses the geodesic distances in the neighborhood graph,\nwhose relationship with the real distance is really close to linear.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) and (c): Pseudo-distance Dij (using formula 1) vs. the true distance on the grid.\n(b) and (d): Geodesic distance in neighborhood graph vs. the true distance on the grid.\nThe true distance is on the horizontal axis for all \ufb01gures.\n(a) and (b) are for a point in the upper-left corner, (c) and (d) for a point in the center.\n\nFigure 3 shows the embeddings obtained on the NORB data using different numbers of examples.\nIn order to quantitatively evaluate the reconstruction, we applied on each embedding the similarity\ntransformation that minimizes the Root of the Mean Squared Error (RMSE) between the coordinates\nof each pixel on the embedding, and their coordinates on the original grid, before measuring the\nresidual error. This minimization is justi\ufb01ed because the discovered embedding could be arbitrarily\nrotated, isotropically scaled, and mirrored. 100 examples are enough to get a reasonable embedding,\nand with 2000 or more a very good embedding is obtained: the RMSE for 2000 examples is 1:13,\nmeaning that in expectation, each pixel is off by slightly more than one.\n\n1Both can be obtained from Yann Le Cun\u2019s web site: http://yann.lecun.com/.\n\n5\n\n\f9.25\n\n10 examples\n\n2.43\n\n50 examples\n\n1.68\n\n100 examples\n\n1.21\n\n1000 examples\n\n1.13\n\n2000 examples\n\nFigure 3: Embedding discovered by Isomap on the NORB dataset, with different numbers of training\nsamples (top row). Second row shows the same embeddings aligned (by a similarity transformation)\non the original grid, third row shows the residual error (RMSE) after the alignment.\n\nFigure 4 shows the whole process of transforming an original image (with pixels possibly permuted)\ninto an embedded image and \ufb01nally into a reconstructed image as per algorithms 1 and 2.\n\nFigure 4: Example of the process of transforming an MNIST image (top) from which pixel order\nis unknown (second row) into its embedding (third row) and \ufb01nally reconstructed as an image after\nrotation and convolution (bottom). In the third row, we show the intensity associated to each original\npixel by the grey level in a circle located at the pixel coordinates discovered by Isomap.\n\nWe also performed experiments with acoustic spectral data to see if the time-frequency topology\ncan be recovered. The acoustic data come from the \ufb01rst 100 blues pieces of a publically available\ngenre classi\ufb01cation dataset [14]. The FFT is computed for each frame and there are 86 frames per\nsecond. The \ufb01rst 30 frequency bands are kept, each covering 21.51 Hz. We used examples formed\nby 30-frame spectrograms, i.e., just like images of size 30 (cid:2) 30. Using the \ufb01rst 600,000 audio\nsamples from each recording yielded 2600 30-frames images, on which we applied our technique.\nFigure 5 shows the resulting embedding when we removed the 30 coordinates of lowest standard\ndeviation ((cid:14) = :15).\n\n6\n\n\f4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n \n1\n\nEigenvalues\nRatio of consecutive eigenvalues\n\n \n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n(b) Spectrum\n\n(a) Blues embedding\n\nFigure 5: Embedding and spectrum decay for sequences of blues music.\n\n6 Discussion\n\nAlthough [8] argue that learning the right permutation of pixels with a \ufb02at prior might be too dif\ufb01cult\n(either in a lifetime or through evolution), our results suggest otherwise.\n\nHow do we interpret that apparent contradiction?\n\nThe main element of explanation that we see is that the space of permutations of d numbers is not\n\nsuch a large class of functions. There are approximately N = p2(cid:25)d(cid:0) d\n\napproximation) of d numbers. Since this is a \ufb01nite class of functions, its VC-dimension [15] is\n\ne(cid:1)d permutations (Stirling\n\nh = log N (cid:25) d log d (cid:0) d:\n\n1\n\nHence if we had a bounded criterion (say taking values in [0; 1]) to compare different permutations\nand we used n examples (i.e., n images, here), we would expect the difference between generaliza-\nwith probability 1(cid:0)(cid:17). Hence, with n a\ntion error and test error to be bounded [15] by\nmultiple of d log d, we would expect that one could approximately learn a good permutation. When\nd = 400 (the number of pixels with non-negligible variance in MNIST images), d log d(cid:0) d (cid:25) 2000.\nThis is more than what we have found necessary to recover a \u201cgood\u201d representation of the images,\nbut on the other hand there are equivalent classes within the set of permutations that give as good\nresults as far as our objective and subjective criteria are concerned: we do not care about image\nsymmetries, rotations, and small errors in pixel placement.\n\n2r 2 log N=(cid:17)\n\nn\n\nWhat is the selection criterion that we have used to recover the image structure? Mainly we have\nused an additional prior which gives a preference to an order for which nearby pixels have similar\ndistributions. How speci\ufb01c to natural images and how strong is that prior? This may be an appli-\ncation of a more general principle that could be advantageous to learning algorithms as well as to\nbrains. When we are trying to compute useful functions from raw data, it is important to discover\ndependencies between the input random variables. If we are going to perform computations on sub-\nsets of variables at a time (which would seem necessary when the number of inputs is very large,\nto reduce the amount of connecting hardware), it would seem wiser that these computations com-\nbine variables that have dependencies with each other. That directly gives rise to the notion of local\nconnectivity between neurons associated to nearby spatial locations, in the case of brains, the same\nnotion that is exploited in convolutional neural networks.\n\nThe fact that nearby pixels are more correlated is true at many scales in natural images. This is well\nknown and explains why Gabor-like \ufb01lters often emerge when trying to learn good \ufb01lters for images,\ne.g., by ICA [9] or Products of Experts [3, 11].\n\nIn addition to the above arguments, there is another important consideration to keep in mind. The\nway in which we score permutations is not the way that one would score functions in an ordinary\nlearning experiment. Indeed, by using the distributional similarity between pairs of pixels, we get\nnot just a scalar score but d(d(cid:0)1)=2 scores. Since our \u201cscoring function\u201d is much more informative,\nit is not surprising that it allows us to generalize from many fewer examples.\n\n7\n\n\f7 Conclusion and Future Work\n\nWe proved here that, even with a small number of examples, we are able to recover almost per-\nfectly the 2-D topology of images. This allows us to use image-speci\ufb01c learning algorithms without\nspecifying any prior other than the dimensionnality of the coordinates. We also showed that this\nalgorithm performed well on sound data, even though the topology might be less obvious in that\ncase.\n\nHowever, in this paper, we only considered the simple case where we knew in advance the dimen-\nsionnality of the coordinates. One could easily apply this algorithm to data whose intrinsic dimen-\nsionality of the coordinates is unknown. In that case, one would not convert the embedding to a grid\nimage but rather keep it and connect only the inputs associated to close coordinates (performing a k\nnearest neighbor for instance). It is not known if such an embedding might be useful for other types\nof data than the ones discussed above.\n\nAcknowledgements\n\nThe authors would like to thank James Bergstra for helping with the audio data. They also want to\nacknowledge the support from several funding agencies: NSERC, the Canada Research Chairs, and\nthe MITACS network.\n\nReferences\n[1] S. Abdallah and M. Plumbley. Geometry dependency analysis. Technical Report C4DM-TR06-05, Center\n\nfor Digital Music, Queen Mary, University of London, 2006.\n\n[2] Y. Bengio and Y. Le Cun. Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste,\n\nand J. Weston, editors, Large Scale Kernel Machines. MIT Press, 2007.\n\n[3] G. Hinton, M. Welling, Y. Teh, and S. Osindero. A new view of ica. In Proceedings of ICA-2001, San\n\nDiego, CA, 2001.\n\n[4] A. Hyv\u00a8arinen, P. O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Compu-\n\ntation, 13(7):1527\u20131558, 2001.\n\n[5] K. J. Lang and G. E. Hinton. The development of the time-delay neural network architecture for speech\n\nrecognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University, 1988.\n\n[6] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation\n\napplied to handwritten zip code recognition. Neural Computation, 1(4):541\u2013551, 1989.\n\n[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[8] Y. LeCun and J. S. Denker. Natural versus universal probability complexity, and entropy.\n\nWorkshop on the Physics of Computation, pages 122\u2013127. IEEE, 1992.\n\nIn IEEE\n\n[9] T.-W. Lee and M. S. Lewicki. Unsupervised classi\ufb01cation segmentation and enhancement of images using\n\nica mixture models. IEEE Trans. Image Proc., 11(3):270\u2013279, 2002.\n\n[10] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer\n\nVision, 60(2):91\u2013110, 2004.\n\n[11] S. Osindero, M. Welling, and G. Hinton. Topographic product models applied to natural scene statistics.\n\nNeural Computation, 18:381\u2013344, 2005.\n\n[12] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500):2323\u20132326, Dec. 2000.\n\n[13] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality\n\nreduction. Science, 290(5500):2319\u20132323, Dec. 2000.\n\n[14] G. Tzanetakis and P. Cook. Musical genre classi\ufb01cation of audio signals. IEEE Transactions on Speech\n\nand Audio Processing, 10(5):293\u2013302, Jul 2002.\n\n[15] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, Berlin, 1982.\n[16] A. Waibel. Modular construction of time-delay neural networks for speech recognition. Neural Compu-\n\ntation, 1:39\u201346, 1989.\n\n[17] K. Q. Weinberger and L. K. Saul. An introduction to nonlinear dimensionality reduction by maximum\nvariance unfolding. In Proceedings of the National Conference on Arti\ufb01cial Intelligence (AAAI), Boston,\nMA, 2006.\n\n8\n\n\f", "award": [], "sourceid": 925, "authors": [{"given_name": "Nicolas", "family_name": "Roux", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Pascal", "family_name": "Lamblin", "institution": null}, {"given_name": "Marc", "family_name": "Joliveau", "institution": null}, {"given_name": "Bal\u00e1zs", "family_name": "K\u00e9gl", "institution": null}]}