{"title": "Semi-Supervised Learning in Gigantic Image Collections", "book": "Advances in Neural Information Processing Systems", "page_first": 522, "page_last": 530, "abstract": "With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. ``Clean labels can be manually obtained on a small fraction, ``noisy labels may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images.\u00a0 Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. We combine this with a label sharing framework obtained from Wordnet to propagate label information to classes lacking manual annotations. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images with 74 thousand classes.", "full_text": "Semi-supervised Learning in\nGigantic Image Collections\n\nRob Fergus\n\nYair Weiss\n\nCourant Institute, NYU,\n\nSchool of Computer Science,\n\n715 Broadway,\n\nNew York, NY 10003\nfergus@cs.nyu.edu\n\nHebrew University,\n\n91904, Jerusalem, Israel\nyweiss@huji.ac.il\n\nAntonio Torralba\nCSAIL, EECS, MIT,\n\n32 Vassar St.,\n\nCambridge, MA 02139\n\ntorralba@csail.mit.edu\n\nAbstract\n\nWith the advent of the Internet it is now possible to collect hundreds of millions\nof images. These images come with varying degrees of label information. \u201cClean\nlabels\u201d can be manually obtained on a small fraction, \u201cnoisy labels\u201d may be ex-\ntracted automatically from surrounding text, while for most images there are no\nlabels at all. Semi-supervised learning is a principled framework for combining\nthese different label sources. However, it scales polynomially with the number\nof images, making it impractical for use on gigantic collections with hundreds of\nmillions of images and thousands of classes. In this paper we show how to uti-\nlize recent results in machine learning to obtain highly ef\ufb01cient approximations\nfor semi-supervised learning that are linear in the number of images. Speci\ufb01cally,\nwe use the convergence of the eigenvectors of the normalized graph Laplacian to\neigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables\nus to apply semi-supervised learning to a database of 80 million images gathered\nfrom the Internet.\n\n1 Introduction\nGigantic quantities of visual imagery are present on the web and in off-line databases. Effective\ntechniques for searching and labeling this ocean of images and video must address two con\ufb02icting\nproblems: (i) the techniques to understand the visual content of an image and (ii) the ability to scale\nto millions of billions of images or video frames. Both aspects have received signi\ufb01cant attention\nfrom researchers, the former being addressed by recent work on object and scene recognition, while\nthe latter is the focus of the content-based image retrieval community (CBIR) [7]. A key issue\npertaining to both aspects of the problem is the diversity of label information accompanying real\nworld image data. A variety of collaborative and online annotation efforts have attempted to build\nlarge collections of human labeled images, ranging from simple image classi\ufb01cations, to bounding-\nboxes and precise pixel-level segmentation [16, 21, 24]. While impressive, these manual efforts\nhave no hope of scaling to the many billions of images on the Internet. However, even though\nmost images on the web lack human annotation, they often have some kind of noisy label gleaned\nfrom nearby text or from the image \ufb01lename and often this gives a strong cue about the content of\nthe image. Finally, there are images where we have no information beyond the pixels themselves.\nSemi-supervised learning (SSL) methods are designed to handle this spectrum of label information\n[26, 28]. They rely on the density structure of the data itself to propagate known labels to areas\nlacking annotations, and provide a natural way to incorporate labeling uncertainty. However, to\nmodel the density of the data, each point must measure its proximity to every other. This requires\npolynomial time \u2013 prohibitive for large-scale problems.\n\nIn this paper, we introduce a semi-supervised learning scheme that is linear in the number of im-\nages, enabling us to tackle very large scale problems. Building on recent results in spectral graph\ntheory, we ef\ufb01ciently construct accurate numerical approximations to the eigenvectors of the nor-\nmalized graph Laplacian. Using these approximations, we can easily propagate labels through huge\ncollections of images.\n\n1\n\n\f1.1 Related Work\n\nCleaning up Internet image data has been explored by several authors: Berg et al. [4], Fergus\net al. [8], Li et al. [13], Vijayanarasimhan et al. [22], amongst others. Unlike our approach, these\nmethods operate independently on each class and would be problematic to scale to millions or bil-\nlions of images. A related group of techniques use active labeling, e.g. [10]. Semi-supervised learn-\ning is a rapidly growing sub-\ufb01eld of machine learning, dealing with datasets that have a large number\nof unlabeled points and a much smaller number of labeled points (see [5] for a recent overview). The\nmost popular approaches are based on the graph Laplacian (e.g. [26, 28] and there has been much\ntheoretical work devoted to the asymptotics of these Laplacians [3, 6, 14]. However, these methods\nrequire the explicit manipulation of an n \u00d7 n Laplacian matrix (n being the number of data points),\nfor example [2] notes: \u201cour algorithms compute the inverse of a dense Gram matrix which leads to\nO(n3) complexity. This may be impractical for large datasets.\u201d\n\nThe large computational complexity of standard graph Laplacian methods has lead to a number\nof recent papers on ef\ufb01cient semi-supervised learning (see [27] for an overview). Many of these\nmethods (e.g. [18, 12, 29, 25] are based on calculating the Laplacian only for a smaller, backbone,\ngraph which reduces complexity to be cubic in the size of the small graph. In most cases [18, 12]\nthe smaller graph is built simply by randomly subsampling a subset of the points, while in [29]\na mixture model is learned on the original dataset and each mixture component de\ufb01nes a node in\nthe backbone graph. In [25] the backbone graph is found using non-negative matrix factorization.\nIn [9] the backbone graph is a uniform grid over the high dimensional space (so the number of nodes\ngrows exponentially with dimension). In [20] the number of datapoints is not reduced but rather the\nnumber of edges. This allows the use of sparse numerical algebra techniques.\n\nThe problem with approaches based on backbone graphs is that the spectrum of the graph Laplacian\ncan change dramatically with different backbone construction methods [12]. This can also be seen\nvisually (see Fig. 3) by examining the clusterings suggested by the full data and a small subsample.\nEven in cases where the \u201ccorrect\u201d clustering is obvious when the full data is considered, the smaller\nsubset may suggest erroneous clusterings (e.g. Fig. 3(left)). In our approach, we take an alternative\nroute. Rather than trying to reduce the number of points, we take the limit as the number of points\ngoes to in\ufb01nity.\n\n2 Semi-supervised Learning\n\nWe start by introducing semi-supervised learning in a graph setting and then describe an approxi-\nmation that reduces the learning time from polynomial to linear in the number of images. Fig. 1\nillustrates the semi supervised learning problem. Following the notations of Zhu et al. [28], we\nare given a labeled dataset of input-output pairs (Xl, Yl) = {(x1, y1), ..., (xl, yl)} and an unlabeled\ndataset Xu = {xl+1, ..., xn}. Thus in Fig. 1(a) we are given two labeled points and 500 unlabeled\npoints. Fig. 1(b) shows the output of a nearest neighbor classi\ufb01er on the unlabeled points. The\npurely supervised solution ignores the apparent clustering suggested by the data.\n\nIn order to use the unlabeled data, we form a graph G = (V, E) where the vertices V are the\ndatapoints x1, ..., xn, and the edges E are represented by an n \u00d7 n matrix W . Entry Wij is the edge\nweight between nodes i, j and a common practice is to set Wij = exp(\u2212kxi \u2212 xjk2/2\u01eb2). Let D\nbe a diagonal matrix whose diagonal elements are given by Dii = Pj Wij , the combinatorial graph\n\nLaplacian is de\ufb01ned as L = D \u2212 W , which is also called the unnormalized Laplacian.\n\nIn graph-based semi-supervised learning, the graph Laplacian L is used to de\ufb01ne a smoothness\noperator that takes into account the unlabeled data. The main idea is to \ufb01nd functions f which agree\nwith the labeled data but are also smooth with respect to the graph. The smoothness is measured by\nthe graph Laplacian:\n\nf T Lf =\n\n1\n\n2 Xi,j\n\nWij (f (i) \u2212 f (j))2\n\nOf course simply minimizing smoothness can be achieved by the trivial solution f = 1, but in\nsemi-supervised learning, we minimize a combination of the smoothness and the training loss. For\nsquared error training loss, this is simply:\n\nJ(f ) = f T Lf +\n\nl\n\nXi=1\n\n\u03bb(f (i) \u2212 yi)2 = f T Lf + (f \u2212 y)T \u039b(f \u2212 y)\n\n2\n\n\fData\n\nSupervised\n\nSemi-Supervised\n\n(a)\n\n(b)\n\n(c)\n(c)\n\nFigure 1: Comparison of supervised and semi-supervised learning on toy data. Semi-supervised\nlearning seeks functions that are smooth with respect to the input density.\n\nData\n\n\u03c61, \u03c31 = 0 \n\n\u03c62, \u03c32 = 0.0002 \n\n\u03c63, \u03c33 = 0.038\n\nDensity\n\n\u03a61, \u03c31 = 0 \u03a62, \u03c32 = 0.0002 \u03a63, \u03c33 = 0.035\n\nFigure 2: Left: The three generalized eigenvectors of the graph Laplacian, for the toy data. Note\nthat the semi-supervised solution can be written as a linear combination of these eigenvectors (in this\ncase, the second eigenvector is enough). Using generalized eigenvectors (or equivalently normalized\nLaplacians) increases robustness of the \ufb01rst eigenvectors, compared to using the un-normalized\neigenvectors. Right: The 2D density of the toy data, and the associated smoothness eigenfunctions\nde\ufb01ned by that density. The plots use the Matlab jet colormap.\n\nwhere \u039b is a diagonal matrix whose diagonal elements are \u039bii = \u03bb if i is a labeled point and \u039bii = 0\nfor unlabeled points. The minimizer is of course a solution to (L + \u039b)f = \u039by. Fig. 1(c) shows the\nsemi-supervised solution.\n\nAlthough the solution can be given in closed form for the squared error loss, note that it requires\nsolving an n\u00d7n system of linear equations. For large n this poses serious problems with computation\ntime and robustness. But as suggested in [5, 17, 28], the dimension of the problem can be reduced\ndramatically by only working with a small number of eigenvectors of the Laplacian.\n\nLet \u03a6i, \u03c3i be the generalized eigenvectors and eigenvalues of the graph Laplacian L (solutions to\nL\u03c6i = \u03c3iD\u03c6i). Note that the smoothness of an eigenvector \u03a6i is simply \u03a6T\ni L\u03a6i = \u03c3i so that eigen-\n\nvectors with smaller eigenvalues are smoother. Since any vector in Rn can be written f = Pi \u03b1i\u03a6i,\nthe smoothness of a vector is simply Pi \u03b12\n\ni \u03c3i so that smooth vectors will be linear combinations of\n\nthe eigenvectors with small eigenvalues1.\n\nFig. 2(left) shows the three generalized eigenvectors of the Laplacian for the toy data shown in\nFig. 1(a). Note that the semi-supervised solution (Fig. 1(c)) is a linear combination of these three\neigenvectors (in fact just one eigenvector is enough). In general, we can signi\ufb01cantly reduce the\ndimension of f by requiring it to be of the form f = U \u03b1 where U is a n \u00d7 k matrix whose columns\nare the k eigenvectors with smallest eigenvalue. We now have:\n\nJ(\u03b1) = \u03b1T \u03a3\u03b1 + (U \u03b1 \u2212 y)T \u039b(U \u03b1 \u2212 y)\nThe minimizing \u03b1 is now a solution to the k \u00d7 k system of equations:\n\n(\u03a3 + U T \u039bU )\u03b1 = U T \u039by\n\n(1)\n\n2.1 From Eigenvectors to Eigenfunctions\n\nGiven the eigenvectors of the graph Laplacian, we can now solve the semi-supervised problem in a\nreduced dimensional space. But to \ufb01nd the eigenvectors in the \ufb01rst place, we need to diagonalize a\nn \u00d7 n matrix. How can we ef\ufb01ciently calculate the eigenvectors as the number of unlabeled points\nincreases?\n\nWe follow [23, 14] in assuming the data xi \u2208 Rd are samples from a distribution p(x) and analyzing\nthe eigenfunctions of the smoothness operator de\ufb01ned by p(x). Fig. 2(right) shows the density in two\n\n1This discussion holds for both ordinary and generalized eigenvectors, but the latter are much more stable\n\nand we use them.\n\n3\n\n\fdimensions for the toy data. This density de\ufb01nes a weighted smoothness operator on any function\nF (x) de\ufb01ned on Rd which we will denote by Lp(F ):\n\nLp(F ) =\n\n1\n\n2 Z (F (x1) \u2212 F (x2))2W (x1, x2)p(x1)p(x2)dx1x2\n\nwith W (x1, x2) = exp(\u2212kx1 \u2212 x2k2/2\u01eb2). Just as the graph Laplacian de\ufb01ned eigenvectors of in-\ncreasing smoothness, the smoothness operator will de\ufb01ne eigenfunctions of increasing smoothness.\nWe de\ufb01ne the \ufb01rst eigenfunction of LP (f ) by a minimization problem:\n\n\u03a61 = arg\n\nmin\n\nLp(F )\n\nF :R F 2(x)p(x)D(x)dx=1\n\nwhere D(x) = Rx2\n\nW (x, x2)p(x2)dx2. Note that the \ufb01rst eigenfunction will always be the trivial\nfunction \u03a6(x) = 1 since it has maximal smoothness LP (1) = 0. The second eigenfunction of\nLp(f ) minimizes the same problem, with the additional constraint that it be orthogonal to the \ufb01rst\n\neigenfunction R F (x)\u03a61(x)D(x)p(x)dx = 0. More generally, the kth eigenfunction minimizes\nLp(f ) under additional constraints that R F (x)\u03a6l(x)p(x)D(x)dx = 0 for all l < k. The eigen-\n\nvalue of an eigenfunction \u03a6k is simply its smoothness \u03c3k = Lp(\u03a6k). Fig. 2(right) shows the \ufb01rst\nthree eigenfunctions corresponding to the density of the toy data. Similar to the eigenvectors of the\ngraph Laplacian, the second eigenfunction reveals the natural clustering of the data. Note that the\neigenvalue of the eigenfunctions is similar to the eigenvalue of the discrete generalized eigenvector.\n\n1\n\nn2 f T Lf = 1\n\nHow are these eigenfunctions related to the generalized eigenvectors of the Laplacian? It is easy\nto see that as n \u2192 \u221e,\n1\n\n2n2 Pi,j Wij (f (i) \u2212 f (j))2 will approach Lp(F ), and\nn Pi f 2(i)D(i, i) will approach R F 2(x)D(x)p(x)dx so that the minimization problems that de-\n\n\ufb01ne the eigenvectors approach the problems that de\ufb01ne the eigenfunctions as n \u2192 \u221e. Thus under\nsuitable convergence conditions, the eigenfunctions can be seen as the limit of the eigenvectors as\nthe number of points goes to in\ufb01nity [1, 3, 6, 14]. For certain parametric probability functions (e.g.\nuniform, Gaussian) the eigenfunctions can be calculated analytically [14, 23]. Thus for these cases,\nthere is a tremendous advantage in estimating p(x) and calculating the eigenfunctions from p(x)\nrather than attempting to estimate the eigenvectors directly. For example, consider a problem with\n80 million datapoints sampled from a 32 dimensional Gaussian. Instead of diagonalizing an 80 mil-\nlion by 80 million matrix, we can simply estimate a 32 \u00d7 32 covariance matrix and get analytical\neigenfunctions. In low dimensions, we can calculate the eigenfunction numerically by discretizing\nthe density. Let g be the eigenfunction values at a set of discrete points, then g satis\ufb01es:\n\n( \u02dcD \u2212 P \u02dcW P )g = \u03c3P \u02c6Dg\n\n(2)\n\nwhere \u02dcW is the af\ufb01nity between the discrete points, P is a diagonal matrix whose diagonal elements\ngive the density at the discrete points, and \u02dcD is a diagonal matrix whose diagonal elements are the\nsum of the columns of P \u02dcW P , and \u02c6D is a diagonal matrix whose diagonal elements are the sum of\nthe columns of P \u02dcW . This method was used to calculate the eigenfunctions in Fig. 2(right).\n\nInstead of assuming that p(x) has a simple, parametric form, we will use a more modest assumption,\nthat p(x) has a product form. Speci\ufb01cally, we assume that if we rotate the data s = Rx then\np(s) = p(s1)p(s2) \u00b7 \u00b7 \u00b7 p(sd). This assumption allows us to calculate the eigenfunctions of Lp using\nonly the marginal distributions p(si).\n\nObservation: Assume p(s) = p(s1)p(s2) \u00b7 \u00b7 \u00b7 p(sd). Let pk be the marginal distribution of a single\ncoordinate in s. Let \u03a6i(sk) be an eigenfunction of Lpk with eigenvalue \u03c3i, then \u03a6i(s) = \u03a6i(sk) is\nalso an eigenfunction of Lp with the same eigenvalue \u03c3i.\n\nProof: This follows from the observation in [14, 23] that for separable distributions, the eigenfunc-\ntions are also separable.\n\nThis observation motivates the following algorithm:\n\n\u2022 Find a rotation of the data R, so that s = Rx are as independent as possible.\n\u2022 For each \u201cindependent\u201d component sk, use a histogram to approximate the density p(sk).\nIn order to regularize the solution (see below), we add a small constant to the value of the\nhistogram at each bin.\n\n4\n\n\f\u2022 Given the approximated density p(sk), solve numerically for eigenfunctions and eigenval-\nues of Lpk using Eqn. 2. As discussed above, this can be done by solving a generalized\neigenvalue problem for a B \u00d7 B matrix, where B is the number of bins in the histogram.\n\n\u2022 Order the eigenfunctions from all components by increasing eigenvalue.\n\nThe need to add a small constant to the histogram comes from the fact that the smoothness operator\nLp(F ) ignores the value of F wherever the density vanishes, p(x) = 0. Thus the eigenfunctions can\noscillate wildly in regions with zero density. By adding a small constant to the density we enforce\nan additional smoothness regularizer, even in regions of zero density. Similar regularizers are used\nin [2, 9].\n\nThis algorithm will recover eigenfunctions of Lp, which depend only on a single coordinate. As\ndiscussed in [23], products of these eigenfunctions for different coordinates are also eigenfunctions,\nbut we will assume the semi-supervised solution is a linear combination of only the single-coordinate\neigenfunctions. By choosing the k eigenfunctions with smallest eigenvalue we now have k functions\n\u03a6k(x) whose value is given at a set of discrete points for each coordinate. We then use linear\ninterpolation in 1D to interpolate \u03a6(x) at each of the labeled points xl. This allows us to solve\nEqn. 1 in time that is independent of the number of unlabeled points.\n\nAlthough this algorithm has a number of approximate steps, it should be noted that if the \u201cinde-\npendent\u201d components are indeed independent, and if the semi-supervised solution is only a linear\ncombination of the single-coordinate eigenfunctions, then this algorithm will exactly recover the\nsemi-supervised solution as n \u2192 \u221e. Consider again a dataset of 80 million points in 32 dimensions\nand assume 100 bins per dimension. If the independent components s = Rx are indeed indepen-\ndent, then this algorithm will exactly recover the semi-supervised solution by solving 32 100 \u00d7 100\ngeneralized eigenvector problems and a single k \u00d7 k least squares problem. In contrast, directly\nestimating the eigenvectors of the graph Laplacian will require diagonalizing an 80 million by 80\nmillion matrix.\n\n3 Experiments\n\nIn this section we describe experiments to illustrate the performance and scalability of our approach.\nThe results will be reported on the Tiny Images database [19], in combination with the CIFAR-10\nlabel set [11]. This data is diverse and highly variable, having been collected directly from Internet\nsearch engines. The set of labels allows us to accurately measure the performance of our algorithm,\nwhile using data typical of the large-scale Internet settings for which our algorithm is designed.\n\nWe start with a toy example that illustrates our eigenfunction approach, compared to the Nystrom\nmethod of Talwalker et al. [18], another approximate semi-supervised learning scheme that can scale\nto large datasets. In Fig. 3 we show two different 2D datasets, designed to reveal the failure modes\nof the two methods.\n\n3.1 Features\n\nFor the experiments in this paper we use global image descriptors to represent the entire image\n(there is no attempt to localize the objects within the images). Each image is thus represented\nby a single Gist descriptor [15], which we then project down to 64 dimensions using PCA. As\n\nData\n\nNystrom\n\nEigenfunction\n\nData\n\nNystrom\n\nEigenfunction\n\nFigure 3: A comparison of the separable eigenfunction approach and the Nystrom method. Both\nmethods have comparable computational cost. The Nystrom method is based on computing the\ngraph Laplacian on a set of sparse landmark points and fails in cases where the landmarks do not ad-\nequately summarize the density (left). The separable eigenfunction approach fails when the density\nis far from a product form (right).\n\n5\n\n\fillustrated in Fig. 3, the eigenfunction approach assumes that the input distribution is separable\nover dimension. In Fig. 4 we show that while the raw gist descriptors exhibit strong dependencies\nbetween dimensions, this is no longer the case after the PCA projection. Note that PCA is one of the\nfew types of projection permitted: since distances between points must be preserved only rotations\nof the data are allowed.\n\nLog histogram of Gist descriptors\n\nLog histogram of PCA\u2019d Gist descriptors\n\nDim. 2 vs 3, MI: 0.555\n\nDim. 3 vs 4, MI: 0.484\n\nDim. 2 vs 16, MI: 0.159\n\nDim. 2 vs 3, MI: 0.017\n\nDim. 3 vs 4, MI: 0.009\n\nDim. 2 vs 16, MI: 0.007\n\nFigure 4: 2D log histograms formed from 1 million Gist descriptors. Red and blue correspond\nto high and low densities respectively. Left: three pairs of dimensions in the raw Gist descriptor,\nalong with their mutual information score (MI), showing strong dependencies between dimensions.\nRight: the dimensions in the Gist descriptors after a PCA projection, as used in our experiments.\nThe dependencies between dimensions are now much weaker, as the MI scores show. Hence the\nseparability assumption made by our approach is not an unreasonable one for this type of data.\n\n3.2 Experiments with CIFAR label set\n\nThe CIFAR dataset [11] was constructed by asking human subjects to label a subset of classes of\nthe Tiny Images dataset. For a given keyword and image, the subjects determined whether the given\nimage was indeed an image of that keyword. The resulting labels span 386 distinct keywords in the\nTiny Images dataset. Our experiments use the sub-set of 126 classes which had at least 200 positive\nlabels and 300 negative labels, giving a total of 63,000 images.\n\nOur experimental protocol is as follows: we take a random subset of C classes from the set of 126.\nFor each class c, we randomly choose a \ufb01xed test-set of 100 positive and 200 negative examples,\nre\ufb02ecting the typical signal-to-noise ratio found in images from Internet search engines. The training\nexamples consist of t positive/negative pairs drawn from the remaining pool of 100 positive/negative\nimages for each keyword.\n\nFor each class in turn, we use our scheme to propagate labels from the training examples to the test\nexamples. By assigning higher probability (values in f ) to the genuine positive images of each class,\nwe are able to re-rank the images. We also make use of the the training examples from keywords\nother than c by treating them as additional negative examples. For example, if we have C = 16\nkeywords and t = 5 training pairs per keyword, then we have 5 positive training examples and\n(5+(16-1)*10)=155 negative training examples for each class. We use these to re-rank the 300 test\nimages of that particular class. Note that the propagation from labeled images to test images may go\nthrough the unlabeled images that are not even in the same class. Our use of examples from other\nclasses as negative examples is motivated by real problems, where training labels are spread over\nmany keywords but relatively few labels are available per class.\n\nIn experiments using our eigenfunction approach, we compute a \ufb01xed set of k=256 eigenfunctions\non the entire 63,000 datapoints in the 64D space with \u01eb = 0.2 and used these for all runs. For\napproaches that require explicit formation of the af\ufb01nity matrix, we calculate the distance between\nthe 64D image descriptors using \u01eb = 0.125. All approaches use \u03bb = 50. To evaluate performance,\nwe choose to measure the precision at a low recall rate of 15%, this being a sensible operating point\nas it corresponds to the \ufb01rst webpage or so in an Internet retrieval setting. Given the split of +ve/-ve\nexamples in the test data, chance level performance corresponds to a precision of 33%. All results\nwere generated by averaging over 10 different runs, each with different random train/test draws, and\nwith different subsets of classes.\n\nIn our \ufb01rst set of experiments, shown in Fig. 5(left), we compare our eigenfunction approach to a\nvariety of alternative learning schemes. We use C = 16 different classes drawn randomly from\nthe 126, and vary the number of training pairs t from 0 up to 100 (thus the total number of labeled\npoints, positive and negative, varied from 0 to 3200). Our eigenfunction approach outperforms other\nmethods, particularly where relatively few training examples are available. We use two baseline\nclassi\ufb01ers: (i) Nearest-Neighbor and (ii) RBF kernel SVM, with kernel width \u01eb. The SVM approach\n\n6\n\n\fbadly over-\ufb01ts the data for small numbers of training examples, but catches up with the eigenfunction\napproach once 64+ve/1984-ve labeled examples are used.\n\nWe also test a range of SSL approaches. The exact least-squares approach (f = (L + \u039b)\u22121\u039bY )\nachieves comparable results to the eigenfunction method, although it is far more expensive. The\neigenvector approach (Eqn. 1) performs less well, being limited by the k = 256 eigenvectors used\n(as k is increased, the performance converges to the exact least-squares solution). Neither of these\nmethods scale to large image collections as the af\ufb01nity matrix W becomes too big and cannot be\ninverted or have its eigenvectors computed. Fig. 5(left) also shows the ef\ufb01cient Nystrom method\n[18], using 1000 landmark points, which has a somewhat disappointing performance. Evidently, as\nin Fig. 3, the landmark points do not adequately summarize the density. As the number of landmarks\nis increased, the performance approaches that of the least squares solution.\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\ns\ne\ns\ns\na\nc\n \n\nl\n\n6\n1\n\n \nr\ne\nv\no\n\n \n\nd\ne\ng\na\nr\ne\nv\na\n\n \n\n \n\nEigenfunction\n\nEigenfunction\nw/noisy labels\n\nNystrom\n\nLeast\u2212squares\n\nEigenvector\n\nSVM\n\nNN\n\nChance\n\n \n\nl\n\ns\ns\na\nl\nc\n/\ns\ne\np\nm\na\nx\ne\ng\nn\nn\na\nr\nt\n \ne\nv\n+\n#\n\n0\n1\n2\n3\n5\n8\n10\n15\n20\n40\n60\n100\n\ni\n\ni\n\n \n\nl\nl\n\na\nc\ne\nr\n \n\n%\n5\n1\n\n \nt\n\ni\n\ni\n\n \n\na\nn\no\ns\nc\ne\nr\np\nn\na\ne\nM\n\n \n\n(a) Without noisy labels\n\n(b) With noisy labels\n\n(c) Without noisy labels\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.25\n\n \n\n\u2212Inf\n\n0 \n\n1 \n number of +ve training examples/class\nLog\n2\n\n4 \n\n5 \n\n2 \n\n3 \n\n6 \n\n7 \n\n0 1 2 3 4 5\nLog\n\n # classes\n\n2\n\n0 1 2 3 4 5\nLog\n\n # classes\n\n2\n\n1\n6\n\n3\n2\n\n6\n4\n\n1\n2\n8\n\n2\n5\n6\n\n5\n1\n2\n\n# Eigenfunctions\n\nFigure 5: Left: Performance (precision at 15% recall) on the Tiny Image CIFAR label set for differ-\nent learning schemes as the number of training pairs is increased, averaged over 16 different classes.\n-Inf corresponds to the unsupervised case (0 examples). Our eigenfunction scheme (solid red) out-\nperforms standard supervised methods (nearest-neighbors (green) and a Gaussian SVM (blue)) for\nsmall numbers of training pairs. Compared to other semi-supervised schemes, ours matches the\nexact least squares solution (which is too expensive to run on a large number of images), while out-\nperforming approximate schemes, such as Nystrom [18]. By using noisy labels in addition to the\ntraining pairs, the performance is boosted when few training examples are available (dashed red).\nRight: (a): The performance of our eigenfunction approach as the number of training pairs per\nclass and number of classes is varied. Increasing the number of classes also aids performance since\nlabeled examples from other classes can be used as negative examples. (b): As for (a) but now us-\ning noisy label information (Section 3.3). Note the improvement in performance when few training\npairs are available. (c): The performance of our approach (using no noisy labels) as the number of\neigenfunctions is varied.\n\nIn Fig. 5(right)(a) we explore how our eigenfunction approach performs as the number of classes C\nis varied, for different numbers of training pairs t per class. For a \ufb01xed t, as C increases, the number\nof negative examples available increases thus aiding performance. Fig. 5(right)(c) shows the effect\nof varying the number of eigenfunctions k for C = 16 classes. The performance is fairly stable\nabove k = 128 eigenfunctions (i.e. on average 2 per dimension), although some mild over-\ufb01tting\nseems to occur for small numbers of training examples when a very large number is used.\n\n3.3 Leveraging noisy labels\n\nIn the experiments above, only two types of data are used: labeled training examples and unlabeled\ntest examples. However, an additional source is the noisy labels from the Tiny Image dataset (the\nkeyword used to query the image search engine). These labels can easily be utilized by our frame-\nwork: all 300 test examples for a class c are given a positive label with a small weight (\u03bb/10),\nwhile the 300(C \u2212 1) test examples from other classes are given negative label with the same small\nweight. Note that these labels do not reveal any information about which of the 300 test images\nare true positives. These noisy labels can provide a signi\ufb01cant performance gain when few train-\ning (clean) labels are available, as shown in Fig. 5(left) (c.f. solid and dashed red lines). Indeed,\nwhen no training labels are available, just the noisy labels, our eigenfunction scheme still performs\nvery well. The performance gain is explored in more detail in Fig. 5(right)(b). In summary, using\n\n7\n\n\fthe eigenfunction approach with noisy labels, the performance obtained with a total of 32 labeled\nexamples is comparable to the SVM trained with 64*16=512 labeled examples.\n\n3.4 Experiments on Tiny Images dataset\n\nOur \ufb01nal experiment applies the eigenfunction approach to the whole of the Tiny Images dataset\n(79,302,017 images). We map the gist descriptor for each image down to a 32D space using PCA\nand precompute k = 64 eigenfunctions over the entire dataset. The 445,954 CIFAR labels (64,185\nof which are +ve) cover 386 keywords, any of which can be re-ranked by solving Eqn. 1, which\ntakes around 1ms on a fast PC. In Fig. 6 we show our scheme on four different keywords, each using\n3 labeled training pairs, resulting in a signi\ufb01cant improvement in quality over the original ordering.\nA nearest-neighbor classi\ufb01er which is not regularized by the data density performs worse than our\napproach.\n\nRanking from search engine Nearest Neighbor re-ranking \n\nEigenfunction re-ranking\n\nFigure 6: Re-ranking images from 4 keywords in an 80 million image dataset, using 3 labeled pairs\nfor each keyword. Rows from top: \u201cJapanese spaniel\u201d, \u201cairbus\u201d, \u201costrich\u201d, \u201cauto\u201d. From L to\nR, the columns show the original image order, results of nearest-neighbors and the results of our\neigenfunction approach. By regularizing the solution using eigenfunctions computed from all 80\nmillion images, our semi-supervised scheme outperforms the purely supervised method.\n\n4 Discussion\n\nWe have proposed a novel semi-supervised learning scheme that is linear in the number of images,\nand then demonstrated it on challenging datasets, including one of 80 million images. The approach\ncan easily be parallelized making it practical for Internet-scale image collections. It can also incor-\nporate a variety of label types, including noisy labels, in one consistent framework.\n\nAcknowledgments\nThe authors would like to thank H\u00b4ector Bernal and the anonymous reviewers and area chairs for their\nconstructive comments. We also thank Alex Krizhevsky and Geoff Hinton for providing the CIFAR\nlabel set. Funding support came from: NSF Career award (ISI 0747120), ISF and a Microsoft\nResearch gift.\n\n8\n\n\fReferences\n\n[1] M. Belkin and P. Niyogi. Towards a theoretical foundation for laplacian based manifold meth-\n\nods. Journal of Computer and System Sciences, 2007.\n\n[2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for\n\nlearning from labeled and unlabeled examples. JMLR, 7:2399\u20132434, 2006.\n\n[3] Y. Bengio, O. Delalleau, N. L. Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning\n\neigenfunctions links spectral embedding and kernel PCA. In NIPS, pages 2197\u20132219, 2004.\n\n[4] T. Berg and D. Forsyth. Animals on the web. In CVPR, pages 1463\u20131470, 2006.\n[5] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2006.\n[6] R. R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker. Geometric\ndiffusion as a tool for harmonic analysis and structure de\ufb01nition of data, part i: Diffusion maps.\nPNAS, 21(102):7426\u20137431, 2005.\n\n[7] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, in\ufb02uences, and trends of the\n\nnew age. ACM Computing Surveys, 2008.\n\n[8] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google\u2019s\n\nimage search. In ICCV, volume 2, pages 1816\u20131823, Oct. 2005.\n\n[9] J. Garcke and M. Griebel. Semi-supervised learning with sparse grids. In ICML workshop on\n\nlearning with partially classi\ufb01ed training data, 2005.\n\n[10] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active learning with gaussian processes\n\nfor object categorization. In CVPR, 2007.\n\n[11] A. Krizhevsky and G. E. Hinton. Learning multiple layers of features from tiny images. Tech-\n\nnical report, Computer Science Department, University of Toronto, 2009.\n\n[12] S. Kumar, M. Mohri, and A. Talwalkar. Sampling techniques for the Nystrom method.\n\nIn\n\nAISTATS, 2009.\n\n[13] L. J. Li, G. Wang, and L. Fei-Fei. Optimol: automatic object picture collection via incremental\n\nmodel learning. In CVPR, 2007.\n\n[14] B. Nadler, S. Lafon, R. R. Coifman, and I. G. Kevrekidis. Diffusion maps, spectral cluster-\ning and reaction coordinates of dynamical systems. Applied and Computational Harmonic\nAnalysis, 21:113\u2013127, 2006.\n\n[15] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the\n\nspatial envelope. IJCV, 42:145\u2013175, 2001.\n\n[16] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-\n\nbased tool for image annotation. IJCV, 77(1):157\u2013173, 2008.\n\n[17] B. Schoelkopf and A. Smola. Learning with Kernels Support Vector Machines, Regularization,\n\nOptimization, and Beyond. MIT Press,, 2002.\n\n[18] A. Talwalkar, S. Kumar, and H. Rowley. Large-scale manifold learning. In CVPR, 2008.\n[19] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large database for non-\n\nparametric object and scene recognition. IEEE PAMI, 30(11):1958\u20131970, November 2008.\n\n[20] I. Tsang and J. Kwok. Large-scale sparsi\ufb01ed manifold regularization. In NIPS, 2006.\n[21] L. van Ahn. The ESP game, 2006.\n[22] S. Vijayanarasimhan and K. Grauman. Keywords to visual categories: Multiple-instance learn-\n\ning for weakly supervised object categorization. In CVPR, 2008.\n\n[23] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.\n[24] B. Yao, X. Yang, and S. C. Zhu. Introduction to a large scale general purpose ground truth\n\ndataset: methodology, annotation tool, and benchmarks. In EMMCVPR, 2007.\n\n[25] K. Yu, S. Yu, and V. Tresp. Blockwise supervised inference on large graphs. In ICML workshop\n\non learning with partially classi\ufb01ed training data, 2005.\n\n[26] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global\n\nconsistency. In NIPS, 2004.\n\n[27] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, University of\n\nWisconsin Madison, 2008.\n\n[28] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In In ICML, pages 912\u2013919, 2003.\n\n[29] X. Zhu and J. Lafferty. Harmonic mixtures: combining mixture models and graph-based meth-\n\nods for inductive and scalable semi-supervised learning. In ICML, 2005.\n\n9\n\n\f", "award": [], "sourceid": 438, "authors": [{"given_name": "Rob", "family_name": "Fergus", "institution": null}, {"given_name": "Yair", "family_name": "Weiss", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}]}