{"title": "Unsupervised Detection of Regions of Interest Using Iterative Link Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 961, "page_last": 969, "abstract": "This paper proposes a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels. The proposed approach discovers highly probable regions of object instances by iteratively repeating the following two functions: (1) choose the exemplar set (i.e. small number of high ranked reference ROIs) across the dataset and (2) refine the ROIs of each image with respect to the exemplar set. These two subproblems are formulated as ranking in two different similarity networks of ROI hypotheses by link analysis. The experiments with the PASCAL 06 dataset show that our unsupervised localization performance is better than one of state-of-the-art techniques and comparable to supervised methods. Also, we test the scalability of our approach with five objects in Flickr dataset consisting of more than 200,000 images.", "full_text": "Unsupervised Detection of Regions of Interest\n\nUsing Iterative Link Analysis\n\nGunhee Kim\n\nSchool of Computer Science\nCarnegie Mellon University\ngunhee@cs.cmu.edu\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nAntonio Torralba\n\nMassachusetts Institute of Technology\n\ntorralba@csail.mit.edu\n\nAbstract\n\nThis paper proposes a fast and scalable alternating optimization technique to de-\ntect regions of interest (ROIs) in cluttered Web images without labels. The pro-\nposed approach discovers highly probable regions of object instances by itera-\ntively repeating the following two functions: (1) choose the exemplar set (i.e. a\nsmall number of highly ranked reference ROIs) across the dataset and (2) re\ufb01ne\nthe ROIs of each image with respect to the exemplar set. These two subproblems\nare formulated as ranking in two different similarity networks of ROI hypotheses\nby link analysis. The experiments with the PASCAL 06 dataset show that our\nunsupervised localization performance is better than one of state-of-the-art tech-\nniques and comparable to supervised methods. Also, we test the scalability of our\napproach with \ufb01ve objects in Flickr dataset consisting of more than 200K images.\n\n1 Introduction\n\nThis paper proposes an unsupervised approach to the detection of regions of interest (ROIs) from a\nWeb-sized dataset (Fig.1). We de\ufb01ne the regions of interest as highly probable rectangular regions of\nobject instances in the images. The extraction of ROIs is extremely helpful for recognition and Web\nuser interfaces. For example, [3, 5] showed comparative studies in which ROI detection is useful to\nlearn more accurate models, which leads to nontrivial improvement of classi\ufb01cation and localization\nperformance. In the recognition of indoor scenes [17], the local regions that contain objects may\nhave special meaning to characterize the scene description. Also, many Web applications allow a\nuser to attach notes on user-speci\ufb01ed regions in a cluttered image (e.g. Flickr Notes). Our algorithm\ncan make this cumbersome annotation easier by suggesting the regions a user may be interested in.\nOur solution to the problem of unsupervised ROI detection is inspired by an alternating optimiza-\ntion. Alternating optimization is one of widely used heuristics where optimization over two sets of\nvariables is not straightforward, but optimization with respect to one while keeping the other \ufb01xed\nis much easier and solvable. This approach has been successful in a wide range of areas such as\nK-means, Expectation-Maximization, and Iterative Closest Point algorithms [2].\n\nFigure 1: Detection of regions of interest (ROIs). Given a Web-sized dataset, our algorithm detects bounding\nbox-shaped ROIs that are statistically signi\ufb01cant across the dataset in an unsupervised manner. The yellow\nboxes are groundtruth labels, and the red and blue ones are ROIs detected by the proposed method.\n\n1\n\n\fThe unsupervised ROI detection can be though of as a chicken-and-egg problem between (1) \ufb01nding\nexemplars of objects in the dataset and (2) localizing object instances in each image.\nIf class-\nrepresentative exemplars are given, the detection of objects in images is solvable (i.e. a conventional\ndetection or localization problem). Conversely, if object instances are clearly annotated beforehand,\nthe exemplars can be easily obtained (i.e. a conventional modeling or ranking problem).\nGiven an image set, \ufb01rst we assume that each image itself is the best ROI (i.e. the most con\ufb01dent\nobject region). Then a small number of highly ranked ones among the selected ROIs are chosen as\nexemplars (called hub seeking), which serve as references to re\ufb01ne the ROIs of each image (called\nROI re\ufb01nement). We repeat these two updates until convergence. The two steps are formulated as\nranking in two different similarity networks of ROI hypotheses by link analysis. The hub seeking\ncorresponds to \ufb01nding a central and diverse hub set in a network of the selected ROIs (i.e. inter-\nimage level). The ROI re\ufb01nement is the ranking in a bipartite graph between the hub sets and all\npossible ROI hypotheses of each image (i.e. intra-image level).\nOur work is closely related to topics on ROI detection [3, 5, 17, 14], unsupervised localization\n[9, 24, 21, 18, 1, 12], and online image collection [13, 19, 6]. The ROI detection and unsupervised\nlocalization share a similar goal of detecting the regions of objects in cluttered images. However,\nmost previous work has been successful for standard datasets with thousands of images. On the other\nhand, our goal is to propose a simple and fast method that can take advantage of enormous amounts\nof Web data. The main objective of online image collection is to collect relevant images from highly\nnoisy data queried by keywords from the Web. Its main limitation is that much of the previous work\nrequires additional assumptions such as a small number of seed images in the beginning [13], texts\nand HTML tags associated with images [19], and user-labeled images [6]. On the other hand, no\nadditional meta-data are required in our approach.\nRecently, link analysis techniques on visual similarity networks were successfully exploited in com-\n[15] applied the random walk with restart technique to\nputer vision problems [12, 15, 11, 16].\nthe auto-captioning task. However, their work is a supervised method requiring annotated caption\nwords for the segmented regions in training images. [12] is similar to ours in that the unsupervised\nclassi\ufb01cation and localization are the main objectives. However, their method suffers from a scala-\nbility issue, and thus their experiments were performed using only 600 images.\n[11] successfully\napplied the PageRank technique to a large-scale image search, but unlike ours their approach is eval-\nuated with quite clean images and sub-image level localization is not dealt with. Likewise, [16] also\nexploited the matching graph of a large-scale image set, but the localization was not discussed.\nThe main advantages of our approach are summarized as follows. First, the proposed method is\nextremely simple and fast, with compelling performance. Our approach shows superior results over\na state-of-the-art unsupervised localization method [18] for the PASCAL 06 dataset. We proposed\na simple heuristic for scalability to make the computation time linear with the data size without\nsevere performance drop. For example, the localization of 200K images took only 4.5 hours with\nnaive matlab implementation on a single PC equipped with Intel Xeon 2.83 GHz CPU (once image\nsegmentation and feature extraction were done). Second, our approach is dynamic thanks to the\nevolving network representation. At every iteration, new ROI hypothesis are added and trivial ones\nare removed from the network while reusing a large portion of previously computed information.\nThird, unlike most previous work, our approach requires neither human annotation, meta-data, nor\ninitial seed images. Finally, we evaluate our approach with a challenging Flickr dataset of up to\n200K images. Although some work [22] in image retrieval uses millions of images, this work has a\ndifferent goal from ours. The objective of image retrieval is to quickly index and search the nearest\nimages to a given query. On the other hand, our goal is to localize objects in every single image of\na dataset without supervision.\n\n2 ROI Candidates and Description\nThe input to our algorithm is a set of images I = {I1, I2, ..., I|I|}. The \ufb01rst task is to de\ufb01ne a set\nof ROI hypotheses from the image set R = {R1, R2, ..., R|I|}. Ideally, the set of ROI hypotheses\nRa = {ra1, ..., ram} of an image Ia enumerates all plausible bounding boxes, and at least one of\nthem is supposed to be a good object annotation. Fig.2 shows the procedure of ROI hypothesis\ngeneration. Given an image, 15 segments are extracted by Normalized cuts [20]. The minimum\nrectangle to enclose each segment is de\ufb01ned as initial ROI hypotheses. Since the over-segmentation\n\n2\n\n\fFigure 2: An example of ROI extraction and description. From left to right: (a) An input image. (b) 15\nsegments. (c) 43 ROI hypotheses. (d) Distribution of visual words. (e) Edge gradients.\n\nis unavoidable in most cases, the combinations of the initial hypotheses are also considered. We \ufb01rst\ncompute pairwise minimum paths between the initial hypotheses using the Dijkstra algorithm. Then\nthe bounding boxes to enclose those minimum paths are added to the ROI hypothesis set. Finally,\na largely overlapped pair of ROIs is merged if rai\u2229raj\n> 0.8. Note that the hypothesis set always\nrai\u222araj\nincludes the image itself as the largest candidate, and the average set size is about 50.\nEach ROI hypothesis is represented by two types of descriptors, which are spatial pyramids of\nvisual words [17] and HOG [3]. As usual, the visual words are generated by vector quantization to\nrandomly selected SIFT descriptors. K-means is applied to form a dictionary of 200 visual words. A\nvisual word is assigned to each pixel of an image by \ufb01nding nearest cluster center in the dictionary,\nand then binned using a two-level spatial pyramid. The oriented gradients are computed by Canny\nedge detection and Sobel mask. Then the HOG descriptor is discretized into 20 orientation bins in\nthe range of [0\u25e6,180\u25e6] by following [3]. The pyramid level is up to three. The similarity measure\nbetween a pair of ROIs is cosine similarity, which is simply calculated by dot product of two L2\nnormalized histograms. Here both descriptors are equally weighted.\n\n3 The Algorithm\n\n3.1 Similarity Networks and Link Analysis Techniques\n\nAll inferences in our approach are based on the link analysis of k-nearest neighbor similarity network\nbetween ROI hypotheses. The similarity network is a weighted graph G = (V,E,W), where V is\nthe set of vertices that are ROI hypotheses. E and W are edge and weight sets discovered by the\nsimilarity measure in the previous section. Each vertex is only connected to its k-nearest neighbors\nwith k = a\u00b7log |V| [23], where a is a constant set to 10. It results in a sparse network, which is more\nadvantageous in terms of computational speed and accuracy. It guarantees that the complexity of\nnetwork analysis is O(|V| log |V|) at worst. Finally, the network is row normalized so that the edge\nweight from note i and j indicates the probability of a random surfer jumping from i to j. The link\nanalysis technique we use is PageRank [4, 10]. Given a similarity matrix G, it computes the same\nlength of PageRank vector p, which assigns a ranked score to each vertex of the network. Intuitively,\nthe PageRank scores of the network of ROI hypotheses are indices of the goodness of hypotheses.\n\n3.2 Overview of the Algorithm\nAlgorithm 1 summarizes the proposed algorithm. The main input is the set of ROI hypotheses R\ngenerated by the method of section 2. The output is the set of selected ROIs S\u2217(\u2282 R). In each\nimage, usually one or two, and rarely more than three, of the most promising ROIs are chosen.\nThe basic idea of our approach is to jointly optimize the ROI selection of each image and the exam-\nplar detection among the selected ROIs. Examplars correspond to hubs in our network representa-\ntion. We begin with images themselves as an initial set of ROI selection S (0) (Step 1). Even though\nthis initialization is quite poor, highly ranked hubs among the ROIs are likely to be much more re-\nliable. They are detected by the function Hub seeking (Step 3). Then, the hub sets are exploited\nto re\ufb01ne the ROIs of each images by the function Hub seeking (Step 5). In turn, those re\ufb01ned\nROIs are likely to lead to a better hub set at the next iteration. The alternating iterations of those\ntwo functions are expected to lead convergence for not only the best ROI selection of each image\nbut also the most representative ROIs of the data set. An example of evolution of ROI selection is\nshown in Fig.4.(c). Although our algorithm forces to select at least one ROI for each image, the\nPageRank vector by Hub seeking can indicate the con\ufb01dence of each ROI, which can be used to\n\ufb01lter out wrongly selected ROIs. Conceptually, both functions share a similar ranking problem to\n\n3\n\n\fFigure 3: Examples of hub images. The pictures illustrate highest-ranked images in 10,000 randomly selected\nimages from \ufb01ve objects of our Flickr dataset and all {train+val}images from two objects of the PASCAL06.\n\nselect a small subset of highly ranked nodes from the input networks of ROI hypotheses. They will\nbe discussed in the following subsections in detail.\nInherently, a good initialization is essential for alternating optimization. Our key assumption is as\nfollows: Provided that the similarity network includes a suf\ufb01ciently large number of images, the\nhub images are likely to be good references. This is based on the \ufb01nding of our previous work\n[12]: If each visual entity votes for others that are similar to itself, this democratic voting can reveal\nthe dominant statistics of the image set. Although the images in a dataset are highly variable, the\nmore repetitive visual information may get more similarity votes, which can be easily and quickly\ndiscovered as hubs by link analysis. Fig.3 supports this argument in our dataset. It illustrates top-\nranked images of our dataset in which the objects are clearly shown in the center with signi\ufb01cant\nsize. Obviously, they are excellent initialization candidates.\nSince we deal with discrete patches from unordered natural images on the Web, it is extremely\ndif\ufb01cult to analytically understand several important behaviors of our algorithm such as convexity,\nconvergence, sensitivity to initial guess, and quality of our solution. One widely used assumption\nin the optimization with image patches is linearity with small incremental displacement (e.g. AAM\n[8]). However, it is not the case in our problem and causes severe computation increase. These\nissues may be open challenges for the optimization of large-scale image analysis.\n\nAlgorithm 1 The Algorithm\nInput: ROI hypothesis R associated with image set I.\nOutput: The set of selected ROIs S\u2217(\u2282 R), where S\u2217 = S (T ) when converged at T .\n\n1: S (0) \u2190 largest ROI hypothesis in each image.\nwhile S (t\u22121) (cid:54)= S (t) do\n\n2: Generate k-NN similarity network G(t) of S (t).\n3: H(t) \u2190 Hub seeking(G(t)), where the hub set H(t) \u2282 S (t)\nfor all Ia \u2208 I unless ROI selection of Ia is not changed for several consecutive times do\n\na \u2190 ROI re\ufb01nement(H(t), Ra), where s(t)\n\n4: s(t)\n5: S (t) \u2190 S (t) \u222a s(t)\n\na \\s(t\u22121)\n\na\n\n.\n\na : ROI selection of Ia, Ra: ROI hypotheses of Ia.\n\nend for\nend while\n\nAlgorithm 2 Hub seeking function\nInput: (1) Network G(t). (2) Window size: d.\nOutput: (1) Hub set H(t).\n1: Compute PageRank vector p of G(t).\nfor all vertex v \u2208 G(t) do\n\n2: Find the neighbor set of v Nv = {u| max\nreachable probability from v to u > d}.\n3: Find local maxima node of v m(v) =\narg maxu p(Nv) where u \u2208 Nv.\n4: H(t) \u2190 v if v = m(v).\n\nend for\n\nAlgorithm 3 ROI re\ufb01nement function\nInput: (1) Hub set H(t). (2) Ra, ROI hypotheses of Ia\nOutput: (1) The selected ROIs s(t)\n\na (\u2282 Ra).\n\n1: Generate k-NN self-similarity matrix Wi of Ra\nand k-NN similarity matrix Wo between Ra and H(t)\n. Both of them are row-normalized.\n2: Generate augmented bipartite graph W =\n\n(cid:18) \u03b1Wi\n\n0\n\nWT\no\na = arg maxraj p(raj) where raj \u2208 Ra.\n\n3: Compute PageRank vector p of W.\n4: s\u2217\n\n(cid:19)\n\n(1 \u2212 \u03b1)Wo\n\n3.3 Hub Seeking with Centrality and Diversity\nThe goal of this step is to detect a hub set H(t) from S (t) by analyzing the network G(t). The main\ncriteria are centrality and diversity. In other words, the selected hub set should be not only highly\nranked but also diverse enough not to lose various aspects of an object. To meet this requirement,\nwe design the hub seeking inspired by Mean Shift [7]. Given feature points, the algorithm creates\na \ufb01xed-radius window at each point. Then each window iteratively moves into the direction of\n\n4\n\n\fFigure 4: (a) An example of a bipartite graph between the hub set and ROI hypotheses of an image. The\nsimilarity between hubs and hypotheses is captured by Wo and the af\ufb01nity between the hypotheses by Wi. The\nhub set is sorted by PageRank values from left and right. The values of leftmost and rightmost are 0.0081 and\n0.0024, respectively. They successfully capture various aspects related to the car object. (b) The effect of the\naugmented bipartite graph. The left image is with \u03b1 = 0 and the right with \u03b1 = 0.1. The ranking of hypotheses\nis represented by jet colormap from red (high) to blue (low). In the left, the weights from the red box to the blue\none are (0.052, 0.050, 0.049, 0.049, 0.049); in the right, (0.060, 0.060, 0.059, 0.059, 0.057). (c) An example of\nROI evolution. At T = 0, the selected ROI is an image itself but is converged to the real object after T = 5.\n\nthe maximum increase in the local density function until it reaches a local maximum. Those local\nmaxima become the modes, and the data points that converge to the same maxima are clustered.\nThe proposed algorithm 2 works in the same manner. For each vertex, we de\ufb01ne the search window\nin the form of maximum reachable probability d (Step 2). The window covers the vertices whose\nmaximum reachable probability is larger than d. For example, given d = 0.1, wij = 0.6, wjk = 0.2,\nthe probability of vertices i to k is 0.6 \u00d7 0.2 = 0.12 > d. Then k is considered inside the search\nwindow of i. For the density function, we use the PageRank vector because it is proportional to the\nvertex degree if the graph is symmetric and connected [25]. In Step 3, we compute the vector m\nthat assigns the local maximum vertex within the window of each vertex in G(t). If v = m(v), the v\nis a local maximum, and it is added to H(t). Additionally, we can easily perform the clustering from\nm. For each node, the search window keeps moving the maximum direction indicated by m until it\nreaches the local maximum. Then the nodes that converge to the same maxima can be clustered.\n\n3.4 ROI Re\ufb01nement\nFormally, this step is to de\ufb01ne a nonparametric function for each image fa : Ra \u2192 R+ (positive real\nnumber) with respect to the hub set H(t). Then the hypothesis with maximum ranked value is chosen\nas the best ROI. In order to solve this problem, we \ufb01rst construct an augmented bipartite graph W\nbetween the hub set H(t) and all possible ROIs Ra as shown in Step 2 of Algorithm 3 (see Fig.4(a)).\nFor better understanding, let us \ufb01rst consider a pure bipartite graph with \u03b1 = 0. Then the matrix W\nrepresents the similarity voting between the ROI candidates and the hub set. If the PageRank vector\np of W is computed, then p(Ra) summarizes the relative importance of each ROI hypothesis with\nrespect to the H(t), which is exactly what we require. Rather than a pure bipartite graph (\u03b1 = 0),\nwe augment it by nonzero \u03b1. Fig.4.(b) explains the effects of \u03b1. The left image shows the result of\n\u03b1 = 0. Even though the red hypothesis is the maximum, several hypotheses near the dark gray car\nhave signi\ufb01cant values. With nonzero \u03b1 = 0.1, those hypotheses are allowed to augment each other,\nso the maximum ROI is changed to a hypothesis on the car. In terms of link analysis, if a random\nsurfer visits nodes of ROI hypotheses (Ra), it jumps to other hypotheses with probability \u03b1 or other\nhubs with 1 \u2212 \u03b1. Since the nearby hypotheses share large portions of rectangles, they have higher\nsimilarity, which results in more votes for nearby hypotheses.\n\n3.5 Scalability Setting\n\nThe bottleneck of our approach is the Step 3 of Algorithm 1. The network generation requires\nquadratic computation of cosine similarity of S (t). In order to bound the computational complexity,\nwe limit the maximum number of images to be considered each run of Algorithm 1 by constant\nnumber N. N should be small enough not to suffer from computational burden. Simultaneously, it\nshould be large enough to successfully detect the meaningful statistics from an extremely variable\n\n5\n\n\fdataset. (In experiments, N is set to 10,000.) If the dataset size |I| > N, we randomly sample N\nimages from I and construct initial consideration set Ic \u2282 I. Algorithm 1 is applied to the image\nset Ic to obtain S\u2217\nc . Then we generate new Ic by sampling unvisited images from I. In order to\nreuse the result of S\u2217\nc based on the\nPageRank values of the network G\u2217 of S\u2217\nc . In other words, the highly ranked (i.e. highly con\ufb01dent)\nROIs in the previous step are reused to expedite the convergence of next iteration. We iterate the\nabove strategy until all images are examined. This simple heuristic allows our technique to analyze\nan extremely large dataset in a linear time without signi\ufb01cant performance drop.\n\nc for the next iteration, we sample x% of N from previous S\u2217\n\n4 Results\n\nWe evaluate our approach with two different experiments, (1) performance tests with PASCAL\nVOC 20061 and (2) scalability tests with Flickr images. The PASCAL dataset provides groundtruth\nlabels, so our approach is quantitatively evaluated and compared with other approaches. Using\nFlickr dataset, we examine the scalability of our method in a real-world problem. The images\nare collected by a query that consists of one object word and one context word. We downloaded\nimages of the objects {butter\ufb02y+insect(69,990), classic+car(265,731), motorcycle+bike(106,590),\nsun\ufb02ower(165,235), giraffe+zoo(53,620)}. The numbers in parentheses are dataset sizes.\n\n4.1 Performance Tests\n\nThe input of our algorithm consists of unlabeled images, which may include a single object (called as\nweakly supervised) or multiple objects (called unsupervised). For unsupervised cases, we perform\nnot only localization but also classi\ufb01cation according to object types. The PASCAL 06 dataset is so\nchallenging to use that only very rare previous work has used it for unsupervised localization. For\ncomparison, we ran publicly available code of one of the state-of-the-art techniques proposed by\nRussell et al2 [18] in the identical setting.\nThe PASCAL dataset consists of {train+val+test}. However, our approach requires only images\nas an input, and thus all of the {train+val+test} images are used without discrimination between\nthem. Note that our task is an image annotation not a learning problem that requires training and\ntesting steps. The performance is measured by following the protocol of PASCAL evaluation: (1)\nThe performance is evaluated from only the {test} set. In practice, there is very little performance\ndifference between analysis of all {train+val+test} and {test} only. (2) The detection is considered\ncorrect if the overlap between the prediction and ground truth exceeds 50%.\nWeakly supervised localization. Fig.5 shows the detection performance as Precision-Recall (PR)\ncurves. For [18], we iterate experiments by changing the number of topics from two to six, and the\nbest results are reported. For clear comparison between our results and [18], we select only the best\nbounding box in each image. We also present the best result of each object in VOC06 competition.\nStrictly speaking, it is not a valid comparison because the experimental setups of VOC06 competi-\ntion and ours are totally different. However, we illustrate them as references to show how closely\nour approach can reach the best supervised methods in VOC 06 for the localization. Although the\nperformance varies according to objects, our approach signi\ufb01cantly outperformed [18] except in\ncow. Promisingly, the performances of our approach for bicycle and motorbike are comparable, and\nthose for bus, cat, and dog objects are superior to the bests of the supervised methods in VOC06.\nUnsupervised classi\ufb01cation and localization. Here we evaluate how well our approach works\nfor unsupervised classi\ufb01cation and localization tasks (i.e. images of multiple objects without any\nannotation are given). Since both our method and [18] aim at sub-image level classi\ufb01cation and\ndetection, we \ufb01rst \ufb01nd out the most con\ufb01dent region of each image, and run the clustering by LDA\nin [18] and spectral clustering [20] in our method. The evaluation of classi\ufb01cation follows the rule\nof VOC06 by the ROC curves as shown in Fig.6. We also show the best of the VOC06 submissions\nfor supervised classi\ufb01cation as a reference. As shown in Fig.6.(a)\u2212(c), our method and [18] present\nsimilar ROC performance. In other words, both methods are quite good at ranking for classi\ufb01cation.\nHowever, the classi\ufb01cation rates of our method are better by about 10% for both 3-object and 4-\nobject cases. (Ours: 69.08%; [18]: 59.05% for {bicycle, car, dog}. Ours: 59.51%; [18]: 50.99%\n\n1The dataset is available at http://www.pascal-network.org/challenges/VOC/\n2The code is available at http://www.di.ens.fr/\u223crussell/projects/mult seg discovery/index.html\n\n6\n\n\fFigure 5: Results of weakly supervised localization. PR curves for the {test} sets of all objects in the PASCAL\n06 dataset. (Ours: blue; [18]: red; the best of VOC06: green). Note that our localization and that of [18] are\nunsupervised, but the VOC06 localization is supervised. (X-axis: recall; Y-axis: precision).\n\n(a)\u2212(c) ROC curves for {test} set of\nFigure 6: Results of unsupervised classi\ufb01cation and localization.\n{bicycle, car, dog}.\n[18], and\nthe best of VOC06 are bicycle:(0.892, 0.869, 0.948), car:(0.968, 0.965, 0.977), and dog:(0.932, 0.954, 0.876),\nrespectively. (X-axis: false positive rates, Y-axis: true positive rates). (d)\u2212(f) PR curves for unsupervised\nlocalization of ours (blue) and [18] (magenta). For comparison, we also represent the results of our weakly\nsupervised localization (red) and the best of VOC 06 (green). (X-axis: recall, Y-axis: precision).\n\n[18]: red; the best of VOC06: green). The AUCs of ours,\n\n(Ours: blue;\n\nfor {bicycle, car, dog, sheep}.) We also show the unsupervised localization performance as PR-\ncurves in Fig.6.(d)\u2212(f). For comparison, we also represent the results of our weakly supervised\nexperiments and the bests of VOC 06 for corresponding objects. The nontrivial performance drop is\nobserved due to the classi\ufb01cation errors and distraction by other objects in the dataset.\n\n4.2 Scalability Tests\n\nIt is an open question how to evaluate the results of a large number of Web-downloaded images that\nhave no ground-truth. For a quantitative evaluation, we manually annotated 0.5% randomly selected\nimages of datasets, and they are used as limited but approximate indices of performance measures.\nAccording to the data sizes used in experiments, we randomly pick x% from the annotated set and\n(100 \u2212 x)% from the non-annotated set. The x is {20, 10, 5, 1, 0.5, 0.5} for the dataset size of\n{500, 5K, 10K, 50K, 100K, 200K}.\nWeakly supervised localization. One interesting question we address here is how performances\nand computation times vary as a function of data sizes. The experiments are repeated ten times\nfor each dataset size, and the median (i.e. \ufb01fth-best) performance scores are reported. Similarly\nto previous tests, we select only the best ROI per image. As shown in Fig.7, the performances of\n500 images \ufb02uctuate, but the results of the dataset size above 5K are stable. As the dataset size\nincreases, a small performance improvement is observed. Since the maximum number of images at\neach running of the algorithm is bounded by N(= 10, 000), the computation times are linear to the\nnumber of images, and the performances of the data size above N are similar each other.\nPerturbation tests. Here we test the goodness of selected ROIs from a different view: robustness of\nROI detection against random network formation. For example, given an image Ia, we can generate\n100 sets of 200 randomly selected images that include Ia. If the ROI selection for Ia is repetitive\nacross 100 different sets, we can say the ROI estimator for Ia is con\ufb01dent. This procedure is similar\nto bootstrapping or cross-validation.\n\n7\n\n\fFigure 7: Weakly supervised localization. (a) PR curves for \ufb01ve objects of our Flickr dataset by varying\ndataset sizes from 500 to 200K. (b) The log-log plot between the number of images and computation times for\nthe car object. The slope of each range is {1.23, 2.05, 0.95, 1.05, 1.28} from left to right.\n\nFigure 8: Examples of perturbation tests. The histograms summarize how many times each ROI is selected\nin 100 random sets. The frequencies of particular ROIs are represented by the thickness of bounding boxes\nand the jet colormap from red (high) to blue (low). From left to right, the entropies of the distributions are\n{0.2419, 1.6846, 2.4331}, respectively. (X-axis: ROI hypotheses; Y-axis: Frequency).\n\nFig.8 shows some examples of the perturbation tests. The histogram indicates how many times each\nROI hypothesis is selected among 100 random sets. From left to right, one can see the increase of\nthe dif\ufb01culty of ROI detection. A peak is observed in an obvious image, but the distribution is wider\nin a challenging image. The entropy of the distribution can be an index of the measure of dif\ufb01culty\nor the con\ufb01dence of the estimator of the image.\nMore localization examples. Fig.9 shows more examples of localization in our approach. The third\nrow illustrates some typical examples of failure. Frequently co-occurred objects can be detected\ninstead such as \ufb02owers in butter\ufb02y images, insects on sun\ufb02owers, other animals in the zoo, and\npersons everywhere. Also, sometimes small multiple instances are detected by one ROI or a part of\nan object is discovered (e.g. a giraffe face rather than the whole body).\n\n5 Discussion\n\nWe proposed an alternating optimization approach for scalable unsupervised ROI detection by an-\nalyzing the statistics of similarity links between ROI hypotheses. Both tests with PASCAL 06 and\nFlickr datasets showed that our approach is not only comparable to other unsupervised and super-\nvised techniques but also applicable to real images on the Web.\nAcknowledgement. Funding for this research was provided by NSF Career award (IIS 0747120).\n\nFigure 9: More examples of object localization. The \ufb01rst and second rows represent successful detection, and\nthe third row illustrates some typical failures. The yellow boxes are groundtruth labels, and the red and blue\nones are ROIs detected by the proposed method.\n\n8\n\n\fReferences\n[1] N. Ahuja and S. Todorovic. Learning the taxonomy and models of categories present in arbitrary images.\n\nIn ICCV, 2007.\n\n[2] P. J. Besl and N. D. McKay. A method for registration of 3-d shapes. IEEE Trans. on Pattern Analysis\n\nand Machine Intelligence, 14(2):239\u2013256, 1992.\n\n[3] A. Bosch, A. Zisserman, and X. Munoz. Image classi\ufb01cation using random forests and ferns. In ICCV,\n\n2007.\n\n[4] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, 1998.\n[5] O. Chum and A. Zisserman. An exemplar model for learning object classes. In CVPR, 2007.\n[6] B. Collins, J. Deng, K. Li, and L. Fei-Fei. Towards scalable dataset construction: An active learning\n\napproach. In ECCV, 2008.\n\n[7] D. Comaniciu and P. Meer. Mean shift: A robutst approach toward feature space analysis. IEEE Trans.\n\non Pattern Analysis and Machine Intelligence, 24(5):603\u2013619, 2002.\n\n[8] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Trans. on Pattern Analysis\n\nand Machine Intelligence, 23(6):681\u2013685, 2001.\n\n[9] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google\u2019s image\n\nsearch. In ICCV, pages 1816\u20131823, Oct. 2005.\n\n[10] G. Jeh and J. Widom. Scaling personalized web search. In WWW, 2003.\n[11] Y. Jing and S. Baluja. Visualrank, pagerank for google image search. IEEE Trans. on Pattern Analysis\n\nand Machine Intelligence, 30(11):1\u201331, 2008.\n\n[12] G. Kim, C. Faloutsos, and M. Hebert. Unsupervised modeling of object categories using link analysis\n\ntechniques. In CVPR, 2008.\n\n[13] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol: automatic object picture collection via incremental model\n\nlearning. In CVPR, 2007.\n\n[14] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum. Learning to detect a salient object. In CVPR,\n\n2007.\n\n[15] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation\n\ndiscovery. In SIGKDD, 2004.\n\n[16] J. Philbin and A. Zisserman. Object mining using a matching graph on very large image collections. In\n\nICVGIP, 2008.\n\n[17] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, 2009.\n[18] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and A. Zisserman. Using multiple segmentations to\n\ndiscover objects and their extent in image collections. In CVPR, 2006.\n\n[19] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007.\n[20] J. Shi and J. Malik. Normalized cuts and image segmentation.\n\nIEEE Trans. on Pattern Analysis and\n\nMachine Intelligence, 22(8):888\u2013905, 2000.\n\n[21] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their\n\nlocation in images image features. In ICCV, 2005.\n\n[22] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric\nobject and scene recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(11):1958\u2013\n1970, 2008.\n\n[23] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n[24] J. Winn and N. Jojic. Locus: Learning object classes with unsupervised segmentation. In ICCV, 2005.\n[25] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Sch\u00a8olkopf. Ranking on data manifolds. In NIPS,\n\n2004.\n\n9\n\n\f", "award": [], "sourceid": 4, "authors": [{"given_name": "Gunhee", "family_name": "Kim", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}]}