{"title": "Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification", "book": "Advances in Neural Information Processing Systems", "page_first": 1378, "page_last": 1386, "abstract": "Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.", "full_text": "Object Bank: A High-Level Image Representation for Scene\n\nClassi\ufb01cation & Semantic Feature Sparsi\ufb01cation\n\nLi-Jia Li*1, Hao Su*1, Eric P. Xing2, Li Fei-Fei1\n1 Computer Science Department, Stanford University\n\n2 Machine Learning Department, Carnegie Mellon University\n\nAbstract\n\nRobust low-level image features have been proven to be effective representations\nfor a variety of visual recognition tasks such as object recognition and scene clas-\nsi\ufb01cation; but pixels, or even local image patches, carry little semantic meanings.\nFor high level visual tasks, such low-level image representations are potentially\nnot enough. In this paper, we propose a high-level image representation, called the\nObject Bank, where an image is represented as a scale-invariant response map of a\nlarge number of pre-trained generic object detectors, blind to the testing dataset or\nvisual task. Leveraging on the Object Bank representation, superior performances\non high level visual recognition tasks can be achieved with simple off-the-shelf\nclassi\ufb01ers such as logistic regression and linear SVM. Sparsity algorithms make\nour representation more ef\ufb01cient and scalable for large scene datasets, and reveal\nsemantically meaningful feature patterns.\n\n1 Introduction\n\nUnderstanding the meanings and contents of images remains one of the most challenging problems\nin machine intelligence and statistical learning. Contrast to inference tasks in other domains, such\nas NLP, where the basic feature space in which the data lie usually bears explicit human perceivable\nmeaning, e.g., each dimension of a document embedding space could correspond to a word [21], or\na topic, common representations of visual data seem to primarily build on raw physical metrics of\nthe pixels such as color and intensity, or their mathematical transformations such as various \ufb01lters,\nor simple image statistics such as shape, edges orientations etc. Depending on the speci\ufb01c visual\ninference task, such as classi\ufb01cation, a predictive method is deployed to pool together and model the\nstatistics of the image features, and make use of them to build some hypothesis for the predictor. For\nexample, Fig.1 illustrates the gradient-based GIST features [25] and texture-based Spatial Pyramid\nrepresentation [19] of two different scenes (foresty mountain vs. street). But such schemes often\nfail to offer suf\ufb01cient discriminative power, as one can see from the very similar image statistics in\nthe examples in Fig.1.\n\nFigure 1: (Best viewed in colors and magni\ufb01cation.) Comparison of object bank (OB) representation with\ntwo low-level feature representations, GIST and SIFT-SPM of two types of images, mountain vs. city street.\nFrom left to right, for each input image, we show the selected \ufb01lter responses in the GIST representation [25],\na histogram of the SPM representation of SIFT patches [19], and a selected number of OB responses.\n\n*indicates equal contributions.\n\n1\n\nTowerSkyObject Filters in OBMountainTreeGist (!lters)SIFT-SPM (L=2)Original ImageTowerSkyMountainTree\fWhile more sophisticated low-level feature engineering and recognition model design remain impor-\ntant sources of future developments, we argue that the use of semantically more meaningful feature\nspace, such as one that is directly based on the content (e.g., objects) of the images, as words for tex-\ntual documents, may offer another promising venue to empower a computational visual recognizer\nto potentially handle arbitrary natural images, especially in our current era where visual knowledge\nof millions of common objects are readily available from various easy sources on the Internet.\nIn this paper, we propose \u201cObject Bank\u201d (OB), a new representation of natural images based on\nobjects, or more rigorously, a collection of object sensing \ufb01lters built on a generic collection of la-\nbeled objects. We explore how a simple linear hypothesis classi\ufb01er, combined with a sparse-coding\nscheme, can leverage on this representation, despite its extreme high-dimensionality, to achieve\nsuperior predictive power over similar linear prediction models trained on conventional representa-\ntions. We show that an image representation based on objects can be very useful in high-level visual\nrecognition tasks for scenes cluttered with objects. It provides complementary information to that of\nthe low-level features. As illustrated in Fig.1, these two different scenes show very different image\nresponses to objects such as tree, street, water, sky, etc. Given the availability of large-scale image\ndatasets such as LabelMe [30] and ImageNet [5], it is no longer inconceivable to obtain trained ob-\nject detectors for a large number of visual concepts. In fact we envision the usage of thousands if\nnot millions of these available object detectors as the building block of such image representation in\nthe future.\nWhile the OB representation offers a rich, high-level description of images, a key technical chal-\nlenge due to this representation is the \u201ccurse of dimensionality\u201d, which is severe because of the size\n(i.e., number of objects) of the object bank and the dimensionality of the response vector for each\nobject. Typically, for a modest sized picture, even hundreds of object detectors would result into a\nrepresentation of tens of thousands of dimensions. Therefore to achieve robust predictor on practi-\ncal dataset with typically only dozens or a couple of hundreds of instances per class, structural risk\nminimization via appropriate regularization of the predictive model is essential.\nIn this paper, we propose a regularized logistic regression method, akin to the group lasso approach\nfor structured sparsity, to explore both feature sparsity and object sparsity in the Object Bank repre-\nsentation for learning and classifying complex scenes. We show that by using this high-level image\nrepresentation and a simple sparse coding regularization, our algorithm not only achieves superior\nimage classi\ufb01cation results in a number of challenging scene datasets, but also can discover seman-\ntically meaningful descriptions of the learned scene classes.\n2 Related Work\nA plethora of image descriptors have been developed for object recognition and image classi\ufb01ca-\ntion [25, 1, 23]. We particularly draw the analogy between our object bank and the texture \ufb01lter\nbanks [26, 10].\nObject detection and recognition also entail a large body of literature [7]. In this work, we mainly\nuse the current state-of-the-art object detectors of Felzenszwalb et. al. [9], as well as the geometric\ncontext classi\ufb01ers (\u201cstuff\u201d detectors) of Hoeim et. al. [13] for pre-training the object detectors.\nThe idea of using object detectors as the basic representation of images is analogous [12, 33, 35]. In\ncontrast to our work, in [12] and [33] each semantic concept is trained by using the entire images or\nframes of video. As there is no localization of object concepts in scenes, understanding cluttered im-\nages composed of many objects will be challenging. In [35], a small number of concepts are trained\nand only the most probable concept is used to form the representation for each region, whereas in\nour approach all the detector responses are used to encode richer semantic information.\nThe idea of using many object detectors as the basic representation of images is analogous to ap-\nproaches applying a large number of \u201csemantic concepts\u201d to video and image annotation and re-\ntrieval [12, 33, 35]. In contrast to our work, in [12, 33, 35] each semantic concept is trained by using\nentire images or frames of videos. There is no sense of localized representation of meaningful object\nconcepts in scenes. As a result, this approach is dif\ufb01cult to use for understanding cluttered images\ncomposed of many objects.\nCombinations of small set of (\u223c a dozen of) off-the-shelf object detectors with global scene context\nhave been used to improve object detection [14, 28, 29]. Also related to our work is a very recent\nexploration of using attributes for recognition [17, 8, 16]. But we emphasize such usage is not a\n\n2\n\n\fFigure 2: (Best viewed in colors and magni\ufb01cation.) Illustration of OB. A large number of object detectors\nare \ufb01rst applied to an input image at multiple scales. For each object at each scale, a three-level spatial pyramid\nrepresentation of the resulting object \ufb01lter map is used, resulting in No.Objects\u00d7 No.Scales\u00d7 (12 + 22 + 42)\ngrids; the maximum response for each object in each grid is then computed, resulting in a No.Objects length\nfeature vector for each grid. A concatenation of features in all grids leads to an OB descriptor for the image.\n\nuniversal representation of images as we have proposed. To our knowledge, this is the \ufb01rst work that\nuse such high-level image features at different image location and scale.\n\n3 The Object Bank Representation of Images\nObject Bank (OB) is an image representation constructed from the responses of many object de-\ntectors, which can be viewed as the response of a \u201cgeneralized object convolution.\u201d We use two\nstate-of-the-art detectors for this operation:\nthe latent SVM object detectors [9] for most of the\nblobby objects such as tables, cars, humans, etc, and a texture classi\ufb01er by Hoiem [13] for more\ntexture- and material-based objects such as sky, road, sand, etc. We point out here that we use the\nword \u201cobject\u201d in its very general form \u2013 while cars and dogs are objects, so are sky and water. Our\nimage representation is agnostic to any speci\ufb01c type of object detector; we take the \u201coutsourcing\u201d\napproach and assume the availability of these pre-trained detectors.\nFig. 2 illustrates the general setup for obtaining the OB representation. A large number of object\ndetectors are run across an image at different scales. For each scale and each detector, we obtain an\ninitial response map of the image (see Appendix for more details of using the object detectors [9,\n13]). In this paper, we use 200 object detectors at 12 detection scales and 3 spatial pyramid levels\n(L=0,1,2) [19]. We note that this is a universal representation of any images for any tasks. We use\nthe same set of object detectors regardless of the scenes or the testing dataset.\n\nImplementation Details of Object Bank\n\n3.1\nSo what are the \u201cobjects\u201d to use in the object bank? And how many? An obvious answer to this\nquestion is to use all objects. As the detectors become more robust, especially with the emergence\nof large-scale datasets such as LabelMe [30] and ImageNet [5], this goal becomes more reachable.\nBut time is not fully ripe yet to consider using all objects in, say, the LabelMe dataset. Not enough\nresearch has yet gone into building robust object detector for tens of thousands of generic objects.\nAnd even more importantly, not all objects are of equal importance and prominence in natural im-\nages. As Fig.1 in Appendix shows, the distribution of objects follows Zipf\u2019s Law, which implies\nthat a small proportion of object classes account for the majority of object instances.\nFor this paper, we will choose a few hundred most useful (or popular) objects in images1. An impor-\ntant practical consideration for our study is to ensure the availability of enough training images for\neach object detectors. We therefore focus our attention on obtaining the objects from popular image\ndatasets such as ESP [31], LabelMe [30], ImageNet [5] and the Flickr online photo sharing com-\nmunity. After ranking the objects according to their frequencies in each of these datasets, we take\nthe intersection set of the most frequent 1000 objects, resulting in 200 objects, where the identities\nand semantic relations of some of them are illustrated in Fig.2 in the Appendix. To train each of the\n200 object detectors, we use 100\u223c200 images and their object bounding box information from the\nLabelMe [30] (86 objects) and ImageNet [5] datasets (177 objects). We use a subset of LabelMe\nscene dataset to evaluate the object detector performance. Final object detectors are selected based\non their performance on the validation set from LabelMe (see Appendix for more details).\n\n1This criterion prevents us from using the Caltech101/256 datasets to train our object detectors [6, 11] where\n\nthe objects are chosen without any particular considerations of their relevance to daily life pictures.\n\n3\n\nBearWaterSailboatOriginal ImageSailboatWaterSkyObjectsBearResponseMax Response (OB)Spatial PyramidObject Bank RepresentationObject Detector Responsesdetector scale\f4 Scene Classi\ufb01cation and Feature/Object Compression via Structured\n\nRegularized Learning\n\n1 ; xT\n\n2 ; . . . ; xT\n\nWe envisage that with the avalanche of annotated objects on the web, the number of object detec-\ntors in our object bank will increase quickly from hundreds to thousands or even millions, offering\nincreasingly rich signatures for each images based on the identity, location, and scale of the object-\nbased content of the scene. However, from a learning point of view, it also poses a challenge on how\nto train predictive models built on such high-dimensional representation with limited number of ex-\namples. We argue that, with an \u201covercomplete\u201d OB representation, it is possible to compress ultra-\nhigh dimensional image vector without losing semantic saliency. We refer this semantic-preserving\ncompression as content-based compression to contrast the conventional information-theoretic com-\npression that aims at lossless reconstruction of the data.\nIn this paper, we intend to explore the power of OB representation in the context of Scene Clas-\nsi\ufb01cation, and we are also interested in discovering meaningful (possibly small subset of) dimen-\nsions during regularized learning for different classes of scenes. For simplicity, here we present our\nmodel in the context of linear binary classier in a 1-versus-all classi\ufb01cation scheme for K classes.\nGeneralization to a multiway softmax classi\ufb01er is slightly more involved under structured regu-\nN ] \u2208 RN\u00d7J, an N \u00d7 J\nlarization and thus deferred to future work. Let X = [xT\nmatrix, represent the design built on the J-dimensional object bank representation of N images;\nand let Y = (y1, . . . , yN ) \u2208 {0, 1}N denote the binary classi\ufb01cation labels of N samples. A\nlinear classi\ufb01er is a function h\u03b2 : RJ \u2192 {0, 1} de\ufb01ned as h\u03b2(x) (cid:44) arg maxy\u2208{0,1} x\u03b2, where\n\u03b2 = (\u03b21, . . . , \u03b2J) \u2208 RJ is a vector of parameters to be estimated. This leads to the following\nlearning problem min\u03b2\u2208RJ \u03bbR(\u03b2) + 1\ni=1 L(\u03b2; xi, yi), where L(\u03b2; x, y) is some non-negative,\nm\nconvex loss, m is the number of training images, R(\u03b2) is a regularizer that avoids over\ufb01tting, and\n\u03bb \u2208 R is the regularization coef\ufb01cient, whose value can be determined by cross validation.\nA common choice of L is the Log loss, L = log(1/P (yi|xi, \u03b2)), where P (yi|xi, \u03b2)) is the logis-\ntic function P (y|x, \u03b2)) = 1\n2 y(x \u00b7 \u03b2)). This leads to the popular logistic regression (LR)\nclassi\ufb01er2. Structural risk minimization schemes over LR via various forms of regularizations have\nbeen widely studied and understood in the literature. In particular, recent asymptotic analysis of the\n(cid:96)1 norm and (cid:96)1/(cid:96)2 mixed norm regularized LR proved that under certain conditions the estimated\nsparse coef\ufb01cient vector \u03b2 enjoys a property called sparsistency [34], suggesting their applicabil-\nity for meaningful variable selection in high-dimensional feature space. In this paper, we employ\nan LR classi\ufb01er for our scene classi\ufb01cation problem. We investigate content-based compression\nof the high-dimensional OB representation that exploits raw feature-, object-, and (feature+object)-\n(cid:80)J\nsparsity, respectively, using LR with appropriate regularization.\nFeature sparsity via (cid:96)1 regularized LR (LR1) By letting R(\u03b2) (cid:44) (cid:107)\u03b2(cid:107)1 =\nj=1 |\u03b2j|, we\nobtain an estimator of \u03b2 that is sparse. The shrinkage function on \u03b2 is applied indistinguishably\nto all dimensions in the OB representation, and it does not have a mechanism to incorporate any\npotential coupling of multiple features that are possibly synergistic, e.g., features induced by the\nsame object detector. We call such a sparsity pattern feature sparsity, and denote the resultant\ncoef\ufb01cient estimator by \u03b2F.\n\n(cid:80)m\n\nZ exp( 1\n\n(cid:80)J\nObject sparsity via (cid:96)1/(cid:96)2 (group) regularized LR (LRG) Recently, a mixed-norm (e.g., (cid:96)1/(cid:96)2)\nregularization [36] has been used for recovery of joint sparsity across input dimensions. By letting\nj=1 (cid:107)\u03b2j(cid:107)2, where \u03b2j is the j-th group (i.e., features grouped by an object j),\nR(\u03b2) (cid:44) (cid:107)\u03b2(cid:107)1,2 =\nand (cid:107) \u00b7 (cid:107)2 is the vector (cid:96)2-norm, we set the feature group to be corresponding to that of all features\ninduced by the same object in the OB. This shrinkage tends to encourage features in the same group\nto be jointly zero. Therefore, the sparsity is now imposed on object level, rather than merely on raw\nfeature level. Such structured sparsity is often desired because it is expected to generate semantically\nmore meaningful lossless compression, that is, out of all the objects in the OB, only a few are needed\nto represent any given natural image. We call such a sparsity pattern object sparsity, and denote the\nresultant coef\ufb01cient estimator by \u03b2O.\n\n2We choose not to use the popular SVM which correspond to L being a hinge loss and R(\u03b2) being a\n\n(cid:96)2-regularizer, because under SVM, content-based compression via structured regularization is much harder.\n\n4\n\n\fFigure 3: (Best viewed in colors and magni\ufb01cation.) Comparison of classi\ufb01cation performance of different\nfeatures (GIST vs. BOW vs. SPM vs. OB) and classi\ufb01ers (SVM vs. LR) on (top to down) 15 scene, LabelMe,\nUIUC-Sports and MIT-Indoor datasets. In the LabelMe dataset, the \u201cideal\u201d classi\ufb01cation accuracy is 90%,\nwhere we use the human ground-truth object identities to predict the labels of the scene classes. The blue bar\nin the last panel is the performance of \u201cpseudo\u201d object bank representation extracted from the same number\nof \u201cpseudo\u201d object detectors. The values of the parameters in these \u201cpseudo\u201d detectors are generated without\naltering the original detector structures. In the case of linear classi\ufb01er, the weights of the classi\ufb01er are randomly\ngenerated from a uniform distribution instead of learned. \u201cPseudo\u201d OB is then extracted with exactly the same\nsetting as OB.\n\nJoint object/feature sparsity via (cid:96)1/(cid:96)2 + (cid:96)1 (sparse group) regularized LR (LRG1) The group-\nregularized LR does not, however, yield sparsity within a group (object) for those groups with non-\nzero total weights. That is, if a group of parameters is non-zero, they will all be non-zero. Translating\nto the OB representation, this means there is no scale or spatial location selection for an object. To\nremedy this, we proposed a composite regularizer, R(\u03b2) (cid:44) \u03bb1(cid:107)\u03b2(cid:107)1,2 + \u03bb2(cid:107)\u03b2(cid:107)1, which conjoin the\nsparsi\ufb01cation effects of both shrinkage functions, and yields sparsity at both the group and individual\nfeature levels. This regularizer necessitates determination of two regularization parameters \u03bb1 and\n\u03bb2, and therefore is more dif\ufb01cult to optimize. Furthermore, although the optimization problem for\n(cid:96)1/(cid:96)2 + (cid:96)1 regularized LR is convex, the non-smooth penalty function makes the optimization highly\nnontrivial. In the Appendix, we derive a coordinate descent algorithm for solving this problem. To\nconclude, we call the sparse group shrinkage patten object/feature sparsity, and denote the resultant\ncoef\ufb01cient estimator by \u03b2OF.\n\n5 Experiments and Results\nDataset We evaluate the OB representation on 4 scene datasets, ranging from generic natural scene\nimages (15-Scene, LabelMe 9-class scene dataset3), to cluttered indoor images (MIT Indoor Scene),\nand to complex event and activity images (UIUC-Sports). Scene classi\ufb01cation performance is eval-\nuated by average multi-way classi\ufb01cation accuracy over all scene classes in each dataset. We list\nbelow the experiment setting for each dataset:\n\n\u2022 15-Scene: This is a dataset of 15 natural scene classes. We use 100 images in each class for training\n\nand rest for testing following [19].\n\n\u2022 LabelMe: This is a dataset of 9 classes. 50 images randomly drawn images from each scene classes\n\nare used for training and 50 for testing.\n\n\u2022 MIT Indoor: This is a dataset of 15620 images over 67 indoor scenes assembled by [27]. We follow\ntheir experimental setting in [27] by using 80 images from each class for training and 20 for testing.\n\u2022 UIUC-Sports: This is a dataset of 8 complex event classes. 70 randomly drawn images from each\n\nclasses are used for training and 60 for testing following [22].\n\nExperiment Setup We compare OB in scene classi\ufb01cation tasks with different types of conven-\ntional image features, such as SIFT-BoW [23, 3], GIST [25] and SPM [19]. An off-the-shelf SVM\nclassi\ufb01er, and an in-house implementation of the logistic regression (LR) classi\ufb01er were used on\nall feature representations being compared. We investigate the behaviors of different structural risk\nminimization schemes over LR on the OB representation. As introduced in Sec 4, we experimented\n(cid:96)1 regularized LR (LR1), (cid:96)1/(cid:96)2 regularized LR (LRG) and (cid:96)1/(cid:96)2 + (cid:96)1 regularized LR (LRG1).\n\n5.1 Scene Classi\ufb01cation\nFig.3 summarizes the results on scene classi\ufb01cation based on OB and a set of well known low-\nlevel feature representations: GIST [25], Bag of Words (BOW) [3] and Spatial Pyramid Matching\n\n3From 100 popular scene names, we obtained 9 classes from the LabelMe dataset in which there are more\nthan 100 images: beach, mountain, bathroom, church, garage, of\ufb01ce, sail, street, forest. The maximum number\nof images in those classes is 1000.\n\n5\n\n0.50.60.70.80.9GistSPMBOWOB-SVMOB-LRClassi cation on 15-Scenesaverage percent correctness0.40.50.60.70.8GistSPMBOWOB-SVMOB-LRClassi cation on LabelMe Scenesaverage percent correctness0.200.240.280.320.360.40GistSPMBOWOB-SVMOB-LRClassi cation on MIT Indooraverage percent correctness0.50.60.70.8GistSPMBOWOB-SVMOB-LRClassi cation on UIUC-Sportsaverage percent correctnessPseudo OB\f(SPM) [19] on four challenging scene datasets. We show the results of OB using both an LR classi-\n\ufb01er and a linear SVM 4 We achieve substantially superior performances on three out of four datasets,\nand are on par with the 15-Scene dataset. The substantial performance gain on the UIUC-Sports\nand the MIT-Indoor scene datasets illustrates the importance of using a semantically meaningful\nrepresentation for complex scenes cluttered with objects. For example, the difference between a liv-\ningroom and a bedroom is less so in the overall texture (easily captured by BoW or GIST), but more\nso in the different objects and their arrangements. This result underscores the effectiveness of OB,\nhighlighting the fact that in high-level visual tasks such as complex scene recognition, a higher level\nimage representation can be very useful. We further decompose the spatial structure and semantic\nmeaning encoded in OB by using a \u201cpseudo\u201d OB without semantic meaning. The signi\ufb01cant im-\nprovement of OB in classi\ufb01cation performance over the \u201cpseudo object bank\u201d is largely attributed\nto the effectiveness of using object detectors trained from image. For each of the existing scene\ndatasets (UIUC-Sports, 15-Scene and MIT-Indoor), we also compare the reported state of the arts\nperformances to our OB algorithm (using a standard LR classi\ufb01er). This result is shown in Tab.15\n\n5.2 Control Experiment: Object Recognition by OB vs. Classemes [33]\n\n15-Scene\n\n72.2%[19]\n81.1%[19]\n80.9%\n\nMIT-\nIndoor\n26% [27]\n\nUIUC-\nSports\n66.0% [32]\n73.4% [22]\n76.3%\n\nOB is constructed from the responses of many objects,\nwhich encodes the semantic and spatial information of\nstate-of\nobjects within images. It can be naturally applied to ob-\n-the-art\nOB\n37.6%\nject recognition task. We compare the object recognition\nTable 1: Comparison of classi\ufb01cation re-\nperformance on the Caltech 256 dataset to [33], a high\nsults using OB with reported state-of-the-\nlevel image representation obtained as the output of a\nart algorithms. Many of the algorithms use\nlarge number of weakly trained object classi\ufb01ers on the\nmore complex model and supervised infor-\nimage. By encoding the spatial locations of the objects\nmation, whereas our results are obtained by\nwithin an image, OB (39%) signi\ufb01cantly outperforms\napplying simple logistic regression.\n[33] (36%) on the 256-way classi\ufb01cation task, where per-\nformance is measured as the average of the diagonal values of a 256\u00d7256 confusion matrix.\n5.3 Semantic Feature Sparsi\ufb01cation Over OB\nIn this subsection, we systematically investigate semantic feature sparsi\ufb01cation of the OB represen-\ntation. We focus on the practical issues directly relevant to the effectiveness of OB representation\nand quality of feature sparsi\ufb01cation, and study the following three aspects of the scene classi\ufb01er:\n1) robustness, 2) feasibility of lossless content-based compression, 3) pro\ufb01tability over growing\nOB.interpretability of predictive features.\n5.3.1 Robustness with Respect to Training Sample Size\n\n(b)\n\n(c)\n\n(d)\n\n(a)\n\nFigure 4: (a) Classi\ufb01cation performance (and s.t.d.) w.r.t number of training images. Each pair represents per-\nformances of LR1 and LRG respectively. X-axis is the ratio of the training images over the full training dataset\n(70 images/class). (b) Classi\ufb01cation performance w.r.t feature dimension. X-axis is the size of compressed\nfeature dimension, represented as the ratio of the compressed feature dimension over the full OB representation\ndimension (44604). (c) Same as (b), represented in Log Scale to contrast the performances of different algo-\nrithms. (d) Classi\ufb01cation performance w.r.t number of object \ufb01lters. X-axis is the number of object \ufb01lters. 3\nrounds of randomized sampling is performed to choose the object \ufb01lters from all the object detectors.\nThe intrinsic high-dimensionness of the OB representation raises a legitimate concern on its demand\non training sample size. We investigate the robustness of the logistic regression classi\ufb01er built on\n\n4We also evaluate the classi\ufb01cation performance of using the detected object location and its detection score\nof each object detector as the image representation. The classi\ufb01cation performance of this representation is\n62.0%, 48.3%, 25.1% and 54% on the 15 scene, LabelMe, UIUC-Sports and MIT-Indoor datasets respectively.\n5We refer to the Appendix for a further discussion of the issue of comparing different algorithms based on\n\ndifferent training strategies.\n\n6\n\n25%50%75%100%01020304050607080Accuracy  LR1LG00.20.40.60.811020304050607080Dimension PercentageAccuracyCompression of Image Representation  LRLR1LRGLRG1\u22128\u22126\u22124\u2212201020304050607080Dimension Percentage Log ScaleAccuracyCompression of Image Representation  LRLR1LRGLRG1050100150404550556065707580Number of ObjectsClassi!cation Accuracy  LRG\ffeatures selected by LR1 and LRG in this experiment. We train LR1 and LRG on the UIUC-Sports\ndataset by using multiple sizes of training examples, ranging from 25%, 50%, 75% to 100% of the\nfull training data.\nAs shown in Fig. 4(a), we observe only moderate drop of performance when the number of training\nsamples decreases from 100% to 25% of the training examples, suggesting that the OB representa-\ntion is a rich representation where discriminating information residing in a lower dimensional \u201cin-\nformative\u201d feature space, which are likely to be retained during feature sparsi\ufb01cation, and thereby\nensuring robustness under small training data. We explore this issue further in the next experiment.\n\n5.3.2 Near Losslessness of Content-based Compression via Regularized Learning\nWe believe that the OB can offer an over complete representation of any natural image. Therefore,\nthere is great room for possibly (near) lossless content-based compression of the image features into\na much lower-dimensional, but equally discriminative subspace where key semantic information of\nthe images are preserved, and the quality of inference on images such as scene classi\ufb01cation are not\ncompromised signi\ufb01cantly. Such compression can be attractive in reducing representation cost of\nimage query, and improving the speed of query inference.\nIn this experiment, we use the classi\ufb01cation performance as a measurement to show how different\nregularization schemes over LR can preserve the discriminative power. For LR1, LRG and LRG1,\ncross-validation is used to decide the best regularization parameters. To study the extend of infor-\nmation loss as a function of different number of features being retained in the classi\ufb01er, we re-train\nan LR classi\ufb01er using features from the top x% percentile of the rank list, where x is a compression\nscale ranging from 0.05% to 100%. One might think that LR itself when \ufb01tted on full input dimen-\nsional can also produce a rank list of features for subsequent selection. For comparison purpose, we\nalso include results from the LR-ranked features, as can be seen in Fig.4(b,c), indeed its performance\ndrops faster than all the regularization methods.\nIn Fig.4 (b), we observe that the classi\ufb01cation accuracy drops very slowly as the number of selected\nfeatures decreases. By excluding 75% feature dimensions, classi\ufb01cation performance of each algo-\nrithm decreases less than 3%. One point to notice here is that, the non-zero entries only appear in\ndimensions corresponding to no more than 45 objects for LRG at this point. Even more surprisingly,\nLR1 and LRG preserve accuracies above 70% when 99% of the feature dimensions are excluded.\nFig. 4 (c) shows more detailed information in the low feature dimension range, which corresponds\nto a high compression ratio. We observe that algorithms imposing sparsity in features (LR1, LRG,\nand LRG1) outperform unregularized algorithm (LR) with a larger margin when the compression\nratio becomes higher. This re\ufb02ects that the sparsity learning algorithms are capable of learning the\nmuch lower-dimensional, but highly discriminative subspace.\n\n5.3.3 Pro\ufb01tability Over Growing OB\nWe envisage the Object Bank will grow rapidly and constantly as more and more labeled web images\nbecome available. This will naturally lead to increasingly richer and higher-dimensional representa-\ntion of images. We ask, are image inference tasks such as scene classi\ufb01cation going to bene\ufb01t from\nthis trend?\nAs group regularized LR imposes sparsity on object level, we choose to use it to investigate how the\nnumber of objects will affect the discriminative power of OB representation. To simulate what hap-\npens when the size of OB grows, we randomly sample subsets of object detectors at 1%, 5%, 10%,\n25%, 50% and 75% of total number of objects for multiple rounds. As in Fig.4(d), the classi\ufb01cation\nperformance of LRG continuously increases when more objects are incorporated in the OB repre-\nsentation. We conjecture that this is due to the accumulation of discriminative object features, and\nwe believe that future growth of OB will lead to stronger representation power and discriminability\nof images models build on OB.\n5.4\nIntuitively, a few key objects can discriminate a scene class from another. In this experiment, we aim\nto discover the object sparsity and investigate its interpretability. Again, we use group regularized\nLR (LRG) since the sparsity is imposed on object level and hence generates a more semantically\nmeaningful compression.\n\nInterpretability of the Compressed Representation\n\n7\n\n\fFigure 6: Illustration of the learned \u03b2OF by LRG1 within\nan object group. Columns from left to right correspond to\n\u201cbuilding\u201d in \u201cchurch\u201d scene, \u201ctree\u201d in \u201cmountain\u201d, \u201ccloud\u201d\nin \u201cbeach\u201d, and \u201cboat\u201d in \u201csailing\u201d. Top Row: weights of\nOB dimensions corresponding to different scales, from small\nto large. The weight of a scale is obtained by summing up\nthe weights of all features corresponding to this scale in \u03b2OF .\nMiddle: Heat map of feature weights in image space at the\nscale with the highest weight (purple bars above). We project\nthe learned feature weights back to the image by reverting the\nOB extraction procedure. The purple bounding box shows the\nsize of the object \ufb01lter at this scale, centered at the peak of\nthe heat map. Bottom: example scene images masked by the\nfeature weights in image space (at the highest weighted scale),\nhighlighting the most relevant object dimension.\n\nFigure 5: Object-wise coef\ufb01cients given\nscene class. Selected objects correspond to\nnon-zero \u03b2 values learned by LRG.\n\nWe show in Fig.5 the object-wise coef\ufb01cients of the com-\npression results for 4 sample scene classes. The object\nweight is obtained by accumulating the coef\ufb01cient of \u03b2O\nfrom the feature dimensions of each object (at different\nscales and spatial locations) learned by LRG. Objects\nwith all zero coef\ufb01cients in the resultant coef\ufb01cient esti-\nmator are not displayed. Fig.5 shows that objects that are\n\u201crepresentative\u201d for each scene are retained by LRG. For\nexample, \u201csailboat\u201d, \u201cboat\u201d, and \u201csky\u201d are objects with\nvery high weight in the \u201csailing\u201d scene class. This sug-\ngests that the representation compression via LRG is vir-\ntually based upon the image content and is semantically\nmeaningful; therefore, it is nearly \u201csemantically lossless\u201d.\nKnowing the important objects learned by the compres-\nsion algorithm, we further investigate the discriminative\ndimensions within the object level. We use LRG1 to examine the learned weights within an ob-\nject. In Sec.3, we introduce that each feature dimension in the OB representation is directly related\nto a speci\ufb01c scale, geometric location and object identity. Hence, the weights in \u03b2OF re\ufb02ects the\nimportance of an object at a certain scale and location. To verify the hypothesis, we examine the im-\nportance of objects across scales by summing up the weights of related spatial locations and pyramid\nresolutions. We show one representative object in a scene and visualize the feature patterns within\nthe object group. As it is shown in Fig.6(Top), LRG1 has achieved joint object/feature sparsi\ufb01cation\nby zero-out less relevant scales, thus only the most discriminative scales are retained. To analyze\nhow \u03b2OF re\ufb02ects the geometric location, we further project the learned coef\ufb01cient back to the im-\nage space by reversing the OB representation extraction procedure. In Fig.6(Middle), we observe\nthat the regions with high intensities are also the locations where the object frequently appears. For\nexample, cloud usually appears in the upper half of a scene in the beach class.\n6 Conclusion\nAs we try to tackle higher level visual recognition problems, we show that Object Bank representa-\ntion is powerful on scene classi\ufb01cation tasks because it carries rich semantic level image informa-\ntion. We also apply structured regularization schemes on the OB representation, and achieve nearly\nlossless semantic-preserving compression. In the future, we will further test OB representation in\nother useful vision applications, as well as other interesting structural regularization schemes.\nAcknowledgments L. F-F is partially supported by an NSF CAREER grant (IIS-0845230), a Google re-\nsearch award, and a Microsoft Research Fellowship. E. X is supported by AFOSR FA9550010247, ONR\nN0001140910758, NSF Career DBI-0546594, NSF IIS- 0713379 and Alfred P. Sloan Fellowship. We thank\nWei Yu, Jia Deng, Olga Russakovsky, Bangpeng Yao, Barry Chai, Yongwhan Lim, and anonymous reviewers\nfor helpful comments.\nReferences\n[1] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE\n\nPAMI, pages 509\u2013522, 2002.\n\n8\n\n1234567891011120500100015002000250030003500building12345678910111205001000150020002500300035004000boat123456789101112\u22122000200400600800100012001400tree12345678910111202004006008001000cloud\u22125000\u22124000\u22123000\u22122000\u22121000010002000300040005000sailboatskycarbuildingperson\u20acoortreesidewalkothersailing\u22121000\u22125000500100015002000250030003500skywatersandgrasscloudoceanskyscraperforktreebuildingotherbeach\u22122000\u22121000010002000300040005000600070008000skymountaintreerockcloudpeoplecarbuildingothermountain\u2212100001000200030004000500060007000buildingskyscrapertreeskycargrasscarotherchurch\f[2] L. Bourdev and J. Malik. Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations.\n\nICCV, 2009.\n\n[3] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. Workshop on\n\nStatistical Learning in Computer Vision, ECCV, 2004.\n\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, 2005.\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale Hierarchical\n\nImage Database. CVPR, 2009.\n\n[6] L. Fei-Fei, R. Fergus, and P. Perona. One-Shot learning of object categories. TPAMI, 2006.\n[7] L. Fei-Fei, R. Fergus, and A. Torralba. Recognizing and learning object categories. Short Course CVPR\n[8] A. Farhadi, I. Endres, D. Hoiem and D. Forsyth. Describing objects by their attributes. CVPR, 2009.\n[9] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively\n\nTrained Part Based Models. JAIR, 29, 2007.\n\n[10] W.T. Freeman and E.H. Adelson. The design and use of steerable \ufb01lters. IEEE PAMI, 1991.\n[11] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. 2007.\n[12] A. Hauptmann, R. Yan, W. Lin, M. Christel, and H. Wactlar. Can high-level concepts \ufb01ll the semantic\n\ngap in video retrieval? a case study with broadcast news. IEEE TMM, 9(5):958, 2007.\n\n[13] D. Hoiem, A.A. Efros, and M. Hebert. Automatic photo pop-up. SIGGRAPH 2005, 24(3):577\u2013584, 2005.\n[14] D. Hoiem, A.A. Efros, and M. Hebert. Putting Objects in Perspective. CVPR, 2006.\n[15] T. Kadir and M. Brady. Scale, saliency and image description. IJCV, 45(2):83\u2013105, 2001.\n[16] N. Kumar, A. C. Berg, P. N. Belhumeur and S. K. Nayar. Attribute and Simile Classi\ufb01ers for Face\n\nVeri\ufb01cation. ICCV, 2009.\n\n[17] C.H. Lampert, H. Nickisch and S. Harmeling. Learning to detect unseen object classes by between-class\n\nattribute transfer. CVPR, 2009.\n\n[18] C.H. Lampert, M.B. Blaschko, T. Hofmann, and S. Zurich. Beyond sliding windows: Object localization\n\nby ef\ufb01cient subwindow search. CVPR, 2008.\n\n[19] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. CVPR, 2006.\n\n[20] H.Lee, R.Grosse, R.Ranganath and A. Y. Ng. Convolutional deep belief networks for scalable unsuper-\n\nvised learning of hierarchical representations. ICML, 2009.\n\n[21] D.Lewis. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. ECML, 1998.\n[22] L-J. Li and L. Fei-Fei. What, where and who? classifying events by scene and object recognition. ICCV,\n\n2007.\n\n[23] D. Lowe. Object recognition from local scale-invariant features. ICCV, 1999.\n[24] K. Mikolajczyk and C. Schmid. An af\ufb01ne invariant interest point detector. ECCV, 2002.\n[25] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial enve-\n\nlope. IJCV, 42, 2001.\n\n[26] P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. PAMI, 1990.\n[27] A. Quattoni and A. Torralba. Recognizing indoor scenes. CVPR, 2009.\n[28] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora and S. Belongie. Objects in context. ICCV,\n\n2007.\n\n[29] D. Ramanan C. Desai and C. Fowlkes. Discriminative models for multi-class object layout. ICCV, 2009.\n[30] B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman. Labelme: a database and web-based tool for\n\nimage annotation. MIT AI Lab Memo, 2005.\n\n[31] L. Von Ahn. Games with a purpose. Computer, 39(6):92\u201394, 2006.\n[32] C. Wang, D. Blei, and L. Fei-Fei. Simultaneous image classi\ufb01cation and annotation. CVPR, 2009.\n[33] L. Torresani, M. Szummer, and A. Fitzgibbon. Ef\ufb01cient Object Category Recognition Using Classemes.\n\nEuropean Conference of Computer Vision 2010, pages 776\u2013789, 2010.\n\n[34] P.Ravikumar, M.Wainwright, J.Lafferty. High-Dimensional Ising Model Selection Using L1-Regularized\n\nLogistic Regression. Annals of Statistics, 2009.\n\n[35] J. Vogel and B. Schiele. Semantic modeling of natural scenes for content-based image retrieval. Interna-\n\ntional Journal of Computer Vision, 2007.\n\n[36] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 2006.\n\n9\n\n\f", "award": [], "sourceid": 159, "authors": [{"given_name": "Li-jia", "family_name": "Li", "institution": null}, {"given_name": "Hao", "family_name": "Su", "institution": null}, {"given_name": "Li", "family_name": "Fei-fei", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}