{"title": "Large-Scale Category Structure Aware Image Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1251, "page_last": 1259, "abstract": "Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach.", "full_text": "Large-Scale Category Structure Aware Image\n\nCategorization\n\nBin Zhao\n\nSchool of Computer Science\nCarnegie Mellon University\n\nLi Fei-Fei\n\nComputer Science Department\n\nStanford University\n\nEric P. Xing\n\nSchool of Computer Science\nCarnegie Mellon University\n\nbinzhao@cs.cmu.edu\n\nfeifeili@cs.stanford.edu\n\nepxing@cs.cmu.edu\n\nAbstract\n\nMost previous research on image categorization has focused on medium-scale\ndata sets, while large-scale image categorization with millions of images from\nthousands of categories remains a challenge. With the emergence of structured\nlarge-scale dataset such as the ImageNet, rich information about the conceptual\nrelationships between images, such as a tree hierarchy among various image cat-\negories, become available. As human cognition of complex visual world bene\ufb01ts\nfrom underlying semantic relationships between object classes, we believe a ma-\nchine learning system can and should leverage such information as well for better\nperformance. In this paper, we employ such semantic relatedness among image\ncategories for large-scale image categorization. Speci\ufb01cally, a category hierarchy\nis utilized to properly de\ufb01ne loss function and select common set of features for\nrelated categories. An ef\ufb01cient optimization method based on proximal approxi-\nmation and accelerated parallel gradient method is introduced. Experimental re-\nsults on a subset of ImageNet containing 1.2 million images from 1000 categories\ndemonstrate the effectiveness and promise of our proposed approach.\n\n1 Introduction\n\nImage categorization / object recognition has been one of the most important research problems in\nthe computer vision community. While most previous research on image categorization has focused\non medium-scale data sets, involving objects from dozens of categories, there is recently a growing\nconsensus that it is necessary to build general purpose object recognizers that are able to recognize\nmany more different classes of objects.\n(A human being has little problem recognizing tens of\nthousands of visual categories, even with very little \u201ctraining\u201d data.) The Caltech 101/256 [14, 18]\nis a pioneer benchmark data set on that front. LabelMe [31] provides 30k labeled and segmented\nimages, covering around 200 image categories. Moreover, the newly released ImageNet [12] data\nset goes a big step further, in that it further increases the number of classes to over 15000, and has\nmore than 1000 images for each class on average. Similarly, TinyImage [36] contains 80 million\n32 \u00d7 32 low resolution images, with each image loosely labeled with one of 75,062 English nouns.\nClearly, these are no longer arti\ufb01cial visual categorization problems created for machine learning,\nbut instead more like a human-level cognition problem for real world object recognition with a\nmuch bigger set of objects. A natural way to formulate this problem is a multi-way or multi-task\nclassi\ufb01cation, but the seemingly standard formulation on such gigantic data set poses a completely\nnew challenge both to computer vision and machine learning. Unfortunately, despite the well-known\nadvantages and recent advancements of multi-way classi\ufb01cation techniques [1, 19, 4] in machine\nlearning, complexity concerns have driven most research on such super large-scale data set back\nto simple methods such as nearest neighbor search [6], least square regression [16] or learning\nthousands of binary classi\ufb01ers [24].\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Image category hierarchy in ImageNet; (b) Overlapping group structure; (3) Semantic relatedness\nmeasure between image categories.\nThe hierarchical semantic structure stemmed from the WordNet over image categories in the Im-\nageNet data makes it distinctive from other existing large-scale dataset, and it reassembles how\nhuman cognitive system stores visual knowledge. Figure 1(a) shows an example such as a tree\nhierarchy, where leaf nodes are individual categories, and each internal node denotes the cluster\nof categories corresponding to the leaf nodes in the subtree rooted at the given node. As human\ncognition of complex visual world bene\ufb01ts from underlying semantic relationships between object\nclasses, we believe a machine learning system can and should leverage such information as well for\nbetter performance. Speci\ufb01cally, we argue that instead of formulating the recognition task as a \ufb02at\nclassi\ufb01cation problem, where each category is treated equally and independently, a better strategy\nis to utilize the rich information residing in the concept hierarchy among image categories to train\na system that couples all different recognition tasks over different categories. It should be noted\nthat our proposed method is applicable to any tree structure for image category, such as the category\nstructure learned to capture visual appearance similarities between image classes [32, 17, 13].\nTo the best of our knowledge, our attempt in this paper represents an initial foray to systematically\nutilizing information residing in concept hierarchy, for multi-way classi\ufb01cation on super large-scale\nimage data sets. More precisely, our approach utilizes the concept hierarchy in two aspects: loss\nfunction and feature selection. First, the loss function used in our formulation weighs differentially\nfor different misclassi\ufb01cation outcomes: misclassifying an image to a category that is close to its\ntrue identity should receive less penalty than misclassifying it to a totally unrelated one. Second,\nin an image classi\ufb01cation problem with thousands of categories, it is not realistic to assume that\nall of the classes share the same set of relevant features. That is to say, a subset of highly re-\nlated categories may share a common set of relevant features, whereas weakly related categories\nare less likely to be affected by the same features. Consequently, the image categorization problem\nis formulated as augmented logistic regression with overlapping-group-lasso regularization. The\ncorresponding optimization problem involves a non-smooth convex objective function represented\nas summation over all training examples. To solve this optimization problem, we introduce the\nAccelerated Parallel ProximaL gradiEnT (APPLET) method, which tackles the non-smoothness of\noverlapping-group-lasso penalty via proximal gradient [20, 9], and the huge number of training sam-\nples by Map-Reduce parallel computing [10]. Therefore, the contributions made in this paper are:\n(1) We incorporate the semantic relationships between object classes, into an augmented multi-class\nlogistic regression formulation, regularized by the overlapping-group-lasso penalty. The sheer size\nof the ImageNet data set that our formulation is designed to tackle singles out our work from previ-\nous attempts on multi-class classi\ufb01cation, or transfer learning. (2) We propose a proximal gradient\nbased method for solving the resulting non-smooth optimization problem, where the super large\nscale of the problem is tackled by map-reduce parallel computation.\nThe rest of this paper is organized as follows. Detailed explanation of the formulation is provided in\nSection 2. Section 3 introduces the Accelerated Parallel ProximaL gradiEnT (APPLET) method for\nsolving the corresponding large-scale non-smooth optimization problem. Section 4 brie\ufb02y reviews\nseveral related works. Section 5 demonstrates the effectiveness of the proposed algorithm using\nmillions of training images from 1000 categories, followed by conclusions in Section 6.\n\n2 Category Structure Aware Image Categorization\n\n2.1 Motivation\n\nImageNet organizes the different classes of images in a densely populated semantic hierarchy.\nSpeci\ufb01cally, image categories in ImageNet are interlinked by several types of relations, with the\n\n2\n\n 0.10.20.30.40.50.60.70.80.91\f\u201cIS-A\u201d relation being the most comprehensive and useful [11], resulting in a tree hierarchy over im-\nage categories. For example, the \u2019husky\u2019 category follows a path in the tree composed of \u2019working\ndog\u2019, \u2019dog\u2019, \u2019canine\u2019, etc. The distance between two nodes in the tree depicts the difference between\nthe two corresponding image categories. Consequently, in the category hierarchy in ImageNet, each\ninternal node near the bottom of the tree shows that the image categories of its subtree are highly\ncorrelated, whereas the internal node near the root represents relatively weaker correlations among\nthe categories in its subtree.\nThe class hierarchy provides a measure of relatedness between image classes. Misclassifying an\nimage to a category that is close to its true identity should receive less penalty than misclassifying it\nto a totally unrelated one. For example, although horses are not exactly ponies, we expect the loss for\nclassifying a \u201cpony\u201d as a \u201chorse\u201d to be lower than classifying it as a \u201ccar\u201d. Instead of using 0-1 loss\nas in conventional image categorization, which treats image categories equally and independently,\nour approach utilizes a loss function that is aware of the category hierarchy.\nMoreover, highly related image categories are more likely to share common visual patterns. For\nexample, in Figure 1(a), husky and shepherd share similar object shape and texture. Consequently,\nrecognition of these related categories are more likely to be affected by the same features. In this\nwork, we regularize the sparsity pattern of weight vectors for related categories. This is equivalent\nto learning a low dimensional representation that is shared across multiple related categories.\n\n2.2 Logistic Regression with Category Structure\n\nGiven N training images, each represented as a J-dimensional input vector and belonging to one\nof the K categories. Let X denote the J \u00d7 N input matrix, where each column corresponds to\nan instance. Similarly, let Y denote the N \u00d7 1 output vector, where each element corresponds\nto the label for an image. Multi-class logistic regression de\ufb01nes a weight vector wk for each class\n= arg maxy2f1;:::;kg P (y|x, W), with the conditional\nk \u2208 {1, . . . , K} and classi\ufb01es sample z by y\n\u2211\nlikelihood computed as\n\n(cid:3)\n\nThe optimal weight vectors W\n\n(cid:3)\n\nxi)\nk xi)\n\nexp(wT\nyi\nk exp(wT\n\nP (yi|xi, W) =\n(cid:3)\nK] are\nlog P (yi|xi, W) + \u03bb\u2126(W)\n\n\u2212 N\u2211\n\n(cid:3)\n1, . . . , w\n\n= [w\n\n(cid:3)\n\nW\n\n= arg min\n\nW\n\ni=1\n\n(1)\n\n(2)\n\nwhere \u2126(W) is a regularization term de\ufb01ned on W and \u03bb is the regularization parameter.\n\n2.2.1 Augmented Soft-Max Loss Function\n\nUsing the tree hierarchy on image categories, we could calculate a semantic relatedness (a.k.a. sim-\nilarity) matrix S \u2208 RK(cid:2)K over all categories, where Sij measures the semantic relatedness of class\ni and j. Using the semantic relatedness measure, the likelihood of xi belonging to category yi could\nbe modi\ufb01ed as follows\n\n\u02c6P (yi|xi, W) \u221d K\u2211\n\u2211\n\nSyi;rP (r|xi, W) \u221d K\u2211\n\u2211\n\u02c6P (r|xi, W) = 1, consequently,\n\u2211\n\n\u02c6P (yi|xi, W) =\n\nK\nr=1\n\nr=1\n\nr=1\n\nSince\n\n\u2211\n\n\u2211\n\nSyi;r\n\nexp(wT\nr xi)\nk exp(wT\n\nk xi)\n\n\u221d K\u2211\n\nr=1\n\nK\nr=1 Syi;r exp(wT\n\nr xi)\nK\nk=1 Sk;r exp(wT\n\nr xi)\n\nK\nr=1\n\nSyi;r exp(wT\n\nr xi)\n\n(3)\n\n(4)\n\nFor the special case where the semantic relatedness matrix S is an identity matrix, meaning each\nclass is only related to itself, Eq. (4) simpli\ufb01es to Eq. (1). Using this modi\ufb01ed softmax loss function,\nthe image categorization problem could be formulated as\n\u2212 log\n\n(\u2211\n\n(\u2211\n\nN\u2211\n\n\u2211\n\n)]\n\n+ \u03bb\u2126(W)\n\n)\n\nSk;r exp(wT\n\nSyi;r exp(wT\n\n[\n\n(5)\n\nlog\n\nr xi)\n\nr xi)\n\nmin\nW\n\ni=1\n\nr\n\nk\n\nr\n\n3\n\n\f2.2.2 Semantic Relatedness Matrix\n\nTo compute semantic relatedness matrix S in the above formulation, we \ufb01rst de\ufb01ne a metric mea-\nsuring the semantic distance between image categories. A simple way to compute semantic distance\nin a structure such as the one provided by ImageNet is to utilize the paths connecting the two corre-\nsponding nodes to the root node. Following [7] we de\ufb01ne the semantic distance Dij between class i\nand class j as the number of nodes shared by their two parent branches, divided by the length of the\nlongest of the two branches\n\nDij =\n\nintersect(path(i), path(j))\n\nmax(length(path(i)), length(path(j)))\n\n(6)\n\nwhere path(i) is the path from the root node to node i and intersect(p1, p2) counts the number of\nnodes shared by two paths p1 and p2. We construct the semantic relatedness matrix S = exp(\u2212\u03ba(1\u2212\nD)), where \u03ba is a constant controlling the decay factor of semantic relatedness with respect to\nsemantic distance. Figure 1(c) shows the semantic relatedness matrix computed with \u03ba = 5.\n\n2.3 Tree-Guided Sparse Feature Coding\n\nIn ImageNet, image categories are grouped at multiple granularity as a tree hierarchy. As illustrated\nin Section 2.1, the image categories in each internal node are likely to be in\ufb02uenced by a common set\nof features. In order to achieve this type of structured sparsity at multiple levels of the hierarchy, we\nutilize an overlapping-group-lasso penalty recently proposed in [21] for genetic association mapping\nproblem, where the goal is to identify a small number of SNPs (inputs) out of millions of SNPs that\nin\ufb02uence phenotypes (outputs) such as gene expression measurements.\nSpeci\ufb01cally, given the tree hierarchy T = (V,E) over image categories, each node v \u2208 V of tree T\nis associated with group Gv, composed of all leaf nodes in the subtree rooted at v, as illustrated in\nFigure 1(b). Clearly, each group Gv is a subset of the power set of {1, . . . , K}. Given these groups\nG = {Gv}v2V of categories, we de\ufb01ne the following overlapping-group-lasso penalty [21]:\n\n\u2211\n\n\u2211\n\nv2V\n\n\u2126(W) =\n\n(7)\nwhere wjGv is the weight coef\ufb01cients {wjk, k \u2208 Gv} for input j \u2208 {1, . . . , J} associated with cate-\ngories in Gv, and each group Gv is associated with weight \u03b3v that re\ufb02ects the strength of correlation\nwithin the group. It should be noted that we do not require groups in G to be mutually exclusive,\nand consequently, each leaf node would belong to multiple groups at various granularity.\nInserting the above overlapping-group-lasso penalty into (5), we formulate the category structure\naware image categorization as follows:\n\nj\n\n\u03b3v||wjGv\n\n||2\n\n)]\n\n\u2211\n\n\u2211\n\n+\u03bb\n\nv2V\n\nj\n\n\u03b3v||wj\n\nGv\n\n||2 (8)\n\n[\nN\u2211\n\n(\u2211\n\u2211\n\nmin\nW\n\nlog\n\ni=1\n\nr\n\nk\n\n)\n\n\u2212log\n\n(\u2211\n\nr\n\nSk;r exp(wT\n\nr xi)\n\nSyi;r exp(wT\n\nr xi)\n\n3 Accelerated Parallel ProximaL gradiEnT (APPLET) Method\n\nThe challenge in solving problem (8) lies in two facts: the non-separability of W in the non-smooth\noverlapping-group-lasso penalty \u2126(W), and the huge number N of training samples. Convention-\nally, to handle the non-smoothness of \u2126(W), we could reformulate the problem as either second\norder cone programming (SOCP) or quadratic programming (QP) [35]. However, the state-of-the-\nart approach for solving SOCP and QP based on interior point method requires solving a Newton\nsystem to \ufb01nd search direction, and is computationally very expensive even for moderate-sized prob-\nlems. Moreover, due to the huge number of samples in the training set, off-the-shelf optimization\nsolvers are too slow to be used.\nIn this work, we adopt a proximal-gradient method to handle the non-smoothness of \u2126(W). Specif-\nically, we \ufb01rst reformulate the overlapping-group-lasso penalty \u2126(W) into a max problem over\nauxiliary variables using dual norm, and then introduce its smooth lower bound [20, 9]. Instead of\noptimizing the original non-smooth penalty, we run the accelerated gradient descent method [27]\nunder a Map-Reduce framework [10] to optimize the smooth lower bound. The proposed approach\nenjoys a fast convergence rate and low per-iteration complexity.\n\n4\n\n\f\uf8eb\uf8ec\uf8ed (cid:11)1g1\n\n...\n\n(cid:11)1gjGj\n\nA =\n\n\uf8f6\uf8f7\uf8f8\n\n. . . (cid:11)Jg1\n...\n...\n. . . (cid:11)JgjGj\n\n3.1 Reformulate the Penalty\nFor referring convenience, we number the elements in the set G = {Gv}v2V as G = {g1, . . . , gjGj}\naccording to an arbitrary order, where |G| denotes the total number of elements in G. For each input\n\u2211\n\u2208 Rjgij.\nj and group gi associated with wjgi, we introduce a vector of auxiliary variables (cid:11)jgi\nSince the dual norm of L2 norm is also an L2 norm, we can reformulate ||wjgi\n||2 =\ng2G |g| \u00d7 J matrix\nmaxjj(cid:11)jgi\n\nwjgi. Moreover, de\ufb01ne the following\n\n||2 as ||wjgi\n\njj2(cid:20)1 (cid:11)T\njgi\n\n(9)\n\n(11)\n\nin domain O = {A|||(cid:11)jgi\ngroup-lasso penalty in (8) can be equivalently reformulated as\n\n||2 \u2264 1,\u2200j \u2208 {1, . . . , J}, gi \u2208 G}. Following [9], the overlapping-\n\u2211\n\n\u2211\n\n\u03b3i max\n\n(cid:11)T\n\n\u2126(W) =\n\nwjgi = max\nA2O\n\ni\n\nj\n\njgi\n\njj2(cid:20)1\njj(cid:11)jgi\n\u2211\nwhere i = 1, . . . ,|G|, j = 1, . . . , J, C \u2208 R\ng2G jgj(cid:2)K, and \u27e8U, V\u27e9 = Tr(UT V) is the inner\nproduct of two matrices. Moreover, the matrix C is de\ufb01ned with rows indexed by (s, gi) such that\ns \u2208 gi and i \u2208 {1, . . . ,|G|}, columns indexed by k \u2208 {1, . . . , K}, and the value of the element at\nrow (s, gi) and column k set to C(s;gi);k = \u03b3i if s = k and 0 otherwise.\nAfter the above reformulation, (10) is still a non-smooth function of W, and this makes the opti-\nmization challenging. To tackle this problem, we introduce an auxiliary function [20, 9] to construct\na smooth approximation of (10). Speci\ufb01cally, our smooth approximation function is de\ufb01ned as:\n\n(10)\n\n\u27e8CWT , A\u27e9\n\nf(cid:22)(W) = max\nA2O\n\n\u27e8CWT , A\u27e9 \u2212 \u00b5d(A)\n\nwhere \u00b5 is the positive smoothness parameter and d(A) is an arbitrary smooth strongly-convex\nfunction de\ufb01ned on O. The original penalty term can be viewed as f(cid:22)(W) with \u00b5 = 0. Since our\nalgorithm will utilize the optimal solution W\nF so that we can\n(cid:3). Clearly, f(cid:22)(W) is a lower bound of f0(W), with the gap\nobtain the closed form solution for A\ncomputed as D = maxA2O d(A) = maxA2O 1\n2\nTheorem 1 For any \u00b5 > 0, f(cid:22)(W) is a convex and continuously differentiable function in W, and\nthe gradient of f(cid:22)(W) can be computed as \u2207f(cid:22)(W) = A\n(cid:3) is the optimal solution\nto (11).\n\n(cid:3) to (11), we choose d(A) = 1\n\n(cid:3)T C, where A\n\n2 J|G|.\n\n||A||2\n\n||A||2\n\nF = 1\n\n2\n\nAccording to Theorem 1, f(cid:22)(W) is a smooth function for any \u00b5 > 0, with a simple form of gradient\nand can be viewed as a smooth approximation of f0(W) with the maximum gap of \u00b5D. Finally, the\noptimal solution A\n), where S is the shrinkage operator\nde\ufb01ned as follows:\n\n{\n\n(cid:3) of (11) is composed of (cid:11)\n(cid:3)\njgi\nujjujj2\nu,\n\nS(u) =\n\n(cid:22)\n\n= S( (cid:13)iwjgi\n||u||2 > 1\n||u||2 \u2264 1\n\n,\n\n(12)\n\n3.2 Accelerated Parallel Gradient Method\n\nGiven the smooth approximation of \u2126(W) in (11) and the corresponding gradient presented in The-\norem 1, we could apply gradient descent method to solve the problem. Speci\ufb01cally, we replace the\noverlapping-group-lasso penalty in (8) with its smooth approximation f(cid:22)(W) to obtain the follow-\ning optimization problem\n\nwhere g(W) =\nis the aug-\nmented logistic regression loss function. The gradient of g(W) w.r.t. wk could be calculated as\nfollows\n\nr Syi;r exp(wT\n\nk Sk;r exp(wT\n\nr xi)\n\nr xi)\n\nlog\n\nN\ni=1\n\n\u2211\n\n[\n\nmin\nW\n\n(\u2211\n\u2211\n[ \u2211\nN\u2211\n\u2211\n\u2211\n\nr\n\nxi\n\ni=1\n\nr\n\n\u2202g(W)\n\n\u2202wk\n\n=\n\n\u02dcf (W) = g(W) + \u03bbf(cid:22)(W)\n\n) \u2212 log\n\n(\u2211\n\u2211\n\n\u2212 Syi;k exp(wT\nr Syi;r exp(wT\n\nk xi)\n\nr xi)\n\nq Sk;q exp(wT\n\nk xi)\n\nq Sr;q exp(wT\n\nr xi)\n\n5\n\n)]\n]\n\n(13)\n\n(14)\n\n\fTherefore, the gradient of g(W) w.r.t. to W could be computed as \u2207g(W) = [ @g(W)\nAccording to Theorem 1, the gradient of \u02dcf (W) is given by\n\u2207 \u02dcf (W) = \u2207g(W) + \u03bbA\n\n(cid:3)T C\n\n@w1\n\n, . . . , @g(W)\n@wK\n\n].\n\n(15)\n\nAlthough \u02dcf (W) is a smooth function of W, it is represented as a summation over all training sam-\nples. Consequently, \u2207 \u02dcf (W) could only be computed by summing over all N training samples. Due\nto the huge number of samples in the training set, we adopt a Map-Reduce parallel framework [10]\nto compute \u2207g(W) as shown in Eq.(14). While standard gradient schemes have a slow conver-\ngence rate, they can often be accelerated. This stems from the pioneering work of Nesterov in [27],\nwhich is a deterministic algorithm for smooth optimization. In this paper, we adopt this accelerated\ngradient method , and the whole algorithm is shown in Algorithm 1.\n\nAlgorithm 1 Accelerated Parallel ProximaL gradiEnT method (APPLET)\nInput: X, Y,C, desired accuracy \u03f5, step parameters {\u03b7t}\nInitialization: B0 = 0\nfor t = 1, 2, . . ., until convergence do\n\u2211\nMap-step: Distribute data to M cores {X1, . . . ,XM}, compute in parallel \u2207gm(Bt(cid:0)1) for Xm\nReduce-step:\n(1) \u2207 \u02dcf (Bt(cid:0)1) =\n(2) Wt = Bt(cid:0)1 \u2212 \u03b7t\u2207 \u02dcf (Bt(cid:0)1)\nt+2 (Wt \u2212 Wt(cid:0)1)\n(3) Bt = Wt + t(cid:0)1\nend for\nOutput: \u02c6W = Wt\n\n\u2207gm(Bt(cid:0)1) + \u03bbA\n\n(cid:3)T C\n\nM\nm=1\n\n4 Related Works\n\nVarious attempts in sharing information across related image categories have been explored. Early\napproaches stem from the neural networks, where the hidden layers are shared across different\nclasses [8, 23]. Recent approaches transfer information across classes by regularizing the parame-\nters of the classi\ufb01ers across classes [37, 28, 15, 33, 34, 2, 26, 30]. Common to all these approaches\nis that experiments are always performed with relatively few classes [16]. It is unclear how these\napproaches would perform on super large-scale data sets containing thousands of image categories.\nSome of these approaches would encounter severe computational bottleneck when scaling up to\nthousands of classes [16].\nAnother line of research is the ImageNet Large Scale Visual Recognition Challenge 2010\n(ILSVRC10) [3], where best performing approaches use techniques such as spatial pyramid match-\ning [22], locality-constrained linear coding [38], the Fisher vector [29], and linear SVM trained\nusing stochastic gradient descent. Success has been witnessed in ILSVRC10 even with simple ma-\nchine learning techniques. However, none of these approaches utilize the semantic relationships\nde\ufb01ned among image categories in ImageNet, which we argue is a crucial source of information for\nfurther improvement in such super large scale classi\ufb01cation problem.\n\n5 Experiments\n\nIn this section, we test the performance of APPLET on a subset of ImageNet used in ILSVRC10,\ncontaining 1.2 million images from 1000 categories, divided into distinct portions for training, val-\nidation and test. The number of images for each category ranges from 668 to 3047. We use the\nprovided validation set for parameter selection and the \ufb01nal results are obtained on the test set.\nBefore presenting the classi\ufb01cation results, we\u2019d like to make clear that the goal and contributions\nof this work is different from the aforementioned approaches proposed in ILSVRC10. Those ap-\nproaches were designed to enter a performance competition, where heavy feature engineering and\npost processing (such as ad hoc voting for multiple algorithms) were used to achieve high accuracy.\nOur work, on the other hand, looks at this problem from a different angle, focusing on principled\n\n6\n\n\fmethodology that explores the bene\ufb01t of utilizing class structure in image categorization and propos-\ning a model and related optimization technique to properly incorporate such information. We did\nnot use the full scope of all the features, and post processing schemes to boost our classi\ufb01cation\nresults as the ILSVRC10 competition teams did. Therefore we argue that the results of our work is\nnot directly comparable with the ILSVRC10 competitions.\n\n5.1\n\nImage Features\n\nEach image is resized to have a max side length of 300 pixels. SIFT [25] descriptors are computed\non 20 \u00d7 20 overlapping patches with a spacing of 10 pixels. Images are further downsized to 1\nof the side length and then 1\n4 of the side length, and more descriptors are computed. We then\nperform k-means clustering on a random subset of 10 million SIFT descriptors to form a visual\nvocabulary of 1000 visual words. Using this learned vocabulary, we employ Locality-constrained\nLinear Coding (LLC) [38], which has shown state-of-the-art performance on several benchmark data\nsets, to construct a vector representation for each image. Finally, a single feature vector is computed\nfor each image using max pooling on a spatial pyramid [22]. The pooled features from various\nlocations and scales are then concatenated to form a spatial pyramid representation of the image.\nConsequently, each image is represented as a vector in a 21,000 dimensional space.\n\n2\n\n5.2 Evaluation Criteria\n\nWe adopt the same performance measures used in ILSVRC10. Speci\ufb01cally, for every image, each\ntested algorithm will produce a list of 5 object categories in the descending order of con\ufb01dence.\nPerformance is measured using the top-n error rate, n = 1, . . . , 5 in our case, and two error measures\nare reported. The \ufb01rst is a \ufb02at error which equals 1 if the true class is not within the n most con\ufb01dent\npredictions, and 0 otherwise. The second is a hierarchical error, reporting the minimum height of\nthe lowest common ancestors between true and predicted classes. For each of the above two criteria,\nthe overall error score for an algorithm is the average error over all test images.\n\nTable 1: Classi\ufb01cation results (both \ufb02at and hierarchical errors) of various algorithms.\n\nLR\nALR\n\nAlgorithm Top 1\n0.797\n0.796\n0.786\n0.779\n\nGroupLR\nAPPLET\n\nFlat Error\n\nTop 3\n0.678\n0.668\n0.642\n0.634\n\nTop 2\n0.726\n0.723\n0.699\n0.698\n\nTop 4\n0.639\n0.624\n0.600\n0.589\n\nTop 5\n0.607\n0.587\n0.568\n0.565\n\nTop 1\n8.727\n8.259\n7.620\n7.208\n\nHierarchical Error\n\nTop 2\n6.974\n6.234\n5.460\n4.985\n\nTop 3\n5.997\n5.061\n4.322\n3.798\n\nTop 4\n5.355\n4.269\n3.624\n3.166\n\nTop 5\n4.854\n3.659\n3.156\n3.012\n\nFigure 2: Left: image classes with highest accuracy. Right: image classes with lowest accuracy.\n\n5.3 Comparisons & Classi\ufb01cation Results\n\nWe have conducted comprehensive performance evaluations by testing our method under differ-\nent circumstances. Speci\ufb01cally, to better understand the effect of augmenting logistic regression\nwith semantic relatedness and use of overlapping-group-lasso penalty to enforce group level fea-\nture selection, we study the model adding only augmented logistic regression loss and adding only\noverlapping-group-lasso penalty separately, and compare with the APPLET method. We use the\nconventional L2 regularized logistic regression [5] as baseline. The algorithms that we evaluated are\nlisted below: (1)L2 regularized logistic regression (LR) [5]; (2) Augmented logistic regression with\nL2 regularization (ALR); (3) Logistic regression with overlapping-group-lasso regularization (Grou-\npLR); (4) Augmented logistic regression with overlapping-group-lasso regularization (APPLET).\nTable 1 presents the classi\ufb01cation results of various algorithms. According to the classi\ufb01cation\nresults, we could clearly see the advantage of APPLET over conventional logistic regression, es-\npecially on the top-5 error rate. Speci\ufb01cally, comparing the top-5 error rate, APPLET outperforms\nLR by a margin of 0.04 on \ufb02at loss, and a margin of 1.84 on hierarchical loss. It should be noted\n\n7\n\n\fthat hierarchical error is measured by the height of the lowest common ancestor in the hierarchy,\nand moving up a level can more than double the number of descendants. Table 1 also compares the\nperformance of ALR with LR. Speci\ufb01cally, ALR outperforms LR slightly when using the top-1 pre-\ndiction results. However, on top-5 prediction results, ALR performs clearly better than LR. Similar\nphenomenon is observed when comparing the classi\ufb01cation results of GroupLR with LR. Moreover,\nFigure 2 shows the image categories with highest and lowest classi\ufb01cation accuracy.\nOne key reason for introducing the augmented loss function is to ensure that predicted image class\nfalls not too far from its true class on the semantic hierarchy. Results in Table 2 demonstrate that\neven though APPLET cannot guarantee to make the correct prediction on each image, it produces\nlabels that are closer to the true one than LR, which generates labels far from correct ones.\n\nTrue class\nAPPLET\n\nLR\n\nlaptop\nlaptop(0)\nlaptop(0)\n\nlinden\n\nlive oak(3)\nlog wood(3)\n\ngordon setter\nIrish setter(2)\n\nalp(11)\n\ngourd\nacorn(2)\nolive(2)\n\nbullfrog\n\nwoodfrog(2)\nwater snake(9)\n\nvolcano\nvolcano(0)\ngeyser(4)\n\nodometer\nodometer(0)\nodometer(0)\n\nearthworm\nearthworm(0)\n\nslug(8)\n\nTable 2: Example prediction results of APPLET and LR. Numbers indicate the hierarchical error of the\nmisclassi\ufb01cation, de\ufb01ned in Section 5.2.\n\nAs shown in Table 1, a systematic reduction in classi\ufb01cation error using APPLET shows that ac-\nknowledging semantic relationships between image classes enables the system to discriminate at\nmore informative semantic levels. Moreover, results in Table 2 demonstrate that classi\ufb01cation re-\nsults of APPLET can be signi\ufb01cantly more informative, as labeling a \u201cbullfrog\u201d as \u201cwoodfrog\u201d gives\na more useful answer than \u201cwater snake\u201d, as it is still correct at the \u201cfrog\u201d level.\n\n5.4 Effects of \u03bb and \u03ba on the Performance of APPLET\n\nWe present in Figure 3 how categorization performance scales with \u03bb and \u03ba. According to Figure 3,\nAPPLET achieves lowest categorization error around \u03bb = 0.01. Moreover, the error rate increases\n\nFigure 3: Classi\ufb01cation results (\ufb02at error and hierarchical error) of APPLET with various (cid:21) and (cid:20).\n\nwhen \u03bb is larger than 0.1, when excessive regularization hampers the algorithm from differentiating\nsemantically related categories. Similarly, APPLET achieves best performance with \u03ba = 5. When\n\u03ba is too small, a large number of categories are mixed together, resulting in a much higher \ufb02at loss.\nOn the other hand, when \u03ba \u2265 50, the semantic relatedness matrix is close to diagonal, resulting in\ntreating all categories independently, and categorization performance becomes similar as LR.\n\n6 Conclusions\n\nIn this paper, we argue the positive effect of incorporating category hierarchy information in super\nlarge scale image categorization. The sheer size of the problem considered here singles out our work\nfrom any previous works on multi-way classi\ufb01cation or transfer learning. Empirical study using 1.2\nmillion training images from 1000 categories demonstrates the effectiveness and promise of our\nproposed approach.\n\nAcknowledgments\n\nE. P. Xing is supported by NSF IIS-0713379, DBI-0546594, Career Award, ONR N000140910758,\nDARPA NBCH1080007 and Alfred P. Sloan Foundation. L. Fei-Fei is partially supported by an\nNSF CAREER grant (IIS-0845230) and an ONR MURI grant.\n\n8\n\n10\u2212310\u2212210\u221211001010.550.60.650.70.750.80.850.9LambdaFlat Error Top\u22121Top\u22122Top\u22123Top\u22124Top\u2212510\u2212310\u2212210\u22121100101345678910LambdaHierarchical Error Top\u22121Top\u22122Top\u22123Top\u22124Top\u221250.55505000.50.550.60.650.70.750.80.850.90.951KappaFlat Error Top\u22121Top\u22122Top\u22123Top\u22124Top\u221250.5550500345678910KappaHierarchical Error Top\u22121Top\u22122Top\u22123Top\u22124Top\u22125\fReferences\n[1] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. JMLR, 4:83\u201399,\n\n[2] E. Bart and S. Ullman. Cross-generalization: learning novel classes from a single example by feature\n\n[3] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge 2010. http://www.image-\n\n2003.\n\nreplacement. In CVPR, 2005.\n\nnet.org/challenges/LSVRC/2010/, 2010.\n\n[4] A. Binder, K.-R. Mller, and M. Kawanabe. On taxonomies for multi-class image categorization. IJCV,\n\n[5] C. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.\n[6] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classi\ufb01cation. In\n\npages 1\u201321, 2011.\n\nCVPR, 2008.\n\n[7] A. Budanitsky and G. Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. Comput.\n\nLinguist., 32:13\u201347, March 2006.\n\n[8] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[9] X. Chen, Q. Lin, S. Kim, J. Carbonell, and E. P. Xing. Smoothing proximal gradient method for general\n\nstructured sparse learning. In UAI, 2011.\n\n[10] C. Chu, S. Kim, Y. Lin, Y. Yu, G., A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore.\n\nIn NIPS. 2007.\n\nIn ECCV, 2010.\n\n[11] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us?\n\n[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale Hierarchical\n\nImage Database. In CVPR, 2009.\n\nscale object recognition. In NIPS, 2011.\n\n[13] J. Deng, S. Satheesh, A. Berg, and L. Fei-Fei. Fast and balanced: Ef\ufb01cient label tree learning for large\n\n[14] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an\nincremental bayesian approach tested on 101 object categories. In CVPR Workshop on Generative-Model\nBased Vision, 2004.\n\n[15] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. PAMI, 28:594\u2013611, 2006.\n[16] R. Fergus, H. Bernal, Y. Weiss, and A. Torralba. Semantic label sharing for learning with many categories.\n\nIn ECCV, ECCV\u201910, 2010.\n\nICCV, 2011.\n\n[17] T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In\n\n[18] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, Cali-\n\nfornia Institute of Technology, 2007.\n\n[19] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. In NIPS, 2008.\n[20] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary\n\n[21] S. Kim and E. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML,\n\nlearning. In ICML, 2010.\n\n2010.\n\nnatural scene categories. In CVPR, 2006.\n\nProc. IEEE, 86:2278\u20132324, 1998.\n\n[22] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\n[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\n[24] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image classi\ufb01cation:\n\nfast feature extraction and svm training. In CVPR, 2011.\n\n[25] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91\u2013110, 2004.\n[26] E. Miller, N. Matsakis, and P. Viola. Learning from one example through shared densities on transforms.\n\nIn CVPR, 2000.\n\n[27] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\no( 1\n\nk2 ). Doklady AN SSSR (translated as Soviet. Math. Docl.), 269:543\u2013547, 1983.\n\n[28] A. Opelt, A. Pinz, and A. Zisserman.\n\nIncremental learning of object detectors using a visual shape\n\n[29] F. Perronnin, J. Sanchez, and T. Mensink. Improving the \ufb01sher kernel for large-scale image classi\ufb01cation.\n\n[30] A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classi\ufb01cation with sparse prototype\n\n[31] B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: A database and web-based tool for image\n\n[32] R. Salakhutdinov, A. Torralba, and Josh Tenenbaum. Learning to share visual appearance for multiclass\n\nalphabet. In CVPR, 2006.\n\nIn ECCV, 2010.\n\nrepresentations. In CVPR, 2008.\n\nannotation. IJCV, 77:157\u2013173, 2008.\n\nobject detection. In CVPR, 2011.\n\nand parts. In CVPR, 2005.\n\n12:1247\u20131283, 2000.\n\n[33] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Learning hierarchical models of scenes, objects,\n\n[34] J. Tenenbaum and W. Freeman. Separating style and content with bilinear models. Neural Computation,\n\n[35] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso.\n\nJournal of the Royal Statistical Society Series B, pages 91\u2013108, 2005.\n\n[36] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for nonparametric object\n\nand scene recognition. PAMI, 30:1958\u20131970, 2008.\n\n[37] A. Torralba, K. Murphy, and W. Freeman. Sharing features: ef\ufb01cient boosting procedures for multiclass\n\n[38] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image\n\nobject detection. In CVPR, 2004.\n\nclassi\ufb01cation. In CVPR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 730, "authors": [{"given_name": "Bin", "family_name": "Zhao", "institution": null}, {"given_name": "Fei", "family_name": "Li", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}