{"title": "A Unified Semantic Embedding: Relating Taxonomies and Attributes", "book": "Advances in Neural Information Processing Systems", "page_first": 271, "page_last": 279, "abstract": "We propose a method that learns a discriminative yet semantic space for object categorization, where we also embed auxiliary semantic entities such as supercategories and attributes. Contrary to prior work which only utilized them as side information, we explicitly embed the semantic entities into the same space where we embed categories, which enables us to represent a category as their linear combination. By exploiting such a unified model for semantics, we enforce each category to be generated as a sparse combination of a supercategory + attributes, with an additional exclusive regularization to learn discriminative composition. The proposed reconstructive regularization guides the discriminative learning process to learn a better generalizing model, as well as generates compact semantic description of each category, which enables humans to analyze what has been learned.", "full_text": "A Uni\ufb01ed Semantic Embedding:\n\nRelating Taxonomies and Attributes\n\nSung Ju Hwang\u2217\nDisney Research\nPittsburgh, PA\n\nsungju.hwang@disneyresearch.com\n\nAbstract\n\nLeonid Sigal\n\nDisney Research\nPittsburgh, PA\n\nlsigal@disneyresearch.com\n\nWe propose a method that learns a discriminative yet semantic space for object\ncategorization, where we also embed auxiliary semantic entities such as supercat-\negories and attributes. Contrary to prior work, which only utilized them as side in-\nformation, we explicitly embed these semantic entities into the same space where\nwe embed categories, which enables us to represent a category as their linear com-\nbination. By exploiting such a uni\ufb01ed model for semantics, we enforce each cate-\ngory to be generated as a supercategory + a sparse combination of attributes, with\nan additional exclusive regularization to learn discriminative composition. The\nproposed reconstructive regularization guides the discriminative learning process\nto learn a model with better generalization. This model also generates compact se-\nmantic description of each category, which enhances interoperability and enables\nhumans to analyze what has been learned.\n\n1\n\nIntroduction\n\nObject categorization is a challenging problem that requires drawing boundaries between groups of\nobjects in a seemingly continuous space. Semantic approaches have gained a lot of attention recently\nas object categorization became more focused on large-scale and \ufb01ne-grained recognition tasks and\ndatasets. Attributes [1, 2, 3, 4] and semantic taxonomies [5, 6, 7, 8] are two popular semantic sources\nwhich impose certain relations between the category models, including a more recently introduced\nanalogies [9] that induce even higher-order relations between them. While many techniques have\nbeen introduced to utilize each of the individual semantic sources for object categorization, no uni-\n\ufb01ed model has been proposed to relate them.\nWe propose a uni\ufb01ed semantic model where we can learn to place categories, supercategories, and\nattributes as points (or vectors) in a hypothetical common semantic space, and taxonomies provide\nspeci\ufb01c topological relationships between these semantic entities. Further, we propose a discrimi-\nnative learning framework, based on dictionary learning and large margin embedding, to learn each\nof these semantic entities to be well separated and pseudo-orthogonal, such that we can use them to\nimprove visual recognition tasks such as category or attribute recognition.\nHowever, having semantic entities embedded into a common space is not enough to utilize the\nvast number of relations that exist between the semantic entities. Thus, we impose a graph-based\nregularization between the semantic embeddings, such that each semantic embedding is regularized\nby sparse combination of auxiliary semantic embeddings. This additional requirement imposed on\nthe discriminative learning model would guide the learning such that we obtain not just the optimal\nmodel for class discrimination, but to learn a semantically plausible model which has a potential to\nbe more robust and human-interpretable; we call this model Uni\ufb01ed Semantic Embedding (USE).\n\n\u2217Now at Ulsan National Institute of Science and Technology in Ulsan, South Korea\n\n1\n\n\fFigure 1: Concept: We regularize each category\nto be represented by its supercategory + a sparse\ncombination of attributes, where the regularization\nparameters are learned. The resulting embedding\nmodel improves the generalization ability by the\nspeci\ufb01c relations between the semantic entities, and\nalso is able to compactly represent a novel category\nin this manner. For example, given a novel category\ntiger, our model can describe it as a striped feline.\n\nThe observation we make to draw the relation between the categories and attributes, is that a category\ncan be represented as the sum of its supercategory + the category-speci\ufb01c modi\ufb01er, which in many\ncases can be represented by a combination of attributes. Further, we want the representation to be\ncompact. Instead of describing a dalmatian as a domestic animal with a lean body, four legs, a\nlong tail, and spots, it is more ef\ufb01cient to say it is a spotted dog (Figure 1). It is also more exact\nsince the higher-level category dog contains all general properties of different dog breeds, including\nindescribable dog-speci\ufb01c properties, such as the shape of the head, and its posture.\nThis exempli\ufb01es how a human would describe an object, to ef\ufb01ciently communicate and understand\nthe concept. Such decomposition of a category into attributes+supercategory can hold for categories\nat any level. For example, supercategory feline can be described as a stalking carnivore.\nWith the addition of this new generative objective, our goal is to learn a discriminative model that\ncan be compactly represented as a combination of semantic entities, which helps learn a model that is\nsemantically more reasonable. We want to balance between these two discriminative and generative\nobjectives when learning a model for each object category. For object categories that have scarce\ntraining examples, we can put more weight on the generative part of the model.\nContributions: Our contributions are threefold: (1) We show a multitask learning formulation for\nobject categorization that learns a uni\ufb01ed semantic space for supercategories and attributes, while\ndrawing relations between them. (2) We propose a novel sparse-coding based regularization that\nenforces the object category representation to be reconstructed as the sum of a supercategory and a\nsparse combination of attributes. (3) We show from the experiments that the generative learning with\nthe sparse-coding based regularization helps improve object categorization performance, especially\nin the one or few-shot learning case, by generating semantically plausible predictions.\n\n2 Related Work\nSemantic methods for object recognition. For many years, vision researchers have sought to\nexploit external semantic knowledge about the object to incorporate semantics into learning of the\nmodel. Taxonomies, or class hierarchies were the \ufb01rst to be explored by vision researchers [5, 6], and\nwere mostly used to ef\ufb01ciently rule out irrelevant category hypotheses leveraging class hierarchical\nstructure [8, 10]. Attributes are visual or semantic properties of an object that are common across\nmultiple categories, mostly regarded as describable mid-level representations. They have been used\nto directly infer categories [1, 2], or as additional supervision to aid the main categorization problem\nin the multitask learning framework [3]. While many methods have been proposed to leverage either\nof these two popular types of semantic knowledge, little work has been done to relate the two, which\nour paper aims to address.\n\nDiscriminative embedding for object categorization. Since the conventional kernel-based mul-\nticlass SVM does not scale due to its memory and computational requirements for today\u2019s large-scale\nclassi\ufb01cation tasks, embedding-based methods have gained recent popularity. Embedding-based\nmethods perform classi\ufb01cation on a low dimensional shared space optimized for class discrimina-\ntion. Most methods learn two linear projections, for data instances and class labels, to a common\nlower-dimensional space optimized by ranking loss. Bengio et al. [10] solves the problem using\nstochastic gradient, and also provides a way to learn a tree structure which enables one to ef\ufb01ciently\npredict the class label at the test time. Mensink et al. [11] eliminated the need of class embedding by\nreplacing them with the class mean, which enabled generalization to new classes at near zero cost.\nThere are also efforts in incorporating semantic information into the learned embedding space.\nWeinberger et al. [7] used the taxonomies to preserve the inter-class similarities in the learned space,\n\n2\n\n\fin terms of distance. Akata et al. [4] used attributes and taxonomy information as labels, replacing\nthe conventional unit-vector based class representation with more structured labels to improve on\nzero-shot performance. One most recent work in this direction is DEVISE [12], which learns em-\nbeddings that maximize the ranking loss, as an additional layer on top of the deep network for both\nimages and labels. However, these models impose structure only on the output space, and structure\non the learned space is not explicitly enforced, which is our goal.\nRecently, Hwang et al. [9] introduced one such model, which regularizes the category quadruplets,\nthat form an analogy, to form a parallelogram. Our goal is similar, but we explore a more general\ncompositional relationship, which we learn without any manual supervision.\nMultitask learning. Our work can be viewed as a multitask learning method, since we relate\neach model for different semantic entities by learning both the joint semantic space and enforcing\ngeometric constraints between them. Perhaps the most similar work is [13], where the parameter\nof each model is regularized while \ufb01xing the parameter for its parent-level models. We use similar\nstrategy but instead of enforcing sharing between the models, we simply learn each model to be\nclose to its approximation obtained using higher-level (more abstract) concepts.\nSparse coding. Our method to approximate each category embedding as a sum of its direct super-\ncategory plus a sparse combination of attributes, is similar to the objective of sparse coding. One\nwork that is speci\ufb01cally relevant to ours is Mairal et al. [14], where the learning objective is to re-\nduce both the classi\ufb01cation and reconstruction error, given class labels. In our model, however, the\ndictionary atoms are also discriminatively learned with supervision, and are assembled to be a se-\nmantically meaningful combination of a supercategory + attributes, while [14] learns the dictionary\natoms in an unsupervised way.\n3 Approach\n\nWe now explain our uni\ufb01ed semantic embedding model, which learns a discriminative common\nlow-dimensional space to embed both the images and semantic concepts including object categories,\nwhile enforcing relationships between them using semantic reconstruction.\nSuppose that we have a d-dimensional image descriptor and m-dimensional vector describing labels\nassociated with the instances, including category labels at different semantic granularities and at-\ntributes. Our goal then is to embed both images and the labels onto a single uni\ufb01ed semantic space,\nwhere the images are associated with their corresponding semantic labels.\nTo formally state the problem, given a training set D that has N labeled examples, i.e. D =\n{xi, yi}N\ni=1, where xi \u2208 Rd denotes image descriptors and yi \u2208 {1, . . . , m} are their labels as-\nsociated with m unique concepts, we want to embed each xi as zi, and each label yi as uyi in the\nde-dimensional space, such that the similarity between zi and uyi, S(zi, uyi) is maximized.\nOne way to solve the above problem is to use regression, using S(zi, uyi) = \u2212(cid:107)zi\u2212 uyi(cid:107)2\n2. That is,\nwe estimate the data embedding zi as zi = W xi, and minimize their distances to the correct label\nembeddings uyi \u2208 Rm where the dimension for yi is set to 1 and every other dimension is set to 0:\n(1)\n\n(cid:107)W xi \u2212 uyi(cid:107)2\n\n2 + \u03bb(cid:107)W(cid:107)2\nF .\n\nN(cid:88)\n\nm(cid:88)\n\nmin\nW\n\nc=1\n\ni=1\n\nThe above ridge regression will project each instance close to its correct embedding. However, it\ndoes not guarantee that the resulting embeddings are well separated. Therefore, most embedding\nmethods for categorization add in discriminative constraints which ensure that the projected in-\nstances have higher similarity to their own category embedding than to others. One way to enforce\n2 +\u03beic, yi (cid:54)= c\nthis is to use large-margin constraints on distance: (cid:107)W xi\u2212uyi(cid:107)2\nwhich can be translated into to the following discriminative loss:\n\n2 +1 \u2264 (cid:107)W xi\u2212uc(cid:107)2\n\nLC(W , U , xi, yi) =\n\n2 \u2212 (cid:107)W xi \u2212 uc(cid:107)2\n\n2]+,\u2200c (cid:54)= yi,\n\n(2)\n\n(cid:88)\n[1 + (cid:107)W xi \u2212 uyi(cid:107)2\n\nc\n\nwhere U is the columwise concatenation of each label embedding vector, such that uj denotes jth\ncolumn of U. After replacing the generative loss in the ridge regression formula with the discrimi-\nnative loss, we get the following discriminative learning problem:\nF + \u03bb(cid:107)U(cid:107)2\n\nLC(W , U , xi, yi) + \u03bb(cid:107)W(cid:107)2\n\nF , yi \u2208 {1, . . . , m},\n\nN(cid:88)\n\n(3)\n\nmin\nW ,U\n\ni\n\n3\n\n\fwhere \u03bb regularizes W and U from shooting to in\ufb01nity. This is one of the most common objective\nused for learning discriminative category embeddings for multi-class classi\ufb01cation [10, 7], while\nranking loss-based [15] models have been also explored for LC. Bilinear model on a single variable\nW has been also used in Akata et al. [4], which uses structured labels (attributes) as uyi.\n\n3.1 Embedding auxiliary semantic entities.\n\nNow we describe how we embed the supercategories and attributes onto the learned shared space.\nSupercategories. While our objective is to better categorize entry level categories, categories in\ngeneral can appear at different semantic granularities. For example, a zebra could be both an equus,\nand an odd-toed ungulate. To learn the embeddings for the supercategories, we map each data\ninstance to be closer to its correct supercategory embedding than to its siblings: (cid:107)W xi\u2212us(cid:107)2\n2 +1 \u2264\n(cid:107)W xi \u2212 uc(cid:107)2\n2 + \u03besc,\u2200s \u2208 Pyi and c \u2208 Ss where Pyi denotes the set of superclasses at all levels\nfor class yi, and Ss is the set of its siblings. The constraints can be translated into the following loss\nterm:\n\n[1 + (cid:107)W xi \u2212 us(cid:107)2\n\n2 \u2212 (cid:107)W xi \u2212 uc(cid:107)2\n\n2]+.\n\n(4)\n\nLS(W , U , xi, yi) =\n\n(cid:88)\n\n(cid:88)\n\ns\u2208Pyi\n\nc\u2208Ss\n\nAttributes. Attributes can be considered normalized basis vectors for the semantic space, whose\ncombination represents a category. Basically, we want to maximize the correlation between the\nprojected instance that possess the attribute, and its correct attribute embedding, as follows:\n\n(5)\nwhere Ac is the set of all attributes for class c, \u03c3 is the margin (can be empirically determined; we\nsimply use a \ufb01xed value of \u03c3 = 1), ya\ni is the label indicating presence/absence of each attribute a\nfor the ith training instance, and ua is its embedding vector for attribute a.\n\na\u2208Ayi\n\ni ua]+,(cid:107)ua(cid:107)2 \u2264 1, ya\n\ni \u2208 {0, 1}\n\nLA(W , U , xi, yi) =\n\n[\u03c3 \u2212 (W xi)Tya\n\n(cid:88)\n\n3.2 Relationship between the categories, supercategories, and attributes\nSimply summing up all previously de\ufb01ned loss functions while adding {us} and {ua} as addi-\ntional columns of U will result in a multi-task formulation that implicitly associate the semantic\nentities, through the shared data embedding W . However, we want to further utilize the relation-\nships between the semantic entities, to explicitly impose structural regularization on the semantic\nembeddings U. One simple and intuitive relation is that an object class can be represented as the\ncombination of its parent level category plus a sparse combination of attributes, which translates into\nthe following constraint:\n\nuc = up + U A\u03b2c, c \u2208 Cp,(cid:107)\u03b2c(cid:107)0 (cid:22) \u03b31, \u03b2c (cid:23) 0,\u2200c, p \u2208 {1, . . . , C + S},\n\n(6)\nwhere U A is the aggregation of all attribute embeddings {ua}, Cp is the set of children classes for\nclass p, \u03b31 is the sparsity parameter, C is the number of leaf level categories, and S is the number\nof supercategories. We require \u03b2 to be non-negative, since it makes more sense and more ef\ufb01cient\nto describe an object with attributes that it might have, rather than describing it by attributes that it\nmight not have.\nWe rewrite Eq. 6 into a regularization term as follows, replacing the (cid:96)0-norm constraints with (cid:96)1-\nnorm regularizations for tractable optimization:\n\nc\n\n(cid:107)uc \u2212 up \u2212 U A\u03b2c(cid:107)2\n\nR(U , B) =\n2 + \u03b32(cid:107)\u03b2c + \u03b2o(cid:107)2\n2,\nc \u2208 Cp, o \u2208 Pc \u222a Sc, 0 (cid:22) \u03b2c (cid:22) \u03b31,\u2200c, p \u2208 {1, . . . , C + S},\n\n(7)\nwhere B is the matrix whose jth column vector \u03b2j is the reconstruction weight for class j, Sc is the\nset of all sibling classes for class c, and \u03b32 is the parameters to enforce exclusivity.\nThe exclusive regularization term is used to prevent the semantic reconstruction \u03b2c for class c from\n\ufb01tting to the same attributes \ufb01tted by its parents and siblings. This is because attributes common\nacross parent and child, and between siblings, are less discriminative. This regularization is es-\npecially useful for discrimination between siblings, which belong to the same superclass and only\ndiffer by the category-speci\ufb01c modi\ufb01er. By generating unique semantic decomposition for each\nclass, we can better discriminate between any two categories using a semantic combination of dis-\ncriminatively learned auxiliary entities.\n\nC(cid:88)\n\n4\n\n\fWith the sparsity regularization enforced by \u03b31, the simple sum of the two weights will prevent the\ntwo (super)categories from having high weight for a single attribute, which will let each category\nembedding to \ufb01t to exclusive attribute set. This, in fact, is the exclusive lasso regularizer introduced\nin [16], except for the nonnegativity constraint on \u03b2c, which makes the problem easier to solve.\n3.3 Uni\ufb01ed semantic embeddings for object categorization\n\nAfter augmenting the categorization objective in Eq. 3 with the superclass and attributes loss and the\nsparse-coding based regularization in Eq. 6, we obtain the following multitask learning formulation\nthat jointly learns all the semantic entities along with the sparse-coding based regularization:\nLC (W , U , xi, yi) + \u00b51 (LS(W , U , xi, yi) + LA(W , U , xi, yi)) + \u00b52R(U , B);\n2 \u2264 \u03bb, 0 (cid:22) \u03b2c (cid:22) \u03b31,\u2200j \u2208 {1, . . . , d},\u2200k \u2208 {1, . . . , m},\u2200c, p \u2208 {1, . . . , C + S},\n\nmin\n2 \u2264 \u03bb,(cid:107)uk(cid:107)2\n\n(cid:107)wj(cid:107)2\n\n(8)\n\nN(cid:88)\n\ni=1\n\nW ,U ,B\n\nwhere wj is W \u2019s jth column, and \u00b51 and \u00b52 are parameters to balance between the main and\nauxiliary tasks, and discriminative and generative objective.\nEq. 8 could be also used for knowledge transfer when learning a model for a novel set of categories,\nby replacing U A in R(U , B) with US, learned on class set S to transfer the knowledge from.\n\n3.4 Numerical optimization\n\nEq. 8 is not jointly convex in all variables, and has both discriminative and generative terms. This\nproblem is similar to the problem in [14], where the objective is to learn the dictionary, sparse\ncoef\ufb01cients, and classi\ufb01er parameters together, and can be optimized using a similar alternating\noptimization, while each subproblem differs. We \ufb01rst describe how we optimize for each variable.\nLearning of W and U. The optimization of both embedding models are similar, except for the\nreconstructive regularization on U. and the main bottleneck lies in the minimization of the O(N m)\nlarge-margin losses. Since the losses are non-differentiable, we solve the problems using stochastic\nsubgradient method. Speci\ufb01cally, we implement the proximal gradient algorithm in [17], handling\nthe (cid:96)-2 norm constraints with proximal operators.\nLearning B. This is similar to the sparse coding problem, but simpler. We use projected gradient\nmethod, where at each iteration t, we project the solution of the objective \u03b2t+ 1\nfor category c to (cid:96)-1\nnorm ball and nonnegative orthant, to obtain \u03b2t\nAlternating optimization. We decompose Eq. 8 to two convex problems: 1) Optimization of the\ndata embedding W and approximation parameter B (Since the two variable do not have direct link\nbetween them) , and 2) Optimization of the category embedding U. We alternate the process of\noptimizing each of the convex problems while \ufb01xing the remaining variables, until the convergence\ncriterion 1 is met, or the maximum number of iteration is reached.\nRun-time complexity. Training: Optimization of W and U using proximal stochastic gradi-\nent [17], have time complexities of O(ded(k + 1)) and O(de(dk + m)) respectively. Both terms are\ndominated by the gradient computation for k(k (cid:28) N ) sampled constraints, that is O(dedk). Outer\nloop for alternation converges within 5-10 iterations depending on \u0001. Test: Test time complexity is\nexactly the same as in LME, which is O(de(C + d)).\n4 Results\n\nc\nc that satis\ufb01es the constraints.\n\n2\n\nWe validate our method for multiclass categorization performance on two different datasets gener-\nated from a public image collection, and also test for knowledge transfer on few-shot learning.\n\n4.1 Datasets\n\nWe use Animals with Attributes dataset [1], which consists of 30, 475 images of 50 animal classes,\nwith 85 class-level attributes 2. We use the Wordnet hierarchy to generate supercategories. Since\n\n1(cid:107)W t+1 \u2212 W t(cid:107)2 + (cid:107)U t+1 \u2212 U t(cid:107)2 + (cid:107)Bt+1 \u2212 Bt(cid:107)2 < \u0001\n2Attributes are de\ufb01ned on color (black, orange), texture (stripes, spots), parts (longneck, hooves), and other\n\nhigh-level behavioral properties (slow, hibernate, domestic) of the animals\n\n5\n\n\fthere is no \ufb01xed training/test split, we use {30,30,30} random split for training/validation/test. We\ngenerate the following two datasets using the provided features. 1) AWA-PCA: We compose a 300-\ndimensional feature vectors by performing PCA on each of 6 types of features provided, including\nSIFT, rgSIFT, SURF, HoG, LSS, and CQ to have 50 dimensions per each feature type, and concate-\nnating them. 2) AWA-DeCAF: For the second dataset, we use the provided 4096-D DeCAF features\n[18] obtained from the layer just before the output layer of a deep convolutional neural network.\n4.2 Baselines\n\nWe compare our proposed method against multiple existing embedding-based categorization ap-\nproaches, that either do not use any semantic information, or use semantic information but do not\nexplicitly embed semantic entities. For non-semantics baselines, we use the following: 1)Ridge\nRegression: A linear regression with (cid:96)-2 norm (Eq. 1). 2) NCM: Nearest mean classi\ufb01er from [11],\nwhich uses the class mean as category embeddings (uc = x\u00b5\nc ). We use the code provided by the\nauthors3. 3) LME: A base large-margin embedding (Eq. 3) solved using alternating optimization.\n\nFor implicit semantic baselines, we consider two different methods. 4) LMTE: Our implementation\nof the Weinberger et al. [7], which enforces the semantic similarity between class embeddings as\ndistance constraints [7], where U is regularized to preserve the pairwise class similarities from a\ngiven taxonomy. 5-7) ALE, HLE, AHLE: Our implementation of the attribute label embedding in\nAkata et al. [4], which encodes the semantic information by representing each class with structured\nlabels that indicate the class\u2019 association with superclasses and attributes. We implement variants\nthat use attributes (ALE), leaf level + superclass labels (HLE), and both (AHLE) labels.\nFor our models, we implement multiple variants to analyze the impact of each semantic entity and\nthe proposed regularization. 1) LME-MTL-S: The multitask semantic embedding model learned\nwith supercategories. 2) LME-MTL-A: The multitask embedding model learned with attributes. 3)\nUSE-No Reg.: The uni\ufb01ed semantic embedding model learned using both attributes and supercate-\ngories, without semantic regularization. 4) USE-Reg: USE with the sparse coding regularization.\nAs for the parameters, we use the projection dimensionality of de = 50 for all our models. 4 For\nother parameters, we \ufb01nd the optimal value by cross-validation on the validation set. We set \u00b51 = 1\nthat balances the main and auxiliary task equally, and search for \u00b52 for discriminative/generative\ntradeoff, in the range of {0.01, 0.1, 0.2 . . . , 1, 10}, and set (cid:96)-2 norm regularization parameter \u03bb = 1.\nFor sparsity parameter \u03b31, we set it to select on average several (3 or 4) attributes per class, and for\ndisjoint parameter \u03b32, we use 10\u03b31, without tuning for performance.\n\nNo\nsemantics\n\nImplicit\nsemantics\n\nExplicit\nsemantics\nUSE\n\nMethod\n\nRidge Regression\n\nNCM [11]\n\nLME\n\nLMTE [7]\nALE [4]\nHLE [4]\nAHLE [4]\n\nLME-MTL-S\nLME-MTL-A\nUSE-No Reg.\n\nUSE-Reg.\n\n1\n\n19.31 \u00b1 1.15\n18.93 \u00b1 1.71\n19.87 \u00b1 1.56\n20.76 \u00b1 1.64\n15.72 \u00b1 1.14\n17.09 \u00b1 1.09\n16.65 \u00b1 0.47\n20.77 \u00b1 1.41\n20.65 \u00b1 0.83\n21.07 \u00b1 1.53\n21.64 \u00b1 1.02\n\n2\n\nFlat hit @ k (%)\n28.34 \u00b1 1.53\n29.75 \u00b1 0.92\n30.47 \u00b1 1.56\n30.71 \u00b1 1.35\n25.63 \u00b1 1.44\n27.52 \u00b1 1.20\n26.55 \u00b1 0.77\n32.09 \u00b1 1.67\n31.51 \u00b1 0.72\n31.59 \u00b1 1.57\n32.69 \u00b1 0.83\n\n5\n\n44.17 \u00b1 2.33\n47.33 \u00b1 1.60\n48.07 \u00b1 1.06\n47.76 \u00b1 2.25\n43.42 \u00b1 1.67\n45.49 \u00b1 0.61\n43.05 \u00b1 1.22\n50.94 \u00b1 1.21\n49.40 \u00b1 0.62\n50.11 \u00b1 1.51\n52.04 \u00b1 1.02\n\n5\n\n2\n\nHierarchical precision @ k (%)\n39.39 \u00b1 0.17\n28.95 \u00b1 0.54\n43.43 \u00b1 0.53\n30.81 \u00b1 0.53\n42.63 \u00b1 0.56\n30.98 \u00b1 0.62\n31.05 \u00b1 0.71\n43.13 \u00b1 0.29\n29.26 \u00b1 0.50\n43.71 \u00b1 0.34\n30.51 \u00b1 0.48\n44.76 \u00b1 0.20\n43.41 \u00b1 0.65\n29.49 \u00b1 0.89\n45.73 \u00b1 0.71\n33.71 \u00b1 0.94\n43.47 \u00b1 0.23\n31.69 \u00b1 0.49\n33.67 \u00b1 0.55\n45.41 \u00b1 0.43\n33.37 \u00b1 0.74\n47.17 \u00b1 0.91\n\nTable 1: Multiclass classi\ufb01cation performance on AWA-PCA dataset (300-D PCA features).\n\n4.3 Multiclass categorization\n\nWe \ufb01rst evaluate the suggested multitask learning framework for categorization performance. We\nreport the average classi\ufb01cation performance and standard error over 5 random training/test splits\nin Table 1 and 2, using both \ufb02at hit@k, which is the accuracy for the top-k predictions made, and\nhierarchical precision@k from [12], which is a precision the given label is correct at k, at all levels.\nNon-semantic baselines, ridge regression and NCM, were outperformed by our most basic LME\nmodel. For implicit semantic baselines, ALE-variants underperformed even the ridge regression\n\n3http://staff.science.uva.nl/\u02dctmensink/code.php\n4Except for ALE variants where de=m, the number of semantic entities.\n\n6\n\n\fNo\nsemantics\n\nImplicit\nsemantics\n\nExplicit\nsemantics\nUSE\n\nMethod\n\nRidge Regression\n\nNCM [11]\n\nLME\n\nLMTE [7]\nALE [4]\nHLE [4]\nAHLE [4]\n\nLME-MTL-S\nLME-MTL-A\nUSE-No Reg.\n\nUSE-Reg.\n\n1\n\n38.39 \u00b1 1.48\n43.49 \u00b1 1.23\n44.76 \u00b1 1.77\n38.92 \u00b1 1.12\n36.40 \u00b1 1.03\n33.56 \u00b1 1.64\n38.01 \u00b1 1.69\n45.03 \u00b1 1.32\n45.55 \u00b1 1.71\n45.93 \u00b1 1.76\n46.42 \u00b1 1.33\n\n2\n\nFlat hit @ k (%)\n48.61 \u00b1 1.29\n57.45 \u00b1 0.91\n58.08 \u00b1 2.05\n49.97 \u00b1 1.16\n50.43 \u00b1 1.92\n45.93 \u00b1 2.56\n52.07 \u00b1 1.19\n57.73 \u00b1 1.75\n58.60 \u00b1 1.76\n59.37 \u00b1 1.32\n59.54 \u00b1 0.73\n\n5\n\n62.12 \u00b1 1.20\n75.48 \u00b1 0.58\n75.11 \u00b1 1.48\n63.35 \u00b1 1.38\n70.25 \u00b1 1.97\n64.66 \u00b1 1.77\n71.53 \u00b1 1.41\n74.43 \u00b1 1.26\n74.67 \u00b1 0.93\n74.97 \u00b1 1.15\n76.62 \u00b1 1.45\n\n5\n\n2\n\nHierarchical precision @ k (%)\n41.73 \u00b1 0.54\n38.51 \u00b1 0.61\n50.32 \u00b1 0.47\n45.25 \u00b1 0.52\n49.87 \u00b1 0.39\n44.84 \u00b1 0.98\n38.67 \u00b1 0.46\n41.72 \u00b1 0.45\n52.46 \u00b1 0.37\n42.52 \u00b1 1.17\n56.79 \u00b1 2.05\n46.11 \u00b1 2.65\n44.43 \u00b1 0.66\n54.39 \u00b1 0.55\n46.05 \u00b1 0.89\n51.08 \u00b1 0.36\n44.23 \u00b1 0.95\n48.52 \u00b1 0.29\n51.04 \u00b1 0.46\n47.13 \u00b1 0.62\n47.39 \u00b1 0.82\n53.35 \u00b1 0.30\n\nTable 2: Multiclass classi\ufb01cation performance on AWA-DeCAF dataset (4096-D DeCAF features).\n\nbaseline with regard to the top-1 classi\ufb01cation accuracy 5, while they improve upon the top-2 recog-\nnition accuracy and hierarchical precision. This shows that hard-encoding structures in the label\nspace do not necessarily improve the discrimination performance, while it helps to learn a more\nsemantic space. LMTE makes substantial improvement on 300-D features, but not on DeCAF fea-\ntures.\nExplicit embedding of semantic entities using our method improved both the top-1 accuracy and\nthe hierarchical precision, with USE variants achieving the best performance in both. Speci\ufb01cally,\nadding superclass embeddings as auxiliary entities improves the hierarchical precision, while using\nattributes improves the \ufb02at top-k classi\ufb01cation accuracy. USE-Reg, especially, made substantial\nimprovements on \ufb02at hit and hierarchical precision @ 5, which shows the proposed regularization\u2019s\neffectiveness in learning a semantic space that also discriminates well.\n\nCategory Ground-truth attributes\n\nSupercategory + learned attributes\n\nAn animal that swims, \ufb01sh, water, new world, small, \ufb02ippers,\nfurry, black, brown, tail, . . .\n\nA musteline mammal that is quadrapedal, \ufb02ippers, furry,\nocean\n\nAn animal that is smelly, black, stripes, white, tail, furry,\nground, quadrapedal, new world, walks, . . .\n\nA musteline mammal that has stripes\n\nAn animal\nquadrapedal, vegetation, timid, hooves, walks, . . .\n\nfast, horns, grazer,\n\nis brown,\n\nthat\n\nforest,\n\nA deer that has spots, nestspot, longneck, yellow, hooves\n\nAn animal that has horns, brown, big, quadrapedal, new\nworld, vegetation, grazer, hooves, strong, ground,. . .\n\nA deer that is arctic, stripes, black\n\nN/A\nN/A\n\nAn odd-toed ungulate, that is lean and active\nAn animal, that has hands and bipedal\n\nOtter\n\nSkunk\n\nDeer\n\nMoose\nEquine\nPrimate\n\nTable 3: Semantic description generated using ground truth attributes labels and learned semantic decomposi-\ntion of each categorys. For ground truth labels, we show top-10 ranked by their human-ranked relevance. For\nour method, we rank the attributes by their learned weights. Incorrect attributes are colored in red.\n4.3.1 Qualitative analysis\n\nBesides learning a space that is both discriminative and generalizes well, our method\u2019s main ad-\nvantage, over existing methods, is its ability to generate compact, semantic descriptions for each\ncategory it has learned. This is a great caveat, since in most models, including the state-of-the\nart deep convolutional networks, humans cannot understand what has been learned; by generating\nhuman-understandable explanation, our model can communicate with the human, allowing under-\nstanding of the rationale behind the categorization decisions, and to possibly allow feedback for\ncorrection.\nTo show the effectiveness of using supercategory+attributes in the description, we report the learned\nreconstruction for our model, compared against the description generated by its ground-truth at-\ntributes in Table 3. The results show that our method generates compact description of each cat-\negory, focusing on its discriminative attributes. For example, our method select attributes such as\n\ufb02ippers for otter, and stripes for skunk, instead of attributes common and nondescriminative such as\ntail. Note that some attributes that are ranked less relevant by humans were selected for their dis-\ncriminativity, e.g., yellow for dear and black for moose, both of which human annotators regarded\n\n5We did extensive parameter search for the ALE variants.\n\n7\n\n\fFigure 2: Learned discriminative attributes association on the AWA-PCA dataset.\ncolored in red.\n\nIncorrect attributes are\n\nLearned decomposition\nClass\nA baleen whale, with plankton, \ufb02ip-\nHumpback\npers, blue, skimmer, arctic\nwhale\nLeopard\nA big cat that is orange, claws, black\nHippopotamus An even-toed ungulate, that is gray,\nbulbous, water, smelly, hands\nA primate, that is mountains, strong,\nstalker, black\nA domestic cat, that is arctic, nestspot,\n\ufb01sh, bush\n\nChimpanzee\n\nPersian Cat\n\nFigure 3: Few-shot experiment result on the AWA dataset, and generated semantic decompositions.\n\nas brown. Further, our method selects discriminative attributes for each supercategory, while there\nis no provided attribute label for supercategories.\nFigure 2 shows the discriminative attributes disjointly selected at each node on the class hierarchy.\nWe observe that coarser grained categories \ufb01t to attributes that are common throughout all its chil-\ndren (e.g. pads, stalker and paws for carnivore), while the \ufb01ner grained categories \ufb01t to attributes\nthat help for \ufb01ner-grained distinctions (e.g. orange for tiger, spots for leopard, and desert for lion).\n4.4 One-shot/Few-shot learning\n\nOur method is expected to be especially useful for few-shot learning, by generating a richer descrip-\ntion than existing methods, that approximate the new input category using either trained categories\nor attributes. For this experiment, we divide the 50 categories into prede\ufb01ned 40/10 training/test\nsplit, and compare with the knowledge transfer using AHLE. We assume that no attribute label is\nprovided for test set. For AHLE, and USE, we regularize the learning of W with W S learned\non training class set S by adding \u03bb2(cid:107)W \u2212 W S(cid:107)2\n2, to LME (Eq. 3). For USE-Reg we use the\nreconstructive regularizer to regularize the model to generate semantic decomposition using US.\nFigure 3 shows the result, and the learned semantic decomposition of each novel category. While all\nmethods make improvements over the no-transfer baseline, USE-Reg achieves the most improve-\nment, improving two-shot result on AWA-DeCAF from 38.93% to 49.87%, where USE comes in\nsecond with 44.87%. Most learned reconstructions look reasonable, and \ufb01t to discriminative traits\nthat help to discriminate between the test classes, which in this case are colors; orange for leopard,\ngray for hippopotamus, blue for humpback whale, and arctic (white) for Persian cat.\n5 Conclusion\n\nWe propose a uni\ufb01ed semantic space model that learns a discriminative space for object categoriza-\ntion, with the help of auxiliary semantic entities such as supercategories and attributes. The auxiliary\nentities aid object categorization both indirectly, by sharing a common data embedding, and directly,\nby a sparse-coding based regularizer that enforces the category to be generated by its supercategory\n+ a sparse combination of attributes. Our USE model improves both the \ufb02at-hit accuracy and hier-\narchical precision on the AWA dataset, and also generates semantically meaningful decomposition\nof categories, that provides human-interpretable rationale.\n\n8\n\nantelope: lean agility activegrizzly bear: cave big mountainsk. whale: meatteeth meat leanbeaver: swims pads stripesdalmatian: spots longleg hairlessPersian cat: domestic pads clawshorse: toughskin brown plainsG. shepherd: longleg gray stalkerblue whale: inactiveSiamese: inactive stalker meatteethskunk: horns slow hoovesmole: plankton tunnelstiger: group orangehippopotamus: strainteeth fish swimsleopard: spots fishmoose: inactivespider monkey: horns grazerhumpback: tailelephant: plankton tusks bushgorilla: bipedal bulbousox: bush bulbous hairlessfox: horns orange fishsheep: fish domestic padsseal: mountains small walkschimpanzee: toughskin insects handshamster: patches walks inactivesquirrel: pads stalker bipedalrhinoceros: jungle tusksrabbit: plankton bushbat: plankton flys hairlessgiraffe: yellow orange longneckwolf: arctic muscleChihuahua: weak domestic grayrat: fierce fields meatteethweasel: longneck grazerotter: tusks group coastalbuffalo: slow toughskinzebra: stripes longneck bushgiant panda: tusks plankton slowdeer: spots lean hoovesbobcat: strong yellow spotspig: weak tunnels whitelion: desert bulbous smellymouse: forest group domesticpolar bear: ocean smelly arcticcollie: domestic meatteethwalrus: tusks grazer buckteethraccoon: stripes spots fastcow: hornsc. dolphin: active domestic swimsbearplanktonlongneckstrainteethdolphinplanktonlongneckrodentplanktondomesticplanktonlongnecktoughskinequinehuntermeatteethsmallsheperdbaleenmustelnstrainteethlongneckplanktonbig cattoughskinlongneckbigdeermuscleg.apebovinesmallmeatteethpinnpdwalksgroundstalkerprocyonidlongnecktoughskinstrainteethbovidhooveshornsgrazerwhalelonglegplainsfieldsdogstrainteethtoughskinlongneckcatstrainteethtoughskinhairlessodd\u2212toed ungulateplanktonmeatteethhunterprimateplanktonhandsbipedalruminantplanktonmeatteethhunteraquaticplanktonoceanswimscaninelongneckfelineplanktonyelloworangeeven\u2212toedhuntercarnivorepadsstalkerpawsungulatehornshooveslongneckplacental024681015202530354045Number of training examplesAccuracy (%)AWA\u2212PCA No transferAHLEUSEUSE\u2212Reg.0246810203040506070Number of training examplesAccuracy (%)AWA\u2212DeCAF No transferAHLEUSEUSE\u2212Reg.\fReferences\n[1] Christoph Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to Detect Unseen Ob-\nject Classes by Between-Class Attribute Transfer. In IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2009.\n\n[2] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing Objects by their At-\n\ntributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[3] Sung Ju Hwang, Fei Sha, and Kristen Grauman. Sharing features between objects and their\nattributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n1761\u20131768, 2011.\n\n[4] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-Embedding for\nAttribute-Based Classi\ufb01cation. In IEEE Conference on Computer Vision and Pattern Recogni-\ntion (CVPR), pages 819\u2013826, 2013.\n\n[5] Marcin Marszalek and Cordelia Schmid. Constructing category hierarchies for visual recogni-\n\ntion. In European Conference on Computer Vision (ECCV), 2008.\n\n[6] Gregory Grif\ufb01n and Pietro Perona. Learning and using taxonomies for fast visual categoriza-\ntion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1\u20138,\n2008.\n\n[7] Kilian Q. Weinberger and Olivier Chapelle. Large margin taxonomy embedding for document\n\ncategorization. In Neural Information Processing Systems (NIPS), pages 1737\u20131744, 2009.\n\n[8] Tianshi Gao and Daphne Koller. Discriminative learning of relaxed hierarchy for large-scale\nvisual recognition. International Conference on Computer Vision (ICCV), pages 2072\u20132079,\n2011.\n\n[9] Sung Ju Hwang, Kristen Grauman, and Fei Sha. Analogy-preserving semantic embedding for\nvisual object categorization. In International Conference on Machine Learning (ICML), pages\n639\u2013647, 2013.\n\n[10] Samy Bengio, Jason Weston, and David Grangier. Label Embedding Trees for Large Multi-\n\nClass Task. In Neural Information Processing Systems (NIPS), 2010.\n\n[11] Thomas Mensink, Jakov Verbeek, Florent Perronnin, and Gabriela Csurka. Distance-based\nIEEE Transactions on\n\nimage classi\ufb01cation: Generalizing to new classes at near zero cost.\nPattern Analysis and Machine Intelligence (TPAMI), 35(11), 2013.\n\n[12] Andrea Frome, Greg Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc\u2019Aurelio Ranzato,\nand Tomas Mikolov. Devise: A deep visual-semantic embedding model. In Neural Information\nProcessing Systems (NIPS), 2013.\n\n[13] Alon Zweig and Daphna Weinshall. Hierarchical regularization cascade for joint learning. In\n\nInternational Conference on Machine Learning (ICML), volume 28, pages 37\u201345, 2013.\n\n[14] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Super-\nIn Neural Information Processing Systems (NIPS), pages 1033\u2013\n\nvised dictionary learning.\n1040, 2008.\n\n[15] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary\nimage annotation. In International Joint Conferences on Arti\ufb01cial Intelligence (IJCAI), 2011.\n[16] Yang Zhou, Rong Jin, and Steven C. H. Hoi. Exclusive lasso for multi-task feature selection.\n\nJournal of Machine Learning Research, 9:988\u2013995, 2010.\n\n[17] John Duchi and Yoram Singer. Ef\ufb01cient online and batch learning using forward backward\n\nsplitting. Journal of Machine Learning Research, 10, 2009.\n\n[18] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor\nDarrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In\nInternational Conference on Machine Learning (ICML), 2014.\n\n9\n\n\f", "award": [], "sourceid": 205, "authors": [{"given_name": "Sung Ju", "family_name": "Hwang", "institution": "Disney Research"}, {"given_name": "Leonid", "family_name": "Sigal", "institution": "Disney Research"}]}