{"title": "Learning a Tree of Metrics with Disjoint Visual Features", "book": "Advances in Neural Information Processing Systems", "page_first": 621, "page_last": 629, "abstract": "We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships.  Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM).  In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children.  Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members' training instances.  To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor supercategory nodes' metrics.  Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat).  We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm.", "full_text": "Learning a Tree of Metrics\nwith Disjoint Visual Features\n\nSung Ju Hwang\nUniversity of Texas\nAustin, TX 78701\n\nKristen Grauman\nUniversity of Texas\nAustin, TX 78701\n\nFei Sha\n\nUniversity of Southern California\n\nLos Angeles, CA 90089\n\nsjhwang@cs.utexas.edu\n\ngrauman@cs.utexas.edu\n\nfeisha@usc.edu\n\nAbstract\n\nWe introduce an approach to learn discriminative visual representations while ex-\nploiting external semantic knowledge about object category relationships. Given\na hierarchical taxonomy that captures semantic similarity between the objects,\nwe learn a corresponding tree of metrics (ToM). In this tree, we have one metric\nfor each non-leaf node of the object hierarchy, and each metric is responsible for\ndiscriminating among its immediate subcategory children. Speci\ufb01cally, a Maha-\nlanobis metric learned for a given node must satisfy the appropriate (dis)similarity\nconstraints generated only among its subtree members\u2019 training instances. To fur-\nther exploit the semantics, we introduce a novel regularizer coupling the metrics\nthat prefers a sparse disjoint set of features to be selected for each metric rela-\ntive to its ancestor (supercategory) nodes\u2019 metrics. Intuitively, this re\ufb02ects that\nvisual cues most useful to distinguish the generic classes (e.g., feline vs. canine)\nshould be different than those cues most useful to distinguish their component\n\ufb01ne-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach\nwith multiple image datasets using the WordNet taxonomy, show its advantages\nover alternative metric learning approaches, and analyze the meaning of attribute\nfeatures selected by our algorithm.\n\n1\n\nIntroduction\n\nVisual recognition is a fundamental computer vision problem that demands sophisticated image\nrepresentations\u2014due to both the large number of object categories a system should ultimately rec-\nognize, as well as the noisy cluttered conditions in which training examples are often captured.\nThe research community has made great strides in recent years by training discriminative models\nwith an array of well-engineered descriptors, e.g., capturing gradient texture, color, or local part\ncon\ufb01gurations. In particular, recent work shows promising results when integrating powerful fea-\nture selection techniques, whether through kernel combination [1, 2], sparse coding dictionaries [3],\nstructured sparsity regularization [4, 5], or metric learning approaches [6, 7, 8, 9, 10].\nHowever, typically the semantic information embedded in the learned features is restricted to the cat-\negory labels on image exemplars. For example, a learned metric generates (dis)similarity constraints\nusing instances with the different/same class label; multiple kernel learning methods optimize fea-\nture weights to minimize class prediction errors; group sparsity regularizers exploit class labels to\nguide the selected dimensions. Unfortunately, this means richer information about the meaning of\nthe target object categories is withheld from the learned representations. While suf\ufb01cient for ob-\njects starkly different in appearance, this omission is likely restrictive for objects with \ufb01ner-grained\ndistinctions, or when a large number of classes densely populate the original feature space.\n\n1\n\n\fWe propose a metric learning approach to learn discriminative visual representations while also\nexploiting external knowledge about the target objects\u2019 semantic similarity.1 We assume the external\nknowledge itself is available in the form of a hierarchical taxonomy over the objects (e.g., from\nWordNet or some other knowledge base). Our approach exploits these semantics in two novel ways.\nFirst, we construct a tree of metrics (ToM) to directly capture the hierarchical structure. In this tree,\neach metric is responsible for discriminating among its immediate object subcategories. Speci\ufb01cally,\nwe learn one metric for each non-leaf node, and require it to satisfy (dis)similarity constraints gen-\nerated among its subtree members\u2019 training instances. We use a variant of the large-margin nearest\nneighbor objective [11], and augment it with a regularizer for sparsity in order to unify Mahalanobis\nparameter learning with a simple means of feature selection.\nSecond, rather than learn the metrics at each node independently, we introduce a novel regularizer\nfor disjoint sparsity that couples each metric with those of its ancestors. This regularizer speci\ufb01es\nthat a disjoint set of features should be selected for a given node and its ancestors, respectively. In-\ntuitively, this represents that the visual features most useful to distinguish the coarse-grained classes\n(e.g., feline vs. canine) should often be different than those cues most useful to distinguish their\n\ufb01ne-grained subclasses (e.g., Persian vs. Siamese cat, German Shepherd vs. Boxer). The resulting\noptimization problem is convex, and can be optimized with a projected subgradient approach.\nThe ideas of exploiting label hierarchy and model sparsity are not completely new to computer\nvision and machine learning researchers. Hierarchical classi\ufb01ers are used to speed up classi\ufb01cation\ntime and alleviate data sparsity problems [12, 13, 14, 15, 16]. Parameter sparsity is increasingly\nused to derive parsimonious models with informative features [4, 5, 3].\nOur novel contribution lies in the idea of ToM and disjoint sparsity together as a new strategy for\nvisual feature learning. Our idea reaps the bene\ufb01ts of both schools of thought. Rather than relying on\nlearners to discover both sparse features and a visual hierarchy fully automatically, we use external\n\u201creal-world\u201d knowledge expressed in hierarchical structures to bias which sparsity patterns we want\nthe learned discriminative feature representations to exhibit. Thus, our end-goal is not any sparsity\npattern returned by learners, but the patterns that are in concert with rich high-level semantics.\nWe validate our approach with the Animals with Attributes [17] and ImageNet [18] datasets using the\nWordNet taxonomy. We demonstrate that the proposed ToM outperforms both global and multiple-\nmetric metric learning baselines that have similar objectives but lack the hierarchical structure and\nproposed disjoint sparsity regularizer. In addition, we show that when the dimensions of the original\nfeature space are interpretable (nameable) visual attributes, the disjoint features selected for super-\nand sub-classes by our method can be quite intuitive.\n\n2 Related Work\n\nA wide variety of feature learning approaches have been explored for visual recognition. Some of\nthe very best results on benchmark image classi\ufb01cation tasks today use multiple kernel learning\napproaches [1, 2] or sparse coding dictionaries for local features (e.g., [3]). One way to regularize\nvisual feature selection is to prefer that object categories share features, so as to speed up object\ndetection [19]; more recent work uses group sparsity to impose some sharing among the (un)selected\nfeatures within an object category or view [4, 5]. We instead seek disjoint features between coarse\nand \ufb01ne categories, such that the regularizer helps to focus on useful differences across levels.\nMetric learning has been a subject of extensive research in recent years, in both vision and learn-\ning. Good visual metrics can be trained with boosting [20, 21], feature weight learning [6], or\nMahalanobis metric learning methods [7, 8, 10]. An array of Mahalanobis metric learners has been\ndeveloped in the machine learning literature [22, 23, 11]. The idea of using multiple \u201clocal\u201d metrics\nto cover a complex feature space is not new [24, 9, 10, 25]; however, in contrast to our approach,\nexisting methods resort to clustering or (\ufb02at) class labels to determine the partitioning of training\ninstances to metrics. Most methods treat the partitioning and metric learning processes separately,\nbut some recent work integrates the grouping directly into the learning objective [21], or trains mul-\n\n1We use \u201clearned representation\u201d and \u201clearned metric\u201d interchangeably, since we deal with sparse Ma-\nhalanobis metrics, which are equivalent to selecting a subset of features and applying a linear feature space\ntransformation.\n\n2\n\n\ftiple metrics jointly across tasks [26]. No previous work explores mapping the semantic hierarchy\nto a ToM, nor couples metrics across the hierarchy levels, as we propose. To show the impact, in\nexperiments we directly compare to a state-of-the-art approach for learning multiple metrics.\nPrevious metric learning work integrates feature learning and selection via a regularizer for spar-\nsity [27], as we do here. However, whereas that approach targets sparsity in the linear transformed\nspace, ours targets sparsity in the original feature space, and, most importantly, also includes a dis-\njoint sparsity regularizer. The advantage in doing so is that our learner will be able to return both\ndiscriminative and interpretable feature dimensions, as we demonstrate in our results. Transformed\nfeature spaces\u2014while suitably \ufb02exible if only discriminative power is desired\u2014add layers that com-\nplicate interpretability, not only to models for individual classi\ufb01ers but also (more seriously) to tease\napart patterns across related categories (such as parent-child).\nThe \u201corthogonal transfer\u201d by [28] is most closely related in spirit to our goal of selecting disjoint\nfeatures. However, unlike [28], our regularizer will yield truly disjoint features when minimized\u2014a\nproperty hinging on the metric-based classi\ufb01cation scheme we have chosen. Our learning problem\nis guaranteed to be convex, whereas hyperparameters need to be tuned to ensure convexity in [28].\nWe return to these differences in Section 3.3, after explaining our algorithm in detail.\nExternal semantics beyond object class labels are rarely used in today\u2019s object recognition systems,\nbut recent work has begun to investigate new ways to integrate richer knowledge. Hierarchical\ntaxonomies have natural appeal, and researchers have studied ways to discover such structure auto-\nmatically [29, 30, 13], or to integrate known structure to train classi\ufb01ers at different levels [12, 31].\nThe emphasis is generally on saving prediction time (by traversing the tree from its root) or com-\nbining decisions, whereas we propose to in\ufb02uence feature learning based on these semantics. While\nsemantic structure need not always translate into helping visual feature selection, the correlation be-\ntween WordNet semantics and visual confusions observed in [32] supports our use of the knowledge\nbase in this work. The machine learning community has also long explored hierarchical classi\ufb01ca-\ntion (e.g., [14, 15, 16]). Of this work, our goals most relate to [14], but our focus is on learning\nfeatures discriminatively and biasing toward a disjoint feature set via regularization.\nBeyond taxonomies, researchers are also injecting semantics by learning mid-level nameable \u201cat-\ntributes\u201d for object categorization (e.g., [17, 33]). We show that when our method is applied to\nattributes as base features, the disjoint sparsity effects appear to be fairly interpretable.\n\n3 Approach\n\nWe review brie\ufb02y the techniques for learning distance metrics. We then describe an (cid:96)1-norm based\nregularization for selecting a sparse set of features in the context of metric learning. Building on that,\nwe proceed to describe our main algorithmic contribution, that is, the design of a metric learning al-\ngorithm that prefers not only sparse but also disjoint features for discriminating different categories.\n\n3.1 Distance metric learning\n\nMany learning algorithms depend on calculating distances between samples, notably k-nearest\nneighbor classi\ufb01ers or clustering. While the default is to use the Euclidean distance, the more\ngeneral Mahalanobis metric is often more suitable. For two data points xi, xj \u2208 RD, their (squared)\nMahalanobis distance is given by\n\nM (xi, xj) = (xi \u2212 xj)TM (xi \u2212 xj),\nd2\n\n(1)\nwhere M is a positive semide\ufb01nite matrix M (cid:23) 0. Arguably, the Mahalanobis distance can better\nmodel complex data, as it considers correlations between feature dimensions.\nLearning the optimal M from labeled data has been an active research topic (e.g., [23, 22, 11]).\nMost methods follow an intuitively appealing strategy: a good metric M should pull data points\nbelonging to the same class closer and push away data points belonging to different classes. As an\nillustrative example, we describe the technique used in constructing large margin nearest neighbor\n(LMNN) classi\ufb01ers [11], to which our empirical studies extensively compare.\nIn LMNN, each point xi in the training set is associated with two sets of different data points in xi\u2019s\nnearest neighbors (identi\ufb01ed in the Euclidean distance): the \u201ctargets\u201d whose labels are the same as\n\n3\n\n\fxi\u2019s and the \u201cimpostors\u201d whose labels are different. Let x+\n\u201cimpostor\u201d sets, respectively. LMNN identi\ufb01es the optimal M as the solution to,\n\ni denote the\n\ni denote the \u201ctarget\u201d and x\u2212\n(cid:88)\n\n(cid:88)\n(cid:88)\nM (xi, xj) \u2212 d2\n\nj\u2208x+\n\ni\n\ni\n\n(cid:96)(M ) =\n\nd2\nM (xi, xj) + \u03b3\n\n\u03beijl\n\nijl\n\nmin\nM(cid:23)0\n\nsubject to\n\n1 + d2\n\nM (xi, xl) \u2264 \u03beijl; \u03beijl \u2265 0 .\u2200 j \u2208 x+\n\ni , l \u2208 x\u2212\n\ni\n\n(2)\n\nwhere the objective function (cid:96)(M ) balances two forces: pulling the target towards xi and pushing\nthe impostor away. The latter is characterized by the constraint composed of a triplet of data points:\nthe distance to an impostor should be greater than the distance to a target by at least a margin of 1,\npossibly with the help of a slack variable \u03beijl. The minimization of eq. (2) is a convex optimization\nproblem with semide\ufb01nite constraints on M (cid:23) 0, and is tractable with standard techniques.\nOur approach extends previous work on metric learning in two aspects: i) we apply a sparsity-based\nregularization to identify informative features (Section 3.2); ii) at the same time, we seek metrics that\nrely on disjoint subsets of features for categories at different semantic granularities (Section 3.3).\n\n3.2 Sparse feature selection for metric learning\n\nHow can we learn a metric such that only a sparse set of features are relevant? Examining the\nde\ufb01nition of the Mahalanobis distance in eq. (1), we deduce that if the d-th feature of x is not to be\nused, it is suf\ufb01cient and necessary for the d-th diagonal element of M be zero.\nTherefore, analogous to the use of (cid:96)1-norm by the popular LASSO technique [34], we add the (cid:96)1-\nnorm of M\u2019s diagonal elements to the large margin metric learning criterion (cid:96)(M ) in eq. (2),\n\n(cid:88)\n\n(cid:88)\n\ni\n\nj\u2208x+\n\ni\n\n(cid:88)\n\nijl\n\nmin\nM(cid:23)0\n\nd2\nM (xi, xj) + \u03b3\n\n\u03beijl + \u03bbTrace[M ],\n\n(3)\n\nwhere we have omitted the constraints as they are not changed. \u03bb and \u03b3 are nonnegative (hy-\nper)parameters trading off the sparsity of the model and the other parts in the objective. Note that\nsince the matrix trace Trace[\u00b7] is a linear function of its argument, this sparse feature metric learning\nproblem remains a convex optimization.\n\n3.3 Learning a tree of metrics (ToM) with disjoint visual features\n\nHow can we learn a tree of metrics so each metric uses features disjoint from its ancestors\u2019?\nUsing disjoint features To characterize the \u201cdisjointness\u201d between two metrics Mt and Mt(cid:48), we\nuse the vectors of their nonnegative diagonal elements vt and vt(cid:48) as proxies to which features are\n(more heavily) used. This is a reasonable choice as we use the sparsity-inducing (cid:96)1-norm in learning\nthe metrics. We measure their degree of \u201ccompetition\u201d for common features,\n\nC(Mt, Mt(cid:48)) = (cid:107)vt + vt(cid:48)(cid:107)2\n2 .\n\n(4)\nIntuitively, if a feature dimension is not used by either metric, the competition for that feature is low.\nIf a feature dimension is used by both metrics heavily, then the competition is high. Consequently,\nminimizing eq. (4) as a regularization term will encourage different metrics to use disjoint features.\nNote that the measure is a convex function of vt and vt(cid:48), hence also convex in Mt and Mt(cid:48).\nLearning a tree of metrics Formally, assume we have a tree T where each node corresponds to\na category. Let t index the T non-leaf or internal nodes. We learn a metric Mt to differentiate its\nchildren categories c(t). For any node t, we use D(t) to denote those training samples whose labeled\ncategories are offspring of t, and a(t) to denote the nodes on the path from the root to t.\n\n4\n\n\fTo learn our metrics {Mt}T\nnearest neighbor classi\ufb01ers. We cast it as a convex optimization problem:\n\nt=1, we apply similar strategies of learning metrics for large-margin\n\nd2\nMt\n\n(xi, xj) + \u03b3\n\n\u03betcrijl +\n\n(cid:88)\n\nt,c,r,ijl\n\n(cid:88)\n\nt\n\n\u03bbtTrace[Mt]\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nc\u2208c(t)\n\nt\n\n+\n\nt\n\na\u2208a(t)\n\ni,j\u2208D(c)\n\nmin\n\n{Mt}(cid:23)0\n\nsubject to\n\n\u03b3taC(Mt, Ma)\n\n(5)\n\n\u2200 t,\u2200 c \u2208 c(t),\u2200 r \u2208 c(t) \u2212 {c},\u2200 xi, xj \u2208 D(c), xl \u2208 D(r)\n1 + d2\n\n(xi, xl) \u2264 \u03betcrijl; \u03betcrijl \u2265 0 .\n\n(xi, xj) \u2212 d2\n\nMt\n\nMt\n\nIn short, there are T learning (sub)problems, one for each metric. Each metric learning problem is\nin the style of the sparse feature metric learning eq. (3). However, more importantly, these metric\nlearning problems are coupled together through the disjoint regularization. Our disjoint regulariza-\ntion encourages a metric Mt to use different sets of features from its super-categories\u2014categories\non the tree path from the root.\nNumerical optimization The optimization problem in eq. (5) is convex, though nonsmooth due\nto the nonnegative slack variables. We use the subgradient method, previously used for similar\nproblems [11]. For problems with a large taxonomy, learning all the regularization coef\ufb01cients \u03bbt\nand \u03b3ta is prohibitive, as the number of coef\ufb01cient combinations is O(kT), where T is the number\nof nodes and k is the number of values a coef\ufb01cient can take. Thus, for the large-scale problems we\nfocus on, we use a simpler and computationally more ef\ufb01cient strategy of Sequential Optimization\n(SO) by sequentially optimizing one metric at a time. Speci\ufb01cally, we optimize the metric at the\nroot node and then its children, assuming the metric at the root is \ufb01xed. We then recursively (in\nbreadth-\ufb01rst-search) optimize the rest of the metrics, always treating the metrics at the higher level\nof the hierarchy as \ufb01xed. This strategy has a signi\ufb01cantly reduced computational cost of O(kT).\nIn addition, the SO procedure allows each metric to be optimized with different parameters and\nprevents a badly-learned low-level metric from in\ufb02uencing upper-level ones through the disjoint\nregularization terms. (This can also be achieved by adjusting all regularization coef\ufb01cients in parallel\nthrough extensive cross-validation, but at a much higher computational expense.)\nUsing a tree of metrics for classi\ufb01cation Once the metrics at all nodes are learned, they can be\nused for several classi\ufb01cation tasks (e.g., with k-NN or as a kernel to a SVM). In this work, we\nstudy two tasks in particular: 1) We consider \u201cper-node classi\ufb01cation\u201d, where the metric at each\nnode is used to discriminate its sub-categories. Since decisions at higher-level nodes must span a\nvariety of object sub-categories, these generic decisions are interesting to test the learned features in\na broader context. 2) We consider hierarchical classi\ufb01cation [35], a natural way to use the full ToM.\nIn this case, we examine the recognition accuracy for the \ufb01nest-level categories only. We classify an\nobject from the root node down; the leaf node that terminates the path is the predicted label.\nWe stress that our metric learning criterion of eq. (5) aims to minimize classi\ufb01cation errors at each\nnode. Thus, improvement in per-node accuracy is more directly indicative of whether the learning\nhas resulted in useful metrics. Understanding the relation between per-node and full multi-class\naccuracy has been a challenging research problem in building hierarchical classi\ufb01ers [16, 12].\nRelationship to orthogonal transfer Our work shares a similar spirit to the \u201corthogonal transfer\u201d\nidea explored in [28]. The authors there use non-overlapping features to construct multiple SVM\nclassi\ufb01ers for hierarchical classi\ufb01cation of text documents. Concretely, they propose an orthog-\ni wj| where wi and wj are the SVM parameters. Minimizing it will\nreduce the similarity of the parameter vectors and make them \u201corthogonal\u201d to each other. How-\never, orthogonality does not necessarily imply disjoint features. This can be seen with a contrived\ntwo-dimensional counterexample where wi = [1 \u2212 1]T and wj = [\u22121 \u2212 1]T. Both features are\nused, yet the two parameter vectors are orthogonal. In contrast, our disjoint regularizer eq. (4) is\nmore indicative of true disjointness. Speci\ufb01cally, when our regularizer attains its minimum value of\nzero, we are guaranteed that features are non-overlapping as our vi and vj are nonnegative diagonal\nelements of positive semide\ufb01nite matrices. Our regularizer is also guaranteed to be convex, whereas\nthe convexity of the regularizer in [28] depends critically on tuning Kij.\n\nonal regularizer(cid:80)\n\nij Kij|wT\n\n5\n\n\f(a) Class Hierarchy\n\n(b) Means of the features\n\n(c) TOM\n\n(d) TOM + Sparsity\n\n(e) TOM + Disjoint\n\nFigure 1: Synthetic dataset example. Our disjoint regularizer yields a sparse metric that only considers the\nfeature dimension(s) necessary for discrimination at that given level.\n\n4 Results\n\nWe validate our ToM approach on several datasets, and consider three baselines: 1) Euclidean:\nEuclidean distance in the original feature space, 2) Global LMNN: a single global metric for all\nclasses learned with the LMNN algorithm [11], and 3) Multi-Metric LMNN: one metric learned\nper class using the multiple metric LMNN variant [11]. We use the code provided by the authors.\nTo evaluate the in\ufb02uence of each aspect of our method, we test it under three variants: 1) ToM:\nToM learning without any regularization terms, 2) ToM+Sparsity: ToM learning with the spar-\nsity regularization term, and 3) ToM+Disjoint: ToM learning with the disjoint regularization term.\nFor all experiments, we test with \ufb01ve random data splits of 60%/20%/20% for train/validation/test.\nWe use the validation data to set the regularization parameters \u03bb and \u03b3 among candidate values\n{0, 1, 10, 100, 1000}, and we generate 500 (xi, xj, xl) training triplets per class.\n4.1 Proof of concept on synthetic dataset\n\nFirst we use synthetic data to clearly illustrate disjoint sparsity regularization. We generate data with\nprecisely the property that coarser categories are distinguishable using feature dimensions distinct\nfrom those needed to discriminate their subclasses. Speci\ufb01cally, we sample 2000 points from each\nof four 4D Gaussians, giving four leaf classes {a, b, c, d}. They are grouped into two superclasses\nA = {a, b} and B = {c, d}. The \ufb01rst two dimensions of all points are speci\ufb01c to the superclass\ndecision (A vs. B), while the last two are speci\ufb01c to the subclasses. See Fig. 1 (a) and (b).\nWe run hierarchical k-nearest neighbor classi\ufb01cation (k = 3) on the test set. ToM+Sparsity increases\nthe recognition rate by 0.90%, while ToM+Disjoint increases it by 4.05%. Thus, as expected, dis-\njoint sparsity does best, since it selects different features for the super- and sub-classes. Accordingly,\nin the learned Mahalanobis matrices for each node (Fig. 1(c)-(e)), we see disjoint sparsity zeros out\nthe unneeded features for the upper-level metric, showed as black squares in the \ufb01gure (e). In con-\ntrast, the ToM+Sparsity features are sub-optimal and \ufb01t to some noise in the data (d).\n4.2 Visual recognition experiments\n\nNext we demonstrate our approach on challenging visual recognition tasks.\nDatasets and implementation details We validate with three datasets drawn from two publicly\navailable image collections: Animals with Attributes (AWA) [17] and ImageNet [18, 32]. Both are\nwell-suited for our scenario, since they consist of \ufb01ne-grained categories that can be grouped into\nmore general object categories. AWA contains 30,475 images and 50 animal classes, and we use\nit to create two datasets: 1) AWA-PCA, which uses the provided features (SIFT, rgSIFT, PHOG,\nSURF, LSS, RGB), concatenated, standardized, and PCA-reduced to 50 dimensions, and 2) AWA-\nATTR, which uses 85-dimensional attribute predictions as the original feature space. The latter is\nformed by concatenating the outputs of 85 linear SVMs trained to predict the presence/absence of\nthe 85 nameable properties annotated by [17], e.g., furry, white, quadrupedal, etc. For our third\ndataset VEHICLE-20, we take 20 vehicle classes and 26,624 images from ImageNet, and apply\nPCA to reduce the authors\u2019 provided visual word features [32] to 50 dimensions per image (The\ndimensionality worked best for the Global LMNN baseline.).\nWe use WordNet to generate the semantic hierarchies for all datasets. We retrieve all nodes in\nWordNet that contain any of the object class names on their word lists. In the case of homonyms,\nwe manually disambiguate the word sense. Then, we build a compact partial hierarchy over those\nnodes by 1) pruning out any node that has only one child (i.e., removing super\ufb02uous nodes), and 2)\nresolving any instances of multiple parentship by choosing the path from the leaf to root having the\nmost overlap with other classes. See Figures 2 and 3 for the resulting AWA and VEHICLE trees.\n\n6\n\nabcdA:{a,b}B:{c,d}root:{a,b,c,d}00.10.20.30.40.50.60.7abcdvalueSynthetic Features\fFigure 2: Semantic hierarchy for AWA (top) and the per-node accuracy improvements relative to Euclidean\ndistance, for the AWA-PCA (left) and AWA-ATTR (right) datasets. Numbers in legends denote average im-\nprovement over all nodes. We generally achieve a sizable accuracy gain relative to the Global LMNN baseline\n(dark left bar for each class), showing the advantage of exploiting external semantics with our ToM approach.\n\nFigure 3: Semantic hierarchy for VEHICLE-20 and the per-node accuracy gains, plotted as above.\n\nThroughout, we evaluate classi\ufb01cation accuracy using k-nearest neighbors (k-NN). For ToM, at\nnode n we use k = 2ln\u22121 + 1, where ln is the level of the node, and ln = 1 for leaf nodes. This\nmeans we use a larger k at the higher nodes in the tree where there is larger intra-class variation,\nin an effort to be more robust to outliers. For the Euclidean and LMNN baselines, which lack a\nhierarchy, we simply use k=3. Note that ToM\u2019s setting at the \ufb01nal decision nodes (just above a leaf)\nis also k = 3, comparable to the baselines.\n4.2.1 Per-node accuracy and analysis of the learned representations\n\nSince our algorithm optimizes the metrics at every node, we \ufb01rst examine the resulting per-node\ndecisions. That is, how accurately can we predict the correct subcategory at any given node? The\nbar charts in Figures 2 and 3 show the results, in terms of raw k-NN accuracy improvements over the\nEuclidean baseline. For reference, we also show the Global LMNN baseline. Multi-Metric LMNN\nis omitted here, since its metrics are only learned for the leaf node classes. We observe a good\nincrease for most classes, as well as a clear advantage relative to LMNN. Furthermore, our results\nare usually strongest when including the novel disjoint sparsity regularizer. This result supports our\nbasic claim about the potential advantage of exploiting external semantics in ToM.\nWe \ufb01nd that absolute gains are similar in either the PCA or ATTR feature spaces for AWA, though\nexact gains per class differ. While the ATTR variant exposes the semantic features directly to the\nlearner, the PCA variant encapsulates an array of low-level descriptors into its dimensions. Thus,\nwhile we can better interpret the meaning of disjoint sparsity on the attributes, our positive result on\nraw image features assures that disjoint feature selection is also amenable in the more general case.\nTo look more closely at this, Table 1 displays representative superclasses from AWA-ATTR together\nwith the attributes that ToM+Disjoint selects as discriminative for their subclasses. The attributes\nshown are those with nonzero weights in the learned metrics. Intuitively, we see that often the se-\nlected attributes are indeed useful for discriminating the child classes. For example, \u2018tusks\u2019 and\n\u2018plankton\u2019 attributes help distinguish common dolphins from killer whales, whereas \u2018stripes\u2019 and\n\n7\n\nantelopegrizzly bearkiller whalebeaverdalmatianPersian cathorseGerman shepherdblue whaleSiamese catskunkmoletigerhippopotamusleopardmoosespider monkeyhumpbackelephantgorillaoxfoxsheepsealchimpanzeehamstersquirrelrhinocerosrabbitbatgiraffewolfChihuahuaratweaselotterbuffalozebragiant pandadeerbobcatpiglionmousepolar bearcolliewalrusraccooncowcommon dolphinbeardolphinrodentdomesticequinesheperdbaleenmustelinebig catdeerg.apebovinepinnipedprocyonidbovidwhaledogcatodd\u2212toed ungulateprimateruminantaquatic mammalcaninefelineeven\u2212toed ungulatecarnivoreungulateplacental\u22126\u22124\u2212202468equinebig catdolphindeerprocyonidbovidsheperddogbearbovinepinnipedmustelineodd\u2212toedcatprimateplacentalcanineeven\u2212toedcarnivoreaquaticruminantwhalerodentdomesticfelineungulatebaleeng.apeAccuracy improvementAWA\u2212PCA  Global LMNN: 1.33TOM: 1.44TOM+Sparsity: 1.93TOM+Disjoint: 2.15\u22124\u221220246810mustelinepinnipeddomesticbovidbig catprocyonidg.apedeerdolphinsheperdprimaterodentwhaleequinedogbearruminantcaninecatungulateeven\u2212toedaquaticodd\u2212toedcarnivorebaleenfelineplacentalbovineAccuracy improvementAWA\u2212ATTR  Global LMNN: 1.01TOM: 1.53TOM+Sparsity: 1.94TOM+Disjoint: 2.45motorscooterbicyclefortwomountainbikeelectric locomo.steam locomo.containershiplinergondolacanoespeedboatwarplaneairlinerairshipballooncabconvertibleracergarbagetruckpickuptrailertruckbicyclelocomotiveshipboath. airl. aircartruckvesselaircraftmotor vehicleself\u2212propelled vehiclecraftwheeled vehiclevehicle\u221220246810lighter\u2212airvehicleshipaircrafttrucklocomotivecraftboatwheeledcarbicyclevesselself\u2212prop.motor vehicleheavier\u2212airAccuracy improvementVEHICLE\u221220  Global LMNN: 0.86TOM: 2.42TOM+Sparsity: 2.79TOM+Disjoint: 3.13\fSuperclass\ndolphin\n\nSubclasses\ncommon\ndolphin,\nkiller whale\n\nAttributes selected\ntusks, plankton, blue, gray,\nred,\npatches, slow, muscle, active, in-\nsects\n\nSuperclass\nwhale\n\nSubclass\ndolphin,\nbaleen\nwhale\n\nequine\n\nhorse,\nzebra\n\nstripes, domestic, orange, red, yel-\nlow,\ntoughskin, newworld, arctic,\nbush\n\nodd-toed\nungulate\n\nequine,\nrhinoceros\n\nAttributes selected\nblack, white, blue, gray, toughskin,\nchewteeth, strainteeth, smelly, slow,\nmuscle, active, \ufb01sh, hunter, skim-\nmer, oldworld, arctic. . .\nfast,\nlongneck, hairless, black,\nwhite, yellow, patches, spots, bul-\nbous,\nlongleg, buckteeth, horns,\ntusks, smelly. . .\n\nTable 1: Attributes selected by ToM+Disjoint for various superclass objects in AWA. See text.\n\nAWA-ATTR\n\nSemantic similarity\n\nAWA-PCA\n\nSemantic similarity\n\nVEHICLE-20\n\nSemantic similarity\n\n56.10 \u00b1 0.41\n57.57 \u00b1 0.45\n57.91 \u00b1 0.54\n60.72 \u00b1 0.54\n62.66 \u00b1 0.26\n63.01 \u00b1 0.21\n\nMethod\nEuclidean\n\nGlobal LMNN\n\nMulti-metric LMNN\n\nToM\n\nToM + Sparsity\nToM + Disjoint\n\nCorrect label\n32.36 \u00b1 0.21\n32.49 \u00b1 0.42\n32.34 \u00b1 0.35\n36.79 \u00b1 0.27\n37.58 \u00b1 0.32\n38.29 \u00b1 0.61\n\n53.60 \u00b1 0.26\n53.93 \u00b1 0.88\n53.73 \u00b1 0.71\n58.36 \u00b1 0.09\n59.29 \u00b1 0.58\n59.72 \u00b1 0.62\n\nCorrect label\n17.54 \u00b1 0.38\n19.62 \u00b1 0.51\n17.61 \u00b1 0.33\n18.70 \u00b1 0.41\n18.79 \u00b1 0.46\n19.00 \u00b1 0.30\n\n38.11 \u00b1 0.58\n40.34 \u00b1 0.32\n38.94 \u00b1 0.31\n43.44 \u00b1 0.43\n43.38 \u00b1 0.34\n43.59 \u00b1 0.19\n\nCorrect label\n28.51 \u00b1 0.56\n29.65 \u00b1 0.44\n30.00 \u00b1 0.51\n31.23 \u00b1 0.67\n32.09 \u00b1 0.18\n32.77 \u00b1 0.32\n\nTable 2: Multi-class hierarchical classi\ufb01cation accuracy and semantic similarity on all three datasets. Num-\nbers are averages over 5 splits, and standard errors for 95% con\ufb01dence interval. Our method outperforms the\nbaselines in almost all cases, and notably provides more semantically close predictions. See text.\n\n\u2018domestic\u2019 help distinguish zebras from horses. At the same time, as desired, we see that the\nattributes useful for coarser-level categories are distinct from those employed to discriminate the\n\ufb01ner-level objects. For example, \u2018fast\u2019, \u2018longneck\u2019, or \u2018hairless\u2019 are used to differentiate equine\nfrom rhino, but are excluded when differentiating horses from zebras (equine\u2019s subclasses).\n\n4.2.2 Hierarchical multi-class classi\ufb01cation accuracy\n\nNext we evaluate the complete multi-class classi\ufb01cation accuracy, where we use all the learned ToM\nmetrics together to predict the leaf-node label of the test points. This is a 50-way task for AWA, and\na 20-way task for VEHICLES. Table 2 shows the results.\nWe score accuracy in two ways: Correct label records the percentage of examples assigned the\ncorrect (leaf) label, while Semantic similarity records the semantic similarity between the predicted\nand true labels. For both, higher is better. The former is standard recognition accuracy, while the\nlatter gives a more nuanced view of the \u201csemantic magnitude\u201d of the classi\ufb01ers\u2019 errors. Speci\ufb01cally,\nwe calculate the semantic similarity between classes (nodes) i and j using the metric de\ufb01ned in [36],\nwhich counts the number of nodes shared by their two parent branches, divided by the length of the\nlongest of the two branches. In the spirit of other recent evaluations [37, 32, 36], this metric re\ufb02ects\nthat some errors are worse than others; for example, calling a Persian cat a Siamese cat is a less\nglaring error than calling a Persian cat a horse. This is especially relevant in our case, since our key\nmotivation is to instill external semantics into the feature learning process.\nIn terms of pure label correctness, ToM improves over the strong LMNN baselines for both AWA-\nATTR and VEHICLE-20. Further, in all cases, we see that disjoint sparsity is an important addition\nto ToM. However, in AWA-PCA, Global LMNN produces the best results by a statistically insignif-\nicant margin. We did not \ufb01nd a clear rationale for this one case. For AWA-ATTR, however, our\nmethod is substantially better than Global LMNN, perhaps due to our method\u2019s strength in exploiting\nsemantic features. While we initially expected Multi-Metric LMNN to outperform Global LMNN,\nwe suspect it struggles with clusters that are too close together. For all cases when ToM+Disjoint\noutperforms the LMNN or Euclidean baselines, the improvement is statistically signi\ufb01cant.\nIn terms of semantic similarity, our ToM is better than all baselines on all datasets. This is a very\nencouraging result, since it suggests our approach is in fact leveraging semantics in a useful way.\nIn practice, the ability to make such \u201creasonable\u201d errors is likely to be increasingly important as the\ncommunity tackles larger and larger multi-class recognition problems.\nConclusion We presented a new metric learning approach for visual recognition that integrates ex-\nternal semantics about object hierarchy. Experiments with challenging datasets indicate its promise,\nand support our hypothesis that outside knowledge about how objects relate is valuable for feature\nlearning. In future work, we are interested in exploring local features in this context, and considering\nways to learn both the hierarchy and the useful features simultaneously.\nAcknowledgments F. Sha is supported by NSF IIS-1065243 and bene\ufb01ted from discussions with\nD. Zhou and B. Kulis. K. Grauman is supported by NSF IIS-1065390.\n\n8\n\n\fReferences\n[1] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In ICCV,\n\n[2] P. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation. In ICCV, 2009.\n[3] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, 2009.\n\n[4] L.-J. Li, H. Su, E. Xing, and L. Fei-Fei. Object bank: A high-level image representation for scene\n\nclassi\ufb01cation and semantic feature sparsi\ufb01cation. In NIPS, 2010.\n\n[5] Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In NIPS, 2010.\n[6] A. Frome, Y. Singer, and J. Malik. Image retrieval and classi\ufb01cation using local distance functions. In\n\n[7] P. Kumar, P. Torr, and A. Zisserman. An invariant large margin nearest neighbour classi\ufb01er. In ICCV,\n\n[8] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In CVPR, 2008.\n[9] D. Ramanan and S. Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. In\n\n2009.\n\nNIPS, 2006.\n\n2007.\n\nPAMI, 2011.\n\nECCV, 2010.\n\n2008.\n\n2004.\n\n[10] Z. Wang, Y. Hu, and L.-T. Chia.\n\nImage-to-class distance metric learning for image classi\ufb01cation.\n\nIn\n\n[11] K. Q. Weinberger and K. L. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJMLR, 10:207\u2013244, June 2009.\n\n[12] M. Marszalek and C. Schmid. Constructing category hierarchies for visual recognition. In ECCV, 2008.\n[13] G. Grif\ufb01n and P. Perona. Learning and using taxonomies for fast visual category recognition. In CVPR,\n\n[14] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML, 1997.\n[15] A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng.\n\nImproving text classi\ufb01cation by shrinkage in a\n\nhierarchy of classes. In ICML, 1998.\n\n[16] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In CIKM,\n\n[17] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\n[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-F ei. ImageNet: A large-scale hierarchical image\n\n[19] A. Torralba and K. Murphy. Sharing visual features for multiclass and multiview object detection. PAMI,\n\n[20] G. Shakhnarovich. Learning Task-Speci\ufb01c Similarity. PhD thesis, MIT, 2006.\n[21] B. Babenko, S. Branson, and S. Belongie. Similarity functions for categorization: from monolithic to\n\nattribute transfer. In CVPR, 2009.\n\ndatabase. In CVPR, 2009.\n\n29(5), 2007.\n\ncategory speci\ufb01c. In ICCV, 2009.\n\n[22] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS, pages 451\u2013458. 2006.\n[23] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In ICML, 2007.\n[24] K. Weinberger and L. Saul. Fast solvers and ef\ufb01cient implementations for distance metric learning. In\n\nICML, 2008.\n\n[25] Q. Chen and S. Sun. Hierarchical large margin nearest neighbor classi\ufb01cation. In ICPR, 2010.\n[26] S. Parameswaran and K. Weinberger. Large margin multi-task metric learning. In NIPS, 2010.\n[27] Y. Ying, K. Huang, and C. Campbell. Sparse metric learning via smooth optimization. In NIPS. 2009.\n[28] D. Zhou, L. Xiao, and M. Wu. Hierarchical classi\ufb01cation via orthogonal transfer. In ICML, 2011.\n[29] J. Sivic, B. Russell, A. Zisserman, W. Freeman, and A. Efros. Unsupervised discovery of visual object\n\nclass hierarchies. In CVPR, 2008.\n\n[30] E. Bart, I. Porteous, P. Perona, and M. Welling. Unsupervised learning of visual taxonomies. In CVPR,\n\n[31] A. Zweig and D. Weinshall. Exploiting object hierarchy: Combining models from different category\n\nlevels. In ICCV, 2007.\n\nIn ECCV, 2010.\n\n2008.\n\n1994.\n\nInformation Retrieval, 2000.\n\nIn ECCV, 2010.\n\n[32] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us?\n\n[33] Y. Wang and G. Mori. A discriminative latent model of object classes and attributes. In ECCV, 2010.\n[34] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statistical Society, 58:267\u2013288,\n\n[35] S. Dumais and H. Chen. Hierarchical classi\ufb01cation of web content. In Research and Development in\n\n[36] R. Fergus, H. Bernal, Y. Weiss, and A. Torralba. Semantic label sharing for learning with many categories.\n\n[37] K. Barnard, Q. Fan, R. Swaminathan, A. Hoogs, R. Collins, P. Rondot, and J. Kaufhold. Evaluation of\nlocalized semantics: data, methodology, and experiments. Technical report, University of Arizona, 2005.\n\n9\n\n\f", "award": [], "sourceid": 439, "authors": [{"given_name": "Kristen", "family_name": "Grauman", "institution": null}, {"given_name": "Fei", "family_name": "Sha", "institution": null}, {"given_name": "Sung", "family_name": "Hwang", "institution": null}]}