{"title": "Predicting Useful Neighborhoods for Lazy Local Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1916, "page_last": 1924, "abstract": "Lazy local learning methods train a classifier on the fly\" at test time, using only a subset of the training instances that are most relevant to the novel test example. The goal is to tailor the classifier to the properties of the data surrounding the test example. Existing methods assume that the instances most useful for building the local model are strictly those closest to the test example. However, this fails to account for the fact that the success of the resulting classifier depends on the full distribution of selected training instances. Rather than simply gather the test example's nearest neighbors, we propose to predict the subset of training data that is jointly relevant to training its local model. We develop an approach to discover patterns between queries and their \"good\" neighborhoods using large-scale multi-label classification with compressed sensing. Given a novel test point, we estimate both the composition and size of the training subset likely to yield an accurate local model. We demonstrate the approach on image classification tasks on SUN and aPascal and show it outperforms traditional global and local approaches.\"", "full_text": "Predicting Useful Neighborhoods\n\nfor Lazy Local Learning\n\nAron Yu\n\nUniversity of Texas at Austin\naron.yu@utexas.edu\n\nKristen Grauman\n\nUniversity of Texas at Austin\ngrauman@cs.utexas.edu\n\nAbstract\n\nLazy local learning methods train a classi\ufb01er \u201con the \ufb02y\u201d at test time, using only\na subset of the training instances that are most relevant to the novel test example.\nThe goal is to tailor the classi\ufb01er to the properties of the data surrounding the test\nexample. Existing methods assume that the instances most useful for building the\nlocal model are strictly those closest to the test example. However, this fails to\naccount for the fact that the success of the resulting classi\ufb01er depends on the full\ndistribution of selected training instances. Rather than simply gathering the test\nexample\u2019s nearest neighbors, we propose to predict the subset of training data that\nis jointly relevant to training its local model. We develop an approach to discover\npatterns between queries and their \u201cgood\u201d neighborhoods using large-scale multi-\nlabel classi\ufb01cation with compressed sensing. Given a novel test point, we estimate\nboth the composition and size of the training subset likely to yield an accurate local\nmodel. We demonstrate the approach on image classi\ufb01cation tasks on SUN and\naPascal and show its advantages over traditional global and local approaches.\n\nIntroduction\n\n1\nMany domains today\u2014vision, speech, biology, and others\u2014are \ufb02ush with data. Data availabil-\nity, combined with recent large-scale annotation efforts and crowdsourcing developments, have\nyielded labeled datasets of unprecedented size. Though a boon for learning approaches, large la-\nbeled datasets also present new challenges. Beyond the obvious scalability concerns, the diversity\nof the data can make it dif\ufb01cult to learn a single global model that will generalize well. For example,\na standard binary dog classi\ufb01er forced to simultaneously account for the visual variations among\nhundreds of dog breeds may be \u201cdiluted\u201d to the point it falls short in detecting new dog instances.\nFurthermore, with training points distributed unevenly across the feature space, the model capacity\nrequired in any given region of the space will vary. As a result, if we train a single high capacity\nlearning algorithm, it may succeed near parts of the decision boundary that are densely populated\nwith training examples, yet fail in poorly sampled areas of the feature space.\nLocal learning methods offer a promising direction to address these challenges. Local learning is an\ninstance of \u201clazy learning\u201d, where one defers processing of the training data until test time. Rather\nthan estimate a single global model from all training data, local learning methods instead focus on a\nsubset of the data most relevant to the particular test instance. This helps learn \ufb01ne-grained models\ntailored to the new input, and makes it possible to adjust the capacity of the learning algorithm to\nthe local properties of the data [5]. Local methods include classic nearest neighbor classi\ufb01cation as\nwell as various novel formulations that use only nearby points to either train a model [2, 3, 5, 13, 29]\nor learn a feature transformation [8, 9, 15, 25] that caters to the novel input.\nA key technical question in local learning is how to determine which training instances are relevant\nto a test instance. All existing methods rely on an important core assumption: that the instances most\nuseful for building a local model are those that are nearest to the test example. This assumption is\nwell-motivated by the factors discussed above, in terms of data density and intra-class variation.\n\n1\n\n\fFurthermore, identifying training examples solely based on proximity has the appeal of permitting\nspecialized similarity functions (whether learned or engineered for the problem domain), which can\nbe valuable for good results, especially in structured input spaces.\nOn the other hand, there is a problem with this core assumption. By treating the individual nearness\nof training points as a metric of their utility for local training, existing methods fail to model how\nthose training points will actually be employed. Namely, the relative success of a locally trained\nmodel is a function of the entire set or distribution of the selected data points\u2014not simply the\nindividual pointwise nearness of each one against the query. In other words, the ideal target subset\nconsists of a set of instances that together yield a good predictive model for the test instance.\nBased on this observation, we propose to learn the properties of a \u201cgood neighborhood\u201d for local\ntraining. Given a test instance, the goal is to predict which subset of the training data should be\nenlisted to train a local model on the \ufb02y. The desired prediction task is non-trivial: with a large\nlabeled dataset, the power set of candidates is enormous, and we can observe relatively few training\ninstances for which the most effective neighborhood is known. We show that the problem can be\ncast in terms of large-scale multi-label classi\ufb01cation, where we learn a mapping from an individual\ninstance to an indicator vector over the entire training set that speci\ufb01es which instances are jointly\nuseful to the query. Our approach maintains an inherent bias towards neighborhoods that are local,\nyet makes it possible to discover subsets that (i) deviate from a strict nearest-neighbor ranking and\n(ii) vary in size.\nThe proposed technique is a general framework to enhance local learning. We demonstrate its impact\non image classi\ufb01cation tasks for computer vision, and show its substantial advantages over existing\nlocal learning strategies. Our results illustrate the value in estimating the size and composition of\ndiscriminative neighborhoods, rather than relying on proximity alone.\n\n2 Related Work\nLocal learning algorithms Lazy local learning methods are most relevant to our work. Exist-\ning methods primarily vary in how they exploit the labeled instances nearest to a test point. One\nstrategy is to identify a \ufb01xed number of neighbors most similar to the test point, then train a model\nwith only those examples (e.g., a neural network [5], SVM [29], ranking function [3, 13], or linear\nregression [2]). Alternatively, the nearest training points can be used to learn a transformation of\nthe feature space (e.g., Linear Discriminant Analysis); after projecting the data into the new space,\nthe model is better tailored to the query\u2019s neighborhood properties [8, 9, 15, 25]. In local selection\nmethods, strictly the subset of nearby data is used, whereas in locally weighted methods, all training\npoints are used but weighted according to their distance [2]. All prior methods select the local neigh-\nborhood based on proximity, and they typically \ufb01x its size. In contrast, our idea is to predict the set\nof training instances that will produce an effective discriminative model for a given test instance.\nMetric learning The question \u201cwhat is relevant to a test point?\u201d also brings to mind the metric\nlearning problem. Metric learning methods optimize the parameters of a distance function so as to\nbest satisfy known (dis)similarity constraints between training data [4]. Most relevant to our work\nare those that learn local metrics; rather than learn a single global parameterization, the metric varies\nin different regions of the feature space. For example, to improve nearest neighbor classi\ufb01cation,\nin [11] a set of feature weights is learned for each individual training example, while in [26, 28]\nseparate metrics are trained for clusters discovered in the training data.\nSuch methods are valuable when the data is multi-modal and thus ill-suited by a single global metric.\nFurthermore, one could plug a learned metric into the basic local learning framework. However,\nwe stress that learning what a good neighbor looks like (metric learning\u2019s goal) is distinct from\nlearning what a good neighborhood looks like (our goal). Whereas a metric can be trained with\npairwise constraints indicating what should be near or far, jointly predicting the instances that ought\nto compose a neighborhood requires a distinct form of learning, which we tackle in this work.\nHierarchical classi\ufb01cation For large multi-class problems, hierarchical classi\ufb01cation approaches\noffer a different way to exploit \u201clocality\u201d among the training data. The idea is to assemble a tree of\ndecision points, where at each node only a subset of labels are considered (e.g., [6, 12, 21]). Such\nmethods are valuable for reducing computational complexity at test time, and broadly speaking they\nshare the motivation of focusing on \ufb01ner-grained learning tasks to improve accuracy. However,\n\n2\n\n\fotherwise the work is quite distant from our problem. Hierarchical methods precompute groups of\nlabels to isolate in classi\ufb01cation tasks, and apply the same classi\ufb01ers to all test instances; lazy local\nlearning predicts at test time what set of training instances are relevant for each novel test instance.\nWeighting training instances Our problem can be seen as deciding which training instances to\n\u201ctrust\u201d most. Various scenarios call for associating weights with training instances such that some\nin\ufb02uence the learned parameters more than others. For example, weighted instances can re\ufb02ect label\ncon\ufb01dences [27], help cope with imbalanced training sets [24], or resist the in\ufb02uence of outliers [20].\nHowever, unlike our setting, the weights are given at training time and they are used to create a\nsingle global model. Methods to estimate the weights per example arise in domain adaptation,\nwhere one aims to give more weight to source domain samples distributed most like those in the\ntarget domain [14, 17, 18]. These are non-local, of\ufb02ine approaches, whereas we predict useful\nneighborhoods in an online, query-dependent manner. Rather than close the mismatch between a\nsource and target domain, we aim to \ufb01nd a subset of training data amenable to a local model.\nActive learning Active learning [23] aims to identify informative unlabeled training instances,\nwith the goal of minimizing labeling effort when training a single (global) classi\ufb01er. In contrast, our\ngoal is to ignore those labeled training points that are irrelevant to a particular novel instance.\n\n3 Approach\n\nWe propose to predict the set of training instances which, for a given test example, are likely to\ncompose an effective neighborhood for local classi\ufb01er learning. We use the word \u201cneighborhood\u201d\nto refer to such a subset of training data\u2014though we stress that the optimal subset need not consist\nof strictly rank-ordered nearest neighbor points.\nOur approach has three main phases: (i) an of\ufb02ine stage where we generate positive training neigh-\nborhoods (Sec. 3.1), (ii) an of\ufb02ine stage where we learn a mapping from individual examples to their\nuseful neighborhoods (Sec. 3.2), and (iii) an online phase where we apply the learned model to infer\na novel example\u2019s neighborhood, train a local classi\ufb01er, and predict the test label (Sec. 3.3).\n\n3.1 Generating training neighborhoods\nLet T = {(x1, c1), . . . , (xM , cM )} denote the set of M category-labeled training examples. Each\nxi \u2208 (cid:60)d is a vector in some d-dimensional feature space, and each ci \u2208 {1, . . . , C} is its target\ncategory label. Given these examples, we \ufb01rst aim to generate a set of training neighborhoods, N =\n{(xn1 , yn1), . . . , (xnN , ynN )}. Each training neighborhood (xni, yni) consists of an individual\ninstance xni paired with a set of training instance indices capturing its target \u201cneighbors\u201d, the latter\nbeing represented as a M-dimensional indicator vector yni. If yni(j) = 1, this means xj appears\nin the target neighborhood for xni. Otherwise, yni(j) = 0. Note that the dimensionality of this\ntarget indicator vector is M, the number of total available training examples. We will generate N\nsuch pairs, where typically N (cid:28) M.\nAs discussed above, there are very good motivations for incorporating nearby points for local learn-\ning. Indeed, we do not intend to eschew the \u201clocality\u201d aspect of local learning. Rather, we start from\nthe premise that points near to a query are likely relevant\u2014but relevance is not necessarily preserved\npurely by their rank order, nor must the best local set be within a \ufb01xed radius of the query (or have a\n\ufb01xed set size). Instead, we aim to generalize the locality concept to jointly estimate the members of\na neighborhood such that taken together they are equipped to train an accurate query-speci\ufb01c model.\nWith these goals in mind, we devise an empirical approach to generate the pairs (xni, yni) \u2208 N .\nThe main idea is to sample a series of candidate neighborhoods for each instance xni, evaluate their\nrelative success at predicting the training instance\u2019s label, and record the best candidate.\nSpeci\ufb01cally, for instance xni, we \ufb01rst compute its proximity to the M \u2212 1 other training images in\nthe feature space. (We simply apply Euclidean distance, but a task-speci\ufb01c kernel or learned metric\ncould also be used here.) Then, for each of a series of possible neighborhood sizes {k1, . . . , kK},\nwe sample a neighborhood of size k from among all training images, subject to two requirements:\n(i) points nearer to xni are more likely to be chosen, and (ii) the category label composition within\nthe neighborhood set is balanced. In particular, for each possible category label 1, . . . , C we sample\nC training instances without replacement, where the weight associated with an instance is inversely\nk\n\n3\n\n\frelated to its (normalized) distance to xni. We repeat the sampling S times for each value of k,\nyielding K \u00d7 S candidates per instance xni.\nNext, for each of these candidates, we learn a local model. Throughout we employ linear support\nvector machine (SVM) classi\ufb01ers, both due to their training ef\ufb01ciency and because lower capacity\nmodels are suited to the sparse, local datasets under consideration; however, kernelized/non-linear\nmodels are also possible.1 Note that any number of the K \u00d7 S sampled neighborhoods may yield a\nclassi\ufb01er that correctly predicts xni\u2019s category label cni. Thus, to determine which among the suc-\ns (xni ) = P (cni|xni )\ncessful classi\ufb01ers is best, we rank them by their prediction con\ufb01dences. Let pk\nbe the posterior estimated by the s-th candidate classi\ufb01er for neighborhood size k, as computed via\nPlatt scaling using the neighborhood points. To automatically select the best k for instance xni, we\naverage these posteriors across all samples per k value, then take the one with the highest probabil-\nity: k\u2217 = arg max\ns (xni). The averaging step aims to smooth the estimated probability\nusing the samples for that value of k, each of which favors near points but varies in its composition.\nFinally, we obtain a single neighborhood pair (xni, yni), where yni is the indicator vector for the\nneighborhood sampled with size k\u2217 having the highest posterior pk\u2217\ns .\nIn general we can expect higher values of S and denser samplings of k to provide best results, though\nat a correspondingly higher computational cost during this of\ufb02ine training procedure.\n\n(cid:80)S\n\nk\n\n1\nS\n\ns=1 pk\n\n3.2 Learning to predict neighborhoods with compressed sensing\n\nWith the training instance-neighborhood pairs in hand, we next aim to learn a function capturing\ntheir relationships. This function must estimate the proper neighborhood for novel test instances.\nWe are faced with a non-trivial learning task. The most straightforward approach might be to learn\na binary decision function for each xi \u2208 T , trained with all xnj for which ynj (i) = 1 as positives.\nHowever, this approach has several problems. First, it would require training M binary classi\ufb01ers,\nand in the applications of interest M\u2014the number of all available category-labeled examples\u2014may\nbe very large, easily reaching the millions. Second, it would fail to represent the dependencies\nbetween the instances appearing in a single training neighborhood, which ought to be informative\nfor our task. Finally, it is unclear how to properly gather negative instances for such a naive solution.\nInstead, we pose the learning task as a large-scale multi-label classi\ufb01cation problem. In multi-label\nclassi\ufb01cation, a single data point may have multiple labels. Typical examples include image and web\npage tagging [16, 19] or recommending advertiser bid phrases [1]. In our case, rather than predict\nwhich labels to associate with a novel example, we want to predict which training instances belong\nin its neighborhood. This is exactly what is encoded by the target indicator vectors de\ufb01ned above,\nyni. Furthermore, we want to exploit the fact that, compared to the number of all labeled training\nimages, the most useful local neighborhoods will contain relatively few examples.\nTherefore, we adopt a large-scale multi-label classi\ufb01cation approach based on compressed sens-\ning [19] into our framework. With it, we can leverage sparsity in the high-dimensional target neigh-\nborhood space to ef\ufb01ciently learn a prediction function that jointly estimates all useful neighbors.\nFirst, for each of the N training neighborhoods, we project its M-dimensional neighborhood vec-\ntor yni to a lower-dimensional space using a random transformation: zni = \u03c6 yni, where \u03c6 is a\nD \u00d7 M random matrix, and D denotes the compressed indicators\u2019 dimensionality. Then, we learn\nregression functions to map the original features to these projected values zn1, . . . , znN as targets.\nThat is, we obtain a series of D (cid:28) M regression functions f1, . . . , fD minimizing the loss in the\ncompressed indicator vector space. Given a novel instance xq, those same regression functions are\napplied to map to the reduced space, [f1(xq), . . . , fD(xq)]. Finally, we predict the complete indica-\ntor vector by recovering the M-dimensional vector using a standard reconstruction algorithm from\nthe compressed sensing literature.\nWe employ the Bayesian multi-label compressed sensing framework of [19], since it uni\ufb01es the\nregression and sparse recovery stages, yielding accurate results for a compact set of latent variables.\nDue to compressed sensing guarantees, an M-dimensional indicator vector with l nonzero entries\ncan be recovered ef\ufb01ciently using D = O(l log M\n\nl ) [16].\n\n1In our experiments the datasets have binary labels (C = 2); in the case of C > 2 the local model must be\n\nmulti-class, e.g., a one-versus-rest SVM.\n\n4\n\n\f3.3\n\nInferring the neighborhood for a novel example\n\nAll processing so far is performed of\ufb02ine. At test time, we are given a novel example xq, and must\npredict its category label. We \ufb01rst predict its neighborhood using the compressed sensing approach\noverviewed in the previous section, obtaining the M-dimensional vector \u02c6yq. The entries of this\nvector are real-valued, and correspond to our relative con\ufb01dence that each category-labeled instance\nxi \u2208 T belongs in xq\u2019s neighborhood.\nPast multi-label classi\ufb01cation work focuses its evaluation on the precision of (a \ufb01xed number of) the\ntop few most con\ufb01dent predictions and the raw reconstruction error [16, 19], and does not handle\nthe important issue of how to truncate the values to produce hard binary decisions. In contrast, our\nsetting demands that we extract both the neighborhood size estimate as well as the neighborhood\ncomposition from the estimated real-valued indicator vector.\nTo this end, we perform steps paralleling the training procedure de\ufb01ned in Sec. 3.1, as follows. First,\nwe use the sorted con\ufb01dence values in \u02c6yq to generate a series of candidate neighborhoods of sizes\nvarying from k1 to kK, each time ensuring balance among the category labels. That is, for each\nk, we take the k\nC most con\ufb01dent training instances per label. Recall that all M training instances\nreferenced by \u02c6yq have a known category label among 1, . . . , C. Analogous to before, we then apply\neach of the K candidate predicted neighborhoods in turn to train a local classi\ufb01er. Of those, we\nreturn the category label prediction from the classi\ufb01er with the most con\ufb01dent decision value.\nNote that this process automatically selects the neighborhood size k to apply for the novel input. In\ncontrast, existing local learning approaches typically manually de\ufb01ne this parameter and \ufb01x it for\nall test examples [5, 8, 13, 15, 29]. Our results show that approach is sub-optimal; not only does the\nmost useful neighborhood deviate from the strict ranked list of neighbors, it also varies in size.\nWe previously explored an alternative approach for inference, where we directly used the con\ufb01-\ndences in \u02c6yq as weights in an importance-weighted SVM. That is, for each query, we trained a model\nwith all M data points, but modulated their in\ufb02uence according to the soft indicator vector \u02c6yq, such\nthat less con\ufb01dent points incurred lower slack penalties. However, we found that approach inferior,\nlikely due to the dif\ufb01culty in validating the slack scale factor for all training instances (problematic\nin the local learning setting) as well as the highly imbalanced datasets we tackle in the experiments.\n\n3.4 Discussion\n\nWhile local learning methods strive to improve accuracy over standard global models, their lazy use\nof training data makes them more expensive to apply. This is true of any local approach that needs to\ncompute distances to neighbors and train a fresh classi\ufb01er online for each new test example. In our\ncase, using Matlab, the run-time for processing a single novel test point can vary from 30 seconds to\n30 minutes. It is dominated by the compressed sensing reconstruction step, which takes about 80%\nof the computation time and is highly dependent on the complexity of the trained model. One could\nimprove performance by using approximate nearest neighbor methods to sort T , or pre-computing\na set of representative local models. We leave these implementation improvements as future work.\nThe of\ufb02ine stages of our algorithm (Secs. 3.1 and 3.2) require about 5 hours for datasets with M =\n14, 000, N = 2, 000, d = 6, 300, and D = 2, 000. The run-time is dominated by the SVM\nevaluation of K \u00d7 S candidate training neighborhoods on the N images, which could be performed\nin parallel. The compressed sensing formulation is quite valuable for ef\ufb01ciency here; if we were to\ninstead naively train M independent classi\ufb01ers, the of\ufb02ine run-time would be on the order of days.\nWe found that building category-label balance into the training and inference algorithms was crucial\nfor good results when dealing with highly imbalanced datasets. Earlier versions of our method\nthat ignored label balance would often predict neighborhoods with only the same label as the query.\nLocal methods typically handle this by reverting to a nearest neighbor decision. However, as we will\nsee below, this can be inferior to explicitly learning to identify a local and balanced neighborhood,\nwhich can be used to build a more sophisticated classi\ufb01er (like an SVM).\nFinally, while our training procedure designates a single neighborhood as the prediction target, it is\ndetermined by a necessarily limited sample of candidates (Sec. 3.1). Our con\ufb01dence ranking step\naccounts for the differences between those candidates that ultimately make the same label prediction.\nNonetheless, the non-exhaustive training samples mean that slight variations on the target vectors\n\n5\n\n\fmay be equally good in practice. This suggests future extensions to explicitly represent \u201cmissing\u201d\nentries in the indicator vector during training or employ some form of active learning.\n\n4 Experiments\n\nWe validate our approach on an array of binary image classi\ufb01cation tasks on public datasets.\nDatasets We consider two challenging datasets with visual attribute classi\ufb01cation tasks. The SUN\nAttributes dataset [22] (SUN) contains 14,340 scene images labeled with binary attributes of vari-\nous types (e.g., materials, functions, lighting). We use all images and randomly select 8 attribute\ncategories. We use the 6,300-dimensional HOG 2 \u00d7 2 features provided by the authors, since they\nperform best for this dataset [22]. The aPascal training dataset [10] contains 6,440 object images\nlabeled with attributes describing the objects\u2019 shapes, materials, and parts. We use all images and\nrandomly select 6 attribute categories. We use the base features from [10], which include color,\ntexture, edges, and HOG. We reduce their dimensionality to 200 using PCA. For both datasets, we\ntreat each attribute as a separate binary classi\ufb01cation task (C = 2).\nImplementation Details For each attribute, we compose a test set of 100 randomly chosen images\n(balanced between positives and negatives), and use all other images for T . This makes M =\n14, 240 for SUN and M = 6, 340 for aPascal. We use N = 2, 000 training neighborhoods for\nboth, and set D = {2000, 1000} for SUN and aPascal, roughly 15% of their original label indicator\nlengths. Generally higher values of D yield better accuracy (less compression), but for a greater\nexpense. We \ufb01x the number of samples S = 100, and consider neighborhood sizes from k1 = 50\nand kK = 500, in increments of 10 to 50.\nBaselines and Setup We compare to the following methods: (1) Global: for each test image, we\napply the same global classi\ufb01er trained with all M training images; (2) Local: for each test image,\nwe apply a classi\ufb01er trained with only its nearest neighbors, as measured with Euclidean distance on\nthe image features. This baseline considers a series of k values, like our method, and independently\nselects the best k per test point according to the con\ufb01dence of the resulting local classi\ufb01ers (see\nSec. 3.3). (3) Local+ML: same as Local, except the Euclidean distance is replaced with a learned\nmetric. We apply the ITML metric learning algorithm [7] using the authors\u2019 public code.\nGlobal represents the default classi\ufb01cation approach, and lets us gauge to what extent the classi\ufb01ca-\ntion task requires local models at all (e.g., how multi-modal the dataset is). The two Local baselines\nrepresent the standard local learning approach [3, 5, 13, 15, 25, 29], in which proximal data points\nare used to train a model per test case, as discussed in Sec. 2. By using proximity instead of \u02c6yq to\nde\ufb01ne neighborhoods, they isolate the impact of our compressed sensing approach.\nAll results reported for our method and the Local baselines use the automatically selected k value\nper test image (cf. Sec. 3.3), unless otherwise noted. Each local method independently selects its\nbest k value. All methods use the exact same image features and train linear SVMs, with the cost\nparameter cross-validated based on the Global baseline. To ensure the baselines do not suffer from\nthe imbalanced data, we show results for the baselines using both balanced (B) and unbalanced (U)\ntraining sets. For the balanced case, for Global we randomly downsample the negatives and average\nresults over 10 such runs, and for Local we gather the nearest k\nSUN Results The SUN attributes are quite challenging classi\ufb01cation tasks.\nImages within the\nsame attribute exhibit wide visual variety. For example, the attribute \u201ceating\u201d (see Fig. 1, top right) is\npositive for any images where annotators could envision eating occurring, spanning from an restau-\nrant scene, to home a kitchen, to a person eating, to a banquet table close-up. Furthermore, the\nattribute may occupy only a portion of the image (e.g., \u201cmetal\u201d might occupy any subset of the\npixels). It is exactly this variety that we expect local learning may handle well.\nTable 1 shows the results on SUN. Our method outperforms all baselines for all attributes. Global\nbene\ufb01ts from a balanced training set (B), but still underperforms our method (by 6 points on aver-\nage). We attribute this to the high intra-class variability of the dataset. Most notably, conventional\nLocal learning performs very poorly\u2014whether or not we enforce balance. (Recall that the test sets\nare always balanced, so chance is 0.50.) Adding metric learning to local (Local+ML) improves\nthings only marginally, likely because the attributes are not consistently localized in the image. We\nalso implemented a local metric learning baseline that clusters the training points then learns a met-\n\n2 neighbors from each class.\n\n6\n\n\fAttribute\n\nhiking\neating\nexercise\nfarming\nmetal\n\nstill water\n\nclouds\nsunny\n\nGlobal\nU\nB\n0.60\n0.80\n0.55\n0.73\n0.69\n0.59\n0.56\n0.77\n0.57\n0.64\n0.54\n0.70\n0.77\n0.78\n0.60\n0.67\n\nLocal\nB\n0.51\n0.50\n0.50\n0.51\n0.50\n0.51\n0.70\n0.65\n\nU\n0.56\n0.50\n0.53\n0.54\n0.50\n0.53\n0.74\n0.67\n\nLocal+ML Ours\nB\n0.85\n0.55\n0.78\n0.50\n0.74\n0.50\n0.83\n0.52\n0.67\n0.50\n0.76\n0.51\n0.80\n0.74\n0.73\n0.62\n\nU\n0.65\n0.51\n0.53\n0.57\n0.51\n0.52\n0.75\n0.60\n\nLocal Local+ML Ours Ours\nFix-k*\n0.89\n0.82\n0.77\n0.88\n0.70\n0.81\n0.84\n0.78\n\n0.89\n0.79\n0.75\n0.81\n0.67\n0.71\n0.79\n0.72\n\n0.53\n0.50\n0.50\n0.51\n0.50\n0.50\n0.65\n0.59\n\nk = 400\n\n0.53\n0.50\n0.50\n0.51\n0.50\n0.50\n0.74\n0.57\n\nTable 1: Accuracy (% of correctly labeled images) for the SUN dataset. B and U refers to balanced and\nunbalanced training data, respectively. All local results to left of double line use k values automatically selected\nper method and per test instance; all those to the right use a \ufb01xed k for all queries. See text for details.\n\nFigure 1: Example neighborhoods using visual similarity alone (Local) and compressed sensing inference\n(Ours) on SUN. For each attribute, we show a positive test image and its top 5 neighbors. Best viewed on pdf.\n\nric per cluster, similar to [26, 28], then proceeds as Local+ML. Its results are similar to those of\nLocal+ML (see Supp. \ufb01le).\nThe results left of the double bar correspond to auto-selected k values per query, which averaged\nk = 106 with a standard deviation of 24 for our method; see Supp. \ufb01le for per attribute statistics. The\nrightmost columns of Table 1 show results when we \ufb01x k for all the local methods for all queries,\nas is standard practice.2 Here too, our gain over Local is sizeable, assuring that Local is not at any\ndisadvantage due to our k auto-selection procedure.\nThe rightmost column, Fix-k*, shows our results had we been able to choose the optimal \ufb01xed k\n(applied uniformly to all queries). Note this requires peeking at the test labels, and is something\nof an upper bound. It is useful, however, to isolate the quality of our neighborhood membership\ncon\ufb01dence estimates from the issue of automatically selecting the neighborhood size. We see there\nis room for improvement on the latter.\nOur method is more expensive at test time than the Local baseline due to the compressed sensing\nreconstruction step (see Sec. 3.4). In an attempt to equalize that factor, we also ran an experiment\nwhere the Local method was allowed to check more candidate k values than our method. Speci\ufb01-\ncally, it could generate as many (proximity-based) candidate neighborhoods at test time as would \ufb01t\nin the run-time required by our approach, where k ranges from 20 up to 6,000 in increments of 10.\nPreliminary tests, however, showed that this gave no accuracy improvement to the baseline. This\nindicates our method\u2019s higher computational overhead is warranted.\nDespite its potential to handle intra-class variations, the Local baseline fails on SUN because the\nneighbors that look most similar are often negative, leading to near-chance accuracy. Even when\nwe balance its local neighborhood by label, the positives it retrieves can be quite distant (e.g., see\n\u201cexercise\u201d in Fig. 1). Our approach, on the other hand, combines locality with what it learned about\n\n2We chose k = 400 based on the range where the Local baseline had best results.\n\n7\n\n\u201chiking\u201dLocalOursLocalOurs\u201ceating\u201dLocalOurs\u201cexercise\u201dLocal\u201cfarming\u201dOursLocalOurs\u201cclouds\u201dLocalOurs\u201csunny\u201d\fAttribute\n\nwing\nwheel\nplastic\ncloth\nfurry\nshiny\n\nGlobal\nU\nB\n0.76\n0.69\n0.86\n0.84\n0.71\n0.67\n0.74\n0.72\n0.80\n0.80\n0.77\n0.72\n\nLocal\nB\n0.58\n0.61\n0.50\n0.70\n0.58\n0.56\n\nU\n0.67\n0.71\n0.60\n0.67\n0.75\n0.67\n\nLocal+ML Ours\nB\n0.71\n0.59\n0.78\n0.62\n0.50\n0.64\n0.72\n0.72\n0.81\n0.60\n0.57\n0.72\n\nU\n0.67\n0.69\n0.54\n0.68\n0.71\n0.64\n\nLocal Local+ML Ours Ours\nFix-k*\n0.78\n0.81\n0.67\n0.77\n0.82\n0.73\n\n0.66\n0.74\n0.54\n0.64\n0.72\n0.62\n\n0.50\n0.54\n0.50\n0.69\n0.54\n0.52\n\nk = 400\n\n0.53\n0.63\n0.50\n0.65\n0.63\n0.55\n\nTable 2: Accuracy (% of correctly labeled images) for the aPascal dataset, formatted as in Table 1\n\nuseful neighbor combinations, attaining much better results. Altogether, our gains over both Local\nand Local+ML\u201420 points on average\u2014support our central claim that learning what makes a good\nneighbor is not equivalent to learning what makes a good neighborhood.\nFigure 1 shows example test images and the top 5 images in the neighborhoods produced by both\nLocal and our approach. We stress that while Local\u2019s neighbors are ranked based on visual similarity,\nour method\u2019s \u201cneighborhood\u201d uses visual similarity only to guide its sampling during training, then\ndirectly predicts which instances are useful. Thus, purer visual similarity in the retrieved examples\nis not necessarily optimal. We see that the most con\ufb01dent neighborhood members predicted by\nour method are more often positives. Relying solely on visual similarity, Local can retrieve less\ninformative instances (e.g., see \u201cfarming\u201d) that share global appearance but do not assist in capturing\nthe class distribution. The attributes where the Local baseline is most successful, \u201csunny\u201d and\n\u201ccloudy\u201d, seem to differ from the rest in that (i) they exhibit more consistent global image properties,\nand (ii) they have many more positives in the dataset (e.g., 2,416 positives for \u201csunny\u201d vs. only 281\nfor \u201cfarming\u201d). In fact, this scenario is exactly where one would expect traditional visual ranking for\nlocal learning to be adequate. Our method does well not only in such cases, but also where image\nnearness is not a good proxy for relevance to classi\ufb01er construction.\naPascal Results Table 2 shows the results on the aPascal dataset. Again we see a clear and con-\nsistent advantage of our approach compared to the conventional Local baselines, with an average\naccuracy gain of 10 points across all the Local variants. The addition of metric learning again\nprovides a slight boost over local, but is inferior to our method, again showing the importance of\nlearning good neighborhoods. On average, the auto-selected k values for this dataset were 144 with\na standard deviation of 20 for our method; see Supp. \ufb01le for per attribute statistics.\nThat said, on this dataset Global has a slight advantage over our method, by 2.7 points on average.\nWe attribute Global\u2019s success on this dataset to two factors: the images have better spatial alignment\n(they are cropped to the boundaries of the object, as opposed to displaying a whole scene as in\nSUN), and each attribute exhibits lower visual diversity (they stem from just 20 object classes, as\nopposed to 707 scene classes in SUN). See Supp. \ufb01le. For this data, training with all examples\nis most effective. While this dataset yields a negative result for local learning on the whole, it is\nnonetheless a positive result for the proposed form of local learning, since we steadily outperform\nthe standard Local baseline. Furthermore, in principle, our approach could match the accuracy of\nthe Global method if we let kK = M during training; in that case our method could learn that for\ncertain queries, it is best to use all examples. This is a \ufb02exibility not offered by traditional local\nmethods. However, due to run-time considerations, at the time of writing we have not yet veri\ufb01ed\nthis in practice.\n\n5 Conclusions\nWe proposed a new form of lazy local learning that predicts at test time what training data is relevant\nfor the classi\ufb01cation task. Rather than rely solely on feature space proximity, our key insight is to\nlearn to predict a useful neighborhood. Our results on two challenging image datasets show our\nmethod\u2019s advantages, particularly when categories are multi-modal and/or its similar instances are\ndif\ufb01cult to match based on global feature distances alone. In future work, we plan to explore ways\nto exploit active learning during training neighborhood generation to reduce its costs. We will also\npursue extensions to allow incremental additions to the labeled data without complete retraining.\n\nAcknowledgements We thank Ashish Kapoor for helpful discussions. This research is supported\nin part by NSF IIS-1065390.\n\n8\n\n\fReferences\n[1] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recom-\n\nmending advertiser bid phrases for web pages. In WWW, 2013.\n\n[2] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 1997.\n[3] S. Banerjee, A. Dubey, J. Machchhar, and S. Chakrabarti. Ef\ufb01cient and accurate local learning for ranking.\n\nIn SIGIR Wkshp, 2009.\n\n[4] A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured\n\ndata. CoRR, abs/1306.6709, 2013.\n\n[5] L. Bottou and V. Vapnik. Local learning algorithms. Neural Comp, 1992.\n[6] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In CIKM,\n\n2004.\n\n[7] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-Theoretic Metric Learning. In ICML, 2007.\n[8] C. Domeniconi and D. Gunopulos. Adaptive nearest neighbor classi\ufb01cation using support vector ma-\n\nchines. In NIPS, 2001.\n\n[9] K. Duh and K. Kirchhoff. Learning to rank with partially-labeled data. In SIGIR, 2008.\n[10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.\n[11] A. Frome, Y. Singer, and J. Malik. Image retrieval and classi\ufb01cation using local distance functions. In\n\nNIPS, 2006.\n\n[12] T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In\n\nICCV, 2011.\n\n[13] X. Geng, T. Liu, T. Qin, A. Arnold, H. Li, and H. Shum. Query dependent ranking using k-nearest\n\nneighbor. In SIGIR, 2008.\n\n[14] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning\n\ndomain-invariant features for unsupervised domain adaptation. In ICML, 2013.\n\n[15] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classi\ufb01cation. PAMI, 1996.\n[16] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In NIPS,\n\n2009.\n\n[17] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Scholkopf. Correcting sample selection bias by\n\nunlabeled data. In NIPS, 2007.\n\n[18] J. Jiang and C. Zhai. Instance weighting for domain adapation in NLP. In ACL, 2007.\n[19] A. Kapoor and P. Jaina nd R. Viswanathan. Multilabel classi\ufb01cation using Bayesian compressed sensing.\n\nIn NIPS, 2012.\n\n[20] M. Lapin, M. Hein, and B. Schiele. Learning using privileged information: SVM+ and weighted SVM.\n\nNeural Networks, 53, 2014.\n\n[21] M. Marszalek and C. Schmid. Constructing category hierarchies for visual recognition. In ECCV, 2008.\n[22] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene at-\n\ntributes. In CVPR, 2012.\n\n[23] B. Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of\n\nWisconsin\u2013Madison, 2009.\n\n[24] K. Veropoulos, C. Campbell, and N. Cristianini. Controlling the sensitivity of support vector machines.\n\nIn IJCAI, 1999.\n\n[25] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest neighbor algorithms. In NIPS,\n\n2001.\n\n[26] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJMLR, 2009.\n\n[27] X. Wu and R. Srihari. Incorporating prior knowledge with weighted margin support vector machines. In\n\nKDD, 2004.\n\n[28] L. Yang, R. Jin, R. Sukthankar, and Y. Liu. An ef\ufb01cent algorithm for local distance metric learning. In\n\nAAAI, 2006.\n\n[29] H. Zhang, A. Berg, M. Maire, and J. Malik. SVM-KNN: Discriminative nearest neighbor classi\ufb01cation\n\nfor visual category recognition. In CVPR, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1061, "authors": [{"given_name": "Aron", "family_name": "Yu", "institution": "University of Texas at Austin"}, {"given_name": "Kristen", "family_name": "Grauman", "institution": "UT Austin"}]}