{"title": "Transfer Learning by Borrowing Examples for Multiclass Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 118, "page_last": 126, "abstract": "", "full_text": "Transfer Learning by Borrowing Examples\n\nfor Multiclass Object Detection\n\nJoseph J. Lim\nCSAIL, MIT\n\nlim@csail.mit.edu\n\nRuslan Salakhutdinov\n\nDepartment of Statistics, University of Toronto\n\nrsalakhu@utstat.toronto.edu\n\nAntonio Torralba\n\nCSAIL, MIT\n\ntorralba@csail.mit.edu\n\nAbstract\n\nDespite the recent trend of increasingly large datasets for object detection, there\nstill exist many classes with few training examples. To overcome this lack of train-\ning data for certain classes, we propose a novel way of augmenting the training\ndata for each class by borrowing and transforming examples from other classes.\nOur model learns which training instances from other classes to borrow and how\nto transform the borrowed examples so that they become more similar to instances\nfrom the target class. Our experimental results demonstrate that our new object\ndetector, with borrowed and transformed examples, improves upon the current\nstate-of-the-art detector on the challenging SUN09 object detection dataset.\n\nIntroduction\n\n1\nConsider building a sofa detector using a database of annotated images containing sofas and many\nother classes, as shown in Figure 1. One possibility would be to train the sofa detector using only\nthe sofa instances. However, this would result in somewhat poor performance due to the limited\nsize of the training set. An alternative is to build priors about the appearance of object categories\nand share information among object models of different classes. In most previous work, transfer of\ninformation between models takes place by imposing some regularization across model parameters.\nThis is the standard approach both in the discriminative setting [1, 2, 3, 4, 5, 6, 7, 8] and in generative\nobject models [9, 10, 11, 12, 13, 14].\nIn this paper, we propose a different approach to transfer information across object categories. In-\nstead of building object models in which we enforce regularization across the model parameters,\nwe propose to directly share training examples from similar categories. In the example from Fig-\nure 1, we can try to use training examples from other classes that are similar enough, for instance\narmchairs. We could just add all the armchair examples to the sofa training set. However, not all\ninstances of armchairs will look close enough to sofa examples to train an effective detector. There-\nfore, we propose a mechanism to select, among all training examples from other classes, which ones\nare closer to the sofa class. We can increase the number of instances that we can borrow by applying\nvarious transformations (e.g., stretching armchair instances horizontally to look closer to sofas). The\ntransformations will also depend on the viewpoint. For instance, a frontal view of an armchair looks\nlike a compressed sofa, whereas the side view of an armchair and a sofa often look indistinguishable.\nOur approach differs from generating new examples by perturbing examples (e.g., adding mirrored\nor rotated versions) from its own class [15]. Rather, these techniques can be combined with ours.\nOur approach looks for the set of classes to borrow from, which samples to borrow, and what the best\ntransformation for each example is. Our work has similarities with three pieces of work on transfer\n\n1\n\n\fFigure 1: An illustration of training a sofa detector by borrowing examples from other related classes. Our\nmodel can \ufb01nd (1) good examples to borrow, by learning a weight for each example, and (2) the best transfor-\nmation for each training example in order to increase the borrowing \ufb02exibility. Transformed examples in blue\n(or red) box are more similar to the sofa\u2019s frontal (or side) view. Transformed examples, which are selected\naccording to their learned weights, are trained for sofa together with the original sofa examples. (X on images\nindicates that they have low weights to be borrowed)\n\nlearning for object recognition. Miller et al. [9] propose a generative model for digits that shares\ntransformations across classes. The generative model decomposes each model into an appearance\nmodel and a distribution over transformations that can be applied to the visual appearance to generate\nnew samples. The set of transformations is shared across classes. In their work, the transfer of\ninformation is achieved by sharing parameters across the generative models and not by reusing\ntraining examples. The work by Fergus et al. [16] achieves transfer across classes by learning a\nregression from features to labels. Training examples from classes similar to the target class are\nassigned labels between +1 and \u22121. This is similar to borrowing training examples but relaxing the\ncon\ufb01dence of the classi\ufb01cation score for the borrowed examples. Wang et al. [17] assign rankings to\nsimilar examples, by enforcing the highest and lowest rankings for the original positive and negative\nexamples, respectively, and requiring borrowed examples be somewhere in between. Both of these\nworks rely on a pre-de\ufb01ned similarity metric (e.g. WordNet or aspect based similarity) for deciding\nwhich classes to share with. Our method, on the other hand, learns which classes to borrow from as\nwell as which examples to borrow within those classes as part of the model learning process.\nBorrowing training examples becomes effective when many categories are available. When there\nare few and distinct object classes, as in the PASCAL dataset [18], the improvement may be limited.\nHowever, a number of other efforts are under way for building large annotated image databases\nwith many categories [19, 20, 21]. As the number of classes grows, the number of sets of classes\nwith similar visual appearances (e.g., the set of truck, car, van, suv, or chair, armchair, swivel chair,\nsofa) will increase, and the effectiveness of our approach will grow as well. In our experiments,\nwe show that borrowing training examples from other classes results in improved performance upon\nthe current state of the art detectors trained on a single class. In addition, we also show that our\ntechnique can be used in a different but related task. In some cases, we are interested in merging\nmultiple datasets in order to improve the performance on a particular test set. We show that learning\nexamples to merge results in better performance than simply combining the two datasets.\n\n2 Learning to Borrow Examples\n\nConsider the challenging problem of detecting and localizing objects from a wide variety of cat-\negories such as cars, chairs, and trees. Many current state-of-the-art object detection (and object\nrecognition) systems use rather elaborate models, based on separate appearance and shape com-\nponents, that can cope with changes in viewpoint, illumination, shape and other visual properties.\nHowever, many of these systems [22, 23] detect objects by testing sub-windows and scoring corre-\n\n2\n\n\u2026\u2026True sofa training examples Borrowed set: transformed from other classes ranked by their \u201csofa weight\u201d Trained Sofa Model with Borrowing \u2026 View point 1 View point 2 \u2026 Multiclass Dataset .. Sofa Bookcase Armchair Car \u2026\u2026\u2026View point 1 View point 2 High weight Low weight Sofa Detector (View 1) Sofa Detector (View 2) \u2026\u2026 \u2026\u2026\u2026 \u2026 \u2026\u2026\fsponding image patches x with a linear function of the form: y = \u03b2\na vector of different image features, and \u03b2 represents a vector of model parameters.\nIn this work, we focus on training detection systems for multiple object classes. Our goal is to\ndevelop a novel framework that enables borrowing examples from related classes for a generic object\ndetector, making minimal assumptions about the type of classi\ufb01er, or image features used.\n\n(cid:62)\u03a6(x), where \u03a6(x) represents\n\n2.1 Loss Function for Borrowing Examples\nConsider a classi\ufb01cation problem where we observe a dataset D = {xi, yi}n\ni=1 of n labeled training\nexamples. Each example belongs to one of C classes (e.g. 100 object classes), and each class\nc \u2208 C = {1, ..., C} contains a set of nc labeled examples. We let xi \u2208 RD denote the input feature\nvector of length D for the training case i, and yi be its corresponding class label. Suppose that\nwe are also given a separate background class, containing b examples. We further assume a binary\nrepresentation for class labels1, i.e. yi \u2208 C \u222a {\u22121}, indicating whether a training example i belongs\nto one of the given C classes, or the \u201cnegative\u201d background class2.\nFor a standard binary classi\ufb01cation problem, a commonly used approach is to minimize:\n\nLoss(cid:0)\u03b2c \u00b7 xi, sign(yi)(cid:1) + \u03bbR(\u03b2c)\n\n(cid:33)\n\n,\n\n(cid:32)nc+b(cid:88)\n\ni=1\n\nmin\n\u03b2c\n\n(1)\nwhere i ranges over the positive and negative examples of the target class c; \u03b2c \u2208 RD is the vector of\nunknown parameters, or regression coef\ufb01cients, for class c; Loss(\u00b7) is the associated loss function;\nand R(\u00b7) is a regularization function for \u03b2.\nNow, consider learning which other training examples from the entire dataset D our target class c\ncould borrow. The key idea is to learn a vector of weights wc of length n + b, such that each wc\ni\nwould represent a soft indicator of how much class c borrows from the training example xi. Soft\ni will range between 0 and 1, with 0 indicating borrowing none and 1 indicating\nindicator variables wc\nborrowing the entire example as an additional training instance of class c. All true positive examples\nbelonging to class c, with yi = c, and all true negative examples belonging to the background class,\nwith yi = \u22121, will have wc\ni = 1, as they will be used fully. Remaining training examples will have\ni between 0 and 1. Our proposed regularization model takes the following form:\nwc\n\n(cid:33)\n)Loss(cid:0)\u03b2c \u00b7 xi, sign(yi)(cid:1) + \u03bbR(\u03b2c) + \u2126\u03bb1,\u03bb2(w\u2217,c)\n\n(cid:32)n+b(cid:88)\n\nsubject to wc\nand where i ranges over all training examples in the dataset. We further de\ufb01ne \u2126(w\u2217) as:\n\nmin\n(2)\nw\u2217,c\ni = 1 for yi = \u22121 or c, and 0 \u2264 wc\ni \u2264 1 for all other i, where we de\ufb01ned3 w\u2217 = 1\u2212 w,\n(cid:88)\n\u221a\nnl(cid:107)w\u2217\n\n\u2126\u03bb1,\u03bb2(w\u2217) = \u03bb1\n\n(cid:88)\n\n(1 \u2212 w\n\n\u2217,c\ni\n\nmin\n\u03b2c\n\nc\u2208C\n\n(3)\n\ni=1\n\n,\n\njnl\n\nj1, w\u2217\n\nj2,\u00b7\u00b7\u00b7 , w\u2217\n\n(l) represents a vector of weights for class l, with w\u2217\n\nwhere w\u2217\n) for yjm = l.\nHere, \u2126(\u00b7) regularizes w\u2217,c using a sparse group lasso criterion [24]. Its \ufb01rst term can be viewed as\nan intermediate between the L1 and L2-type penalty. A pleasing property of L1-L2 regularization is\nthat it performs variable selection at the group level. The second term of \u2126(\u00b7) is an L1-norm, which\nkeeps the sparsity of weights at the individual level.\nThe overall objective of Eq (2) and its corresponding regularizer \u2126(\u00b7) have an intuitive interpretation.\nThe regularization term encourages borrowing all examples as new training instances for the target\nclass c. Indeed, setting corresponding regularization parameters \u03bb1 and \u03bb2 to high enough values\n(i.e. forcing w to be an all 1 vector) would amount to borrowing all examples, which would result\nin learning a \u201cgeneric\u201d object detector. On the other hand, setting \u03bb1 = \u03bb2 = 0 would recover the\noriginal standard objective of Eq (1), without borrowing any examples. Figure 2b displays learned\nwi for 6547 instances to be borrowed by the truck class. Observe that classes that have similar\nvisual appearances to the target truck class (e.g. van, bus) have wi close to 1 and are grouped\ntogether (compare with Figure 2a, which only uses an L1 norm).\n\n1This is a standard \u201c1 vs. all\u201d classi\ufb01cation setting.\n2When learning a model for class c, all other classes can be considered as \u201cnegative\u201d examples. In this\n3For clarity of presentation, throughout the rest of the paper, we will use the following identity w\u2217 = 1\u2212w.\n\nwork, for clarity of presentation, we will simply assume that we are given a separate background class.\n\n3\n\nl\u2208C\n\n(l)(cid:107)2 + \u03bb2(cid:107)w\u2217(cid:107)1,\n(l) = (w\u2217\n\n\fw\n\nw\n\n)\n\u00b7\n(\n\n\u00b7\n\nH\nw\n\n(a) Only with L1-norm\n\n(b) Learned by \u2126(\u00b7) without\nthe Heaviside step function\n\n(c) Learned by \u2126(\u00b7) with\nthe Heaviside step function\n\nFigure 2: Learning to borrow for the target truck class: Learned weights wtruck for 6547 instances using\n(a) L1-norm; (b) \u2126(\u00b7) regularization; and (c) \u2126(\u00b7) with symmetric borrowing constraint.\n\nWe would also like to point out an analogy between our model and various other transfer learning\nmodels that regularize the \u03b2 parameter space [25, 26]. The general form applied to our problem\nsetting takes the following form:\n\n(cid:88)\n\nc\u2208C\n\nmin\n\u03b2c\n\n(cid:32)(cid:88)\n\ni\n\nC(cid:88)\nLoss(\u03b2c \u00b7 xi, sign(yi)) + \u03bbR(\u03b2c) + \u03b3(cid:107)\u03b2c \u2212 1\n(cid:80)\nC\n\nk=1\n\n(cid:33)\n\n\u03b2k(cid:107)2\n\n2\n\n.\n\n(4)\n\nThe model in Eq (4) regularizes all \u03b2c to be close to a single mode, 1\nk \u03b2k. This can be further\nC\ngeneralized so that \u03b2c is regularized toward one of many modes, or \u201csuper-categories\u201d, as pursued in\n[27]. Contrary to previous work, our model from Eq (2) regularizes weights on all training examples,\nrather than parameters, across all categories. This allows us to directly learn both: which examples\nand what categories we should borrow from. We also note that model performance could potentially\nbe improved by introducing additional regularization across model parameters.\n\n2.2 Learning\nSolving our \ufb01nal optimization problem, Eq (2), for w and \u03b2 jointly is a non-convex problem. We\ntherefore resort to an iterative algorithm based on the fact that solving for \u03b2 given w and for w given\n\u03b2 are convex problems. The algorithm will iterate between (1) solving for \u03b2 given w based on [22],\nand (2) solving for w given \u03b2 using the block coordinate descent algorithm [28] until convergence.\ni to 1 for yi = c and yi = \u22121, and to 0 for all other training\nWe initialize the model by setting wc\nexamples. Given this initialization, the \ufb01rst iteration is equivalent to solving C separate binary\nclassi\ufb01cation problems of Eq (1), when there is no borrowing4\nEven though most irrelevant examples have low borrowing indicator weights wi, it is ideal to clean\nup these noisy examples. To this end, we introduce a symmetric borrowing constraint: if a car class\ndoes not borrow examples from chair class, then we would also like for the chair class not to borrow\nc \u2212 \u0001),\nexamples from the corresponding car class. To accomplish this, we multiply wc\nwhere H(\u00b7) is the Heaviside step function. We note that wc\ni refers to the weight of example xi to\nbe borrowed by the target class c, whereas \u00afwyi\nrefers to the average weight of examples that class\nc\nyi borrows from the target class c. In other words, if the examples that class yi borrows from class\nc have low weights on average (i.e. \u00afwyi\nc < \u0001), then class c will not borrow example xi, as this\nindicates that classes c and yi may not be similar enough. The resulting weights after introducing\nthis symmetric relationship are shown in Figure 2c.\n\ni by H( \u00afwyi\n\n3 Borrowing Transformed Examples\nSo far, we have assumed that each training example is borrowed as is. Here, we describe how we\napply transformations to the candidate examples during the training phase. This will allow us to\nborrow from a much richer set of categories such as sofa-armchair, cushion-pillow, and car-van.\nThere are three different transformations we employ: translation, scaling, and af\ufb01ne transformation.\nTranslation and scaling: Translation and scaling are naturally inherited into existing detection\nsystems during scoring. Scaling is resolved by scanning windows at multiple scales of the image,\nwhich typical sliding-window detectors already do. Translation is implemented by relaxing the\nlocation of the ground-truth bounding box Bi. Similar to Felzenszwalb et al. [22]\u2019s approach of\n\ufb01nding latent positive examples, we extract xi from multiple boxes that have a signi\ufb01cant overlap\nwith Bi, and select a candidate example that has the smallest Loss(\u03b2c \u00b7 xi, sign(yi)).\n\n4In this paper, we iterate only once, as it was suf\ufb01cient to borrow similar examples (see Figure 2).\n\n4\n\n010002000300040005000600000.20.40.60.81 car van bus 010002000300040005000600000.20.40.60.81(1(cid:239)witruck) car van bus 010002000300040005000600000.20.40.60.81(1(cid:239)witruck) car van bus \fOriginal Class\n\nWithout transformation\n\nBorrowed Classes AP improvement\n\nWith transformation\n\nBorrowed Classes AP improvement\n\nTruck\nShelves\n\nCar\n\nDesk lamp\n\nToilet\n\ncar, van\nbookcase\ntruck, van\n\n\u2205\n\u2205\n\n+7.14\n+0.17\n+1.07\nN/A\nN/A\n\ncar, van\nbookcase\n\ntruck, van, bus\n\n\ufb02oor lamp\nsink, cup\n\n+9.49\n+4.73\n+1.78\n+0.30\n-0.68\n\nTable 1: Learned borrowing relationships: Most discovered relations are consistent with human subjective\njudgment. Classes that were borrowed only with transformations are shown in bold.\nAf\ufb01ne transformation: We also change aspect ratios of borrowed examples so that they look more\nalike (as in sofa-armchair and desk lamp-\ufb02oor lamp). Our method is to transform training examples\nto every canonical aspect ratio of the target class c, and \ufb01nd the best candidate for borrowing. The\ncanonical aspect ratios can be determined by clustering aspect ratios of all ground-truth bounding\nboxes [22], or based on the viewpoints, provided we have labels for each viewpoint. Speci\ufb01cally,\nsuppose that there is a candidate example xi to be borrowed by the target class c and there are L\ni}0\u2264l\u2264L\ncanonical aspect ratios of c. We transform xi into xl\ni = xi). In order to ensure that only one candidate is\ncontains all L canonical aspect ratios of c (and x0\ni, for each i, that minimizes Loss(\u03b2c \u00b7\ngenerated from xi, we select a single transformed example xl\ni, sign(yi)). Note that this \ufb01nal candidate can be selected during every training iteration, so that\nxl\nthe best selection can change as the model is updated.\nFigure 1 illustrates the kind of learning our model performs. To borrow examples for sofa, each\nexample in the dataset is transformed into the frontal and side view aspect ratios of sofa. The\ntransformed example that has the smallest Loss(\u00b7) is selected for borrowing. Each example is then\nassigned a borrowing weight using Eq (2). Finally, the new sofa detector is trained using borrowed\nexamples together with the original sofa examples. We refer the detector trained without af\ufb01ne\ntransformation as the borrowed-set detector, and the one trained with af\ufb01ne transformation as the\nborrowed-transformed detector.\n\ni by resizing one dimension so that {xl\n\n4 Experimental Results\nWe present experimental results on two standard datasets: the SUN09 dataset [21] and the PASCAL\nVOC 2007 challenge [18]. The SUN09 dataset contains 4,082 training images and 9,518 testing\nimages. We selected the top 100 object categories according to the number of training examples.\nThese 100 object categories include a wide variety of classes such as bed, car, stool, column, and\n\ufb02owers, and their distribution is heavy tailed varying from 1356 to 8 instances. The PASCAL dataset\ncontains 2,051 training images and 5,011 testing images, belonging to 20 different categories. For\nboth datasets, we use the PASCAL VOC 2008 evaluation protocol [18]. During the testing phase,\nin order to enable a direct comparison between various detectors, we measure the detection score of\nclass c as the mean Average Precision (AP) score across all positive images that belong to class c\nand randomly sub-sampled negative images, so that the ratio between positive and negative examples\nremains the same across all classes.\nOur experiments are based on one of the state-of-art detectors [22]. Following [22], we use a hinge\nloss for Loss(\u00b7) and a squared L2-norm for R(\u00b7) in Eq (2), where every detector contains two root\ncomponents. There are four controllable parameters: \u03bb, \u03bb1, \u03bb2, and \u0001 (see Eq (2)). We used the\nsame \u03bb as in [22]. \u03bb1 and \u03bb2 were picked based on the validation set, and \u0001 was set to 0.6. In order\nto improve computation time, we threshold each weight wi so that it will either be 0 or 1.\nWe perform two kinds of experiments: (1) borrowing examples from other classes within the same\ndataset, and (2) borrowing examples from the same class that come from a different dataset. Both\nexperiments require identifying which examples are bene\ufb01cial to borrow for the target class.\n\n4.1 Borrowing from Other Classes\nWe \ufb01rst tested our model to identify a useful set of examples to borrow from other classes in order\nto improve the detection quality on the SUN09 dataset. A unique feature of the SUN09 dataset is\nthat all images were downloaded from the internet without making any effort to create a uniform\ndistribution over object classes. We argue that this represents a much more realistic setting, in which\nsome classes contain a lot of training data and many other classes contain little data.\n\n5\n\n\f(a) Shelves for Bookcase\n\n(b) Chair for Swivel chair\n\nFigure 3: Borrowing Weights: Examples are ranked by learned weights, w: (a) shelves examples to be\nborrowed by the bookcase class and (b) chair examples to be borrowed by the swivel chair class. Both show\nthat examples with higher w are more similar to the target class. (green: borrowed, red: not borrowed)\n\n(a) Number of examples\nbefore/after borrowing\n\n(b) Borrowed-set\nAP improvements\n\n(c) Borrowed-transformed\n\nAP improvements\n\nFigure 4: (a) Number of examples used for training per class before borrowing (blue) and after borrowing\n(red). Categories with fewer examples tend to borrow more examples. AP improvements (b) without and (c)\nwith transformations, compared to the single detector trained only with the original examples. Note that our\nmodel learned to borrow from (b) 28 classes, and (c) 37 classes.\n\nAmong 100 classes, our model learned that there are 28 and 37 classes that can borrow from other\nclasses without and with transformations, respectively. Table 1 shows some of the learned borrowing\nrelationships along with their improvements. Most are consistent with human subjective judgment.\nInterestingly, our model excluded bag, slot machine, \ufb02ag, and \ufb01sh, among others, from borrowing.\nMany of those objects have quite distinctive visual appearances compared to other object categories.\nFigure 3 shows borrowed examples along with their relative orders according to the borrowing in-\ndicator weights, wi. Note that our model learns quite reliable weights: for example, chair examples\nin green box are similar to the target swivel chair class, whereas examples in red box are either\noccluded or very atypical.\nFigure 4 further displays AP improvements of the borrowed-set and borrowed-transformed detec-\ntors, against standard single detectors. Observe that over 20 categories bene\ufb01t in various degrees\nfrom borrowing related examples. Among borrowed-transformed detectors, the categories with the\nlargest improvements are truck (9.49), picture (7.54), bus (7.32), swivel chair (6.88), and bookcase\n(5.62). We note that all of these objects borrow visual appearance from other related frequent ob-\njects, including car, chair, and shelves. Five objects with the largest decrease in AP include plate (-\n3.53), \ufb02uorescent tube (-3.45), ball (-3.21), bed (-2.69), and microwave (-2.52). Model performance\noften deteriorates when our model discovers relationships that are not ideal (e.g. toilet borrowing\ncup and sink; plate borrowing mug).\nTable 2 further breaks down borrowing rates as a function of the number of training examples, where\na borrowing rate is de\ufb01ned as the ratio of the total number of borrowed examples to the number of\noriginal training examples. Observe that borrowing rates are much higher when there are fewer\ntraining examples (see also Figure 4a). On average, the borrowed-set detectors borrow 75% of\nthe total number of original training examples, whereas the borrowed-transformed detectors borrow\nabout twice as many examples, 149%.\nTable 3 shows AP improvements of our methods. Borrowed-set improve 1.00 and borrowed-\ntransformed detectors improve 1.36. This is to be expected as introducing transformations allows us\nto borrow from a much richer set of object classes. We also compare to a baseline approach, which\n\n6\n\n\u2026 \u2026 Highest w Lowest w 051015202530350100200300400500600700Index of classNumber of instancesObject Index Number of Training Examples 0510152025\u22124\u221220246810AP improvements Object Index 05101520253035\u22124\u221220246810AP improvements Object Index \f(a)\n\np\no\nt\nr\ne\nt\nn\nu\no\nc\n\n(b)\n\nr\ni\na\nh\nc\n\nl\ne\nv\ni\nw\ns\n\nFigure 5: Detection results on random images containing the target class. Only the most con\ufb01dent detection\nis shown per image. For clearer visualization, we do not show images where both detectors have large over-\nlap. Our detectors (2nd/4th row) show better localizations than single detectors (1st/3rd row). (red: correct\ndetection, yellow: false detection)\n\nNumber of Training Examples\n\nBorrowed-set\n\nBorrowed-Transformed\n\n1-30\n1.69\n2.75\n\n31-50\n0.48\n2.57\n\n51-100\n0.43\n0.94\n\n101-150 > 150 ALL\n0.75\n1.49\n\n0.13\n0.17\n\n0.48\n0.81\n\nTable 2: Borrowing rates for the borrowed-set and borrowed-transformed models. Borrowing rate is de\ufb01ned\nas the ratio of the number of borrowed examples to the number of original examples.\n\nMethods\n\nAP without borrowing\n\nAP improvements\n\nBorrowed-set\n\n14.99\n+1.00\n\nAll examples from the same classes\n\n16.59\n+0.30\n\nBorrowed-Transformed\n\n16.59\n+1.36\n\nTable 3: AP improvements of the borrowed-set and borrowed-transformed detectors. We also compared\nborrowed-transformed method against the baseline approach borrowing all examples, without any selection\nof examples, from the same classes our method borrows from. 2nd row shows the average AP score of the\ndetectors without any borrowing in the classes used for borrowed-set or borrowed-transformed.\n\nuses all examples in the borrowed classes of borrowed-transformed method. For example, if class A\nborrows some examples from class B and C using borrowed-transformed method, then the baseline\napproach uses all examples from class A, B, and C without any selection. Note that this baseline\napproach improves only 0.30 compared to 1.36 of our method.\nFinally, Figure 5 displays detection results. Single and borrowed-transformed detections are visu-\nalized on test images, chosen at random, that contain the target class. In many cases, transformed\ndetectors are better at localizing the target object, even when they fail to place a bounding box around\nthe full object. We also note that borrowing similar examples tends to introduce some confusions\nbetween related object categories. However, we argue that this type of failure is much more tolerable\ncompared to the single detector, which often has false detections of completely unrelated objects.\n\n4.2 Borrowing from Other Datasets\n\nCombining datasets is a non-trivial task as different datasets contain different biases. Consider\ntraining a car detector that is going to be evaluated on the PASCAL dataset. The best training set for\nsuch a detector would be the dataset provided by the PASCAL challenge, as both the training and test\nsets come from the same underlying distribution. In order to improve model performance, a simple\nmechanism would be to add additional training examples. For this, we could look for other datasets\nthat contain annotated images of cars \u2013 for example, the SUN09 dataset. However, as the PASCAL\nand SUN09 datasets come with different biases, many of the training examples from SUN09 are\nnot as effective for training when the detector is evaluated on the PASCAL dataset \u2013 a problem that\nwas extensively studied by [29]. Here, we show that, instead of simply mixing the two datasets, our\nmodel can select a useful set of examples from the SUN09 for the PASCAL dataset, and vice-versa.\n\n7\n\nsingle transformed single transformed \f(a)\n\n(b)\n\n(c)\n\nFigure 6: SUN09 borrowing PASCAL examples: (a) Typical SUN09 car images, (b) Typical PASCAL car\nimages, (c) PASCAL car images sorted by learned borrowing weights. (c) shows that examples are sorted from\ncanonical view points (left) to atypical or occluded examples (right). (green: borrowed, red: not borrowed)\n\nSUN09 PASCAL SUN09\n\nSUN09\n\n+PASCAL +borrow PASCAL\n\ncar\n\nonly\n43.31\nperson 45.46\nsofa\n12.96\nchair 18.82\nmean 30.14\nDiff.\n\nonly\n39.47\n28.78\n11.97\n13.84\n23.51\n-6.63\n\n43.64\n46.46\n12.86\n18.18\n30.29\n+0.15\n\n(a) Testing on the SUN09 dataset\n\n45.88\n46.90\n15.25\n20.45\n32.12\n+1.98\n\nPASCAL SUN09 PASCAL\n\nPASCAL\n\ncar\n\nperson\nsofa\nchair\nmean\nDiff.\n\nonly\n49.58\n23.58\n19.91\n14.23\n26.83\n\nonly +SUN09 +borrow SUN09\n40.81\n22.31\n13.99\n14.20\n22.83\n-4.00\n\n51.00\n27.05\n22.17\n18.55\n29.69\n+2.86\n(b) Testing on the PASCAL 2007 dataset\n\n49.91\n26.05\n20.01\n19.06\n28.76\n+1.93\n\nTable 4: Borrowing from other datasets: AP scores of various detectors: \u201cSUN09 only\u201d and \u201cPASCAL\nonly\u201d are trained using the SUN09 dataset [21] and the PASCAL dataset [18] without borrowing any examples.\n\u201cSUN09+PASCAL\u201d is trained using positive examples from both SUN09 and PASCAL. and negative examples\nfrom the target dataset. \u201cPASCAL+borrow SUN09\u201d and \u201cSUN09+borrow PASCAL\u201d borrow selected examples\nfrom another dataset for each target dataset using our method. The last Diff row shows AP improvements over\nthe \u201cstandard\u201d state-of-art detector trained on the target dataset (column 1).\n\ni\n\nFigure 6 shows the kind of borrowing our model performs. Figure 6a,b display typical car images\nfrom the SUN09 and PASCAL datasets. Compared to SUN09, PASCAL images display a much\nwider variety of car types, with different viewpoints and occlusions. Figure 6c further shows the\nfor i \u2208 DPASCAL. Observe that images with high w\nranking of PASCAL examples by wSUN09 car\nmatch the canonical representations of SUN09 images much better compared to images with low w.\nTable 4 shows performances of four detectors. Observe that detectors trained on the target dataset\n(column 1) outperform ones trained using another dataset (column 2). This shows that there exists\na signi\ufb01cant difference between the two datasets, which agrees with previous work [29]. Next, we\ntested detectors by simply combining positive examples from both datasets and using negative exam-\nples from the target dataset (column 3). On the SUN09 test set, the improvement was not signi\ufb01cant,\nand on the PASCAL test set, we observed slight improvements. Detectors trained by our model\n(column 4) substantially outperformed single detectors as well as ones that were trained mixing the\ntwo datasets. The detectors (columns 1 and 2) were trained using the state-of-art algorithm [22].\n\n5 Conclusion\nIn this paper we presented an effective method for transfer learning across object categories. The\nproposed approach consists of searching similar object categories using sparse grouped Lasso frame-\nwork, and borrowing examples that have similar visual appearances to the target class. We further\ndemonstrated that our method, both with and without transformation, is able to \ufb01nd useful object\ninstances to borrow, resulting in improved accuracy for multi-class object detection compared to the\nstate-of-the-art detector trained only with examples available for each class.\nAcknowledgments: This work is funded by ONR MURI N000141010933, CAREER Award No.\n07471 20, NSERC, and NSF Graduate Research Fellowship.\n\nReferences\n[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n8\n\nRandom Orders \u2026 Highest w Lowest w \f[2] S. Krempp, D. Geman, and Y. Amit. Sequential learning of reusable parts for object detection. Technical\n\nreport, CS Johns Hopkins, 2002.\n\n[3] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: ef\ufb01cient boosting procedures for multi-\n\nclass object detection. In CVPR, 2004.\n\n[4] E. Bart and S. Ullman. Cross-generalization: learning novel classes from a single example by feature\n\nreplacement. In CVPR, 2005.\n\n[5] A. Opelt, A. Pinz, and A. Zisserman.\n\nalphabet. In CVPR, 2006.\n\nIncremental learning of object detectors using a visual shape\n\n[6] K. Levi, M. Fink, and Y. Weiss. Learning from a small number of training examples by exploiting object\n\ncategories. In Workshop of Learning in Computer Vision, 2004.\n\n[7] A. Quattoni, M. Collins, and T.J. Darrell. Transfer learning for image classi\ufb01cation with sparse prototype\n\nrepresentations. In CVPR, 2008.\n\n[8] C.H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\nattribute transfer. In CVPR, 2009.\n\n[9] E. Miller, N. Matsakis, and P. Viola. Learning from one example through shared densities on transforms.\n\nIn CVPR, 2000.\n\n[10] L. Fei-Fei, R. Fergus, and P. Perona. A bayesian approach to unsupervised one-shot learning of object\n\ncategories. In ICCV, 2003.\n\n[11] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an\n\nincremental bayesian approach tested on 101 object categories. In IEEE. Workshop on GMBV, 2004.\n\n[12] E. Sudderth, A. Torralba, W. T. Freeman, and W. Willsky. Learning hierarchical models of scenes, objects,\n\nand parts. In ICCV, 2005.\n\n[13] J. Sivic, B.C. Russell, A. Zisserman, W.T. Freeman, and A.A. Efros. Unsupervised discovery of visual\n\nobject class hierarchies. In CVPR, 2008.\n\n[14] E. Bart, I. Porteous, P. Perona, and M. Welling. Unsupervised learning of visual taxonomies. In CVPR,\n\n2008.\n\n[15] D.M. Gavrila and J. Giebel. Virtual sample generation for template-based shape matching. In CVPR,\n\n2001.\n\n[16] R. Fergus, H. Bernal, Y. Weiss, and A. Torralba. Semantic label sharing for learning with many categories.\n\nIn ECCV, 2010.\n\n[17] Gang Wang, David Forsyth, and Derek Hoiem. Comparative object similarity for improved recognition\n\nwith few or no examples. In CVPR, 2010.\n\n[18] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. International Journal of Computer Vision, 88(2):303\u2013338, June 2010.\n\n[19] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool\n\nfor image annotation. IJCV, 77(1-3):157\u2013173, 2008.\n\n[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n[21] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-\n\nscale scene recognition from abbey to zoo. In CVPR, 2010.\n\n[22] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part-based models. TPAMI, 32(9):1627 \u20131645, 2010.\n\n[23] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[24] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of\n\nthe Royal Statistical Society, Series B, 68:49\u201367, 2006.\n\n[25] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi\u2013task learning. In ACM SIGKDD, 2004.\n[26] T. Tommasi, F. Orabona, and B. Caputo. Safety in numbers: Learning categories from few examples with\n\nmulti model knowledge transfer. In CVPR, 2011.\n\n[27] R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for multiclass\n\nobject detection. In CVPR, 2011.\n\n[28] Jerome Friedman, Trevor Hastie, , and Robert Tibshirani. A note on the group lasso and a sparse group\n\nlasso. Technical report, Department of Statistics, Stanford University, 2010.\n\n[29] A. Torralba and A. Efros. Unbiased look at dataset bias. In CVPR, 2011.\n\n9\n\n\f", "award": [], "sourceid": 4461, "authors": [{"given_name": "Joseph", "family_name": "Lim", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": ""}, {"given_name": "Antonio", "family_name": "Torralba", "institution": ""}]}