{"title": "Reshaping Visual Datasets for Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 1286, "page_last": 1294, "abstract": "In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. However, image data is difficult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.) We propose an approach to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identified domains to be different from each other; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric representation and efficient optimization procedure for distinctiveness, which, when coupled with our learnability constraint, can successfully discover domains among both training and test data. We extensively evaluate our approach on object recognition and human activity recognition tasks.", "full_text": "Reshaping Visual Datasets for Domain Adaptation\n\nBoqing Gong\n\nU. of Southern California\nLos Angeles, CA 90089\nboqinggo@usc.edu\n\nKristen Grauman\nU. of Texas at Austin\n\nAustin, TX 78701\n\ngrauman@cs.utexas.edu\n\nFei Sha\n\nU. of Southern California\nLos Angeles, CA 90089\nfeisha@usc.edu\n\nAbstract\n\nIn visual recognition problems, the common data distribution mismatches between\ntraining and testing make domain adaptation essential. However, image data is\ndif\ufb01cult to manually divide into the discrete domains required by adaptation al-\ngorithms, and the standard practice of equating datasets with domains is a weak\nproxy for all the real conditions that alter the statistics in complex ways (lighting,\npose, background, resolution, etc.) We propose an approach to automatically dis-\ncover latent domains in image or video datasets. Our formulation imposes two key\nproperties on domains: maximum distinctiveness and maximum learnability. By\nmaximum distinctiveness, we require the underlying distributions of the identi\ufb01ed\ndomains to be different from each other to the maximum extent; by maximum\nlearnability, we ensure that a strong discriminative model can be learned from the\ndomain. We devise a nonparametric formulation and ef\ufb01cient optimization pro-\ncedure that can successfully discover domains among both training and test data.\nWe extensively evaluate our approach on object recognition and human activity\nrecognition tasks.\n\n1\n\nIntroduction\n\nA domain refers to an underlying data distribution. Generally, there are two: the one with which\nclassi\ufb01ers are trained, and the other to which classi\ufb01ers are applied. While many learning algorithms\nassume the two are the same, in real-world applications, the distributions are often mismatched,\ncausing signi\ufb01cant performance degradation when the classi\ufb01ers are applied. Domain adaptation\ntechniques are crucial in building robust classi\ufb01ers to address mismatched new and unexpected\ntarget environments. As such, the subject has been intensively studied in computer vision [1, 2, 3, 4],\nspeech and language processing [5, 6], and statistics and learning [7, 8, 9, 10].\nWhile domain adaptation research largely focuses on how adaptation should proceed, there are also\nvital questions concerning the domains themselves: what exactly is a domain composed of? and\nhow are domains different from each other? For some applications, the answers come naturally.\nFor example, in speech recognition, we can organize data into speaker-speci\ufb01c domains where each\ndomain contains a single speaker\u2019s utterances. In language processing, we can organize text data\ninto language-speci\ufb01c domains. For those types of data, we can neatly categorize each instance\nwith a discrete set of semantically meaningful properties; a domain is thus naturally composed of\ninstances of the same (subset of) properties.\nFor visual recognition, however, the same is not possible. In addition to large intra-category ap-\npearance variations, images and video of objects (or scenes, attributes, activities, etc.) are also\nsigni\ufb01cantly affected by many extraneous factors such as pose, illumination, occlusion, camera res-\nolution, and background. Many of these factors simply do not naturally lend themselves to deriving\ndiscrete domains. Furthermore, the factors overlap and interact in images in complex ways. In fact,\neven coming up with a comprehensive set of such properties is a daunting task in its own right\u2014not\nto mention automatically detecting them in images!\n\n1\n\n\fPartially due to these conceptual and practical constraints, datasets for visual recognition are not\ndeliberately collected with clearly identi\ufb01able domains [11, 12, 13, 14, 15]. Instead, standard im-\nage/video collection is a product of trying to ensure coverage of the target category labels on one\nhand, and managing resource availability on the other. As a result, a troubling practice in visual do-\nmain adaptation research is to equate datasets with domains and study the problem of cross-dataset\ngeneralization or correcting dataset bias [16, 17, 18, 19].\nOne pitfall of this ad hoc practice is that a dataset could be an agglomeration of several distinctive\ndomains. Thus, modeling the dataset as a single domain would necessarily blend the distinctions,\npotentially damaging visual discrimination. Consider the following human action recognition task,\nwhich is also studied empirically in this work. Suppose we have a training set containing videos of\nmultiple subjects taken at view angles of 30\u25e6 and 90\u25e6, respectively. Unaware of the distinction of\nthese two views of videos, a model for the training set as a single training domain needs to account\nfor both inter-subject and inter-view variations. Presumably, applying the model to recognizing\nvideos taken at view angle of 45\u25e6 (i.e., from the test domain) would be less effective than applying\nmodels accounting for the two view angles separately, i.e., modeling inter-subject variations only.\nHow can we avoid such pitfalls? More speci\ufb01cally, how can we form characteristic domains, with-\nout resorting to the hopeless task of manually de\ufb01ning properties along which to organize them?\nWe propose novel learning methods to automatically reshape datasets into domains. This is a chal-\nlenging unsupervised learning problem. At the surface, we are not given any information about\nthe domains that the datasets contain, such as the statistical properties of the domains, or even the\nnumber of domains. Furthermore, the challenge cannot be construed as a traditional clustering prob-\nlem; simply clustering images by their appearance is prone to reshaping datasets into per-category\ndomains, as observed in [20] and our own empirical studies. Moreover, there may be many com-\nplex factors behind the domains, making it dif\ufb01cult to model the domains with parametric mixture\nmodels on which traditional clustering algorithms (e.g., Kmeans or Gaussian mixtures) are based.\nOur key insights are two axiomatic properties that latent domains should possess: maximum dis-\ntinctiveness and maximum learnability. By maximum distinctiveness, we identify domains that are\nmaximally different in distribution from each other. This ensures domains are characteristic in terms\nof their large inter-domain variations. By maximum learnability, we identify domains from which\nwe can derive strong discriminative models to apply to new testing data.\nIn section 2, we describe our learning methods for extracting domains with these desirable prop-\nerties. We derive nonparametric approaches to measure domain discrepancies and show how to\noptimize them to arrive at maximum distinctiveness. We also show how to achieve maximum learn-\nability by monitoring an extracted domain\u2019s discriminative learning performance, and we use that\nproperty to automatically choose the number of latent domains. To our best knowledge, [20] is\nthe \ufb01rst and only work addressing latent domain discovery. We postpone a detailed discussion and\ncomparison to their method to section 3, after we have described our own.\nIn section 4, we demonstrate the effectiveness of our approach on several domain adaptation tasks for\nobject recognition and human activity recognition. We show that we achieve far better classi\ufb01cation\nresults using adapted classi\ufb01ers learned on the discovered domains. We conclude in section 5.\n\n2 Proposed approach\n\nWe assume that we have access to one or more annotated datasets with a total of M data instances.\nThe data instances are in the form of (xm, ym) where xm \u2208 RD is the feature vector and ym \u2208 [C]\nthe corresponding label out of C categories. Moreover, we assume that each data instance comes\nfrom a latent domain zm \u2208 [K] where K is the number of domains.\nIn what follows, we start by describing our algorithm for inferring zm assuming K is known. Then\nwe describe how to infer K from the data.\n\n2.1 Maximally distinctive domains\nGiven K, we denote the distributions of unknown domains Dk by Pk(x, y) for k \u2208 [K]. We do not\nimpose any parametric form on Pk(\u00b7,\u00b7). Instead, the marginal distribution Pk(x) is approximated\n\n2\n\n\fby the empirical distribution \u02c6Pk(x)\n\n(cid:88)\n\nm\n\n\u03b4xm zmk,\n\n\u02c6Pk(x) =\n\n1\nMk\n\nwhere Mk is the number of data instances to be assigned to the domain k and \u03b4xm is an atom at\nxm. zmk \u2208 {0, 1} is a binary indicator variable and takes the value of 1 when zm = k. Note that\n\nMk =(cid:80)\n\nm zmk and(cid:80)\n\nk Mk = M.\n\nWhat kind of properties do we expect from \u02c6Pk(x)? Intuitively, we would like any two different\ndomains \u02c6Pk(x) and \u02c6Pk(cid:48)(x) to be as distinctive as possible. In the context of modeling visual data,\nthis implies that intra-class variations between domains are often far more pronounced than inter-\nclass variations within the same domain. As a concrete example, consider the task of differentiating\ncommercial jetliners from \ufb01ghter jets. While the two categories are easily distinguishable when\nviewed from the same pose (frontal view, side view, etc.), there is a signi\ufb01cant change in appearance\nwhen either category undergoes a pose change. Clearly, de\ufb01ning domains by simply clustering the\nimages by appearance is insuf\ufb01cient; the inter-category and inter-pose variations will both contribute\nto the clustering procedure and may lead to unreasonable clusters. Instead, to identify characteristic\ndomains, we need to look for divisions of the data that yield maximally distinctive distributions.\nTo quantify this intuition, we need a way to measure the difference in distributions. To this end, we\napply a kernel-based method to examine whether two samples are from the same distribution [21].\nConcretely, let k(\u00b7,\u00b7) denote a characteristic positive semide\ufb01nite kernel (such as the Gaussian ker-\nnel). We compute the the difference between the means of two empirical distributions in the repro-\nducing kernel Hilbert space (RKHS) H induced by the kernel function,\n\nd(k, k(cid:48)) =\n\nk(\u00b7, xm)zmk \u2212 1\nM(cid:48)\n\nk\n\nk(\u00b7, xm)zmk(cid:48)\n\n(cid:88)\n\nm\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nH\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nMk\n\n(cid:88)\n\nm\n\nwhere k(\u00b7, xm) is the image (or kernel-induced feature) of xm under the kernel. The measure\napproaches zero as the number of samples tends to in\ufb01nity, if and only if the two domains are the\nsame, Pk = Pk(cid:48). We de\ufb01ne the total domain distinctiveness (TDD) as the sum of this quantity over\nall possible pairs of domains:\n\nTDD(K) =\n\nd(k, k(cid:48)),\n\n(cid:88)\n\nk(cid:54)=k(cid:48)\n\n(1)\n\n(2)\n\nK(cid:88)\n\nk=1\n\nM(cid:88)\n\nm=1\n\nM(cid:88)\n\nm=1\n\nand choose domain assignments for zm such that TDD is maximized. We \ufb01rst discuss this optimiza-\ntion problem in its native formulation of integer programming, followed by a more computationally\nconvenient continuous optimization.\nOptimization In addition to the binary constraints on zmk, we also enforce\n\nzmk = 1,\n\n\u2200 m \u2208 [M], and\n\n1\nMk\n\nzmkymc =\n\n1\nM\n\nymc,\n\n\u2200 c \u2208 [C],\n\nk \u2208 [K]\n\n(3)\n\nwhere ymc is a binary indicator variable, taking the value of 1 if ym = c.\nThe \ufb01rst constraint stipulates that every instance will be assigned to one domain and one domain\nonly. The second constraint, which we refer to as the label prior constraint (LPC), requires that\nwithin each domain, the class labels are distributed according to the prior distribution (of the labels),\nestimated empirically from the labeled data.\nLPC does not restrict the absolute numbers of instances of different labels in each domain. It only\nre\ufb02ects the intuition that in the process of data collection, the relative percentages of different classes\nare approximately in accordance with a prior distribution that is independent of domains. For ex-\nample, in action recognition, if the \u201cwalking\u201d category occurs relatively frequently in a domain\ncorresponding to brightly lit video, we also expect it to be frequent in the darker videos. Thus, when\ndata instances are re-arranged into latent domains, the same percentages are likely to be preserved.\nThe optimization problem is NP-hard due to the integer constraints. In the following, we relax it\ninto a continuous optimization, which is more accessible with off-the-shelf optimization packages.\n\n3\n\n\fRelaxation We introduce new variables \u03b2mk = zmk/Mk, and relax them to live on the simplex\n\n\u03b2k = (\u03b21k,\u00b7\u00b7\u00b7 , \u03b2Mk)T \u2208 \u2206 =\n\n\u03b2k : \u03b2mk \u2265 0,\n\n\u03b2mk = 1\n\n(cid:40)\n\n(cid:41)\n\nM(cid:88)\n\nm=1\n\nfor k = 1,\u00b7\u00b7\u00b7 , K. With the new variables, our optimization problem becomes\n\n(cid:88)\n\nTDD(K) =\n\nmax\n\n\u03b2\n\nk(cid:54)=k(cid:48)\n\ns.t. 1/M \u2264(cid:88)\n(cid:88)\n\n(1 \u2212 \u03b4)/M\n\nk\n\nk(cid:54)=k(cid:48)\n\n(cid:88)\nymc \u2264(cid:88)\n\n(\u03b2k \u2212 \u03b2k(cid:48))TK(\u03b2k \u2212 \u03b2k(cid:48))\n\n\u03b2mk \u2264 1/C, m = 1, 2,\u00b7\u00b7\u00b7 , M,\n\n\u03b2mkymc \u2264 (1 + \u03b4)/M\n\n(4)\n\n(5)\n\nymc,\n\nc = 1,\u00b7\u00b7\u00b7 , C,\n\nk = 1,\u00b7\u00b7\u00b7 , K,\n\n(cid:88)\n\nm\n\nm\n\nm\n\nwhere K is the M \u00d7 M kernel matrix. The \ufb01rst constraint stems from the (default) requirement that\nevery domain should have at least one instance per category, namely, Mk \u2265 C and every domain\nshould at most have M instances (Mk \u2264 M). The second constraint is a relaxed version of the LPC,\nallowing a small deviation from the prior distribution by setting \u03b4 = 1%. We assign xm to the\ndomain k for which \u03b2mk is the maximum of \u03b2m1,\u00b7\u00b7\u00b7 , \u03b2mK.\nThis relaxed optimization problem is a maximization of convex quadratic function subject to linear\nconstraints. Though in general still NP-hard, this type of optimization problem has been studied\nextensively and we have found existing solvers are adequate in yielding satisfactory solutions.\n\n2.2 Maximally learnable domains: determining the number of domains\n\nGiven M instances, how many domains hide inside? Note that the total domain distinctiveness\nTDD(K) increases as K increases \u2014 presumably, in the extreme case, each domain has only a few\ninstances and their distributions would be maximally different from each other. However, such tiny\ndomains would offer insuf\ufb01cient data to separate the categories of interest reliably.\nTo infer the optimal K, we appeal to maximum learnability, another desirable property we impose\non the identi\ufb01ed domains. Speci\ufb01cally, for any identi\ufb01ed domain, we would like the data instances it\ncontains to be adequate to build a strong classi\ufb01er for labeled data \u2014 failing to do so would cripple\nthe domain\u2019s adaptability to new test data.\nFollowing this line of reasoning, we propose domain-wise cross-validation (DWCV) to identify the\noptimal K. DWCV consists of the following steps. First, starting from K = 2, we use the method\ndescribed in the previous section to identify K domains. Second, for each identi\ufb01ed domain, we\nbuild discriminative classi\ufb01ers, using the label information and evaluate them with cross-validation.\nDenote the cross-validation accuracy for the k-th domain by Ak. We then combine all the accuracies\nwith a weighted sum\n\nK(cid:88)\n\nA(K) = 1/M\n\nMkAk.\n\nk=1\n\nFor very large K such that each domain contains only a few examples, A(K) approaches the classi-\n\ufb01cation accuracy using the class prior probability to classify. Thus, starting at K = 2 (and assuming\nA(2) is greater than the prior probability\u2019s classi\ufb01cation accuracy), we choose K\u2217 as the value that\nattains the highest cross-validation accuracy: K\u2217 = arg maxK A(K). For N-fold cross-validation,\na practical bound for the largest K we need to examine is Kmax \u2264 min{M/(NC), C}. Beyond this\nbound it does not quite make sense to do cross-validation.\n\n3 Related work\n\nDomain adaptation is a fundamental research subject in statistical machine learning [9, 22, 23, 10],\nand is also extensively studied in speech and language processing [5, 6, 8] and computer vision [1,\n2, 3, 4, 24, 25]. Mostly these approaches are validated by adaptating between datasets, which, as\ndiscussed above, do not necessarily correspond to well-de\ufb01ned domains.\n\n4\n\n\fIn our previous work, we proposed to identify some landmark data points in the source domain which\nare distributed similarly to the target domain [26]. While that approach also slices the training set, it\ndiffers in the objective. We discover the underlying domains of the training datasets, each of which\nwill be adaptable, whereas the landmarks in [26] are intentionally biased towards the single given\ntarget domain. Hoffman et al.\u2019s work [20] is the most relevant to ours. They also aim at discovering\nthe latent domains from datasets, by modeling the data with a hierarchical distribution consisting\nof Gaussian mixtures. However, their explicit form of distribution may not be easily satis\ufb01able\nin real data. In contrast, we appeal to nonparametric methods, overcoming this limitation without\nassuming any form of distribution. In addition, we examine the new scenario where the test set is\nalso composed of heterogeneous domains.\nA generalized clustering approach by Jegelka et al. [27] shares the idea of maximum distinctive-\nness (or \u201cdiscriminability\u201d used in [27]) criterion with our approach. However, their focus is the\nsetting of unsupervised clustering where ours is domain discovery. As such, they adopt a different\nregularization term from ours, which exploits labels in the datasets.\nMulti-domain adaptation methods suppose that multiple source domains are given as input, and the\nlearner must adapt from (some of) them to do well in testing on a novel target domain [28, 29, 10].\nIn contrast, in the problem we tackle, the division of data into domains is not given\u2014our algorithm\nmust discover the latent domains. After our approach slices the training data into multiple domains,\nit is natural to apply multi-domain techniques to achieve good performance on a test domain. We\nwill present some related experiments in the next section.\n\n4 Experimental Results\n\nWe validate our approach on visual object recognition and human activity recognition tasks. We\n\ufb01rst describe our experimental settings, and then report the results of identifying latent domains\nand using the identi\ufb01ed domains for adapting classi\ufb01ers to a new mono-domain test set. After that,\nwe present and report experimental results of reshaping heterogeneous test datasets into domains\nmatching to the identi\ufb01ed training domains. Finally, we give some qualitative analyses and details\non choosing the number of domains.\n\n4.1 Experimental setting\n\nData For object recognition, we use images from Caltech-256 (C) [14] and the image datasets of\nAmazon (A), DSLR (D), and Webcam (W) provided by Saenko et al. [2]. There are total 10 common\ncategories among the 4 datasets. These images mainly differ in the data collection sources: Caltech-\n256 was collected from webpages on the Internet, Amazon images from amazon.com, and DSLR\nand Webcam images from an of\ufb01ce environment. We represent images with bag-of-visual-words\ndescriptors following previous work on domain adaptation [2, 4]. In particular, we extract SURF\n[30] features from the images, use K-means to build a codebook of 800 clusters, and \ufb01nally obtain\nan 800-bin histogram for each image.\nFor action recognition from video sequences, we use the IXMAS multi-view action dataset [15].\nThere are \ufb01ve views (Camera 0, 1,\u00b7\u00b7\u00b7 , 4) of eleven actions in the dataset. Each action is performed\nthree times by twelve actors and is captured by the \ufb01ve cameras. We keep the \ufb01rst \ufb01ve actions\nperformed by alba, andreas, daniel, hedlena, julien, and nicolas such that the irregularly performed\nactions [15] are excluded. In each view, 20 sequences are randomly selected per actor per action.\nWe use the shape-\ufb02ow descriptors to characterize the motion of the actions [31].\nEvaluation strategy The four image datasets are commonly used as distinctive domains in research\nin visual domain adaptation [2, 3, 4, 32]. Likewise, each view in the IXMAS dataset is often taken\nas a domain in action recognition [33, 34, 35, 24]. Similarly, in our experiments, we use a subset of\nthese datasets (views) as source domains for training classi\ufb01ers and the rest of the datasets (views)\nas target domains for testing. However, the key difference is that we do not compare performance of\ndifferent adaptation algorithms which assume domains are already given. Instead, we evaluate the\neffectiveness of our approach by investigating whether its automatically identi\ufb01ed domains improve\nadaptation, that is, whether recognition accuracy on the target domains can be improved by reshaping\nthe datasets into their latent source domains.\n\n5\n\n\fTable 1: Oracle recognition accuracy on target domains by adapting original or identi\ufb01ed domains\n\nS\nT\nGORIG\n\nGOTHER [20]\n\nGOURS\n\nA, C D, W C, D, W\nD, W A, C\n41.0\n32.6\n33.7\n39.5\n42.6\n35.5\n\nA\n41.8\n34.6\n44.6\n\nCam 0, 1\nCam 2, 3, 4\n\nCam 2, 3, 4\nCam 0, 1\n\n44.6\n43.9\n47.3\n\n47.1\n45.1\n50.3\n\nTable 2: Adaptation recognition accuracies, using original and identi\ufb01ed domains with different\nmulti-source adaptation methods\nMulti-DA\nmethod\nUNION\n\nLatent\nDomains\nORIGINAL\n\nCam 0, 1\nCam 2, 3, 4\n\nCam 2, 3, 4\nCam 0, 1\n\n[20]\n\nOURS\n\nENSEMBLE\nMATCHING\nENSEMBLE\nMATCHING\n\nA, C D, W C, D, W\nD, W A, C\n35.8\n41.7\n31.7\n34.4\n34.0\n39.6\n35.8\n38.7\n42.6\n35.5\n\nA\n41.0\n38.9\n34.6\n42.8\n44.6\n\n45.1\n43.3\n43.2\n45.0\n47.3\n\n47.8\n29.6\n45.2\n40.5\n50.3\n\nWe use the geodesic \ufb02ow kernel for adapting classi\ufb01ers [4]. To use the kernel-based method for\ncomputing distribution difference, we use Gaussian kernels, cf. section 2. We set the kernel band-\nwidth to be twice the median distances of all pairwise data points. The number of latent domains K\nis determined by the DWCV procedure (cf. section 2.2).\n\n(cid:88)\n\nIdentifying latent domains from training datasets\n\n4.2\nNotation Let S = {S1,S2, . . . ,SJ} denote the J datasets we will be using as training source datasets\nand let T = {T1,T2, . . . ,TL} denote the L datasets we will be using as testing target datasets.\nFurthermore, let K denote the number of optimal domains discovered by our DWCV procedure and\nU = {U1,U2, . . . ,UK} the K hidden domains identi\ufb01ed by our approach. Let r(A \u2192 B) denote the\nrecognition accuracy on the target domain B with A as the source domain.\nGoodness of the identi\ufb01ed domains We examine whether {Uk} is a set of good domains by com-\nputing the expected best possible accuracy of using the identi\ufb01ed domains separately for adaptation\n\nr(Uk \u2192 Tl)\n\nr(Uk,B) \u2248 1\nL\n\nGOURS = EB\u2208P max\n\nk\n\nl\n\nk\n\nmax\n\n(6)\nwhere B is a target domain drawn from a distribution on domains P. Since this distribution is not\nobtainable, we approximate the expectation with the empirical average over the observed testing\ndatasets {Tl}. Likewise, we can de\ufb01ne GORIG where we compute the best possible accuracy for the\noriginal domains {Sj}, and GOTHER where we compute the same quantity for a competing method\nfor identifying latent domains, proposed in [20]. Note that the max operation requires that the target\ndomains be annotated; thus the accuracies are the most optimistic estimate for all methods, and\nupper bounds of practical algorithms.\nTable 1 reports the three quantities on different pairs of sources and target domains. Clearly, our\nmethod yields a better set of identi\ufb01ed domains, which are always better than the original datasets.\nWe also experimented using Kmeans or random partition for clustering data instances into domains.\nNeither yields competitive performance and the results are omitted here for brevity.\nPractical utility of identi\ufb01ed domains In practical applications of domain adaptation algorithms,\nhowever, the target domains are not annotated. The oracle accuracies reported in Table 1 are thus not\nachievable in general. In the following, we examine how closely the performance of the identi\ufb01ed\ndomains can approximate the oracle if we employ multi-source adaptation.\nTo this end, we consider several choices of multiple-source domain adaptation methods:\n\nadapt from this \u201cmega\u201d domain to the target domains. We use this as a baseline.\n\n\u2022 UNION The most naive way is to combine all the source domains into a single dataset and\n\u2022 ENSEMBLE A more sophisticated strategy is to adapt each source domain to the target do-\nmain and combine the adaptation results in the form of combining multiple classi\ufb01ers [20].\n\n6\n\n\fTable 3: Results of reshaping the test set when it consists of data from multiple domains.\nConditional reshaping\n\nNo reshaping\n\nA(cid:83) B(cid:83) C \u2192 F\n\nFrom identi\ufb01ed (Reshaping training only)\nA(cid:48) \u2192 F\n36.4\n40.4\n46.5\n50.7\n43.6\n\nB(cid:48) \u2192 F\n37.1\n38.7\n45.7\n50.6\n41.8\n\nC(cid:48) \u2192 F\n37.7\n39.6\n46.1\n50.5\n43.9\n\nCam 012\nCam 123\nCam 234\nCam 340\nCam 401\n\nX \u2192 FX , \u2200X \u2208 {A(cid:48), B(cid:48), C(cid:48)}\n\n37.3\n39.9\n47.8\n52.3\n43.3\n\n38.5\n41.1\n49.2\n54.9\n44.8\n\n\u2022 MATCHING This strategy compares the empirical (marginal) distribution of the source\ndomains and the target domains and selects the single source domain that has the smallest\ndifference to the target domain to adapt. We use the kernel-based method to compare\ndistributions, as explained in section 2. Note that since we compare only the marginal\ndistributions, we do not require the target domains to be annotated.\n\nTable 2 reports the averaged recognition accuracies on the target domains, using either the original\ndatasets/domains or the identi\ufb01ed domains as the source domains. The latent domains identi\ufb01ed\nby our method generally perform well, especially using MATCHING to select the single best source\ndomain to match the target domain for adaptation.\nIn fact, contrasting Table 2 to Table 1, the\nMATCHING strategy for adaptation is able to match the oracle accuracies, even though the matching\nprocess does not use label information from the target domains.\n\n4.3 Reshaping the test datasets\n\nSo far we have been concentrating on reshaping multiple annotated datasets (for training classi\ufb01ers)\ninto domains for adapting to test datasets. However, test datasets can also be made of multiple latent\ndomains. Hence, it is also instrumental to investigate whether we can reshape the test datasets into\nmultiple domains to achieve better adaptation results.\nHowever, the reshaping process for test datasets has a critical difference from reshaping training\ndatasets. Speci\ufb01cally, we should reshape test datasets, conditioning on the identi\ufb01ed domains from\nthe training datasets \u2014 the goal is to discover latent domains in the test datasets that match the\ndomains in the training datasets as much as possible. We term this conditional reshaping.\nComputationally, conditional reshaping is more tractable than identifying latent domains from the\ntraining datasets. Concretely, we minimize the distribution differences between the latent domains in\nthe test datasets and the domains in the training datasets, using the kernel-based measure explained in\nsection 2. The optimization problem, however, can be relaxed into a convex quadratic programming\nproblem. Details are in the Suppl. Material.\nTable 3 demonstrates the bene\ufb01t of conditionally reshaping the test datasets, on cross-view action\nrecognition. This problem inherently needs test set reshaping, since the person may be viewed from\nany direction at test time. (In contrast, test sets for the object recognition datasets above are less\nheterogeneous.) The \ufb01rst column shows \ufb01ve groups of training datasets, each being a different view,\ndenoted by A, B and C. In each group, the remaining views D and E are merged into a new test\n\ndataset, denoted by F = D(cid:83) E.\ndataset F ; (2) adapting from the merged dataset A(cid:83) B(cid:83) C to F . These are contrasted to adapting\n\nTwo baselines are included: (1) adapting from the identi\ufb01ed domains A(cid:48), B(cid:48) and C(cid:48) to the merged\n\nfrom the identi\ufb01ed domains in the training datasets to the matched domains in F . In most groups,\nthere is a signi\ufb01cant improvement in recognition accuracies by conditional reshaping over no re-\nshaping on either training or testing, and reshaping on training only.\n\n4.4 Analysis of identi\ufb01ed domains and the optimal number of domains\n\nIt is also interesting to see which factors are dominant in the identi\ufb01ed domains. Object appearance,\nillumination, or background? Do they coincide with the factors controlled by the dataset collectors?\nSome exemplar images are shown in Figure 1, where each row corresponds to an original dataset,\nand each column is an identi\ufb01ed domain across two datasets. On the left of Figure 1 we reshape\nAmazon and Caltech-256 into two domains. In Domain II all the \u201claptop\u201d images 1) are taken from\n\n7\n\n\fIdenti\ufb01ed Domain I\n\nIdenti\ufb01ed Domain II\n\nIdenti\ufb01ed Domain I\n\nIdenti\ufb01ed Domain II\n\nn\no\nz\na\nm\nA\n\nh\nc\ne\nt\nl\na\nC\n\nR\nL\nS\nD\nm\na\nc\nb\ne\nW\n\nFigure 1: Exemplar images from the original and identi\ufb01ed domains after reshaping. Note that\nidenti\ufb01ed domains contain images from both datasets.\n\nFigure 2: Domain-wise cross-validation (DWCV) for choosing the number of domains.\n\nthe front view and 2) have colorful screens, while Domain I images are less colorful and have more\ndiversi\ufb01ed views. It looks like the domains in Amazon and Caltech-256 are mainly determined by\nthe factors of object pose and appearance (color).\nThe \ufb01gures on the right are from reshaping DSLR and Webcam, of which the \u201ckeyboard\u201d images\nare taken in an of\ufb01ce environment with various lighting, object poses, and background controlled\nby the dataset creators [2]. We can see that the images in Domain II have gray background, while\nin Domain I the background is either white or wooden. Besides, keyboards of the same model,\ncharacterized by color and shape, are almost perfectly assigned to the same domain. In sum, the\nmain factors here are probably background and object appearance (color and shape).\nFigure 2 plots some intermediate results of the domain-wise cross-validation (DWCV) for deter-\nmining the number of domains K to identify from the multiple training datasets. In addition to the\nDWCV accuracy A(K), the average classi\ufb01cation accuracies on the target domain(s) are also in-\ncluded for reference. We set A(K) to 0 when some categories in a domain are assigned with only\none or no data point (as a result of optimization). Generally, A(K) goes up and then drops at some\npoint, before which is the optimal K(cid:63) we use in the experiments. Interestingly, the number favored\nby DWCV coincides with the number of datasets we mix, even though, as our experiments above\nshow, the ideal domain boundaries do not coincide with the dataset boundaries.\n\n5 Conclusion\n\nWe introduced two domain properties, maximum distinctiveness and maximum learnability, to dis-\ncover latent domains from datasets. Accordingly, we proposed nonparametric approaches encour-\naging the extracted domains to satisfy these properties. Since in each domain visual discrimination\nis more consistent than that in the heterogeneous datasets, better prediction performance can be\nachieved on the target domain. The proposed approach is extensively evaluated on visual object\nrecognition and human activity recognition tasks. Our identi\ufb01ed domains outperform not only the\noriginal datasets but also the domains discovered by [20], validating the effectiveness of our ap-\nproach. It may also shed light on dataset construction in the future by examining the main factors of\nthe domains discovered from the existing datasets.\nAcknowledgments K.G is supported by ONR ATL N00014-11-1-0105. B.G. and F.S. is supported by ARO\nAward# W911NF-12-1-0241 and DARPA Contract# D11AP00278 and the IARPA via DoD/ARL contract #\nW911NF-12-C-0012. The U.S. Government is authorized to reproduce and distribute reprints for Governmen-\ntal purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein\nare those of the authors and should not be interpreted as necessarily representing the of\ufb01cial policies or en-\ndorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.\n\n8\n\n23453035404550# of domainsAccuracy (%)(A, C) DWCVDomain adaptation234501020304050# of domainsAccuracy (%)(C, D, W) DWCVDomain adaptation23453540455055606570# of domainsAccuracy (%)(Cam 1, 2, 3) DWCVDomain adaptation2345010203040506070# of domainsAccuracy (%)(Cam 2, 3, 4) DWCVDomain adaptation\fby unlabeled data. In NIPS, 2007.\n\nTrans. NN, (99):1\u201312, 2009.\n\nlearning. The MIT Press, 2009.\n\ndatabase. In CVPR, 2009.\n\nReferences\n[1] L. Duan, D. Xu, I.W. Tsang, and J. Luo. Visual event recognition in videos by learning from web data. In\n\n[2] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV,\n\n[3] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised ap-\n\n[4] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\n[5] H. Daum\u00b4e III. Frustratingly easy domain adaptation. In ACL, 2007.\n[6] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In\n\nproach. In ICCV, 2011.\n\nCVPR, 2010.\n\n2010.\n\nCVPR, 2012.\n\nEMNLP, 2006.\n\n[7] J. Huang, A.J. Smola, A. Gretton, K.M. Borgwardt, and B. Scholkopf. Correcting sample selection bias\n\n[8] S.J. Pan, I.W. Tsang, J.T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE\n\n[9] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence. Dataset shift in machine\n\n[10] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In NIPS, 2009.\n[11] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\n[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. International Journal of Computer Vision, 88(2):303\u2013338, 2010.\n\n[13] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool\n\nfor image annotation. IJCV, 77:157\u2013173, 2008.\n\n[14] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, California\n\nInstitute of Technology, 2007.\n\nICCV, 2007.\n\n[15] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars. In\n\n[16] A. Torralba and A.A. Efros. Unbiased look at dataset bias. In CVPR, 2011.\n[17] B. Gong, F. Sha, and K. Grauman. Overcoming dataset bias: An unsupervised domain adaptation ap-\n\nproach. In NIPS Workshop on Large Scale Visual Recognition and Retrieval, 2012.\n\n[18] L. Cao, Z. Liu, and T. S Huang. Cross-dataset action detection. In CVPR, 2010.\n[19] T. Tommasi, N. Quadrianto, B. Caputo, and C. Lampert. Beyond dataset bias: multi-task unaligned shared\n\n[20] J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain\n\nknowledge transfer. In ACCV, 2012.\n\nadaptation. In ECCV. 2012.\n\nproblem. In NIPS. 2007.\n\n[21] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel method for the two-sample-\n\n[22] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[23] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation.\n\nIn NIPS, 2007.\n\n[24] R. Li and T. Zickler. Discriminative virtual views for cross-view action recognition. In CVPR, 2012.\n[25] K. Tang, V. Ramanathan, L. Fei-Fei, and D. Koller. Shifting weights: Adapting object detectors from\n\nimage to video. In NIPS, 2012.\n\n[26] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning\n\ndomain-invariant features for unsupervised domain adaptation. In ICML, 2013.\n\n[27] S. Jegelka, A. Gretton, B. Sch\u00a8olkopf, B. K Sriperumbudur, and U. Von Luxburg. Generalized clustering\n\nvia kernel embeddings. In Advances in Arti\ufb01cial Intelligence, 2009.\n\n[28] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye. A two-stage weighting framework for multi-source\n\ndomain adaptation. In NIPS, 2011.\n\n\ufb01ers. In ICML, 2009.\n\n[29] L. Duan, I. W Tsang, D. Xu, and T. Chua. Domain adaptation from multiple sources via auxiliary classi-\n\n[30] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features. In ECCV, 2006.\n[31] D. Tran and A. Sorokin. Human activity recognition with metric learning. In ECCV. 2008.\n[32] A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object classi\ufb01cation: a\n\ndomain adaptation approach. In NIPS, 2010.\n\n[33] A. Farhadi and M. Tabrizi. Learning to recognize activities from the wrong view point. In ECCV, 2008.\n[34] C.-H. Huang, Y.-R. Yeh, and Y.-C. Wang. Recognizing actions across cameras by exploring the correlated\n\n[35] J. Liu, M. Shah, B. Kuipers, and S. Savarese. Cross-view action recognition via view knowledge transfer.\n\nsubspace. In ECCV, 2012.\n\nIn CVPR, 2011.\n\n9\n\n\f", "award": [], "sourceid": 664, "authors": [{"given_name": "Boqing", "family_name": "Gong", "institution": "University of Southern California (USC)"}, {"given_name": "Kristen", "family_name": "Grauman", "institution": "UT Austin"}, {"given_name": "Fei", "family_name": "Sha", "institution": "University of Southern California (USC)"}]}