{"title": "Label Efficient Learning of Transferable Representations acrosss Domains and Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 165, "page_last": 177, "abstract": "We propose a framework that learns a representation transferable across different domains and tasks in a data efficient manner. Our approach battles domain shift with a domain adversarial loss, and generalizes the embedding to novel task using a metric learning-based approach. Our model is simultaneously optimized on labeled source data and unlabeled or sparsely labeled data in the target domain. Our method shows compelling results on novel classes within a new domain even when only a few labeled examples per class are available, outperforming the prevalent fine-tuning approach. In addition, we demonstrate the effectiveness of our framework on the transfer learning task from image object recognition to video action recognition.", "full_text": "Label Ef\ufb01cient Learning of Transferable\nRepresentations across Domains and Tasks\n\nZelun Luo\n\nStanford University\n\nzelunluo@stanford.edu\n\nYuliang Zou\nVirginia Tech\nylzou@vt.edu\n\nJudy Hoffman\n\nUniversity of California, Berkeley\njhoffman@eecs.berkeley.edu\n\nAbstract\n\nLi Fei-Fei\n\nStanford University\n\nfeifeili@cs.stanford.edu\n\nWe propose a framework that learns a representation transferable across different\ndomains and tasks in a label ef\ufb01cient manner. Our approach battles domain shift\nwith a domain adversarial loss, and generalizes the embedding to novel task using\na metric learning-based approach. Our model is simultaneously optimized on\nlabeled source data and unlabeled or sparsely labeled data in the target domain.\nOur method shows compelling results on novel classes within a new domain even\nwhen only a few labeled examples per class are available, outperforming the\nprevalent \ufb01ne-tuning approach. In addition, we demonstrate the effectiveness of\nour framework on the transfer learning task from image object recognition to video\naction recognition.\n\n1\n\nIntroduction\n\nHumans are exceptional visual learners capable of generalizing their learned knowledge to novel\ndomains and concepts and capable of learning from few examples. In recent years, computational\nmodels based on end-to-end learnable convolutional networks have made signi\ufb01cant improvements for\nvisual recognition [18, 28, 54] and have been shown to demonstrate some cross-task generalizations [8,\n48] while enabling faster learning of subsequent tasks as most frequently evidenced through \ufb01ne-\ntuning [14, 36, 50].\nHowever, most efforts focus on the supervised learning scenario where a closed world assumption\nis made at training time about both the domain of interest and the tasks to be learned. Thus,\nany generalization ability of these models is only an observed byproduct. There has been a large\npush in the research community to address generalizing and adapting deep models across different\ndomains [64, 13, 58, 38], to learn tasks in a data ef\ufb01cient way through few shot learning [27, 70, 47,\n11], and to generically transfer information across tasks [1, 14, 50, 35].\nWhile most approaches consider each scenarios in isolation we aim to directly tackle the joint problem\nof adapting to a novel domain which has new tasks and few annotations. Given a large labeled source\ndataset with annotations for a task set, A, we seek to transfer knowledge to a sparsely labeled target\ndomain with a possibly wholly new task set, B. This setting is in line with our intuition that we\nshould be able to learn reusable and general purpose representations which enable faster learning of\nfuture tasks requiring less human intervention. In addition, this setting matches closely to the most\ncommon practical approach for training deep models which is to use a large labeled source dataset\n(often ImageNet [6, 52]) to train an initial representation and then to continue supervised learning\nwith a new set of data and often with new concepts.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn our approach, we jointly adapt a source representation for use in a distinct target domain using a\nnew multilayer unsupervised domain adversarial formulation while introducing a novel cross-domain\nand within domain class similarity objective. This new objective can be applied even when the target\ndomain has non-overlapping classes to the source domain.\nWe evaluate our approach in the challenging setting of joint transfer across domains and tasks\nand demonstrate our ability to successfully transfer, reducing the need for annotated data for the\ntarget domain and tasks. We present results transferring from a subset of Google Street View House\nNumbers (SVHN) [41] containing only digits 0-4 to a subset of MNIST [29] containing only digits 5-9.\nSecondly, we present results on the challenging setting of adapting from ImageNet [6] object-centric\nimages to UCF-101 [57] videos for action recognition.\n\n2 Related work\n\nDomain adaptation. Domain adaptation seeks to learn from related source domains a well perform-\ning model on target data distribution [4]. Existing work often assumes that both domains are de\ufb01ned\non the same task and labeled data in target domain is sparse or non-existent [64]. Several methods\nhave tackled the problem with the Maximum Mean Discrepancy (MMD) loss [17, 36, 37, 38, 73]\nbetween the source and target domain. Weight sharing of CNN parameters [58, 22, 21, 3] and mini-\nmizing the distribution discrepancy of network activations [51, 65, 30] have also shown convincing\nresults. Adversarial generative models [33, 32, 2, 59] aim at generating source-like data with target\ndata by training a generator and a discriminator simultaneously, while adversarial discriminative\nmodels [62, 64, 13, 12, 23] focus on aligning embedding feature representations of target domain\nto source domain. Inspired by adversarial discriminative models, we propose a method that aligns\ndomain features with multi-layer information.\nTransfer learning. Transfer learning aims to transfer knowledge by leveraging the existing labeled\ndata of some related task or domain [45, 71]. In computer vision, examples of transfer learning\ninclude [1, 31, 61] which try to overcome the de\ufb01cit of training samples for some categories by\nadapting classi\ufb01ers trained for other categories [43]. With the power of deep supervised learning and\nthe ImageNet dataset [6, 52], learned knowledge can even transfer to a totally different task (i.e. image\nclassi\ufb01cation \u2192 object detection [50, 49, 34]; image classi\ufb01cation \u2192 semantic segmentation [35])\nand then achieve state-of-the-art performance. In this paper, we focus on the setting where source\nand target domains have differing label spaces but the label spaces share the same structure. Namely\nadapting between classifying different category sets but not transferring from classi\ufb01cation to a\nlocalization plus classi\ufb01cation task.\nFew-shot learning. Few-shot learning seeks to learn new concepts with only a few annotated\nexamples. Deep siamese networks [27] are trained to rank similarity between examples. Matching\nnetworks [70] learns a network that maps a small labeled support set and an unlabeled example to its\nlabel. Aside from these metric learning-based methods, meta-learning has also served as a essential\npart. Ravi et al. [47] propose to learn a LSTM meta-learner to learn the update rule of a learner. Finn\net al. [11] tries to \ufb01nd a good initialization point that can be easily \ufb01ne-tune with new examples from\nnew tasks. When there exists a domain shift, the results of prior few-shot learning methods are often\ndegraded.\nUnsupervised learning. Many unsupervised learning algorithms have focused on modeling raw data\nusing reconstruction objectives [19, 69, 26]. Other probabilistic models include restricted Boltzmann\nmachines [20], deep Boltzmann machines [53], GANs [15, 10, 9], and autoregressive models [42, 66]\nare also popular. An alternative approach, often terms \u201cself-supervised learning\u201d [5], de\ufb01nes a\npretext task such as predicting patch ordering [7], frame ordering [40], motion dynamics [39], or\ncolorization [72], as a form of indirect supervision. Compared to these approaches, our unsupervised\nlearning method does not rely on exploiting the spatial or temporal structure of the data, and is\ntherefore more generic.\n\n3 Method\n\nWe introduce a semi-supervised learning algorithm which transfers information from a large labeled\nsource domain, S, to a sparsely labeled target domain, T . The goal being to learn a strong target\n\n2\n\n\fFigure 1: Our proposed learning framework for joint transfer across domains and semantic trans-\nfer across source and target and across target labeled to unlabeled data. We introduce a domain\ndiscriminator which aligns source and target representations across multiple layers of the network\nthrough domain adversarial learning. We enable semantic transfer through minimizing the entropy of\nthe pairwise similarity between unlabeled and labeled target images and use the temperature of the\nsoftmax over the similarity vector to allow for non-overlapping label spaces.\n\nclassi\ufb01er without requiring the large annotation overhead required for standard supervised learning\napproaches.\nIn fact, this setting is very commonly explored for convolutional network (convnet) based recognition\nmethods. When learning with convnets the usual learning procedure is to use a very large labeled\ndataset (e.g. ImageNet [6, 52]) for initial training of the network parameters (termed pre-training).\nThe learned weights are then used as initialization for continued learning on new data and for new\ntasks, called \ufb01ne-tuning. Fine-tuning has been broadly applied to reduce the number of labeled\nexamples needed for learning new tasks, such as recognizing new object categories after ImageNet\npre-training [54, 18], or learning new label structures such as detection after class\ufb01ciation pre-\ntraining [14, 50]. Here we focus on transfer in the case of a shared label structure (e.g. classi\ufb01cation\nof different category sets).\nWe assume the source domain contains ns images, xs \u2208 X S, with associated labels, ys \u2208 YS.\nSimilarly, the target domain consists of nt unlabeled images, \u02dcxt \u2208 \u02dcX T , as well as mt images,\nxt \u2208 X T , with associated labels, yt \u2208 YT . We assume that the target domain is only sparsely\nlabeled so that the number of image-label pairs is much smaller than the number of unlabeled images,\nmt (cid:28) nt. Additionally, the number of source labeled images is assumed to be much larger than the\nnumber of target labeled images, mt (cid:28) ns.\nUnlike standard domain adaptation approaches which transfer knowledge from source to target\ndomains assuming a marginal or conditional distribution shift under a shared label space (YS = YT ),\nwe tackle joint image or feature space adaptation as well as transfer across semantic spaces. Namely,\nwe consider the case where the source and target label spaces are not equal, YS (cid:54)= YT , and even the\nmost challenging case where the sets are non-overlapping, YS \u2229 YT = \u2205.\n\n3.1\n\nJoint domain and semantic transfer\n\nOur approach consists of unsupervised feature alignment between source and target as well as\nsemantic transfer to the unlabeled target data from either the labeled target or the labeled source\ndata. We introduce a new multi-layer domain discriminator which can be used for domain alignment\nfollowing the recent domain adversarial learning approaches [13, 64]. We next introduce a new\nsemantic transfer learning objective which uses cross category similarity and can be tuned to account\nfor varying size of label set overlap.\n\n3\n\nSource labeled dataSupervisedLossMulti-layer Domain TransferPairwiseSimilarity{(cid:6882)t, (cid:6934)t}{(cid:6882)t}{(cid:6882), (cid:6934)}Softmax(cid:7648)Semantic TransferSource CNNTarget CNNTarget unlabeled dataAdversarialLossTarget labeled dataEntropyLossSupervisedLoss\fWe depict our overall model in Figure 1. We take the ns source labeled examples, {xs, ys}, the mt\ntarget labeled examples, {xt, yt}, and the nt unlabeled target images, {\u02dcxt} as input. We learn an\ninitial layered source representation and classi\ufb01cation network (depicted in blue in Figure 1) using\nstandard supervised techniques. We then initialize the target model (depicted in green in Figure 1)\nwith the source parameters and begin our adaptive transfer learning.\nOur model jointly optimizes over a target supervised loss, Lsup, a domain transfer objective, LDT, and\n\ufb01nally a semantic transfer objective, LST. Thus, our total objective can be written as follows:\nL(X S ,YS ,X T ,YT , \u02dcX T ) = Lsup(X T ,YT ) + \u03b1LDT(X S , \u02dcX T ) + \u03b2LST(X S ,X T , \u02dcX T )\n\n(1)\nwhere the hyperparameters \u03b1 and \u03b2 determine the in\ufb02uence of the domain transfer loss and the\nsemantic transfer loss, respectively. In the following sections we elaborate on our domain and\nsemantic transfer objectives.\n\n3.2 Multi-layer domain adversarial loss\n\nWe de\ufb01ne a novel domain alignment objective function called multi-layer domain adversarial\nloss. Recent efforts in deep domain adaptation have shown strong performance using feature space\ndomain adversarial objectives [13, 64]. These methods learn a target representation such that the\ntarget distribution viewed under this model is aligned with the source distribution viewed under the\nsource representation. This alignment is accomplished through an adversarial minimization across\ndomain, analogous to the prevalent generative adversarial approaches [15]. In particular, a domain\ndiscriminator, D(\u00b7), is trained to classify whether a particular data point arises from the source or the\ntarget domain. Simultaneously, the target embedding function Et(xt) (de\ufb01ned as the application of\nlayers of the network is trained to generate the target representation that cannot be distinguished from\nthe source domain representation by the domain discriminator. Similar to [63, 64], we consider a\nrepresentation to be domain invariant if the domain discriminator can not distinguish examples from\nthe two domains.\nPrior work considers alignment for a single layer of the embedding at a time and as such learns a\ndomain discriminator which takes the output from the corresponding source and target layers as input.\nSeparately, domain alignment methods which focus on \ufb01rst and second order statistics have shown\nimproved performance through applying domain alignment independently at multiple layers of the\nnetwork [36]. Rather than learning independent discriminators for each layer of the network we\npropose a simultaneous alignment of multiple layers through a multi-layer discriminator.\nAt each layer of our multi-layer domain discriminator, information is accumulated from both the\noutput from the previous discriminator layer as well as the source and target activations from the\ncorresponding layer in their respective embeddings. Thus, the output of each discriminator layer is\nde\ufb01ned as:\n(2)\nwhere l is the current layer, \u03c3(\u00b7) is the activation function, \u03b3 \u2264 1 is the decay factor, \u2295 represents\nconcatenation or element-wise summation, and x is taken either from source data xs \u2208 X S, or target\ndata \u02dcxt \u2208 \u02dcX T . Notice that the intermediate discriminator layers share the same structure with their\ncorresponding encoding layers to match the dimensions.\nThus, the following loss functions are proposed to optimize the multi-layer domain discriminator and\nthe embeddings, respectively, according to our domain transfer objective:\n\ndl = Dl(\u03c3(\u03b3dl\u22121 \u2295 El(x)))\n\nl)(cid:3)\nl ] \u2212 Ext\u223cX T (cid:2)log(1 \u2212 dt\nl )] \u2212 Ext\u223cX T (cid:2)log dt\n(cid:3)\n\nl\n\nl , dt\n\nLD\nDT = \u2212Exs\u223cX S [log ds\nLEt\nDT = \u2212Exs\u223cX S [log(1 \u2212 ds\n\n(3)\n(4)\nwhere ds\nl are the outputs of the last layer of the source and target multi-layer domain discriminator.\nNote that these losses are placed after the \ufb01nal domain discriminator layer and the last embedding\nlayer but then produce gradients which back-propagate throughout all relevant lower layer parameters.\nThese two losses together comprise LDT , and there is no iterative optimization procedure involved.\nThis multi-layer discriminator (shown in Figure 1 - yellow) allows for deeper alignment of the\nsource and target representations which we \ufb01nd empirically results in improved target classi\ufb01cation\nperformance as well as more stable adversarial learning.\n\n4\n\n\fFigure 2: We illustrate the purpose of temperature (\u03c4) for our pairwise similarity vector. Consider an\nexample target unlabeled point and its similarity to four labeled source points (x-axis). We show here,\noriginal unnormalized scores (leftmost) as well as the same similarity scores after applying softmax\nwith different temperatures, \u03c4. Notice that entropy values, H(x), have higher variance for scores\nnormalized with a small temperature softmax.\n\n3.3 Cross category similarity for semantic transfer\n\nIn the previous section, we introduced a method for transferring an embedding from the source to the\ntarget domain. However, this only enforces alignment of the global domain statistics with no class\nspeci\ufb01c transfer. Here, we de\ufb01ne a new semantic transfer objective, LST, which transfers information\nfrom a labeled set of data to an unlabeled set of data by minimizing the entropy of the softmax with\ntemperature of the similarity vector between an unlabeled point and all labeled points. Thus, this\nloss may be applied either between the source and unlabeled target data or between the labeled and\nunlabeled target data.\nFor each unlabeled target image, \u02dcxt, we compute the similarity, \u03c8(\u00b7), to each labeled example or\nto each prototypical example [56] per class in the labeled set. For simplicity of presentation let us\nconsider semantic transfer from the source to the target domain \ufb01rst. For each target unlabeled image\nwe compute a similarity vector where the ith element is the similarity between this target image and\nthe ith labeled source image: [vs(\u02dcxt)]i = \u03c8(\u02dcxt, xs\ni ). Our semantic transfer loss can be de\ufb01ned as\nfollows:\n\n(cid:88)\n\nLST( \u02dcX T ,X S ) =\n\nH(\u03c3(vs(\u02dcxt)/\u03c4 ))\n\n(5)\n\n\u02dcxt\u2208 \u02dcX T\n\nwhere, H(\u00b7) is the information entropy function, \u03c3(\u00b7) is the softmax function and \u03c4 is the temperature\nof the softmax. Note that the temperature can be used to directly control the percentage of source\nexamples we expect the target example to be similar to (see Figure 2).\nEntropy minimization has been widely used for unsupervised [44] and semi-supervised [16] learning\nby encouraging low density separation between clusters or classes. Recently this principle of entropy\nminimization has be applied for unsupervised adaptation [38]. Here, the source and target domains\nare assumed to share a label space and each unlabeled target example is passed through the initial\nsource classi\ufb01er and the entropy of the softmax output scores is minimized.\nIn contrast, we do not assume a shared label space between the source and target domains and as such\ncan not assume that each target image maps to a single source label. Instead, we compute pairwise\nsimilarities between target points and the source points (or per class averages of source points [56])\nacross the features spaces aligned by our multi-layer domain adversarial transfer. We then tune the\nsoftmax temperature based on the expected similarity between the source and target labeled set. For\nexample, if the source and target label set overlap, then a small temperature will encourage each\ntarget point to be very similar to one source class, whereas a larger temperature will allow for target\npoints to be similar to multiple source classes.\nFor semantic transfer within the target domain, we utilize the metric-based cross entropy loss between\nlabeled target examples to stabilize and improve the learning. For a labeled target example, in addition\nto the traditional cross entropy loss, we also calculate a metric-based cross entropy loss 1. Assume\nwe have k labeled examples from each class in the target domain. We compute the embedding for\n\n1We refer this as \"metric-based\" to cue the reader that this is not a cross entropy within the label space.\n\n5\n\n\feach example and then the centroid cT\ni of each class in the embedding space. Thus, we can compute\nthe similarity vector for each labeled example, where the ith element is the similarity between this\nlabeled example and the centroid of each class: [vt(xt)]i = \u03c8(xt, cT\ni ). We can then calculate the\nmetric based cross entropy loss:\n\nLST,sup(X T ) = \u2212 (cid:88)\n\n{xt,yt}\u2208X T\n\nexp ([vt(xt)]yt)\ni=1 exp ([vt(xt)]i)\n\nlog\n\n(cid:80)n\n(cid:88)\n\n(6)\n\n(7)\n\nSimilar to the source-to-target scenario, for target-to-target we also have the unsupervised part,\n\nLST,unsup( \u02dcX T ,X T ) =\n\nH(\u03c3(vt(\u02dcxt)/\u03c4 ))\n\n\u02dcxt\u2208 \u02dcX T\n\nWith the metric-based cross entropy loss, we introduce the constraint that the target domain data\nshould be similar in the embedding space. Also, we \ufb01nd that this loss can provide a guidance\nfor the unsupervised semantic transfer to learn in a more stable way. LST is the combination of\nLST,unsupervised from source-target (Equation 5), LST,supervised from source-target (Equation 6), and\nLST,unsupervised from target-target (Equation 7), i.e.,\n\nLST (X S ,X T , \u02dcX T ) = LST ( \u02dcX T ,X S ) + LST,sup(X T ) + LST,unsup( \u02dcX T ,X T )\n\n(8)\n\n4 Experiment\n\nThis section is structured as follows. In section 4.1, we show that our method outperform \ufb01ne-tuning\napproach by a large margin, and all parts of our method are necessary. In section 4.2, we show that\nour method can be generalized to bigger datasets. In section 4.3, we show that our multi-layer domain\nadversarial method outperforms state-of-the-art domain adversarial approaches.\nDatasets We perform adaptation experiments across two different paired data settings. First for\nadaptation across different digit domains we use MNIST [29] and Google Street View House Numbers\n(SVHN) [41]. The MNIST handwritten digits database has a training set of 60,000 examples, and a\ntest set of 10,000 examples. The digits have been size-normalized and centered in \ufb01xed-size images.\nSVHN is a real-world image dataset for machine learning and object recognition algorithms with\nminimal requirement on data preprocessing and formatting. It has 73257 digits for training, 26032\ndigits for testing. As our second experimental setup, we consider adaptation from object centric\nimages in ImageNet [52] to action recognition in video using the UCF-101 [57] dataset. ImageNet\nis a large benchmark for the object classi\ufb01cation task. We use the task 1 split from ILSVRC2012.\nUCF-101 is an action recognition dataset collected on YouTube. With 13,320 videos from 101\naction categories, UCF-101 provides a large diversity in terms of actions and with the presence of\nlarge variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered\nbackground, illumination conditions, etc.\nImplementation details We pre-train the source domain embedding function with cross-entropy loss.\nFor domain adversarial loss, the discriminator takes the last three layer activations as input when the\nnumber of output classes are the same for source and target tasks, and takes the second last and third\nlast layer activations when they are different. The similarity score is chosen as the dot product of the\nnormalized support features and the unnormalized target feature. We use the temperature \u03c4 = 2 for\nsource-target semantic transfer and \u03c4 = 1 for within target transfer as the label space is shared. We\nuse \u03b1 = 0.1 and \u03b2 = 0.1 in our objective function. The network is trained with Adam optimizer [25]\nand with learning rate 10\u22123. We conduct all the experiments with the PyTorch framework.\n4.1 SVHN 0-4 \u2192 MNIST 5-9\nExperimental setting. In this experiment, we de\ufb01ne three datasets: (i) labeled data in source domain\nD1; (ii) few labeled data in target domain D2; (iii) unlabeled data in target domain D3. We take the\ntraining split of SVHN dataset as dataset D1. To fairly compare with traditional learning paradigm\nand episodic training, we subsample k examples from each class to construct dataset D2 so that we\ncan perform traditional training or episodic (k \u2212 1)-shot learning. We experiment with k = 2, 3, 4, 5,\nwhich corresponds to 10, 15, 20, 25 labeled examples, or 0.017%, 0.025%, 0.333%, 0.043% of the\n\n6\n\n\ftotal training data respectively. Since our approach involves using annotations from a small subset\nof the data, we randomly subsample 10 different subsets {Di\n2}10\ni=1 from the training split of MNIST\ndataset, and use the remaining data as {Di\n3}10\ni=1 for each k. Note that source domain and target\ndomain have non-overlapping classes: we only utilize digits 0-4 in SVHN, and digits 5-9 in MNIST.\n\nFigure 3: An illustration of our task. Our model effectively transfer the learned representation on\nSVHN digits 0-4 (left) to MNIST digits 5-9 (right).\n\nBaselines and prior work. We compare against six different methods: (i) Target only: the model\nis trained on D2 from scratch; (ii) Fine-tune: the model is pretrained on D1 and \ufb01ne-tuned on D2;\n(iii) Matching networks [70]: we \ufb01rst pretrain the model on D3, then use D2 as the support set in the\nmatching networks; (iv) Fine-tuned matching networks: same as baseline iii, except that for each k\nthe model is \ufb01ne-tuned on D2 with 5-way (k \u2212 1)-shot learning: k \u2212 1 examples in each class are\nrandomly selected as the support set, and the last example in each class is used as the query set; (v)\nFine-tune + adversarial: in addition to baseline ii, the model is also trained on D1 and D3 with a\ndomain adversarial loss; (vi.) Full model: \ufb01ne-tune the model with the proposed multi-layer domain\nadversarial loss.\nResults and analysis. We calculate the mean and standard error of the accuracies across 10 sets of\ndata, which is shown in Table 1. Due to domain shift, matching networks perform poorly without\n\ufb01ne-tuning, and \ufb01ne-tuning is only marginally better than training from scratch. Our method with\nmulti-layer adversarial only improves the overall performance, but is more sensitive to the subsampled\ndata. Our method achieves signi\ufb01cant performance gain, especially when the number of labeled\nexamples is small (k = 2). For reference, \ufb01ne-tuning on full target dataset gives an accuracy of\n99.65%.\n\nTable 1: The test accuracies of the baseline models and our method. Row 1 to row 6 correspond (in\nthe same order) to the six methods proposed in section 4.2. Note that the accuracies of two matching\nnet methods are calculated based on nearest neighbors in the support set. We report the mean and the\nstandard error of each method across 10 different subsampled data.\n\nMethod\nTarget only\nFine-tune\n\nMatching nets [70]\n\nFine-tuned matching nets\n\nOurs: \ufb01ne-tune + adv.\n\nOurs: full model (\u03b3 = 0.1)\n\nk=2\n\n0.642 \u00b1 0.026\n0.612 \u00b1 0.020\n0.469 \u00b1 0.019\n0.645 \u00b1 0.019\n0.702 \u00b1 0.020\n0.917 \u00b1 0.007\n\nk=3\n\n0.771 \u00b1 0.015\n0.779 \u00b1 0.018\n0.455 \u00b1 0.014\n0.755 \u00b1 0.024\n0.800 \u00b1 0.013\n0.936 \u00b1 0.006\n\nk=4\n\n0.801 \u00b1 0.010\n0.802 \u00b1 0.016\n0.566 \u00b1 0.013\n0.793 \u00b1 0.013\n0.804 \u00b1 0.014\n0.942 \u00b1 0.006\n\nk=5\n\n0.840 \u00b1 0.013\n0.830 \u00b1 0.011\n0.513 \u00b1 0.023\n0.827 \u00b1 0.011\n0.831 \u00b1 0.013\n0.950 \u00b1 0.004\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 4: The t-SNE [68, 67] visualization of different feature embeddings. (a) Source domain\nembedding. (b) Target domain embedding using encoder trained with source domain domain. (c)\nTarget domain embedding using encoder \ufb01ne-tuned with target domain data. (d) Target domain\nembedding using encoder trained with our method. (e) An overlap of a and c. (f) An overlap of a and\nd. (best viewed in color and with zoom)\n\n7\n\n\f4.2\n\nImage object recognition \u2192 video action recognition\n\nProblem analysis. Many recent works [60, 24] study the domain shift between images and video\nin the object detection settings. Compared to still images, videos provide several advantages: (i)\nmotion provides information for foreground vs background segmentation [46]; (ii) videos often show\nmultiple views and thus provide 3D information. On the other hand, video frames usually suffer from:\n(i) motion blur; (ii) compression artifacts; (iii) objects out-of-focus or out-of-frame.\n\n3}3\ni=1.\n\n2}3\ni=1, and {Di\n\nExperimental setting. In this experiment, we focus on three dataset splits: (i) ImageNet training\nset as the labeled data in source domain D1; (ii) k video clips per class randomly sampled from\nUCF-101 training as the few labeled data in target domain set D2; (iii) the remaining videos in\nUCF-101 training set as the unlabeled data in target domain D3. We experiment with k = 3, 5, 10,\nwhich corresponds 303, 505, 1010 video clips, or 2.27%, 3.79%, 7.59% of the total training data\nrespectively. Each experiment is run 3 times on D1, {Di\nBaselines and prior work. We compare our method with two baseline methods: (i) Target only:\nthe model is trained on D2 from scratch; (ii) Fine-tune: the model is \ufb01rst pre-trained on D1, then\n\ufb01ne-tuned on D2. For reference, we report the performance of a fully supervised method [55].\nResults and analysis. The accuracy of each model is shown in Table 2. We also \ufb01ne-tune a model\nwith all the labeled data for comparison. Per-frame performance (img) and average-across-frame\nperformance (vid) are both reported. Note that we calculate the average-across-frame performance by\naveraging the softmax score of each frame in a video. Our method achieves signi\ufb01cant improvement\non average-across-frame performance over standard \ufb01ne-tuning for each value of k. Note that\ncompared to \ufb01ne-tuning, our method has a bigger gap between per-frame and per-video accuracy.\nWe believe that this is due to the semantic transfer: our entropy loss encourages a sharper softmax\nvariance among per-frame softmax scores per video (if the variance is zero, then per-frame accuracy =\nper-video accuracy). By making more con\ufb01dent predictions among key frames, our method achieves\na more signi\ufb01cant gain with respective to per-video performance, even when there is little change in\nthe per-frame prediction.\n\nTable 2: Accuracy of UCF-101 action classi\ufb01cation. The results of the two-stream spatial model are\ntaken from [55] and vary depending on hyperparameters. We report the mean and the standard error\nof each method across 3 different subsampled data.\n\nk=3\n\n0.098\u00b10.003\n0.105\u00b10.003\n0.380\u00b10.013\n0.406\u00b10.015\n\nMethod\n\nTarget only (img)\nTarget only (vid)\nFine-tune (img)\nFine-tune (vid)\n\nTwo-stream spatial [55]\n\nOurs (img)\nOurs (vid)\n\nk=5\n\n0.126\u00b10.022\n0.133\u00b10.024\n0.486\u00b10.012\n0.523\u00b10.010\n\nk=10\n\n0.100\u00b10.035\n0.106\u00b10.038\n0.529\u00b10.039\n0.568\u00b10.042\n\nAll\n-\n-\n\n0.672\n0.714\n\n-\n\n0.393\u00b10.006\n0.467\u00b10.007\n\n-\n\n0.459\u00b10.013\n0.545\u00b10.014\n\n-\n\n0.523\u00b10.002\n0.620\u00b10.005\n\n0.708 - 0.720\n\n-\n-\n\n4.3 Ablation: unsupervised domain adaptation\n\nTo validate our multi-layer domain adversarial loss objective, we conduct an ablation experiment\nfor unsupervised domain adaptation. We compare against multiple recent domain adversarial unsu-\npervised adaptation methods. In this experiment, we \ufb01rst pretrain a source embedding CNN on the\ntraining split SVHN [41] and then adapt the target embedding for MNIST by performing adversarial\ndomain adaptation. We evaluate the classi\ufb01cation performance on the test split of MNIST [29]. We\nfollow the same training strategy and model architecture for the embedding network as [64].\n\n8\n\n\fAll the models here have a two-step training strategy and share the \ufb01rst stage. ADDA [64] optimizes\nencoder and classi\ufb01er simultaneously. We also propose a similar method, but optimize encoder only.\nOnly we try a model with no classi\ufb01er in the last layer (i.e. perform domain adversarial training in\nfeature space). We choose \u03b3 = 0.1 as the decay factor for this model.\nThe accuracy of each model is shown in Table 3. We \ufb01nd that our method achieve 6.5% performance\ngain over the best competing domain adversarial approach indicating that our multilayer objective\nindeed contributes to our overall performance. In addition, in our experiments, we found that the\nmultilayer approach improved overall optimization stability, as evidenced in our small standard error.\n\nTable 3: Experimental results on unsupervised domain adaptation from SVHN to MNIST. Results of\nGradient reversal, Domain confusion, and ADDA are from [64], and the results of other methods are\nfrom experiments across 5 different subsampled data.\n\nMethod\n\nSource only\n\n0.739\n\nAccuracy\n0.601 \u00b1 0.011\n0.681 \u00b1 0.003\n0.760 \u00b1 0.018\n0.810 \u00b1 0.003\n\nGradient reversal [13]\nDomain confusion [62]\n\nADDA [64]\n\nOurs\n\n5 Conclusion\n\nIn this paper, we propose a method to learn a representation that is transferable across different\ndomains and tasks in a data ef\ufb01cient manner. The framework is trained jointly to minimize the domain\nshift, to transfer knowledge to new task, and to learn from large amounts of unlabeled data. We show\nsuperior performance over the popular \ufb01ne-tuning approach. We hope to keep improving the method\nin future work.\n\nAcknowledgement\n\nWe would like to start by thanking our sponsors: Stanford Computer Science Department and Stanford\nProgram in AI-assisted Care (PAC). Next, we specially thank De-An Huang, Kenji Hata, Serena\nYeung, Ozan Sener and all the members of Stanford Vision and Learning Lab for their insightful\ndiscussion and feedback. Lastly, we thank all the anonymous reviewers for their valuable comments.\n\n9\n\n\fReferences\n[1] Yusuf Aytar and Andrew Zisserman. Tabula rasa: Model transfer for object category detection.\nIn Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2252\u20132259. IEEE,\n2011.\n\n[2] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan.\nUnsupervised pixel-level domain adaptation with generative adversarial networks. arXiv preprint\narXiv:1612.05424, 2016.\n\n[3] Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Learning\naligned cross-modal representations from weakly aligned data. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 2940\u20132949, 2016.\n\n[4] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv\n\npreprint arXiv:1702.05374, 2017.\n\n[5] Virginia R de Sa. Learning classi\ufb01cation with unlabeled data. Advances in neural information\n\nprocessing systems, pages 112\u2013112, 1994.\n\n[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nHierarchical Image Database. In CVPR09, 2009.\n\nImageNet: A Large-Scale\n\n[7] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning\nby context prediction. In Proceedings of the IEEE International Conference on Computer\nVision, pages 1422\u20131430, 2015.\n\n[8] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor\nDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. In\nProceedings of the 31st International Conference on International Conference on Machine\nLearning - Volume 32, ICML\u201914, pages I\u2013647\u2013I\u2013655. JMLR.org, 2014.\n\n[9] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv\n\npreprint arXiv:1605.09782, 2016.\n\n[10] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier\narXiv preprint\n\nMastropietro, and Aaron Courville. Adversarially learned inference.\narXiv:1606.00704, 2016.\n\n[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. arXiv preprint arXiv:1703.03400, 2017.\n\n[12] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation.\n\narXiv preprint arXiv:1409.7495, 2014.\n\n[13] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois\nLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural\nnetworks. Journal of Machine Learning Research, 17(59):1\u201335, 2016.\n\n[14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies\nfor accurate object detection and semantic segmentation. In Computer Vision and Pattern\nRecognition, 2014.\n\n[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[16] Yves Grandvalet, Yoshua Bengio, et al. Semi-supervised learning by entropy minimization. In\n\nNIPS, volume 17, pages 529\u2013536, 2004.\n\n[17] A Gretton, A.J. Smola, J Huang, Marcel Schmittfull, K.M. Borgwardt, B Sch\u00f6lkopf,\nJ Qui\u00f1onero Candela, M Sugiyama, A Schwaighofer, and N D. Lawrence. Covariate shift by\nkernel mean matching. In Dataset Shift in Machine Learning, 131-160 (2009), 01 2009.\n\n10\n\n\f[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[19] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with\n\nneural networks. science, 313(5786):504\u2013507, 2006.\n\n[20] Geoffrey E Hinton and Terrence J Sejnowski. Learning and releaming in boltzmann machines.\n\nParallel Distrilmted Processing, 1, 1986.\n\n[21] Judy Hoffman, Saurabh Gupta, and Trevor Darrell. Learning with side information through\nmodality hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 826\u2013834, 2016.\n\n[22] Judy Hoffman, Saurabh Gupta, Jian Leong, Sergio Guadarrama, and Trevor Darrell. Cross-\nmodal adaptation for rgb-d detection. In Robotics and Automation (ICRA), 2016 IEEE Interna-\ntional Conference on, pages 5032\u20135039. IEEE, 2016.\n\n[23] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level\n\nadversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.\n\n[24] Vicky Kalogeiton, Vittorio Ferrari, and Cordelia Schmid. Analysing domain shift factors\nbetween videos and images for object detection. IEEE transactions on pattern analysis and\nmachine intelligence, 38(11):2327\u20132334, 2016.\n\n[25] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[26] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[27] Gregory Koch. Siamese neural networks for one-shot image recognition. PhD thesis, University\n\nof Toronto, 2015.\n\n[28] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classi\ufb01cation with deep convolu-\n\ntional neural networks. In Neural Information Processing Systems (NIPS), 2012.\n\n[29] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[30] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch\n\nnormalization for practical domain adaptation. arXiv preprint arXiv:1603.04779, 2016.\n\n[31] Joseph J Lim, Ruslan Salakhutdinov, and Antonio Torralba. Transfer learning by borrowing\nexamples for multiclass object detection. In Proceedings of the 24th International Conference\non Neural Information Processing Systems, pages 118\u2013126. Curran Associates Inc., 2011.\n\n[32] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation\n\nnetworks. arXiv preprint arXiv:1703.00848, 2017.\n\n[33] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural\n\nInformation Processing Systems, pages 469\u2013477, 2016.\n\n[34] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang\nFu, and Alexander C Berg. Ssd: Single shot multibox detector. In European Conference on\nComputer Vision, pages 21\u201337. Springer, 2016.\n\n[35] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman-\ntic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3431\u20133440, 2015.\n\n[36] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features\n\nwith deep adaptation networks. In ICML, pages 97\u2013105, 2015.\n\n11\n\n\f[37] Mingsheng Long, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint\n\nadaptation networks. arXiv preprint arXiv:1605.06636, 2016.\n\n[38] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain\nadaptation with residual transfer networks. In Advances in Neural Information Processing\nSystems, pages 136\u2013144, 2016.\n\n[39] Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, and Li Fei-Fei. Unsupervised learning\n\nof long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821, 2017.\n\n[40] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf\ufb02e and learn: unsupervised learning\nusing temporal order veri\ufb01cation. In European Conference on Computer Vision, pages 527\u2013544.\nSpringer, 2016.\n\n[41] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, volume 2011, page 5, 2011.\n\n[42] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural\n\nnetworks. arXiv preprint arXiv:1601.06759, 2016.\n\n[43] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-\nlevel image representations using convolutional neural networks. In Proceedings of the IEEE\nconference on computer vision and pattern recognition, pages 1717\u20131724, 2014.\n\n[44] Gintautas Palubinskas, Xavier Descombes, and Frithjof Kruggel. An unsupervised clustering\n\nmethod using the entropy minimization. In AAAI, 1999.\n\n[45] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.\n\nknowledge and data engineering, 22(10):1345\u20131359, 2010.\n\nIEEE Transactions on\n\n[46] Deepak Pathak, Ross Girshick, Piotr Doll\u00e1r, Trevor Darrell, and Bharath Hariharan. Learning\n\nfeatures by watching objects move. arXiv preprint arXiv:1612.06370, 2016.\n\n[47] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Interna-\n\ntional Conference on Learning Representations, volume 1, page 6, 2017.\n\n[48] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features\noff-the-shelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision\nand Pattern Recognition, CVPR Workshops 2014, Columbus, OH, USA, June 23-28, 2014, pages\n512\u2013519, 2014.\n\n[49] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uni\ufb01ed,\nreal-time object detection. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 779\u2013788, 2016.\n\n[50] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time\nobject detection with region proposal networks. In Advances in neural information processing\nsystems, pages 91\u201399, 2015.\n\n[51] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain\n\nadaptation. arXiv preprint arXiv:1603.06432, 2016.\n\n[52] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[53] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 448\u2013455, 2009.\n\n[54] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. CoRR, abs/1409.1556, 2014.\n\n12\n\n\f[55] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action\nrecognition in videos. In Advances in neural information processing systems, pages 568\u2013576,\n2014.\n\n[56] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning.\n\narXiv preprint arXiv:1703.05175, 2017.\n\n[57] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human\n\nactions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[58] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation.\n\nIn Computer Vision\u2013ECCV 2016 Workshops, pages 443\u2013450. Springer, 2016.\n\n[59] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation.\n\narXiv preprint arXiv:1611.02200, 2016.\n\n[60] Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller. Shifting weights: Adapting\nobject detectors from image to video. In Advances in Neural Information Processing Systems,\npages 638\u2013646, 2012.\n\n[61] Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Safety in numbers: Learning\ncategories from few examples with multi model knowledge transfer. In Computer Vision and\nPattern Recognition (CVPR), 2010 IEEE Conference on, pages 3081\u20133088. IEEE, 2010.\n\n[62] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across\ndomains and tasks. In Proceedings of the IEEE International Conference on Computer Vision,\npages 4068\u20134076, 2015.\n\n[63] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across\n\ndomains and tasks. In International Conference in Computer Vision (ICCV), 2015.\n\n[64] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain\n\nadaptation. In Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[65] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain\n\nconfusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.\n\n[66] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.\nConditional image generation with pixelcnn decoders. In Advances in Neural Information\nProcessing Systems, pages 4790\u20134798, 2016.\n\n[67] Laurens Van Der Maaten. Accelerating t-sne using tree-based algorithms. Journal of machine\n\nlearning research, 15(1):3221\u20133245, 2014.\n\n[68] Laurens Van der Maaten and Geoffrey Hinton. Visualizing non-metric similarities in multiple\n\nmaps. Machine learning, 87(1):33\u201355, 2012.\n\n[69] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and\ncomposing robust features with denoising autoencoders. In Proceedings of the 25th international\nconference on Machine learning, pages 1096\u20131103. ACM, 2008.\n\n[70] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one\nshot learning. In Advances in Neural Information Processing Systems, pages 3630\u20133638, 2016.\n\n[71] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal\n\nof Big Data, 3(1):1\u201340, 2016.\n\n[72] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European\n\nConference on Computer Vision, pages 649\u2013666. Springer, 2016.\n\n[73] Xu Zhang, Felix Xinnan Yu, Shih-Fu Chang, and Shengjin Wang. Deep transfer network:\n\nUnsupervised domain adaptation. arXiv preprint arXiv:1503.00591, 2015.\n\n13\n\n\f", "award": [], "sourceid": 142, "authors": [{"given_name": "Zelun", "family_name": "Luo", "institution": "Stanford University"}, {"given_name": "Yuliang", "family_name": "Zou", "institution": "Virginia Tech"}, {"given_name": "Judy", "family_name": "Hoffman", "institution": "UC Berkeley"}, {"given_name": "Li", "family_name": "Fei-Fei", "institution": "Stanford University & Google"}]}