{"title": "Domain Generalization via Model-Agnostic Learning of Semantic Features", "book": "Advances in Neural Information Processing Systems", "page_first": 6450, "page_last": 6461, "abstract": "Generalization capability to unseen domains is crucial for machine learning models when deploying to real-world conditions. We investigate the challenging problem of domain generalization, i.e., training a model on multi-domain source data such that it can directly generalize to target domains with unknown statistics. We adopt a model-agnostic learning paradigm with gradient-based meta-train and meta-test procedures to expose the optimization to domain shift. Further, we introduce two complementary losses which explicitly regularize the semantic structure of the feature space. Globally, we align a derived soft confusion matrix to preserve general knowledge of inter-class relationships. Locally, we promote domain-independent class-specific cohesion and separation of sample features with a metric-learning component. The effectiveness of our method is demonstrated with new state-of-the-art results on two common object recognition benchmarks. Our method also shows consistent improvement on a medical image segmentation task.", "full_text": "Domain Generalization via\n\nModel-Agnostic Learning of Semantic Features\n\nQi Dou Daniel C. Castro Konstantinos Kamnitsas Ben Glocker\n\nBiomedical Image Analysis Group, Imperial College London, UK\n\n{qi.dou,dc315,kk2412,b.glocker}@imperial.ac.uk\n\nAbstract\n\nGeneralization capability to unseen domains is crucial for machine learning models\nwhen deploying to real-world conditions. We investigate the challenging problem\nof domain generalization, i.e., training a model on multi-domain source data such\nthat it can directly generalize to target domains with unknown statistics. We adopt\na model-agnostic learning paradigm with gradient-based meta-train and meta-test\nprocedures to expose the optimization to domain shift. Further, we introduce\ntwo complementary losses which explicitly regularize the semantic structure of\nthe feature space. Globally, we align a derived soft confusion matrix to preserve\ngeneral knowledge about inter-class relationships. Locally, we promote domain-\nindependent class-speci\ufb01c cohesion and separation of sample features with a\nmetric-learning component. The effectiveness of our method is demonstrated with\nnew state-of-the-art results on two common object recognition benchmarks. Our\nmethod also shows consistent improvement on a medical image segmentation task.\n\n1\n\nIntroduction\n\nMachine learning methods have achieved remarkable success, under the assumption that training and\ntest data are sampled from the same distribution. In real-world applications, this assumption is often\nviolated as conditions for data acquisition may change, and a trained system may fail to produce\naccurate predictions for unseen data with domain shift. To tackle this issue, domain adaptation\nalgorithms normally learn to align source and target data in a domain-invariant discriminative feature\nspace [6, 11, 19, 32, 33, 42, 43, 50, 51]. These methods rely on access to a few labelled [6, 42, 50] or\nunlabelled [11, 19, 32, 33, 43, 51] data samples from the target distribution during training.\nAn arguably harder problem is domain generalization, which aims to train a model using multi-domain\nsource data, such that it can directly generalize to new domains without need of retraining. This\nsetting is very different to domain adaptation as no information about the new domains is available,\na scenario that is encountered in real-world applications. In the \ufb01eld of healthcare, for example,\nmedical images acquired at different sites can differ signi\ufb01cantly in their data distribution, due to\nvarying scanners, imaging protocols or patient cohorts. At deployment, each new hospital can be\nregarded as a new domain but it is impractical to collect data each time to adapt a trained system.\nLearning a model which directly generalizes to new clinical sites would be of great practical value.\nDomain generalization is an active research area with a number of approaches being proposed. As no\na priori knowledge of the target distribution is available, the key question is how to guide the model\nlearning to capture information which is discriminative for the speci\ufb01c task but insensitive to changes\nof domain-speci\ufb01c statistics. For computer vision applications, the aim is to capture general semantic\nfeatures for object recognition. Previous work has demonstrated that this can be investigated through\nregularization of the feature space, e.g., by minimizing divergence between marginal distributions of\ndata sources [35], or joint consideration of the class conditional distributions [30]. Li et al. [28] use\nadversarial feature alignment via maximum mean discrepancy. Leveraging distance metrics of feature\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fvectors is another method [20, 34]. Model-agnostic meta-learning [10] is a recent gradient-based\nmethod for fast adaptation of models to new conditions, e.g., a new task at few-shot learning. Meta-\nlearning has been introduced to address domain generalization [1, 26, 31], by adopting an episodic\ntraining paradigm, i.e., splitting the available source domains into meta-train and meta-test at each\niteration, to simulate domain shift. Promising performance has been demonstrated by deriving the\nloss from a task error [26], a classi\ufb01er regularizer [1], or a predictive feature-critic module [31].\nWe introduce two complementary losses which explicitly regularize the semantic structure of the\nfeature space via a model-agnostic episodic learning procedure. Our optimization objective encour-\nages the model to learn semantically consistent features across training domains that may generalize\nbetter to unseen domains. Globally, we align a derived soft confusion matrix to preserve inter-class\nrelationships. Locally, we use a metric-learning component to encourage domain-independent while\nclass-speci\ufb01c cohesion and separation of sample features. The effectiveness of our approach is\ndemonstrated with new state-of-the-art performance on two common object recognition benchmarks.\nOur method also shows consistent improvement on a medical image segmentation task. Code for our\nproposed method is available at: https://github.com/biomedia-mira/masf.\n\n2 Related Work\n\nDomain adaptation is based on the central theme of bounding the target error by the source error\nplus a discrepancy metric between the target and the source [2]. This is practically performed by\nnarrowing the domain shift between the target and source either in input space [19], feature space [6,\n11, 32, 42, 51], or output space [33, 43, 49], generally using maximum mean discrepancy [15, 46] or\nadversarial learning [14]. The success of methods operating on feature representations motivates us\nto optimize the semantic feature space for domain generalization in this paper.\nDomain generalization aims to generalize models to unseen domains without knowledge about the\ntarget distribution during training. Different methods have been proposed for learning generalizable\nand transferable representations. A promising direction is to extract task-speci\ufb01c but domain-invariant\nfeatures [12, 28, 30, 34, 35]. Muandet et al. [35] propose a domain-invariant component analysis\nmethod with a kernel-based optimization algorithm to minimize the dissimilarity across domains.\nGhifary et al. [12] learn multi-task auto-encoders to extract invariant features which are robust to\ndomain variations. Li et al. [30] consider the conditional distribution of label space over input space,\nand minimize discrepancy of a joint distribution. Motiian et al. [34] use contrastive loss to guide\nsamples from the same class being embedded nearby in latent space across data sources. Li et al. [28]\nextend adversarial autoencoders by imposing maximum mean discrepancy measure to align multi-\ndomain distributions. Instead of harmonizing the feature space, others use low-rank parameterized\nCNNs [25] or decompose network parameters to domain-speci\ufb01c/-invariant components [22]. Data\naugmentation strategies, such as gradient-based domain perturbation [47] or adversarially perturbed\nsamples [53] demonstrate effectiveness for model generalization. A recent method with state-of-the-\nart performance is JiGen [3], which leverages self-supervised signals by solving jigsaw puzzles.\nMeta-learning (a.k.a. learning to learn [44, 48]) is a long standing topic exploring the training of\na meta-learner that learns how to train particular models [10, 29, 36, 37]. Recently, gradient-based\nmeta-learning methods [10, 36] have been successfully applied to few-shot learning, with a procedure\npurely leveraging gradient descent. The episodic training paradigm, originated from model-agnostic\nmeta-learning (MAML) [10], has been introduced to address domain generalization [1, 26, 27, 31].\nEpi-FCR [27] alternates domain-speci\ufb01c feature extractors and classi\ufb01ers across domains via episodic\ntraining, but without using inner gradient descent update. The method of MLDG [26] closely follows\nthe update rule of MAML, back-propagating the gradients from an ordinary task loss on meta-test\ndata. A limitation is that using the task objective might be sub-optimal, as it is highly abstracted\nfrom the feature representations (only using class probabilities). Moreover, it may not well \ufb01t the\nscenario where target data are unavailable (as pointed out by Balaji et al. [1]). A recent method,\nMetaReg [1], learns a regularization function (e.g., weighted L1 loss) particularly for the network\u2019s\nclassi\ufb01cation layer, excluding the feature extractor. Instead, Li et al. [31] propose a feature-critic\nnetwork which learns an auxiliary meta loss (producing a non-negative scalar) depending on output of\nthe feature extractor. Both [1] and [31] lack notable guidance from semantics of feature space, which\nmay contain crucial domain-independent \u2018general knowledge\u2019 for model generalization. Our method\nis orthogonal to previous work, proposing to enforce semantic features via global class alignment and\nlocal sample clustering, with losses explicitly derived in an episodic learning procedure.\n\n2\n\n\fsemantic\n\nfeature space\n\nclass\n\nrelationships\n\nF\u03c8(cid:48)\n\nT\u03b8(cid:48)\n\n(cid:96)i,j\nglobal\n\nF\u03c8\n\nT\u03b8\n\nF\u03c8(cid:48)\n\nT\u03b8(cid:48)\n\n\u2207Ltask\n\n\u2207Lglobal\n\nM\u03c6\n\n\u2207Llocal\n\n(a)\n\nDi\n\nDj\n\n(b)\n\nlearned\nmetric\n\nM\u03c6\n\nd\u03c6\n\nD1\n\nD2\n\nD3\n\nLlocal\n\n(c)\n\nFigure 1: An overview of the proposed model-agnostic learning of semantic features (MASF):\n(a) episodic training under simulated domain shift, with gradient \ufb02ows indicated; (b) global alignment\nof class relationships; (c) local sample clustering, towards cohesion and separation. F\u03c8 and T\u03b8 are\nthe feature extractor and the task net, F\u03c8(cid:48) and T\u03b8(cid:48) are their updated versions by inner gradient descent\non the task loss Ltask, the M\u03c6 is a metric embedding net, and Dk denotes different source domains.\n\nn , y(k)\n\nn )}Nk\n\n3 Method\nIn the following, we denote input and label spaces by X and Y, the domains D = {D1, D2, . . . , DK}\nare different distributions on the joint space X \u00d7 Y. Since domain generalization involves a common\npredictive task, the label space is shared by all domains. In each domain, samples are drawn from a\ndataset Dk = {(x(k)\nn=1 where Nk is the number of labeled data points in the k-th domain.\nThe domain generalization (DG) setting further assumes the existence of domain-invariant patterns in\nthe inputs (e.g. semantic features), which can be extracted to learn a label predictor that performs well\nacross seen and unseen domains. Unlike domain adaptation, DG assumes no access to observations\nfrom or explicit knowledge about the target distribution.\nIn this work, we consider a classi\ufb01cation model composed of a feature extractor, F\u03c8 : X \u2192 Z, where\nZ is a feature space (typically much lower-dimensional than X ), and a task network, T\u03b8 : Z \u2192 RC,\nwhere C is the number of classes in Y. The \ufb01nal class predictions are given by p(y | x; \u03c8, \u03b8) = \u02c6y =\nr ear.1 The parameters (\u03c8, \u03b8) are optimized with\nAlthough the minimization of Ltask may produce highly discriminative features z = F\u03c8(x), and\nhence an excellent predictor for data from the training domains, nothing in this process prevents the\nmodel from over\ufb01tting to the source domains and suffering from degradation on unseen test domains.\nWe therefore propose to optimize the feature space such that its semantic structure is insensitive to\ndifferent training domains, and generalize better to new unseen domains. Figure 1 gives an overview\nof our model-agnostic learning of semantic features (MASF), which we will detail in this section.\n\nsoftmax(T\u03b8(F\u03c8(x))), where softmax(a) = ea/(cid:80)\nrespect to a task-speci\ufb01c loss Ltask, e.g. cross-entropy: (cid:96)task(y, \u02c6y) = \u2212(cid:80)\n\nc 1[y = c] log \u02c6yc.\n\n3.1 Model-Agnostic Learning with Episodic Training\n\nThe key of our learning procedure is an episodic training scheme, originated from model-agnostic\nmeta-learning [10], to expose the model optimization to distribution mismatch. In line with our goal\nof domain generalization, the model is trained on a sequence of simulated episodes with domain\nshift. Speci\ufb01cally, at each iteration, the available domains D are randomly split into sets of meta-train\nDtr and meta-test Dte domains. The model is trained to semantically perform well on held-out Dte\nafter being optimized with one or more steps of gradient descent with Dtr domains. In our case, the\nfeature extractor\u2019s and task network\u2019s parameters, \u03c8 and \u03b8, are \ufb01rst updated from the task-speci\ufb01c\nsupervised loss Ltask (e.g. cross-entropy for classi\ufb01cation), computed on meta-train:\n\n(\u03c8(cid:48), \u03b8(cid:48)) = (\u03c8, \u03b8) \u2212 \u03b1\u2207\u03c8,\u03b8Ltask(Dtr; \u03c8, \u03b8) ,\n\n(1)\nwhere \u03b1 is a learning-rate hyperparameter. This results in a predictive model T\u03b8(cid:48) \u25e6 F\u03c8(cid:48) with improved\ntask accuracy on the meta-train source domains, Dtr.\nOnce this optimized set of parameters has been obtained, we can apply a meta-learning step, aiming to\nenforce certain properties that we desire the model to exhibit on held-out domain Dte. Crucially, the\n\n1For image segmentation, F\u03c8 extracts feature maps and the task network T\u03b8 is applied pixel-wise.\n\n3\n\n\fAlgorithm 1 Model-agnostic learning of semantic features for domain generalization\nInput: Source training domains D = {Dk}K\nk=1; hyperparameters \u03b1, \u03b7, \u03b3, \u03b21, \u03b22 > 0\nOutput: Feature extractor F\u03c8, task network T\u03b8, embedding network M\u03c6\n1: repeat\n2:\n3:\n4:\n\nRandomly split source domains D into disjoint meta-train Dtr and meta-test Dte\n(\u03c8(cid:48), \u03b8(cid:48)) \u2190 (\u03c8, \u03b8) \u2212 \u03b1\u2207\u03c8,\u03b8Ltask(Dtr; \u03c8, \u03b8)\nCompute global class alignment loss:\nLglobal \u2190 1|Dtr|\n1|Dte|\nCompute local sample clustering loss:\nLlocal(D; \u03c8(cid:48), \u03c6) \u2190 ED[(cid:96)n,m\ncon ] or ED[(cid:96)a,p,n\nLmeta \u2190 \u03b21Lglobal + \u03b22Llocal\n(\u03c8, \u03b8) \u2190 (\u03c8, \u03b8) \u2212 \u03b7\u2207\u03c8,\u03b8(Ltask + Lmeta)\n\u03c6 \u2190 \u03c6 \u2212 \u03b3\u2207\u03c6Llocal\n\n(cid:96)global(Di, Dj; \u03c8(cid:48), \u03b8(cid:48))\n\n(cid:80)\n\n(cid:80)\n\nDj\u2208Dte\n\nDi\u2208Dtr\n\n]\n\ntri\n\n6:\n7:\n8:\n9: until convergence\n\n5:\n\n// Section 3.2\n\n// Section 3.3\n\nobjective function quantifying these properties, Lmeta, is computed based on the updated parameters,\n(\u03c8(cid:48), \u03b8(cid:48)), and the gradients are computed towards the original parameters, (\u03c8, \u03b8). Intuitively, besides\nthe task itself, the training procedure is learning how to generalize under domain shift. In other words,\nparameters are updated such that future updates with given source domains also improve the model\nregarding some generalizable aspects on unseen target domains.\nIn particular, we desire the feature space to encode semantically relevant properties: features from\ndifferent domains should respect inter-class relationships, and they should be compactly clustered\nby class labels regardless of domains (cf. Alg. 1). In the remainder of this section we describe the\ndesign of our semantic meta-objective, Lmeta = \u03b21Lglobal + \u03b22Llocal, composed of a global class\nalignment term and a local sample clustering term, with weighting coef\ufb01cients \u03b21, \u03b22 > 0.\n\n3.2 Global Class Alignment Objective\n\nRelationships between class concepts exist in purely semantic space, independent of changes in the\nobservation domain. In light of this, compared with individual hard label prediction, aligning class\nrelationships can promote more transferable knowledge towards model generalization. This is also\nnoted by Tzeng et al. [50] in the context of domain adaptation, by aggregating the output probability\ndistribution when \ufb01ne-tuning the model on a few labelled target data. In contrast to their work, our\ngoal is to structure the feature space itself to preserve learned class relationships on unseen data, by\nmeans of explicit regularization.\nSpeci\ufb01cally, we formulate this objective in a manner that imposes a global layout of extracted features,\nsuch that the relative locations of features from different classes embody the inherent similarity in\nsemantic structures. Inspired by knowledge distillation from neural networks [18], we exploit what\nthe model has learned about class ambiguities\u2014in the form of per-class soft labels\u2014and enforce\nthem to be consistent between Dtr and Dte domains. For each domain k, we summarize the model\u2019s\ncurrent \u2018concept\u2019 of each class c by computing the class-speci\ufb01c mean feature vectors \u00afz(k)\n\n:\n\nc\n\nF\u03c8(cid:48)(x(k)\n\nn ) \u2248 EDk [F\u03c8(cid:48)(x)| y = c] ,\n\n(2)\n\n(cid:88)\n\n\u00afz(k)\nc =\n\n1\n\nN (c)\n\nk\n\nn:y(k)\n\nn =c\n\nis the number of samples in domain Dk labelled as class c. The obtained \u00afz(k)\n\nwhere N (c)\nconveys\nhow samples from a particular class are generally represented. It is then forwarded to the task network\nT\u03b8(cid:48), for computing soft label distributions s(k)\nc with a \u2018softened\u2019 softmax at temperature \u03c4 > 1 [18]:\n\nk\n\nc\n\ns(k)\nc = softmax(T\u03b8(cid:48)(\u00afz(k)\n\nc )/\u03c4 ) .\n\n(3)\n\nThe collection of soft labels [s(k)\nc=1 represents a kind of \u2018soft confusion matrix\u2019 associated with a\n]C\nparticular domain, encoding the inter-class relationships learned by the model. Such relationships\nshould be preserved as general semantics on meta-test after updating the classi\ufb01cation model on\nmeta-train (e.g., cartoon dogs are more easily misclassi\ufb01ed as horses than as houses, which likely\nholds in unseen domains). Standard supervised training with Ltask focuses only on the dominant hard\n\nc\n\n4\n\n\flabel prediction, there is no reason a priori for consistency of such inter-class alignment. We therefore\npropose to align the soft class confusion matrix between two domains Di \u2208 Dtr and Dj \u2208 Dte, by\nminimising their symmetrized Kullback\u2013Leibler (KL) divergence, averaged over all C classes:\n\n[DKL(s(i)\n\nc (cid:107) s(j)\n\nc ) + DKL(s(j)\n\nc (cid:107) s(i)\n\nc )] ,\n\n1\n2\n\n(4)\n\n(cid:96)global(Di, Dj; \u03c8(cid:48), \u03b8(cid:48)) =\n\nwhere DKL(p(cid:107) q) = (cid:80)\n\nC(cid:88)\n\nc=1\n\n1\nC\n\nr pr log pr\nqr\n\n. Other symmetric divergences such as Jensen\u2013Shannon (JS)\ncould also be considered, although our preliminary experiments showed no signi\ufb01cant difference with\nJS over symm. KL. Finally, the global class alignment loss, Lglobal(Dtr,Dte; \u03c8(cid:48), \u03b8(cid:48)), is calculated\nas the average of (cid:96)global(Di, Dj; \u03c8(cid:48), \u03b8(cid:48)) over all pairs of available meta-train and meta-test domains,\n(Di, Dj) \u2208 Dtr \u00d7 Dte (cf. Alg. 1). The complexity of this computation is not problematic in practice,\nsince the number of domains selected in a training mini-batch is limited (as with the form in MAML\n[10]), and in our experiments we took |Dtr| = 2 and |Dte| = 1.\n\n3.3 Local Sample Clustering Objective\nIn addition to promoting the alignment of class relationships across domains with Lglobal as de\ufb01ned\nabove, we further encourage robust semantic features that locally cluster according to class regardless\nof the domain. This is crucial, as neither of the class-prediction-based losses (Ltask or Lglobal) ensure\nthat features of samples in the same class will lie close to each other and away from those of different\nclasses, a.k.a. feature compactness [21]. If the model cannot project the inputs to the semantic\nfeature clusters with domain-independent class-speci\ufb01c cohesion and separation, the predictions may\nsuffer from ambiguous decision boundaries, and still be sensitive to unseen kinds of domain shift.\nWe therefore propose a local regularization objective Llocal to boost robustness, by increasing the\ncompactness of class-speci\ufb01c clusters while reducing their overlap. Note how this is complementary\nto the global class alignment of semantically structuring the relative locations among class clusters.\nOur preliminary experiments revealed that applying such regularization explicitly onto the features\nmay constrain the optimization for Ltask and Lglobal too heavily, hurting generalization performance\non unseen domain. We thus take a metric-learning approach, introducing an embedding network M\u03c6\nthat operates on the extracted features, z = F\u03c8(cid:48)(x). This component represents a learnable distance\nfunction [5] between feature vectors (rather than between raw inputs):\n\nd\u03c6(zn, zm) = (cid:107)en \u2212 em(cid:107)2 = (cid:107)M\u03c6(zn) \u2212 M\u03c6(zm)(cid:107)2 .\n\n(5)\nThe sample pairs (n, m) are randomly drawn from all source domains D, because we expect the\nupdated F\u03c8(cid:48) will harmonize the semantic feature space of Dte with that of Dtr, in terms of class-\nspeci\ufb01c clustering regardless of domains. The computed embeddings, e = M\u03c6(z), can then be\noptimized with any suitable metric-learning loss Llocal(D; \u03c8(cid:48), \u03c6) to regularize the local sample\nclustering. Under mild domain shift, the contrastive loss [16] is a sensible choice, as it attempts to\nseparately collapse each group of same-class exemplars to a distinct single point. It might however be\nover-restrictive for more extreme situations, wherein domains are related rather semantically, but with\nwildly distinct low-level statistics. For such cases, we propose instead to employ the triplet loss [45].\nContrastive loss is computed for pairs of samples, attracting samples of the same class and repelling\nsamples of different classes [16]. Instead of pushing clusters apart to in\ufb01nity, the repulsion range is\nbounded by a distance margin \u03be. Our contrastive loss for a pair of samples (n, m) is de\ufb01ned as:\n\n(cid:96)n,m\ncon =\n\n(max{0, \u03be \u2212 d\u03c6(zn, zm)})2 ,\n\nif yn = ym\nif yn (cid:54)= ym\n\n.\n\n(6)\n\nThe total loss for a training mini-batch, Llocal, is normally averaged over all pairs of samples. In cases\nwhere full O(N 2) enumeration is intractable\u2014e.g. image segmentation, which would involve all\npairs of pixels in all images\u2014we can obtain an unbiased O(N ) estimator of the loss by e.g. shuf\ufb02ing\nthe samples and iterating over (2i \u2212 1, 2i) pairs with i = 1, . . . ,(cid:98)N/2(cid:99).\nTriplet loss aims to make pairs of samples from the same class closer than pairs from different\nclasses, by a certain margin \u03be [45]. Given one \u2018anchor\u2019 sample a, one \u2018positive\u2019 sample p (with\nya = yp), and one \u2018negative\u2019 sample n (with ya (cid:54)= yn), we compute their triplet loss as follows:\n\ntri = max{0, d\u03c6(za, zp)2 \u2212 d\u03c6(za, zn)2 + \u03be} .\n(cid:96)a,p,n\n\n(7)\n\n5\n\n(cid:26)d\u03c6(zn, zm)2 ,\n\n\fSchroff et al. [45] argue that judicious triplet selection is essential for good convergence, as many\ntriplets may already satisfy this constraint and others may be too hard to contribute meaningfully to\nthe learning process. Here we adopt their proposed online \u2018semi-hard\u2019 triplet mining strategy, and\nLlocal is the average over all selected triplet pairs.\n\n4 Experiments\n\nWe evaluate and compare our method on three datasets: 1) the classic VLCS domain generalization\nbenchmark for image classi\ufb01cation, 2) the recently introduced PACS benchmark for object recognition\nwith challenging domain shift, 3) a real-world medical imaging task of tissue segmentation in brain\nMRI. Results with an in-depth analysis and ablation study are presented in the following.\n\n4.1 VLCS Dataset\n\nVLCS [8] is a classic benchmark for domain generalization, which includes images from four datasets:\nPASCAL VOC2007 (V) [7], LabelMe (L) [41], Caltech (C) [9], and SUN09 (S) [4]. The multi-class\nobject recognition task includes \ufb01ve classes: bird, car, chair, dog and person. We follow previous\nwork [3, 27, 34] of using the publicly available pre-extracted DeCAF6 features (4096-dimensional\nvector) for leave-one-domain-out validation with randomly dividing each domain into 70% training\nand 30% test, inputting to two fully connected layers with output size of 1024 and 128 with ReLU\nactivation. For our metric embedding M\u03c6 (inputting the 128-dimensional vector), we use two fully\nconnected layers with output size of 128 and 64. The triplet loss is adopted for computing Llocal, with\ncoef\ufb01cient \u03b22 = 0.005, such that it is in a similar scale to Ltask and Lglobal (\u03b21 = 1). We use the\nAdam optimizer [23] with \u03b7 initialized to 10\u22123 and exponentially decayed by 2% every 1k iterations.\nFor the inner optimization to obtain (\u03c8(cid:48), \u03b8(cid:48)), we clip the gradients by norm (threshold by 2.0) to\nprevent them from exploding, since this step uses plain, non-adaptive gradient descent (with learning\nrate \u03b1 = 10\u22125). Note that, although performing gradient descent on Lmeta involves second-order\ngradients on (\u03c8, \u03b8), their computation does not incur a substantial overhead in training time [10].\nWe also employ an Adam optimizer for the meta-updates of \u03c6 with learning rate \u03b3 = 10\u22125 without\ndecay. The batch size is 128 for each source domain, with an Nvidia TITAN Xp 12 GB GPU. The\nmetric-learning margin hyperparameter \u03be was chosen heuristically based on observing the distances\nwithin and between the clusters of class features. For our results, we report the average and standard\ndeviation over three independent runs.\nResults. Table 1 shows the object recognition accuracies on different target domains. Our DeepAll\nbaseline\u2014i.e., merging all source domains and training F\u03c8\u25e6T\u03b8 by standard supervised learning on\nLtask with the same hyperparameters\u2014achieves an average accuracy of 72.19% over four domains.\nUsing our episodic training paradigm with regularizations on semantic feature space, we improve\nthe performance to 74.11%, setting the state-of-the-art accuracy on VLCS. We compare with eight\ndifferent methods (cf. Section 2) which report previous best results on this benchmark. CCSA [34]\ncombines contrastive loss together with ordinary cross-entropy without using episodic meta-update\nparadigm. Notably, our approach outperforms MLDG [26], indicating that explicitly encouraging\nsemantic properties in the feature space is superior to using a highly-abstracted task loss on meta-test.\n\nTable 1: Domain generalization results on VLCS dataset with object recognition accuracy (%).\nMASF\nSource Target D-MTAE CIDDG CCSA DBADG MMD-AAE MLDG Epi-FCR JiGen DeepAll\n(Baseline)\n(Ours)\n\nL,C,S\nV,C,S\nV,L,S\nV,L,C\n\nV\nL\nC\nS\nAverage\n\n[12]\n63.90\n60.13\n89.05\n61.33\n68.60\n\n[30]\n64.38\n63.06\n88.83\n62.10\n69.59\n\n[34]\n67.10\n62.10\n92.30\n59.10\n70.15\n\n[25]\n69.99\n63.49\n93.63\n61.32\n72.11\n\n[28]\n67.70\n62.60\n94.40\n64.40\n72.28\n\n[26]\n67.7\n61.3\n94.4\n65.9\n72.3\n\n[27]\n67.1\n64.3\n94.1\n65.9\n72.9\n\n[3]\n70.62 68.67\u00b10.09 69.14\u00b10.19\n60.90 63.10\u00b10.11 64.90\u00b10.08\n96.93 92.86\u00b10.13 94.78\u00b10.16\n64.30 64.11\u00b10.17 67.64\u00b10.12\n73.19\n\n74.11\n\n72.19\n\n4.2 PACS Dataset\n\nThe PACS dataset [25] is a recent benchmark with more severe distribution shift between domains,\nmaking it more challenging than VLCS. It consists of four domains: art painting, cartoon, photo,\nsketch, with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person. Following\npractice in the literature [1, 3, 26, 27], we also use leave-one-domain-out cross-validation, i.e., training\n\n6\n\n\fTable 2: Domain generalization results on PACS dataset with recognition accuracy (%) using AlexNet.\n\nD-MTAE CIDDG DBADG MLDG Epi-FCR MetaReg JiGen DeepAll\n(Baseline)\n\nMASF\n(Ours)\n\nSource Target\n\nC,P,S Art painting\nA,P,S Cartoon\nA,C,S Photo\nA,C,P Sketch\nAverage\n\n[12]\n60.27\n58.65\n91.12\n47.68\n64.48\n\n[30]\n62.70\n69.73\n78.65\n64.45\n68.88\n\n[25]\n62.86\n66.97\n89.50\n57.51\n69.21\n\n[26]\n66.23\n66.88\n88.00\n58.96\n70.01\n\n[27]\n64.7\n72.3\n86.1\n65.0\n72.0\n\n[1]\n69.82\n70.35\n91.07\n59.26\n72.62\n\n[3]\n67.63 67.60\u00b10.21 70.35\u00b10.33\n71.71 68.87\u00b10.22 72.46\u00b10.19\n89.00 89.20\u00b10.24 90.68\u00b10.12\n65.18 61.13\u00b10.30 67.33\u00b10.12\n73.38\n\n71.70\n\n75.21\n\nFigure 2: The t-SNE visualization of extracted features from F\u03c8, using our proposed (a-b) MASF\nand the (c-d) DeepAll model on PACS dataset. In (a) and (c), the different colors indicate different\nclasses; correspondingly in (b) and (d), the different colors indicate different domains.\n\non three domains and testing on the remaining unseen one, and adopt an AlexNet [24] pre-trained\non ImageNet [40]. The metric embedding M\u03c6 is connected to the last fully connected layer (i.e.,\nfc7 layer with a 4096-dimesional vector), by stacking two fully connected layers with output size\nof 1024 and 256. For the Llocal, we also use the triplet loss with \u03b22 = 0.005, \u03b21 = 1.0, particularly\nconsidering the severe domain shift. We initialize learning rates \u03b1 = \u03b7 = \u03b3 = 10\u22125 and clip inner\ngradients by norm. The batch size is 128 for each source domain.\nResults. Table 2 summarizes the results of object recognition on PACS dataset with a comparison\nto previous work (noting that not all compared methods reported results on both VLCS and PACS).\nMLDG [26] and MetaReg [1] employ episodic training with meta-learning, but from different angles\nin terms of the meta learner\u2019s objective (Li et al. [26] minimize task error, Balaji et al. [1] learn\na classi\ufb01er regularizer). The promising results for [1, 26, 27] indicate that exposing the training\nprocedure to domain shift bene\ufb01ts model generalization to unseen domains. Our method further\nexplicitly considers the semantic structure, regarding both global class alignment and local sample\nclustering, yielding improved accuracy. Across all domains, our method increases average accuracy\nby 3.51% over the baseline. Note that current state-of-the-art JiGen [3] improves 1.86% over its own\nbaseline. In addition, we observe an improvement of 6.20% when the unseen domain is sketch, which\nhas a distinct style and requires more general knowledge about semantic concepts.\nAblation analysis. We conduct an extensive study using PACS benchmark to investigate two key\npoints: 1) the contribution of each component to our method\u2019s performance, 2) how the semantic\nfeature space is in\ufb02uenced by our proposed meta losses. First, we test all possible combinations of\nincluding the key components: episodic meta-learning simulating domain shift, global class alignment\nloss and local sample clustering loss. Accuracies averaged over three runs when using different\ncombinations are given in Table 3. For example, \ufb01rst row corresponds to the DeepAll baseline with\nstandard training by aggregating all source data. The \ufb01fth row is directly adding the Lglobal,Llocal\nlosses on top of Ltask with standard optimization scheme, i.e., without splitting D to meta-train and\nmeta-test domains. From the ablation study, we observe that each component plays its own role in a\ncomplementary way. Speci\ufb01cally, the proposed losses that encourage semantic structure in feature\nspace yield improvement over DeepAll, as well as over pure episodic training (the second row) that\ncorresponds to our implementation of MLDG thus enabling straightforward comparison. By further\nleveraging the gradient-based update paradigm, performance is further improved across all settings.\n\n7\n\ndogelephantguitargiraffehorsehouseperson(a)(b)targetdomainsourcedomains(c)(d)\fFigure 3: Analysis of the learning procedure: (a) margin of feature distance between sample negative\npairs (with different classes) and positive pairs (with the same class), (b) class relationships alignment\nloss between unseen target domain and source domain, (c) Silhouette plot of the embeddings from\nmeta metric-learning. Detailed analysis is in Section 4.2 for (a-b) and Section 4.3 for (c).\n\nTable 3: Ablation study on key components of our method with the PACS dataset (accuracy, %).\n\nEpisodic Lglobal Llocal\n\n-\n(cid:88)\n-\n-\n-\n(cid:88)\n(cid:88)\n(cid:88)\n\n-\n-\n(cid:88)\n-\n(cid:88)\n(cid:88)\n-\n(cid:88)\n\n-\n-\n-\n(cid:88)\n(cid:88)\n-\n(cid:88)\n(cid:88)\n\nArt\n\n67.60\u00b10.21\n69.19\u00b10.10\n69.43\u00b10.29\n69.50\u00b10.15\n69.48\u00b10.20\n69.94\u00b10.15\n69.50\u00b10.20\n70.35\u00b10.33\n\nCartoon\n68.87\u00b10.22\n70.66\u00b10.37\n70.22\u00b10.21\n70.25\u00b10.13\n71.15\u00b10.16\n72.16\u00b10.28\n71.44\u00b10.34\n72.46\u00b10.19\n\nPhoto\n\n89.20\u00b10.24\n90.36\u00b10.18\n90.64\u00b10.15\n90.12\u00b10.12\n90.16\u00b10.15\n90.10\u00b10.12\n90.16\u00b10.15\n90.68\u00b10.12\n\nSketch\n\n61.13\u00b10.30\n59.89\u00b10.26\n60.11\u00b10.17\n63.02\u00b10.12\n64.73\u00b10.34\n63.54\u00b10.13\n64.97\u00b10.28\n67.33\u00b10.12\n\nAverage\n71.70\n72.52\n72.60\n73.22\n73.88\n73.93\n74.02\n75.21\n\nTable 4: PACS results with deep residual network architectures (accuracy, %).\n\nSource Target\n\nC,P,S\nA,P,S\nA,C,S\nA,C,P\n\nArt-painting\nCartoon\nPhoto\nSketch\n\nResNet-18\n\nResNet-50\n\nDeepAll\n77.38\u00b1 0.15\n75.65\u00b1 0.11\n94.25\u00b1 0.09\n69.64\u00b1 0.25\n\nMASF (ours)\n80.29\u00b1 0.18\n77.17\u00b1 0.08\n94.99\u00b1 0.09\n71.69\u00b1 0.22\n\nDeepAll\n81.41\u00b1 0.16\n78.61\u00b1 0.17\n94.83\u00b1 0.06\n69.69\u00b1 0.11\n\nMASF (ours)\n82.89\u00b1 0.16\n80.49\u00b1 0.21\n95.01\u00b1 0.10\n72.29\u00b1 0.15\n\nWe utilize t-SNE [52] to analzye the feature space learned with our proposed model and the DeepAll\nbaseline (cf. Fig. 2). It appears that our MASF model yields a better separation of classes. We also\nnote that the sketch domain is further apart from art painting and cartoon, although all three are\nsource domains in this experiment, possibly explained by the unique characteristics of sketches. In\nFigure 3 (a), we plot the difference of feature distances between samples of negative pairs and positive\npairs, i.e., E[(cid:107)za \u2212 zn(cid:107)2 \u2212 (cid:107)za \u2212 zp(cid:107)2]. For the two magenta lines, respectively for MASF and\nDeepAll, sample pairs are drawn from different training source domains. We see that both distance\nmargins naturally increase as training progresses. The shaded area highlights that MASF yeilds a\nhigher distance margin between classes compared to DeepAll, indicating that sample clusters are\nbetter separated with MASF. Similarly, for the two blue lines, sample pairs are from the unseen target\ndomain and a source domain (randomly selected at each iteration). As expected, the margin is not as\nlarge as between training domains, yet our method still presents a notably bigger margin than the\nbaseline. In Figure 3 (b), we plot (cid:96)global quantifying differences of average class posteriors between\nunseen target domain and a source domain during training. We observe that the semantic inter-class\nrelationships, conveying general knowledge about a recognition task, would not naturally converge\nand generalize to the unseen domain without explicit guidance.\nDeeper architectures. In the interest of providing stronger baseline results, we perform additional\npreliminary experiments using more up-to-date deep residual architectures [17] with ResNet-18 and\nResNet-50. Table 4 shows strong and consistent improvements of MASF over the DeepAll baseline in\nall PACS splits for both network architectures. This suggests our proposed algorithm is also bene\ufb01cial\nfor domain generalization with deeper feature extractors.\n\n8\n\n00.20.40.60.8ClassAlignment Loss8k16k24k32k0Iterations020406080Feature Distance Margin1008k16k24k32k40k0IterationsClass Clustersbackgroundcerebrospinal fluidgrey matterwhite matterSihouettevalue1.00.80.60.40.2-0.2-0.4-0.6-0.80.0(a)(b)(c)\fTable 5: Evaluation of brain tissue segmenta-\ntion (Dice coef\ufb01cient, %) in different settings:\ncolumns 1\u20134: train model on single source do-\nmain, test on all domains; columns 5\u20136: train on\nthree source domains, test on remaining domain.\nTrain Set-A Set-B Set-C Set-D DeepAll MASF\n89.82\n91.71\n94.50\n89.51\n\n90.62 88.91 88.81 85.03\n85.03 94.22 81.38 88.31\n93.14 92.80 95.40 88.68\n76.32 88.39 73.50 94.29\n\n89.09\n90.41\n94.30\n88.62\n\nFigure 4: Different brain MRI datasets with ex-\nample images and intensity histograms.\n\nTest\nSet-A\nSet-B\nSet-C\nSet-D\n\n4.3 Tissue Segmentation in Multi-site Brain MRI\n\nWe evaluate our method on a real-world medical imaging task of brain tissue segmentation in T1-\nweighted MRI. Data was acquired from four clinical centers (denoted as Set-A/B/C/D). Domain\nshift occurs due to differences in scanners, acquisition protocols and many other factors, posing\nsevere limitations for translating learning-based methods to clinical practice [13]. Figure 4 shows\nexample images and intensity histograms. We adapt MASF for the segmentation of four classes:\nbackground, grey matter (GM), white matter (WM), cerebrospinal \ufb02uid (CSF). We employ a U-\nNet [38], commonly used for this task. For Lglobal, the \u00afz(k)\nis computed by averaging over all pixels\nof a class. Our metric-embedding has two layers of 1\u00d71 convolutions, with contrastive loss for Llocal.\nWe randomly split each domain to 80% for training and 20% for testing in experimental settings.\nResults. For easier comparison, we average the evaluated Dice scores achieved for the three fore-\nground classes (GM/WM/CSF) and report it in Table 5. Although hard to notice visually from the\ngray-scale images, the domain shift from data distribution degrades segmentation signi\ufb01cantly by up\nto 10%. DeepAll is a strong baseline, yet our model-agnostic learning scheme provides consistent\nimprovement over naively aggregating data from multiple sources, especially when generalizing to a\nnew clinical site with relatively poorer imaging quality (i.e., Set-D). Figure 3 (c) is the Silhouette\nplot [39] of the embeddings from M\u03c6, demonstrating that the samples within the same class cluster\nare tightly grouped, as well as clearly separated from those of other classes.\n\nc\n\n5 Conclusions\nWe have presented promising results for a new approach to domain generalization of predictive\nmodels by incorporating global and local constraints for learning semantic feature spaces. The better\ngeneralization capability is demonstrated by new state-of-the-art results on popular benchmarks\nand a dense classi\ufb01cation task (i.e., semantic segmentation) for medical images. The proposed loss\nfunctions are generally orthogonal to other algorithms, and evaluating the bene\ufb01t of their integration\nis an appealing future direction. Our learning procedure could also be interesting to explore in the\ncontext of generative models, which may greatly bene\ufb01t from semantic guidance when learning\nlow-dimensional data representations from multiple sources.\n\nAcknowledgements\nThis project has received funding from the European Research Council (ERC) under the European\nUnion\u2019s Horizon 2020 research and innovation programme (grant No 757173, project MIRA, ERC-\n2017-STG) and is supported by an EPSRC Impact Acceleration Award (EP/R511547/1). DCC is also\npartly supported by CAPES, Ministry of Education, Brazil (BEX 1500/2015-05).\n\nReferences\n[1] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. MetaReg: Towards domain\ngeneralization using meta-regularization. In Advances in Neural Information Processing Systems\n(NeurIPS), pages 998\u20131008, 2018.\n\n[2] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-\nnifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79\n(1-2):151\u2013175, 2010.\n\n9\n\nSet-ASet-BSet-CSet-D\f[3] Fabio Maria Carlucci, Antonio D\u2019Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tom-\nmasi. Domain generalization by solving jigsaw puzzles. In IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2019.\n\n[4] Myung Jin Choi, Joseph J. Lim, Antonio Torralba, and Alan S. Willsky. Exploiting hierarchical\ncontext on a large database of object categories. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 129\u2013136, 2010.\n\n[5] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively,\nwith application to face veri\ufb01cation. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 539\u2013546, 2005.\n\n[6] Hal Daum\u00e9 III, Abhishek Kumar, and Avishek Saha. Co-regularization based semi-supervised\ndomain adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pages\n478\u2013486, 2010.\n\n[7] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisser-\nman. The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer\nVision, 88(2):303\u2013338, 2010.\n\n[8] Chen Fang, Ye Xu, and Daniel N. Rockmore. Unbiased metric learning: On the utilization\nof multiple datasets and web images for softening bias. In IEEE International Conference on\nComputer Vision (ICCV), pages 1657\u20131664, 2013.\n\n[9] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training\nexamples: An incremental Bayesian approach tested on 101 object categories. Computer Vision\nand Image Understanding, 106(1):59\u201370, 2007.\n\n[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning (ICML), pages 1126\u20131135, 2017.\n\n[11] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois\nLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural\nnetworks. Journal of Machine Learning Research, 17(1):2096\u20132030, 2016.\n\n[12] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain general-\nization for object recognition with multi-task autoencoders. In IEEE International Conference\non Computer Vision (ICCV), pages 2551\u20132559, 2015.\n\n[13] Ben Glocker, Robert Robinson, Daniel C. Castro, Qi Dou, and Ender Konukoglu. Machine\nlearning with multi-site imaging data: An empirical study on the impact of scanner effects. In\nMedical Imaging Meets NeurIPS Workshop, 2019.\n\n[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems (NeurIPS), pages 2672\u20132680, 2014.\n\n[15] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano\nPontil, Kenji Fukumizu, and Bharath K. Sriperumbudur. Optimal kernel choice for large-scale\ntwo-sample tests. In Advances in Neural Information Processing Systems (NeurIPS), pages\n1205\u20131213, 2012.\n\n[16] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an\ninvariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nvolume 2, pages 1735\u20131742, 2006.\n\n[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n770\u2013778. IEEE, 2016. doi: 10.1109/CVPR.2016.90.\n\n[18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In\n\nNeurIPS 2014 Deep Learning and Representation Learning Workshop, 2014.\n\n10\n\n\f[19] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A.\nEfros, and Trevor Darrell. CyCADA: Cycle-consistent adversarial domain adaptation. In\nProceedings of the 35th International Conference on Machine Learning (ICML), pages 1989\u2013\n1998, 2018.\n\n[20] Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. Learning to cluster in order to transfer across\n\ndomains and tasks. In International Conference on Learning Representations (ICLR), 2018.\n\n[21] Konstantinos Kamnitsas, Daniel C. Castro, Loic Le Folgoc, Ian Walker, Ryutaro Tanno, Daniel\nRueckert, Ben Glocker, Antonio Criminisi, and Aditya Nori. Semi-supervised learning via\ncompact latent space clustering. Proceedings of the 35th International Conference on Machine\nLearning (ICML), pages 2459\u20132468, 2018.\n\n[22] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A Efros, and Antonio Torralba.\nUndoing the damage of dataset bias. In European Conference on Computer Vision (ECCV),\npages 158\u2013171. Springer, 2012.\n\n[23] Diederik P. Kingma and Jimmy L. Ba. Adam: A method for stochastic optimization.\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\nIn\n\n[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classi\ufb01cation with deep con-\nvolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS),\npages 1097\u20131105, 2012.\n\n[25] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier\ndomain generalization. In IEEE International Conference on Computer Vision (ICCV), pages\n5542\u20135550, 2017.\n\n[26] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Learning to generalize:\nMeta-learning for domain generalization. In Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence (AAAI), pages 3490\u20133497, 2018.\n\n[27] Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, and Timothy M. Hospedales.\n\nEpisodic training for domain generalization. arXiv preprint arXiv:1902.00113, 2019.\n\n[28] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with\nadversarial feature learning. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 5400\u20135409, 2018.\n\n[29] Ke Li and Jitendra Malik. Learning to optimize. In International Conference on Learning\n\nRepresentations (ICLR), 2017.\n\n[30] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng\nTao. Deep domain generalization via conditional invariant adversarial networks. In European\nConference on Computer Vision (ECCV), pages 624\u2013639, 2018.\n\n[31] Yiying Li, Yongxin Yang, Wei Zhou, and Timothy M Hospedales. Feature-critic networks for\n\nheterogeneous domain generalization. arXiv preprint arXiv:1901.11448, 2019.\n\n[32] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Unsupervised domain\nadaptation with residual transfer networks. In Advances in Neural Information Processing\nSystems (NeurIPS), pages 136\u2013144, 2016.\n\n[33] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. Taking a closer look at\ndomain shift: Category-level adversaries for semantics consistent domain adaptation. In IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[34] Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gianfranco Doretto. Uni\ufb01ed deep su-\npervised domain adaptation and generalization. In IEEE International Conference on Computer\nVision (ICCV), pages 5716\u20135726, 2017.\n\n[35] Krikamol Muandet, David Balduzzi, and Bernhard Sch\u00f6lkopf. Domain generalization via\nIn Proceedings of the 30th International Conference on\n\ninvariant feature representation.\nMachine Learning (ICML), pages 10\u201318, 2013.\n\n11\n\n\f[36] Alex Nichol, Joshua Achiam, and John Schulman. On \ufb01rst-order meta-learning algorithms.\n\narXiv preprint arXiv:1803.02999, 2018.\n\n[37] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2017.\n\n[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for\nbiomedical image segmentation. In Medical Image Computing and Computer-Assisted Inter-\nvention (MICCAI), pages 234\u2013241, 2015.\n\n[39] Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster\n\nanalysis. Journal of Computational and Applied Mathematics, 20:53\u201365, 1987.\n\n[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet large scale visual recognition challenge. International Journal of Computer Vision,\n115(3):211\u2013252, 2015.\n\n[41] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. LabelMe: a\ndatabase and web-based tool for image annotation. International Journal of Computer Vision,\n77(1-3):157\u2013173, 2008.\n\n[42] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to\n\nnew domains. In European Conference on Computer Vision (ECCV), pages 213\u2013226, 2010.\n\n[43] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classi\ufb01er\ndiscrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 3723\u20133732, 2018.\n\n[44] J\u00fcrgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to\n\nlearn: the meta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[45] Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A uni\ufb01ed embedding\nfor face recognition and clustering. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 815\u2013823, 2015.\n\n[46] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, et al. Equivalence\nof distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41\n(5):2263\u20132291, 2013.\n\n[47] Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi, and\nSunita Sarawagi. Generalizing across domains via cross-gradient training. In International\nConference on Learning Representations (ICLR), 2018.\n\n[48] Sebastian Thrun and Lorien Pratt, editors. Learning to Learn. Springer, 1998.\n\n[49] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and\nManmohan Chandraker. Learning to adapt structured output space for semantic segmentation.\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7472\u20137481,\n2018.\n\n[50] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across\ndomains and tasks. In IEEE International Conference on Computer Vision (ICCV), pages\n4068\u20134076, 2015.\n\n[51] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain\nadaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n7167\u20137176, 2017.\n\n[52] Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-SNE. Journal of\n\nMachine Learning Research, 9(Nov):2579\u20132605, 2008.\n\n[53] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C. Duchi, Vittorio Murino, and Silvio\nSavarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in\nNeural Information Processing Systems (NeurIPS), pages 5334\u20135344, 2018.\n\n12\n\n\f", "award": [], "sourceid": 3472, "authors": [{"given_name": "Qi", "family_name": "Dou", "institution": "Imperial College London"}, {"given_name": "Daniel", "family_name": "Coelho de Castro", "institution": "Imperial College London"}, {"given_name": "Konstantinos", "family_name": "Kamnitsas", "institution": "Imperial College London"}, {"given_name": "Ben", "family_name": "Glocker", "institution": "Imperial College London"}]}