{"title": "MetaReg: Towards Domain Generalization using Meta-Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 998, "page_last": 1008, "abstract": "Training models that generalize to new domains at test time is a problem of fundamental importance in machine learning. In this work, we encode this notion of domain generalization using a novel regularization function. We pose the problem of finding such a regularization function in a Learning to Learn (or) meta-learning framework. The objective of domain generalization is explicitly modeled by learning a regularizer that makes the model trained on one domain to perform well on another domain. Experimental validations on computer vision and natural language datasets indicate that our method can learn regularizers that achieve good cross-domain generalization.", "full_text": "MetaReg: Towards Domain Generalization using\n\nMeta-Regularization\n\nYogesh Balaji\n\nDepartment of Computer Science\n\nUniversity of Maryland\n\nCollege Park, MD\n\nyogesh@cs.umd.edu\n\nSwami Sankaranarayanan\u2217\n\nButter\ufb02y Network Inc.\n\nNewYork, NY\n\nswamiviv@butterflynetinc.com\n\nDepartment of Electrical and Computer Engineering\n\nRama Chellappa\n\nUniversity of Maryland\n\nCollege Park, MD\n\nrama@umiacs.umd.edu\n\nAbstract\n\nTraining models that generalize to new domains at test time is a problem of\nfundamental importance in machine learning. In this work, we encode this notion\nof domain generalization using a novel regularization function. We pose the\nproblem of \ufb01nding such a regularization function in a Learning to Learn (or) meta-\nlearning framework. The objective of domain generalization is explicitly modeled\nby learning a regularizer that makes the model trained on one domain to perform\nwell on another domain. Experimental validations on computer vision and natural\nlanguage datasets indicate that our method can learn regularizers that achieve good\ncross-domain generalization.\n\n1\n\nIntroduction\n\nExisting machine learning algorithms including deep neural networks achieve good performance\nin cases where the training and the test data are sampled from the same distribution. While this\nis a reasonable assumption to make, it might not hold true in practice. Deploying the perception\nsystem of an autonomous vehicle in new environments compared to its training setting might lead to\nfailure owing to the shift in data distribution. Even strong learners such as deep neural networks are\nknown to be sensitive to such domain shifts [9][33]. Approaches that resolve this issue in a domain\nadaptation framework have access to the target distribution. This is hardly true in practice: deploying\nreal systems involve generalizing to unseen sources of data. This problem, also known as domain\ngeneralization is the focus of this paper.\nMost machine learning models (including neural networks) are susceptible to domain shift - models\ntrained on one dataset perform poorly on a related but a shifted dataset. Approaches for addressing this\nissue can be broadly grouped into two major categories - domain adaptation and domain generalization.\nDomain adaptation techniques assume access to target dataset, the models are trained using a\ncombination of labeled source data and unlabeled/ sparsely labeled target data. Domain generalization,\non the other hand, is a much harder problem as we assume no access to target information. Instead,\nthe variations in multiple source domains are utilized to generalize to novel test distributions.\n\n\u2217Work done while at University of Maryland, College Park.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Illustration of the proposed approach. Figure (a) depicts the network design - We employ a\nshared feature network F and p task networks {Ti}p\ni=1. Each task network Ti is trained only on the\ndata from domain i, and the shared network F is trained on all p source domains. The \ufb01gure on the\nright illustrates the optimization updates. At each iteration we sample a pair of domains (i, j) from\nthe training set. The black arrows are the SGD updates of the task network Ti trained on domain i.\nFrom each point in the black path, we take l gradient steps using the regularized loss and the samples\nfrom domain i to reach a new point \u2217. We then compute the loss on domain j at \u2217. The regularizer\nparameters \u03c6 are updated so that this meta-loss is minimized. This ensures that the task network Ti\ntrained with the proposed regularizer generalizes to domain j\n\nThe conventional approach to improve the generalization of a parametric model is to introduce\nregularization in the loss function [32]. In a Bayesian setting, regularization can be interpreted as\npriors on the parameters. Several regularization schemes have been proposed for neural networks\nincluding weight decay [17], Dropout [28], DropConnect [31], batch normalization [14], etc. While\nthese schemes have been shown to reduce test error on samples drawn from the same training\ndistribution, they do not generalize when there is a training-test distribution mismatch. Hence, the\nobjective of this work is to learn a regularizer that generalizes to novel distributions not present in the\ntraining data.\nDesigning such regularizers for achieving cross-domain generalization is a challenging problem. The\ndif\ufb01culty in mathematically modeling domain shifts makes it hard to design hand-crafted regularizers.\nInstead, we take a data-driven approach where we aim to learn the regularization function using the\nvariability in the source domains. We cast the problem of learning regularizers in a learning to learn,\nor meta-learning framework, which has received a resurgence in interest recently with applications\nincluding few-shot learning [7][24] and learning optimizers [20][1]. Similar to [7], we follow an\nepisodic paradigm where at each iteration, we sample an episode comprising meta-train and meta-test\ndata such that the domains contained in meta-train and meta-test sets are disjoint. The objective is\nthen to train the regularizer such that k steps of gradient descent using the meta-train data results in\ndecreasing the loss in the meta-test. This procedure is repeated for multiple episodes sampled from\nthe source dataset. After the regularizer is trained, we \ufb01ne-tune a new model on the entire source\ndataset using the trained regularizer.\nThe primary contribution of this work is that we propose a scheme for learning regularization\nfunctions that enable domain generalization. We show how the notion of domain generalization can\nbe explicitly encoded in a regularization function, which can then be used to train models that are\nmore robust to domain shifts. This framework is also scalable as the same regularizer can later be\nused to \ufb01ne-tune on a larger dataset. Experiments indicate that our approach can learn regularizers\nthat achieve good cross-domain generalization on benchmark domain generalization datasets.\n\n2 Related work\n\nDomain Adaptation Domain adaptation has received signi\ufb01cant attention in recent years from\nmachine learning, computer vision and natural language processing communities. Non-deep learning\n\n2\n\n\fapproaches to this problem include feature engineering methods [6], learning intermediate subspaces\nusing manifolds [11][12] and dictionaries [23], etc. Recent methods harness the expressive power\nof deep neural networks to learn domain invariant representations. The method in [9] attempts to\nreduce the distributional distance between source and target embeddings by formulating a minimax\ngame between the feature network and a domain discriminator network. [5] uses stacked denoising\nautoencoders for learning robust data representations that adapt well to target domains. Other\nnotable works include the use of Maximum Mean Discrepancy (MMD) [21], Generative Adversarial\nNetworks (GAN) [25][26], co-training [4], etc.\n\nMeta-learning The concept of meta-learning (or) learning to learn has a long standing history,\nsome of the earlier works include [30] [27]. Recently, there has been a lot of interest in applying\nsuch strategies for deep neural networks. One interesting application is the problem of learning\nthe optimization updates of neural networks by casting it as a policy learning problem in a Markov\ndecision process [1] [20]. Few-shot learning is another problem where meta-learning strategies have\nbeen widely explored. [24] proposes an LSTM-based meta learner for learning the optimization\nupdates of a few-shot classi\ufb01er. Instead of learning the updates, [7] learns transferable weight\nrepresentations that quickly adapts to a new task using only a few samples. Other recent applications\nthat use meta learning include imitation learning [8], visual question answering [29], etc.\n\nDomain Generalization Unlike domain adaptation, domain generalization is a relatively less\nexplored area of research. [22] proposes domain invariant component analysis, a kernel-based\nalgorithm for minimizing the differences in the marginal distributions of multiple domains. [10]\nattempts to learn a domain-invariant feature representation by using multi-view autoencoders to\nperform cross domain reconstructions. The method in [15] decomposes the parameters of a model\n(SVM classi\ufb01er) into domain speci\ufb01c and domain invariant components, and uses the domain invariant\nparameters to make predictions on the unseen domain. [18] extends this idea to decompose the\nweights of deep neural networks using multi-linear model and tensor decomposition.\nFinn et al. [7] recently proposed a model agnostic meta-learning procedure for few shot learning\nproblems. The objective of the MAML approach is to \ufb01nd a good initialization \u03b8 such that few gradient\nsteps from \u03b8 results in a good task speci\ufb01c network. The focus of the MAML is to adapt quickly in few\nshot settings. Recently, [19] proposed a meta learning based approach (MLDG) extending MAML\nto the domain generalization problem. This approach has the following limitations - the objective\nfunction of MAML is more suited for fast task adaptation for which it was originally proposed. In\ndomain generalization however, we do not have access to samples from a new domain, and so a\nMAML-like objective might not be effective. The second issue is scalability - it is hard to scale\nMLDG to deep architectures like Resnet [13]. Our approach attempts to tackle both these problems\n- (1) We explicitly address the notion of domain generalization in our episodic training procedure\nby using a regularizer to go from a task speci\ufb01c representation to a task general representation at\neach episode. (2) We make our approach scalable by freezing the feature network and performing\nmeta learning only on the task network. This enables us to use our approach to train deeper models\nlike Resnet-50. A similar approach for training meta-learning algorithms in feature space has been\nexplored in a recent work [34].\n\n3 Method\n\n3.1 Problem Setup\nWe begin with a formal description of the domain generalization problem. Let X denote the instance\nspace (which can be images, text, etc.) and Y denote the label space. Domain generalization\ninvolves data sampled from p source distributions and q target distributions, each containing data\nfor performing the same task. Classi\ufb01cation tasks are considered in this work. Hence, Y is the\ndiscrete set {1, 2, . . . Nc}, where Nc denotes the number of classes. Let {Di}p+q\ni=1 represent the p + q\ndistributions, each of which exists on the joint space X \u00d7 Y. Let Di = {(x(i)\nj , y(i)\nj=1 represent\nj) i.i.d.\u223c Di. In the rest of the paper,\nthe dataset sampled from the ith distribution, i.e. each (xi\nDi is referred to as the ith domain. Note that every Di shares the same label space. In the domain\ngeneralization problem, each of the p + q domains contain varied domain statistics. The objective is\nto train models on the p source domains so that they generalize well to the q novel target domains.\n\nj )}Ni\n\nj, yi\n\n3\n\n\fWe are interested in training a parametric model M\u0398 : X \u2192 Y using data only from the p source\ndomains. In this work, we consider M\u0398 to be a deep neural network. We decompose the network\nM into a feature network F and a task network T (i.e) M\u0398(x) = (T\u03b8 \u25e6 F\u03c8)(x), where \u0398 = {\u03c8, \u03b8}.\nHere, \u03c8 denotes the weights of the feature network F , and \u03b8 denotes the weights of the task network.\nThe output of M\u0398(x) is a vector of dimension Nc with ith entry denoting the probability that the\ninstance x belongs to the class i. Standard neural network training involves minimizing the cross\nentropy loss function given by Eq (1)\n\np(cid:88)\n\nNi(cid:88)\n\nL(\u03c8, \u03b8) = E(x,y)\u223cD[\u2212y. log(M\u0398(x))] =\n\n\u2212y(i)\n\nj . log(M\u0398(x(i)\n\nj ))\n\n(1)\n\ni=1\n\nj=1\n\nis the one-hot representation of the label y(i)\nj\n\nHere, y(i)\nand \u2018.\u2019 denotes the dot product between two\nj\nvectors. The above loss function does not take into account any factor that models domain shifts, so\ngeneralization to a new domain is not expected. To accomplish this, we propose using a regularizer\nR(\u03c8, \u03b8). The new loss function then becomes Lreg(\u03c8, \u03b8) = L(\u03c8, \u03b8) + R(\u03c8, \u03b8). The regularizer\nR(\u03c8, \u03b8) should capture the notion of domain generalization (i.e) it should enable generalization to a\nnew distribution with varied domain statistics. Designing such regularizers is hard in general, so we\npropose to learn it using meta learning.\n\n3.2 Learning the regularizer\n\nIn this work, we model the regularizer R as a neural network parametrized by weights \u03c6. Moreover,\nthe regularization is applied only on the parameters \u03b8 of the task network to enable scalable meta-\nlearning. So, the regularizer is denoted as R\u03c6(\u03b8) in the rest of the paper. We now discuss how the\nparameters of the regularizer R\u03c6(\u03b8) are estimated. In this stage of the training pipeline, the neural\nnetwork architecture consists of a feature network F and p task networks {Ti}p\ni=1 (with parameters\nof Ti denoted by \u03b8i) as shown in Fig. 1. Each Ti is trained only on the samples from domain i\nand F is the shared network trained on all p source domains. The reason for using p task networks\nis to enforce domain-speci\ufb01city in the models so that the regularizer can be trained to make them\ndomain-invariant.\nWe now describe the procedure for learning the regularizer:\n\n\u2022 Base model updates: We begin by training the shared network F and p task networks\n{Ti}p\ni=1 using supervised classi\ufb01cation loss L(\u03c8, \u03b8) given by Eq (1). Note that there is no\nregularization in this step. Let the network parameters at the kth step of this optimization be\ndenoted as [\u03c8(k), \u03b8(k)\n\n1 , . . . \u03b8(k)\np ].\n\n\u2022 Episode creation: To train R\u03c6(\u03b8), we follow an episodic training procedure similar to [19].\nLet a, b be two randomly chosen domains from the training set. Each episode contains data\npartitioned into two subsets - (1) m1 labeled samples from domain a denoted as metatrain\nset and (2) m2 labeled samples from domain b denoted as metatest set. The domains\ncontained in both the sets are disjoint (i.e) a (cid:54)= b, and the data is sampled only from the\nsource distributions (i.e) a, b \u2208 {1, 2, . . . p}.\n\n\u2022 Regularizer updates At iteration k, a new task network Tnew is initialized with \u03b8(k)\n\n- the\nbase model\u2019s task network parameters of the ath domain at iteration k. Using the samples\nfrom the metatrain set (which contains domain a), l steps of gradient descent is performed\nwith the regularized loss function Lreg(\u03c8, \u03b8) on Tnew. Let \u02c6\u03b8(k)\na denote the parameters of\nTnew after these l gradient steps. We treat each update of the network Tnew as a separate\nvariable in the computational graph. \u02c6\u03b8(k)\nthen depends on \u03c6 through these l gradient steps.\nThe unregularized loss on the metatest set computed using Tnew (with parameters \u02c6\u03b8(k)\na ) is\nthen minimized with respect to the regularizer parameters \u03c6. Each regularizer update unrolls\nthrough the l gradient steps as \u02c6\u03b8(k)\na depends on \u03c6 through the l gradient steps. This entire\nprocedure can be expressed by the following set of equations:\n\na\n\na\n\n4\n\n\fa\n\n\u03b21 \u2190 \u03b8(k)\n\u03b2t = \u03b2t\u22121 \u2212 \u03b1\u2207\u03b2t\u22121\n\u02c6\u03b8(k)\na = \u03b2l\n\n(cid:2)L(a)(\u03c8(k), \u03b2t\u22121) + R\u03c6(\u03b2t\u22121)(cid:3)\n\n\u2200t \u2208 {2, . . . l}\n\n(2)\n\na )|\u03c6=\u03c6(k)\n\n\u03c6(k+1) = \u03c6(k) \u2212 \u03b1\u2207\u03c6L(b)(\u03c8(k), \u02c6\u03b8(k)\n\n(3)\nHere, L(i)(\u03c8, \u03b8new) = E(x,y)\u223cDi[\u2212y. log(T\u03b8new (F\u03c8(x)))] (i.e) the loss of task network\nTnew on samples from domain i, and \u03b1 is the learning rate. Eq (2) represents l steps\nof gradient descent from the initial point \u03b8(k)\na using samples from metatrain set, with \u03b2t\ndenoting the output at the tth step. Eq (3) is the meta-update step for updating the parameters\nof the regularizer. This update ensures that l steps of gradient descent using the regularized\nloss on samples from domain a results in task network a performing well on domain b.\nIt is important to note that the dependence of \u03c6 on \u02c6\u03b8(k)\ncomes from the l gradient steps\nperformed in Eq. 2. So, the gradients of \u03c6 propagates through these l unrolled gradient steps.\n\na\n\nSince the same regularizer R\u03c6(\u03b8) is trained on every (a, b) pair, the resulting regularizer we learn\ncaptures the notion of domain generalization. Please refer to Fig. 1 for a pictoral description of the\nmeta-update step. The entire algorithm is given in Algorithm 1\n\n3.3 Training the \ufb01nal model\n\nweighted L1 loss as our regularization function, (i.e) R\u03c6(\u03b8) = (cid:80)\n\nOnce the regularizer is learnt, the regularization parameters \u03c6 are frozen and the \ufb01nal task network\ninitialized from scratch is trained on all p source domains using the regularized loss function\nLreg(\u03c8, \u03b8). The network architectures consists of just one F \u2212 T pair.\nIn this paper, we use\ni \u03c6i|\u03b8i|. The weights of this\nregularizer are estimated using the meta-learning procedure discussed above. However, our approach\nis general and can be extended to any class of regularizers (refer to Section. 5). The use of weighted\nL1 loss can be interpreted as a learnable weight decay mechanism - Weights \u03b8i for which \u03c6i is positive\nwill be decayed to 0 and those for which \u03c6i is negative will be boosted. By using our meta-learning\nprocedure, we select a common set of weights that achieve good cross-domain generalization across\nevery pair of source domains (a, b).\n\nj=1\n\nj , y(i)\n\nj ) \u223c Di}nb\n\nSample nb labeled images {(x(i)\nPerform supervised classi\ufb01cation updates:\n\u03c8(t) \u2190 \u03c8(t\u22121) \u2212 \u03b11\u2207\u03c8L(i)(\u03c8(t\u22121), \u03b8(t\u22121)\n)\ni \u2190 \u03b8(t\u22121)\n\u2212 \u03b11\u2207\u03b8iL(i)(\u03c8(t\u22121), \u03b8(t\u22121)\n\u03b8(t)\n)\nend for\nChoose a, b \u2208 {1, 2, . . . p} randomly such that a (cid:54)= b\n\u03b21 \u2190 \u03b8(t)\nfor i = 2 : l do\n\nAlgorithm 1 MetaReg training algorithm\nRequire: Niter: number of training iterations\nRequire: \u03b11, \u03b12: Learning rate hyperparameters\n1: for t in 1 : Niter do\nfor i in 1 : p do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: end for\n\nSample metatrain set {(x(a)\n\u03b2i = \u03b2i\u22121 \u2212 \u03b12\u2207\u03b2i\u22121 [L(a)(\u03c8(t), \u03b2i\u22121) + R\u03c6(\u03b2i\u22121)]\n\n) \u223c Da}nb\n\n, y(a)\n\nj=1\n\nend for\n\u02c6\u03b8(t)\na = \u03b2l\nSample metatest set {(x(b)\nPerform meta-update for regularizer \u03c6(t) = \u03c6(t\u22121) \u2212 \u03b12\u2207\u03c6L(b)(\u03c8(t), \u02c6\u03b8(t)\n\nj ) \u223c Db}nb\n, y(b)\n\nj=1\n\nj\n\ni\n\ni\n\nj\n\nj\n\ni\n\na\n\na )|\u03c6=\u03c6(t)\n\n5\n\n\fTable 1: Cross-domain recognition accuracy (in %) averaged over 5 runs on PACS dataset using\nAlexnet architecture. For the baseline setting, the numbers on the parenthesis indicate the baseline\nperformance as reported by [19]\n67.21 \u00b1 0.72 (64.91)\n\n88.47 \u00b1 0.63 (86.67)\n\n55.32 \u00b1 0.44 (53.08)\n\nCartoon\n\n66.12 \u00b1 0.51 (64.28)\n\nAverage\n\n69.28 (67.24)\n\nMethod\nBaseline\n\nArt painting\n\nPhoto\n\n91.12\n83.25\n89.50\n88.0\n\nSketch\n\n47.68\n58.58\n57.51\n58.96\n\n64.48\n67.37\n69.21\n70.01\n72.62\n\nD-MTAE ([10])\n\nDSN ([3]\n\nDBA-DG ([18])\nMLDG ([19])\nMetaReg (Ours)\n\n60.27\n61.13\n62.86\n66.23\n\n69.82 \u00b1 0.76\n\n58.65\n66.54\n66.97\n66.88\n\n70.35 \u00b1 0.63\n\n91.07 \u00b1 0.41\n\n59.26 \u00b1 0.31\n\n3.4 Summary of the training pipeline\n\nThe feature network is \ufb01rst trained using combined data from all source domains, and is kept frozen in\nthe rest of training. The regularizer parameters are then estimated using the meta-learning procedure\ndescribed in the previous section. As the individual task networks are updated on their respective\nsource domain data, the regularizer updates are derived from each point of this SGD path with the\nobjective of cross-domain generalization (refer Alg. 1). To learn the regularizer effectively at the\nearly stages of the task network updates, replay memory is used where the regularizer updates are\nperiodically derived from the early stages of the task networks\u2019 SGD paths. The learnt regularizer\nis used in the \ufb01nal step of the training process where a single F \u2212 T network is trained using the\nregularized cross-entropy loss.\n\n4 Experiments\n\nIn this section, we describe the experimental validation of our proposed approach. We perform\nexperiments on two benchmark domain generalization datasets - Multi-domain image recognition\nusing PACS dataset [18] and sentiment classi\ufb01cation using Amazon Reviews dataset [2].\n\n4.1 PACS dataset\n\nPACS dataset is a recently proposed benchmark dataset for domain generalization. This dataset\ncontains images from four domains - Photo, Art painting, Cartoon and Sketch. Following [19], we\nperform experiments on four settings: In each setting, one of the four domains is treated as the unseen\ntarget domain, and the model is trained on the other three source domains.\n\nloss (i.e) R\u03c6(\u03b8) =(cid:80)\n\nAlexnet The \ufb01rst set of experiments is based on the Alexnet [16] model pretrained on Imagenet.\nThe feature network F comprises of the top layers of Alexnet model till pool5 layer, while the task\nnetwork T contains f c6, f c7 and f c8 layers. For the regularizer network, we used weighted L1\ni \u03c6i|\u03b8i|, where \u03c6i are the parameters estimated using meta-learning. In all our\nexperiments, Baseline setting denotes training a neural network (Alexnet in this case) on all of the\nsource domains without performing any domain generalization. Other comparison methods include\nMulti-task Autoencoders (MTAE) [10], Domain Separation Networks (DSN) [3], Artier Domain\nGeneralization (DBA-DG) [18] and MLDG [19]. While some of these methods were originally\nproposed for domain adaptation, they were adapted to the domain generalization problem as done in\n[19].\nAll our models are trained using the SGD optimizer with learning rate 5e \u2212 4 and a batch size of 64.\nThis is in accordance with the setup used in [19]. Table 1 presents the results of our approach along\nwith other comparison methods. We observe that our method obtains a performance improvement of\n3.34% over the baseline, thus achieving the state-of-the-art performance on this dataset.\n\nResnet One disadvantage with approaches like MLDG [19] is that it requires differentiating through\nk steps of optimization updates, and this might not be scalable to deeper architectures like Resnet.\nEven our approach requires a similar optimization process. However, unlike [19], we perform\nmeta-learning only on the task network. Since the task network is much shallower than the feature\nnetwork, our approach is scalable even to some of the contemporary deep architectures. In this\nsection, we show experiments using two such architectures - Resnet18 and Resnet 50.\n\n6\n\n\fWe use the Resnet-18 and Resnet-50 models pretrained on ImageNet as our feature network, and the\nlast fully connected layer as our task network. Similar to the previous experiment, we used weighted\nL1 loss as our class of regularizers. All models were trained using SGD optimizer with a learning\nrate of 0.001 and momentum 0.9. The hyper-parameters \u03b11 and \u03b12 are both set as 0.001. The results\nof our experiments are reported in Table. 2. Our method performs better than baseline in both settings.\nIt is important to note that the baseline numbers for Resnet architectures are much higher than that of\nAlexnet. Even on such stronger baselines, our method gives performance improvement.\n\n4.2 Sentiment Classi\ufb01cation\n\nIn this section, we perform experiments on the task of sentiment classi\ufb01cation on Amazon reviews\ndataset as pre-processed by [5]. The dataset contains reviews of products belonging to four domains -\nbooks, DVD, electronics and kitchen appliances. The differences in textual description of the reviews\neach of these product categories manifests as domain shift. Following [9], we use unigrams and\nbigrams as features resulting in 5000 dimensional vector representations. The reviews are assigned\nbinary labels - 0 if the rating of the product is upto 3 stars, and 1 if the rating is 4 or 5 stars.\nWe conduct 4 cross-domain experiments - in each setting one of the four domains is treated as the\nunseen test domain, and the other three domains are used as source domains. Similar to [9], we\nused a neural network with one hidden layer (with 100 neurons) as our task network. All models\nwere trained using an SGD optimizer with learning rate 0.01 and momentum 0.9 for 5000 iterations.\nThe results of our experiments are reported in the Table. 3. Since there is signi\ufb01cant variation\nin performance over runs, each experiment was repeated 10 times with different random weight\ninitialization and averages of these 10 runs are reported. We observe that our method performs better\nthan the baseline in all of the settings. However, the performance improvement is less compared to\nthe previous experiments. This is because of the nature of the problem and the architectural choice.\nWe would like to point out that even domain adaptation methods that make use of unlabeled target\ndata achieve similar gains in performance [9] in this dataset.\n\n5 Ablation Study\n\nFor all the ablation experiments except 5.3, we use the Resnet-18 model as our neural network\narchitecture, and Art-painting setting in PACS dataset as our experimental setting, (i.e) we use Art\npainting domain as the test domain, and Cartoon, Photo and Sketch as source domains.\n\n5.1 Class of Regularizers\n\nR\u03c6(\u03b8) =(cid:80)\n\ni \u03c6i|\u03b8i|, (2) Weighted L2 loss: R\u03c6(\u03b8) =(cid:80)\n\nIn this experiment, we study the effect of different regularizers on the performance of our\napproach. We experimented on the following class of regularizers:\n(1) Weighted L1 loss:\ni , and (3) Two layer neural network:\nR\u03c6(\u03b8) = \u03c6(2)T (ReLU (\u03c6(1)T \u03b8)). The performance of these regularizers are reported in Table. 4.\nWe observe that the Weighted L1 regularizer performs the best among the three. Also, we observed\nthat training networks with the weighted L1 regularizer lead to better convergence and stability in\nperformance compared to the other two. We also compare our approach with two other schemes: (1)\nDropConnect [31] and (2) Default L1 regularization, which is Weighted L1 regularization where the\nweights \u03c6i = 1. We observe that both these schemes do not improve the baseline performance.\n\ni \u03c6i\u03b82\n\nTable 2: Cross-domain recognition accuracy (in %) averaged over 5 runs on PACS dataset using\nResnet architectures\n\nMethod\n\nBaseline\n\nMetareg (Ours)\n\nBaseline\n\nMetareg (Ours)\n\nArt painting\n79.9 \u00b1 0.22\n83.7 \u00b1 0.19\n85.4 \u00b1 0.24\n87.2 \u00b1 0.13\n\nResnet-18\n\nCartoon\n75.1 \u00b1 0.35\n77.2 \u00b1 0.31\n77.7 \u00b1 0.31\n79.2 \u00b1 0.27\n\nResnet-50\n\nPhoto\n\nSketch\n\nAverage\n\n95.2 \u00b1 0.18\n95.5 \u00b1 0.24\n97.8 \u00b1 0.17\n97.6 \u00b1 0.31\n\n69.5 \u00b1 0.37\n70.3 \u00b1 0.28\n69.5 \u00b1 0.42\n70.3 \u00b1 0.18\n\n79.9\n81.7\n\n82.6\n83.6\n\n7\n\n\fTable 3: Cross domain classi\ufb01cation accuracy (x %) averaged over 10 runs on Amazon Reviews\ndataset\n\nMethod\nBaseline\n\nMetareg (Ours)\n\nBooks\n\n75.5 \u00b1 0.52\n76.1 \u00b1 0.41\n\nDVD\n\n79.0 \u00b1 0.37\n79.6 \u00b1 0.32\n\nElectronics\n83.7 \u00b1 0.44\n83.9 \u00b1 0.28\n\nKitchen\n84.7 \u00b1 0.63\n85.1 \u00b1 0.43\n\nAverage\n\n80.7\n81.2\n\nTable 4: Effect of different classes of regularization functions\nBaseline DropConnect [31] Default L1 Weighted L1 Weighted L2\n\n79.9\n\n80.1\n\n79.7\n\n83.7\n\n83.2\n\n2 layer NN\n\n83.3\n\n5.2 Availability of data over time\n\nIn all of the previous experiments, we assumed that the entire training data is available from the\nstart of the training process. But consider a more general setting where we train our model on some\ninitial data, but more data gets available over time. Is it possible make use of the newly available data\nto improve our models without having to perform meta-learning again? We propose the following\nsolution: Train the feature network, task network and regularizer on the initial dataset. On the new\ndata, \ufb01netune the task network and feature network using the regularizer trained on the initial data.\nNote that, we do not perform meta learning again on the new data, and so this is computationally\nef\ufb01cient since meta-learning procedure incurs signi\ufb01cant overhead over a regular \ufb01netuning process.\nWith approaches like MLDG [19], meta-learning has to be performed even on the new data.\nWe simulate these experimental conditions as follows: In each setting, we consider a fraction f of\nthe PACS dataset as our intial dataset on which our model and the regularizer are trained. We then\n\ufb01netune our model on the remaining data using the trained regularizer. The performance of these\nmodels on the test set are shown in Table. 5. We observe that there is little drop in performance for\nall f values. Our approach is able to learn good regularizers even with 10% of the entire dataset.\n\nTable 5: Experiments for training models on less data\n\nData fraction f\nAccuracy (in %)\n\n0.1\n82.86\n\n0.2\n83.11\n\n0.3\n83.42\n\n0.4\n83.62\n\n0.5\n83.60\n\n1.0\n83.71\n\n5.3 Effect of the number of layers regularized\n\nIn our training paradigm, the neural network is decomposed into feature and task network, and the\nmeta-regularization is performed only on the task network. Deciding this feature/task network split is\na design choice which needs to be understood. The effect of domain generalization performance on\nvarying the number of layers is reported in Table 6. This experiment is performed using the Alexnet\narchitecture on PACS dataset with Cartoon as the target domain. We observe that as the number\n\nFigure 2: Histogram of the weights learnt by the task network. \"No reg\" corresponds to the network\nwithout regularizaton, and \"Reg, f=x\" corresponds to the regularized network, where the regualrizer\nR is trained only on x% of the data\n\n8\n\n(a) No reg(b) Reg, f=0.1(c) Reg, f=0.5(d) Reg, f=1\fof regularization layers increases, the generalization performance increases and saturates beyond a\npoint.\n\nTable 6: Effect of cross-domain generalization with varying number of layers regularized on PACS\ndataset using Alexnet model. Cartoon is used as the test domain\nf c7 + f c8\n\nf c6 + f c7 + f c8\n\nLayers regularized None\nAccuracy (in %)\n66.12\n\nf c8\n67.31\n\n70.10\n\n70.35\n\n5.4 Visualizing the weights\n\nWe plot the histogram of the weights learnt by the task network with and without the use of our\nregularizer in Fig. 2. The following observations can be made: (1) For the network with regularization,\nthere is a sharp peak at 0. This is because the weights \u03b8i for which \u03c6i are positive are decayed to 0.\n(2) The weights of the network with regularization has wider spread compared to the network without\nregularization. This is because the weights \u03b8i for which \u03c6i are negative are boosted, due to which\ncertain weights have high values.\n\n6 Conclusion and Future Work\n\nIn this work, we addressed the problem of domain generalization by using regularization. The task\nof \ufb01nding the desired regularizer that captures the notion of domain generalization is modeled as a\nmeta-learning problem. Experiments indicate that the learnt regularizers achieve good cross-domain\ngeneralization on the benchmark domain generalization datasets. Some avenues for future work\ninclude scalable meta-learning approaches for learning regularization functions over convolutional\nlayers while preserving the spatial dependency between the channels, and extending our approach to\ndeep reinforcement learning problems.\n\n7 Acknowledgement\n\nThis reseach was supported by MURI from the Army Research Of\ufb01ce under the Grant No. W911NF-\n17-1-0304. This is part of the collaboration between US DOD, UK MOD and UK Engineering and\nPhysical Research Council (EPSRC) under the Multidisciplinary University Research Initiative.\n\nReferences\n[1] Marcin Andrychowicz, Misha Denil, Sergio G\u00f3mez, Matthew W Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 29, pages 3981\u20133989. Curran Associates, Inc., 2016.\n\n[2] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural\ncorrespondence learning. In Proceedings of the 2006 Conference on Empirical Methods in\nNatural Language Processing, EMNLP \u201906, 2006.\n\n[3] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru\nErhan. Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 343\u2013351.\nCurran Associates, Inc., 2016.\n\n[4] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In\n\nAdvances in Neural Information Processing Systems 24. Curran Associates, Inc., 2011.\n\n[5] Minmin Chen, Zhixiang Eddie Xu, Kilian Q. Weinberger, and Fei Sha. Marginalized denoising\n\nautoencoders for domain adaptation. In ICML. icml.cc / Omnipress, 2012.\n\n[6] Hal Daume III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual\n\nMeeting of the Association of Computational Linguistics, June 2007.\n\n9\n\n\f[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1126\u20131135, 2017.\n\n[8] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual\n\nimitation learning via meta-learning. Proceedings of Machine Learning Research, 2017.\n\n[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois\nLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural\nnetworks. J. Mach. Learn. Res., 17(1):2096\u20132030, January 2016.\n\n[10] Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain\ngeneralization for object recognition with multi-task autoencoders. In 2015 IEEE International\nConference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.\n\n[11] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic \ufb02ow kernel for unsupervised\ndomain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition,\npages 2066\u20132073, 2012.\n\n[12] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recog-\nnition: An unsupervised approach. In Proceedings of the 2011 International Conference on\nComputer Vision, ICCV \u201911, 2011.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. arXiv preprint arXiv:1512.03385, 2015.\n\n[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proceedings of the 32Nd International Conference on\nInternational Conference on Machine Learning - Volume 37, ICML\u201915, pages 448\u2013456, 2015.\n\n[15] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A. Efros, and Antonio Torralba.\nUndoing the damage of dataset bias. In Proceedings of the 12th European Conference on\nComputer Vision - Volume Part I, ECCV\u201912, 2012.\n\n[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems 25,\npages 1097\u20131105. 2012.\n\n[17] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In\n\nAdvances in Neural Information Processing Systems, pages 950\u2013957, 1992.\n\n[18] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier\ndomain generalization. In IEEE International Conference on Computer Vision, ICCV 2017,\nVenice, Italy, October 22-29, 2017, pages 5543\u20135551, 2017.\n\n[19] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Learning to generalize:\n\nMeta-learning for domain generalization. CoRR, abs/1710.03463, 2017.\n\n[20] Ke Li and Jitendra Malik. Learning to optimize neural nets. CoRR, abs/1703.00441, 2017.\n\n[21] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable\nfeatures with deep adaptation networks. In Proceedings of the 32nd International Conference\non Machine Learning, pages 97\u2013105, 2015.\n\n[22] K. Muandet, D. Balduzzi, and B. Sch\u00f6lkopf. Domain generalization via invariant feature\nrepresentation. In Proceedings of the 30th International Conference on Machine Learning,\nW&CP 28(1), pages 10\u201318. JMLR, 2013. Volume 28, number 1.\n\n[23] Jie Ni, Qiang Qiu, and Rama Chellappa. Subspace interpolation via dictionary learning for\nunsupervised domain adaptation. In 2013 IEEE Conference on Computer Vision and Pattern\nRecognition, Portland, OR, USA, June 23-28, 2013, pages 692\u2013699, 2013.\n\n[24] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\nIn In\n\n10\n\n\f[25] Swami Sankaranarayanan, Yogesh Balaji, Carlos D. Castillo, and Rama Chellappa. Generate to\nadapt: Aligning domains using generative adversarial networks. In The IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), June 2018.\n\n[26] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa.\nLearning from synthetic data: Addressing domain shift for semantic segmentation. In The IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), June 2018.\n\n[27] J\u00fcrgen Schmidhuber. On learning how to learn learning strategies. Technical report, 1995.\n\n[28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. J. Mach. Learn. Res., 15(1),\nJanuary 2014.\n\n[29] Damien Teney and Anton van den Hengel. Visual question answering as a meta learning task.\n\nCoRR, abs/1711.08105, 2017.\n\n[30] Sebastian Thrun and Lorien Pratt, editors. Learning to Learn. Kluwer Academic Publishers,\n\nNorwell, MA, USA, 1998.\n\n[31] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural\nnetworks using dropconnect. In Proceedings of the 30th International Conference on Machine\nLearning, pages 1058\u20131066, 2013.\n\n[32] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. In ICLR, 2017.\n\n[33] Yang Zhang, Philip David, and Boqing Gong. Curriculum domain adaptation for semantic\nsegmentation of urban scenes. In The IEEE International Conference on Computer Vision\n(ICCV), volume 2, page 6, Oct 2017.\n\n[34] Fengwei Zhou, Bin Wu, and Zhenguo Li. Deep meta-learning: Learning to learn in the concept\n\nspace. CoRR, abs/1802.03596, 2018.\n\n11\n\n\f", "award": [], "sourceid": 545, "authors": [{"given_name": "Yogesh", "family_name": "Balaji", "institution": "University of Maryland"}, {"given_name": "Swami", "family_name": "Sankaranarayanan", "institution": "Butterfly Network inc."}, {"given_name": "Rama", "family_name": "Chellappa", "institution": "University of Maryland College Park"}]}