{"title": "Learning Robust Global Representations by Penalizing Local Predictive Power", "book": "Advances in Neural Information Processing Systems", "page_first": 10506, "page_last": 10518, "abstract": "Despite their renowned in-domain predictive power, convolutional neural networks are known to rely more on high-frequency patterns that humans deem superficial than on low-frequency patterns that agree better with intuitions about what constitutes category membership. This paper proposes a method for training robust convolutional networks by penalizing the predictive power of the local representations learned by earlier layers. Intuitively, our networks are forced to discard predictive signals such as color and texture that can be gleaned from local receptive fields and to rely instead on the global structures of the image. Across a battery of synthetic and benchmark domain adaptation tasks, our method confers improved generalization out of the domain. Additionally, to evaluate cross-domain transfer, we introduce ImageNet-Sketch, a new dataset consisting of sketch-like images that matches the ImageNet classification validation set in scale and dimension.", "full_text": "Learning Robust Global Representations\n\nby Penalizing Local Predictive Power\n\nHaohan Wang, Songwei Ge, Eric P. Xing, Zachary C. Lipton\n\n{haohanw,songweig,epxing,zlipton}@cs.cmu.edu\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nDespite their well-documented predictive power on i.i.d. data, convolutional\nneural networks have been demonstrated to rely more on high-frequency (textural)\npatterns that humans deem super\ufb01cial than on low-frequency patterns that agree\nbetter with intuitions about what constitutes category membership. This paper\nproposes a method for training robust convolutional networks by penalizing the\npredictive power of the local representations learned by earlier layers. Intuitively,\nour networks are forced to discard predictive signals such as color and texture that\ncan be gleaned from local receptive \ufb01elds and to rely instead on the global structure\nof the image. Across a battery of synthetic and benchmark domain adaptation tasks,\nour method confers improved generalization. To evaluate cross-domain transfer,\nwe introduce ImageNet-Sketch, a new dataset consisting of sketch-like images and\nmatching the ImageNet classi\ufb01cation validation set in categories and scale.\n\n1\n\nIntroduction\n\nConsider the task of determining whether a photograph depicts a tortoise or a sea turtle. A human\nmight check to see whether the shell is dome-shaped (indicating tortoise) or \ufb02at (indicating turtle).\nShe might also check to see whether the feet are short and bent (indicating tortoise) or \ufb01n-like and\nwebbed (indicating turtle). However, the pixels corresponding to the turtle (or tortoise) itself are not\nalone in offering predictive value. As easily con\ufb01rmed through a Google Image search, sea turtles\ntend to be photographed in the sea while tortoises tend to be photographed on land.\nAlthough an image\u2019s background may indeed be predictive of the category of the depicted object, it\nnevertheless seems unsettling that our classi\ufb01ers should depend so precariously on a signal that is in\nsome sense irrelevant. After all, a tortoise appearing in the sea is still a tortoise and a turtle on land is\nstill a turtle. One reason why we might seek to avoid such a reliance on correlated but semantically\nunrelated artifacts is that they might be liable to change out-of-sample. Even if all cats in a training\nset appear indoors, we might require a classi\ufb01er capable of recognizing an outdoors cat at test time.\nIndeed, recent papers have attested to the tendency of neural networks to rely on surface statistical\nregularities rather than learning global concepts (Jo and Bengio, 2017; Geirhos et al., 2019). A\nnumber of papers have demonstrated unsettling drops in performance when convolutional neural\nnetworks are applied to out-of-domain testing data, even in the absence of adversarial manipulation.\nThe problem of developing robust classi\ufb01ers capable of performing well on out-of-domain data is\nbroadly known as Domain Adaptation (DA). While the problem is known to be impossible absent\nany restrictions on the relationship between training and test distributions (Ben-David et al., 2010b),\nprogress is often possible under reasonable assumptions. Theoretically-principled algorithms have\nbeen proposed under a variety of assumptions, including covariate shift (Shimodaira, 2000; Gretton\net al., 2009) and label shift (Storkey, 2009; Sch\u00f6lkopf et al., 2012; Zhang et al., 2013; Lipton et al.,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fC\tchannels\t\nM\tx\tN\tpatches\t\n\nL\tlogits\t\n\nC\tinputs\tfor\t\t\nevery\tpatch\t\n\n(Reverse\tGradient)\t\n\nC\tx\tL\t\n\nM\tx\tN\tx\tL\tlogits\t\n\nFigure 1: In addition to the primary classi\ufb01er, our model consists of a number of side classi\ufb01ers,\napplied at each 1 \u21e5 1 location in a designated early layer. The side classi\ufb01ers result in one prediction\nper spatial location. The goal of patch-wise adversarial regularization is to fool all of them (via\nreverse gradient) while nevertheless outputting the correct class from the topmost layer.\n\n2018). Despite some known impossibility results for general DA problems (Ben-David et al., 2010b),\nin practice, humans exhibit remarkable robustness to a wide variety of distribution shifts, exploiting a\nvariety of invariances, and knowledge about what a label actually means.\nOur work is motivated by the intuition that for the classes typically of interest in many image\nclassi\ufb01cation tasks, the larger-scale structure of the image is what makes the class apply and while\nsmall local patches might be predictive of the label, Such local features, considered independently,\nshould not (vis-a-vis robustness desiderata) comprise the basis for outputting a given classi\ufb01cation.\nInstead, we posit that classi\ufb01ers that are required to (in some sense) discard this local signal (i.e.,\npatches of an images correlated to the label within a data collection), basing predictions instead on\nglobal concepts (i.e., concepts that can only be derived by combining information intelligently across\nregions), may better mimic the robustness that humans demonstrate in visual recognition.\nIn this paper, in order to coerce a convolutional neural network to focus on the global concept of\nan object, we introduce Patch-wise Adversarial Regularization (PAR), a learning scheme that\npenalizes the predictive power of local representations in earlier layers. The method consists of a\npatch-wise classi\ufb01er applied at each spatial location in low-level representation. Via the reverse\ngradient technique popularized by Ganin et al. (2016), our network is optimized to fool the side\nclassi\ufb01ers and simultaneously optimized to output correct predictions at the \ufb01nal layer. Design\nchoices of PAR include the layer on which the penalty is applied, the regularization strength, and the\nnumber of layers in the patch-wise network-in-network classi\ufb01er.\nIn extensive experiments across a wide spectrum of synthetic and real data sets, our method out-\nperforms the competing ones, especially when the domain information is not available. We also\ntake measures to evaluate our model\u2019s ability to learn concepts at real-world scale despite the small\nscale of popular domain adaptation benchmarks. Thus we introduce a new benchmark dataset that\nresembles ImageNet in the choice of categories and size, but consists only of images with the aesthetic\nof hand-drawn sketches. Performances on this new benchmark also endorse our regularization.\n\n2 Related Work\n\nA broad set of papers have addressed various formulations of DA (Bridle and Cox, 1991; Ben-David\net al., 2010a) dating in the ML and statistics literature to early works on covariate shift Shimodaira\n(2000) with antecedents classic econometrics work on sample selection bias (Heckman, 1977; Manski\nand Lerman, 1977). Several modern works address principled learning techniques under covariate\nshift (when p(y|x) does not change) (Gretton et al., 2009) and under label shift (when p(x|y) doesn\u2019t\nchange) (Storkey, 2009; Zhang et al., 2013; Lipton et al., 2018), and various other assumptions (e.g.\nbounded divergences between source and target distributions) (Mansour et al., 2009; Hu et al., 2016).\nWith the recent success of deep learning methods, a number of heuristic domain adaptation methods\nhave been proposed that despite lacking theoretical backing nevertheless confer improvements on\na number of benchmarks, even when traditional assumptions break down (e.g., no shared support).\n\n2\n\n\fAt a high level these methods comprise two subtypes: \ufb01ne-tuning over target domain (Long et al.,\n2016; Hoffman et al., 2017; Motiian et al., 2017a; Gebru et al., 2017; Volpi et al., 2018) and coercing\ndomain invariance via adversarial learning (or further extensions) (Ganin et al., 2016; Bousmalis\net al., 2017; Tzeng et al., 2017; Xie et al., 2018; Hoffman et al., 2018; Long et al., 2018; Zhao et al.,\n2018b; Kumar et al., 2018; Li et al., 2018b; Zhao et al., 2018a; Schoenauer-Sebag et al., 2019). While\nsome methods have justi\ufb01ed domain-adversarial learning by appealing to theoretical bounds due to\nBen-David et al. (2010a), the theory does not in fact guarantee generalization (recently shown by\nJohansson et al. (2019) and Wu et al. (2019)) and sometimes guarantees failure. For a general primer,\nwe refer to several literature reviews (Weiss et al., 2016; Csurka, 2017; Wang and Deng, 2018).\nIn contrast to the typical unsupervised DA setup, which requires access to both labeled source data\nand unlabeled target data, several recent papers propose deep learning methods that confer robustness\nto a variety of natural-seeming distribution shifts (in practice) without requiring any data (even\nunlabeled data) from the target distribution. In domain generalization (DG) methods (Muandet et al.,\n2013) (or sometimes known as \u201czero shot domain adaptation\u201d (Kumagai and Iwata, 2018; Niu et al.,\n2015; Erfani et al., 2016; Li et al., 2017c)) one possesses domain identi\ufb01ers for a number of known\nin-sample domains, and the goal is to generalize to a new domain. More recent DG approaches\nincorporate adversarial (or similar) techniques (Ghifary et al., 2015; Wang et al., 2016; Motiian et al.,\n2017b; Li et al., 2018a; Carlucci et al., 2018), or build ensembles of per-domain models that are then\nfused representations together (Bousmalis et al., 2016; Ding and Fu, 2018; Mancini et al., 2018).\nMeta-learning techniques have also been explored (Li et al., 2017b; Balaji et al., 2018).\nMore recently, Wang et al. (2019) demonstrated promising results on a number of benchmarks without\nusing domain identi\ufb01ers. Their method achieves addresses distribution shift by incorporating a new\ncomponent intended to be especially sensitive to domain-speci\ufb01c signals. Our paper extends the setup\nof (Wang et al., 2019) and empirically studies the problem of developing image classi\ufb01ers robust to a\nvariety of natural shifts without leveraging any domain information at training or deployment time.\n\n3 Method\n\nWe use hX, yi to denote the samples and f (g(\u00b7; ); \u2713) to denote a convolutional neural network,\nwhere g(\u00b7; ) denotes the output of the bottom convolutional layers (e.g., the \ufb01rst layer), and and \u2713\nare parameters to be learned. The traditional training process addresses the optimization problem\n\nmin\n,\u2713\n\nE(X,y)[l(f (g(X; ); \u2713), y)],\n\n(1)\n\nwhere l(\u00b7,\u00b7) denotes the loss function, commonly cross-entropy loss in classi\ufb01cation problems.\nFollowing the standard set-up of a convolutional layer, is a tensor of c \u21e5 m \u21e5 n parameters, where\nc denotes the number of convolutional channels, and m \u21e5 n is the size of the convolutional kernel.\nTherefore, for the ith sample, g(Xi; ) is a representation of Xi of the dimension c \u21e5 m0 \u21e5 n0, where\nm0 (or n0) is a function of the image dimension and m (or n). 1\n\n3.1 Patch-wise Adversarial Regularization\nWe \ufb01rst introduce a new classi\ufb01er, h(\u00b7; ) that takes the input of a c-length vector and predicts the\nlabel. Thus, h(\u00b7; ) can be applied onto the representation g(Xi; ) and yield m0 \u21e5 n0 predictions.\nTherefore, each of the m0 \u21e5 n0 predictions can be seen as a prediction made only by considering\na small image patch corresponding to each of the receptive \ufb01elds in g(Xi; ). If any of the image\npatches are predictive and g(\u00b7; ) summarizes the predictive representation well, h(\u00b7; ) can be trained\nto achieve a high prediction accuracy.\nOn the other hand, if g(\u00b7; ) summarizes the patch-wise predictive representation well, higher layers\n(f (\u00b7; \u2713)) can directly utilize these representation for prediction and thus may not be required to learn\na global concept. Our intuition is that by regularizing g(\u00b7; ) such that each \ufb01ber (i.e., representation\nat the same location from every channel) in the activation tensor should not be individually predictive\nof the label, we can prevent our model from relying on local patterns and instead force it to learn a\npattern that can only be revealed by aggregating information across multiple receptive \ufb01elds.\n\n1The exact function depends on padding size and stride size, and is irrelevant to the discussion of this paper.\n\n3\n\n\fAs a result, in addition to the standard optimization problem (Eq. 1), we also optimize the following\nterm:\n\nmin\n\n\n\nmax\n\n\n\nE(X,y)[\n\nm0,n0Xi,j\n\nl(h(g(X; )i,j; ), y)]\n\n(2)\n\nwhere the minimization consists of training h(\u00b7; ) to predict the label based on the local features (at\neach spatial location) while the maximization consists of training g(\u00b7; ) to shift focus away from\nlocal predictive representations.\nWe hypothesize that by jointly solving these two optimization problems (Eq. 1 and Eq. 2), we can\ntrain a model that can predict the label well without relying too strongly on local patterns. The\noptimization can be reformulated into the following two problems:\n\nmin\n,\u2713\n\nE(X,y)[l(f (g(X; ); \u2713), y) \n\n\n\nm0n0\n\nm0,n0Xi,j\n\nl(h(g(X; )i,j; ), y)]\n\nE(X,y)[\n\nmin\n\n\n\n\n\nm0n0\n\nm0,n0Xi,j\n\nl(h(g(X; )i,j; ), y)]\n\nwhere is a tuning hyperparameter. We divide the loss by m0n0 to keep the two terms at a same scale.\nOur method can be implemented ef\ufb01ciently as follows: In practice, we consider h(\u00b7; ) as a fully-\nconnected layer. consists of a c \u21e5 k weight matrix and a k-length bias vector, where k is the\nnumber of classes. The m0 \u21e5 n0 forward operations as fully-connected networks can be ef\ufb01ciently\nimplemented as a 1\u21e5 1 convolutional operation with c input channels and k output channels operating\non the m0 \u21e5 n0 representation.\nNote that although the input has m0 \u21e5 n0 vectors, h(\u00b7; ) only has one set of parameters that is used\nfor all these vectors, in contrast to building a set of parameter for every receptive \ufb01eld of the m0 \u21e5 n0\ndimension. Using only one set of parameters can not only help to reduce the computational load and\nparameter space, but also help to identify the predictive local patterns well because the predictive\nlocal pattern does not necessarily appear at the same position across the images. Our idea of our\nmethod is illustrated in Figure 1.\n\n3.2 Other Extensions and Training Heuristics\n\nThere can be many simple extensions to the basic PAR setting we discussed above. Here we introduce\nthree extensions that we will experiment with later in the experiment section.\nMore Powerful Pattern Classi\ufb01er: We explore the space of discriminator architectures, replacing\nthe single-layer network h(\u00b7; ) with a more powerful network architecture, e.g. a multilayer\nperceptron (MLP). In this paper, we consider three-layer MLPs with ReLU activation functions. We\nname this variant as PARM.\nBroader Local Pattern: We can also extend the 1 \u21e5 1 convolution operation to enlarge the concept\nof \u201clocal\u201d. In this paper, we experiment with a 3 \u21e5 3 convolution operation, thus the number of\nparameters in is increased. We refer to this variant as PARB.\nHigher Level of Local Concept: Further, we can also build the regularization upon higher convolu-\ntional layers. Building the regularization on higher layers is related to enlarging the patch of image,\nbut also considering higher level of abstractions. In this paper, we experiment the regularization on\nthe second layer. We refer this method as PARH.\nTraining Heuristics: Finally, we introduce the training heuristic that plays an important role in\nour regularization technique, especially in modern architectures such as AlexNet or ResNet. The\ntraining heuristic is simple: we \ufb01rst train the model conventionally until convergence (or after a\ncertain number of epochs), then train the model with our regularization. In other words, we can also\ndirectly work on pretrained models and continue to \ufb01ne-tune the parameters with our regularization.\n\n4\n\n\fFigure 2: Prediction accuracy with standard deviation for MNIST with patterns. Notations: V: vanilla\nbaseline, E: HEX, D: DANN, I: InfoDrop, P: PAR, B: PARB, M: PARM, H: PARH\n\n4 Experiments\n\nIn this section, we test PAR over a variety of settings, we \ufb01rst test with perturbed MNIST under the\ndomain generalization setting, and then test with perturbed CIFAR10 under domain adaptation setting.\nFurther, we test on more challenging data sets, with PACS data under domain generalization setting\nand our newly proposed ImageNet-Sketch data set. We compare with previous state-of-the-art when\navailable, or with the most popular benchmarks such as DANN (Ganin et al., 2016), InfoDrop (Achille\nand Soatto, 2018), and HEX (Wang et al., 2019) on synthetic experiments.2,3\n\n4.1 MNIST with Perturbation\n\nWe follow the set-up of Wang et al. (2019) in experimenting with MNIST data set with different\nsuper\ufb01cial patterns. There are three different super\ufb01cial patterns (radial kernel, random kernel, and\noriginal image). The training/validation samples are attached with two of these patterns, while the\ntesting samples are attached with the remaining one. As in Wang et al. (2019), training/validation\nsamples are attached with patterns following two strategies: 1) independently: the pattern is indepen-\ndent of the digit, and 2) dependently: images of digit 0-4 have one pattern while images of digit 5-9\nhave the other pattern.\nWe use the same model architecture and learning rate as in Wang et al. (2019). The extra hyperparam-\neter is set as 1 as the most straightforward choice. Methods in Wang et al. (2019) are trained for 100\nepochs, so we train the model for 50 epochs as pretraining and 50 epochs with our regularization. The\nresults are shown in Figure 2. In addition to the direct message that our proposed method outperforms\ncompeting ones in most cases, it is worth mentioning that the proposed methods behave differently\nin the \u201cdependent\u201d settings. For example, PARM performs the best in the \u201coriginal\u201d and \u201cradial\u201d\nsettings, but almost the worst among proposed methods in the \u201crandom\u201d setting, which may indicate\nthat the pattern attached by \u201crandom\u201d kernel can be more easily detected and removed by PARM\nduring training (Notice that the name of the setting (\u201coriginal\u201d, \u201cradial\u201d or \u201crandom\u201d) indicates the\npattern attached to testing images, and the training samples are attached with the other two patterns).\nMore information about hyperparameter choice is in Appendix A.\n\n4.2 CIFAR with Perturbation\n\nWe continue to experiment on CIFAR10 data set by modifying the color and texture of test dataset\nwith four different schemas: 1) greyscale; 2) negative color; 3) random kernel; 4) radial kernel. Some\nexamples of the perturbed data are shown in Appendix B. In this experiment, we use ResNet-32 as\nour base classi\ufb01er, which has a rough 92% prediction accuracy on original CIFAR10 test data set.\nAs for PAR, we \ufb01rst train the base classi\ufb01er for 250 epochs and then train with the adversarial loss\nfor another 150 epochs. As for the competing models, we also train for 400 epochs with carefully\nselected hyperparameters. The overall performances are shown in Table 1. In general, PAR and its\nvariants achieve the best performances on all four test data sets, even when DANN has an unfair\nadvantage over others by seeing unlabelled testing data during training. To be speci\ufb01c, PAR achieves\nthe best performances on the greyscale and radial kernel settings; PARM is the best on the negative\ncolor and random kernel settings. One may argue that the numeric improvements are not signi\ufb01cant\n\n2Clean demonstration of the implementation can be found at: https://github.com/HaohanWang/PAR\n3Source code for replication can be found at : https://github.com/HaohanWang/PAR_experiments\n\n5\n\n\fTable 1: Test accuracy of PAR and variants on Cifar10 datasets with perturbed color and texture.\n\nGreyscale\nNegColor\nRandKernel\nRadialKernel\n\nAverage\n\nResNet DANN InfoDrop HEX PAR PARB\n87.9\n87.7\n65.3\n62.8\n43.0\n40.5\n63.2\n62.4\n63.9\n64.2\n\n88.1\n66.2\n47.0\n63.8\n66.3\n\n87.3\n64.3\n33.4\n63.3\n62.0\n\n86.4\n57.6\n41.3\n60.3\n61.4\n\n87.6\n62.4\n42.5\n61.9\n63.6\n\nPARM\n87.8\n67.6\n47.5\n63.2\n66.5\n\nPARH\n86.9\n62.7\n40.8\n61.4\n62.9\n\nand PAR may only affect the model marginally, but a closer look at the training process of the\nmethods indicates that our regularization of local patterns bene\ufb01ts the robustness signi\ufb01cantly while\nminimally impacting the original performance. More detailed discussions are in Appendix B.\n\n4.3 PACS\n\nWe test on the PACS data set (Li et al., 2017a), which consists of collections of images over four\ndomains, including photo, art painting, cartoon, and sketch. Many recent methods have been tested on\nthis data set, which offers a convenient way for PAR to be compared with the previous state-of-the-art.\nFollowing Li et al. (2017a), we use AlexNet as baseline and build PAR upon it. We compare with\nrecently reported state-of-the-art on this data set, including DSN (Bousmalis et al., 2016), LCNN (Li\net al., 2017a), MLDG (Li et al., 2017b), Fusion (Mancini et al., 2018), MetaReg (Balaji et al., 2018),\nJigen (Carlucci et al., 2019), and HEX (Wang et al., 2019), in addition to the baseline reported in (Li\net al., 2017a). We are also aware that methods that explicitly use domain knowledge (e.g., Lee et al.,\n2018) may be helpful, but we do not directly compare with them numerically, as the methods deviate\nfrom the central theme of this paper.\n\nTable 2: Prediction accuracy of PAR and variants on PACS data set in comparison with the previously\nreported state-of-the-art results. Bold numbers indicate the best performance (three sets, one for each\nscenario). We use ? to denote the methods that use the training setting in (Carlucci et al., 2019) (e.g.,\nextra data augmentation, different train-test split, and different learning rate scheduling). Notably,\nPARH achieves the best performance in sketch testing case even in comparison to all other methods\nwithout data augmentation.\n\nCartoon\n\nForgoing Domaim ID Data Aug.\n\n!\n\n!\n!\n!\n!\n!\n!\n!\n!\n!\n!\n\n!\n!\n!\n!\n!\n\nAlexNet\n\nDSN\nL-CNN\nMLDG\nFusion\nMetaReg\n\nHEX\nPAR\nPARB\nPARM\nPARH\nJigen?\nPAR?\nPARB\n?\nPARM\nPARH\n?\n\n?\n\nArt\n63.3\n61.1\n62.8\n63.6\n64.1\n69.8\n66.8\n66.9\n66.3\n65.7\n66.3\n67.6\n68.0\n67.6\n68.7\n68.7\n\nPhoto\n87.7\n83.2\n89.5\n87.8\n90.2\n91.1\n87.9\n88.6\n87.2\n88.9\n89.6\n89.0\n90.8\n90.1\n90.5\n90.4\n\nSketch Average\n67.03\n67.33\n69.18\n67.43\n70.30\n72.63\n70.18\n71.30\n70.78\n71.10\n72.08\n73.38\n73.05\n72.59\n73.33\n73.54\n\n54\n58.5\n57.5\n54.9\n60.1\n59.2\n56.3\n62.6\n61.8\n61.7\n64.1\n65.1\n61.8\n62.0\n62.6\n64.6\n\n63.1\n66.5\n66.9\n63.4\n66.8\n70.4\n69.7\n67.1\n67.8\n68.1\n68.3\n71.7\n71.6\n70.7\n71.5\n70.5\n\nFollowing the training heuristics we introduced, we continue with trained AlexNet weights4 and\n\ufb01ne-tune on training domain data of PACS for 100 epochs. We notice that once our regularization\nis plugged in, we can outperform the baseline AlexNet with a 2% improvement. The results are\n\n4https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\n(i)\n\nFigure 3: Sample Images from ImageNet-Sketch. Corresponding classes: (a) magpie (b) box turtle\n(c) gold\ufb01sh (d) golden retriever (e) parachute (f) bookshop (g) acoustic guitar (h) racer (i) giant panda\n\nreported in Table 2, where we separate the results of techniques relying on domain identi\ufb01cations and\ntechniques free of domain identi\ufb01cations.\nWe also report the results based on the training schedule used by (Carlucci et al., 2019) as shown in\nthe bottom part of Table 2. Note that (Carlucci et al., 2019) used the random training-test split that\nare different from the of\ufb01cial split used by the other baselines. In addition, they used another data\naugmentation technique to convert image patch to grayscale which could bene\ufb01t the adaptation to\nSketch domain.\nWhile our methods are in general competitive, it is worth mentioning that our methods improve\nupon previous methods with a relatively large margin when Sketch is the testing domain. The\nimprovement on Sketch is notable because Sketch is the only colorless domain out of the four\ndomains in PACS. Therefore, when tested with the other three domains, a model may learn to exploit\nthe color information, which is usually local, to predict, but when tested with Sketch domain, the\nmodel has to learn colorless concepts to make good predictions.\n\n4.4\n\nImageNet-Sketch\n\n4.4.1 The ImageNet-Sketch Data\nInspired by the Sketch data of (Li et al., 2017a) with seven classes, and several other Sketch datasets,\nsuch as the Sketchy dataset (Sangkloy et al., 2016) with 125 classes and the Quick Draw! dataset\n(QuickDraw, 2018) with 345 classes, and motivated by absence of a large-scale sketch dataset \ufb01tting\nthe shape and size of popular image classi\ufb01cation benchmarks, we construct the ImageNet-Sketch\ndata set for evaluating the out-of-domain classi\ufb01cation performance of vision models trained on\nImageNet.\nCompatible with standard ImageNet validation data set for the classi\ufb01cation task (Deng et al., 2009),\nour ImageNet-Sketch data set consists of 50000 images, 50 images for each of the 1000 ImageNet\nclasses. We construct the data set with Google Image queries \u201csketch of\nis the\nstandard class name. We only search within the \u201cblack and white\u201d color scheme. We initially query\n100 images for every class, and then manually clean the pulled images by deleting the irrelevant\nimages and images that are for similar but different classes. For some classes, there are less than 50\nimages after manually cleaning, and then we augment the data set by \ufb02ipping and rotating the images.\nWe expect ImageNet-Sketch to serve as a unique ImageNet-scale out-of-domain evaluation dataset for\nimage classi\ufb01cation. Also, notably, different from perturbed ImageNet validation sets (Geirhos et al.,\n2019; Hendrycks and Dietterich, 2019), the images of ImageNet-Sketch are collected independently\nfrom the original validation images. The independent collection procedure is more similar to (Recht\net al., 2019), who collected a new set of standard colorful ImageNet validation images. However,\nwhile their goal was to assess over\ufb01tting to the benchmark validation sets, and thus they tried replicate\n\n\u201d, where\n\n7\n\n\fTable 3: Testing accuracy of competing methods on the ImageNet-Sketch data. The bottom\nhalf denotes the method that has extra advantages: \u2020 denotes the method that has access to\nunlabelled target domain data, and ? denotes the method that use extra data augmentation.\n\nTop 1\nTop 5\n\nTop 1\nTop 5\n\nAlexNet\n0.1204\n0.2480\n\nInfoDrop\n0.1224\n0.2560\nDANN\u2020\n0.1360\n0.2712\n\nHEX\n0.1292\n0.2654\nJigGen?\n0.1469\n0.2898\n\nPAR\n0.1306\n0.2627\nPAR?\n0.1494\n0.2949\n\nPARB\n0.1273\n0.2575\nPARB\n?\n0.1494\n0.2945\n\nPARM\n0.1287\n0.2603\nPARM\n?\n0.1501\n0.2957\n\nPARH\n0.1266\n0.2544\nPARH\n?\n0.1499\n0.2954\n\nTable 4: Some examples that are predicted correctly with our method but wrongly with\nthe original AlexNet because the original model seems to focus on the local patterns.\n\nAlexNet-PAR\n\nprediction\n\nAlexNet\n\ncon\ufb01dence\n\nprediction\n\ncon\ufb01dence\n\nstethoscope\n\n0.6608\n\nhook\n\n0.3903\n\ntricycle\n\n0.9260\n\nsafety pin\n\n0.5143\n\nAfghan hound\n\n0.8945\n\nswab (mop)\n\n0.7379\n\nred wine\n\n0.5999\n\ngoblet\n\n0.7427\n\nthe ImageNet collection procedure exactly, our goald is to collect out-of-domain black-and-white\nsketch images with the goal of testing a model\u2019s ability to extrapolate out of domain.5 Sample images\nare shown in Figure 3.\n\n4.4.2 Experiment Results\nWe use AlexNet as the baseline and test whether our method can help improve the out-of-domain\nprediction. We start with ImageNet pretrained AlexNet and continue to use PAR to tune AlexNet for\nanother \ufb01ve epochs on the original ImageNet training dataset. The results are reported in Table 3.\nWe are particularly interested in how PAR improves upon AlexNet, so we further investigate the top-1\nprediction results. Although the numeric results in Table 3 seem to show that PAR only improves\nthe upon AlexNet by predicting a few more examples correctly, we notice that these models share\n5025 correct predictions, while AlexNet predicts another 1098 images correctly and PAR predicts a\ndifferent set of 1617 images correctly.\nWe \ufb01rst investigate the examples that are correctly predicted by the original AlexNet, but wrongly\npredicted by PAR. We notice some examples that help verify the performance of PAR. For examples,\nPAR incorrectly predicts three instances of \u201ckeyboard\u201d as \u201ccrossword puzzle,\u201d while AlexNet predicts\nthese samples correctly. It is notable that two of these samples are \u201ckeyboards with missing keys\u201d\nand hence look similar to a \u201ccrossword puzzle.\u201d\nWe also investigate the examples that are correctly predicted by PAR, but wrongly predicted by the\noriginal AlexNet. Interestingly, we notice several samples that are wrongly predicted by AlexNet\nbecause the model may only focus on the local patterns. Some of the most interesting examples are\nreported in Table 4: The \ufb01rst example is a stethoscope, PAR predicts it correctly with 0.66 con\ufb01dence,\nwhile AlexNet predicts it to be a hook. We conjecture the reason to be that AlexNet tends to only\nfocus on the curvature which resembles a hook. The second example tells a similar story, PAR\n\n5The ImageNet-Sketch data can be found at: https://github.com/HaohanWang/ImageNet-Sketch\n\n8\n\n\fpredicts tricycle correctly with 0.92 con\ufb01dence, but AlexNet predicts it as a safety pin with 0.51\ncon\ufb01dence. We believe this is because part of the image (likely the seat-supporting frame) resembles\nthe structure of a safety pin. For the third example, PAR correctly predicts it to be an Afghan hound\nwith 0.89 con\ufb01dence, but AlexNet predicts it as a mop with 0.73 con\ufb01dence. This is likely because\nthe fur of the hound is similar to the head of a mop. For the last example, PAR correctly predicts\nthe object to be red wine with 0.59 con\ufb01dence, but AlexNet predicts it to be a goblet with 0.74\ncon\ufb01dence. This is likely because part of the image is indeed part of a goblet, but PAR may learn to\nmake predictions based on the global concept considering the bottle, the liquid, and part of the goblet\ntogether. Table 4 only highlights a few examples, and more examples are shown in Appendix C.\n\n5 Conclusion\n\nIn this paper, we introduced patch-wise adversarial regularization, a technique that regularizes\nmodels, encouraging them to learn global concepts for classifying objects by penalizing the model\u2019s\nability to make predictions based on representations of local patches. We extended our basic set-up\nwith several different variants and conducted extensive experiments, evaluating these methods with\nseveral datasets for domain adaptation and domain generalization tasks. The experimental results\nfavored our methods, especially when domain information is unknown to the methods. In addition to\nthe superior performances we achieved through these experiments, we expected to further challenge\nour method at real-world scale. Therefore, we also constructed a dataset that matches the ImageNet\nclassi\ufb01cation validation set in classes and scales but contains only sketch-alike images. Our new\nImageNet-Sketch data set can serve as new territory for evaluating models\u2019 ability to generalize to\nout-of-domain images at an unprecedented scale.\nWhile our method often confers bene\ufb01ts on out-of-domain data, we note that it may not help (or\ncan even hurt) in-domain accuracy when local patterns are truly predictive of the labels. However,\nwe argue that the local patterns, while predictive in-sample, may be less reliable out-of-domain as\ncompared to larger-scale patterns, which motivates this paper. For the three variations we introduced,\nour experiments indicate that different variants are applicable to different scenarios. We recommend\nthat users decide which variant to use given their understanding of the problem and hope in future\nwork, to develop clear principles for guiding these choices. While we did not give a clear choice of\nwhich PAR to use, we note that none of the variants of PAR outperform the vanilla PAR consistently.\nHowever, the vanilla PAR outperforms most comparable baselines in the vast majority of our\nexperiments.\n\nAcknowledgments\nHaohan Wang is supported by NIH R01GM114311, NIH P30DA035778, and NSF IIS1617583. Any\nopinions, \ufb01ndings and conclusions or recommendations expressed in this material are those of the\nauthor(s) and do not necessarily re\ufb02ect the views of the National Institutes of Health or the National\nScience Foundation. Zachary Lipton thanks the Center for Machine Learning and Health, a joint\nventure of Carnegie Mellon University, UPMC, and the University of Pittsburgh for supporting our\ncollaboration with Abridge AI to develop robust models for machine learning in healthcare. He is also\ngrateful to Salesforce Research, Facebook Research, and Amazon AI for faculty awards supporting\nhis lab\u2019s research on robust deep learning under distribution shift.\n\nReferences\nA. Achille and S. Soatto. Information dropout: Learning optimal representations through noisy\ncomputation. IEEE transactions on pattern analysis and machine intelligence, 40(12):2897\u20132905,\n2018.\n\nY. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using\nmeta-regularization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 998\u20131008.\nCurran Associates, Inc., 2018.\n\nS. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning\n\nfrom different domains. Machine learning, 79(1):151\u2013175, 2010a.\n\n9\n\n\fS. Ben-David, T. Lu, T. Luu, and D. P\u00e1l.\n\nImpossibility theorems for domain adaptation.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010b.\n\nIn\n\nK. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks.\n\nIn Advances in Neural Information Processing Systems, pages 343\u2013351, 2016.\n\nK. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain\nadaptation with generative adversarial networks. 2017 IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), Jul 2017. doi: 10.1109/cvpr.2017.18.\n\nJ. S. Bridle and S. J. Cox. Recnorm: Simultaneous normalisation and classi\ufb01cation applied to speech\n\nrecognition. In Advances in Neural Information Processing Systems, pages 234\u2013240, 1991.\n\nF. M. Carlucci, P. Russo, T. Tommasi, and B. Caputo. Agnostic domain generalization. arXiv preprint\n\narXiv:1808.01102, 2018.\n\nF. M. Carlucci, A. D\u2019Innocente, S. Bucci, B. Caputo, and T. Tommasi. Domain generalization by\nsolving jigsaw puzzles. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2019.\n\nG. Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint\n\narXiv:1702.05374, 2017.\n\nJ. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical\n\nimage database. In In CVPR, 2009.\n\nZ. Ding and Y. Fu. Deep domain generalization with structured low-rank constraint. IEEE Transac-\n\ntions on Image Processing, 27(1):304\u2013313, 2018.\n\nS. Erfani, M. Baktashmotlagh, M. Moshtaghi, V. Nguyen, C. Leckie, J. Bailey, and R. Kotagiri. Robust\ndomain generalisation by enforcing distribution invariance. In Proceedings of the Twenty-Fifth In-\nternational Joint Conference on Arti\ufb01cial Intelligence, pages 1455\u20131461. AAAI Press/International\nJoint Conferences on Arti\ufb01cial Intelligence, 2016.\n\nY. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and\nV. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning\nResearch, 17(1):2096\u20132030, 2016.\n\nT. Gebru, J. Hoffman, and L. Fei-Fei. Fine-grained recognition in the wild: A multi-task domain\nadaptation approach. 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017.\ndoi: 10.1109/iccv.2017.151.\n\nR. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained\nCNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In\nInternational Conference on Learning Representations, 2019.\n\nM. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi. Domain generalization for object\nrecognition with multi-task autoencoders. In Proceedings of the IEEE international conference on\ncomputer vision, pages 2551\u20132559, 2015.\n\nA. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Sch\u00f6lkopf. Covariate\n\nshift by kernel mean matching. Journal of Machine Learning Research, 2009.\n\nJ. J. Heckman. Sample selection bias as a speci\ufb01cation error (with an application to the estimation of\n\nlabor supply functions), 1977.\n\nD. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions\n\nand perturbations. In International Conference on Learning Representations, 2019.\n\nJ. Hoffman, E. Tzeng, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks.\nAdvances in Computer Vision and Pattern Recognition, page 173\u2013187, 2017. ISSN 2191-6594.\ndoi: 10.1007/978-3-319-58347-1_9.\n\n10\n\n\fJ. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA:\nCycle-consistent adversarial domain adaptation. In J. Dy and A. Krause, editors, Proceedings of\nthe 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 1989\u20131998, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR.\n\nW. Hu, G. Nio, I. Sato, and M. Sugiyama. Does distributionally robust supervised learning give\n\nrobust classi\ufb01ers? arXiv preprint arXiv:1611.02041, 2016.\n\nJ. Jo and Y. Bengio. Measuring the tendency of cnns to learn surface statistical regularities. arXiv\n\npreprint arXiv:1711.11561, 2017.\n\nF. D. Johansson, R. Ranganath, and D. Sontag. Support and invertibility in domain-invariant\n\nrepresentations. arXiv preprint arXiv:1903.03448, 2019.\n\nA. Kumagai and T. Iwata. Zero-shot domain adaptation without domain semantic descriptors. arXiv\n\npreprint arXiv:1807.02927, 2018.\n\nA. Kumar, P. Sattigeri, K. Wadhawan, L. Karlinsky, R. Feris, B. Freeman, and G. Wornell. Co-\nregularized alignment for unsupervised domain adaptation. In S. Bengio, H. Wallach, H. Larochelle,\nK. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 31, pages 9345\u20139356. Curran Associates, Inc., 2018.\n\nK. Lee, K. Lee, H. Lee, and J. Shin. A simple uni\ufb01ed framework for detecting out-of-distribution\nsamples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-\nBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages\n1019\u20131030. Curran Associates, Inc., 2018.\n\nD. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Deeper, broader and artier domain generalization.\nIn Computer Vision (ICCV), 2017 IEEE International Conference on, pages 5543\u20135551. IEEE,\n2017a.\n\nD. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Learning to generalize: Meta-learning for domain\n\ngeneralization. arXiv preprint arXiv:1710.03463, 2017b.\n\nH. Li, S. J. Pan, S. Wang, and A. C. Kot. Domain generalization with adversarial feature learning. In\n\nProc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), 2018a.\n\nW. Li, Z. Xu, D. Xu, D. Dai, and L. Van Gool. Domain generalization and adaptation using low rank\n\nexemplar svms. IEEE transactions on pattern analysis and machine intelligence, 2017c.\n\nY. Li, m. Murias, g. Dawson, and D. E. Carlson. Extracting relationships by multi-domain matching.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 6798\u20136809. Curran Associates, Inc.,\n2018b.\n\nZ. C. Lipton, Y.-X. Wang, and A. Smola. Detecting and correcting for label shift with black box\n\npredictors. In International Conference on Machine Learning (ICML), 2018.\n\nM. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer\n\nnetworks. In Advances in Neural Information Processing Systems, pages 136\u2013144, 2016.\n\nM. Long, Z. CAO, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation.\n\nIn\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 1640\u20131650. Curran Associates, Inc.,\n2018.\n\nM. Mancini, S. R. Bul\u00f2, B. Caputo, and E. Ricci. Best sources forward: domain generalization\n\nthrough source-speci\ufb01c nets. arXiv preprint arXiv:1806.05810, 2018.\n\nC. F. Manski and S. R. Lerman. The estimation of choice probabilities from choice based samples.\n\nEconometrica: Journal of the Econometric Society, 1977.\n\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms.\n\narXiv preprint arXiv:0902.3430, 2009.\n\n11\n\n\fS. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Uni\ufb01ed deep supervised domain adaptation\nand generalization. 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017a.\ndoi: 10.1109/iccv.2017.609.\n\nS. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Uni\ufb01ed deep supervised domain adaptation\nand generalization. In The IEEE International Conference on Computer Vision (ICCV), volume 2,\npage 3, 2017b.\n\nK. Muandet, D. Balduzzi, and B. Sch\u00f6lkopf. Domain generalization via invariant feature representa-\n\ntion. In International Conference on Machine Learning, pages 10\u201318, 2013.\n\nL. Niu, W. Li, and D. Xu. Multi-view domain generalization for visual recognition. In Proceedings\n\nof the IEEE International Conference on Computer Vision, pages 4193\u20134201, 2015.\n\nQuickDraw. Quick draw! the data, 2018. URL https://quickdraw.withgoogle.com/data.\nB. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classi\ufb01ers generalize to imagenet?,\n\n2019.\n\nP. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: Learning to retrieve badly\n\ndrawn bunnies. ACM Transactions on Graphics (proceedings of SIGGRAPH), 2016.\n\nA. Schoenauer-Sebag, L. Heinrich, M. Schoenauer, M. Sebag, L. Wu, and S. Altschuler. Multi-domain\n\nadversarial learning. In International Conference on Learning Representations, 2019.\n\nB. Sch\u00f6lkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal\nlearning. In International Coference on International Conference on Machine Learning (ICML-12),\npages 459\u2013466. Omnipress, 2012.\n\nH. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of statistical planning and inference, 2000.\n\nA. Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in\n\nmachine learning, 2009.\n\nE. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation.\n2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017. doi:\n10.1109/cvpr.2017.316.\n\nR. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese. Generalizing to unseen\ndomains via adversarial data augmentation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems,\npages 5334\u20135344. Curran Associates, Inc., 2018.\n\nH. Wang, A. Meghawat, L.-P. Morency, and E. P. Xing. Select-additive learning: Improving\n\ngeneralization in multimodal sentiment analysis. arXiv preprint arXiv:1609.05244, 2016.\n\nH. Wang, Z. He, Z. C. Lipton, and E. P. Xing. Learning robust representations by projecting super\ufb01cial\n\nstatistics out. In International Conference on Learning Representations, 2019.\n\nM. Wang and W. Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135\u2013153,\n\nOct 2018. ISSN 0925-2312. doi: 10.1016/j.neucom.2018.05.083.\n\nK. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey of transfer learning. Journal of Big Data, 3\n\n(1):9, 2016.\n\nY. Wu, E. Winston, D. Kaushik, and Z. Lipton. Domain adaptation with asymmetrically-relaxed\n\ndistribution alignment. International Conference on Machine Learning, 2019.\n\nS. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semantic representations for unsupervised domain\nadaptation. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5423\u20135432,\nStockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n12\n\n\fK. Zhang, B. Sch\u00f6lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional\n\nshift. In International Conference on Machine Learning, 2013.\n\nA. Zhao, M. Ding, J. Guan, Z. Lu, T. Xiang, and J.-R. Wen. Domain-invariant projection learning for\nzero-shot recognition. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1019\u20131030.\nCurran Associates, Inc., 2018a.\n\nH. Zhao, S. Zhang, G. Wu, J. M. F. Moura, J. P. Costeira, and G. J. Gordon. Adversarial multiple\nsource domain adaptation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8559\u20138570.\nCurran Associates, Inc., 2018b.\n\n13\n\n\f", "award": [], "sourceid": 5526, "authors": [{"given_name": "Haohan", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Songwei", "family_name": "Ge", "institution": "Carnegie Mellon University"}, {"given_name": "Zachary", "family_name": "Lipton", "institution": "Carnegie Mellon University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. / Carnegie Mellon University"}]}