{"title": "Co-regularized Alignment for Unsupervised Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 9345, "page_last": 9356, "abstract": "Deep neural networks, trained with large amount of labeled data, can fail to\ngeneralize well when tested with examples from a target domain whose distribution differs from the training data distribution, referred as the source domain. It can be expensive or even infeasible to obtain required amount of labeled data in all possible domains. Unsupervised domain adaptation sets out to address this problem, aiming to learn a good predictive model for the target domain using labeled examples from the source domain but only unlabeled examples from the target domain. \nDomain alignment approaches this problem by matching the source and target feature distributions, and has been used as a key component in many state-of-the-art domain adaptation methods. However, matching the marginal feature distributions does not guarantee that the corresponding class conditional distributions will be aligned across the two domains. We propose co-regularized domain alignment for unsupervised domain adaptation, which constructs multiple diverse feature spaces and aligns source and target distributions in each of them individually, while encouraging that alignments agree with each other with regard to the class predictions on the unlabeled target examples.\nThe proposed method is generic and can be used to improve any domain adaptation method which uses domain alignment. We instantiate it in the context of a recent state-of-the-art method and \nobserve that it provides significant performance improvements on several domain adaptation benchmarks.", "full_text": "Co-regularized Alignment for Unsupervised Domain\n\nAdaptation\n\nMIT-IBM Watson AI Lab, IBM Research\n\nMIT-IBM Watson AI Lab, IBM Research\n\nAbhishek Kumar\n\nabhishk@us.ibm.com\n\nPrasanna Sattigeri\n\npsattig@us.ibm.com\n\nKahini Wadhawan\n\nLeonid Karlinsky\n\nMIT-IBM Watson AI Lab, IBM Research\n\nMIT-IBM Watson AI Lab, IBM Research\n\nkahini.wadhawan@ibm.com\n\nleonidka@il.ibm.com\n\nRogerio Feris\n\nMIT-IBM Watson AI Lab, IBM Research\n\nrsferis@us.ibm.com\n\nWilliam T. Freeman\n\nGregory Wornell\n\nMIT\n\nbillf@mit.edu\n\nMIT\n\ngww@mit.edu\n\nAbstract\n\nDeep neural networks, trained with large amount of labeled data, can fail to\ngeneralize well when tested with examples from a target domain whose distribution\ndiffers from the training data distribution, referred as the source domain. It can be\nexpensive or even infeasible to obtain required amount of labeled data in all possible\ndomains. Unsupervised domain adaptation sets out to address this problem, aiming\nto learn a good predictive model for the target domain using labeled examples\nfrom the source domain but only unlabeled examples from the target domain.\nDomain alignment approaches this problem by matching the source and target\nfeature distributions, and has been used as a key component in many state-of-the-art\ndomain adaptation methods. However, matching the marginal feature distributions\ndoes not guarantee that the corresponding class conditional distributions will be\naligned across the two domains. We propose co-regularized domain alignment\nfor unsupervised domain adaptation, which constructs multiple diverse feature\nspaces and aligns source and target distributions in each of them individually,\nwhile encouraging that alignments agree with each other with regard to the class\npredictions on the unlabeled target examples. The proposed method is generic\nand can be used to improve any domain adaptation method which uses domain\nalignment. We instantiate it in the context of a recent state-of-the-art method and\nobserve that it provides signi\ufb01cant performance improvements on several domain\nadaptation benchmarks.\n\n1\n\nIntroduction\n\nDeep learning has shown impressive performance improvements on a wide variety of tasks. These\nremarkable gains often rely on the access to large amount of labeled examples (x, y) for the concepts\nof interest (y \u2208 Y ). However, a predictive model trained on certain distribution of data ({(x, y) :\nx \u223c Ps(x)}, referred as the source domain) can fail to generalize when faced with observations\npertaining to same concepts but from a different distribution (x \u223c Pt(x), referred as the target\ndomain). This problem of mismatch in training and test data distributions is commonly referred\nas domain or covariate shift [34]. The goal in domain adaptation is to address this mismatch and\nobtain a model that generalizes well on the target domain with limited or no labeled examples from\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Example scenarios for domain alignment between source S (green) and target T (blue).\nContinuous boundary denotes the + class and the dashed boundary denotes the \u2212 class. (a) Ps and Pt\nare not aligned but dH\u2206H(Ps, Pt) is zero for H (a hypothesis class of linear separators) given by the\nshaded orange region, (b) Marginal distributions Ps and Pt are aligned reasonably well but expected\nerror \u03bb is high, (c) Marginal distributions Ps and Pt are aligned reasonably well and expected error \u03bb\nis small.\n\nthe target domain. Domain adaptation \ufb01nds applications in many practical scenarios, including the\nspecial case when source domain consists of simulated or synthetic data (for which labels are readily\navailable from the simulator) and target domain consists of real world observations [43, 39, 5].\nWe consider the problem of unsupervised domain adaptation where the learner has access to only\nunlabeled examples from the target domain. The goal is to learn a good predictive model for the\ntarget domain using labeled source examples and unlabeled target examples. Domain alignment\n[13, 15] approaches this problem by extracting features that are invariant to the domain but preserve\nthe discriminative information required for prediction. Domain alignment has been used as a crucial\ningredient in numerous existing domain adaptation methods [17, 26, 41, 40, 4, 16, 44, 42, 35]. The\ncore idea is to align distributions of points (in the feature space) belonging to same concept class\nacross the two domains (i.e., aligning g#Ps(\u00b7|y) and g#Pt(\u00b7|y) where g is a measurable feature\ngenerator mapping and g#P denotes the push-forward of a distribution P ), and the prediction\nperformance in target domain directly depends on the correctness of this alignment. However, the\nright alignment of class conditional distributions can be challenging to achieve without access to\nany labels in the target domain. Indeed, there is still signi\ufb01cant gap between the performance of\nunsupervised domain adapted classi\ufb01ers with existing methods and fully-supervised target classi\ufb01er,\nespecially when the discrepancy between the source and target domains is high1.\nIn this work, we propose an approach to improve the alignment of class conditional feature distri-\nbutions of source and target domains for unsupervised domain adaptation. Our approach works by\nconstructing two (or possibly more) diverse feature embeddings for the source domain examples\nand aligning the target domain feature distribution to each of them individually. We co-regularize\nthe multiple alignments by making them agree with each other with regard to the class prediction,\nwhich helps in reducing the search space of possible alignments while still keeping the correct set of\nalignments under consideration. The proposed method is generic and can be used to improve any\ndomain adaptation method that uses domain alignment as an ingredient. We evaluate our approach\non commonly used benchmark domain adaptation tasks such as digit recognition (MNIST, MNIST-\nM, SVHN, Synthetic Digits) and object recognition (CIFAR-10, STL), and observe signi\ufb01cant\nimprovement over state-of-the-art performance on these.\n\n2 Formulation\n\nWe \ufb01rst provide a brief background on domain alignment while highlighting the challenges involved\nwhile using it for unsupervised domain adaptation.\n\n2\n\n\f2.1 Domain Alignment\n\nThe idea of aligning source and target distributions for domain adaptation can be motivated from the\nfollowing result by Ben-David et al. [2]:\nTheorem 1 ([2]) Let H be the common hypothesis class for source and target. The expected error\nfor the target domain is upper bounded as\n\u0001t(h) \u2264 \u0001s(h) +\n\ndH\u2206H(Ps, Pt) + \u03bb,\u2200h \u2208 H,\n\n(1)\n\n1\n2\n\nwhere dH\u2206H(Ps, Pt) = 2 suph,h(cid:48)\u2208H |Prx\u223cPs [h(x) (cid:54)= h(cid:48)(x)] \u2212 Prx\u223cPt [h(x) (cid:54)= h(cid:48)(x)]|,\n\u03bb = minh[\u0001s(h) + \u0001t(h)], and \u0001s(h) is the expected error of h on the source domain.\nLet gs : X \u2192 Rm and gt : X \u2192 Rm be the feature generators for source and target examples,\nrespectively. We assume gs = gt = g for simplicity but the following discussion also holds for\ndifferent gs and gt. Let g#Ps be the push-forward distribution of source distribution Ps induced by\ng (similarly for g#Pt). Let H be a class of hypotheses de\ufb01ned over the feature space {g(x) : x \u223c\nPs} \u222a {g(x) : x \u223c Pt} It should be noted that alignment of distributions g#Ps and g#Pt is not a\nnecessary condition for dH\u2206H to vanish and there may exist sets of Ps, Pt, and H for which dH\u2206H is\nzero without g#Ps and g#Pt being well aligned (Fig. 1a). However, for unaligned g#Ps and g#Pt,\nit is dif\ufb01cult to choose the appropriate hypothesis class H with small dH\u2206H and small \u03bb without\naccess to labeled target data.\nOn the other hand, if source feature distribution g#Ps and target feature distribution g#Pt are\naligned well, it is easy to see that the H\u2206H-distance will vanish for any space H of suf\ufb01ciently\nsmooth hypotheses. A small H\u2206H-distance alone does not guarantee small expected error on the\ntarget domain (Fig. 1b): it is also required to have source and target feature distributions such that\nthere exists a hypothesis h\u2217 \u2208 H with low expected error \u03bb on both source and target domains.\nFor well aligned marginal feature distributions, having a low \u03bb requires that the corresponding\nclass conditional distributions g#Ps(\u00b7|y) and g#Pt(\u00b7|y) should be aligned for all y \u2208 Y (Fig. 1c).\nHowever, directly pursuing the alignment of the class conditional distributions is not possible as we\ndo not have access to target labels in unsupervised domain adaptation. Hence most unsupervised\ndomain adaptation methods optimize for alignment of marginal distributions g#Ps and g#Pt, hoping\nthat the corresponding class conditional distributions will get aligned as a result.\nThere is a large body of work on distribution alignment which becomes readily applicable here. The\ngoal is to \ufb01nd a feature generator g (or a pair of feature generators gs and gt) such that g#Ps and\ng#Pt are close. Methods based on minimizing various distances between the two distributions (e.g.,\nmaximum mean discrepancy [17, 44], suitable divergences and their approximations [15, 4, 35]) or\nmatching the moments of the two distributions [41, 40] have been proposed for unsupervised domain\nadaptation.\n\n2.2 Co-regularized Domain Alignment\n\nThe idea of co-regularization has been successfully used in semi-supervised learning [37, 38, 31, 36]\nfor reducing the size of the hypothesis class. It works by learning two predictors in two hypothesis\nclasses H1 and H2 respectively, while penalizing the disagreement between their predictions on the\nunlabeled examples. This intuitively results in shrinking the search space by ruling out predictors\nfrom H1 that don\u2019t have an agreeing predictor in H2 (and vice versa) [36]. When H1 and H2 are\nreproducing kernel Hilbert spaces, the co-regularized hypothesis class has been formally shown to\nhave a reduced Rademacher complexity, by an amount that depends on a certain data dependent\ndistance between the two views [31]. This results in improved generalization bounds comparing with\nthe best predictor in the co-regularized class (reduces the variance) 2.\nSuppose the true labeling functions for source and target domains are given by fs : X \u2192 Y and\nft : X \u2192 Y , respectively. Let X y\nt = {x : ft(x) = y, x \u223c Pt}\n1Heavily-tuned manual data augmentation can be used to bring the two domains closer in the observed space\n\ns = {x : fs(x) = y, x \u223c Ps} and X y\n\nX [14] but it requires the augmentation to be tuned individually for every domain pair to be successful.\n2Sridharan and Kakade [38] show that the bias introduced by co-regularization is small when each view\ncarries suf\ufb01cient information about Y on its own (i.e., mutual information I(Y ; X1|X2) and I(Y ; X2|X1) are\nsmall), and the generalization bounds comparing with the Bayes optimal predictor are also tight.\n\n3\n\n\ft\n\ns \u2282 X y1\n\ns and Ay2\ns } and g(Ay2\n\ns ) := {g(x) : x \u2208 Ay1\n\nt \u2282 X y2\nt ) := {g(x) : x \u2208 Ay2\n\nbe the sets which are assigned label y in source and target domains, respectively. As discussed in\nthe earlier section, the hope is that alignment of marginal distributions g#Ps and g#Pt will result\nin aligning the corresponding class conditionals g#Ps(\u00b7|y) and g#Pt(\u00b7|y) but it is not guaranteed.\n, for y1 (cid:54)= y2, such that their images under g (i.e.,\nThere might be sets Ay1\nt }) get aligned in the feature space,\ng(Ay1\nwhich is dif\ufb01cult to detect or correct in the absence of target labels.\nWe propose to use the idea of co-regularization to trim the space of possible alignments without\nruling out the desirable alignments of class conditional distributions from the space. Let G1, G2\nbe the two hypothesis spaces for the feature generators, and H1, H2 be the hypothesis classes of\npredictors de\ufb01ned on the output of the feature generators from G1 and G2, respectively. We want\nto learn a gi \u2208 Gi and a hi \u2208 Hi such that hi \u25e6 gi minimizes the prediction error on the source\ndomain, while aligning the source and target feature distributions by minimizing a suitable distance\nD(gi#Ps, gi#Pt) (for i = 1, 2). To measure the disagreement between the alignments of feature\ndistributions in the two feature spaces (gi#Ps and gi#Pt, for i = 1, 2), we look at the distance\nbetween the predictions (h1 \u25e6 g1)(x) and (h2 \u25e6 g2)(x) on unlabeled target examples x \u223c Pt. If\nthe predictions agree, it can be seen as an indicator that the alignment of source and target feature\ndistributions is similar across the two feature spaces induced by g1 and g2 (with respect to the\nclassi\ufb01er boundaries). Coming back to the example of erroneous alignment given in the previous\nparagraph, if there is a g1 \u2208 G1 which aligns g1(Ay2\ns ) but does not have any agreeing\ng2 \u2208 G2 with respect to the classi\ufb01er predictions, it will be ruled out of consideration. Hence, ideally\nwe would like to construct G1 and G2 such that they induce complementary erroneous alignments of\nsource and target distributions while each of them still contains the set of desirable feature generators\nthat produce the right alignments.\nThe proposed co-regularized domain alignment (referred as Co-DA) can be summarized by the\nfollowing objective function (denoting fi = hi \u25e6 gi for i = 1, 2):\n\nt ) and g1(Ay1\n\nLy(f1; Ps) + \u03bbdLd(g1#Ps, g1#Pt) + Ly(f2; Ps) + \u03bbdLd(g2#Ps, g2#Pt)\n\n+ \u03bbpLp(f1, f2; Pt) \u2212 \u03bbdivDg(g1, g2),\n\n(2)\nwhere, Ly(fi; Ps) := Ex,y\u223cPs[y(cid:62) ln fi(x)] is the usual cross-entropy loss for the source examples\n(assuming fi outputs the probabilities of classes and y is the label vector), Ld(\u00b7,\u00b7) is the loss term mea-\nsuring the distance between the two distributions, Lp(f1, f2; Pt) := Ex\u223cPtlp(f1(x), f2(x)) where\nlp(\u00b7,\u00b7) measures the disagreement between the two predictions for a target sample, and Dg(g1, g2)\nquanti\ufb01es the diversity of g1 and g2. In the following, we instantiate Co-DA algorithmically, getting\nto a concrete objective that can be optimized.\n\nmin\n\ngi\u2208Gi,hi\u2208Hi\n\nfi=hi\u25e6gi\n\n2.2.1 Algorithmic Instantiation\n\nWe make our approach of co-regularized domain alignment more concrete by making the following\nalgorithmic choices:\nDomain alignment. Following much of the earlier work, we minimize the variational form of the\nJensen-Shannon (JS) divergence [29, 18] between source and target feature distributions [15, 4, 35]:\n\nLd(gi#Ps, gi#Pt) := sup\ndi\n\n(cid:125)\n(cid:124)\nEx\u223cPs ln di(gi(x)) + Ex\u223cPt ln(1 \u2212 di(gi(x)))\n,\n\n(cid:123)(cid:122)\n\nLdisc(gi,di;Ps,Pt)\n\n(3)\n\nwhere di is the domain discriminator, taken to be a two layer neural network that outputs the\nprobability of the input sample belonging to the source domain.\nTarget prediction agreement. We use (cid:96)1 distance between the predicted class probabilities (twice\nthe total variation distance) as the measure of disagreement (although other measures such as JS-\ndivergence are also possible):\n\nLp(f1, f2; Pt) := Ex\u223cPt (cid:107)f1(x) \u2212 f2(x)(cid:107)1\n\n(4)\n\nDiverse g1 and g2. It is desirable to have g1 and g2 such that errors in the distribution alignments are\ndifferent from each other and target prediction agreement can play its role. To this end, we encourage\nsource feature distributions induced by g1 and g2 to be different from each other. There can be\n\n4\n\n\fmultiple ways to approach this; here we adopt a simpler option of pushing the minibatch means (with\nbatch size b) far apart:\n\nDg(g1, g2) := min\n\n\u03bd,\n\n(g1(xj) \u2212 g2(xj))\n\n(5)\n\n(cid:18)\n\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nb\n\nb(cid:88)\n\nj=1, xj\u223cPs\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nThe hyperparameter \u03bd is a positive real controlling the maximum disparity between g1 and g2. This is\nneeded for stability of feature maps g1 and g2 during training: we empirically observed that having \u03bd\nas in\ufb01nity results in their continued divergence from each other, harming the alignment of source and\ntarget distributions in both G1 and G2. Note that we only encourage the source feature distributions\ng1#Ps and g2#Ps to be different, hoping that aligning the corresponding target distributions g1#Pt\nand g2#Pt to them will produce different alignments.\nCluster assumption. The large amount of target unlabeled data can be used to bias the classi\ufb01er\nboundaries to pass through the regions containing low density of data points. This is referred as the\ncluster assumption [7] which has been used for semi-supervised learning [19, 27] and was also recently\nused for unsupervised domain adaptation [35]. Minimization of the conditional entropy of fi(x) can\nbe used to push the predictor boundaries away from the high density regions [19, 27, 35]. However,\nthis alone may result in over\ufb01tting to the unlabeled examples if the classi\ufb01er has high capacity. To\navoid this, virtual adversarial training (VAT) [27] has been successfully used in conjunction with\nconditional entropy minimization to smooth the classi\ufb01er surface around the unlabeled points [27, 35].\nWe follow this line of work and add the following additional loss terms for conditional entropy\nminimization and VAT to the objective in (2):\nLce(fi; Pt) := \u2212Ex\u223cPt[fi(x)(cid:62) ln fi(x)], Lvt(fi; Pt) := Ex\u223cPt\n\nDkl(fi(x)(cid:107)fi(x + r))\n\n(cid:20)\n\n(cid:21)\n\n(6)\n\nmax\n(cid:107)r(cid:107)\u2264\u0001\n\nWe also use VAT loss Lvt(fi; Ps) on the source domain examples following Shu et al. [35]. Our \ufb01nal\nobjective is given as:\n\nmin\n\ngi,hi,fi=hi\u25e6gi\n\nL(f1) + L(f2) + \u03bbpLp(f1, f2; Pt) \u2212 \u03bbdivDg(g1, g2), where\n\n(7)\nL(fi) :=Ly(fi; Ps) + \u03bbdLd(gi#Ps, gi#Pt) + \u03bbsvLvt(fi; Ps) + \u03bbce(Lce(fi; Pt) + Lvt(fi; Pt))\nRemarks.\n(1) The proposed co-regularized domain alignment (Co-DA) can be used to improve any domain\nadaptation method that has a domain alignment component in it. We instantiate it in the context of\na recently proposed method VADA [35], which has the same objective as L(fi) in Eq. (7) and has\nshown state-of-the-art results on several datasets. Indeed, we observe that co-regularized domain\nalignment signi\ufb01cantly improves upon these results.\n(2) The proposed method can be naturally extended to more than two hypotheses, however we limit\nourselves to two hypothesis classes in the empirical evaluations.\n\n3 Related Work\n\nDomain Adaptation. Due to the signi\ufb01cance of domain adaptation in reducing the need for labeled\ndata, there has been extensive activity on it during past several years. Domain alignment has almost\nbecome a representative approach for domain adaptation, acting as a crucial component in many\nrecently proposed methods [17, 26, 41, 40, 4, 16, 44, 42, 35]. The proposed co-regularized domain\nalignment framework is applicable in all such methods that utilize domain alignment as an ingredient.\nPerhaps most related to our proposed method is a recent work by Saito et al. [33], who proposed\ndirectly optimizing a proxy for H\u2206H-distance [2] in the context of deep neural networks. Their\nmodel consists of a single feature generator g that feeds to two different multi-layer NN classi\ufb01ers h1\nand h2. Their approach alternates between two steps: (i) For a \ufb01xed g, \ufb01nding h1 and h2 such that\nthe discrepancy or disagreement between the predictions (h1 \u25e6 g)(x) and (h2 \u25e6 g)(x) is maximized\nfor x \u223c Pt, (ii) For \ufb01xed h1 and h2, \ufb01nd g which minimizes the discrepancy between the predictions\n(h1 \u25e6 g)(x) and (h2 \u25e6 g)(x) for x \u223c Pt. Our approach also has a discrepancy minimization term over\nthe predictions for target samples but the core idea in our approach is fundamentally different where\nwe want to have diverse feature generators g1 and g2 that induce different alignments for source and\ntarget populations, and which can correct each other\u2019s errors by minimizing disagreement between\nthem as measured by target predictions. Further, unlike [33] where the discrepancy is maximized at\n\n5\n\n\fthe \ufb01nal predictions (h1 \u25e6 g)(x) and (h2 \u25e6 g)(x) (Step (i)), we maximize diversity at the output of\nfeature generators g1 and g2. Apart from the aforementioned approaches, methods based on image\ntranslations across domains have also been proposed for unsupervised domain adaptation [24, 28, 6].\nCo-regularization and Co-training. The related ideas of co-training [3] and co-regularization\n[37, 36] have been successfully used for semi-supervised learning as well as unsupervised learning\n[21, 20]. Chen et al. [8] used the idea of co-training for semi-supervised domain adaptation (assuming\na few target labeled examples are available) by \ufb01nding a suitable split of the features into two sets\nbased on the notion of \u0001-expandibility [1]. A related work [9] used the idea of co-regularization for\nsemi-supervised domain adaptation but their approach is quite different from our method where they\nlearn different classi\ufb01ers for source and target, making their predictions agree on the unlabeled target\nsamples. Tri-training [45] can be regarded as an extension of co-training [3] and uses the output of\nthree different classi\ufb01ers to assign pseudo-labels to unlabeled examples. Saito et al. [32] proposed\nAsymmetric tri-training for unsupervised domain adaptation where one of the three models is learned\nonly on pseudo-labeled target examples. Asymmetric tri-training, similar to [33], works with a single\nfeature generator g which feeds to three different classi\ufb01ers h1, h2 and h3.\nEnsemble learning. There is an extensive line of work on ensemble methods for neural nets which\ncombine predictions from multiple models [11, 10, 30, 25, 23]. Several ensemble methods also\nencourage diversity among the classi\ufb01ers in the ensemble [25, 23]. However, ensemble methods\nhave a different motivation from co-regularization/co-training: in the latter, diversity and agreement\ngo hand in hand, working together towards reducing the size of the hypothesis space and the two\nclassi\ufb01ers converge to a similar performance after the completion of training due to the agreement\nobjective. Indeed, we observe this in our experiments as well and either of the two classi\ufb01ers can be\nused for test time predictions. On the other hand, ensemble methods need to combine predictions from\nall member models to get desired accuracy which can be both memory and computation intensive.\n\n4 Experiments\n\nWe evaluate the proposed Co-regularized Domain Alignment (Co-DA) by instantiating it in the\ncontext of a recently proposed method VADA [35] which has shown state-of-the-art results on several\nbenchmarks, and observe that Co-DA yields further signi\ufb01cant improvement over it, establishing\nnew state-of-the-art in several cases. For a fair comparison, we evaluate on the same datasets as\nused in [35] (i.e., MNIST, SVHN, MNIST-M, Synthetic Digits, CIFAR-10 and STL), and base\nour implementation on the code released by the authors3 to rule out incidental differences due to\nimplementation speci\ufb01c details.\nNetwork architecture. VADA [35] has three components in the model architecture: a feature\ngenerator g, a feature classi\ufb01er h that takes output of g as input, and a domain discriminator d for\ndomain alignment (Eq. 3). Their data classi\ufb01er f = h \u25e6 g consists of nine conv layers followed\nby a global pool and fc, with some additional dropout, max-pool and Gaussian noise layers in g.\nThe last few layers of this network (the last three conv layers, global pool and fc layer) are taken\nas the feature classi\ufb01er h and the remaining earlier layers are taken as the feature generator g. Each\nconv and fc layer in g and h is followed by batch-norm. The objective of VADA for learning a\ndata classi\ufb01er fi = hi \u25e6 gi is given in Eq. (7) as L(fi). We experiment with the following two\narchitectural versions for creating the hypotheses f1 and f2 in our method: (i) We use two VADA\nmodels as our two hypotheses, with each of these following the same architecture as used in [35]\n(for all three components gi, hi and di) but initialized with different random seeds. This version is\nreferred as Co-DA in the result tables. (ii) We use a single (shared) set of parameters for conv and\nfc layers in g1/g2 and h1/h2 but use conditional batch-normalization [12] to create two different\nsets of batch-norm layers for the two hypotheses. However we still have two different discriminators\n(unshared parameters) performing domain alignment for features induced by g1 and g2. This version\nis referred as Co-DAbn in the result tables. Additionally, we also experiment with fully shared\nnetworks parameters without conditional batch-normalization (i.e., shared batchnorm layers): in this\ncase, g1 and g2 differ only due to random sampling in each forward pass through the model (by virtue\nof the dropout and Gaussian noise layers in the feature generator). We refer this variant as Co-DAsh\n(for shared parameters). The diversity term Dg (Eq. (5)) becomes inapplicable in this case. This also\n\n3https://github.com/RuiShu/dirt-t\n\n6\n\n\fhas resemblance to \u03a0-model [22] and fraternal dropout [46], which were recently proposed in the\ncontext of (semi-)supervised learning.\nOther details and hyperparameters. For domain alignment, which involves solving a saddle point\nproblem (mingi maxdi Ldisc(gi, di; Ps, Pt), as de\ufb01ned in Eq. 3), Shu et al. [35] replace gradient\nreversal [15] with alternating minimization (maxdi Ldisc(gi, di; Ps, Pt), mingi Ldisc(gi, di; Ps, Pt))\nas used by Goodfellow et al. [18] in the context of GAN training. This is claimed to alleviate the\nproblem of saturating gradients, and we also use this approach following [35]. We also use instance\nnormalization following [35] which helps in making the classi\ufb01er invariant to channel-wide shifts\nand scaling of the input pixel intensities. We do not use any sort of data augmentation in any of\nour experiments. For VADA hyperparameters \u03bbce and \u03bbsv (Eq. 7), we \ufb01x their values to what were\nreported by Shu et al. [35] for all the datasets (obtained after a hyperparameter search in [35]). For\nthe domain alignment hyperparameter \u03bbd, we do our own search over the grid {10\u22123, 10\u22122} (the\ngrid for \u03bbd was taken to be {0, 10\u22122} in [35]). The hyperparameter for target prediction agreement,\n\u03bbp, was obtained by a search over {10\u22123, 10\u22122, 10\u22121}. For hyperparameters in the diversity term,\nwe \ufb01x \u03bbdiv = 10\u22122 and do a grid search for \u03bd (Eq. 5) over {1, 5, 10, 100}. The hyperparameters\nare tuned by randomly selecting 1000 target labeled examples from the training set and using that\nfor validation, following [35, 32]. We completely follow [35] for training our model, using Adam\nOptimizer (lr = 0.001, \u03b21 = 0.5, \u03b22 = 0.999) with Polyak averaging (i.e., an exponential moving\naverage with momentum= 0.998 on the parameter trajectory), and train the models in all experiments\nfor 80k iterations as in [35].\nBaselines. We primarily compare with VADA [35] to show that co-regularized domain alignment can\nprovide further improvements over state-of-the-art results. We also show results for Co-DA without\nthe diversity loss term (i.e., \u03bbdiv = 0) to tease apart the effect of explicitly encouraging diversity\nthrough Eq. 5 (note that some diversity can arise even with \u03bbdiv = 0, due to different random seeds,\nand Gaussian noise / dropout layers present in g). Shu et al. [35] also propose to incrementally re\ufb01ne\nthe learned VADA model by shifting the classi\ufb01er boundaries to pass through low density regions\nof target domain (referred as the DIRT-T phase) while keeping it from moving too far away. If\nf\u2217n is the classi\ufb01er at iteration n (f\u22170 being the solution of VADA), the new classi\ufb01er is obtained\nas f\u2217n+1 = arg minf n+1 \u03bbce(Lce(f n+1; Pt) + Lvt(f n+1; Pt)) + \u03b2 Ex\u223cPtDkl(f\u2217n(x)(cid:107)f n+1(x)).\nWe also perform DIRT-T re\ufb01nement individually on each of the two trained hypotheses obtained\nwith Co-DA (i.e., f\u22170\n2 ) to see how it compares with DIRT-T re\ufb01nement on the VADA model\n[35]. Note that DIRT-T re\ufb01nement phase is carried out individually for f\u22170\n2 and there is no\nco-regularization term connecting the two in DIRT-T phase. Again, following the evaluation protocol\nin [35], we train DIRT-T for {20k, 40k, 60k, 80k} iterations, with number of iterations taken as a\nhyperparameter. We do not perform any hyperparameter search for \u03b2 and the values for \u03b2 are \ufb01xed to\nwhat were reported in [35] for all datasets. Apart from VADA, we also show comparisons with other\nrecently proposed unsupervised domain adaptation methods for completeness.\n\n1 and f\u22170\n\n1 , f\u22170\n\n4.1 Domain adaptation results\n\nWe evaluate Co-DA on the following domain adaptation benchmarks. The results are shown in Table\n1. The two numbers A/B in the table for the proposed methods are the individual test accuracies for\nboth classi\ufb01ers which are quite close to each other at convergence.\nMNIST\u2192SVHN. Both MNIST and SVHN are digits datasets but differ greatly in style : MNIST\nconsists of gray-scale handwritten digits whereas SVHN consists of house numbers from street\nview images. This is the most challenging domain adaptation setting in our experiments (many\nearlier domain adaptation methods have omitted it from the experiments due to the dif\ufb01culty of\nadaptation). VADA [35] showed good performance (73.3%) on this challenging setting using\ninstance normalization but without using any data augmentation. The proposed Co-DA improves it\nsubstantially \u223c 81.7%, even surpassing the performance of VADA+DIRT-T (76.5%) [35]. Figure 2\nshows the test accuracy as training proceeds. For the case of no instance-normalization as well, we see\na substantial improvement over VADA from 47.5% to 52% using Co-DA and 55.3% using Co-DAbn.\nApplying iterative re\ufb01nement with DIRT-T [35] further improves the accuracy to 88% with instance\nnorm and \u223c 60% without instance norm. This sets new state-of-the-art for MNIST\u2192SVHN domain\nadaptation without using any data augmentation. To directly measure the improvement in source and\ntarget feature distribution alignment, we also do the following experiment: (i) We take the feature\nembeddings g1(\u00b7) for the source training examples, reduce the dimensionality to 50 using PCA, and\n\n7\n\n\fSource MNIST\nSVHN\nTarget\n35.7\n\n52.8\n\n-\n\n-\n\nDANN [15]\n\nDSN [4]\nATT [32]\nMCD [33]\n\nMNIST\n\nDIGITS\nSVHN\nMNIST MNIST-M SVHN\n90.3\n91.2\n92.9\n\n81.5\n83.2\n94.2\n\n71.1\n82.7\n86.2\n96.2\n\n-\n\n-\n\nVADA [35]\n\nCo-DA (\u03bbdiv = 0)\nCo-DAbn (\u03bbdiv = 0)\n\nCo-DAsh\nCo-DA\nCo-DAbn\n\nWithout instance-normalized input\n\n47.5\n\n50.7/50.1\n46.0/45.9\n\n52.8\n\n52.0/49.7\n55.3/55.2\n\n97.9\n\n97.4/97.2\n98.4/98.3\n\n98.6\n\n98.3/98.2\n98.8/98.7\n\n97.7\n\n98.9/99.0\n99.0/99.0\n\n98.9\n\n99.0/98.9\n98.6/98.7\n\n94.8\n\n94.9/94.6\n94.9/94.8\n\n96.1\n\n96.1/96.0\n95.4/95.3\n\nCIFAR\nSTL\n\nSTL\nCIFAR\n\n-\n-\n-\n-\n\n-\n-\n-\n\n80.0\n\n81.3/80\n80.4/80.3\n\n78.9\n\n81.1/80.4\n81.4/81.2\n\n73.5\n\n76.1/75.5\n76.3/76.6\n\n76.1\n\n76.4/75.7\n76.3/76.2\n\nVADA+DIRT-T [35]\n\nCo-DA+DIRT-T\nCo-DAbn+DIRT-T\n\n54.5\n\n59.8/60.8\n62.4/63.0\n\n99.4\n\n99.4/99.4\n99.3/99.2\n\n98.9\n\n99.1/99.0\n98.9/99.0\n\n96.1\n\n96.4/96.5\n96.1/96.1\n\n-\n-\n-\n\n75.3\n\n76.3/76.6\n77.6/77.5\n\nWith instance-normalized input\n\nVADA [35]\n\nCo-DA (\u03bbdiv = 0)\nCo-DAbn (\u03bbdiv = 0)\n\nCo-DAsh\nCo-DA\nCo-DAbn\n\n73.3\n\n78.5/78.2\n74.5/74.3\n\n79.9\n\n81.7/80.9\n81.4/81.3\n\n94.5\n\n97.6/97.5\n98.4/98.4\n\n98.7\n\n98.6/98.5\n98.5/98.5\n\n95.7\n\n97.1/96.4\n96.7/96.6\n\n96.9\n\n97.5/97.0\n98.0/97.9\n\n94.9\n\n95.1/94.9\n95.3/95.2\n\n96.0\n\n96.0/95.9\n95.3/95.3\n\n78.3\n\n80.1/79.2\n78.9/79.0\n\n78.4\n\n80.6/79.9\n80.6/80.4\n\n71.4\n\n74.5/73.9\n74.2/74.4\n\n74.7\n\n74.7/74.2\n74.7/74.6\n\nVADA+DIRT-T [35]\n\nCo-DA+DIRT-T\nCo-DAbn+DIRT-T\n\n76.5\n\n88.0/87.3\n86.5/86.7\n\n99.4\n\n99.3/99.4\n99.4/99.3\n\n98.7\n\n98.7/98.6\n98.8/98.8\n\n96.2\n\n96.4/96.5\n96.4/96.5\n\n-\n-\n-\n\n73.3\n\n74.8/74.2\n75.9/75.6\n\nTable 1: Test accuracy on the Target domain: Co-DAbn is the proposed method for the two classi\ufb01ers\nwith shared parameters but with different batch-norm layers and different domain discriminators.\nCo-DAsh is another variant where the only sources of difference between the two classi\ufb01ers are the\nstochastic layers (dropout and Gaussian noise). The stochastic layers collapse to their expectations\nand we effectively have a single classi\ufb01er during test phase. For Co-DA, the two numbers A/B are\nthe accuracies for the two classi\ufb01ers (at 80k iterations). Numbers in bold denote the best accuracy\namong the comparable methods and those in italics denote the close runner-up, if any. VADA and\nDIRT-T results are taken from [35].\n\nuse these as training set for a k-nearest neighbor (kNN) classi\ufb01er. (ii) We then compute the accuracy\nof this kNN classi\ufb01er on target test sets (again with PCA on the output of feature generator).We also\ndo steps (i) and (ii) for VADA, and repeat for multiple values of \u2019k\u2019. Fig. 3 compares the target test\naccuracy scores for VADA and Co-DA.\nSVHN\u2192MNIST. This adaptation direction is easier as MNIST as the test domain is easy to classify\nand performance of existing methods is already quite high (97.9% with VADA). Co-DA is still able\nto yield reasonable improvement over VADA, of about \u223c 1% for no instance-normalization, and\n\u223c 4% with instance-normalization. The application of DIRT-T after Co-DA does not give signi\ufb01cant\nboost over VADA+DIRT-T as the performance is already saturated with Co-DA (close to 99%).\nMNIST\u2192MNIST-M. Images in MNIST-M are created by blending MNIST digits with random\ncolor patches from the BSDS500 dataset. Co-DA provides similar improvements over VADA as the\nearlier setting of SVHN\u2192MNIST, of about \u223c 1% for no instance-normalization, and \u223c 2% with\ninstance-normalization.\nSyn-DIGITS\u2192SVHN. Syn-DIGITS data consists of about 50k synthetics digits images of varying\npositioning, orientation, background, stroke color and amount of blur. We again observe reasonable\nimprovement of \u223c 1% with Co-DA over VADA, getting close to the accuracy of a fully supervised\ntarget model for SVHN (without data augmentation).\n\n8\n\n\fFigure 2: Test accuracy as the training iterations proceed for MNIST\u2192SVHN with instance-\nnormalization: there is high disagreement between the two classi\ufb01ers during the earlier iterations for\nCo-DA, which vanishes eventually at convergence. VADA [35] gets to a much higher accuracy early\non during training but eventually falls short of Co-DA performance.\n\nFigure 3: Test accuracy of a kNN classi\ufb01er on target domain for VADA and Co-DA: source domain\nfeatures (output of g1(\u00b7), followed by PCA which reduces dimensionality to 50) are used as training\ndata for the classi\ufb01er.\n\nCIFAR\u2194STL. CIFAR has more labeled examples than STL hence CIFAR\u2192STL is easier adap-\ntation problem than STL\u2192CIFAR. We observe more signi\ufb01cant gains on the harder problem of\nSTL\u2192CIFAR, with Co-DA improving over VADA by 3% for both with- and without instance-\nnormalization.\n5 Conclusion\nWe proposed co-regularization based domain alignment for unsupervised domain adaptation. We\ninstantiated it in the context of a state-of-the-art domain adaptation method and observe that it provides\nimproved performance on some commonly used domain adaptation benchmarks, with substantial\ngains in the more challenging tasks, setting new state-of-the-art in these cases. Further investigation is\nneeded into more effective diversity losses (Eq. (5)). A theoretical understanding of co-regularization\nfor domain adaptation in the context of deep neural networks, particularly characterizing its effect\non the alignment of source and target feature distributions, is also an interesting direction for future\nwork.\n\n9\n\n\fReferences\n[1] Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co-training and expansion: Towards bridging theory\n\nand practice. In Advances in neural information processing systems, pages 89\u201396, 2005.\n\n[2] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman\n\nVaughan. A theory of learning from different domains. Machine learning, 79(1-2):151\u2013175, 2010.\n\n[3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Conference on\n\nLearning Theory, 1998.\n\n[4] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan.\nDomain separation networks. In Advances in Neural Information Processing Systems, pages 343\u2013351,\n2016.\n\n[5] Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai, Matthew Kelcey, Mrinal Kalakrishnan,\nLaura Downs, Julian Ibarz, Peter Pastor, Kurt Konolige, et al. Using simulation and domain adaptation to\nimprove ef\ufb01ciency of deep robotic grasping. arXiv preprint arXiv:1709.07857, 2017.\n\n[6] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsuper-\nvised pixel-level domain adaptation with generative adversarial networks. In The IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), volume 1, page 7, 2017.\n\n[7] Olivier Chapelle and A. Zien. Semi-Supervised Classi\ufb01cation by Low Density Separation. In AISTATS,\n\npages 57\u201364, 2005.\n\n[8] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Advances in\n\nneural information processing systems, pages 2456\u20132464, 2011.\n\n[9] Hal Daume III, Abhishek Kumar, and Avishek Saha. Co-regularization Based Semi-supervised Domain\n\nAdaptation. In Advances in Neural Information Processing Systems, 2010.\n\n[10] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple\n\nclassi\ufb01er systems, pages 1\u201315. Springer, 2000.\n\n[11] Harris Drucker, Corinna Cortes, Lawrence D Jackel, Yann LeCun, and Vladimir Vapnik. Boosting and\n\nother ensemble methods. Neural Computation, 6(6):1289\u20131301, 1994.\n\n[12] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In\n\nProceedings of the International Conference on Learning Representations, Toulon, France, April 2017.\n\n[13] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain\nadaptation using subspace alignment. In Computer Vision (ICCV), 2013 IEEE International Conference\non, pages 2960\u20132967. IEEE, 2013.\n\n[14] Geoffrey French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for domain adaptation. In\n\nInternational Conference on Learning Representations, 2018.\n\n[15] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML,\n\n2015.\n\n[16] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette,\nMario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of\nMachine Learning Research, 17(1):2096\u20132030, 2016.\n\n[17] Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang. Domain adaptive neural networks for object\nrecognition. In Paci\ufb01c Rim International Conference on Arti\ufb01cial Intelligence, pages 898\u2013904. Springer,\n2014.\n\n[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[19] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in\n\nneural information processing systems, pages 529\u2013536, 2005.\n\n[20] Abhishek Kumar and Hal Daum\u00e9. A co-training approach for multi-view spectral clustering. In Proceedings\n\nof the 28th International Conference on Machine Learning (ICML-11), pages 393\u2013400, 2011.\n\n10\n\n\f[21] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. In Advances\n\nin neural information processing systems, pages 1413\u20131421, 2011.\n\n[22] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint\n\narXiv:1610.02242, 2016.\n\n[23] Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David Crandall, and\nDhruv Batra. Stochastic multiple choice learning for training diverse deep ensembles. In Advances in\nNeural Information Processing Systems, pages 2119\u20132127, 2016.\n\n[24] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In\n\nAdvances in Neural Information Processing Systems, pages 700\u2013708, 2017.\n\n[25] Yong Liu and Xin Yao. Ensemble learning via negative correlation. Neural networks, 12(10):1399\u20131404,\n\n1999.\n\n[26] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep\n\nadaptation networks. arXiv preprint arXiv:1502.02791, 2015.\n\n[27] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a\nregularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976,\n2017.\n\n[28] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image\n\ntranslation for domain adaptation. arXiv preprint arXiv:1712.00479, 2017.\n\n[29] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and\nthe likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):\n5847\u20135861, 2010.\n\n[30] Bruce E Rosen. Ensemble learning using decorrelated neural networks. Connection science, 8(3-4):\n\n373\u2013384, 1996.\n\n[31] David S Rosenberg and Peter L Bartlett. The rademacher complexity of co-regularized kernel classes. In\n\nArti\ufb01cial Intelligence and Statistics, pages 396\u2013403, 2007.\n\n[32] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain\n\nadaptation. arXiv preprint arXiv:1702.08400, 2017.\n\n[33] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classi\ufb01er discrepancy\nfor unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\n[34] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of statistical planning and inference, 90(2):227\u2013244, 2000.\n\n[35] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain\n\nadaptation. In International Conference on Learning Representations, 2018.\n\n[36] Vikas Sindhwani and David S Rosenberg. An rkhs for multi-view learning and manifold co-regularization.\n\nIn Proceedings of the 25th international conference on Machine learning, 2008.\n\n[37] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A Co-regularization approach to semi-supervised\nlearning with multiple views. In Proceedings of the Workshop on Learning with Multiple Views, Interna-\ntional Conference on Machine Learning, 2005.\n\n[38] Karthik Sridharan and Sham M Kakade. An information theoretic framework for multi-view learning. In\n\nCOLT, 2008.\n\n[39] Baochen Sun and Kate Saenko. From virtual to reality: Fast adaptation of virtual object detectors to real\n\ndomains. In BMVC, volume 1, page 3, 2014.\n\n[40] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation.\n\nEuropean Conference on Computer Vision, pages 443\u2013450. Springer, 2016.\n\nIn\n\n[41] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains\nand tasks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4068\u20134076. IEEE,\n2015.\n\n11\n\n\f[42] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation.\n\nIn Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.\n\n[43] David Vazquez, Antonio M Lopez, Javier Marin, Daniel Ponsa, and David Geronimo. Virtual and real\nworld adaptation for pedestrian detection. IEEE transactions on pattern analysis and machine intelligence,\n36(4):797\u2013809, 2014.\n\n[44] Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. Mind the class\nweight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 2272\u20132281, 2017.\n\n[45] Zhi-Hua Zhou and Ming Li. Tri-training: Exploiting unlabeled data using three classi\ufb01ers.\n\nTransactions on knowledge and Data Engineering, 17(11):1529\u20131541, 2005.\n\nIEEE\n\n[46] Konrad Zolna, Devansh Arpit, Dendi Suhubdy, and Yoshua Bengio. Fraternal dropout. In International\n\nConference on Learning Representations, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5706, "authors": [{"given_name": "Abhishek", "family_name": "Kumar", "institution": "Google"}, {"given_name": "Prasanna", "family_name": "Sattigeri", "institution": "IBM Research"}, {"given_name": "Kahini", "family_name": "Wadhawan", "institution": "IBM Research"}, {"given_name": "Leonid", "family_name": "Karlinsky", "institution": "IBM-Research"}, {"given_name": "Rogerio", "family_name": "Feris", "institution": "IBM Research AI"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}, {"given_name": "Gregory", "family_name": "Wornell", "institution": "MIT"}]}