{"title": "Wasserstein Dependency Measure for Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15604, "page_last": 15614, "abstract": "Mutual information maximization has emerged as a powerful learning objective for unsupervised representation learning obtaining state-of-the-art performance in applications such as object recognition, speech recognition, and reinforcement learning. However, such approaches are fundamentally limited since a tight lower bound on mutual information requires sample size exponential in the mutual information. This limits the applicability of these approaches for prediction tasks with high mutual information, such as in video understanding or reinforcement learning. In these settings, such techniques are prone to overfit, both in theory and in practice, and capture only a few of the relevant factors of variation. This leads to incomplete representations that are not optimal for downstream tasks. In this work, we empirically demonstrate that mutual information-based representation learning approaches do fail to learn complete representations on a number of designed and real-world tasks. To mitigate these problems we introduce the Wasserstein dependency measure, which learns more complete representations by using the Wasserstein distance instead of the KL divergence in the mutual information estimator. We show that a practical approximation to this theoretically motivated solution, constructed using Lipschitz constraint techniques from the GAN literature, achieves substantially improved results on tasks where incomplete representations are a major challenge.", "full_text": "Wasserstein Dependency Measure\n\nfor Representation Learning\n\nSherjil Ozair\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nCorey Lynch\nGoogle Brain\n\nYoshua Bengio\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nA\u00e4ron van den Oord\n\nDeepmind\n\nSergey Levine\nGoogle Brain\n\nPierre Sermanet\n\nGoogle Brain\n\nAbstract\n\nMutual information maximization has emerged as a powerful learning objective\nfor unsupervised representation learning obtaining state-of-the-art performance\nin applications such as object recognition, speech recognition, and reinforcement\nlearning. However, such approaches are fundamentally limited since a tight lower\nbound on mutual information requires sample size exponential in the mutual\ninformation. This limits the applicability of these approaches for prediction tasks\nwith high mutual information, such as in video understanding or reinforcement\nlearning. In these settings, such techniques are prone to over\ufb01t, both in theory and\nin practice, and capture only a few of the relevant factors of variation. This leads to\nincomplete representations that are not optimal for downstream tasks. In this work,\nwe empirically demonstrate that mutual information-based representation learning\napproaches do fail to learn complete representations on a number of designed\nand real-world tasks. To mitigate these problems we introduce the Wasserstein\ndependency measure, which learns more complete representations by using the\nWasserstein distance instead of the KL divergence in the mutual information\nestimator. We show that a practical approximation to this theoretically motivated\nsolution, constructed using Lipschitz constraint techniques from the GAN literature,\nachieves substantially improved results on tasks where incomplete representations\nare a major challenge.\n\n1\n\nIntroduction\n\nRecent success in supervised learning can arguably be attributed to the paradigm shift from engineer-\ning representations to learning representations [32]. Especially in the supervised setting, effective\nrepresentations can be acquired directly from the labels. However, representation learning in the\nunsupervised setting, without hand-speci\ufb01ed labels, becomes signi\ufb01cantly more challenging: al-\nthough much more data is available for learning, this data lacks the clear learning signal that would\nbe provided by human-speci\ufb01ed semantic labels.\nNevertheless, unsupervised representation learning has made signi\ufb01cant progress recently, due to\na number of different approaches. Representations can be learned via implicit generative methods\n[22, 17, 16, 40], via explicit generative models [28, 44, 13, 44, 29], and self-supervised learning\n[6, 15, 50, 14, 47, 49, 24]. Among these, the latter methods are particularly appealing because\nthey remove the need to actually generate full observations (e.g., image pixels or audio waveform).\nSelf-supervised learning techniques have demonstrated state-of-the-art performance in speech and\nimage understanding [47, 24], reinforcement learning [25, 18, 26], imitation learning [45, 5], and\nnatural language processing [12, 43].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSelf-supervised learning techniques make use of discriminative pretext tasks, chosen in such a way\nthat its labels can be extracted automatically and such that solving the task requires a semantic\nunderstanding of the data, and therefore a meaningful representation. For instance, Doersch et al. [15]\npredict the relative position of adjacent patches extracted from an image. Zhang et al. [50] reconstruct\nimages from their grayscaled versions. Gidaris et al. [21] predict the canonically upwards direction in\nrotated images. Sermanet et al. [45] maximize the mutual information between two views of the same\nscene. Hjelm et al. [24] maximize mutual information between an image and patches of the image.\nHowever, a major issue with such techniques is that the pretext task must not admit trivial or easy\nsolutions. For instance, Doersch et al. [15] found that relative position could be easily predicted using\nlow-level cues such as boundary patterns, shared textures, long edges, and even chromatic aberration.\nIn some cases, trivial solutions are easily identi\ufb01able and recti\ufb01ed, such as by adding gaps between\npatches, adding random jitter, and/or grayscaling the patches.\nIdentifying such exploits and \ufb01nding \ufb01xes is a cumbersome process, requires expert knowledge about\nthe domain, and can still fail to eliminate all degenerate solutions. However, even when care is taken\nto remove such low-level regularities, self-supervised representation learning techniques can still\nsuffer and produce incomplete representations, i.e. representations that capture only a few of the\nunderlying factors of variations in the data.\nRecent work has provided a theoretical underpinning of this empirically observed shortcoming. A\nnumber of self-supervised learning techniques can be shown to be maximizing a lower bound to\nmutual information between representations of different data modalities [7, 47, 42]. However, as\nshown by McAllester and Statos [35], lower bounds to the mutual information are only tight for\nsample size exponential in the mutual information. Unfortunately, many practical problems of interest\nwhere representation learning would be bene\ufb01cial have large mutual information. For instance,\nmutual information between successive frames in a temporal setting scales with the number of objects\nin the scene. Self-supervised learning techniques in such settings often only capture a few objects\nsince modeling a few objects is suf\ufb01cient to con\ufb01dently predict future frames from a random sample\nof frames.\nIn this paper, we motivate this limitation formally in terms of the fundamental limitations of mutual\ninformation estimation and KL divergences, and show examples of this limitation empirically,\nillustrating relatively simple problems where fully reconstructive models can easily learn complete\nrepresentations, while self-supervised learning methods struggle. We then propose a potential solution\nto this problem, by employing the Wasserstein metric in place of KL divergence as a training objective.\nIn practice, we show that approximating this by means of recently proposed regularization methods\ndesigned for generative adversarial networks can substantially reduce the incomplete representation\nproblem, leading to a substantial improvement in the ability of representations learned via mutual\ninformation estimation to capture task-salient features.\n\n2 Mutual Information Estimation and Maximization\nMutual information for representation learning has a long history. One approach [33, 8] is to maximize\nthe mutual information between observed data samples x and learned representations z = f (x), i.e.\nI(x; z), thus ensuring the representation learned retain the most information about the underlying\ndata. Another [6] is to maximize the mutual information between representations of two different\nmodalities of the data, i.e. I(f (x); f (y)).\nHowever, such representation learning approaches have been limited due to the dif\ufb01culty of estimating\nmutual information. Previous approaches have had to make parametric assumptions about the data or\nuse nonparametric approaches [30, 37] which don\u2019t scale well to high-dimensional data.\nMore recently Nguyen et al. [39], Belghazi et al. [7], van den Oord et al. [47], and Poole et al. [42]\nhave proposed variational energy-based lower bounds to the mutual information which are tractable\nin high dimension and can be estimated by gradient-based optimization, which makes them suitable\nto combine with deep learning.\nWhile Becker and Hinton [6] maximize mutual information between representations of spatially\nadjacent patches of an image, one can also use past and future states such as shown recently by\nvan den Oord et al. [47] and Sermanet et al. [45] which has connections to predictive coding in\nspeech [4, 19], predictive processing in the brain [10, 41, 46] and the free energy principle [20].\n\n2\n\n\fThese techniques have shown promising results, but their applicability is still limited to low mutual\ninformation settings.\n\n2.1 Formal Limitations in Mutual Information Estimation\n\nThe limitations of estimating mutual information via lower bounds stems from the limitations of\nthe KL divergence as a measure of distribution similarity. Theorem 1 formalizes this limitation of\nestimating the KL divergence via lower bounds. This result is based on the derivation by McAllester\nand Statos [35], who prove a stronger claim for the case where p(x) is fully known.\nTheorem 1. Let p(x) and q(x) be two distributions, and R = {xi \u223c p(x)}n\ni=1 and S = {xi \u223c\nq(x)}n\ni=1 be two sets of n samples from p(x) and q(x) respectively. Let \u03b4 be a con\ufb01dence parameter,\nand let B(R, S, \u03b4) be a real-valued function of the two samples S and R and the con\ufb01dence parameter\n\u03b4. We have that, if with probability at least 1 \u2212 \u03b4,\n\nB(R, S, \u03b4) \u2264 KL(p(x)(cid:107)q(x))\n\nthen with probability at least 1 \u2212 4\u03b4 we have\n\nB(R, S, \u03b4) \u2264 log n.\n\nThus, since the mutual information corresponds to KL divergence, we can conclude that any high-\ncon\ufb01dence lower bound on the mutual information requires n = exp(I(x; y)), i.e., sample size\nexponential in the mutual information.\n\n3 Wasserstein Dependency Measure\n\nThe KL divergence is not only problematic for representation learning due to the statistical limitations\ndescribed in Theorem 1, but also due to the fact that it is completely agnostic to the metric of the\nunderlying data distribution, and invariant to any invertible transformation. KL divergence is also\nsensitive to small differences in the data samples. When used for representation learning, the encoder\ncan often only represent small parts of the data samples, since any small differences found is suf\ufb01cient\nto maximize the KL divergence. The Wasserstein distance, however, is a metric-aware divergence,\nand represents the difference between two distributions in terms of the actual distance between data\nsamples. A large Wasserstein distance actually represents large distances between the underlying\ndata samples. On the other hand, KL divergence can be large even if the underlying data samples\ndiffer very little.\nThis qualitative difference between the KL divergence and Wasserstein distance was recently noted\nby Arjovsky et al. [3] to propose the Wasserstein GAN, a metric-aware extension to the original\nGAN proposed by Goodfellow et al. [22] which is based on the Jensen symmetrization of the KL\ndivergence [11]. For GANs, we would like the discriminator to model not only the density ratio of\ntwo distributions, but the complete process of how one distribution can be transformed into another,\nwhich is the underlying basis of the theory of optimal transport and Wasserstein distances [48].\nThis motivates us to investigate the use of the Wasserstein distance as a replacement for the KL\ndivergence in mutual information, which we call Wasserstein dependency measure.\nDe\ufb01nition 3.1. Wasserstein dependency measure. For two random variables x and y with joint\ndistribution p(x, y), we de\ufb01ne the Wasserstein dependency measure IW (x; y) as the Wasserstein\ndistance between the joint distribution p(x, y) and the product of marginal distributions p(x)p(y).\n\nIW (x; y)\n\ndef\n\n= W(p(x, y), p(x)p(y))\n\n(1)\n\nThus, the Wasserstein dependency measure (WDM) measures the cost of transforming samples from\nthe marginals to samples from the joint, and thus cannot ignore parts of the input, unlike the KL\ndivergence which can ignore large parts of the input.\n\nChoice of Metric Space. The Wasserstein dependency measure assumes that the data lies in a\nknown metric space. However, the purpose of representation learning is, in a sense, to use the\nrepresentations to implicitly form a metric space for the data. Thus, it may seem that we\u2019re assuming\nthe solution by requiring knowledge of the metric space. However, the difference between the two is\n\n3\n\n\fthat the base metric used in the Wasserstein distance is data-independent, while the metric induced by\nthe representations is informed by the data. The two metrics can be thought of as prior and posterior\nmetrics. Thus, the base metric should encode our prior beliefs about the task independent of the data\nsamples, which acts as inductive bias to help learn a better posterior metric induced by the learned\nrepresentations. In our experiments we assume a Euclidean metric space for all the tasks.\n\n3.1 Generalization of Wasserstein Distances\n\nTheorem 1 is a statement about the inabiltiy of mutual information bounds to generalize for large\nvalues of mutual information, since the gap between the lower bound sample estimate and the true\nmutual information can be exponential in the mutual information value. The Wasserstein distance,\nhowever, can be shown to have better generalization properties when used with Lipschitz neural net\nfunction approximation via its dual representation. Neyshabur et al. [38] show that a neural network\u2019s\ngeneralization gap is proportional to the square root of the network\u2019s Lipschitz constant, which is\nbounded (= 1) for the function class used in Wasserstein distance estimation, but is unbounded for\nthe function class used in mutual information lower bounds.\n\n4 Wasserstein Predictive Coding for Representation learning\n\nEstimating Wasserstein distances is intractable in general. We will use the Kantorovich-Rubenstein\nduality [48] to obtain the dual form of the Wasserstein dependency measure, which allows for easier\nestimation since the dual form allows gradient-based optimization over the function space using\nneural networks.\n\nIW (x; y)\n\ndef\n\n= W(p(x, y), p(x)p(y))\n= sup\n\nf\u2208LM\u00d7M\n\nEp(x,y)[f (x, y)] \u2212 Ep(x)p(y)[f (x, y)]\n\nHere, LM\u00d7M is the set of all 1-Lipschitz functions in M \u00d7 M \u2192 IR. We note that Equation 2 is\nsimilar to contrastive predictive coding (CPC) [47], which optimizes\n\n(cid:34)\n\n(cid:80)\n\nJCP C = sup\nf\u2208F\n\nEp(x,y)p(yj )\n\nlog\n\nexp f (x, y)\nj exp f (x, yj)\n\n(cid:35)\n(cid:88)\n\nj\n\n\uf8ee\uf8f0log\n\n\uf8f9\uf8fb .\n\n(2)\n\n(3)\n\nEp(x,y)[f (x, y)] \u2212 Ep(x)p(yj )\n\n= sup\nf\u2208F\n\nexp f (x, yj)\n\nThe two main differences between contrastive predictive coding and the dual Wasserstein dependency\n\nCPC.\nWe propose a new objective, which is a lower bound on both contrastive predictive coding and the\ndual Wasserstein dependency measure, by keeping both the Lipschitz class of functions and the\n\nmeasure is the Lipschitz constraint on the function class and the log(cid:80) exp in the second term of\nlog(cid:80) exp, which we call Wasserstein predictive coding (WPC):\n\uf8ee\uf8f0log\nWe choose to keep the log(cid:80) exp term, since it decreases the variance when we use samples to\n\nEp(x,y)[f (x, y)] \u2212 Ep(x)p(yj )\n\nJW P C = sup\n\n(cid:88)\n\nj\n\n\uf8f9\uf8fb .\n\nexp f (x, yj)\n\nf\u2208LM\u00d7M\n\nestimate the gradient, which we found to improve performance in practice. In the previous sections,\nwe motivated the use of Wasserstein distance, which directly suggests the use of a Lipschitz constraint\nin Equation 4. CPC and similar contrastive learning techniques work by reducing the distance\nbetween paired samples, and increasing the distance between unpaired random samples. However,\nwhen using powerful neural networks as the representation encoder, the neural network can learn to\nexaggerate small differences between unpaired samples to increase the distance between arbitrarily.\nThis then prevents the encoder to learn any other differences between unpaired samples because one\ndiscernible difference suf\ufb01ces to optimize the objective. However, if we force the encoder to be\nLipschitz, then the distance between learned representations is bounded by the distance between the\nunderlying samples. Thus the encoder is forced to represent more components of the data to make\nprogress in the objective.\n\n(4)\n\n4\n\n\fFigure 1: Left - The SpatialMultiOmniglot dataset consists of pairs of images (x, y) each comprising\nof multiple Omniglot characters in a grid, where the characters in y are the next characters in the\nalphabet of the characters in x. Middle - The Shapes3D dataset is a collection of colored images of\nan object in a room. Each image corresponds to a unique value for the underlying latent variables:\ncolor of object, color of wall, color of \ufb02oor, shape of object, size of object, viewing angle. Right\n- The SplitCelebA dataset consists of pairs of images p(x, y) where x and y are the left and right\nhalves of the same CelebA image, respectively.\n\n4.1 Approximating Lipschitz Continuity\n\nOptimization over Lipschitz functions with neural networks is a challenging problem, and a topic of\nactive research. Due to the popularity of the Wasserstein GAN [3], a number of techniques have been\nproposed to approximate Lipschitz continuity [23, 36]. However, recent work [2] has shown that that\nsuch techniques severely restrict the capacity of typical neural networks and could hurt performance\nin complex tasks where high capacity neural networks are essential. This is also observed empirically\nby Brock et al. [9] in the context of training GANs.\nThus, in our experiments, we use the gradient penalty technique proposed by Gulrajani et al. [23],\nwhich is suf\ufb01cient to provide experimental evidence in support of our hypothesis, but we note the\ncaveat that gradient penalty might not be effective for complex tasks. Incorporating better and\nmore scalable methods to enforce Lipschitz continuity would likely further improve practical WDM\nimplementations.\n\n5 Experiments\n\nThe goal of our experiments are the following. We demonstrate and quantify the limitations of mutual\ninformation-based representation learning. We quantitatively compare our proposed alternative,\nthe Wasserstein dependency measure, with mutual information for representation learning. We\ndemonstrate the importance of the class of functions being used to practically approximate dependency\nmeasures, such as fully-connected or convolutional networks.\n\n5.1 Evaluation Methodology\nAll of our experiments make use of datasets generated via the following process: p(z)p(x|z)p(y|z).\nHere, z is the underlying latent variable, and x and y are observed variables. For example, z could be\na class label, and x and y two images of the same class. We speci\ufb01cally use datasets with large values\nof the mutual information I(x; y), which is common in practice and is also the condition under which\nwe expect current MI estimators to struggle.\nThe goal of the representation learning task is to learn representation encoders f \u2208 F and g \u2208 F,\nsuch that the representations f (x) and g(y) capture the underlying generative factors of variation\nrepresented by the latent variable z. For example, for SpatialMultiOmniglot (described in 5.2),\nwe aim to learn f (x) which captures the class of each of the characters in the image. However,\nrepresentation learning is not about making sure the representations contain the requisite information,\nbut that they contain the requisite information in an accessible way, ideally via linear probes [1].\nThus, we measure the quality of the representations by learning linear classi\ufb01ers predicting the\n\n5\n\n\fTable 1: WPC outperforms CPC on the SplitCelebA dataset.\n\nCPC (fc) WPC (fc) CPC (conv) WPC (convnet)\n\nMethod\nAccuracy\n\n0.85\n\n0.87\n\n0.82\n\n0.87\n\nunderlying latent variables z. This methodology is standard in the self-supervised representation\nlearning literature.\n\n5.2 Datasets\n\nWe present experimental results on four tasks, SpatialMultiOmniglot, StackedMultiOmniglot, Multi-\nviewShapes3D, and SplitCelebA.\n\n(cid:80)mn\n\nSpatialMultiOmniglot. We used the Omniglot dataset [31] as a base dataset to construct Spatial-\nMultiOmniglot and StackedMultiOmniglot. SpatialMultiOmniglot is a dataset of paired images x\nand y, where x is an image of size (32m, 32n) comprised of mn Omniglot character arranged in a\n(m, n) grid from different Omniglot alphabets, as illustrated in Figure 1. The characters in y are the\nnext characters of the corresponding characters in x, and the latent variable z is the index of each of\nthe characters in x.\nLet li be the alphabet size for the ith character in x. Then, the mutual information I(x; y) is\ni=1 log li. Thus, adding more characters increases the mutual information and allows easy control\n\nof the complexity of the task.\nFor our experiments, we picked the 9 largest alphabets which are Ti\ufb01nagh (55 characters), Japanese\n(hiragana) (52 characters), Gujarati (48), Japanese (katakana) (47), Bengali (46), Grantha (43),\nSanskrit (42), Armenian (41), and Mkhedruli (Georgian) (41), with their respective alphabet sizes in\nparentheses.\n\nStackedMultiOmniglot. StackedMultiOmniglot is similar to SpatialMultiOmniglot except the\ncharacters are stacked in the channel axis, and thus x and y are arrays of size (32, 32, n). This dataset\nis designed to remove feature transfer between characters of different alphabets that is present in\nSpatialMultiOmniglot when using convolutional neural networks. The mutual information is the\n\nsame, i.e. I(x; y)(cid:80)n\n\ni=1 log li.\n\nMultiviewShapes3D. Shapes3D [27] (Figure 1) is a dataset of images of a single object in a room.\nIt has six factors of variation: object color, wall color, \ufb02oor color, object shape, object size, and\ncamera angle. These factors have 10, 10, 10, 4, 6, and 15 values respectively. Thus, the total entropy\nof the dataset is log(10 \u00d7 10 \u00d7 10 \u00d7 4 \u00d7 6 \u00d7 15). MultiviewShapes3D is a subset of Shapes3D where\nwe select only the two extreme camera angles for x and y, and the other 5 factors of x and y comprise\nthe latent variable z.\n\nSplitCelebA. CelebA [34] is a dataset consisting of celebrity faces. The SplitCelebA task uses\nsamples from this dataset split into left and right halves. Thus x is the left half, and y is the right half.\nWe use the CelebA binary attributes as the latent variable z.\n\n5.3 Experimental Results\n\nEffect of Mutual Information. Our \ufb01rst main experimental contribution is to show the effect of\ndataset size on the performance of mutual information-based representation learning, in particular, of\ncontrastive predictice coding (CPC). Figure 2 (bottom) shows the performance of CPC and WPC\nas the mutual information increases. We were able to control the mutual information in the data\nby controlling the number of characters in the images. We kept the training dataset size \ufb01xed at\n50,000 samples. This con\ufb01rms our hypothesis that mutual information-based representation learning\nindeed suffers when the mutual information is large. As can be seen, for small number (1 and 2) of\ncharacters, CPC has near-perfect representation learning. The exponential of the mutual information\nin this case is 55 and 55 \u00d7 52 = 2860 (i.e. the product of alphabet class sizes), which is smaller\nthan the dataset size. However, when the number of characters is 3, the exponential of the mutual\ninformation is 55 \u00d7 52 \u00d7 48 = 137280 which is larger than the dataset size. This is the case where\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Performance of CPC and WPC (a) on SpatialMultiOmniglot using fully-connected neural\nnetworks, (b) on StackedMultiOmniglot using fully-connected networks, and (c) using convolutional\nneural networks. Top - WPC consistently performs better than CPC over different dataset sizes, espe-\ncially when using fully-connected networks. Middle - WPC is more robust to minibatch size, while\nCPC\u2019s performance drops rapidly on reduction in minibatch size. Bottom - As mutual information is\nincreased, WPC\u2019s drop in performance is more gradual, while CPC\u2019s drop in performance is drastic\n(when mutual information passes log dataset size).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Performance of CPC and WPC on MultiviewShapes3D using (a,b) fully-connected networks,\nand (c,d) convolutional network. WPC performs consistently better than CPC for multiple dataset\nand minibatch sizes.\n\n7\n\n\fCPC is no longer a good lower bound estimator for the mutual information, and the representation\nlearning performance drops down signi\ufb01cantly.\nWe observe that while WPC\u2019s performance also drops when the mutual information is increased, how-\never it\u2019s always better than CPC, and the drop in performance is more gradual. Ideally, representation\nlearning performance should not be effected by the number of characters at all. We believe WPC\u2019s\nless-than-ideal performance is due to the practical approximations we used such as gradient penalty.\n\nEffect of Dataset Size. Figures 2 (top), and 3 (a,c) show the performance of CPC and WPC as we\nvary the dataset size. For the Omniglot datasets, the number of characters has been \ufb01xed to 9, and\nthe mutual information for this dataset is the logarithm of the product of the 9 alphabet sizes which\nis around 34.43 nats. This is a very large information value as compared to the dataset size, and\nthus, we observe that the performance of either method is far from perfect. However, WPC performs\nsigni\ufb01cantly better than CPC.\n\nEffect of Minibatch Size. Both CPC and WPC are minibatch-dependent techniques. For small\nmininatches, the variance of the estimator becomes large. We observe in Figure 2 (middle) that CPC\u2019s\nperformance increases as the minibatch size is increased. However, WPC\u2019s performance is not as\nsensitive to the minibatch size. WPC reaches its optimal performance with a minibatch size of 32,\nand any further increase in minibatch size does not improve the performance. This suggests that\nWasserstein-based representation learning can be effective even at small minibatch sizes.\n\nEffect of Neural Network Inductive Bias. The use of fully connected neural networks allowed\nus to make predictions about the performance based on whether the mutual information is larger\nor smaller than the log dataset size. However, most practical uses of representation learning use\nconvolutional neural networks (convnet). Convnets change the interplay of mutual information and\ndataset sizes, since they can be more ef\ufb01cient with smaller dataset sizes since they bring in their\ninductive biases such as translation invariance or invertibility via residual connections. Convnets\nalso perform worse on StackedMultiOmniglot (Figure 2 (c)) than on SpatialMultiOmniglot (Figure 2\n(b)), which is expected since SpatialMultiOmniglot arranges the Omniglot characters spatially which\nworks well with convnet\u2019s translation invariance. When the data does not match convnet\u2019s inductive\nbias, as in StackedOmniglot, WPC provides a larger improvement over CPC.\n\n6 Conclusion\n\nWe proposed a new representation learning objective as an alternative to mutual information. This\nobjective which we refer to as the Wasserstein dependency measure, uses the Wasserstein distance\nin place of KL divergence in mutual information. A practical implementations of this approach,\nWasserstein predictive coding, is obtained by regularizing existing mutual information estimators to\nenforce Lipschitz continuity. We explore the fundamental limitations of prior mutual information-\nbased estimators, present several problem settings where these limitations manifest themselves,\nresulting in poor representation learning performance, and show that WPC mitigates these issues to a\nlarge extent. However, optimization of Lipschitz-continuous neural networks is still a challenging\nproblem. Our results indicate that Lipschitz continuity is highly bene\ufb01cial for representation learning,\nand an exciting direction for future work is to develop better techniques for enforcing Lipschitz\ncontinuity. As better regularization methods are developed, we expect the quality of representations\nlearned via Wasserstein dependency measure to also improve.\n\nAcknowledgement\n\nThe authors would like to thank Ben Poole, George Tucker, Alex Alemi, Alex Lamb, Aravind\nSrinivas, Ethan Holly, Eric Jang, Luke Metz, and Julian Ibarz for useful discussions and feedback on\nour research. SO is thankful to the Google Brain team for providing a productive and empowering\nresearch environment.\n\n8\n\n\fReferences\n[1] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classi\ufb01er\n\nprobes. arXiv preprint arXiv:1610.01644, 2016.\n\n[2] Cem Anil, James Lucas, and Roger Grosse. Sorting out lipschitz function approximation. arXiv\n\npreprint arXiv:1811.05381, 2018.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[4] Bishnu S Atal and Manfred R Schroeder. Adaptive predictive coding of speech signals. Bell\n\nSystem Technical Journal, 49(8):1973\u20131986, 1970.\n\n[5] Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, and Nando de Freitas.\nPlaying hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.\n\n[6] Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that discovers surfaces\n\nin random-dot stereograms. Nature, 355(6356):161, 1992.\n\n[7] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Ben-\ngio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation.\nIn Jen-\nnifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference\non Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages\n531\u2013540, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR. URL http:\n//proceedings.mlr.press/v80/belghazi18a.html.\n\n[8] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind\n\nseparation and blind deconvolution. Neural computation, 7(6):1129\u20131159, 1995.\n\n[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity\n\nnatural image synthesis. arXiv preprint arXiv:1809.11096, 2018.\n\n[10] Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive\n\nscience. Behavioral and brain sciences, 36(3):181\u2013204, 2013.\n\n[11] Gavin E Crooks. On measures of entropy and information. 2017.\n\n[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[13] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\narXiv preprint arXiv:1605.08803, 2016.\n\n[14] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning.\n\n[15] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning\nby context prediction. In Proceedings of the IEEE International Conference on Computer\nVision, pages 1422\u20131430, 2015.\n\n[16] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv\n\npreprint arXiv:1605.09782, 2016.\n\n[17] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Ar-\njovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704,\n2016.\n\n[18] Debidatta Dwibedi, Jonathan Tompson, Corey Lynch, and Pierre Sermanet. Learning actionable\nIn 2018 IEEE/RSJ International Conference on\n\nrepresentations from visual observations.\nIntelligent Robots and Systems (IROS), pages 1577\u20131584. IEEE, 2018.\n\n[19] Peter Elias. Predictive coding\u2013i. IRE Transactions on Information Theory, 1(1):16\u201324, 1955.\n\n9\n\n\f[20] Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosophical\nTransactions of the Royal Society of London B: Biological Sciences, 364(1521):1211\u20131221,\n2009.\n\n[21] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by\n\npredicting image rotations. arXiv preprint arXiv:1803.07728, 2018.\n\n[22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[23] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5767\u20135777, 2017.\n\n[24] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,\nAdam Trischler, and Yoshua Bengio. Learning deep representations by mutual information\nestimation and maximization. International Conference on Learning Representations (ICLR),\n2019.\n\n[25] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo,\nDavid Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary\ntasks. arXiv preprint arXiv:1611.05397, 2016.\n\n[26] Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, and Hyun Oh Song. Emi:\nExploration with mutual information maximizing state and action embeddings. arXiv preprint\narXiv:1810.01176, 2018.\n\n[27] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Jennifer Dy and Andreas\nKrause, editors, Proceedings of the 35th International Conference on Machine Learning, vol-\nume 80 of Proceedings of Machine Learning Research, pages 2649\u20132658, Stockholmsm\u00e4ssan,\nStockholm Sweden, 10\u201315 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/\nkim18b.html.\n\n[28] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[29] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in neural informa-\ntion processing systems, pages 4743\u20134751, 2016.\n\n[30] Alexander Kraskov, Harald St\u00f6gbauer, and Peter Grassberger. Estimating mutual information.\n\nPhysical review E, 69(6):066138, 2004.\n\n[31] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[32] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[33] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105\u2013117, 1988.\n\n[34] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[35] David McAllester and Karl Statos. Formal limitations on the measurement of mutual information.\n\narXiv preprint arXiv:1811.04251, 2018.\n\n[36] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[37] Ilya Nemenman, William Bialek, and Rob de Ruyter van Steveninck. Entropy and information\nin neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):056111,\n2004.\n\n10\n\n\f[38] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-\nbayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint\narXiv:1707.09564, 2017.\n\n[39] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence func-\ntionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information\nTheory, 56(11):5847\u20135861, 2010.\n\n[40] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\nauxiliary classi\ufb01er gans. In Proceedings of the 34th International Conference on Machine\nLearning - Volume 70, ICML\u201917, pages 2642\u20132651. JMLR.org, 2017. URL http://dl.acm.\norg/citation.cfm?id=3305890.3305954.\n\n[41] Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information\nin a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908\u20136913,\n2015.\n\n[42] Ben Poole, Sherjil Ozair, A\u00e4ron Van den Oord, Alexander A Alemi, and George Tucker. On\nvariational bounds of mutual information. In International Conference on Machine Learning,\n2019.\n\n[43] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language\n\nunderstanding by generative pre-training.\n\n[44] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\narXiv preprint arXiv:1505.05770, 2015.\n\n[45] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and\nSergey Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint\narXiv:1704.06888, 2017.\n\n[46] Ga\u0161per Tka\u02c7cik and William Bialek. Information processing in living systems. Annual Review\n\nof Condensed Matter Physics, 7:89\u2013117, 2016.\n\n[47] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive\n\npredictive coding. arXiv preprint arXiv:1807.03748, 2018.\n\n[48] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[49] Donglai Wei, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. Learning and using\nthe arrow of time. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2018.\n\n[50] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.\n\n11\n\n\f", "award": [], "sourceid": 9051, "authors": [{"given_name": "Sherjil", "family_name": "Ozair", "institution": "Mila, Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Corey", "family_name": "Lynch", "institution": "Google Brain"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Mila"}, {"given_name": "Aaron", "family_name": "van den Oord", "institution": "Google Deepmind"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}, {"given_name": "Pierre", "family_name": "Sermanet", "institution": "Google Brain"}]}