{"title": "Learning Representations by Maximizing Mutual Information Across Views", "book": "Advances in Neural Information Processing Systems", "page_first": 15535, "page_last": 15545, "abstract": "We propose an approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context. For example, one could produce multiple views of a local spatio-temporal context by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual). Or, an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation. Maximizing mutual information between features extracted from these views requires capturing information about high-level factors whose influence spans multiple views \u2013 e.g., presence of certain objects or occurrence of certain events. Following our proposed approach, we develop a model which learns image representations that significantly outperform prior methods on the tasks we consider. Most notably, using self-supervised learning, our model learns representations which achieve 68.1% accuracy on ImageNet using standard linear evaluation. This beats prior results by over 12% and concurrent results by 7%. When we extend our model to use mixture-based representations, segmentation behaviour emerges as a natural side-effect. Our code is available online: https://github.com/Philip-Bachman/amdim-public.", "full_text": "Learning Representations by Maximizing Mutual\n\nInformation Across Views\n\nPhilip Bachman\nMicrosoft Research\n\nphil.bachman@gmail.com\n\nR Devon Hjelm\n\nMicrosoft Research, MILA\n\ndevon.hjelm@microsoft.com\n\nWilliam Buchwalter\nMicrosoft Research\n\nwibuch@microsoft.com\n\nAbstract\n\nWe propose an approach to self-supervised representation learning based on max-\nimizing mutual information between features extracted from multiple views of a\nshared context. For example, one could produce multiple views of a local spatio-\ntemporal context by observing it from different locations (e.g., camera positions\nwithin a scene), and via different modalities (e.g., tactile, auditory, or visual). Or,\nan ImageNet image could provide a context from which one produces multiple\nviews by repeatedly applying data augmentation. Maximizing mutual informa-\ntion between features extracted from these views requires capturing information\nabout high-level factors whose in\ufb02uence spans multiple views \u2013 e.g., presence of\ncertain objects or occurrence of certain events. Following our proposed approach,\nwe develop a model which learns image representations that signi\ufb01cantly outper-\nform prior methods on the tasks we consider. Most notably, using self-supervised\nlearning, our model learns representations which achieve 68.1% accuracy on Im-\nageNet using standard linear evaluation. This beats prior results by over 12%\nand concurrent results by 7%. When we extend our model to use mixture-based\nrepresentations, segmentation behaviour emerges as a natural side-effect. Our code\nis available online: https://github.com/Philip-Bachman/amdim-public.\n\n1\n\nIntroduction\n\nLearning useful representations from unlabeled data is a challenging problem and improvements\nover existing methods can have wide-reaching bene\ufb01ts. For example, consider the ubiquitous use of\npre-trained model components, such as word vectors [Mikolov et al., 2013, Pennington et al., 2014]\nand context-sensitive encoders [Peters et al., 2018, Devlin et al., 2019], for achieving state-of-the-art\nresults on hard NLP tasks. Similarly, large convolutional networks pre-trained on large supervised\ncorpora have been widely used to improve performance across the spectrum of computer vision tasks\n[Donahue et al., 2014, Ren et al., 2015, He et al., 2017, Carreira and Zisserman, 2017]. Though,\nthe necessity of pre-trained networks for many vision tasks has been convincingly questioned in\nrecent work [He et al., 2018]. Nonetheless, the core motivations for unsupervised learning \u2013 namely\nminimizing dependence on potentially costly corpora of manually annotated data \u2013 remain strong.\nWe propose an approach to self-supervised representation learning based on maximizing mutual\ninformation between features extracted from multiple views of a shared context. This is analogous to\na human learning to represent observations generated by a shared cause, e.g. the sights, scents, and\nsounds of baking, driven by a desire to predict other related observations, e.g. the taste of cookies.\nFor a more concrete example, the shared context could be an image from the ImageNet training set,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand multiple views of the context could be produced by repeatedly applying data augmentation to\nthe image. Alternatively, one could produce multiple views of an image by repeatedly partitioning\nits pixels into \u201cpast\u201d and \u201cfuture\u201d sets, with the considered partitions corresponding to a \ufb01xed\nautoregressive ordering, as in Contrastive Predictive Coding [CPC, van den Oord et al., 2018]. The\nkey idea is that maximizing mutual information between features extracted from multiple views of a\nshared context forces the features to capture information about higher-level factors (e.g., presence of\ncertain objects or occurrence of certain events) that broadly affect the shared context.\nWe introduce a model for self-supervised representation learning based on local Deep InfoMax [DIM,\nHjelm et al., 2019]. Local DIM maximizes mutual information between a global summary feature\nvector, which depends on the full input, and a collection of local feature vectors pulled from an\nintermediate layer in the encoder. Our model extends local DIM in three key ways: it predicts\nfeatures across independently-augmented versions of each input, it predicts features simultaneously\nacross multiple scales, and it uses a more powerful encoder. Each of these modi\ufb01cations provides\nimprovements over local DIM. Predicting across independently-augmented copies of an input and\npredicting at multiple scales are two simple ways of producing multiple views of the context provided\nby a single image. We also extend our model to mixture-based representations, and \ufb01nd that\nsegmentation-like behaviour emerges as a natural side-effect. Section 3 discusses the model and\ntraining objective in detail.\nWe evaluate our model using standard datasets: CIFAR10, CIFAR100, STL10 [Coates et al., 2011],\nImageNet1 [Russakovsky et al., 2015], and Places205 [Zhou et al., 2014]. We evaluate performance\nfollowing the protocol described by Kolesnikov et al. [2019]. Our model outperforms prior work\non these datasets. Our model signi\ufb01cantly improves on existing results for STL10, reaching over\n94% accuracy with linear evaluation and no encoder \ufb01ne-tuning. On ImageNet, we reach over 68%\naccuracy for linear evaluation, which beats the best prior result by over 12% and the best concurrent\nresult by 7%. We reach 55% accuracy on the Places205 task using representations learned with\nImageNet data, which beats the best prior result by 7%. Section 4 discusses the experiments in detail.\n\n2 Related Work\n\nOne characteristic which distinguishes self-supervised learning from classic unsupervised learning\nis its reliance on procedurally-generated supervised learning problems. When developing a self-\nsupervised learning method, one seeks to design a problem generator such that models must capture\nuseful information about the data in order to solve the generated problems. Problems are typically\ngenerated from prior knowledge about useful structure in the data, rather than from explicit labels.\nSelf-supervised learning is gaining popularity across the NLP, vision, and robotics communities \u2013 e.g.,\n[Devlin et al., 2019, Logeswaran and Lee, 2018, Sermanet et al., 2017, Dwibedi et al., 2018]. Some\nseminal work on self-supervised learning for computer vision involves predicting spatial structure or\ncolor information that has been procedurally removed from the data. E.g., Doersch et al. [2015] and\nNoroozi and Favaro [2016] learn representations by learning to predict/reconstruct spatial structure.\nZhang et al. [2016] introduce the task of predicting color information that has been removed by\nconverting images to grayscale. Gidaris et al. [2018] propose learning representations by predicting\nthe rotation of an image relative to a \ufb01xed reference frame, which works surprisingly well.\nWe approach self-supervised learning by maximizing mutual information between features extracted\nfrom multiples views of a shared context. For example, consider maximizing mutual information\nbetween features extracted from a video with most color information removed and features extracted\nfrom the original full-color video. Vondrick et al. [2018] showed that object tracking can emerge as a\nside-effect of optimizing this objective in the special case where the features extracted from the full-\ncolor video are simply the original video frames. Similarly, consider predicting how a scene would\nlook when viewed from a particular location, given an encoding computed from several views of the\nscene from other locations. This task, explored by Eslami et al. [2018], requires maximizing mutual\ninformation between features from the multi-view encoder and the content of the held-out view. The\ngeneral goal is to distill information from the available observations such that contextually-related\nobservations can be identi\ufb01ed among a set of plausible alternatives. Closely related work considers\nlearning representations by predicting cross-modal correspondence [Arandjelovi\u00b4c and Zisserman,\n2017, 2018]. In concurrent work, Tian et al. [2019] propose applying global Deep InfoMax across\n\n1ILSVRC2012 version\n\n2\n\n\fmultiple image views constructed by splitting each source image into a pair of images comprising the\nL and ab channels of the LAB colorspace version of the source image. While the mutual information\nbounds in [Vondrick et al., 2018, Eslami et al., 2018] rely on explicit density estimation, our model\nuses the contrastive bound from CPC [van den Oord et al., 2018], which has been further analyzed by\nMcAllester and Stratos [2018], and Poole et al. [2019].\nAnother line of prior work considers relating the information shared across multiple views of an\ninput distribution. Earlier works in this direction, called multiview learning, were largely focused on\nmethods based on Canonical Correlation Analysis, in settings where assumptions were imposed to\npermit a more rigorous analysis [Kakade and Foster, 2007, Sridharan and Kakade, 2008]. More recent\nwork based on multiview CCA has extended these approaches beyond formally tractable settings,\nand is more akin to our own work [Wang et al., 2015]. Basing multiview learning on general mutual\ninformation, rather than correlation, seems to provide a broader foundation for future work.\nEvaluating new self-supervised learning methods presents some challenges. E.g., performance\ngains may be largely due to improvements in model architectures and training practices, rather than\nadvances in the self-supervised learning component. This point was addressed by Kolesnikov et al.\n[2019], who found massive gains in standard metrics when existing methods were reimplemented\nwith up-to-date architectures and optimized to run at larger scales. When evaluating our model,\nwe follow their protocols and compare against their optimized results for existing methods. Some\npotential shortcomings with standard evaluation protocols have been noted by Goyal et al. [2019].\n\n3 Method Description\n\nOur model, which we call Augmented Multiscale DIM (AMDIM), extends the local version of Deep\nInfoMax introduced by Hjelm et al. [2019] in several ways. First, we maximize mutual information\nbetween features extracted from independently-augmented copies of each image, rather than between\nfeatures extracted from a single, unaugmented copy of each image.2 Second, we maximize mutual\ninformation between multiple feature scales simultaneously, rather than between a single global and\nlocal scale. Third, we use a more powerful encoder architecture. Finally, we introduce mixture-based\nrepresentations. We now describe local DIM and the components added by our new model.\n\n3.1 Local DIM\n\nLocal DIM maximizes mutual information between global features f1(x), produced by a convolu-\ntional encoder f, and local features {f7(x)ij : \u2200i, j}, produced by an intermediate layer in f. The\nsubscript d \u2208 {1, 7} denotes features from the top-most encoder layer with spatial dimension d \u00d7 d,\nand the subscripts i and j index the two spatial dimensions of the array of activations in layer d.3\nIntuitively, this mutual information measures how much better we can guess the value of f7(x)ij when\nwe know the value of f1(x) than when we do not know the value of f1(x). Optimizing this relative\nability to predict, rather than absolute ability to predict, helps avoid degenerate representations which\nmap all observations to similar values. Such degenerate representations perform well in terms of\nabsolute ability to predict, but poorly in terms of relative ability to predict.\nFor local DIM, the terms global and local uniquely de\ufb01ne where features come from in the encoder\nand how they will be used. In AMDIM, this is no longer true. So, we will refer to the features\nthat encode the data to condition on (global features) as antecedent features, and the features to be\npredicted (local features) as consequent features. We choose these terms based on their role in logic.\nWe can construct a distribution p(f1(x), f7(x)ij) over (antecedent, consequent) feature pairs via\nancestral sampling as follows: (i) sample an input x \u223c D, (ii) sample spatial indices i \u223c u(i)\nand j \u223c u(j), and (iii) compute features f1(x) and f7(x)ij. Here, D is the data distribution, and\nu(i)/u(j) denote uniform distributions over the range of valid spatial indices into the relevant encoder\nlayer. We denote the marginal distributions over per-layer features as p(f1(x)) and p(f7(x)ij).\nGiven p(f1(x)), p(f7(x)ij), and p(f1(x), f7(x)ij), local DIM seeks an encoder f that maximizes\nthe mutual information I(f1(x); f7(x)ij) in p(f1(x), f7(x)ij).\n\n2We focus on images in this paper, but the approach directly extends to, e.g.: audio, video, and text.\n3d refers to the layer\u2019s spatial dimension and should not be confused with its depth in the encoder.\n\n3\n\n\f3.2 Noise-Contrastive Estimation\n\nThe best results with local DIM were obtained using a mutual information bound based on Noise-\nContrastive Estimation (NCE \u2013 [Gutmann and Hyv\u00e4rinen, 2010]), as used in various NLP applications\n[Ma and Collins, 2018], and applied to infomax objectives by van den Oord et al. [2018]. This class\nof bounds has been studied in more detail by McAllester and Stratos [2018], and Poole et al. [2019].\nWe can maximize the NCE lower bound on I(f1(x); f7(x)ij) by minimizing the following loss:\n\n(f1(x),f7(x)ij )(cid:20) E\n\nE\n\nN7\n\n[L\u03a6(f1(x), f7(x)ij, N7)](cid:21) .\n\n(1)\n\nThe positive sample pair (f1(x), f7(x)ij) is drawn from the joint distribution p(f1(x), f7(x)ij).\nN7 denotes a set of negative samples, comprising many \u201cdistractor\u201d consequent features drawn\nindependently from the marginal distribution p(f7(x)ij). Intuitively, the task of the antecedent\nfeature is to pick its true consequent out of a large bag of distractors. The loss L\u03a6 is a standard\nlog-softmax, where the normalization is over a large set of matching scores \u03a6(f1, f7). Roughly\nspeaking, \u03a6(f1, f7) maps (antecedent, consequent) feature pairs onto scalar-valued scores, where\nhigher scores indicate higher likelihood of a positive sample pair. We can write L\u03a6 as follows:\n\nL\u03a6(f1, f7, N7) = \u2212 log\n\n,\n\n(2)\n\nexp(\u03a6(f1, f7))\n\n(cid:80) \u02dcf7\u2208N7\u222a{f7} exp(\u03a6(f1, \u02dcf7))\n\nwhere we omit spatial indices and dependence on x for brevity. Training in local DIM corresponds\nto minimizing the loss in Eqn. 1 with respect to f and \u03a6, which we assume to be represented by\nparametric function approximators, e.g. deep neural networks.\n\n3.3 Ef\ufb01cient NCE Computation\n\n(cid:62)\u03c67(f7(x)ij).\n\nWe can ef\ufb01ciently compute the bound in Eqn. 1 for many positive sample pairs, using large negative\nsample sets, e.g. |N7| (cid:29) 10000, by using a simple dot product for the matching score \u03a6:\n\n\u03a6(f1(x), f7(x)ij) (cid:44) \u03c61(f1(x))\n\n(3)\nThe functions \u03c61/\u03c67 non-linearly transform their inputs to some other vector space. Given a\nsuf\ufb01ciently high-dimensional vector space, in principle we should be able to approximate any\n(reasonable) class of functions we care about \u2013 which correspond to belief shifts like log p(f7|f1)\nin our case \u2013 via linear evaluation. The power of linear evaluation in high-dimensional spaces can\nbe understood by considering Reproducing Kernel Hilbert Spaces (RKHS). One weakness of this\napproach is that it limits the rank of the set of belief shifts our model can represent when the vector\nspace is \ufb01nite-dimensional, as was previously addressed in the context of language modeling by\nintroducing mixtures [Yang et al., 2018]. We provide pseudo-code for the NCE bound in Figure 1.\nWhen training with larger models on more challenging datasets, i.e. STL10 and ImageNet, we use\nsome tricks to mitigate occasional instability in the NCE cost. The \ufb01rst trick is to add a weighted\nregularization term that penalizes the squared matching scores like: \u03bb(\u03c61(f1(x))(cid:62)\u03c67(f7(x)ij))2.\nWe use NCE regularization weight \u03bb = 4e\u22122 for all experiments. The second trick is to apply a soft\nclipping non-linearity to the scores after computing the regularization term and before computing\nthe log-softmax in Eqn. 2. For clipping score s to range [\u2212c, c], we applied the non-linearity\ns(cid:48) = c tanh( s\nc ), which is linear around 0 and saturates as one approaches \u00b1c. We use c = 20 for\nall experiments. We suspect there may be interesting formal and practical connections between\nregularization that restricts the variance/range/etc of scores that go into the NCE bound, and things\nlike the KL/information cost in Variational Autoencoders [Kingma and Welling, 2013].\n\np(f7)\n\n3.4 Data Augmentation\n\nOur model extends local DIM by maximizing mutual information between features from augmented\nviews of each input. We describe this with a few minor changes to our notation for local DIM. We\nconstruct the augmented feature distribution pA(f1(x1), f7(x2)ij) as follows: (i) sample an input\nx \u223c D, (ii) sample augmented images x1 \u223c A(x) and x2 \u223c A(x), (iii) sample spatial indices\ni \u223c u(i) and j \u223c u(j), (iv) compute features f1(x1) and f7(x2)ij. We use A(x) to denote the\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a): Local DIM with predictions across views generated by data augmentation. (b):\nAugmented Multiscale DIM, with multiscale infomax across views generated by data augmentation.\n(c)-top: An algorithm for ef\ufb01cient NCE with minibatches of na images, comprising one antecedent\nand nc consequents per image. For each true (antecedent, consequent) positive sample pair, we\ncompute the NCE bound using all consequents associated with all other antecedents as negative\nsamples. Our pseudo-code is roughly based on pytorch. We use dynamic programming in the\nlog-softmax normalizations required by (cid:96)nce. (c)-bottom: Our ImageNet encoder architecture.\ndistribution of images generated by applying stochastic data augmentation to x. For this paper, we\napply some standard data augmentations: random resized crop, random jitter in color space, and\nrandom conversion to grayscale. We apply a random horizontal \ufb02ip to x before computing x1 and x2.\nGiven pA(f1(x1), f7(x2)ij), we de\ufb01ne the marginals pA(f1(x1)) and pA(f7(x2)ij). Using these,\nwe rewrite the infomax objective in Eqn. 1 to include prediction across data augmentation:\n\n(f1(x1),f7(x2)ij )(cid:20) E\n\nE\n\nN7(cid:2)L\u03a6(f1(x1), f7(x2)ij, N7)(cid:3)(cid:21) ,\n\nwhere negative samples in N7 are now sampled from the marginal pA(f7(x2)ij), and L\u03a6 is unchanged.\nFigure 1a illustrates local DIM with prediction across augmented views.\n\n3.5 Multiscale Mutual Information\n\n(4)\n\n(5)\n\nOur model further extends local DIM by maximizing mutual information across multiple feature\nscales. Consider features f5(x)ij taken from position (i, j) in the top-most layer of f with spatial\ndimension 5 \u00d7 5. Using the procedure from the preceding subsection, we can construct joint\ndistributions over pairs of features from any position in any layer like: pA(f5(x1)ij, f7(x2)kl),\npA(f5(x1)ij, f5(x2)kl), or pA(f1(x1), f5(x2)kl).\nWe can now de\ufb01ne a family of n-to-m infomax costs:\n\n(fn(x1)ij ,fm(x2)kl)(cid:20) E\n\nE\n\nNm(cid:2)L\u03a6(fn(x1)ij, fm(x2)kl, Nm)(cid:3)(cid:21) ,\n\nwhere Nm denotes a set of independent samples from the marginal pA(fm(x2)ij) over features from\nthe top-most m \u00d7 m layer in f. For the experiments in this paper we maximize mutual information\nfrom 1-to-5, 1-to-7, and 5-to-5. We uniformly sample locations for both features in each positive\nsample pair. These costs may look expensive to compute at scale, but it is actually straightforward to\nef\ufb01ciently compute Monte Carlo approximations of the relevant expectations using many samples\nin a single pass through the encoder for each batch of (x1, x2) pairs. Figure 1b illustrates our full\nmodel, which we call Augmented Multiscale Deep InfoMax (AMDIM).\n\n3.6 Encoder\n\nOur model uses an encoder based on the standard ResNet [He et al., 2016a,b], with changes to make\nit suitable for DIM. Our main concern is controlling receptive \ufb01elds. When the receptive \ufb01elds\nfor features in a positive sample pair overlap too much, the task becomes too easy and the model\nperforms worse. Another concern is keeping the feature distributions stationary by avoiding padding.\n\n5\n\nAlgorithmComputeandMemory-EcientNCE//na:#antecedents,nc:#consequents/antecedent//s:arrayof(fa)>(fc)scores,withsize(na,na,nc)//thetupleaftereachstatementgivesresultsizesshift=max(max(s,dim=2),dim=1)//(na,1,1)sexp=exp(ssshift)//(na,na,nc)sself=sum(sexp,dim=2)//(na,na,1)sfull=sum(sself,dim=1)//(na,1,1)sother=sfullsself//(na,na,1)slse=log(sexp+sother)//(na,na,nc)snce=ssshiftslse//(na,na,nc)`nce=1nancPnai=1Pncj=1snce[i,i,j]AlgorithmImageNetEncoderArchitectureReLU(Conv2d(3,ndf,5,2,2))ReLU(Conv2d(ndf,ndf,3,1,0))ResBlock(1*ndf,2*ndf,4,2,ndepth)ResBlock(2*ndf,4*ndf,4,2,ndepth)ResBlock(4*ndf,8*ndf,2,2,ndepth)\u2013providesf7ResBlock(8*ndf,8*ndf,3,1,ndepth)\u2013providesf5ResBlock(8*ndf,8*ndf,3,1,ndepth)ResBlock(8*ndf,nrkhs,3,1,1)\u2013providesf1\fmaximize\n\nf,q\n\nE\n\n(x1,x2)\uf8ee\uf8f0 1\n\nnc\n\nnc(cid:88)i=1\n\n1 (x1)|f i\n\nk(cid:88)j=1(cid:16)q(f j\n7 (x2)}. snce(f j\n\n7(x2)) snce(f j\n\n1 (x1), f i\n\n7(x2)) + \u03b1H(q)(cid:17)\uf8f9\uf8fb .\n\n(6)\n\nThe encoder comprises a sequence of blocks, with each block comprising multiple residual layers.\nThe \ufb01rst layer in each block applies mean pooling with kernel width w and stride s to compute a\nbase output, and computes residuals to add to the base output using a convolution with kernel width\nw and stride s, followed by a ReLU and then a 1 \u00d7 1 convolution, i.e. w = s = 1. Subsequent\nlayers in the block are standard 1 \u00d7 1 residual layers. The mean pooling compensates for not using\npadding, and the 1 \u00d7 1 layers control receptive \ufb01eld growth. Exhaustive details can be found in\nour code: https://github.com/Philip-Bachman/amdim-public. We train our models using\n4-8 standard Tesla V100 GPUs per model. Other recent, strong self-supervised models are non-\nreproducible on standard hardware.\nWe use the encoder architecture in Figure 1c when working with ImageNet and Places205. We use\n128 \u00d7 128 input for these datasets due to resource constraints. The argument order for Conv2d is\n(input dim, output dim, kernel width, stride, padding). The argument order for ResBlock is the same\nas Conv2d, except the last argument (i.e. ndepth) gives block depth rather than padding. Parameters\nndf and nrkhs determine encoder feature dimension and output dimension for the embedding functions\n\u03c6n(fn). The embeddings \u03c67(f7) and \u03c65(f5) are computed by applying a small MLP via convolution.\nWe use similar architectures for the other datasets, with minor changes to account for input sizes.\n\n3.7 Mixture-Based Representations\n\nWe now extend our model to use mixture-based features. For each antecedent feature f1, we compute\na set of mixture features {f 1\n1 }, where k is the number of mixture components. We compute\nthese features using a function mk: {f 1\n1 } = mk(f1). We represent mk using a fully-connected\nnetwork with a single ReLU hidden layer and a residual connection between f1 and each mixture\nfeature f i\n\n1. When using mixture features, we maximize the following objective:\n\n1 , ..., f k\n\n1 , ..., f k\n\n7 (x2), ..., f nc\n\n1 (x1)} and nc\n7(x2)) denotes the NCE score between\n7(x2), computed as described in Figure 1c. This score gives the log-softmax term for\n\nFor each augmented image pair (x1, x2), we extract k mixture features {f 1\nconsequent features {f 1\nf j\n1 (x1) and f i\nthe mutual information bound in Eqn. 2. We also add an entropy maximization term \u03b1H(q).\nIn practice, given the k scores {snce(f 1\nthe k mixture features {f 1\n\n7)} assigned to consequent feature f i\n\n7), ..., snce(f k\n\n1 (x1), ..., f k\n\n1 (x1), f i\n\n1 , ..., f k\n\n1 , f i\n\n1 , f i\n\n1 }, we can compute the optimal distribution q as follows:\nq(f j\n1|f i\n\nexp(\u03c4 snce(f j\n\n1 , f i\n\n7) =\n\n7 by\n\n(7)\n\n,\n\n7))\n1 , f i\n\n7))\n\nwhere \u03c4 is a temperature parameter that controls the entropy of q. We motivate Eqn. 7 by analogy to\nReinforcement Learning. Given the scores snce(f j\n7), we could de\ufb01ne q using an indicator of the\nmaximum score. But, when q depends on the stochastic scores this choice will be overoptimistic in\nexpectation, since it will be biased towards scores which are pushed up by the stochasticity (which\ncomes from sampling negative samples). Rather than take a maximum, we encourage q to be less\ngreedy by adding the entropy maximization term \u03b1H(q). For any value of \u03b1 in Eqn. 6, there exists a\nvalue of \u03c4 in Eqn. 7 such that computing q using Eqn. 7 provides an optimal q with respect to Eqn. 6.\nThis directly relates to the formulation of optimal Boltzmann-type policies in the context of Soft Q\nLearning. See, e.g. Haarnoja et al. [2017]. In practice, we treat \u03c4 as a hyperparameter.\n\n1 , f i\n\n(cid:80)j(cid:48) exp(\u03c4 snce(f j(cid:48)\n\n4 Experiments\n\nWe evaluate our model on standard benchmarks for self-supervised visual representation learning.\nWe use CIFAR10, CIFAR100, STL10, ImageNet, and Places205. To measure performance, we \ufb01rst\ntrain an encoder using all examples from the training set (sans labels), and then train linear and\nMLP classi\ufb01ers on top of the encoder features f1(x) (sans backprop into the encoder). The \ufb01nal\nperformance metric is the accuracy of these classi\ufb01ers. This follows the evaluation protocol described\nby Kolesnikov et al. [2019]. Our model outperforms prior work on these datasets.\n\n6\n\n\f(b)\n\n(c)\n\n(a)\n\nTable 1: (a): Comparing AMDIM with prior results for the ImageNet and Imagenet\u2192Places205\ntransfer tasks using linear evaluation. The competing methods are from: [Gidaris et al., 2018,\nDosovitskiy et al., 2014, Doersch and Zisserman, 2017, Noroozi and Favaro, 2016, van den Oord\net al., 2018, H\u00e9naff et al., 2019]. The non-CPC results are from updated versions of the models by\nKolesnikov et al. [2019]. The (sup) models were fully-supervised, with no self-supervised costs.\nThe small and large AMDIM models had size parameters: (ndf=192, nrkhs=1536, ndepth=8) and\n(ndf=320, nrkhs=2560, ndepth=10). AMDIM outperforms prior and concurrent methods by a large\nmargin. We trained AMDIM models for 150 epochs on 8 NVIDIA Tesla V100 GPUs. When we train\nthe small model using a shorter 50 epoch schedule, it achieves 62.7% accuracy in 2 days on 4 GPUs.\nTable 2: (b): comparing AMDIM with fully-supervised models on CIFAR10 and CIFAR100, using\nlinear and MLP evaluation. Supervised results are from [Srivastava et al., 2015, He et al., 2016a,\nZagoruyko and Komodakis, 2016]. The small and large AMDIM models had size parameters:\n(ndf=128, nrkhs=1024, ndepth=10) and (ndf=256, nrkhs=2048, ndepth=10). AMDIM features\nperformed on par with classic fully-supervised models.\nTable 3: (c): Results of single ablations on STL10 and ImageNet. The size parameters for all models\non both datasets were: (ndf=192, nrkhs=1536, ndepth=8). We trained these models for 50 epochs,\nthus the ImageNet models were smaller and trained for one third as long as our best result (68.1%).\nWe perform ablations against a baseline model that applies basic data augmentation which includes\nresized cropping, color jitter, and random conversion to grayscale. We ablate different aspects of the\ndata augmentation as well as multiscale feature learning and NCE cost regularization. Our strongest\nresults used the Fast AutoAugment augmentation policy from Lim et al. [2019], and we report the\neffects of switching from basic augmentation to stronger augmentation as \u201c+strong aug\u201d. Data\naugmentation had the strongest effect by a large margin, followed by stability regularization and\nmultiscale prediction.\n\nOn CIFAR10 and CIFAR100 we trained small models with size parameters: (ndf=128, nrkhs=1024,\nndepth=10), and large models with size parameters: (ndf=320, nrkhs=1280, ndepth=10). On CI-\nFAR10, the large model reaches 91.2% accuracy with linear evaluation and 93.1% accuracy with\nMLP evaluation. On CIFAR100, it reaches 70.2% and 72.8%. These are comparable with slightly\nolder fully-supervised models, and well ahead of other work on self-supervised feature learning. See\nTable 2 for a comparison with standard fully-supervised models. On STL10, using size parame-\nters: (ndf=192, nrkhs=1536, ndepth=8), our model signi\ufb01cantly improves on prior self-supervised\nresults. STL10 was originally intended to test semi-supervised learning methods, and comprises\n10 classes with a total of 5000 labeled examples. Strong results have been achieved on STL10\nvia semi-supervised learning, which involves \ufb01ne-tuning some of the encoder parameters using the\navailable labeled data. Examples of such results include [Ji et al., 2019] and [Berthelot et al., 2019],\nwhich achieve 88.8% and 94.4% accuracy respectively. Our model reaches 94.2% accuracy on STL10\nwith linear evaluation, which compares favourably with semi-supervised results that \ufb01ne-tune the\nencoder using the labeled data.\nOn ImageNet, using a model with size parameters: (ndf=320, nrkhs=2536, ndepth=10), and a batch\nsize of 1008, we reach 68.1% accuracy for linear evaluation, beating the best prior result by over\n12% and the best concurrent results by 7% [Kolesnikov et al., 2019, H\u00e9naff et al., 2019, Tian et al.,\n2019]. Our model is signi\ufb01cantly smaller than the models which produced those results and is\nreproducible on standard hardware. Using MLP evaluation, our model reaches 69.5% accuracy. Our\nlinear and MLP evaluation results on ImageNet both surpass the original AlexNet trained end-to-end\nby a large margin. Table 3 provides results from single ablation tests on STL10 and ImageNet. We\nperform ablations on individual aspects of data augmentation and on the use of multiscale feature\n\n7\n\nMethodImageNetPlaces205ResNet50v2(sup)74.461.6AMDIM(sup)71.357.4Rotation55.448.0Exemplar46.042.7PatchO\u21b5set51.445.3Jigsaw44.642.2CPC-large48.7n/aCPC-huge61.0n/aCMC-large60.1n/aAMDIM-small63.5n/aAMDIM-large68.155.0CIFAR10CIFAR100(linear,MLP)(linear,MLP)HighwayNetwork92.2867.61ResNet:10193.5874.84WideResNet:40-495.4779.82AMDIM-small89.5,91.468.1,71.5AMDIM-large91.2,93.170.2,72.8STL10ImageNet(linear,MLP)(linear,MLP)AMDIM93.4,93.861.7,62.6+strongaug94.2,94.562.7,63.1colorjitter90.3,90.657.7,58.8randomgray88.3,89.453.6,54.9randomcrop86.0,87.153.2,54.9multiscale92.6,93.059.9,61.2stabilize93.5,93.857.2,59.5\f(a)\n\n(c)\n\n(b)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\nFigure 2: Visualizing behaviour of AMDIM. (a) and (b) combine two things \u2013 KNN retrieval based\non cosine similarity between features f1, and the matching scores (i.e., \u03c61(f1)(cid:62)\u03c67(f7)) between\nfeatures f1 and f7. (a) is from ImageNet and (b) is from Places205. Each left-most column shows\na query image, whose f1 was used to retrieve 7 most similar images. For each query, we visualize\nsimilarity between its f1 and the f7s from the retrieved images. On ImageNet, we see that good\nretrieval is often based on similarity focused on the main object, while poor retrieval depends more\non background similarity. The pattern is more diffuse for Places205. (c) and (d) visualize the data\naugmentation that produces paired images x1 and x2, and three types of similarity: between f1(x1)\nand f7(x2), between f7(x1) and f7(x2), and between f5(x1) and f5(x2). (e, f, g, h): we visualize\nmodels trained on STL10 with 2, 3, 3, and 4 components in the top-level mixtures. For each x1 (left)\nand x2 (right), the mixture components were inferred from x1 and we visualize the posteriors over\nthose components for the f7 features from x2. We compute the posteriors as described in Section 3.7.\n\nlearning and NCE cost regularization. See Table 1 for a comparison with well-optimized results for\nprior and concurrent models. We also tested our model on an Imagenet\u2192Places205 transfer task,\nwhich involves training the encoder on ImageNet and then training the evaluation classi\ufb01ers on the\nPlaces205 data. Our model also beat prior results on that task. Performance with the transferred\nfeatures is close to that of features learned on the Places205 data. See Table 1.\nWe include additional visualizations of model behaviour in Figure 2. See the \ufb01gure caption for more\ninformation. Brie\ufb02y, though our model generally performs well, it does exhibit some characteristic\nweaknesses that provide interesting subjects for future work. Intriguingly, when we incorporate\nmixture-based representations, segmentation behaviour emerges as a natural side-effect. The mixture-\nbased model is more sensitive to hyperparameters, and we have not had time to tune it for ImageNet.\nHowever, the qualitative behaviour on STL10 is exciting and we observe roughly a 1% boost in\nperformance with a simple bag-of-features approach for using the mixture features during evaluation.\n\n5 Discussion\n\nWe introduced an approach to self-supervised learning based on maximizing mutual information\nbetween arbitrary features extracted from multiple views of a shared context. Following this approach,\n\n8\n\n\fwe developed a model called Augmented Multiscale Deep InfoMax (AMDIM), which improves\non prior results while remaining computationally practical. Our approach extends to a variety of\ndomains, including video, audio, text, etc. E.g., we expect that capturing natural relations using\nmultiple views of local spatio-temporal contexts in video could immediately improve our model.\nWorthwhile subjects for future research include: modifying the AMDIM objective to work better\nwith standard architectures, improving scalability and running on better infrastructure, further work\non mixture-based representations, and examining (formally and empirically) the role of regularization\nin the NCE-based mutual information bound. We believe contrastive self-supervised learning has a\nlot to offer, and that AMDIM represents a particularly effective approach.\n\nReferences\nRelja Arandjelovi\u00b4c and Andrew Zisserman. Look, listen and learn. International Conference on\n\nComputer Vision (ICCV), 2017.\n\nRelja Arandjelovi\u00b4c and Andrew Zisserman. Objects that sound. European Conference on Computer\n\nVision (ECCV), 2018.\n\nDavid Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel.\n\nMixmatch: A holistic approach to semi-supervised learning. arXiv:1905.02249 [cs.LG], 2019.\n\nJoao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics\n\ndataset. Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\nAdam Coates, Honglak Lee, and Andrew Y Ng. An analysis of single layer networks in unsupervised\nfeature learning. International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2011.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep\nbidirectional transformers for language understanding. Conference of the North American Chapter\nof the Association for Computational Linguistics (NAACL), 2019.\n\nCarl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. International\n\nConference on Computer Vision (ICCV), 2017.\n\nCarl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by\n\ncontext prediction. International Conference on Computer Vision (ICCV), 2015.\n\nJeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor\nDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. International\nConference on Machine Learning (ICML), 2014.\n\nAlexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, and Thomas Brox Martin Ried-\nmiller. Discriminative unsupervised feature learning with exemplar convolutional neural networks.\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\nDebidatta Dwibedi, Jonathan Tompson, Corey Lynch, and Pierre Sermanet. Learning actionable\nrepresentations from visual observations. International Conference on Intelligent Robots and\nSystems (IROS), 2018.\n\nSM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Aria S Morcos, Marta\nGarnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, David P Reichert,\nLars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King,\nChloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, and Demis Hassabis. Neural\nscene representation and rendering. Science, 2018.\n\nSpyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by\npredicting image rotations. International Conference on Learning Representations (ICLR), 2018.\n\nPriya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-\n\nsupervised visual representation learning. arXiv:1905.01235 [cs.CV], 2019.\n\n9\n\n\fMichael Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation principle for\nunnormalized statistical models. International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2010.\n\nTuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with\n\ndeep energy-based policies. International Conference on Machine Learning (ICML), 2017.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. European Conference on Computer Vision (ECCV), 2016b.\n\nKaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. International Confer-\n\nence on Computer Vision (ICCV), 2017.\n\nKaiming He, Ross Girshick, and Piotr Doll\u00e1r. Rethinking imagenet pre-trainining. arXiv:1811.08883\n\n[vs.CV], 2018.\n\nOlivier J H\u00e9naff, Ali Rezavi, Carl Doersch, S M Ali Eslami, and Aaron van den Oord. Data-ef\ufb01cient\n\nimage recognition with contrastive predictive coding. arXiv:1905.09272 [cs.LG], 2019.\n\nR Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam\nTrischler, and Yoshua Bengio. Learning deep representations by mutual information estimation\nand maximization. International Conference on Learning Representations (ICLR), 2019.\n\nXu Ji, Jo ao F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised\n\nimage classi\ufb01cation and segmentation. arXiv:1807.06653, 2019.\n\nSham M Kakade and Dean P Foster. Multi-view regression via canonical correlation analysis. Annual\n\nConference on Learning Theory (COLT), 2007.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.\n\nAlexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised learning.\n\narXiv:1901.09005, 2019.\n\nSungbim Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment.\n\narXiv:1905.00397 [cs.LG], 2019.\n\nLajanugen Logeswaran and Honglak Lee. An ef\ufb01cient framework for learning sentence representa-\n\ntions. International Conference on Learning Representations (ICLR), 2018.\n\nZhuang Ma and Michael Collins. Noise-contrastive estimation and negative sampling for conditional\nmodels: Consistency and statistical ef\ufb01ciency. Conference on Empirical Methods in Natural\nLanguage Processing (EMNLP), 2018.\n\nDavid McAllester and Karl Stratos. Formal limitations on the measurement of mutual information.\n\narXiv:1811.04251 [cs.IT], 2018.\n\nTomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representa-\ntions of words and phrases and their compositionality. Advances in Neural Information Processing\nSystems (NIPS), 2013.\n\nMehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw\n\npuzzles. European Conference on Computer Vision (ECCV), 2016.\n\nJeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word\n\nrepresentation. Empirical Methods in Natural Language Processing (EMNLP), 2014.\n\nMatthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\nLuke Zettlemoyer. Deep contextualized word representations. Conference of the North American\nChapter of the Association for Computational Linguistics (NAACL), 2018.\n\n10\n\n\fBen Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On variational\n\nbounds of mutual information. International Conference on Machine Learning (ICML), 2019.\n\nShaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object\ndetection with region proposal networks. Advances in Neural Information Processing Systems\n(NIPS), 2015.\n\nOlga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet\nlarge scale visual recognition challenge. International Journal of Computer Vision (IJCV), 2015.\n\nPierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey\nLevine. Time-contrastive networks: Self-supervised learning from video. International Conference\non Robotics and Automation (ICRA), 2017.\n\nKarthik Sridharan and Sham M Kakade. An information theoretic framework for multi-view learning.\n\nAnnual Conference on Learning Theory (COLT), 2008.\n\nRupesh Kumar Srivastava, Klaus Greff, and J urgen Schmidhuber. Training very deep networks.\n\nAdvances in Neural Information Processing Systems (NIPS), 2015.\n\nYonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv:1906.05849\n\n[cs.LG], 2019.\n\nAaron van den Oord, Yazhe Li, and Oriol Vinyals. Contrastive predictive coding. arXiv:1807.03748,\n\n2018.\n\nCarl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking\n\nemerges by colorizing videos. European Conference on Computer Vision (ECCV), 2018.\n\nWeiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation\n\nlearning. International Conference on Machine Learning (ICML), 2015.\n\nZhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottle-\nneck: A high-rank rnn language model. International Conference on Learning Representations\n(ICLR), 2018.\n\nSergey Zagoruyko and Nikos Komodakis. Wide residual networks. British Machine Vision Conference\n\n(BMVC), 2016.\n\nRichard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization.\n\nconference on computer vision, pages 649\u2013666. Springer, 2016.\n\nIn European\n\nBolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Audo Oliva. Learning deep\nfeatures for scene recognition using places database. Advances in Neural Information Processing\nSystems (NIPS), 2014.\n\n11\n\n\f", "award": [], "sourceid": 9009, "authors": [{"given_name": "Philip", "family_name": "Bachman", "institution": "Microsoft Research"}, {"given_name": "R Devon", "family_name": "Hjelm", "institution": "Microsoft Research"}, {"given_name": "William", "family_name": "Buchwalter", "institution": "Microsoft"}]}