{"title": "Information Competing Process for Learning Diversified Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2178, "page_last": 2189, "abstract": "Learning representations with diversified information remains as an open problem. Towards learning diversified representations, a new approach, termed Information Competing Process (ICP), is proposed in this paper. Aiming to enrich the information carried by feature representations, ICP separates a representation into two parts with different mutual information constraints. The separated parts are forced to accomplish the downstream task independently in a competitive environment which prevents the two parts from learning what each other learned for the downstream task. Such competing parts are then combined synergistically to complete the task. By fusing representation parts learned competitively under different conditions, ICP facilitates obtaining diversified representations which contain rich information. Experiments on image classification and image reconstruction tasks demonstrate the great potential of ICP to learn discriminative and disentangled representations in both supervised and self-supervised learning settings.", "full_text": "Information Competing Process for Learning\n\nDiversi\ufb01ed Representations\n\nJie Hu12, Rongrong Ji123\u2217, ShengChuan Zhang1, Xiaoshuai Sun1,\n\nQixiang Ye4, Chia-Wen Lin5, Qi Tian6.\n\n1Media Analytics and Computing Lab, Department of Arti\ufb01cial Intelligence,\n\nSchool of Informatics, Xiamen University.\n\n2National Institute for Data Science in Health and Medicine, Xiamen University.\n\n3Peng Cheng Laboratory. 4University of Chinese Academy of Sciences.\n\n5National Tsing Hua University. 6Noah\u2019s Ark Lab, Huawei.\n\nAbstract\n\nLearning representations with diversi\ufb01ed information remains as an open problem.\nTowards learning diversi\ufb01ed representations, a new approach, termed Information\nCompeting Process (ICP), is proposed in this paper. Aiming to enrich the informa-\ntion carried by feature representations, ICP separates a representation into two parts\nwith different mutual information constraints. The separated parts are forced to\naccomplish the downstream task independently in a competitive environment which\nprevents the two parts from learning what each other learned for the downstream\ntask. Such competing parts are then combined synergistically to complete the task.\nBy fusing representation parts learned competitively under different conditions,\nICP facilitates obtaining diversi\ufb01ed representations which contain rich information.\nExperiments on image classi\ufb01cation and image reconstruction tasks demonstrate\nthe great potential of ICP to learn discriminative and disentangled representations\nin both supervised and self-supervised learning settings. 1\n\n1\n\nIntroduction\n\nRepresentation learning aims to make the learned feature representations more effective on extracting\nuseful information from input for downstream tasks [4], which has been an active research topic\nin recent years and has become the foundation for many tasks [28, 8, 11, 15, 40, 20, 6]. Notably,\na majority of works about representation learning have been studied from the viewpoint of mutual\ninformation constraint. For instance, the Information Bottleneck (IB) theory [38, 1] minimizes the\ninformation carried by representations to \ufb01t the target outputs, and the generative models such as\n\u03b2-VAE [13, 5] also rely on such information constraint to learn disentangled representations. Some\nother works [22, 3, 26, 14] reveal the advantages of maximizing the mutual information for learning\ndiscriminative representations. Despite the exciting progresses, learning diversi\ufb01ed representations\nremains as an open problem. Diversi\ufb01ed representations are learned with different constraints\nencouraging representation parts to extract various information from inputs, which results in powerful\nfeatures to represent the inputs. In principle, a good representation learning approach is supposed to\ndiscriminate and disentangle the underlying explanatory factors hidden in the input [4]. However,\nthis goal is hard to realize as the existing methods typically resort to only one type of information\nconstraint. As a consequence, the information diversity of the learned representations is deteriorated.\n\n\u2217Corresponding Author.\n1Codes, models and experimental results are all available at https://github.com/hujiecpp/\n\nInformationCompetingProcess/\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper we present a diversi\ufb01ed representation learning scheme, termed Information Competing\nProcess (ICP), which handles the above issues through a new information diversifying objective.\nFirst, the separated representation parts learned with different constraints are forced to accomplish\nthe downstream task competitively. Then, the rival representation parts are combined to solve the\ndownstream task synergistically. A novel solution is further proposed to optimize the new objective\nin both supervised and self-supervised learning settings.\nWe verify the effectiveness of the proposed ICP on both image classi\ufb01cation and image reconstruction\ntasks, where neural networks are used as the feature extractors. In the supervised image classi\ufb01cation\ntask, we integrate ICP with four different network architectures (i.e., VGG [34], GoogLeNet [35],\nResNet [12], and DenseNet [16]) to demonstrate how the diversi\ufb01ed representations boost classi\ufb01ca-\ntion accuracy. In the self-supervised image reconstruction task, we implement ICP with \u03b2-VAE [13]\nto investigate its ability of learning disentangled representations to reconstruct and manipulate the\ninputs. Empirical evaluations suggest that ICP \ufb01ts \ufb01ner labeled dataset and disentangles \ufb01ne-grained\nsemantic information for representations.\n\n2 Related Work\n\nRepresentation Learning with Mutual Information. Mutual information has been a powerful\ntool in representation learning for a long time. In the unsupervised setting, mutual information\nmaximization is typically studied, which targets at adding speci\ufb01c information to the representation\nand forces the representation to be discriminative. For instance, the InfoMax principle [22, 3]\nadvocates maximizing mutual information between the inputs and the representations, which forms\nthe basis of independent component analysis [17]. Contrastive Predictive Coding [26] and Deep\nInfoMax [14] maximize mutual information between global and local representation pairs, or the\ninput and global/local representation pairs.\nIn the supervised or self-supervised settings, mutual information minimization is commonly utilized.\nFor instance, the Information Bottleneck (IB) theory [38] uses the information theoretic objective to\nconstrain the mutual information between the input and the representation. IB was then introduced\nto deep neural networks [37, 33, 31], and Deep Variational Information Bottleneck (VIB) [1] was\nrecently proposed to re\ufb01ne IB with a variational approximation. Another group of works in self-\nsupervised setting adopt generative models to learn representations [19, 30], in which the mutual\ninformation plays an important role in learning disentangled representations. For instance, \u03b2-VAE\n[13] is a variant of Variation Auto-Encoder [19] that attempts to learn a disentangled representation\nby optimizing a heavily penalized objective with mutual information minimization. Recent works\nin [5, 18, 7] revise the objective of \u03b2-VAE by applying various constraints. One special case is\nInfoGAN [8], which maximizes the mutual information between representation and a factored\nGaussian distribution. Besides, Mutual Information Neural Estimation [2] estimates the mutual\ninformation of continuous variables. Differing from the above schemes, the proposed ICP leverages\nboth mutual information maximization and minimization to create competitive environment for\nlearning diversi\ufb01ed representations.\nRepresentation Collaboration. The idea of collaborating neural representations can be found\nin Neural Expectation Maximization [10] and Tagger [9], which uses different representations to\ngroup and represent individual entities. The Competitive Collaboration [29] method is the most\nrelevant to our work. It de\ufb01nes a three-player game with two competitors and a moderator, where\nthe moderator takes the role of a critic and the two competitors collaborate to train the moderator.\nUnlike Competitive Collaboration, the proposed ICP enforces two (or more) representation parts to\nbe complementary through different mutual information constraints for the same downstream task\nby a competitive environment, which endows the capability of learning more discriminative and\ndisentangled representations.\n\n3\n\nInformation Competing Process\n\nThe key idea of ICP is depicted in Fig. 1, in which different representation parts compete and\ncollaborate with each other to diversify the information. In this section, we \ufb01rst unify supervised and\nself-supervised objectives for acheving the target tasks. Then, the information competing objective\nfor learning diversi\ufb01ed representations is proposed.\n\n2\n\n\fFigure 1: The proposed Information Competing Process. In the competitive step, the rival repre-\nsentation parts are forced to accomplish the downstream task solely by preventing both parts from\nknowing what each other learned under different constraints for the task. In the synergetic step,\nthese representation parts are combined to complete the downstream task synthetically. ICP can be\ngeneralized to arbitrary number of constrained parts, and in this paper we make an example of two.\n\n3.1 Unifying Supervised and Self-Supervised Objectives\n\nThe information constraining objective in supervised setting has the same form as that of self-\nsupervised setting except the target outputs. We therefore unify these two objectives by using t\nas the output of the downstream tasks. In supervised setting, t represents the label of input x. In\nself-supervised setting, t represents the input x itself. This leads to the uni\ufb01ed objective function\nlinking the representation r of input x and target t as:\n\n(1)\nwhere I(\u00b7,\u00b7) stands for the mutual information. This uni\ufb01ed objective describes a constraint with the\ngoal of maximizing the mutual information between the representation r and the target t.\n\nmax(cid:2)I(r, t)(cid:3),\n\n3.2 Separating and Diversifying Representations\n\nTo explicitly diversify the information on representations, we directly separate the representation\nr into two parts [z, y] with different constraints, and encourage representations to learn discrepant\ninformation from the input x. Speci\ufb01cally, we constrain the information capacity of representation\npart z while increasing the information capacity of representation part y. To that effect, we have the\nfollowing objective function:\n\nmax(cid:2)I(r, t) + \u03b1I(y, x) \u2212 \u03b2I(z, x)(cid:3),\n\n(2)\n\nwhere \u03b1 > 0 and \u03b2 > 0 are the regularization factors.\n\n3.3 Competition of Representation Parts\n\nTo prevent any one of the representation parts from dominating the downstream task, we let z and\ny to accomplish the downstream task t solely by utlizing the mutual information constraints I(z, t)\nand I(y, t). Additionally, for ensuring the representations catch diversi\ufb01ed information through\ndifferent constraints, ICP prevents z and y from knowing what each other learned for the downstream\ntask, which is realized by enforcing z and y independent of each other. These constraints result in a\ncompetitive environment to enrich the information carried by representations. Correspondingly, the\nobjective of ICP is concluded as:\n\nmax(cid:2) I(r, t)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n+ \u03b1I(y, x)\n2(cid:13) Maximization\nwhere \u03b3 > 0 is the regularization factor.\n\n1(cid:13) Synergy\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u2212 \u03b2I(z, x)\n3(cid:13) Minimization\n\n(cid:124)\n\n+I(z, t) + I(y, t) \u2212 \u03b3I(z, y)\n\n(cid:123)(cid:122)\n\n4(cid:13) Competition\n\n(cid:3),\n\n(cid:125)\n\n(3)\n\n4 Optimizing the Objective of ICP\n\nIn this section, we derive a solution to optimize the objective of ICP. Although all terms of this\nobjective have the same formulation that calculates the mutual information between two variables, they\n\n3\n\nxzyttCompetitionConstraint AConstraint BxzytSynergy\f(cid:90) (cid:90)\n(cid:90) (cid:90)\n(cid:90) (cid:90)\n\nI(z, x) =\n\n=\n\n=\n\n(cid:90) (cid:90)\n(cid:90) (cid:90)\n(cid:90)\n\nP (z, x) log\n\nP (z, x)\n\nP (z)P (x)\n\nP (z, x) log P (z|x)dxdz \u2212\n\nP (z, x) log P (z|x)dxdz \u2212\n\ndxdz =\n\nP (z, x) log\n\ndxdz\n\nP (z|x)\nP (z)\n\nP (x|z)P (z) log P (z)dxdz\n\n(4)\n\nneed to be optimized using different methods due to their different aims. We therefore classify these\nterms as the mutual information minimization term I(z, x), the mutual information maximization\nterm I(y, x), the inference terms I(z, t),I(y, t),I(r, t) and the predictability minimization term\nI(z, y) to \ufb01nd the solution.\n\n4.1 Mutual Information Minimization Term\n\nTo minimize the mutual information between x and z, we can \ufb01nd out a tractable upper bound\nfor the intractable I(z, x). In the existing works [19, 1], I(z, x) is usually de\ufb01ned under the joint\ndistribution of inputs and their encoding distribution, as it is the constraint between the inputs and the\nrepresentations. Concretely, the formulation is derived as:\n\nP (z) log P (z)dz.\n\nLet Q(z) be a variational approximation of P (z), we have:\n\n(cid:90)\n\nKL(cid:2)P (z)||Q(z)(cid:3) \u2265 0 \u21d2\n(cid:90) (cid:90)\n\nP (z|x)P (x) log\n\nI(z, x) \u2264\n\n(cid:90)\n\nP (z) log Q(z)dz.\n\n(5)\n\n(cid:104)\n\nKL(cid:2)P (z|x)||Q(z)(cid:3)(cid:105)\n\n,\n\n(6)\n\nP (z) log P (z)dz \u2265\n\nP (z|x)\nQ(x)\n\ndxdz = Ex\u223cP (x)\n\nAccording to Eq. 5, the trackable upper bound after applying the variational approximation is:\n\nwhich enforces the extracted z conditioned on x to a prede\ufb01ned distribution Q(z) such as a standard\nGaussian distribution.\n\n4.2 Mutual Information Maximization Term\n\nTo maximize the mutual information between x and y, we deduce a tractable alternate for the\nintractable I(y, x). Speci\ufb01cally, like the above minimization term, the mutual information should\nalso be de\ufb01ned as the joint distribution of inputs and their encoding distribution. As it is hard to\nderive a tractable lower bound for this term, we expand the mutual information as:\n\ndxdy = KL(cid:2)P (y|x)P (x)||P (y)P (x)(cid:3).\n\n(7)\n\nI(y, x) =\n\nP (y, x) log\n\nP (y, x)\n\nP (y)P (x)\n\n(cid:90) (cid:90)\n\nSince Eq. 7 means that maximizing the mutual information is equal to enlarging the Kullback-Leibler\n(KL) divergence between distributions P (y|x)P (x) and P (y)P (x), and the maximization of KL\ndivergence is divergent. We instead maximize the Jensen-Shannon (JS) divergence as an alternative\nwhich approximates the maximization of KL divergence but is convergent. As [25], a tractable\nvariational estimation of JS divergence can be de\ufb01ned as:\n\nJS(cid:2)P (y|x)P (x)||P (y)P (x)(cid:3) = max\n\n(cid:104)E(y,x)\u223cP (y|x)P (x)\n\n+E(\u02c6y,x)\u223cP (y)P (x)\n\n(cid:2) log D(y, x)(cid:3)\n(cid:2) log(cid:0)1 \u2212 D(\u02c6y, x)(cid:1)(cid:3)(cid:105)\n\n(8)\n\n,\n\nwhere D is a discriminator that estimates the probability of the input pair, (y, x) is the positive pair\nsampled from P (y|x)P (x), and (\u02c6y, x) is the negative pair sampled from P (y)P (x). As \u02c6y shoule be\nthe representation conditioned on x, we disorganize y in the positive pair (x, y) to obtain the negative\npair (x, \u02c6y).\n\n4.3\n\nInference Term\n\nThe inference terms in Eq. 3 should be de\ufb01ned as the joint distribution of representation and the\noutput distribution of downstream task solver. We take I(r, t) as an example, and I(z, t),I(y, t)\n\n4\n\n\fAlgorithm 1: Optimization of Information Competing Process\nInput: The source input x with the downstream task target t, the prior distribution Q(z), Q(t|r),\n\nQ(t|z) and Q(t|y) for variational approximation, and the hyperparameters \u03b1, \u03b2, \u03b3.\n\nOutput: The learned representation extractor and downstream solver.\n\n1 while not Convergence do\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12 end\n\nOptimize Eq. 8 and Eq. 16 for discriminator D and predictor H;\n// Mutual Information Minimization Term:\nReplace I(z, x) in Eq. 3 with the tractable upper bound in Eq. 6;\n// Mutual Information Maximization Term:\nReplace I(y, x) in Eq. 3 with the tractable alternative in Eq. 8;\n// Inference Term:\nReplace I(z, t),I(y, t),I(r, t) in Eq. 3 with the tractable lower bound in Eq. 14;\n// Predictability Minimization Term:\nReplace I(z, y) in Eq. 3 with Eq. 16;\nOptimize Eq. 3 while \ufb01xing the parameters of D and H;\n\nhave the same formulation with I(r, t). We expand this mutual information term as:\n\n(cid:90) (cid:90)\n(cid:90) (cid:90)\n(cid:90) (cid:90)\n(cid:90) (cid:90)\n\nI(r, t) =\n\n=\n\n=\n\n\u2265\n\nP (r, t) log\n\nP (t|r)\nP (t)\n\ndrdt\n\nP (r, t) log P (t|r)dtdr \u2212\n\n(cid:90)\n\nP (t) log P (t)dt\n\nP (r, t) log P (t|r)dtdr + H(t)\n\nP (t|r)p(r) log P (t|r)dtdr,\n\n(9)\n\nwhere H(t) \u2265 0 is the information entropy of t. Let Q(t|r) be a variational approximation of P (t|r),\nwe have:\n\nKL(cid:2)P (t|r)||Q(t|r)(cid:3) \u2265 0 \u21d2\n\n(cid:90)\n\nP (t|r) log P (t|r)dt \u2265\n\nP (t|r) log Q(t|r)dt.\n\n(10)\n\n(cid:90)\n\nBy applying the variational approximation, the trackable lower bound of the mutual information\nbetween r and t is:\n\nI(r, t) \u2265\n\nP (r, t) log Q(t|r)dtdr.\n\n(11)\n\n(cid:90) (cid:90)\n\nBased on the above formulation, we derive different objectives for the supervised and self-supervised\nsettings in what follows.\nSupervised Setting. In the supervised setting, t represents the known target labels. By assuming that\nthe representation r is not dependent on the label t, i.e., P (r|x, t) = P (r|x), we have:\n\nP (x, r, t) = P (r|x, t)P (t|x)P (x) = P (r|x)P (t|x)P (x).\n\nAccordingly, the joint distribution of r and t can be written as:\n\nP (r, t) =\n\nP (x, r, t)dx =\n\nP (r|x)P (t|x)P (x)dx.\n\n(cid:90)\n\n(12)\n\n(13)\n\n(cid:90)\n\n(cid:90) (cid:90) (cid:90)\n\nCombining Eq. 11 with Eq. 13, we get the lower bound of the inference term in the supervised setting:\n\nI(r, t) \u2265\n\n= Ex\u223cP (x)\n\n(cid:104)Er\u223cP (r|x)\n\n(cid:2)(cid:90)\n\nP (x)P (r|x)P (t|x) log Q(t|r)dtdrdx\n\nP (t|x) log Q(t|r)dt(cid:3)(cid:105)\n\n(14)\n\n.\n\n5\n\n\fTable 1: Classi\ufb01cation error rates (%) on CIFAR-10 test set.\n\nBaseline\nVIB [1]\n\nDIM* [14]\n\nVIB\u00d72\nDIM*\u00d72\nICP-ALL\nICP-COM\n\nICP\n\n7.63\n\n4.92\n\n6.67\n\nVGG16 [34] GoogLeNet [35] ResNet20 [12] DenseNet40 [16]\n6.81\u21910.14\n6.54\u21930.13\n6.86\u21910.19\n7.24\u21910.57\n6.97\u21910.30\n6.59\u21930.08\n6.10\u21930.57\n\n5.72\u21930.11\n6.15\u21910.32\n6.36\u21910.53\n5.60\u21930.23\n6.13\u21910.30\n5.63\u21930.20\n4.99\u21930.84\n\n5.09\u21910.17\n4.65\u21930.27\n4.88\u21930.04\n4.95\u21910.03\n4.76\u21930.16\n4.67\u21930.25\n4.26\u21930.66\n\n6.95\u21930.68\n7.61\u21930.02\n6.85\u21930.78\n7.46\u21930.17\n6.47\u21931.16\n7.33\u21930.30\n6.01\u21931.62\n\n5.83\n\nTable 2: Classi\ufb01cation error rates (%) on CIFAR-100 test set.\n\nBaseline\nVIB [1]\n\nDIM* [14]\n\nVIB\u00d72\nDIM*\u00d72\nICP-ALL\nICP-COM\n\nICP\n\n31.91\n\n20.68\n\n26.41\n\nVGG16 [34] GoogLeNet [35] ResNet20 [12] DenseNet40 [16]\n26.56\u21910.15\n26.74\u21910.33\n26.08\u21930.33\n25.72\u21930.69\n26.73\u21910.32\n26.37\u21930.04\n24.54\u21931.87\n\n20.93\u21910.25\n20.94\u21910.26\n22.09\u21911.41\n21.74\u21911.06\n20.90\u21910.22\n20.81\u21910.13\n18.55\u21932.13\n\n26.37\u21931.18\n27.51\u21930.04\n29.33\u21911.78\n27.15\u21930.40\n27.51\u21930.04\n26.85\u21930.70\n24.52\u21933.03\n\n30.84\u21931.07\n32.62\u21910.71\n29.74\u21932.17\n30.16\u21931.75\n28.35\u21933.56\n32.76\u21910.85\n28.13\u21933.78\n\n27.55\n\nSince the conditional probability P (t|x) represents the distribution of labels in the supervised setting,\nEq. 14 is actually the cross entropy loss for classi\ufb01cation.\nSelf-supervised Setting. In the self-supervised setting, t is the input x itself. Therefore, Eq 11 can\nbe directly derived as:\n\n(cid:104)Er\u223cP (r|x)\n\n(cid:2) log Q(x|r)(cid:3)(cid:105)\n\n.\n\n(15)\n\n(cid:90) (cid:90)\n\nI(r, x) \u2265\n\nP (r|x)P (x) log Q(x|r)dxdt = Ex\u223cP (x)\n\nAssuming Q(t|r) as a Gaussian distribution, Eq. 15 can be expanded as the L2 reconstruction loss\nfor the input x.\n\n4.4 Predictability Minimization Term\n\nTo diversify the information and prevent the dominance of one representation part, we constrain\nthe mutual information between z and y, which equals to make z and y be independent with each\nother. Inspired by [32], we introduce a predictor H to ful\ufb01ll this goal. Concretely, we let H predict\ny conditioned on z, and prevent the extractor from producing z which can predict y. The same\noperation is conducted on y to z. The corresponding objective is:\n\n(cid:104)Ez\u223cP (z|x)\n\n(cid:2)H(y|z)(cid:3) + Ey\u223cP (y|x)\n\n(cid:2)H(z|y)(cid:3)(cid:105)\n\nmin max\n\n.\n\n(16)\n\nSo far, we have all the tractable bounds and alternatives for optimizing the information diversifying\nobjective of ICP. The optimization process is summarized in Alg. 1.\n\n5 Experiments\n\nIn experiments, all the probabilistic feature extractors, task solvers, predictor and discriminator are\nimplemented by neural networks. We suppose Q(z), Q(t|r), Q(t|z), Q(t|y) are standard Gaussian\ndistributions and use reparameterization trick by following VAE [19]. The objectives are differentiable\nand trained using backpropagation.\nIn the classi\ufb01cation task (supervised setting), we use one\nfully-connected layer as classi\ufb01er. In the reconstruction task (self-supervised setting), multiple\ndeconvolution layers are used as the decoder to reconstruct the inputs. The implementation details\nand the experimental logs are all avaliable at our source code page.\n\n6\n\n\f(a) Correlation heatmap of ICP-ALL.\n\n(b) Correlation heatmap of ICP.\n\nFigure 2: Heatmaps of the correlation between categories and the dimension of representations of\nVGGNet on CIFAR-10. The horizontal axis denotes the dimension of representations, and the vertical\naxis denotes the categories. Darker color denotes higher correlation.\n\n5.1 Supervised Setting: Classi\ufb01cation Tasks\n\n5.1.1 Datasets\n\nCIFAR-10 and CIFAR-100 [21] are used to evaluate the performance of ICP in the image classi\ufb01cation\ntask. These datasets contain natural images belonging to 10 and 100 classes respectively. CIFAR-100\ncomes with \ufb01ner labels than CIFAR-10. The raw images are with 32\u00d732 pixels and we normalize\nthem using the channel means and standard deviations. Standard data augmentation by random\ncropping and mirroring is applied to the training set.\n\n5.1.2 Classi\ufb01cation Performance and Ablation Study\n\nWe utilize four architectures including VGGNet [34], GoogLeNet [35, 36], ResNet [12], and\nDenseNet [16] to test the general applicability of ICP and to study the diversi\ufb01ed representations\nlearned by ICP. We use the classi\ufb01cation results of original network architectures as our baselines.\nThe deep Variational Information Bottleneck (VIB) [1] and global version of Deep InfoMax with one\nadditional mutual maximization term (DIM*) [14] are used as references, in which VIB is optimized\nby maximizing I(z, t) \u2212 \u03b2I(z, x) , and DIM* is optimized by maximizing I(y, t) + \u03b1I(y, x). To\nmake a fair comparison, we expand the representation dimension of both methods to the same size of\nICP\u2019s (denoted as VIB\u00d72, and DIM*\u00d72). The VIB, DIM*, VIB\u00d72 and DIM*\u00d72 are the methods that\nonly use one type of representation constraints in ICP, which can also be regarded as ablation study\nfor ICP with single information constraint and without the information diversifying objective.\nFor further ablation study, we optimize ICP without all the information diversifying and competing\nconstraints (i.e., optimize Eq. 1), which is denoted as ICP-ALL. We also optimize ICP with the\ninformation diversifying objective but without the information competing objective (i.e., optimize\nEq. 2), which is denoted as ICP-COM.\nThe classi\ufb01cation results on CIFAR-10 and CIFAR-100 are shown in Tables 1 and 2. We \ufb01nd that\nVIB, DIM*, VIB\u00d72 and DIM*\u00d72 achieve sub-optimal results due to the limited diversi\ufb01cation of\nrepresentations. ICP-ALL do not work well as the large model capacity over\ufb01ts the training set, and\nICP-COM fails because of the dominance of one type of representations. These results show that\nexpanding models with sole constraint or removing one constraint from the objective decreases the\nperformance. Only ICP generalizes to all these architectures and reports the best performance. In\naddition, the results on different datasets (i.e., CIFAR-10 and CIFAR-100) suggest that ICP works\nbetter on the \ufb01ner labeled dataset (i.e., CIFAR-100). We attribute the success to the diversi\ufb01ed\nrepresentations that capture more detailed information of inputs.\n\n7\n\nCategories\u2460\u2461\u2462DifferentDimensionsCategoriesGeneralSpecificDifferentDimensions\f(a) dSprites\n\n(b) 3D Faces\n\nFigure 3: Qualitative disentanglement results of ICP on (a) dSprites and (b) 3D Faces datasets.\n\n5.1.3 Interpretability of The Diversi\ufb01ed Representations\n\nTo explain the intuitive idea and the superior results of ICP, we study the learned classi\ufb01cation\nmodels to explore why ICP works and provide some insights about the interpretability of the learned\nrepresentations. In the following, we make an example of VGGNet on CIFAR-10 and visualize\nthe normalized absolute value of the classi\ufb01er\u2019s weights. As shown in Fig. 2(a), the classi\ufb01cation\ndependency is fused in ICP-ALL, which means combining two representations directly without any\nconstraints does not diversify the representation. The \ufb01rst green bounding box shows that the\nclassi\ufb01cation relies on both parts. The second and the third green bounding boxes show that the\nclassi\ufb01cation relies more on the \ufb01rst part or the second part. On the contrary, as shown in Fig. 2(b), the\nclassi\ufb01cation dependency can be separated into two parts. As the mutual information minimization\nmakes the representation carry more general information of input while the maximization makes the\nrepresentation carry more speci\ufb01c information of input, a small number of dimensions are suf\ufb01cient\nfor inference (i.e., the left bounding box of Fig. 2(b)), while a large number of dimensions are required\nfor inference (i.e., the right bounding box of Fig. 2(b)). This suggests that ICP learns diversi\ufb01ed\nrepresentations for classi\ufb01cation.\n\n5.2 Self-supervised Setting: Reconstruction\n\n5.2.1 Datasets\n\nWe perform quantitative and qualitative disentanglement evaluations with the dataset of 2D shapes\n(dSprites) [24] and the dataset of synthetic 3D Faces [27]. The ground truth factors of dSprites are\nscale(6), rotation(40), posX(32) and posY(32). The ground truth factors of 3D Faces are azimuth(21),\nelevation(11) and lighting(11). Parentheses contain number of quantized values for each factor. The\ndSprites and 3D Faces contain 3 types of shapes and 50 identities, respectively, which are treated as\nnoise during evaluation. The images of both datasets are reshaped to 64\u00d764 pixels to compare with\nthe baseline methods. We also evaluate the reconstruction and manipulation performance on more\nchallenging CelebA [23] dataset which contains a large number of celebrity faces. The images are\nreshaped to 128\u00d7128 pixels for more detialed reconstruction instead of 64\u00d764 pixels.\n\n5.2.2 Quantitative Evaluation\n\nWe evaluate the disentanglement performance quantitatively by the Mutual Information Gap (MIG)\nscore [7] with the 2D shapes (dSprites) [24] dataset and 3D Faces [27] dataset. MIG is a classi\ufb01er-free\ninformation-theoretic disentanglement metric and is meaningful for any factorized latent distribution.\nAs shown in Table 3, ICP achieves the state-of-the-art performance on the quantitative evaluation\nof disentanglement. We also conduct ablation studies as what we do in the supervised setting.\n\n8\n\nPosXPosYScaleRotationElevationLightingFace WidthAzimuth\f(a) Smile\n\n(b) Goatee\n\n(c) Eyeglasses\n\n(d) Hair Color\n\nFigure 4: Qualitative disentanglement results of \u03b2-VAE and ICP on CelebA. Each row represents a\ndifferent seed image used to infer the representation.\n\nFrom the results of ICP-ALL and ICP-COM, we \ufb01nd\ndisentanglement performance decreases without\nthe information diversifying and competing pro-\ncess. For the challenging CelebA [23] dataset, we\nevaluate the reconstruction performance via the\naverage Mean Square Error (MSE) and the Struc-\ntural Similarity Index (SSIM) [39]. The MSE of\nICP is 8.5 \u2217 10\u22123 compared with 9.2 \u2217 10\u22123 of \u03b2-\nVAE [13] and the SSIM of ICP is 0.62 compared\nwith 0.60 of \u03b2-VAE [13], which show ICP retains more information of input for reconstruction.\n\nTable 3: MIG score of disentanglement.\n3D Faces [27]\n\n\u03b2-VAE [13]\n\u03b2-TCVAE [7]\n\nICP-ALL\nICP-COM\n\nICP\n\ndSprites [24]\n\n0.22\n0.38\n0.33\n0.20\n0.48\n\n0.54\n0.62\n0.26\n0.57\n0.73\n\n5.2.3 Qualitative Evaluation\n\nFor qualitative evaluation, we conduct the latent space traverse by traversing a single dimension of the\nlearned representation over the range of [-3, 3] while keeping other dimensions \ufb01xed. We manually\npick the dimensions which have semantic meaning related to human concepts from the reconstruction\nresults. The qualitative disentanglement results are shown in Figs. 3 and 4. It can be seen that many\n\ufb01ne-grained semantic attributes such as rotation on dSprites dataset, face width on 3D Face dataset\nand goatee on CelebA dataset are disentangled clearly by ICP with details.\n\n6 Conclusion\n\nWe proposed a new approach named Information Competing Process (ICP) for learning diversi\ufb01ed\nrepresentations. To enrich the information carried by representations, ICP separates a representation\ninto two parts with different mutual information constraints, and prevents both parts from knowing\nwhat each other learned for the downstream task. Such rival representations are then combined\nto accomplish the downstream task synthetically. Experiments demonstrated the great potential of\nICP in both supervised and self-supervised settings. The nature behind the performance gain lies\nin that ICP has the ability to learn diversi\ufb01ed representations, which provides fresh insights for the\nrepresentation learning problem.\n\n9\n\n!-VAEICP!-VAEICP!-VAEICP!-VAEICP!-VAEICP!-VAEICP!-VAEICP!-VAEICP\fAcknowledgments\n\nThis work is supported by the National Key R&D Program (No.2017YFC0113000, and\nNo.2016YFB1001503), Nature Science Foundation of China (No.U1705262, No.61772443, No.\n61802324, No.61572410 and No.61702136), and Nature Science Foundation of Fujian Province,\nChina (No. 2017J01125 and No. 2018J01106).\n\nReferences\n[1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational\n\ninformation bottleneck. In International Conference on Learning Representations, 2017.\n\n[2] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio,\nIn\n\nAaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation.\nInternational Conference on Machine Learning, 2018.\n\n[3] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind\n\nseparation and blind deconvolution. Neural Computation, 1995.\n\n[4] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and\n\nnew perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.\n\n[5] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des-\njardins, and Alexander Lerchner. Understanding disentangling in beta-vae. In Advances in\nNeural Information Processing Systems, 2018.\n\n[6] Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Ge Xuri Zhang, Baochang, Yongjian Wu,\nFeiyue Huang, and Yan Wang. Variational structured semantic inference for diverse image\ncaptioning. In Advances in Neural Information Processing Systems, 2019.\n\n[7] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of\ndisentanglement in variational autoencoders. In Advances in Neural Information Processing\nSystems, 2018.\n\n[8] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, 2016.\n\n[9] Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and J\u00fcrgen Schmidhuber.\nTagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing\nSystems, 2016.\n\n[10] Klaus Greff, Sjoerd van Steenkiste, and J\u00fcrgen Schmidhuber. Neural expectation maximization.\n\nIn Advances in Neural Information Processing Systems, 2017.\n\n[11] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large\n\ngraphs. In Advances in Neural Information Processing Systems, 2017.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,\n2016.\n\n[13] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\n2017.\n\n[14] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler,\nand Yoshua Bengio. Learning deep representations by mutual information estimation and\nmaximization. In International Conference on Learning Representations, 2019.\n\n[15] Jie Hu, Rongrong Ji, Hong Liu, Shengchuan Zhang, Cheng Deng, and Qi Tian. Towards visual\nfeature translation. In Proceedings of the IEEE conference on Computer Vision and Pattern\nRecognition, 2019.\n\n10\n\n\f[16] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on Computer Vision and Pattern\nRecognition, 2017.\n\n[17] Aapo Hyv\u00e4rinen and Erkki Oja. Independent component analysis: algorithms and applications.\n\nNeural Networks, 2000.\n\n[18] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on\n\nMachine Learning, 2018.\n\n[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International\n\nConference on Learning Representations, 2013.\n\n[20] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual\nrepresentation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern\nRecognition, 2019.\n\n[21] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[22] Ralph Linsker. Self-organization in a perceptual network. Computer, 1988.\n\n[23] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of the IEEE International Conference on Computer Vision, 2015.\n\n[24] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle-\n\nment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.\n\n[25] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, 2016.\n\n[26] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive\n\npredictive coding. arXiv preprint arXiv:1807.03748, 2018.\n\n[27] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face\nmodel for pose and illumination invariant face recognition. In IEEE International Conference\non Advanced Video and Signal Based Surveillance, 2009.\n\n[28] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[29] Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black.\nAdversarial collaboration: Joint unsupervised learning of depth, camera motion, optical \ufb02ow\nand motion segmentation. In Proceedings of the IEEE conference on Computer Vision and\nPattern Recognition, 2019.\n\n[30] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In International Conference on Machine\nLearning, 2014.\n\n[31] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky,\nBrendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deep\nlearning. 2018.\n\n[32] J\u00fcrgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computa-\n\ntion, 1992.\n\n[33] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via\n\ninformation. arXiv preprint arXiv:1703.00810, 2017.\n\n[34] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n11\n\n\f[35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2015.\n\n[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non Computer Vision and Pattern Recognition, 2016.\n\n[37] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In\n\nIEEE Information Theory Workshop, 2015.\n\n[38] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.\n\narXiv preprint physics/0004057, 2000.\n\n[39] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment:\n\nfrom error visibility to structural similarity. IEEE Transactions on Image Processing, 2004.\n\n[40] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and Qixiang Ye. FreeAnchor: Learning\nto match anchors for visual object detection. In Neural Information Processing Systems, 2019.\n\n12\n\n\f", "award": [], "sourceid": 1291, "authors": [{"given_name": "Jie", "family_name": "Hu", "institution": "Xiamen University"}, {"given_name": "Rongrong", "family_name": "Ji", "institution": "Xiamen University, China"}, {"given_name": "ShengChuan", "family_name": "Zhang", "institution": "Xiamen University"}, {"given_name": "Xiaoshuai", "family_name": "Sun", "institution": "Xiamen University"}, {"given_name": "Qixiang", "family_name": "Ye", "institution": "University of Chinese Academy of Sciences, China"}, {"given_name": "Chia-Wen", "family_name": "Lin", "institution": "National Tsing Hua University"}, {"given_name": "Qi", "family_name": "Tian", "institution": "Huawei Noah\u2019s Ark Lab"}]}