{"title": "Removing the Feature Correlation Effect of Multiplicative Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 627, "page_last": 636, "abstract": "Multiplicative noise, including dropout, is widely used to regularize deep neural networks (DNNs), and is shown to be effective in a wide range of architectures and tasks. From an information perspective, we consider injecting multiplicative noise into a DNN as training the network to solve the task with noisy information pathways, which leads to the observation that multiplicative noise tends to increase the correlation between features, so as to increase the signal-to-noise ratio of information pathways. However, high feature correlation is undesirable, as it increases redundancy in representations. In this work, we propose non-correlating multiplicative noise (NCMN), which exploits batch normalization to remove the correlation effect in a simple yet effective way. We show that NCMN significantly improves the performance of standard multiplicative noise on image classification tasks, providing a better alternative to dropout for batch-normalized networks. Additionally, we present a unified view of NCMN and shake-shake regularization, which explains the performance gain of the latter.", "full_text": "Removing the Feature Correlation Effect of\n\nMultiplicative Noise\n\nZijun Zhang\n\nUniversity of Calgary\n\nzijun.zhang@ucalgary.ca\n\nyining.zhang1@ucalgary.ca\n\nYining Zhang\n\nUniversity of Calgary\n\nZongpeng Li\n\nWuhan University\n\nzongpeng@whu.edu.cn\n\nAbstract\n\nMultiplicative noise, including dropout, is widely used to regularize deep neural\nnetworks (DNNs), and is shown to be effective in a wide range of architectures\nand tasks. From an information perspective, we consider injecting multiplicative\nnoise into a DNN as training the network to solve the task with noisy information\npathways, which leads to the observation that multiplicative noise tends to increase\nthe correlation between features, so as to increase the signal-to-noise ratio of\ninformation pathways. However, high feature correlation is undesirable, as it\nincreases redundancy in representations. In this work, we propose non-correlating\nmultiplicative noise (NCMN), which exploits batch normalization to remove the\ncorrelation effect in a simple yet effective way. We show that NCMN signi\ufb01cantly\nimproves the performance of standard multiplicative noise on image classi\ufb01cation\ntasks, providing a better alternative to dropout for batch-normalized networks.\nAdditionally, we present a uni\ufb01ed view of NCMN and shake-shake regularization,\nwhich explains the performance gain of the latter.\n\n1\n\nIntroduction\n\nState-of-the-art deep neural networks are often over-parameterized to deliver more expressive power.\nFor instance, a typical convolutional neural network (CNN) for image classi\ufb01cation can consist of\ntens to hundreds of layers, and millions to tens of millions of learnable parameters [1, 2]. To combat\nover\ufb01tting, a variety of regularization techniques have been developed. Examples include dropout [3],\nDropConnect [4], and the recently proposed shake-shake regularization [5]. Among them, dropout is\narguably the most popular, due to its simplicity and effectiveness in a wide range of architectures and\ntasks, e.g., convolutional neural networks (CNNs) for image recognition [6], and recurrent neural\nnetworks (RNNs) for natural language processing (NLP) [7].\nNonetheless, we observe a side effect of dropout that has long been ignored. That is, it tends to increase\nthe correlation between the features it is applied to, which reduces the ef\ufb01ciency of representations.\nIt is also known that decorrelated features can lead to better generalization [8, 9, 10, 11]. Thus, this\nside effect may counteract, to some extent, the regularization effect of dropout.\nIn this work, we demonstrate the feature correlation effect of dropout, as well as other types of\nmultiplicative noise, through analysis and experiments. Our analysis is based on a simple assumption\nthat, in order to reduce the interference of noise, the training process will try to maximize the\nsignal-to-noise ratio (SNR) of representations. We show that the tendency of increasing the SNR\nwill increase feature correlation as a result. To remove the correlation effect, it is possible to resort to\nfeature decorrelation techniques. However, existing techniques penalize high correlation explicitly;\nthey either introduce a substantial computational overhead [10], or yield marginal improvements\n[11]. Moreover, these techniques require extra hyperparameters to control the strength of the penalty,\nwhich further hinders their practical application.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe propose a simple yet effective approach to solve this problem. Speci\ufb01cally, we \ufb01rst decompose\nnoisy features into the sum of two components, a signal component and a noise component, and then\ntruncate the gradient through the latter, i.e., treat it as a constant. However, naively modifying the\ngradient would encourage the magnitude of features to grow in order to increase the SNR, causing\noptimization dif\ufb01culties. We solve this problem by combining the aforementioned technique with\nbatch normalization [12], which effectively counteracts the tendency of increasing feature magnitude.\nThe resulting method, non-correlating multiplicative noise (NCMN), is able to reduce the correlation\nbetween features, reaching a level even lower than that without multiplicative noise. More importantly,\nit signi\ufb01cantly improves the performance of standard multiplicative noise on image classi\ufb01cation\ntasks.\nAs another contribution of this work, we further investigate the connection between NCMN and\nshake-shake regularization. Despite its impressive performance, how shake-shake works remains\nelusive. We show that the noise produced by shake-shake has a similar form to that of a NCMN\nvariant, and both NCMN and shake-shake achieve superior generalization performance by avoiding\nfeature correlation.\nThe rest of this paper is organized as follows. In Section 2, we de\ufb01ne a general form of multiplicative\nnoise, and identify the feature correlation effect. In Section 3, we \ufb01rst propose NCMN, which we show\nthrough analysis is able to remove the feature correlation effect; then we develop multiple variants\nof NCMN, and provide a uni\ufb01ed view of NCMN and shake-shake regularization. In Section 4, we\nprovide empirical evidence of our analysis, and evaluate the performance of the proposed methods.1\n\n2 Motivation\n\n2.1 Multiplicative Noise\n\n\u02dcxi = uixi,\u2200i \u2208 Hl.\n\nLet xi be the activation of hidden unit i, we consider multiplicative noise, ui, which is applied to the\nactivations of layer l as\n(1)\nHere, Hl represents the set of hidden units in layer l. For simplicity, we restrict our analysis to\nfully-connected layers without batch normalization, and will later extend it to batch-normalized layers\nand convolutional layers. When used for regularization purpose, multiplicative noise is typically\napplied at training time, and removed at test time. Consequently, the noise should satisfy E [ui] = 1,\nsuch that E [\u02dcxi] = xi.\nThe noise mask, ui, can be sampled from various distributions, as exempli\ufb01ed by Bernoulli, Gaussian,\nand uniform distributions. We take dropout and DropConnect as examples. For dropout, let mi be\nthe dropout mask sampled from a Bernoulli distribution, Bern (p), then the equivalent multiplicative\nnoise is given by ui = mi/p. DropConnect is slightly different from dropout, in that the dropout\nmask is independently sampled for each weight instead of each activation. Thus, we denote the\ndropout mask by mij, where j \u2208 Hl+1, then we have uij = mij/p and\n\n\u02dcxij = uijxi,\u2200i \u2208 Hl, j \u2208 Hl+1.\n\n(2)\nComparing Eq. (2) with Eq. (1), we observe that applying multiplicative noise to weights is equivalent\nto applying it to activations, except that the noise mask is independently sampled for each hidden unit\nin the upper layer. Therefore, without loss of generality, we consider multiplicative noise of the form\nin Eq. (1) in the following discussion.\nCompared to other types of noise, such as additive isotropic noise, multiplicative noise can adapt the\nscale of noise to the scale of features, which may contribute to its empirical success.\n\n2.2 The Feature Correlation Effect\n\nAs a regularization technique, dropout, and other multiplicative noise, improve generalization by\npreventing feature co-adaptation [3]. From an information perspective, injecting noise into a neural\nnetwork can be seen as training the model to solve the task with noisy information pathways. To\nbetter solve the task, a simple and natural strategy that can be learned is to increase the signal-to-noise\nratio (SNR) of the information pathways.\n\n1Code is available at https://github.com/zj10/NCMN.\n\n2\n\n\fConcretely, the noisy activations of layer l is aggregated by a weighted sum to form the pre-activations\n(without biases) of layer l + 1 as\n\nzj =\n\nwij \u02dcxi,\u2200j \u2208 Hl+1,\n\n(3)\n\n(cid:88)\n\ni\u2208Hl\n\n(cid:88)\n\nj\n\n(cid:3)(cid:1)2(cid:105)\n\nE(cid:104)(cid:0)zs\nj \u2212 E(cid:2)zs\nE(cid:104)(cid:0)zn\n(cid:1)2(cid:105)\n\nj\n\nj\n\nwhere wij is the weight between unit i in layer l and unit j in layer l + 1. Although we cannot\nincrease the SNR of \u02dcxi due to the multiplicative nature of the noise, it is possible to increase the SNR\nof zj instead. In the following, we omit the range of summation when it is over Hl.\nWe now focus on the pre-activation, zj, of an arbitrary unit in layer l + 1, and de\ufb01ne its signal and\nnoise components, respectively, as\n\nwijxi, and zn\nj =\nwhere vi = ui \u2212 1, such that E [vi] = 0, and zj = zs\nincreasing the SNR of information pathways by the following implicit objective function:\n\nj + zn\n\nwijvixi,\n\nj . Then we can model the tendency of\n\nzs\nj =\n\n(4)\n\ni\n\ni\n\n(cid:88)\n\nmaximize\n\nSNR (zj) =\n\n.\n\n(5)\n\nHere, the expectations are taken with respect to both xi and vi, \u2200i \u2208 Hl. Note that in Eq. (5), we\ninput samples. Let \u03c32 = Var [vi] ,\u2200i \u2208 Hl, we have\n(cid:80)\nE(cid:2)(cid:80)\n\n(cid:3), from the signal, since it does not capture the variation of\n2 E(cid:104)(cid:80)\n\nsubtract the constant component, E(cid:2)zs\n\uf8ee\uf8f01 +\n2 E(cid:104)(cid:80)\n\nThus, the objective function in Eq. (5) is equivalent to the following:\n\ni (wijxi)2(cid:3)\n(cid:105)\nE(cid:2)zs\n(cid:3)2\ni (wijxi)2(cid:3) .\nE(cid:2)(cid:80)\n\n(cid:80)\nE(cid:2)(cid:80)\ni (wijxi)2(cid:3)\n\n(cid:105) \u2212 E(cid:2)zs\n\ni (wijxi) (wi(cid:48)jxi(cid:48) )\n\ni (wijxi) (wi(cid:48)jxi(cid:48) )\n\n\uf8f9\uf8fb .\n\nSNR (zj) =\n\nmaximize\n\n(cid:3)2\n\n1\n\u03c32\n\ni(cid:48)(cid:54)=i\n\ni(cid:48)(cid:54)=i\n\n(6)\n\n(7)\n\n\u2212\n\nj\n\nj\n\nIntuitively, the \ufb01rst term in Eq. (7) tries to maximize the correlation between each pair of wijxi\nand wi(cid:48)jxi, where i (cid:54)= i(cid:48). Although it is not the same as the commonly used Pearson correlation\ncoef\ufb01cient, it can be regarded as a generalization of the latter to multiple variables. Since wij can\nbe either positive or negative, maximizing the correlations between wijxi\u2019s essentially increases the\nmagnitudes of the correlations between xi\u2019s, and hence causing the feature correlation effect. The\n\nsecond term in Eq. (7) penalizes non-zero values of E(cid:2)zs\n(cid:3), and does not affect feature correlations.\nnumerator and denominator of Eq. (5) will be divided by the same factor,(cid:112)Var [zj]. Therefore,\n\nFor batch-normalized layers, if batch normalization is applied to zj (as a common practice), the\n\nEq. (5) remains the same, and the analysis still holds.\nFor convolutional layers, one should consider Hl+1 as the set of convolutional kernels or feature maps\nin layer l + 1, and Hl as the set of input activations at each spatial location. Accordingly, the inputs\nat different spatial locations are considered as different input samples. Since adjacent activations in\nthe same feature map are often highly correlated, sharing the same noise mask across different spatial\nlocations is shown to be more effective [13]. In this setting, the feature correlation effect tends to be\nmore prominent between activations in different feature maps than that in the same feature map.\n\nj\n\n3 Methods\n\n3.1 Non-Correlating Multiplicative Noise\n\nA high correlation between features increases redundancy in neural representations, and can thus\nreduce the expressive power of neural networks. To remove this effect, one can directly penalize\nthe correlation between features as part of the objective function [10], which, however, introduces\na substantial computational overhead. Alternatively, one can penalize the correlation between the\nweight vectors (or convolutional kernels) of different hidden units [11], which is less computationally\n\n3\n\n\f(cid:1)2(cid:105)\n\nexpensive, but yields only marginal improvements. Moreover, both approaches require manually-\ntuned penalty strength. A more desirable approach is to simply avoid feature correlation in the \ufb01rst\nplace, rather than counteracting it with other regularization techniques.\n\nFrom Eq. (5) we observe that, if we consider the denominator E(cid:104)(cid:0)zn\n(cid:3)(cid:1)2(cid:105)\n\ngradient during training, the objective function is equivalent to\n\nmaximize E(cid:104)(cid:0)zs\n\nj \u2212 E(cid:2)zs\n\n(8)\nj , the tendency of increasing\nEq. (8) implies that if we ignore the gradient of the noise component, zn\nj , instead of increasing\nthe SNR of zj will attempt to increase the variance of the signal component, zs\nthe feature correlation of the lower layer. However, we \ufb01nd in practice that such modi\ufb01cation to the\ngradient causes optimization dif\ufb01culties, preventing the training process from converging. Fortunately,\nby using batch normalization, the remedy for this problem is surprisingly simple.\nConcretely, we apply batch normalization to zj as\n\nas a constant, i.e., ignore its\n\nj\n\n.\n\nj\n\n\u02c6zj = BN(cid:0)zj; zs\n\nj\n\n(cid:1) =\n\nzj \u2212 E(cid:2)zs\n(cid:3)\n(cid:113)\n(cid:3) .\nVar(cid:2)zs\n\nj\n\nj\n\n(9)\n\n(10)\n\n(cid:3) .\n\nWe neglect the small difference between the true mean/variance, and the sample mean/variance, and\nadopt the same notation for simplicity. Note that in Eq. (9), zj is normalized using the statistics of zs\nj ,\nwhich is slightly different from standard batch normalization. We now consider the SNR of the new\npre-activation, \u02c6zj. The signal and noise components of \u02c6zj are respectively\n\nj\n\nj\n\nzs\n\n\u02c6zs\nj =\n\n(cid:3)\n(cid:3) , and \u02c6zn\n\nj \u2212 E(cid:2)zs\n(cid:113)\nVar(cid:2)zs\n(cid:1) = BN(cid:0)zs\nj + AsConst(cid:0)\u02c6zn\nmaximize E(cid:104)(cid:0)\u02c6zs\n\nzn\n\nj =\n\nj = \u02c6zj \u2212 \u02c6zs\n\nj(cid:113)\nVar(cid:2)zs\n(cid:1) + AsConst(cid:0)BN(cid:0)zj; zs\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj\n\n.\n\nj\n\nj\n\n(11)\n\n/\u2202zs\n\n(cid:1)2(cid:105)\n\n(cid:1)(cid:1) ,\n\n(cid:48)\nj = \u02c6zs\n\u02c6z\n\n(cid:1)2(cid:105)\n\n(cid:1) is\n\nequivalent to\n\n(cid:1) \u2212 BN(cid:0)zs\n\nFor clarity, we de\ufb01ne an identity function, AsConst (\u00b7), meaning that the argument of the function is\nconsidered as a constant during the backpropagation phase, or in other words, its gradient is set to\nzero by the function. We then substitute \u02c6zj with\n\nsuch that the noise component is considered as a constant. Therefore, maximizing SNR(cid:0)\u02c6z(cid:48)\nthus we have \u2202 E(cid:104)(cid:0)\u02c6zs\n\n(12)\nj , and\nj (m) = 0,\u2200m \u2208 B, where B denotes a set of mini-batch samples,\nand zs\nj corresponding to sample m. Therefore, we can now remove the\nfeature correlation effect of multiplicative noise without causing optimization dif\ufb01culties. We refer to\nthis approach as non-correlating multiplicative noise (NCMN).\nWe also note that the non-standard batch normalization used in Eq. (9) is not necessary in practice.\nTo take advantage of existing optimized implementations of batch normalization, we can modify\nEq. (11) as follows:\n\nDue to the use of batch normalization, Eq. (12) is a constant with respect to each sample of zs\n\nj (m) denotes the value of zs\n\n(cid:1) + AsConst(cid:0)BN (zj) \u2212 BN(cid:0)zs\n\n(13)\nIn this case, to keep the forward pass consistent between training and testing, the running mean and\nvariance should be calculated based on zj, rather than zs\nj .\nInterestingly, Eq. (11) and Eq. (13) can be seen as adding a noise component to batch-normalized zs\nj ,\nwhere the noise is generated in a special way and passed through the AsConst (\u00b7) function. However,\nthe analysis in this section does not depend on the particular choice of the noise, as long as it is\nconsidered as a constant. This observation leads to a uni\ufb01ed view of multiple variants of NCMN and\nshake-shake regularization, as discussed in the following section.\n\nj = BN(cid:0)zs\n\n(cid:1)(cid:1) .\n\n(cid:48)\n\u02c6z\n\nj\n\nj\n\n3.2 A Uni\ufb01ed View of NCMN and Shake-shake Regularization\n\nthe lower layer activations. Assuming the independence of vi\u2019s, we have E(cid:2)\u02c6zn\n\nIn Eq. (11), the noise component, \u02c6zn\n\nj , is generated indirectly from the multiplicative noise applied to\n\nVar(cid:2)\u02c6zn\n\nj\n\n(cid:3) =\n\nVar(cid:2)zs\n\n\u03c32\n\nj\n\n(cid:3)(cid:88)\n\ni\n\n(cid:3) = 0, and\n\uf8f9\uf8fb ,\n\nj\n\n\uf8ee\uf8f0(cid:0)zs\n\nj\n\n(cid:1)2 \u2212 2\n\n(cid:88)\n\n(cid:88)\n\ni(cid:48)(cid:54)=i\n\ni\n\nVar(cid:2)zs\n\n\u03c32\n\nj\n\n(cid:3)\n\n4\n\n(wijxi)2 =\n\n(wijxi) (wi(cid:48)jxi(cid:48) )\n\n(14)\n\n\fj + AsConst(cid:0)vj \u02c6zs\n\n(cid:1) .\n\n(cid:48)\nj = \u02c6zs\n\u02c6z\n\nwhich implies that \u02c6zn\nj , if we neglect\nthe correlation term in Eq. (14), and only the mean and variance of the noise are of interest. We can\nfurther simplify the approximation by applying multiplicative noise directly to \u02c6zs\n\nj can be approximated by a noise multiplied by zs\n\nj but applied to \u02c6zs\n\nj as\n\nj\n\n(15)\nThe advantage of this variant is that, while Eq. (11) and Eq. (13) require an extra forward pass for\neach layer, Eq. (15) introduces no computational overhead, and is straightforward to implement. A\nsimilar idea was explored for fast dropout training [14].\nTo indicate the number of layers involved in noise generation, we refer to this variant (Eq. (15)) as\nNCMN-0, and the original form (Eq. (13)) as NCMN-1. We next generalize NCMN from NCMN-1\nto NCMN-2, and demonstrate its connection to shake-shake regularization [5]. For convenience, we\nde\ufb01ne the following two functions to indicate the pre-activations and activations of a batch-normalized\nlayer:\n\n(cid:18)(cid:16)\nwhere xl = [xi], Wl+1 = [wij], i \u2208 Hl, j \u2208 Hl+1, and\n\n\u03a8l+1(cid:16)\nxl(cid:17)\nxl(cid:17)\n\u03a6l+1(cid:16)\n\nxl(cid:17)T\n\u03b3l+1 (cid:12) \u03a8l+1(cid:16)\nxl(cid:17)\n\n= BN\n\n(cid:16)\n\n= \u03c6\n\n,\n\n(cid:19)\n+ \u03b2l+1(cid:17)\n\n(16)\n\n(17)\n\n,\n\nWl+1\n\nwhere \u03b3l+1 and \u03b2l+1 denote, respectively, the scaling and shifting vectors of batch normalization,\n(cid:12) denotes element-wise multiplication, and \u03c6 (\u00b7) denotes the activation function. Let \u03a8l\ni (\u00b7) be an\nelement of \u03a8l (\u00b7), the noise component of NCMN-1 is then given by\n\n(cid:16)\n\nul (cid:12) xl(cid:17) \u2212 \u03a8l+1\n\n\u02c6zn\nj = \u03a8l+1\n\nj\n\nwhere ul = [ui] , i \u2208 Hl, and j \u2208 Hl+1.\nWe then de\ufb01ne a natural generalization from NCMN-1 to NCMN-2 as\n\n\u03a6l+1(cid:16)\n(cid:16)\nxl(cid:17)(cid:17)\n\n\u02c6zs\nk = \u03a8l+2\n\nk\n\n, and \u02c6zn\n\nk = \u03a8l+2\n\nk\n\nj\n\n,\n\n(cid:16)\n\nxl(cid:17)\nul+1 (cid:12) \u03a6l+1(cid:16)\n\n(cid:16)\n\nul (cid:12) xl(cid:17)(cid:17) \u2212 \u02c6zs\n\nk,\n\n(18)\n\n(19)\n\nk ) ,\n\n(cid:48)\nk = \u02c6zs\n\u02c6z\n\nk + AsConst (\u02c6zn\n\nand\n(20)\nwhere k \u2208 Hl+2. Different from NCMN-0 and NCMN-1, which can be applied to every layer,\nNCMN-2 can only be applied once every two layers. For residual networks (ResNets) in particular,\nNCMN-2 should be aligned with residual blocks, such that \u03b3l+2 \u02c6z(cid:48)\nthe residual of a residual block, depending on which variant is used [15].\nInterestingly, we can formulate shake-shake regularization in a similar way to NCMN-2. Shake-shake\nregularization is a regularization technique developed speci\ufb01cally for ResNets with two residual\nbranches, as opposed to only one of standard ResNets. It works by averaging the outputs of two\nresidual branches with random weights. For the forward pass at training time, one of the weights,\n\u03b11, is sampled uniformly from [0, 1], while the other one is set to \u03b12 = 1 \u2212 \u03b11. For the backward\npass, the weights are either sampled in the same way as the forward pass, or set to the same value,\n2 = 1/2. We \ufb01rst consider the latter case. Let p \u2208 {1, 2} denote the index of residual\ni.e, \u03b1(cid:48)\nbranches, then we have the signal component as\n\nk + \u03b2l+2 or \u03c6(cid:0)\u03b3l+2 \u02c6z(cid:48)\n\nk + \u03b2l+2(cid:1) is\n\n1 = \u03b1(cid:48)\n\n\u02c6zs\nk = (\u02c6z1,k + \u02c6z2,k) /2, where \u02c6zp,k = \u03a8l+2\np,k\n\n\u03a6l+1\n\np\n\n,\n\n(21)\n\nand the noise component as\n(22)\nwhere v = \u03b11 \u2212 1/2 is uniformly sampled from [\u22121/2, 1/2]. Accordingly, the noisy pre-activation\nis also given by Eq. (20).\nSimilar to Eq. (5), we can de\ufb01ne the SNR of \u02c6z(cid:48)\n\nk = v (\u02c6z1,k \u2212 \u02c6z2,k) ,\n\u02c6zn\n\n(cid:34)\n\nk as\n\nE(cid:2)(\u02c6z1,k \u2212 \u02c6z2,k)2(cid:3)(cid:35)\n\nE(cid:2)(\u02c6zs\nk)2(cid:3)\n(cid:1) =\nE(cid:2)(\u02c6zn\nk )2(cid:3) = 3\nk) will encourage large E [\u02c6z1,k \u02c6z2,k] and small E(cid:104)\n\n4 E [\u02c6z1,k \u02c6z2,k]\n\n1 +\n\n.\n\nApparently, maximizing SNR (\u02c6z(cid:48)\nbranch. However, if we keep the gradient of the noise component, or, equivalently, let \u03b1(cid:48)\n\u03b1(cid:48)\n2 = \u03b12, then maximizing SNR (\u02c6z(cid:48)\n\nk) does not affect the correlation between features from the same\n1 = \u03b11 and\n,\n\n(\u02c6z1,k \u2212 \u02c6z2,k)2(cid:105)\n\nSNR(cid:0)\u02c6z\n\n(cid:48)\nk\n\n(23)\n\n5\n\n(cid:16)\n\n(cid:16)\nxl(cid:17)(cid:17)\n\n\f1 = \u03b1(cid:48)\n\nleading to highly correlated branches. On the other hand, if we consider the noise component as a\nconstant, then only large E [\u02c6z1,k \u02c6z2,k] will be encouraged, which results in much weaker correlation,\nand hence the better generalization performance observed in practice. For a similar reason to that\nof NCMN, the batch normalization layers before Eq. (20) are crucial for avoiding optimization\ndif\ufb01culties.\nIt is worth noting that setting \u03b1(cid:48)\n2 = 1/2 for the backward pass leads to exactly the same\ngradient for \u02c6z1,k and \u02c6z2,k, which can also increase the correlation between the two branches. Thus,\nsampling \u03b1(cid:48)\n1 randomly further breaks the symmetry between the two branches, which may explain its\nslightly better performance on large networks.\nWhile the noise components of NCMN-2 and shake-shake are generated in different ways, they are\ninjected into the network in the same way (Eq. (20)). Therefore, we expect similar regularization\neffects from NCMN-2 and shake-shake in practice. However, while adjusting the regularization\nstrength is dif\ufb01cult for shake-shake, it can be easily done for NCMN by tuning the variance of\nmultiplicative noise. Moreover, NCMN is not restricted to ResNets with multiple residual branches,\nand can be applied to various architectures.\n\n4 Experiments\n\nIn our preliminary experiments we found that different choices of noise distributions, including\nBernoulli, Gaussian, and uniform distributions, lead to similar performance, which is consistent with\nthe experimental results for Gaussian dropout [3]. We use uniform noise with variable standard\ndeviation in the following experiments. For fair comparison, in contrast to previous work [5, 16], we\ntune the hyperparameters (e.g., learning rate, L2 weight decay, noise standard deviation) separately\nfor different types of noise, as well as for different datasets. We use ND-Adam [17] for optimization,\nwhich is a variant of Adam [18] that has similar generalization performance to SGD, but is easier to\ntune due to its decoupled learning rate and weight decay hyperparameters. See the supplementary\nmaterial for hyperparameter settings, and practical guidelines for tuning them.\nWe \ufb01rst empirically verify the feature correlation effect of multiplicative noise, and the decorrelation\neffect of NCMN. To avoid possible interactions with skip connections, we use plain CNNs rather\nthan more sophisticated architectures for this purpose. Due to the lack of skip connections, we use a\nrelatively shallow network in case of optimization dif\ufb01culties. Speci\ufb01cally, we construct a CNN by\nremoving the skip connections from a wide residual network (WRN) [16], WRN-16-10, which is a\n16-layer ResNet with 10 times more \ufb01lters per layer than the original. Accordingly, we refer to the\nmodi\ufb01ed network as CNN-16-10. We train the network with different types of noise on the CIFAR-10\nand CIFAR-100 datasets [19]. For each convolutional layer except the \ufb01rst one, we calculate the\ncorrelation between each pair of feature maps after batch normalization, and take the average of their\nabsolute values. The results are grouped by the size of feature maps, and are shown in Fig. 1. In\naddition, the corresponding test error rates (average of 3 or more runs) are shown in Table 1.\n\n(a) Results on CIFAR-10.\n\n(b) Results on CIFAR-100.\n\nFigure 1: Feature correlations of CNN-16-10 networks trained with different types of noise. None\nrefers to the baseline without noise injection, and MN refers to standard multiplicative noise. The\nerror bars represent the standard deviation across different layers in a single run, which varies little\nacross different runs.\n\n6\n\n\u000e\u0013\r\u0005\u0010\u000f\u0005\u0010\u000f\u0010\u000f\r\u0005\u000e\u0013\u0005\u000e\u0013\u0013\u0011\r\u0005\u0001\u0005\u0001\u001d0,9:70\u00032,5\u00038\u0004\u00050\r\u000b\r\r\u000b\u000e\r\u000b\u000f\r\u000b\u0010\r\u000b\u0011\r\u000b\u0012\r\u000b\u0013\u001d0,9:70\u0003.4770\u0004,9\u000443\u001f430\u001e\u001f\u001f\u001a\u001e\u001f\n\r\u001f\u001a\u001e\u001f\n\u000e\u001f\u001a\u001e\u001f\n\u000f\u000e\u0013\r\u0005\u0010\u000f\u0005\u0010\u000f\u0010\u000f\r\u0005\u000e\u0013\u0005\u000e\u0013\u0013\u0011\r\u0005\u0001\u0005\u0001\u001d0,9:70\u00032,5\u00038\u0004\u00050\r\u000b\r\r\u000b\u000e\r\u000b\u000f\r\u000b\u0010\r\u000b\u0011\r\u000b\u0012\r\u000b\u0013\u001d0,9:70\u0003.4770\u0004,9\u000443\u001f430\u001e\u001f\u001f\u001a\u001e\u001f\n\r\u001f\u001a\u001e\u001f\n\u000e\u001f\u001a\u001e\u001f\n\u000f\fCompared to the baseline, standard multiplicative noise exhibits slightly higher feature correlations\nfor both CIFAR-10 and CIFAR-100. By modifying the gradient as Eq. (13), which results in NCMN-\n1, the feature correlations are signi\ufb01cantly reduced as predicted by our analysis. Surprisingly, for\nall sizes of feature maps, the feature correlations of NCMN-1 reach a level even lower than that\nof the baseline. This intriguing result may indicate that the regularization effect of multiplicative\nnoise strongly encourages decorrelation between features, which, however, is counteracted by the\nfeature correlation effect. As a result, NCMN-1 signi\ufb01cantly improves the performance of standard\nmultiplicative noise, as shown in Table 1. As discussed in Section 3.2, NCMN-0 can be considered\nas an approximation to NCMN-1, which is consistent with the fact that the feature correlations\nand test errors of NCMN-0 are both close to that of NCMN-1. On the other hand, since NCMN-2\napplies noise more sparsely (once every two layers), it exhibits higher correlations and slightly worse\ngeneralization performance than NCMN-0 and NCMN-1.\n\nTable 1: CIFAR-10/100 error rates (%) of\nCNN-16-10 networks trained with different\ntypes of noise.\nNoise type\nNone\nMN\nNCMN-0\nNCMN-1\nNCMN-2\n\nCIFAR-100\n19.22\u00b1 0.05\n18.08\u00b1 0.03\n17.37\u00b10.05\n17.55\u00b1 0.06\n18.16\u00b1 0.04\n\nCIFAR-10\n4.05\u00b1 0.05\n3.76\u00b1 0.00\n3.51\u00b1 0.07\n3.41\u00b10.07\n3.44\u00b1 0.03\n\nTable 2: CIFAR-10/100 error rates (%) of\nWRN-22-7.5 networks trained with different\ntypes of noise.\nNoise type\nNone\nMN\nNCMN-0\nNCMN-1\nNCMN-2\n\nCIFAR-100\n19.29\u00b1 0.07\n18.60\u00b1 0.03\n17.05\u00b1 0.08\n17.09\u00b1 0.10\n16.70\u00b10.13\n\nCIFAR-10\n3.68\u00b1 0.02\n3.59\u00b1 0.06\n3.34\u00b1 0.02\n3.02\u00b1 0.06\n3.00\u00b10.05\n\nNext, we extend our experiments to ResNets, in order to investigate possible interactions between\nskip connections and NCMN. Speci\ufb01cally, we test different types of noise on a WRN-22-7.5 network,\nwhich has comparable performance to WRN-28-10, but is much faster to train. Since WRN is based\non the full pre-activation variant of ResNets [15], NCMN-1, NCMN-2, and shake-shake require an\nextra batch normalization layer after each residual branch. The feature correlation is calculated for\nthe output of each residual block, instead of for each layer.\nAs shown in Fig. 2, NCMN continues to lower the correlation between features, which is especially\nprominent in the second and third stages of the network. In addition, as shown in Table 2, we\nobtain even more performance gains from NCMN compared to the CNN-16-10 network. However, a\nnotable difference is that, while NCMN-1 works well with both architectures, NCMN-0 and NCMN-2\nperform slightly better, respectively with plain CNNs and ResNets, which may result from the\ninteraction between the skip connections of ResNets and NCMN.\n\n(a) Results on CIFAR-10.\n\n(b) Results on CIFAR-100.\n\nFigure 2: Feature correlations of WRN-22-7.5 networks trained with different types of noise.\n\nTo compare the performance of NCMN with shake-shake regularization, we train a WRN-22-5.4\nnetwork with two residual branches, which has comparable number of parameters to WRN-22-7.5.\nWe refer to this network as WRN-22-5.4\u00d72. The averaging weights of residual branches are randomly\nsampled for each input image, and are independently sampled for the forward and backward passes.\nThe training curves corresponding to different types of noise are shown in Fig. 3. On both CIFAR-\n\n7\n\n\u000e\u000f\r\u0005\u0010\u000f\u0005\u0010\u000f\u000f\u0011\r\u0005\u000e\u0013\u0005\u000e\u0013\u0011\u0001\r\u0005\u0001\u0005\u0001\u001d0,9:70\u00032,5\u00038\u0004\u00050\r\u000b\r\r\u000b\u000e\r\u000b\u000f\r\u000b\u0010\r\u000b\u0011\r\u000b\u0012\r\u000b\u0013\u001d0,9:70\u0003.4770\u0004,9\u000443\u001f430\u001e\u001f\u001f\u001a\u001e\u001f\n\r\u001f\u001a\u001e\u001f\n\u000e\u001f\u001a\u001e\u001f\n\u000f\u000e\u000f\r\u0005\u0010\u000f\u0005\u0010\u000f\u000f\u0011\r\u0005\u000e\u0013\u0005\u000e\u0013\u0011\u0001\r\u0005\u0001\u0005\u0001\u001d0,9:70\u00032,5\u00038\u0004\u00050\r\u000b\r\r\u000b\u000e\r\u000b\u000f\r\u000b\u0010\r\u000b\u0011\r\u000b\u0012\r\u000b\u0013\u001d0,9:70\u0003.4770\u0004,9\u000443\u001f430\u001e\u001f\u001f\u001a\u001e\u001f\n\r\u001f\u001a\u001e\u001f\n\u000e\u001f\u001a\u001e\u001f\n\u000f\f10 and CIFAR-100, NCMN and shake-shake show stronger regularization effect than standard\nmultiplicative noise, as indicated by the difference between training (dashed lines) and testing (solid\nlines) accuracies. However, we make the observation that, the regularization strength of shake-shake\nis stronger at the early stage of training, but diminishes rapidly afterwards, while that of NCMN is\nmore consistent throughout the training process. See Table 3 for detailed results.\nAs shown in Table 3, we provide additional results demonstrating the performance of NCMN on\nmodels of different sizes. Interestingly, NCMN is able to signi\ufb01cantly improve the performance\nof both small and large models. It is worth noting that, apart from the difference in architecture\nand the number of parameters, the number of training epochs can also notably affect generalization\nperformance [20]. Compared to the results of shake-shake regularization, a WRN-28-10 network\ntrained with NCMN is able to achieve comparable performance in 9 times less epochs. For practical\nuses, NCMN-0 is simple, fast, and can be applied to any batch-normalized neural networks, while\nNCMN-2 yields better generalization performance on ResNets.\n\n(a) Results on CIFAR-10.\n\n(b) Results on CIFAR-100.\n\nFigure 3: Training curves of WRN-22-7.5 networks trained with different types of noise, and that of\na WRN-22-5.4\u00d72 network trained with shake-shake regularization.\n\nTable 3: More results on CIFAR-10/100 for comparison.\n\nModel\nDenseNet-BC (250, 24) [21]\nResNeXt-26 (2\u00d796d) [5]\nResNeXt-29 (8\u00d764d) [5]\nWRN-28-10 [16]\nDenseNet-BC (40, 48)\nCNN-16-3\nCNN-16-10\nWRN-22-2\nWRN-22-7.5\nWRN-22-5.4\u00d72\nWRN-28-10\n\nParams Epochs\n15.3M 300\n26.2M 1800\n34.4M 1800\n36.5M 200\n300\n3.9M\n1.6M\n200\n17.1M 200\n1.1M\n200\n15.1M 200\n15.5M 200\n36.5M 200\n\nNoise type\n\nCIFAR-10 CIFAR-100\n\nNone\n\nShake/None\nShake/None\nDropout/None\nNCMN-0/None\nNCMN-0/None\nNCMN-1/None\nNCMN-0/None\nNCMN-2/None\n\nShake/None\n\nNCMN-2/None\n\n3.62\n\n2.86/3.58\n\n\u2014\n\n3.89/4.00\n3.51/4.07\n4.47/5.10\n3.41/4.05\n4.56/5.19\n3.00/3.68\n3.51/4.04\n2.78/3.70\n\n17.60\n\n\u2014\n\n15.85/16.34\n18.85/19.25\n17.68/19.92\n21.92/24.97\n17.55/19.22\n23.54/25.90\n16.70/19.29\n17.77/19.71\n15.86/18.42\n\nWe also experimented with language models based on long short-term memories (LSTMs) [22].\nIntriguingly, we found that the hidden states of LSTMs had a consistently low level of correlation (less\nthan 0.1 on average), even in the presence of dropout. Consequently, we did not observe signi\ufb01cant\nimprovement by replacing dropout with NCMN.\n\n5 Conclusion\n\nIn this work, we analyzed multiplicative noise from an information perspective. Our analysis suggests\na side effect of dropout and other types of multiplicative noise, which increases the correlation between\nfeatures, and consequently degrades generalization performance. The same theoretical framework\nalso provides a principled explanation for the performance gain of shake-shake regularization.\n\n8\n\n\u000f\u0012\u0012\r\u0001\u0012\u000e\r\r\u000e\u000f\u0012\u000e\u0012\r\u000e\u0001\u0012\u000f\r\r\u001c54.\u00048\r\u000b\r\u000f\u000b\u0012\u0012\u000b\r\u0001\u000b\u0012\u000e\r\u000b\r\u000e\u000f\u000b\u0012\u000e\u0012\u000b\r\u000e\u0001\u000b\u0012\u000f\r\u000b\r\u001c7747\u00037,90\u0003\b\u001f430\u001e\u001f\u001f\u001a\u001e\u001f\n\r\u001f\u001a\u001e\u001f\n\u000e\u001f\u001a\u001e\u001f\n\u000f$\u0004,\u00040\n8\u0004,\u00040\u000f\u0012\u0012\r\u0001\u0012\u000e\r\r\u000e\u000f\u0012\u000e\u0012\r\u000e\u0001\u0012\u000f\r\r\u001c54.\u00048\r\u000e\r\u000f\r\u0010\r\u0011\r\u0012\r\u001c7747\u00037,90\u0003\b\u001f430\u001e\u001f\u001f\u001a\u001e\u001f\n\r\u001f\u001a\u001e\u001f\n\u000e\u001f\u001a\u001e\u001f\n\u000f$\u0004,\u00040\n8\u0004,\u00040\fFurthermore, we proposed a simple modi\ufb01cation to the gradient of noise components, which, when\ncombined with batch normalization, is able to effectively remove the feature correlation effect. The\nresulting method, NCMN, outperforms standard multiplicative noise by a large margin, proving it to\nbe a better alternative for batch-normalized networks.\nWhile we combine batch normalization with NCMN to counteract the tendency of increasing feature\nmagnitude, an interesting future work would be to investigate if other normalization schemes, such as\nlayer normalization [23] and group normalization [24], can serve the same purpose.\n\nAcknowledgments\n\nThis work was supported by NSFC 61628209, Hubei Science Foundation 2016CFA030,\n2017AAA125, and Wuhan Science & Tech Program 2018010401011288.\n\nReferences\n[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[3] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[4] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of\nneural networks using dropconnect. In International Conference on Machine Learning, pages\n1058\u20131066, 2013.\n\n[5] Xavier Gastaldi. Shake-shake regularization of 3-branch residual networks. In Workshop of\n\nInternational Conference on Learning Representations, 2017.\n\n[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[7] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory\n\nloss. arXiv preprint arXiv:1603.05118, 2016.\n\n[8] Yoshua Bengio and James S Bergstra. Slow, decorrelated features for pretraining complex\ncell-like networks. In Advances in neural information processing systems, pages 99\u2013107, 2009.\n[9] Dmytro Mishkin and Jiri Matas. All you need is a good init. In International Conference on\n\nLearning Representations, 2016.\n\n[10] Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducing\nover\ufb01tting in deep networks by decorrelating representations. In International Conference on\nLearning Representations, 2016.\n\n[11] Pau Rodr\u00edguez, Jordi Gonzalez, Guillem Cucurull, Josep M Gonfaus, and Xavier Roca. Regu-\nlarizing cnns with locally constrained decorrelations. In International Conference on Learning\nRepresentations, 2017.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[13] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Ef\ufb01cient\nobject localization using convolutional networks. In IEEE Conference on Computer Vision and\nPattern Recognition, pages 648\u2013656, 2015.\n\n[14] Sida Wang and Christopher Manning. Fast dropout training. In International Conference on\n\nMachine Learning, pages 118\u2013126, 2013.\n\n9\n\n\f[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European Conference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[17] Zijun Zhang, Lin Ma, Zongpeng Li, and Chuan Wu. Normalized direction-preserving adam.\n\narXiv preprint arXiv:1709.04546, 2017.\n\n[18] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations, 2015.\n\n[19] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, University of Toronto, 2009.\n\n[20] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\ngeneralization gap in large batch training of neural networks. In Advances in Neural Information\nProcessing Systems, pages 1729\u20131739, 2017.\n\n[21] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\nconvolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition,\nvolume 1, page 3, 2017.\n\n[22] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[23] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[24] Yuxin Wu and Kaiming He. Group normalization. In European Conference on Computer Vision.\n\nSpringer, 2018.\n\n10\n\n\f", "award": [], "sourceid": 360, "authors": [{"given_name": "Zijun", "family_name": "Zhang", "institution": "University of Calgary"}, {"given_name": "Yining", "family_name": "Zhang", "institution": "University of Calgary"}, {"given_name": "Zongpeng", "family_name": "Li", "institution": "Wuhan University"}]}