{"title": "Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 5109, "page_last": 5118, "abstract": "Overfitting is one of the most critical challenges in deep neural networks, and there are various types of regularization methods to improve generalization performance. Injecting noises to hidden units during training, e.g., dropout, is known as a successful regularizer, but it is still not clear enough why such training techniques work well in practice and how we can maximize their benefit in the presence of two conflicting objectives---optimizing to true data distribution and preventing overfitting by regularization. This paper addresses the above issues by 1) interpreting that the conventional training methods with regularization by noise injection optimize the lower bound of the true objective and 2) proposing a technique to achieve a tighter lower bound using multiple noise samples per training example in a stochastic gradient descent iteration. We demonstrate the effectiveness of our idea in several computer vision applications.", "full_text": "Regularizing Deep Neural Networks by Noise:\n\nIts Interpretation and Optimization\n\nHyeonwoo Noh\n\nTackgeun You\n\nDept. of Computer Science and Engineering, POSTECH, Korea\n\nJonghwan Mun\n\nBohyung Han\n\n{shgusdngogo,tackgeun.you,choco1916,bhhan}@postech.ac.kr\n\nAbstract\n\nOver\ufb01tting is one of the most critical challenges in deep neural networks, and there\nare various types of regularization methods to improve generalization performance.\nInjecting noises to hidden units during training, e.g., dropout, is known as a suc-\ncessful regularizer, but it is still not clear enough why such training techniques\nwork well in practice and how we can maximize their bene\ufb01t in the presence of\ntwo con\ufb02icting objectives\u2014optimizing to true data distribution and preventing\nover\ufb01tting by regularization. This paper addresses the above issues by 1) interpret-\ning that the conventional training methods with regularization by noise injection\noptimize the lower bound of the true objective and 2) proposing a technique to\nachieve a tighter lower bound using multiple noise samples per training example\nin a stochastic gradient descent iteration. We demonstrate the effectiveness of our\nidea in several computer vision applications.\n\n1\n\nIntroduction\n\nDeep neural networks have been showing impressive performance in a variety of applications in\nmultiple domains [2, 12, 20, 23, 26, 27, 28, 31, 35, 38]. Its great success comes from various factors\nincluding emergence of large-scale datasets, high-performance hardware support, new activation\nfunctions, and better optimization methods. Proper regularization is another critical reason for better\ngeneralization performance because deep neural networks are often over-parametrized and likely to\nsuffer from over\ufb01tting problem. A common type of regularization is to inject noises during training\nprocedure: adding or multiplying noise to hidden units of the neural networks, e.g., dropout. This\nkind of technique is frequently adopted in many applications due to its simplicity, generality, and\neffectiveness.\nNoise injection for training incurs a tradeoff between data \ufb01tting and model regularization, even\nthough both objectives are important to improve performance of a model. Using more noise makes it\nharder for a model to \ufb01t data distribution while reducing noise weakens regularization effect. Since\nthe level of noise directly affects the two terms in objective function, model \ufb01tting and regularization\nterms, it would be desirable to maintain proper noise levels during training or develop an effective\ntraining algorithm given a noise level.\nBetween these two potential directions, we are interested in the latter, more effective training. Within\nthe standard stochastic gradient descent framework, we propose to facilitate optimization of deep\nneural networks with noise added for better regularization. Speci\ufb01cally, by regarding noise injected\noutputs of hidden units as stochastic activations, we interpret that the conventional training strategy\noptimizes the lower bound of the marginal likelihood over the hidden units whose values are sampled\nwith a reparametrization trick [18].\nOur algorithm is motivated by the importance weighted autoencoders [7], which are variational\nautoencoders trained for tighter variational lower bounds using more samples of stochastic variables\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fper training example in a stochastic gradient descent iteration. Our novel interpretation of noise\ninjected hidden units as stochastic activations enables the lower bound analysis of [7] to be naturally\napplied to training deep neural networks with regularization by noise. It introduces the importance\nweighted stochastic gradient descent, a variant of the standard stochastic gradient descent, which\nemploys multiple noise samples in an iteration for each training example. The proposed training\nstrategy allows trained models to achieve good balance between model \ufb01tting and regularization.\nAlthough our method is general for various regularization techniques by noise, we mainly discuss its\nspecial form, dropout\u2014one of the most famous methods for regularization by noise.\nThe main contribution of our paper is three-fold:\n\n\u2022 We present that the conventional training with regularization by noise is equivalent to\noptimizing the lower bound of the marginal likelihood through a novel interpretation of\nnoise injected hidden units as stochastic activations.\n\n\u2022 We derive the importance weighted stochastic gradient descent for regularization by noise\n\nthrough the lower bound analysis.\n\n\u2022 We demonstrate that the importance weighted stochastic gradient descent often improves\nperformance of deep neural networks with dropout, a special form of regularization by noise.\n\nThe rest of the paper is organized as follows. Section 2 discusses prior works related to our approach.\nWe describe our main idea and instantiation to dropout in Section 3 and 4, respectively. Section 5\nanalyzes experimental results on various applications and Section 6 makes our conclusion.\n\n2 Related Work\n\nRegularization by noise is a common technique to improve generalization performance of deep neural\nnetworks, and various implementations are available depending on network architectures and target\napplications. A well-known example is dropout [34], which randomly turns off a subset of hidden\nunits of neural networks by multiplying noise sampled from a Bernoulli distribution.\nIn addition to the standard form of dropout, there exist several variations of dropout designed to\nfurther improve generalization performance. For example, Ba et al. [3] proposed adaptive dropout,\nwhere dropout rate is determined by another neural network dynamically. Li et al. [22] employ\ndropout with a multinormial distribution, instead of a Bernoulli distribution, which generates noise by\nselecting a subset of hidden units out of multiple subsets. Bulo et al. [6] improve dropout by reducing\ngap between training and inference procedure, where the output of dropout layers in inference stage\nis given by learning expected average of multiple dropouts. There are several related concepts to\ndropout, which can be categorized as regularization by noise. In [17, 37], noise is added to weights\nof neural networks, not to hidden states. Learning with stochastic depth [15] and stochastic ensemble\nlearning [10] can also be regarded as noise injection techniques to weights or architecture. Our work\nis differentiated with the prior study in the sense that we improve generalization performance using\nbetter training objective while dropout and its variations rely on the original objective function.\nOriginally, dropout is proposed with interpretation as an extreme form of model ensemble [20, 34],\nand this intuition makes sense to explain good generalization performance of dropout. On the other\nhand, [36] views dropout as an adaptive regularizer for generalized linear models and [16] claims that\ndropout is effective to escape local optima for training deep neural networks. In addition, [9] uses\ndropout for estimating uncertainty based on Bayesian perspective. The proposed training algorithm\nis based on a novel interpretation of training with regularization by noise as training with latent\nvariables. Such understanding is distinguishable from the existing views on dropout, and provides a\nprobabilistic formulation to analyze dropout. A similar interpretation to our work is proposed in [24],\nbut it focuses on reducing gap between training and inference steps of using dropout while our work\nproposes to use a novel training objective for better regularization.\nOur goal is to formulate a stochastic model for regularization by noise and propose an effective\ntraining algorithm given a prede\ufb01ned noise level within a stochastic gradient descent framework.\nA closely related work is importance weighted autoencoder [7], which employs multiple samples\nweighted by importance to compute gradients and improve performance. This work shows that the\nimportance weighted stochastic gradient descent method achieves a tighter lower-bound of the ideal\nmarginal likelihood over latent variables than the variational lower bound. It also presents that the\n\n2\n\n\fbound becomes tighter as the number of samples for the latent variables increases. The importance\nweighted objective has been applied to various applications such as generative modeling [5, 7],\ntraining binary stochastic feed-forward networks [30] and training recurrent attention models [4].\nThis idea is extended to discrete latent variables in [25].\n\n3 Proposed Method\n\nThis section describes the proposed importance weighted stochastic gradient descent using multiple\nsamples in deep neural networks for regularization by noise.\n\n3.1 Main Idea\n\nThe premise of our paper is that injecting noise into deterministic hidden units constructs stochastic\nhidden units. Noise injection during training obviously incurs stochastic behavior of the model and\nthe optimizer. By de\ufb01ning deterministic hidden units with noise as stochastic hidden units, we can\nexploit well-de\ufb01ned probabilistic formulations to analyze the conventional training procedure and\npropose approaches for better optimization.\nSuppose that a set of activations over all hidden units across all layers, z, is given by\n\nz = g(h\u03c6(x), \u0001) \u223c p\u03c6(z|x),\n\n(1)\nwhere h\u03c6(x) is a deterministic activations of hidden units for input x and model parameters \u03c6. A\nnoise injection function g(\u00b7,\u00b7) is given by addition or multiplication of activation and noise, where \u0001\ndenotes noise sampled from a certain probability distribution such as Gaussian distribution. If this\npremise is applied to dropout, the noise \u0001 means random selections of hidden units in a layer and the\nrandom variable z indicates the activation of the hidden layer given a speci\ufb01c sample of dropout.\nTraining a neural network with stochastic hidden units requires optimizing the marginal likelihood\nover the stochastic hidden units z, which is given by\n\nLmarginal = log Ep\u03c6(z|x) [p\u03b8(y|z, x)] ,\n\n(2)\nwhere p\u03b8(y|z, x) is an output probability of ground-truth y given input x and hidden units z, and \u03b8 is\nthe model parameter for the output prediction. Note that the expectation over training data Ep(x,y)\noutside the logarithm is omitted for notational simplicity.\nFor marginalization of stochastic hidden units constructed by noise, we employ the reparameterization\ntrick proposed in [18]. Speci\ufb01cally, random variable z is replaced by Eq. (1) and the marginalization\nis performed over noise, which is given by\n\n(3)\nwhere p(\u0001) is the distribution of noise. Eq. (3) means that training a noise injected neural network\nrequires optimizing the marginal likelihood over noise \u0001.\n\nLmarginal = log Ep(\u0001) [p\u03b8(y|g(h\u03c6(x), \u0001), x)] ,\n\n3.2\n\nImportance Weighted Stochastic Gradient Descent\n\nWe now describe how the marginal likelihood in Eq. (3) is optimized in a SGD (Stochastic Gradient\nDescent) framework and propose the IWSGD (Importance Weighted Stochastic Gradient Descent)\nmethod derived from the lower bound introduced by the SGD.\n\n3.2.1 Objective\n\nIn practice, SGD estimates the marginal likelihood in Eq. (3) by taking expectation over multiple sets\nof noisy samples, where we computes a marginal log-likelihood for a \ufb01nite number of noise samples\nin each set. Therefore, the real objective for SGD is as follows:\n\n(cid:35)\n\nLmarginal \u2248 LSGD(S) = Ep(E)\n\np\u03b8(y|g(h\u03c6(x), \u0001), x)\n\n,\n\n(4)\n\nwhere S is the number of noise samples for each training example and E = {\u00011, \u00012, ..., \u0001S} is a set of\nnoises.\n\n(cid:34)\n\nlog\n\n1\nS\n\n(cid:88)\n\n\u0001\u2208E\n\n3\n\n\fThe main observation from Burda et al. [7] is that the SGD objective in Eq. (4) is the lower-bound of\nthe marginal likelihood in Eq. (3), which is held by Jensen\u2019s inequality as\n\nLSGD(S) = Ep(E)\n\nlog\n\np\u03b8(y|g(h\u03c6(x), \u0001), x)\n\n(cid:34)\n\n(cid:88)\n(cid:88)\n\n\u0001\u2208E\n\n1\nS\n\n(cid:34)\n\n(cid:35)\n(cid:35)\n\np\u03b8(y|g(h\u03c6(x), \u0001), x)\n\n1\nS\n\n\u2264 log Ep(E)\n= log Ep(\u0001) [p\u03b8(y|g(h\u03c6(x), \u0001), x)]\n= Lmarginal,\n\n\u0001\u2208E\n\n\u0001\u2208E f (\u0001)(cid:3) for an arbitrary function f (\u00b7) over \u0001 if the cardinality of\n(cid:80)\n\nwhere Ep(\u0001) [f (\u0001)] = Ep(E)\nE is equal to S. This characteristic makes the number of noise samples S directly related to the\ntightness of the lower-bound as\n\n(cid:2) 1\n\nS\n\nLmarginal \u2265 LSGD(S + 1) \u2265 LSGD(S).\n\n(6)\n\nRefer to [7] for the proof of Eq. (6).\nBased on this observation, we propose to use LSGD (S > 1) as an objective of IWSGD. Note that the\nconventional training procedure for regularization by noise such as dropout [34] relies on the objective\nwith S = 1 (Section 4). Thus, we show that using more samples achieves tighter lower-bound and\nthat the optimization by IWSGD has great potential to improve accuracy by proper regularization.\n\n3.2.2 Training\n\nTraining with IWSGD is achieved by computing the weighted average of gradients obtained from\nmultiple noise samples \u0001. This training strategy is based on the derivative of IWSGD objective with\nrespect to the model parameters \u03b8 and \u03c6, which is given by\n\n\u2207\u03b8,\u03c6LSGD(S) = \u2207\u03b8,\u03c6Ep(E)\n\n(5)\n\n(7)\n\n(8)\n\nlog\n\n1\nS\n\n\u0001\u2208E\n\np\u03b8(y|g(h\u03c6(x), \u0001), x)\n\n(cid:88)\n(cid:88)\n1\n(cid:80)\nS\n\u0001\u2208E p\u03b8(y|g(h\u03c6(x), \u0001), x)\n\u0001(cid:48)\u2208E p\u03b8(y|g(h\u03c6(x), \u0001(cid:48)), x)\nw\u0001\u2207\u03b8,\u03c6log p\u03b8 (y|g(h\u03c6(x), \u0001), x)\n\np\u03b8(y|g(h\u03c6(x), \u0001), x)\n\n(cid:21)\n\n\u0001\u2208E\n\n(cid:35)\n(cid:35)\n\n(cid:35)\n\n,\n\n\u2207\u03b8,\u03c6log\n\n(cid:34)\n\n(cid:34)\n(cid:20)\u2207\u03b8,\u03c6\n(cid:80)\n(cid:34)(cid:88)\n\n\u0001\u2208E\n\np\u03b8 (y|g(h\u03c6(x), \u0001), x)\n\u0001(cid:48)\u2208E p\u03b8 (y|g(h\u03c6(x), \u0001(cid:48)), x)\n\n.\n\n= Ep(E)\n\n= Ep(E)\n\n= Ep(E)\n\n(cid:80)\n\nw\u0001 =\n\nwhere w\u0001 denotes an importance weight with respect to sample noise \u0001 and is given by\n\nNote that the weight of each sample is equal to the normalized likelihood of the sample.\nFor training, we \ufb01rst draw a set of noise samples E and perform forward and backward propagation\nfor each noise sample \u0001 \u2208 E to compute likelihoods and corresponding gradients. Then, importance\nweights are computed by Eq. (8), and employed to compute the weighted average of gradients. Finally,\nwe optimize the model by SGD with the importance weighted gradients.\n\n3.2.3 Inference\n\nInference in the IWSGD is same as the standard dropout; input activations to each dropout layer\nare scaled based on dropout probability, rather than taking a subset of activations stochastically.\nTherefore, compared to the standard dropout, neither additional sampling nor computation is required\nduring inference.\n\n4\n\n\fFigure 1:\nImplementation detail of IWSGD for dropout optimization. We compute a weighted\naverage of the gradients from multiple dropout masks. For each training example the gradients for\nmultiple dropout masks are independently computed and are averaged with importance weights in\nEq. (8).\n\n3.3 Discussion\n\nOne may argue that the use of multiple samples is equivalent to running multiple iterations either\ntheoretically or empirically. It is dif\ufb01cult to derive the aggregated lower bounds of the marginal\nlikelihood over multiple iterations since the model parameters are updated in every iteration. However,\nwe observed that performance with a single sample is saturated easily and it is unlikely to achieve\nbetter accuracy with additional iterations than our algorithm based on IWSGD, as presented in\nSection 5.1.\n\n4\n\nImportance Weighted Stochastic Gradient Descent for Dropout\n\nThis section describes how the proposed idea is realized in the context of dropout, which is one of the\nmost popular techniques for regularization by noise.\n\n4.1 Analysis of Conventional Dropout\n\nFor training with dropout, binary dropout masks are sampled from a Bernoulli distribution. The\nhidden activations below dropout layers, denoted by h(x), are either kept or discarded by element-\nwise multiplication with a randomly sampled dropout mask \u0001; activations after the dropout layers are\ndenoted by g(h\u03c6(x), \u0001). The objective of SGD optimization is obtained by averaging log-likelihoods,\nwhich is formally given by\n\nLdropout = Ep(\u0001) [log p\u03b8(y|g(h\u03c6(x), \u0001), x)] ,\n\n(9)\nwhere the outermost expectation over training data Ep(x,y) is omitted for simplicity as mentioned\nearlier. Note that the objective in Eq. (9) is a special case of the objective of IWSGD with S = 1.\nThis implies that the conventional dropout training optimizes the lower-bound of the ideal marginal\nlikelihood, which is improved by increasing the number of dropout masks for each training example\nin an iteration.\n\n4.2 Training Dropout with Tighter Lower-bound\n\nFigure 1 illustrates how IWSGD is employed to train with dropout layers for regularization. Following\nthe same training procedure described in Section 3.2.2, we sample multiple dropout masks as a\nrealization of the multiple noise sampling.\n\n5\n\n\f(a) Test error in CIFAR 10 (depth=28)\n\n(b) Test error in CIFAR-100 (depth=28)\n\nFigure 2:\nImpact of multi-sample training in CIFAR datasets with variable dropout rates. These\nresults are with wide residual net (widening factor=10, depth=28). Each data point and error bar\nare computed from 3 trials with different seeds. The results show that using IWSGD with multiple\nsamples consistently improves the performance and the results are not sensitive to dropout rates.\n\nTable 1: Comparison with various models in CIFAR datasets. We achieve the near state-of-the-art\nperformance by applying the multi-sample objective to wide residual network [40]. Note that \u00d74\niterations means a model trained with 4 times more iterations. The test errors of our implementations\n(including reproduction of [40]) are obtained from the results with 3 different seeds. The numbers\nwithin parentheses denote the standard deviations of test errors.\n\nCIFAR-10\n\nCIFAR-100\n\nResNet [12]\nResNet with Stochastic Depth [15]\nFractalNet with Dropout [21]\nResNet (pre-activation) [13]\nPyramidNet [11]\nWide ResNet (depth=40) [40]\nDenseNet [14]\nWide ResNet (depth=28, dropout=0.3) [40]\nWide ResNet (depth=28, dropout=0.5) (\u00d74 iterations)\nWide ResNet (depth=28, dropout=0.5) (reproduced)\nWide ResNet (depth=28, dropout=0.5) with IWSGD (S = 4)\nWide ResNet (depth=28, dropout=0.5) with IWSGD (S = 8)\n\n6.43\n4.91\n4.60\n4.62\n3.77\n3.80\n3.46\n3.89\n\n4.48 (0.15)\n3.88 (0.15)\n3.58 (0.05)\n3.55 (0.11)\n\n-\n\n24.58\n23.73\n22.71\n18.29\n18.30\n17.18\n18.85\n\n20.70 (0.19)\n19.12 (0.24)\n18.01 (0.16)\n17.63 (0.13)\n\nThe use of IWSGD for optimization requires only minor modi\ufb01cations in implementation. This\nis because the gradient computation part in the standard dropout is reusable. The gradient for the\nstandard dropout is given by\n\n\u2207\u03b8,\u03c6Ldropout = Ep(\u0001) [\u2207\u03b8,\u03c6logp\u03b8 (y|g(h\u03c6(x), \u0001), x)] .\n\n(10)\n\nNote that this is actually unweighted version of the \ufb01nal line in Eq. (7). Therefore, the only additional\ncomponent for IWSGD is about weighting gradients with importance weights. This property makes\nit easy to incorporate IWSGD into many applications with dropout.\n\n5 Experiments\n\nWe evaluate the proposed training algorithm in various architectures for real world tasks including ob-\nject recognition [40], visual question answering [39], image captioning [35] and action recognition [8].\nThese models are chosen for our experiments since they use dropouts actively for regularization. To\nisolate the effect of the proposed training method, we employ simple models without integrating\nheuristics for performance improvement (e.g., model ensembles, multi-scaling, etc.) and make\nhyper-parameters (e.g., type of optimizer, learning rate, batch size, etc.) \ufb01xed.\n\n6\n\n\fTable 2: Accuracy on VQA test-dev dataset. Our re-implementation of SAN [39] is used as baseline.\nIncreasing the number of samples S with IWSGD consistently improves performance.\n\nMultiple-Choice\n\nOpen-Ended\nY/N Num Others All\n-\n\nAll\n58.68 79.28 36.56 46.09\n60.19 79.69 36.74 48.84 64.77 79.72 39.03 57.82\n60.31 80.74 34.70 48.66 65.01 80.73 36.36 58.05\n60.41 80.86 35.56 48.56 65.21 80.77 37.56 58.18\n\nY/N Num Others\n\n-\n\n-\n\n-\n\nSAN [39]\nSAN with 2-layer LSTM (reproduced)\nwith IWSGD (S = 5)\nwith IWSGD (S = 8)\n\n5.1 Object Recognition\n\nThe proposed algorithm is integrated into wide residual network [40], which uses dropout in every\nresidual block, and evaluated on CIFAR datasets [19]. This network shows the accuracy close to the\nstate-of-the-art performance in both CIFAR 10 and CIFAR 100 datasets with data augmentation. We\nuse the publicly available implementation1 by the authors of [40] and follow all the implementation\ndetails in the original paper.\nFigure 2 presents the impact of IWSGD with multiple samples. We perform experiments using\nthe wide residual network with widening factor 10 and depth 28. Each experiment is performed 3\ntimes with different seeds in CIFAR datasets and test errors with corresponding standard deviations\nare reported. The baseline performance is from [40], and we also report the reproduced results by\nour implementation, which is denoted by Wide ResNet (reproduced). The result by the proposed\nalgorithm is denoted by IWSGD together with the number of samples S.\nTraining with IWSGD with multiple samples clearly improves performance as illustrated in Figure 2.\nIt also presents that, as the number of samples increases, the test errors decrease even more both on\nCIFAR-10 and CIFAR-100, regardless of the dropout rate. Another observation is that the results\nfrom the proposed multi-sample training strategy are not sensitive to dropout rates.\nUsing IWSGD with multiple samples to train the wide residual network enables us to achieve the near\nstate-of-the-art performance on CIFAR datasets. As illustrated in Table 1, the accuracy of the model\nwith S = 8 samples is very close to the state-of-the-art performance for CIFAR datasets, which is\nbased on another architecture [14]. To illustrate the bene\ufb01t of our algorithm compared to the strategy\nto simply increase the number of iterations, we evaluate the performance of the model trained with 4\ntimes more iterations, which is denoted by \u00d74 iterations. Note that the model with more iterations\ndoes not improve the performance as discussed in Section 3.3. We believe that the simple increase of\nthe number of iterations is likely to over\ufb01t the trained model.\n\n5.2 Visual Question Answering\n\nVisual Question Answering (VQA) [2] is a task to answer a question about a given image. Input of\nthis task is a pair of an image and a question, and output is an answer to the question. This task is\ntypically formulated as a classi\ufb01cation problem with multi-modal inputs [1, 29, 39].\nTo train models and run experiments, we use VQA dataset [2], which is commonly used for the\nevaluation of VQA algorithms. There are two different kinds of tasks: open-ended and multiple-\nchoice task. The model predicts an answer for an open-ended task without knowing prede\ufb01ned set of\ncandidate answers while selecting one of candidate answers in multiple-choice task. We evaluate\nthe proposed training method using a baseline model, which is similar to [39] but has a single stack\nof attention layer. For question features, we employ a two-layer LSTM based on word embedding2,\nwhile using activations from pool5 layer of VGG-16 [32] for image features.\nTable 2 presents the results of our experiment for VQA. SAN with 2-layer LSTM denotes our\nbaseline with the standard dropout. This method already outperforms the comparable model with\nspatial attention [39] possibly due to the use of a stronger question encoder, two-layer LSTM. When\nwe evaluate performance of IWSGD with 5 and 8 samples, we observe consistent performance\nimprovement of our algorithm with increase of the number of samples.\n\n1https://github.com/szagoruyko/wide-residual-networks\n2https://github.com/VT-vision-lab/VQA_LSTM_CNN\n\n7\n\n\fTable 3: Results on MSCOCO test dataset for image captioning. For BLEU metric, we use BLEU-4,\nwhich is computed based on 4-gram words, since the baseline method [35] reported BLEU-4 only.\n\nGoogle-NIC [35]\nGoogle-NIC (reproduced)\nwith IWSGD (S = 5)\n\nBLEU\n27.7\n26.8\n27.5\n\nMETEOR\n\n23.7\n22.6\n22.9\n\nCIDEr\n85.5\n82.2\n83.6\n\nTable 4: Average classi\ufb01cation accuracy of compared algorithms over three splits on UCF-101\ndataset. TwoStreamFusion (reproduced) denotes our reproduction based on the public source code.\n\nMethod\nTwoStreamFusion [8]\nTwoStreamFusion (reproduced)\nwith IWSGD (S = 5)\nwith IWSGD (S = 10)\nwith IWSGD (S = 15)\n\nUCF-101\n92.50 %\n92.49 %\n92.73 %\n92.69 %\n92.72 %\n\n5.3\n\nImage Captioning\n\nImage captioning is a problem generating a natural language description given an image. This task is\ntypically handled by an encoder-decoder network, where a CNN encoder transforms an input image\ninto a feature vector and an LSTM decoder generates a caption from the feature by predicting words\none by one. A dropout layer is located on top of the hidden state in LSTM decoder. To evaluate the\nproposed training method, we exploit a publicly available implementation3 whose model is identical\nto the standard encoder-decoder model of [35], but uses VGG-16 [32] instead of GoogLeNet as a\nCNN encoder. We \ufb01x the parameters of VGG-16 network to follow the implementation of [35].\nWe use MSCOCO dataset for experiment, and evaluate models with several metrics (BLEU, METEOR\nand CIDEr) using the public MSCOCO caption evaluation tool. These metrics measure precision or\nrecall of n-gram words between the generated captions and the ground-truths.\nTable 3 summarizes the results on image captioning. Google-NIC is the reported scores in the\noriginal paper [35] while Google-NIC (reproduced) denotes the results of our reproduction. Our\nreproduction has slightly lower accuracy due to use of a different CNN encoder. IWSGD with 5\nsamples consistently improves performance in terms of all three metrics, which indicates our training\nmethod is also effective to learn LSTMs.\n\n5.4 Action Recognition\n\nAction recognition is a task recognizing a human action in videos. We employ a well-known\nbenchmark of action classi\ufb01cation, UCF-101 [33], for evaluation, which has 13,320 trimmed videos\nannotated with 101 action categories. The dataset has three splits for cross validation, and the \ufb01nal\nperformance is calculated by the average accuracy of the three splits.\nWe employ a variation of two-stream CNN proposed by [8], which shows competitive performance on\nUCF-101. The network consists of three subnetworks: a spatial stream network for image, a temporal\nstream network for optical \ufb02ow and a fusion network for combining the two-stream networks. We\napply our IWSGD only to \ufb01ne-tuning the fusion unit for training ef\ufb01ciency. Our implementation is\nbased on the public source code4. Hyper-parameters such as dropout rate and learning rate scheduling\nis the same as the baseline model [8].\nTable 4 illustrates performance improvement by integrating IWSGD but the overall tendency with\nincrease of the number of samples is not consistent. We suspect that this is because the performance\nof the model is already saturated and there is no much room for improvement through \ufb01ne-tuning\nonly the fusion unit.\n\n3https://github.com/karpathy/neuraltalk2\n4http://www.robots.ox.ac.uk/~vgg/software/two_stream_action/\n\n8\n\n\f6 Conclusion\n\nWe proposed an optimization method for regularization by noise, especially for dropout, in deep\nneural networks. This method is based on a novel interpretation of noise injected deterministic hidden\nunits as stochastic hidden ones. Using this interpretation, we proposed to use IWSGD (Importance\nWeighted Stochastic Gradient Descent), which achieves tighter lower bounds as the number of\nsamples increases. We applied the proposed optimization method to dropout, a special case of the\nregularization by noise, and evaluated on various visual recognition tasks: image classi\ufb01cation,\nvisual question answering, image captioning and action classi\ufb01cation. We observed the consistent\nimprovement of our algorithm over all tasks, and achieved near state-of-the-art performance on\nCIFAR datasets through better optimization. We believe that the proposed method may improve\nmany other deep neural network models with dropout layers.\n\nAcknowledgement This work was supported by the IITP grant funded by the Korea government\n(MSIT) [2017-0-01778, Development of Explainable Human-level Deep Machine Learning Infer-\nence Framework; 2017-0-01780, The Technology Development for Event Recognition/Relational\nReasoning and Learning Knowledge based System for Video Understanding].\n\nReferences\n[1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.\n[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA:\n\nvisual question answering. In ICCV, 2015.\n\n[3] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In NIPS, 2013.\n[4] J. Ba, R. R. Salakhutdinov, R. B. Grosse, and B. J. Frey. Learning wake-sleep recurrent attention\n\nmodels. In NIPS, 2015.\n\n[5] J. Bornschein and Y. Bengio. Reweighted wake-sleep. In ICLR, 2015.\n[6] S. R. Bulo, L. Porzi, and P. Kontschieder. Dropout distillation. In ICML, 2016.\n[7] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.\n[8] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for\n\n[9] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncer-\n\nvideo action recognition. In CVPR, 2016.\n\ntainty in deep learning. In ICML, 2016.\n\n[10] B. Han, J. Sim, and H. Adam. Branchout: Regularization for online ensemble tracking with\n\nconvolutional neural networks. In CVPR, 2017.\n\n[11] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. CVPR, 2017.\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV,\n\n[14] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional\n\n[15] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic\n\n[16] P. Jain, V. Kulkarni, A. Thakurta, and O. Williams. To drop or not to drop: Robustness,\nconsistency and differential privacy properties of dropout. arXiv preprint arXiv:1503.02031,\n2015.\n\n[17] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization\n\ntrick. In NIPS, 2015.\n\n[18] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n[19] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\n2016.\n\n2016.\n\nnetworks. CVPR, 2017.\n\ndepth. In ECCV, 2016.\n\nUniversity of Toronto, 2009.\n\nneural networks. In NIPS, 2012.\n\nresiduals. ICLR, 2017.\n\n[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\n[21] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without\n\n[22] Z. Li, B. Gong, and T. Yang. Improved dropout for shallow and deep learning. In NIPS, 2016.\n\n9\n\n\f[23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nIn CVPR, 2015.\n\ntion. In ICLR, 2016.\n\n[24] X. Ma, Y. Gao, Z. Hu, Y. Yu, Y. Deng, and E. Hovy. Dropout with expectation-linear regulariza-\n\n[25] A. Mnih and D. Rezende. Variational inference for monte carlo objectives. In ICML, 2016.\n[26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[27] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking.\n\nIn CVPR, 2016.\n\nICCV, 2015.\n\n[28] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In\n\n[29] H. Noh, S. Hong, and B. Han. Image question answering using convolutional neural network\n\nwith dynamic parameter prediction. In CVPR, 2016.\n\n[30] T. Raiko, M. Berglund, G. Alain, and L. Dinh. Techniques for learning binary stochastic\n\n[31] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with\n\nfeedforward neural networks. In ICLR, 2015.\n\nregion proposal networks. In NIPS, 2015.\n\n[32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In ICLR, 2015.\n\n[33] K. Soomro, A. R. Zamir, and M. Shah. UCF101: a dataset of 101 human actions classes from\n\nvideos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[34] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a\n\nsimple way to prevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n[35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption\n\n[36] S. Wager, S. Wang, and P. S. Liang. Dropout training as adaptive regularization. In NIPS, 2013.\n[37] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using\n\ngenerator. In CVPR, 2015.\n\ndropconnect. In ICML, 2013.\n\n[38] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al. Google\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[39] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question\n\nanswering. In CVPR, 2016.\n\n[40] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2653, "authors": [{"given_name": "Hyeonwoo", "family_name": "Noh", "institution": "POSTECH"}, {"given_name": "Tackgeun", "family_name": "You", "institution": "POSTECH"}, {"given_name": "Jonghwan", "family_name": "Mun", "institution": "POSTECH"}, {"given_name": "Bohyung", "family_name": "Han", "institution": "POSTECH"}]}