{"title": "Improved Dropout for Shallow and Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2523, "page_last": 2531, "abstract": "Dropout has been witnessed with great success in training deep neural networks by independently zeroing out the outputs of neurons at random. It has also received a surge of interest for shallow learning, e.g., logistic regression. However, the independent sampling for dropout could be suboptimal for the sake of convergence. In this paper, we propose to use multinomial sampling for dropout, i.e., sampling features or neurons according to a multinomial distribution with different probabilities for different features/neurons. To exhibit the optimal dropout probabilities, we analyze the shallow learning with multinomial dropout and establish the risk bound for stochastic optimization. By minimizing a sampling dependent factor in the risk bound, we obtain a distribution-dependent dropout with sampling probabilities dependent on the second order statistics of the data distribution. To tackle the issue of evolving distribution of neurons in deep learning, we propose an efficient adaptive dropout (named \\textbf{evolutional dropout}) that computes the sampling probabilities on-the-fly from a mini-batch of examples. Empirical studies on several benchmark datasets demonstrate that the proposed dropouts achieve not only much faster convergence and but also a smaller testing error than the standard dropout. For example, on the CIFAR-100 data, the evolutional dropout achieves relative improvements over 10\\% on the prediction performance and over 50\\% on the convergence speed compared to the standard dropout.", "full_text": "Improved Dropout for Shallow and Deep Learning\n\nZhe Li1, Boqing Gong2, Tianbao Yang1\n1The University of Iowa, Iowa city, IA 52245\n\nbgong@crcv.ucf.edu\n\n2University of Central Florida, Orlando, FL 32816\n\n{zhe-li-1,tianbao-yang}@uiowa.edu\n\nAbstract\n\nDropout has been witnessed with great success in training deep neural networks by\nindependently zeroing out the outputs of neurons at random. It has also received\na surge of interest for shallow learning, e.g., logistic regression. However, the\nindependent sampling for dropout could be suboptimal for the sake of conver-\ngence. In this paper, we propose to use multinomial sampling for dropout, i.e.,\nsampling features or neurons according to a multinomial distribution with different\nprobabilities for different features/neurons. To exhibit the optimal dropout proba-\nbilities, we analyze the shallow learning with multinomial dropout and establish\nthe risk bound for stochastic optimization. By minimizing a sampling dependent\nfactor in the risk bound, we obtain a distribution-dependent dropout with sampling\nprobabilities dependent on the second order statistics of the data distribution. To\ntackle the issue of evolving distribution of neurons in deep learning, we propose\nan ef\ufb01cient adaptive dropout (named evolutional dropout) that computes the sam-\npling probabilities on-the-\ufb02y from a mini-batch of examples. Empirical studies on\nseveral benchmark datasets demonstrate that the proposed dropouts achieve not\nonly much faster convergence and but also a smaller testing error than the standard\ndropout. For example, on the CIFAR-100 data, the evolutional dropout achieves\nrelative improvements over 10% on the prediction performance and over 50% on\nthe convergence speed compared to the standard dropout.\n\n1\n\nIntroduction\n\nDropout has been widely used to avoid over\ufb01tting of deep neural networks with a large number of\nparameters [9, 16], which usually identically and independently at random samples neurons and sets\ntheir outputs to be zeros. Extensive experiments [4] have shown that dropout can help obtain the\nstate-of-the-art performance on a range of benchmark data sets. Recently, dropout has also been\nfound to improve the performance of logistic regression and other single-layer models for natural\nlanguage tasks such as document classi\ufb01cation and named entity recognition [21].\nIn this paper, instead of identically and independently at random zeroing out features or neurons, we\npropose to use multinomial sampling for dropout, i.e., sampling features or neurons according to\na multinomial distribution with different probabilities for different features/neurons. Intuitively, it\nmakes more sense to use non-uniform multinomial sampling than identical and independent sampling\nfor different features/neurons. For example, in shallow learning if input features are centered, we\ncan drop out features with small variance more frequently or completely allowing the training to\nfocus on more important features and consequentially enabling faster convergence. To justify the\nmultinomial sampling for dropout and reveal the optimal sampling probabilities, we conduct a\nrigorous analysis on the risk bound of shallow learning by stochastic optimization with multinomial\ndropout, and demonstrate that a distribution-dependent dropout leads to a smaller expected risk (i.e.,\nfaster convergence and smaller generalization error).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fInspired by the distribution-dependent dropout, we propose a data-dependent dropout for shallow\nlearning, and an evolutional dropout for deep learning. For shallow learning, the sampling probabili-\nties are computed from the second order statistics of features of the training data. For deep learning,\nthe sampling probabilities of dropout for a layer are computed on-the-\ufb02y from the second-order\nstatistics of the layer\u2019s outputs based on a mini-batch of examples. This is particularly suited for deep\nlearning because (i) the distribution of each layer\u2019s outputs is evolving over time, which is known\nas internal covariate shift [5]; (ii) passing through all the training data in deep neural networks (in\nparticular deep convolutional neural networks) is much more expensive than through a mini-batch\nof examples. For a mini-batch of examples, we can leverage parallel computing architectures to\naccelerate the computation of sampling probabilities.\nWe note that the proposed evolutional dropout achieves similar effect to the batch normalization\ntechnique (Z-normalization based on a mini-batch of examples) [5] but with different \ufb02avors. Both\napproaches can be considered to tackle the issue of internal covariate shift for accelerating the\nconvergence. Batch normalization tackles the issue by normalizing the output of neurons to zero\nmean and unit variance and then performing dropout independently 1. In contrast, our proposed\nevolutional dropout tackles this issue from another perspective by exploiting a distribution-dependent\ndropout, which adapts the sampling probabilities to the evolving distribution of a layer\u2019s outputs. In\nother words, it uses normalized sampling probabilities based on the second order statistics of internal\ndistributions. Indeed, we notice that for shallow learning with Z-normalization (normalizing each\nfeature to zero mean and unit variance) the proposed data-dependent dropout reduces to uniform\ndropout that acts similarly to the standard dropout. Because of this connection, the presented\ntheoretical analysis also sheds some lights on the power of batch normalization from the angle\nof theory. Compared to batch normalization, the proposed distribution-dependent dropout is still\nattractive because (i) it is rooted in theoretical analysis of the risk bound; (ii) it introduces no\nadditional parameters and layers without complicating the back-propagation and the inference; (iii) it\nfacilitates further research because its shares the same mathematical foundation as standard dropout\n(e.g., equivalent to a form of data-dependent regularizer) [18].\nWe summarize the main contributions of the paper below.\n\n\u2022 We propose a multinomial dropout and demonstrate that a distribution-dependent dropout\nleads to a faster convergence and a smaller generalization error through the risk bound\nanalysis for shallow learning.\n\u2022 We propose an ef\ufb01cient evolutional dropout for deep learning based on the distribution-\n\u2022 We justify the proposed dropouts for both shallow learning and deep learning by experimen-\n\ndependent dropout.\n\ntal results on several benchmark datasets.\n\nIn the remainder, we \ufb01rst review some related work and preliminaries. We present the main results in\nSection 4 and experimental results in Section 5.\n\n2 Related Work\n\nIn this section, we review some related work on dropout and optimization algorithms for deep\nlearning.\nDropout is a simple yet effective technique to prevent over\ufb01tting in training deep neural networks [16].\nIt has received much attention recently from researchers to study its practical and theoretical properties.\nNotably, Wager et al. [18], Baldi and Sadowski [2] have analyzed the dropout from a theoretical\nviewpoint and found that dropout is equivalent to a data-dependent regularizer. The most simple\nform of dropout is to multiply hidden units by i.i.d Bernoulli noise. Several recent works also found\nthat using other types of noise works as well as Bernoulli noise (e.g., Gaussian noise), which could\nlead to a better approximation of the marginalized loss [20, 7]. Some works tried to optimize the\nhyper-parameters that de\ufb01ne the noise level in a Bayesian framework [23, 7]. Graham et al. [3] used\nthe same noise across a batch of examples in order to speed up the computation. The adaptive dropout\nproposed in[1] overlays a binary belief network over a neural netowrk, incurring more computational\noverhead to dropout because one has to train the additional binary belief network. In constrast,\n\n1The author also reported that in some cases dropout is even not necessary\n\n2\n\n\fthe present work proposes a new dropout with noise sampled according to distribution-dependent\nsampling probabilities. To the best of our knowledge, this is the \ufb01rst work that rigorously studies this\ntype of dropout with theoretical analysis of the risk bound. It is demonstrated that the new dropout\ncan improve the speed of convergence.\nStochastic gradient descent with back-propagation has been used a lot in optimizing deep neural\nnetworks. However, it is notorious for its slow convergence especially for deep learning. Recently,\nthere emerge a battery of studies trying to accelearte the optimization of deep learning [17, 12, 22, 5, 6],\nwhich tackle the problem from different perspectives. Among them, we notice that the developed\nevolutional dropout for deep learning achieves similar effect as batch normalization [5] addressing\nthe internal covariate shift issue (i.e., evolving distributions of internal hidden units).\n\n3 Preliminaries\n\nIn this section, we present some preliminaries, including the framework of risk minimization in\nmachine learning and learning with dropout noise. We also introduce the multinomial dropout, which\nallows us to construct a distribution-dependent dropout as revealed in the next section.\nLet (x, y) denote a feature vector and a label, where x \u2208 Rd and y \u2208 Y. Denote by P the joint\ndistribution of (x, y) and denote by D the marginal distribution of x. The goal of risk minimization\nis to learn a prediction function f (x) that minimizes the expected loss, i.e., minf\u2208H EP [(cid:96)(f (x), y)],\nwhere (cid:96)(z, y) is a loss function (e.g., the logistic loss) that measures the inconsistency between z\nand y and H is a class of prediction functions. In deep learning, the prediction function f (x) is\ndetermined by a deep neural network. In shallow learning, one might be interested in learning a linear\nmodel f (x) = w(cid:62)x. In the following presentation, the analysis will focus on the risk minimization\nof a linear model, i.e.,\n\nL(w) (cid:44) EP [(cid:96)(w(cid:62)x, y)]\n\nmin\nw\u2208Rd\n\n(1)\n\n(2)\n\n1\n\nIn this paper, we are interested in learning with dropout, i.e., the feature vector x is corrupted by\na dropout noise. In particular, let \u0001 \u223c M denote a dropout noise vector of dimension d, and the\n\ncorrupted feature vector is given by(cid:98)x = x \u25e6 \u0001, where the operator \u25e6 represents the element-wise\nmultiplication. Let (cid:98)P denote the joint distribution of the new data ((cid:98)x, y) and (cid:98)D denote the marginal\ndistribution of(cid:98)x. With the corrupted data, the risk minimization becomes\n\n(cid:98)L(w) (cid:44) E(cid:98)P [(cid:96)(w(cid:62)(x \u25e6 \u0001), y)]\nIn standard dropout [18, 4], the entries of the noise vector \u0001 are sampled independently according\n1\u2212\u03b4 ) = 1 \u2212 \u03b4, i.e., features are dropped with a probability \u03b4 and\nto Pr(\u0001j = 0) = \u03b4 and Pr(\u0001j = 1\n1\u2212\u03b4 , where bj \u2208 {0, 1}, j \u2208 [d]\nscaled by 1\nensure that E\u0001[(cid:98)x] = x. It is obvious that using the standard dropout different features will have equal\nare i.i.d Bernoulli random variables with Pr(bj = 1) = 1 \u2212 \u03b4. The scaling factor\n1\u2212\u03b4 is added to\n\n1\u2212\u03b4 with a probability 1 \u2212 \u03b4. We can also write \u0001j = bj\n\nmin\nw\u2208Rd\n\n\u0001i = mi\nkpi\n\n, i \u2208 [d] and {m1, . . . , md} follow a multinomial distribution M ult(p1, . . . , pd; k) with\n\nprobabilities to be dropped out or to be selected independently. However, in practice some features\ncould be more informative than the others for learning purpose. Therefore, it makes more sense to\nassign different sampling probabilities for different features and make the features compete with each\nother.\nTo this end, we introduce the following multinomial dropout.\n\nDe\ufb01nition 1. (Multinomial Dropout) A multinomial dropout is de\ufb01ned as (cid:98)x = x \u25e6 \u0001, where\n(cid:80)d\ni=1 pi = 1 and pi \u2265 0.\npi. As in the standard dropout, the normalization by kpi is to ensure that E\u0001[(cid:98)x] = x. The parameter k\n\nRemark: The multinomial dropout allows us to use non-uniform sampling probabilities p1, . . . , pd\nfor different features. The value of mi is the number of times that the i-th feature is selected in k\nindependent trials of selection. In each trial, the probability that the i-th feature is selected is given by\nplays the same role as the parameter 1 \u2212 \u03b4 in standard dropout, which controls the number of features\nto be dropped. In particular, the expected total number of the kept features using multinomial dropout\nis k and that using standard dropout is d(1 \u2212 \u03b4). In the sequel, to make fair comparison between\n\n3\n\n\fthe two dropouts, we let k = d(1 \u2212 \u03b4). In this case, when a uniform distribution pi = 1/d is used\nin multinomial dropout to which we refer as uniform dropout, then \u0001i = mi\n1\u2212\u03b4 , which acts similarly\nto the standard dropout using i.i.d Bernoulli random variables. Note that another choice to make\nthe sampling probabilities different is still using i.i.d Bernoulli random variables but with different\nprobabilities for different features. However, multinomial dropout is more suitable because (i) it is\neasy to control the level of dropout by varying the value of k; (ii) it gives rise to natural competition\ni pi = 1; (iii) it allows us to minimize the sampling\n\namong features because of the constraint(cid:80)\n\ndependent risk bound for obtaining a better distribution than uniform sampling.\n\nDropout is a data-dependent regularizer Dropout as a regularizer has been studied in [18, 2] for\nlogistic regression, which is stated in the following proposition for ease of discussion later.\nProposition 1. If (cid:96)(z, y) = log(1 + exp(\u2212yz)), then\n\nE(cid:98)P [(cid:96)(w(cid:62)(cid:98)x, y)] = EP [(cid:96)(w(cid:62)x, y)] + RD,M(w)\n\nwhere M denotes the distribution of \u0001 and RD,M(w) = ED,M\nRemark: It is notable that RD,M \u2265 0 due to the Jensen inequality. Using the second order Taylor\nexpansion, [18] showed that the following approximation of RD,M(w) is easy to manipulate and\nunderstand:\n\n2 )+exp(\u2212w(cid:62) x\u25e6\u0001\n2 )\nexp(w(cid:62)x/2)+exp(\u2212w(cid:62)x/2)\n\nlog exp(w(cid:62) x\u25e6\u0001\n\n.\n\n(cid:104)\n\n(3)\n\n(cid:105)\n\n(4)\n\n(cid:98)RD,M(w) =\n\nED[q(w(cid:62)x)(1 \u2212 q(w(cid:62)x))w(cid:62)CM(x \u25e6 \u0001)w]\n\n2\n\n1\n\nwhere q(w(cid:62)x) =\n1+exp(\u2212w(cid:62)x/2), and CM denotes the covariance matrix in terms of \u0001. In particular,\nif \u0001 is the standard dropout noise, then CM[x \u25e6 \u0001] = diag(x2\nd\u03b4/(1 \u2212 \u03b4)), where\ndiag(s1, . . . , sn) denotes a d\u00d7d diagonal matrix with the i-th entry equal to si. If \u0001 is the multinomial\ndropout noise in De\ufb01nition 1, we have\n\n1\u03b4/(1 \u2212 \u03b4), . . . , x2\n\nCM[x \u25e6 \u0001] =\n\n1\nk\n\ndiag(x2\n\ni /pi) \u2212 1\nk\n\nxx(cid:62)\n\n(5)\n\n4 Learning with Multinomial Dropout\n\nIn this section, we analyze a stochastic optimization approach for minimizing the dropout loss\nin (2). Assume the sampling probabilities are known. We \ufb01rst obtain a risk bound of learning with\nmultinomial dropout for stochastic optimization. Then we try to minimize the factors in the risk\nbound that depend on the sampling probabilities. We would like to emphasize that our goal here is\nnot to show that using dropout would render a smaller risk than without using dropout, but rather\nfocus on the impact of different sampling probabilities on the risk. Let the initial solution be w1. At\nthe iteration t, we sample (xt, yt) \u223c P and \u0001t \u223c M as in De\ufb01nition 1 and then update the model by\n(6)\nwhere \u2207(cid:96) denotes the (sub)gradient in terms of wt and \u03b7t is a step size. Suppose we run the stochastic\nt=1 wt.\nWe note that another approach of learning with dropout is to minimize the empirical risk by marginal-\nizing out the dropout noise, i.e., replacing the true expectations EP and ED in (3) with empirical\nexpectations over a set of samples (x1, y1), . . . , (xn, yn) denoted by EPn and EDn. Since the\ndata dependent regularizer RDn,M(w) is dif\ufb01cult to compute, one usually uses an approximation\n\noptimization by n steps (i.e., using n examples) and compute the \ufb01nal solution as (cid:98)wn = 1\n\n(cid:98)RDn,M(w) (e.g., as in (4)) in place of RDn,M(w). However, the resulting problem is a non-convex\n\nwt+1 = wt \u2212 \u03b7t\u2207(cid:96)(w(cid:62)\n\nt (xt \u25e6 \u0001t), yt)\n\noptimization, which together with the approximation error would make the risk analysis much more\ninvolved. In contrast, the update in (6) can be considered as a stochastic gradient descent update\nfor solving the convex optimization problem in (2), allowing us to establish the risk bound based\non previous results of stochastic gradient descent for risk minimization [14, 15]. Nonetheless, this\nrestriction does not lose the generality. Indeed, stochastic optimization is usually employed for\nsolving empirical loss minimization in big data and deep learning.\n\nThe following theorem establishes a risk bound of (cid:98)wn in expectation.\n\n(cid:80)n\n\nn\n\n4\n\n\fTheorem 1. Let L(w) be the expected risk of w de\ufb01ned in (1). Assume E(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\n2] \u2264 B2 and\n(cid:96)(z, y) is G-Lipschitz continuous. For any (cid:107)w\u2217(cid:107)2 \u2264 r, by appropriately choosing \u03b7, we can have\n\nE[L((cid:98)wn) + RD,M((cid:98)wn)] \u2264 L(w\u2217) + RD,M(w\u2217) +\n\nGBr\u221a\nn\n\nwhere E[\u00b7] is taking expectation over the randomness in (xt, yt, \u0001t), t = 1, . . . , n.\nRemark: In the above theorem, we can choose w\u2217 to be the best model that minimizes the expected\nrisk in (1). Since RD,M (w) \u2265 0, the upper bound in the theorem above is also the upper bound of\n\nthe risk of (cid:98)wn, i.e., L((cid:98)wn), in expectation. The proof of the above theorem follows the standard\n\nanalysis of stochastic gradient descent. The detailed proof of theorem is included in the appendix.\n\n4.1 Distribution Dependent Dropout\n\nNext, we consider the sampling dependent factors in the risk bounds. From Theorem 1, we can\nsee that there are two terms that depend on the sampling probabilities, i.e., B2 - the upper bound\n\n2], and RD,M(w\u2217) \u2212 RD,M((cid:98)wn) \u2264 RD,M(w\u2217). We note that the second term also\n\n2] and\npresent the discussion on minimizing RD,M(w\u2217) later. From Theorem 1, we can see that minimizing\n2] would lead to not only a smaller risk (given the same number of total examples, smaller\n2] gives a smaller risk bound) but also a faster convergence (with the same number of\n\nof E(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\ndepends on w\u2217 and(cid:98)wn, which is more dif\ufb01cult to optimize. We \ufb01rst try to minimize E(cid:98)D[(cid:107)x\u25e6\u0001(cid:107)2\nE(cid:98)D[(cid:107)x\u25e6 \u0001(cid:107)2\nE(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\niterations, smaller E(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\nDue to the limited space, the proofs of Proposition 2, 3, 4 are included in supplement. The following\nproposition simpli\ufb01es the expectation E(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\n2].\nProposition 2. Let \u0001 follow the distribution M de\ufb01ned in De\ufb01nition 1. Then\n\n2] gives a smaller optimization error).\n\nd(cid:88)\n\ni=1\n\n1\nk\n\nk \u2212 1\nk\n\n1\npi\n\ni ] +\n\n2] =\n\nED[x2\n\nE(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\nGiven the expression of E(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\n(cid:112)ED[x2\nProposition 3. The solution to p\u2217 = arg minp\u22650,p(cid:62)1=1 E(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\n(cid:113)\n(cid:80)d\n\nfollowing result.\n\n, i = 1, . . . , d\n\np\u2217\ni =\n\ni ]\n\nj=1\n\nED[x2\nj ]\n\nd(cid:88)\n\ni=1\n\nED[x2\ni ]\n\n2] is given by\n\n(7)\n\n(8)\n\n(9)\n\n2] in Proposition 2, we can minimize it over p, leading to the\n\n1\n\n(cid:16)(cid:80)d\n\nNext, we examine RD,M(w\u2217). Since direct manipulation on RD,M(w\u2217) is dif\ufb01cult, we try to\n\nminimize the second order Taylor expansion (cid:98)RD,M(w\u2217) for logistic loss. The following theorem\nestablishes an upper bound of (cid:98)RD,M(w\u2217).\nProposition 4. Let \u0001 follow the distribution M de\ufb01ned in De\ufb01nition 1. We have (cid:98)RD,M(w\u2217) \u2264\n(cid:17)\n8k(cid:107)w\u2217(cid:107)2\nRemark: By minimizing the relaxed upper bound in Proposition 4, we obtain the same sampling\nprobabilities as in (8). We note that a tighter upper bound can be established, however, which will\nyield sampling probabilities dependent on the unknown w\u2217.\n\nIn summary, using the probabilities in (8), we can reduce both E(cid:98)D[(cid:107)x \u25e6 \u0001(cid:107)2\n\n2] and RD,M(w\u2217) in the\nrisk bound, leading to a faster convergence and a smaller generalization error. In practice, we can use\nempirical second-order statistics to compute the probabilities, i.e.,\n\n\u2212 ED[(cid:107)x(cid:107)2\n2]\n\nED[x2\ni ]\n\ni=1\n\npi\n\n2\n\n(cid:113) 1\n(cid:80)n\n(cid:113) 1\n(cid:80)n\n\nn\n\ni(cid:48)=1\n\nn\n\nj=1[[xj]2\ni ]\n\nj=1[[xj]2\ni(cid:48)]\n\n(cid:80)d\n\npi =\n\nwhere [xj]i denotes the i-th feature of the j-th example, which gives us a data-dependent dropout.\nWe state it formally in the following de\ufb01nition.\n\n5\n\n\fEvolutional Dropout for Deep Learning\n1, . . . , xl\n\nInput: a batch of outputs of a layer: X l = (xl\nand dropout level parameter k \u2208 [0, d]\n\nm)\n\nOutput: (cid:98)X l = X l \u25e6 \u03a3l\n\nCompute sampling probabilities by (10)\nFor j = 1, . . . , m\n\n1, . . . , pl\n\nd; k)\n\nSample ml\n\nConstruct \u0001l\n\nLet \u03a3l = (\u0001l\n\nj \u223c M ult(pl\nj =\n1, . . . , \u0001l\n\nml\nj\n\nkpl \u2208 Rd, where pl = (pl\nm) and compute (cid:98)X l = X l \u25e6 \u03a3l\n\n1, . . . , pl\n\nd)(cid:62)\n\nFigure 1: Evolutional Dropout applied to a layer over a mini-batch\n\ndata-dependent dropout is de\ufb01ned as(cid:98)x = x \u25e6 \u0001, where \u0001i = mi\n\nDe\ufb01nition 2. (Data-dependent Dropout) Given a set of training examples (x1, y1), . . . , (xn, yn). A\n, i \u2208 [d] and {m1, . . . , md} follow a\n\nmultinomial distribution M ult(p1, . . . , pd; k) with pi given by (9).\nRemark: Note that if the data is normalized such that each feature has zero mean and unit variance\n(i.e., according to Z-normliazation), the data-dependent dropout reduces to uniform dropout. It\nimplies that the data-dependent dropout achieves similar effect as Z-normalization plus uniform\ndropout. In this sense, our theoretical analysis also explains why Z-normalization usually speeds up\nthe training [13].\n\nkpi\n\n4.2 Evolutional Dropout for Deep Learning\n\n1, . . . , xl\n\nNext, we discuss how to implement the distribution-dependent dropout for deep learning. In training\ndeep neural networks, the dropout is usually added to the intermediate layers (e.g., fully connected\nlayers and convolutional layers). Let xl = (xl\nd) denote the outputs of the l-th layer (with the\nindex of data omitted). Adding dropout to this layer is equivalent to multiplying xl by a dropout\n\nnoise vector \u0001l, i.e., feeding (cid:98)xl = xl \u25e6 \u0001l as the input to the next layer. Inspired by the data-\ndependent dropout, we can generate \u0001l according to a distribution given in De\ufb01nition 1 with sampling\nn} similar to that (9). However, deep learning is usually\nprobabilities pl\ntrained with big data and a deep neural network is optimized by mini-batch stochastic gradient\ndescent. Therefore, at each iteration it would be too expensive to afford the computation to pass\nthrough all examples. To address this issue, we propose to use a mini-batch of examples to calculate\nthe second-order statistics similar to what was done in batch normalization. Let X l = (xl\nm)\ndenote the outputs of the l-th layer for a mini-batch of m examples. Then we can calculate the\nprobabilities for dropout by\n\ni computed from {xl\n\n1, . . . , xl\n\n1, . . . , xl\n\n(cid:113) 1\n(cid:80)m\n(cid:113) 1\n(cid:80)m\n\nm\n\ni(cid:48)=1\n\nm\n\n(cid:80)d\n\nj=1[[xl\n\nj]2\ni ]\n\nj=1[[xl\n\nj]2\ni(cid:48)]\n\npl\ni =\n\n, i = 1, . . . , d\n\n(10)\n\nwhich de\ufb01ne the evolutional dropout named as such because the probabilities pl\ni will also evolve as\nthe the distribution of the layer\u2019s outputs evolve. We describe the evolutional dropout as applied to a\nlayer of a deep neural network in Figure 1.\nFinally, we would like to compare the evolutional dropout with batch normalization. Similar to batch\nnormalization, evolutional dropout can also address the internal covariate shift issue by adapting\nthe sampling probabilities to the evolving distribution of layers\u2019 outputs. However, different from\nbatch normalization, evolutional dropout is a randomized technique, which enjoys many bene\ufb01ts\nas standard dropout including (i) the back-propagation is simple to implement (just multiplying the\n\ngradient of (cid:98)X l by the dropout mask to get the gradient of X l); (ii) the inference (i.e., testing) remains\n\nthe same 2; (iii) it is equivalent to a data-dependent regularizer with a clear mathematical explanation;\n2Different from some implementations for standard dropout which doest no scale by 1/(1 \u2212 \u03b4) in training\n\nbut scale by 1 \u2212 \u03b4 in testing, here we do scale in training and thus do not need any scaling in testing.\n\n6\n\n\f(iv) it prevents units from co-adapting of neurons, which facilitate generalization. Moreover, the\nevolutional dropout has its root in distribution-dependent dropout, which has theoretical guarantee to\naccelerate the convergence and improve the generalization for shallow learning.\n\n5 Experimental Results\n\nIn the section, we present some experimental results to justify the proposed dropouts. In all ex-\nperiments, we set \u03b4 = 0.5 in the standard dropout and k = 0.5d in the proposed dropouts for fair\ncomparison, where d represents the number of features or neurons of the layer that dropout is applied\nto. For the sake of clarity, we divided the experiments into three parts. In the \ufb01rst part, we compare\nthe performance of the data-dependent dropout (d-dropout) to the standard dropout (s-dropout)\nfor logistic regression. In the second part, we compare the performance of evolutional dropout\n(e-dropout) to the standard dropout for training deep convolutional neural networks. Finally, we\ncompare e-dropout with batch normalization.\n\nFigure 2: Left three: data-dependent dropout vs. standard dropout on three data sets (real-sim,\nnews20, RCV1) for logistic regression; Right: Evolutional dropout vs BN on CIFAR-10. (best seen\nin color).\n5.1 Shallow Learning\n\nWe implement the presented stochastic optimization algorithm. To evaluate the performance\nof data-dependent dropout for shallow learning, we use the three data sets: real-sim, news20\nand RCV13.\nIn this experiment, we use a \ufb01xed step size and tune the step size in\n[0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001] and report the best results in terms of convergence\nspeed on the training data for both standard dropout and data-dependent dropout. The left three\npanels in Figure 2 show the obtained results on these three data sets. In each \ufb01gure, we plot both\nthe training error and the testing error. We can see that both the training and testing errors using the\nproposed data-dependent dropout decrease much faster than using the standard dropout and also a\nsmaller testing error is achieved by using the data-dependent dropout.\n\n5.2 Evolutional Dropout for Deep Learning\n\nWe would like to emphasize that we are not aiming to obtain better prediction performance by trying\ndifferent network structures and different engineering tricks such as data augmentation, whitening,\netc., but rather focus on the comparison of the proposed dropout to the standard dropout using\nBernoulli noise on the same network structure. In our experiments, we use the default splitting of\ntraining and testing data in all data sets. We directly optimize the neural networks using all training\nimages without further splitting it into a validation data to be added into the training in later stages,\nwhich explains some marginal gaps from the literature results that we observed (e.g., on CIFAR-10\ncompared with [19]).\nWe conduct experiments on four benchmark data sets for comparing e-dropout and s-dropout: MNIST\n[10], SVHN [11], CIFAR-10 and CIFAR-100 [8]. We use the same or similar network structure as in\nthe literatures for the four data sets. In general, the networks consist of convolution layers, pooling\nlayers, locally connected layers, fully connected layers, softmax layers and a cost layer. For the\ndetailed neural network structures and their parameters, please refer to the supplementary materials.\nThe dropout is added to some fully connected layers or locally connected layers. The recti\ufb01ed linear\nactivation function is used for all neurons. All the experiments are conducted using the cuda-convnet\nlibrary 4. The training procedure is similar to [9] using mini-batch SGD with momentum (0.9). The\n\n3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/\n4https://code.google.com/archive/p/cuda-convnet/\n\n7\n\n# of iters\u00d71040123456error00.050.10.150.20.250.3s-dropout(tr)s-dropout(te)d-dropout(tr)d-dropout(te)# of iters\u00d710401234error00.050.10.150.20.250.30.350.40.450.5s-dropout(tr)s-dropout(te)d-dropout(tr)d-dropout(te)# of iters\u00d710502468error0.040.060.080.10.120.140.160.18s-dropout(tr)s-dropout(te)d-dropout(tr)d-dropout(te)# of iters\u00d710401234567test accuracy00.10.20.30.40.50.60.70.80.9no BN and no DropoutBNBN+DropoutEvolutional Dropout\f(a) MNIST\n\n(b) SVHN\n\n(c) CIFAR-10\n\n(d) CIFAR-100\n\nFigure 3: Evolutional dropout vs. standard dropout on four benchmark datasets for deep learning\n(best seen in color).\n\nsize of mini-batch is \ufb01xed to 128. The weights are initialized based on the Gaussian distribution\nwith mean zero and standard deviation 0.01. The learning rate (i.e., step size) is decreased after a\nnumber of epochs similar to what was done in previous works [9]. We tune the initial learning rates\nfor s-dropout and e-dropout separately from 0.001, 0.005, 0.01, 0.1 and report the best result on each\ndata set that yields the fastest convergence.\nFigure 3 shows the training and testing error curves in the optimization process on the four data sets\nusing the standard dropout and the evolutional dropout. For SVHN data, we only report the \ufb01rst\n12000 iterations, after which the error curves of the two methods almost overlap. We can see that\nusing the evolutional dropout generally converges faster than using the standard dropout. On CIFAR-\n100 data, we have observed signi\ufb01cant speed-up. In particular, the evolutional dropout achieves\nrelative improvements over 10% on the testing performance and over 50% on the convergence speed\ncompared to the standard dropout.\n\n5.3 Comparison with the Batch Normalization (BN)\n\nFinally, we make a comparison between the evolutional dropout and the batch normalization. For\nbatch normalization, we use the implementation in Caffe 5. We compare the evolutional dropout with\nthe batch normalization on CIFAR-10 data set. The network structure is from the Caffe package and\ncan be found in the supplement, which is different from the one used in the previous experiment.\nIt contains three convolutional layers and one fully connected layer. Each convolutional layer is\nfollowed by a pooling layer. We compare four methods: (1) No BN and No dropout - without using\nbatch normalization and dropout; (2) BN; (3) BN with standard dropout; (4) Evolutional Dropout.\nThe recti\ufb01ed linear activation is used in all methods. We also tried BN with the sigmoid activation\nfunction, which gives worse results. For the methods with BN, three batch normalization layers are\ninserted before or after each pooling layer following the architecture given in Caffe package (see\nsupplement). For the evolutional dropout training, only one layer of dropout is added to the the last\nconvolutional layer. The mini-batch size is set to 100, the default value in Caffe. The initial learning\nrates for the four methods are set to the same value (0.001), and they are decreased once by ten times.\nThe testing accuracy versus the number of iterations is plotted in the right panel of Figure 2, from\nwhich we can see that the evolutional dropout training achieves comparable performance with BN\n+ standard dropout, which justi\ufb01es our claim that evolutional dropout also addresses the internal\ncovariate shift issue.\n6 Conclusion\nIn this paper, we have proposed a distribution-dependent dropout for both shallow learning and\ndeep learning. Theoretically, we proved that the new dropout achieves a smaller risk and faster\nconvergence. Based on the distribution-dependent dropout, we developed an ef\ufb01cient evolutional\ndropout for training deep neural networks that adapts the sampling probabilities to the evolving\ndistributions of layers\u2019 outputs. Experimental results on various data sets veri\ufb01ed that the proposed\ndropouts can dramatically improve the convergence and also reduce the testing error.\n\nAcknowledgments\n\nWe thank anonymous reviewers for their comments. Z. Li and T. Yang are partially supported by\nNational Science Foundation (IIS-1463988, IIS-1545995). B. Gong is supported in part by NSF\n(IIS-1566511) and a gift from Adobe.\n\n5https://github.com/BVLC/caffe/\n\n8\n\n# of iters02000400060008000error00.10.20.30.40.50.60.70.80.9s-dropout(tr)s-dropout(te)e-dropout(tr)e-dropout(te)# of iters020004000600080001000012000error0.010.020.030.040.050.060.070.080.09s-dropout(tr)s-dropout(te)e-dropout(tr)e-dropout(te)# of Iters\u00d71050123456error00.10.20.30.40.50.6s-dropout(tr)s-dropout(te)e-dropout(tr)e-dropout(te)# of iters\u00d7104024681012error0.10.20.30.40.50.60.70.80.91s-dropout(tr)s-dropout(te)e-dropout(tr)e-dropout(te)\fReferences\n[1] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 3084\u20133092, 2013.\n\n[2] Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in Neural Information Processing\n\nSystems, pages 2814\u20132822, 2013.\n\n[3] Benjamin Graham, Jeremy Reizenstein, and Leigh Robinson. Ef\ufb01cient batchwise dropout training using\n\nsubmatrices. CoRR, abs/1502.02478, 2015.\n\n[4] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Im-\nproving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580,\n2012.\n\n[5] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[6] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[7] Diederik P. Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization\n\ntrick. CoRR, abs/1506.02557, 2015.\n\n[8] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.\n[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[10] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[11] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in\nnatural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, volume 2011, page 4. Granada, Spain, 2011.\n\n[12] Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in\ndeep neural networks. In Advances in Neural Information Processing Systems, pages 2413\u20132421, 2015.\n[13] Marc\u2019Aurelio Ranzato, Alex Krizhevsky, and Geoffrey E. Hinton. Factored 3-way restricted boltzmann\n\nmachines for modeling natural images. In AISTATS, pages 621\u2013628, 2010.\n\n[14] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization.\n\nIn The 22nd Conference on Learning Theory (COLT), 2009.\n\n[15] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances\n\nin Neural Information Processing Systems 23 (NIPS), pages 2199\u20132207, 2010.\n\n[16] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15\n(1):1929\u20131958, 2014.\n\n[17] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and\nmomentum in deep learning. In Proceedings of the 30th international conference on machine learning\n(ICML-13), pages 1139\u20131147, 2013.\n\n[18] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in\n\nNeural Information Processing Systems, pages 351\u2013359, 2013.\n\n[19] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks\nusing dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13),\npages 1058\u20131066, 2013.\n\n[20] Sida Wang and Christopher Manning. Fast dropout training. In Proceedings of the 30th International\n\nConference on Machine Learning (ICML-13), pages 118\u2013126, 2013.\n\n[21] Sida I Wang, Mengqiu Wang, Stefan Wager, Percy Liang, and Christopher D Manning. Feature noising for\n\nlog-linear structured prediction. In EMNLP, pages 1170\u20131179, 2013.\n\n[22] Sixin Zhang, Anna Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. arXiv\n\npreprint arXiv:1412.6651, 2014.\n\n[23] Jingwei Zhuo, Jun Zhu, and Bo Zhang. Adaptive dropout rates for learning with corrupted features. In\n\nIJCAI, pages 4126\u20134133, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1311, "authors": [{"given_name": "Zhe", "family_name": "Li", "institution": "The University of Iowa"}, {"given_name": "Boqing", "family_name": "Gong", "institution": "University of Central Florida"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "University of Iowa"}]}