{"title": "Dropout Training as Adaptive Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 359, "abstract": "Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an $\\LII$ regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learner, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset.", "full_text": "Dropout Training as Adaptive Regularization\n\nStefan Wager\u21e4, Sida Wang\u2020, and Percy Liang\u2020\nDepartments of Statistics\u21e4 and Computer Science\u2020\n\nStanford University, Stanford, CA-94305\n\nswager@stanford.edu, {sidaw, pliang}@cs.stanford.edu\n\nAbstract\n\nDropout and other feature noising schemes control over\ufb01tting by arti\ufb01cially cor-\nrupting the training data. For generalized linear models, dropout performs a form\nof adaptive regularization. Using this viewpoint, we show that the dropout regular-\nizer is \ufb01rst-order equivalent to an L2 regularizer applied after scaling the features\nby an estimate of the inverse diagonal Fisher information matrix. We also establish\na connection to AdaGrad, an online learning algorithm, and \ufb01nd that a close rel-\native of AdaGrad operates by repeatedly solving linear dropout-regularized prob-\nlems. By casting dropout as regularization, we develop a natural semi-supervised\nalgorithm that uses unlabeled data to create a better adaptive regularizer. We ap-\nply this idea to document classi\ufb01cation tasks, and show that it consistently boosts\nthe performance of dropout training, improving on state-of-the-art results on the\nIMDB reviews dataset.\n\nIntroduction\n\n1\nDropout training was introduced by Hinton et al. [1] as a way to control over\ufb01tting by randomly\nomitting subsets of features at each iteration of a training procedure.1 Although dropout has proved\nto be a very successful technique, the reasons for its success are not yet well understood at a theo-\nretical level.\nDropout training falls into the broader category of learning methods that arti\ufb01cially corrupt train-\ning data to stabilize predictions [2, 4, 5, 6, 7]. There is a well-known connection between arti\ufb01cial\nfeature corruption and regularization [8, 9, 10]. For example, Bishop [9] showed that the effect of\ntraining with features that have been corrupted with additive Gaussian noise is equivalent to a form\nof L2-type regularization in the low noise limit. In this paper, we take a step towards understand-\ning how dropout training works by analyzing it as a regularizer. We focus on generalized linear\nmodels (GLMs), a class of models for which feature dropout reduces to a form of adaptive model\nregularization.\nUsing this framework, we show that dropout training is \ufb01rst-order equivalent to L2-regularization af-\nter transforming the input by diag(\u02c6I)1/2, where \u02c6I is an estimate of the Fisher information matrix.\nThis transformation effectively makes the level curves of the objective more spherical, and so bal-\nances out the regularization applied to different features. In the case of logistic regression, dropout\ncan be interpreted as a form of adaptive L2-regularization that favors rare but useful features.\nThe problem of learning with rare but useful features is discussed in the context of online learning\nby Duchi et al. [11], who show that their AdaGrad adaptive descent procedure achieves better regret\nbounds than regular stochastic gradient descent (SGD) in this setting. Here, we show that AdaGrad\n\nS.W. is supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship.\n1Hinton et al. introduced dropout training in the context of neural networks speci\ufb01cally, and also advocated\nomitting random hidden layers during training. In this paper, we follow [2, 3] and study feature dropout as a\ngeneric training method that can be applied to any learning algorithm.\n\n1\n\n\fand dropout training have an intimate connection: Just as SGD progresses by repeatedly solving\nlinearized L2-regularized problems, a close relative of AdaGrad advances by solving linearized\ndropout-regularized problems.\nOur formulation of dropout training as adaptive regularization also leads to a simple semi-supervised\nlearning scheme, where we use unlabeled data to learn a better dropout regularizer. The approach\nis fully discriminative and does not require \ufb01tting a generative model. We apply this idea to several\ndocument classi\ufb01cation problems, and \ufb01nd that it consistently improves the performance of dropout\ntraining. On the benchmark IMDB reviews dataset introduced by [12], dropout logistic regression\nwith a regularizer tuned on unlabeled data outperforms previous state-of-the-art. In follow-up re-\nsearch [13], we extend the results from this paper to more complicated structured prediction, such\nas multi-class logistic regression and linear chain conditional random \ufb01elds.\n\n2 Arti\ufb01cial Feature Noising as Regularization\nWe begin by discussing the general connections between feature noising and regularization in gen-\neralized linear models (GLMs). We will apply the machinery developed here to dropout training in\nSection 4.\nA GLM de\ufb01nes a conditional distribution over a response y 2Y given an input feature vector\nx 2 Rd:\n\n= h(y) exp{y x \u00b7 A(x \u00b7 )},`\n\n(1)\nHere, h(y) is a quantity independent of x and , A(\u00b7) is the log-partition function, and `x,y() is the\nloss function (i.e., the negative log likelihood); Table 1 contains a summary of notation. Common\nexamples of GLMs include linear (Y = R), logistic (Y = {0, 1}), and Poisson (Y = {0, 1, 2, . . .})\nregression.\nGiven n training examples (xi, yi), the standard maximum likelihood estimate \u02c6 2 Rd minimizes\nthe empirical loss over the training examples:\n\n= log p(y | x).\n\ndef\n\np(y | x)\n\ndef\n\nx,y()\n\n\u02c6\n\ndef\n= arg min\n2Rd\n\nnXi=1\n\n`xi, yi().\n\n(2)\n\nWith arti\ufb01cial feature noising, we replace the observed feature vectors xi with noisy versions \u02dcxi =\n\u232b(xi,\u21e0 i), where \u232b is our noising function and \u21e0i is an independent random variable. We \ufb01rst create\nmany noisy copies of the dataset, and then average out the auxiliary noise. In this paper, we will\nconsider two types of noise:\n\n\u2022 Additive Gaussian noise: \u232b(xi,\u21e0 i) = xi + \u21e0i, where \u21e0i \u21e0N (0, 2Id\u21e5d).\n\u2022 Dropout noise: \u232b(xi,\u21e0 i) = xi \u21e0i, where is the elementwise product of two vec-\ntors. Each component of \u21e0i 2{ 0, (1 )1}d is an independent draw from a scaled\nBernoulli(1 ) random variable. In other words, dropout noise corresponds to setting \u02dcxij\nto 0 with probability and to xij/(1 ) else.2\n\nIntegrating over the feature noise gives us a noised maximum likelihood parameter estimate:\n\n\u02c6 = arg min\n2Rd\n\nnXi=1\n\nE\u21e0 [`\u02dcxi, yi()] , where E\u21e0 [Z]\n\ndef\n\n= E [Z | {xi, yi}]\n\n(3)\n\nis the expectation taken with respect to the arti\ufb01cial feature noise \u21e0 = (\u21e01, . . . ,\u21e0 n). Similar expres-\nsions have been studied by [9, 10].\nFor GLMs, the noised empirical loss takes on a simpler form:\n\nE\u21e0 [`\u02dcxi, yi()] =\n\nnXi=1\n\nnXi=1\n\n (y xi \u00b7 E\u21e0 [A(\u02dcxi \u00b7 )]) =\n\nnXi=1\n\n`xi, yi() + R().\n\n(4)\n\n2Arti\ufb01cial noise of the form xi \u21e0i is also called blankout noise. For GLMs, blankout noise is equivalent\n\nto dropout noise as de\ufb01ned by [1].\n\n2\n\n\fTable 1: Summary of notation.\n\nObserved feature vector R()\nxi\nNoised feature vector\n\u02dcxi\nA(x \u00b7 ) Log-partition function\n\nNoising penalty (5)\n\nRq() Quadratic approximation (6)\nNegative log-likelihood (loss)\n`()\n\nThe \ufb01rst equality holds provided that E\u21e0[\u02dcxi] = xi, and the second is true with the following de\ufb01ni-\ntion:\n\nR()\n\ndef\n=\n\nnXi=1\n\nE\u21e0 [A(\u02dcxi \u00b7 )] A(xi \u00b7 ).\n\n(5)\n\nHere, R() acts as a regularizer that incorporates the effect of arti\ufb01cial feature noising. In GLMs, the\nlog-partition function A must always be convex, and so R is always positive by Jensen\u2019s inequality.\nThe key observation here is that the effect of arti\ufb01cial feature noising reduces to a penalty R()\nthat does not depend on the labels {yi}. Because of this, arti\ufb01cial feature noising penalizes the\ncomplexity of a classi\ufb01er in a way that does not depend on the accuracy of a classi\ufb01er. Thus, for\nGLMs, arti\ufb01cial feature noising is a regularization scheme on the model itself that can be compared\nwith other forms of regularization such as ridge (L2) or lasso (L1) penalization. In Section 6, we\nexploit the label-independence of the noising penalty and use unlabeled data to tune our estimate of\nR().\nThe fact that R does not depend on the labels has another useful consequence that relates to predic-\ntion. The natural prediction rule with arti\ufb01cially noised features is to select \u02c6y to minimize expected\nloss over the added noise: \u02c6y = argminy E\u21e0[`\u02dcx, y( \u02c6)]. It is common practice, however, not to noise\nthe inputs and just to output classi\ufb01cation decisions based on the original feature vector [1, 3, 14]:\n\u02c6y = argminy `x, y( \u02c6). It is easy to verify that these expressions are in general not equivalent, but\nthey are equivalent when the effect of feature noising reduces to a label-independent penalty on the\nlikelihood. Thus, the common practice of predicting with clean features is formally justi\ufb01ed for\nGLMs.\n\n2.1 A Quadratic Approximation to the Noising Penalty\nAlthough the noising penalty R yields an explicit regularizer that does not depend on the labels\n{yi}, the form of R can be dif\ufb01cult to interpret. To gain more insight, we will work with a quadratic\napproximation of the type used by [9, 10]. By taking a second-order Taylor expansion of A around\nx \u00b7 , we get that E\u21e0 [A(\u02dcx \u00b7 )] A(x \u00b7 ) \u21e1 1\n2 A00(x \u00b7 ) Var\u21e0 [\u02dcx \u00b7 ] . Here the \ufb01rst-order term\nE\u21e0 [A0(x \u00b7 )(\u02dcx x)] vanishes because E\u21e0[\u02dcx] = x. Applying this quadratic approximation to (5)\nyields the following quadratic noising regularizer, which will play a pivotal role in the rest of the\npaper:\n\nRq()\n\ndef\n=\n\n1\n2\n\nnXi=1\n\nA00(xi \u00b7 ) Var\u21e0 [\u02dcxi \u00b7 ] .\n\n(6)\n\nThis regularizer penalizes two types of variance over the training examples: (i) A00(xi \u00b7 ), which\ncorresponds to the variance of the response yi in the GLM, and (ii) Var\u21e0[\u02dcxi \u00b7 ], the variance of the\nestimated GLM parameter due to noising.3\nAccuracy of approximation Figure 1a compares the noising penalties R and Rq for logistic re-\ndef\ngression in the case that \u02dcx \u00b7 is Gaussian;4 we vary the mean parameter p\n= (1 + ex\u00b7)1 and the\nnoise level . We see that Rq is generally very accurate, although it tends to overestimate the true\npenalty for p \u21e1 0.5 and tends to underestimate it for very con\ufb01dent predictions. We give a graphical\nexplanation for this phenomenon in the Appendix (Figure A.1).\nThe quadratic approximation also appears to hold up on real datasets.\nIn Figure 1b, we com-\npare the evolution during training of both R and Rq on the 20 newsgroups alt.atheism vs\n\n3Although Rq is not convex, we were still able (using an L-BFGS algorithm) to train logistic regression\n\nwith Rq as a surrogate for the dropout regularizer without running into any major issues with local optima.\n\n4This assumption holds a priori for additive Gaussian noise, and can be reasonable for dropout by the central\n\nlimit theorem.\n\n3\n\n\fy\nt\nl\n\n \n\na\nn\ne\nP\ng\nn\ns\no\nN\n\ni\n\ni\n\n0\n3\n\n.\n\n0\n\n5\n2\n\n.\n\n0\n\n0\n2\n\n.\n\n0\n\n5\n1\n\n.\n\n0\n\n0\n1\n\n.\n\n0\n\n5\n0\n\n.\n\n0\n\n0\n0\n\n.\n\n0\n\np = 0.5\np = 0.73\np = 0.82\np = 0.88\np = 0.95\n\nDropout Penalty\nQuadratic Penalty\nNegative Log\u2212Likelihood\n\ns\ns\no\nL\n\n0\n0\n5\n\n0\n0\n2\n\n0\n0\n1\n\n0\n5\n\n0\n2\n\n0\n1\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n0\n\n50\n\n100\n\n150\n\nSigma\n\nTraining Iteration\n\n(a) Comparison of noising penalties R and Rq for\nlogistic regression with Gaussian perturbations,\ni.e., (\u02dcx x) \u00b7 \u21e0N (0, 2). The solid line\nindicates the true penalty and the dashed one is\nour quadratic approximation thereof; p = (1 +\nex\u00b7)1 is the mean parameter for the logistic\nmodel.\n\n(b) Comparing the evolution of the exact dropout\npenalty R and our quadratic approximation Rq\nfor logistic regression on the AthR classi\ufb01cation\ntask in [15] with 22K features and n = 1000\nexamples. The horizontal axis is the number of\nquasi-Newton steps taken while training with ex-\nact dropout.\n\nFigure 1: Validating the quadratic approximation.\n\nsoc.religion.christian classi\ufb01cation task described in [15]. We see that the quadratic ap-\nproximation is accurate most of the way through the learning procedure, only deteriorating slightly\nas the model converges to highly con\ufb01dent predictions.\nIn practice, we have found that \ufb01tting logistic regression with the quadratic surrogate Rq gives\nsimilar results to actual dropout-regularized logistic regression. We use this technique for our ex-\nperiments in Section 6.\n3 Regularization based on Additive Noise\nHaving established the general quadratic noising regularizer Rq, we now turn to studying the ef-\nfects of Rq for various likelihoods (linear and logistic regression) and noising models (additive and\ndropout). In this section, we warm up with additive noise; in Section 4 we turn to our main target of\ninterest, namely dropout noise.\nLinear regression Suppose \u02dcx = x + \" is generated by by adding noise with Var[\"] = 2Id\u21e5d to\n2, and in the case of linear regression\nthe original feature vector x. Note that Var\u21e0[\u02dcx \u00b7 ] = 2kk2\n2 z2, so A00(z) = 1. Applying these facts to (6) yields a simpli\ufb01ed form for the quadratic\nA(z) = 1\nnoising penalty:\n\nRq() =\n\n2nkk2\n2.\n\n(7)\nThus, we recover the well-known result that linear regression with additive feature noising is equiv-\nalent to ridge regression [2, 9]. Note that, with linear regression, the quadratic approximation Rq is\nexact and so the correspondence with L2-regularization is also exact.\nLogistic regression The situation gets more interesting when we move beyond linear regression.\nFor logistic regression, A00(xi \u00b7 ) = pi(1 pi) where pi = (1 + exp(xi \u00b7 ))1 is the predicted\nprobability of yi = 1. The quadratic noising penalty is then\n\n1\n2\n\n4\n\nIn other words, the noising penalty now simultaneously encourages parsimonious modeling as be-\n2 to be small) as well as con\ufb01dent predictions (by encouraging the pi\u2019s to\nfore (by encouraging kk2\nmove away from 1\n2).\n\npi(1 pi).\n\n(8)\n\nRq() =\n\n1\n2\n\n2kk2\n\n2\n\nnXi=1\n\n\fTable 2: Form of the different regularization schemes. These expressions assume that the design\nij = 1 for all j. The pi = (1 + exi\u00b7)1 are mean\n\nmatrix has been normalized, i.e., thatPi x2\n\nparameters for the logistic model.\n\nLinear Regression\n\nLogistic Regression\n\nL2-penalization\nAdditive Noising\nDropout Training\n\n2\n\nkk2\nkk2\nkk2\n\n2\n\n2\n\nkk2\n\n2\n\nkk2\n\n2Pi pi(1 pi)\nPi, j pi(1 pi) x2\n\nij 2\nj\n\nGLM\nkk2\n2\n2 tr(V ())\n\nkk2\n\n> diag(X>V ()X)\n\n4 Regularization based on Dropout Noise\nRecall that dropout training corresponds to applying dropout noise to training examples, where\nthe noised features \u02dcxi are obtained by setting \u02dcxij to 0 with some \u201cdropout probability\u201d and to\nxij/(1 ) with probability (1 ), independently for each coordinate j of the feature vector. We\ncan check that:\n\n(9)\n\nVar\u21e0 [\u02dcxi \u00b7 ] =\n\n1\n2\n\nand so the quadratic dropout penalty is\n\nij2\nx2\nj ,\n\n\n1 \n\ndXj=1\nA00(xi \u00b7 )\n\n1\n2\n\n1\n2\n\n(10)\n\n\n1 \n\nRq() =\n\nnXi=1\nLetting X 2 Rn\u21e5d be the design matrix with rows xi and V () 2 Rn\u21e5n be a diagonal matrix with\nentries A00(xi \u00b7 ), we can re-write this penalty as\n\n\n1 \n\ndXj=1\n\nij2\nx2\nj .\n\nRq() =\n\n> diag(X>V ()X).\n\n(11)\nLet \u21e4 be the maximum likelihood estimate given in\ufb01nite data. When computed at \u21e4, the matrix\nnPn\ni=1 r2`xi, yi(\u21e4) is an estimate of the Fisher information matrix I. Thus,\nn X>V (\u21e4)X = 1\n1\ndropout can be seen as an attempt to apply an L2 penalty after normalizing the feature vector by\ndiag(I)1/2. The Fisher information is linked to the shape of the level surfaces of `() around \u21e4.\nIf I were a multiple of the identity matrix, then these level surfaces would be perfectly spherical\naround \u21e4. Dropout, by normalizing the problem by diag(I)1/2, ensures that while the level\nsurfaces of `() may not be spherical, the L2-penalty is applied in a basis where the features have\nbeen balanced out. We give a graphical illustration of this phenomenon in Figure A.2.\nLinear Regression For linear regression, V is the identity matrix, so the dropout objective is\nequivalent to a form of ridge regression where each column of the design matrix is normalized\nbefore applying the L2 penalty.5 This connection has been noted previously by [3].\nLogistic Regression The form of dropout penalties becomes much more intriguing once we move\nbeyond the realm of linear regression. The case of logistic regression is particularly interesting.\nHere, we can write the quadratic dropout penalty from (10) as\n\nRq() =\n\n1\n2\n\n\n1 \n\nnXi=1\n\ndXj=1\n\npi(1 pi) x2\n\nij 2\nj .\n\n(12)\n\nj , provided that the corresponding cross-term x2\n\nThus, just like additive noising, dropout generally gives an advantage to con\ufb01dent predictions and\nsmall . However, unlike all the other methods considered so far, dropout may allow for some large\npi(1 pi) and some large 2\nOur analysis shows that dropout regularization should be better than L2-regularization for learning\nweights for features that are rare (i.e., often 0) but highly discriminative, because dropout effectively\ndoes not penalize j over observations for which xij = 0. Thus, in order for a feature to earn a large\nj , it suf\ufb01ces for it to contribute to a con\ufb01dent prediction with small pi(1 pi) each time that it\n2\nis active.6 Dropout training has been empirically found to perform well on tasks such as document\n5Normalizing the columns of the design matrix before performing penalized regression is standard practice,\n\nij is small.\n\nand is implemented by default in software like glmnet for R [16].\n\n6To be precise, dropout does not reward all rare but discriminative features. Rather, dropout rewards those\nfeatures that are rare and positively co-adapted with other features in a way that enables the model to make\ncon\ufb01dent predictions whenever the feature of interest is active.\n\n5\n\n\fTable 3: Accuracy of L2 and dropout regularized logistic regression on a simulated example. The\n\ufb01rst row indicates results over test examples where some of the rare useful features are active (i.e.,\nwhere there is some signal that can be exploited), while the second row indicates accuracy over the\nfull test set. These results are averaged over 100 simulation runs, with 75 training examples in each.\nAll tuning parameters were set to optimal values. The sampling error on all reported values is within\n\u00b10.01.\n\nAccuracy L2-regularization Dropout training\n\nActive Instances\nAll Instances\n\n0.66\n0.53\n\n0.73\n0.55\n\nclassi\ufb01cation where rare but discriminative features are prevalent [3]. Our result suggests that this is\nno mere coincidence.\nWe summarize the relationship between L2-penalization, additive noising and dropout in Table 2.\nAdditive noising introduces a product-form penalty depending on both and A00. However, the full\npotential of arti\ufb01cial feature noising only emerges with dropout, which allows the penalty terms due\nto and A00 to interact in a non-trivial way through the design matrix X (except for linear regression,\nin which all the noising schemes we consider collapse to ridge regression).\n4.1 A Simulation Example\nThe above discussion suggests that dropout logistic regression should perform well with rare but\nuseful features. To test this intuition empirically, we designed a simulation study where all the\nsignal is grouped in 50 rare features, each of which is active only 4% of the time. We then added\n1000 nuisance features that are always active to the design matrix, for a total of d = 1050 features.\nTo make sure that our experiment was picking up the effect of dropout training speci\ufb01cally and not\njust normalization of X, we ensured that the columns of X were normalized in expectation.\nThe dropout penalty for logistic regression can be written as a matrix product\n\nRq() =\n\n1\n2\n\n\n1 \n\n(\u00b7\u00b7\u00b7\n\npi(1 pi)\n\n\u00b7\u00b7\u00b7)0@\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n1A0@\n\n\u00b7\u00b7\u00b72\n\u00b7\u00b7\u00b7\n\nj\n\nWe designed the simulation study in such a way that, at the optimal , the dropout penalty should\nhave structure\n\n\u00b7\u00b7\u00b7\nx2\nij\n\u00b7\u00b7\u00b7\n\n0BBBB@\n\n1A .\n1CCCCA\n\n.\n\n(13)\n\n(14)\n\n \n\nSmall\n\n(con\ufb01dent prediction)\n\nBig\n\n(weak prediction)\n\n!\n\n0B@\n\n\u00b7\u00b7\u00b7\n0\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n1CA\n\nBig\n\n(useful feature)\n\nSmall\n\n(nuisance feature)\n\nA dropout penalty with such a structure should be small. Although there are some uncertain pre-\ndictions with large pi(1 pi) and some big weights 2\nj , these terms cannot interact because the\nij are all 0 (these are examples without any of the rare discriminative fea-\ncorresponding terms x2\ntures and thus have no signal). Meanwhile, L2 penalization has no natural way of penalizing some\nj more and others less. Our simulation results, given in Table 3, con\ufb01rm that dropout training\noutperforms L2-regularization here as expected. See Appendix A.1 for details.\n5 Dropout Regularization in Online Learning\nThere is a well-known connection between L2-regularization and stochastic gradient descent (SGD).\nIn SGD, the weight vector \u02c6 is updated with \u02c6t+1 = \u02c6t \u2318t gt, where gt = r`xt, yt( \u02c6t) is the\ngradient of the loss due to the t-th training example. We can also write this update as a linear\nL2-penalized problem\n\n\u02c6t+1 = argmin\u21e2`xt, yt( \u02c6t) + gt \u00b7 ( \u02c6t) +\n\n2 ,\n1\n2\u2318tk \u02c6tk2\n\n(15)\n\nwhere the \ufb01rst two terms form a linear approximation to the loss and the third term is an L2-\nregularizer. Thus, SGD progresses by repeatedly solving linearized L2-regularized problems.\n\n6\n\n\fy\nc\na\nr\nu\nc\nc\na\n\n0.9\n\n0.88\n\n0.86\n\n0.84\n\n0.82\n\n0.8\n\n \n0\n\n \n\ndropout+unlabeled\ndropout\nL2\n\n10000\n\n20000\n\n30000\n\n40000\n\nsize of unlabeled data\n\n0.9\n\n0.88\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.86\n\n0.84\n\n0.82\n\n0.8\n\n \n\n \n\ndropout+unlabeled\ndropout\nL2\n\n5000\n\n10000\n\n15000\n\nsize of labeled data\n\nFigure 2: Test set accuracy on the IMDB dataset [12] with unigram features. Left: 10000 labeled\ntraining examples, and up to 40000 unlabeled examples. Right: 3000-15000 labeled training exam-\nples, and 25000 unlabeled examples. The unlabeled data is discounted by a factor \u21b5 = 0.4.\n\nAs discussed by Duchi et al. [11], a problem with classic SGD is that it can be slow at learning\nweights corresponding to rare but highly discriminative features. This problem can be alleviated\nby running a modi\ufb01ed form of SGD with \u02c6t+1 = \u02c6t \u2318A 1\nt gt, where the transformation At is\nalso learned online; this leads to the AdaGrad family of stochastic descent rules. Duchi et al. use\nAt = diag(Gt)1/2 where Gt = Pt\ni=1 gig>i and show that this choice achieves desirable regret\nbounds in the presence of rare but useful features. At least super\ufb01cially, AdaGrad and dropout seem\nto have similar goals: For logistic regression, they can both be understood as adaptive alternatives\nto methods based on L2-regularization that favor learning rare, useful features. As it turns out, they\nhave a deeper connection.\nThe natural way to incorporate dropout regularization into SGD is to replace the penalty term k \n\u02c6k2\n\n2/2\u2318 in (15) with the dropout regularizer, giving us an update rule\n\n\u02c6t+1 = argminn`xt, yt( \u02c6t) + gt \u00b7 ( \u02c6t) + Rq( \u02c6t; \u02c6t)o\n\nwhere, Rq(\u00b7; \u02c6t) is the quadratic noising regularizer centered at \u02c6t:7\n\n(16)\n\ntXi=1\n\nRq( \u02c6t; \u02c6t) =\n\n1\n2\n\n( \u02c6t)> diag(Ht) ( \u02c6t), where Ht =\n\nr2`xi, yi( \u02c6t).\n\n(17)\n\nThis implies that dropout descent is \ufb01rst-order equivalent to an adaptive SGD procedure with At =\ndiag(Ht). To see the connection between AdaGrad and this dropout-based online procedure, recall\nthat for GLMs both of the expressions\n\nE\u21e4\u21e5r2`x, y(\u21e4)\u21e4 = E\u21e4\u21e5r`x, y(\u21e4)r`x, y(\u21e4)>\u21e4\n\n(18)\nare equal to the Fisher information I [17]. In other words, as \u02c6t converges to \u21e4, Gt and Ht are both\nconsistent estimates of the Fisher information. Thus, by using dropout instead of L2-regularization\nto solve linearized problems in online learning, we end up with an AdaGrad-like algorithm.\nOf course, the connection between AdaGrad and dropout is not perfect.\nIn particular, AdaGrad\nallows for a more aggressive learning rate by using At = diag(Gt)1/2 instead of diag(Gt)1.\nBut, at a high level, AdaGrad and dropout appear to both be aiming for the same goal: scaling\nthe features by the Fisher information to make the level-curves of the objective more circular. In\ncontrast, L2-regularization makes no attempt to sphere the level curves, and AROW [18]\u2014another\npopular adaptive method for online learning\u2014only attempts to normalize the effective feature matrix\nbut does not consider the sensitivity of the loss to changes in the model weights. In the case of\nlogistic regression, AROW also favors learning rare features, but unlike dropout and AdaGrad does\nnot privilege con\ufb01dent predictions.\n\n7This expression is equivalent to (11) except that we used \u02c6t and not \u02c6t to compute Ht.\n\n7\n\n\fTable 4: Performance of semi-supervised dropout training for document classi\ufb01cation.\n\n(a) Test accuracy with and without unlabeled data on\ndifferent datasets. Each dataset is split into 3 parts\nof equal sizes: train, unlabeled, and test. Log. Reg.:\nlogistic regression with L2 regularization; Dropout:\ndropout\ntrained with quadratic surrogate; +Unla-\nbeled: using unlabeled data.\n\n(b) Test accuracy on the IMDB dataset [12]. Labeled:\nusing just labeled data from each paper/method, +Un-\nlabeled: use additional unlabeled data. Drop: dropout\nwith Rq, MNB: multionomial naive Bayes with semi-\nsupervised frequency estimate from [19],8-Uni: uni-\ngram features, -Bi: bigram features.\n\nDatasets Log. Reg. Dropout +Unlabeled\n\nMethods Labeled +Unlabeled\n\nSubj\nRT\nIMDB-2k\nXGraph\nBbCrypt\nIMDB\n\n88.96\n73.49\n80.63\n83.10\n97.28\n87.14\n\n90.85\n75.18\n81.23\n84.64\n98.49\n88.70\n\n91.48\n76.56\n80.33\n85.41\n99.24\n89.21\n\nMNB-Uni [19]\nMNB-Bi [19]\nVect.Sent[12]\nNBSVM[15]-Bi\nDrop-Uni\nDrop-Bi\n\n83.62\n86.63\n88.33\n91.22\n87.78\n91.31\n\n84.13\n86.98\n88.89\n\n\u2013\n\n89.52\n91.98\n\n(19)\n\n6 Semi-Supervised Dropout Training\nRecall that the regularizer R() in (5) is independent of the labels {yi}. As a result, we can use\nadditional unlabeled training examples to estimate it more accurately. Suppose we have an unlabeled\ndataset {zi} of size m, and let \u21b5 2 (0, 1] be a discount factor for the unlabeled data. Then we can\nde\ufb01ne a semi-supervised penalty estimate\nn\n\ndef\n=\n\nR\u21e4()\n\nn + \u21b5m\u21e3R() + \u21b5R Unlabeled()\u2318,\n\nwhere R() is the original penalty estimate and RUnlabeled() = Pi E\u21e0[A(zi \u00b7 )] A(zi \u00b7 ) is\ncomputed using (5) over the unlabeled examples zi. We select the discount parameter by cross-\nvalidation; empirically, \u21b5 2 [0.1, 0.4] works well. For convenience, we optimize the quadratic\nsurrogate Rq\n\u21e4 instead of R\u21e4. Another practical option would be to use the Gaussian approximation\nfrom [3] for estimating R\u21e4().\nMost approaches to semi-supervised learning either rely on using a generative model [19, 20, 21, 22,\n23] or various assumptions on the relationship between the predictor and the marginal distribution\nover inputs. Our semi-supervised approach is based on a different intuition: we\u2019d like to set weights\nto make con\ufb01dent predictions on unlabeled data as well as the labeled data, an intuition shared by\nentropy regularization [24] and transductive SVMs [25].\nExperiments We apply this semi-supervised technique to text classi\ufb01cation. Results on several\ndatasets described in [15] are shown in Table 4a; Figure 2 illustrates how the use of unlabeled data\nimproves the performance of our classi\ufb01er on a single dataset. Overall, we see that using unlabeled\ndata to learn a better regularizer R\u21e4() consistently improves the performance of dropout training.\nTable 4b shows our results on the IMDB dataset of [12]. The dataset contains 50,000 unlabeled\nexamples in addition to the labeled train and test sets of size 25,000 each. Whereas the train and\ntest examples are either positive or negative, the unlabeled examples contain neutral reviews as well.\nWe train a dropout-regularized logistic regression classi\ufb01er on unigram/bigram features, and use the\nunlabeled data to tune our regularizer. Our method bene\ufb01ts from unlabeled data even in the presence\nof a large amount of labeled data, and achieves state-of-the-art accuracy on this dataset.\n\n7 Conclusion\nWe analyzed dropout training as a form of adaptive regularization. This framework enabled us\nto uncover close connections between dropout training, adaptively balanced L2-regularization, and\nAdaGrad; and led to a simple yet effective method for semi-supervised training. There seem to be\nmultiple opportunities for digging deeper into the connection between dropout training and adaptive\nregularization. In particular, it would be interesting to see whether the dropout regularizer takes\non a tractable and/or interpretable form in neural networks, and whether similar semi-supervised\nschemes could be used to improve on the results presented in [1].\n\n8Our implementation of semi-supervised MNB. MNB with EM [20] failed to give an improvement.\n\n8\n\n\fReferences\n[1] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdi-\narXiv preprint\n\nImproving neural networks by preventing co-adaptation of feature detectors.\n\nnov.\narXiv:1207.0580, 2012.\n\n[2] Laurens van der Maaten, Minmin Chen, Stephen Tyree, and Kilian Q Weinberger. Learning with\nmarginalized corrupted features. In Proceedings of the International Conference on Machine Learning,\n2013.\n\n[3] Sida I Wang and Christopher D Manning. Fast dropout training.\n\nConference on Machine Learning, 2013.\n\nIn Proceedings of the International\n\n[4] Yaser S Abu-Mostafa. Learning from hints in neural networks. Journal of Complexity, 6(2):192\u2013198,\n\n1990.\n\n[5] Chris J.C. Burges and Bernhard Schlkopf. Improving the accuracy and speed of support vector machines.\n\nIn Advances in Neural Information Processing Systems, pages 375\u2013381, 1997.\n\n[6] Patrice Y Simard, Yann A Le Cun, John S Denker, and Bernard Victorri. Transformation invariance in\npattern recognition: Tangent distance and propagation. International Journal of Imaging Systems and\nTechnology, 11(3):181\u2013197, 2000.\n\n[7] Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent\n\nclassi\ufb01er. Advances in Neural Information Processing Systems, 24:2294\u20132302, 2011.\n\n[8] Kiyotoshi Matsuoka. Noise injection into inputs in back-propagation learning. Systems, Man and Cyber-\n\nnetics, IEEE Transactions on, 22(3):436\u2013440, 1992.\n\n[9] Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation,\n\n7(1):108\u2013116, 1995.\n\n[10] Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model\n\ntrained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.\n\n[11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2010.\n\n[12] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.\nLearning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Associa-\ntion for Computational Linguistics, pages 142\u2013150. Association for Computational Linguistics, 2011.\n\n[13] Sida I Wang, Mengqiu Wang, Stefan Wager, Percy Liang, and Christopher D Manning. Feature noising\n\nfor log-linear structured prediction. In Empirical Methods in Natural Language Processing, 2013.\n\n[14] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. In Proceedings of the International Conference on Machine Learning, 2013.\n\n[15] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment and topic clas-\nsi\ufb01cation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics,\npages 90\u201394. Association for Computational Linguistics, 2012.\n\n[16] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of Statistical Software, 33(1):1, 2010.\n\n[17] Erich Leo Lehmann and George Casella. Theory of Point Estimation. Springer, 1998.\n[18] Koby Crammer, Alex Kulesza, Mark Dredze, et al. Adaptive regularization of weight vectors. Advances\n\nin Neural Information Processing Systems, 22:414\u2013422, 2009.\n\n[19] Jiang Su, Jelber Sayyad Shirab, and Stan Matwin. Large scale text classi\ufb01cation using semi-supervised\n\nmultinomial naive Bayes. In Proceedings of the International Conference on Machine Learning, 2011.\n\n[20] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classi\ufb01cation from\n\nlabeled and unlabeled documents using EM. Machine Learning, 39(2-3):103\u2013134, May 2000.\n\n[21] G. Bouchard and B. Triggs. The trade-off between generative and discriminative classi\ufb01ers. In Interna-\n\ntional Conference on Computational Statistics, pages 721\u2013728, 2004.\n\n[22] R. Raina, Y. Shen, A. Ng, and A. McCallum. Classi\ufb01cation with hybrid generative/discriminative models.\n\nIn Advances in Neural Information Processing Systems, Cambridge, MA, 2004. MIT Press.\n\n[23] J. Suzuki, A. Fujino, and H. Isozaki. Semi-supervised structured output learning based on a hybrid\nIn Empirical Methods in Natural Language Processing and\n\ngenerative and discriminative approach.\nComputational Natural Language Learning, 2007.\n\n[24] Y. Grandvalet and Y. Bengio. Entropy regularization. In Semi-Supervised Learning, United Kingdom,\n\n2005. Springer.\n\n[25] Thorsten Joachims. Transductive inference for text classi\ufb01cation using support vector machines.\n\nProceedings of the International Conference on Machine Learning, pages 200\u2013209, 1999.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 246, "authors": [{"given_name": "Stefan", "family_name": "Wager", "institution": "Stanford University"}, {"given_name": "Sida", "family_name": "Wang", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}