{"title": "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1027, "abstract": "Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This grounding of dropout in approximate Bayesian inference suggests an extension of the theoretical results, offering insights into the use of dropout with RNN models. We apply this new variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks. The new approach outperforms existing techniques, and to the best of our knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.", "full_text": "A Theoretically Grounded Application of Dropout in\n\nRecurrent Neural Networks\n\nYarin Gal\n\nUniversity of Cambridge\n\n{yg279,zg201}@cam.ac.uk\n\nAbstract\n\nZoubin Ghahramani\n\nRecurrent neural networks (RNNs) stand at the forefront of many recent develop-\nments in deep learning. Yet a major dif\ufb01culty with these models is their tendency to\nover\ufb01t, with dropout shown to fail when applied to recurrent layers. Recent results\nat the intersection of Bayesian modelling and deep learning offer a Bayesian inter-\npretation of common deep learning techniques such as dropout. This grounding of\ndropout in approximate Bayesian inference suggests an extension of the theoretical\nresults, offering insights into the use of dropout with RNN models. We apply this\nnew variational inference based dropout technique in LSTM and GRU models,\nassessing it on language modelling and sentiment analysis tasks. The new approach\noutperforms existing techniques, and to the best of our knowledge improves on the\nsingle model state-of-the-art in language modelling with the Penn Treebank (73.4\ntest perplexity). This extends our arsenal of variational tools in deep learning.\n\nIntroduction\n\n1\nRecurrent neural networks (RNNs) are sequence-based models of key importance for natural language\nunderstanding, language generation, video processing, and many other tasks [1\u20133]. The model\u2019s input\nis a sequence of symbols, where at each time step a simple neural network (RNN unit) is applied to a\nsingle symbol, as well as to the network\u2019s output from the previous time step. RNNs are powerful\nmodels, showing superb performance on many tasks, but over\ufb01t quickly. Lack of regularisation in\nRNN models makes it dif\ufb01cult to handle small data, and to avoid over\ufb01tting researchers often use\nearly stopping, or small and under-speci\ufb01ed models [4].\nDropout is a popular regularisation technique with deep networks [5, 6] where network units are\nrandomly masked during training (dropped). But the technique has never been applied successfully\nto RNNs. Empirical results have led many to believe that noise added to recurrent layers (connections\nbetween RNN units) will be ampli\ufb01ed for long sequences, and drown the signal [4]. Consequently,\nexisting research has concluded that the technique should be used with the inputs and outputs of the\nRNN alone [4, 7\u201310]. But this approach still leads to over\ufb01tting, as is shown in our experiments.\nRecent results at the intersection of Bayesian research and deep learning offer interpretation of\ncommon deep learning techniques through Bayesian eyes [11\u201316]. This Bayesian view of deep\nlearning allowed the introduction of new techniques into the \ufb01eld, such as methods to obtain principled\nuncertainty estimates from deep learning networks [14, 17]. Gal and Ghahramani [14] for example\nshowed that dropout can be interpreted as a variational approximation to the posterior of a Bayesian\nneural network (NN). Their variational approximating distribution is a mixture of two Gaussians\nwith small variances, with the mean of one Gaussian \ufb01xed at zero. This grounding of dropout in\napproximate Bayesian inference suggests that an extension of the theoretical results might offer\ninsights into the use of the technique with RNN models.\nHere we focus on common RNN models in the \ufb01eld (LSTM [18], GRU [19]) and interpret these\nas probabilistic models, i.e. as RNNs with network weights treated as random variables, and with\nsuitably de\ufb01ned likelihood functions. We then perform approximate variational inference in these\nprobabilistic Bayesian models (which we will refer to as Variational RNNs). Approximating the\nposterior distribution over the weights with a mixture of Gaussians (with one component \ufb01xed at\nzero and small variances) will lead to a tractable optimisation objective. Optimising this objective is\nidentical to performing a new variant of dropout in the respective RNNs.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fyt1\n\nyt\n\nyt+1\n\nyt1\n\nyt\n\nyt+1\n\nxt1\n(a) Naive dropout RNN\n\nxt\n\nxt+1\n\nxt1\n\n(b) Variational RNN\n\nxt\n\nxt+1\n\nFigure 1: Depiction of the dropout technique following our Bayesian interpretation (right)\ncompared to the standard technique in the \ufb01eld (left). Each square represents an RNN unit, with\nhorizontal arrows representing time dependence (recurrent connections). Vertical arrows represent\nthe input and output to each RNN unit. Coloured connections represent dropped-out inputs, with\ndifferent colours corresponding to different dropout masks. Dashed lines correspond to standard\nconnections with no dropout. Current techniques (naive dropout, left) use different masks at different\ntime steps, with no dropout on the recurrent layers. The proposed technique (Variational RNN, right)\nuses the same dropout mask at each time step, including the recurrent layers.\nIn the new dropout variant, we repeat the same dropout mask at each time step for both inputs, outputs,\nand recurrent layers (drop the same network units at each time step). This is in contrast to the existing\nad hoc techniques where different dropout masks are sampled at each time step for the inputs and\noutputs alone (no dropout is used with the recurrent connections since the use of different masks\nwith these connections leads to deteriorated performance). Our method and its relation to existing\ntechniques is depicted in \ufb01gure 1. When used with discrete inputs (i.e. words) we place a distribution\nover the word embeddings as well. Dropout in the word-based model corresponds then to randomly\ndropping word types in the sentence, and might be interpreted as forcing the model not to rely on\nsingle words for its task.\nWe next survey related literature and background material, and then formalise our approximate\ninference for the Variational RNN, resulting in the dropout variant proposed above. Experimental\nresults are presented thereafter.\n\n2 Related Research\nIn the past few years a considerable body of work has been collected demonstrating the negative\neffects of a naive application of dropout in RNNs\u2019 recurrent connections. Pachitariu and Sahani [7],\nworking with language models, reason that noise added in the recurrent connections of an RNN leads\nto model instabilities. Instead, they add noise to the decoding part of the model alone. Bayer et al. [8]\napply a deterministic approximation of dropout (fast dropout) in RNNs. They reason that with dropout,\nthe RNN\u2019s dynamics change dramatically, and that dropout should be applied to the \u201cnon-dynamic\u201d\nparts of the model \u2013 connections feeding from the hidden layer to the output layer. Pham et al. [9]\nassess dropout with handwriting recognition tasks. They conclude that dropout in recurrent layers\ndisrupts the RNN\u2019s ability to model sequences, and that dropout should be applied to feed-forward\nconnections and not to recurrent connections. The work by Zaremba, Sutskever, and Vinyals [4] was\ndeveloped in parallel to Pham et al. [9]. Zaremba et al. [4] assess the performance of dropout in RNNs\non a wide series of tasks. They show that applying dropout to the non-recurrent connections alone\nresults in improved performance, and provide (as yet unbeaten) state-of-the-art results in language\nmodelling on the Penn Treebank. They reason that without dropout only small models were used\nin the past in order to avoid over\ufb01tting, whereas with the application of dropout larger models can\nbe used, leading to improved results. This work is considered a reference implementation by many\n(and we compare to this as a baseline below). Bluche et al. [10] extend on the previous body of work\nand perform exploratory analysis of the performance of dropout before, inside, and after the RNN\u2019s\nunit. They provide mixed results, not showing signi\ufb01cant improvement on existing techniques. More\nrecently, and done in parallel to this work, Moon et al. [20] suggested a new variant of dropout in\nRNNs in the speech recognition community. They randomly drop elements in the LSTM\u2019s internal\ncell ct and use the same mask at every time step. This is the closest to our proposed approach\n(although fundamentally different to the approach we suggest, explained in \u00a74.1), and we compare to\nthis variant below as well.\n\n2\n\n\fExisting approaches are based on an empirical experimentation with different \ufb02avours of dropout,\nfollowing a process of trial-and-error. These approaches have led many to believe that dropout\ncannot be extended to a large number of parameters within the recurrent layers, leaving them with\nno regularisation. In contrast to these conclusions, we show that it is possible to derive a variational\ninference based variant of dropout which successfully regularises such parameters, by grounding our\napproach in recent theoretical research.\n\n3 Background\nWe review necessary background in Bayesian neural networks and approximate variational inference.\nBuilding on these ideas, in the next section we propose approximate inference in the probabilistic\nRNN which will lead to a new variant of dropout.\n3.1 Bayesian Neural Networks\nGiven training inputs X = {x1, . . . , xN} and their corresponding outputs Y = {y1, . . . , yN}, in\nBayesian (parametric) regression we would like to infer parameters ! of a function y = f !(x) that\nare likely to have generated our outputs. What parameters are likely to have generated our data?\nFollowing the Bayesian approach we would put some prior distribution over the space of parameters,\np(!). This distribution represents our prior belief as to which parameters are likely to have generated\nour data. We further need to de\ufb01ne a likelihood distribution p(y|x, !). For classi\ufb01cation tasks we\nmay assume a softmax likelihood,\n\npy = d|x, ! = Categorical exp(f !\n\nd (x))/Xd0\n\nexp(f !\n\nd0 (x))!\n\nor a Gaussian likelihood for regression. Given a dataset X, Y, we then look for the posterior\ndistribution over the space of parameters: p(!|X, Y). This distribution captures how likely various\nfunction parameters are given our observed data. With it we can predict an output for a new input\npoint x\u21e4 by integrating\n\n(1)\n\n(2)\n\np(y\u21e4|x\u21e4, X, Y) =Z p(y\u21e4|x\u21e4, !)p(!|X, Y)d!.\n\nOne way to de\ufb01ne a distribution over a parametric set of functions is to place a prior distribution over\na neural network\u2019s weights, resulting in a Bayesian NN [21, 22]. Given weight matrices Wi and bias\nvectors bi for layer i, we often place standard matrix Gaussian prior distributions over the weight\nmatrices, p(Wi) = N (0, I) and often assume a point estimate for the bias vectors for simplicity.\n3.2 Approximate Variational Inference in Bayesian Neural Networks\nWe are interested in \ufb01nding the distribution of weight matrices (parametrising our functions) that have\ngenerated our data. This is the posterior over the weights given our observables X, Y: p(!|X, Y).\nThis posterior is not tractable in general, and we may use variational inference to approximate it (as\nwas done in [23\u201325, 12]). We need to de\ufb01ne an approximating variational distribution q(!), and then\nminimise the KL divergence between the approximating distribution and the full posterior:\n\nKLq(!)||p(!|X, Y) / Z q(!) log p(Y|X, !)d! + KL(q(!)||p(!))\n\n= \n\nNXi=1Z q(!) log p(yi|f !(xi))d! + KL(q(!)||p(!)).\n\nWe next extend this approximate variational inference to probabilistic RNNs, and use a q(!) distribu-\ntion that will give rise to a new variant of dropout in RNNs.\n\n4 Variational Inference in Recurrent Neural Networks\nIn this section we will concentrate on simple RNN models for brevity of notation. Derivations for\nLSTM and GRU follow similarly. Given input sequence x = [x1, ..., xT ] of length T , a simple RNN\nis formed by a repeated application of a function fh. This generates a hidden state ht for time step t:\n\nht = fh(xt, ht1) = (xtWh + ht1Uh + bh)\n\nfor some non-linearity . The model output can be de\ufb01ned, for example, as fy(hT ) = hT Wy + by.\nWe view this RNN as a probabilistic model by regarding ! = {Wh, Uh, bh, Wy, by} as random\n\n3\n\n\fh (xT , f !\n\nh (...f !\n\nvariables (following normal prior distributions). To make the dependence on ! clear, we write f !\ny\nfor fy and similarly for f !\nh . We de\ufb01ne our probabilistic model\u2019s likelihood as above (section 3.1).\nThe posterior over random variables ! is rather complex, and we use variational inference with\napproximating distribution q(!) to approximate it.\nEvaluating each sum term in eq. (2) above with our RNN model we get\n\nZ q(!) log p(y|f !\n\nwith h0 = 0. We approximate this with Monte Carlo (MC) integration with a single sample:\n\nh (x1, h0)...))\u25c6d!\nb! \u21e0 q(!)\n\nresulting in an unbiased estimator to each sum term.\nThis estimator is plugged into equation (2) to obtain our minimisation objective\n\ny (hT ))d! =Z q(!) log p\u2713yf !\nh (xT , hT1)\u25c6d!\nyf !\n=Z q(!) log p\u2713yf !\nyf !\n\u21e1 log p\u2713yfb!\nh (x1, h0)...))\u25c6,\nyfb!\nh (xT , fb!\nlog p\u2713yifb!i\nh (xi,1, h0)...))\u25c6 + KL(q(!)||p(!)).\nNXi=1\ny fb!i\nh (xi,T , fb!i\nh,bbi\ny,bbi\nh,bUi\nNote that for each sequence xi we sample a new realisationb!i = {cWi\nh,cWi\ny}, and that\neach symbol in the sequence xi = [xi,1, ..., xi,T ] is passed through the function fb!i\nh with the same\nh,bbi\nh,bUi\nweight realisationscWi\nh used at every time step t \uf8ff T .\nq(wk) = pN (wk; 0, 2I) + (1  p)N (wk; mk, 2I)\n\nFollowing [17] we de\ufb01ne our approximating distribution to factorise over the weight matrices and\ntheir rows in !. For every weight matrix row wk the approximating distribution is:\n\nwith mk variational parameter (row vector), p given in advance (the dropout probability), and small\n2. We optimise over mk the variational parameters of the random weight matrices; these correspond\nto the RNN\u2019s weight matrices in the standard view1. The KL in eq. (3) can be approximated as L2\nregularisation over the variational parameters mk [17].\n\nh (...fb!i\n\nh (...fb!\n\nL\u21e1 \n\n(3)\n\n(4)\n\nEvaluating the model output fb!\nrows in each weight matrix W during the forward pass \u2013 i.e. performing dropout. Our objective L is\nidentical to that of the standard RNN. In our RNN setting with a sequence input, each weight matrix\nrow is randomly masked once, and importantly the same mask is used through all time steps.2\nPredictions can be approximated by either propagating the mean of each layer to the next (referred to\nas the standard dropout approximation), or by approximating the posterior in eq. (1) with q(!),\n\ny (\u00b7) with sampleb! \u21e0 q(!) corresponds to randomly zeroing (masking)\n\np(y\u21e4|x\u21e4, X, Y) \u21e1Z p(y\u21e4|x\u21e4, !)q(!)d! \u21e1\n\n1\nK\n\nKXk=1\n\np(y\u21e4|x\u21e4,b!k)\n\nImplementation and Relation to Dropout in RNNs\n\nwithb!k \u21e0 q(!), i.e. by performing dropout at test time and averaging results (MC dropout).\n\n4.1\nImplementing our approximate inference is identical to implementing dropout in RNNs with the\nsame network units dropped at each time step, randomly dropping inputs, outputs, and recurrent\nconnections. This is in contrast to existing techniques, where different network units would be\ndropped at different time steps, and no dropout would be applied to the recurrent connections (\ufb01g. 1).\nCertain RNN models such as LSTMs and GRUs use different gates within the RNN units. For\nexample, an LSTM is de\ufb01ned using four gates: \u201cinput\u201d, \u201cforget\u201d, \u201coutput\u201d, and \u201cinput modulation\u201d,\n\ni = sigmht1Ui + xtWi\no = sigmht1Uo + xtWo\n\nf = sigmht1Uf + xtWf\ng = tanhht1Ug + xtWg\n\n1Graves et al. [26] further factorise the approximating distribution over the elements of each row, and use a\nGaussian approximating distribution with each element (rather than a mixture); the approximating distribution\nabove seems to give better performance, and has a close relation with dropout [17].\n\n2In appendix A we discuss the relation of our dropout interpretation to the ensembling one.\n\n4\n\n\fct = f  ct1 + i  g\n\n(5)\nwith ! = {Wi, Ui, Wf , Uf , Wo, Uo, Wg, Ug} weight matrices and  the element-wise product.\nHere an internal state ct (also referred to as cell) is updated additively.\nAlternatively, the model could be re-parametrised as in [26]:\n\nht = o  tanh(ct)\n\ni\nf\no\ng\n\n0B@\n\n1CA =0B@\n\nsigm\nsigm\nsigm\ntanh\n\n1CA\u2713\u2713 xt\nht1\u25c6 \u00b7 W\u25c6\n\nwith ! = {W}, W a matrix of dimensions 2K by 4K (K being the dimensionality of xt). We name\nthis parametrisation a tied-weights LSTM (compared to the untied-weights LSTM in eq. (5)).\nEven though these two parametrisations result in the same deterministic model, they lead to different\napproximating distributions q(!). With the \ufb01rst parametrisation one could use different dropout\nmasks for different gates (even when the same input xt is used). This is because the approximating\ndistribution is placed over the matrices rather than the inputs: we might drop certain rows in one\nweight matrix W applied to xt and different rows in another matrix W0 applied to xt. With the\nsecond parametrisations we would place a distribution over the single matrix W. This leads to a\nfaster forward-pass, but with slightly diminished results as we will see in the experiments section.\nIn more concrete terms, we may write our dropout variant with the second parametrisation (eq. (6)) as\n\n(6)\n\n(7)\n\ni\nf\no\ng\n\n0B@\n\n1CA =0B@\n\nsigm\nsigm\nsigm\ntanh\n\n1CA\u2713\u2713 xt  zx\nht1  zh\u25c6 \u00b7 W\u25c6\n\nx\n\nwith zx, zh random masks repeated at all time steps (and similarly for the parametrisation in eq. (5)).\nIn comparison, Zaremba et al. [4]\u2019s variant replaces zx in eq. (7) with a time-dependent mask: xt  zt\nwhere zt\nx is sampled anew every time step (whereas zh is removed and the recurrent connection ht1\nis not dropped). On the other hand, Moon et al. [20]\u2019s variant changes eq. (5) by adapting the internal\ncell ct = ct  zc with the same mask zc used at all time steps. Note that unlike [20], by viewing\ndropout as an operation over the weights our technique trivially extends to RNNs and GRUs.\n4.2 Word Embeddings Dropout\nIn datasets with continuous inputs we often apply dropout to the input layer \u2013 i.e. to the input vector\nitself. This is equivalent to placing a distribution over the weight matrix which follows the input and\napproximately integrating over it (the matrix is optimised, therefore prone to over\ufb01tting otherwise).\nBut for models with discrete inputs such as words (where every word is mapped to a continuous\nvector \u2013 a word embedding) this is seldom done. With word embeddings the input can be seen as\neither the word embedding itself, or, more conveniently, as a \u201cone-hot\u201d encoding (a vector of zeros\nwith 1 at a single position). The product of the one-hot encoded vector with an embedding matrix\nWE 2 RV \u21e5D (where D is the embedding dimensionality and V is the number of words in the\nvocabulary) then gives a word embedding. Curiously, this parameter layer is the largest layer in most\nlanguage applications, yet it is often not regularised. Since the embedding matrix is optimised it can\nlead to over\ufb01tting, and it is therefore desirable to apply dropout to the one-hot encoded vectors. This\nin effect is identical to dropping words at random throughout the input sentence, and can also be\ninterpreted as encouraging the model to not \u201cdepend\u201d on single words for its output.\nNote that as before, we randomly set rows of the matrix WE 2 RV \u21e5D to zero. Since we repeat the\nsame mask at each time step, we drop the same words throughout the sequence \u2013 i.e. we drop word\ntypes at random rather than word tokens (as an example, the sentence \u201cthe dog and the cat\u201d might\nbecome \u201c\u2014 dog and \u2014 cat\u201d or \u201cthe \u2014 and the cat\u201d, but never \u201c\u2014 dog and the cat\u201d). A possible\ninef\ufb01ciency implementing this is the requirement to sample V Bernoulli random variables, where\nV might be large. This can be solved by the observation that for sequences of length T , at most T\nembeddings could be dropped (other dropped embeddings have no effect on the model output). For\nT \u2327 V it is therefore more ef\ufb01cient to \ufb01rst map the words to the word embeddings, and only then to\nzero-out word embeddings based on their word type.\n\n5 Experimental Evaluation\nWe start by implementing our proposed dropout variant into the Torch implementation of Zaremba\net al. [4], that has become a reference implementation for many in the \ufb01eld. Zaremba et al. [4] have\n\n5\n\n\fset a benchmark on the Penn Treebank that to the best of our knowledge hasn\u2019t been beaten for\nthe past 2 years. We improve on [4]\u2019s results, and show that our dropout variant improves model\nperformance compared to early-stopping and compared to using under-speci\ufb01ed models. We continue\nto evaluate our proposed dropout variant with both LSTM and GRU models on a sentiment analysis\ntask where labelled data is scarce. We \ufb01nish by giving an in-depth analysis of the properties of the\nproposed method, with code and many experiments deferred to the appendix due to space constraints.\n5.1 Language Modelling\nWe replicate the language modelling experiment of Zaremba, Sutskever, and Vinyals [4]. The\nexperiment uses the Penn Treebank, a standard benchmark in the \ufb01eld. This dataset is considered\na small one in the language processing community, with 887, 521 tokens (words) in total, making\nover\ufb01tting a considerable concern. Throughout the experiments we refer to LSTMs with the dropout\ntechnique proposed following our Bayesian interpretation as Variational LSTMs, and refer to existing\ndropout techniques as naive dropout LSTMs (different masks at different steps, applied to the input\nand output of the LSTM alone). We refer to LSTMs with no dropout as standard LSTMs.\nWe implemented a Variational LSTM for both the medium model of [4] (2 layers with 650 units in\neach layer) as well as their large model (2 layers with 1500 units in each layer). The only changes\nwe\u2019ve made to [4]\u2019s setting are 1) using our proposed dropout variant instead of naive dropout, and\n2) tuning weight decay (which was chosen to be zero in [4]). All other hyper-parameters are kept\nidentical to [4]: learning rate decay was not tuned for our setting and is used following [4]. Dropout\nparameters were optimised with grid search (tying the dropout probability over the embeddings\ntogether with the one over the recurrent layers, and tying the dropout probability for the inputs and\noutputs together as well). These are chosen to minimise validation perplexity3. We further compared\nto Moon et al. [20] who only drop elements in the LSTM internal state using the same mask at all\ntime steps (in addition to performing dropout on the inputs and outputs). We implemented their\ndropout variant with each model size, and repeated the procedure above to \ufb01nd optimal dropout\nprobabilities (0.3 with the medium model, and 0.5 with the large model). We had to use early stopping\nfor the large model with [20]\u2019s variant as the model starts over\ufb01tting after 16 epochs. Moon et al.\n[20] proposed their dropout variant within the speech recognition community, where they did not\nhave to consider embeddings over\ufb01tting (which, as we will see below, affect the recurrent layers\nconsiderably). We therefore performed an additional experiment using [20]\u2019s variant together with\nour embedding dropout (referred to as Moon et al. [20]+emb dropout).\nOur results are given in table 1. For the variational LSTM we give results using both the tied weights\nmodel (eq. (6)\u2013(7), Variational (tied weights)), and without weight tying (eq. (5), Variational (untied\nweights)). For each model we report performance using both the standard dropout approximation\n(averaging the weights at test time \u2013 propagating the mean of each approximating distribution as input\nto the next layer), and using MC dropout (obtained by performing dropout at test time 1000 times,\nand averaging the model outputs following eq. (4), denoted MC). For each model we report average\nperplexity and standard deviation (each experiment was repeated 3 times with different random seeds\nand the results were averaged). Model training time is given in words per second (WPS).\nIt is interesting that using the dropout approximation, weight tying results in lower validation error\nand test error than the untied weights model. But with MC dropout the untied weights model performs\nmuch better. Validation perplexity for the large model is improved from [4]\u2019s 82.2 down to 77.3 (with\nweight tying), or 77.9 without weight tying. Test perplexity is reduced from 78.4 down to 73.4 (with\nMC dropout and untied weights). To the best of our knowledge, these are currently the best single\nmodel perplexities on the Penn Treebank.\nIt seems that Moon et al. [20] underperform even compared to [4]. With no embedding dropout the\nlarge model over\ufb01ts and early stopping is required (with no early stopping the model\u2019s validation\nperplexity goes up to 131 within 30 epochs). Adding our embedding dropout, the model performs\nmuch better, but still underperforms compared to applying dropout on the inputs and outputs alone.\nComparing our results to the non-regularised LSTM (evaluated with early stopping, giving similar\nperformance as the early stopping experiment in [4]) we see that for either model size an improvement\ncan be obtained by using our dropout variant. Comparing the medium sized Variational model to the\nlarge one we see that a signi\ufb01cant reduction in perplexity can be achieved by using a larger model.\nThis cannot be done with the non-regularised LSTM, where a larger model leads to worse results.\n\n3Optimal probabilities are 0.3 and 0.5 respectively for the large model, compared [4]\u2019s 0.6 dropout probability,\n\nand 0.2 and 0.35 respectively for the medium model, compared [4]\u2019s 0.5 dropout probability.\n\n6\n\n\fMedium LSTM\n\nValidation\n\n\n\n\nLarge LSTM\nTest\n127.4\n118.7\n86.0\n78.4\n\n128.3\n122.9\n88.8\n82.2\n\n\n\n\nMoon et al. [20]\n\nMoon et al. [20] +emb dropout\n\nZaremba et al. [4]\n\nVariational (tied weights)\n\n121.1\n100.7\n88.9\n86.2\n\nTest\n121.7\n97.0\n86.5\n82.7\n\nNon-regularized (early stopping)\n\nVariational (tied weights, MC)\nVariational (untied weights)\n\nWPS\nWPS Validation\n2.5K\n5.5K\n3K\n4.8K\n3K\n4.8K\n2.5K\n5.5K\n4.7K 77.3 \u00b1 0.2 75.0 \u00b1 0.1\n2.4K\n81.8 \u00b1 0.2 79.7 \u00b1 0.1\n\n\n74.1 \u00b1 0.0\n79.0 \u00b1 0.1\n1.6K\n2.7K 77.9 \u00b1 0.3 75.2 \u00b1 0.2\n81.9 \u00b1 0.2 79.7 \u00b1 0.1\nVariational (untied weights, MC)\n73.4 \u00b1 0.0 \n78.6 \u00b1 0.1 \nTable 1: Single model perplexity (on test and validation sets) for the Penn Treebank language\nmodelling task. Two model sizes are compared (a medium and a large LSTM, following [4]\u2019s setup),\nwith number of processed words per second (WPS) reported. Both dropout approximation and MC\ndropout are given for the test set with the Variational model. A common approach for regularisation is\nto reduce model complexity (necessary with the non-regularised LSTM). With the Variational models\nhowever, a signi\ufb01cant reduction in perplexity is achieved by using larger models.\nThis shows that reducing the complexity of the model, a possible approach to avoid over\ufb01tting,\nactually leads to a worse \ufb01t when using dropout.\nWe also see that the tied weights model achieves very close performance to that of the untied weights\none when using the dropout approximation. Assessing model run time though (on a Titan X GPU),\nwe see that tying the weights results in a more time-ef\ufb01cient implementation. This is because the\nsingle matrix product is implemented as a single GPU kernel, instead of the four smaller matrix\nproducts used in the untied weights model (where four GPU kernels are called sequentially). Note\nthough that a low level implementation should give similar run times.\nWe further experimented with a model averaging experiment following [4]\u2019s setting, where several\nlarge models are trained independently with their outputs averaged. We used Variational LSTMs\nwith MC dropout following the setup above. Using 10 Variational LSTMs we improve [4]\u2019s test set\nperplexity from 69.5 to 68.7 \u2013 obtaining identical perplexity to [4]\u2019s experiment with 38 models.\nLastly, we report validation perplexity with reduced learning rate decay (with the medium model).\nLearning rate decay is often used for regularisation by setting the optimiser to make smaller steps\nwhen the model starts over\ufb01tting (as done in [4]). By removing it we can assess the regularisation\neffects of dropout alone. As can be seen in \ufb01g. 2, even with early stopping, Variational LSTM achieves\nlower perplexity than naive dropout LSTM and standard LSTM. Note though that a signi\ufb01cantly\nlower perplexity for all models can be achieved with learning rate decay scheduling as seen in table 1\n5.2 Sentiment Analysis\nWe next evaluate our dropout variant with both LSTM and GRU models on a sentiment analysis task,\nwhere labelled data is scarce. We use MC dropout (which we compare to the dropout approximation\nfurther in appendix B), and untied weights model parametrisations.\nWe use the raw Cornell \ufb01lm reviews corpus collected by Pang and Lee [27]. The dataset is composed\nof 5000 \ufb01lm reviews. We extract consecutive segments of T words from each review for T = 200,\nand use the corresponding \ufb01lm score as the observed output y. The model is built from one embedding\nlayer (of dimensionality 128), one LSTM layer (with 128 network units for each gate; GRU setting is\nbuilt similarly), and \ufb01nally a fully connected layer applied to the last output of the LSTM (resulting\nin a scalar output). We use the Adam optimiser [28] throughout the experiments, with batch size 128,\nand MC dropout at test time with 10 samples.\n\nFigure 2: Medium model validation perplexity for the Penn Treebank language modelling task.\nLearning rate decay was reduced to assess model over\ufb01tting using dropout alone. Even with early\nstopping, Variational LSTM achieves lower perplexity than naive dropout LSTM and standard LSTM.\nLower perplexity for all models can be achieved with learning rate decay scheduling, seen in table 1.\n\n7\n\n\f(c) GRU test error: variational,\n\nnaive dropout, and standard LSTM.\n\n(b) LSTM test error: variational,\nnaive dropout, and standard LSTM.\n\n(a) LSTM train error: variational,\nnaive dropout, and standard LSTM.\nFigure 3: Sentiment analysis error for Variational LSTM / GRU compared to naive dropout LSTM /\nGRU and standard LSTM / GRU (with no dropout).\nThe main results can be seen in \ufb01g. 3. We compared Variational LSTM (with our dropout variant\napplied with each weight layer) to standard techniques in the \ufb01eld. Training error is shown in \ufb01g. 3a\nand test error is shown in \ufb01g. 3b. Optimal dropout probabilities and weight decay were used for each\nmodel (see appendix B). It seems that the only model not to over\ufb01t is the Variational LSTM, which\nachieves lowest test error as well. Variational GRU test error is shown in \ufb01g. 14 (with loss plot given\nin appendix B). Optimal dropout probabilities and weight decay were used again for each model.\nVariational GRU avoids over\ufb01tting to the data and converges to the lowest test error. Early stopping in\nthis dataset will result in smaller test error though (lowest test error is obtained by the non-regularised\nGRU model at the second epoch). It is interesting to note that standard techniques exhibit peculiar\nbehaviour where test error repeatedly decreases and increases. This behaviour is not observed with\nthe Variational GRU. Convergence plots of the loss for each model are given in appendix B.\nWe next explore the effects of dropping-out different parts of the model. We assessed our Variational\nLSTM with different combinations of dropout over the embeddings (pE = 0, 0.5) and recurrent\nlayers (pU = 0, 0.5) on the sentiment analysis task. The convergence plots can be seen in \ufb01gure 4a. It\nseems that without both strong embeddings regularisation and strong regularisation over the recurrent\nlayers the model would over\ufb01t rather quickly. The behaviour when pU = 0.5 and pE = 0 is quite\ninteresting: test error decreases and then increases before decreasing again. Also, it seems that when\npU = 0 and pE = 0.5 the model becomes very erratic.\nLastly, we tested the performance of Variational LSTM with different recurrent layer dropout\nprobabilities, \ufb01xing the embedding dropout probability at either pE = 0 or pE = 0.5 (\ufb01gs. 4b-4c).\nThese results are rather intriguing. In this experiment all models have converged, with the loss getting\nnear zero (not shown). Yet it seems that with no embedding dropout, a higher dropout probability\nwithin the recurrent layers leads to over\ufb01tting! This presumably happens because of the large number\nof parameters in the embedding layer which is not regularised. Regularising the embedding layer with\ndropout probability pE = 0.5 we see that a higher recurrent layer dropout probability indeed leads to\nincreased robustness to over\ufb01tting, as expected. These results suggest that embedding dropout can be\nof crucial importance in some tasks.\nIn appendix B we assess the importance of weight decay with our dropout variant. Common practice\nis to remove weight decay with naive dropout. Our results suggest that weight decay plays an\nimportant role with our variant (it corresponds to our prior belief of the distribution over the weights).\n\n6 Conclusions\nWe presented a new technique for recurrent neural network regularisation. Our RNN dropout variant\nis theoretically motivated and its effectiveness was empirically demonstrated.\n\n(a) Combinations of pE = 0, 0.5\n\nwith pU = 0, 0.5.\n\n(b) pU = 0, ..., 0.5 with\n\n\ufb01xed pE = 0.\n\n(c) pU = 0, ..., 0.5 with\n\n\ufb01xed pE = 0.5.\n\nFigure 4: Test error for Variational LSTM with various settings on the sentiment analysis task.\nDifferent dropout probabilities are used with the recurrent layer (pU) and embedding layer (pE).\n\n8\n\n\fReferences\n[1] Martin Sundermeyer, Ralf Schl\u00fcter, and Hermann Ney. LSTM neural networks for language modeling. In\n\nINTERSPEECH, 2012.\n\n[2] N Kalchbrenner and P Blunsom. Recurrent continuous translation models. In EMNLP, 2013.\n[3] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In\n\nNIPS, 2014.\n\n[4] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv\n\npreprint arXiv:1409.2329, 2014.\n\n[5] Geoffrey E others Hinton. Improving neural networks by preventing co-adaptation of feature detectors.\n\narXiv preprint arXiv:1207.0580, 2012.\n\n[6] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\n\nA simple way to prevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[7] Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models:\n\nwhen are they needed? arXiv preprint arXiv:1301.5650, 2013.\n\n[8] J Bayer et al. On fast dropout and its applicability to recurrent networks. arXiv preprint arXiv:1311.0701,\n\n2013.\n\n[9] Vu Pham, Theodore Bluche, Christopher Kermorvant, and Jerome Louradour. Dropout improves recurrent\n\nneural networks for handwriting recognition. In ICFHR. IEEE, 2014.\n\n[10] Th\u00e9odore Bluche, Christopher Kermorvant, and J\u00e9r\u00f4me Louradour. Where to apply dropout in recurrent\n\nneural networks for handwriting recognition? In ICDAR. IEEE, 2015.\n\n[11] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. In ICML, 2014.\n\n[12] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural\n\nnetwork. In ICML, 2015.\n\n[13] Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of\n\nBayesian neural networks. In ICML, 2015.\n\n[14] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with Bernoulli approximate\n\nvariational inference. arXiv:1506.02158, 2015.\n\n[15] Diederik Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization\n\ntrick. In NIPS. Curran Associates, Inc., 2015.\n\n[16] Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian dark knowledge.\n\nIn NIPS. Curran Associates, Inc., 2015.\n\n[17] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty\n\nin deep learning. arXiv:1506.02142, 2015.\n\n[18] S Hochreiter and J Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997.\n[19] Kyunghyun Cho et al. Learning phrase representations using RNN encoder\u2013decoder for statistical machine\n\ntranslation. In EMNLP, Doha, Qatar, October 2014. ACL.\n\n[20] Taesup Moon, Heeyoul Choi, Hoshik Lee, and Inchul Song. RnnDrop: A Novel Dropout for RNNs in\n\nASR. In ASRU Workshop, December 2015.\n\n[21] David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4\n\n(3):448\u2013472, 1992.\n\n[22] R M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n[23] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description\n\nlength of the weights. In COLT, pages 5\u201313. ACM, 1993.\n\n[24] David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks. NATO ASI\n\nSERIES F COMPUTER AND SYSTEMS SCIENCES, 168:215\u2013238, 1998.\n\n[25] Alex Graves. Practical variational inference for neural networks. In NIPS, 2011.\n[26] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\n\nneural networks. In ICASSP. IEEE, 2013.\n\n[27] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with\n\nrespect to rating scales. In ACL. ACL, 2005.\n\n[28] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[29] James Bergstra et al. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python\n\nfor Scienti\ufb01c Computing Conference (SciPy), June 2010. Oral Presentation.\n\n[30] fchollet. Keras. https://github.com/fchollet/keras, 2015.\n\n9\n\n\f", "award": [], "sourceid": 594, "authors": [{"given_name": "Yarin", "family_name": "Gal", "institution": "University of Cambridge"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}