{"title": "Adaptive dropout for training deep neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3084, "page_last": 3092, "abstract": "Recently, it was shown that by dropping out hidden activities with a probability of 0.5, deep neural networks can perform very well. We describe a model in which a binary belief network is overlaid on a neural network and is used to decrease the information content of its hidden units by selectively setting activities to zero. This ''dropout network can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found our method can be used to achieve lower classification error rates than other feather learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our model achieves 5.8% error on the NORB test set, which is better than state-of-the-art results obtained using convolutional architectures. \"", "full_text": "Adaptive dropout for training deep neural networks\n\nLei Jimmy Ba Brendan Frey\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Toronto\n\njimmy, frey@psi.utoronto.ca\n\nAbstract\n\nRecently, it was shown that deep neural networks can perform very well if the\nactivities of hidden units are regularized during learning, e.g, by randomly drop-\nping out 50% of their activities. We describe a method called \u2018standout\u2019 in which\na binary belief network is overlaid on a neural network and is used to regularize\nof its hidden units by selectively setting activities to zero. This \u2018adaptive dropout\nnetwork\u2019 can be trained jointly with the neural network by approximately com-\nputing local expectations of binary dropout variables, computing derivatives using\nback-propagation, and using stochastic gradient descent.\nInterestingly, experi-\nments show that the learnt dropout network parameters recapitulate the neural\nnetwork parameters, suggesting that a good dropout network regularizes activities\naccording to magnitude. When evaluated on the MNIST and NORB datasets, we\nfound that our method achieves lower classi\ufb01cation error rates than other feature\nlearning methods, including standard dropout, denoising auto-encoders, and re-\nstricted Boltzmann machines. For example, our method achieves 0.80% and 5.8%\nerrors on the MNIST and NORB test sets, which is better than state-of-the-art\nresults obtained using feature learning methods, including those that use convolu-\ntional architectures.\n\nIntroduction\n\n1\nFor decades, deep networks with broad hidden layers and full connectivity could not be trained to\nproduce useful results, because of over\ufb01tting, slow convergence and other issues. One approach\nthat has proven to be successful for unsupervised learning of both probabilistic generative models\nand auto-encoders is to train a deep network layer by layer in a greedy fashion [7]. Each layer of\nconnections is learnt using contrastive divergence in a restricted Boltzmann machine (RBM) [6] or\nbackpropagation through a one-layer auto-encoder [1], and then the hidden activities are used to\ntrain the next layer. When the parameters of a deep network are initialized in this way, further \ufb01ne\ntuning can be used to improve the model, e.g., for classi\ufb01cation [2]. The unsupervised, pre-training\nstage is a crucial component for achieving competitive overall performance on classi\ufb01cation tasks,\ne.g., Coates et al. [4] have achieved improved classi\ufb01cation rates by using different unsupervised\nlearning algorithms.\nRecently, a technique called dropout was shown to signi\ufb01cantly improve the performance of deep\nneural networks on various tasks [8], including vision problems [10]. Dropout randomly sets hidden\nunit activities to zero with a probability of 0.5 during training. Each training example can thus\nbe viewed as providing gradients for a different, randomly sampled architecture, so that the \ufb01nal\nneural network ef\ufb01ciently represents a huge ensemble of neural networks, with good generalization\ncapability. Experimental results on several tasks show that dropout frequently and signi\ufb01cantly\nimproves the classi\ufb01cation performance of deep architectures. Injecting noise for the purpose of\nregularization has been studied previously, but in the context of adding noise to the inputs [3],[21]\nand to network components [16].\nUnfortunately, when dropout is used to discriminatively train a deep fully connected neural network\non input with high variation, e.g., in viewpoint and angle, little bene\ufb01t is achieved (section 5.5),\nunless spatial structure is built in.\n\n1\n\n\fIn this paper, we describe a generalization of dropout, where the dropout probability for each\nhidden variable is computed using a binary belief network that shares parameters with the deep\nnetwork. Our method works well both for unsupervised and supervised learning of deep networks.\nWe present results on the MNIST and NORB datasets showing that our \u2018standout\u2019 technique can\nlearn better feature detectors for handwritten digit and object recognition tasks. Interestingly, we\nalso \ufb01nd that our method enables the successful training of deep auto-encoders from scratch, i.e.,\nwithout layer-by-layer pre-training.\n2 The model\nThe original dropout technique [8] uses a constant probability for omitting a unit, so a natural ques-\ntion we considered is whether it may help to let this probability be different for different hidden\nunits. In particular, there may be hidden units that can individually make con\ufb01dent predictions for\nthe presence or absence of an important feature or combination of features. Dropout will ignore this\ncon\ufb01dence and drop the unit out 50% of the time. Viewed another way, suppose after dropout is\napplied, it is found that several hidden units are highly correlated in the pre-dropout activities. They\ncould be combined into a single hidden unit with a lower dropout probability, freeing up hidden\nunits for other purposes.\nWe denote the activity of unit j in a deep neural network by aj and assume that its inputs are\n{ai : i < j}. In dropout, aj is randomly set to zero with probability 0.5. Let mj be a binary variable\nthat is used to mask, the activity aj, so that its value is\n\n(1)\nwhere wj,i is the weight from unit i to unit j and g(\u00b7) is the activation function and a0 = 1 accounts\nfor biases. Whereas in standard dropout, mj is Bernoulli with probability 0.5, here we use an\nadaptive dropout probability that depends on input activities:\n\nwj,iai\n\naj = mjg(cid:0)(cid:88)\n\n(cid:1),\nP (mj = 1|{ai : i < j}) = f(cid:0)(cid:88)\n\ni:i