{"title": "DropMax: Adaptive Variational Softmax", "book": "Advances in Neural Information Processing Systems", "page_first": 919, "page_last": 929, "abstract": "We propose DropMax, a stochastic version of softmax classifier which at each iteration drops non-target classes according to dropout probabilities adaptively decided for each instance. Specifically, we overlay binary masking variables over class output probabilities, which are input-adaptively learned via variational inference. This stochastic regularization has an effect of building an ensemble classifier out of exponentially many classifiers with different decision boundaries. Moreover, the learning of dropout rates for non-target classes on each instance allows the classifier to focus more on classification against the most confusing classes. We validate our model on multiple public datasets for classification, on which it obtains significantly improved accuracy over the regular softmax classifier and other baselines. Further analysis of the learned dropout probabilities shows that our model indeed selects confusing classes more often when it performs classification.", "full_text": "DropMax: Adaptive Variational Softmax\n\nHae Beom Lee1,2, Juho Lee3,2, Saehoon Kim2, Eunho Yang1,2, Sung Ju Hwang1,2\n\nKAIST1, AItrics2, South Korea,\n\nUniversity of Oxford3, United Kingdom,\n\n{haebeom.lee, eunhoy, sjhwang82}@kaist.ac.kr\njuho.lee@stats.ox.ac.uk, shkim@aitrics.com\n\nAbstract\n\nWe propose DropMax, a stochastic version of softmax classi\ufb01er which at each\niteration drops non-target classes according to dropout probabilities adaptively\ndecided for each instance. Speci\ufb01cally, we overlay binary masking variables\nover class output probabilities, which are input-adaptively learned via variational\ninference. This stochastic regularization has an effect of building an ensemble\nclassi\ufb01er out of exponentially many classi\ufb01ers with different decision boundaries.\nMoreover, the learning of dropout rates for non-target classes on each instance\nallows the classi\ufb01er to focus more on classi\ufb01cation against the most confusing\nclasses. We validate our model on multiple public datasets for classi\ufb01cation, on\nwhich it obtains signi\ufb01cantly improved accuracy over the regular softmax classi\ufb01er\nand other baselines. Further analysis of the learned dropout probabilities shows\nthat our model indeed selects confusing classes more often when it performs\nclassi\ufb01cation.\n\n1\n\nIntroduction\n\nDeep learning models have shown impressive performances on classi\ufb01cation tasks [17, 10, 11].\nHowever, most of the efforts thus far have been made on improving the network architecture, while\nthe predominant choice of the \ufb01nal classi\ufb01cation function remained to be the basic softmax regression.\nRelatively less research has been done here, except for few works that propose variants of softmax,\nsuch as Sampled Softmax [12], Spherical Softmax [5], and Sparsemax [22]. However, they either do\nnot target accuracy improvement or obtain improved accuracy only on certain limited settings.\nIn this paper, we propose a novel variant of softmax classi\ufb01er that achieves improved accuracy over\nthe regular softmax function by leveraging the popular dropout regularization, which we refer to as\nDropMax. At each stochastic gradient descent step in network training, DropMax classi\ufb01er applies\ndropout to the exponentiations in the softmax function, such that we consider the true class and a\nrandom subset of other classes to learn the classi\ufb01er. At each training step, this allows the classi\ufb01er to\nbe learned to solve a distinct subproblem of the given multi-class classi\ufb01cation problem, enabling it\nto focus on discriminative properties of the target class relative to the sampled classes. Finally, when\ntraining is over, we can obtain an ensemble of exponentially many 1 classi\ufb01ers with different decision\nboundaries.\nMoreover, when doing so, we further exploit the intuition that some classes could be more important\nthan others in correct classi\ufb01cation of each instance, as they may be confused more with the given\ninstance. For example in Figure 1, the instance of the class cat on the left is likely to be more confused\nwith class lion because of the lion mane wig it is wearing. The cat instance on the right, on the other\nhand, resembles Jaguar due to its spots. Thus, we extend our classi\ufb01er to learn the probability of\ndropping non-target classes for each input instance, such that the stochastic classi\ufb01er can consider\n\n1to number of classes\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fclassi\ufb01cation against confusing classes more of-\nten than others adaptively for each input. This\nhelps to better classify such dif\ufb01cult instances,\nwhich in turn will results in improving the over-\nall classi\ufb01cation performance.\nThe proposed adaptive class dropout can be also\nviewed as stochastic attention mechanism, that\nselects a subset of classes each instance should\nattend to in order for it to be well discrimi-\nnated from any of the false classes. It also in\nsome sense has similar effect as boosting, since\nlearning a classi\ufb01er at each iteration with ran-\ndomly selected non-target classes can be seen\nas learning a weak classi\ufb01er, which is combined\ninto a \ufb01nal strong classi\ufb01er that solves the com-\nplete multi-class classi\ufb01cation problem with the\nweights provided by the class retain probabili-\nties learned for each input. Our regularization is\ngeneric and can be applied even to networks on\nwhich the regular dropout is ineffective, such as ResNet, to obtain improved performance.\nWe validate our model on \ufb01ve public datasets for classi\ufb01cation, on which it consistently obtains\nsigni\ufb01cant accuracy improvements over the base softmax, with noticeable improvements on \ufb01ne-\ngrained datasets (3.38%p on AWA and 7.77%p on CUB dataset.)\nOur contribution is threefold:\n\nFigure 1: Concepts.\nFor a given instance,\nclasses are randomly sampled with the probabili-\nties learned adaptively for each instance. Then the\nsampled subset participates in the classi\ufb01cation.\n\n\u2022 We propose a novel stochastic softmax function, DropMax, that randomly drops non-target\n\nclasses when computing the class probability for each input instance.\n\n\u2022 We propose a variational inference framework to adaptively learn the dropout probability of\nnon-target classes for each input, s.t. our stochastic classi\ufb01er considers non-target classes\nconfused with the true class of each instance more often than others.\n\n\u2022 We propose a novel approach to incorporate label information into our conditional variational\n\ninference framework, which yields more accurate posterior estimation.\n\n2 Related Work\n\nSubset sampling with softmax classi\ufb01er Several existing work propose to consider only a partial\nsusbset of classes to compute the softmax, as done in our work. The main motivation is on improving\nthe ef\ufb01ciency of the computation, as matrix multiplication for computing class logits is expensive\nwhen there are too many classes to consider. For example, the number of classes (or words) often\nexceeds millions in language translation task. The common practice to tackle this challenge is to\nuse a shortlist of 30K to 80K the most frequent target words to reduce the inherent scale of the\nclassi\ufb01cation problem [3, 20]. Further, to leverage the full vocabulary, [12] propose to calculate the\nimportance of each word with a deterministic function and select top-K among them. On the other\nhand, [22] suggest a new softmax variant that can generate sparse class probabilities, which has a\nsimilar effect to aforementioned models. Our model also works with subset of classes, but the main\ndifference is that our model aims to improve the accuracy of the classi\ufb01er, rather than improving its\ncomputational ef\ufb01ciency.\n\nDropout variational inference Dropout [25] is one of the most popular and succesful regularizers\nfor deep neural networks. Dropout randomly drops out each neuron with a prede\ufb01ned probability\nat each iteration of a stochastic gradient descent, to achieve the effect of ensemble learning by\ncombining exponentially many networks learned during training. Dropout can be also understood\nas a noise injection process [4], which makes the model to be robust to a small perturbation of\ninputs. Noise injection is also closely related to probabilistic modeling, and [8] has shown that a\nnetwork trained with dropout can be seen as an approximation to deep Gaussian process. Such\nBayesian understanding of dropout allows us to view model training as posterior inference, where\n\n2\n\nLion5JaguarCatWhale0.80.6Input1Input2CatJaguarLionWhaleCatJaguarCatLionsampled subsetsampled subsetsampled subsettarget\u2026\u2026\fpredictive distribution is sampled by dropout at test time [13]. The same process can be applied to\nconvolutional [6] and recurrent networks [7].\n\nLearning dropout probability\nIn regular dropout regularization, dropout rate is a tunable parame-\nter that can be found via cross-validation. However, some recently proposed models allow to learn\nthe dropout probability in the training process. Variational dropout [14] assumes that each individ-\nual weight has independent Gaussian distribution with mean and variance, which are trained with\nreparameterization trick. Due to the central limit theorem, such Gaussian dropout is identical to the\nbinary dropout, with much faster convergence [25, 27]. [23] show that variational dropout that allows\nin\ufb01nite variance results in sparsity, whose effect is similar to automatic relevance determination\n(ARD). All the aforementioned work deals with the usual posterior distribution not dependent on\ninput at test time. On the other hand, adaptive dropout [2] learns input dependent posterior at test\ntime by overlaying binary belief network on hidden layers. Whereas approximate posterior is usually\nassumed to be decomposed into independent components, adaptive dropout allows us to overcome\nit by learning correlations between network components in the mean of input dependent posterior.\nRecently, [9] proposed to train dropout probability for each layer for accurate estimation of model\nuncertainty, by reparameterizing Bernoulli distribution with continuous relaxation [21].\n\n3 Approach\nWe \ufb01rst introduce the general problem setup. Suppose a dataset D = {(xi, yi)}N\ni=1, xi \u2208 Rd, and\none-hot categorical label yi \u2208 {0, 1}K, with K the number of classes. We will omit the index i\nwhen dealing with a single datapoint. Further suppose h = NN(x; \u03c9), which is the last feature\nvector generated from an arbitrary neural network NN(\u00b7) parameterized by \u03c9. Note that \u03c9 is globally\noptimized w.r.t. the other network components to be introduced later, and we will omit the details for\nbrevity. We then de\ufb01ne K dimensional class logits (or scores):\n\no(x; \u03c8) = W(cid:62)h + b, \u03c8 = {W, b}\n\nThe original form of the softmax classi\ufb01er can then be written as:\n\n(cid:80)\n\nexp(ot(x; \u03c8))\nk exp(ok(x; \u03c8))\n\np(y|x; \u03c8) =\n\n3.1 DropMax\n\n, where t is the target class of x.\n\n(1)\n\n(2)\n\n(3)\n\nAs mentioned in the introduction, we propose to randomly drop out classes at training phase, with\nthe motivation of learning an ensemble of exponentially many classi\ufb01ers in a single training. In\n(2), one can see that class k is completely excluded from the classi\ufb01cation when exp(ok) = 0,\nand the gradients are not back-propagated from it. From this observation, we randomly drop\nexp(o1), . . . , exp(oK) based on Bernoulli trials, by introducing a dropout binary mask vector zk\nwith retain probability \u03c1k, which is one minus the dropout probability for each class k:\n\nzk \u223c Ber(zk; \u03c1k),\n\np(y|x, z; \u03c8) =\n\n(zt + \u03b5) exp(ot(x; \u03c8))\nk(zk + \u03b5) exp(ok(x; \u03c8))\n\n(cid:80)\n\nwhere suf\ufb01ciently small \u03b5 > 0 (e.g. 10\u221220) prevents the whole denominator from vanishing.\nHowever, if we drop the classes based on purely random Bernoulli trials, we may exclude the classes\nthat are important for classi\ufb01cation. Obviously, the target class t of a given instance should not\nbe dropped, but we cannot manually set the retain probabilities \u03c1t = 1 since the target classes\ndiffer for each instance, and more importantly, we do not know them at test time. We also want the\nretain probabilities \u03c11, . . . , \u03c1K to encode meaningful correlations between classes, so that the highly\ncorrelated classes may be dropped or retained together to limit the hypothesis space to a meaningful\nsubspace.\nTo resolve these issues, we adopt the idea of Adaptive Dropout [2], and model \u03c1 \u2208 [0, 1]K as an\noutput of a neural network which takes the last feature vector h as an input:\n\u03b8 = {W\u03b8, b\u03b8}.\n\n\u03c1(x; \u03b8) = sgm(W(cid:62)\n\n(4)\n\n\u03b8 h + b\u03b8),\n\n3\n\n\f(cid:80)\n\nBy learning \u03b8, we expect these retain probabilities to be high for the target class of given inputs, and\nconsider correlations between classes. Based on this retain probability network, DropMax is de\ufb01ned\nas follows.\n\nzk|x \u223c Ber(zk; \u03c1k(x; \u03b8)),\n\np(y|x, z; \u03c8, \u03b8) =\n\n(zt + \u03b5) exp(ot(x; \u03c8))\nk(zk + \u03b5) exp(ok(x; \u03c8))\n\n(5)\n\nThe main difference of our model from [2] is that, unlike in the adaptive dropout where the neurons\nof intermediate layers are dropped, we drop classes. As we stated earlier, this is a critical difference,\nbecause by dropping classes we let the model to learn on different (sub)-problems at each iteration,\nwhile in the adaptive dropout we train different models at each iteration. Of course, our model can be\nextended to let it learn the dropout probabilities for the intermediate layers, but it is not our primary\nconcern at this point. Note that DropMax can be easily applied to any type of neural networks, such\nas convolutional neural nets or recurrent neural nets, provided that they have the softmax output for\nthe last layer. This generality is another bene\ufb01t of our approach compared to the (adaptive) dropout\nthat are reported to degrade the performance when used in the intermediate layers of convolutional or\nrecurrent neural networks without careful con\ufb01guration.\nA limitation of [2] is the use of heuristics to learn the dropout probabilities that may possibly\nresult in high variance in gradients during training. To overcome this weakness, we use concrete\ndistribution [21], which is a continuous relaxation of discrete random variables that allows to back-\npropagate through the (relaxed) bernoulli random variables zk to compute the gradients of \u03b8 [9]:\n\nzk = sgm(cid:8)\u03c4\u22121 (log \u03c1k(x; \u03b8) \u2212 log(1 \u2212 \u03c1k(x; \u03b8)) + log u \u2212 log(1 \u2212 u))(cid:9)\n\n(6)\nwith u \u223c Unif(0, 1). The temperature \u03c4 is usually set to 0.1, which determines the degree of\nprobability mass concentration towards 0 and 1.\n\n4 Approximate Inference for DropMax\n\nIn this section, we describe the learning framework for DropMax. For notational simplicity, we de\ufb01ne\nX, Y, Z as the concatenations of xi, yi, and zi over all training instances (i = 1, . . . , N).\n\n4.1\n\nIntractable true posterior\n\nWe \ufb01rst check the form of the true posterior distribution p(Z|X, Y) =(cid:81)N\n\ni=1 p(zi|xi, yi). If it is\ntractable, then we can use exact inference algorithms such as EM to directly maximize the log-\nlikelihood of our observation Y|X. For each instance, the posterior distribution can be written as\n\np(y, z|x)\np(y|x)\n\n(cid:80)\np(y|x, z)p(z|x)\nz(cid:48) p(y|x, z(cid:48))p(z(cid:48)|x)\n\n=\n\nwhere we let p(z|x) =(cid:81)K\n\np(z|x, y) =\n(7)\nk=1 p(zk|x) for simplicity. However, the graphical representation of (5)\nindicates the dependencies among z1, . . . , zK when y is observed. It means that unlike p(z|x),\nthe true posterior p(z|x, y) is not decomposable into the product of each element. Further, the\ndenominator is the summation w.r.t. the exponentially many combinations of z, which makes the\nform of the true posterior even more complicated.\nThus, we suggest to use stochastic gradient variational Bayes (SGVB), which is a general framework\nfor approximating intractable posterior of latent variables in neural network [15, 24]. In standard\nvariational inference, we maximize the evidence lower bound (ELBO):\n\nlog p(Y|X; \u03c8, \u03b8) \u2265 N(cid:88)\n\n(cid:26)\n\n(cid:104)\n\nEq(zi|xi,yi;\u03c6)\n\nlog p(yi|zi, xi; \u03c8)\n\n(cid:105) \u2212 KL\n(cid:104)\n\nq(zi|xi, yi; \u03c6)(cid:13)(cid:13)p(zi|xi; \u03b8)\n\n(cid:105)(cid:27)\n\ni=1\n\nwhere q(zi|xi, yi; \u03c6) is our approximate posterior with a set of variational parameters \u03c6.\n\n(8)\n\n4.2 Structural form of the approximate posterior\n\nThe probabilistic interpretation of each term in (8) is straightforward. However, it does not tell us how\nto encode them into the network components. Especially, in modeling q(z|x, y; \u03c6), how to utilize\n\n4\n\n\f(a) DropMax(q = p) (train)\n\n(c) DropMax\nand\nDropMax(q = p) (test)\nFigure 2: Illustration of model architectures. (a) DropMax (q = p) model at training time that lets q(z|x, y) =\np(z|x, y; \u03b8), except that it \ufb01xes the target mask as 1. (b) DropMax model that utilizes the label information at\ntraining time. (c) The test-time architecture for both models.\n\n(b) DropMax (train)\n\nthe label y is not a straightforward matter. [24] suggests a simple concatenation [h(x); y] as an\ninput to q(z|x, y; \u03c6), while generating a pseudo-label from p(z|x; \u03b8) to make the pipeline of training\nand testing network to be identical. However, it signi\ufb01cantly increases the number of parameters of\nthe whole network. On the other hand, [24] also proposes another solution where the approximate\nposterior simply ignores y and share the parameters with p(z|x; \u03b8) (Figure 2(a)). This method is\nknown to be stable due to the consistency between training and testing pipeline [24, 28, 13]. However,\nwe empirically found that this approach produces suboptimal results for DropMax since it yields\ninaccurate approximated posterior.\nOur novel approach to solve the problem starts from the observation that the relationship between\nz and y is relatively simple in DropMax (5), unlike the case where latent variables are assumed at\nlower layers. In this case, even though a closed form of true posterior is not available, we can capture\na few important property of it and encode them into the approximate posterior.\nThe \ufb01rst step is to encode the structural form of the true posterior (7), which is decomposable into\ntwo factors: 1) the factor dependent only on x, and 2) the factor dependent on both x and y.\n\np(z|x, y) = p(z|x)\n\n\u00d7 p(y|z, x)/p(y|x)\n\n.\n\n(9)\n\n(cid:124) (cid:123)(cid:122) (cid:125)A\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nB\n\n(cid:125)\n\nThe key idea is that the factor B can be interpreted as the rescaling factor from the unlabeled posterior\np(z|x), which takes x and y as inputs. In doing so, we model the approximate posterior q(z|x, y)\nwith two pipelines. Each of them corresponds to: A without label, which we already have de\ufb01ned as\np(z|x; \u03b8) = Ber(z; sgm(W(cid:62)\nB is able to scale up or down A, but at the same time should bound the resultant posterior p(z|x, y) in\nthe range of [0, 1]K. To model B with network components, we simply add to the logit of A, a vector\nr \u2208 RK taking x as an input (we will defer the description on how to use y to the next subsection).\nThen we squash it again in the range of [0, 1] (Note that addition in the logit level is multiplicative):\n\n\u03b8 h + b\u03b8)) in (4), and B with label, which we discuss below.\n\ng(x; \u03c6) = sgm(W(cid:62)\n\n\u03b8 h + b\u03b8 + r(x; \u03c6)),\n\n(10)\nwhere g(x; \u03c6) is the main ingredient of our approximate posterior in (12), and \u03c6 = {W\u03c6, b\u03c6} is\nvariational parameter. W\u03b8 and b\u03b8 denote that stop-gradients are applied to them, to make sure g(x; \u03c6)\nis only parameterized by the variational parameter \u03c6, which is important for properly de\ufb01niting the\nvariational distribution. Next we discuss how to encode y into r(x; \u03c6) and g(x; \u03c6), to \ufb01nialize the\napproximate posterior q(z|x, y; \u03c6).\n\n\u03c6 h + b\u03c6.\n\nr(x; \u03c6) = W(cid:62)\n\n4.3 Encoding the label information\n\nOur modeling choice for encoding y is based on the following observations.\nObservation 1. If we are given (x, y) and consider z1, . . . , zK one by one, then zt is positively\ncorrelated with the DropMax likelihood p(y|x, z) in (5), while zk(cid:54)=t is negatively correlated with it.\nObservation 2. The true posterior of the target retain probability p(zt = 1|x, y) is 1, if we exclude\nthe case z1 = z2 = \u00b7\u00b7\u00b7 = zK = 0, i.e. the retain probability for every class is 0.\nOne can easily verify the observation 1: the likelihood will increase if we attend the target more, and\nvice versa. We encode this observation as follows. Noting that the likelihood p(y|x, z) is in general\nmaximized over the training instances, the factor B in (9) involves p(y|x, z) and should behave\n\n5\n\n11\fLaux(\u03c6) = \u2212 N(cid:88)\n\nK(cid:88)\n\n(cid:110)\n\nconsistently (as in observation 1). Toward this, each rt(x; \u03c6) and rk(cid:54)=t(x; \u03c6) should be maximized\nand minimized respectively. We achieve this by minimizing the cross-entropy for sgm(r(x; \u03c6))\nacross the training instances:\n\nyi,k log sgm(rk(xi; \u03c6)) + (1 \u2212 yi,k) log(1 \u2212 sgm(rk(xi; \u03c6)))\n\n(11)\nThe observation 2 says that z\\t (cid:54)= 0 \u2192 zt = 1 given y. Thus, simply ignoring the case z\\t = 0 and\n\ufb01xing q(zt|x, y; \u03c6) = Ber(zt; 1) is a close approximation of p(zt|x, y), especially under mean-\ufb01eld\nassumption (see the Appendix A for justi\ufb01cation). Hence, our \ufb01nal approximate posterior is given as:\n\ni=1\n\nk=1\n\n(cid:111)\n\nq(z|x, y; \u03c6) = Ber(zt; 1)\n\nBer (zk; gk(x; \u03c6)) .\n\n(12)\n\n(cid:89)\n\nk(cid:54)=t\n\nSee Figure 2(b) and 2(c) for the illustration of the model architecture.\n\n4.4 Regularized variational inference\nOne last critical issue in optimizing the ELBO (8) is that p(z|x; \u03b8) collapses into q(z|x, y; \u03c6) too\neasily, as p(z|x; \u03b8) is parameteric with input x. Preventing it is crucial for z to generalize well on\na test instance x\u2217, because z is sampled from p(z|x\u2217; \u03b8) at test time (Figure 2(c)). We empirically\nfound that imposing some prior (e.g. zero-mean gaussian or laplace prior) to \u03b8 = {W\u03b8, b\u03b8} was not\neffective in preventing this behavior. (The situation is different from VAE [15] where the prior of\nlatent code p(z) is commonly set to gaussian with no trainable parameters (i.e. N (0, \u03bbI)).)\nWe propose to remove weight decay for \u03b8 and apply an entropy regularizer directly to p(z|x; \u03b8). We\nempirically found that the method works well without any scaling hyperparameters.\n\nH(p(z|x; \u03b8)) =\n\n\u03c1k log \u03c1k + (1 \u2212 \u03c1k) log(1 \u2212 \u03c1k)\n\n(13)\n\n(cid:88)\n\nk\n\nWe are now equipped with all the essential components. The KL divergence and the \ufb01nal minimization\nobjective are given as:\n\nKL[q(z|x, y; \u03c6)||p(z|x; \u03b8)] =\n\nI{k=t} log\n\n+ I{k(cid:54)=t}\n\n1\n\u03c1k\n\ngk log\n\n+ (1 \u2212 gk) log\n\ngk\n\u03c1k\n\n(cid:18)\n\n(cid:19)(cid:41)\n\n1 \u2212 gk\n1 \u2212 \u03c1k\n\n(cid:21)\n\n(cid:40)\n\n(cid:88)\n\nk\n\n(cid:20)\n\nN(cid:88)\n\ni=1\n\nS(cid:88)\n\ns=1\n\n\u2212 1\nS\n\nL(\u03c8, \u03b8, \u03c6) =\n\nlog p(yi|xi, z(s)\n\ni\n\n; \u03c8) + KL[q(zi|xi, yi; \u03c6)||p(zi|xi; \u03b8)] \u2212 H\n\n+ Laux\n\ni \u223c q(zi|xi, yi; \u03c6) and S = 1 as usual. Figure 2(b) and (c) illustrate the model architectures\n\nwhere z(s)\nfor training and testing respectively.\nWhen testing, we can perform Monte-Carlo sampling:\n\np(y\u2217|x\u2217) = Ez[p(y\u2217|x\u2217, z)] \u2248 1\nS\n\np(y\u2217|x\u2217, z(s)),\n\nz(s) \u223c p(z|x\u2217; \u03b8).\n\nAlternatively, we can approximate the expectation as\n\np(y\u2217|x\u2217) = Ez[p(y\u2217|x\u2217, z)] \u2248 p(y\u2217|x\u2217, E[z|x\u2217]) = p(y\u2217|x\u2217, \u03c1(x\u2217; \u03b8)),\nwhich is a common practice for many practitioners. We report test error based on (15).\n\n5 Experiments\n\n(14)\n\n(15)\n\nS(cid:88)\n\ns=1\n\nBaselines and our models We \ufb01rst introduce relevant baselines and our models.\n1) Base Softmax. The baseline CNN network with softmax, that only uses the hidden unit dropout at\nfully connected layers, or no dropout regularization at all.\n2) Sparsemax. Base network with Sparsemax loss proposed by [22], which produces sparse class\nprobabilities.\n\n6\n\n\fTable 1: Classi\ufb01cation performance in test error (%). The reported numbers are mean and standard\nerrors with 95% con\ufb01dence interval over 5 runs.\n\nModels\n\nBase Softmax\nSparsemax [22]\n\nDropMax\n\nSampled Softmax [12]\n\nRandom DropMax\n\nDeterministic Attention\nDeterministic DropMax\n\nDropMax (q = p)\n\nM-1K\n\n7.09\u00b10.46\n6.57\u00b10.17\n7.36\u00b10.22\n7.19\u00b10.57\n6.91\u00b10.46\n6.30\u00b10.64\n7.52\u00b10.26\n5.32\u00b10.09\n\nM-5K\n\n2.13\u00b10.21\n2.05\u00b10.18\n2.31\u00b10.14\n2.23\u00b10.19\n2.03\u00b10.11\n1.89\u00b10.04\n2.05\u00b10.07\n1.64\u00b10.08\n\nM-55K\n0.65\u00b10.04\n0.75\u00b10.06\n0.66\u00b10.04\n0.68\u00b10.07\n0.69\u00b10.05\n0.64\u00b10.05\n0.63\u00b10.02\n0.59\u00b10.04\n\nC10\n\n7.90\u00b10.21\n7.90\u00b10.28\n7.98\u00b10.24\n8.21\u00b10.08\n7.87\u00b10.24\n7.84\u00b10.14\n7.80\u00b10.22\n7.67\u00b10.11\n\nC100\n\n30.60\u00b10.12\n31.41\u00b10.16\n30.87\u00b10.19\n30.78\u00b10.28\n30.60\u00b10.21\n30.55\u00b10.51\n29.98\u00b10.35\n29.87\u00b10.36\n\nAWA\n\n30.29\u00b10.80\n36.06\u00b10.64\n29.81\u00b10.45\n31.11\u00b10.54\n30.98\u00b10.66\n26.22\u00b10.76\n29.27\u00b11.19\n26.91\u00b10.54\n\nCUB\n\n48.84\u00b10.85\n64.41\u00b11.12\n49.90\u00b10.56\n48.87\u00b10.79\n49.97\u00b10.32\n47.35\u00b10.42\n42.08\u00b10.94\n41.07\u00b10.57\n\n3) Sampled Softmax. Base network with sampled softmax [12]. Sampling function Q(y|x) is uni-\nformly distributed during training. We tune the number of sampled classes among {20%, 40%, 60%}\nof total classes, while the target class is always selected. Test is done with (2).\n4) Random DropMax. A baseline that randomly drops out non-target classes with a prede\ufb01ned\nretain probability \u03c1 \u2208 {0.2, 0.4, 0.6} at training time. For learning stability, the target class is not\ndropped out during training. Test is done with the softmax function (2), without sampling the dropout\nmasks.\n5) Deterministic Attention. Softmax with deterministic sigmoid attentions multiplied at the expo-\nnentiations. The attention probabilities are generated from the last feature vector h with additional\nweights and biases, similarly to (4).\n6) Deterministic DropMax. This model is the same with Deterministic Attention, except that it is\ntrained in a supervised manner with true class labels. With such consideration of labels when training\nthe attention generating network, the model can be viewed as a deterministic version of DropMax.\n7) DropMax (q = p). A variant of DropMax where we let q(z|x, y) = p(z|x; \u03b8) except that we\n\ufb01x q(zt|x, y) = Ber(zt; 1) as in (12) for learning stability. The corresponding KL[q(cid:107)p] can be\neasily computed. The auxiliary loss term Laux is removed and the entropy term H is scaled with a\nhyperparameter \u03b3 \u2208 {1, 0.1, 0.01} (See Figure 2(a)).\n8) DropMax. Our adaptive stochastic softmax, where each class is dropped out with input dependent\nprobabilities trained from the data. No hyperparameters are needed for scaling each term.\nWe implemented DropMax using Tensor\ufb02ow [1] framework. The source codes are available at\nhttps://github.com/haebeom-lee/dropmax.\n\nDatasets and base networks We validate our method on multiple public datasets for classi\ufb01cation,\nwith different network architecture for each dataset.\n1) MNIST. This dataset [19] consists of 60, 000 images that describe hand-written digits from 0 to 9.\nWe experiment with varying number of training instances: 1K, 5K, and 55K. The validation and\ntest set has 5K and 10K instances, respectively. As for the base network, we use the CNN provided\nin the Tensor\ufb02ow Tutorial, which has a similar structure to LeNet.\n2) CIFAR-10. This dataset [16] consists of 10 generic object classes, which for each class has 5000\nimages for training and 1000 images for test. We use ResNet-34 [10] as the base network.\n3) CIFAR-100. This dataset consists of 100 object classes. It has 500 images for training and 100\nimages are for test for each class. We use ResNet-34 as the base network.\n4) AWA. This is a dataset for classifying different animal species [18], that contains 30, 475 images\nfrom 50 animal classes such as cow, fox, and humpback whale. For each class, we used 50 images for\ntest, while rest of the images are used as training set. We use ResNet-18 as the base network.\n5) CUB-200-2011. This dataset [26] consists of 200 bird classes such as Black footed albatross,\nRusty blackbird, and Eastern towhee. It has 5994 training images and 5794 test images, which is\nquite small compared to the number of classes. We only use the class label for the classi\ufb01cation. We\nuse ResNet-18 as the base network.\nAs AWA and CUB datasets are subsets of ImageNet-1K dataset, for those datasets we do not use a\npretained model but train from scratch. The experimental setup is available in the Appendix C.\n\n7\n\n\f(a) \u03c1 (easy)\n\n(b) \u03c1 (hard)\n\n(c) q (hard)\n\n(d) z(s) (hard)\n\n(e) \u00af\u03c1 (hard)\n\n(f) p(y|x)\n\nFigure 4: Visualization of class dropout probabilities for example test instances from MNIST-1K dataset. (a)\nand (b) shows estimated class retain probability for easy and dif\ufb01cult test instances respectively. The green o\u2019s\ndenote the ground truths, while the red x\u2019s denote the base model predictoins. (c) shows approximate posterior\nq(z|x, y; \u03c6). (d) shows generated retain masks from (b). (e) shows the average retain probability per class for\nhard instances. (f) shows sampled predictive distributions of easy and dif\ufb01cult instance respectively.\n\n5.1 Quantitative Evaluation\n\nMulti-class classi\ufb01cation. We report the classi\ufb01cation performances of our models and the base-\nlines in Table 1. The results show that the variants of softmax function such as Sparsemax and\nSampled Softmax perform similarly to the original softmax function (or worse). Random DropMax\nalso performs worse due to the inconsistency between train and test time handling of the dropout\nprobabilities for target class. Deterministic Attention also performs similarly to all the previous\nbaselines. Interestingly, Deterministic DropMax with supervised learning of attention mechanism\nimproves the performance over the base soft classi\ufb01er, which suggests that such combination of a\nmulti-class and multi-label classi\ufb01er could be somewhat bene\ufb01cial. However, the improvements are\nmarginal except on the AWA dataset, because the gating function also lacks proper regularization add\nthus yields very sharp attention probabilities. DropMax (q = p) has an entropy regularizer to address\nthis issue, but the model obtains suboptimal performance due to the inaccurate posterior estimation.\nOn the other hand, the gating function of DropMax is optimally regularized to make a crude selection\nof candidate classes via the proposed variational inference framework, and shows consistent and\nsigni\ufb01cant improvements over the baselines across all datasets. DropMax also obtains noticeably\nhigher accuracy gains on AWA and CUB dataset that are \ufb01ne-grained, with 3.38%p and 7.77%p\nimprovements, as these \ufb01ne-grained datasets contain many ambiguous instances that can be effectively\nhandled by DropMax with its focus on the most confusing classes. On MNIST dataset, we also\nobserve that the DropMax is more effective when the number of training instances is small. We\nattribute this to the effect of stochastic regularization that effectively prevents over\ufb01tting. This is also\na general advantage of Bayesian learning as well.\n\nConvergence rate. We examine the con-\nvergence rate of our model against the\nbase network with regular softmax func-\ntion. Figure 3 shows the plots of cross en-\ntropy loss computed at each training step\non MNIST-55K and CIFAR-100. To re-\nduce the variance of z, we plot with \u03c1 in-\nstead (training is done with z). DropMax\nshows slightly lower convergence rate, but\nthe test loss is signi\ufb01cantly improved, effec-\ntively preventing over\ufb01tting. Moreover, the\nlearning curve of DropMax is more stable\nthan that of regular softmax (see Appendix B for more discussion).\n\n(a) MNIST-55K\n\nFigure 3: Convergence plots\n\n(b) CIFAR-100\n\n5.2 Qualitative Analysis\n\nWe further perform qualitative analysis of our model to see how exactly it works and where the\naccuracy improvements come from.\n\n8\n\n012345678901234567890.20.30.4010K20K30K40K50K60Knumber of steps103102101100cross entropyBase trainBase testDropMax trainDropMax test010K20K30K40K50K60Knumber of steps0.51.02.05.0cross entropyBase trainBase testDropMax trainDropMax test\fFigure 5: Examples from CIFAR-100 dataset with top-4 and bottom-2 retain probabilities. Blue and red color\ndenotes the ground truths and base model predictions respectively.\n\nFigure 4(a) shows the retain probabilities estimated for easy examples, in which case the model set\nthe retain probability to be high for the true class, and evenly low for non-target classes. Thus, when\nthe examples are easy, the dropout probability estimator works like a second classi\ufb01er. However, for\ndif\ufb01cult examples in Figure 4(b) that is missclassi\ufb01ed by the base softmax function, we observe that\nthe retain probability is set high for the target class and few other candidates, as this helps the model\nfocus on the classi\ufb01cation between them. For example, in Figure 4(b), the instance from class 3 sets\nhigh retain probability for class 5, since its handwritten character looks somewhat similar to number\n5. However, the retain probability could be set differently even across the instances from the same\nclass, which makes sense since even within the same class, different instances may get confused with\ndifferent classes. For example, for the \ufb01rst instance of 4, the class with high retain probability is 0,\nwhich looks like 0 in its contour. However, for the second instance of 4, the network sets class 9 with\nhigh retain probability as this instance looks like 9.\nSimilar behaviors can be observed on CIFAR-100 dataset (Figure 5) as well. As an example, for\ninstances that belong to class girl, DropMax sets the retain probability high on class woman and boy,\nwhich shows that it attends to most confusing classes to focus more on such dif\ufb01cult problems.\nWe further examine the class-average dropout probabilities for each class in MNIST dataset in\nFigure 4(e). We observe the patterns by which classes are easily confused with the others. For\nexample, class 3 is often confused with 5, and class 4 with 9. It suggests that retain probability\nimplicitly learns correlations between classes, since it is modeled as an input dependent distribution.\nAlso, since DropMax is a Bayesian inference framework, we can easily obtain predictive uncertainty\nfrom MC sampling in Figure 4(f), even when probabilistic modeling on intermediate layers is dif\ufb01cult.\n\n6 Conclusion and Future Work\nWe proposed a stochastic version of a softmax function, DropMax, that randomly drops non-target\nclasses at each iteration of the training step. DropMax enables to build an ensemble over exponentially\nmany classi\ufb01ers that provide different decision boundaries. We further proposed to learn the class\ndropout probabilities based on the input, such that it can consider the discrimination of each instance\nagainst more confusing classes. We cast this as a Bayesian learning problem and present how\nto optimize the parameters through variational inference, while proposing a novel regularizer to\nmore exactly estimate the true posterior. We validate our model on multiple public datasets for\nclassi\ufb01cation, on which our model consistently obtains signi\ufb01cant performance improvements over\nthe base softmax classi\ufb01er and its variants, achieving especially high accuracy on datasets for \ufb01ne-\ngrained classi\ufb01cation. For future work, we plan to further investigate the source of generalization\nimprovements with DropMax, besides increased stability of gradients (Appendix B).\n\nAcknowledgement\nThis research was supported by the Engineering Research Center Program through the Na-\ntional Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-\n2018R1A5A1059921), Samsung Research Funding Center of Samsung Electronics (SRFC-\nIT150203), Machine Learning and Statistical Inference Framework for Explainable Arti\ufb01cial Intel-\nligence (No.2017-0-01779), and Basic Science Research Program through the National Research\nFoundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01061019). Juho\nLee\u2019s research leading to these results has received funding from the European Research Council\nunder the European Union\u2019s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement\nno. 617071.\n\n9\n\ngirlwomanbabyboyrabbitmanmaple treewillow treepine treeskyscraperpalm treebridgewardrobechaircouchbedtroutbicycletroutflatfishraycrabcrocodileotteroak treepine treemaple treewillow treedinosaursunflowerraywhaledolphinoak treebearotteroaktreemaple treedinosaurlionstreetcarwhalegirlwomanboywardrobebearman\fReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, et al. Tensor\ufb02ow: Large-scale Machine Learning on Heterogeneous\nDistributed Systems. arXiv:1603.04467, 2016.\n\n[2] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In NIPS, 2013.\n\n[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural Machine Translation by Jointly Learning to Align\n\nand Translate. In ICLR, 2015.\n\n[4] X. Bouthillier, K. Konda, P. Vincent, and R. Memisevic. Dropout as data augmentation. ArXiv\n\ne-prints, June 2015.\n\n[5] A. de Br\u00e9bisson and P. Vincent. An exploration of softmax alternatives belonging to the\n\nspherical loss family. In ICLR, 2016.\n\n[6] Y. Gal and Z. Ghahramani. Bayesian Convolutional Neural Networks with Bernoulli Approxi-\n\nmate Variational Inference. ArXiv e-prints, June 2015.\n\n[7] Y. Gal and Z. Ghahramani. A Theoretically Grounded Application of Dropout in Recurrent\n\nNeural Networks. In NIPS, 2016.\n\n[8] Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model\n\nUncertainty in Deep Learning. In ICML, 2016.\n\n[9] Y. Gal, J. Hron, and A. Kendall. Concrete Dropout. In NIPS, 2017.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR,\n\n2016.\n\n[11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional\n\nnetworks. In CVPR, 2017.\n\n[12] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On Using Very Large Target Vocabulary for\n\nNeural Machine Translation. In ACL, 2015.\n\n[13] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for\n\nComputer Vision? In NIPS, 2017.\n\n[14] D. P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameteriza-\n\ntion Trick. In NIPS, 2015.\n\n[15] D. P. Kingma and M. Welling. Auto encoding variational bayes. In ICLR, 2014.\n\n[16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. In NIPS, 2012.\n\n[18] C. Lampert, H. Nickisch, and S. Harmeling. Learning to Detect Unseen Object Classes by\n\nBetween-Class Attribute Transfer. In CVPR, 2009.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[20] M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. Addressing the Rare Word\n\nProblem in Neural Machine Translation. In ACL, 2015.\n\n[21] C. J. Maddison, A. Mnih, and Y. Whye Teh. The Concrete Distribution: A Continuous\n\nRelaxation of Discrete Random Variables. In ICLR, 2017.\n\n[22] A. F. T. Martins and R. Fernandez Astudillo. From Softmax to Sparsemax: A Sparse Model of\n\nAttention and Multi-Label Classi\ufb01cation. In ICML, 2016.\n\n10\n\n\f[23] D. Molchanov, A. Ashukha, and D. Vetrov. Variational Dropout Sparsi\ufb01es Deep Neural\n\nNetworks. In ICML, 2017.\n\n[24] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional\n\ngenerative models. In NIPS, 2015.\n\n[25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[26] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011\n\nDataset. Technical report, 2011.\n\n[27] S. Wang and C. Manning. Fast dropout training. In ICML, 2013.\n\n[28] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show,\n\nAttend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, 2015.\n\n11\n\n\f", "award": [], "sourceid": 508, "authors": [{"given_name": "Hae Beom", "family_name": "Lee", "institution": "KAIST"}, {"given_name": "Juho", "family_name": "Lee", "institution": "University of Oxford"}, {"given_name": "Saehoon", "family_name": "Kim", "institution": "AITRICS"}, {"given_name": "Eunho", "family_name": "Yang", "institution": "Korea Advanced Institute of Science and Technology; AItrics"}, {"given_name": "Sung Ju", "family_name": "Hwang", "institution": "KAIST, AItrics"}]}