{"title": "Sampled Softmax with Random Fourier Features", "book": "Advances in Neural Information Processing Systems", "page_first": 13857, "page_last": 13867, "abstract": "The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the Random Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to enable more efficient and accurate sampling from an approximate softmax distribution. We show that RF-softmax leads to low bias in estimation in terms of both the full softmax distribution and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically with the number of classes.", "full_text": "Sampled Softmax with Random Fourier Features\n\nAnkit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, and Sanjiv Kumar\n\nGoogle Research, New York\n\n{ankitsrawat, chenjiecao, felixyu, theertha, sanjivk}@google.com\n\nAbstract\n\nThe computational cost of training with softmax cross entropy loss grows linearly\nwith the number of classes. For the settings where a large number of classes\nare involved, a common method to speed up training is to sample a subset of\nclasses and utilize an estimate of the loss gradient based on these classes, known\nas the sampled softmax method. However, the sampled softmax provides a biased\nestimate of the gradient unless the samples are drawn from the exact softmax\ndistribution, which is again expensive to compute. Therefore, a widely employed\npractical approach involves sampling from a simpler distribution in the hope of\napproximating the exact softmax distribution. In this paper, we develop the \ufb01rst\ntheoretical understanding of the role that different sampling distributions play in\ndetermining the quality of sampled softmax. Motivated by our analysis and the\nwork on kernel-based sampling, we propose the Random Fourier Softmax (RF-\nsoftmax) method that utilizes the powerful Random Fourier Features to enable\nmore ef\ufb01cient and accurate sampling from an approximate softmax distribution.\nWe show that RF-softmax leads to low bias in estimation in terms of both the\nfull softmax distribution and the full softmax gradient. Furthermore, the cost of\nRF-softmax scales only logarithmically with the number of classes.\n\n1\n\nIntroduction\n\nThe cross entropy loss based on softmax function is widely used in multi-class classi\ufb01cation tasks\nsuch as natural language processing [1], image classi\ufb01cation [2], and recommendation systems [3].\nIn multi-class classi\ufb01cation, given an input x 2X , the goal is to predict its class t 2{ 1, 2, . . . , n},\nwhere n is the number of classes. Given an input feature x, the model (often a neural network) \ufb01rst\ncomputes an input embedding h 2 Rd and then the raw scores or logits for classes o = (o1, . . . , on)\nas the product of the input embedding h and the class embeddings c1, . . . , cn 2 Rd,\n\n(1)\nHere, \u2327 is often referred to as the (inverse) temperature parameter of softmax. Given the logits, the\nprobability that the model assigns to the i-th class is computed using the full softmax function\n\noi = \u2327 hT ci.\n\npi = eoi/Z,\n\n(2)\nwhere Z =Pn\ni=1 eoi is called the partition function. The distribution in (2) is commonly referred to\nas the softmax distribution. Given a training set, the model parameters are estimated by minimizing\nan empirical risk over the training set, where the empirical risk is de\ufb01ned by the cross-entropy loss\nbased on softmax function or the full softmax loss. Let t 2 [n] denote the true class for the input x,\nthen the full softmax loss is de\ufb01ned as1\n\n1The results of this paper generalize to a multi-label setting by using multi-label to multi-class reductions [4].\n\nL(x, t) := log pt = ot + log Z.\n\n(3)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOne typically employs \ufb01rst order optimization methods to train neural network models. This requires\ncomputing the gradient of the loss with respect to the model parameter \u2713\u2713\u2713 during each iteration\n\nr\u2713\u2713\u2713L(x, t) = r\u2713\u2713\u2713ot +\n\neoi\nZ \u00b7r \u2713\u2713\u2713oi = r\u2713\u2713\u2713ot + Es\u21e0p [r\u2713\u2713\u2713os] ,\n\n(4)\n\nnXi=1\n\nwhere the expectation is taken over the softmax distribution (cf. (2)). As evident from (4), computing\nthe gradient of the full softmax loss takes O(dn) time due to the contributions from all n classes.\nTherefore, training a model using the full softmax loss becomes prohibitively expensive in the settings\nwhere a large number of classes are involved. To this end, various approaches have been proposed for\nef\ufb01cient training. This includes different modi\ufb01ed loss functions: hierarchical softmax [5] partitions\nthe classes into a tree based on class similarities, allowing for O(d log n) training and inference time;\nspherical softmax [6, 7] replaces the exponential function by a quadratic function, enabling ef\ufb01cient\nalgorithm to compute the updates of the output weights irrespective of the output size. Ef\ufb01cient\nhardware-speci\ufb01c implementations of softmax are also being actively studied [8].\n\n1.1 Sampled softmax\nA popular approach to speed up the training of full softmax loss is using sampled softmax: instead of\nincluding all classes during each iteration, a small random subset of n classes is considered, where\neach negative class is sampled with some probability. Formally, let the number of sampled classes\nduring each iteration be m, with class i being picked with probability qi. Let Nt , [n]\\{t} be the set\nof negative classes. Assuming that s1, . . . , sm 2N t denote the sampled class indices, following [9],\nwe de\ufb01ne the adjusted logits o0 = {o01, o02, . . . , o0m+1} such that o01 = ot and for i 2 [m],\n(5)\nZ0 , where Z0 =Pm+1\nAccordingly, we de\ufb01ne the sampled softmax distribution as p0i = eo0i\nj=1 eo0j . The\nsampled softmax loss corresponds to the cross entropy loss with respect to the sampled softmax\ndistribution:\n\no0i+1 = osi log(mqsi).\n\nL0(x, t) = log p0t = ot + log Z0.\n\n(6)\nHere, we note that adjusting the logits for the sampled negative classes using their expected number\nof occurrence in (5) ensures that Z0 is an unbiased estimator of Z [9]. Since L0(x, t) depends only\non m + 1 classes, the computational cost is reduced from O(dn) to O(dm) as compared to the full\nsoftmax loss in (3).\nIn order to realize the training with the full softmax loss, one would like the gradient of the sampled\nsoftmax loss to be an unbiased estimator of the gradient of the full softmax loss2, i.e.,\n\nE [r\u2713\u2713\u2713L0] = r\u2713\u2713\u2713L,\n\n(7)\nwhere the expectation is taken over the sampling distribution q. As it turns out, the sampling\ndistribution plays a crucial role in ensuring the unbiasedness of r\u2713\u2713\u2713L0. Bengio and Sen\u00e9cal [9] show\nthat (7) holds if the sampling distribution is the full softmax distribution itself, i.e., qi / eoi (cf. (2)).\nHowever, sampling from the softmax distribution itself is again computationally expensive: one\nneeds to compute the partition function Z during each iteration, which is again an O(dn) operation\nsince Z depends both on the current model parameters and the input. As a feasible alternative, one\nusually samples from a distribution which does not depend on the current model parameters and the\ninput. Common choices are uniform, log-uniform, or the global prior of classes [10, 1]. However,\nsince these distributions are far from the full softmax distribution, they can lead to signi\ufb01cantly worse\nsolutions. Various approaches have been proposed to improve negative sampling. For example, a\nseparate model can be used to track the distribution of softmax in language modeling tasks [9]. One\ncan also use an LSH algorithm to \ufb01nd the approximate nearest classes in the embedding space which\nin turn helps in sampling from the softmax distribution ef\ufb01ciently [11]. Quadratic kernel softmax [12]\nuses a kernel-based sampling method and quadratic approximation of the softmax function to draw\neach sample in sublinear time. Similarly, the Gumbel trick has been proposed to sample from the\nsoftmax distribution in sublinear time [13]. The partition function can also be written in a double-sum\nformulation to enable an unbiased sampling algorithm for SGD [14, 15].\n\n2Since it is clear from the context, in what follows, we denote L(x, t) and L0(x, t) by L and L0, respectively.\n\n2\n\n\fAmong other training approaches based on sampled losses, Noise Contrastive Estimation (NCE)\nand its variants avoid computing the partition function [16], and (semi-)hard negative sampling\n[17, 4, 18] selects the negatives that most violate the current objective function. Hyv\u00e4rinen [19]\nproposes minimization of Fisher divergence (a.k.a. score matching) to avoid computation of the\npartition function Z. However, in our setting, the partition function depends on the input embedding\nh, which changes during the training. Thus, while calculating the score function (taking derivative\nof Z with respect to (h, c)), the partition function has a non-trivial contribution which makes this\napproach inapplicable to our setting. We also note the existence of MCMC based approaches in\nthe literature (see, e.g., [20]) for sampling classes with a distribution that is close to the softmax\ndistribution. Such methods do not come with precise computational complexity guarantees.\n\n1.2 Our contributions\nTheory. Despite a large body of work on improving the quality of sampled softmax, developing a\ntheoretical understanding of the performance of sampled softmax has not received much attention.\nBlanc and Rendle [12] show that the full softmax distribution is the only distribution that provides\nan unbiased estimate of the true gradient r\u2713\u2713\u2713L. However, it is not clear how different sampling\ndistributions affect the bias r\u2713\u2713\u2713L E [r\u2713\u2713\u2713L0]. In this paper, we address this issue and characterize\nthe bias of the gradient for a generic sampling distribution (cf. Section 2).\nAlgorithm. In Section 3, guided by our analysis and recognizing the practical appeal of kernel-based\nsampling [12], we propose Random Fourier softmax (RF-softmax), a new kernel-based sampling\nmethod for the settings with normalized embeddings. RF-softmax employs the powerful Random\nFourier Features [21] and guarantees small bias of the gradient estimate. Furthermore, the complexity\nof sampling one class for RF-softmax is O(D log n), where D denotes the number of random\nfeatures used in RF-softmax. In contrast, assuming that d denotes the embedding dimension, the\nfull softmax and the prior kernel-based sampling method (Quadratic-softmax [12]) incur O(dn) and\nO(d2 log n) computational cost to generate one sample, respectively. In practice, D can be two orders\nof magnitudes smaller than d2 to achieve similar or better performance. As a result, RF-softmax has\ntwo desirable features: 1) better accuracy due to lower bias and 2) computational ef\ufb01ciency due to\nlow sampling cost.\nExperiments. We conduct experiments on widely used NLP and extreme classi\ufb01cation datasets to\ndemonstrate the utility of the proposed RF-softmax method (cf. Section 4).\n\n2 Gradient bias of sampled softmax\n\nThe goal of sampled softmax is to obtain a computationally ef\ufb01cient estimate of the true gradient\nr\u2713\u2713\u2713L (cf. (4)) of the full softmax loss (cf. (3)) with small bias. In this section we develop a theoretical\nunderstanding of how different sampling distributions affect the bias of the gradient. To the best of\nour knowledge, this is the \ufb01rst result of this kind.\nFor the cross entropy loss based on the sampled softmax (cf. (6)), the training algorithm employs the\nfollowing estimate of r\u2713\u2713\u2713L.\n\nr\u2713\u2713\u2713L0 = r\u2713\u2713\u2713ot +\n\neotr\u2713\u2713\u2713ot +Pi2[m]\neot +Pi2[m]\n\neosi\nmqsi r\u2713\u2713\u2713osi\neosi\nmqsi\n\n.\n\n(8)\n\nThe following result bounds the bias of the estimate r\u2713\u2713\u2713L0. Without loss of generality, we work\nwith the sampling distributions that assign strictly positive probability to each negative class, i.e.,\nqi > 0 8 i 2N t.\nTheorem 1. Let r\u2713\u2713\u2713L0 (cf. (8)) be the estimate of r\u2713\u2713\u2713L based on m negative classes s1, . . . , sm,\ndrawn according to the sampling distribution q. We further assume that the gradients of the logits\nr\u2713\u2713\u2713oi have their coordinates bounded3 by M. Then, the bias of r\u2713\u2713\u2713L0 satis\ufb01es\n\nLB \uf8ff E [r\u2713\u2713\u2713L0] r\u2713\u2713\u2713L\uf8ff UB\n\n(9)\n\nwith\n\n3This assumption naturally holds in most of the practical implementations, where each of the gradient\n\ncoordinates or norm of the gradient is clipped by a threshold.\n\n3\n\n\ft\n\nUB1\n\nm\n\nUB2\n\neoj0\n\neok\n\n|\n\nmZ 2\n\n(10)\n\n(11)\n\nm\u2318! \u00b7 1,\n+o\u21e3 1\n\nThe proof of Theorem 1 is presented in Appendix A. Theorem 1 captures the effect of the underlying\nsampling distribution q on the bias of gradient estimate in terms of three (closely related) quantities:\n\nmZ 3\n\ne2oj\nqj\n\neokZt eok\nqk \nMPk2Nt\n\u27131 o\u21e3 1\nm\u2318\u25c6 \u00b7 1,\nLB , \nqi0 Zt\nmaxi,i02Nt eoi\nUB , Pj2Nt\nm\u2318! \u00b7 g + 2M\nqi eoi0\ne2oj\nqj Z 2\n+o\u21e3 1\nZ 2 +Pj2Nt\n}\n{z\n|\n{z\n}\neoj , g ,Pj2Nt\nwhere Zt ,Pj2Nt\nj,j02Nt\nXj2Nt\n=Xj\n\nqj0 , and Xj2Nt\nqj \u2318 \u21e3Xj\n\nqk .\neoj\u23182\n\neojr\u2713\u2713\u2713oj and 1 is the all one vector.\n\neoj\nqj \n\neoj \n\ne2oj\nqj\n\n, max\n\nIdeally, we would like to pick a sampling distribution for which all these quantities are as small as\npossible. Since q is a probability distribution, it follows from Cauchy-Schwarz inequality that\n\n.\n\ne2oj\n\n(13)\n\ne2oj\nqj\n\ne2oj\nqj\n\nXj\n\nqj \u00b7\u21e3Xj\nIf qj / eoj , then (13) is attained (equivalently,Pj\nis minimized). In particular, this implies that\nUB1 in (11) disappears. Furthermore, for such a distribution we have qj =\neoi . This implies\nthat both UB2 and LB disappear for such a distribution as well. This guarantees a small bias of the\ngradient estimate r\u2713\u2713\u2713L0.\nSince sampling exactly from the distribution q such that qj / eoj is computationally expensive, one\nhas to resort to other distributions that incur smaller computational cost. However, to ensure small\nbias of the gradient estimate r\u2713\u2713\u2713L0, Theorem 1 and the accompanying discussion suggest that it is\ndesirable to employ a distribution that ensures that eoj\nis as close to 1 as possible for each j 2N t and\nqj\nall possible values of the logit oj. In other words, we are interested in those sampling distributions\nthat provide a tight uniform multiplicative approximation of eoj in a computationally ef\ufb01cient manner.\nThis motivates our main contribution in the next section, where we rely on kernel-based sampling\nmethods to ef\ufb01ciently implement a distribution that uniformly approximates the softmax distribution.\n\neojPi2Nt\n\n(12)\n\n3 Random Fourier Softmax (RF-Softmax)\n\nIn this section, guided by the conclusion in Section 2, we propose Random Fourier Softmax (RF-\nsoftmax), as a new sampling method that employs Random Fourier Features to tightly approximate\nthe full softmax distribution. RF-softmax falls under the broader class of kernel-based sampling\nmethods which are amenable to ef\ufb01cient implementation. Before presenting RF-softmax, we brie\ufb02y\ndescribe the kernel-based sampling and an existing method based on quadratic kernels [12].\n\n3.1 Kernel-based sampling and Quadratic-softmax\nGiven a kernel K : Rd \u21e5 Rd ! R, the input embedding h 2 Rd, and the class embeddings\nc1, . . . , cn 2 Rd, kernel-based sampling selects the class i with probability qi = K(h,ci)\nj=1 K(h,cj ). Note\nPn\nthat if K(h, ci) = exp(oi) = exp(\u2327 hT ci), this amounts to directly sampling from the softmax\ndistribution.\nBlanc and Steffen [12] show that if the kernel can be linearized by a mapping : Rd ! RD such\nthat K(h, ci) = (h)T (ci), sampling one point from the distribution takes only O(D log n) time\nby a divide-and-conquer algorithm. We brie\ufb02y review the algorithm in this section.\nUnder the linearization assumption, the sampling distribution takes the following form.\n\nK(h, ci)\n\nj=1 (h)T (cj)\n\nqi =\n\nPn\n\n4\n\n(h)T (ci)\n\n.\n\nj=1 (cj)\n\n=\n\n(h)TPn\n\n\fThe idea is to organize the classes in a binary tree with individual classes at the leaves. We then\nsample along a path on this tree recursively until we reach a single class. Each sampling step takes\n\nO(D) time as we can pre-computePj2S (cj) where S is any subset of classes. Similarly, when\nthe embedding of a class changes, the cost of updating allPj2S (cj) along the path between the\nroot and this class is again O(D log n).\nNote that we pre-computePj2[n] (cj) for the root node. Now, suppose the left neighbor and the\nright neighbor of the root node divide all the classes into two disjoint set S1 and S2, respectively. In\nthis case, we pre-computePj2S1\n(cj) for the left neighbor and the right neighbor\n\nof the root node, respectively. First, the probability of the sampled class being in S1 is\n\n(cj) andPj2S2\n(h)TPj2S1\n\n(cj)\n\n(h)TPj2S1\n(cj) + (h)TPj2S2\n\nqS1 =\n\n(cj)\n\n(14)\n\nAnd the probability of the sampled class being in S2 is 1 qS1. We then recursively sample along a\npath until a single classes is reached, i.e., we reach a leaf node. After updating the embedding of the\nsampled class, we recursively trace up the tree to update the sums stored on all the nodes until we\nreach the root node.\nGiven the ef\ufb01ciency of the kernel-based sampling, Blanc and Steffen [12] propose a sampled softmax\napproach which utilizes the quadratic kernel to de\ufb01ne the sampling distribution.\n\nKquad(h, ci) = \u21b5 \u00b7 (hT ci)2 + 1.\n\n(15)\nNote that the quadratic kernel can be explicitly linearized by the mapping (z) = [p\u21b5 \u00b7 (z \u2326 z), 1].\nThis implies that D = O(d2); consequently, sampling one class takes O(d2 log n) time. Despite the\npromising results of Quadratic-softmax, it has the following caveats:\n\u2022 The quadratic kernel with \u21b5 = 100, the value used in [12], does not give a tight multiplicative\napproximation of the exponential kernel eoj . Thus, according to the analysis in Section 2, this\nresults in a gradient estimate with large bias.\n\n\u2022 The O(d2) computational cost can be prohibitive for models with large embedding dimensions.\n\u2022 Since a quadratic function is a poor approximation of the exponential function for negative\nnumbers, it is used to approximate a modi\ufb01ed absolute softmax loss function in [12] instead,\nwhere absolute values of logits serve as input to the softmax function.\n\nNext, we present a novel kernel-based sampling method that addresses all of these shortcomings.\n\n3.2 RF-softmax\nGiven the analysis in Section 2 and low computational cost of the linearized kernel-based sampling\nmethods, our goal is to come up with a linearizable kernel (better than quadratic) that provides a\ngood uniform multiplicative approximation of the exponential kernel K(h, c) = eo = e\u2327 hT c. More\nconcretely, we would like to \ufb01nd a nonlinear map (\u00b7) : Rd ! RD such that the error between\nK(h, c) and \u02c6K(h, c) = (h)T (c) is small for all values of h and c.\nThe Random Maclaurin Features [22] seem to be an obvious choice here since it provides an unbiased\nestimator of the exponential kernel. However, due to rank de\ufb01ciency of the produced features, it\nrequires large D in order to achieve small mean squared error [23, 24]. We verify that this is indeed a\npoor choice in Table 1.\nIn contrast, the Random Fourier Features (RFF) [21] is much more compact [23]. Moreover, these\nfeatures and their extensions are also theoretically better understood (see, e.g., [25, 26, 27]). This\nleads to the natural question: Can we use RFF to approximate the exponential kernel? However, this\napproach faces a major challenge at the outset: RFF only works for positive de\ufb01nite shift-invariant\nkernels such as the Gaussian kernel, while the exponential kernel is not shift-invariant.\nA key observation is that when the input embedding h and the class embedding c are normalized, the\nexponential kernel becomes the Gaussian kernel (up to a multiplicative constant):\n\ne\u2327 hT c = e\u2327 e \u2327||hc||2\n\n2\n\n2\n\n.\n\n(16)\n\n5\n\n\fMethod Quadratic [12]\nD\nMSE\n\n2562\n2.8e-3\n\nRandom Fourier [21]\n2562\n5.5e-6\n\n1000\n2.7e-4\n\n100\n2.6e-3\n\nRandom Maclaurin [22]\n2562\n8.8e-2\n\nTable 1: Mean squared error (MSE) of approximating a kernel exp(\u2327 hT c). h and c are randomly sampled\nfrom the USPS dataset (d = 256). The data is normalized, i.e., ||h||2 = 1, ||c||2 = 1. For Quadratic, we assume\nthe form is \u21b5 \u00b7 (hT ci)2 + and solve \u21b5 and in a linear system to get the optimal MSE. In practice, with \ufb01xed\n\u21b5 and as in [12], the MSE will be larger. Random Fourier has much lower MSE with same D, and much\nsmaller D with similar MSE. Also note that Random Fourier and Random Maclaurin are unbiased.\n\nWe note that normalized embeddings are widely used in practice to improve training stability and\nmodel quality [28, 29, 30]. In particular, it attains improved performance as long as \u2327 (cf (1)) is large\nenough to ensure that the output of softmax can cover (almost) the entire range (0,1). In Section 4,\nwe verify that restricting ourselves to the normalized embeddings does not hurt and, in fact, improves\nthe \ufb01nal performance.\nFor the Gaussian kernel K(x y) = e \u232bkxyk2\nRFF map takes the following form.\n\nwith temperature parameter \u232b, the D-dimensional\n\n2\n\n(u) =\n\n(17)\nwhere w1, . . . , wD \u21e0 N (0, I/\u232b). The RFF map provides an unbiased approximation of the Gaussian\nkernel [26, Lemma 1]\n\n1 u), . . . , cos(wT\n\n1 u), . . . , sin(wT\n\nDu), sin(wT\n\n1\n\npDh cos(wT\n\nDu)i,\n\ne \u232bkxyk2\n\n2\n\n(18)\nNow, given an input embedding h, if we sample class i with probability qi / exp (\u2327kci hk2/2),\nthen it follows from (16) that our sampling distribution is the same as the softmax distribution.\nTherefore, with normalized embeddings, one can employ the kernel-based sampling to realize the\nsampled softmax such that class i is sampled with the probability\n\n\u21e1 (x)T (y).\n\nqi / (ci)T (h).\n\n(19)\nWe refer to this method as Random Fourier Softmax (RF-softmax). RF-softmax costs O(D log n) to\nsample one point (cf. Section 3.1). Note that computing the nonlinear map takes O(Dd) time with\nthe classic Random Fourier Feature. One can easily use the structured orthogonal random feature\n(SORF) technique [26] to reduce this complexity to O(D log d) with even lower approximation error.\nSince typically the embedding dimension d is on the order of hundreds and we consider large n:\nd \u2327 n, the overall complexity of RF-softmax is O(D log n).\nAs shown in Table 1, the RFF map approximates the exponential kernel with the mean squared\nerror that is orders of magnitudes smaller than that for the quadratic map with the same mapping\ndimension D. The use of RFF raises interesting implementation related challenges in terms of\nselecting the temperature parameter \u232b to realize low biased sampling, which we address in the\nfollowing subsection.\n\n3.3 Analysis and discussions\nRecall that the discussion following Theorem 1 implies that, for low-bias gradient estimates, the\nsampling distribution qi needs to form a tight multiplicative approximation of the softmax distribution\npi, where pi / exp(oi) = exp(\u2327 hT ci). In the following result, we quantify the quality of the\nproposed RF-softmax based on the ratio |pi/qi|. The proof of the result is presented in Appendix B.\nTheorem 2. Given the `2-normalized input embedding h and `2-normalized class embeddings\n{c1, . . . , cn}, let oi = \u2327 hT ci be the logits associated with the class i. Let qi denote the probability of\nsampling the i-th class based on D-dimensional Random Fourier Features, i.e., qi = 1\nC \u00b7 (ci)T (h),\npD\nwhere C is the normalizing constant. Then, as long as e2\u232b \uf8ff \nlog D , the following holds with\nprobability at least 1 O 1\nD2.\n\n\u21e2pd \u00b7\n\neoi\n\n1\n\n(20)\n\ne(\u2327\u232b)hT ci \u00b7 (1 2) \uf8ff\n\nPi2Nt\n\neoi \u00b7\n\n6\n\nqi \uf8ff e(\u2327\u232b)hT ci \u00b7 (1 + 4),\n\n\fwhere and \u21e2 are positive constants.\nRemark 1. With large enough D, we may invoke Theorem 2 with \u232b = \u2327 and\n\n = a0 \u00b7\u21e3e2\u2327 \u00b7\n\n\u21e2pd log D\npD \u2318,\n\nwhere a0 > 1 is a \ufb01xed constant. In this case, since = oD(1), it follows from (20) that qi /\n(1 \u00b1 oD(1)) \u00b7 pi, for each i 2N t. In particular, at D = 1, we have qi / pi.\nWe now combine Theorem 1 and Theorem 2 to obtain the following corollary, which characterizes the\nbias of the gradient estimate for RF-softmax in the regime where D is large. The proof is presented\nin Appendix C.\nCorollary 1. Let qi = 1\nRF-softmax. Then, with high probability, for large enough D, by selecting \u232b = \u2327 ensures that\n\nC \u00b7 (ci)T (h) denote the probability of sampling the i-th class under\n\noD(1) \u00b7 1 \uf8ff E[r\u2713\u2713\u2713L0] r\u2713\u2713\u2713L\uf8ff \u21e3oD(1) + o1/m\u2318 \u00b7 g +\u21e3oD(1) + o1/m\u2318 \u00b7 1,\n\neojr\u2713\u2713\u2713oj and 1 is the all one vector. Here, the expectation is taken over the\n\nwhere g , Pj2Nt\n\nsampling distribution q.\n\nNote that Theorem 2 and Remark 1 highlight an important design issue in the implementation of the\nRF-softmax approach. The ability of q to approximate p (as stated in (20)) degrades as the difference\n|\u2327 \u232b| increases. Therefore, one would like to pick \u232b to be as close to \u2327 as possible (ideally exactly\nequal to \u2327). However, for the \ufb01xed dimensional feature map , the approximation guarantee in (20)\npD\nholds for only those values of \u232b such that e2\u232b \uf8ff \nlog D . Therefore, the dimension of the feature\nmap dictates which Gaussian kernels we can utilize in the proposed RF-softmax approach. On the\nother hand, the variance of the kernel approximation of Random Fourier feature grows with \u232b [26].\nAdditionally, in order to work with the normalized embeddings, it\u2019s necessary to select a reasonably\nlarge value for the temperature parameter4 \u2327. Therefore, choosing \u232b = \u2327 in this case will result in\nlarger variance of the estimated kernel.\nRemark 2. As a trade off between bias and variance, while approximating the exponential kernel\nwith a limited D and large \u2327, \u232b should be set as a value smaller than \u2327.\n\n\u21e2pd \u00b7\n\nIn Section 4, we explore different choices for the value of \u232b and con\ufb01rm that some \u232b<\u2327\nthe best empirical performance.\n\nachieves\n\n4 Experiments\n\nIn this section, we experimentally evaluate the proposed RF-softmax and compare it with several\nbaselines using simple neural networks to validate the utility of the proposed sampling method and\nthe accompanying theoretical analysis. We show that, in terms of the \ufb01nal model quality, RF-softmax\nwith computational cost O(D log n) (D \u2327 d2) performs at par with the sampled softmax approach\nbased on the full softmax distribution which incurs the computational cost of O(dn). We also show\nthat RF-softmax outperforms Quadratic-softmax [12] that uses the quadratic function to approximate\nthe exponential kernel and has computational cost O(d2 log n). In addition, we highlight the effect of\ndifferent design choices regarding the underlying RFF map (cf. (17)) on the \ufb01nal performance of\nRF-softmax.\n\n4.1 Datasets and models\nNLP datasets. PENNTREEBANK [31] is a popular benchmark for NLP tasks with a vocabulary of\nsize 10, 000. We train a language model using LSTM, where the normalized output of the LSTM\nserves as the input embedding. BNEWS [32] is another NLP dataset. For this dataset, we select the\nmost frequent 64, 000 words as the vocabulary. Our model architecture for BNEWS is the same as\nthe one used for PENNTREEBANK with more parameters. We \ufb01x the embedding dimension to be\nd = 200 for PENNTREEBANK and d = 512 for BNEWS.\n\n4This provides a wide enough range for the logits values; thus, ensuring the underlying model has suf\ufb01cient\n\nexpressive power. The typical choice ranges from 5 to 30.\n\n7\n\n\fExtreme classi\ufb01cation datasets. We test the proposed method on three classi\ufb01cation datasets with\na large number of classes [33]. For each data point that is de\ufb01ned by a v-dimensional sparse\nfeature vector, we \ufb01rst map the data point to a 128-dimensional vector by using a v \u21e5 128 matrix.\nOnce normalized, this vector serves as the input embedding h. For i 2 [n], class i is mapped\nto a 128-dimensional normalized vector ci. Following the convention of extreme classi\ufb01cation\nliterature [34, 35, 4], we report precision at k (PREC@K) on the test set.\nRecall that the proposed RF-softmax draws samples with computational complexity O(D log n). We\ncompare it with the following baselines.\n\u2022 FULL, the full softmax loss, which has computational complexity O(dn).\n\u2022 EXP, the sampled softmax approach with the full softmax distribution (cf. (2)) as the sampling\ndistribution. This again has computational complexity O(dn) to draw one sample.\n\u2022 UNIFORM, the sampled softmax approach that draws negative classes uniformly at random,\namounting to O(1) computational complexity to draw one sample.\n\u2022 QUADRATIC, the sampled softmax approach based on a quadratic kernel (cf. (15)). We follow the\nimplementation of [12] and use \u21b5o2 + 1 with \u21b5 = 100. Since \u21b5o2 + 1 is a better approximation\nof e|o| than it is of eo, [12] proposes to use the absolute softmax function \u02dcpi =\nPj2[n] e\u2327|oj| while\ncomputing the cross entropy loss. We employ this modi\ufb01ed loss function for the quadratic kernel\nas it gives better results as compared to the full softmax loss in (3). The cost of drawing one\nsample with this method is O(d2 log n).\n\ne\u2327|oi|\n\nHere, we refer to 1/p\u2327 as the temperature of softmax\n(cf. (2)). We set this parameter to 0.3 as it leads to the\nbest performance for the FULL baseline. This is a natural\nchoice given that sampled softmax aims at approximating\nthis loss function with a small subset of (sampled) negative\nclasses.\n\n4.2 Experimental results\n\n# classes (n) Method\n\n10, 000\n\nEXP\nQUADRATIC\nRFF (D = 50)\nRFF (D = 200)\nRFF (D = 500)\nRFF (D = 1, 000)\nEXP\nQUADRATIC\nRFF (D = 50)\nRFF (D = 200)\nRFF (D = 500)\nRFF (D = 1, 000)\n\nWall time\n1.4 ms\n6.5 ms\n0.5 ms\n0.6 ms\n1.2 ms\n1.4 ms\n32.3 ms\n8.2 ms\n1.6 ms\n1.7 ms\n2.0 ms\n2.4 ms\n\n500, 000\n\nWall time. We begin with verifying that RF-softmax in-\ndeed incurs low sampling cost. In Table 2, we compare the\nwall-time that different model dependent sampling meth-\nods take to compute the sampled softmax loss (cf. (6)) for\ndifferent sets of parameters.\nNormalized vs. unnormalized embeddings. To justify\nour choice of working with normalized embeddings, we\nran experiments on FULL with and without embedding\nnormalization for both PENNTREEBANK and AMAZONCAT-13K. On PENNTREEBANK, after 10\nepochs, the unnormalized version has much worse validation perplexity of 126 as compared to 120\nwith the normalized version. On AMAZONCAT-13K, both versions have PREC@1 87%.\nNext, we discuss the key design choices for RF-softmax: the choice of the Gaussian sampling kernel\nde\ufb01ned by \u232b (cf. (16)); and the dimension of the RFF map D.\n\nTable 2: Comparison of wall time for com-\nputing sampled softmax loss for different\nmodel-dependent sampling methods. (Batch\nsize = 10, m = 10 and d = 64.)\n\nFigure 1: Validation perplexity\nfor RF-softmax on PENNTREE-\nBANK with m = 100, D = 1024\nand varying values of T .\nEffect of the parameter \u232b. As discussed in Section 3.2, for a \ufb01nite D, we should choose \u232b<\u2327\nas a trade off between the variance and the bias. Figure 1 shows the performance of the proposed\n\nFigure 3: Comparison of RF-\nsoftmax with other baselines on PEN-\nNTREEBANK with m = 100 and val-\nidation perplexity as the metric.\n\nFigure 2: Validation perplexity\nfor RF-softmax on PENNTREE-\nBANK with m = 100 and varying\nD.\n\n8\n\nEpoch110150190230270310246810T = 0.3T = 0.4T = 0.5T = 0.6T = 0.7T = 0.8Epoch110130150170190210230250246810FullD = 128D = 256D = 512D = 1024Epoch110148186224262300246810FullExpRF-softmax (D = 1024)QuadraticUniform\fDataset\n\nAMAZONCAT-13K\nn = 13, 330\nv = 203, 882\n\nDELICIOUS-200K\nn = 205, 443\nv = 782, 585\n\nWIKILSHTC\nn = 325, 056\nv = 1, 617, 899\n\nMethod\nEXP\nUNIFORM\nQUADRATIC\nRFF\nEXP\nUNIFORM\nQUADRATIC\nRFF\nEXP\nUNIFORM\nQUADRATIC\nRFF\n\nPREC@1\n0.87\n0.83\n0.84\n0.87\n0.42\n0.36\n0.40\n0.41\n0.58\n0.47\n0.57\n0.56\n\nPREC@3\n0.76\n0.69\n0.74\n0.75\n0.38\n0.34\n0.36\n0.37\n0.37\n0.29\n0.37\n0.35\n\nPREC@5\n0.62\n0.55\n0.60\n0.61\n0.37\n0.32\n0.34\n0.36\n0.29\n0.22\n0.28\n0.26\n\nTable 3: Comparison among sampled softmax methods on extreme classi\ufb01cation datasets. We report the metrics\nbased on the same number of training iterations for all methods.\n\nRF-softmax method on PENNTREEBANK for different values of \u232b. In particular, we vary T = 1p\u232b\nas it de\ufb01nes the underlying RFF map (cf. (18)). The best performance is attained at T = 0.5. This\nchoice of \u232b<\u2327\nis in line with our discussion in Section 3.3. We use this same setting in the remaining\nexperiments.\nEffect of D. The accuracy of the approximation of the\nGaussian kernel using the RFF map improves as we in-\ncrease the dimension of the map D. Figure 2 demonstrates\nthe performance of the proposed RF-softmax method on\nPENNTREEBANK for different values of D. As expected,\nthe performance of RF-softmax gets closer to that of\nFULL when we increase D.\nRF-softmax vs. baselines. Figure 3 illustrates the perfor-\nmance of different sampled softmax approaches on PEN-\nNTREEBANK. The \ufb01gure shows the validation perplexity\nas the training progresses. As expected, the performance\nof expensive EXP is very close to the performance of\nFULL. RF-softmax outperforms both QUADRATIC and\nUNIFORM. We note that RF-softmax with D = 1024\nperforms better than QUADRATIC method at signi\ufb01cantly\nlower computational and space cost. Since we have embedding dimension d = 200, RF-softmax\nis almost 40X more ef\ufb01cient as compared to QUADRATIC (D vs. d2). Figure 4 shows the perfor-\nmance of different methods on BNEWS. Note that the performance of RF-softmax is at par with\nQUADRATIC when D = 2048. Furthermore, RF-softmax outperforms QUADRATIC when D = 8192.\nIn this experiment, we have d = 512, so RF-softmax with D = 2048 and D = 8192 are 128X and\n32X more ef\ufb01cient than QUADRATIC, respectively.\nTable 3 shows the performance of various sampling methods on three extreme classi\ufb01cation\ndatasets [33]. We do not report PREC@K values for FULL as the performance of EXP is an accurate\nproxy for those. The results demonstrate that RF-softmax attains better/comparable performance\nrelative to QUADRATIC. For WIKILSHTC, even though QUADRATIC has better PREC@K values,\nRF-softmax leads to 4% smaller full softmax loss, which validates our analysis.\n\nFigure 4: Comparison of RF-softmax with\nother baselines on BNEWS with m = 100\nand validation perplexity as the metric.\n\nReferences\n[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations\nof words and phrases and their compositionality. In Advances in Neural Information Processing Systems,\npages 3111\u20133119, 2013.\n\n[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105, 2012.\n\n[3] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In\n\nProceedings of the 10th ACM Conference on Recommender Systems, pages 191\u2013198. ACM, 2016.\n\n9\n\nSteps501502503504500100000200000300000ExpQuadraticUniformRF-softmax (D = 2048)RF-softmax (D = 8192)\f[4] Sashank J Reddi, Satyen Kale, Felix Yu, Dan Holtmann-Rice, Jiecao Chen, and Sanjiv Kumar. Stochastic\nnegative mining for learning with large output spaces. Proceedings of the International Conference on\nArti\ufb01cial Intelligence and Statistics, 2019.\n\n[5] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model.\n\nIn\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, volume 5, pages\n246\u2013252, 2005.\n\n[6] Pascal Vincent, Alexandre De Br\u00e9bisson, and Xavier Bouthillier. Ef\ufb01cient exact gradient update for training\ndeep networks with very large sparse targets. In Advances in Neural Information Processing Systems,\npages 1108\u20131116, 2015.\n\n[7] Alexandre de Br\u00e9bisson and Pascal Vincent. An exploration of softmax alternatives belonging to the\n\nspherical loss family. arXiv preprint arXiv:1511.05042, 2015.\n\n[8] Edouard Grave, Armand Joulin, Moustapha Ciss\u00e9, Herv\u00e9 J\u00e9gou, et al. Ef\ufb01cient softmax approximation\nfor gpus. In Proceedings of the 34th International Conference on Machine Learning, pages 1302\u20131310.\nJMLR. org, 2017.\n\n[9] Yoshua Bengio and Jean-S\u00e9bastien Sen\u00e9cal. Adaptive importance sampling to accelerate training of a\n\nneural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713\u2013722, 2008.\n\n[10] S\u00e9bastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target\n\nvocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.\n\n[11] Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, and Jay Yagnik. Deep networks with large\n\noutput spaces. arXiv preprint arXiv:1412.7479, 2014.\n\n[12] Guy Blanc and Steffen Rendle. Adaptive sampled softmax with kernel based sampling. In International\n\nConference of Machine Learning, 2018.\n\n[13] Stephen Mussmann, Daniel Levy, and Stefano Ermon. Fast amortized inference and learning in log-linear\n\nmodels with randomly perturbed nearest neighbor search. arXiv preprint arXiv:1707.03372, 2017.\n\n[14] Parameswaran Raman, Sriram Srinivasan, Shin Matsushima, Xinhua Zhang, Hyokun Yun, and SVN\nVishwanathan. DS-MLR: Exploiting double separability for scaling up distributed multinomial logistic\nregression. arXiv preprint arXiv:1604.04706, 2016.\n\n[15] Francois Fagan and Garud Iyengar. Unbiased scalable softmax optimization.\n\narXiv:1803.08577, 2018.\n\narXiv preprint\n\n[16] Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings ef\ufb01ciently with noise-contrastive\n\nestimation. In Advances in Neural Information Processing Systems, pages 2265\u20132273, 2013.\n\n[17] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\nrecognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 815\u2013823, 2015.\n\n[18] Ian En-Hsu Yen, Satyen Kale, Felix Yu, Daniel Holtmann-Rice, Sanjiv Kumar, and Pradeep Ravikumar.\nLoss decomposition for fast learning in large output spaces. In International Conference on Machine\nLearning, pages 5626\u20135635, 2018.\n\n[19] Aapo Hyv\u00e4rinen. Estimation of non-normalized statistical models by score matching. J. Mach. Learn.\n\nRes., 6:695\u2013709, December 2005.\n\n[20] Shankar Vembu, Thomas G\u00e4rtner, and Mario Boley. Probabilistic structured predictors. In Proceedings of\n\nthe Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 557\u2013564, 2009.\n\n[21] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural\n\nInformation Processing Systems, pages 1177\u20131184, 2008.\n\n[22] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. In Proceedings of the\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2012.\n\n[23] Jeffrey Pennington, Felix Xinnan X Yu, and Sanjiv Kumar. Spherical random features for polynomial\n\nkernels. In Advances in Neural Information Processing Systems, pages 1846\u20131854, 2015.\n\n[24] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps.\n\nInternational Conference on Machine Learning, pages 19\u201327, 2014.\n\nIn\n\n[25] Bharath Sriperumbudur and Zolt\u00e1n Szab\u00f3. Optimal rates for random fourier features. In Advances in\n\nNeural Information Processing Systems, pages 1144\u20131152, 2015.\n\n[26] Felix X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv\nKumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pages\n1975\u20131983, 2016.\n\n[27] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-Monte Carlo feature maps for\n\nshift-invariant kernels. In International Conference on Machine Learning, pages 485\u2013493, 2014.\n\n10\n\n\f[28] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-constrained softmax loss for discriminative\n\nface veri\ufb01cation. arXiv preprint arXiv: 1703.09507, 2017.\n\n[29] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep\nhypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), volume 1, page 1, 2017.\n\n[30] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: l2 hypersphere embedding\nfor face veri\ufb01cation. In Proceedings of the 25th ACM International Conference on Multimedia, pages\n1041\u20131049. ACM, 2017.\n\n[31] Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. Treebank-3 LDC99T42,\n\n1999.\n\n[32] David Graff, John S. Garofolo, Jonathan G. Fiscus, William Fisher, and David Pallett. 1996 English\n\nbroadcast news speech (HUB4) LDC97S44, 1997.\n\n[33] Manik Varma. Extreme classi\ufb01cation repository. Website, 8 2018. http://manikvarma.org/\n\ndownloads/XC/XMLRepository.html.\n\n[34] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of\nlabels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22Nd International\nConference on World Wide Web (WWW), pages 13\u201324, New York, NY, USA, 2013. ACM.\n\n[35] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommen-\ndation, tagging, ranking & other missing label applications. In Proceedings of the 22Nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining (KDD), pages 935\u2013944, New York,\nNY, USA, 2016. ACM.\n\n11\n\n\f", "award": [], "sourceid": 7742, "authors": [{"given_name": "Ankit Singh", "family_name": "Rawat", "institution": "Google Research"}, {"given_name": "Jiecao", "family_name": "Chen", "institution": "Google Research"}, {"given_name": "Felix Xinnan", "family_name": "Yu", "institution": "Google Research"}, {"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "Google"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google Research"}]}