{"title": "Fast and Accurate Stochastic Gradient Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 12339, "page_last": 12349, "abstract": "Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time. In this paper, we break this barrier by providing the first demonstration of a scheme, Locality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD), which leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling. Such an algorithm is possible due to the sampling view of LSH, which came to light recently. As a consequence of superior and fast estimation, we reduce the running time of all existing gradient descent algorithms, that relies on gradient estimates including Adam, Ada-grad, etc. We demonstrate the effectiveness of our proposal with experiments on linear models as well as the non-linear BERT, which is a recent popular deep learning based language representation model.", "full_text": "Fast and Accurate Stochastic Gradient Estimation\n\nBeidi Chen\nRice University\nHouston, Texas\n\nbeidi.chen@rice.edu\n\nYingchen Xu\nRice University\nHouston, Texas\nyx26@rice.edu\n\nAbstract\n\nAnshumali Shrivastava\n\nRice University\nHouston, Texas\n\nanshumali@rice.edu\n\nStochastic Gradient Descent or SGD is the most popular optimization algorithm\nfor large-scale problems. SGD estimates the gradient by uniform sampling with\nsample size one. There have been several other works that suggest faster epoch-wise\nconvergence by using weighted non-uniform sampling for better gradient estimates.\nUnfortunately, the per-iteration cost of maintaining this adaptive distribution for\ngradient estimation is more than calculating the full gradient itself, which we\ncall the chicken-and-the-egg loop. As a result, the false impression of faster\nconvergence in iterations, in reality, leads to slower convergence in time. In this\npaper, we break this barrier by providing the \ufb01rst demonstration of a scheme,\nLocality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD),\nwhich leads to superior gradient estimation while keeping the sampling cost per\niteration similar to that of the uniform sampling. Such an algorithm is possible due\nto the sampling view of LSH, which came to light recently. As a consequence of\nsuperior and fast estimation, we reduce the running time of all existing gradient\ndescent algorithms, that relies on gradient estimates including Adam, Ada-grad,\netc. We demonstrate the effectiveness of our proposal with experiments on linear\nmodels as well as the non-linear BERT, which is a recent popular deep learning\nbased language representation model.\n\n1 Motivation\n\nStochastic gradient descent or commonly known as SGD is the most popular choice of optimization\nalgorithm in large-scale setting for its computational ef\ufb01ciency. A typical interest in Machine Learning\nis to minimize the average loss function f over the training data, with respect to the parameters \u03b8, i.e.,\nthe objective function of interest is\n\nN(cid:88)\n\ni=1\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\nF (\u03b8) = arg min\n\n\u03b8\n\n1\nN\n\nf (xi, \u03b8).\n\n(1)\n\nThroughout the paper, our training data D = {xi, yi}N\ni=1 will have N instances with d dimensional\nfeatures xi \u2208 Rd and labels yi. The labels can be continuous real valued for regression problems. For\nclassi\ufb01cation problem, they will take value in a discrete set, i.e., yi \u2208 {1, 2,\u00b7\u00b7\u00b7 , K}. Typically, the\nfunction f is convex, thus a Gradient Descent (GD) algorithm can achieve the global optimum. The\nobjective function for least squares, f (xi, \u03b8) = (\u03b8 \u00b7 xi \u2212 yi)2, used in regression setting is a classical\nexample of f.\nSGD [4] samples an instance xj uniformly from N instances, and performs the gradient descent\nupdate:\n\n(2)\nwhere \u03b7t is the step size at the tth iteration. The gradient \u2207f (xj, \u03b8t\u22121) is only evaluated on xj,\nusing the current \u03b8t\u22121. It should be noted that a full gradient of the objective is given by the average\n\n\u03b8t = \u03b8t\u22121 \u2212 \u03b7t\u2207f (xj, \u03b8t\u22121),\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(cid:80)N\ni=1 \u2207f (xi, \u03b8t\u22121). Thus, a uniformly sampled gradient \u2207f (xj, \u03b8t\u22121) is an unbiased estimator\n\n1\nN\nof the full gradient, i.e.,\n\nN(cid:88)\n\ni=1\n\nE(\u2207f (xj, \u03b8t\u22121)) =\n\n1\nN\n\n\u2207f (xi, \u03b8t\u22121).\n\n(3)\n\nThis is the key reason why, despite only using one sample, SGD still converges to the local minima,\nanalogously to full gradient descent, provided \u03b7t is chosen properly [21, 4].\nIt is known that the convergence rate of SGD is slower than that of the full gradient descent [22].\nNevertheless, the cost of computing the full gradient requires O(N ) evaluations of \u2207f compared\nto just O(1) evaluation in SGD. Thus, with the cost of one epoch of full gradient descent, SGD\ncan perform O(N ) epochs, which overcompensates the slow convergence (One epoch is one pass\nof the training data). Therefore, despite slow convergence rates, SGD is almost always the chosen\nalgorithm in large-scale settings as the calculation of the full gradient in every epoch is prohibitively\nslow. Further improving SGD is still an active area of research. Any such improvement will directly\nspeed up most of the state-of-the-art algorithms in machine learning.\nThe slower convergence of SGD in iterations is expected due to the poor estimation of the gradient\n(the average) by only sampling a single instance uniformly. Clearly, the variance of the one-sample\nestimator is high. As a consequence, there have been several efforts in \ufb01nding better sampling\nstrategies for estimating the gradients [29, 18, 30, 2]. The key idea behind these methods is to replace\nsampling from a uniform distribution with sampling from a weighted distribution which leads towards\na lower and even optimal variance.\nHowever, obtaining the optimal weighted distribution is not a straightforward task, due to its correla-\ntion with the L2 norm of the gradients. Therefore, whenever the parameters and the gradients change,\nthe weighted distribution has to change. Unfortunately, as argued in [14, 20], all of these adaptive\nsampling methods for SGD, suffer from what we call the chicken-and-egg loop \u2013 adaptive sampling\nimproves stochastic estimation but maintaining the required adaptive distribution will cost up to O(N )\nper iteration, which is also the cost of computing the full gradient exactly (or at least not O(1)). Not\nsurprisingly [17] showed another O(N ) scheme that improves the running time compared with SGD\nusing O(N ) leverage scores [28] sampling. However, as noted O(N ) per iteration is prohibitive.\nTo the best of our knowledge, there does not exist any generic sampling scheme for adaptive gradient\nestimation, where the cost of maintaining and updating the distribution, per iteration, is O(1) which\nis comparable to SGD. Our work provides \ufb01rst such sampling scheme utilizing the recent advances in\nsampling and unbiased estimation using Locality Sensitive Hashing [23, 24].\n\n1.1 Related Work: Adaptive Sampling for SGD\n\nFor non-uniform sampling, we can sample each xi with an associated weight wi. These wi\u2019s can\nbe tuned to minimize the variance. It was \ufb01rst shown in [2], that sampling xi with probability in\nproportion to the L2 norm (euclidean norm) of the gradient, i.e. ||\u2207f (xi, \u03b8t\u22121)||2, leads to the optimal\ndistribution that minimizes the variance. However, sampling xi with probability in proportion to\nwi = ||\u2207f (xi, \u03b8t\u22121)||2, requires \ufb01rst computing all the wi\u2019s, which change in every iteration because\n\u03b8t\u22121 gets updated. Therefore, maintaining the values of wi\u2019s is even costlier than computing the full\ngradient. [14] proposed to mitigate this overhead partially by exploiting additional side information\nsuch as the cluster structure of the data. Prior to the realization of optimal variance distribution, [29]\nand [18] proposed to sample a training instance with a probability proportional to the Lipschitz\nconstant of the function f (xi, \u03b8t\u22121) or \u2207f (xi, \u03b8t\u22121) respectively. It is worth mentioning that before\nthese works, a similar idea was used in designing importance sampling-based low-rank matrix\napproximation algorithms. The resulting sampling methods, known as leverage score sampling, are\nagain proportional to the squared Euclidean norms of rows and columns of the underlying matrix\n[12]. Nevertheless, as argued, in [14], the cost of maintaining the distribution is prohibitive.\nThe Chicken-and-Egg Loop: In summary, to speed up the convergence of stochastic gradient\ndescent, we need non-uniform sampling for better estimates (low variance) of the full gradient. Any\ninteresting non-uniform sampling is dependent on the data and the parameter \u03b8t which changes\nin every iteration. Thus, maintaining the non-uniform distribution for estimation requires O(N )\ncomputations to calculate the weight wi, which is the same cost as computing it exactly. It is not\neven clear that there exists any sweet and adaptive distribution which breaks this computational\n\n2\n\n\fchicken-and-egg loop. We provide the \ufb01rst af\ufb01rmative answer by giving an unusual distribution\nwhich is derived from probabilistic indexing based on locality sensitive hashing.\nOur Contributions: In this work, we propose a novel LSH-based sampler, that breaks the afore-\nmentioned chicken-and-egg loop. Our algorithm, which we call LSH sampled Stochastic Gradient\nDescent (LGD), are generated via hash lookups which have O(1) cost. Moreover, the probability of\nselecting xi is provably adaptive. Therefore, the current gradient estimates is likely to have lower\nvariance, compared to a single sample SGD, while the computational complexity of sampling is\nconstant and of the order of SGD sampling cost. Furthermore, we demonstrate that LGD can be\nutilized to speed up any existing gradient-based optimization algorithm such as AdaGrad [13]. We\nalso show the power of LGD with experiments on both linear and non-linear models.\nAs a direct consequence, we obtain a generic and ef\ufb01cient gradient descent algorithm which converges\nsigni\ufb01cantly faster than SGD, both in terms of iterations as well as running time. It should be noted\nthat rapid iteration or epoch-wise convergence alone does not imply computational ef\ufb01ciency. For\ninstance, Newtons method converges faster, epoch-wise, than any \ufb01rst-order gradient descent, but\nit is prohibitively slow in practice. The wall clock time or the amount of \ufb02oating point operations\nperformed to reach convergence should be the metric of consideration for useful conclusions.\nAccuracy Vs Running Time: It is rare to see any fair (same computational setting) empirical\ncomparisons of SGD with existing adaptive SGD schemes, which compare the improvement in\naccuracy with respect to running time on the same computational platform. Almost all methods\ncompare accuracy with the number of epochs. It is unfair to SGD which can complete O(N ) updates\nat the computational cost (or running time) of one update for adaptive sampling schemes.\n\n2 The LGD Algorithm\n\n2.1 A Generic Framework for Ef\ufb01cient Gradient Estimation\n\nFigure 1: The work-\ufb02ow of LGD Algorithm\n\nOur algorithm leverages the ef\ufb01cient\nestimations using locality sensitive\nhashing, which usually beats random\nsampling estimators while keeping the\nsampling cost near-constant. We \ufb01rst\nprovide the intuition of our proposal,\n(cid:80)N\nand the analysis will follow. Figure\n1 shows the complete work-\ufb02ow of\ni=1(yi \u2212 \u03b8t \u00b7 xi)2, where \u03b8t\nLGD algorithm. Consider least squares regression with loss function 1\nN\nis the parameter in the tth iteration. The gradient is just like a partition function in classical discrete\nsystem. If we simply follow the procedures in [24], we can easily show a generic unbiased estimator\nvia adaptive sampling. However, better sampling alternatives are possible.\nObserving that the gradient, with respect to \u03b8t concerning xi, is given by 2(yi \u2212 \u03b8t \u00b7 xi)xi, the L2\nnorm of the gradient can therefore be written as an absolute value of inner product.\n\n(cid:12)(cid:12) = 2(cid:12)(cid:12)[\u03b8t,\u22121] \u00b7 [xi(cid:107)xi(cid:107)2, yi(cid:107)xi(cid:107)2](cid:12)(cid:12),\n(cid:80)N\n(cid:107)\u2207f (xi,\u03b8t)(cid:107)2\nwhere [\u03b8t,\u22121] is a vector concatenation of \u03b8 with \u22121. According to [14], w\u2217\nj=1 (cid:107)\u2207f (xj ,\u03b8t)(cid:107)2\nis also the optimal sampling weight for xi. Therefore, if the data is normalized, we should sample xi\n\nin proportion to wi\u2217 =(cid:12)(cid:12)[\u03b8t,\u22121] \u00b7 [xi, yi](cid:12)(cid:12), i.e. large magnitude inner products should be sampled\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2 =(cid:12)(cid:12)2(\u03b8t \u00b7 xi \u2212 yi)(cid:107)xi(cid:107)2\n\ni =\n\n(4)\n\ni = f (w\u2217\n\nwith higher probability.\nAs argued, such sampling process is expensive because w\u2217\ni changes with \u03b8t. We address this issue by\ndesigning a sampling process that does not exactly sample with probability w\u2217\ni but instead samples\nfrom a different weighted distribution which is a monotonic function of w\u2217\ni . Speci\ufb01cally, we sample\nfrom wlsh\ni ), where f is some monotonic function. Before we describe the ef\ufb01cient sampling\nprocess, we \ufb01rst argue that a monotonic sampling is a good choice for gradient estimation. Figure 2\nin the appendix helps visualize the relation among optimal weighted distribution (target), uniform\nsampling in SGD and adaptive sampling in LGD.\nFor any monotonic function f, the weighted distribution wlsh\nwith \u03b8t. Also, due to monotonicity, if the optimal sampling prefers xi over xj i.e. w\u2217\n\ni ) is still adaptive and changes\nj , then\n\ni = f (w\u2217\n\ni \u2265 w\u2217\n\n3\n\nHash training examples to hash tableStartCompute hash code\n000000\u202611\u2026\u2026\u2026\u2026\u2026000110\u202611h11h1k\u2026Buckets\u2026\u2026Empty\u2026\u202600\u202601(\u03b8,\u22121)for000000\u202611\u2026\u2026\u2026\u2026\u2026000110\u202611h11h1k\u2026Buckets\u2026\u2026Empty\u2026\u2026QueryGet SampleCompute gradient\u03b8and update One-time Preprocessing\fmonotonic sampling will also have same preference, i.e., wlsh\n\ni \u2265 wlsh\n\nj\n\n. The key insight is that there\n\nFigure 2: Subplots (a)(b) show the comparisons of the average (over number of samples) gradient L2\nnorm of the points that LGD and SGD sampled. Subplots (d)(e) show the comparison of the cosine\nsimilarity between gradient estimated by LGD and the true gradient and the cosine similarity between\ngradient estimated by SGD and the true gradient.\nare two quantities in the inner product (equation 4), [\u03b8t,\u22121] and [xi, yi]. With successive iteration,\n[\u03b8t,\u22121] changes while [xi, yi] is \ufb01xed. Thus, it is possible to preprocess [xi, yi] into hash tables (one\ntime cost) and query with [\u03b8t,\u22121] for ef\ufb01cient and adaptive sampling. With every iteration, only the\nquery changes to [\u03b8t+1,\u22121], but the hash tables remains the same. Few hash lookups are suf\ufb01cient to\nsample xi for gradient estimation adaptively. Therefore, we only pay one-time preprocessing cost of\nbuilding hash tables and few hash lookups, typically just one, in every iteration to get a sample for\nestimation.\n\nAlgorithm 1: assignment algorithm\n\nInput: H (Hash functions), HT [][] (L Hash\nTables), K, Query\ncp(x, Q) is Pr(h(x)= h(Q)), under given LSH\nOutput: sampled data x, sampling probability\np\nl, S = 0\nwhile true do\n\nti = random(1, L)\nbucket = H(Query, ti) (table speci\ufb01c hash)\nif HT[ti][bucket] = empty then\n\nl++\nend if\nS = |HT [ti][bucket]| (size of bucket)\nx = randomly pick one element from\nHT [ti][bucket]\nbreak;\nend while\np = cp(x, Query)K(1\u2212cp(x, Query)K)l\u22121\u00d7 1\nreturn x, p\n\nS\n\nThere are a few more technical subtleties\ndue to the absolute value of inner product\n\n(cid:12)(cid:12)[\u03b8t,\u22121[\u00b7[xi, yi](cid:12)(cid:12), rather than the inner prod-\n(cid:12)(cid:12)[\u03b8t,\u22121]\u00b7[xi, yi](cid:12)(cid:12)2\n\nuct itself. However, the square of the absolute\nvalue of the inner product\n\n= T ([\u03b8t,\u22121])\u00b7T ([xi, yi]),\ncan also be written as an inner product as it\nis a quadratic kernel, and T is the correspond-\ning feature expansion transformation. Again\nsquare is monotonic function, and therefore,\nour sampling is still monotonic as composition\nof monotonic functions is monotonic. Thus,\ntechnically we hash T ([xi, yi]) to create hash\ntables and the query at tth step is T ([\u03b8t,\u22121]).\nOnce an xi is sampled via LSH sampling (Al-\ngorithm 1), we can precisely compute the prob-\nability of its sampling, i.e., pi . It is not dif\ufb01-\ncult to show that our estimation of full gradient\nis unbiased (Section 2.3).\n\n2.2 Algorithm and Implementation Details\n\nWe \ufb01rst describe the detailed step of our gradient estimator in Algorithm 2. We also provide the\nsampling algorithm 1 with detail. Assume that we have access to the right LSH function h, and its\ncollision probability expression cp(x, y) = P r(h(x) = h(y)). For linear regression, we can use\nsigned random projections, simhash [8], or MIPS hashing. With normalized data, simhash collision\nprobability is cp(x, y) = 1 \u2212 cos\u22121(\n, which is monotonic in the inner product. Furthermore,\nwe centered the data we need to store in the LSH hash table to make the simhash query more ef\ufb01cient.\nLGD with Adaptive Learning Rate The learning rate or step size \u03b7 in SGD is a one parameter\napproximation to the inverse of the Hessian (second order derivative) [5]. Time based (or step based)\ndecay and exponential decay [27] have been empirically found to work well. Furthermore, [13]\nproposed the popular AdaGrad which is dimension speci\ufb01c adaptive learning rate based on \ufb01rst order\ngradient information. Although the methods mentioned above also help improve the convergence of\n\n(cid:107)x(cid:107)2(cid:107)y(cid:107)2\n\u03c0\n\nx\u00b7y\n\n)\n\n4\n\n020040060080010000.040.060.080.10PredMSD Gradient Norm Comparison# SamplesGradient NormSGDLGD020040060080010000.030.040.050.060.070.08Slice Gradient Norm Comparison# SamplesGradient NormSGDLGD020040060080010000.500.520.540.560.580.60PredMSD Gradient Cosine Similarity# SamplesCosine SimilaritySGDLGD020040060080010000.500.550.60Slice Gradient Cosine Similarity# SamplesCosine SimilaritySGDLGD\fSGD by tweaking the learning rate, LGD is not an alternative but a complement to them. In LGD\nimplementation, AdaGrad as well as those learning rate decay methods are customized options that\ncan be used in conjunction.\n\nlsh, yi\n\ntrain from preprocessed data\n\nxlsh, ylsh and then put [xi\nData structure.\ntrain, y(cid:48)\n\n5: Get x(cid:48)\n6: t = 0\n7: while N otConverged do\n8:\n\nAlgorithm 2: LSH-Sampled Stochastic gradient\nDescent (LGD) Algorithm\n1: Input: D = xi, yi, N, \u03b80, \u03b7\n2: Input: LSH Family H, parameters K, L\n3: Output: \u03b8\u2217\n4: HT = Get preprocessed training data vectors\nlsh] into LSH\n\nRunning Time of Sampling The computa-\ntional cost of SGD sampling is merely a single\nrandom number generator. The cost of gradient\nupdate (equation 2) is one inner product, which\nis d multiplications. If we want to design an\nadaptive sampling procedure that beats SGD,\nthe sampling cost cannot be signi\ufb01cantly larger\nthan d multiplications.\nThe cost of LGD sampling (Algorithm 1) is\nK \u00d7 l hash computations followed by l + 1 ran-\ndom number generator, (1 extra for sampling\nfrom the bucket). Since the scheme works for\nany K, we can always choose K small enough\nso that empty buckets are rare (see [24]). In all\nof our experiments, K = 5 for which l is almost\n10:\nalways 1. Thus, we require K hash computa-\n11: end while\ntions and only two random number generations.\n12: return \u03b8\u2217\nIf we use very sparse random projections, then\nK hash computations only require a constant\n(cid:28) d multiplications. For example, in all our experiments we only need d\n30 multiplication, in expecta-\ntion, to get all the hashes using sparse projections. Therefore, our sampling cost is signi\ufb01cantly less\nthan d multiplication which is the cost of gradient update. Using fast hash computation is critical for\nour method to work in practice.\n\nlsh, p = Sample(H,HT, K, [\u03b8t,\u22121])\nxi\n(Algorithm 1)\nGet xi(cid:48)\ntrain, yi(cid:48)\n\u03b8t+1 := \u03b8t \u2212 \u03b7t(\n\n9:\n\ntrain from preprocessed data\n\n\u2207f (xi(cid:48)\n\ntrain,\u03b8t)\np\u00d7N\n\n)\n\nFigure 3: In subplots (a)(b), the comparisons of Wall clock training loss convergence are made\nbetween plain LGD (red lines) and plain SGD (blue lines). We can clearly see the big gap between\nthem representing LGD converge faster than SGD even in time-wise. Subplots (d)(e) shows the\nresults for same comparisons but in epoch-wise.\n\n2.2.1 Near-Neighbor is Costlier than LSH-Sampling\n\nIt might be tempting to use approximate near-neighbor search with query \u03b8t to \ufb01nd xi. Near-neighbor\nsearch has been used in past [10] to speed up coordinate descent. However, near-neighbor queries\nare expensive due to candidate generation and \ufb01ltering. It is still sub-linear in N (and not constant).\nThus, even if we see epoch-wise faster convergence, iterations with a near-neighbor query would be\norders of magnitude slower than a single SGD iteration. Moreover, the sampling probability of x\ncannot be calculated for near-neighbor search which would cause bias in the gradient estimates.\nIt is important to note that although LSH is heavily used for near-neighbor search, in our case, we\nuse it as a sampler. For ef\ufb01cient near neighbor search, K and L grow with N [15]. In contrast, the\nsampling works for any K and l 1 as small as one leading to only approximately 1.5 times the cost of\nSGD iteration (see section 3). Ef\ufb01cient unbiased estimation is the key difference that makes sampling\npractical while near-neighbor query prohibitive. It is unlikely that a near-neighbor query would beat\nSGD in time, while sampling would.\n\n1L represents the number of hash tables but l represents the number of hash tables used in one query\n\n5\n\nlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll020000400006000080000010002000300040005000YearPredictionMSD Training LossTime (ms)LosslSGD Training LossLGD Training Lossllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0100002000030000400005000020406080100120140Slice Training LossTime (ms)LosslSGD Training LossLGD Training Lossllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll020406080100010002000300040005000YearPredictionMSD Training LossEpochLosslSGD Training LossLGD Training Lossllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll05010015020406080100120140Slice Training LossEpochLosslSGD Training LossLGD Training Loss\f2.3 Variance and Convergence Analysis\n\nIn section 1, we have discussed the convergence guarantees for SGD on convex functions under the\nassumptions of Lipschitz-continuous objective gradients with Lipschitz constant L > 0. Now we\nstrengthen the importance of reducing the variance of for faster convergence rate. It is well-known\nthat GD converges at a linear rate while SGD converges to a slower sublinear convergence rate. The\nkey reason for the much slower rate is due to the relatively large variance of the SGD estimator.\nSpeci\ufb01cally, assume that the variance of SGD is bounded in a relatively smaller manner, the expected\ndecrease in the objective function yielded by the tth step is bounded by [6],\n\nE(f (\u03b8t+1)) \u2212 f (\u03b8t) \u2264 \u2212\u03b7t(cid:107)\u2207f (x, \u03b8t)(cid:107)2 + \u03b72\n\nt\n\nE[(cid:107)\u2207f (xi, \u03b8t)(cid:107)2]\n\nL\n2\n\n(5)\n\nIf variance term E[(cid:107)\u2207f (xi, \u03b8t)(cid:107)2] = 0, then SGD would have had linear convergence rate with\na constant step size similar to GD. However, due to the stochasticity introduced by the gradient\nestimation, the variance, smaller step size is chosen and thereby slowing down the convergence [5].\nClearly, lowering the variance of the estimator directly helps improve the convergence rate.\nTherefore, in this section, we \ufb01rst prove that our estimator of the gradient is unbiased with bounded\nvariance which is suf\ufb01cient for convergence. We further argue about conditions under which LGD\nwill have lower variance than SGD. Denote Sb as the bucket that contains a set of samples which has\nthe same hash value (same bucket) as the query and xm is the chosen sample in Algorithm 1. For\nsimplicity we denote the query as \u03b8t and pi = cp(xi, \u03b8t)K(1 \u2212 cp(xi, \u03b8t)K)l\u22121 as the probability of\nxi belonging to that bucket.\nTheorem 1. The following expression is an unbiased estimator of the full gradient:\n\n1xi\u2208Sb\n\n1(xi=xm|xi\u2208Sb)\n\n\u2207f (xi, \u03b8t) \u00b7 |Sb|\n\n,\n\nE[Est] =\n\n1\nN\n\n\u2207f (xi, \u03b8t).\n\n(6)\n\nN(cid:88)\n\ni=1\n\nEst =\n\n1\nN\n\nN(cid:88)\n\ni=1\n\nTheorem 2. The Trace of the covariance of our estimator:\n\nT r(\u03a3(Est)) =\n\n1\nN 2\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2\n\nP(xi,xj\u2208Sb)\n\npi\n\nj=1\n\ni=1\n\npi\n\nN(cid:88)\n\nN(cid:88)\nN 2(cid:107)(\n\u2212 1\n\ni=1\n\n\u2207f (xi, \u03b8t))(cid:107)2\n\n2\n\n(7)\n\nThe trace of the covariance of LGD is the total variance of the descent direction. The variance can be\nminimized when the sampling probability of xi is proportional to the L2-norm of the gradient we\nmentioned in Section 1.1. The intuition of the advantage of LGD estimator comes from sampling xi\nunder a distribution monotonic to the optimal one. We \ufb01rst make a simple comparison of the variance\nof LGD with that of SGD theoretically and then in Section 3 and we would further empirically show\nthe drastic superiority of LGD over SGD.\nLemma 1. The Trace of the covariance of LGD\u2019s estimator is smaller than that of SGD\u2019s estimator if\n\nN(cid:88)\n\n1\nN\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2\n\ni=1\n\npi\n\n2 \u00b7(cid:80)N\n\nj=1\n\nP(xi,xj\u2208Sb)\n\npi\n\n<\n\nN(cid:88)\n\ni=1\n\npi\n\n2 \u00b7(cid:80)N\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2\n2,\n\n(8)\n\nWe analyze a simple case that if the data is uniformly distributed, such that every collision probability\nis the same. It is trivial to see that the trace of the covariance of LGD is exactly the same as SGD\nfrom equation 8. Intuitively, this happens when all the gradient norms are equal. Therefore, SGD\nwould perform well if the data is uniform, but this is unlikely in practice.\nObserve that when the gradients are large, pi is also large due to the monotonicity of pi with gradient\nnorms. As a result, the term\nis likely to be much smaller than N making the\ncorresponding component of LHS (left hand side) smaller favoring LGD estimator. In a real scenario,\nwith more of a power-law behavior, we have few large gradients, and most other gradients would\nbe uniform in expectation. In such cases, We can expect LGD to have smaller variance. Rigorous\ncharacterization of distributions where LGD is better than SGD is hard due to correlations between\ngradients and collision probabilities ps as well as the size of buckets. [7] shows that such analysis is\nonly possible under several query and data speci\ufb01c assumptions. A rigorous analysis in the gradient\n\nP(xi,xj\u2208Sb )\n\n(cid:80)N\n\nj=1\n\npi\n\npi\n\n6\n\n\fdescent settings where both query and data distribution changes in every iteration is left for the future\nwork.\nHere we provide the analysis based on assumptions on the data. We \ufb01rst upper bound the left side of\nequation 8 by,\n\nN(cid:88)\n\n1\nN\n\n2 \u00b7(cid:80)N\n\nj=1\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2\n\nP(xi,xj\u2208Sb)\n\npi\n\ni=1\n\npi\n\n\u2264 N(cid:88)\n\ni=1\n\nAssume the normalized collision probability follows Pareto distribution [3], which is a power-law\nprobability distribution. If X is a random variable with a Pareto (Type I) distribution, then the\n\n(cid:80)N\n\nj=1 pj\np2\ni N\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2\n2 \u00b7\n\n(9)\n\n(cid:26)( xm\n\n1,\n\nx )\u03b1,\n\nif x > xm\n(cid:80)N\nif x \u2264 xm\n\nwhere\n\nprobability that X is greater than some number x, is P r(X > x) =\n\nxm is the minimum possible value of X, and \u03b1 is a positive parameter. Then\n\u00b5p = \u03b1xm\n\n(mean) is\n\u03b1\u22121 . Assume pi is sorted in descending order and let \ufb01rst separate the right side of equation 9\n, where k is the index that\n\n+(cid:80)N\n\nj=1 pj\nN\n\ninto two parts,(cid:80)k\ni=1 (cid:107)\u2207f (xi, \u03b8t)(cid:107)2\nk(cid:88)\n\n2 \u00b7 \u00b5p\nseparates the summation based on \u00b5p \u2264 p2\n\np2\ni\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2\n\ni=1\n\ni or \u00b5p > p2\n\ni . Then we equation 8 becomes,\n\n2 \u00b7 \u00b5p\n\np2\ni\n\ni=k+1 (cid:107)\u2207f (xi, \u03b8t)(cid:107)2\nN(cid:88)\n\n2 \u00b7 (1 \u2212 \u00b5p\np2\ni\n\n) >\n\ni=k+1\n\n(cid:107)\u2207f (xi, \u03b8t)(cid:107)2\n\n2 \u00b7 (\n\n\u2212 1),\n\n\u00b5p\np2\ni\n\n(10)\n\nmaking lemma 1 a reasonable condition if the distribution of the gradient norm also follows power-law.\nBecause the large gradient norm terms will be on the LHS and the small gradient terms are on the\nRHS and under power-law assumption the small gradient norms terms drop off extremely fast.\nIn practice, we can tune parameter K for our hashing scheme in LGD, which controls the values of\npi. With this tuning, we achieve better controls over relative decays of pi leading to more possibilities\nof better variance. Recall that the collision probability pi = cp(xi, \u03b8t)K(1 \u2212 cp(xi, \u03b8t)K)l\u22121. Note\nthat l here, according to Algorithm 1 is the number of tables that have been utilized by the sampling\nprocess. In most practical cases and also in our experiment, K and l are relatively small. L, which is\nthe total number of hash tables, should be large to ensure enough independence across samples, but it\ndoes not contribute to the sampling time (See Alg. 1). Overall, our experiments show that LGD is\nef\ufb01cient and generally achieves smaller variance than SGD by setting small enough values of K and\nl making the sampling process as ef\ufb01cient as SGD.\nLGD for Logistic Regression We can derive a similar form of LGD for logistic regression. Noted\nthat the label yi \u2208 {\u22121, +1}. The loss function of logistic regression can be written as, L(\u03b8t) =\n\n(cid:80)N\ni=1 ln(1 + e\u2212yi\u03b8txi), s where the l2 norm of the gradient can be derived as,\n\n1\nN\n\n(cid:107)\u2207L(\u03b8t)i(cid:107)2 =\n\n(cid:107)xi(cid:107)2\n\neyi\u03b8txi + 1\n\n=\n\n1\n\neyi\u03b8txi + 1\n\n,\n\n(11)\n\nwhen xi is normalized to have unit norm. Similar to linear regression, we get two quantities in the\ninner product, yi \u00b7 xi and \u2212\u03b8t. The inner product is monotonic to\neyi\u03b8txi +1, which is the l2 norm of\nthe gradient. To apply our LGD framework for estimating the gradient, we can preprocess yi \u00b7 xi into\nhash tables and query with \u2212\u03b8t for ef\ufb01cient and adaptive sampling.\n\n1\n\n3 Experiments\n\nLinear regression is a basic and commonly used supervised machine learning algorithm for prediction.\nDeep learning models recently become popular for their state-of-the-art performance on Natural\nLanguage Processing (NLP) and also Computer Vision tasks. Therefore, we chose both linear\nregression and deep learning models as the target experiment tasks to examine the effectiveness of\nour algorithm. We follow the following four steps: (1) Compare the quality of samples retrieved by\nLGD and that of samples retrieved by SGD. According to Section 2.1, high quality samples have\nlarger gradient L2 norm. (2) Compare the convergence of linear regression task in time using SGD\nand LGD. (3) Compare the convergence of linear regression task in time using SGD with AdaGrad\nand LGD with AdaGrad. (4) Compare the epoch-wise convergence of NLP tasks between SGD and\nLGD in with BERT [9].\n\n7\n\n\fDataset: We used three large regression, YearPredictionMSD [16],Slice [16], UJIIndoorLoc [25],\nand two NLP benchmarks, MRPC [11], RTE [26]. The details are shown in Table 4 and Appendix.\n\n3.1 Linear Regression Tasks\n\n2 Three regression datasets were preprocessed as described in Section 2.2. Note that for all the\nexperiments, the choice of the gradient decent algorithm was the same. For both SGD and LGD, the\nonly difference in the gradient algorithm was the gradient estimator. For SGD, a random sampling\nestimator was used, while for LGD, the estimator used the adaptive estimator. We used \ufb01xed values\nK = 5 and L = 100 for all the datasets. l is the number of hash tables that have been searched\nbefore landing in a non-empty bucket in a query. In our experiments l is almost always as low\nas 1. L only affects preprocessing but not sampling. Our hash function was simhash (or signed\nrandom projections) and we used sparse random projections with sparsity 1\n30 for speed. We know\nthat epoch-wise convergence is not a true indicator of speed as it hides per epoch computation. Our\nmain focus is convergence with running time, which is a better indicator of computational ef\ufb01ciency.\nTo the best of our knowledge, there is\nno other adaptive estimation baseline,\nwhere the cost of sampling per itera-\ntion is less than linear O(N ). Since our\nprimary focus would be on wall clock\nspeedup, no O(N ) estimation method\nwould be able to outperform O(1) SGD\n(and LGD) estimates on the same plat-\nform. From section 2.2.1, even methods\nrequiring a near-neighbor query would be too costly (orders of magnitude) to outperform SGD from\ncomputational perspective.\n\nTESTING DIMENSION\n51,630\n42,800\n10,534\n\nDATA SET\nYEARMSD\nSLICE\nUJIINDOORLOC\nMRPC\nRTE\n\nFigure 4: Statistics Information for Datasets\n\nTRAINING\n463,715\n53,500\n10,534\n3669\n2491\n\n409\n278\n\n90\n74\n529\nN/A\nN/A\n\nFigure 5: In subplots (a)(b), the comparisons of epoch-wise testing accuracy convergence are made\nbetween LGD (red lines) and SGD (blue lines) separately in two NLP benchmarks. We can see the\nbig gap between them representing LGD converge faster than SGD. Subplots (c)(d) shows similar\ncomparison over testing loss.\n\nLGD, SGD vs. True Gradient In the \ufb01rst experiment, as a sanity check, we \ufb01rst verify weather\nLGD samples data point with probability monotonic to L2 norm of the gradient mentioned in section\n2.1. In order to do that, we freeze the optimization at an intermediate iteration and use the \u03b8 at that\nmoment to sample data points with LGD as well as SGD to compute gradient L2 norm separately. We\nobserve that if freezing at the beginning iterations, the difference of average gradient norm between\nLGD and SGD samples is not obvious. This is not surprising because model \u03b8 is initialized randomly.\nTo visualize the quality difference of SGD and LGD samples more clearly, we choose to freeze after\n4 epoch of cold start. The upper three plots in Figure 2 show the comparison of the sampled gradient\n1\nnorm of LGD and SGD. X-axis represents the number of samples that we averaged in the above\nprocess. It is obvious that LGD sampled points have larger gradient norm than SGD ones consistently\nacross all three datasets.\nIn addition, we also do a sanity check that if empirically, the chosen sample from LGD get better\nestimation of the true gradient direction than that of SGD. Again, we freeze the program at an\nintermediate iteration like the experiments above. Then we compute the angular similarity of full\ngradient (average over the training data) direction with both LGD and SGD gradient direction, where,\nSimilarity = 1 \u2212 cos\u22121\n. From the right two plots in Figure 2, we can see that in average,\n\nx\u00b7y\n\n(cid:107)x(cid:107)2(cid:107)y(cid:107)2\n\u03c0\n\n2Note that in the experiments, we show the plots of two datasets and the third one is in the appendix.\n\n8\n\nllllllllllllllllll501001502002503003500.700.750.800.85MRPC Testing AccuracyIterAccuracylSGD Testing AccuracyLGD Testing Accuracylllllllllllll0501001502000.540.560.580.600.620.64RTE Testing AccuracyIterAccuracylSGD Testing AccuracyLGD Testing Accuracyllllllllllllllllll501001502002503003500.350.400.450.500.550.60MRPC Testing LossIterLosslSGD Testing LossLGD Testing Losslllllllllllll0501001502000.650.700.750.80RTE Testing LossIterLosslSGD Testing LossLGD Testing Loss\fLGD estimated gradient has smaller angle (more aligned) to true gradient than SGD estimated\ngradient.The variance of both norm and cosine similarity reduce when averaging them over samples\nas shown in plots.\nLGD vs. SGD In this section, we compare\nvanilla SGD with LGD, i.e., we use simple SGD\nwith \ufb01xed learning rate. This basic experiment\naims to demonstrate the performance of pure\nLGD and SGD without involving other factors\nlike L1/L2 regularization on linear regression\ntask. In such a way, we can quantify the superi-\nority of LGD more easily. We tried a sweep of\ninitial step size from 1e\u22125 to 1e\u22121 and choose\nthe one that will lead to convergence with LGD\nand SGD. Figure 3 shows the decrease in the\nsquared loss error with epochs. Blue lines rep-\nresent SGD and red lines represent LGD. It is\nobvious that LGD converges much faster than\nSGD in both training and testing loss compar-\nisons. This is not surprising with the claims in\nSection 2.2 and theoretical proof in Section 2.3. Since LGD uses slightly more computations per\nepoch than SGD does, it is hard to defend if LGD gains enough bene\ufb01ts simply from the epoch-wise\ncomparisons. We therefore also show the decrease in error with wall clock time also in \ufb01gure 3. Wall\nclock time is the actual quanti\ufb01cation of speedups. Again, on every single dataset, LGD shows faster\ntime-wise convergence as well.\nAs argued in section 1.1, our LGD algorithm is complimentary to any gradient-based optimization\nalgorithm. We repeated the \ufb01rst experiment but using AdaGrad [13] instead of plain SGD. Figure 6\nshows running time comparisons on LGD and SGD training convergence. The trends as expected are\nsimilar to those of LGD vs. SGD. LGD with AdaGrad outperforms AdaGrad (SGD) estimates of\ngradients both epoch-wise and time-wise.\n\nFigure 6: The comparisons of Wall clock training\nloss convergence are made between LGD+adaGrad\nand SGD+adaGrad separately in three datasets. We\ncan again see the similar gap between them rep-\nresenting LGD converge faster than SGD in time-\nwise. Epoch-wise comparisons are in appendix.\n\n3.2 BERT Tasks\n\nBERT [9], a recent popular language representation model, is designed to pre-train deep bidirectional\nrepresentations that can be \ufb01ne-tuned jointly with just one additional layer to create state-of-the-art\nmodels for various tasks. To strengthen the power of LGD, we adapted LGD in BERT for several\nnatural language processing (NLP) tasks. The implementation details were included in the appendix.\nWe used two popular benchmarks in NLP, MRPC and RTE, and replicated the same experiments\nsetting in BERT paper. For the pre-trained model, we chose BERTbase because it performs more\nstable for such smaller downstream tasks. For each task, we ran \ufb01ne-tunings for 3 epochs with batch\nsize 32 and used Adam optimizer with initial learning rates 2e. As for LSH parameter, we chose\nK = 7, L = 10. Results are presented in Figure 5. We show that LGD outperformed SGD in\nepoch-wise convergence on both tasks with a substantial margin. It is encouraging because in the\nprevious section, we have shown that even with the hashing overhead, LGD leads to faster time-wise\nconvergence. We do not explore the time-wise convergence comparison between LGD and SGD\nin current tasks because BERT is implemented in Tensor\ufb02ow [1] and Pytorch [19] on GPU. We\ncurrently only have the CPU implementation of LSH. Therefore running LGD algorithm on BERT\ncreates an extra overhead of switching between GPUs and CPUs. An ef\ufb01cient GPU implementation\nof LGD can be an independent research interest for future work. This section is to demonstrate the\npower of LGD in non-linear models.\n\n4 Conclusion\n\nIn this paper, we proposed a novel LSH-based sampler with a reduction to the gradient estimation\nvariance. We achieved it by sampling with probability proportional to the L2 norm of the instances\ngradients leading to an optimal distribution that minimizes the variance of estimation. More re-\nmarkably, LGD is as computationally ef\ufb01cient as SGD but achieves faster convergence not only\nepoch-wise but also time-wise.\n\n9\n\nlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0e+002e+044e+046e+048e+041e+05010002000300040005000PredMSD Training Loss with Ada GradTime (ms)LosslSGD+adaGrad Training LossLGD+adaGrad Training Losslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0500010000150002000025000300003500020406080100120Slice Training Loss with Ada GradTime (ms)LosslSGD+adaGrad Training LossLGD+adaGrad Training Loss\fAcknowledgments\n\nWe thank the reviewers for their valuable comments. We also thank Ben Benjamin Coleman for the\nhelpful discussions. The work was supported by NSF-1652131, NSF-BIGDATA 1838177, AFOSR-\nYIPFA9550-18-1-0152, Amazon Research Award, and ONR BRC grant for Randomized Numerical\nLinear Algebra.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray,\nChris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul\nTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden,\nMartin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale\nmachine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org. 9\n\n[2] Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio.\nVariance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481,\n2015. 2\n\n[3] Barry C Arnold. Pareto and generalized pareto distributions. In Modeling income distributions\n\nand Lorenz curves, pages 119\u2013145. Springer, 2008. 7\n\n[4] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of\n\nCOMPSTAT\u20192010, pages 177\u2013186. Springer, 2010. 1, 2\n\n[5] L\u00e9on Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural\n\ninformation processing systems, pages 161\u2013168, 2008. 4, 6\n\n[6] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. Siam Review, 60(2):223\u2013311, 2018. 6\n\n[7] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high\ndimensions. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science\n(FOCS), pages 1032\u20131043. IEEE, 2017. 6\n\n[8] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings\nof the thiry-fourth annual ACM symposium on Theory of computing, pages 380\u2013388. ACM,\n2002. 4\n\n[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018. 7, 9\n\n[10] Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. Nearest neighbor based greedy\ncoordinate descent.\nIn J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 24, pages 2160\u20132168.\nCurran Associates, Inc., 2011. 5\n\n[11] William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential para-\nphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.\n8\n\n[12] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast\napproximation of matrix coherence and statistical leverage. Journal of Machine Learning\nResearch, 13(Dec):3475\u20133506, 2012. 2\n\n[13] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n3, 4, 9\n\n10\n\n\f[14] Siddharth Gopal. Adaptive sampling for sgd by exploiting side information. In International\n\nConference on Machine Learning, pages 364\u2013372, 2016. 2, 3\n\n[15] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the\ncurse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of\ncomputing, pages 604\u2013613. ACM, 1998. 5\n\n[16] M. Lichman. UCI machine learning repository, 2013. 8\n\n[17] Hongseok Namkoong, Aman Sinha, Steve Yadlowsky, and John C Duchi. Adaptive sampling\nprobabilities for non-smooth optimization. In International Conference on Machine Learning,\npages 2574\u20132583, 2017. 2\n\n[18] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling,\nand the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems,\npages 1017\u20131025, 2014. 2\n\n[19] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017. 9\n\n[20] Dmytro Perekrestenko, Volkan Cevher, and Martin Jaggi. Faster coordinate descent via adaptive\n\nimportance sampling. arXiv preprint arXiv:1703.02518, 2017. 2\n\n[21] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951. 2\n\n[22] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:\nConvergence results and optimal averaging schemes. In International Conference on Machine\nLearning, pages 71\u201379, 2013. 2\n\n[23] Ryan Spring and Anshumali Shrivastava. Scalable and sustainable deep learning via randomized\n\nhashing. arXiv preprint arXiv:1602.08194, 2016. 2\n\n[24] Ryan Spring and Anshumali Shrivastava. A new unbiased and ef\ufb01cient class of lsh-based\nsamplers and estimators for partition function computation in log-linear models. arXiv preprint\narXiv:1703.05160, 2017. 2, 3, 5\n\n[25] J. Torres-Sospedra, R. Montoliu, A. Mart\u00ednez-Us\u00f3, J. P. Avariento, T. J. Arnau, M. Benedito-\nBordonau, and J. Huerta. Ujiindoorloc: A new multi-building and multi-\ufb02oor database for wlan\n\ufb01ngerprint-based indoor localization problems. In 2014 International Conference on Indoor\nPositioning and Indoor Navigation (IPIN), pages 261\u2013270, Oct 2014. 8\n\n[26] Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman.\nGlue: A multi-task benchmark and analysis platform for natural language understanding. arXiv\npreprint arXiv:1804.07461, 2018. 8\n\n[27] Wei Xu. Towards optimal one pass large scale learning with averaged stochastic gradient\n\ndescent. arXiv preprint arXiv:1107.2490, 2011. 4\n\n[28] Jiyan Yang, Yin-Lam Chow, Christopher R\u00e9, and Michael W Mahoney. Weighted sgd for lp\nregression with randomized preconditioning. In Proceedings of the twenty-seventh annual\nACM-SIAM symposium on Discrete algorithms, pages 558\u2013569. SIAM, 2016. 2\n\n[29] Peilin Zhao and Tong Zhang. Accelerating minibatch stochastic gradient descent using strati\ufb01ed\n\nsampling. arXiv preprint arXiv:1405.3080, 2014. 2\n\n[30] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized\nloss minimization. In Proceedings of the 32nd International Conference on Machine Learning\n(ICML-15), pages 1\u20139, 2015. 2\n\n11\n\n\f", "award": [], "sourceid": 6671, "authors": [{"given_name": "Beidi", "family_name": "Chen", "institution": "Rice University"}, {"given_name": "Yingchen", "family_name": "Xu", "institution": "Airbnb"}, {"given_name": "Anshumali", "family_name": "Shrivastava", "institution": "Rice University"}]}