{"title": "Statistical Tests for Optimization Efficiency", "book": "Advances in Neural Information Processing Systems", "page_first": 2196, "page_last": 2204, "abstract": "Learning problems such as logistic regression are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter \u2013 the probability for an update to be in the wrong direction \u2013 which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable.", "full_text": "Statistical Tests for Optimization Ef\ufb01ciency\n\nLevi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling\n\n{lboyles},{akoratti},{dramanan},{welling}@ics.uci.edu\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697-3425\n\nAbstract\n\nLearning problems, such as logistic regression, are typically formulated as pure\noptimization problems de\ufb01ned on some loss function. We argue that this view\nignores the fact that the loss function depends on stochastically generated data\nwhich in turn determines an intrinsic scale of precision for statistical estimation.\nBy considering the statistical properties of the update variables used during the\noptimization (e.g. gradients), we can construct frequentist hypothesis tests to\ndetermine the reliability of these updates. We utilize subsets of the data for com-\nputing updates, and use the hypothesis tests for determining when the batch-size\nneeds to be increased. This provides computational bene\ufb01ts and avoids over\ufb01tting\nby stopping when the batch-size has become equal to size of the full dataset.\nMoreover, the proposed algorithms depend on a single interpretable parameter \u2013\nthe probability for an update to be in the wrong direction \u2013 which is set to a single\nvalue across all algorithms and datasets. In this paper, we illustrate these ideas\non three L1 regularized coordinate descent algorithms: L1-regularized L2-loss\nSVMs, L1-regularized logistic regression, and the Lasso, but we emphasize that\nthe underlying methods are much more generally applicable.\n\n1\n\nIntroduction\n\nThere is an increasing tendency to consider machine learning as a problem in optimization: de\ufb01ne\na loss function, add constraints and/or regularizers and formulate it as a preferably convex program.\nThen, solve this program using some of the impressive tools from the optimization literature. The\nmain purpose of this paper is to point out that this \u201creduction to optimization\u201d ignores certain\nimportant statistical features that are unique to statistical estimation. The most important feature\nwe will exploit is the fact that the statistical properties of an estimation problem determine an\nintrinsic scale of precision (that is usually much larger than machine precision). This implies\nimmediately that optimizing parameter-values beyond that scale is pointless and may even have an\nadverse affect on generalization when the underlying model is complex. Besides a natural stopping\ncriterion it also leads to much faster optimization before we reach that scale by realizing that far\naway from optimality we need much less precision to determine a parameter update than when\nclose to optimality. These observations can be incorporated in many off-the-shelves optimizers and\nare often orthogonal to speed-up tricks in the optimization toolbox.\nThe intricate relationship between computation and estimation has been pointed out before in [1]\nand [2] where asymptotic learning rates were provided. One of the important conclusions was\nthat a not so impressive optimization algorithm such as stochastic gradient descent (SGD) can be\nnevertheless a very good learning algorithm because it can process more data per unit time. Also\nin [3] (sec. 5.5) the intimate relationship between computation and model \ufb01tting is pointed out.\n[4] gives bounds on the generalization risk for online algorithms, and [5] shows how additional\ndata can be used to reduce running time for a \ufb01xed target generalization error. Regret-minimizing\nalgorithms ([6], [7]) are another way to account for the interplay between learning and computation.\nHypothesis testing has been exploited for computational gains before in [8].\n\n1\n\n\fOur method exploits the fact that loss functions are random variables subject to uncertainty. In a\nfrequentist world we may ask how different the value of the loss would have been if we would have\nsampled another dataset of the same size from a single shared underlying distribution. The role of\nan optimization algorithm is then to propose parameter updates that will be accepted or rejected\non statistical grounds. The test we propose determines whether the direction of a parameter update\nis correct with high probability. If we do not pass our tests when using all the available data-cases\nthen we stop learning (or alternatively we switch to sampling or bagging), because we have reached\nthe intrinsic scale of precision set by the statistical properties of the estimation problem.\nHowever, we can use the same tests to speed up the optimization process itself, that is before we\nreach the above stopping criterion. To see that, imagine one is faced with an in\ufb01nite dataset. In batch\nmode, using the whole (in\ufb01nite) dataset, one would not take a single optimization step in \ufb01nite time.\nThus, one should really be concerned with making as much progress as possible per computational\nunit. Hence, one should only use a subset of the total available dataset. Importantly, the optimal\nbatch-size depends on where we are in the learning process: far away from convergence we only\nneed a rough idea of where to move which requires very few data-cases. On the other hand, the\ncloser we get to the true parameter value, the more resolution we need. Thus, the computationally\noptimal batch-size is a function of the residual estimation error. Our algorithm adaptively grows a\nsubset of the data by requiring that we have just enough precision to con\ufb01dently move in the correct\ndirection. Again, when we have exhausted all our data we stop learning.\nOur algorithm heavily relies on the central limit tendencies of large sums of random variables.\nFortunately, many optimization algorithms are based on averages over data-cases. For instance,\ngradient descent falls in this class, as the gradient is de\ufb01ned by an average (or sum). As in [11],\nwith large enough batch sizes we can use the Central Limit Theorem to claim that the average\ngradients are normally distributed and estimate their variance without actually seeing more data\n(this assumption is empirically veri\ufb01ed in section 5.2). We have furthermore implemented methods\nto avoid testing updates for parameters which are likely to fail their test. This ensures that we\napproximately visit the features with their correct frequency (i.e.\nimportant features may require\nmore updates than unimportant ones).\nIn summary, the main contribution of this paper is to introduce a class of algorithms with the\nfollowing properties.\n\n\u2022 They depend on a single interpretable parameter \u01eb \u2013 the probability to update parameters in the\nwrong direction. Moreover, the performance of the algorithms is relatively insensitive to the\nexact value we choose.\n\n\u2022 They have a natural, inbuilt stopping criterion. The algorithms terminate when the probability to\n\nupdate the parameters in the wrong direction can not be made smaller than \u01eb.\n\n\u2022 They are applicable to wide range of loss functions. The only requirement is that the updates\n\ndepend on sums of random variables.\n\n\u2022 They inherit the convergence guarantees of the optimization method under consideration. This\n\nfollows because the algorithms will eventually consider all the data.\n\n\u2022 They achieve very signi\ufb01cant speedups in learning models from data. Throughout the learning\nprocess they determine the size of the data subset required to perform updates that move in the\ncorrect direction with probability at least 1 \u2212 \u01eb.\n\nWe emphasize that our framework is generally applicable. In this paper we show how these con-\nsiderations can be applied to L1-regularized coordinate descent algorithms: L1-regularized L2-loss\nSVMs, L1-regularized logistic regression, and Lasso [9]. Coordinate descent algorithms are conve-\nnient because they do not require any tuning of hyper-parameters to be effective, and are still ef\ufb01cient\nwhen training sparse models. Our methodology extends these algorithms to be competitive for dense\nmodels and for N >> p. In section 2 we review the coordinate descent algorithms. Then, in section\n3.2 we formulate our hypothesis testing framework, followed by a heuristic for predicting hypothesis\ntest failures in section 4. We report experimental results in section 5 and we end with conclusions.\n\n2 Coordinate Descent\n\nWe consider L1-regularized learning problems where the loss is de\ufb01ned as a statistical average over\nN datapoints:\n\nf (\u03b2) = \u03b3||\u03b2||1 +\n\n1\n2N\n\nN\n\nXi=1\n\nloss(\u03b2, xi, yi) where \u03b2, xi \u2208 Rp\n\n(1)\n\nWe will consider continously-differentiable loss functions (squared hinge-loss,\nlog-loss, and\nsquared-loss) that allow for the use of ef\ufb01cient coordinate-descent optimization algorithms, where\n\n2\n\n\feach parameter is updated \u03b2new\n\nj \u2190 \u03b2j + dj with:\n\ndj = argmin\n\nf (\u03b2 + dej)\n\nf (\u03b2 + dej) = |\u03b2j + d| + Lj(d; \u03b2) + const\n\n(2)\n\nd\n\nwhere Lj(d; \u03b2) = 1\nthe above, we perform a second-order Taylor expansion of the partial loss Lj(d; \u03b2):\n\ni=1 loss(\u03b2 + dej, xi, yi) and ej is the jth standard basis vector. To solve\n\n2N PN\nf (\u03b2 + dej) \u2248 |\u03b2j + d| + L\u2032j(0; \u03b2)d +\n\n1\n2\n\nL\u2032\u2032j (0; \u03b2)d2 + const\n\n[10] show that the minimum of the approximate objective (3) is obtained with:\n\n(3)\n\n(4)\n\nL\u2032\n\nj (0,\u03b2)+\u03b3\nj (0,\u03b2)\nL\u2032\u2032\nL\u2032\nj (0,\u03b2)\u2212\u03b3\nj (0,\u03b2)\nL\u2032\u2032\n\n\u2212\n\n\u2212\n\u2212\u03b2j\n\ndj =\n\n\uf8f1\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f3\n\nif L\u2032j(0, \u03b2) + \u03b3 \u2264 L\u2032\u2032j (0, \u03b2)\u03b2j\nif L\u2032j(0, \u03b2) \u2212 \u03b3 \u2265 L\u2032\u2032j (0, \u03b2)\u03b2j\notherwise\n\nFor quadratic loss functions, the approximation in (3) is exact. For general convex loss functions,\none can optimize (2) by repeatedly linearizing and applying the above update. We perform a single\nupdate per parameter during the cyclic iteration over parameters. Notably, the partial derivatives\nare functions of statistical averages computed over N training points. We show that one can use\nfrequentist hypothesis tests to elegantly manage the amount of data needed (N) to reliably compute\nthese quantities.\n\n2.1 L1-regularized L2-loss SVM\n\nUsing a squared hinge-loss function in (1), we obtain a L1-regularized L2-loss SVM:\n\nlossSV M = max(0, 1 \u2212 yi\u03b2T xi)2\n\n(5)\n\nAppendix F of [10] derive the corresponding partial derivatives, where the second-order statistic is\napproximate because the squared hinge-loss is not twice differentiable:\n\nL\u2032j(0, \u03b2) = \u2212\n\n1\n\nN Xi\u2208I(\u03b2)\n\nyixijbi(\u03b2)\n\nL\u2032\u2032j (0, \u03b2) =\n\n1\n\nN Xi\u2208I(\u03b2)\n\nx2\nij\n\n(6)\n\nwhere bi(\u03b2) = 1 \u2212 yi\u03b2T xi and I(\u03b2) = {i|bi(\u03b2) > 0}. We write xij for the jth element of\ndatapoint xi. In [10], each parameter is updated until convergence, using a line-search for each\nupdate, whereas we simply check that L\u2032\u2032 term is not ill formed rather than performing a line search.\n\n2.2 L1-regularized Logistic Regression\n\nUsing a log-loss function in (1), we obtain a L1-regularized logistic regression model:\n\nlosslog = log(1 + e\u2212yi\u03b2T xi )\nAppendix G of [10] derive the corresponding partial derivatives:\n\n(7)\n\nL\u2032j(0, \u03b2) =\n\n1\n2N\n\nN\n\nXi=1\n\n\u2212xij\n\n1 + eyi\u03b2T xi\n\nL\u2032\u2032j (0, \u03b2) =\n\n1\n2N\n\nN\n\nXi=1\n\n(cid:18)\n\nxij\n\n1 + eyi\u03b2T xi(cid:19)2\n\neyi\u03b2T xi\n\n(8)\n\n2.3 L1-regularized Linear Regression (Lasso)\n\nUsing a quadratic loss function in (1), we obtain a L1-regularized linear regression, or LASSO,\nmodel:\n\nlossquad = (yi \u2212 \u03b2T xi)2\n\n(9)\n\nThe corresponding partial derivatives [9] are:\n\nL\u2032j(0, \u03b2) = \u2212\n\n1\nN\n\nN\n\nXi=1\n\n(yi \u2212 \u03b2T xi)xij\n\nL\u2032\u2032j (0, \u03b2) =\n\n3\n\n1\nN\n\nN\n\nXi=1\n\nxijxij\n\n(10)\n\n\fBecause the Taylor expansion is exact for quadratic loss functions, we can directly write the closed\nform solution for parameter \u03b2new\n\nj = S(\u03b1j, \u03b3) where\n\n\u03b1j =\n\n1\nN\n\nN\n\nXi=1\n\nxij(yi \u2212 \u02dcy(j)\n\ni\n\n)\n\nS(\u03b1, \u03b3) = \uf8f1\uf8f2\n\uf8f3\n\n\u03b1 \u2212 \u03b3 \u03b1 > 0, \u03b3 < |\u03b1|\n\u03b1 + \u03b3 \u03b1 < 0, \u03b3 < |\u03b1|\n0\n\n\u03b3 \u2265 |\u03b1|\n\n(11)\n\nwhere \u02dcy(j)\ni = Pk6=j xik\u03b2k is the prediction made with all parameters except \u03b2j and S is a \u201csoft-\nthreshold\u201d function that is zero for an interval of 2\u03b3 about the origin, and shrinks the magnitude of\nthe input \u03b1 by \u03b3 outside of this interval. We can use this expression as an estimator for \u03b2 from a\ndataset {xi, yi}. The above update rule assumes standardized data ( 1\nij = 1),\nbut it is straightforward to extend for the general case.\n\nN Pi xij = 0, 1\n\nN Pi x2\n\n3 Hypothesis Testing\n\nj\n\nEach update \u03b2new\n= \u03b2j + dj is computed using a statistical average over a batch of N training\npoints. We wish to estimate the reliability of an update as a function of N. To do so, we model\nthe current \u03b2 vector as a \ufb01xed constant and the N training points as random variables drawn\nfrom an underlying joint density p(x, y). This also makes the proposed updates dj and \u03b2new\nrandom variables because they are functions of the training points. In the following we will make\nan explicit distinction between random variables, e.g. \u03b2new\n, dj, xij, yi and their instantiations,\n\u02c6\u03b2new\n, \u02c6dj, \u02c6xij, \u02c6yi. We would like to determine whether or not a particular update is statistically\nj\njusti\ufb01ed. To this end, we use hypothesis tests where if there is high uncertainty in the direction of\nthe update, we say this update is not justi\ufb01ed and the update is not performed. For example, if our\nproposed update \u02c6\u03b2new\n\nis positive, we want to ensure that P (\u03b2new\n\nj < 0) is small.\n\nj\n\nj\n\nj\n\n3.1 Algorithm Overview\n\nWe propose a \u201cgrowing batch\u201d algorithm for handling very large or in\ufb01nite datasets: \ufb01rst we select\na very small subsample of the data of size Nb \u226a N, and optimize until the entire set of parameters\nare failing their hypothesis tests (described in more detail below). We then query more data points\nand include them in our batch, reducing the variance of our estimates and making it more likely that\nthey will pass their tests. We continue adding data to our batch until we are using the full dataset\nof size N. Once all of the parameters are failing their hypothesis tests on the full batch of data,\nwe stop training. The reasoning behind this is, as argued in the introduction, that at this point we\ndo not have enough evidence for even determining the direction in which to update, which implies\nthat further optimization would result in over\ufb01tting. Thus, our algorithm behaves like a stochastic\nonline algorithm during early stages and like a batch algorithm during later stages, equipped with\na natural stopping condition.\nIn our experiments, we increase the batch size Nb by a factor of 10 once all parameters fail their\nhypothesis tests for a given batch. Values in the range 2-100 also worked well, however, we chose\n10 as it works very well for our implementation.\n\n3.2 Lasso\n\nFor quadratic loss functions with standardized variables, we can directly analyze the densities of\n. We accept an update if the sign of dj can be estimated with suf\ufb01cient probability. Central\ndj, \u03b2new\nto our analysis is \u03b1j (11), which is equivalent to \u03b2new\nfor the unregularized case \u03b3 = 0. We rewrite\nit as:\n\nj\n\nj\n\n\u03b1j =\n\n1\nN\n\nN\n\nXi=1\n\nzij(\u03b2) where\n\nzij(\u03b2) = xij(yi \u2212 \u02dcy(j)\n\ni\n\n)\n\n(12)\n\nBecause zij(\u03b2) are given by a \ufb01xed transformation of the iid training points, they themselves are iid.\nAs N \u2192 \u221e, we can appeal to the Central Limit Theorm and model \u03b1j as distributed as a standard\nN V ar(zij) \u2200i. Empirical\nNormal: \u03b1j \u223c N (\u00b5\u03b1j , \u03c3\u03b1j ), where \u00b5\u03b1j = E[zij], \u2200i and \u03c32\njusti\ufb01cation of normality of these quantities is given in section 5.2. So, for any given \u03b1j, we can\nprovide estimates\n\n\u03b1j = 1\n\nE[zij] \u2248 \u02c6zj =\n\n1\n\nN Xi\n\n\u02c6zij\n\nV ar(zij) \u2248 \u03c32\n\n\u02c6zj =\n\n4\n\n1\n\nN \u2212 1 Xi\n\n(\u02c6zij \u2212 \u02c6zj)2\n\n(13)\n\n\f0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n \n\nTransformed\nOriginal\n\ns\ne\n\nl\ni\nt\n\nn\na\nu\nQ\n\n \nt\n\ni\n\nn\ne\nd\na\nr\nG\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\nQ\u2212Q Plots of Gradient Distributions\n\n0.3\n\n0.2\n\n0.1\n\nN\n=1,000,000\nb\n\nN\n=60,000\nb\n\nN\n=250,000\nb\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22124\n\n\u22122\n2\nNormal Theoretic Quantiles\n\n0\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n \ne\ng\na\nr\ne\nv\nA\n\n0.79\n\n0.78\n\n0.77\n\n0.76\n\n0.75\n\n0.74\n\n0.73\n0\n0\n\nAP and Time responses to \u03b5 (LR on INRIA dataset)\n\nx 104\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \ne\nm\nT\n\ni\n\n0.1\n0.1\n\n0.2\n0.2\n\n\u03b5\n\n0.3\n0.3\n\n0.4\n0.4\n\n0\n0.5\n0.5\n\nFigure 1: (left) A Gaussian distribution and the distribution resulting from applying the transformation S,\n(middle) Q-Q plot\nwith \u03b3 = .1. The interval that is \u201csquashed\u201d is shown by the dash-dotted blue lines.\ndemonstrating the normality of the gradients on the L1-regularized L2-loss SVM, computed at various stages\nof the algorithm (i.e. at different batch-sizes Nb and models \u03b2). Straight lines provide evidence that the\nempirical distribution is close to normality. (right ) Plot showing the behavior of our algorithm with respect to\n\u01eb, using logistic regression on the INRIA dataset. \u01eb = 0 corresponds to an algorithm which never updates, and\n\u01eb = 0.5 corresponds to an algorithm which always updates (with no stopping criteria), so for these experiments\n\u01eb was chosen in the range [.01, .49]. Error bars denote a single standard deviation.\n\nj\n\nwhich in turn provide estimates for \u00b5\u03b1j and \u03c3\u03b1j . We next apply the soft threshold function S to \u03b1j to\n, a random variable whose pdf is a Gaussian which has a section of width 2\u03b3 \u201csquashed\u201d\nobtain \u03b2new\nto zero into a single point of probability mass, with the remaining density shifted towards zero by a\nmagnitude \u03b3. This is illustrated in Figure 1. Our criterion for accepting an update is that it moves\ntowards the true solution with high probability. Let \u02c6dj be the realization of the random variable\nj \u2212 \u03b2j, computed from the sample batch of N training points. If \u02c6dj > 0, then we want\ndj = \u03b2new\nP (dj \u2264 0) to be small, and vice versa. Speci\ufb01cally, for \u02c6dj > 0, we want P (dj \u2264 0) < \u01eb, where\n\nP (dj \u2264 0) = P (\u03b2new\n\nj \u2264 \u03b2j) = \uf8f1\uf8f2\n\uf8f3\n\n\u03c3\u03b1j\n\n\u03a6(cid:16) \u03b2j\u2212(\u00b5\u03b1j +\u03b3)\n\u03a6(cid:16) \u03b2j\u2212(\u00b5\u03b1j \u2212\u03b3)\n\n\u03c3\u03b1j\n\n(cid:17) if \u03b2j < 0\n(cid:17) if \u03b2j \u2265 0\n\n(14)\n\nwhere \u03a6(\u00b7) denotes the cdf for the standard Normal. This distribution can be derived from its two\nunderlying Gaussians, one with mean \u00b5\u03b1j + \u03b3 and one with mean \u00b5\u03b1j \u2212 \u03b3. Similarly, one can\nde\ufb01ne an analgous test of P (dj \u2265 0) < \u01eb for \u02c6dj < 0. These are the hypothesis test equations\nfor a single coordinate, so this test is performed once for each coordinate at its iteration in the\ncoordinate descent algorithm. If a coordinate update fails its test, then we assume that we do not\nhave enough evidence to perform an update on the coordinate, and do not update. Note that, since\nwe are potentially rejecting many updates, signi\ufb01cant computation could be going to \u201cwaste,\u201d as we\nare computing updates without using them. Methods to address this follow in section 4.\n\n3.3 Gradient-Based Hypothesis Tests\n\nFor general convex loss functions, it is dif\ufb01cult to construct a pdf for dj and \u03b2new\n. Instead, we\ncan be estimated with suf\ufb01cient\naccept an update \u03b2new\nreliability. Because f (\u03b2) may be nondifferentiable, we de\ufb01ne \u2202jf (\u03b2) to be the set of 1D subgra-\ndients, or lower tangent planes, at \u03b2 along direction j. The minimal (in magnitude) subgradient gj,\nassociated with the \ufb02atest lower tangent, is:\n\nif the sign of the partial derivative \u2202f (\u03b2)\n\u2202\u03b2j\n\nj\n\nj\n\n\u03b1j \u2212 \u03b3\n\u03b1j + \u03b3\nS(\u03b1j, \u03b3)\n\nif \u03b2j < 0\nif \u03b2j > 0\notherwise\n\nwhere\n\n\u03b1j = L\u2032j(0, \u03b2) =\n\n1\nN\n\nN\n\nXi=1\n\nzij\n\n(15)\n\ngj = \uf8f1\uf8f2\n\uf8f3\n\nfor log-loss.\nwhere zij(\u03b2) = \u22122yixijbi(\u03b2) for the squared hinge-loss and zij(\u03b2) =\nAppealing to the same arguements as in Sec.3.2, one can show that \u03b1j \u223c N (\u00b5\u03b1j , \u03c3\u03b1j ) where \u00b5\u03b1j =\nE[zij], \u2200i and \u03c32\nN V ar(zij) \u2200i. Thus the pdf of subgradient g is a Normal shifted by \u03b3sign(\u03b2j)\nin the case where \u03b2j 6= 0, or a Normal transformed by the function S(\u03b1j, \u03b3) in the case \u03b2j = 0.\nTo formulate our hypothesis test, we write \u02c6gj as the realization of random variable gj, computed\nfrom the batch of N training points. We want to take an update only if our update is in the correct\n\n\u03b1j = 1\n\n1+eyi \u03b2T xi\n\nxij\n\n5\n\n\fi\n\ni\n\nn\no\ns\nc\ne\nr\nP\n \ne\ng\na\nr\ne\nv\nA\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n \n101\n\nSVM Algorithm Comparison on the INRIA dataset\n\n \n\nCD\u2212Full\nCD\u2212Hyp Test\nvanilla CD\nSGD\nSGD\u2212Regret\n\n104\n\n102\n\n103\n\ntime (seconds)\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n \ne\ng\na\nr\ne\nv\nA\n\n0.32\n\n0.3\n\n0.28\n\n0.26\n\n0.24\n\n0.22\n\n0.2\n\n0.18\n\n0.16\n\n \n103\n\nSVM Algorithm Comparison on the VOC dataset\n\n \n\nCD\u2212Full\nCD\u2212HypTest\nSGD\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n \ne\ng\na\nr\ne\nv\nA\n\n104\n\ntime (seconds)\n\n105\n\n0\n\n \n\n102\n\n103\n\ntime (seconds)\n\nCD\u2212Hyp. Test\nvanilla CD\nSGD\nSGD\u2212Regret\n\n104\n\nLogistic Regression Algorithm Comparison on the INRIA Dataset\n\n \n\nFigure 2: Plot comparing various algorithms for the L1-regularized L2-loss SVM on the INRIA dataset (left)\nand the VOC dataset (middle), and for the L1-regularized logistic-regression on INRIA (right) using \u01eb = 0.05.\n\u201cCD-Full\u201d denotes our method using all applicable heuristic speedups, \u201cCD-Hyp Testing\u201d does not use the\nshrinking heuristic while \u201cvanilla CD\u201d simply performs coordinate descent without any speedup methods.\n\u201cSGD\u201d is stochastic gradient descent with an annealing schedule. Optimization of the hyper-parameters of the\nannealing schedule (on train data) was not included in the total runtime. Note that our method achieves the\noptimal precision faster than SGD and also stops learning approximately when over\ufb01tting sets in.\n\ndirection with high probability: for \u02c6gj > 0, we want P (gj \u2264 0) < \u01eb, where\n\nP (gj \u2264 0) = \uf8f1\uf8f2\n\uf8f3\n\n\u03c3\u03b1j\n\n\u03a6(cid:16) 0\u2212(\u00b5\u03b1j \u2212\u03b3)\n\u03a6(cid:16) 0\u2212(\u00b5\u03b1j +\u03b3)\n\n\u03c3\u03b1j\n\n(cid:17) if \u03b2j \u2264 0\n(cid:17) if \u03b2j > 0\n\n(16)\n\nWe can likewise de\ufb01ne a test of P (gj \u2265 0) < \u01eb which we use to accept updates given a negative\nestimated gradient \u02c6gj < 0.\n\n4 Additional Speedups\n\nIt often occurs that many coordinates will fail their respective hypothesis tests for several con-\nsecutive iterations, so predicting these consecutive failures and skipping computations on these\ncoordinates could potentially save computation. We employ a simple heuristic towards these\nmatters based on a few observations (where for simpli\ufb01ed notation we drop the subscript j):\n\n1. If the set of parameters that are updating remains constant between updates, then for a particular\ncoordinate, the change in the gradient from one iteration to the next is roughly constant. This is\nan empirical observation.\n\n2. When close to the solution, \u03c3\u03b1 remains roughly constant.\n\nWe employ a heuristic which is a complicated instance of a simple idea:\nif the value a(0) of\na variable of interest is changing at a constant rate r, we can predict its value at time t with\na(t) = a(0) + rt. In our case, we wish to predict when the gradient will have moved to a point\nwhere the associated hypothesis test will pass.\nFirst, we will consider the unregularized case (\u03b3 = 0), wherein g = \u03b1. We wish to detect when\nthe gradient will result in the hypothesis test passing, that is, we want to \ufb01nd the values \u00b5\u03b1 \u2248 \u02c6\u03b1,\nwhere \u02c6\u03b1 is a realization of the random variable \u03b1, such that P (g \u2265 0) = \u01eb or P (g \u2264 0) = \u01eb. For\nthis purpose, we need to draw the distinction between an update which was taken, and one which\nis proposed but for which the hypothesis test failed. Let the set of accepted updates be indexed by\nt, as in \u02c6gt, and let the set of potential updates, after an accepted update at time t, be indexed by s,\nas in \u02c6gt(s). Thus the algorithm described in the previous section will compute \u02c6gt(1)...\u02c6gt(s\u2217) until\nthe hypothesis test passes for \u02c6gt(s\u2217), and we then set \u02c6gt+1(0) = \u02c6gt(s\u2217), and perform an update to\n\u03b2 using \u02c6gt+1(0). Ideally, we would prefer not to compute \u02c6gt(1)...\u02c6gt(s\u2217 \u2212 1) at all, and instead only\ncompute the gradient when we know the hypothesis test will pass, s\u2217 iterations after the last accept.\nGiven that we have some scheme from skipping k iterations, we estimate a \u201cvelocity\u201d at which\n\u02c6gt(s) = \u02c6\u03b1t(s) changes: \u2206e \u2261 \u02c6\u03b1t(s)\u2212 \u02c6\u03b1t(s\u2212k\u22121)\n. If, for instance, \u2206\u03b1 > 0, we can compute the value\nof \u02c6\u03b1 at which the hypothesis test will pass, assuming \u03c3\u03b1 remains constant, by setting P (g \u2264 0|\u00b5\u03b1 =\n\u03b1pass) = \u01eb, and subsequently we can approximate the number of iterations to skip next1:\n\nk+1\n\n\u03b1pass = \u2212\u03c3\u03b1\u03a6\u22121(\u01eb)\n\nkskip \u2190\n\n\u2206\u03b1\n1In practice we cap kskip at some maximum number of iterations (say 40).\n\n\u03b1pass \u2212 \u02c6\u03b1t(s)\n\n6\n\n(17)\n\n\fx 104\n\n10\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \n\ne\nm\nT\n\ni\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n \n\n \n\nSGD \u2212 \u03b7\n=.5\n0\nSGD \u2212 \u03b7\n=1\n0\nSGD \u2212 \u03b7\n=2\n0\nSGD \u2212 \u03b7\n=3\n0\nSGD \u2212 \u03b7\n=5\n0\n\u03b5=.4\n\u03b5=.2\n\u03b5=.05\n\n0.8 \n\n0.9 \n\n0.96 \n\n\u03bb\n\n0.99 \n\n0.999\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\ne\ng\na\nr\ne\nv\nA\n\n \n\n0.82\n\n0.81\n\n0.8\n\n0.79\n\n0.78\n\n0.77\n\n0.76\n\n0.75\n\n0.74\n\n0.73\n\n0.72\n\n \n\n \n\nSGD \u2212 \u03b7\n=.5\n0\nSGD \u2212 \u03b7\n=1\n0\nSGD \u2212 \u03b7\n=2\n0\nSGD \u2212 \u03b7\n=3\n0\nSGD \u2212 \u03b7\n=5\n0\n\u03b5=.4\n\u03b5=.2\n\u03b5=.05\n\n0.8 \n\n0.9 \n\n0.96 \n\n\u03bb\n\n0.99 \n\n0.999\n\nFigure 3: Comparison of our Lasso algorithm against SGD across various hyper-parameter settings for the\nexponential annealing schedule. Our algorithm is marked by the horizontal lines, with \u01eb \u2208 {0.05, 0.2, 0.4}.\nNote that all algorithms have very similar precision scores in the interval [0.75 \u2212 0.76]. For values of\n\u03bb = {0.8, 0.9, 0.96, 0.99, 0.999}, SGD gives a good score, however, picking \u03b70 > 1 had an adverse effect on\nthe optimization speed. Our method converged faster then SGD with the best annealing schedule.\n\nThe regularized case with \u03b2j > 0 is equivalent to the unregularized case where g = \u03b1 + \u03b3, and we\nsolve for the value of \u03b1 that will allow the test to pass via P (g \u2264 0|\u00b5\u03b1 = \u03b1pass) = \u01eb:\n\n\u03b1pass = \u2212\u03c3\u03b1\u03a6\u22121(\u01eb) \u2212 \u03b3\n\n(18)\nSimilarly, the case with \u03b2j \u2264 0 is equivalent to the unregularized case where g = \u03b1 \u2212 \u03b3:\n\u03b1pass = \u2212\u03c3\u03b1\u03a6\u22121(\u01eb) + \u03b3. For the case where \u2206\u03b1 < 0, we solve for P (g \u2265 0|\u00b5\u03b1 = \u03b1pass) = \u01eb.\nThis gives \u03b1pass = \u2212\u03c3\u03b1\u03a6\u22121(1 \u2212 \u01eb) + \u03b3 if \u03b2j < 0 and \u03b1pass = \u2212\u03c3\u03b1\u03a6\u22121(1 \u2212 \u01eb) \u2212 \u03b3 otherwise.\nA similar heuristic for the Lasso case can also be derived.\n\n4.1 Shrinking Strategy\n\nIt is common in SVM algorithms to employ a \u201cshrinking\u201d strategy in which datapoints which do\nnot contribute to the loss are removed from future computations. Speci\ufb01cally, if a data point (xi, yi)\nhas the property that bi = 1 \u2212 yi\u03b2T xi < \u01ebshrink < 0, for some \u01ebshrink, then the data point is\nremoved from the current batch. Data points removed from earlier batches in the optimization\nare still candidates for future batches. We employ this heuristic in our SVM implementation, and\nFigure 2 shows the relative performance between including this heuristic and not.\n\n5 Experiments\n\n5.1 Datasets\n\nWe provide experimental results for the task of visual object detection, building on recent successful\napproaches that learn linear scanning-window classi\ufb01ers de\ufb01ned on Histograms of Oriented Gradi-\nents (HOG) descriptors [12, 13]. We train and evaluate a pedestrain detector using the INRIA dataset\n[12], where (N, p) = (5e6, 1100). We also train and evaluate a car detector using the 2007 PASCAL\nVOC dataset [13], where (N,p) = (6e7,1400). For both datasets, we measure performance using the\nstandard PASCAL evaluation protocol of average precision (with 50% overlap of predicted/ground\ntruth bounding boxes). On such large training sets, one would expect delicately-tuned stochastic\nonline algorithms (such as SGD) to outperform standard batch optimization (such as coordinate de-\nscent). We show that our algorithm exhibits the speed of the former with the reliability of the latter.\n\n5.2 Normality Tests\n\nIn this section we empirically verify the normality claims on the INRIA dataset. Because the\nnegative examples in this data are comprised of many overlapping windows from images, we\nmay expect this non-iid property to damage any normality properties of our updates. For these\nexperiments, we focus on the gradients of the L1-regularized, L2-loss SVM computed during\nvarious stages of the optimization process. Figure 1 shows quantile-quantile plots of the average\ngradient, computed over different subsamples of the data of \ufb01xed size Nb, versus the standard\nNormal. Experiments for smaller N (\u2248 100) and random \u03b2 give similar curves. We conclude that\nthe presence of straight lines provide strong evidence for our assumption that the distribution of\ngradients is in fact close to normally distributed.\n\n7\n\n\f5.3 Algorithm Comparisons\n\n1\n\nt0+t .\n\nWe compared our algorithm to the stochastic gradient method for L1-regularized Log-linear models\nin [14], adapted for the L1-regularized methods above. We use the following decay schedule for\nIn addition to this schedule, we also tested\nall curves over time labeled \u201cSGD\u201d: \u03b7 = \u03b70\n1\u221at0+t\n.\nagainst SGD using the regret-minimizing schedule of [6] on the INRIA dataset: \u03b7 = \u03b70\nAfter spending a signi\ufb01cant amount of time hand-optimizing the hyper-parameters \u03b70, t0, we found\nthat settings \u03b70 \u2248 1 for both rate schedules, and t0 \u2248 N/10 (standard SGD) and t0 \u2248 (N/10)2\n(regret-minimzing SGD) have worked well on our datasets. We ran all our algorithms \u2013 Lasso,\nLogistic Regression and SVM \u2013 with a value of \u01eb = 0.05 for both INRIA and VOC datasets.\nFigures 2 show a comparison between our method and stochastic gradient descent on the INRIA and\nVOC datasets. Our method including the shrinking strategy is faster for the SVM, while methods\nwithout a data shrinking strategy, such as logistic regression, are still competitive (see Figure 2).\nIn comparing our methods to the coordinate descent upon which ours are based, we see that our\nframework provides a considerable speedup over standard coordinate descent. We do this with a\nmethod which eventually uses the entire batch of data, so the tricks that enable SGD to converge in\nan L1-regularized problem are not necessary. In terms of performance, our models are equivalent\nor near to published state of the art results for linear models [13, 15].\n\nWe also performed a comparison against SGD with an exponential decay schedule \u03b7 = \u03b70e\u2212\u03bbt on\nthe Lasso problem (see Fig 3). Exponential decay schedules are known to work well in practice\n[14], but do not give the theoretical convergence guarantees of other schedules. For a range of\nvalues for \u03b70 and \u03bb, we compare SGD against our algorithm with \u01eb \u2208 {0.05, 0.2, 0.4}. From these\nexperiments we conclude that changing \u01eb from its standard value 0.05 all the way to 0.4 (recall\nthat \u01eb < 0.5) has very little effect on accuracy and speed. This in contrast to SGD which required\nhyper-parameter tuning to achieve comparable performance.\nTo further demonstrate the robustness of our method to \u01eb, we performed 5 trials of logistic regression\non the INRIA dataset with a wide range of values of \u01eb, with random initializations, shown in Figure\n1. All choices of \u01eb give a reasonable average precision, and the algorithm begins to become\nsigni\ufb01cantly slower only with \u01eb > .3.\n\n6 Conclusions\n\nWe have introduced a new framework for optimization problems from a statistical, frequentist point\nof view. Every phase of the learning process has its own optimal batchsize. That is to say, we\nneed only few data-cases early on in learning but many data-cases close to convergence. In fact, we\nargue that when we are using all of our data and cannot determine with statistical con\ufb01dence that\nour update is in the correct direction, we should stop learning to avoid over\ufb01tting. These ideas are\nabsent in the usual frequentist (a.k.a. maximum likelihood) and learning-theory approaches which\nformulate learning as the optimization of some loss function. A meaningful smallest length scale\nbased on statistical considerations is present in Bayesian analysis through the notion of a posterior\ndistribution. However, the most common inference technique in that domain, MCMC sampling,\ndoes not make use of the fact that less precision is needed during the \ufb01rst phases of learning (a.k.a.\n\u201cburn-in\u201d) because any accept/reject rule requires all data-cases to be seen. Hence, our approach\ncan be thought of as a middle ground that borrows from both learning philosophies.\nOur approach also leverages the fact that some features are more predictive than others, and may\ndeserve more attention during optimization. By predicting when updates will pass their statistical\ntests, we can update each feature approximately with the correct frequency.\nThe proposed algorithms feature a single variable that needs to be set. However, the variable has a\nclear meaning \u2013 the allowed probability that an update moves in the wrong direction. We have used\n\u01eb = 0.05 in all our experiments to showcase the robustness of the method.\nOur method is not limited to L1 methods or linear models; our framework can be used on any\nalgorithm in which we take updates which are simple functions on averages over the data.\nRelative to vanilla coordinate descent, our algorithms can handle dense datasets with N >> p. Rel-\native to SGD2 our method can be thought of as \u201cself-annealing\u201d in the sense that it increases its preci-\nsion by adaptively increasing the dataset size. The advantages over SGD are therefore that we avoid\ntuning hyper-parameters of an annealing schedule and that we have an automated stopping criterion.\n\n2Recent benchmarks [16] show that a properly tuned SGD solver is highly competitive for large-scale\n\nproblems [17].\n\n8\n\n\fReferences\n[1] L. Bottou and O. Bousquet. Learning using large datasets. In Mining Massive DataSets for Security,\n\nNATO ASI Workshop Series. IOS Press, Amsterdam, 2008.\n\n[2] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Advances in neural information\n\nprocessing systems, 20:161\u2013168, 2008.\n\n[3] B. Yu. Embracing statistical challenges in the information technology age. Technometrics, American\n\nStatistical Association and the American Society for Quality, 49:237\u2013248, 2007.\n\n[4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. Information Theory, IEEE Transactions on, 50(9):2050\u20132057, 2004.\n\n[5] S. Shalev-Shwartz and N. Srebro. SVM optimization:\n\ninverse dependence on training set size.\nProceedings of the 25th international conference on Machine learning, pages 928\u2013935. ACM, 2008.\n\nIn\n\n[6] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. Twentieth\n\nInternational Conference on Machine Learning, 2003.\n\n[7] P.L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. Advances in Neural\n\nInformation Processing Systems, 21, 2007.\n\nmatrix factorization. AISTATS, 2011.\n\nApplied Statistics, 1(2):302\u2013332, 2007.\n\n[8] A. Korattikara, L. Boyles, M. Welling, J. Kim, and H. Park. Statistical optimization of non-negative\n\n[9] J. Friedman, T. Hastie, H. H\u00a8o\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. Annals of\n\n[10] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. The Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[11] N. Le Roux, P.A. Manzagol, and Y. Bengio. Topmoumoute online natural gradient algorithm. In Neural\n\nInformation Processing Systems (NIPS). Citeseer, 2007.\n\n[12] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.\n\nIn IEEE Computer\n\nSociety Conference on Computer Vision and Pattern Recognition, volume 1, page 886. Citeseer, 2005.\n\n[13] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nJ. Winn,\n\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\n[14] Yoshimasa Tsuruoka, Jun\u2019ichi Tsujii, and Sophia Ananiadou. Stochastic gradient descent training for\nl1-regularized log-linear models with cumulative penalty. In Proceedings of the Joint Conference of the\n47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Pro-\ncessing of the AFNLP, pages 477\u2013485, Suntec, Singapore, August 2009. Association for Computational\nLinguistics.\n\n[15] Navneet Dalal. Finding People in Images and Video. PhD thesis, Institut National Polytechnique de\n\nGrenoble / INRIA Grenoble, July 2006.\n\n[16] Pascal large scale learning challenge. http://largescale.ml.tu-berlin.de/workshop/, 2008.\n[17] A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.\n\nJournal of Machine Learning Research, 10:1737\u20131754, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1220, "authors": [{"given_name": "Levi", "family_name": "Boyles", "institution": null}, {"given_name": "Anoop", "family_name": "Korattikara", "institution": null}, {"given_name": "Deva", "family_name": "Ramanan", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}]}