{"title": "Nearest Neighbor based Greedy Coordinate Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 2160, "page_last": 2168, "abstract": "Increasingly, optimization problems in machine learning, especially those arising from high-dimensional statistical estimation, have a large number of variables. Modern statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structure to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a suite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structure, we are not only able to significantly speed up the vanilla greedy method, but also outperform cyclic descent when the problem size becomes large. Our results indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.", "full_text": "Nearest Neighbor based Greedy Coordinate Descent \n\nInderjit S. Dhillon \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \n\ninderjit@cs.utexas.edu \n\nPradeep Raviknmar \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \n\npradeepr@cs.utexas.edu \n\nAmbuj Tewari \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \nambuj@cs.utexas.edu \n\nAbstract \n\nIncreasingly, optimization problems in machine learning, especially those arising \nfrom bigh-dimensional statistical estimation, bave a large number of variables. \nModem statistical estimators developed over the past decade have statistical or \nsample complexity that depends only weakly on the number of parameters when \nthere is some structore to the problem, such as sparsity. A central question is \nwhether similar advances can be made in their computational complexity as well. \nIn this paper, we propose strategies that indicate that such advances can indeed be \nmade. In particular, we investigate the greedy coordinate descent algorithm, and \nnote that performing the greedy step efficiently weakens the costly dependence on \nthe problem size provided the solution is sparse. We then propose a snite of meth(cid:173)\nods that perform these greedy steps efficiently by a reduction to nearest neighbor \nsearch. We also devise a more amenable form of greedy descent for composite \nnon-smooth objectives; as well as several approximate variants of such greedy \ndescent. We develop a practical implementation of our algorithm that combines \ngreedy coordinate descent with locality sensitive hashing. Without tuning the lat(cid:173)\nter data structore, we are not only able to significantly speed up the vanilla greedy \nmethod, hot also outperform cyclic descent when the problem size becomes large. \nOur resnlts indicate the effectiveness of our nearest neighbor strategies, and also \npoint to many open questions regarding the development of computational geo(cid:173)\nmetric techniques tailored towards first-order optimization methods. \n\n1 Introduction \nIncreasingly, optimization problems in machine learning are very high-dimensional, where the num(cid:173)\nber of variables is very large. This has led to a renewed interest in iterative algorithms that reqnire \nbounded time per iteration. Such iterative methods take various forms such as so-called row-action \nmethods [6] which enforce constraints in the optimization problem sequentially, or first-order meth(cid:173)\nods [4] which only compute the gradient or a coordinate of the gradient per step. But the overall time \ncomplexity of these methods still has a high polynomial dependence on the number of parameters. \nModem statistical estimators developed over the past decade have statistical or sample complexity \nthat depends only weakly on the number ofpararneters [5, 15, 18]. Can similar advances be made \nin their computational complexity? \n\nTowards this, we investigate one of the simplest classes of first order methods: coordinate descent, \nwhich only updates a single coordinate of the iterate at every step. The coordinate descent class \nof algorithms has seen a renewed interest after recent papers [8, 10, 19] have shown considerable \nempirical success in application to large problems. Saba and Tewari [13] even show that under \n\n1 \n\n\fcertain conditions, the convergence rate of cyclic coordinate descent is at least as fast as that of \ngradient descent. \n\nIn this paper, we focus on high-dimensional optimization problems where the solution is sparse. \nWe were motivated to investigate coordinate descent algorithms by the intuition that they could \nleverage the sparsity structure of the solution by judiciously choosing the coordinate to be updated. \nIn particular, we show that a greedy selection of the coordinates succeeds in weakening the costly \ndependence on problem size with the caveat that we could perform the greedy step efficiently. The \nnaive greedy updates would however take time that scales at least linearly in the problem dimension \nO(P) since it has to compute the coordinate with the maximum gradient. We thus come to the other \nmain question of this paper: Can the greedy steps in a greedy coordinate scheme be peiformed \nefficiently? Surprisingly, we are able to answer in the affirmative, and we show this by a reduction \nto nearest neighbor search. This allows us to leverage the significant amount of recent research \non sublinear methods for nearest neighbor search, provided it suffices to have approximate nearest \nneighbors. The upshot of our results is a suite of methods that depend weakly on the problem size \nor number of parameters. We also investigate several notions of approximate greedy coordinate \ndescent for which we are able to derive similar rates. For the composite objective case, where the \nobjective is the sum of a smooth component and a separable non-smooth component, we propose \nand analyze a \"look-ahead\" variant of greedy coordinate descent. \n\nThe development in this paper thus raises a new line of research on connections between computa(cid:173)\ntional geometry and first-order optimization methods. For instance, given our results, it would be of \ninterest to develop approximate nearest neighbor methods tuned to greedy coordinate descent. As an \ninstance of such a connection, we show that if the covariates underlying the optimization objective \nsatisfy a mutual incoherence condition, then a very simple nearest neighbor data structure suffices to \nyield a good approximation. Finally, we provide simulations that not ouly show that greedy coordi(cid:173)\nnate descent with approximate nearest neighbor search performs overwheltuingly better than vanilla \ngreedy coordinate descent, but also that it starts outperforming cyclic descent when the problem size \nincreases: the larger the number of variables, the greater the relative improvement in performance. \nThe results of this paper natorally lead to several open problems: can effective computational ge(cid:173)\nometric data structures be tailored towards greedy coordinate descent? Can these be adapted to \n(a) other first-order methods, perhaps based on sampling, and (b) different regularized variants that \nuncover structored sparsity. We hope this paper fosters further research and cross-fertilization of \nideas in computational geometry and optimization. \n2 Setup and Notation \nWe start our treatment with differentiable objective functions, and then extend this to encompass \nnon-differentiable functions which arise as the sum of a smooth component and a separable non(cid:173)\nsmooth component. Let C : JR\" --+ IR be a convex differentiable function. We do not assume that \nthe function is strongly convex: indeed most optimizations arising out of high-dimensional machine \nlearning problems are convex but typically not strungly so. Our analysis requires that the function \nsatisfies the following coordinate-wise Lipschitz condition: \nA \u2022\u2022 omptionAt. The loss function C satisfies \n\nIIV'C(w) - V'C(v)ll~ ::; \"1 \u00b7llw - vIiI, for some \"1> o. \n\nWe note that this condition is weaker than the standard Lipschitz conditions on the gradients. In par(cid:173)\nticular, we say that C has \"2-Lipschitz continuous gradient w.r.t. 11\u00b7112 when IIV' C(w) - V' C(v)112 ::; \n\"2 . IIw - vl12' Note that \"1 ::; \"2; indeed \"1 could be up to p times smaller than \"2. E.g. when \nC(w) = 1/2w T Aw with a positive setui-definite matrix A , we have \"1 = max; A;,;, the maximum \nentry on the diagonal, while \"2 = max; >';(A), the maxium eigenvalue of A. It is thus possible for \n\"2 to be much larger than \",: for instance \"2 = P'\" when A is the all I's matrix. \nWe are interested in the general optimization problem, \nmin C(w). \nwE\"\" \n\n(I) \n\nWe will focus on the case where the solution is bounded and sparse. We thus assume: \nA \u2022\u2022 omptionAl. The solution w' of(J) satisfies: Ilw'll~ ::; B for some constant B < 00 indepen(cid:173)\ndent ofp, and that Ilw'lIo = 8, i.e., solution is 8-sparse. \n\n2.t Coordinate Descent \nCoordinate descent solves (I) iteratively by optimizing over a single coordinate while holding others \nfixed. lYPically, the choice of the coordinate to be updated is cyclic. One caveat with this scheme \n\n2 \n\n\fhowever is that it could be expensive to compute the one-dimensional optimum for general functions \n\u00a3,. Moreover when \u00a3, is not smooth, such coordinatewise descent is not guaranteed to converge to \nthe global optimum in general, unless the non-differentiable component is separable [16]. A line \nof recent work [16, 17, 14] has thus focused on a \"gradient descent\" version of coordinate descent, \nthat iteratively uses a local quadratic upper bound rY of the function C. For the case where the \noptimization function is the sum of a smooth function aod the i l regularizer, this variant is also \nca\\led Iterative Soft Thresholding [7]. A template for such coordinate gradient descent is the set of \n;, VjC(w')ej. Friedman et aI. [8], Genkin et aI. [10], Wu and Laoge [19] \niterates: w' = W'-I -\naod others have shown considerable empirical success in applying these to large problems. \n2.2 Greedy Coordinate Descent \n10 this section, we focus on a simple deterministic variant of coordinate descent that picks the coor(cid:173)\ndinate that attains the coordinalewise maximum of the gradient vector: \nAlgorithm 1 Greedy Coordinate Gradient Descent \n\nInitialize: Set the initial value of wO\n\u2022 \nfort = 1, ... do \n\nj = argmruq IVIC(w')I. \nw' = w'-I - ;, VjC(w')ej. \n\nend for \n\nLemma 1. Suppose the convex differentiable function C satisfies Assumptions Al and A2. Then \nthe iterates of Algorithm 1 satisfy: \n\nC(w') _ C(w*) :<; ~I Ilw ~ w II.. \n\n\u00b0 *. \n\nLetting c(P) denote the time required to solve each greedy step mruq IV IC( w') I, the greedy version \nof coordinate descent achieves the rate C(w') - C(w*) = 0(.' c(P)IT) at time T. Note that the \ndependence on the problem size p is restricted to the greedy step: if we could solve this maximization \nmore efficiently, then we have a powerful \"active-set\" method. While brute force maximization for \nthe greedy step would take O(P) time, ifit cao be done in 0(1) time, then at time T, the iterate w \nsatisfies C( w) - C( w*) = 0(.' IT) which would be independent of the problem size. \n3 Nearest Neighbor aod Fast Greedy \n10 this section, we examine whether the greedy step cao be performed in sublinear time. We focus in \nparticular on optimization problems arising from statistical learoing problems where the optimiza(cid:173)\ntion objective can be written as \n\nn \n\nC(w) = ~i(wTx',y'), \n\ni=l \n\n(2) \n\nfor some loss functioni : RxR r-> R, and a set of observations {(Xi, yi)}:'~I' with Xi E RP, yi E R. \nNote that such an optimization objective arises in most statisticallearoing problems. For instance, \nconsider linear regression, with response y = (w, x) + E, where E ~ N(O, 1). Then given observa(cid:173)\ntions {(xi, yi)}:'~I' the maximum likelihood problem has the form of (2), with i(u, v) = (u - v)'. \nLetJing g( u, v) = V ui( u, v) denote the gradient of the sample loss with respect to its first ar(cid:173)\ngument, and ri(w) = g(wT Xi, yi), the gradient of the objective (2) may be written as VjC(w) = \nL~~I x~ r'(w) = (Xj, r(w)) . It then follows that the greedy coordinate descent step in Algorithm 1 \nreduces to the following simple problem: \n\n, \nmaxi (xj,r(w)) I\u00b7 \n\n(3) \n\nWe can now see why the greedy step (3) cao be performed efficiently: it cao be cast as a nearness \nproblem. Iodeed, assume that the data is standardized so that IIxj II = 1 for j = 1, ... ,po Let \nx = {XI, ... , xp, -X\" ... , -xp} include the negated data vectors. Then, it cao be seen that \n\nargmax I (Xj, r) I == arg min IIxj - rll~\u00b7 \n\n,E[Pj \n\n,Ej'pj \n\n(4) \n\nThus, the greedy step amounts to a nearest neighbor problem of computing the nearest point to r in \nthe set {Xj} ~~I' While this would take O(pn) time via brute force, the hope is to leverage the state of \n\n3 \n\n\fthe art in nearest neighbor search [II] to perform this greedy selection in sublinear time. Regarding \nthe time taken to compute the gradient r(w), note that after any coordinate descent update, we can \nupdate r' in 0(1) time if we cache the values {(w, x')}, so that r can be updated in O(n) time. \nThe reduction to nearest neighbor search however comes with a caveat: nearest neighbor search vari(cid:173)\nants that run in sublinear time only compute approximate nearest neighbors. This in turn aroounts \nto performing the greedy step approximately. In the next few subsections, we investigate the conse(cid:173)\nquences of such approximations. \n\n3.1 Multiplicative Greedy \nWe first consider a variant where the greedy step is performed under a mnltiplicative approximation, \nwhere we choose a coordinate it such that, for some c E (0,1], \n\nIIV.c(w')];, I 2: c\u00b7IIV.c(w')lloo. \n\n(5) \n\nAs the following lemma shows, the approximate greedy steps have little qualitative effect (proof in \nSupplementary Material). \n\nLemma 2. The greedy coordinate descent iterates, with the greedy step computed as in (5), satisfy: \n\n.c(w') _ .c(w*) :0; ~ . \"\"lwO; w*ll~ . \n\nThe price for the approximate greedy updates is thus just a constant factor 1/ c 2: I reduction in the \nconvergence rate. \n\nNote that the equivalence of (4) need not hold under multiplicative approximations. That is, approx(cid:173)\nimate nearest neighbor techuiques that obtain a nearest neighbor upto a multiplicative factor, do not \nguarantee a mnltiplicative approximation for the inner product in the greedy step in turn. As the next \nlemma shows this still achieves the required qualitative rate. \n\nLemma 3. Suppose the greedy step is performed as in (5) with a multiplicative approximation factor \nof (I + ,=) (due to approximate nearest neighbor search for instance). Then, at any iteration t, the \ngreedy coordinate descent iterates satisfy either of the following two conditions, for any' > 0: \n\n(a) V.c(w') is small (i.e. the iterate is near-stationary): IIV.c(w')lloo :0; C::::<:) Ilr(w')1I2' or \n(b) .c(w') - .c(w*) < \n\n. ~,lIwo_w'll: \n\nt \n\n1+'00 \n\n- EIIII(l/f)+l \n\n3.2 Additive Greedy \nAnother natural variant is the following additive approximate greedy coordioate descent, where we \nchoose the coordinate i, such that \n\n(6) \nfor some 'odd. As the lemma below shows, the approximate greedy steps have little qualitative effect \nLemma 4. The greedy coordinate descent iterates, with the greedy step computed as in (6), satisfy: \n\n.c(w') - .c(w*) :0; \"\"lwO; w*ll~ + 'odd. \n\nNote that we need obtain an additive approximation in the greedy step only upto the order of the \nfinal precision desired of the optimization problem. In particular, for statistical estimation problems \nthe desired optimization accuracy need not be lower than the statisical precision, which is typically \nof the order of slog(P) /..;n. Indeed, given the conoections elucidated above to greedy coordinate \ndescent, it is an interesting futore problem to develop approximate nearest neighbor methods with \nadditive approximations. \n\n4 Tailored Nearest Neighbor Data Structures \nIn this section, we show that one could develop approximate nearest neighbor methods tailored to \nthe statistical estimation setting. \n\n4 \n\n\f4.1 Qnadtree nnder Mntnallncoherence \nWe will show that just a vanilla quadtree yields a good approximation when the covariates satisfY \na technical statistical condition of mutual coherence. A quadtree is a tree data structure which \npartitions the space. Each internal node u in the quadtree has a representative point, denoted by \nrep(u), and a list of children nodes, denoted by children(u), which partition the space under u. For \nfurther details, we refer to Har-Peled [II]. The spread