{"title": "An Homotopy Algorithm for the Lasso with Online Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 489, "page_last": 496, "abstract": "It has been shown that the problem of $\\ell_1$-penalized least-square regression commonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that are sparse and therefore achieves model selection. We propose in this paper an algorithm to solve the Lasso with online observations. We introduce an optimization problem that allows us to compute an homotopy from the current solution to the solution after observing a new data point. We compare our method to Lars and present an application to compressed sensing with sequential observations. Our approach can also be easily extended to compute an homotopy from the current solution to the solution after removing a data point, which leads to an efficient algorithm for leave-one-out cross-validation.", "full_text": "An Homotopy Algorithm for the Lasso with Online\n\nObservations\n\nRedwood Center for Theoretical Neuroscience\n\nPierre J. Garrigues\nDepartment of EECS\n\nUniversity of California\n\nBerkeley, CA 94720\n\ngarrigue@eecs.berkeley.edu\n\nLaurent El Ghaoui\nDepartment of EECS\nUniversity of California\n\nBerkeley, CA 94720\n\nelghaoui@eecs.berkeley.edu\n\nAbstract\n\nIt has been shown that the problem of \u20181-penalized least-square regression com-\nmonly referred to as the Lasso or Basis Pursuit DeNoising leads to solutions that\nare sparse and therefore achieves model selection. We propose in this paper Re-\ncLasso, an algorithm to solve the Lasso with online (sequential) observations. We\nintroduce an optimization problem that allows us to compute an homotopy from\nthe current solution to the solution after observing a new data point. We com-\npare our method to Lars and Coordinate Descent, and present an application to\ncompressive sensing with sequential observations. Our approach can easily be\nextended to compute an homotopy from the current solution to the solution that\ncorresponds to removing a data point, which leads to an ef\ufb01cient algorithm for\nleave-one-out cross-validation. We also propose an algorithm to automatically\nupdate the regularization parameter after observing a new data point.\n\n1 Introduction\n\nRegularization using the \u20181-norm has attracted a lot of interest in the statistics [1], signal processing\n[2], and machine learning communities. The \u20181 penalty indeed leads to sparse solutions, which is\na desirable property to achieve model selection, data compression, or for obtaining interpretable\nresults. In this paper, we focus on the problem of \u20181-penalized least-square regression commonly\nreferred to as the Lasso [1]. We are given a set of training examples or observations (yi, xi) \u2208\nR \u00d7 Rm, i = 1 . . . n. We wish to \ufb01t a linear model to predict the response yi as a function of xi\nand a feature vector \u03b8 \u2208 Rm, yi = xT\ni \u03b8 + \u03bdi, where \u03bdi represents the noise in the observation. The\nLasso optimization problem is given by\n\ni \u03b8 \u2212 yi)2 + \u00b5nk\u03b8k1,\n(xT\n\n(1)\n\nnX\n\ni=1\n\nmin\n\n\u03b8\n\n1\n2\n\nwhere \u00b5n is a regularization parameter. The solution of (1) is typically sparse, i.e. the solution \u03b8 has\nfew entries that are non-zero, and therefore identi\ufb01es which dimensions in xi are useful to predict\nyi.\nThe \u20181-regularized least-square problem can be formulated as a convex quadratic problem (QP) with\nlinear equality constraints. The equivalent QP can be solved using standard interior-point methods\n(IPM) [3] which can handle medium-sized problems. A specialized IPM for large-scale problems\nwas recently introduced in [4]. Homotopy methods have also been applied to the Lasso to compute\nthe full regularization path when \u03bb varies [5] [6][7]. They are particularly ef\ufb01cient when the solution\nis very sparse [8]. Other methods to solve (1) include iterative thresholding algorithms [9][10][11],\nfeature-sign search [12], bound optimization methods [13] and gradient projection algorithms [14].\n\n1\n\n\fWe propose an algorithm to compute the solution of the Lasso when the training examples\n(yi, xi)i=1...N are obtained sequentially. Let \u03b8(n) be the solution of the Lasso after observing n\ntraining examples and \u03b8(n+1) the solution after observing a new data point (yn+1, xn+1) \u2208 R\u00d7Rm.\nWe introduce an optimization problem that allows us to compute an homotopy from \u03b8(n) to \u03b8(n+1).\nHence we use the previously computed solution as a \u201cwarm-start\u201d, which makes our method partic-\nularly ef\ufb01cient when the supports of \u03b8(n) and \u03b8(n+1) are close.\nIn Section 2 we review the optimality conditions of the Lasso, which we use in Section 3 to derive\nour algorithm. We test in Section 4 our algorithm numerically, and show applications to compres-\nsive sensing with sequential observations and leave-one-out cross-validation. We also propose an\nalgorithm to automatically select the regularization parameter each time we observe a new data\npoint.\n\n2 Optimality conditions for the Lasso\n\nThe objective function in (1) is convex and non-smooth since the \u20181 norm is not differentiable when\n\u03b8i = 0 for some i. Hence there is a global minimum at \u03b8 if and only if the subdifferential of the\nobjective function at \u03b8 contains the 0-vector. The subdifferential of the \u20181-norm at \u03b8 is the following\nset\n\n(cid:26)\n\n\u2202k\u03b8k1 =\n\nv \u2208 Rm :\n\n(cid:26)vi = sgn(\u03b8i) if |\u03b8i| > 0\n\nvi \u2208 [\u22121, 1] if \u03b8i = 0\n\n(cid:27)\n\n.\n\nLet X \u2208 Rn\u00d7m be the matrix whose ith row is equal to xT\nconditions for the Lasso are given by\n\ni , and y = (y1, . . . , yn)T . The optimality\n\nX T (X\u03b8 \u2212 y) + \u00b5nv = 0, v \u2208 \u2202k\u03b8k1.\n\nWe de\ufb01ne as the active set the indices of the elements of \u03b8 that are non-zero. To simplify notations we\nassume that the active set appears \ufb01rst, i.e. \u03b8T = (\u03b8T\n2 ), where v1i = sgn(\u03b81i)\nfor all i, and \u22121 \u2264 v2j \u2264 1 for all j. Let X = (X1 X2) be the partitioning of X according to the\nactive set. If the solution is unique it can be shown that X T\n1 X1 is invertible, and we can rewrite the\noptimality conditions as\n\n1 , 0T ) and vT = (vT\n\n1 , vT\n\n(cid:26)\u03b81 = (X T\n\n1 X1)\u22121(X T\n\n1 y \u2212 \u00b5nv1)\n\n\u2212\u00b5nv2 = X T\n\n2 (X1\u03b81 \u2212 y)\n\n.\n\nNote that if we know the active set and the signs of the coef\ufb01cients of the solution, then we can\ncompute it in closed form.\n\n3 Proposed homotopy algorithm\n\n3.1 Outline of the algorithm\n\nSuppose we have computed the solution \u03b8(n) to the Lasso with n observation and that we are given\nan additional observation (yn+1, xn+1) \u2208 R \u00d7 Rm. Our goal is to compute the solution \u03b8(n+1) of\nthe augmented problem. We introduce the following optimization problem\n\n\u03b8(t, \u00b5) = arg min\n\n\u03b8\n\n1\n2\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n(cid:18) X\n\ntxT\n\nn+1\n\n(cid:18) y\n\ntyn+1\n\n\u03b8 \u2212\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n+ \u00b5k\u03b8k1.\n\n(2)\n\nWe have \u03b8(n) = \u03b8(0, \u00b5n) and \u03b8(n+1) = \u03b8(1, \u00b5n+1). We propose an algorithm that computes a path\nfrom \u03b8(n) to \u03b8(n+1) in two steps:\n\nStep 1 Vary the regularization parameter from \u00b5n to \u00b5n+1 with t = 0. This amounts to comput-\ning the regularization path between \u00b5n and \u00b5n+1 as done in Lars. The solution path is\npiecewise linear and we do not review it in this paper (see [15][7][5]).\n\nStep 2 Vary the parameter t from 0 to 1 with \u00b5 = \u00b5n+1. We show in Section 3.2 how to compute\n\nthis path.\n\n2\n\n\f3.2 Algorithm derivation\n\nWe show in this Section that \u03b8(t, \u00b5) is a piecewise smooth function of t. To make notations lighter\nwe write \u03b8(t) := \u03b8(t, \u00b5). We saw in Section 2 that the solution to the Lasso can be easily computed\nonce the active set and signs of the coef\ufb01cients are known. This information is available at t = 0,\nand we show that the active set and signs will remain the same for t in an interval [0, t\u2217) where the\nsolution \u03b8(t) is smooth. We denote such a point where the active set changes a \u201ctransition point\u201d\nand show how to compute it analytically. At t\u2217 we update the active set and signs which will remain\nvalid until t reaches the next transition point. This process is iterated until we know the active set\nand signs of the solution at t = 1, and therefore can compute the desired solution \u03b8(n+1).\nWe suppose as in Section 2 and without loss of generality that the solution at t = 0 is such that\n\u03b8(0) = (\u03b8T\nLemma 1. Suppose \u03b81i 6= 0 for all i and |v2j| < 1 for all j. There exist t\u2217 > 0 such that for all\nt \u2208 [0, t\u2217), the solution of (2) has the same support and the same sign as \u03b8(0).\nPROOF. The optimality conditions of (2) are given by\n\n2 ) \u2208 \u2202k\u03b8(0)k1 satisfy the optimality conditions.\n\n1 , 0T ) and vT = (vT\n\n1 , vT\n\nX T (X\u03b8 \u2212 y) + t2xn+1\n\nn+1\u03b8 \u2212 yn+1\n\n(3)\nwhere w \u2208 \u2202k\u03b8k1. We show that there exists a solution \u03b8(t)T = (\u03b81(t)T , 0T ) and w(t)T =\n1 , w2(t)T ) \u2208 \u2202k\u03b8(t)k1 satisfying the optimality conditions for t suf\ufb01ciently small. We partition\n(vT\nn+1 = (xT\nxT\n\nn+1,2) according to the active set. We rewrite the optimality conditions as\n\nn+1,1, xT\n\n(cid:1) + \u00b5w = 0,\n\n(cid:1) + \u00b5v1 = 0\n(cid:1) + \u00b5w2(t) = 0\n\nT \u03b81(t) \u2212 yn+1\nT \u03b81(t) \u2212 yn+1\n\n1 (X1\u03b81(t) \u2212 y) + t2xn+1,1\n2 (X1\u03b81(t) \u2212 y) + t2xn+1,2\nX T\nSolving for \u03b81(t) using the \ufb01rst equation gives\n\n(cid:26)X T\n\u03b81(t) =(cid:0)X T\n\n(cid:0)xT\n(cid:0)xn+1,1\n(cid:0)xn+1,1\nT(cid:1)\u22121(cid:0)X T\n\n.\n\n(cid:1) .\n(cid:1) .\n\n1 such that for t < t\u2217\n2 (X1\u03b81(t) \u2212 y) + t2xn+1,2\n\n1 X1 + t2xn+1,1xn+1,1\n\n(4)\nWe can see that \u03b81(t) is a continuous function of t. Since \u03b81(0) = \u03b81 and the elements of \u03b81 are all\nstrictly positive, there exists t\u2217\n1, all elements of \u03b81(t) remain positive and do not\nchange signs. We also have\n\n1 y + t2yn+1xn+1,1 \u2212 \u00b5v1\n\n\u2212\u00b5n+1w2(t) = X T\n\n2 all elements of w2(t) are strictly smaller than 1 in absolute value. By taking t\u2217 = min(t\u2217\n\nT \u03b81(t) \u2212 yn+1\nSimilarly w2(t) is a continuous function of t, and since w2(0) = v2, there exists t\u2217\nt < t\u2217\nwe obtain the desired result.\nThe solution \u03b8(t) will therefore be smooth until t reaches a transition point where either a component\nof \u03b81(t) becomes zero, or one of the component of w2(t) reaches one in absolute value. We now\nshow how to compute the value of the transition point.\n\n(5)\n2 such that for\n1, t\u2217\n2)\n\n(cid:0)xn+1,1\n\n(cid:18) X\n\n(cid:19)\n\n(cid:18) y\n\n(cid:19)\n\n(cid:16) \u02dcX1 \u02dcX2\n\n(cid:17)\n\naccording to the active\n\nLet \u02dcX =\nset. We use the Sherman-Morrison formula and rewrite (4) as\n\n. We partition \u02dcX =\n\nand \u02dcy =\n\nxn+1\n\nyn+1\n\nT\n\nwhere \u02dc\u03b81 = ( \u02dcX T\nu = ( \u02dcX T\n\n1\n\n1\n\n1 + \u03b1(t2 \u2212 1) u,\nT \u02dc\u03b81 \u2212 yn+1, \u03b1 = xn+1,1\n\u02dcX1)\u22121xn+1,1. Let t1i the value of t such that \u03b81i(t) = 0. We have\n\n1 \u02dcy \u2212 \u00b5v1), \u00afe = xn+1,1\n\n\u03b81(t) = \u02dc\u03b81 \u2212 (t2 \u2212 1)\u00afe\n(cid:19)\u22121! 1\n\n\u02dcX1)\u22121( \u02dcX T\n\n \n\n2\n\nt1i =\n\n1 +\n\n\u2212 \u03b1\n\n,\n\n(cid:18) \u00afeui\n\n\u02dc\u03b81i\n\nT ( \u02dcX T\n\n1\n\n\u02dcX1)\u22121xn+1,1 and\n\nWe now examine the case where a component of w2(t) reaches one in absolute value. We \ufb01rst notice\nthat\n\n(xn+1,1\n\nT \u03b81(t) \u2212 yn+1 =\n\n\u02dcX1\u03b81(t) \u2212 \u02dcy = \u02dce \u2212 (t2\u22121)\u00afe\n1+\u03b1(t2\u22121)\n\n\u00afe\n\n1+\u03b1(t2\u22121)\n\u02dcX1u\n\n,\n\n3\n\n\fwhere \u02dce = \u02dcX1\n\n\u02dc\u03b81 \u2212 \u02dcy. We can rewrite (5) as\n\n\u2212\u00b5w2(t) = \u02dcX T\n\n2 \u02dce +\n\n\u00afe(t2 \u2212 1)\n\n1 + \u03b1(t2 \u2212 1)\n\n(xn+1,2 \u2212 \u02dcX T\n\n2\n\n\u02dcX1u).\n\nLet cj be the jth column of \u02dcX2, and x(j) the jth element of xn+1,2. The jth component of w2(t)\nwill become 1 in absolute value as soon as\n\u00afe(t2 \u2212 1)\n\nj \u02dce +\n\n1 + \u03b1(t2 \u2212 1)\n\nx(j) \u2212 cT\n\nj\n\n\u02dcX1u\n\nLet t+\n\n2 j (resp. t\u2212\n\n2 j) be the value such that w2j(t) = 1 (resp. w2j(t) = \u22121). We have\n\n(cid:12)(cid:12)(cid:12)(cid:12)cT\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n \n \n\n2 j =\nt+\n\n1 +\n\nt\u2212\n2 j =\n\n1 +\n\n(cid:16)\n(cid:18) \u00afe(x(j)\u2212cT\n(cid:18) \u00afe(x(j)\u2212cT\n\nj\n\n\u2212\u00b5\u2212cT\n\nj\n\u00b5\u2212cT\nj \u02dce\n\n\u02dcX1u)\n\nj \u02dce \u2212 \u03b1\n\u2212 \u03b1\n\n\u02dcX1u)\n\n(cid:17)(cid:12)(cid:12)(cid:12)(cid:12) = \u00b5.\n(cid:19)\u22121! 1\n(cid:19)\u22121! 1\n\n.\n\n2\n\n2\n\nHence the transition point will be equal to t0 = min{mini t1i, minj t+\n2 j} where we re-\nstrict ourselves to the real solutions that lie between 0 and 1. We now have the necessary ingredients\nto derive the proposed algorithm.\n\n2 j, minj t\u2212\n\nAlgorithm 1 RecLasso: homotopy algorithm for online Lasso\n1: Compute the path from \u03b8(n) = \u03b8(0, \u00b5n) to \u03b8(0, \u00b5n+1).\n2: Initialize the active set to the non-zero coef\ufb01cients of \u03b8(0, \u00b5n+1) and let v = sign (\u03b8(0, \u00b5n+1)).\nLet v1 and xn+1,1 be the subvectors of v and xn+1 corresponding to the active set, and \u02dcX1 the\nsubmatrix of \u02dcX whose columns correspond to the active set.\nInitialize \u02dc\u03b81 = ( \u02dcX T\nInitialize the transition point t0 = 0.\n\n3: Compute the next transition point t0. If it is smaller than the previous transition point or greater\n\n1 \u02dcy \u2212 \u00b5v1).\n\n\u02dcX1)\u22121( \u02dcX T\n\n1\n\nthan 1, go to Step 5.\nCase 1 The component of \u03b81(t0) corresponding to the ith coef\ufb01cient goes to zero:\n\nRemove i from the active set.\nUpdate v by setting vi = 0.\n\nCase 2 The component of w2(t0) corresponding to the jth coef\ufb01cient reaches one in absolute\n\nvalue:\nAdd j to the active set.\nIf the component reaches 1 (resp. \u22121), then set vj = 1 (resp. vj = \u22121).\n\n4: Update v1, \u02dcX1 and xn+1,1 according to the updated active set.\n\n\u02dcX1)\u22121( \u02dcX T\n\n1 \u02dcy \u2212 \u00b5v1) (rank 1 update).\n\nUpdate \u02dc\u03b81 = ( \u02dcX T\nGo to Step 3.\n\n1\n\n5: Compute \ufb01nal value at t = 1, where the values of \u03b8(n+1) on the active set are given by \u02dc\u03b81.\n\nThe initialization amounts to computing the solution of the Lasso when we have only one data point\n(y, x) \u2208 R \u00d7 Rm. In this case, the active set has at most one element. Let i0 = arg maxi |x(i)| and\nv = sign(yx(i0)). We have\n\n( 1\n(x(i0))2 (yx(i0) \u2212 \u00b51v)ei0 if |yx(i0)| > \u00b51\n0 otherwise\n\n.\n\n\u03b8(1) =\n\nWe illustrate our algorithm by showing the solution path when the regularization parameter and t\nare successively varied with a simple numerical example in Figure 1.\n\n3.3 Complexity\nThe complexity of our algorithm is dominated by the inversion of the matrix \u02dcX T\n\u02dcX1 at each transition\npoint. The size of this matrix is bounded by q = min(n, m). As the update to this matrix after a\n\n1\n\n4\n\n\fFigure 1: Solution path for both steps of our algorithm. We set n = 5, m = 5, \u00b5n = .1n. All\nthe values of X, y, xn+1 and yn+1 are drawn at random. On the left is the homotopy when the\nregularization parameter goes from \u00b5n = .5 to \u00b5n+1 = .6. There is one transition point as \u03b82\nbecomes inactive. On the right is the piecewise smooth path of \u03b8(t) when t goes from 0 to 1. We\ncan see that \u03b83 becomes zero, \u03b82 goes from being 0 to being positive, whereas \u03b81, \u03b84 and \u03b85 remain\nactive with their signs unchanged. The three transition points are shown as black dots.\n\ntransition point is rank 1, the cost of computing the inverse is O(q2). Let k be the total number of\ntransition points after varying the regularization parameter from \u00b5n to \u00b5n+1 and t from 0 to 1. The\ncomplexity of our algorithm is thus O(kq2). In practice, the size of the active set d is much lower\nthan q, and if it remains \u223c d throughout the homotopy, the complexity is O(kd2). It is instructive\nto compare it with the complexity of recursive least-square, which corresponds to \u00b5n = 0 for all n\nand n > m. For this problem the solution typically has m non-zero elements, and therefore the cost\nof updating the solution after a new observation is O(m2). Hence if the solution is sparse (d small)\nand the active set does not change much (k small), updating the solution of the Lasso will be faster\nthan updating the solution to the non-penalized least-square problem.\nSuppose that we applied Lars directly to the problem with n + 1 observations without using knowl-\nedge of \u03b8(n) by varying the regularization parameter from a large value where the size of the active\nset is 0 to \u00b5n+1. Let k0 be the number of transition points. The complexity of this approach is\nO(k0q2), and we can therefore compare the ef\ufb01ciency of these two approaches by comparing the\nnumber of transition points.\n\n4 Applications\n\n4.1 Compressive sensing\nLet \u03b80 \u2208 Rm be an unknown vector that we wish to reconstruct. We observe n linear projections\nyi = xT\ni \u03b80 + \u03bdi, where \u03bdi is Gaussian noise of variance \u03c32. In general one needs m such measure-\nment to reconstruct \u03b80. However, if \u03b80 has a sparse representation with k non-zero coef\ufb01cients, it\nhas been shown in the noiseless case that it is suf\ufb01cient to use n \u221d k log m such measurements.\nThis approach is known as compressive sensing [16][17] and has generated a tremendous amount of\ninterest in the signal processing community. The reconstruction is given by the solution of the Basis\nPursuit (BP) problem\n\nk\u03b8k1 subject to X\u03b8 = y.\n\nmin\n\n\u03b8\n\nIf measurements are obtained sequentially, it is advantageous to start estimating the unknown sparse\nsignal as measurements arrive, as opposed to waiting for a speci\ufb01ed number of measurements. Al-\ngorithms to solve BP with sequential measurements have been proposed in [18][19], and it has been\nshown that the change in the active set gives a criterion for how many measurements are needed to\nrecover the underlying signal [19].\nIn the case where the measurements are noisy (\u03c3 > 0), a standard approach to recover \u03b80 is to\nsolve the Basis Pursuit DeNoising problem instead [20]. Hence, our algorithm is well suited for\n\n5\n\n\fcompressive sensing with sequential and noisy measurements. We compare our proposed algorithm\nto Lars as applied to the entire dataset each time we receive a new measurement. We also compare\nour method to coordinate descent [11] with warm start: when receiving a new measurement, we\ninitialize coordinate descent (CD) to the actual solution.\nWe sample measurements of a model where m = 100, the vector \u03b80 used to sample the data has 25\nnon- zero elements whose values are Bernoulli \u00b11, xi \u223c N (0, Im), \u03c3 = 1, and we set \u00b5n = .1n.\nThe reconstruction error decreases as the number of measurements grows (not plotted). The param-\neter that controls the complexity of Lars and RecLasso is the number of transition points. We see\nin Figure 2 that this quantity is consistently smaller for RecLasso, and that after 100 measurements\nwhen the support of the solution does not change much there are typically less than 5 transition\npoints for RecLasso. We also show in Figure 2 timing comparison for the three algorithms that we\nhave each implemented in Python. We observed that CD requires a lot of iterations to converge to\nthe optimal solution when n < m, and we found dif\ufb01cult to set a stopping criterion that ensures\nconvergence. Our algorithm is consistently faster than Lars and CD with warm- start.\n\nFigure 2: Compressive sensing results. On the x- axis of the plots are the iterations of the algorithm,\nwhere at each iteration we receive a new measurement. On the left is the comparison of the number\nof transition points for Lars and RecLasso, and on the right is the timing comparison for the three\nalgorithms. The simulation is repeated 100 times and shaded areas represent one standard deviation.\n\n4.2 Selection of the regularization parameter\n\nWe have supposed until now a pre- determined regularization schedule, an assumption that is not\npractical. The amount of regularization depends indeed on the variance of the noise present in\nthe data which is not known a priori. It is therefore not obvious how to determine the amount of\nregularization. We write \u00b5n = n\u03bbn such that \u03bbn is the weighting factor between the average mean-\nsquared error and the \u20181- norm. We propose an algorithm that selects \u03bbn in a data- driven manner.\nThe problem with n observations is given by\n\n\u03b8(\u03bb) = arg min\n\n\u03b8\n\n1\n2n\n\ni \u03b8 \u2212 yi)2 + \u03bbk\u03b8k1.\n(xT\n\nnX\n\ni=1\n\nWe have seen previously that \u03b8(\u03bb) is piecewise linear, and we can therefore compute its gradient\nn+1\u03b8(\u03bb)\u2212yn+1)2 be the error on the new observation.\nunless \u03bb is a transition point. Let err(\u03bb) = (xT\nWe propose the following update rule to select \u03bbn+1\n\nlog \u03bbn+1 = log \u03bbn \u2212 \u03b7\n\u21d2 \u03bbn+1 = \u03bbn \u00d7 exp\n\nn\n\n\u2202err\n\u2202 log \u03bb\n2n\u03b7xT\n\n(\u03bbn)\n\nn+1,1(X T\n\n1 X1)\u22121v1(xT\n\nn+1\u03b81 \u2212 yn+1)\n\no\n\n,\n\nwhere the solution after n observations corresponding to the regularization parameter \u03bbn is given by\n1 , 0T ), and v1 = sign(\u03b81). We therefore use the new observation as a test set, which allows us\n(\u03b8T\nto update the regularization parameter before introducing the new observation by varying t from 0\n\n6\n\n\fto 1. We perform the update in the log domain to ensure that \u03bbn is always positive. We performed\nsimulations using the same experimental setup as in Section 4.1 and using \u03b7 = .01. We show in\nFigure 3 a representative example where \u03bb converges. We compared this value to the one we would\nobtain if we had a training and a test set with 250 observations each such that we could \ufb01t the model\non the training set for various values of \u03bb, and see which one gives the smallest prediction error on\nthe test set. We obtain a very similar result, and understanding the convergence properties of our\nproposed update rule for the regularization parameter is the object of current research.\n\n4.3 Leave-one-out cross-validation\n\nWe suppose in this Section that we have access to a dataset (yi, xi)i=1...n and that \u00b5n = n\u03bb. The\nparameter \u03bb is tied to the amount of noise in the data which we do not know a priori. A standard\napproach to select this parameter is leave-one-out cross-validation. For a range of values of \u03bb, we\nuse n \u2212 1 data points to solve the Lasso with regularization parameter (n \u2212 1)\u03bb and then compute\nthe prediction error on the data point that was left out. This is repeated n times such that each data\npoint serves as the test set. Hence the best value for \u03bb is the one that leads to the smallest mean\nprediction error.\nOur proposed algorithm can be adapted to the case where we wish to update the solution of the Lasso\nafter a data point is removed. To do so, we compute the \ufb01rst homotopy by varying the regularization\nparameter from n\u03bb to (n \u2212 1)\u03bb. We then compute the second homotopy by varying t from 1 to 0\nwhich has the effect of removing the data point that will be used for testing. As the algorithm is\nvery similar to the one we proposed in Section 3.2 we omit the derivation. We sample a model with\nn = 32 and m = 32. The vector \u03b80 used to generate the data has 8 non-zero elements. We add\nGaussian noise of variance 0.2 to the observations, and select \u03bb for a range of 10 values. We show in\nFigure 4 the histogram of the number of transition points for our algorithm when solving the Lasso\nwith n \u2212 1 data points (we solve this problem 10 \u00d7 n times). Note that in the majority cases there\nare very few transition points, which makes our approach very ef\ufb01cient in this setting.\n\nFigure 3: Evolution of the regularization param-\neter when using our proposed update rule.\n\nFigure 4: Histogram of the number of transition\npoints when removing an observation.\n\n5 Conclusion\n\nWe have presented an algorithm to solve \u20181-penalized least-square regression with online observa-\ntions. We use the current solution as a \u201cwarm-start\u201d and introduce an optimization problem that al-\nlows us to compute an homotopy from the current solution to the solution after observing a new data\npoint. The algorithm is particularly ef\ufb01cient if the active set does not change much, and we show a\ncomputational advantage as compared to Lars and Coordinate Descent with warm-start for applica-\ntions such as compressive sensing with sequential observations and leave-one-out cross-validation.\nWe have also proposed an algorithm to automatically select the regularization parameter where each\nnew measurement is used as a test set.\n\n7\n\n\fAcknowledgments\n\nWe wish to acknowledge support from NSF grant 0835531, and Guillaume Obozinski and Chris\nRozell for fruitful discussions.\n\nReferences\n[1] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[2] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129\u2013\n\n159, 2001.\n\n[3] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ. Press, 2004.\n[4] S-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method for large-scale\nl1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 1(4):606\u2013617, 2007.\n[5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n\n32(2):407\u2013499, 2004.\n\n[6] M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares\n\nproblems. IMA Journal of Numerical Analysis, 20:389\u2013404, 2000.\n\n[7] D.M. Malioutov, M. Cetin, and A.S. Willsky. Homotopy continuation for sparse signal representation.\nIn Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP),\nPhiladelphia, PA, March 2005.\n\n[8] I. Drori and D.L. Donoho. Solution of \u20181 minimization problems by lars/homotopy methods. In Proceed-\nings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse,\nFrance, May 2006.\n\n[9] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems\n\nwith a sparsity constraint. Communications on Pure and Applied Mathematics, 57:1413\u20131541, 2004.\n\n[10] C.J. Rozell, D.H. Johnson, R.G. Baraniuk, and B.A. Olshausen. Locally competitive algorithms for sparse\napproximation. In Proceedings of the International Conference on Image Processing (ICIP), San Antonio,\nTX, September 2007.\n\n[11] J. Friedman, T. Hastie, H. Hoe\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. The Annals of\n\nApplied Statistics, 1(2):302\u2013332, 2007.\n\n[12] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Ef\ufb01cient sparse coding algorithms.\n\nNeural Information Processing Systems (NIPS), 2007.\n\nIn Proceedings of the\n\n[13] M. Figueiredo and R. Nowak. A bound optimization approach to wavelet-based image deconvolution.\nIn Proceedings of the International Conference on Image Processing (ICIP), Genova, Italy, September\n2005.\n\n[14] M. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: Application to\ncompressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing,\n1(4):586\u2013597, 2007.\n\n[15] M Osborne. An effective method for computing regression quantiles. IMA Journal of Numerical Analysis,\n\nJan 1992.\n\n[16] E. Cand`es. Compressive sampling. Proceedings of the International Congress of Mathematicians, 2006.\n[17] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306, 2006.\n[18] S. Sra and J.A. Tropp. Row-action methods for compressed sensing. In Proceedings of the International\n\nConference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, May 2006.\n\n[19] D. Malioutov, S. Sanghavi, and A. Willsky. Compressed sensing with sequential observations. In Pro-\nceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las\nVegas, NV, March 2008.\n\n[20] Y. Tsaig and D.L. Donoho. Extensions of compressed sensing. Signal Processing, 86(3):549\u2013571, 2006.\n\n8\n\n\f", "award": [], "sourceid": 176, "authors": [{"given_name": "Pierre", "family_name": "Garrigues", "institution": null}, {"given_name": "Laurent", "family_name": "Ghaoui", "institution": null}]}