{"title": "Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1921, "page_last": 1928, "abstract": "Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory.", "full_text": "Adaptive Forward-Backward Greedy Algorithm for\n\nSparse Learning with Linear Models\n\nTong Zhang\n\nStatistics Department\nRutgers University, NJ\n\ntzhang@stat.rutgers.edu\n\nAbstract\n\nConsider linear prediction models where the target function is a sparse linear com-\nbination of a set of basis functions. We are interested in the problem of identifying\nthose basis functions with non-zero coef\ufb01cients and reconstructing the target func-\ntion from noisy observations. Two heuristics that are widely used in practice are\nforward and backward greedy algorithms. First, we show that neither idea is ad-\nequate. Second, we propose a novel combination that is based on the forward\ngreedy algorithm but takes backward steps adaptively whenever bene\ufb01cial. We\nprove strong theoretical results showing that this procedure is effective in learning\nsparse representations. Experimental results support our theory.\n\n1 Introduction\nConsider a set of input vectors x1, . . . , xn \u2208 Rd, with corresponding desired output variables\ny1, . . . , yn. The task of supervised learning is to estimate the functional relationship y \u2248 f(x)\nbetween the input x and the output variable y from the training examples {(x1, y1), . . . , (xn, yn)}.\nThe quality of prediction is often measured through a loss function \u03c6(f(x), y). In this paper, we\nconsider linear prediction model f(x) = wT x. As in boosting or kernel methods, nonlinearity can\nbe introduced by including nonlinear features in x.\nWe are interested in the scenario that d (cid:29) n. That is, there are many more features than the number\nof samples. In this case, an unconstrained empirical risk minimization is inadequate because the\nsolution over\ufb01ts the data. The standard remedy for this problem is to impose a constraint on w to\nobtain a regularized problem. An important target constraint is sparsity, which corresponds to the\n(non-convex) L0 regularization, where we de\ufb01ne kwk0 = |{j : wj 6= 0}| = k. If we know the\nsparsity parameter k, a good learning method is L0 regularization:\n\n\u03c6(wT xi, yi)\n\nsubject to kwk0 \u2264 k.\n\n(1)\n\n\u02c6w = arg min\nw\u2208Rd\n\n1\nn\n\nnX\n\ni=1\n\nIf k is not known, then one may regard k as a tuning parameter, which can be selected through cross-\nvalidation. This method is often referred to as subset selection in the literature. Sparse learning is an\nessential topic in machine learning, which has attracted considerable interests recently. Generally\nspeaking, one is interested in two closely related themes: feature selection, or identifying the basis\nfunctions with non-zero coef\ufb01cients; estimation accuracy, or reconstructing the target function from\nnoisy observations. It can be shown that the solution of the L0 regularization problem in (1) achieves\ngood prediction accuracy if the target function can be approximated by a sparse \u00afw.\nIt can also\nsolve the feature selection problem under extra identi\ufb01ability assumptions. However, a fundamental\ndif\ufb01culty with this method is the computational cost, because the number of subsets of {1, . . . , d}\nof cardinality k (corresponding to the nonzero components of w) is exponential in k. There are no\nef\ufb01cient algorithms to solve the subset selection formulation (1).\n\n1\n\n\fDue to the computational dif\ufb01cult, in practice, there are three standard methods for learning sparse\nrepresentations by solving approximations of (1). The \ufb01rst approach is L1-regularization (Lasso).\nThe idea is to replace the L0 regularization in (1) by L1 regularization. It is the closest convex\napproximation to (1). It is known that L1 regularization often leads to sparse solutions. Its perfor-\nmance has been theoretically analyzed recently. For example, if the target is truly sparse, then it\nwas shown in [10] that under some restrictive conditions referred to as irrepresentable conditions,\nL1 regularization solves the feature selection problem. The prediction performance of this method\nhas been considered in [6, 2, 1, 9]. Despite its popularity, there are several problems with L1 regu-\nlarization: \ufb01rst, the sparsity is not explicitly controlled, and good feature selection property requires\nstrong assumptions; second, in order to obtain very sparse solution, one has to use a large regulariza-\ntion parameter that leads to suboptimal prediction accuracy because the L1 penalty not only shrinks\nirrelevant features to zero, but also shrinks relevant features to zero. A sub-optimal remedy is to\nthreshold the resulting coef\ufb01cients; however this requires additional tuning parameters, making the\nresulting procedures more complex and less robust. The second approach to approximately solve\nthe subset selection problem is forward greedy algorithm, which we will describe in details in Sec-\ntion 2. The method has been widely used by practitioners. The third approach is backward greedy\nalgorithm. Although this method is widely used by practitioners, there isn\u2019t any theoretical analysis\nwhen n (cid:28) d (which is the case we are interested in here). The reason will be discussed later.\nIn this paper, we are particularly interested in greedy algorithms because they have been widely\nused but the effectiveness has not been well analyzed. As we shall explain later, neither the standard\nforward greedy idea nor th standard backward greedy idea is adequate for our purpose. However,\nthe \ufb02aws of these methods can be \ufb01xed by a simple combination of the two ideas. This leads to a\nnovel adaptive forward-backward greedy algorithm which we present in Section 3. The general idea\nworks for all loss functions. For least squares loss, we obtain strong theoretical results showing that\nthe method can solve the feature selection problem under moderate conditions.\nFor clarity, this paper only considers the \ufb01xed design formulation. To simplify notations in our\ndescription, we will replace the optimization problem in (1) with a more general formulation. In-\nstead of working with n input data vectors xi \u2208 Rd, we work with d feature vectors fj \u2208 Rn\n(j = 1, . . . , d), and y \u2208 Rn. Each fj corresponds to the j-th feature component of xi for\ni = 1, . . . , n. That is, fj,i = xi,j. Using this notation, we can generally rewrite (1) with in the\nform \u02c6w = arg minw\u2208Rd R(w) subject to kwk0 \u2264 k, where weight w = [w1, . . . , wd] \u2208 Rd,\nand R(w) is a real-valued cost function which we are interested in optimization. For least squares\n2. In the following, we also let ej \u2208 Rd be the\nvector of zeros, except for the j-component which is one. For convenience, we also introduce the\nfollowing notations.\nDe\ufb01nition 1.1 De\ufb01ne supp(w) = {j : wj 6= 0} as the set of nonzero coef\ufb01cients of a vector\nw = [w1, . . . , wd] \u2208 Rd. For a weight vector w \u2208 Rd, we de\ufb01ne mapping f : Rd \u2192 Rn as:\nj=1 wjfj. Given f \u2208 Rd and F \u2282 {1, . . . , d}, let \u02c6w(F, f) = minw\u2208Rd kf(w) \u2212\nfk2\nsubject to supp(w) \u2282 F , and let \u02c6w(F ) = \u02c6w(F, y) be the solution of the least squares\nproblem using features F .\n\nregression, we have R(w) = n\u22121kP\n\nj wjfj \u2212 yk2\n\nf(w) = Pd\n\n2\n\n2 Forward and Backward Greedy Algorithms\n\nForward greedy algorithms have been widely used in applications. The basic algorithm is presented\nin Figure 1. Although a number of variations exist, they all share the basic form of greedily picking\nan additional feature at every step to aggressively reduce the cost function. The intention is to make\nmost signi\ufb01cant progress at each step in order to achieve sparsity. In this regard, the method can be\nconsidered as an approximation algorithm for solving (1).\nA major \ufb02aw of this method is that it can never correct mistakes made in earlier steps. As an\nillustration, we consider the situation plotted in Figure 2 with least squares regression. In the \ufb01gure,\ny can be expressed as a linear combination of f1 and f2 but f3 is closer to y. Therefore using the\nforward greedy algorithm, we will \ufb01nd f3 \ufb01rst, then f1 and f2. At this point, we have already found\nall good features as y can be expressed by f1 and f2, but we are not able to remove f3 selected\nin the \ufb01rst step. The above argument implies that forward greedy method is inadequate for feature\nselection. The method only works when small subsets of the basis functions {fj} are near orthogonal\n\n2\n\n\fInput: f1, . . . , fd, y \u2208 Rn and \u0001 > 0\nOutput: F (k) and w(k)\nlet F (0) = \u2205 and w(0) = 0\nfor k = 1, 2, . . .\nlet i(k) = arg mini min\u03b1 R(w(k\u22121) + \u03b1ei)\nlet F (k) = {i(k)} \u222a F (k\u22121)\nlet w(k) = \u02c6w(F (k))\nif (R(w(k\u22121)) \u2212 R(w(k)) \u2264 \u0001) break\nend\n\nFigure 1: Forward Greedy Algorithm\n\nFigure 2: Failure of Forward Greedy Algorithm\n\n[7]. In general, Figure 2 (which is the case we are interested in in this paper) shows that forward\ngreedy algorithm will make errors that are not corrected later on.\nIn order to remedy the problem, the so-called backward greedy algorithm has been widely used by\npractitioners. The idea is to train a full model with all the features, and greedily remove one feature\n(with the smallest increase of cost function) at a time. Although at the \ufb01rst sight, backward greedy\nmethod appears to be a reasonable idea that addresses the problem of forward greedy algorithm, it is\ncomputationally very costly because it starts with a full model with all features. Moreover, there are\nno theoretical results showing that this procedure is effective. In fact, under our setting, the method\nmay only work when d (cid:28) n (see, for example, [3]), which is not the case we are interested in. In\nthe case d (cid:29) n, during the \ufb01rst step, we start with a model with all features, which can immediately\nover\ufb01t the data with perfect prediction. In this case, the method has no ability to tell which feature\nis irrelevant and which feature is relevant because removing any feature still completely over\ufb01ts\nthe data. Therefore the method will completely fail when d (cid:29) n, which explains why there is no\ntheoretical result for this method.\n\n3 Adaptive Forward-Backward Greedy Algorithm\n\nThe main strength of forward greedy algorithm is that it always works with a sparse solution ex-\nplicitly, and thus computationally ef\ufb01cient. Moreover, it does not signi\ufb01cantly over\ufb01t the data due\nto the explicit sparsity. However, a major problem is its inability to correct any error made by the\nalgorithm. On the other hand, backward greedy steps can potentially correct such an error, but need\nto start with a good model that does not completely over\ufb01t the data \u2014 it can only correct errors with\na small amount of over\ufb01tting. Therefore a combination of the two can solve the fundamental \ufb02aws\nof both methods. However, a key design issue is how to implement a backward greedy strategy\nthat is provably effective. Some heuristics exist in the literature, although without any effectiveness\nproof. For example, the standard heuristics, described in [5] and implemented in SAS, includes\nanother threshold \u00010 in addition to \u0001: a feature is deleted if the cost-function increase by performing\nthe deletion is no more than \u00010. Unfortunately we cannot provide an effectiveness proof for this\nheuristics: if the threshold \u00010 is too small, then it cannot delete any spurious features introduced in\nthe forward steps; if it is too large, then one cannot make progress because good features are also\ndeleted. In practice it can be hard to pick a good \u00010, and even the best choice may be ineffective.\n\n3\n\nf5yf1f2f3f4\fThis paper takes a more principled approach, where we speci\ufb01cally design a forward-backward\ngreedy procedure with adaptive backward steps that are carried out automatically. The procedure\nhas provably good performance and \ufb01xes the drawbacks of forward greedy algorithm illustrated in\nFigure 2. There are two main considerations in our approach: we want to take reasonably aggressive\nbackward steps to remove any errors caused by earlier forward steps, and to avoid maintaining a\nlarge number of basis functions; we want to take backward step adaptively and make sure that any\nbackward greedy step does not erase the gain made in the forward steps. Our algorithm, which we\nrefer to as FoBa, is listed in Figure 3. It is designed to balance the above two aspects. Note that we\nonly take a backward step when the increase of cost function is no more than half of the decrease of\ncost function in earlier forward steps. This implies that if we take \u2018 forward steps, then no matter\nhow many backward steps are performed, the cost function is decreased by at least an amount of\n\u2018\u0001/2. It follows that if R(w) \u2265 0 for all w \u2208 Rd, then the algorithm terminates after no more than\n2R(0)/\u0001 steps. This means that the procedure is computationally ef\ufb01cient.\n\nInput: f1, . . . , fd, y \u2208 Rn and \u0001 > 0\nOutput: F (k) and w(k)\nlet F (0) = \u2205 and w(0) = 0\nlet k = 0\nwhile true\n\nlet k = k + 1\n// forward step\nlet i(k) = arg mini min\u03b1 R(w(k\u22121) + \u03b1ei)\nlet F (k) = {i(k)} \u222a F (k\u22121)\nlet w(k) = \u02c6w(F (k))\nlet \u03b4(k) = R(w(k\u22121)) \u2212 R(w(k))\nif (\u03b4(k) \u2264 \u0001)\nk = k \u2212 1\nbreak\n\nendif\n// backward step (can be performed after each few forward steps)\nwhile true\nlet j(k) = arg minj\u2208F (k) R(w(k) \u2212 w(k)\nj ej)\nlet \u03b40 = R(w(k) \u2212 w(k)\nj(k)ej(k)) \u2212 R(w(k))\nif (\u03b40 > 0.5\u03b4(k)) break\nlet k = k \u2212 1\nlet F (k) = F (k+1) \u2212 {j(k+1)}\nlet w(k) = \u02c6w(F (k))\n\nend\n\nend\n\nFigure 3: FoBa: Forward-Backward Greedy Algorithm\n\nNow, consider an application of FoBa to the example in Figure 2. Again, in the \ufb01rst three forward\nsteps, we will be able to pick f3, followed by f1 and f2. After the third step, since we are able\nto express y using f1 and f2 only, by removing f3 in the backward step, we do not increase the\ncost. Therefore at this stage, we are able to successfully remove the incorrect basis f3 while keeping\nthe good features f1 and f2. This simple illustration demonstrates the effectiveness of FoBa. In\nthe following, we formally characterize this intuitive example, and prove the effectiveness of FoBa\nfor feature selection as well as parameter estimation. Our analysis assumes the least squares loss.\nHowever, it is possible to handle more general loss functions with a more complicated derivation.\nWe introduce the following de\ufb01nition, which characterizes how linearly independent small subsets\nof {fj} of size k are. For k (cid:28) n, the number \u03c1(k) can be bounded away from zero even when\nd (cid:29) n. For example, for random basis functions fj, we may take ln d = O(n/k) and still have \u03c1(k)\nto be bounded away from zero. This quantity is the smallest eigenvalue of the k \u00d7 k diagonal blocks\nof the d \u00d7 d design matrix [f T\ni fj]i,j=1,...,d, and has appeared in recent analysis of L1 regularization\n\n4\n\n\f2/kwk2\n\nnkf(w)k2\nnkf(w(k)) \u2212 yk2\n\nmethods such as in [2, 8], etc. We shall refer it to as the sparse eigenvalue condition. This condition\nis the least restrictive condition when compared to other conditions in the literature [1].\n\nDe\ufb01nition 3.1 De\ufb01ne for all 1 \u2264 k \u2264 d: \u03c1(k) = inf(cid:8) 1\n\n2 : kwk0 \u2264 k(cid:9).\n\nnkfjk2\n\nAssumption 3.1 Consider least squares loss R(w) = 1\n2. Assume that the basis\n2 = 1 for all j = 1, . . . , d, and assume that {yi}i=1,...,n\nfunctions are normalized such that 1\nare independent (but not necessarily identically distributed) sub-Gaussians: there exists \u03c3 \u2265 0 such\nthat \u2200i and \u2200t \u2208 R, Eyiet(yi\u2212Eyi) \u2264 e\u03c32t2/2.\nBoth Gaussian and bounded random variables are sub-Gaussian using the above de\ufb01nition. For\nIf a random variable \u03be \u2208 [a, b], then\nexample, we have the following Hoeffding\u2019s inequality.\nE\u03beet(\u03be\u2212E\u03be) \u2264 e(b\u2212a)2t2/8. If a random variable is Gaussian: \u03be \u223c N(0, \u03c32), then E\u03beet\u03be \u2264 e\u03c32t2/2.\nThe following theorem is stated with an explicit \u0001 for convenience. In applications, one can always\nrun the algorithm with a smaller \u0001 and use cross-validation to determine the optimal stopping point.\n\nTheorem 3.1 Consider the FoBa algorithm in Figure 3, where Assumption 3.1 holds. Assume also\nthat the target is sparse: there exists \u00afw \u2208 Rd such that \u00afwT xi = Eyi for i = 1, . . . , n, and\n\u00afF = supp( \u00afw). Let \u00afk = | \u00afF|, and \u0001 > 0 be the stopping criterion in Figure 3. Let s \u2264 d be an\ninteger which either equals d or satis\ufb01es the condition 8\u00afk \u2264 s\u03c1(s)2.\nIf minj\u2208supp( \u00afw) | \u00afwj|2 \u2265\n25 \u03c1(s)\u22122\u0001, and for some \u03b7 \u2208 (0, 1/3), \u0001 \u2265 64\u03c1(s)\u22122\u03c32 ln(2d/\u03b7)/n, then with probability\nlarger than 1 \u2212 3\u03b7, when the algorithm terminates, we have F (k) = \u00afF and kw(k) \u2212 \u00afwk2 \u2264\n\n64\n\n\u03c3p\u00afk/(n\u03c1(\u00afk))\n\nh\n\ni\n1 +p20 ln(1/\u03b7)\n\n.\n\nThe result shows that one can identify the correct set of features \u00afF as long as the weights \u00afwj are\nnot close to zero when j \u2208 \u00afF . This condition is necessary for all feature selection algorithms\nincluding previous analysis of Lasso. The theorem can be applied as long as eigenvalues of small\ns \u00d7 s diagonal blocks of the design matrix [f T\ni fj]i,j=1,...,d are bounded away from zero (i.e., sparse\neigenvalue condition). This is the situation under which the forward greedy step can make mistakes,\nbut such mistakes can be corrected using FoBa. Because the conditions of the theorem do not prevent\nforward steps from making errors, the example described in Figure 2 indicates that it is not possible\nto prove a similar result for the forward greedy algorithm. The result we proved is also better than\nthat of Lasso, which can successfully select features under irrepresentable conditions of [10]. It is\nknown that the sparse eigenvalue condition considered here is generally weaker [8, 1].\n\nOur result relies on the assumption that | \u00afwj| (j \u2208 \u00afF ) is larger than the noise level O(\u03c3pln d/n) in\n\norder to select features effectively. If any nonzero weight is below the noise level, then no algorithm\ncan distinguish it from zero with large probability. That is, in this case, one cannot reliably perform\nfeature selection due to the noise. Therefore FoBa is near optimal in term of its ability to perform\nreliable feature selection, except for the constant hiding in O(\u00b7). For target that is not truly sparse,\nsimilar results can be obtained. In this case, it is not possible to correctly identify all the features\nwith large probability. However, we can show that FoBa can still select part of the features reliably,\nwith good parameter estimation accuracy. Such results can be found in the full version of the paper,\navailable from the author\u2019s website.\n\n4 Experiments\n\nWe compare FoBa described in Section 3) to forward-greedy and L1-regularization on arti\ufb01cial\nand real data. They show that in practice, FoBa is closer to subset selection than the other two\napproaches, in the sense that FoBa achieves smaller training error given any sparsity level. In oder\nto compare with Lasso, we use the LARS [4] package in R, which generates a path of actions for\nadding and deleting features, along the L1 solution path. For example, a path of {1, 3, 5,\u22123, . . .}\nmeans that in the \ufb01st three steps, feature 1, 3, 5 are added; and the next step removes feature 3.\nUsing such a solution path, we can compare Lasso to Forward-greedy and FoBa under the same\nframework. Similar to the Lasso path, FoBa also generates a path with both addition and deletion\noperations, while forward-greedy algorithm only adds features without deletion.\n\n5\n\n\fOur experiments compare the performance of the three algorithms using the corresponding feature\naddition/deletion paths. We are interested in features selected by the three algorithms at any sparsity\nlevel k, where k is the desired number of features presented in the \ufb01nal solution. Given a path, we\ncan keep an active feature set by adding or deleting features along the path. For example, for path\n{1, 3, 5,\u22123}, we have two potential active feature sets of size k = 2: {1, 3} (after two steps) and\n{1, 5} (after four steps). We then de\ufb01ne the k best features as the active feature set of size k with\nthe smallest least squares error because this is the best approximation to subset selection (along the\npath generated by the algorithm). From the above discussion, we do not have to set \u0001 explicitly in\nthe FoBa procedure. Instead, we just generate a solution path which is \ufb01ve times as long as the\nmaximum desired sparsity k, and then generate the best k features for any sparsity level using the\nabove described procedure.\n\n4.1 Simulation Data\nSince for real data, we do not know the true feature set \u00afF , simulation is needed to compare feature\nselection performance. We generate n = 100 data points of dimension d = 500. The target vector\n\u00afw is truly sparse with \u00afk = 5 nonzero coef\ufb01cients generated uniformly from 0 to 10. The noise\nlevel is \u03c32 = 0.1. The basis functions fj are randomly generated with moderate correlation: that\nis, some basis functions are correlated to the basis functions spanning the true target. Note that\nif there is no correlation (i.e., fj are independent random vectors), then both forward-greedy and\nL1-regularization work well because the basis functions are near orthogonal (this is the well-known\ncase considered in the compressed sensing literature). Therefore in this experiment, we generate\nmoderate correlation so that the performance of the three methods can be differentiated. Such mod-\nerate correlation does not violate the sparse eigenvalue condition in our analysis, but violates the\nmore restrictive conditions for forward-greedy method and Lasso.\n\nleast squares training error\nparameter estimation error\n\nfeature selection error\n\nFoBa\n\n0.093 \u00b1 0.02\n0.057 \u00b1 0.2\n0.76 \u00b1 0.98\n\nForward-greedy\n0.16 \u00b1 0.089\n0.52 \u00b1 0.82\n1.8 \u00b1 1.1\n\nL1\n\n0.25 \u00b1 0.14\n1.1 \u00b1 1\n3.2 \u00b1 0.77\n\nTable 1: Performance comparison on simulation data at sparsity level k = 5\n\nTable 1 shows the performance of the three methods (including two versions of FoBa), where we\nrepeat the experiments 50 times, and report the average \u00b1 standard-deviation. We use the three\nmethods to select \ufb01ve best features, using the procedure described above. We report three metrics.\nTraining error is the squared error of the least squares solution with the selected \ufb01ve features. Pa-\nrameter estimation error is the 2-norm of the estimated parameter (with the \ufb01ve features) minus the\ntrue parameter. Feature selection error is the number of incorrectly selected features. It is clear\nfrom the table that for this data, FoBa achieves signi\ufb01cantly smaller training error than the other two\nmethods, which implies that it is closest to subset selection. Moreover, the parameter estimation\nperformance and feature selection performance are also better. The two versions of FoBa perform\nvery similarly for this data.\n\n4.2 Real Data\n\nInstead of listing results for many datasets without gaining much insights, we present a more detailed\nstudy on a typical dataset, which re\ufb02ect typical behaviors of the algorithms. Our study shows that\nFoBa does what it is designed to do well: that is, it gives a better approximation to subset selection\nthan either forward-greedy or L1 regularization. Moreover, the difference between aggressive FoBa\nand conservative FoBa become more signi\ufb01cant.\nIn this study, we use the standard Boston Housing data, which is the housing data for 506 cen-\nsus tracts of Boston from the 1970 census, available from the UCI Machine Learning Database\nRepository: http://archive.ics.uci.edu/ml/. Each census tract is a data-point, with 13 features (we\nadd a constant offset one as the 14th feature), and the desired output is the housing price. In the\nexperiment, we randomly partition the data into 50 training plus 456 test points. We perform the\nexperiments 50 times, and for each sparsity level from 1 to 10, we report the average training and test\nsquared error. The results are plotted in Figure 4. From the results, we can see that FoBa achieves\n\n6\n\n\fbetter training error for any given sparsity, which is consistent with the theory and the design goal of\nFoBa. Moreover, it achieves better test accuracy with small sparsity level (corresponding to a more\nsparse solution). With large sparsity level (corresponding to a less sparse solution), the test error\nincrease more quickly with FoBa. This is because it searches a larger space by more aggressively\nmimic subset selection, which makes it more prone to over\ufb01tting. However, at the best sparsity level\nof 2 or 3 (for aggressive and conservative FoBa, respectively), FoBa achieves signi\ufb01cantly better test\nerror. Moreover, we can observe with small sparsity level (a more sparse solution), L1 regularization\nperforms poorly, due to the bias caused by using a large L1-penalty.\n\nFigure 4: Performance of the algorithms on Boston Housing data Left: average training squared\nerror versus sparsity; Right: average test squared error versus sparsity\n\nFor completeness, we also compare FoBa to the backward-greedy algorithm and the classical heuris-\ntic forward-backward greedy algorithm as implemented in SAS (see its description at the beginning\nof Section 3). We still use the Boston Housing data, but plot the results separately, in order to avoid\ncluttering. As we have pointed out, there is no theory for the SAS version of forward-backward\ngreedy algorithm. It is dif\ufb01cult to select an appropriate backward threshold \u00010: a too small value\nleads to few backward steps, and a too large value leads to overly aggressive deletion, and the pro-\ncedure terminates very early. In this experiment, we pick a value of 10, because it is a reasonably\nlarge quantity that does not lead to an extremely quick termination of the procedure. The perfor-\nmance of the algorithms are reported in Figure 5. From the results, we can see that backward greedy\nalgorithm performs reasonably well on this problem. Note that for this data, d (cid:28) n, which is the\nscenario that backward does not start with a completely over\ufb01tted full model. Still, it is inferior to\nFoBa at small sparsity level, which means that some degree of over\ufb01tting still occurs. Note that\nbackward-greedy algorithm cannot be applied in our simulation data experiment, because d (cid:29) n\nwhich causes immediate over\ufb01tting. From the graph, we also see that FoBa is more effective than\nthe SAS implementation of forward-backward greedy algorithm. The latter does not perform signif-\nicant better than the forward-greedy algorithm with our choice of \u00010. Unfortunately, using a larger\nbackward threshold \u00010 will lead to an undesirable early termination of the algorithm. This is why the\nprovably effective adaptive backward strategies introduced in this paper are superior.\n\n5 Discussion\n\nThis paper investigates the problem of learning sparse representations using greedy algorithms. We\nshowed that neither forward greedy nor backward greedy algorithms are adequate by themselves.\nHowever, through a novel combination of the two ideas, we showed that an adaptive forward-back\ngreedy algorithm, referred to as FoBa, can effectively solve the problem under reasonable condi-\ntions. FoBa is designed to be a better approximation to subset selection. Under the sparse eigenvalue\ncondition, we obtained strong performance bounds for FoBa for feature selection and parameter es-\ntimation. In fact, to the author\u2019s knowledge, in terms of sparsity, the bounds developed for FoBa in\nthis paper are superior to all earlier results in the literature for other methods.\n\n7\n\nllllllllll2468102030405060sparsitytraining errorlFoBaforward\u2212greedyL1llllllllll2468103540455055606570sparsitytest errorlFoBaforward\u2212greedyL1\fFigure 5: Performance of greedy algorithms on Boston Housing data. Left: average training squared\nerror versus sparsity; Right: average test squared error versus sparsity\n\nOur experiments also showed that FoBa achieves its design goal: that is, it gives smaller training\nerror than either forward-greedy or L1 regularization for any given level of sparsity. Therefore the\nexperiments are consistent with our theory. In real data, better sparsity helps on some data such\nas Boston Housing. However, we shall point out that while FoBa always achieves better training\nerror for a given sparsity in our experiments on other datasets (thus it achieves our design goal), L1-\nregularization some times achieves better test performance. This is not surprising because sparsity is\nnot always the best complexity measure for all problems. In particular, the prior knowledge of using\nsmall weights, which is encoded in the L1 regularization formulation but not in greedy algorithms,\ncan lead to better generalization performance on some data (when such a prior is appropriate).\n\nReferences\n[1] Peter Bickel, Yaacov Ritov, and Alexandre Tsybakov. Simultaneous analysis of Lasso and\n\nDantzig selector. Annals of Statistics, 2008. to appear.\n\n[2] Florentina Bunea, Alexandre Tsybakov, and Marten H. Wegkamp. Sparsity oracle inequalities\n\nfor the Lasso. Electronic Journal of Statistics, 1:169\u2013194, 2007.\n\n[3] Christophe Couvreur and Yoram Bresler. On the optimality of the backward greedy algorithm\n\nfor the subset selection problem. SIAM J. Matrix Anal. Appl., 21(3):797\u2013808, 2000.\n\n[4] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.\n\nAnnals of Statistics, 32(2):407\u2013499, 2004.\n\n[5] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,\n\n2001.\n\n[6] Vladimir Koltchinskii. Sparsity in penalized empirical risk minimization. Annales de l\u2019Institut\n\nHenri Poincar\u00e9, 2008.\n\n[7] Joel A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Info.\n\nTheory, 50(10):2231\u20132242, 2004.\n\n[8] Cun-Hui Zhang and Jian Huang. Model-selection consistency of the Lasso in high-dimensional\n\nlinear regression. Technical report, Rutgers University, 2006.\n\n[9] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regulariza-\n\ntion. The Annals of Statistics, 2009. to appear.\n\n[10] Peng Zhao and Bin Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132567, 2006.\n\n8\n\nllllllllll24681020304050sparsitytraining errorllllllllllllFoBaForward\u2212Backward (SAS)forward\u2212greedybackward\u2212greedyllllllllll24681040506070sparsitytest errorllllllllllllFoBaForward\u2212Backward (SAS)forward\u2212greedybackward\u2212greedy\f", "award": [], "sourceid": 497, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}