{"title": "SpAM: Sparse Additive Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1201, "page_last": 1208, "abstract": null, "full_text": "SpAM: Sparse Additive Models\n\nPradeep Ravikumar\u2020 Han Liu\u2020\u2021 John Lafferty\u2217\u2020 Larry Wasserman\u2021\u2020\n\n\u2020Machine Learning Department\n\n\u2021Department of Statistics\n\n\u2217Computer Science Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nWe present a new class of models for high-dimensional nonparametric regression\nand classi\ufb01cation called sparse additive models (SpAM). Our methods combine\nideas from sparse linear modeling and additive nonparametric regression. We de-\nrive a method for \ufb01tting the models that is effective even when the number of\ncovariates is larger than the sample size. A statistical analysis of the properties of\nSpAM is given together with empirical results on synthetic and real data, show-\ning that SpAM can be effective in \ufb01tting sparse nonparametric models in high\ndimensional data.\n\n1 Introduction\n\nSubstantial progress has been made recently on the problem of \ufb01tting high dimensional linear re-\ngression models of the form Yi = X T\ni \u03b2 + \u0001i , for i = 1, . . . , n. Here Yi is a real-valued response, Xi\nis a p-dimensional predictor and \u0001i is a mean zero error term. Finding an estimate of \u03b2 when p > n\nthat is both statistically well-behaved and computationally ef\ufb01cient has proved challenging; how-\never, the lasso estimator (Tibshirani (1996)) has been remarkably successful. The lasso estimatorb\u03b2\n\nminimizes the `1-penalized sums of squares\n\nXi\n(Yi \u2212 X T\n\npXj=1\n\ni \u03b2) + \u03bb\n\n|\u03b2 j|\n\n(1)\n\nwith the `1 penalty k\u03b2k1 encouraging sparse solutions, where many componentsb\u03b2 j are zero. The\ngood empirical success of this estimator has been recently backed up by results con\ufb01rming that it has\nstrong theoretical properties; see (Greenshtein and Ritov, 2004; Zhao and Yu, 2007; Meinshausen\nand Yu, 2006; Wainwright, 2006).\nThe nonparametric regression model Yi = m(Xi )+\u0001i , where m is a general smooth function, relaxes\nthe strong assumptions made by a linear model, but is much more challenging in high dimensions.\nHastie and Tibshirani (1999) introduced the class of additive models of the form\n\nYi =\n\npXj=1\n\nm j (Xi j ) + \u0001i\n\n(2)\n\nwhich is less general, but can be more interpretable and easier to \ufb01t; in particular, an additive model\ncan be estimated using a coordinate descent Gauss-Seidel procedure called back\ufb01tting. An extension\nof the additive model is the functional ANOVA model\n\nYi = X1\u2264 j\u2264 p\n\nm j (Xi j ) +Xj \u03bb\n\n(11)\n\n\uf8eb\n\uf8ed1 +\n\n\uf8f6\n\uf8f8 f j = Pj\n\n\u03bb\n\nqE( f 2\n\nj )\n\n3\n\n\fInput: Data (Xi , Yi ), regularization parameter \u03bb.\nInitialize f j = f (0)\nIterate until convergence:\n\n, for j = 1, . . . , p.\n\nj\n\nFor each j = 1, . . . , p:\nCompute the residual: R j = Y \u2212Pk6= j fk (Xk );\nEstimate the projection Pj = E[R j | X j ] by smoothing: bPj = S j R j ;\nEstimate the norm s j =qE[Pj ]2 using, for example, (15) or (35);\nSoft-threshold: f j =(cid:20)1 \u2212\nCenter: f j \u2190 f j \u2212 mean( f j ).\n\nbs j(cid:21)+bPj ;\n\nOutput: Component functions f j and estimatorbm(Xi ) =P j f j (Xi j ).\n\nFigure 1: THE SPAM BACKFITTING ALGORITHM\n\n\u03bb\n\nand f j = 0 otherwise. Condition (11), in turn, implies\nj ) =qE(P2\n\nj )\n\n\u03bb\n\n\uf8eb\n\uf8ed1 +\n\nqE( f 2\n\nj ) or qE( f 2\n\nj ) =qE(P2\n\nj ) \u2212 \u03bb.\n\nThus, we arrive at the following multiplicative soft-thresholding update for f j :\n\n\uf8f6\n\uf8f8qE( f 2\nf j =\uf8ee\n\uf8f01 \u2212\n\n\uf8f9\n\uf8fb+\n\nPj\n\n\u03bb\n\nqE(P2\n\nj )\n\nbPj = S j R j\n\u221an kbPjk2 =qmean(bP2\n\nj ).\n\nbs j =\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nwhere [\u00b7]+ denotes the positive part. In the \ufb01nite sample case, as in standard back\ufb01tting (Hastie and\nTibshirani, 1999), we estimate the projection E[R j | X j ] by a smooth of the residuals:\n\nwhere S j is a linear smoother, such as a local linear or kernel smoother. Letbs j be an estimate of\nqE[P2\n\nj ]. A simple but biased estimate is\n1\n\nMore accurate estimators are possible; an example is given in the appendix. We have thus derived\nthe SpAM back\ufb01tting algorithm given in Figure 1.\n\nWhile the motivating optimization problem (Q) is similar to that considered in the COSSO (Lin\nand Zhang, 2006) for smoothing splines, the SpAM back\ufb01tting algorithm decouples smoothing and\nsparsity, through a combination of soft-thresholding and smoothing. In particular, SpAM back\ufb01tting\ncan be carried out with any nonparametric smoother; it is not restricted to splines. Moreover, by\niteratively estimating over the components and using soft thresholding, our procedure is simple to\nimplement and scales to high dimensions.\n\n3.1 SpAM for Nonparametric Logistic Regression\n\nThe SpAM back\ufb01tting procedure can be extended to nonparametric logistic regression for classi\ufb01-\ncation. The additive logistic model is\n\nP(Y = 1 | X ) \u2261 p(X; f ) =\n\n4\n\nexp(cid:16)P p\n1 + exp(cid:16)P p\n\nj=1 f j (X j )(cid:17)\nj=1 f j (X j )(cid:17)\n\n(16)\n\n\fwhere Y \u2208 {0, 1}, and the population log-likelihood is `( f ) = E(cid:2)Y f (X ) \u2212 log (1 + exp f (X ))(cid:3).\n\nRecall that in the local scoring algorithm for generalized additive models (Hastie and Tibshirani,\n1999) in the logistic case, one runs the back\ufb01tting procedure within Newton\u2019s method. Here one\niteratively computes the transformed response for the current estimate f0\n\n(17)\nand weights w(Xi ) = p(Xi; f0)(1 \u2212 p(Xi; f0), and carries out a weighted back\ufb01tting of (Z , X )\nwith weights w. The weighted smooth is given by\n\np(Xi; f0)(1 \u2212 p(Xi; f0))\n\nZi = f0(Xi ) +\n\nYi \u2212 p(Xi; f0)\n\nS j (w R j )\n\nS j w\n\n.\n\nbPj =\n\n(18)\n\n\u00b5 j E( f j )\n\n(19)\n\nTo incorporate the sparsity penalty, we \ufb01rst note that the Lagrangian is given by\n\nL( f, \u03bb, \u00b5) = E(cid:2)log (1 + exp f (X )) \u2212 Y f (X )(cid:3) + \u03bb\n\npXj=1qE( f 2\n\nj (X j )) +Xj\n\nj ). As in the unregularized case, this condition is nonlinear in f ,\nand so we linearize the gradient of the log-likelihood around f0. This yields the linearized condition\n\nand the stationary condition for component function f j is E(cid:0) p \u2212 Y | X j(cid:1) + \u03bbv j = 0 where v j is an\nelement of the subgradient \u2202qE( f 2\nE(cid:2)w(X )( f (X ) \u2212 Z ) | X j(cid:3) + \u03bbv j = 0. When E( f 2\nqE( f j )2\uf8f6\nS j w + \u03bb.qE( f 2\n\nj ) 6= 0, this implies the condition\n\uf8f8 f j (X j ) = E(w R j | X j ).\n\nIn the \ufb01nite sample case, in terms of the smoothing matrix S j , this becomes\n\nIf kS j (w R j )k2 < \u03bb, then f j = 0. Otherwise, this implicit, nonlinear equation for f j cannot be\nsolved explicitly, so we propose to iterate until convergence:\n\n\uf8eb\n\uf8edE(cid:0)w | X j(cid:1) +\n\nS j (w R j )\n\nf j =\n\n.\n\nj )\n\n\u03bb\n\n(20)\n\n(21)\n\nWhen \u03bb = 0, this yields the standard local scoring update (18). An example of logistic SpAM is\ngiven in Section 5.\n\nf j \u2190\n\nS j (w R j )\n\nS j w + \u03bb\u221an(cid:14)k f jk2\n\n.\n\n(22)\n\n4 Properties of SpAM\n\n4.1 SpAM is Persistent\n\nThe notion of risk consistency, or persistence, was studied by Juditsky and Nemirovski (2000) and\nGreenshtein and Ritov (2004) in the context of linear models. Let (X, Y ) denote a new pair (inde-\npendent of the observed data) and de\ufb01ne the predictive risk when predicting Y with f (X ) by\n\n(23)\n\nR( f ) = E(Y \u2212 f (X ))2.\n\nSince we consider predictors of the form f (x) = P j \u03b2 j g j (x j ) we also write the risk as R(\u03b2, g)\nwhere \u03b2 = (\u03b21, . . . , \u03b2 p) and g = (g1, . . . , g p). Following Greenshtein and Ritov (2004), we say\nthat an estimatorbmn is persistent relative to a class of functions Mn if\nR(bmn) \u2212 R(m\u2217n)\n(24)\nwhere m\u2217n = argmin f \u2208Mn R( f ) is the predictive oracle. Greenshtein and Ritov (2004) showed\nthat the lasso is persistent for the class of linear models Mn = {f (x) = x T \u03b2 : k\u03b2k1 \u2264 Ln} if\nLn = o((n/ log n)1/4). We show a similar result for SpAM.\nTheorem 4.1. Suppose that pn \u2264 en\u03be for some \u03be < 1. Then SpAM is persistent relative to the\nclassofadditivemodelsMn =n f (x) =P p\nj=1 \u03b2 j g j (x j ) : k\u03b2k1 \u2264 Lno if Ln = o(cid:0)n(1\u2212\u03be )/4(cid:1).\n\n\u2192 0\n\nP\n\n5\n\n\f4.2 SpAM is Sparsistent\nIn the case of linear regression, with m j (X j ) = \u03b2 T\nj X j , Wainwright (2006) shows that under certain\nconditions on n, p, s = |supp(\u03b2)|, and the design matrix X, the lasso recovers the sparsity pattern\nshow a similar result for SpAM with the sparse back\ufb01tting procedure.\n\nasymptotically; that is, the lasso estimator b\u03b2n is sparsistent: P(cid:0)supp(\u03b2) = supp(b\u03b2n)(cid:1) \u2192 1. We\n\nFor the purpose of analysis, we use orthogonal function regression as the smoothing procedure. For\neach j = 1, . . . , p let \u03c8 j be an orthogonal basis for H j . We truncate the basis to \ufb01nite dimension\ndn, and let dn \u2192 \u221e such that dn/n \u2192 0. Let \u0001 j denote the n \u00d7 d matrix \u0001 j (i, k) = \u03c8 jk (Xi j ).\nIf A \u2282 {1, . . . , p}, we denote by \u0001A the n \u00d7 d|A| matrix where for each i \u2208 A, \u0001i appears as a\nsubmatrix in the natural way. The SpAM optimization problem can then be written as\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\nn\n\n6= 0}, with\n6= 0} denote the estimated set of\n\n1\n\nmin\n\u03b2\n\n+ \u03bbn\n\n\u03b2 T\nj \u0001 T\n\nj \u0001 j \u03b2 j\n\nj=1 \u0001 j \u03b2 j(cid:17)2\n\n2n(cid:16)Y \u2212P p\n\nTheorem 4.2. Supposethat \u0001 satis\ufb01estheconditions\n\npXj=1r 1\nwhere each \u03b2 j is a d-dimensional vector. Let S denote the true set of variables {j : m j\ns = |S|, and let Sc denote its complement. LetbSn = {j : b\u03b2 j\nvariables from the minimizerb\u03b2n of (25).\nS \u0001S(cid:19) \u2264 Cmax < \u221e and \u0001min(cid:18) 1\n\u0001max(cid:18) 1\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) 1\nSc \u0001S(cid:17)(cid:16) 1\n1 \u2212 \u03b4\n\u221as\nLettheregularizationparameter \u03bbn \u2192 0 bechosentosatisfy\n\u03bbnpsdn \u2192 0,\n\nS \u0001S(cid:17)\u22121(cid:13)(cid:13)(cid:13)(cid:13)\nThenSpAMissparsistent: P(cid:0)bSn = S(cid:1) \u2212\u2192 1.\n\nS \u0001S(cid:19) \u2265 Cmin > 0\n, forsome0 < \u03b4 \u2264 1\n\ndn(log dn + log( p \u2212 s))\n\n2 \u2264s Cmin\n\ns\ndn\u03bbn \u2192 0,\n\n\u2192 0.\n\n\u0001 T\n\nn\n\n\u0001 T\n\nn\n\nn \u0001 T\n\nn \u0001 T\n\nCmax\n\nn\u03bb2\nn\n\nand\n\n2\n\n5 Experiments\n\nIn this section we present experimental results for SpAM applied to both synthetic and real data,\nincluding regression and classi\ufb01cation examples that illustrate the behavior of the algorithm in vari-\nous conditions. We \ufb01rst use simulated data to investigate the performance of the SpAM back\ufb01tting\nalgorithm, where the true sparsity pattern is known. We then apply SpAM to some real data. If not\nexplicitly stated otherwise, the data are always rescaled to lie in a d-dimensional cube [0, 1]d, and\na kernel smoother with Gaussian kernel is used. To tune the penalization parameter \u03bb, we use a C p\nstatistic, which is de\ufb01ned as\n\nC p(bf ) =\n\n1\nn\n\nnXi=1(cid:16)Yi \u2212P p\n\nj=1 bf j (X j )(cid:17)2\n\n+\n\n2b\u03c3 2\n\nn\n\npXj=1\n\ntrace(S j ) 1[bf j 6= 0]\n\nwhere S j is the smoothing matrix for the j-th dimension andb\u03c3 2 is the estimated variance.\n\n5.1 Simulations\n\n(29)\n\nWe \ufb01rst apply SpAM to an example from (H\u00e4rdle et al., 2004). A dataset with sample size n = 150\nis generated from the following 200-dimensional additive model:\n\nYi = f1(xi1) + f2(xi2) + f3(xi3) + f4(xi4) + \u0001i\n\n(30)\n(31)\nand f j (x) = 0 for j \u2265 5 with noise \u0001i \u223c N (0, 1). These data therefore have 196 irrelevant\ndimensions. The results of applying SpAM with the plug-in bandwidths are summarized in Figure 2.\n\nf4(x) = e\u2212x + e\u22121 \u2212 1\n\nf1(x) = \u22122 sin(2x),\n\nf2(x) = x 2 \u2212 1\n3 ,\n\nf3(x) = x \u2212 1\n2 ,\n\n6\n\n\f6\n\n.\n\n0\n\n5\n0\n\n.\n\n4\n\n.\n\n0\n\n3\n\n.\n\n0\n\n2\n\n.\n\n0\n\ns\nm\nr\no\nN\n\n \nt\n\nn\ne\nn\no\np\nm\no\nC\n\n1\n0\n\n.\n\n0\n\n.\n\n0\n\n4\n\n2\n\n1\nm\n\n2\n\u2212\n\n4\n\u2212\n\n4\n1\n\n2\n1\n\n0\n1\n\n3\n\n4\n\n2\n\n4\n9\n\n9\n\n4\n9\n1\n\np\nC\n\n8\n\n6\n\n4\n\n2\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\nl1=97.05\n\nl1=88.36\n\nl1=90.65\n\nl1=79.26\n\n6\n\n4\n\n2\n\n2\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\n4\n\n2\n\n3\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\u2212\n\n6\n\n4\n\n2\n\n4\nm\n\n2\n\u2212\n\n4\n\u2212\n\n0\n\n.\n\n1\n\n8\n0\n\n.\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\ny\nr\ne\nv\no\nc\ne\nr\n \nt\nc\ne\nr\nr\no\nc\n \nf\n\no\n\n \n.\n\nb\no\nr\np\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\n6\n\n4\n\n2\n\n5\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\u2212\n\np=128\n\np=256\n\n0 10 20 30 40 50 60 70 80 90\nsample size\n\n110 130 150\n\nzero\n\nzero\n\n6\n\n4\n\n2\n\n6\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\u2212\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx1\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx2\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx3\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx4\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx5\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx6\n\nFigure 2: (Simulated data) Upper left: The empirical `2 norm of the estimated components as plotted\n\nagainst the tuning parameter \u03bb; the value on the x-axis is proportional toP j kbf jk2. Upper center:\n\nThe C p scores against the tuning parameter \u03bb; the dashed vertical line corresponds to the value of\n\u03bb which has the smallest C p score. Upper right: The proportion of 200 trials where the correct\nrelevant variables are selected, as a function of sample size n. Lower (from left to right): Estimated\n(solid lines) versus true additive component functions (dashed lines) for the \ufb01rst 6 dimensions; the\nremaining components are zero.\n\n5.2 Boston Housing\n\nThe Boston housing data was collected to study house values in the suburbs of Boston; there are\naltogether 506 observations with 10 covariates. The dataset has been studied by many other authors\n(H\u00e4rdle et al., 2004; Lin and Zhang, 2006), with various transformations proposed for different\ncovariates. To explore the sparsistency properties of our method, we add 20 irrelevant variables. Ten\nof them are randomly drawn from Uniform(0, 1), the remaining ten are a random permutation of the\noriginal ten covariates, so that they have the same empirical densities.\n\nThe full model (containing all 10 chosen covariates) for the Boston Housing data is:\n\nmedv = \u03b1 + f1(crim) + f2(indus) + f3(nox) + f4(rm) + f5(age)\n\n+ f6(dis) + f7(tax) + f8(ptratio) + f9(b) + f10(lstat)\n\n(32)\n\nThe result of applying SpAM to this 30 dimensional dataset is shown in Figure 3. SpAM identi\ufb01es 6\nnonzero components. It correctly zeros out both types of irrelevant variables. From the full solution\npath, the important variables are seen to be rm, lstat, ptratio, and crim. The importance\nof variables nox and b are borderline. These results are basically consistent with those obtained\nby other authors (H\u00e4rdle et al., 2004). However, using C p as the selection criterion, the variables\nindux, age, dist, and tax are estimated to be irrelevant, a result not seen in other studies.\n\n5.3 SpAM for Spam\n\nHere we consider an email spam classi\ufb01cation problem, using the logistic SpAM back\ufb01tting algo-\nrithm from Section 3.1. This dataset has been studied by Hastie et al. (2001), using a set of 3,065\nemails as a training set, and conducting hypothesis tests to choose signi\ufb01cant variables; there are a\ntotal of 4,601 observations with p = 57 attributes, all numeric. The attributes measure the percent-\nage of speci\ufb01c words or characters in the email, the average and maximum run lengths of upper case\nletters, and the total number of such letters. To demonstrate how SpAM performs well with sparse\ndata, we only sample n = 300 emails as the training set, with the remaining 4301 data points used\nas the test set. We also use the test data as the hold-out set to tune the penalization parameter \u03bb. The\nresults of a typical run of logistic SpAM are summarized in Figure 4, using plug-in bandwidths.\n\n7\n\n\fl1=177.14\n\nl1=1173.64\n\n0\n2\n\n0\n1\n1\nm\n\n0\n1\n\u2212\n\n0\n2\n\n0\n1\n4\nm\n\n0\n1\n\u2212\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx1\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx4\n\nl1=478.29\n\nl1=1221.11\n\n0\n2\n\n0\n1\n8\nm\n\n0\n2\n\n0\n1\n0\n1\nm\n\n0\n1\n\u2212\n\n3\n\ns\nm\nr\no\nN\n\n2\n\n \nt\n\nn\ne\nn\no\np\nm\no\nC\n\n1\n\n0\n\n4\n\n0\n1\n\n8\n\n3\n\n6\n\n5\n\n7\n\n7\n1\n\n0\n8\n\n0\n7\n\n0\n6\n\np\nC\n\n0\n5\n\n0\n4\n\n0\n3\n\n0\n2\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0\n1\n\u2212\n\nFigure 3: (Boston housing) Left: The empirical `2 norm of the estimated components versus the\nregularization parameter \u03bb. Center: The C p scores against \u03bb; the dashed vertical line corresponds to\nbest C p score. Right: Additive \ufb01ts for four relevant variables.\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx8\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx10\n\n\u03bb(\u00d710\u22123)\n\n5.5\n5.0\n4.5\n4.0\n3.5\n3.0\n2.5\n2.0\n\nERROR\n0.2009\n0.1725\n0.1354\n0.1083 (\u221a)\n0.1117\n0.1174\n0.1251\n0.1259\n\n# ZEROS\n\nSELECTED VARIABLES\n\n0\n2\n\n.\n\n0\n\n55\n51\n46\n20\n0\n0\n0\n0\n\n{ 8,54}\n\n{ 8, 9, 27, 53, 54, 57}\n\n{7, 8, 9, 17, 18, 27, 53, 54, 57, 58}\n{4, 6\u201310, 14\u201322, 26, 27, 38, 53\u201358}\n\nALL\nALL\nALL\nALL\n\n8\n1\n.\n0\n\nr\no\nr\nr\ne\nn\no\n\n \n\n6\n1\n.\n0\n\ni\n\ni\nt\nc\nd\ne\nr\np\n\n \nl\n\na\nc\ni\nr\ni\np\nm\nE\n\n4\n1\n.\n0\n\n2\n1\n.\n0\n\npenalization parameter\nFigure 4: (Email spam) Classi\ufb01cation accuracies and variable selection for logistic SpAM.\n\n2.0\n\n2.5\n\n3.0\n\n3.5\n\n4.0\n\n4.5\n\n5.0\n\n5.5\n\n6 Acknowlegments\n\nThis research was supported in part by NSF grant CCF-0625879 and a Siebel Scholarship to PR.\n\nReferences\n\nGREENSHTEIN, E. and RITOV, Y. (2004). Persistency in high dimensional linear predictor-selection and the\n\nvirtue of over-parametrization. Journal of Bernoulli 10 971\u2013988.\n\nH\u00c4RDLE, W., M\u00dcLLER, M., SPERLICH, S. and WERWATZ, A. (2004). Nonparametric and Semiparametric\n\nModels. Springer-Verlag Inc.\n\nHASTIE, T. and TIBSHIRANI, R. (1999). Generalized additive models. Chapman & Hall Ltd.\nHASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. H. (2001). The Elements of Statistical Learning: Data\n\nMining, Inference, and Prediction. Springer-Verlag.\n\nJUDITSKY, A. and NEMIROVSKI, A. (2000). Functional aggregation for nonparametric regression. Ann.\n\nStatist. 28 681\u2013712.\n\nLIN, Y. and ZHANG, H. H. (2006). Component selection and smoothing in multivariate nonparametric regres-\n\nsion. Ann. Statist. 34 2272\u20132297.\n\nMEINSHAUSEN, N. and YU, B. (2006). Lasso-type recovery of sparse representations for high-dimensional\n\ndata. Tech. Rep. 720, Department of Statistics, UC Berkeley.\n\nTIBSHIRANI, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, Methodological 58 267\u2013288.\n\nWAINWRIGHT, M. (2006). Sharp thresholds for high-dimensional and noisy recovery of sparsity. Tech. Rep.\n\n709, Department of Statistics, UC Berkeley.\n\nYUAN, M. (2007). Nonnegative garrote component selection in functional ANOVA models. In Proceedings of\n\nAI and Statistics, AISTATS.\n\nZHAO, P. and YU, B. (2007). On model selection consistency of lasso. J. of Mach. Learn. Res. 7 2541\u20132567.\n\n8\n\n\f", "award": [], "sourceid": 415, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}