{"title": "Multi-stage Convex Relaxation for Learning with Sparse Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1929, "page_last": 1936, "abstract": "We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: (1) Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. (2) Convex relaxation such as $L_1$-regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-$L_1$ regularization. Our performance bound shows that the procedure is superior to the standard $L_1$ convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data.", "full_text": "Multi-stage Convex Relaxation for Learning with\n\nSparse Regularization\n\nTong Zhang\n\nStatistics Department\nRutgers University, NJ\n\ntzhang@stat.rutgers.edu\n\nAbstract\n\nWe study learning formulations with non-convex regularizaton that are natural for\nsparse linear models. There are two approaches to this problem:\n\u2022 Heuristic methods such as gradient descent that only \ufb01nd a local minimum.\nA drawback of this approach is the lack of theoretical guarantee showing that\nthe local minimum gives a good solution.\n\u2022 Convex relaxation such as L1-regularization that solves the problem under\n\nsome conditions. However it often leads to sub-optimal sparsity in reality.\n\nThis paper tries to remedy the above gap between theory and practice. In partic-\nular, we investigate a multi-stage convex relaxation scheme for solving problems\nwith non-convex regularization. Theoretically, we analyze the behavior of a re-\nsulting two-stage relaxation scheme for the capped-L1 regularization. Our per-\nformance bound shows that the procedure is superior to the standard L1 convex\nrelaxation for learning sparse targets. Experiments con\ufb01rm the effectiveness of\nthis method on some simulation and real data.\n\n1 Introduction\nConsider a set of input vectors x1, . . . , xn \u2208 Rd, with corresponding desired output variables\ny1, . . . , yn. The task of supervised learning is to estimate the functional relationship y \u2248 f(x)\nbetween the input x and the output variable y from the training examples {(x1, y1), . . . , (xn, yn)}.\nThe quality of prediction is often measured through a loss function \u03c6(f(x), y). We assume that\n\u03c6(f, y) is convex in f throughout the paper.\nIn this paper, we consider linear prediction model\nf(x) = wT x. As in boosting or kernel methods, nonlinearity can be introduced by including non-\nlinear features in x.\nWe are mainly interested in the scenario that d (cid:29) n. That is, there are many more features than the\nnumber of samples. In this case, an unconstrained empirical risk minimization is inadequate because\nthe solution over\ufb01ts the data. The standard remedy for this problem is to impose a constraint on w\nto obtain a regularized problem. An important target constraint is sparsity, which corresponds to the\n(non-convex) L0 regularization, de\ufb01ned as kwk0 = |{j : wj 6= 0}| = k. If we know the sparsity\nparameter k for the target vector, then a good learning method is L0 regularization:\n\n\u03c6(wT xi, yi)\n\nsubject to kwk0 \u2264 k.\n\n(1)\n\n\u02c6w = arg min\nw\u2208Rd\n\n1\nn\n\nnX\n\ni=1\n\nIf k is not known, then one may regard k as a tuning parameter, which can be selected through cross-\nvalidation. This method is often referred to as subset selection in the literature. Sparse learning is\nan essential topic in machine learning, which has attracted considerable interests recently. It can be\nshown that the solution of the L0 regularization problem in (1) achieves good prediction accuracy\n\n1\n\n\fif the target function can be approximated by a sparse \u00afw. However, a fundamental dif\ufb01culty with\nthis method is the computational cost, because the number of subsets of {1, . . . , d} of cardinality k\n(corresponding to the nonzero components of w) is exponential in k.\nDue to the computational dif\ufb01cult, in practice, it is necessary to replace (1) by some easier to solve\nformulations below:\n\nnX\n\ni=1\n\n\u02c6w = arg min\nw\u2208Rd\n\n1\nn\n\n\u03c6(wT xi, yi) + \u03bbg(w),\n\n(2)\n\nwith g(w) =Pd\nas \u03b1 \u2192 0,P\n\nwhere \u03bb > 0 is an appropriately chosen regularization condition. We obtain a formulation equiv-\nalent to (2) by choosing the regularization function as g(w) = kwk0. However, this function is\ndiscontinuous. For computational reasons, it is helpful to consider a continuous approximation with\ng(w) = kwkp, where p > 0. If p \u2265 1, the resulting formulation is convex. In particular, by choos-\ning the closest approximation with p = 1, one obtain Lasso, which is the standard convex relaxation\nformulation for sparse learning. With p \u2208 (0, 1), the Lp regularization kwkp is non-convex but con-\ntinuous. In this paper, we are also interested in the following capped-L1 approximation of kwk0,\nj=1 min(|wj|, \u03b1), where for v \u2208 R: This is a good approximation to L0 because\nj min(|wj|, \u03b1)/\u03b1 \u2192 kwk0. Therefore when \u03b1 \u2192 0, this regularization condition is\nequivalent to the sparse L0 regularization upto a rescaling of \u03bb. Note that the capped-L1 regulariza-\ntion is also non-convex. It is related to the so-called SCAD regularization in statistics, which is a\nsmoother version. We use the simpler capped-L1 regularization because the extra smoothness does\nnot affect our algorithm or theory.\nFor a non-convex but smooth regularization condition such as capped-L1 or Lp with p \u2208 (0, 1),\nstandard numerical techniques such as gradient descent leads to a local minimum solution. Unfor-\ntunately, it is dif\ufb01cult to \ufb01nd the global optimum, and it is also dif\ufb01cult to analyze the quality of the\nlocal minimum. Although in practice, such a local minimum solution may outperform the Lasso so-\nlution, the lack of theoretical (and practical) performance guarantee prevents the more wide-spread\napplications of such algorithms. As a matter of fact, results with non-convex regularization are dif-\n\ufb01cult to reproduce because different numerical optimization procedures can lead to different local\nminima. Therefore the quality of the solution heavily depend on the numerical procedure used.\nThe situation is very dif\ufb01cult for a convex relaxation formulation such as L1-regularization (Lasso).\nThe global optimum can be easily computed using standard convex programming techniques. It is\nknown that in practice, 1-norm regularization often leads to sparse solutions (although often sub-\noptimal). Moreover, its performance has been theoretically analyzed recently. For example, it is\nknown from the compressed sensing literature that under certain conditions, the solution of L1 re-\nlaxation may be equivalent to L0 regularization asymptotically even when noise is present (e.g. [3]\nand references therein). If the target is truly sparse, then it was shown in [9] that under some restric-\ntive conditions referred to as irrepresentable conditions, 1-norm regularization solves the feature\nselection problem. The prediction performance of this method has been considered in [4, 8, 1].\nDespite of its success, L1-regularization often leads to suboptimal solutions because it is not a good\napproximation to L0 regularization. Statistically, this means that even though it converges to the\ntrue sparse target when n \u2192 \u221e (consistency), the rate of convergence can be suboptimal. The\nonly way to \ufb01x this problem is to employ a non-convex regularization condition that is closer to\nL0 regularization, such as the capped-L1 regularization. The superiority of capped-L1 is formally\nproved later in this paper.\nBecause of the above gap between practice and theory, it is important to study direct solutions of\nnon-convex regularization beyond the standard L1 relaxation. Our goal is to design a numerical pro-\ncedure that leads to a reproducible solution with better theoretical behavior than L1-regularization.\nThis paper shows how this can be done. Speci\ufb01cally, we consider a general multi-stage convex re-\nlaxation method for solving learning formulations with non-convex regularization. In this scheme,\nconcave duality is used to construct a sequence of convex relaxations that give better and better\napproximations to the original non-convex problem. Moreover, using the capped-L1 regularization,\nwe show that after only two stages, the solution gives better statistical performance than standard\nLasso when the target is approximately sparse. In essence, this paper establishes a performance\nguarantee for non-convex formulations using a multi-stage convex relaxation approach that is more\nsophisticated than the standard one-stage convex relaxation (which is the standard approach com-\n\n2\n\n\fmonly studied in the current literature). Experiments con\ufb01rm the effectiveness of the multi-stage\napproach.\n\n2 Concave Duality\n\nGiven a continuous regularization function g(w) in (2) which may be non-convex, we are interested\nin rewriting it using concave duality. Let h(w) : Rd \u2192 \u2126 \u2282 Rd be a map with range \u2126. It may not\nbe a one-to-one map. However, we assume that there exists a function \u00afgh(u) de\ufb01ned on \u2126 such that\ng(w) = \u00afgh(h(w)) holds.\nWe assume that we can \ufb01nd h so that the function \u00afgh(u) is a concave function of u on \u2126. Under\nthis assumption, we can rewrite the regularization function g(w) as:\n\nusing concave duality [6]. In this case, g\u2217\n\ng(w) = inf\nv\u2208Rd\nh(v) is the concave dual of \u00afgh(u) given below\n\n(cid:2)vT h(w) + g\u2217\nh(v)(cid:3)\n(cid:2)\u2212vT u + \u00afgh(u)(cid:3) .\n\ng\u2217\nh(v) = inf\nu\u2208\u2126\n\nMoreover, it is well-known that the minimum of the right hand side of (3) is achieved at\n\n\u02c6v = \u2207u\u00afgh(u)|u=h(w).\n\n(3)\n\n(4)\n\nj vp/(p\u2212q)\n\u02c6vj = (p/q)|wj|p\u2212q.\n\nj\n\np \u2208 (0, 1). Given any q > p, (3) holds with h(w) = [|w1|q, . . . ,|wd|q] and g\u2217\n\nThis is a very general framework. For illustration, we include two example non-convex sparse\nregularization conditions discussed in the introduction.\n\nLp regularization We consider the regularization condition g(w) = Pd\nc(p, q)P\nIn this case, \u00afgh(u) = Pd\nCapped-L1 regularization We consider the regularization condition g(w) =Pd\n\nj=1 |wj|p for some\nh(v) =\nde\ufb01ned on the domain {v : vj \u2265 0}, where c(p, q) = (q \u2212 p)pp/(q\u2212p)qq/(p\u2212q).\non \u2126 = {u : uj \u2265 0}. The solution in (4) is given by\n\nj=1 min(|wj|, \u03b1).\nIn this case, (2) holds with h(w) = [|w1|, . . . ,|wd|] and g\u2217\nj=1 \u03b1(1 \u2212 vj)I(vj \u2208 [0, 1])\nde\ufb01ned on the domain {v : vj \u2265 0}, where I(\u00b7) is the set indicator function. The solution in (4) is\ngiven by \u02c6vj = I(|wj| \u2264 \u03b1).\n\nh(v) =Pd\n\nj=1 up/q\n\nj\n\n3 Multi-stage Convex Relaxation\n\ng(w). Let h(w) =P\n\nWe consider a general procedure for solving (2) with convex loss and non-convex regularization\nj hj(w) be a convex relaxation of g(w) that dominates g(w) (for example,\nit can be the smallest convex upperbound (i.e., the inf over all convex upperbounds) of g(w)). A\nsimple convex relaxation of (2) becomes\n\ndX\n\n\uf8f9\uf8fb .\n\n\uf8ee\uf8f0 1\n\nn\n\nnX\n\n\"\n\nnX\n\ni=1\n\n1\nn\n\n\u02c6w = arg min\nw\u2208Rd\n\n\u03c6(wT xi, yi) + \u03bb\n\nhj(w)\n\ni=1\n\nj=1\n\n(5)\n\nThis simple relaxation can yield a solution that is not close to the solution of (2). However, if h\nsatis\ufb01es the condition of Section 2, then it is possible to write g(w) as (3). Now, with this new\nrepresentation, we can rewrite (2) as\n\n[ \u02c6w, \u02c6v] = arg min\nw,v\u2208Rd\n\n\u03c6(wT xi, yi) + \u03bbvT h(w) + \u03bbg\u2217\n\nh(v),\n\n,\n\n(6)\n\nIf we can \ufb01nd a good approximation of \u02c6v that\nThis is clearly equivalent to (2) because of (3).\nimproves upon the initial value of \u02c6v = [1, . . . , 1], then the above formulation can lead to a re\ufb01ned\nconvex problem in w that is a better convex relaxation than (5).\n\n3\n\n#\n\n\fOur numerical procedure exploits the above fact, which tries to improve the estimation of vj over\nthe initial choice of vj = 1 in (5) using an iterative algorithm. This can be done using an alternating\noptimization procedure, which repeatedly applies the following two steps:\n\n\u2022 First we optimize w with v \ufb01xed: this is a convex problem in w with appropriately chosen\n\u2022 Second we optimize v with w \ufb01xed: although non-convex, it has a closed form solution\n\nh(w).\n\nthat is given by (4).\n\nThe general procedure is presented in Figure 1. It can be regarded as a generalization of CCCP\n(concave-convex programming) [7], which takes h(w) = w. By repeatedly re\ufb01ning the parameter\nv, we can potentially obtain better and better convex relaxation, leading to a solution superior to that\nof the initial convex relaxation. Note that using the Lp and capped-L1 regularization conditions in\nSection 2, this procedure lead to more speci\ufb01c multi-stage convex relaxation algorithms. We skip\nthe details due to the space limitation.\n\nTuning parameters: \u03bb\nInput: training data (x1, y1), . . . , (xn, yn)\nOutput: weight vector \u02c6w\ninitialize \u02c6vj = 1\nRepeat the following two steps until convergence:\n\n\u2022 Let \u02c6w = arg minw\u2208Rd\n\u2022 Let \u02c6v = \u2207u\u00afgh(u))|u=h(w)\n\nn\n\n(cid:2) 1\n\ni=1 \u03c6(wT xi, yi) + \u03bb\u02c6vT h(w)(cid:3)\nPn\n\n(\u2217)\n\nFigure 1: Multi-stage Convex Relaxation Method\n\n4 Theory of Two-stage Convex Relaxation for Capped-L1 Regularization\n\nAlthough the reasoning in Section 3 is appealing, it is only a heuristic argument without any formal\ntheoretical guarantee. In contrast, the simple one-stage L1 relaxation is known to perform reasonably\nwell under certain assumptions. Therefore unless we can develop a theory to show the effectiveness\nof the multi-stage procedure in Figure 1, our proposal is mere yet another local minimum \ufb01nding\nscheme that may potentially stuck into a bad local solution.\nThis section tries to address this issue. Although we have not yet developed a complete theory for\nthe general procedure, we are able to obtain a learning bound for the capped-L1 regularization. In\nparticular, if the target function is sparse, then the performance of the solution after merely two-\nstages of our procedure is superior to that of Lasso. This demonstrates the effectiveness of the\nmulti-stage approach. Since the analysis is rather complicated, we focus on the least squares loss\nonly, and only for the solution after two-stages of the algorithm.\nFor a complete theory, the following questions are worth asking:\n\n\u2022 Under what conditions, the global solution with non-convex penalty is statistically better\nthan the (one-stage) convex relaxation solution? That is, when does it lead to better predic-\ntion accuracy or generalization error?\n\u2022 Under what conditions, there is only one local minimum solution close to the solution of\nthe initial convex relaxation, and it is also the global optimum? Moreover, does multi-stage\nconvex relaxation \ufb01nd this solution?\n\nThe \ufb01rst question answers whether it is bene\ufb01cial to use a non-convex penalty function. The second\nquestion answers whether we can effectively solve the resulting non-convex problem using multi-\nstage convex relaxation. The combination of the two questions leads to a satisfactory theoretical\nanswer to the effectiveness of the multi-stage procedure.\nA general theory along this line will be developed in the full paper. In the following, instead of\ntrying to answer the above questions separately, we provide a uni\ufb01ed \ufb01nite sample analysis for the\nprocedure that directly addresses the combined effect of the two questions. The result is adopted\n\n4\n\n\ffrom [8], which justi\ufb01es the multi-stage convex relaxation approach by showing that the two-stage\nprocedure using capped-L1 regularization can lead to better generalization than the standard one\nstage L1 regularization.\nThe procedure we shall analyze, which is a special case of the multi-stage algorithm in Figure 1 with\ncapped-L1 regularization and only two stages, is described in Figure 2. It is related to the adaptive\nLasso method [10]. The result is reproducible when the solution of the \ufb01rst stage is unique because\nit involves two well-de\ufb01ned convex programming problems. Note that it is described with least\nsquares loss only because our analysis assumes least squares loss: a more general analysis for other\nloss functions is possible but would lead to extra complications that are not central to our interests.\n\nTuning parameters: \u03bb, \u03b1\nInput: training data (x1, y1), . . . , (xn, yn)\nOutput: weight vector \u02c6w0\n\nStage 1: Compute \u02c6w by solving the L1 penalization problem:\n\n\u02c6w = arg min\nw\u2208Rd\n\n(wT xi \u2212 yi)2 + \u03bbkwk1\n\nStage 2: Solving the following selective L1 penalization problem:\n\n\"\n\nnX\n\ni=1\n\n1\nn\n\n\uf8ee\uf8f0 1\n\nn\n\nnX\n\ni=1\n\n#\n\n.\n\n\uf8f9\uf8fb .\n\n|wj|\n\nX\n\nj:| \u02c6wj|\u2264\u03b1\n\n\u02c6w0 = arg min\nw\u2208Rd\n\n(wT xi \u2212 yi)2 + \u03bb\n\nFigure 2: Two-stage capped-L1 Regularization\n\nThis particular two-stage procedure also has an intuitive interpretation (besides treating it as a spe-\ncial case of multi-stage convex relaxation). We shall refer to the feature components corresponding\nto the large weights as relevant features, and the feature components smaller the cut-off threshold \u03b1\nas irrelevant features. We observe that as an estimation method, L1 regularization has two impor-\ntant properties: shrink estimated weights corresponding to irrelevant features toward zero; shrink\nestimated weights corresponding to relevant features toward zero. While the \ufb01rst effect is desirable,\nthe second effect is not. In fact, we should avoid shrinking the weights corresponding to the relevant\nfeatures if we can identify these features. This is why the standard L1 regularization may have sub-\noptimal performance. However, after the \ufb01rst stage of L1 regularization, we can identify the relevant\nfeatures by picking the components corresponding to the largest weights; in the second stage of L1\nregularization, we do not have to penalize the features selected in the \ufb01rst stage, as in Figure 2.\nA related method, called relaxed Lasso, was proposed recently by Meinshausen [5], which is similar\nto a two-stage Dantzig selector in [2]. Their idea differs from our proposal in that in the second\nj are forced to be zero when j /\u2208 supp0( \u02c6w). It was pointed out\nstage, the weight coef\ufb01cients w0\nin [5] that if supp0( \u02c6w) can exactly identify all non-zero components of the target vector, then in\nthe second stage, the relaxed Lasso can asymptotically remove the bias in the \ufb01rst stage Lasso.\nHowever, it is not clear what theoretical result can be stated when Lasso cannot exactly identify all\nrelevant features. In the general case, it is not easy to ensure that relaxed Lasso does not degrade\nthe performance when some relevant coef\ufb01cients become zero in the \ufb01rst stage. On the contrary, the\ntwo-stage penalization procedure in Figure 2, which is based on the capped-L1 regularization, does\nnot require that all relevant features are identi\ufb01ed. Consequently, we are able to prove a result for\nFigure 2 with no counterpart for relaxed Lasso.\nDe\ufb01nition 4.1 Let w = [w1, . . . , wd] \u2208 Rd and \u03b1 \u2265 0, we de\ufb01ne the set of relevant features with\nthreshold \u03b1 as:\n\nsupp\u03b1(w) = {j : |wj| > \u03b1}.\n\n(cid:16)P\nj>k |wij|2(cid:17)1/2\n\nMoreover, if |wi1| \u2265 \u00b7\u00b7\u00b7 \u2265 |wid| are in descending order, then de\ufb01ne \u03b4k(w) =\nas the 2-norm of the largest k components (in absolute value) of w.\n\nFor simplicity, we assume sub-Gaussian noise as follows.\n\n5\n\n\fAssumption 4.1 Assume that {yi}i=1,...,n are independent (but not necessarily identically dis-\ntributed) sub-Gaussians: there exists \u03c3 \u2265 0 such that \u2200i and \u2200t \u2208 R,\n\nEyiet(yi\u2212Eyi) \u2264 e\u03c32t2/2.\n\nBoth Gaussian and bounded random variables are sub-Gaussian using the above de\ufb01nition. For\nexample, if a random variable \u03be \u2208 [a, b], then E\u03beet(\u03be\u2212E\u03be) \u2264 e(b\u2212a)2t2/8. If a random variable is\nGaussian: \u03be \u223c N(0, \u03c32), then E\u03beet\u03be \u2264 e\u03c32t2/2.\ni , de\ufb01ne M \u02c6A = supi6=j | \u02c6Ai,j|, and\nTheorem 4.1 Let Assumption 4.1 hold. Let \u02c6A = 1\nassume that \u02c6Aj,j = 1 for all j. Consider any target vector \u00afw such that Ey = \u00afwT x, and assume that\n\u00afw contains only s non-zeros where s \u2264 d/3 and assume that M \u02c6As \u2264 1/6. Let k = |supp\u03bb( \u00afw)|.\nConsider the two-stage method in Figure 2. Given \u03b7 \u2208 (0, 0.5), with probability larger than 1\u2212 2\u03b7:\n\nif \u03b1/48 \u2265 \u03bb \u2265 12\u03c3p2 ln(2d/\u03b7)/n, then\n\nPn\n\ni=1 xixT\n\nn\n\nk \u02c6w0 \u2212 \u00afwk2 \u2264 24pk \u2212 q\u03bb + 24\u03c3\n\n \n\nr20q\n\nn\n\n!\n\n1 +\n\nln(1/\u03b7)\n\n+ 168\u03b4k( \u00afw),\n\nwhere q = |supp1.5\u03b1( \u00afw)|.\nThe proof of this theorem can be found in [8]. Note that the theorem allows the situation d (cid:29)\nn, which is what we are interested in. The condition M \u02c6As \u2264 1/6, often referred to as mutual\ncoherence, is also quite standard in the analysis of L1 regularization, e.g., in [1, 3]. Although the\ncondition is idealized, the theorem nevertheless yields important insights into the behavior of the\ntwo-stage algorithm. This theorem leads to a bound for Lasso with \u03b1 = \u221e or q = 0. The bound has\nthe form\n\n\u221a\n\nk \u02c6w0 \u2212 \u00afwk2 = O(\u03b4k( \u00afw) +\n\nk\u03bb).\n\n\u221a\nThis bound is tight for Lasso, in the sense that the right hand side cannot be improved except for\nthe constant. In particular, the factor O(\nk\u03bb) cannot be removed using Lasso \u2014 this can be easily\nveri\ufb01ed with an orthogonal design matrix. It is known that in order for Lasso to be effective, one\n\nhas to pick \u03bb no smaller than the order \u03c3pln d/n. Therefore, the generalization of standard Lasso\nis of the order \u03b4k( \u00afw) + \u03c3pk ln d/n, which cannot be improved. Similar results appear in [1, 4].\nLasso result if the sparse target satis\ufb01es \u03b4k( \u00afw) (cid:28) \u221a\nNow, with a small \u03b1, the bound in Theorem 4.1 can be signi\ufb01cantly better than that of the standard\nk\u03bb and k \u2212 q (cid:28) k. The latter condition is true\nwhen |supp1.5\u03b1( \u00afw)| \u2248 |supp\u03bb( \u00afw)|. These conditions are satis\ufb01ed when most non-zero coef\ufb01cients\nof \u00afw in supp\u03bb( \u00afw) are relatively large in magnitude and the rest is small in 2-norm. That is, when\ncomponents of \u00afw are large), we obtain k \u02c6w0\u2212 \u00afwk2 = O(pk ln(1/\u03b7)/n) for the two-stage procedure,\nthe target \u00afw can be decompose as a sparse vector with large coef\ufb01cients plus another (less sparse)\nvector with small coef\ufb01cients. In the extreme case when q = k = |supp0( \u00afw)| (that is, all nonzero\nwhich is superior to the standard one-stage Lasso bound k \u02c6w \u2212 \u00afwk2 = O(pk ln(d/\u03b7)/n). Again,\n\nthis bound cannot be improved for Lasso, and the difference can be signi\ufb01cant when d is large.\n\n5 Experiments\n\nIn the following, we show with a synthetic and a real data that our multi-stage approach improves\nthe standard Lasso in practice. In order to avoid cluttering, we only study results for the two-stage\nprocedure of Figure 2, which corresponds to the capped-L1 regularization. We shall also compare\nit to the two-stage Lp regularization method with p = 0.5, which corresponds to the adaptive Lasso\napproach [10]. Note that instead of tuning the \u03b1 parameter in Figure 2, in these experiments, we\ntune the number of features q in \u02c6w that are larger than the threshold \u03b1 (i.e., q = |{j : | \u02c6wj| > \u03b1}|\nis the number of features that are not regularized in stage-2). This is clearly more convenient than\ntuning \u03b1. The standard Lasso corresponds to q = 0.\nIn the \ufb01rst experiment, we generate an n \u00d7 d random matrix with its column j corresponding to\n[x1,j, . . . , xn,j], and each element of the matrix is an independent standard Gaussian N(0, 1). We\ni,j = n. A truly sparse target \u00af\u03b2, is generated with k\n\nthen normalize its columns so thatPn\n\ni=1 x2\n\n6\n\n\fnonzero elements that are uniformly distributed from [\u221210, 10]. The observation yi = \u00af\u03b2T xi + \u0001i,\nwhere each \u0001i \u223c N(0, \u03c32). In this experiment, we take n = 25, d = 100, k = 5, \u03c3 = 1, and repeat\nthe experiment 100 times. The average training error and 2-norm parameter estimation error are\nreported in Figure 3. We compare the performance of the two-stage method with different q versus\nthe regularization parameter \u03bb. As expected, the training error becomes smaller when q increases.\nCompared to the standard Lasso (which corresponds to q = 0), substantially smaller estimation\nerror is achieved with q = 3 for Capped-L1 regularization and with p = 0.5 for Lp regularization.\nThis shows that the multi-stage convex relaxation approach is effective.\n\nFigure 3: Performance of multi-stage convex relaxation on simulation data. Left: average training\nsquared error versus \u03bb; Right: parameter estimation error versus \u03bb.\n\nIn the second experiment, we use real data to illustrate the effectiveness of the multi-stage approach.\nDue to the space limitation, we only report the performance on a single data, Boston Housing. This\nis the housing data for 506 census tracts of Boston from the 1970 census, available from the UCI\nMachine Learning Database Repository: http://archive.ics.uci.edu/ml/. Each census tract is a data-\npoint, with 13 features (we add a constant offset on e as the 14th feature), and the desired output is\nthe housing price. In the experiment, we randomly partition the data into 20 training plus 456 test\npoints. We perform the experiments 100 times, and report training and test squared error versus the\nregularization parameter \u03bb for different q. The results are plotted in Figure 4. In this case, q = 1\nachieves the best performance. This means one feature can be reliably identi\ufb01ed in this example.\nIn comparison, adaptive Lasso is not effective. Note that this dataset contains only a small number\n(d = 14) features, which is not the case where we can expect signi\ufb01cant bene\ufb01t from the multi-stage\napproach (most of other UCI data similarly contain only small number of features). In order to\nillustrate the advantage of the two-stage method more clearly, we also consider a modi\ufb01ed Boston\nHousing data, where we append 20 random features (similar to the simulation experiments) to the\noriginal Boston Housing data, and rerun the experiments. The results are shown in Figure 5. As\nexpected from Theorem 4.1 and the discussion thereafter, since d becomes large, the multi-stage\nconvex relaxation approach with capped-L1 regularization (q > 0) has signi\ufb01cant advantage over\nthe standard Lasso (q = 0).\n\nReferences\n\n[1] Florentina Bunea, Alexandre Tsybakov, and Marten H. Wegkamp. Sparsity oracle inequalities\n\nfor the Lasso. Electronic Journal of Statistics, 1:169\u2013194, 2007.\n\n[2] Emmanuel Candes and Terence Tao. The Dantzig selector: statistical estimation when p is\n\nmuch larger than n. Annals of Statistics, 2007.\n\n[3] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recovery of sparse over-\ncomplete representations in the presence of noise. IEEE Trans. Info. Theory, 52(1):6\u201318, 2006.\n\n7\n\nlllllllll0.010.020.050.100.200.501.002.000.0050.0200.1000.5002.000lambdatraining errorlllllllllllq=0q=1q=3Lp (p=0.5)lllllllll0.010.020.050.100.200.501.002.00234567lambdaparameter estimation errorlllllllllllq=0q=1q=3Lp (p=0.5)\fFigure 4: Performance of multi-stage convex relaxation on the original Boston Housing data. Left:\naverage training squared error versus \u03bb; Right: test squared error versus \u03bb.\n\nFigure 5: Performance of multi-stage convex relaxation on the modi\ufb01ed Boston Housing data. Left:\naverage training squared error versus \u03bb; Right: test squared error versus \u03bb.\n\n[4] Vladimir Koltchinskii. Sparsity in penalized empirical risk minimization. Annales de l\u2019Institut\n\nHenri Poincar\u00e9, 2008.\n\n[5] Nicolai Meinshausen. Lasso with relaxation. ETH Research Report, 2005.\n[6] R. Tyrrell Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ, 1970.\n[7] Alan L. Yuille and Anand Rangarajan. The concave-convex procedure. Neural Computation,\n\n15:915\u2013936, 2003.\n\n[8] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regulariza-\n\ntion. The Annals of Statistics, 2009. to appear.\n\n[9] Peng Zhao and Bin Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132567, 2006.\n\n[10] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical\n\nAssociation, 101:1418\u20131429, 2006.\n\n8\n\nlllllllllll0.10.20.51.02.0102030405060lambdatraining errorlllllllllllllq=0q=1q=2Lp (p=0.5)lllllllllll0.10.20.51.02.050607080lambdatest errorlllllllllllllq=0q=1q=2Lp (p=0.5)llllllllllllll0.10.20.51.02.05.00.51.02.05.010.050.0200.0lambdatraining errorllllllllllllllllq=0q=1q=2Lp (p=0.5)llllllllllllll0.10.20.51.02.05.0100150200250lambdatest errorllllllllllllllllq=0q=1q=2Lp (p=0.5)\f", "award": [], "sourceid": 494, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}