{"title": "Trace Lasso: a trace norm regularization for correlated designs", "book": "Advances in Neural Information Processing Systems", "page_first": 2187, "page_last": 2195, "abstract": "Using the $\\ell_1$-norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm of the selected covariates, which is a convex surrogate of their rank, as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net.", "full_text": "Trace Lasso: a trace norm regularization for\n\ncorrelated designs\n\n\u00b4Edouard Grave\n\nINRIA, Sierra Project-team\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris\nedouard.grave@inria.fr\n\nGuillaume Obozinski\n\nINRIA, Sierra Project-team\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris\n\nguillaume.obozinski@inria.fr\n\nFrancis Bach\n\nINRIA, Sierra Project-team\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris\nfrancis.bach@inria.fr\n\nAbstract\n\nUsing the (cid:96)1-norm to regularize the estimation of the parameter vector of a linear\nmodel leads to an unstable estimator when covariates are highly correlated. In this\npaper, we introduce a new penalty function which takes into account the correla-\ntion of the design matrix to stabilize the estimation. This norm, called the trace\nLasso, uses the trace norm of the selected covariates, which is a convex surrogate\nof their rank, as the criterion of model complexity. We analyze the properties of\nour norm, describe an optimization algorithm based on reweighted least-squares,\nand illustrate the behavior of this norm on synthetic data, showing that it is more\nadapted to strong correlations than competing methods such as the elastic net.\n\n1\n\nIntroduction\n\nThe concept of parsimony is central in many scienti\ufb01c domains. In the context of statistics, signal\nprocessing or machine learning, it takes the form of variable or feature selection problems, and is\ncommonly used in two situations: \ufb01rst, to make the model or the prediction more interpretable or\ncheaper to use, i.e., even if the underlying problem does not admit sparse solutions, one looks for the\nbest sparse approximation. Second, sparsity can also be used given prior knowledge that the model\nshould be sparse. Many methods have been designed to learn sparse models, namely methods based\non greedy algorithms [1, 2], Bayesian inference [3] or convex optimization [4, 5].\nIn this paper, we focus on the regularization by sparsity-inducing norms. The simplest example\nof such norms is the (cid:96)1-norm, leading to the Lasso, when used within a least-squares framework.\nIn recent years, a large body of work has shown that the Lasso was performing optimally in high-\ndimensional low-correlation settings, both in terms of prediction [6], estimation of parameters or\nestimation of supports [7, 8]. However, most data exhibit strong correlations, with various correla-\ntion structures, such as clusters (i.e., close to block-diagonal covariance matrices) or sparse graphs,\nsuch as for example problems involving sequences (in which case, the covariance matrix is close to\na Toeplitz matrix [9]). In these situations, the Lasso is known to have stability problems: although\nits predictive performance is not disastrous, the selected predictor may vary a lot (typically, given\ntwo correlated variables, the Lasso will only select one of the two, at random).\nSeveral remedies have been proposed to this instability. First, the elastic net [10] adds a strongly\nconvex penalty term (the squared (cid:96)2-norm) that will stabilize selection (typically, given two cor-\nrelated variables, the elastic net will select the two variables). However, it is blind to the exact\n\n1\n\n\fcorrelation structure, and while strong convexity is required for some variables, it is not for other\nvariables. Another solution is to consider the group Lasso, which will divide the predictors into\ngroups and penalize the sum of the (cid:96)2-norm of these groups [11]. This is known to accomodate\nstrong correlations within groups [12]; however it requires to know the groups in advance, which is\nnot always possible. A third line of research has focused on sampling-based techniques [13, 14, 15].\nAn ideal regularizer should thus take into account the design (like the group Lasso, with oracle\ngroups), but without requiring human intervention (like the elastic net); it should thus add strong\nconvexity only where needed, and not modifying variables where things behave correctly. In this\npaper, we propose a new norm towards this end.\nMore precisely we make the following contributions:\n\ninterpolates between the (cid:96)1-norm and the (cid:96)2-norm depending on correlations.\n\n\u2022 We propose in Section 2 a new norm based on the trace norm (a.k.a. nuclear norm) that\n\u2022 We show that there is a unique minimum when penalizing with this norm in Section 2.2.\n\u2022 We provide optimization algorithms based on reweighted least-squares in Section 3.\n\u2022 We study the second-order expansion around independence and relate it to existing work\n\u2022 We perform synthetic experiments in Section 5, where we show that the trace Lasso out-\n\non including correlations in Section 4.\n\nperforms existing norms in strong-correlation regimes.\n\nNotations. Let M \u2208 Rn\u00d7p. We use superscripts for the columns of M, i.e., M(i) denotes the i-th\ncolumn, and subscripts for the rows, i.e., Mi denotes the i-th row. For M \u2208 Rp\u00d7p, diag(M) \u2208 Rp\nis the diagonal of the matrix M, while for u \u2208 Rp, Diag(u) \u2208 Rp\u00d7p is the diagonal matrix whose\ndiagonal elements are the ui. Let S be a subset of {1, ..., p}, then uS is the vector u restricted to\nthe support S, with 0 outside the support S. We denote by Sp the set of symmetric matrices of size\np. We will use various matrix norms, here are the notations we use: (cid:107)M(cid:107)\u2217 is the trace norm, i.e.,\nthe sum of the singular values of the matrix M, (cid:107)M(cid:107)op is the operator norm, i.e., the maximum\nsingular value of the matrix M, (cid:107)M(cid:107)F is the Frobenius norm, i.e., the (cid:96)2-norm of the singular\n\nvalues, which is also equal to(cid:112)tr(M(cid:62)M) and (cid:107)M(cid:107)2,1 is the sum of the (cid:96)2-norm of the columns\n\np(cid:88)\n\nof M: (cid:107)M(cid:107)2,1 =\n\n(cid:107)M(i)(cid:107)2.\n\ni=1\n\n2 De\ufb01nition and properties of the trace Lasso\nWe consider the problem of predicting y \u2208 R, given a vector x \u2208 Rp, assuming a linear model\n\ny = w(cid:62)x + \u03b5,\n\nwhere \u03b5 is an additive (typically Gaussian) noise with mean 0 and variance \u03c32. Given a training set\nX = (x1, ..., xn)(cid:62) \u2208 Rn\u00d7p and y = (y1, ..., yn)(cid:62) \u2208 Rn, a widely used method to estimate the\nparameter vector w is penalized empirical risk minimization\n\nn(cid:88)\n\ni=1\n\n\u02c6w \u2208 argmin\n\nw\n\n1\nn\n\n(cid:96)(yi, w(cid:62)xi) + \u03bbf (w),\n\n(1)\n\nwhere (cid:96) is a loss function used to measure the error we make by predicting w(cid:62)xi instead of yi, while\nf is a regularization term used to penalize complex models. This second term helps avoiding over-\n\ufb01tting, especially in the case where we have many more parameters than observation, i.e., n (cid:28) p.\n\n2.1 Related work\n\nWe will now present some classical penalty functions for linear models which are widely used in the\nmachine learning and statistics community. The \ufb01rst one, known as Tikhonov regularization [16] or\nridge [17], is the squared (cid:96)2-norm. When used with the square loss, estimating the parameter vector\nw is done by solving a linear system. One of the main drawbacks of this penalty function is the fact\n\n2\n\n\fthat it does not perform variable selection and thus does not behave well in sparse high-dimensional\nsettings.\nHence, it is natural to penalize linear models by the number of variables used by the model. Un-\nfortunately, this criterion, sometimes denoted by (cid:107) \u00b7 (cid:107)0 ((cid:96)0-penalty), is not convex and solving the\nproblem in Eq. (1) is generally NP-Hard [18]. Thus, a convex relaxation for this problem was in-\ntroduced, replacing the size of the selected subset by the (cid:96)1-norm of w. This estimator is known\nas the Lasso [4] in the statistics community and basis pursuit [5] in signal processing. Under some\nassumptions, the two problems are in fact equivalent (see for example [19] and references therein).\nWhen two predictors are highly correlated, the Lasso has a very unstable behavior: it often only\nselects the variable that is the most correlated with the residual. On the other hand, Tikhonov\nregularization tends to shrink coef\ufb01cients of correlated variables together, leading to a very stable\nbehavior. In order to get the best of both worlds, stability and variable selection, Zou and Hastie\nintroduced the elastic net [10], which is the sum of the (cid:96)1-norm and squared (cid:96)2-norm. Unfortunately,\nthis estimator needs two regularization parameters and is not adaptive to the precise correlation\nstructure of the data. Some authors also proposed to use pairwise correlations between predictors\nto interpolate more adaptively between the (cid:96)1-norm and squared (cid:96)2-norm, by introducing a method\ncalled pairwise elastic net [20] (see comparisons with our approach in Section 5).\nFinally, when one has more knowledge about the data, for example clusters of variables that should\nbe selected together, one can use the group Lasso [11]. Given a partition (Si) of the set of variables,\nit is de\ufb01ned as the sum of the (cid:96)2-norms of the restricted vectors wSi:\n\nk(cid:88)\n\n(cid:107)w(cid:107)GL =\n\n(cid:107)wSi(cid:107)2.\n\nThe effect of this penalty function is to introduce sparsity at the group level: variables in a group are\nselected all together. One of the main drawback of this method, which is also sometimes one of its\nquality, is the fact that one needs to know the partition of the variables, and so one needs to have a\ngood knowledge of the data.\n\ni=1\n\n2.2 The ridge, the Lasso and the trace Lasso\n\nIn this section, we show that Tikhonov regularization and the Lasso penalty can be viewed as norms\nof the matrix X Diag(w). We then introduce a new norm involving this matrix.\nThe solution of empirical risk minimization penalized by the (cid:96)1-norm or (cid:96)2-norm is not equivariant\nto rescaling of the predictors X(i), so it is common to normalize the predictors. When normalizing\nthe predictors X(i), and penalizing by Tikhonov regularization or by the Lasso, people are implic-\nitly using a regularization term that depends on the data or design matrix X. In fact, there is an\nequivalence between normalizing the predictors and not normalizing them, using the two following\nreweighted (cid:96)2 and (cid:96)1-norms instead of Tikhonov regularization and the Lasso:\n\n(cid:107)w(cid:107)2\n\n2 =\n\n(cid:107)X(i)(cid:107)2\n\n2 w2\ni\n\nand\n\n(cid:107)w(cid:107)1 =\n\n(cid:107)X(i)(cid:107)2 |wi|.\n\n(2)\n\ni=1\n\ni=1\n\nThese two norms can be expressed using the matrix X Diag(w):\n\n(cid:107)w(cid:107)2 = (cid:107)X Diag(w)(cid:107)F\n\nand\n\n(cid:107)w(cid:107)1 = (cid:107)X Diag(w)(cid:107)2,1,\n\nand a natural question arises: are there other relevant choices of functions or matrix norms? A\nclassical measure of the complexity of a model is the number of predictors used by this model,\nwhich is equal to the size of the support of w. This penalty being non-convex, people use its convex\nrelaxation, which is the (cid:96)1-norm, leading to the Lasso.\nHere, we propose a different measure of complexity which can be shown to be more adapted in\nmodel selection settings [21]: the dimension of the subspace spanned by the selected predictors.\nThis is equal to the rank of the selected predictors, or also to the rank of the matrix X Diag(w).\nLike the size of the support, this function is non-convex, and we propose to replace it by a convex\nsurrogate, the trace norm, leading to the following penalty that we call \u201ctrace Lasso\u201d:\n\n\u2126(w) = (cid:107)X Diag(w)(cid:107)\u2217.\n\n3\n\np(cid:88)\n\np(cid:88)\n\n\fThe trace Lasso has some interesting properties: if all the predictors are orthogonal, then, it is equal\nto the (cid:96)1-norm. Indeed, we have the decomposition:\n\n(cid:16)(cid:107)X(i)(cid:107)2wi\n\n(cid:17) X(i)\n\ne(cid:62)\ni ,\n\n(cid:107)X(i)(cid:107)2\n\nX Diag(w) =\n\np(cid:88)\n\ni=1\n\np(cid:88)\n\nwhere ei are the vectors of the canonical basis. Since the predictors are orthogonal and the ei are\northogonal too, this gives the singular value decomposition of X Diag(w) and we get\n\n(cid:107)X Diag(w)(cid:107)\u2217 =\n\n(cid:107)X(i)(cid:107)2|wi| = (cid:107)X Diag(w)(cid:107)2,1.\n\nOn the other hand, if all the predictors are equal to X(1), then\nX Diag(w) = X(1)w(cid:62),\n\ni=1\n\nand we get (cid:107)X Diag(w)(cid:107)\u2217 = (cid:107)X(1)(cid:107)2(cid:107)w(cid:107)2 = (cid:107)X Diag(w)(cid:107)F , which is equivalent to Tikhonov\nregularization. Thus when two predictors are strongly correlated, our norm will behave like\nTikhonov regularization, while for almost uncorrelated predictors, it will behave like the Lasso.\nAlways having a unique minimum is an important property for a statistical estimator, as it is a \ufb01rst\nstep towards stability. The trace Lasso, by adding strong convexity exactly in the direction of highly\ncorrelated covariates, always has a unique minimum, and is thus much more stable than the Lasso.\nProposition 1. If the loss function (cid:96) is strongly convex with respect to its second argument, then the\nsolution of the empirical risk minimization penalized by the trace Lasso, i.e., Eq. (1), is unique.\n\nThe technical proof of this proposition can be found in [22], and consists in showing that in the \ufb02at\ndirections of the loss function, the trace Lasso is strongly convex.\n\n2.3 A new family of penalty functions\n\nIn this section, we introduce a new family of penalties, inspired by the trace Lasso, allowing us to\nwrite the (cid:96)1-norm, the (cid:96)2-norm and the newly introduced trace Lasso as special cases. In fact, we\nnote that (cid:107) Diag(w)(cid:107)\u2217 = (cid:107)w(cid:107)1 and (cid:107)p\u22121/21(cid:62) Diag(w)(cid:107)\u2217 = (cid:107)w(cid:62)(cid:107)\u2217 = (cid:107)w(cid:107)2. In other words,\nwe can express the (cid:96)1 and (cid:96)2-norms of w using the trace norm of a given matrix times the matrix\nDiag(w). A natural question to ask is: what happens when using a matrix P other than the identity\nor the line vector p\u22121/21(cid:62), and what are good choices of such matrices? Therefore, we introduce\nthe following family of penalty functions:\nDe\ufb01nition 1. Let P \u2208 Rk\u00d7p, all of its columns having unit norm. We introduce the norm \u2126P as\n\n\u2126P(w) = (cid:107)P Diag(w)(cid:107)\u2217.\n\nProof. The positive homogeneity and triangle inequality are direct consequences of the linearity of\nw (cid:55)\u2192 P Diag(w) and the fact that (cid:107) \u00b7 (cid:107)\u2217 is a norm. Since all the columns of P are not equal to zero,\nwe have\n\nP Diag(w) = 0 \u21d4 w = 0,\n\nand so, \u2126P separates points and thus is a norm.\n\nAs stated before, the (cid:96)1 and (cid:96)2-norms are special cases of the family of norms we just introduced.\nAnother important penalty that can be expressed as a special case is the group Lasso, with non-\noverlapping groups. Given a partition (Sj) of the set {1, ..., p}, the group Lasso is de\ufb01ned by\n\n(cid:88)\n\n(cid:107)w(cid:107)GL =\n\n(cid:107)wSj(cid:107)2.\n\nWe de\ufb01ne the matrix PGL by\n\nPGL\n\nij =\n\n(cid:26)\n\n1/(cid:112)|Sk|\n\n0\n\nSj\n\nif i and j are in the same group Sk,\notherwise.\n\n4\n\n\fFigure 1: Unit balls for various value of P(cid:62)P. See the text for the value of P(cid:62)P. (Best seen in\ncolor).\n\n(cid:88)\n\n1Sj(cid:112)|Sj| w(cid:62)\n\nThen,\n\nPGL Diag(w) =\n\n(3)\nUsing the fact that (Sj) is a partition of {1, ..., p}, the vectors 1Sj are orthogonal and so are the\nvectors wSj . Hence, after normalizing the vectors, Eq. (3) gives a singular value decomposition of\nPGL Diag(w) and so the group Lasso penalty can be expressed as a special case of our family of\nnorms:\n\nSj\n\nSj\n\n.\n\n(cid:107)PGL Diag(w)(cid:107)\u2217 =\n\n(cid:107)wSj(cid:107)2 = (cid:107)w(cid:107)GL.\n\n(cid:88)\n\nSj\n\nIn the following proposition, we show that our norm only depends on the value of P(cid:62)P. This is an\nimportant property for the trace Lasso, where P = X, since it underlies the fact that this penalty\nonly depends on the correlation matrix X(cid:62)X of the covariates.\nProposition 2. Let P \u2208 Rk\u00d7p, all of its columns having unit norm. We have\n\n\u2126P(w) = (cid:107)(P(cid:62)P)1/2 Diag(w)(cid:107)\u2217.\n\n(cid:32) 1\n\nWe plot the unit ball of our norm for the following value of P(cid:62)P (see \ufb01gure 1):\n1\n1\n0\n\n0.49\n0.7\n1\n\n0.1\n0.1\n1\n\n0.7\n1\n0.7\n\n0.9\n1\n0.1\n\n0.7\n0.49\n\n0.9\n0.1\n\n(cid:32) 1\n\n1\n0\n\n(cid:33)\n\n(cid:32) 1\n\n(cid:33)\n\n(cid:33)\n\n0\n0\n1\n\nWe can lower bound and upper bound our norms by the (cid:96)2-norm and (cid:96)1-norm respectively. This\nshows that, as for the elastic net, our norms interpolate between the (cid:96)1-norm and the (cid:96)2-norm. But\nthe main difference between the elastic net and our norms is the fact that our norms are adaptive,\nand require a single regularization parameter to tune. In particular for the trace Lasso, when two\ncovariates are strongly correlated, it will be close to the (cid:96)2-norm, while when two covariates are\nalmost uncorrelated, it will behave like the (cid:96)1-norm. This is a behavior close to the one of the\npairwise elastic net [20].\nProposition 3. Let P \u2208 Rk\u00d7p, all of its columns having unit norm. We have\n\n(cid:107)w(cid:107)2 \u2264 \u2126P(w) \u2264 (cid:107)w(cid:107)1.\n\n2.4 Dual norm\n\nThe dual norm is an important quantity for both optimization and theoretical analysis of the estima-\ntor. Unfortunately, we are not able in general to obtain a closed form expression of the dual norm for\nthe family of norms we just introduced. However we can obtain a bound, which is exact for some\nspecial cases:\nProposition 4. The dual norm, de\ufb01ned by \u2126\u2217\n\nu(cid:62)v, can be bounded by:\n\nP(u) = max\n\n\u2126P(v)\u22641\nP(u) \u2264 (cid:107)P Diag(u)(cid:107)op.\n\u2126\u2217\n\n5\n\n\fProof. Using the fact that diag(P(cid:62)P) = 1, we have\n\nu(cid:62)v = tr(cid:0)Diag(u)P(cid:62)P Diag(v)(cid:1)\n\n\u2264 (cid:107)P Diag(u)(cid:107)op(cid:107)P Diag(v)(cid:107)\u2217,\n\nwhere the inequality comes from the fact that the operator norm (cid:107) \u00b7 (cid:107)op is the dual norm of the trace\nnorm. The de\ufb01nition of the dual norm then gives the result.\n\nAs a corollary, we can bound the dual norm by a constant times the (cid:96)\u221e-norm:\n\nP(u) \u2264 (cid:107)P Diag(u)(cid:107)op \u2264 (cid:107)P(cid:107)op(cid:107) Diag(u)(cid:107)op = (cid:107)P(cid:107)op(cid:107)u(cid:107)\u221e.\n\u2126\u2217\n\nUsing proposition (3), we also have the inequality \u2126\u2217\n\nP(u) \u2265 (cid:107)u(cid:107)\u221e.\n\n3 Optimization algorithm\n\nIn this section, we introduce an algorithm to estimate the parameter vector w when the loss function\n2 (y \u2212 w(cid:62)x)2 and the penalty is the trace Lasso. It is\nis equal to the square loss: (cid:96)(y, w(cid:62)x) = 1\nstraightforward to extend this algorithm to the family of norms indexed by P. The problem we\nconsider is thus\n\nmin\n\nw\n\n1\n2\n\n(cid:107)y \u2212 Xw(cid:107)2\n\n2 + \u03bb(cid:107)X Diag(w)(cid:107)\u2217.\n\nWe could optimize this cost function by subgradient descent, but this is quite inef\ufb01cient: computing\nthe subgradient of the trace Lasso is expensive and the rate of convergence of subgradient descent\nis quite slow. Instead, we consider an iteratively reweighted least-squares method. First, we need to\nintroduce a well-known variational formulation for the trace norm [23]:\nProposition 5. Let M \u2208 Rn\u00d7p. The trace norm of M is equal to:\n\n(cid:107)M(cid:107)\u2217 =\n\n(cid:16)\n\n1\n2\n\ninf\nS(cid:23)0\n\ntr(cid:0)M(cid:62)S\u22121M(cid:1) + tr (S) ,\n\nMM(cid:62)(cid:17)1/2\nw(cid:62) Diag(cid:0) diag(X(cid:62)S\u22121X)(cid:1)w +\n\n.\n\nand the in\ufb01mum is attained for S =\n\nUsing this proposition, we can reformulate the previous optimization problem as\n\nmin\n\nw\n\ninf\nS(cid:23)0\n\n1\n2\n\n(cid:107)y \u2212 Xw(cid:107)2\n\n2 +\n\n\u03bb\n2\n\n\u03bb\n2\n\ntr(S).\n\nThis problem is jointly convex in (w, S) [24]. In order to optimize this objective function by alter-\n2 tr(S\u22121). Otherwise, the in\ufb01mum\nnating the minimization over w and S, we need to add a term \u03bb\u00b5i\nover S could be attained at a non invertible S, leading to a non convergent algorithm. The in\ufb01mum\n\nover S is then attained for S =(cid:0)X Diag(w)2X(cid:62) + \u00b5iI(cid:1)1/2.\nwhere D = Diag(cid:0)diag(X(cid:62)S\u22121X)(cid:1). It is equivalent to solving the linear system\n\nOptimizing over w is a least-squares problem penalized by a reweighted (cid:96)2-norm equal to w(cid:62)Dw,\n\n(X(cid:62)X + \u03bbD)w = X(cid:62)y.\n\nThis can be done ef\ufb01ciently by using a conjugate gradient method. Since the cost of multiplying\n(X(cid:62)X+\u03bbD) by a vector is O(np), solving the system has a complexity of O(knp), where k \u2264 n+1\nis the number of iterations needed to converge (see theorem 10.2.5 of [9]). Using warm restarts, k\ncan be even smaller than n, since the linear system we are solving does not change a lot from an\niteration to another. Below we summarize the algorithm:\n\nITERATIVE ALGORITHM FOR ESTIMATING w\n\nInput: the design matrix X, the initial guess w0, number of iteration N, sequence \u00b5i.\nFor i = 1...N:\n\n\u2022 Compute the eigenvalue decomposition U Diag(sk)U(cid:62) of X Diag(wi\u22121)2X(cid:62).\n\u2022 Set D = Diag(diag(X(cid:62)S\u22121X)), where S\u22121 = U Diag(1/\n\u2022 Set wi by solving the system (X(cid:62)X + \u03bbD)w = X(cid:62)y.\n\nsk + \u00b5i)U(cid:62).\n\n\u221a\n\nFor the sequence \u00b5i, we use a decreasing sequence converging to ten times the machine precision.\n\n6\n\n\f3.1 Choice of \u03bb\n\nWe now give a method to choose the initial parameter \u03bb of the regularization path. In fact, we know\nthat the vector 0 is solution if and only if \u03bb \u2265 \u2126\u2217(X(cid:62)y) [25]. Thus, we need to start the path at\n\u03bb = \u2126\u2217(X(cid:62)y), corresponding to the empty solution 0, and then decrease \u03bb. Using the inequalities\non the dual norm we obtained in the previous section, we get\n\n(cid:107)X(cid:62)y(cid:107)\u221e \u2264 \u2126\u2217(X(cid:62)y) \u2264 (cid:107)X(cid:107)op(cid:107)X(cid:62)y(cid:107)\u221e.\n\nTherefore, starting the path at \u03bb = (cid:107)X(cid:107)op(cid:107)X(cid:62)y(cid:107)\u221e is a good choice.\n\n4 Approximation around the Lasso\nWe recall that when P = I \u2208 Rp\u00d7p, our norm is equal to the (cid:96)1-norm, and we want to understand\nits behavior when P departs from the identity. Thus, we compute a second order approximation of\nour norm around the Lasso: we add a small perturbation \u2206 \u2208 Sp to the identity matrix, and using\nProp. 6 of [22], we obtain the following second order approximation:\n\n(cid:107)(I + \u2206) Diag(w)(cid:107)\u2217 = (cid:107)w(cid:107)1 + diag(\u2206)(cid:62)|w|+\n\n(\u2206ji|wi| \u2212 \u2206ij|wj|)2\n\n4(|wi| + |wj|)\n\n(cid:88)\n\n(cid:88)\n\n|wi|>0\n\n|wj|>0\n\nWe can rewrite this approximation as\n\n(cid:107)(I + \u2206) Diag(w)(cid:107)\u2217 = (cid:107)w(cid:107)1 + diag(\u2206)(cid:62)|w| +\n\n(cid:88)\n\n(cid:88)\n\n|wi|=0\n\n|wj|>0\n\n+\n\n(\u2206ij|wj|)2\n2|wj| + o((cid:107)\u2206(cid:107)2).\n\n(cid:88)\n\ni,j\n\nij(|wi| \u2212 |wj|)2\n\u22062\n4(|wi| + |wj|)\n\n+ o((cid:107)\u2206(cid:107)2),\n\nusing a slight abuse of notation, considering that the last term is equal to 0 when wi = wj = 0. The\nsecond order term is quite interesting: it shows that when two covariates are correlated, the effect of\nthe trace Lasso is to shrink the corresponding coef\ufb01cients toward each other. We also note that this\nterm is very similar to pairwise elastic net penalties, which are of the form |w|(cid:62)P|w|, where Pij is\na decreasing function of \u2206ij.\n\n5 Experiments\n\nIn this section, we perform experiments on synthetic data to illustrate the behavior of the trace Lasso\nand other classical penalties when there are highly correlated covariates in the design matrix. The\nsupport S of w is equal to {1, ..., k}, where k is the size of the support. For i in the support of\nw, wi is independently drawn from a uniform distribution over [\u22121, 1]. The observations xi are\ndrawn from a multivariate Gaussian with mean 0 and covariance matrix \u03a3. For the \ufb01rst setting, \u03a3\nis set to the identity, for the second setting, \u03a3 is block diagonal with blocks equal to 0.2I + 0.811(cid:62)\ncorresponding to clusters of four variables, \ufb01nally for the third setting, we set \u03a3ij = 0.95|i\u2212j|,\ncorresponding to a Toeplitz design. For each method, we choose the best \u03bb. We perform a \ufb01rst\nseries of experiments (p = 1024, n = 256) for which we report the estimation error. For the second\nseries of experiments (p = 512, n = 128), we report the Hamming distance between the estimated\nsupport and the true support.\nIn all six graphs of Figure 2, we observe behaviors that are typical of Lasso, ridge and elastic net:\nthe Lasso performs very well on very sparse models but its performance degrades for denser models.\nThe elastic net performs better than the Lasso for settings where there are strongly correlated covari-\nates, thanks to its strongly convex (cid:96)2 term. In setting 1, since the variables are uncorrelated, there\nis no reason to couple their selection. This suggests that the Lasso should be the most appropriate\nconvex regularization. The trace Lasso approaches the Lasso when n is much larger than p, but the\nweak coupling induced by empirical correlations is suf\ufb01cient to slightly decrease its performance\ncompared to that of the Lasso. By contrast, in settings 2 and 3, the trace Lasso outperforms other\nmethods (including the pairwise elastic net) since variables that should be selected together are in-\ndeed correlated. As for the penalized elastic net, since it takes into account the correlations between\nvariables, it is not surprising that in experiments 2 and 3 it performs better than methods that do not.\nWe do not have a compelling explanation for its superior performance in experiment 1.\n\n7\n\n\fFigure 2: Left: estimation error (p = 1024, n = 256), right: support recovery (p = 512, n = 128).\n(Best seen in color. e-net stands for elastic net, pen stands for pairwise elastic net and trace\nstands for trace Lasso. Error bars are obtained over 20 runs.)\n\n6 Conclusion\n\nWe introduce a new penalty function, the trace Lasso, which takes advantage of the correlation\nbetween covariates to add strong convexity exactly in the directions where needed, unlike the elastic\nnet for example, which blindly adds a squared (cid:96)2-norm term in every directions. We show on\nsynthetic data that this adaptive behavior leads to better estimation performance. In the future, we\nwant to show that if a dedicated norm using prior knowledge such as the group Lasso can be used,\nthe trace Lasso will behave similarly and its performance will not degrade too much, providing\ntheoretical guarantees to such adaptivity. Finally, we will seek applications of this estimator in\ninverse problems such as deblurring, where the design matrix exhibits strong correlation structure.\n\nAcknowledgments\n\nGuillaume Obozinski and Francis Bach are supported in part by the European Research Council\n(SIERRA ERC-239993).\n\n8\n\n\fReferences\n[1] S.G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. Signal Pro-\n\ncessing, IEEE Transactions on, 41(12):3397\u20133415, 1993.\n\n[2] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models.\n\nAdvances in Neural Information Processing Systems, 22, 2008.\n\n[3] M.W. Seeger. Bayesian inference and optimal design for the sparse linear model. The Journal\n\nof Machine Learning Research, 9:759\u2013813, 2008.\n\n[4] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[5] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM\n\njournal on scienti\ufb01c computing, 20(1):33\u201361, 1999.\n\n[6] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nThe Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[7] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n[8] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using\n(cid:96)1-constrained quadratic programming (Lasso). Information Theory, IEEE Transactions on,\n55(5):2183\u20132202, 2009.\n\n[9] G.H. Golub and C.F. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.\n[10] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n[11] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n[12] F. Bach. Consistency of the group Lasso and multiple kernel learning. The Journal of Machine\n\nLearning Research, 9:1179\u20131225, 2008.\n\n[13] F. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. In Proceedings of\n\nthe 25th international conference on Machine learning, pages 33\u201340. ACM, 2008.\n\n[14] H. Liu, K. Roeder, and L. Wasserman. Stability approach to regularization selection (stars) for\nhigh dimensional graphical models. Advances in Neural Information Processing Systems, 23,\n2010.\n\n[15] N. Meinshausen and P. B\u00a8uhlmann. Stability selection. Journal of the Royal Statistical Society:\n\nSeries B (Statistical Methodology), 72(4):417\u2013473, 2010.\n\n[16] A. Tikhonov. Solution of incorrectly formulated problems and the regularization method. In\n\nSoviet Math. Dokl., volume 5, page 1035, 1963.\n\n[17] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal prob-\n\nlems. Technometrics, 12(1):55\u201367, 1970.\n\n[18] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Constructive ap-\n\nproximation, 13(1):57\u201398, 1997.\n\n[19] E.J. Cand`es and T. Tao. Decoding by linear programming. Information Theory, IEEE Trans-\n\nactions on, 51(12):4203\u20134215, 2005.\n\n[20] A. Lorbert, D. Eis, V. Kostina, D. M. Blei, and P. J. Ramadge. Exploiting covariate similarity\nin sparse regression via the pairwise elastic net. JMLR - Proceedings of the 13th International\nConference on Arti\ufb01cial Intelligence and Statistics, 9:477\u2013484, 2010.\n\n[21] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. 2001.\n[22] E. Grave, G. Obozinski, and F. Bach. Trace lasso: a trace norm regularization for correlated\n\ndesigns. Technical report, arXiv:1109.1990, 2011.\n\n[23] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Advances in neural\n\ninformation processing systems, 19:41, 2007.\n\n[24] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004.\n[25] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing\n\nnorms. S. Sra, S. Nowozin, S. J. Wright., editors, Optimization for Machine Learning, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1213, "authors": [{"given_name": "Edouard", "family_name": "Grave", "institution": null}, {"given_name": "Guillaume", "family_name": "Obozinski", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}