{"title": "Fused sparsity and robust estimation for linear models with unknown variance", "book": "Advances in Neural Information Processing Systems", "page_first": 1259, "page_last": 1267, "abstract": "In this paper, we develop a novel approach to the problem of learning sparse representations in the context of fused sparsity and unknown noise level. We propose an algorithm, termed Scaled Fused Dantzig Selector (SFDS), that accomplishes the aforementioned learning task by means of a second-order cone program. A special emphasize is put on the particular instance of fused sparsity corresponding to the learning in presence of outliers. We establish finite sample risk bounds and carry out an experimental evaluation on both synthetic and real data.", "full_text": "Fused sparsity and robust estimation for linear\n\nmodels with unknown variance\n\nYin Chen\n\nUniversity Paris Est, LIGM\n\n77455 Marne-la-Valle, FRANCE\nyin.chen@eleves.enpc.fr\n\nArnak S. Dalalyan\n\nENSAE-CREST-GENES\n\n92245 MALAKOFF Cedex, FRANCE\narnak.dalalyan@ensae.fr\n\nAbstract\n\nIn this paper, we develop a novel approach to the problem of learning sparse rep-\nresentations in the context of fused sparsity and unknown noise level. We propose\nan algorithm, termed Scaled Fused Dantzig Selector (SFDS), that accomplishes\nthe aforementioned learning task by means of a second-order cone program. A\nspecial emphasize is put on the particular instance of fused sparsity corresponding\nto the learning in presence of outliers. We establish \ufb01nite sample risk bounds and\ncarry out an experimental evaluation on both synthetic and real data.\n\n1\n\nIntroduction\n\nConsider the classical problem of Gaussian linear regression1:\n\n\u2217\n\n+ \u03c3\u2217\u03be,\n\n\u03be \u223c Nn(0, In),\n\nY = X\u03b2\n\n\u2217. Even if the ambient dimensionality p of \u03b2\n\n(1)\nwhere Y \u2208 Rn and X \u2208 Rn\u00d7p are observed, in the neoclassical setting of very large dimensional\n\u2217 is larger than n, it has proven\nunknown vector \u03b2\npossible to consistently estimate this vector under the sparsity assumption. The letter states that the\n\u2217, denoted by s and called intrinsic dimension, is small compared\nnumber of nonzero elements of \u03b2\nto the sample size n. Most famous methods of estimating sparse vectors, the Lasso and the Dantzig\nSelector (DS), rely on convex relaxation of (cid:96)0-norm penalty leading to a convex program that in-\nvolves the (cid:96)1-norm of \u03b2. More precisely, for a given \u00af\u03bb > 0, the Lasso and the DS [26, 4, 5, 3] are\nde\ufb01ned as\n\n(cid:27)\n\nL\n\n= arg min\n\u03b2\u2208Rp\n\n(cid:107)Y \u2212 X\u03b2(cid:107)2\n\n2 + \u00af\u03bb(cid:107)\u03b2(cid:107)1\n\nDS\n\n= arg min(cid:107)\u03b2(cid:107)1 subject to (cid:107)X(cid:62)(Y \u2212 X\u03b2)(cid:107)\u221e \u2264 \u00af\u03bb.\n\n(Lasso)\n\n(DS)\n\n(cid:26) 1\n\n2\n\n(cid:98)\u03b2\n\n(cid:98)\u03b2\n\nThe performance of these algorithms depends heavily on the choice of the tuning parameter \u00af\u03bb.\nSeveral empirical and theoretical studies emphasized that \u00af\u03bb should be chosen proportionally to the\nnoise standard deviation \u03c3\u2217. Unfortunately, in most applications, the latter is unavailable.\nIt is\ntherefore vital to design statistical procedures that estimate \u03b2 and \u03c3 in a joint fashion. This topic\nreceived special attention in last years, cf. [10] and the references therein, with the introduction of\ncomputationally ef\ufb01cient and theoretically justi\ufb01ed \u03c3-adaptive procedures the square-root Lasso [2]\n(a.k.a. scaled Lasso [24]) and the (cid:96)1 penalized log-likelihood minimization [20].\nIn the present work, we are interested in the setting where \u03b2\nknown q \u00d7 p matrix M, the vector M\u03b2\n\n\u2217 is not necessarily sparse, but for a\n\u2217 is sparse. We call this setting \u201cfused sparsity scenario\u201d.\n1We denote by In the n \u00d7 n identity matrix. For a vector v, we use the standard notation (cid:107)v(cid:107)1, (cid:107)v(cid:107)2 and\n(cid:107)v(cid:107)\u221e for the (cid:96)1, (cid:96)2 and (cid:96)\u221e norms, corresponding respectively to the sum of absolute values, the square root\nof the sum of squares and the maximum of the coef\ufb01cients of v.\n\n1\n\n\fThe term \u201cfused\u201d sparsity, introduced by [27], originates from the case where M\u03b2 is the discrete\nderivative of a signal \u03b2 and the aim is to minimize the total variation, see [12, 19] for a recent\noverview and some asymptotic results. For general matrices M, tight risk bounds were proved in\n[14]. We adopt here this framework of general M and aim at designing a computationally ef\ufb01cient\nprocedure capable to handle the situation of unknown noise level and for which we are able to\nprovide theoretical guarantees along with empirical evidence for its good performance.\nThis goal is attained by introducing a new procedure, termed Scaled Fused Dantzig Selector (SFDS),\nwhich is closely related to the penalized maximum likelihood estimator but has some advantages in\nterms of computational complexity. We establish tight risk bounds for the SFDS, which are nearly\nas strong as those proved for the Lasso and the Dantzig selector in the case of known \u03c3\u2217. We also\nshow that the robust estimation in linear models can be seen as a particular example of the fused\nsparsity scenario. Finally, we carry out a \u201cproof of concept\u201d type experimental evaluation to show\nthe potential of our approach.\n\n2 Estimation under fused sparsity with unknown level of noise\n\n2.1 Scaled Fused Dantzig Selector\nWe will only consider the case rank(M) = q \u2264 p, which is more relevant for the applications\nwe have in mind (image denoising and robust estimation). Under this condition, one can \ufb01nd a\n(p\u2212 q)\u00d7 p matrix N such that the augmented matrix M = [M(cid:62) N(cid:62)](cid:62) is of full rank. Let us denote\nby mj the jth column of the matrix M \u22121, so that M \u22121 = [m1, ..., mp]. We also introduce:\nM \u22121 = [M\u2020, N\u2020], M\u2020 = [m1, ..., mq] \u2208 Rp\u00d7q, N\u2020 = [mq+1, ..., mp] \u2208 Rp\u00d7(p\u2212q).\n\nGiven two positive tuning parameters \u03bb and \u00b5, we de\ufb01ne the Scaled Fused Dantzig Selector (SFDS)\n\n((cid:98)\u03b2,(cid:98)\u03c3) as a solution to the following optimization problem:\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n(cid:107)Xmj(cid:107)2|(M\u03b2)j| subject to\n\nminimize\n\nq(cid:88)\n\nj=1\n\nj X(cid:62)(X\u03b2 \u2212 Y )| \u2264 \u03bb\u03c3(cid:107)Xmj(cid:107)2, j \u2264 q;\n|m(cid:62)\n\u2020 X(cid:62)(X\u03b2 \u2212 Y ) = 0,\nN(cid:62)\nn\u00b5\u03c32 + Y (cid:62)X\u03b2 \u2264 (cid:107)Y (cid:107)2\n2.\n\n(P1)\n\nThis estimator has several attractive properties: (a) it can be ef\ufb01ciently computed even for very large\nscale problems using a second-order cone program, (b) it is equivariant with respect to the scale\ntransformations both in the response Y and in the lines of M and, \ufb01nally, (c) it is closely related to\nthe penalized maximum likelihood estimator. Let us give further details on these points.\n\n2.2 Relation with the penalized maximum likelihood estimator\n\nOne natural way to approach the problem of estimating \u03b2\nprocedure of penalized log-likelihood minimization.\nNn(0, In), the negative log-likelihood (up to irrelevant additive terms) is given by\n\n\u2217 in our setup is to rely on the standard\nIf the noise distribution is Gaussian, \u03be \u223c\n\n(cid:96)(Y , X; \u03b2, \u03c3) = n log(\u03c3) +\n\n(cid:107)Y \u2212 X\u03b2(cid:107)2\n\n2\n\n2\u03c32\n\n.\n\nIn the context of large dimension we are concerned with, i.e., when p/n is not small, the maximum\nlikelihood estimator is subject to over\ufb01tting and is of very poor quality. If it is plausible to expect\n\u2217 such that for some matrix M, only a\nthat the data can be \ufb01tted suf\ufb01ciently well by a vector \u03b2\n\u2217 are nonzero, then one can considerably improve the quality of\nsmall fraction of elements of M\u03b2\nestimation by adding a penalty term to the log-likelihood. However, the most appealing penalty,\n(cid:80)\nthe number of nonzero elements of M\u03b2, leads to a nonconvex optimization problem which cannot\nbe ef\ufb01ciently solved even for moderately large values of p. Instead, convex penalties of the form\nat a relatively low computational cost. This corresponds to de\ufb01ning the estimator ((cid:98)\u03b2PL,(cid:98)\u03c3PL) as the\nj \u03c9j|(M\u03b2)j|, where wj > 0 are some weights, have proven to provide high accuracy estimates\n\n2\n\n\fminimizer of the penalized log-likelihood\n\n\u00af(cid:96)(Y , X; \u03b2, \u03c3) = n log(\u03c3) +\n\n(cid:107)Y \u2212 X\u03b2(cid:107)2\n\n2\n\n2\u03c32\n\n+\n\nq(cid:88)\n\nj=1\n\n\u03c9j|(M\u03b2)j|.\n\nTo ensure the scale equivariance, the weights \u03c9j should be chosen inversely proportionally to \u03c3:\n\u03c9j = \u03c3\u22121 \u00af\u03c9j. This leads to the estimator\n\n((cid:98)\u03b2PL,(cid:98)\u03c3PL) = arg min\n\n\u03b2,\u03c3\n\n(cid:26)\n\nn log(\u03c3) +\n\n(cid:107)Y \u2212 X\u03b2(cid:107)2\n\n2\n\n2\u03c32\n\n+\n\n|(M\u03b2)j|\n\n\u03c3\n\n\u00af\u03c9j\n\nq(cid:88)\n\nj=1\n\n(cid:27)\n\n.\n\nAlthough this problem can be cast [20] as a problem of convex minimization (by making the change\nof parameters \u03c6 = \u03b2/\u03c3 and \u03c1 = 1/\u03c3), it does not belong to the standard categories of convex\nproblems that can be solved either by linear programming or by second-order cone programming or\nby semide\ufb01nite programming. Furthermore, the smooth part of the objective function is not Lips-\nchitz which makes it impossible to directly apply most \ufb01rst-order optimization methods developed in\nrecent years. Our goal is to propose a procedure that is close in spirit to the penalized maximum like-\nlihood but has the additional property of being computable by standard algorithms of second-order\ncone programming.\nTo achieve this goal, at the \ufb01rst step, we remark that it can be useful to introduce a penalty term that\ndepends exclusively on \u03c3 and that prevents the estimator of \u03c3\u2217 from being too large or too small. One\ncan show that the only function (up to a multiplicative constant) that can serve as penalty without\nbreaking the property of scale equivariance is the logarithmic function. Therefore, we introduce an\nadditional tuning parameter \u00b5 > 0 and look for minimizing the criterion\n|(M\u03b2)j|\n\n(cid:107)Y \u2212 X\u03b2(cid:107)2\n\nq(cid:88)\n\nn\u00b5 log(\u03c3) +\n\n2\u03c32\n\n2\n\n+\n\n\u00af\u03c9j\n\nj=1\n\n.\n\n\u03c3\n\nIf we make the change of variables \u03c61 = M\u03b2/\u03c3, \u03c62 = N\u03b2/\u03c3 and \u03c1 = 1/\u03c3, we get a convex\nfunction for which the \ufb01rst-order conditions [20] take the form\n\n(2)\n\n(3)\n(4)\n\n(5)\n\nj X(cid:62)(Y \u2212 X\u03b2) \u2208 \u00af\u03c9jsign({M\u03b2}j),\nm(cid:62)\n\u2020 X(cid:62)(Y \u2212 X\u03b2) = 0,\nN(cid:62)\n\n(cid:0)(cid:107)Y (cid:107)2\n\n2 \u2212 Y (cid:62)X\u03b2(cid:1) = \u03c32.\n\n1\nn\u00b5\n\nnorm(cid:80)\n\nThus, any minimizer of (2) should satisfy these conditions. Therefore, to simplify the problem of\noptimization we propose to replace minimization of (2) by the minimization of the weighted (cid:96)1-\nj \u00af\u03c9j|(M\u03b2)j| subject to some constraints that are as close as possible to (3-5). The only\nproblem here is that the constraints (3) and (5) are not convex. The \u201cconvexi\ufb01cation\u201d of these\nconstraints leads to the procedure described in (P1). As we explain below, the particular choice of\n\u00af\u03c9js is dictated by the desire to enforce the scale equivariance of the procedure.\n\n2.3 Basic properties\n\nA key feature of the SFDS is its scale equivariance. Indeed, one easily checks that if ((cid:98)\u03b2,(cid:98)\u03c3) is a\nsolution to (P1) for some inputs X, Y and M, then \u03b1((cid:98)\u03b2,(cid:98)\u03c3) will be a solution to (P1) for the inputs\nMore precisely, if ((cid:98)\u03b2,(cid:98)\u03c3) is a solution to (P1) for some inputs X, Y and M, then ((cid:98)\u03b2,(cid:98)\u03c3) will be a\n\nX, \u03b1Y and M, whatever the value of \u03b1 \u2208 R is. This is the equivariance with respect to the scale\nchange in the response Y . Our method is also equivariant with respect to the scale change in M.\nsolution to (P1) for the inputs X, Y and DM, whatever the q \u00d7 q diagonal matrix D is. The latter\n\u2217 is sparse, then\nproperty is important since if we believe that for a given matrix M the vector M\u03b2\n\u2217, for any diagonal matrix D. Having a procedure the output\nthis is also the case for the vector DM\u03b2\nof which is independent of the choice of D is of signi\ufb01cant practical importance, since it leads to a\nsolution that is robust with respect to small variations of the problem formulation.\nThe second attractive feature of the SFDS is that it can be computed by solving a convex optimiza-\ntion problem of second-order cone programming (SOCP). Recall that an SOCP is a constrained\n\n3\n\n\foptimization problem that can be cast as minimization with respect to w \u2208 Rd of a linear function\na(cid:62)w under second-order conic constraints of the form (cid:107)Aiw + bi(cid:107)2 \u2264 c(cid:62)\ni w + di, where Ais are\nsome ri \u00d7 d matrices, bi \u2208 Rri, ci \u2208 Rd are some vectors and dis are some real numbers. The\nproblem (P1) belongs well to this category, since it can be written as min(u1 + . . . + uq) subject to\n\n(cid:107)Xmj(cid:107)2|(M\u03b2)j| \u2264 uj;\n\n\u2020 X(cid:62)(X\u03b2 \u2212 Y ) = 0,\nN(cid:62)\n\n|m(cid:62)\n\n(cid:113)\nj X(cid:62)(X\u03b2 \u2212 Y )| \u2264 \u03bb\u03c3(cid:107)Xmj(cid:107)2,\n\n4n\u00b5(cid:107)Y (cid:107)2\n\n2\u03c32 + (Y (cid:62)X\u03b2)2 \u2264 2(cid:107)Y (cid:107)2\n\n\u2200j = 1, . . . , q;\n2 \u2212 Y (cid:62)X\u03b2.\n\nNote that all these constraints can be transformed into linear inequalities, except the last one which\nis a second order cone constraint. The problems of this type can be ef\ufb01ciently solved by various\nstandard toolboxes such as SeDuMi [22] or TFOCS [1].\n\n2.4 Finite sample risk bound\n\nTo provide theoretical guarantees for our estimator, we impose the by now usual assumption of\nrestricted eigenvalues on a suitably chosen matrix. This assumption, stated in De\ufb01nition 2.1 below,\nwas introduced and thoroughly discussed by [3]; we also refer the interested reader to [28].\nD\u00b4e\ufb01nition 2.1. We say that a n \u00d7 q matrix A satis\ufb01es the restricted eigenvalue condition RE(s, 1),\nif\n\nWe say that A satis\ufb01es the strong restricted eigenvalue condition RE(s, s, 1), if\n\n\u03ba(s, 1) \u2206= min|J|\u2264s\n\nmin\n\n(cid:107)\u03b4Jc(cid:107)1\u2264(cid:107)\u03b4J(cid:107)1\n\n\u03ba(s, s, 1) \u2206= min|J|\u2264s\n\nmin\n\n(cid:107)\u03b4Jc(cid:107)1\u2264(cid:107)\u03b4J(cid:107)1\n\n(cid:107)A\u03b4(cid:107)2\n\u221a\nn(cid:107)\u03b4J(cid:107)2\n\n> 0.\n\n\u221a\n\n(cid:107)A\u03b4(cid:107)2\nn(cid:107)\u03b4J\u222aJ0(cid:107)2\n\n> 0,\n\nn M(cid:62)\n\nwhere J0 is the subset of {1, ..., q} corresponding to the s largest in absolute value coordinates of \u03b4.\nFor notational convenience, we assume that M is normalized in such a way that the diagonal ele-\n\u2020 X(cid:62)XM\u2020 are all equal to 1. This can always be done by multiplying M from the\nments of 1\nleft by a suitably chosen positive de\ufb01nite diagonal matrix. Furthermore, we will repeatedly use the\n\u2020 X(cid:62) onto the subspace of Rn spanned by the columns of\nprojector2 \u03a0 = XN\u2020(N(cid:62)\nXN\u2020. We denote by r = rank{\u03a0} the rank of this projector which is typically very small compared\nto n \u2227 p, and is always smaller than n \u2227 (p \u2212 q). In all theoretical results, the matrices X and M are\nassumed deterministic.\n\nTheorem 2.1. Let us \ufb01x a tolerance level \u03b4 \u2208 (0, 1) and de\ufb01ne \u03bb =(cid:112)2n\u03b3 log(q/\u03b4). Assume that\n\n\u2020 X(cid:62)XN\u2020)\u22121N(cid:62)\n\nthe tuning parameters \u03b3, \u00b5 > 0 satisfy\n\u2264 1 \u2212 r\nn\n\n\u00b5\n\u03b3\n\n(6)\n\u2217 is s-sparse and the matrix (In \u2212 \u03a0)XM\u2020 satis\ufb01es the condition RE(s, 1) with\n\nn\n\n.\n\nIf the vector M\u03b2\nsome \u03ba > 0 then, with probability at least 1 \u2212 6\u03b4, it holds:\n\n\u2212 2\n\n(cid:112)(n \u2212 r) log(1/\u03b4) + log(1/\u03b4)\n(cid:114)\n(cid:114)\n\u03ba2 ((cid:98)\u03c3 + \u03c3\u2217)s\n(cid:112)2\u03b3s log(q/\u03b4)\n)(cid:107)2 \u2264 2((cid:98)\u03c3 + \u03c3\u2217)\n(cid:114)\n\u2217(cid:107)2 \u2264 4((cid:98)\u03c3 + \u03c3\u2217)\n\n)(cid:107)1 \u2264 4\n\n2\u03b3 log(q/\u03b4)\n\n2s log(q/\u03b4)\n\n(cid:114)\n\n\u03c3\u2217\n\u03ba\n\n+\n\nn\n\n\u03ba2\n\nn\n\n\u03c3\u2217\n\u03ba\n\n\u2217\n\n\u2217\n\n(cid:107)M((cid:98)\u03b2 \u2212 \u03b2\n(cid:107)X((cid:98)\u03b2 \u2212 \u03b2\n(cid:107)M(cid:98)\u03b2 \u2212 M\u03b2\n\n(8)\nIf, in addition, (In \u2212 \u03a0)XM\u2020 satis\ufb01es the condition RE(s, s, 1) with some \u03ba > 0 then, with a\nprobability at least 1 \u2212 6\u03b4, we have:\n\n\u03ba\n\n+\n\n2s log(1/\u03b4)\n\n+ \u03c3\u2217(cid:0)(cid:112)8 log(1/\u03b4) + r(cid:1).\n\nn\n\nMoreover, with a probability at least 1 \u2212 7\u03b4, we have:\n\n(cid:98)\u03c3 \u2264 \u03c3\u2217\n\n\u00b51/2\n\n\u2217(cid:107)1\n\n\u03bb(cid:107)M\u03b2\nn\u00b5\n\n+\n\n+\n\ns1/2\u03c3\u2217 log(q/\u03b4)\n\nn\u03ba\u00b51/2\n\n+ (\u03c3\u2217 + (cid:107)M\u03b2\n\n\u2217(cid:107)1)\u00b5\u22121/2\n\n2 log(1/\u03b4)\n\nn\n\n.\n\n(10)\n\n2 log(1/\u03b4)\n\nn\n\n(cid:114)\n\n(7)\n\n(9)\n\n2Here and in the sequel, the inverse of a singular matrix is understood as MoorePenrose pseudoinverse.\n\n4\n\n\f\u2217\n\n)(cid:107)2\n\nn(cid:107)X((cid:98)\u03b2\u2212 \u03b2\n\nBefore looking at the consequences of these risk bounds in the particular case of robust estimation,\nlet us present some comments highlighting the claims of Theorem 2.1. The \ufb01rst comment is about\nthe conditions on the tuning parameters \u00b5 and \u03b3. It is interesting to observe that the roles of these\n\u2217 while \u00b5 determines the\nparameters are very clearly de\ufb01ned: \u03b3 controls the quality of estimating \u03b2\nquality of estimating \u03c3\u2217. One can note that all the quantities entering in the right-hand side of (6)\nare known, so that it is not hard to choose \u00b5 and \u03b3 in such a way that they satisfy the conditions of\nTheorem 2.1. However, in practice, this theoretical choice may be too conservative in which case it\ncould be a better idea to rely on cross validation.\nThe second remark is about the rates of convergence. According to (8), the rate of estimation\nmeasured in the mean prediction loss 1\n2 is of the order of s log(q)/n, which is known\n\u2217 is also estimated with the nearly parametric rate in both\nas fast or parametric rate. The vector M\u03b2\n(cid:96)1 and (cid:96)2-norms. To the best of our knowledge, this is the \ufb01rst work where such kind of fast rates\nare derived in the context of fused sparsity with unknown noise-level. With some extra work, one\ncan check that if, for instance, \u03b3 = 1 and |\u00b5 \u2212 1| \u2264 cn\u22121/2 for some constant c, then the estimator\nto the noise level is the presence of (cid:107)M\u03b2\nestimation in the case of large signal-to-noise ratio.\nEven if Theorem 2.1 requires the noise distribution to be Gaussian, the proposed algorithm remains\nvalid in a far broader context and tight risk bounds can be obtained under more general conditions\non the noise distribution. In fact, one can see from the proof that we only need to know con\ufb01dence\nsets for some linear and quadratic functionals of \u03be. For instance, such kind of con\ufb01dence sets can be\nreadily obtained in the case of bounded errors \u03bei using the Bernstein inequality. It is also worthwhile\nto mention that the proof of Theorem 2.1 is not a simple adaptation of the arguments used to prove\nanalogous results for ordinary sparsity, but contains some qualitatively novel ideas. More precisely,\nthe cornerstone of the proof of risk bounds for the Dantzig selector [4, 3, 9] is that the true parameter\n\u2217 is a feasible solution. In our case, this argument cannot be used anymore. Our proposal is then\n\u03b2\n\n(cid:98)\u03c3 has also a risk of the order of sn\u22121/2. However, the price to pay for being adaptive with respect\n\u2217(cid:107)1 in the bound on(cid:98)\u03c3, which deteriorates the quality of\n\nto specify another vector(cid:101)\u03b2 that simultaneously satis\ufb01es the following three conditions: M(cid:101)\u03b2 has the\n\nsame sparsity pattern as M\u03b2\nA last remark is about the restricted eigenvalue conditions. They are somewhat cumbersome in this\nabstract setting, but simplify a lot when the concrete example of robust estimation is considered,\ncf. the next section. At a heuristical level, these conditions require from the columns of XM\u2020 to\nbe not very strongly correlated. Unfortunately, this condition fails for the matrices appearing in\nthe problem of multiple change-point detection, which is an important particular instance of fused\nsparsity. There are some workarounds to circumvent this limitation in that particular setting, see\n[17, 11]. The extension of these kind of arguments to the case of unknown \u03c3\u2217 is an open problem\nwe intend to tackle in the near future.\n\n\u2217,(cid:101)\u03b2 is close to \u03b2\n\n\u2217 and lies in the feasible set.\n\n3 Application to robust estimation\nThis methodology can be applied in the context of robust estimation, i.e., when we observe Y \u2208 Rn\nand A \u2208 Rn\u00d7k such that the relation\n\nYi = (A\u03b8\n\n\u2217\n\n)i + \u03c3\u2217\u03bei,\n\niid\u223c N (0, 1)\n\n\u03bei\n\nholds only for some indexes i \u2208 I \u2282 {1, ..., n}, called inliers. The indexes does not belonging to\nI will be referred to as outliers. The setting we are interested in is the one frequently encountered\n\u2217 is small as compared to n but the presence\nin computer vision [13, 25]: the dimensionality k of \u03b8\nof outliers causes the complete failure of the least squares estimator. In what follows, we use the\nstandard assumption that the matrix 1\nFollowing the ideas developed in [6, 7, 8, 18, 15], we introduce a new vector \u03c9 \u2208 Rn that serves to\ncharacterize the outliers. If an entry \u03c9i of \u03c9 is nonzero, then the corresponding observation Yi is an\noutlier. This leads to the model:\n\nn A(cid:62)A has diagonal entries equal to one.\n\n+\n\nn\u03c9\u2217 + \u03c3\u2217\u03be = X\u03b2\n\n](cid:62).\nY = A\u03b8\nThus, we have rewritten the problem of robust estimation in linear models as a problem of\nestimation in high dimension under the fused sparsity scenario. Indeed, we have X \u2208 Rn\u00d7(n+k)\n\nand \u03b2 = [\u03c9\u2217 ; \u03b8\n\nn In A],\n\n\u221a\n+ \u03c3\u2217\u03be, where X = [\n\n\u221a\n\n\u2217\n\n\u2217\n\n\u2217\n\n5\n\n\f\u2217 \u2208 Rn+k, and we are interested in \ufb01nding an estimator(cid:98)\u03b2 of \u03b2\n\nand \u03b2\ncontains as many zeros as possible. This means that we expect that the number of outliers is\nsigni\ufb01cantly smaller than the sample size. We are thus in the setting of fused sparsity with\nM = [In 0n\u00d7k]. Setting N = [0k\u00d7n Ik], we de\ufb01ne the Scaled Robust Dantzig Selector (SRDS) as\n\n\u2217 for which (cid:98)\u03c9 = [In0n\u00d7k](cid:98)\u03b2\n\na solution ((cid:98)\u03b8,(cid:98)\u03c9,(cid:98)\u03c3) of the problem:\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u221a\nn \u03c9 \u2212 Y (cid:107)\u221e \u2264 \u03bb\u03c3,\n\u221a\nn \u03c9 \u2212 Y ) = 0,\n\n\u221a\nn(cid:107)A\u03b8 +\nA(cid:62)(A\u03b8 +\nn\u00b5\u03c32 + Y (cid:62)(A\u03b8 +\n\n\u221a\n\nn\u03c9) \u2264 (cid:107)Y (cid:107)2\n2.\n\n(P2)\n\nminimize (cid:107)\u03c9(cid:107)1\n\nsubject to\n\nn A denoted by \u03bd\u2217 and \u03bd\u2217 respectively.\n\nOnce again, this can be recast in a SOCP and solved with great ef\ufb01ciency by standard algorithms.\nFurthermore, the results of the previous section provide us with strong theoretical guarantees for the\nSRDS. To state the corresponding result, we will need a notation for the largest and the smallest\nsingular values of 1\u221a\n\nTheorem 3.1. Let us \ufb01x a tolerance level \u03b4 \u2208 (0, 1) and de\ufb01ne \u03bb =(cid:112)2n\u03b3 log(n/\u03b4). Assume that\n(cid:0)(cid:112)(n \u2212 k) log(1/\u03b4) + log(1/\u03b4)(cid:1). Let \u03a0\n\n\u03b3 \u2264 1 \u2212 k\nn \u2212 2\nthe tuning parameters \u03b3, \u00b5 > 0 satisfy \u00b5\ndenote the orthogonal projector onto the k-dimensional subspace of Rn spanned by the columns of\n\u221a\nn(In \u2212 \u03a0) satis\ufb01es the condition RE(s, 1) with\nA. If the vector \u03c9\u2217 is s-sparse and the matrix\n(cid:114)\nsome \u03ba > 0 then, with probability at least 1 \u2212 5\u03b4, it holds:\n\u03ba2 ((cid:98)\u03c3 + \u03c3\u2217)s\n(cid:114)\n(cid:107)(In \u2212 \u03a0)((cid:98)\u03c9 \u2212 \u03c9\u2217)(cid:107)2 \u2264 2((cid:98)\u03c3 + \u03c3\u2217)\n\n(12)\nn (In \u2212 \u03a0) satis\ufb01es the condition RE(s, s, 1) with some \u03ba > 0 then, with a proba-\n\n(cid:107)(cid:98)\u03c9 \u2212 \u03c9\u2217(cid:107)1 \u2264 4\n\nn\n2 log(1/\u03b4)\n\n2\u03b3 log(n/\u03b4)\n\n2s log(n/\u03b4)\n\n2s log(1/\u03b4)\n\n(cid:114)\n\n\u03c3\u2217\n\u03ba\n\n(cid:114)\n\n+ \u03c3\u2217\n\n(11)\n\n\u221a\n\n+\n\nn\n\nn\n\nn\n\n\u03ba\n\nn\n\n.\n\n,\n\n(cid:114)\n\nIf, in addition,\nbility at least 1 \u2212 6\u03b4, we have:\n\n(cid:114)\n(cid:107)(cid:98)\u03c9 \u2212 \u03c9\u2217(cid:107)2 \u2264 4((cid:98)\u03c3 + \u03c3\u2217)\n(cid:26) 4((cid:98)\u03c3 + \u03c3\u2217)\n(cid:107)(cid:98)\u03b8 \u2212 \u03b8\n\n\u03ba2\n\n\u2217(cid:107)2 \u2264 \u03bd\u2217\n\u03bd2\u2217\n(cid:98)\u03c3 \u2264 \u03c3\u2217\n\n\u00b51/2\n\n2s log(n/\u03b4)\n\n(cid:114)\n\n\u03c3\u2217\nn\n\u03ba\n2s log(n/\u03b4)\n\n+\n\n2 log(1/\u03b4)\n\n(cid:114)\n\n\u03c3\u2217\n\u03ba\n\n+\n\nn\n2 log(1/\u03b4)\n\n\u221a\n\u03c3\u2217(\n\n+\n\nn\n\n\u03ba2\n\nn\n\nMoreover, with a probability at least 1 \u2212 7\u03b4, the following inequality holds:\n+ (\u03c3\u2217 + (cid:107)\u03c9\u2217(cid:107)1)\u00b5\u22121/2\n\ns1/2\u03c3\u2217 log(n/\u03b4)\n\n\u03bb(cid:107)\u03c9\u2217(cid:107)1\n\n+\n\n+\n\nn\u00b5\n\nn\u03ba\u00b51/2\n\n(cid:27)\n\nk +(cid:112)2 log(1/\u03b4))\n(cid:114)\n\n\u221a\n\nn\n\n2 log(1/\u03b4)\n\n.\n\n(13)\n\nn\n\n\u221a\n\nAll the comments made after Theorem 2.1, especially those concerning the tuning parameters and\nthe rates of convergence, hold true for the risk bounds in Theorem 3.1 as well. Furthermore, the\nrestricted eigenvalue condition in the latter theorem is much simpler and deserves a special attention.\nn(In \u2212 \u03a0) implies that there is\nIn particular, one can remark that the failure of RE(s, 1) for\na unit vector \u03b4 in Im(A) such that |\u03b4(1)| + . . . + |\u03b4(n\u2212s)| \u2264 |\u03b4(n\u2212s+1)| + . . . + |\u03b4(n)|, where\n\u03b4(k) stands for the kth smallest (in absolute value) entry of \u03b4. To gain a better understanding of\nhow restrictive this assumption is, let us consider the case where the rows a1, . . . , an of A are\ni.i.d. zero mean Gaussian vectors. Since \u03b4 \u2208 Im(A), its coordinates \u03b4i are also i.i.d. Gaussian\nrandom variables (they can be considered N (0, 1) due to the homogeneity of the inequality we are\n(cid:80)\ninterested in). The inequality |\u03b4(1)| + . . . + |\u03b4(n\u2212s)| \u2264 |\u03b4(n\u2212s+1)| + . . . + |\u03b4(n)| can be written\nn (|\u03b4(n\u2212s+1)| + . . . + |\u03b4(n)|). While the left-hand side of this inequality tends to\ni |\u03b4i| \u2264 2\nas 1\nn\nE[|\u03b41|] > 0, the right-hand side is upper-bounded by 2s\n.\n\u221a\nTherefore, if 2s\nis small, the condition RE(s, 1) is satis\ufb01ed. This informal discussion can be\nmade rigorous by studying large deviations of the quantity max\u03b4\u2208Im(A)\\{0} (cid:107)\u03b4(cid:107)\u221e/(cid:107)\u03b4(cid:107)1. A simple\nsuf\ufb01cient condition entailing RE(s, 1) for\nLemma 3.2. Let us set \u03b6s(A) = inf u\u2208Sk\u22121\n\n\u03a0) satis\ufb01es both RE(s, 1) and RE(s, s, 1) with \u03ba(s, 1) \u2265 \u03ba(s, s, 1) \u2265 \u03b6s(A)/(cid:112)(\u03bd\u2217)2 + \u03b6s(A)2.\n\nn(In \u2212 \u03a0) is presented in the following lemma.\n\u221a\n\n\u221a\nn maxi |\u03b4i|, which is on the order of 2s\n\n(cid:80)n\ni=1 |aiu|\u2212 2s(cid:107)A(cid:107)2,\u221e\u221a\n\n. If \u03b6s(A) > 0, then\n\nn (In\u2212\n\nlog n\nn\n\nlog n\nn\n\n\u221a\n\n1\nn\n\nn\n\n6\n\n\fSFDS\n\n|(cid:98)\u03b2 \u2212 \u03b2\u2217|2\n\n|(cid:98)\u03c3 \u2212 \u03c3\u2217|\n\nLasso\n\n|(cid:98)\u03b2 \u2212 \u03b2\u2217|2\n\nAve\n0.04\n0.09\n0.23\n0.06\n0.20\n0.34\n0.10\n0.19\n1.90\n\nStD\n0.03\n0.05\n0.17\n0.01\n0.05\n0.11\n0.01\n0.09\n0.20\n\nAve\n0.18\n0.42\n0.75\n0.28\n0.56\n0.34\n0.36\n0.27\n4.74\n\nStD\n0.14\n0.35\n0.55\n0.11\n0.10\n0.21\n0.02\n0.26\n1.01\n\nAve\n0.07\n0.16\n0.31\n0.13\n0.31\n0.73\n0.15\n0.31\n0.61\n\nStD\n0.05\n0.11\n0.21\n0.09\n0.04\n0.25\n0.00\n0.04\n0.08\n\n( T, p, s\u2217, \u03c3\u2217)\n(200, 400, 2, .5)\n(200, 400, 2, 1)\n(200, 400, 2, 2)\n(200, 400, 5, .5)\n(200, 400, 5, 1)\n(200, 400, 5, 2)\n(200, 400, 10, .5)\n(200, 400, 10, 1)\n(200, 400, 10, 2)\n\nSquare-Root Lasso\n\n|(cid:98)\u03b2 \u2212 \u03b2\u2217|2\n\n|(cid:98)\u03c3 \u2212 \u03c3\u2217|\n\nAve\n0.06\n0.13\n0.25\n0.11\n0.25\n0.47\n0.10\n0.19\n1.80\n\nStD\n0.04\n0.09\n0.18\n0.06\n0.02\n0.29\n0.01\n0.09\n0.04\n\nAve\n0.20\n0.46\n0.79\n0.18\n0.66\n0.69\n0.36\n0.27\n3.70\n\nStD\n0.14\n0.37\n0.56\n0.27\n0.05\n0.70\n0.02\n0.26\n0.48\n\nTable 1: Comparing our procedure SFDS with the (oracle) Lasso and the SqRL on a synthetic dataset. The\n\naverage values and the standard deviations of the quantities |(cid:98)\u03b2\u2212 \u03b2\u2217|2 and |(cid:98)\u03c3\u2212 \u03c3\u2217| over 500 trials are reported.\n\nThey represent respectively the accuracy in estimating the regression vector and the level of noise.\n\nThe proof of the lemma can be found in the supplementary material.\n\nOne can take note that the problem (P2) boils down to computing ((cid:98)\u03c9,(cid:98)\u03c3) as a solution to\n(cid:26) \u221a\nand then setting(cid:98)\u03b8 = (A(cid:62)A)\u22121A(cid:62)(Y \u2212 \u221a\n\nn[(In \u2212 \u03a0)Y ](cid:62)\u03c9 \u2264 (cid:107)(In \u2212 \u03a0)Y (cid:107)2\n2.\n\nn(cid:107)(In \u2212 \u03a0)(\nn\u00b5\u03c32 +\n\n\u221a\n\nminimize (cid:107)\u03c9(cid:107)1\n\nsubject to\n\n\u221a\n\nn\u03c9 \u2212 Y )(cid:107)\u221e \u2264 \u03bb\u03c3,\n\nn(cid:98)\u03c9).\n\n4 Experiments\n\nFor the empirical evaluation we use a synthetic dataset with randomly drawn Gaussian design matrix\nX and the real-world dataset fountain-P113, on which we apply our methodology for computing the\nfundamental matrices between consecutive images.\n\n4.1 Comparative evaluation on synthetic data\nWe randomly generated a n \u00d7 p matrix X with independent entries distributed according to the\n\u2217 \u2208 Rp that has exactly s nonzero elements\nstandard normal distribution. Then we chose a vector \u03b2\nall equal to one. The indexes of these elements were chosen at random. Finally, the response\nY \u2208 Rn was computed by adding a random noise \u03c3\u2217Nn(0, In) to the signal X\u03b2\n\u2217. Once Y and X\navailable, we computed three estimators of the parameters using the standard sparsity penalization\n(in order to be able to compare our approach to the others): the SFDS, the Lasso and the square-\nroot Lasso (SqRL). We used the \u201cuniversal\u201d tuning parameters for all these methods: (\u03bb, \u00b5) =\n\n((cid:112)2n log(p), 1) for the SFDS, \u03bb =(cid:112)2 log(p) for the SqRL and \u03bb = \u03c3\u2217(cid:112)2 log(p) for the Lasso.\n\nNote that the latter is not really an estimator but rather an oracle since it exploits the knowledge of\nthe true \u03c3\u2217. This is why the accuracy in estimating \u03c3\u2217 is not reported in Table 1. To reduce the\nwell known bias toward zero [4, 23], we performed a post-processing for all of three procedures. It\nconsisted in computing least squares estimators after removing all the covariates corresponding to\n\u2217. The results summarized in Table 1 show that the SFDS\nvanishing coef\ufb01cients of the estimator of \u03b2\nis competitive with the state-of-the-art methods and, a bit surprisingly, is sometimes more accurate\nthan the oracle Lasso using the true variance in the penalization. We stress however that the SFDS\nis designed for being applied in\u2014and has theoretical guarantees for\u2014the broader setting of fused\nsparsity.\n\n4.2 Robust estimation of the fundamental matrix\n\nTo provide a qualitative evaluation of the proposed methodology on real data, we applied the SRDS\nto the problem of fundamental matrix estimation in multiple-view geometry, which constitutes an\n\n3available at http://cvlab.epfl.ch/\u02dcstrecha/multiview/denseMVS.html\n\n7\n\n\f(cid:98)\u03c3\n(cid:107)(cid:98)\u03c9(cid:107)0\nn (cid:107)(cid:98)\u03c9(cid:107)0\n\n100\n\n1\n0.13\n218\n1.3\n\n2\n0.13\n80\n0.46\n\n3\n0.13\n236\n1.37\n\n4\n0.17\n90\n0.52\n\n5\n0.16\n198\n1.13\n\n6\n0.17\n309\n1.84\n\n7\n0.20\n17\n0.12\n\n8\n0.18\n31\n0.19\n\n9\n0.17\n207\n1.49\n\n10 Average\n0.15\n139.4\n0.94\n\n0.11\n8\n1.02\n\nTable 2: Quantitative results on fountain dataset.\n\nFigure 1: Qualitative results on fountain dataset. Top left: the values of(cid:98)\u03c9i for the \ufb01rst pair of images. There\n\nis a clear separation between outliers and inliers. Top right: the \ufb01rst pair of images and the matches classi\ufb01ed\nas wrong by SRDS. Bottom: the eleven images of the dataset.\nessential step in almost all pipelines of 3D reconstruction [13, 25]. In short, if we have two images I\nand I(cid:48) representing the same 3D scene, then there is a 3\u00d73 matrix F, called fundamental matrix, such\nthat a point x = (x, y) in I1 matches with the point x(cid:48) = (x(cid:48), y(cid:48)) in I(cid:48) only if [x; y; 1] F [x(cid:48); y(cid:48); 1](cid:62) =\n0. Clearly, F is de\ufb01ned up to a scale factor: if F33 (cid:54)= 0, one can assume that F33 = 1. Thus, each\npair x \u2194 x(cid:48) of matching points in images I and I(cid:48) yields a linear constraint on the eight remaining\ncoef\ufb01cients of F. Because of the quanti\ufb01cation and the presence of noise in images, these linear\nrelations are satis\ufb01ed up to some error. Thus, estimation of F from a family of matching points\n{xi \u2194 x(cid:48)\ni; i = 1, . . . , n} is a problem of linear regression. Typically, matches are computed by\ncomparing local descriptors (such as SIFT [16]) and, for images of reasonable resolution, hundreds\nof matching points are found. The computation of the fundamental matrix would not be a problem in\nthis context of large sample size / low dimension, if the matching algorithms were perfectly correct.\nHowever, due to noise, repetitive structures and other factors, a non-negligible fraction of detected\nmatches are wrong (outliers). Elimination of these outliers and robust estimation of F are crucial\nsteps for performing 3D reconstruction.\nHere, we apply the SRDS to the problem of estimation of F for 10 pairs of consecutive images\nprovided by the fountain dataset [21]: the 11 images are shown at the bottom of Fig. 1. Using SIFT\ndescriptors, we found more than 17.000 point matches in most pairs of images among the 10 pairs\nwe are considering. The CPU time for computing each matrix using the SeDuMi solver [22] was\nabout 7 seconds, despite such a large dimensionality. The number of outliers and the estimated\nnoise-level for each pair of images are reported in Table 2. We also showed in Fig. 1 the 218 outliers\nfor the \ufb01rst pair of images. They are all indeed wrong correspondncies, even those which correspond\nto the windows (this is due to the repetitive structure of the window).\n\n5 Conclusion and perspectives\n\nWe have presented a new procedure, SFDS, for the problem of learning linear models with unknown\nnoise level under the fused sparsity scenario. We showed that this procedure is inspired by the\npenalized maximum likelihood but has the advantage of being computable by solving a second-\norder cone program. We established tight, nonasymptotic, theoretical guarantees for the SFDS with\na special attention paid to robust estimation in linear models. The experiments we have carried out\nare very promising and support our theoretical results.\nIn the future, we intend to generalize the theoretical study of the performance of the SFDS to the case\nof non-Gaussian errors \u03bei, as well as to investigate its power in variable selection. The extension to\nthe case where the number of lines in M is larger than the number of columns is another interesting\ntopic for future research.\n\n8\n\n\fReferences\n[1] Stephen Becker, Emmanuel Cand`es, and Michael Grant. Templates for convex cone problems with appli-\n\ncations to sparse signal recovery. Math. Program. Comput., 3(3):165\u2013218, 2011.\n\n[2] A. Belloni, Victor Chernozhukov, and L. Wang. Square-root lasso: Pivotal recovery of sparse signals via\n\nconic programming. Biometrika, to appear, 2012.\n\n[3] Peter J. Bickel, Ya\u2019acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso and Dantzig\n\nselector. Ann. Statist., 37(4):1705\u20131732, 2009.\n\n[4] Emmanuel Candes and Terence Tao. The Dantzig selector: statistical estimation when p is much larger\n\nthan n. Ann. Statist., 35(6):2313\u20132351, 2007.\n\n[5] Emmanuel J. Cand`es. The restricted isometry property and its implications for compressed sensing. C.\n\nR. Math. Acad. Sci. Paris, 346(9-10):589\u2013592, 2008.\n\n[6] Emmanuel J. Cand`es and Paige A. Randall. Highly robust error correction by convex programming. IEEE\n\nTrans. Inform. Theory, 54(7):2829\u20132840, 2008.\n\n[7] Arnak S. Dalalyan and Renaud Keriven. L1-penalized robust estimation for a class of inverse problems\n\narising in multiview geometry. In NIPS, pages 441\u2013449, 2009.\n\n[8] Arnak S. Dalalyan and Renaud Keriven. Robust estimation for an inverse problem arising in multiview\n\ngeometry. J. Math. Imaging Vision., 43(1):10\u201323, 2012.\n\n[9] Eric Gautier and Alexandre Tsybakov. High-dimensional instrumental variables regression and con\ufb01-\n\ndence sets. Technical Report arxiv:1105.2454, September 2011.\n\n[10] Christophe Giraud, Sylvie Huet, and Nicolas Verzelen. High-dimensional regression with unknown vari-\n\nance. submitted, page arXiv:1109.5587v2 [math.ST].\n\n[11] Z. Harchaoui and C. L\u00b4evy-Leduc. Multiple change-point estimation with a total variation penalty. J.\n\nAmer. Statist. Assoc., 105(492):1480\u20131493, 2010.\n\n[12] Za\u00a8\u0131d Harchaoui and C\u00b4eline L\u00b4evy-Leduc. Catching change-points with lasso. In John Platt, Daphne Koller,\n\nYoram Singer, and Sam Roweis, editors, NIPS. Curran Associates, Inc., 2007.\n\n[13] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University\n\nPress, June 2004.\n\n[14] A. Iouditski, F. Kilinc Karzan, A. S. Nemirovski, and B. T. Polyak. On the accuracy of l1-\ufb01ltering of\n\nsignals with block-sparse structure. In NIPS 24, pages 1260\u20131268. 2011.\n\n[15] S. Lambert-Lacroix and L. Zwald. Robust regression through the Huber\u2019s criterion and adaptive lasso\n\npenalty. Electron. J. Stat., 5:1015\u20131053, 2011.\n\n[16] David G. Lowe. Distinctive image features from scale-invariant keypoints.\n\nComputer Vision, 60(2):91\u2013110, 2004.\n\nInternational Journal of\n\n[17] E. Mammen and S. van de Geer. Locally adaptive regression splines. Ann. Statist., 25(1):387\u2013413, 1997.\n[18] Nam H. Nguyen, Nasser M. Nasrabadi, and Trac D. Tran. Robust lasso with missing and grossly corrupted\nobservations. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 1881\u20131889. 2011.\n\n[19] A. Rinaldo. Properties and re\ufb01nements of the fused lasso. Ann. Statist., 37(5B):2922\u20132952, 2009.\n[20] Nicolas St\u00a8adler, Peter B\u00a8uhlmann, and Sara van de Geer. (cid:96)1-penalization for mixture regression models.\n\nTEST, 19(2):209\u2013256, 2010.\n\n[21] C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibra-\ntion and multi-view stereo for high resolution imagery. In Conference on Computer Vision and Pattern\nRecognition, pages 1\u20138, 2009.\n\n[22] J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim.\n\nMethods Softw., 11/12(1-4):625\u2013653, 1999.\n\n[23] T. Sun and C.-H. Zhang. Comments on: (cid:96)1-penalization for mixture regression models. TEST, 19(2):\n\n270\u2013275, 2010.\n\n[24] T. Sun and C.-H. Zhang. Scaled sparse linear regression. arXiv:1104.4595, 2011.\n[25] R. Szeliski. Computer Vision: Algorithms and Applications. Texts in Computer Science. Springer, 2010.\n[26] Robert Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):\n\n267\u2013288, 1996.\n\n[27] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness\n\nvia the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(1):91\u2013108, 2005.\n\n[28] Sara A. van de Geer and Peter B\u00a8uhlmann. On the conditions used to prove oracle results for the Lasso.\n\nElectron. J. Stat., 3:1360\u20131392, 2009.\n\n9\n\n\f", "award": [], "sourceid": 616, "authors": [{"given_name": "Arnak", "family_name": "Dalalyan", "institution": null}, {"given_name": "Yin", "family_name": "Chen", "institution": null}]}