{"title": "Sparse Support Recovery with Non-smooth Loss Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 4269, "page_last": 4277, "abstract": "In this paper, we study the support recovery guarantees of underdetermined sparse regression using the $\\ell_1$-norm as a regularizer and a non-smooth loss function for data fidelity. More precisely, we focus in detail on the cases of $\\ell_1$ and $\\ell_\\infty$ losses, and contrast them with the usual $\\ell_2$ loss.While these losses are routinely used to account for either sparse ($\\ell_1$ loss) or uniform ($\\ell_\\infty$ loss) noise models, a theoretical analysis of their performance is still lacking. In this article, we extend the existing theory from the smooth $\\ell_2$ case to these non-smooth cases. We derive a sharp condition which ensures that the support of the vector to recover is stable to small additive noise in the observations, as long as the loss constraint size is tuned proportionally to the noise level. A distinctive feature of our theory is that it also explains what happens when the support is unstable. While the support is not stable anymore, we identify an \"extended support\" and show that this extended support is stable to small additive noise. To exemplify the usefulness of our theory, we give a detailed numerical analysis of the support stability/instability of compressed sensing recovery with these different losses. This highlights different parameter regimes, ranging from total support stability to progressively increasing support instability.", "full_text": "Sparse Support Recovery with\nNon-smooth Loss Functions\n\nK\u00e9vin Degraux\n\nISPGroup/ICTEAM, FNRS\n\nUniversit\u00e9 catholique de Louvain\nLouvain-la-Neuve, Belgium 1348\nkevin.degraux@uclouvain.be\n\nGabriel Peyr\u00e9\nCNRS, DMA\n\n\u00c9cole Normale Sup\u00e9rieure\n\nParis, France 75775\n\ngabriel.peyre@ens.fr\n\nJalal M. Fadili\n\nNormandie Univ, ENSICAEN,\n\nCNRS, GREYC,\n\nCaen, France 14050\n\nJalal.Fadili@ensicaen.fr\n\nLaurent Jacques\n\nISPGroup/ICTEAM, FNRS\n\nUniversit\u00e9 catholique de Louvain\nLouvain-la-Neuve, Belgium 1348\nlaurent.jacques@uclouvain.be\n\nAbstract\n\nIn this paper, we study the support recovery guarantees of underdetermined sparse\nregression using the (cid:96)1-norm as a regularizer and a non-smooth loss function for\ndata \ufb01delity. More precisely, we focus in detail on the cases of (cid:96)1 and (cid:96)\u221e losses,\nand contrast them with the usual (cid:96)2 loss. While these losses are routinely used to\naccount for either sparse ((cid:96)1 loss) or uniform ((cid:96)\u221e loss) noise models, a theoretical\nanalysis of their performance is still lacking. In this article, we extend the existing\ntheory from the smooth (cid:96)2 case to these non-smooth cases. We derive a sharp\ncondition which ensures that the support of the vector to recover is stable to small\nadditive noise in the observations, as long as the loss constraint size is tuned\nproportionally to the noise level. A distinctive feature of our theory is that it also\nexplains what happens when the support is unstable. While the support is not stable\nanymore, we identify an \u201cextended support\u201d and show that this extended support\nis stable to small additive noise. To exemplify the usefulness of our theory, we\ngive a detailed numerical analysis of the support stability/instability of compressed\nsensing recovery with these different losses. This highlights different parameter\nregimes, ranging from total support stability to progressively increasing support\ninstability.\n\nIntroduction\n\n1\n1.1 Sparse Regularization\n\nThis paper studies sparse linear regression problems of the form\n\ny = \u03a6x0 + w,\n\nwhere x0 \u2208 Rn is the unknown vector to estimate, supposed to be non-zero and sparse, w \u2208 Rm\nis some additive noise and the design matrix \u03a6m\u00d7n is in general rank de\ufb01cient corresponding to\na noisy underdetermined linear system of equations, i.e., typically in the high-dimensional regime\nwhere m (cid:28) n. This can also be understood as an inverse problem in imaging sciences, a particular\ninstance of which being the compressed sensing problem [3], where the matrix \u03a6 is drawn from some\nappropriate random matrix ensemble.\nIn order to recover a sparse vector x0, a popular regularization is the (cid:96)1-norm, in which case we\nconsider the following constrained sparsity-promoting optimization problem\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\nx\u2208Rn {||x||1 s.t. ||\u03a6x \u2212 y||\u03b1 (cid:54) \u03c4} ,\nmin\n\n\u03b1(y))\n\n(P \u03c4\n\n\f= ((cid:80)\n\ndef.\n\nx\n\nmin\n\ni |ui|\u03b1)1/\u03b1 denotes the (cid:96)\u03b1-norm, and the constraint size \u03c4 (cid:62) 0\nwhere for \u03b1 \u2208 [1, +\u221e], ||u||\u03b1\nshould be adapted to the noise level. To avoid trivialities, through the paper, we assume that\n\u03b1(y)) is feasible, which is of course the case if \u03c4 (cid:62) ||w||\u03b1. In the special situation where\nproblem (P \u03c4\nthere is no noise, i.e., w = 0, it makes sense to consider \u03c4 = 0 and solve the so-called Lasso [14] or\nBasis-Pursuit problem [4], which is independent of \u03b1, and reads\n{||x||1 s.t. \u03a6x = \u03a6x0} .\n\n(P 0(\u03a6x0))\nThe case \u03b1 = 2 corresponds to the usual (cid:96)2 loss function, which entails a smooth constraint set, and\nhas been studied in depth in the literature (see Section 1.6 for an overview). In contrast, the cases\n\u03b1 \u2208 {1, +\u221e} correspond to very different setups, where the loss function || \u00b7 ||\u03b1 is polyhedral and\nnon-smooth. They are expected to lead to signi\ufb01cantly different estimation results and require to\ndevelop novel theoretical results, which is the focus of this paper. The case \u03b1 = 1 corresponds to\na \u201crobust\u201d loss function, and is important to cope with impulse noise or outliers contaminating the\ndata (see for instance [11, 13, 9]). At the extreme opposite, the case \u03b1 = +\u221e is typically used to\nhandle uniform noise such as in quantization (see for instance [10]). This paper studies the stability\nof the support supp(x\u03c4 ) of minimizers x\u03c4 of (P \u03c4\n\u03b1(y)). In particular, we provide a sharp analysis\nfor the polyhedral cases \u03b1 \u2208 {1, +\u221e} that allows one to control the deviation of supp(x\u03c4 ) from\nsupp(x0) if ||w||\u03b1 is not too large and \u03c4 is chosen proportionally to ||w||\u03b1. The general case is studied\nnumerically in a compressed sensing experiment where we compare supp(x\u03c4 ) and supp(x0) for\n\u03b1 \u2208 [1, +\u221e].\n1.2 Notations.\n\ndef.\n\ndef.\n\ndef.\n= supp(x0) where supp(u)\n\n= {i | ui (cid:54)= 0}. The saturation support\nThe support of x0 is noted I\nof a vector is de\ufb01ned as sat(u)\n= {i | |ui| = ||u||\u221e}. The sub-differential of a convex function f\n= R(C \u2212 C). A\u2217 is the\nis denoted \u2202f. The subspace parallel to a nonempty convex set C is par(C)\ntranspose of a matrix A and A+ is the Moore-Penrose pseudo-inverse of A. Id is the identity matrix\nand \u03b4i the canonical vector of index i. For a subspace V \u2282 Rn, PV is the orthogonal projector onto\nV . For sets of indices S and I, we denote \u03a6S,I the submatrix of \u03a6 restricted to the rows indexed\nby S and the columns indexed by I. When all rows or all columns are kept, a dot replaces the\ncorresponding index set (e.g., \u03a6\u00b7,I). We denote \u03a6\u2217\n= (\u03a6S,I )\u2217, i.e. the transposition is applied after\nthe restriction.\n\nS,I\n\ndef.\n\ndef.\n\n1.3 Dual Certi\ufb01cates\nBefore diving into our theoretical contributions, we \ufb01rst give important de\ufb01nitions. Let Dx0 be the\nset of dual certi\ufb01cates (see, e.g., [17]) de\ufb01ned by\n\n\u2217\n\ndef.\n\nDx0\n\n= {p \u2208 Rm | \u03a6\n\n(1)\nThe \ufb01rst order optimality condition (see, e.g., [12]) states that x0 is a solution of (P 0(\u03a6x0)) if and\nonly if Dx0 (cid:54)= \u2205. Assuming this is the case, our main theoretical \ufb01nding (Theorem 1) states that the\nstability (and instability) of the support of x0 is characterized by the following speci\ufb01c subset of\ncerti\ufb01cates\n\n\u2217\n\u00b7,I p = sign(x0,I ),||\u03a6\n\n\u2217\n\np \u2208 \u2202||x0||1} =(cid:8)p \u2208 Rm (cid:12)(cid:12) \u03a6\n\np||\u221e (cid:54) 1(cid:9) .\n\np\u03b2 \u2208 Argmin\np\u2208Dx0\n\n||p||\u03b2 where\n\n1\n\n\u03b1 + 1\n\n\u03b2 = 1.\n\n(2)\n\nWe call such a certi\ufb01cate p\u03b2 a minimum norm certi\ufb01cate. Note that for 1 < \u03b1 < +\u221e, this p\u03b2 is\nactually unique but that for \u03b1 \u2208 {1,\u221e} it might not be the case.\nAssociated to such a minimal norm certi\ufb01cate, we de\ufb01ne the extended support as\n\nJ\n\n\u2217\ndef.\n= sat(\u03a6\n\n\u2217\np\u03b2) = {i \u2208 {1, . . . , n} | |(\u03a6\n\n(3)\nWhen the certi\ufb01cate p\u03b2 from which J is computed is unclear from the context, we write it explicitly\nas an index Jp\u03b2 . Note that, from the de\ufb01nition of Dx0, one always has I \u2286 J. Intuitively, J indicates\nthe set of indexes that will be activated in the signal estimate when a small noise w is added to the\nobservation, and thus the situation when I = J corresponds to the case where the support of x0 is\nstable.\n\np\u03b2)i| = 1} .\n\n2\n\n\fpar(\u2202||p1||1)\n\u2202||p1||1\ne1\np1\n\n\u2022\n\n\u2022\n\nT1\n\n\u20220\n{p | ||p||1 (cid:54) 1}\n\nFig. 1: Model tangent subspace T\u03b2 in R2 for (\u03b1, \u03b2) = (\u221e, 1).\n\n1.4 Lagrange multipliers and restricted injectivity conditions\nIn the case of noiseless observations (w = 0) and when \u03c4 > 0, the following general lemma\nwhose proof can be found in Section 2 associate to a given dual certi\ufb01cate p\u03b2 an explicit solution of\n\u03b1(\u03a6x0)). This formula depends on a so-called Lagrange multiplier vector v\u03b2 \u2208 Rn, which will\n(P \u03c4\nbe instrumental to state our main contribution (Theorem 1). Note that this lemma is valid for any\n\u03b1 \u2208 [1,\u221e]. Even though this goes beyond the scope of our main result, one can use the same lemma\nfor an arbitrary (cid:96)\u03b1-norm for \u03b1 \u2208 [1,\u221e] (see Section 3) or for even more general loss functions.\nLemma 1 (Noiseless solution). We assume that x0 is identi\ufb01able, i.e. it is a solution to (P 0(\u03a6x0)),\nand consider \u03c4 > 0. Then there exists a v\u03b2 \u2208 Rn supported on J such that\n\u2217\n\u00b7, \u02dcJ p\u03b2\n\nand \u2212 sign(v\u03b2, \u02dcJ ) = \u03a6\n\n||v\u03b2,I||\u221e , with x = mini\u2208I |x0,I|, then a\n\n\u03b1(\u03a6x0)) with support equal to J is given by\n\u00afx\u03c4,J = x0,J \u2212 \u03c4 v\u03b2,J .\n\nwhere we denoted \u02dcJ\nsolution \u00afx\u03c4 of (P \u03c4\nMoreover, its entries have the same sign as those of x0 on its support I, i.e., sign(\u00afx\u03c4,I ) = sign(x0,I ).\nAn important question that arises is whether v\u03b2 can be computed explicitly. For this, let us de\ufb01ne the\n= par(\u2202||p\u03b2||\u03b2)\u22a5, i.e., T\u03b2 is the orthogonal to the subspace parallel to\nmodel tangent subspace T\u03b2\n\u2202||p\u03b2||\u03b2, which uniquely de\ufb01nes the model vector, e\u03b2\n= PT\u03b2 \u2202||p\u03b2||\u03b2, as shown on Figure 1 (see [17]\nfor details). Using this notation, v\u03b2,J is uniquely de\ufb01ned and expressed in closed-form as\n(4)\n\nv\u03b2,J = (PT\u03b2 \u03a6\u00b7,J )+e\u03b2\n\n\u03a6\u00b7,J v\u03b2,J \u2208 \u2202||p\u03b2||\u03b2\n= J\\I. If \u03c4 is such that 0 < \u03c4 <\n\ndef.\n\ndef.\n\ndef.\n\nx\n\nif and only if the following restricted injectivity condition holds\n\ndef.\n= supp(p1) is used.\n\nKer(PT\u03b2 \u03a6\u00b7,J ) = {0}.\n\n(INJ\u03b1)\nFor the special case (\u03b1, \u03b2) = (\u221e, 1), the following lemma, proved in Section 2, gives easily veri\ufb01able\nsuf\ufb01cient conditions, which ensure that (INJ\u221e) holds. The notation S\nLemma 2 (Restricted injectivity for \u03b1 = \u221e). Assume x0 is identi\ufb01able and \u03a6S,J has full rank. If\n\n(cid:48)\n\u2217\nS(cid:48),J ) \u2200S\nsJ /\u2208 Im(\u03a6\n(cid:48)\nqS /\u2208 Im(\u03a6S,J(cid:48)) \u2200J\n|J|, and qS = sign(p1,S) \u2208 {\u22121, 1}\n\n\u2286 {1, . . . , m},\n\u2286 {1, . . . , n},\nwhere sJ = \u03a6\u2217\n|S|, then, |S| = |J| and \u03a6S,J is\ninvertible, i.e., since PT1 \u03a6\u00b7,J = Id\u00b7,S\u03a6S,J, (INJ\u221e) holds.\nRemark 1. If \u03a6 is randomly drawn from a continuous distribution with i.i.d. entries, e.g., Gaussian,\nthen as soon as x0 is identi\ufb01able, the conditions of Lemma 2 hold with probability 1 over the\ndistribution of \u03a6.\n(cid:20)\nFor (\u03b1, \u03b2) = (1,\u221e), we de\ufb01ne Z\n\n| < |J| and\n|S\n(cid:48)\n| < |S|,\n|J\n\n\u00b7,J p1 \u2208 {\u22121, 1}\n\n(cid:21)\n\n(cid:48)\n\ndef.\n= sat(p\u221e),\nsign(p\u2217\n\nIdZc,\u00b7\n\u221e,Z)IdZ,\u00b7\n\ndef.\n=\n\n\u0398\n\nand (cid:101)\u03a6\n\ndef.\n= \u0398\u03a6\u00b7,J .\n\n|J| and(cid:101)\u03a6 is invertible. In that case, (INJ1) holds as Ker(PT\u221e\u03a6\u00b7,J ) = Ker((cid:101)\u03a6). Table 1 summarizes\n\nFollowing similar reasoning as in Lemma 2 and Remark 1, we can reasonably assume that |Z c| + 1 =\nfor the three speci\ufb01c cases \u03b1 \u2208 {1, 2, +\u221e} the quantities introduced here.\n\nTable 1: Model tangent subspace, restricted injectivity condition and Lagrange multipliers.\n\u03b1\n\n(INJ\u03b1)\n\nv\u03b2,J\n\n(PT\u03b2 \u03a6\u00b7,J )+\n\nT\u03b2\nRm\n\n2\n\u221e\n\n1\n\n{u | supp(u) = S }\n\n{u | uZ = \u03c1 sign(p\u221e,Z ), \u03c1 \u2208 R}\n\nKer(\u03a6\u00b7,J ) = {0}\nKer(\u03a6S,J ) = {0}\nKer((cid:101)\u03a6) = {0}\n\n3\n\n\u03a6+\n\u00b7,J\n\n\u03a6\u22121\nS,J IdS,\u00b7\n\n(cid:101)\u03a6\u22121\u0398\n\n\u03a6\u22121\n\nS,J sign(p1,S)\n\n\u03a6+\n\u00b7,J\n\np2\n||p2||2\n\n(cid:101)\u03a6\u22121\u03b4|J|\n\n\fFig. 2: (best observed in color) Simulated compressed sensing example showing x\u03c4 (above) for\nincreasing values of \u03c4 and random noise w respecting the hypothesis of Theorem 1 and \u03a6\u2217p\u03b2 (bellow)\nwhich predicts the support of x\u03c4 when \u03c4 > 0.\n1.5 Main result\n\nOur main contribution is Theorem 1 below. A similar result is known to hold in the case of the smooth\n(cid:96)2 loss (\u03b1 = 2, see Section 1.6). Our paper extends it to the more challenging case of non-smooth\nlosses \u03b1 \u2208 {1, +\u221e}. The proof for \u03b1 = +\u221e is detailed in Section 2. It is important to emphasize\nthat the proof strategy is signi\ufb01cantly different from the classical approach developed for \u03b1 = 2,\nmainly because of the lack of smoothness of the loss function. The proof for \u03b1 = 1 follows a similar\nstructure, and due to space limitation, it can be found in the supplementary material.\nTheorem 1. Let \u03b1 \u2208 {1, 2, +\u221e}. Suppose that x0 is identi\ufb01able, and let p\u03b2 be a minimal norm\ncerti\ufb01cate (see (2)) with associated extended support J (see (3)). Suppose that the restricted injectivity\ncondition (INJ\u03b1) is satis\ufb01ed so that v\u03b2,J can be explicitly computed (see (4)). Then there exist\nconstants c1, c2 > 0 depending only on \u03a6 and p\u03b2 such that, for any (w, \u03c4 ) satisfying\n\n||w||\u03b1 < c1\u03c4\n\nand\n\n\u03c4 (cid:54) c2x where x\n\ndef.\n= min\n\ni\u2208I |x0,I|,\n\na solution x\u03c4 of (P \u03c4\n\ndef.\n\n\u03b1(\u03a6x0 + w)) with support equal to J is given by\n= x0,J + (PT\u03b2 \u03a6\u00b7,J )+w \u2212 \u03c4 v\u03b2,J .\n\nx\u03c4,J\n\n(5)\n\n(6)\n\ndef.\n\nThis theorem shows that if the signal-to-noise ratio is large enough and \u03c4 is chosen in proportion\nto the noise level ||w||\u03b1 , then there is a solution supported exactly in the extended support J. Note\nin particular that this solution (6) has the correct sign pattern sign(x\u03c4,I ) = sign(x0,I ), but might\nexhibit outliers if \u02dcJ\n= J\\I (cid:54)= \u2205. The special case I = J characterizes the exact support stability\n(\u201csparsistency\u201d), and in the case \u03b1 = 2, the assumptions involving the dual certi\ufb01cate correspond to a\ncondition often referred to as \u201cirrepresentable condition\u201d in the literature (see Section 1.6).\nIn Section 3, we propose numerical simulations to illustrate our theoretical \ufb01ndings on a compressed\nsensing (CS) scenario. Using Theorem 1, we are able to numerically assess the degree of support\ninstability of CS recovery using (cid:96)\u03b1 \ufb01delity. As a prelude to shed light on this result, we show on\nFigure 2, a smaller simulated CS example for (\u03b1, \u03b2) = (\u221e, 1). The parameters are n = 20, m = 10\nand |I| = 4 and x0 and \u03a6 are generated as in the experiment of Section 3 and we use CVX/MOSEK\n[8, 7] at best precision to solve the optimization programs. First, we observe that x0 is indeed\nidenti\ufb01able by solving (P 0(\u03a6x0)). Then we solve (2) to compute p\u03b2 and predict the extended support\nJ. Finally, we add uniformly distributed noise w with wi \u223ci.i.d. U(\u2212\u03b4, \u03b4) and \u03b4 chosen appropriately\nto ensure that the hypotheses hold and we solve (P \u03c4\n\u03b1(y)). Observe that as we increase \u03c4, new non-zero\nentries appear in x\u03c4 but because w and \u03c4 are small enough, as predicted, we have supp(x\u03c4 ) = J.\nLet us now comment on the limitations of our analysis. First, this result does not trivially extend to\nthe general case \u03b1 \u2208 [1, +\u221e] as there is, in general, no simple closed form for x\u03c4 . A generalization\nwould require more material and is out of the scope of this paper. Nevertheless, our simulations in\nSection 3 stand for arbitrary \u03b1 \u2208 [1, +\u221e] which is why the general formulation was presented.\nSecond, larger noise regime, though interesting, is also out of the scope. Let us note that no other\nresults in the literature (even for (cid:96)2) provide any insight about sparsistency in the large noise regime.\nIn that case, we are only able to provide bounds on the distance between x0 and the recovered vector\nbut this is the subject of a forthcoming paper.\nFinally our work is agnostic with respect to the noise models. Being able to distinguish between\ndifferent noise models would require further analysis of the constant involved and some additional\nconstraint on \u03a6. However, our result is a big step towards the understanding of the solutions behavior\nand can be used in this analysis.\n\n4\n\n...x\u03c4\u03a6\u2217p\u03b2x0x\u03c41x\u03c42Jc\u02dcJI1100\u22121\u22121\f1.6 Relation to Prior Works\n\n1\n\nTo the best of our knowledge, Theorem 1 is the \ufb01rst to study the support stability guarantees by\nminimizing the (cid:96)1-norm with non-smooth loss function, and in particular here the (cid:96)1 and (cid:96)\u221e losses.\nThe smooth case \u03b1 = 2 is however much more studied, and in particular, the associated support\nstability results we state here are now well understood. Note that most of the corresponding literature\nstudies in general the penalized form, i.e., minx\n2||\u03a6x \u2212 y||2 + \u03bb||x||1 instead of our constrained\n\u03b1(y)). In the case \u03b1 = 2, since the loss is smooth, this distinction is minor and the\nformulation (P \u03c4\nproof is almost the same for both settings. However, for \u03b1 \u2208 {1, +\u221e}, it is crucial to study the\nconstrained problems to be able to state our results. The support stability (also called \u201csparsistency\u201d,\n\u03b1(y)) in the case \u03b1 = 2 has been proved\ncorresponding to the special case I = J of our result) of (P \u03c4\nby several authors in slightly different setups. In the signal processing literature, this result can be\ntraced back to the early work of J-J. Fuchs [6] who showed Theorem 1 when \u03b1 = 2 and I = J. In\nthe statistics literature, sparsistency is also proved in [19] in the case where \u03a6 is random, the result of\nsupport stability being then claimed with high probability. The condition that I = J, i.e., that the\nminimal norm certi\ufb01cate p\u03b2 (for \u03b1 = \u03b2 = 2) is saturating only on the support, is often coined the\n\u201cirrepresentable condition\u201d in the statistics and machine learning literature. These results have been\nextended recently in [5] to the case where the support I is not stable, i.e. I (cid:40) J. One could also cite\n[15], whose results are somewhat connected but are restricted to the (cid:96)2 loss and do not hold in our\ncase. Note that \u201csparsistency\u201d-like results have been proved for many \u201clow-complexity\u201d regularizers\nbeyond the (cid:96)1-norm. Let us quote among others: the group-lasso [1], the nuclear norm [2], the total\nvariation [16] and a very general class of \u201cpartly-smooth\u201d regularizers [17]. Let us also point out\nthat one of the main sources of application of these results is the analysis of the performance of\ncompressed sensing problems, where the randomness of \u03a6 allows to derive sharp sample complexity\nbounds as a function of the sparsity of x0 and n, see for instance [18]. Let us also stress that these\nsupport recovery results are different from those obtained using tools such as the Restricted Isometry\nProperty and alike (see for instance [3]) in many respects. For instance, the guarantees they provide\nare uniform (i.e., they hold for any sparse enough vector x0), though they usually lead to quite\npessimistic worst-case bounds, and the stability is measured in (cid:96)2 sense.\n\n2 Proof of Theorem 1\n\nIn this section, we prove the main result of this paper. For the sake of brevity, when part of the proof\nwill become speci\ufb01c to a particular choice of \u03b1, we will only write the details for \u03b1 = \u221e. The details\nof the proof for \u03b1 = 1 can be found in the supplementary material.\nIt can be shown that the Fenchel-Rockafellar dual problem to (P \u03c4\n\n\u03b1(y)) is [12]\n\n\u03b2(y))\nFrom the corresponding (primal-dual) extremality relations, one can deduce that (\u02c6x, \u02c6p) is an optimal\nprimal-dual Kuhn-Tucker pair if, and only if,\n\np\u2208Rm {\u2212(cid:104)y, p(cid:105) + \u03c4||p||\u03b2 s.t. ||\u03a6\nmin\n\np||\u221e (cid:54) 1} .\n\n(D\u03c4\n\n\u2217\n\n\u03a6\n\n\u2217\n\u00b7, \u02c6I \u02c6p = sign(\u02c6x \u02c6I )\ny \u2212 \u03a6\u02c6x\n\n\u2217\n\nand\n\n||\u03a6\n\n\u02c6p||\u221e (cid:54) 1.\n\n(7)\n\nwhere \u02c6I = supp(\u02c6x), and\n\n(8)\nThe \ufb01rst relationship comes from the sub-differential of the (cid:96)1 regularization term while the second is\nspeci\ufb01c to a particular choice of \u03b1 for the (cid:96)\u03b1-norm data \ufb01delity constraint. We start by proving the\nLemma 1 and Lemma 2.\nProof of Lemma 1 Let us rewrite the problem (2) by introducing the auxiliary variable \u03b7 = \u03a6\u2217p\nas\n\n\u2208 \u2202||\u02c6p||\u03b2.\n\n\u03c4\n\n\u2217\np,\u03b7 {||p||\u03b2 + \u03b9B\u221e (\u03b7) | \u03b7 = \u03a6\nmin\n\np, \u03b7I = sign(x0,I )} ,\n\n(9)\n\nwhere \u03b9B\u221e is the indicator function of the unit (cid:96)\u221e ball. De\ufb01ne the Lagrange multipliers v and zI and\nthe associated Lagrangian function\n\nL(p, \u03b7, v, zI ) = ||p||\u03b2 + \u03b9B\u221e (\u03b7) + (cid:104)v, \u03b7 \u2212 \u03a6\n\np(cid:105) + (cid:104)zI , \u03b7I \u2212 sign(x0,I )(cid:105).\n\nDe\ufb01ning zI c = 0, the \ufb01rst order optimality conditions (generalized KKT conditions) for p and \u03b7 read\n\n\u2217\n\n\u03a6v \u2208 \u2202||p||\u03b2\n\nand \u2212 v \u2212 z \u2208 \u2202\u03b9B\u221e (\u03b7),\n\n5\n\n\fFrom the normal cone of the B\u221e at \u03b7 on its boundary, the second condition is\n\n\u2212v \u2212 z \u2208 {u | uJ c = 0, sign(uJ ) = \u03b7J } ,\n\nwhere J = sat(\u03b7) = sat(\u03a6\u2217p). Since I \u2286 J, v is supported on J. Moreover, on \u02dcJ = J\\I, we\nhave \u2212 sign(v \u02dcJ ) = \u03b7 \u02dcJ. As p\u03b2 is a solution to (9), we can de\ufb01ne a corresponding vector of Lagrange\nmultipliers v\u03b2 supported on J such that \u2212 sign(v\u03b2, \u02dcJ ) = \u03a6\u2217\n\u03b1(y)), i.e., it obeys (7) and\nTo prove the lemma, it remains to show that \u00afx\u03c4 is indeed a solution to (P \u03c4\n(8) for some dual variable \u02c6p. We will show that this is the case with \u02c6p = p\u03b2. Observe that p\u03b2 (cid:54)= 0 as\notherwise, it would mean that x0 = 0, which contradicts our initial assumption of non-zero x0. We\ncan then directly see that (8) is satis\ufb01ed. Indeed, noting y0\n\n\u00b7, \u02dcJ p\u03b2 and \u03a6\u00b7,J v\u03b2,J \u2208 \u2202||p\u03b2||\u03b2.\n\ndef.\n= \u03a6x0, we can write\n\ny0 \u2212 \u03a6\u00b7,J \u00afx\u03c4,J = \u03c4 \u03a6\u00b7,J v\u03b2,J \u2208 \u03c4 \u2202||p\u03b2||\u03b2.\nBy de\ufb01nition of p\u03b2, we have ||\u03a6\u2217p\u03b2||\u221e (cid:54) 1. In addition, it must satisfy \u03a6\u2217\nI, the condition is always satis\ufb01ed since \u2212 sign(v\u03b2, \u02dcJ ) = \u03a6\u2217\nsign(x0,I ). The condition on \u03c4 is thus |x0,i| > \u03c4 |v\u03b2,i| ,\u2200i \u2208 I, or equivalently, \u03c4 <\nProof of Lemma 2 As established by Lemma 1, the existence of p1 and of v1 are implied by the\nidenti\ufb01ability of x0. We have the following,\n\n\u00b7, \u02dcJ p\u03b2. On I, we know that \u03a6\u2217\n||v\u03b2,I||\u221e .\n\n\u00b7,J p\u03b2 = sign(\u00afx\u03c4,J ). Outside\n\u00b7,I p\u03b2 =\n\nx\n\n\u2217\nS,J is surjective \u21d4 |S| (cid:62) |J|\n\u2203p1 \u21d2 \u2203pS, \u03a6\n\u2203v1 \u21d2 \u2203vJ , \u03a6S,J vJ = qS \u21d4 \u03a6S,J is surjective \u21d4 |J| (cid:62) |S|,\n\n\u2217\nS,J pS = sJ \u21d4 \u03a6\n\nS,J is full rank, |S| (cid:62) |J| is equivalent to surjectivity.\nS,J is not surjective so that |S| < |J|, then sJ /\u2208 Im(\u03a6\u2217\nS,J ) and the over-determined system\nS,J is\n\u2217,\u2020\n\u2217,\u2020\nS,J is any right-inverse of \u03a6\u2217\nS,J sJ as a solution where \u03a6\nS,J. This\n\nTo clarify, we detail the \ufb01rst line. Since \u03a6\u2217\nAssume \u03a6\u2217\n\u03a6\u2217\nS,J pS = sJ has no solution in pS, which contradicts the existence of p1. Now assume \u03a6\u2217\nsurjective, then we can take pS = \u03a6\nproves that \u03a6S,J is invertible.\nWe are now ready to prove the main result in the particular case \u03b1 = \u221e.\nProof of Theorem 1 (\u03b1 = \u221e) Our proof consists in constructing a vector supported on J, obeying\nthe implicit relationship (6) and which is indeed a solution to (P \u03c4\u221e(\u03a6x0 + w)) for an appropriate\nregime of the parameters (\u03c4,||w||\u03b1). Note that we assume that the hypothesis of Lemma 2 on \u03a6 holds\nand in particular, \u03a6S,J is invertible. When (\u03b1, \u03b2) = (\u221e, 1), the \ufb01rst order condition (8), which holds\nfor any optimal primal-dual pair (x, p), reads, with Sp\n(10)\nOne should then look for a candidate primal-dual pair (\u02c6x, \u02c6p) such that supp(\u02c6x) = J and satisfying\n(11)\nyS \u02c6p \u2212 \u03a6S \u02c6p,J \u02c6xJ = \u03c4 sign(\u02c6pS \u02c6p ).\nWe now need to show that the \ufb01rst order conditions (7) and (10) hold for some p = \u02c6p solution of\nthe \u201cperturbed\u201d dual problem (D\u03c4\n1 (\u03a6x0 + w)) with x = \u02c6x. Actually, we will show that under the\nconditions of the theorem, this holds for \u02c6p = p1, i.e., p1 is solution of (D\u03c4\n\nySp \u2212 \u03a6Sp,\u00b7x = \u03c4 sign(pSp )\n\ndef.\n= supp(p),\nand\n\n||y \u2212 \u03a6x||\u221e (cid:54) \u03c4.\n\n1 (\u03a6x0 + w)) so that\n\n\u22121\nS,J yS \u2212 \u03c4 \u03a6\nLet us start by proving the equality part of (7), \u03a6\u2217\n\u02c6pS = p1,S if and only if sign(\u02c6xJ ) = \u03a6\u2217\n\n\u02c6xJ = \u03a6\n\n\u22121\nS,J sign(p1,S) = x0,J + \u03a6\n\n\u22121\nS,J wS \u2212 \u03c4 v1,J .\n\nS,J \u02c6pS = sign(\u02c6xJ ). Since \u03a6S,J is invertible, we have\n\nS,J p1,S. Noting IdI,J the restriction from J to I, we have\n\nsign\n\nas soon as\n\nIt is suf\ufb01cient to require\n\n(cid:16)\n(cid:12)(cid:12)(cid:12)(cid:16)\n\n\u22121\nS,J wS \u2212 \u03c4 v1,I\n\n(cid:17)\n(cid:12)(cid:12)(cid:12) < |x0,I|\n\ni \u2212 \u03c4 v1,i\n\nx0,I + IdI,J \u03a6\n\n(cid:17)\n\n\u22121\nS,J wS\n\n\u03a6\n\n= sign (x0,I )\n\n\u2200i \u2208 I.\n\n\u22121\nS,J wS \u2212 \u03c4 v1,I||\u221e < x\n||IdI,J \u03a6\n\u22121\nS,J||\u221e,\u221e||w||\u221e + \u03c4||v1,I||\u221e < x,\n\n||\u03a6\n\nwith x = mini\u2208I |x0,I|. Injecting the fact that ||w||\u221e < c1\u03c4 (the value of c1 will be derived later), we\nget the condition\n\n6\n\n\fwith b = ||\u03a6\n\n\u22121\nS,J||\u221e,\u221e and \u03bd = ||v1||\u221e (cid:54) b. Rearranging the terms, we obtain\n\n\u03c4 (bc1 + \u03bd) (cid:54) x,\n\n\u03c4 (cid:54)\n\nx\n\nbc1 + \u03bd\n\n= c2x,\n\nwhich guarantees sign(\u02c6xI ) = sign(x0,I ). Outside I, de\ufb01ning Id \u02dcJ,J as the restriction from J to \u02dcJ,\nwe must have\n\n(cid:16)\n(cid:12)(cid:12)(cid:12)(cid:12) < \u03c4|v1,j| \u2200j \u2208 \u02dcJ.\nFrom Lemma 1, we know that \u2212 sign(v1, \u02dcJ ) = \u03a6\u2217\n\n\u2217\nS, \u02dcJ p1,S = sign\n\n\u22121\nS,J wS \u2212 \u03c4 v1, \u02dcJ\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:16)\n\n\u22121\nS,J wS\n\n(cid:17)\n\n\u03a6\n\n\u03a6\n\nj\n\n(cid:17)\n\nId \u02dcJ,J \u03a6\nS, \u02dcJ p1,S, so that the condition is satis\ufb01ed as soon as\n\n.\n\nNoting v = minj\u2208 \u02dcJ |v1,j|, we get the suf\ufb01cient condition for (7),\n\n||\u03a6\n\n\u22121\nS,J wS||\u221e < \u03c4 v,\nv\n||w||\u221e < \u03c4\n.\nb\n\n(c1a)\n\nWe can now verify (10). From (11) we see that the equality part is satis\ufb01ed on S. Outside S, we have\n\nySc \u2212 \u03a6Sc,\u00b7 \u02c6x = wSc \u2212 \u03a6Sc,J \u03a6\n\n\u22121\nS,J wS + \u03c4 \u03a6Sc,J v1,J ,\n\nwhich must be smaller than \u03c4, i.e.,\n\nIt is thus suf\ufb01cient to have\n\n||wSc \u2212 \u03a6Sc,J \u03a6\n\n\u22121\nS,J wS + \u03c4 \u03a6Sc,J v1,J||\u221e (cid:54) \u03c4.\n\u22121\nS,J||\u221e,\u221e)||w||\u221e + \u03c4 \u00b5 (cid:54) \u03c4,\n\n(1 + ||\u03a6Sc,J \u03a6\n\nwith \u00b5\n\ndef.\n\n= ||\u03a6Sc,J v1,J||\u221e. Noting a = ||\u03a6Sc,J \u03a6\n\n\u22121\nS,J||\u221e,\u221e, we get\n\n||w||\u221e (cid:54) 1 \u2212 \u00b5\n\n1 + a\n\n\u03c4.\n\n(c1b)\n\n(c1a) and (c1b) together give the value of c1. This ensures that the inequality part of (10) is satis\ufb01ed\nfor \u02c6x and with that, that \u02c6x is solution to (P \u03c4\u221e(\u03a6x0 + w)) and p1 solution to (D\u03c4\n1 (\u03a6x0 + w)), which\nconcludes the proof.\nRemark 2. From Lemma 1, we know that in all generality \u00b5 (cid:54) 1. If the inequality was saturated, it\nwould mean that c1 = 0 and no noise would be allowed. Fortunately, it is easy to prove that under a\nmild assumption on \u03a6, similar to the one of Lemma 2 (which holds with probability 1 for Gaussian\nmatrices), the inequality is strict, i.e., \u00b5 < 1.\n\n3 Numerical experiments\n\nIn order to illustrate support stability in Lemma 1 and Theorem 1, we address numerically the\nproblem of comparing supp(x\u03c4 ) and supp(x0) in a compressed sensing setting. Theorem 1 shows\nthat supp(x\u03c4 ) does not depend on w (as long as it is small enough); simulations thus do not involve\nnoise. All computations are done in Matlab, using CVX [8, 7], with the MOSEK solver at \u201cbest\u201d\nprecision setting to solve the convex problems. We set n = 1000, m = 900 and generate 200 times a\nrandom sensing matrix \u03a6 \u2208 Rm\u00d7n with \u03a6ij \u223ci.i.d N (0, 1). For each sensing matrix, we generate\n60 different k-sparse vectors x0 with support I where k\n= |I| varies from 10 to 600. The non-zero\nentries of x0 are randomly picked in {\u00b11} with equal probability. Note that this choice does not\nimpact the result because the de\ufb01nition of Jp\u03b2 only depends on sign(x0) (see (1)). It will only affect\nthe bounds in (5). For each case, we verify that x0 is identi\ufb01able and for \u03b1 \u2208 {1, 2,\u221e} (which\ncorrespond to \u03b2 \u2208 {\u221e, 2, 1}), we compute the minimum (cid:96)\u03b2-norm certi\ufb01cate p\u03b2, solution to (2) and\n= sat(\u03a6\u2217p\u03b2)\\I. It is important to emphasize that there is no\nin particular, the support excess \u02dcJp\u03b2\nnoise in these simulations. As long as the hypotheses of the theorem are satis\ufb01ed, we can predict that\nsupp(x\u03c4 ) = Jp\u03b2 \u2282 I without actually computing x\u03c4 , or choosing \u03c4, or generating w.\n\ndef.\n\ndef.\n\n7\n\n\fFig. 3: (best observed in color) Sweep over se \u2208 {0, 10, ...} of the empirical probability as a function\nof the sparsity k that x0 is identi\ufb01able and | \u02dcJp\u221e| (cid:54) se (left), | \u02dcJp2| (cid:54) se (middle) or | \u02dcJp1| (cid:54) se\n(right). The bluest corresponds to se = 0 and the redest to the maximal empirical value of | \u02dcJp\u03b2|.\n\n\u03b1 \u2208 [0, 1] of the empirical probability as a function of k\n\nFig. 4: (best observed in color) Sweep over 1\nthat x0 is identi\ufb01able and | \u02dcJp\u03b2| (cid:54) se for three values of se. The dotted red line indicates \u03b1 = 2.\nWe de\ufb01ne a support excess threshold se \u2208 N varying from 0 to \u221e. On Figure 3 we plot the probability\nthat x0 is identi\ufb01able and | \u02dcJp\u03b2|, the cardinality of the predicted support excess, is smaller or equal\nto se. It is interesting to note that the probability that | \u02dcJp1| = 0 (the bluest horizontal curve on the\nright plot) is 0, which means that even for extreme sparsity (k = 10) and a relatively high m/n\nrate of 0.9, the support is never predicted as perfectly stable for \u03b1 = \u221e in this experiment. We can\nobserve as a rule of thumb, that a support excess of | \u02dcJp1| \u2248 k is much more likely. In comparison, (cid:96)2\nrecovery provides a much more likely perfect support stability for k not too large and the expected\nsize of \u02dcJp2 increases slower with k. Finally, we can comment that the support stability with (cid:96)1 data\n\ufb01delity is in between. It is possible to recover the support perfectly but the requirement on k is a bit\nmore restrictive than with (cid:96)2 \ufb01delity.\nAs previously noted, Lemma 1 and its proof remain valid for smooth loss functions such as the\n(cid:96)\u03b1-norm when \u03b1 \u2208 (1,\u221e). Therefore, it makes sense to compare the results with the ones obtained\nfor \u03b1 \u2208 (1,\u221e) . On Figure 4 we display the result of the same experiment but with 1/\u03b1 as the\nvertical axis. To realize the \ufb01gure, we compute p\u03b2 and \u02dcJp\u03b2 for \u03b2 corresponding to 41 equispaced\nvalues of 1/\u03b1 \u2208 [0, 1]. The probability that | \u02dcJp\u03b2| (cid:54) se is represented by the color intensity. The three\ndifferent plots correspond to three different values for se. On this \ufb01gure, the yellow to blue transition\ncan be interpreted as the maximal k to ensure, with high probability, that | \u02dcJp\u03b2| does not exceeds se. It\nis always (for all se) further to the right at \u03b1 = 2. It means that the (cid:96)2 data \ufb01delity constraint provides\nthe highest support stability. Interestingly, we can observe that this maximal k decreases gracefully\nas \u03b1 moves away from 2 in one way or the other. Finally, as already observed on Figure 3, we see\nthat, especially when se is small, the (cid:96)1 loss function has a small advantage over the (cid:96)\u221e loss.\n4 Conclusion\nIn this paper, we provided sharp theoretical guarantees for stable support recovery under small enough\nnoise by (cid:96)1 minimization with non-smooth loss functions. Unlike the classical setting where the data\nloss is smooth, our analysis reveals the dif\ufb01culties arising from non-smoothness, which necessitated a\nnovel proof strategy. Though we focused here on the case of (cid:96)\u03b1 data loss functions, for \u03b1 \u2208 {1, 2,\u221e},\nour analysis can be extended to more general non-smooth losses, including coercive gauges. This\nwill be our next milestone.\n\nAcknowledgments\n\nKD and LJ are funded by the Belgian F.R.S.-FNRS. JF is partly supported by Institut Universitaire de\nFrance. GP is supported by the European Research Council (ERC project SIGMA-Vision).\n\n8\n\nkkkP[|\u02dcJp\u221e|6se]P[|\u02dcJp2|6se]P[|\u02dcJp1|6se]\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u2113\u221e\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21132\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131\u21131000111200200200400400400600600600kkk1\u03b1se=0se=50se=15010.50125125125250250250375375375500500500\fReferences\n[1] F.R. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning\n\nResearch, 9:1179\u20131225, 2008.\n\n[2] F.R. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9:1019\u20131048,\n\n2008.\n\n[3] E. J. Cand\u00e8s, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure-\n\nments. Communications on pure and . . . , 40698(8):1\u201315, aug 2006.\n\n[4] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic Decomposition by Basis Pursuit. SIAM Journal\n\non Scienti\ufb01c Computing, 20(1):33\u201361, jan 1998.\n\n[5] V. Duval and G. Peyr\u00e9. Sparse spikes deconvolution on thin grids. Preprint 01135200, HAL, 2015.\n\n[6] J.-J. Fuchs. On sparse representations in arbitrary redundant bases. IEEE Transactions on Information\n\nTheory, 50(6):1341\u20131344, 2004.\n\n[7] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and\nH. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information\nSciences, pages 95\u2013110. Springer-Verlag Limited, 2008. http://stanford.edu/~boyd/graph_dcp.\nhtml.\n\n[8] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http:\n\n//cvxr.com/cvx, March 2014.\n\n[9] L. Jacques. On the optimality of a L1/L1 solver for sparse signal recovery from sparsely corrupted\n\ncompressive measurements. Technical Report, TR-LJ-2013.01, arXiv preprint arXiv:1303.5097, 2013.\n\n[10] L. Jacques, D. K. Hammond, and Jalal M. Fadili. Dequantizing Compressed Sensing: When Oversampling\nand Non-Gaussian Constraints Combine. IEEE Transactions on Information Theory, 57(1):559\u2013571, jan\n2011.\n\n[11] M. Nikolova. A variational approach to remove outliers and impulse noise. Journal of Mathematical\n\nImaging and Vision, 20(1), 2004.\n\n[12] R. T. Rockafellar. Conjugate duality and optimization, volume 16. Siam, 1974.\n\n[13] C. Studer, P. Kuppinger, G. Pope, and H. Bolcskei. Recovery of Sparsely Corrupted Signals. IEEE\n\nTransactions on Information Theory, 58(5):3115\u20133130, may 2012.\n\n[14] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society.\n\nSeries B: Statistical Methodology, 58(1):267\u2013288, 1995.\n\n[15] Ryan J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456\u20131490,\n\n2013.\n\n[16] S. Vaiter, G. Peyr\u00e9, C. Dossal, and M.J. Fadili. Robust sparse analysis regularization. IEEE Transactions\n\non Information Theory, 59(4):2001\u20132016, 2013.\n\n[17] S. Vaiter, G. Peyr\u00e9, and J. Fadili. Model consistency of partly smooth regularizers. Preprint 00987293,\n\nHAL, 2014.\n\n[18] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using (cid:96)1-constrained\n\nquadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):2183\u20132202, 2009.\n\n[19] P. Zhao and B. Yu. On model selection consistency of Lasso. J. Mach. Learn. Res., 7:2541\u20132563, December\n\n2006.\n\n9\n\n\f", "award": [], "sourceid": 2123, "authors": [{"given_name": "K\u00e9vin", "family_name": "Degraux", "institution": "Universit\u00e9 catholique de Louva"}, {"given_name": "Gabriel", "family_name": "Peyr\u00e9", "institution": "Universit\u00e9 Paris Dauphine"}, {"given_name": "Jalal", "family_name": "Fadili", "institution": "CNRS-ENSICAEN-Univ. Caen"}, {"given_name": "Laurent", "family_name": "Jacques", "institution": "Universit\u00e9 catholique de Louvain"}]}