{"title": "A Family of Penalty Functions for Structured Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 1612, "page_last": 1623, "abstract": "We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the $\\ell_1$ norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.", "full_text": "A Family of Penalty Functions for Structured\n\nSparsity\n\nCharles A. Micchelli\u2217\n\nDepartment of Mathematics\nCity University of Hong Kong\n\n83 Tat Chee Avenue, Kowloon Tong\n\nHong Kong\n\nJean M. Morales\n\nDepartment of Computer Science\n\nUniversity College London\nGower Street, London WC1E\n\nEngland, UK\n\ncharles micchelli@hotmail.com\n\nj.morales@cs.ucl.ac.uk\n\nMassimiliano Pontil\n\nDepartment of Computer Science\n\nUniversity College London\nGower Street, London WC1E\n\nEngland, UK\n\nm.pontil@cs.ucl.ac.uk\n\nAbstract\n\nWe study the problem of learning a sparse linear regression vector under addi-\ntional conditions on the structure of its sparsity pattern. We present a family of\nconvex penalty functions, which encode this prior knowledge by means of a set of\nconstraints on the absolute values of the regression coef\ufb01cients. This family sub-\nsumes the \u21131 norm and is \ufb02exible enough to include different models of sparsity\npatterns, which are of practical and theoretical importance. We establish some im-\nportant properties of these functions and discuss some examples where they can be\ncomputed explicitly. Moreover, we present a convergent optimization algorithm\nfor solving regularized least squares with these penalty functions. Numerical sim-\nulations highlight the bene\ufb01t of structured sparsity and the advantage offered by\nour approach over the Lasso and other related methods.\n\n1 Introduction\n\nThe problem of sparse estimation is becoming increasingly important in machine learning and statis-\ntics. In its simplest form, this problem consists in estimating a regression vector \u03b2\u2217 \u2208 Rn from a\ndata vector y \u2208 Rm, obtained from the model y = X\u03b2\u2217 + \u03be, where X is an m \u00d7 n matrix, which\nmay be \ufb01xed or randomly chosen and \u03be \u2208 Rm is a vector resulting from the presence of noise. An\nimportant rationale for sparse estimation comes from the observation that in many practical applica-\ntions the number of parameters n is much larger than the data size m, but the vector \u03b2\u2217 is known to\nbe sparse, that is, most of its components are equal to zero. Under these circumstances, it has been\nshown that regularization with the \u21131 norm, commonly referred to as the Lasso method, provides an\neffective means to estimate the underlying regression vector as well as its sparsity pattern, see for\nexample [4, 12, 15] and references therein.\n\nIn this paper, we are interested in sparse estimation under additional conditions on the sparsity pat-\ntern of \u03b2\u2217. In other words, not only do we expect that \u03b2\u2217 is sparse but also that it is structured sparse,\nnamely certain con\ufb01gurations of its nonzero components are to be preferred to others. This problem\n\n\u2217C.A. Micchelli is also with the Dept. of Mathematics and Statistics, State University of New York, Albany,\nUSA. We are grateful to A. Argyriou and Y. Ying for valuable discussions. This work was supported by NSF\nGrant ITR-0312113, Air Force Grant AFOSR-FA9550, and EPSRC Grant EP/D071542/1.\n\n1\n\n\farises is several applications, see [10] for a discussion. The prior knowledge that we consider in\nthis paper is that the vector |\u03b2\u2217|, whose components are the absolute value of the corresponding\ncomponents of \u03b2\u2217, should belong to some prescribed convex set \u039b. For certain choices of \u039b this\nimplies a constraint on the sparsity pattern as well. For example, the set \u039b may include vectors with\nsome desired monotonicity constraints, or other constraints on the \u201cshape\u201d of the regression vector.\nUnfortunately, the constraint that |\u03b2\u2217| \u2208 \u039b is nonconvex and its implementation is computational\nchallenging. To overcome this dif\ufb01culty, we propose a novel family of penalty functions. It is based\non an extension of the \u21131 norm used by the Lasso method and involves the solution of a smooth\nconvex optimization problem, which incorporates the structured sparsity constraints. As we shall\nsee, a key property of our approach is that the penalty function equals the \u21131 norm of a vector \u03b2\nwhen |\u03b2| \u2208 \u039b and it is strictly greater than the \u21131 norm otherwise. This observation suggests that\nthe penalty function encourages the desired structured sparsity property.\n\nThere has been some recent research interest on structured sparsity, see [1, 2, 7, 9, 10, 11, 13, 16]\nand references therein. Closest to our approach are penalty methods built around the idea of mixed\n\u21131 \u2212 \u21132 norms. In particular, the group Lasso method [16] assumes that the components of the\nunderlying regression vector \u03b2\u2217 can be partitioned into prescribed groups, such that the restriction\nof \u03b2\u2217 to a group is equal to zero for most of the groups. This idea has been extended in [10, 17]\nby considering the possibility that the groups overlap according to certain hierarchical or spatially\nrelated structures. A limitation of these methods is that they can only handle sparsity patterns form-\ning a single connected region. Our point of view is different from theirs and provides a means to\ndesigning more general and \ufb02exible penalty functions which maintain convexity whilst modeling\nricher model structures. For example, we will demonstrate that our family of penalty functions can\nmodel sparsity pattern forming multiple connected regions of coef\ufb01cients.\n\nThe paper is organized as follows. In Section 2 we de\ufb01ne the learning method. In particular, we\ndescribe the associated penalty function and establish some of its important properties. In Section\n3 we provide examples of penalty functions, deriving the explicit analytical form in some important\ncases, namely the case that the set \u039b is a box or the wedge with nonincreasing coordinates. In\nSection 4 we address the issue of solving the learning method numerically by means of an alternating\nminimization algorithm. Finally, in Section 5 we provide numerical simulations with this method,\nshowing the advantage offered by our approach.\n\n2 Learning method\nIn this section, we introduce the learning method and establish some important properties of the\nassociated penalty function. We let R++ be the positive real line and let Nn be the set of positive\n++ and estimate \u03b2\u2217 by a\nintegers up to n. We prescribe a convex subset \u039b of the positive orthant Rn\nsolution of the convex optimization problem\n\nwhere k \u00b7 k2 denotes the Euclidean norm. The penalty function takes the form\n\nmin(cid:8)kX\u03b2 \u2212 yk2\n\n2 + 2\u03c1\u2126(\u03b2|\u039b) : \u03b2 \u2208 Rn(cid:9) ,\n\n\u2126(\u03b2|\u039b) = inf {\u0393(\u03b2, \u03bb) : \u03bb \u2208 \u039b}\n\n(2.1)\n\n(2.2)\n\ni\n\u03bbi\n\n+ \u03bbi(cid:17) .\n\n2Pi\u2208Nn(cid:16) \u03b22\n\n++ \u2192 R is given by the formula \u0393(\u03b2, \u03bb) = 1\n\nand the function \u0393 : Rn \u00d7 Rn\nNote that \u0393 is convex on its domain because each of its summands are likewise convex functions.\nHence, when the set \u039b is convex it follows that \u2126(\u00b7|\u039b) is a convex function and (2.1) is a convex\noptimization problem. An essential idea behind our construction of this function, is that, for every\n\u03bb \u2208 R++, the quadratic function \u0393(\u00b7, \u03bb) provides a smooth approximation to |\u03b2| from above, which\nis exact at \u03b2 = \u00b1\u03bb. We indicate this graphically in Figure 1-a. This fact follows immediately\n\u221aab. Using the same inequal-\nby the arithmetic-geometric mean inequality, namely (a + b)/2 \u2265\n++, that is it holds that\nity it also follows that the Lasso problem corresponds to (2.1) when \u039b = Rn\n++) = k\u03b2k1 :=Pi\u2208Nn |\u03b2i|. This important special case motivated us to consider the general\n\u2126(\u03b2|Rn\nmethod described above. The utility of (2.2) is that upon inserting it into (2.1) results in an optimiza-\ntion problem over \u03bb and \u03b2 with a continuously differentiable objective function. Hence, we have\nsucceeded in expressing a nondifferentiable convex objective function by one which is continuously\ndifferentiable on its domain.\n\nThe next proposition provides a justi\ufb01cation of the penalty function as a means to incorporate struc-\ntured sparsity and establish circumstances for which the penalty function is a norm.\n\n2\n\n\f5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n \n\u22122.5\n\n \n\nabs\n\u03bb=0.75\n\u03bb=1.50\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n(a)\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n \n0\n\n \n\n\u03b2=0.20\n\u03b2=1.00\n\u03b2=2.00\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n3.5\n\n4\n\n4.5\n\n5\n\n(b)\n\nFigure 1: (a): Function \u0393(\u00b7, \u03bb) for some values of \u03bb; (b): Function \u0393(\u03b2,\u00b7) for some values of \u03b2.\nProposition 2.1. For every \u03b2 \u2208 Rn, it holds that k\u03b2k1 \u2264 \u2126(\u03b2|\u039b) and the equality holds if and\nonly if |\u03b2| := (|\u03b2i| : i \u2208 Nn) \u2208 \u039b. Moreover, if \u039b is a nonempty convex cone then the function\n\u2126(\u00b7|\u039b) is a norm and we have that \u2126(\u03b2|\u039b) \u2264 \u03c9k\u03b2k1, where \u03c9 := max{\u2126(ek|\u039b) : k \u2208 Nn} and\n{ek : k \u2208 Nn} is the canonical basis of Rn.\nProof. By the arithmetic-geometric inequality we have that k\u03b2k1 \u2264 \u0393(\u03b2, \u03bb), proving the \ufb01rst as-\nsertion. If |\u03b2| \u2208 \u039b, there exists a sequence {\u03bbk : k \u2208 N} in \u039b, such that limk\u2192\u221e \u03bbk = |\u03b2|.\nSince \u2126(\u03b2|\u039b) \u2264 \u0393(\u03b2, \u03bbk) it readily follows that \u2126(\u03b2|\u039b) \u2264 k\u03b2k1. Conversely, if |\u03b2| \u2208 \u039b, then\nthere is a sequence {\u03bbk : k \u2208 N} in \u039b, such \u03b3(\u03b2, \u03bbk) \u2264 k\u03b21k + 1/k. This inequality implies\nthat some subsequence of this sequence converges to a \u03bb \u2208 \u039b. Using the arithmetic-geometric\nwe conclude that \u03bb = |\u03b2| and the result follows. To prove the second part, observe that if \u039b\nis a nonempty convex cone, namely, for any \u03bb \u2208 \u039b and t \u2265 0 it holds that t\u03bb \u2208 \u039b, we have\nthat \u2126 is positive homogeneous. Indeed, making the change of variable \u03bb\u2032 = \u03bb/|t| we see that\n\u2126(t\u03b2|\u039b) = |t|\u2126(\u03b2|\u039b). Moreover, the above inequality, \u2126(\u03b2|\u039b) \u2265 k\u03b2k1, implies that if \u2126(\u03b2|\u039b) = 0\nthen \u03b2 = 0. The proof of the triangle inequality follows from the homogeneity and convexity of \u2126,\nnamely \u2126(\u03b1 + \u03b2|\u039b) = 2\u2126 ((\u03b1 + \u03b2)/2|\u039b) \u2264 \u2126(\u03b1|\u039b) + \u2126(\u03b2|\u039b). Finally, note that \u2126(\u03b2|\u039b) \u2264 \u03c9k\u03b2k1\nif and only if \u03c9 = max{\u2126(\u03b2|\u039b) : k\u03b2k1 = 1}. Since \u2126 is convex the maximum above is achieved at\nan extreme point of the \u21131 unit ball.\n\nThis proposition indicates that the function \u2126(\u00b7|\u039b) penalizes less vectors \u03b2 which have the property\nthat |\u03b2| \u2208 \u039b, hence encouraging structured sparsity. Indeed, any permutation of the coordinates\nof a vector \u03b2 with the above property will incur in the same or a larger value of the penalty term.\nMoreover, for certain choices of the set \u039b, some of which we describe below, the penalty function\nwill encourage vectors which not only are sparse but also have sparsity patterns (1{|\u03b2i|>0} : i \u2208\nNn) \u2208 \u039b, where 1{\u00b7} denotes the indicator function.\nWe end this section by noting that a normalized version of the group Lasso penalty [16] is included\nin our setting as a special case. If {J\u2113 : \u2113 \u2208 Nk}, k \u2208 Nn form a partition of the index set Nn, the\ncorresponding group Lasso penalty is de\ufb01ned as \u2126GL(\u03b2) =P\u2113\u2208Nkp|J\u2113| k\u03b2J\u2113k2, where, for every\nJ \u2286 Nn, we use the notation \u03b2J = (\u03b2j : j \u2208 J). It is a easy matter to verify that \u2126GL(\u00b7) = \u2126(\u00b7|\u039b)\nfor \u039b = {\u03bb : \u03bb \u2208 Rn\n\n++, \u03bbj = \u03b8\u2113, j \u2208 J\u2113, \u2113 \u2208 Nk, \u03b8\u2113 > 0}.\n\n3 Examples of the penalty function\nWe proceed to discuss some examples of the set \u039b \u2286 Rn\n++ which may be used in the design of the\npenalty function \u2126(\u00b7|\u039b). All but the \ufb01rst example fall into the category that \u039b is a polyhedral cone,\nthat is \u039b = {\u03bb : \u03bb \u2208 Rn\n++, A\u03bb \u2265 0}, where A is an m \u00d7 n matrix. Thus, in view of Proposition 2.1\nthe function \u2126(\u00b7|\u039b) is a norm.\nThe \ufb01rst example corresponds to the prior knowledge that the magnitude of the components of the\nregression vector should be in some prescribed intervals.\nExample 3.1. We choose a, b \u2208 Rn, 0 < a \u2264 b and de\ufb01ne the corresponding box as B[a, b] :=\nNi\u2208Nn\nThe theorem below establishes the form of the box penalty; see also [8, 14] for related penalty\nfunctions. To state our result, we de\ufb01ne, for every t \u2208 R, the function (t)+ = max(0, t).\n\n[ai, bi].\n\n3\n\n\fTheorem 3.1. We have that\n\n\u2126(\u03b2|B[a, b]) = k\u03b2k1 + Xi\u2208Nn(cid:18) 1\n\n2ai\n\n(ai \u2212 |\u03b2i|)2\n\n+ +\n\n1\n2bi\n\n+(cid:19) .\n(|\u03b2i| \u2212 bi)2\n\nMoreover, the components of the vector \u03bb(\u03b2) := argmin{\u0393(\u03b2, \u03bb) : \u03bb \u2208 B[a, b]} are given by the\nequations \u03bbi(\u03b2) = |\u03b2i| + (ai \u2212 |\u03b2i|)+ \u2212 (|\u03b2i| \u2212 b)+, i \u2208 Nn.\nProof. Since \u2126(\u03b2|B[a, b]) = Pi\u2208Nn\nn = 1. We shall show that if a, b, \u03b2 \u2208 R, a \u2264 b then\n\n\u2126(\u03b2i|[ai, bi]) it suf\ufb01ces to establish the result in the case\n\n\u2126(\u03b2|[a, b]) = |\u03b2| +\n\n1\n2a\n\n(a \u2212 |\u03b2|)2\n\n+ +\n\n(|\u03b2| \u2212 b)2\n+.\n\n(3.1)\n\n1\n2b\n\nSince both sides of the above equation are continuous functions of \u03b2 it suf\ufb01ces to prove this equation\nfor \u03b2 \u2208 R\\{0}. In this case, the function \u0393(\u03b2,\u00b7) is strictly convex in the second argument, and so,\nhas a unique minimum in R++ at \u03bb = |\u03b2|, see also Figure 1-b. Moreover, if |\u03b2| \u2264 a the constrained\nminimum occurs at \u03bb = a, whereas if |\u03b2| \u2265 b, it occurs at \u03bb = b. This establishes the formula for\n\u03bb(\u03b2). Consequently, we have that\n\n\u2126(\u03b2|[a, b]) = |\u03b2|1{a\u2264|\u03b2|\u2264b} +\n\n1\n\n2(cid:18) \u03b22\n\na\n\n+ a(cid:19) 1{|\u03b2|b}.\n\nEquation (3.1) now follows by a direct computation.\n\nNote that the function in equation (3.1) is a concatenation of two quadratic functions, connected\ntogether with a linear function. Thus, the box penalty will favor sparsity only for a = 0, case that is\nde\ufb01ned by a limiting argument.\n\nThe second example implements the prior knowledge that the coordinates of the vector \u03bb are ordered\nin a non increasing fashion.\nExample 3.2. We de\ufb01ne the wedge as W = {\u03bb : \u03bb \u2208 Rn\nWe say that a partition J = {J\u2113 : \u2113 \u2208 Nk} of Nn is contiguous if for all i \u2208 J\u2113, j \u2208 J\u2113+1,\n\u2113 \u2208 Nk\u22121, it holds that i < j. For example, if n = 3, partitions {{1, 2},{3}} and {{1},{2},{3}}\nare contiguous but {{1, 3},{2}} is not.\nTheorem 3.2. For every \u03b2 \u2208 (R\\{0})n there is a unique contiguous partition J = {J\u2113 : \u2113 \u2208 Nk}\nof Nn, k \u2208 Nn, such that\n(3.2)\n\n++, \u03bbj \u2265 \u03bbj+1, j \u2208 Nn\u22121}.\n\n\u2126(\u03b2|W ) = X\u2113\u2208Nkp|J\u2113| k\u03b2J\u2113k2.\n\n\u03bbj (\u03b2) = k\u03b2J\u2113k2\np|J\u2113|\nk\u03b2Kk2\u221ak\n\n> k\u03b2J\u2113\\Kk2\np|J\u2113| \u2212 k\n\nMoreover, the components of the vector \u03bb(\u03b2) = argmin{\u0393(\u03b2, \u03bb) : \u03bb \u2208 W} are given by\n\n, j \u2208 J\u2113, \u2113 \u2208 Nk\n\n(3.3)\n\nand, for every \u2113 \u2208 Nk and subset K \u2282 J\u2113 formed by the \ufb01rst k < |J\u2113| elements of J\u2113, it holds that\n(3.4)\n\n.\n\nThe partition J appearing in the theorem is determined by the set of inequalities \u03bbj \u2265 \u03bbj+1 which\nare an equality at the minimum. This set is identi\ufb01ed by examining the Karush-Kuhn-Tucker opti-\nmality conditions [3] of the optimization problem (2.2) for \u039b = W . The detailed proof is reported\nin the supplementary material. Equations (3.3) and (3.4) indicate a strategy to compute the partition\nassociated with a vector \u03b2. We explain how to do this in Section 4.\nAn interesting property of the Wedge penalty is that it has the form of a group Lasso penalty (see\nthe discussion at the end of Section 2) with groups not \ufb01xed a-priori but depending on the location\nof the vector \u03b2. The groups are the elements of the partition J and are identi\ufb01ed by certain convex\n\n4\n\n\fconstraints on the vector \u03b2. For example, for n = 2 we obtain that \u2126(\u03b2|W ) = k\u03b2k1 if |\u03b21| > |\u03b22|\nand \u2126(\u03b2|W ) = \u221a2k\u03b2k2 otherwise. For n = 3, we have that\nif |\u03b21| > |\u03b22| > |\u03b23|\nif |\u03b21| \u2264 |\u03b22| and \u03b22\nif |\u03b22| \u2264 |\u03b23| and 2\u03b22\notherwise\n\n2) + |\u03b23|,\n2 + \u03b22\n3),\n\nJ = {{1},{2},{3}}\nJ = {{1, 2},{3}}\nJ = {{1},{2, 3}}\nJ = {{1, 2, 3}}\n\n\u2126(\u03b2|W ) =\n\n2 > 2\u03b22\n\n1 + \u03b22\n\n1 > \u03b22\n\n2 + \u03b22\n\n2 + \u03b22\n\n3),\n\n3\n\n3\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n1 + \u03b22\n\nk\u03b2k1,\np2(\u03b22\n|\u03b21| +p2(\u03b22\np3(\u03b22\n\n1 + \u03b22\n\nwhere we have also reported the partition involved in each case.\n\nThe next example is an extension of the wedge set which is inspired by previous work on the group\nLasso estimator with hierarchically overlapping groups [17]. It models vectors whose magnitude is\nordered according to a graphical structure. Within this context, the wedge corresponds to the set\nassociated with a line graph.\nExample 3.3. We let A be the incidence matrix of a directed graph and choose \u039b = {\u03bb : \u03bb \u2208\n++, A\u03bb \u2265 0}.\nWe have con\ufb01rmed that Theorem 3.2 extends to the case that the graph is a tree but the general case\nis yet to be understood. We postpone this discussion to a future occasion.\n\nRn\n\nNext, we note that the wedge may equivalently be expressed as the constraint that the difference\nvector D1(\u03bb) := (\u03bbj+1\u2212\u03bbj : j \u2208 Nn\u22121) is less than or equal to zero. Our next example extends this\nobservation by using the higher order difference operator, which is given by the formula Dk(\u03bb) =\n(cid:16)\u03bbj+k +P\u2113\u2208Nk\nExample 3.4. For every k \u2208 Nn we de\ufb01ne the set W k := {\u03bb : \u03bb \u2208 Rn\nThe corresponding penalty \u2126(\u00b7|W k) encourages vectors whose sparsity pattern is concentrated on\nat most k different contiguous regions. The case k = 1 essentially corresponds to the wedge,\nwhile the case k = 2 includes vectors which have a convex \u201cpro\ufb01le\u201d and whose sparsity pattern is\nconcentrated either on the \ufb01rst elements of the vector, on the last, or on both.\n\n\u2113(cid:1)\u03bbj+k\u2212\u2113 : j \u2208 Nn\u2212k(cid:17).\n\n++, Dk(\u03bb) \u2265 0}.\n\n(\u22121)\u2113(cid:0)k\n\nWe end this section by discussing a useful construction which may be applied to generate new\npenalty functions from available ones. It is obtained by composing a set \u0398 \u2286 Rk\n++ with a linear\ntransformation, modeling the sum of the components of a vector, across the elements of a prescribed\npartition {P\u2113 : \u2113 \u2208 Nk} of Nn. That is, we let \u039b = {\u03bb : \u03bb \u2208 Rn\n\u03bbj : \u2113 \u2208 Nk) \u2208 \u0398}. We\nuse this construction in the composite wedge experiments in Section 5.\n\n++, (Pj\u2208P\u2113\n\n4 Optimization method\n\nIn this section, we address the issue of implementing the learning method (2.1) numerically. Since\nthe penalty function \u2126(\u00b7|\u039b) is constructed as the in\ufb01mum of a family of quadratic regularizers,\nthe optimization problem (2.1) reduces to a simultaneous minimization over the vectors \u03b2 and \u03bb.\nFor a \ufb01xed \u03bb \u2208 \u039b, the minimum over \u03b2 \u2208 Rn is a standard Tikhonov regularization and can\nbe solved directly in terms of a matrix inversion. For a \ufb01xed \u03b2, the minimization over \u03bb \u2208 \u039b\nrequires computing the penalty function (2.2). These observations naturally suggests an alternating\nminimization algorithm, which has already been considered in special cases in [1]. To describe our\nalgorithm we choose \u01eb > 0 and introduce the mapping \u03c6\u01eb : Rn \u2192 Rn\n++, whose i-th coordinate at\n\u03b2 \u2208 Rn is given by \u03c6\u01eb\ni + \u01eb. For \u03b2 \u2208 (R\\{0})n, we also let \u03bb(\u03b2) = argmin{\u0393(\u03b2, \u03bb) :\n\u03bb \u2208 \u039b}. The alternating minimization algorithm is de\ufb01ned as follows: choose, \u03bb0 \u2208 \u039b and, for\nk \u2208 N, de\ufb01ne the iterates\n\ni (\u03b2) = p\u03b22\n\n\u03b2k = diag(\u03bbk\u22121)(diag(\u03bbk\u22121)X \u22a4X + \u03c1I)\u22121y\n\u03bbk = \u03bb(\u03c6\u01eb(\u03b2k)).\n\n(4.1)\n(4.2)\n\nThe following theorem establishes convergence of this algorithm. Its proof is presented in the sup-\nplementary material.\nTheorem 4.1. If the set \u039b is convex and, for all a, b \u2208 R with 0 < a < b, the set \u039ba,b := [a, b]n\u2229\u039b is\na nonempty, compact subset of the interior of \u039b then the iterations (4.1)\u2013(4.2) converges to the vector\n\n5\n\n\fInitialization: k \u2190 0\nInput: \u03b2 \u2208 Rn; Output: J1, . . . , Jk\nfor t = 1 to n do\n\nJk+1 \u2190 {t}; k \u2190 k + 1\nwhile k > 1 and k\u03b2Jk\u22121\n\nk2\n\nk2\u221a|Jk|\nJk\u22121 \u2190 Jk\u22121 \u222a Jk; k \u2190 k \u2212 1\n\n\u221a|Jk\u22121| \u2264\n\nk\u03b2Jk\n\nend\n\nend\n\nFigure 2: Iterative algorithm to compute the wedge penalty\n\n2 + 2\u03c1\u2126(\u03c6\u01eb(\u03b2)|\u039b) : \u03b2 \u2208 Rn(cid:9). Moreover, any convergent subsequence\n\u03b3(\u01eb) := argmin(cid:8)ky \u2212 X\u03b2k2\nof the sequence {\u03b3(cid:0) 1\nThe most challenging step in the alternating algorithm is the computation of the vector \u03bbk. For-\ntunately, if \u039b is a second order cone, problem (2.2) de\ufb01ning the penalty function \u2126(\u00b7|\u039b) may be\nreformulated as a second order cone program (SOCP), see e.g. [5]. To see this, we introduce an\nadditional variable t \u2208 Rn and note that\n\n\u2113(cid:1) : \u2113 \u2208 N} converges to a solution of the optimization problem (2.1).\n\n\u2126(\u03b2|\u039b) = min(Xi\u2208Nn\n\nti + \u03bbi : k(2\u03b2i, ti \u2212 \u03bbi)k2 \u2264 ti + \u03bbi, ti \u2265 0, i \u2208 Nn, \u03bb \u2208 \u039b) .\n\nIn particular, in all examples in Section 3, the set \u039b is formed by linear constraints and, so, problem\n(2.2) is an SOCP. We may then use available tool-boxes to compute the solution of this problem.\nHowever, in special cases the computation of the penalty function may be signi\ufb01cantly facilitated by\nusing the analytical formulas derived in Section 3. Here, for simplicity we describe how to do this\nin the case of the wedge penalty. For this purpose we say that a vector \u03b2 \u2208 Rn is admissible if, for\nevery k \u2208 Nn, it holds that k\u03b2Nkk2/\u221ak \u2264 k\u03b2k2/\u221an.\nLemma 4.1. If \u03b2 \u2208 Rn and \u03b4 \u2208 Rp are admissible and k\u03b2k2/\u221an \u2264 k\u03b4k2/\u221ap then (\u03b2, \u03b4) is\n\nThe proof of the next lemma is straightforward and we do not elaborate on the details.\n\nadmissible.\n\nThe iterative algorithm presented in Figure 2 can be used to \ufb01nd the partition J = {J\u2113 : \u2113 \u2208 Nk}\nand, so, the vector \u03bb(\u03b2) described in Theorem 3.2. The algorithm processes the components of\nvector \u03b2 in a sequential manner. Initially, the \ufb01rst component forms the only set in the partition.\nAfter the generic iteration t \u2212 1, where the partition is composed of k sets, the index of the next\ncomponents, t, is put in a new set Jk+1. Two cases can occur: the means of the squares of the sets\nare in strict descending order, or this order is violated by the last set. The latter is the only case\nthat requires further action, so the algorithm merges the last two sets and repeats until the sets in\nthe partition are fully ordered. Note that, since the only operation performed by the algorithm is\nthe merge of admissible sets, Lemma 4.1 ensures that after each step t the current partition satis\ufb01es\nthe conditions (3.4). Moreover, the while loop ensures that after each step the current partition\n\nsatis\ufb01es, for every \u2113 \u2208 Nk\u22121, the constraints k\u03b2J\u2113k2p|J\u2113| > k\u03b2J\u2113+1k2p|J\u2113+1|. Thus, the output\nof the algorithm is the partition J de\ufb01ned in Theorem 3.2. In the actual implementation of the\nalgorithm, the means of squares of each set can be saved. This allows us to compute the mean of\nsquares of a merged set as a weighted mean, which is a constant time operation. Since there are\nn \u2212 1 consecutive terms in total, this is also the maximum number of merges that the algorithm can\nperform. Each merge requires exactly one additional test, so we can conclude that the running time\nof the algorithm is linear.\n\n5 Numerical simulations\n\nIn this section we present some numerical simulations with the proposed method. For simplicity,\nwe consider data generated noiselessly from y = X\u03b2\u2217, where \u03b2\u2217 \u2208 R100 is the true underlying\nregression vector, and X is an m \u00d7 100 input matrix, m being the sample size. The elements of X\nare generated i.i.d. from the standard normal distribution, and the columns of X are then normalized\nsuch that their \u21132 norm is 1. Since we consider the noiseless case, we solve the interpolation problem\nmin{\u2126(\u03b2) : y = X\u03b2}, for different choices of the penalty function \u2126. In practice, we solve problem\n(2.1) for a tiny value of the parameter \u03c1 = 10\u22128, which we found to be suf\ufb01cient to ensure that the\n\n6\n\n\f350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\nr\no\nr\nr\ne\n\n \nl\n\ne\nd\no\nM\n\n0\n \n12\n\n15\n\n18\n\n5000\n\n4000\n\nr\no\nr\nr\ne\n\n \nl\n\ne\nd\no\nM\n\n3000\n\n2000\n\n1000\n\n0\n \n12\n\n15\n\n18\n\n \n\nLasso\nBox\u2212A\nBox\u2212B\nBox\u2212C\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\nr\no\nr\nr\ne\n\n \nl\n\ne\nd\no\nM\n\n50\n\n75\n\n100\n\n0\n \n12\n\n15\n\n18\n\n20\n\n25\n\nSample size\n\n20\n\n25\n\nSample size\n\n(a)\n\n(b)\n\n \n\nLasso\nC\u2212Wedge\nGL\u2212ind\nGL\u2212hie\nGL\u2212con\n\n2500\n\n2000\n\nr\no\nr\nr\ne\n\n \nl\n\ne\nd\no\nM\n\n1500\n\n1000\n\n500\n\n50\n\n75\n\n100\n\n0\n \n12\n\n15\n\n18\n\n20\n\n25\n\nSample size\n\n20\n\n25\n\nSample size\n\n \n\nLasso\nWedge\nGL\u2212lin\n\n700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\nr\no\nr\nr\ne\n\n \nl\n\ne\nd\no\nM\n\n50\n\n75\n\n100\n\n0\n \n12\n\n15\n\n18\n\n \n\nLasso\nW\u22122\nWedge\nGL\u2212lin\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nr\no\nr\nr\ne\n \nl\ne\nd\no\nM\n\n50\n\n75\n\n100\n\n0\n \n22\n\n25\n\n28\n\n \n\nLasso\nWedge\nGL\u2212lin\n\n50\n\n75\n\n100\n\n20\n25\nSample size\n\n(c)\n\n \n\nLasso\nW\u22123\nWedge\nGL\u2212lin\n\n30\n\nSample size\n\n35\n\n50\n\n75\n\n100\n\nFigure 3: Comparison between different penalty methods: (a) Box vs. Lasso; (b,c) Wedge vs. Hier-\narchical group Lasso; (d) Composite wedge; (e) Convex; (f) Cubic. See text for more information\n\n(f)\n\n(d)\n\n(e)\n\ni |)+ and bi = (|\u03b2\u2217\n\nerror term in (2.1) is negligible at the minimum. All experiments were repeated 50 times, generating\neach time a new matrix X. In the \ufb01gures we report the average of the model error E[k \u02c6\u03b2 \u2212 \u03b2\u2217k2\n2] of\nthe vector \u02c6\u03b2 learned by each method, as a function of the sample size m. In the following, we discuss\na series of experiments, corresponding to different choices for the model vector \u03b2\u2217 and its sparsity\npattern. In all experiments, we solved the optimization problem (2.1) with the algorithm presented\nin Section 4. Whenever possible we solved step (4.2) using the formulas derived in Section 3 and\nresorted to the solver CVX (http://cvxr.com/cvx/) in the other cases.\nBox. In the \ufb01rst experiment the model is 10-sparse, where each nonzero component, in a random\nposition, is an integer uniformly sampled in the interval [\u221210, 10]. We wish to show that the more\naccurate the prior information about the model is, the more precise the estimate will be. We use\na box penalty (see Theorem 3.1) constructed \u201caround\u201d the model, imagining that an oracle tells us\ni | is bounded within an interval. We consider three boxes B[a, b] of different\nthat each component |\u03b2\u2217\ni |\u2212 r)+ and radii r = 5, 1 and 0.1, which we denote as\nsizes, namely ai = (r \u2212|\u03b2\u2217\nBox-A, Box-B and Box-C, respectively. We compare these methods with the Lasso \u2013 see Figure 3-a.\nAs expected, the three box penalties perform better. Moreover, as the radius of a box diminishes,\nthe amount of information about the true model increases, and the performance improves.\nWedge. In the second experiment, we consider a regression vector, whose components are nonin-\ncreasing in absolute value and only a few are nonzero. Speci\ufb01cally, we choose a 10-sparse vector:\nj = 11 \u2212 j, if j \u2208 N10 and zero otherwise. We compare the Lasso, which makes no use of such\n\u03b2\u2217\nordering information, with the wedge penalty \u2126(\u03b2|W ) (see Example 3.2 and Theorem 3.2) and the\nhierarchical group Lasso in [17], which both make use of such information. For the group Lasso\nwe choose \u2126(\u03b2) = P\u2113\u2208N100 ||\u03b2J\u2113||, with J\u2113 = {\u2113, \u2113 + 1, . . . , 100}, \u2113 \u2208 N100. These two methods\n\nare referred to as \u201cWedge\u201d and \u201cGL-lin\u201d in Figure 3-b, respectively. As expected both methods\nimprove over the Lasso, with \u201cGL-lin\u201d being the best of the two. We further tested the robustness\nof the methods, by adding two additional nonzero components with value of 10 to the vector \u03b2\u2217 in a\nrandom position between 20 and 100. This result, reported in Figure 3-c, indicates that \u201cGL-lin\u201d is\nmore sensitive to such a perturbation.\nComposite wedge. Next we consider a more complex experiment, where the regression vector is\nsparse within different contiguous regions P1, . . . , P10, and the \u21131 norm on one region is larger than\nthe \u21131 norm on the next region. We choose sets Pi = {10(i \u2212 1) + 1, . . . , 10i}, i \u2208 N10 and\ngenerate a 6-sparse vector \u03b2\u2217 whose i-th nonzero element has value 31 \u2212 i (decreasing) and is in\na random position in Pi, for i \u2208 N6. We encode this prior knowledge by choosing \u2126(\u03b2|\u039b) with\n\u039b = (cid:8)\u03bb \u2208 R100 : ||\u03bbPi||1 \u2265 k\u03bbPi+1||1, i \u2208 N9(cid:9). This method constraints the sum of the sets to be\nnonincreasing and may be interpreted as the composition of the wedge set with an average operation\nacross the sets Pi, see the discussion at the end of Section 3. This method, which is referred to as \u201cC-\nWedge\u201d in Figure 3-d, is compared to the Lasso and to three other versions of the group Lasso. The\n\n7\n\n\f25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nFigure 4: Lasso vs. penalty \u2126(\u00b7|W 2) (left) and \u2126(\u00b7|W 3) (Right); see text for more information.\n\n\ufb01rst is a standard group Lasso with the nonoverlapping groups Ji = Pi, i \u2208 N10, thus encouraging\nthe presence of sets of zero elements, which is useful because there are 4 such sets. The second is a\nj=iPj, i \u2208 N10. A problem\nvariation of the hierarchical group Lasso discussed above with Ji = \u222a10\nwith these approaches is that the \u21132 norm is applied at the level of the individual sets Pi, which does\nnot promote sparsity within these sets. To counter this effect we can enforce contiguous nonzero\npatterns within each of the Pi, as proposed by [10]. That is, we consider as the groups the sets\nformed by all sequences of q \u2208 N9 consecutive elements at the beginning or at the end of each of the\nsets Pi, for a total of 180 groups. These three groupings will be referred to as \u201cGL-ind\u201d, \u201cGL-hie\u2019\u2018,\n\u201cGL-con\u201d in Figure 3-d, respectively. This result indicates the advantage of \u201cC-Wedge\u201d over the\nother methods considered. In particular, the group Lasso methods fall behind our method and the\nLasso, with \u201cGL-con\u201d being slight better than \u201cGL-ind\u201d and \u201cGL-hie\u201d. Notice also that all group\nLasso methods gradually diminish the model error until they have a point for each dimension, while\nour method and the Lasso have a steeper descent, reaching zero at a number of points which is less\nthan half the number of dimensions.\nConvex and Cubic. To show the \ufb02exibility of our framework, we consider two further examples\nof sparse regression vectors with additional structured properties. In the \ufb01rst example, most of the\ncomponents of this vector are zero, but the \ufb01rst and the last few elements follow a discrete convex\ntrend. Speci\ufb01cally, we choose \u03b2\u2217 = (52, 42, 32, 22, 1, 0, . . . , 0, 1, 22, 32, 42, 52) \u2208 R100. In this\ncase, we expect the penalty function \u2126(\u03b2|W 2) to outperform the Lasso, because it favors vectors\nwith convex shape. Results are shown in Figure3-e, where this penalty is named \u201cW-2\u201d. In lack\nof other speci\ufb01c methods to impose this convex shape constraint, and motivating by the fact that\nthe \ufb01rst few components decrease, we compare it with two methods that favors a learned vector\nthat is decreasing: the Wedge and the group Lasso with Jk = {k, . . . , 100} for k \u2208 N100. These\nmethods and the Lasso fail to use the prior knowledge of convexity, and are outperformed by using\nthe constraint set W 2. The second example considers the case where |\u03b2\u2217| \u2208 W 3, namely the\ndifferences of the second order are decreasing. This vector is constructed from the cubic polynomial\np(t) = \u2212t(t\u22121.5)(t+6.5). The polynomial is evaluated at 100 equally spaced (0.1) points, starting\nfrom \u22127. The resulting vector starts with 5 nonzero components and has then a bump of another\n15 elements. We use our method with the penalty \u2126(\u03b2|W 3), which is referred to as \u201cW-3\u201d in the\nFigure. The model error, compared again with \u201cW-1\u201d and group Lasso linear, is shown in Figure\n3-f. Finally, Figure 4 displays the regression vector found by the Lasso and the vector learned by\n\u201cW-2\u201d (left) and by the Lasso and \u201cW-3\u201d (right), in a single run with sample size of 15 and 35,\nrespectively. The estimated vectors (green) are superposed to the true vector (black). Our method\nprovides a better estimate than the Lasso in both cases.\n\nConclusion\n\nWe proposed a family of penalty functions that can be used to model structured sparsity in linear\nregression. We provided theoretical, algorithmic and computational information about this new\nclass of penalty functions. Our theoretical observations highlight the generality of this framework\nto model structured sparsity. An important feature of our approach is that it can deal with richer\nmodel structures than current approaches while maintaining convexity of the penalty function. Our\npractical experience indicates that these penalties perform well numerically, improving over state\nof the art penalty methods for structure sparsity, suggesting that our framework is promising for\napplications.\nIn the future, it would be valuable to extend the ideas presented here to learning\nnonlinear sparse regression models. There is also a need to clarify the rate of convergence of the\nalgorithm presented here.\n\n8\n\n\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learn-\n\ning, 73(3):243\u2013272, 2008.\n\n[2] R.G. Baraniuk, V. Cevher, M.F. Duarte, and C. Hegde. Model-based compressive sensing.\n\nInformation Theory, IEEE Transactions on, 56(4):1982 \u20132001, 2010.\n\n[3] D. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n[4] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nAnnals of Statistics, 37:1705\u20131732, 2009.\n\n[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[6] J.M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathe-\n\nmatics, 14(4):641\u2013664, 1966.\n\n[7] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the\n\n26th Annual International Conference on Machine Learning, pages 417\u2013424. ACM, 2009.\n\n[8] L. Jacob. Structured priors for supervised learning in computational biology. 2009. Ph.D.\n\nThesis.\n\n[9] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In Interna-\n\ntional Conference on Machine Learning (ICML 26), 2009.\n\n[10] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing\n\nnorms. arXiv:0904.3523v2, 2009.\n\n[11] S. Kim and E.P. Xing. Tree-guided group lasso for multi-task regression with structured spar-\n\nsity. Technical report, 2009. arXiv:0909.1373.\n\n[12] K. Lounici. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig\n\nestimators. Electronic Journal of Statistics, 2:90\u2013102, 2008.\n\n[13] K. Lounici, M. Pontil, A.B Tsybakov, and S. van de Geer. Taking advantage of sparsity in\nIn Proc. of the 22nd Annual Conference on Learning Theory (COLT),\n\nmulti-task learning.\n2009.\n\n[14] A.B. Owen. A robust hybrid of lasso and ridge regression. In Prediction and discovery: AMS-\nIMS-SIAM Joint Summer Research Conference, Machine and Statistical Learning: Prediction\nand Discovery, volume 443, page 59, 2007.\n\n[15] S.A. van de Geer. High-dimensional generalized linear models and the Lasso. Annals of\n\nStatistics, 36(2):614, 2008.\n\n[16] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\nJournal of the Royal Statistical Society, Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n[17] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite\n\nabsolute penalties. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n9\n\n\fA Appendix\n\nIn this appendix we provide the proof of Theorems 3.2 and 4.1.\n\nA.1 Proof of Theorem 3.2\n\nBefore proving the theorem we require some additional notation. Given any two disjoint subsets\nJ, K \u2286 Nn we de\ufb01ne the region\n\nQJ,K =(cid:26)\u03b2 : \u03b2 \u2208 Rn, k\u03b2Jk2\n|J|\n\n2\n\n2\n\n> k\u03b2Kk2\n|K| (cid:27) .\n\nNote that the boundary of this region is determined by the zero set of a homogeneous polynomial of\ndegree two. We also need the following construction.\nDe\ufb01nition A.1. For every subset S \u2286 Nn\u22121 we set k = |S| + 1 and label the elements of S in\nincreasing order as S = {j\u2113 : \u2113 \u2208 Nk\u22121}. We associate with the subset S a contiguous partition\nof Nn, given by J (S) = {J\u2113 : \u2113 \u2208 Nk}, where we de\ufb01ne J\u2113 := {j\u2113\u22121 + 1, j\u2113} and set j0 = 0 and\njk = n.\n\nA subset S of Nn\u22121 also induces two regions in Rn which play a central role in the identi\ufb01cation of\nthe wedge penalty. First, we describe the region which \u201ccrosses over\u201d the induced partition J (S).\nThis is de\ufb01ned to be the set\n\nOS :=\\(cid:8)QJ\u2113,J\u2113+1 : \u2113 \u2208 Nk\u22121(cid:9) .\n\n(A.1)\n\nIn other words, \u03b2 \u2208 OS if the average of the square of its components within each region J\u2113 strictly\ndecreases with \u2113. The next region which is essential in our analysis is the \u201cstays within\u201d region,\ninduced by the partition J (S). This region requires the notation J\u2113,q := {j : j \u2208 J\u2113, j \u2264 q} and is\nde\ufb01ned by the equation\n(A.2)\n\nIS :=\\nQJ\u2113,J\u2113,q : q \u2208 J\u2113, \u2113 \u2208 Nko ,\n\nwhere Q denotes the closure of the set Q. In other words, all vectors \u03b2 within this region have the\nproperty that, for every set J\u2113 \u2208 J (S), the average of the square of a \ufb01rst segment of components\nof \u03b2 within this set is not greater than the average over J\u2113. We note that if S is the empty set the\nabove notation should be interpreted as OS = Rn and\n\nWe also introduce, for every S \u2208 Nn\u22121 the sets\n\nIS =\\{QNn,Nq : q \u2208 Nn}.\nUS := OS \u2229 IS \u2229 (R\\{0})n.\n\nWe shall prove the following slightly more general version the Theorem 3.2\nTheorem A.1. The collection of sets U := {US : S \u2286 Nn\u22121} forms a partition of (R\\{0})n. For\neach \u03b2 \u2208 (R\\{0})n there is a unique S \u2208 Nn\u22121 such that \u03b2 \u2208 US, and\n\n\u2126(\u03b2|W ) = X\u2113\u2208Nkp|J\u2113| k\u03b2J\u2113k2,\n\n(A.3)\n\nwhere k = |S| + 1. Moreover, the components of the vector \u03bb(\u03b2) := argmin{\u0393(\u03b2, \u03bb) : \u03bb \u2208 W} is\ngiven by the equations \u03bbj(\u03b2) = \u00b5\u2113, j \u2208 J\u2113, \u2113 \u2208 Nk, where\n\n.\n\n(A.4)\n\nProof. First, let us observe that there are n\u2212 1 inequality constraints de\ufb01ning W . It readily follows\nthat all vectors in this constraint set are regular, in the sense of optimization theory, see [3, p. 279].\nHence, we can appeal to [3, Prop. 3.3.4, p. 316 and Prop. 3.3.6, p. 322], which state that \u03bb \u2208 Rn\n\n++\n\n\u00b5\u2113 = k\u03b2J\u2113k2\np|J\u2113|\n\n10\n\n\fis a solution to the minimum problem determined by the wedge penalty, if and only if there exists a\nvector \u03b1 = (\u03b1i : i \u2208 Nn\u22121) with nonnegative components such that\n\n\u03b22\nj\n\u03bb2\nj\n\n\u2212\n\n+ 1 + \u03b1j\u22121 \u2212 \u03b1j = 0,\n\nj \u2208 Nn,\n\n(A.5)\n\nwhere we set \u03b10 = \u03b1n = 0. Furthermore, the following complementary slackness conditions hold\ntrue\n\n\u03b1j(\u03bbj+1 \u2212 \u03bbj) = 0, j \u2208 Nn\u22121.\n\n(A.6)\nTo unravel these equations, we let S := {j : \u03bbj > \u03bbj+1, j \u2208 Nn\u22121}, which is the subset of indexes\ncorresponding to the constraints that are not tight. When k \u2265 2, we express this set in the form\n{j\u2113 : \u2113 \u2208 Nk\u22121} where k = |S| + 1.\nAs explained in De\ufb01nition A.1, the set S induces the partition J (S) = {J\u2113 : \u2113 \u2208 Nk} of Nn. When\nk = 1 our notation should be interpreted to mean that S is empty and the partition J (S) consists\nonly of Nn. In this case, it is easy to solve the equations (A.5) and (A.6). In fact, all components of\nthe vector \u03bb have a common value, say \u00b5 > 0, and by summing both sides of equation (A.5) over\nj \u2208 Nn we obtain that \u00b52 = k\u03b2k2\n2/n. Moreover, summing both sides of the same equation over\nj /\u00b52 + q and, since \u03b1q \u2265 0 we conclude that \u03b2 \u2208 IS = US.\nj \u2208 Nq we obtain that \u03b1q = \u2212Pj\u2208Nq\n\u03b22\nWe now consider the case that k \u2265 2. Hence, the vector \u03bb has equal components on each subset\nJ\u2113, which we denote by \u00b5\u2113, \u2113 \u2208 Nk\u22121. The de\ufb01nition of the set S implies that the \u00b5\u2113 are strictly\ndecreasing and equation (A.6) implies that \u03b1j = 0, for every j \u2208 S. Summing both sides of equation\n(A.5) over j \u2208 J\u2113 we obtain that\n\n\u2212\n\n1\n\u00b52\n\n\u2113 Xj\u2208J\u2113\n\n\u03b22\nj + |J\u2113| = 0\n\nfrom which equation (A.4) follows. Since the \u00b5\u2113 are strictly decreasing, we conclude that \u03b2 \u2208 OS.\nMoreover, choosing q \u2208 J\u2113 and summing both sides of equations (A.5) over j \u2208 J\u2113,q we obtain that\n\n2\n\n\u00b52\n\u2113\n\n+ |J\u2113,q|\n\n0 \u2264 \u03b1q = \u2212k\u03b2J\u2113,qk2\n. Since this holds for every q \u2208 J\u2113 and \u2113 \u2208 Nk we conclude that\n\nwhich implies that \u03b2 \u2208 QJ\u2113,J\u2113,q\n\u03b2 \u2208 IS and therefore, it follows that \u03b2 \u2208 US.\nIn summary, we have shown that \u03b2 \u2208 US. In particular, this implies that the collection of sets U\ncovers (R\\{0})n. Next, we show that the elements of U are disjoint. To this end, we observe that,\nthe computation described above can be reversed. That is to say, conversely for any S \u2286 Nn\u22121 and\n\u03b2 \u2208 US we conclude that the vectors \u03b1 and \u03bb de\ufb01ne above solve the equations (A.5) and (A.6).\nSince the wedge penalty function is strictly convex we know that equations (A.5) and (A.6) have a\nunique solution. Now, if \u03b2 \u2208 US \u2229 US \u2032 then it must follow that \u03bb = \u03bb\u2032. Consequently, since the\nvectors \u03bb and \u03bb\u2032 are a constant on any element of their respective partitions J (S) and J (S\u2032), strictly\ndecreasing from one element to the next in those partition, it must be the case that S1 = S2.\n\nWe note that if some components of \u03b2 are zero we may compute \u2126(\u03b2|\u039b) as a limiting process, since\nthe function \u2126(\u00b7|\u039b) is continuous.\nProof of Theorem 4.1 We divide the proof into several steps. To this end, we de\ufb01ne\n\nE\u01eb(\u03b2, \u03bb) := ky \u2212 X\u03b2k2\n\n2 + 2\u03c1\u0393(\u03c6\u01eb(\u03b2), \u03bb)\n\nand let \u03b2(\u03bb) := argmin{E\u01eb(\u03b1, \u03bb) : \u03b1 \u2208 Rn}.\nStep 1. We de\ufb01ne two sequences, \u03b8k = E\u01eb(\u03b2k, \u03bbk\u22121) and \u03bdk = E\u01eb(\u03b2k, \u03bbk) and observe, for any\nk \u2265 2, that\n(A.7)\nThese inequalities follow directly from the de\ufb01nition of the alternating algorithm, see equations (4.1)\nand (4.2).\nStep 2. We de\ufb01ne the compact set B = {\u03b2 : \u03b2 \u2208 Rn,k\u03b2k1 \u2264 \u03b81}. From the \ufb01rst inequality in\nProposition 2.1, k\u03b2k1 \u2264 \u2126(\u03b2|\u039b), and inequality (A.7) we conclude, for every k \u2208 N, that \u03b2k \u2208 B.\n\n\u03b8k+1 \u2264 \u03bdk \u2264 \u03b8k \u2264 \u03bdk\u22121.\n\n11\n\n\fStep 3. We de\ufb01ne a function g : Rn \u2192 R at \u03b2 \u2208 Rn as\n\ng(\u03b2) = min{E\u01eb(\u03b1, \u03bb(\u03c6\u01eb(\u03b2))) : \u03b1 \u2208 Rn} .\n\nWe claim that g is continuous on B. In fact, there exists a constant \u03ba > 0 such that, for every\n\u03b3 1, \u03b3 2 \u2208 B, it holds that\n\n|g(\u03b3 1) \u2212 g(\u03b3 2)| \u2264 \u03bak\u03bb(\u03c6\u01eb(\u03b3 1)) \u2212 \u03bb(\u03c6\u01eb(\u03b3 2))k\u221e.\n\n(A.8)\n\nThe essential ingredient in the proof of this inequality is the fact that by our hypothesis on the set\n\u039b there exists constant a and b such that, for all \u03b2 \u2208 B, \u03bb(\u03c6\u01eb(\u03b2)) \u2208 [a, b]n. This fact follows by\nDanskin\u2019s Theorem [6].\nStep 4. By step 2, there exists a subsequence {\u03b2k\u2113 : \u2113 \u2208 N} which converges to \u02dc\u03b2 \u2208 B and, for all\n\u03b2 \u2208 Rn and \u03bb \u2208 \u039b, it holds that\n\nE\u01eb( \u02dc\u03b2, \u03bb(\u03c6\u01eb( \u02dc\u03b2))) \u2264 E\u01eb(\u03b2, \u03bb(\u03c6\u01eb( \u02dc\u03b2))), E\u01eb( \u02dc\u03b2, \u03bb(\u03c6\u01eb( \u02dc\u03b2))) \u2264 E\u01eb( \u02dc\u03b2, \u03bb).\n\n(A.9)\n\nIndeed, from step 1 we conclude that there exists \u03c8 \u2208 R++ such that\n\nlim\nk\u2192\u221e\n\n\u03b8k = lim\nk\u2192\u221e\n\n\u03bdk = \u03c8.\n\nUnder our hypothesis the mapping \u03b2 7\u2192 \u03bb(\u03b2) is continuous for \u03b2 \u2208 (R\\{0})n, we conclude that\n\n\u03bbk\u2113 = \u03bb(\u03c6\u01eb( \u02dc\u03b2)).\n\nlim\n\u2113\u2192\u221e\n\nBy the de\ufb01nition of the alternating algorithm, we have, for all \u03b2 \u2208 Rn and \u03bb \u2208 \u039b, that\n\u03bdk = E\u01eb(\u03b2k, \u03bbk) \u2264 E\u01eb(\u03b2k, \u03bb).\n\n\u03b8k+1 = E\u01eb(\u03b2k+1, \u03bbk) \u2264 E\u01eb(\u03b2, \u03bbk),\n\nFrom this inequality we obtain, passing to limit, inequalities (A.9).\nStep 5. The vector ( \u02dc\u03b2, \u03bb(\u03c6\u01eb( \u02dc\u03b2)) is a stationary point. Indeed, since \u039b is admissible, by step 3,\n\u03bb(\u03c6\u01eb( \u02dc\u03b2)) \u2208 int(\u039b). Therefore, since E\u01eb is continuously differentiable this claim follows from step\n4.\nStep 6. The alternating algorithm converges. This claim follows from the fact that E\u01eb is strictly\nconvex. Hence, E\u01eb has a unique global minimum in Rn \u00d7 \u039b, which in virtue of inequalities (A.9) is\nattained at ( \u02dc\u03b2, \u03bb(\u03c6\u01eb( \u02dc\u03b2))).\nThe last claim in the theorem follows from the fact that the set {\u03b3(\u01eb) : \u01eb > 0} is bounded and the\nfunction \u03bb(\u03b2) is continuous.\n\n12\n\n\f", "award": [], "sourceid": 650, "authors": [{"given_name": "Jean", "family_name": "Morales", "institution": null}, {"given_name": "Charles", "family_name": "Micchelli", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}]}