{"title": "Structured sparsity-inducing norms through submodular functions", "book": "Advances in Neural Information Processing Systems", "page_first": 118, "page_last": 126, "abstract": "Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the L1-norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lovasz extension, a common tool in submodular analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.", "full_text": "Structured sparsity-inducing norms\n\nthrough submodular functions\n\nFrancis Bach\n\nINRIA - Willow project-team\n\nLaboratoire d\u2019Informatique de l\u2019Ecole Normale Sup\u00b4erieure\n\nParis, France\n\nfrancis.bach@ens.fr\n\nAbstract\n\nSparse methods for supervised learning aim at \ufb01nding good linear predictors from\nas few variables as possible, i.e., with small cardinality of their supports. This\ncombinatorial selection problem is often turned into a convex optimization prob-\nlem by replacing the cardinality function by its convex envelope (tightest convex\nlower bound), in this case the \u21131-norm. In this paper, we investigate more gen-\neral set-functions than the cardinality, that may incorporate prior knowledge or\nstructural constraints which are common in many applications: namely, we show\nthat for nondecreasing submodular set-functions, the corresponding convex en-\nvelope can be obtained from its Lov\u00b4asz extension, a common tool in submodu-\nlar analysis. This de\ufb01nes a family of polyhedral norms, for which we provide\ngeneric algorithmic tools (subgradients and proximal operators) and theoretical\nresults (conditions for support recovery or high-dimensional inference). By se-\nlecting speci\ufb01c submodular functions, we can give a new interpretation to known\nnorms, such as those based on rank-statistics or grouped norms with potentially\noverlapping groups; we also de\ufb01ne new norms, in particular ones that can be used\nas non-factorial priors for supervised learning.\n\n1 Introduction\n\nThe concept of parsimony is central in many scienti\ufb01c domains. In the context of statistics, signal\nprocessing or machine learning, it takes the form of variable or feature selection problems, and is\ncommonly used in two situations: First, to make the model or the prediction more interpretable or\ncheaper to use, i.e., even if the underlying problem does not admit sparse solutions, one looks for the\nbest sparse approximation. Second, sparsity can also be used given prior knowledge that the model\nshould be sparse. In these two situations, reducing parsimony to \ufb01nding models with low cardinality\nturns out to be limiting, and structured parsimony has emerged as a fruitful practical extension, with\napplications to image processing, text processing or bioinformatics (see, e.g., [1, 2, 3, 4, 5, 6, 7]\nand Section 4). For example, in [4], structured sparsity is used to encode prior knowledge regarding\nnetwork relationship between genes, while in [6], it is used as an alternative to structured non-\nparametric Bayesian process based priors for topic models.\n\nMost of the work based on convex optimization and the design of dedicated sparsity-inducing norms\nhas focused mainly on the speci\ufb01c allowed set of sparsity patterns [1, 2, 4, 6]: if w \u2208 Rp denotes the\npredictor we aim to estimate, and Supp(w) denotes its support, then these norms are designed so that\npenalizing with these norms only leads to supports from a given family of allowed patterns. In this\npaper, we instead follow the approach of [8, 3] and consider speci\ufb01c penalty functions F (Supp(w))\nof the support set, which go beyond the cardinality function, but are not limited or designed to only\nforbid certain sparsity patterns. As shown in Section 6.2, these may also lead to restricted sets of\nsupports but their interpretation in terms of an explicit penalty on the support leads to additional\n\n1\n\n\finsights into the behavior of structured sparsity-inducing norms (see, e.g., Section 4.1). While direct\ngreedy approaches (i.e., forward selection) to the problem are considered in [8, 3], we provide\nconvex relaxations to the function w 7\u2192 F (Supp(w)), which extend the traditional link between the\n\u21131-norm and the cardinality function.\nThis is done for a particular ensemble of set-functions F , namely nondecreasing submodular func-\ntions. Submodular functions may be seen as the set-function equivalent of convex functions, and\nexhibit many interesting properties that we review in Section 2\u2014see [9] for a tutorial on submodu-\nlar analysis and [10, 11] for other applications to machine learning. This paper makes the following\ncontributions:\n\n\u2212 We make explicit links between submodularity and sparsity by showing that the convex enve-\nlope of the function w 7\u2192 F (Supp(w)) on the \u2113\u221e-ball may be readily obtained from the Lov\u00b4asz\nextension of the submodular function (Section 3).\n\u2212 We provide generic algorithmic tools, i.e., subgradients and proximal operators (Section 5), as\nwell as theoretical guarantees, i.e., conditions for support recovery or high-dimensional inference\n(Section 6), that extend classical results for the \u21131-norm and show that many norms may be tackled\nby the exact same analysis and algorithms.\n\n\u2212 By selecting speci\ufb01c submodular functions in Section 4, we recover and give a new interpre-\ntation to known norms, such as those based on rank-statistics or grouped norms with potentially\noverlapping groups [1, 2, 7], and we de\ufb01ne new norms, in particular ones that can be used as non-\nfactorial priors for supervised learning (Section 4). These are illustrated on simulation experiments\nin Section 7, where they outperform related greedy approaches [3].\nNotation.\nFor w \u2208 Rp, Supp(w) \u2282 V = {1, . . . , p} denotes the support of w, de\ufb01ned as\nSupp(w) = {j \u2208 V, wj 6= 0}. For w \u2208 Rp and q \u2208 [1,\u221e], we denote by kwkq the \u2113q-norm of w.\nWe denote by |w| \u2208 Rp the vector of absolute values of the components of w. Moreover, given a\nvector w and a matrix Q, wA and QAA are the corresponding subvector and submatrix of w and Q.\nFinally, for w \u2208 Rp and A \u2282 V , w(A) = Pk\u2208A wk (this de\ufb01nes a modular set-function).\n2 Review of submodular function theory\n\nThroughout this paper, we consider a nondecreasing submodular function F de\ufb01ned on the power\nset 2V of V = {1, . . . , p}, i.e., such that:\n\n\u2200A, B \u2282 V,\n\u2200A, B \u2282 V,\n\nF (A) + F (B) > F (A \u222a B) + F (A \u2229 B),\nA \u2282 B \u21d2 F (A) 6 F (B).\n\n(submodularity)\n(monotonicity)\n\nMoreover, we assume that F (\u2205) = 0. These set-functions are often referred to as polymatroid\nset-functions [12, 13]. Also, without loss of generality, we may assume that F is strictly positive\non singletons, i.e., for all k \u2208 V , F ({k}) > 0. Indeed, if F ({k}) = 0, then by submodularity and\nmonotonicity, if A \u220b k, F (A) = F (A\\{k}) and thus we can simply consider V \\{k} instead of V .\nClassical examples are the cardinality function (which will lead to the \u21131-norm) and, given a partition\nof V into B1 \u222a \u00b7\u00b7\u00b7 \u222a Bk = V , the set-function A 7\u2192 F (A) which is equal to the number of groups\nB1, . . . , Bk with non empty intersection with A (which will lead to the grouped \u21131/\u2113\u221e-norm [1, 14]).\nLov\u00b4asz extension. Given any set-function F , one can de\ufb01ne its Lov\u00b4asz extensionf : Rp\n+ \u2192 R, as\nfollows; given w \u2208 Rp\n+, we can order the components of w in decreasing order wj1 > \u00b7\u00b7\u00b7 > wjp >\n0, the value f (w) is then de\ufb01ned as:\nf (w) = Pp\n\nk=1 wjk [F ({j1, . . . , jk}) \u2212 F ({j1, . . . , jk\u22121})].\n\nThe Lov\u00b4asz extension f is always piecewise-linear, and when F is submodular, it is also convex\n(see, e.g., [12, 9]). Moreover, for all \u03b4 \u2208 {0, 1}p, f (\u03b4) = F (Supp(\u03b4)): f is indeed an extension\nfrom vectors in {0, 1}p (which can be identi\ufb01ed with indicator vectors of sets) to all vectors in Rp\n+.\nMoreover, it turns out that minimizing F over subsets, i.e., minimizing f over {0, 1}p is equivalent\nto minimizing f over [0, 1]p [13].\nSubmodular polyhedron and greedy algorithm.\nWe denote by P the submodular poly-\nhedron [12], de\ufb01ned as the set of s \u2208 Rp\n+ such that for all A \u2282 V , s(A) 6 F (A), i.e.,\nP = {s \u2208 Rp\n+, \u2200A \u2282 V, s(A) 6 F (A)}, where we use the notation s(A) = Pk\u2208A sk. One\n\n(1)\n\n2\n\n\f(0,1)/F({2})\n\n(1,1)/F({1,2})\n\n(1,0)/F({1})\n\nFigure 1: Polyhedral unit ball, for 4 different submodular functions (two variables), with different\nstable inseparable sets leading to different sets of extreme points; changing values of F may make\nsome of the extreme points disappear. From left to right: F (A) = |A|1/2 (all possible extreme\npoints), F (A) = |A| (leading to the \u21131-norm), F (A) = min{|A|, 1} (leading to the \u2113\u221e-norm),\nF (A) = 1\n\n2 1{A\u2229{2}6=\u2205} + 1{A6=\u2205} (leading to the structured norm \u2126(w) = 1\n\n2|w2| + kwk\u221e).\n\nimportant result in submodular analysis is that if F is a nondecreasing submodular function, then\nwe have a representation of f as a maximum of linear functions [12, 9], i.e., for all w \u2208 Rp\n+,\n\n(2)\n\nf (w) = max\ns\u2208P\n\nw\u22a4s.\n\nInstead of solving a linear program with p + 2p contraints, a solution s may then be obtained by the\nfollowing \u201cgreedy algorithm\u201d: order the components of w in decreasing order wj1 > \u00b7\u00b7\u00b7 > wjp ,\nand then take for all k \u2208 {1, . . . , p}, sjk = F ({j1, . . . , jk}) \u2212 F ({j1, . . . , jk\u22121}).\nStable sets. A set A is said stable if it cannot be augmented without increasing F , i.e., if for all\nsets B \u2283 A, B 6= A \u21d2 F (B) > F (A). If F is strictly increasing (such as for the cardinality), then\nall sets are stable. The set of stable sets is closed by intersection [13], and will correspond to the set\nof allowed sparsity patterns (see Section 6.2).\nSeparable sets. A set A is separable if we can \ufb01nd a partition of A into A = B1\u222a\u00b7\u00b7\u00b7\u222aBk such that\nF (A) = F (B1) + \u00b7\u00b7\u00b7 + F (Bk). A set A is inseparable if it is not separable. As shown in [13], the\nsubmodular polytope P has full dimension p as soon as F is strictly positive on all singletons, and its\nfaces are exactly the sets {sk = 0} for k \u2208 V and {s(A) = F (A)} for stable and inseparable sets A.\nWe denote by T the set of such sets. This implies that P = {s \u2208 Rp\n+, \u2200A \u2208 T , s(A) 6 F (A)}.\nThese stable inseparable sets will play a role when describing extreme points of unit balls of our\nnew norms (Section 3) and for deriving concentration inequalities in Section 6.3. For the cardinality\nfunction, stable and inseparable sets are singletons.\n\n3 De\ufb01nition and properties of structured norms\nWe de\ufb01ne the function \u2126(w) = f (|w|), where |w| is the vector in Rp composed of absolute values\nof w and f the Lov\u00b4asz extension of F . We have the following properties (see proof in [15]), which\nshow that we indeed de\ufb01ne a norm and that it is the desired convex envelope:\n\nProposition 1 (Convex envelope, dual norm) Assume that the set-function F is submodular, non-\ndecreasing, and strictly positive for all singletons. De\ufb01ne \u2126 : w 7\u2192 f (|w|). Then:\n(i) \u2126 is a norm on Rp,\n(ii) \u2126 is the convex envelope of the function g : w 7\u2192 F (Supp(w)) on the unit \u2113\u221e-ball,\n(iii) the dual norm (see, e.g., [16]) of \u2126 is equal to \u2126\u2217(s) = maxA\u2282V ksAk1\nF (A) = maxA\u2208T\nWe provide examples of submodular set-functions and norms in Section 4, where we go from set-\nfunctions to norms, and vice-versa. From the de\ufb01nition of the Lov\u00b4asz extension in Eq. (1), we see\nthat \u2126 is a polyhedral norm (i.e., its unit ball is a polyhedron). The following proposition gives the\nset of extreme points of the unit ball (see proof in [15] and examples in Figure 1):\n\nksAk1\nF (A) .\n\nProposition 2 (Extreme points of unit ball) The extreme points of the unit ball of \u2126 are the vec-\ntors\n\nF (A) s, with s \u2208 {\u22121, 0, 1}p, Supp(s) = A and A a stable inseparable set.\n\n1\n\nThis proposition shows, that depending on the number and cardinality of the inseparable stable sets,\nwe can go from 2p (only singletons) to 3p \u2212 1 extreme points (all possible sign vectors). We show\nin Figure 1 examples of balls for p = 2, as well as sets of extreme points. These extreme points will\nplay a role in concentration inequalities derived in Section 6.\n\n3\n\n\fFigure 2: Sequence and groups: (left) groups for contiguous patterns, (right) groups for penalizing\nthe number of jumps in the indicator vector sequence.\n\n1\n1\n1\n1\n\n0.5\n0.5\n0.5\n0.5\n\ns\ns\ns\ns\nt\nt\nt\nt\nh\nh\nh\nh\ng\ng\ng\ng\ne\ne\ne\ne\nw\nw\nw\nw\n\ni\ni\ni\ni\n\n0\n0\n0\n0\n\u22126\n\u22126\n\u22126\n\u22126\n\n\u22124\n\u22124\n\u22124\n\u22124\n\n\u22122\n\u22122\n\u22122\n\u22122\nlog(\u03bb)\nlog(\u03bb)\nlog(\u03bb)\nlog(\u03bb)\n\n0\n0\n0\n0\n\n1\n1\n1\n1\n\n0.5\n0.5\n0.5\n0.5\n\ns\ns\ns\ns\nt\nt\nt\nt\nh\nh\nh\nh\ng\ng\ng\ng\ne\ne\ne\ne\nw\nw\nw\nw\n\ni\ni\ni\ni\n\n0\n0\n0\n0\n\u22126\n\u22126\n\u22126\n\u22126\n\n0.2\n\n0.1\n\ns\nt\nh\ng\ne\nw\n\ni\n\n0\n\n\u22124\n\u22124\n\u22124\n\u22124\n\n\u22122\n\u22122\n\u22122\n\u22122\nlog(\u03bb)\nlog(\u03bb)\nlog(\u03bb)\nlog(\u03bb)\n\n0\n0\n0\n0\n\n\u22122\n\n\u22121\n\nlog(\u03bb)\n\n1\n1\n1\n1\n\n0.5\n0.5\n0.5\n0.5\n\ns\ns\ns\ns\nt\nt\nt\nt\nh\nh\nh\nh\ng\ng\ng\ng\ne\ne\ne\ne\nw\nw\nw\nw\n\ni\ni\ni\ni\n\n0\n0\n0\n0\n\u22126\n\u22126\n\u22126\n\u22126\n\n\u22124\n\u22124\n\u22124\n\u22124\n\n\u22122\n\u22122\n\u22122\n\u22122\nlog(\u03bb)\nlog(\u03bb)\nlog(\u03bb)\nlog(\u03bb)\n\n0\n0\n0\n0\n\nFigure 3: Regularization path for a penalized least-squares problem (black: variables that should\nbe active, red: variables that should be left out). From left to right: \u21131-norm penalization (a wrong\nvariable is included with the correct ones), polyhedral norm for rectangles in 2D, with zoom (all\nvariables come in together), mix of the two norms (correct behavior).\n\n4 Examples of nondecreasing submodular functions\n\nWe consider three main types of submodular functions with potential applications to regularization\nfor supervised learning. Some existing norms are shown to be examples of our frameworks (Sec-\ntion 4.1, Section 4.3), while other novel norms are designed from speci\ufb01c submodular functions\n(Section 4.2). Other examples of submodular functions, in particular in terms of matroids and en-\ntropies, may be found in [12, 10, 11] and could also lead to interesting new norms. Note that set\ncovers, which are common examples of submodular functions are subcases of set-functions de\ufb01ned\nin Section 4.1 (see, e.g., [9]).\n\n4.1 Norms de\ufb01ned with non-overlapping or overlapping groups\nWe consider grouped norms de\ufb01ned with potentially overlapping groups [1, 2], i.e., \u2126(w) =\nPG\u2282V d(G)kwGk\u221e where d is a nonnegative set-function (with potentially d(G) = 0 when G\nshould not be considered in the norm). It is a norm as soon as \u222aG,d(G)>0G = V and it corresponds\nto the nondecreasing submodular function F (A) = PG\u2229A6=\u2205 d(G). In the case where \u2113\u221e-norms\nare replaced by \u21132-norms, [2] has shown that the set of allowed sparsity patterns are intersections of\ncomplements of groups G with strictly positive weights. These sets happen to be the set of stable\nsets for the corresponding submodular function; thus the analysis provided in Section 6.2 extends the\nresult of [2] to the new case of \u2113\u221e-norms. However, in our situation, we can give a reinterpretation\nthrough a submodular function that counts the number of times the support A intersects groups G\nwith non zero weights. This goes beyond restricting the set of allowed sparsity patterns to stable\nsets. We show later in this section some insights gained by this reinterpretation. We now give some\nexamples of norms, with various topologies of groups.\nHierarchical norms. Hierarchical norms de\ufb01ned on directed acyclic graphs [1, 5, 6] correspond\nto the set-function F (A) which is the cardinality of the union of ancestors of elements in A. These\nhave been applied to bioinformatics [5], computer vision and topic models [6].\nNorms de\ufb01ned on grids.\nIf we assume that the p variables are organized in a 1D, 2D or 3D\ngrid, [2] considers norms based on overlapping groups leading to stable sets equal to rectangular or\nconvex shapes, with applications in computer vision [17]. For example, for the groups de\ufb01ned in\nthe left side of Figure 2 (with unit weights), we have F (A) = p \u2212 2 + range(A) if A 6= \u2205 and\nF (\u2205) = 0 (the range of A is equal to max(A) \u2212 min(A) + 1). From empty sets to non-empty sets,\nthere is a gap of p \u2212 1, which is larger than differences among non-empty sets. This leads to the\nundesired result, which has been already observed by [2], of adding all variables in one step, rather\nthan gradually, when the regularization parameter decreases in a regularized optimization problem.\nIn order to counterbalance this effect, adding a constant times the cardinality function has the effect\nof making the \ufb01rst gap relatively smaller. This corresponds to adding a constant times the \u21131-norm\nand, as shown in Figure 3, solves the problem of having all variables coming together. All patterns\nare then allowed, but contiguous ones are encouraged rather than forced.\n\n4\n\n\fAnother interesting new norm may be de\ufb01ned from the groups in the right side of Figure 2. Indeed, it\ncorresponds to the function F (A) equal to |A| plus the number of intervals of A. Note that this also\nfavors contiguous patterns but is not limited to selecting a single interval (like the norm obtained\nfrom groups in the left side of Figure 2). Note that it is to be contrasted with the total variation\n(a.k.a. fused Lasso penalty [18]), which is a relaxation of the number of jumps in a vector w rather\nthan in its support. In 2D or 3D, this extends to the notion of perimeter and area, but we do not\npursue such extensions here.\n\n\u03c0\n\n0\n\nR \u221e\n\n4.2 Spectral functions of submatrices\nGiven a positive semide\ufb01nite matrix Q \u2208 Rp\u00d7p and a real-valued function h from R+ \u2192 R, one may\nde\ufb01ne tr[h(Q)] as Pp\ni=1 h(\u03bbi) where \u03bb1, . . . , \u03bbp are the (nonnegative) eigenvalues of Q [19]. We\ncan thus de\ufb01ne the set-function F (A) = tr h(QAA) for A \u2282 V . The functions h(\u03bb) = log(\u03bb+t) for\nt > 0 lead to submodular functions, as they correspond to entropies of Gaussian random variables\n(see, e.g., [12, 9]). Thus, since for q \u2208 (0, 1), \u03bbq = q sin q\u03c0\nlog(1 + \u03bb/t)tq\u22121dt (see, e.g., [20]),\nh(\u03bb) = \u03bbq for q \u2208 (0, 1] are positive linear combinations of functions that lead to nondecreasing\nsubmodular functions. Thus, they are also nondecreasing submodular functions, and, to the best of\nour knowledge, provide novel examples of such functions.\nIn the context of supervised learning from a design matrix X \u2208 Rn\u00d7p, we naturally use Q = X\u22a4X.\nIf h is linear, then F (A) = tr X\u22a4A XA = Pk\u2208A X\u22a4k Xk (where XA denotes the submatrix of X with\ncolumns in A) and we obtain a weighted cardinality function and hence and a weighted \u21131-norm,\nwhich is a factorial prior, i.e., it is a sum of terms depending on each variable independently.\nIn a frequentist setting, the Mallows CL penalty [21] depends on the degrees of freedom, of the\nform tr X\u22a4A XA(X\u22a4A XA + \u03bbI)\u22121. This is a non-factorial prior but unfortunately it does not lead to\na submodular function. In a Bayesian context however, it is shown by [22] that penalties of the form\nlog det(X\u22a4A XA + \u03bbI) (which lead to submodular functions) correspond to marginal likelihoods\nassociated to the set A and have good behavior when used within a non-convex framework. This\nhighlights the need for non-factorial priors which are sub-linear functions of the eigenvalues of\nX\u22a4A XA, which is exactly what nondecreasing submodular function of submatrices are. We do not\npursue the extensive evaluation of non-factorial convex priors in this paper but provide in simulations\nexamples with F (A) = tr(X\u22a4A XA)1/2 (which is equal to the trace norm of XA [16]).\n\n4.3 Functions of cardinality\nFor F (A) = h(|A|) where h is nondecreasing, such that h(0) = 0 and concave, then, from Eq. (1),\n\u2126(w) is de\ufb01ned from the rank statistics of |w| \u2208 Rp\n+, i.e., if |w(1)| > |w(2)| > \u00b7\u00b7\u00b7 > |w(p)|,\nthen \u2126(w) = Pp\nk=1[h(k) \u2212 h(k \u2212 1)]|w(k)|. This includes the sum of the q largest elements, and\nmight lead to interesting new norms for unstructured variable selection but this is not pursued here.\nHowever, the algorithms and analysis presented in Section 5 and Section 6 apply to this case.\n\n5 Convex analysis and optimization\n\nIn this section we provide algorithmic tools related to optimization problems based on the regular-\nization by our novel sparsity-inducing norms. Note that since these norms are polyhedral norms with\nunit balls having potentially an exponential number of vertices or faces, regular linear programming\ntoolboxes may not be used.\nFrom \u2126(w) = maxs\u2208P s\u22a4|w| and the greedy algorithm1 presented in Section 2,\nSubgradient.\none can easily get in polynomial time one subgradient as one of the maximizers s. This allows to use\nsubgradient descent, with, as shown in Figure 4, slow convergence compared to proximal methods.\nProximal operator. Given regularized problems of the form minw\u2208Rp L(w) + \u03bb\u2126(w), where\nL is differentiable with Lipschitz-continuous gradient, proximal methods have been shown to be\nparticularly ef\ufb01cient \ufb01rst-order methods (see, e.g., [23]). In this paper, we consider the methods\n\u201cISTA\u201d and its accelerated variants \u201cFISTA\u201d [23], which are compared in Figure 4.\n\n1The greedy algorithm to \ufb01nd extreme points of the submodular polyhedron should not be confused with\n\nthe greedy algorithm (e.g., forward selection) that we consider in Section 7.\n\n5\n\n\f1\n\n2kw \u2212 zk2\n\nit suf\ufb01ces to be able to solve ef\ufb01ciently problems of the form:\nTo apply these methods,\n2 + \u03bb\u2126(w). In the case of the \u21131-norm, this reduces to soft thresholding of z,\nminw\u2208Rp\nthe following proposition (see proof in [15]) shows that this is equivalent to a particular algorithm\nfor submodular function minimization, namely the minimum-norm-point algorithm, which has no\ncomplexity bound but is empirically faster than algorithms with such bounds [12]:\n\nProposition 3 (Proximal operator) Let z \u2208 Rp and \u03bb > 0, minimizing 1\n2 + \u03bb\u2126(w)\nis equivalent to \ufb01nding the minimum of the submodular function A 7\u2192 \u03bbF (A) \u2212 |z|(A) with the\nminimum-norm-point algorithm.\n\n2kw \u2212 zk2\n\nIn [15], it is shown how a solution for one problem may be obtained from a solution to the other\nproblem. Moreover, any algorithm for minimizing submodular functions allows to get directly the\nsupport of the unique solution of the proximal problem and that with a sequence of submodular\nfunction minimizations, the full solution may also be obtained. Similar links between convex opti-\nmization and minimization of submodular functions have been considered (see, e.g., [24]). However,\nthese are dedicated to symmetric submodular functions (such as the ones obtained from graph cuts)\nand are thus not directly applicable to our situation of non-increasing submodular functions.\n\nFinally, note that using the minimum-norm-point algorithm leads to a generic algorithm that can be\napplied to any submodular functions F , and that it may be rather inef\ufb01cient for simpler subcases\n(e.g., the \u21131/\u2113\u221e-norm, tree-structured groups [6], or general overlapping groups [7]).\n\n6 Sparsity-inducing properties\n\nIn this section, we consider a \ufb01xed design matrix X \u2208 Rn\u00d7p and y \u2208 Rn a vector of random\nresponses. Given \u03bb > 0, we de\ufb01ne \u02c6w as a minimizer of the regularized least-squares cost:\n\nminw\u2208Rp\n\n1\n\n2nky \u2212 Xwk2\n\n2 + \u03bb\u2126(w).\n\n(3)\n\nWe study the sparsity-inducing properties of solutions of Eq. (3), i.e., we determine in Section 6.2\nwhich patterns are allowed and in Section 6.3 which suf\ufb01cient conditions lead to correct estimation.\nLike recent analysis of sparsity-inducing norms [25], the analysis provided in this section relies\nheavily on decomposability properties of our norm \u2126.\n\n6.1 Decomposability\nFor a subset J of V , we denote by FJ : 2J \u2192 R the restriction of F to J, de\ufb01ned for A \u2282 J\nby FJ (A) = F (A), and by F J : 2J c\n\u2192 R the contraction of F by J, de\ufb01ned for A \u2282 J c by\nF J (A) = F (A \u222a J) \u2212 F (A). These two functions are submodular and nondecreasing as soon as F\nis (see, e.g., [12]).\nWe denote by \u2126J the norm on RJ de\ufb01ned through the submodular function FJ , and \u2126J the pseudo-\nnorm de\ufb01ned on RJ c de\ufb01ned through F J (as shown in Proposition 4, it is a norm only when J is\na stable set). Note that \u2126J c (a norm on J c) is in general different from \u2126J . Moreover, \u2126J (wJ ) is\nactually equal to \u2126( \u02dcw) where \u02dcwJ = wJ and \u02dcwJ c = 0, i.e., it is the restriction of \u2126 to J.\nWe can now prove the following decomposition properties, which show that under certain circum-\nstances, we can decompose the norm \u2126 on subsets J and their complements:\n\nProposition 4 (Decomposition) Given J \u2282 V and \u2126J and \u2126J de\ufb01ned as above, we have:\n(i) \u2200w \u2208 Rp, \u2126(w) > \u2126J (wJ ) + \u2126J (wJ c ),\n(ii) \u2200w \u2208 Rp, if minj\u2208J |wj| > maxj\u2208J c |wj| , then \u2126(w) = \u2126J (wJ ) + \u2126J (wJ c ),\n(iii) \u2126J is a norm on RJ c if and only if J is a stable set.\n\n6.2 Sparsity patterns\nIn this section, we do not make any assumptions regarding the correct speci\ufb01cation of the linear\nmodel. We show that with probability one, only stable support sets may be obtained (see proof in\n[15]). For simplicity, we assume invertibility of X\u22a4X, which forbids the high-dimensional situation\np > n we consider in Section 6.3, but we could consider assumptions similar to the ones used in [2].\n\n6\n\n\fProposition 5 (Stable sparsity patterns) Assume y \u2208 Rn has an absolutely continuous density\nwith respect to the Lebesgue measure and that X\u22a4X is invertible. Then the minimizer \u02c6w of Eq. (3)\nis unique and, with probability one, its support Supp( \u02c6w) is a stable set.\n\nF (B)\n\nF (B\u222aJ)\u2212F (J)\n\n6.3 High-dimensional inference\nWe now assume that the linear model is well-speci\ufb01ed and extend results from [26] for suf\ufb01cient\nsupport recovery conditions and from [25] for estimation consistency. As seen in Proposition 4,\nthe norm \u2126 is decomposable and we use this property extensively in this section. We denote by\n; by submodularity and monotonicity of F , \u03c1(J) is always between\n\u03c1(J) = minB\u2282J c\nzero and one, and, as soon as J is stable it is strictly positive (for the \u21131-norm, \u03c1(J) = 1). Moreover,\nwe denote by c(J) = supw\u2208Rp \u2126J (wJ )/kwJk2, the equivalence constant between the norm \u2126J and\nthe \u21132-norm. We always have c(J) 6 |J|1/2 maxk\u2208V F ({k}) (with equality for the \u21131-norm).\nThe following propositions allow us to get back and extend well-known results for the \u21131-norm, i.e.,\nPropositions 6 and 8 extend results based on support recovery conditions [26]; while Propositions 7\nand 8 extend results based on restricted eigenvalue conditions (see, e.g., [25]). We can also get back\nresults for the \u21131/\u2113\u221e-norm [14]. As shown in [15], proof techniques are similar and are adapted\nthrough the decomposition properties from Proposition 4.\n\nProposition 6 (Support recovery) Assume that y = Xw\u2217 + \u03c3\u03b5, where \u03b5 is a standard multivariate\nnormal vector. Let Q = 1\nn X\u22a4X \u2208 Rp\u00d7p. Denote by J the smallest stable set containing the\nsupport Supp(w\u2217) of w\u2217. De\ufb01ne \u03bd = minj,w\u2217\nj 6=0 |w\u2217j| > 0, assume \u03ba = \u03bbmin(QJJ ) > 0 and that\nfor \u03b7 > 0, (\u2126J )\u2217[(\u2126J (Q\u22121\nJJ QJj))j\u2208J c ] 6 1 \u2212 \u03b7. Then, if \u03bb 6 \u03ba\u03bd\n2c(J), the minimizer \u02c6w is unique\nand has support equal to J, with probability larger than 1 \u2212 3P (cid:0)\u2126\u2217(z) > \u03bb\u03b7\u03c1(J)\u221an\n(cid:1), where z is a\n\nmultivariate normal with covariance matrix Q.\n\n2\u03c3\n\nProposition 7 (Consistency) Assume that y = Xw\u2217 + \u03c3\u03b5, where \u03b5 is a standard multivariate\nnormal vector. Let Q = 1\nn X\u22a4X \u2208 Rp\u00d7p. Denote by J the smallest stable set containing the support\nSupp(w\u2217) of w\u2217. Assume that for all \u2206 such that \u2126J (\u2206J c ) 6 3\u2126J(\u2206J ), \u2206\u22a4Q\u2206 > \u03bak\u2206Jk2\n2. Then\n24c(J)2\u03bb\n\u03ba\u03c1(J)2 and 1\nwe have \u2126( \u02c6w \u2212 w\u2217) 6\n, with probability larger than\n1 \u2212 P (cid:0)\u2126\u2217(z) > \u03bb\u03c1(J)\u221an\n(cid:1) where z is a multivariate normal with covariance matrix Q.\n\nnkX \u02c6w \u2212 Xw\u2217k2\n\n36c(J)2\u03bb2\n\n2\u03c3\n\n2 6\n\n\u03ba\u03c1(J)2\n\nProposition 8 (Concentration inequalities) Let z be a normal variable with covariance matrix Q.\n2|A| exp (cid:0) \u2212 t2F (A)2/2\nLet T be the set of stable inseparable sets. Then P (\u2126\u2217(z) > t) 6 PA\u2208T\n1\u22a4QAA1 (cid:1).\n7 Experiments\n\nWe provide illustrations on toy examples of some of the results presented in the paper. We consider\nthe regularized least-squares problem of Eq. (3), with data generated as follows: given p, n, k, the\ndesign matrix X \u2208 Rn\u00d7p is a matrix of i.i.d. Gaussian components, normalized to have unit \u21132-\nnorm columns. A set J of cardinality k is chosen at random and the weights w\u2217J are sampled from a\nstandard multivariate Gaussian distribution and w\u2217J c = 0. We then take y = Xw\u2217+n\u22121/2kXw\u2217k2 \u03b5\nwhere \u03b5 is a standard Gaussian vector (this corresponds to a unit signal-to-noise ratio).\n\nProximal methods vs. subgradient descent.\nFor the submodular function F (A) = |A|1/2 (a\nsimple submodular function beyond the cardinality) we compare three optimization algorithms de-\nscribed in Section 5, subgradient descent and two proximal methods, ISTA and its accelerated ver-\nsion FISTA [23], for p = n = 1000, k = 100 and \u03bb = 0.1. Other settings and other set-functions\nwould lead to similar results than the ones presented in Figure 4: FISTA is faster than ISTA, and\nmuch faster than subgradient descent.\nRelaxation of combinatorial optimization problem. We compare three strategies for solving\nthe combinatorial optimization problem minw\u2208Rp\n2 + \u03bbF (Supp(w)) with F (A) =\ntr(X\u22a4A XA)1/2, the approach based on our sparsity-inducing norms, the simpler greedy (forward\nselection) approach proposed in [8, 3], and by thresholding the ordinary least-squares estimate. For\nall methods, we try all possible regularization parameters. We see in the right plots of Figure 4 that\n\n2nky \u2212 Xwk2\n\n1\n\n7\n\n\fi\n\n)\nf\n(\nn\nm\n\u2212\n)\nw\n\n(\nf\n\n100\n\n10\u22125\n\n \n0\n\n \n\nfista\nista\nsubgradient\n\nr\nr\nr\no\no\no\nr\nr\nr\nr\nr\nr\ne\ne\ne\n\n \n \n \nl\nl\nl\n\na\na\na\nu\nu\nu\nd\nd\nd\ns\ns\ns\ne\ne\ne\nr\nr\nr\n\ni\ni\ni\n\n20\n\n40\n\ntime (seconds)\n\n60\n\n1\n1\n1\n\n0.5\n0.5\n0.5\n\n0\n0\n0\n \n \n \n0\n0\n0\n\nthresholded OLS\nthresholded OLS\nthresholded OLS\ngreedy\ngreedy\ngreedy\nsubmodular\nsubmodular\nsubmodular\n\n \n \n \n\n1\n\nr\no\nr\nr\ne\n\n0.5\n\n \nl\n\na\nu\nd\ns\ne\nr\n\ni\n\n20\n20\n20\n\npenalty\npenalty\npenalty\n\n40\n40\n40\n\n0\n0\n\n20\n\npenalty\n\n40\n\nFigure 4: (Left) Comparison of iterative optimization algorithms (value of objective function vs. run-\nning time). (Middle/Right) Relaxation of combinatorial optimization problem, showing residual er-\nror 1\n2 vs. penalty F (Supp( \u02c6w)): (middle) high-dimensional case (p = 120, n = 20,\nk = 40), (right) lower-dimensional case (p = 120, n = 120, k = 40).\n\nnky \u2212 X \u02c6wk2\n\nn\n120\n120\n120\n120\n120\n120\n20\n20\n20\n20\n20\n20\n\nk\n80\n40\n20\n10\n6\n4\n80\n40\n20\n10\n6\n4\n\np\n120\n120\n120\n120\n120\n120\n120\n120\n120\n120\n120\n120\n\nsubmodular\n40.8 \u00b1 0.8\n35.9 \u00b1 0.8\n29.0 \u00b1 1.0\n20.4 \u00b1 1.0\n15.4 \u00b1 0.9\n11.7 \u00b1 0.9\n46.8 \u00b1 2.1\n47.9 \u00b1 1.9\n49.4 \u00b1 2.0\n49.2 \u00b1 2.0\n43.5 \u00b1 2.0\n41.0 \u00b1 2.1\n\n\u21132 vs. submod.\n-2.6 \u00b1 0.5\n2.4 \u00b1 0.4\n9.4 \u00b1 0.5\n17.5 \u00b1 0.5\n22.7 \u00b1 0.5\n26.3 \u00b1 0.5\n-0.6 \u00b1 0.5\n-0.3 \u00b1 0.5\n0.4 \u00b1 0.5\n0.0 \u00b1 0.6\n3.5 \u00b1 0.8\n4.8 \u00b1 0.7\n\n\u21131 vs. submod.\n0.6 \u00b1 0.0\n0.3 \u00b1 0.0\n-0.1 \u00b1 0.0\n-0.2 \u00b1 0.0\n-0.2 \u00b1 0.0\n-0.1 \u00b1 0.0\n3.0 \u00b1 0.9\n3.5 \u00b1 0.9\n2.2 \u00b1 0.8\n1.0 \u00b1 0.8\n0.9 \u00b1 0.6\n-1.3 \u00b1 0.5\n2/n (multiplied by 100) with\nTable 1: Normalized mean-square prediction errors kX \u02c6w \u2212 Xw\u2217k2\noptimal regularization parameters (averaged over 50 replications, with standard deviations divided\nby \u221a50). The performance of the submodular method is shown, then differences from all methods to\nthis particular one are computed, and shown in bold when they are signi\ufb01cantly greater than zero, as\nmeasured by a paired t-test with level 5% (i.e., when the submodular method is signi\ufb01cantly better).\n\ngreedy vs. submod.\n21.8 \u00b1 0.9\n15.8 \u00b1 1.0\n6.7 \u00b1 0.9\n-2.8 \u00b1 0.8\n-5.3 \u00b1 0.8\n-6.0 \u00b1 0.8\n22.9 \u00b1 2.3\n23.7 \u00b1 2.0\n23.5 \u00b1 2.1\n20.3 \u00b1 2.6\n24.4 \u00b1 3.0\n25.1 \u00b1 3.5\n\nfor hard cases (middle plot) convex optimization techniques perform better than other approaches,\nwhile for easier cases with more observations (right plot), it does as well as greedy approaches.\nNon factorial priors for variable selection. We now focus on the predictive performance and\ncompare our new norm with F (A) = tr(X\u22a4A XA)1/2, with greedy approaches [3] and to regulariza-\ntion by \u21131 or \u21132 norms. As shown in Table 1, the new norm based on non-factorial priors is more\nrobust than the \u21131-norm to lower number of observations n and to larger cardinality of support k.\n\n8 Conclusions\n\nWe have presented a family of sparsity-inducing norms dedicated to incorporating prior knowl-\nedge or structural constraints on the support of linear predictors. We have provided a set of com-\nmon algorithms and theoretical results, as well as simulations on synthetic examples illustrating the\ngood behavior of these norms. Several avenues are worth investigating: \ufb01rst, we could follow cur-\nrent practice in sparse methods, e.g., by considering related adapted concave penalties to enhance\nsparsity-inducing norms, or by extending some of the concepts for norms of matrices, with potential\napplications in matrix factorization or multi-task learning (see, e.g., [27] for application of submod-\nular functions to dictionary learning). Second, links between submodularity and sparsity could be\nstudied further, in particular by considering submodular relaxations of other combinatorial func-\ntions, or studying links with other polyhedral norms such as the total variation, which are known to\nbe similarly associated with symmetric submodular set-functions such as graph cuts [24].\nAcknowledgements. This paper was partially supported by the Agence Nationale de la Recherche\n(MGA Project) and the European Research Council (SIERRA Project). The author would like to\nthank Edouard Grave, Rodolphe Jenatton, Armand Joulin, Julien Mairal and Guillaume Obozinski\nfor discussions related to this work.\n\n8\n\n\fReferences\n\n[1] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite\n\nabsolute penalties. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n[2] R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing\n\nnorms. Technical report, arXiv:0904.3523, 2009.\n\n[3] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proc. ICML, 2009.\n[4] L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlaps and graph Lasso. In Proc.\n\nICML, 2009.\n\n[5] S. Kim and E. Xing. Tree-guided group Lasso for multi-task regression with structured sparsity.\n\nIn Proc. ICML, 2010.\n\n[6] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical\n\ndictionary learning. In Proc. ICML, 2010.\n\n[7] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network \ufb02ow algorithms for structured\n\nsparsity. In Adv. NIPS, 2010.\n\n[8] J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Transac-\n\ntions on Information Theory, 52(9):4036\u20134048, 2006.\n\n[9] F Bach. Convex analysis and optimization with submodular functions: a tutorial. Technical\n\nReport 00527714, HAL, 2010.\n\n[10] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models.\n\nIn Proc. UAI, 2005.\n\n[11] Y. Kawahara, K. Nagano, K. Tsuda, and J.A. Bilmes. Submodularity cuts and applications. In\n\nAdv. NIPS, 2009.\n\n[12] S. Fujishige. Submodular Functions and Optimization. Elsevier, 2005.\n[13] J. Edmonds. Submodular functions, matroids, and certain polyhedra. In Combinatorial opti-\n\nmization - Eureka, you shrink!, pages 11\u201326. Springer, 2003.\n\n[14] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling:\n\n[15] F. Bach. Structured sparsity-inducing norms through submodular functions. Technical Report\n\nBene\ufb01ts and perils of \u21131-\u2113\u221e-regularization. In Adv. NIPS, 2008.\n00511310, HAL, 2010.\n\n[16] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[17] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In\n\nProc. AISTATS, 2009.\n\n[18] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the\n\nfused Lasso. J. Roy. Stat. Soc. B, 67(1):91\u2013108, 2005.\n\n[19] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge Univ. Press, 1990.\n[20] T. Ando. Concavity of certain maps on positive de\ufb01nite matrices and applications to hadamard\n\nproducts. Linear Algebra and its Applications, 26:203\u2013241, 1979.\n\n[21] C. L. Mallows. Some comments on Cp. Technometrics, 15(4):661\u2013675, 1973.\n[22] D. Wipf and S. Nagarajan. Sparse estimation using general likelihoods and non-factorial priors.\n\nIn Adv. NIPS, 2009.\n\n[23] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[24] A. Chambolle and J. Darbon. On total variation minimization and surface evolution using\nparametric maximum \ufb02ows. International Journal of Computer Vision, 84(3):288\u2013307, 2009.\n[25] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\n\ndimensional analysis of M-estimators with decomposable regularizers. In Adv. NIPS, 2009.\n\n[26] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n[27] A. Krause and V. Cevher. Submodular dictionary selection for sparse representation. In Proc.\n\nICML, 2010.\n\n9\n\n\f", "award": [], "sourceid": 875, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}]}