{"title": "Shaping Level Sets with Submodular Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 18, "abstract": "We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on  non-decreasing functions, we explore symmetric submodular functions and their \\lova extensions. We show that the Lovasz extension may be seen as the convex envelope of a function that depends on level sets  (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide a unified set of optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions,  we   give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.", "full_text": "Shaping Level Sets with Submodular Functions\n\nFrancis Bach\n\nINRIA - Sierra Project-team\n\nLaboratoire d\u2019Informatique de l\u2019Ecole Normale Sup\u00b4erieure, Paris, France\n\nfrancis.bach@ens.fr\n\nAbstract\n\nWe consider a class of sparsity-inducing regularization terms based on submodular func-\ntions. While previous work has focused on non-decreasing functions, we explore sym-\nmetric submodular functions and their Lov\u00b4asz extensions. We show that the Lov\u00b4asz\nextension may be seen as the convex envelope of a function that depends on level sets\n(i.e., the set of indices whose corresponding components of the underlying predictor are\ngreater than a given constant): this leads to a class of convex structured regularization\nterms that impose prior knowledge on the level sets, and not only on the supports of the\nunderlying predictors. We provide uni\ufb01ed optimization algorithms, such as proximal\noperators, and theoretical guarantees (allowed level sets and recovery conditions). By\nselecting speci\ufb01c submodular functions, we give a new interpretation to known norms,\nsuch as the total variation; we also de\ufb01ne new norms, in particular ones that are based\non order statistics with application to clustering and outlier detection, and on noisy cuts\nin graphs with application to change point detection in the presence of outliers.\n\n1 Introduction\nThe concept of parsimony is central in many scienti\ufb01c domains. In the context of statistics, signal\nprocessing or machine learning, it may take several forms. Classically, in a variable or feature\nselection problem, a sparse solution with many zeros is sought so that the model is either more\ninterpretable, cheaper to use, or simply matches available prior knowledge (see, e.g., [1, 2, 3] and\nreferences therein). In this paper, we instead consider sparsity-inducing regularization terms that\nwill lead to solutions with many equal values. A classical example is the total variation in one or\ntwo dimensions, which leads to piecewise constant solutions [4, 5] and can be applied to various\nimage labelling problems [6, 5], or change point detection tasks [7, 8, 9]. Another example is the\n\u201cOscar\u201d penalty which induces automatic grouping of the features [10]. In this paper, we follow\nthe approach of [3], who designed sparsity-inducing norms based on non-decreasing submodular\nfunctions, as a convex approximation to imposing a speci\ufb01c prior on the supports of the predictors.\nHere, we show that a similar parallel holds for some other class of submodular functions, namely\nnon-negative set-functions which are equal to zero for the full and empty set. Our main instance of\nsuch functions are symmetric submodular functions.\nWe make the following contributions:\n\n\u2212 We provide in Section 3 explicit links between priors on level sets and certain submodular\nfunctions: we show that the Lov\u00b4asz extensions (see, e.g., [11] and a short review in Section 2)\nassociated to these submodular functions are the convex envelopes (i.e., tightest convex lower\nbounds) of speci\ufb01c functions that depend on all level sets of the underlying vector.\n\n\u2212 In Section 4, we reinterpret existing norms such as the total variation and design new norms,\nbased on noisy cuts or order statistics. We propose applications to clustering and outlier de-\ntection, as well as to change point detection in the presence of outliers.\n\n\u2212 We provide uni\ufb01ed algorithms in Section 5, such as proximal operators, which are based on a\nsequence of submodular function minimizations (SFMs), when such SFMs are ef\ufb01cient, or by\nadapting the generic slower approach of [3] otherwise.\n\n\u2212 We derive uni\ufb01ed theoretical guarantees for level set recovery in Section 6, showing that even\nin the absence of correlation between predictors, level set recovery is not always guaranteed,\na situation which is to be contrasted with traditional support recovery situations [1, 3].\n\n1\n\n\fNotation. For w \u2208 Rp and q \u2208 [1,\u221e], we denote by kwkq the \u2113q-norm of w. Given a subset A\nof V = {1, . . . , p}, 1A \u2208 {0, 1}p is the indicator vector of the subset A. Moreover, given a vector\nw and a matrix Q, wA and QAA denote the corresponding subvector and submatrix of w and Q.\nFinally, for w \u2208 Rp and A \u2282 V , w(A) = Pk\u2208A wk = w\u22a41A (this de\ufb01nes a modular set-function).\nIn this paper, for a certain vector w \u2208 Rp, we call level sets the sets of indices which are larger (or\nsmaller) or equal to a certain constant \u03b1, which we denote {w > \u03b1} (or {w 6 \u03b1}), while we call\nconstant sets the sets of indices which are equal to a constant \u03b1, which we denote {w = \u03b1}.\n2 Review of Submodular Analysis\n\nj=1 |(1A)j \u2212 (1A)j+1|.\n\nIn this section, we review relevant results from submodular analysis. For more details, see, e.g., [12],\nand, for a review with proofs derived from classical convex analysis, see, e.g., [11].\nDe\ufb01nition. Throughout this paper, we consider a submodular function F de\ufb01ned on the power set\n2V of V = {1, . . . , p}, i.e., such that \u2200A, B \u2282 V, F (A) + F (B) > F (A\u222a B) + F (A\u2229 B). Unless\notherwise stated, we consider functions which are non-negative (i.e., such that F (A) > 0 for all A \u2282\nV ), and that satisfy F (\u2205) = F (V ) = 0. Usual examples are symmetric submodular functions, i.e.,\nsuch that \u2200A \u2282 V, F (V \\A) = F (A), which are known to always have non-negative values. We give\nseveral examples in Section 4; for illustrating the concepts introduced in this section and Section 3,\nwe will consider the cut in an undirected chain graph, i.e., F (A) = Pp\u22121\nLov\u00b4asz extension. Given any set-function F such that F (V ) = F (\u2205) = 0, one can de\ufb01ne its\nLov\u00b4asz extension f : Rp \u2192 R, as f (w) = RR F ({w > \u03b1})d\u03b1 (see, e.g., [11] for this particular\nformulation). The Lov\u00b4asz extension is convex if and only if F is submodular. Moreover, f is\npiecewise-linear and for all A \u2282 V , f (1A) = F (A), that is, it is indeed an extension from 2V\n(which can be identi\ufb01ed to {0, 1}p through indicator vectors) to Rp. Finally, it is always positively\nhomogeneous. For the chain graph, we obtain the usual total variation f (w) = Pp\u22121\nj=1 |wj \u2212 wj+1|.\nBase polyhedron. We denote by B(F ) = {s \u2208 Rp, \u2200A \u2282 V, s(A) 6 F (A), s(V ) = F (V )}\nthe base polyhedron [12], where we use the notation s(A) = Pk\u2208A sk. One important result in\nsubmodular analysis is that if F is a submodular function, then we have a representation of f as a\nmaximum of linear functions [12, 11], i.e., for all w \u2208 Rp, f (w) = maxs\u2208B(F ) w\u22a4s. Moreover,\ninstead of solving a linear program with 2p \u2212 1 contraints, a solution s may be obtained by the\nfollowing \u201cgreedy algorithm\u201d: order the components of w in decreasing order wj1 > \u00b7\u00b7\u00b7 > wjp ,\nand then take for all k \u2208 {1, . . . , p}, sjk = F ({j1, . . . , jk}) \u2212 F ({j1, . . . , jk\u22121}).\nTight and inseparable sets. The polyhedra U = {w \u2208 Rp, f (w) 6 1} and B(F ) are polar to each\nother (see, e.g., [13] for de\ufb01nitions and properties of polar sets). Therefore, the facial structure of U\nmay be obtained from the one of B(F ). Given s \u2208 B(F ), a set A \u2282 V is said tight if s(A) = F (A).\nIt is known that the set of tight sets is a distributive lattice, i.e., if A and B are tight, then so are A\u222aB\nand A \u2229 B [12, 11]. The faces of B(F ) are thus intersections of hyperplanes {s(A) = F (A)} for\nA belonging to certain distributive lattices (see Prop. 3). A set A is said separable if there exists a\nnon-trivial partition of A = B \u222a C such that F (A) = F (B) + F (C). A set is said inseparable if it\nis not separable. For the cut in an undirected graph, inseparable sets are exactly connected sets.\n\n3 Properties of the Lov\u00b4asz Extension\n\nIn this section, we derive properties of the Lov\u00b4asz extension for submodular functions, which go\nbeyond convexity and homogeneity. Throughout this section, we assume that F is a non-negative\nsubmodular set-function that is equal to zero at \u2205 and V . This immediately implies that f is invariant\nby addition of any constant vector (that is, f (w + \u03b11V ) = f (w) for all w \u2208 Rp and \u03b1 \u2208 R), and\nthat f (1V ) = F (V ) = 0. Thus, contrary to the non-decreasing case [3], our regularizers are not\nnorms. However, they are norms on the hyperplane {w\u22a41V = 0} as soon as for A 6= \u2205 and A 6= V ,\nF (A) > 0 (which we assume for the rest of this paper).\nWe now show that the Lov\u00b4asz extension is the convex envelope of a certain combinatorial function\nwhich does depend on all levets sets {w > \u03b1} of w \u2208 Rp (see proof in [14]):\nProposition 1 (Convex envelope) The Lov\u00b4asz extension f (w) is the convex envelope of the function\nw 7\u2192 max\u03b1\u2208R F ({w > \u03b1}) on the set [0, 1]p + R1V = {w \u2208 Rp, maxk\u2208V wk \u2212 mink\u2208V wk 6 1}.\n\n2\n\n\f1w =w\n2\n\nw > w >w1\n2\n3\n\n(1,0,1)/F({1,3})\n\n1w > w >w3\n2\n\n(1,0,0)/F({1})\n\n2w =w\n3\n\nw > w >w\n3\n\n1\n\n2\n\n(0,0,1)/F({3})\n\nw > w >w\n1\n\n2\n\n3\n\n(0,1,1)/F({2,3})\n\nw > w >w\n1\n\n2\n\n3\n\n(1,0,1)/2\n\n(0,1,0)/F({2})\nw =w1\n3\n\n2w > w >w1\n3\n\n(1,0,0)\n\n(0,0,1)\n\n(0,1,1)\n\n(0,1,0)/2\n\n(1,1,0)/F({1,2})\n\n(1,1,0)\n\nFigure 1: Top: Polyhedral level set of f (projected on the set w\u22a41V = 0), for 2 different submodular\nsymmetric functions of three variables, with different inseparable sets leading to different sets of\nextreme points; changing values of F may make some of the extreme points disappear. The various\nextreme points cut the space into polygons where the ordering of the component is \ufb01xed. Left:\nF (A) = 1|A|\u2208{1,2}, leading to f (w) = maxk wk \u2212 mink wk (all possible extreme points); note\nthat the polygon need not be symmetric in general. Right: one-dimensional total variation on three\nnodes, i.e., F (A) = |11\u2208A \u2212 12\u2208A| +|12\u2208A \u2212 13\u2208A|, leading to f (w) = |w1 \u2212 w2| +|w2 \u2212 w3|, for\nwhich the extreme points corresponding to the separable set {1, 3} and its complement disappear.\n\nNote the difference with the result of [3]: we consider here a different set on which we compute the\nconvex envelope ([0, 1]p + R1V instead of [\u22121, 1]p), and not a function of the support of w, but of\nall its level sets.1 Moreover, the Lov\u00b4asz extension is a convex relaxation of a function of level sets\n(of the form {w > \u03b1}) and not of constant sets (of the form {w = \u03b1}). It would have been perhaps\nmore intuitive to consider for example RR F ({w = \u03b1})d\u03b1, since it does not depend on the ordering\nof the values that w may take; however, to the best of our knowledge, the latter function does not\nlead to a convex function amenable to polynomial-time algorithms. This de\ufb01nition through level\nsets will generate some potentially undesired behavior (such as the well-known staircase effect for\nthe one-dimensional total variation), as we show in Section 6.\nThe next proposition describes the set of extreme points of the \u201cunit ball\u201d U = {w, f (w) 6 1},\ngiving a \ufb01rst illustration of sparsity-inducing effects (see example in Figure 1, in particular for the\none-dimensional total variation).\nProposition 2 (Extreme points) The extreme points of the set U \u2229{w\u22a41V = 0} are the projections\nof the vectors 1A/F (A) on the plane {w\u22a41V = 0}, for A such that A is inseparable for F and V \\A\nis inseparable for B 7\u2192 F (A \u222a B) \u2212 F (A).\nPartially ordered sets and distributive lattices. A subset D of 2V is a (distributive) lattice if it\nis invariant by intersection and union. We assume in this paper that all lattices contain the empty\nset \u2205 and the full set V , and we endow the lattice with the inclusion order. Such lattices may be\nrepresented as a partially ordered set (poset) \u03a0(D) = {A1, . . . , Am} (with order relationship <),\nwhere the sets Aj, j = 1, . . . , m, form a partition of V (we always assume a topological ordering\nof the sets, i.e., Ai < Aj \u21d2 i > j). As illustrated in Figure 2, we go from D to \u03a0(D), by\nconsidering all maximal chains in D and the differences between consecutive sets. We go from\n\u03a0(D) to D, by constructing all ideals of \u03a0(D), i.e., sets J such that if an element of \u03a0(D) is lower\nthan an element of J, then it has to be in J (see [12] for more details, and an example in Figure 2).\nDistributive lattices and posets are thus in one-to-one correspondence. Throughout this section, we\ngo back and forth between these two representations. The distributive lattice D will correspond to\nall authorized level sets {w > \u03b1} for w in a single face of U, while the elements of the poset \u03a0(D)\nare the constant sets (over which w is constant), with the order between the subsets giving partial\nconstraints between the values of the corresponding constants.\nFaces of U. The faces of U are characterized by lattices D, with their corresponding posets \u03a0(D) =\n{A1, . . . , Am}. We denote by U \u25e6\nD (and by UD its closure) the set of w \u2208 Rp such that (a) f (w) 6 1,\n(b) w is piecewise constant with respect to \u03a0(D), with value vi on Ai, and (c) for all pairs (i, j),\n\n1Note that the support {w = 0} is a constant set which is the intersection of two level sets.\n\n3\n\n\f{2}\n\n{1,2}\n\n{1,2,5,6}\n\n{1,2,3,4,5,6}\n\n{2}\n\n{5,6}\n\n{5,6}\n\n{2,5,6}\n\n{2,3,4,5,6}\n\n{1}\n\n{3,4}\n\nFigure 2: Left: distributive lattice with 7 elements in 2{1,2,3,4,5,6}, represented with the Hasse dia-\ngram corresponding to the inclusion order (for a partial order, a Hasse diagram connects A to B if\nA is smaller than B and there is no C such that A is smaller than C and C is smaller than B). Right:\ncorresponding poset, with 4 elements that form a partition of {1, 2, 3, 4, 5, 6}, represented with the\nHasse diagram corresponding to the order < (a node points to its immediate smaller node according\nto <). Note that this corresponds to an \u201callowed\u201d lattice (see Prop. 3) for the one-dimensional total\nvariation.\n\nD, where D is a lattice such that:\n\nAi < Aj \u21d2 vi > vj. For certain lattices D, these will be exactly the relative interiors of all faces\nof U (see proof in [14]):\nProposition 3 (Faces of U) The (non-empty) relative interiors of all faces of U are exactly of the\nform U \u25e6\n(i) the restriction of F to D is modular, i.e., for all A, B \u2208 D, F (A)+F (B) = F (A\u222aB)+F (A\u2229B),\n(ii) for all j \u2208 {1, . . . , m}, the set Aj is inseparable for the function Cj 7\u2192 F (Bj\u22121 \u222a Cj ) \u2212\nF (Bj\u22121), where Bj\u22121 is the union of all ancestors of Aj in \u03a0(D),\n(iii) among all lattices corresponding to the same unordered partition, D is a maximal element of\nthe set of lattices satisfying (i) and (ii).\n\nAmong the three conditions, the second one is the easiest to interpret, as it reduces to having constant\nsets which are inseparable for certain submodular functions, and for cuts in an undirected graph,\nthese will exactly be connected sets. Note also that extreme points from Prop. 2 are recovered with\nD = {\u2205, A, V }.\nSince we are able to characterize all faces of U (of all dimensions) with non-empty relative interior,\nwe have a partition of the space and any w \u2208 Rp which is not proportional to 1V , will be, up to\nthe strictly positive constant f (w), in exactly one of these relative interiors of faces; we refer to this\nlattice as the lattice associated to w. Note that from the face w belongs to, we have strong constraints\non the constant sets, but we may not be able to determine all level sets of w, because only partial\nconstraints are given by the order on \u03a0(D). For example, in Figure 2 for the one-dimensional total\nvariation, w2 may be larger or smaller than w5 = w6 (and even potentially equal, but with zero\nprobability, see Section 6).\n\n4 Examples of Submodular Functions\n\nIn this section, we provide examples of submodular functions and of their Lov\u00b4asz extensions. Some\nare well-known (such as cut functions and total variations), some are new in the context of supervised\nlearning (regular functions), while some have interesting effects in terms of clustering or outlier\ndetection (cardinality-based functions).\nSymmetrization. From any submodular function G, one may de\ufb01ne F (A) = G(A) + G(V \\A) \u2212\nG(\u2205)\u2212 G(V ), which is symmetric. Potentially interesting examples which are beyond the scope of\nthis paper are mutual information, or functions of eigenvalues of submatrices [3].\nCut functions. Given a set of nonnegative weights d : V \u00d7 V \u2192 R+, de\ufb01ne the cut F (A) =\nPk\u2208A,j\u2208V \\A d(k, j). The Lov\u00b4asz extension is equal to f (w) = Pk,j\u2208V d(k, j)(wk \u2212 wj)+ (which\nshows submodularity because f is convex), and is often referred to as the total variation. If the\nweight function d is symmetric, then the submodular function is also symmetric. In this case, it can\nbe shown that inseparable sets for functions A 7\u2192 F (A \u222a B) \u2212 F (B) are exactly connected sets.\nHence, by Props. 3 and 6, constant sets are connected sets, which is the usual justi\ufb01cation behind\nthe total variation. Note however that some con\ufb01gurations of connected sets are not allowed due\nto the other conditions in Prop. 3 (see examples in Section 6). In Figure 5 (right plot), we give an\nexample of the usual chain graph, leading to the one-dimensional total variation [4, 5]. Note that\nthese functions can be extended to cuts in hypergraphs, which may have interesting applications in\ncomputer vision [6]. Moreover, directed cuts may be interesting to favor increasing or decreasing\njumps along the edges of the graph.\n\n4\n\n\fs\nt\n\ni\n\nh\ng\ne\nw\n\n5\n\n0\n\n\u22125\n\ns\nt\n\ni\n\nh\ng\ne\nw\n\n5\n\n0\n\n\u22125\n\ns\nt\n\ni\n\nh\ng\ne\nw\n\n5\n\n0\n\n\u22125\n\n \n \n \n\nTV\nTV\nTV\nrobust TV\nrobust TV\nrobust TV\nrobust TV \u2212 2\nrobust TV \u2212 2\n\n0.4\n0.4\n0.4\n\n0.3\n0.3\n0.3\n\n0.2\n0.2\n0.2\n\n0.1\n0.1\n0.1\n\nr\nr\nr\no\no\no\nr\nr\nr\nr\nr\nr\ne\ne\ne\n\n \n \n \n\nn\nn\nn\no\no\no\n\ni\ni\ni\nt\nt\nt\n\na\na\na\nm\nm\nm\n\ni\ni\ni\nt\nt\nt\ns\ns\ns\ne\ne\ne\n\n5\n\n10\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\n0\n0\n0\n \n \n \n\u22122\n\u22122\n\u22122\n\n0\n0\n0\n\nFigure 3: Three left plots: Estimation of noisy piecewise constant 1D signal with outliers (indices\n5 and 15 in the chain of 20 nodes). Left: original signal. Middle: best estimation with total variation\n(level sets are not correctly estimated). Right: best estimation with the robust total variation based on\nnoisy cut functions (level sets are correctly estimated, with less bias and with detection of outliers).\nRight plot: clustering estimation error vs. noise level, in a sequence of 100 variables, with a single\njump, where noise of variance one is added, with 5% of outliers (averaged over 20 replications).\n\n4\n4\n4\n\n6\n6\n6\n\n2\n2\n2\nlog(\u03c32)\nlog(\u03c32)\nlog(\u03c32)\n\nRegular functions and robust total variation. By partial minimization, we obtain so-called\nregular functions [6, 5]. One application is \u201cnoisy cut functions\u201d: for a given weight function\nd : W \u00d7 W \u2192 R+, where each node in W is uniquely associated in a node in V , we consider\nthe submodular function obtained as the minimum cut adapted to A in the augmented graph (see\nan example in the right plot of Figure 5): F (A) = minB\u2282W Pk\u2208B, j\u2208W \\B d(k, j) + \u03bb|A\u2206B|.\nThis allows for robust versions of cuts, where some gaps may be tolerated; indeed, compared to\nhaving directly a small cut for A, B needs to have a small cut and be close to A, thus allowing\nsome elements to be removed or added to A in order to lower the cut. See examples in Figure 3,\nillustrating the behavior of the type of graph displayed in the bottom-right plot of Figure 5, where\nthe performance of the robust total variation is signi\ufb01cantly more stable in presence of outliers.\nCardinality-based functions. For F (A) = h(|A|) where h is such that h(0) = h(p) = 0 and h\nconcave, we obtain a submodular function, and a Lov\u00b4asz extension that depends on the order statis-\ntics of w, i.e., if wj1 > \u00b7\u00b7\u00b7 > wjp , then f (w) = Pp\u22121\nk=1 h(k)(wjk \u2212wjk+1 ). While these examples do\nnot provide signi\ufb01cantly different behaviors for the non-decreasing submodular functions explored\nby [3] (i.e., in terms of support), they lead to interesting behaviors here in terms of level sets, i.e.,\nthey will make the components w cluster together in speci\ufb01c ways. Indeed, as shown in Section 6,\nallowed constant sets A are such that A is inseparable for the function C 7\u2192 h(|B \u222a C|) \u2212 h(|B|)\n(where B \u2282 V is the set of components with higher values than the ones in A), which imposes that\nthe concave function h is not linear on [|B|,|B|+|A|]. We consider the following examples:\n\n1. F (A) = |A| \u00b7 |V \\A|, leading to f (w) = Pp\n\ni,j=1 |wi \u2212 wj|. This function can thus be also\nseen as the cut in the fully connected graph. All patterns of level sets are allowed as the\nfunction h is strongly concave (see left plot of Figure 4). This function has been extended\nin [15] by considering situations where each wj is a vector, instead of a scalar, and replacing\nthe absolute value |wi \u2212 wj| by any norm kwi \u2212 wjk, leading to convex formulations for\nclustering.\n2. F (A) = 1 if A 6= \u2205 and A 6= V , and 0 otherwise, leading to f (w) = maxi,j |wi \u2212 wj|. Two\nlarge level sets at the top and bottom, all the rest of the variables are in-between and separated\n(Figure 4, second plot from the left).\n\n3. F (A) = max{|A|,|V \\A|}. This function is piecewise af\ufb01ne, with only one kink, thus only\none level set of cardinalty greater than one (in the middle) is possible, which is observed in\nFigure 4 (third plot from the left). This may have applications to multivariate outlier detection\nby considering extensions similar to [15].\n\n5 Optimization Algorithms\n\nIn this section, we present optimization methods for minimizing convex objective functions regular-\nized by the Lov\u00b4asz extension of a submodular function. These lead to convex optimization problems,\nwhich we tackle using proximal methods (see, e.g., [16, 17] and references therein). We \ufb01rst start\nby mentioning that subgradients may easily be derived (but subgradient descent is here rather inef-\n\ufb01cient as shown in Figure 5). Moreover, note that with the square loss, the regularization paths are\npiecewise af\ufb01ne, as a direct consequence of regularizing by a polyhedral function.\n\n5\n\n\fs\nt\nh\ng\ne\nw\n\ni\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n0\n\ns\nt\nh\ng\ne\nw\n\ni\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n0\n\ns\nt\nh\ng\ne\nw\n\ni\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n0\n\n0.2\n\n\u03bb\n\n0.4\n\ns\nt\nh\ng\ne\nw\n\ni\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n0\n\n1\n\n2\n\n\u03bb\n\n3\n\n0.01\n\n0.02\n\u03bb\n\n0.03\n\n1\n\n2\n\n\u03bb\n\n3\n\nFigure 4: Left: Piecewise linear regularization paths of proximal problems (Eq. (1)) for different\nfunctions of cardinality. From left to right: quadratic function (all level sets allowed), second ex-\nample in Section 4 (two large level sets at the top and bottom), piecewise linear with two pieces (a\nsingle large level set in the middle). Right: Same plot for the one-dimensional total variation. Note\nthat in all these particular cases the regularization paths for orthogonal designs are agglomerative\n(see Section 5), while for general designs, they would still be piecewise af\ufb01ne but not agglomerative.\n\nSubgradient. From f (w) = maxs\u2208B(F ) s\u22a4w and the greedy algorithm2 presented in Section 2,\none can easily get in polynomial time one subgradient as one of the maximizers s. This allows to\nuse subgradient descent, with slow convergence compared to proximal methods (see Figure 5).\nProximal problems through sequences of submodular function minimizations (SFMs). Given\nregularized problems of the form minw\u2208Rp L(w) + \u03bbf (w), where L is differentiable with Lipschitz-\ncontinuous gradient, proximal methods have been shown to be particularly ef\ufb01cient \ufb01rst-order\nmethods (see, e.g., [16]).\nIn this paper, we use the method \u201cISTA\u201d and its accelerated variant\n\u201cFISTA\u201d [16]. To apply these methods, it suf\ufb01ces to be able to solve ef\ufb01ciently:\n\nmin\nw\u2208Rp\n\n1\n\n2kw \u2212 zk2\n\n2 + \u03bbf (w),\n\n(1)\n\nwhich we refer to as the proximal problem. It is known that solving the proximal problem is related\nto submodular function minimization (SFM). More precisely, the minimum of A 7\u2192 \u03bbF (A) \u2212 z(A)\nmay be obtained by selecting negative components of the solution of a single proximal problem [12,\n11]. Alternatively, the solution of the proximal problem may be obtained by a sequence of at most p\nsubmodular function minimizations of the form A 7\u2192 \u03bbF (A)\u2212 z(A), by a decomposition algorithm\nadapted from [18], and described in [11].\nThus, computing the proximal operator has polynomial complexity since SFM has polynomial com-\nplexity. However, it may be too slow for practical purposes, as the best generic algorithm has\ncomplexity O(p6) [19]3. Nevertheless, this strategy is ef\ufb01cient for families of submodular functions\nfor which dedicated fast algorithms exist:\n\n\u2013 Cuts: Minimizing the cut or the partially minimized cut, plus a modular function, may be\ndone with a min-cut/max-\ufb02ow algorithm [see, e.g., 6, 5]. For proximal methods, we need in\nfact to solve an instance of a parametric max-\ufb02ow problem, which may be done using other\nef\ufb01cient dedicated algorithms [21, 5] than the decomposition algorithm derived from [18].\n\n\u2013 Functions of cardinality: minimizing functions of the form A 7\u2192 \u03bbF (A)\u2212 z(A) can be done\n\nin closed form by sorting the elements of z.\n\nProximal problems through minimum-norm-point algorithm.\nIn the generic case (i.e., beyond\ncuts and cardinality-based functions), we can follow [12, 3]: since f (w) is expressed as a mini-\nmum of linear functions, the problem reduces to the projection on the polytope B(F ), for which we\nhappen to be able to easily maximize linear functions (using the greedy algorithm described in Sec-\ntion 2). This can be tackled ef\ufb01ciently by the minimum-norm-point algorithm [12], which iterates\nbetween orthogonal projections on af\ufb01ne subspaces and the greedy algorithm for the submodular\nfunction4. We compare all optimization methods on synthetic examples in Figure 5.\n\n2The greedy algorithm to \ufb01nd extreme points of the base polyhedron should not be confused with the greedy\n\nalgorithm (e.g., forward selection) that is common in supervised learning/statistics.\n\n3Note that even in the case of symmetric submodular functions, where more ef\ufb01cient algorithms in O(p3)\nfor submodular function minimization (SFM) exist [20], the minimization of functions of the form \u03bbF (A) \u2212\nz(A) is provably as hard as general SFM [20].\n\n4Interestingly, when used for submodular function minimization (SFM), the minimum-norm-point algo-\n\nrithm has no complexity bound but is empirically faster than algorithms with such bounds [12].\n\n6\n\n\fi\n\n)\nf\n(\nn\nm\n\u2212\n)\nw\n\n(\nf\n\n100\n\n10\u22125\n\n10\u221210\n\n10\u221215\n \n0\n\n2\n\n \n\nfista\u2212generic\nista\u2212generic\nsubgradient\nfista\u2212card\nista\u2212card\nsubgradient\u2212sqrt\n\n4\n\ntime (seconds)\n\n6\n\n8\n\n10\n\nW\n\nV\n\nFigure 5: Left: Matlab running times of different optimization methods on 20 replications of a least-\nsquares regression problem with p = 1000 for a cardinality-based submodular function (best seen\nin color). Proximal methods with the generic algorithm (using the minimum-norm-point algorithm)\nare faster than subgradient descent (with two schedules for the learning rate, 1/t or 1/\u221at). Using the\ndedicated algorithm (which is not available in all situations) is signi\ufb01cantly faster. Right: Examples\nof graphs (top: chain graph, bottom: hidden chain graph, with sets W and V and examples of a set\nA in light red, and B in blue, see text for details).\n\n\u2200C \u2282 A, |C|\n\n|A| [F (B \u222a A) \u2212 F (B)] 6 F (B \u222a C) \u2212 F (B).\n\n(2)\n\nProximal path as agglomerative clustering. When \u03bb varies from zero to +\u221e, then the unique\noptimal solution of Eq. (1) goes from z to a constant. We now provide conditions under which\nthe regularization path of the proximal problem may be obtained by agglomerative clustering (see\nexamples in Figure 4):\nProposition 4 (Agglomerative clustering) Assume that for all sets A, B such that B \u2229 A = \u2205 and\nA is inseparable for D 7\u2192 F (B \u222a D) \u2212 F (B), we have:\n\nThen the regularization path for Eq. (1) is agglomerative, that is, if two variables are in the same\nconstant for a certain \u00b5 \u2208 R+, so are they for all larger \u03bb > \u00b5.\nAs shown in [14], the assumptions required for by Prop. 4 are satis\ufb01ed by (a) all submodular set-\nfunctions that only depend on the cardinality, and (b) by the one-dimensional total variation\u2014we\nthus recover and extend known results from [7, 22, 15].\nAdding an \u21131-norm. Following [4], we may add the \u21131-norm kwk1 for additional sparsity of w (on\ntop of shaping its level sets). The following proposition extends the result for the one-dimensional\ntotal variation [4, 23] to all submodular functions and their Lov\u00b4asz extensions:\n\n2 + f (w) + \u03bbkwk1 may be obtained by soft-thresholding the minimizers of 1\n\nProposition 5 (Proximal problem for \u21131-penalized problems) The unique minimizer of 1\n2kw \u2212\nzk2\n2 + f (w).\nThat is, the proximal operator for f + \u03bbk \u00b7 k1 is equal to the composition of the proximal operator\nfor f and the one for \u03bbk \u00b7 k1.\n\n2kw \u2212 zk2\n\n6 Sparsity-inducing Properties\nGoing from the penalization of supports to the penalization of level sets introduces some complexity\nand for simplicity in this section, we only consider the analysis in the context of orthogonal design\nmatrices, which is often referred to as the denoising problem, and in the context of level set esti-\nmation already leads to interesting results. That is, we study the unique global minimum \u02c6w of the\nproximal problem in Eq. (1) and make some assumption regarding z (typically z = w\u2217 + noise), and\nprovide guarantees related to the recovery of the level sets of w\u2217. We \ufb01rst start by characterizing the\nallowed level sets, showing that the partial constraints de\ufb01ned in Section 3 on faces of {f (w) 6 1}\ndo not create by chance further groupings of variables (see proof in [14]).\nProposition 6 (Stable constant sets) Assume z \u2208 Rp has an absolutely continuous density with\nrespect to the Lebesgue measure. Then, with probability one, the unique minimizer \u02c6w of Eq. (1) has\nconstant sets that de\ufb01ne a partition corresponding to a lattice D de\ufb01ned in Prop. 3.\nWe now show that under certain conditions the recovered constant sets are the correct ones:\n\n7\n\n\fTheorem 1 (Level set recovery) Assume that z = w\u2217 + \u03c3\u03b5, where \u03b5 \u2208 Rp is a standard Gaus-\nsian random vector, and w\u2217 is consistent with the lattice D and its associated poset \u03a0(D) =\n(A1, . . . , Am), with values v\u2217\nj on Aj, for j \u2208 {1, . . . , m}. Denote Bj = A1 \u222a \u00b7\u00b7\u00b7 \u222a Aj for\nj \u2208 {1, . . . , m}. Assume that there exists some constants \u03b7j > 0 and \u03bd > 0 such that:\n|Aj| [F (Bj\u22121\u222aAj)\u2212F (Bj\u22121)] > \u03b7j min(cid:8) |Cj |\n\u2200Cj \u2282 Aj, F (Bj\u22121\u222aCj)\u2212F (Bj\u22121)\u2212 |Cj|\n\n|Aj| (cid:9), (3)\n(4)\n(5)\nThen the unique minimizer \u02c6w of Eq. (1) is associated to the same lattice D than w\u2217, with probability\ngreater than 1 \u2212 Pm\nWe now discuss the three main assumptions of Theorem 1 as well as the probability estimate:\n\n\u2200i, j \u2208 {1, . . . , m}, Ai < Aj \u21d2 v\u2217\n\u2200j \u2208 {1, . . . , m}, \u03bb(cid:12)(cid:12)\n\nj=1 exp (cid:0) \u2212 \u03bd 2|Aj|\n\n32\u03c32 (cid:1) \u2212 2 Pm\n\ni \u2212 v\u2217\n\nj > \u03bd,\n(cid:12)(cid:12) 6 \u03bd/4.\n\n|Aj | , 1\u2212 |Cj|\n\nF (Bj )\u2212F (Bj\u22121)\n\n|Aj |\n\nj=1 |Aj| exp (cid:0) \u2212\n\n\u03bb2\u03b72\n2\u03c32|Aj|2 (cid:1).\nj\n\n\u2013 Eq. (3) is the equivalent of the support recovery condition for the Lasso [1] or its exten-\nsions [3]. The main difference is that for support recovery, this assumption is always met\nfor orthogonal designs, while here it is not always met. Interestingly, the validity of level set\nrecovery implies the agglomerativity of proximal paths (Eq. (2) in Prop. 4).\nNote that if Eq. (3) is satis\ufb01ed only with \u03b7j > 0 (it is then exactly Eq. (2) in Prop. 4), then,\neven with in\ufb01nitesimal noise, one can show that in some cases, the wrong level sets may be\nobtained with non vanishing probability, while if \u03b7j is strictly negative, one can show that\nin some cases, we never get the correct level sets. Eq. (3) is thus essentially suf\ufb01cient and\nnecessary.\n\n\u2013 Eq. (4) corresponds to having distinct values of w\u2217 far enough from each other.\n\u2013 Eq. (5) is a constraint on \u03bb which controls the bias of the estimator: if it is too large, then there\n\nmay be a merging of two clusters.\n\n\u2013 In the probability estimate, the second term is small if all \u03c32|Aj|\u22121 are small enough (i.e.,\ngiven the noise, there is enough data to correctly estimate the values of the constant sets) and\nthe third term is small if \u03bb is large enough, to avoid that clusters split.\n\nOne-dimensional total variation.\nIn this situation, we always get \u03b7j = 0, but in some cases, it\ncannot be improved (i.e., the best possible \u03b7j is equal to zero), and as shown in [14], this occurs\nas soon as there is a \u201cstaircase\u201d, i.e., a piecewise constant vector, with a sequence of at least two\nconsecutive increases, or two consecutive decreases, showing that in the presence of such staircases,\none cannot have consistent support recovery, which is a well-known issue in signal processing (typ-\nically, more steps are created). If there is no staircase effect, we have \u03b7j = 1 and Eq. (5) becomes\n8 minj |Aj|. If we take \u03bb equal to the limiting value in Eq. (5), then we obtain a probability\n\u03bb 6 \u03bd\nless than 1 \u2212 4p exp(\u2212 \u03bd 2 minj |Aj|2\n128\u03c32 maxj |Aj |2 ). Note that we could also derive general results when an ad-\nditional \u21131-penalty is used, thus extending results from [24]. Finally, similar (more) negative results\nmay be obtained for the two-dimensional total variation [25, 14].\nClustering with F (A) = |A| \u00b7 |V \\A|.\nIn this case, we have \u03b7j = |Aj|/2, and Eq. (5) becomes\n128p\u03c32 (cid:1).\n\u03bb 6 \u03bd\nThis indicates that the noise variance \u03c32 should be small compared to 1/p, which is not satisfactory\nand would be corrected with the weighting schemes proposed in [15].\n\n4p , leading to the probability of correct support estimation greater than 1\u2212 4p exp (cid:0)\u2212 \u03bd 2\n\n7 Conclusion\n\nWe have presented a family of sparsity-inducing norms dedicated to incorporating prior knowledge\nor structural constraints on the level sets of linear predictors. We have provided a set of common al-\ngorithms and theoretical results, as well as simulations on synthetic examples illustrating the behav-\nior of these norms. Several avenues are worth investigating: \ufb01rst, we could follow current practice in\nsparse methods, e.g., by considering related adapted concave penalties to enhance sparsity-inducing\ncapabilities, or by extending some of the concepts for norms of matrices, with potential applications\nin matrix factorization [26] or multi-task learning [27].\n\nAcknowledgements. This paper was partially supported by the Agence Nationale de la Recherche\n(MGA Project), the European Research Council (SIERRA Project) and Digiteo (BIOVIZ project).\n\n8\n\n\fReferences\n[1] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n[2] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\n\ndimensional analysis of M-estimators with decomposable regularizers. In Adv. NIPS, 2009.\n\n[3] F. Bach. Structured sparsity-inducing norms through submodular functions. In Adv. NIPS,\n\n2010.\n\n[4] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the\n\nfused Lasso. J. Roy. Stat. Soc. B, 67(1):91\u2013108, 2005.\n\n[5] A. Chambolle and J. Darbon. On total variation minimization and surface evolution using\nparametric maximum \ufb02ows. International Journal of Computer Vision, 84(3):288\u2013307, 2009.\n[6] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts.\n\nIEEE Trans. PAMI, 23(11):1222\u20131239, 2001.\n\n[7] Z. Harchaoui and C. L\u00b4evy-Leduc. Catching change-points with Lasso. Adv. NIPS, 20, 2008.\n[8] J.-P. Vert and K. Bleakley. Fast detection of multiple change-points shared by many signals\n\nusing group LARS. Adv. NIPS, 23, 2010.\n\n[9] M. Kolar, L. Song, and E. Xing. Sparsistent learning of varying-coef\ufb01cient models with struc-\n\ntural changes. Adv. NIPS, 22, 2009.\n\n[10] H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and\n\nsupervised clustering of predictors with oscar. Biometrics, 64(1):115\u2013123, 2008.\n\n[11] F. Bach. Convex analysis and optimization with submodular functions: a tutorial. Technical\n\nReport 00527714, HAL, 2010.\n\n[12] S. Fujishige. Submodular Functions and Optimization. Elsevier, 2005.\n[13] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1997.\n[14] F. Bach. Shaping level sets with submodular functions. Technical Report 00542949-v2, HAL,\n\n2011.\n\n[15] T. Hocking, A. Joulin, F. Bach, and J.-P. Vert. Clusterpath: an algorithm for clustering using\n\nconvex fusion penalties. In Proc. ICML, 2011.\n\n[16] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[17] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penal-\n\nties. Technical Report 00613125, HAL, 2011.\n\n[18] H. Groenevelt. Two algorithms for maximizing a separable concave function over a polyma-\n\ntroid feasible region. European Journal of Operational Research, 54(2):227\u2013236, 1991.\n\n[19] J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization.\n\nMathematical Programming, 118(2):237\u2013251, 2009.\n\n[20] M. Queyranne. Minimizing symmetric submodular functions. Mathematical Programming,\n\n82(1):3\u201312, 1998.\n\n[21] G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast parametric maximum \ufb02ow algorithm and\n\napplications. SIAM Journal on Computing, 18(1):30\u201355, 1989.\n\n[22] H. Hoe\ufb02ing. A path algorithm for the fused Lasso signal approximator. Technical Report\n\n0910.0526v1, arXiv, 2009.\n\n[23] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse\n\ncoding. Journal of Machine Learning Research, 11:19\u201360, 2010.\n\n[24] A. Rinaldo. Properties and re\ufb01nements of the fused Lasso. Ann. Stat., 37(5):2922\u20132952, 2009.\n[25] V. Duval, J.-F. Aujol, and Y. Gousseau. The TVL1 model: A geometric point of view. Multi-\n\nscale Modeling and Simulation, 8(1):154\u2013189, 2009.\n\n[26] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Adv.\n\nNIPS 17, 2005.\n\n[27] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learn-\n\ning, 73(3):243\u2013272, 2008.\n\n9\n\n\f", "award": [], "sourceid": 12, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}]}