{"title": "Network Flow Algorithms for Structured Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 1558, "page_last": 1566, "abstract": "We consider a class of learning problems that involve a structured sparsity-inducing norm defined as the sum of $\\ell_\\infty$-norms over groups of variables. Whereas a lot of effort has been put in developing fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlapping groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of groups and variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems.", "full_text": "Network Flow Algorithms for Structured Sparsity\n\nJulien Mairal\u2217\n\nINRIA - Willow Project-Team\u2020\njulien.mairal@inria.fr\n\nRodolphe Jenatton\u2217\n\nINRIA - Willow Project-Team\u2020\n\nrodolphe.jenatton@inria.fr\n\nGuillaume Obozinski\n\nINRIA - Willow Project-Team\u2020\n\nguillaume.obozinski@inria.fr\n\nFrancis Bach\n\nINRIA - Willow Project-Team\u2020\nfrancis.bach@inria.fr\n\nAbstract\n\nWe consider a class of learning problems that involve a structured sparsity-\ninducing norm de\ufb01ned as the sum of \u2113\u221e-norms over groups of variables. Whereas\na lot of effort has been put in developing fast optimization methods when the\ngroups are disjoint or embedded in a speci\ufb01c hierarchical structure, we address\nhere the case of general overlapping groups. To this end, we show that the cor-\nresponding optimization problem is related to network \ufb02ow optimization. More\nprecisely, the proximal problem associated with the norm we consider is dual to a\nquadratic min-cost \ufb02ow problem. We propose an ef\ufb01cient procedure which com-\nputes its solution exactly in polynomial time. Our algorithm scales up to millions\nof variables, and opens up a whole new range of applications for structured sparse\nmodels. We present several experiments on image and video data, demonstrating\nthe applicability and scalability of our approach for various problems.\n\n1\n\nIntroduction\n\nSparse linear models have become a popular framework for dealing with various unsupervised and\nsupervised tasks in machine learning and signal processing. In such models, linear combinations of\nsmall sets of variables are selected to describe the data. Regularization by the \u21131-norm has emerged\nas a powerful tool for addressing this combinatorial variable selection problem, relying on both a\nwell-developed theory (see [1] and references therein) and ef\ufb01cient algorithms [2, 3, 4].\n\nThe \u21131-norm primarily encourages sparse solutions, regardless of the potential structural relation-\nships (e.g., spatial, temporal or hierarchical) existing between the variables. Much effort has recently\nbeen devoted to designing sparsity-inducing regularizations capable of encoding higher-order infor-\nmation about allowed patterns of non-zero coef\ufb01cients [5, 6, 7, 8, 9, 10], with successful applications\nin bioinformatics [6, 11], topic modeling [12] and computer vision [9, 10].\n\nBy considering sums of norms of appropriate subsets, or groups, of variables, these regulariza-\ntions control the sparsity patterns of the solutions. The underlying optimization problem is usually\ndif\ufb01cult, in part because it involves nonsmooth components. Proximal methods have proven to be\neffective in this context, essentially because of their fast convergence rates and their scalability [3, 4].\nWhile the settings where the penalized groups of variables do not overlap or are embedded in a tree-\nshaped hierarchy [12] have already been studied, regularizations with general overlapping groups\nhave, to the best of our knowledge, never been addressed with proximal methods.\n\nThis paper makes the following contributions:\n\n\u2212 It shows that the proximal operator associated with the structured norm we consider can be\n\n\u2217Contributed equally.\n\u2020Laboratoire d\u2019Informatique de l\u2019Ecole Normale Sup\u00b4erieure (INRIA/ENS/CNRS UMR 8548)\n\n1\n\n\fcomputed with a fast and scalable procedure by solving a quadratic min-cost \ufb02ow problem.\n\n\u2212 It shows that the dual norm of the sparsity-inducing norm we consider can also be evaluated\nef\ufb01ciently, which enables us to compute duality gaps for the corresponding optimization problems.\n\u2212 It demonstrates that our method is relevant for various applications, from video background\nsubtraction to estimation of hierarchical structures for dictionary learning of natural image patches.\n\n2 Structured Sparse Models\n\nWe consider in this paper convex optimization problems of the form\n\nmin\nw\u2208Rp\n\nf (w) + \u03bb\u2126(w),\n\n(1)\n\nwhere f : Rp \u2192 R is a convex differentiable function and \u2126 : Rp \u2192 R is a convex, nonsmooth,\nsparsity-inducing regularization function. When one knows a priori that the solutions of this learn-\ning problem have only a few non-zero coef\ufb01cients, \u2126 is often chosen to be the \u21131-norm (see [1, 2]).\nWhen these coef\ufb01cients are organized in groups, a penalty encoding explicitly this prior knowl-\nedge can improve the prediction performance and/or interpretability of the learned models [13, 14].\nDenoting by G a set of groups of indices, such a penalty might for example take the form:\n\n\u2126(w) , X\n\ng\u2208G\n\n\u03b7g max\nj\u2208g\n\n|wj| = X\n\ng\u2208G\n\n\u03b7gkwgk\u221e,\n\n(2)\n\nwhere wj is the j-th entry of w for j in [1; p] , {1, . . . , p}, the vector wg in R|g| records the\ncoef\ufb01cients of w indexed by g in G, and the scalars \u03b7g are positive weights. A sum of \u21132-norms is\nalso used in the literature [7], but the \u2113\u221e-norm is piecewise linear, a property that we take advantage\nof in this paper. Note that when G is the set of singletons of [1; p], we get back the \u21131-norm.\n\nIf G is a more general partition of [1; p], variables are selected in groups rather than individually.\nWhen the groups overlap, \u2126 is still a norm and sets groups of variables to zero together [5]. The\nlatter setting has \ufb01rst been considered for hierarchies [7, 11, 15], and then extended to general group\nstructures [5].1 Solving Eq. (1) in this context becomes challenging and is the topic of this paper.\nFollowing Jenatton et al. [12] who tackled the case of hierarchical groups, we propose to approach\nthis problem with proximal methods, which we now introduce.\n\n2.1 Proximal Methods\n\nIn a nutshell, proximal methods can be seen as a natural extension of gradient-based techniques,\nand they are well suited to minimizing the sum f + \u03bb\u2126 of two convex terms, a smooth function f\n\u2014continuously differentiable with Lipschitz-continuous gradient\u2014 and a potentially non-smooth\nfunction \u03bb\u2126 (see [16] and references therein). At each iteration, the function f is linearized at the\ncurrent estimate w0 and the so-called proximal problem has to be solved:\n\nmin\nw\u2208Rp\n\nf (w0) + (w \u2212 w0)\u22a4\u2207f (w0) + \u03bb\u2126(w) +\n\nL\n2\n\nkw \u2212 w0k2\n2.\n\nThe quadratic term keeps the solution in a neighborhood where the current linear approximation\nholds, and L > 0 is an upper bound on the Lipschitz constant of \u2207f . This problem can be rewritten as\n\nmin\nw\u2208Rp\n\n1\n2\n\nku \u2212 wk2\n\n2 + \u03bb\u2032\u2126(w),\n\n(3)\n\nwith \u03bb\u2032 , \u03bb/L, and u , w0 \u2212 1\nL \u2207f (w0). We call proximal operator associated with the regulariza-\ntion \u03bb\u2032\u2126 the function that maps a vector u in Rp onto the (unique, by strong convexity) solution w\u22c6\nof Eq. (3). Simple proximal methods use w\u22c6 as the next iterate, but accelerated variants [3, 4] are\nalso based on the proximal operator and require to solve problem (3) exactly and ef\ufb01ciently to enjoy\ntheir fast convergence rates. Note that when \u2126 is the \u21131-norm, the solution of Eq. (3) is obtained by\nsoft-thresholding [16]. The approach we develop in the rest of this paper extends [12] to the case of\ngeneral overlapping groups when \u2126 is a weighted sum of \u2113\u221e-norms, broadening the application of\nthese regularizations to a wider spectrum of problems.2\n\n1Note that other types of structured sparse models have also been introduced, either through a different\n\nnorm [6], or through non-convex criteria [8, 9, 10].\n\n2For hierarchies, the approach of [12] applies also to the case of where \u2126 is a weighted sum of \u21132-norms.\n\n2\n\n\f3 A Quadratic Min-Cost Flow Formulation\n\nIn this section, we show that a convex dual of problem (3) for general overlapping groups G can\nbe reformulated as a quadratic min-cost \ufb02ow problem. We present an ef\ufb01cient algorithm to solve it\nexactly, as well as a related algorithm to compute the dual norm of \u2126. We start by considering the\ndual formulation to problem (3) introduced in [12], for the case where \u2126 is a sum of \u2113\u221e-norms:\n\nLemma 1 (Dual of the proximal problem [12])\nGiven u in Rp, consider the problem\n\nmin\n\n\u03be\u2208Rp\u00d7|G|\n\n1\n\n2(cid:13)(cid:13)(cid:13)\n\nu \u2212 X\n\ng\u2208G\n\n2\n\n2\n\n\u03beg(cid:13)(cid:13)(cid:13)\n\ns.t. \u2200g \u2208 G, k\u03begk1 \u2264 \u03bb\u03b7g\n\nand\n\n\u03beg\nj = 0 if j /\u2208 g,\n\n(4)\n\nwhere \u03be = (\u03beg)g\u2208G is in Rp\u00d7|G|, and \u03beg\nj denotes the j-th coordinate of the vector \u03beg. Then, every\nsolution \u03be\u22c6 = (\u03be\u22c6g)g\u2208G of Eq. (4) satis\ufb01es w\u22c6 = u\u2212Pg\u2208G \u03be\u22c6g, where w\u22c6 is the solution of Eq. (3).\nWithout loss of generality,3 we assume from now on that the scalars uj are all non-negative, and we\nconstrain the entries of \u03be to be non-negative. We now introduce a graph modeling of problem (4).\n\n3.1 Graph Model\n\nLet G be a directed graph G = (V, E, s, t), where V is a set of vertices, E \u2286 V \u00d7 V a set of arcs, s\na source, and t a sink. Let c and c\u2032 be two functions on the arcs, c : E \u2192 R and c\u2032 : E \u2192 R+, where\nc is a cost function and c\u2032 is a non-negative capacity function. A \ufb02ow is a non-negative function\non arcs that satis\ufb01es capacity constraints on all arcs (the value of the \ufb02ow on an arc is less than or\nequal to the arc capacity) and conservation constraints on all vertices (the sum of incoming \ufb02ows at\na vertex is equal to the sum of outgoing \ufb02ows) except for the source and the sink.\n\nWe introduce a canonical graph G associated with our optimization problem, and uniquely charac-\nterized by the following construction:\n(i) V is the union of two sets of vertices Vu and Vgr, where Vu contains exactly one vertex for\neach index j in [1; p], and Vgr contains exactly one vertex for each group g in G. We thus have\n|V | = |G| + p. For simplicity, we identify groups and indices with the vertices of the graph.\n(ii) For every group g in G, E contains an arc (s, g). These arcs have capacity \u03bb\u03b7g and zero cost.\n(iii) For every group g in G, and every index j in g, E contains an arc (g, j) with zero cost and\nin\ufb01nite capacity. We denote by \u03beg\n(iv) For every index j in [1; p], E contains an arc (j, t) with in\ufb01nite capacity and a\n2 (uj \u2212 \u00af\u03bej)2, where \u00af\u03bej is the \ufb02ow on (j, t). Note that by \ufb02ow conservation, we neces-\ncost cj , 1\nsarily have \u00af\u03bej =Pg\u2208G \u03beg\nj .\nExamples of canonical graphs are given in Figures 1(a)-(c). The \ufb02ows \u03beg\nj associated with G can now\nbe identi\ufb01ed with the variables of problem (4): indeed, the sum of the costs on the edges leading\nto the sink is equal to the objective function of (4), while the capacities of the arcs (s, g) match the\nconstraints on each group. This shows that \ufb01nding a \ufb02ow minimizing the sum of the costs on such a\ngraph is equivalent to solving problem (4).\n\nj the \ufb02ow on this arc.\n\nWhen some groups are included in others, the canonical graph can be simpli\ufb01ed to yield a graph\nwith a smaller number of edges. Speci\ufb01cally, if h and g are groups with h \u2282 g, the edges (g, j) for\nj \u2208 h carrying a \ufb02ow \u03beg\nj can be removed and replaced by a single edge (g, h) of in\ufb01nite capacity and\nzero cost, carrying the \ufb02ow Pj\u2208h \u03beg\nj . This simpli\ufb01cation is illustrated in Figure 1(d), with a graph\n\u22c6\nequivalent to the one of Figure 1(c). This does not change the optimal value of \u00af\u03be\n, which is the\nquantity of interest for computing the optimal primal variable w\u22c6 (a proof and a formal de\ufb01nition\nof these equivalent graphs are available in a longer technical report [17]). These simpli\ufb01cations are\nuseful in practice, since they reduce the number of edges in the graph and improve the speed of the\nalgorithms we are now going to present.\n\n3Let \u03be\u22c6 denote a solution of Eq. (4). Optimality conditions of Eq. (4) derived in [12] show that for all j in\n[1; p], the signs of the non-zero coef\ufb01cients \u03be\u22c6g\nfor g in G are the same as the signs of the entries uj . To solve\nEq. (4), one can therefore \ufb02ip the signs of the negative variables uj , then solve the modi\ufb01ed dual formulation\n(with non-negative variables), which gives the magnitude of the entries \u03be\u22c6g\nj\n\n(the signs of these being known).\n\nj\n\n3\n\n\fs\n\ns\n\n\u03beg\n1 +\u03beg\n\n2 +\u03beg\n\n3 \u2264 \u03bb\u03b7g\n\n\u03beg\n1 +\u03beg\n\n2 \u2264 \u03bb\u03b7g\n\n2 +\u03beh\n\u03beh\n\n3 \u2264 \u03bb\u03b7h\n\n\u03beg\n1\n\nu1\n\ng\n\n\u03beg\n2\n\nu2\n\n\u03beg\n3\n\ng\n\n\u03beg\n2\n\n\u03beg\n1\n\nh\n\n\u03beh\n2\n\nu3\n\nu1\n\nu2\n\n\u03beh\n3\n\nu3\n\n\u00af\u03be1, c1\n\n\u00af\u03be2, c2\n\n\u00af\u03be3, c3\n\n\u00af\u03be1, c1\n\n\u00af\u03be2, c2\n\n\u00af\u03be3, c3\n\nt\n\nt\n\n(a) G = {g = {1, 2, 3}}.\n\n(b) G = {g = {1, 2}, h = {2, 3}}.\n\n2 +\u03beh\n\u03beh\n\n3 \u2264 \u03bb\u03b7h\n\n1 +\u03beg\n\u03beg\n\n2 +\u03beg\n\n3 \u2264 \u03bb\u03b7g\n\n2 +\u03beh\n\u03beh\n\n3 \u2264 \u03bb\u03b7h\n\ns\n\ng\n\n1 +\u03beg\n\u03beg\n\n2 +\u03beg\n\n3 \u2264 \u03bb\u03b7g\n\ns\n\ng\n\n\u03beg\n1\n\nu1\n\n\u03beg\n2\n\nu2\n\n\u03beh\n2\n\n\u03beg\n3\n\nh\n\n\u03beh\n3\n\nu3\n\n\u03beg\n1\n\n\u03beg\n2 +\u03beg\n\u03beg\n2 +\u03beh\n2\n\n3\n\nu1\n\nu2\n\nh\n\n\u03beg\n3 +\u03beh\n3\n\nu3\n\n\u00af\u03be1, c1\n\n\u00af\u03be2, c2\n\n\u00af\u03be3, c3\n\n\u00af\u03be1, c1\n\n\u00af\u03be2, c2\n\n\u00af\u03be3, c3\n\nt\n\nt\n\n(c) G = {g = {1, 2, 3}, h = {2, 3}}.\n\n(d) G = {g = {1} \u222a h, h = {2, 3}}.\n\nFigure 1: Graph representation of simple proximal problems with different group structures G. The\nthree indices 1, 2, 3 are represented as grey squares, and the groups g, h in G as red discs. The\nsource is linked to every group g, h with respective maximum capacity \u03bb\u03b7g, \u03bb\u03b7h and zero cost. Each\n2 (uj \u2212 \u00af\u03bej)2. All\nvariable uj is linked to the sink t, with an in\ufb01nite capacity, and with a cost cj , 1\nother arcs in the graph have zero cost and in\ufb01nite capacity. They represent inclusion relationships\nin-between groups, and between groups and variables. The graphs (c) and (d) correspond to a special\ncase of tree-structured hierarchy in the sense of [12]. Their min-cost \ufb02ow problems are equivalent.\n\n3.2 Computation of the Proximal Operator\n\nQuadratic min-cost \ufb02ow problems have been well studied in the operations research literature [18].\nOne of the simplest cases, where G contains a single group g (\u2126 is the \u2113\u221e-norm) as in Figure 1(a),\ncan be solved by an orthogonal projection on the \u21131-ball of radius \u03bb\u03b7g.\nIt has been shown that\nsuch a projection can be done in O(p) operations [18, 19]. When the group structure is a tree as in\nFigure 1(d), the problem can be solved in O(pd) operations, where d is the depth of the tree [12, 18].4\n\nThe general case of overlapping groups is more dif\ufb01cult. Hochbaum and Hong have shown in [18]\nthat quadratic min-cost \ufb02ow problems can be reduced to a speci\ufb01c parametric max-\ufb02ow problem,\nfor which an ef\ufb01cient algorithm exists [20].5 While this generic approach could be used to solve\nEq. (4), we propose to use Algorithm 1 that also exploits the fact that our graphs have non-zero costs\nonly on edges leading to the sink. As shown in the technical report [17], it has a signi\ufb01cantly better\nperformance in practice. This algorithm clearly shares some similarities with existing approaches\nin network \ufb02ow optimization such as the simpli\ufb01ed version of [20] presented in [21] that uses a\ndivide and conquer strategy. Moreover, we have discovered after that this paper was accepted for\npublication that an equivalent algorithm exists for minimizing convex functions over polymatroid\n\n4When restricted to the case where \u2126 is a sum of \u2113\u221e-norms, the approach of [12] is in fact similar to [18].\n5By de\ufb01nition, a parametric max-\ufb02ow problem consists in solving, for every value of a parameter, a max-\n\n\ufb02ow problem on a graph whose arc capacities depend on this parameter.\n\n4\n\n\fsets [22]. This equivalence, however, requires a non-trivial representation of structured sparsity-\ninducing norms with submodular functions, as recently pointed out by [23].\n\nAlgorithm 1 Computation of the proximal operator for overlapping groups.\n1: Inputs: u \u2208 Rp, a set of groups G, positive weights (\u03b7g)g\u2208G, and \u03bb (regularization parameter).\n2: Build the initial graph G0 = (V0, E0, s, t) as explained in Section 3.2.\n3: Compute the optimal \ufb02ow: \u00af\u03be \u2190 computeFlow(V0, E0).\n4: Return: w = u \u2212 \u00af\u03be (optimal solution of the proximal problem).\nFunction computeFlow(V = Vu \u222a Vgr, E)\n1\n1: Projection step: \u03b3 \u2190 arg min\u03b3 Pj\u2208Vu\n2: For all nodes j in Vu, set \u03b3 j to be the capacity of the arc (j, t).\n3: Max-\ufb02ow step: Update ( \u00af\u03bej)j\u2208Vu by computing a max-\ufb02ow on the graph (V, E, s, t).\n4: if \u2203 j \u2208 Vu s.t. \u00af\u03bej 6= \u03b3 j then\n5:\n\nDenote by (s, V +) and (V \u2212, t) the two disjoint subsets of (V, s, t) separated by the minimum\n(s, t)-cut of the graph, and remove the arcs between V + and V \u2212. Call E+ and E\u2212 the two\nremaining disjoint subsets of E corresponding to V + and V \u2212.\n( \u00af\u03bej)j\u2208V +\n( \u00af\u03bej)j\u2208V \u2212\n\n\u2190 computeFlow(V +, E+).\n\u2190 computeFlow(V \u2212, E\u2212).\n\n2 (uj \u2212 \u03b3 j)2 s.t. Pj\u2208Vu\n\n\u03b3 j \u2264 \u03bbPg\u2208Vgr\n\n\u03b7g.\n\n6:\n\nu\n\n7:\n8: end if\n9: Return: ( \u00af\u03bej)j\u2208Vu .\n\nu\n\nThe intuition behind this algorithm is the following: The \ufb01rst step looks for a candidate value for\n\n\u00af\u03be =Pg\u2208G \u03beg by solving a relaxed version of problem Eq. (4), where the constraints k\u03begk1 \u2264 \u03bb\u03b7g are\ndropped and replaced by a single one k \u00af\u03bek1 \u2264 \u03bbPg\u2208G \u03b7g. The relaxed problem only depends on \u00af\u03be\nand can be solved in linear time. By calling its solution \u03b3, it provides a lower bound ku \u2212 \u03b3k2\n2/2\non the optimal cost. Then, the second step tries to \ufb01nd a feasible \ufb02ow of the original problem (4)\nsuch that the resulting vector \u00af\u03be matches \u03b3, which is in fact a max-\ufb02ow problem [24]. If \u00af\u03be = \u03b3,\nthen the cost of the \ufb02ow reaches the lower bound, and the \ufb02ow is optimal. If \u00af\u03be 6= \u03b3, the lower\nbound is not achievable, and we construct a minimum (s, t)-cut of the graph [25] that de\ufb01nes two\ndisjoints sets of nodes V + and V \u2212; V + is the part of the graph that could potentially have received\nmore \ufb02ow from the source (the arcs between s and V + are not saturated), whereas all arcs linking s\nto V \u2212 are saturated. At this point, it is possible to show that the value of the optimal min-cost\n\ufb02ow on all arcs between V + and V \u2212 is necessary zero. Thus, removing them yields an equivalent\noptimization problem, which can be decomposed into two independent problems of smaller sizes\nand solved recursively by the calls to computeFlow(V +, E+) and computeFlow(V \u2212, E\u2212). A\nformal proof of correctness of Algorithm 1 and further details are relegated to [17].\n\nThe approach of [18, 20] is guaranteed to have the same worst-case complexity as a single max-\ufb02ow\nalgorithm. However, we have experimentally observed a signi\ufb01cant discrepancy between the worst\ncase and empirical complexities for these \ufb02ow problems, essentially because the empirical cost of\neach max-\ufb02ow is signi\ufb01cantly smaller than its theoretical cost. Despite the fact that the worst-case\nguarantee of our algorithm is weaker than their (up to a factor |V |), it is more adapted to the structure\nof our graphs and has proven to be much faster in our experiments (see technical report [17]).\n\nSome implementation details are crucial to the ef\ufb01ciency of the algorithm:\n\n\u2022 Exploiting connected components: When there exists no arc between two subsets of V , it is\n\npossible to process them independently in order to solve the global min-cost \ufb02ow problem.\n\n\u2022 Ef\ufb01cient max-\ufb02ow algorithm: We have implemented the \u201cpush-relabel\u201d algorithm of [24]\nfor solving our max-\ufb02ow problems, using classical heuristics that signi\ufb01cantly speed it up in\npractice (see [24, 26]). This algorithm leverages the concept of pre-\ufb02ow that relaxes the de\ufb01ni-\ntion of \ufb02ow and allows vertices to have a positive excess.\nIt can be initialized with any valid\npre-\ufb02ow, enabling warm-restarts when the max-\ufb02ow is called several times as in our algorithm.\n\u2022 Improved projection step: The \ufb01rst line of the function computeFlow can be replaced by\n\u03b3 \u2190 arg min\u03b3 Pj\u2208Vu\n\u03b7g and |\u03b3 j| \u2264 \u03bbPg\u220bj \u03b7g. The\nidea is that the structure of the graph will not allow \u00af\u03bej to be greater than \u03bbPg\u220bj \u03b7g after the max-\n\n2 (uj \u2212 \u03b3 j)2 s.t. Pj\u2208Vu\n\n\ufb02ow step. Adding these additional constraints leads to better performance when the graph is not\nwell balanced. This modi\ufb01ed projection step can still be computed in linear time [19].\n\n\u03b3 j \u2264 \u03bbPg\u2208Vgr\n\n1\n\n5\n\n\f3.3 Computation of the Dual Norm\n\nThe dual norm \u2126\u2217 of \u2126, de\ufb01ned for any vector \u03ba in Rp by \u2126\u2217(\u03ba) , max\u2126(z)\u22641 z\u22a4\u03ba, is a key quan-\ntity to study sparsity-inducing regularizations [5, 15, 27]. We use it here to monitor the convergence\nof the proximal method through a duality gap, and de\ufb01ne a proper optimality criterion for prob-\nlem (1). We denote by f \u2217 the Fenchel conjugate of f [28], de\ufb01ned by f \u2217(\u03ba) , supz[z\u22a4\u03ba \u2212 f (z)].\nThe duality gap for problem (1) can be derived from standard Fenchel duality arguments [28] and\nit is equal to f (w) + \u03bb\u2126(w) + f \u2217(\u2212\u03ba) for w, \u03ba in Rp with \u2126\u2217(\u03ba) \u2264 \u03bb. Therefore, evaluating\nthe duality gap requires to compute ef\ufb01ciently \u2126\u2217 in order to \ufb01nd a feasible dual variable \u03ba. This is\nequivalent to solving another network \ufb02ow problem, based on the following variational formulation:\n\n\u2126\u2217(\u03ba) = min\n\n\u03c4\n\u03be\u2208Rp\u00d7|G|\n\ns.t. X\n\ng\u2208G\n\n\u03beg = \u03ba, and \u2200g \u2208 G, k\u03begk1 \u2264 \u03c4 \u03b7g with \u03beg\n\nj = 0 if j /\u2208 g.\n\n(5)\n\nIn the network problem associated with (5), the capacities on the arcs (s, g), g \u2208 G, are set to \u03c4 \u03b7g,\nand the capacities on the arcs (j, t), j in [1; p], are \ufb01xed to \u03baj. Solving problem (5) amounts to\n\ufb01nding the smallest value of \u03c4 , such that there exists a \ufb02ow saturating the capacities \u03baj on the arcs\nleading to the sink t (i.e., \u00af\u03be = \u03ba). The algorithm below is proven to be correct in [17].\n\nAlgorithm 2 Computation of the dual norm.\n1: Inputs: \u03ba \u2208 Rp, a set of groups G, positive weights (\u03b7g)g\u2208G.\n2: Build the initial graph G0 = (V0, E0, s, t) as explained in Section 3.3.\n3: \u03c4 \u2190 dualNorm(V0, E0).\n4: Return: \u03c4 (value of the dual norm).\n\n\u03baj)/(Pg\u2208Vgr\n\n\u03b7g) and set the capacities of arcs (s, g) to \u03c4 \u03b7g for all g in Vgr.\n\nFunction dualNorm(V = Vu \u222a Vgr, E)\n1: \u03c4 \u2190 (Pj\u2208Vu\n2: Max-\ufb02ow step: Update ( \u00af\u03bej)j\u2208Vu by computing a max-\ufb02ow on the graph (V, E, s, t).\n3: if \u2203 j \u2208 Vu s.t. \u00af\u03bej 6= \u03baj then\n4:\n5: end if\n6: Return: \u03c4 .\n\nDe\ufb01ne (V +, E+) and (V \u2212, E\u2212) as in Algorithm 1, and set \u03c4 \u2190 dualNorm(V \u2212, E\u2212).\n\n4 Applications and Experiments\n\nOur experiments use the algorithm of [4] based on our proximal operator, with weights \u03b7g set to 1.\n\n4.1 Speed Comparison\n\nWe compare our method (ProxFlow) and two generic optimization techniques, namely a subgradient\ndescent (SG) and an interior point method,6 on a regularized linear regression problem. Both SG and\nProxFlow are implemented in C++. Experiments are run on a single-core 2.8 GHz CPU. We con-\nsider a design matrix X in Rn\u00d7p built from overcomplete dictionaries of discrete cosine transforms\n(DCT), which are naturally organized on one- or two-dimensional grids and display local corre-\nlations. The following families of groups G using this spatial information are thus considered: (1)\nevery contiguous sequence of length 3 for the one-dimensional case, and (2) every 3\u00d73-square in the\ntwo-dimensional setting. We generate vectors y in Rn according to the linear model y = Xw0 + \u03b5,\nwhere \u03b5 \u223c N (0, 0.01kXw0k2\n2). The vector w0 has about 20% percent nonzero components, ran-\ndomly selected, while respecting the structure of G, and uniformly generated between [\u22121, 1].\n\nIn our experiments, the regularization parameter \u03bb is chosen to achieve the same sparsity as w0. For\nSG, we take the step size to be equal to a/(k + b), where k is the iteration number, and (a, b) are the\nbest parameters selected in {10\u22123, . . . , 10}\u00d7{102, 103, 104}. For the interior point methods, since\nproblem (1) can be cast either as a quadratic (QP) or as a conic program (CP), we show in Figure 2\nthe results for both formulations. Our approach compares favorably with the other methods, on\nthree problems of different sizes, (n, p) \u2208 {(100, 103), (1024, 104), (1024, 105)}, see Figure 2. In\naddition, note that QP, CP and SG do not obtain sparse solutions, whereas ProxFlow does. We have\nalso run ProxFlow and SG on a larger dataset with (n, p) = (100, 106): after 12 hours, ProxFlow\nand SG have reached a relative duality gap of 0.0006 and 0.02 respectively.7\n\n6In our simulations, we use the commercial software Mosek, http://www.mosek.com/.\n7Due to the computational burden, QP and CP could not be run on every problem.\n\n6\n\n\fn=1024, p=10000, two\u2212dimensional DCT\n\n)\n\nm\nu\nm\n\ni\nt\np\no\n \no\nt\n \ne\nc\nn\na\nt\ns\nd\n \ne\nv\ni\nt\na\ne\nr\n(\ng\no\n\nl\n\ni\n\nl\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n \n\u22122\n\nn=100, p=1000, one\u2212dimensional DCT\n\n \n\n)\n\nm\nu\nm\n\ni\nt\np\no\n \no\nt\n \ne\nc\nn\na\nt\ns\nd\n \ne\nv\ni\nt\na\ne\nr\n(\ng\no\n\ni\n\nl\n\nl\n\nCP\nQP\nProxFlow\nSG\n\n\u22121\nlog(CPU time) in seconds\n\n1\n\n0\n\n2\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n \n\u22122\n\n \n\n)\n\nm\nu\nm\n\ni\nt\np\no\n \no\nt\n \ne\nc\nn\na\nt\ns\nd\n \ne\nv\ni\nt\na\ne\nr\n(\ng\no\n\ni\n\nl\n\nl\n\nCP\nProxFlow\nSG\n\n0\n\n2\n\n4\n\nlog(CPU time) in seconds\n\nn=1024, p=100000, one\u2212dimensional DCT\n \n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n \n\u22122\n\nProxFlow\nSG\n\n0\n\n2\n\n4\n\nlog(CPU time) in seconds\n\nFigure 2: Speed comparisons: distance to the optimal primal value versus CPU time (log-log scale).6\n\nFigure 3: From left to right: original image y; estimated background Xw; foreground (the sparsity\npattern of e used as mask on y) estimated with \u21131; foreground estimated with \u21131 + \u2126; another\nforeground obtained with \u2126, on a different image, with the same values of \u03bb1, \u03bb2 as for the previous\nimage. For the top row, the percentage of pixels matching the ground truth is 98.8% with \u2126, 87.0%\nwithout. As for the bottom row, the result is 93.8% with \u2126, 90.4% without (best seen in color).\n\n4.2 Background Subtraction\n\nFollowing [9, 10], we consider a background subtraction task. Given a sequence of frames from a\n\ufb01xed camera, we try to segment out foreground objects in a new image. If we denote by y \u2208 Rn a\ntest image, we model y as a sparse linear combination of p other images X \u2208 Rn\u00d7p, plus an error\nterm e in Rn, i.e., y \u2248 Xw + e for some sparse vector w in Rp. This approach is reminiscent\nof [29] in the context of face recognition, where e is further made sparse to deal with occlusions.\nThe term Xw accounts for background parts present in both y and X, while e contains speci\ufb01c,\n2 +\nor foreground, objects in y. The resulting optimization problem is minw,e\n\u03bb1kwk1 + \u03bb2kek1, with \u03bb1, \u03bb2 \u2265 0. In this formulation, the \u21131-norm penalty on e does not take\ninto account the fact that neighboring pixels in y are likely to share the same label (background or\nforeground), which may lead to scattered pieces of foreground and background regions (Figure 3).\nWe therefore put an additional structured regularization term \u2126 on e, where the groups in G are\nall the overlapping 3\u00d73-squares on the image. A dataset with hand-segmented evaluation images\nis used to illustrate the effect of \u2126.8 For simplicity, we use a single regularization parameter, i.e.,\n\u03bb1 = \u03bb2, chosen to maximize the number of pixels matching the ground truth. We consider p = 200\nimages with n = 57600 pixels (i.e., a resolution of 120\u00d7160, times 3 for the RGB channels). As\nshown in Figure 3, adding \u2126 improves the background subtraction results for the two tested videos,\nby encoding, unlike the \u21131-norm, both spatial and color consistency.\n\n2 ky \u2212 Xw \u2212 ek2\n\n1\n\n4.3 Multi-Task Learning of Hierarchical Structures\n\nIn [12], Jenatton et al. have recently proposed to use a hierarchical structured norm to learn dictio-\nnaries of natural image patches. Following this work, we seek to represent n signals {y1, . . . , yn}\nof dimension m as sparse linear combinations of elements from a dictionary X = [x1, . . . , xp]\nin Rm\u00d7p. This can be expressed for all i in [1; n] as yi \u2248 Xwi, for some sparse vector wi in Rp.\nIn [12], the dictionary elements are embedded in a prede\ufb01ned tree T , via a particular instance of the\nstructured norm \u2126; we refer to it as \u2126tree, and call G the underlying set of groups. In this case, each\nsignal yi admits a sparse decomposition in the form of a subtree of dictionary elements.\n\n8\n\nhttp://research.microsoft.com/en-us/um/people/jckrumm/wallflower/testimages.htm\n\n7\n\n\fInspired by ideas from multi-task learning [14], we propose to learn the tree structure T by pruning\nirrelevant parts of a larger initial tree T0. We achieve this by using an additional regularization\nterm \u2126joint across the different decompositions, so that subtrees of T0 will simultaneously be removed\nfor all signals yi. In other words, the approach of [12] is extended by the following formulation:\n\nmin\nX,W\n\n1\nn\n\nn\n\nX\n\ni=1\n\nh 1\n\n2\n\nkyi \u2212 Xwik2\n\n2 + \u03bb1\u2126tree(wi)i+\u03bb2\u2126joint(W), s.t. kxjk2 \u2264 1, for all j in [1; p], (6)\n\nwhere W , [w1, . . . , wn] is the matrix of decomposition coef\ufb01cients in Rp\u00d7n. The new regular-\nization term operates on the rows of W and is de\ufb01ned as \u2126joint(W) , Pg\u2208G maxi\u2208[1;n] |wi\ng|.9 The\noverall penalty on W, which results from the combination of \u2126tree and \u2126joint, is itself an instance\nof \u2126 with general overlapping groups, as de\ufb01ned in Eq (2).\n\nTo address problem (6), we use the same optimization scheme as [12], i.e., alternating between X\nand W, \ufb01xing one variable while optimizing with respect to the other. The task we consider is the\ndenoising of natural image patches, with the same dataset and protocol as [12]. We study whether\nlearning the hierarchy of the dictionary elements improves the denoising performance, compared to\nstandard sparse coding (i.e., when \u2126tree is the \u21131-norm and \u03bb2 = 0) and the hierarchical dictionary\nlearning of [12] based on prede\ufb01ned trees (i.e., \u03bb2 = 0). The dimensions of the training set \u2014\n50 000 patches of size 8\u00d78 for dictionaries with up to p = 400 elements \u2014 impose to handle large\ngraphs, with |E| \u2248 |V | \u2248 4.107. Since problem (6) is too large to be solved many times to select the\nregularization parameters (\u03bb1, \u03bb2) rigorously, we use the following heuristics: we optimize mostly\nwith the currently pruned tree held \ufb01xed (i.e., \u03bb2 = 0), and only prune the tree (i.e., \u03bb2 > 0)\nevery few steps on a random subset of 10 000 patches. We consider the same hierarchies as in [12],\ninvolving between 30 and 400 dictionary elements. The regularization parameter \u03bb1 is selected on\nthe validation set of 25 000 patches, for both sparse coding (Flat) and hierarchical dictionary learning\n(Tree). Starting from the tree giving the best performance (in this case the largest one, see Figure 4),\nwe solve problem (6) following our heuristics, for increasing values of \u03bb2. As shown in Figure 4,\nthere is a regime where our approach performs signi\ufb01cantly better than the two other compared\nmethods. The standard deviation of the noise is 0.2 (the pixels have values in [0, 1]); no signi\ufb01cant\nimprovements were observed for lower levels of noise.\n\nDenoising Experiment: Mean Square Error\n\n0.21\n\nr\no\nr\nr\n\n \n\nE\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n0.2\n\n0.19\n\n \n0\n\n \n\nFlat\nTree\nMulti\u2212task Tree\n\n100\n\n200\n\n300\n\n400\n\nDictionary Size\n\nFigure 4: Left: Hierarchy obtained by pruning a larger tree of 76 elements. Right: Mean square\nerror versus dictionary size. The error bars represent two standard deviations, based on three runs.\n\n5 Conclusion\n\nWe have presented a new optimization framework for solving sparse structured problems involving\nsums of \u2113\u221e-norms of any (overlapping) groups of variables. Interestingly, this sheds new light on\nconnections between sparse methods and the literature of network \ufb02ow optimization. In particular,\nthe proximal operator for the formulation we consider can be cast as a quadratic min-cost \ufb02ow\nproblem, for which we propose an ef\ufb01cient and simple algorithm. This allows the use of accelerated\ngradient methods. Several experiments demonstrate that our algorithm can be applied to a wide class\nof learning problems, which have not been addressed before within sparse methods.\n\nAcknowledgments\n\nThis paper was partially supported by the European Research Council (SIERRA Project). The au-\nthors would like to thank Jean Ponce for interesting discussions and suggestions.\n\n9The simpli\ufb01ed case where \u2126tree and \u2126joint are the \u21131- and mixed \u21131/\u21132-norms [13] corresponds to [30].\n\n8\n\n\fReferences\n\n[1] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat.,\n\n37(4):1705\u20131732, 2009.\n\n[2] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Stat., 32(2):407\u2013499,\n\n2004.\n\n[3] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center for\n\nOperations Research and Econometrics (CORE), Catholic University of Louvain, 2007.\n\n[4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imag. Sci., 2(1):183\u2013202, 2009.\n\n[5] R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.\n\nTechnical report, 2009. Preprint arXiv:0904.3523v1.\n\n[6] L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. In Proc. ICML, 2009.\n\n[7] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical\n\nvariable selection. Ann. Stat., 37(6A):3468\u20133497, 2009.\n\n[8] R. G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. IEEE T. Inform.\n\nTheory, 2010. to appear.\n\n[9] V. Cehver, M.F. Duarte, C. Hedge, and R.G. Baraniuk. Sparse signal recovery using markov random\n\n\ufb01elds. In Adv. NIPS, 2008.\n\n[10] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proc. ICML, 2009.\n\n[11] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In\n\nProc. ICML, 2010.\n\n[12] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary\n\nlearning. In Proc. ICML, 2010.\n\n[13] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. Roy. Stat.\n\nSoc. B, 68:49\u201367, 2006.\n\n[14] G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for\n\nmultiple classi\ufb01cation problems. Stat. Comput., 20(2):231\u2013252, 2010.\n\n[15] F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Adv. NIPS, 2008.\n\n[16] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing.\n\nIn Fixed-Point\n\nAlgorithms for Inverse Problems in Science and Engineering. Springer, 2010.\n\n[17] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network \ufb02ow algorithms for structured sparsity.\n\nTechnical report, 2010. Preprint arXiv:1008.5209v1.\n\n[18] D. S. Hochbaum and S. P. Hong. About strongly polynomial time algorithms for quadratic optimization\n\nover submodular constraints. Math. Program., 69(1):269\u2013309, 1995.\n\n[19] P. Brucker. An O(n) algorithm for quadratic knapsack problems. Oper. Res. Lett., 3:163\u2013166, 1984.\n\n[20] G. Gallo, M. E. Grigoriadis, and R. E. Tarjan. A fast parametric maximum \ufb02ow algorithm and applica-\n\ntions. SIAM J. Comput., 18:30\u201355, 1989.\n\n[21] M. Babenko and A.V. Goldberg. Experimental evaluation of a parametric \ufb02ow algorithm. Technical\n\nreport, Microsoft Research, 2006. MSR-TR-2006-77.\n\n[22] H. Groenevelt. Two algorithms for maximizing a separable concave function over a polymatroid feasible\n\nregion. Eur. J. Oper. Res., pages 227\u2013236, 1991.\n\n[23] F. Bach. Structured sparsity-inducing norms through submodular functions. In Adv. NIPS, 2010.\n\n[24] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum \ufb02ow problem.\n\nIn Proc. of ACM\n\nSymposium on Theory of Computing, 1986.\n\n[25] L. R. Ford and D. R. Fulkerson. Maximal \ufb02ow through a network. Canadian J. Math., 8(3), 1956.\n\n[26] B. V. Cherkassky and A. V. Goldberg. On implementing the pushrelabel method for the maximum \ufb02ow\n\nproblem. Algorithmica, 19(4):390\u2013410, 1997.\n\n[27] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of M-estimators with decomposable regularizers. In Adv. NIPS, 2009.\n\n[28] J. M. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization. Springer, 2006.\n\n[29] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representa-\n\ntion. IEEE T. Pattern. Anal., pages 210\u2013227, 2008.\n\n[30] P. Sprechmann, I. Ramirez, G. Sapiro, and Y. C. Eldar. Collaborative hierarchical sparse modeling.\n\nTechnical report, 2010. Preprint arXiv:1003.0400v1.\n\n9\n\n\f", "award": [], "sourceid": 1040, "authors": [{"given_name": "Julien", "family_name": "Mairal", "institution": null}, {"given_name": "Rodolphe", "family_name": "Jenatton", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Guillaume", "family_name": "Obozinski", "institution": null}]}