{"title": "Efficient Minimization of Decomposable Submodular Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 2208, "page_last": 2216, "abstract": "Many combinatorial problems arising in machine learning can be reduced to the problem of minimizing a submodular function. Submodular functions are a natural discrete analog of convex functions, and can be minimized in strongly polynomial time. Unfortunately, state-of-the-art algorithms for general submodular minimization are intractable for practical problems. In this paper, we introduce a novel subclass of submodular minimization problems that we call decomposable. Decomposable submodular functions are those that can be represented as sums of concave functions applied to linear functions. We develop an algorithm, SLG, that can efficiently minimize decomposable submodular functions with tens of thousands of variables. Our algorithm exploits recent results in smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint classification-and-segmentation task, and show that it outperforms the state-of-the-art general purpose submodular minimization algorithms by several orders of magnitude.", "full_text": "Ef\ufb01cient Minimization of\n\nDecomposable Submodular Functions\n\nPeter Stobbe\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125\n\nstobbe@caltech.edu\n\nAndreas Krause\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125\n\nkrausea@caltech.edu\n\nAbstract\n\nMany combinatorial problems arising in machine learning can be reduced to the problem\nof minimizing a submodular function. Submodular functions are a natural discrete analog\nof convex functions, and can be minimized in strongly polynomial time. Unfortunately,\nstate-of-the-art algorithms for general submodular minimization are intractable for larger\nproblems.\nIn this paper, we introduce a novel subclass of submodular minimization\nproblems that we call decomposable. Decomposable submodular functions are those\nthat can be represented as sums of concave functions applied to modular functions. We\ndevelop an algorithm, SLG, that can ef\ufb01ciently minimize decomposable submodular\nfunctions with tens of thousands of variables. Our algorithm exploits recent results in\nsmoothed convex minimization. We apply SLG to synthetic benchmarks and a joint\nclassi\ufb01cation-and-segmentation task, and show that it outperforms the state-of-the-art\ngeneral purpose submodular minimization algorithms by several orders of magnitude.\n\nIntroduction\n\n1\nConvex optimization has become a key tool in many machine learning algorithms. Many seemingly\nmultimodal optimization problems such as nonlinear classi\ufb01cation, clustering and dimensionality\nreduction can be cast as convex programs. When minimizing a convex loss function, we can rest\nassured to ef\ufb01ciently \ufb01nd an optimal solution, even for large problems. Convex optimization is a\nstructural property of continuous optimization problems. However, many machine learning prob-\nlems, such as structure learning, variable selection, MAP inference in discrete graphical models,\nrequire solving discrete, combinatorial optimization problems.\nIn recent years, another fundamental problem structure, which has similar bene\ufb01cial properties,\nhas emerged as very useful in many combinatorial optimization problems arising in machine learn-\ning: Submodularity is an intuitive diminishing returns property, stating that adding an element to a\nsmaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows\none to ef\ufb01ciently \ufb01nd provably (near-)optimal solutions. In particular, the minimum of a submodular\nfunction can be found in strongly polynomial time [11]. Unfortunately, while polynomial-time solv-\nable, exact techniques for submodular minimization require a number of function evaluations on the\norder of n5 [12], where n is the number of variables in the problem (e.g., number of random variables\nin the MAP inference task), rendering the algorithms impractical for many real-world problems.\nFortunately, several submodular minimization problems arising in machine learning have structure\nthat allows solving them more ef\ufb01ciently. Examples include symmetric functions that can be\nsolved in O(n3) evaluations using Queyranne\u2019s algorithm [19], and functions that decompose into\nattractive, pairwise potentials, that can be solved using graph cutting techniques [7]. In this paper,\nwe introduce a novel class of submodular minimization problems that can be solved ef\ufb01ciently. In\nparticular, we develop an algorithm SLG, that can minimize a class of submodular functions that\nwe call decomposable: These are functions that can be decomposed into sums of concave functions\napplied to modular (additive) functions. Our algorithm is based on recent techniques of smoothed\nconvex minimization [18] applied to the Lov\u00b4asz extension. We demonstrate the usefulness of\n\n1\n\n\four algorithm on a joint classi\ufb01cation-and-segmentation task involving tens of thousands of\nvariables, and show that it outperforms state-of-the-art algorithms for general submodular function\nminimization by several orders of magnitude.\n\n2 Background on Submodular Function Minimization\nWe are interested in minimizing set functions that map subsets of some base set E to real numbers.\nI.e., given f : 2E ! R we wish to solve for A\u0003 2 arg minA f (A). For simplicity of notation, we\nuse the base set E = f1; : : : ng, but in an application the base set may consist of nodes of a graph,\npixels of an image, etc. Without loss of generality, we assume f (;) = 0. If the function f has no\nstructure, then there is no way solve the problem other than checking all 2n subsets. In this paper,\nwe consider functions that satisfy a key property that arises in many applications: submodularity\n(c.f., [16]). A set function f is called submodular iff, for all A; B 2 2E, we have\n\nf (A [ B) + f (A \\ B) \u0014 f (A) + f (B):\n\n(1)\nSubmodular functions can alternatively, and perhaps more intuitively, be characterized in terms of\ntheir discrete derivatives. First, we de\ufb01ne \u0001kf (A) = f (A[fkg)f (A) to be the discrete derivative\nof f with respect to k 2 E at A; intuitively this is the change in f\u2019s value by adding the element k\nto the set A. Then, f is submodular iff:\n\n\u0001kf (A) \u0015 \u0001kf (B); for all A \u0012 B \u0012 E and k 2 E n B:\n\nNote the analogy to concave functions; the discrete derivative is smaller for larger sets, in the same\nway that \u001e(x+h)\u001e(x) \u0015 \u001e(y+h)\u001e(y) for all x \u0014 y; h \u0015 0 if and only if \u001e is a concave function\non R. Thus a simple example of a submodular function is f (A) = \u001e(jAj) where \u001e is any concave\nfunction. Yet despite this connection to concavity, it is in fact \u2018easier\u2019 to minimize a submodular\nfunction than to maximize it1, just as it is easier to minimize a convex function. One explanation for\nthis is that submodular minimization can be reformulated as a convex minimization problem.\nTo see this, consider taking a set function minimization problem, and reformulating it as a mini-\nmization problem over the unit cube [0; 1]n \u001a Rn. De\ufb01ne eA 2 Rn to be the indicator vector of the\nset A, i.e.,\n\neA[k] = \u001a0 if k =2 A\n\n1 if k 2 A\n\nWe use the notation x[k] for the kth element of the vector x. Also we drop brackets and commas\nin subscripts, so ekl = efk;lg and ek = efkg as with the standard unit vectors. A continuous\nextension of a set function f is a function ~f on the unit cube ~f : [0; 1]n ! R with the property\nthat f (A) = ~f (eA). In order to be useful, however, one needs the minima of the set function to be\nrelated to minima of the extension:\n\nA\u0003 2 arg min\n\nA22E\n\nf (A) ) eA\u0003 2 arg min\nx2[0;1]n\n\n~f (x):\n\n(2)\n\nA key result due to Lov\u00b4asz [16] states that each submodular function f has an extension ~f that not\nonly satis\ufb01es the above property, but is also convex and ef\ufb01cient to evaluate. We can de\ufb01ne the\nLov\u00b4asz extension in terms of the submodular polyhedron Pf :\n\nPf = fv 2 Rn : v \u0001 eA \u0014 f (A); for all A 2 2Eg;\n\n~f (x) = sup\nv2Pf\n\nv \u0001 x:\n\nThe submodular polyhedron Pf is de\ufb01ned by exponentially many inequalities, and evaluating ~f\nrequires solving a linear program over this polyhedron. Perhaps surprisingly, as shown by Lov\u00b4asz, ~f\ncan be very ef\ufb01ciently computed as follows. For a \ufb01xed x let \u001b : E ! E be a permutation such that\nx[\u001b(1)] \u0015 : : : \u0015 x[\u001b(n)], and then de\ufb01ne the set Sk = f\u001b(1); : : : ; \u001b(k)g. Then we have a formula\nfor ~f and a subgradient:\n\n~f (x) =\n\nn\n\nXk=1\n\nx[\u001b(k)](f (Sk) f (Sk1));\n\n@ ~f (x) 3\n\nn\n\nXk=1\n\ne\u001b(k)(f (Sk) f (Sk1)):\n\nNote that if two components of x are equal, the above formula for ~f is independent of the permuta-\ntion chosen, but the subgradient is not unique.\n\n1With the additional assumption that f is nondecreasing, maximizing a submodular function subject to a\n\ncardinality constraint jAj \u0014 M is \u2018easy\u2019; a greedy algorithm is known to give a near-optimal answer [17].\n\n2\n\n\fEquation (2) was used to show that submodular minimization can be achieved in polynomial time\n[16]. However, algorithms which directly minimize the Lovasz extension are regarded as imprac-\ntical. Despite being convex, the Lov\u00b4asz extension is non-smooth, and hence a simple subgradient\ndescent algorithm would need O(1=\u000f2) steps to achieve O(\u000f) accuracy.\nRecently, Nesterov showed that if knowledge about the structure of a particular non-smooth convex\nfunction is available, it can be exploited to achieve a running time of O(1=\u000f) [18]. One way this is\ndone is to construct a smooth approximation of the non-smooth function, and then use an accelerated\ngradient descent algorithm which is highly effective for smooth functions. Connections of this work\nwith submodularity and combinatorial optimization are also explored in [4] and [2].\nIn fact, in\n[2], Bach shows that computing the smoothed Lov\u00b4asz gradient of a general submodular function is\nequivalent to solving a submodular minimization problem. In this paper, we do not treat general\nsubmodular functions, but rather a large class of submodular minimization functions that we call\ndecomposable. (To apply the smoothing technique of [18], special structural knowledge about the\nconvex function is required, so it is natural that we would need special structural knowledge about\nthe submodular function to leverage those results.) We further show that we can exploit the discrete\nstructure of submodular minimization in a way that allows terminating the algorithm early with a\ncerti\ufb01cate of optimality, which leads to drastic performance improvements.\n\n3 The Decomposable Submodular Minimization Problem\n\nIn this paper, we consider the problem of minimizing functions of the following form:\n\nf (A) = c \u0001 eA +Xj\n\n\u001ej(wj \u0001 eA);\n\n(3)\n\nwhere c; wj 2 Rn and 0 \u0014 wj \u0014 1 and \u001ej : [0; wj \u0001 1] ! R are arbitrary concave functions. It can\nbe shown that functions of this form are submodular. We call this class of functions decomposable\nsubmodular functions, as they decompose into a sum of concave functions applied to nonnegative\nmodular functions2. Below, we give examples of decomposable submodular functions arising in\napplications.\nWe \ufb01rst focus on the special case where all the concave functions are of the form \u001ej(\u0001) =\ndj min(yj; \u0001) for some yj; dj > 0. Since these potentials are of key importance, we de\ufb01ne the\nsubmodular functions \tw;y(A) = min(y; w \u0001 eA) and call them threshold potentials. In Section 5,\nwe will show in how to generalize our approach to arbitrary decomposable submodular functions.\n\nExamples. The simplest example is a 2-potential, which has the form \u001e(jA\\fk; lgj), where \u001e(1)\n\u001e(0) \u0015 \u001e(1) \u001e(2). It can be expressed as a sum of a modular function and a threshold potential:\n\n\u001e(jA \\ fk; lgj) = \u001e(0) + (\u001e(2) \u001e(1))ekl \u0001 eA + (2\u001e(1) \u001e(0) \u001e(2))\tekl;1(A)\n\nWhy are such potential functions interesting? They arise, for example, when \ufb01nding the Maximum\na Posteriori con\ufb01guration of a pairwise Markov Random Field model in image classi\ufb01cation\nschemes such as in [20]. On a high level, such an algorithm computes a value c[k] that corresponds\nto the log-likelihood of pixel k being of one class vs. another, and for each pair of adjacent pixels,\na value dkl related to the log-likelihood that pixels k and l are of the same class. Then the algorithm\n\nclassi\ufb01es pixels by minimizing a sum of 2-potentials: f (A) = c \u0001 eA +Pk;l dkl(1 j1 ekl \u0001 eAj).\n\nIf the value dkl is large, this encourages the pixels k and l to be classi\ufb01ed similarly.\nMore generally, consider a higher order potential function: a concave function of the number of\nelements in some activation set S, \u001e(jA \\ Sj) where \u001e is concave. It can be shown that this can\nbe written as a sum of a modular function and a positive linear combination of jSj 1 threshold\npotentials. Recent work [14] has shown that classi\ufb01cation performance can be improved by adding\nterms corresponding to such higher order potentials \u001ej(jRj \\ Aj) to the objective function where the\nfunctions \u001ej are piecewise linear concave functions, and the regions Rj of various sizes generated\nfrom a segmentation algorithm. Minimization of these particular potential functions can then be\nreformulated as a graph cut problem [13], but this is less general than our approach.\nAnother canonical example of a submodular function is a set cover function. Such a function can\nbe reformulated as a combination of concave cardinality functions (details omitted here). So all\n\n2A function is called modular if (1) holds with equality. It can be written as A 7! w \u0001 eA for some w 2 Rn.\n\n3\n\n\ffunctions which are weighted combinations of set cover functions can be expressed as threshold\npotentials. However, threshold potentials with nonuniform weights are strictly more general than\nconcave cardinality potentials. That is, there exists w and y such that \tw;y(A) cannot be expressed\n\nas Pj \u001ej(jRj \\ Aj) for any collection of concave \u001ej and sets Rj.\n\nAnother example of decomposable functions arises in multiclass queuing systems [10]. These are\nof the form f (A) = c \u0001 eA + u \u0001 eA\u001e(v \u0001 eA), where u; v are nonnegative weight vectors and \u001e is\na nonincreasing concave function. With the proper choice of \u001ej and wj (again details are omitted\nhere), this can in fact be reformulated as sum of the type in Eq. 3 with n terms.\nIn our own experiments, shown in Section 6, we use an implementation of TextonBoost [20] and\naugment it with quadratic higher order potentials. That is, we use TextonBoost to generate per-pixel\n\nscores c, and then minimize f (A) = c \u0001 eA +Pj jA \\ RjjjRj n Aj, where the regions Rj are regions\n\nof pixels that we expect to be of the same class (e.g., by running a cheap region-growing heuristic).\nThe potential function jA\\RjjjRj nAj is smallest when A contains all of Rj or none of it. It gives the\nlargest penalty when exactly half of Rj is contained in A. This encourages the classi\ufb01cation scheme\nto classify most of the pixels in a region Rj the same way. We generate regions with a basic region-\ngrowing algorithm with random seeds. See Figure 1(a) for an illustration of examples of regions\nthat we use. In our experience, this simple idea of using higher-order potentials can dramatically\nincrease the quality of the classi\ufb01cation over one using only 2-potentials, as can be seen in Figure 2.\n4 The SLG Algorithm for Threshold Potentials\nWe now present our algorithm for ef\ufb01cient minimization of a decomposable submodular function f\nbased on smoothed convex minimization. We \ufb01rst show how we can ef\ufb01ciently smooth the Lov\u00b4asz\nextension of f. We then apply accelerated gradient descent to the gradient of the smoothed function.\nLastly, we demonstrate how we can often obtain a certi\ufb01cate of optimality that allows us to stop\nearly, drastically speeding up the algorithm in practice.\n4.1 The Smoothed Extension of a Threshold Potential\nThe key challenge in our algorithm is to ef\ufb01ciently smooth the Lov\u00b4asz extension of f, so that we\ncan resort to algorithms for accelerated convex minimization. We now show how we can ef\ufb01ciently\nsmooth the threshold potentials \tw;y(A) = min(y; w \u0001 eA) of Section 3, which are simple enough\nto allow ef\ufb01cient smoothing, but rich enough when combined to express a large class of submodular\nfunctions. For x \u0015 0, the Lov\u00b4asz extension of \tw;y is\n\n~\tw;y(x) = sup v \u0001 x s.t. v \u0014 w; v \u0001 eA \u0014 y for all A 2 2E:\n\nNote that when x \u0015 0, the arg max of the above linear program always contains a point v which\nsatis\ufb01es v \u0001 1 = y, and v \u0015 0. So we can restrict the domain of the dual variable v to those points\nwhich satisfy these two conditions, without changing the value of ~\t(x):\n\n~\tw;y(x) = max\n\nv2D(w;y)\n\nv \u0001 x where D(w; y) = fv : 0 \u0014 v \u0014 w; v \u0001 1 = yg:\n\nRestricting the domain of v allows us to de\ufb01ne a smoothed Lov\u00b4asz extension (with parameter \u0016)\nthat is easily computed:\n\n~\t\u0016\n\nw;y(x) = max\n\nv \u0001 x \n\nkvk2\n\nv2D(w;y)\n\n\u0016\n2\n\nTo compute the value of this function we need to solve for the optimal vector v\u0003, which is also the\ngradient of this function, as we have the following characterization:\n\nTo derive an expression for v\u0003, we begin by forming the Lagrangian and deriving the dual problem:\n\nr ~\t\u0016\n\nw;y(x) = arg max\nv2D(w;y)\n\nv \u0001 x \n\n\u0016\n2\n\nkvk2 = arg min\n\n:\n\n(4)\n\nx\n\u0016\n\nv2D(w;y)\n\n v\n\n~\t\u0016\n\nw;y(x) =\n\nt2R;\u00151;\u00152\u00150\u0012max\n\nmin\n\nv \u0001 x \n\n\u0016\n2\n\nkvk2 + \u00151 \u0001 v + \u00152 \u0001 (w v) + t(y v \u0001 1)\u0013\n\n=\n\nmin\n\nt2R;\u00151;\u00152\u00150\n\nkx t1 + \u00151 \u00152k2 + \u00152 \u0001 w + ty:\n\nv2Rn\n1\n2\u0016\n\nIf we \ufb01x t, we can solve for the optimal dual variables \u0015\u0003\nwe know the optimal primal variable is given by v\u0003 = 1\n\u0015\u0003\n1 = max(t\u00031 x; 0); \u0015\u0003\n\n1 and \u0015\u0003\n\u0016 (x t\u00031 + \u0015\u0003\n\n2 componentwise. By strong duality,\n\n1 \u0015\u0003\n\n2). So we have:\n\n2 = max(x t\u00031 \u0016w; 0) ) v\u0003 = min (max ((x t\u00031)=\u0016; 0) ; w) :\n\n4\n\n\fThis expresses v\u0003 as a function of the unknown optimal dual variable t\u0003. For the simple case of\n2-potentials, we can solve for t\u0003 explicitly and get a closed form expression:\n\nr ~\t\u0016\n\nekl;1(x) = 8><\n>:\n\nek\nel\n1\n2 (ekl + 1\n\n\u0016 (x[k] x[l])(ek el))\n\nif x[k] \u0015 x[l] + \u0016\nif x[l] \u0015 x[k] + \u0016\nif jx[k] x[l]j < \u0016\n\nHowever, in general to \ufb01nd t\u0003 we note that v\u0003 must satisfy v\u0003 \u0001 1 = y. So de\ufb01ne \u001a\u0016\n\nx;w(t) as:\n\n\u001a\u0016\nx;w(t) = min(max((x t1)=\u0016; 0); w) \u0001 1\n\nThen we note this function is a monotonic continuous piecewise linear function of t, so we can use a\nsimple root-\ufb01nding algorithm to solve \u001a\u0016\nx;w(t\u0003) = y. This root \ufb01nding procedure will take no more\nthan O(n) steps in the worst case.\n4.2 The SLG Algorithm for Minimizing Sums of Threshold Potentials\nStepping beyond a single threshold potential, we now assume that the submodular function to be\nminimized can be written as a nonnegative linear combination of threshold potentials and a modular\nfunction, i.e.,\n\nThus, we have the smoothed Lov\u00b4asz extension, and its gradient:\n\nf (A) = c \u0001 eA +Xj\n\ndj\twj ;yj (A):\n\n~f \u0016(x) = c \u0001 x +Xj\n\ndj ~\t\u0016\n\nwj ;yj (x) and r ~f \u0016(x) = c +Xj\n\ndjr ~\t\u0016\n\nwj ;yj (x):\n\nWe now wish to use the accelerated gradient descent algorithm of [18] to minimize this function.\nThis algorithm requires that the smoothed objective has a Lipschitz continuous gradient. That is, for\nsome constant L, it must hold that kr ~f \u0016(x1) r ~f \u0016(x2)k \u0014 Lkx1 x2k; for all x1; x2 2 Rn.\nFortunately, by construction, the smoothed threshold extensions ~\t\u0016\nwj ;yj (x) all have 1=\u0016 Lip-\nschitz gradient, a direct consequence of the characterization in Equation 4. Hence we have\na loose upper bound for the Lipschitz constant of ~f \u0016: L \u0014 D\n\u0016 , where D = Pj dj. Fur-\nthermore, the smoothed threshold extensions approximate the threshold extensions uniformly:\n2 for all x, so j ~f \u0016(x) ~f (x)j \u0014 \u0016D\nj ~\t\u0016\n2 .\n\nwj ;yj (x) ~\twj ;yj (x)j \u0014 \u0016\n\nOne way to use the smoothed gradient is to specify an accuracy \", then minimize ~f \u0016 for suf\ufb01ciently\nsmall \u0016 to guarantee that the solution will also be an approximate minimizer of ~f. Then we simply\napply the accelerated gradient descent algorithm of [18]. See also [3] for a description. Let PC(x) =\narg minx02C kx x0k be the projection of x onto the convex set C. In particular, P[0;1]n (x) =\nmin(max(x; 0); 1). Algorithm 1 formalizes our Smoothed Lov\u00b4asz Gradient (SLG) algorithm:\n\nAlgorithm 1: SLG: Smoothed Lov\u00b4asz Gradient\nInput: Accuracy \"; decomposable function f.\nbegin\n\n1;\n\n2\n\n\u0016 = \"\nfor t = 0; 1; 2; : : : do\n\n\u0016 , x1 = z1 = 1\n\n2D , L = D\ngt = r ~f \u0016(xt1)=L; zt = P[0;1]n \u0010z1 Pt\nif gapt \u0014 \"=2 then stop;\nxt = (2zt + (t + 1)yt)=(t + 3);\n\nx\" = yt;\n\nOutput: \"-optimal x\" to minx2[0;1]n ~f (x)\n\ns=0 s+1\n\n2 \u0001 gs\u0011; yt = P[0;1]n (xt gt);\n\nThe optimality gap of a smooth convex function at the iterate yt can be computed from its gradient:\n\ngapt = max\nx2[0;1]n\n\n(yt x) \u0001 r ~f \u0016(yt) = yt \u0001 r ~f \u0016(yt) + max(r ~f \u0016(yt); 0) \u0001 1:\n\nIn summary, as a consequence of the results of [18], we have the following guarantee about SLG:\n\nTheorem 1 SLG is guaranteed to provide an \"-optimal solution after running for O( D\n\n\" ) iterations.\n\n5\n\n\fSLG is only guaranteed to provide an \"-optimal solution to the continuous optimization problem.\nFortunately, once we have an \"-optimal point for the Lov\u00b4asz extension, we can ef\ufb01ciently round it to\nset which is \"-optimal for the original submodular function using Alg. 2 (see [9] for more details).\nAlgorithm 2: Set generation by rounding the continuous solution\nInput: Vector x 2 [0; 1]n; submodular function f.\nbegin\n\nBy sorting, \ufb01nd any permutation \u001b satisfying: x[\u001b(1)] \u0015 : : : \u0015 x[\u001b(n)];\nSk = f\u001b(1); : : : ; \u001b(k)g; K \u0003 = arg mink2f0;1;:::;ng f (Sk); C = fSk : k 2 K \u0003g;\n\nOutput: Collection of sets C, such that f (A) \u0014 ~f (x) for all A 2 C\n\n4.3 Early Stopping based on Discrete Certi\ufb01cates of Optimality\nIn general, if the minimum of f is not unique, the output of SLG may be in the interior of the unit\ncube. However, if f admits a unique minimum A\u0003, then the iterates will tend toward the corner\neA\u0003. One natural question one may ask, if a trend like this is observed, is it necessary to wait for the\niterates to converge all the way to the optimal solution of the continuous problem minx2[0;1]n ~f (x),\nwhen one is actually iterested in solving the discrete problem minA22E f (A)? Below, we show that\nit is possible to use information about the current iterates to check optimality of a set and terminate\nthe algorithm before the continuous problem has converged.\nTo prove optimality of a candidate set A, we can use a subgradient of ~f at eA. If g 2 @ ~f (eA), then\nwe can compute an optimality gap:\n\nf (A) f \u0003 \u0014 max\nx2[0;1]n\n\n(eA x) \u0001 g = Xk2A\n\nmax(0; g[k](eA[k] eEnA[k])):\n\n(5)\n\nIn particular if g[k] \u0014 0 for k 2 A and g[k] \u0015 0 for k 2 E n A, then A is optimal. But if we only\nhave knowledge of candidate set A, then \ufb01nding a subgradient g 2 @ ~f (eA) which demonstrates\noptimality may be extremely dif\ufb01cult, as the set of subgradients is a polyhedron with exponentially\nmany extreme points. But our algorithm naturally suggests the subgradient we could use; the gradi-\nent of the smoothed extension is one such subgradient \u2013 provided a certain condition is satis\ufb01ed, as\ndescribed in the following Lemma.\nLemma 1 Suppose f is a decomposable submodular function, with Lov\u00b4asz extension ~f, and\nsmoothed extension ~f \u0016 as in the previous section. Suppose x 2 Rn and A 2 2E satisfy the fol-\nlowing property:\n\nmin\n\nx[k] x[l] \u0015 2\u0016\n\nk2A;l2EnA\n\nThen r ~f \u0016(x) 2 @ ~f (eA)\nThis is a consequence of our formula for r ~\t\u0016, but see the appendix of the extended paper [21] for\na detailed proof. Lemma 1 states that if the components of point x corresponding to elements of A\nare all larger than all the other components by at least 2\u0016, then the gradient at x is a subgradient for\n~f at eA (which by Equation 5 allows us to compute an optimality gap). In practice, this separation\nof components naturally occurs as the iterates move in the direction of the point eA, long before\nthey ever actually reach the point eA. But even if the components are not separated, we can easily\nadd a positive multiple of eA to separate them and then compute the gradient there to get an\noptimality gap. In summary, we have the following algorithm to check the optimality of a candidate\nset: Of critical importance is how to choose the candidate set A. But by Equation 5, for a set to be\nAlgorithm 3: Set Optimality Check\nInput: Set A; decomposable function f; scale \u0016; x 2 Rn.\nbegin\n\n = 2\u0016 + maxk2A;l2EnA x[l] x[k]; g = r ~f \u0016(x + \reA);\ngap = Pk2A max(0; g[k](eA[k] eEnA[k]));\n\nOutput: gap, which satis\ufb01es gap \u0015 f (A) f \u0003\n\noptimal, we want the components of the gradient r ~f \u0016(A + \reA)[k] to be negative for k 2 A and\npositive for k 2 E n A. So it is natural to choose A = fk : r ~f \u0016(x)[k] \u0014 0g. Thus, if adding \reA\ndoes not change the signs of the components of the gradient, then in fact we have found the optimal\nset. This stopping criterion is very effective in practice, and we use it in all of our experiments.\n\n6\n\n\f(a) Example Regions for Potentials\nFigure 1: (a) Example regions used for our higher-order potential functions (b-c) Comparision of\nrunning times of submodular minimization algorithms on synthetic problems from DIMACS [1].\n\n(b) Results for genrmf-long\n\n(c) Results genrmf-wide\n\n5 Extension to General Concave Potentials\nTo extend our algorithm to work on general concave functions, we note that an arbitrary concave\nfunction can be expressed as an integral of threshold potential functions. This is a simple conse-\nquence of integration by parts, which we state in the following lemma:\nLemma 2 For \u001e 2 C 2([0; T ]),\n\n\u001e(x) = \u001e(0) + \u001e0(T )x Z T\n\n0\n\nmin(x; y)\u001e00(y)dy;\n\n8x 2 [0; T ]\n\nThis means that for a general sum of concave potentials as in Equation (3), we have:\n\nf (A) = c \u0001 eA +Xj\n\n\u0012\u001ej(0) + \u001e0(wj \u0001 1)wj \u0001 eA Z wj \u00011\n\n0\n\n\twj ;y(A)\u001e00\n\nj (y)dy\u0013 :\n\nThen we can de\ufb01ne ~f and ~f \u0016 by replacing \t with ~\t and ~\t\u0016 respectively. Our SLG algorithm is\nessentially unchanged, the conditions for optimality still hold, and so on. Conceptually, we just use\na different smoothed gradient, but calculating it is more involved. We need to compute the integrals\nw;y(x) is a piecewise linear function with repect to y\nwhich we can compute, we can evaluate the integral by parts so that we need only evaluate \u001e, but\nnot its derivatives. We omit the resulting formulas for space limitations.\n\nof the form R r ~\t\u0016\n\nw;y(x)\u001e00(y)dy. Since r ~\t\u0016\n\n6 Experiments\nSynthetic Data. We reproduce the experimental setup of [8] designed to compare submodular\nminimization algorithms. Our goal is to \ufb01nd the minimum cut of a randomly generated graph (which\nrequires submodular minimization of a sum of 2-potentials) with the graph generated by the speci-\n\ufb01cations in [1]. We compare against the state of the art combinatorial algorithms (LEX2, HYBRID,\nSFM3, PR [6]) that are guaranteed to \ufb01nd the exact solution in polynomial time, as well as the\nMinimum Norm algorithm of [8], a practical alternative with unknown running time. Figures 1(b)\nand 1(c) compare the running time of SLG against the running times reported in [8]. In some cases,\nSLG was 6 times faster than the MinNorm algorithm. However the comparison to the MinNorm\nalgorithm is inconclusive in this experiment, since while we used a faster machine, we also used a\nsimple MATLAB implementation. What is clear is that SLG scales at least as well as MinNorm on\nthese problems, and is practical for problem sizes that the combinatorial algorithms cannot handle.\nImage Segmentation Experiments. We also tested our algorithm on the joint\nimage\nsegmentation-and-classi\ufb01cation task introduced in Section 3. We used an implementation of\nTextonBoost [20], then trained on and tested subsampled images from [5]. As seen in Figures 2(e)\nand 2(g), using only the per-pixel score from our TextonBoost implementation gets the general area\nof the object, but does not do a good job of identifying the shape of a classi\ufb01ed object. Compare\nto the ground truth in Figures 2(b) and 2(d). We then perform MAP inference in a Markov Random\nField with 2-potentials (as done in [20]). While this regularization, as shown in Figures 2(f) and\n2(h), leads to improved performance, it still performs poorly on classifying the boundary.\n\n7\n\nR3R1R2102103100102Problem Size (n)Running Time (s)SLGSFM3MinNormHYBRIDPRLEX2102103100102Problem Size (n)Running Time (s)SLGSFM3MinNormLEX2HYBRIDPR\f(a) Original Image\n\n(b) Ground truth\n\n(c) Original Image\n\n(d) Ground Truth\n\n(e) Pixel-based\n\n(f) Pairwise Potentials\n\n(g) Pixel-based\n\n(h) Pairwise Potentials\n\n(i) Concave Potentials\n\n(j) Continuous\n\n(k) Concave Potentials\nFigure 2: Segmentation experimental results\n\n(l) Continuous\n\nFinally, we used SLG to regularize with higher order potentials. To generate regions for our poten-\ntials, we randomly picked seed pixels and grew the regions based on HSV channels of the image.\nWe picked our seed pixels with a preference for pixels which were included in the least number of\npreviously generated regions. Figure 1(a) shows what the regions typically looked like. For our ex-\n\nperiments, we used 90 total regions. We used SLG to minimize f (A) = c\u0001eA+Pj jA\\RjjjRj nAj,\n\nwhere c was the output from TextonBoost, scaled appropriately. Figures 2(i) and 2(k) show the clas-\nsi\ufb01cation output. The continuous variables x at the end of each run are shown in Figures 2(j) and\n2(l); while it has no formal meaning, in general one can interpret a very high or low value of x[k]\nto correspond to high con\ufb01dence in the classi\ufb01cation of the pixel k. To generate the result shown\nin Figure 2(k), a problem with 104 variables and 90 concave potentials, our MATLAB/mex im-\nplementation of SLG took 71.4 seconds. In comparison, the MinNorm implementation of the SFO\ntoolbox [15] gave the same result, but took 6900 seconds. Similar problems on an image of twice the\nresolution (4\u0002104 variables) were tested using SLG, resulting in runtimes of roughly 1600 seconds.\n7 Conclusion\nWe have developed a novel method for ef\ufb01ciently minimizing a large class of submodular functions\nof practical importance. We do so by decomposing the function into a sum of threshold potentials,\nwhose Lov\u00b4asz extensions are convenient for using modern smoothing techniques of convex opti-\nmization. This allows us to solve submodular minimization problems with thousands of variables,\nthat cannot be expressed using only pairwise potentials. Thus we have achieved a middle ground\nbetween graph-cut-based algorithms which are extremely fast but only able to handle very speci\ufb01c\ntypes of submodular minimization problems, and combinatorial algorithms which assume nothing\nbut submodularity but are impractical for large-scale problems.\n\nAcknowledgements This research was partially supported by NSF grant IIS-0953413, a gift from\nMicrosoft Corporation and an Okawa Foundation Research Grant. Thanks to Alex Gittens and\nMichael McCoy for use of their TextonBoost implementation.\n\n8\n\n\fReferences\n[1] Dimacs, The First international algorithm implementation challenge: The core experiments,\n\n1990.\n\n[2] F. Bach, Structured sparsity-inducing norms through submodular functions, Advances in Neu-\n\nral Information Processing Systems (2010).\n\n[3] S. Becker, J. Bobin, and E.J. Candes, Nesta: A fast and accurate \ufb01rst-order method for sparse\n\nrecovery, Arxiv preprint arXiv 904 (2009), 1\u201337.\n\n[4] F.A. Chudak and K. Nagano, Ef\ufb01cient solutions to relaxations of combinatorial problems with\nsubmodular penalties via the Lov\u00b4asz extension and non-smooth convex optimization, Proceed-\nings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for\nIndustrial and Applied Mathematics, 2007, pp. 79\u201388.\n\n[5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The\nPASCAL Visual Object Classes Challenge 2009 (VOC2009) Results, http://www.pascal-\nnetwork.org/challenges/VOC/voc2009/workshop/index.html.\n\n[6] L. Fleischer and S. Iwata, A push-relabel framework for submodular function minimization\nand applications to parametric optimization, Discrete Applied Mathematics 131 (2003), no. 2,\n311\u2013322.\n\n[7] D. Freedman and P. Drineas, Energy minimization via graph cuts: Settling what is possi-\nble, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005.\nCVPR 2005, vol. 2, 2005.\n\n[8] Satoru Fujishige, Takumi Hayashi, and Shigueo Isotani, The Minimum-Norm-Point Algorithm\n\nApplied to Submodular Function Minimization and Linear Programming, (2006), 1\u201319.\n\n[9] E. Hazan and S. Kale, Beyond convexity: Online submodular minimization, Advances in\nNeural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I.\nWilliams, and A. Culotta, eds.), 2009, pp. 700\u2013708.\n\n[10] T. Itoko and S. Iwata, Computational geometric approach to submodular function minimiza-\ntion for multiclass queueing systems, Integer Programming and Combinatorial Optimization\n(2007), 267\u2013279.\n\n[11] S. Iwata, L. Fleischer, and S. Fujishige, A combinatorial strongly polynomial algorithm for\n\nminimizing submodular functions, Journal of the ACM (JACM) 48 (2001), no. 4, 777.\n\n[12] S. Iwata and J.B. Orlin, A simple combinatorial algorithm for submodular function minimiza-\ntion, Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms,\nSociety for Industrial and Applied Mathematics, 2009, pp. 1230\u20131237.\n\n[13] P. Kohli, M.P. Kumar, and P.H.S. Torr, P3 & Beyond: Solving Energies with Higher Order\n\nCliques, 2007 IEEE Conference on Computer Vision and Pattern Recognition (2007), 1\u20138.\n\n[14] P. Kohli, L. Ladick\u00b4y, and P.H.S. Torr, Robust Higher Order Potentials for Enforcing Label\n\nConsistency, International Journal of Computer Vision 82 (2009), no. 3, 302\u2013324.\n\n[15] A. Krause, SFO: A Toolbox for Submodular Function Optimization, The Journal of Machine\n\nLearning Research 11 (2010), 1141\u20131144.\n\n[16] L. Lov\u00b4asz, Submodular functions and convexity, Mathematical programming: the state of the\n\nart, Bonn (1982), 235\u2013257.\n\n[17] G. Nemhauser, L. Wolsey, and M. Fisher, An analysis of the approximations for maximizing\n\nsubmodular set functions, Mathematical Programming 14 (1978), 265\u2013294.\n\n[18] Yu. Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming 103\n\n(2004), no. 1, 127\u2013152.\n\n[19] M. Queyranne, Minimizing symmetric submodular functions, Mathematical Programming 82\n\n(1998), no. 1-2, 3\u201312.\n\n[20] J. Shotton, J. Winn, C. Rother, and A. Criminisi, TextonBoost for Image Understanding: Multi-\nClass Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context,\nInt. J. Comput. Vision 81 (2009), no. 1, 2\u201323.\n\n[21] P. Stobbe and A. Krause, Ef\ufb01cient minimization of decomposable submodular functions,\n\narXiv:1010.5511 (2010).\n\n9\n\n\f", "award": [], "sourceid": 480, "authors": [{"given_name": "Peter", "family_name": "Stobbe", "institution": null}, {"given_name": "Andreas", "family_name": "Krause", "institution": null}]}