{"title": "On the Convergence Rate of Decomposable Submodular Function Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 640, "page_last": 648, "abstract": "Submodular functions describe a variety of discrete problems in machine learning, signal processing, and computer vision. However, minimizing submodular functions poses a number of algorithmic challenges. Recent work introduced an easy-to-use, parallelizable algorithm for minimizing submodular functions that decompose as the sum of simple\" submodular functions. Empirically, this algorithm performs extremely well, but no theoretical analysis was given. In this paper, we show that the algorithm converges linearly, and we provide upper and lower bounds on the rate of convergence. Our proof relies on the geometry of submodular polyhedra and draws on results from spectral graph theory.\"", "full_text": "On the Convergence Rate of Decomposable\n\nSubmodular Function Minimization\n\nRobert Nishihara, Stefanie Jegelka, Michael I. Jordan\n\nElectrical Engineering and Computer Science\n\nUniversity of California\n\nBerkeley, CA 94720\n\n{rkn,stefje,jordan}@eecs.berkeley.edu\n\nAbstract\n\nSubmodular functions describe a variety of discrete problems in machine learn-\ning, signal processing, and computer vision. However, minimizing submodular\nfunctions poses a number of algorithmic challenges. Recent work introduced an\neasy-to-use, parallelizable algorithm for minimizing submodular functions that\ndecompose as the sum of \u201csimple\u201d submodular functions. Empirically, this al-\ngorithm performs extremely well, but no theoretical analysis was given. In this\npaper, we show that the algorithm converges linearly, and we provide upper and\nlower bounds on the rate of convergence. Our proof relies on the geometry of\nsubmodular polyhedra and draws on results from spectral graph theory.\n\n1\n\nIntroduction\n\nA large body of recent work demonstrates that many discrete problems in machine learning can be\nphrased as the optimization of a submodular set function [2]. A set function F : 2V ! R over a\nground set V of N elements is submodular if the inequality F (A) + F (B) F (A[ B) + F (A\\ B)\nholds for all subsets A, B \u2713 V . Problems like clustering [33], structured sparse variable selection\n[1], MAP inference with higher-order potentials [28], and corpus extraction problems [31] can be\nreduced to the problem of submodular function minimization (SFM), that is\n\nF (A).\n\n(P1)\n\nmin\nA\u2713V\n\nAlthough SFM is solvable in polynomial time, existing algorithms can be inef\ufb01cient on large-scale\nproblems. For this reason, the development of scalable, parallelizable algorithms has been an active\narea of research [24, 25, 29, 35]. Approaches to solving Problem (P1) are either based on combina-\ntorial optimization or on convex optimization via the Lov\u00b4asz extension.\nFunctions that occur in practice are usually not arbitrary and frequently possess additional ex-\nploitable structure. For example, a number of submodular functions admit specialized algorithms\nthat solve Problem (P1) very quickly. Examples include cut functions on certain kinds of graphs,\nconcave functions of the cardinality |A|, and functions counting joint ancestors in trees. We will use\nthe term simple to refer to functions F for which we have a fast subroutine for minimizing F + s,\nwhere s 2 RN is any modular function. We treat these subroutines as black boxes. Many com-\nmonly occuring submodular functions (for example, graph cuts, hypergraph cuts, MAP inference\nwith higher-order potentials [16, 28, 37], co-segmentation [22], certain structured-sparsity inducing\nfunctions [26], covering functions [35], and combinations thereof) can be expressed as a sum\n\n(1)\nof simple submodular functions. Recent work demonstrates that this structure offers important prac-\ntical bene\ufb01ts [25, 29, 35]. For instance, it admits iterative algorithms that minimize each Fr sepa-\nrately and combine the results in a straightforward manner (for example, dual decomposition).\n\nFr(A)\n\nr=1\n\nF (A) =XR\n\n1\n\n\fIn particular, it has been shown that the minimization of decomposable functions can be rephrased\nas a best-approximation problem, the problem of \ufb01nding the closest points in two convex sets [25].\nThis formulation brings together SFM and classical projection methods and yields empirically fast,\nparallel, and easy-to-implement algorithms. In these cases, the performance of projection methods\ndepends heavily on the speci\ufb01c geometry of the problem at hand and is not well understood in\ngeneral. Indeed, while Jegelka et al. [25] show good empirical results, the analysis of this alternative\napproach to SFM was left as an open problem.\nContributions. In this work, we study the geometry of the submodular best-approximation problem\nand ground the prior empirical results in theoretical guarantees. We show that SFM via alternating\nprojections, or block coordinate descent, converges at a linear rate. We show that this rate holds\nfor the best-approximation problem, relaxations of SFM, and the original discrete problem. More\nimportantly, we prove upper and lower bounds on the worst-case rate of convergence. Our proof\nrelies on analyzing angles between the polyhedra associated with submodular functions and draws\non results from spectral graph theory. It offers insight into the geometry of submodular polyhedra\nthat may be bene\ufb01cial beyond the analysis of projection algorithms.\nSubmodular minimization. The \ufb01rst polynomial-time algorithm for minimizing arbitrary submod-\nular functions was a consequence of the ellipsoid method [19]. Strongly and weakly polynomial-\ntime combinatorial algorithms followed [32]. The current fastest running times are O(N 5\u23271 + N 6)\n[34] in general and O((N 4\u23271 + N 5) log Fmax) for integer-valued functions [23], where Fmax =\nmaxA |F (A)| and \u23271 is the time required to evaluate F . Some work has addressed decomposable\nfunctions [25, 29, 35]. The running times in [29] apply to integer-valued functions and range from\nO((N + R)2 log Fmax) for cuts to O((N + Q2R)(N + Q2R + QR\u23272) log Fmax), where Q \uf8ff N is\nthe maximal cardinality of the support of any Fr, and \u23272 is the time required to minimize a simple\nfunction. Stobbe and Krause [35] use a convex optimization approach based on Nesterov\u2019s smooth-\ning technique. They achieve a (sublinear) convergence rate of O(1/k) for the discrete SFM problem.\nTheir results and our results do not rely on the function being integral.\nProjection methods. Algorithms based on alternating projections between convex sets (and related\nmethods such as the Douglas\u2013Rachford algorithm) have been studied extensively for solving convex\nfeasibility and best-approximation problems [4, 5, 7, 11, 12, 20, 21, 36, 38]. See Deutsch [10] for a\nsurvey of applications. In the simple case of subspaces, the convergence of alternating projections\nhas been characterized in terms of the Friedrichs angle cF between the subspaces [5, 6]. There are\noften good ways to compute cF (see Lemma 6), which allow us to obtain concrete linear rates of\nconvergence for subspaces. The general case of alternating projections between arbitrary convex\nsets is less well understood. Bauschke and Borwein [3] give a general condition for the linear\nconvergence of alternating projections in terms of the value \uf8ff\u21e4 (de\ufb01ned in Section 3.1). However,\nexcept in very limited cases, it is unclear how to compute or even bound \uf8ff\u21e4. While it is known that\n\uf8ff\u21e4 < 1 for polyhedra [5, Corollary 5.26], the rate may be arbitrarily slow, and the challenge is\nto bound the linear rate away from one. We are able to give a speci\ufb01c uniform linear rate for the\nsubmodular polyhedra that arise in SFM.\nAlthough both \uf8ff\u21e4 and cF are useful quantities for understanding the convergence of projection\nmethods, they largely have been studied independently of one another.\nIn this work, we relate\nthese two quantities for polyhedra, thereby obtaining some of the generality of \uf8ff\u21e4 along with the\ncomputability of cF . To our knowledge, we are the \ufb01rst to relate \uf8ff\u21e4 and cF outside the case of\nsubspaces. We feel that this connection may be useful beyond the context of submodular polyhedra.\n\n1.1 Background\n\nThroughout this paper, we assume that F is a sum of simple submodular functions F1, . . . , FR and\n\nthat F (;) = 0. Points s 2 RN can be identi\ufb01ed with (modular) set functions via s(A) =Pn2A sn.\n\nThe base polytope of F is de\ufb01ned as the set of all modular functions that are dominated by F and\nthat sum to F (V ),\n\nB(F ) = {s 2 RN | s(A) \uf8ff F (A) for all A \u2713 V and s(V ) = F (V )}.\n\nThe Lov\u00b4asz extension f : RN ! R of F can be written as the support function of the base polytope,\nthat is f (x) = maxs2B(F ) s>x. Even though B(F ) may have exponentially many faces, the exten-\nsion f can be evaluated in O(N log N ) time [15]. The discrete SFM problem (P1) can be relaxed to\n\n2\n\n\fthe non-smooth convex optimization problem\n\nmin\nx2[0,1]N\n\nf (x) \u2318 min\nx2[0,1]N\n\nRXr=1\n\nfr(x),\n\n(P2)\n\nwhere fr is the Lov\u00b4asz extension of Fr. This relaxation is exact \u2013 rounding an optimal continuous\nsolution yields the indicator vector of an optimal discrete solution. The formulation in Problem (P2)\nis amenable to dual decomposition [30] and smoothing techniques [35], but suffers from the non-\nsmoothness of f [25]. Alternatively, we can formulate a proximal version of the problem\n\nRXr=1\n\nmin\nx2RN\n\nf (x) + 1\n\n2kxk2 \u2318 min\nx2RN\n\n(fr(x) + 1\n\n2Rkxk2).\n\n(P3)\n\nBy thresholding the optimal solution of Problem (P3) at zero, we recover the indicator vector of an\noptimal discrete solution [17], [2, Proposition 8.4].\nLemma 1. [25] The dual of the right-hand side of Problem (P3) is the best-approximation problem\n(P4)\n\nmin ka bk2 a 2A , b 2B ,\n\nr=1 ar = 0} and B = B(F1) \u21e5\u00b7\u00b7\u00b7\u21e5 B(FR).\n\nwhere A = {(a1, . . . , aR) 2 RN R |PR\nLemma 1 implies that we can minimize a decomposable submodular function by solving Prob-\nlem (P4), which means \ufb01nding the closest points between the subspace A and the product B of base\npolytopes. Projecting onto A is straightforward because A is a subspace. Projecting onto B amounts\nto projecting onto each B(Fr) separately. The projection \u21e7B(Fr)z of a point z onto B(Fr) may be\nsolved by minimizing Fr z [25]. We can compute these projections easily because each Fr is\nsimple.\nThroughout this paper, we use A and B to refer to the speci\ufb01c polyhedra de\ufb01ned in Lemma 1\n(which live in RN R) and P and Q to refer to general polyhedra (sometimes arbitrary convex sets) in\nRD. Note that the polyhedron B depends on the submodular functions F1, . . . , FR, but we omit the\ndependence to simplify our notation. Our bound will be uniform over all submodular functions.\n\n2 Algorithm and Idea of Analysis\n\nA popular class of algorithms for solving best-approximation problems are projection methods [5].\nThe most straightforward approach uses alternating projections (AP) or block coordinate descent.\nStart with any point a0 2A , and inductively generate two sequences via bk =\u21e7 Bak and ak+1 =\n\u21e7Abk. Given the nature of A and B, this algorithm is easy to implement and use in our setting, and\nit solves Problem (P4) [25]. This is the algorithm that we will analyze.\nThe sequence (ak, bk) will eventually converge to an optimal pair (a\u21e4, b\u21e4). We say that AP converges\nlinearly with rate \u21b5< 1 if kaka\u21e4k \uf8ff C1\u21b5k and kbkb\u21e4k \uf8ff C2\u21b5k for all k and for some constants\nC1 and C2. Smaller values of \u21b5 are better.\nAnalysis: Intuition. We will provide a detailed analysis of the convergence of AP for the polyhedra\nA and B. To motivate our approach, we \ufb01rst provide some intuition with the following much-\nsimpli\ufb01ed setup. Let U and V be one-dimensional subspaces spanned by the unit vectors u and v\nrespectively. In this case, it is known that AP converges linearly with rate cos2 \u2713, where \u2713 2 [0, \u21e1\n2 ]\nis the angle such that cos \u2713 = u>v. The smaller the angle, the slower the rate of convergence.\nFor subspaces U and V of higher dimension, the relevant generalization of the \u201cangle\u201d between the\nsubspaces is the Friedrichs angle [11, De\ufb01nition 9.4], whose cosine is given by\n\ncF (U, V ) = supu>v | u 2 U \\ (U \\ V )?, v 2 V \\ (U \\ V )?,kuk \uf8ff 1,kvk \uf8ff 1 .\n\n(2)\nIn \ufb01nite dimensions, cF (U, V ) < 1. In general, when U and V are subspaces of arbitrary dimension,\nAP will converge linearly with rate cF (U, V )2 [11, Theorem 9.8]. If U and V are af\ufb01ne spaces, AP\nstill converges linearly with rate cF (U u, V v)2, where u 2 U and v 2 V .\nWe are interested in rates for polyhedra P and Q, which we de\ufb01ne as the intersection of \ufb01nitely\nmany halfspaces. We generalize the preceding results by considering all pairs (Px, Qy) of\n\n3\n\n\fP\n\nQ\n\nE\nv\n\nH\n\nP\n\nQ0\n\nE\n\nFigure 1: The optimal sets E, H in Equation (4), the vector v, and the shifted polyhedron Q0.\n\np2P\n\nfaces of P and Q and showing that the convergence rate of AP between P and Q is at worst\nmaxx,y cF (a\u21b50(Px), a\u21b50(Qy))2, where a\u21b5(C) is the af\ufb01ne hull of C and a\u21b50(C) = a\u21b5(C) c\nfor some c 2 C. The faces {Px}x2RD of P are de\ufb01ned as the nonempty maximizers of linear\nfunctions over P , that is\n(3)\n\nPx = arg max\n\nx>p.\n\nWhile we look at angles between pairs of faces, we remark that Deutsch and Hundal [13] consider a\ndifferent generalization of the \u201cangle\u201d between arbitrary convex sets.\nRoadmap of the Analysis. Our analysis has two main parts. First, we relate the convergence rate\nof AP between polyhedra P and Q to the angles between the faces of P and Q. To do so, we give a\ngeneral condition under which AP converges linearly (Theorem 2), which we show depends on the\nangles between the faces of P and Q (Corollary 5) in the polyhedral case. Second, we specialize\nto the polyhedra A and B, and we equate the angles with eigenvalues of certain matrices and use\ntools from spectral graph theory to bound the relevant eigenvalues in terms of the conductance of a\nspeci\ufb01c graph. This yields a worst-case bound of 1 1\nIn Theorem 14, we show a lower bound of 1 2\u21e12\n3 The Upper Bound\nWe \ufb01rst derive an upper bound on the rate of convergence of AP between the polyhedra A and B.\nThe results in this section are proved in Appendix A.\n\nN 2R on the worst-case convergence rate.\n\nN 2R2 on the rate, stated in Theorem 12.\n\n3.1 A Condition for Linear Convergence\n\nWe begin with a condition under which AP between two closed convex sets P and Q converges\nlinearly. This result is similar to that of Bauschke and Borwein [3, Corollary 3.14], but the rate we\nachieve is twice as fast and relies on slightly weaker assumptions.\nWe will need a few de\ufb01nitions from Bauschke and Borwein [3]. Let d(K1, K2) = inf{kk1 k2k :\nk1 2 K1, k2 2 K2} be the distance between sets K1 and K2. De\ufb01ne the sets of \u201cclosest points\u201d as\n(4)\nand let v =\u21e7 QP 0 (see Figure 1). Note that H = E + v, and when P \\ Q 6= ; we have v = 0\nand E = H = P \\ Q. Therefore, we can think of the pair (E, H) as a generalization of the\nintersection P \\ Q to the setting where P and Q do not intersect. Pairs of points (e, e + v) 2 E \u21e5 H\nare solutions to the best-approximation problem between P and Q. In our analysis, we will mostly\nstudy the translated version Q0 = Q v of Q that intersects P at E.\nFor x 2 RD\\E, the function \uf8ff relates the distance to E with the distances to P and Q0,\n\nH = {q 2 Q| d(q, P ) = d(Q, P )},\n\nE = {p 2 P | d(p, Q) = d(P, Q)}\n\nIf \uf8ff is bounded, then whenever x is close to both P and Q0, it must also be close to their intersection.\nIf, for example, D 2 and P and Q are balls of radius one whose centers are separated by distance\n\n\uf8ff(x) =\n\nd(x, E)\n\nmax{d(x, P ), d(x, Q0)}\n\n.\n\n4\n\n\fexactly two, then \uf8ff is unbounded. The maximum \uf8ff\u21e4 = supx2(P[Q0)\\E \uf8ff(x) is useful for bounding\nthe convergence rate.\nTheorem 2. Let P and Q be convex sets, and suppose that \uf8ff\u21e4 < 1. Then AP between P and Q\nconverges linearly with rate 1 1\n\n. Speci\ufb01cally,\n\n\uf8ff2\n\u21e4\n\nkpk p\u21e4k \uf8ff 2kp0 p\u21e4k(1 1\n\n\uf8ff2\n\u21e4\n\n)k\n\nand\n\nkqk q\u21e4k \uf8ff 2kq0 q\u21e4k(1 1\n\n\uf8ff2\n\u21e4\n\n)k.\n\n3.2 Relating \uf8ff\u21e4 to the Angles Between Faces of the Polyhedra\nIn this section, we consider the case of polyhedra P and Q, and we bound \uf8ff\u21e4 in terms of the angles\nbetween pairs of their faces. In Lemma 3, we show that \uf8ff is nondecreasing along the sequence of\npoints generated by AP between P and Q0. We treat points p for which \uf8ff(p) = 1 separately because\nthose are the points from which AP between P and Q0 converges in one step. This lemma enables us\nto bound \uf8ff(p) by initializing AP at p and bounding \uf8ff at some later point in the resulting sequence.\nLemma 3. For any p 2 P\\E, either \uf8ff(p) = 1 or 1 <\uf8ff (p) \uf8ff \uf8ff(\u21e7Q0p). Similarly, for any\nq 2 Q0\\E, either \uf8ff(q) = 1 or 1 <\uf8ff (q) \uf8ff \uf8ff(\u21e7P q).\nWe can now bound \uf8ff by angles between faces of P and Q.\nProposition 4. If P and Q are polyhedra and p 2 P\\E, then there exist faces Px and Qy such that\n\n1\n\uf8ff(p)2 \uf8ff cF (a\u21b50(Px), a\u21b50(Qy))2.\n\n1 \n\nThe analogous statement holds when we replace p 2 P\\E with q 2 Q0\\E.\nNote that a\u21b50(Qy) = a\u21b50(Q0y). Proposition 4 immediately gives us the following corollary.\nCorollary 5. If P and Q are polyhedra, then\n\n1 \n\n1\n\uf8ff2\n\u21e4\n\n\uf8ff max\nx,y2RD\n\ncF (a\u21b50(Px), a\u21b50(Qy))2.\n\n3.3 Angles Between Subspaces and Singular Values\n\nCorollary 5 leaves us with the task of bounding the Friedrichs angle. To do so, we \ufb01rst relate the\nFriedrichs angle to the singular values of certain matrices in Lemma 6. We then specialize this to\nbase polyhedra of submodular functions. For convenience, we prove Lemma 6 in Appendix A.5,\nthough this result is implicit in the characterization of principal angles between subspaces given\nin [27, Section 1]. Ideas connecting angles between subspaces and eigenvalues are also used by\nDiaconis et al. [14].\nLemma 6. Let S and T be matrices with orthonormal rows and with equal numbers of columns.\nIf all of the singular values of ST > equal one, then cF (null(S), null(T )) = 0. Otherwise,\ncF (null(S), null(T )) is equal to the largest singular value of ST > that is less than one.\nFaces of relevant polyhedra. Let Ax and By be faces of the polyhedra A and B from Lemma 1.\nSince A is a vector space, its only nonempty face is Ax = A. Hence, Ax = null(S), where S is an\nN \u21e5 N R matrix of N \u21e5 N identity matrices IN:\n|\n\npR\u2713 IN \u00b7\u00b7\u00b7\nrepeated R times \u25c6.\n{z\n}\n\nThe matrix for a\u21b50(By) requires a bit more elaboration. Since B is a Cartesian product, we have\nBy = B(F1)y1 \u21e5\u00b7\u00b7\u00b7\u21e5 B(FR)yR, where y = (y1, . . . , yR) and B(Fr)yr is a face of B(Fr). To\nproceed, we use the following characterization of faces of base polytopes [2, Proposition 4.7].\nProposition 7. Let F be a submodular function, and let B(F )x be a face of B(F ). Then there exists\na partition of V into disjoint sets A1, . . . , AM such that\n\nS =\n\n(5)\n\n1\n\nIN\n\na\u21b5(B(F )x) =\n\nM\\m=1\n\n{s 2 RN | s(A1 [\u00b7\u00b7\u00b7[ Am) = F (A1 [\u00b7\u00b7\u00b7[ Am)}.\n\n5\n\n\fThe following corollary is immediate.\nCorollary 8. De\ufb01ne F , B(F )x, and A1, . . . , AM as in Proposition 7. Then\n\na\u21b50(B(F )x) =\n\nM\\m=1\n\n{s 2 RN | s(A1 [\u00b7\u00b7\u00b7[ Am) = 0}.\n\nBy Corollary 8, for each Fr, there exists a partition of V into disjoint sets Ar1, . . . , ArMr such that\n\na\u21b50(By) =\n\n{(s1, . . . , sR) 2 RN R | sr(Ar1 [\u00b7\u00b7\u00b7[ Arm) = 0}.\n\n(6)\n\nR\\r=1\n\nMr\\m=1\n\nIn other words, we can write a\u21b50(By) as the nullspace of either of the matrices\n\n1>A11...\n\n1>A11[\u00b7\u00b7\u00b7[A1M1\n\n...\n\nT 0 =\n\n0BBBBBBBBBBBB@\n\nor T =\n\n1CCCCCCCCCCCCA\n\n1>AR1...\n\n1>AR1[\u00b7\u00b7\u00b7[ARMR\n\n0BBBBBBBBBBBBBBBB@\n\n1>A11p|A11|\n...\n1>A1M1p|A1M1|\n\n...\n\n1>AR1p|AR1|\n...\n\n1>ARMR\np|ARMR|\n\n,\n\n1CCCCCCCCCCCCCCCCA\n\n1\n\npR\u2713 1A11p|A11|\n\nwhere 1A is the indicator vector of A \u2713 V . For T 0, this follows directly from Equation (6). T\ncan be obtained from T 0 via left multiplication by an invertible matrix, so T and T 0 have the same\nnullspace. Lemma 6 then implies that cF (a\u21b50(Ax), a\u21b50(By)) equals the largest singular value of\n\n1A1M1p|A1M1|\n\nST > =\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\nthat is less than one. We rephrase this conclusion in the following remark.\nRemark 9. The largest eigenvalue of (ST >)>(ST >) less than one equals cF (a\u21b50(Ax), a\u21b50(By))2.\nLet Mall = M1 +\u00b7\u00b7\u00b7 + MR. Then (ST >)>(ST >) is the Mall \u21e5 Mall square matrix whose rows and\ncolumns are indexed by (r, m) with 1 \uf8ff r \uf8ff R and 1 \uf8ff m \uf8ff Mr and whose entry corresponding\nto row (r1, m1) and column (r2, m2) equals\n\n1AR1p|AR1|\n\n1ARMR\n\np|ARMR| \u25c6\n\n1\nR\n\n1>Ar1m1\n\n1Ar2m2\n\np|Ar1m1||Ar2m2|\n\n3.4 Bounding the Relevant Eigenvalues\n\n=\n\n1\nR\n\n|Ar1m1 \\ Ar2m2|\n\np|Ar1m1||Ar2m2|\n\n.\n\nIt remains to bound the largest eigenvalue of (ST >)>(ST >) that is less than one. To do so, we view\nthe matrix in terms of the symmetric normalized Laplacian of a weighted graph. Let G be the graph\nwhose vertices are indexed by (r, m) with 1 \uf8ff r \uf8ff R and 1 \uf8ff m \uf8ff Mr. Let the edge between\nvertices (r1, m1) and (r2, m2) have weight |Ar1m1 \\ Ar2m2|. We may assume that G is connected\n(the analysis in this case subsumes the analysis in the general case). The symmetric normalized\nLaplacian L of this graph is closely related to our matrix of interest,\n\n(ST >)>(ST >) = I R1\n\nR L.\n\n(7)\n\nHence, the largest eigenvalue of (ST >)>(ST >) that is less than one can be determined from the\nsmallest nonzero eigenvalue 2(L) of L. We bound 2(L) via Cheeger\u2019s inequality (stated in Ap-\npendix A.6) by bounding the Cheeger constant hG of G.\nLemma 10. For R 2, we have hG 2\n\nN R and hence 2(L) 2\n\nN 2R2 .\n\n6\n\n\fWe prove Lemma 10 in Appendix A.7. Combining Remark 9, Equation (7), and Lemma 10, we\nobtain the following bound on the Friedrichs angle.\nProposition 11. Assuming that R 2, we have\n\ncF (a\u21b50(Ax), a\u21b50(By))2 \uf8ff 1 R1\n\nR\n\n2\n\nN 2R2 \uf8ff 1 1\n\nN 2R2 .\n\nTogether with Theorem 2 and Corollary 5, Proposition 11 implies the \ufb01nal bound on the rate.\nTheorem 12. The AP algorithm for Problem (P4) converges linearly with rate 1 1\nN 2R2 , i.e.,\nN 2R2 )k.\nkbk b\u21e4k \uf8ff 2kb0 b\u21e4k(1 1\n\nkak a\u21e4k \uf8ff 2ka0 a\u21e4k(1 1\n\nN 2R2 )k\n\nand\n\n4 A Lower Bound\n\nTo probe the tightness of Theorem 12, we construct a \u201cbad\u201d submodular function and decomposition\nthat lead to a slow rate. Appendix B gives the formal details. Our example is an augmented cut\nfunction on a cycle: for each x, y 2 V , de\ufb01ne Gxy to be the cut function of a single edge (x, y),\n\nGxy =\u21e21\n\n0\n\nif |A \\{ x, y}| = 1\notherwise .\n\nTake N to be even and R 2 and de\ufb01ne the submodular function F lb = F lb\n\nF lb\n1 = G12 + G34 + \u00b7\u00b7\u00b7 + G(N1)N\n\n1 + \u00b7\u00b7\u00b7 + F lb\nF lb\n2 = G23 + G45 + \u00b7\u00b7\u00b7 + GN 1\n\nR , where\n\nr = 0 for all r 3. The optimal solution to the best-approximation problem is the all zeros\n\nand F lb\nvector.\nLemma 13. The cosine of the Friedrichs angle between A and a\u21b5(Blb) is\nR1 cos 2\u21e1\nN .\n\u21e4 = cF (A, a\u21b5(Blb))2, and\n\nAround the optimal solution 0, the polyhedra A and Blb behave like subspaces, and it is possible to\npick initializations a0 2A and b0 2B lb such that the Friedrichs angle exactly determines the rate\nof convergence. That means 1 1/\uf8ff2\n\ncF (A, a\u21b5(Blb))2 = 1 1\n\nkakk = (1 1\n\nR (1 cos( 2\u21e1\n\nN )))kka0k\n\nand\n\nkbkk = (1 1\n\nR (1 cos( 2\u21e1\n\nN )))kkb0k.\n\n2 x2 leads to the following lower bound on the rate.\n\nBounding 1 cos(x) \uf8ff 1\nTheorem 14. There exists a decomposed function F lb and initializations for which the convergence\nrate of AP is at least 1 2\u21e12\nN 2R.\nThis theoretical bound can also be observed empirically (Figure 3 in Appendix B).\n\n5 Convergence of the Primal Objective\nWe have shown that AP generates a sequence of points {ak}k0 and {bk}k0 in RN R such that\n(ak, bk) ! (a\u21e4, b\u21e4) linearly, where (a\u21e4, b\u21e4) minimizes the objective in Problem (P4). In this section,\nwe show that this result also implies the linear convergence of the objective in Problem (P3) and of\nthe original discrete objective in Problem (P1). The proofs may be found in Appendix C.\nDe\ufb01ne the matrix = R1/2S, where S is the matrix de\ufb01ned in Equation (5). Multiplication by \nmaps a vector (w1, . . . , wR) to Pr wr, where wr 2 RN for each r. Set xk = bk and x\u21e4 = b\u21e4.\nAs shown in Jegelka et al. [25], Problem (P3) is minimized by x\u21e4.\nProposition 15. We have f (xk) + 1\n\nN 2R2 .\n2kx\u21e4k2 linearly with rate 1 1\nThis linear rate of convergence translates into a linear rate for the original discrete problem.\nTheorem 16. Choose A\u21e4 2 arg minA\u2713V F (A). Let Ak be the suplevel set of xk with smallest\nvalue of F . Then F (Ak) ! F (A\u21e4) linearly with rate 1 1\n\n2kxkk2 ! f (x\u21e4) + 1\n\n2N 2R2 .\n\n7\n\n\f6 Discussion\n\nIn this work, we analyze projection methods for parallel SFM and give upper and lower bounds on\nthe linear rate of convergence. This means that the number of iterations required for an accuracy of\n\u270f is logarithmic in 1/\u270f, not linear as in previous work [35]. Our rate is uniform over all submodular\nfunctions. Moreover, our proof highlights how the number R of components and the facial structure\nof B affect the convergence rate. These insights may serve as guidelines when working with projec-\ntion algorithms and aid in the analysis of special cases. For example, reducing R is often possible.\nAny collection of Fr that have disjoint support, such as the cut functions corresponding to the rows\nor columns of a grid graph, can be grouped together without making the projection harder.\nOur analysis also shows the effects of additional properties of F . For example, suppose that F\nis separable, that is, F (V ) = F (S) + F (V \\S) for some nonempty S ( V . Then the subsets\nArm \u2713 V de\ufb01ning the relevant faces of B satisfy either Arm \u2713 S or Arm \u2713 Sc [2]. This makes G\nin Section 3.4 disconnected, and as a result, the N in Theorem 12 gets replaced by max{|S|,|Sc|}\nfor an improved rate. This applies without the user needing to know S when running the algorithm.\nA number of future directions suggest themselves. For example, Jegelka et al. [25] also considered\nthe related Douglas\u2013Rachford (DR) algorithm. DR between subspaces converges linearly with rate\ncF [6], as opposed to c2\nF for AP. We suspect that our approach may be modi\ufb01ed to analyze DR\nbetween polyhedra. Further questions include the extension to cyclic updates (instead of parallel\nones), multiple polyhedra, and stochastic algorithms.\n\nAcknowledgments. We would like to thank M\u02d8ad\u02d8alina Persu for suggesting the use of Cheeger\u2019s\ninequality. This research is supported in part by NSF CISE Expeditions Award CCF-1139158,\nLBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon\nWeb Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Apple, C3Energy, Cisco,\nCloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp,\nPivotal, Splunk, Virdata, VMware, WANdisco, and Yahoo!. This work is supported in part by the\nOf\ufb01ce of Naval Research under grant number N00014-11-1-0688, the US ARL and the US ARO\nunder grant number W911NF-11-1-0391, and the NSF under grant number DGE-1106400.\n\nReferences\n[1] F. Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Infor-\n\nmation Processing Systems, 2011.\n\n[2] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and\n\nTrends in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[3] H. H. Bauschke and J. M. Borwein. On the convergence of von Neumann\u2019s alternating projection algo-\n\nrithm for two sets. Set-Valued Analysis, 1(2):185\u2013212, 1993.\n\n[4] H. H. Bauschke and J. M. Borwein. Dykstra\u2019s alternating projection algorithm for two sets. Journal of\n\nApproximation Theory, 79(3):418\u2013443, 1994.\n\n[5] H. H. Bauschke and J. M. Borwein. On projection algorithms for solving convex feasibility problems.\n\nSIAM Review, 38(3):367\u2013426, 1996.\n\n[6] H. H. Bauschke, J. B. Cruz, T. T. Nghia, H. M. Phan, and X. Wang. The rate of linear convergence of the\nDouglas\u2013Rachford algorithm for subspaces is the cosine of the Friedrichs angle. Journal of Approximation\nTheory, 185:63\u201379, 2014.\n\n[7] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods. SIAM Journal\n\non Optimization, 23(4):2037\u20132060, 2013.\n\n[8] J. V. Burke and J. J. Mor\u00b4e. On the identi\ufb01cation of active constraints. SIAM Journal on Numerical\n\nAnalysis, 25(5):1197\u20131211, 1988.\n\n[9] F. R. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n[10] F. Deutsch. The method of alternating orthogonal projections. In Approximation Theory, Spline Functions\n\nand Applications, pages 105\u2013121. Springer, 1992.\n\n[11] F. Deutsch. Best Approximation in Inner Product Spaces, volume 7. Springer, 2001.\n[12] F. Deutsch and H. Hundal. The rate of convergence of Dykstra\u2019s cyclic projections algorithm: The poly-\n\nhedral case. Numerical Functional Analysis and Optimization, 15(5-6):537\u2013565, 1994.\n\n8\n\n\f[13] F. Deutsch and H. Hundal. The rate of convergence for the cyclic projections algorithm I: angles between\n\nconvex sets. Journal of Approximation Theory, 142(1):36\u201355, 2006.\n\n[14] P. Diaconis, K. Khare, and L. Saloff-Coste. Stochastic alternating projections. Illinois Journal of Mathe-\n\nmatics, 54(3):963\u2013979, 2010.\n\n[15] J. Edmonds. Combinatorial Structures and Their Applications, chapter Submodular Functions, Matroids\n\nand Certain Polyhedra, pages 69\u201387. Gordon and Breach, 1970.\n\n[16] A. Fix, T. Joachims, S. Park, and R. Zabih. Structured learning of sum-of-submodular higher order energy\n\nfunctions. In Int. Conference on Computer Vision (ICCV), 2013.\n\n[17] S. Fujishige and S. Isotani. A submodular function minimization algorithm based on the minimum-norm\n\nbase. Paci\ufb01c Journal of Optimization, 7:3\u201317, 2011.\n\n[18] R. M. Gray. Toeplitz and circulant matrices: A review. Foundations and Trends in Communications and\n\nInformation Theory, 2(3):155\u2013239, 2006.\n\n[19] M. Gr\u00a8otschel, L. Lov\u00b4asz, and A. Schrijver. The ellipsoid method and its consequences in combinatorial\n\noptimization. Combinatorica, 1(2):169\u2013197, 1981.\n\n[20] L. Gubin, B. Polyak, and E. Raik. The method of projections for \ufb01nding the common point of convex\n\nsets. USSR Computational Mathematics and Mathematical Physics, 7(6):1\u201324, 1967.\n\n[21] I. Halperin. The product of projection operators. Acta Sci. Math. (Szeged), 23:96\u201399, 1962.\n[22] D. Hochbaum and V. Singh. An ef\ufb01cient algorithm for co-segmentation. In Int. Conference on Computer\n\nVision (ICCV), 2009.\n\n[23] S. Iwata. A faster scaling algorithm for minimizing submodular functions. SIAM J. on Computing, 32:\n\n833\u2013840, 2003.\n\n[24] S. Jegelka, H. Lin, and J. Bilmes. On fast approximate sumodular minimization. In Advances in Neural\n\nInformation Processing Systems, 2011.\n\n[25] S. Jegelka, F. Bach, and S. Sra. Re\ufb02ection methods for user-friendly submodular optimization. In Ad-\n\nvances in Neural Information Processing Systems, pages 1313\u20131321, 2013.\n\n[26] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.\n\nJMLR, page 22972334, 2011.\n\n[27] A. V. Knyazev and M. E. Argentati. Principal angles between subspaces in an A-based scalar product:\nalgorithms and perturbation estimates. SIAM Journal on Scienti\ufb01c Computing, 23(6):2008\u20132040, 2002.\n[28] P. Kohli, L. Ladick\u00b4y, and P. Torr. Robust higher order potentials for enforcing label consistency. Int.\n\nJournal of Computer Vision, 82, 2009.\n\n[29] V. Kolmogorov. Minimizing a sum of submodular functions. Discrete Applied Mathematics, 160(15):\n\n2246\u20132258, 2012.\n\n[30] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposi-\n\ntion. IEEE Trans. Pattern Analysis and Machine Intelligence, 2011.\n\n[31] H. Lin and J. Bilmes. Optimal selection of limited vocabulary speech corpora. In Proc. Interspeech, 2011.\n[32] S. McCormick. Handbook on Discrete Optimization, chapter Submodular Function Minimization, pages\n\n321\u2013391. Elsevier, 2006.\n\n[33] M. Narasimhan and J. Bilmes. Local search for balanced submodular clusterings. In IJCAI, pages 981\u2013\n\n986, 2007.\n\n[34] J. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Math.\n\nProgramming, 118:237\u2013251, 2009.\n\n[35] P. Stobbe and A. Krause. Ef\ufb01cient minimization of decomposable submodular functions. In Advances in\n\nNeural Information Processing Systems, 2010.\n\n[36] P. Tseng. Alternating projection-proximal methods for convex programming and variational inequalities.\n\nSIAM Journal on Optimization, 7(4):951\u2013965, 1997.\n\n[37] S. Vicente, V. Kolmogorov, and C. Rother. Joint optimization of segmentation and appearance models. In\n\nInt. Conference on Computer Vision (ICCV), 2009.\n\n[38] J. Von Neumann. Functional Operators: The Geometry of Orthogonal Spaces. Princeton University\n\nPress, 1950.\n\n9\n\n\f", "award": [], "sourceid": 447, "authors": [{"given_name": "Robert", "family_name": "Nishihara", "institution": "UC Berkeley"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}