{"title": "Submodular-Bregman and the Lov\u00e1sz-Bregman Divergences with Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 2933, "page_last": 2941, "abstract": "", "full_text": "Submodular-Bregman and the Lov\u00b4asz-Bregman\n\nDivergences with Applications\n\nRishabh Iyer\n\nJeff Bilmes\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nrkiyer@u.washington.edu\n\nUniversity of Washington\n\nbilmes@uw.edu\n\nAbstract\n\nWe introduce a class of discrete divergences on sets (equivalently binary vectors)\nthat we call the submodular-Bregman divergences. We consider two kinds, de\ufb01ned\neither from tight modular upper or tight modular lower bounds of a submodular\nfunction. We show that the properties of these divergences are analogous to the\n(standard continuous) Bregman divergence. We demonstrate how they generalize\nmany useful divergences, including the weighted Hamming distance, squared\nweighted Hamming, weighted precision, recall, conditional mutual information,\nand a generalized KL-divergence on sets. We also show that the generalized\nBregman divergence on the Lov\u00b4asz extension of a submodular function, which we\ncall the Lov\u00b4asz-Bregman divergence, is a continuous extension of a submodular\nBregman divergence. We point out a number of applications, and in particular show\nthat a proximal algorithm de\ufb01ned through the submodular Bregman divergence pro-\nvides a framework for many mirror-descent style algorithms related to submodular\nfunction optimization. We also show that a generalization of the k-means algorithm\nusing the Lov\u00b4asz Bregman divergence is natural in clustering scenarios where\nordering is important. A unique property of this algorithm is that computing the\nmean ordering is extremely ef\ufb01cient unlike other order based distance measures.\n\n1\n\nIntroduction\n\nThe Bregman divergence \ufb01rst appeared in the context of relaxation techniques in convex programming\n([4]), and has found numerous applications as a general framework in clustering ([2]), proximal\nminimization ([5]) and online learning ([27]). Many of these applications are due to the nice properties\nof the Bregman divergence, and the fact that they are parameterized by a single convex function.\nThey also generalize a large class of divergences on vectors. Recently Bregman divergences have\nalso been de\ufb01ned between matrices ([26, 6]) and between functions ([8]).\nIn this paper we de\ufb01ne a class of divergences between sets, where each divergence is parameterized by\na submodular function. This can alternatively and equivalently be seen as a divergence between binary\nvectors in the same way that submodular functions are special cases of pseudo-Boolean functions\n[3]. We call this the class of submodular Bregman divergences (or just submodular Bregman). We\nshow an interesting mathematical property of the submodular Bregman, namely that they can be\nde\ufb01ned based on either a tight modular (linear) upper bound or alternatively a tight modular lower\nbound, unlike the traditional (continuous) Bregman de\ufb01nable only via a tight linear lower bound.\nLet V refer to a \ufb01nite ground set {1, 2, . . . ,|V |}. A set function f : 2V \u2192 R is submodular if\n\u2200S, T \u2286 V , f (S) + f (T ) \u2265 f (S \u222a T ) + f (S \u2229 T ). Submodular functions have attractive properties\nthat make their exact or approximate optimization ef\ufb01cient and often practical. Submodularity\ncan be seen as a discrete counterpart to convexity and concavity ([20]) and often the problems are\nclosely related ([1]). Indeed, as we shall see in this paper, the connections between submodularity\n\n1\n\n\fand convexity and concavity will help us formulate certain discrete divergences that are analogous\nto the Bregman divergence. We in fact show a direct connection between a submodular Bregman\nand a generalized Bregman divergence de\ufb01ned through the Lov\u00b4asz extension. Further background\non submodular functions may be found in the text [9].\nAn outline of the paper follows. We \ufb01rst de\ufb01ne the different types of submodular Bregman in\nSection 2. We also de\ufb01ne the Lov\u00b4asz Bregman divergence, and show its relation to a version of\nthe submodular Bregman. Then in Section 3, we prove a number of properties of the submodular\nBregman and show how they are related to the Bregman divergence. Finally in Section 4, we show\nhow the submodular Bregman can be used in applications in machine learning. In particular, we show\nhow the proximal framework of the submodular Bregman generalizes a number of mirror-descent\nstyle approximate submodular optimization algorithms. We also consider generalizations of the\nk-means algorithm using the Lov\u00b4asz Bregman divergence, and show how they can be used in\nclustering applications where ordering or ranking is important.\n\n2 The Bregman and Submodular Bregman divergences\n\nNotation: We use \u03c6 to refer to a convex function, f to refer to a submodular function, and \u02c6f as\nf\u2019s Lov\u00b4asz extension. Lowercase characters x, y will refer to continuous vectors, while upper case\ncharacters X, Y, S will refer to sets. We will also refer to the characteristic vectors of a set X as\n1X \u2208 {0, 1}V . Note that the characteristic vector of a set X, 1X is such that 1X (j) = I(j \u2208 X),\nwhere I(\u00b7) is the standard indicator function. We will refer to the ground set as V , and the cardinality\nof the ground set as n = |V |. A divergence on vectors and sets is formally de\ufb01ned as follows:\nGiven a domain of vectors or sets S (and if sets, S = a lattice of sets L, where L is a lattice if\n\u2200X, Y \u2208 L, X \u222a Y, X \u2229 Y \u2208 L), a function d : S \u00d7 S \u2192 R+ is called a divergence if \u2200x, y \u2208 S,\nd(x, y) \u2265 0 and \u2200x \u2208 S, d(x, x) = 0. For simplicity, we consider mostly the Boolean lattice L = 2V\nbut generalizations are possible as well [9].\n\n2.1 Bregman and Generalized Bregman divergences\nRecall the de\ufb01nition of the Bregman divergence: d\u03c6 : S \u00d7 S \u2192 R+ as:\nd\u03c6(x, y) = \u03c6(x) \u2212 \u03c6(y) \u2212 (cid:104)\u2207\u03c6(y), x \u2212 y(cid:105).\n\n(1)\n\nFor non-differentiable convex functions we can extend equation (1) to de\ufb01ne the generalized Bregman\ndivergence [13, 18]. De\ufb01ne a subgradient map H\u03c6, which for every vector y, gives a subgradient\nH\u03c6(y) = hy \u2208 \u2202\u03c6(y) [13], where \u2202\u03c6(y) is the subdifferential of \u03c6 at y.\n\nH\u03c6\n\u03c6 (x, y) = \u03c6(x) \u2212 \u03c6(y) \u2212 (cid:104)H\u03c6(y), x \u2212 y(cid:105),\u2200x, y \u2208 S.\nd\n\n(2)\nWhen \u03c6 is differentiable, then \u2202\u03c6(x) = {\u2207\u03c6(x)} and H\u03c6(y) = \u2207\u03c6(y). More generally, there may\nbe multiple distinct subgradients in the subdifferential, hence the generalized Bregman divergence is\nparameterized both by \u03c6 and the subgradient-map H\u03c6. The generalized Bregman divergences have\nalso been de\ufb01ned in terms of \u201cextreme\u201d subgradients [25, 18].\n\n\u03c6(x, y) = \u03c6(x) \u2212 \u03c6(y) + \u03c3\u2202\u03c6(y)(y \u2212 x), (3)\n\u03c6(x, y) = \u03c6(x) \u2212 \u03c6(y) \u2212 \u03c3\u2202\u03c6(y)(x \u2212 y)\nd(cid:92)\nd(cid:93)\nH\u03c6\nwhere, for a convex set C, \u03c3C(.) (cid:44) maxx\u2208C(cid:104)., x(cid:105). Clearly, we then have: d(cid:93)\n\u03c6 (x, y) \u2264\n\u03c6(x, y),\u2200H\u03c6 which justi\ufb01es their being called the extreme generalized Bregman divergences [13].\nd(cid:92)\n\n\u03c6(x, y) \u2264 d\n\nand\n\n2.2 The Submodular Bregman divergences\n\nIn a similar spirit, we de\ufb01ne a submodular Bregman divergence parameterized by a submodular\nfunction and de\ufb01ned as the difference between the function and its modular (sometimes called linear)\nbounds. Surprisingly, any submodular function has both a tight upper and lower modular bound\n([15]), unlike strict convexity where only a tight \ufb01rst-order lower bound exists. Hence, we de\ufb01ne\ntwo distinct forms of submodular Bregman parameterized by a submodular function and in terms\nof either its tight upper or tight lower bounds.\n\n2\n\n\f2.2.1 Lower bound form of the Submodular Bregman\nGiven a submodular function f, the submodular polymatroid Pf , the corresponding base polytope\nBf and the subdifferential \u2202f (Y ) (at a set Y ) for a submodular function f [9] are respectively:\n\nPf = {x : x(S) \u2264 f (S),\u2200S \u2286 V },\n\nBf = Pf \u2229 {x : x(V ) = f (V )}, and\n\n\u2202f (Y ) = {y \u2208 RV : \u2200X \u2286 V, f (Y ) \u2212 y(Y ) \u2264 f (X) \u2212 y(X)}.\n\n(4)\n(5)\n\nNote that here y(S) =(cid:80)\n\nj\u2208S y(j) is a modular function. In a manner similar to the generalized\nBregman divergence ([13]), we de\ufb01ne a discrete subgradient map for a submodular function Hf ,\nwhich for every set Y , picks a subgradient Hf (Y ) = hY \u2208 \u2202f (Y ). Then, given a submodular\nfunction f and a subgradient-map Hf , the generalized lower bound submodular Bregman, which\nHf\nwe shall henceforth call d\nf\n\n, is de\ufb01ned as:\n\nHf\nd\nf\n\n(X, Y ) = f (X) \u2212 f (Y ) \u2212 hY (X) + hY (Y ) = f (X) \u2212 f (Y ) \u2212 (cid:104)Hf (Y ), 1X \u2212 1Y (cid:105)).\n\n(6)\n\nWe remark here that similar to the de\ufb01nition of the generalized Bregman divergence, this submodular\nBregman is parameterized both by the submodular function f and the subgradient map Hf .\nThe subdifferential corresponding to a submodular function is an unbounded polyhedron [9], with\nan uncountable number of possible subgradients. Its extreme points, however, are easy to \ufb01nd\nHf\nf with Hf\nand characterize using the greedy algorithm [7]. Thus, we de\ufb01ne a subclass of d\nchosen so that it picks an extreme points of \u2202f (Y ), which we will call the permutation based lower\nbound submodular Bregman, henceforth referred to with d\u03a3\nf . The extreme points of \u2202f (Y ) can\nbe obtained via a greedy algorithm ([7, 9]) as follows: Let \u03c3 be a permutation of V and de\ufb01ne\nSi = {\u03c3(1), \u03c3(2), . . . , \u03c3(i)} as its corresponding chain. We de\ufb01ne \u03a3Y as the set of permutations \u03c3Y\nsuch that their corresponding chains contain Y , meaning S|Y | = Y . Then we can de\ufb01ne a subgradient\nhY,\u03c3Y (which is an extreme point of \u2202f (Y )) where:\n\n(cid:26)f (S1)\n\n\u2200\u03c3Y \u2208 \u03a3Y , hY,\u03c3Y (\u03c3Y (i)) =\n\nf (Si) \u2212 f (Si\u22121)\n\nif i = 1\notherwise .\n\n(7)\n\nIn the above, hY,\u03c3Y (Y ) = f (Y ). Hence de\ufb01ne H\u03a3\nf as a subgradient map which picks a subgradient\nhY,\u03c3Y , for some \u03a3(Y ) = \u03c3Y \u2208 \u03a3Y . Here we treat \u03a3 as a permutation operator which, for a given set\nY , produces a permutation \u03c3Y \u2208 \u03a3Y . Hence we can rewrite Eqn. (6), with the above subgradient as\n\nf (X, Y ) = f (X) \u2212 hY,\u03c3Y (X) = f (X) \u2212 (cid:104)H\u03a3\nd\u03a3\n\nf (Y ), 1X(cid:105).\n\n(8)\n\nHf\nf\n\nf (X, Y ) and d(cid:92)\n\nf are special cases of the d\n\nAs can readily be seen, the d\u03a3\n. Similar to the extreme gen-\neralized Bregman divergence above, we can de\ufb01ne forms of the \u201cextreme\u201d lower bound\nsubmodular Bregman divergences d(cid:93)\nf (X, Y ). Since in the case of a submod-\nular function \u2202f (Y ) is an unbounded polyhedron, we restrict C = \u2202f (Y ) \u2229 Pf , and de\ufb01ne:\nf (X, Y ) = f (X) \u2212 f (Y ) + \u03c3C(1Y \u2212 1X ) The\nf (X, Y ) = f (X) \u2212 f (Y ) \u2212 \u03c3C(1X \u2212 1Y ) and d(cid:92)\nd(cid:93)\nextreme lower bound submodular Bregman have very nice forms as shown in the theorem below:\nTheorem 2.1. For every hY \u2208 \u2202f (Y ) \u2229 Pf , d(cid:93)\nfor every permutation map \u03a3, d(cid:93)\nf (Y )\u2212f (X\u2229Y )\u2212f (X\u222aY ). Similarly d(cid:92)\n\nf (X, Y ). Similarly\nf (X, Y ) = f (X) +\nf (X, Y ) = f (X)\u2212f (Y )\u2212f (Y \\X)\u2212f (V )+f (V \\X\\Y )\nHf\nThe above theorem gives bounds for d\nf is exactly the divergence\nf\nwhich de\ufb01nes the submodularity of f. Also notice that this is unlike the generalized Bregman\ndivergences, where the \u201cextreme\u201d forms may not be easy to obtain in general [13].\n\n(X, Y ) \u2264 d(cid:92)\nf (X, Y ). Further d(cid:93)\n\nf (X, Y ) \u2264 d\u03a3\n\nf (X, Y ) \u2264 d(cid:92)\n\nand d\u03a3\n\nf . Further we see that d(cid:93)\n\nf (X, Y ) \u2264 d\n\nHf\nf\n\n3\n\n\f2.2.2 The upper bound submodular Bregman\n\nFor submodular f, [23] established properties of submodular function using which we can de\ufb01ne the\nfollowing divergences (which we call here the Nemhauser divergences):\n\nf (j|X \u2212 {j}) +\n\nf (j|X \u2229 Y ) \u2212 f (Y )\n\nf (j|X \u222a Y \u2212 {j}) +\n\nf (j|X) \u2212 f (Y ),\n\nf (j|X \u2212 {j}) +\n\nf (j|\u2205) \u2212 f (Y ),\n\nf (j|V \u2212 {j}) +\n\nf (j|X) \u2212 f (Y ).\n\nf (j|V \u2212 {j}) +\n\nf (j|\u2205) \u2212 f (Y ).\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nwhere f (j|X) (cid:44) f (X \u222a j) \u2212 f (X). Similar to the approach in ([15]), we can relax the Nemhauser\ndivergences to obtain three modular upper bound submodular Bregmans as:\n\ndf\n\ndf\n\nj\u2208X\\Y\n\nj\u2208X\\Y\n\n(cid:93) (X, Y ) (cid:44) f (X) \u2212 (cid:88)\n(cid:92) (X, Y ) (cid:44) f (X) \u2212 (cid:88)\n1 (X, Y ) (cid:44) f (X) \u2212 (cid:88)\n2 (X, Y ) (cid:44) f (X) \u2212 (cid:88)\n3 (X, Y ) (cid:44) f (X) \u2212 (cid:88)\n\nj\u2208X\\Y\n\ndf\n\ndf\n\ndf\n\nj\u2208X\\Y\n\nj\u2208X\\Y\n\n(cid:88)\n\nj\u2208Y \\X\n\n(cid:88)\n\nj\u2208Y \\X\n\nj\u2208Y \\X\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nj\u2208Y \\X\n\nj\u2208Y \\X\n\n1, df\n\n2 and df\n\nWe call these the Nemhauser based upper-bound submodular Bregmans of, respectively, type-I, II\nand III. Henceforth, we shall refer to them as df\n3 and when referring to them collectively,\nwe will use df\n1:3. The Nemhauser divergences are analogous to the extreme divergences of the\ngeneralized Bregman divergences since they bound the Nemhauser based submodular Bregmans. Its\nnot hard to observe that for a submodular function f, df\n(cid:93) (X, Y ). Similarly\n3 (X, Y ) \u2265 df\ndf\nHf\nSimilar to the generalized lower bound submodular Bregman d\n, we de\ufb01ne a generalized upper\nf\nbound submodular Bregman divergence dfGf in terms of any supergradient of f. Interestingly for\na submodular function, we can de\ufb01ne a superdifferential \u2202f (X) at X as follows:\n\u2202f (X) = {x \u2208 RV : \u2200Y \u2286 V, f (X) \u2212 x(X) \u2265 f (Y ) \u2212 x(Y )}.\n\n1 (X, Y ) \u2265 df\n\n3 (X, Y ) \u2265 df\n\n2 (X, Y ) \u2265 df\n\n(cid:92) (X, Y )\n\n(14)\n\nGiven a supergradient at X, gX \u2208 \u2202f (X), we can de\ufb01ne a divergence dfGf , as:\n\ndfGf (X, Y ) = f (X) \u2212 f (Y ) \u2212 gX (X) \u2212 gX (Y ) = f (X) \u2212 f (Y ) \u2212 (cid:104)gX , 1X \u2212 1Y (cid:105)\n\n(15)\nSimilar to the subgradient map, we can de\ufb01ne Gf as the supergradient map, which picks a supergra-\ndient from Gf (X) = gX \u2208 \u2202f (X). In fact, it can be shown that all three forms of df\n1:3 are actually\nspecial cases of dfGf , in that they form speci\ufb01c supergradient maps. De\ufb01ne three supergradients\nX (j) = f (j|X \u2212 {j})\ng1\nX , g2\nX (j) = f (j|\u2205)\nand g2\nX \u2208 \u2202f (X), and\nand g2\ncorrespondingly df\n\nX and g3\nX (j) = f (j|V \u2212 {j}) for j \u2208 X. Similarly let g1\nX (j) = g3\nX (j) = f (j|X) for j /\u2208 X. Then it can be shown [12] that g1\n\nX (with the corresponding maps Gf\n\nX (j) = g3\nX , g2\n\n3 ) such that: g1\n\n2 and Gf\n\n3 are special cases of dfGf .\n\n2 and df\n\n1 ,Gf\n\nX , g3\n\n1, df\n\nf (X) =(cid:80)\n\ndfGf also subsumes an interesting class of divergences for any submodular function representable as\nconcave over modular. Consider any decomposable submodular function [24] f, representable as:\ni \u03bbihi(mi(X)), where the his are (not necessarily smooth) concave functions and the mis\nare vectors in Rn. Let h(cid:48)\ni(mi(X))mi.\nFurther we can de\ufb01ne a divergence de\ufb01ned for a concave over modular function as:\n\ni be any supergradient of hi. Then we de\ufb01ne gcm\n\nX =(cid:80)\n\ni \u03bbih(cid:48)\n\ndf\ncm(X, Y ) =\n\n\u03bbi(hi(mi(X)) \u2212 hi(mi(Y )) \u2212 hi(mi(X))(mi(X) \u2212 mi(Y ))\n\n(16)\n\n(cid:88)\n\ni\n\nThen it can be shown [12] that df\ndecomposable submodular function.\n\ncm is also a special case of dfGf with gX = gcm\n\nX when f is a\n\n4\n\n\fHf\nTable 1: Instances of weighted distances measures as special cases of d\nf\n\nName\n\nHamming\nHamming\n\nRecall\nPrecision\n\nAER(Y, X; Y )\n\nCond. MI\n\nItakura-Saito\n\nGen. KL\n\nType\nHf\nd\nf\ndf\nGf\nHf\nd\nf\ndf\nGf\nHf\nd\nf\nd(cid:93)\nf\ndf\nGf\ndf\nGf\n\nd\n\nw(X\\Y ) + w(Y \\X)\nw(X\\Y ) + w(Y \\X)\n\nw(Y )\n\n1 \u2212 w(X\u2229Y )\n1 \u2212 w(X\u2229Y )\nw(X)\n1 \u2212 |Y |+|Y \u2229X|\n2|Y |\nI(XX\\Y ;XY \\X|XX\u2229Y )\nw(X) \u2212 log w(Y )\nw(X) \u2212 1\n\nw(Y )\n\nand dfGf for w \u2208 Rn\nHf (Y )/Gf (X)\n\n+\n\n2 \u00b7 w (cid:12) 1Y\n\u22122 \u00b7 w (cid:12) 1X\n\nf (X)\nw(X)\n\u2212w(X)\n\n1\n-1\n1\n2\n\nH(XX )\nlog w(X)\n\nw(cid:12)1Y\nw(Y )\n\u2212 w(cid:12)1X|X|\n1Y\n2|Y |\n\n-\nw\n\nw(X)\n\nw(Y ) log w(Y )\n\nw(X) \u2212 w(Y ) + w(X) \u2212w(X) log w(X) \u2212w(1 + log w(X))\n\nHf\nf\n\nand dfGf generalize a number of interesting distance measures like Hamming, recall,\nFinally both d\nprecision, conditional mutual information, and weighted hamming. We show this in detail in [12],\nand owing to lack of space brie\ufb02y summarize them in Table 1. The distance measures are shown in\nweighted form, but cardinality based distances are special cases with w =1\n\n2.3 The Lov\u00b4asz Bregman divergence\n\nThe Lov\u00b4asz extension ([20]) offers a natural connection between submodularity and convexity. The\nLov\u00b4asz extension is a non-smooth convex function, and hence we can de\ufb01ne a generalized Bregman\ndivergence ([13, 18]) which has a number of properties and applications analogous to the Bregman\ndivergence. Recall that the generalized Bregman divergence corresponding to a convex function\n\u03c6 is parameterized by the choice of the subgradient map H\u03c6. The Lov\u00b4asz extension of a submodular\nfunction has a very interesting set of subgradients, which have a particularly nice structure in that\nthere is a very simple way of obtaining them [7].\nGiven a vector y, de\ufb01ne a permutation \u03c3y such that y[\u03c3y(1)] \u2265 y[\u03c3y(2)] \u2265 \u00b7\u00b7\u00b7 \u2265 y[\u03c3y(n)]\n(cid:80)n\nand de\ufb01ne Yk = {\u03c3y(1),\u00b7\u00b7\u00b7 , \u03c3y(k)}. The Lov\u00b4asz extension ([7, 20]) is de\ufb01ned as: \u02c6f (y) =\nk=1 y[\u03c3y(k)]f (\u03c3y(k)|Yk\u22121). For each point y, we can de\ufb01ne a subdifferential \u2202 \u02c6f (y), which has\na particularly nice form [9]: for any point y \u2208 [0, 1]n, \u2202 \u02c6f (y) = \u2229{\u2202f (Yi)|i = 1, 2\u00b7\u00b7\u00b7 , n}. This\nH \u02c6f\nof the Lov\u00b4asz extension, parameterized by a\nnaturally de\ufb01nes a generalized Bregman divergence d\n\u02c6f\nsubgradient map H \u02c6f , which we can de\ufb01ne as:\n\nH \u02c6f\n\u02c6f\n\nd\n\n(x, y) = \u02c6f (x) \u2212 \u02c6f (y) \u2212 (cid:104)hy, x \u2212 y(cid:105), for some hy = H \u02c6f (y) \u2208 \u2202 \u02c6f (y).\n\n(17)\n\nWe can also de\ufb01ne speci\ufb01c subgradients of \u02c6f at y as hy,\u03c3y, with hy,\u03c3y (\u03c3y(k)) = f (Yk)\u2212f (Yk\u22121),\u2200k.\nThese subgradients are really the extreme points of the submodular polyhedron. Then de\ufb01ne the\nLov\u00b4asz Bregman divergence d \u02c6f as the Bregman divergence of \u02c6f and the subgradient hy,\u03c3y. Similar\nf , it can be shown [12], that d \u02c6f (x, y) = \u02c6f (x) \u2212 (cid:104)hy,\u03c3y , x(cid:105). Note that if the vector y is totally\nto d\u03a3\nordered (no two elements are equal to each other), the subgradient of \u02c6f and the corresponding\npermutation \u03c3y at y will actually be unique. When the vector is not totally ordered, we can consider\n\u03c3y as a permutation operator which de\ufb01nes a valid and consistent total ordering for every vector y,\nand we can then de\ufb01ne the Bregman divergence in terms of it. Note also that the points with no total\nordering in the interior of the hypercube is of measure zero. Hence for simplicity we just refer to the\nLov\u00b4asz Bregman divergence as d \u02c6f . The Lov\u00b4asz Bregman divergence is closely related to the lower\nbound submodular Bregman, as we show below.\nTheorem 2.2. The Lov\u00b4asz Bregman divergences are an extension of the lower bound submodular\nBregman, over the interior of the hypercube. Further the Lov\u00b4asz Bregman divergence can be expressed\nas d \u02c6f (x, y) = (cid:104)x, hx,\u03c3x \u2212 hy,\u03c3y(cid:105), and hence depends only x, the permutation \u03c3x and the permutation\nof y(\u03c3y), but is independent of the values of y.\n\n5\n\n\f3 Properties of the submodular Bregman and Lov\u00b4asz Bregman divergences\n\nIn this section, we investigate some of the properties of the submodular Bregman and Lov\u00b4asz\nBregman divergences which make these divergences interesting for Machine Learning applications.\nWe only state them here \u2014 for an elaborate discussion refer to [12]. All forms of the submodular\nBregman divergences are non-negative, and hence they are valid divergences. The lower bound\nsubmodular Bregman is submodular in X for a given Y , while the upper bound submodular Bregman\nis supermodular in Y for a given X. A direct consequence of this is that problems involving\noptimization in X or Y (for example in \ufb01nding the discrete representatives in a discrete k-means\nlike application which we consider in [12]), can be performed either exactly or approximately in\npolynomial time. In addition to these the forms of the submodular Bregman divergence also satisfy\ninteresting properties like a characterization of equivalence classes, a form of set separation, a\ngeneralized triangle inequality over sets and a form of both Fenchel and submodular duality. Finally\nthe generalized submodular Bregman divergence has an interesting alternate characterization, which\nshows that they can potentially subsume a large number of discrete divergences. In particular, a\niff for any sets A, B \u2286 V , the set function fA(X) = d(X, A) is\ndivergence d is of the form d\nsubmodular in X and the set function d(X, A) \u2212 d(X, B) is modular in X. Similarly a divergence d\nis of the form dfGf iff, for any set A, B \u2286 V , the set function fA(Y ) = d(A, Y ) is supermodular in\nY and the set function d(A, Y ) \u2212 d(B, Y ) is modular in Y . These facts show that the generalized\nBregman divergences are potentially a very large class of divergences while Table 1 provides just a\nfew of them.\nAdditionally, the Lov\u00b4asz Bregman divergence also has a number of very interesting properties.\nNotable amongst these is the fact that it has an interesting property related to permutations.\nTheorem 3.1. [12] Given a submodular function whose polyhedron contains all possible extreme\n\npoints (e.g., f (X) =(cid:112)|X|), d \u02c6f (x, y) = 0 if and only if \u03c3x = \u03c3y.\n\nHf\nf\n\nHence the Lov\u00b4asz Bregman divergence can be seen as a divergence between the permutations. While\na number of distance measures capture the notion of a distance amongst orderings [17], the Lov\u00b4asz\nBregman divergences has a unique feature not present in these distance measures. The Lov\u00b4asz\nBregman divergences not only capture the distance between \u03c3x and \u03c3y, but also weighs it with the\nvalue of x, thus giving preference to the values and not just the orderings. Hence it can be seen\nas a divergence between a score x and a permutation \u03c3y, and hence we shall also represent it as\nd \u02c6f (x, y) = d \u02c6f (x||\u03c3y) = (cid:104)x, hx,\u03c3x \u2212 hx,\u03c3y(cid:105). Correspondingly, given a collection of scores, it also\nmeasures how con\ufb01dent the scores are about the ordering. For example given two scores x and y\nwith the same orderings such that the values of x are nearly equal (low con\ufb01dence), while the values\nof y have large differences, the distance to any other permutation will be more for y than x. This\nproperty intuitively desirable in a permutation based divergence. Finally, as we shall see the Lov\u00b4asz\nBregman divergences are easily amenable to k-means style alternating minimization algorithms for\nclustering ranked data, a process that is typically dif\ufb01cult using other permutation-based distances.\n\n4 Applications\n\nIn this section, we show the utility of the submodular Bregman and Lov\u00b4asz Bregman divergences by\nconsidering some practical applications in machine learning and optimization. The \ufb01rst application\nis that of proximal algorithms which generalize several mirror descent algorithms. As a second\napplication, we motivate the use of the Lov\u00b4asz Bregman divergence as a natural choice in clustering\nwhere the order is important. Due to lack of space, we only concisely describe these applications,\nand for a more elaborate discussion please see [12] where we also consider a third discrete clustering\napplication, and provide a clustering framework for the submodular Bregman with fast algorithms\nfor clustering sets of binary vectors\n\n4.1 A proximal framework for the submodular Bregman divergence\n\nThe Bregman divergence has some nice properties related to a proximal method. In particular ([5]), let\n\u03c8 be a convex function that is hard to optimize, but suppose the function \u03c8(x) + \u03bbd\u03c6(x, y) is easy to\n\n6\n\n\fX t+1 := argminX\u2208S F (X) + \u03bbd(X, X t)\nt \u2190 t + 1\n\nAlgorithm 1: Proximal Minimization Algorithm\nX 0 = \u2205\nwhile until convergence do\n\noptimize for a given \ufb01xed y. Then a proximal algorithm, which starts with a particular x0 and updates\nat every iteration xt+1 = argmaxx\u03c8(x) + \u03bbd\u03c6(x, xt), is bound to converge to the global minima.\nWe de\ufb01ne a similar framework for the sub-\nmodular Bregmans. Consider a set function\nF , and an underlying combinatorial constraint\nS. Optimizing this set function may not be\neasy \u2014 e.g., if S is the constraint that X be a\ngraph-cut, then this optimization problem is NP\nhard even if F is submodular ([15]). Consider\nnow a divergence d(X, Y ) that can be either an upper or lower bound submodular Bregman. Note,\nthe combinatorial constraints S are the discrete analogs of the convex set projection in the proximal\nmethod. We offer a proximal minimization algorithm (Algorithm 1) in a spirit similar to [5].\nFurthermore, Algorithm 1 is guaranteed to monotonically decrease the function value over the\niterations [12]. Interestingly, a number of approximate optimization problems considered in the past\nturn out to be special cases of the proximal framework. We show this below:\nMinimizing the difference between submodular (DS) functions: Consider the case where\nF (X) = f (X) \u2212 g(X) is a difference between two submodular functions f and g. This problem\nis known to be NP hard and even NP hard to approximate [22, 11]. However there are a number\nof heuristic algorithms which have been shown to perform well in practice [22, 11]. Consider \ufb01rst:\ng (X, X t) (for some appropriate schedule \u03a3t of permutations), with \u03bb = 1 and\nd(X, X t) = d\u03a3t\nS = 2V . Then it can be shown trivially [12] that we obtain the submodular-supermodular (sub-sup)\n1:3(X t, X), again with \u03bb = 1 and S = 2V .\nprocedure ([22]). Moreover, we can de\ufb01ne d(X, X t) = df\nThen again we can show [12] that we obtain the supermodular-submodular (sup-sub) procedure [11].\nFinally de\ufb01ning d(X, X t) = df\ng (X, X t), we get the modular-modular (mod-mod)\nprocedure [11]. Further, the sup-sub and mod-mod procedures can be used with more complicated\nconstraints like cardinality, matroid and knapsack constraints while the mod-mod algorithm can be\nextended with even combinatorial constraints like the family of cuts, spanning trees, shortest paths,\ncovers, matchings, etc. [11]\nSubmodular function minimization: Algorithm 1 also generalizes a number of approximate\nsubmodular minimization algorithms. If F is a submodular function and the underlying constraints\nS represent the family of cuts, then we obtain the cooperative cut problem ([15], [14]) and one of\nthe algorithms developed in ([15]) is a special case of Algorithm 1. If S = 2V above, we get a\nform of the approximate submodular minimization algorithm suggested for arbitrary (non-graph\nrepresentable) submodular functions ([16]). The proximal minimization algorithm also generalizes\nthree submodular function minimization algorithms IMA-I, II and III, described in detail in [12]\nagain with \u03bb = 1,S = 2V and d = df\n3 respectively. These algorithms are similar to\nthe greedy algorithm for submodular maximization [23]. Interestingly these algorithms provide\nbounds to the lattice of minimizers of the submodular functions.\nIt is known [1] that the sets\nA = {j : f (j|\u2205) < 0}, B = {j : f (j|V \u2212 {j}) > 0 are such that, for every minimizer X\u2217,\nA \u2286 X\u2217 \u2286 B. Thus the lattice formed with A and B de\ufb01ned as the join and meet respectively, gives\na bound on the minimizers, and we can restrict the submodular minimization algorithms to this lattice.\n3 as a regularizer (which is IMA-III) and starting with X 0 = \u2205 and X 0 = V , we\nHowever using d = df\nget the sets A and B [10, 12] respectively from Algorithm 1. With versions of algorithm 1 with d = df\n1\n2, and starting respectively from X 0 = \u2205 and X 0 = V , we get sets that provide a tighter\nand d = df\nbound on the lattice of minimizers than the one obtained with A and B. Further these algorithms\nalso provide improved bounds in the context of monotone submodular minimization subject to\ncombinatorial constraints. In particular, these algorithms provide bounds which are better than 1\n\u03bd ,\nwhere \u03bd is a parameter related to the curvature of the submodular function. Hence when the parameter\n\u03bd is a constant, these bounds are constant factor guarantees, which contrasts the O(n) bounds for\nmost of these problems. For a more elaborate and detailed discussion related to this, refer to [10]\nSubmodular function maximization:\nIf f is a submodular function, then using d(X, X v) =\nd\u03a3v\nf (X, X v) forms an iterative algorithm for maximizing the modular lower bound of a submodular\nThis algorithm then generalizes a algorithms number of unconstrained submodular\nfunction.\nmaximization and constrained submodular maximization, in that by an appropriate schedule of \u03a3v\n2 approximate algorithm and a 1 \u2212 1\nwe can obtain these algorithms. Notable amongst them is a 1\ne\n\n1:3(X t, X) + d\u03a3t\n\n1 , df\n\n2 and df\n\n7\n\n\fFigure 1: Results of k-means clustering using the Lov\u00b4asz Bregman divergence (two plots on the left)\nand the Euclidean distance (two plots on the right). URLs above link to videos.\n\n//youtu.be/kfEnLOmvEVc\n\n//youtu.be/IqRhemUg14I\n\nhttp:\n\nhttp:\n\napproximation algorithm for unconstrained and cardinality constrained submodular maximization\nrespectively. For a complete list of algorithms generalized by this, refer to [10].\n\n4.2 Clustering framework with the Lov\u00b4asz Bregman divergence\n\nn\n\nIn this section we investigate a clustering framework similar to [2], using the Lov\u00b4asz Bregman\nRecall that the Lov\u00b4asz\ndivergence and show how this is natural for a number of applications.\nBregman divergence in some sense measures the distance between the ordering of the vectors and can\nbe seen as a form of the \u201csorting\u201d distance. We de\ufb01ne the clustering problem as given a set of vectors,\n\ufb01nd a clustering into subsets of vectors with similar orderings. For example, given a set of voters and\ntheir corresponding ranked preferences, we might want to \ufb01nd subsets of voters who mostly agree.\nLet X = {x1, x2,\u00b7\u00b7\u00b7 , xm} represent a set of m vectors, such that \u2200i, xi \u2208 [0, 1]n. We \ufb01rst consider\nthe problem of \ufb01nding the representative of these vectors. Given a set of vectors X and a Lov\u00b4asz\nBregman divergence d \u02c6f , a natural choice of a representative (in this case a permutation) is the point\ni=1 d \u02c6f (xi||\u03c3(cid:48)). Interestingly\nfor the Lov\u00b4asz Bregman divergence this problem is easy and the representative permutation is exactly\nthe permutation of the arithmetic mean of X\nTheorem 4.1.\n\nwith minimum average distance, or in other words: \u03c3 = argmin\u03c3(cid:48)(cid:80)n\nargmin\u03c3(cid:48)(cid:80)n\nthis idea of clustering vectors into subsets of similar orderings: minM,C(cid:80)k\n\ni=1 d \u02c6f (xi||\u03c3(cid:48)) is exactly \u03c3 = \u03c3\u00b5, \u00b5 = 1\n\n[12] Given a submodular function f,\n\nthe Lov\u00b4asz Bregman representative\n\n(cid:80)n\n\ni=1 xi\n\nj=1\n\nxi\u2208Cj\n\n(cid:80)\n\nIt may not suf\ufb01ce to encode X using a single representative, and hence we partition X into disjoint\nblocks C = {C1,\u00b7\u00b7\u00b7 , Ck} with each block having its own Lov\u00b4asz Bregman representative, with the\nset of representatives given by M = {\u03c31, \u03c32,\u00b7\u00b7\u00b7 , \u03c3k}. Then we de\ufb01ne an objective, which captures\nd \u02c6f (xi, \u00b5j).\nConsider then a k-means like alternating algorithm [19, 21]. It has two stages, often called the as-\nsignment and the re-estimation step. In the assignment stage, for every point xi we choose its cluster\nmembership Cj such that j = argminl d \u02c6f (xi||\u03c3l). The re-estimation step involves \ufb01nding the repre-\nsentatives for every cluster Cj, which is exactly the permutation of the mean of the vectors in Cj. We\nskip the algorithm here due to space constraints, and refer the reader to [12] for a complete discussion.\nWe remark here that a number of distance measures capture the notion of orderings, like the\nbubble-sort distance [17], etc. However for these distance measures, \ufb01nding the representative may\nnot be easy. The Lov\u00b4asz Bregman divergence naturally captures the notion of distance between\norderings of vectors and yet, the problem of \ufb01nding the representative in this case is very easy.\nFinally similar to the analysis in [2, 12], we can show that the k-means algorithm using the Lov\u00b4asz\nBregman divergence will monotonically decrease the objective at every iteration, and the algorithm\nconverges to a local minima. [12] To demonstrate the utility of our clustering framework, we\nshow some results in 2 and 3 dimensions (Fig. 1), where we compare our framework to a k-means\n\nalgorithm using the euclidean distance. We use the submodular function f (X) =(cid:112)w(X), for an\n\narbitrary vector w ensuring unique base extreme points. The results clearly show that the Lov\u00b4asz\nBregman divergence clusters the data based on the orderings of the vectors.\nAcknowledgments: We thank Stefanie Jegelka, Karthik Narayanan, Andrew Guillory, Hui Lin, John\nHalloran and the rest of the submodular group at UW-EE for discussions. This material is based\nupon work supported by the National Science Foundation under Grant No. (IIS-1162606), and is also\nsupported by a Google, a Microsoft, and an Intel research award.\n\n8\n\n\fReferences\n[1] F. Bach. Learning with Submodular functions: A convex Optimization Perspective. Arxiv preprint, 2011.\n[2] A. Banerjee, S. Meregu, I. S. Dhilon, and J. Ghosh. Clustering with Bregman divergences. JMLR,\n\n6:1705\u20131749, Oct. 2005.\n\n[3] E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Math., 123(1\u20133):155 \u2013 225,\n\n2002.\n\n[4] L. Bregman. The relaxation method of \ufb01nding the common point of convex sets and its application to the\n\nsolution of problems in convex programming. USSR Comput. Math and Math Physics, 7, 1967.\n\n[5] Y. Censor and S. Zenios. Parallel optimization: Theory, algorithms, and applications. Oxford University\n\nPress, USA, 1997.\n\n[6] I. Dhillon and J. Tropp. Matrix nearness problems with Bregman divergences. SIAM Journal on Matrix\n\nAnalysis and Applications, 29(4):1120\u20131146, 2007.\n\n[7] J. Edmonds. Submodular functions, matroids and certain polyhedra. Combinatorial structures and their\n\nApplications, 1970.\n\n[8] B. Frigyik, S. Srivastava, and M. Gupta. Functional Bregman divergence.\n\n1681\u20131685. IEEE, 2008.\n\nIn In Proc. ISIT, pages\n\n[9] S. Fujishige. Submodular functions and optimization, volume 58. Elsevier Science, 2005.\n[10] R. Iyer and J. Bilmes. A framework of mirror descent algorithms for submodular optimization. To\nAppear in NIPS Workshop on Discrete Optimization in Machine Learning (DISCML) 2012- Structure and\nScalability, 2012.\n\n[11] R. Iyer and J. Bilmes. Algorithms for approximate minimization of the difference between submodular\n\nfunctions, with applications. In Proc. UAI, 2012.\n\n[12] R. Iyer and J. Bilmes. Submodular-Bregman and the Lov\u00b4asz-Bregman Divergences with Applications:\n\nExtended Version. 2012.\n\n[13] R. Iyer and J. Bilmes. A uni\ufb01ed theory on the generalized bregman divergences. Manuscript, 2012.\n[14] S. Jegelka and J. Bilmes. Cooperative cuts: Graph cuts with submodular edge weights. Technical report,\n\nTechnical Report TR-189, Max Planck Institute for Biological Cybernetics, 2010.\n\n[15] S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. In\n\nComputer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June 2011.\n\n[16] S. Jegelka, H. Lin, and J. Bilmes. Fast approximate submodular minimization. In Proc. NIPS, 2011.\n[17] M. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81\u201393, 1938.\n[18] K. C. Kiwiel. Free-steering relaxation methods for problems with strictly convex costs and linear constraints.\n\nMathematics of Operations Research, 22(2):326\u2013349, 1997.\n\n[19] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on IT, 28(2):129\u2013137, 1982.\n[20] L. Lov\u00b4asz. Submodular functions and convexity. Mathematical Programming, 1983.\n[21] J. MacQueen et al. Some methods for classi\ufb01cation and analysis of multivariate observations.\n\nIn\nProceedings of the \ufb01fth Berkeley symposium on math. stats and probability, volume 1, pages 281\u2013297.\nCalifornia, USA, 1967.\n\n[22] M. Narasimhan and J. Bilmes. A submodular-supermodular procedure with applications to discriminative\n\nstructure learning. In Uncertainty in Arti\ufb01cial Intelligence (UAI), Edinburgh, Scotland, July 2005.\n\n[23] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular set\n\nfunctions\u2014i. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[24] P. Stobbe and A. Krause. Ef\ufb01cient minimization of decomposable submodular functions. In Proc. Neural\n\nInformation Processing Systems (NIPS), 2010.\n\n[25] M. Telgarsky and S. Dasgupta. Agglomerative bregman clustering. In Proc. ICML, 2012.\n[26] K. Tsuda, G. Ratsch, and M. Warmuth. Matrix exponentiated gradient updates for on-line learning and\n\nBregman projection. JMLR, 6(1):995, 2006.\n\n[27] M. K. Warmuth. Online learning and Bregman divergences. Tutorial at the Machine Learning Summer\n\nSchool, 2006.\n\n9\n\n\f", "award": [], "sourceid": 4692, "authors": [{"given_name": "Rishabh", "family_name": "Iyer", "institution": null}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": null}]}