{"title": "From MAP to Marginals: Variational Inference in Bayesian Submodular Models", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 252, "abstract": "Submodular optimization has found many applications in machine learning and beyond. We carry out the first systematic investigation of inference in probabilistic models defined through submodular functions, generalizing regular pairwise MRFs and Determinantal Point Processes. In particular, we present L-Field, a variational approach to general log-submodular and log-supermodular distributions based on sub- and supergradients. We obtain both lower and upper bounds on the log-partition function, which enables us to compute probability intervals for marginals, conditionals and marginal likelihoods. We also obtain fully factorized approximate posteriors, at the same computational cost as ordinary submodular optimization. Our framework results in convex problems for optimizing over differentials of submodular functions, which we show how to optimally solve. We provide theoretical guarantees of the approximation quality with respect to the curvature of the function. We further establish natural relations between our variational approach and the classical mean-field method. Lastly, we empirically demonstrate the accuracy of our inference scheme on several submodular models.", "full_text": "From MAP to Marginals: Variational Inference in\n\nBayesian Submodular Models\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nAndreas Krause\n\nETH Z\u00a8urich\n\nkrausea@ethz.ch\n\nJosip Djolonga\n\nETH Z\u00a8urich\n\njosipd@inf.ethz.ch\n\nAbstract\n\nSubmodular optimization has found many applications in machine learning and\nbeyond. We carry out the \ufb01rst systematic investigation of inference in probabilis-\ntic models de\ufb01ned through submodular functions, generalizing regular pairwise\nMRFs and Determinantal Point Processes. In particular, we present L-FIELD, a\nvariational approach to general log-submodular and log-supermodular distribu-\ntions based on sub- and supergradients. We obtain both lower and upper bounds\non the log-partition function, which enables us to compute probability intervals\nfor marginals, conditionals and marginal likelihoods. We also obtain fully factor-\nized approximate posteriors, at the same computational cost as ordinary submod-\nular optimization. Our framework results in convex problems for optimizing over\ndifferentials of submodular functions, which we show how to optimally solve.\nWe provide theoretical guarantees of the approximation quality with respect to\nthe curvature of the function. We further establish natural relations between our\nvariational approach and the classical mean-\ufb01eld method. Lastly, we empirically\ndemonstrate the accuracy of our inference scheme on several submodular models.\n\nIntroduction\n\n1\nSubmodular functions [1] are a rich class of set functions F : 2V \u2192 R, investigated originally\nin game theory and combinatorial optimization. They capture natural notions such as diminishing\nreturns and economies of scale. In recent years, submodular optimization has seen many important\napplications in machine learning, including active learning [2], recommender systems [3], document\nsummarization [4], representation learning [5], clustering [6], the design of structured norms [7] etc.\nIn this work, instead of using submodular functions to obtain point estimates through optimiza-\ntion, we take a Bayesian approach and de\ufb01ne probabilistic models over sets (so called point pro-\ncesses) using submodular functions. Many of the aforementioned applications can be understood\nas performing MAP inference in such models. We develop L-FIELD, a general variational infer-\nence scheme for reasoning about log-supermodular (P (A) \u221d exp(\u2212F (A))) and log-submodular\n(P (A) \u221d exp(F (A))) distributions, where F is a submodular set function.\nPrevious work. There has been extensive work on submodular optimization (both approximate and\nexact minimization and maximization, see, e.g., [8, 9, 10, 11]). In contrast, we are unaware of pre-\nvious work that addresses the general problem of probabilistic inference in Bayesian submodular\nmodels. There are two important special cases that have received signi\ufb01cant interest. The most\nprominent examples are undirected pairwise Markov Random Fields (MRFs) with binary variables,\nalso called the Ising model [12], due to their importance in statistical physics, and applications, e.g.,\nin computer vision. While MAP inference is ef\ufb01cient for regular (log-supermodular) MRFs, com-\nputing the partition function is known to be #P-hard [13], and the approximation problem has been\nalso shown to be hard [14]. Also, there is no FPRAS in the log-submodular case unless RP=NP [13].\nAn important case of log-submodular distributions is the Determinantal Point Process (DPP), used\n\n1\n\n\fin machine learning as a principled way of modeling diversity. Its partition function can be com-\nputed ef\ufb01ciently, and a 1\n4-approximation scheme for \ufb01nding the (NP-hard) MAP [15] is known. In\nthis paper, we propose a variational inference scheme for general Bayesian submodular models, that\nencompasses these two and many other distributions, and has instance-dependent quality guaran-\ntees. A hallmark of the models is that they capture high-order interactions between many random\nvariables. Existing variational approaches [16] cannot ef\ufb01ciently cope with such high-order interac-\ntions \u2014 they generally have to sum over all variables in a factor, scaling exponentially in the size of\nthe factor. We discuss this prototypically for mean-\ufb01eld in Sec. 5.\nOur contributions. In summary, our main contributions are\n\nlog-supermodular distributions, that can capture high-order variable interactions.\n\n\u2022 We provide the \ufb01rst general treatment of probabilistic inference with log-submodular and\n\u2022 We develop L-FIELD, a novel variational inference scheme that optimizes over sub- and\nsupergradients of submodular functions. Our scheme yields both upper and lower bounds\non the partition function, which imply rigorous probability intervals for marginals. We can\nalso obtain factorial approximations of the distribution at no larger computational cost than\nperforming MAP inference in the model (for which a plethora of algorithms are available).\n\u2022 We identify a natural link between our scheme and the well-known mean-\ufb01eld method.\n\u2022 We establish theoretical guarantees about the accuracy of our bounds, dependent on the\n\u2022 We demonstrate the accuracy of L-FIELD on several Bayesian submodular models.\n\ncurvature of the underlying submodular function.\n\n2 Submodular functions and optimization\n\nSubmodular functions are set functions satisfying a diminishing returns condition. Formally, let V\nbe some \ufb01nite ground set, w.l.o.g. V = {1, . . . , n}, and consider a set function F : 2V \u2192 R. The\nmarginal gain of adding item i \u2208 V to the set A \u2286 V w.r.t. F is de\ufb01ned as F (i|A) = F (A \u222a {i}) \u2212\nF (A). Then, a function F : 2V \u2192 R is said to be submodular if for all A \u2286 B \u2286 V and i \u2208 V \u2212 B\nit holds that F (i|A) \u2265 F (i|B). A function F is called supermodular if \u2212F is submodular. Without\nloss of generality1, we will also make the assumption that F is normalized so that F (\u2205) = 0.\nThe problem of submodular function optimization has received signi\ufb01cant attention. The (uncon-\nstrained) minimization of submodular functions, minA F (A), can be done in polynomial time.\nWhile general purpose algorithms [8] can be impractical due to their high order, several classes\nof functions admit faster, specialized algorithms, e.g. [17, 18, 19]. Many important problems can\nbe cast as the minimization of a submodular objective, ranging from image segmentation [20, 12] to\nclustering [6]. Submodular maximization has also found numerous applications, e.g. experimental\ndesign [21], document summarization [4] or representation learning [5]. While this problem is in\ngeneral NP-hard, effective constant-factor approximation algorithms exist (e.g. [22, 11]).\nIn this paper we lift results from submodular optimization to probabilistic inference, which lets us\nquantify uncertainty about the solutions of the problem, instead of binding us to a single one. Our\napproach allows us to obtain (approximate) marginals at the same cost as traditional MAP inference.\n\nnormalizing quantity Z = (cid:80)\n\n3 Probabilistic inference in Bayesian submodular models\nWhich Bayesian models are associated with submodular functions? Suppose F : 2V \u2192 R is a sub-\nmodular set function. We consider distributions over subsets2 A \u2286 V of the form P (A) = 1Z e+F (A)\nand P (A) = 1Z e\u2212F (A), which we call log-submodular and log-supermodular, respectively. The\nS\u2286V e\u00b1F (S) is called the partition function, and \u2212 log Z is also\nknown as free energy in the statistical physics literature. Note that distributions over subsets of V\nare isomorphic to distributions of |V | = n binary random variables X1, . . . , Xn \u2208 {0, 1} \u2014 we\nsimply identify Xi as the indicator function of the event i \u2208 A, or formally Xi = [i \u2208 A].\nExamples of log-supermodular distributions. There are many distributions that \ufb01t this frame-\nwork. As a prominent example, consider binary pairwise Markov random \ufb01elds (MRFs),\n\n1The functions F (A) and F (A) + c encode the same distributions by virtue of normalization.\n2In the appendix we also consider cardinality constraints, i.e., distributions over sets A that satisfy |A| \u2264 k.\n\n2\n\n\f(cid:81)\n\nare equivalent to distributions P (A) \u221d exp(\u2212F (A)), where F (A) =(cid:80)\n\ntion j. Then, we consider F (A) =(cid:80)\n\nconcave. Then, the function F (A) =(cid:80)k\n\ni,j \u03c6i,j(Xi, Xj). Assuming the potentials \u03c6i,j are positive, such MRFs\nP (X1, . . . , Xn) = 1Z\ni,j Fi,j(A), and Fi,j(A) =\n\u2212 log \u03c6i,j([i \u2208 A], [j \u2208 A]). An MRF is called regular iff each Fi,j is submodular (and con-\nsequently P (A) is log-supermodular). Such models are extensively used in applications, e.g. in\ncomputer vision [12]. More generally, a rich class of distributions can be de\ufb01ned using decompos-\nable submodular functions, which can be written as sums of (usually simpler) submodular functions.\nAs an example, let G1, . . . , Gk \u2286 V be groups of elements and let \u03c61, . . . , \u03c6k : [0,\u221e) \u2192 R be\ni=1 \u03c6i(|Gi \u2229 A|) is submodular. Models using these types\nof functions strictly generalize pairwise MRFs, and can capture higher-order variable interactions,\nwhich can be crucial in computer vision applications such as semantic segmentation (e.g. [23]).\nExamples of log-submodular distributions. A prominent example of log-submodular distributions\nare Determinantal Point Processes (DPPs) [24]. A DPP is a distribution over sets A of the form\nP (A) = 1Z exp(F (A)), where F (A) = log |KA|. Here, K \u2208 RV \u00d7V is a positive semi-de\ufb01nite\nmatrix, KA is the square submatrix indexed by A, and | \u00b7 | denotes the determinant. Because K is\npositive semi-de\ufb01nite, F (A) is known to be submodular, and hence DPPs are log-submodular. An-\nother natural model is that of facility location. Assume that we have a set of locations V where we\ncan open shops, and a set N of customers that we would like to serve. For each customer i \u2208 N and\nlocation j \u2208 V we have a non-negative number Ci,j quantifying how much service i gets from loca-\ni\u2208N maxj\u2208A Ci,j. We can also penalize the number of open\nshops and use a distribution P (A) \u221d exp(F (A) \u2212 \u03bb|A|) for \u03bb > 0. Such objectives have been used\nfor optimization in many applications, ranging from clustering [25] to recommender systems [26].\nThe Inference Challenge. Having introduced the models that we consider, we now show how to do\ninference in them3. Let us introduce the following operations that preserve submodularity.\nDe\ufb01nition 1. Let F : 2V \u2192 R be submodular and let X, Y \u2286 V . De\ufb01ne the submodular functions\nF X as the restriction of F to 2X, and FX : 2V \u2212X \u2192 R as FX (A) = F (A \u222a X) \u2212 F (X).\nFirst, let us see how to compute marginals. The probability that the random subset S distributed as\nP (S = A) \u221d exp(\u2212F (A)) is in some non-empty lattice [X, Y ] = {A | X \u2286 A \u2286 Y } is equal to\nP (S \u2208 [X, Y ]) =\n(1)\nwhere Z Y\nA\u2286Y \u2212X e\u2212(F (X\u222aA)\u2212F (X)) is the partition function of (FX )Y . Marginals P (i \u2208 S)\nof any i \u2208 V can be obtained using [{i}, V ]. We also obtain conditionals \u2014 if, for example, we\nX if A \u2208 [X, Y ],\ncondition on the event on (1), we have P (S = A|S \u2208 [X, Y ]) = exp(\u2212F (A))/Z Y\n0 otherwise. Note that log-supermodular distributions are conjugate with each other: for a log-\nsupermodular prior P (A) \u221d exp(\u2212F (A)) and a likelihood function4 P (E | A) \u221d exp(\u2212L(E; A)),\nfor which L is submodular w.r.t. A for each evidence E, the posterior P (A | E) \u221d exp(\u2212(F (A) +\nL(E; A))) is log-supermodular as well. The same holds for log-submodular distributions.\n4 The variational approach\nIn Section 3 we have seen that due to the closure properties of submodular functions, important in-\nference tasks (e.g., marginals, conditioning) in Bayesian submodular models require computing par-\ntition functions of suitably de\ufb01ned/restricted submodular functions. Given that the general problem\nis #P hard, we seek approximate methods. The main idea is to exploit the peculiar property of sub-\nmodular functions that they can be both lower- and upper-bounded using simple additive functions\ni\u2208A s({i}).\nWe will also treat modular functions s(\u00b7) as vectors s \u2208 RV with coordinates si = s({i}). Because\nmodular functions have tractable log-partition functions, we obtain the following bounds.\nLemma 1. If \u2200A \u2286 V : sl(A) + cl \u2264 F (A) \u2264 su(A) + cu for modular su, sl, and cl, cu \u2208 R, then\n\nof the form s(A) + c, where c \u2208 R and s : 2V \u2192 R is modular, i.e. it satis\ufb01es s(A) =(cid:80)\n\nexp(\u2212F (X \u222a A)) = e\u2212F (X)Z Y\nXZ ,\n\nexp(\u2212F (A)) =\n\n1\nZ\n\nX =(cid:80)\n\n(cid:88)\n\n1\nZ\n\nX\u2286A\u2286Y\n\n(cid:88)\n\nA\u2286Y \u2212X\n\nlog Z +(sl, cl) \u2264 log(cid:80)\nlog Z\u2212(su, cu) \u2264 log(cid:80)\n\nwhere log Z +(s, c) = c +(cid:80)\n\nA\u2286V exp(+F (A)) \u2264 log Z +(su, cu) and\nA\u2286V exp(\u2212F (A)) \u2264 log Z\u2212(sl, cl),\n\ni\u2208V log(1 + esi ) and log Z\u2212(s, c) = \u2212c +(cid:80)\n\ni\u2208V log(1 + e\u2212si).\n\n3We consider log-supermodular distributions, as the log-submodular case is analogous.\n4Such submodular loss functions L have been considered, e.g., in document summarization [4].\n\n3\n\n\fWe can use any modular (upper or lower) bound s(A) + c to de\ufb01ne a completely factorized\ndistribution that can be used as a proxy to approximate values of interest of the original distribution.\nFor example, the marginal of i \u2208 A under Q(A) \u221d exp(\u2212s(A) + c) is easily seen to be 1/(1 + esi).\nInstead of optimizing over all possible bounds of the above form, we consider for each X \u2286 V two\nsets of modular functions, which are exact at X and lower- or upper-bound F respectively. Similarly\nas for convex functions, we de\ufb01ne [8][\u00a76.2] the subdifferential of F at X as\n\n\u2202F (X) = {s \u2208 Rn | \u2200Y \u2286 V : F (Y ) \u2265 F (X) + s(Y ) \u2212 s(X)}.\n\n(2)\n\nThe superdifferential \u2202F (X) is de\ufb01ned analogously by inverting the inequality sign [27]. For each\nsubgradient s \u2208 \u2202F (X), the function gX (Y ) = s(Y ) + F (X) \u2212 s(X) is lower bounding F .\nSimilarly, for a supergradient s \u2208 \u2202F (X), hX (Y ) = s(Y ) + F (X)\u2212 s(X) is an upper bound of F .\nNote that both hX and gX are of the form that we considered (modular plus constant) and are tight at\nX, i.e. hX (X) = gX (X) = F (X). Because we will be optimizing over differentials, we de\ufb01ne for\nX (s) = Z\u2212(s, F (X) \u2212 s(X)).\nany X \u2286 V the shorthands Z +\n\nX (s) = Z +(s, F (X) \u2212 s(X)) and Z\u2212\n\n4.1 Optimizing over subgradients\nTo analyze the problem of minimizing log Z\u2212\nX (s) subject to s \u2208 \u2202F (X), we introduce the base\npolyhedron of F , de\ufb01ned as B(F ) = {s \u2208 RV | s(V ) = F (V ) and \u2200A \u2286 V : s(A) \u2264 F (A)}, i.e.\nthe set of modular lower bounds that are exact at V . As the following lemma shows, we do not have\nto consider log Z\u2212\nLemma 2. For all X \u2286 V we have mins\u2208\u2202F (\u2205) Z\u2212\nproblem is equivalent to\n\nX for all X and we can restrict our attention to the case X = \u2205.\n\n\u2205 (s) \u2264 mins\u2208\u2202F (X) Z\u2212\n\nX (s). Moreover, the former\n\nlog(1 + e\u2212si)\n\nsubject to\n\ns \u2208 B(F ).\n\n(3)\n\n(cid:88)\n\ni\u2208V\n\nminimize\n\ns\n\nThus, we have to optimize a convex function over B(F ), a problem that has been already con-\nsidered [8, 9]. For example, we can use the Frank-Wolfe algorithm [28, 29], which is easy to\nimplement and has a convergence rate of O( 1\nIt requires the optimization of linear functions\ng(s) = (cid:104)w, s(cid:105) = wT s over the domain, which, as shown by Edmonds [1], can be done greedily in\nO(|V | log |V |) time. More precisely, to compute a maximizer s\u2217 \u2208 B(F ) of g(s), pick a bijection\n\u03c3 : {1, . . . ,|V |} \u2192 V that orders w, i.e. w\u03c3(1) \u2265 w\u03c3(2) \u2265 \u00b7\u00b7\u00b7 \u2265 w\u03c3(|V |). Then, set s\u2217\n\u03c3(i) =\nF (\u03c3(i)|{\u03c3(1), . . . , \u03c3(i \u2212 1)}). Alternatively, if we can ef\ufb01ciently minimize the sum of the function\nplus a modular term, e.g. for the family of graph-cut representable functions [10], we can apply the\ndivide-and-conquer algorithm [9][\u00a79.1], which needs the minimization of O(|V |) problems.\n\nk ).\n\nDe\ufb01ne f (x) = log(1 + e\u2212x) (cid:46) Elementwise.\nfor k \u2190 1, 2, . . . , T do\n\n1: procedure FRANK-WOLFE(F , x1, \u0001)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\nPick s \u2208 argminx\u2208B(F )(cid:104)x,\u2207f (xk)(cid:105)\nif (cid:104)xk \u2212 s,\u2207f (xk)(cid:105) \u2264 \u0001 then\n\nreturn xk\n(cid:46) Small duality gap.\nxk+1 = (1\u2212 \u03b3k)xk + \u03b3ks; \u03b3k = 2\n\nelse\n\nk+2\n\n|V | 1; A\u2217 \u2190 minimizer of F (\u00b7) \u2212 s(\u00b7)\n\ns \u2190 F (V )\nif F (A\u2217) = s(A\u2217) then\n\n1: procedure DIVIDE-CONQUER(F )\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\nreturn s\nsA \u2190DIVIDE-CONQUER(F A)\nsV \u2212A \u2190DIVIDE-CONQUER(FA)\nreturn (sA, sV \u2212A)\n\nelse\n\nThe entropy viewpoint and the Fenchel dual. Interestingly, (3) can be interpreted as a maximum\nentropy problem. Recall that, for s \u2208 B(F ) we use the distribution P (A) \u221d exp(\u2212s(A)), whose\nentropy is exactly the negative of our objective. Hence, we can consider Problem (3) as that of\nmaximizing the entropy over the set of factorized distributions with parameters in \u2212B(F ). We can\ngo back to the standard representation using the marginals p via pi = 1/(1+exp(si)). This becomes\nobvious if we consider the Fenchel dual of the problem, which, as discussed in \u00a75, allows us to make\nconnections with the classical mean-\ufb01eld approach. To this end, we introduce the Lov`asz extension,\nde\ufb01ned for any F : 2V \u2192 R as the support function over B(F ), i.e. f (p) = sups\u2208B(F ) sT p [30].\nLet us also de\ufb01ne for p \u2208 [0, 1]V by H[p] the Shannon entropy of a vector of |V | independent\nBernoulli random variables with success probabilities p.\n\n4\n\n\fLemma 3. The Fenchel dual problem of Problem (3) is\n\n(cid:16)\n\nmaximize\np\u2208[0,1]V\n\nH[p] \u2212 f (p).\n\n(cid:17)\n\n(4)\n\n(5)\n\nMoreover, there is zero duality gap, and the pair (s\u2217, p\u2217) is primal-dual optimal if and only if\n\np\u2217 =\n\n1\n\n1 + exp(s\u2217\ni )\n\n, . . . ,\n\n1\n\n1 + exp(s\u2217\nn)\n\nand\n\nf (p\u2217) = p\u2217T s\u2217.\n\nFrom the discussion above, it can be easily seen that the Fenchel dual reparameterizes the prob-\nlem from the parameters \u2212s to the marginals p. Note that the dual lets us provide a certi\ufb01cate of\noptimality, as the Lov\u00b4asz extension can be computed with Edmonds\u2019 greedy algorithm.\n\n4.2 Optimizing over supergradients\nTo optimize over subgradients, we pick for each set X \u2286 V a representative supergradient and\noptimize over all X. As in [27], we consider the following supergradients, elements of \u2202F (X).\n\nGrow supergradient \u02c6sX\ni \u2208 X \u02c6sX ({i}) = F (i|V \u2212 {i})\ni /\u2208 X \u02c6sX ({i}) = F (i|X)\n\nShrink supergradient \u02c7sX\n\u02c7sX ({i}) = F (i|X \u2212 {i})\n\u02c7sX ({i}) = F ({i})\n\nBar supergradient sX\nsX ({i}) = F (i|V \u2212 {i})\nsX ({i}) = F ({i})\n\nOptimizing the bound over bar supergradients requires the minimization of the original function\nplus a modular term. As already mentioned for the divide-and-conquer strategy above, we can do\nthis ef\ufb01ciently for several problems. The exact formulation of the problem is presented below.\nLemma 4. De\ufb01ne the modular functions m1({i}) = log(1 + e\u2212F (i|V \u2212i)) \u2212 log(1 + eF (i)), and\nm2({i}) = log(1 + eF (i|V \u2212i)) \u2212 log(1 + e\u2212F (i)). The following pairs of problems are equivalent.\n\nminimizeX log Z +\nmaximizeX log Z\u2212\n\nX (sX ) \u2261 minimizeX F (X) + m1(X)\nX (sX ) \u2261 minimizeX F (X) \u2212 m2(X)\n\nEven though we cannot optimize over grow and shrink supergradients, we can evaluate all three at\nthe optimum for the problems above and pick the one that gives the best bound.\n\n5 Mean-\ufb01eld methods and the multi-linear extension\nIs there a relation to traditional variational methods? If Q(\u00b7) is a distribution over subsets of V , then\n0 \u2264 KL(Q || P ) = EQ\n= log Z \u2212 H[Q] + EQ[F ],\nwhich yields the bound log Z \u2265 H[Q] \u2212 EQ[F ]. The mean-\ufb01eld method restricts Q to be a com-\npletely factorized distribution, so that elements are picked independently and Q can be described by\nthe vector of marginals q \u2208 [0, 1]V , over which it is then optimized. Compare this with our approach.\n\n= log Z + EQ\n\nexp(\u2212F (S))\n\nQ(S)\nP (S)\n\nQ(S)\n\n(cid:105)\n\n(cid:104)\n\nlog\n\n(cid:104)\n\nlog\n\n(cid:105)\n\nMean-Field Objective\nmaximizeq\u2208[0,1]V H[q] \u2212 Eq[F ]\n(cid:46) Non-concave, can be hard to evaluate.\n\nOur Objective: L-FIELD\nmaximizeq\u2208[0,1]V H[q] \u2212 f (q)\n(cid:46) Concave, ef\ufb01cient to evaluate.\n\n(cid:81)\ni q[i\u2208A]\n\n\u02dcf (q) = Eq[F ] =(cid:80)\n\nBoth the Lov\u00b4asz extension f (q) and the multi-linear extension \u02dcf (q) = Eq[F ] are continuous\nextensions of F , introduced for submodular minimization [30] and maximization [31], respec-\ntively. The former agrees with the convex envelope of F and can be ef\ufb01ciently evaluated (in\nO(|V |) evaluations of F ) using Edmonds\u2019 greedy algorithm (cf., \u00a74.1, [1]). In contrast, evaluating\n(1 \u2212 qi)[i /\u2208A]F (A) in general requires summing over exponen-\ntially many terms \u2013 a problem potentially as hard as the original inference problem! Even if \u02dcf (q)\nis approximated by sampling, it is neither convex nor concave. Moreover, computing the coordinate\nascent updates of mean-\ufb01eld can be intractable for general F . Hence, our approach can be motivated\nas follows: instead of using the multi-linear extension \u02dcf, we use the Lov\u00b4asz extension f of F , which\nmakes the problem convex and tractable. This analogy motivated the name L-FIELD (L for Lov\u00b4asz).\n\nA\u2286V\n\ni\n\n5\n\n\f6 Curvature-dependent approximation bounds\n\nHow accurate are the bounds obtained via our variational approach? We now provide theoreti-\ncal guarantees on the approximation quality as a function of the curvature of F , which quanti\ufb01es\nhow far the function is from modularity. Curvature is de\ufb01ned for polymatroid functions, which\nare normalized non-decreasing submodular functions, i.e., a submodular function F : 2V \u2192 R is\npolymatroid if for all A \u2286 B \u2286 V it holds that F (A) \u2264 F (B).\nDe\ufb01nition 2 (From [32]). Let G : 2V \u2192 R be a polymatroid function. The curvature \u03ba of G is\nde\ufb01ned as 5 \u03ba = 1 \u2212 mini\u2208V : G({i})>0\n\nG(i|V \u2212{i})\n\n.\n\nG({i})\n\nThe curvature is always between 0 and 1 and is equal to 0 if and only if the function is modular.\nAlthough the curvature is a notion for polymatroid functions, we can still show results for the general\ncase as any submodular function F can be decomposed [33] as the sum of a modular term m(\u00b7)\nde\ufb01ned as m({i}) = F (i|V \u2212 {i}) and G = F \u2212 m, which is a polymatroid function. Our bounds\n\nbelow depend on the curvature of G and GMAX = G(V ) = F (V ) \u2212(cid:80)\n\ni\u2208V F (i|V \u2212 i).\n\nTheorem 1. Let F = G + m, where G is polymatroid with curvature \u03ba and m is modular de\ufb01ned as\nabove. Pick any bijection \u03c3 : V \u2192 {1, 2, . . . ,|V |} and de\ufb01ne sets S\u03c3\ni = {\u03c3(1), . . . , \u03c3(i)}.\ni\u22121), then s + m \u2208 \u2202F (\u2205) and the following inequalities hold.\nIf we de\ufb01ne s : s\u03c3(i) = G(S\u03c3\n(6)\n\nlog Z\u2212(s + m, 0) \u2212 log\n\nexp(\u2212F (A)) \u2264 \u03baGMAX\n\n0 = \u2205, S\u03c3\n\n(cid:88)\n\nA\u2286V\n\nexp(+F (A)) \u2212 log Z +(s + m, 0) \u2264 \u03baGMAX\n\n(7)\n\ni ) \u2212 G(S\u03c3\n(cid:88)\n\nlog\n\nA\u2286V\n\nby s(A) =(cid:80)\n(cid:88)\n\nlog\n\nA\u2286V\n\nTheorem 2. Under the same assumptions as in Theorem 1, if we de\ufb01ne the modular function s(\u00b7)\n\ni\u2208A G({i}), then s + m \u2208 \u2202F (\u2205) and the following inequalities hold.\nexp(\u2212F (A)) \u2212 log Z\u2212(s + m, 0) \u2264\nGMAX \u2264 \u03ba\n1 \u2212 \u03ba\nGMAX \u2264 \u03ba\n1 \u2212 \u03ba\n\n1 + (n \u2212 1)(1 \u2212 \u03ba)\n\n1 + (n \u2212 1)(1 \u2212 \u03ba)\n\nexp(+F (A)) \u2264\n\n\u03ba(n \u2212 1)\n\n\u03ba(n \u2212 1)\n\n(cid:88)\n\nA\u2286V\n\nlog Z +(s + m, 0) \u2212 log\n\nGMAX\n\nGMAX\n\n(8)\n\n(9)\n\nNote that we establish bounds for speci\ufb01c sub-/supergradients. Since our variational scheme con-\nsiders these in the optimization as well, the same quality guarantees hold for the optimized bounds.\nFurther, note that we get a dependence on the range of the function via GMAX. However, if we con-\nsider \u03b1F for large \u03b1 > 1, most of the mass will be concentrated at the MAP (assuming it is unique).\nIn this case, L-FIELD also performs well, as it can always choose gradients that are tight at the MAP.\nWhen we optimize over supergradients, all possible tight sets are considered. Similarly, the subgra-\ndients are optimized over B(F ), and for any X \u2286 V there exists some sX \u2208 B(F ) tight at X.\n\n7 Experiments\n\nOur experiments6 aim to address four main questions: (1) How large is the gap between the upper-\nand lower-bounds for the log-partition function and the marginals? (2) How accurate are the fac-\ntorized approximations obtained from a single MAP-like optimization problem? (3) How does the\naccuracy depend on the amount of evidence (i.e., concentration of the posterior), the curvature of the\nfunction, and the type of Bayesian submodular model considered? (4) How does L-FIELD compare\nto mean-\ufb01eld on problems where the latter can be applied?\nWe consider approximate marginals obtained from the following methods: lower/upper: obtained\nfrom the factorized distributions associated with the modular lower/upper bounds; lower-/upper-\nbound: the lower/upper bound of the estimated probability interval. All of the functions we consider\nare graph-representable [17], which allows us to perform the optimization over superdifferentials\nusing a single graph cut and use the exact divide-and-conquer algorithm. We used the min-cut\n5We differ from the convention to remove i \u2208 V s.t. G({i}) = 0. Please see the appendix for a discussion.\n6The code will be made available at http://las.ethz.ch.\n\n6\n\n\f(cid:0)|Nv\u2229A|\n\n|Nv|\n\n(cid:1)\u00b5(cid:17)\n\n(cid:16)\u2212(cid:80)\n\nimplementation from [34]. Since the update equations are easily computable, we have also\nimplemented mean-\ufb01eld for the \ufb01rst experiment. For the other two experiments computing the\nupdates requires exhaustive enumeration and is intractable. The results are shown on Figure 1 and\nthe experiments are explained below. We plot the averages of several repetitions of the experiments.\nNote that computing intervals for marginals requires two MAP-like optimizations per variable;\nhence we focus on small problems with |V | = 100. We point out that obtaining a single factorized\napproximation (as produced, e.g., by mean-\ufb01eld), only requires a single MAP-like optimization,\nwhich can be done for more than 270,000 variables [19].\nLog-supermodular: Cuts / Pairwise MRFs. Our \ufb01rst experiment evaluates L-FIELD on a se-\nquence of distributions that are increasingly more concentrated. Motivated by applications in semi-\nsupervised learning, we sampled data from a 2-dimensional Gaussian mixture model with 2 clus-\nters. The centers were sampled from N ([3, 3], I) and N ([\u22123,\u22123], I) respectively. For each cluster,\nwe sampled n = 50 points from a bivariate normal. These 2n points were then used as nodes\nto create a graph with weight between points x and x(cid:48) equal to e\u2212||x\u2212x(cid:48)||. As prior we chose\nP (A) \u221d exp(\u2212F (A)), where F is the cut function in this graph, hence P (A) is a regular MRF.\nThen, for k = 1, . . . , n we consider the conditional distribution on the event that k points from the\n\ufb01rst cluster are on one side of the cut and k points from the other cluster are on the other side. As we\nprovide more evidence, the posterior concentrates, and the intervals for both the log-partition func-\ntion and marginals shrink. Compared with ground truth, the estimates of the marginal probabilities\nimprove as well. Due to non-convexity, mean-\ufb01eld occasionally gets stuck in local optima, resulting\nin very poor marginals. To prevent this, we chose the best run out of 20 random restarts. These best\nruns produced slightly better marginals than L-FIELD for this model, at the cost of less robustness.\nLog-supermodular: Decomposable functions. Our second experiment assesses the performance\nas a function of the curvature of F . It is motivated by a problem in outbreak detection on networks.\nAssume that we have a graph G = (V, E) and some of its nodes E \u2286 V have been infected by\nsome contagious process. Instead of E, we observe a noisy set N \u2286 V , corrupted with a false\npositive rate of 0.1 and a false negative rate of 0.2. We used a log-supermodular prior P (A) \u221d\n, where \u00b5 \u2208 [0, 1] and Nv is the union of v and its neighbors. This prior\nexp\nprefers smaller sets and sets that are more clustered on the graph. Note that \u00b5 controls the preference\nof clustered nodes and affects the curvature. We sampled random graphs with 100 nodes from a\nWatts-Strogatz model and obtained E by running an independent cascade starting from 2 random\nnodes. Then, for varying \u00b5, we consider the posterior, which is log-supermodular, as the noise model\nresults in a modular likelihood. As the curvature increases, the intervals for both the log-partition\nfunction and marginals decrease as expected. Surprisingly, the marginals are very accurate (< 0.1\naverage error) even for very large curvature. This suggests that our curvature dependent bounds are\nvery conservative, and much better performance can be expected in practice.\nLog-submodular: Facility location modeling. Our last experiment evaluates how accurate L-\nFIELD is when quantifying uncertainty in submodular maximization tasks. Concretely, we consider\nthe problem of sensor placement in water distribution networks, which can be modeled as sub-\nmodular maximization [35]. More speci\ufb01cally, we have a water distribution network and there are\nsome junctions V where we can put sensors that can detect contaminated water. We also have a\nset I of contamination scenarios. For each i \u2208 I and j \u2208 V we have a utility Ci,j \u2208 [0, 1], that\ncomes from real data [35]. Moreover, as the sensors are expensive, we would like to use as few\nas possible. We use the facility-location model, more precisely P (S = A) \u221d exp(F (A) \u2212 2|A|),\ni\u2208N maxj\u2208A Ci,j. Instead of optimizing for a \ufb01xed placement, here we consider\nthe problem of sampling from P in order to quantify the uncertainty in the optimization task. We\nused the following sampling strategy. We consider nodes v \u2208 V in some order. We then sample a\nBernoulli Z with probability P (Z = 1) = qv based on the factorized distribution q from the modu-\nlar upper bound. We then condition on v \u2208 S if Z = 1, or v /\u2208 S if Z = 0. In the computation of the\nlower bound we used the subgradient sg computed from the greedy order of V \u2014 the i-th element\nin this order v1, . . . , vn is the one that gives the highest improvement when added to the set formed\nby the previous i \u2212 1 elements. Then, sg \u2208 \u2202F (\u2205) : sg\ni = F (vi|{v0, . . . , vi\u22121}). We repeated the\nexperiment several times using randomly sampled 500 contamination scenarios and 100 locations\nfrom a larger dataset. Note that our approximations get better as we condition on more information\n(i.e., proceed through the iterations of the sampling procedure above). Also note that even from the\nvery beginning, the marginals are very accurate (< 0.1 average error).\n\nwith F (A) = (cid:80)\n\nv\u2208V\n\n7\n\n\fLower\nUpper\n\nMean-Field\n\nd\nn\nu\no\nB\n\nr\ne\nw\no\nL\n-\nr\ne\np\np\nU\n\u2014\n\np\na\nG\ne\ng\na\nr\ne\nv\nA\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n30\n\n20\n\n10\n40\nNumber of Conditioned Pairs\n(a) [CT] \u2014 Logp. Bounds\n\n50\n\n0\n\n20\n\n10\n40\nNumber of Conditioned Pairs\n\n30\n\n(b) [CT] \u2014 Prob. Interval Gap\n\nn\no\ni\nt\nc\nn\nu\nF\nn\no\ni\nt\ni\nt\nr\na\nP\n-\ng\no\nL\n\nn\no\ni\nt\nc\nn\nu\nF\nn\no\ni\nt\ni\nt\nr\na\nP\n-\ng\no\nL\n\nn\no\ni\nt\nc\nn\nu\nF\nn\no\ni\nt\ni\nt\nr\na\nP\n-\ng\no\nL\n\n150\n\n100\n\n50\n\n0\n\n30\n\n20\n\n10\n\n0\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\nLower\nUpper\n\n0\n\n0.2\n\n0.6\n0.4\n1-Curvature\n\n0.8\n\n1\n\n(d) [NW] \u2014 Logp. Bounds\n\nLower\nUpper\n\nd\nn\nu\no\nB\n\nr\ne\nw\no\nL\n-\nr\ne\np\np\nU\n\u2014\n\np\na\nG\ne\ng\na\nr\ne\nv\nA\n\nd\nn\nu\no\nB\n\nr\ne\nw\no\nL\n-\nr\ne\np\np\nU\n\u2014\np\na\nG\ne\ng\na\nr\ne\nv\nA\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n50\n\ns\nl\na\nn\ni\ng\nr\na\n\nM\n\nf\no\n\nr\no\nr\nr\nE\ne\nt\nu\nl\no\ns\nb\nA\nn\na\ne\n\nM\n\ns\nl\na\nn\ni\ng\nr\na\n\nM\n\nf\no\n\nr\no\nr\nr\nE\ne\nt\nu\nl\no\ns\nb\nA\nn\na\ne\n\nM\n\nLower\nUpper\n\nLower-Bound\nUpper-Bound\nMean-Field\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n4\n\n2\n8\nNumber of Conditioned Pairs\n\n6\n\n(c) [CT] \u2014 Mean Error on Marginals\n\nLower\nUpper\n\nLower-Bound\nUpper-Bound\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.6\n0.4\n1-Curvature\n\n0.8\n\n1\n\n0\n\n0.2\n\n0.6\n0.4\n1-Curvature\n\n0.8\n\n1\n\n(e) [NW] \u2014 Prob. Interval Gap\n\n(f) [NW] \u2014 Mean Error on Marginals\n\nLower\nUpper\n\nLower-Bound\nUpper-Bound\n\ns\nl\na\nn\ni\ng\nr\na\n\nM\n\nf\no\nr\no\nr\nr\nE\ne\nt\nu\nl\no\ns\nb\nA\nn\na\ne\n\nM\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n20\n\n40\n60\nIteration\n\n80\n\n100\n\n0\n\n20\n\n40\n60\nIteration\n\n80\n\n100\n\n0\n\n5\n\n10\n\nIteration\n\n15\n\n20\n\n(g) [SP] \u2014 Logp. Bounds\n\n(h) [SP] \u2014 Prob. Interval Gap\n\n(i) [SP] \u2014 Mean Error on Marginals\n\nFigure 1: Experiments on [CT] Cuts (a-c), [NW] network detection (d-f), [SP] sensor placement (g-i). Note\nthat to generate (c,f,i) we had to compute the exact marginals by exhaustive enumeration. Hence, these three\ngraphs were created using a smaller ground set of size 20. The error bars capture 3 standard errors.\n8 Conclusion\n\nWe proposed L-FIELD, the \ufb01rst variational method for approximate inference in general Bayesian\nsubmodular and supermodular models. Our approach has several attractive properties: It produces\nrigorous upper and lower bounds on the log-partition function and on marginal probabilities. These\nbounds can be optimized ef\ufb01ciently via convex and submodular optimization. Accurate factorial\napproximations can be obtained at the same computational cost as performing MAP inference in the\nunderlying model, a problem for which a vast array of scalable methods are available. Furthermore,\nwe identi\ufb01ed a natural connection to the traditional mean-\ufb01eld method and bounded the quality of\nour approximations with the curvature of the function. Our experiments demonstrate the accuracy\nof our inference scheme on several natural examples of Bayesian submodular models. We believe\nthat our results present a signi\ufb01cant step in understanding the role of submodularity \u2013 so far mainly\nconsidered for optimization \u2013 in approximate Bayesian inference. Furthermore, L-FIELD presents a\nsigni\ufb01cant advance in our ability to perform probabilistic inference in models with complex, high-\norder dependencies, which present a major challenge for classical techniques.\nAcknowledgments. This research was supported in part by SNSF grant 200021 137528, ERC StG\n307036 and a Microsoft Research Faculty Fellowship.\nReferences\n\n[1]\n\nJ. Edmonds. \u201cSubmodular functions, matroids, and certain polyhedra\u201d. In: Combinatorial structures and\ntheir applications (1970), pp. 69\u201387.\n\n[2] D. Golovin and A. Krause. \u201cAdaptive Submodularity: Theory and Applications in Active Learning and\nStochastic Optimization\u201d. In: Journal of Arti\ufb01cial Intelligence Research (JAIR) 42 (2011), pp. 427\u2013486.\n\n8\n\n\f[3] Y. Yue and C. Guestrin. \u201cLinear Submodular Bandits and its Application to Diversi\ufb01ed Retrieval\u201d. In:\n\nNeural Information Processing Systems (NIPS). 2011.\n\n[4] H. Lin and J. Bilmes. \u201cA class of submodular functions for document summarization\u201d. In: 49th Annual\n\nMeeting of the Association for Computational Linguistics: HLT. 2011, pp. 510\u2013520.\n\n[5] V. Cevher and A. Krause. \u201cGreedy Dictionary Selection for Sparse Representation\u201d. In: IEEE Journal\n\nof Selected Topics in Signal Processing 99.5 (2011), pp. 979\u2013988.\n\n[6] M. Narasimhan, N. Jojic, and J. Bilmes. \u201cQ-clustering\u201d. In: NIPS. Vol. 5. 10.10. 2005, p. 5.\n[7] F. Bach. \u201cStructured sparsity-inducing norms through submodular functions.\u201d In: NIPS. 2010.\n[8] S. Fujishige. Submodular functions and optimization. Vol. 58. Annals of Discrete Mathematics. 2005.\n[9] F. Bach. \u201cLearning with submodular functions: a convex optimization perspective\u201d. In: Foundations and\n\nTrends R(cid:13) in Machine Learning 6.2-3 (2013), pp. 145\u2013373. ISSN: 1935-8237.\n\n[10] S. Jegelka, H. Lin, and J. A. Bilmes. \u201cOn fast approximate submodular minimization.\u201d In: NIPS. 2011.\n[11] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. \u201cA tight linear time (1/2)-approximation for\n\nunconstrained submodular maximization\u201d. In: Foundations of Computer Science (FOCS). 2012.\n\n[12] Y. Boykov, O. Veksler, and R. Zabih. \u201cFast approximate energy minimization via graph cuts\u201d. In: Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on 23.11 (2001), pp. 1222\u20131239.\n\n[13] M. Jerrum and A. Sinclair. \u201cPolynomial-time approximation algorithms for the Ising model\u201d. In: SIAM\n\nJournal on computing 22.5 (1993), pp. 1087\u20131116.\n\n[14] L. A. Goldberg and M. Jerrum. \u201cThe complexity of ferromagnetic Ising with local \ufb01elds\u201d. In: Combina-\n\ntorics, Probability and Computing 16.01 (2007), pp. 43\u201361.\nJ. Gillenwater, A. Kulesza, and B. Taskar. \u201cNear-Optimal MAP Inference for Determinantal Point Pro-\ncesses\u201d. In: Proc. Neural Information Processing Systems (NIPS). 2012.\n\n[16] M. J. Wainwright and M. I. Jordan. \u201cGraphical Models, Exponential Families, and Variational Infer-\n\n[15]\n\nence\u201d. In: Found. Trends Mach. Learn. 1.1-2 (2008), pp. 1\u2013305.\n\n[17] V. Kolmogorov and R. Zabin. \u201cWhat energy functions can be minimized via graph cuts?\u201d In: Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on 26.2 (2004), pp. 147\u2013159.\n\n[18] P. Stobbe and A. Krause. \u201cEf\ufb01cient Minimization of Decomposable Submodular Functions\u201d. In: Proc.\n\nNeural Information Processing Systems (NIPS). 2010.\n\n[19] S. Jegelka, F. Bach, and S. Sra. \u201cRe\ufb02ection methods for user-friendly submodular optimization\u201d. In:\n\nAdvances in Neural Information Processing Systems. 2013, pp. 1313\u20131321.\n\n[20] S. Jegelka and J. Bilmes. \u201cSubmodularity beyond submodular energies: coupling edges in graph cuts\u201d.\nIn: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. 2011, pp. 1897\u20131904.\n[21] A. Krause and C. Guestrin. \u201cNear-optimal Nonmyopic Value of Information in Graphical Models\u201d. In:\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI). 2005.\n\n[22] A. Krause and D. Golovin. \u201cSubmodular Function Maximization\u201d. In: Tractability: Practical Ap-\n\nproaches to Hard Problems (to appear). Cambridge University Press, 2014.\n\n[23] P. Kohli, L. Ladick\u00b4y, and P. H. Torr. \u201cRobust higher order potentials for enforcing label consistency\u201d.\n\nIn: International Journal of Computer Vision 82.3 (2009), pp. 302\u2013324.\n\n[24] A. Kulesza and B. Taskar. \u201cDeterminantal Point Processes for Machine Learning\u201d. In: Foundations and\n\nTrends in Machine Learning 5.2\u20133 (2012).\n\n[25] R. Gomes and A. Krause. \u201cBudgeted Nonparametric Learning from Data Streams\u201d. In: ICML. 2010.\n[26] K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. \u201cTurning down the noise in the blogosphere\u201d. In: Proc.\n\nACM SIGKDD International Conference on Knowledge Discovery and Data mining. 2009.\n\n[27] R. Iyer, S. Jegelka, and J. Bilmes. \u201cFast Semidifferential-based Submodular Function Optimization\u201d. In:\n\nICML (3). 2013, pp. 855\u2013863.\n\n[28] M. Frank and P. Wolfe. \u201cAn algorithm for quadratic programming\u201d. In: Naval Research Logistics Quar-\n\nterly 3.1-2 (1956), pp. 95\u2013110. ISSN: 1931-9193.\n\n[29] M. Jaggi. \u201cRevisiting Frank-Wolfe: Projection-free sparse convex optimization\u201d. In: 30th International\n\nConference on Machine Learning (ICML-13). 2013, pp. 427\u2013435.\n\n[30] L. Lov\u00b4asz. \u201cSubmodular functions and convexity\u201d. In: Mathematical Programming The State of the Art.\n\nSpringer, 1983, pp. 235\u2013257.\n\n[31] G. Calinescu, C. Chekuri, M. P\u00b4al, and J. Vondr\u00b4ak. \u201cMaximizing a submodular set function subject to a\n\nmatroid constraint\u201d. In: Integer programming and combinatorial optimization. Springer, 2007.\n\n[32] M. Conforti and G. Cornuejols. \u201cSubmodular set functions, matroids and the greedy algorithm: tight\nworst-case bounds and some generalizations of the Rado-Edmonds theorem\u201d. In: Discrete applied math-\nematics 7.3 (1984), pp. 251\u2013274.\n\n[33] W. H. Cunningham. \u201cDecomposition of submodular functions\u201d. In: Combinatorica 3.1 (1983).\n[34] Y. Boykov and V. Kolmogorov. \u201cAn experimental comparison of min-cut/max-\ufb02ow algorithms for en-\nergy minimization in vision\u201d. In: Pattern Analysis and Machine Intelligence, IEEE Trans. on 26.9 (2004).\n[35] A. Krause, J. Leskovec, C. Guestrin, J. VanBriesen, and C. Faloutsos. \u201cEf\ufb01cient Sensor Placement Op-\ntimization for Securing Large Water Distribution Networks\u201d. In: Journal of Water Resources Planning\nand Management 134.6 (2008), pp. 516\u2013526.\n\n9\n\n\f", "award": [], "sourceid": 175, "authors": [{"given_name": "Josip", "family_name": "Djolonga", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}