{"title": "Sampling from Probabilistic Submodular Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1945, "page_last": 1953, "abstract": "Submodular and supermodular functions have found wide applicability in machine learning, capturing notions such as diversity and regularity, respectively. These notions have deep consequences for optimization, and the problem of (approximately) optimizing submodular functions has received much attention. However, beyond optimization, these notions allow specifying expressive probabilistic models that can be used to quantify predictive uncertainty via marginal inference. Prominent, well-studied special cases include Ising models and determinantal point processes, but the general class of log-submodular and log-supermodular models is much richer and little studied. In this paper, we investigate the use of Markov chain Monte Carlo sampling to perform approximate inference in general log-submodular and log-supermodular models. In particular, we consider a simple Gibbs sampling procedure, and establish two sufficient conditions, the first guaranteeing polynomial-time, and the second fast (O(nlogn)) mixing. We also evaluate the efficiency of the Gibbs sampler on three examples of such models, and compare against a recently proposed variational approach.", "full_text": "Sampling from Probabilistic Submodular Models\n\nAlkis Gotovos\nETH Zurich\n\nS. Hamed Hassani\n\nETH Zurich\n\nAndreas Krause\n\nETH Zurich\n\nalkisg@inf.ethz.ch\n\nhamed@inf.ethz.ch\n\nkrausea@ethz.ch\n\nAbstract\n\nSubmodular and supermodular functions have found wide applicability in ma-\nchine learning, capturing notions such as diversity and regularity, respectively.\nThese notions have deep consequences for optimization, and the problem of (ap-\nproximately) optimizing submodular functions has received much attention. How-\never, beyond optimization, these notions allow specifying expressive probabilis-\ntic models that can be used to quantify predictive uncertainty via marginal infer-\nence. Prominent, well-studied special cases include Ising models and determinan-\ntal point processes, but the general class of log-submodular and log-supermodular\nmodels is much richer and little studied. In this paper, we investigate the use of\nMarkov chain Monte Carlo sampling to perform approximate inference in gen-\neral log-submodular and log-supermodular models. In particular, we consider a\nsimple Gibbs sampling procedure, and establish two suf\ufb01cient conditions, the \ufb01rst\nguaranteeing polynomial-time, and the second fast (O(n log n)) mixing. We also\nevaluate the ef\ufb01ciency of the Gibbs sampler on three examples of such models,\nand compare against a recently proposed variational approach.\n\n1\n\nIntroduction\n\nModeling notions such as coverage, representativeness, or diversity is an important challenge in\nmany machine learning problems. These notions are well captured by submodular set functions.\nAnalogously, supermodular functions capture notions of smoothness, regularity, or cooperation. As\na result, submodularity and supermodularity, akin to concavity and convexity, have found numerous\napplications in machine learning. The majority of previous work has focused on optimizing such\nfunctions, including the development and analysis of algorithms for minimization [10] and maxi-\nmization [9,26], as well as the investigation of practical applications, such as sensor placement [21],\nactive learning [12], in\ufb02uence maximization [19], and document summarization [25].\nBeyond optimization, though, it is of interest to consider probabilistic models de\ufb01ned via submod-\nular functions, that is, distributions over \ufb01nite sets (or, equivalently, binary random vectors) de\ufb01ned\nas p(S) \u221d exp(\u03b2F (S)), where F : 2V \u2192 R is a submodular or supermodular function (equiva-\nlently, either F or \u2212F is submodular), and \u03b2 \u2265 0 is a scaling parameter. Finding most likely sets in\nsuch models captures classical submodular optimization. However, going beyond point estimates,\nthat is, performing general probabilistic (e.g., marginal) inference in them, allows us to quantify\nuncertainty given some observations, as well as learn such models from data. Only few special\ncases belonging to this class of models have been extensively studied in the past; most notably,\nIsing models [20], which are log-supermodular in the usual case of attractive (ferromagnetic) po-\ntentials, or log-submodular under repulsive (anti-ferromagnetic) potentials, and determinantal point\nprocesses [23], which are log-submodular.\nRecently, Djolonga and Krause [6] considered a more general treatment of such models, and pro-\nposed a variational approach for performing approximate probabilistic inference for them.\nIt is\nnatural to ask to what degree the usual alternative to variational methods, namely Monte Carlo sam-\npling, is applicable to these models, and how it performs in comparison. To this end, in this paper\n\n1\n\n\fwe consider a simple Markov chain Monte Carlo (MCMC) algorithm on log-submodular and log-\nsupermodular models, and provide a \ufb01rst analysis of its performance. We present two theoretical\nconditions that respectively guarantee polynomial-time and fast (O(n log n)) mixing in such models,\nand experimentally compare against the variational approximations on three examples.\n\n2 Problem Setup\nWe start by considering set functions F : 2V \u2192 R, where V is a \ufb01nite ground set of size |V | =\nn. Without loss of generality, if not otherwise stated, we will hereafter assume that V = [n] :=\n{1, 2, . . . , n}. The marginal gain obtained by adding element v \u2208 V to set S \u2286 V is de\ufb01ned\nas F (v|S) := F (S \u222a {v}) \u2212 F (S). Intuitively, submodularity expresses a notion of diminishing\nreturns; that is, adding an element to a larger set provides less bene\ufb01t than adding it to a smaller\none. More formally, F is submodular if, for any S \u2286 T \u2286 V , and any v \u2208 V \\ T , it holds that\nF (v|T ) \u2264 F (v|S). Supermodularity is de\ufb01ned analogously by reversing the sign of this inequality.\nIn particular, if a function F is submodular, then the function \u2212F is supermodular. If a function\nm is both submodular and supermodular, then it is called modular, and may be written in the form\n\nv\u2208S mv, where c \u2208 R, and mv \u2208 R, for all v \u2208 V .\n\nm(S) = c +(cid:80)\n\nOur main focus in this paper are distributions over the powerset of V of the form\n\nexp(\u03b2F (S))\n\n,\n\nZ\n\np(S) =\n\nsupermodular respectively. The constant denominator Z := Z(\u03b2) := (cid:80)\n\n(1)\nfor all S \u2286 V , where F is submodular or supermodular. The scaling parameter \u03b2 is referred\nto as inverse temperature, and distributions of the above form are called log-submodular or log-\nS\u2286V exp(\u03b2F (S)) serves\nthe purpose of normalizing the distribution and is called the partition function of p. An alternative\nand equivalent way of de\ufb01ning distributions of the above form is via binary random vectors X \u2208\n{0, 1}n. If we de\ufb01ne V (X) := {v \u2208 V | Xv = 1}, it is easy to see that the distribution pX (X) \u221d\nexp(\u03b2F (V (X))) over binary vectors is isomorphic to the distribution over sets of (1). With a slight\nabuse of notation, we will use F (X) to denote F (V (X)), and use p to refer to both distributions.\n\nExample models The (ferromagnetic) Ising model is an example of a log-supermodular model.\nIn its simplest form, it is de\ufb01ned through an undirected graph (V, E), and a set of pairwise po-\ntentials \u03c3v,w(S) := 4(1{v\u2208S} \u2212 0.5)(1{w\u2208S} \u2212 0.5).\nIts distribution has the form p(S) \u221d\n\nexp(\u03b2(cid:80){v,w}\u2208E \u03c3v,w(S)), and is log-supermodular, because F (S) = (cid:80){v,w}\u2208E \u03c3v,w(S) is su-\n\npermodular. (Each \u03c3v,w is supermodular, and supermodular functions are closed under addition.)\nDeterminantal point processes (DPPs) are examples of log-submodular models. A DPP is de\ufb01ned via\na positive semide\ufb01nite matrix K \u2208 Rn\u00d7n, and has a distribution of the form p(S) \u221d det(KS), where\nKS denotes the square submatrix indexed by S. Since F (S) = ln det(KS) is a submodular function,\np is log-submodular. Another example of log-submodular models are those de\ufb01ned through facility\n(cid:96)\u2208[L] maxv\u2208S wv,(cid:96), where wv,(cid:96) \u2265 0, and are\nsubmodular. If wv,(cid:96) \u2208 {0, 1}, then F represents a set cover function.\nNote that, both the facility location model and the Ising model use decomposable functions, that is,\nfunctions that can be written as a sum of simpler submodular (resp. supermodular) functions F(cid:96):\n\nlocation functions, which have the form F (S) = (cid:80)\n(cid:88)\n\n(2)\n\nF (S) =\n\nF(cid:96)(S).\n\n(cid:96)\u2208[L]\n\nMarginal inference Our goal is to perform marginal inference for the distributions described\nabove. Concretely, for some \ufb01xed A \u2286 B \u2286 V , we would like to compute the probability of sets S\nthat contain all elements of A, but no elements outside of B, that is, p(A \u2286 S \u2286 B). More generally,\nwe are interested in computing conditional probabilities of the form p(A \u2286 S \u2286 B | C \u2286 S \u2286 D).\nThis computation can be reduced to computing unconditional marginals as follows. For any C \u2286 V ,\nde\ufb01ne the contraction of F on C, FC : 2V \\C \u2192 R, by FC(S) = F (S\u222aC)\u2212F (S), for all S \u2286 V \\C.\nAlso, for any D \u2286 V , de\ufb01ne the restriction of F to D, F D : 2D \u2192 R, by F D(S) = F (S), for all\nS \u2286 D. If F is submodular, then its contractions and restrictions are also submodular, and, thus,\n(FC)D is submodular. Finally, it is easy to see that p(S | C \u2286 S \u2286 D) \u221d exp(\u03b2(FC)D(S)). In\n\n2\n\n\fAlgorithm 1 Gibbs sampler\nInput: Ground set V , distribution p(S) \u221d exp(\u03b2F (S))\n1: X0 \u2190 random subset of V\n2: for t = 0 to Niter do\n3:\n4: \u2206F (v|Xt) \u2190 F (Xt \u222a {v}) \u2212 F (Xt \\ {v})\n5:\n6:\n7:\n8: end for\n\nv \u2190 Unif(V )\npadd \u2190 exp(\u03b2\u2206F (v|Xt))/(1 + exp(\u03b2\u2206F (v|Xt)))\nz \u2190 Unif([0, 1])\nif z \u2264 padd then Xt+1 \u2190 Xt \u222a {v} else Xt+1 \u2190 Xt \\ {v}\n\nour experiments, we consider computing marginals of the form p(v \u2208 S | C \u2286 S \u2286 D), for some\nv \u2208 V , which correspond to A = {v}, and B = V .\n\n3 Sampling and Mixing Times\n\nPerforming exact inference in models de\ufb01ned by (1) boils down to computing the partition function\nZ. Unfortunately, this is generally a #P-hard problem, which was shown to be the case even for Ising\nmodels by Jerrum and Sinclair [17]. However, they also proposed a sampling-based FPRAS for a\nclass of ferromagnetic models, which gives us hope that it may be possible to ef\ufb01ciently perform\napproximate inference in more general models under suitable conditions.\nMCMC sampling [24] approaches are based on performing randomly selected local moves in a\nstate space E to approximately compute quantities of interest. The visited states (X0, X1, . . .) form\na Markov chain, which under mild conditions converges to a stationary distribution \u03c0. Crucially,\nthe probabilities of transitioning from one state to another are carefully chosen to ensure that the\nstationary distribution is identical to the distribution of interest. In our case, the state space is the\npowerset of V (equivalently, the space of all binary vectors of length n), and to approximate the\nmarginal probabilities of p we construct a chain over subsets of V that has stationary distribution p.\n\nThe Gibbs sampler\nIn this paper, we focus on one of the simplest and most commonly used\nchains, namely the Gibbs sampler, also known as the Glauber chain. We denote by P the transition\nmatrix of the chain; each element P (x, y) corresponds to the conditional probability of transitioning\nfrom state x to state y, that is, P (x, y) := P[Xt+1 = y | Xt = x], for any x, y \u2208 E, and any t \u2265 0.\nWe also de\ufb01ne an adjacency relation x \u223c y on the elements of the state space, which denotes that x\nand y differ by exactly one element. It follows that each x \u2208 E has exactly n neighbors.\nThe Gibbs sampler is de\ufb01ned by an iterative two-step procedure, as shown in Algorithm 1. First, it\nselects an element v \u2208 V uniformly at random; then, it adds or removes v to the current state Xt\naccording to the conditional probability of the resulting state. Importantly, the conditional proba-\nbilities that need to be computed do not depend on the partition function Z, thus the chain can be\nsimulated ef\ufb01ciently, even though Z is unknown and hard to compute. Moreover, it is easy to see\nthat \u2206F (v|Xt) = 1{v(cid:54)\u2208Xt}F (v|Xt) + 1{v\u2208Xt}F (v|Xt \\ {v}); thus, the sampler only requires a\nblack box for the marginal gains of F , which are often faster to compute than the values of F itself.\nFinally, it is easy to show that the stationary distribution of the chain constructed this way is p.\n\npected value of function f : E \u2192 R by Ep[f (X)] \u2248 (1/T )(cid:80)T\n\nMixing times Approximating quantities of interest using MCMC methods is based on using time\naverages to estimate expectations over the desired distribution. In particular, we estimate the ex-\nr=1 f (Xs+r). For example, to\nestimate the marginal p(v \u2208 S), for some v \u2208 V , we would de\ufb01ne f (x) = 1{xv=1}, for all x \u2208 E.\nThe choice of burn-in time s and number of samples T in the above expression presents a tradeoff\nbetween computational ef\ufb01ciency and approximation accuracy. It turns out that the effect of both s\nand T is largely dependent on a fundamental quantity of the chain called mixing time [24].\nThe mixing time of a chain quanti\ufb01es the number of iterations t required for the distribution\nof Xt to be close to the stationary distribution \u03c0. More formally, it is de\ufb01ned as tmix(\u0001) :=\nmin{t | d(t) \u2264 \u0001}, where d(t) denotes the worst-case (over the starting state X0 of the chain) total\nvariation distance between the distribution of Xt and \u03c0. Establishing upper bounds on the mix-\n\n3\n\n\fing time of our Gibbs sampler is, therefore, suf\ufb01cient to guarantee ef\ufb01cient approximate marginal\ninference (e.g., see [24, Theorem 12.19]).\n\n4 Theoretical Results\n\nIn the previous section we mentioned that exact computation of the partition function for the class\nof models we consider here is, in general, infeasible. Only for very few exceptions, such as DPPs,\nis exact inference possible in polynomial time [23]. Even worse, it has been shown that the partition\nfunction of general Ising models is hard to approximate; in particular, there is no FPRAS for these\nmodels, unless RP = NP. [17] This implies that the mixing time of any Markov chain with such\na stationary distribution will, in general, be exponential in n.\nIt is, therefore, our aim to derive\nsuf\ufb01cient conditions that guarantee sub-exponential mixing times for the general class of models.\nIn some of our results we will use the fact that any submodular function F can be written as\n\n(3)\nwhere c \u2208 R is a constant that has no effect on distributions de\ufb01ned by (1); m is a normalized\n(m(\u2205) = 0) modular function; and f is a normalized (f (\u2205) = 0) monotone submodular function,\nthat is, it additionally satis\ufb01es the monotonicity property f (v|S) \u2265 0, for all v \u2208 V , and all S \u2286 V .\nA similar decomposition is possible for any supermodular function as well.\n\nF = c + m + f,\n\n4.1 Polynomial-time mixing\n\nOur guarantee for mixing times that are polynomial in n depends crucially on the following quantity,\nwhich is de\ufb01ned for any set function F : 2V \u2192 R:\n\n\u03b6F := max\nA,B\u2286V\n\n|F (A) + F (B) \u2212 F (A \u222a B) \u2212 F (A \u2229 B)| .\n\nIntuitively, \u03b6F quanti\ufb01es a notion of distance to modularity. To see this, note that a function F is\nmodular if and only if F (A) + F (B) = F (A \u222a B) + F (A \u2229 B), for all A, B \u2286 V . For modular\nfunctions, therefore, we have \u03b6F = 0. Furthermore, a function F is submodular if and only if\nF (A) + F (B) \u2265 F (A \u222a B) + F (A \u2229 B), for all A, B \u2286 V . Similarly, F is supermodular if the\nabove holds with the sign reversed. It follows that for submodular and supermodular functions, \u03b6F\nrepresents the worst-case amount by which F violates the modular equality. It is also important\nto note that, for submodular and supermodular functions, \u03b6F depends only on the monotone part\nof F ; if we decompose F according to (3), then it is easy to see that \u03b6F = \u03b6f . A trivial upper\nbound on \u03b6F , therefore, is \u03b6F \u2264 f (V ). Another quantity that has been used in the past to quantify\nthe deviation of a submodular function from modularity is the curvature [4], de\ufb01ned as \u03baF :=\n1\u2212 minv\u2208V (F (v|V \\ {v})/F (v)). Although of similar intuitive meaning, the multiplicative nature\nof its de\ufb01nition makes it signi\ufb01cantly different from \u03b6F , which is de\ufb01ned additively.\nAs an example of a function class with \u03b6F that do not depend on n, assume a ground set V =\n(cid:96)=1 \u03c6(|S \u2229 V(cid:96)|), where \u03c6 : R \u2192 R is a bounded\nconcave function, for example, \u03c6(x) = min{\u03c6max, x}. Functions of this form are submodular, and\nhave been used in applications such as document summarization to encourage diversity [25]. It is\neasy to see that, for such functions, \u03b6F \u2264 L\u03c6max, that is, \u03b6F is independent of n.\nThe following theorem establishes a bound on the mixing time of the Gibbs sampler run on models\nof the form (1). The bound is exponential in \u03b6F , but polynomial in n.\nTheorem 1. For any function F : 2V \u2192 R, the mixing time of the Gibbs sampler is bounded by\n\n(cid:83)L\n(cid:96)=1 V(cid:96), and consider functions F (S) = (cid:80)L\n\nwhere pmin := min\n\nS\u2208E p(S). If F is submodular or supermodular, then the bound is improved to\n\n(cid:18) 1\n(cid:18) 1\n\n\u0001pmin\n\n\u0001pmin\n\n(cid:19)\n\n,\n\n(cid:19)\n\n.\n\ntmix(\u0001) \u2264 2n2 exp(2\u03b2\u03b6F ) log\n\ntmix(\u0001) \u2264 2n2 exp(\u03b2\u03b6f ) log\n\n4\n\n\fNote that, since the factor of two that constitutes the difference between the two statements of the\ntheorem lies in the exponent, it can have a signi\ufb01cant impact on the above bounds. The dependence\non pmin is related to the (worst-case) starting state of the chain, and can be eliminated if we have\na way to guarantee a high-probability starting state. If F is submodular or supermodular, this is\nusually straightforward to accomplish by using one of the standard constant-factor optimization\nalgorithms [10, 26] as a preprocessing step. More generally, if F is bounded by 0 \u2264 F (S) \u2264 Fmax,\nfor all S \u2286 V , then log(1/pmin) = O(n\u03b2Fmax).\nCanonical paths Our proof of Theorem 1 is based on the method of canonical paths [5,15,16,28].\nThe high-level idea of this method is to view the state space as a graph, and try to construct a\npath between each pair of states that carries a certain amount of \ufb02ow speci\ufb01ed by the stationary\ndistribution under consideration. Depending on the choice of these paths and the resulting load on\nthe edges of the graph, we can derive bounds on the mixing time of the Markov chain.\nMore concretely, let us assume that for some set function F and corresponding distribution p as in\n(1), we construct the Gibbs chain on state space E = 2V with transition matrix P . We can view the\nstate space as a directed graph that has vertex set E, and for any A, B \u2208 E, contains edge (S, S(cid:48))\nif and only if S \u223c S(cid:48), that is, if and only if S and S(cid:48) differ by exactly one element. Now, assume\nthat, for any pair of states A, B \u2208 E, we de\ufb01ne what is called a canonical path \u03b3AB := (A =\nS0, S1, . . . , S(cid:96) = B), such that all (Si, Si+1) are edges in the above graph. We denote the length\nof path \u03b3AB by |\u03b3AB|, and de\ufb01ne Q(S, S(cid:48)) := p(S)P (S, S(cid:48)). We also denote the set of all pairs of\nstates whose canonical path goes through (S, S(cid:48)) by CSS(cid:48) := {(A, B) \u2208 E \u00d7 E | (S, S(cid:48)) \u2208 \u03b3AB}.\nThe following quantity, referred to as the congestion of an edge, uses a collection of canonical paths\nto quantify to what amount that edge is overloaded:\n\n\u03c1(S, S(cid:48)) :=\n\n1\n\nQ(S, S(cid:48))\n\np(A)p(B)|\u03b3AB|.\n\n(4)\n\n(cid:88)\n\n(A,B)\u2208CSS(cid:48)\n\nThe denominator Q(S, S(cid:48)) quanti\ufb01es the capacity of edge (S, S(cid:48)), while the sum represents the total\n\ufb02ow through that edge according to the choice of canonical paths. The congestion of the whole graph\nis then de\ufb01ned as \u03c1 := maxS\u223cS(cid:48) \u03c1(S, S(cid:48)). Low congestion implies that there are no bottlenecks in\nthe state space, and the chain can move around fast, which also suggests rapid mixing. The following\ntheorem makes this concrete.\nTheorem 2 ([15, 28]). For any collection of canonical paths with congestion \u03c1, the mixing time of\nthe chain is bounded by\n\n(cid:18) 1\n\n\u0001pmin\n\n(cid:19)\n\n.\n\ntmix(\u0001) \u2264 \u03c1 log\n\nProof outline of Theorem 1 To apply Theorem 2 to our class of distributions, we need to con-\nstruct a set of canonical paths in the corresponding state space 2V , and upper bound the resulting\ncongestion. First, note that, to transition from state A \u2208 E to state B \u2208 E, in our case, it is enough to\nremove the elements of A\\B and add the elements of B\\A. Each removal and addition corresponds\nto an edge in the state space graph, and the order of these operations identify a canonical path in this\ngraph that connects A to B. For our analysis, we assume a \ufb01xed order on V (e.g., the natural order\nof the elements themselves), and perform the operations according to this order.\nHaving de\ufb01ned the set of canonical paths, we proceed to bounding the congestion \u03c1(S, S(cid:48)) for any\nedge (S, S(cid:48)). The main dif\ufb01culty in bounding \u03c1(S, S(cid:48)) is due to the sum in (4) over all pairs in CSS(cid:48).\nTo simplify this sum we construct for each edge (S, S(cid:48)) an injective map \u03b7SS(cid:48) : CSS(cid:48) \u2192 E; this is a\ncombinatorial encoding technique that has been previously used in similar proofs to ours [15]. We\nthen prove the following key lemma about these maps.\nLemma 1. For any S \u223c S(cid:48), and any A, B \u2208 E, it holds that\n\np(A)p(B) \u2264 2n exp(2\u03b2\u03b6F )Q(S, S(cid:48))p(\u03b7SS(cid:48)(A, B)).\n\n(A,B)\u2208CSS(cid:48) p(\u03b7SS(cid:48)(A, B)) \u2264 1. Furthermore, it is clear\nthat each canonical path \u03b3AB has length |\u03b3AB| \u2264 n, since we need to add and/or remove at most n\nelements to get from state A to state B. Combining these two facts with the above lemma, we get\n\nSince \u03b7SS(cid:48) is injective, it follows that(cid:80)\n\n\u03c1(S, S(cid:48)) \u2264 2n2 exp(2\u03b2\u03b6F ).\n\nIf F is submodular or supermodular, we show that the dependence on \u03b6F in Lemma 1 is improved\nto exp(\u03b2\u03b6F ). More details can be found in the longer version of the paper.\n\n5\n\n\f4.2 Fast mixing\n\nWe now proceed to show that, under some stronger conditions, we are able to establish even faster\u2014\nO(n log n)\u2014mixing. For any function F , we denote \u2206F (v|S) := F (S \u222a {v}) \u2212 F (S \\ {v}), and\nde\ufb01ne the following quantity,\n\n(cid:88)\n\nv\u2208V\n\n(cid:18) \u03b2\n\n2\n\n(cid:12)(cid:12)\u2206F (v|S) \u2212 \u2206F (v|S \u222a {r})(cid:12)(cid:12)(cid:19)\n\n,\n\n\u03b3F,\u03b2 := max\nS\u2286V\nr\u2208V\n\ntanh\n\nwhich quanti\ufb01es the (maximum) total in\ufb02uence of an element r \u2208 V on the values of F . For\nexample, if the inclusion of r makes no difference with respect to other elements of the ground set,\nwe will have \u03b3F,\u03b2 = 0. The following theorem establishes conditions for fast mixing of the Gibbs\nsampler when run on models of the form (1).\nTheorem 3. For any set function F : 2V \u2192 R, if \u03b3F,\u03b2 < 1, then the mixing time of the Gibbs\nsampler is bounded by\n\ntmix(\u0001) \u2264\n\n1\n\n1 \u2212 \u03b3F,\u03b2\n\nn(log n + log\n\n1\n\u0001\n\n).\n\nIf F is additionally submodular or supermodular, and is decomposed according to (3), then\n\ntmix(\u0001) \u2264\n\n1\n\n1 \u2212 \u03b3f,\u03b2\n\nn(log n + log\n\n1\n\u0001\n\n).\n\nNote that, in the second part of the theorem, \u03b3f,\u03b2 depends only on the monotone part of F . We have\nseen in Section 2 that some commonly used models are based on decomposable functions that can\nbe written in the form (2). We prove the following corollary that provides an easy to check condition\nfor fast mixing of the Gibbs sampler when F is a decomposable submodular function.\nCorollary 1. For any submodular function F that can be written in the form of (2), with f being its\nmonotone (also decomposable) part according to (3), if we de\ufb01ne\n\n(cid:88)\n\n(cid:96)\u2208[L]\n\n(cid:112)f(cid:96)(v) and \u03bbf := max\n\n(cid:96)\u2208[L]\n\n(cid:112)f(cid:96)(v),\n\n(cid:88)\n\nv\u2208V\n\n\u03b8f := max\nv\u2208V\n\nthen it holds that\n\n(cid:80)L\n\n\u03b3f,\u03b2 \u2264 \u03b2\n2\n\n\u03b8f \u03bbf .\n\n(cid:80)\n\n(cid:96)=1\n\n\u221a\n\nwv,(cid:96), and \u03bbf = max(cid:96)\n\nFor example, applying this to the facility location model de\ufb01ned in Section 2, we get \u03b8f =\nwv,(cid:96), and obtain fast mixing if \u03b8f \u03bbf \u2264 2/\u03b2. As a\nmaxv\nspecial case, if we consider the class of set cover functions (wv,(cid:96) \u2208 {0, 1}), such that each v \u2208 V\ncovers at most \u03b4 sets, and each set (cid:96) \u2208 [L] is covered by at most \u03b4 elements, then \u03b8f , \u03bbf \u2264 \u03b4, and we\nobtain fast mixing if \u03b42 \u2264 2/\u03b2. Note, that the corollary can be trivially applied to any submodular\nfunction by taking L = 1, but may, in general, result in a loose bound if used that way.\n\nv\u2208V\n\n\u221a\n\nCoupling Our proof of Theorem 3 is based on the coupling technique [1]; more speci\ufb01cally, we\nuse the path coupling method [2,15,24]. Given a Markov chain (Xt) on state space E with transition\nmatrix P , a coupling for (Zt) is a new Markov chain (Xt, Yt) on state space E \u00d7 E, such that\nboth (Xt) and (Yt) are by themselves Markov chains with transition matrix P . The idea is to\nconstruct the coupling in such a way that, even when the starting points X0 and Y0 are different,\nthe chains (Xt) and (Yt) tend to coalesce. Then, it can be shown that the coupling time tcouple :=\nmin{t \u2265 0 | Xt = Yt} is closely related to the mixing time of the original chain (Zt). [24]\nThe main dif\ufb01culty in applying the coupling approach lies in the construction of the coupling itself,\nfor which one needs to consider any possible pair of states (Yt, Zt). The path coupling technique\nmakes this construction easier by utilizing the same state-space graph that we used to de\ufb01ne canon-\nical paths in Section 4.1. The core idea is to \ufb01rst de\ufb01ne a coupling only over adjacent states, and\nthen extend it for any pair of states by using a metric on the graph. More concretely, let us denote\nby d : E \u00d7E \u2192 R the path metric on state space E; that is, for any x, y \u2208 E, d(x, y) is the minimum\nlength of any path from x to y in the state space graph. The following theorem establishes fast\nmixing using this metric, as well as the diameter of the state space, diam(E) := maxx,y\u2208E d(x, y).\n\n6\n\n\fTheorem 4 ([2, 24]). For any Markov chain (Zt), if (Xt, Yt) is a coupling, such that, for some\na \u2265 0, and any x, y \u2208 E with x \u223c y, it holds that\n\nE[d(Xt+1, Yt+1) | Xt = x, Yt = y] \u2264 e\u2212\u03b1d(x, y),\n\nthen the mixing time of the original chain is bounded by\n\n(cid:18)\n\ntmix(\u0001) \u2264 1\n\u03b1\n\nlog(diam(E)) + log\n\n(cid:19)\n\n.\n\n1\n\u0001\n\nProof outline of Theorem 3\nIn our case, the path metric d is the Hamming distance between\nthe binary vectors representing the states (equivalently, the number of elements by which two sets\ndiffer). We need to construct a suitable coupling (Xt, Yt) for any pair of states x \u223c y. Consider the\ntwo corresponding sets S, R \u2286 V that differ by exactly one element, and assume that R = S \u222a {r},\nfor some r \u2208 V . (The case S = R \u222a {s} for some s \u2208 V is completely analogous.) Remember that\nthe Gibbs sampler \ufb01rst chooses an element v \u2208 V uniformly at random, and then adds or removes\nit according to the conditional probabilities. Our goal is to make the same updates happen to both S\nand R as frequently as possible. As a \ufb01rst step, we couple the candidate element for update v \u2208 V\nto always be the same in both chains. Then, we have to distinguish between the following cases.\nIf v = r, then the conditionals for both chains are identical, therefore we can couple both chains\nto add r with probability padd := p(S \u222a {r})/(p(S) + p(S \u222a {r})), which will result in new\nsets S(cid:48) = R(cid:48) = S \u222a {r}, or remove r with probability 1 \u2212 padd, which will result in new sets\nS(cid:48) = R(cid:48) = S. Either way, we will have d(S(cid:48), R(cid:48)) = 0.\nIf v (cid:54)= r, we cannot always couple the updates of the chains, because the conditional probabilities\nof the updates are different. In fact, we are forced to have different updates (one chain adding v, the\nother chain removing v) with probability equal to the difference of the corresponding conditionals,\nwhich we denote here by pdif (v). If this is the case, we will have d(S(cid:48), R(cid:48)) = 2, otherwise the\nchains will make the same update and will still differ only by element r, that is, d(S(cid:48), R(cid:48)) = 1.\nPutting together all the above, we get the following expected distance after one step:\n\n(cid:18)\n\u2212 1 \u2212 \u03b3F,\u03b2\nOur result follows from applying Theorem 4 with \u03b1 = \u03b3F,\u03b2/n, noting that diam(E) = n.\n\nE[d(S(cid:48), R(cid:48))] = 1 \u2212 1\nn\n\npdif (v) \u2264 1 \u2212 1\nn\n\n(1 \u2212 \u03b3F,\u03b2) \u2264 exp\n\n(cid:88)\n\nv(cid:54)=r\n\n(cid:19)\n\n.\n\n+\n\n1\nn\n\nn\n\n5 Experiments\n\nWe compare the Gibbs sampler against the variational approach proposed by Djolonga and Krause\n[6] for performing inference in models of the form (1), and use the same three models as in their\nexperiments. We brie\ufb02y review here the experimental setup and refer to their paper for more details.\nThe \ufb01rst is a (log-submodular) facility location model with an added modular term that penalizes the\nnumber of selected elements, that is, p(S) \u221d exp(F (S) \u2212 2|S|), where F is a submodular facility\nlocation function. The model is constructed from randomly subsampling real data from a problem of\nsensor placement in a water distribution network [22]. In the experiments, we iteratively condition\non random observations for each variable in the ground set. The second is a (log-supermodular)\npairwise Markov random \ufb01eld (MRF; a generalized Ising model with varying weights), constructed\nby \ufb01rst randomly sampling points from a 2-D two-cluster Gaussian mixture model, and then in-\ntroducing a pairwise potential for each pair of points with exponentially-decreasing weight in the\ndistance of the pair. In the experiments, we iteratively condition on pairs of observations, one from\neach cluster. The third is a (log-supermodular) higher-order MRF, which is constructed by \ufb01rst gen-\nerating a random Watts-Strogatz graph, and then creating one higher-order potential per node, which\ncontains that node and all of its neighbors in the graph. The strength of the potentials is controlled\nby a parameter \u00b5, which is closely related to the curvature of the functions that de\ufb01ne them. In the\nexperiments, we vary this parameter from 0 (modular model) to 1 (\u201cstrongly\u201d supermodular model).\nFor all three models, we constrain the size of the ground set to n = 20, so that we are able to\ncompute, and compare against, the exact marginals. Furthermore, we run multiple repetitions for\neach model to account for the randomness of the model instance, and the random initialization of\n\n7\n\n\f0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\nVar (upper)\nVar (lower)\nGibbs (100)\nGibbs (500)\nGibbs (2000)\n\n0.2\n\n0.1\n\n0\n\n4\n\n6\n\n2\nNumber of conditioned elements\n\n8 10 12 14 16 18\n\nVar (upper)\nVar (lower)\nGibbs (100)\nGibbs (500)\nGibbs (2000)\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\nNumber of conditioned pairs\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n0\n\nVar (upper)\nVar (lower)\nGibbs (100)\nGibbs (500)\nGibbs (2000)\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n\u00b5\n\n(a) Facility location\n\n(b) Pairwise MRF\n\n(c) Higher-order MRF\n\nFigure 1: Absolute error of the marginals computed by the Gibbs sampler compared to variational\ninference [6]. A modest 500 Gibbs iterations outperform the variational method for the most part.\n\nthe Gibbs sampler. The marginals we compute are of the form p(v \u2208 S | C \u2286 S \u2286 D), for all\nv \u2208 V . We run the Gibbs sampler for 100, 500, and 2000 iterations on each problem instance.\nIn compliance with recommended MCMC practice [11], we discard the \ufb01rst half of the obtained\nsamples as burn-in, and only use the second half for estimating the marginals.\nFigure 1 compares the average absolute error of the approximate marginals with respect to the exact\nones. The averaging is performed over v \u2208 V , and over the different repetitions of each experiment;\nerrorbars depict two standard errors. The two variational approximations are obtained from factor-\nized distributions associated with modular lower and upper bounds respectively [6]. We notice a\nsimilar trend on all three models. For the regimes that correspond to less \u201cpeaked\u201d posterior dis-\ntributions (small number of conditioned variables, small \u00b5), even 100 Gibbs iterations outperform\nboth variational approximations. The latter gain an advantage when the posterior is concentrated\naround only a few states, which happens after having conditioned on almost all variables in the \ufb01rst\ntwo models, or for \u00b5 close to 1 in the third model.\n\n6 Further Related Work\n\nIn contemporary work to ours, Rebeschini and Karbasi [27] analyzed the mixing times of log-\nsubmodular models. Using a method based on matrix norms, which was previously introduced by\nDyer et al. [7], and is closely related to path coupling, they arrive at a similar\u2014though not directly\ncomparable\u2014condition to the one we presented in Theorem 3.\nIyer and Bilmes [13] recently considered a different class of probabilistic models, called submod-\nular point processes, which are also de\ufb01ned through submodular functions, and have the form\np(S) \u221d F (S). They showed that inference in SPPs is, in general, also a hard problem, and pro-\nvided approximations and closed-form solutions for some subclasses.\nThe canonical path method for bounding mixing times has been previously used in applications, such\nas approximating the partition function of ferromagnetic Ising models [17], approximating matrix\npermanents [16, 18], and counting matchings in graphs [15]. The most prominent application of\ncoupling-based methods is counting k-colorings in low-degree graphs [3,14,15]. Other applications\ninclude counting independent sets in graphs [8], and approximating the partition function of various\nsubclasses of Ising models at high temperatures [24].\n\n7 Conclusion\n\nWe considered the problem of performing marginal inference using MCMC sampling techniques in\nprobabilistic models de\ufb01ned through submodular functions. In particular, we presented for the \ufb01rst\ntime suf\ufb01cient conditions to obtain upper bounds on the mixing time of the Gibbs sampler in general\nlog-submodular and log-supermodular models. Furthermore, we demonstrated that, in practice, the\nGibbs sampler compares favorably to previously proposed variational approximations, at least in\nregimes of high uncertainty. We believe that this is an important step towards a uni\ufb01ed framework\nfor further analysis and practical application of this rich class of probabilistic submodular models.\n\nAcknowledgments This work was partially supported by ERC Starting Grant 307036.\n\n8\n\n\fReferences\n[1] David Aldous. Random walks on \ufb01nite groups and rapidly mixing markov chains.\n\nProbabilites XVII. Springer, 1983.\n\nIn Seminaire de\n\n[2] Russ Bubley and Martin Dyer. Path coupling: A technique for proving rapid mixing in markov chains. In\n\nSymposium on Foundations of Computer Science, 1997.\n\n[3] Russ Bubley, Martin Dyer, and Catherine Greenhill. Beating the 2d bound for approximately counting\n\ncolourings: A computer-assisted proof of rapid mixing. In Symposium on Discrete Algorithms, 1998.\n\n[4] Michele Conforti and Gerard Cornuejols. Submodular set functions, matroids and the greedy algorithm:\nTight worst-case bounds and some generalizations of the rado-edmonds theorem. Disc. App. Math., 1984.\n[5] Persi Diaconis and Daniel Stroock. Geometric bounds for eigenvalues of markov chains. The Annals of\n\nApplied Probability, 1991.\n\n[6] Josip Djolonga and Andreas Krause. From MAP to marginals: Variational inference in bayesian submod-\n\nular models. In Neural Information Processing Systems, 2014.\n\n[7] Martin Dyer, Leslie Ann Goldberg, and Mark Jerrum. Matrix norms and rapid mixing for spin systems.\n\nAnnals of Applied Probability, 2009.\n\n[8] Martin Dyer and Catherine Greenhill. On markov chains for independent sets. J. of Algorithms, 2000.\n[9] Uriel Feige, Vahab S. Mirrokni, and Jan Vondrak. Maximizing non-monotone submodular functions. In\n\nSymposium on Foundations of Computer Science, 2007.\n\n[10] Satoru Fujishige. Submodular Functions and Optimization. Elsevier Science, 2005.\n[11] Andrew Gelman and Kenneth Shirley. Innovation and intellectual property rights. In Handbook of Markov\n\nChain Monte Carlo. CRC Press, 2011.\n\n[12] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning\n\nand stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 2011.\n\n[13] Rishabh Iyer and Jeff Bilmes. Submodular point processes with applications in machine learning.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\nIn\n\n[14] Mark Jerrum. A very simple algorithm for estimating the number of k-colorings of a low-degree graph.\n\nRandom Structures and Algorithms, 1995.\n\n[15] Mark Jerrum. Counting, Sampling and Integrating: Algorithms and Complexity. Birkh\u00a8auser, 2003.\n[16] Mark Jerrum and Alistair Sinclair. Approximating the permanent. SIAM Journal on Computing, 1989.\n[17] Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM\n\nJournal on Computing, 1993.\n\n[18] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algorithm for the\n\npermanent of a matrix with non-negative entries. Journal of the ACM, 2004.\n\n[19] David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of in\ufb02uence through a social\n\nnetwork. In Conference on Knowledge Discovery and Data Mining, 2003.\n\n[20] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT\n\nPress, 2009.\n\n[21] Andreas Krause, Carlos Guestrin, Anupam Gupta, and Jon Kleinberg. Near-optimal sensor placements:\nIn Information Processing in Sensor\n\nMaximizing information while minimizing communication cost.\nNetworks, 2006.\n\n[22] Andreas Krause, Jure Leskovec, Carlos Guestrin, Jeanne Vanbriesen, and Christos Faloutsos. Ef\ufb01cient\nsensor placement optimization for securing large water distribution networks. Journal of Water Resources\nPlanning and Management, 2008.\n\n[23] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and\n\nTrends in Machine Learning, 2012.\n\n[24] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing Times. American\n\nMathematical Society, 2008.\n\n[25] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization.\n\nLanguage Technologies, 2011.\n\nIn Human\n\n[26] George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approximations for\n\nmaximizing submodular set functions. Mathematical Programming, 1978.\n\n[27] Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Conference on Learn-\n\ning Theory, 2015.\n\n[28] Alistair Sinclair. Improved bounds for mixing rates of markov chains and multicommodity \ufb02ow. Combi-\n\nnatorics, Probability and Computing, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1190, "authors": [{"given_name": "Alkis", "family_name": "Gotovos", "institution": "ETH Zurich"}, {"given_name": "Hamed", "family_name": "Hassani", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}