{"title": "Parameter Learning for Log-supermodular Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 3234, "page_last": 3242, "abstract": "We consider log-supermodular models on binary variables, which are probabilistic models with negative log-densities which are submodular. These models provide probabilistic interpretations of common combinatorial optimization tasks such as image segmentation. In this paper, we focus primarily on parameter estimation in the models from known upper-bounds on the intractable log-partition function. We show that the bound based on separable optimization on the base polytope of the submodular function is always inferior to a bound based on ``perturb-and-MAP'' ideas. Then, to learn parameters, given that our approximation of the log-partition function is an expectation (over our own randomization), we use a stochastic subgradient technique to maximize a lower-bound on the log-likelihood. This can also be extended to conditional maximum likelihood. We illustrate our new results in a set of experiments in binary image denoising, where we highlight the flexibility of a probabilistic model to learn with missing data.", "full_text": "Parameter Learning\n\nfor Log-supermodular Distributions\n\nTatiana Shpakova\n\nFrancis Bach\n\nINRIA - \u00c9cole Normale Sup\u00e9rieure Paris\n\ntatiana.shpakova@inria.fr\n\nINRIA - \u00c9cole Normale Sup\u00e9rieure Paris\n\nfrancis.bach@inria.fr\n\nAbstract\n\nWe consider log-supermodular models on binary variables, which are probabilistic\nmodels with negative log-densities which are submodular. These models provide\nprobabilistic interpretations of common combinatorial optimization tasks such as\nimage segmentation. In this paper, we focus primarily on parameter estimation in\nthe models from known upper-bounds on the intractable log-partition function. We\nshow that the bound based on separable optimization on the base polytope of the\nsubmodular function is always inferior to a bound based on \u201cperturb-and-MAP\u201d\nideas. Then, to learn parameters, given that our approximation of the log-partition\nfunction is an expectation (over our own randomization), we use a stochastic\nsubgradient technique to maximize a lower-bound on the log-likelihood. This can\nalso be extended to conditional maximum likelihood. We illustrate our new results\nin a set of experiments in binary image denoising, where we highlight the \ufb02exibility\nof a probabilistic model to learn with missing data.\n\nIntroduction\n\n1\nSubmodular functions provide ef\ufb01cient and \ufb02exible tools for learning on discrete data. Several\ncommon combinatorial optimization tasks, such as clustering, image segmentation, or document\nsummarization, can be achieved by the minimization or the maximization of a submodular function [1,\n8, 14]. The key bene\ufb01t of submodularity is the ability to model notions of diminishing returns, and\nthe availability of exact minimization algorithms and approximate maximization algorithms with\nprecise approximation guarantees [12].\nIn practice, it is not always straightforward to de\ufb01ne an appropriate submodular function for a problem\nat hand. Given fully-labeled data, e.g., images and their foreground/background segmentations in\nimage segmentation, structured-output prediction methods such as the structured-SVM may be\nused [18]. However, it is common (a) to have missing data, and (b) to embed submodular function\nminimization within a larger model. These are two situations well tackled by probabilistic modelling.\nLog-supermodular models, with negative log-densities equal to a submodular function, are a \ufb01rst im-\nportant step toward probabilistic modelling on discrete data with submodular functions [5]. However,\nit is well known that the log-partition function is intractable in such models. Several bounds have\nbeen proposed, that are accompanied with variational approximate inference [6]. These bounds are\nbased on the submodularity of the negative log-densities. However, parameter learning (typically by\nmaximum likelihood), which is a key feature of probabilistic modeling, has not been tackled yet. We\nmake the following contributions:\n\u2013 In Section 3, we review existing variational bounds for the log-partition function and show that\nthe bound of [9], based on \u201cperturb-and-MAP\u201d ideas, formally dominates the bounds proposed\nby [5, 6].\n\n\u2013 In Section 4.1, we show that for parameter learning via maximum likelihood the existing bound\nof [5, 6] typically leads to a degenerate solution while the one based on \u201cperturb-and-MAP\u201d ideas\nand logistic samples [9] does not.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u2013 In Section 4.2, given that the bound based on \u201cperturb-and-MAP\u201d ideas is an expectation (over\nour own randomization), we propose to use a stochastic subgradient technique to maximize the\nlower-bound on the log-likelihood, which can also be extended to conditional maximum likelihood.\n\u2013 In Section 5, we illustrate our new results on a set of experiments in binary image denoising, where\n\nwe highlight the \ufb02exibility of a probabilistic model for learning with missing data.\n\n2 Submodular functions and log-supermodular models\n\nIn this section, we review the relevant theory of submodular functions and recall typical examples of\nlog-supermodular distributions.\n\n2.1 Submodular functions\nWe consider submodular functions on the vertices of the hypercube {0, 1}D. This hypercube\nrepresentation is equivalent to the power set of {1, . . . , D}. Indeed, we can go from a vertex of the\nhypercube to a set by looking at the indices of the components equal to one and from set to vertex by\ntaking the indicator vector of the set.\nFor any two vertices of the hypercube, x, y \u2208 {0, 1}D, a function f : {0, 1}D \u2192 R is submodular\nif f (x) + f (y) (cid:62) f (min{x, y}) + f (max{x, y}), where the min and max operations are taken\ncomponent-wise and correspond to the intersection and union of the associated sets. Equivalently, the\nfunction x (cid:55)\u2192 f (x + ei) \u2212 f (x), where ei \u2208 RD is the i-th canonical basis vector, is non-increasing.\nHence, the notion of diminishing returns is often associated with submodular functions. Most widely\nused submodular functions are cuts, concave functions of subset cardinality, mutual information, set\ncovers, and certain functions of eigenvalues of submatrices [1, 7]. Supermodular functions are simply\nnegatives of submodular functions.\nIn this paper, we are going to use a few properties of such submodular functions (see [1, 7] and\nreferences therein). Any submodular function f can be extended from {0, 1}D to a convex function\non RD, which is called the Lov\u00e1sz extension. This extension has the same value on {0, 1}D, hence\nwe use the same notation f. Moreover, this function is convex and piecewise linear, which implies\nthe existence of a polytope B(f ) \u2282 RD, called the base polytope, such that for all x \u2208 RD,\nf (x) = maxs\u2208B(f ) x(cid:62)s, that is, f is the support function of B(f ). The Lov\u00e1sz extension f and the\nbase polytope B(f ) have explicit expressions that are, however, not relevant to this paper. We will\nonly use the fact that f can be ef\ufb01ciently minimized on {0, 1}D, by a variety of generic algorithms,\nor by more ef\ufb01cient dedicated ones for subclasses such as graph-cuts.\n\n2.2 Log-supermodular distributions\nLog-supermodular models are introduced in [5] to model probability distributions on a hypercube,\nx \u2208 {0, 1}D, and are de\ufb01ned as\n\np(x) =\n\n1\n\nZ(f )\n\nexp(\u2212f (x)),\n\nis Z(f ) = (cid:80)\nfunction A(f ) = log Z(f ) = log(cid:80)\n\nwhere f : {0, 1}D \u2192 R is a submodular function such that f (0) = 0 and the partition function\nx\u2208{0,1}D exp(\u2212f (x)). It is more convenient to deal with the convex log-partition\nx\u2208{0,1}D exp(\u2212f (x)). In general, the calculation of the partition\nfunction Z(f ) or the log-partition function A(f ) is intractable, as it includes simple binary Markov\nrandom \ufb01elds\u2014the exact calculation is known to be #P -hard [10]. In Section 3, we review upper-\nbounds for the log-partition function.\n\n2.3 Examples\nEssentially, all submodular functions used in the minimization context can be used as negative\nlog-densities [5, 6]. In computer vision, the most common examples are graph-cuts, which are\nessentially binary Markov random \ufb01elds with attractive potentials, but higher-order potentials have\nbeen considered as well [11]. In our experiments, we use graph-cuts, where submodular function\nminimization may be performed with max-\ufb02ow techniques and is thus ef\ufb01cient [4]. Note that there\nare extensions of submodular functions to continuous domains that could be considered as well [2].\n\n2\n\n\f3 Upper-bounds on the log-partition function\nIn this section, we review the main existing upper-bounds on the log-partition function for log-\nsupermodular densities. These upper-bounds use several properties of submodular functions, in\nparticular, the Lov\u00e1sz extension and the base polytope. Note that lower bounds based on submodular\nmaximization aspects and superdifferentials [5] can be used to highlight the tightness of various\nbounds, which we present in Figure 1.\n\n3.1 Base polytope relaxation with L-Field [5]\nThis method exploits the fact that any submodular function f (x) can be lower bounded by a modular\nfunction s(x), i.e., a linear function of x \u2208 {0, 1}D in the hypercube representation. The submodular\nfunction and its lower bound are related by f (x) = maxs\u2208B(f ) s(cid:62)x, leading to:\n\nA(f ) = log(cid:80)\nmins\u2208B(f ) log(cid:80)\n\nx\u2208{0,1}D exp (\u2212f (x)) = log(cid:80)\nx\u2208{0,1}D mins\u2208B(f ) exp (\u2212s(cid:62)x),\n(cid:80)D\nd=1 log (1 + e\u2212sd )\n\nx\u2208{0,1}D exp (\u2212s(cid:62)x) = mins\u2208B(f )\n\nwhich, by swapping the sum and min, is less than\n\ndef\n= AL-\ufb01eld(f ).\n\n(1)\n\nSince the polytope B(f ) is tractable (through its membership oracle or by maximizing linear functions\nef\ufb01ciently), the bound AL-\ufb01eld(f ) is tractable, i.e., computable in polynomial time. Moreover, it has\na nice interpretation through convex duality as the logistic function log(1 + e\u2212sd) may be represented\nas max\u00b5d\u2208[0,1] \u2212\u00b5dsd \u2212 \u00b5d log \u00b5d \u2212 (1 \u2212 \u00b5d) log(1 \u2212 \u00b5d), leading to:\n\u2212\u00b5(cid:62)s + H(\u00b5) = max\n\u00b5\u2208[0,1]D\n\n(cid:8)\u00b5d log \u00b5d + (1 \u2212 \u00b5d) log(1 \u2212 \u00b5d)(cid:9). This shows in particular the convexity\n\nwhere H(\u00b5) = \u2212(cid:80)D\n\nof f (cid:55)\u2192 AL-\ufb01eld(f ). Finally, [6] shows the remarkable result that the minimizer s \u2208 B(f ) may\nbe obtained by minimizing a simpler function on B(f ), namely the squared Euclidean norm, thus\nleading to algorithms such as the minimum-norm-point algorithm [7].\n\nAL-\ufb01eld(f ) = min\ns\u2208B(f )\n\nH(\u00b5) \u2212 f (\u00b5),\n\nmax\n\u00b5\u2208[0,1]D\n\nd=1\n\n\u201cPertub-and-MAP\u201d with logistic distributions\n\n3.2\nEstimating the log-partition function can be done through optimization using \u201cpertub-and-MAP\u201d\nideas. The main idea is to perturb the log-density, \ufb01nd the maximum a-posteriori con\ufb01guration (i.e.,\nperform optimization), and then average over several random perturbations [9, 17, 19].\nThe Gumbel distribution on R, whose cumulative distribution function is F (z) = exp(\u2212 exp(\u2212(z +\nIndeed, if {g(y)}y\u2208{0,1}D is a col-\nc))), where c is the Euler constant, is particularly useful.\nlection of independent random variables g(y) indexed by y \u2208 {0, 1}D, each following the\nGumbel distribution, then the random variable maxy\u2208{0,1}D g(y) \u2212 f (y) is such that we have\nlog Z(f ) = Eg\n2D such variables, and a key contribution of [9] is to show that if we consider a factored collec-\ntion {gd(yd)}yd\u2208{0,1},d=1,...,D of i.i.d. Gumbel variables, then we get an upper-bound on the log\n\n(cid:2)maxy\u2208{0,1}D {g(y) \u2212 f (y)}(cid:3) [9, Lemma 1]. The main problem is that we need\n\npartition-function, that is, log Z(f ) \u2264 Egmaxy\u2208{0,1}D {(cid:80)D\n(cid:2) max\n\nALogistic(f ) = Ez1,...,zD\u223clogistic\n\nWriting gd(yd) = [gd(1) \u2212 gd(0)]yd + gd(0) and using the fact that (a) gd(0) has zero expectation\nand (b) the difference between two independent Gumbel distributions has a logistic distribution (with\ncumulative distribution function z (cid:55)\u2192 (1 + e\u2212z)\u22121) [15], we get the following upper-bound:\n\nd=1 gd(yd) \u2212 f (y)}.\n\n{z(cid:62)y \u2212 f (y)}(cid:3),\n\n(2)\nwhere the random vector z \u2208 RD consists of independent elements taken from the logistic distribution.\nThis is always an upper-bound on A(f ) and it uses only the fact that submodular functions are ef\ufb01cient\nto optimize. It is convex in f as an expectation of a maximum of af\ufb01ne functions of f.\n\ny\u2208{0,1}D\n\n3.3 Comparison of bounds\nIn this section, we show that AL-\ufb01eld(f ) is always dominated by ALogistic(f ). This is complemented\nby another result within the maximum likelihood framework in Section 4.\n\n3\n\n\fProposition 1. For any submodular function f : {0, 1}D \u2192 R, we have:\n\nA(f ) (cid:54) ALogistic(f ) (cid:54) AL-\ufb01eld(f ).\n\n(3)\n\ny\u2208{0,1}D\n\ny\u2208{0,1}D\n\nz(cid:62)y \u2212 max\ns\u2208B(f )\n\nProof. The \ufb01rst inequality was shown by [9]. For the second inequality, we have:\nALogistic(f ) = Ez\n= Ez\n= Ez\n= Ez\n(cid:54) mins\u2208B(f ) Ez\n= mins\u2208B(f ) Ez\n= mins\u2208B(f )\n= mins\u2208B(f )\n\n(cid:2) max\nz(cid:62)y \u2212 f (y)(cid:3)\n(cid:2) max\n(cid:2) maxy\u2208{0,1}D mins\u2208B(f ) z(cid:62)y \u2212 s(cid:62)y(cid:3),\n(cid:2) mins\u2208B(f ) max\n(cid:2) maxy\u2208{0,1}D (z \u2212 s)(cid:62)y(cid:3) by swapping expectation and minimization,\n(cid:2)(cid:80)D\n(cid:2)(cid:80)D\nd=1(zd \u2212 sd)+\n(cid:2)(cid:80)D\n(cid:82) +\u221e\nEzd (zd \u2212 sd)+\n(cid:2)(cid:80)D\n(cid:82) +\u221e\n\u2212\u221e (zd \u2212 sd)+P (zd)dzd\n(cid:80)D\n(zd \u2212 sd)\n(1+e\u2212zd )2 dzd\nd=1 log(1 + e\u2212sd ), which leads to the desired result.\n\ns(cid:62)y(cid:3) from properties of the base polytope B(f ),\nz(cid:62)y \u2212 s(cid:62)y(cid:3) by convex duality,\n(cid:3) by explicit maximization,\n(cid:3) by using linearity of expectation,\n\n(cid:3) by de\ufb01nition of expectation,\n(cid:3) by substituting the density function,\n\ny\u2208{0,1}D\n\n= min\ns\u2208B(f )\n\n= mins\u2208B(f )\n\nd=1\n\nsd\n\nd=1\n\nd=1\n\ne\u2212zd\n\nIn the inequality above, since the logistic distribution has full support, there cannot be equality.\nHowever, if the base polytope is such that, with high probability \u2200d,|sd| \u2265 |zd|, then the two bounds\nare close. Since the logistic distribution is concentrated around zero, we have equality when |sd| is\nlarge for all d and s \u2208 B(f ).\n\nRunning-time complexity of AL-\ufb01eld and Alogistic. The logistic bound Alogistic can be computed\nif there is ef\ufb01cient MAP-solver for submodular functions (plus a modular term). In this case, the\ndivide-and-conquer algorithm can be applied for L-Field [5]. Thus, the complexity is dedicated to\nthe minimization of O(|V |) problems. Meanwhile, for the method based on logistic samples, it is\nnecessary to solve M optimization problems. In our empirical bound comparison (next paragraph),\nthe running time was the same for both methods. Note however that for parameter learning, we need\na single SFM problem per gradient iteration (and not M).\n\nEmpirical comparison of AL-\ufb01eld and Alogistic. We compare the upper-bounds on the log-partition\nfunction AL-\ufb01eld and Alogistic, with the setup used by [5]. We thus consider data from a Gaussian\nmixture model with 2 clusters in R2. The centers are sampled from N([3, 3], I) and N([\u22123,\u22123], I),\nrespectively. Then we sampled n = 50 points for each cluster. Further, these 2n points are used as\nnodes in a complete weighted graph, where the weight between points x and y is equal to e\u2212c||x\u2212y||.\nWe consider the graph cut function associated to this weighted graph, which de\ufb01nes a log-\nsupermodular distribution. We then consider conditional distributions, one for each k = 1, . . . , n,\non the events that at least k points from the \ufb01rst cluster lie on the one side of the cut and at least k\npoints from the second cluster lie on the other side of the cut. For each conditional distribution, we\nevaluate and compare the two upper bounds. We also add the tree-reweighted belief propagation\nupper bound [23] and the superdifferential-based lower bound [5].\nIn Figure 1, we show various bounds on A(f ) as functions of the number on conditioned pairs. The\nlogistic upper bound is obtained using 100 logistic samples: the logistic upper-bound Alogistic is close\nto the superdifferential lower bound from [5] and is indeed signi\ufb01cantly lower than the bound AL-\ufb01eld.\nHowever, the tree-reweighted belief propagation bound behaves a bit better in the second case, but its\ncalculation takes more time, and it cannot be applied for general submodular functions.\n\n3.4 From bounds to approximate inference\nSince linear functions are submodular functions, given any convex upper-bound on the log-partition\nfunction, we may derive an approximate marginal probability for each xd \u2208 {0, 1}. Indeed, follow-\ning [9], we consider an exponential family model p(x|t) = exp(\u2212f (x) + t(cid:62)x \u2212 A(f \u2212 t)), where\n\n4\n\n\f(a) Mean bounds with con\ufb01dence intervals, c = 1.\n(b) Mean bounds with con\ufb01dence intervals, c = 3.\nFigure 1: Comparison of log-partition function bounds for different values of c. See text for details.\n\nf \u2212 t is the function x (cid:55)\u2192 f (x) \u2212 t(cid:62)x. When f is assumed to be \ufb01xed, this can be seen as an\nexponential family with the base measure exp(\u2212f (x)), suf\ufb01cient statistics x, and A(f \u2212 t) is the\nlog-partition function. It is known that the expectation of the suf\ufb01cient statistics under the exponential\nfamily model Ep(x|t)x is the gradient of the log-partition function [23]. Hence, any approximation of\nthis log-partition gives an approximation of this expectation, which in our situation is the vector of\nmarginal probabilities that an element is equal to 1.\n(cid:80)D\nFor the L-\ufb01eld bound, at t = 0, we have \u2202td AL-\ufb01eld(f \u2212 t) = \u03c3(s\u2217\nd=1 log(1 + e\u2212sd ), thus recovering the interpretation of [6] from another point of view.\nFor the logistic bound, this is the inference mechanism from [9], with \u2202td Alogistic(f \u2212 t) =\nEzy\u2217(z), where y\u2217(z) is the maximizer of maxy\u2208{0,1}D z(cid:62)y \u2212 f (y). In practice, in order to perform\napproximate inference, we only sample M logistic variables. We could do the same for parameter\nlearning, but a much more ef\ufb01cient alternative, based on mixing sampling and convex optimization,\nis presented in the next section.\n\nd), where s\u2217 is the minimizer of\n\n4 Parameter learning through maximum likelihood\n\nAn advantage of log-supermodular probabilistic models is the opportunity to learn the model parame-\nters from data using the maximum-likelihood principle. In this section, we consider that we are given\nN observations x1, . . . , xN \u2208 {0, 1}D, e.g., binary images such as shown in Figure 2.\n\nWe consider a submodular function f (x) represented as f (x) =(cid:80)K\n\nk=1 \u03b1kfk(x)\u2212 t(cid:62)x. The modular\nterm t(cid:62)x is explicitly taken into account with t \u2208 RD, and K base submodular functions are\nassumed to be given with \u03b1 \u2208 RK\n+ so that the function f remains submodular. Assuming the data\nx1, . . . , xN are independent and identically (i.i.d.) distributed, then maximum likelihood is equivalent\nto minimizing:\n\nmin\u03b1\u2208RK\n\n(4)\nwhere we use the notation A(\u03b1, t) = A(f ). We now consider replacing the intractable log-partition\nfunction by its approximations de\ufb01ned in Section 3.\n\nn=1 fk(xn)\n\n+ , t\u2208RD\n\nn=1 xn\n\nk=1 \u03b1k\n\nN\n\nN\n\n4.1 Learning with the L-\ufb01eld approximation\nIn this section, we show that if we replace A(f ) by AL-\ufb01eld(f ), we obtain a degenerate solution.\nIndeed, we have\n\nAL-\ufb01eld(\u03b1, t) = min\ns\u2208B(f )\n\nlog (1 + e\u2212sd ) =\n\ns\u2208B((cid:80)K\n\nmin\nk=1 \u03b1K fK )\n\nlog (1 + e\u2212sd+td ).\n\nD(cid:88)\n\nd=1\n\nD(cid:88)\n\nd=1\n\n5\n\nN(cid:88)\n\nn=1\n\n\u03b1\u2208RK\n\nmin\n+ , t\u2208RD\n\n\u2212 1\nN\n\nlog p(xn|\u03b1, t) =\n\n\u03b1\u2208RK\n\nmin\n+ , t\u2208RD\n\nwhich takes the particularly simple form\n\n(cid:80)K\n\n(cid:16) 1\n\n(cid:80)N\n\n1\nN\n\n(cid:8) K(cid:88)\nN(cid:88)\n\u03b1kfk(xn) \u2212 t(cid:62)xn + A(f )(cid:9),\n(cid:17) \u2212 t(cid:62)(cid:16) 1\n(cid:80)N\n\n+ A(\u03b1, t),\n\n(cid:17)\n\nn=1\n\nk=1\n\n0204060050100150200250Number of Conditioned PairsLog\u2212Partition Function Superdifferential lower boundL\u2212field upper boundTree\u2212reweighted BPLogistic upper bound01020304050020406080100Number of Conditioned PairsLog\u2212Partition Function Superdifferential lower boundL\u2212field upper boundTree\u2212reweighted BPLogistic upper bound\fThis implies that Eq. (4) becomes\n\nK(cid:88)\n\n(cid:16) 1\n\nN(cid:88)\n\nmin\n+ , t\u2208RD\n\nmin\nk=1 \u03b1K fK )\n\n\u03b1k\n\n\u03b1\u2208RK\nThe minimum with respect to td may be performed in closed form with td \u2212 sd = log\n(cid:104)x(cid:105) = 1\n\n(cid:104)x(cid:105)d\n1\u2212(cid:104)x(cid:105)d\nn=1 xn. Putting this back into the equation above, we get the equivalent problem:\n\nn=1\n\nk=1\n\nd=1\n\nN\n\nN\n\n, where\n\n+\n\nlog (1 + e\u2212sd+td ).\n\ns\u2208B((cid:80)K\n(cid:80)N\n\nfk(xn)\n\nN\n\nn=1\n\nxn\n\n(cid:17)\n\n(cid:17) \u2212 t(cid:62)(cid:16) 1\nN(cid:88)\n(cid:17) \u2212 s(cid:62)(cid:16) 1\nN(cid:88)\n(cid:0) 1\n(cid:80)N\n\nN\n\nn=1\n\n(cid:16) 1\nN(cid:88)\n(cid:2) 1\n(cid:80)N\nn=1 fk(xn) \u2212 fk\n\nfk(xn)\n\nn=1\n\nN\n\nN\n\nN\n\n\u03b1k\n\nn=1 xn\n\nD(cid:88)\n\n(cid:17)\n(cid:1)(cid:3).\n\nK(cid:88)\n(cid:80)K\n\nk=1\n\nwhich is equivalent to, using the representation of f as the support function of B(f ):\n\ns\u2208B((cid:80)K\n\nmin\nk=1 \u03b1K fK )\n\nmin\n\u03b1\u2208RK\n\n+\n\nmin\u03b1\u2208RK\n\n+\n\nk=1 \u03b1k\n\nxn\n\n+ const ,\n\nSince fk is convex, by Jensen\u2019s inequality, the linear term in \u03b1k is non-negative; thus maximum\nlikelihood through L-\ufb01eld will lead to a degenerate solution where all \u03b1\u2019s are equal to zero.\n\nK(cid:88)\n\n4.2 Learning with the logistic approximation with stochastic gradients\nIn this section we consider the problem (4) and replace A(f ) by ALogistic(f ):\n\n(cid:2) max\n\nz(cid:62)y + t(cid:62)y \u2212 K(cid:88)\n\n\u03b1kf (y)(cid:3),\n\n(5)\n\n\u03b1k(cid:104)fk(x)(cid:105)emp.\u2212t(cid:62)(cid:104)x(cid:105)emp. +Ez\u223clogistic\n\nmin\n+ , t\u2208RD\n\nk=1\n\nk=1\n\nk=1\n\ny\u2208{0,1}D\n\nk=1 \u03b1kf (y), the objective\n\n\u03b1\u2208RK\nwhere (cid:104)M (x)(cid:105)emp. denotes the empirical average of M (x) (over the data).\n\nDenoting by y\u2217(z, t, \u03b1) \u2208 {0, 1}D the maximizers of z(cid:62)y + t(cid:62)y \u2212(cid:80)K\nK(cid:88)\n(cid:2)(cid:104)fk(x)(cid:105)emp.\u2212(cid:104)fk(y\u2217(z, t, \u03b1))(cid:105)logistic\n\nfunction may be written:\n\n(cid:3)\u2212t(cid:62)(cid:2)(cid:104)x(cid:105)emp.\u2212(cid:104)y\u2217(z, t, \u03b1)(cid:105)logistic]+(cid:104)z(cid:62)y\u2217(z, t, \u03b1)(cid:105)logistic.\n\n\u03b1k\nThis implies that at optimum, for \u03b1k > 0, then (cid:104)fk(x)(cid:105)emp. = (cid:104)fk(y\u2217(z, t, \u03b1))(cid:105)logistic, while,\n(cid:104)x(cid:105)emp. = (cid:104)y\u2217(z, t, \u03b1)(cid:105)logistic, the expected values of the suf\ufb01cient statistics match between the data\nand the optimizers used for the logistic approximation [9].\nIn order to minimize the expectation in Eq. (5), we propose to use the projected stochastic gradient\nmethod, not on the data as usually done, but on our own internal randomization. The algorithm then\nbecomes, once we add a weighted (cid:96)2-regularization \u2126(t, \u03b1):\n\u2022 Input: functions fk, k = 1, . . . , K, and expected suf\ufb01cient statistics (cid:104)fk(x)(cid:105)emp. \u2208 R and\n\n(cid:104)x(cid:105)emp. \u2208 [0, 1]D, regularizer \u2126(t, \u03b1).\n\n\u2022 Initialization: \u03b1 = 0, t = 0\n\u2022 Iterations: for h from 1 to H\n\n\u2013 Sample z \u2208 RD as independent logistics\n\u2013 Compute y\u2217 = y\u2217(z, t, \u03b1) \u2208 arg max\ny\u2208{0,1}D\n\u2013 Replace t by t \u2212 C\u221a\n\nz(cid:62)y + t(cid:62)y \u2212(cid:80)K\n(cid:2)y\u2217 \u2212 (cid:104)x(cid:105)emp. + \u2202t\u2126(t, \u03b1)(cid:3)\n(cid:2)(cid:104)fk(x)(cid:105)emp. \u2212 fk(y\u2217) + \u2202\u03b1k \u2126(t, \u03b1)(cid:3)(cid:1)\n\n\u2013 Replace \u03b1k by(cid:0)\u03b1k \u2212 C\u221a\n\nh\n\nh\n\n+.\n\nk=1 \u03b1kf (y)\n\n\u2022 Output: (\u03b1, t).\n\u221a\nSince our cost function is convex and Lipschitz-continuous, the averaged iterates are converging to\nthe global optimum [16] at rate 1/\n\nH (for function values).\n\n4.3 Extension to conditional maximum likelihood\nIn experiments in Section 5, we consider a joint model over two binary vectors x, z \u2208 RD, as follows\n\np(x, z|\u03b1, t, \u03c0) = p(x|\u03b1, t)p(z|x, \u03c0) = exp(\u2212f (x) \u2212 A(f ))\n\n\u03c0\u03b4(zd(cid:54)=xd)\n\nd\n\n(1 \u2212 \u03c0d)\u03b4(zd=xd),\n\n(6)\n\nD(cid:89)\n\nd=1\n\n6\n\n\f(a) original image\n\n(b) noisy image\n\n(c) denoised image\n\nFigure 2: Denoising of a horse image from the Weizmann horse database [3].\n\nwhich corresponds to sampling x from a log-supermodular model and considering z that switches the\nvalues of x with probability \u03c0d for each d, that is, a noisy observation of x. We have:\n\nlog p(x, z|\u03b1, t, \u03c0) = \u2212f (x) \u2212 A(f ) +(cid:80)D\n\n(cid:8) \u2212 log(1 + eud ) + xdud + zdud \u2212 2xdzdud\n\n(cid:9),\n\nd=1\n\nwhich is equivalent to \u03c0d = (1 + e\u2212ud )\u22121.\n\nwith ud = log \u03c0d\n1\u2212\u03c0d\nUsing Bayes rule, we have p(x|z, \u03b1, t, \u03c0) \u221d exp(\u2212f (x) \u2212 A(f ) + x(cid:62)u \u2212 2x(cid:62)(u \u25e6 z)), which leads\nto the log-supermodular model p(x|z, \u03b1, t, \u03c0) = exp(\u2212f (x) + x(cid:62)(u\u2212 2u\u25e6 z)\u2212 A(f \u2212 u + 2u\u25e6 z)).\nThus, if we observe both z and x, we can consider a conditional maximization of the log-likelihood\n(still a convex optimization problem), which we do in our experiments for supervised image denoising,\nwhere we assume we know both noisy and original images at training time. Stochastic gradient on\nthe logistic samples can then be used. Note that our conditional ML estimation can be seen as a form\nof approximate conditional random \ufb01elds [13].\nWhile supervised learning can be achieved by other techniques such as structured-output-SVMs [18,\n20, 22], our approach also applies when we do not observe the original image, which we now consider.\n\n4.4 Missing data through maximum likelihood\nIn the model in Eq. (6), we now assume we only observed the noisy output z, and we perform\nparameter learning for \u03b1, t, \u03c0. This is a latent variable model for which maximum likelihood can be\nreadily applied. We have:\n\nlog p(z|\u03b1, t, \u03c0) = log(cid:80)\n= log(cid:80)\nx\u2208{0,1}D exp(\u2212f (x) \u2212 A(f ))(cid:81)D\n= A(f \u2212 u + 2u \u25e6 z) \u2212 A(f ) + z(cid:62)u \u2212(cid:80)D\n\nx\u2208{0,1} p(z, x|\u03b1, t, \u03c0)\n\nd=1 \u03c0\u03b4(zd(cid:54)=xd)\nd=1 log(1 + eud ).\n\nd\n\n(1 \u2212 \u03c0d)\u03b4(zd=xd)\n\nIn practice, we will assume that the noise probability \u03c0 (and hence u) is uniform across all elements.\nWhile we could use majorization-minization approaches such as the expectation-minimization algo-\nrithm (EM), we consider instead stochastic subgradient descent to learn the model parameters \u03b1, t\nand u (now a non-convex optimization problem, for which we still observed good convergence).\n\n5 Experiments\n\nThe aim of our experiments is to demonstrate the ability of our approach to remove noise in binary\nimages, following the experimental set-up of [9]. We consider the training sample of Ntrain = 100\nimages of size D = 50 \u00d7 50, and the test sample of Ntest = 100 binary images, containing a horse\nsilhouette from the Weizmann horse database [3]. At \ufb01rst we add some noise by \ufb02ipping pixels\nvalues independently with probability \u03c0. In Figure 2, we provide an example from the test sample:\nthe original, the noisy and the denoised image (by our algorithm).\nWe consider the model from Section 4.3, with the two functions f1(x), f2(x) which are horizontal and\nvertical cut functions with binary weights respectively, together with a modular term of dimension D.\nTo perform minimization we use graph-cuts [4] as we deal with positive or attractive potentials.\n\nSupervised image denoising. We assume that we observe N = 100 pairs (xi, zi) of original-noisy\nimages, i = 1, . . . , N. We perform parameter inference by maximum likelihood using stochastic\nsubgradient descent (over the logistic samples), with regularization by the squared (cid:96)2-norm, one\n\n7\n\n\fnoise \u03c0 max-marg.\n\n1%\n5%\n10%\n20%\n\n0.4%\n1.1%\n2.1%\n4.2%\n\nstd\n\nmean-marginals\n\nstd\n\n<0.1%\n<0.1%\n<0.1%\n<0.1%\nTable 1: Supervised denoising results.\n\n<0.1%\n<0.1%\n<0.1%\n<0.1%\n\n0.4%\n1.1%\n2.0%\n4.1%\n\nSVM-Struct\n\n0.6%\n1.5%\n2.8%\n6.0%\n\nstd\n\n<0.1%\n<0.1%\n0.3%\n0.6%\n\n\u03c0 is \ufb01xed\n\nmean-marg.\n\n\u03c0\n1%\n5%\n10%\n20%\n\nmax-marg.\n\n0.5%\n0.9%\n1.9%\n5.3%\n\nstd\n\n<0.1%\n0.1%\n0.4%\n2.0%\n\n0.5%\n1.0%\n2.1%\n6.0%\n\nstd\n\n<0.1%\n0.1%\n0.4%\n2.0%\n\nmax-marg.\n\n1.0%\n3.5%\n6.8%\n20.0%\n\n0.9%\n2.2%\n\n-\n\n\u03c0 is not \ufb01xed\nstd\n-\n\nmean-marg.\n\n1.0%\n3.6%\n7.0%\n20.0%\n\nstd\n-\n\n0.8%\n2.0%\n\n-\n\nTable 2: Unsupervised denoising results.\n\nparameter for t, one for \u03b1, both learned by cross-validation. Given our estimates, we may denoise\na new image by computing the \u201cmax-marginal\u201d, e.g., the maximum a posteriori maxx p(x|z, \u03b1, t)\nthrough a single graph-cut, or computing \u201cmean-marginals\u201d with 100 logistic samples. To calculate\nthe error we use the normalized Hamming distance and 100 test images.\nResults are presented in Table 1, where we compare the two types of decoding, as well as a structured\noutput SVM (SVM-Struct [22]) applied to the same problem. Results are reported in proportion\nof correct pixels. We see that the probabilistic models here slightly outperform the max-margin\nformulation1 and that using mean-marginals (which is optimal given our loss measure) lead to slightly\nbetter performance.\n\nUnsupervised image denoising. We now only consider N = 100 noisy images z1, . . . , zN to\nlearn the model, without the original images, and we use the latent model from Section 4.4. We apply\nstochastic subgradient descent for the difference of the two convex functions Alogistic to learn the\nmodel parameters and use \ufb01xed regularization parameters equal to 10\u22122.\nWe consider two situations, with a known noise-level \u03c0 or with learning it together with \u03b1 and t. The\nerror was calculated using either max-marginals and mean-marginals. Note that here, structured-\noutput SVMs cannot be used because there is no supervision. Results are reported in Table 2. One\nexplanation for a better performance for max-marginals in this case is that the unsupervised approach\ntends to oversmooth the outcome and max-marginals correct this a bit.\nWhen the noise level is known, the performance compared to supervised learning is not degraded\nmuch, showing the ability of the probabilistic models to perform parameter estimation with missing\ndata. When the noise level is unknown and learned as well, results are worse, still better than a trivial\nanswer for moderate levels of noise (5% and 10%) but not better than outputting the noisy image for\nextreme levels (1% and 20%). In challenging fully unsupervised case the standard deviation is up to\n2.2% (which shows that our results are statistically signi\ufb01cant).\n\n6 Conclusion\n\nIn this paper, we have presented how approximate inference based on stochastic gradient and \u201cperturb-\nand-MAP\u201d ideas could be used to learn parameters of log-supermodular models, allowing to bene\ufb01t\nfrom the versatility of probabilistic modelling, in particular in terms of parameter estimation with\nmissing data. While our experiments have focused on simple binary image denoising, exploring\nlarger-scale applications in computer vision (such as done by [24, 21]) should also show the bene\ufb01ts\nof mixing probabilistic modelling and submodular functions.\n\nAcknowledgements. We acknowledge support\nthe European Union\u2019s H2020 Framework\nProgramme (H2020-MSCA-ITN-2014) under grant agreement no642685 MacSeNet, and thank Sesh\nKumar, Anastasia Podosinnikova and Anton Osokin for interesting discussions related to this work.\n\n1[9] shows a stronger difference, which we believe (after consulting with authors) is due to lack of convergence\n\nfor the iterative algorithm solving the max-margin formulation.\n\n8\n\n\fReferences\n[1] F. Bach. Learning with submodular functions: a convex optimization perspective. Foundations\n\nand Trends in Machine Learning, 6(2-3):145 \u2013 373, 2013.\n\n[2] F. Bach. Submodular functions: from discrete to continuous domains. Technical Report\n\n1511.00394, arXiv, 2015.\n\n[3] E. Borenstein, E. Sharon, and S. Ullman. Combining Top-down and Bottom-up Segmentation.\n\nIn Proc. ECCV, 2004.\n\n[4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222\u20131239, 2001.\n\n[5] J. Djolonga and A. Krause. From MAP to Marginals: Variational Inference in Bayesian\n\nSubmodular Models. In Adv. NIPS, 2014.\n\n[6] J. Djolonga and A. Krause. Scalable Variational Inference in Log-supermodular Models. In\n\nProc. ICML, 2015.\n\n[7] S. Fujishige. Submodular Functions and Optimization. Annals of discrete mathematics. Elsevier,\n\n2005.\n\n[8] D. Golovin and A. Krause. Adaptive Submodularity: Theory and Applications in Active\nLearning and Stochastic Optimization. Journal of Arti\ufb01cial Intelligence Research, 42:427\u2013486,\n2011.\n\n[9] T. Hazan and T. Jaakkola. On the Partition Function and Random Maximum A-Posteriori\n\nPerturbations. In Proc. ICML, 2012.\n\n[10] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model.\n\nSIAM Journal on Computing, 22(5):1087\u20131116, 1993.\n\n[11] P. Kohli, L. Ladicky, and P. H. S. Torr. Robust higher order potentials for enforcing label\n\nconsistency. International Journal of Computer Vision, 82(3):302\u2013324, 2009.\n\n[12] Andreas Krause and Daniel Golovin. Submodular function maximization. In Tractability:\n\nPractical Approaches to Hard Problems. Cambridge University Press, February 2014.\n\n[13] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In Proc. ICML, 2001.\n\n[14] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In Proc.\n\nNAACL/HLT, 2011.\n\n[15] S. Nadarajah and S. Kotz. A generalized logistic distribution. International Journal of Mathe-\n\nmatics and Mathematical Sciences, 19:3169 \u2013 3174, 2005.\n\n[16] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[17] G. Papandreou and A. Yuille. Perturb-and-map random \ufb01elds: Using discrete optimization to\n\nlearn and sample from energy models. In Proc. ICCV, 2011.\n\n[18] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In Proc. ECCV, 2008.\n[19] D. Tarlow, R.P. Adams, and R.S. Zemel. Randomized optimum models for structured prediction.\n\nIn Proc. AISTATS, 2012.\n\n[20] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. 2003.\n[21] S. Tschiatschek, J. Djolonga, and A. Krause. Learning probabilistic submodular diversity\n\nmodels via noise contrastive estimation. In Proc. AISTATS, 2016.\n\n[22] I. Tsochantaridis, Thomas Joachims, T., Y. Altun, and Y. Singer. Large margin methods\nfor structured and interdependent output variables. Journal of Machine Learning Research,\n6:1453\u20131484, 2005.\n\n[23] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[24] J. Zhang, J. Djolonga, and A. Krause. Higher-order inference for multi-class log-supermodular\n\nmodels. In Proc. ICCV, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1614, "authors": [{"given_name": "Tatiana", "family_name": "Shpakova", "institution": "Inria - ENS Paris"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}