{"title": "Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods", "book": "Advances in Neural Information Processing Systems", "page_first": 4628, "page_last": 4636, "abstract": "The well known maximum-entropy principle due to Jaynes, which states that given mean parameters, the maximum entropy distribution matching them is in an exponential family has been very popular in machine learning due to its \u201cOccam\u2019s razor\u201d interpretation. Unfortunately, calculating the potentials in the maximum entropy distribution is intractable [BGS14]. We provide computationally efficient versions of this principle when the mean parameters are pairwise moments: we design distributions that approximately match given pairwise moments, while having entropy which is comparable to the maximum entropy distribution matching those moments. We additionally provide surprising applications of the approximate maximum entropy principle to designing provable variational methods for partition function calculations for Ising models without any assumptions on the potentials of the model. More precisely, we show that we can get approximation guarantees for the log-partition function comparable to those in the low-temperature limit, which is the setting of optimization of quadratic forms over the hypercube. ([AN06])", "full_text": "Approximate maximum entropy principles via\n\nGoemans-Williamson with applications to provable\n\nvariational methods\n\nYuanzhi Li\n\nPrinceton University\nPrinceton, NJ, 08450\n\nAndrej Risteski\n\nPrinceton University\nPrinceton, NJ, 08450\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nyuanzhil@cs.princeton.edu\n\nristeski@cs.princeton.edu\n\nAbstract\n\nThe well known maximum-entropy principle due to Jaynes, which states that\ngiven mean parameters, the maximum entropy distribution matching them is in an\nexponential family has been very popular in machine learning due to its \u201cOccam\u2019s\nrazor\u201d interpretation. Unfortunately, calculating the potentials in the maximum-\nentropy distribution is intractable [BGS14]. We provide computationally ef\ufb01cient\nversions of this principle when the mean parameters are pairwise moments: we\ndesign distributions that approximately match given pairwise moments, while\nhaving entropy which is comparable to the maximum entropy distribution matching\nthose moments.\nWe additionally provide surprising applications of the approximate maximum\nentropy principle to designing provable variational methods for partition function\ncalculations for Ising models without any assumptions on the potentials of the\nmodel. More precisely, we show that we can get approximation guarantees for the\nlog-partition function comparable to those in the low-temperature limit, which is\nthe setting of optimization of quadratic forms over the hypercube. ([AN06])\n\n1\n\nIntroduction\n\ni.e. \u00b5(x) \u221d exp((cid:80)T\n\nMaximum entropy principle The maximum entropy principle [Jay57] states that given mean pa-\nrameters, i.e. E\u00b5[\u03c6t(x)] for a family of functionals \u03c6t(x), t \u2208 [1, T ], where \u00b5 is distribution over the\nhypercube {\u22121, 1}n, the entropy-maximizing distribution \u00b5 is an exponential family distribution,\nt=1 Jt\u03c6t(x)) for some potentials Jt, t \u2208 [1, T ]. 1 This principle has been one\nof the reasons for the popularity of graphical models in machine learning: the \u201cmaximum entropy\u201d\nassumption is interpreted as \u201cminimal assumptions\u201d on the distribution other than what is known\nabout it.\nHowever, this principle is problematic from a computational point of view. Due to results of\n[BGS14, SV14], the potentials Jt of the Ising model, in many cases, are impossible to estimate well\nin polynomial time, unless NP = RP \u2013 so merely getting the description of the maximum entropy\ndistribution is already hard. Moreover, in order to extract useful information about this distribution,\nusually we would also like to at least be able to sample ef\ufb01ciently from this distribution \u2013 which is\ntypically NP-hard or even #P-hard.\n\n1There is a more general way to state this principle over an arbitrary domain, not just the hypercube, but for\n\nclarity in this paper we will focus on the hypercube only.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn this paper we address this problem in certain cases. We provide a \u201cbi-criteria\u201d approximation\nfor the special case where the functionals \u03c6t(x) are \u03c6i,j(x) = xixj, i.e. pairwise moments: we\nproduce a ef\ufb01ciently sampleable distribution over the hypercube which matches these moments up\nto multiplicative constant factors, and has entropy at most a constant factor smaller from from the\nentropy of the maximum entropy distribution. 2\nFurthermore, the distribution which achieves this is very natural: the sign of a multivariate normal\nvariable. This provides theoretical explanation for the phenomenon observed by the computational\nneuroscience community [BB07] that this distribution (there named dichotomized Gaussian there)\nhas near-maximum entropy.\nVariational methods The above results also allow us to get results for a seemingly unrelated problem\nt=1 Jt\u03c6t(x)) of a member of an\n\n\u2013 approximating the partition function Z = (cid:80)\n\nx\u2208{\u22121,1}n exp((cid:80)T\n\nexponential family. The reason this task is important is that it is tied to calculating marginals.\nOne of the ways this task is solved is variational methods: namely, expressing log Z as an optimization\nproblem. While there is a plethora of work on variational methods, of many \ufb02avors (mean \ufb01eld,\nBethe/Kikuchi relaxations, TRBP, etc. for a survey, see [WJ08]), they typically come either with no\nguarantees, or with guarantees in very constrained cases (e.g. loopless graphs; graphs with large girth,\netc. [WJW03, WJW05]). While this is a rich area of research, the following extremely basic research\nquestion has not been answered:\nWhat is the best approximation guarantee on the partition function in the worst case (with no\nadditional assumptions on the potentials)?\n\nIn the low-temperature limit, i.e. when |Jt| \u2192 \u221e, log Z \u2192 maxx\u2208{\u22121,1}n(cid:80)T\n\nt=1 Jt\u03c6t(x) - i.e. the\nquestion reduces to purely optimization. In this regime, this question has very satisfying answers\nfor many families \u03c6t(x). One classical example is when the functionals are \u03c6i,j(x) = xixj. In the\ngraphical model community, these are known as Ising models, and in the optimization community this\nis the problem of optimizing quadratic forms and has been studied by [CW04, AN06, AMMN06].\nIn the optimization version, the previous papers showed that in the worst case, one can get O(log n)\nfactor multiplicative factor approximation of it, and that unless P = NP, one cannot get better than\nconstant factor approximations of it.\nIn the \ufb01nite-temperature version, it is known that it is NP-hard to achieve a 1 + \u0001 factor approximation\nto the partition function (i.e. construct a FPRAS) [SS12], but nothing is known about coarser\napproximations. We prove in this paper, informally, that one can get comparable multiplicative\nguarantees on the log-partition function in the \ufb01nite temperature case as well \u2013 using the tools and\ninsights we develop on the maximum entropy principles.\nOur methods are extremely generic, and likely to apply to many other exponential families, where\nalgorithms based on linear/semide\ufb01nite programming relaxations are known to give good guarantees\nin the optimization regime.\n\n2 Statements of results and prior work\n\nApproximate maximum entropy The main theorem in this section is the following one.\nTheorem 2.1. For any covariance matrix \u03a3 of a centered distribution \u00b5 : {\u22121, 1}n \u2192 R, i.e.\nE\u00b5[xixj] = \u03a3i,j, E\u00b5[xi] = 0, there is an ef\ufb01ciently sampleable distribution \u02dc\u00b5, which can be sampled\nas sign(g), where g \u223c N (0, \u03a3 + \u03b2I) and satis\ufb01es\n\u03a3i,j and has\nentropy H(\u02dc\u00b5) \u2265 n\n\n\u03a3i,j \u2264 E\u02dc\u00b5[XiXj] \u2264 1\n1 + \u03b2\n\nG\n1 + \u03b2\n\n(31/4\u221a\n\u221a\n\n\u03b2\u22121)2\n3\u03b2\n\n, for any \u03b2 \u2265 1\n\n31/2 .\n\n25\n\nThere are two prior works on computational issues relating to maximum entropy principles, both\nproving hardness results.\n[BGS14] considers the \u201chard-core\u201d model where the functionals \u03c6t are such that the distribution \u00b5(x)\nputs zero mass on con\ufb01gurations x which are not independent sets with respect to some graph G.\n\n2In fact, we produce a distribution with entropy \u2126(n), which implies the latter claim since the maximum\n\nentropy of any distribution of over {\u22121, 1}n is at most n\n\n2\n\n\fThey show that unless NP = RP, there is no FPRAS for calculating the potentials Jt, given the mean\nparameters E\u00b5[\u03c6t(x)].\n[SV14] prove an equivalence between calculating the mean parameters and calculating partition\nfunctions. More precisely, they show that given an oracle that can calculate the mean parameters up\nto a (1 + \u0001) multiplicative factor in time O(poly(1/\u0001)), one can calculate the partition function of the\nsame exponential family up to (1 + O(poly(\u0001))) multiplicative factor, in time O(poly(1/\u0001)). Note,\nthe \u0001 in this work potentially needs to be polynomially small in n (i.e. an oracle that can calculate the\nmean parameters to a \ufb01xed multiplicative constant cannot be used.)\nBoth results prove hardness for \ufb01ne-grained approximations to the maximum entropy principle, and\nask for outputting approximations to the mean parameters. Our result circumvents these hardness\nresults by providing a distribution which is not in the maximum-entropy exponential family, and is\nallowed to only approximately match the moments as well. To the best of our knowledge, such an\napproximation, while very natural, has not been considered in the literature.\nProvable variational methods The main theorems in this section will concern the approximation\nfactor that can be achieved by degree-2 pseudo-moment relaxations of the standard variational\nprinciple due to Gibbs. ([Ell12]) As outlined before, we will be concerned with a particularly popular\nexponential family: Ising models. We will prove the following three results:\nTheorem 2.2 (Ferromagnetic Ising, informal). There is a convex programming relaxation based on\ndegree-2 pseudo-moments that calculates up to multiplicative approximation factor 50 the value of\nlog Z where Z is the partition function of the exponential distribution \u00b5(x) \u221d exp(\nJi,jxixj) for\n\n(cid:88)\n\nJi,j > 0.\nTheorem 2.3 (Ising model, informal). There is a convex programming relaxation based on degree-2\npseudo-moments that calculates up to multiplicative approximation factor O(log n) the value of\nlog Z where Z is the partition function of the exponential distribution \u00b5(x) \u221d exp(\nJi,jxixj).\n\ni,j\n\n(cid:88)\n(cid:88)\n\ni,j\n\ni,j\u2208E(G)\n\nTheorem 2.4 (Ising model, informal). There is a convex programming relaxation based on degree-2\npseudo-moments that calculates up to multiplicative approximation factor O(log \u03c7(G)) the value of\nlog Z where Z is the partition function of the exponential distribution \u00b5(x) \u221d exp(\nJi,jxixj)\n\nwhere G = (V (G), E(G)) is a graph with chromatic number \u03c7(G). 3\n\nWhile a lot of work is done on variational methods in general (see the survey by [WJ08] for a detailed\noverview), to the best of our knowledge nothing is known about the worst-case guarantee that we\nare interested in here. Moreover, other than a recent paper by [Ris16], no other work has provided\nprovable bounds for variational methods that proceed via a convex relaxation and a rounding thereof.4\n[Ris16] provides guarantees in the case of Ising models that are also based on pseudo-moment\nrelaxations of the variational principle, albeit only in the special case when the graph is \u201cdense\u201d in a\nsuitably de\ufb01ned sense. 5 The results there are very speci\ufb01c to the density assumption and can not be\nadapted to our worst-case setting.\nFinally, we mention that in the special case of the ferromagnetic Ising models, an algorithm based on\nMCMC was provided by [JS93], which can give an approximation factor of (1 + \u0001) to the partition\nfunction and runs in time O(n11poly(1/\u0001)). In spite of this, the focus of this part of our paper is\nto provide understanding of variational methods in certain cases, as they continue to be popular in\npractice for their faster running time compared to MCMC-based methods but are theoretically much\nmore poorly studied.\n\n3Theorem 2.4 is strictly more general than Theorem 2.3, however the proof of Theorem 2.3 uses less heavy\n\nmachinery and is illuminating enough that we feel merits being presented as a separate theorem.\n\n4In some sense, it is possible to give provable bounds for Bethe-entropy based relaxations, via analyzing\n(cid:80)\nbelief propagation directly, which has been done in cases where there is correlation decay and the graph is locally\ntree-like. [WJ08] has a detailed overview of such results.\ni,j |Ji,j|, one can get an additive\n\n5More precisely, they prove that in the case when \u2200i, j, \u2206|Ji,j| \u2264 \u2206\ni,j Ji,j) approximation to log Z in time nO( \u2206\n\u00012 ).\n\n\u0001((cid:80)\n\nn2\n\n3\n\n\f3 Approximate maximum entropy principles\n\nLet us recall what the problem we want to solve:\nApproximate maximum entropy principles We are given a positive-semide\ufb01nite matrix \u03a3 \u2208 Rn\u00d7n\nwith \u03a3i,i = 1,\u2200i \u2208 [n], which is the covariance matrix of a centered distribution over {\u22121, 1}n,\ni.e. E\u00b5[xixj] = \u03a3i,j, E\u00b5[xi] = 0, for a distribution \u00b5 : {\u22121, 1}n \u2192 R. We wish to produce a\ndistribution \u02dc\u00b5 : {\u22121, 1}n \u2192 R with pairwise covariances that match the given ones up to constant\nfactors, and entropy within a constant factor of the maximum entropy distribution with covariance \u03a3.\n6\n\n(cid:8) 2\n\u03c0 arcsin(t)/t(cid:9) \u2248 0.64.\n\nBefore stating the result formally, it will be useful to de\ufb01ne the following constant:\nDe\ufb01nition 3.1. De\ufb01ne the constant G = mint\u2208[\u22121,1]\nWe will prove the following main theorem:\nTheorem 3.1 (Main, approximate entropy principle). For any positive-semide\ufb01nite matrix \u03a3 with\n\u03a3i,i = 1,\u2200i, there is an ef\ufb01ciently sampleable distribution \u02dc\u00b5 : {\u22121, 1}n \u2192 R, which can be sampled\nas sign(g), where g \u223c N (0, \u03a3 + \u03b2I), and satis\ufb01es G\n1+\u03b2 \u03a3i,j and has entropy\nH(\u02dc\u00b5) \u2265 n\n\n1+\u03b2 \u03a3i,j \u2264 E\u02dc\u00b5[xixj] \u2264 1\n\n, where \u03b2 \u2265 1\n\n(31/4\u221a\n\u221a\n\n\u03b2\u22121)2\n3\u03b2\n\n25\n\n31/2 .\n\nNote \u02dc\u00b5 is in fact very close to the the one which is classically used to round semide\ufb01nite relaxations\nfor solving the MAX-CUT problem. [GW95] We will prove Theorem 3.1 in two parts \u2013 by \ufb01rst lower\nbounding the entropy of \u02dc\u00b5, and then by bounding the moments of \u02dc\u00b5.\nTheorem 3.2. The entropy of the distribution \u02dc\u00b5 satis\ufb01es H(\u02dc\u00b5) \u2265 n\n\n31/2 .\nProof. A sample g from N (0, \u02dc\u03a3) can be produced by sampling g1 \u223c N (0, \u03a3), g2 \u223c N (0, \u03b2I) and\nsetting g = g1 + g2. The sum of two multivariate normals is again a multivariate normal. Furthermore,\nthe mean of g is 0, and since g1, g2 are independent, the covariance of g is \u03a3 + \u03b2I = \u02dc\u03a3.\nLet\u2019s denote the random variable Y = sign(g1 + g2) which is distributed according to \u02dc\u00b5. We wish\nto lower bound the entropy of Y. Toward that goal, denote the random variable S := {i \u2208 [n] :\n|(g1)i| \u2264 cD} for c, D to be chosen. Then, we have: for \u03b3 = c\u22121\nc ,\n\nwhen \u03b2 \u2265 1\n\n(31/4\u221a\n\u221a\n\n\u03b2\u22121)2\n3\u03b2\n\n25\n\n(cid:88)\n\nPr[S = S]H(Y|S = S) \u2265 (cid:88)\n\nPr[S = S]H(Y|S = S)\n\nH(Y) \u2265 H(Y|S) =\n\nS\u2286[n]\n\nS\u2286[n],|S|\u2265\u03b3n\n\nS\u2286[n],|S|\u2265\u03b3n\n= Pr [|S| \u2265 \u03b3n]\n\nH(Y|S = S)\n\nwhere the \ufb01rst inequality follows since conditioning doesn\u2019t decrease entropy, and the latter by the\nnon-negativity of entropy. Continue the calculation we can get:\n\n(cid:88)\n\nPr[S = S]H(Y|S = S) \u2265 (cid:88)\n\nPr[S = S]\n\nmin\n\nS\u2286[n],|S|\u2265\u03b3n\n\nH(Y|S = S)\n\nS\u2286[n],|S|\u2265\u03b3n\n\nWe will lower bound Pr[|S| \u2265 \u03b3n] \ufb01rst. Notice that E[(cid:80)n\n. On the other hand, if(cid:80)n\n\n(cid:35)\ni \u2265 Dn\n\n(cid:34) n(cid:88)\n\n(g1)2\n\ni=1(g1)2\n\nmin\n\nS\u2286[n],|S|\u2265\u03b3n\n\ni=1\n\n\u2264 1\nD\nc , which means that |{i : (g1)2\n\ninequality, Pr\ncD}| \u2264 n\nthis means Pr [|S| \u2265 \u03b3n] \u2265 1 \u2212 1\nD\nIt remains to lower bound minS\u2286[n],|S|\u2265\u03b3n H(Y|S = S). For every S \u2286 [n],|S| \u2265 \u03b3n, denote by\nYS the coordinates of Y restricted to S, we get\n\nc = \u03b3n. Putting things together,\n\ni \u2264 cD}| \u2265 n \u2212 n\n\nc = (c\u22121)n\n\ni=1(g1)2\n\n.\n\ni ] = n, therefore by Markov\u2019s\ni \u2265\n\ni \u2264 Dn, then |{i : (g1)2\n\nH(Y|S = S) \u2265 H(YS|S = S) \u2265 H\u221e(YS|S = S) = \u2212 log(max\n\nPr[YS = yS|S = S])\n\n6Note for a distribution over {\u22121, 1}n, the maximal entropy a distribution can have is n, which is achieved\n\nby the uniform distribution.\n\n4\n\nyS\n\n\f(where H\u221e is the min-entropy) so we only need to bound maxyS Pr[YS = yS|S = S]\nWe will now, for any yS, upper bound Pr[YS = yS|S = S]. Recall that the event S = S implies that\n\u2200i \u2208 S, |(g1)i| \u2264 cD. Since g2 is independent of g1, we know that for every \ufb01xed g \u2208 Rn:\n\nPr[YS = yS|S = S, g1 = g] = \u03a0i\u2208S Pr[sign([g]i + [g2]i) = yi]\n\nFor a \ufb01xed i \u2208 [S], consider the term Pr[sign([g]i + [g2]i) = yi]. Without loss of generality, let\u2019s\nassume [g]i > 0 (the proof is completely symmetric in the other case). Then, since [g]i is positive\nand g2 has mean 0, we have Pr[[g]i + (g2)i < 0] \u2264 1\n2\nMoreover,\n\n.\n\nPr [[g]i + [g2]i > 0] = Pr[[g2]i > 0] Pr [[g]i + [g2]i > 0 | [g2]i > 0]\n\n+ Pr[[g2]i < 0] Pr [[g]i + [g2]i > 0 | [g2]i < 0]\n\nThe \ufb01rst term is upper bounded by 1\nstandard Gaussian tail bounds:\n\n2 since Pr[[g2]i > 0] \u2264 1\n\n2. The second term we will bound using\n\nPr [[g]i + [g2]i > 0 | [g2]i < 0] \u2264 Pr [|[g2]i| \u2264 |[g]i| | [g2]i < 0]\n\n= Pr[|[g2]i| \u2264 |[g]i|] \u2264 Pr[([g2]i)2 \u2264 cD]\n= 1 \u2212 Pr[([g2]i)2 > cD]\n\u2264 1 \u2212 2\u221a\n2\u03c0\n\nexp (\u2212cD/2\u03b2)\n\n\u2212\n\n(cid:32)(cid:114)\n\n\u03b2\ncD\n\n\u03b2\ncD\n\n\uf8eb\uf8ed(cid:114)\n(cid:32)(cid:114)\n\n\uf8eb\uf8ed(cid:114)\n\uf8eb\uf8ed(cid:114)\n\n\u03b2\ncD\n\n(cid:32)(cid:114)\n\n\u2212\n\n(cid:33)3\uf8f6\uf8f8\n(cid:33)3(cid:33)(cid:33)\n\n\u03b2\ncD\n\n(cid:32)(cid:114)\n\n\u2212\n\n\u03b2\ncD\n\n\u03b2\ncD\n\n(cid:32)(cid:114)\n(cid:33)3\uf8f6\uf8f8\n(cid:33)3\uf8f6\uf8f8\uf8f9\uf8fb\u03b3n\n(cid:33)3\uf8f6\uf8f8\uf8f9\uf8fb\n(cid:32)(cid:114)\n\n\u03b2\ncD\n\n\u03b2\ncD\n\nwhich implies\n\nPr[[g2]i < 0] Pr[[g]i + [g2]i > 0 | [g2]i < 0] \u2264 1\n2\n\nPutting together, we have\n\n(cid:32)\n\n1 \u2212 2\u221a\n2\u03c0\n\nexp (\u2212cD/2\u03b2)\n\nPr[sign((g1)i + (g2)i) = yi] \u2264 1 \u2212 1\u221a\n2\u03c0\n\nexp (\u2212cD/2\u03b2)\n\n\u2212\n\n\u03b2\ncD\n\nTogether with the fact that |S| \u2265 \u03b3n we get\n\nPr[YS = yS|S = s, g1 = g] \u2264\n\nexp (\u2212cD/2\u03b2)\n\n\uf8ee\uf8f01 \u2212 1\u221a\n\uf8ee\uf8f01 \u2212 1\u221a\n\nlog\n\n2\u03c0\n\n\uf8eb\uf8ed(cid:114)\n\n\u2212\n\n\u03b2\ncD\n\nexp (\u2212cD/2\u03b2)\n\n2\u03c0\n\n(cid:18)\n\n(cid:19) (c \u2212 1)n\n\nc\n\nwhich implies that\n\nH(Y) \u2265 \u2212\n\n1 \u2212 1\nD\n\nBy setting c = D = 31/4\u221a\nH(Y) \u2265 n\n\n(31/4\u221a\n\u221a\n\n\u03b2\u22121)2\n3\u03b2\n\n25\n\n, as we need.\n\n\u03b2 and a straightforward (albeit unpleasant) calculation, we can check that\n\nWe next show that the moments of the distribution are preserved up to a constant G\nLemma 3.1. The distribution \u02dc\u00b5 has G\n\n1+\u03b2 \u03a3i,j \u2264 E\u02dc\u00b5[XiXj] \u2264 1\n\n1+\u03b2 \u03a3i,j\n\n1+\u03b2 .\n\n5\n\n\fProof. Consider the Gram decomposition of \u02dc\u03a3i,j = (cid:104)vi, vj(cid:105). Then, N (0, \u02dc\u03a3) is in distribution equal\nto (sign((cid:104)v1, s(cid:105)), . . . , sign((cid:104)vn, s(cid:105))) where s \u223c N (0, I). Similarly as in the analysis of Goemans-\nWilliamson [GW95], if \u00afvi = 1(cid:107)vi(cid:107) vi, we have G(cid:104)\u00afvi, \u00afvj(cid:105) \u2264 E\u02dc\u00b5[XiXj] =\narcsin((cid:104)\u00afvi, \u00afvj(cid:105)) \u2264\n(cid:113)\n(cid:104)\u00afvi, \u00afvj(cid:105). However, since (cid:104)\u00afvi, \u00afvj(cid:105) =\n(cid:107)vi(cid:107)(cid:107)vj(cid:107) \u03a3i,j and (cid:107)vi(cid:107) =\n\n\u02dc\u03a3i,i =(cid:112)1 + \u03b2,\u2200i \u2208 [1, n], we get that\n\n(cid:107)vi(cid:107)(cid:107)vj(cid:107) \u02dc\u03a3i,j =\n\u03a3i,j \u2264 E\u02dc\u00b5[XiXj] \u2264 1\n1 + \u03b2\n\n1\n\n(cid:107)vi(cid:107)(cid:107)vj(cid:107)(cid:104)vi, vj(cid:105) =\n\n\u03a3i,j as we want.\n\nG\n1 + \u03b2\n\n2\n\u03c0\n1\n\n1\n\nLemma 3.2 and 3.1 together imply Theorem 3.1.\n\n4 Provable bounds for variational methods\n\nWe will in this section consider applications of the approximate maximum entropy principles we\ndeveloped for calculating partition functions of Ising models. Before we dive into the results, we give\nbrief preliminaries on variational methods and pseudo-moment convex relaxations.\nPreliminaries on variational methods and pseudo-moment convex relaxations Recall, varia-\ntional methods are based on the following simple lemma, which characterizes log Z as the solution\nof an optimization problem. It essentially dates back to Gibbs [Ell12], who used it in the context of\nstatistical mechanics, though it has been rediscovered by machine learning researchers [WJ08]:\nLemma 4.1 (Variational characterization of log Z). Let us denote by M the polytope of distributions\nover {\u22121, 1}n. Then,\n\nlog Z = max\n\u00b5\u2208M\n\nJtE\u00b5[\u03c6t(x)] + H(\u00b5)\n\n(1)\n\n(cid:40)(cid:88)\n\nt\n\n(cid:41)\n\n(cid:41)\n\nWhile the above lemma reduces calculating log Z to an optimization problem, optimizing over\nthe polytope M is impossible in polynomial time. We will proceed in a way which is natural for\noptimization problems \u2013 by instead optimizing over a relaxation M(cid:48) of that polytope.\nThe relaxation will be associated with the degree-2 Lasserre hierarchy. Intuitively, M(cid:48) has as\nvariables tentative pairwise moments of a distribution of {\u22121, 1}n, and it imposes all constraints on\nthe moments that hold for distributions over {\u22121, 1}n. To de\ufb01ne M(cid:48) more precisely we will need\nthe following notion: (for a more in-depth review of moment-based convex hierarchies, the reader\ncan consult [BKS14])\nDe\ufb01nition 4.1. A degree-2 pseudo-moment 7 \u02dcE\u03bd[\u00b7] is a linear operator mapping polynomials of\ndegree 2 to R, such that \u02dcE\u03bd[x2\nWe will be optimizing over the polytope M(cid:48) of all degree-2 pseudo-moments, i.e. we will consider\nsolving\n\ni ] = 1, and \u02dcE\u03bd[p(x)2] \u2265 0 for any polynomial p(x) of degree 1.\n\n(cid:40)(cid:88)\n\nt\n\nmax\n\n\u02dcE\u03bd [\u00b7]\u2208M(cid:48)\n\n\u02dcE\u03bd[\u03c6t(x)] + \u02dcH(\u02dcE\u03bd[\u00b7])\n\nJt\n\nwhere \u02dcH will be a proxy for the entropy we will have to de\ufb01ne (since entropy is a global property\nthat depends on all moments, and \u02dcE\u03bd only contains information about second order moments).\nTo see this optimization problem is convex, we show that it can easily be written as a semide\ufb01nite\nprogram. Namely, note that the pseudo-moment operators are linear, so it suf\ufb01ces to de\ufb01ne them over\nmonomials only. Hence, the variables will simply be \u02dcE\u03bd(xS) for all monomials xS of degree at most\n2. The constraints \u02dcE\u03bd[x2\ni ] = 1 then are clearly linear, as is the \u201cenergy part\u201d of the objective function.\nSo we only need to worry about the constraint \u02dcE\u03bd[p(x)2] \u2265 0 and the entropy functional.\nWe claim the constraint \u02dcE\u03bd[p(x)2] \u2265 0 can be written as a PSD constraint: namely if we de\ufb01ne the\nmatrix Q, which is indexed by all the monomials of degree at most 1, and it satis\ufb01es Q(xS, xT ) =\n\u02dcE\u03bd[xSxT ]. It is easy to see that \u02dcE\u03bd[p(x)2] \u2265 0 \u2261 Q (cid:23) 0.\n\n7The reason \u02dcE\u03bd [\u00b7] is called a pseudo-moment, is that it behaves like the moments of a distribution \u03bd :\n\n{\u22121, 1}n \u2192 [0, 1], albeit only over polynomials of degree at most 2.\n\n6\n\n\fHence, the \ufb01nal concern is how to write an expression for the entropy in terms of the low-order\nmoments, since entropy is a global property that depends on all moments. There are many candidates\nfor this in machine learning are like Bethe/Kikuchi entropy, tree-reweighted Bethe entropy, log-\ndeterminant etc. However, in the worst case \u2013 none of them come with any guarantees. We will in\nfact show that the entropy functional is not an issue \u2013 we will relax the entropy trivially to n.\nGiven all of this, the \ufb01nal relaxation we will consider is:\n\n(cid:40)(cid:88)\n\nt\n\n(cid:41)\n\nmax\n\n\u02dcE\u03bd [\u00b7]\u2208M(cid:48)\n\n\u02dcE\u03bd[\u03c6t(x)] + n\n\nJt\n\n(2)\n\nFrom the prior setup it is clear that the solution to 2 is an upper bound to log Z. To prove a claim like\nTheorem 2.3 or Theorem 2.4, we will then provide a rounding of the solution. In this instance, this\nt JtE\u02dc\u00b5[\u03c6t(x)] + H(\u02dc\u00b5) comparable to\nthe value of the solution. Note this is slightly different than the usual requirement in optimization,\nwhere one cares only about producing a single x \u2208 {\u22121, 1}n with comparable value to the solution.\nOur distribution \u02dc\u00b5 will have entropy \u2126(n), and preserves the \u201cenergy\u201d portion of the objective\n\nwill mean producing a distribution \u02dc\u00b5 which has the value of(cid:80)\n(cid:80)\n\nt JtE\u00b5[\u03c6t(x)] up to a comparable factor to what is achievable in the optimization setting.\n\nWarmup: exponential family analogue of MAX-CUT As a warmup, to illustrate the basic ideas\nbehind the above rounding strategy, before we consider Ising models we consider the exponential\nfamily analogue of MAX-CUT. It is de\ufb01ned by the functionals \u03c6i,j(x) = (xi \u2212 xj)2. Concretely,\nwe wish to approximate the partition function of the distribution \u00b5(x) \u221d exp\n\nJi,j(xi \u2212 xj)2\n\n\uf8eb\uf8ed(cid:88)\n\n\uf8f6\uf8f8.\n\nWe will prove the following simple observation:\nObservation 4.1. The relaxation 2 provides a factor 2 approximation of log Z.\n\ni,j\n\nProof. We proceed as outlined in the previous section, by providing a rounding of 2. We point out\nagain, unlike the standard case in optimization, where typically one needs to produce an assignment of\nthe variables, because of the entropy term here it is crucial that the rounding produces a distribution.\nThe distribution \u02dc\u00b5 we produce here will be especially simple: we will round each xi independently\n2. Then, clearly H(\u02dc\u00b5) = n. On the other hand, we similarly have Pr\u02dc\u00b5[(xi \u2212 xj)2 =\nwith probability 1\n2. Altogether, this\n1] = 1\nas we needed.\n\n(cid:17)\n2, since xi and xj are rounded independently. Hence, E\u02dc\u00b5[(xi \u2212 xj)2] \u2265 1\n\ni,j Ji,jE\u02dc\u00b5[(xi \u2212 xj)2] + H(\u02dc\u00b5) \u2265 1\n\ni,j Ji,jE\u03bd[(xi \u2212 xj)2] + n\n\nimplies(cid:80)\n\n(cid:16)(cid:80)\n\n2\n\n4.1\n\nIsing models\n\nWe proceed with the main results of this section on Ising models, which is the case where \u03c6i,j(x) =\nxixj. We will split into the ferromagnetic and general case separately, as outlined in Section 2.\nTo be concrete, we will be given potentials Ji,j, and we wish to calculate the partition function of the\n\nIsing model \u00b5(x) \u221d exp((cid:80)\n\ni,j Ji,jxixj).\n\nFerromagnetic case\nRecall, in the ferromagnetic case of Ising model, we have the conditions that the potentials Ji,j > 0.\nWe will provide a convex relaxation which has a constant factor approximation in this case. First, recall\nthe famous \ufb01rst Grif\ufb01ths inequality due to Grif\ufb01ths [Gri67] which states that in the ferromagnetic\ncase, E\u00b5[xixj] \u2265 0,\u2200i, j.\nUsing this inequality, we will look at the following natural strenghtening of the relaxation 2:\n\n(cid:40)(cid:88)\n\nt\n\n(cid:41)\n\nmax\n\n\u02dcE\u03bd [\u00b7]\u2208M(cid:48);\u02dcE\u03bd [xixj ]\u22650,\u2200i,j\n\n\u02dcE\u03bd[\u03c6t(x)] + n\n\nJt\n\n(3)\n\nWe will prove the following theorem, as a straightforward implication of our claims from Section 3:\n\n7\n\n\fTheorem 4.1. The relaxation 3 provides a factor 50 approximation of log Z.\n\nProof. Notice, due to Grif\ufb01ths\u2019 inequality, 3 is in fact a relaxation of the Gibbs variational principle\nand hence an upper bound)of log Z. Same as before, we will provide a rounding of 3. We will use the\ndistribution \u02dc\u00b5 we designed in Section 3 the sign of a Gaussian with covariance matrix \u03a3 + \u03b2I, for a\n\u03b2 which we will specify. By Lemma 3.2, we then have H(\u02dc\u00b5) \u2265 n\nwhenever \u03b2 \u2265 1\n31/2 .\nBy Lemma 3.1, on the other hand, we can prove that E\u02dc\u00b5[xixj] \u2265 G\n\n(31/4\u221a\n\u221a\n\u02dcE\u03bd[xixj]\n\n\u03b2\u22121)2\n3\u03b2\n\n25\n\n1 + \u03b2\n\nBy setting \u03b2 = 21.8202, we get n\n25\n\n(31/4\u221a\n\u221a\n\n\u03b2\u22121)2\n3\u03b2\n\n\u2265 0.02 and G\n\n1+\u03b2 \u2265 0.02, which implies that\n\n(cid:88)\n\nJi,jE\u02dc\u00b5[xixj] + H(\u02dc\u00b5) \u2265 0.02\n\n\u02dcE\u03bd[xixj] + n\n\nJi,j\n\n\uf8eb\uf8ed(cid:88)\n\n\uf8f6\uf8f8\n\ni,j\n\ni,j\n\nwhich is what we need.\n\npreserve the sum(cid:80)\n\nNote that the above proof does not work in the general Ising model case: when \u02dcE\u03bd[xixj] can be\neither positive or negative, even if we preserved each \u02dcE\u03bd[xixj] up to a constant factor, this may not\n\n\u02dcE\u03bd[xixj] due to cancellations in that expression.\n\ni,j Ji,j\nGeneral Ising models case\nFinally, we will tackle the general Ising model case. As noted in the previous section, the straightfor-\nward application of the results proven in Section 3 doesn\u2019t work, so we have to consider a different\nrounding \u2013 again inspired by roundings used in optimization.\nThe intuition is the same as in the ferromagnetic case: we wish to design a rounding which preserves\nthe \u201cenergy\u201d portion of the objective, while having a high entropy. In the previous section, this\nwas achieved by modifying the Goemans-Williamson rounding so that it produces a high-entropy\ndistribution. We will do a similar thing here, by modifying rounding due to [CW04] and [AMMN06].\nThe convex relaxation we will consider will just be the basic one: 2 and we will prove the following\ntwo theorems:\nTheorem 4.2. The relaxation 2 provides a factor O(log n) approximation to log Z when \u03c6i,j(x) =\nxixj.\nTheorem 4.3. The relaxation 2 provides a factor O(log(\u03c7(G))) approximation to log Z when\n\u03c6i,j(x) = xixj for i, j \u2208 E(G) of some graph G = (V (G), E(G)), and \u03c7(G) is the chromatic\nnumber of G.\n\nSince the chromatic number of a graph is bounded by n, the second theorem is in fact strictly stronger\nthan the \ufb01rst, however the proof of the \ufb01rst theorem uses less heavy machinery, and is illuminating\nenough to be presented on its own.\nDue to space constraints, the proofs of these theorems are forwarded to the appendix.\n\n5 Conclusion\n\nIn summary, we presented computationally ef\ufb01cient approximate versions of the classical max-\nentropy principle by [Jay57]: ef\ufb01ciently sampleable distributions which preserve given pairwise\nmoments up to a multiplicative constant factor, while having entropy within a constant factor of the\nmaximum entropy distribution matching those moments. Additionally, we applied our insights to\ndesigning provable variational methods for Ising models which provide comparable guarantees for\napproximating the log-partition function to those in the optimization setting. Our methods are based\non convex relaxations of the standard variational principle due to Gibbs, and are extremely generic\nand we hope they will \ufb01nd applications for other exponential families.\n\n8\n\n\fReferences\n[AMMN06] Noga Alon, Konstantin Makarychev, Yury Makarychev, and Assaf Naor. Quadratic\n\nforms on graphs. Inventiones mathematicae, 163(3):499\u2013522, 2006.\n\n[AN06] Noga Alon and Assaf Naor. Approximating the cut-norm via grothendieck\u2019s inequality.\n\nSIAM Journal on Computing, 35(4):787\u2013803, 2006.\n\n[BB07] Matthias Bethge and Philipp Berens. Near-maximum entropy models for binary neural\n\nrepresentations of natural images. 2007.\n\n[BGS14] Guy Bresler, David Gamarnik, and Devavrat Shah. Hardness of parameter estimation\nin graphical models. In Advances in Neural Information Processing Systems, pages\n1062\u20131070, 2014.\n\n[BKS14] Boaz Barak, Jonathan A Kelner, and David Steurer. Rounding sum-of-squares relax-\nations. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing,\npages 31\u201340. ACM, 2014.\n\n[CW04] Moses Charikar and Anthony Wirth. Maximizing quadratic programs: extending\ngrothendieck\u2019s inequality. In Foundations of Computer Science, 2004. Proceedings.\n45th Annual IEEE Symposium on, pages 54\u201360. IEEE, 2004.\n\n[Ell12] Richard S Ellis. Entropy, large deviations, and statistical mechanics, volume 271.\n\nSpringer Science & Business Media, 2012.\n\n[EN78] Richard S Ellis and Charles M Newman. The statistics of curie-weiss models. Journal\n\nof Statistical Physics, 19(2):149\u2013161, 1978.\n\n[Gri67] Robert B Grif\ufb01ths. Correlations in ising ferromagnets. i. Journal of Mathematical\n\nPhysics, 8(3):478\u2013483, 1967.\n\n[GW95] Michel X Goemans and David P Williamson. Improved approximation algorithms for\nmaximum cut and satis\ufb01ability problems using semide\ufb01nite programming. Journal of\nthe ACM (JACM), 42(6):1115\u20131145, 1995.\n\n[Jay57] Edwin T Jaynes.\n106(4):620, 1957.\n\nInformation theory and statistical mechanics. Physical review,\n\n[JS93] Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the\n\nising model. SIAM Journal on computing, 22(5):1087\u20131116, 1993.\n\n[Ris16] Andrej Risteski. How to compute partition functions using convex programming\nhierarchies: provable bounds for variational methods. In Proceedings of the Conference\non Learning Theory (COLT), 2016.\n\n[SS12] Allan Sly and Nike Sun. The computational hardness of counting in two-spin models\non d-regular graphs. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd\nAnnual Symposium on, pages 361\u2013369. IEEE, 2012.\n\n[SV14] Mohit Singh and Nisheeth K Vishnoi. Entropy, optimization and counting. In Proceed-\nings of the 46th Annual ACM Symposium on Theory of Computing, pages 50\u201359. ACM,\n2014.\n\n[WJ08] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families,\nand variational inference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305,\n2008.\n\n[WJW03] Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. Tree-reweighted belief\npropagation algorithms and approximate ml estimation by pseudo-moment matching.\n2003.\n\n[WJW05] Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. A new class of upper\nInformation Theory, IEEE Transactions on,\n\nbounds on the log partition function.\n51(7):2313\u20132335, 2005.\n\n9\n\n\f", "award": [], "sourceid": 2311, "authors": [{"given_name": "Andrej", "family_name": "Risteski", "institution": "Princeton University"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton University"}]}