{"title": "Hardness of parameter estimation in graphical models", "book": "Advances in Neural Information Processing Systems", "page_first": 1062, "page_last": 1070, "abstract": "We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan) but no proof was known. Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).", "full_text": "Hardness of parameter estimation\n\nin graphical models\n\nGuy Bresler1 David Gamarnik2 Devavrat Shah1\n\nLaboratory for Information and Decision Systems\n\n{gbresler,gamarnik,devavrat}@mit.edu\n\nDepartment of EECS1 and Sloan School of Management2\n\nMassachusetts Institute of Technology\n\nAbstract\n\nWe consider the problem of learning the canonical parameters specifying an undi-\nrected graphical model (Markov random \ufb01eld) from the mean parameters. For\ngraphical models representing a minimal exponential family, the canonical param-\neters are uniquely determined by the mean parameters, so the problem is feasible\nin principle. The goal of this paper is to investigate the computational feasibil-\nity of this statistical task. Our main result shows that parameter estimation is in\ngeneral intractable: no algorithm can learn the canonical parameters of a generic\npair-wise binary graphical model from the mean parameters in time bounded by a\npolynomial in the number of variables (unless RP = NP). Indeed, such a result has\nbeen believed to be true (see [1]) but no proof was known.\nOur proof gives a polynomial time reduction from approximating the partition\nfunction of the hard-core model, known to be hard, to learning approximate pa-\nrameters. Our reduction entails showing that the marginal polytope boundary has\nan inherent repulsive property, which validates an optimization procedure over\nthe polytope that does not use any knowledge of its structure (as required by the\nellipsoid method and others).\n\n1\n\nIntroduction\n\nGraphical models are a powerful framework for succinct representation of complex high-\ndimensional distributions. As such, they are at the core of machine learning and arti\ufb01cial intelli-\ngence, and are used in a variety of applied \ufb01elds including \ufb01nance, signal processing, communica-\ntions, biology, as well as the modeling of social and other complex networks. In this paper we focus\non binary pairwise undirected graphical models, a rich class of models with wide applicability. This\nis a parametric family of probability distributions, and for the models we consider, the canonical\nparameters \u2713 are uniquely determined by the vector \u00b5 of mean parameters, which consist of the\nnode-wise and pairwise marginals.\nTwo primary statistical tasks pertaining to graphical models are inference and parameter estimation.\nA basic inference problem is the computation of marginals (or conditional probabilities) given the\nmodel, that is, the forward mapping \u2713 7! \u00b5. Conversely, the backward mapping \u00b5 7! \u2713 corresponds\nto learning the canonical parameters from the mean parameters. The backward mapping is de\ufb01ned\nonly for \u00b5 in the marginal polytope M of realizable mean parameters, and this is important in what\nfollows. The backward mapping captures maximum likelihood estimation of parameters; the study\nof the statistical properties of maximum likelihood estimation for exponential families is a classical\nand important subject.\nIn this paper we are interested in the computational tractability of these statistical tasks. A basic\nquestion is whether or not these maps can be computed ef\ufb01ciently (namely in time polynomial in\n\n1\n\n\fthe problem size). As far as inference goes, it is well known that approximating the forward map\n(inference) is computational hard in general. This was shown by Luby and Vigoda [2] for the hard-\ncore model, a simple pairwise binary graphical model (de\ufb01ned in (2.1)). More recently, remarkably\nsharp results have been obtained, showing that computing the forward map for the hard-core model\nis tractable if and only if the system exhibits the correlation decay property [3, 4]. In contrast, to the\nbest of our knowledge, no analogous hardness result exists for the backward mapping (parameter\nestimation), despite its seeming intractability [1].\nTangentially related hardness results have been previously obtained for the problem of learning the\ngraph structure underlying an undirected graphical model. Bogdanov et al. [5] showed hardness\nof determining graph structure when there are hidden nodes, and Karger and Srebro [6] showed\nhardness of \ufb01nding the maximum likelihood graph with a given treewidth. Computing the backward\nmapping, in comparison, requires estimation of the parameters when the graph is known.\nOur main result, stated precisely in the next section, establishes hardness of approximating the\nbackward mapping for the hard-core model. Thus, despite the problem being statistically feasible,\nit is computationally intractable.\nThe proof is by reduction, showing that the backward map can be used as a black box to ef\ufb01ciently\nestimate the partition function of the hard-core model. The reduction, described in Section 4, uses\nthe variational characterization of the log-partition function as a constrained convex optimization\nover the marginal polytope of realizable mean parameters. The gradient of the function to be min-\nimized is given by the backward mapping, and we use a projected gradient optimization method.\nSince approximating the partition function of the hard-core model is known to be computationally\nhard, the reduction implies hardness of approximating the backward map.\nThe main technical dif\ufb01culty in carrying out the argument arises because the convex optimization\nis constrained to the marginal polytope, an intrinsically complicated object. Indeed, even deter-\nmining membership (or evaluating the projection) to within a crude approximation of the polytope\nis NP-hard [7]. Nevertheless, we show that it is possible to do the optimization without using any\nknowledge of the polytope structure, as is normally required by ellipsoid, barrier, or projection meth-\nods. To this end, we prove that the polytope boundary has an inherent repulsive property that keeps\nthe iterates inside the polytope without actually enforcing the constraint. The consequence of the\nboundary repulsion property is stated in Proposition 4.6 of Section 4, which is proved in Section 5.\nOur reduction has a close connection to the variational approach to approximate inference [1]. There,\nthe conjugate-dual representation of the log-partition function leads to a relaxed optimization prob-\nlem de\ufb01ned over a tractable bound for the marginal polytope and with a simple surrogate to the\nentropy function. What our proof shows is that accurate approximation of the gradient of the en-\ntropy obviates the need to relax the marginal polytope.\nWe mention a related work of Kearns and Roughgarden [8] showing a polynomial-time reduction\nfrom inference to determining membership in the marginal polytope. Note that such a reduction\ndoes not establish hardness of parameter estimation: the empirical marginals obtained from samples\nare guaranteed to be in the marginal polytope, so an ef\ufb01cient algorithm could hypothetically exist\nfor parameter estimation without contradicting the hardness of marginal polytope membership.\nAfter completion of our manuscript, we learned that Montanari [9] has independently and simulta-\nneously obtained similar results showing hardness of parameter estimation in graphical models from\nthe mean parameters. His high-level approach is similar to ours, but the details differ substantially.\n\n2 Main result\n\nIn order to establish hardness of learning parameters from marginals for pairwise binary graphical\nmodels, we focus on a speci\ufb01c instance of this class of graphical models, the hard-core model.\nGiven a graph G = (V, E) (where V = {1, . . . , p}), the collection of independent set vectors\nI(G) \u2713{ 0, 1}V consist of vectors  such that i = 0 or j = 0 (or both) for every edge {i, j}2 E.\nEach vector  2I (G) is the indicator vector of an independent set. The hard-core model assigns\nnonzero probability only to independent set vectors, with\n\nP\u2713() = exp\u2713Xi2V\n\n\u2713ii  (\u2713)\u25c6 for each  2I (G) .\n\n(2.1)\n\n2\n\n\fThis is an exponential family with vector of suf\ufb01cient statistics () = (i)i2V 2{ 0, 1}p and\nvector of canonical parameters \u2713 = (\u2713i)i2V 2 Rp. In the statistical physics literature the model\nis usually parameterized in terms of node-wise fugacity (or activity) i = e\u2713i. The log-partition\nfunction\n\n(\u2713) = log X2I(G)\n\nexp\u2713Xi2V\n\n\u2713ii\u25c6!\n\nserves to normalize the distribution; note that (\u2713) is \ufb01nite for all \u2713 2 Rp. Here and throughout, all\nlogarithms are to the natural base.\nThe set M of realizable mean parameters plays a major role in the paper, and is de\ufb01ned as\n\nM = {\u00b5 2 Rp| there exists a \u2713 such that E\u2713[()] = \u00b5} .\n\nFor the hard-core model (2.1), the set M is a polytope equal to the convex hull of independent set\nvectors I(G) and is called the marginal polytope. The marginal polytope\u2019s structure can be rather\ncomplex, and one indication of this is that the number of half-space inequalities needed to represent\nM can be very large, depending on the structure of the graph G underlying the model [10, 11].\nThe model (2.1) is a regular minimal exponential family, so for each \u00b5 in the interior M of the\nmarginal polytope there corresponds a unique \u2713(\u00b5) satisfying the dual matching condition\n\nE\u2713[()] = \u00b5 .\n\nWe are concerned with approximation of the backward mapping \u00b5 7! \u2713, and we use the following\nnotion of approximation.\nDe\ufb01nition 2.1. We say that \u02c6y 2 R is a -approximation to y 2 R if y(1  ) \uf8ff \u02c6y \uf8ff (1 + ). A\nvector \u02c6v 2 Rp is a -approximation to v 2 Rp if each entry \u02c6vi is a -approximation to vi.\nWe next de\ufb01ne the appropriate notion of ef\ufb01cient approximation algorithm.\nDe\ufb01nition 2.2. A fully polynomial randomized approximation scheme (FPRAS) for a mapping fp :\nXp ! R is a randomized algorithm that for each > 0 and input x 2X p, with probability at\nleast 3/4 outputs a -approximation \u02c6fp(x) to fp(x) and moreover the running time is bounded by a\npolynomial Q(p, 1).\n\nOur result uses the complexity classes RP and NP, de\ufb01ned precisely in any complexity text (such\nas [12]). The class RP consists of problems solvable by ef\ufb01cient (randomized polynomial) algo-\nrithms, and NP consists of many seemingly dif\ufb01cult problems with no known ef\ufb01cient algorithms.\nIt is widely believed that NP 6= RP. Assuming this, our result says that there cannot be an ef\ufb01cient\napproximation algorithm for the backward mapping in the hard-core model (and thus also for the\nmore general class of binary pairwise graphical models).\nWe recall that approximating the backward mapping entails taking a vector \u00b5 as input and producing\nan approximation of the corresponding vector of canonical parameters \u2713 as output. It should be noted\nthat even determining whether a given vector \u00b5 belongs to the marginal polytope M is known to be\nan NP-hard problem [7]. However, our result shows that the problem is NP-hard even if the input\nvector \u00b5 is known a priori to be an element of the marginal polytope M.\nTheorem 2.3. Assuming NP 6= RP, there does not exist an FPRAS for the backward mapping\n\u00b5 7! \u2713.\nAs discussed in the introduction, Theorem 2.3 is proved by showing that the backward mapping\ncan be used as a black-box to ef\ufb01ciently estimate the partition function of the hard core model,\nknown to be hard. This uses the variational characterization of the log-partition function as well as a\nprojected gradient optimization method. Proving validity of the projected gradient method requires\novercoming a substantial technical challenge: we show that the iterates remain within the marginal\npolytope without explicitly enforcing this (in particular, we do not project onto the polytope). The\nbulk of the paper is devoted to establishing this fact, which may be of independent interest.\nIn the next section we give necessary background on conjugate-duality and the variational character-\nization as well as review the result we will use on hardness of computing the log-partition function.\nThe proof of Theorem 2.3 is then given in Section 4.\n\n3\n\n\f3 Background\n\n3.1 Exponential families and conjugate duality\n\nWe now provide background on exponential families (as can be found in the monograph by Wain-\nwright and Jordan [1]) specialized to the hard-core model (2.1) on a \ufb01xed graph G = (V, E).\nGeneral theory on conjugate duality justifying the statements of this subsection can be found in\nRockafellar\u2019s book [13].\nThe basic relationship between the canonical and mean parameters is expressed via conjugate (or\nFenchel) duality. The conjugate dual of the log-partition function (\u2713) is\n\n\u21e4(\u00b5) := sup\n\n\u27132Rdnh\u00b5, \u2713i  (\u2713)o .\n\n(\u2713) = sup\n\n\u00b52Mnh\u2713, \u00b5i  \u21e4(\u00b5)o ,\n\u00b52M nh\u2713, \u00b5i  \u21e4(\u00b5)o .\n\nNote that for our model (\u2713) is \ufb01nite for all \u2713 2 Rp and furthermore the supremum is uniquely\nattained. On the interior M of the marginal polytope, \u21e4 is the entropy function. The log-\npartition function can then be expressed as\n\n(3.1)\n\nwith\n\n\u00b5(\u2713) = arg max\n\n(3.2)\nThe forward mapping \u2713 7! \u00b5 is speci\ufb01ed by the variational characterization (3.2) or alternatively by\nthe gradient map r: Rp !M .\nAs mentioned earlier, for each \u00b5 in the interior M there is a unique \u2713(\u00b5) satisfying the dual match-\ning condition E\u2713(\u00b5)[()] = (r)(\u2713(\u00b5)) = \u00b5.\nFor mean parameters \u00b5 2M , the backward mapping \u00b5 7! \u2713(\u00b5) to the canonical parameters is\ngiven by\n\n\u27132Rp nh\u00b5, \u2713i  (\u2713)o\nr\u21e4(\u00b5) = \u2713(\u00b5) .\nThe latter representation will be the more useful one for us.\n\nor by the gradient\n\n\u2713(\u00b5) = arg max\n\n3.2 Hardness of inference\n\nWe describe an existing result on the hardness of inference and state the corollary we will use. The\nresult says that, subject to widely believed conjectures in computational complexity, no ef\ufb01cient\nalgorithm exists for approximating the partition function of certain hard-core models. Recall that\nthe hard-core model with fugacity  is given by (2.1) with \u2713i = ln  for each i 2 V .\nTheorem 3.1 ([3, 4]). Suppose d  3 and > c(d) = (d1)d1\n(d2)d . Assuming NP 6= RP, there exists\nno FPRAS for computing the partition function of the hard-core model with fugacity  on regular\ngraphs of degree d. In particular, no FPRAS exists when  = 1 and d  5.\nWe remark that the source of hardness is the long-range dependence property of the hard-core model\nfor > c(d). It was shown in [14] that for < c(d) the model exhibits decay of correlations\nand there is an FPRAS for the log-partition function (in fact there is a deterministic approximation\nscheme as well). We note that a number of hardness results are known for the hardcore and Ising\nmodels, including [15, 16, 3, 2, 4, 17, 18, 19]. The result stated in Theorem 3.1 suf\ufb01ces for our\npurposes.\nFrom this section we will need only the following corollary, proved in the Appendix. The proof,\nstandard in the literature, uses the self-reducibility of the hard-core model to express the partition\nfunction in terms of marginals computed on subgraphs.\nCorollary 3.2. Consider the hard-core model (2.1) on graphs of degree most d with parameters\n\u2713i = 0 for all i 2 V . Assuming NP 6= RP, there exists no FPRAS \u02c6\u00b5(0) for the vector of marginal\nprobabilities \u00b5(0), where error is measured entry-wise as per De\ufb01nition 2.1.\n\n4\n\n\f4 Reduction by optimizing over the marginal polytope\n\nIn this section we describe our reduction and prove Theorem 2.3. We de\ufb01ne polynomial constants\n\n\u270f = p8 ,\n\nq = p5 ,\n\n(4.1)\n\nwhich we will leave as \u270f, q, and s to clarify the calculations. Also, given the asymptotic nature of the\nresults, we assume that p is larger than a universal constant so that certain inequalities are satis\ufb01ed.\nProposition 4.1. Fix a graph G on p nodes. Let \u02c6\u2713 : M ! Rp be a black box giving a -\napproximation for the backward mapping \u00b5 7! \u2713 for the hard-core model (2.1). Using 1/\u270f2 calls\nto \u02c6\u2713, and computation bounded by a polynomial in p, 1/, it is possible to produce a 4p7/2/q\u270f2-\napproximation \u02c6\u00b5(0) to the marginals \u00b5(0) corresponding to all zero parameters.\n\nand s = \u270f\n2p2 ,\n\nWe \ufb01rst observe that Theorem 2.3 follows almost immediately.\n\n\u21e4(\u00b5) .\n\nProof of Theorem 2.3. A standard median ampli\ufb01cation trick (see e.g. [20]) allows to decrease the\nprobability 1/4 of erroneous output by a FPRAS to below 1/p\u270f2 using O(log(p\u270f2)) function calls.\nThus the assumed FPRAS for the backward mapping can be made to give a -approximation \u02c6\u2713 to \u2713\non 1/\u270f2 successive calls, with probability of no erroneous outputs equal to at least 3/4. By taking\n = \u02dcq\u270f 2p7/2/2 in Proposition 4.1 we get a \u02dc-approximation to \u00b5(0) with computation bounded\nby a polynomial in p, 1/\u02dc. In other words, the existence of an FPRAS for the mapping \u00b5 7! \u2713 gives\nan FPRAS for the marginals \u00b5(0), and by Corollary 3.2 this is not possible if NP 6= RP.\nWe now work towards proving Proposition 4.1, the goal being to estimate the vector of marginals\n\u00b5(0) for some \ufb01xed graph G. The desired marginals are given by the solution to the optimiza-\ntion (3.2) with \u2713 = 0:\n\n\u00b5(0) =  arg min\n\u00b52M\n\n(4.2)\nWe know from Section 3 that for x 2M  the gradient r\u21e4(x) = \u2713(x), that is, the backward\nmapping amounts to a gradient \ufb01rst order (gradient) oracle. A natural approach to solving the\noptimization problem (4.2) is to use a projected gradient method. For reasons that will be come clear\nlater, instead of projecting onto the marginal polytope M, we project onto the shrunken marginal\npolytope M1 \u21e2M de\ufb01ned as\n(4.3)\n\nM1 = {\u00b5 2M\\ [q\u270f, 1)p : \u00b5 + \u270f \u00b7 ei 2M for all i} ,\n\nwhere ei is the ith standard basis vector.\nAs mentioned before, projecting onto M1 is NP-hard, and this must therefore be avoided if we\nare to obtain a polynomial-time reduction. Nevertheless, we temporarily assume that it is possible\nto do the projection and address this dif\ufb01culty later. With this in mind, we propose to solve the\noptimization (4.2) by a projected gradient method with \ufb01xed step size s,\n\nxt+1 = PM1(xt  sr\u21e4(xt)) = PM1(xt  s\u2713(xt)) ,\n\n(4.4)\nIn order for the method (4.4) to succeed a \ufb01rst requirement is that the optimum is inside M1. The\nfollowing lemma is proved in the Appendix.\nLemma 4.2. Consider the hard core model (2.1) on a graph G with maximum degree d on p  2d+1\nnodes and canonical parameters \u2713 = 0. Then the corresponding vector of mean parameters \u00b5(0) is\nin M1.\nOne of the bene\ufb01ts of operating within M1 is that the gradient is bounded by a polynomial in p,\nand this will allow the optimization procedure to converge in a polynomial number of steps. The\nfollowing lemma amounts to a rephrasing of Lemmas 5.3 and 5.4 in Section 5 and the proof is\nomitted.\nLemma 4.3. We have the gradient bound kr\u21e4(x)k1 = k\u2713(x)k1 \uf8ff p/\u270f = p9 for any x 2M 1.\nNext, we state general conditions under which an approximate projected gradient algorithm con-\nverges quickly. Better convergence rates are possible using the strong convexity of \u21e4 (shown in\nLemma 4.5 below), but this lemma suf\ufb01ces for our purposes. The proof is standard (see [21] or\nTheorem 3.1 in [22] for a similar statement) and is given in the Appendix for completeness.\n\n5\n\n\fT PT\n\nLemma 4.4 (Projected gradient method). Let G : C ! R be a convex function de\ufb01ned over a com-\npact convex set C with minimizer x\u21e4 2 arg minx2C G(x). Suppose we have access to an approxi-\nmate gradient oracledrG(x) for x 2 C with error bounded as supx2C kdrG(x)rG(x)k1 \uf8ff /2.\nLet L = supx2C kdrG(x)k. Consider the projected gradient method xt+1 = PC(xt  sdrG(xt))\nstarting at x1 2 C and with \ufb01xed step size s = /2L2. After T = 4kx1  x\u21e4k2L2/2 iterations the\naverage \u00afxT = 1\n\nt=1 xt satis\ufb01es G(\u00afxT )  G(x\u21e4) \uf8ff .\n\n2 -strongly convex. As a consequence, if \u21e4(x) \n\nTo translate accuracy in approximating the function \u21e4(x\u21e4) to approximating x\u21e4, we use the fact that\n\u21e4 is strongly convex. The proof (in the Appendix) uses the equivalence between strong convexity\nof \u21e4 and strong smoothness of the Fenchel dual , the latter being easy to check. Since we\nonly require the implication of the lemma, we defer the de\ufb01nitions of strong convexity and strong\nsmoothness to the appendix where they are used.\nLemma 4.5. The function \u21e4 : M ! R is p 3\n\u21e4(x\u21e4) \uf8ff  for x 2M  and x\u21e4 = arg miny2M \u21e4(y), then kx  x\u21e4k \uf8ff 2p 3\n2 .\nAt this point all the ingredients are in place to show that the updates (4.4) rapidly approach \u00b5(0),\nbut a crucial dif\ufb01culty remains to be overcome. The assumed black box \u02c6\u2713 for approximating the\nmapping \u00b5 7! \u2713 is only de\ufb01ned for \u00b5 inside M, and thus it is not at all obvious how to evaluate\nthe projection onto the closely related polytope M1. Indeed, as shown in [7], even approximate\nprojection onto M is NP-hard, and no polynomial time reduction can require projecting onto M1\n(assuming P 6= NP).\nThe goal of the subsequent Section 5 is to prove Proposition 4.6 below, which states that the opti-\nmization procedure can be carried out without any knowledge about M or M1. Speci\ufb01cally, we\nshow that thresholding coordinates suf\ufb01ces, that is, instead of projecting onto M1 we may project\nonto the translated non-negative orthant [q\u270f, 1)p. Writing P for this projection, we show that the\noriginal projected gradient method (4.4) has identical iterates xt as the much simpler update rule\n\nxt+1 = P(xt  s\u2713(xt)) .\n\n(4.5)\nProposition 4.6. Choose constants as per (4.1). Suppose x1 2M 1, and consider the iterates\nxt+1 = P(xt  s\u02c6\u2713(xt)) for t  1, where \u02c6\u2713(xt) is a -approximation of \u2713(xt) for all t  1. Then\nxt 2M 1, for all t  1, and thus the iterates are the same using either P or PM1.\nThe next section is devoted to the proof of Proposition 4.6. We now complete the reduction.\n\np , 1\n\np , . . . , 1\n\n2p , 1\n\n2p , . . . , 1\n\n2p  q\u270f, for all i.\n\nProof of Proposition 4.1. We start the gradient update procedure xt+1 = P(xt  s\u02c6\u2713(xt)) at the\npoint x1 = ( 1\n2p ), which we claim is within M1 for any graph G for p = |V | large\nenough. To see this, note that ( 1\np ) is in M, because it is a convex combination (with\nweight 1/p each) of the independent set vectors e1, . . . , ep. Hence x1+ 1\n2p\u00b7ei 2M , and additionally\ni = 1\nx1\nWe establish that xt 2M 1 for each t  1 by induction, having veri\ufb01ed the base case t = 1 in\nthe preceding paragraph. Let xt 2M 1 for some t  1. At iteration t of the update rule we make\na call to the black box \u02c6\u2713(xt) giving a -approximation to the backward mapping \u2713(xt), compute\nxt  s\u02c6\u2713(xt), and then project onto [q\u270f, 1)p. Proposition 4.6 ensures that xt+1 2M 1. Therefore,\nthe update xt+1 = P(xt  s\u02c6\u2713(xt)) is the same as xt+1 = PM1(xt  s\u02c6\u2713(xt)).\nNow we can now apply Lemma 4.4 with G = \u21e4, C = M1,  = 2p2/\u270f and L =\n\nsupx2C kdrG(x)k2 \uf8ffpp(p/\u270f)2 = p3/2/\u270f. After\n\niterations the average \u00afxT = 1\nLemma 4.5 implies that k\u00afxT  x\u21e4k2 \uf8ff 2p 3\ni  x\u21e4i|\uf8ff 2p 3\n|\u00afxT\n\nT PT\n\nT = 4kx1  x\u21e4k2L2/2 \uf8ff 4p(p3/\u270f2)/(42p4/\u270f2) = 1/2\n\nt=1 xt satis\ufb01es G(\u00afxT )  G(x\u21e4) \uf8ff .\n\n2 x\u21e4i /q\u270f for each i 2 V . Hence \u00afxT is a 4p7/2/q\u270f2-approximation for x\u21e4.\n\n2 , and since x\u21e4i  q\u270f, we get the entry-wise bound\n\n6\n\n\f5 Proof of Proposition 4.6\n\nIn Subsection 5.1 we prove estimates on the parameters \u2713 corresponding to \u00b5 close to the boundary\nof M1, and then in Subsection 5.2 we use these estimates to show that the boundary of M1 has a\ncertain repulsive property that keeps the iterates inside.\n\n5.1 Bounds on gradient\nWe start by introducing some helpful notation. For a node i, let N (i) = {j 2 [p] : (i, j) 2 E}\ndenote its neighbors. We partition the collection of independent set vectors as\n\nwhere\n\nI = Si [ Si [ S\u21b5i\n\n,\n\nSi = { 2I : i = 1} = {Ind sets containing i}\nSi = {  ei :  2 Si} = {Ind sets where i can be added}\nS\u21b5i = { 2I : j = 1 for some j 2N (i)} = {Ind sets con\ufb02icting with i} .\n\nFor a collection of independent set vectors S \u2713I we write P(S) as shorthand for P\u2713( 2 S) and\n\nf (S) = P(S) \u00b7 e(\u2713) =X2S\n\nexp\u2713Xj2V\n\n\u2713jj\u25c6 .\n\nWe can then write the marginal at node i as \u00b5i = P(Si), and since Si, Si , S\u21b5i partition I, the space\nof all independent sets of G, 1 = P(Si) + P(Si ) + P(S\u21b5i ). For each i let\n\n\u232bi = P(S\u21b5i ) = P(a neighbor of i is in ) .\n\nThe following lemma speci\ufb01es a condition on \u00b5i and \u232bi that implies a lower bound on \u2713i.\nLemma 5.1. If \u00b5i + \u232bi  1   and \u232bi \uf8ff 1  \u21e3 for \u21e3> 1, then \u2713i  ln(\u21e3  1).\nProof. Let \u21b5 = e\u2713i, and observe that f (Si) = \u21b5f (Si ). We want to show that \u21b5  \u21e3  1.\nThe \ufb01rst condition \u00b5i + \u232bi  1   implies that\n\nf (Si) + f (S\u21b5i )  (1  )(f (Si) + f (S\u21b5i ) + f (Si ))\n\n= (1  )(f (Si) + f (S\u21b5i ) + \u21b51f (Si)) ,\n\nand rearranging gives\n\nf (S\u21b5i ) + f (Si) \n\n1  \n\n\n\u21b51f (Si) .\n\n(5.1)\n\n(5.2)\n\nThe second condition \u232bi \uf8ff 1  \u21e3 reads f (S\u21b5i ) \uf8ff (1  \u21e3)(f (Si) + f (S\u21b5i ) + f (Si )) or\n\nf (S\u21b5i ) \uf8ff\n\n1  \u21e3\n\u21e3\n\nf (Si)(1 + \u21b51)\n\nCombining (5.1) and (5.2) and simplifying results in \u21b5  \u21e3  1.\nWe now use the preceding lemma to show that if a coordinate is close to the boundary of the shrunken\nmarginal polytope M1, then the corresponding parameter is large.\nLemma 5.2. Let r be a positive real number. If \u00b5 2M 1 and \u00b5 + r\u270f\u00b7 ei /2M , then \u2713i  ln q\nr  1.\nProof. We would like to apply Lemma 5.1 with \u21e3 = q/r and  = r\u270f, which requires showing that\n(a) \u232bi \uf8ff 1  q\u270f and (b) \u00b5i + \u232bi  1  r\u270f. To show (a), note that if \u00b5 2M 1, then \u00b5i  q\u270f by\nde\ufb01nition of M1. It follows that \u232bi \uf8ff 1  \u00b5i \uf8ff 1  q\u270f.\nWe now show (b). Since \u00b5i = P(Si), \u232bi = P(S\u21b5i ), and 1 = P(Si) + P(S\u21b5i ) + P (Si ), (b)\nis equivalent to P(Si ) \uf8ff r\u270f. We assume that \u00b5 + r\u270f \u00b7 ei /2M and suppose for the sake of\n\n7\n\n\f\u2318 + \u2318ei\n0\n\u2318\n\nif  2 Si\nif  2 Si\notherwise .\n\n\u23180 =8<:\n\ncontradiction that P(Si ) > r\u270f . Writing \u2318 = P() for  2I , so that \u00b5 =P2I \u2318 \u00b7 , we de\ufb01ne\n\na new probability measure\n\nOne can check that \u00b50 =P2I \u23180 has \u00b50j = \u00b5j for each i 6= j and \u00b50i = \u00b5i + P(Si ) > \u00b5i + r\u270f.\nThe point \u00b50, being a convex combination of independent set vectors, must be in M, and hence so\nmust \u00b5 + r\u270f \u00b7 ei. But this contradicts the hypothesis and completes the proof of the lemma.\nThe proofs of the next two lemmas are similar in spirit to Lemma 8 in [23] and are proved in the\nAppendix. The \ufb01rst lemma gives an upper bound on the parameters (\u2713i)i2V corresponding to an\narbitrary point in M1.\nLemma 5.3. If \u00b5 + \u270f \u00b7 ei 2M , then \u2713i \uf8ff p/\u270f. Hence if \u00b5 2M 1, then \u2713i \uf8ff p/\u270f for all i.\nThe next lemma shows that if a component \u00b5i is not too small, the corresponding parameter \u2713i is\nalso not too negative. As before, this allows to bound from below the parameters corresponding to\nan arbitrary point in M1.\nLemma 5.4. If \u00b5i  q\u270f, then \u2713i  p/q\u270f. Hence if \u00b5 2M 1, then \u2713i  p/q\u270f for all i.\n5.2 Finishing the proof of Proposition 4.6\n\nWe sketch the remainder of the proof here; full detail is given in Section D of the Supplement.\nStarting with an arbitrary xt in M1, our goal is to show that xt+1 = P(xt  s\u02c6\u2713(xt)) remains\nin M1. The proof will then follow by induction, because our initial point x1 is in M1 by the\nhypothesis.\nThe argument considers separately each hyperplane constraint for M of the form hh, xi \uf8ff 1. The\ndistance of x from the hyperplane is 1  hh, xi. Now, the de\ufb01nition of M1 implies that if x 2M 1,\nthen x + \u270f\u00b7 ei 2M 1 for all coordinates i, and thus 1hh, xi  \u270fkhk1 for all constraints. We call a\nconstraint hh, xi \uf8ff 1 critical if 1  hh, xi <\u270f khk1, and active if \u270fkhk1 \uf8ff 1  hh, xi < 2\u270fkhk1.\nFor xt 2M 1 there are no critical constraints, but there may be active constraints.\nWe \ufb01rst show that inactive constraints can at worst become active for the next iterate xt+1, which\nrequires only that the step-size is not too large relative to the magnitude of the gradient (Lemma 4.3\ngives the desired bound). Then we show (using the gradient estimates from Lemmas 5.2, 5.3,\nand 5.4) that the active constraints have a repulsive property and that xt+1 is no closer than xt\nto any active constraint, that is, hh, xt+1i \uf8ff hh, xti. The argument requires care, because the pro-\ni  s\u02c6\u2713i(xt) being very negative if xt\njection P may prevent coordinates i from decreasing despite xt\nis already small. These arguments together show that xt+1 remains in M1, completing the proof.\n6 Discussion\n\ni\n\nThis paper addresses the computational tractability of parameter estimation for the hard-core model.\nOur main result shows hardness of approximating the backward mapping \u00b5 7! \u2713 to within a small\npolynomial factor. This is a fairly stringent form of approximation, and it would be interesting\nto strengthen the result to show hardness even for a weaker form of approximation. A possible\ngoal would be to show that there exists a universal constant c > 0 such that approximation of the\nbackward mapping to within a factor 1 + c in each coordinate is NP-hard.\n\nAcknowledgments\nGB thanks Sahand Negahban for helpful discussions. Also we thank Andrea Montanari for sharing\nhis unpublished manuscript [9]. This work was supported in part by NSF grants CMMI-1335155\nand CNS-1161964, and by Army Research Of\ufb01ce MURI Award W911NF-11-1-0036.\n\n8\n\n\fReferences\n[1] M. Wainwright and M. Jordan, \u201cGraphical models, exponential families, and variational infer-\n\nence,\u201d Foundations and Trends in Machine Learning, vol. 1, no. 1-2, pp. 1\u2013305, 2008.\n\n[2] M. Luby and E. Vigoda, \u201cFast convergence of the glauber dynamics for sampling independent\n\nsets,\u201d Random Structures and Algorithms, vol. 15, no. 3-4, pp. 229\u2013241, 1999.\n\n[3] A. Sly and N. Sun, \u201cThe computational hardness of counting in two-spin models on d-regular\n\ngraphs,\u201d in FOCS, pp. 361\u2013369, IEEE, 2012.\n\n[4] A. Galanis, D. Stefankovic, and E. Vigoda, \u201cInapproximability of the partition function for the\n\nantiferromagnetic Ising and hard-core models,\u201d arXiv preprint arXiv:1203.2226, 2012.\n\n[5] A. Bogdanov, E. Mossel, and S. Vadhan, \u201cThe complexity of distinguishing Markov random\n\ufb01elds,\u201d Approximation, Randomization and Combinatorial Optimization, pp. 331\u2013342, 2008.\n[6] D. Karger and N. Srebro, \u201cLearning Markov networks: Maximum bounded tree-width graphs,\u201d\n\nin Symposium on Discrete Algorithms (SODA), pp. 392\u2013401, 2001.\n\n[7] D. Shah, D. N. Tse, and J. N. Tsitsiklis, \u201cHardness of low delay network scheduling,\u201d Infor-\n\nmation Theory, IEEE Transactions on, vol. 57, no. 12, pp. 7810\u20137817, 2011.\n\n[8] T. Roughgarden and M. Kearns, \u201cMarginals-to-models reducibility,\u201d in Advances in Neural\n\nInformation Processing Systems, pp. 1043\u20131051, 2013.\n\n[9] A. Montanari, \u201cComputational implications of reducing data to suf\ufb01cient statistics.\u201d unpub-\n\nlished, 2014.\n\n[10] M. Deza and M. Laurent, Geometry of cuts and metrics. Springer, 1997.\n[11] G. M. Ziegler, \u201cLectures on 0/1-polytopes,\u201d in Polytopes\u2014combinatorics and computation,\n\npp. 1\u201341, Springer, 2000.\n\n[12] C. H. Papadimitriou, Computational complexity. John Wiley and Sons Ltd., 2003.\n[13] R. T. Rockafellar, Convex analysis, vol. 28. Princeton university press, 1997.\n[14] D. Weitz, \u201cCounting independent sets up to the tree threshold,\u201d in Proceedings of the thirty-\n\neighth annual ACM symposium on Theory of computing, pp. 140\u2013149, ACM, 2006.\n\n[15] M. Dyer, A. Frieze, and M. Jerrum, \u201cOn counting independent sets in sparse graphs,\u201d SIAM\n\nJournal on Computing, vol. 31, no. 5, pp. 1527\u20131541, 2002.\n\n[16] A. Sly, \u201cComputational transition at the uniqueness threshold,\u201d in FOCS, pp. 287\u2013296, 2010.\n[17] F. Jaeger, D. Vertigan, and D. Welsh, \u201cOn the computational complexity of the jones and tutte\n\npolynomials,\u201d Math. Proc. Cambridge Philos. Soc, vol. 108, no. 1, pp. 35\u201353, 1990.\n\n[18] M. Jerrum and A. Sinclair, \u201cPolynomial-time approximation algorithms for the Ising model,\u201d\n\nSIAM Journal on computing, vol. 22, no. 5, pp. 1087\u20131116, 1993.\n\n[19] S. Istrail, \u201cStatistical mechanics, three-dimensionality and NP-completeness: I. universality\nof intracatability for the partition function of the Ising model across non-planar surfaces,\u201d in\nSTOC, pp. 87\u201396, ACM, 2000.\n\n[20] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani, \u201cRandom generation of combinatorial struc-\ntures from a uniform distribution,\u201d Theoretical Computer Science, vol. 43, pp. 169\u2013188, 1986.\n[21] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87. Springer,\n\n2004.\n\n[22] S. Bubeck,\n\n\u201cTheory of convex optimization for machine learning.\u201d Available at\n\nhttp://www.princeton.edu/ sbubeck/pub.html.\n\n[23] L. Jiang, D. Shah, J. Shin, and J. Walrand, \u201cDistributed random access algorithm: scheduling\n\nand congestion control,\u201d IEEE Trans. on Info. Theory, vol. 56, no. 12, pp. 6182\u20136207, 2010.\n\n[24] D. P. Bertsekas, Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[25] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, \u201cRegularization techniques for learning with\n\nmatrices,\u201d J. Mach. Learn. Res., vol. 13, pp. 1865\u20131890, June 2012.\n\n[26] J. M. Borwein and J. D. Vanderwerff, Convex functions: constructions, characterizations and\n\ncounterexamples. No. 109, Cambridge University Press, 2010.\n\n9\n\n\f", "award": [], "sourceid": 633, "authors": [{"given_name": "Guy", "family_name": "Bresler", "institution": "Massachusetts Institute of Technology"}, {"given_name": "David", "family_name": "Gamarnik", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Devavrat", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}]}