{"title": "Simple MAP Inference via Low-Rank Relaxations", "book": "Advances in Neural Information Processing Systems", "page_first": 3077, "page_last": 3085, "abstract": "We focus on the problem of maximum a posteriori (MAP) inference in Markov random fields with binary variables and pairwise interactions. For this common subclass of inference tasks, we consider low-rank relaxations that interpolate between the discrete problem and its full-rank semidefinite relaxation, followed by randomized rounding. We develop new theoretical bounds studying the effect of rank, showing that as the rank grows, the relaxed objective increases but saturates, and that the fraction in objective value retained by the rounded discrete solution decreases. In practice, we show two algorithms for optimizing the low-rank objectives which are simple to implement, enjoy ties to the underlying theory, and outperform existing approaches on benchmark MAP inference tasks.", "full_text": "Simple MAP Inference via Low-Rank Relaxations\n\nRoy Frostig\u21e4,\nComputer Science Department, Stanford University, Stanford, CA, 94305\n\nPercy Liang, Christopher D. Manning\n\nSida I. Wang,\u21e4\n\n{rf,sidaw,pliang}@cs.stanford.edu, manning@stanford.edu\n\nAbstract\n\nWe focus on the problem of maximum a posteriori (MAP) inference in Markov\nrandom \ufb01elds with binary variables and pairwise interactions. For this common\nsubclass of inference tasks, we consider low-rank relaxations that interpolate be-\ntween the discrete problem and its full-rank semide\ufb01nite relaxation. We develop\nnew theoretical bounds studying the effect of rank, showing that as the rank grows,\nthe relaxed objective increases but saturates, and that the fraction in objective value\nretained by the rounded discrete solution decreases. In practice, we show two algo-\nrithms for optimizing the low-rank objectives which are simple to implement, enjoy\nties to the underlying theory, and outperform existing approaches on benchmark\nMAP inference tasks.\n\n1\n\nIntroduction\n\nMaximum a posteriori (MAP) inference in Markov random \ufb01elds (MRFs) is an important problem\nwith abundant applications in computer vision [1], computational biology [2], natural language\nprocessing [3], and others. To \ufb01nd MAP solutions, stochastic hill-climbing and mean-\ufb01eld inference\nare widely used in practice due to their speed and simplicity, but they do not admit any formal\nguarantees of optimality. Message passing algorithms based on relaxations of the marginal polytope\n[4] can offer guarantees (with respect to the relaxed objective), but require more complex bookkeeping.\nIn this paper, we study algorithms based on low-rank SDP relaxations which are both remarkably\nsimple and capable of guaranteeing solution quality.\nOur focus is on MAP in a restricted but common class of models, namely those over binary variables\ncoupled by pairwise interactions. Here, MAP can be cast as optimizing a quadratic function over\nthe vertices of the n-dimensional hypercube: maxx2{1,1}n xTAx. A standard optimization strategy\nis to relax this integer quadratic program (IQP) to a semide\ufb01nite program (SDP), and then round\nthe relaxed solution to a discrete one achieving a constant factor approximation to the IQP optimum\n[5, 6, 7]. In practice, the SDP can be solved ef\ufb01ciently using low-rank relaxations [8] of the form\nmaxX2Rn\u21e5k tr(X>AX).\nThe \ufb01rst part of this paper is a theoretical study of the effect of the rank k on low-rank relaxations of\nthe IQP. Previous work focused on either using SDPs to solve IQPs [5] or using low-rank relaxations\nto solve SDPs [8]. We instead consider the direct link between the low-rank problem and the IQP. We\nshow that as k increases, the gap between the relaxed low-rank objective and the SDP shrinks, but\nvanishes as soon as k rank(A); our bound adapts to the problem A and can thereby be considerably\nbetter than the typical data-independent bound of O(pn) [9, 10]. We also show that the rounded\nobjective shrinks in ratio relative to the low-rank objective, but at a steady rate of \u21e5(1/k) on average.\nThis result relies on the connection we establish between IQP and low-rank relaxations. In the end,\nour analysis motivates the use of relatively small values of k, which is advantageous from both a\nsolution quality and algorithmic ef\ufb01ciency standpoint.\n\n\u21e4Authors contributed equally.\n\n1\n\n\fThe second part of this paper explores the use of very low-rank relaxation and randomized rounding\n(R3) in practice. We use projected gradient and coordinate-wise ascent for solving the R3 relaxed\nproblem (Section 4). We note that R3 interfaces with the underlying problem in an extremely simple\nway, much like Gibbs sampling and mean-\ufb01eld: only a black box implementation of x 7! Ax is\nrequired. This decoupling permits users to customize their implementation based on the structure\nof the weight matrix A: using GPUs for dense A, lists for sparse A, or much faster specialized\nalgorithms for A that are Gaussian \ufb01lters [11]. In contrast, belief propagation and marginal polytope\nrelaxations [2] need to track messages for each edge or higher-order clique, thereby requiring more\nmemory and a \ufb01ner-grained interface to the MRF that inhibits \ufb02exibility and performance.\nFinally, we introduce a comparison framework for algorithms via the x 7! Ax interface, and use it to\ncompare R3 with annealed Gibbs sampling and mean-\ufb01eld on a range of different MAP inference\ntasks (Section 5). We found that R3 often achieves the best-scoring results, and we provide some\nintuition for our advantage in Section 4.1.\n\n2 Setup and background\n\nNotation We write Sn for the set of symmetric n \u21e5 n real matrices and S k for the unit sphere\n{x 2 Rk : kxk2 = 1}. All vectors are columns unless stated otherwise. If X is a matrix, then\nXi 2 R1\u21e5k is its i\u2019th row.\nThis section reviews how MAP inference on binary graphical models with pairwise interactions can\nbe cast as integer quadratic programs (IQPs) and approximately solved via semide\ufb01nite relaxations\nand randomized rounding. Let us begin with the de\ufb01nition of an IQP:\nDe\ufb01nition 2.1. Let A 2 Sn be a symmetric n \u21e5 n matrix. An (inde\ufb01nite) integer quadratic program\n(IQP) is the following optimization problem:\n\nmax\n\nx2{1,1}n\n\nIQP(x) def= xTAx\n\n(1)\n\nSolving (1) is NP-complete in general: the MAX-CUT problem immediately reduces to it [5]. With\nan eye towards tractability, consider a \ufb01rst candidate relaxation: maxx2[1,1]n xTAx. This relaxation\nis always tight in that the maxima of the relaxed objective and original objective (1) are equal.1\nTherefore it is just as hard to solve. Let us then replace each scalar xi 2 [1, 1] with a unit vector\nXi 2 Rk and de\ufb01ne the following low-rank problem (LRP):\nDe\ufb01nition 2.2. Let k 2 {1, . . . , n} and A 2 Sn. De\ufb01ne the low-rank problem LRPk by:\n\nLRPk(X) def= tr(X TAX)\nmax\nX2Rn\u21e5k\nsubject to kXik2 = 1, i = 1, . . . , n.\n\n(2)\n\nNote that setting Xi = [xi, 0, . . . , 0] 2 Rk recovers (1). More generally, we have a sequence of\nsuccessively looser relaxations as k increases. What we get in return is tractability. The LRPk\nobjective generally yields a non-convex problem, but if we take k = n, the objective can be rewritten\nas tr(X>AX) = tr(AXX>) = tr(AS), where S is a positive semide\ufb01nite matrix with ones on the\ndiagonal. The result is the classic SDP relaxation, which is convex:\n\nSDP(S) def= tr(AS)\nmax\nS2Sn\nsubject to S \u232b 0, diag(S) = 1\n\n(3)\n\nAlthough convexity begets easy optimization in a theoretical sense, the number of variables in the\nSDP is quadratic in n. Thus for large SDPs, we actually return to the low-rank parameterization (2).\nSolving LRPk via simple gradient methods works extremely well in practice and is partially justi\ufb01ed\nby theoretical analyses in [8, 12].\n\n1Proof. WLOG, A \u232b 0 because adding to its diagonal merely adds a constant term to the IQP objective.\n2, so it must be\n\nThe objective is a convex function, as we can factor A = LLT and write xTLLTx = kLTxk2\nmaximized over its convex polytope domain at a vertex point.\n\n2\n\n\fTo complete the picture, we need to convert the relaxed solutions X 2 Rn\u21e5k into integral solutions\nx 2 {1, 1}n of the original IQP (1). This can be done as follows: draw a vector g 2 Rk on the\nunit sphere uniformly at random, project each Xi onto g, and take the sign. Formally, we write\nx = rrd(X) to mean xi = sign(Xi \u00b7 g) for i = 1, . . . , n. This randomized rounding procedure was\npioneered by [5] to give the celebrated 0.878-approximation of MAX-CUT.\n\n3 Understanding the relaxation-rounding tradeoff\n\nThe overall IQP strategy is to \ufb01rst relax the integer problem domain, then round back in to it. The\noptimal objective increases in relaxation, but decreases in randomized rounding. How do these effects\ncompound? To guide our choice of relaxation, we analyze the effect that the rank k in (2) has on the\napproximation ratio of rounded versus optimal IQP solutions.\nMore formally, let x?, X ?, and S? denote global optima of IQP, of LRPk, and of SDP, respectively.\nWe can decompose the approximation ratio as follows:\n\n1 \n\nE[IQP(rrd(X ?))]\n\nIQP(x?)\n\n|\n\n{z\n\n\u21e5\n\nSDP(S?)\nIQP(x?)\nconstant 1\n\n{z\n\n|\n\n}\n\n=\n\n}\n\n|\n\nLRPk(X ?)\nSDP(S?)\n\nE[IQP(rrd(X ?))]\n\nLRPk(X ?)\n\n\u21e5\n\n(4)\n\n{z\n\n}\n\n|\n\n{z\n\n}\n\napproximation ratio\n\ntightening ratio T (k)\n\nrounding ratio R(k)\n\nAs k increases from 1, the tightening ratio T (k) increases towards 1 and the rounding ratio R(k)\ndecreases from 1. In this section, we lower bound T and R each in turn, thus lower-bounding the\napproximation ratio as a function of k. Speci\ufb01cally, we show that T (k) reaches 1 at small k and that\nR(k) falls as 2\nIn practice, one cannot \ufb01nd X ? for general k with guaranteed ef\ufb01ciency (if we could, we would\nsimply use LRP1 to directly solve the original IQP). However, Section 5 shows empirically that\nsimple procedures solve LRPk well for even small k.\n\n\u21e1 + \u21e5( 1\n\nk ).\n\n3.1 The tightening ratio T (k) increases\nWe now show that, under the assumption of A \u232b 0, the tightening ratio T (k) plateaus early and\nthat it approaches this plateau steadily. Hence, provided k is beyond this saturation point, and large\nenough so that an LRPk solver is practically capable of providing near-optimal solutions, there is no\nadvantage in taking k larger.\nFirst, T (k) is steadily bounded below. The following is a result of [13] (that also gives insight into\nthe theoretical worst-case hardness of optimizing LRPk):\nTheorem 3.1 ([13]). Fix A \u232b 0 and let S? be an optimal SDP solution. There is a randomized\nalgorithm that, given S?, outputs \u02dcX feasible for LRPk such that E \u02dcX[LRPk( \u02dcX)] (k) \u00b7 SDP(S?),\nwhere\n\n(k) def=\n\n2\n\nk\u2713 ((k + 1)/2)\n(k/2) \u25c62\n\n= 1 \n\n1\n2k\n\n+ o\u2713 1\nk\u25c6\n\n(5)\n\n\u21e1 = 0.6366, (2) = 0.7854, (3) = 0.8488, (4) = 0.8836, (5) = 0.9054.2\nFor example, (1) = 2\nBy optimality of X ?, LRPk(X ?) E \u02dcX[LRPk( \u02dcX)] under any probability distribution, so the exis-\ntence of the algorithm in Theorem 3.1 implies that T (k) (k).\nMoreover, T (k) achieves its maximum of 1 at small k, and hence must strictly exceed the (k) lower\nbound early on. We can arrive at this fact by bounding the rank of the SDP-optimal solution S?.\nThis is because S? factors into S? = XX T, where X is in Rn\u21e5rank S? and must be optimal since\nLRPrank S?(X) = SDP(S?). Without consideration of A, the following theorem uniformly bounds\nthis rank at well below n. The theorem was established independently by [9] and [10]:\nTheorem 3.2 ([9, 10]). Fix a weight matrix A. There exists an optimal solution S? to SDP (3) such\nthat rank S? \uf8ff p2n.\n\n2The function (k) generalizes the constant approximation factor 2/\u21e1 = (1) with regards to the impli-\ncations of the unique games conjecture: the authors show that no polynomial time algorithm can, in general,\napproximate LRPk to a factor greater than (k) assuming P 6= NP and the UGC.\n\n3\n\n\fo\n\ni\nt\n\na\nr\n \n\ni\n\ng\nn\nd\nn\nu\no\nr\n\n1.05\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n \n1\n\n \n\nR(k)\nlower bound\n\n2\n\n3\n\n4\n\n5\n\n6\n\nk\n\ne\nv\ni\nt\nc\ne\nb\no\n\nj\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n \n1\n\n \n\n\u03b3(k)\nT(k)=LRP\n/SDP\nk\n\n2\n\n3\n\n4\n\n5\n\n6\n\nk\n\n1800\n\n1700\n\n1600\n\ne\nv\ni\nt\nc\ne\nb\no\n\nj\n\n1500\n\n1400\n\n1300\n\n1200\n\n1100\n\n \n1\n\n \n\nSDP\nMax\nMean\nMean+Std\nMean\u2212Std\n\n2\n\n3\n\n4\n\n5\n\n6\n\nk\n\n(a) R(k) (blue) is close to it 2/(\u21e1(k))\nlower bound (red) across the small k.\n\n(b) \u02dcT (k) (blue), the empirical tightening ra-\ntio, clears its lower bound (k) (red) and hits\nits ceiling of 1 at k = 4.\n\n(c) Rounded objective values vs. k: optimal\nSDP (cyan), best IQP rounding (green), and\nmean IQP rounding \u00b1 (black).\n\nFigure 1: Plots of quantities analyzed in Section 3, under A 2 R100\u21e5100 whose entries are sampled\nindependently from a unit Gaussian. For this instance, the empirical post-rounding objectives are\nshown at the right for completeness.\n\nHence we know already that the tightening ratio T (k) equals 1 by the time k reaches p2n.\nTaking A into consideration, we can identify a class of problem instances for which T (k) actually\nsaturates at even smaller k. This result is especially useful when the rank of the weight matrix A is\nknown, or even under one\u2019s control, while modeling the underlying optimization task:\nTheorem 3.3. If A is symmetric, there is an optimal SDP solution S? such that rank S? \uf8ff rank A.\nA complete proof is in Appendix A.1. Because adding to the diagonal of A is equivalent to merely\nadding a constant to the objective of all problems considered, Theorem 3.3 can be strengthened:\nCorollary 3.4. For any symmetric weight matrix A, there exists an optimal SDP solution S? such\nthat rank S? \uf8ff minu2Rn rank(A + diag(u)).\nThat is, changes to the diagonal of A that reduce its rank may be applied to improve the bound.\nIn summary, T (k) grows at least as fast as (k), from T (k) = 0.6366 at k = 1 to T (k) = 1 at\n\nk = min{p2n, minu2Rn rank(A + diag(u))}. This is validated empirically in Figure 1b.\n3.2 The rounding ratio R(k) decreases\n\nAs the dimension k grows for row vectors Xi in the LRPk problem, the rounding procedure incurs a\nlarger expected drop in objective value. Fortunately, we can bound this drop. Even more fortunately,\nthe bound grows no faster than (k), exactly the steady lower bound for T (k). We obtain this result\nwith an argument based on the analysis of [13]:\nTheorem 3.5. Fix a weight matrix A \u232b 0 and any LRPk-feasible X 2 Rn\u21e5k. The rounding ratio for\nX is bounded below as\n\nE[IQP(rrd(X))]\n\nLRPk(X)\n\n2\n\n\u21e1(k)\n\n\n\n=\n\n2\n\n\u21e1\u27131 +\n\n1\n2k\n\n+ o\u2713 1\nk\u25c6\u25c6\n\n(6)\n\nNote that X in the theorem need not be optimal \u2013 the bound applies to whatever solution an LRPk\nsolver might provide. The proof, given in Appendix section A.1, uses Lemma 1 from [13], which is\nbased on the theory of positive de\ufb01nite functions on spheres [14]. A decrease in R(k) that tracks the\nlower bound is observed empirically in Figure 1a.\nIn summary, considering only the steady bounds (Theorems 3.1 and 3.5), T will always rise opposite\nto R at least at the same rate. Then, the added fact that T plateaus early (Theorem 3.2 and Corollary\n3.4) means that T in fact rises even faster.\nIn practice, we would like to take k beyond 1 as we \ufb01nd that the \ufb01rst few relaxations give the optimizer\nan increasing advantage in arriving at a good LRPk solution, close to X ? in objective. The rapid rise\nof T relative to R just shown then justi\ufb01es not taking k much larger if at all.\n\n4\n\n\f4 Pairwise MRFs, optimization, and inference alternatives\n\nHaving understood theoretically how IQP relates to low-rank relaxations, we now turn to MAP\ninference and empirical evaluation. We will show that the LRPk objective can be optimized via\na simple interface to the underlying MRF. This interface then becomes the basis for (a) a MAP\ninference algorithm based on very low-rank relaxations, and (b) a comparison to two other basic\nalgorithms for MAP: Gibbs sampling and mean-\ufb01eld variational inference.\nA binary pairwise Markov random \ufb01eld (MRF) models a function h over x 2 {0, 1}n given by\nh(x) =Pi i(xi) +Pi