{"title": "Exact inference in structured prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3698, "page_last": 3707, "abstract": "Structured prediction can be thought of as a simultaneous prediction of multiple labels. \nThis is often done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise and unary potentials.\nThe above is naturally modeled with a graph, where edges and vertices are related to pairwise and unary potentials, respectively.\nWe consider the generative process proposed by Globerson et al. (2015) and apply it to general connected graphs.\nWe analyze the structural conditions of the graph that allow for the exact recovery of the labels.\nOur results show that exact recovery is possible and achievable in polynomial time for a large class of graphs.\nIn particular, we show that graphs that are bad expanders can be exactly recovered by adding small edge perturbations coming from the \\Erdos-\\Renyi model.\nFinally, as a byproduct of our analysis, we provide an extension of Cheeger's inequality.", "full_text": "Exact inference in structured prediction\n\nKevin Bello\n\nDepartment of Computer Science\n\nPurdue Univeristy\n\nWest Lafayette, IN 47906, USA\n\nkbellome@purdue.edu\n\nJean Honorio\n\nDepartment of Computer Science\n\nPurdue Univeristy\n\nWest Lafayette, IN 47906, USA\n\njhonorio@purdue.edu\n\nAbstract\n\nStructured prediction can be thought of as a simultaneous prediction of multiple\nlabels. This is often done by maximizing a score function on the space of labels,\nwhich decomposes as a sum of pairwise and unary potentials. The above is naturally\nmodeled with a graph, where edges and vertices are related to pairwise and unary\npotentials, respectively. We consider the generative process proposed by Globerson\net al. (2015) and apply it to general connected graphs. We analyze the structural\nconditions of the graph that allow for the exact recovery of the labels. Our results\nshow that exact recovery is possible and achievable in polynomial time for a large\nclass of graphs. In particular, we show that graphs that are bad expanders can\nbe exactly recovered by adding small edge perturbations coming from the Erd\u02ddos-\nR\u00e9nyi model. Finally, as a byproduct of our analysis, we provide an extension of\nCheeger\u2019s inequality.\n\n1\n\nIntroduction\n\nThroughout the years, structured prediction has been continuously used in multiple domains such as\ncomputer vision, natural language processing, and computational biology. Examples of structured\nprediction problems include dependency parsing, image segmentation, part-of-speech tagging, named\nentity recognition, and protein folding. In this setting, the input X is some observation, e.g., social\nnetwork, an image, a sentence. The output is a labeling y, e.g., an assignment of each individual of a\nsocial network to a cluster, or an assignment of each pixel in the image to foreground or background,\nor the parse tree for the sentence. A common approach to structured prediction is to exploit local\nfeatures to infer the global structure. For instance, one could include a feature that encourages\ntwo individuals of a social network to be assigned to different clusters whenever there is a strong\ndisagreement in opinions about a particular subject. Then, one can de\ufb01ne a posterior distribution\nover the set of possible labelings conditioned on the input. Some classical methods for learning the\nparameters of the model are conditional random \ufb01elds (Lafferty et al. 2001) and structured support\nvector machines (Taskar et al. 2003, Tsochantaridis et al. 2005, Altun & Hofmann 2003). In this\nwork we will focus in the inference problem and assume that the model parameters have been already\nlearned.\nIn the context of Markov random \ufb01elds (MRFs), for an undirected graph G = (V,E), one is interested\nin \ufb01nding a solution to the following inference problem:\n\ncv(m)1[yv = m] +\n\ncu,v(m, n)1[yu = m, yv = n] ,\n\n(1)\n\nwhere M is the set of possible labels, cu(m) is the cost of assigning label m to node v, and cu,v(m, n)\nis the cost of assigning m and n to the neighbors u, v respectively.1 Similar inference problems arise\n\n1In the literature, the cost functions cv and cu,v are also known as unary and pairwise potentials respectively.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n(cid:88)\n\nmax\ny\u2208M|V|\n\nv\u2208V,m\u2208M\n\n(cid:88)\n\n(u,v)\u2208E,m1,m2\u2208M\n\n\fin the context of statistical physics, sociology, community detection, average case analysis, and graph\npartitioning. Very few cases of the general MRF inference problem are known to be exactly solvable\nin polynomial time. For example, Chandrasekaran et al. (2008) showed that (1) can be solved exactly\nin polynomial time for a graph G with low treewidth via the junction tree algorithm. While in the case\nof Ising models, Schraudolph & Kamenetsky (2009) showed that the inference problem can also be\nsolved exactly in polynomial time for planar graphs via perfect matchings. Finally, polynomial-time\nsolvability can also stem from properties of the pairwise potential, under this view, the inference\nproblem can be solved exactly in polynomial time via graph cuts for binary labels and sub-modular\npairwise potentials (Boykov & Veksler 2006).\nDespite the intractability of maximum likelihood estimation, maximum a-posteriori estimation, and\nmarginal inference for most models in the worst case, the inference task seems to be easier in practice\nthan the theoretical worst case. Approximate inference algorithms can be extremely effective, often\nobtaining state-of-the-art results for these structured prediction tasks. Some important theoretical and\nempirical work on approximate inference include (Foster et al. 2018, Globerson et al. 2015, Kulesza\n& Pereira 2007, Sontag et al. 2012, Koo et al. 2010, Daum\u00e9 et al. 2009).\nIn particular, Globerson et al. (2015) analyzes the hardness of approximate inference in the case\nwhere performance is measured through the Hamming error, and provide conditions for the minimum-\nachievable Hamming error by studying a generative model. Similar to the objective (1), the authors in\n(Globerson et al. 2015) consider unary and pairwise noisy observations. As a concrete example (Foster\net al. 2018), consider the problem of trying to recover opinions of individuals in social networks.\nSuppose that every individual in a social network can hold one of two opinions labeled by \u22121 or +1.\nOne observes a measurement of whether neighbors in the network have an agreement in opinion, but\nthe value of each measurement is \ufb02ipped with probability p (pairwise observations). Additionally,\none receives estimates of the opinion of each individual, perhaps using a classi\ufb01cation model on their\npro\ufb01le, but these estimates are corrupted with probability q (unary observations). Foster et al. (2018)\ngeneralizes the work of Globerson et al. (2015), who provides results for grid lattices, by providing\nresults for trees and general graphs that allow tree decompositions (e.g., hypergrids and ring lattices).\nNote that the above problem is challenging since there is a statistical and computational trade-off, as\nin several machine learning problems. The statistical part focuses on giving highly accurate labels\nwhile ignoring computational constraints. In practice this is unrealistic, one cannot afford to wait long\ntimes for each prediction, which motivated several studies on this trade-off (e.g., Chandrasekaran &\nJordan (2013), Bello & Honorio (2018)).\nHowever, while the statistical and computational trade-off appears in general, an interesting question\nis whether there are conditions for when recovery of the true labels is achievable in polynomial\ntime. That is, conditions for when the Hamming error of the prediction is zero and can be obtained\nef\ufb01ciently. The present work addresses this question. In contrast to (Globerson et al. 2015, Foster\net al. 2018), we study the suf\ufb01cient conditions for exact recovery in polynomial time, and provide\nhigh probability results for general families of undirected connected graphs, which we consider to be\na novel result to the best of our knowledge. In particular, we show that weak-expander graphs (e.g.,\ngrids) can be exactly recovered by adding small perturbations (edges coming from the Erd\u02ddos-R\u00e9nyi\nmodel with small probability). Also, as a byproduct of our analysis, we provide an extension of\nCheeger\u2019s inequality (Cheeger 1969). Finally, another work in this line was done by Chen et al.\n(2016), where the authors consider exact recovery for edges on sparse graphs such as grids and rings.\nHowever, (Chen et al. 2016) consider the case where one has multiple i.i.d. observations of edge\nlabels. In contrast, we focus on the case where there is a single (noisy) observation of each edge and\nnode in the graph.\n\n2 Notation and Problem Formulation\n\nThis section introduces the notation used throughout the paper and formally de\ufb01nes the problem\nunder analysis.\nVectors and matrices are denoted by lowercase and uppercase bold faced letters respectively (e.g.,\na, A), while scalars are in normal font weight (e.g., a). Moreover, random variables are written in\nupright shape (e.g., a, A). For a random vector a, and a random matrix A, their entries are denoted\nby ai and Ai,j respectively. Indexing starts at 1, with Ai,: and A:,i indicating the i-th row and i-th\n\n2\n\n\fu y\u2217\n\nv (good edge) with probability 1 \u2212 p, and \u2212y\u2217\n\nu y\u2217\n\ncolumn of A respectively. Finally, sets and tuples are both expressed in uppercase calligraphic fonts\nand shall be distinguished by the context. For example, R will denote the set of real numbers.\nWe now present the inference task. We consider a similar problem setting to the one in (Globerson\net al. 2015), with the only difference that we consider general undirected graphs. That is, the goal\nis to predict a vector of n node labels y = (y1, . . . , yn)(cid:62), where yi \u2208 {+1,\u22121}, from a set of\nobservations X and c, where X and c correspond to corrupted measurements of edges and nodes\nrespectively. These observations are assumed to be generated from a ground truth labeling y\u2217 by a\ngenerative process de\ufb01ned via an undirected connected graph G = (V,E), an edge noise p \u2208 (0, 0.5),\nand a node noise q \u2208 (0, 0.5). For each edge (u, v) \u2208 E, the edge observation Xu,v is independently\nsampled to be y\u2217\nv (bad edge) with probability p.\nWhile for each edge (u, v) /\u2208 E, the observation Xu,v is always 0. Similarly, for each node u \u2208 V,\nu (good node) with probability 1 \u2212 q,\nthe node observation cu is independently sampled to be y\u2217\nand \u2212y\u2217\nu (bad node) with probability q. Thus, we have a known undirected connected graph G, an\nunknown ground truth label vector y\u2217 \u2208 {+1,\u22121}n, and noisy observations X \u2208 {\u22121, 0, +1}n\u00d7n\nand c \u2208 {\u22121, +1}n, and our goal is to predict a vector label y \u2208 {\u22121, +1}n.\nDe\ufb01nition 1 (Biased Rademacher variable). Let zp \u2208 {+1,\u22121} such that P (zp = +1) = 1\u2212 p, and\nP (zp = \u22121) = p. We call zp a biased Rademacher random variable with parameter p and expected\nvalue 1 \u2212 2p.\nFrom the de\ufb01nition above, we can write the edge observations as Xu,v = y\u2217\nwhere z(u,v)\nwhere z(u)\nGiven the generative process, we aim to solve the following optimization problem, which is based\non the maximum likelihood estimator that returns the label arg maxy P (X, y) (see Globerson et al.\n(2015)):\n\nis a biased Rademacher with parameter p. While the node observation is cu = y\u2217\nis a biased Rademacher with parameter q.\n\n1(cid:2)(u, v) \u2208 E(cid:3),\n\nv z(u,v)\n\nu y\u2217\n\nu z(u)\n\nq\n\n,\n\np\n\nq\n\np\n\nmax\n\ny\n\n1\n2\n\ny(cid:62)Xy + \u03b1c(cid:62)y subject to yi = \u00b11,\n\n(2)\n\nwhere \u03b1 = log 1\u2212q\np . In general, the above combinatorial problem is NP-hard to compute (e.g.,\nsee for results on grids (Barahona 1982)). Our goal is to \ufb01nd what structural properties of the graph\nG suf\ufb01ce to achieve, with high probability, exact recovery in polynomial time.\n\nq /log 1\u2212p\n\n3 On Exact Recovery of Labels\n\nOur approach consists of two stages, similar in spirit to (Globerson et al. 2015). We \ufb01rst use only the\nquadratic term from (2), which will give us two possible solutions, and then as a second stage, the\nlinear term is used to decide the best between these two solutions.\n\n3.1 First Stage\n\nWe analyze a semide\ufb01nite program (SDP) relaxation to the following combinatorial problem (3),\nmotivated by the techniques in (Abbe et al. 2016).\n\nmax\n\ny\n\n1\n2\n\ny(cid:62)Xy subject to yi = \u00b11,\n\n(3)\n\nWe denote the degree of node i as \u2206i, and the maximum node degree as \u2206max = maxi\u2208V \u2206i. For\nany subset S \u2282 V, we denote its complement by S C such that S \u222a S C = V and S \u2229 S C = \u2205.\nFurthermore, let E(S,S C) = {(i, j) \u2208 E | i \u2208 S, j \u2208 S C or j \u2208 S, i \u2208 S C}, i.e., |E(S,S C)|\ndenotes the number of edges between S and S C.\nDe\ufb01nition 2 (Edge Expansion). For a set S \u2282 V with |S| \u2264 n/2, its edge expansion, \u03c6S, is\nde\ufb01ned as: \u03c6S = |E(S,S C )|/|S|. Then, the edge expansion of a graph G = (V,E) is de\ufb01ned as:\n\u03c6G = minS\u2282V,|S|\u2264n/2 \u03c6S .\n\nIn the literature, \u03c6G is also known as the Cheeger constant, due to the geometric analogue de\ufb01ned\nby Cheeger in (Cheeger 1969). Next, we de\ufb01ne the Laplacian matrix of a graph and the Rayleigh\nquotient which are also used throughout this section.\n\n3\n\n\fmatrix, M, is a symmetric matrix that satis\ufb01es x(cid:62)M x =(cid:80)\n\nDe\ufb01nition 3 (Laplacian matrix). For a graph G = (V,E) of n nodes. The Laplacian matrix L is\nde\ufb01ned as L = D \u2212 A, where D is the degree matrix and A is the adjacency matrix.\nDe\ufb01nition 4 (Rayleigh quotient). For a given symmetric matrix M \u2208 Rn\u00d7n and non-zero vector\na \u2208 Rn, the Rayleigh quotient RM (a), is de\ufb01ned as: RM (a) = a(cid:62)M a\na(cid:62)a .\nWe now de\ufb01ne a signed Laplacian matrix.\nDe\ufb01nition 5 (Signed Laplacian matrix). For a graph G = (V,E) of n nodes. A signed Laplacian\n(i,j)\u2208E (yixi \u2212 yjxj)2, where y is an\neigenvector of M with eigenvalue 0, and yi \u2208 {+1,\u22121}.\nNote that the typical Laplacian matrix, as in De\ufb01nition 3, ful\ufb01lls the conditions of De\ufb01nition 5 with\nyi = +1 for all i. Next, we present an intermediate result for later use.\nLemma 1. Let G = (V,E) be an undirected graph of n nodes with Laplacian L. Let M \u2208 Rn\u00d7n\nbe a signed Laplacian with eigenvector y as in De\ufb01nition 5, and let a \u2208 Rn be a vector such\nthat (cid:104)y, a(cid:105) = 0. Finally, let 1 \u2208 Rn be a vector of ones. Then we have that, for a given \u03b4 \u2208 R,\nRL(a \u25e6 y + \u03b41) \u2264 RM (a), where the operator \u25e6 denotes the Hadamard product.\n\nthat x(cid:62)Lx = (cid:80)\n(cid:80)\nProof. First, note that L has a 0 eigenvalue with corresponding eigenvector 1. Also, we have\n(i,j)\u2208E (xi \u2212 xj)2, for any vector x. Then, (a \u25e6 y + \u03b41)(cid:62)L(a \u25e6 y + \u03b41) =\n(i,j)\u2208E ((yiai + \u03b4) \u2212 (yjaj + \u03b4))2 = (yiai \u2212 yjaj)2 = a(cid:62)M a. Therefore, we have that the\n(a\u25e6y+\u03b41)(cid:62)(a\u25e6y+\u03b41) = (a\u25e6y)(cid:62)(a\u25e6y)+2\u03b4(cid:104)1, a\u25e6y(cid:105)+\u03b421(cid:62)1 =(cid:80)\nnumerators of RL(a \u25e6 y + \u03b41) and RM (a) are equal. For the denominators, one can observe that:\ni aiyiaiyi+2\u03b4(cid:104)a, y(cid:105)+\u03b42n =\na(cid:62)a + \u03b42n \u2265 a(cid:62)a, which implies that RL(a \u25e6 y + \u03b41) \u2264 RM (a).\n\nIn what follows, we present our \ufb01rst result, which has a connection to Cheeger\u2019s inequality (Cheeger\n1969).\nTheorem 1. Let G, M , L, y be de\ufb01ned as in Lemma 1, and let \u03bb1 \u2264 \u03bb2 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbn be the\neigenvalues of M. Then, we have that\n\n\u03c62G\n\n\u2264 \u03bb2.\n\n4\u2206max\n\nProof. Since y is an eigenvector of M with eigenvalue 0, and M is a symmetric matrix, we can\nexpress \u03bb2 using the variational characterization of eigenvalues as follows:\n\n\u03bb2 =\n\nmin\n\na\u2208Rn, a(cid:62)y=0\n\nRM (a),\n\n(4)\n\nwhere we used the fact that y is orthogonal to all the other eigenvectors, by the Spectral Theorem.\nAssume that a is the eigenvector associated with \u03bb2, i.e., we have that M a = \u03bb2a and a(cid:62)y = 0.\nThen, by Lemma 1, we have that:\n\nRL(a \u25e6 y + \u03b41) \u2264 RM (a) = \u03bb2.\n\ni )(cid:62) such that w+\ni = wi if wi \u2264 0 and w\u2212\n\n(5)\nNext, we choose \u03b4 \u2208 R such that {a1y1 + \u03b4, a2y2 + \u03b4, . . . , anyn + \u03b4} has median 0. The reason\nfor the zero median is to later ensure that the subset of vertices S has less than n/2 vertices. Let\nw = a \u25e6 y + \u03b41. From equation (5), we have that RL(w) \u2264 \u03bb2.\ni = wi if wi \u2265 0 and w+\ni )(cid:62) such\nLet w+ = (w+\ni = 0 otherwise. Then, we have that either RL(w+) \u2264 2RL(w)\nthat w\u2212\nor RL(w\u2212) \u2264 2RL(w). Now suppose that w.l.o.g. RL(w+) \u2264 2RL(w), then, it follows that\nRL(w+) \u2264 2\u03bb2.\nLet us scale w+ by some constant \u03b2 \u2208 R so that: {\u03b2w1, \u03b2w2, . . . , \u03b2wm} \u2286 [0, 1]. It is clear that\nRL(w+) = RL(\u03b2w+), therefore, we will still use w+ to denote the rescaled vector. That is, now\nthe entries of vector w+ are in between 0 and 1.\nNext, we will show that there exists a set S \u2282 V with |S| \u2264 n/2 such that:\n\n(cid:112)2RL(w+)\u2206max. We construct the set S as follows. We choose t \u2208 [0, 1] uniformly at random and\n\ni = 0 otherwise. Let w\u2212 = (w\u2212\n\ni )2 \u2265 t}. Let Bi,j = 1 if i \u2208 S and j \u2208 S C or if j \u2208 S and i \u2208 S C, and Bi,j = 0\n\nlet S = {i | (w+\n\nE[|E(S,S C )|]\n\nE[|S|]\n\n\u2264\n\n4\n\n\fotherwise. Then, E[|E(S,S C)|] = E[(cid:80)\n\ni )2).\n\nt \u2264 (w+\nRecall that (w+\n\nE[|E(S,S C)|] =\n\n(cid:115) (cid:88)\n\n(i,j)\u2208E\n\n\u2264\n\n(cid:88)\n(cid:115)\n\n(i,j)\u2208E\n\nj | |w+\n\n|w+\n\ni \u2212 w+\n(cid:88)\n\n(i,j)\u2208E\n\n(i,j)\u2208E P ((w+\n\nj )2 \u2264\n\n(i,j)\u2208E Bi,j] = (cid:80)\n(i,j)\u2208E E[Bi,j] = (cid:80)\n(cid:115) (cid:88)\n(cid:115) (cid:88)\n(cid:115)\n(cid:115) (cid:88)\n\ni )2 \u2212 (w+\ni \u2212 w+\nj )2\n\ni + w+\n\nj | \u2264\n\n(i,j)\u2208E\n\n(w+\n\n(w+\n\ni \u2212 w+\nj )2\n\n(i,j)\u2208E\n\nj )2|. Thus,\n\n(i,j)\u2208E\n\n(w+\n\ni + w+\n\nj )2\n\n(6)\n\n(cid:88)\n\ni\n\n2\u2206max\n\n(w+\n\ni )2,\n\ni )2 \u2208 [0, 1], therefore, the probability above is |(w+\n\n(w+\n\ni \u2212 w+\nj )2\n\n2\n\n(w+\n\ni )2 + (w+\n\nj )2 \u2264\n\ni\n\ni )2\n\n\u2264\n\n\u221a\n\n2\u2206max\n\ni(w+\n\nThus,\n\nE[|E(S,S C )|]\n\nE[|S|]\n\u221a\n\u2264 2\n\n(7)\nwhere eq.(6) is due to Cauchy-Schwarz inequality and eq.(7) uses the maximum-degree of a node for\nan upper bound.\nNow consider another random variable bi such that bi = 1 if i \u2208 S, and bi = 0 otherwise.\ni )2.\n\nTherefore, we have that E[|S|] = E[(cid:80)\n(cid:113)(cid:80)\n(cid:80)\n(cid:112)2RL(w+)\u2206max \u2264 2\ni \u2212w+\n(i,j)\u2208E (w+\n=\n\u221a\n\u03bb2\u2206max. The above implies that there exists some S such that\n\nE[bi] = (cid:80)\n(cid:113)(cid:80)\ni P (t \u2264 (w+\n\u221a(cid:80)\n(i,j)\u2208E (w+\n\ni bi] = (cid:80)\n(cid:80)\n\ni )2) = (cid:80)\n\n\u03bb2\u2206max or equivalently\n\n\u03bb2\u2206max. Therefore, \u03c6G \u2264 2\n\n|E(S,S C )|\nRemark 1. For a given undirected graph G, its Laplacian matrix L ful\ufb01lls the conditions of Lemma\n1 and Theorem 1. That is, if M = L in Theorem 1 then it becomes the known Cheeger\u2019s inequality.\nTherefore, our result in Theorem 1 apply for more general matrices and is of use for our next result.\nWe now provide the SDP relaxation of problem (3). Let Y = y y(cid:62), we have that y(cid:62)Xy =\nTr(XY ) = (cid:104)X, Y (cid:105). Since our prediction is a column vector y, we have that y y(cid:62) is rank-1 and\nsymmetric, which implies that Y is a positive semide\ufb01nite matrix. Therefore, our relaxation to the\ncombinatorial problem (3) results in the following primal formulation2:\nsubject to Yii = 1, Y (cid:23) 0.\n\nj )2\ni(w+\ni )2\n\u221a\n\ni \u2212w+\nj )2\ni(w+\ni )2\n\n(cid:104)X, Y (cid:105)\n\n\u2264 \u03bb2.\n\ni(w+\n\nmax\n\n2\u2206max\n\n4\u2206max\n\n(8)\n\n|S|\n\n\u03c62G\n\n\u221a\n\n=\n\nY\n\nWe will make use of the following matrix concentration inequality for our main proof.\nLemma 2 (Matrix Bernstein inequality, Theorem 1.4 in (Tropp 2012)). Consider a \ufb01nite sequence\n{Nk} of independent, random, self-adjoint matrices with dimension n. Assume that each ran-\nand \u03bbmax(Nk) \u2264 R almost surely. Then, for all t \u2265 0,\ndom matrix satis\ufb01es E[Nk] = 0\nP\n\n, where \u03c32 = (cid:107)(cid:80)\n\n(cid:17) \u2264 n \u00b7 exp\n\n(cid:16) \u2212t2/2\n\n(cid:1) \u2265 t\n\n(cid:0)(cid:80)\n\nE[N2\n\n(cid:16)\n\n(cid:17)\n\n\u03bbmax\n\nk](cid:107).\n\nk Nk\n\n\u03c32+Rt/3\n\nk\n\nThe next theorem includes our main result and provides the conditions for exact recovery of labels\nwith high probability.\nTheorem 2. Let G = (V,E) be an undirected connected graph with n nodes, Cheeger constant \u03c6G,\nand maximum node degree \u2206max. Then, for the combinatorial problem (3), a solution y \u2208 {y\u2217,\u2212y\u2217}\nis achievable in polynomial time by solving the SDP based relaxation (8), with probability at least\n1 \u2212 \u00011(\u03c6G, \u2206max, p), where p is the edge noise from our model, and\n\u22123(1\u22122p)2\u03c64G\n\n\u00011(\u03c6G, \u2206max, p) = 2n \u00b7 e\n\n1536\u22063\n\nmaxp(1\u2212p)+32(1\u22122p)(1\u2212p)\u03c62G \u2206max .\n\nProof. Without loss of generality assume that y = y\u2217. The \ufb01rst step of our proof corresponds to\n\ufb01nding suf\ufb01cient conditions for when Y = y y(cid:62) is the unique optimal solution to SDP (8), for which\nwe make use of the Karush-Kuhn-Tucker (KKT) optimality conditions (Boyd & Vandenberghe 2004).\nIn the following we write the dual formulation of SDP (8):\n\n(9)\nThus, we have that Y = y y(cid:62) is guaranteed to be an optimal solution under the following conditions:\n\nsubject to V (cid:23) X, V is diagonal.\n\nTr(V )\n\nmin\nV\n\n2Here we dropped the constant 1/2 since it does not change the decision problem.\n\n5\n\n\f1. y y(cid:62) is a feasible solution to the primal problem (8).\n2. There exists a matrix V feasible for the dual formulation (9) such that Tr(Xy y(cid:62)) = Tr(V ).\n\nThe \ufb01rst point is trivially veri\ufb01ed. For the second point, we assume strong duality in order to \ufb01nd a\ndual certi\ufb01cate. To achieve that, we make Vi,i = (XY )i,i.3 If V \u2212 X (cid:23) 0 then the matrix V is a\nfeasible solution to the dual formulation. Thus, our \ufb01rst condition is to have V \u2212 X (cid:23) 0, and we\nconclude that y y(cid:62) is an optimal solution to SDP (8).\nFor showing that y y(cid:62) is the unique optimal solution, it suf\ufb01ces to have \u03bb2(V \u2212 X) > 0. Suppose\n\nthat (cid:98)Y is another optimal solution to SDP (8). Then, from complementary slackness we have that\n(cid:104)V \u2212 X,(cid:98)Y (cid:105) = 0, and from primal feasibility (cid:98)Y (cid:23) 0. Moreover, notice that we have (V \u2212 X)y = 0,\ncomplementary slackness, primal and dual feasibility, entail that (cid:98)Y is a multiple of y y(cid:62). Thus, we\nmust have that (cid:98)Y = y y(cid:62) because (cid:98)Yi,i = 1.\n\ni.e., y is an eigenvector of V \u2212 X with eigenvalue 0. By assumption, the second smallest eigenvalue\nof V \u2212 X is greater than 0, therefore, y spans all of its null space. This fact combined with\n\nFrom the points above we arrived to the two following suf\ufb01cient conditions:\n\nV \u2212 X (cid:23) 0\n\nand \u03bb2(V \u2212 X) > 0.\n\n(10)\n\nOur next step is to show when condition (10) is ful\ufb01lled with high probability. Since we have that\ny is an eigenvector of V \u2212 X with eigenvalue zero, showing that \u03bb2(V \u2212 X) > 0 will imply that\nV \u2212 X is positive semide\ufb01nite. Therefore, we focus on controlling its second smallest eigenvalue.\nNext, we have that:\n\n\u03bb2(V \u2212 X) > 0 \u21d0\u21d2 \u03bb2(V \u2212 X \u2212 E[V \u2212 X] + E[V \u2212 X]) > 0\n\n\u21d0 \u03bb1(V \u2212 E[V]) + \u03bb1(E[X] \u2212 X) + \u03bb2(E[V \u2212 X]) > 0.\n\n(11)\nWe now focus on condition (11) since it implies that \u03bb2(V \u2212 X) > 0. For the \ufb01rst two summands\nof condition (11) we make use of Lemma 2, while for the third summand we make use of The-\nj=1 yiyjXi,j =\n\norem 1. From Vi,i = (XY )i,i, we have that Vi,i = yiXi,:y, thus, Vi,i = (cid:80)n\n(cid:80)n\n\np 1(cid:2)(i, j) \u2208 E(cid:3) . Then, its expected value is: E[Vi,i] = \u2206i(1 \u2212 2p).\n\nj=1 z(i,j)\n\nBounding the third summand of condition (11). Our goal is to \ufb01nd a non-zero lower bound for\nthe second smallest eigenvalue of E[V \u2212 X]. Notice that E[V \u2212 X] (cid:23) 0 since it is a diagonally\ndominant matrix, and y is its \ufb01rst eigenvector with eigenvalue 0, i.e., \u03bb1(E[V \u2212 X]) = 0.\nThen, we write M = E[V \u2212 X]. Now we focus on \ufb01nding a lower bound for \u03bb2(M ). We use the\n\nfact that for any vector a \u2208 Rn, we have that a(cid:62)M a = (1 \u2212 2p)(cid:80)\n\n(i,j)\u2208E (yiai \u2212 yjaj)2.\n\nWe also note that M has a 0 eigenvalue with eigenvector y. Thus, the matrix M/(1\u22122p) satis\ufb01es the\nconditions of Theorem 1 and we have that \u03bb2(M/(1\u22122p)) \u2265 \u03c62G\n4\u2206max\n\u03c62G\n\n. We conclude that,\n\n\u03bb2(E[V \u2212 X]) \u2265 (1 \u2212 2p)\n\n(12)\n\n.\n\n4\u2206max\n\nBounding the \ufb01rst summand of condition (11). Let N(i,j)\nj ), where\nei is the standard basis, i.e., the vector of all zeros except the i-th entry which is 1. We can\n. Then, we have a sequence of independent random matrices\n) \u2264 2(1\u2212 p), and also\n\n(i,j)\u2208E N(i,j)\n}, where we obtain the following: \u03bbmax(E[N(i,j)\n\n]\u2212 N(i,j)\n\n]\u2212 N(i,j)\n\n= z(i,j)\n\np\n\np\n\np\n\np\n\np\n\np\n\n(eie(cid:62)\n\ni + eje(cid:62)\n\nnow write V = (cid:80)\n(cid:107)(cid:80)\n\n{E[N(i,j)\n\n(i,j)\u2208E E[(E[N(i,j)\n\np\n\np\n\n] \u2212 N(i,j)\n\np\n\n)2](cid:107) \u2264 4\u2206maxp(1 \u2212 p).\n\nNext, we use the fact that \u03bbmax(A) = \u2212\u03bb1(\u2212A) for any matrix A. Then, by applying Lemma 2, we\nobtain:\n\n(cid:16)\n\n(cid:0)V \u2212 E[V](cid:1) \u2264 \u2212(1 \u2212 2p)\u03c62G\n\n(cid:17) \u2264 n \u00b7 e\n\n\u22123(1\u22122p)2\u03c64G\n\n1536\u22063\n\nmax p(1\u2212p)+32(1\u22122p)(1\u2212p)\u03c62G \u2206max\n\n(13)\n\nP\n\n\u03bb1\n\n8\u2206max\n\n3Note that we now write V in upright shape (i.e., V) since it contains randomness from X.\n\n6\n\n\fBounding the second summand of condition (11). Using similar arguments to the concentration\nabove, we now analyze \u03bb1(E[X] \u2212 X). Let H(i,j) = Xi,j(eie(cid:62)\ni ). Then, we have a sequence\n(i,j)\u2208E H(i,j). Finally,\nwe have that \u03bbmax(H(i,j)\u2212E[H(i,j)]) \u2264 2(1\u2212 p), and E[(H(i,j)\u2212E[H(i,j)])2] = 4p(1\u2212 p)(eie(cid:62)\ni +\n(i,j)\u2208E E[(H(i,j) \u2212 E[H(i,j)])2](cid:107) \u2264 4\u2206maxp(1 \u2212 p) and by applying Lemma 2 we\neje(cid:62)\nobtain:\n\nof independent random matrices {H(i,j)\u2212E[H(i,j)]} and we can write X =(cid:80)\nj ). Thus, (cid:107)(cid:80)\n(cid:16)\n\n(cid:0)E[X] \u2212 X(cid:1) \u2264 \u2212(1 \u2212 2p)\u03c62G\n\n(cid:17) \u2264 n \u00b7 e\n\nj + eje(cid:62)\n\n1536\u22063\n\nmaxp(1\u2212p)+32(1\u22122p)(1\u2212p)\u03c62G \u2206max\n\n\u22123(1\u22122p)2 \u03c64G\n\n(14)\n\nP\n\n\u03bb1\n\n8\u2206max\n\nNote that the thresholds in the concentrations above are motivated by equation (12). Finally, combin-\ning equations (12), (13), and (14), we have that:\n\nP(cid:0)\u03bb2(V \u2212 X) > 0(cid:1) \u2265 1 \u2212 2ne\n\n\u22123(1\u22122p)2\u03c64G\n\n1536\u22063\n\nmaxp(1\u2212p)+32(1\u22122p)(1\u2212p)\u03c62G \u2206max ,\n\nwhich concludes our proof.\n\n\u221a\n\nRegarding the statistical part from Theorem 2, it is natural to ask under what conditions we obtain a\nhigh probability statement. For example, one can observe that if \u03c62G/\u2206max \u2208 \u2126(n) then there is an\nexponential decay in the probability of error. Another example would be if \u2206max \u2208 O(\nn) and\n\u03c62G/\u2206max \u2208 \u2126(\nn) then we also obtain high probability argument. Thus, we are interested in \ufb01nding\nwhat classes of graphs ful\ufb01ll these or other structural properties so that we obtain a high probability\nbound in Theorem 2. Regarding the computational complexity of exact recovery, from Theorem 2,\nwe are solving a SDP, and any SDP can be solved in polynomial time using methods such as the\ninterior point method.\n\n\u221a\n\n3.2 Second Stage\nAfter the \ufb01rst stage, we obtain two feasible solutions for problem (3), that is, y \u2208 {y\u2217,\u2212y\u2217}. To\ndecide which solution is correct we will use the node observations c. Speci\ufb01cally, we will output the\nvector y that maximizes the score c(cid:62)y. The next theorem formally states that, with high probability,\ny = y\u2217 maximizes the score c(cid:62)y for a suf\ufb01ciently large n.\nTheorem 3. Let y \u2208 {y\u2217,\u2212y\u2217}. Then, with probability at least 1 \u2212 \u00012(n, q), we have that:\nc(cid:62)y\u2217 = maxy\u2208{y\u2217,\u2212y\u2217} c(cid:62)y, where \u00012(n, q) = e\u2212 n\nThe remaining proofs of our manuscript can be found in Appendix A.\nRemark 2. From Theorems 2 and 3, we obtain that exact recovery (i.e., y = y\u2217) is achievable with\nprobability at least 1 \u2212 \u00011(\u03c6G, \u2206max, p) \u2212 \u00012(n, q). Finally, from Theorem 3, it is clear that since the\nparameter q \u2208 (0, 0.5), for a suf\ufb01ciently large n we have an exponential decay of the probability of\nerror \u00012. Thus, we focus on the conditions of the \ufb01rst stage and provide examples in the next section.\n\n2 (1\u22122q)2 and q is the node noise.\n\n4 Examples of Graphs for Exact Recovery\n\nIn this section, we provide examples of classes of graphs that yield high probability in Theorem 2.\nPerhaps the most important example we provide in this section is related to the smoothed analysis\n\non connected graphs (Krivelevich et al. 2015). Consider any \ufb01xed graph G = (V,E) and let (cid:101)E be\na random set of edges over the same set of vertices V, where each edge e \u2208 (cid:101)E is independently\nconstant. We denote this as (cid:101)E \u223c ER(n, \u03b5/n), then let (cid:101)G = (V,E \u222a(cid:101)E) denote the random graph with\nthe edge set (cid:101)E added.\n\ndrawn according to the Erd\u02ddos-R\u00e9nyi model with probability \u03b5/n and where \u03b5 is a small (\ufb01xed) positive\n\nThe model above can be considered a generalization of the classical Erd\u02ddos-R\u00e9nyi random graph,\nwhere one starts from an empty graph (i.e., G = (V,\u2205)) and adds edges between all possible pairs of\nvertices independently with a given probability. The focus on \u201csmall\u201d \u03b5 means that we are interested\nin the effect of a rather gentle random perturbation. In particular, it is known that graphs with bad\nexpansion are not suitable for exact inference (see for instance, (Abbe et al. 2014)), but certain classes\n\n7\n\n\fwith probability at least 1 \u2212 n\u22122.2\u2212 log \u03b5\n2 .\n\nsuch as grids or planar graphs can yield good approximation under some regimes despite being bad\nexpanders as shown by Globerson et al. (2015). Here we consider the graph G to be a bad expander\nand show that with a small perturbation, exact inference is achievable.\nThe following result was presented by (Krivelevich et al. 2015) in an equivalent fashion.4\nLemma 3 (Theorem 2 in (Krivelevich et al. 2015)). Let G = (V,E) be a connected graph, choose\n256+256 log n ,\n\n(cid:101)E \u223c ER(n, \u03b5/n), and let (cid:101)G = (V,E \u222a(cid:101)E). Then, for every \u03b5 \u2208 [1, n], we have that \u03c6(cid:101)G \u2265\nThe above lemma allows us to lower bound the Cheeger constant of the random graph (cid:101)G with high\nCorollary 1. Let G = (V,E) be any connected graph, choose(cid:101)E \u223c ER(n, log8 n/n), let(cid:101)G = (V,E \u222a(cid:101)E)\nand let \u2206(cid:101)G\nmax \u2208 \u2126(log5 n) and\n\u2206(cid:101)G\nmax \u2208 O(log9 n) with high probability. Therefore, exact recovery in polynomial time is achievable\nWe emphasize the nice property of random graphs (cid:101)G shown in Corollary 1, that is, by adding a small\nwith high probability.\n\nprobability, and is of use for our \ufb01rst example.\n\nmax be the maximum node degree of (cid:101)G. Then, we have that \u03c62(cid:101)G/\u2206(cid:101)G\n\n\u03b5\n\nperturbation (edges from the Erd\u02ddos-R\u00e9nyi model with small probability) we are able to obtain exact\ninference despite of G having bad properties such as being a bad expander. Our next two examples\ninclude complete graphs and d-regular expanders. The following corollary shows that, with high\nprobability, exact recovery of labels for complete graphs is possible in polynomial time.\nCorollary 2 (Complete graphs). Let G = Kn, where Kn denotes a complete graph of n nodes. Then,\nwe have that \u03c62G/\u2206max \u2208 \u2126(n). Therefore, exact recovery in polynomial time is achievable with high\nprobability.\n\nAnother important class of graphs that admits exact recovery is the family of d-regular expanders\n(Hoory et al. 2006), which is de\ufb01ned below.\nDe\ufb01nition 6 (d-regular expander). A d-regular graph with n nodes is an expander with constant\nc > 0 if, for every set S \u2282 V with |S| \u2264 n/2, |E(S,S C)| \u2265 c \u00b7 d \u00b7 |S|.\nCorollary 3 (Expanders graphs). Let G be a d-regular expander with constant c. Then, we have that\n\u03c62G/\u2206max \u2208 \u2126(d). If d \u2208 \u2126(log n) then exact recovery in polynomial time is achievable with high\nprobability.\n\n5 Concluding Remarks\n\nWe considered a model where we receive a single noisy observation for each edge and each node of a\ngraph. Our approach consisted of two stages, similar in spirit to (Globerson et al. 2015). The \ufb01rst\nstage consisted of solving solely the quadratic term of the optimization problem and was based in\na SDP relaxation in order to \ufb01nd the structural properties of a graph that guarantee exact recovery\nwith high probability. Given two solutions from the \ufb01rst stage, the second stage consisted in using\nsolely the node observations and simply outputting the vector with higher score. We showed that for\nany graph G, the term \u03c62G/\u2206max is related to achieve exact recovery in polynomial time. Examples\ninclude complete graphs and d-regular expanders, that are guaranteed to recover the correct labeling\nwith high probability. While perhaps the most interesting example is related to smoothed analysis\non connected graphs, where even for a graph with bad properties such as bad expansion can still be\nexactly recovered by adding small perturbations (edges coming from an Erd\u02ddos-R\u00e9nyi model with\nsmall probability).\n\nAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1716609-IIS.\n\n4 Speci\ufb01cally, we set \u03b1 = 1/2, \u03b4 = \u03b5/256, K = 128/\u03b5, C = 1, s = K log n, which results with all the\n\nconditions being ful\ufb01lled in the proof of Theorem 2 in (Krivelevich et al. 2015).\n\n8\n\n\fReferences\nAbbe, E., Bandeira, A. S., Bracher, A. & Singer, A. (2014), \u2018Decoding binary node labels from\ncensored edge measurements: Phase transition and ef\ufb01cient recovery\u2019, IEEE Transactions on\nNetwork Science and Engineering 1(1), 10\u201322.\n\nAbbe, E., Bandeira, A. S. & Hall, G. (2016), \u2018Exact recovery in the stochastic block model\u2019, IEEE\n\nTransactions on Information Theory 62(1), 471\u2013487.\n\nAltun, Y. & Hofmann, T. (2003), \u2018Large margin methods for label sequence learning\u2019, European\n\nConference on Speech Communication and Technology pp. 145\u2013152.\n\nBarahona, F. (1982), \u2018On the computational complexity of ising spin glass models\u2019, Journal of\n\nPhysics A: Mathematical and General 15(10), 3241.\n\nBello, K. & Honorio, J. (2018), \u2018Learning latent variable structured prediction models with gaussian\n\nperturbations\u2019, NeurIPS .\n\nBoyd, S. & Vandenberghe, L. (2004), Convex optimization, Cambridge university press.\n\nBoykov, Y. & Veksler, O. (2006), Graph cuts in vision and graphics: Theories and applications, in\n\n\u2018Handbook of mathematical models in computer vision\u2019, Springer, pp. 79\u201396.\n\nChandrasekaran, V. & Jordan, M. I. (2013), \u2018Computational and statistical tradeoffs via convex\n\nrelaxation\u2019, Proceedings of the National Academy of Sciences p. 201302293.\n\nChandrasekaran, V., Srebro, N. & Harsha, P. (2008), Complexity of inference in graphical models, in\n\u2018Proceedings of the Twenty-Fourth Conference on Uncertainty in Arti\ufb01cial Intelligence\u2019, AUAI\nPress, pp. 70\u201378.\n\nCheeger, J. (1969), A lower bound for the smallest eigenvalue of the laplacian, in \u2018Proceedings of the\n\nPrinceton conference in honor of Professor S. Bochner\u2019.\n\nChen, Y., Kamath, G., Suh, C. & Tse, D. (2016), Community recovery in graphs with locality, in\n\n\u2018International Conference on Machine Learning\u2019, pp. 689\u2013698.\n\nDaum\u00e9, H., Langford, J. & Marcu, D. (2009), \u2018Search-based structured prediction\u2019, Machine learning\n\n75(3), 297\u2013325.\n\nFoster, D., Sridharan, K. & Reichman, D. (2018), Inference in sparse graphs with pairwise measure-\nments and side information, in \u2018International Conference on Arti\ufb01cial Intelligence and Statistics\u2019,\npp. 1810\u20131818.\n\nGloberson, A., Roughgarden, T., Sontag, D. & Yildirim, C. (2015), How hard is inference for\n\nstructured prediction?, in \u2018International Conference on Machine Learning\u2019, pp. 2181\u20132190.\n\nHoory, S., Linial, N. & Wigderson, A. (2006), \u2018Expander graphs and their applications\u2019, Bulletin of\n\nthe American Mathematical Society 43(4), 439\u2013561.\n\nKoo, T., Rush, A. M., Collins, M., Jaakkola, T. & Sontag, D. (2010), Dual decomposition for parsing\nwith non-projective head automata, in \u2018Proceedings of the 2010 Conference on Empirical Methods\nin Natural Language Processing\u2019, Association for Computational Linguistics, pp. 1288\u20131298.\n\nKrivelevich, M., Reichman, D. & Samotij, W. (2015), \u2018Smoothed analysis on connected graphs\u2019,\n\nSIAM Journal on Discrete Mathematics 29(3), 1654\u20131669.\n\nKulesza, A. & Pereira, F. (2007), \u2018Structured learning with approximate inference\u2019, Neural Informa-\n\ntion Processing Systems 20, 785\u2013792.\n\nLafferty, J., McCallum, A. & Pereira, F. C. (2001), \u2018Conditional random \ufb01elds: Probabilistic models\n\nfor segmenting and labeling sequence data\u2019.\n\nSchraudolph, N. N. & Kamenetsky, D. (2009), Ef\ufb01cient exact inference in planar ising models, in\n\n\u2018Advances in Neural Information Processing Systems\u2019, pp. 1417\u20131424.\n\n9\n\n\fSontag, D., Choe, D. K. & Li, Y. (2012), \u2018Ef\ufb01ciently searching for frustrated cycles in map inference\u2019,\n\narXiv preprint arXiv:1210.4902 .\n\nTaskar, B., Guestrin, C. & Koller, D. (2003), \u2018Max-margin Markov networks\u2019, Neural Information\n\nProcessing Systems 16, 25\u201332.\n\nTropp, J. A. (2012), \u2018User-friendly tail bounds for sums of random matrices\u2019, Foundations of\n\ncomputational mathematics 12(4), 389\u2013434.\n\nTsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. (2005), \u2018Large margin methods for\nstructured and interdependent output variables\u2019, Journal of machine learning research 6(Sep), 1453\u2013\n1484.\n\n10\n\n\f", "award": [], "sourceid": 2025, "authors": [{"given_name": "Kevin", "family_name": "Bello", "institution": "Purdue University"}, {"given_name": "Jean", "family_name": "Honorio", "institution": "Purdue University"}]}