{"title": "HOGWILD!-Gibbs can be PanAccurate", "book": "Advances in Neural Information Processing Systems", "page_first": 32, "page_last": 41, "abstract": "Asynchronous Gibbs sampling has been recently shown to be fast-mixing and an accurate method for estimating probabilities of events on a small number of variables of a graphical model satisfying Dobrushin's condition~\\cite{DeSaOR16}. We investigate whether it can be used to accurately estimate expectations of functions of {\\em all the variables} of the model. Under the same condition, we show that the synchronous (sequential) and asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between their (multivariate) samples remains bounded by $O(\\tau \\log n),$ where $n$ is the number of variables in the graphical model, and $\\tau$ is a measure of the asynchronicity. A similar bound holds for any constant power of the Hamming distance. Hence, the expectation of any function that is Lipschitz with respect to a power of the Hamming distance, can be estimated with a bias that grows logarithmically in $n$. Going beyond Lipschitz functions, we consider the bias arising from asynchronicity in estimating the expectation of polynomial functions of all variables in the model. Using recent concentration of measure results~\\cite{DaskalakisDK17,GheissariLP17,GotzeSS18}, we show that the bias introduced by the asynchronicity is of smaller order than the standard deviation of the function value already present in the true model. We perform experiments on a multi-processor machine to empirically illustrate our theoretical findings.", "full_text": "HOGWILD!-Gibbs Can Be PanAccurate\n\nConstantinos Daskalakis \u2217\nEECS & CSAIL, MIT\n\ncostis@csail.mit.edu\n\nNishanth Dikkala \u2217\nEECS & CSAIL, MIT\n\nnishanthd@csail.mit.edu\n\nSiddhartha Jayanti \u2217\u2020\nEECS & CSAIL, MIT\njayanti@mit.edu\n\nAbstract\n\nAsynchronous Gibbs sampling has been recently shown to be fast-mixing and\nan accurate method for estimating probabilities of events on a small number of\nvariables of a graphical model satisfying Dobrushin\u2019s condition [DSOR16]. We\ninvestigate whether it can be used to accurately estimate expectations of functions\nof all the variables of the model. Under the same condition, we show that the\nsynchronous (sequential) and asynchronous Gibbs samplers can be coupled so\nthat the expected Hamming distance between their (multivariate) samples remains\nbounded by O(\u03c4 log n), where n is the number of variables in the graphical model,\nand \u03c4 is a measure of the asynchronicity. A similar bound holds for any constant\npower of the Hamming distance. Hence, the expectation of any function that is\nLipschitz with respect to a power of the Hamming distance, can be estimated\nwith a bias that grows logarithmically in n. Going beyond Lipschitz functions,\nwe consider the bias arising from asynchronicity in estimating the expectation of\npolynomial functions of all variables in the model. Using recent concentration of\nmeasure results [DDK17, GLP17, GSS18], we show that the bias introduced by\nthe asynchronicity is of smaller order than the standard deviation of the function\nvalue already present in the true model. We perform experiments on a multi-\nprocessor machine to empirically illustrate our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nThe increasingly ambitious applications of data analysis, and the corresponding growth in the size\nof the data that needs to processed has brought important scalability challenges to machine learning\nalgorithms. Fundamental methods such as Gradient Descent and Gibbs sampling, which were de-\nsigned with a sequential computational model in mind, are to be applied on datasets of increasingly\nlarger size. As such, there has recently been increased interest towards developing techniques for\nparallelizing these methods. However, these algorithms are inherently sequential and are dif\ufb01cult to\nparallelize.\nHOGWILD!-SGD, proposed by Niu et al. [NRRW11], is a lock-free asynchronous execution\nof stochastic gradient descent that has been shown to converge under the right sparsity con-\nditions. Several variants of this method, and extensions of the asynchronous execution ap-\nproach have been recently proposed, and have found successful applications in a broad range\nof applications ranging from PageRank approximation, to deep learning and recommender sys-\ntems [YHSD12, NO14, MBDC15, MPP+15, LWR+15, LWR+15, DSZOR15].\nSimilar to HOGWILD!-SGD,\nlock-free asynchronous execution of Gibbs sampling, called\nHOGWILD!-Gibbs, was proposed by Smola and Narayanamurthy [SN10], and empirically shown\nto work well on several models [ZR14]. Johnson et al. [JSW13] provide suf\ufb01cient conditions under\n\u2217Supported by NSF awards CCF-1617730 and IIS-1741137, a Simons Investigator Award, a Google Faculty\n\u2020Also supported by the Department of Defense (DoD) through the NDSEG Program.\n\nResearch Award, and an MIT-IBM Watson AI Lab research grant.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhich they show theoretically that HOGWILD!-Gibbs produces samples with the correct mean in\nGaussian models, while Terenin et al. [TSD15] propose a modi\ufb01cation to the algorithm that is shown\nto converge under some strong assumptions on asynchronous computation.\nInput: Set of variables V , Con\ufb01guration x0 \u2208 S|V |, Distribution \u03c0\ninitialization;\nfor t = 1 to T do\n\nSample i uniformly from {1, 2, . . . , n};\nSample Xi \u223c Pr\u03c0 [.|X\u2212i = x\u2212i] and set xi,t = Xi;\nFor all j (cid:54)= i, set xj,t = xj,t\u22121;\n\nend\n\nAlgorithm 1: Gibbs Sampling\n\nIn a more recent paper, De Sa et al. [DSOR16] propose the study of HOGWILD!-Gibbs under\na stochastic model of asynchronicity in graphical models with discrete variables. Whenever the\ngraphical model satis\ufb01es Dobrushin\u2019s condition, they show that the mixing time of the asynchronous\nGibbs sampler is similar to that of the sequential (synchronous) one. Moreover, they establish that\nthe asynchronous Gibbs sampler accurately estimates probabilities of events on a sublinear number\nof variables, in particular events on up to O(\u03b5n/ log n) variables can be estimated within variational\ndistance \u03b5, where n is the total number of variables in the graphical model (Lemma 2, [DSOR16]).\nOur Results. Our goal in this paper is to push the theoretical understanding of HOGWILD!-Gibbs\nto estimate functions of all the variables in a graphical model. In particular, we are interested in\nwhether HOGWILD!-Gibbs can be used to accurately estimate the expectations of such functions.\nResults from [DSOR16] imply that an accurate estimation is possible whenever the function under\nconsideration is Lipschitz with a good Lipschitz constant with respect to the Hamming metric.\nUnder the same Dobrushin condition used in [DSOR16] (see De\ufb01nition 3), and under a stochastic\nmodel of asynchronicity with weaker assumptions (see Section 2.1), we show that you can do better\nthan the bounds implied by [DSOR16] even for functions with bad Lipschitz constants. For instance,\nconsider quadratic functions on an Ising model, which is a binary graphical model, and serves as a\ncanonical example of Markov random \ufb01elds [LPW09, MS10, Fel04, DMR11, GG86, Ell93]. Under\nappropriate normalization, these functions take values in the range [\u2212n2, n2] and have a Lipschitz\n\u221a\nconstant of n. Given this, the results of [DSOR16] would imply we can estimate quadratic functions\non the Ising model within an error of O(n). We improve this error to be of O(\nn). In particular,\nwe show the following in our paper:\n\n\u2022 Starting at the same initial con\ufb01guration, the executions of the sequential and the asyn-\nchronous Gibbs samplers can be coupled so that the expected Hamming distance between\nthe multivariate samples that the two samplers maintain is bounded by O(\u03c4 log n), where\nn is the number of variables in the graphical model, and \u03c4 is a measure of the average\ncontention in the asynchronicity model of Section 2.1. See Lemma 2. More generally, the\nexpectation of the d-th power of the Hamming distance is bounded by C(d, \u03c4 ) logd n, for\nsome function C(d, \u03c4 ). See Lemma 3.\n\n\u2022 It follows from Lemmas 2 and 3 that, if a function f of the variables of a graphical model\nis K-Lipschitz with respect to the d-th power of the Hamming distance, then the bias in\nthe expectation of f introduced by HOGWILD!-Gibbs under the asynchronicity model of\nSection 2.1 is bounded by K \u00b7 C(d, \u03c4 ) logd n. See Corollary 1.\n\n\u2022 Next, we improve the bounds of Corollary 1 for functions that are degree-d polynomials of\nthe variables of the graphical model. Low degree polynomials on graphical models are a\nnatural class of functions which are of interest in many statistical tasks performed on graph-\nical models (see, for instance, [DDK18] ). For simplicity we show these improvements for\nthe Ising model, but our results are extendible to general graphical models. We show, in\nTheorem 4, that the bias introduced by HOGWILD!-Gibbs in the expectation of a degree-d\npolynomial of the Ising model is bounded by O((n log n)(d\u22121)/2). This bound improves\nupon the bound computed by Corollary 1 by a factor of about (n/ log n)(d\u22121)/2, as the Lip-\nschitz constant with respect to the Hamming distance of a degree-d polynomial of the Ising\nmodel can be up to O(nd\u22121). Importantly, the bias of O((n log n)(d\u22121)/2) that we show\nis introduced by the asynchronicity is of a lower order of magnitude than the standard de-\n\n2\n\n\fviation of degree-d polynomials of the Ising model, which is O((n)d/2)\u2014see Theorem 2,\nand which is already experienced by the sequential sampler. Moreover, in Theorem 5, we\nalso show that the asynchronous Gibbs sampler is not adding a higher order variance to its\nsample. Thus, our results suggest that running Gibbs sampling asynchronously leads to a\nvalid bias-variance tradeoff.\nOur bounds for the expected Hamming distance between the sequential and the asyn-\nchronous Gibbs samplers follow from coupling arguments, while our improvements\nfor polynomial functions of Ising models follow from a combination of our Hamming\nbounds and recent concentration of measure results for polynomial functions of the Ising\nmodel [DDK17, GLP17, GSS18].\n\n\u221a\n\nn \u00d7 \u221a\n\n\u2022 In Section 5, we illustrate our theoretical \ufb01ndings by performing experiments on a multi-\ncore machine. We experiment with graphical models over two kinds of graphs. The \ufb01rst is\nthe\nn grid graph (which we represent as a torus for degree regularity) where each\nnode has 4 neighbors, and the second is the clique over n nodes.\nWe \ufb01rst study how valid the assumptions of the asynchronicity model are. The main as-\nsumption in the model was that the average contention parameter \u03c4 doesn\u2019t grow as the\nnumber of nodes in the graph grows. It is a constant which depends on the hardware being\nused and we observe that this is indeed the case in practice. The expected contention grows\nlinearly with the number of processors on the machine but remains constant with respect to\nn (see Figures 1 and 2).\nNext, we look at quadratic polynomials over graphical models associated with both the grid\nand clique graphs. We estimate their expected values under the sequential Gibbs sampler\nand HOGWILD!-Gibbs and measure the bias (absolute difference) between the two. Our\ntheory predicts that this should scale at\nn and we observe that this is indeed the case\n(Figure 3). Our experiments are described in greater detail in Section 5.\n\n\u221a\n\n2 The Model and Preliminaries\n\nIn this paper, we consider the Gibbs sampling algorithm as applied to discrete graphical models. The\nmodels will be de\ufb01ned on a graph G = (V, E) with |V | = n nodes and will represent a probability\ndistribution \u03c0. We use S to denote the range of values each node in V can take. For any con\ufb01guration\nX \u2208 S|V |, \u03c0i(.|X\u2212i) will denote the conditional distribution of variable i given all other variables\nof state X.\nIn Section 4, we will look at Ising models, a particular class of discrete binary graphical models\nwith pairwise local correlations. We consider the Ising model on a graph G = (V, E) with n nodes.\nThis is a distribution over \u2126 = {\u00b11}n, with a parameter vector (cid:126)\u03b8 \u2208 R|V |+|E|. (cid:126)\u03b8 has a parameter\ncorresponding to each edge e \u2208 E and each node v \u2208 V . The probability mass function assigned to\na string x is\n\n\uf8f6\uf8f8 ,\n\n\uf8eb\uf8ed(cid:88)\n\nv\u2208V\n\n(cid:88)\n\ne=(u,v)\u2208E\n\nP (x) = exp\n\n\u03b8vxv +\n\n\u03b8exuxv \u2212 \u03a6((cid:126)\u03b8)\n\nwhere \u03a6((cid:126)\u03b8) is the log-partition function for the distribution. We say an Ising model has no external\n\ufb01eld if \u03b8v = 0 for all v \u2208 V . For ease of exposition we will focus on the case with no external \ufb01eld\nin this paper. However, the results extend to Ising models with external \ufb01elds when the functions\nunder consideration (in Section 4) are appropriately chosen to be centered. See [DDK17].\nThroughout the paper we will focus on bounded functions de\ufb01ned on the discrete space S|V |. For\na function f, we use (cid:107)f(cid:107)\u221e to denote the maximum absolute value of the function over its domain.\nWe will use [n] to denote the set {1, 2, . . . , n}. In Section 4, we will study polynomial functions\nover the Ising model. Since x2\ni = 1 always in an Ising model, any polynomial function of degree d\ncan be represented as a multilinear function of degree d and we will refer to them interchangeably\nin the context of Ising models.\nDe\ufb01nition 1 (Polynomial/Multilinear Functions of the Ising Model). A degree-d polynomial de\ufb01ned\non n variables x1, . . . , xn is a function of the following form\n\n(cid:88)\n\n(cid:89)\n\ni\u2208S\n\naS\n\nxi,\n\nS\u2286[n]:|S|\u2264d\n\n3\n\n\fwhere a : 2[n] \u2192 R is a coef\ufb01cient vector.\nWe will use a to denote the coef\ufb01cient vector of such a multilinear function and (cid:107)a(cid:107)\u221e to denote the\nmaximum element of a in absolute value. Note that we will use permutations of the subscripts to\nrefer to the same coef\ufb01cient, i.e., aijk is the same as ajik.\n\nWe now give a formal de\ufb01nition of Dobrushin\u2019s uniqueness condition, also known as the high-\ntemperature regime. First we de\ufb01ne the in\ufb02uence of a node j on a node i.\nDe\ufb01nition 2 (In\ufb02uence in Graphical Models). Let \u03c0 be a probability distribution over some set of\nvariables V . Let Bj denote the set of state pairs (X, Y ) which differ only in their value at variable\nj. Then the in\ufb02uence of node j on node i is de\ufb01ned as\n\n(cid:13)(cid:13)\u03c0i(.|X\u2212i) \u2212 \u03c0i(.|Y \u2212i)(cid:13)(cid:13)T V\n\nI(j, i) = max\n\n(X,Y )\u2208Bj\n\nNow, we are ready to state Dobrushin\u2019s condition.\nDe\ufb01nition 3 (Dobrushin\u2019s Uniqueness Condition). Consider a distribution \u03c0 de\ufb01ned on a set of\nvariables V . Let\n\n(cid:88)\n\nj\u2208V\n\n\u03b1 = max\ni\u2208V\n\nI(j, i)\n\n\u03c0 is said to satisfy Dobrushin\u2019s uniqueness condition if \u03b1 < 1.\n\nWe have the following result from [DSOR16] about mixing time of Gibbs sampler for a model\nsatisfying Dobrushin\u2019s condition.\nTheorem 1 (Mixing Time of Sequential Gibbs Sampling). Assume that we run Gibbs sampling on\na distribution that satis\ufb01es Dobrushin\u2019s condition, \u03b1 < 1. Then the mixing time of sequential-Gibbs\nis bounded by\n\n(cid:16) n\n\n(cid:17)\n\n.\n\n\u03b5\n\nbetween x, y \u2208 S|V | is de\ufb01ned as dH (x, y) =(cid:80)\n\ntmix\u2212seq(\u03b5) \u2264 n\n1 \u2212 \u03b1\n\nlog\n\ni\u2208V\n\n1{xi(cid:54)=yi}.\n\nDe\ufb01nition 4. For any discrete state space S|V | over the set of variables V , The Hamming distance\n\nDe\ufb01nition 5 (The greedy coupling between two Gibbs Sampling chains). Consider two instances\nof Gibbs sampling associated with the same discrete graphical model \u03c0 over the state space S|V |:\nX0, X1, . . . and Y0, Y1, . . .. The following coupling procedure is known as the greedy coupling.\nStart chain 1 at X0 and chain 2 at Y0 and in each time step t, choose a node v \u2208 V uniformly at\nrandom to update in both the chains. Without loss of generality assume that S = {1, 2, . . . , k}. Let\np(i1) denote the probability that the \ufb01rst chain sets Xt,v = i1 and let q(i2) be the probability that\nj=1 q(j) = Q(i) for all\ni \u2208 [k] on the interval from [0, 1]. Also pick P (0) = Q(0) = 0 and P (k + 1) = Q(k + 1) = 1.\nCouple the updates according to the following rule:\n\nthe second chain sets Yt,v = i2. Plot the points(cid:80)i\n\nj=1 p(j) = P (i), and(cid:80)i\n\nDraw a number x uniformly at random from [0, 1]. Suppose x \u2208 [P (i1), P (i1 + 1)] and x \u2208\n[Q(i2), Q(i2 + 1)]. Choose Xt,v = i1 and Yt,v = i2.\n\nWe state an important property of this coupling which holds under Dobrushin\u2019s condition, in the\nfollowing Lemma.\nLemma 1. The greedy coupling (De\ufb01nition 5) satis\ufb01es the following property. Let X0, Y0 \u2208 S|V |\nand consider two executions of Gibbs sampling associated with distribution \u03c0 and starting at X0\nand Y0 respectively. Suppose the executions were coupled using the greedy coupling. Suppose in the\nstep t = 1, node i is chosen to be updated in both the models. Then,\n0 ) \u2212 \u03c0i(.|Y \u2212i\n\nPr [X1,i (cid:54)= Y1,i] \u2264(cid:13)(cid:13)\u03c0i(.|X\u2212i\n\n0 )(cid:13)(cid:13)T V\n\n(1)\n\n2.1 Modeling Asynchronicity\n\nWe use the asynchronicity model from [RRWN11] and [DSOR16]. Hogwild!-Gibbs is a multi-\nthreaded algorithm where each thread performs a Gibbs update on the state of a graph which is\n\n4\n\n\fstored in shared memory (typically in RAM). We view each processor\u2019s write as occuring at a\ndistinct time instant. And each write starts the next time step for the process. Assuming that the\nwrites are all serialized, one can now talk about the state of the system after t writes. This will be\ndenoted as time t. HOGWILD! is modeled as a stochastic system adapted to a natural \ufb01ltration Ft.\nFt contains all events thast have occured until time t. Some of these writes happen based on a read\ndone a few steps ago and hence correspond to updates based on stale values in the local cache of\nthe processor. The staleness is modeled in a stochastic manner using the random variable \u03c4i,t to\ndenote the delay associated with the read performed on node i at time step t. The value of node i\nused in the update at time t is going to be Yi,t = Xi,(t\u2212\u03c4i,t). Delays across different node reads can\nbe correlated. However delay distribution is independent of the con\ufb01guration of the model at time\nt. The model imposes two restrictions on the delay distributions. First, the expected value of each\ndelay distribution is bounded by \u03c4. We will think of \u03c4 as a constant compared to n in this paper. We\ncall \u03c4 the average contention parameter associated with a HOGWILD!-Gibbs execution. [DSOR16]\nimpose a second restriction which bounds the tails of the distribution of \u03c4i,t. We do not need to\nmake this assumption in this paper for our results. [DSOR16] need the assumption to show that\nthe HOGWILD! chain mixes fast. However, by using coupling arguments we can avoid the need to\nhave the HOGWILD! chain mix and will just use the mixing time bounds for the sequential Gibbs\nsampling chain instead. Let T denote the set of all delay distributions. We refer to the sequential\nGibbs sampler associated with a distribution \u03c0 as G\u03c0 and the HOGWILD! Gibbs sampler together\np . Note that H\u03c0 is a time-inhomogenuous Markov\nwith T associated with a distribution p by H T\nchain and might not converge to a stationary distribution.\n\n2.2 Concentration of Polynomials on Ising Models\n\nHere we state a known result about concentration of measure for polynomial functions on Ising\nmodels satisfying Dobrushin\u2019s condition.\nTheorem 2 (Concentration of Measure for Polynomial Functions of the Ising model, [DDK17,\nGLP17, GSS18]). Consider an Ising model p without external \ufb01eld on a graph G = (V, E) satisfy-\ning Dobrushin\u2019s condition with Dobrushin parameter \u03b1 < 1. Let fa be a degree d-polynomial over\nthe Ising model. Let X \u223c p. Then, there is a constant c(\u03b1, \u03b4), such that,\n\nPr [|fa(X) \u2212 E [fa(X)]| > t] \u2264 2 exp\n\n.\n\n(cid:32)\n\u2212 (1 \u2212 \u03b1)t2/d\nc(\u03b1, d)(cid:107)a(cid:107)2/d\u221e n\n\n(cid:33)\n\nAs a corollary this also implies,\n\nVar [fa(X)] \u2264 C3(d, \u03b1)nd.\n\n3 Bounding The Expected Hamming Distance Between Coupled Execution\n\nof HOGWILD! and Sequential Gibbs Samplers\n\nIn this Section, we show that under the greedy coupling of the sequential and asynchronous chains,\nthe expected Hamming distance between the two chains at any time t is small. This will form the\nbasis for our accurate estimation results of Section 4. We begin by with Lemma 2.\nLemma 2. Let \u03c0 denote a discrete probability distribution on n variables (nodes) with Dobrushin\nparameter \u03b1 < 1. Let G\u03c0 = X0, X1, . . . , Xt, . . . denote the execution of the sequential Gibbs\nsampler on \u03c0 and H T\n\u03c0 = Y0, Y1, . . . , Yt, . . . denote the HOGWILD! Gibbs sampler associated with\n\u03c0 such that X0 = Y0. Suppose the two chains are running coupled in a greedy manner. Let Kt\ndenote all events that have occured until time t in this coupled execution. Then we have, for all\nt \u2265 0, under the greedy coupling of the two chains,\n\nE [dH (Xt, Yt)|K0] \u2264 \u03c4 \u03b1 log n\n1 \u2212 \u03b1\n\nAt a high level, the proof proceeds by studying the expected change in the Hamming distance un-\nder one step of the coupled execution of the chains. We can bound the expected change using the\nDobrushin parameter and the property of the greedy coupling (Lemma 1). We then show that the\nexpected change is negative whenever the Hamming distance between the two chains was above\n\n5\n\n\fO(log n) to begin with. This allows us to argue that when the two chains start at the same con\ufb01gu-\nration, then the expected Hamming distance remains bounded by O(log n).\nNext, we generalize the above Lemma to bound also the dth moment of the Hamming distance\nbetween Xt and Yt obtained from the coupled executions.\nLemma 3 (dth moment bound on Hamming). Consider the same setting as that of Lemma 2. We\nhave, for all t \u2265 0, under the greedy coupling of the two chains,\n\nE(cid:2)dH (Xt, Yt)d|K0\n\n(cid:3) \u2264 C(\u03c4, \u03b1, d) logd n,\n\nwhere C(.) is some function of the parameters \u03c4, \u03b1 and d.\n\nThe proof of Lemma 3 follows a similar \ufb02avor as that of Lemma 2. It is however more involved to\nbound the expected increase in the dth power of the Hamming distance and it requires some careful\nanalysis to see that the bound doesn\u2019t scale polynomially in n.\n\n4 Estimating Global Functions Using HOGWILD! Gibbs Sampling\n\nTo begin with, we observe that our Hamming moment bounds from Section 3 imply that we can\naccurately estimate functions or events of the graphical model if they are Lipschitz. We show this\nbelow as a Corollary of Lemma 3.\nNow, we state Corollary 1 which quanti\ufb01es the error we can attain when trying to estimate expecta-\ntions of Lipschitz functions using HOGWILD!-Gibbs.\nCorollary 1. Let \u03c0 denote the distribution associated with a graphical model over the set of vari-\nables V (|V | = n) taking values in a discrete space Sn. Assume that the model satis\ufb01es Dobrushin\u2019s\ncondition with Dobrushin parameter \u03b1 < 1. Let f : S|V | \u2192 R be a function such that, for all\nx, y \u2208 S|V |,\n\nLet X \u223c \u03c0 and let Y0, Y1, . . . , Yt denote an execution of HOGWILD!-Gibbs sampling on \u03c0 with\naverage contention parameter \u03c4. For t > n\n\n|f (x) \u2212 f (y)| \u2264 KdH (x, y)d.\n1\u2212\u03b1 log (2(cid:107)f(cid:107)\u221e n/K),\n\n|E[f (Yt)] \u2212 E[f (X)]| \u2264 K.(C(\u03c4, \u03b1, d) logd n + 1).\n\nWe note that the results of [DSOR16] can be used to obtain Corollary 1 when the function is\nLipschitz with respect to the Hamming distance. The above corollary provides a simple way to\nbound the bias introduced by HOGWILD! in estimation of Lipschitz functions. However, many\nfunctions of interest over graphical models are not Lipschitz with good Lipschitz constants. In many\ncases, even when the Lipschitz constants are bad, there is still hope for more accurate estimation.\nAs it turns out Dobrushin\u2019s condition provides such cases. We will focus on one such case which\nis polynomial functions of the Ising model. Our goal will be to accurately estimate the expected\nvalues of constant degree polynomials over the Ising model. Using the bounds from Lemmas 2 and\n3, we now proceed to bound the bias in computing polynomial functions of the Ising model using\nHOGWILD! Gibbs sampling.\n\ndue to HOGWILD!-Gibbs. This is because under zero external \ufb01eld Ising models E[(cid:80)\n\nWe \ufb01rst remark that linear functions (degree 1 polynomials) suffer 0 bias in their expected values\ni aiXi] = 0\nsince each node individually has equal probability of being \u00b11. This symmetry is maintained by\nHOGWILD!-Gibbs since the delays are con\ufb01guration-agnostic. Hence the delays when a node is\n+1 and when it is \u22121 can be coupled perfectly leaving the symmetry intact. Therefore, we start\nour investigation at quadratic polynomials. Theorem 3 states the bound we show for the bias in\ncomputation of degree 2 polynomials of the Ising model.\nTheorem 3 (Bias in Quadratic functions of Ising Model computed using HOGWILD!-Gibbs). Con-\ni,j:i 6n\n\nsider the quadratic function fa(x) = (cid:80)\n\n|E[fa(Xt) \u2212 fa(Yt)]| \u2264 c2 (cid:107)a(cid:107)\u221e\n\n\u03c4 \u03b1 log n\n(1 \u2212 \u03b1)3/2\n\n(n log n)1/2.\n\n6\n\n\fThe main intuition behind the proof is that we can improve upon the bound implied by the Lipschitz\nconstant by appealing to strong concentration of measure results about functions of graphical models\nunder Dobrushin\u2019s condition [DDK17, GLP17, GSS18].\nWe extend the ideas in the above proof to bound the bias introduced by the HOGWILD! Gibbs\nalgorithm when computing the expected values of a degree d polynomial of the Ising model in high\ntemperature. Our main result concerning d-linear functions is Theorem 4.\nTheorem 4 (Bias in degree d polynomials computed using HOGWILD!-Gibbs). Consider a degree\nai1i2...id xi1xi2 . . . xid. Consider the same setting as\n1\u2212\u03b1 log n, under the greedy coupling of the two chains,\n\nd polynomial of the form fa(x) =(cid:80)\n\nthat of Theorem 3. Then we have, for t > n(d+1)\n\ni1i2,...,id\n\n|E[fa(Xt) \u2212 fa(Yt)]| \u2264 c(cid:48) (cid:107)a(cid:107)\u221e (n log n)(d\u22121)/2.\n\nNext, we show that we can accurately estimate the expectations above by showing that the variance\nof the functions under the asynchronous model is comparable to that of the functions under the\nsequential model.\nTheorem 5 (Variance of degree d polynomials computed using HOGWILD!-Gibbs). Consider a\nhigh temperature Ising model p on n nodes with Dobrushin parameter \u03b1 < 1. Let fa(x) be a degree\nd polynomial function Let Y0, Y1, . . . , Yt denote a run of HOGWILD! Gibbs sampling associated\nwith p. We have, for t > (d+1)n\n\n1\u2212\u03b1 log(cid:0)n2(cid:1),\n\nVar [f (Yt)] \u2264 (cid:107)a(cid:107)2\u221e C(d, \u03b1, \u03c4 )nd.\n\n4.1 Going Beyond Ising Models\n\nWe presented results for accurate estimation of polynomial functions over the Ising model. How-\never, the results can be extended to hold for more general graphical models satisfying Dobrushin\u2019s\ncondition. A main ingredient here was concentration of measure. If the class of functions we look at\nhas dth-order bounded differences in expectation, then we indeed get concentration of measure for\nthese functions (Theorem 1.2 of [GSS18]). This combined with the techniques in our paper would\nallow similar gains in accurate estimation of such functions on general graphical models.\n\n5 Experiments\n\nWe show the results of experiments run on a machine with four 10-core Intel Xeon E7-4850 CPUs\nto demonstrate the practical validity of our theory. In our experiments, we focused on two Ising\nmodels\u2014Curie-Weiss and the Grid. The Curie-Weiss CW (n, \u03b1) is the Ising model corresponding\nto the complete graph on n vertices with edges of weight \u03b2 = \u03b1\nn\u22121. The Grid(k2, \u03b1) model is the\nIsing model corresponding to the k-by-k grid with the left connected to the right and top connected\nto the bottom to form a torus\u2014a four-regular graph; the edge weights are \u03b1\n4 . The total in\ufb02uence of\neach of these models is at most \u03b1, so we chose \u03b1 = 0.5 to ensure Dobrushin\u2019s condition. To generate\nsamples, we start at a uniformly random con\ufb01guration and run Markov chains for T = 10n log2(n)\nsteps to ensure mixing.\nIn our \ufb01rst experiment (Figure 1) we validate the modeling assumption that the average delay of a\nread \u03c4 is a constant. Computing the exact delays in a real run of the HOGWILD! is not possible, but\nwe approximate the delays by making processes log read and write operations to a lock-free queue\nas they execute the HOGWILD!-updates. We present two plots of the average delay of a read in a\nHOGWILD! run of the CW (n, 0.5) Markov chain with respect to n. Four asynchronous processors\nwere used to generate the \ufb01rst plot, while twenty were used for the second. We notice that the\naverage delay depends on the number of asynchronous processes, but is constant with respect to n\nas assumed in our model.\nNext, we plot (in Figure 2) the relationship between the number of asynchronous processors used\nin a HOGWILD! execution and the delay parameter \u03c4. For this plot, we estimated \u03c4 by the average\nempirical delay over HOGWILD! runs of CW (n, 0.5) models, with n ranging from 100 to 1000\nin increments of one hundred. The plot shows a linear relationship, and suggests that the delay per\nadditional processor is approximately 0.4 steps.\n\n7\n\n\fFigure 1: Average delay of reads for CW (n, 0.5) model. Four asynchronous processors were used\non the left, while twenty were used on the right.\n\nFigure 2: Average delay of reads for CW (n, 0.5) model as the number of processors used varies.\n\nexpectations of the complete bilinear function f (X1, . . . , Xn) =(cid:80)\n\nThe primary purpose of our work is to demonstrate that polynomial statistics computed from samples\nof a HOGWILD! run of Gibbs Sampling will approximate those computed from a sequential run.\nOur third experiment demonstrates exactly this fact. We plot (in Figure 3 on the left) the empirical\ni(cid:54)=j XiXj as we vary the number\nof nodes n in a Curie-Weiss model graph. Each red point is the empirical mean of the function f\ncomputed over 5000 samples from the HOGWILD! Markov chain corresponding to CW (n, 0.5),\nand each blue point is the empirical mean produced from 5000 sequential runs of the same chain.\n\u221a\nOur theory (Theorem 3) predicts that the bias, the vertical difference in height between red and\nblue points, at any given value of n will be on the order of the standard deviation divided by\nn\n(standard deviation is \u0398(n) and bias is O(\nn)). We plot error bars of this order, and \ufb01nd that the\nHOGWILD! means fall inside the error bars, thus corroborating our theory. We show that theory\nand practice coincide even for sparse graphs, by making the same plot for the Grid(n, 0.5) model\non the right of the same \ufb01gure.\n\n\u221a\n\nFigure 3: Means (with appropriately scaled error bars) of the complete bilinear function computed\nover 5000 sequential and hogwild runs of CW (n, 0.5) (left) and Grid(n, 0.5) (right).\n\n8\n\n\f6 Acknowledgements\n\nWe thank Prof. Srinivas Devdas and Xiangyao Yu for helping us gain access to and program on their\nmulticore machines.\n\nReferences\n\n[DDK17] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Concentration of\nIn Ad-\nmultilinear functions of the Ising model with applications to network data.\nvances in Neural Information Processing Systems 30, NIPS \u201917. Curran Associates,\nInc., 2017.\n\n[DDK18] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising mod-\nels. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms,\nSODA \u201918, Philadelphia, PA, USA, 2018. SIAM.\n\n[DMR11] Constantinos Daskalakis, Elchanan Mossel, and S\u00e9bastien Roch. Evolutionary trees\nand the Ising model on the Bethe lattice: A proof of Steel\u2019s conjecture. Probability\nTheory and Related Fields, 149(1):149\u2013189, 2011.\n\n[DSOR16] Christopher De Sa, Kunle Olukotun, and Christopher R\u00e9. Ensuring rapid mixing and\nlow bias for asynchronous gibbs sampling. In JMLR workshop and conference pro-\nceedings, volume 48, page 1567. NIH Public Access, 2016.\n\n[DSZOR15] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the\nwild: A uni\ufb01ed analysis of hogwild-style algorithms. In Advances in neural informa-\ntion processing systems, pages 2674\u20132682, 2015.\n\n[Ell93] Glenn Ellison.\n\n61(5):1047\u20131071, 1993.\n\nLearning,\n\nlocal\n\ninteraction, and coordination.\n\nEconometrica,\n\n[Fel04] Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates Sunderland, 2004.\n\n[GG86] Stuart Geman and Christine Graf\ufb01gne. Markov random \ufb01eld image models and their\napplications to computer vision. In Proceedings of the International Congress of Math-\nematicians, pages 1496\u20131517. American Mathematical Society, 1986.\n\n[GLP17] Reza Gheissari, Eyal Lubetzky, and Yuval Peres. Concentration inequalities for poly-\n\nnomials of contracting Ising models. arXiv preprint arXiv:1706.00121, 2017.\n\n[GSS18] Friedrich G\u00f6tze, Holger Sambale, and Arthur Sinulis. Higher order concentration for\nfunctions of weakly dependent random variables. arXiv preprint arXiv:1801.06348,\n2018.\n\n[JSW13] Matthew Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel\nIn Advances in Neural Information Processing Systems,\n\ngaussian gibbs sampling.\npages 2715\u20132723, 2013.\n\n[LPW09] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing\n\nTimes. American Mathematical Society, 2009.\n\n[LWR+15] Ji Liu, Stephen J Wright, Christopher R\u00e9, Victor Bittorf, and Srikrishna Sridhar. An\nasynchronous parallel stochastic coordinate descent algorithm. The Journal of Ma-\nchine Learning Research, 16(1):285\u2013322, 2015.\n\n[MBDC15] Ioannis Mitliagkas, Michael Borokhovich, Alexandros G Dimakis, and Constantine\nCaramanis. Frogwild!: fast pagerank approximations on graph engines. Proceedings\nof the VLDB Endowment, 8(8):874\u2013885, 2015.\n\n[MPP+15] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ram-\nchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic\noptimization. arXiv preprint arXiv:1507.06970, 2015.\n\n9\n\n\f[MS10] Andrea Montanari and Amin Saberi. The spread of innovations in social networks.\n\nProceedings of the National Academy of Sciences, 107(47):20196\u201320201, 2010.\n\n[NO14] Cyprien Noel and Simon Osindero. Dogwild!-distributed hogwild for cpu & gpu. In\n\nNIPS Workshop on Distributed Machine Learning and Matrix Computations, 2014.\n\n[NRRW11] Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A lock-\nfree approach to parallelizing stochastic gradient descent. In Advances in neural in-\nformation processing systems, pages 693\u2013701, 2011.\n\n[RRWN11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-\nfree approach to parallelizing stochastic gradient descent. In Advances in neural in-\nformation processing systems, pages 693\u2013701, 2011.\n\n[SN10] Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic\n\nmodels. Proceedings of the VLDB Endowment, 3(1-2):703\u2013710, 2010.\n\n[TSD15] Alexander Terenin, Daniel Simpson, and David Draper. Asynchronous gibbs sam-\n\npling. arXiv preprint arXiv:1509.08999, 2015.\n\n[YHSD12] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. Scalable coordinate descent\napproaches to parallel matrix factorization for recommender systems. In Data Mining\n(ICDM), 2012 IEEE 12th International Conference on, pages 765\u2013774. IEEE, 2012.\n\n[ZR14] Ce Zhang and Christopher R\u00e9. Dimmwitted: A study of main-memory statistical\n\nanalytics. Proceedings of the VLDB Endowment, 7(12):1283\u20131294, 2014.\n\n10\n\n\f", "award": [], "sourceid": 37, "authors": [{"given_name": "Constantinos", "family_name": "Daskalakis", "institution": "MIT"}, {"given_name": "Nishanth", "family_name": "Dikkala", "institution": "MIT"}, {"given_name": "Siddhartha", "family_name": "Jayanti", "institution": "MIT"}]}