{"title": "Asynchronous Parallel Coordinate Minimization for MAP Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 5734, "page_last": 5744, "abstract": "Finding the maximum a-posteriori (MAP) assignment is a central task in graphical models. Since modern applications give rise to very large problem instances, there is increasing need for efficient solvers. In this work we propose to improve the efficiency of coordinate-minimization-based dual-decomposition solvers by running their updates asynchronously in parallel. In this case message-passing inference is performed by multiple processing units simultaneously without coordination, all reading and writing to shared memory. We analyze the convergence properties of the resulting algorithms and identify settings where speedup gains can be expected. Our numerical evaluations show that this approach indeed achieves significant speedups in common computer vision tasks.", "full_text": "Asynchronous Parallel Coordinate Minimization\n\nfor MAP Inference\n\nOfer Meshi\n\nGoogle\n\nmeshi@google.com\n\nAlexander G. Schwing\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Illinois at Urbana-Champaign\n\naschwing@illinois.edu\n\nAbstract\n\nFinding the maximum a-posteriori (MAP) assignment is a central task for structured\nprediction. Since modern applications give rise to very large structured problem\ninstances, there is increasing need for ef\ufb01cient solvers. In this work we propose\nto improve the ef\ufb01ciency of coordinate-minimization-based dual-decomposition\nsolvers by running their updates asynchronously in parallel. In this case message-\npassing inference is performed by multiple processing units simultaneously without\ncoordination, all reading and writing to shared memory. We analyze the conver-\ngence properties of the resulting algorithms and identify settings where speedup\ngains can be expected. Our numerical evaluations show that this approach indeed\nachieves signi\ufb01cant speedups in common computer vision tasks.\n\n1\n\nIntroduction\n\nFinding the most probable con\ufb01guration of a structured distribution is an important task in machine\nlearning and related applications. It is also known as the maximum a-posteriori (MAP) inference\nproblem in graphical models [Wainwright and Jordan, 2008, Koller and Friedman, 2009], and has\nfound use in a wide range of applications, from disparity map estimation in computer vision, to\npart-of-speech tagging in natural language processing, protein-folding in computational biology and\nothers. Generally, MAP inference is intractable, and ef\ufb01cient algorithms only exist in some special\ncases, such as tree-structured graphs. It is therefore common to use approximations.\nIn recent years, many approximate MAP inference methods have been proposed [see Kappes et al.,\n2015, for a recent survey]. One of the major challenges in applying approximate inference techniques\nis that modern applications give rise to very large instances. For example, in semantic image\nsegmentation the task is to assign labels to all pixels in an image [e.g., Zhou et al., 2016]. This can\ntranslate into a MAP inference problem with hundreds of thousands of variables (one for each pixel).\nFor this reason, ef\ufb01ciency of approximate inference algorithms is becoming increasingly important.\nOne approach to dealing with the growth in problem complexity is to use cheap (but often inaccurate)\nalgorithms. For example, variants of the mean \ufb01eld algorithm have witnessed a surge in popularity\ndue to their impressive success in several computer vision tasks [Kr\u00e4henb\u00fchl and Koltun, 2011]. A\nshortcoming of this approach is that it is limited to a speci\ufb01c type of model (fully connected graphs\nwith Gaussian pairwise potentials). Moreover, the mean \ufb01eld approximation is often less accurate\nthan other approximations, e.g., those based on convex relaxations [Desmaison et al., 2016].\nIn this work we study an alternative approach to making approximate MAP inference algorithms\nmore ef\ufb01cient \u2013 parallel computation. Our study is motivated by two developments. First, current\nhardware trends increase the availability of parallel processing hardware in the form of multi-core\nCPUs as well as GPUs. Second, recent theoretical results improve our understanding of various\nasynchronous parallel algorithms, and demonstrate their potential usefulness, especially for objective\nfunctions that are typical in machine learning problems [e.g., Recht et al., 2011, Liu et al., 2015].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFocusing on a smoothed objective function originating from a dual-decomposition approximation,\nwe present a fully asynchronous parallel algorithm for MAP inference based on block-coordinate\nupdates. Our approach gives rise to a message-passing procedure, where messages are computed\nand updated in shared memory asynchronously in parallel by multiple processing units, with no\nattempt to coordinate their actions. The reason we focus on asynchronous algorithms is because\nthe runtime of synchronous algorithms is dominated by the slowest worker, which may cause the\noverhead from synchronization to outweigh the gain from parallelization. The asynchronous parallel\nsetting is particularly suitable for message-passing algorithms, like the ones we study here.\nOur analysis is conducted under the bounded delay assumption, which is standard in the literature on\nasynchronous optimization and matches well modern multicore architectures. It reveals the precise\nrelation between the delay and the expected change in objective value following an update. This\nresult suggests a natural criterion for adaptively choosing the number of parallel workers to guarantee\nconvergence to the optimal value. Additional analysis shows that speedups which are linear in the\nnumber of processors can be expected under some conditions. We illustrate the performance of our\nalgorithm both on synthetic models and on a disparity estimation task from computer vision. We\ndemonstrate 45-fold improvements or more when compared to other asynchronous optimization\ntechniques.\n\n2 Related Work\n\nOur work is inspired by recent advances in the study of asynchronous parallel algorithms and their\nsuccessful application to various machine learning tasks. In particular, parallel versions of various\nsequential algorithms have been recently analyzed, adding to past work in asynchronous parallel\noptimization [Bertsekas and Tsitsiklis, 1989, Tseng, 1991]. Those include, for example, stochastic\ngradient descent [Recht et al., 2011], conditional gradient [Wang et al., 2016], ADMM [Zhang and\nKwok, 2014], proximal gradient methods [Davis et al., 2016], and coordinate descent [Liu et al.,\n2015, Liu and Wright, 2015, Avron et al., 2015, Hsieh et al., 2015, Peng et al., 2016, You et al., 2016].\nThe algorithms we study here are based on block coordinate minimization, a coordinate descent\nmethod in which an optimal update is computed in closed form.1 To the best of our knowledge, this\nalgorithm has yet to be analyzed in the asynchronous parallel setting. The analysis of this algorithm\nis signi\ufb01cantly more challenging compared to other coordinate descent methods, since there is no\nnotion of a step-size, which is carefully chosen in previous analyses to guarantee convergence [e.g.,\nLiu et al., 2015, Avron et al., 2015, Peng et al., 2016]. Furthermore, in most previous papers, the\nfunction which is being optimized is assumed to be strongly convex, or to satisfy a slightly weaker\ncondition [Liu et al., 2015, Hsieh et al., 2015]. In contrast, we analyze a smooth and convex MAP\nobjective, which does not satisfy any of these strong-convexity conditions. We focus on this particular\nobjective function since optimal block coordinate updates are known in this case, which is not true\nfor its strongly convex counterparts [Meshi et al., 2015].\nWe are not the \ufb01rst to study parallel inference methods in graphical models. Parallel variants of\nBelief Propagation (BP) are proposed and analyzed by Gonzalez et al. [2011]. They present bounds\non achievable gains from parallel inference on chain graphs, as well as an optimal parallelization\nscheme. However, the algorithms they propose include global synchronization steps, which often hurt\nef\ufb01ciency. In contrast, we focus on the fully asynchronous setting, so our algorithms and analysis\nare substantially different. Piatkowski and Morik [2011] and Ma et al. [2011] also describe parallel\nimplementations of BP, but those again involve synchronization. We are particularly interested\nin the MAP inference problem and use convergent coordinate minimization methods with a dual-\ndecomposition objective. This is quite different from marginal inference with BP, used in the\naforementioned works; for example, BP is not guaranteed to converge even with sequential execution.\nDual-decomposition based parallel inference for graphical models has been investigated by Choi and\nRutenbar [2012] and extended by Hurkat et al. [2015]. They study hardware implementations of\nthe TRW-S algorithm (a coordinate-minimization algorithm very similar to the ones we study here),\nwhere some message computations can be parallelized. However, their parallelization scheme is quite\ndifferent from ours as it is synchronous, i.e., the messages computed in parallel have to be carefully\nchosen, and it is speci\ufb01c to grid-structured graphs. In addition, they provide no theoretical analysis\n\n1For a single coordinate this is equivalent to exact line search, but for larger blocks the updates can differ.\n\n2\n\n\fof convergence (which is not directly implied by TRW-S convergence due to different message\nscheduling).\nSchwing et al. [2011] and Zhang et al. [2014] also study dual-decomposition based parallel infer-\nence. They demonstrate gains when parallelizing the computation across multiple machines in a\ncluster. However, their approach requires the employed processing units to run in synchrony. Parallel\nMAP solvers based on subdifferential techniques [Schwing et al., 2012], have also been consid-\nered by Schwing et al. [2014] using a Frank-Wolfe algorithm. Albeit individual computations are\nperformed in parallel, their approach also requires a synchronous gradient step.\nAn alternative parallel inference approach is based on sampling algorithms [Singh et al., 2010, Wick\net al., 2010, Asuncion et al., 2011]. However, the gains in runtime observed in this case are usually\nmuch smaller than those observed for algorithms which do not use sampling.\nOur work is thus the \ufb01rst to propose and analyze a fully asynchronous parallel coordinate minimization\nalgorithm for MAP inference in graphical models.\n\n3 Approach\n\nIn this section we formalize the MAP inference problem and present our algorithmic framework.\nConsider a set of discrete variables X1, . . . , XN, and denote by xi 2X i a particular assignment to\nvariable Xi from a discrete set Xi. Let r \u2713{ 1, . . . , N} denote a subset of the variables, also known\nas a region, and let R be the set of all regions that are used in a problem. Each region r 2R is\nassociated with a local score function \u2713r(xr), referred to as a factor. The MAP inference problem is\nto \ufb01nd a joint assignment x that maximizes the sum of all factor scores,\n\nmax\n\nx Xr2R\n\n\u2713r(xr) .\n\n(1)\n\nConsider semantic image segmentation as an example. Factors depending on a single variable denote\nunivariate preferences often obtained from neural networks [Chen\u21e4 et al., 2015]. Factors depending\non two or more variables encode local preference relationships.\nThe problem in Eq. (1) is a combinatorial optimization problem which is generally NP-hard [Shimony,\n1994]. Notable tractable special cases include tree-structured graphs and super-modular pairwise\nfactors. In this work we are interested in solving the general form of the problem, therefore we resort\nto approximate inference.\nMultiple ways to compute an approximate MAP solution have been proposed. Here we employ\napproximations based on the dual-decomposition method [Komodakis et al., 2007, Werner, 2010,\nSontag et al., 2011], which often deliver competitive performance compared to other approaches,\nand are also amenable to asynchronous parallel execution. The key idea in dual-decomposition is\nto break the global optimization problem of Eq. (1) into multiple (easy) subproblems, one for each\nfactor. Agreement constraints between overlapping subproblem maximizers are then de\ufb01ned, and the\nresulting program takes the following form,2\n\nmin\n\n Xr2R\n\nmax\n\nxr \u2713r(xr)+Xp:r2p\n\npr(xr)Xc:c2r\n\nrc(xc)! \u2318 min\n Xr2R\n\n\u02c6\u2713\nr(xr) .\n\nmax\nxr\n\n(2)\n\nHere, \u2018r 2 p\u2019 (similarly, \u2018c 2 r\u2019) represents parent-child containment relationships, often represented\nas a region graph [Wainwright and Jordan, 2008], and are Lagrange multipliers for the agreement\nconstraints, de\ufb01ned for every region r, assignment xr, and parent p : r 2 p. In particular, these\nconstraints enforce that the maximizing assignment in a parent region p agrees with the maximizing\nassignment in the child region r on the values of the variables in r (which are also in p due to\ncontainment). For a full derivation see Werner [2010] (Eq. (11)). The modi\ufb01cation of the model\nfactors \u2713r by the multipliers is known as a reparameterization, and is denoted here by \u02c6\u2713\nr for brevity.\nThe program in Eq. (2) is an unconstrained convex problem with a (piecewise-linear) non-smooth\nobjective function. Standard algorithms, such as subgradient descent, can be applied in this case\n[Komodakis et al., 2007, Sontag et al., 2011], however, often, faster algorithms can be derived for a\nsmoothed variant of this objective function [Johnson, 2008, Hazan and Shashua, 2010, Werner, 2009,\n\n2The problem in Eq. (2) can also be derived as the dual of a linear programming relaxation of Eq. (1).\n\n3\n\n\fAlgorithm 1 Block Coordinate Minimization\n1: Initialize: 0 = 0\n2: while not converged do\n3:\n4:\n5: end while\n\nChoose a block s at random\nUpdate: t+1\n\ns = argmin0s\n\nf (0s, t\n\ns),\n\nand keep: t+1\n\ns = t\ns\n\nSavchynskyy et al., 2011]. In this approach the max operator is replaced with soft-max, giving rise to\nthe following problem:\n\nmin\n\n\n\nf () :=Xr2R\n\n logXxr\n\nexp\u21e3\u02c6\u2713\n\nr(xr)/\u2318 ,\n\n(3)\n\nwhere is a parameter controlling the amount of smoothing (larger is smoother).\nAlgorithms: Several algorithms for optimizing either the smooth (Eq. (3)) or non-smooth (Eq. (2))\nproblem have been studied. Block coordinate minimization algorithms, which are the focus of our\nwork, are among the most competitive methods. In particular, in this approach a block of variables s\nis updated at each iteration using the values in other blocks, i.e., s, which are held \ufb01xed. Below we\nwill assume a randomized schedule, where the next block to update is chosen uniformly at random.\nOther schedules are possible [e.g., Meshi et al., 2014, You et al., 2016], but this one will help to\navoid unwanted coordination between workers in an asynchronous implementation. The resulting\nmeta-algorithm is given in Algorithm 1.\nVarious choices of blocks give rise to different algorithms in this family. A key consideration is to\nmake sure that the update in line 4 of Algorithm 1 can be computed ef\ufb01ciently. Indeed, for several\ntypes of blocks, ef\ufb01cient, oftentimes analytically computable, updates are known [Werner, 2007,\nGloberson and Jaakkola, 2008, Kolmogorov, 2006, Sontag et al., 2011, Meshi et al., 2014]. To make\nthe discussion concrete, we next instantiate the block coordinate minimization update (line 4 in\nAlgorithm 1) using the smooth objective in Eq. (3) for two types of blocks.3 Speci\ufb01cally, we use the\nPencil block, consisting of the variables pr(\u00b7), and the Star block, which consists of the set \u00b7r(\u00b7).\nIntuitively, for the Pencil block, we choose a parent p and one of its children r. For the Star block we\nchoose a region r and consider all of its parents.\nTo simplify notation, it is useful to de\ufb01ne per-factor probability distributions, referred to as beliefs:\n\nUsing this de\ufb01nition, the Pencil update is performed by picking a pair of adjacent regions p, r, and\nsetting:\n\n\u00b5r(xr) / exp\u21e3\u02c6\u2713\nr(xr)/\u2318 .\nlog \u00b5t\nfor all xr, where we denote the marginal belief \u00b5p(xr) =Px0p\\r\nPr + 1 \u00b7 log0@\u00b5t\n\nupdate we pick a region r, and set:\n\npr(xr) + log \u00b5t\n\nt+1\npr (xr) = t\n\nt+1\npr (xr) = t\n\np(xr) \n\npr(xr) +\n\n1\n2\n\n1\n\np(xr) log \u00b5t\n\nr(xr)\n\n(4)\n\u00b5p(xr, x0p\\r). Similarly, for the Star\nr(xr) \u00b7 Yp0:r2p0\n\np0 (xr)1A\n\n\u00b5t\n\nfor all p : r 2 p and all xr, where Pr = |{p : r 2 p}| is the number of parents of r in the region\ngraph. Full derivation of the above updates is outside the scope of this paper and can be found in\nprevious work [e.g., Meshi et al., 2014]. The variables are sometimes called messages. Hence the\nalgorithms considered here belong to the family of message-passing procedures.\nIn terms of convergence rate, it is known that coordinate minimization converges to the optimum of\nthe smooth problem in Eq. (3) with rate O(1/t) [Meshi et al., 2014].\nIn this work our goal is to study asynchronous parallel coordinate minimization for approximate\nMAP inference. This means that each processing unit repeatedly performs the operations in lines 3-4\n\n3Similar updates for the non-smooth case (Eq. (2)) are also known. Those are easily obtained by switching\n\nfrom soft-max to max.\n\n4\n\n\fof Algorithm 1 independently, with minimal coordination between units. We refer to this algorithm as\nAPCM \u2013 for Asynchronous Parallel Coordinate Minimization. We use APCM-Pencil and APCM-Star\nto refer to the instantiations of APCM with Pencil and Star blocks, respectively.\n\n4 Analysis\n\nWe now proceed to analyze the convergence properties of the asynchronous variants of Algorithm 1.\nIn this setting, the iteration counter t corresponds to write operations, which are assumed to be atomic.\nNote, however, that in our experiments in Section 5 we use a lock-free implementation, which may\nresult in inconsistent writes and reads.\nIf there is no delay, then the algorithm is performing exact coordinate minimization. However, since\nupdates happen asynchronously, there will generally be a difference between the current beliefs \u00b5t\nand the ones used to compute the update. We denote by k(t) the iteration counter corresponding to\nthe time in which values were read. The bounded delay assumption implies that t k(t) \uf8ff \u2327 for\nsome constant \u2327. We present results for the Pencil block next, and defer results for the Star block to\nAppendix B.\nOur \ufb01rst result precisely characterizes the expected change in objective value following an update as\na function of the old and new beliefs. All proofs appear in the supplementary material.\nProposition 1. The APCM-Pencil algorithm satis\ufb01es:\n\nEs[f (t+1)] f (t) =\n\n\n\nnXr Xp:r2p logXxr\n+ logXxr\n\nr(xr)\n\n\u00b5t\n\u00b5k(t)\nr\n\np\n\n(xr)q\u00b5k(t)\n(xr)q\u00b5k(t)\n\np(xr)\n\np\n\n\u00b5t\n\u00b5k(t)\np\n\n(xr) \u00b7 \u00b5k(t)\n\nr\n\n(xr)\n\n(5)\n\n(xr) \u00b7 \u00b5k(t)\n\nr\n\n(xr)! ,\n\nwhere n =PrPp:r2p 1 is the number of Pencil blocks, and the expectation is over the choice of\n\nblocks.\n\nAt a high-level, our derivation carefully tracks the effect of stale beliefs on convergence by sepa-\nrating old and new beliefs after applying the update (see Appendix A.1). We next highlight a few\nconsequences of Proposition 1. First, it provides an exact characterization of the expected change in\nobjective value, not an upper bound. Second, as a sanity check, when there is no delay (k(t) = t),\nthe belief ratio terms (\u00b5t/\u00b5k(t)) drop, and we recover the sequential decrease in objective, which\ncorresponds to the (negative) Bhattacharyya divergence measure between the pair of distributions\nr(xr) and \u00b5t\np(xr) [Meshi et al., 2014]. Finally, Proposition 1 can be used to dynamically set the\n\u00b5t\ndegree of parallelization as follows. We estimate Eq. (5) (per block) and if the result is strictly positive\nthen it suggests that the delay is too large and we should reduce the number of concurrent processors.\nNext, we obtain an upper bound on the expected change in objective value that takes into account the\nsparsity of the update.\nProposition 2. The APCM-Pencil algorithm satis\ufb01es:\n\nEs[f (t+1)] f (t) \uf8ff\n\n\nn\n\n+\n\n\u00b5d+1\nr(d)(xr)\n\u00b5d\n\nt1Xd=k(t)\"max\nnXr Xp:r2p\n\nxr log\nlog Xxr q\u00b5k(t)\n\nr(d)(xr)! + max\nxr log\n(xr)!2\n(xr) \u00b7 \u00b5k(t)\n\n\n\np\n\nr\n\n.\n\np(d)(xr)!#\n\n\u00b5d+1\np(d)(xr)\n\u00b5d\n\n(6)\n\n(7)\n\nThis bound separates the expected change in objective into two terms: the delay term (Eq. (6)) and\nthe (stale) improvement term (Eq. (7)). The improvement term is always non-positive, it is equal to\nthe negative Bhattacharyya divergence, and it is exactly the same as the expected improvement in\nthe sequential setting. The delay term is always non-negative, and as before, when there is no delay\n(k(t) = t), the sum in Eq. (6) is empty, and we recover the sequential improvement. Note that the\ndelay term depends only on the beliefs in regions that were actually updated between the read and\ncurrent write. This result is obtained by exploiting the sparsity of the updates: each message affects\nonly the neighboring nodes in the graph (see Appendix A.2). Similar structural properties are also\nused in related analyses [e.g., Recht et al., 2011], however in other settings this involves making\n\n5\n\n\fe\nv\n\ni\nt\nc\ne\nb\nO\n\nj\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n1 worker\n10 workers\n20 workers\n40 workers\n40 workers (adaptive)\n\n0\n\n20\n\n40\n\n60\n\nIteration\n\n80\n\n100\n\n120\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0\n\n40 workers (adaptive)\n\n40\n35\ns\n30\nr\ne\nk\nr\no\n25\nw\n \ne\nv\n20\ni\nt\nc\na\n \nf\n15\no\n \nr\ne\nb\n10\nm\nu\nN\n5\n\n20\n\n40\n\n60\n\nIteration\n\n80\n\n100\n\n120\n\ne\nv\n\ni\nt\nc\ne\nb\nO\n\nj\n\n55\n\n50\n\n45\n\n40\n\n35\n\n30\n\n0\n\n1 worker\n10 workers\n20 workers\n40 workers\n40 workers (adaptive)\n\n20\n\n40\n\n60\n\nIteration\n\n80\n\n100\n\n120\n\nFigure 1: Simulation of APCM-Pencil on toy models. (Left) objective vs. iteration (equiv., update)\non a 3-node chain graph. The dashed lines show the same objective when iterations are divided by\nthe number of workers, which approximates runtime. (Middle) objective vs. iteration and vs. number\nof active workers on a 3-node chain graph when adapting the number of workers. (Right) objective\nvs. iteration (equiv., update) on a 6-node fully connected graph.\n\nnon-trivial assumptions (such as how training examples interact), whereas in our case the sparsity\npattern is readily available through the structure of the graphical model.\nTo demonstrate the hardness of our setting, we present in Appendix A.3 a case where the RHS of\nEq. (6) - (7) may be a large positive number. This happens when some beliefs are very close to 0. In\ncontrast, the next theorem uses the results above to show speedups under additional assumptions.\nTheorem 1. Let |\u02c6\u2713t\nr (xr)|\uf8ff M for all t, r, xr, and let kt \u21e4k2 < B for all t. Assume that the\ngradient is bounded from below as krfk2 c, and that the delay is bounded as \u2327 \uf8ff c\n32M . Then\nt .\nEs[f (t)] f (\u21e4) \uf8ff 8nB\nThis upper bound is only 2 times slower than the corresponding sequential bound (see Theorem 3 in\nMeshi et al. [2014]), however, in this parallel setting we execute updates roughly \u2327 times faster, so\nwe obtain a linear speedup in this case. Notice that this rate applies only when the gradient is not\ntoo small, so we expect to get large gains from parallelization initially, and smaller gains as we get\ncloser to optimality. This is due to the hardness of our setting (see Appendix A.3), and gives another\ntheoretical justi\ufb01cation to adaptively reduce the number of processing units as the iterations progress.\nAt \ufb01rst glance, the assumptions in Theorem 1 (speci\ufb01cally, the bounds M and B) seem strong.\nHowever, it turns out that they are easily satis\ufb01ed whenever f (t) \uf8ff f (0) (see Lemma 9 in Meshi\net al. [2014]) \u2013 which is a mild assumption that is satis\ufb01ed in all of our experiments except some\nadversarially constructed toy problems (see Section 5.1).\n\n5 Experiments\n\nIn this section we present numerical experiments to study the performance of APCM in practical MAP\nestimation problems. We \ufb01rst simulate APCM on toy problems in Section 5.1, then, in Section 5.2,\nwe demonstrate our approach on a disparity estimation task from computer vision.\n\n5.1 Synthetic Problems\n\nTo better understand the behavior of APCM, we simulate the APCM-Pencil algorithm sequentially as\nfollows. We keep a set of \u2018workers,\u2019 each of which can be in one of two states: \u2018read\u2019 or \u2018update.\u2019\nIn every step, we choose one of the workers at random using a skewed distribution to encourage\n\nlarge delays: the probability of sampling a worker w is pw = e\uf8ffsw /Pw0 e\uf8ffsw0 , where sw is sampled\n\nuniformly in [0, 1], and \uf8ff = 5. If the worker is in the \u2018read\u2019 state, then it picks a message uniformly\nat random, makes a local copy of the beliefs, and moves to state \u2018update.\u2019 Else, if the worker wakes\nup in state \u2018update,\u2019 then it computes the update from its local beliefs, writes the update to the global\nbeliefs, and goes back to state \u2018read.\u2019 This procedure creates delays between the read and write steps.\nOur \ufb01rst toy model consists of 3 binary variables and 2 pairwise factors, forming a chain graph. This\nmodel has a total of 4 messages. Factor values are sampled uniformly in the range [5, 5]. In Fig. 1\n(left) we observe that as the number of workers grows, the updates become less effective due to stale\nbeliefs. Importantly, it takes 40 workers operating on 4 messages to observe divergence. We don\u2019t\n\n6\n\n\f2 200\n\n4 200\n\n8 200\n\n8 400\n\n106\n\n106\n\n5.6\n\n5.58\n\n5.56\n\n5.54\n\n5.52\n\n5.5\n\n5.48\n\n5.6\n\n5.58\n\n5.56\n\n5.54\n\n5.52\n\n5.5\n\n5.48\n\n1\n2\n4\n8\n16\n32\n46\n\n1\n2\n4\n8\n16\n32\n46\n\n107\n\n107\n\n2.81\n\n2.805\n\nl\n\na\nu\nD\n\n2.8\n\n2.795\n\n2.79\n\n2.81\n\n2.805\n\nl\n\na\nu\nD\n\n2.8\n\n2.795\n\n2.79\n\n103\n\n104\nTime [ms]\n\n103\n\n104\nTime [ms]\n\n108\n\n108\n\n1.156\n\n1.1555\n\n1.155\n\n1.1545\n\n1.154\n\nl\n\na\nu\nD\n\n1.156\n\n1.1555\n\n1.155\n\n1.1545\n\n1.154\n\nl\n\na\nu\nD\n\n1\n2\n4\n8\n16\n32\n46\n\n105\n\n1\n2\n4\n8\n16\n32\n46\n\n105\n\n103\n\nTime [ms]\n\n104\n\n103\n\nTime [ms]\n\n104\n\n103\n\n104\n\nTime [ms]\n\n105\n\n1\n2\n4\n8\n16\n32\n46\n\n1\n2\n4\n8\n16\n32\n46\n\n108\n\n108\n\n1.156\n\n1.1555\n\n1.155\n\n1.1545\n\n1.154\n\nl\n\na\nu\nD\n\n1.156\n\n1.1555\n\n1.155\n\n1.1545\n\n1.154\n\nl\n\na\nu\nD\n\n104\n\nTime [ms]\n\n103\n\n104\n\nTime [ms]\n\n105\n\n104\n\nTime [ms]\n\n1\n2\n4\n8\n16\n32\n46\n\n106\n\n1\n2\n4\n8\n16\n32\n46\n\n106\n\nl\n\na\nu\nD\n\ns\nr\nu\nO\n\n!\n\nl\n\na\nu\nD\n\nD\nL\nI\nW\nG\nO\nH\n\nFigure 2: For = 1 and an 8 state model, we illustrate the convergence behavior of our approach\ncompared to HOGWILD!, for a variety of MRF con\ufb01gurations (2, 4, 8), and different number of\niterations (200, 400). Different number of threads are used for each con\ufb01guration.\n\nAlgorithm 2 HOGWILD! A single update\n1: Choose a region r 2R at random\n2: Update: pr(xr) = \u2318t\u00b5r(xr) for all xr, p : r 2 p\nrc(xc) += \u2318t\u00b5r(xc) for all xc, c : c 2 r\n\nexpect a setting with more workers than messages to be observed in practice. We also adaptively\nchange the number of workers as suggested by our theory, which indeed helps to regain convergence.\nFig. 1 (middle) shows how the number of workers decreases as the objective approaches the optimum.\nOur second toy model consists of 6 binary variables forming a fully connected graph. This model has\n30 messages. In this setting, despite stale beliefs due to a skewed distribution, Fig. 1 (right) shows\nthat APCM is convergent even with 40 active workers. Hypothetically assuming 40 workers to run in\nparallel yields a signi\ufb01cant speedup when compared to a single thread, as is illustrated by the dashed\nlines in Fig. 1.\n\n5.2 Disparity Estimation\nWe now proceed to test our approach on a disparity estimation task, a more realistic setup. In our\ncase, the employed pairwise graphical model, often also referred to as a pairwise Markov random\n\ufb01eld (MRF), is grid structured. It has 144 \u21e5 185 = 26, 640 unary regions with 8 states and is a\ndownsampled version from Schwing et al. [2011]. We use the temperature parameter = 1 for the\nsmooth objective (Eq. (3)). We compare our APCM-Star algorithm to the HOGWILD! approach\n[Recht et al., 2011], which employs an asynchronous parallel stochastic gradient descent method \u2013\nsummarized in Algorithm 2, where we use the shorthand \u00b5r(xc) =Px0r\\c\n\u00b5r(xc, x0r\\c). We refer\nthe reader to Appendix C in the supplementary material for additional results on graphical models\nwith larger state space size and for results regarding the non-smooth update obtained for = 0. In\nshort, those results are similar to the ones reported here.\nNo synchronization is used for both HOGWILD! and our approach, i.e., we allow inconsistent\nreads and writes. Hence our optimization is lock-free and each of the threads is entirely devoted to\ncomputing and updating messages. We use one additional thread that constantly monitors progress\nby computing the objective in Eq. (3). We perform this function evaluation a \ufb01xed number of times,\neither 200 or 400 times. Running for more iterations lets us compare performance in the high-accuracy\nregime. During function evaluation, other threads randomly and independently choose a region r and\nupdate the variables \u00b7r(\u00b7), i.e., we evaluate the Star block updates of Eq. (5). Our choice is motivated\nby the fact that Star block updates are more overlapping compared to Pencil updates, as they depend\non more variables. Therefore, Star blocks are harder to parallelize (see Theorem 2 in Appendix B).\nTo assess the performance of our technique we use pairwise graphical models of different densities.\nIn particular, we use a \u2018connection width\u2019 of 2, 4, or 8. This means we connect variables in the grid by\n\n7\n\n\fOurs\n\nHOGWILD!\n\nComparison\n\n2 200\n4 200\n8 200\n8 400\n\n40\n\n30\n\n20\n\n10\n\nf\n \n\np\nu\nd\ne\ne\np\ns\n\n2 200\n4 200\n8 200\n8 400\n\n40\n\n30\n\n20\n\n10\n\nf\n \n\np\nu\nd\ne\ne\np\ns\n\n30\n\n40\n\n10\n\n10\n\n20\nthreads\n(a)\n\n40\n\n30\n\n20\nthreads\n(b)\n\np\nu\nd\ne\ne\np\ns\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n2 200\n4 200\n8 200\n8 400\n\n30\n\n40\n\n10\n\n20\nthreads\n(c)\n\nFigure 3: Speedup w.r.t. single thread obtained for a speci\ufb01c number of threads for our approach\n(a) and HOGWILD! (b), using a variety of MRF neighborhoods (2, 4, 8), and different number of\niterations (200, 400). Speedups are shown for = 1 and 8 states. (c) shows the speedup of our\nmethod compared to HOGWILD!.\n\npairwise factors, if their `1-norm distance is less than 2, 4, or 8. A \u2018connection width\u2019 of 2 is often\nalso referred to as 8-neighborhood, because a random variable is connected to its eight immediate\nneighbors. A \u2018connection width\u2019 of 4 or 8 connects a random variable to 48 or 224 neighboring\nvariables respectively. Hence, the connectivity of the employed graphical model is reasonably dense\nto observe inconsistent reads and writes. At the same time our experiments cover connection densities\nwell above many typical graphical models used in practice.\nConvergence: In a \ufb01rst experiment we investigate the convergence behavior of our approach and\nthe HOGWILD! implementation for different graphical model con\ufb01gurations. We examine the\nbehavior when using one to 46 threads, where the number of threads is not adapted, but remains\n\ufb01xed throughout the run. The stepsize parameter, necessary in the case of HOGWILD!, is chosen\nto be as large as possible while still ensuring convergence (following Recht et al. [2011]). Note\nthat our approach is hyper-parameter free. Hence no tuning is required, which we consider an\nimportant practical advantage. We also evaluated HOGWILD! using a diminishing stepsize, but\nfound those results to be weaker than the ones reported here. Also note that a diminishing stepsize\nintroduces yet another hyper-parameter. Our results are provided in Fig. 2 for = 1 and 8 states\nper random variable. We assess different MRF con\ufb01gurations (2, 4, 8 connectivity), and iterations\n(200, 400). Irrespective of the chosen setup, we observe monotone convergence even with 46 threads\nat play for both approaches. In neither of our con\ufb01gurations do we observe any instability during\noptimization. As expected, we also observe the exact minimization employed in our approach to\nresult in signi\ufb01cantly faster descent than use of the gradient (i.e., HOGWILD!). This is consistent\nwith the comparison of these methods in the sequential setting.\nThread speedup: In our second experiment we investigate the speedup obtained when using an\nincreasing number of threads. To this end we use the smallest dual value obtained with a single thread\nand illustrate how much faster we are able to obtain an identical or better value when using more than\none thread during computation. The results for all the investigated graphical model con\ufb01gurations\nare illustrated in Fig. 3 (a) for our approach and in Fig. 3 (b) for HOGWILD!. In these \ufb01gures, we\nobserve very similar speedups across different graphical model con\ufb01gurations. We also observe that\nour approach scales just as well as the gradient based technique does.\nHOGWILD! speedup: In our third experiment we directly compare HOGWILD! to our approach.\nMore speci\ufb01cally, we use the smallest dual value found with the gradient based technique using a\n\ufb01xed number of threads, and assess how much faster the proposed approach is able to \ufb01nd an identical\nor better value when using the same number of threads. We show speedups of our approach compared\nto HOGWILD! in Fig. 3 (c). Considering the results presented in the previous paragraphs, speedups\nare to be expected. In all cases, we observe the speedups to be larger when using more threads.\nDepending on the model setup, we observe speedups to stabilize at values around 45 or higher.\nIn summary, we found our asynchronous optimization technique to be a compelling practical approach\nto infer approximate MAP con\ufb01gurations for graphical models.\n\n8\n\n\f6 Conclusion\n\nWe believe that parallel algorithms are essential for dealing with the scale of modern problem instances\nin graphical models. This has led us to present an asynchronous parallel coordinate minimization\nalgorithm for MAP inference. Our theoretical analysis provides insights into the effect of stale\nupdates on the convergence and speedups of this scheme. Our empirical results show the great\npotential of this approach, achieving linear speedups with up to 46 concurrent threads.\nFuture work may include improving the analysis (possibly under additional assumptions), particularly\nthe restriction on the gradients in Theorems 1 and 2. An interesting extension of our work is to derive\nasynchronous parallel coordinate minimization algorithms for other objective functions, including\nthose arising in other inference tasks, such as marginal inference. Another natural extension is to\ntry our algorithms on MAP problems from other domains, such as natural language processing and\ncomputational Biology, adding to our experiments on disparity estimation in computer vision.\n\nAcknowledgments\nThis material is based upon work supported in part by the National Science Foundation under Grant\nNo. 1718221. This work utilized computing resources provided by the Innovative Systems Lab (ISL)\nat NCSA.\n\nReferences\nA. Asuncion, P. Smyth, M. Welling, D. Newman, I. Porteous, and S. Triglia. Distributed Gibbs sampling for\n\nlatent variable models. 2011.\n\nH. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate through\n\nrandomization. J. ACM, 62(6):51:1\u201351:27, 2015.\n\nD. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall,\n\nInc., Upper Saddle River, NJ, USA, 1989. ISBN 0-13-648700-9.\n\nL.-C. Chen\u21e4, A. G. Schwing\u21e4, A. L. Yuille, and R. Urtasun. Learning Deep Structured Models. In Proc. ICML,\n\n2015. \u21e4 equal contribution.\n\nJ. Choi and R. A. Rutenbar. Hardware implementation of mrf map inference on an fpga platform. In Field\n\nProgrammable Logic, 2012.\n\nD. Davis, B. Edmunds, and M. Udell. The sound of apalm clapping: Faster nonsmooth nonconvex optimization\nwith stochastic asynchronous palm. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 29, pages 226\u2013234. 2016.\n\nA. Desmaison, R. Bunel, P. Kohli, P. H. Torr, and M. P. Kumar. Ef\ufb01cient continuous relaxations for dense crf. In\n\nEuropean Conference on Computer Vision, pages 818\u2013833, 2016.\n\nA. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LP-\n\nrelaxations. In NIPS. MIT Press, 2008.\n\nJ. Gonzalez, Y. Low, and C. Guestrin. Parallel Inference on Large Factor Graphs. Cambridge University Press,\n\n2011.\n\nT. Hazan and A. Shashua. Norm-product belief propagation: Primal-dual message-passing for approximate\n\ninference. IEEE Transactions on Information Theory, 56(12):6294\u20136316, 2010.\n\nC.-J. Hsieh, H.-F. Yu, and I. S. Dhillon. Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In\n\nICML, volume 15, pages 2370\u20132379, 2015.\n\nS. Hurkat, J. Choi, E. Nurvitadhi, J. F. Mart\u00ednez, and R. A. Rutenbar. Fast hierarchical implementation of\nsequential tree-reweighted belief propagation for probabilistic inference. In Field Programmable Logic, pages\n1\u20138, 2015.\n\nJ. Johnson. Convex Relaxation Methods for Graphical Models: Lagrangian and Maximum Entropy Approaches.\n\nPhD thesis, EECS, MIT, 2008.\n\nJ. H. Kappes, B. Andres, F. A. Hamprecht, C. Schn\u00f6rr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, T. Kr\u00f6ger,\nJ. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother. A comparative study of modern inference\ntechniques for structured discrete energy minimization problems. International Journal of Computer Vision,\n115(2):155\u2013184, 2015.\n\n9\n\n\fD. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.\n\nV. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 28(10):1568\u20131583, 2006.\n\nN. Komodakis, N. Paragios, and G. Tziritas. Mrf optimization via dual decomposition: Message-passing\n\nrevisited, 2007.\n\nP. Kr\u00e4henb\u00fchl and V. Koltun. Ef\ufb01cient inference in fully connected crfs with gaussian edge potentials. In\n\nAdvances in Neural Information Processing Systems 24, pages 109\u2013117. 2011.\n\nJ. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties.\n\nSIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\nJ. Liu, S. J. Wright, C. R\u00e9, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent\n\nalgorithm. Journal of Machine Learning Research, 16:285\u2013322, 2015.\n\nN. Ma, Y. Xia, and V. K. Prasanna. Data parallelism for belief propagation in factor graphs. In 2011 23rd\nInternational Symposium on Computer Architecture and High Performance Computing, pages 56\u201363, 2011.\n\nO. Meshi, T. Jaakkola, and A. Globerson. Smoothed coordinate descent for map inference. In S. Nowozin, P. V.\n\nGehler, J. Jancsary, and C. Lampert, editors, Advanced Structured Prediction. MIT Press, 2014.\n\nO. Meshi, M. Mahdavi, and A. G. Schwing. Smooth and strong: MAP inference with linear convergence. In\n\nNeural Informaion Processing Systems, 2015.\n\nY. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on\n\nOptimization, 22(2):341\u2013362, 2012.\n\nZ. Peng, Y. Xu, M. Yan, and W. Yin. Arock: An algorithmic framework for asynchronous parallel coordinate\n\nupdates. SIAM Journal on Scienti\ufb01c Computing, 38(5):A2851\u2013A2879, 2016.\n\nN. Piatkowski and K. Morik. Parallel inference on structured data with crfs on gpus. In International Workshop\n\nat ECML PKDD on Collective Learning and Inference on Structured Data (COLISD2011), 2011.\n\nB. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient\n\ndescent. In Advances in Neural Information Processing Systems 24. 2011.\n\nB. Savchynskyy, S. Schmidt, J. Kappes, and C. Schnorr. A study of Nesterov\u2019s scheme for lagrangian decompo-\n\nsition and map labeling. CVPR, 2011.\n\nA. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large Scale Graphical\n\nModels. In Proc. CVPR, 2011.\n\nA. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Dual MAP LP Relaxation Solvers\n\nusing Fenchel-Young Margins. In Proc. NIPS, 2012.\n\nA. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Parallel MAP LP Relaxation\n\nSolver using the Frank-Wolfe Algorithm. In Proc. ICML, 2014.\n\nY. Shimony. Finding the MAPs for belief networks is NP-hard. Ariti\ufb01cal Intelligence, 68(2):399\u2013410, 1994.\n\nS. Singh, A. Subramanya, F. Pereira, and A. McCallum. Distributed map inference for undirected graphical\nmodels. In Neural Information Processing Systems (NIPS) Workshop on Learning on Cores, Clusters, and\nClouds (LCCC), 2010.\n\nD. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference. In Optimization for\n\nMachine Learning, pages 219\u2013254. MIT Press, 2011.\n\nP. Tseng. On the rate of convergence of a partially asynchronous gradient projection algorithm. SIAM Journal\n\non Optimization, 1(4):603\u2013619, 1991.\n\nM. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Now\n\nPublishers Inc., Hanover, MA, USA, 2008.\n\nY.-X. Wang, V. Sadhanala, W. Dai, W. Neiswanger, S. Sra, and E. Xing. Parallel and distributed block-coordinate\nfrank-wolfe algorithms. In Proceedings of The 33rd International Conference on Machine Learning, pages\n1548\u20131557, 2016.\n\nT. Werner. A linear programming approach to max-sum problem: A review. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 29(7):1165\u20131179, 2007.\n\nT. Werner. Revisiting the decomposition approach to inference in exponential families and graphical models.\n\nTechnical Report CTU-CMP-2009-06, Czech Technical University, 2009.\n\n10\n\n\fT. Werner. Revisiting the linear programming relaxation approach to gibbs energy minimization and weighted\n\nconstraint satisfaction. IEEE PAMI, 32(8):1474\u20131488, 2010.\n\nM. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and mcmc. Proc.\n\nVLDB Endow., 3(1-2):794\u2013804, 2010.\n\nY. You, X. Lian, J. Liu, H.-F. Yu, I. S. Dhillon, J. Demmel, and C.-J. Hsieh. Asynchronous parallel greedy\n\ncoordinate descent. In Advances in Neural Information Processing Systems 29, pages 4682\u20134690. 2016.\n\nJ. Zhang, A. G. Schwing, and R. Urtasun. Message Passing Inference for Large Scale Graphical Models with\n\nHigh Order Potentials. In Proc. NIPS, 2014.\n\nR. Zhang and J. T. Kwok. Asynchronous distributed admm for consensus optimization.\n\n1701\u20131709, 2014.\n\nIn ICML, pages\n\nB. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through\n\nthe ade20k dataset. arXiv preprint arXiv:1608.05442, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2934, "authors": [{"given_name": "Ofer", "family_name": "Meshi", "institution": "Google"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}]}