{"title": "Parallel Recursive Best-First AND/OR Search for Exact MAP Inference in Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 928, "page_last": 936, "abstract": "The paper presents and evaluates the power of parallel search for exact MAP inference in graphical models. We introduce a new parallel shared-memory recursive best-first AND/OR search algorithm, called SPRBFAOO, that explores the search space in a best-first manner while operating with restricted memory. Our experiments show that SPRBFAOO is often superior to the current state-of-the-art sequential AND/OR search approaches, leading to considerable speed-ups (up to 7-fold with 12 threads), especially on hard problem instances.", "full_text": "Parallel Recursive Best-First AND/OR Search for\n\nExact MAP Inference in Graphical Models\n\nAkihiro Kishimoto\nIBM Research, Ireland\nakihirok@ie.ibm.com\n\nRadu Marinescu\n\nIBM Research, Ireland\n\nradu.marinescu@ie.ibm.com\n\nAdi Botea\n\nIBM Research, Ireland\nadibotea@ie.ibm.com\n\nAbstract\n\nThe paper presents and evaluates the power of parallel search for exact MAP\ninference in graphical models. We introduce a new parallel shared-memory recur-\nsive best-\ufb01rst AND/OR search algorithm, called SPRBFAOO, that explores the\nsearch space in a best-\ufb01rst manner while operating with restricted memory. Our\nexperiments show that SPRBFAOO is often superior to the current state-of-the-art\nsequential AND/OR search approaches, leading to considerable speed-ups (up to\n7-fold with 12 threads), especially on hard problem instances.\n\n1 Introduction\n\nGraphical models provide a powerful framework for reasoning with probabilistic information. These\nmodels use graphs to capture conditional independencies between variables, allowing a concise\nknowledge representation and ef\ufb01cient graph-based query processing algorithms. Combinatorial\nmaximization, or maximum a posteriori (MAP) tasks arise in many applications and often can be\nef\ufb01ciently solved by search schemes, especially in the context of AND/OR search spaces that are\nsensitive to the underlying problem structure [1].\nRecursive best-\ufb01rst AND/OR search (RBFAOO) is a recent yet very powerful scheme for exact MAP\ninference that was shown to outperform current state-of-the-art depth-\ufb01rst and best-\ufb01rst methods by\nseveral orders of magnitude on a variety of benchmarks [2]. RBFAOO explores the context minimal\nAND/OR search graph associated with a graphical model in a best-\ufb01rst manner (even with non-\nmonotonic heuristics) while running within restricted memory. RBFAOO extends Recursive Best-\nFirst Search (RBFS) [3] to graphical models and thus uses a threshold controlling technique to drive\nthe search in a depth-\ufb01rst like manner while using the available memory for caching.\nUp to now, search-based MAP solvers were developed primarily as sequential search algorithms.\nHowever, parallel, multi-core processing can be a powerful approach to boosting the performance\nof a problem solver. Now that multi-core computing systems are ubiquitous, one way to extract\nsubstantial speed-ups from the hardware is to resort to parallel processing. Parallel search has been\nsuccessfully employed in a variety of AI areas, including planning [4], satis\ufb01ability [5], and game\nplaying [6, 7]. However, little research has been devoted to solving graphical models in parallel.\nThe only parallel search scheme for MAP inference in graphical models that we are aware of is\nthe distributed AND/OR Branch and Bound algorithm (daoopt) [8]. This assumes however a large\nand distributed computational grid environment with hundreds of independent and loosely connected\ncomputing systems, without access to a shared memory space for caching and reusing partial results.\nContribution In this paper, we take a radically different approach and explore the potential of\nparallel search for MAP tasks in a shared-memory environment which, to our knowledge, has not\nbeen attempted before. We introduce SPRBFAOO, a new parallelization of RBFAOO in shared-\nmemory environments. SPRBFAOO maintains a single cache table shared among the threads. In\nthis way, each thread can effectively reuse the search effort performed by others. Since all threads\nstart from the root of the search graph using the same search strategy, an effective load balancing is\n\n1\n\n\f(a) Primal graph\n\n(b) Pseudo tree\n\n(c) Context minimal AND/OR search graph\n\nFigure 1: A simple graphical model and its associated AND/OR search graph.\n\nobtained without using sophisticated schemes, as done in previous work [8]. An extensive empirical\nevaluation shows that our new parallel recursive best-\ufb01rst AND/OR search scheme improves consid-\nerably over current state-of-the-art sequential AND/OR search approaches, in many cases leading to\nconsiderable speed-ups (up to 7-fold using 12 threads) especially on hard problem instances.\n\n2 Background\n\nQ\n\n(cid:3)\n\n= argmaxx\n\nP\n\nGraphical models (e.g., Bayesian Networks [9] or Markov Random Fields [10]) capture the fac-\ntorization structure of a distribution over a set of variables. A graphical model is a tuple M =\nhX; D; Fi, where X = fXi : i 2 V g is a set of variables indexed by set V and D = fDi : i 2 V g\nis the set of their \ufb01nite domains of values. F = f (cid:11) : (cid:11) 2 Fg is a set of discrete positive real-\nvalued local functions de\ufb01ned on subsets of variables, where F (cid:18) 2V is a set of variable subsets. We\nuse (cid:11) (cid:18) V and X(cid:11) (cid:18) X to indicate the scope of function (cid:11), i.e., X(cid:11) = var( (cid:11)) = fXi : i 2 (cid:11)g.\nThe function scopes yield a primal graph whose vertices are the variables and whose edges connect\nany two variables that appear in the scope of the same function. The graphical model M de\ufb01nes a\n(cid:11)2F (cid:11)(X(cid:11)) where the partition\nfactorized probability distribution on X, as follows: P (X) = 1\nZ\nfunction, Z, normalizes the probability.\nAn important inference task which appears in many real world applications is maximum a posteriori\n(MAP, sometimes called maximum probable explanation or MPE). MAP/MPE \ufb01nds a complete\nassignment to the variables that has the highest probability (i.e., a mode of the joint probability),\n(cid:11)2F (cid:11)(x(cid:11)) The task is NP-hard to solve in general [9]. In this paper\nnamely: x\nwe focus on solving MAP as a minimization problem by taking the negative logarithm of the local\nfunctions to avoid numerical issues, namely: x\nSigni\ufb01cant improvements for MAP inference have been achieved by using AND/OR search spaces,\nwhich often capture problem structure far better than standard OR search methods [11]. A pseudo\ntree of the primal graph captures the problem decomposition and is used to de\ufb01ne the search space.\nA pseudo tree of an undirected graph G = (V; E) is a directed rooted tree T = (V; E\n), such that\n0 is a back-arc in T , namely it connects a node in T to an ancestor\nevery arc of G not included in E\nin T . The arcs in E\nGiven a graphical model M = hX; D; Fi with a primal graph G and a pseudo tree T of G, the\nAND/OR search tree ST has alternating levels of OR nodes corresponding to the variables and AND\nnodes corresponding to the values of the OR parent\u2019s variable, with edges weighted according to F.\nWe denote the weight on the edge from OR node n to AND node m by w(n; m). Identical sub-\nproblems, identi\ufb01ed by their context (the partial instantiation that separates the sub-problem from\nthe rest of the problem graph), can be merged, yielding an AND/OR search graph [11]. Merging all\ncontext-mergeable nodes yields the context minimal AND/OR search graph, denoted by CT . The\nsize of CT is exponential in the induced width of G along a depth-\ufb01rst traversal of T [11].\nA solution tree T of CT is a subtree such that: (1) it contains the root node of CT ; (2) if an internal\nAND node n is in T then all its children are in T ; (3) if an internal OR node n is in T then exactly\none of its children is in T ; (4) every tip node in T (i.e., nodes with no children) is a terminal node.\nThe cost of a solution tree is the sum of the weights associated with its edges.\n\n0 may not all be included in E.\n\n(cid:0) log ( (cid:11)(x(cid:11))).\n\n= argminx\n\n(cid:11)2F\n\nQ\n\n0\n\n(cid:3)\n\n2\n\n\fEach node n in CT is associated with a value v(n) capturing the optimal solution cost of the condi-\ntioned sub-problem rooted at n. It was shown that v(n) can be computed recursively based on the\nvalues of n\u2019s children: OR nodes by minimization, AND nodes by summation (see also [11]).\nExample 1. Figure 1(a) shows the primal graph of a simple graphical model with 5 variables and\n7 binary functions. Figure 1(c) displays the context minimal AND/OR search graph based on the\npseudo tree from Figure 1(b) (the contexts are shown next to the pseudo tree nodes). A solution tree\ncorresponding to the assignment (A = 0; B = 1; C = 1; D = 0; E = 0) is shown in red.\n\nCurrent state-of-the-art sequential search methods for exact MAP inference perform either depth-\n\ufb01rst or best-\ufb01rst search. Prominent methods studied and evaluated extensively are the AND/OR\nBranch and Bound (AOBB) [1] and Best-First AND/OR Search (AOBF) [12]. More recently, Re-\ncursive Best-First AND/OR Search (RBFAOO) [2] has emerged as the best performing algorithm\nfor exact MAP inference. RBFAOO belongs to the class of RBFS algorithms and employs a local\nthreshold controlling mechanism to explore the AND/OR search graph in a depth-\ufb01rst like manner\n[3, 13]. RBFAOO maintains at each node n a lower-bound q(n) (called q-value) on v(n). During\nsearch, RBFAOO improves and caches in a \ufb01xed size table q(n) which is calculated by propagating\nback the q-values of n\u2019s children. RBFAOO stops when q(r) = v(r) at the root r or it proves that\nthere is no solution, namely q(r) = v(r) = 1.\n\n3 Our Parallel Algorithm\n\nAlgorithm 1 SPRBFAOO\nfor all i from 1 to nr CPU cores do\n\nroot:th 1 (cid:0) (cid:15); root:thub 1\nlaunch tRBFS(root) on a separate thread\n\nwait for threads to \ufb01nish their work\nreturn optimal cost (e.g., as root\u2019s q-value in the cache)\n\nWe now describe SPRBFAOO, a parallelization of RBFAOO in shared-memory environments.\nSPRBFAOO\u2019s threads start from the root and run in parallel, as shown in Algorithm 1. Threads\nshare one cache table, allowing them to reuse the results of each other. An entry in the cache table,\ncorresponding to a node n, is a tuple with 6 \ufb01elds: a q-value q(n), being a lower bound on the\noptimal cost of node n; n:solved, a \ufb02ag indicating whether n is solved optimally; a virtual q-value\nvq(n), de\ufb01ned later in this section; a best known solution cost bs(n) for node n; the number of\nthreads currently working on n; and a lock. When accessing a cache entry, threads lock it temporar-\nily for other threads. The method Ctxt(n) identi\ufb01es the context of n, which is further used to access\nthe corresponding cache entry. Besides the cache, shared among threads, each thread will use two\nthreshold values, n:th and n:thub, for each node n. These are separated from one thread to another.\nAlgorithm 2 shows the procedure invoked on each thread. When a thread examines a node n, it\n\ufb01rst increments in the cache the number of threads working on node n (line 1). Then it increases\nvq(n) by an increment (cid:16), and stores the new value in the cache (line 2). The virtual q-value vq(n) is\ninitially set to q(n). As more threads work on solving n, vq(n) grows due to the repeated increases\nby (cid:16). In effect, vq(n) re\ufb02ects both the estimated cost of node n (through its q(n) component) and\nthe number of threads working on n. By computing vq(n) this way, our goal is to dynamically\ncontrol the degree to which threads overlap when exploring the search space. When a given area\nof the search space is more promising than others, more than one thread are encouraged to work\ntogether within that area. On the other hand, when several areas are roughly equally promising,\nthreads should diverge and work on different areas. Indeed, in Algorithm 2, the tests on lines 13 and\n23 prevent a thread from working on a node n if n:th < vq(n). (Other conditions in these tests are\ndiscussed later.) A large vq(n), which increases the likelihood that n:th < vq(n), may re\ufb02ect a less\npromising node (i.e., large q-value), or many threads working on n, or both. Thus, our strategy is\nan automated and dynamic way of tuning the number of threads working on solving a node n as a\nfunction of how promising that node is. We call this the thread coordination mechanism.\nLines 4\u20137 address the case of nodes with no children, which are either terminal nodes or deadends.\nIn both cases, method Evaluate sets the solved \ufb02ag to true. The q-value q is set to 0 for terminal\n\n3\n\n\f(q; solved) Evaluate(n)\nSaveInCache(Ctxt(n); q; solved; q; q)\nDecrementNrThreadsInCache(Ctxt(n))\nreturn\n\nAlgorithm 2 Method tRBFS. Handling locks skipped for clarity.\nRequire: node n\n1: IncrementNrThreadsInCache(Ctxt(n))\n2: IncreaseVQInCache(Ctxt(n); (cid:16)))\n3: if n has no children then\n4:\n5:\n6:\n7:\n8: GenerateChildren(n)\n9: if n is an OR node then\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n\n(cbest; vq; vq2; q; bs) BestChild(n)\nn:thub min(n:thub; bs)\nif n:th < vq_q (cid:21) n:thub_n:solved then\ncbest:th min(n:th; vq2+(cid:14))(cid:0)w(n; cbest)\ncbest:thub n:thub (cid:0) w(n; cbest)\ntRBSF(cbest)\n\nvq q\n\nbreak\n\nloop\n\nloop\n\n18: [continued from previous column]\n19: if n is an AND node then\n20:\n(q; vq; bs) Sum(n)\n21:\nn:thub min(n:thub; bs)\n22:\nif n:th < vq_q (cid:21) n:thub_n:solved then\n23:\n24:\n(cbest; qcbest ; vqcbest ) UnsolvedChild(n)\n25:\ncbest:th n:th (cid:0) (vq (cid:0) vqcbest )\n26:\ncbest:thub n:thub (cid:0) (q (cid:0) qcbest )\n27:\ntRBSF(cbest)\n28:\n29: if n:solved _ NrThreadsCache(Ctxt(n)) = 1\n\nbreak\n\nthen\n\n30:\n31: DecrementNrThreadsInCache(Ctxt(n))\n32: SaveInCache(Ctxt(n); q; n:solved; vq; bs)\n\nnodes and to 1 otherwise. Method SaveInCache takes as argument the context of the node, and four\nvalues to be stored in order in these \ufb01elds of the corresponding cache entry: q, solved, vq and bs.\nLines 10\u201317 and 20\u201328 show respectively the cases when the current node n is an OR node or an\nAND node. Both these follow a similar high-level sequence of steps:\n\n(cid:15) Update vq, q, and bs for n, from the children\u2019s values (lines 11, 21). Also update n:thub\n(lines 12, 22), an upper bound for the best solution cost known for n so far. Methods\nBestChild and Sum are shown in Algorithm 3. In these, child node information is either\nretrieved from the cache, if available, or initialized with an admissible heuristic function h.\n(cid:15) Perform the backtracking test (lines 13\u201314 and 23\u201324). The thread backtracks to n\u2019s parent\nif at least one of the following conditions hold: th(n) < vq(n), discussed earlier; q(n) (cid:21)\nn:thub i.e., a solution containing n cannot possibly beat the best known solution (we call\nthis the suboptimality test); or the node is solved. The solved \ufb02ag is true iff the node cost\nhas been proven to be optimal, or the node was proven not to have any solution.\n(cid:15) Otherwise, select a successor cbest to continue with (lines 11, 25). At OR nodes n, cbest\nis the child with the smallest vq among all children not solved yet (see method BestChild).\nAt AND nodes, any unsolved child can be chosen. Then, update the thresholds of cbest\n(lines 15\u201316 and 26\u201327), and recursively process cbest (lines 17, 28). The threshold n:th is\nupdated in a similar way to RBFAOO, including the overestimation parameter (cid:14) (see [2]).\nHowever, there are two key differences. First, we use vq instead of q, to obtain the thread\ncoordination mechanism presented earlier. Secondly, we use two thresholds, th and thub,\ninstead of just th, with thub being used to implement the suboptimality test q(n) (cid:21) n:thub.\nWhen a thread backtracks to n\u2019s parent, if either n\u2019s solved \ufb02ag is set, or no other thread currently\nexamines n, the thread sets vq(n) to q(n) (lines 29\u201330 in Algorithm 2). In this way, SPRBFAOO\nreduces the frequency of the scenarios where n is considered to be less promising. Finally, the thread\ndecrements in the cache the number of threads working on n (line 31), and saves in the cache the\nrecalculated vq(n), q(n), bs(n), and the solved \ufb02ag (line 32).\nTheorem 3.1. With an admissible heuristic in use, SPRBFAOO returns optimal solutions.\nProof sketch. SPRBFAOO\u2019s bs(r) at the root r is computed from a solution tree, therefore, bs(r) (cid:21)\nv(r). Additionally, SPRBFAOO determines solution optimality by using not vq(n) but q(n) saved\nin the cache table. By an induction-based discussion similar to Theorem 3.1 in [2], q(n) (cid:20) v(n)\nholds for any q(n) saved in the cache table with admissible h, which indicates q(r) (cid:20) v(r). When\nSPRBFAOO returns a solution, bs(r) = q(r), therefore, bs(r) = q(r) = v(r).\n\nWe conjecture that SPRBFAOO is also complete, and leave a more in-depth analysis as future work.\n\n4\n\n\felse\n\n(qci ; sci ; vqci ; bsci ) FromCache(Ctxt(ci))\n(qci ; sci ; vqci ; bsci ) (h(ci);?; h(ci);1)\n w(n; ci) + qci\n w(n; ci) + vqci\n\nAlgorithm 3 Methods BestChild (left) and Sum (right)\nRequire: node n\n1: n:solved ? (? stands for false);\n2: initialize vq; vq2; q; bs to 1\n3: for all ci child of n do\nif Ctxt(ci) in cache then\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: return (cbest; vq; vq2; q; bs)\n\nqci\nvqci\nbs = min(bs; w(n; ci) + bsci )\nif (qci < q) _ (qci = q ^ :n:solved) then\nif vqci < vq ^ :sci then\nelse if vqci < vq2 ^ :sci then\n\nn:solved sci; q qci\nvq2 vq; vq vqci; cbest ci\nvq2 vqci\n\nif Ctxt(ci) in cache then\n\nRequire: node n\n1: n:solved > (> stands for true)\n2: initialize vq; q; bs to 0\n3: for all ci child of n do\n4:\n5:\nelse\n6:\n7:\nq q + qci\n8:\nvq vq + vqci\n9:\nbs bs + bsci\n10:\nn:solved n:solved ^ sci\n11:\n12: return (q; vq; bs)\n\n(qci ; sci ; vqci ; bsci ) FromCache(Ctxt(ci))\n(qci ; sci ; vqci ; bsci ) (h(ci);?; h(ci);1)\n\n4 Experiments\n\nWe evaluate empirically our parallel SPRBFAOO and compare it against sequential RBFAOO and\nAOBB. We also considered parallel shared-memory AOBB, denote by SPAOBB, which uses a mas-\nter thread to explore centrally the AND/OR search graph up to a certain depth and solves the remain-\ning conditioned sub-problems in parallel using a set of worker threads. The cache table is shared\namong the workers so that some workers may reuse partial search results recorded by others. In our\nimplementation, the search space explored by the master corresponds to the \ufb01rst m variables in the\npseudo tree. The performance of SPAOBB was very poor across all benchmarks due to noticeably\nlarge search overhead as well as poor load balancing, and therefore its results are omitted hereafter.\nAll competing algorithms (SPRBFAOO, RBFAOO and AOBB) use the pre-compiled mini-bucket\nheuristic [1] for guiding the search. The heuristic is controlled by a parameter called i-bound which\nallows a trade-off between accuracy and time/space requirements \u2013 higher values of i yield a more\naccurate heuristic but take more time and space to compute. The search algorithms were also re-\nstricted to a static variable ordering obtained as a depth-\ufb01rst traversal of a min-\ufb01ll pseudo tree [1].\nOur benchmark problems1 include three sets of instances from genetic linkage analysis (denoted\npedigree) [14], grid networks and protein side-chain interaction networks (denoted protein)\n[15]. In total, we evaluated 21 pedigrees, 32 grids and 240 protein networks. The algorithms were\nimplemented in C++ (64-bit) and the experiments were run on a 2.6GHz 12-core processor with\n80GB of RAM. Following [2], RBFAOO ran with a 10-20GB cache table (134,217,728 entries)\nand overestimation parameter (cid:14) = 1. However, SPRBFAOO allocated only 95,869,805 entries with\nthe same amount of memory, due to extra information such as virtual q-values. We set (cid:16) = 0:01\nthroughout the experiments (except those where we vary (cid:16)). The time limit was set to 2 hours. We\nalso record typical ranges of problem speci\ufb01c parameters shown in Table 1 such as the number of\nvariables (n), maximum domain size (k), induced width (w\nTable 1: Ranges (min-max) of the bench-\nmark problems parameters.\n\n(cid:3)), and depth of the pseudo tree (h).\nTable 2: Number of unsolved problem in-\nstances (1 vs 12 cores).\n\nbenchmark\ngrid\npedigree\nprotein\n\nn\n\n144 \u2013 676\n334 \u2013 1289\n26 \u2013 177\n\nk\n2\n\n3 \u2013 7\n81\n\n(cid:3)\n\nw\n\n15 \u2013 36\n15 \u2013 33\n6 \u2013 16\n\nh\n\n48 \u2013 136\n51 \u2013 140\n15 \u2013 43\n\nmethod\nRBFAOO\nSPRBFAOO\n\ngrid\n\npedigree\n\ni = 6\n\ni = 14\n\ni = 6\n\ni = 14\n\n9\n7\n\n5\n5\n\n8\n7\n\n6\n3\n\ni = 2\n\nprotein\ni=4\n16\n9\n\n41\n32\n\nThe primary performance measures reported are the run time and node expansions during search.\nWhen the run time of a solver is discussed, the total CPU time reported in seconds is one metric to\nshow overall performance. The total CPU time consists of the heuristic compilation time and search\n\n1http://graphmod.ics.uci.edu\n\n5\n\n\fTable 3: Total CPU time (sec) and nodes on grid and pedigree instances. Time limit 2 hours.\n\ni = 6\n\nnodes\n\n133143216\n153612683\n\n2273916962\n\ninstance\n(n; k; w\n\n(cid:3)\n\n; h)\n\n75-22-5\n(484,2,30,107)\n\n75-24-5\n(576,2,32,116)\n\n90-30-5\n(900,2,42,151)\n\npedigree7\n(1068,4,28,140)\n\npedigree9\n(1119,7,25,123)\n\npedigree19\n(793,5,21,107)\n\nalgorithm\n\n(mbe)\nAOBB\nRBFAOO\nSPRBFAOO\n(mbe)\nAOBB\nRBFAOO\nSPRBFAOO\n(mbe)\nAOBB\nRBFAOO\nSPRBFAOO\n(mbe)\nAOBB\nRBFAOO\nSPRBFAOO\n(mbe)\nAOBB\nRBFAOO\nSPRBFAOO\n(mbe)\nAOBB\nRBFAOO\nSPRBFAOO\n\ntime\n(0.06\n\n629\n116\n(0.08)\n\n2794\n(0.2)\n\n(0.1)\n\n(0.1)\n\n(0.1)\n\ni = 8\n\nnodes\n\n331885596\n410230906\n\n2309390159\n\n3062954989\n\ntime\n(0.07)\n\n2018\n483\n(0.1)\n\n2959\n(0.2)\n\n(0.2)\n\n4560\n(0.2)\n\n(0.2)\n\ni = 12\n\nnodes\n\n314622599\n113597702\n129817500\n\n465384385\n511894256\n\n201063828\n222896697\n\ni = 10\n\nnodes\n\n761867041\n334441548\n385071090\n\n665237411\n804068930\n\n226436502\n249562472\n\ntime\n(0.1)\n5221\n2036\n466\n(0.1)\n\n4182\n1012\n(0.3)\n\n(0.3)\n\n1873\n353\n(0.2)\n\n(0.4)\n\ntime\n(0.2)\n2100\n638\n152\n(0.3)\n\n2792\n579\n(0.5)\n\n(0.6)\n\n1642\n314\n(0.5)\n\n(1.3)\n\ni = 14\n\nnodes\n\n144092486\n18728991\n25076772\n\n47015068\n59504303\n\n565053698\n665947009\n\n135387634\n151794050\n\n2807834881\n\ntime\n(0.7)\n884\n85\n17\n(0.8)\n\n229\n43\n(1.4)\n\n3783\n869\n(2.1)\n\n1239\n267\n(1.6)\n\n3021\n(10)\n\n3792\n\n2721253097\n\n2083\n\n1914585138\n\nFigure 2: Total CPU time (sec) for RBFAOO vs. SPRBFAOO with smaller (top) and larger (bottom)\ni-bounds. Time limit 2 hours. i 2 f6; 14g for grid and pedigree, i 2 f2; 4g for protein.\n\ntime. SPRBFAOO does not reduce the heuristic compilation time calculated sequentially. Note that\nparallelizing the heuristic compilation is an important extension as future work.\nParallel versus sequential search Table 3 shows detailed results (as total CPU time in seconds and\nnodes expanded) for solving grid and pedigree instances using parallel and sequential search.\nThe columns are indexed by the i-bound. For each problem instance, we also record the mini-bucket\nheuristic pre-compilation time, denoted by (mbe), corresponding to each i-bound. SPRBFAOO\nran with 12 threads. We can see that SPRBFAOO improves considerably over RBFAOO across\nall reported i-bounds. The bene\ufb01t of parallel search is more clearly observed at smaller i-bounds\nthat correspond to relatively weak heuristics. In this case, the heuristic is less likely to guide the\nsearch towards more promising regions of the search space and therefore diversifying the search\nvia multiple parallel threads is key to achieving signi\ufb01cant speed-ups. For example, on grid 75-22-\n5, SPRBFAOO(6) is almost 6 times faster than RBFAOO(6). Similarly, SPRBFAOO(8) solves the\npedigree7 instance while RBFAOO(8) runs out of time. This is important since on very hard problem\ninstances it may only be possible to compute rather weak heuristics given limited resources. Notice\n\n6\n\n100101102103104RBFAOO100101102103104SPRBFAOOgrids:totalCPUtime,i-bound(i=6)100101102103104RBFAOO100101102103104SPRBFAOOpedigree:totalCPUtime,i-bound(i=6)100101102103104RBFAOO100101102103104SPRBFAOOprotein:totalCPUtime,i-bound(i=2)100101102103104RBFAOO100101102103104SPRBFAOOgrids:totalCPUtime,i-bound(i=14)100101102103104RBFAOO100101102103104SPRBFAOOpedigree:totalCPUtime,i-bound(i=14)100101102103104RBFAOO100101102103104SPRBFAOOprotein:totalCPUtime,i-bound(i=4)\fFigure 3: Total search time (sec) and average speed-up as a function of parameter (cid:16). Time limit 2\nhours. i = 14 for grid and pedigree, i = 4 for protein.\n\nalso that the pre-processing time (mbe) increases with the i-bound. Table 2 shows the number of\nunsolved problems in each domain. Note that SPRBFAOO solved all instances solved by RBFAOO.\nFigure 2 plots the total CPU time obtained by RBFAOO and SPRBFAOO using smaller (resp. larger)\ni-bounds corresponding to relatively weak (resp. strong) heuristics. We selected i 2 f6; 14g for\ngrid and pedigree, and i 2 f2; 4g for protein. Speci\ufb01cally, i = 6 (grids, pedigrees) and\ni = 2 (proteins) were the smallest i-bounds for which SPRBFAOO could solve at least two thirds of\ninstances within the 2 hour time limit, while i = 14 (grids, pedigrees) and i = 4 (proteins) were the\nlargest possible i-bounds for which we could compile the heuristics without running out of memory\non all instances. The data points shown in green correspond to problem instances that were solved\nonly by SPRBFAOO. As before, we notice the bene\ufb01t of parallel search when using relatively weak\nheuristics. The largest speed-up of 9.59 is obtained on the pdbilk protein instance with i = 2. As\nthe i-bound increases and the heuristics become more accurate, the difference between RBFAOO(i)\nand SPRBFAOO(i) decreases because both algorithms are guided more effectively towards the sub-\nspace containing the optimal solution. In addition, the overhead associated with larger i-bounds,\nwhich is calculated sequentially, offsets considerably the speed-up obtained by SPRBFAOO(i) over\nRBFAOO(i) (see for example the plot for protein instances with i = 4).\nWe also observed that SPRBFAOO\u2019s speed-up over RBFAOO increases sublinearly as more threads\nare used (we experimented with 3, 6, and 12 threads, respectively). In addition to search overhead,\nsynchronization overhead is another cause for achieving only sublinear speed-ups. The synchro-\nnization overhead can be estimated by checking the node expansion rate per thread. For example, in\ncase of SPRBFAOO with 12 threads, the node expansion rate per thread slows down to 47 %, 50 %,\nand 61 % of RBFAOO in grid (i = 6), pedigree (i = 6), and protein (i = 2), respectively.\nThis implies that the overhead related to locks is large. Since these numbers with 6 threads are 73\n%, 79 %, and 96 %, respectively, the slowdown becomes severer with more threads. We hypothesize\nthat due to the property of the virtual q-value, SPRBFAOO\u2019s threads tend to follow the same path\nfrom the root until search directions are diversi\ufb01ed, and frequently access the cache table entries of\nthe these internal nodes located on that path, where lock contentions occur non-negligibly.\nFinally, SPRBFAOO\u2019s load balance is quite stable in all domains, especially when all threads are\ninvoked and perform search after a while. For example, its load balance ranges between 1.005-\n1.064, 1.013-1.049, and 1.004-1.117 for grid (i = 6), pedigree (i = 6), and protein (i = 2),\nespecially on those instances where SPRBFAOO expands at least 1 million nodes with 12 threads.\nImpact of parameter (cid:16) In Figure 3 we analyze the performance of SPRBFAOO with 12 threads\nas a function of the parameter (cid:16) which controls the way different threads are encouraged or discour-\naged to start exploring a speci\ufb01c subproblem (see also Section 3). For this purpose and to better\nunderstand SPRBFAOO\u2019s scaling behavior, we ignore the heuristic compilation time. Therefore,\nwe show the total search time (in seconds) over the instances that all parallel versions solve, and the\nsearch-time-based average speed-ups based on the instances where RBFAOO needs at least 1 second\nto solve. We obtained these numbers for (cid:16) 2 f0:001; 0:01; 0:1g. We see that all (cid:16) values lead to\nimproved speed-ups. This is important because, unlike the approach of [8] which involves a sophis-\nticated scheme, it is considerably simpler yet extremely ef\ufb01cient and only requires tuning a single\nparameter ((cid:16)). Of the three (cid:16) values, while SPRBFAOO with (cid:16) = 0:1 spends the largest total search\ntime, it yields the best speed-up. This indicates a trade-off about selecting (cid:16). Since the instances\nused to calculate speed-up values are solved by RBFAOO, they contain relatively easy instances.\n\n7\n\n0.0010.010.1parameter\u03b605000100001500020000250003000035000Totalsearchtime(sec)gridspedigreeprotein0.0010.010.1parameter\u03b60123456Averagespeed-upgridspedigreeprotein\fTable 4: Total CPU time (sec) and node expansions for hard pedigree instances. SPRBFAOO ran\nwith 12 threads, i = 20 (type4b) and i = 16 (largeFam). Time limit 100 hours.\n\ninstance\n\n(n; k; w\n\n(cid:3)\n\n; h)\n\ntype4b-100-19\ntype4b-120-17\ntype4b-130-21\ntype4b-140-19\nlargeFam3-10-52\n\n(7308,5,29,354)\n(7766,5,24,319)\n(8883,5,29,416)\n(9274,5,30,366)\n(1905,3,36,80)\n\n(mbe)\ntime\n400\n191\n281\n488\n13\n\ntime\n132711\n210\n290760\n248376\n154994\n\nnodes\n22243047591\n4297063\n51481315386\n39920187143\n19363865449\n\nRBFAOO\n\nSPRBFAOO\n\ntime\n42846\n195\n149321\n74643\n50700\n\nnodes\n50509174040\n6046663\n177393525747\n85152364623\n44073583335\n\nOn the other hand, several dif\ufb01cult instances solved by SPRBFAOO with 12 threads are included\nin calculating the total search time.\nIn case of (cid:16) = 0:1, because of increased search overhead,\nSPRBFAOO needs more search time to solve these dif\ufb01cult instances. There is also one protein\ninstance unsolved with (cid:16) = 0:1 but solved with (cid:16) = 0:01 and 0:001. This phenomenon can be\nexplained as follows. With large (cid:16), SPRBFAOO searches in more diversi\ufb01ed directions which could\nreduce lock contentions, resulting in improved speed-up values. However, due to larger diversi\ufb01-\ncation, when SPRBFAOO with (cid:16) = 0:1 solves dif\ufb01cult instances, it might focus on less promising\nportions of the search space, resulting in increased total search time.\nSummary of the experiments\nIn terms of search-time-based speed-ups, our parallel shared-\nmemory method SPRBFAOO improved considerably over its sequential counterpart RBFAOO, by\nup to 7 times using 12 threads. At relatively larger i-bounds, their corresponding computational\noverhead typically outweighed the gains obtained by parallel search. Still, parallel search had an\nadvantage of solving additional instances unsolved by serial search. Finally, in Table 4 we report the\nresults obtained on 5 very hard pedigree instances from [2] (mbe records the heuristic compilation\ntime). We see again that SPRBFAOO improved over RBFAOO on all instances, while achieving a\ntotal-time-based speed-up of 3 on two of them (i.e., type4b-100-19 and largeFam3-10-52).\n\n5 Related Work\n\nThe distributed AOBB algorithm daoopt [8] which builds on the notion of parallel tree search\n[16], explores centrally the search tree up to a certain depth and solves the remaining conditioned\nsub-problems in parallel using a large grid of distributed processing units without a shared cache.\nIn parallel evidence propagation, the notion of pointer jumping has been used for exact probabilistic\ninference. For example, Pennock [17] performs a theoretical analysis. Xia and Prasanna [18] split\na junction tree into chains where evidence propagation is performed in parallel using a distributed-\nmemory environment, and the results are merged later on.\nProof-number search (PNS) in AND/OR spaces [19] and its parallel variants [20] have been shown\nto be effective in two-player games. As PNS is suboptimal, it cannot be applied as is to exact\nMAP inference. Kaneko [21] presents shared-memory parallel depth-\ufb01rst proof-number search with\nvirtual proof and disproof numbers (vpdn). These combine proof and disproof numbers [19] and the\nnumber of threads examining a node. Thus, our vq(n) is closely related to vpdn. However, vpdn\nhas an over-counting problem, which we avoid due to the way we dynamically update vq(n). Saito\net al. [22] uses threads that probabilistically avoid the best-\ufb01rst strategy. Hoki et al. [23] adds small\nrandom values the proof and disproof numbers of each thread without sharing any cache table.\n\n6 Conclusion\n\nWe presented SPRBFAOO, a new shared-memory parallel recursive best-\ufb01rst AND/OR search\nscheme in graphical models. Using the virtual q-values shared in a single cache table, SPRBFAOO\nenables threads to work on promising regions of the search space with effective reuse of the search\neffort performed by others. A homogeneous search mechanism across the threads achieves an effec-\ntive load balancing without resorting to sophisticated schemes used in related work [8]. We prove the\ncorrectness of the algorithm. In experiments, SPRBFAOO improves considerably over current state-\nof-the-art sequential AND/OR search approaches, in many cases leading to considerable speed-ups\n(up to 7-fold using 12 threads) especially on hard problem instances. Ongoing and future research\ndirections include proving the completeness conjecture, extending SPRBFAOO to distributed mem-\nory environments, and parallelizing the mini-bucket heuristic for shared and distributed memory.\n\n8\n\n\fReferences\n[1] R. Marinescu and R. Dechter. AND/OR branch-and-bound search for combinatorial optimiza-\n\ntion in graphical models. Arti\ufb01cial Intelligence, 173(16-17):1457\u20131491, 2009.\n\n[2] A. Kishimoto and R. Marinescu. Recursive best-\ufb01rst AND/OR search for optimization in\ngraphical models. In International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npages 400\u2013409, 2014.\n\n[3] R. Korf. Linear-space best-\ufb01rst search. Arti\ufb01cial Intelligence, 62(1):41\u201378, 1993.\n[4] A. Kishimoto, A. Fukunaga, and A. Botea. Evaluation of a simple, scalable, parallel best-\ufb01rst\n\nsearch strategy. Arti\ufb01cial Intelligence, 195:222\u2013248, 2013.\n\n[5] Wahid Chrabakh and Rich Wolski. Gradsat: A parallel SAT solver for the Grid. Technical\n\nreport, University of California at Santa Barbara, 2003.\n\n[6] M. Campbell, A. Joseph Hoane Jr., and F.-h. Hsu. Deep Blue. Arti\ufb01cial Intelligence, 134(1\u2013\n\n2):57\u201383, 2002.\n\n[7] M. Enzenberger, M. M\u00a8uller, B. Arneson, and R. Segal. FUEGO - an open-source framework\nIEEE Transactions on\n\nfor board games and Go engine based on Monte-Carlo tree search.\nComputational Intelligence and AI in Games, 2(4):259\u2013270, 2010.\n\n[8] L. Otten and R. Dechter. A case study in complexity estimation: Towards parallel branch-and-\nbound over graphical models. In Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 665\u2013674,\n2012.\n\n[9] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[10] S. L. Lauritzen. Graphical Models. Clarendon Press, 1996.\n[11] R. Dechter and R. Mateescu. AND/OR search spaces for graphical models. Arti\ufb01cial Intelli-\n\ngence, 171(2-3):73\u2013106, 2007.\n\n[12] R. Marinescu and R. Dechter. Memory intensive AND/OR search for combinatorial optimiza-\n\ntion in graphical models. Arti\ufb01cial Intelligence, 173(16-17):1492\u20131524, 2009.\n\n[13] A. Nagai. Df-pn Algorithm for Searching AND/OR Trees and Its Applications. PhD thesis, The\n\nUniversity of Tokyo, 2002.\n\n[14] M. Fishelson and D. Geiger. Exact genetic linkage computations for general pedigrees. Bioin-\n\nformatics, 18(1):189\u2013198, 2002.\n\n[15] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for\n\nside-chain prediction. Journal of Computational Biology, 15(7):899\u2013911, 2008.\n\n[16] A. Grama and V. Kumar. State of the art in parallel search techniques for discrete optimization\n\nproblems. IEEE Transactions on Knowledge and Data Engineering, 11(1):28\u201335, 1999.\n\n[17] D. Pennock. Logarithmic time parallel Bayesian inference. In Uncertainty in Arti\ufb01cial Intelli-\n\ngence (UAI), pages 431\u2013438, 1998.\n\n[18] Y. Xia and V. K. Prasanna. Junction tree decomposition for parallel exact inference. In IEEE\n\nInternational Symposium on Parallel and Distributed Processing (IPDPS), 2008.\n\n[19] L. V. Allis, M. van der Meulen, and H. J. van den Herik. Proof-number search. Arti\ufb01cial\n\nIntelligence, 66(1):91\u2013124, 1994.\n\n[20] A. Kishimoto, M. Winands, M. M\u00a8uller, and J.-T. Saito. Game-tree search using proof numbers:\n\nThe \ufb01rst twenty years. ICGA Journal, Vol. 35, No. 3, 35(3):131\u2013156, 2012.\n\n[21] T. Kaneko. Parallel depth \ufb01rst proof number search. In AAAI Conference on Arti\ufb01cial Intelli-\n\ngence, pages 95\u2013100, 2010.\n\n[22] J-T. Saito, M. H. M. Winands, and H. J. van den Herik. Randomized parallel proof-number\nIn Advances in Computers Games Conference (ACG\u201909), volume 6048 of Lecture\n\nsearch.\nNotes in Computer Science, pages 75\u201387. Springer, 2010.\n\n[23] K. Hoki, T. Kaneko, A. Kishimoto, and T. Ito. Parallel dovetailing and its application to depth-\n\n\ufb01rst proof-number search. ICGA Journal, 36(1):22\u201336, 2013.\n\n9\n\n\f", "award": [], "sourceid": 595, "authors": [{"given_name": "Akihiro", "family_name": "Kishimoto", "institution": "IBM Research"}, {"given_name": "Radu", "family_name": "Marinescu", "institution": "IBM Research, Ireland"}, {"given_name": "Adi", "family_name": "Botea", "institution": "IBM Research"}]}