{"title": "M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search", "book": "Advances in Neural Information Processing Systems", "page_first": 6786, "page_last": 6797, "abstract": "Learning to walk over a graph towards a target node for a given query and a source node is an important problem in applications such as knowledge base completion (KBC). It can be formulated as a reinforcement learning (RL) problem with a known state transition model. To overcome the challenge of sparse rewards, we develop a graph-walking agent called M-Walk, which consists of a deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes the state (i.e., history of the walked path) and maps it separately to a policy and Q-values. In order to effectively train the agent from sparse rewards, we combine MCTS with the neural policy to generate trajectories yielding more positive rewards. From these trajectories, the network is improved in an off-policy manner using Q-learning, which modifies the RNN policy via parameter sharing. Our proposed RL algorithm repeatedly applies this policy-improvement step to learn the model. At test time, MCTS is combined with the neural policy to predict the target node. Experimental results on several graph-walking benchmarks show that M-Walk is able to learn better policies than other RL-based methods, which are mainly based on policy gradients. M-Walk also outperforms traditional KBC baselines.", "full_text": "M-Walk: Learning to Walk over Graphs\n\nusing Monte Carlo Tree Search\n\n\u21e4Yelong Shen1, \u21e4Jianshu Chen1, \u21e4Po-Sen Huang2?, Yuqing Guo2, and Jianfeng Gao2\n\n1Tencent AI Lab, Bellevue, WA, USA.\n\n{yelongshen, jianshuchen}@tencent.com\n\n2Microsoft Research, Redmond, WA, USA\n{yuqguo, jfgao}@microsoft.com\n\nAbstract\n\nLearning to walk over a graph towards a target node for a given query and a source\nnode is an important problem in applications such as knowledge base completion\n(KBC). It can be formulated as a reinforcement learning (RL) problem with a\nknown state transition model. To overcome the challenge of sparse rewards, we\ndevelop a graph-walking agent called M-Walk, which consists of a deep recurrent\nneural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes\nthe state (i.e., history of the walked path) and maps it separately to a policy and\nQ-values. In order to effectively train the agent from sparse rewards, we combine\nMCTS with the neural policy to generate trajectories yielding more positive rewards.\nFrom these trajectories, the network is improved in an off-policy manner using\nQ-learning, which modi\ufb01es the RNN policy via parameter sharing. Our proposed\nRL algorithm repeatedly applies this policy-improvement step to learn the model.\nAt test time, MCTS is combined with the neural policy to predict the target node.\nExperimental results on several graph-walking benchmarks show that M-Walk is\nable to learn better policies than other RL-based methods, which are mainly based\non policy gradients. M-Walk also outperforms traditional KBC baselines.\n\n1\n\nIntroduction\n\nWe consider the problem of learning to walk over a graph in order to \ufb01nd a target node for a given\nsource node and a query. Such problems appear in, for example, knowledge base completion (KBC)\n[38, 16, 31, 19, 7]. A knowledge graph is a structured representation of world knowledge in the form\nof entities and their relations (e.g., Figure 1(a)), and has a wide range of downstream applications\nsuch as question answering. Although a typical knowledge graph may contain millions of entities\nand billions of relations, it is usually far from complete. KBC aims to predict the missing relations\nbetween entities using information from the existing knowledge graph. More formally, let G = (N ,E)\ndenote a graph, which consists of a set of nodes, N = {ni}, and a set of edges, E = {eij}, that\nconnect the nodes, and let q denote an input query. The problem is stated as using the graph G, the\nsource node nS 2N and the query q as inputs to predict the target node nT 2N . In KBC tasks, G\nis a given knowledge graph, N is a collection of entities (nodes), and E is a set of relations (edges)\nthat connect the entities. In the example in Figure 1(a), the objective of KBC is to identify the target\nnode nT = USA for the given head entity nS = Obama and the given query q = CITIZENSHIP.\n\n\u21e4Yelong Shen, Jianshu Chen, and Po-Sen Huang contributed equally to the paper. The work was done when\nYelong Shen and Jianshu Chen were with Microsoft Research. ?Po-Sen Huang is now at DeepMind (Email:\nposenhuang@google.com).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe problem can also be understood as constructing a function f (G, nS, q) to predict nT , where the\nfunctional form of f (\u00b7) is generally unknown and has to be learned from a training dataset consisting\nof samples like (nS, q, nT ). In this work, we model f (G, nS, q) by means of a graph-walking agent\nthat intelligently navigates through a subset of nodes in the graph from nS towards nT . Since nT is\nunknown, the problem cannot be solved by conventional search algorithms such as A\u21e4-search [11],\nwhich seeks to \ufb01nd paths between the given source and target nodes. Instead, the agent needs to learn\nits search policy from the training dataset so that, after training is complete, the agent knows how to\nwalk over the graph to reach the correct target node nT for an unseen pair of (nS, q). Moreover, each\ntraining sample is in the form of \u201c(source node, query, target node)\u201d, and there is no intermediate\nsupervision for the correct search path. Instead, the agent receives only delayed evaluative feedback:\nwhen the agent correctly (or incorrectly) predicts the target node in the training set, the agent will\nreceive a positive (or zero) reward. For this reason, we formulate the problem as a Markov decision\nprocess (MDP) and train the agent by reinforcement learning (RL) [27].\nThe problem poses two major challenges. Firstly, since the state of the MDP is the entire trajectory,\nreaching a correct decision usually requires not just the query, but also the entire history of traversed\nnodes. For the KBC example in Figure 1(a), having access to the current node nt = Hawaii alone is\nnot suf\ufb01cient to know that the best action is moving to nt+1 = USA. Instead, the agent must track\nthe entire history, including the input query q = Citizenship, to reach this decision. Secondly,\nthe reward is sparse, being received only at the end of a search path, for instance, after correctly\npredicting nT =USA.\nIn this paper, we develop a neural graph-walking agent, named M-Walk, that effectively addresses\nthese two challenges. First, M-Walk uses a novel recurrent neural network (RNN) architecture\nto encode the entire history of the trajectory into a vector representation, which is further used to\nmodel the policy and the Q-function. Second, to address the challenge of sparse rewards, M-Walk\nexploits the fact that the MDP transition model is known and deterministic.2 Speci\ufb01cally, it combines\nMonte Carlo Tree Search (MCTS) with the RNN to generate trajectories that obtain signi\ufb01cantly\nmore positive rewards than using the RNN policy alone. These trajectories can be viewed as being\ngenerated from an improved version of the RNN policy. But while these trajectories can improve\nthe RNN policy, their off-policy nature prevents them from being leveraged by policy gradient RL\nmethods. To solve this problem, we design a structure for sharing parameters between the Q-value\nnetwork and the RNN\u2019s policy network. This allows the policy network to be indirectly improved\nthrough Q-learning over the off-policy trajectories. Our method is in sharp contrast to existing\nRL-based methods for KBC, which use a policy gradients (REINFORCE) method [36] and usually\nrequire a large number of rollouts to obtain a trajectory with a positive reward, especially in the early\nstages of learning [9, 37, 14]. Experimental results on several benchmarks, including a synthetic\ntask and several real-world KBC tasks, show that our approach learns better policies than previous\nRL-based methods and traditional KBC methods.\nThe rest of the paper is organized as follows: Section 3 develops the M-Walk agent, including the\nmodel architecture, the training and testing algorithms.3 Experimental results are presented in Section\n4. Finally, we discuss related work in Section 5 and conclude the paper in Section 6.\n\n2 Graph Walking as a Markov Decision Process\n\nIn this section, we formulate the graph-walking problem as a Markov Decision Process (MDP), which\nis de\ufb01ned by the tuple (S,A,R,P), where S is the set of states, A is the set of actions, R is the\nreward function, and P is the state transition probability. We further de\ufb01ne S, A, R and P below.\nFigure 1(b) illustrates the MDP corresponding to the KBC example of Figure 1(a). Let st 2S denote\nthe state at time t. Recalling that the agent needs the entire history of traversed nodes and the query\nto make a correct decision, we de\ufb01ne st by the following recursion:\n\ns0 , {q, nS,EnS ,NnS}\n\nst = st1 [{ at1, nt,Ent,Nnt},\n\n(1)\nwhere at 2A denotes the action selected by the agent at time t, nt 2G denotes the currently visited\nnode at time t, Ent \u21e2E is the set of all edges connected to nt, and Nnt \u21e2N is the set of all nodes\n2Whenever the agent takes an action, by selecting an edge connected to a next node, the identity of the next\n\nnode (which the environment will transition to) is already known. Details can be found in Section 2.\n3The code of this paper is available at: https://github.com/yelongshen/GraphWalk\n\n2\n\n\f(a) An example of Knowledge Base Completion\n\n(b) The corresponding Markov Decision Process\n\nFigure 1: An example of Knowledge Base Completion and its formulation as a Markov Decision Process. (a)\nWe want to identify the target node nT = USA for a given pair of query q = Citizenship and source node\nnS = Obama. (b) The activated circles and edges (in black lines) denote all the observed information up to time\nt (i.e., the state st). The double circle denotes the current node nt, while Ent and Nnt denote the edges and\nnodes connected to the current node.\n\nconnected to nt (i.e., the neighborhood). Note that state st is a collection of (i) all the traversed nodes\n(along with their edges and neighborhoods) up to time t, (ii) all the previously selected (up to time\nt 1) actions, and (iii) the initial query q. The set S consists of all the possible values of {st, t 0}.\nBased on st, the agent takes one of the following actions at each time t: (i) choosing an edge in\nEnt and moving to the next node nt+1 2N nt, or (ii) terminating the walk (denoted as the \u201cSTOP\u201d\naction). Once the STOP action is selected, the MDP reaches the terminal state and outputs \u02c6nT = nt\nas a prediction of the target node nT . Therefore, we de\ufb01ne the set of feasible actions at time t as\nAt , Ent [{ STOP}, which is usually time-varying. The entire action space A is the union of all\nAt, i.e., A = [tAt. Recall that the training set consists of samples in the form of (nS, q, nT ). The\nreward is de\ufb01ned to be +1 when the predicted target node \u02c6nT is the same as nT (i.e., \u02c6nT = nT ), and\nzero otherwise. In the example of Figure 1(a), for a training sample (Obama, Citizenship, USA),\nif the agent successfully navigates from Obama to USA and correctly predicts \u02c6nT = USA, the reward\nis +1. Otherwise, it will be 0. The rewards are sparse because positive reward can be received only at\nthe end of a correct path. Furthermore, since the graph G is known and static, the MDP transition\nprobability p(st|st1, at1) is known and deterministic, and is de\ufb01ned by (1). To see this, we observe\nfrom Figure 1(b) that once an action at (i.e., an edge in Ent or \u201cSTOP\u201d) is selected, the next node\nnt+1 and its associated Ent+1 and Nnt+1 are known. By (1) (with t replaced by t + 1), this means\nthat the next state st+1 is determined. This important (model-based) knowledge will be exploited to\novercome the sparse-reward problem using MCTS and signi\ufb01cantly improve the performance of our\nmethod (see Sections 3\u20134 below).\nWe further de\ufb01ne \u21e1\u2713(at|st) and Q\u2713(st, at) to be the policy and the Q-function, respectively, where \u2713\nis a set of model parameters. The policy \u21e1\u2713(at|st) denotes the probability of taking action at given\nthe current state st. In M-Walk, it is used as a prior to bias the MCTS search. And Q\u2713(st, at) de\ufb01nes\nthe long-term reward of taking action at at state st and then following the optimal policy thereafter.\nThe objective is to learn a policy that maximizes the terminal rewards, i.e., correctly identi\ufb01es the\ntarget node with high probability. We now proceed to explain how to model and jointly learn \u21e1\u2713 and\nQ\u2713 to achieve this objective.\n\n3 The M-Walk Agent\n\nIn this section, we develop a neural graph-walking agent named M-Walk (i.e., MCTS for graph\nWalking), which consists of (i) a novel neural architecture for jointly modeling \u21e1\u2713 and Q\u2713, and (ii)\nMonte Carlo Tree Search (MCTS). We \ufb01rst introduce the overall neural architecture and then explain\nhow MCTS is used during the training and testing stages. Finally, we describe some further details of\nthe neural architecture. Our discussion focuses on addressing the two challenges described earlier:\nhistory-dependent state and sparse rewards.\n\n3.1 The neural architecture for jointly modeling \u21e1\u2713 and Q\u2713\nRecall from Section 2 (e.g., (1)) that one challenge in applying RL to the graph-walking problem\nis that the state st nominally includes the entire history of observations. To address this problem,\nwe propose a special RNN encoding the state st at each time t into a vector representation, ht =\nENC\u2713e(st), where \u2713e is the associated model parameter. We defer the discussion of this RNN\nstate encoder to Section 3.4, and focus in this section on how to use ht to jointly model \u21e1\u2713 and\n\n3\n\n\f(a)\n\n(b)\n\nFigure 2: The neural architecture for M-Walk. (a) The vector representation of the state is mapped into \u21e1\u2713\nand Q\u2713. (b) The GRU-RNN state encoder maps the state into its vector representation ht. Note that the inputs\nhA,t1 and hat1,t1 are from the output of the previous time step t 1.\n\nu0 = f\u2713\u21e1 (hS,t, hA,t),\n\nQ\u2713(st,\u00b7) = (u0, un01\n\nQ\u2713. Speci\ufb01cally, the vector ht consists of several sub-vectors of the same dimension M: hS,t,\n{hn0,t : n0 2N nt} and hA,t. Each sub-vector encodes part of the state st in (1). For instance, the\nvector hS,t encodes (st1, at1, nt), which characterizes the history in the state. The vector hn0,t\nencodes the (neighboring) node n0 and the edge ent,n0 connected to nt, which can be viewed as a\nvector representation of the n0-th candidate action (excluding the STOP action). And the vector hA,t\nis a vector summarization of Ent and Nnt, which is used to model the STOP action probability. In\nsummary, we use the sub-vectors to model \u21e1\u2713 and Q\u2713 according to:\n(2)\n(3)\nwhere h\u00b7,\u00b7i denotes inner product, f\u2713\u21e1 (\u00b7) is a fully-connected neural network with model parameter \u2713\u21e1,\n(\u00b7) denotes the element-wise sigmoid function, and \u2327 (\u00b7) is the softmax function with temperature\nparameter \u2327. Note that we use the inner product between the vectors hS,t and hn0,t to compute the\n(pre-softmax) score un0 for choosing the n0-th candidate action, where n0 2N nt. The inner product\noperation has been shown to be useful in modeling Q-functions when the candidate actions are\ndescribed by vector representations [13, 3] and in solving other problems [33, 1]. Moreover, the value\nof u0 is computed by f\u2713\u21e1 (\u00b7) using hS,t and hA,t, where u0 gives the (pre-softmax) score for choosing\nthe STOP action. We model the Q-function by applying element-wise sigmoid to u0, un01\n, . . . , un0k,\n, . . . , un0k.4\nand we model the policy by applying the softmax operation to the same set of u0, un01\nNote that the policy network and the Q-network share the same set of model parameters. We will\nexplain in Section 3.2 how such parameter sharing enables indirect updates to the policy \u21e1\u2713 via\nQ-learning from off-policy data.\n\nun0 = hhS,t, hn0,ti, n0 2N nt\n),\u21e1 \u2713(\u00b7|st) = \u2327 (u0, un01\n\n, . . . , un0k\n\n, . . . , un0k\n\n)\n\n3.2 The training algorithm\nWe now discuss how to train the model parameters \u2713 (including \u2713\u21e1 and \u2713e) from a training dataset\n{(nS, q, nT )} using reinforcement learning. One approach is the policy gradient method (RE-\nINFORCE) [36, 28], which uses the current policy \u21e1\u2713(at|st) to roll out multiple trajectories\n(s0, a0, r0, s1, . . .) to estimate a stochastic gradient, and then updates the policy \u21e1\u2713 via stochas-\ntic gradient ascent. Previous RL-based KBC methods [38, 5] typically use REINFORCE to learn the\npolicy. However, policy gradient methods generally suffer from low sample ef\ufb01ciency, especially\nwhen the reward signal is sparse, because large numbers of Monte Carlo rollouts are usually needed to\nobtain many trajectories with positive terminal reward, particularly in the early stages of learning. To\naddress this challenge, we develop a novel RL algorithm that uses MCTS to exploit the deterministic\nMDP transition de\ufb01ned in (1). Speci\ufb01cally, on each MCTS simulation, a trajectory is rolled out by\nselecting actions according to a variant of the PUCT algorithm [21, 25] from the root state s0 (de\ufb01ned\nin (1)):\n\nat = argmaxanc \u00b7 \u21e1\u2713(a|st)qPa0 N (st, a0)(1+N (st, a))+W (st, a)/N (st, a)o\n\n(4)\nwhere \u21e1\u2713(a|s) is the policy de\ufb01ned in Section 3.1, c and are two constants that control the level of\nexploration, and N (s, a) and W (s, a) are the visit count and the total action reward accumulated on\nthe (s, a)-th edge on the MCTS tree. Overall, PUCT treats \u21e1\u2713 as a prior probability to bias the MCTS\n4An alternative choice is applying softmax to the Q-function to get the policy, which is known as softmax\n\nselection [27]. We found in our experiments that these two designs do not differ much in performance.\n\n4\n\n\f(a) An example of MCTS path (in red) in M-Walk\n\n(b) Iterative policy improvement in M-Walk\n\nFigure 3: MCTS is used to generate trajectories for iterative policy improvement in M-Walk.\n\nsearch; PUCT initially prefers actions with high values of \u21e1\u2713 and low visit count N (s, a) (because\nthe \ufb01rst term in (4) is large), but then asympotically prefers actions with high value (because the\n\ufb01rst term in (4) vanishes and the second term W (s, a)/N (s, a) dominates). When PUCT selects the\nSTOP action or the maximum search horizon has been reached, MCTS completes one simulation and\nupdates W (s, a) and N (s, a) using V\u2713(sT ) = Q\u2713(sT , a = STOP). (See Figure 3(a) for an example\nand Appendix B.1 for more details.) The key idea of our method is that running multiple MCTS\nsimulations generates a set of trajectories with more positive rewards (see Section 4 for more analysis),\nwhich can also be viewed as being generated by an improved policy \u21e1\u2713. Therefore, learning from these\ntrajectories can further improve \u21e1\u2713. Our RL algorithm repeatedly applies this policy-improvement\nstep to re\ufb01ne the policy. However, since these trajectories are generated by a policy that is different\nfrom \u21e1\u2713, they are off-policy data, breaking the assumptions inherent in policy gradient methods. For\nthis reason, we instead update the Q-network from these trajectories in an off-policy manner using\nQ-learning: \u2713 \u2713 + \u21b5 \u00b7r \u2713Q\u2713(st, at) \u21e5 (r(st, at) + maxa0 Q\u2713(st+1, a0) Q\u2713(st, at)). Recall\nfrom Section 3.1 that \u21e1\u2713 and Q\u2713(s, a) share the same set of model parameters; once the Q-network is\nupdated, the policy network \u21e1\u2713 will also be automatically improved. Finally, the new \u21e1\u2713 is used to\ncontrol the MCTS in the next iteration. The main idea of the training algorithm is summarized in\nFigure 3(b).\n\n3.3 The prediction algorithm\n\nAt test time, we want to infer the target node nT for an unseen pair of (nS, q). One approach is to use\nthe learned policy \u21e1\u2713 to walk through the graph G to \ufb01nd nT . However, this would not exploit the\nknown MDP transition model (1). Instead, we combine the learned \u21e1\u2713 and Q\u2713 with MCTS to generate\nan MCTS search tree, as in the training stage. Note that there could be multiple paths that reach the\nsame terminal node n 2G , meaning that there could be multiple leaf states in MCTS corresponding\nto that node. Therefore, the prediction results from these MCTS leaf states need to be merged into one\nscore to rank the node n. Speci\ufb01cally, we use Score(n) =PsT !n N (sT , aT )/N \u21e5 Q\u2713(sT , STOP),\nwhere N is the total number of MCTS simulations, and the summation is over all the leaf states sT\nthat correspond to the same node n 2G . Score(n) is a weighted average of the terminal state values\nassociated with the same candidate node n.5 Among all the candidates nodes, we select the predicted\ntarget node to be the one with the highest score: \u02c6nT = argmaxnScore(n).\n\n3.4 The RNN state encoder\nWe now discuss the details of the RNN state encoder ht = ENC\u2713e(st), where \u2713e , {\u2713A,\u2713 S,\u2713 q},\nas shown in Figure 2(b). Speci\ufb01cally, we explain how the sub-vectors of ht are computed. We\nintroduce qt , st1 [{ at1, nt} as an auxiliary variable. Then, the state st in (1) can be written\nas st = qt [ {Ent,Nnt}. Note that the state st is composed of two parts: (i) Ent and Nnt, which\nrepresent the candidate actions to be selected (excluding the STOP action), and (ii) qt, which\nrepresents the history. We use two different neural networks to encode these separately. For the n0-th\ncandidate action (n0 2N nt), we concatenate n0 with its associated ent,n0 2E nt and input them into\na fully connected network (FCN) f\u2713A(\u00b7) to compute their joint vector representation hn0,t, where\n\u2713A is the model parameter. Recall that the action space At = Ent [{ STOP} can be time-varying\nwhen the size of Ent changes over time. To address this issue, we apply the same FCN f\u2713(\u00b7) to\n5There could be alternative ways to compute the score, such as Score(n) = maxsT !n Q\u2713(sT , STOP).\n\nHowever, we found in our (unreported) experiments that they do not make much difference.\n\n5\n\n\ud835\udc41(\ud835\udc60,\ud835\udc4e)\ud835\udc4a(\ud835\udc60,\ud835\udc4e)\ud835\udc41(\ud835\udc60,\ud835\udc4e)\ud835\udc4a(\ud835\udc60,\ud835\udc4e)\fdifferent (n0, ent,n0) to obtain their respective representations. Then, we use a coordinate-wise max-\npooling operation over {hn0,t : n0 2N nt} to obtain a (\ufb01xed-length) overall vector representation\nof {Ent,Nnt}. To encode qt, we call upon the following recursion for qt (see Appendix A for the\nderivation): qt+1 = qt [ {Ent,Nnt, at, nt+1}. Inspired by this recursion, we propose using the\nGRU-RNN [4] to encode qt into a vector representation6: qt+1 = f\u2713q (qt, [hA,t, hat,t, nt+1]) with\ninitialization q0 = f\u2713q (q, [0, 0, nS]), where \u2713q is the model parameter, and hat,t denotes the vector\nhn0,t at n0 = at. We use hA,t and hat,t computed by the FCNs to represent (Ent,Nnt) and at,\nrespectively. Then, we map qt to hS,t using another FCN f\u2713S (\u00b7).\n4 Experiments\n\nWe evaluate and analyze the effectiveness of M-Walk on a synthetic Three Glass Puzzle task and\ntwo real-world KBC tasks. We brie\ufb02y describe the tasks here, and give the experiment details and\nhyperparameters in Appendix B.\n\nThree Glass Puzzle The Three Glass Puzzle [20] is a problem studied in math puzzles and graph\ntheory. It involves three milk containers A, B, and C, with capacities A, B and C liters, respectively.\nThe containers display no intermediate markings. There are three feasible actions at each time step:\n(i) \ufb01ll a container (to its capacity), (ii) empty all of its liquid, and (iii) pour its liquid into another\ncontainer (up to its capacity). The objective of the problem is, given a desired volume q, to take\na sequence of actions on the three containers after which one of them contains q liters of liquid.\nWe formulate this as a graph-walking problem; in the graph G, each node n = (a, b, c) denotes the\namounts of remaining liquid in the three containers, each edge denotes one of the three feasible\nactions, and the input query is the desired volume q. The reward is +1 when the agent successfully\n\ufb01lls one of the containers to q and 0 otherwise (see Appendix B.2.1 for the details). We use vanilla\npolicy gradient (REINFORCE) [36] as the baseline, with task success rate as the evaluation metric.\n\nKnowledge Base Completion We use WN18RR and NELL995 knowledge graph datasets for\nevaluation. WN18RR [6] is created from the original WN18 [2] by removing various sources of test\nleakage, making the dataset more challenging. The NELL995 dataset was released by [38] and has\nseparate graphs for each query relation. We use the same data split and preprocessing protocol as\nin [6] for WN18RR and in [38, 5] for NELL995. As in [38, 5], we study the 10 relation tasks of\nNELL995 separately. We use HITS@1,3 and mean reciprocal rank (MRR) as the evaluation metrics\nfor WN18RR, and use mean average precision (MAP) for NELL995,7 where HITS@K computes\nthe percentage of the desired entities being ranked among the top-K list, and MRR computes an\naverage of the reciprocal rank of the desired entities. We compare against RL-based methods [38, 5],\nembedding-based models (including DistMult [39], ComplEx [32] and ConvE [6]) and recent work\nin logical rules (NeuralLP) [40]. For all the baseline methods, we used the implementation released\nby the corresponding authors with their best-reported hyperparameter settings.8 The details of the\nhyperparameters for M-Walk are described in Appendix B.2.2 of the supplementary material.\n\n4.1 Performance of M-Walk\nWe \ufb01rst report the overall performance of the M-Walk algorithm on the three tasks and compare it\nwith other baseline methods. We ran the experiments three times and report the means and standard\ndeviations (except for PRA, TransE, and TransR on NELL995, whose results are directly quoted from\n[38]). On the Three Glass Puzzle task, M-Walk signi\ufb01cantly outperforms the baseline: the best model\nof M-Walk achieves an accuracy of (99.0 \u00b1 1.0)% while the best REINFORCE method achieves\n(49.0 \u00b1 2.6)% (see Appendix C for more experiments with different settings on this task). For the\ntwo KBC tasks, we report their results in Tables 1-2, where PG-Walk and Q-Walk are two methods\nwe created just for the ablation study in the next section. The proposed method outperforms previous\nworks in most of the metrics on NELL995 and WN18RR datasets. Additional experiments on the\nFB15k-237 dataset can be found in Appendix C.1.1 of the supplementary material.\n\n6For simplicity, we use the same notation qt to denote its vector representation.\n7We use these metrics in order to be consistent with [38, 5]. We also report the HITS and MRR scores for\n\nNELL995 in Table 9 of the supplementary material.\n\n8ConvE: https://github.com/TimDettmers/ConvE, Neural-LP: https://github.com/fanyangxyz/Neural-LP/,\n\nDeepPath: https://github.com/xwhan/DeepPath, MINERVA: https://github.com/shehzaadzd/MINERVA/\n\n6\n\n\fTable 1: The MAP scores (%) on NELL995 task, where we report RL-based methods in terms of \u201cmean\n(standard deviation)\u201d. PG-Walk and Q-Walk are methods we created just for the ablation study.\n\nTasks\n\nAthletePlaysForTeam\nAthletePlaysInLeague\nAthleteHomeStadium\n\nAthletePlaysSport\nTeamPlaySports\n\nOrgHeadquaterCity\n\nWorksFor\n\nBornLocation\nPersonLeadsOrg\nOrgHiredPerson\n\nOverall\n\nM-Walk\n84.7 (1.3)\n97.8 (0.2)\n91.9 (0.1)\n98.3 (0.1)\n88.4 (1.8)\n95.0 (0.7)\n84.2 (0.6)\n81.2 (0.0)\n88.8 (0.5)\n88.8 (0.6)\n\n89.9\n\nPG-Walk\n80.8 (0.9)\n96.0 (0.6)\n91.9 (0.3)\n98.0 (0.8)\n87.4 (0.9)\n94.0 (0.4)\n84.0 (1.6)\n82.3 (0.6)\n87.2 (0.5)\n87.2 (0.4)\n\n88.9\n\nQ-Walk\n82.6 (1.2)\n96.2 (0.8)\n91.1 (1.3)\n97.0 (0.2)\n78.5 (0.6)\n94.0 (0.6)\n82.7 (0.2)\n81.4 (0.5)\n86.9 (0.5)\n87.8 (0.9)\n\n87.8\n\nMINERVA\n82.7 (0.8)\n95.2 (0.8)\n92.8 (0.1)\n98.6 (0.1)\n87.5 (0.5)\n94.5 (0.3)\n82.7 (0.5)\n78.2 (0.0)\n83.0 (2.6)\n87.0 (0.3)\n\n87.6\n\nDeepPath\n72.1 (1.2)\n92.7 (5.3)\n84.6 (0.8)\n91.7 (4.1)\n69.6 (6.7)\n79.0 (0.0)\n69.9 (0.3)\n75.5 (0.5)\n79.0 (1.0)\n73.8 (1.9)\n\n78.8\n\nPRA\n54.7\n84.1\n85.9\n47.4\n79.1\n81.1\n68.1\n66.8\n70.0\n59.9\n69.7\n\nTransE\n62.7\n77.3\n71.8\n87.6\n76.1\n62.0\n67.7\n71.2\n75.1\n71.9\n72.3\n\nTransR\n67.3\n91.2\n72.2\n96.3\n81.4\n65.7\n69.2\n81.2\n77.2\n73.7\n77.5\n\nTable 2: The results on the WN18RR dataset, in the form of \u201cmean (standard deviation)\u201d.\n\nMetric (%)\nHITS@1\nHITS@3\nMRR\n\nM-Walk\n41.4 (0.1)\n44.5 (0.2)\n43.7 (0.1)\n\nPG-Walk\n39.3 (0.2)\n41.9 (0.1)\n41.3 (0.1)\n\nQ-Walk\n38.2 (0.3)\n40.8 (0.4)\n40.1 (0.3)\n\nMINERVA\n35.1 (0.1)\n44.5 (0.4)\n40.9 (0.1)\n\nComplEx\n38.5 (0.3)\n43.9 (0.3)\n42.2 (0.2)\n\nConvE\n39.6 (0.3)\n44.7 (0.2)\n43.3 (0.2)\n\nDistMult\n38.4 (0.4)\n42.4 (0.3)\n41.3 (0.3)\n\nNeuralLP\n37.2 (0.1)\n43.4 (0.1)\n43.5 (0.1)\n\n4.2 Analysis of M-Walk\n\nWe performed extensive experimental analysis to understand the proposed M-Walk algorithm, in-\ncluding (i) the contributions of different components, (ii) its ability to overcome sparse rewards, (iii)\nhyperparameter analysis, (iv) its strengths and weaknesses compared to traditional KBC methods,\nand (v) its running time. First, we used ablation studies to analyze the contributions of different\ncomponents in M-Walk. To understand the contribution of the proposed neural architecture in M-\nWalk, we created a method, PG-Walk, which uses the same neural architecture as M-Walk but with\nthe same training (PG) and testing (beam search) algorithms as MINERVA [5]. We observed that\nthe novel neural architecture of M-Walk contributes an overall 1% gain relative to MINERVA on\nNELL995, and it is still 1% worse than M-Walk, which uses MCTS for training and testing. To\nfurther understand the contribution of MCTS, we created another method, Q-Walk, which uses the\nsame model architecture as M-Walk except that it is trained by Q-learning only without MCTS. Note\nthat this lost about 2% in overall performance on NELL995. We observed similar trends on WN18RR.\nIn addition, we also analyze the importance of MCTS in the testing stage in Appendix C.1.\nSecond, we analyze the ability of M-Walk to overcome the sparse-reward problem. In Figure 4, we\nshow the positive reward rate (i.e., the percentage of trajectories with positive reward during training)\non the Three Glass Puzzle task and the NELL995 tasks. Compared to the policy gradient method (PG-\nWalk), and Q-learning method (Q-Walk) methods under the same model architecture, M-Walk with\nMCTS is able to generate trajectories with more positive rewards, and this continues to improve\nas training progresses. This con\ufb01rms our motivation of using MCTS to generate higher-quality\ntrajectories to alleviate the sparse-reward problem in graph walking.\nThird, we analyze the performance of M-Walk under different numbers of MCTS rollout simulations\nand different search horizons on WN18RR dataset, with results shown in Figure 5(a). We observe\nthat the model is less sensitive to search horizon and more sensitive to the number of MCTS rollouts.\nFinally, we analyze the strengths and weaknesses of M-Walk relative to traditional methods on the\nWN18RR dataset. The \ufb01rst question is how M-Walk performs on reasoning paths of different lengths\ncompared to baselines. To answer this, we analyze the HITS@1 accuracy against ConvE in Fig. 5(b).\nWe categorize each test example using the BFS (breadth-\ufb01rst search) steps from the query entity to\nthe target entity (-1 means not reachable). We observe that M-Walk outperforms the strong baseline\nConvE by 4.6\u201310.9% in samples that require 2 or 3 steps, while it is nearly on par for paths of length\none. Therefore, M-Walk does better at reasoning over longer paths than ConvE. Another question\nis what are the major types of errors made by M-Walk. Recall that M-Walk only walks through\na subset of the graph and ranks a subset of candidate nodes (e.g., MCTS produces about 20\u201360\nunique candidates on WN18RR). When the ground truth is not in the candidate set, M-Walk always\nmakes mistakes and we de\ufb01ne this type of error as out-of-candidate-set error. To examine this\neffect, we show in Figure 5(c)-top the HITS@K accuracies when the ground truth is in the candidate\n\n7\n\n\f(a) #Train Rollouts = 16\n\n(b) #Train Rollouts = 32\n\n(c) #Train Rollouts = 64\n\n(d) M-Walk MCTS Comparison\n\n(e) OrganizationHiredPerson\n\n(f) AthletePlaysForTeam\n\n(g) PersonLeadsOrganization\n\n(h) WorksFor\n\nFigure 4: The positive reward rate. Figures (a)-(d) are the results on the Three Glass Puzzle task and Figures\n(e)-(h) are the results on the NELL-995 task. (See Appendix C.1.1 for more results.)\n\n(a) MCTS hyperparameter analysis\n\n(b) HITS@1 (%) for different path lengths\n\n(c) Error pattern analysis\n\nFigure 5: M-Walk hyperparameter and error analysis on WN18RR.\n\nset.9 It shows that M-Walk has very high accuracy in this case, which is signi\ufb01cantly higher than\nConvE (80% vs 39.6% in HITS@1). We further examine the percentage of out-of-candidate-set\nerrors among all errors in Figure 5(c)-bottom. It shows that the major error made by M-Walk is\nthe out-of-candidate-set error. These observations point to an important direction for improving\nM-Walk in future work: increasing the chance of covering the target by the candidate set.\n\nTable 3: Running time of M-Walk and MINERVA for different combinations of (horizon, rollouts).\nM-Walk (5,64) M-Walk (5,128) M-Walk (3,64) M-Walk (3,128) MINERVA (3,100), best\n\nModel\n\n8\n\n3 \u21e5 103\n\n14\n\n6 \u21e5 103\n\n5\n\n1.6 \u21e5 103\n\n8\n\n2.7 \u21e5 103\n\n3\n\n2 \u21e5 102\n\nTraining (hrs.)\n\nTesting (sec/sample)\n\nIn Table 3, we show the running time of M-Walk (in-house C++ & Cuda) and MINERVA (TensorFlow-\ngpu) for both training and testing on WN18RR with different values of search horizon and number of\nrollouts (or MCTS simulation number). Note that the running time of M-Walk is comparable to that\nof MINERVA. Additional results can be found in Figure 9(c) of the supplementary material. Finally,\nin Table 4, we show examples of reasoning paths found by M-Walk.10\n\n5 Related Work\n\nReinforcement Learning Recently, deep reinforcement learning has achieved great success in\nmany arti\ufb01cial intelligence problems [17, 24, 25]. The use of deep neural networks with RL allows\npolicies to be learned from raw data (e.g., images) in an end-to-end manner. Our work also aligns\n\n9The ground truth is always in the candidate list in ConvE, as it examines all the nodes.\n10More examples can be found in Appendix C.2 of the supplementary material.\n\n8\n\n\fTable 4: Examples of reasoning paths found by M-Walk on the NELL-995 dataset for the relation\n\u201cAthleteHomeStadium\u201d. True (False) means the prediction is correct (wrong).\n\nAthleteHomeStadium:\nExample 1: athlete ernie banks AthleteHomeStadium\n!?\n\nathlete ernie banks AthletePlaysInLeague\n\n! SportsLeague mlb TeamPlaysInLeague1\n\n! SportsTeam chicago cubs TeamHomeStadium\n\n! StadiumOrEventVenue wrigley \ufb01eld, (True)\n\nExample 2: coach jim zorn AthleteHomeStadium\n!?\n\ncoach jim zorn CoachWonTrophy\n\n! AwardTrophyTournament super bowl TeamWonTrophy1\n\n! SportsTeam redskins TeamHomeStadium\n\n! StadiumOrEventVenue fedex \ufb01eld, (True)\n\nExample 3: athlete oliver perez AthleteHomeStadium\n!?\n\nathlete oliver perez AthletePlaysInLeague\n\n! SportsLeague mlb TeamPlaysInLeague1\n\n! SportsTeam chicago cubs TeamHomeStadium\n\n! StadiumOrEventVenue wrigley \ufb01eld, (False)\n\nwith this direction. Furthermore, the idea of using an RNN to encode the history of observations\nalso appeared in [12, 35]. The combination of model-based and model-free information in our work\nshares the same spirit as [24, 25, 26, 34]. Among them, the most relevant are [24, 25], which combine\nMCTS with neural policy and value functions to achieve superhuman performance on Go. Different\nfrom our work, the policy and the value networks in [24] are trained separately without the help\nof MCTS, and are only used to help MCTS after being trained. The work [25] uses a new policy\niteration method that combines the neural policy and value functions with MCTS during training.\nHowever, the method in [25] improves the policy network from the MCTS probabilities of the moves,\nwhile our method improves the policy from the trajectories generated by MCTS. Note that the former\nis constructed from the visit counts of all the edges connected to the MCTS root node; it only uses\ninformation near the root node to improve the policy. By contrast, we improve the policy by learning\nfrom the trajectories generated by MCTS, using information over the entire MCTS search tree.\n\nKnowledge Base Completion In KBC tasks, early work [2] focused on learning vector represen-\ntations of entities and relations. Recent approaches have demonstrated limitations of these prior\napproaches: they suffer from cascading errors when dealing with compositional (multi-step) re-\nlationships [10]. Hence, recent works [8, 18, 10, 15, 30] have proposed approaches for injecting\nmulti-step paths such as random walks through sequences of triples during training, further improving\nperformance on KBC tasks. IRN [23] and Neural LP [40] explore multi-step relations by using an\nRNN controller with attention over an external memory. Compared to RL-based approaches, it is\nhard to interpret the traversal paths, and these models can be computationally expensive to access the\nentire graph in memory [23]. Two recent works, DeepPath [38] and MINERVA [5], use RL-based\napproaches to explore paths in knowledge graphs. DeepPath requires target entity information to\nbe in the state of the RL agent, and cannot be applied to tasks where the target entity is unknown.\nMINERVA [5] uses a policy gradient method to explore paths during training and test. Our proposed\nmodel further exploits state transition information by integrating the MCTS algorithm. Empirically,\nour proposed algorithm outperforms both DeepPath and MINERVA in the KBC benchmarks.11\n\n6 Conclusion and Discussion\n\nWe developed an RL-agent (M-Walk) that learns to walk over a graph towards a desired target node\nfor given input query and source nodes. Speci\ufb01cally, we proposed a novel neural architecture that\nencodes the state into a vector representation, and maps it to Q-values and a policy. To learn from\nsparse rewards, we propose a new reinforcement learning algorithm, which alternates between an\nMCTS trajectory-generation step and a policy-improvement step, to iteratively re\ufb01ne the policy. At\ntest time, the learned networks are combined with MCTS to search for the target node. Experimental\nresults on several benchmarks demonstrate that our method learns better policies than other baseline\nmethods, including RL-based and traditional methods on KBC tasks. Furthermore, we also performed\nextensive experimental analysis to understand M-Walk. We found that our method is more accurate\nwhen the ground truth is in the candidate set. We also found that the out-of-candidate-set error is the\nmain type of error made by M-Walk. Therefore, in future work, we intend to improve this method by\nreducing such out-of-candidate-set errors.\n\n11A preliminary version of M-Walk with limited experiments was reported in the workshop paper [22].\n\n9\n\n\fAcknowledgments\n\nWe thank Ricky Loynd, Adith Swaminathan, and anonymous reviewers for their valuable feedback.\n\nReferences\n[1] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina-\n\ntorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.\n\n[2] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\nTranslating embeddings for modeling multi-relational data. In Proc. NIPS, pages 2787\u20132795,\n2013.\n\n[3] Jianshu Chen, Chong Wang, Lin Xiao, Ji He, Lihong Li, and Li Deng. Q-LDA: Uncovering\nlatent patterns in text-based sequential decision processes. In Proc. NIPS, pages 4984\u20134993,\n2017.\n\n[4] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. Proc. EMNLP, 2014.\n\n[5] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay\nKrishnamurthy, Alexander J. Smola, and Andrew McCallum. Go for a walk and arrive at the\nanswer: Reasoning over paths in knowledge bases using reinforcement learning. In Proc. ICLR,\n2018.\n\n[6] Tim Dettmers, Minervini Pasquale, Stenetorp Pontus, and Sebastian Riedel. Convolutional 2D\n\nknowledge graph embeddings. In Proc. AAAI, February 2018.\n\n[7] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational AI. arXiv\n\npreprint arXiv:1809.08267, 2018.\n\n[8] Matt Gardner, Partha Pratim Talukdar, Jayant Krishnamurthy, and Tom Mitchell. Incorporating\n\nvector space similarity in random walk inference over knowledge bases. In EMNLP, 2014.\n\n[9] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine.\n\nQ-prop: Sample-ef\ufb01cient policy gradient with an off-policy critic. In Proc. ICLR, 2016.\n\n[10] Kelvin Guu, John Miller, and Percy Liang. Traversing knowledge graphs in vector space. In\n\nProc. EMNLP, 2015.\n\n[11] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination\n\nof minimum cost paths. IEEE Trans. Systems Science and Cybernetics, 4(2):100\u2013107, 1968.\n\n[12] Matthew J. Hausknecht and Peter Stone. Deep recurrent Q-learning for partially observable\n\nMDPs. CoRR, abs/1507.06527, 2015.\n\n[13] Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep\nreinforcement learning with a natural language action space. In Proc. ACL, pages 1621\u20131630,\n2016.\n\n[14] Sham M Kakade. A natural policy gradient. In Proc. NIPS, pages 1531\u20131538, 2002.\n\n[15] Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. Modeling\n\nrelation paths for representation learning of knowledge bases. In EMNLP, 2015.\n\n[16] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation\n\nembeddings for knowledge graph completion. In AAAI, 2015.\n\n[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n10\n\n\f[18] Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space\n\nmodels for knowledge base completion. In Proc. AAAI, 2015.\n\n[19] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective\n\nlearning on multi-relational data. In Proc. ICML, pages 809\u2013816, 2011.\n\n[20] Oystein Ore. Graphs and Their Uses, volume 34. Cambridge University Press, 1990.\n\n[21] Christopher D. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and\n\nArti\ufb01cial Intelligence, 61(3):203\u2013230, Mar 2011.\n\n[22] Yelong Shen, Jianshu Chen, Po-Sen Huang, Yuqing Guo, and Jianfeng Gao. ReinforceWalk:\n\nLearning to walk in graph with Monte Carlo tree search. In ICLR workshop, 2018.\n\n[23] Yelong Shen, Po-Sen Huang, Ming-Wei Chang, and Jianfeng Gao. Modeling large-scale\nstructured relationships with shared memory for knowledge base completion. In Proceedings of\nthe 2nd Workshop on Representation Learning for NLP, pages 57\u201368, 2017.\n\n[24] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[25] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\nGo without human knowledge. Nature, 550(7676):354, 2017.\n\n[26] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel\nDulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, and Thomas Degris. The\npredictron: End-to-end learning and planning. In Proc. ICML, 2017.\n\n[27] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction, volume 1.\n\nMIT press Cambridge, 1998.\n\n[28] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nmethods for reinforcement learning with function approximation. In Proc. NIPS, pages 1057\u2013\n1063, 2000.\n\n[29] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and\ntext inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and\ntheir Compositionality, 2015.\n\n[30] Kristina Toutanova, Xi Victoria Lin, Scott Wen tau Yih, Hoifung Poon, and Chris Quirk.\nCompositional learning of embeddings for relation paths in knowledge bases and text. In Proc.\nACL, 2016.\n\n[31] Th\u00e9o Trouillon, Christopher R Dance, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and\nGuillaume Bouchard. Knowledge graph completion via complex tensor factorization. Journal\nof Machine Learning Research, 2017.\n\n[32] Th\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard.\n\nComplex embeddings for simple link prediction. In Proc. ICML, pages 2071\u20132080, 2016.\n\n[33] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Proc. NIPS, pages\n\n2692\u20132700, 2015.\n\n[34] Theophane Weber, S\u00e9bastien Racani\u00e8re, David P. Reichert, Lars Buesing, Arthur Guez,\nDanilo Jimenez Rezende, Adri\u00e0 Puigdom\u00e8nech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li,\nRazvan Pascanu, Peter Battaglia, David Silver, and Daan Wierstra. Imagination-augmented\nagents for deep reinforcement learning. In Proc. NIPS, 2017.\n\n[35] Daan Wierstra, Alexander F\u00f6rster, Jan Peters, and J\u00fcrgen Schmidhuber. Recurrent policy\n\ngradients. Logic Journal of the IGPL, 18(5):620\u2013634, 2010.\n\n11\n\n\f[36] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. In Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n[37] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region\nmethod for deep reinforcement learning using kronecker-factored approximation. In Proc. NIPS,\npages 5285\u20135294, 2017.\n\n[38] Wenhan Xiong, Thien Hoang, and William Yang Wang. DeepPath: A reinforcement learning\n\nmethod for knowledge graph reasoning. In Proc. EMNLP, pages 575\u2013584, 2017.\n\n[39] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and\n\nrelations for learning and inference in knowledge bases. In ICLR, 2015.\n\n[40] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for\n\nknowledge base reasoning. In Proc. NIPS, pages 2316\u20132325, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3410, "authors": [{"given_name": "Yelong", "family_name": "Shen", "institution": "Microsoft Research, Redmond, WA"}, {"given_name": "Jianshu", "family_name": "Chen", "institution": "Tencent AI Lab"}, {"given_name": "Po-Sen", "family_name": "Huang", "institution": "Google DeepMind"}, {"given_name": "Yuqing", "family_name": "Guo", "institution": "Microsoft Research"}, {"given_name": "Jianfeng", "family_name": "Gao", "institution": "Microsoft Research, Redmond, WA"}]}