{"title": "Learning Transferable Graph Exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 2518, "page_last": 2529, "abstract": "This paper considers the problem of efficient exploration of unseen environments, a key challenge in AI. We propose a `learning to explore' framework where we learn a policy from a distribution of environments. At test time, presented with an unseen environment from the same distribution, the policy aims to generalize the exploration strategy to visit the maximum number of unique states in a limited number of steps. We particularly focus on environments with graph-structured state-spaces that are encountered in many important real-world applications like software testing and map building.\nWe formulate this task as a reinforcement learning problem where the `exploration' agent is rewarded for transitioning to previously unseen environment states and employ a graph-structured memory to encode the agent's past trajectory. Experimental results demonstrate that our approach is extremely effective for exploration of spatial maps; and when applied on the challenging problems of coverage-guided software-testing of domain-specific programs and real-world mobile applications, it outperforms methods that have been hand-engineered by human experts.", "full_text": "Learning Transferable Graph Exploration\n\nHanjun Dai\"\u2020\u21e4, Yujia Li\u00a7, Chenglong Wang\u2021, Rishabh Singh\u2020, Po-Sen Huang\u00a7, Pushmeet Kohli\u00a7\n\n\u2021 University of Washington, clwang@cs.washington.edu\n\n\u00a7 DeepMind, {yujiali, posenhuang, pushmeet}@google.com\n\n\" Georgia Institute of Technology\n\n\u2020 Google Brain, {hadai, rising}@google.com\n\nAbstract\n\nThis paper considers the problem of ef\ufb01cient exploration of unseen environments,\na key challenge in AI. We propose a \u2018learning to explore\u2019 framework where we\nlearn a policy from a distribution of environments. At test time, presented with\nan unseen environment from the same distribution, the policy aims to generalize\nthe exploration strategy to visit the maximum number of unique states in a limited\nnumber of steps. We particularly focus on environments with graph-structured\nstate-spaces that are encountered in many important real-world applications like\nsoftware testing and map building. We formulate this task as a reinforcement\nlearning problem where the \u2018exploration\u2019 agent is rewarded for transitioning to\npreviously unseen environment states and employ a graph-structured memory\nto encode the agent\u2019s past trajectory. Experimental results demonstrate that our\napproach is extremely effective for exploration of spatial maps; and when applied on\nthe challenging problems of coverage-guided software-testing of domain-speci\ufb01c\nprograms and real-world mobile applications, it outperforms methods that have\nbeen hand-engineered by human experts.\n\n1\n\nIntroduction\n\nExploration is a fundamental problem in AI; appearing in the context of reinforcement learning as a\nsurrogate for the underlying target task [1, 2, 3] or to balance exploration and exploitation [4]. In this\npaper, we consider a coverage variant of the exploration problem where given a (possibly unknown)\nenvironment, the goal is to reach as many distinct states as possible, within a given interaction budget.\nThe above-mentioned state-space coverage exploration problem appears in many important real-world\napplications like software testing and map building which we consider in this paper. The goal of\nsoftware testing is to \ufb01nd as many potential bugs as possible with carefully designed or generated\ntest inputs. To quantify the effectiveness of program exploration, program coverage (e.g. number\nof branches of code triggered by the inputs) is typically used as a surrogate objective [5]. One\npopular automated testing technique is fuzzing, which tries to maximize code coverage via randomly\ngenerated inputs [6]. In active map building, a robot needs to construct the map for an unknown\nenvironment while also keeping track of its locations [7]. The more locations one can visit, the better\nthe map reconstruction could be. Most of these problems have limited budget (e.g. limited time or\nsimulation trials), thus having a good exploration strategy is important.\nA crucial challenge for these problems is of generalization to unseen environments. Take software\ntesting as an example, in most traditional fuzzing methods, the fuzzing procedure will start from\nscratch for a new program, where the knowledge about the previously tested programs is not utilized.\nDifferent programs may share common design patterns and semantics, which could be exploited\nduring exploration. Motivated by this problem, this paper proposes a \u2018learning to explore\u2019 framework\n\n\u21e4Work done during an internship at DeepMind, when Hanjun was in Georgia Institute of Technology\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere we learn a policy from a distribution of environments with the aim of achieving transferable\nexploration ef\ufb01ciency. At test time, presented with an unseen environment from the same distribution,\nthe policy aims to generalize the exploration strategy to visit the maximum number of unique states\nin a limited number of steps.\nWe formulate the state-space coverage problem using Reinforcement Learning (RL). The reward\nmechanism of the corresponding Markov Decision Process (MDP) is non-stationary as it changes\ndrastically as the episode proceeds. In particular, visiting an unobserved state will be rewarded, but\nvisiting it more than once is a waste of exploration budget. In other words, the environment is always\nexpecting something new from the agent.\nStates in many such exploration problems are typically structured. For example, programs have\nsyntactic or semantic structures [8], and ef\ufb01ciently covering the program statements require reasoning\nabout the graph structure. The states in a generic RL environment may also form a graph with edges\nindicating reachability. To utilize the structure of these environments, we augment our RL agent with\na graph neural network (GNNs) [9] to encode and represent the graph structured states. This model\ngives our agent the ability to generalize across problem instances (environments). We also use a\ngraph structured external memory to capture the interaction history of the agent with the environment.\nAdding this information to the agent\u2019s state, allows us to handle the non-stationarity challenge of the\ncoverage exploration problem. The key contributions of this paper can be summarized as:\n\u2022 We propose a new problem framework of exploration in graph structured spaces for several\n\u2022 We propose to use GNNs for modeling graph-structured states, and model the exploration history\nas a sequence of evolving graphs. The modeling of a sequence of evolving graphs in particular is\nas far as we know the \ufb01rst such attempt in the learning and program testing literature.\n\nimportant applications.\n\n\u2022 We successfully apply the graph exploration agent on a range of challenging problems, from\nexploring synthetic 2D mazes, to generating inputs for software testing, and \ufb01nally testing real-\nworld Android apps. Experimental evaluation shows that our approach is comparable or better in\nterms of exploration ef\ufb01ciency than strong baselines such as heuristics designed by human experts,\nand symbolic execution using the Z3 SMT (satis\ufb01ability modulo theories) solver [10].\n\n2 Problem Formulation\nWe consider two different exploration settings. The \ufb01rst setting concerns exploration in an unknown\nenvironment, where the agent observes a graph at each step, with each node corresponding to a visited\nunique environment state, and each edge corresponding to an experienced transition. In this setting,\nthe graph grows in size during an episode, and the agent maximizes the speed of this growth.\nThe second setting is about exploration in a known but complex environment, and is motivated by\nprogram testing. In this setting, we have access to the program source code and thus also its graph\nstructure, where the nodes in the graph correspond to the program branches and edges correspond\nto the syntactic and semantic relationship between branches. The challenge here is to reason about\nand understand the graph structure, and come up with the right actions to increase graph coverage.\nEach action corresponds to a test input which resides in a huge action space and has rich structures.\nFinding such valuable inputs is highly non-trivial in automated testing literature [5, 11, 12, 13, 14],\nbecause of challenges in modeling complex program semantics for precise logical reasoning.\nWe formalize both settings with the same formulation. At each step t, the agent observes a graph\nGt1 = (Vt1, Et1) and a coverage mask ct1 : Vt1 7! {0, 1}, indicating which nodes have been\ncovered in the exploration process so far. The agent generates an action xt, the environment takes this\naction and returns a new graph Gt = (Vt, Et) with a new ct. In the \ufb01rst setting above, the coverage\nmask ct is 1 for any node v 2 Vt as the graph only contains visited nodes. While in the second setting,\nthe graph Gt is constant from step to step, and the coverage mask ct(v) = 1 if v is covered in the\npast by some actions and 0 otherwise. We set the initial observation for t = 0 to be c0 mapping any\nnode to 0, and in the \ufb01rst exploration setting G0 to be an empty graph. The exploration process for a\ngraph structured environment can be seen as a \ufb01nite horizon Markov Decision Process (MDP), with\nthe number of actions or steps T being the budget for exploration.\nAction The space for actions xt is problem speci\ufb01c. We used the letter x instead of the more common\nletter a to highlight that these actions are sometimes closer to the typical inputs to a neural network,\n\n2\n\n\f!\"# $(\u210e#))\n\n/: graph to be\n\nexplored\n\ncurrent\n\nhistory\n\nunseen\n\nEvolving \ngraph \nmemory\n\nsuch as test cases for programs, or click/scroll events for app exploration \n\ndomain specific actions\n\nGGNN\n\nSimulator\nfeedback\n\nGGNN\n\nGGNN\n\nGGNN\n\n.*\n\n.)\n\n.+\n\n.,\n\nt\n\nSimulator\n\nL\n\nGGNN readout\non visible graph\n\n(,(-)\n(,(*)\n\nAttentive\naggregation\n\n()(-)\n()(*)\n\nMessage\nPassing\nsteps\n\n(+(-)\n(+(*)\n\nFigure 1: Overview of our meta exploration model for exploring a known but complicated graph\nstructured environment. The GGNN [15] module captures the graph structures at each step, and the\nrepresentations of each step are pooled together to form a representation of the exploration history.\n\nwhich lives in an exponentially large space with rich structures, than to the more common \ufb01xed \ufb01nite\naction spaces in typical RL environments. In particular, for testing programs, each action is a test\ninput to the program, which can be text (sequences of characters) or images (2D array of characters).\nOur task is to provide a sequence of T actions x1, x2, . . . , xT to maximize an exploration objective.\nAn obvious choice is the number of unique nodes (environment states) covered, i.e.Pv2VT\ncT (v). To\nhandle different graph sizes during training, we further normalize this objective by the maximum\npossible size of the graph |V|2, which is the number of nodes in the underlying full graph (for the\nsecond exploration setting this is the same as |VT|). We therefore get the objective in Eq. (1).\n\n(1)\n\nmax\n\ncT (v)/|V|\n\nct1(v)/|V|, (2)\n\nrt = Xv2Vt\n\nct(v)/|V| Xv2Vt1\n\nReward Given the above objective, we can de\ufb01ne the per-step reward rt as in Eq. (2). It is easy to\ncT (v)/|V|, i.e., the cumulative reward of the MDP is the same as the\nc0(v) = 0. In this de\ufb01nition, the reward at time step t is given to only\n\n{x1,x2,...,xT } Xv2VT\nverify thatPT\nt=1 rt =Pv2VT\nobjective in Eq. (1), asPv2V0\nthe additional coverage introduced by the action xt.\nState Instead of feeding in only the observation (Gt, ct) at each step to the agent, we use an agent\nstate representation that contains the full interaction history in the episode ht = {(x\u2327 , G\u2327 , c\u2327 )}t1\n\u2327 =0,\nwith x0 = ;. An agent policy maps each ht to an action xt.\n3 Model\nOverview of the Framework We aim to learn an action policy \u21e1(x|ht; \u2713t) at each time\nstep t, which is parameterized by \u2713t. The objective of this speci\ufb01c MDP is formulated as:\nmax[\u27131,...,\u2713T ]PT\nt=1 Ext\u21e0\u21e1(x|ht;\u2713t)rt. Note that, we could share \u2713 across time steps and learn a\nsingle policy \u21e1(x|ht,\u2713 ) for all t, but in a \ufb01nite horizon MDP we found it bene\ufb01cial to use different\n\u2713s for different time steps t.\nIn this paper, we are not only interested in ef\ufb01cient exploration for a single graph structured environ-\nment, but also the generalization and transferrability of learned exploration strategies that can be\nused without \ufb01ne-tuning or retraining on unseen graphs. More concretely, let G denote each graph\nstructured environment, we are interested in the following meta-reinforcement learning problem:\n\nmax\n\n[\u27131,...,\u2713T ]\n\nEG\u21e0D\u21e3 TXt=1\n\nEx(G)\n\nt \u21e0\u21e1(x|h(G)\n\nt\n\n;\u2713t)r(G)\n\nt \u2318\n\n(3)\n\nwhere D is the distribution of graph exploration problems we are interested in, and we share the\nparameters {\u2713t} across graphs. After training, the learned policy can generalize to new graphs\nG0 \u21e0D from the same distribution, as the parameters are not tied to any particular G.\n2When it is unknown, we can simply divide the reward by T to normalize the total reward.\n\n3\n\n\fGraph Structured Agent and Exploration History The key to developing an agent that can learn\nto optimize Eq. (3) well is to have a model that can: 1) effectively exploit and represent the graph\nstructure of the problem; and 2) encode and incorporate the history of exploration.\nFig. 1 shows an overview of the agent structure. Since the observations are graph structured in our\nformulation, we use a variant of the Graph Neural Network [9] to embed them into a continuous\nvector space. We implement a mapping g : (G, c) 7! Rd using a GNN. The mapping starts from\ninitial node features \u00b5(0)\nv , which is problem speci\ufb01c, and can be e.g. the syntax information of a\nprogram branch, or app screen features. We also pad these features with one extra bit ct(v) to add in\nrun-time coverage information. These representations are then updated through an iterative message\npassing process,\n\n\u00b5(l+1)\nv\n\n= f (\u00b5l\n\nv\n\nu )}u2N (v)),\n\nv,{(euv, \u00b5(l)\n\n(4)\nwhere N (v) is the neighbors of node v, euv is the feature for edge (u, v). This iterative process goes\nfor L iterations, aggregating information from L-hop neighborhoods. We use the parameterization\nof GGNN [15] to implement this update function f (.). To get the graph representation g(G, c), we\naggregate node embeddings \u00b5(L)\nfrom the last message passing step through an attention-based\nweighted-sum following [15], which performs better than a simple sum empirically.\nCapturing the exploration history is particularly important. Like many other similar problems in\nRL, the exploration reward is only consistent when taking the history into account, as repeatedly\nvisiting a \u2018good\u2019 state can only be rewarded once. Here we treat the ht as the evolving graph\nstructured memory for the history. The representation of the full history is obtained by aggregating\nthe per-step representations. Formally, we structure the representation function F for history ht as\nF (ht) = F ([gx(x0), g(G0, c0)], ..., [gx(xt1), g(Gt1, ct1)]), where gx is an encoder for actions\nand [\u00b7] is the concatenation operator. The function F can take a variety of forms, for example: 1) take\nthe most recent element; or 2) auto-regressive aggregation across t steps. We explored a few different\nsettings for this, and obtained the best results with an auto-regressive F (more in Appendix B.1).\nThe action policy \u21e1(xt|ht) = \u21e1(xt|F (ht)) conditioned on an encoding of the history ht is parame-\nterized by a domain speci\ufb01c neural network. In program testing where the actions are the generated\nprogram inputs, \u21e1 is an RNN sequence decoder; while in other problems where we have a small \ufb01nite\nset of available actions, an MLP is used instead.\nLearning To train this agent, we adopt the advantage actor critic algorithm [16], in the synchronized\ndistributed setting. We use 32 distributed actors to collect on-policy trajectories in parallel, and\naggregate them into a single machine to perform parameter update.\n\n4 Experiments\nIn this section, we \ufb01rst illustrate the effectiveness of learning an exploration strategy on synthetic 2D\nmazes, and then study the problem of program testing (coverage guided fuzzing) through learning,\nwhere our model generates test cases for programs. Lastly we evaluate our algorithm on exploring both\nsynthetic and real-world mobile Apps. We use GMETAEXP (Graph Meta-Exploration) to denote our\nproposed method. More details about experiment setup and more results are included in Appendix B.\nTo train our agent, we adopt the advantage actor critic algorithm [16] in the synchronized distributed\nsetting. For the synthetic experiments where we have the data generator, we generate the training\ngraph environments on the \ufb02y. During inference, the agent is deployed on unseen graphs and not\nallowed to \ufb01ne-tune its parameters (zero-shot generalization).\nThe baselines we compare against fall into the following categories:\n\u2022 Random exploration: which randomly picks an action at each step;\n\u2022 Heuristics: including exploration heuristics like depth-\ufb01rst-search (DFS) or expert designed ones;\n\u2022 Exact Solver: we compare with Z3 when testing programs. This is a state-of-the-art SMT solver\nthat can \ufb01nd provably optimal solutions given enough computation time.\n\u2022 Fuzzer: we compare with state-of-the-art fuzzing tools like AFL 3 and Neuzz for program testing.\n\u2022 RL baselines: we also compare with RL models that use different amount of history information,\nor different history encoding models.\n\n3http://lcamtuf.coredump.cx/a\ufb02/\n\n4\n\n\fDSL\n\nRobustFill\n\n# train\n1M\n\n# test Coverage\nRegEx\n1,000\nBranches\n467\nTable 1: DSL program dataset information.\n\n# valid\n1,000\n490\n\nKarel\n\n212,524\n\nRandom RandomDFS GMETAEXP\n\nMethod\nCoverage\nTable 2: Fraction of the mazes covered via dif-\nferent exploration methods.\n\n72%\n\n33%\n\n54%\n\nFull Maze Random RandDFS GMETAEXP\nFigure 2: Maze exploration visualizations. Note\nthe mazes are 6x6 but the walls also take up 1\npixel in the visualizations. The start position is\nmarked red in the \ufb01rst column.\n\nFigure 3: Test cases (2D grid world layouts) gen-\nerated for Karel. Covered program branches are\nmarked . The generated layout on the right by\nour model GMETAEXP covers all statements in the\nprogram, while the program exits after the \ufb01rst\nstatement using the layout on the left.\n\n4.1 Synthetic 2D Maze Exploration\nWe start with a simple exploration task in synthetic 2D mazes. The goal is to visit as much of a maze\nas possible within a \ufb01xed number of steps. This is inspired by applications like map building, where\nan agent explores an environment and builds a map using e.g. SLAM [7]. In this setup, the agent only\nobserves a small neighborhood around its current location, and does not know the 2D coordinates of\nthe grids it has visited. At each time step, it has at most 4 actions (corresponding to the 4 directions).\nMore concretely, we use the following practical protocol to setup this task:\n\u2022 Observation: the observed Gt contains the locations (nodes) and the connectivities (edges) for\nthe part of the maze the agent has traversed up to time t, plus 1-hop vision for the current location.\n\u2022 Reward: as de\ufb01ned in Eq. (2), a positive reward will only be received if a new location is visited;\n\u2022 Termination: when the agent has visited all the nodes, or has used up the exploration budget T .\nWe train on random mazes of size 6 \u21e5 6, and test on 100 held-out mazes from the same distribution.\nThe starting location is chosen randomly. We allow the agent to traverse for T = 36 steps, and report\nthe average fraction of the maze grid locations covered on the 100 held-out mazes.\nTable 2 shows the quantitative performance of our method and random exploration baselines. As\nbaselines we have uniform random policy (denoted by Random), and a depth-\ufb01rst search policy with\nrandom next-step selection (denoted by RandDFS) which allows the agent to backtrack and avoid\nblindly visiting a node in the current DFS stack. Note that for such DFS, the exploration order of\nactions is randomized instead of being \ufb01xed. Our learned exploration strategy performs signi\ufb01cantly\nbetter. Fig. 2 shows some example maze exploration trajectories using different approaches.\nWe conducted an ablation study on the importance of utilizing graph structure, and different variants\nfor modeling the history (see more details in Appendix B.4). We found that 1) exploiting the full graph\nstructure performs signi\ufb01cantly better than only using the current node (33%) as the observation or\ntreating all the nodes in a graph as a set (41%) and ignoring the edges; 2) autoregressive aggregation\nover the history performs signi\ufb01cantly better than only using the last step, our best performance is\nimproved by 5% by modeling the full history compared to using a single step observation.\n\n5\n\nTest input: 2D Grid World LayoutRandomGMetaExpDEF run turnRight WHILE (rightIsClear) plantTree move IF (TreesPresent) turnRight ELSE plantTreeDEF run turnRight WHILE (rightIsClear) plantTree move IF (TreesPresent) turnRight ELSE plantTreeDEF run turnRight WHILE (rightIsClear) plantTree move IF (TreesPresent) turnRight ELSE plantTreeG0x1x1\f4.2 Generating Inputs for Testing Domain Speci\ufb01c Programs\nIn this section, we study the effectiveness of transferable exploration in the domain of program testing\n(a.k.a. coverage guided fuzzing). In this setup, our model proposes inputs (test cases) to the program\nbeing tested, with the goal of covering as many code branches as possible.\nWe test our algorithms on two datasets of programs written in two domain speci\ufb01c languages (DSLs),\nRobustFill [17] and Karel [18]. The RobustFill DSL is a regular expression based string manipulation\nlanguage, with primitives like concatenation, substring, etc.The Karel DSL is an educational\nlanguage used to de\ufb01ne agents that programmatically explore a grid world. This language is more\ncomplex than RobustFill, as it contains conditional statements like if/then/else blocks, and\nloops like for/while. Table 1 summarizes the statistics of the two datasets. For Karel, we use the\npublished benchmark dataset4 with the train/val/test splits;while for RobustFill, the training data was\ngenerated using a program synthesizer that is described in [17].Note that for RobustFill, the agent\nactions are sequences of characters to generate an input string, while for Karel the actions are 2D\narrays of characters to generate map layouts (see Fig. 3 for an example program and two generated\nmap layouts), both generated and encoded by RNNs. Training these RNNs for the huge action spaces\njointly with the rest of the model using RL is a challenging task in itself.\nMain Results We compare against two baselines: 1) uniform random policy, and 2) specialized\nheuristic algorithms designed by a human expert. The objective we optimize is the fraction of unique\ncode branches (for Karel) or regular expressions (for RobustFill) covered (triggered when executing\nthe program on the generated inputs) by the test cases, which is a good indicator of the quality of the\ngenerated inputs. Modern fuzzing tools like AFL are also coverage-guided.\nFig. 4(a) summarizes the coverage performance of different methods. In RobustFill, our method\napproaches the human expert level performance, where both of them achieve above 90% coverage.\nNote that the dataset is generated in a way that the programs are sampled to get the most coverage\non human generated inputs[17], so the evaluation is biased towards human expert. Nevertheless,\nGMETAEXP still gets comparable performance, which is much better than the random fuzzing approach\nwhich is widely used in software testing [19]. For Karel programs, GMETAEXP gets signi\ufb01cantly\nbetter results than even the human expert, as it is much harder for a human expert to develop heuristic\nalgorithms to generate inputs for programs with complex conditional and loop statements.\nComparing to fuzzers To compare with fuzzing approaches, we adapted AFL and Neuzz to our\nproblems. We translated all Karel programs into C programs as afl-gcc is required in both AFL\nand Neuzz. We limit the vocabulary and fuzzing strategies to provide guidance in generating valid\ntest cases with AFL. We run AFL for 10 mins for each program, and report coverage using the test\ncases with distinct execution traces. Note that to get n distinct execution traces AFL or Neuzz may\npropose N n test cases. Neuzz is set up similarly, but with the output from AFL as initialization.\n\nAFL\n\nNeuzz\n\n# distinct inputs\njoint coverage\n# inputs tried\n\n2\n0.63\n11k\n\n3\n0.67\n31k\n\n5\n0.74\n17k\nTable 3: Karel program coverage with AFL and Neuzz\n\n10\n0.81\n122k\n\n5\n0.76\n82k\n\n2\n0.64\n11k\n\n3\n0.69\n14k\n\n10\n0.77\n23k\n\nWe report the joint coverage in Table 3. Our approach has a coverage of 0.75 with 1 test case and\n0.95 with 5, signi\ufb01cantly more ef\ufb01cient than AFL and Neuzz. Correspondingly, when only one test\ncase is allowed, AFL and Neuzz gets coverage of 0.53 and 0.55 respectively, which are about the\nsame as random. Also note that we can directly predict the inputs for new programs (in seconds),\nrather than taking a long time to just warm up as needed in Neuzz.\nHowever using our approach on standard problems used in fuzzing is challenging. For example, the\nbenchmark from Neuzz consists of only a few programs and this small dataset size makes it dif\ufb01cult\nto use our learning based approach that focuses on generalization across programs. On the other\nhand, our approach does not scale to very large scale programs yet. SMT solvers are similar to our\napproach in this regard as both focus on analyzing smaller functions with complicated logic.\nComparing to SMT solver Since it is dif\ufb01cult to design good heuristics manually for Karel, we\nimplemented a symbolic execution [13] baseline that uses the Z3 SMT solver to \ufb01nd inputs that\nmaximize the program coverage. The symbolic execution baseline \ufb01nds optimal solutions for 412\n\n4https://msr-redmond.github.io/karel-dataset/\n\n6\n\n\f(a) Program coverage results\n\n(b) History encoding\n\nl\nl\ni\n\nF\nt\ns\nu\nb\no\nR\n\nl\ne\nr\na\nK\n\nFine-tuning\n\nData Q-learning GMETAEXP\nER\nApp\n\n0.60\n0.58\n\n0.68\n0.61\n\nGeneralization\nRandDFS\n\nGMETAEXP\n\n0.52\n0.54\n\n0.65\n0.58\n\nData\nER\nApp\n\nTable 4: App testing results.\n\nFigure 4: Testing DSL programs. (a) Program coverage\nresults. The joint coverage of multiple inputs is reported. (b)\nAblation study on different history encoding models.\n\nFigure 5: Comparing learning\ncurve of GMETAEXP with differ-\nent initializations.\n\nout of 467 test programs within the time budget, which takes about 4 hours in total. The average\nscore for the solved 412 cases using a single input is 0.837, which is roughly an \u2018upper bound\u2019 on the\nsingle input coverage (not guaranteed to be optimal as we restrict the maximum number of paths to\ncheck to 100 and the maximum number of expansions of while loops to 3 to make the solving time\ntractable). In contrast, GMETAEXP gets 0.76 average score with one input and takes only seconds\nto run for all the test programs. While the symbolic execution approach achieves higher average\ncoverage, it is slow and often fails to solve cases with highly nested loop structures (i.e., nested repeat\nor while loops). On the hard cases where the SMT solver failed (i.e., cannot \ufb01nd a solution after\nchecking all top 100 potential paths), our approach still gets 0.698 coverage. This shows that the\nGMETAEXP achieves a good balance between computation cost and accuracy.\nWe visualize the test inputs generated by GMETAEXP for two example test programs in Fig. 3 and\nFig. 6 in appendix. The covered regular expressions or branches are highlighted. We can observe\nthat the randomly generated inputs can only cover a small fraction of the program. In contrast, our\nproposed input can trigger many more branches. Moreover, the program also performs interesting\nmanipulations on our generated inputs after execution.\nEffectiveness of program-aware inputs: When compared with randomly generated program inputs,\nour learned model does signi\ufb01cantly better in coverage. Random generation, however, can trade\noff ef\ufb01ciency with speed, as generating a random input is very fast. We therefore evaluated random\ngeneration with a much higher sample budget, and found that with 10 inputs, random generation can\nreach a joint coverage (i.e., the union of the coverage of graph nodes/program branches using multiple\ngenerated test inputs) of 73%, but the coverage maxed out to only 85% even with 100 inputs (Fig 7 in\nappendix). This shows the usefulness of our learned model, and the generated program-aware inputs,\nas we get 93% coverage with just one input.\nComparison of different conditioning models: We also study the effectiveness of different explo-\nration history encoders. The encoders we consider here are (1) UnCond, where the policy network\nknows nothing about the programs or the task, and blindly proposes generally \u2018good\u2019 test cases. Note\nthat it still knows the past test cases it proposed through the autoregressive parameterization of F (ht)\n(2) EnvCond, where the policy network is blind to the program but takes the external reward obtained\nwith previous actions into account when generating new actions. This is similar to meta RL [20, 21],\nwhere the agent learns an adaptive policy based on the historical interactions with the environment.\n(3) program-aware models, where the policy network conditions on an encoding of the program.\nWe use BowEnc, BiLstmEnc, and GnnEnc to denote the bag of words encoder, bidirectional LSTM\nencoder, and graph neural network encoder, respectively.\n\n7\n\n\fFig 4(b) shows the ablation results on different encoders. For RobustFill, since the DSL is relatively\nsimpler, the models conditioned on the program get similar performance; for Karel, we observe\nthat the GnnEnc gets best performance, especially when the exploration budget, i.e., the number of\ninputs is small. One interesting observation is that UnCond, which does not rely on the program,\nalso achieves good performance. This shows that, one can \ufb01nd some universally good exploration\nstrategies with RL for these datasets. This is also consistent with the software testing practice, where\nthere are common strategies for testing corner cases, like empty strings, null pointers, etc.\n\n4.3 App Testing\nIn this section, we study the exploration and testing problem for mobile apps. Since mobile apps\ncan be very large and the source code is not available for commercial apps, measuring and modeling\ncoverage at the code branch level is very expensive and often impossible. An alternative practice is to\nmeasure the number of distinct \u2018screens\u2019 that are covered by test user interactions [22]. Here each\n\u2018screen\u2019 packs a number of features and UI elements a user can interact with, and testing different\ninteractions on different screens to explore different transitions between screens is a good way to\ndiscover bugs and crashes [22, 23].\nIn this section we explore the screen transition graph for each app with a \ufb01xed interaction budget\nT = 15, in the explore unknown environment setting. At each step, the agent can choose from a \ufb01nite\nset of user interaction actions like search query, click, scroll, etc. Features of a node may\ncome from an encoding of the visual appearance of the screen, the layout or UI elements visible, or\nan encoding of the past test logs, e.g., in a continuous testing scenario. More details about the setup\nand results are included in Appendix B.7.\nDatasets We scraped a set of apps from the Android app store, and collected 1,000 apps with at\nmost 20 distinct screens as our dataset. We use 5% of them for held-out evaluation. To avoid the\nexpensive interaction with the Android app simulator during learning, we instead used random user\ninputs to test these apps of\ufb02ine and extracted a screen transition graph for each app. We then built\na light-weight of\ufb02ine app simulator that transitions between screens based on the recorded graph.\nInteracting with this of\ufb02ine simulator is cheap.\nIn addition to the real world app dataset, we also created a dataset of synthetic apps to further test\nthe capabilities of our approach. We collected randomly sampled Erd\u02ddos-R\u00e9nyi (denoted ER in the\nexperiment results) graphs with 15-20 nodes and edge probability 0.1, and used these graphs as the\nunderlying screen transition graph for the synthetic apps. For training we generate random graphs on\nthe \ufb02y, and we use 100 held-out graphs for testing the generalization performance.\nBaselines Besides the RandDFS baselines de\ufb01ned in Sec 4.1, we also evaluate a tabular Q-learning\nbaseline. This Q-learning baseline uses node ID as states and does not model the exploration\nhistory. This limitation makes Q-learning impossible to learn the optimal strategy, as the MDP is\nnon-stationary when the state representation only contains the current node ID. Moreover, since this\napproach is tabular, it does not generalize to new graphs and cannot be used in the generalization\nsetting. We train this baseline on each graph separately for a \ufb01xed number of iterations and report the\nbest performance it can reach on those graphs.\nEvaluation setup We evaluate our algorithms in two scenarios, namely \ufb01ne-tuning and generaliza-\ntion. In the \ufb01ne-tuning case, the agent is allowed to interact with the App simulator for as many\nepisodes as needed, and we report the performance of the algorithms after they have been \ufb01ne-tuned\non the apps. Alternatively, this can be thought of as the \u2018train on some apps, and evaluate on the\nsame set of apps\u2019 setting, which is standard for many RL tasks. For the generalization scenario, the\nagent is asked to get as much reward as possible within one single episode on apps not seen during\ntraining. We compare to the tabular Q-learning approach in the \ufb01rst scenario as it is stronger than\nrandom exploration; for the second scenario, since the tabular policy is not generalizable, random\nexploration is used as baseline instead.\nResults Table 4 summarizes the results for different approaches on different datasets. As we can\nsee, with our modeling of the graph structure and exploration history and learning setup, GMETAEXP\nperforms better in both \ufb01ne-tuning and generalization experiments compared to Q-learning and\nrandom exploration baselines. Furthermore, our zero-shot generalization performance is even\nbetter than the \ufb01ne-tuned performance of tabular Q-learning. This further shows the importance of\nembedding the structural history when proposing the user inputs for exploration.\n\n8\n\n\fWe show the learning curves of our model for learning from scratch versus \ufb01ne-tuning on the 100 test\ngraphs for the synthetic app graphs in Fig 5. For \ufb01ne-tuning, we initialize with the trained model,\nand perform reinforcement learning for each individual graph. For learning from scratch, we directly\nlearn on each individual graph separately. We observe that (1) the generalization performance is quite\neffective in this case, where it achieves performance close to the \ufb01ne tuned model; (2) learning from\nthe pretrained model is bene\ufb01cial; it converges faster and converges to a model with slightly better\ncoverage than learning from scratch.\n5 Related work\nBalancing between exploration and exploitation is a fundamental topic in reinforcement learning.\nTo tackle this challenge, many mechanisms have been designed, ranging from simple \u270f-greedy,\npseudo-count [1, 3], intrinsic motivation [2], diversity [24], to meta learning approaches that learns\nthe algorithm itself [20, 21], or combining structural noise that address the multi-modality policy\ndistribution [25]. In SLAM literature, the exploration problem is typically known as active SLAM\nwith different uncertainty criteria [26] such as entropy/information based approach [27, 28]. Our\nwork focuses purely on exploring distinct states in graph.\nExploration for Fuzzing: Fuzzing explores corner cases in a software, with coverage guided\nsearch [23] or learned proposal distributions [29]. To explore the program semantics with input\nexamples, there have been heuristics designed by human expert [17], sampling from manually tuned\ndistributions [30] or greedy approaches [31]. Some recent learning based fuzzing approaches like\nLearn&Fuzz [29] and DeepFuzz [32] build language models of inputs and sample from it to generate\nnew inputs, but such a paradigm is not directly applicable for program conditional testing. Neuzz [33]\nbuilds a smooth surrogate function on top of AFL that allows gradient guided input generation.\nRajpal et al. [34] learn a function to predict which bytes might lead to new coverage using supervised\nlearning on previous fuzzing explorations. Different from these approaches that explore a speci\ufb01c\ntask, we learn a transferable exploration strategy, which is encoded in the graph memory based agent\nthat can be directly rolled out in new unseen environments.\nRepresentation Learning over Structures: The representation of our external graph memory is\nbuilt on recent advances in graph representation learning [35]. The graph neural network [9] and\nthe variants have shown superior results in domains including program modeling [15, 8], semi-\nsupervised learning [36], bioinformatics and chemistry [37, 38, 39, 40]. In this paper, we adapt\nthe parameterization from Li et al. [15], the graph sequence modeling [41], and also the attention\nbased [42] read-out for the graph.\nOptimization over Graphs: Existing papers have studied the path \ufb01nding problems in graph. The\nDOM-Q-NET [43] navigates HTML page and \ufb01nishes certain tasks, while Mirowski et al. [44]\nlearns to handle complex visual sensory inputs. Our task is seeking for optimal traversal tour, which\nis essentially NP-hard. Our work is also closely related to the recent advances in combinatorial\noptimization over graph structured data. The graph neural network can be learned with one-bit [45] or\nfull supervision [46, 47] and generalize to new combinatorial optimization problem instances. In the\ncase that lacks supervision, the reinforcement learning are adapted [48]. Khalil et al. [49] uses \ufb01nite\nhorizon DQN to learn the action policy. Our work mainly differs in two ways: 1) the full structure of\ngraph is not always observed and instead needs be explored by the agent; 2) we model the exploration\nhistory as a sequence of evolving graphs, rather than learning Q-function of a single graph.\n6 Conclusion\nIn this paper, we study the problem of transferable graph exploration. We propose to use a sequence of\ngraph structured external memory to encode the exploration history. By encoding the graph structure\nwith GNN, we can also obtain transferable history memory representations. We demonstrate our\nmethod on domains including synthetic 2D maze exploration and real world program and app testing,\nand show comparable or better performance than human engineered methods. Future work includes\nscaling up the graph external memory to handle large software or code base.\n\nAcknowledgments\nWe would like to thank Hengxiang Hu, Shu-Wei Cheng and other members in the team for providing\ndata and engineering suggestions. We also want to thank Arthur Guez, Georg Ostrovski, Jonathan\nUesato, Tejas Kulkarni, and anonymous reviewers for providing constructive feedbacks.\n\n9\n\n\fReferences\n[1] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and R\u00e9mi Munos. Count-based exploration\n\nwith neural density models. arXiv preprint arXiv:1703.01310, 2017.\n\n[2] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by\n\nself-supervised prediction. In ICML, 2017.\n\n[3] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.\nUnifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing\nSystems, pages 1471\u20131479, 2016.\n\n[4] Arthur Guez, Th\u00e9ophane Weber, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals, Daan Wierstra,\nR\u00e9mi Munos, and David Silver. Learning to search with mctsnets. arXiv preprint arXiv:1802.04697, 2018.\n\n[5] Patrice Godefroid, Michael Y. Levin, and David A. Molnar. Automated whitebox fuzz testing.\n\nIn\nProceedings of the Network and Distributed System Security Symposium, NDSS 2008, San Diego, California,\nUSA, 10th February - 13th February 2008, 2008.\n\n[6] Barton P. Miller, Lars Fredriksen, and Bryan So. An empirical study of the reliability of UNIX utilities.\n\nCommun. ACM, 33(12):32\u201344, 1990.\n\n[7] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping: part i. IEEE robotics &\n\nautomation magazine, 13(2):99\u2013110, 2006.\n\n[8] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with\n\ngraphs. arXiv preprint arXiv:1711.00740, 2017.\n\n[9] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[10] Leonardo Mendon\u00e7a de Moura and Nikolaj Bj\u00f8rner. Z3: an ef\ufb01cient SMT solver. In Tools and Algorithms\nfor the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of\nthe Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary,\nMarch 29-April 6, 2008. Proceedings, pages 337\u2013340, 2008. doi: 10.1007/978-3-540-78800-3\\_24. URL\nhttps://doi.org/10.1007/978-3-540-78800-3_24.\n\n[11] Koushik Sen. DART: directed automated random testing.\n\nIn Hardware and Software: Veri\ufb01cation\nand Testing - 5th International Haifa Veri\ufb01cation Conference, HVC 2009, Haifa, Israel, October\n19-22, 2009, Revised Selected Papers, page 4, 2009. doi: 10.1007/978-3-642-19237-1\\_4. URL\nhttps://doi.org/10.1007/978-3-642-19237-1_4.\n\n[12] Koushik Sen, Darko Marinov, and Gul Agha. CUTE: a concolic unit testing engine for C. In Proceedings\nof the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International\nSymposium on Foundations of Software Engineering, 2005, Lisbon, Portugal, September 5-9, 2005, pages\n263\u2013272, 2005. doi: 10.1145/1081706.1081750. URL https://doi.org/10.1145/1081706.1081750.\n\n[13] Cristian Cadar and Koushik Sen.\n\nSymbolic execution for software testing:\n\nlater.\nhttps://doi.org/10.1145/2408776.2408795.\n\nCommun. ACM, 56(2):82\u201390, 2013.\n\ndoi:\n\n10.1145/2408776.2408795.\n\nthree decades\nURL\n\n[14] Caroline Lemieux and Koushik Sen. Fairfuzz: a targeted mutation strategy for increasing greybox\nfuzz testing coverage. In Proceedings of the 33rd ACM/IEEE International Conference on Automated\nSoftware Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pages 475\u2013485, 2018. doi:\n10.1145/3238147.3238176. URL https://doi.org/10.1145/3238147.3238176.\n\n[15] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nInternational conference on machine learning, pages 1928\u20131937, 2016.\n\n[17] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet\n\nKohli. Robust\ufb01ll: Neural program learning under noisy i/o. arXiv preprint arXiv:1703.07469, 2017.\n\n[18] Rudy Bunel, Matthew J. Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging\n\ngrammar and reinforcement learning for neural program synthesis. CoRR, abs/1805.04276, 2018.\n\n10\n\n\f[19] Michael Sutton, Adam Greene, and Pedram Amini. Fuzzing: brute force vulnerability discovery. Pearson\n\nEducation, 2007.\n\n[20] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast\n\nreinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.\n\n[21] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles\nBlundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint\narXiv:1611.05763, 2016.\n\n[22] Tanzirul Azim and Iulian Neamtiu. Targeted and depth-\ufb01rst exploration for systematic testing of android\n\napps. In Acm Sigplan Notices, volume 48, pages 641\u2013660. ACM, 2013.\n\n[23] Ke Mao, Mark Harman, and Yue Jia. Sapienz: Multi-objective automated testing for android applications.\nIn Proceedings of the 25th International Symposium on Software Testing and Analysis, pages 94\u2013105.\nACM, 2016.\n\n[24] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning\n\nskills without a reward function. arXiv preprint arXiv:1802.06070, 2018.\n\n[25] Abhishek Gupta, Russell Mendonca, Yuxuan Liu, Pieter Abbeel, and Sergey Levine. Meta-reinforcement\n\nlearning of structured exploration strategies. CoRR, abs/1802.07245, 2018.\n\n[26] Henry Carrillo, Ian Reid, and Jos\u00e9 A Castellanos. On the comparison of uncertainty criteria for active slam.\nIn Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 2080\u20132087. IEEE,\n2012.\n\n[27] Beipeng Mu, Matthew Giamou, Liam Paull, Ali-akbar Agha-mohammadi, John Leonard, and Jonathan\nHow. Information-based active slam via topological feature graphs. In Decision and Control (CDC), 2016\nIEEE 55th Conference on, pages 5583\u20135590. IEEE, 2016.\n\n[28] Luca Carlone, Jingjing Du, Miguel Kaouk Ng, Basilio Bona, and Marina Indri. Active slam and exploration\nwith particle \ufb01lters using kullback-leibler divergence. Journal of Intelligent & Robotic Systems, 75(2):\n291\u2013311, 2014.\n\n[29] Patrice Godefroid, Hila Peleg, and Rishabh Singh. Learn&fuzz: Machine learning for input fuzzing. In\nProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, pages\n50\u201359. IEEE Press, 2017.\n\n[30] Richard Shin, Neel Kant, Kavi Gupta, Christopher Bender, Brandon Trabucco, Rishabh Singh, and Dawn\n\nSong. Synthetic datasets for neural program synthesis. 2018.\n\n[31] Yewen Pu, Zachery Miranda, Armando Solar-Lezama, and Leslie Kaelbling. Selecting representative\nexamples for program synthesis. In International Conference on Machine Learning, pages 4158\u20134167,\n2018.\n\n[32] Xiao Liu, Xiaoting Li, Rupesh Prajapati, and Dinghao Wu. Deepfuzz: Automatic generation of syntax\n\nvalid c programs for fuzz testing. 2019.\n\n[33] Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, and Suman Jana. Neuzz: Ef\ufb01cient\n\nfuzzing with neural program learning. arXiv preprint arXiv:1807.05620, 2018.\n\n[34] Mohit Rajpal, William Blum, and Rishabh Singh. Not all bytes are equal: Neural byte sieve for fuzzing.\n\nCoRR, abs/1711.04596, 2017.\n\n[35] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz\nMalinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive\nbiases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.\n\n[36] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[37] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured\n\ndata. In International Conference on Machine Learning, pages 2702\u20132711, 2016.\n\n[38] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message\n\npassing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n[39] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence\n\nand graph kernels. arXiv preprint arXiv:1705.09037, 2017.\n\n11\n\n\f[40] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, pages 1024\u20131034, 2017.\n\n[41] Daniel D Johnson. Learning graphical state transitions. 2016.\n\n[42] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio.\n\nGraph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[43] Sheng Jia, Jamie Kiros, and Jimmy Ba. Dom-q-net: Grounded rl on structured language. arXiv preprint\n\narXiv:1902.07257, 2019.\n\n[44] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha De-\nnil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments.\narXiv preprint arXiv:1611.03673, 2016.\n\n[45] Daniel Selsam, Matthew Lamm, Benedikt Bunz, Percy Liang, Leonardo de Moura, and David L Dill.\n\nLearning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018.\n\n[46] Alex Nowak, Soledad Villar, Afonso S Bandeira, and Joan Bruna. A note on learning algorithms for\n\nquadratic assignment with graph neural networks. arXiv preprint arXiv:1706.07450, 2017.\n\n[47] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional\nnetworks and guided tree search. In Advances in Neural Information Processing Systems, pages 537\u2013546,\n2018.\n\n[48] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial\n\noptimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.\n\n[49] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization\nalgorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348\u20136358, 2017.\n\n12\n\n\f", "award": [], "sourceid": 1444, "authors": [{"given_name": "Hanjun", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Yujia", "family_name": "Li", "institution": "DeepMind"}, {"given_name": "Chenglong", "family_name": "Wang", "institution": "University of Washington"}, {"given_name": "Rishabh", "family_name": "Singh", "institution": "Google Brain"}, {"given_name": "Po-Sen", "family_name": "Huang", "institution": "DeepMind"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "DeepMind"}]}