{"title": "Exploring Unknown Environments with Real-Time Search or Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1003, "page_last": 1009, "abstract": null, "full_text": "Exploring Unknown Environments with \n\nReal-Time Search or Reinforcement Learning \n\nSven Koenig \n\nCollege of Computing, Georgia Institute of Technology \n\nskoenig@cc.gatech.edu \n\nAbstract \n\nLearning Real-Time A* (LRTA*) is a popular control method that interleaves plan(cid:173)\nning and plan execution and has been shown to solve search problems in known \nenvironments efficiently. In this paper, we apply LRTA * to the problem of getting to \na given goal location in an initially unknown environment. Uninformed LRTA * with \nmaximal lookahead always moves on a shortest path to the closest unvisited state, \nthat is, to the closest potential goal state. This was believed to be a good exploration \nheuristic, but we show that it does not minimize the worst-case plan-execution time \ncompared to other uninformed exploration methods. This result is also of interest to \nreinforcement-learning researchers since many reinforcement learning methods use \nasynchronous dynamic programming, interleave planning and plan execution, and \nexhibit optimism in the face of uncertainty, just like LRTA *. \n\n1 Introduction \n\nReal-time (heuristic) search methods are domain-independent control methods that inter(cid:173)\nleave planning and plan execution. They are based on agent-centered search [Dasgupta et \nat., 1994; Koenig, 1996], which restricts the search to a small part of the environment that \ncan be reached from the current state of the agent with a small number of action executions. \nThis is the part of the environment that is immediately relevant for the agent in its current \nsituation. The most popular real-time search method is probably the Learning Real-Time \nA * (LRTA *) method [Korf, 19901 It has a solid theoretical foundation and the following \nadvantageous properties: First, it allows for fine-grained control over how much planning \nto do between plan executions and thus is an any-time contract algorithm [Russell and Zil(cid:173)\nberstein, 1991]. Second, it can use heuristic knowledge to guide planning, which reduces \nplanning time without sacrificing solution quality. Third, it can be interrupted at any state \nand resume execution at a different state. Fourth, it amortizes learning over several search \nepisodes, which allows it to find plans with suboptimal plan-execution time fast and then \nimprove the plan-execution time as it solves similar planning tasks, until its plan-execution \ntime is optimal. Thus, LRTA * always has a small sum of planning and plan-execution \n\n\f1004 \n\nS. Koenig \n\nInitially, u( s) = 0 for all s E S. \n1. Scurrent := S s tart. \n2. If Scurrent E G, then stop successfully. \n3. Generate a local search space Sios ~ S with \n\nS current E Si s s and Siss n G = 0. \n\n4. Update u( s) for all S E Sios (Figure 2). \n5. a := one-ofargminaEA(scurrent) \n\nu( succ( S c urrent , a)) . \n\n6. Execute action a. \n7. S current := SUCC(Scurrent, a). \n8. If Scurrent E Si s \", then go to 5. \n9. Go to 2. \n\nFigure 1: Uninformed LRTA * \n\n1. For all S E SI .. : u(s) := 00. \n2. If u( s) < 00 for all S E Slss, then return. \n3. s' := one-ofargminsEs, \u2022\u2022 :u(s)= oo \n\nminaEA(s) u( succ(s, a)) . \n\nreturn. \n\n4. IfminaEA(sl) u(succ(s' , a)) = 00, then \n5. u(s') := 1 + minaEA( s l) u(succ(s' , a)). \n6. Go to 2. \n\nFigure 2: Value-Update Step \n\ntime, and it minimizes the plan-execution time in the long run in case similar planning tasks \nunexpectedly repeat. This is important since no search method that executes actions before \nit has solved a planning task completely can guarantee to minimize the plan-execution time \nright away. \n\nReal-time search methods have been shown to be efficient alternatives to traditional search \nmethods in known environments. In this paper, we investigate real-time search methods \nin unknown environments. In such environments, real-time search methods allow agents \nto gather information early. This information can then be used to resolve some of the \nuncertainty and thus reduce the amount of planning done for unencountered situations. \n\nWe study robot-exploration tasks without actuator and sensor uncertainty, where the sensors \non-board the robot can uniquely identify its location and the neighboring locations. The \nrobot does not know the map in advance, and thus has to explore its environment sufficiently \nto find the goal and a path to it. A variety of methods can solve these tasks, including LRTA *. \nThe proceedings of the AAAI-97 Workshop on On-Line Search [Koenig et al., 1997] give \na good overview of some of these techniques. In this paper, we study whether uninformed \nLRTA * is able to minimize the worst-case plan-execution time over all state spaces with the \nsame number of states provided that its lookahead is sufficiently large. Uninformed LRTA * \nwith maximallookahead always moves on a shortest path to the closest unvisited state, that \nis, to the closest potential goal state - it exhibits optimism in the fac\\! of uncertainty [Moore \nand Atkeson, 19931 We show that this exploration heuristic is not as good as it was believed \nto be. This sol ves the central problem left open in [Pemberton and Korf, 1992] and improves \nour understanding of LRTA *. Our results also apply to learning control for tasks other than \nrobot exploration, for example the control tasks studied in [Davies et ai., 19981 They are \nalso of interest to reinforcement-learning researchers since many reinforcement learning \nmethods use asynchronous dynamic programming, interleave planning and plan execution, \nand exhibit optimism in the face of uncertainty, just like LRTA * [Barto et ai., 1995; \nKearns and Singh, 19981 \n\n2 LRTA* \n\nWe use the following notation to describe LRTA *: S denotes the finite set of states of the \nenvironment, S3t(Jrt E S the start state, and 0 =I G ~ S the set of goal states. The number \nof states is n := lSI. A( s) =I 0 is the finite, nonempty set of actions that can be executed in \nstate s E S. succ( s, a) denotes the successor state that results from the execution of action \na E A(s) in state s E S. We also use two operators with the following semantics: Given \n\n\fExploring Unknown Environments \n\n1005 \n\na set X, the expression \"one-of X\" returns an element of X according to an arbitrary rule. \nA subsequent invocation of \"one-of X\" can return the same or a different element. The \nexpression \"arg minxEx !(x)\" returns the elements x E X that minimize !(x), that is, the \nset {x E XI!(x) = minx'Ex !(x' )}. \nWe model environments (topological maps) as state spaces that correspond to undirected \ngraphs, and assume that it is indeed possible to reach a goal state from the start state. We \nmeasure the distances and thus plan-execution time in action executions, which is reasonable \nif every action can be executed in about the same amount of time. The graph is initially \nunknown. The robot can always observe whether its current state is a goal state, how many \nactions can be executed in it, and which successor states they lead to but not whether the \nsuccessor states are goal states. Furthermore, the robot can identify the successor states \nwhen it observes them again at a later point in time. This assumption is realistic, for \nexample, if the states look sufficiently different or the robot has a global positioning system \n(GPS) available. \nLRTA * learns a map of the environment and thus needs memory proportional to the number \nof states and actions observed. It associates a small amount of information with the states \nin its map. In particular, it associates a u-value u(s) with each state s E S. The u-values \napproximate the goal distances of the states. They are updated as the search progresses and \nused to determine which actions to execute. Figure 1 describes LRTA *: LRTA * first checks \nwhether it has already reached a goal state and thus can terminate successfully (Line 2). If \nnot, it generates the local search space S/H ~ S (Line 3). While we require only that the \ncurrent state is part of the local search space and the goal states are not [Barto et al., 1995], \nin practice LRTA * constructs S/88 by searching forward from the current state. LRTA * then \nupdates the u-values of all states in the local search space (Line 4), as shown in Figure 2. \nThe value-update step assigns each state its goal distance under the assumption that the \nu-values of all states outside of the local search space correspond to their correct goal \ndistances. Formally, if u( s) E [0,00] denotes the u-values before the value-update step and \nu(s) E [0,00] denotes the u-values afterwards, then u(s) = 1 + mina EA(8) u(succ(s, a)) \nfor all s E S/S8 and u( s) = u( s) otherwise. Based on these u-values, LRTA * decides which \naction to execute next (Line 5). It greedily chooses the action that minimizes the u-value of \nthe successor state (ties are broken arbitrarily) because the u-values approximate the goal \ndistances and LRTA * attempts to decrease its goal distance as much as possible. Finally, \nLRTA * executes the selected action (Line 6) and updates its current state (Line 7). Then, if \nthe new state is still part of the local search space used previously, LRTA * selects another \naction for execution based on the current u-values (Line 8). Otherwise, it iterates (Line 9), \n(The behavior of LRTA * with either minimal or maximal lookahead does not change if \nLine 8 is deleted.) \n\n3 Plan-Execution Time of LRTA * for Exploration \nIn this section, we study the behavior of LRTA * with minimal and maximallookaheads in \nunknown environments. We assume that no a-priori heuristic knowledge is available and, \nthus, that LRTA * is uninformed. In this case, the u-values of all unvisited states are zero \nand do not need to be maintained explicitly. \nMinimal Lookahead: The lookahead of LRTA * is minimal if the local search space con(cid:173)\ntains only the current state. LRTA * with minimallookahead performs almost no planning \nbetween plan executions. Its behavior in initially known and unknown environments is \nidentical. Figure 3 shows an example. \nLet gd(s) denote the goal distance of state s. Then, according to one of our previous results, \nuninformed LRTA * with any lookahead reaches a goal state after at most L 8 E s gd ( s) action \nexecutions [Koenig and Simmons, 1995]. Since L8ES gd(s) ~ L7:o1 i = 1/2n2 - 1/2n, \n\n\f1006 \n\ngoal \n\nS. Koenig \n\no = visited vertex (known not to be a goal vertex) \no \n\u2022 \n0 3 = u\u00b7value of the vertex \n\n= current vertex 0' the robot \n\n= unvisited (but known) vertex (unknown whether ft is a goal vertex) \n\n= edge trav~sed in at least one direction \n= untraversed edge \n= local search space \n\n_ \n\n-~ \n\nLATA\" with minimallookahead: \n\nLATA\" with maximallookahead: \n\n~ \n~o \n\no \n\nFigure 3: Example \n\nstart \n\nall edge lengths ara one \n\nt \n\ngoal \n\nFigure 4: A Planar Undirected Graph \n\nuninformed LRTA* with any lookahead reaches a goal state after O(n2 ) action executions. \n\nThis upper bound on the plan-execution time is tight in the worst case for uninformed \nLRTA * with rninimallookahead, even if the number of actions that can be executed in any \nstate is bounded from above by a small constant (here: three) . Figure 4, for example, shows \na rectangular grid-world for which uninformed LRTA * with rninimallookahead reaches a \ngoal state in the worst case only after 8( n 2) action executions. In particular, LRTA * can \ntraverse the state sequence that is printed by the following program in pseudo code. The \nscope of the for-statements is shown by indentation. \n\nfor i \n\n:= n-3 downto n / 2 step 2 \n\nfor j \n\n: = 1 to i step 2 \n\nfor j \n\n: = i+l downto 2 step 2 \n\nprint j \n\nprint j \n\nfor i \n\n:= 1 to n-l step 2 \n\nprint i \n\nIn this case, LRTA * executes 3n 2/16 - 3/4 actions before it reaches the goal state (for \nn 2: 2 with n mod 4 = 2). For example, for n = 10, it traverses the state sequence 8), 83, \n85,87,88,86,84,82,8),83,85,86,84,82,81,83,85,87, and 89 . \nMaximal Lookahead: As we increase the lookahead of LRTA *, we expect that its plan(cid:173)\nexecution time tends to decrease because LRTA * uses more information to decide which \n\n\fExploring Unknown Environments \n\n1007 \n\ngoal \n\n+ \n\nbranches of \n\nlength 3 \n\n/~ \n\nthe order in which the remaining \n\nunvisited vertices are visited \n\nt \n\nLRTA* is now here \n\nt \n\nstart \n\no = visited vertex \no = unvisited vertex \n\n= edge traversed in at least one direction \n= untraversed edge \n\nFigure 5: Another Planar Undirected Graph (m = 3) \n\naction to execute next. This makes it interesting to study LRTA * with maximallookahead. \nThe lookahead of LRTA * is maximal in known environments if the local search space \ncontains all non-goal states. In this case, LRTA * performs a complete search without \ninterleaving planning and plan execution and follows a shortest path from the start state to \na closest goal state. Thus, it needs gd( Sst art ) action executions. No other method can do \nbetter than that. \nThe maximallookahead ofLRTA * is necessarily smaller in initially unknown environments \nthan in known environments because its value-update step can only search the known part of \nthe environment. Therefore, the look ahead of LRTA * is maximal in unknown environments \nif the local search space contains all visited non-goal states. Figure 3 shows an example. \nUninformed LRTA * with maximal lookahead always moves on a shortest path to the \nclosest unvisited state, that is, to the closest potential goal state. This appears to be a \ngood exploration heuristic. [Pemberton and Korf, 1992] call this behavior \"incremental \nbest-first search,\" but were not able to prove or disprove whether this locally optimal \nsearch strategy is also globally optimal. Since this exploration heuristic has been used \non real mobile robots [Thrun et al., 1998], we study how well its plan-execution time \ncompares to the plan-execution time of other uninformed exploration methods. We show \nthat the worst-case plan-execution time of uninformed LRTA * with maximallookahead in \nunknown environments is Q( IO~~; n n) action executions and thus grows faster than linearly \nin the number of states n. It follows that the plan-execution time of LRTA * is not optimal \nin the worst case, since depth-first search needs a number of action executions in the worst \ncase that grows only linearly in the number of states. \n\nConsider the graph shown in Figure 5, that is a variation of a graph in [Koenig and Smirnov, \n19961. It consists of a stem with several branches. Each branch consists of two parallel \npaths of the same length that connect the stem to a single edge. The length of the branch is \nthe length of each of the two paths. The stem has length mm for some integer m ;:::: 3 and \nconsists of the vertices Vo, VI , .. . , Vmm . For each integer i with 1 ::; i ::; m there are mm- i \nbranches of length :L~~~ mj each (including branches of length zero). These branches \nattach to the stem at the vertices Vj m' for integers j; if i is even, then 0::; j ::; mm-i - 1, \notherwise 1 ::; j ::; mm-i. There is one additional single edge that attaches to vertex Vo . \n\n\f1008 \n\nS. Koenig \n\nVm m is the starting vertex. The vertex at the end of the single edge of the longest branch is \nthe goal vertex. Notice that the graph is planar. This is a desirable property since non-planar \ngraphs are, in general, rather unrealistic models of maps. \n\nUninformed LRTA * with maximallookahead can traverse the stem repeatedly forward and \nbackward, and the resulting plan-execution time is large compared to the number of vertices \nthat are necessary to mislead LRTA * into this behavior. In particular, LRTA * can behave \nas follows: It starts at vertex Vmm and traverses the whole stem and all branches, excluding \nthe single edges at their end, and finally traverses the additional edge attached to vertex \nvo, as shown in Figure 5. At this point, LRTA* knows all vertices. It then traverses the \nwhole stem, visiting the vertices at the ends of the single edges of the branches of length O. \nIt then switches directions and travels along the whole stem in the opposite direction, this \ntime visiting the vertices at the end of the single edges of the branches of length m, and so \nforth, switching directions repeatedly. It succeeds when it finally uses the longest branch \nand discovers the goal vertex. To summarize, the vertices at the ends of the branches are \ntried out in the order indicated in Figure 5. The total number of edge traversals is.o.( mm+l ) \nsince the stem of length mm is traversed m + 1 times. To be precise, the total number of \nedge traversal~ is (mm+3 +3mm+2_8mm+ 1 +2m2 -m+3)/(m2-2m+ 1). It holds that \nn = 8(mm) smcen = (3mm+2_5mm+l_mm +mm-l +2m2-2m+2)/(m2-2m+l) . \nThis implies that m = .0.( IO~~; n) since it holds that, for k > 1 and all sufficiently large m \n(to be precise: m with m ~ k) \n\n10Ik m+IOlk logk m \n\nmlOlk m \n\nm \n\n1 \n\n.1.+ logk logk m < I\":Ui+o = m. \n\n1 \n\nmlogk m \n\n-\n\nm \n\nPut together, it follows that the total number of edge traversals is .o.(mm+!) = .o.(m n) = \n.0.( IO:~; n n). (We also performed a simulation that confirmed our theoretical results.) \n\nThe graph from Figure 5 can be modified to cause LRTA * to behave similarly even if the \nassumptions of the capabilities of the robot or the environment vary from our assumptions \nhere, including the case where the robot can observe only the actions that lead to unvisited \nstates but not the states themselves. \n\n4 Future Work \n\nOur example provided a lower bound on the plan-execution time of uninformed LRTA * \nwith maximallookahead in unknown environments. The lower bound is barely super-linear \nin the number of states. A tight bound is currently unknown, although upper bounds are \nknown. A trivial upper bound, for example, is O(n2) since LRTA* executes at most n - 1 \nactions before it visits another state that it has not visited before and there are only n states \nto visit. A tighter upper bound follows directly from [Koenig and Smirnov, 19961. It was \nsurprisingly difficult to construct our example. It is currently unknown, and therefore a \ntopic of future research, for which classes of graphs the worst-case plan-execution time of \nLRTA * is optimal up to a constant factor and whether these classes of graphs correspond to \ninteresting and realistic environments. It is also currently unknown how the bounds change \nas LRTA * becomes more informed about where the goal states are. \n\n5 Conclusions \n\nOur work provides a first analysis of uninformed LRTA * in unknown environments. We \nstudied versions of LRTA * with minimal and maximal lookaheads and showed that their \n\n\fExploring Unknown Environments \n\n1009 \n\nworst-case plan-execution time is not optimal, not even up to a constant factor. The worst(cid:173)\ncase plan-execution time of depth-first search, for example, is smaller than that of LRTA * \nwith either minimal or maximallookahead. This is not to say that one should always prefer \ndepth-first search over LRTA * since, for example, LRTA * can use heuristic knowledge to \ndirect its search towards the goal states. LRTA * can also be interrupted at any location and \nget restarted at a different location. If the batteries of the robot need to get recharged during \nexploration, for instance, LRTA * can be interrupted and later get restarted at the charging \nstation. While depth-first search could be modified to have these properties as well, it would \nlose some of its simplicity. \n\nAcknowledgments \n\nThanks to Yury Smirnov for our collaboration on previous work which this paper extends. Thanks also \nto the reviewers for their suggestions for improvements and future research directions. Unfortunately, \nspace limitations prevented us from implementing all of their suggestions in this paper. \n\nReferences \n\n(Barto etal., 1995) Barto, A.; Bradtke, S.; and Singh, S. 1995. Learning to act using real-time \n\ndynamic programming. Artificial1ntelligence 73(1):81-138. \n\n(Dasgupta et aI., 1994) Dasgupta, P.; Chakrabarti, P.; and DeSarkar, S. 1994. Agent searching in a \n\ntree and the optimality of iterative deepening. Artificial Intelligence 71 : 195-208. \n\n(Davies et al., 1998) Davies, S.; Ng, A; and Moore, A 1998. Applying online search techniques \nto reinforcement learning. In Proceedings of the National Conference on Artificial Intelligence. \n753-760. \n\n(Kearns and Singh, 1998) Kearns, M. and Singh, S. 1998. Near-optimal reinforcement learning in \npolynomial time. In Proceedings of the International Conference on Machine Learning. 260-268. \n\n(Koenig and Simmons, 1995) Koenig, S. and Simmons, RG. 1995. Real-time search in non(cid:173)\nIn Proceedings of the International Joint Conference on Artificial In(cid:173)\n\ndeterministic domains. \ntelligence. 1660-1667. \n\n(Koenig and Smirnov, 1996) Koenig, S. and Smirnov, Y. 1996. Graph learning with a nearest neigh(cid:173)\n\nbor approach. In Proceedings of the Conference on Computational Learning Theory. 19-28. \n\n(Koenig et aI., 1997) Koenig, S.; Blum, A; Ishida, T.; and Korf, R, editors 1997. Proceedings of \n\nthe AAAI-97 Workshop on On-Line Search. AAAI Press. \n\n(Koenig, 1996) Koenig, S. 1996. Agent-centered search: Situated search with small look-ahead. In \n\nProceedings of the National Conference on Artificial Intelligence . 1365. \n\n(Korf,1990) Korf, R. 1990. Real-time heuristic search. Artificial Intelligence 42(2-3):189-211. \n(Moore and Atkeson, 1993) Moore, A. and Atkeson, C. 1993. Prioritized sweeping: Reinforcement \n\nlearning with less data and less time. Machine Learning 13:103-130. \n\n(Pemberton and Korf, 1992) Pemberton, J. and Korf, R 1992. Incremental path planning on graphs \nwith cycles. In Proceedings of the International Conference on Artificial Intelligence Planning \nSystems. 179-188. \n\n(Russell and Zilberstein, 1991) Russell, S. and Zilberstein, S. 1991. Composing real-time systems. \n\nIn Proceedings of the Internationalloint Conference on Artificial Intelligence. 212-217. \n\n(Thrun etal., 1998) Thrun, S.; BUcken, A; Burgard, W; Fox, D.; Frohlinghaus, T.; Hennig, D.; \nHofmann, T.; Krell, M.; and Schmidt, T. 1998. Map learning and high-speed navigation in rhino. \nIn Kortenkamp, D.; Bonasso, R.; and Murphy, R., editors 1998, Artificial Intelligence Based \nMobile Robotics: Case Studies of Successful Robot Systems. MIT Press. 21-52. \n\n\f", "award": [], "sourceid": 1608, "authors": [{"given_name": "Sven", "family_name": "Koenig", "institution": null}]}