{"title": "Learning to Navigate in Cities Without a Map", "book": "Advances in Neural Information Processing Systems", "page_first": 2419, "page_last": 2430, "abstract": "Navigating through unstructured environments is a basic capability of intelligent creatures, and thus is of fundamental interest in the study and development of artificial intelligence. Long-range navigation is a complex cognitive task that relies on developing an internal representation of space, grounded by recognisable landmarks and robust visual processing, that can simultaneously support continuous self-localisation (\"I am here\") and a representation of the goal (\"I am going there\"). Building upon recent research that applies deep reinforcement learning to maze navigation problems, we present an end-to-end deep reinforcement learning approach that can be applied on a city scale. Recognising that successful navigation relies on integration of general policies with locale-specific knowledge, we propose a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities. A key contribution of this paper is an interactive navigation environment that uses Google Street View for its photographic content and worldwide coverage. Our baselines demonstrate that deep reinforcement learning agents can learn to navigate in multiple cities and to traverse to target destinations that may be kilometres away. A video summarizing our research and showing the trained agent in diverse city environments as well as on the transfer task is available at: https://sites.google.com/view/learn-navigate-cities-nips18", "full_text": "Learning to Navigate in Cities Without a Map\n\nPiotr Mirowski, Matthew Koichi Grimes, Mateusz Malinowski, Karl Moritz Hermann,\n\nKeith Anderson, Denis Teplyashin, Karen Simonyan, Koray Kavukcuoglu,\n\nAndrew Zisserman, Raia Hadsell\n\nDeepMind\n\nLondon, United Kingdom\n\n{piotrmirowski, mkg, mateuszm, kmh, keithanderson, }@google.com\n\n{teplyashin, simonyan, korayk, zisserman, raia}@google.com\n\nAbstract\n\nNavigating through unstructured environments is a basic capability of intelligent\ncreatures, and thus is of fundamental interest in the study and development of\narti\ufb01cial intelligence. Long-range navigation is a complex cognitive task that re-\nlies on developing an internal representation of space, grounded by recognisable\nlandmarks and robust visual processing, that can simultaneously support continu-\nous self-localisation (\u201cI am here\u201d) and a representation of the goal (\u201cI am going\nthere\u201d). Building upon recent research that applies deep reinforcement learning to\nmaze navigation problems, we present an end-to-end deep reinforcement learning\napproach that can be applied on a city scale. Recognising that successful nav-\nigation relies on integration of general policies with locale-speci\ufb01c knowledge,\nwe propose a dual pathway architecture that allows locale-speci\ufb01c features to be\nencapsulated, while still enabling transfer to multiple cities. A key contribution of\nthis paper is an interactive navigation environment that uses Google Street View\nfor its photographic content and worldwide coverage. Our baselines demonstrate\nthat deep reinforcement learning agents can learn to navigate in multiple cities and\nto traverse to target destinations that may be kilometres away. The project webpage\nhttp://streetlearn.cc contains a video summarizing our research and show-\ning the trained agent in diverse city environments and on the transfer task, the form\nto request the StreetLearn dataset and links to further resources. The StreetLearn en-\nvironment code is available at https://github.com/deepmind/streetlearn.\n\n1\n\nIntroduction\n\nThe subject of navigation is attractive to various research disciplines and technology domains alike,\nbeing at once a subject of inquiry from the point of view of neuroscientists wishing to crack the code\nof grid and place cells [2, 12], as well as a fundamental aspect of robotics research. The majority\nof algorithms involve building an explicit map during an exploration phase and then planning and\nacting via that representation. In this work, we are interested in pushing the limits of end-to-end\ndeep reinforcement learning for navigation by proposing new methods and demonstrating their\nperformance in large-scale, real-world environments. Just as humans can learn to navigate a city\nwithout relying on maps, GPS localisation, or other aids, it is our aim to show that a neural network\nagent can learn to traverse entire cities using only visual observations. In order to realise this aim, we\ndesigned an interactive environment that uses the images and underlying connectivity information\nfrom Google Street View, and propose a dual pathway agent architecture that can navigate within the\nenvironment (see Fig. 1a).\nLearning to navigate directly from visual inputs has been shown to be possible in some domains, by\nusing deep reinforcement learning (RL) approaches that can learn from task rewards \u2013 for instance,\nnavigating to a destination. Recent research has demonstrated that RL agents can learn to navigate\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Diverse views and corresponding local maps in Street View.\nFigure 1: (a) Our environment is built of real-world places from Street View (we illustrate Times\nSquare and Central Park in New York City and St. Paul\u2019s Cathedral in London). The green cone\nrepresents the agent\u2019s location and orientation. (b) We use large regions of London and Paris and in\nNew York we focus on 5 different regions to show transfer.\n\n(b) Street View regions used in this study.\n\nhouse scenes [45, 42], mazes (e.g. [33]), and 3D games (e.g. [30]). These successes notwithstanding,\ndeep RL approaches are notoriously data inef\ufb01cient and sensitive to perturbations of the environment,\nand are more well-known for their successes in games and simulated environments than in real-world\napplications. It is therefore not obvious that they can be used for large-scale visual navigation based\non real-world images, and hence this is the subject of our investigation.\nThe primary contributions of this paper are (a) to present a new RL challenge that features real world\nvisual navigation through city-scale environments, and (b) to propose a modular, goal-conditional\ndeep RL algorithm that can solve this task, thus providing a strong baseline for future research.\nStreetLearn1is a new interactive environment for reinforcement learning that features real-world\nimages as agent observations, with real-world grounded content that is built on top of the publicly\navailable Google Street View. Within this environment we have developed a traversal task that\nrequires that the agent navigates from goal to goal within London, Paris and New York City.\nTo evaluate the feasibility of learning in such an environment, we propose an agent that learns a goal-\ndependent policy with a dual pathway, modular architecture with similarities to the interchangeable\ntask-speci\ufb01c modules approach from [13], and the target-driven visual navigation approach of [45].\nThe approach features a recurrent neural architecture that supports both locale-speci\ufb01c learning as\nwell as general, transferable navigation behaviour. Balancing these two capabilities is achieved by\nseparating a recurrent neural pathway from the general navigation policy of the agent. This pathway\naddresses two needs. First, it receives and interprets the current goal given by the environment, and\nsecond, it encapsulates and memorises the features and structure of a single city region. Thus, rather\nthan using a map, or an external memory, we propose an architecture with two recurrent pathways\nthat can effectively address a challenging navigation task in a single city as well as transfer to new\ncities or regions by training only a new locale-speci\ufb01c pathway.\n\n2 Related Work\n\nReward-driven navigation in a real-world environment is related to research in various areas of deep\nlearning, reinforcement learning, navigation and planning.\nLearning from real-world imagery. Localising from only an image may seem impossible, but\nhumans can integrate visual cues to geolocate a given image with surprising accuracy, motivating ma-\nchine learning approaches. For instance, convolutional neural networks (CNNs) achieve competitive\nscores on the geolocation task [41] and CNN+LSTM architectures improve on this [15, 31]. Several\nmethods [5, 28], including DeepNav [6], use datasets collected using Street View or Open Street\nMaps and solve navigation-related tasks using supervision. RatSLAM demonstrates localisation and\npath planning over long distances using a biologically-inspired architecture [32]. The aforementioned\nmethods rely on supervised training with ground truth labels: with the exception of the compass, we\ndo not provide labels in our environment.\n\n1http://streetlearn.cc (dataset) and https://github.com/deepmind/streetlearn (code).\n\n2\n\nCentral LondonParis Rive Gauche HarlemWallStreetNYUMidtownCentralPark\fDeep RL methods for navigation. Many RL-based approaches for navigation rely on simulators\nwhich have the bene\ufb01t of features like procedurally generated variations but tend to be visually\nsimple and unrealistic [3, 26, 39]. To support sparse reward signals in these environments, recent\nnavigational agents use auxiliary tasks in training [33, 25, 30]. Other methods learn to predict future\nmeasurements or to follow simple text instructions [16, 23, 22, 11]; in our case, the goal is designated\nusing proximity to local landmarks. Deep RL has also been used for active localisation [10]. Similar\nto our proposed architecture, [45] show goal-conditional indoor navigation with a simulated robot\nand environment.\nTo bridge the gap between simulation and reality, researchers have developed more realistic, higher-\n\ufb01delity simulated environments [17, 29, 38, 42]. However, in spite of their increasing photo-realism,\nthe inherent problems of simulated environments lie in the limited diversity of the environments and\nthe antiseptic quality of the observations. Photographic environments have been used to train agents\non short navigation problem in indoor scenes with limited scale [9, 1, 7, 35]. Our real-world dataset\nis diverse and visually realistic, comprising scenes with vegetation, pedestrians or vehicles, diverse\nweather conditions and covering large geographic areas. However, we note that there are obvious\nlimitations of our environment: it does not contain dynamic elements, the action space is necessarily\ndiscrete as it must jump between panoramas, and the street topology cannot be arbitrarily altered.\nDeep RL for path planning and mapping. Several recent approaches have used memory or other\nexplicit neural structures to support end-to-end learning of planning or mapping. These include\nNeural SLAM [44] that proposes an RL agent with an external memory to represent an occupancy\nmap and a SLAM-inspired algorithm, Neural Map [36] which proposes a structured 2D memory for\nnavigation, Memory Augmented Control Networks [27], which uses a hierarchical control strategy,\nand MERLIN, a general architecture that achieves superhuman results in novel navigation tasks [40].\nOther work [8, 10] explicitly provides a global map that is input to the agent. The architecture in [21]\nuses an explicit neural mapper and planner for navigation tasks as well as registered pairs of landmark\nimages and poses. Similar to [20, 44], they use extra memory that represents the ego-centric agent\nposition. Another recent work proposes a graph network solution [37]. The focus of our paper is\nto demonstrate that simpler architectures can explore and memorise very large environments using\ntarget-driven visual navigation with a goal-conditional policy.\n\n3 Environment\n\nThis section presents an interactive environment, named StreetLearn, constructed using Google Street\nView, which provides a public API2. Street View provides a set of geolocated 3600 panoramic images\nwhich form the nodes of an undirected graph. We selected a number of large regions in New York\nCity, Paris and London that contain between 7,000 and 65,500 nodes (and between 7,200 and 128,600\nedges, respectively), have a mean node spacing of 10m, and cover a range of up to 5km (see Fig. 1b).\nWe do not simplify the underlying connectivity, thus there are congested areas with complex occluded\nintersections, tunnels and footpaths, and other ephemera. Although the graph is used to construct the\nenvironment, the agent only sees the raw RGB images (see Fig. 1a).\n\n3.1 Agent Interface and the Courier Task\n\nAn RL environment needs to specify the start space, observations, and action space of the agent as\nwell as the task reward. The agent has two inputs: the image xt, which is a cropped, 600 square, RGB\nimage that is scaled to 84 \u02c6 84 pixels (i.e. not the entire panorama), and the goal description gt. The\naction space is composed of \ufb01ve discrete actions: \u201cslow\u201d rotate left or right (\u02d822.50), \u201cfast\u201d rotate\nleft or right (\u02d867.50), or move forward\u2014this action becomes a noop if there is not an edge in view\nfrom the current agent pose. If there are multiple edges in the view cone of the agent, then the most\ncentral one is chosen.\nThere are many options for how to specify the goal to the agent, from images to agent-relative\ndirections, to text descriptions or addresses. We choose to represent the current goal in terms of\nits proximity to a set L of \ufb01xed landmarks: L \u201c tpLatk, Longkquk, speci\ufb01ed using the Lat/Long\n(latitude and longitude) coordinate system. To represent a goal at pLatg\nt q we take a softmax\nt , Longg\nover the distances to the k landmarks (see Fig. 2a), thus for distances tdg\nt,kuk the goal vector\n\n2https://developers.google.com/maps/documentation/streetview/\n\n3\n\n\f(a) Goal description using landmarks.\n\n(b) Comparison of architectures.\n\n\u0159\n\nt,iq{\n\nk expp\u00b4\u03b1dg\n\nFigure 2: (a) In the illustration of the goal description, we show a set of 5 nearby landmarks and 4\ndistant ones; the code gi is a vector with a softmax-normalised distance to each landmark. (b) Left:\nGoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav\nis a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (\u03b8).\nRight: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.\nt,kq for the ith landmark with \u03b1 \u201c 0.002 (which we\ncontains gt,i \u201c expp\u00b4\u03b1dg\nchose through cross-validation). This forms a goal description with certain desirable qualities: it is a\nscalable representation that extends easily to new regions, it does not rely on any arbitrary scaling of\nmap coordinates, and it has intuitive meaning\u2014humans and animals also navigate with respect to\n\ufb01xed landmarks. Note that landmarks are \ufb01xed per map and we used the same list of landmarks across\nall experiments; gt is computed using the distance to all landmarks, but by feeding these distances\nthrough a non-linearity, the contribution of distant landmarks is reduced to zero. In the Supplementary\nmaterial, we show that the locally-continuous landmark-based representation of the goal performs\nt q. Since the landmark-based representation\nas well as the linear scalar representation pLatg\nperforms well while being independent of the coordinate system and thus more scalable, we use this\nrepresentation as canonical. Note that the goal description is not relative to the agent\u2019s position and\nonly changes when a new goal is sampled. Locations of the 644 manually de\ufb01ned landmarks in New\nYork, London and Paris are given in the Supplementary material, where we also show that the density\nof landmarks does not impact the agent performance.\nIn the courier task, which we de\ufb01ne as the problem of navigating to a series of random locations in a\ncity, the agent starts each episode from a randomly sampled position and orientation. If the agent\ngets within 100m of the goal (approximately one city block), the next goal is randomly chosen and\ninput to the agent. Each episode ends after 1000 agent steps. The reward that the agent gets upon\nreaching a goal is proportional to the shortest path between the goal and the agent\u2019s position when\nthe goal is \ufb01rst assigned; much like a delivery service, the agent receives a higher reward for longer\njourneys. Note that we do not reward agents for taking detours, but rather that the reward in a given\nlevel is a function of the optimal distance from start to goal location. As the goals get more distant\nduring the training curriculum, per-episode reward statistics should ideally reach and stay at a plateau\nperformance level if the agent can equally reach closer and further goals.\n\nt , Longg\n\nG\n\n\u015a\n\n\u015a\n\n4 Methods\nWe formalise the learning problem as a Markov Decision Process, with state space S, action space\nA, environment E, and a set of possible goals G. The reward function depends on the current goal\nA \u00d1 R. The usual reinforcement learning objective is to \ufb01nd the policy\nand state: R : S\nthat maximises the expected return de\ufb01ned as the sum of discounted rewards starting from state\ns0 with discount \u03b3. In this navigation task, the expected return from a state st also depends on\nthe series of sampled goals tgkuk. The policy is a distribution over actions given the current state\nst and the goal gt: \u03c0pa|s, gq \u201c P rpat \u201c a|st \u201c s, gt \u201c gq. We de\ufb01ne the value function to be\nthe expected return for the agent that is sampling actions from policy \u03c0 from state st with goal gt:\nV \u03c0ps, gq \u201c ErRts \u201c Er\nWe hypothesise the courier task should bene\ufb01t from two types of learning: general, and locale-speci\ufb01c.\nA navigating agent not only needs an internal representation that is general, to support cognitive\n\n\u01598\nk\u201c0 \u03b3krt`k|st \u201c s, gt \u201c gs.\n\n4\n\ngoal gi ABCDEgoal code gi ABCDEconvconvconv\ud835\uded1\ud835\udc7dxtgtat-1,rt-1\ud835\uded1\ud835\udc7dxtgtat-1,rt-1envkenvjenvi\ud835\udec9k\ud835\udec9j\ud835\udec9i\ud835\uded1\ud835\udc7dxtgtat-1,rt-1\ud835\udec9a.GoalNav agentb.CityNav agentc.MultiCityNav agent\fprocesses such as scene understanding, but also needs to organise and remember the features and\nstructures that are unique to a place. Therefore, to support both types of learning, we focus on neural\narchitectures with multiple pathways.\n\n4.1 Architectures\n\nThe policy and the value function are both parameterised by a neural network which shares all\nlayers except the \ufb01nal linear outputs. The agent operates on raw pixel images xt, which are passed\nthrough a convolutional network as in [34]. A Long Short-Term Memory (LSTM) [24] receives the\noutput of the convolutional encoder as well as the past reward rt\u00b41 and previous action at\u00b41. The\nthree different architectures are described below. Additional architectural details are given in the\nSupplementary Material.\nThe baseline GoalNav architecture (Fig. 2ba) has a convolutional encoder and policy LSTM. The key\ndifference from the canonical A3C agent [34] is that the goal description gt is input to the policy\nLSTM (along with the previous action and reward).\nThe CityNav architecture (Fig. 2bb) combines the previous architecture with an additional LSTM,\ncalled the goal LSTM, which receives visual features as well as the goal description. The CityNav\nagent also adds an auxiliary heading (\u03b8) prediction task on the outputs of the goal LSTM.\nThe MultiCityNav architecture (Fig. 2bc) extends the CityNav agent to learn in different cities. The\nremit of the goal LSTM is to encode and encapsulate locale-speci\ufb01c features and topology such that\nmultiple pathways may be added, one per city or region. Moreover, after training on a number of\ncities, we demonstrate that the convolutional encoder and the policy LSTM become general enough\nthat only a new goal LSTM needs to be trained for new cities, a bene\ufb01t of the modular approach [13].\nFigure 2b illustrates that the goal descriptor gt is not seen by the policy LSTM but only by the locale-\nspeci\ufb01c LSTM in the CityNav and MultiCityNav architectures (the baseline GoalNav agent has\nonly one LSTM, so we directly input gt). This separation forces the locale-speci\ufb01c LSTM to interpret\nthe absolute goal position coordinates, with the hope that it then sends relative goal information\n(directions) to the policy LSTM. This hypothesis is tested in section 2.3 of the supplementary material.\nAs shown in [25, 33, 16, 30], auxiliary tasks can speed up learning by providing extra gradients as\nwell as relevant information. We employ a very natural auxiliary task: the prediction of the agent\u2019s\nheading \u03b8t, de\ufb01ned as an angle between the north direction and the agent\u2019s pose, using a multinomial\nclassi\ufb01cation loss on binned angles. The optional heading prediction is an intuitive way to provide\nadditional gradients for training the convnet. The agent can learn to navigate without it, but we\nbelieve that heading prediction helps learning the geometry of the environment; the Supplementary\nmaterial provides a detailed architecture ablation analysis and agent implementation details.\nTo train the agents, we use IMPALA [18], an actor-critic implementation that decouples acting and\nlearning. In our experiments, IMPALA results in similar performance to A3C [34]. We use 256\nactors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and\nsequences are unrolled to length 50.\n\n4.2 Curriculum Learning\n\nCurriculum learning gradually increases the complexity of the learning task by presenting progres-\nsively more dif\ufb01cult examples to the learning algorithm [4, 19, 43]. We use a curriculum to help the\nagent learn to \ufb01nd increasingly distant destinations. Similar to RL problems such as Montezuma\u2019s\nRevenge, the courier task suffers from very sparse rewards; unlike that game, we are able to de\ufb01ne a\nnatural curriculum scheme. We start by sampling each new goal to be within 500m of the agent\u2019s\nposition (phase 1). In phase 2, we progressively grow the maximum range of allowed destinations to\ncover the full graph (3.5km in the smaller New York areas, or 5km for central London or Paris).\n\n5 Results\n\nIn this section, we demonstrate and analyse the performance of the proposed architectures on the\ncourier task. We \ufb01rst show the performance of our agents in large city environments, next their\n\n5\n\n\fgeneralisation capabilities on a held-out set of goals. Finally, we investigate whether the proposed\napproach allows transfer of an agent trained on a set of regions to a new and previously unseen region.\n\n(a) NYU (New York City)\n\n(b) Central London\n\n(c) Effect of reward shaping\n\nFigure 3: Average per-episode rewards (y axis) are plotted vs. learning steps (x axis) for the courier\ntask. We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip\nconnection on the NYU environment (a), and the CityNav agent in London (b). We also give Oracle\nperformance and a Heuristic agent. A curriculum is used in London\u2014we indicate the end of phase 1\n(up to 500m) and the end of phase 2 (5000m). (c) Results of the CityNav agent on NYU, comparing\nradii of early rewards (ER) vs. ER with random coins vs. curriculum with ER 200m and no coins.\n\n5.1 Courier Navigation in Large, Diverse City Environments\n\nWe \ufb01rst show that the CityNav agent, trained with curriculum learning, succeeds in learning the\ncourier task in New York, London and Paris. We replicated experiments with 5 random seeds and\nplot the mean and standard deviation of the reward statistic throughout the experimental results.\nThroughout the paper, and for ease of comparison with experiments that include reward shaping, we\nreport only the rewards at the goal destination (goal rewards). Figure 3 compares different agents and\nshows that the CityNav architecture with the dual LSTM pathways and the heading prediction task\nattains a higher performance and is more stable than the simpler GoalNav agent. We also trained a\nCityNav agent without the skip connection from the vision layers to the policy LSTM. While this\nhurts the performance in single-city training, we consider it because of the multi-city transfer scenario\n(see Section 5.4) where funeling all visual information through the locale-speci\ufb01c LSTM seems to\nregularise the interface between the goal LSTM and the policy LSTM. We also consider two baselines\nwhich give lower (Heuristic) and upper (Oracle) bounds on the performance. Heuristic is a random\nwalk on the street graph, where the agent turns in a random direction if it cannot move forward; if at\n\n(a)\n\n(b)\n\nFigure 4: (a) Number of steps required for the CityNav agent to reach a goal from 100 start locations\nvs. the straight-line distance to the goal in metres. (b) CityNav performance in London (left panes)\nand NYU (right panes). Top: examples of the agent\u2019s trajectory during one 1000-step episode,\nshowing successful consecutive goal acquisitions. The arrows show the direction of travel of the\nagent. Bottom: We visualise the agent\u2019s value function over 100 trajectories with random starting\npoints and the same goal. Thicker and warmer colour lines correspond to higher value functions.\n\n6\n\nstartstartendend\fan intersection it will turn with a probability p \u201c 0.95. Oracle uses the full graph to compute the\noptimal path using breath-\ufb01rst search.\nWe visualise trajectories from the trained agent over two 1000 step episodes (Fig. 4b (top row)). In\nLondon, we see that the agent crosses a bridge to get to the \ufb01rst goal, then travels to goal 2, and the\nepisode ends before it can reach the third goal. Figure 4b (bottom row) shows the value function\nof the agent as it repeatedly navigates to a chosen destination (respectively, St Paul\u2019s Cathedral in\nLondon and Washington Square in New York).\nTo understand whether the agent has learned a policy over the full extent of the environment, we\nplot the number of steps required by the agent to get to the goal. As the number grows linearly with\nthe straight-line distance to that goal, this result suggests that the agent has successfully learnt the\nnavigation policy on both cities (Fig. 4a).\n\n5.2\n\nImpact of Reward Shaping and Curriculum Learning\n\nTo better understand the environment, we present further experiments on reward, curriculum. Ad-\nditional analysis, including architecture ablations, the robustness of the agent to the choice of goal\nrepresentations, and position and goal decoding, are presented in the Supplementary Material.\nOur navigation task assigns a goal to the agent; once the agent successfully navigates to the goal, a\nnew goal is given to the agent. The long distance separating the agent from the goal makes this a\ndif\ufb01cult RL problem with sparse rewards. To simplify this challenging task, we investigate giving\nearly rewards (reward shaping) to the agent before it reaches the goal (we de\ufb01ne goals with a 100m\nradius), or to add random rewards (coins) to encourage exploration [3, 33]. Figure 3c suggests that\ncoins by themselves are ineffective as our task does not bene\ufb01t from wide explorations. At the same\ntime, large radii of reward shaping help as they greatly simplify the problem. We prefer curriculum\nlearning to reward shaping on large areas because the former approach keeps agent training consistent\nwith its experience at test time and also reduces the risk of learning degenerate strategies such as\nascending the gradient of increasing rewards to reach the goal, rather than learn to read the goal\nspeci\ufb01cation gt.\nAs a trade-off between task realism and feasibility, and guided by the results in Fig. 3c, we decide\nto keep a small amount of reward shaping (200m away from the goal) combined with curriculum\nlearning. The speci\ufb01c reward function we use is: rt \u201c maxp0, minp1,pdER\u00b4 dg\ntq{100qq\u02c6 rg, where\nt is the distance from the current position of the agent to the goal, dER \u201c 200 and rg is the reward\ndg\nthat the agent will receive if it reaches the goal. Early rewards are given only once per panorama /\nnode, and only if the distance dg\nt to the goal is decreasing (in order to avoid the agent developing a\nbehavior of harvesting early rewards around the goal rather than going directly towards the goal).\nWe choose a curriculum that starts by sampling the goal within a radius of 500m from the agent\u2019s\nlocation, and progressively grows that disc until it reaches the maximum distance an agent could travel\nwithin the environment (e.g., 3.5km, and 5km in the NYU and London environments respectively) by\nthe end of the training. Note that this does not preclude the agent from going astray in the opposite\ndirection several kilometres away from the goal, and that the goal may occasionally be sampled close\nto the agent. Hence, our curriculum scheme naturally combines easy with dif\ufb01cult cases [43], with\nthe latter becoming more common over the period of time.\n\n5.3 Generalization on Held-out Goals\n\nNavigation agents should, ideally, be able to generalise to unseen environments [14]. While the nature\nof our courier task precludes zero-shot navigation in a new city without retraining, we test the CityNav\nagent\u2019s ability to exploit local linearities of the goal representation to handle unseen goal locations.\nWe mask 25% of the possible goals and train on the remaining ones (Fig. 5). At test time we evaluate\nthe agent only on its ability to reach goals in the held-out areas. Note that the agent is still able to\ntraverse through these areas, it just never samples a goal there. More precisely, the held-out areas are\nsquares sized 0.010, 0.0050 or 0.00250 of latitude and longitude (roughly 1km\u02c61km, 0.5km\u02c60.5km\nand 0.25km\u02c60.25km). We call these grids respectively coarse (with few and large held-out areas),\nmedium and \ufb01ne (with many small held-out areas).\nIn the experiments, we train the CityNav agent for 1B steps, and next freeze the weights of the\nagent and evaluate its performance on held-out areas for 100M steps. Table 1 shows decreasing\n\n7\n\n\fGRID\nSIZE\n\nTRAIN\nREW\n\nTEST\nREW FAIL\n\nT 1\n2\n\nFINE\nMEDIUM\nCOARSE\n\n655\n637\n623\n\n567\n293\n164\n\n11% 229\n20% 184\n38% 243\n\nFigure 5: Illustration of medium-sized held-out\ngrid with gray corresponding to training destina-\ntions, black corresponding to held-out test desti-\nnations. Landmark locations are marked in red.\n\nTable 1: CityNav agent generalization perfor-\nmance (reward and fail metrics) on a set of\nheld-out goal locations. We also compute the\nhalf-trip time (T 1\n), to reach halfway to the goal.\n\n2\n\nperformance of the agents as the held-out area size increases. We believe that the performance\ndrops on the large held-out areas (medium and coarse grid size) because the model cannot process\nnew or unseen local landmark-based goal speci\ufb01cations, which is due to our landmark-based goal\nrepresentation: as Figure 5 shows, some coarse grid held-out areas cover multiple landmarks. To gain\nfurther understanding, in addition to the Test Reward metric, we also use missed goals (Fail) and\nhalf-trip time (T 1\n) metrics. The missed goals metric measures the percentage of times goals were not\nreached. The half-trip time measures the number of agent steps necessary to cover half the distance\nseparating the agent from the goal. While the agent misses more goal destinations on larger held-out\ngrids, it still manages to travel half the distance to the goal within a similar time, which suggests that\nthe agent has an approximate held-out goal representation that enables it to head towards it until it\ngets close to the goal and the representation is no longer useful for the \ufb01nal approach.\n\n2\n\n5.4 Transfer in Multi-city Experiments\n\nA critical test for our proposed method is to demonstrate that it can provide a mechanism for transfer\nto new cities. By de\ufb01nition, the courier task requires a degree of memorization of the map, and\nwhat we focused on was not zero-shot transfer, but rather the capability of models to generalize\nquickly, learning to separate general ability from local knowledge when migrating to a new map. Our\nmotivation for transfer learning experiments comes from the goal of continual learning, which is\nabout learning new skills without forgetting older skills. As with humans, when our agent visits a\nnew city we would expect it to have to learn a new set of landmarks, but not have to re-learn its visual\nrepresentation, its behaviours, etc. Speci\ufb01cally, we expect the agent to take advantage of existing\nvisual features (convnet) and movement primitives (policy LSTM). Therefore, using the MultiCityNav\nagent, we train on a number of cities (actually regions in New York City), freeze both the policy\nLSTM and the convolutional encoder, and then train a new locale-speci\ufb01c pathway (the goal LSTM)\non a new city. The gradient that is computed by optimising the RL loss is passed through the policy\nLSTM without affecting it and then applied only to the new pathway.\nWe compare the performance using three different training regimes, illustrated in Fig. 6a: Training on\nonly the target city (single training); training on multiple cities, including the target city, together\n(joint training); and joint training on all but the target city, followed by training on the target city\nwith the rest of the architecture frozen (pre-train and transfer). In these experiments, we use the\nwhole Manhattan environment as shown in Figure 1b, and consisting of the following regions \u201cWall\nStreet\u201d, \u201cNYU\u201d, \u201cMidtown\u201d, \u201cCentral Park\u201d and \u201cHarlem\u201d. The target city is always the Wall Street\nenvironment, and we evaluate the effects of pre-training on 2, 3 or 4 of the other environments. We\nalso compare performance if the skip connection between the convolutional encoder and the policy\nLSTM is removed.\nWe can see from the results in Figure 6b that not only is transfer possible, but that its effectiveness\nincreases with the number of the regions the network is trained on. Remarkably, the agent that is\npre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to\nan agent trained jointly on all the regions, and only slightly worse than single-city training on Wall\nStreet alone3. This result supports our intuition that training on a larger set of environments results in\nsuccessful transfer. We also note that in the single-city scenario it is better to train an agent with a\n\n3We observed that we could train a model jointly on 4 cities in fewer steps than when training 4 single-city\n\nmodels.\n\n8\n\n\f(a) Diagram of transfer learning experiments.\n\n(b) Transfer learning performance.\n\nFigure 6: Left: Illustration of training regimes: (a) training on a single city (equivalent to CityNav);\n(b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net\nand policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city\nwith convolutional net and policy LSTM frozen (only the target city pathway is optimised). Right:\nJoint multi-city training and transfer learning performance of variants of the MultiCityNav agent,\nevaluated only on the target city (Wall Street).\n\nskip-connection, but this trend is reversed in the multi-city transfer scenario. We hypothesise that\nisolating the locale-speci\ufb01c LSTM as a bottleneck is more challenging but reduces over\ufb01tting of the\nconvolutional features and enforces a more general interface to the policy LSTM. While the transfer\nlearning performance of the agent is lower than the stronger agent trained jointly on all the areas, the\nagent signi\ufb01cantly outperforms the baselines and demonstrates goal-dependent navigation.\n\n6 Conclusion\n\nNavigation is an important cognitive task that enables humans and animals to traverse a complex world\nwithout maps. We have presented a city-scale real-world environment for training RL navigation\nagents, introduced and analysed a new courier task, demonstrated that deep RL algorithms can\nbe applied to problems involving large-scale real-world data, and presented a multi-city neural\nnetwork agent architecture that demonstrates transfer to new environments. A multi-city version\nof the Street View based RL environment, with carefully processed images provided by Google\nStreet View (i.e., blurred faces and license plates, with a mechanism for enforcing image take-\ndown requests) has been released for Manhattan and Pittsburgh and is accessible from http://\nstreetlearn.cc and https://github.com/deepmind/streetlearn. The project webpage at\nhttp://streetlearn.cc also contains resources on how to build and train an agent. Future work\nwill involve learning landmarks from images and scaling up the navigation and path-planning thanks\nto hierarchical RL approaches.\n\nAcknowledgements\n\nThe authors wish to acknowledge Andras Banki-Horvath for open-sourcing the StreetLearn envi-\nronment, Lasse Espeholt and Hubert Soyer for technical help with the IMPALA algorithm, Razvan\nPascanu, Ross Goroshin, Pushmeet Kohli and Nando de Freitas for their feedback, Chloe Hillier,\nRazia Ahamed and Vishal Maini for help with the project, and the Google Street View team (Tilman\nReinhardt, Wenfeng Li, Ben Mears, Karen Guo, Oliver Metzger, Jayanth Nayak) as well as Richard\nIves and Ashwin Kakarla for their support in accessing the data.\n\nReferences\n[1] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S\u00fcnderhauf, Ian\nReid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: In-\nterpreting visually-grounded navigation instructions in real environments. arXiv preprint\narXiv:1711.07280, 2017.\n\n9\n\n11convconvconv(cid:7528)(cid:6957)xtgtat-1,rt-1(cid:7528)(cid:6957)xtgtat-1,rt-1envkenvjenvi(cid:7520)k(cid:7520)j(cid:7520)i(cid:7528)(cid:6957)xtgtat-1,rt-1(cid:7520)a.GoalNav agentb.CityNav agentc. MultiCityNav agentc. b. a.conv(cid:7528)(cid:6957)1(cid:7528)(cid:6957)23convcity1city2city3xgcity1xg(cid:7528)(cid:6957)23convcity1city2city3xg1(cid:7528)(cid:6957)23convcity1city2city3xg\f[2] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr\nMirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al.\nVector-based navigation using grid-like representations in arti\ufb01cial agents. Nature, page 1,\n2018.\n\n[3] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich\nK\u00fcttler, Andrew Lefrancq, Simon Green, V\u00edctor Vald\u00e9s, Amir Sadik, et al. Deepmind lab. arXiv\npreprint arXiv:1612.03801, 2016.\n\n[4] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.\nIn Proceedings of the 26th annual international conference on machine learning, pages 41\u201348.\nACM, 2009.\n\n[5] Rodrigo F Berriel, Lucas Tabelini Torres, Vinicius B Cardoso, R\u00e2nik Guidolini, Claudine Badue,\nAlberto F De Souza, and Thiago Oliveira-Santos. Heading direction estimation using deep\nlearning with automatic large-scale data acquisition. 2018.\n\n[6] Samarth Brahmbhatt and James Hays. Deepnav: Learning to navigate large cities. arXiv\n\npreprint arXiv:1701.09135, 2017.\n\n[7] Jake Bruce, Niko S\u00fcnderhauf, Piotr Mirowski, Raia Hadsell, and Michael Milford. One-shot rein-\nforcement learning for robot navigation with interactive replay. arXiv preprint arXiv:1711.10137,\n2017.\n\n[8] Gino Brunner, Oliver Richter, Yuyi Wang, and Roger Wattenhofer. Teaching a machine to read\n\nmaps with deep reinforcement learning. arXiv preprint arXiv:1711.07479, 2017.\n\n[9] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nie\u00dfner, Manolis\nSavva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in\nindoor environments. arXiv preprint arXiv:1709.06158, 2017.\n\n[10] Devendra Singh Chaplot, Emilio Parisotto, and Ruslan Salakhutdinov. Active neural localization.\n\nInternational Conference on Learning Representations, 2018.\n\n[11] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj\nRajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language\ngrounding. arXiv preprint arXiv:1706.07230, 2017.\n\n[12] Christopher J Cueva and Xue-Xin Wei. Emergence of grid-like representations by training\nrecurrent neural networks to perform spatial localization. arXiv preprint arXiv:1803.07770,\n2018.\n\n[13] Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning\nIn Robotics and\nmodular neural network policies for multi-task and multi-robot transfer.\nAutomation (ICRA), 2017 IEEE International Conference on, pages 2169\u20132176. IEEE, 2017.\n[14] Vikas Dhiman, Shurjo Banerjee, Brent Grif\ufb01n, Jeffrey M Siskind, and Jason J Corso. A critical\ninvestigation of deep reinforcement learning for navigation. arXiv preprint arXiv:1802.02274,\n2018.\n\n[15] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini\nVenugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for\nvisual recognition and description. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 2625\u20132634, 2015.\n\n[16] Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. arXiv preprint\n\narXiv:1611.01779, 2016.\n\n[17] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio L\u00f3pez, and Vladlen Koltun. Carla:\n\nAn open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017.\n\n[18] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward,\nYotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu.\nImpala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv\npreprint arXiv:1802.01561, 2018.\n\n10\n\n\f[19] Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Auto-\n\nmated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003, 2017.\n\n[20] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cogni-\n\ntive mapping and planning for visual navigation. arXiv preprint arXiv:1702.03920, 2017.\n\n[21] Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Unifying map and landmark\n\nbased representations for visual navigation. arXiv preprint arXiv:1712.08125, 2017.\n\n[22] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer,\nDavid Szepesvari, Wojtek Czarnecki, Max Jaderberg, Denis Teplyashin, et al. Grounded\nlanguage learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.\n\n[23] Felix Hill, Karl Moritz Hermann, Phil Blunsom, and Stephen Clark. Understanding grounded\n\nlanguage learning agents. arXiv preprint arXiv:1710.09867, 2017.\n\n[24] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[25] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo,\nDavid Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary\ntasks. arXiv preprint arXiv:1611.05397, 2016.\n\n[26] Micha\u0142 Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja\u00b4skowski. Viz-\ndoom: A doom-based ai research platform for visual reinforcement learning. In Computational\nIntelligence and Games (CIG), 2016 IEEE Conference on, pages 1\u20138. IEEE, 2016.\n\n[27] Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, and\nDaniel D Lee. Memory augmented control networks. arXiv preprint arXiv:1709.05706,\n2017.\n\n[28] Aditya Khosla, Byoungkwon An An, Joseph J Lim, and Antonio Torralba. Looking beyond\nthe visible scene. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3710\u20133717, 2014.\n\n[29] Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi.\nAi2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.\n\n[30] Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement\nlearning. In Proceedings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[31] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A deep learning\napproach to visual question answering. International Journal of Computer Vision, 125(1-3):110\u2013\n135, 2017.\n\n[32] Michael J Milford, Gordon F Wyeth, and David Prasser. Ratslam: a hippocampal model\nfor simultaneous localization and mapping. In Robotics and Automation, 2004. Proceedings.\nICRA\u201904. 2004 IEEE International Conference on, volume 1, pages 403\u2013408. IEEE, 2004.\n\n[33] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew Ballard, Andrea Banino,\nMisha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia\nHadsell. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673,\n2016.\n\n[34] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[35] Kaichun Mo, Haoxiang Li, Zhe Lin, and Joon-Young Lee. The adobeindoornav dataset: Towards\ndeep reinforcement learning based real-world indoor robot visual navigation. arXiv preprint\narXiv:1802.08824, 2018.\n\n[36] Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforce-\n\nment learning. arXiv preprint arXiv:1702.08360, 2017.\n\n11\n\n\f[37] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological\n\nmemory for navigation. arXiv preprint arXiv:1803.00653, 2018.\n\n[38] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-\ufb01delity visual\nand physical simulation for autonomous vehicles. In Field and Service Robotics, pages 621\u2013635.\nSpringer, 2018.\n\n[39] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J. Mankowitz, and Shie Mannor. A deep\nhierarchical approach to lifelong learning in minecraft. In Proceedings of the Thirty-First AAAI\nConference on Arti\ufb01cial Intelligence, 2017.\n\n[40] Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-\nBarwinska, Jack Rae, Piotr Mirowski, Joel Z Leibo, Adam Santoro, et al. Unsupervised\npredictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760, 2018.\n\n[41] Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-photo geolocation with convolutional\nneural networks. In European Conference on Computer Vision, pages 37\u201355. Springer, 2016.\n\n[42] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a\n\nrealistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018.\n\n[43] Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615,\n\n2014.\n\n[44] Jingwei Zhang, Lei Tai, Joschka Boedecker, Wolfram Burgard, and Ming Liu. Neural slam:\n\nLearning to explore with external memory. arXiv preprint arXiv:1706.09520, 2017.\n\n[45] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali\nFarhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In\n2017 IEEE International Conference on Robotics and Automation, ICRA, pages 3357\u20133364,\n2017.\n\n12\n\n\f", "award": [], "sourceid": 1233, "authors": [{"given_name": "Piotr", "family_name": "Mirowski", "institution": "DeepMind"}, {"given_name": "Matt", "family_name": "Grimes", "institution": "DeepMind"}, {"given_name": "Mateusz", "family_name": "Malinowski", "institution": "DeepMind"}, {"given_name": "Karl Moritz", "family_name": "Hermann", "institution": "DeepMind"}, {"given_name": "Keith", "family_name": "Anderson", "institution": "DeepMind"}, {"given_name": "Denis", "family_name": "Teplyashin", "institution": "DeepMind"}, {"given_name": "Karen", "family_name": "Simonyan", "institution": "DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": "DeepMind & University of Oxford"}, {"given_name": "Raia", "family_name": "Hadsell", "institution": "DeepMind"}]}