{"title": "Exponential Family PCA for Belief Compression in POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1667, "page_last": 1674, "abstract": null, "full_text": "Exponential Family PCA for Belief Compression\n\nin POMDPs\n\nNicholas Roy\n\nRobotics Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nnickr@ri.cmu.edu\n\nAbstract\n\nGeoffrey Gordon\n\nDepartment of Computer Science\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nggordon@cs.cmu.edu\n\nStandard value function approaches to \ufb01nding policies for Partially Observable\nMarkov Decision Processes (POMDPs) are intractable for large models. The in-\ntractability of these algorithms is due to a great extent to their generating an optimal\npolicy over the entire belief space. However, in real POMDP problems most belief\nstates are unlikely, and there is a structured, low-dimensional manifold of plausible\nbeliefs embedded in the high-dimensional belief space.\nWe introduce a new method for solving large-scale POMDPs by taking advantage of\nbelief space sparsity. We reduce the dimensionality of the belief space by exponential\nfamily Principal Components Analysis [1], which allows us to turn the sparse, high-\ndimensional belief space into a compact, low-dimensional representation in terms of\nlearned features of the belief state. We then plan directly on the low-dimensional belief\nfeatures. By planning in a low-dimensional space, we can \ufb01nd policies for POMDPs\nthat are orders of magnitude larger than can be handled by conventional techniques.\nWe demonstrate the use of this algorithm on a synthetic problem and also on a mobile\nrobot navigation task.\n\n1 Introduction\n\nLarge Partially Observable Markov Decision Processes (POMDPs) are generally very dif-\n\ufb01cult to solve, especially with standard value iteration techniques [2, 3]. Maintaining a full\nvalue function over the high-dimensional belief space entails \ufb01nding the expected reward of\nevery possible belief under the optimal policy. However, in reality most POMDP policies\ngenerate only a small percentage of possible beliefs. For example, a mobile robot navigat-\ning in an of\ufb01ce building is extremely unlikely to ever encounter a belief about its pose that\nresembles a checkerboard. If the execution of a POMDP is viewed as a trajectory inside the\nbelief space, trajectories for most large, real world POMDPs lie on low-dimensional mani-\nfolds embedded in the belief space. So, POMDP algorithms that compute a value function\nover the full belief space do a lot of unnecessary work.\n\nAdditionally, real POMDPs frequently have the property that the belief probability distri-\nbutions themselves are sparse. That is, the probability of being at most states in the world is\nzero. Intuitively, mobile robots and other real world systems have local uncertainty (which\ncan often be multi-modal), but rarely encounter global uncertainty. Figure 1 depicts a mo-\nbile robot travelling down a corridor, and illustrates the sparsity of the belief space.\n\n\fFigure 1: An example probability distribution of a mobile robot navigating in a hallway (map di-\nmensions are 47m x 17m, with a grid cell resolution of 10cm). The white areas are free space, states\nwhere the mobile robot could be. The black lines are walls, and the dark gray particles are the output\nof the particle \ufb01lter tracking the robot\u2019s position. The particles are located in states where the robot\u2019s\nbelief over its position is non-zero. Although the distribution is multi-modal, it is still relatively\ncompact: the majority of the states contain no particles and therefore have zero probability.\n\nWe will take advantage of these characteristics of POMDP beliefs by using a variant of a\ncommon dimensionality reduction technique, Principal Components Analysis (PCA). PCA\nis well-suited to dimensionality reduction where the data lies near a linear manifold in\nthe higher-dimensional space. Unfortunately, POMDP belief manifolds are rarely linear;\nin particular, sparse beliefs are usually very non-linear. However, we can employ a link\nfunction to transform the data into a space where it does lie near a linear manifold; the\nalgorithm which does so (while also correctly handling the transformed residual errors) is\ncalled Exponential Family PCA (E-PCA). E-PCA will allow us to \ufb01nd manifolds with only\na handful of dimensions, even for belief spaces with thousands of dimensions.\n\nOur algorithm begins with a set of beliefs from a POMDP. It uses these beliefs to \ufb01nd a\ndecomposition of belief space into a small number of belief features. Finally, it plans over\na low-dimensional space by discretizing the features and using standard value iteration to\n\ufb01nd a policy over the discrete beliefs.\n\n2 POMDPs\n\n\u0007\u000b\f\r\n\u000f\u000e\u000f\u000e\u0010\u000e\n\n\u0015\u0018\u0017\n\n\u0001\u0004\u0003\u000b\u0015\n\u0007&\n\n\b\t\n\n )!*#\n\n\f\u0016\n\u0010\u000e\u000f\u000e\u000f\u000e\u0010\n\n\u0007&\n\nA Partially Observable Markov Decision Process (POMDP) is a model given by a set\n\nof states \u0002\u0001\u0004\u0003\u0006\u0005\n\u0007\t\b\u000b\n\n\u0003\u000b\u001a\n\u0012 . Associated with these are a set of transition probabilities \u001b\u001d\u001c\n\u0007\u000b\u001e%$\n\n\u0007\u000b\u0011\u0013\u0012 , actions \u0014\n\n\u000f\u000e\u0010\u000e\u000f\u000e\u0010\n\n\u0007&\n\n and observation probabilities '(\u001c\n\n\u0012 and observations \u0019\n\u0007\u0010\u001e\u001f\n\n .\n\nThe objective of the planning problem is to \ufb01nd a policy that maximises the expected sum\nof future (possibly discounted) rewards of the agent executing the policy. There are a large\nnumber of value function approaches [2, 4] that explicitly compute the expected reward\nof every belief. Such approaches produce complete policies, and can guarantee optimality\nunder a wide range of conditions. However, \ufb01nding a value function this way is usually\ncomputationally intractable.\n\n\u0007\u0010 \"!\n\nPolicy search algorithms [3, 5, 6, 7] have met with success recently. We suggest that a large\npart of the success of policy search is due to the fact that it focuses computation on relevant\nbelief states. A disadvantage of policy search, however, is that can be data-inef\ufb01cient:\nmany policy search techniques have trouble reusing sample trajectories generated from old\npolicies. Our approach focuses computation on relevant belief states, but also allows us to\nuse all relevant training data to estimate the effect of any policy.\n\nRelated research has developed heuristics which reduce the belief space representation. In\nparticular, entropy-based representations for heuristic control [8] and full value-function\nplanning [9] have been tried with some success. However, these approaches make strong\nassumptions about the kind of uncertainties that a POMDP generates. By performing prin-\n\n\u0005\n\u0005\n\u0015\n\u0001\n\b\n\n\u001a\n\f\n\u001a\n\u0017\n\u0005\n\u0015\n\n\u0005\n#\n\u001c\n\u0005\n\u0005\n\u0015\n\u001a\n\n\u0005\n\u0015\n\u001c\n\u001a\n$\n\u0005\n\u0015\n\fcipled dimensionality reduction of the belief space, our technique should be applicable to\na wider range of problems.\n\nthat minimise the loss function\n\n3 Dimensionality Reduction\nPrincipal Component Analysis is one of the most popular and successful forms of di-\n\nmensionality reduction [10]. PCA operates by \ufb01nding a set of feature vectors \n\u0003\u0002\u0001\n\n\u0010\u000e\u000f\u000e\u000f\u000e\u0010\n\n\u0006\u0005\n )!\nis the matrix of low-dimensional coordinates of \u0007\nwhere \u0007\n\n(1)\n.\nThis particular loss function assumes that the data lie near a linear manifold, and that dis-\nplacements from this manifold are symmetric and have the same variance everywhere. (For\nexample, i.i.d. Gaussian errors satisfy these requirements.)\n\n\u001c\u0004\nis the original data and \u0005\n\n\u0007\t\b\n\nUnfortunately, as mentioned previously, probability distributions for POMDPs rarely form\na linear subspace. In addition, squared error loss is inappropriate for modelling probability\ndistributions: it does not enforce positive probability predictions.\n\nWe use exponential family PCA to address this problem. Other nonlinear dimensionality-\nreduction techniques [11, 12, 13] could also work for this purpose, but would have different\ndomains of applicability. Although the optimisation procedure for E-PCA may be more\ncomplicated than that for other models such as locally-linear models, it requires many\nfewer samples of the belief space. For real world systems such as mobile robots, large\nsample sets may be dif\ufb01cult to acquire.\n\n3.1 Exponential family PCA\n\nis\n\n\f\r\u0013\n\nand \u0005\n\ngiven \u0007\n\nExponential family Principal Component Analysis [1] (E-PCA) varies from conventional\nPCA by adding a link function, in analogy to generalised linear models, and modifying\nthe loss function appropriately. As long as we choose the link and loss functions to match\n. By picking\nparticular link functions (with their matching losses), we can reduce the model to an SVD.\n\neach other, there will exist ef\ufb01cient algorithms for \ufb01nding \nWe can use any convex function\n\nThe loss function which corresponds to \n\n\u000b\r\f\u000f\u000e\nwhere\n\nis always 0. (\n\nBregman divergence from \u001a\n\n to generate a matching pair of link and loss functions.\n \u0010\b\u0012\u0011\n \u0018\b\u0019\u0011\nis called the convex dual of\n\n(3)\n, and expression (3) is called a generalised\n\nis de\ufb01ned so that the minimum over \u001a of\n\nThe loss functions themselves are only necessary for the analysis; our algorithm needs only\nthe link functions and their derivatives. So, we can pick the loss functions and differentiate\nto get the matching link functions; or, we can pick the link functions directly and not worry\nabout the corresponding loss functions.\n\nto\u0011\n\n.)\n\n \u0017\u0016\n\n\u0015\u0014\u0016\u001c\n\n(2)\n\nEach choice of link and loss functions results in a different model and therefore a poten-\n. This choice is where we should inject our domain\n\ntially different decomposition of \u0007\nknowledge about what sort of noise there is in \u0007\nare a priori most likely. In our case the entries of \u0007\nThe corresponding link function is\u001a\n )!\u001c\u001d\u001f\u001e! \n\n!\u001c\u001b\n\n\u001c\u0004\n\nare the number of particles from a\nlarge sample which fell into a small bin, so a Poisson loss function is most appropriate.\n\nand what parameter matrices \n\nand \u0005\n\n\u001c\u0004\n\n(4)\n\n!\n\b\n\u0001\n\u0011\n\u0012\n\u0003\n$\n$\n\n\u0005\n$\n$\n\f\n\u001c\n\u001a\n\n\u001c\n\u001a\n\f\n\f\n\u001a\n\u0011\n\f\n\u0014\n\n\u001c\n\u001a\n\u001a\n\u0013\n\n\u0014\n\u001c\n\u0011\n \n\u0014\n\u0007\n\u0005\n\u0005\n \n\f(taken component-wise) and its associated loss function is\n\n )!\u001c\u001d\u001f\u001e! \n\n\u001c\u0004\n\n\b\u0019\u0007\u0002\u0001\n\n(5)\n\n\u0001\u0005\u0004\n\nwhere the \u201cmatrix dot product\u201d \u0003\nis the sum of products of corresponding elements.\nIt is worth noting that using the Poisson loss for dimensionality reduction is related to Lee\nand Seung\u2019s non-negative matrix factorization [14].\nIn order to \ufb01nd \nand \u0005\n\n, we compute the derivatives of the loss function with respect to\nand set them to 0. The result is a set of \ufb01xed-point equations that the optimal\n\nand \u0005\n\nparameter settings must satisfy:\n\n\u0007\u0006\n\n\u0005\u001d \n\n(6)\n(7)\n\nmultiple of \n\nThere are many algorithms which we could use to solve our optimality equations (6)\nand (7). For example, we could use gradient descent. In other words, we could add a\n, and repeat until\nconvergence. Instead we will use a more ef\ufb01cient algorithm due to Gordon [15]; this algo-\nrithm is based on Newton\u2019s method and is related to iteratively-reweighted least squares.\nWe refer the reader to this paper for further details.\n\n, add a multiple of \u001c\n\nto \n\nto\u0005\n\n\u001c\n\t\n\n\u001c\u000b\t\n\n4 Augmented MDP\n\nGiven the belief features acquired through E-PCA, it remains to learn a policy. We do\nso by using the low-dimensional belief features to convert the POMDP into a tractable\nMDP. Our conversion algorithm is a variant of the Augmented MDP, or Coastal Navigation\nalgorithm [9], using belief features instead of entropy. Table 1 outlines the steps of this\nalgorithm.\n\n1. Collect sample beliefs\n2. Use E-PCA to generate low-dimensional belief features\n\n4. Learn belief transition probabilities\n\n3. Convert low-dimensional space into discrete state space \n5. Perform value iteration on new model, using states \n\n\u0007\u000e\r\u000b , and reward function\n\n, transition probabilities\n\n .\n\nand\n\n.\n\nTable 1: Algorithm for planning in low-dimensional belief space.\n\nWe can collect the beliefs in step 1 using some prior policy such as a random walk or a\nmost-likely-state heuristic. We have already described E-PCA (step 2), and value iteration\n(step 5) is well-known. That leaves steps 3 and 4.\n\nThe state space can be discretized in a number of ways, such as laying a grid over the belief\nfeatures or using distance to the closest training beliefs to divide feature space into Voronoi\nregions. Thrun [16] has proposed nearest-neighbor discretization in high-dimensional be-\nlief space; we propose instead to use low-dimensional feature space, where neighbors\nshould be more closely related.\n\nWe can compute the model reward function\n\n easily from the reconstructed beliefs.\n\n(8)\n\n )!\n\n\u001c\u0012\u0011\n\n\u0011\u0014\u0013\n\nTo learn the transition function, we can sample states from the reconstructed beliefs, sample\nobservations from those states, and incorporate those observations to produce new belief\nstates.\n\n\u0003\n\u001c\n\n\n\u0005\n\u0005\n \n\n\u0005\n\n\u001c\n\u0007\n\b\n\u001b\n\u001c\n\n \n!\n\b\n\u001c\n\u0007\n\b\n\u001b\n\u001c\n\n\u0005\n \n \n\u0005\n\u0006\n!\n\b\n\u0006\n\u001c\n\u0007\n\b\n\u001b\n \n \n\u0007\n\b\n\u001b\n \n \n\u0005\n\u0006\n\f\n\u001c\n\u0007\n\f\n\n\u0015\n\n\u000f\n\u001c\n\u0007\n\f\n\f\n\u000f\n\u000f\n\u001c\n\u0007\n\f\n\u0010\n\u0010\n\u001c\n\u0007\n \n\fOne additional question is how to choose the number of bases. One possibility is to examine\n\nthe singular values of the matrix after performing E-PCA, and use only the features that\n\nhave singular values above some cutoff. A second possibility is to use a model selection\ntechnique such as keeping a validation set of belief samples and picking the basis size\nwith the best reconstruction quality. Finally, we could search over basis sizes according to\nperformance of the resulting policy.\n\n5 Experimental Results\n\nWe tested our approach on two models: a synthetic 40 state world with idealised action and\nobservations, and a large mobile robot navigation task. For each problem, we compared E-\nPCA to conventional PCA for belief representation quality, and compared E-PCA to some\nheuristics for policy performance. We are unable to compare our approach to conventional\nvalue function approaches, because both problems are too large to be solved by existing\ntechniques.\n\n5.1 Synthetic model\n\n\b\u0003\u0002\n\n\b\u0003\u0002\n\n\u000e\u0010\u000e\u000f\u000e\n\n\u0007\u0005\u0004\n\na circular corridor, and one binary orientation. States \u0007\u0001)\u000e\u000f\u000e\u0010\u000e\none orientation, and states \u0007\n\nThe abstract model has a two-dimensional state space: one dimension of position along\ninclusive correspond to\n\b correspond to the other. The reward is at a known\nposition along the corridor; therefore, the agent needs to discover its orientation, move to\nthe appropriate position, and declare it has arrived at the goal. When the goal is declared\nthe system resets (regardless of whether the agent is actually at the goal). The agent has 4\nactions: left, right, sense_orientation, and declare_goal. The observation\nand transition probabilities are given by von Mises distributions, an exponential family\n\n\b\u0007\u0006\t\b\n\u0006% . The von Mises distribution is the \u201cwrapped\u201d analog\n\nof a Gaussian; it accounts for the fact that the two ends of the corridor are connected, and\nbecause the sum of two von Mises variates is another von Mises variate, we can guarantees\nthat the true belief distribution is always a von Mises distribution over the corridor for each\norientation.\n\ndistribution de\ufb01ned over \u000e\n\nSample Beliefs\n\ne\nt\na\nt\n\n \n\nS\n \nf\no\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n0.18\n0.16\n0.14\n0.12\n0.1\n0.08\n0.06\n0.04\n0.02\n0\n\n0\n\n5\n\n10\n\n15\n\n25\n\n30\n\n35\n\n40\n\n20\nState\n\nFigure 2: Some sample beliefs from the two-dimensional problem, generated from roll-outs of the\nmodel. Notice that some beliefs are bimodal, whereas others are unimodal in one half or the other of\nthe state space.\n\nFigure 2 shows some sample beliefs from this model. Notice that some of the beliefs are\nbimodal, but some beliefs have probability mass over half of the state space only\u2014these\nunimodal beliefs follow the sense_orientation action.\n\nFigure 3(a) shows the reconstruction performance of both the E-PCA approach and con-\nventional PCA, plotting average KL-divergence between the sample belief and its recon-\nstruction against the number of bases used for the reconstruction. PCA minimises squared\nerror, while E-PCA with the Poisson loss minimises unnormalised KL-divergence, so it is\n\n\u0007\n\fno surprise that E-PCA performs better. We believe that KL-divergence is a more appro-\npriate measure since we are \ufb01tting probabilities. Both PCA and E-PCA reach near-zero\nerror at 3 bases (E-PCA hits zero error, since an\nexponential family exactly). This fact suggests that both decompositions should generate\ngood policies using only 3 dimensions.\n\n-basis E-PCA can \ufb01t an \u001c\u0001\n\n\b\u0003\u0002\u000b -parameter\n\nKL Divergence between Sampled Beliefs and Reconstructions\n1.4\n\nPCA\nE-PCA\n\ne\nc\nn\ne\ng\nr\ne\nv\ni\nD\nL\nK\n\n \n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n-0.2\n\n1\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n120000\n\n100000\n\n80000\n\n60000\n\n40000\n\n20000\n\n0\n\n-20000\n\n1\n\nAverage reward vs. Number of Bases\n\nE-PCA\nPCA\n\nMDP Heuristic\n\nEntropy Heuristic\n\n2\n\n3\n\nNumber of Bases\n\n4\n\n2\n\n3\n\n4\n\n5\n\nNumber of Bases\n\n(a) Reconstruction error\n\n(b) Policy performance\n\nFigure 3: (a) A comparison of the average KL divergence between the sample beliefs and their\nreconstructions, against the number of bases used, for 500 samples beliefs. (b) A comparison of\npolicy performance using different numbers of bases, for 10000 trials. Policy performance was given\nby total reward accumulated over trials.\nFigure 3(b) shows a comparison of the policies from different algorithms. The PCA tech-\nniques do approximately twice as a well as the naive Maximum Likelihood heuristic. This\nis because the ML-heuristic must guess its orientation, and is correct about half the time.\nIn comparison, the Entropy heuristic does very poorly because it is unable to distinguish\nbetween a unimodal belief that has uncertainty about its orientation but not its position, and\na bimodal belief that knows its position but not its orientation.\n\n5.2 Mobile Robot Navigation\n\nNext we tried our algorithm on a mobile robot navigating in a corridor, as shown in \ufb01gure 1.\nAs in the previous example, the robot can detect its position, but cannot determine its\norientation until it reaches the lab door approximately halfway down the corridor. The\nrobot must navigate to within 10cm of the goal and declare the goal to receive the reward.\n17m, with a grid cell resolution of 0.1m.\nThe map is shown in \ufb01gures 1 and 4, and is 47m\nThe total number of unoccupied cells is 8250, generating a POMDP with a belief space of\n8250 dimensions. Without loss of generality, we restrict the robot\u2019s actions to the forward\nand backward motion, and similarly simpli\ufb01ed the observation model. The reward structure\nof the problem strongly penalised declaring the goal when the robot was far removed from\nthe goal state.\n\nThe initial set of beliefs was collected by a mobile robot navigating in the world, and then\npost-processed using a noisy sensor model. In this particular environment, the laser data\nused for localisation normally gives very good localisation results; however, this will not\nbe true for many real world environments [17].\n\nFigure 4 shows a sample robot trajectory using the policy learned using 5 basis functions.\nNotice that the robot drives past the goal to the lab door in order to verify its orientation\nbefore returning to the goal. If the robot had started at the other end of the corridor, its\norientation would have become apparent on its way to the goal.\n\nFigure 5(a) shows the reconstruction performance of both the E-PCA approach and con-\n\n\n\u0004\n\fStart Distribution\n\nGoal State\n\nStart State\n\nRobot Trajectory\n\nFigure 4: An example robot trajectory, using the policy learned using 5 basis functions. On the left\nare the start conditions and the goal. On the right is the robot trajectory. Notice that the robot drives\npast the goal to the lab door to localise itself, before returning to the goal.\n\nventional PCA, plotting average KL-divergence between the sample belief and its recon-\nstruction against the number of bases used for the reconstruction.\n\nKL Divergence between Sampled Beliefs and Reconstructions\n\nPolicy perfomance on Mobile Robot Navigation\n\ne\nc\nn\ne\ng\nr\ne\nv\ni\nD\nL\nK\n\n \n\n45\n40\n35\n30\n25\n20\n15\n10\n5\n0\n\nE-PCA\nPCA\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n1\n\n2\n\n3\n\n5\n\n4\n6\nNumber of Bases\n\n7\n\n8\n\n9\n\n400000\n\n300000\n\n200000\n\n100000\n\n-268500.0\n\n0\n\n-1000.0\n\n33233.0\n\n-100000\n\n-200000\n\n-300000\n\nML Heuristic\n\nPCA\n\nE-PCA\n\n(a) Reconstruction performance\n\n(b) Policy performance\n\nFigure 5: (a) A comparison of the average KL divergence between the sample beliefs and their\nreconstructions against the number of bases used, for 400 samples beliefs for a navigating mobile\nrobot.(b) A comparison of policy performance using E-PCA, conventional PCA and the Maximum\nLikelihood heuristic, for 1,000 trials.\nFigure 5(b) shows the average policy performance for the different techniques, using 5\nbases.\n(The number of bases was chosen based on reconstruction quality of E-PCA:\nsee [15] for further details.) Again, the E-PCA outperformed the other techniques be-\ncause it was able to model its belief accurately. The Maximum-Likelihood heuristic could\nnot distinguish orientations, and therefore regularly declared the goal in the wrong place.\nThe conventional PCA algorithm failed because it could not represent its belief accurately\nwith only a few bases.\n\n6 Conclusions\nWe have demonstrated an algorithm for planning for Partially Observable Markov Decision\nProcesses by taking advantage of particular kinds of belief space structure that are prevalent\nin real world domains. In particular, we have shown this approach to work well on an\nabstract small problem, and also on a 8250 state mobile robot navigation task which is well\nbeyond the capability of existing value function techniques.\n\nThe heuristic that we chose for dimensionality reduction was simply one of reconstruction\nerror, as in equation 5: a reduction that minimises reconstruction error should allow near-\noptimal policies to be learned. However, it may be possible to learn good policies with even\nfewer dimensions by taking advantage of transition probability structure, or cost function\nstructure. For example, for certain classes of problems, a loss function such as\n\n(9)\n\n\u0006\u0005\n\n )!\n\n\u0005\u001d \u0010\b\n\n\u0003\n\u001c\n\n\u000b\n\n\u0001\n\u001b\n\u001c\n\u0015\n\f\n \n\u0013\n\n\u001c\n\n\u001b\n\u001c\n\u0015\n\f\n \n\u0013\n\u0007\n\u0001\n\n\u0005\n\u0001\n\f\n\fwould lead to a dimensionality reduction that maximises predictability. Similarly,\n\nwhere\u0005\n\n(10)\nis some heuristic cost function (such as from a previous iteration of dimension-\nality reduction) would lead to a reduction that maximises ability to differentiate states with\ndifferent values.\n\n )!\n\nAcknowledgments\n\nThanks to Sebastian Thrun for many suggestions and insight. Thanks also to Drew Bagnell, Aaron\nCourville and Joelle Pineau for helpful discussion. Thanks to Mike Montemerlo for localisation code.\n\nReferences\n[1] M. Collins, S. Dasgupta, and R. E. Schapire. A generalization of principal components analysis\nto the exponential family. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances\nin Neural Information Processing Systems, volume 14, Cambridge, MA, 2002. MIT Press.\n\n[2] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in\n\npartially observable stochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[3] Andrew Ng and Michael Jordan. PEGASUS: A policy search method for large MDPs and\n\nPOMDPs. In Proceedings of Uncertainty in Arti\ufb01cial Intelligence (UAI), 2000.\n\n[4] Milos Hauskrecht. Value-function approximations for partially observable Markov decision\n\nprocesses. Journal of Arti\ufb01cial Intelligence Research, 13:33\u201394, 2000.\n\n[5] Andrew Ng, Ron Parr, and Daphne Koller. Policy search via density estimation. In Advances\n\nin Neural Information Processing Systems 12, 1999.\n\n[6] Jonathan Baxter and Peter Bartlett. Reinforcement learning in POMDP\u2019s via direct gradient\n\nascent. In Proc. the 17th International Conference on Machine Learning, 2000.\n\n[7] J. Andrew Bagnell and Jeff Schneider. Autonomous helicopter control using reinforcement\nlearning policy search methods. In Proceedings of the International Conference on Robotics\nand Automation, 2001.\n\n[8] Anthony R. Cassandra, Leslie Pack Kaelbling, and James A. Kurien. Acting under uncertainty:\nDiscrete Bayesian models for mobile-robot navigation. In Proceedings of the IEEE/RSJ Inter-\national Conference on Intelligent Robotic Systems (IROS), 1996.\n\n[9] Nicholas Roy and Sebastian Thrun. Coastal navigation with mobile robots.\n\nNeural Processing Systems 12, pages 1043\u20131049, 1999.\n\nIn Advances in\n\n[10] I. T. Joliffe. Principal Component Analysis. Springer-Verlag, 1986.\n[11] Sam Roweis and Lawrence Saul. Nonlinear dimensionality reduction by locally linear embed-\n\nding. Science, 290(5500):2323\u20132326, December 2000.\n\n[12] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, December 2000.\n\n[13] S. T. Roweis, L. K. Saul, and G. E. Hinton. Global coordination of local linear models. In T. G.\nDietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing\nSystems, volume 14, Cambridge, MA, 2002. MIT Press.\n\n[14] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix\n\nfactorization. Nature, 401:788\u2013791, 1999.\n\n[15] Geoffrey Gordon. Generalized\n\nmodels. In Suzanna Becker, Sebastian Thrun, and Klaus\nObermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 2003.\n[16] Sebastian Thrun. Monte Carlo POMDPs. In Advances in Neural Information Processing Sys-\n\nlinear\n\ntems 12, 1999.\n\n[17] S. Thrun, M. Beetz, M. Bennewitz, W. Burgard, A.B. Cremers, F. Dellaert, D. Fox, D. Hhnel,\nC. Rosenberg, N. Roy, J. Schulte, , and D. Schulz. Probabilistic algorithms and the interactive\nmuseum tour-guide robot Minerva. International Journal of Robotics Research, 19(11):972\u2013\n999, 2000.\n\n\u0003\n\u001c\n\n\n\u0005\n\u0001\n\u0005\n\u001c\n\n\u001c\n\n\u0005\n \n \n\b\n\u0005\n\u001c\n\u0007\n\u0001\n\n\u0005\n \n\u0001\n\f\n\u001c\n\u0013\n \n\n\n\f", "award": [], "sourceid": 2319, "authors": [{"given_name": "Nicholas", "family_name": "Roy", "institution": null}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}