{"title": "Information-based learning by agents in unbounded state spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 3023, "page_last": 3031, "abstract": "The idea that animals might use information-driven planning to explore an unknown environment and build an internal model of it has been proposed for quite some time. Recent work has demonstrated that agents using this principle can efficiently learn models of probabilistic environments with discrete, bounded state spaces. However, animals and robots are commonly confronted with unbounded environments. To address this more challenging situation, we study information-based learning strategies of agents in unbounded state spaces using non-parametric Bayesian models. Specifically, we demonstrate that the Chinese Restaurant Process (CRP) model is able to solve this problem and that an Empirical Bayes version is able to efficiently explore bounded and unbounded worlds by relying on little prior information.", "full_text": "Information-based learning by agents in unbounded\n\nstate spaces\n\nShariq A. Mobin, James A. Arnemann, Friedrich T. Sommer\n\nRedwood Center for Theoretical Neuroscience\n\nUniversity of California, Berkeley\n\nshariqmobin@berkeley.edu, arnemann@berkeley.edu, fsommer@berkeley.edu\n\nBerkeley, CA 94720\n\nAbstract\n\nThe idea that animals might use information-driven planning to explore an un-\nknown environment and build an internal model of it has been proposed for quite\nsome time. Recent work has demonstrated that agents using this principle can ef-\n\ufb01ciently learn models of probabilistic environments with discrete, bounded state\nspaces. However, animals and robots are commonly confronted with unbounded\nenvironments. To address this more challenging situation, we study information-\nbased learning strategies of agents in unbounded state spaces using non-parametric\nBayesian models. Speci\ufb01cally, we demonstrate that the Chinese Restaurant Pro-\ncess (CRP) model is able to solve this problem and that an Empirical Bayes ver-\nsion is able to ef\ufb01ciently explore bounded and unbounded worlds by relying on\nlittle prior information.\n\n1\n\nIntroduction\n\nLearning in animals involves the active gathering of sensor data, presumably selecting those sensor\ninputs that are most useful for learning a model of the world. Thus, a theoretical framework for\nthe learning in agents, where learning itself is the primary objective, would be essential for making\ntestable predictions for neuroscience and psychology [9, 7], and it would also impact applications\nsuch as optimal experimental design and building autonomous robots [3].\nIt has been proposed that information theory-based objective functions, such as those based on the\ncomparison of learned probability distributions, could guide exploratory behavior in animals and ar-\nti\ufb01cial agents [13, 18]. Although reinforcement learning theory has largely advanced in describing\naction planning in fully or partially observable worlds with a \ufb01xed reward function, e.g., [17], the\nstudy of planning with internally de\ufb01ned and gradually decreasing reward functions has been rather\nslow. A few recent studies [20, 11, 12] developed remarkably ef\ufb01cient action policies for learning\nan internal model of an unknown fully observable world that are driven by maximizing an objec-\ntive of predicted information gain. Although using somewhat different de\ufb01nitions of information\ngain, the key insights of these studies are that optimization has to be non-greedy, with a longer time\nhorizon, and that gain in information also translates to ef\ufb01cient reward gathering. However, these\nmodels are still quite limited and cannot be applied to agents in more realistic environments. They\nonly work in observable, discrete and bounded state spaces. Here, we relax one of these restric-\ntions and present a model for unbounded, observable discrete state spaces. Using methods from\nnon-parametric Bayesian statistics, speci\ufb01cally the Chinese Restaurant Process (CRP), the resulting\nagent can ef\ufb01ciently learn the structure of an unknown, unbounded state space. To our knowledge\nthis is the \ufb01rst use of CRPs to address this problem, however, CRPs have been introduced earlier to\nreinforcement learning for other purposes, such as state clustering [2].\n\n1\n\n\f2 Model\n\n2.1 Mathematical framework for embodied active learning\n\nIn this study we follow [12] and use Controlled Markov Chains (CMC) to describe how an agent\ncan interact with its environment in closed, embodied, action-perception loops. A CMC is a Markov\nChain with an additional control variable to allow for switching between different transition distri-\nbutions in each state, e.g. [6]. Put differently, it is a Markov Decision Process (MDP) without the\nreward function. A CMC is described by a 3-tuple (S , A , \u0398) where S denotes a \ufb01nite set of states,\nA is a \ufb01nite set of actions the agent can take, and \u0398 is a 3-dimensional CMC kernel describing the\ntransition probabilities between states for each action\n\n\u0398sas(cid:48) = ps(cid:48)|s,a = P (st+1 = s(cid:48)|st = s, at = a)\n\n(1)\n\nLike in [12] we consider the exploration task of the agent to be the formation of an accurate estimate,\n\nor internal model(cid:98)\u0398, of the true CMC kernel, \u0398, that describes its world.\n\n2.2 Modeling the transition in unbounded state spaces\n\nSt+1|S1, , St \u223c Kt(cid:88)\n\nLet t be the current number of observations of states S and Kt be the number of different states\ndiscovered so far. The observed counts are denoted by Ct := {#1, ..., #Kt}.\nSpecies sampling models have been proposed as generalizations of the Dirichlet process [14], which\nare interesting for non-parametric Bayesian inference in unbounded state spaces. A species sampling\nsequence (SSS) describes the distribution of the next observation St+1. It is de\ufb01ned by\n\npi(Ct)\u03b4 \u02dcS + pKt+1(Ct)\n\n(2)\n\ni=1\n\nwith \u03b4 \u02dcS a degenerate probability measure, see [10] for details.\nIn order to de\ufb01ne a valid SSS,\nthe sequence (p1, p2, ...) must sum to one and be an Exchangeable Partition Probability Function\n(EPPF). The exchangeability condition requires that the probabilities depend only on the counts Ct,\nnot on the order of how the agent sampled the transitions.\nHere we consider one of most common EPPF models in the literature, the Chinese Restaurant Pro-\ncess (CRP) or Polya urn process [1]. According to the CRP model, the probability of observing a\nstate is\n\npi(Ct) =\np\u03c8(Ct) \u2261 pKt+1(Ct) =\n\n#i\nt + \u03b8\n\n\u03b8\n\nt + \u03b8\n\nfor i = 1, ..., Kt\n\n(3)\n\n(4)\n\nwhere (3) describes revisiting a state and (4) describes the undiscovered probability mass (UPM),\ni.e., the probability of discovering a new state, which is then labeled Kt+1. In the following, the set\nof undiscovered states will be denoted by \u03c8. Using this formalism, the agent must de\ufb01ne a separate\nCRP for each state action pair s, a. The internal model is then described by\n\nupdated according to (3, 4). The t index in(cid:98)\u0398sas(cid:48) is suppressed for the sake of notational ease.\n\n(5)\n\n(cid:98)\u0398sas(cid:48) = ps(cid:48)|s,a(Ct),\n\nOur simplest agent uses a CRP (3, 4) with \ufb01xed \u03b8. Further, we will investigate an Empirical Bayes\nCRP, referred to as EB-CRP, in which the parameter \u03b8 is learned and adjusted from observations\nonline using a maximum likelihood estimate (MLE). This is similar to the approach of [22] but we\nfollow a more straightforward path and derive a MLE of \u03b8 using the EPPF of the CRP and employing\nan approximation of the harmonic series.\nThe likelihood of observing a given number of state counts is described by the EPPF of the CRP [8]\n\n\u03c0(Ct; \u03b8) =\n\n(#i \u2212 1)!\n\n(6)\n\nKt(cid:89)\n\n(cid:81)t\u22121\n\n\u03b8Kt\n\ni=0(\u03b8 + i)\n\ni=1\n\n2\n\n\fMaximizing the log likelihood\n\nd\nd\u03b8\n\nln(\u03c0(Ct; \u03b8)) =\n\n\u2212 t\u22121(cid:88)\n\ni=0\n\nKt\n\u03b8\n\n1\n\n\u03b8 + i\n\n= 0\n\nyields\n\n\u03b8(t) \u2248\n\nKt\n\nln(t) + \u03b3 + 1\n\n2t \u2212 1\n\n12t2\n\n,\n\n(7)\n\n(8)\n\nwhere (8) uses a closed form approximation of the harmonic series in (7) with Euler\u2019s Mascheroni\nconstant \u03b3. In our EB-CRP agent, the parameter \u03b8 is updated after each observation according to\n(8).\n\n2.3\n\nInformation-theoretic assessment of learning\n\ntributions of the same dimensions is the KL Divergence, DKL. However, in our case, with the size\n\nAssessing or guiding the progress of the agent in the exploration process can be done by comparing\nprobability distributions. For example, the learning progress should increase the similarity between\n\nthe internal model, (cid:98)\u0398, of the agent and the true model, \u0398. A popular measure for comparing dis-\nof the underlying state space unknown and states being discovered successively in (cid:98)\u0398, models of\nered states and transitions (Figure 1). If the smaller model, (cid:98)\u0398, has n undiscovered state transitions\n\ndifferent sizes have to be compared.\nTo address this, we apply the following padding procedure to the smaller model with fewer discov-\n\nfrom a known origin state, one splits the UPM uniformly into n equal probabilities (Figure 1a). The\nresulting padded model is given by\n\n(cid:98)\u0398P\n\nsas(cid:48) =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n(cid:98)\u0398sa\u03c8\n(|S\u0398sa|\u2212|S(cid:98)\u0398sa\n(cid:98)\u0398sas(cid:48),\n1/|S\u0398sa|,\n\n|) , (cid:98)\u0398sas(cid:48) = 0\ns /\u2208 S(cid:98)\u0398\n(cid:98)\u0398sas(cid:48) > 0\n\n[Figure 1a]\n[Figure 1b]\n\n(9)\n\nif there are undiscovered origin states in (cid:98)\u0398, one adds such states and a uniform transition kernel to\nwhere |S\u0398sa| is the number of known states reachable from state s by taking action a in \u0398. Further,\n\npotential target states (Figure 1b).\n\nFigure 1:\n\nin a smaller, less informed model, (cid:98)\u0398, of an unbounded environment in order to compare it with a\n\nIllustration of the padding procedure for adding unknown states and state transitions\n\nlarger, better informed model, \u0398. (a) If transitions to target states are missing, we uniformly split the\nUPM into equal transition probabilities to the missing target states, which are in fact the unknown\nelements of the set \u03c8. (b) If a state is not discovered yet, we paste this state in with a uniform\ntransition distribution to all target states reachable in the larger model, \u0398.\n\nWith this type of padding procedure we can de\ufb01ne a distance between two unequally sized models,\n\n(cid:88)\n\nDKLP (\u0398sa\u00b7||(cid:98)\u0398sa\u00b7) := DKL(\u0398sa\u00b7||(cid:98)\u0398P\n(cid:88)\n\nIM (\u0398||(cid:98)\u0398) :=\n\nsa\u00b7) :=\n\n\u0398sas(cid:48) log2\n\ns(cid:48)\u2208S\u0398sa\n\nDKLP (\u0398sa\u00b7||(cid:98)\u0398sa\u00b7),\n\n(cid:33)\n\n(cid:32)\n\n\u0398sas(cid:48)\n\n(cid:98)\u0398P\n\nsas(cid:48)\n\n,\n\n(10)\n\n(11)\n\nand use it to extend previous information measures for assessing and guiding explorative learning\n[12] to unbounded state spaces. First, we de\ufb01ne Missing Information,\n\ns\u2208S ,a\u2208A\n\n3\n\n\fa quantity an external observer can use for assessing the de\ufb01ciency of the internal model of the agent\nwith respect to the true model. Second, we de\ufb01ne Information Gain,\n\nIG(s, a, s(cid:48)) := IM (\u0398||(cid:98)\u0398) \u2212 IM (\u0398||(cid:98)\u0398s,a\u2192s(cid:48)\n\nmodel of the agent, (cid:98)\u0398, and an improved one, (cid:98)\u0398s,a\u2192s(cid:48)\n\na quantity measuring the improvement between two models, in this case, between the current internal\n, which represents an updated model after\nobserving a new state transition from s to s(cid:48) under action a.\n\n),\n\n(12)\n\n2.4 Predicted information gain\n\nPredicted information gain (PIG) as used in [12] is the expected information gain for a given state\naction pair. To extend the previous formula in [12] to compute this expectation in the non-parametric\nsetting, we again make use of the padding procedure described in the last section\n\nP IG(s, a)\n\n= (cid:98)\u0398sa\u03c8DKLP ((cid:98)\u0398s,a\u2192\u03b7\n:= Es(cid:48),\u0398|Ct[IG(s, a, s(cid:48))]\n\nsa\u00b7\n\n||(cid:98)\u0398sa\u00b7) +\n\n(cid:98)\u0398sas(cid:48)DKL((cid:98)\u0398s,a\u2192s(cid:48)\n\nsa\u00b7\n\n||(cid:98)\u0398sa\u00b7)\n\n(13)\n\n(cid:88)\ns(cid:48)\u2208S(cid:98)\u0398sa\n\nHere, DKLP handles the case where the agent, during its planning, hypothetically discovers a new\ntarget state, \u03b7 \u2208 \u03c8, from the state action pair, s, a. There is one small difference in calculating the\nDKLP from the previous section, which is that in equation (9) S\u0398sa is replaced by S(cid:98)\u0398s,a\u2192\u03b7\n. Thus\nthe RHS of (13) can be computed internally by the agent for action planning as it does not contain\nthe true model, \u0398.\n\nsa\n\n2.5 Value Iteration\n\nWhen states of low information gain separate the agent from states of high information gain in the\nenvironment, greedy maximization of PIG performs poorly. Thus, like in [12], we employ value\niteration using the Bellman equations [4]. We begin at a distant time point (\u03c4 = 0) assigning initial\nvalues to PIG. Then, we propogate backward in time calculating the expected reward.\n\n(cid:104)(cid:98)\u0398sa\u03c8V\u03c4 (\u03c8) +\n\n(cid:105)\n(cid:98)\u0398sas(cid:48)V\u03c4 (s(cid:48))\n\n(cid:88)\ns(cid:48)\u2208S(cid:98)\u0398sa\n\n(14)\n(15)\n\n(16)\n\nQ0(s, a)\nQ\u03c4\u22121(s, a)\n\n:= P IG(s, a)\n\n:= P IG(s, a) + \u03bb\n\nV\u03c4 (s)\n\n:= max\n\na\n\nQ\u03c4 (s, a)\n\nWith the discount factor, \u03bb, set to 0.95, one can de\ufb01ne how actions are chosen by all our PIG agents\n\naP IG := argmax\n\na\n\nQ\u221210(s, a)\n\n(17)\n\n3 Experimental Results\n\nHere we describe simulation experiments with our two models, CRP-PIG and EB-CRP-PIG, and\ncompare them with published approaches. The models are tested in environments de\ufb01ned in the\nliterature and also in an unbounded world.\nFirst the agents were tested in a bounded maze environment taken from [12] (Figure 2). The state\nspace in the maze consists of the |S | = 36 rooms. There are |A | = 4 actions that correspond to noisy\ntranslations in the four cardinal directions, drawn from a Dirichlet distribution. To make the task of\nlearning harder, 30 transporters are distributed amongst the walls which lead to an absorbing state\n(state 29 marked by concentric rings in Figure 2). Absorbing states, such as at the bottom of gravity\nwells, are common in real world environments and pose serious challenges for many exploration\nalgorithms [12].\nWe compare the learning strategies proposed here, CRP-PIG and EB-CRP-PIG, with the following\nstrategies:\n\n4\n\n\fRandom action: A negative control, representing the minimally directed action policy that\nany directed action policy should beat.\nLeast Taken Action (LTA): A well known explorative strategy that simply takes the action it\nhas taken least often in the current state [16].\nCounter-Based Exploration (CB): Another explorative strategy from the literature that at-\ntempts to induce a uniform sampling across states [21].\nDP-PIG: The strategy of [12] which applies the same objective function as described here,\nbut is given the size of the state space and is therefore at an advantage. This agent uses a\nDirichlet process (DP) with \u03b1 set to 0.20, which was found empirically to be optimal for the\nmaze environment.\nUnembodied: An agent which can choose any action from any state at each time step (hence\nunembodied) and can therefore attain the highest PIG possible at every sampling step. This\nstrategy represents a positive control.\n\nFigure 2: Bounded Maze environment. Two transition\ndistributions, \u0398sa\u00b7, are depicted, one for (s=13, a=\u2018left\u2019)\nand one for (s=9, a=\u2018up\u2019). Dark versus light gray ar-\nrows represent high versus low probabilities. For (s=13,\na=\u2018left\u2019), the agent moves with highest probability left\ninto a transporter (blue line), leading it to the absorbing\nstate 29 (blue concentric rings). With smaller probabili-\nties the agent moves up, down or is re\ufb02ected back to its\ncurrent state by the wall to the right. The second transi-\ntion distribution is displayed similarly.\n\nFigure 3 depicts the missing information (11) in the bounded maze for the various learning strate-\ngies over 3000 sampling steps averaged over 200 runs. All PIG-based embodied strategies exhibit\na faster decrease of missing information with sampling, however, still signi\ufb01cantly slower than the\nunembodied control. In this \ufb01nite environment the DP-PIG agent with the correct Dirichlet prior\n(experimentally optimized \u03b1-parameter) has an advantage over the CRP based agents and reduced\nthe missing information more quickly. However, the new strategies for unbounded state space still\noutperform the competitor agents from the literature by far. Interestingly, EB-CRP-PIG with con-\ntinuously adjusted \u03b8 can reduce missing information signi\ufb01cantly faster than CRP-PIG with \ufb01xed,\nexperimentally optimized \u03b8 = 0.25.\n\nFigure 3: Missing Information vs. Time for EB-CRP-PIG and several other strategies in the bounded\nmaze environment.\n\nTo directly assess how ef\ufb01cient learning translates to the ability to harvest reward, we consider the 5-\nstate \u201cChain\u201d problem [19], shown in Figure 4, a popular benchmark problem. In this environment,\nagents have two actions available, a and b, which cause transitions between the \ufb01ve states. At each\ntime step the agent \u201cslips\u201d and performs the opposite action with probability pslip = 0.2. The agent\nreceives a reward of 2 for taking action b in any state and a reward of 0 for taking action a in\n\n5\n\n\fFigure 4: Chain Environment.\n\nevery state but the last, in which it receives a reward of 10. The optimal policy is to always choose\naction a to reach the highest reward at the end of the chain, it is used as a positive control for this\nexperiment. We follow the protocol in previous publications and report the cumulative reward in\n1000 steps, averaged over 500 runs. Our agent EB-CRP-PIG-R executes the EB-CRP-PIG strategy\nfor S steps, then computes the best reward policy given its internal model and executes it for the\nremaining 1000-S steps. We found S=120 to be roughly optimal for our agent and display the\nresults of the experiment in Table 1, taking the results of the competitor algorithms directly from\nthe corresponding papers. The competitor algorithms de\ufb01ne their own balance between exploitation\nand exploration, leading to different results.\n\nMethod\nRAM-RMAX [5]\nBOSS [2]\nexploit [15]\nBayesian DP [19]\nEB-CRP-PIG-R\nOptimal\n\nReward\n2810\n3003\n3078\n3158 \u00b1 31\n3182 \u00b1 25\n3658 \u00b1 14\n\nTable 1: Cumulative reward for 1000 steps in the chain environment.\n\nThe EB-CRP-PIG-R agent is able to perform the best and signi\ufb01cantly outperforms many of the\nother strategies. This result is remarkable because the EB-CRP-PIG-R agent has no prior knowledge\nof the state space size, unlike all the competitor models. We also note that our algorithm is extremely\nef\ufb01cient computationally, it must approximate the optimal policy only once and then simply execute\nit.\nIn comparison, the exploit strategy [15] must compute the approximation at each time step.\nFurther, we interpret our competitive edge over BOSS to re\ufb02ect a more ef\ufb01cient exploration strategy.\nSpeci\ufb01cally, BOSS uses LTA for exploration and Figure 3 indicates that the learning performance\nof LTA is far worse than the performance of the PIG-based models.\n\nFigure 5: Missing Information vs. Time for EB-CRP-PIG and CRP-PIG in the unbounded maze\nenvironment.\nFinally, we consider an unbounded maze environment with |S | being in\ufb01nite and with multiple\nabsorbing states. Figure 5 shows the decrease of missing information (11) for the two CRP based\nstrategies. Interestingly, like in the bounded maze the Empirical Bayes version reduces the missing\ninformation more rapidly than a CRP which has a \ufb01xed, but experimentally optimized, parameter\nvalue. What is important about this result is that EB-CRP-PIG is not only better but it requires no\nprior parameter tuning since \u03b8 is adjusted intrinsicially. Figure 6 shows how an EB-CRP-PIG and\nan LTA agent explore the environment over 6000 steps. The missing information for each state is\n\n6\n\n\fFigure 6: Unbounded Maze environment. Exploration is depicted for two different agents (a) EB-\nCRP-PIG and (b) LTA, after 2000, 4000, and 6000 exploration steps respectively. Initially all states\nare white (not depicted), which represent unexplored states. Transporters (blue lines) move the agent\nto the closest gravity well (small blue concentric rings). The current position of the agent is indicated\nby the purple arrow.\n\ncolor coded, light yellow representing high missing information, and red representing low missing\ninformation, less than 1 bit. Note that the EB-CRP-PIG agent explores a much bigger area than the\nLTA agent.\nThe two agents are also tested in a reward task in the unbounded environment for assessing whether\nthe exploration of EB-CRP-PIG leads to ef\ufb01cient reward acquisition. Speci\ufb01cally, we assign a re-\nward to each state equal to the Euclidian distances from the starting state. Like for the Chain problem\nbefore, we create two agents EB-CRP-PIG-R and LTA-R which each run for 1000 total steps, explor-\ning for S=750 steps (de\ufb01ned previously) and then calculating their best reward policy and executing\nit for the remaining 250 steps. The agents are repositioned to the start state after S steps and the\nbest reward policy is calculated. The simulation results are shown in Table 2. Clearly, the increased\ncoverage of the EB-CRP-PIG agent also results in higher reward acquisition.\n\nMethod\nEB-CRP-PIG-R 1053\nLTA-R\n812\n\nReward\n\nTable 2: Cumulative reward after 1000 steps in the unbounded maze environment.\n\n7\n\n\f4 Discussion\n\nTo be able to learn environments whose number of states is unknown or even unbounded is crucial\nfor applications in biology, as well as in robotics. Here we presented a principled information-based\nstrategy for an agent to learn a model of an unknown, unbounded environment. Speci\ufb01cally, the\nproposed model uses the Chinese Restaurant Process (CRP) and a version of predicted information\ngain (PIG) [12], adjusted for being able to accommodate comparisons of models with different\nnumbers of states.\nWe evaluated our model in three different environments in order to assess its performance. In the\nbounded maze environment the new algorithm performed quite similarly to DP-PIG despite being at\na disadvantage in terms of prior knowledge. This result suggests that agents exploring environments\nof unknown size can still develop accurate models of it quite rapidly. Since the new model is based\non the CRP, calculating the posterior and sampling from it is easily tractable.\nThe experiments in a simple bounded reward task, the Chain environment, were equally encourag-\ning. Although the agent was unaware of the size of its environment, it was able to learn the states\nand their transition probabilities quickly and retrieved a cumulative reward that was competitive with\npublished results. Some of the competitor strategies (exploit [15]) required to recompute the best\nreward policy for each step. In contrast, EB-CRP-PIG computed the best policy only once, yet, was\nable to outperform the exploit [15] strategy.\nIn the unbounded maze environment, EB-CRP-PIG was able to outperform CRP-PIG even though\nit required no prior parameter tuning. In addition, it covered much more ground during exploration\nthan LTA, one of the few existing competitor models able to function in unbounded environments.\nSpeci\ufb01cally, the EB-CRP-PIG model evenly explored a large number of environmental states. In\ncontrast, LTA, exhaustively explored a much smaller area limited by two nearby absorbing states.\nTwo caveats need to be mentioned. First, although the computational complexity of the CRP is low,\nthe complexity of the value iteration algorithm scales linearly with the number of states discovered.\nThus, tractability of value iteration is an issue in EB-CRP-PIG. A possible remedy to this problem\nwould be to only calculate value iteration for states that are reachable from the current state in the\ncalculated time horizon. Second, the described padding procedure implicitly sets a balance between\nseeking to discover new state transitions versus sampling from known ones. For different goals or\nenvironments this balance may not be optimal, a future investigation of alternatives for comparing\nmodels of different sizes would be very interesting.\nAll told, the proposed novel models overcome a major limitation of information-based learning\nmethods, the assumption of a bounded state space of known size. Since the new models are based\non the CRP, sampling is quite tractable. Interestingly, by applying Empirical Bayes for continuously\nupdating the parameter of the CRP, we are able to build agents that can explore bounded or un-\nbounded environments with very little prior information. For describing learning in animals, models\nthat easily adapt to diverse environments could be crucial. Of course, other restrictictions in these\nmodels still need to be addressed, in particular, the limitation to discrete and fully observable state\nspaces. For example, the need to act in continuous state spaces is obviously crucial for animals and\nrobots. Further, recent literature [7] supports that information-based learning in partially observable\nstate spaces, like POMDPs [17], will be important to address applications in neuroscience.\n\n5 Acknowledgements\n\nJAA was funded by NSF grant IIS-1111765. FTS was supported by the Director, Of\ufb01ce of Sci-\nence, Of\ufb01ce of Advanced Scienti\ufb01c Computing Research, Applied Mathematics program of the\nU.S. Department of Energy under Contract No. DE-AC02-05CH11231. The authors thank Bruno\nOlshausen, Tamara Broderick, and the members of the Redwood Center for Theoretical Neuro-\nscience for their valuable input.\n\n8\n\n\fReferences\n[1] David Aldous. Exchangeability and related topics. \u00b4Ecole d\u2019 \u00b4Et\u00b4e de Probabilit\u00b4es de Saint-Flour XIII1983,\n\npages 1\u2013198, 1985.\n\n[2] John Asmuth, Lihong Li, Michael L Littman, Ali Nouri, and David Wingate. A bayesian sampling\nIn Proceedings of the Twenty-Fifth Conference on\n\napproach to exploration in reinforcement learning.\nUncertainty in Arti\ufb01cial Intelligence, pages 19\u201326. AUAI Press, 2009.\n\n[3] Nihat Ay, Nils Bertschinger, Ralf Der, Frank G\u00a8uttler, and Eckehard Olbrich. Predictive information and\n\nexplorative behavior of autonomous robots. The European Physical Journal B, 63(3):329\u2013339, 2008.\n\n[4] Richard Bellman. E. 1957. dynamic programming. Princeton UniversityPress. BellmanDynamic pro-\n\ngramming1957, 1957.\n\n[5] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal\n\nreinforcement learning. The Journal of Machine Learning Research, 3:213\u2013231, 2003.\n\n[6] Hugo Gimbert. Pure stationary optimal strategies in markov decision processes. In STACS 2007, pages\n\n200\u2013211. Springer, 2007.\n\n[7] Jacqueline Gottlieb, Pierre-Yves Oudeyer, Manuel Lopes, and Adrien Baranes.\n\nInformation-seeking,\ncuriosity, and attention: computational and neural mechanisms. Trends in cognitive sciences, 17(11):585\u2013\n593, 2013.\n\n[8] Hemant Ishwaran and Lancelot F James. Generalized weighted chinese restaurant processes for species\n\nsampling mixture models. Statistica Sinica, 13(4):1211\u20131236, 2003.\n\n[9] Laurent Itti and Pierre Baldi. Bayesian surprise attracts human attention. Advances in neural information\n\nprocessing systems, 18:547, 2006.\n\n[10] Jaeyong Lee, Fernando A Quintana, Peter M\u00a8uller, and Lorenzo Trippa. De\ufb01ning predictive probability\nfunctions for species sampling models. Statistical science: a review journal of the Institute of Mathemat-\nical Statistics, 28(2):209, 2013.\n\n[11] Daniel Y Little and Friedrich T Sommer. Learning in embodied action-perception loops through explo-\n\nration. arXiv preprint arXiv:1112.1125, 2011.\n\n[12] Daniel Y Little and Friedrich T Sommer. Learning and exploration in action-perception loops. Frontiers\n\nin neural circuits, 7, 2013.\n\n[13] David JC MacKay. Information-based objective functions for active data selection. Neural computation,\n\n4(4):590\u2013604, 1992.\n\n[14] Jim Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related\n\n\ufb01elds, 102(2):145\u2013158, 1995.\n\n[15] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian\nreinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages\n697\u2013704. ACM, 2006.\n\n[16] Mitsuo Sato, Kenichi Abe, and Hiroshi Takeda. Learning control of \ufb01nite markov chains with an ex-\nplicit trade-off between estimation and control. Systems, Man and Cybernetics, IEEE Transactions on,\n18(5):677\u2013684, 1988.\n\n[17] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regular-\n\nization. In Advances in Neural Information Processing Systems, pages 1772\u20131780, 2013.\n\n[18] Jan Storck, Sepp Hochreiter, and J\u00a8urgen Schmidhuber. Reinforcement driven information acquisition\nin non-deterministic environments. In Proceedings of the International Conference on Arti\ufb01cial Neural\nNetworks, Paris, volume 2, pages 159\u2013164. Citeseer, 1995.\n\n[19] Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, pages 943\u2013950, 2000.\n[20] Yi Sun, Faustino Gomez, and J\u00a8urgen Schmidhuber. Planning to be surprised: Optimal bayesian explo-\n\nration in dynamic environments. In Arti\ufb01cial General Intelligence, pages 41\u201351. Springer, 2011.\n\n[21] Sebastian B Thrun. Ef\ufb01cient exploration in reinforcement learning. 1992.\n[22] Jian Zhang, Zoubin Ghahramani, and Yiming Yang. A probabilistic model for online document clustering\n\nwith application to novelty detection. In NIPS, volume 4, pages 1617\u20131624, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1573, "authors": [{"given_name": "Shariq", "family_name": "Mobin", "institution": "UC Berkeley"}, {"given_name": "James", "family_name": "Arnemann", "institution": "UC Berkeley"}, {"given_name": "Fritz", "family_name": "Sommer", "institution": "UC Berkeley"}]}