{"title": "Hierarchical Optimistic Region Selection driven by Curiosity", "book": "Advances in Neural Information Processing Systems", "page_first": 1448, "page_last": 1456, "abstract": "This paper aims to take a step forwards making the term ``intrinsic motivation'' from reinforcement learning theoretically well founded, focusing on curiosity-driven learning. To that end, we consider the setting where, a fixed partition P of a continuous space X being given, and a process \\nu defined on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample \\nu in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis.", "full_text": "Hierarchical Optimistic Region Selection driven by\n\nCuriosity\n\nOdalric-Ambrym Maillard\n\nLehrstuhl f\u00a8ur Informationstechnologie\n\nMontanuniversit\u00a8at Leoben\nLeoben, A-8700, Austria\n\nodalricambrym.maillard@gmail.com\n\nAbstract\n\nThis paper aims to take a step forwards making the term \u201cintrinsic motivation\u201d\nfrom reinforcement learning theoretically well founded, focusing on curiosity-\ndriven learning. To that end, we consider the setting where, a \ufb01xed partition P of\na continuous space X being given, and a process \u03bd de\ufb01ned on X being unknown,\nwe are asked to sequentially decide which cell of the partition to select as well as\nwhere to sample \u03bd in that cell, in order to minimize a loss function that is inspired\nfrom previous work on curiosity-driven learning. The loss on each cell consists of\none term measuring a simple worst case quadratic sampling error, and a penalty\nterm proportional to the range of the variance in that cell. The corresponding\nproblem formulation extends the setting known as active learning for multi-armed\nbandits to the case when each arm is a continuous region, and we show how an\nadaptation of recent algorithms for that problem and of hierarchical optimistic\nsampling algorithms for optimization can be used in order to solve this problem.\nThe resulting procedure, called Hierarchical Optimistic Region SElection driven\nby Curiosity (HORSE.C) is provided together with a \ufb01nite-time regret analysis.\n\n1\n\nIntroduction\n\nIn this paper, we focus on the setting of intrinsically motivated reinforcement learning (see Oudeyer\nand Kaplan [2007], Baranes and Oudeyer [2009], Schmidhuber [2010], Graziano et al. [2011]),\nwhich is an important emergent topic that proposes new dif\ufb01cult and interesting challenges for the\ntheorist. Indeed, if some formal objective criteria have been proposed to implement speci\ufb01c notions\nof intrinsic rewards (see Jung et al. [2011], Martius et al. [2007]), so far, many - and only - experi-\nmental work has been carried out for this problem, often with interesting output (see Graziano et al.\n[2011], Mugan [2010], Konidaris [2011]) but unfortunately no performance guarantee validating a\nproposed approach. Thus proposing such an analysis may have great immediate consequences for\nvalidating some experimental studies.\nMotivation. A typical example is the work of Baranes and Oudeyer [2009] about curiosity-driven\nlearning (and later on Graziano et al. [2011], Mugan [2010], Konidaris [2011]), where a precise\nalgorithm is de\ufb01ned together with an experimental study, yet no formal goal is de\ufb01ned, and no\nanalysis is performed as well. They consider a so-called sensory-motor space X def= S\u00d7M \u2282 [0, 1]d\nwhere S is a (continuous) state space and M is a (continuous) action space. There is no reward, yet\none can consider that the goal is to actively select and sample subregions of X for which a notion of\n\u201clearning progress\u201d - this intuitively measures the decay of some notion of error when successively\nsampling into one subregion - is maximal. Two key components are advocated in Baranes and\nOudeyer [2009], in order to achieve successful results (despite that success is a fuzzy notion):\n\n\u2022 The use of a hierarchy of regions, where each region is progressively split into sub-regions.\n\n1\n\n\f\u2022 Splitting leaf-regions in two based on the optimization of the dissimilarity, amongst the\nregions, of the learning progress. The idea is to identify regions with a learning complex-\nity that is a globally constant in that region, which also provides better justi\ufb01cation for\nallocating samples between identi\ufb01ed regions.\n\nWe believe it is possible to go one step towards a full performance analysis of such algorithms, by\nrelating the corresponding active sampling problem to existing frameworks.\nContribution. This paper aims to take a step forwards making the term \u201cintrinsic motivation\u201d from\nreinforcement learning theoretically well founded, focusing on curiosity-driven learning. We in-\ntroduce a mathematical framework in which a metric space (which intuitively plays the role of the\nstate-action space) is divided into regions and a learner has to sample from an unknown random func-\ntion in a way that reduces a notion of error measure the most. This error consists of two terms, the\n\ufb01rst one is a robust measure of the quadratic error between the observed samples and their unknown\nmean, the second one penalizes regions with non constant learning complexity, thus enforcing the\nnotion of curiosity. The paper focuses on how to choose the region to sample from, when a partition\nof the space is provided.\nThe resulting problem formulation can be seen as a non trivial extension of the setting of active\nlearning in multi-armed bandits (see Carpentier et al. [2011] or Antos et al. [2010]), where the main\nidea is to estimate the variance of each arm and sample proportionally to it, to the case when each\narm is a region as opposed to a point. In order to deal with this dif\ufb01culty, the maximal and minimal\nvariance inside each region is tracked by means of a hierarchical optimization procedure, in the spirit\nof the HOO algorithm from Bubeck et al. [2011]. This leads to a new procedure called Hierarchical\nOptimistic Region SElection driven by Curiosity (HORSE.C) for which we provide a theoretical\nperformance analysis.\nOutline. The outline of the paper is the following. In Section 2 we introduce the precise setting and\nde\ufb01ne the objective function. Section 3 de\ufb01nes our assumptions. Then in Section 4 we present the\nHORSE.C algorithm. Finally in Section 5, we provide the main Theorem 1 that gives performance\nguarantees for the proposed algorithm.\n\n2 Setting: Robust region allocation with curiosity-inducing penalty.\nLet X assumed to be a metric space and let Y \u2282 Rd be a normed space, equipped with the Euclidean\nnorm || \u00b7 ||. We consider an unknown Y-valued process de\ufb01ned on X , written \u03bd : X \u2192 M+\n1 (Y),\nwhere M+\n1 (Y) refers to the set of all probability measures on Y, such that for all x \u2208 X , the random\nvariable Y \u223c \u03bd(x) has mean \u00b5(x) \u2208 Rd and covariance matrix \u03a3(x) \u2208 Md,d(R) assumed to be\ndiagonal. We thus introduce for convenience the notation \u03c1(x) def= trace(\u03a3(x)), where trace is\nthe trace operator (this corresponds to the variance in dimension 1). We call X the input space or\nsampling space, and Y the output space or value space.\nIntuition Intuitively when applied to the setting of Baranes and Oudeyer [2009], then X def= S \u00d7 A\nis the space of state-action pairs, where S is a continuous state space and A a continuous action\nspace, \u03bd is the transition kernel of an unknown MDP, and \ufb01nally Y def= S. This is the reason why\nwe consider Y \u2282 Rd and not only Y \u2282 R as would seem more natural. One difference is that\nwe assume (see Section 3) that we can sample anywhere in X , which is a restrictive yet common\nassumption in the reinforcement learning literature. How to get rid of this assumption is an open\nand challenging question that is left for future work.\nSampling error and robustness Let us consider a sequential sampling process on X , i.e. a process\nthat samples at time t a value Yt \u223c \u03bd(Xt) at point Xt, where Xt \u2208 F 0\n\ndef=\n\n\u03b7t\n\n1\nt\n\nt\ufffds=1\n\nYs \u2212 \u00b5(Xs) \u2208 Rd .\n\nE[||\u03b7t||2 ] =\n\n1\nt\n\nE\ufffd 1\n\nt\n\nt\ufffds=1\n\n\u03c1(Xs)\ufffd .\n\n2\n\n\fA similar property holds for a region R \u2282 X that has been sampled nt(R) times, and in order to be\nrobust against a bad sampling strategy inside a region, it is natural to look at the worst case error,\nthat we de\ufb01ne as\n\neR(nt) def=\n\nsupx\u2208R \u03c1(x)\n\n.\n\nnt(R)\n\nOne reason for looking at robustness is that for instance, in the case we work with an MDP, we are\ngenerally not completely free to choose the sample Xs \u2208 S \u00d7A: we can only choose the action and\nthe next state is generally given by Nature. Thus, it is important to be able to estimate this worst\ncase error so as to prevent from bad situations.\nGoal Now let P be a \ufb01xed, known partition of the space X and consider the following game. The\ngoal of an algorithm is, at each time step t, to propose one point xt where to sample the space\nX , so that its allocation of samples {nt(R)}R\u2208P (that is, the number of points sampled in each\nregion) minimizes some objective function. Thus, the algorithm is free to sample everywhere in\neach region, with the goal that the total number of points chosen in each region is optimal in some\nsense. A simple candidate for this objective function would be the following\n\nLP (nt) def= max\ufffdeR(nt) ; R \u2208 P\ufffd ,\n\nhowever, in order to incorporate a notion of curiosity, we would also like to penalize regions that\nhave a variance term \u03c1 that is non homogeneous (i.e. the less homogeneous, the more samples we\nallocate). Indeed, if a region has constant variance, then we do not really need to understand more\nits internal structure, and thus its better to focus on an other region that has very heterogeneous\nvariance. For instance, one would like to split such a region in several homogeneous parts, which\nis essentially the idea behind section C.3 of Baranes and Oudeyer [2009]. We thus add a curiosity-\npenalization term to the previous objective function, which leads us to de\ufb01ne the pseudo-loss of an\nallocation nt\n\ndef= {nt(R)}R\u2208P in the following way:\nLP (nt) def= max\ufffd eR(nt) + \u03bb|R|(max\n\nx\u2208R\n\n(1)\nIndeed, this means that we do not want to focus just on regions with high variance, but also trade-off\nwith highly heterogeneous regions, which is coherent with the notion of curiosity (see Oudeyer and\nKaplan [2007]). For convenience, we also de\ufb01ne the pseudo-loss of a region R by\n\n\u03c1(x) \u2212 min\nx\u2208R\n\n\u03c1(x)) ; R \u2208 P \ufffd .\n\nLR(nt) def= eR(nt) + \u03bb|R|(max\nx\u2208R\n\n\u03c1(x) \u2212 min\nx\u2208R\n\n\u03c1(x)) .\n\nRegret The regret (or loss) of an allocation algorithm at time T is de\ufb01ned as the difference between\nthe cumulated pseudo-loss of the allocations nt = {nR,t}R\u2208P proposed by the algorithm and that\nof the best allocation strategy n\ufffd\n\nR,t}R\u2208P at each time steps; we de\ufb01ne\nt = {n\ufffd\ndef=\nRT\n\nLP (nt) \u2212 LP (n\ufffd\nt ) ,\n\nT\ufffdt=|P|\n\nwhere an optimal allocation at time t is de\ufb01ned by\n\nn\ufffd\n\nt \u2208 argmin\ufffd LP (nt) ; {nt(R)}R\u2208P is such that \ufffdR\u2208P\n\nnt(R) = t\ufffd .\n\nNote that the sum starts at t = |P| for a technical reason, since for t < |P|, whatever the allocation,\nthere is always at least one region with no sample, and thus LP (nt) = \u221e.\nExample 1 In the special case when X = {1, . . . , K} is \ufb01nite with K \ufffd T , and when P is the\ncomplete partition (each cell corresponds to exactly one point), the penalization term is canceled.\nThus the problem reduces to the choice of the quantities nt(i) for each arm i, and the loss of an\nallocation simply becomes\n\nL(nt) def= max\ufffd \u03c1(i)\n\nnt(i)\n\n; 1 \u2264 i \u2264 K\ufffd .\n\nThis almost corresponds to the already challenging setting analyzed for instance in Carpentier et al.\n[2011] or Antos et al. [2010]. The difference is that we are interested in the cumulative regret of\nour allocation instead of only the regret suffered for the last round as considered in Carpentier et al.\n[2011] or Antos et al. [2010]. Also we directly target \u03c1(i)\nnt(i) whereas they consider the mean sampling\nerror (but both terms are actually of the same order). Thus the setting we consider can be seen as\na generalization of these works to the case when each arm corresponds to a continuous sampling\ndomain.\n\n3\n\n\f3 Assumptions\n\nIn this section, we introduce some mild assumptions. We essentially assume that the unknown\ndistribution is such that it has a sub-Gaussian noise, and a smooth mean and variance functions.\nThese are actually very mild assumptions. Concerning the algorithm, we consider it can use a\npartition tree of the space, and that this one is essentially not degenerated (a typical binary tree that\nsatis\ufb01es all the following assumptions is such that each cell is split in two children of equal volume).\nSuch assumptions on trees have been extensively discussed for instance in Bubeck et al. [2011].\nSampling At any time, we assume that we are able to sample at any point in X , i.e. we assume we\nhave a generative model1 of the unknown distribution \u03bd.\nUnknown distribution We assume that \u03bd is sub-Gaussian, meaning that for all \ufb01xed x \u2208 X\n\n\u2200\u03bb \u2208 Rd ln E exp[\ufffd\u03bb, Y \u2212 \u00b5(X)\ufffd] \u2264\n\n\u03bbT \u03a3(x)\u03bb\n\n,\n\n2\n\nand has diagonal covariance matrix in each point2.\nThe function \u00b5 is assumed to be Lipschitz w.r.t a metric \ufffd1, i.e. it satis\ufb01es\n\nSimilarly, the function \u03c1 is assumed to be Lipschitz w.r.t a metric \ufffd2, i.e. it satis\ufb01es\n\n\u2200x, x\ufffd \u2208 X ||\u00b5(x) \u2212 \u00b5(x\ufffd)|| \u2264 \ufffd1(x, x\ufffd) .\n\n\u2200x, x\ufffd \u2208 X |\u03c1(x) \u2212 \u03c1(x\ufffd)| \u2264 \ufffd2(x, x\ufffd) .\n\nR(h, i) = R(h + 1, 2i) \u222a R(h + 1, 2i + 1) .\n\nHierarchy We assume that Y is a convex and compact subset of [0, 1]d. We consider an in\ufb01nite\nbinary tree T whose nodes correspond to regions of X . A node is indexed by a pair (h, i), where\nh \u2265 0 is the depth of the nodes in T and 0 \u2264 i < 2h is the position of the node at depth h. We write\nR(h, i) \u2282 X the region associated with node (h, i). The regions are \ufb01xed in advance, are all assumed\nto be measurable with positive measure, and must satisfy that for each h \u2265 1, {R(h, i)}0\u2264i<2h is a\npartition of X that is compatible with depth h \u2212 1, where R(0, 0) def= X ; in particular for all h \u2265 0,\nfor all 0 \u2264 i < 2h, then\nIn dimension d, a standard way to de\ufb01ne such a tree is to split each parent node in half along the\nlargest side of the corresponding hyper-rectangle, see Bubeck et al. [2011] for details.\nFor a \ufb01nite sub-tree Tt of T , we write Leaf (Tt) for the set of all leaves of Tt. For a region (h, i) \u2208\nTt, we denote by Ct(h, i) the set of its children in Tt, and by Tt(h, i) the subtree of Tt starting with\nroot node (h, i).\nAlgorithm and partition The partition P is assumed to be such that each of its regions R corre-\nsponds to one region R(h, i) \u2208 T ; equivalently, there exists a \ufb01nite sub-tree T0 \u2282 T such that\nLeaf (T0) = P. An algorithm is only allowed to expand one node of Tt at each time step t. In the\nsequel, we write indifferently P \u2208 T and (h, i) \u2208 T or P and R(h, i) \u2282 X to refer to the partition\nor one of its cell.\n\nExponential decays Finally, we assume that the \ufffd1 and \ufffd2 diameters of the region R(h, i) as well as\nits volume |R(h, i)| decay at exponential rate in the sense that there exists positive constants \u03b3, \u03b31,\n\u03b32 and c, c1, c2 such that for all h \u2265 0, then |R(h, i)| \u2264 c\u03b3h,\n1 and max\nx\ufffd,x\u2208R(h,i)\n\n\ufffd2(x, x\ufffd) \u2264 c2\u03b3h\n2 .\n\n\ufffd1(x, x\ufffd) \u2264 c1\u03b3h\n\nSimilarly, we assume that there exists positive constants c\ufffd \u2264 c, c\ufffd1 \u2264 c1 and c\ufffd2 \u2264 c2 such that for\nall h \u2265 0, then |R(h, i)| \u2265 c\ufffd\u03b3h,\n\nmax\n\nx\ufffd,x\u2208R(h,i)\n\nmax\n\nx\ufffd,x\u2208R(h,i)\n\n\ufffd1(x, x\ufffd) \u2265 c\ufffd1\u03b3h\n\n1 and max\n\nx\ufffd,x\u2208R(h,i)\n\n\ufffd2(x, x\ufffd) \u2265 c\ufffd2\u03b3h\n2 .\n\nThis assumption is made to avoid degenerate trees and for general purpose only. It actually holds\nfor any reasonable binary tree.\n\n1using the standard terminology in Reinforcement Learning.\n2this assumption is only here to make calculations easier and avoid nasty technical considerations that\n\nanyway do not affect the order of the \ufb01nal regret bound but only concern second order terms.\n\n4\n\n\f4 Allocation algorithm\n\nIn this section, we now introduce the main algorithm of this paper in order to solve the problem\nconsidered in Section 2. It is called Hierarchical Optimistic Region SElection driven by Curiosity.\nBefore proceeding, we need to de\ufb01ne some quantities.\n\n4.1 High-probability upper-bound and lower-bound estimations\n\nLet us consider the following (biased) estimator\n\n1\n\nt\ufffds=1\n\nt (R) def=\n\u02c6\u03c32\n\n1\n\nNt(R)\n\nt\ufffds=1\n\n||Ys||2I{Xs \u2208 R} \u2212 ||\n\nYsI{Xs \u2208 R}||2 .\n\nNt(R)\nApart from a small multiplicative biased by a factor Nt(R)\u22121\n, it has more importantly a positive bias\nNt(R)\ndue to the fact that the random variables do not share the same mean; this phenomenon is the same\nas the estimation of the average variance for independent but non i.i.d variables with different means\nj=1 \u00b5j]2 (see Lemma 5). In our case,\n{\u00b5i}i\u2264n, where the bias would be given by 1\nit is thus always non negative, and under the assumption that \u00b5 is Lipschitz w.r.t the metric \ufffd1, it is\nfortunately bounded by d1(R)2, the diameter of R w.r.t the metric \ufffd1.\nWe then introduce the two following key quantities, de\ufb01ned for all x \u2208 R and \u03b4 \u2208 [0, 1] by\nUt(R, x, \u03b4) def= \u02c6\u03c32\n\nn\ufffdn\ni=1[\u00b5i \u2212 1\nt (R) + (1 + 2\u221ad)\ufffd d ln(2d/\u03b4)\nt (R) \u2212 (1 + 2\u221ad)\ufffd d ln(2d/\u03b4)\n\n2Nt(R) \u2212 d1(R)2 \u2212\n\n\ufffd2(Xs, x)I{Xs \u2208 R} .\n\nLt(R, x, \u03b4) def= \u02c6\u03c32\n\nn\ufffdn\n\n\ufffd2(Xs, x)I{Xs \u2208 R},\n\n2Nt(R)\n\nNt(R)\n\nd ln(2d/\u03b4)\n\n1\n\nt\ufffds=1\n\n+\n\n2Nt(R)\n\n+\n\n1\n\nNt(R)\n\nt\ufffds=1\n\nNote that we would have preferred to replace the terms involving ln(2d/\u03b4) with a term depending\non the empirical variance, in the spirit of Carpentier et al. [2011] or Antos et al. [2010]. However,\ncontrary to the estimation of the mean, extending the standard results valid for i.i.d data to the case\nof a martingale difference sequence is non trivial for the estimation of the variance, especially due\nto the additive bias resulting from the fact that the variables may not share the same mean, but also\nto the absence of such results for U-statistics (up to the author\u2019s knowledge). For that reason such\nan extension is left for future work.\nThe following results (we provide the proof in [Maillard, 2012, Appendix A.3]) show that\nUt(R, x, \u03b4) is a high probability upper bound on \u03c1(x) while Lt(R, x, \u03b4) is a high probability lower\nbound on \u03c1(x).\nProposition 1 Under the assumptions that Y is a convex subset of [0, 1]d, \u03bd is sub-Gaussian, \u03c1 is\nLipschitz w.r.t. \ufffd2 and R \u2282 X is compact and convex, then\n\nSimilarly, under the same assumptions, then\n\nP\ufffd\u2200x \u2208 X ; Ut(R, x, \u03b4) \u2264 \u03c1(x)\ufffd \u2264 t\u03b4 .\n\nP\ufffd\u2200x \u2208 X ; Lt(R, x, \u03b4) \u2264 \u03c1(x) \u2212 b(x, R, Nt(R), \u03b4)\ufffd \u2264 t\u03b4 ,\n\ufffd2(x, x\ufffd) + d1(R)2 + 2(1 + 2\u221ad)\ufffd d ln(2d/\u03b4)\n\ndef= 2 max\nx\ufffd\u2208R\n\n2n\n\nwhere we introduced for convenience the quantity\n\nb(x, R, n, \u03b4)\n\n+\n\nd ln(2d/\u03b4)\n\n2n\n\n.\n\nNow on the other other hand, we have that (see the proof in [Maillard, 2012, Appendix A.3])\nProposition 2 Under the assumptions that Y is a convex subset of [0, 1]d, \u03bd is sub-Gaussian, \u00b5 is\nLipschitz w.r.t. \ufffd1, \u03c1 is Lipschitz w.r.t. \ufffd2 and R \u2282 X is compact and convex, then\nP\ufffd\u2200x \u2208 X ; Ut(R, x, \u03b4) \u2265 \u03c1(x) + b(x, R, Nt(R), \u03b4)\ufffd \u2264 t\u03b4 .\n\nSimilarly, under the same assumptions, then\n\nP\ufffd\u2200x \u2208 X ; Lt(R, x, \u03b4) \u2265 \u03c1(x)\ufffd \u2264 t\u03b4 .\n\n5\n\n\f4.2 Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C).\n\nThe pseudo-code of the HORSE.C algorithm is presented in Figure 1 below. This algorithm relies\non the estimation of the quantities maxx\u2208R \u03c1(x) and minx\u2208R \u03c1(x) in order to de\ufb01ne which point\nXt+1 to sample at time t + 1. It is chosen by expanding a leaf of a hierarchical tree Tt \u2282 T , in an\noptimistic way, starting with a tree T0 with leaves corresponding to the partition P.\nThe intuition is the following: let us consider a node (h, i) of the tree Tt expanded by the algorithm\nat time t. The maximum value of \u03c1 in R(h, i) is thus achieved for one of its children node (h\ufffd, i\ufffd) \u2208\nCt(h, i). Thus if we have computed an upper bound on the maximal value of \u03c1 in each child, then\nwe have an upper bound on the maximum value of \u03c1 in R(h, i). Proceeding in a similar way for the\nlower bound, this motivates the following two recursive de\ufb01nitions:\n\n\u02c6\u03c1+\nt (h, i; \u03b4)\n\n\u02c6\u03c1\u2212t (h, i; \u03b4)\n\ndef= min\ufffd max\ndef= max\ufffd min\n\nx\u2208R(h,i)\n\nx\u2208R(h,i)\n\nt (h\ufffd, i\ufffd; \u03b4) ; (h\ufffd, i\ufffd) \u2208 Ct(h, i)\ufffd\ufffd ,\nUt(R(h, i), x, \u03b4) , max\ufffd \u02c6\u03c1+\nLt(R(h, i), x, \u03b4) , min\ufffd \u02c6\u03c1\u2212t (h\ufffd, i\ufffd; \u03b4) ; (h\ufffd, i\ufffd) \u2208 Ct(h, i)\ufffd\ufffd .\n\nt (step 7,8,9), or according to \u02c6\u03c1\u2212t (step 11.)\n\nThese values are used in order to build an optimistic estimate of the quantity LR(h,i)(Nt) in region\n(h, i) (step 4), and then to select in which cell of the partition we should sample (step 5). Then the\nalgorithm chooses where to sample in the selected region so as to improve the estimations of \u02c6\u03c1+\nt and\n\u02c6\u03c1\u2212t . This is done by alternating (step 6.) between expanding a leaf following a path that is optimistic\naccording to \u02c6\u03c1+\nThus, at a high level, the algorithm performs on each cell (h, i) \u2208 P of the given partition two\nhierarchical searches, one for the maximum value of \u03c1 in region R(h, i) and one for its minimal\nvalue. This can be seen as an adaptation of the algorithm HOO from Bubeck et al. [2011] with the\nmain difference that we target the variance and not just the mean (this is more dif\ufb01cult). On the other\nhand, there is a strong link between step 5, where we decide to allocate samples between regions\n{R(h, i)}(h,i)\u2208P, and the CH-AS algorithm from Carpentier et al. [2011].\n5 Performance analysis of the HORSE.C algorithm\n\nIn this section, we are now ready to provide the main theorem of this paper, i.e. a regret bound on\nthe performance of the HORSE.C algorithm, which is the main contribution of this work. To this\nend, we make use of the notion of near-optimality dimension, introduced in Bubeck et al. [2011],\nand that measures a notion of intrinsic dimension of the maximization problem.\nDe\ufb01nition (Near optimality dimension) For c > 0, the c-optimality dimension of \u03c1 restricted to\nthe region R with respect to the pseudo-metric \ufffd2 is de\ufb01ned as\n\nmax\ufffd lim sup\n\n\ufffd\u21920\n\nln(N (Rc\ufffd, \ufffd2, \ufffd))\n\nln(\ufffd\u22121)\n\n, 0\ufffd where Rc\ufffd\n\ndef= \ufffdx \u2208 R ; \u03c1(x) \u2265 max\n\nx\u2208R\n\n\u03c1(x) \u2212 \ufffd\ufffd ,\n\nand where N (Rc\ufffd, \ufffd2, \ufffd) is the \ufffd-packing number of the region Rc\ufffd.\nLet d+(h0, i0) be the c-optimality dimension of \u03c1 restricted to the region R(h0, i0) (see e.g. Bubeck\net al. [2011]), with the constant c def= 4(2c2 + c2\n1)/c\ufffd2. Similarly, let d\u2212(h0, i0) be the c-optimality\ndimension of \u2212\u03c1 restricted to the region R(h0, i0). Let us \ufb01nally de\ufb01ne the biggest near-optimality\ndimension of \u03c1 over each cell of the partition P to be\n\nd\u03c1\n\ndef= max\ufffd max\ufffdd+(h0, i0), d\u2212(h0, i0)\ufffd ; (h0, i0) \u2208 P\ufffd .\n\nTheorem 1 (Regret bound for HORSE.C) Under the assumptions of Section 3 and if moreover\n1 \u2264 \u03b32, then for all \u03b4 \u2208 [0, 1], the regret of the Hierarchical Optimistic Region SElection driven by\n\u03b32\nCuriosity procedure parameterized with \u03b4 is bounded with probability higher than 1\u2212 2\u03b4 as follows.\n\nRT \u2264\n\n(h0,i0)\u2208P\ufffd\n\nmax\n\nT\ufffdt=|P|\n\n1\n\nn\ufffd\n\nt (h0, i0)\n\n+ 2\u03bbc\u03b3h0\ufffdB\ufffdh0, n\ufffd\n\nt (h0, i0), \u03b4t\ufffd,\n\n6\n\n\fAlgorithm 1 The HORSE.C algorithm.\nRequire: An in\ufb01nite binary tree T , a partition P \u2282 T , \u03b4 \u2208 [0, 1], \u03bb \u2265 0\n1: Let T0 be such that Leaf (T0) = P, and \u03b4i,t =\n\u03c02i2(2t+1)|P|t3 , t := 0.\n2: while true do\n3:\n\n6\u03b4\n\nde\ufb01ne for each region (h, i) \u2208 Tt the estimated loss\n+ \u03bb|R(h, i)|\ufffd\u02c6\u03c1+\n\n\u02c6\u03c1+\nt (h, i; \u03b4)\nNt(R(h, i))\n\n\u02c6Lt(h, i) def=\n\nwhere \u03b4 = \u03b4Nt(R(h,i)),t, where by convention \u02c6Lt(h, i) if it is unde\ufb01ned.\nchoose the next region of the current partition P \u2282 T to sample\n(Ht+1, It+1) def= argmax\ufffd \u02c6Lt(h, i) ; (h, i) \u2282 P\ufffd .\n\nif Nt(R(h, i)) = n is odd then\n\nt (h, i; \u03b4) \u2212 \u02c6\u03c1\u2212t (h, i; \u03b4)\ufffd ,\n\n4:\n\n5:\n6:\n\n7:\n\n8:\n\nelse\n\n9:\n10:\n11:\n12:\n13: end while\n\nend if\nt := t + 1.\n\nsequentially select a path of children of (Ht+1, It+1) in Tt de\ufb01ned by the initial node\n(H 0\n\nt+1, I 0\n\nuntil j = jt+1 is such that (H jt+1\nexpand the node (H jt+1\n\nt+1 , I j+1\n\nt+1 , I jt+1\n\nt+1) def= (Ht+1, It+1), and then\n(H j+1\n\nt+1 ) \u2208 Leaf (Tt).\n\nt (h, i; \u03b4n,t) ; (h, i) \u2208 Ct(H j\n\nt+1 ) def= argmax\ufffd\u02c6\u03c1+\n(ht+1, it+1) def= argmax\ufffd\u02c6\u03c1+\n\nt+1)\ufffd ,\nt+1 )\ufffd .\ndef= argmax\ufffdUt(R(ht+1, it+1), x, \u03b4n,t) ; x \u2208 R(ht+1, it+1)\ufffd ,\n\nt (h, i; \u03b4n,t) ; (h, i) \u2208 Ct+1(H jt+1\n\nt+1 , I jt+1\n\nt+1 , I jt+1\n\nsample at point Xt+1 and receive the value Yt+1 \u223c \u03bd(Xt+1), where\n\nt+1, I j\n\nXt+1\n\nt+1 ) in order to de\ufb01ne Tt+1 and then de\ufb01ne the candidate child\n\nproceed similarly than steps 6,7,8 with \u02c6\u03c1+\n\nt replaced with \u02c6\u03c1\u2212t .\n\n+\n\ndef=\n\n1\u03b32h\n\n2 + c2\n\n2(2c2\u03b3h\n\nNh0 (h, k)\n\n2Nh0 (h, k)\n\ndef= min\n\nd ln(2d/\u03b4k,t)\n\n1\nC(c\ufffd2\u03b3h\n\nt (h0, i0) is the optimal\n\nd ln(2d/\u03b4k,t)\n1\u03b32h\n\nt (h0,i0),t\u22121, where n\ufffd\n\nin which we have used the following quantity\n\nwhere \u03b4t is a shorthand notation for the quantity \u03b4n\ufffd\nallocation at round t for the region (h0, i0) \u2208 P and where\nB(h0, k, \u03b4k,t)\n\n1 + 2(1 + 2\u221ad)\ufffd d ln(2d/\u03b4k,t)\nh0\u2264h\ufffd 2c2\u03b3h\n2 )\u2212d\u03c1\ufffdk \u2212 2h\u2212h0 [2 + 4\u221ad +\ufffdd ln(2d/\u03b4k,t)/2]2\n\n2Nh0 (h, k) \ufffd,\n1 )2\ufffd .\nNote that the assumption \u03b32\n1 \u2264 \u03b32 is only here so that d\u03c1 can be de\ufb01ned w.r.t the metric \ufffd2 only.\nWe can remove it at the price of using instead a metric mixing \ufffd1 and \ufffd2 together and of much\nmore technical considerations. Similarly, we could have expressed the result using the local values\nd+(h0, i0) instead of the less precise d\u03c1 (neither those, nor d\u03c1 need to be known by the algorithm).\nThe full proof of this theorem is reported in the appendix. The main steps of the proof are as follows.\nFirst we provide upper and lower con\ufb01dence bounds for the estimation of the quantities Ut(R, x, \u03b4)\nand Lt(R, x, \u03b4). Then, we lower-bound the depth of the subtree of each region (h0, i0) \u2208 P that\ncontains a maximal point argmaxx\u2208R(h0,i0) \u03c1(x), and proceed similarly for a minimal point. This\nuses the near-optimality dimension of \u03c1 and \u2212\u03c1 in the region R(h0, i0), and enables to provide an\nt (h, i; \u03b4) as well as a lower bound on \u02c6\u03c1\u2212t (h, i; \u03b4). This then enables us to deduce\nupper bound on \u02c6\u03c1+\nbounds relating the estimated loss \u02c6Lt(h, i) to the true loss LR(h,i)(Nt). Finally, we relate the true\nt+1(h0, i0) by discussing whether a\nloss of the current allocation to the one using the optimal one n\ufffd\nregion has been over or under sampled. This \ufb01nal part is closed in spirit to the proof of the regret\nbound for CH-AS in Carpentier et al. [2011].\nIn order to better understand the gain in Theorem 1, we provide the following corollary that gives\nmore insights about the order of magnitude of the regret.\n\n2 + c2\n\n7\n\n\f2\n\nCorollary 1 Let \u03b2 def= 1+ln\ufffd max{2, \u03b3\u2212d\u03c1\nthe partition P of the space X is well behaved, i.e. that for all (h0, i0) \u2208 P, then n\ufffd\nat least at speed O\ufffd ln(t)\ufffd 1\nRT = O\ufffd T\ufffdt=|P|\n\n}\ufffd. Under the assumptions of Theorem 1, assuming that\n\u03b32\ufffd2h0\u03b2\ufffd, then for all \u03b4 \u2208 [0, 1], with probability higher than 1 \u2212 2\u03b4 we\n(h0,i0)\u2208P\ufffd\n\n+ 2\u03bbc\u03b3h0\ufffd\ufffd ln(t)\n\n2\u03b2\ufffd.\nt (h0, i0)\ufffd 1\n\nt+1(h0, i0) grows\n\nt (h0, i0)\n\nhave\n\nmax\n\nn\ufffd\n\nn\ufffd\n\n1\n\n(h0,i0)\n\n1\n\nln(t)\n\nn\ufffd\n\nn\ufffd\n\nmax\n\nLP (n\ufffd\n\nt ) =\n\nThis regret term has to be compared with the typical range of the cumulative loss of the optimal\nallocation strategy, that is given by\n\n+ 2\u03bbc\u03b3h0 (\u03c1+\n\nt (h0, i0)\u2212 1\n\nT\ufffdt=|P|\n\nT\ufffdt=|P|\n\n(h0,i0)\nt (h0, i0)\n\n2\u03b2 , i.e. decays at speed n\ufffd\n\n\u03c1+(h0,i0)\ufffd\n\nt (h0,i0)\ufffd 1\n\n(h0,i0)\u2208P\ufffd \u03c1+\n\ndef= maxx\u2208R(h0,i0) \u03c1(x), and similarly \u03c1\u2212(h0,i0)\n\n(h0,i0) \u2212 \u03c1\u2212(h0,i0))\ufffd,\ndef= minx\u2208R(h0,i0) \u03c1(x). Thus,\nwhere \u03c1+\nthis shows that, after normalization, the relative regret on each cell (h0, i0) is roughly of order\n2\u03b2 . This shows that we are not only able\nto compete with the performance of the best allocation strategy, but we actually achieve the exact\nsame performance with multiplicative factor 1, up to a second order term. Note also that, when\nspeci\ufb01ed to the case of Example 1, the order of this regret is competitive with the standard results\nfrom Carpentier et al. [2011].\nThe lost of the variance term \u03c1+(h0, i0)\u22121 (that is actually a constant) here comes from the fact\nthat we are only able to use Hoeffding\u2019s like bounds for the estimation of the variance. In order\nto remove it, one would need empirical Bernstein\u2019s bounds for variance estimation in the case of\nmartingale difference sequences. This is postponed to future work.\n6 Discussion\nIn this paper, we have provided an algorithm together with a regret analysis for a problem of online\nallocation of samples in a \ufb01xed partition, where the objective is to minimize a loss that contains a\npenalty term that is driven by a notion of curiosity. A very speci\ufb01c case (\ufb01nite state space) already\ncorresponds to a dif\ufb01cult question known as active learning in the multi-armed bandit setting and\nhas been previously addressed in the literature (e.g. Antos et al. [2010], Carpentier et al. [2011]). We\nhave considered an extension of this problem to a continuous domain where a \ufb01xed partition of the\nspace as well as a generative model of the unknown dynamic are given, using our curiosity-driven\nloss function as a measure of performance. Our main result is a regret bound for that problem,\nthat shows that our procedure is \ufb01rst order optimal, i.e. achieves the same performance as the best\npossible allocation (thus with multiplicative constant 1).\nWe believe this result contributes to \ufb01lling the important gap that exists between existing algorithms\nfor the challenging setting of intrinsic reinforcement learning and a theoretical analysis of such, the\nHORSE.C algorithm being related in spirit to, yet simpler and less ambitious the RIAC algorithm\nfrom Baranes and Oudeyer [2009]. Indeed, in order to achieve the objective that tries to address\nRIAC, one should \ufb01rst remove the assumption that the partition is given: One trivial solution is to\nrun the HORSE.C algorithm in episodes of doubling length, starting with the trivial partition, and to\nselect at the end of each a possibly better partition based on computed con\ufb01dence intervals, however\nmaking ef\ufb01cient use of previous samples and avoiding a blow-up of candidate partitions happen to\nbe a challenging question; then one should relax the generative model assumption (i.e. that we can\nsample wherever we want), a question that shares links with a problem called autonomous explo-\nration. Thus, even if the regret analysis of the HORSE.C algorithm is already a strong, new result\nthat is interesting independently of such dif\ufb01cult speci\ufb01c goals and of the reinforcement learning\nframework (no MDP structure is required), those questions are naturally left for future work.\n\nAcknowledgements The research leading to these results has received funding from the European\nCommunity\u2019s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327\n(CompLACS) and no 216886 (PASCAL2).\n\n8\n\n\fReferences\nAndr`as Antos, Varun Grover, and Csaba Szepesv`ari. Active learning in heteroscedastic noise. The-\n\noretical Computer Science, 411(29-30):2712\u20132728, 2010.\n\nA. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learn-\n\ning. IEEE Transactions on Autonomous Mental Development, 1(3):155\u2013169, October 2009.\n\nS\u00b4ebastien Bubeck, R\u00b4emi Munos, Gilles Stoltz, and Csaba Szepesv`ari. X-armed bandits. Journal of\n\nMachine Learning Research, 12:1655\u20131695, 2011.\n\nAlexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, R\u00b4emi Munos, and Peter\nAuer. Upper-con\ufb01dence-bound algorithms for active learning in multi-armed bandits. In Jyrki\nKivinen, Csaba Szepesv`ari, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learn-\ning Theory, volume 6925 of Lecture Notes in Computer Science, pages 189\u2013203. Springer Berlin\n/ Heidelberg, 2011.\n\nVincent Graziano, Tobias Glasmachers, Tom Schaul, Leo Pape, Giuseppe Cuccu, J. Leitner, and\nJ. Schmidhuber. Arti\ufb01cial Curiosity for Autonomous Space Exploration. Acta Futura (in press),\n(1), 2011.\n\nTobias Jung, Daniel Polani, and Peter Stone. Empowerment for continuous agent-environment sys-\ntems. Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems, 19(1):\n16\u201339, 2011.\n\nG.D. Konidaris. Autonomous robot skill acquisition. PhD thesis, University of Massachusetts\n\nAmherst, 2011.\n\nOdalric-Ambrym Maillard. Hierarchical optimistic region selection driven by curiosity. HAL, 2012.\n\nURL http://hal.archives-ouvertes.fr/hal-00740418.\n\nGeorg Martius, J. Michael Herrmann, and Ralf Der. Guided self-organisation for autonomous\nIn Proceedings of the 9th European conference on Advances in arti\ufb01cial\n\nrobot development.\nlife, ECAL\u201907, pages 766\u2013775, Berlin, Heidelberg, 2007. Springer-Verlag.\n\nJonathan Mugan. Autonomous Qualitative Learning of Distinctions and Actions in a Developing\n\nAgent. PhD thesis, University of Texas at Austin, 2010.\n\nPierre-Yves Oudeyer and Frederic Kaplan. What is Intrinsic Motivation? A Typology of Computa-\n\ntional Approaches. Frontiers in neurorobotics, 1(November):6, January 2007.\n\nJ. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous\n\nMental Development, IEEE Transactions on, 2(3):230\u2013247, 2010.\n\n9\n\n\f", "award": [], "sourceid": 697, "authors": [{"given_name": "Odalric-ambrym", "family_name": "Maillard", "institution": null}]}