{"title": "Curriculum-guided Hindsight Experience Replay", "book": "Advances in Neural Information Processing Systems", "page_first": 12623, "page_last": 12634, "abstract": "In off-policy deep reinforcement learning, it is usually hard to collect sufficient successful experiences with sparse rewards to learn from. Hindsight experience replay (HER) enables an agent to learn from failures by treating the achieved state of a failed experience as a pseudo goal. However, not all the failed experiences are equally useful to different learning stages, so it is not efficient to replay all of them or uniform samples of them. In this paper, we propose to 1) adaptively select the failed experiences for replay according to the proximity to the true goals and the curiosity of exploration over diverse pseudo goals, and 2) gradually change the proportion of the goal-proximity and the diversity-based curiosity in the selection criteria: we adopt a human-like learning strategy that enforces more curiosity in earlier stages and changes to larger goal-proximity later. This ``Goal-and-Curiosity-driven Curriculum Learning'' leads to ``Curriculum-guided HER (CHER)'', which adaptively and dynamically controls the exploration-exploitation trade-off during the learning process via hindsight experience selection. We show that CHER improves the state of the art in challenging robotics environments.", "full_text": "Curriculum-guided Hindsight Experience Replay\n\nMeng Fang1\u2217, Tianyi Zhou2\u2217, Yali Du3, Lei Han1, Zhengyou Zhang1\n\n2Paul G. Allen School of Computer Science & Engineering, University of Washington\n\n1Tencent Robotics X\n\n3University College London\n\nAbstract\n\nIn off-policy deep reinforcement learning, it is usually hard to collect suf\ufb01cient\nsuccessful experiences with sparse rewards to learn from. Hindsight experience\nreplay (HER) enables an agent to learn from failures by treating the achieved state\nof a failed experience as a pseudo goal. However, not all the failed experiences\nare equally useful to different learning stages, so it is not ef\ufb01cient to replay all of\nthem or uniform samples of them. In this paper, we propose to 1) adaptively select\nthe failed experiences for replay according to the proximity to true goals and the\ncuriosity of exploration over diverse pseudo goals, and 2) gradually change the\nproportion of the goal-proximity and the diversity-based curiosity in the selection\ncriteria: we adopt a human-like learning strategy that enforces more curiosity in\nearlier stages and changes to larger goal-proximity later. This \u201cGoal-and-Curiosity-\ndriven Curriculum Learning\u201d leads to \u201cCurriculum-guided HER (CHER)\u201d, which\nadaptively and dynamically controls the exploration-exploitation trade-off during\nthe learning process via hindsight experience selection. We show that CHER\nimproves the state of the art in challenging robotics environments.\n\n1\n\nIntroduction\n\nDeep reinforcement learning (RL) has been an effective framework addressing a rich repertoire of\ncomplex control problems. In simulated domains, deep RL can train agents to perform a diverse array\nof challenging tasks [Mnih et al., 2015, Lillicrap et al., 2015, Duan et al., 2016]. In order to train\nreliable agents, it is critical to not only design a reward faithfully re\ufb02ecting how successful the task is,\nbut also (re)shape the reward [Ng et al., 1999] to provide dense feedback that can ef\ufb01ciently guide\nthe policy optimization towards a better solution in the given environment. Unfortunately, many of\nthe capabilities demonstrated by the current reward engineering are often limited to speci\ufb01c tasks in\nspeci\ufb01ed environments. Moreover, the quality of reward shaping heavily relies on both the choice of\nRL algorithm and domain-speci\ufb01c knowledge. For situations where we do not know what admissible\nbehavior may look like, for example, using LEGO bricks to build a desired architecture, it is dif\ufb01cult\nto apply reward engineering. Therefore, it is essential (but also challenging) to develop smarter and\nmore general algorithms which can directly learn from unshaped and usually sparse reward signals,\nwhere the sparsity is caused by the insuf\ufb01ciency of successful experiences (which are expensive to\ncollect). Hindsight Experience Replay (HER) [Andrychowicz et al., 2017] proposes to additionally\nleverage the rich repository of the failed experiences, by replacing the desired (true) goals of training\ntrajectories with the achieved goals of the failed experiences. With this modi\ufb01cation, any failed\nexperience can have a nonnegative reward.\nThe achieved goals of failed experiences can be signi\ufb01cantly different to each other: their proximity to\nthe desired goal varies so learning how to reach a pseudo goal distant from the true one cannot directly\nhelp the targeted task; they also carry different information about the manipulation environment.\n\n\u2217Correspondence to: Meng Fang and Tianyi Zhou .\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fHence, they have distinct levels of dif\ufb01culty to be learned and their contributions to a task vary across\ndifferent learning stages. Nevertheless, they are treated equally in HER: they are uniformly sampled\nto replace the desired goals; and the resulted training trajectories with the replaced goals have the\nsame weight in training. However, not all the failed experiences are equally useful to improve the\nagent: some provide limited helps to reach the true goal; while some are too similar to each other and\nthus redundant to be all learned.\nIn human education, a delicately designed curriculum can signi\ufb01cantly improve the learning quality\nand ef\ufb01ciency. Inspired by this, curriculum learning [Bengio et al., 2009] and its applications [Khan\net al., 2011, Basu and Christensen, 2013, Spitkovsky et al., 2009] propose to train a model on a\ndesigned sequence of training samples/tasks, i.e., a curriculum, which leads to improvement on both\nlearning performance and ef\ufb01ciency. In each learning stage, the training samples are selected either\nby a human expert or an adaptive algorithm, and the selection can be either pre-de\ufb01ned before training\nbegins or determined by the learning progress itself on the \ufb02y [Kumar et al., 2010]. Most curriculum\nlearning methods, e.g., self-paced learning and its variants [Tang et al., 2012a, Supancic III and\nRamanan, 2013, Tang et al., 2012b], adopt the strategy of selecting a few easier training samples at\nbeginning and an increased amount of more dif\ufb01cult ones later on. Recent works [Zhou and Bilmes,\n2018, Zhou et al., 2018] show that diversity also needs to be considered in curriculum generation.\nCurriculum learning has been explained as a form of continuation scheme [Allgower and Georg,\n2003] that addresses a hard task by solving a sequence of tasks moving from easy to hard, and uses\nthe solution to each task as the warm start for the next slightly harder task. Such continuation schemes\ncan reduce the impact of local minima within neural networks [Bengio et al., 2013, Bengio, 2014].\nIn this paper, we propose \u201cGoal-and-Curiosity-driven Curriculum Learning\u201d that dynamically and\nadaptively controls the exploration-exploitation trade-off in selecting hindsight experiences for replay\nby gradually changing the preference on 1) goal-proximity: how close the achieved goals are to the\ndesired goals; and 2) diversity-based curiosity: how diverse the achieved goals are in the environment.\nSpeci\ufb01cally, given a candidate subset of achieved goals for HER training, we de\ufb01ne its proximity\nas the sum of their similarities to the desired goals, and measure its diversity by a submodular\nfunction [Fujishige, 2005], e.g., the facility location function [Cornu\u00e9jols et al., 1977, Lin et al., 2009].\nIn each episode, a subset of achieved goals are selected according to both its proximity and curiosity:\nwe prefer more curiosity for earlier episodes\u2019 exploration and then gradually increase the proportion\nof proximity in the selection criteria during later episodes. We apply this training framework, called\n\u201cCurriculum-guided HER (CHER)\u201d, to train agents in the multi-goal setting of UVFA [Schaul et al.,\n2015] and HER [Andrychowicz et al., 2017] (which assumes that the goal being pursued does not\nin\ufb02uence the environment dynamics). In several challenging robotics environments (where deep RL\nmethods suffer from sparse reward problem), CHER outperforms the state-of-the-art approaches on\nboth the learning ef\ufb01ciency and \ufb01nal performance.1\n\n2 Related Work\n\nIn recent works, curriculum learning with progressive training strategy has been introduced to different\nscenarios of deep RL. Those methods differ in that they apply an increasing dif\ufb01culty/complexity\nscheduling to different components of the training loop, e.g., the initial positions [Florensa et al.,\n2017], the required \u0001-accuracy [Fournier et al., 2018], the policies of intermediate agents used for\nmixing [Czarnecki et al., 2018], the environments [Wu and Tian, 2017], the aid from built-in AI [Tian\net al., 2017], the reward [Justesen and Risi, 2018], and new tasks with masked sub-goals [Eppe et al.,\n2018]. These works show that curriculum learning can effectively improve deep RL for challenging\ntasks including robotics manipulation, game-bot, and simulated environment such as OpenAI Gym.\nHER can also be explained as a form of implicit curriculum learning, since the achieved goals of\nfailed experiences are easier to achieve than the desired goals. The last work mentioned above\nimproves HER but requires extra efforts in each epoch to evaluate the dif\ufb01culty of sub-goals and\ntrain the new tasks with sub-goals. It is not practical for tasks with complex goals, such as the hand\nmanipulation tasks studied in this paper.\nAnother line of recent work [Burda et al., 2019, Pathak et al., 2017, Savinov et al., 2019, Frank et al.,\n2013] investigates the curiosity-driven exploration of deep RL agents within interactive environments.\nIn particular, they either replace or augment the extrinsic (but usually sparse) reward by a dense\n\n1Our code is available at https://github.com/mengf1/CHER.\n\n2\n\n\fintrinsic reward measuring the curiosity or uncertainty of the agent at a given state. Thereby, the\nagent is encouraged to explore unseen scenarios and unexplored regions of the environment. It has\nbeen shown that such curiosity driven strategy can improve the learning ef\ufb01ciency, mitigate the sparse\nreward problem, and successfully learn challenging tasks even without extrinsic reward. Different\nfrom curriculum learning approaches, which are usually goal-oriented with focus on exploitation,\ncuriosity-driven approaches can be unsupervised/self-supervised with more focus on exploration.\nComparing to our strategy, they reshape the reward but do not dynamically and adaptively change the\nproportion of curiosity in the reward during training.\nA number of RL methods leveraging hindsight experiences have been proposed since HER. Hindsight\nPolicy Gradient (HPG) [Rauber et al., 2019] extends the idea of training goal-conditional agents on\nhindsight experiences to on-policy RL setting. Dynamic Hindsight Experience Replay (DHER) [Fang\net al., 2019] assembles failed experiences to train policies handling dynamic goals rather than static\nones studied in HER. On top of HER, Competitive Experience Replay (CER) [Liu et al., 2019]\nintroduces a competition between two agents for better exploration. To handle raw-pixel inputs, Nair\net al. [2018] minimize a pixel-MSE given visual observations with an extra cost of training a VAE.\nZhao and Tresp [2018] focus on hindsight trajectories containing higher energy than others and claim\nthat they are more valuable to training. Unlike the above works, our curriculum learning scheme can\nbe generalized to a variety of settings and environments for more ef\ufb01cient goal-conditional RL.\n\n3 Methodology\n\nIn this section, we will brie\ufb02y introduce HER and multi-goal RL at \ufb01rst. Then, we will study the\nselection criteria applied to hindsight experiences. An ef\ufb01cient selection algorithm is introduced\nafterwards. In the end, we will present CHER with scheduled goal-proximity and diversity-based\ncuriosity in the selection criteria.\n\n3.1 HER and Multi-Goal RL\n\nWe study an agent operating in a multi-goal environment with sparse reward [Schaul et al., 2015,\nAndrychowicz et al., 2017]. At each time step t, the agent gets an observation (or state) st from the\nenvironment and takes an action at in response by applying its policy \u03c0(st) (deterministic policy maps\nst to at = \u03c0(st), while stochastic policy samples at \u223c p(at|st) = \u03c0(st)), then it receives a reward\nsignal rt = r(st, at) and gets the next state st+1 sampled from the transition probability p(\u00b7|st, at).\nGiven a behavior policy \u03c0(\u00b7), the agent can generate a trajectory \u03c4 = {(s0, a0),\u00b7\u00b7\u00b7 , (sT\u22121, aT\u22121)}\nof any length T , with each step t associated with a transition tuple (st, at, rt, st+1). In many RL\ntasks, the reward only depends on whether the trajectory \ufb01nally reaches a desired goal g or not.\nHence, only the successful trajectories get nonnegative rewards. Since \u03c0(\u00b7) is not fully-trained and\nhas low success rate, the collected successful trajectories are usually insuf\ufb01cient for training, which\nresults in the sparse reward problem.\nHER addresses the sparse reward problem by treating failures as successes and learning from the\nfailed experiences. For any off-policy RL algorithm (e.g., DQN [Mnih et al., 2015], DDPG [Lillicrap\net al., 2015], NAF [Gu et al., 2016], SDQN [Metz et al., 2017]), HER modi\ufb01es the desired goals\ng in the transition tuples for training to some achieved goals g(cid:48) sampled from the states in failed\nexperiences. The desired goal g is the actual goal that the agent aims to achieve, i.e., the real target.\nAn achieved goal g(cid:48) is a state that the agent has already achieved, e.g., the Cartesian position of\neach \ufb01ngertip on a robotic hand. Once g is replaced by a g(cid:48), the corresponding failed experience is\nassigned a nonnegative reward and thus can contribute to learning policies.\n\n3.2 Goal-and-Curiosity-driven Selection of Pseudo Goals\n\nIn HER, the achieved goals used to modify the desired goals are uniformly sampled from (a batch of)\nprevious experiences B. In contrast to uniform sampling, we propose to select a subset of achieved\ngoals A \u2286 B according to 1) their proximity to the desired goals and 2) their diversity that re\ufb02ects the\ncuriosity of an agent exploring different achieved goals in the environment. Although all the failed\nexperiences can be turned into success ones with pseudo goals, they can be very different in the above\ntwo quantities, which however play important roles in guiding the learning process. In particular, a\nlarge proximity enforces the training to proceed towards the desired goals, while a large diversity\n\n3\n\n\fleads to more exploration of different states and regions in the environment. A desirable trade-off\nbetween them is essential to the learning ef\ufb01ciency and generalization performance of resulted agents.\nIn our selection criteria, we select a subset of failed experiences to replay according to its proximity\nand diversity, which are both de\ufb01ned based on a similarity function sim(\u00b7,\u00b7) measuring the likeness\nof two achieved goals in the interactive environment. Given a distance metric dis(\u00b7,\u00b7) (e.g., Euclidean\ndistance), sim(\u00b7,\u00b7) can be de\ufb01ned, for example, by the radial basis function (RBF) kernel with\nbandwidth \u03c3, i.e.,\n\n(cid:18)\u2212 dis(x, y)2\n\n(cid:19)\n\nsim(x, y) (cid:44) exp\n\n,\n\n(1)\n\n\u03c32\nwhile another option is a constant c minus the distance, i.e.,\n\nsim(x, y) (cid:44) c \u2212 dis(x, y),\n\n(2)\nwhere c is large enough to guarantee that sim(x, y) \u2265 0 for all possible (x, y). The choice of dis(\u00b7,\u00b7)\nis usually determined by the task and environment. For instance, in hand manipulation tasks, we can\nde\ufb01ne dis(gi, gj) as the mean distance between \ufb01ngertips at time step i and \ufb01ngertips at time step j.\nWe are now able to select a subset A of achieved goals with size up to k from buffered experiences B\nby solving the following combinatorial optimization that maximizes both the proximity and diversity:\n\nThe \ufb01rst term Fprox(A), associated with a trade-off weight \u03bb, is a modular function\n\nmax\n\nA\u2286B,|A|\u2264k\n\nF (A) (cid:44) \u03bbFprox(A) + Fdiv(A).\n\nFprox(A) (cid:44)(cid:88)\n\ni\u2208A\n\nsim(gi, g),\n\n(3)\n\n(4)\n\nwhich re\ufb02ects the proximity of the selected achieved goals g(cid:48) in A to its desired goal g. The\nsecond term Fdiv(A) measures the diversity of the goals from A. We use the facility location\nfunction [Cornu\u00e9jols et al., 1977, Lin et al., 2009] to compute Fdiv(A), i.e.,\n\nmax\ni\u2208A\n\nsim(gi, gj).\n\n(5)\n\nFdiv(A) (cid:44)(cid:88)\n\nj\u2208B\n\nIntuitively, we expect the achieved goals selected into A can represent all the goals from B. For each\ngj from B, Fdiv(A) \ufb01nds an achieved goal gi most similar to gj from A, and uses sim(gi, gj) to\nmeasure how well A can represent gj. Hence, by summing up sim(gi, gj) over all the achieved goals\nj \u2208 B, Fdiv(A) quanti\ufb01es how representative of A w.r.t B. It has been widely used as a diversity\nmetric, because a large Fdiv(A) indicates that every goal in B can \ufb01nd a suf\ufb01ciently similar goal in\nA, in other words, A spans the space of B. A diverse subset A of achieved goals encourage the agent\nto explore new states and unseen areas of the environment and thus learn to reach different goals.\nThe facility location function is a typical example from a large expressive family of submodular\nfunctions that satis\ufb01es the diminishing return property: given a \ufb01nite ground set V , any A \u2286 B \u2286 V\nand an element v /\u2208 B, v \u2208 V , they ful\ufb01ll F ({v} \u222a A) \u2212 F (A) \u2265 F ({v} \u222a B) \u2212 F (B) (with abuse\nof former notations A and B). Due to this property, they can naturally measure the diversity of a\nset of items [Fujishige, 2005], and has been applied in a variety of diversity-driven tasks achieving\nappealing results [Lin and Bilmes, 2011, Batra et al., 2012, Prasad et al., 2014, Gillenwater et al.,\n2012, Fiterau and Dubrawski, 2012]. Although we choose facility location function to be Fdiv(A) in\nthis paper, other submodular functions are worth studying in our curriculum learning framework.\nSince F (A) in Eq. (3) is a weighted sum of a non-negative (the similarity is non-negative) modular\nfunction Fprox(A) and a submodular function Fdiv(A), it is monotone non-decreasing submodular.\nAlthough exactly solving Eq. (3) is NP-hard, a near-optimal solution can be achieved by the greedy\nalgorithm with a worst-case approximation factor \u03b1 = 1 \u2212 e\u22121 [Nemhauser et al., 1978], as a result\nof the submodularity of F (A). The greedy algorithm starts with A \u2190 \u2205, and selects the next goal\ni \u2208 B\\A bringing the largest improvement F (i|A) (cid:44) F (i \u222a A) \u2212 F (A) to the objective F (A), i.e.,\nA \u2190 A \u222a {i\u2217} where i\u2217 \u2208 argmaxi\u2208B\\A F (i|A), and this repeats until |A| = k. For the speci\ufb01c\nF (A) de\ufb01ned in Eq. (3)-Eq. (5),\n\nF (i|A) =\u03bb sim(gi, g) +\n\nmax\n\n0, sim(gi, gj) \u2212 max\nl\u2208A\n\nsim(gl, gj)\n\n.\n\n(6)\n\n(cid:27)\n\n(cid:88)\n\nj\u2208B\n\n(cid:26)\n\n4\n\n\fAlgorithm 1 STOCHASTIC-GREEDY(k, m, \u03bb)\nRequire: experience buffer B\n1: Input: k, m, \u03bb\n2: Output: a minibatch A of size k\n3: Sample a batch B of size O(mk) from B, and build a sparse K-nearest neighbor graph of B.\n4: Initialize A \u2190 \u2205;\n5: for i = 0 to k \u2212 1 do\n6:\n7:\n8:\n\nSample a subset b of size m from B\\A;\nfor each transition tuple (st, at, rt, st+1) in b do\n\nend for\nAdd to A the transition tuple that has the maximum utility score F (i|A);\n\n9:\n10:\n11: end for\nThe evaluation of Fdiv(A) and F (i|A) requires the pairwise similarity of any two goals (gi, gj), and\ncan be expensive when the size of B is large. In practice, we can use kd-tree or ball-tree to build a\nsparse K-nearest neighbor graph for the goals in B before running the greedy algorithm. It has been\nshown in previous works [Wei et al., 2014] that a suf\ufb01ciently good solution can be achieved even for\nK as small as O(log |B|).\n\nCalculate the utility score F (i|A) by Eq. (6) (using gi = g(cid:48)\nstate st;\n\nt and g) based on the current\n\n3.3 Lazier than Lazy Greedy for Ef\ufb01cient Selection\n\nThe greedy algorithm is simple to implement and usually outperforms other optimization methods,\ne.g., those based on integer linear programming, but suffers from expensive computation requiring\nO(|B|k) function evaluations. There exist several accelerations, e.g., lazy greedy [Minoux, 1978],\nlazier than lazy greedy [Mirzasoleiman et al., 2015] and GreeDi [Mirzasoleiman et al., 2016], which\neither has the same or close approximation factor but enjoys signi\ufb01cant speedups.\nWe choose lazier than lazy greedy for the speedup of selecting failed experiences in CHER, because\nit is compatible with the stochastic learning nature of most off-policy RL algorithms. Algorithm 1\nshows the detailed procedures. In each step, from a random subset b of B\\A (instead of B\\A) it\nselects the goal that results in the largest improvement F (i|A). According to [Mirzasoleiman et al.,\n2015], when m = O(|B|/k log 1/\u0001), lazier than lazy greedy reduces the approximation factor \u03b1 of the\nvanilla greedy algorithm by \u0001, but only requires O(|B| log 1/\u0001)) evaluations of F (\u00b7).\n\n3.4 Curriculum-guided Hindsight Experience Replay\n\nThe trade-off between proximity and diversity in the selection of achieved goals re\ufb02ects the trade-off\nbetween exploitation and exploration. Similar to the learning process of human, which requires\ndifferent proportions of exploitation and exploration in different learning stages, the preference of\nproximity and diversity (or curiosity) in different episodes of deep RL also needs to vary. In the\nearlier episodes, curiosity over diverse pseudo goals can help the agent to explore new states and\nunseen areas. Thus it evolves RL for better generalization. However, diverse goals can distract the\nlearning of later episodes, in which the proximity to the desired goals is more important for the agent\nsince it has already accumulated suf\ufb01cient knowledge about an environment and needs to focus on\nlearning how to achieve the true goals of a task. Another critical reason to avoid large proximity in\nearlier episodes but promote it later is: the agent policy in earlier episodes cannot produce suf\ufb01cient\nnumber of pseudo goals close to the desired goals (otherwise the learning is almost accomplished and\nwe never suffer from sparse reward), but after adequate training it is able to do so later.\nIn the following, we propose \u201cGoal-and-Curiosity-driven Curriculum (GCC) Learning\u201d as an effective\nlearning scheme for CHER. It starts from learning to reach different achieved goals with large\ndiversity, and gradually turns the focus on how to progressively approach the achieved goals with\nlarge proximity to the desired goals. This is achieved by smoothly increasing the weight \u03bb of the\nproximity term in F (A) of Eq. (3). For the tasks in this paper, we use an exponentially increasing \u03bb\nover the course of training, i.e.,\n\n\u03bb = (1 + \u03b7)\u03b3\u03bb0,\n\n5\n\n(7)\n\n\fSample an initial goal g and an initial state s0\nfor t = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n\nAlgorithm 2 Curriculum-guided HER (CHER)\nRequire: off-policy RL algorithm A, experience buffer B\n1: Input: mini-batch size k, m, \u03bb0, reward function r(\u00b7)\n2: Initialize A, B \u2190 \u2205, \u03bb \u2190 \u03bb0;\n3: for episode = 0, 1,\u00b7\u00b7\u00b7 , M \u2212 1 do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nrt := r(st, at, g);\nStore the tuple (st|g, at, rt, st+1|g) in B;\n\nend for\nfor i = 0, 1,\u00b7\u00b7\u00b7 , N \u2212 1 do\n\nend for\nfor t = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n\nSample an action at from the behavioral policy of A, i.e., at \u223c \u03c0(st|g);\nExecute action at and observe a new state st+1;\n\nend for\n\u03bb \u2190 (1 + \u03b7)\u03bb;\n\nSelect a subset A of the achieved goals of B by Algorithm 1, i.e.,\nA \u2190STOCHASTIC-GREEDY(k, m, \u03bb);\nInitialize a minibatch Bi \u2190 \u2205;\nfor g(cid:48) \u2208 A do\n\nend for\nOptimize A using minibatch Bi;\n\nr(cid:48) := r(st, at, g(cid:48)), where \u2203(st, at) \u2208 \u03c4: g(cid:48) has been achieved by \u03c4 after t;\nStore the tuple (st|g(cid:48), at, r(cid:48), st+1|g(cid:48)) in Bi;\n\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23: end for\nwhere \u03b7 \u2208 [0, 1] is a learning pace controlling the progress of the curriculum, \u03b3 is the episode of the\noff-policy RL, and \u03bb0 is the initial weight for proximity, which should be relatively small.\nThe complete procedures of \u201cCurriculum-guided HER (CHER)\u201d can be found in Algorithm 2.\nComparing to the vanilla HER, the major differences are at line-14, which selects the achieved goals\nfrom the experience buffer according to proximity and diversity, and line-22, which increases the\nweight for proximity as instructed by the curriculum. The algorithm can be generalized to improve\nany off-policy RL method, and does not require any extra training on new tasks.\nAlthough Algorithm 1 cannot exactly solve the combinatorial optimization in Eq. (3), it is worth\nnoting that the approximate solution can gradually approach the global optimal as the curriculum\nproceeds and \u03bb increases. Increasing \u03bb makes F (A) close to a modular function. As a result, the\ngreedy solution approaches the top-k ranking, which is the optimal solution to modular maximization.\nThis trend can be theoretically analyzed by the curvature-dependent approximation bound of greedy\nalgorithm (which can easily extend to lazier than lazy greedy). It improves \u03b1 = 1 \u2212 e\u22121 to \u03b1 = (1 \u2212\ne\u2212\u03baF )/\u03baF [Conforti and Cornuejols, 1984], where the curvature \u03baF \u2208 [0, 1] of F (A) is de\ufb01ned as\n\n\u03baF (cid:44) 1 \u2212 min\nj\u2208B\n\nF (j|B\\j)\n\n.\n\n(8)\nWhen \u03baF = 0, F (\u00b7) is modular, resulting in \u03b1 = 1 (which achieves the global optimum); and when\n\u03baF = 1, F (A) is fully curved and \u03b1 = 1 \u2212 e\u22121. In CHER, when \u03bb is suf\ufb01ciently large in later\nepisodes, we have \u03baF \u2192 0 and thus \u03b1 \u2192 1. We theoretically derive an upper bound \u03baF \u2264 \u03baS\n(where \u03baS is the curvature of Fdiv(\u00b7)), which goes to zero when \u03bb \u2192 \u221e (see Appendix A).\n\nF (j)\n\n\u03bb\u03b2+1\n\n4 Experiments\n\nWe evaluate CHER and compare to state-of-the-art baselines on several challenging robotic manipu-\nlation tasks in simulated MuJoCo environments [Todorov et al., 2012]. In particular, we will use a\nsimple Fetch environment as a toy example and Shadow Dexterous Hand environments from OpenAI\nGym [Brockman et al., 2016]. It is worth noting that the Shadow Dexterous Hand environments are\nalso the most dif\ufb01cult environments amongst OpenAI\u2019s robotics environments.\n\n6\n\n\fFigure 1: The Fetch and four Shadow Dexterous Hand environments.\n\n4.1 Environments\n\nIn Figure 1,\nthere are FetchReach environment and four Shadow Dexterous Hand environ-\nments: HandReach, Block manipulation (HandManipulateBlockRotateXYZ-v0), Egg manipula-\ntion (HandManipulateEggFull-v0) and Pen manipulation (HandManipulatePenRotate-v0). The\nFetchReach environment is based on the 7-DoF (degrees of freedom) Fetch robotics arm with a\ntwo-\ufb01ngered parallel gripper. Each action at is a 3-dimensional vector specifying the desired gripper\nmovement in Cartesian coordinates and the gripper keeps closing during the process of reaching\nsome target location. Each observation is the state of the robot. In the simulated environments, the\nshadow Dexterous Hand is an anthropomorphic robotic hand with 24 DoF, in which 20 joints can\nbe controlled independently whereas the remaining ones are coupled joints. In all the four hand\nenvironments, each action at is a 20-dimensional vector containing the absolute position control for\nall non-coupled joints of the hand. Each observation includes the 24 positions and the associated\nvelocities of the 24 joints. To represent an object that is manipulated, the environment provides the\nobject\u2019s Cartesian position and rotation represented by a 7-dimensional vector, as well as its linear\nand angular velocities. The rewards are sparse and binary: the agent receives a reward of 0 if the goal\nhas been achieved (within some task-speci\ufb01c tolerance) and \u22121 otherwise.\nIn FetchReach, the goal of reaching task is a 3-dimensional vector describing the target position of an\nobject (or the end-effector for reaching) and the achieved goal is the position of the gripper. We use\nEuclidean distance for dis(gi, gj). In HandReach, the goal of reaching task is a target position and the\ndesired goal is the position of \ufb01ngertips. In Block and Pen manipulations, the goal of manipulation\ntasks is the rotation of a target pose and the achieved goal is the rotation of the block/pen. In Egg\nmanipulation, the goal of manipulation task is the rotation and location of a target pose and the\nachieved goal is the rotation and location of the egg.\n\n4.2 Baselines\n\nOur evaluation of different methods is based on DDPG. We use different methods to select/sample\nhindsight experiences to replay and train policies on the environments issuing sparse rewards. We\ncompare CHER with the following baselines:\n\ndeterministic policy by a stochastic counterpart to explore during training.\n\n\u2022 DDPG [Lillicrap et al., 2015], a model-free RL algorithm for continuous control. It learns a\n\u2022 DDPG+HER [Andrychowicz et al., 2017], which samples hindsight experiences uniformly\n\u2022 DDPG+HEREBP [Zhao and Tresp, 2018], which uses an energy function to evaluate\n\nfor replay.\n\ntrajectories and prioritize hindsight experiences with large energy.\n\nThe comparison between dense and sparse rewards has been presented in Plappert et al. [2018] and it\nhas shown the advantage of using sparse rewards.\n\n4.3 Training Setting\n\nFor all environments except FetchReach, we train policies on a single machine with 20 CPU cores.\nEach core generates experiences by using two parallel rollouts with MPI for synchronization. We train\neach agent for 50 epochs with batch size 64. Hyperparameters are nearly the same as in Andrychowicz\net al. [2017]. In CHER, we use |B| = 128, |A| = k = 64 and |b| = m = 3 for Algorithm 1. We\nevaluate the policies after each epoch by performing 10 deterministic test rollouts per MPI worker,\nand then compute the test success rate by averaging across rollouts and MPI workers. In all cases, we\n\n7\n\nFetchReach\n(Toy example)HandReachHandManipulate\nBlockHandManipulate\nEggHandManipulate\nPen\f(a) Performance for the toy example.\n\n(b) Goals at an earlier episode of CHER.\n\n(c) Goals at a later episode of CHER.\n\nFigure 2: Toy example \u2013 FetchReach. (a) CHER learns much faster than other RL methods. (b) The\nred points (selected achieved goals) compose a diverse and representative subset of the gray points\n(all achieved goals), but some are not close to any green point (desired goals) since CHER prefers\ndiversity than proximity in earlier episodes. (c) Most red points are close to some green points due\nto the large proximity in later episodes\u2019 selection criteria, but some regions with many gray points\nconcentrated do not contain any red point since CHER prefers proximity more than diversity.\n\n(a) HandReach\n\n(b) Block\n\n(c) Egg\n\n(d) Pen\n\nFigure 3: Performance for all four hand environments (Block: \u03bb0 = 0; others: \u03bb0 = 1).\n\nrepeat each experiment with 5 different random seeds and report their performance by computing the\nmedian test success rate as well as the interquartile range.\n\n4.4 Toy Example\n\nTo quickly prove the concept of our idea, we \ufb01rst study it in a simple environment, FetchReach, as a\ntoy example. We train policies by using one CPU core.\nFigure 2(a) depicts the median test success rate for the FetchReach environment. FetchReach is\nknown as a very simple environment and can be easily learned by our approach. The results show\nthat DDPG+CHER learns faster than all other baselines. Vanilla DDPG can also reach 100% success\nrate at last but much later than DDPG+CHER. DDPG+HEREBP performs similarly to DDPG+HER\non this simple task.\nFigure 2(b)(c) visualize the selected desired goals g (green stars), all the achieved goals B (grey\ncircles), and the achieved goals A \u2286 B selected by Algorithm 1 (red triangles) at an earlier episode\n(left) and a later episode (right) of DDPG+CHER. In the earlier episode, the achieved goals selected\ninto A averagely spread on the manifold of all the achieved goals B, implying that A is a diverse and\nrepresentative subset of B. There are regions that contain several selected goals far away from any\ndesired goal, since the proximity plays a minor role in earlier episodes while the diversity dominates\nthe selection criteria. In the later episode, in contrast, most of the achieved goals selected into A\ngather around some desired goals, and there are regions where many unselected goals gather but none\nis selected, which indicates that the proximity dominates over the diversity in the selection.\n\n4.5 Benchmark Results\n\nFigure 3 reports how the median test success rate achieved by all methods improves during learning in\nthe four hand environments. Similar to what is shown in the FetchReach environment, DDPG+CHER\nsigni\ufb01cantly outperforms the other baselines. The results also show that DDPG can easily fail in\nthese environments, but DDPG+HER is able to learn partly successful policies in all environments.\nSurprisingly, DDPG+CHER has got signi\ufb01cant improvement on Egg and Pen manipulation tasks, as\n\n8\n\n05101520253035404550Epoch0.00.20.40.60.81.0Median Success RateFetchReach-v1DDPGDDPG+HERDDPG+HEREBPDDPG+CHERLocation-X1.001.051.101.151.201.251.301.35Location-Y0.550.600.650.700.750.800.850.90Location-Z0.4250.4500.4750.5000.5250.5500.5750.6000.625achieved goalsdesired goalsselected achieved goalsLocation-X1.01.11.21.31.41.5Location-Y0.40.50.60.70.80.9Location-Z0.450.500.550.600.650.70achieved goalsdesired goalsselected achieved goals05101520253035404550Epoch0.00.20.40.60.81.0Median Success RateHandReach-v0DDPGDDPG+HERDDPG+HEREBPDDPG+CHER05101520253035404550Epoch0.00.20.40.60.81.0Median Success RateHandManipulateBlockRotateXYZ-v0DDPGDDPG+HERDDPG+HEREBPDDPG+CHER05101520253035404550Epoch0.000.050.100.150.200.25Median Success RateHandManipulateEggFull-v0DDPGDDPG+HERDDPG+HEREBPDDPG+CHER05101520253035404550Epoch0.000.050.100.150.200.25Median Success RateHandManipulatePenRotate-v0DDPGDDPG+HERDDPG+HEREBPDDPG+CHER\f(a) HandReach\n\n(b) Block\n\n(c) Egg\n\n(d) Pen\n\nFigure 4: Performance of DDPG+CHER with different initial \u03bb0 for all four hand environments.\n\n(a) HandReach\n\n(b) Block\n\n(c) Egg\n\n(d) Pen\n\nFigure 5: Ablation study of DDPG+CHER with \u03bb \ufb01xed (\u03bbf ixed) for all four hand environments.\n\nshown in Figures 3(c) and 3(d). These tasks are known to be dif\ufb01cult because the objects often drop\ndown. With curriculum learning equipped in CHER, the agent quickly learns the way of holding the\nobject. In summary, DDPG+CHER with curriculum learning that selects hindsight experiences for\nreplay effectively improves the performance of DDPG+HER.\nFigure 4 reports the success rate of DDPG+CHER using different initialization \u03bb0 for \u03bb. It shows\nthat promoting different amount of proximity in selection affects the performance. When \u03bb0 = 0,\ni.e., starting without any proximity preferred, the performance degrades. It also shows that too large\nproximity does not improve the performance.\nIn Figure 5, we test the performance of DDPG+CHER with \u03bb \ufb01xed at different values, where \u03bb =INF\nrefers to proximity-only. Compared to DDPG+CHER using a curriculum of increasing \u03bb in Figure 3,\nthe performance of CHER can signi\ufb01cantly vary when using different \u03bbf ixed, and some can perform\nmuch worse. In contrast, DDPG+CHER with a gradually increasing \u03bb usually results in a smoother\nand more stable learning process that can rapidly learn to accomplish challenging tasks.\n\n5 Conclusion\n\nThe main contributions of this paper are summarized as follows: (1) We introduce \u201cGoal-and-\nCuriosity-driven Curriculum Learning\u201d for Hindsight Experience Replay (HER). To our knowledge,\nthe resulted Curriculum-guided HER (CHER) is the \ufb01rst work that adaptively selects failed experi-\nences for replay according to their compatibility and usefulness to different learning stages of deep\nRL; (2) We show that a large diversity is bene\ufb01cial to earlier exploration, while a large proximity to\nthe desired goals is essential for effective exploitation in later stages; (3) We show that the sample\nef\ufb01ciency and learning speed of off-policy RL algorithms can be signi\ufb01cantly improved by CHER.\nWe attribute this to the global knowledge learning on a set of failed experiences, which breaks the con-\nstraint of local one-episode experience towards more robust strategies; (4) We apply CHER to several\nchallenging continuous robotics environments with sparse rewards, and demonstrate its effectiveness\nand advantage over other HER-based approaches; (5) CHER does not make assumptions on tasks\nand environments, and can potentially be generalized to other more complicated tasks, environments\nand settings.\n\nAcknowledgments\n\nWe would like to thank Tencent AI Lab and Robotics X for providing an excellent research environ-\nment that made this work possible. Also, we would like to thank the anonymous reviewers.\n\n9\n\n05101520253035404550Epoch0.00.20.40.60.81.0Median Success RateHandReach-v0DDPG+CHER (0=0)DDPG+CHER (0=0.1)DDPG+CHER (0=1)DDPG+CHER (0=10)05101520253035404550Epoch0.00.20.40.60.81.0Median Success RateHandManipulateBlockRotateXYZ-v0DDPG+CHER (0=0)DDPG+CHER (0=0.1)DDPG+CHER (0=1)DDPG+CHER (0=10)05101520253035404550Epoch0.000.050.100.150.200.25Median Success RateHandManipulateEggFull-v0DDPG+CHER (0=0)DDPG+CHER (0=0.1)DDPG+CHER (0=1)DDPG+CHER (0=10)05101520253035404550Epoch0.000.050.100.150.200.25Median Success RateHandManipulatePenRotate-v0DDPG+CHER (0=0)DDPG+CHER (0=0.1)DDPG+CHER (0=1)DDPG+CHER (0=10)05101520253035404550Epoch0.00.20.40.60.81.0Median Success RateHandReach-v0DDPG+CHER (fixed=0)DDPG+CHER (fixed=0.1)DDPG+CHER (fixed=1)DDPG+CHER (fixed=10)DDPG+CHER (fixed=INF)05101520253035404550Epoch0.00.20.40.60.81.0Median Success RateHandManipulateBlockRotateXYZ-v0DDPG+CHER (fixed=0)DDPG+CHER (fixed=0.1)DDPG+CHER (fixed=1)DDPG+CHER (fixed=10)DDPG+CHER (fixed=INF)05101520253035404550Epoch0.000.050.100.150.200.25Median Success RateHandManipulateEggFull-v0DDPG+CHER (fixed=0)DDPG+CHER (fixed=0.1)DDPG+CHER (fixed=1)DDPG+CHER (fixed=10)DDPG+CHER (fixed=INF)05101520253035404550Epoch0.000.050.100.150.200.25Median Success RateHandManipulatePenRotate-v0DDPG+CHER (fixed=0)DDPG+CHER (fixed=0.1)DDPG+CHER (fixed=1)DDPG+CHER (fixed=10)DDPG+CHER (fixed=INF)\fReferences\nE. L. Allgower and K. Georg. Introduction to Numerical Continuation Methods. Society for Industrial\n\nand Applied Mathematics, 2003.\n\nM. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin,\nO. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information\nProcessing Systems, pages 5048\u20135058, 2017.\n\nS. Basu and J. Christensen. Teaching classi\ufb01cation boundaries to humans. In AAAI Conference on\n\nArti\ufb01cial Intelligence, pages 109\u2013115, 2013.\n\nD. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse m-best solutions in\n\nmarkov random \ufb01elds. In European Conference on Computer Vision, pages 1\u201316, 2012.\n\nY. Bengio. Evolving Culture Versus Local Minima, pages 109\u2013138. Springer Berlin Heidelberg,\n\n2014.\n\nY. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning.\n\nConference on Machine Learning, pages 41\u201348, 2009.\n\nIn International\n\nY. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798\u20131828, 2013.\n\nG. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai\n\ngym, 2016.\n\nY. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of\n\ncuriosity-driven learning. In International Conference on Learning Representations, 2019.\n\nM. Conforti and G. Cornuejols. Submodular set functions, matroids and the greedy algorithm: Tight\nworst-case bounds and some generalizations of the rado-edmonds theorem. Discrete Applied\nMathematics, 7(3):251\u2013274, 1984.\n\nG. Cornu\u00e9jols, M. Fisher, and G. Nemhauser. On the uncapacitated location problem. Annals of\n\nDiscrete Mathematics, 1:163\u2013177, 1977.\n\nW. Czarnecki, S. Jayakumar, M. Jaderberg, L. Hasenclever, Y. W. Teh, N. Heess, S. Osindero, and\nR. Pascanu. Mix match agent curricula for reinforcement learning. In International Conference\non Machine Learning, volume 80, pages 1087\u20131095, 2018.\n\nY. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning, pages 1329\u2013\n1338, 2016.\n\nM. Eppe, S. Magg, and S. Wermter. Curriculum goal masking for continuous deep reinforcement\n\nlearning. arXiv preprint arXiv:1809.06146, 2018.\n\nM. Fang, C. Zhou, B. Shi, B. Gong, J. Xu, and T. Zhang. DHER: Hindsight experience replay for\n\ndynamic goals. In International Conference on Learning Representations, 2019.\n\nM. Fiterau and A. Dubrawski. Projection retrieval for classi\ufb01cation. In Advances in Neural Informa-\n\ntion Processing Systems, pages 3023\u20133031. 2012.\n\nC. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel. Reverse curriculum generation for\n\nreinforcement learning. In Conference on Robot Learning, volume 78, pages 482\u2013495, 2017.\n\nP. Fournier, O. Sigaud, M. Chetouani, and P.-Y. Oudeyer. Accuracy-based curriculum learning in\n\ndeep reinforcement learning. arXiv preprint arXiv:1806.09614, 2018.\n\nM. Frank, J. Leitner, M. F. Stollenga, A. F\u00f6rster, and J. Schmidhuber. Curiosity driven reinforcement\n\nlearning for motion planning on humanoids. In Frontiers in Neurorobotics, 2013.\n\nS. Fujishige. Submodular functions and optimization. Annals of discrete mathematics. Elsevier, 2005.\n\n10\n\n\fJ. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal map inference for determinantal point\n\nprocesses. In Advances in Neural Information Processing Systems, pages 2735\u20132743, 2012.\n\nS. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based\n\nacceleration. In International Conference on Machine Learning, pages 2829\u20132838, 2016.\n\nN. Justesen and S. Risi. Automated curriculum learning by rewarding temporally rare events. IEEE\n\nConference on Computational Intelligence and Games, pages 1\u20138, 2018.\n\nF. Khan, X. J. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching\n\ndimension. In Advances in Neural Information Processing Systems, pages 1449\u20131457, 2011.\n\nM. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In Advances\n\nin Neural Information Processing Systems, pages 1189\u20131197, 2010.\n\nT. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous\n\ncontrol with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.\n\nH. Lin and J. A. Bilmes. A class of submodular functions for document summarization. In The\n\nAnnual Meeting of the Association for Computational Linguistics, pages 510\u2013520, 2011.\n\nH. Lin, J. A. Bilmes, and S. Xie. Graph-based submodular selection for extractive summarization.\nIn IEEE Automatic Speech Recognition and Understanding Workshop, Merano, Italy, December\n2009.\n\nH. Liu, A. Trott, R. Socher, and C. Xiong. Competitive experience replay. In International Conference\n\non Learning Representations, 2019.\n\nL. Metz, J. Ibarz, N. Jaitly, and J. Davidson. Discrete sequential prediction of continuous actions for\n\ndeep rl. arXiv preprint arXiv:1705.05035, 2017.\n\nM. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization\nTechniques, volume 7 of Lecture Notes in Control and Information Sciences, chapter 27, pages\n234\u2013243. Springer Berlin Heidelberg, 1978.\n\nB. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondr\u00e1k, and A. Krause. Lazier than lazy greedy.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence, pages 1812\u20131818, 2015.\n\nB. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization.\n\nJournal of Machine Learning Research, 17(238):1\u201344, 2016.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-\nmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King,\nD. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforce-\nment learning. Nature, 518:529, 2015.\n\nA. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with\nimagined goals. In Advances in Neural Information Processing Systems, pages 9191\u20139200, 2018.\n\nG. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions-I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\nA. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and\napplication to reward shaping. In International Conference on Machine Learning, volume 99,\npages 278\u2013287, 1999.\n\nD. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised\n\nprediction. In International Conference on Machine Learning, 2017.\n\nM. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. To-\nbin, M. Chociej, P. Welinder, et al. Multi-goal reinforcement learning: Challenging robotics\nenvironments and request for research. arXiv preprint arXiv:1802.09464, 2018.\n\n11\n\n\fA. Prasad, S. Jegelka, and D. Batra. Submodular meets structured: Finding diverse subsets in\nexponentially-large structured item sets. In Advances in Neural Information Processing Systems,\npages 2645\u20132653, 2014.\n\nP. Rauber, A. Ummadisingu, F. Mutz, and J. Schmidhuber. Hindsight policy gradients. In International\n\nConference on Learning Representations, 2019.\n\nN. Savinov, A. Raichuk, D. Vincent, R. Marinier, M. Pollefeys, T. Lillicrap, and S. Gelly. Episodic\n\ncuriosity through reachability. In International Conference on Learning Representations, 2019.\n\nT. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators.\n\nInternational Conference on Machine Learning, pages 1312\u20131320, 2015.\n\nIn\n\nV. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Baby Steps: How \u201cLess is More\u201d in unsupervised de-\npendency parsing. In Advances in Neural Information Processing Systems Workshop on Grammar\nInduction, Representation of Language and Language Learning, 2009.\n\nJ. S. Supancic III and D. Ramanan. Self-paced learning for long-term tracking. In Conference on\n\nComputer Vision and Pattern Recognition, pages 2379\u20132386, 2013.\n\nK. Tang, V. Ramanathan, L. Fei-fei, and D. Koller. Shifting weights: Adapting object detectors from\nimage to video. In Advances in Neural Information Processing Systems, pages 638\u2013646, 2012a.\n\nY. Tang, Y.-B. Yang, and Y. Gao. Self-paced dictionary learning for image classi\ufb01cation. In The\n\nACM International Conference on Multimedia, pages 833\u2013836, 2012b.\n\nY. Tian, Q. Gong, W. Shang, Y. Wu, and C. L. Zitnick. Elf: An extensive, lightweight and \ufb02exible\nresearch platform for real-time strategy games. In Advances in Neural Information Processing\nSystems, pages 2659\u20132669, 2017.\n\nE. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ\n\nInternational Conference on Intelligent Robots and Systems, pages 5026\u20135033, 2012.\n\nK. Wei, R. Iyer, and J. Bilmes. Fast multi-stage submodular maximization.\n\nConference on Machine Learning, 2014.\n\nIn International\n\nY. Wu and Y. Tian. Training agent for \ufb01rst-person shooter game with actor-critic curriculum learning.\n\nIn International Conference on Learning Representations, 2017.\n\nR. Zhao and V. Tresp. Energy-based hindsight experience prioritization. In Conference on Robot\n\nLearning, pages 113\u2013122, 2018.\n\nT. Zhou and J. Bilmes. Minimax curriculum learning: Machine teaching with desirable dif\ufb01culties\n\nand scheduled diversity. In International Conference on Learning Representations, 2018.\n\nT. Zhou, S. Wang, and J. A. Bilmes. Diverse ensemble evolution: Curriculum data-model marriage.\n\nIn Advances in Neural Information Processing Systems, pages 5905\u20135916. 2018.\n\n12\n\n\f", "award": [], "sourceid": 6872, "authors": [{"given_name": "Meng", "family_name": "Fang", "institution": "Tencent"}, {"given_name": "Tianyi", "family_name": "Zhou", "institution": "University of Washington, Seattle"}, {"given_name": "Yali", "family_name": "Du", "institution": "University College London"}, {"given_name": "Lei", "family_name": "Han", "institution": "Tencent AI Lab"}, {"given_name": "Zhengyou", "family_name": "Zhang", "institution": "Tencent"}]}