{"title": "Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2125, "page_last": 2133, "abstract": "The mutual information is a core statistical quantity that has applications in all areas of machine learning, whether this is in training of density models over multiple data modalities, in maximising the efficiency of noisy transmission channels, or when learning behaviour policies for exploration by artificial agents. Most learning algorithms that involve optimisation of the mutual information rely on the Blahut-Arimoto algorithm --- an enumerative algorithm with exponential complexity that is not suitable for modern machine learning applications. This paper provides a new approach for scalable optimisation of the mutual information by merging techniques from variational inference and deep learning. We develop our approach by focusing on the problem of intrinsically-motivated learning, where the mutual information forms the definition of a well-known internal drive known as empowerment. Using a variational lower bound on the mutual information, combined with convolutional networks for handling visual input streams, we develop a stochastic optimisation algorithm that allows for scalable information maximisation and empowerment-based reasoning directly from pixels to actions.", "full_text": "Variational Information Maximisation for\n\nIntrinsically Motivated Reinforcement Learning\n\nShakir Mohamed and Danilo J. Rezende\n\nGoogle DeepMind, London\n\n{shakir, danilor}@google.com\n\nAbstract\n\nThe mutual information is a core statistical quantity that has applications in all ar-\neas of machine learning, whether this is in training of density models over multiple\ndata modalities, in maximising the ef\ufb01ciency of noisy transmission channels, or\nwhen learning behaviour policies for exploration by arti\ufb01cial agents. Most learn-\ning algorithms that involve optimisation of the mutual information rely on the\nBlahut-Arimoto algorithm \u2014 an enumerative algorithm with exponential com-\nplexity that is not suitable for modern machine learning applications. This paper\nprovides a new approach for scalable optimisation of the mutual information by\nmerging techniques from variational inference and deep learning. We develop our\napproach by focusing on the problem of intrinsically-motivated learning, where\nthe mutual information forms the de\ufb01nition of a well-known internal drive known\nas empowerment. Using a variational lower bound on the mutual information,\ncombined with convolutional networks for handling visual input streams, we de-\nvelop a stochastic optimisation algorithm that allows for scalable information\nmaximisation and empowerment-based reasoning directly from pixels to actions.\n\n1 Introduction\nThe problem of measuring and harnessing dependence between random variables is an inescapable\nstatistical problem that forms the basis of a large number of applications in machine learning, includ-\ning rate distortion theory [4], information bottleneck methods [28], population coding [1], curiosity-\ndriven exploration [26, 21], model selection [3], and intrinsically-motivated reinforcement learning\n[22]. In all these problems the core quantity that must be reasoned about is the mutual information.\nIn general, the mutual information (MI) is intractable to compute and few existing algorithms are\nuseful for realistic applications. The received algorithm for estimating mutual information is the\nBlahut-Arimoto algorithm [31] that effectively solves for the MI by enumeration \u2014 an approach\nwith exponential complexity that is not suitable for modern machine learning applications. By com-\nbining the best current practice from variational inference with that of deep learning, we bring the\ngenerality and scalability seen in other problem domains to information maximisation problems.\nWe provide a new approach for maximisation of the mutual information that has signi\ufb01cantly lower\ncomplexity, allows for computation with high-dimensional sensory inputs, and that allows us to\nexploit modern computational resources.\nThe technique we derive is generally applicable, but we shall describe and develop our approach\nby focussing on one popular and increasingly topical application of the mutual information: as a\nmeasure of \u2018empowerment\u2019 in intrinsically-motivated reinforcement learning. Reinforcement learn-\ning (RL) has seen a number of successes in recent years that has now established it as a practical,\nscalable solution for realistic agent-based planning and decision making [16, 13]. A limitation of\nthe standard RL approach is that an agent is only able to learn using external rewards obtained from\nits environment; truly autonomous agents will often exist in environments that lack such external\nrewards or in environments where rewards are sparsely distributed. Intrinsically-motivated rein-\nforcement learning [25] attempts to address this shortcoming by equipping an agent with a number\nof internal drives or intrinsic reward signals, such as hunger, boredom or curiosity that allows the\nagent to continue to explore, learn and act meaningfully in a reward-sparse world. There are many\n\n1\n\n\fAction\n\nOption\n\nExternal Environment\n\nInternal Environment\n\nOption KB\n\nCritic\n\nState Repr.\n\nPlanner\n\nObservation\nEnvironment\n\nAgent\n\nState\n\nEmbedding\n\nx\u2019\n\nx\n\nState Representation\n\nDecoder\n\nq(a1, . . . , aK|s, s0)\n\nz\n\naK\n\nz\n\naK\n\nz\n\na1\n\nz\n\na1\n\nz\n\na2\n\n\u2026\n\nSource \u03c9(A|s)\nh(a1, . . . , aK|s)\n\n\u2026\n\nz\n\na2\n\n\u03c8(s)\n\ns\u2019\n\ns\n\nFigure 1: Perception-action loop sepa-\nrating environment into internal and ex-\nternal facets.\n\nFigure 2: Computational graph for variational in-\nformation maximisation.\n\nways in which to formally de\ufb01ne internal drives, but what all such de\ufb01nitions have in common is\nthat they, in some unsupervised fashion, allow an agent to reason about the value of information in\nthe action-observation sequences it experiences. The mutual information allows for exactly this type\nof reasoning and forms the basis of one popular intrinsic reward measure, known as empowerment.\nOur paper begins by describing the framework we use for online and self-motivated learning (sec-\ntion 2) and then describes the general problem associated with mutual information estimation and\nempowerment (section 3). We then make the following contributions:\n\u2022 We develop stochastic variational information maximisation, a new algorithm for scalable es-\ntimation of the mutual information and channel capacity that is applicable to both discrete and\ncontinuous settings.\n\u2022 We combine variational information optimisation and tools from deep learning to develop a scal-\nable algorithm for intrinsically-motivated reinforcement learning, demonstrating a new applica-\ntion of the variational theory for problems in reinforcement learning and decision making.\n\u2022 We demonstrate that empowerment-based behaviours obtained using variational information max-\nimisation match those using the exact computation. We then apply our algorithms to a broad range\nof high-dimensional problems for which it is not possible to compute the exact solution, but for\nwhich we are able to act according to empowerment \u2013 learning directly from pixel information.\n\n2 Intrinsically-motivated Reinforcement Learning\nIntrinsically- or self-motivated learning attempts to address the question of where rewards come\nfrom and how they are used by an autonomous agent. Consider an online learning system that\nmust model and reason about its incoming data streams and interact with its environment. This\nperception-action loop is common to many areas such as active learning, process control, black-box\noptimisation, and reinforcement learning. An extended view of this framework was presented by\nSingh et al. [25], who describe the environment as factored into external and internal components\n(\ufb01gure 1). An agent receives observations and takes actions in the external environment. Impor-\ntantly, the source and nature of any reward signals are not assumed to be provided by an oracle in\nthe external environment, but is moved to an internal environment that is part of the agent\u2019s decision-\nmaking system; the internal environment handles the ef\ufb01cient processing of all input data and the\nchoice and computation of an appropriate internal reward signal.\nThere are two important components of this framework: the state representation and the critic. We\nare principally interested in vision-based self-motivated systems, for which there are no solutions\ncurrently developed. To achieve this, our state representation system is a convolutional neural net-\nwork [14]. The critic in \ufb01gure 1 is responsible for providing intrinsic rewards that allow the agent to\nact under different types of internal motivations, and is where information maximisation enters the\nintrinsically-motivated learning problem.\nThe nature of the critic and in particular, the reward signal it provides is the main focus of this\npaper. A wide variety of reward functions have been proposed, and include: missing information or\nBayesian surprise, which uses the KL divergence to measure the change in an agents internal belief\nafter the observation of new data [8, 24]; measures based on prediction errors of future states such\npredicted L1 change, predicted mode change or probability gain [17], or salient event prediction\n[25]; and measures based on information-theoretic quantities such as predicted information gain\n(PIG) [15], causal entropic forces [30] or empowerment [23]. The paper by Oudeyer & Kaplan [19]\n\n2\n\n\fcurrently provides the widest singular discussion of the breadth of intrinsic motivation measures.\nAlthough we have a wide choice of intrinsic reward measures, none of the available information-\ntheoretic approaches are ef\ufb01cient to compute or scalable to high-dimensional problems: they require\neither knowledge of the true transition probability or summation over all con\ufb01gurations of the state\nspace, which is not tractable for complex environments or when the states are large images.\n3 Mutual Information and Empowerment\nThe mutual information is a core information-theoretic quantity that acts as a general measure of\ndependence between two random variables x and y, de\ufb01ned as:\n\nI(x, y) = Ep(y|x)p(x)\uf8fflog\u2713 p(x, y)\n\np(x)p(y)\u25c6 ,\n\nwhere the p(x, y) is a joint distribution over the random variables, and p(x) and p(y) are the cor-\nresponding marginal distributions. x and y can be many quantities of interest: in computational\nneuroscience they are the sensory inputs and the spiking population code; in telecommunications\nthey are the input signal to a channel and the received transmission; when learning exploration\npolicies in RL, they are the current state and the action at some time in the future, respectively.\nFor intrinsic motivation, we use an internal reward measure referred to as empowerment [12, 23]\nthat is obtained by searching for the maximal mutual information I(\u00b7,\u00b7), conditioned on a starting\nstate s, between a sequence of K actions a and the \ufb01nal state reached s0:\n\n(1)\n\n(2)\n\nE(s) = max\n\n! I !(a, s0|s) = max\n\n! Ep(s0|a,s)!(a|s)\uf8fflog\u2713 p(a, s0|s)\n\n!(a|s)p(s0|s)\u25c6 ,\n\nwhere a = {a1, . . . , aK} is a sequence of K primitive actions ak leading to a \ufb01nal state s0, and\np(s0|a, s) is the K-step transition probability of the environment. p(a, s0|s) is the joint distribution\nof action sequences and the \ufb01nal state, !(a|s) is a distribution over K-step action sequences, and\np(s0|s) is the joint probability marginalised over the action sequence.\nEquation (2) is the de\ufb01nition of the channel capacity in information theory and is a measure of the\namount of information contained in the action sequences a about the future state s0. This measure\nis compelling since it provides a well-grounded, task-independent measure for intrinsic motivation\nthat \ufb01ts naturally within the framework for intrinsically motivated learning described by \ufb01gure 1.\nFurthermore, empowerment, like the state- or action-value function in reinforcement learning, as-\nsigns a value E(s) to each state s in an environment. An agent that seeks to maximise this value will\nmove towards states from which it can reach the largest number of future states within its planning\nhorizon K. It is this intuition that has led authors to describe empowerment as a measure of agent\n\u2018preparedness\u2019, or as a means by which an agent may quantify the extent to which it can reliably\nin\ufb02uence its environment \u2014 motivating an agent to move to states of maximum in\ufb02uence [23].\nAn empowerment-based agent generates an open-loop sequence of actions K steps into the future\n\u2014 this is only used by the agent for its internal planning using !(a|s). When optimised using (2),\nthe distribution !(a|s) becomes an ef\ufb01cient exploration policy that allows for uniform exploration\nof the state space reachable at horizon K, and is another compelling aspect of empowerment (we\nprovide more intuition for this in appendix A). But this policy is not what is used by the agent for\nacting: when an agent must act in the world, it follows a closed-loop policy obtained by a planning\nalgorithm using the empowerment value (e.g., Q-learning); we expand on this in sect. 4.3. A further\nconsequence is that while acting, the agent is only \u2018curious\u2019 about parts of its environment that can\nbe reached within its internal planning horizon K. We shall not explore the effect of the horizon in\nthis work, but this has been widely-explored and we defer to the insights of Salge et al. [23].\n4 Scalable Information Maximisation\nThe mutual information (MI) as we have described it thus far, whether it be for problems in empow-\nerment, channel capacity or rate distortion, hides two dif\ufb01cult statistical problems. Firstly, comput-\ning the MI involves expectations over the unknown state transition probability. This can be seen by\nrewriting the MI in terms of the difference between conditional entropies H(\u00b7) as:\n\n(3)\nwhere H(a|s) =E!(a|s)[log !(a|s)] and H(a|s0, s) =Ep(s0|a,s)!(a|s) [log p(a|s0, s)]. This com-\nputation requires marginalisation over the K-step transition dynamics of the environment p(s0|a, s),\n\nI(a, s0|s) = H(a|s) H(a|s0, s),\n\n3\n\n\fwhich is unknown in general. We could estimate this distribution by building a generative model\nof the environment, and then use this model to compute the MI. Since learning accurate generative\nmodels remains a challenging task, a solution that avoids this is preferred (and we also describe one\napproach for model-based empowerment in appendix B).\nSecondly, we currently lack an ef\ufb01cient algorithm for MI computation. There exists no scalable\nalgorithm for computing the mutual information that allows us to apply empowerment to high-\ndimensional problems and that allow us to easily exploit modern computing systems. The current\nsolution is to use the Blahut-Arimoto algorithm [31], which essentially enumerates over all states,\nthus being limited to small-scale problems and not being applicable to the continuous domain. More\nscalable non-parametric estimators have been developed [7, 6]: these have a high memory footprint\nor require a very large number of observations, any approximation may not be a bound on the\nMI making reasoning about correctness harder, and they cannot easily be composed with existing\n(gradient-based) systems that allow us to design a uni\ufb01ed (end-to-end) system. In the continuous\ndomain, Monte Carlo integration has been proposed [10], but applications of Monte Carlo estimators\ncan require a large number of draws to obtain accurate solutions and manageable variance. We\nhave also explored Monte Carlo estimators for empowerment and describe an alternative importance\nsampling-based estimator for the MI and channel capacity in appendix B.1.\n4.1 Variational Information Lower Bound\nThe MI can be made more tractable by deriving a lower bound to it and maximising this instead \u2014\nhere we present the bound derived by Barber & Agakov [1]. Using the entropy formulation of the MI\n(3) reveals that bounding the conditional entropy component is suf\ufb01cient to bound the entire mutual\ninformation. By using the non-negativity property of the KL divergence, we obtain the bound:\n\nKL[p(x|y)kq(x|y)] 0 ) H(x|y) \uf8ff Ep(x|y) [log q\u21e0(x|y)]\n\nI!(s) = H(a|s) H(a|s0, s) H(a) +Ep(s0|a,s)!\u2713(a|s)[log q\u21e0(a|s0, s)] = I!,q(s)\n\n(4)\nwhere we have introduced a variational distribution q\u21e0(\u00b7) with parameters \u21e0; the distribution !\u2713(\u00b7)\nhas parameters \u2713. This bound becomes exact when q\u21e0(a|s0, s) is equal to the true action posterior\ndistribution p(a|s0, s). Other lower bounds for the mutual information are also possible: Jaakkola &\nJordan [9] present a lower bound by using the convexity bound for the logarithm; Brunel & Nadal\n[2] use a Gaussian assumption and appeal to the Cramer-Rao lower bound.\nThe bound (4) is highly convenient (especially when compared to other bounds) since the transition\nprobability p(s0|a, s) appears linearly in the expectation and we never need to evaluate its probability\n\u2014 we can thus evaluate the expectation directly by Monte Carlo using data obtained by interaction\nwith the environment. The bound is also intuitive since we operate using the marginal distribution on\naction sequences !\u2713(a|s), which acts as a source (exploration distribution), the transition distribution\np(s0|a, s) acts as an encoder (transition distribution) from a to s0, and the variational distribution\nq\u21e0(a|s0, s) conveniently acts as a decoder (planning distribution) taking us from s0 to a.\n4.2 Variational Information Maximisation\nA straightforward optimisation procedure based on (4) is an alternating optimisation for the param-\neters of the distributions q\u21e0(\u00b7) and !\u2713(\u00b7). Barber & Agakov [1] made the connection between this\napproach and the generalised EM algorithm and refer to it as the IM (information maximisation)\nalgorithm and we follow the same optimisation principle. From an optimisation perspective, the\nmaximisation of the bound I!,q(s) in (4) w.r.t. !(a|s) can be ill-posed (e.g., in Gaussian models,\nthe variances can diverge). We avoid such divergent solutions by adding a constraint on the value of\nthe entropy H(a), which results in the constrained optimisation problem:\n\u02c6E(s) = max\nwhere a is the action sequence performed by the agent when moving from s to s0 and is an inverse\ntemperature (which is a function of the constraint \u270f).\nAt all times we use very general source and decoder distributions formed by complex non-linear\nfunctions using deep networks, and use stochastic gradient ascent for optimisation. We refer to\nour approach as stochastic variational information maximisation to highlight that we do all our\ncomputation on a mini-batch of recent experience from the agent. The optimisation for the decoder\nq\u21e0(\u00b7) becomes a maximum likelihood problem, and the optimisation for the source !\u2713(\u00b7) requires\ncomputation of an unnormalised energy-based model, which we describe next. We summmarise the\noverall procedure in algorithm 1.\n\n ln !(a|s)+ln q\u21e0(a|s0, s)]\n\n!,q I !,q(s) s.t. H(a|s) <\u270f, \u02c6E(s) = max\n\n!,q Ep(s0|a,s)!(a|s)[ 1\n\n(5)\n\n4\n\n\f4.2.1 Maximum Likelihood Decoder\nThe \ufb01rst step of the alternating optimisation is the optimisation of equation (5) w.r.t. the decoder q,\nand is a supervised maximum likelihood problem. Given a set of data from past interactions with\nthe environment, we learn a distribution from the start and termination states s, s0, respectively, to\nthe action sequences a that have been taken. We parameterise the decoder as an auto-regressive\ndistribution over the K-step action sequence:\n\nq\u21e0(a|s0, s) = q(a1|s, s0)\n\nq(ak|f\u21e0(ak1, s, s0)),\n\n(6)\n\nKYk=2\n\nis given by !?(a|s) =\n\n1\n\nthe decoder q,\n\nthe variational\n\nsolution for\n\nthat Pa !(a|s) = 1,\n\nWe are free to choose the distributions q(ak) for each action in the sequence, which we choose as\ncategorical distributions whose mean parameters are the result of the function f\u21e0(\u00b7) with parameters\n\u21e0. f is a non-linear function that we specify using a two-layer neural network with recti\ufb01ed-linear\nactivation functions. By maximising this log-likelihood, we are able to make stochastic updates to\nthe variational parameters \u21e0 of this distribution. The neural network models used are expanded upon\nin appendix D.\n4.2.2 Estimating the Source Distribution\nGiven a current estimate of\nthe distribu-\ntion !(a|s) computed by solving the functional derivative I !(s)/!(a|s) = 0 under\nZ(s) exp (\u02c6u(s, a)) , where\nthe constraint\nu(s, a) = Ep(s0|s,a)[ln q\u21e0(a|s, s0)], \u02c6u(s, a) = u(s, a) and Z(s) = Pa e\u02c6u(s,a) is a normalisation\nterm. By substituting this optimal distribution into the original objective (5) we \ufb01nd that it can be\nexpressed in terms of the normalisation function Z(s) only, E(s) = 1\nThe distribution !?(a|s) is implicitly de\ufb01ned as an unnormalised distribution \u2014 there are no direct\nmechanisms for sampling actions or computing the normalising function Z(s) for such distributions.\nWe could use Gibbs or importance sampling, but these solutions are not satisfactory as they would\nrequire several evaluations of the unknown function u(s, a) per decision per state. We obtain a\nmore convenient problem by approximating the unnormalised distribution !?(a|s) by a normalised\n(directed) distribution h\u2713(a|s). This is equivalent to approximating the energy term \u02c6u(s, a) by a\nfunction of the log-likelihood of the directed model, r\u2713:\n(7)\nWe introduced a scalar function \u2713(s) into the approximation, but since this is not dependent on\nthe action sequence a it does not change the approximation (7), and can be veri\ufb01ed by substituting\n(7) into !?(a|s). Since h\u2713(a|s) is a normalised distribution, this leaves \u2713(s) to account for the\nnormalisation term log Z(s), veri\ufb01ed by substituting !?(a|s) and (7) into (5). We therefore obtain a\ncheap estimator of empowerment E(s) \u21e1 1\nTo optimise the parameters \u2713 of the directed model h\u2713 and the scalar function \u2713 we can minimise\nany measure of discrepancy between the two sides of the approximation (7). We minimise the\nsquared error, giving the loss function L(h\u2713, \u2713) for optimisation as:\n\n!?(a|s) \u21e1 h\u2713(a|s) ) \u02c6u(s, a) \u21e1 r\u2713(s, a);\n\nr\u2713(s, a) = ln h\u2713(a|s) + \u2713(s).\n\n log Z(s).\n\n \u2713(s).\n\nL(h\u2713, \u2713) =Ep(s0|s,A)\u21e5( ln q\u21e0(a|s,s0) r\u2713(s, a))2\u21e4 .\n\n(8)\nAt convergence of the optimisation, we obtain a compact function with which to compute the em-\npowerment that only requires forward evaluation of the function . h\u2713(a|s) is parameterised using\nan auto-regressive distribution similar to (18), with conditional distributions speci\ufb01ed by deep net-\nworks. The scalar function \u2713 is also parameterised using a deep network. Further details of these\nnetworks are provided in appendix D.\n4.3 Empowerment-based Behaviour policies\nUsing empowerment as an intrinsic reward measure, an agent will seek out states of maximal em-\npowerment. We can treat the empowerment value E(s) as a state-dependent reward and can then\nutilise any standard planning algorithm, e.g., Q-learning, policy gradients or Monte Carlo search.\nWe use the simplest planning strategy by using a one-step greedy empowerment maximisation. This\namounts to choosing actions a = arg maxa C(s, a), where C(s, a) = Ep(s0|s,a) [E(s)]. This policy\ndoes not account for the effect of actions beyond the planning horizon K. A natural enhancement is\nto use value iteration [27] to allow the agent to take actions by maximising its long term (potentially\n\n5\n\n\fAlgorithm 1: Stochastic Variational Information\nMaximisation for Empowerment\n\nParameters: \u21e0 variational, convolutional, \u2713\nsource\nwhile not converged do\n\nx { Read current state}\ns = ConvNet(x) {Compute state repr.}\nA \u21e0 !(a|s) {Draw action sequence.}\nObtain data (x, a, x0) {Acting in env. }\ns0 = ConvNet(x0) {Compute state repr.}\n\u21e0 / r\u21e0 log q\u21e0(a|s, s0) (18)\n\u2713 / r\u2713L(h\u2713, \u2713) (8)\n / r log q\u21e0(a|s, s0) + rL(h\u2713, \u2713)\n{Empowerment}\n\n \u2713(s)\n\nend while\nE(s) = 1\n\nI\n\n \n\nM\ne\nu\nr\nT\n\nI\n\n \n\nM\ne\nu\nr\nT\n\nI\n\n \n\nM\nd\nn\nu\no\nB\n\nI\n\n \n\nM\nd\nn\nu\no\nB\n\n3.0\n\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\ny = \u22120.083 + 1 \u22c5 x, r2\n\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n = 0.903\n\n\u25cf\n\n\u25cf\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\n)\ns\nt\na\nn\n\n)\ns\nt\n\na\nn\n(\n \nt\n\n(\n \nt\nn\nn\ne\ne\nm\nm\nr\nr\ne\ne\nw\nw\no\no\np\np\nm\nm\nE\nE\ne\ne\nt\na\na\nm\nm\nx\nx\no\no\nr\np\nr\np\np\nA\np\nA\n\n \n\ni\n\ni\n\n2.0\n\n \n\nt\n\n2.5\n\n1.5\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n1.5\nTrue Empowerment (nats)\nTrue Empowerment (nats) \n\n2.0\n\n2.5\n\n3.0\n\nFigure 3: Comparing exact vs approx-\nimate empowerment. Heat maps: em-\npowerment\ntwo\nrooms, cross room, two-rooms; Scatter\nplot: agreement for two-rooms.\n\nin 3 environments:\n\ndiscounted) empowerment. A third approach would be to use empowerment as a potential function\nand the difference between the current and previous state\u2019s empowerment as a shaping function with\nin the planning [18]. A fourth approach is one where the agent uses the source distribution !(a|s)\nas its behaviour policy. The source distribution has similar properties to the greedy behaviour pol-\nicy and can also be used, but since it effectively acts as an empowered agents internal exploration\nmechanism, it has a large variance (it is designed to allow uniform exploration of the state space).\nUnderstanding this choice of behaviour policy is an important line of ongoing research.\n\n4.4 Algorithm Summary and Complexity\nThe system we have described is a scalable and general purpose algorithm for mutual information\nmaximisation and we summarise the core components using the computational graph in \ufb01gure 2 and\nin algorithm 1. The state representation mechanism used throughout is obtained by transforming raw\nobservations x, x0 to produce the start and \ufb01nal states s, s0, respectively. When the raw observations\nare pixels from vision, the state representation is a convolutional neural network [14, 16], while for\nother observations (such as continuous measurements) we use a fully-connected neural network \u2013\nwe indicate the parameters of these models using . Since we use a uni\ufb01ed loss function, we can\napply gradient descent and backpropagate stochastic gradients through the entire model allowing for\njoint optimisation of both the information and representation parameters. For optimisation we use a\npreconditioned optimisation algorithm such as Adagrad [5].\nThe computational complexity of empowerment estimators involves the planning horizon K, the\nnumber of actions N, and the number of states S. For the exact computation we must enumerate over\nthe number of states, which for grid-worlds is S / D2 (for D\u21e5D grids), or for binary images is S =\n2D2. The complexity of using the Blahut-Arimoto (BA) algorithm is O(N KS2) = O(N KD4) for\ngrid worlds or O(N K22D2) for binary images. The BA algorithm, even in environments with a small\nnumber of interacting objects becomes quickly intractable, since the state space grows exponentially\nwith the number of possible interactions, and is also exponential in the planning horizon. In contrast,\nour approach deals directly on the image dimensions. Using visual inputs, the convolutional network\nproduces a vector of size P , upon which all subsequent computation is based, consisting of an L-\nlayer neural network. This gives a complexity for state representation of O(D2P + LP 2). The\nautoregressive distributions have complexity of O(H 2KN ), where H is the size of the hidden layer.\nThus, our approach has at most quadratic complexity in the size of the hidden layers used and linear\nin other quantities, and matches the complexity of any currently employed large-scale vision-based\nmodels. In addition, since we use gradient descent throughout, we are able to leverage the power of\nGPUs and distributed gradient computations.\n\n5 Results\nWe demonstrate the use of empowerment and the effectiveness of variational information maximi-\nsation in two types of environments. Static environments consists of rooms and mazes in different\ncon\ufb01gurations in which there are no objects with which the agent can interact, or other moving ob-\n\n6\n\n\fFigure 4: Empowerment for a room environ-\nment, showing a) an empty room, b) room with\nan obstacle c) room with a moveable box, d)\nroom with row of moveable boxes.\n\nFigure 5: Left: empowerment landscape for\nagent and key scenario. Yellow is the key and\ngreen is the door. Right: Agent in a corridor\nwith \ufb02owing lava. The agent places a bricks to\nstem the \ufb02ow of lava.\n\njects. The number of states in these settings is equal to the number of locations in the environment,\nso is still manageable for approaches that rely on state enumeration. In dynamic environments, as-\npects of the environment change, such as \ufb02owing lava that causes the agent to reset, or a predator\nthat chases the agent. For the most part, we consider discrete action settings in which the agent has\n\ufb01ve actions (up, down, left, right, do nothing). The agent may have other actions, such as picking\nup a key or laying down a brick. There are no external rewards available and the agent must reason\npurely using visual (pixel) information. For all these experiments we used a horizon of K = 5.\n\n5.1 Effectiveness of the MI Bound\nWe \ufb01rst establish that the use of the variational information lower bound results in the same be-\nhaviour as that obtained using the exact mutual information in a set of static environments. We\nconsider environments that have at most 400 discrete states and compute the true mutual informa-\ntion using the Blahut-Arimoto algorithm. We compute the variational information bound on the\nsame environment using pixel information (on 20\u21e5 20 images). To compare the two approaches we\nlook at the empowerment landscape obtained by computing the empowerment at every location in\nthe environment and show these as heatmaps. For action selection, what matters is the location of the\nmaximum empowerment, and by comparing the heatmaps in \ufb01gure 3, we see that the empowerment\nlandscape matches between the exact and the variational solution, and hence, will lead to the same\nagent-behaviour.\nIn each image in \ufb01gure 3, we show a heat-map of the empowerment for each location in the environ-\nment. We then analyze the point of highest empowerment: for the large room it is in the centre of\nthe room; for the cross-shaped room it is at the centre of the cross, and in a two-rooms environment,\nit is located near both doors. In addition, we show that the empowerment values obtained by our\nmethod constitute a close approximation to the true empowerment for the two-rooms environment\n(correlation coeff = 1.00, R2=0.90). These results match those by authors such as Klyubin et al.\n[12] (using empowerment) and Wissner-Gross & Freer [30] (using a different information-theoretic\nmeasure \u2014 the causal entropic force). The advantage of the variational approach is clear from this\ndiscussion: we are able to obtain solutions of the same quality as the exact computation, we have far\nmore favourable computational scaling (one that is not exponential in the size of the state space and\nplanning horizon), and we are able to plan directly from pixel information.\n\n5.2 Dynamic Environments\nHaving established the usefulness of the bound and some further understanding of empowerment,\nwe now examine the empowerment behaviour in environments with dynamic characteristics. Even\nin small environments, the number of states becomes extremely large if there are objects that can\nbe moved, or added and removed from the environment, making enumerative algorithms (such as\nBA) quickly infeasible, since we have an exponential explosion in the number of states. We \ufb01rst\nreproduce an experiment from Salge et al. [23, \u00a74.5.3] that considers the empowered behaviour of\nan agent in a room-environment, a room that: is empty, has a \ufb01xed box, has a moveable box, has\na row of moveable boxes. Salge et al. [23] explore this setup to discuss the choice of the state\nrepresentation, and that not including the existence of the box severely limits the planning ability of\nthe agent. In our approach, we do not face this problem of choosing the state representation, since\nthe agent will reason about all objects that appear within its visual observations, obviating the need\nfor hand-designed state representations. Figure 4 shows that in an empty room, the empowerment is\nuniform almost everywhere except close to the walls; in a room with a \ufb01xed box, the \ufb01xed box limits\nthe set of future reachable states, and as expected, empowerment is low around the box; in a room\nwhere the box can be moved, the box can now be seen as a tool and we have high empowerment\nnear the box; similarly, when we have four boxes in a row, the empowerment is highest around the\n\n7\n\n\fBox+Up\n\nBox+Down\n\nBox+Left\n\nBox+Right\n\nUp\n\nDown\n\nLeft\n\nRight\n\nStay\n\nC(s, a)\n\n1\n=\n\nt\n\n2\n=\n\nt\n\n3\n=\n\nt\n\n4\n=\n\nt\n\n5\n=\n\nt\n\nFigure 6: Empowerment planning in a lava-\ufb01lled\nmaze environment. Black panels show the path\ntaken by the agent.\n\n1\n\n4\n\n1\n\n5\n\n2\n\n6\n\n3\n\n6\n\nFigure 7: Predator (red) and agent (blue) sce-\nnario. Panels 1, 6 show the 3D simulation.\nOther panels show a trace of the path that the\npredator and prey take at points on its trajectory.\nThe blue/red shows path history; cyan shows the\ndirection to the maximum empowerment.\n\nboxes. These results match those of Salge et al. [23] and show the effectiveness of reasoning from\npixel information directly.\nFigure 6 shows how planning with empowerment works in a dynamic maze environment, where lava\n\ufb02ows from a source at the bottom that eventually engulfs the maze. The only way the agent is able to\nsafeguard itself, is to stem the \ufb02ow of lava by building a wall at the entrance to one of the corridors.\nAt every point in time t, the agent decides its next action by computing the expected empowerment\nafter taking one action. In this environment, we show the planning for all 9 available actions and a\nbar graph with the empowerment values for each resulting state. The action that leads to the highest\nempowerment is taken and is indicated by the black panels1.\nFigure 5(left) shows two-rooms separated by a door. The agent is able to collect a key that allows\nit to open the door. Before collecting the key, the maximum empowerment is in the region around\nthe key, once the agent has collected the key, the region of maximum empowerment is close to\nthe door2. Figure 5(right) shows an agent in a corridor and must protect itself by building a wall of\nbricks, which it is able to do successfully using the same empowerment planning approach described\nfor the maze setting.\n5.3 Predator-Prey Scenario\nWe demonstrate the applicability of our approach to continuous settings, by studying a simple 3D\nphysics simulation [29], shown in \ufb01gure 7. Here, the agent (blue) is followed by a predator (red) and\nis randomly reset to a new location in the environment if caught by the predator. Both the agent and\nthe predator are represented as spheres in the environment that roll on a surface with friction. The\nstate is the position, velocity and angular momentum of the agent and the predator, and the action is\na 2D force vector. As expected, the maximum empowerment lies in regions away from the predator,\nwhich results in the agent learning to escape the predator3.\n6 Conclusion\nWe have developed a new approach for scalable estimation of the mutual information by exploiting\nrecent advances in deep learning and variational inference. We focussed speci\ufb01cally on intrinsic\nmotivation with a reward measure known as empowerment, which requires at its core the ef\ufb01cient\ncomputation of the mutual information. By using a variational lower bound on the mutual infor-\nmation, we developed a scalable model and ef\ufb01cient algorithm that expands the applicability of\nempowerment to high-dimensional problems, with the complexity of our approach being extremely\nfavourable when compared to the complexity of the Blahut-Arimoto algorithm that is currently the\nstandard. The overall system does not require a generative model of the environment to be built,\nlearns using only interactions with the environment, and allows the agent to learn directly from vi-\nsual information or in continuous state-action spaces. While we chose to develop the algorithm in\nterms of intrinsic motivation, the mutual information has wide applications in other domains, all\nwhich stand to bene\ufb01t from a scalable algorithm that allows them to exploit the abundance of data\nand be applied to large-scale problems.\nAcknowledgements: We thank Daniel Polani for invaluable guidance and feedback.\n\n1 Video:\nhttp://youtu.be/tMiiKXPirAQ; http://youtu.be/LV5jYY-JFpE\n\nhttp://youtu.be/eA9jVDa7O38\n\n2 Video:\n\nhttp://youtu.be/eSAIJ0isc3Y\n\n3 Videos:\n\n8\n\n\fReferences\n[1] Barber, D. and Agakov, F. The IM algorithm: a variational approach to information maximiza-\n\ntion. In NIPS, volume 16, pp. 201, 2004.\n\n[2] Brunel, N. and Nadal, J. Mutual information, Fisher information, and population coding.\n\nNeural Computation, 10(7):1731\u20131757, 1998.\n\n[3] Buhmann, J. M., Chehreghani, M. H., Frank, M., and Streich, A. P.\n\nInformation theoretic\nmodel selection for pattern analysis. Workshop on Unsupervised and Transfer Learning, 2012.\n\n[4] Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 1991.\n[5] Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and\n\nstochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[6] Gao, S., Steeg, G. V., and Galstyan, A. Ef\ufb01cient estimation of mutual information for strongly\n\ndependent variables. arXiv:1411.2003, 2014.\n\n[7] Gretton, A., Herbrich, R., and Smola, A. J. The kernel mutual information. In ICASP, vol-\n\nume 4, pp. IV\u2013880, 2003.\n\n[8] Itti, L. and Baldi, P. F. Bayesian surprise attracts human attention. In NIPS, pp. 547\u2013554, 2005.\n[9] Jaakkola, T. S. and Jordan, M. I. Improving the mean \ufb01eld approximation via the use of mixture\n\ndistributions. In Learning in graphical models, pp. 163\u2013173. 1998.\n\n[10] Jung, T., Polani, D., and Stone, P. Empowerment for continuous agent-environment systems.\n\nAdaptive Behavior, 19(1):16\u201339, 2011.\n\n[12] Klyubin, A. S., Polani, D., and Nehaniv, C. L. Empowerment: A universal agent-centric\n\nmeasure of control. In IEEE Congress on Evolutionary Computation, pp. 128\u2013135, 2005.\n\n[13] Koutn\u00b4\u0131k, J., Schmidhuber, J., and Gomez, F. Evolving deep unsupervised convolutional net-\n\nworks for vision-based reinforcement learning. In GECCO, pp. 541\u2013548, 2014.\n\n[14] LeCun, Y. and Bengio, Y. Convolutional networks for images, speech, and time series. The\n\nhandbook of brain theory and neural networks, 3361:310, 1995.\n\n[15] Little, D. Y. and Sommer, F. T. Learning and exploration in action-perception loops. Frontiers\n\nin neural circuits, 7, 2013.\n\n[16] Mnih, V., Kavukcuoglu, K., and Silver, D., et al. Human-level control through deep reinforce-\n\nment learning. Nature, 518(7540):529\u2013533, 2015.\n\n[17] Nelson, J. D. Finding useful questions: on Bayesian diagnosticity, probability, impact, and\n\ninformation gain. Psychological review, 112(4):979, 2005.\n\n[18] Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transfor-\n\nmations: Theory and application to reward shaping. In ICML, 1999.\n\n[19] Oudeyer, P. and Kaplan, F. How can we de\ufb01ne intrinsic motivation? In International confer-\n\nence on epigenetic robotics, 2008.\n\n[21] Rubin, J., Shamir, O., and Tishby, N. Trading value and information in MDPs. In Decision\n\nMaking with Imperfect Decision Makers, pp. 57\u201374. 2012.\n\n[22] Salge, C., Glackin, C., and Polani, D. Changing the environment based on empowerment as\n\nintrinsic motivation. Entropy, 16(5):2789\u20132819, 2014.\n\n[23] Salge, C., Glackin, C., and Polani, D. Empowerment\u2013an introduction.\n\nIn Guided Self-\n\nOrganization: Inception, pp. 67\u2013114. 2014.\n\n[24] Schmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010). IEEE\n\nTrans. Autonomous Mental Development, 2(3):230\u2013247, 2010.\n\n[25] Singh, S. P., Barto, A. G., and Chentanez, N. Intrinsically motivated reinforcement learning.\n\nIn NIPS, 2005.\n\n[26] Still, S. and Precup, D. An information-theoretic approach to curiosity-driven reinforcement\n\nlearning. Theory in Biosciences, 131(3):139\u2013148, 2012.\n\n[27] Sutton, R. S. and Barto, A. G. Introduction to reinforcement learning. MIT Press, 1998.\n[28] Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. In Allerton\n\nConference on Communication, Control, and Computing, 1999.\n\n[29] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In\n\nIntelligent Robots and Systems (IROS), pp. 5026\u20135033, 2012.\n\n[30] Wissner-Gross, A. D. and Freer, C. E. Causal entropic forces. Phys. Rev. Let., 110(16), 2013.\n[31] Yeung, R. W. The Blahut-Arimoto algorithms. In Information Theory and Network Coding,\n\npp. 211\u2013228. 2008.\n\n9\n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "Shakir", "family_name": "Mohamed", "institution": "Google DeepMind"}, {"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google DeepMind"}]}