{"title": "Multi-View Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1420, "page_last": 1431, "abstract": "This paper is concerned with multi-view reinforcement learning (MVRL), which allows for decision making when agents share common dynamics but adhere to different observation models. We define the MVRL framework by extending partially observable Markov decision processes (POMDPs) to support more than one observation model and propose two solution methods through observation augmentation and cross-view policy transfer. We empirically evaluate our method and demonstrate its effectiveness in a variety of environments. Specifically, we show reductions in sample complexities and computational time for acquiring policies that handle multi-view environments.", "full_text": "Multi-View Reinforcement Learning\n\nMinne Li \u21e4\n\nUniversity College London\nLondon, United Kingdom\nminne.li@cs.ucl.ac.uk\n\nLisheng Wu \u21e4\n\nUniversity College London\nLondon, United Kingdom\n\nlisheng.wu.17@ucl.ac.uk\n\nHaitham Bou Ammar \u2020\nUniversity College London\nLondon, United Kingdom\n\nhaitham.bouammar71@googlemail.com\n\nJun Wang\n\nUniversity College London\nLondon, United Kingdom\njunwang@cs.ucl.ac.uk\n\nAbstract\n\nThis paper is concerned with multi-view reinforcement learning (MVRL), which\nallows for decision making when agents share common dynamics but adhere to\ndifferent observation models. We de\ufb01ne the MVRL framework by extending\npartially observable Markov decision processes (POMDPs) to support more than\none observation model and propose two solution methods through observation\naugmentation and cross-view policy transfer. We empirically evaluate our method\nand demonstrate its effectiveness in a variety of environments. Speci\ufb01cally, we\nshow reductions in sample complexities and computational time for acquiring\npolicies that handle multi-view environments.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL), tasks are de\ufb01ned as Markov decision processes (MDPs) with state and\naction spaces, transition models, and reward functions. The dynamics of an RL agent commence by\nexecuting an action in a state of the environment according to some policy. Based on the action choice,\nthe environment responds by transitioning the agent to a new state and providing an instantaneous\nreward quantifying the quality of the executed action. This process repeats until a terminal condition\nis met. The goal of the agent is to learn an optimal action-selection rule that maximizes total-expected\nreturns from any initial state. Though minimally supervised, this framework has become a profound\ntool for decision making under uncertainty, with applications ranging from computer games [24, 30]\nto neural architecture search [42], robotics [3, 25, 27], and multi-agent systems [22, 34, 38, 40].\nCommon RL algorithms, however, only consider observations from one view of the state space [17].\nSuch an assumption can become too restrictive in real-life scenarios. To illustrate, imagine designing\nan autonomous vehicle that is equipped with multiple sensors. For such an agent to execute safe\nactions, data-fusion is necessary so as to account for all available information about the world.\nConsequently, agent policies have now to be conditioned on varying state descriptions, which in turn,\nlead to challenging representation and learning questions. In fact, acquiring good-enough policies in\nthe multi-view setting is more complex when compared to standard RL due to the increase in sample\ncomplexities needed to reason about varying views. If solved, however, multi-view RL will allow for\ndata-fusion, fault-tolerance to sensor deterioration, and policy generalization across domains.\nNumerous algorithms for multi-view learning in supervised tasks have been proposed. Interested\nreaders are referred to the survey in [23, 39, 41], and references therein for a detailed exposition.\n\n\u21e4Equal contributions.\n\u2020Honorary Lecturer at University College London\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThough abundant in supervised learning tasks, multi-view data fusion for decision making has gained\nless attention. In fact, our search revealed only a few papers attempting to target this exact problem.\nA notable algorithm is the work in [6] that proposed a double task deep Q-network for multi-view\nreinforcement learning. We believe the attempt made by the authors handle the multi-view decision\nproblem indirectly by carrying innovations from computer-vision, where they augment different angle\ncameras in one state and feed to a standard deep Q-network. Our attempt, on the other hand, aims to\nresolve multi-view decision making directly by learning joint models for autonomous planning. As\na by-product of our method, we arrive at a learning pipeline that allows for improvements in both\npolicy learning and feature representations.\nClosest to our work are algorithms from the domain of partially observable Markov decision processes\n(POMDPs) [17]. There, the environment\u2019s state is also hidden, and the agent is equipped with a\nsensor (i.e., an observation function) for it to build a belief of the latent state in order to execute and\nlearn its policy. Although most algorithms consider one observation function (i.e., one view), one\ncan generalize the de\ufb01nition of POMDPs, to multiple types of observations by allowing for a joint\nobservation space, and consequently a joint observation function, across varying views. Though\npossible in principle, we are not aware of any algorithm from the POMDP literature targeting this\nscenario. Our problem, in fact, can become substantially harder from both the representation and\nlearning perspectives. To illustrate, consider a POMDP with only two views, the \ufb01rst being images,\nwhile the second a low-dimensional time series corresponding to, say joint angles and angular\nvelocities. Following the idea of constructing a joint observation space, one would be looking for a\nmap from history of observations (and potentially actions) to a new observation at the consequent\ntime steps. Such an observation can be an image, a time series, or both. In this setting, constructing a\njoint observation mapping is dif\ufb01cult due to the varying nature of the outputs and their occurrences in\nhistory. Due to the large sample and computational complexities involved in designing larger deep\nlearners with varying output units, and switching mechanisms to differentiate between views, we\nrather advocate for a more grounded framework by drawing inspiration from the well-established\nmulti-view supervised learning literature. Thus, leading us to multi-view reinforcement learning3.\nOur framework for multi-view reinforcement learning also shares similarities with multi-task rein-\nforcement learning, a framework that has gained considerable attention [3, 10, 11, 12, 26, 32, 36].\nParticularly, one can imagine multi-view RL as a case of multi-task RL where tasks share action\nspaces, transition models, and reward functions, but differ in their state-representations. Though\na bridge between multi-view and multi-task can be constructed, it is worth mentioning that most\nworks on multi-task RL consider same observation and action spaces but varying dynamics and/or\nreward functions [33]. As such, these methods fail to handle fusion and generalization across feature\nrepresentations that vary between domains. A notable exception is the work in [2], which trans-\nfers knowledge between task groups, with varying state and/or action spaces. Though successful,\nthis method assumes model-free with linear policy settings. As such, it fails to ef\ufb01ciently handle\nhigh-dimensional environments, which require deep network policies.\nIn this paper, we contribute by introducing a framework for multi-view reinforcement learning\nthat generalizes partially observable Markov decision processes (POMDPs) to ones that exhibit\nmultiple observation models. We \ufb01rst derive a straight-forward solution based on state augmentation\nthat demonstrates superior performance on various benchmarks when compared to state-of-the-art\nproximal policy optimization (PPO) in multi-view scenarios. We then provide an algorithm for\nmulti-view model learning, and propose a solution capable of transferring policies learned from one\nview to another. This, in turn, greatly reduces the amount of training samples needed by around two\norder of magnitudes in most tasks when compared to PPO. Finally, in another set of experiments,\nwe show that our algorithm outperforms PILCO [9] in terms of sample complexities, especially on\nhigh-dimensional and non-smooth systems, e.g., Hopper4. Our contributions can be summarized as:\n\n\u2022 formalizing multi-view reinforcement learning as a generalization of POMDPs;\n\u2022 proposing two solutions based on state augmentation and policy transfer to multi-view RL;\n\u2022 demonstrating improvement in policy against state-of-the-art methods on a variety of control\n\nbenchmarks.\n\n3We note that other titles for this work are also possible, e.g., Multi-View POMDPs, or Multi-Observation\nPOMDPs. The emphasis is on the fact that little literature has considered RL with varying sensory observations.\n4It is worth noting that our experiments reveal that PILCO, for instance, faces challenges when dealing with\nhigh-dimensional systems, e.g., Hopper. In future, we plan to investigate latent space Gaussian process models\nwith the aim of carrying sample ef\ufb01cient algorithms to high-dimensional systems.\n\n2\n\n\f2 Multi-View Reinforcement Learning\n\nThis section introduces multi-view reinforcement learning by extending MDPs to allow for multiple\nstate representations and observation densities. We show that POMDPs can be seen as a special case\nof our framework, where inference about latent state only uses one view of the state-space.\n\n2.1 Multi-View Markov Decision Processes\n\nTo allow agents to reason about varying state representations, we generalize the notion of an MDP to\nobs\u21b5.\na multi-view MDP, which is de\ufb01ned by the tuple Mmulti-view =\u2326S,A,P,O1,P 1\nobs, . . . ,ON ,P N\nHere, S\u2713 Rd and A\u2713 Rm, and P : S\u21e5A\u21e5S! [0, 1] represent the standard MDP\u2019s state and\naction spaces, as well as the transition model, respectively.\nContrary to MDPs, multi-view MDPs incorporate additional components responsible for formally\nrepresenting multiple views belonging to different observation spaces. We use Oj and P j\nobs : S\u21e5\nOj ! [0, 1] to denote the observation space and observation-model of each sensor j 2{ 1, . . . , N}.\nAt some time step t, the agent executes an action at 2A according to its policy that conditions\non a history of heterogeneous observations (and potentially actions) Ht = {oi1\nt }, where\nobs (.|sk) for k 2{ 1, . . . , T} and5 ik 2{ 1, . . . , N}. We de\ufb01ne Ht to represent the history of\nk \u21e0P ik\noik\nobservations, and introduce a superscript it to denote the type of view at the tth time instance. Each\nit is dependent on the observation function (availability of speci\ufb01c views) and conditioned on the\nenvironment\u2019s state. As per our de\ufb01nition, we allow for N different types of observations, therefore, it\nis allowed to vary from one to N. Depending on the selected action, the environment then transitions\nto a successor state st+1 \u21e0P (.|st, at), which is not directly observable by the agent. On the contrary,\nthe agent only receives a successor view oit+1\n\nt+1 \u21e0P it+1\n2.2 Multi-View Reinforcement Learning Objective\nAs in standard literature, the goal is to learn a policy that maximizes total expected return. To\nformalize such a goal, we introduce a multi-view trajectory \u2327 M, which augments standard POMDP\ntrajectories with multi-view observations, i.e., \u2327 M = [s1, oi1\nT ], and consider \ufb01nite\nhorizon cumulative rewards computed on the real state (hidden to the agent) and the agent\u2019s actions.\nWith this, the goal of the agent to determine a policy to maximize the following optimization objective:\n\n(.|st+1) with it+1 2{ 1, . . . , N}.\n\n1 , a1, . . . , sT , oiT\n\n1 , . . . , oit\n\nobs\n\nmax\n\u21e1M\n\nE\u2327MhGT\u21e3\u2327 M\u2318i , where \u2327 M \u21e0 p\u21e1M\u21e3\u2327 M\u2318 and GT\u21e3\u2327 M\u2318 =Xt\n\ntR(st, at),\n\n(1)\n\nwhere is the discount factor. The last component needed for us to \ufb01nalize our problem de\ufb01nition is\nto understand how to factor multi-view trajectory densities. Knowing that the trajectory density is\nthat de\ufb01ned over joint observation, states, and actions, we write:\n\nt=2 P it\n\nobs\u21e3oit\n\nobs\u21e3oi1\n\n1 |s1\u2318P0(s1)\n\nobs\u21e3oiT\n=P0(s1)P i1\n\n1 |s1\u2318 \u21e1M (a1|H1)YT\n\nT |sT\u2318P (sT|sT1, aT1) \u21e1M (aT1|HT1) . . .P i1\nobs\u21e3oi1\n\np\u21e1M\u21e3\u2327 M\u2318 =P iT\nt |st\u2318P(st|st1, at1)\u21e1M (at1|Ht1) ,\nwith P0(s1) being the initial state distribution. The generalization above arrives with additional\nsample and computational burdens, rendering current solutions to POMDPs impertinent. Among\nvarious challenges, multi-view policy representations capable of handling varying sensory signals\ncan become expensive to both learn and represent. That being said, we can still reason about such\nstructures and follow a policy-gradient technique to learn the parameters of the network. However, a\nmulti-view policy network needs to adhere to a crucial property, which can be regarded as a special\ncase of representation fusion networks from multi-view representation learning [23, 39, 41]. We give\nour derivation of general gradient update laws following model-free policy gradients in Sect. 3.1.\nContrary to standard POMDPs, our trajectory density is generalized to support multiple state views\nby requiring different observation models through time. Such a generalization allows us to advocate\na more ef\ufb01cient model-based solver that enables cross-view policy transfer (i.e., conditioning one\nview policies on another) and few-shot RL, as we shall introduce in Sect. 3.2.\n\n5Please note it is possible to incorporate the actions in Ht by simply introducing additional action variables.\n\n3\n\n\f3 Solution Methods\n\nThis section presents our model-free and model-based approaches to solving multi-view reinforcement\nlearning. Given a policy network, our model-free solution derives a policy gradient theorem, which\ncan be approximated using Monte Carlo and history-dependent baselines when updating model\nparameters. We then propose a model-based alternative that learns joint dynamical (and emission)\nmodels to allow for control in the latent space and cross-view policy transfer.\n\n3.1 Model-Free Multi-View Reinforcement Learning through Observation Augmentation\nThe type of algorithm we employ for model free multi-view reinforcement learning falls in the class\nof policy gradient algorithms. Given existing advancements on Monte Carlo estimates, the variance\nreduction methods (e.g., observation-based baselines B(Ht)), and the problem de\ufb01nition in Eq. (1),\nwe can proceed by giving the rule of updating the policy parameters ! as:\n\n!k+1 = !k + \u2318k 1\n\n(2)\n\nM XM\n\nj=1XT\n\nt=1 r! log \u21e1M\u21e3aj\nt\u2318\u21e3R(sj\nt|Hj\n\nt , aj\n\nt )\u2318 .\nt ) B (Hj\n\nPlease refer to the appendix for detailed derivation. While a policy gradient algorithm can be\nimplemented, the above update rule is oblivious to the availability of multiple views in the state\nspace closely resembling standard (one-view) POMDP scenarios. This only increases the number of\nsamples required for training, as well as the variance in the gradient estimates. We thus introduce\na straight-forward model-free MVRL algorithm by leveraging fusion networks from multi-view\nrepresentation learning. Speci\ufb01cally, we assume that corresponding observations from all views, i.e.,\nobservations that sharing the same latent state, are accessible during training. Although the parameter\nupdate rule is exactly the same as de\ufb01ned in Eq. (2), this method manages to utilize the knowledge of\nshared dynamics across different views, thus being optimal than independent model-free learners, i.e.,\nregarding each view as a single environment and learn the policy.\n\nat\u22121\n\n3.2 Model-Based Multi-View Reinforcement Learning\nWe now propose a model-based approach\nthat learns approximate transition models\nfor multi-view RL allowing for policies that\nare simpler to learn and represent. Our\nlearnt model can also be used for cross-\nview policy transfer (i.e., conditioning one\nview policies on another), few-shot RL, and\ntypical model-based RL enabling policy\nimprovements through back-propagation\nin learnt joint models.\n\not\u221211\n\nst\u22121\n\n\u2026\n\nat\n\n1\n\not\n\not\u22121N\n\nst\n\nN\u2026\not\n\nat+1\n\not+11\n\nst+1\not+1N\u2026\n\nFigure 1: The graphical model of multi-view learning.\n\n3.2.1 Multi-View Model Learning\nThe purpose of model-learning is to abstract a hidden joint model shared across varying views of the\nstate space. For accurate predictions, we envision the generative model in Fig. (1). Here, observed\nrandom variables are denoted by oit\nt for a time-step t and it 2{ 1, . . . , N}, while st 2 Rd represents\nlatent variables that temporally evolve conditioned on applied actions. As multiple observations are\ns1\nallowed, our model generalizes to multi-view by supporting varying emission models depending on\nthe nature of observations. Crucially, we do not assume Markov transitions in the latent space as we\nbelieve that reasoning about multiple views requires more than one-step history information.\nTo de\ufb01ne our optimization objective, we follow standard variational inference. Before deriving the\n1\nvariational lower bound, however, we \ufb01rst introduce additional notation to ease exposure. Recall that\n6. To account for action conditioning, we further\noit\nt is the observation vector at time-step t for view it\noj\nt , at]T for t = {1, . . . , T 1}, and oitCT =\naugment oit\noit\nT for the last time-step T . Analogous to standard latent-variable models, our goal is to learn latent\ntransitions that maximize the marginal likelihood of observations as max log p(oitC1 , . . . , oitCT ) =\nmax log p(oit\nT ). According to the graphical model in Fig. (1), observations\n6Different from the model-free solver introduced in Sect. 3.1, we don\u2019t assume the accessibility to all views.\n\nt with executed controls, leading to oitCt = [oit\n\na2 \u2026\n\nT1, aT1, oit\n\n1 , a1, . . . , oit\n\ns2\n2\n\ns1\n1\n\na1\n\noi\n\noi\n\n4\n\n\u2026 same transition\n\na1\n\ns2\n2\n\noj\n\na2\n\n\fare generated from latent variables which are temporally evolving. Hence, we regard observations as\na resultant of a process in which latent states have been marginalized-out:\n\np\u21e3oMC1:T\u2318 =\n\nNYit=1\n\np\u21e3oitC\u2318 =\n\nNYit=1Zs1\n\n. . .ZsT\n\np(oit\n\n1 , a1, . . . , oit\n\nT1, aT1, oit\n\nT , s1, . . . , sT\nlatent variables\n\n) ds1 . . . dsT\n\n,\n\nmarginalization\n\n|\n\n{z\n\n}\n\n|\n\n{z\n\n}\n\np\u21e3oMC1:T , s1, . . . , sT\u2318 =YN\n\nwhere oMC1:T collects all multi-view observations and actions across time-steps. To devise an algorithm\nthat reasons about latent dynamics, two components need to be better understood. The \ufb01rst relates\nto factoring our joint density, while the second to approximating multi-dimensional integrals. To\nfactorize our joint density, we follow the modeling assumptions in Fig. (1) and write:\n(st| \u02c6Ht),\n\np\u2713it\nwhere in the last step we introduced \u2713it = {\u2713it\n1 , \u2713it\n3 } to emphasize modeling parameters that\nneed to be learnt, and \u02c6Ht to concatenate state and action histories back to s1.\nHaving dealt with density factorization, another problem to circumvent is that of computing intractable\nmulti-dimensional integrals. This can be achieved by introducing a variational distribution over latent\nvariables, q(s1, . . . , sT|oMC1:T ), which transform integration into an optimization problem as\n\n(s1)YT\n\nt |st)p\u2713it\n\n2 , \u2713it\n\np\u2713it\n\n(oit\n\nit=1\n\nt=2\n\n1\n\n2\n\n3\n\nlog p\u21e3oMC1:T\u2318 = logZs1:T\nZs1:T\n\np(oMC1:T , s1, . . . , sT )ds1:T\n\nq(s1, . . . , sT|oMC1:T )\nq(s1, . . . , sT|oMC1:T )\nq(s1, . . . , sT|oMC1:T ) log\uf8ff p(oMC1:T , s1, . . . , sT )\n\nq(s1, . . . , sT|oMC1:T ) ds1:T ,\n\nt\n\n1\n\nit\n\nt=2 qit\n\n(s1)QT\nlog p\u21e3oMC1:T\u2318 \n\nwhere we used the concavity of the logarithm and Jensen\u2019s inequality in the second step of the deriva-\ntion. We assume a mean-\ufb01eld decomposition for the variational distribution, q(s1, . . . , sT|oMC1:T ) =\nQN\n(st|Ht), with Ht being the observation and action history. This leads to:\nit=1 qit\nTXt=1\uf8ffEq\nNXit=1\n KL\u21e3qit\nt=1hEq\n\nwhere KL(p||q) denotes the Kullback\u2013Leibler divergence between two distribution. Assuming\nshared variational parameters (e.g., one variational network), model learning can be formulated as:\n(3)\n\nt hlog p\u2713it\n(s1)\u2318 ,\nit (st|Ht)hlog p\u2713it (oit\n\nt |st)i KL\u21e3qit (st|Ht)||p\u2713it (st| \u02c6Ht)\u2318i .\n\nIntuitively, Eq. (3) \ufb01ts the model by maximizing multi-view observation likelihood, while being\nregularized through the KL-term. Clearly, this is similar to the standard evidence lower-bound with\nadditional components related to handling multi-view types of observations.\n\n|st)i KL\u21e3qit\n\n(st| \u02c6Ht)\u2318\n\n\u2713m,XN\n\nit=1XT\n\n(st|Ht)||p\u2713it\n\n(s1)||p\u2713it\n\n(oMt\n\nmax\n\n\n\n2\n\n3\n\n1\n\n1\n\nt\n\n3.2.2 Distribution Parameterization and Implementation Details\nTo \ufb01nalize our problem de\ufb01nition, choices for the modeling and variational distributions can ultimately\nbe problem-dependent. To encode transitions beyond Markov assumptions, we use a memory-based\nmodel g (e.g., a recurrent neural network) to serve as the history encoder and future predictor, i.e.,\nht = g (st1, ht1, at1). Introducing memory splits the model into stochastic and deterministic\nparts, where the deterministic part is the memory model g , while the stochastic part is the conditional\nprior distribution on latent states st, i.e., p\u2713it (st|ht). We assume that this distribution is Gaussian\nwith its mean and variance parameterized by a feed-forward neural network taking ht as inputs.\nAs for the observation model, the exact form is domain-speci\ufb01c depending on available observation\ntypes. In the case when our observation is a low-dimensional vector, we chose a Gaussian parame-\nterization with mean and variance output by a feed-forward network as above. When dealing with\nimages, we parameterized the mean by a deconvolutional neural network [13] and kept an identity\ncovariance. The variational distribution q can thus, respectively, be parameterized by a feed-forward\nneural network and a convolutional neural network [18] for these two types of views.\nWith the above assumptions, we can now derive the training loss used in our experiments. First, we\nrewrite Eq. (3) as\n\nmax\n\n\u2713m,XN\n\nit=1XT\n\nt=1hEqithlog p\u2713it (oit\n\nt |st)i + Eqit [log p\u2713it (st|ht)] + Hhqitii ,\n\n(4)\n\n5\n\n\fwhere H denotes the entropy, and qit represents qit (st|ht, ot). From the \ufb01rst two terms in Eq. (4),\nwe realize that the optimization problem for multi-view model learning consists of two parts: 1)\nobservation reconstruction, 2) transition prediction. Observation reconstruction operates by: 1)\ninferring the latent state st from the observation ot using the variational model, and 2) decoding\nst to \u02dcot (an approximation of ot). Transition predictions, on the other hand, operate by feeding\nthe previous latent state st1 and the previous action at1 to predict the next latent state \u02c6st via\nthe memory model. Both parts are optimized by maximizing the log-likelihood under a Gaussian\ndistribution with unit variance. This equates to minimizing the mean squared error between model\noutputs and actual variable value:\n\nLr(\u2713it , it ) =XN\nLp( , \u2713it , it ) =XN\n\nt |st)i =XN\nt=1 Eqithlog p\u2713it (oit\nt=1 Eqit [log p\u2713it (st|ht)] =XN\n\nit=1XT\nit=1XT\nwhere k\u00b7k2 is the Euclidean norm.\nOptimizing Eq. (4) also requires maximizing the entropy of the variational model H[qit (st|ht, ot)].\nIntuitively, the variational model aims to increase the element-wise similarity of the latent state s\namong corresponding observations [14]. Thus, we represent the entropy term as:\nt=1 k \u00af\u00b5it\n\nit=1XT\nit=1XT\n\nt=1 k \u02dcoit\nt=1 k\u02c6sit\n\nt oit\nt sit\n\nt k2,\nt k2,\n\ntk2,\n\nt \u00af\u00b51\n\nit=2XT\n\nLH(\u2713it , it ) =XN\n\nt=1 H\u21e5qit (st|ht, ot)\u21e4 \u2318XN\n\nit=1XT\n(5)\nwhere \u00af\u00b5it 2 RK is the average value of the mean of the diagonal Gaussian representing\nqit (st|ht, ot) for each training batch.\n3.2.3 Policy Transfer and Few-Shot Reinforcement Learning\nAs introduced in Section 2.2, trajectory densities in MVRL generalize to support multiple state views\nby requiring different observation models through time. Such a generalization enables us to achieve\ncross-view policy transfer and few-shot RL, where we only require very few data from a speci\ufb01c\nview to train the multi-view model. This can then be used for action selection by: 1) inferring the\ncorresponding latent state, and 2) feeding the latent state into the policy learned from another view\nwith greater accessibility. Details can be found in Appendix A.\nConcretely, our learned models \u2713it should be able to reconstruct corresponding observations for\nviews with shared underlying dynamics (latent state s). During model learning, we thus validate the\nvariational and observation model by: 1) inferring the latent state s from the \ufb01rst view\u2019s observation\no1, and 2) comparing the reconstructed corresponding observation from other views \u02dcoit with the\nit=2 k \u02dcoit oitk2.\nSimilarly, the memory model can be validated by: 1) reconstructing the predicted latent state \u02c6s1 of the\n\ufb01rst view using the observation model of other views to get \u02c6oit, and 2) comparing \u02c6oit with the actual\nit=2 k \u02c6oit oitk2.\n\nactual observation oit through calculating the transformation loss: Lt = PN\nobservation oit, through calculating prediction transformation losses: Lpt =PN\n\n4 Experiments\n\nWe evaluate our method on a variety of dynamical systems varying in dimensions of their state\nrepresentation. We consider both high and low dimensional problems to demonstrate the effectiveness\nof our model-free and model-based solutions. On the model-free side, we demonstrate performance\nagainst state-of-the-art methods, such as Proximal Policy Optimization (PPO) [29]. On the model-\nbased side, we are interested in knowing whether our model successfully learns shared dynamics\nacross varying views, and if these can then be utilized to enable ef\ufb01cient control.\nWe consider dynamical systems from the Atari suite, Roboschool [29], PyBullet [8], and the Highway\nenvironments [20]. We generate varying views either by transforming state representations, introduc-\ning noise, or by augmenting state variables with additional dummy components. When considering\nthe game of Pong, we allow for varying observations by introducing various transformations to the\nimage representing the state, e.g., rotation, \ufb02ipping, and horizontal swapping. We then compare our\nmulti-view model with a state-of-the-art modeling approach titled World Models [16]; see Section 4.1.\nGiven successful modeling results, we commence to demonstrate control in both mode-free and\nmodel-based scenarios in Section 4.2. Our results demonstrate that although multi-view model-free al-\ngorithms can present advantages when compared to standard RL, multi-view model-based techniques\nare highly more ef\ufb01cient in terms of sample complexities.\n\n6\n\n\f(a) o and t\n\n(b) o and h\n\n(c) o and c\n\n(d) o and m\n\nFigure 2: Training multi-view models on Atari Pong. Legend: p: prediction loss Lp; r: reconstruction\nloss Lr; t: transformation loss Lt; pt: predicted transformation loss Lpt. These results demonstrate\nthat our method correctly converges in terms of loss values.\n\n(a) log(2), multi-view\n\n(b) Distance of \u00b5o, \u00b5t\n\n(c) log(2), WM\n\nFigure 3: Difference between inferred latent states from o and t. Results demonstrating that our\nmethod is capable of learning key elements \u2013 a property essential for multi-view dynamics learning.\nThese results also demonstrate that extracting such key-elements is challenging for world-models.\n\n4.1 Modeling Results\n\nTo evaluate multi-view model learning we generated \ufb01ve views by varying state representations (i.e.,\nimages) in the Atari Pong environment. We kept dynamics unchanged and considered four sensory\ntransformations of the observation frame o. Namely, we considered varying views as: 1) transposed\nimages t, 2) horizontally-swapped images h, 3) inverse images c, and 4) mirror-symmetric\nimages m. Exact details on how these views have been generated can be found in Appendix C.1.1.\nFor simplicity, we built pair-wise multi-view models between o and one of the above \ufb01ve variants.\nFig. (2) illustrates convergence results of multi-view prediction and transformation for different views\nof Atari Pong. Fig (3) further investigates the learnt shared dynamics among different views (o and\nt). Fig. (3a) illustrates the converged log(2), and the log standard deviation of the latent state\nvariable, in o and t. Observe that a small group of elements (indexed as 14, 18, 21, 24 and 29)\nhave relatively low variance in both views, thus keeping stable values across different observations.\nWe consider these elements as the critical part in representing the shared dynamics and de\ufb01ne them\nas key elements. Clearly, learning a shared group of key elements across different views is the target\nin the multi-view model. Results in Fig. (3b), illustrate the distance between \u00b5o and \u00b5t for the\nmulti-view model demonstrating convergence. As the same group of elements are, in fact, close\nto key elements learnt by the multi-view model, we conclude that we can capture shared dynamics\nacross different views. Further analysis of these key elements is also presented in the appendix.\nIn Fig. (3c), we also report the converged value of log(2) of World Models (WM) [16] under the\nmulti-view setting. Although the world model can still infer the latent dynamics of both environments,\nthe large difference between learnt dynamics demonstrates that varying views resemble a challenge\nto world models \u2013 our algorithm, however, is capable of capturing such hidden shared dynamics.\n\n4.2 Policy Learning Results\n\nGiven successful modeling results, in this section we present results on controlling systems across\nmultiple views within a RL setting. Namely, we evaluate our multi-view RL approach on several\nhigh and low dimensional tasks. Our systems consisted of: 1) Cartpole (O\u2713 R4, a 2{ 0, 1}),\nwhere the goal is to balance a pole by applying left or right forces to a pivot, 2) hopper (O\u2713\nR15,A\u2713 R3), where the focus is on locomotion such that the dynamical system hops forward as\nfast as possible, 3) RACECAR (O\u2713 R2,A\u2713 R2), where the observation is the position (x,y)\nof a randomly placed ball in the camera frame and the reward is based on the distance to the ball,\n\n7\n\n\f(b) Hopper Policy Learning\n\n(c) RACECAR Policy Learning\n\n(d) Parking Policy Learning\n\n(a) Cartpole Policy Learning\n\nFigure 4: Policy learning results demonstrating that our method outper-\nforms others in terms of sample complexities.\n\nand 4) parking (O\u2713\nR6,A\u2713 R2), where an\nego-vehicle must park\nin a given space with\nan appropriate heading\n(a goal-conditioned con-\ntinuous control\ntask).\nThe evaluation metric\nwe used was de\ufb01ned\nas\nthe average test-\ning return) across all\nviews with respect to\nthe amount of\ntrain-\ning samples (number\nof interactions). We\nuse the same setting\nin all experiments to\ngenerate multiple views.\nNamely, the original en-\nvironment observation\nis used as the \ufb01rst view,\nand adding dummy di-\nmensions (two dims)\nand large-scale noises (0.1 after observation normalization) to the original observation generates the\nsecond view. Such a setting would allow us to understand if our model can learn shared dynamics\nwith mis-speci\ufb01ed state representations, and if such a model can then be used for control.\nFor all experiments, we trained the multi-view model with a few samples gathered from all views\nand used the resultant for policy transfer (MV-PT) between views during the test period. We chose\nstate-of-the-art PPO [29] \u2013 an algorithm based on the work in [27], as the baseline by training separate\nmodels on different views and aggregated results together. The multi-view model-free (MV-MF)\nmethod is trained by augmenting PPO with concatenated observations. Relevant parameter values\nand implementation details are listed in the Appendix C.2.\nFig. (4) shows the result of average testing return (the average testing successful rate for the parking\ntask) from all views. On the Cartpole and parking tasks, our MV-MF algorithms can present\nimprovements when compared to strong model-free baselines such as PPO, showing the advantage of\nleveraging information from multiple views than training independently within each view. On the\nother hand, multi-view model-based techniques give the best performance on all tasks and reduce\nnumber of samples needed by around two orders of magnitudes in most tasks. This proves that\nMV-PT greatly reduces the required amount of training samples to reach good performance.\nWe also conducted pure model-based RL ex-\nperiment and compared our multi-view dy-\nnamic model against 1) PILCO [9], which\ncan be regarded as state-of-the-art model-\nbased solution using a Gaussian Process\ndynamic model, 2) PILCO with augmented states, i.e., a single PILCO model to approximate the\ndata distribution from all views, and 3) a multilayer perceptron (MLP). We use the same planning\nalgorithm for all model-based methods, e.g., hill climbing or Model Predictive Control (MPC) [25],\ndepending on the task at hand. Table 1 shows the result on the Cartpole environment, where we\nevaluate all methods by the amount of interactions till success. Each training rollout has at most\n40 steps of interactions and we de\ufb01ne the success as reaching an average testing return of 195.0.\nAlthough multi-view model performs slightly worse than PILCO in the Cartpole task, we found\nout that model-based alone cannot perform well on tasks without suitable environment rewards.\nFor example, the reward function in hopper primarily encourages higher speed and lower energy\nconsumption. Such high-level reward functions make it hard for model-based methods to succeed;\ntherefore, the results of model-based algorithms on all tasks are lower than using other speci\ufb01cally\ndesigned reward functions. Tailoring reward functions, though interesting, is out of the scope of this\npaper. MV-PT, on the other hand, outperforms others signi\ufb01cantly, see Fig. (4).\n\nTable 1: Model-based RL result on the Cartpole.\nMV-MB\nPILCO PILCO-aug\n381 \u00b1 28\n\n240\n\n320\n\nMLP\n\n2334 \u00b1 358\n\n8\n\n\f5 Related Work\n\nOur work has extended model-based RL to the multi-view scenario. Model-based RL for POMDPs\nhave been shown to be more effective than model-free alternatives in certain tasks [37, 35, 21, 19].\nOne of the classical combination of model-based and model-free algorithms is Dyna-Q [31], which\nlearns the policy from both the model and the environment by supplementing real world on-policy\nexperiences with simulated trajectories. However, using trajectories from a non-optimal or biased\nmodel can lead to learning a poor policy [15]. To model the world environment of Atari Games,\nautoencoders have been used to predict the next observation and environment rewards [19]. Some\nprevious works [28, 16, 19] maintain a recurrent architecture to model the world using unsupervised\nlearning and proved its ef\ufb01ciency in helping RL agents in complex environments. Mb-Mf [25] is\na framework bridging the gap between model-free and model-based methods by employing MPC\nto pre-train a policy within the learned model before training it with standard model-free method.\nHowever, these models can only be applied to a single environment and need to be built from scratch\nfor new environments. Although using a similar recurrent architecture, our work differs from above\nworks by learning the shared dynamics over multiple views. Also, many of the above advancements\nare orthogonal to our proposed approach, which can de\ufb01nitely bene\ufb01t from model ensemble, e.g.,\npre-train the model-free policy within the multi-view model when reward models are accessible.\nAnother related research area is multi-task learning (or meta-learning). To achieve multi-task learning,\nrecurrent architectures [10, 36] have also been used to learn to reinforcement learn by adapting to\ndifferent MDPs automatically. These have been shown to be comparable to the UCB1 algorithm [4] on\nbandit problems. Meta-learning shared hierarchies (MLSH) [12] share sub-policies among different\ntasks to achieve the goal in the training process, where high hierarchy actions are obtained and reused\nin other tasks. Model-agnostic meta-learning algorithm (MAML) [11] minimizes the total error\nacross multiple tasks by locally conducting few-shot learning to \ufb01nd the optimal parameters for\nboth supervised learning and RL. Actor-mimic [26] distills multiple pre-trained DQNs on different\ntasks into one single network to accelerate the learning process by initializing the learning model\nwith learned parameters of the distilled network. To achieve promising results, these pre-trained\nDQNs have to be expert policies. Distral [32] learns multiple tasks jointly and trains a shared policy\nas the \"centroid\" by distillation. Concurrently with our work, ADRL [5] has extended model-free\nRL to multi-view environments and proposed an attention-based policy aggregation method based\non the Q-value of the actor-critic worker for each view. Most of above approaches consider the\nproblems within the model-free RL paradigm and focus on \ufb01nding the common structure in the policy\nspace. However, model-free approaches require large amounts of data to explore in high-dimensional\nenvironments. In contrast, we explicitly maintain a multi-view dynamic model to capture the latent\nstructures and dynamics of the environment, thus having more stable correlation signals.\nSome algorithms from meta-learning have been adapted to the model-based setting [1, 7]. These\nfocused on model adaptation when the model is incomplete, or the underlying MDPs are evolving.\nBy taking the unlearnt model as a new task and continuously learning new structures, the agent can\nkeep its model up to date. Different from these approaches, we focus on how to establish the common\ndynamics over compact representations of observations generated from different emission models.\n\n6 Conclusions\n\nIn this paper, we proposed multi-view reinforcement learning as a generalization of partially observ-\nable Markov decision processes that exhibit multiple observation densities. We derive model-free\nand model-based solutions to multi-view reinforcement learning, and demonstrate the effectiveness\nof our method on a variety of control benchmarks. Notably, we show that model-free multi-view\nmethods through observation augmentation signi\ufb01cantly reduce number of training samples when\ncompared to state-of-the-art reinforcement learning techniques, e.g., PPO, and demonstrate that\nmodel-based approaches through cross-view policy transfer allow for extremely ef\ufb01cient learners\nneeding signi\ufb01cantly fewer number of training samples.\nThere are multiple interesting avenues for future work. First, we would like to apply our technique\nto real-world robotic systems such as self-driving cars, and second, use our method for transferring\nbetween varying views across domains.\n\n9\n\n\fReferences\n[1] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel. Continuous\nadaptation via meta-learning in nonstationary and competitive environments. arXiv preprint\narXiv:1710.03641, 2017.\n\n[2] H. B. Ammar, E. Eaton, J. M. Luna, and P. Ruvolo. Autonomous cross-domain knowledge trans-\nfer in lifelong policy gradient reinforcement learning. In Proceedings of the 24th International\nConference on Arti\ufb01cial Intelligence, IJCAI\u201915, pages 3345\u20133351. AAAI Press, 2015.\n\n[3] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor. Online multi-task learning for policy\ngradient methods.\nIn Proceedings of the 31st International Conference on International\nConference on Machine Learning - Volume 32, ICML\u201914, pages II\u20131206\u2013II\u20131214. JMLR.org,\n2014.\n\n[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2):235\u2013256, May 2002.\n\n[5] E. Barati and X. Chen. An actor-critic-attention mechanism for deep reinforcement learning in\nmulti-view environments. In Proceedings of the Twenty-Eighth International Joint Conference\non Arti\ufb01cial Intelligence, IJCAI-19, pages 2002\u20132008. International Joint Conferences on\nArti\ufb01cial Intelligence Organization, 7 2019.\n\n[6] J. Chen, T. Bai, X. Huang, X. Guo, J. Yang, and Y. Yao. Double-task deep q-learning with\nmultiple views. In Proceedings of the IEEE International Conference on Computer Vision,\npages 1050\u20131058, 2017.\n\n[7] I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to adapt:\n\nMeta-learning for model-based control. arXiv preprint arXiv:1803.11347, 2018.\n\n[8] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics\n\nand machine learning. http://pybullet.org, 2016\u20132019.\n\n[9] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy\nsearch. In Proceedings of the 28th International Conference on machine learning (ICML-11),\npages 465\u2013472, 2011.\n\n[10] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl 2: Fast reinforce-\n\nment learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.\n\n[11] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. arXiv preprint arXiv:1703.03400, 2017.\n\n[12] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. arXiv\n\npreprint arXiv:1710.09767, 2017.\n\n[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[14] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel method for\n\nthe two-sample-problem. In Advances in NIPS, pages 513\u2013520, 2007.\n\n[15] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based\n\nacceleration. In ICML, pages 2829\u20132838, 2016.\n\n[16] D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in\n\nNeural Information Processing Systems, pages 2455\u20132467, 2018.\n\n[17] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable\n\nstochastic domains. Artif. Intell., 101(1-2):99\u2013134, May 1998.\n\n[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105,\n2012.\n\n10\n\n\f[19] F. Leibfried, N. Kushman, and K. Hofmann. A deep learning approach for joint video frame\n\nand reward prediction in atari games. arXiv preprint arXiv:1611.07078, 2016.\n\n[20] E. Leurent. An environment for autonomous driving decision-making. https://github.com/\n\neleurent/highway-env, 2018.\n\n[21] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under\nunknown dynamics. In Advances in Neural Information Processing Systems, pages 1071\u20131079,\n2014.\n\n[22] M. Li, Z. Qin, Y. Jiao, Y. Yang, J. Wang, C. Wang, G. Wu, and J. Ye. Ef\ufb01cient ridesharing\norder dispatching with mean \ufb01eld multi-agent reinforcement learning. In The World Wide Web\nConference, pages 983\u2013994. ACM, 2019.\n\n[23] Y. Li, M. Yang, and Z. M. Zhang. A survey of multi-view representation learning. IEEE\n\nTransactions on Knowledge and Data Engineering, 2018.\n\n[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529, 2015.\n\n[25] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-\nbased deep reinforcement learning with model-free \ufb01ne-tuning. In 2018 IEEE International\nConference on Robotics and Automation (ICRA), pages 7559\u20137566. IEEE, 2018.\n\n[26] E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer\n\nreinforcement learning. arXiv preprint arXiv:1511.06342, 2015.\n\n[27] J. Peters and S. Schaal. Natural actor-critic. Neurocomput., 71(7-9):1180\u20131190, Mar. 2008.\n[28] J. Schmidhuber. On learning to think: Algorithmic information theory for novel combina-\ntions of reinforcement learning controllers and recurrent neural world models. arXiv preprint\narXiv:1511.09249, 2015.\n\n[29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[30] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,\nM. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature,\n550(7676):354, 2017.\n\n[31] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating\ndynamic programming. In Machine Learning Proceedings 1990, pages 216\u2013224. Elsevier,\n1990.\n\n[32] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu.\nDistral: Robust multitask reinforcement learning. In Advances in NIPS, pages 4496\u20134506,\n2017.\n\n[33] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu.\nDistral: Robust multitask reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-\nlach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 30, pages 4496\u20134506. Curran Associates, Inc., 2017.\n\n[34] Z. Tian, Y. Wen, Z. Gong, F. Punakkath, S. Zou, and J. Wang. A regularized opponent model\nwith maximum entropy objective. In Proceedings of the Twenty-Eighth International Joint\nConference on Arti\ufb01cial Intelligence, IJCAI-19, pages 602\u2013608. International Joint Conferences\non Arti\ufb01cial Intelligence Organization, 7 2019.\n\n[35] N. Wahlstr\u00f6m, T. B. Sch\u00f6n, and M. P. Deisenroth. From pixels to torques: Policy learning with\n\ndeep dynamical models. arXiv preprint arXiv:1502.02251, 2015.\n\n[36] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Ku-\nmaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763,\n2016.\n\n11\n\n\f[37] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear\nlatent dynamics model for control from raw images. In Advances in NIPS, pages 2746\u20132754,\n2015.\n\n[38] Y. Wen, Y. Yang, R. Luo, J. Wang, and W. Pan. Probabilistic recursive reasoning for multi-agent\n\nreinforcement learning. In International Conference on Learning Representations, 2019.\n\n[39] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv preprint arXiv:1304.5634,\n\n2013.\n\n[40] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean \ufb01eld multi-agent reinforcement\nlearning. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5571\u20135580,\nStockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[41] J. Zhao, X. Xie, X. Xu, and S. Sun. Multi-view learning overview: Recent progress and new\n\nchallenges. Information Fusion, 38:43\u201354, 2017.\n\n[42] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\n12\n\n\f", "award": [], "sourceid": 824, "authors": [{"given_name": "Minne", "family_name": "Li", "institution": "University College London"}, {"given_name": "Lisheng", "family_name": "Wu", "institution": "UCL"}, {"given_name": "Jun", "family_name": "WANG", "institution": "UCL"}, {"given_name": "Haitham", "family_name": "Bou Ammar", "institution": "UCL"}]}