{"title": "Multiple Futures Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 15424, "page_last": 15434, "abstract": "Temporal prediction is critical for making intelligent and robust decisions in complex dynamic environments. Motion prediction needs to model the inherently uncertain future which often contains multiple potential outcomes, due to multi-agent interactions and the latent goals of others. Towards these goals, we introduce a probabilistic framework that efficiently learns latent variables to jointly model the multi-step future motions of agents in a scene. Our framework is data-driven and learns semantically meaningful latent variables to represent the multimodal future, without requiring explicit labels. Using a dynamic attention-based state encoder, we learn to encode the past as well as the future interactions among agents, efficiently scaling to any number of agents. Finally, our model can be used for planning via computing a conditional probability density over the trajectories of other agents given a hypothetical rollout of the ego agent. We demonstrate our algorithms by predicting vehicle trajectories of both simulated and real data, demonstrating the state-of-the-art results on several vehicle trajectory datasets.", "full_text": "Multiple Futures Prediction\n\nYichuan Charlie Tang\n\nApple Inc.\n\nyichuan_tang@apple.com\n\nRuslan Salakhutdinov\n\nApple Inc.\n\nrsalakhutdinov@apple.com\n\nAbstract\n\nTemporal prediction is critical for making intelligent and robust decisions in com-\nplex dynamic environments. Motion prediction needs to model the inherently\nuncertain future which often contains multiple potential outcomes, due to multi-\nagent interactions and the latent goals of others. Towards these goals, we introduce\na probabilistic framework that ef\ufb01ciently learns latent variables to jointly model\nthe multi-step future motions of agents in a scene. Our framework is data-driven\nand learns semantically meaningful latent variables to represent the multimodal\nfuture, without requiring explicit labels. Using a dynamic attention-based state\nencoder, we learn to encode the past as well as the future interactions among\nagents, ef\ufb01ciently scaling to any number of agents. Finally, our model can be used\nfor planning via computing a conditional probability density over the trajectories\nof other agents given a hypothetical rollout of the \u2018self\u2019 agent. We demonstrate\nour algorithms by predicting vehicle trajectories of both simulated and real data,\ndemonstrating the state-of-the-art results on several vehicle trajectory datasets.\n\n1\n\nIntroduction\n\nThe ability to make good predictions lies at the heart of robust and safe decision making. It is\nespecially critical to be able to predict the future motions of all relevant agents in complex and\ndynamic environments. For example, in the autonomous driving domain, motion prediction is central\nboth to the ability to make high level decisions, such as when to perform maneuvers, as well as to\nlow level path planning optimizations [34, 28].\nMotion prediction is a challenging problem due to the various needs of a good predictive model.\nThe varying objectives, goals, and behavioral characteristics of different agents can lead to multiple\npossible futures or modes. Agents\u2019 states do not evolve independently from one another, but rather\nthey interact with each other. As an illustration, we provide some examples in Fig. 1. In Fig. 1(a),\nthere are a few different possible futures for the blue vehicle approaching an intersection. It can\neither turn left, go straight, or turn right, forming different modes in trajectory space. In Fig. 1(b),\ninteractions between the two vehicles during a merge scenario show that their trajectories in\ufb02uence\neach other, depending on who yields to whom. Besides multimodal interactions, prediction needs\nto scale ef\ufb01ciently with an arbitrary number of agents in a scene and take into account auxiliary\nand contextual information, such as map and road information. Additionally, the ability to measure\nuncertainty by computing probability over likely future trajectories of all agents in closed-form (as\nopposed to Monte Carlo sampling) is of practical importance.\nDespite a large body of work in temporal motion predictions [24, 7, 13, 26, 16, 2, 30, 8, 39], existing\nstate-of-the-art methods often only capture a subset of the aforementioned features. For example,\nalgorithms are either deterministic, not multimodal, or do not fully capture both past and future\ninteractions. Multimodal techniques often require the explicit labeling of modes prior to training.\nModels which perform joint prediction often assume the number of agents present to be \ufb01xed [36, 31].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(b) Scenario A: green yields to blue.\n\n(a) Multiple possible future trajectories.\n\n(c) Scenario B: blue yields to green.\n\nFigure 1: Examples illustrating the need for mutimodal interactive predictions. (a): There are a few possible\nmodes for the blue vehicle. (b and c): Time-lapsed visualization of how interactions between agents in\ufb02uences\neach other\u2019s trajectories.\n\nWe tackle these challenges by proposing a unifying framework that captures all of the desirable\nfeatures mentioned earlier. Our framework, which we call Multiple Futures Predictor (MFP), is\na sequential probabilistic latent variable generative model that learns directly from multi-agent\ntrajectory data. Training maximizes a variational lower bound on the log-likelihood of the data. MFP\nlearns to model multimodal interactive futures jointly for all agents, while using a novel factorization\ntechnique to remain scalable to arbitrary number of agents. After training, MFP can compute both\n(un)conditional trajectory probabilities in closed form, not requiring any Monte Carlo sampling.\nMFP builds on the Seq2seq [32], encoder-decoder framework by introducing latent variables and\nusing a set of parallel RNNs (with shared weights) to represent the set of agents in a scene. Each\nRNN takes on the point-of-view of its agent and aggregates historical information for sequential\ntemporal prediction for that agent. Discrete latent variables, one per RNN, automatically learn\nsemantically meaningful modes to capture multimodality without explicit labeling. MFP can be\nfurther ef\ufb01ciently and jointly trained end-to-end for all agents in the scene. To summarize, we make\nthe following contributions: First, semantically meaningful latent variables are automatically learned\nfrom trajectory data without labels. This addresses the multimodality problem. Second, interactive\nand parallel step-wise rollouts are preformed for all agents in the scene. This addresses the modeling\nof interactions between actors during future prediction, see Sec. 3.1. We further propose a dynamic\nattentional encoding which captures both the relationships between agents and the scene context, see\nSec. 3.1. Finally, MFP is capable of performing hypothetical inference: evaluating the conditional\nprobability of agents\u2019 trajectories conditioning on \ufb01xing one or more agent\u2019s trajectory, see Sec. 3.2.\n\n2 Related Work\nThe problem of predicting future motion for dynamic agents has been well studied in the literature.\nThe bulk of classical methods focus on using physics based dynamic or kinematic models [38, 21, 25].\nThese approaches include Kalman \ufb01lters and maneuver based methods, which compute the future\nmotion of agents by propagating their current state forward in time. While these methods perform well\nfor short time horizons, longer horizons suffer due to the lack of interaction and context modeling.\nThe success of machine learning and deep learning ushered in a variety of data-driven recurrent\nneural network (RNN) based methods [24, 7, 13, 26, 16, 2]. These models often combine RNN\nvariants, such as LSTMs or GRUs, with encoder-decoder architectures such as conditional variational\nautoencoders (CVAEs). These methods eschew physic based dynamic models in favor of learning\ngeneric sequential predictors (e.g. RNNs) directly from data. Converting raw input data to input\nfeatures can also be learned, often by encoding rasterized inputs using CNNs [7, 13].\nMethods that can learn multiple future modes have been proposed in [16, 24, 13]. However, [16]\nexplicitly labels six maneuvers/modes and learn to separately classify these modes. [24, 13] do not\nrequire mode labeling but they also do not train in an end-to-end fashion by maximizing the data\nlog-likelihood of the model. Most of the methods in literature encode the past interactions of agents\nin a scene, however prediction is often an independent rollout of a decoder RNN, independent of\nother future predicted trajectories [16, 29]. Encoding of spatial relationships is often done by placing\nother agents in a \ufb01xed and spatially discretized grid [16, 24].\n\n2\n\n\f(a) Graphical model of the MFP. Solid\nnodes denote observed. Cross agent\ninteraction edges are shaded for clarity.\nxt denotes both the state and contextual\ninformation from timesteps 1 to t.\n\n(b) Architecture of the proposed MFP. Circular \u2019world\u2019 contains the\nworld state and positions of all agents. Diamond nodes are determin-\nistic while the circular zn are discrete latent random variables.\n\nFigure 2: Graphical model and computation graph of the MFP. See text for details. Best viewed in color.\n\nIn contrast, MFP proposes a unifying framework which exhibits the aforementioned features. To\nsummarize, we present a feature comparison of MFP with some of the recent methods in the\nsupplementary materials.\n\n= {x1\n\nt , . . . , xN\n\n.\n= X1:N\n\n.\n= Y1:N\n\n= {xn\n\nt\u2212\u03c4 , xn\n\n= {y1\n.\n= {yn\n\n\u03b4 , y2\nt , yn\n\nt\u2212\u03c4 +1, . . . , xn\n\n\u03b4 , . . . , yN\nt+1, . . . , yn\n\n3 Multiple Futures Prediction\nWe tackle motion prediction by formulating a probabilistic framework of continuous space but discrete\ntime system with a \ufb01nite (but variable) number of N interacting agents. We represent the joint state\nt }, where d is the dimensionality of\nof all N agents at time t as Xt \u2208 RN\u00d7d .\nt , x2\nt \u2208 Rd is the state n-th agent at time t. With a slight abuse of notation, we\neach state1, and xn\nt } to denote the past states of the n-th agent and\nuse superscripted Xn .\nt\u2212\u03c4 :t to denote the joint agent states from time t \u2212 \u03c4 to t, where \u03c4 is the past history steps.\nX\n\u03b4 } and the future trajectory\nThe future state at time \u03b4 of all agents is denoted by Y\u03b4\nT}. Y\nof agent n, from time t to time T , is denoted by Yn .\nt:t+T denotes\nthe joint state of all agents for the future timesteps. Contextual scene information, e.g. a rasterized\nimage Rh\u00d7w\u00d73 of the map, could be useful by providing important cues. We use It to represent any\ncontextual information at time t.\nThe goal of motion prediction is then to accurately model p(Y|X,It). As in most sequential\nmodelling tasks, it is both inef\ufb01cient and intractable to model p(Y|X,It) jointly. RNNs are typically\nemployed to sequentially model the distribution in a cascade form. However, there are two major\nchallenges speci\ufb01c to our multi-agent prediction framework: (1) Multimodality: optimizing vanilla\nRNNs via backpropagation through time will lead to mode-averaging since the mapping from X\nto Y is not a function, but rather a one-to-many mapping. In other words, multimodality means\nthat for a given X, there could be multiple distinctive modes that results in signi\ufb01cant probability\ndistribution over different sequences of Y. (2) Variable-Agents: the number of agents N is variable\nand unknown, and therefore we can not simply vectorize Xt as the input to a standard RNN at time t.\nFor multimodality, we introduce a set of stochastic latent variables zn \u223c M ultinoulli(K), one\nper agent, where zn can take on K discrete values. The intuition here is that zn would learn to\nrepresent intentions (left/right/straight) and/or behavior modes (aggressive/conservative). Learning\nmaximizes the marginalized distribution, where z is free to learn any latent behavior so long as it\nhelps to improve the data log-likelihood. Each z is conditioned on X at the current time (before\nfuture prediction) and will in\ufb02uence the distribution over future states Y. A key feature of the MFP is\nthat zn is only sampled once at time t, and must be consistent for the next T time steps. Compared to\nsampling zn at every timestep, this leads to a tractability and more realistic intention/goal modeling,\n\n1We assume states are fully observable and are agents\u2019 (x, y) coordinates on the ground plane (d=2).\n\n3\n\n\fas we will discuss in more detail later. We now arrive at the following distribution:\n\nlog p(Y|X,I) = log(\n\np(Y, Z|X,I)) = log(\n\np(Y|Z, X,I)p(Z|X,I)),\n\nwhere Z denotes the joint latent variables of all agents. Na\u00efvely optimizing for Eq. 1 is prohibitively\nexpensive and not scalable as the number of agents and timesteps may become large. In addition,\nthe max number of possible modes is exponential: O(K N ). We \ufb01rst make the model more tractable\nby factorizing across time, followed by factorization across agents. The joint future distribution Y\nassumes the form of product of conditional distributions:\n\n(cid:88)\n\nZ\n\n(cid:88)\n\nZ\n\nT(cid:89)\n\n(1)\n\n(2)\n\n(3)\n\np(Y|Z, X,I) =\n\np(Y\u03b4|Yt:\u03b4\u22121, Z, X,I),\n\n\u03b4=t+1\n\np(Y\u03b4|Yt:\u03b4\u22121, Z, X,I) =\n\np(yn\n\n\u03b4 |Yt:\u03b4\u22121, zn, X,I).\n\nN(cid:89)\n\nn=1\n\nThe second factorization is sensible as the factorial component is conditioning on the joint states of\nall agents in the immediate previous timestep, where the typical temporal delta is very short (e.g.\n100ms). Also note that the future distribution of the n-th agent is explicitly dependent on its own\nmode zn but implicitly dependent on the latent modes of other agents by re-encoding the other agents\npredicted states ym\n\u03b4 (please see discussion later and also Sec. 3.1). Explicitly conditioning an agent\u2019s\nown latent modes is both more scalable computationally as well as more realistic: agents in the\nreal-world can only infer other agent\u2019s latent goals/intentions via observing their states. Finally our\noverall objective from Eq. 1 can be written as:\n\nlog(cid:0)(cid:88)\n\np(Y|Z, X,I)p(Z|X,I)(cid:1) = log\n\nZ\n\n= log\n\nN(cid:89)\n\n(cid:18)(cid:88)\n(cid:18)(cid:88)\n\nZ\n\nT(cid:89)\nN(cid:89)\n\n\u03b4=t+1\n\nn=1\n\nT(cid:89)\n\np(yn\n\n\u03b4 |Yt:\u03b4\u22121, zn, X,I)p(zn|X,I)\n\np(zn|X,I)\n\np(yn\n\n\u03b4 |Yt:\u03b4\u22121, zn, X,I)\n\n(cid:19)\n(cid:19)\n\n(4)\n\n(5)\n\nZ\n\nn=1\n\n\u03b4=t+1\n\nThe graphical model of the MFP is illustrated in Fig. 2a. While we show only three agents for\nsimplicity, MFP can easily scale to any number of agents. Nonlinear interactions among agents makes\n\u03b4 |Yt:\u03b4\u22121, X,I) complicated to model. The class of recurrent neural networks are powerful and\np(yn\n\ufb02exible models that can ef\ufb01ciently capture and represent long-term dependences in sequential data.\nAt a high level, RNNs introduce deterministic hidden units ht at every timestep t, which act as\nfeatures or embeddings that summarize all of the observations up until time t. At time step t, a RNN\ntakes as its input the observation, xt, and the previous hidden representation, ht\u22121, and computes the\nupdate: ht = frnn(xt, ht\u22121). The prediction yt is computed from the decoding layer of the RNN\nyt = fdec(ht). frnn and fdec are recursively applied at every timestep of the sequence.\nFig. 2b shows the computation graph of the MFP. A point-of-view (PoV) transformation \u03d5n(Xt) is\n\ufb01rst used to transform the past states to each agent\u2019s own reference frame by translation and rotation\nsuch that +x-axis aligns with agent\u2019s heading. We then instantiate an encoding and a decoding RNN2\nper agent. Each encoding RNN is responsible for encoding the past observations xt\u2212\u03c4 :t into a feature\nvector. Scene context is transformed via a convolutional neural network into its own feature. The\nfeatures are combined via a dynamic attention encoder, detailed in Sec. 3.1, to provide inputs both\nto the latent variables as well as to the ensuing decoding RNNs. During predictive rollouts, the\ndecoding RNN will predict its own agent\u2019s state at every timestep. The predictions will be aggregated\nand subsequently transformed via \u03d5n(\u00b7), providing inputs to every agent/RNN for the next timestep.\nLatent variables Z provide extra inputs to the decoding RNNs to enable multimodality. Finally, the\noutput yn\nt consists of a 5 dim vector governing a Bivariate Normal distribution: \u00b5x, \u00b5y, \u03c3x, \u03c3y, and\ncorrelation coef\ufb01cient \u03c1.\nWhile we instantiate two RNNs per agent, these RNNs share the same parameters across agents, which\nmeans we can ef\ufb01ciently perform joint predictions by combining inputs in a minibatch, allowing us\nto scale to arbitrary number of agents. Making Z discrete and having only one set of latent variables\nin\ufb02uencing subsequent predictions is also a deliberate choice. We would like Z to model modes\ngenerated due to high level intentions such as left/right lane changes or conservative/aggressive modes\nof agent behavior. These latent behavior modes also tend to stay consistent over the time horizon\nwhich is typical of motion prediction (e.g. 5 seconds).\n\n2We use GRUs [10]. LSTMs and GRUs perform similarly, but GRUs were slightly faster computationally.\n\n4\n\n\fLearning\nGiven a set of training trajectory data D = {(X(i), Y(i), ) . . .}i=1,2,...,|D|, we optimize using the\nmaximum likelihood estimation (MLE) to estimate the parameters \u03b8\u2217 = argmax\u03b8 L(\u03b8,D) that\nachieves the maximum marginal data log-likelihood:3\n\nL(\u03b8,D) = log p(Y|X; \u03b8) = log(cid:0)(cid:88)\n\np(Y, Z|X; \u03b8)(cid:1) =\n\np(Z|Y, X; \u03b8) log\n\n(6)\n\np(Y, Z|X; \u03b8)\np(Z|Y, X; \u03b8)\n\nOptimizing for Eq. 6 directly is non-trivial as the posterior distribution is not only hard to compute, but\nalso varies with \u03b8. We can however decompose the log-likelihood into the sum of the evidence lower\nbound (ELBO) and the KL-divergence between the true posterior and an approximating posterior\nq(Z) [27]:\n\n(cid:88)\n\nZ\n\nlog p(Y|X; \u03b8) =\n\nq(Z|Y, X) log\n\np(Y, Z|X; \u03b8)\nq(Z|Y, X)\n\n+ DKL(q||p)\n\nq(Z|Y, X) log p(Y, Z|X; \u03b8) + H(q),\n\n(7)\n\nZ\n\n(cid:88)\n\u2265(cid:88)\n\nZ\n\nZ\n\n(cid:88)\n(cid:88)\n\nZ\n\nwhere Jensen\u2019s inequality is used to arrive at the lower bound, H is the entropy function and\nDKL(q||p) is the KL-divergence between the true and approximating posterior. We learn by max-\nimizing the variational lower bound on the data log-likelihood by \ufb01rst using the true posterior4 at\nthe current \u03b8(cid:48) as the approximating posterior: q(Z|Y, X)\n= p(Z|Y, X; \u03b8(cid:48)). We can then \ufb01x the\n.\napproximate posterior and optimize the model parameters for the following function:\n) log p(Y, Z|X; \u03b8)\n\np(Z|Y, X; \u03b8\n\nQ(\u03b8, \u03b8\n\n) =\n\n(cid:48)\n\n(cid:48)\n\n)(cid:8) log p(Y|Z, X; \u03b8rnn) + log p(Z|X; \u03b8Z)(cid:9).\n\np(Z|Y, X; \u03b8\n\n(cid:48)\n\nZ\n\n=\n\n(8)\nwhere \u03b8 = {\u03b8rnn, \u03b8Z} denote the parameters of the RNNs and the parameters of the network layers\nfor predicting Z. As our latent variables Z are discrete and have small cardinality (e.g. < 10), we\ncan compute the posterior exactly for a given \u03b8(cid:48). The RNN parameter gradients are computed from\n\u2202Q(\u03b8, \u03b8(cid:48))/\u2202\u03b8rnn and the gradient for \u03b8Z is \u2202KL(p(Z|Y, X; \u03b8(cid:48))||p(Z|X; \u03b8Z))/\u2202\u03b8Z.\nOur learning algorithm is a form of the EM algorithm [14], where for the M-step we optimize RNN\nparameters using stochastic gradient descent. By integrating out the latent variable Z, MFP learns\ndirectly from trajectory data, without requiring any annotations or weak supervision for latent modes.\nWe provide a detailed training algorithm pseudocode in the supplementary materials.\nClassmates-forcing\nTeacher forcing is a standard technique (albeit biased) to accelerate RNN and sequence-to-sequence\ntraining by using ground truth values yt as the input to step t + 1. Even with scheduled sampling [4],\nwe found that over-\ufb01tting due to exposure bias could be an issue. Interestingly, an alternative is\npossible in the MFP: at time t, for agent n, the ground truth observations are used as inputs for all\n: m (cid:54)= n. However, for agent n itself, we still use its previous predicted state instead\nother agents ym\nt\nof the true observations xn\nConnections to other Stochastic RNNs\nVarious stochastic recurrent models in existing literature have been proposed: DRAW [20],\nSTORN [3], VRNN [11], SRNN [18], Z-forcing [19], Graph-VRNN [31]. Beside the multi-agent\nmodeling capability of the MFP, the key difference between these methods and MFP is that the other\nmethods use continuous stochastic latent variables zt at every timestep, sampled from a standard\nNormal prior. The training is performed via the pathwise derivatives, or the reparameterization trick.\nHaving multiple continuous stochastic variables means that the posterior can not be computed in\nclosed form and Monte Carlo (or lower-variance MCMC estimators5) must be used to estimate the\nELBO. This makes it hard to ef\ufb01ciently compute the log-probability of an arbitrary imagined or\nhypothetical trajectory, which might be useful for planning and decision-making (See Sec. 3.2). In\ncontrast, latent variables in MFP is discrete and can learn semantically meaningful modes (Sec. 4.1).\n3We have omitted the dependence on context I for clarity. The R.H.S. is derived from the common\n\nt as its input. We provide empirical comparisons in Table 2.\n\nlog-derivative trick.\n\n4The ELBO is the tightest when the KL-divergence is zero and the q is the true posterior.\n5Even with IWAE [6], 50 samples are needed to obtain a somewhat tight lower-bound, making it prohibitively\n\nexpensive to compute good log-densities for these stochastic RNNs for online applications.\n\n5\n\n\fWith K modes, it is possible to evaluate the exact log-likelihoods of trajectories in O(K), without\nresorting to sampling.\n\n3.1 State Encodings\nAs shown in Fig. 2b, the input to the RNNs at step t is \ufb01rst transformed via the point-of-view\n\u03d5(Yt) transformation, followed by state encoding, which aggregates the relative positions of other\nagents with respect to the n-th agent (ego agent, or the agent for which the RNN is predicting) and\nencodes the information into a feature vector. We denote the encoded feature st \u2190 \u03c6n\nenc(\u03d5(Yt)).\nHere, we propose a dynamic attention-like mechanism where radial basis functions are used for\nmatching and routing relevant agents from the input to the feature encoder, shown in Fig. 3.\n\nEach agent uses a neural network to transform\nits state (positions, velocity, acceleration, and\nheading) into a key or descriptor, which is then\nmatched via a radial basis function to a \ufb01xed\nnumber of \u201cslots\" with learned keys in the en-\ncoder network. The ego6 agent has a separate\nslot to send its own state. Slots are aggregated\nand further transformed by a two layer encoder\nnetwork, encoding a state st (e.g. 128 dim vec-\ntor). The entire dynamic encoder can be learned\nin an end-to-end fashion. The key-matching is\nsimilar to dot-product attention [35], however,\nthe use of radial basis functions allows us to\nlearn spatially sensitive and meaningful keys to\nextract relevant agents. In addition, Softmax\nnormalization in dot-product attention lacks the\nability to differentiate between a single close-by\nagent vs. a far-away agent.\n\nFigure 3: Diagram for dynamic attentional state en-\ncoding. MFP uses state encoding at every timestep to\nconvert the state of surrounding agents into a feature\nvector for next-step prediction, see text for more details.\n\n3.2 Hypothetical Rollouts\nPlanning and decision-making must rely on prediction for what-ifs [22]. It is important to predict how\nothers might behave to different hypothetical ego actions (e.g. what if ego were to perform a more an\naggressive lane change?). Speci\ufb01cally, we are interested in the distribution when conditioning on any\nhypothetical future trajectory Yn of one (or more) agents:\n\np(Ym:m(cid:54)=n|Yn, X) =\n\np(ym\n\n\u03b4 |Yt:\u03b4\u22121, zm, X)p(zm|X),\n\n(9)\n\n(cid:88)\n\nT(cid:89)\n\nN(cid:89)\n\nZm:m(cid:54)=n\n\n\u03b4=t+1\n\nm:m(cid:54)=n\n\nt:T of the conditioning agent on\nThis can be easily computed within MFP by \ufb01xing future states yn\nthe R.H.S. of Eq. 9 while the states of other agents ym(cid:54)=n\nare not changed. This is due to the fact\nthat MFP performs interactive future rollouts in a synchronized manner for all agents, as the joint\npredicted states at t of all agents are used as inputs for predicting the states at t + 1. As a comparison,\nmost of the other prediction algorithms perform independent rollouts, which makes it impossible to\nperform hypothetical rollouts as there is a lack of interactions during the future timesteps.\n\nt:T\n\n4 Experimental Results\nWe demonstrate the effectiveness of MFP in learning interactive multimodal predictions for the\ndriving domain, where each agent is a vehicle. As a proof-of-concept, we \ufb01rst generate simulated\ntrajectory data from the CARLA simulator [17], where we can specify the number of modes and\nscript 2nd-order interactions. We demonstrate MFP can learn semantically meaningful latent modes\nto capture all of the modes of the data, all without using labeling of the latent modes. We then\nexperiment on a widely known standard dataset of real vehicle trajectories, the NGSIM [12] dataset.\nWe show that MFP achieves state-of-the-art results on modeling held-out test trajectories. In addition,\nwe also benchmark MFP with previously published results on the more recent large scale Argoverse\nmotion forecasting dataset [9]. We provide MFP architecture and learning details in the supplementary\nmaterials.\n\n6We will use ego to refer to the main or \u2019self\u2019 agent for whom we are predicting.\n\n6\n\nRadial Basis Function MatchingKey 1EncoderKey 2Key NKey\u2026Not InterestingKeyEgo\u2026Key(x,y,v,\u2713)AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=agent state(x,y,v,\u2713)AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=(x,y,v,\u2713)AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=(x,y,v,\u2713)AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=(x,y,v,\u2713)AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ=AAAB9XicdVDJSgNBEO1xjXGLevTSGIQIIcwYMZNb0IvHCGaBZAw9nZ6kSc9Cd010GPIfXjwo4tV/8ebf2FkEFX1Q8Hiviqp6biS4AtP8MJaWV1bX1jMb2c2t7Z3d3N5+U4WxpKxBQxHKtksUEzxgDeAgWDuSjPiuYC13dDn1W2MmFQ+DG0gi5vhkEHCPUwJaui3cF5PiuNiFIQNy0svlzVK1bNvnFjZL5gyaWBW7bFWxtVDyaIF6L/fe7Yc09lkAVBClOpYZgZMSCZwKNsl2Y8UiQkdkwDqaBsRnyklnV0/wsVb62AulrgDwTP0+kRJfqcR3dadPYKh+e1PxL68Tg2c7KQ+iGFhA54u8WGAI8TQC3OeSURCJJoRKrm/FdEgkoaCDyuoQvj7F/5PmacnSyVyf5WsXizgy6BAdoQKyUAXV0BWqowaiSKIH9ISejTvj0XgxXuetS8Zi5gD9gPH2CaLfkfQ= representation\f(a) CARLA simulation [17].\n\n(b) MFP sample rollouts after training. Multiple\ntrials from same initial locations are overlaid.\nFigure 4: (a) CARLA data. (b) Sample rollouts overlayed, showing learned multimodality. (c) MFP learned\nsemantically meaningful latent modes automatically: triangle: right turn, square: straight ahead, circle: stop.\n\n(c) Learned latent modes. Same marker shape denotes the\nsame mode across agents. Time is the z-axis.\n\n4.1 CARLA\nCARLA is a realistic, open-source, high \ufb01delity driving simulator based on the Unreal Engine [17]. It\ncurrently contains six different towns and dozens of different vehicle assets. The simulation includes\nboth highways and urban settings with traf\ufb01c light intersections and four-way stops. Simple traf\ufb01c\nlaw abiding \"auto-pilot\" CPU agents are also available.\nWe create a scenario at an intersection where one vehicle is approaching the intersection and two other\nvehicles are moving across horizontally (Fig. 4(a)). The \ufb01rst vehicle (red) has 3 different possibilities\nwhich are randomly chosen during data generation. The \ufb01rst mode aggressively speeds up and makes\nthe right turn, cutting in front of the green vehicle. The second mode will still make the right turn,\nhowever it will slow down and yield to the green vehicle. For the third mode, the \ufb01rst vehicle will\nslow to a stop, yielding to both of the other vehicles. The far left vehicle also chooses randomly\nbetween going straight or turning right. We report the performance of MFP as a function of # of\nmodes in Table 1.\n\nMetric C.V.\n(nats)\nNLL\n\nRNN\nbasic\n\nMFP\n1 mode\n\n5 modes\n11.46 5.64\u00b10.02 5.23\u00b10.01 3.37\u00b10.81 1.72\u00b10.19 1.39\u00b10.01 1.39\u00b10.01\n\n4 modes\n\n2 modes\n\n3 modes\n\nMFP\n\nMFP\n\nMFP\n\nMFP\n\nTable 1: Test performance (minMSD with K = 12) comparisons.\n\nFixed-Encoding\nNLL 1.878\u00b10.163\n\nDynEnc\n\n1.694\u00b10.175\n\nTeacher-forcing Classmates-forcing\n\nNLL 4.344\u00b10.059\n\n4.196\u00b10.075\n\nMetric\n\nVehicle 1\n\nVehicle 2\n\nStandard\n\nK=12\nminADE 1.509 \u00b1 0.37\n0.800 \u00b1 0.064 0.709 \u00b1 0.060\nminFDE 2.530 \u00b1 0.635 2.305 \u00b1 0.570 3.171 \u00b1 0.462 2.729 \u00b1 0.415\n\n1.402 \u00b1 0.34\n\nStandard\n\nHypo\n\nHypo\n\nTable 2: Additional comparisons.\n\nTable 3: Hypothetical Rollouts.\n\nThe modes learned here are somewhat semantically meaningful. In Fig. 4(c), we can see that even for\ndifferent vehicles, the same latent variable z learned to be interpretable. Mode 0 (squares) learned\nto go straight, mode 1 (circles) learned to break/stop, and mode 2 (triangles) represents right turns.\nFinally, in Table 2, we can see the performance between using teacher-forcing vs. the proposed\nclassmates-forcing. In addition, we compare different types of encodings. DynEnc is the encoding\nproposed in Sec. 3.1. Fixed-encoding uses a \ufb01xed ordering which is not ideal when there are N\narbitrary number of agents. We can also look at how well we can perform hypothetical rollouts by\nconditioning our predictions of other agents on ego\u2019s future trajectories. We report these results in\nTable 3.\n\n7\n\n\fDESIRE SocialGAN R2P2-MA ESP[30]\nno LIDAR\n\n[21]\n\n[30]\n\nESP\n\nESP MultiPath MFP-1 MFP-2 MFP-3 MFP-4 MFP-5\nFlex\n\n[8]\n\n1.141\n\nTown01 2.422\n\n0.447\ntest \u00b10.017 \u00b10.015 \u00b10.008 \u00b10.011 \u00b10.007 \u00b10.009\n0.435\ntest \u00b10.017 \u00b10.015 \u00b10.011 \u00b10.013 \u00b10.009 \u00b10.011\n\nTown02 1.697\n\n0.979\n\n0.565\n\n0.675\n\n0.770\n\n1.102\n\n0.632\n\n0.784\n\n0.68\n\n0.69\n\n0.291\n\n0.284 0.279\n\n0.448\n0.374\n\u00b10.007 \u00b10.005 \u00b10.005 \u00b10.005 \u00b10.006\n0.457\n0.389\n\u00b10.004 \u00b10.003 \u00b10.003 \u00b10.003 \u00b10.004\n\n0.290\n\n0.295\n\n0.311\n\nTable 4: Test performance (minMSD with K = 12) comparisons, in meters squared.\n\nMetric\n\ntime Cons vel. CVGMM[15]\n\n[23] MATF[39] LSTM S-LSTM[1] CS-LSTM(M) MFP-1\n\nMFP-2\n\nMFP-3\n\nMFP-4\n\nMFP-5\n\n) 1 sec.\n2 sec.\n3 sec.\n4 sec.\n5 sec.\n\ns\nt\na\nn\n(\nL\nL\nN\n\n3.72\n5.37\n6.40\n7.16\n7.76\n\n2.02\n3.63\n4.62\n5.35\n5.93\n\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n\n1.17\n2.85\n3.80\n4.48\n4.99\n\n1.01\n2.49\n3.36\n4.01\n4.54\n\n0.89 (0.58)\n2.43 (2.14)\n3.30 (3.03)\n3.97 (3.68)\n4.51 (4.22)\n\n0.73\u00b10.01 -0.32\u00b10.01 -0.58\u00b10.01 -0.65\u00b10.01 -0.45\u00b10.01\n2.33\u00b10.01 1.43\u00b10.01 1.26\u00b10.01 1.19\u00b10.01 1.36\u00b10.01\n3.17\u00b10.01 2.45\u00b10.01 2.32\u00b10.01 2.28\u00b10.01 2.42\u00b10.01\n3.77\u00b10.01 3.21\u00b10.00 3.07\u00b10.00 3.06\u00b10.00 3.17\u00b10.00\n4.26\u00b10.00 3.81\u00b10.00 3.69\u00b10.00 3.69\u00b10.00 3.76\u00b10.00\n\nMetric\n\ntime Cons vel.\n\nCVGMM\n\nMATF\n\nLSTM S-LSTM CS-LSTM[16] MFP-1\n\nMFP-2\n\nMFP-3\n\nMFP-4\n\nMFP-5\n\n0.66\n1.56\n2.75\n4.24\n5.99\n\n0.69\n1.51\n2.55\n3.65\n4.71\n\n0.66\n1.34\n2.08\n2.97\n4.13\n\n0.73\n1.78\n3.13\n4.78\n6.68\n\n0.68\n1.65\n2.91\n4.46\n6.27\n\n0.61\n1.27\n2.09\n3.10\n4.37\n\n0.65\n1.31\n2.16\n3.25\n4.55\n\n) 1 sec.\n2 sec.\nm\n(\n3 sec.\nE\nS\n4 sec.\nM\nR\n5 sec.\n\n0.54\u00b10.00 0.55\u00b10.00 0.54\u00b10.00 0.54\u00b10.00 0.55\u00b10.00\n1.16\u00b10.00 1.18\u00b10.00 1.17\u00b10.00 1.16\u00b10.00 1.18\u00b10.00\n1.90\u00b10.00 1.92\u00b10.00 1.91\u00b10.00 1.89\u00b10.00 1.92\u00b10.00\n2.78\u00b10.00 2.80\u00b10.00 2.78\u00b10.00 2.75\u00b10.00 2.78\u00b10.00\n3.83\u00b10.01 3.85\u00b10.01 3.83\u00b10.01 3.78\u00b10.01 3.80\u00b10.01\nTable 5: NGSIM prediction results. Hightlighted columns are our results (lower is better). MFP-K: K is\nthe number of latent modes. The standard error of the mean is over 5 trials. For multimodal MFPs, we report\nminRMSE over 5 samples. NLL can be negative as we are modeling a continuous density function.\nCARLA PRECOG\nWe next compared MFP to a much larger CARLA dataset with published benchmark results. This\ndataset consists of over 60K training sequences collected from two different towns in CARLA [30].\nWe trained MFP (with 1 to 5 modes) on the Town01 training set for 200K updates, with minibatch\nsize 8. We report the minMSD metric (in meters squared) at \u02c6mK=12 for all 5 agents jointly. We\ncompare with state-of-the-art methods in Table 4. Non-MFP results are reported from [30] (v3)\nand [8]. MFP signi\ufb01cantly outperforms various other methods on this dataset. We include qualitative\nvisualizations of test set predictions in the supplementary materials.\n\n4.2 NGSIM\nNext Generation Simulation [12](NGSIM) is a collection of video-transcribed datasets of vehicle\ntrajectories on US-101, Lankershim Blvd. in Los Angeles, I-80 in Emeryville, CA, and Peachtree St.\nin Atlanta, Georgia. In total, it contains approximately 45 minutes of vehicle trajectory data at 10 Hz\nand consisting of diverse interactions among cars, trucks, buses, and motorcycles in congested \ufb02ow.\nWe experiment with the US-101 and I-80 datasets, and follow the experimental protocol of [16],\nwhere the datasets are split into 70% training, 10% validation, and 20% testing. We extract 8 seconds\ntrajectories, using the \ufb01rst 3 seconds as history to predict 5 seconds into the future.\nIn Table 5, we report both neg. log-likelihood and RMSE errors on the test set. RMSE and other\nmeasures such as average/\ufb01nal displacement errors (ADE/FDE) are not good metrics for multimodal\ndistributions and are only reported for MFP-1. For multimodal MFPs, we report minRMSE over 5\nsamples, which uses the ground truth select the best trajectory and therefore could be overly optimistic.\nNote that this applies equally to other popular metrics such as minADE, minFDE, and minMSD.\n\n(a) Merge-off scenario.\n\n(b) Lane change left scenario.\n\nFigure 5: Qualitative MFP-3 results after training on NGSIM data. Three modes: red, purple, and green are\nshown as density contour plots for the blue vehicle. Grey vehicles are other agents. Blue path is past trajectory,\norange path is actual future ground truth. Grey pixels form a heatmap of frequently visited paths. Additional\nvisualizations provided in the supplementary materials.\n\n8\n\n050100(feet)15002550(feet)75\f2.28\n\nminADE C.V. NN+map LSTM+ED LSTM MFP3 MFP3\nED+map (ver. 1.0) (ver. 1.1)\n\nis a large scale trajectory prediction dataset with\nEach sequence is 5 seconds long in to-\nthe next 3 seconds after observing 2 seconds of history.\n\nThe current state-of-the-art, multimodal CS-LSTM [16], requires a separate prediction of 6 \ufb01xed\nmaneuver modes. As a comparison, MFP achieves signi\ufb01cant improvements with less number of\nmodes. Detailed evaluation protocols are provided in the supplementary materials. We also provide\nqualitative results on the different modes learned by MFP in Fig. 5. In the right panel, we can interpret\nthe green mode is fairly aggressive lane change while the purple and red mode is more \u201ccautious\u201d.\nAblative studies showing the contributions of both interactive rollouts and dynamic attention encoding\nare also provided in the supplementary materials. We obtain best performance with the combination\nof both interactive rollouts and dynamic attention encoding.\n4.3 Argoverse Motion Forecasting\nArgoverse motion forecasting dataset\nmore than 300, 000 curated scenarios [9].\ntal and the task is to predict\nWe performed preliminary experi-\nments by training a MFP with 3 modes\nK=6\nfor 20K updates and compared to\nthe existing of\ufb01cial baselines in Ta-\nmeters 3.55\n1.399\nble 6. MFP hyperparmeters were\nTable 6: Argoverse Motion Forecasting. Performance on the\nnot selected for this dataset so we do\nvalidation set. CV: constant velocity. Baseline results are from [9].\nexpect to see improved MFP perfor-\nmances with additional tuning. We report validation set performance on both version 1.0 and version\n1.1 of the dataset.\n4.4 Planning and Decision Making\nThe original intuitive motivation for learning a good predictor is to enable robust decision making.\nWe now test this by creating a simple yet non-trivial reinforcement learning (RL) task in the form\nof an unprotected left turn. Situated in Town05 of the CARLA simulator, the objective is to safely\nperform an unprotected (no traf\ufb01c lights) turn, see Fig. 6. Two oncoming vehicles have random initial\nspeeds. Collisions incur a penalty of \u2212500 while success yields +10. There is also a small reward\nfor higher velocity and the action space is acceleration along the ego agent\u2019s default path (blue).\nUsing predictions to learn the policy is in the domain of model-based RL [33, 37]. Here, MFP can be\nused in several ways: 1) we can generate imagined future rollouts and add them to the experiences\nfrom which temporal difference methods learns [33], or 2) we can perform online planning by using\na form of the shooting methods [5], which allows us to optimize over future trajectories. We perform\nexperiments with the latter technique where we progressively train MFP to predict the joint future\ntrajectories of all three vehicles in the scene. We \ufb01nd the optimal policy by leveraging the current\nMFP model and optimize over ego\u2019s future actions. We compare this approach to a couple of strong\nmodel-free RL baselines: DDPG and Proximal policy gradients. In Fig. 7, we plot the reward vs. the\nnumber of environmental steps taken. In Table 7, we show that MFP based planning is more robust to\nparameter variations in the testing environment.\n\n2.27\n\n2.25\n\n1.411\n\nFigure 6: RL learning environ-\nment - Unprotected left turn.\n\nFigure 7: Learning curves as a func-\ntion of step sizes.\n\n\u2206 Env. Params DDPG PPO MFP\n3% 4% 0%\n8% 4% 0%\n6% 15% 0%\n3% 1% 0%\n\nvel : +0m/s\nvel : +5m/s\nvel : +10m/s\nacc : +1m/s2\n\nTable 7: Testing crash rates per\n100 trials. Test env. modi\ufb01es the\nvelocity & acceleration parame-\nters.\n\n5 Discussions\nIn this paper, we proposed a probabilistic latent variable framework that facilitates the joint multi-step\ntemporal prediction of arbitrary number of agents in a scene. Leveraging the ability to learn latent\nmodes directly from data and interactively rolling out the future with different point-of-view encoding,\nMFP demonstrated state-of-the-art performance on several vehicle trajectory datasets. For future\nwork, it would be interesting to add a mix of discrete and continuous latent variables as well as train\nand validate on pedestrian or bicycle trajectory datasets.\n\n9\n\nStartGoal+10 rewardCrash: -500 rewardAction: accelerationObservations: x, y, vel., headingEgo agentpathOther agents\fAcknowledgements We thank Barry Theobald, Gabe Hoffmann, Alex Druinsky, Nitish Srivastava, Russ\nWebb, and the anonymous reviewers for making this a better manuscript. We also thank the authors of [16] for\nopen sourcing their code and dataset.\n\nReferences\n[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio\nSavarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE\nconference on computer vision and pattern recognition, pages 961\u2013971, 2016.\n\n[2] Mayank Bansal, Alex Krizhevsky, and Abhijit S. Ogale. Chauffeurnet: Learning to drive by imitating the\n\nbest and synthesizing the worst. CoRR, abs/1812.03079, 2018.\n\n[3] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks.\n\narXiv:1411.7610, 2014.\n\narXiv preprint\n\n[4] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence\nprediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages\n1171\u20131179, 2015.\n\n[5] John T Betts. Survey of numerical methods for trajectory optimization. Journal of guidance, control, and\n\ndynamics, 21(2):193\u2013207, 1998.\n\n[6] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519, 2015.\n\n[7] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning to predict intention from raw sensor\ndata. In Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, editors, Proceedings of The 2nd\nConference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 947\u2013956.\nPMLR, 29\u201331 Oct 2018.\n\n[8] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic\n\nanchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019.\n\n[9] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett,\nDe Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with\nrich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n8748\u20138757, 2019.\n\n[10] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated\n\nrecurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[11] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A\nrecurrent latent variable model for sequential data. In Advances in neural information processing systems,\npages 2980\u20132988, 2015.\n\n[12] James Colyar and John Halkias. Us highway 101 dataset. FHWA-HRT-07-030, 2007.\n\n[13] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, Tsung-Han Lin, Thi Nguyen, Tzu-Kuo Huang,\nJeff Schneider, and Nemanja Djuric. Multimodal trajectory predictions for autonomous driving using deep\nconvolutional networks. CoRR, abs/1809.10732, 2018.\n\n[14] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society: Series B, 39:1\u201338, 1977.\n\n[15] Nachiket Deo, Akshay Rangesh, and Mohan M Trivedi. How would surround vehicles move? a uni\ufb01ed\nframework for maneuver classi\ufb01cation and motion prediction. IEEE Transactions on Intelligent Vehicles,\n3(2):129\u2013140, 2018.\n\n[16] Nachiket Deo and Mohan M. Trivedi. Convolutional social pooling for vehicle trajectory prediction. CoRR,\n\nabs/1805.06771, 2018.\n\n[17] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An\nopen urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages\n1\u201316, 2017.\n\n[18] Marco Fraccaro, S\u00f8ren Kaae S\u00f8nderby, Ulrich Paquet, and Ole Winther. Sequential neural models with\n\nstochastic layers. In Advances in neural information processing systems, pages 2199\u20132207, 2016.\n\n10\n\n\f[19] Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre C\u00f4t\u00e9, Nan Rosemary Ke, and Yoshua Bengio.\nZ-forcing: Training stochastic recurrent networks. In Advances in neural information processing systems,\npages 6713\u20136723, 2017.\n\n[20] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A\n\nrecurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.\n\n[21] Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop KF: learning discriminative\n\ndeterministic state estimators. CoRR, abs/1605.07148, 2016.\n\n[22] Thomas Howard, Mihail Pivtoraiko, Ross A Knepper, and Alonzo Kelly. Model-predictive motion\nIEEE Robotics & Automation\n\nplanning: Several key developments for autonomous mobile robots.\nMagazine, 21(1):64\u201373, 2014.\n\n[23] Alex Kue\ufb02er, Jeremy Morton, Tim Wheeler, and Mykel Kochenderfer. Imitating driver behavior with\ngenerative adversarial networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 204\u2013211. IEEE,\n2017.\n\n[24] Namhoon Lee, Wongun Choi, Paul Vernaza, Chris Choy, Philip H. S. Torr, and Manmohan Chandraker.\nDesire: Distant future prediction in dynamic scenes with interacting agents. pages 2165\u20132174, 07 2017.\n\n[25] St\u00e9phanie Lef\u00e8vre, Dizan Vasquez, and Christian Laugier. A survey on motion prediction and risk\n\nassessment for intelligent vehicles. ROBOMECH journal, 1(1):1, 2014.\n\n[26] Yuexin Ma, Xinge Zhu, Sibo Zhang, Ruigang Yang, Wenping Wang, and Dinesh Manocha. Traf\ufb01cpredict:\n\nTrajectory prediction for heterogeneous traf\ufb01c-agents. CoRR, abs/1811.02146, 2018.\n\n[27] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and\n\nother variants. In Learning in graphical models, pages 355\u2013368. Springer, 1998.\n\n[28] Brian Paden, Michal Cap, Sze Zheng Yong, Dmitry Yershov, and Emilio Frazzoli. A survey of motion\nplanning and control techniques for self-driving urban vehicles. IEEE Transactions on Intelligent Vehicles,\n1, 04 2016.\n\n[29] SeongHyeon Park, Byeongdo Kim, Chang Mook Kang, Chung Choo Chung, and Jun Won Choi.\nSequence-to-sequence prediction of vehicle trajectory via LSTM encoder-decoder architecture. CoRR,\nabs/1802.06338, 2018.\n\n[30] Nicholas Rhinehart, Rowan McAllister, Kris M. Kitani, and Sergey Levine. PRECOG: prediction condi-\n\ntioned on goals in visual multi-agent settings. CoRR, abs/1905.01296, 2019.\n\n[31] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, and Kevin Murphy. Stochastic prediction of\n\nmulti-agent interactions from partial observations. CoRR, abs/1902.09641, 2019.\n\n[32] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[33] Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating\n\ndynamic programming. In Machine Learning Proceedings 1990, pages 216\u2013224. Elsevier, 1990.\n\n[34] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel,\nPhilip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. Stanley: The robot that won the darpa\ngrand challenge. Journal of \ufb01eld Robotics, 23(9):661\u2013692, 2006.\n\n[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008, 2017.\n\n[36] Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and Andrea Tacchetti.\nVisual interaction networks: Learning a physics simulator from video. In Advances in neural information\nprocessing systems, pages 4539\u20134547, 2017.\n\n[37] Theophane Weber, S\u00e9bastien Racani\u00e8re, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez\nRezende, Adri\u00e0 Puigdom\u00e8nech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter W.\nBattaglia, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning.\nCoRR, abs/1707.06203, 2017.\n\n[38] Greg Welch, Gary Bishop, et al. An introduction to the kalman \ufb01lter. 1995.\n\n[39] Tianyang Zhao, Yifei Xu, Mathew Monfort, Wongun Choi, Chris Baker, Yibiao Zhao, Yizhou Wang, and\nYing Nian Wu. Multi-agent tensor fusion for contextual trajectory prediction. CoRR, abs/1904.04776,\n2019.\n\n11\n\n\f", "award": [], "sourceid": 8910, "authors": [{"given_name": "Charlie", "family_name": "Tang", "institution": "Apple Inc."}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}]}