{"title": "Gaussian Process Volatility Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1044, "page_last": 1052, "abstract": "The prediction of time-changing variances is an important task in the modeling of financial data. Standard econometric models are often limited as they assume rigid functional relationships for the evolution of the variance. Moreover, functional parameters are usually learned by maximum likelihood, which can lead to overfitting. To address these problems we introduce GP-Vol, a novel non-parametric model for time-changing variances based on Gaussian Processes. This new model can capture highly flexible functional relationships for the variances. Furthermore, we introduce a new online algorithm for fast inference in GP-Vol. This method is much faster than current offline inference procedures and it avoids overfitting problems by following a fully Bayesian approach. Experiments with financial data show that GP-Vol performs significantly better than current standard alternatives.", "full_text": "Gaussian Process Volatility Model\n\nYue Wu\n\nCambridge University\n\nwu5@post.harvard.edu\n\nJos\u00b4e Miguel Hern\u00b4andez Lobato\n\nCambridge University\njmh233@cam.ac.uk\n\nZoubin Ghahramani\nCambridge University\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nThe prediction of time-changing variances is an important task in the modeling of\n\ufb01nancial data. Standard econometric models are often limited as they assume rigid\nfunctional relationships for the evolution of the variance. Moreover, functional\nparameters are usually learned by maximum likelihood, which can lead to over-\n\ufb01tting. To address these problems we introduce GP-Vol, a novel non-parametric\nmodel for time-changing variances based on Gaussian Processes. This new model\ncan capture highly \ufb02exible functional relationships for the variances. Furthermore,\nwe introduce a new online algorithm for fast inference in GP-Vol. This method\nis much faster than current of\ufb02ine inference procedures and it avoids over\ufb01tting\nproblems by following a fully Bayesian approach. Experiments with \ufb01nancial data\nshow that GP-Vol performs signi\ufb01cantly better than current standard alternatives.\n\n1\n\nIntroduction\n\nTime series of \ufb01nancial returns often exhibit heteroscedasticity, that is the standard deviation or\nvolatility of the returns is time-dependent. In particular, large returns (either positive or negative) are\noften followed by returns that are also large in size. The result is that \ufb01nancial time series frequently\ndisplay periods of low and high volatility. This phenomenon is known as volatility clustering [1].\nSeveral univariate models have been proposed in the literature for capturing this property. The best\nknown and most popular is the Generalised Autoregressive Conditional Heteroscedasticity model\n(GARCH) [2]. An alternative to GARCH are stochastic volatility models [3]. However, there is no\nevidence that SV models have better predictive performance than GARCH [4, 5, 6].\nGARCH has further inspired a host of variants and extensions. A review of many of these models\ncan be found in [7]. Most of these GARCH variants attempt to address one or both limitations of\nGARCH: a) the assumption of a linear dependency between current and past volatilities, and b)\nthe assumption that positive and negative returns have symmetric effects on volatility. Asymmetric\neffects are often observed, as large negative returns often send measures of volatility soaring, while\nthis effect is smaller for large positive returns [8, 9]. Finally, there are also extensions that use\nadditional data besides daily closing prices to improve volatility predictions [10].\nMost solutions proposed in these variants of GARCH involve: a) introducing nonlinear functional\nrelationships for the evolution of volatility, and b) adding asymmetric effects in these functional\nrelationships. However, the GARCH variants do not fundamentally address the problem that the\nspeci\ufb01c functional relationship of the volatility is unknown. In addition, these variants can have a\nhigh number of parameters, which may lead to over\ufb01tting when using maximum likelihood learning.\nMore recently, volatility modeling has received attention within the machine learning community,\nwith the development of copula processes [11] and heteroscedastic Gaussian processes [12]. These\n\n1\n\n\fmodels leverage the \ufb02exibility of Gaussian Processes [13] to model the unknown relationship be-\ntween the variances. However, these models do not address the asymmetric effects of positive and\nnegative returns on volatility.\nWe introduce a new non-parametric volatility model, called the Gaussian Process Volatility Model\n(GP-Vol). This new model is more \ufb02exible, as it is not limited by a \ufb01xed functional form. Instead, a\nnon-parametric prior distribution is placed on possible functions, and the functional relationship is\nlearned from the data. This allows GP-Vol to explicitly capture the asymmetric effects of positive\nand negative returns on volatility. Our new volatility model is evaluated in a series of experiments\nwith real \ufb01nancial returns, and compared against popular econometric models, namely, GARCH,\nEGARCH [14] and GJR-GARCH [15].\nIn these experiments, GP-Vol produces the best overall\npredictions. In addition to this, we show that the functional relationship learned by GP-Vol often\nexhibits the nonlinear and asymmetric features that previous models attempt to capture.\nThe second main contribution of the paper is the development of an online algorithm for learning\nGP-Vol. GP-Vol is an instance of a Gaussian Process State Space Model (GP-SSM). Previous work\non GP-SSMs [16, 17, 18] has mainly focused on developing approximation methods for \ufb01ltering\nand smoothing the hidden states in GP-SSM, without jointly learning the GP transition dynamics.\nOnly very recently have Frigola et al. [19] addressed the problem of learning both the hidden states\nand the transition dynamics by using Particle Gibbs with Ancestor Sampling (PGAS) [20]. In this\npaper, we introduce a new online algorithm for performing inference on GP-SSMs. Our algorithm\nhas similar predictive performance as PGAS on \ufb01nancial data, but is much faster.\n\n2 Review of GARCH and GARCH variants\n\nt = \u03b10 +(cid:80)q\n\nt\u2212j +(cid:80)p\n\nThe standard variance model for \ufb01nancial data is GARCH. GARCH assumes a Gaussian observation\nt is linearly\nmodel and a linear transition function for the variance:\ndependent on p previous variance values and q previous squared time series values, that is,\n\nthe time-varying variance \u03c32\n\nxt\u223c N (0, \u03c32\nt ) ,\n\n\u03c32\n\nand\n\nj=1 \u03b1jx2\n\n(1)\nwhere xt are the values of the return time series being modeled. This model is \ufb02exible and can\nproduce a variety of clustering behaviors of high and low volatility periods for different settings\nof \u03b11, . . . , \u03b1q and \u03b21, . . . , \u03b2p. However, it has several limitations. First, only linear relationships\nbetween \u03c32\nt are allowed. Second, past positive and negative returns have the same\nt\u2212j. However, it is often observed that large negative returns\neffect on \u03c32\nlead to larger rises in volatility than large positive returns [8, 9].\nA more \ufb02exible and often cited GARCH extension is Exponential GARCH (EGARCH) [14]. The\nequation for \u03c32\n\nt\u2212p:t\u22121 and \u03c32\nt due to the quadratic term x2\n\ni=1 \u03b2i\u03c32\n\nt\u2212i ,\n\ni=1 \u03b2i log(\u03c32\n\nt\u2212i) , where\n\ng(xt) = \u03b8xt + \u03bb|xt| .\n\n(2)\n\nt ) = \u03b10 +(cid:80)q\n\nt is now:\n\nlog(\u03c32\n\nj=1 \u03b1jg(xt\u2212j) +(cid:80)p\nt = \u03b10 +(cid:80)q\n\nt\u2212j +(cid:80)p\n\nAsymmetry in the effects of positive and negative returns is introduced through the function g(xt). If\nthe coef\ufb01cient \u03b8 is negative, negative returns will increase volatility, while the opposite will happen\nif \u03b8 is positive. Another GARCH extension that models asymmetric effects is GJR-GARCH [15]:\n\n\u03c32\n\nj=1 \u03b1jx2\n\n(3)\nwhere It\u2212k = 0 if xt\u2212k \u2265 0 and It\u2212k = 1 otherwise. The asymmetric effect is now captured by\nIt\u2212k, which is nonzero if xt\u2212k < 0.\n3 Gaussian process state space models\n\nt\u2212kIt\u2212k ,\n\nk=1 \u03b3kx2\n\ni=1 \u03b2i\u03c32\n\nt\u2212i +(cid:80)r\n\nGARCH, EGARCH and GJR-GARCH can be all represented as General State-Space or Hidden\nMarkov models (HMM) [21, 22], with the unobserved dynamic variances being the hidden states.\nTransition functions for the hidden states are \ufb01xed and assumed to be linear in these models. The\nlinear assumption limits the \ufb02exibility of these models.\nMore generally, a non-parametric approach can be taken where a Gaussian Process (GP) prior is\nplaced on the transition function, so that its functional form can be learned from data. This Gaussian\nProcess state space model (GP-SSM) is a generalization of HMM. GP-SSM and HMM differ in two\nmain ways. First, in HMM the transition function has a \ufb01xed functional form, while in GP-SSM\n\n2\n\n\fFigure 1: Left, graphical model for GP-Vol. The transitions of the hidden states vt is represented by\nthe unknown function f. f takes as inputs the previous state vt\u22121 and previous observation xt\u22121.\nMiddle, 90% posterior interval for a. Right, 90% posterior interval for b.\n\nit is represented by a GP. Second, in GP-SSM the states do not have Markovian structure once the\ntransition function is marginalized out.\nThe \ufb02exibility of GP-SSMs comes at a cost: inference in GP-SSMs is computationally challenging.\nBecause of this, most of the previous work on GP-SSMs [16, 17, 18] has focused on \ufb01ltering and\nsmoothing the hidden states in GP-SSM, without jointly learning the GP dynamics. Note that in\n[18], the authors learn the dynamics, but using a separate dataset in which both input and target\nvalues for the GP model are observed. A few papers considered learning both the GP dynamics and\nthe hidden states for special cases of GP-SSMs. For example, [23] applied EM to obtain maximum\nlikelihood estimates for parametric systems that can be represented by GPs. A general method has\nbeen recently proposed for joint inference on the hidden states and the GP dynamics using Particle\nGibbs with Ancestor Sampling (PGAS) [20, 19]. However, PGAS is a batch MCMC inference\nmethod that is computationally very expensive.\n4 Gaussian process volatility model\nOur new Gaussian Process Volatility Model (GP-Vol) is an instance of GP-SSM:\n\n\u0001t \u223c N (0, \u03c32\n\nn) .\n\nxt \u223c N (0, \u03c32\nt ) ,\n\nt) = Cov(f (zt), f (z(cid:48)\n\nvt := log(\u03c32\n\nt ) = f (vt\u22121, xt\u22121) + \u0001t ,\n\nt) should be highly correlated.\n\n(4)\nNote that we model the logarithm of the variance, which has real support. Equation (4) de\ufb01nes\na GP-SMM. We place a GP prior on the transition function f. Let zt = (vt, xt). Then f \u223c\nGP(m, k) where m(zt) and k(zt, z(cid:48)\nt) are the GP mean and covariance functions, respectively. The\nmean function can encode prior knowledge of the system dynamics. The covariance function gives\nt)) . Intuitively if zt and z(cid:48)\nthe prior covariance between function values: k(zt, z(cid:48)\nt\nare close to each other, the covariances between the corresponding function values should be large:\nf (zt) and f (z(cid:48)\nThe graphical model for GP-Vol is given in Figure 1. The explicit dependence of transition function\nvalues on the previous return xt\u22121 enables GP-Vol to model the asymmetric effects of positive and\nnegative returns on the variance evolution. GP-Vol can be extended to depend on p previous log\nvariances and q past returns like in GARCH(p,q). In this case, the transition would be of the form\nvt = f (vt\u22121, vt\u22122, ..., vt\u2212p, xt\u22121, xt\u22122, ..., xt\u2212q) + \u0001t.\n5 Bayesian inference in GP-Vol\nIn the standard GP regression setting, the inputs and targets are fully observed and f can be learned\nusing exact Bayesian inference [13]. However, this is not the case in GP-Vol, where the unknown\n{vt} form part of the inputs and all the targets. Let \u03b8 denote the model hyper-parameters and let\nf = [f (v1), . . . , f (vT )]. Directly learning the joint posterior of the unknown variables f, v1:T and\n\u03b8 is a challenging task. Fortunately, the posterior p(vt|\u03b8, x1:t), where f has been marginalized out,\ncan be approximated with particles [24]. We \ufb01rst describe a standard sequential Monte Carlo (SMC)\nparticle \ufb01lter to learn this posterior.\nLet {vi\n1:t\u22121}N\nweights W i\n\ni=1 be particles representing chains of states up to t\u2212 1 with corresponding normalized\nt\u22121. The posterior p(v1:t\u22121|\u03b8, x1:t\u22121) is then approximated by\n(v1:t\u22121) .\n\n\u02c6p(v1:t\u22121|\u03b8, x1:t\u22121) =(cid:80)N\n\n(5)\n\ni=1 W i\n\nt\u22121\u03b4vi\n\n1:t\u22121\n\n3\n\n02040608010000.511.522.53Number of ObservationstruthGP\u2212Vol 5%GP\u2212Vol 95%020406080100\u22120.500.511.522.5Number of ObservationstruthGP\u2212Vol 5%GP\u2212Vol 95%\ft from p(vt|\u03b8, vj\n\nt}).\n\nt \u221d p(xt|\u03b8, vj\n\nt ) = N (xt|0, exp{vj\n\nThe corresponding posterior for v1:t can be approximated by propagating these particles forward.\nFor this, we propose new states from the GP-Vol transition model and then we importance-weight\nthem according to the GP-Vol observation model. Speci\ufb01cally, we resample particles vj\n1:t\u22121 from\n(5) according to their weights W j\nt\u22121, and propagate the samples forward. Then, for each of the\nparticles propagated forward, we propose vj\n1:t\u22121, x1:t\u22121), which is the GP predictive\ndistribution. The proposed particles are then importance-weighted according to the observation\nmodel, that is, W j\nThe above setup assumes that \u03b8 is known. To learn these hyper-parameters, we can also encode them\nin particles and \ufb01lter them together with the hidden states. However, since \u03b8 is constant across time,\nnaively \ufb01ltering such particles without regeneration will fail due to particle impoverishment, where\na few or even one particle receives all the weight. To solve this problem, the Regularized Auxiliary\nParticle Filter (RAPF) regenerates parameter particles by performing kernel smoothing operations\n[25]. This introduces arti\ufb01cial dynamics and estimation bias. Nevertheless, RAPF has been shown\nto produce state-of-the-art inference in multivariate parametric \ufb01nancial models [6].\nRAPF was designed for HMMs, but GP-Vol is non-Markovian once f is marginalized out. Therefore,\nwe design a new version of RAPF for non-Markovian systems and refer to it as the Regularized\nAuxiliary Particle Chain Filter (RAPCF), see Algorithm 1. There are two main parts in RAPCF.\nFirst, there is the Auxiliary Particle Filter (APF) part in lines 5, 6 and 7 of the pseudocode [26].\nThis part selects particles associated with high expected likelihood, as given by the new expected\nstate in (7) and the corresponding resampling weight in (8). This bias towards particles with high\nexpected likelihood is eliminated when the \ufb01nal importance weights are computed in (9). The most\npromising particles are propagated forward in lines 8 and 9. The main difference between RAPF and\nRAPCF is in the effect that previous states vi\n1:t\u22121 have in the propagation of particles. In RAPCF\nall the previous states determine the probabilities of the particles being propagated, as the model is\nnon-Markovian, while in RAPF these probabilities are only determined by the last state vi\nt\u22121. The\nsecond part of RAPCF avoids particle impoverishment in \u03b8. For this, new particles are generated\nin line 10 by sampling from a Gaussian kernel. The over-dispersion introduced by these arti\ufb01cial\ndynamics is eliminated in (6) by shrinking the particles towards their empirical average. We \ufb01x the\nshrinking parameter \u03bb to be 0.95. In practice, we found little difference in predictions when we\nvaried \u03bb from 0.99 to 0.95.\nRAPCF has limitations similar to those of RAPF. First, it introduces bias as sampling from the\nkernel adds arti\ufb01cial dynamics. Second, RAPCF only \ufb01lters forward and does not smooth backward.\nConsequently, there will be impoverishment in distant ancestors vt\u2212L, since these states are not\nregenerated. When this occurs, GP-Vol will consider the collapsed ancestor states as inputs with\nlittle uncertainty and the predictive variance near these inputs will be underestimated. These issues\ncan be addressed by adopting a batch MCMC approach. In particular, Particle Markov Chain Monte\nCarlo (PMCMC) procedures [24] established a framework for learning the states and the parameters\nin general state space models. Additionally, [20] developed a PMCMC algorithm called Particle\nGibbs with ancestor sampling (PGAS) for learning non-Markovian state space models. PGAS was\napplied by [19] to learn GP-SSMs. These batch MCMC methods are computationally much more\nexpensive than RAPCF. Furthermore, our experiments show that in the GP-Vol model, RAPCF and\nPGAS have similar empirical performance, while RAPCF is orders of magnitude faster than PGAS.\nThis indicates that the aforementioned issues have limited impact in practice.\n\n6 Experiments\n\nWe performed three sets of experiments. First, we tested on synthetic data whether we can jointly\nlearn the hidden states and transition dynamics in GP-Vol using RAPCF. Second, we compared\nthe performance of GP-Vol against standard econometric models GARCH, EGARCH and GJR-\nGARCH on \ufb01fty real \ufb01nancial time series. Finally, we compared RAPCF with the batch MCMC\nmethod PGAS in terms of accuracy and execution time. The code for RAPCF in GP-Vol is publicly\navailable at http://jmhl.org.\n\n6.1 Experiments with synthetic data\nWe generated ten synthetic datasets of length T = 100 according to (4). The transition function f is\nsampled from a GP prior speci\ufb01ed with a linear mean function and a squared exponential covariance\n\n4\n\n\fAlgorithm 1 RAPCF\n1: Input: data x1:T , number of particles N, shrinkage parameter 0 < \u03bb < 1, prior p(\u03b8).\n2: Sample N parameter particles from the prior: {\u03b8i\n3: Set initial importance weights, W i\n4: for t = 1 to T do\n5:\n\nShrink parameter particles towards their empirical mean \u00af\u03b8t\u22121 =(cid:80)N\n\n0}i=1,...,N \u223c p(\u03b8).\n\n0 = 1/N.\n\ni=1 W i\n\nt\u22121\u03b8i\n\nCompute the new expected states:\n\u00b5i\n\nt = \u03bb\u03b8i\n\n(cid:101)\u03b8i\nt = E(vt|(cid:101)\u03b8i\nt \u221d W i\ngi\n\nt\u22121 + (1 \u2212 \u03bb) \u00af\u03b8t\u22121 .\nt, vi\nt,(cid:101)\u03b8i\nt\u22121p(xt|\u00b5i\n\n1:t\u22121, x1:t\u22121) .\n\nt\u22121 by setting\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n6:\n\n7:\n\n8:\n9:\n10:\n11:\n12:\n\nCompute importance weights proportional to the likelihood of the new expected states:\n\nt) .\nResample N auxiliary indices {j} according to weights {gi\nt}.\nPropagate the corresponding chains of hidden states forward, that is, {vj\nAdd jitter: \u03b8j\nPropose new states vj\nCompute importance weights adjusting for the modi\ufb01ed proposal:\n\nt , (1 \u2212 \u03bb2)Vt\u22121), where Vt\u22121 is the empirical covariance of \u03b8t\u22121.\nt \u223c p(vt|\u03b8j\n\nt \u223c N ((cid:101)\u03b8j\n\n1:t\u22121}j\u2208J.\n\n1:t\u22121, x1:t\u22121).\n\nt , vj\nt \u221d p(xt|vj\nW j\n\nt , \u03b8j\n\nt ,(cid:101)\u03b8j\nt )/p(xt|\u00b5j\n\nt ) ,\n\n13: end for\n14: Output: particles for chains of states vj\n\n1:T , particles for parameters \u03b8j\n\nt and particle weights W j\nt .\n\nfunction. The linear mean function is E(vt) = m(vt\u22121, xt\u22121) = avt\u22121 + bxt\u22121. The squared\nexponential covariance function is k(y, z) = \u03b3 exp(\u22120.5|y \u2212 z|2/l2) where l is the length-scale\nparameter and \u03b3 is the amplitude parameter.\nWe used RAPCF to learn the hidden states v1:T and the hyper-parameters \u03b8 = (a, b, \u03c3n, \u03b3, l) using\nnon-informative diffuse priors for \u03b8. In these experiments, RAPCF successfully recovered the state\nand the hyper-parameter values. For the sake of brevity, we only include two typical plots of the 90%\nposterior intervals for hyper-parameters a and b in the middle and right of Figures 1. The intervals\nare estimated from the \ufb01ltered particles for a and b at each time step t. In both plots, the posterior\nintervals eventually concentrate around the true parameter values, shown as dotted blue lines.\n6.2 Experiments with real data\nWe compared the predictive performances of GP-Vol, GARCH, EGARCH and GJR-GARCH on real\n\ufb01nancial datasets. We used GARCH(1,1), EGARCH(1,1) and GJR-GARCH(1,1,1) models since\nthese variants have the least number of parameters and are consequently less affected by over\ufb01tting\nproblems. We considered \ufb01fty datasets, consisting of thirty daily Equity and twenty daily foreign\nexchange (FX) time series. For the Equity series, we used daily closing prices. For FX, which\noperate 24h a day, with no of\ufb01cial daily closing prices, we cross-checked different pricing sources\nand took the consensus price up to 4 decimal places at 10am New York, which is the time with\nmost market liquidity. Each of the resulting time series contains a total of T = 780 observations\nfrom January 2008 to January 2011. The price data p1:T was pre-processed to eliminate prices\ncorresponding to times when markets were closed or not liquid. After this, prices were converted\ninto logarithmic returns, xt = log(pt/pt\u22121). Finally, the resulting returns were standardized to have\nzero mean and unit standard deviation.\nDuring the experiments, each method receives an initial time series of length 100. The different\nmodels are trained on that data and then a one-step forward prediction is made. The performance of\neach model is measured in terms of the predictive log-likelihood on the \ufb01rst return out of the training\nset. Then the training set is augmented with the new observation and the training and prediction steps\nare repeated. The whole process is repeated sequentially until no further data is received.\nGARCH, EGARCH and GJR-GARCH were implemented using numerical optimization routines\nprovided by Kevin Sheppard 1. A relatively long initial time series of length 100 was needed to\nto train these models. Using shorter initial data resulted in wild jumps in the maximum likelihood\n\n1http:///www.kevinsheppard.com/wiki/UCSD_GARCH/\n\n5\n\n\fFigure 2: Comparison between GP-Vol, GARCH, EGARCH and GJR-GARCH via a Nemenyi test.\nThe \ufb01gure shows the average rank across datasets of each method (horizontal axis). The methods\nwhose average ranks differ more than a critical distance (segment labeled CD) show signi\ufb01cant\ndifferences in performance at this con\ufb01dence level. When the performances of two methods are\nstatistically different, their corresponding average ranks appear disconnected in the \ufb01gure.\n\nestimates of the model parameters. These large \ufb02uctuations produced very poor one-step forward\npredictions. By contrast, GP-Vol is less susceptible to over\ufb01tting since it approximates the posterior\ndistribution using RAPCF instead of \ufb01nding point estimates of the model parameters. We placed\nbroad non-informative priors on \u03b8 = (a, b, \u03c3n, \u03b3, l) and used N = 200 particles and shrinkage\nparameter \u03bb = .95 in RAPCF.\n\nGARCH EGARCH GJR\n\nDataset\nGP-Vol\nAUDUSD \u22121.303 \u22121.514 \u22121.305 \u22121.297\nBRLUSD \u22121.203 \u22121.227 \u22121.201 \u22121.180\nCADUSD \u22121.402 \u22121.409 \u22121.402 \u22121.386\nCHFUSD \u22121.375 \u22121.404 \u22121.404 \u22121.359\nCZKUSD \u22121.422 \u22121.473 \u22121.422 \u22121.456\nEURUSD \u22121.418 \u22122.120 \u22121.426 \u22121.403\nGBPUSD \u22121.382 \u22123.511 \u22121.386 \u22121.385\nIDRUSD \u22121.223 \u22121.244 \u22121.209 \u22121.039\nJPYUSD \u22121.350 \u22122.704 \u22121.355 \u22121.347\nKRWUSD \u22121.189 \u22121.168 \u22121.209 \u22121.154\nMXNUSD \u22121.220 \u22123.438 \u22121.278 \u22121.167\nMYRUSD \u22121.394 \u22121.412 \u22121.395 \u22121.392\nNOKUSD \u22121.416 \u22121.567 \u22121.419 \u22121.416\nNZDUSD \u22121.369 \u22123.036 \u22121.379 \u22121.389\nPLNUSD \u22121.395 \u22121.385 \u22121.382 \u22121.393\nSEKUSD \u22121.403 \u22123.705 \u22121.402 \u22121.407\nSGDUSD \u22121.382 \u22122.844 \u22121.398 \u22121.393\nTRYUSD \u22121.224 \u22121.461 \u22121.238 \u22121.236\nTWDUSD \u22121.384 \u22121.377 \u22121.388 \u22121.294\nZARUSD \u22121.318 \u22121.344 \u22121.301 \u22121.304\n\nGP-Vol\nDataset GARCH EGARCH GJR\n\u22121.304 \u22121.449 \u22121.281 \u22121.282\nA\n\u22121.228 \u22121.280 \u22121.230 \u22121.218\nAA\nAAPL \u22121.234 \u22121.358 \u22121.219 \u22121.212\nABC \u22121.341 \u22121.976 \u22121.344 \u22121.337\nABT \u22121.295 \u22121.527 \u22121.3003 \u22121.302\nACE \u22121.084 \u22122.025 \u22121.106 \u22121.073\nADBE \u22121.335 \u22121.501 \u22121.386 \u22121.302\n\u22121.373 \u22121.759 \u22121.352 \u22121.356\nADI\nADM \u22121.228 \u22121.884 \u22121.223 \u22121.223\nADP \u22121.229 \u22121.720 \u22121.205 \u22121.211\nADSK \u22121.345 \u22121.604 \u22121.340 \u22121.316\nAEE \u22121.292 \u22121.282 \u22121.263 \u22121.166\n\u22121.151 \u22121.177 \u22121.146 \u22121.142\nAEP\n\u22121.237 \u22121.319 \u22121.234 \u22121.197\nAES\nAET \u22121.285 \u22121.302 \u22121.269 \u22121.246\n\nDataset GARCH EGARCH GJR\nGP-Vol\n\u22121.057 \u22121.126 \u22121.061 \u22120.997\nAFL\nAGN \u22121.270 \u22121.338 \u22121.261 \u22121.274\n\u22121.151 \u22121.256 \u22121.195 \u22121.069\nAIG\n\u22121.111 \u22121.147 \u22121.1285 \u22121.133\nAIV\n\u22121.423 \u22121.816 \u22121.469 \u22121.362\nAIZ\nAKAM \u22121.230 \u22121.312 \u22121.229 \u22121.246\n\u22121.030 \u22121.034 \u22121.052 \u22121.015\nAKS\n\u22121.339 \u22123.108 \u22121.316 \u22121.327\nALL\nALTR \u22121.286 \u22121.443 \u22121.277 \u22121.282\nAMAT \u22121.319 \u22121.465 \u22121.332 \u22121.310\nAMD \u22121.342 \u22121.348 \u22121.332 \u22121.243\nAMGN \u22121.191 \u22121.542 \u22121.1772 \u22121.189\nAMP \u22121.386 \u22121.444 \u22121.365 \u22121.317\nAMT \u22121.206 \u22121.820 \u22121.3658 \u22121.210\nAMZN \u22121.206 \u22121.567 \u22121.3537 \u22121.342\n\nTable 1: FX series.\n\nTable 2: Equity series 1-15.\n\nTable 3: Equity series 16-30.\n\nWe show the average predictive log-likelihood of GP-Vol, GARCH, EGARCH and GJR-GARCH in\ntables 1, 2 and 3 for the FX series, the \ufb01rst 15 Equity series and the last 15 Equity series, respectively.\nThe results of the best performing method in each dataset have been highlighted in bold. These tables\nshow that GP-Vol obtains the highest predictive log-likelihood in 29 of the 50 analyzed datasets. We\nperform a statistical test to determine whether differences among GP-Vol, GARCH, EGARCH and\nGJR-GARCH are signi\ufb01cant. These methods are compared against each other using the multiple\ncomparison approach described by [27]. In this comparison framework, all the methods are ranked\naccording to their performance on different tasks. Statistical tests are then applied to determine\nwhether the differences among the average ranks of the methods are signi\ufb01cant. In our case, each of\nthe 50 datasets analyzed represents a different task. A Friedman rank sum test rejects the hypothesis\nthat all methods have equivalent performance at \u03b1 = 0.05 with p-value less than 10\u221215. Pairwise\ncomparisons between all the methods with a Nemenyi test at a 95% con\ufb01dence level are summarized\nin Figure 2. The Nemenyi test shows that GP-Vol is signi\ufb01cantly better than the other methods.\nThe other main advantage of GP-Vol over existing models is that it can learn the functional relation-\nship f between the new log variance vt and the previous log variance vt\u22121 and previous return xt\u22121.\nWe plot a typical log variance surface in the left of Figure 3. This surface is generated by plotting the\nmean predicted outputs vt against a grid of inputs for vt\u22121 and xt\u22121. For this, we use the functional\ndynamics learned with RAPCF on the AUDUSD time series. AUDUSD stands for the amount of\nUS dollars that an Australian dollar can buy. The grid of inputs is designed to contain a range of\nvalues experienced by AUDUSD from 2008 to 2011, which is the period covered by the data. The\nsurface is colored according to the standard deviation of the posterior predictive distribution for the\nlog variance. Large standard deviations correspond to uncertain predictions, and are redder.\n\n6\n\n1234EGARCHGARCHGP\u2212VOLGJRCDNemenyi Test\fFigure 3: Left, surface generated by plotting the mean predicted outputs vt against a grid of inputs\nfor vt\u22121 and xt\u22121. Middle, predicted vt \u00b1 2 s.d. for inputs (0, xt\u22121). Right, predicted vt \u00b1 2 s.d.\nfor inputs (0, xt\u22121).\n\nThe plot in the left of Figure 3 shows several patterns. First, there is an asymmetric effect of positive\nand negative previous returns xt\u22121. This can be seen in the skewness and lack of symmetry of the\ncontour lines with respect to the vt\u22121 axis. Second, the relationship between vt\u22121 and vt is slightly\nnon-linear because the distance between consecutive contour lines along the vt\u22121 axis changes as we\nmove across those lines, especially when xt\u22121 is large. In addition, the relationship between xt\u22121\nand vt is nonlinear, but some sort of skewed quadratic function. These two patterns con\ufb01rm the\nasymmetric effect and the nonlinear transition function that EGARCH and GJR-GARCH attempt\nto model. Third, there is a dip in predicted log variance for vt\u22121 < \u22122 and \u22121 < xt\u22121 < 2.5.\nIntuitively this makes sense, as it corresponds to a calm market environment with low volatility.\nHowever, as xt\u22121 becomes more extreme the market becomes more turbulent and vt increases.\nTo further understand the transition function f we study cross sections of the log variance surface.\nFirst, vt is predicted for a grid of vt\u22121 and xt\u22121 = 0 in the middle plot of Figure 3. Next, vt is\npredicted for various xt\u22121 and vt\u22121 = 0 in the right plot of Figure 3. The con\ufb01dence bands in the\n\ufb01gures correspond to the mean prediction \u00b12 standard deviations. These cross sections con\ufb01rm the\nnonlinearity of the transition function and the asymmetric effect of positive and negative returns on\nthe log variance. The transition function is slightly non-linear as a function of vt\u22121 as the band in\nthe middle plot of Figure 3 passes through (\u22122,\u22122) and (0, 0), but not (2, 2). Surprisingly, we\nobserve in the right plot of Figure 3 that large positive xt\u22121 produces larger vt when vt\u22121 = 0 since\nthe band is slightly higher at xt\u22121 = 6 than at xt\u22121 = \u22126. However, globally, the highest predicted\nvt occurs when vt\u22121 > 5 and xt\u22121 < \u22125, as shown in the surface plot.\n6.3 Comparison between RAPCF and PGAS\n\nWe now analyze the potential shortcomings of RAPCF that were discussed in Section 5. For this,\nwe compare RAPCF against PGAS on the twenty FX time series from the previous section in terms\nof predictive log-likelihood and execution times. The RAPCF setup is the same as in Section 6.2.\nFor PGAS, which is a batch method, the algorithm is run on initial training data x1:L, with L = 100,\nand a one-step forward prediction is made. The predictive log-likelihood is evaluated on the next\nobservation out of the training set. Then the training set is augmented with the new observation\nand the batch training and prediction steps are repeated. The process is repeated sequentially until\nno further data is received. For these experiments we used shorter time series with T = 120 since\nPGAS is computationally very expensive. Note that we cannot simply learn the GP-SSM dynamics\non a small set of training data and then predict on a large test dataset, as it was done in [19]. These\nauthors were able to predict forward as they were using synthetic data with known \u201chidden\u201d states.\nWe analyze different settings of RAPCF and PGAS. In RAPCF we use N = 200 particles since that\nnumber was used to compare against GARCH, EGARCH and GJR-GARCH in the previous section.\nPGAS has two parameters: a) N, the number of particles and b) M, the number of iterations.\nThree combinations of these settings were used. The resulting average predictive log-likelihoods for\nRAPCF and PGAS are shown in Table 4. On each dataset, the results of the best performing method\n\n7\n\n\u22122024\u2212505\u22124\u22122024input, vt\u22121Log Variance Surface for AUDUSDinput, xt\u22121output, vt0.10.150.20.250.30.35\u22126\u22124\u221220246\u22125\u22124\u22123\u22122\u2212101234Cross section vtvs vt\u22121vt\u22121vt\u22126\u22124\u22122024600.511.52xt\u22121vtCross section vtvs xt\u22121\fhave been highlighted in bold. The average rank of each method across the analyzed datasets is\nshown in Table 5. From these tables, there is no evidence that PGAS outperforms RAPCF on these\n\ufb01nancial datasets, since there is no clear predictive edge of any PGAS setting over RAPCF.\n\nRAPCF PGAS.1 PGAS.2 PGAS.3\nN = 200 N = 10 N = 25 N = 10\nDataset\nM = 100 M = 100 M = 200\nAUDUSD \u22121.1205 \u22121.0571 \u22121.0699 \u22121.0936\nBRLUSD \u22121.0102 \u22121.0043 \u22120.9959 \u22120.9759\nCADUSD \u22121.4174 \u22121.4778 \u22121.4514 \u22121.4077\nCHFUSD \u22121.8431 \u22121.8536 \u22121.8453 \u22121.8478\nCZKUSD \u22121.2263 \u22121.2357 \u22121.2424 \u22121.2093\nEURUSD \u22121.3837 \u22121.4586 \u22121.3717 \u22121.4064\nGBPUSD \u22121.1863 \u22121.2106 \u22121.1790 \u22121.1729\nIDRUSD \u22120.5446 \u22120.5220 \u22120.5388 \u22120.5463\nJPYUSD \u22122.0766 \u22121.9286 \u22122.1585 \u22122.1658\nKRWUSD \u22121.0566 \u22121.1212 \u22121.2032 \u22121.2066\nMXNUSD \u22120.2417 \u22120.2731 \u22120.2271 \u22120.2538\nMYRUSD \u22121.4615 \u22121.5464 \u22121.4745 \u22121.4724\nNOKUSD \u22121.3095 \u22121.3443 \u22121.3048 \u22121.3169\nNZDUSD \u22121.2254 \u22121.2101 \u22121.2366 \u22121.2373\nPLNUSD \u22120.8972 \u22120.8704 \u22120.8708 \u22120.8704\nSEKUSD \u22121.0085 \u22121.0085 \u22121.0505 \u22121.0360\nSGDUSD \u22121.6229 \u22121.9141 \u22121.7566 \u22121.7837\nTRYUSD \u22121.8336 \u22121.8509 \u22121.8352 \u22121.8553\nTWDUSD \u22121.7093 \u22121.7178 \u22121.8315 \u22121.7257\nZARUSD \u22121.3236 \u22121.3326 \u22121.3440 \u22121.3286\n\nTable 4: Results for RAPCF vs. PGAS.\n\nCon\ufb01guration\n\nMethod\nRAPCF\nPGAS.1 N = 10, M = 100\nPGAS.2 N = 25, M = 100\nPGAS.3 N = 10, M = 200\n\nN = 200\n\nRank\n2.025\n2.750\n2.550\n2.675\n\nTable 5: Average ranks.\n\nCon\ufb01guration\n\nMethod\nRAPCF\nPGAS.1 N = 10, M = 100\nPGAS.2 N = 25, M = 100\nPGAS.3 N = 10, M = 200\n\nN = 200\n\nAvg. Time\n\n6\n732\n1832\n1465\n\nTable 6: Avg. running time.\n\nAs mentioned above, there is little difference between the predictive accuracies of RAPCF and\nPGAS. However, PGAS is computationally much more expensive. We show average execution times\nin minutes for RAPCF and PGAS in Table 6. Note that RAPCF is up to two orders of magnitude\nfaster than PGAS. The cost of this latter method could be reduced by using fewer particles N or\nfewer iterations M, but this would also reduce its predictive accuracy. Even after doing so, PGAS\nwould still be more costly than RAPCF. RAPCF is also competitive with GARCH, EGARCH and\nGJR, whose average training times are in this case 2.6, 3.5 and 3.1 minutes, respectively. A naive\nimplementation of RAPCF has cost O(N T 4), since at each time step t there is a O(T 3) cost from\nthe inversion of the GP covariance matrix. On the other hand, the cost of applying PGAS naively is\nO(N M T 5), since for each batch of data x1:t there is a O(N M T 4) cost. These costs can be reduced\nto be O(N T 3) and O(N M T 4) for RAPCF and PGAS respectively by doing rank one updates of\nthe inverse of the GP covariance matrix at each time step. The costs can be further reduced by a\nfactor of T 2 by using sparse GPs [28].\n\n7 Summary and discussion\n\nWe have introduced a novel Gaussian Process Volatility model (GP-Vol) for time-varying variances\nin \ufb01nancial time series. GP-Vol is an instance of a Gaussian Process State-Space model (GP-SSM)\nwhich is highly \ufb02exible and can model nonlinear functional relationships and asymmetric effects of\npositive and negative returns on time-varying variances. In addition, we have presented an online\ninference method based on particle \ufb01ltering for GP-Vol called the Regularized Auxiliary Particle\nChain Filter (RAPCF). RAPCF is up to two orders of magnitude faster than existing batch Particle\nGibbs methods. Results for GP-Vol on 50 \ufb01nancial time series show signi\ufb01cant improvements in\npredictive performance over existing models such as GARCH, EGARCH and GJR-GARCH. Finally,\nthe nonlinear transition functions learned by GP-Vol can be easily analyzed to understand the effect\nof past volatility and past returns on future volatility.\nFor future work, GP-Vol can be extended to learn the functional relationship between a \ufb01nancial\ninstrument\u2019s volatility, its price and other market factors, such as interest rates. The functional\nrelationship thus learned can be useful in the pricing of volatility derivatives on the instrument.\nAdditionally, the computational ef\ufb01ciency of RAPCF makes it an attractive choice for inference in\nother GP-SSMs different from GP-Vol. For example, RAPCF could be more generally applied to\nlearn the hidden states and the dynamics in complex control systems.\n\n8\n\n\fReferences\n[1] R. Cont. Empirical properties of asset returns: Stylized facts and statistical issues. Quantitative Finance,\n\n1(2):223\u2013236, 2001.\n\n[2] T. Bollerslev. Generalized autoregressive conditional heteroskedasticity.\n\n31(3):307\u2013327, 1986.\n\nJournal of econometrics,\n\n[3] A. Harvey, E. Ruiz, and N. Shephard. Multivariate stochastic variance models. The Review of Economic\n\nStudies, 61(2):247\u2013264, 1994.\n\n[4] S. Kim, N. Shephard, and S. Chib. Stochastic volatility: likelihood inference and comparison with ARCH\n\nmodels. The Review of Economic Studies, 65(3):361\u2013393, 1998.\n\n[5] S.H. Poon and C. Granger. Practical issues in forecasting volatility. Financial Analysts Journal, 61(1):45\u2013\n\n56, 2005.\n\n[6] Y. Wu, J. M. Hern\u00b4andez-Lobato, and Z. Ghahramani. Dynamic covariance models for multivariate \ufb01nan-\n\ncial time series. In ICML, pages 558\u2013566, 2013.\n\n[7] L. Hentschel. All in the family nesting symmetric and asymmetric GARCH models. Journal of Financial\n\nEconomics, 39(1):71\u2013104, 1995.\n\n[8] G. Bekaert and G. Wu. Asymmetric volatility and risk in equity markets. Review of Financial Studies,\n\n13(1):1\u201342, 2000.\n\n[9] J.Y. Campbell and L. Hentschel. No news is good news: An asymmetric model of changing volatility in\n\nstock returns. Journal of \ufb01nancial Economics, 31(3):281\u2013318, 1992.\n\n[10] M.W. Brandt and C.S. Jones. Volatility forecasting with range-based EGARCH models. Journal of\n\nBusiness & Economic Statistics, 24(4):470\u2013486, 2006.\n\n[11] A. Wilson and Z. Ghahramani. Copula processes. In Advances in Neural Information Processing Systems\n\n23, pages 2460\u20132468. 2010.\n\n[12] M. L\u00b4azaro-Gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regression. In ICML,\n\npages 841\u2013848, 2011.\n\n[13] C.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning. Springer, 2006.\n[14] D.B. Nelson. Conditional heteroskedasticity in asset returns: A new approach. Econometrica, 59(2):347\u2013\n\n370, 1991.\n\n[15] L.R. Glosten, R. Jagannathan, and D.E. Runkle. On the relation between the expected value and the\n\nvolatility of the nominal excess return on stocks. The Journal of Finance, 48(5):1779\u20131801, 1993.\n\n[16] J. Ko and D. Fox. GP-BayesFilters: Bayesian \ufb01ltering using Gaussian process prediction and observation\n\nmodels. Autonomous Robots, 27(1):75\u201390, 2009.\n\n[17] M. P. Deisenroth, M. F. Huber, and U. D. Hanebeck. Analytic moment-based Gaussian process \ufb01ltering.\n\nIn ICML, pages 225\u2013232. ACM, 2009.\n\n[18] M. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian Process Dynamical Systems. In\n\nAdvances in Neural Information Processing Systems 25, pages 2618\u20132626, 2012.\n\n[19] R. Frigola, F. Lindsten, T. B. Sch\u00a8on, and C. E. Rasmussen. Bayesian inference and learning in Gaussian\n\nprocess state-space models with particle MCMC. In NIPS, pages 3156\u20133164. 2013.\n\n[20] F. Lindsten, M. Jordan, and T. Sch\u00a8on. Ancestor Sampling for Particle Gibbs.\n\nInformation Processing Systems 25, pages 2600\u20132608, 2012.\n\nIn Advances in Neural\n\n[21] L.E. Baum and T. Petrie. Statistical inference for probabilistic functions of \ufb01nite state Markov chains.\n\nThe Annals of Mathematical Statistics, 37(6):1554\u20131563, 1966.\n\n[22] A. Doucet, N. De Freitas, and N. Gordon. Sequential Monte Carlo methods in practice. Springer Verlag,\n\n2001.\n\n[23] R. D. Turner, M. P. Deisenroth, and C. E. Rasmussen. State-space inference and learning with Gaussian\n\nprocesses. In AISTATS, pages 868\u2013875, 2010.\n\n[24] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 72(3):269\u2013342, 2010.\n\n[25] J. Liu and M. West. Combined parameter and state estimation in simulation-based \ufb01ltering. Institute of\n\nStatistics and Decision Sciences, Duke University, 1999.\n\n[26] M.K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle \ufb01lters. Journal of the American\n\nStatistical Association, pages 590\u2013599, 1999.\n\n[27] J. Dem\u02c7sar. Statistical comparisons of classi\ufb01ers over multiple data sets. Journal of Machine Learning\n\nResearch, 7:1\u201330, 2006.\n\n[28] J. Qui\u02dcnonero-Candela and C.E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. The Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n9\n\n\f", "award": [], "sourceid": 623, "authors": [{"given_name": "Yue", "family_name": "Wu", "institution": "Cambridge"}, {"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "Harvard University"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}