{"title": "Scalable Bayesian dynamic covariance modeling with variational Wishart and inverse Wishart processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4582, "page_last": 4592, "abstract": "We implement gradient-based variational inference routines for Wishart and inverse Wishart processes, which we apply as Bayesian models for the dynamic, heteroskedastic covariance matrix of a multivariate time series. The Wishart and inverse Wishart processes are constructed from i.i.d. Gaussian processes, existing variational inference algorithms for which form the basis of our approach. These methods are easy to implement as a black-box and scale favorably with the length of the time series, however, they fail in the case of the Wishart process, an issue we resolve with a simple modification into an additive white noise parameterization of the model. This modification is also key to implementing a factored variant of the construction, allowing inference to additionally scale to high-dimensional covariance matrices. Through experimentation, we demonstrate that some (but not all) model variants outperform multivariate GARCH when forecasting the covariances of returns on financial instruments.", "full_text": "Scalable Bayesian dynamic covariance modeling with\nvariational Wishart and inverse Wishart processes\n\nCreighton Heaukulani\n\nNo Af\ufb01liation\n\nBangkok, Thailand\n\nc.k.heaukulani@gmail.com\n\nMark van der Wilk\n\nPROWLER.io\n\nCambridge, United Kingdom\n\nmark@prowler.io\n\nAbstract\n\nWe implement gradient-based variational inference routines for Wishart and in-\nverse Wishart processes, which we apply as Bayesian models for the dynamic,\nheteroskedastic covariance matrix of a multivariate time series. The Wishart and\ninverse Wishart processes are constructed from i.i.d. Gaussian processes, existing\nvariational inference algorithms for which form the basis of our approach. These\nmethods are easy to implement as a black-box and scale favorably with the length\nof the time series, however, they fail in the case of the Wishart process, an issue we\nresolve with a simple modi\ufb01cation into an additive white noise parameterization\nof the model. This modi\ufb01cation is also key to implementing a factored variant\nof the construction, allowing inference to additionally scale to high-dimensional\ncovariance matrices. Through experimentation, we demonstrate that some (but\nnot all) model variants outperform multivariate GARCH when forecasting the\ncovariances of returns on \ufb01nancial instruments.\n\n1\n\nIntroduction\n\nEstimating the (time series of) covariance matrices between the variables in a multivariate time\nseries is a principal problem of interest in many domains, including the construction of \ufb01nancial\ntrading portfolios [15] and the study of brain activity measurements in neurological studies [7].\nEstimating the entries of the covariance matrices is a challenging problem, however, because there are\nO(N D2) parameters to estimate for a time series of length N with D variables, yet we only record\na single observation of the time series consisting of O(N D) data points. Bayesian models (and\ntheir corresponding inference procedures) often perform well in these overparameterized problems;\nindeed, Fox and West [7], Wilson and Ghahramani [24] and Fox and Dunson [6] show that Bayesian\napproaches based on the Wishart and inverse Wishart processes produce better estimates of dynamic\ncovariance matrices than the venerable multivariate GARCH approaches [3, 4].\nThe Wishart and inverse Wishart processes are two related stochastic processes in the state space of\nsymmetric, positive de\ufb01nite matrices, making them appropriate models for (heteroskedastic) time\nseries of covariance matrices. They are themselves constructed from i.i.d. Gaussian processes, in\nanalogy to the construction of Wishart or inverse Wishart random variables from i.i.d. Gaussian\nrandom variables. Exact posterior inference for these models is intractable, and so previous authors\nhave suggested approximate inference routines based on Markov chain Monte Carlo (MCMC) algo-\nrithms. We instead propose a gradient-based variational inference routine, derived from approaches\nto approximate inference with (sparse and/or multi-output) Gaussian process models. Taking the\nvariational approach has several advantages including a simple, black-box implementation and the\nability to scale down the computational cost of inference with respect to N, the length of the time\nseries, if required. Furthermore, we derive a factored variant of the model that may additionally scale\ninference to large numbers of variables D, i.e., the dimensionality of the covariance matrix. In our\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fexperiments, we will see that our black-box, scalable, gradient-based variational inference routines\nhave predictive performance that is competitive with multivariate GARCH.\nWe start by considering variational inference routines for the model presented by Wilson and Ghahra-\nmani [24]; our approach is a gradient-based analogue of the coordinate ascent algorithms for varia-\ntional inference on Wishart processes presented by van der Wilk et al. [23]. The factored variants of\nthe model that we build toward in Section 5 end up being a reparameterization of the construction by\nFox and Dunson [6], and so our work provides gradient-based variational inference routines for their\nmodel class as well. Alternatively, Fox and West [7] construct inverse Wishart processes that are\nautoregressive (as opposed to the full process dependence assumed by the Gaussian processes); Wu\net al. [26] model time series of (univariate) variances with Gaussian process models; and Wu et al.\n[25] consider generalizing multivariate GARCH by modeling the transition matrices of the process\nwith autoregressive structures. All of these references elect MCMC-based inference and emphasize\nthat Bayesian inference of (co)variances dominate non-Bayesian approaches.\n\n2 Wishart and inverse Wishart processes\nLet Y := (Yn, n \u2265 1) denote a sequence of measurements in RD, which will be regressed upon a\ncorresponding sequence of input locations (i.e., covariates) in Rp denoted by X := (Xn, n \u2265 1).\nIn our applications, we will take Xn to be a univariate (so p = 1), real-valued representation of the\n\u201ctime\u201d at which the measurement Yn was taken. For example, in a dataset of daily stock returns,\nthe vector Yn can record the returns for D stocks on day n, for n \u2264 N, where the points in X can\nbe linearly spaced in some \ufb01xed interval like (0, 1), and individual points of X may be altered to\naccount for any irregular spacing, such as weekends or the removal of special trading days.\nWe let the conditional likelihood of Yn be given by the multivariate Gaussian density\n\nn \u2265 1,\n\nYn | \u00b5n, \u03a3n \u223c N (\u00b5n, \u03a3n),\n\n(1)\nfor a sequence \u00b51, \u00b52, . . . of elements in RD and a sequence \u03a31, \u03a32, . . . of (random) positive de\ufb01nite\nmatrices, which we note may depend on X. In modern portfolio theory [15], where Yn is a sequence of\n\ufb01nancial returns, predictions for the mean process \u00b51, \u00b52, . . . are used in conjunction with predictions\nfor the covariances of the residuals \u03a31, \u03a32, . . . to construct a portfolio that maximizes expected\nreturn while minimizing risk. In this article, we will focus on modeling the process \u03a31, \u03a32, . . . , and\nwe henceforth assume that Yn is mean zero (i.e., \u00b5n = 0, the zero vector), for n \u2264 N.\nBayesian models for the sequence \u03a3 := (\u03a31, \u03a32, . . . ) include the Wishart and inverse Wishart\nprocesses. In analogy to the construction of Wishart and inverse Wishart random variables from i.i.d.\ncollections of Gaussian random variables, we may construct Wishart and inverse Wishart processes\nfrom i.i.d. collections of Gaussian processes as follows. Let\n\nfd,k \u223c GP(0, \u03ba(\u00b7 , \u00b7 ; \u03b8)),\n\n(2)\nbe i.i.d. Gaussian processes with zero mean function and (shared) kernel function \u03ba(\u00b7 , \u00b7 ; \u03b8), where \u03b8\ndenotes any parameters of the kernel function, and the positive integer-valued \u03bd \u2265 D will be called\nthe degrees of freedom parameter. Let Fn,d,k := fd,k(Xn), and let Fn := (Fn,d,k, d \u2264 D, k \u2264 \u03bd)\ndenote the D \u00d7 \u03bd matrix of collected function values, for every n \u2265 1. Construct\n\nd \u2264 D, k \u2264 \u03bd,\n\n\u03a3n = AFnF T\n\n(3)\nwhere A \u2208 RD\u00d7D satis\ufb01es the condition that the symmetric matrix AAT is positive de\ufb01nite.1 So\nconstructed, \u03a3n is (marginally) Wishart distributed, and \u03a3 := (\u03a31, \u03a32, . . . ) is correspondingly called\na Wishart process with degrees of freedom \u03bd and scale matrix AAT . Alternatively, if we instead\nconstruct the precision matrix\n\nn AT ,\n\nn \u2265 1,\n\n\u03a3\u22121\nn = AFnF T\n\nn AT ,\n\nn \u2265 1,\n\n(4)\n\nthen \u03a3n is inverse Wishart distributed, and \u03a3 is called an inverse Wishart process (with degrees of\nfreedom \u03bd and scale matrix AAT ). The dynamics of the process of covariance matrices \u03a3 are inherited\nby the Gaussian processes, which are perhaps best controlled by the kernel function \u03ba(\u00b7 ,\u00b7 ; \u03b8).\n\n1Alternatively, we may take A to be the (triangular) cholesky factor of a positive de\ufb01nite matrix AAT .\n\n2\n\n\fThe posterior distribution for \u03a3 is dif\ufb01cult to evaluate, and so previous MCMC-based approaches to\napproximate inference typically utilize conjugacy results between the (inverse) Wishart distribution\nand the likelihood function in Eq. (1). In contrast, the \u201cblack-box\u201d variational inference routines\nthat we suggest only require evaluations of the log conditional likelihood function, dramatically\nsimplifying their implementation. For the Wishart process case, we have\n\nlog p(Yn | Fn) = \u2212 D\n2\n\nlog(2\u03c0) \u2212 1\n2\n\nlog |AFnF T\n\nn AT| \u2212 1\n2\n\nY T\nn (AFnF T\n\nn AT )\u22121Yn,\n\nand for the inverse Wishart case, we have\n\nlog p(Yn | Fn) = \u2212 D\n2\n\nlog(2\u03c0) +\n\n1\n2\n\nlog |AFnF T\n\nn AT| \u2212 1\n2\n\nY T\nn AFnF T\n\nn AT Yn.\n\n(5)\n\n(6)\n\nChanging our implementation between these two likelihood models only requires changing the\nline(s) of code computing these expressions, highlighting the ease of the black-box approach. Other\nlikelihoods may be considered; for example, Eq. (5) and Eq. (6) may be replaced by the likelihood\nfunction for a multivariate t-distribution (as done by Wu et al. [25]), a popular heavy-tailed model.\n\n3\n\nInducing points and variational inference\n\nA popular approach to variational inference with Gaussian processes is based on the introduction\nof M inducing points Z := (Z1, . . . , ZM ), taking values in the same space as the inputs X, upon\nwhich we assume the dependence of the function values Fn decouple during inference [1, 9, 18]. In\nparticular, for every d \u2264 D and k \u2264 \u03bd, let Um,d,k := fd,k(Zm), for m \u2264 M, denote the evaluations\nof the Gaussian process at the inducing points, and collectively denote Ud,k := (Um,d,k, m \u2264 M )\nand Fd,k := (Fn,d,k, n \u2264 N ). By independence, and with well-known properties of the Gaussian\ndistribution, we may write\n\np(Y, F, U ) =\n\np(Yn | Fn)\n\np(Fd,k | Ud,k)p(Ud,k)\n\n,\n\n(7)\n\nN(cid:89)\n\n(cid:104)\n\n(cid:105) D(cid:89)\n\n\u03bd(cid:89)\n\n(cid:104)\n\nn=1\n\nd=1\n\nk=1\n\n(cid:105)\n\nwhere\n\nxz),\n\nzz K T\n\nzz Ud,k, Kxx \u2212 KxzK\u22121\n\np(Fd,k | Ud,k) = N (Fd,k; KxzK\u22121\np(Ud,k) = N (Ud,k; 0, Kzz),\n\n(8)\n(9)\nand where the N \u00d7 N matrix Kxx has (n, n(cid:48))-th element \u03ba(Xn, Xn(cid:48); \u03b8), the N \u00d7 M matrix Kxz has\n(n, m)-th element \u03ba(Xn, Zm; \u03b8), and the M \u00d7 M matrix Kzz has (m, m(cid:48))-th element \u03ba(Zm, Zm(cid:48); \u03b8).\nFollowing Hensman et al. [10], we introduce a variational approximation to the posterior distribution\nof the latent variables that takes the following form: Independently for every d \u2264 D and k \u2264 \u03bd, let\n(10)\nfor some variational parameters \u00b5d,k \u2208 RM and Sd,k \u2208 RM\u00d7M a real, symmetric, positive de\ufb01nite\nmatrix. It follows that\n\nq(Fd,k, Ud,k) = p(Fd,k | Ud,k)q(Ud,k), where q(Ud,k) = N (Ud,k; \u00b5d,k, Sd,k),\n\np(Fd,k | Ud,k)q(Ud,k)dUd,k = N (Fd,k; \u02dcK\u00b5d,k, Kxx + \u02dcK(Sd,k \u2212 Kzz) \u02dcK T ),\n\nq(Fd,k) =\nwhere \u02dcK := KxzK\u22121\nq(Fd,k). We may then lower bound the log marginal likelihood of the data as follows:\n\nzz . That is, the variational approximation q(Ud,k) induces the approximation\n\n(11)\n\n(cid:90)\n\nlog p(Y ) \u2265 N(cid:88)\n\nEq(Fn)[log p(Yn | Fn)] \u2212 D(cid:88)\n\n\u03bd(cid:88)\n\nKL[q(Ud,k)|| p(Ud,k)],\n\n(12)\n\nn=1\n\nd=1\n\nk=1\n\nwhere KL[q || p] denotes the Kullback\u2013Leibler divergence from q to p. To perform inference, we\nmaximize the evidence lower bound on the right hand side of Eq. (12)\u2014which we note depends on\nthe parameters to be optimized \u0398 := {Z, \u00b5, S, \u03b8} through the variational distribution q\u2014via gradient\nascent. The terms KL[q(Ud,k)|| p(Ud,k)] may be analytically evaluated and so their gradients (w.r.t.\nthe optimization parameters) are straightforward to compute. Because q(F ) is not conjugate to the\nlikelihood p(Y | F ), we cannot analytically evaluate the term Eq(Fn)[log p(Yn | Fn)]. We therefore\n\n3\n\n\ffollow Salimans and Knowles [21] and Kingma and Welling [12] to approximate the gradients of\n(Monte Carlo estimates of) this expression by \u201cdifferentiating through\u201d random samples from q as\nfollows. Independently for every d \u2264 D and k \u2264 \u03bd, produce the R \u2265 1 Monte Carlo samples\n\nF (r)\nd,k = \u03c8d,k(w(r)\n\nd,k; \u0398), w(r)\n\nd,k \u223c N (0, IN ),\n\nr = 1, . . . , R,\n\n(13)\n\nwhere \u03c8d,k(w; \u0398) := Bd,kw + \u02dcK\u00b5d,k, for the matrix Bd,k \u2208 RN\u00d7N that satis\ufb01es Bd,kBT\nd,k =\nKxx + \u02dcK(Sd,k \u2212 Kzz) \u02dcK T , as given by the Cholesky factor. Note then that the samples generated\naccording to Eq. (13) have distribution q(Fd,k). Form the Monte Carlo approximations\n\n\u2207(\u00b5d,k,Sd,k)Eq(Fn)[log p(Yn | Fn)] \u2248 1\nR\nfor every d \u2264 D and k \u2264 \u03bd, where \u25e6 denotes the element-wise product, and\nR(cid:88)\n\n(cid:104)\u2207Fd,k log p(Yn | F (r)\n(cid:104)\u2207Fd,k log p(Yn | F (r)\n\u03bd(cid:88)\n\n\u2207(Z,\u03b8)Eq(Fn)[log p(Yn | Fn)] \u2248 1\nR\n\nR(cid:88)\nD(cid:88)\n\nr=1\n\nr=1\n\nd=1\n\nk=1\n\n(cid:105)\n\n(cid:105)\n\nn ) \u25e6 \u2207(\u00b5d,k,Sd,k)\u03c8d,k(w(r)\n\nd,k; \u0398)\n\n,\n\nn ) \u25e6 \u2207(Z,\u03b8)\u03c8d,k(w(r)\n\nd,k; \u0398)\n\n.\n\n(cid:88)\n\nThese unbiased estimates often have low enough variance that a single Monte Carlo sample suf\ufb01ces\nfor the approximation [21], however, we will see in Section 4 that this is not the case with the\nWishart process, where a numerical instability renders these estimates useless. Finally, the gradients\nof the lower bound on the right hand side of Eq. (12) with respect to \u00b5d,k and Sd,k may then be\napproximated by the unbiased estimator\n\n(cid:104)\u2207(\u00b5d,k,Sd,k)Eq(Fn)[log p(Yn | Fn)]\n(cid:104)\u2207(Z,\u03b8)Eq(Fn)[log p(Yn | Fn)]\n(cid:88)\n\n(cid:105) \u2212 \u2207(\u00b5d,k,Sd,k)KL[q(Ud,k)|| p(Ud,k)],\n(cid:105) \u2212 D(cid:88)\n\n\u2207(Z,\u03b8)KL[q(Ud,k)|| p(Ud,k)].\n\nwhere B \u2286 {(Xn, Yn) : n \u2264 N} is a minibatch of the datapoints. Likewise, the gradients with\nrespect to Z and \u03b8 may be approximated by\n\n\u03bd(cid:88)\n\nN\n|B|\n\n(14)\n\n(15)\n\nn\u2208B\n\nN\n|B|\n\nn\u2208B\n\nd=1\n\nk=1\n\nWith the gradient approximations in Eqs. (14) and (15), gradient ascent may now be carried out with\na Robbins\u2013Monro stochastic approximation routine.\nWe can see that an immediate bene\ufb01t of taking this black-box variational approach is the ease of\nswitching between the Wishart and inverse Wishart processes, requiring only a switch between the\nappropriate log conditional likelihood function log p(Yn | Fn), given by Eq. (5) or Eq. (6), in the\nsubroutine computing Eqs. (14) and (15). This implementation is particularly easy with GP\ufb02ow [16],\na Gaussian process toolbox built on Tensor\ufb02ow, as demonstrated with a code snippet in the Appendix.\nNote that choosing M = N and \ufb01xing the locations of the inducing points Z at the inputs X results\nin a Wishart or inverse Wishart process model that captures full temporal dependence among the N\nmeasurements. In this case, the evaluations of log p(Yn | Fn) in Eqs. (5) and (6) have computational\ncomplexity and memory requirements with respect to N scaling in O(N 3) and O(N 2), respectively.\nHowever, another (equally important) advantage of the inducing point formulation and the variational\napproach to inference is the ability to reduce this computational burden with respect to N, if needed.\nIn particular, by selecting M (cid:28) N, we end up with sparse approximations to the Gaussian processes.\nFor simplicity, assume that \u03bd = D. In this case, producing a Monte Carlo sample from q(Fd,k) for the\nminibatch B scales in O(N 3\nb +NbM +M 2) space, where Nb := |B|.\nProducing this for every d \u2264 D, k \u2264 \u03bd, together with the computation of log p(Yn | Fn), results in\nan overall computation in O(NbD3 + D2(N 3\nb + NbM + M 2))\nspace. Note that all of these complexities scale linearly with the number of samples R used for the\nMonte Carlo approximations in Eqs. (14) and (15).\n\nb + NbM 2 + M 3)) time and O(D2(N 2\n\nb +NbM 2 +M 3) time and O(N 2\n\n4 The additive white noise model\n\nIn our initial experiments, we found that the inverse Wishart parameterization successfully moved\nthe parameters into a good region of the state space, whereas the Wishart process failed to move\n\n4\n\n\f(a) Wishart process (horizontal axis on a log-scale)\n\n(b) Additive white noise Wishart process\n\nFigure 1: Histograms of 1,000 Monte Carlo samples of the gradient with respect to the variable F1,1,1\nin a univariate model. Fig. 1(a) shows an extremely skewed distribution in the case of the Wishart\nprocess, and Fig. 1(b) shows its correction under the additive white noise reparameterization. The\nmean of each distribution is shown as a red horizontal line.\n\nn (AFnF T\n\nthe parameters in the correct direction (based on traceplots of parameters and validation metrics). It\nappears that this failure is due to extremely high variance of the Monte Carlo gradient approximation\nroutine. By studying the log-likelihood function for the Wishart process in Eq. (5), we hypothesize\nthat evaluating the inverse in the \ufb01nal term, \u2212 1\nn AT )\u22121Yn, on Monte Carlo samples\nof Fn (as required by the procedure described in Section 3) is problematic because those samples\ncan often be close to the origin, resulting in this quantity being extremely large in magnitude. For\nexample, in the case of a univariate output, i.e., D = 1, and corresponding unit scale A = 1, the\nlikelihood involves computation of the scalar term \u2212 1\nn, which is large in magnitude for samples\nwhen fn is closer to zero than the data point yn, a problem that is exacerbated by the quadratic scales.\nTo visualize this issue, consider a bivariate output Yn with constant covariance matrix \u03a3n =\n[[2.0, 1.9], [1.9, 2.0]] and A = [[1, 0], [0, 1]]. We let \u03bd = D = 2 and simulated a dataset Yn at\ninput locations Xn, for n \u2264 30, which together with some inducing points Zm, m \u2264 10, are sampled\nuniformly in (0, 1). As described in Section 3, we compute the following 1,000 samples:\nd \u2264 2, k \u2264 2, n \u2264 30, r \u2264 1000,\n\n\u2207Fd,k log p(Yn | F (r)\n\nd,k \u223c q(Fd,k),\n\nn ), F (r)\n\nn/f 2\n\n(16)\n\n2 Y T\n\n2 y2\n\nand we display a histogram of the samples corresponding to the variable F1,1,1 in Fig. 1(a), where\nthe horizontal axis is on a log scale. The distribution is extremely skewed; the mean of these samples,\nat around 2.5 \u00d7 109, is plotted as a red vertical line, and the standard deviation is 7.9 \u00d7 1010!\nTo resolve this issue, consider once again the case when D = 1 with unit scale A = 1. We can modify\nthe previously problematic scalar term to be \u2212 1\nn + \u03bb), where the denominator is shifted away\nfrom zero by a parameter \u03bb > 0. More generally, we can accomplish this with a slightly generalized\nconstruction to that studied by van der Wilk et al. [23]: Construct the covariance matrix of yn as\n\nn/(f 2\n\n2 y2\n\nn AT + \u039b, n \u2265 1,\n\n\u03a3n := AFnF T\n\n(17)\nwhere \u039b is a diagonal D \u00d7 D matrix with positive (diagonal) entries. To interpret this modi\ufb01cation,\nnote that the model in Section 2 may be alternatively written as yn = AFnzn, where zn \u223c N (0, I\u03bd),\nfor n \u2265 1 , so that Cov(yn|Fn) = AFnF T\nn AT . The modi\ufb01ed construction may be instead written as\n(18)\n\nzn \u223c N (0, I\u03bd),\n\n\u03b5n \u223c N (0, \u039b),\n\nyn = AFnzn + \u03b5n,\n\nn \u2265 1,\n\nand so Cov(yn|Fn) = AFnF T\nn AT +\u039b. This modi\ufb01cation may therefore be interpreted as introducing\nwhite (or observational) noise to the model. The log conditional likelihood in Eq. (5) is replaced by\n\nlog p(Yn | Fn) =\n\nN D\n\n2\n\nlog(2\u03c0) \u2212 log |AFnF T\n\nn AT + \u039b| \u2212 1\n2\n\nY T\nn (AFnF T\n\nn AT + \u039b)\u22121Yn.\n\n(19)\n\nThe approximated gradients may now be stably computed: In Fig. 1(b), we plot a histogram of the\nsamples of the gradients in Eq. (16) for this modi\ufb01ed model, where \u039b = [[0.01, 0.0], [0.0, 0.01]].\n\n5\n\ne10e5e0e5e10e15e20e25e300102030405060700.030.020.010.000.010.020.03050100150200250\fn AT + \u039b\u22121,\n\n\u03a3\u22121\nn := AFnF T\n\nWhile the inverse Wishart case does not suffer such computational issues, we will see in Section 5 that\nthis additive white noise modi\ufb01cation is the key to a factored variant of both the Wishart and inverse\nWishart processes, inference for which is tractable for high-dimensional covariance matrices. In the\ninverse Wishart case, however, a useful additive white noise modi\ufb01cation is not easy to implement.\nWe consider instead the following construction for the precision matrix\nn \u2265 1,\n\n(20)\nwhere, as a diagonal matrix, \u039b\u22121 contains the inverted elements on the diagonal of \u039b. If the variables\nin yn are independent, then the elements of \u039b retain their interpretation as (the variances of) additive\nwhite noise. More generally, they have an interpretation as additive terms to the partial variances of\nthe variables in yn. The log conditional likelihood in Eq. (6) is now replaced by\nlog p(Yn | Fn) =\nn AT + \u039b\u22121| \u2212 1\nn AT + \u039b\u22121)Yn. (21)\n2\nd,d \u223c gamma(a, b), d \u2264 D, for some\nThe elements of \u039b share an inverse gamma prior distribution \u039b\u22121\na, b > 0. We \ufb01t a mean-\ufb01eld variational approximation with an analogous approach to the methods\nin Section 3 for gamma random variables described by Figurnov et al. [5]. (Alternative approaches\nwere described by Knowles [13] and Ruiz et al. [20].) We \ufb01t a and b by maximum likelihood.\n\nlog(2\u03c0) + log |AFnF T\n\nY T\nn (AFnF T\n\nN D\n\n2\n\n5 Factored covariance models\n\nThe computational and memory requirements of inference in the models so far presented scale with\nrespect to D in O(D3) and O(D2), respectively, since we must invert (or take the determinant of) a\nD \u00d7 D matrix. This will become intractable for even moderate values of D, which is particularly\ntroublesome in applications like \ufb01nance where D could, for example, represent the number of \ufb01nancial\ninstruments in a large stock market index like the S&P 500. To reduce this complexity, consider\n\ufb01xing some K (cid:28) D and reducing Fn to be of size K \u00d7 \u03bd, for some \u03bd \u2265 K. The matrix FnF T\nn is\na K \u00d7 K Wishart-distributed matrix. Let A now be of size D \u00d7 K. Then by a scaling property of\nthe Wishart distribution [19, p. 535], the D \u00d7 D matrix AFnF T\nn AT is also Wishart-distributed. This\nfactor-like, low-rank model has signi\ufb01cantly fewer parameters than those in Sections 2 and 4.\nConsider applying this construction to the additive white noise model for the Wishart process\ndescribed in Section 4, where \u03a3n = AFnF T\nn AT +\u039b. Recalling that \u039b is diagonal, the log conditional\nlikelihood function in Eq. (19) may be computed ef\ufb01ciently with the Woodbury matrix identities as\n\nlog p(Yn | Fn) =\n\nN D\n\nlog \u039bd,d +\n\nlog |I\u03bd + F T\n\nn AT \u039b\u22121AFn|\n\n1\n2\n\n(22)\n\nY T\nn AFn(I\u03bd + F T\n\nn AT \u039b\u22121AFn)\u22121F T\n\nn AT Yn.\n\nn AT +\u039b\u22121, which we note is a reparameterization\nIn the inverse Wishart case, we have \u03a3\u22121\nof the construction by Fox and Dunson [6]. The log conditional likelihood function in this case is\n\nn = AFnF T\n\nD(cid:88)\n\nd=1\n1\n2\n\nlog(2\u03c0) \u2212 1\n2\nn \u039b\u22121Yn +\nY T\n\n2\n\u2212 1\n2\n\nlog(2\u03c0) +\n\nD(cid:88)\nn \u039b\u22121Yn \u2212 1\nY T\n2\n\n1\n2\n\nd=1\n\n2\n\u2212 1\n2\n\nlog p(Yn | Fn) =\n\nN D\n\nlog \u039bd,d \u2212 1\n2\n\nlog |I\u03bd + F T\n\nn AT \u039bAFn|\n\n(23)\n\nY T\nn AFnF T\n\nn AT Yn.\n\nFor simplicity, assume \u03bd = K. Then these log conditional likelihood functions may be computed\n(with respect to D and K) in O(DK 2) time and O(DK) space. With the black-box approach to\nvariational inference, we need only drop the expressions in Eqs. (22) and (23) into the subroutines\ncomputing the gradient estimates in Eqs. (14) and (15). The overall complexity then reduces to\ncomputations in O(NbDK 2 + K 2(N 3\nb + NbM + M 2))\nspace. This model and inference procedure is therefore scalable to both large N and D regimes.\n\nb + NbM 2 + M 3)) time and O(DK + K 2(N 2\n\n6 Experiments on \ufb01nancial returns\n\nWe implement our variational inference routines on the model variants applied to three datasets of\n\ufb01nancial returns, which are denoted as follows (note that we take the log returns, which are de\ufb01ned\nat time t + 1 as log(1 + Pt+1/Pt), where Pt is the price of the instrument at time t):\n\n6\n\n\fDow 30: Intraday returns on the components of the Dow 30 Industrial Average (as of the changes on\nJun. 8, 2009), taken at the close of every \ufb01ve-minute interval from Nov. 17, 2017 through Dec. 6,\n2017. The resulting dataset size is N = 978, D = 30. The raw data was from Marjanovic [14].\nFX: Daily foreign exchange rates for 20 currency pairs taken from Wu et al. [26]. The dataset size is\nN = 1, 565, D = 20.\nS&P 500: Daily returns on the closing prices of the components of the S&P 500 index between\nFeb. 8, 2013 through Feb. 7, 2018, taken from Nugent [17]. Missing prices are forward-\ufb01lled. The\nresulting dataset size is N = 1, 258, D = 505 (there are 505 names in the index).\nThe simplest baseline is univariate ARCH (applied to each variable independently), implemented\nthrough the Python package arch [22]. The MGARCH variants we compare to are the dynamic\nconditional correlation model (DCC) [3] with Gaussian and multivariate-t likelihoods, and the gener-\nalized orthogonal garch model (GO-GARCH) [2], a competitive variant of the BEKK MGARCH\nspeci\ufb01cation. These baselines are among the dominant MGARCH modeling approaches and were\nimplemented through the R package rmgarch [8]. The MGARCH baselines do not scale to the S&P\n500 dataset, and there are no ubiquitous baselines in this large covariance regime.\nWe used a diagonal matrix A for the full-rank (non-factored) covariance models. The parameters\nin A and \u039b are inferred by maximum likelihood. The values of \u039b were initialized to \u039bd,d = 0.001,\nd \u2264 D. The degrees of freedom parameter \u03bd is set to the number of variables D, or the number of\nfactors K in the factored covariance cases. We did not \ufb01nd performance to be sensitive to this choice.\nWe used M = 300 inducing points, R = 2 variational samples for the Monte Carlo approximations,\nand a minibatch size of 300. The gradient ascent step sizes were scheduled according to Adam [11].\nWe selected the stopping times and an exponential learning rate decay schedule via cross validation,\nchoosing the setting that maximized the test loglikelihood metric (see below) on a validation set. The\nvalidation sets were the \ufb01nal 2%, 5%, and 5% of the measurements in just one of the training sets for\nthe Dow 30, FX, and S&P 500 datasets, respectively.\nFor each dataset, we created 10 evenly-sized training and testing sets with a sliding window, where\nthe test set comprises 10 consecutive measurements following the training set (we may therefore\nconsider a 10-step-ahead forecasting task), and no testing sets overlap. To evaluate the models, we\nt at horizon t, for t \u2264 10\u2014and compute the log-likelihood\nforecast the covariance matrix\u2014say, \u03a3\u2217\nof the corresponding test measurement Yt under a mean-zero Gaussian distribution with covariance\n\u03a3\u2217\nt . The prediction is formed by Monte Carlo estimation with 300 samples from the \ufb01tted variational\ndistribution. The parameter settings producing the prediction with the highest training log-likelihood\nfrom among a window of 300 steps following the stopping time is kept for testing.\nIn order to visualize our experimental setup, we display the results for the FX dataset in Fig. 2 as a\nseries of grouped histograms. The horizontal axis represents the forecast horizon; at each horizon,\nthe boxplot of test-loglikelihoods (over the 10 training/testing sets) are displayed for each model.\nThe Wishart and inverse Wishart process variants are denoted by \u2018wp\u2019 and \u2018iwp\u2019, respectively. If the\nadditive white noise parameterization described in Section 4 is used (with a non-factored covariance\nmodel), we prepend the model name with \u2018n-\u2019. The factored model variants, described in Section 5,\nhave model names prepended with \u2018f[K]-\u2019, where [K] is the number of factors. We used a Gaussian\nprocess covariance kernel composed as the sum of a Matern 3/2 kernel, a rational quadratic kernel, a\nradial basis function kernel, and a periodic component, which is itself composed as the product of a\nperiodic kernel and a radial basis function kernel (see the code snippet in the Appendix). The ARCH\nbaseline is denoted \u2018arch\u2019, the DCC baselines with a multivariate normal and multivariate-t likelihood\nare denoted \u2018dcc\u2019 and \u2018dcc-t\u2019, respectively, and the GO-GARCH baseline is denoted \u2018go-garch\u2019.\nWe compare the collections of the log-likelihood scores for each of the 10 forecast horizons in each\nof the 10 test sets (resulting in populations of 100 scores each). In Table 1, we report the mean score\n\u00b1 one standard deviation for each model and dataset. For each of our model variants, we provide in\nbrackets the p-value of a Wilcoxon signed-rank test comparing the performance of the model against\nthe highest performing MGARCH baseline (which was always either go-garch or dcc-t) or the ARCH\nbaseline in the case of the S&P 500 dataset. We bold the highest performing model on each dataset,\nand we highlight any improvements with a \u2217 if signi\ufb01cant at a 0.05 level.\nThe Wishart process variants score highest on each dataset; it is notable that they consistently\noutperform their inverse Wishart process counterparts. In fact, the inverse Wishart process appears to\nhave unreliable performance; while the iwp variants outperform MGARCH on the Dow 30 dataset,\n\n7\n\n\fFigure 2: Example display of the results for the FX dataset. A set of boxplots reporting the test\nloglikelihoods of the predictions is displayed for each step of the 10-step forecast horizon (indicated\non the horizontal axis). Each boxplot contains the scores from the 10 training/testing splits.\n\nTable 1: Test loglikelihood metrics across 10-step forecast horizons in 10 test splits. We display the\nmean over the 100 scores, along with \u00b1 one std. dev. The p-value of a Wilcoxon signed-rank test\ncomparing our models to the highest performing MGARCH/ARCH baseline is displayed in brackets.\nThe highest score is bolded. Signi\ufb01cant improvements at a 0.05 level are highlighted with a \u2217.\n\nDow 30\n142.47 \u00b1 17.97\n162.70 \u00b1 42.98\n162.64 \u00b1 42.86\n163.59 \u00b1 52.65\n164.09 \u00b1 26.47\u2217 (1.71e-8)\n164.49 \u00b1 19.82\u2217 (1.13e-9)\n165.98 \u00b1 23.23\u2217 (1.03e-6)\n162.28 \u00b1 22.91\u2217 (4.67e-11)\n165.39 \u00b1 30.89\u2217 (2.31e-5)\n\u2013\n\u2013\n\nFX\n68.24 \u00b1 7.55\n82.52 \u00b1 4.55\n82.54 \u00b1 4.56\n82.43 \u00b1 4.85\n81.42 \u00b1 4.12 (8.15e-8)\n82.10 \u00b1 3.72 (1.62e-3)\n82.69 \u00b1 4.15 (5.11e-2)\n77.76 \u00b1 3.94 (5.99e-17)\n81.12 \u00b1 3.59 (2.62e-10)\n\u2013\n\u2013\n\nS&P 500\n1358.23 \u00b1 355.12\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n1275.27 \u00b1 264.48 (4.14e-18)\n1423.31 \u00b1 132.19\u2217 (1.48e-13)\n1047.73 \u00b1 1436.16 (3.90e-18)\n1438.40 \u00b1 130.14\u2217 (1.54e-15)\n\narch\ndcc\ndcc-t\ngo-garch\niwp\nn-iwp\nn-wp\nf10-iwp\nf10-wp\nf30-iwp\nf30-wp\n\nthey perform poorly on the FX and S&P 500 datasets. On the Dow 30 dataset, not only does the\n(additive noise, full covariance) Wishart process (denoted n-wp) signi\ufb01cantly outperform the best\nperforming MGARCH baseline (go-garch, in this case), but every other one of our full covariance\nmodels and f10-wp does as well. The f10-iwp model variant is the only one of our models that\nunderperforms go-garch, further emphasizing that the inverse Wishart process should be avoided.\nWhile n-wp attains the highest score on the FX dataset, it is not deemed signi\ufb01cant over the scores for\nthe highest performing MGARCH baseline (dcc-t in this case), according to the Wilcoxon signed-rank\ntest. However, we may take some comfort in the fact that the p-value of 5.11e-2 (comparing the\nscores of n-wp and dcc-t) is very close to signi\ufb01cance at this level. We note, however, that every other\none of our models under-performs compared to dcc-t. The MGARCH baselines cannot scale to the\nS&P 500 dataset, and so we may only compare our factored covariance models against the diagonal\nARCH baseline. The f30-wp and f10-wp models both signi\ufb01cantly outperform the ARCH baseline,\nhowever, worryingly, the f30-iwp and f10-iwp model variants signi\ufb01cantly under-perform the ARCH\nbaseline, giving yet another example of the unreliable performance of the inverse Wishart process.\nWith this evidence, we recommend a practitioner to always implement the Wishart process instead of\nthe inverse Wishart process. Further study should be undertaken to understand why the performance\nof these two model variants differ. Unsurprisingly, the additive noise parameterization always\nimproves performance (n-iwp always outperforms iwp), which we may attribute to the additional\nnoise parameters afforded to the model in this case. While we see the factored Wishart process\nperform competitively with its full covariance counterpart on the Dow 30 dataset, this was the not the\ncase on the FX dataset, and so (unsurprisingly) a full covariance model should be preferred if enough\ncomputational resources are available.\n\n8\n\n12345678910horizon5060708090test_llarchdccdcc-tgo-garchiwpn-iwpn-wpf10-iwpf10-wp\f7 Conclusion\n\nWe conclude that the black-box variational approach to inference signi\ufb01cantly eases the implementa-\ntion of the various Wishart and inverse Wishart process models that we have presented. If needed,\nthe computational burden of inference with respect to both the length of the time series and the\ndimensionality of the covariance matrix may be reduced. We hope that the initial failure of the\nblack-box variational approach in the case of the Wishart process provides a warning to practitioners\nthat these methods cannot always be trusted to work out of the box. We recommend that practitioners\nalways implement the (additive noise) Wishart process instead of the inverse Wishart process. When\nthe dimensionality of the covariance matrix is large, one may use the factored model, however, a full\ncovariance model should be preferred if computational resources will allow it.\n\nAppendix: Implementation in GP\ufb02ow\n\nTo demonstrate the ease of implementing our methods, we provide the following 25 lines of Python\ncode implementing the inverse Wishart process in GP\ufb02ow [16] (version 1.3.0):\ni m p o r t numpy a s np\ni m p o r t\nfrom gpflow i m p o r t models ,\n\nl i k e l i h o o d s , k e r n e l s , params ,\n\nt r a n s f o r m s , d e c o r s\n\nt e n s o r f l o w a s\n\nt f\n\nc l a s s\n\nI n v W i s h a r t P r o c e s s L i k e l i h o o d ( l i k e l i h o o d s . L i k e l i h o o d ) :\n\nd e f _ _ i n i t _ _ ( s e l f , D, R=1) :\n\ns u p e r ( ) . _ _ i n i t _ _ ( )\ns e l f . R ,\ns e l f . A_diag = params . P a r a m e t e r ( np . ones (D) ,\n\ns e l f .D = R , D\n\nt r a n s f o r m = t r a n s f o r m s . p o s i t i v e )\n\n@decors . p a r a m s _ a s _ t e n s o r s\nd e f v a r i a t i o n a l _ e x p e c t a t i o n s ( s e l f , mu , S , Y) :\n\n# d e c o r a t o r\n\nt r a n s l a t i n g TF t e n s o r s\n\nf o r GPflow\n\nN, D = t f . s ha pe (Y)\nW = t f . random_normal ( [ s e l f . R , N,\nF = W \u2217 ( S \u2217\u2217 0 . 5 ) + mu\n\n# s a m p l e s\n\nt f . sh ap e (mu) [ 1 ] ] )\n\nt h r o u g h which TF a u t o m a t i c a l l y d i f f e r e n t i a t e s\n\nt h e )\n\nl i k e l i h o o d\n\n# compute t h e ( mean o f\n[ s e l f . R , N, D, \u22121])\nAF = s e l f . A_diag [ : , None ] \u2217 t f . r e s h a p e ( F ,\ny f f y = t f . reduce_sum ( t f . einsum ( \u2019 jk , i j k l \u2212> i j l \u2019 , Y, AF) \u2217\u2217 2 . 0 ,\nc h o l s = t f . c h o l e s k y ( t f . matmul (AF , AF ,\nl o g p = t f . reduce_sum ( t f . l o g ( t f . m a t r i x _ d i a g _ p a r t ( c h o l s ) ) , a x i s =2) \u2212 0 . 5 \u2217 y f f y\nr e t u r n t f . reduce_mean ( logp ,\n\nt r a n s p o s e _ b = True ) )\n\na x i s =\u22121)\n\na x i s =0)\n\n# c h o l e s k y o f p r e c i s i o n\n\nc l a s s\n\nI n v W i s h a r t P r o c e s s ( models . svgp . SVGP) :\n\nd e f _ _ i n i t _ _ ( s e l f , X, Y, Z , m i n i b a t c h _ s i z e =None , nu=None ) :\n\nD = Y. sh ap e [ 1 ]\nnu = D i f nu i s None e l s e nu\n\n# d e g r e e s o f\n\nfreedom\n\n# c r e a t e a c o m p o s i t i o n a l k e r n e l\nk e r n = k e r n e l s . Matern32 ( 1 ) + k e r n e l s . R a t i o n a l Q u a d r a t i c ( 1 ) + k e r n e l s . RBF ( 1 )\n\nf u n c t i o n\n\n+ k e r n e l s . P e r i o d i c K e r n e l ( 1 ) \u2217 k e r n e l s . RBF ( 1 )\n\n\\ \\\n\n# a l m o s t\ns u p e r ( ) . _ _ i n i t _ _ (X, Y, Z = Z , k e r n = kern ,\n\na l l work i s done by SVGP!\n\n# n o t a t i o n a s\n\nl i k e l i h o o d = I n v W i s h a r t P r o c e s s L i k e l i h o o d (D, R=10) ,\nn u m _ l a t e n t = D \u2217 nu ,\nm i n i b a t c h _ s i z e = m i n i b a t c h _ s i z e )\n\n# number o f o u t p u t s\n\ni n t h e p a p e r\n( m u l t i\u2212o u t p u t GP)\n\n# 10 MCMC s a m p l e s\n\nGP\ufb02ow\u2019s abstract class gpflow.models.svgp.SVGP is designed to automate gradient-based vari-\national inference with (sparse) Gaussian process models. Our only model-speci\ufb01c computation is\nfor the Monte Carlo approximations of the log-likelihood expression in Eq. (6), which is carried out\nby the method InvWishartProcessLikelihood.variational_expectations. The Tensor\ufb02ow\nbackend automatically differentiates through these expressions to obtain the gradients described in\nSection 3. Finally, note that a kernel function is being de\ufb01ned from a composition of several simpler\nkernel functions, demonstrating one of the many utilities of GP\ufb02ow; this is the particular composition\nused in our experiments in Section 6.\n\nAcknowledgements\n\nWe thank anonymous reviewers for feedback. All funding for the experiments were personally\nprovided by CH, who does not have an af\ufb01liation for this work.\n\n9\n\n\fReferences\n[1] M. Bauer, M. van der Wilk, and C. E. Rasmussen. Understanding probabilistic sparse Gaussian\n\nprocess approximations. In NIPS, 2016.\n\n[2] R. Van der Weide. GO-GARCH: a multivariate generalized orthogonal GARCH model. Journal\n\nof Applied Econometrics, 17(5):549\u2013564, 2002.\n\n[3] R. Engle. Dynamic conditional correlation: A simple class of multivariate generalized autore-\ngressive conditional heteroskedasticity models. Journal of Business & Economic Statistics, 20\n(3):339\u2013350, 2002.\n\n[4] R. F. Engle and K. F. Kroner. Multivariate simultaneous generalized ARCH. Econometric\n\nTheory, 11(1):122\u2013150, 1995.\n\n[5] M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. In NIPS, 2018.\n\n[6] E. B. Fox and D. B. Dunson. Bayesian nonparametric covariance regression. Journal of Machine\n\nLearning Research, 16:2501\u20132542, 2015.\n\n[7] E. B. Fox and M. West. Autoregressive models for variance matrices: Stationary inverse Wishart\n\nprocesses. arXiv preprint arXiv:1107.5239, 2011.\n\n[8] A. Ghalanos. rmgarch: Multivariate GARCH models, 2014. URL https://cran.r-project.\n\norg/web/packages/rmgarch. R package version 1.2-8.\n\n[9] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In UAI, 2013.\n\n[10] J. Hensman, A. G. de G. Matthews, and Z. Ghahramani. Scalable variational Gaussian process\n\nclassi\ufb01cation. In AISTATS, 2015.\n\n[11] D. P. Kingma and J. Ba. Adam: a method for stochastic optimization. In ICLR, 2015.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[13] D. A. Knowles. Stochastic gradient variational Bayes for gamma approximating distributions.\n\narXiv preprint arXiv:1509.01631, 2015.\n\n[14] Boris Marjanovic. Huge stock market dataset. Kaggle.com, Nov. 2017. URL https://www.\nkaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs. Ver-\nsion 3. Last updated 11/10/2017.\n\n[15] H. Markowitz. Portfolio selection. The Journal of Finance, 7(1):77\u201391, 1952.\n\n[16] A. G. de G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1,\nZ. Ghahramani, and J. Hensman. GP\ufb02ow: A Gaussian process library using TensorFlow.\nJournal of Machine Learning Research, 18(40):1\u20136, 2017.\n\n[17] Cam Nugent. S&p 500 stock data. Kaggle.com, Feb. 2018. URL https://www.kaggle.com/\n\ncamnugent/sandp500. Version 4.\n\n[18] J. Qui\u00f1onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian\n\nprocess regression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[19] C. R. Rao. Linear statistical inference and its applications, volume 2. Wiley New York, 1973.\n\n[20] F. R. Ruiz, M. Titsias, and D. Blei. The generalized reparameterization gradient. In NIPS, 2016.\n\n[21] T. Salimans and D. A. Knowles. Fixed-form variational posterior approximation through\n\nstochastic linear regression. Bayesian Analysis, 8(4):837\u2013882, 2013.\n\n[22] K. Sheppard. Arch, 2014. URL https://github.com/bashtage/arch. Python package\n\nversion 4.3.1.\n\n[23] M. van der Wilk, A. G. Wilson, and C. E. Rasmussen. Variational inference for latent variable\nmodelling of correlation structure. In NIPS 2014 Workshop on Advances in Variational Inference,\n2014.\n\n10\n\n\f[24] A. G. Wilson and Z. Ghahramani. Generalised Wishart processes. In UAI, 2010.\n\n[25] Y. Wu, J. M. Hern\u00e1ndez-Lobato, and Z. Ghahramani. Dynamic covariance models for multivari-\n\nate \ufb01nancial time series. In ICML, 2013.\n\n[26] Y. Wu, J. M. Hern\u00e1ndez-Lobato, and Z. Ghahramani. Gaussian process volatility model. In\n\nNIPS, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2579, "authors": [{"given_name": "Creighton", "family_name": "Heaukulani", "institution": "No Affiliation"}, {"given_name": "Mark", "family_name": "van der Wilk", "institution": "PROWLER.io"}]}