{"title": "Sparse Approximate Manifolds for Differential Geometric MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 2879, "page_last": 2887, "abstract": "One of the enduring challenges in Markov chain Monte Carlo methodology is the development of proposal mechanisms to make moves distant from the current point, that are accepted with high probability and at low computational cost. The recent introduction of locally adaptive MCMC methods based on the natural underlying Riemannian geometry of such models goes some way to alleviating these problems for certain classes of models for which the metric tensor is analytically tractable, however computational efficiency is not assured due to the necessity of potentially high-dimensional matrix operations at each iteration. In this paper we firstly investigate a sampling-based approach for approximating the metric tensor and suggest a valid MCMC algorithm that extends the applicability of Riemannian Manifold MCMC methods to statistical models that do not admit an analytically computable metric tensor. Secondly, we show how the approximation scheme we consider naturally motivates the use of l1 regularisation to improve estimates and obtain a sparse approximate inverse of the metric, which enables stable and sparse approximations of the local geometry to be made. We demonstrate the application of this algorithm for inferring the parameters of a realistic system of ordinary differential equations using a biologically motivated robust student-t error model, for which the expected Fisher Information is analytically intractable.", "full_text": "Sparse Approximate Manifolds for\n\nDifferential Geometric MCMC\n\nBen Calderhead\u2217\n\nCoMPLEX\n\nUniversity College London\nLondon, WC1E 6BT, UK\n\nb.calderhead@ucl.ac.uk\n\nM\u00e1ty\u00e1s A. Sustik\n\nDepartment of Computer Sciences\n\nUniversity of Texas at Austin\n\nAustin, TX 78712, USA\n\nsustik@cs.utexas.edu\n\nAbstract\n\nOne of the enduring challenges in Markov chain Monte Carlo methodology is\nthe development of proposal mechanisms to make moves distant from the current\npoint, that are accepted with high probability and at low computational cost. The\nrecent introduction of locally adaptive MCMC methods based on the natural un-\nderlying Riemannian geometry of such models goes some way to alleviating these\nproblems for certain classes of models for which the metric tensor is analytically\ntractable, however computational ef\ufb01ciency is not assured due to the necessity of\npotentially high-dimensional matrix operations at each iteration.\nIn this paper we \ufb01rstly investigate a sampling-based approach for approximating\nthe metric tensor and suggest a valid MCMC algorithm that extends the appli-\ncability of Riemannian Manifold MCMC methods to statistical models that do\nnot admit an analytically computable metric tensor. Secondly, we show how the\napproximation scheme we consider naturally motivates the use of (cid:96)1 regularisa-\ntion to improve estimates and obtain a sparse approximate inverse of the metric,\nwhich enables stable and sparse approximations of the local geometry to be made.\nWe demonstrate the application of this algorithm for inferring the parameters of a\nrealistic system of ordinary differential equations using a biologically motivated\nrobust Student-t error model, for which the Expected Fisher Information is ana-\nlytically intractable.\n\n1\n\nIntroduction\n\nThe use of Markov chain Monte Carlo methods can be extremely challenging in many modern day\napplications. This dif\ufb01culty arises from the more frequent use of complex and nonlinear statistical\nmodels that induce strong correlation structures in their often high-dimensional parameter spaces.\nThe exact structure of the target distribution is generally not known in advance and local correlation\nstructure between different parameters may vary across the space, particularly as the chain moves\nfrom the transient phase, exploring areas of negligible probability mass, to the stationary phase\nexploring higher density regions [1].\nConstructing a Markov chain that adapts to the target distribution while still drawing samples from\nthe correct stationary distribution is challenging, although much research over the last 15 years has\nresulted in a variety of approaches and theoretical results. Adaptive MCMC for example, allows\nfor global adaptation based on the partial or full history of a chain; this breaks its Markov property,\nalthough it has been shown that subject to some technical conditions [2,3] the resulting chain will\nstill converge to the desired stationary distribution. Most recently, advances in Riemannian Mani-\nfold MCMC allow locally changing, position speci\ufb01c proposals to be made based on the underlying\n\n\u2217http://www.2020science.net/people/ben-calderhead\n\n1\n\n\fgeometry of the target distribution [1]. This directly takes into account the changing sensitivities\nof the model for different parameter values and enables very ef\ufb01cient inference over a number of\npopular statistical models. It is useful for inference over large numbers of strongly covarying pa-\nrameters, however this methodology is still not suitable for all statistical models; in its current form\nit is only applicable to models that admit an analytic expression for the metric tensor. In practice,\nthere are many commonly used models for which the Expected Fisher Information is not analytically\ntractable, such as when a robust Student-t error model is employed to construct the likelihood.\nIn this paper we propose the use of a locally adaptive MCMC algorithm that approximates the local\nRiemannian geometry at each point in the target space. This extends the applicability of Riemannian\nManifold MCMC to a much wider class of statistical models than at present. In particular, we do so\nby estimating the covariance structure of the tangent vectors at a point on the Riemannian manifold\ninduced by the statistical model. Considering this geometric problem as one of inverse covariance\nestimation naturally leads us to the use of an (cid:96)1 regularised maximum likelihood estimator. This\napproximate inverse approach allows the required geometry to be estimated with few samples, en-\nabling good proposals for the Markov chain while inducing a natural sparsity in the inverse metric\ntensor that reduces the associated computational cost.\nWe \ufb01rst give a brief characterisation of current adaptive approaches to MCMC, making a distinc-\ntion between locally and globally adaptive methods, since these two approaches have very different\nrequirements in terms of proving convergence to the stationary distribution. We then discuss the\nuse of geometry in MCMC and the interpretation of such methods as being locally adaptive, before\ngiving the necessary background on Riemannian geometry and MCMC algorithms de\ufb01ned on in-\nduced Riemannian manifolds. We focus on the manifold MALA sampler, which is derived from a\nLangevin diffusion process that takes into account local non-Euclidean geometry, and we discuss\nsimpli\ufb01cations that may be made for computational ef\ufb01ciency. Finally we present a valid MCMC\nalgorithm that estimates the Riemannian geometry at each iteration based on covariance estimates\nof random vectors tangent to the manifold at the chain\u2019s current point. We demonstrate the use of\n(cid:96)1 regularisation to calculate sparse approximate inverses of the metric tensor and investigate the\nsampling properties of the algorithm on an extremely challenging statistical model for which the\nExpected Fisher Information is analytically intractable.\n\n2 Background\n\n\u03c0(xn)\n\nWe wish to sample from some arbitrary target density \u03c0(x) de\ufb01ned on a continuous state space XD,\nwhich may be high-dimensional. We may de\ufb01ne a Markov chain that converges to the correct sta-\ntionary distribution in the usual manner by proposing a new position x\u2217 from the current position\nxn via some \ufb01xed proposal distribution q(x\u2217|xn); we accept the new move setting xn+1 = x\u2217with\nq(xn|x\u2217)\nprobability \u03b1(x\u2217|xn) = min( \u03c0(x\u2217)\nq(x\u2217|xn) , 1) and set xn+1 = xn otherwise. In a Bayesian con-\ntext, we will often have a posterior distribution as our target \u03c0(x) = p(\u03b8|y), where y is the data and\n\u03b8 are the parameters of a statistical model. The choice of proposal distribution is the critical factor in\ndetermining how ef\ufb01ciently the Markov chain can explore the space and whether new moves will be\naccepted with high probability and be suf\ufb01ciently far from the current point to keep autocorrelation\nof the samples to a minimum. There is a lot of \ufb02exibility in the choice of proposal distribution, in\nthat it may depend on the current point in a deterministic manner.\nWe note that Adaptive MCMC approaches attempt to change their proposal mechanism throughout\nthe running of the algorithm, and for the purpose of proving convergence to the stationary distri-\nbution it is useful to categorise them as follows; locally adaptive MCMC methods make proposals\nbased only on the current position of the chain, whereas globally adaptive MCMC methods use pre-\nviously collected samples in the chain\u2019s history to generate a new proposal mechanism. This is an\nimportant distinction since globally adaptive methods lose their Markov property and convergence\nto the stationary distribution must be proven in an alternative manner. It has been shown that such\nchains may still be usefully employed as long as they satisfy some technical conditions, namely\ndiminishing adaptation and bounded convergence [2]. In practice these algorithms represent a step\ntowards MCMC as a \u201cblack box\u201d method and may be very useful for sampling from target distri-\nbutions for which there is no derivative or higher order geometric information available, however\nthere are simple examples of standard Adaptive MCMC methods requiring hundreds of thousands\nof iterations in higher dimensions before adapting to a suitable proposal distribution [3]. In addi-\n\n2\n\n\ftion, if there is more information about the target density available, then there seems little point in\ntrying to guess the geometric structure when it may be calculated directly. In this paper we focus\non locally adaptive methods that employ proposals constructed deterministically from information\nat the current position of the Markov chain.\n\n2.1 Locally Adaptive MCMC\n\nMany geometric-based MCMC methods may be categorised as being locally adaptive. When\nthe derivative of the target density is available, MCMC methods such as the Metropolis-adjusted\nLangevin Algorithm (MALA) [4] allow local adaptation based on the geometry at the current point,\nbut unlike globally adaptive MCMC, they retain their Markovian property and therefore converge to\nthe correct stationary distribution using a standard Metropolis-Hastings step and without the need to\nsatisfy further technical conditions.\nIn general, we can de\ufb01ne position-speci\ufb01c proposal densities based on deterministic functions that\ndepend only on the current point. This idea has been previously employed to develop approaches for\nsampling multimodal distributions whereby large initial jumps followed by deterministic optimisa-\ntion functions were used to create mode-jumping proposal mechanisms [5]. In some instances, the\nuse of \ufb01rst order geometric information may drastically speed up the convergence to a stationary dis-\ntribution, however in other cases such algorithms exhibit very slow convergence, due to the gradients\nnot being isotropic in magnitude [6]; in practice gradients may vary greatly in different directions\nand the rate of exploration of the target density may in addition be dependent on the problem-speci\ufb01c\nchoice of parameterisation [1]. Methods using the standard gradient implicitly assume that the slope\nin each direction is approximately constant over a small distance, when in fact these gradients may\nrapidly change over short distances. Incorporating higher order geometry often helps although at an\nincreased computational cost.\nA number of Hessian-based MCMC methods have been proposed as a solution [7]. While such\napproaches have been shown to work very well for selected problems there are a number of problems\nwith this use of geometry; ad hoc methods are often necessary to deal with the fact that the Hessian\nmight not be everywhere positive-de\ufb01nite, and second derivatives can be challenging and costly to\ncompute. We can also exploit higher order information through the use of Riemannian geometry.\nUsing a metric tensor instead of a Hessian matrix lends us nice properties such as invariance to\nreparameterisation of our statistical model, and positive-de\ufb01niteness is also assured. Riemannian\ngeometry has been useful in a variety of other machine learning and statistical contexts [8] however\nthe limiting factor is usually analytic or computational tractability.\n\n3 Differential Geometric MCMC\n\nDuring the 1940s, Jeffreys and Rao demonstrated that the Expected Fisher Information has the\nsame properties as a metric tensor and indeed induces a natural Riemannian structure for a statistical\nmodel [11, 10], providing a fascinating link between statistics and differential geometry. Much work\nhas been done since then elucidating the relationship between statistics and Riemannian geometry,\nin particular examining geometric concepts such as distance, curvature and geodesics on statistical\nmanifolds, within a \ufb01eld that has become known as Information Geometry [6]. We \ufb01rst provide an\noverview of Riemannian geometry and MCMC algorithms de\ufb01ned on Riemannian manifolds. We\nthen describe a sampling scheme that allows the local geometry to be estimated at each iteration for\nstatistical models that do not admit an analytically tractable metric tensor.\n\n3.1 Riemannian Geometry\n\nInformally, a manifold is an n-dimensional space that is locally Euclidean; it is locally equivalent\nto Rn via some smooth transformation. At each point \u03b8 \u2208 Rn on a Riemannian manifold M there\nexists a tangent space, which we denote as T\u03b8M. We can think of this as a linear approximation\nto the Riemannian manifold at the point \u03b8 and this is simply a standard vector space, whose origin\nis the current point on the manifold and whose vectors are tangent to this point. The vector space\nT\u03b8M is spanned by the differential operators\n, which act on functions de\ufb01ning paths\non the underlying manifold [9]. In the context of MCMC we can consider the target density as the\n\n(cid:104) \u2202\n\n(cid:105)\n\n, . . . , \u2202\n\u2202\u03b8n\n\n\u2202\u03b81\n\n3\n\n\f(cid:104) \u2202\n\n(cid:105)\n\n, . . . , \u2202\n\u2202\u03b8n\n\nlog-likelihood of a statistical model given some data, such that at a particular point \u03b8, the derivatives\nof the log-likelihood are tangent to the manifold and these are just the score vectors at \u03b8, \u2207\u03b8L =\n. The tangent space at each point \u03b8 arises when we equip a differentiable manifold\n\u2202\u03b81\nwith an inner product at each point, which we can use to measure distance and angles between\nvectors. This inner product is de\ufb01ned in terms of a metric tensor, G\u03b8, which de\ufb01nes a basis on each\ntangent space T\u03b8M. The tangent space is therefore a linear approximation of the manifold at a given\npoint and it has the same dimensionality. A natural inner product for this vector space is given by\nthe covariance of the basis score vectors, since the covariance function satis\ufb01es the same properties\nas a metric tensor, namely symmetry, bilinearity and positive-de\ufb01niteness [9]. This inner product\nthen turns out to be equivalent to the Expected Fisher Information, following from the fact that the\nexpectation of the score is zero, with the [i, j]th component of the tensor given by\n\n(cid:19)\n\n(cid:18) \u2202L\n\n\u2202\u03b8i ,\n\n\u2202L\n\u2202\u03b8j\n\n(cid:32)\n\n(cid:33)\n\n\u2202L\n\u2202\u03b8i\n\nT \u2202L\n\u2202\u03b8j\n\n= \u2212Ep(x|\u03b8)\n\n(cid:18) \u22022L\n\n(cid:19)\n\n\u2202\u03b8i\u2202\u03b8j\n\n(1)\n\nGi,j = Cov\n\n= Ep(x|\u03b8)\n\nEach tangent vector, t1 \u2208 T\u03b8M, at a point on the manifold, \u03b8 \u2208 M, has a length ||t1|| \u2208 R+,\nwhose square is given by the inner product, such that ||t1||2\n1 G\u03b8t1. This squared\ndistance is known as the \ufb01rst fundamental form in Riemannian geometry [9], is invariant to repa-\nrameterisations of the coordinates, and importantly for MCMC provides a local measure of distance\nthat takes into account the local 2nd order sensitivity of the statistical model. We note that when the\nmetric tensor is constant for all values of \u03b8 then the Riemannian manifold is equivalent to a vector\nspace with constant inner product; further, if the metric tensor is an identity matrix then the manifold\nsimply becomes a Euclidean space.\n\n= (cid:104)t1, t1(cid:105)\u03b8 = tT\n\nG\u03b8\n\n3.2 Manifold MCMC\n\nWe consider the manifold version of the MALA sampling algorithm, which proposes moves based\non a stochastic differential equation de\ufb01ning a Langevin diffusion [4]. It turns out we can also de\ufb01ne\nsuch a diffusion on a Riemannian manifold [12], and so in a similar manner we can derive a sampling\nalgorithm that takes the underlying geometric structure into account when making proposals. It is\nbased on the Laplace-Beltrami operator, which simply measures the divergence of a vector \ufb01eld on\na manifold. The stochastic differential equation de\ufb01ning the Langevin diffusion on a Riemannian\n\u02dc\u2207\u03b8L(\u03b8(t))dt + d \u02dcb(t), where the natural gradient [6] is the gradient of a\nmanifold is d\u03b8(t) = 1\n2\nfunction transformed into the tangent space at the current point by a linear transformation using the\nbasis de\ufb01ned by the metric tensor, such that \u02dc\u2207\u03b8L(\u03b8(t)) = G\u22121(\u03b8(t))\u2207\u03b8L(\u03b8(t)), and the Brownian\nmotion on the Riemannian manifold is de\ufb01ned as\n\n(cid:16)(cid:112)G\u22121(\u03b8(t))db(t)\n\n(cid:17)\n\n(2)\n\ni\n\nd\u02dcbi(t) = |G(\u03b8(t))|\u2212 1\n\n2\n\n(G\u22121(\u03b8(t))ij|G(\u03b8(t))| 1\n\n2 )dt +\n\n\u2202\n\u2202\u03b8j\n\nD(cid:88)\n\nj=1\n\nThe \ufb01rst part of the right hand side of Equation 2 represents the 1st order terms of the Laplace-\nBeltrami operator and these relate to the local curvature of the manifold, reducing to zero if the\nmetric is everywhere constant. The second term on the right hand side provides a position speci\ufb01c\nlinear transformation of the Brownian motion b(t) based on the local metric. Employing a \ufb01rst order\nEuler integrator, the discrete form of the Langevin diffusion on a Riemannian manifold follows as\n\n(cid:18)\n\nD(cid:88)\n\nj=1\n\nG\u22121(\u03b8n)\n\n(cid:19)\n\n\u2202G(\u03b8n)\n\n\u2202\u03b8j\n\n+\n\n\u2202G(\u03b8n)\n\n(cid:19)\n\u0001(cid:112)G\u22121(\u03b8n)zn(cid:17)\n\nG\u22121(\u03b8n)\n\n\u2202\u03b8j\n\n(cid:16)\n\nij\n\ni\n\n(3)\n\n\u03b8n+1\ni\n\n= \u03b8n\n\ni +\n\n\u00012\n2\n\nD(cid:88)\n\nj=1\n\n(G\u22121(\u03b8n)\u2207\u03b8L(\u03b8n))i \u2212 \u00012\n\n(cid:18)\n(cid:0)G\u22121(\u03b8n)(cid:1)\n\u0001(cid:112)G\u22121(\u03b8n)zn(cid:17)\n(cid:16)\n\nij T r\n\nG\u22121(\u03b8n)\n\n+\n\n\u00012\n2\n\n= \u00b5(\u03b8n, \u0001)i +\n\ni\n\n4\n\n\f\u2217\n\n\u2217\n\n)q(\u03b8n|\u03b8\n\n\u2217|\u03b8n) = N (\u03b8\n\u2217|\u00b5(\u03b8n, \u0001), \u00012G\u22121(\u03b8n)) and ac-\nwhich de\ufb01nes a proposal mechanism with density q(\u03b8\nceptance probability min{1, p(\u03b8\n\u2217|\u03b8n)} to ensure convergence to the invariant\n)/p(\u03b8n)q(\u03b8\ndensity p(\u03b8). We note that this deterministically de\ufb01nes a position-speci\ufb01c proposal distribution at\neach point on the manifold; we may categorise this as another locally adaptive MCMC method and\nconvergence to the invariant density follows from using the standard Metropolis-Hastings ratio.\nIt may be computationally expensive to calculate the 3rd order derivatives needed for working out the\nrate of change of the metric tensor, and so an obvious approximation is to assume these derivatives\nare zero for each step. In other words, for each step we can assume that the metric is locally constant.\nOf course even if the curvature of the manifold is not constant, this simpli\ufb01ed proposal mechanism\nstill de\ufb01nes a correct MCMC method which converges to the target measure, as we accept or reject\nmoves using a Metropolis-Hastings ratio. This is equivalent to a position-speci\ufb01c pre-conditioned\nMALA proposal, where the pre-conditioning is dependent on the current parameter values\n\nG\u22121(\u03b8n)\u2207\u03b8L(\u03b8n) + \u0001(cid:112)G\u22121(\u03b8n)zn\n\n\u03b8n+1 = \u03b8n +\n\n\u00012\n2\n\n(4)\n\nFor a manifold whose metric tensor is globally constant, this reduces further to a pre-conditioned\nMALA proposal, where the pre-conditioning is effectively independent of the current parameter\nvalues. In this context, such pre-conditioning no longer needs to be chosen arbitrarily, but rather it\nmay be informed by the geometry of the distribution we are exploring.\nWe point out that any approximations of the metric tensor would be best employed in the simpli\ufb01ed\nmMALA scheme, de\ufb01ning the covariance of the proposal distribution, or as a \ufb02at approximation to\na manifold. In the case of full mMALA, or even Hamiltonian Monte Carlo de\ufb01ned on a Riemannian\nmanifold [1], Christoffel symbols are also used, incorporating the derivatives of the metric tensor\nas it changes across the surface of the manifold - in many cases the extra expense of computing or\nestimating such higher order information is not suf\ufb01ciently supported by the increase in sampling\nef\ufb01ciency [1] and for this reason we do not consider such methods further.\nIn the next section we consider the representation of the metric tensor as the covariance of the tangent\nvectors at each point. We consider a method of estimating this such that convergence is guaranteed\nby extending the state-space and introducing auxiliary variables that are conditioned on the current\npoint and we demonstrate its potential within a Riemannian geometric context.\n\n4 Approximate Geometry for MCMC Proposals\n\nWe \ufb01rst derive an acceptance ratio on an extended state-space that enables convergence to the sta-\ntionary distribution before describing the implications for developing new differential geometric\nMCMC methods. Following [13, 14] we can employ the oft-used trick of de\ufb01ning an extended state\nspace X \u00d7 D. We may of course choose D to be of any size, however in our particular case we\nshall choose D to be Rm\u00d7s, where m is the dimension of the data and s is the number of samples;\nthe reasons for this shall become clear. We therefore sample from this extended state space, whose\njoint distribution follows as \u03c0\u2217 = \u03c0(x)\u02c6\u03c0(d|x). Given the current states [xn, dn], we may propose a\nnew state q(x\u2217|xn, dn) and the MCMC algorithm will satisfy detailed balance and hence converge\nto the stationary distribution if we accept joint proposals with Metropolis-Hastings probability ratio,\n\n\u03b1(x\u2217, d\u2217|xn, dn) = min\n\n= min\n\n(cid:18)\n(cid:18)\n(cid:18)\n\n1,\n\n1,\n\n= min\n\n1,\n\n(cid:19)\n\n(cid:19)\n\n(5)\n\nq(xn|x\u2217, d\u2217)\nq(x\u2217|xn, dn)\n\n\u02c6\u03c0(dn|xn)\n\u02c6\u03c0(d\u2217|x\u2217)\n\nq(xn|x\u2217, d\u2217)\nq(x\u2217|xn, dn)\n\n\u02c6\u03c0(dn|xn)\n\u02c6\u03c0(d\u2217|x\u2217)\n\n\u03c0\u2217(x\u2217, d\u2217)\n\u03c0\u2217(xn, dn)\n\u03c0(x\u2217)\n\u03c0(xn)\n\u03c0(x\u2217)\n\u03c0(xn)\n\n\u02c6\u03c0(d\u2217|x\u2217)\n\u02c6\u03c0(dn|xn)\nq(xn|x\u2217, d\u2217)\nq(x\u2217|xn, dn)\n\n(cid:19)\n\nThis is a reversible transition on \u03c0(x, d), from which we can sample to obtain \u03c0(x) as the marginal\ndistribution. The key point here is that we may de\ufb01ne our proposal distribution q(x\u2217|xn, dn) in\nalmost any deterministic manner we wish. In particular, choosing \u02c6\u03c0(d|x) to be the same distribution\n\n5\n\n\fas the log-likelihood for our statistical model, the s samples from the extended state space D may\nbe thought of as pseudo-data, from which we can deterministically calculate an estimate of the\nExpected Fisher Information to use as the covariance of a proposal distribution. Speci\ufb01cally, each\nsampled pseudo-data can be used deterministically to give a sample of \u2202L\nd\u03b8 given the current \u03b8, all of\nwhich may then be used deterministically to obtain an approximation of the covariance of tangent\nvectors at the current point. This approximation, unlike the Hessian, will always be positive de\ufb01nite,\nand gives us an approximation of the metric tensor de\ufb01ning the local geometry. Further, we may use\nadditional deterministic procedures, given xn and dn, to construct better proposals; we consider a\nsparsity inducing approach in the next section.\n\n5 Stability and Sparsity via (cid:96)1 Regularisation\n\nWe have two motivations for using an (cid:96)1 regularisation approach for computing the inverse of the\nmetric tensor; \ufb01rstly, since the metric is equivalent to the covariance of tangent vectors, we may\nobtain more stable estimates of the inverse metric tensor using smaller numbers of samples, and\nsecondly, it induces a natural sparsity in the inverse metric, which may be exploited to decrease the\ncomputational cost associated with repeated Cholesky factorisations and matrix-vector multiplica-\ntions. We adopted the graphical lasso [15, 16], in which the maximum likelihood solution results in\nthe matrix optimisation problem,\n\n(cid:88)\n\ni(cid:54)=j\n\n{\u2212 log det(A) + tr(AG) + \u03b3\n\narg min\nA(cid:31)0\n\n|Aij|}\n\n(6)\n\nwhere G is an empirical covariance matrix and \u03b3 is a regularisation parameter. This convex opti-\nmisation problem aims to \ufb01nd A, the regularised maximum likelihood estimate for the inverse of\nthe covariance matrix. Importantly, the optimisation algorithm we employ is deterministic given our\ntangent vectors, and therefore does not affect the validity of our MCMC algorithm; indeed we note\nthat we may use any deterministic sparse matrix inverse estimation approaches within this MCMC\nalgorithm. The use of the (cid:96)1 regularisation promotes sparsity [23]; larger values for the regularisation\nparameter matrix \u039b results in a solution that is more sparse, on the other hand when \u039b approaches\nzero, the solution converges to the inverse of G (assuming it exists). It is also worth noting that the\n(cid:96)1 regularisation helps to recover a sparse structure in a high dimensional setting where the number\nof samples is less than the number of parameters [17].\nIn order to achieve suf\ufb01ciently fast computation we carefully implemented the graphical lasso algo-\nrithm tailored to this problem. We used no penalisation for the diagonal and uniform regularisation\nparameter value for the off-diagonal elements. The motivation for not penalising the diagonal is\nthat it has been shown in the covariance estimation setting that the true inverse is approached as the\nnumber of samples is increased [18], and the structure is learned more accurately [19]. The simple\nregularisation structure allowed code simpli\ufb01cation and reduction in memory use. We refactored\nthe graphical lasso algorithm of [15] and implemented it directly in FORTRAN which we then called\nfrom MATLAB, making sure to minimise matrix copying due to MATLAB processing. This code is\navailable as a software package, GLASSOFAST [20].\nIn the current context, the use of this approach allows us to obtain sparse approximations to the\ninverse metric tensor, which may then be used in an MCMC proposal. Indeed, even if we have\naccess to an analytic metric tensor we need not use the full inverse for our proposals; we could still\nobtain an approximate sparse representation, which may be bene\ufb01cial computationally. The metric\ntensor varies smoothly across a Riemannian manifold and, theoretically, if we are calculating the\ninverse of 2 metric tensors that are close to each other, they may be numerically similar enough to\nbe able to use the solution of one to speed up convergence of solution for the other, although in the\nsimulations in this paper we found no bene\ufb01t in doing so, i.e. the metric tensor varied too much as\nthe MCMC sampler took large steps across the manifold.\n\n6 Simulation Study\n\nWe consider a challenging class of statistical models that severely tests the sampling capability\nof MCMC methods; in particular, two examples based on nonlinear differential equations using a\n\n6\n\n\f(a) Exact full inverse\n\n(b) Approximate sparse inverse\n\nFigure 1: In this comparison we plotted the exact and the sparse approximate inverses of a typical\nmetric tensor G; we note that only subsets of parameters are typically strongly correlated in the sta-\ntistical models we consider here and that the sparse approximation still captures the main correlation\nstructure present. Here the dimension is p = 25, and the regularisation parameter \u03b3 is 0.05 \u00b7 ||G||\u221e.\n\nTable 1: Summary of results for the Fitzhugh-Nagumo model with 10 runs of each parameter sam-\npling scheme and 5000 posterior samples.\n\nSampling\nMethod\n\nMetropolis\n\nMALA\n\nmMALA Simp.\n\nTime (s)\n\n14.5\n24.9\n35.9\n\nMean ESS\n(a, b, c)\n\n139, 18.2, 23.4\n119.3, 28.7, 52.3\n283.4, 136.6, 173.7\n\nTotal Time/\n\n(Min mean ESS)\n\n0.80\n0.87\n0.26\n\nRelative\nSpeed\n\u00d71.1\n\u00d71.0\n\u00d73.4\n\nbiologically motivated robust Student-t likelihood, which renders the metric tensor analytically in-\ntractable. We examine the ef\ufb01ciency of our MCMC method with approximate metric on a well stud-\nied toy example, the Fitzhugh-Nagumo model, before examining a realistic, nonlinear and highly\nchallenging example describing enzymatic circadian control in the plant Arabidopsis thaliana [22].\n\n6.1 Nonlinear Ordinary Differential Equations\n\nStatistical modelling using systems of nonlinear ordinary differential equations plays a vital role in\nunravelling the structure and behaviour of biological processes at a molecular level. The well-used\nGaussian error model however is often inappropriate, particularly in molecular biology where lim-\nited measurements may not be repeated under exactly the same conditions and are susceptible to\nbias and systematic errors. The use of a Student-t distribution as a likelihood may help the robust-\nness of the model with respect to possible outliers in the data. This presents a problem for standard\nmanifold MCMC algorithms as it makes the metric tensor analytically intractable. We consider\n\ufb01rst the Fitzhugh-Nagumo model [1]. This synthetic dataset consisted of 200 time points simulated\nfrom the model between t = [0, 20] with parameters [a, b, c] = [0.2, 0.2, 3], to which Gaussian dis-\ntributed noise was added with variance \u03c32 = 0.25. We employed a Student-t likelihood with scaling\nparameter v = 3, and compared M-H and MALA (both employing scaled isotropic covariances),\nand simpli\ufb01ed mMALA with approximate metric. The stepsize for each was automatically adjusted\nduring the burn-in phase to obtain the theoretically optimal acceptance rate.\nTable 1 shows the results including time-normalised effective sample size (ESS) as a measure of\nsampling ef\ufb01ciency excluding burn-in [1]. The approximate manifold sampler offers a modest im-\nprovement on the other two samplers; despite taking longer to run because of the computational\ncost of estimating the metric, the samples it draws exhibit lower autocorrelation, and as such the\napproximate manifold sampler offers the highest time-normalised ESS.\nThe toy Fitzhugh-Nagumo model is however rather simple, and despite being a popular example is\nrather unlike many realistic models used nowadays in the molecular modelling community. As such\nwe consider another larger model that describes the enzymatic control of the circadian networks in\nArabidopsis thaliana [21]. This is an extremely challenging, highly nonlinear model. We consider\n\n7\n\n\fTable 2: Comparison of pseudodata sample size on the quality of metric tensor estimation, and\nhence on sampling ef\ufb01ciency, using the circadian network example model, with 10 runs and 10,000\nposterior samples.\n\nNumber of Time (s) Min Mean ESS\nSamples\n\n10\n20\n30\n40\n\n155.6\n163.2\n168.9\n175.2\n\nTotal Time/\n\n(Min mean ESS)\n\n1.90\n0.95\n0.81\n0.84\n\nRelative\nSpeed\n\u00d71.0\n\u00d72.0\n\u00d72.35\n\u00d72.26\n\n85.1\n171.9\n209.1\n208.3\n\nTable 3: Summary of results for the circadian network model with 10 runs of each parameter sam-\npling scheme and 10,000 posterior samples.\n\nSampling\nMethod\n\nMetropolis\n\nMALA\n\nAdaptive MCMC\nmMALA Simp.\n\nTime (s) Min Mean ESS\n\n37.1\n101.3\n110.4\n168.9\n\n6.0\n3.7\n46.7\n209.1\n\nTotal Time/\n\n(Min mean ESS)\n\n6.2\n27.4\n2.34\n0.81\n\nRelative\nSpeed\n\u00d74.4\n\u00d71.0\n\u00d711.7\n\u00d733.8\n\ninferring the 6 rate parameters that control production and decay of proteins in the nucleus and\ncytoplasm (see [22] for the equations and full details of the model), again employing a Student-t\nlikelihood for which the Expected Fisher Information is analytically intractable. We used parameter\nvalues from [22] to simulate observations for each of the six species at 48 time points representing\n48 hours in the model. Student-t distributed noise was then added to obtain the data for inference.\nWe \ufb01rst investigated the effect that the tangent vector sample size for covariance estimation has on\nthe sampling ef\ufb01ciency of simpli\ufb01ed mMALA. The results in Table 2 show that there is a threshold\nabove which a more accurate estimate of the metric tensor does not result in additional sampling\nadvantage. The threshold for this particular example model is around 30 pseudodata samples. Table\n3 shows the time normalised statistical ef\ufb01ciency for each of the sampling methods; this time we\nalso compare an Adaptive MCMC algorithm [2] with M-H, MALA, and simpli\ufb01ed mMALA with\napproximate geometry. Both the M-H and MALA algorithms fail to explore the target distribution\nand have severe dif\ufb01culties with the extreme scalings and nonlinear correlation structure present in\nthe manifold. The Adaptive MCMC method works reasonably well after taking 2000 samples to\nlearn the covariance structure, although its performance is still poorer than the simpli\ufb01ed mMALA\nscheme, which converges almost immediately with no adaptation time required; the approximation\nmMALA makes of the local geometry allows it to adequately deal with the different scalings and\ncorrelations that occur in different parts of the space.\n\n7 Conclusions\n\nThe use of Riemannian geometry can be very useful for enabling ef\ufb01cient sampling from arbitrary\nprobability densities. The metric tensor may be used for creating position-speci\ufb01c proposal mech-\nanisms that allow MCMC methods to automatically adapt to the local correlation structure induced\nby the sensitivities of the parameters of a statistical model. The metric tensor may conveniently be\nde\ufb01ned as the Expected Fisher Information, however this quantity is often either dif\ufb01cult or impos-\nsible to compute analytically. We have presented a sampling scheme that approximates the Expected\nFisher Information by estimating the covariance structure of the tangent vectors at each point on the\nmanifold. By considering this problem as one of inverse covariance estimation, this naturally led\nus to consider the use of (cid:96)1 regularisation to improve the estimation procedure. This had the added\nbene\ufb01t of inducing sparsity into the metric tensor, which may offer computational advantages when\nproposing MCMC moves across the manifold. For future work it will be exciting to investigate the\npotential impact of approximate, sparse metric tensors for high dimensional problems.\n\n8\n\n\fBen Calderhead gratefully acknowledges his Research Fellowship through the 2020 Science pro-\ngramme, funded by EPSRC grant number EP/I017909/1 and supported by Microsoft Research.\n\nReferences\n\n[1] M. Girolami and B. Calderhead, Riemann Manifold Langevin and Hamiltonian Monte Carlo\nMethods (with discussion), Journal of the Royal Statistical Society: Series B, 73:123-214, 2011\n[2] H. Haario, E. Saksman and J. Tamminen, An Adaptive Metropolis Algorithm, Bernoulli,\n7(2):223-242, 2001\n[3] G. Roberts and J. Rosenthal, Examples of Adaptive MCMC, Journal of Computational and\nGraphical Statistics, 18(2), 2009\n[4] G. Roberts and O. Stramer, Langevin diffusions and Metropolis-Hastings algorithms, Methodol.\nComput. Appl. Probab., 4, 337-358, 2003\n[5] H. Tjelmeland and B. Hegstad, Mode Jumping Proposals in MCMC, Scandinavian Journal of\nStatistics, 28(1), 2001\n[6] S. Amari and H. Nagaoka, Methods of Information Geometry, Oxford University Press, 2000\n[7] Y. Qi and T. Minka, Hessian-based Markov Chain Monte-Carlo algorithms, 1st Cape Cod Work-\nshop Monte Carlo Methods, 2002\n[8] A. Honkela, T. Raiko, M. Kuusela, M. Tornio and J. Karhunen, Approximate Riemannian con-\njugate gradient learning for \ufb01xed-form variational Bayes, JMLR, 11:3235-3268, 2010\n[9] M. K. Murray and J. W. Rice, Differential Geometry and Statistics, Chapman and Hall, 1993\n[10] C. R. Rao, Information and accuracy attainable in the estimation of statistical parameters, Bull.\nCalc. Math. Soc., 37:81-91, 1945\n[11] H. Jeffreys, Theory of Probability, 1st ed. The Clarendon Press, Oxford, 1939\n[12] J. Kent, Time reversible diffusions, Adv. Appl. Probab., 10:819-835, 1978\n[13] J. Besag, P. Green, D. Higdon, and K. Mengersen, Bayesian Computation and Stochastic Sys-\ntems, Statistical Science, 10(1):3-41, 1995\n[14] A. Doucet, P. Jacob and A. Johansen, Discussion of Riemann Manifold Langevin and Hamilto-\nnian Monte Carlo Methods, Journal of the Royal Statistical Society: Series B, 73:162, 2011\n[15] J. Friedman, T. Hastie and R. Tibshirani, Sparse inverse covariance estimation with the graphi-\ncal lasso, Biostatistics, 9(3):432-441, 2008\n[16] O. Banerjee, L. El Ghaoui and A. d\u2019Aspremont, Model Selection Through Sparse Maximum\nLikelihood Estimation for Multivariate Gaussian or Binary Data, JMLR, 9(6), 2008\n[17] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu., Model selection in Gaussian graphical\nmodels: High-dimensional consistency of (cid:96)1-regularized MLE, NIPS 21, 2008\n[18] A. J. Rothman, P. J. Bickel, E. Levina and J. Zhu, Sparse permutation invariant covariance\nestimation, Electronic Journal of Statistics, 2:494-515, 2008\n[19] J. Duchi, S. Gould and D. Koller, Projected Subgradient Methods for Learning Sparse Gaus-\nsians, Conference on Uncertainty in Arti\ufb01cial Intelligence, 2008\n[20] M. A. Sustik and B. Calderhead, GLASSOFAST: An ef\ufb01cient GLASSO implementation, Techni-\ncal Report, Computer Science Department, University of Texas at Austin, TR-12-29, 2012\n[21] J. C. W. Locke, A. Millar and M. Turner, Modelling genetic networks with noisy and varied\nexperimental data: the circadian clock in Arabidopsis thaliana, J. Theor. Biol. 234:383-393, 2005\n[22] B. Calderhead and M. Girolami, Statistical analysis of nonlinear dynamical systems using dif-\nferential geometric sampling methods, Journal of the Royal Society Interface Focus, 1(6), 2011\n[23] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical\nSociety: Series B, 58:267-288, 1996\n\n9\n\n\f", "award": [], "sourceid": 1309, "authors": [{"given_name": "Ben", "family_name": "Calderhead", "institution": null}, {"given_name": "M\u00e1ty\u00e1s", "family_name": "Sustik", "institution": null}]}