{"title": "Heavy-Tailed Process Priors for Selective Shrinkage", "book": "Advances in Neural Information Processing Systems", "page_first": 2406, "page_last": 2414, "abstract": "Heavy-tailed distributions are often used to enhance the robustness of regression and classification methods to outliers in output space. Often, however, we are confronted with ``outliers'' in input space, which are isolated observations in sparsely populated regions. We show that heavy-tailed process priors (which we construct from Gaussian processes via a copula), can be used to improve robustness of regression and classification estimators to such outliers by selectively shrinking them more strongly in sparse regions than in dense regions. We carry out a theoretical analysis to show that selective shrinkage occurs provided the marginals of the heavy-tailed process have sufficiently heavy tails. The analysis is complemented by experiments on biological data which indicate significant improvements of estimates in sparse regions while producing competitive results in dense regions.", "full_text": "Heavy-Tailed Process Priors for Selective Shrinkage\n\nFabian L. Wauthier\n\nUniversity of California, Berkeley\n\nflw@cs.berkeley.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\njordan@cs.berkeley.edu\n\nAbstract\n\nHeavy-tailed distributions are often used to enhance the robustness of regression\nand classi\ufb01cation methods to outliers in output space. Often, however, we are con-\nfronted with \u201coutliers\u201d in input space, which are isolated observations in sparsely\npopulated regions. We show that heavy-tailed stochastic processes (which we con-\nstruct from Gaussian processes via a copula), can be used to improve robustness\nof regression and classi\ufb01cation estimators to such outliers by selectively shrinking\nthem more strongly in sparse regions than in dense regions. We carry out a theo-\nretical analysis to show that selective shrinkage occurs when the marginals of the\nheavy-tailed process have suf\ufb01ciently heavy tails. The analysis is complemented\nby experiments on biological data which indicate signi\ufb01cant improvements of es-\ntimates in sparse regions while producing competitive results in dense regions.\n\n1\n\nIntroduction\n\nGaussian process classi\ufb01ers (GPCs) [12] provide a Bayesian approach to nonparametric classi\ufb01ca-\ntion with the key advantage of producing predictive class probabilities. Unfortunately, when training\ndata are unevenly sampled in input space, GPCs tend to over\ufb01t in the sparsely populated regions.\nOur work is motivated by an application to protein folding where this presents a major dif\ufb01culty.\nIn particular, while Nature provides samples of protein con\ufb01gurations near the global minima of\nfree energy functions, protein-folding algorithms, which imitate Nature by minimizing an estimated\nenergy function, necessarily explore regions far from the minimum. If the estimate of free energy is\npoor in those sparsely-sampled regions then the algorithm has a poor guide towards the minimum.\nMore generally this problem can be viewed as one of \u201ccovariate shift,\u201d where the sampling pattern\ndiffers in the training and testing phase.\nIn this paper we investigate a GPC-based approach that addresses over\ufb01tting by shrinking predictive\nclass probabilities towards conservative values. For an unevenly sampled input space it is natural\nto consider a selective shrinkage strategy: we wish to shrink probability estimates more strongly in\nsparse regions than in dense regions. To this end several approaches could be considered. If sparse\nregions can be readily identi\ufb01ed, selective shrinkage could be induced by tailoring the Gaussian\nprocess (GP) kernel to re\ufb02ect that information. In the absence of such knowledge, Goldberg and\nWilliams [5] showed that Gaussian process regression (GPR) can be augmented with a GP on the\nlog noise level. More recent work has focused on partitioning input space into discrete regions\nand de\ufb01ning different kernel functions on each. Treed Gaussian process regression [6] and Treed\nGaussian process classi\ufb01cation [1] represent advanced variations of this theme that de\ufb01ne a prior\ndistribution over partitions and their respective kernel hyperparameters. Another line of research\nwhich could be adapted to this problem posits that the covariate space is a nonlinear deformation\nof another space on which a Gaussian process prior is placed [3, 13]. Instead of directly modifying\nthe kernel matrix, the observed non-uniformity of measurements is interpreted as being caused by\nthe spatial deformation. A dif\ufb01culty with all these approaches is that posterior inference is based on\nMCMC, which can be overly slow for the large-scale problems that we aim to address.\n\n1\n\n\fThis paper shows that selective shrinkage can be more elegantly introduced by replacing the Gaus-\nsian process underlying GPC with a stochastic process that has heavy-tailed marginals (e.g., Laplace,\nhyperbolic secant, or Student-t). While heavy-tailed marginals are generally viewed as providing ro-\nbustness to outliers in the output space (i.e., the response space), selective shrinkage can be viewed\nas a form of robustness to outliers in the input space (i.e., the covariate space). Indeed, selective\nshrinkage means the data points that are far from other data points in the input space are regularized\nmore strongly. We provide a theoretical analysis and empirical results to show that inference based\non stochastic processes with heavy-tailed marginals yields precisely this kind of shrinkage.\nThe paper is structured as follows: Section 2 provides background on GPCs and highlights how\nselective shrinkage can arise. We present a construction of heavy-tailed processes in Section 3 and\nshow that inference reduces to standard computations in a Gaussian process. An analysis of our\napproach is presented in Section 4 and details on inference algorithms are presented in Section 5.\nExperiments on biological data in Section 6 demonstrate that heavy-tailed process classi\ufb01cation\nsubstantially outperforms GPC in sparse regions while performing competitively in dense regions.\nThe paper concludes with an overview of related research and \ufb01nal remarks in Sections 7 and 8.\n\n2 Gaussian process classi\ufb01cation and shrinkage\nA Gaussian process (GP) [12] is a prior on functions z : X \u2192 R de\ufb01ned through a mean function\n(usually identically zero) and a symmetric positive semide\ufb01nite kernel k(\u00b7,\u00b7). For a \ufb01nite set of\nlocations X = (x1, . . . , xn) we write z(X) \u223c p(z(X)) = N (0, K(X, X)) as a random variable\ndistributed according to the GP with \ufb01nite-dimensional kernel matrix [K(X, X)]i,j = k(xi, xj). Let\ny denote an n-vector of binary class labels associated with measurement locations X1. For Gaussian\nprocess classi\ufb01cation (GPC) [12] the probability that a test point x\u2217 is labeled as class y\u2217 = 1, given\ntraining data (X, y), is computed as\n\n(cid:18)\n\n(cid:19)\n\np(y\u2217 = 1|X, y, x\u2217) = Ep(z(x\u2217)|X,y,x\u2217)\np(z(x\u2217)|X, y, x\u2217) =\n\n(cid:90)\n\np(z(x\u2217)|X, z(X), x\u2217)p(z(X)|X, y)dz(X).\n\n1\n\n1 + exp{\u2212z(x\u2217)}\n\n(1)\n\nThe predictive distribution p(z(x\u2217)|X, y, x\u2217) represents a regression on z(x\u2217) with a complicated\nobservation model y|z. The central observation from Eq. (1) is that we could selectively shrink\nthe prediction p(y\u2217 = 1|X, y, x\u2217) towards a conservative value 1/2 by selectively shrinking\np(z(x\u2217)|X, y, x\u2217) closer to a point mass at zero.\n\n3 Heavy-tailed process priors via the Gaussian copula\n\nIn this section we construct the heavy-tailed stochastic process by transforming a GP. As with the\nGP, we will treat the new process as a prior on functions. Suppose that diag (K(X, X)) = \u03c321. We\nde\ufb01ne the heavy-tailed process f (X) with marginal c.d.f. Gb as\n\nz(X) \u223c N (0, K(X, X))\nu(X) = \u03a60,\u03c32(z(X))\nf (X) = G\u22121\n\nb (u(X)) = G\u22121\n\nb (\u03a60,\u03c32 (z(X))).\n\n(2)\n(3)\n\nHere the function \u03a60,\u03c32(\u00b7) is the c.d.f. of a centered Gaussian with variance \u03c32. Presently, we\nonly consider the case when Gb is the (continuous) c.d.f. of a heavy-tailed density gb with scale\nparameter b that is symmetric about the origin. Examples include the Laplace, hyperbolic secant\nand Student-t distribution. We note that other authors have considered asymmetric or even discrete\ndistributions [2, 11, 16] while Snelson et al. [15] use arbitrary monotonic transformations in place\nb (\u03a60,\u03c32 (\u00b7)). The process u(X) has the density of a Gaussian copula [10, 16] and is critical\nof G\u22121\nin transferring the correlation structure encoded by K(X, X) from z(X) to f (X). If we de\ufb01ne\n\n1To improve the clarity of exposition, we only deal with binary classi\ufb01cation for now. A full multiclass\n\nclassi\ufb01cation model is used in our experiments.\n\n2\n\n\f0,\u03c32 (Gb(f (X))), it is well known [7, 9, 11, 15, 16] that the density of f (X) satis\ufb01es\n\n(cid:81)\n\nz(f (X)) = \u03a6\u22121\n\nz(f (X))(cid:62)(cid:20)\nObserve that if K(X, X) = \u03c32I then p(f (X)) = (cid:81)\n\ni=1 gb(f (xi))\n|K(X, X)/\u03c32|1/2\n\np(f (X)) =\n\n\u2212 1\n2\n\n(cid:26)\n\nexp\n\n(cid:21)\n\n(cid:27)\n\nK(X, X)\u22121 \u2212 I\n\u03c32\n\nz(f (X))\n\n.\n\n(4)\n\ni=1 gb(f (xi)). Also note that if Gb were\nchosen to be Gaussian, we would recover the Gaussian process. The predictive distribution\np(f (x\u2217)|X, f (X), x\u2217) can be interpreted as a Heavy-tailed process regression (HPR). It is easy to\nsee that its computation can be reduced to standard computations in a Gaussian model by nonlinearly\ntransforming observations f (X) into z-space. The predictive distribution in z-space satis\ufb01es\n\n(5)\n(6)\n(7)\nThe corresponding distribution in f-space follows by another change of variables. Having de\ufb01ned\nthe heavy-tailed stochastic process in general we now turn to an analysis of its shrinkage properties.\n\np(z(x\u2217)|X, f (X), x\u2217) = N (\u00b5\u2217, \u03a3\u2217)\n\u00b5\u2217 = K(x\u2217, X)K(X, X)\u22121z(f (X))\n\u03a3\u2217 = K(x\u2217, x\u2217) \u2212 K(x\u2217, X)K(X, X)\u22121K(X, x\u2217).\n\n4 Selective shrinkage\n\nBy \u201cselective shrinkage\u201d we mean that the degree of shrinkage applied to a collection of estimators\nvaries across estimators. As motivated in Section 2, we are speci\ufb01cally interested in selectively\nshrinking posterior distributions near isolated observations more strongly than in dense regions.\nThis section shows that we can achieve this by changing the form of prior marginals (heavy-tailed\ninstead of Gaussian) and that this induces stronger selective shrinkage than any GPR could induce.\nSince HPR uses a GP in its construction, which can induce some selective shrinkage on its own, care\nb (\u03a60,\u03c32(\u00b7)) has on\nmust be taken to investigate only the additional bene\ufb01ts the transformation G\u22121\nshrinkage. For this reason we assume a particular GP prior which leads to a special type of shrinkage\nin GPR and then check how an HPR model built on top of that GP changes the observed behavior.\nIn this section we provide an idealized analysis that allows us to compare the selective shrinkage\nobtained by GPR and HPR. Note that we focus on regression in this section so that we can obtain\nanalytical results. We work with n measurement locations, X = (x1, . . . , xn), whose index set\n{1, . . . , n} can be partitioned into a \u201cdense\u201d set D with |D| = n\u2212 1 and a single \u201csparse\u201d index s /\u2208\nD. Assume that xd = xd(cid:48),\u2200d, d(cid:48) \u2208 D, so that we may let (without loss of generality) \u02dcK(xd, xd(cid:48)) =\n1,\u2200d (cid:54)= d(cid:48) \u2208 D. We also assert that xd (cid:54)= xs \u2200d \u2208 D and let \u02dcK(xd, xs) = \u02dcK(xs, xd) = 0 \u2200d \u2208 D.\nAssuming that n > 2 we \ufb01x the remaining entry \u02dcK(xs, xs) = \u0001/(\u0001 + n \u2212 2), for some \u0001 > 0. We\ninterpret \u0001 as a noise variance and let K = \u02dcK + \u0001I.\nDenote any distributions computed under the GPR model by pgp(\u00b7) and those computed in HPR\nby php(\u00b7). Using K(X, X) = K, de\ufb01ne z(X) as in Eq. (2). Let y denote a vector of real-valued\nmeasurements for a regression task. The posterior distribution of z(xi) given y, with xi \u2208 X, is\nderived by standard Gaussian computations as\n\npgp(z(xi)|X, y) = N(cid:0)\u00b5i, \u03c32\n\n(cid:1)\n\ni\n\ns for d \u2208 D. To ensure that the posterior\nFor our choice of K(X, X) one can show that \u03c32\ndistributions agree at the two locations we require \u00b5d = \u00b5s, which holds if measurements y satisfy\n\nd = \u03c32\n\n\u00b5i = \u02dcK(xi, X)K(X, X)\u22121y\ni = K(xi, xi) \u2212 \u02dcK(xi, X)K(X, X)\u22121 \u02dcK(X, xi).\n\u03c32\n(cid:40)\n\n(cid:111)\n\nK(X, X)\u22121y = 0\n\n=\n\ny\n\nyd = ys\n\n.\n\nA similar analysis can be carried out for the induced HPR model. By Eqs. (5)\u2013(7) HPR inference\nleads to identical distributions php(z(xd)|X, y(cid:48)) = php(z(xs)|X, y(cid:48)) with d \u2208 D if measurements\ny(cid:48) in f-space satisfy\n\nK(X, X)\u22121\u03a6\u22121\n\n0,\u03c32(Gb(y(cid:48))) = 0\n\ny \u2208 Ygp (cid:44)(cid:110)\ny(cid:48) \u2208 Yhp (cid:44)(cid:110)\n\ny|(cid:16) \u02dcK(xd, X) \u2212 \u02dcK(xs, X)\n(cid:17)\ny(cid:48)|(cid:16) \u02dcK(xd, X) \u2212 \u02dcK(xs, X)\n(cid:17)\n=(cid:8)y(cid:48) = G\u22121\nb (\u03a60,\u03c32 (y))|y \u2208 Ygp\n\n(cid:9) .\n\n3\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)\n\nd\u2208D\n\n(cid:41)\n\n(cid:111)\n\n\f(a) gb(x) = 1\n\n2b exp\n\n(cid:110)\u2212 |x|\n\nb\n\n(cid:111)\n\n(b) gb(x) = 1\n\n2b sech\n\n(cid:17)\n\n(cid:16) \u03c0x\n\n2b\n\n(c) gb(x) =\n\n2+(x/b)2(cid:17)3/2\n(cid:16)\n\n1\n\nb\n\nFigure 1: Illustration of G\u22121\nb (\u03a60,\u03c32(x)), for \u03c32 = 1.0 with Gb the c.d.f. of (a) the Laplace dis-\ntribution (b) the hyperbolic secant distribution (c) a Student-t inspired distribution, all with scale\nparameter b. Each plot shows three samples\u2014dotted, dashed, solid\u2014for growing b. As b increases\nthe distributions become heavy-tailed and the gradient of G\u22121\n\nb (\u03a60,\u03c32(x)) increases.\n\nTo compare the shrinkage properties of GPR and HPR we analyze select pairs of measurements\nb (\u03a60,\u03c32(\u00b7)) is strongly concave on (\u2212\u221e, 0],\nin Ygp and Yhp. The derivation requires that G\u22121\nstrongly convex on [0, +\u221e) and has gradient > 1 on R. To see intuitively why this should hold,\nnote that for Gb with fatter tails than a Gaussian, |G\u22121\nb (\u03a60,\u03c32(x))| should eventually dominate\n|\u03a6\u22121\n0,b2(\u03a60,\u03c32(x))| = (b/\u03c3)|x|. Figure 1 demonstrates graphically that the assumption holds for sev-\neral choices of Gb, provided b is large enough, i.e., that gb has suf\ufb01ciently heavy tails. Indeed, it can\nb (\u03a60,\u03c32(\u00b7)) scale lin-\nbe shown that for scale parameters b > 0, the \ufb01rst and second derivatives of G\u22121\nearly with b. Consider a measurement 0 (cid:54)= y \u2208 Ygp with sign (y(xd)) = sign (y(xd(cid:48))) ,\u2200d, d(cid:48) \u2208 D.\nAnalyzing such y is relevant, as we are most interested in comparing how multiple reinforcing ob-\nservations at clustered locations and a single isolated observation are absorbed during inference. By\nde\ufb01nition of Ygp, for d\u2217 = argmaxd\u2208D|yd| we have |yd\u2217| < |ys| as long as n > 2. The correspond-\ning element y(cid:48) = G\u22121\n\nb (\u03a60,\u03c32 (y)) \u2208 Yhp then satis\ufb01es\n\n|y(cid:48)(xs)| =(cid:12)(cid:12)G\u22121\n\nb (\u03a60,\u03c32(y(xs)))(cid:12)(cid:12) >\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G\u22121\n\nb (\u03a60,\u03c32(y(xd\u2217 )))\n\ny(xs)\n\ny(xd\u2217 )\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =\n\n(cid:12)(cid:12)(cid:12)(cid:12) y(cid:48)(xd\u2217 )\n\ny(xd\u2217 )\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\ny(xs)\n\n(8)\n\nThus HPR inference leads to identical predictive distributions in f-space at the two locations even\nthough the isolated observation y(cid:48)(xs) has disproportionately larger magnitude than y(cid:48)(xd\u2217 ), relative\nto the GPR measurements y(xs) and y(xd\u2217 ). As this statement holds for any y \u2208 Ygp satisfying\nour earlier sign requirement, it indicates that HPR systematically shrinks isolated observations more\nb (\u03a60,\u03c32(\u00b7)) scales linearly with scale b > 0,\nstrongly than GPR. Since the second derivative of G\u22121\nan intuitive connection suggests itself when looking at inequality (8): the heavier the marginal tails,\nthe stronger the inequality and thus the stronger the selective shrinkage effect.\nThe previous derivation exempli\ufb01es in an idealized setting that HPR leads to improved shrinkage of\npredictive distributions near isolated observations. More generally, because GPR transforms mea-\nsurements only linearly, while HPR additionally pre-transforms measurements nonlinearly, our anal-\nysis suggests that for any GPR we can \ufb01nd an HPR model which leads to stronger selective shrink-\nage. The result has intuitive parallels to the parametric case:\njust as (cid:96)1-regularization improves\nshrinkage of parametric estimators, heavy-tailed processes improve shrinkage of nonparametric es-\ntimators. We note that although our analysis kept K(X, X) \ufb01xed for GPR and HPR, in practice we\nare free to tune the kernel to yield a desired scale of predictive distributions. The above analysis\nhas been carried out for regression, but motivates us to now explore heavy-tailed processes in the\nclassi\ufb01cation case.\n\n5 Heavy-tailed process classi\ufb01cation\n\nThe derivation of heavy-tailed process classi\ufb01cation (HPC) is similar to that of standard multiclass\nGPC with Laplace approximation in Rasmussen and Williams [12]. However, due to the nonlinear\ntransformations involved, some nice properties of their derivation are lost. We revert notation and\nlet y denote a vector of class labels. For a C-class classi\ufb01cation problem with n training points we\n\n4\n\n\u221210010\u2212505xGb\u22121(\u03a6(x)) \u221210010\u2212505xGb\u22121(\u03a6(x))\u221210010\u2212505xGb\u22121(\u03a6(x))\fn, f 2\n\n1 , . . . , f 2\n\n1 , . . . , f 1\n\nn )(cid:62).\nintroduce a vector of nC latent function measurements (f 1\nFor each block c \u2208 {1, . . . , C} of n variables we de\ufb01ne an independent heavy-tailed process prior\nusing Eq. (4) with kernel matrix Kc. Equivalently, we can de\ufb01ne the prior jointly on f by letting\nK be a block-diagonal kernel matrix with blocks K1, . . . , KC. Each kernel matrix Kc is de\ufb01ned\nby a (possibly different) symmetric positive semide\ufb01nite kernel with its own set of parameters. The\nfollowing construction relaxes the earlier condition that diag (K) = \u03c321 and instead views \u03a60,\u03c32 (\u00b7)\nas some nonlinear transformation with parameter \u03c32. By this relaxation we effectively adopt Liu et\nal.\u2019s [9] interpretation that Eq. (4) de\ufb01nes the copula. The scale parameters b could in principle vary\nacross the nC variables, but we keep them constant at least within each block of n. Labels y are\nrepresented in a 1-of-n form and generated by the following observation model\n\n1 , . . . , f C\n\nn, . . . , f C\n\n(cid:80)\ni }\nexp{f c\ni } .\nc(cid:48) exp{f c(cid:48)\n(cid:18) exp{f c\u2217}\n(cid:80)\nc(cid:48) exp{f c(cid:48)\u2217 }\n\n(cid:19)\n\n,\n\n(9)\n\n(10)\n\np(yc\n\ni = 1|fi) = \u03c0c\n\ni =\n\nFor inference we are ultimately interested in computing\np(yc\u2217 = 1|X, y, x\u2217) = Ep(f\u2217|X,y,x\u2217)\n\nwhere f\u2217 = (f 1\u2217 , . . . , f C\u2217 )(cid:62). The previous section motivates that improved selective shrinkage will\noccur in p(f\u2217|X, y, x\u2217), provided the prior marginals have suf\ufb01ciently heavy tails.\n\n5.1 Inference\nAs in GPC, most of the intractability lies in computing the predictive distribution p(f\u2217|X, y, x\u2217). We\nuse the Laplace approximation to address this issue: a Gaussian approximation to p(z|X, y) is found\nand then combined with the Gaussian p(z\u2217|X, z, x\u2217) to give us an approximation to p(z\u2217|X, y, x\u2217).\nThis is then transformed to a (typically non-Gaussian) distribution in f-space using a change of\nvariables. Hence we \ufb01rst seek to \ufb01nd a mode and corresponding Hessian matrix of the log posterior\nlog p(z|X, y). Recalling the relation f = G\u22121\n\nb (\u03a60,\u03c32(z)), the log posterior can be written as\n\nJ(z) (cid:44) log p(y|z) + log p(z) = y(cid:62)f \u2212(cid:88)\n\n(cid:88)\n\nlog\n\ni\n\nc\n\nexp{f c\n\ni )} \u2212 1\n2\n\nz(cid:62)K\u22121z \u2212 1\n2\n\nlog |K| + const.\n\nLet \u03a0 be an nC \u00d7 n matrix of stacked diagonal matrices diag (\u03c0c) for n-subvectors \u03c0c of \u03c0. With\nW = diag (\u03c0) \u2212 \u03a0\u03a0(cid:62), the gradients are\n\n(cid:18) df\n(cid:19)\n(cid:18) d2f\n(cid:19)\n\ndz\n\ndz2\n\n\u2207J(z) = diag\n\n\u22072J(z) = diag\n\n(y \u2212 \u03c0) \u2212 K\u22121z\n\ndiag (y \u2212 \u03c0) \u2212 diag\n\n(cid:18) df\n\n(cid:19)\n\ndz\n\nW diag\n\n(cid:18) df\n\n(cid:19)\n\ndz\n\n\u2212 K\u22121.\n\nUnlike in Rasmussen and Williams [12], \u2212\u22072J(z) is not generally positive de\ufb01nite owing to its \ufb01rst\nterm. For that reason we cannot use a Newton step to \ufb01nd the mode and instead resort to a simpler\ngradient method. Once the mode \u02c6z has been found we approximate the posterior as\n\np(z|X, y) \u2248 q(z|X, y) = N(cid:0)\u02c6z,\u2212\u22072J(\u02c6z)\u22121(cid:1) ,\n\nand use this to approximate the predictive distribution by\n\nq(z\u2217|X, y, x\u2217) =\n\np(z\u2217|X, z, x\u2217)q(z|X, y)df.\n\n(cid:90)\n\nSince we arranged for both distributions in the integral to be Gaussian, the resulting Gaussian can\nbe straightforwardly evaluated. Finally, to approximate the one-dimensional integral with respect\nto p(f\u2217|X, y, x\u2217) in Eq. (10) we could either use a quadrature method, or generate samples from\nq(z\u2217|X, y, x\u2217), convert them to f-space using G\u22121\nb (\u03a60,\u03c32(\u00b7)) and then approximate the expectation\nby an average. We have compared predictions of the latter method with those of a Gibbs sampler;\nthe Laplace approximation matched Gibbs results well, while being much faster to compute.\n\n5\n\n\fR\ne\ns\ni\nd\nu\ne\n\nRotamer r \u2208 {1, 2, 3}\n\nR\n\ne\n\ns\n\ni\n\nd\n\nu\n\ne\n\nR\n\nesid\n\nu\n\ne\n\nO\n\nC(cid:48)\n\nH\n\nN\n\n\u03a8 C\u03b1\n\nH\n\nN\n\nH\n\n\u03a6\n\nC(cid:48)\n\nO\n\n(a)\n\n(b)\n\nFigure 2: (a) Schematic of a protein segment. The backbone is the sequence of C(cid:48), N, C\u03b1, C(cid:48), N\natoms. An amino-acid-speci\ufb01c sidechain extends from the C\u03b1 atom at one of three discrete an-\ngles known as \u201crotamers.\u201d (b) Ramachandran plot of 400 (\u03a6, \u03a8) measurements and corresponding\nrotamers (by shapes/colors) for amino-acid arginine (arg). The dark shading indicates the sparse\nregion we considered in producing results in Figure 3. Progressively lighter shadings indicate how\nthe sparse region was grown to produce Figure 4.\n\n5.2 Parameter estimation\nUsing a derivation similar to that in [12], we have for \u02c6f = G\u22121\nimation of the marginal log likelihood is\nlog p(y|x) \u2248 log q(y|x) = J(\u02c6z) \u2212 1\n2\n\nlog | \u2212 2\u03c0\u22072J(\u02c6z)|\n\n= y(cid:62) \u02c6f \u2212(cid:88)\n\n(cid:88)\n\nlog\n\nexp\n\ni\n\nc\n\n(cid:110) \u02c6f c\n\ni\n\n(cid:111) \u2212 1\n\n\u02c6z(cid:62)K\u22121 \u02c6z \u2212 1\n2\n\nlog |K| \u2212 1\n2\n\n2\n\nb (\u03a60,\u03c32 (\u02c6z)) that the Laplace approx-\n\n(11)\nlog | \u2212 \u22072J(\u02c6z)| + const.\n\nWe optimize kernel parameters \u03b8 by taking gradient steps on log q(y|x). The derivative needs to\ntake into account that perturbing the parameters can also perturb the mode \u02c6z found for the Laplace\napproximation. At an optimum \u2207J(\u02c6z) must be zero, so that\n\n(cid:32)\n\n(cid:33)\n\nd \u02c6f\nd\u02c6z\n\n(cid:32)\n(cid:19)\n\n(cid:33)\n\nd \u02c6f\nd\u02c6z\n\u2212 1\n2\n\n\u02c6z = Kdiag\n\n(y \u2212 \u02c6\u03c0),\n\n(12)\n\nwhere \u02c6\u03c0 is de\ufb01ned as in Eq. (9) but using \u02c6f rather than f. Taking derivatives of this equation allows\nus to compute the gradient d\u02c6z/d\u03b8. Differentiating the marginal likelihood we have\n\nd log q(y|x)\n\nd\u03b8\n\n= (y \u2212 \u02c6\u03c0)(cid:62)diag\n\n(cid:18)\n\n1\n2\n\ntr\n\nK\u22121 dK\nd\u03b8\n\nd\u02c6z\nd\u03b8\n\n(cid:18)\n\ntr\n\n\u02c6z(cid:62)K\u22121 dK\nd\u03b8\n\n1\n2\n\nK\u22121 \u02c6z +\n\n\u2212 d\u02c6z\nd\u03b8\n\u22072J(\u02c6z)\u22121 d\u22072J(\u02c6z)\n\n(cid:19)\n\n.\n\nd\u03b8\n\nK\u22121 \u02c6z \u2212\n\nThe remaining gradient computations are straightforward, albeit tedious. In addition to optimizing\nthe kernel parameters, it may also be of interest to optimize the scale parameter b of marginals Gb.\nAgain, differentiating Eq. (12) with respect to b allows us to compute d\u02c6z/db. We note that when\nperturbing b we change \u02c6f by changing the underlying mode \u02c6z as well as by changing the parameter\nb which is used to compute \u02c6f from \u02c6z. Suppressing the detailed computations, the derivative of the\nmarginal log likelihood with respect to b is\n= (y \u2212 \u02c6\u03c0)(cid:62) d \u02c6f\ndb\n\n\u22072J(\u02c6z)\u22121 d\u22072J(\u02c6z)\n\nK\u22121 \u02c6z \u2212 1\n2\n\nd log q(y|x)\n\n\u2212 d\u02c6z\ndb\n\n(cid:18)\n\n(cid:19)\n\ndb\n\ndb\n\ntr\n\n(cid:62)\n\n.\n\n6\n\n\u03a6\u03a8 \u2212pi\u2212pi/20pi/2pi\u2212pi\u2212pi/20pi/2pir = 1r = 2r = 3\f(a)\n\n(b)\n\nFigure 3: Rotamer prediction rates in percent in (a) sparse and (b) dense regions. Both \ufb02avors\nof HPC (hyperbolic secant and Laplace marginals) signi\ufb01cantly outperform GPC in sparse regions\nwhile performing competitively in dense regions.\n\n6 Experiments\n\nTo a \ufb01rst approximation, the three-dimensional structure of a folded protein is de\ufb01ned by pairs\nof continuous backbone angles (\u03a6, \u03a8), one pair for each amino-acid, as well as discrete angles,\nso-called rotamers, that de\ufb01ne the conformations of the amino-acid sidechains that extend from\nthe backbone. The geometry is outlined in Figure 2(a). There is a strong dependence between\nbackbone angles (\u03a6, \u03a8) and rotamer values; this is illustrated in the \u201cRamachandran plot\u201d shown\nin Figure 2(b), which plots the backbone angles for each rotamer (indicated by the shapes/colors).\nThe dependence is exploited in computational approaches to protein structure prediction, where\nestimates of rotamer probabilities given backbone angles are used as one term in an energy function\nthat models native protein states as minima of the energy. Poor estimates of rotamer probabilities\nin sparse regions can derail the prediction procedure. Indeed, sparsity has been a serious problem\nin state-of-the-art rotamer models based on kernel density estimates (Roland Dunbrack, personal\ncommunication). Unfortunately, we have found that GPC is not immune to the sparsity problem.\nTo evaluate our algorithm we consider rotamer-prediction tasks on the 17 amino-acids (out of 20)\nthat have three rotamers at the \ufb01rst dihedral angle along the sidechain2. Our previous work thus\napplies with the number of classes C = 3 and the covariates being (\u03a6, \u03a8) angle pairs. Since the\ninput space is a torus we de\ufb01ned GPC and HPC using the following von Mises-inspired kernel for\nd-dimensional angular data:\n\n(cid:40)\n\n(cid:32)(cid:32) d(cid:88)\n\n(cid:33)\n\n(cid:33)(cid:41)\n\nk(xi, xj) = \u03c32 exp\n\n\u03bb\n\ncos(xi,k \u2212 xj,k)\n\n\u2212 d\n\n,\n\nk=1\n\nwhere xi,k, xj,k \u2208 [0, 2\u03c0] and \u03c32, \u03bb \u2265 03. To \ufb01nd good GPC kernel parameters we optimize\nan (cid:96)2-regularized version of the Laplace approximation to the log marginal likelihood reported in\nEq. 3.44 of [12]. For HPC we let Gb be either the centered Laplace distribution or the hyperbolic\nsecant distribution with scale parameter b. We estimate HPC kernel parameters as well as b by\nsimilarly maximizing an (cid:96)2-regularized form of Eq. (11). In both cases we restricted the algorithms\nto training sets of only 100 datapoints. Since good regularization parameters for the objectives are\nnot known a priori we train with and test them on a grid for each of the 17 rotameric residues in\nten-fold cross-validation. To \ufb01nd good regularization parameters for a particular residue we look up\nthat combination which, averaged over the ten folds of the remaining 16 residues, produced the best\ntest results. Having chosen the regularization constants we report average test results computed in\nten-fold cross validation.\nWe evaluate the algorithms on prede\ufb01ned sparse and dense regions in the Ramachandran plot, as\nindicated by the background shading in Figure 2(b). Across 17 residues the sparse regions usually\ncontained more than 70 measurements (and often more than 150), each of which appears in one\nof the 10 cross validations. Figure 3 compares the label prediction rates on the dense and sparse\n\n2Residues alanine and glycine are non-discrete while proline has two rotamers at the \ufb01rst dihedral angle.\n3The function cos(xi,k \u2212 xj,k) = [cos(xi.k), sin(xi,k)][cos(xj.k), sin(xj,k)](cid:62) is a symmetric positive\nsemi-de\ufb01nite kernel. By Propositions 3.22 (i) and (ii) and Proposition 3.25 in Shawe-Taylor and Cristian-\nini [14], so is k(xi, xj) above.\n\n7\n\ntrptyrserphegluasnleuthrhisaspargcyslysmetglnileval00.20.40.60.81Prediction rate HPC Hyp. sec.HPC LaplaceGPCtrptyrserphegluasnleuthrhisaspargcyslysmetglnileval00.20.40.60.81Prediction rate HPC Hyp. sec.HPC LaplaceGPC\fFigure 4: Average rotamer prediction rate in the sparse region for two \ufb02avors of HPC, standard GPC\nwell as CTGP [1] as a function of the average number of points per residue in the sparse region.\n\nregions. Averaged over all 17 residues HPC outperforms GPC by 5.79% with Laplace and 7.89%\nwith hyperbolic secant marginals. With Laplace marginals HPC underperforms GPC on only two\nresidues in sparse regions: by 8.22% on glutamine (gln), and by 2.53% on histidine (his). On\ndense regions HPC lies within 0.5% on 16 residues and only degrades once by 3.64% on his.\nUsing hyperbolic secant marginals HPC often improves GPC by more than 10% on sparse regions\nand degrades by more than 5% only on cysteine (cys) and his. On dense regions HPC usually\nperforms within 1.5% of GPC. In Figure 4 we show how the average rotamer prediction rate across\n17 residues changes for HPC, GPC, as well as CTGP [1] as we grow the sparse region to include\nmore measurements from dense regions. The growth of the sparse region is indicated by progres-\nsively lighter shadings in Figure 2(b). As more points are included the signi\ufb01cant advantage of HPC\nlessens. Eventually GPC does marginally better than HPC and much better than CTGP. The values\nreported in Figure 3 correspond to the dark shaded region, with an average of 155 measurements.\n\n7 Related research\n\nCopulas [10] allow convenient modelling of multivariate correlation structures as separate from\nmarginal distributions. Early work by Song [16] used the Gaussian copula to generate complex\nmultivariate distributions by complementing a simple copula form with marginal distributions of\nchoice. Popularity of the Gaussian copula in the \ufb01nancial literature is generally credited to Li [8]\nwho used it to model correlation structure for pairs of random variables with known marginals. More\nrecently, the Gaussian process has been modi\ufb01ed in a similar way to ours by Snelson et al. [15].\nThey demonstrate that posterior distributions can better approximate the true noise distribution if\nthe transformation de\ufb01ning the warped process is learned. Jaimungal and Ng [7] have extended\nthis work to model multiple parallel time series with marginally non-Gaussian stochastic processes.\nTheir work uses a \u201cbinding copula\u201d to combine several subordinate copulas into a joint model.\nBayesian approaches focusing on estimation of the Gaussian copula covariance matrix for a given\ndataset are given in [4, 11]. Research also focused on estimation in high-dimensional settings [9].\n\n8 Conclusions\n\nThis paper analyzed learning scenarios where outliers are observed in the input space, rather than\nthe output space as commonly discussed in the literature. We illustrated heavy-tailed processes as\na straightforward extension of GPs and an economical way to improve the robustness of estimators\nin sparse regions beyond those of GP-based methods.\nImportantly, because these processes are\nbased on a GP, they inherit many of its favorable computational properties; predictive inference\nin regression, for instance, is straightforward. Moreover, because heavy-tailed processes have a\nparsimonious representation, they can be used as building blocks in more complicated models where\ncurrently GPs are used. In this way the bene\ufb01ts of heavy-tailed processes extend to any GP-based\nmodel that struggles with covariate shift.\n\nAcknowledgements\n\nWe thank Roland Dunbrack for helpful discussions and providing access to the rotamer datasets.\n\n8\n\n1552463906189801554246339060.450.50.550.60.65\u2019Density of test data\u2019Prediction rate HPC Hyp. sec.HPC LaplaceCTGPGPC\fReferences\n[1] Tamara Broderick and Robert B. Gramacy. Classi\ufb01cation and Categorical Inputs with Treed\n\nGaussian Process Models. Journal of Classi\ufb01cation. To appear.\n\n[2] Wei Chu and Zoubin Ghahramani. Gaussian Processes for Ordinal Regression. Journal of\n\nMachine Learning Research, 6:1019\u20131041, 2005.\n\n[3] Doris Damian, Paul D. Sampson, and Peter Guttorp. Bayesian Estimation of Semi-Parametric\n\nNon-Stationary Spatial Covariance Structures. Environmetrics, 12:161\u2013178.\n\n[4] Adrian Dobra and Alex Lenkoski. Copula Gaussian Graphical Models. Technical report,\n\nDepartment of Statistics, University of Washington, 2009.\n\n[5] Paul W. Goldberg, Christopher K. I. Williams, and Christopher M. Bishop. Regression with\nInput-dependent Noise: A Gaussian Process Treatment. In Advances in Neural Information\nProcessing Systems, volume 10, pages 493\u2013499. MIT Press, 1998.\n\n[6] Robert B. Gramacy and Herbert K. H. Lee. Bayesian Treed Gaussian Process Models with an\n\nApplication to Computer Modeling. Journal of the American Statistical Association, 2007.\n\n[7] Sebastian Jaimungal and Eddie K. Ng. Kernel-based Copula Processes. In Proceedings of the\nEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages\n628\u2013643. Springer-Verlag, 2009.\n\n[8] David X. Li. On Default Correlation: A Copula Function Approach. Technical Report 99-07,\n\nRiskmetrics Group, New York, April 2000.\n\n[9] Han Liu, John Lafferty, and Larry Wasserman. The Nonparanormal: Semiparametric Esti-\nmation of High Dimensional Undirected Graphs. Journal of Machine Learning Research,\n10:1\u201337, 2009.\n\n[10] Roger B. Nelsen. An Introduction to Copulas. Springer, 1999.\n[11] Michael Pitt, David Chan, and Robert J. Kohn. Ef\ufb01cient Bayesian Inference for Gaussian\n\nCopula Regression Models. Biometrika, 93(3):537\u2013554, 2006.\n\n[12] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.\n\nMIT Press, 2006.\n\n[13] Alexandra M. Schmidt and Anthony O\u2019Hagan. Bayesian Inference for Nonstationary Spa-\ntial Covariance Structure via Spatial Deformations. Journal of the Royal Statistical Society,\n65(3):743\u2013758, 2003. Ser. B.\n\n[14] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge\n\nUniversity Press, 2004.\n\n[15] Ed Snelson, Carl E. Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes. In\n\nAdvances in Neural Information Processing Systems, volume 16, pages 337\u2013344, 2004.\n\n[16] Peter Xue-Kun Song. Multivariate Dispersion Models Generated From Gaussian Copula.\n\nScandinavian Journal of Statistics, 27(2):305\u2013320, 2000.\n\n9\n\n\f", "award": [], "sourceid": 747, "authors": [{"given_name": "Fabian", "family_name": "Wauthier", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}