{"title": "Worst-Case Bounds for Gaussian Process Models", "book": "Advances in Neural Information Processing Systems", "page_first": 619, "page_last": 626, "abstract": "", "full_text": "Worst-Case Bounds for Gaussian Process Models\n\nSham M. Kakade\n\nUniversity of Pennsylvania\n\nMatthias W. Seeger\n\nUC Berkeley\n\nDean P. Foster\n\nUniversity of Pennsylvania\n\nAbstract\n\nWe present a competitive analysis of some non-parametric Bayesian al-\ngorithms in a worst-case online learning setting, where no probabilistic\nassumptions about the generation of the data are made. We consider\nmodels which use a Gaussian process prior (over the space of all func-\ntions) and provide bounds on the regret (under the log loss) for com-\nmonly used non-parametric Bayesian algorithms \u2014 including Gaussian\nregression and logistic regression \u2014 which show how these algorithms\ncan perform favorably under rather general conditions. These bounds ex-\nplicitly handle the in\ufb01nite dimensionality of these non-parametric classes\nin a natural way. We also make formal connections to the minimax and\nminimum description length (MDL) framework. Here, we show precisely\nhow Bayesian Gaussian regression is a minimax strategy.\n\nIntroduction\n\n1\nWe study an online (sequential) prediction setting in which, at each timestep, the learner is\ngiven some input from the set X , and the learner must predict the output variable from the\nset Y. The sequence {(xt, yt)| t = 1, . . . , T} is chosen by Nature (or by an adversary), and\nimportantly, we do not make any statistical assumptions about its source: our statements\nhold for all sequences. Our goal is to sequentially code the next label yt, given that we\nhave observed x\u2264t and y 0 is a constant such that for all yt \u2208 y\u2264T ,\n\n1\n2\n\n1\n2\n\nfor all u \u2208 R.\n\n\u2212 d2\ndu2 log P (yt|u) \u2264 c\n\nThe proof of this theorem parallels that provided by Kakade and Ng [2004], with a number\nof added complexities for handling GP priors. For the special case of Gaussian regression\nwhere c = \u03c3\u22122, the following theorem shows the stronger result that the bound is satis\ufb01ed\nwith an equality for all sequences.\nTheorem 3.2: Assume P (yt|u(xt)) = N (yt|u(xt), \u03c32) and that Y = R. Let (x\u2264T , y\u2264T )\nbe a sequence from (X \u00d7 Y)T . Then,\n\n(cid:26)\n\u2212 log P (y\u2264T|x\u2264T , f(\u00b7)) +\nlog(cid:12)(cid:12)I + \u03c3\u22122K(cid:12)(cid:12)\n\n(cid:27)\n\nkfk2\n\nK\n\n1\n2\n\n(2)\n\n\u2212 log Pbayes(y\u2264T|x\u2264T ) = min\nf\u2208H\n1\n2\n\n+\n\nand the minimum is attained for a kernel expansion over x\u2264T .\n\nThis equality has important implications in our minimax theory (in Corollary 4.4, we make\nthis precise). It is not hard to see that the equality does not hold for other likelihoods.\n\n3.1 Interpretation\nK and log |I + cK|. We discuss each in turn.\nThe regret bound depends on two terms, kfk2\nThe dependence on kfk2\nK states the intuitive fact that a meaningful bound can only be\nobtained under smoothness assumptions on the set of experts. The more complicated f is\n(as measured by k \u00b7 kK), the higher the regret may be. The equality shows in Theorem 3.2\nshows this dependence is unavoidable. We come back to this dependence in Section 4.\nLet us now interpret the log |I + cK| term, which we refer to as the regret term. The\nconstant c, which bounds the curvature of the likelihood, exists for most commonly used\nexponential family likelihoods. For logistic regression, we have c = 1/4, and for the\nGaussian regression, we have c = \u03c3\u22122. Also, interestingly, while f is an arbitrary function\nin H, this regret term depends on K only at the sequence points x\u2264T .\nFor most in\ufb01nite-dimensional kernels and without strong restrictions on the inputs, the\nregret term can be as large as \u2126(T ) \u2014 the sequence can be chosen s.t. K \u2248 c0I, which\nimplies that log |I + cK| \u2248 T log(1 + cc0). For example, for an isotropic kernel (which\nis a function of the norm kx \u2212 x0k2) we can choose the xt to be mutually far from each\nother. For kernels which barely enforce smoothness \u2014 e.g. the Ornstein-Uhlenbeck kernel\nexp(\u2212bkx \u2212 x0k1) \u2014 the regret term can easily \u2126(T ). The cases we are interested in are\nthose where the regret term is o(T ), in which case the average regret tends to 0 with time.\nA spectral interpretation of this term helps us understand the behavior.\nIf we let the\n\u03bb1, \u03bb2, . . . \u03bbT be the eigenvalues of K, then\n\nlog |I + cK| =\n\nlog(1 + c\u03bbt) \u2264 c tr K\n\nTX\n\nt=1\n\n\fwhere tr K is the trace of K. This last quantity is closely related to the \u201cdegrees of\nfreedom\u201d in a system (see Hastie et al. [2001]). Clearly, if the sum of the eigenvalues has\na sublinear growth rate of o(T ), then the average regret tends to 0. Also, if one assumes\nthat the input sequence, x\u2264T , is i.i.d. then the above eigenvalues are essentially the process\neigenvalues.\nIn a forthcoming longer version, we explore this spectral interpretation in\nmore detail and provide a case using the exponential kernel in which the regret grows as\nO(poly(log T )). We now review the parametric case.\n\n3.2 The Parametric Case\n\nHere we obtain a slight generalization of the result in Kakade and Ng [2004] as a special\ncase. Namely, the familiar linear model \u2014 with u(x) = \u03b8 \u00b7 x, \u03b8, x \u2208 Rd and Gaussian\nprior \u03b8 \u223c N (0, I) \u2014 can be seen as a GP model with the linear kernel: K(x, x0) = x \u00b7 x0.\ni \u03b1ixi \u00b7 x = \u03b8 \u00b7 x with\n\nWith X = (x1, . . . xT )T we have that a kernel expansion f(x) =P\n\nK = \u03b1TX XT\u03b1 = k\u03b8k2\n\n2, so that H = {\u03b8 \u00b7 x| \u03b8 \u2208 Rd}, and so\n\n\u03b8 = XT\u03b1, and kfk2\n\nlog |I + cK| = log(cid:12)(cid:12)I + cXTX(cid:12)(cid:12)\n(cid:19)\n\n(cid:18)\n\nlog |I + cK| \u2264 d log\n\n1 + cT\nd\n\nTherefore, our result gives an input-dependent version of the result of Kakade and Ng\n[2004]. If we make the further assumption that kxk2 \u2264 1 (as done in Kakade and Ng\n[2004]), then we can obtain exactly their regret term:\n\nwhich can seen by rotating K into an diagonal matrix and maximizing the expression sub-\nject to the constraint that kxk2 \u2264 1 (i.e. that the eigenvalues must sum to 1).\nIn general, this example shows that if K is a \ufb01nite-dimension kernel such as the linear or\nthe polynomial kernel, then the regret term is only O(log T ).\n\n4 Relationships to Minimax Procedures and MDL\n\nThis section builds the framework for understanding the minimax property of Gaussian re-\ngression. We start by reviewing Shtarkov\u2019s theorem, which shows that a certain normalized\nmaximum likelihood density is the minimax strategy (when using the log loss). In many\ncases, this minimax strategy does not exist \u2014 in those cases where the minimax regret is\nin\ufb01nite. We then propose a different, penalized notion of regret, and show that a certain\nnormalized maximum a posteriori density is the minimax strategy here. Our main result\n(Corollary 4.4) shows that for Gaussian regression the Bayesian strategy is precisely this\nminimax strategy\n\n4.1 Normalized Maximum Likelihood\nHere, let us assume that there are no inputs \u2014 sequences consist of only yt \u2208 Y. Given a\nmeasurable space with base measure \u00b5, we employ a countable number of random variables\nyt in Y. Fix the sequence length T and de\ufb01ne the model class F = {Q(\u00b7|\u03b8)| \u03b8 \u2208 \u0398)},\nwhere Q(\u00b7|\u03b8) denotes a joint probability density over Y T with respect to \u00b5.\nWe assume that for our model class there exists a parameter, \u03b8ml(y\u2264T ), maximizing the\nlikelihood Q(y\u2264T|\u03b8) over \u0398 for all y\u2264T \u2208 Y T . We make this assumption to make the\nconnections to maximum likelihood (and, later, MAP) estimation clear. De\ufb01ne the regret\nof a joint density P on y\u2264T as:\n\nR(y\u2264T , P, \u0398) = \u2212 log P (y\u2264T ) \u2212 inf\n\u03b8\u2208\u0398\n\n{\u2212 log Q(y\u2264T|\u03b8)}\n= \u2212 log P (y\u2264T ) + log Q(y\u2264T|\u03b8ml(y\u2264T )).\n\n(3)\n\n(4)\n\n\fwhere the latter step uses our assumption on the existence of \u03b8ml(y\u2264T ).\nDe\ufb01ne the minimax regret with respect to \u0398 as:\n\nR(\u0398) = inf\nP\n\nsup\n\ny\u2264T \u2208Y T\n\nR(y\u2264T , P, \u0398)\n\nwhere the inf is over all probability densities on Y T .\nThe following theorem due to Shtarkov [1987] characterizes the minimax strategy.\n\nTheorem 4.1: [Shtarkov, 1987]If the following density exists (i.e. if it has a \ufb01nite normal-\nization constant), then de\ufb01ne it to be the normalized maximum likelihood (NML) density.\n\nPml(y\u2264T ) =\n\nR Q(y\u2264T|\u03b8ml(y\u2264T ))d\u00b5(y\u2264T )\n\nQ(y\u2264T|\u03b8ml(y\u2264T ))\n\n(5)\n\nIf Pml exists, it is a minimax strategy, i.e. for all y\u2264T , the regret R(y\u2264T , Pml, \u0398) does not\nexceed R(\u0398).\n\nnote\n\nthat\n\nFirst\n\nNote that this density exists only if the normalizing constant is \ufb01nite, which is not the case\nin general. The proof is straightforward using the fact that the NML density is an equalizer\n\u2014 meaning that it has constant regret on all sequences.\nProof:\n\nlogR Q(y\u2264T|\u03b8ml(y\u2264T ))d\u00b5(y\u2264T ).\n\nand simplify.\nFor convenience, de\ufb01ne the regret of any P as R(P, \u0398) = supy\u2264T \u2208Y T R(y\u2264T , P, \u0398). For\nany P 6= Pml (differing on a set with positive measure), there exists some y\u2264T such that\nP (y\u2264T ) < Pml(y\u2264T ), since the densities are normalized. This implies that\nR(P, \u0398) \u2265 R(y\u2264T , P, \u0398) > R(y\u2264T , Pml, \u0398) = R(Pml, \u0398)\n\nregret R(y\u2264T , Pml, \u0398)\n\nthe\nconstant\nTo see this, simply substitute Eq. 5 into Eq. 4\n\nthe\n\nis\n\nwhere the \ufb01rst step follows from the de\ufb01nition of R(P, \u0398),\nthe second from\n\u2212 log P (y\u2264T ) > \u2212 log Pml(y\u2264T ), and the last from the fact that Pml is an equalizer (its\nregret is constant on all sequences). Hence, P has a strictly larger regret, implying that Pml\n(cid:3)\nis the unique minimax strategy.\nUnfortunately, in many important model classes, the minimax regret R(\u0398) is not \ufb01nite, and\nthe NML density does not exist. We now provide one example (see Grunwald [2005] for\nfurther discussion).\n\nExample 4.2: Consider a model which assumes the sequence is generated i.i.d. from\na Gaussian with unknown mean and unit variance. Speci\ufb01cally, let \u0398 = R, Y = R,\nand P (y\u2264T|\u03b8) be the product \u03a0T\nIt is easy to see that for this class the\nminimax regret is in\ufb01nite and Pml does not exist (see Grunwald [2005]). This example\ncan be generalized to the Gaussian regression model (if we know the sequence x\u2264T in\nadvance). For this problem, if one modi\ufb01es the space of allowable sequences (i.e. Y T is\nmodi\ufb01ed), then one can obtain \ufb01nite regret, such as those in Barron et al. [1998], Foster\nand Stine [2001]. This technique may not be appropriate in general.\n\nt=1N (yt; \u03b8, 1).\n\n4.2 Normalized Maximum a Posteriori\nTo remedy this problem, consider placing some structure on the model class F =\n{Q(\u00b7|\u03b8)|\u03b8 \u2208 \u0398}. The idea is to penalize Q(\u00b7|\u03b8) \u2208 F based on this structure. The mo-\ntivation is similar to that of structural risk minimization [Vapnik, 1998]. Assume that \u0398 is\n\n\fa measurable space and place a prior distribution with density function q on \u0398. De\ufb01ne the\npenalized regret of P on y\u2264T as:\n\nRq(y\u2264T , P, \u0398) = \u2212 log P (y\u2264T ) \u2212 inf\n\u03b8\u2208\u0398\n\n{\u2212 log Q(y\u2264T|\u03b8) \u2212 log q(\u03b8)} .\n\nNote that \u2212 log Q(y\u2264T|\u03b8) \u2212 log q(\u03b8) can be viewed as a \u201ctwo part\u201d code, in which we\n\ufb01rst code \u03b8 under the prior q and then code y\u2264T under the likelihood Q(\u00b7|\u03b8). Unlike the\nstandard regret, the penalized regret can be viewed as a comparison to an actual code.\nThese two part codes are common in the MDL literature (see Grunwald [2005]). However,\nin MDL, they consider using minimax schemes (via Pml) for the likelihood part of the code,\nwhile we consider minimax schemes for this penalized regret.\nAgain, for clarity, assume there exists a parameter, \u03b8map(y\u2264T ) maximizing log Q(y\u2264T|\u03b8)+\nlog q(\u03b8). Notice that this is just the maximum aposteriori (MAP) parameter, if one were to\nuse a Bayesian strategy with the prior q (since the posterior density would be proportional\nto Q(y\u2264T|\u03b8)q(\u03b8)). Here,\n\nRq(y\u2264T , P, \u0398) = \u2212 log P (y\u2264T ) + log Q(y\u2264T|\u03b8map(y\u2264T )) + log q(\u03b8map(y\u2264T ))\n\nSimilarly, with respect to \u0398, de\ufb01ne the minimax penalized regret as:\n\nRq(\u0398) = inf\nP\n\nsup\n\ny\u2264T \u2208Y T\n\nRq(y\u2264T P, \u0398)\n\nwhere again the inf is over all densities on Y T . If \u0398 is \ufb01nite or countable and Q(\u00b7|\u03b8) > 0\nfor all \u03b8, then the Bayes procedure has the desirable property of having penalized regret\nwhich is non-positive.2 However, in general, the Bayes procedure does not achieve the\nminimax penalized regret, Rq(\u0398), which is what we desire \u2014 though, for one case, we\nshow that it does (in the next section).\nWe now characterize this minimax strategy in general.\n\nTheorem 4.3: De\ufb01ne the normalized maximum a posteriori (NMAP) density, if it exists, as:\n\nPmap(y\u2264T ) =\n\nR Q(y\u2264T|\u03b8map(y\u2264T ))q(\u03b8map(y\u2264T )) d\u00b5(y\u2264T ) .\n\nQ(y\u2264T|\u03b8map(y\u2264T ))q(\u03b8map(y\u2264T ))\n\n(6)\n\nIf Pmap exists, it is a minimax strategy for the penalized regret, i.e. for all y\u2264T , the penalized\nregret Rq(y\u2264T , Pmap, \u0398) does not exceed Rq(\u0398).\n\nThe proof relies on Pmap being an equalizer for the penalized regret and is identical to that\nof Theorem 4.1 \u2014 just replace all quantities with their penalized equivalents.\n\n4.3 Bayesian Gaussian Regression as a Minimax Procedure\n\nWe now return to the setting with inputs and show how the Bayesian strategy for the Gaus-\nsian regression model is a minimax strategy for all input sequences x\u2264T . If we \ufb01x the input\nsequence x\u2264T , we can consider the competitor class to be F = {P (y\u2264T|x\u2264T , \u03b8)| \u03b8 \u2208\n\u0398)}. In other words, we make the more stringent comparison against a model class which\nhas full knowledge of the input sequence in advance. Importantly, note that the learner only\nobserves the past inputs x