{"title": "Online Bounds for Bayesian Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 648, "abstract": null, "full_text": "Online Bounds for Bayesian Algorithms\n\nSham M. Kakade\n\nAndrew Y. Ng\n\nComputer and Information Science Department\n\nComputer Science Department\n\nUniversity of Pennsylvania\n\nStanford University\n\nAbstract\n\nWe present a competitive analysis of Bayesian learning algorithms in the\nonline learning setting and show that many simple Bayesian algorithms\n(such as Gaussian linear regression and Bayesian logistic regression) per-\nform favorably when compared, in retrospect, to the single best model in\nthe model class. The analysis does not assume that the Bayesian algo-\nrithms\u2019 modeling assumptions are \u201ccorrect,\u201d and our bounds hold even\nif the data is adversarially chosen. For Gaussian linear regression (us-\ning logloss), our error bounds are comparable to the best bounds in the\nonline learning literature, and we also provide a lower bound showing\nthat Gaussian linear regression is optimal in a certain worst case sense.\nWe also give bounds for some widely used maximum a posteriori (MAP)\nestimation algorithms, including regularized logistic regression.\n\nIntroduction\n\n1\nThe last decade has seen signi\ufb01cant progress in online learning algorithms that perform well\neven in adversarial settings (e.g. the \u201cexpert\u201d algorithms of Cesa-Bianchi et al. (1997)). In\nthe online learning framework, one makes minimal assumptions on the data presented to\nthe learner, and the goal is to obtain good (relative) performance on arbitrary sequences. In\nstatistics, this philosophy has been espoused by Dawid (1984) in the prequential approach.\nWe study the performance of Bayesian algorithms in this adversarial setting, in which the\nprocess generating the data is not restricted to come from the prior\u2014data sequences may\nbe arbitrary. Our motivation is similar to that given in the online learning literature and the\nMDL literature (see Grunwald, 2005) \u2014namely, that models are often chosen to balance\nrealism with computational tractability, and often assumptions made by the Bayesian are\nnot truly believed to hold (e.g. i.i.d. assumptions). Our goal is to study the performance of\nBayesian algorithms in the worst-case, where all modeling assumptions may be violated.\nWe consider the widely used class of generalized linear models\u2014focusing on Gaussian\nlinear regression and logistic regression\u2014and provide relative performance bounds (com-\nparing to the best model in our model class) when the cost function is the logloss. Though\nthe regression problem has been studied in a competitive framework and, indeed, many\ningenious algorithms have been devised for it (e.g., Foster, 1991; Vovk, 2001; Azoury and\nWarmuth, 2001) , our goal here is to study how the more widely used, and often simpler,\nBayesian algorithms fare. Our bounds for linear regression are comparable to the best\nbounds in the literature (though we use the logloss as opposed to the square loss).\nThe competitive approach to regression started with Foster (1991), who provided com-\npetitive bounds for a variant of the ridge regression algorithm (under the square loss).\nVovk (2001) presents many competitive algorithms and provides bounds for linear regres-\nsion (under the square loss) with an algorithm that differs slightly from the Bayesian one.\nAzoury and Warmuth (2001) rederive Vovk\u2019s bound with a different analysis based on\nBregman distances. Our work differs from these in that we consider Bayesian Gaussian\n\n\flinear regression, while previous work typically used more complex, cleverly devised al-\ngorithms which are either variants of a MAP procedure (as in Vovk, 2001) , or that involve\nother steps such as \u201cclipping\u201d predictions (as in Azoury and Warmuth, 2001) . These dis-\ntinctions are discussed in more detail in Section 3.1.\nWe should also note that when the loss function is the logloss, multiplicative weights algo-\nrithms are sometimes identical to Bayes rule with particular choices of the parameters (see\nFreund and Schapire, 1999) . Furthermore, Bayesian algorithms have been used in some\nonline learning settings, such as the sleeping experts setting of Freund et al. (1997) and the\nonline boolean prediction setting of Cesa-Bianchi et al. (1998). Ng and Jordan (2001) also\nanalyzed an online Bayesian algorithm but assumed that the data generation process was\nnot too different from the model prior. To our knowledge, there have been no studies of\nBayesian generalized linear models in an adversarial online learning setting (though many\nvariants have been considered as discussed above).\nWe also examine maximum a posteriori (MAP) algorithms for both Gaussian linear regres-\nsion (i.e., ridge regression) and for (regularized) logistic regression. These algorithms are\noften used in practice, particularly in logistic regression where Bayesian model averag-\ning is computationally expensive, but the MAP algorithm requires only solving a convex\nproblem. As expected, MAP algorithms are somewhat less competitive than full Bayesian\nmodel averaging, though not unreasonably so.\n2 Bayesian Model Averaging\nWe now consider the Bayesian model averaging (BMA) algorithm and give a bound on its\nworst-case online loss. We start with some preliminaries. Let x \u2208 Rn denote the inputs of\na learning problem and y \u2208 R the outputs. Consider a model from the generalized linear\nmodel family (see McCullagh and Nelder, 1989) , that can be written p(y|x, \u03b8) = p(y|\u03b8T x),\nwhere \u03b8 \u2208 Rn are the parameters of our model (\u03b8T denotes the transpose of \u03b8). Note that\nthe predicted distribution of y depends only on \u03b8T x, which is linear in \u03b8. For example, in\nthe case of Gaussian linear regression, we have\n\np(y|x, \u03b8) =\n\n1\u221a\n2\u03c0\u03c32\n\n2\u03c32\n\nexp(cid:18)\u2212(\u03b8T x \u2212 y)2\n+ (1 \u2212 y) log(cid:18)1 \u2212\n\n(cid:19) ,\n1 + exp(\u2212\u03b8T x)(cid:19) ,\n\n1\n\nIn logistic\n\n(1)\n\n(2)\n\nwhere \u03c32 is a \ufb01xed, known constant that is not a parameter of our model.\nregression, we would have\nlog p(y|x, \u03b8) = y log\n\n1\n\n1 + exp(\u2212\u03b8T x)\n\nwhere we assume y \u2208 {0, 1}.\nLet S = {(x(1), y(1)), (x(2), y(2)), . . . , (x(T ), y(T ))} be an arbitrary sequence of examples,\npossibly chosen by an adversary. We also use St to denote the subsequence consisting of\nonly the \ufb01rst t examples. We assume throughout this paper that ||x(t)|| \u2264 1 (where || \u00b7 ||\ndenotes the L2 norm).\nAssume that we are going to use a Bayesian algorithm to make our online predictions.\nSpeci\ufb01cally, assume that we have a Gaussian prior on the parameters:\nwhere In is the n-by-n identity matrix, N (\u00b7; \u00b5, \u03a3) is the Gaussian density with mean \u00b5 and\ncovariance \u03a3, and \u03bd2 > 0 is some \ufb01xed constant governing the prior variance. Also, let\n\np(\u03b8) = N (\u03b8;(cid:126)0, \u03bd2In),\n\npt(\u03b8) = p(\u03b8|St) = (cid:16)(cid:81)t\n(cid:82)\u03b8(cid:16)(cid:81)t\n\ni=1 p(y(i)|x(i), \u03b8)(cid:17) p(\u03b8)\ni=1 p(y(i)|x(i), \u03b8)(cid:17) p(\u03b8)d\u03b8\n\nbe the posterior distribution over \u03b8 given the \ufb01rst t training examples. We have that p0(\u03b8) =\np(\u03b8) is just the prior distribution.\n\n\fOn iteration t, we are given the input x(t), and our algorithm makes a prediction using the\nposterior distribution over the outputs:\n\np(y|x(t), St\u22121) =(cid:90)\n\np(y|x(t), \u03b8)p(\u03b8|St\u22121)d\u03b8.\n\nWe are then given the true label y(t), and we suffer logloss \u2212 log p(y(t)|x(t), St\u22121). We\nde\ufb01ne the cumulative loss of the BMA algorithm after T rounds to be\n\n\u03b8\n\nLBMA(S) =\n\n\u2212 log p(y(t)|x(t), St\u22121).\n\nT(cid:88)\n\nt=1\n\nImportantly, note that even though the algorithm we consider is a Bayesian one, our theo-\nretical results do not assume that the data comes from any particular probabilistic model.\nIn particular, the data may be chosen by an adversary.\nWe are interested in comparing against the loss of any \u201cexpert\u201d that uses some \ufb01xed pa-\nrameters \u03b8 \u2208 Rn. De\ufb01ne (cid:96)\u03b8(t) = \u2212 log p(y(t)|x(t), \u03b8), and let\n\nSometimes, we also wish to compare against distributions over experts. Given a distribution\n\nQ over \u03b8, de\ufb01ne (cid:96)Q(t) =(cid:82)\u03b8\n\nL\u03b8(S) =\n\n\u2212 log p(y(t)|x(t), \u03b8).\n\nt=1\n\n(cid:96)\u03b8(t) =\n\nT(cid:88)\n\u2212Q(\u03b8) log p(y(t)|x(t), \u03b8)d\u03b8, and\n\nT(cid:88)\n(cid:96)Q(t) =(cid:90)\n\nt=1\n\nT(cid:88)\n\n\u03b8\n\nLQ(S) =\n\nQ(\u03b8)L\u03b8(S)d\u03b8.\n\nt=1\n\nThis is the expected logloss incurred by a procedure that \ufb01rst samples some \u03b8 \u223c Q and\nthen uses this \u03b8 for all its predictions. Here, the expectation is over the random \u03b8, not over\nthe sequence of examples. Note that the expectation is of the logloss, which is a different\ntype of averaging than in BMA, which had the expectation and the log in the reverse order.\n2.1 A Useful Variational Bound\nThe following lemma provides a worst case bound of the loss incurred by Bayesian algo-\nrithms and will be useful for deriving our main result in the next section. A result very\nsimilar to this (for \ufb01nite model classes) is given by Freund et al. (1997). For completeness,\nwe prove the result here in its full generality, though our proof is similar to theirs. As usual,\n\nde\ufb01ne KL(q||p) =(cid:82)\u03b8 q(\u03b8) log q(\u03b8)\n\np(\u03b8) .\n\nLemma 2.1: Let Q be any distribution over \u03b8. Then for all sequences S\n\nLBMA(S) \u2264 LQ(S) + KL(Q||p0).\n\nProof: Let Y = {y(1), . . . , y(T )} and X = {x(1), . . . , x(T )}. The chain rule of conditional\nprobabilities implies that LBMA(S) = \u2212 log p(Y |X) and L\u03b8(S) = \u2212 log p(Y |X, \u03b8). So\n\nLBMA(S) \u2212 LQ(S) = \u2212 log p(Y |X) +(cid:90)\nQ(\u03b8) log p(Y |X, \u03b8)\np(Y |X) d\u03b8\n. Continuing,\n\n= (cid:90)\n\nBy Bayes rule, we have that pT (\u03b8) = p(Y |X,\u03b8)p0(\u03b8)\n\n\u03b8\n\n\u03b8\n\nQ(\u03b8) log p(Y |X, \u03b8)d\u03b8\n\np(Y |X)\n\n= (cid:90)\n= (cid:90)\n= KL(Q||p0) \u2212 KL(Q||pT ).\n\nQ(\u03b8) log pT (\u03b8)\nQ(\u03b8) log Q(\u03b8)\n\np0(\u03b8) d\u03b8 \u2212(cid:90)\n\np0(\u03b8) d\u03b8\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\nQ(\u03b8) log Q(\u03b8)\n\npT (\u03b8) d\u03b8\n\nTogether with the fact that KL(Q||pT ) \u2265 0, this proves the lemma.\n\n(cid:3)\n\n\f2.2 An Upper Bound for Generalized Linear Models\nFor the theorem that we shortly present, we need one new de\ufb01nition. Let fy(z) =\n\u2212 log p(y|\u03b8T x = z). Thus, fy(t)(\u03b8T x(t)) = (cid:96)\u03b8(t). Note that for linear regression (as\nde\ufb01ned in Equation 1), we have that for all y\n\nand for logistic regression (as de\ufb01ned in Equation 2), we have that for y \u2208 {0, 1}\n\n1\n\u03c32\n\ny (z)| =\n|f\n(cid:48)(cid:48)\n|f\ny (z)| \u2264 1 .\n(cid:48)(cid:48)\n\n(3)\n\n(5)\n\n,\n\n\u2217\n\n\u2217\n\nn\n\n(cid:19)\n\n1\n2\u03bd2\n\nLBMA(S) \u2264 L\u03b8\u2217(S) +\n\nlog(cid:18)1 + T c\u03bd2\n\nTheorem 2.2: Suppose fy(z) is continuously differentiable. Let S be a sequence such that\n||x(t)|| \u2264 1 and such that for some constant c, |f\n||\u03b8\n\ny(t)(z)| \u2264 c (for all z). Then for all \u03b8\n(cid:48)(cid:48)\n\u2217||2 + n\n2\n\n(4)\n\u2217||2/2\u03bd2 term can be interpreted as a penalty term from our prior. The log term is\n. Importantly, this extra loss is\n\nThe ||\u03b8\nhow fast our loss could grow in comparison to the best \u03b8\nonly logarithmic in T in this adversarial setting.\nThis bound almost identical to those provided by Vovk (2001); Azoury and Warmuth (2001)\nand Foster (1991) for the linear regression case (under the square loss); the only difference\nis that in their bounds, the last term is multiplied by an upper bound on y(t). In contrast,\nwe require no bound on y(t) in the Gaussian linear regression case due to the fact that we\ndeal with the logloss (also recall |f\nProof: We use Lemma 2.1 with Q(\u03b8) = N (\u03b8; \u03b8\nmean \u03b8\nthe tightest possible bound. Letting H(Q) = n\nKL(Q||p0) = (cid:90)\n\n, \u00012In) being a normal distribution with\nand covariance \u00012In. Here, \u00012 is a variational parameter that we later tune to get\n\n2 log 2\u03c0e\u00012 be the entropy of Q, we have\nd\u03b8 \u2212 H(Q)\n\ny (z)| = 1\n(cid:48)(cid:48)\n\n\u03c32 for all y).\n\n2\u03bd2 \u03b8T \u03b8(cid:19)(cid:21)\u22121\n\n\u2217\n\n\u2217\n\n\u03b8\n\n1\n\nQ(\u03b8) log(cid:20)\n(2\u03c0)n/2|\u03bd2In|1/2\n2\u03bd2(cid:90)\nQ(\u03b8)\u03b8T \u03b8d\u03b8 \u2212 n\n2\n\u2217||2 + n\u00012(cid:1) \u2212 n\n2\u03bd2(cid:0)||\u03b8\n\nexp(cid:18)\u2212 1\n\u2212 n log \u0001\n\u2212 n log \u0001.\n\n2\n\n1\n\n1\n\n\u03b8\n\n= n log \u03bd +\n\n= n log \u03bd +\n\n\u2217) + f\n\nfy(z) = fy(z\n\nTo prove the result, we also need to relate the error of LQ to that of L\u03b8\u2217. By taking a Taylor\nexpansion of fy (assume y \u2208 S), we have that\n\u2217)(z \u2212 z\n\u2217) \u00b7 0 + Ez(cid:20)f\n(cid:21)\n\nfor some appropriate function \u03be. Thus, if z is a random variable with mean z\n\u2217)2\n\nEz[fy(z)] = fy(z\n\u2264 fy(z\n\n(z \u2212 z\n2\n(z \u2212 z\n2\n\n\u2217) + cEz(cid:20)(z \u2212 z\n\n(cid:48)(cid:48)\ny (\u03be(z))\n\n(cid:48)(cid:48)\ny (\u03be(z))\n\n\u2217) + f\n\n\u2217) + f\n\n, we have\n\n(cid:48)\ny(z\n\n(cid:48)\ny(z\n\n\u2217)2\n\n\u2217)2\n\n(cid:21)\n\n\u2217\n\n,\n\n2\n\n= fy(z\n\n\u2217) + c\n2\n\nVar(z)\n\n\u2217T x, and\nConsider a single example (x, y). We can apply the argument above with z\nz = \u03b8T x, where \u03b8 \u223c Q. Note that E[z] = z\n. Also, Var(\u03b8T x) =\nxT (\u00012In)x = ||x||2\u00012 \u2264 \u00012 (because we previously assumed that ||x|| \u2264 1). Thus, we have\n\nsince Q has mean \u03b8\n\n\u2217 = \u03b8\n\n\u2217\n\n\u2217\n\nE\u03b8\u223cQ[fy(\u03b8T x)] \u2264 fy(\u03b8\n\n\u2217T x) + c\u00012\n2\n\n\f1\n\n\u2217T x(t)), we can sum both sides\n\n\u2212 n log \u0001.\n\n\u2217||2 + n\u00012(cid:1) \u2212 n\n\nSince (cid:96)Q(t) = E\u03b8\u223cQ[fy(t)(\u03b8T x(t))] and (cid:96)\u03b8\u2217(t) = fy(t)(\u03b8\nfrom t = 1 to T to obtain\nLQ(S) \u2264 L\u03b8\u2217(S) + T c\n2 \u00012\n2\u03bd2(cid:0)||\u03b8\n\n2 \u00012 + n log \u03bd +\n\nPutting this together with Lemma 2.1 and Equation 5, we \ufb01nd that\n\n2\nn+T c\u03bd2 and simplifying, Theorem 2.2 follows.\n\nLBMA(S) \u2264 L\u03b8\u2217(S) + T c\nFinally, by choosing \u00012 = n\u03bd2\n2.3 A Lower Bound for Gaussian Linear Regression\nThe following lower bound shows that, for linear regression, no other prediction scheme is\nbetter than Bayes in the worst case (when our penalty term is ||\u03b8\n\u2217||2). Here, we compare to\nan arbitrary predictive distribution q(y|x(t), St\u22121) for prediction at time t, which suffers an\ninstant loss (cid:96)q(t) = \u2212 log q(y(t)|x(t), St\u22121). In the theorem, (cid:98)\u00b7(cid:99) denotes the \ufb02oor function.\nTheorem 2.3: Let L\u03b8\u2217(S) be the loss under the Gaussian linear regression model using the\n, and let \u03bd2 = \u03c32 = 1. For any set of predictive distributions q(y|x(t), St\u22121),\nparameter \u03b8\nthere exists an S with ||x(t)|| \u2264 1 such that\nT(cid:88)\n\nlog(cid:18)1 +(cid:22) T\n\n\u03b8\u2217 (L\u03b8\u2217(S) +\n\n(cid:96)q(t) \u2265 inf\n\n1\n2\n\n||\u03b8\n\n\u2217||2) + n\n2\n\n(cid:23)(cid:19)\n\n(cid:3)\n\nn\n\n\u2217\n\nt=1\n\nProof: (sketch) If n = 1 and if S is such that x(t) = 1, one can show the equality:\n\nLBMA(S) = inf\n\n\u03b8\u2217 (L\u03b8\u2217(S) +\n\n||\u03b8\n\n\u2217||2) +\n\n1\n2\n\n1\n2\n\nlog(1 + T )\n\nLet Y = {y(1), . . . , y(T )} and X = {1, . . . , 1}. By the chain rule of conditional probabil-\nities, LBMA(S) = \u2212 log p(Y |X) (where p is the Gaussian linear regression model), and\nq\u2019s loss is(cid:80)T\nt=1 (cid:96)q(t) = \u2212 log q(Y |X). For any predictive distribution q that differs from\np, there must exist some sequence S such that \u2212 log q(Y |X) is greater than \u2212 log p(Y |X)\n(since probabilities are normalized). Such a sequence proves the result for n = 1.\nThe modi\ufb01cation for n dimensions follows: S is broken into (cid:98)T /n(cid:99) subsequences where\nin every subsequence only one dimension has x(t)\nk = 1 (and the other dimensions are set to\n(cid:3)\n0). The result follows due to the additivity of the losses on these subsequences.\n3 MAP Estimation\nWe now present bounds for MAP algorithms for both Gaussian linear regression (i.e., ridge\nregression) and logistic regression. These algorithms use the maximum \u02c6\u03b8t\u22121 of pt\u22121(\u03b8) to\nform their predictive distribution p(y|x(t), \u02c6\u03b8t\u22121) at time t, as opposed to BMA\u2019s predictive\ndistribution of p(y|x(t), St\u22121). As expected these bounds are weaker than BMA, though\nperhaps not unreasonably so.\n3.1 The Square Loss and Ridge Regression\nBefore we provide the MAP bound, let us \ufb01rst present the form of the posteriors and the\ni=1 x(i)x(i)T , and\npredictions for Gaussian linear regression. De\ufb01ne At = 1\n\n\u03bd2 In + 1\n\n\u03c32(cid:80)t\n\ni=1 x(i)y(i). We now have that\n\nbt =(cid:80)t\n\nwhere \u02c6\u03b8t = A\n\n\u22121\nt bt, and \u02c6\u03a3t = A\n\n. Also, the predictions at time t + 1 are given by\n\npt(\u03b8) = p(\u03b8|St) = N(cid:16)\u03b8; \u02c6\u03b8t, \u02c6\u03a3t(cid:17) ,\np(y(t+1)|x(t+1), St) = N(cid:16)y(t+1); \u02c6yt+1, s2\nt+1(cid:17)\n\n\u22121\nt\n\n(6)\n\n(7)\n\n\fwhere \u02c6yt+1 = \u02c6\u03b8T\n\ufb01xed expert using parameter \u03b8\n\nt x(t+1), s2\n\n\u2217\n\nt+1 = x(t+1)T \u02c6\u03a3tx(t+1) + \u03c32. In contrast, the prediction of a\nwould be\np(y(t)|x(t), \u03b8\n\n\u2217) = N(cid:16)y(t); y\n\nt , \u03c32(cid:17) ,\n\n(8)\n\n\u2217\n\n\u2217\nt = \u03b8\n\n\u2217T x(t).\nwhere y\nNow the BMA loss is:\n\nLBMA(S) =\n\nT(cid:88)\n\nt=1\n\n1\n2s2\nt\n\n(y(t) \u2212 \u02c6\u03b8T\n\nt\u22121x(t))2 + log(cid:113)2\u03c0s2\n\nt\n\n(9)\n\nImportantly, note how Bayes is adaptively weighting the squared term with the inverse\nvariances 1/st (which depend on the current observation x(t)). The logloss of using a \ufb01xed\nexpert \u03b8\n\nis just:\n\n\u2217\n\nL\u03b8\u2217(S) =\n\n(10)\nThe MAP procedure (referred to as ridge regression) uses p(y|x(t), \u02c6\u03b8t\u22121) which has a \ufb01xed\nvariance. Hence, the MAP loss is essentially the square loss and we de\ufb01ne it as such:\n\nt=1\n\n2\u03c0\u03c32\n\n1\n\n2\u03c32 (y(t) \u2212 \u03b8\n\n\u2217T x(t))2 + log\n\nT(cid:88)\n\n(cid:101)LMAP(S) =\n\nT(cid:88)\n\nt=1\n\n1\n2\n\n(y(t) \u2212 \u02c6\u03b8T\n\nt\u22121x(t))2 , (cid:101)L\u03b8\u2217(S) =\n\nT(cid:88)\n\nt=1\n\n1\n2\n\n(y(t) \u2212 \u03b8\n\n\u2217T x(t))2,\n\n(11)\n\n\u221a\n\nwhere \u02c6\u03b8t is the MAP estimate (see Equation 6).\nCorollary 3.1: Let \u03b32 = \u03c32 + \u03bd2. For all S such that ||x(t)|| \u2264 1 and for all \u03b8\n(cid:19)\n\nlog(cid:18)1 + T \u03bd2\n\n||\u03b8\n\n\u2217||2 + \u03b32n\n2\n\n(cid:101)LMAP(S) \u2264 \u03b32\n\n\u03c32n\n\nProof: Using Equations (9,10) and Theorem 2.2, we have\n\n\u2217\n\n, we have\n\n2\u03bd2\n\n\u03c32(cid:101)L\u03b8\u2217(S) + \u03b32\nt\u22121x(t))2 \u2264 T(cid:88)\n\nT(cid:88)\n\nt=1\n\n1\n2s2\nt\n\n(y(t) \u2212 \u02c6\u03b8T\n\n1\n\n2\u03c32 (y(t) \u2212 \u03b8\nlog(cid:18)1 + T c\u03bd2\n\nn\n\n\u2217T x(t))2 +\n\n(cid:19) +\n\nT(cid:88)\n\nt=1\n\n\u2217||2\n\n||\u03b8\n1\n2\u03bd2\n\u221a\n2\u03c0\u03c32(cid:112)2\u03c0s2\n\nlog\n\nt\n\nt=1\n\n+ n\n2\n\nt\n\nEquations (6, 7) imply that \u03c32 \u2264 s2\n\u2264 \u03c32 + \u03bd2. Using this, the result follows by noting\nthat the last term is negative and by multiplying both sides of the equation by \u03c32 + \u03bd2. (cid:3)\nWe might have hoped that MAP were more competitive in that the leading coef\ufb01cient,\nin front of the (cid:101)L\u03b8\u2217(S) term in the bound, be 1 (similar to Theorem 2.2) rather than \u03b32\n\u03c32 .\nCrudely, the reason that MAP is not as effective as BMA is that MAP does not take into\naccount the uncertainty in its predictions\u2014thus the squared terms cannot be reweighted to\ntake variance into account (compare Equations 9 and 11).\nSome previous (non-Bayesian) algorithms did in fact have bounds with this coef\ufb01cient\nbeing unity. Vovk (2001) provides such an algorithm, though this algorithm differs from\nMAP in that its predictions at time t are a nonlinear function of x(t) (it uses At instead\nof At\u22121 at time t). Foster (1991) provides a bound with this coef\ufb01cient being 1 with\nmore restrictive assumptions. Azoury and Warmuth (2001) also provide a bound with a\ncoef\ufb01cient of 1 by using a MAP procedure with \u201cclipping.\u201d (Their algorithm thresholds the\nprediction \u02c6yt = \u02c6\u03b8T\nt\u22121x(t) if it is larger than some upper bound. Note that we do not assume\nany upper bound on y(t).)\n\n\f1\n2\n\n\u221a\n\n||\u03b8\n\n\u2217||2) + \u2126(T )\n\nAs the following lower bound shows, it is not possible for the MAP linear regression algo-\n\nrithm to have a coef\ufb01cient of 1 for(cid:101)L\u03b8\u2217(S), with a reasonable regret bound. A similar lower\nbound is in Vovk (2001), which doesn\u2019t apply to our setting where we have the additional\nconstraint ||x(t)|| \u2264 1.\nTheorem 3.2: Let \u03b32 = \u03c32 + \u03bd2. There exists a sequence S with ||x(t)|| \u2264 1 such that\n\nthat inf \u03b8\u2217((cid:101)L\u03b8\u2217(S) + 1\n\n2\n\n\u03b8\u2217 ((cid:101)L\u03b8\u2217(S) +\n\nProof: (sketch) Let S be a length T + 1 sequence, with n = 1, where for the \ufb01rst T steps,\nT and y(t) = 1, and at T + 1, x(T +1) = 1 and y(T +1) = 0. Here, one can show\nx(t) = 1/\n\n(cid:101)LMAP(S) \u2265 inf\n\u2217||2) = T /4 and(cid:101)LMAP(S) \u2265 3T /8, and the result follows. (cid:3)\n||\u03b8\n3.2 Logistic Regression\nMAP estimation is often used for regularized logistic regression, since it requires only\nsolving a convex program (while BMA has to deal with a high dimensional integral over\n\u03b8 that is intractable to compute exactly). Letting \u02c6\u03b8t\u22121 be the maximum of the posterior\n\u2212 log p(y(t)|x(t), \u02c6\u03b8t\u22121). As with the square loss case,\n\npt\u22121(\u03b8), de\ufb01ne LMAP(S) =(cid:80)T\nthe bound we present is multiplicatively worse (by a factor of 4).\nTheorem 3.3: In the logistic regression model with \u03bd \u2264 0.5, we have that for all sequences\nS such that ||x(t)|| \u2264 1 and y(t) \u2208 {0, 1} and for all \u03b8\n\nt=1\n\nLMAP(S) \u2264 4L\u03b8\u2217(S) +\n\n||\u03b8\n\n2\n\u03bd2\n\n\u2217\n\n\u2217||2 + 2n log(cid:18)1 + T \u03bd2\n\n(cid:19)\n\nn\n\nProof: (sketch) Assume n = 1 (the general case is analogous). The proof consists of\n(t) = \u2212 log p(y(t)|x(t), \u02c6\u03b8t\u22121) \u2264 4(cid:96)BMA(t). Without loss of generality,\nshowing that (cid:96)\u02c6\u03b8t\u22121\nassume y(t) = 1 and x(t) \u2265 0, and for convenience, we just write x instead of x(t). Now\nthe BMA prediction is(cid:82)\u03b8 p(1|\u03b8, x)pt\u22121(\u03b8)d\u03b8, and (cid:96)BMA(t) is the negative log of this. Note\n\u03b8 = \u221e gives probability 1 for y(t) = 1 (and this setting of \u03b8 minimizes the loss at time t).\nSince we do not have a closed form solution of the posterior pt\u22121, let us work with an-\n(cid:82)\u03b8 p(1|\u03b8, x)q(\u03b8)d\u03b8, which can be viewed as the prediction using q rather than the posterior.\nother distribution q(\u03b8) in lieu of pt\u22121(\u03b8) that satis\ufb01es certain properties. De\ufb01ne pq =\nWe choose q to be the recti\ufb01cation of the Gaussian N (\u03b8; \u02c6\u03b8t\u22121, \u03bd2In), such that there is\npositive probability only for \u03b8 \u2265 \u02c6\u03b8t\u22121 (and the distribution is renormalized). With this\nchoice, we \ufb01rst show that the loss of q, \u2212 log pq, is less than or equal to (cid:96)BMA(t). Then we\ncomplete the proof by showing that (cid:96)\u02c6\u03b8t\u22121\nConsider the q which maximizes pq subject to the following constraints: let q(\u03b8) have its\nmaximum at \u02c6\u03b8t\u22121; let q(\u03b8) = 0 if \u03b8 < \u02c6\u03b8t\u22121 (intuitively, mass to the left of \u02c6\u03b8t\u22121 is just\nmaking the pq smaller); and impose the constraint that \u2212(log q(\u03b8))(cid:48)(cid:48) \u2265 1/\u03bd2. We now\nargue that for such a q, \u2212 log pq \u2264 (cid:96)BMA(t). First note that due to the Gaussian prior p0, it\nis straightforward to show that \u2212(log pt\u22121)(cid:48)(cid:48)(\u03b8) \u2265 1\n\u03bd2 (the prior imposes some minimum\ncurvature). Now if this posterior pt\u22121 were recti\ufb01ed (with support only for \u03b8 \u2265 \u02c6\u03b8t\u22121)\nand renormalized, then such a modi\ufb01ed distribution clearly satis\ufb01es the aforementioned\nconstraints, and it has loss less than the loss of pt\u22121 itself (since the recti\ufb01cation only\nincreases the prediction). Hence, the maximizer, q, of pq subject to the constraints has loss\nless than that of pt\u22121, i.e. \u2212 log pq \u2264 (cid:96)BMA(t).\nWe now show that such a maximal q is the (renormalized) recti\ufb01cation of the Gaussian\nN (\u03b8; \u02c6\u03b8t\u22121, \u03bd2In), such that there is positive probability only for \u03b8 > \u02c6\u03b8t\u22121. Assume some\nother q2 satis\ufb01ed these constraints and maximized pq. It cannot be that q2(\u02c6\u03b8t\u22121) < q(\u02c6\u03b8t\u22121),\n\n(t) \u2264 \u22124 log pq, since \u2212 log pq \u2264 (cid:96)BMA(t).\n\n\fpq = (cid:90)\n\u2264 (cid:90)\n\n\u03b8\n\n\u03b8\n\nelse one can show q2 would not be normalized (since with q2(\u02c6\u03b8t\u22121) < q(\u02c6\u03b8t\u22121), the cur-\nvature constraint imposes that this q2 cannot cross q). It also cannot be that q2(\u02c6\u03b8t\u22121) >\nq(\u02c6\u03b8t\u22121). To see this, note that normalization and curvature imply that q2 must cross pt only\nonce. Now a suf\ufb01ciently slight perturbation of this crossing point to the left, by shifting\nmore mass from the left to the right side of the crossing point, would not violate the cur-\nvature constraint and would result in a new distribution with larger pq, contradicting the\nmaximality of q2. Hence, we have that q2(\u02c6\u03b8t\u22121) = q(\u02c6\u03b8t\u22121). This, along with the curvature\nconstraint and normalization, imply that the recti\ufb01ed Gaussian, q, is the unique solution.\n(t) = \u2212 log p(1|x, \u02c6\u03b8t\u22121) \u2264 \u22124 log pq. We consider\nTo complete the proof, we show (cid:96)\u02c6\u03b8t\u22121\ntwo cases, \u02c6\u03b8t\u22121 < 0 and \u02c6\u03b8t\u22121 \u2265 0. We start with the case \u02c6\u03b8t\u22121 < 0. Using the boundedness\nof the derivative |\u2202 log p(1|x, \u03b8)/\u2202\u03b8| < 1 and that q only has support for \u03b8 \u2265 \u02c6\u03b8t\u22121, we have\n\nexp(log p(1|x, \u03b8))q(\u03b8)d\u03b8\nexp(cid:16)log(p(1|x, \u02c6\u03b8t\u22121) + \u03b8 \u2212 \u02c6\u03b8t\u22121(cid:17) q(\u03b8)d\u03b8 \u2264 1.6p(1|x, \u02c6\u03b8t\u22121)\n\n(t), which shows (cid:96)\u02c6\u03b8t\u22121\n\nwhere we have used that(cid:82)\u03b8 exp(\u03b8\u2212 \u02c6\u03b8t\u22121)q(\u03b8)d\u03b8 < 1.6 (which can be veri\ufb01ed numerically\nusing the de\ufb01nition of q with \u03bd \u2264 0.5). Now observe that for \u02c6\u03b8t\u22121 \u2264 0, we have the lower\nbound \u2212 log p(1|x, \u02c6\u03b8t\u22121) \u2265 log 2. Hence, \u2212 log pq \u2265 \u2212 log p(1|x, \u02c6\u03b8t\u22121) \u2212 log 1.6 \u2265\n(t) \u2264 \u22124 log pq.\n(\u2212 log p(1|x, \u02c6\u03b8t\u22121))(1\u2212 log 1.6/ log 2) \u2265 0.3(cid:96)\u02c6\u03b8t\u22121\nNow for the case \u02c6\u03b8t\u22121 \u2265 0. Let \u03c3 be the sigmoid function, so p(1|x, \u03b8) = \u03c3(\u03b8x) and\npq =(cid:82)\u03b8 \u03c3(x\u03b8)q(\u03b8)d\u03b8. Since the sigmoid is concave for \u03b8 > 0 and, for this case, q only has\nsupport from positive \u03b8, we have that pq \u2264 \u03c3(cid:0)x(cid:82)\u03b8 \u03b8q(\u03b8)d\u03b8(cid:1). Using the de\ufb01nition of q, we\nthen have that pq \u2264 \u03c3(x(\u02c6\u03b8t\u22121 + \u03bd)) \u2264 \u03c3(\u02c6\u03b8t\u22121 + \u03bd), where the last inequality follows from\n\u02c6\u03b8t\u22121 + \u03bd > 0 and x \u2264 1. Using properties of \u03c3, one can show |(log \u03c3)(cid:48)(z)| < \u2212 log \u03c3(z)\n(for all z). Hence, for all \u03b8 \u2265 \u02c6\u03b8t\u22121, |(log \u03c3)(cid:48)(\u03b8)| < \u2212 log \u03c3(\u03b8) \u2264 \u2212 log \u03c3(\u02c6\u03b8t\u22121). Using\nthis derivative condition along with the previous bound on pq, we have that \u2212 log pq \u2265\n\u2212 log \u03c3(\u02c6\u03b8t\u22121 + \u03bd) \u2265 (\u2212 log \u03c3(\u02c6\u03b8t\u22121))(1 \u2212 \u03bd) = (cid:96)\u02c6\u03b8t\u22121\n(t)(1 \u2212 \u03bd), which shows that\n(cid:3)\n(cid:96)\u02c6\u03b8t\u22121\nAcknowledgments. We thank Dean Foster for numerous helpful discussions. This work\nwas supported by the Department of the Interior/DARPA under contract NBCHD030010.\nReferences\nAzoury, K. S. and Warmuth, M. (2001). Relative loss bounds for on-line density estimation with the\n\n(t) \u2264 \u22124 log pq (since \u03bd \u2264 0.5). This proves the claim when \u02c6\u03b8t\u22121 \u2265 0.\n\nexponential family of distributions. Machine Learning, 43(3).\n\nCesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D., Schapire, R., and Warmuth, M. (1997).\n\nCesa-Bianchi, N., Helmbold, D., and Panizza, S. (1998). On Bayes methods for on-line boolean\n\nHow to use expert advice. J. ACM, 44.\n\nprediction. Algorithmica, 22.\n\nDawid, A. (1984). Statistical theory: The prequential approach. J. Royal Statistical Society.\nFoster, D. P. (1991). Prediction in the worst case. Annals of Statistics, 19.\nFreund, Y. and Schapire, R. (1999). Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29:79\u2013103.\n\nspecialize. In STOC.\n\nFreund, Y., Schapire, R., Singer, Y., and Warmuth, M. (1997). Using and combining predictors that\n\nGrunwald, P. (2005). A tutorial introduction to the minimum description length principle.\nMcCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall.\nNg, A. Y. and Jordan, M. (2001). Convergence rates of the voting Gibbs classi\ufb01er, with application\nto Bayesian feature selection. In Proceedings of the 18th Int\u2019l Conference on Machine Learning.\n\nVovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69.\n\n\f", "award": [], "sourceid": 2637, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}