{"title": "A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 350, "abstract": null, "full_text": "A Statistical Mechanics Approach to\n\nApproximate Analytical Bootstrap Averages\n\nD\u00a8orthe Malzahn \u0002\u0001\u0004\u0003\n\nManfred Opper \u0006\u0005\u0007\u0003\n\n\u0002\u0001\u0004\u0003 Informatics and Mathematical Modelling, Technical University of Denmark,\n\nR.-Petersens-Plads Building 321, DK-2800 Lyngby, Denmark\n\n\b\u0005\t\u0003 Neural Computing Research Group, School of Engineering and Applied Science,\n\nAston University, Birmingham B4 7ET, United Kingdom\n\ndm@imm.dtu.dk\n\nopperm@aston.ac.uk\n\nAbstract\n\nWe apply the replica method of Statistical Physics combined with a vari-\national method to the approximate analytical computation of bootstrap\naverages for estimating the generalization error. We demonstrate our ap-\nproach on regression with Gaussian processes and compare our results\nwith averages obtained by Monte-Carlo sampling.\n\n1 Introduction\n\nThe application of tools from Statistical Mechanics to analyzing the average case perfor-\nmance of learning algorithms has a long tradition in the Neural Computing and Machine\nLearning community [1, 2]. When data are generated from a highly symmetric distribution\nand the dimension of the data space is large, methods of statistical mechanics of disor-\ndered systems allow for the computation of learning curves for a variety of interesting and\nnontrivial models ranging from simple perceptrons to Support-vector Machines. Unfor-\ntunately, the speci\ufb01c power of this approach, which is able to give explicit distribution\ndependent results represents also a major drawback for practical applications. In general,\ndata distributions are unknown and their replacement by simple model distributions might\nonly reveal some qualitative behavior of the true learning performance.\n\nIn this paper we suggest a novel application of the Statistical Mechanics techniques to\na topic within Machine Learning for which the distribution over data is well known and\ncontrolled by the experimenter. It is given by the resampling of an existing dataset in the\nso called bootstrap approach [3]. Creating bootstrap samples of the original dataset by\nrandom resampling with replacement and retraining the statistical model on the bootstrap\nsample is a widely applicable statistical technique. By replacing averages over the true\nunknown distribution of data with suitable averages over the bootstrap samples one can\nestimate various properties such as the bias, the variance and the generalization error of a\nstatistical model.\n\nWhile in general bootstrap averages can be approximated by Monte-Carlo sampling, it is\nuseful to have also analytical approximations which avoid the time consuming retraining\nof the model for each sample. Existing analytical approximations (based on asymptotic\ntechniques) such as the delta method and the saddle point method (see e.g.[5]) require\n\n\fusually explicit analytical formulas for the estimators of the parameters for a trained model.\nThese may not be easily obtained for more complex models in Machine Learning. In this\npaper, we discuss an application of the replica method of Statistical Physics [4] which\ncombined with a variational method [6] can produce approximate averages over the random\ndrawings of bootstrap samples. Explicit formulas for parameter estimates are avoided and\nreplaced by the implicit condition that such estimates are expectations with respect to a\ncertain Gibbs distribution to which the methods of Statistical Physics can be well applied.\nWe demonstrate the method for the case of regression with Gaussian processes (GP) (which\nis a kernel method that has gained high popularity in the Machine Learning community in\nrecent years [7]) and compare our analytical results with results obtained by Monte-Carlo\nsampling.\n\n2 Basic setup and Gibbs distribution\n\nWe will keep the notation in this section fairly general, indicating that most of the theory\n\nis modeled by a likelihood of the type\n\n\u0006\u000f\u000e\u0011\u0010\n\nstands for a function\n\n\u0010\u0017\u0016\u0019\u0018\u001b\u001a\u001d\u001c\n\n\u0010)(\n\u0005,+\n\b.-\n. In this case,\u0015\n\n(which can be a \ufb01nite or even\nin\ufb01nite dimensional object) which must be estimated from the data. We will later specialize\n\nwhich parameterize such functions. We will later apply our approach to the mean square\nerror given by\n\n(2)\nThe \ufb01rst basic ingredient of our approach is the assumption that the estimator for the un-\n\ncan be developed for a broader class of models. We assume that a \ufb01xed set of data\u0002\u0001\u0004\u0003\n\u0005\u0007\u0006\n\u0005\n\b\f\u000b\r\u000b\r\u000b\r\b\n\u0001\t\b\n\u001f! \n\u0014\u0013\n\u0001'&\n#%$\nis parametrized by a parameter\u0015\nwhere the \u201ctraining error\u201d&\n\u0010 consists of an input+\nto supervised learning problems where each data point\u0006\n(usually a \ufb01nite dimensional vector) and a real label-\n\u0005,+'\u0010 which models the outputs, or for the parameters (like the weights of a neural network)\n\u0005\u0007+\nknown \u201ctrue\u201d function\u0015 can be represented as the mean with respect to a posterior distri-\nbution over all possible\u0015 \u2019s. This avoids the problem of writing down explicit, complicated\nformulas for estimators. To be precise, we assume that the statistical estimator 2\n\u0015\n354 (which\nis based on the training set6\u0001 ) can be represented as the expectation of\u0015 with respect to\n\u001f\u0017 \n\u0010)(\n#%$\nwhich is constructed from a suitable prior distribution<\n9 and the likelihood (1).\n#%$\n\ndenotes a normalizing partition function. Our choice of (3) does not mean that we restrict\nourselves to Bayesian estimators. By introducing speci\ufb01c (\u201ctemperature\u201d like) parameters\nin the prior and the likelihood, the measure (3) can be strongly concentrated at its mean\nsuch that maximum likelihood/MAP estimators can be included in our framework.\n\n0\n1\n\n\u0018\u001b\u001a\u001d\u001c\n\u0018\u001b\u001a\u001d\u001c\n\n758\n\n\u0015!\u0013\n\n;=<\n\n\u0001:9\n\u0003@?BAC<\n\n(1)\n\n(3)\n\n(4)\n\nthe measure\n\n\u0001>&\n\n3 Bootstrap averages\n\nWe will explain our analytical approximation to resampling averages for the case of super-\nvised learning problems. If we are interested in, say, estimating the expected error on test\n\n\u0006\n\u0012\n\u0005\n\u0015\n\u001e\n\u000e\n\"\n\u0005\n\u0015\n\b\n\u0006\n#\n*\n\u0003\n\u0015\n&\n\u0005\n\u0015\n\b\n\u0006\n#\n\u0010\n\u0003\n/\n\u0005\n\u0005\n\u0015\n#\n\u0010\n \n-\n#\n\u0010\n\u0005\n\u000b\n\n\u0003\n/\n8\n\u0015\n9\n\u001e\n\u000e\n\"\n\u0005\n\u0015\n\b\n\u0006\n#\n*\n8\n\u0015\n;\n8\n\u0015\n9\n\u001e\n\u001f\n \n\u000e\n\"\n\u0001\n&\n\u0005\n\u0015\n\b\n\u0006\n#\n\u0010\n(\n*\n\fEfron\u2019s estimator for the bootstrap generalization error is\n\n\u0005\u0016 \n\n\u0002\u001f\u001e\n\nwhich do not contain the test point. Introducing the abbreviation\n\n.\n\nAC<\n\nand if we have no hold\ndata from\n.\n\nby the vector of \u201coccupation\u201d\nis the number of times example\n,\n\nall. A proxy for the true average test error can be obtained by retraining the model on each\n, calculating the test error only on those points which are not con-\n\nwhere we specialized to the square error for testing. Eq.(5) computes the average bootstrap\nfor\ncontribute\n\nmain importance, but we will also allow for estimating a lager part of the \u201clearning curve\u201d\n. We will not discuss the statistical properties of such\nbootstrap estimates and their re\ufb01nements (such as Efron\u2019s .632 estimate) in this paper, but\nrefer the reader to the standard literature [3, 5].\n\npoints 1 which are not contained in the training set=\u0001 of size\nout data, we can create arti\ufb01cial data sets\nby resampling (with replacement)\u0001\n\u0001 , where each data point\u0006\u0003\u0002\u0005\u0004\nthe original set\nHence, some of the\u0006\u0003\u0002 \u2019s will appear several times in the bootstrap sample and others not at\nis taken with equal probability/\nbootstrap training set\nand \ufb01nally averaging over many sets\ntained in\n\u0003\u0007 maybe of\n. In practice, the case\u0001\nand\u0001\t\u000b\u0007\nby allowing for\u0001\t\b\n\n\u0001 , we represent a bootstrap sample\nFor any given set\n\u0010 with\r\n\u0001\u000f\b\f\u000b\r\u000b\r\u000b\f\b\n\u0003\u000e\u0001\nnumbers\f\n\u0006\u000f\u0002 appears in the set\n. Denoting the expectation over random bootstrap samples by\u0010\n\u0012\u0014\u0013\u0016\u0015\u0018\u0017\u001a\u0019\n\u0005\u0007+\u001d\u0002)\u0010\n\u0001\u001c\u001b\ntest error at each data point! from6\u0001 . The Kronecker symbol, de\ufb01ned by\u0013\n\u0003#\" and$ else, guarantees that only realizations of bootstrap training sets\n\u0015\t3\n(which is a linear function of\u0015 ), and using the de\ufb01nition of the estimator 2\nof\u0015 \u2019s over the Gibbs distribution (3), the bootstrap estimate (5) can be rewritten as\n\u0012\u0014\u0013\u0016\u0015\u0018\u0017\u001a\u0019\n\u0010\u0003'\n\u0006\u000f\u0002\n\u0006&\u0002\n\u0010.\u0010\n\u0010)(\n'6\u0018\u001b\u001a\u001d\u001c\n*+*\n#%$\nwhich involves0 copies (or replicas)\u0015\n\u0005 of the variable\u0015\n\u0001 and\u0015\nof test errors which are polynomials or can be approximated by polynomials in 2\n\u0015\t3\nrewritten in a similar way, involving more replicas of the variable\u0015\n\u0002 \u2019s is multinomial. It is simpler (and does not make a big\n, the distribution of\u0001\nFor \ufb01xed\u0001\ndifference when\u0001\nsize of the set with\u0001\nfor the occupation numbers\u0001\n\n\u0015.\u00170/2143\n\u000205\n. With Eq. (8) follows\u0010\n\nis suf\ufb01ciently large) when we work with a Poisson distribution for the\nas the mean number of data points in the sample. In this case we\n\n4 Analytical averages using the \u201creplica trick\u201d\n\n. More complicated types\ncan be\n\n.\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n/6173 .\n\nget the simpler, factorizing joint distribution\n\n1The average is over the unknown distribution of training data sets.\n\n\u0002 where<\n\n\u0003\u0007\u0001\n\n\u0005\u0007+\n\n\u0003@\u0015\n\nas an average\n\n\n\u0001\n\u0006\n\n\u0003\n\u0005\n\u0001\n\u0001\n\u000e\n\u000e\n\u0002\n$\n\u0001\n\u0001\n\u0002\n\u0001\n\u0002\n3\n\u0011\n\u0005\n\u0001\n\u0010\n\u000b\n\u0003\n/\n\n\u000e\n\"\n\u0002\n$\n\u0001\n\u0010\n3\n2\n\u0015\n3\n \n-\n\u0010\n3\n8\n\u0013\n\u0015\n\u0017\n\u0019\n\u0001\n9\n\u0002\n\u0019\n#\n\u0003\n/\n!\n%\n\u0005\n\u0015\n\b\n\u0006\n\u0002\n\u0010\n\u0002\n\u0010\n \n-\n\u0002\n\u0011\n\u0005\n\u0001\n\u0010\n\u0003\n/\n\n\u000e\n\"\n\u0002\n$\n\u0001\n/\n\u0010\n3\n8\n\u0013\n\u0015\n\u0017\n\u0019\n\u0001\n9\n\u0010\n3\n\u0001\n/\n;\n\u0005\n?\nA\n<\n8\n\u0015\n\u0001\n9\n8\n\u0015\n\u0005\n9\n%\n\u0005\n\u0015\n\u0001\n\b\n\u0010\n%\n\u0005\n\u0015\n\u0005\n\b\n\u001e\n\u001f\n \n\u000e\n\"\n\u0001\n\u0001\n#\n\u0005\n&\n\u0005\n\u0015\n\u0001\n\b\n\u0006\n#\n&\n\u0005\n\u0015\n\u0005\n\b\n\u0006\n#\n(\n,\n\u000b\n\u0012\n\u0005\n\f\n\u0010\n\u0003\n\u000e\n-\n\u0002\n$\n\u0001\n<\n\u0001\n\u0006\n\n3\n8\n\u0013\n\u0015\n\u0017\n\u0019\n\u0001\n9\n\u0003\n\f17\u0015\n\nreplicas of the original\n\n\u0006\u000f\u0002\n\n\u0005\n\b\n\n\u0006&\u0002\n\u0010.\u0010\n\n\u0010\u001e\u001d\n\n(9)\n\n(10)\n\n\u000b\n\n\u0018\u001b\u001a\u001d\u001c\n\n(which is the \u201cquenched disorder\u201d in\n\nthe language of Statistical Physics) it is necessary to introduce the auxiliary quantity\n\nTo enable the analytical average over the vector\f\n\u0012\u0014\u0013\u0016\u0015\u0018\u0017\u001a\u0019\n\u0011\u0001\n\u0018\u001b\u001a\u001d\u001c\n#%$\n\u0010 . The advantage of this de\ufb01nition\nreal, which allows to write\u0011\n\u0003\u0005\u0004\u0007\u0006\t\b\n\u0010 can be represented in terms of\u0003\n0 ,\u0011\u0001\nfor\u0003\nis that for integers\u0003\n\u0002 \u2019s is possible. At the end of all calculations\nvariable\u0015\nfor which an explicit average over\u0001\n$ must be performed. Using\nan analytical continuation to arbitrary real\u0003 and the limit\u0003\r\f\nthe de\ufb01nition of the partition function (4), we get for integer\u0003\n14\u0015\n\u0001\u000f\b\n#%$\nExchanging the expectation over datasets with the expectation over\u0015 \u2019s and using the ex-\n14\u0015\n\u0017\u0016\u0019\u0018\n\u0003\u0011\u0010\n\u0001\u0013\u0012\n\u000b\f\u000b\r\u000b! denote an average with respect to a Gibbs measure for replicas\nwhere the brackets\u001f\n\b\r\u000b\r\u000b\f\u000b\r\b\n1#\u0014\n has been introduced for convenience to normalize the\nand where the partition function\u0010\n$ . In most nontrivial cases, averages with respect to the measure (12)\nmeasure for\u0003'&\n\ncan not be calculated exactly. Hence, we have to apply a sensible approximation. Our idea\nis to use techniques which have been frequently applied to probabilistic models [10] such\nas the variational approximation, the mean \ufb01eld approximation and the TAP approach. In\nthis paper, we restrict ourself to a variational Gaussian approximation. More advanced\napproximations will be given elsewhere.\n\n\u0003\u001c\u001b\n\u0018\u001b\u001a\u001d\u001c\n\t\u0016\n\n\u001a%$\n\nplicit form of the distribution (8) we obtain\n\n(11)\n\n(12)\n\n(13)\n\n\u0001\n\b\n\n \u0013\"\n\nwhich is given by\n\nwhere\n\n\u0018\u001b\u001a\u001d\u001c\n\n1\u0015\u0014\n\n#%$\n\n \u0013\"\n\n\u0018\u001b\u001a\u001d\u001c\n\n5 Variational approximation\n\nA method, frequently used in Statistical Physics which has also attracted considerable in-\nterest in the Machine Learning community, is the variational approximation [8]. Its goal is\nto replace an intractable distribution like (12) by a different, suf\ufb01ciently close distribution\nfrom a tractable class which we will write in the form\n\n(14)\n\n\u0005\n\u0001\n\u0010\n\u000b\n\u0003\n/\n/\n\u0002\n\u000e\n\n\u000e\n\"\n\u0002\n$\n\u0001\n\u0010\n3\n\u0001\n;\n\n1\n\u0005\n?\nA\n<\n8\n\u0015\n\u0001\n9\nA\n<\n8\n\u0015\n\u0005\n9\n%\n\u0005\n\u0015\n\u0001\n\b\n\u0010\n%\n\u0005\n\u0015\n\u0005\n\b\n\u0010\n'\n'\n\u001e\n\u001f\n \n\u000e\n\"\n\u0001\n\u0001\n#\n\u0005\n&\n\u0005\n\u0015\n\u0001\n\b\n\u0006\n#\n\u0010\n(\n&\n\u0005\n\u0015\n\u0005\n\b\n\u0006\n#\n(\n*\n*\n,\n\u0005\n\u0001\n\u0010\n\u0001\n\u0011\n\n\u0005\n\u0001\n\u000b\n\u0005\n\u0001\n\u000b\n0\n\u0011\n\n\u0005\n\u0001\n\u0010\n\u0003\n/\n/\n\u0002\n\u000e\n\n\u000e\n\"\n\u0002\n$\n\u0001\n\u0010\n3\n\u000e\n\u0013\n\u0015\n\u0017\n\u0019\n\u0001\n?\n\n-\n\u000f\n$\n\u0001\nA\n<\n8\n\u0015\n\u000f\n9\n%\n\u0005\n\u0015\n\u0006\n\u0002\n\u0010\n%\n\u0005\n\u0015\n\u0006\n\u0002\n\u0010\n'\n'\n\u001e\n\u001f\n \n\u000e\n\"\n\u0001\n\u0001\n#\n\n\"\n\u000f\n$\n\u0001\n&\n\u0005\n\u0015\n\u000f\n\b\n\u0006\n#\n\u0010\n(\n*\n*\n,\n\u000b\n\u0011\n\n\u0005\n\u0001\n\u0010\n\n/\n/\n\u0002\n\u000e\n\n\u000e\n\"\n\u0002\n$\n\u000e\n \n\u0001\n\n\n-\n\u000f\n$\n\u0001\n/\n\u0019\n\u001a\n\u0017\n%\n\u0005\n\u0015\n\u0006\n\u0002\n\u0010\n%\n\u0005\n\u0015\n\u0005\n\b\n\u0006\n\u0002\n7\n\n\u0005\n\u0015\n\u0001\n\u0015\n\n\u0010\n\u0003\n/\n\u0010\n\n\n-\n\u000f\n$\n\u0001\n<\n8\n\u0015\n\u000f\n9\n8\n\n9\n\"\n\n\u0003\n\u0001\n \n\u0001\n\n\u000e\n\"\n\u0001\n\n-\n\u000f\n$\n\u0001\n/\n\u0018\n\u0019\n\u0003\n\u0003\n7\n\u0001\n\n\u0016\n\n-\n\u000f\n$\n\u0001\n<\n8\n\u0015\n\u000f\n9\n8\n\u0001\n\n9\n\u000b\n\f\u0005,+\n\n\u0005,+\n\n\u0005,+\n\nbe replica symmetric, i.e. we set\n\naveraged posterior covariance.\n\nis the covariance kernel of the GP.\n\n, we choose the quadratic expression\n\nof the variational free energy\n\nresulting in a minimization\n\ndenote averages with respect to the variational distribution (14).\n\nas a suitable trial Hamiltonian, leading to a Gaussian distribution (14). The functions\n\nFor our application to Gaussian process models, we will now specialize to Gaussian priors\n\n, we assume that the optimal parameters should\nfor\n\n will be used in (11) instead of7\nto approximate the average.\"\n and7\ne.g. [10]) to minimize the relative entropy between7\n will be chosen (see\n \u0013\"\n\u0018\u001b\u001a\u001d\u001c\nAC<\n\u0004\u0002\u0001\nbeing an upper bound to the true free energy \n\u000b\f\u000b\r\u000b! \n\u0004\u0002\u0001\nfor any integer\u0003\n. The brackets\u001f\n9 . For\"\n\u0005\u0007+\n\u0010)(\n\u0005,+\n\u0010)(\n\u0005,+\n#%$\n\u0010 and 2\n\u0005\u0007+\n\u0005,+\n\u0010 are the variational parameters to be optimized. To continue the\n\u0010 as well as 2\n\u0005,+\n\u0005,+\n\u0005,+\nvariational solutions to arbitrary real\u0003\n\u0005,+\n\u0005,+\n\u0010 . The variational free energy can then be expressed by\n\u0003\b\u0007 and 2\n\u000f\u001e\u000f\n\u0005\u0007+\n\u0005,+\n\u0005,+\n\u0005\u0007+\r\u000b\n\u0005,+\f\u000b\n\u0005\u0007+\f\u000b\n\u0005,+\n\u0005,+\f\u000b\nthe local moments (\u201corder parameters\u201d in the language of Statistical Physics)\t\n\u0003\u000e\u0007 and\u000f\n\u0001 ,\n\n\u0005\u0007+\n\u0005,+\n\u0005,+\n\u0005\u0007+\nfor\u0006\n\u0001 which have the same replica symmetric structure. Since\n\u0003 matrices (such as2\neach of the\u0003\n\u0003 ) are assumed to have only two types of entries, it\nis possible to obtain variational equations which contain the number\u0003 of replicas as a sim-\n$ can be explicitly performed (see appendix). In\n\u0010 are found to have simple inter-\n\u0010 ,\n\n\u0005\u0007+\nple parameter for which the limit\u0003\n\u0005,+\n\u0010 with respect to\nthis limit, the limiting order parameters\t\n\u0010 becomes the (approximate) bootstrap\n\u0005\u0007+\u0010\u000b\nthe average over bootstrap data sets while\u000f\ncase, the prior measure<\n9 can be simply represented by an\n\u0010.\u0010 having zero mean and covariance matrix\n\u0005\u0007+\n\u0005\u0007+\ndistribution for the vector\u0005\n\b\f\u000b\r\u000b\f\u000b\r\b\n\u0010 , where\u0011\n\u0005\u0007+\n\u0005\u0007+\n\b.-\n by7\n$ ) values of order parameters, and by approximating7\nUsing the limiting (for\u0003\n143\n\u0005\u0007+\ngeneral loss function of the type\u0015\n143\n\u0017\u0016\n143\nA\u0019\u0018\n\nWe consider a GP model for regression with training energy given by Eq. (2). In this\ndimensional Gaussian\n\nin Eq.(11), the explicit result for the bootstrap mean square generalization error is found to\nbe\n\n\u0010 . The result is\n\u0005\u0007+\u001d\u0002\n\n6 Explicit results for regression with Gaussian processes\n\nThe entire analysis can be repeated for testing (keeping the training energy \ufb01xed) with a\n\npretations as the (approximate) mean and variance of the predictor\n\n\u0005,+\n\n\u0005,+7\u0002\n\u0002\u0007\u0005\n\n\u0015.\u0017\n/61\n\u001a\u001c\u001b\n0\u001f\u001e\n\n\u0005,+7\u0002\n\n+\u001d\u0002\n\n\u0002\u001d(\n\n\u0005,+\n\n\u0018\"!\n\n\u0015\t3\n\n\u0005,+\n\n\u0005\u0007+\n\n\u0005,+7\u0002\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n7\n\u0001\n\n\u0001\n\u0001\n\n\n\n\u000b\n\u0003\n \n?\n\n-\n\u000f\n$\n\u0001\n8\n\u0015\n\u000f\n9\n8\n\u0001\n\n9\n(\n\u001f\n\"\n\n \n\"\n\u0001\n\n \n\u0001\n\u0010\n\n\u0001\n<\n8\n\u0015\n\u0001\n\n\"\n\u0001\n\n\u0003\n/\n\n\u000e\n\"\n\u0001\n\u001e\n\u001f\n/\n0\n\n\"\n\u000f\n\u0019\n\u0003\n$\n\u0001\n2\n\u0004\n\u000f\n\u0003\n#\n\u0010\n\u0015\n\u000f\n#\n\u0010\n\u0015\n\u0003\n#\n\n\"\n\u000f\n$\n\u0001\n2\n\u0005\n\u000f\n#\n\u0010\n\u0015\n\u000f\n#\n*\n2\n\u0004\n\u000f\n\u0003\n#\n\u0005\n\u000f\n#\n2\n\u0005\n\u000f\n#\n\u0010\n\u0003\n2\n\u0005\n#\n\u0004\n\u000f\n\u0003\n#\n\u0010\n\u0003\n2\n\u0004\n#\n\u0010\n\u0006\n&\n\u0004\n#\n\u0010\n\u0003\n2\n\u0004\n\u0001\n#\n#\n\u0010\n\u000b\n\u0003\n\u001f\n\u0015\n\u000f\n#\n\u0010\n \n\b\n+\n#\n\u0010\n\u000b\n\u0003\n\u001f\n\u0015\n\u000f\n\u0010\n\u0015\n\u0003\n#\n\u0010\n \n\u0001\n \n\u001f\n\u0015\n\u000f\n\u0010\n \n\u0001\n\u001f\n\u0015\n\u0003\n#\n\u0010\n \n\u0001\n&\n\b\n+\n#\n\u0010\n\u000b\n\u0003\n\u001f\n\u0015\n\u000f\n\u000b\n\u0010\n\u0015\n\u000f\n#\n\u0010\n \n\u0001\n \n\u001f\n\u0015\n\u000f\n\u000b\n\u0010\n\u0015\n\u0003\n#\n\u0010\n \n'\n\u0004\n\u000f\n\f\n#\n#\n\b\n+\n#\n2\n#\n\b\n+\n#\n8\n\u0015\n\u0015\n\u0001\n\u0010\n\u0015\n\u000e\n\u0011\n\u0002\n\b\n+\n#\n\u0010\n\f\n\u0001\n\n\u0011\n\u0005\n\u0001\n\u0010\n\u0003\n/\n/\n\n\u000e\n\"\n\u0002\n$\n\u0001\n8\n\u0005\n\t\n\u0002\n\u0010\n \n-\n\u0002\n\u0010\n\u0005\n(\n\n\u0002\n\b\n+\n\u0002\n\u0010\n9\n\u0012\n\"\n\u0013\n$\n\u0001\n\u0005\n \n<\n\u0010\n\u0013\n\u0014\n5\n/\n\u0005\n/\n(\n\u0014\n\u000f\n\u0002\n\b\n+\n\u0002\n\u0010\n\u0006\n1\n\u0005\n\u0010\n\u0005\n\u000b\n\u0005\n2\n\u0015\n3\n\u0010\n \n-\n\u0011\n\u0003\n\u0005\n\u0001\n\u0010\n\u0003\n/\n/\n\n\u0010\n3\n\u000e\n\u000e\n\"\n\u0002\n$\n\u0001\n\u0013\n\u0019\n\u0001\n\u0015\n\u0005\n2\n\u0015\n3\n\u0010\n \n-\n\u0002\n\u0010\n\u001b\n\u0003\n/\n/\n\n\u000e\n\"\n\u0002\n$\n\u0001\n?\n\u001d\n\u0012\n\"\n\u0013\n$\n\u0001\n\u0005\n \n<\n\u0010\n\u0013\n\u0014\n5\n\u0015\n \n\t\n\u0010\n \n-\n\n\b\n\u0010\n/\n(\n\u0014\n\u000f\n\u0002\n\b\n+\n\u0002\n\u0010\n\u0006\n1\n\u0005\n#\n\u000b\n\fr\no\nr\nr\nE\n\n \n\n \nt\ns\ne\nT\np\na\nr\nt\ns\nt\no\no\nB\n\n8\n\n7\n\n6\n\n5\n\n4\n0\n\nSimulation\nTheory\n\n} N=1000\n\nm=N\n\n500\n\n1000\n\n1500\n\n2000\n\nSize m of Bootstrap Sample\n\nr\no\nr\nr\nE\n\n \n\n \nt\ns\ne\nT\np\na\nr\nt\ns\nt\no\no\nB\n\n2.0\n\n1.9\n\n1.8\n\n1.7\n\n1.6\n\n1.5\n\n1.4\n0\n\nSimulation\nTheory\n\n}\n\nN=1000\n\nm=N\n\n500\n\n1000\n\n1500\n\n2000\n\nSize m of Bootstrap Sample\n\n0\n\n\u0002\u0001\n\nif\nif\nif\n\n(19)\n\nFigure 1: Average bootstrapped generalization error on Abalone data using square error\nloss (left) and epsilon insensitive loss (right). Simulation (circles) and theory (lines) based\n\nWe have applied our theory to the Abalone data set [11] where we have computed the\napproximate bootstrapped generalization errors for the square error loss and the so-called\n\non the same data set\n\u0001 with\n$6$2$ data points. The GP model uses an RBF kernel\n+\u0001\n+\u0001\n\u0018\r\u001a\u001d\u001c\n\u0010 with\u0002\n\u0005\u0007+\n0\u0003\u0002\n\u0003\u0005\u0004 on whitened inputs. For the data noise we\nset1\n/ .\n \u000b\n\n\u0006 -insensitive loss which is de\ufb01ned by\n1\u0001\u0010\n \u0014\n\n\u0003\u0012\u0011\n\r\f\n\u000e\u000f\f\n\b\u0016\u0015\n\u0002 . We have set\n\nwith\u0013\u0018\u0017\n\u0005,+\n\u0019 and\u0006\n\u0015\t3\n/ . The bootstrap average from our\nsquare error loss (Eq.(17), left panel) as well as the one measured by the \u0006 -insensitive loss\non Monte-Carlo sampling averages that were computed using the same data set\u0002\u0001 having\n$2$6$ . The Monte-Carlo training sets\n\u0001 with replacement. We \ufb01nd a good agreement between theory and simulations in the\nregion were\u0001\n\u0010 over the\nwhole data set\n$2$2$ , and compares our theory (line) with simulations (circles)\n\n, however, the agreement is\nnot so good and corrections to our variational Gaussian approximation would be required.\nFigure 2 shows the bootstrap average of the posterior variance \u0001\n\nof size\u0001\n. When we oversample the data set\u0001\n\nwhich were based on Monte-Carlo sampling averages. The overall approximation looks\nbetter than for the bootstrap generalization error.\n\ntheory is obtained from Eq.(18). Figure 1 shows the generalization error measured by the\n\n(right panel). Our theory (line) is compared with simulations (circles) which were based\n\nFinally, it is important to note that all displayed theoretical learning curves have been ob-\ntained computationally much faster than their respective simulated learning curves.\n\n7 Outlook\n\nThe replica approach to bootstrap averages can be extended in a variety of different di-\nrections. Besides the average generalization error, one can compute its bootstrap sample\n\ufb02uctuations by introducing more complicated replica expressions. It is also straightfor-\nward to apply the approach to more complex problems in supervised learning which are\nrelated to Gaussian processes, such as GP classi\ufb01ers or Support-vector Machines. Since\n\nare obtained by sampling from\n\n\u0005,+7\u0002\n\n+\u001d\u0002\n\n\u0001 ,\n\n\u0003\n/\n\u0011\n\b\n\u0010\n\u0003\n\u0005\n \n\u0013\n\u0013\n+\n \n\u0013\n\u0013\n\u0005\n\u0006\n\u0005\n\u0005\n\u0005\n\u0003\n$\n\u000b\n\u0015\n\u0005\n\u0013\n\u0010\n\u0003\n\u0007\n\b\n\t\n\u0013\n\u0013\n\u0013\n\u0004\n8\n$\n\b\n\u0005\n/\n\u0010\n\u0006\n9\n1\n\u0003\n\u001b\n\u0013\n\u0010\n\u0011\n\u0013\n\u0013\n\u0013\n\u0004\n8\n\u0005\n/\n\u0010\n\u0006\n\b\n\u0005\n/\n(\n\n\u0010\n\u0006\n9\n\u0013\n\u0013\n\u0013\n \n\u0006\n\u0013\n\u0013\n\u0013\n\u0004\n8\n\u0005\n/\n(\n\n\u0010\n\u0006\n9\n\u0003\n2\n\u0002\n\u0010\n \n-\n\u0003\n$\n\u000b\n\u0003\n$\n\u000b\n\n\u0003\n/\n\n\b\n\n\u000b\n\n\u000e\n\n\u000e\n\u0002\n$\n\u0001\n\u000f\n\b\n\u0003\n/\n\fSimulation \n\nTheory } N=1000\n\ne\nc\nn\na\ni\nr\na\nV\n\n \nr\no\ni\nr\ne\nt\ns\no\nP\n\n10-1\n\n10-2\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\nSize m of Bootstrap Sample\n\nFigure 2: Bootstrap averaged posterior variance for Abalone data. Simulation (circles) and\n\ntheory (line) based on the same data set6\u0001 with\n\n$2$6$ data points.\n\nour method requires the solution of a set of variational equations of the size of the original\ntraining set, we can expect that its computational complexity should be similar to the one\nneeded for making the actual predictions with the basic model. This will also apply to the\nproblem of very large datasets, where one may use a variety of well known sparse approx-\nimations (see e.g. [9] and references therein). It will also be important to assess the quality\nof the approximation introduced by the variational method and compare it to alternative\napproximation techniques in the computation of the replica average (11), such as the mean\n\ufb01eld method and its more complex generalizations (see e.g. [10]).\n\nAcknowledgement\n\nWe would like to thank Lars Kai Hansen for stimulating discussions. DM thanks the Copen-\nhagen Image and Signal Processing Graduate School for \ufb01nancial support.\n\nAppendix: Variational equations\n\nFor reference, we will give the explicit form of the equations for variational and order\n\n$ . The derivations will be given elsewhere. We obtain\n\nThe order parameter equations Eqs.(20-22) must be solved together with the variational\nequations which are given by\n\n\u0005,+\n\u0005,+\n\nparameters in the limit\u0003\r\f\n\u0005,+\u001d\u0002\n\u0005\u0007+\n\u0005,+7\u0002\nwhere the matrix\u000f\n\u0005,+\nwhere\u0011\n\n#%$\n#%$\n\nis given by\n\n\u0005,+7\u0002\n\u0005,+\n(\u0003\u0002\u0005\u0004\nis the kernel matrix. Finally \u0002\n\u0010.\u0010\n\n\u0003\u0001\n\u0005,+\u001d\u0002\n\n+\u001d\u0002\n\n\u0005\u0007+\n\n\u0005\u0007+\n\n\u0005\u0007+\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\n\u0005,+\n\n\u0010.\u0010 .\n\n\u0003\n/\n\t\n\u0010\n\u0003\n \n/\n\n\u000e\n\"\n\u0001\n\u000f\n\b\n+\n#\n\u0010\n2\n\u0005\n#\n\u0010\n\n\u0002\n\b\n+\n\u0013\n\u0010\n\u0003\n \n/\n\n\u000e\n\"\n\u0001\n\u000f\n\u0002\n\b\n+\n#\n\u0010\n\u000f\n\u0013\n\b\n+\n#\n\u0010\n2\n\u0004\n#\n\u0010\n\b\n+\n#\n\u0010\n\u000f\n\u0011\n1\n\u0001\n1\n\u0001\n\u0002\n#\n\u0003\n\u0011\n\u0002\n\b\n+\n#\n\u0010\n\u0002\n#\n\u0003\n\u0001\n\u000e\n\u0013\n\u0002\n#\n\u0005\n2\n\u0004\n\u0001\n\u0002\n\u0010\n \n2\n\u0004\n\u0002\n\u0006\n2\n\u0004\n\u0002\n\u0010\n\u0003\n\u0001\n\u0005\n1\n\u0005\n(\n\u000f\n\b\n\fwith \u0006\n\n\u0005,+\n\u0005\u0007+\n\u0005,+\n\u0005\u0007+\u001d\u0002)\u0010\n\u0005\u0007+\u001d\u0002\n+7\u0002\n\u0010.\u0010 .\n\u0005\u0007+\u001d\u0002\n\u0005,+7\u0002\n\u0005\u0007+\u001d\u0002\n\u0010 . Its iterative solution (based\nCombining Eqs.(22) and (23), a self consistent matrix equation\u000f\n\u0010 ) requires usually only a few iterations. The order\n\u0005\u0007+\nobtained where \u0002 depends on the diagonal elements\u000f\n+\u001d\u0002\n\u0010 and\n\n\u0005\u0007+\u001d\u0002\n\u0010 can then be solved subsequently using Eq.(20,21) with\non a good initial guess for\u000f\nparameters\t\n\n\u0005,+\u001d\u0002\n\u0005\u0007+\n\nReferences\n\n\u0005,+7\u0002\n\n(24,25).\n\n(24)\n\n(25)\n\nis\n\n\u0010%\u0010\n\n\u0005\u0001\n\n\u0002'\u0010\n\n[1] A. Engel and C. Van den Broeck, Statistical Mechanics of Learning (Cambridge Uni-\n\nversity Press, 2001).\n\n[2] H. Nishimori, Statistical Physics of Spin Glasses and Information Processing (Oxford\n\nScience Publications, 2001).\n\n[3] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, Monographs on Statis-\n\ntics and Applied Probability 57 (Chapman\n\nHall, 1993).\n\n[4] M. M\u00b4ezard, G. Parisi, and M. A. Virasoro, Spin Glass Theory and Beyond, Lecture\n\nNotes in Physics 9 (World Scienti\ufb01c, 1987).\n\n[5] J. Shao and D. Tu, The Jackknife and Bootstrap, Springer Series in Statistics (Springer\n\nVerlag, 1995).\n\n[6] D. Malzahn and M. Opper, A variational approach to learning curves, NIPS 14, Edi-\n\ntors: T.G. Dietterich, S. Becker, Z. Ghahramani, (MIT Press, 2002).\n\n[7] R. Neal, Bayesian Learning for Neural Networks, Lecture Notes in Statistics 118\n\n(Springer, 1996).\n\n[8] R. P. Feynman and A. R. Hibbs, Quantum mechanics and path integrals (Mc Graw-\n\nHill Inc., 1965).\n\n[9] L. Csat\u00b4o and M. Opper, Sparse Gaussian Processes, Neural Computation 14, No 3,\n\n641 - 668 (2002).\n\n[10] M. Opper and D. Saad (editors), Advanced Mean Field Methods: Theory and Prac-\n\ntice, (MIT Press, 2001).\n\n[11] From http://www1.ics.uci.edu/\n\nmlearn/MLSummary.html.\n\nThe data set contains 4177 examples. We used a representative fraction (the forth\nblock (a 1000 data) from the list).\n\n2\n\u0005\n\u0002\n\u0010\n\u0003\n \n-\n\u0002\n\u0006\n2\n\u0004\n\u0002\n\u0010\n2\n\u0004\n\u0003\n \n8\n\u0005\n\t\n\u0010\n \n-\n\u0002\n\u0010\n\u0005\n(\n\n\b\n\u0010\n9\n\u0005\n\u0006\n2\n\u0004\n\u0002\n\u0005\n\u0001\n2\n\u0004\n\u0010\n\u0003\n\u0005\n2\n\u0004\n\u0001\n\u0010\n \n2\n\u0004\n\u0003\n(\n\u0011\n1\n\u0001\n\u0011\n\u0002\n\b\n+\n\u0002\n\u0002\n\b\n+\n\u0002\n\b\n\u0002\n\u0003\n\f", "award": [], "sourceid": 2185, "authors": [{"given_name": "D\u00f6rthe", "family_name": "Malzahn", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}]}