{"title": "Generalization Performance of Some Learning Problems in Hilbert Functional Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 543, "page_last": 550, "abstract": null, "full_text": "Generalization Performance of Some Learning\n\nProblems in Hilbert Functional Spaces\n\nTong Zhang\n\nIBM T.J. Watson Research Center\n\nYorktown Heights, NY 10598\n\ntzhang@watson.ibm.com\n\nAbstract\n\nWe investigate the generalization performance of some learning prob-\nlems in Hilbert functional Spaces. We introduce a notion of convergence\nof the estimated functional predictor to the best underlying predictor, and\nobtain an estimate on the rate of the convergence. This estimate allows\nus to derive generalization bounds on some learning formulations.\n\n1 Introduction\n\nas possible:\n\nthat is problem dependent. In machine learning,\n\n\b . Usually the quality of the predictor \u0004\u000b\u0006\n\nIn machine learning, our goal is often to predict an unobserved output value  based on an\n. This requires us to estimate a functional relationship \u0003\u0002\u0005\u0004\u0007\u0006\nobserved input vector \u0001\n\u0001\t\b\nfrom a set of example pairs of \u0006\n\u0001\t\b can be\n\u0001\u0007\n\n\u0006\r\u0004\u0007\u0006\n\u0001\u000e\b\u000f\n\nmeasured by a loss function \f\nwe assume that the data \u0006\n\b are drawn from an underlying distribution \u0011 which is not\n\u0001\u0010\n\nknown. Our goal is to \ufb01nd\u0004\u000b\u0006\n\u0001\u000e\b so that the expected true loss of\u0004 given below is as small\n\u0006\u0013\u0004\u0007\u0006\u0015\u0014\n\b\u0016\b\u0018\u0017\u001a\u0019\u001c\u001b\u001e\u001d \u001f\n\u0001\u000e\b from a set of training data \u0006\nIn order to estimate a good predictor\u0004\u0007\u0006\n\u0001\u0010\n\nfrom\u0011\nwe consider models that are subsets in some Hilbert functional spaces*\n\u0004\u000b\u0006\n\u0004\u0007\u0006\n, we consider models in the set ./\u001710\nthe norm in *\n\u0001\u000e\b324*657+\nwould like to \ufb01nd the best model in . which is given by:\n\u0006\r\u0004\u000b\u0006\n\b()\n\u0001\u000e\b\u000f\n\n!8\"O\f\nBy introducing a non-negative Lagrangian multiplier P\u0005QSR , we may rewrite the above\n\n\b randomly drawn\n. Denote by+\n+-,\n\u0001\t\b8+\n,:9\u0005;=< , where\n\nis a parameter that can be used to control the size of the underlying model family. We\n\n, it is necessary to start with a model of the functional relationship. In this paper,\n\n\b?\u0017\u001a@ ACB7DFEHG\nIKJML\n\n!\u000f\"$#&%'\f\n\n\u0019N\u001bH\u001d \u001f\n\n\u0006\u0013\u0004\u0007\u0006\n\n\u0001\t\b(\n\nproblem as:\n\n\u0004\u0007\u0006\u0016\u0014\n\n\b\u000f)\n\n(1)\n\n(2)\n\nWe shall only consider this equivalent formulation in this paper. In addition, for technical\n\n\u0004\u0007\u0006\u0016\u0014\n\n\b\u0018\u0017T@ ACB7DFE\u001eG\nIKJ\n,\u001cU\n\u0006^]\n\nC_\u000f\b\n\n\b\u0010W\n\n!8\"V\f\n\n\u0006\r\u0004\u000b\u0006\n\n\u0001\t\b(\n\n\u0004\u0007\u0006\u0016\u0014\n\u0019N\u001bH\u001d \u001f\nis a convex function of] .\n\n\bY+\u000fZ\n\n,\\[\n\nreasons, we also assume that\f\n\n\n\n\b\n\n\u0012\n\n\n\u0014\n;\n>\n\n>\n\nP\nX\n+\n)\n\ftraining examples \u0006\n\nGiven \n\u0001\u0004\u0003\n\u0001\u000f\b\u000f\n8)8)Y)&\n\n\b , we consider the following estimation\nmethod to approximate the optimal predictor >\n\u0004\u000b\u0006\n\u0001\t\b'2\n\u0006\r\u0004\u000b\u0006\n\b\u000f\n\n, >\nThe goal of this paper is to show that as \n\t\f\u000b\n\nin probability under appropri-\nate regularity conditions. Furthermore, we obtain an estimate on the rate of convergence.\nConsequences of this result in some speci\ufb01c learning formulations are examined.\n\n\u0001\u0002\u0001K\n\n\b?\u0017\u001a@ ACB7DFEHG\nIKJ\n\n\u0004\u000b\u0006\u0015\u0014\n\n\u0003\r\t\n\n\b\u0010W\n\n\b8+\n\n\u0006\u0015\u0014\n\n(3)\n\n:\n\n2 Convergence of the estimated predictor\n\nis\n\n.\n\nis point-\n\n\u0001\t\b?\u0017\n\nI\u0013\u0012\n\u0004\u000b\u0006\n\n. This represen-\n\nis not important for the purpose of this paper.\n\nAssume that input\u0001 belongs to a set\u000e\n. We make the reasonable assumption that\u0004\n\u0004\u0004\u0016\n\u0004\u0017\u0016\nI\u0015\u0014\ntopology:\u000f\u000e\u0001\n2\u0010\u000e\n,\u0011\u001eEHD\n\u0001\u000e\b where\u0004\n\u0004\u0007\u0006\nwise continuous under the +\n\u0004\u0019\u0018\n+8,\u001a\t\n\u0001\t\b'&\nR . This assumption is equivalent to \u001b\u001d\u001c\u001f\u001e! \n #\"%$\n\u0004\u000b\u0006\nis in the sense that +\nW'\u000b(\u000f\t\u0001\n2)\u000e\n. The condition implies that each data point \u0001 can be regarded as a bounded\nlinear functional *\nsuch that\u000f\n: *\n\u0006\r\u0004\n\u001d on*\n\u0001\t\b . Since a Hilbert space *\n\b=\u0017\n. For notational simplicity,\u000f\u000e\u0001 we shall\nself-dual, we can represent*\n\u001d by an element in*\nde\ufb01ned as*\nlet*\n, where\u0014 denotes the inner product of*\n\u0001\t\b for all\u0004\n\u0004\u000b\u0006\n\u0014\u0016\u0004\nIt is clear that*\n\u001d can be regarded as a representing feature vector of \u0001\nin*\ntation can be computed as follows. Let +\n@-,\n\u0004\u0007\u0006\n\u0001\t\b , then it is not dif\ufb01cult to\n@MA\nB7D\n\u001d0/\nsee that +\n\u0001\u000e\b2+\n\b?\u00171+\n. It follows that*\n\u0001\t\b?\u0017\u0005+\u0013*\nand +\n+\u0015*\n\u0017.*\n\u0006\u0016\u0014\n\u0004\u0007\u0006\n\u0004\u0007\u0006\n\b . Note that this\n+\u000f,\n+\u000f,\nmethod of computing*\nSince\u0004\u0007\u0006\n\u001d can now be considered as a linear functional using the feature space\n\u0001\u000e\b\u001c\u0017\nrepresentation*\n\u001d of\u0001\nof >\n. Following [6], using the linear representation of\u0004\u000b\u0006\n\u0001\u000e\b , we differentiate (2) at\nin*\nthe optimal solution >\n\u0004\u0007\u0006\u0016\u0014\n\u0017TR\n\nC_\u000f\b with respect to ]\nif \f\nwhere \f\n\nC_8\b\n\u0006^]\nsubgradient (see [4]) otherwise. Since we have assumed that \f\n\u00018\b\u0015\f\n\u0006^]\nof]\n\u0006^]\n, we know that\f\n_\u000f\b . This implies the following\n\u0004\u000b\u0006\n\b(\n\n\b\u0016\b\n\b\u000f\n\n\b\u000f\n\n\u0004\u0007\u0006\n\n\b , which leads to the following \ufb01rst order condition:\n\u0001-\n\n\f43\n\u001dM\u001f\nis the derivative of \f\n\u0006^]\n\n, we can use the idea from [6] to analyze the convergence behavior\n\nis smooth; it denotes a\nis a convex function\n\n\b5*\n\u0001K\nC_\u000f\b\n\nwhich is equivalent to:\n\ninequality:\n\n\u0006^]\n\b\u0010W\n\n_\u000f\b\u0007W\n\n\u0004\u000b\u0006\n\n\u0004\u0010\u0006\n\n_\u000f\b\n\n\u0006\u0015\u0014\n\n(4)\n\n\u0006^]\n\n\u0006^]\n\n\b\u000f\n\n\b\u0016Z\n\n\u00046\u0018\n\n\u0004\u000b\u0006\n\b(\n\nNote that we have used\u0004\n\nwe have\n\n\b(\n\n\b&W\n\n\b(\n\n\b\u0007W\nto denote\u0004\n\b\u0010W\n\b(\n\n\u0004\u0007\u0006\n\n\u0004\u0010\u0006\n\n\b\u0016\b\u0007W4P\n\n\u0014M\u0006\n\n. Also note that by the de\ufb01nition of >\n\n,\n\n\b(\n\n\b\u0010W\n\n\n\u0006\n\n\n\u0003\n.\n>\n\u0004\n\u0003\n,\nU\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0001\n\u0007\n\n\u0007\nP\nX\n+\nZ\n,\n[\n)\n\u0004\n>\n\u0004\n\u0014\n+\n,\n\u0006\n\t\n\u0004\n\u0016\nI\n\u0001\n\u0004\n2\n*\n\u001d\n\u001d\n2\n*\n\u001d\n\u0017\n2\n*\n\u0004\n\u0017\n \nI\n \n\"\n\b\n\u0001\n\u0004\n\u001d\n\u001d\n\u001d\n\u0004\n\u001d\n\u0004\n\u0014\n*\n\u0004\n\u0003\n\u0019\n!\n\u0001\n\u0006\n>\n\u0004\n\u0014\n*\n\u001d\n\n\n\u001d\nW\nP\n>\n\u0004\n\n3\n\u0001\n\nZ\n\u0018\n]\n3\n\u0001\n9\n\f\nZ\n\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n3\n\u0001\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\b\n\u0018\n>\n\u0001\n\u0007\n9\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\n\u0007\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\nP\nX\n>\n\u0004\nZ\nW\nU\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n3\n\u0001\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\b\n\u0018\n>\n\u0001\n\u0007\n>\n\u0004\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n\b\n[\nW\nP\nX\n\u0006\n>\n>\n\u0004\n\u0003\n9\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\n\u0007\nP\nX\n>\n\u0004\nZ\n\u0003\n)\nZ\n\u0014\n\u0004\n\u0017\n+\n\u0004\n+\nZ\n,\n\u0004\n\u0003\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\nP\nX\n>\n\u0004\nZ\nQ\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\n\u0007\nP\nX\n>\n\u0004\nZ\n\u0003\n)\n\fTherefore by comparing the above two inequalities, we obtain:\n\nZ\u001c9\n\n\f43\n\n\u0004\u0007\u0006\n\n\b(\n\n\b5*\n\n\b(\n\nW4P\n\n\u0014M\u0006\n\n\b\u0010W\n\n\u00046\u0018\n\nThis implies that\n\n\u00046\u0018\n\n\u0003&+8,\n\n+\u000f,\n\u001d \u001f\n\n\b\u000f\n\n\b\u000f\n\nX\u0015\u0014\n\n\u0004\u000b\u0006\n\u0004\u000b\u0006\n\nX\u0015\u0014\n\n.\n\n\u0006\r\f\u000e\u0003\u000f\b .\n\n+\u000f,\u0005Q\u0013\u0010 \b\n\b(\n\nW\u001c\u0018\n\n\u0004\u000b\u0006\n\n\b5*\n\u0001\t\b(\n\nthe condition is used. In (5), we have already bounded the convergence of >\n\u0004\u0007\u0006\nof the convergence of the empirical expectation of a random vector \f\n\nNote that the last equality follows from the \ufb01rst order condition (4). This is the only place\nin terms\nto its\nmean. The latter is often easier to estimate. For example, if its variance can be bounded,\nthen we may use the Chebyshev inequality to obtain a probability bound. In this paper,\nwe are interested in obtaining an exponential probability bound. In order to do so, similar\nto the analysis in [6], we use the following form of concentration inequality which can be\nfound in [5], page 95:\n\nto >\n\u0001\u000e\b(\n\n(5)\n\nX : \u0007\nW\u0016\u0010\u0017\u0003\n\b5*\n\u001d :\nW4\u0019\u0019\u0018\n\nFrom inequality (5) and Theorem 2.1, we immediately obtain the following bound:\n\nWe may now use the following form of Jensen\u2019s inequality to bound the moments of the\n\nTheorem 2.1 ([5]) Let\u0002\nIf there exists \u0003\u0005\u0004\nR : \u0012\nThen for all \u0010\u0011\u0004\nzero-mean random vector\f\n\n\u0007 be zero-mean independent random vectors in a Hilbert space*\n\u0003\u000b\n\nR such that for all natural numbers \u0006\u0018Q\n_\t\u0003\n\u0004\u0007\u0006\n\u0004\u0007\u0006\n\u0001\t\b(\n\n\u0019\u0019\u0018\n\u0019\u0019\u0018\nthere exists \u0003\nfor all natural numbers \u0006\n\bY+\u0015*\nR :\n+\u000f,\u001e\u0018\n\u0006\u001f\f\n\u0003\u000f\b . Then for all \u0010\u0011\u0004\n\u00065\u0018\n\u0006\u001b!\n+8,\nQ\u0013\u0010 \b\n_\t\u0003\nAlthough Theorem 2.2 is quite general, the quantity _ and \u0003\nthe bound depend on the optimal predictor >\n\u0004 which requires to be estimated. In order to\nobtain a bound that does not require any knowledge of the true distribution \u0011\n_\u000f\b and +\u0015*\n\u0006^]\nimpose the following assumptions: both \f\n+8,\n+\u0015*\n\u0004\u0007\u0006\nIKJ\n\u0001\u000e\b , we obtain the following result:\n,$#\nCorollary 2.1 Assume that\u000f\u000e\u0001\n,\u001b5\u001c\u001f\u001e\n\u0004\u0007\u0006\nIKJ\n\u0001\u000e\b\n9'&\n,%#\n, then\u000f)\u0010*\u0004\n\nC_\u000f\b satis\ufb01es the condition that \u0018\n\nC_\u000f\b(\u0018\nfunction\f\nX,\u0014\n\u0006\u001b!\n\n, we may\nare bounded. Observe that\n\n\u0019\u0019\u0018\n\u0004\u000b\u0006\n\u0001\t\b(\n\nW4P\"\u0010\u0017\u0003\n\n\b\u0016\b\u000f)\n\n. Also assume that the loss\n\non the right hand side of\n\nTheorem 2.2 If\n\n\b\u0016\b .\n\n+\t\b\n\n\u0019\u0019\u0018\n\n\b7\u0017\n\n\u001b\u001d\u001c\u001f\u001e\n\u0006^]\n\n\u001dM\u001f\n\n\b\u001d\u001a\n\nsuch that\n\n\u0006^]\n\nQ+\u0010 \b\n\nR :\nP\"\u0010 \b\u0016\b()\n\n\b\u001b\u001a\n\n\u001dM\u001f\n\nX :\n\nP\nX\n\u0006\n>\n\u0004\n\u0018\n>\n\u0004\n\u0003\n\b\n\u0018\nU\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\u0001\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n*\n\u001d\n\u0001\n\u0014\n\u0006\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\nP\n>\n\u0004\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n\b\n[\n9\n+\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n3\n\u0001\n\u0006\n>\n\u0004\n\u0006\n\u0001\n\u0007\n\n\u0007\n\u001d\n\u0001\n>\n\u0004\n+\n,\n+\n>\n>\n\u0004\n\u0003\n+\n,\n)\n+\n>\n>\n\u0004\n9\nX\nP\n+\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n3\n\u0001\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n*\n\u001d\n\u0001\nW\nP\n>\n\u0004\n\u0017\nX\nP\n+\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n3\n\u0001\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n*\n\u001d\n\u0001\n\u0018\n\u0019\n!\n\f\n3\n\u0001\n\u0006\n>\n\n\u001d\n+\n,\n)\n\u0004\n\u0003\n\u0004\n3\n\u0001\n\u0006\n>\n\n\b\n*\n\u001d\n\u0003\n\u0007\n\b\n\u0001\n\u0019\n+\n\u0002\n\u0007\n,\n9\nZ\n\u0006\n+\n\u0001\n\u0003\n\u0007\n\u0007\n\u0002\n\u0007\n9\n,\n\u001e\n\u0006\n\u0018\n\u0003\nZ\n\u0010\nZ\n/\n\u0006\nZ\n3\n\u0001\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n*\n\u001d\n\u0001\n\u0018\n\u0019\n!\n\f\n3\n\u0001\n\u0006\n>\n\n\u0002\n\u0018\n\u0019\n\u0002\n\u0018\n\b\n9\nX\n\u0001\n\u0006\n\u0002\n\u0018\n\b\n\u0019\n\u0002\n\u0018\n\b\n\b\n9\nX\n\u0001\n\u0006\n\u0002\n\u0018\n\b\n\u0002\n\u0018\n\b\nX\n\b\n\u0002\n\u0018\n\b\n)\n\u0004\nR\nQ\n\u0019\n!\n\u0018\n\f\n3\n\u0001\n\u0006\n>\n\n\u001d\n\b\n9\n\nZ\n\u0012\n\u0006\n+\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n9\n,\n\u001e\n\n \nP\nZ\n\u0010\nZ\n/\nZ\n3\n\u0001\n\n\u001d\n\u001d\n+\n,\n\u0017\n \nI\n \n\"\n\b\n\u0001\n2\n\u000e\n \nI\n \n\"\n\b\n\u0001\n\f\n3\n\u0001\n9\n\u0003\n\u0012\n\u0006\n+\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n+\n,\n9\n,\n\u001e\n\u0006\n\u0018\n\n \nP\nZ\n\u0010\nZ\n/\n&\nZ\n\u0003\nZ\nW\n&\n\u0003\n\f3 Generalization performance\n\nWe study some consequences of Corollary 2.1, which bounds the convergence rate of the\nestimated predictor to the best predictor.\n\n3.1 Regression\n\nWe consider the following type of Huber\u2019s robust loss function:\n\nIt is clear that the right-hand side of the above inequality does not depend on the unobserved\n\nTheorem 3.1 compares the performance of the computed function with that of the optimal\nin (1). This style of analysis has been extensively used in the\nliterature. For example, see [3] and references therein. In order to compare with their\n,\n\npredictor >\nresults, we can rewrite Theorem 3.1 in another form as: with probability of at least \u0005\n\nIn [3], the authors employed a covering number analysis which led to a bound of the form\n(for squared loss)\n\n\u001dM\u001f\n\n\u0001\u000e\b\u000f\n\n\u001dM\u001f\n\n\u0004\u0007\u0006\n\n\u0001\u000e\b\u000f\n\n\u001dM\u001f\n\n\u0001\u000e\b(\n\n\u001dM\u001f\n\n\u0004\u0010\u0006\n\n\u0001\t\b(\n\n\b\u000f)\n\n\u0011HG\n\u0011HG\n\n\u0006\f\u000b\n\n\u0011HG\n\n\b\u0010W\n\t\n\b\u000bW\r\t\n\n\u0006\u0013\u0004\n\u0001\t\b(\n\n\u00046\u0018\n\u00046\u0018\n\b(\u0018\n\nif \u0018\nif \u0018\n\u0006\u0013\u0004\n\u0001-\n\n\b\u000bW\n\n\u0004\u000b\u0006\n\n\u0001\t\b\n\u0001\t\b\n\n\u0006\u0013\u0004\n\n\u00046\u0018\n\u0006\r\u0004\n\nUsing this inequality and (4), we obtain:\n\n\u0006\r\u0004\n\b7\u0017\nZ and :\n\u0001K\n\n\u0006\r\u0004\n\n\u0018\u0002\u0001\u0004\u0003\nIt is clear that\f\nis continuous differentiable and \u0018\nnot hard to check that\u000f\n\u0001\u000f\b\u0015\f\n\u0006\u0013\u0004\n\u0004\u0010\u0006\n\u0001\t\b(\n\n\u0001\t\b(\n\n\u0001\t\b(\n\n\u0001\u000e\b\nIf we assume that +\n\n\u001dM\u001f\n\u001d \u001f\n\u001d \u001f\n\n\u0004\u0007\u0006\n,$#\n\n\b\u0007W\n\u0004\u0007\u0006\n\n\u0001\u000e\b\n+8,\n\n\u0017\\\u0019\n\n\u0004\u0007\u0006\n\n\b\u0015Z\n\n\u001dM\u001f\n\n\u0001\u000e\b\u000f\n\n\u001dM\u001f\n\nThis gives the following inequality:\n\n+\u000fZ\n\u0001\u000e\b(\n\n\u0010 and\u001b\u001d\u001c\u001f\u001e\n\u0004\u0007\u0006\n\u0001\u000e\b\u000f\n\n\u0004\u0007\u0006\n\u001d \u001f\n\n\u001dM\u001f\n+\u000fZ\nIKJ\n\b\u0010W\n\u0001\u000e\b(\n\n\u001dM\u001f\n\nP\"\u0010\n\n\u0001\t\b(\n\n. Using Corollary 2.1, we obtain the following bound:\n\n\u0003&+\u000f,\nfunction >\nTheorem 3.1 Using loss function (6) in (3). Assume that \u001b\u001d\u001c\u001f\u001e\n\u00065\u0018\n\u0010P\n\u000f\tR\n&+\u0010\n\u0003&+\u000f,\n\n, with probability of at least \u0005\n\u0004\u0007\u0006\n\u0001\t\b(\n\nX\u0015\u0014\n\b\u0007W4P\"\u0010\n\nW\u0016\u0010\nIKJ\nW\u0016\u0010\n\n\u0001\t\b(\n\n\u001dM\u001f\n\n,%#\n\n\u001dM\u001f\n.\u0006\u0005\n\n(6)\n\n. It is also\n\nfor all\u0004 and\n\u00018\b\u0015ZM)\n\n\u0001\t\b\u0016\b\n\n\b\u0007W\n\n+\u000fZ\n\n\u0006\u0013\u0004\n+8Z\n\u0004\u0007\u0006\n\n, then\n\n\b\u0007W\n\n(7)\n\nW4P\u000e\b()\n\nP\u000e\b()\n\u0004\u0007\u0006\n\u0001\u000e\b\n\b , we have\nP\u000e\b()\n\n, then\n\n\u0018\b\u0007\n\n\f\n\n\n\n\u0001\nZ\n\u0018\n\n\b\nZ\n\n\u0018\n9\n\u0003\n\u0003\n\u0018\n\n\u0018\nZ\n\n\u0018\nQ\n\u0003\n)\n\f\n3\n\u0001\n\n\n9\n\u0003\n\u0004\n\u0001\n\n\u0004\n\f\nZ\n\n\n\b\n\u0018\n\f\n\n\b\n\u0018\nZ\n\u0018\n\u0004\n3\n\u0001\n\n\b\n9\n\u0005\nX\nZ\n\u0018\n\u0004\nU\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\nP\nX\n+\n>\n\u0004\n\u0003\n,\n[\n\u0018\nU\n\u0019\n!\n\f\n\u0006\n>\n\nP\nX\n+\n>\n\u0004\n,\n[\n!\n\u0006\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n\u0018\n\f\n\u0006\n>\n\n\b\n\u0018\n\f\n3\n\u0001\n\u0006\n>\n\n\b\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0018\n>\nP\nX\n+\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n,\n9\n\u0019\n!\n\u0005\nX\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0018\n>\nW\nP\nX\n+\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n,\n)\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n9\n \nI\n \n\"\n\b\n\u0001\n9\n&\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0019\n!\n\f\n\u0006\n>\n\nP\nX\n\u0006\n+\n>\n\u0004\n+\nZ\n,\n\u0018\n+\n>\n\u0004\n\u0003\n+\nZ\n,\n\u0010\nZ\nX\n\u0006\n&\nZ\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n\u0018\n\u0019\n!\n\f\n\u0006\n>\n\n\b\n9\n+\n>\n\u0004\nZ\n\u0006\n&\nZ\nX\nW\n\u0004\n \nI\n \n\"\n\b\n\u0001\n9\n&\n9\n&\n\u0003\n/\nP\n\u0018\n,\n\u001e\nZ\n\u0010\nZ\n/\n!\nR\n&\nZ\n\u0003\nZ\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0019\n!\n\f\n\u0006\n>\n\n+\n>\n\u0004\nZ\n\u0006\n&\nZ\nX\nW\n\u0004\n2\n*\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0019\n!\n\f\n\u0006\n>\n\n\u0005\n\u0018\n\u0007\n\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0019\n!\n\f\n\u0006\n>\n\n\u0006\n\n\u0018\n\u0007\n\n\b\n\f\u0004\u0010\u0006\n\n\u0001\t\b(\n\n\b\u0016\b\n\nP\u0017\n\n\b\u000bW\r\t\n\u001dM\u001f\n\n\u001dM\u001f\n\nW\n\t\n\u001dM\u001f\n\n\b()\n\n\b()\n\nfor \ufb01nite dimensional problems. Note that the constant in their\ndimension, which can be in\ufb01nity for problems considered in this paper. It is possible to em-\nploy their analysis using some covering number bounds for general Hilbert spaces. How-\never, such an analysis would have led to a result of the following form for our problems:\n\n\u0006\u0016\u0014\n\n\b depends on the pseudo-\n\u0011\u001eG\n\n\u0001\u0001\u0001\u0002\n\n\u0011\u001eG\n\nIt is also interesting to compare Theorem 3.1 with the leave-one-out analysis in [7]. The\ngeneralization error averaged over all training examples for squared loss can be bounded\nas\n\n\u001dM\u001f\n\n\u0001\t\b(\n\nThis result is not directly comparable with Theorem 3.1 since the right hand side includes\n. Using the analysis in this paper, we may obtain a similar result\n\nfrom (7) which leads to an average bound of the form:\n\n\u0004\u0007\u0006\n\n\u0001\t\b(\n\n\b\u0007W4P\u000b+\n\n+\u000fZ\n\n\b\u000f)\n\n\u001d \u001f\nan extra term of P\u0007+\n\u001d \u001f\n\n\u0001\u000e\b\u000f\n\n\u0001\u000e\b(\n\n\u0004\u0007\u0006\n\n\u0001\u000e\b\u000f\n\n\b\u0010W\n\nP\u000b+\n\n\b&W\n\t\n\nresulted in our paper is not as good as\n\nIt is clear that the term\nfrom [7].\nHowever analysis in this paper leads to probability bounds while the leave-one-out analysis\nin [7] only gives average bounds. It is also worth mentioning that it is possible to re\ufb01ne\nthe analysis presented in this section to obtain a probability bound which when averaged,\nin the current analysis.\ngives a bound with the correct term of\nHowever due to the space limitation, we shall skip this more elaborated derivation.\n\n\b , rather than\n\nIn addition to the above style bounds, it is also interesting to compare the generalization\nperformance of the computed function to the empirical error of the computed function.\nSuch results have occurred, for example, in [1]. In order to obtain a comparable result, we\nmay use a derivation similar to that of (7), together with the \ufb01rst order condition of (3) as\nfollows:\n\n\u0017\u001aR\n\n\b5*\n\b\u0007W\n\nCombining the above inequality and (7), we obtain the following theorem:\n\nThis leads to a bound of the form:\n\n\u0004\u0007\u0006\n\n\u0003&+\nTheorem 3.2 Using loss function (6) in (3). Assume that \u001b\u001d\u001c\u001f\u001e\n\u00065\u0018\n\u000f\tR\n&+\u0010\n\u0010P\n\n\b(\n\n\b(\n\n, with probability of at least \u0005\n\u001d \u001f\n\nX\u0015\u0014\n\n\b(\n\n\b\u0010W\n\n\b\u000f)\n\nIKJ\n\n,%#\n\n, then\n\n\u0004\u0007\u0006\n\u0001\u000e\b\n\b , we have\n\n\u0004\u000b\u0006\n\n\u001dM\u001f\n\n\u0001\t\b(\n\n\b(\n\n\u0004\u000b\u0006\n\u0001\t\b(\n\nUnlike Theorem 3.1, the bound given in Theorem 3.2 contains a term U\n[ which relies on the unknown optimal predictor >\nwe know that this term does not affect the performance of the estimated function >\n\u0003 when\n\n. From Theorem 3.1,\n\n\u0004\u000b\u0006\n\n\u001dM\u001f\n\n\b\u000f\n\n\u0004\u000b\u0006\n\n\u0001\u000e\b\u000f\n\n\b\u000bW\n\n\t\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0019\n!\n\f\n\u0006\n>\n\n\u0006\n\u0006\n\n\u0018\n\u0007\n\n\b\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0006\n\u0005\n\u0006\n\u0005\n\u0006\n\u0019\n!\n\f\n\u0006\n>\n\n>\n\u0004\n,\n>\n\u0004\n+\nZ\n,\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0006\n\u0019\n!\n\f\n\u0006\n>\n\n>\n\u0004\n+\nZ\n,\n\u0006\n\u0005\nP\nZ\n\n\t\n\u0006\n\u0001\n\u0003\n\u0003\n\u0003\n\b\n\t\n\u0006\n\u0001\n\u0003\n\u0003\n\b\n\t\n\u0006\n\u0001\n\u0003\n\u0003\n\t\n\u0006\n\u0001\n\u0003\n\u0003\n\b\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n3\n\u0001\n\u0006\n>\n\u0004\n\u0003\n\u0014\n*\n\u001d\n\u0001\n\n\n\u0007\n\u001d\n\u0001\nW\nP\n>\n\u0004\n\u0003\n)\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n9\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\n\u0007\nP\nX\n\u0006\n+\n>\n\u0004\nZ\n,\n\u0018\n+\n>\n\u0004\n+\nZ\n,\n\u0010\nZ\nX\n\u0006\n&\nZ\nW\nP\n \nI\n \n\"\n\b\n\u0001\n9\n&\n9\n&\n\u0003\n/\nP\n\u0018\n,\n\u001e\nZ\n\u0010\nZ\n/\n!\nR\n&\nZ\n\u0003\nZ\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n\u0018\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\n\u0007\n\b\n9\n\u0010\nZ\n\u0006\n&\nZ\nW\nP\nU\n\u0019\n!\n\f\n\u0006\n>\n\n\b\n\u0018\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n[\n)\n\u0019\n!\n\f\n\u0006\n>\n\n\b\n\u0018\n\u0001\n\u0003\n\u0007\n\u0003\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\n\b\n\u0004\n\u0004\n\f\u0010P\n\b\u000f)\n\n,\n\n \u0001\n\u0018\b\u0007\n\u0011HG\n\n\u0011HG\n\nW\u0013\u0003\n\n. In order for us to compare with the bound in [1]\nobtained from an algorithmic stability point of view, we make the additional assumption\n\ncompared with the performance of >\nHoeffding\u2019s inequality, we obtain that with probability of at most \u0014\nthat \u0018\n\n\b . Note that this assumption is also required in [1]. Using\n\b ,\n\n\u0004\u0010\u0006\n\n\u0001\t\b\n\n\u00065\u0018\n\n\u0010\u0017\u0003\nTogether with Theorem 3.2, we have with probability of at least \u0005\n\n\b%\u0004\n\n\b(\n\n\u0004\u0010\u0006\n\nfor all \u0006\n\u0001\u0010\n\n\u0004\u0010\u0006\n\u0001\u000e\b\u000f\n\n\u001d \u001f\n\n\u001dM\u001f\n\n\u0001\u000e\b\u000f\n\n\b(\n\nZ\u0018W\n\nThis compares very favorably to the following bound in [1]: 1\n\n\b(\n\n\u001dM\u001f\n\n\u0001\u000e\b\u000f\n\n3.2 Binary classi\ufb01cation\n\n&\u0006\u0005\n\u0001\t\b , we consider the following prediction rule: predict \nif\u0004\u0007\u0006\n\u0005 otherwise. The classi\ufb01cation error (we shall ignore the point\u0004\u000b\u0006\n\u0001\u000e\b\n\nIn binary classi\ufb01cation, the output value\nous model\u0004\u0007\u0006\npredict\n\nwhich is assumed to occur rarely) is\n\n2\u00030\b\u0007\n\nP\u001f\n\nis a discrete variable. Given a continu-\n\nQTR , and\n\u0001\t\b\n\u00171R ,\n\n\u0011HG\n\nUnfortunately, this classi\ufb01cation error function is not convex, which cannot be handled in\nour formulation. In fact, even in many other popular methods, such as logistic regression\nand support vector machines, some kind of convex formulations have to be employed. We\nshall thus consider the following soft-margin SVM style loss as an illustration:\n\nNote that the separable case of this loss was investigated in [6].\n\n\u0006\r\u0004\n\u0005 and\f\ndenotes a subgradient rather than gradient since \f\n\u0006\r\u0004\nSince\u000f,\u0018\n\n\u0006\r\u0004\n[ ;\f\n\u0010 and\n\nthen\n\n\u0006\u0013\u0004\n\n\u0006\r\u0004\n\n\b\u000f)\n\nIn this case, \f\nis non-smooth: at \u0004\n\u0006\u0013\u0004\n\b7\u0017\u001aR when\u0004\n\b , we know that if+\n\b(\n\n(8)\n\n\u0005 ,\n\u0010 ,\n\n\u0006\r\u0004\n\n\u0005 .\n+\u000f,\n\n\u00065\u0018\n\n\u0010P\n\nUsing the standard Hoeffding\u2019s inequality, we have with probability of at most\n\n1In [1], there was a small error after equation (11). As a result, the original bound in their paper\n\n\u001dM\u001f\n\n\u0004\u0007\u0006\nwas in a form equivalent to the one we cite here with\n\n\b\u0010W\nreplaced by\r\u000f\u000e\u0011\u0010\u0013\u0012 .\n\n\b()\n\n\u0006\u001b!\n\n\b\u000f)\n\nif\u0004\u0007\u0006\nif\u0004\u0007\u0006\n\n\u0001\u000e\b\n\u0001\u000e\b\n\n\u0006\u0013\u0004\u0007\u0006\n\n\u0001\t\b(\n\n\u0006\r\u0004\n\u0005 ,\t\n\n\b\u0018\u0017\u000b\n\n\b7\u0017\u001aDF@-,\n\u0005 when\u0004\n\u0004\u0007\u0006\n\u001d \u001f\n\n\u0001\u000e\b\n\n\u0017\f\u0007\n\u0001\u000e\b(\n\n\u0004\u0007\u0006\n\nCR\n\u001dM\u001f\n\n\b ,\n\u0004\u000b\u0006\n\u0001\t\b\n\n\u0004\n>\n\u0018\n\n\u0018\n9\n\u0003\n\n,\n\u001e\nZ\n\u0010\nZ\n/\n!\nR\n&\nZ\n\u0003\nZ\n\u0019\n!\n\f\n\u0006\n>\n\n\b\n\u0018\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0001\n\u0007\n\n\u0007\nP\n/\n\u0006\n\u0002\n&\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n\u0018\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\n\u0007\n\b\n9\n\u0006\n&\nP\n\b\n!\nR\n&\nZ\n\u0003\nZ\n\u0002\n\u0003\nP\nZ\n\nZ\n\u0004\n\u0002\n\u0003\n \n\n)\n\u0019\n!\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n\u0018\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\f\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\n\u0007\n\b\n9\nX\n&\nZ\n\u0003\nZ\nW\nX\n\u0003\nZ\n\u0004\n\u0006\nX\nP\nZ\nW\n!\n&\nZ\nP\nW\nX\n\b\nZ\n\u0003\n\n)\n\u0005\n<\n\u0017\n\u0005\n\u0017\n\u0018\n\t\n\n\u0005\n\n9\nR\n\nR\n\n\u0004\nR\n)\n\f\n\n\n\u0006\n\u0005\n\u0018\n\u0004\n\n\nR\n3\n\u0001\n\n\n\b\n\n\n\b\n\n\u0017\n\f\n3\n\u0001\n\n\n\b\n\n2\nU\n\u0018\n\u0005\n3\n\u0001\n\n\n\b\n\n\u0017\n\u0018\n\n&\n3\n\u0001\n\n\n\n\u0004\n\u0004\n\u0001\n\u0018\n\u0004\nZ\n\u0018\n9\nZ\n\n\n\b\n9\n\t\n\u0001\n\u0018\n\u0010\n\n\n\n>\n\u0004\n\u0003\n\u0018\n>\n\u0004\n9\n\u0019\n!\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0019\n!\n\t\n\u0006\n>\n\u0018\n\u0010\n&\n\n\n\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0001\n\u0007\n\b\n\u0018\n\u0010\n&\n\n\u0007\n\n\n\u0007\n\b\n9\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\b\n\u0018\nX\n\u0010\n&\n\n\u0007\n\n\n\u0007\n\u0014\n,\n\u001e\nZ\n\u0010\nZ\n/\n!\nR\n&\nZ\n\u0019\n!\n\t\n\u0006\n>\n\u0018\n\u0010\n&\n\n\n\n\b\n\u0004\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0001\n\u0007\n\b\n\u0018\n\u0010\n&\n\n\u0007\n\n\n\u0007\nP\n\u0010\n/\n\n\u0002\n&\n\fplicative) form of Hoeffding\u2019s inequality, which implies that with probability of at most\n\nR , it is usually better to use a different (multi-\n\nWhen \u0001\n\u00065\u0018\n\u0010P\n\n\u0004\u000b\u0006\n\n\b ,\n\u0001\u000e\b\n\n\u0004\u000b\u0006\n\n\u001dM\u001f\n\n\u0004\u0007\u0006\n\n\b(\n\n\b()\n\nTogether with Corollary 2.1, we obtain the following margin-percentile result:\n\nTheorem 3.3 Using loss function (8) in (3). Assume that \u001b\u001d\u001c\u001f\u001e\n\u000f\tR\n&+\u0010\n\u0010P\n\b\u0010W\nWe also have with probability of at least \u0005\n\n, with probability of at least \u0005\n\u001dM\u001f\n\n\u0001\t\b(\n\n\u0018\u0001\n\n\u0018\u0001\n\nIKJ\n\n,%#\n\n\u0004\u0007\u0006\n\u0001\u000e\b\n\b , we have\n\u0006\u001b!\n\b\u000f)\n\n\u001dM\u001f\n\n\b()\nWe may obtain from Theorem 3.3 the following result: with probability of at least \u0005\n\n\u0001\u000e\b\u000f\n\n\b(\nCP\n\n\u0010KZ\n\nZ8\b\n\nDF@-,\n\n\u00065\u0018\n\u0010P\n\n\b ,\n\n, then\n\n,\n\n\u0018\b\u0007\n\nIt is interesting to compare this result with margin percentile style bounds from VC analy-\nsis. For example, Theorem 4.19 in [2] implies that there exists a constant\nsuch that with\n\nwe have\n\n\u001d \u001f\n\n\u0001\u000e\b\u000f\n\nprobability of at least \u0005\n\u0001\t\b(\n\n\u0007 : for all\nWe can see that if we assume thatP\n\n\u001d \u001f\n\n\u0011\u001eG\n\n\b\u0007W\u0004\u0002\n\n\u0011\u001eG\n\n\b&W\n\n\u0011HG\n\n\u0011HG\n\nP\u0005\u0003\n\nis also small,\n\t\u000b\n\nis inferior to the bound in Theorem 3.3. Clearly,\nthen the above bound with this choice of\nthis implies that our analysis has some advantages over VC analysis due to the fact that we\ndirectly analyze the numerical formulation of support vector classi\ufb01cation.\n\nis small and the margin\n\nR\b\u0007\n\n4 Conclusion\n\nIn this paper, we have introduced a notion of the convergence of the estimated predictor\nto the best underlying predictor for some learning problems in Hilbert spaces. This gen-\neralizes an earlier study in [6]. We derived generalization bounds for some regression and\nclassi\ufb01cation problems. We have shown that results from our analysis compare favorably\nwith a number of earlier studies. This indicates that the concept introduced in this paper\ncan lead to valuable insights into certain numerical formulations of learning problems.\n\nReferences\n\n[1] Olivier Bousquet and Andr\u00b4e Elisseeff. Algorithmic stability and generalization perfor-\nmance. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances\nin Neural Information Processing Systems 13, pages 196\u2013202. MIT Press, 2001.\n\n\u0003\n\u0007\n\u0003\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0001\n\u0007\n\b\n\u0018\n\u0010\n&\n\n\u0007\n\n\n\u0007\n\b\n\u0002\n\u0014\n,\n\u001e\nZ\n\u0010\nZ\n/\n!\nR\n&\nZ\n\u0019\n!\n\t\n\u0006\n>\n\u0018\n\u0010\n&\n\n\n\n\b\n\u0004\nD\n@\n,\n\u0006\nX\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0001\n\u0007\n\b\n\u0018\n\u0010\n&\n\n\u0007\n\n\n\u0007\nP\nZ\n\u0010\nZ\n/\n\u0006\n\u0002\n&\nZ\n\b\n \nI\n \n\"\n\b\n\u0001\n9\n&\n9\n&\n/\nP\n\u0014\n,\n\u001e\nZ\n\u0010\nZ\n/\n!\nR\n&\nZ\n\u0019\n!\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\b\n\u0018\nX\n\u0010\n&\n\n\u0007\n\n\n\u0007\nP\n\u0010\n/\n\n\u0002\n&\n\u0014\n,\n\u001e\n\u0006\n\u0018\nZ\n\u0010\nZ\n/\n!\nR\n&\nZ\n\u0019\n!\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0006\nX\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\b\n\u0018\nX\n\u0010\n&\n\n\u0007\n\n\n\u0007\nZ\n/\n\u0006\n\u0002\n&\n\u0019\n!\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\b\n\u0018\n!\n\u0004\n\u0005\nR\n\u0002\n\u0003\nP\nZ\n\n&\nZ\n\n\u0007\n\n\n\u0007\n\u0004\n\u0002\n\u0003\nX\n\n)\n\u0002\n\u0018\n\u0003\n\u0019\n!\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\n\b\n9\n\u0005\n\n\u0003\n\u0006\n\u0007\n\b\n\u0001\n\t\n\u0006\n>\n\u0004\n\u0003\n\u0006\n\u0001\n\u0007\n\b\n\u0018\n\u0003\n\n\u0007\n\n\n\u0007\n\u0004\n&\nZ\nZ\n\nZ\n\nW\n\u0001\n\u0003\n\n)\n\u0003\n\u0017\n!\n\u0006\n\u0005\n\f\n\u0003\n\u0003\n\u0003\n&\nZ\n\u0003\n\f[2] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines\n\nand other Kernel-based Learning Methods. Cambridge University Press, 2000.\n\n[3] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of con-\nvexity in learning with squared loss. IEEE Trans. Inform. Theory, 44(5):1974\u20131980,\n1998.\n\n[4] R. Tyrrell Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ,\n\n1970.\n\n[5] Vadim Yurinsky. Sums and Gaussian vectors. Springer-Verlag, Berlin, 1995.\n[6] Tong Zhang. Convergence of large margin separable linear classi\ufb01cation. In Advances\n\nin Neural Information Processing Systems 13, pages 357\u2013363, 2001.\n\n[7] Tong Zhang. A leave-one-out cross validation bound for kernel methods with appli-\ncations in learning. In 14th Annual Conference on Computational Learning Theory,\npages 427\u2013443, 2001.\n\n\f", "award": [], "sourceid": 2136, "authors": [{"given_name": "T.", "family_name": "Zhang", "institution": null}]}