{"title": "Permutation Complexity Bound on Out-Sample Error", "book": "Advances in Neural Information Processing Systems", "page_first": 1531, "page_last": 1539, "abstract": "We define a data dependent permutation complexity for a hypothesis set \\math{\\hset}, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based like the maximum discrepancy on (dependent) sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.", "full_text": "Permutation Complexity Bound on Out-Sample Error\n\nMalik Magdon-Ismail\n\nComputer Science Department\nRensselaer Ploytechnic Institute\n\n110 8th Street, Troy, NY 12180, USA\n\nmagdon@cs.rpi.edu\n\nAbstract\n\nWe de\ufb01ne a data dependent permutation complexity for a hypothesis set H, which\nis similar to a Rademacher complexity or maximum discrepancy. The permutation\ncomplexity is based (like the maximum discrepancy) on dependent sampling. We\nprove a uniform bound on the generalization error, as well as a concentration result\nwhich means that the permutation estimate can be ef\ufb01ciently estimated.\n\n1 Introduction\n\n1\n\ni=1(1 \u2212 yih(xi)). The out-sample error eout(h) = 1\n\nAssume a standard setting with data D = {(xi, yi)}n\ni=1, where (xi, yi) are sampled iid from the joint\ndistribution p(x, y) on Rd\u00d7{\u00b11}. Let H = {h : Rd 7\u2192 {\u00b11}} be a learning model which produces\na hypothesis g \u2208 H when given D (we use g for the hypothesis returned by the learning algorithm\nand h for a generic hypothesis in H). We assume the 0-1 loss, so the in-sample error is ein(h) =\n2nPn\nE [(1 \u2212 yh(x))]; the expectation is over\nthe joint distribution p(x, y). We wish to bound eout(g). To do so, we will bound |eout(h) \u2212 ein(h)|\nuniformly over H for all distributions p(x, y); however, the bound itself will depend on the data,\nand hence the distribution. The classic distribution independent bound is the VC-bound (Vapnik and\nChervonenkis, 1971); the hope is that by taking into account the data one can get a tighter bound.\nThe data dependent permutation complexity1 for H is de\ufb01ned by:\n\n2\n\nPH(n, D) = E\u03c0\"max\n\nh\u2208H\n\n1\nn\n\ny\u03c0ih(xi)# .\n\nn\n\nXi=1\n\nHere, \u03c0 is a uniformly random permutation on {1, . . . , n}. PH(n, D) is an intuitively plausible\nmeasure of the complexity of a model, measuring its ability to correlate with a random permutation\nof the target values. The dif\ufb01culty in analyzing PH is that {y\u03c0i} is an ordered random sample\nfrom y = [y1, . . . , yn], sampled without replacement; as such it is a dependent sampling from a\ndata driven distribution. Analogously, we may de\ufb01ne the bootstrap complexity, using the bootstrap\nis independent and uniformly random over y1, . . . , yn:\ndistribution B on y, where each sample yB\ni\n\nBH(n, D) = EB\"max\n\nh\u2208H\n\n1\nn\n\nyB\n\ni h(xi)# .\n\nn\n\nXi=1\n\nWhen the average y-value \u00afy = 0, the bootstrap complexity is exactly the Rademacher complexity\n(Bartlett and Mendelson, 2002; Fromont, 2007; K\u00a8a\u00a8ari\u00a8ainen and Elomaa, 2003; Koltchinskii, 2001;\nKoltchinskii and Panchenko, 2000; Lozano, 2000; Lugosi and Nobel, 1999; Massart, 2000):\n\nRH(n, D) = E\n\nr\"max\n\nh\u2208H\n\n1\nn\n\nrih(xi)# ,\n\nn\n\nXi=1\n\n1For simplicity, we assume that H is closed under negation; generally, all the results hold with the complex-\n\nities de\ufb01ned using absolute values, so for example PH(n, D) = E\u03c0 \u02c6maxh\u2208H \u02db\n\u02db\n\n1\n\nn Pn\n\n\u02db\u02dc.\ni=1 y\u03c0i h(xi)\u02db\n\n1\n\n\fwhere r is a random vector of i.i.d. fair \u00b11\u2019s. The maximum discrepancy complexity measure\n\u2206H(n, D) is similar to the Rademacher complexity, with the expectation over r being restricted to\nthose r satisfyingPn\n\ni=1 ri = 0,\n\nn\n\n\u2206H(n, D) = E\n\nr\"max\n\nh\u2208H\n\n1\nn\n\nXi=1\n\nriyih(xi)# .\n\nWhen \u00afy = 0, the permutation complexity is the maximum discrepancy; the permutation complexity\nis to maximum discrepancy as the bootstrap complexity is to the Rademacher complexity. The per-\nmutation complexity maintains a little more information regarding the distribution. Indeed we prove\na uniform bound very similar to the uniform bound obtained using the Rademacher complexity:\nTheorem 1 With probability at least 1 \u2212 \u03b4, for every h \u2208 H,\n\neout(h) \u2264 ein(h) + PH(n, D) + 13r 1\n\n2n\n\nln\n\n6\n\u03b4\n\n.\n\nThe probability in this theorem is with respect to the data distribution. The challenge in proving this\ntheorem is to accomodate samples (y\u03c0i) constructed according to the data, and in a dependent way.\nUsing our same proof technique, one can also obtain a similar uniform bound with the bootstrap\ncomplexity, where the samples are independent, but according to the data. The proof starts with the\nstandard ghost sample and symmetrization argument. We then need to handle the data dependent\nsampling in the complexity measure, and this is done by introducing a second ghost data set to\ngovern the sampling. The crucial aspect about sampling according to a second ghost data set is\nthat the samples are now independent of the data; this is acceptable, provided the two methods of\nsampling are close enough; this is what constitutes the meat of the proof given in Section 2.2.\nFor a given permutation \u03c0, one can compute maxh\u2208H\ni=1 y\u03c0ih(xi) using an empirical risk\nminimization; however, the computation of the expectation over permutations is an exponential task,\nwhich needless to say is not feasible. Fortunately, we can establish that the permutation complexity\nis concentrated around its expectation, which means that in principle a single permutation suf\ufb01ces\nto compute the permutation complexity. Let \u03c0 be a single random permutation.\n\nnPn\n\n1\n\nTheorem 2 For an absolute constant c \u2264 6 +p2/ ln 2, with probability at least 1 \u2212 \u03b4,\n\nn\n\nPH(n, D) \u2264 sup\n\nh\u2208H\n\n1\nn\n\nXi=1\n\ny\u03c0ih(xi) + cr 1\n\n2n\n\nln\n\n3\n\u03b4\n\n.\n\nThe probability here is with respect to random permutations (i.e., it holds for any data set). It is easy\nto show concentration for the bootstrap complexity about its expectation \u2013 this follows from Mc-\nDiarmid\u2019s inequality because the samples are independent. The complication with the permutation\ncomplexity is that the samples are not independent. Nevertheless, we can show the concentration\nindirectly by \ufb01rst relating the two complexities for any data set, and then using the concentration of\nthe bootstrap complexity (see Section 2.3).\n\nEmpirical Results. For a single random permutation, with probability at least 1 \u2212 \u03b4,\n\neout(h) \u2264 ein(h) + sup\n\nh\u2208H\n\n1\nn\n\nn\n\nXi=1\n\ny\u03c0ih(xi) + O r 1\n\nn\n\nln\n\n1\n\n\u03b4! .\n\nAsymptotically, one random permutation suf\ufb01ces; in practice, one should average over a few. In-\ndeed, a permutation based validation estimate for model selection has been extensively tested (see\nMagdon-Ismail and Mertsalov (2010) for details); for classi\ufb01cation, this permutation estimate is the\npermutation complexity after removing a bias term. It outperformed LOO-cross validation and the\nRademacher complexity on real data. We restate those results here, comparing model selection us-\ning the permutation estimate versus using the Rademacher complexity (using real data sets from the\nUCI Machine Learning repository (Asuncion and Newman, 2007)). The performance metric is the\nregret when compared to oracle model selection on a held out set (lower regret is better). We con-\nsidered two model selection tasks: choosing the number of leafs in a decision tree; and, selecting k\nin the k-nearest neighbor method. The results reported here are averaged over several (10,000 or\nmore) random splits of the data into a training set and held out set. We de\ufb01ne a learning episode as\nan empirical risk minimization on O(n) data points.\n\n2\n\n\f10 Learning Episodes\nk-NN\n\n100 Learning Episodes\nk-NN\n\nData\n\nAbalone\nIonosphere\nM.Mass\nParkinsons\nPima Ind.\nSpambase\nTransfusion\nWDBC\nDiffusion\n\nn\n\n3,132\n263\n667\n144\n576\n3,450\n561\n426\n2,665\n\nDecision Trees\nRad.\nPerm.\n0.02\n0.02\n0.18\n0.19\n0.06\n0.06\n0.34\n0.40\n0.07\n0.07\n0.07\n0.07\n0.08\n0.09\n0.24\n0.37\n0.02\n0.03\n\nPerm. Rad.\n0.09\n0.12\n0.75\n0.84\n0.11\n0.12\n0.32\n0.44\n0.12\n0.15\n0.43\n0.54\n0.12\n0.19\n0.33\n0.50\n0.04\n0.06\n\nDecision Trees\nRad.\nPerm.\n0.02\n0.02\n0.16\n0.17\n0.05\n0.05\n0.34\n0.41\n0.07\n0.07\n0.06\n0.07\n0.08\n0.09\n0.23\n0.34\n0.02\n0.03\n\nPerm. Rad.\n0.04\n0.04\n0.70\n0.83\n0.11\n0.11\n0.33\n0.43\n0.11\n0.14\n0.43\n0.55\n0.12\n0.19\n0.34\n0.51\n0.03\n0.06\n\nThe permutation complexity appears to dominate most of the time (especially when n is small);\nand, when it fails to dominate, it is as good or only slightly worse than the Rademacher estimate.\nIt is not surprising that as n increases, the performances of the various complexities converges.\nAsymptotically, one can deduce several relationships between them, for example the maximum\ndiscrepancy can be asymptotically bounded from above and below by the Rademacher complexity.\nSimilarly, (see Lemma 5), the bootstrap and permutation complexities are equal, asymptotically. The\nsmall sample performance of the complexities as bounding tools is not easy to discern theoretically,\nwhich is where the empirics comes in. An intuition for why the permutation complexity performs\nrelatively well is because it maintains more of the true data distribution. Indeed, the permutation\nmethod for validation was found to work well empirically, even in regression (Magdon-Ismail and\nMertsalov, 2010); however, our permutation complexity bound only applies to classi\ufb01cation.\n\nOpen Questions. Can the permutation complexity bound be extended beyond classi\ufb01cation to (for\nexample) regression with bounded loss? The permutation complexity displays a bias for severely\nunbalanced data; can this bias be removed. We conjecture that it should be possible to get a better\nuniform bound in terms of E\u03c0[maxh\u2208H\n\ni=1(y\u03c0i \u2212 \u00afy)h(xi)].\n\n1\n\nnPn\n\n1.1 Related Work\n\nOut-sample error estimation has extensive coverage, both in the statistics and learning commuities.\n\n(i) Statistical methods try to estimate the out-sample error asymptotically in n, and give consis-\ntent estimates under certain model assumptions, for example: \ufb01nal prediction error (FPE) (Akaike,\n1974); Generalized Cross Validation (GCV) (Craven and Wahba, 1979); or, covariance-type penal-\nties (Efron, 2004; Wang and Shen, 2006). Statistical methods tend to work well when the model has\nbeen well speci\ufb01ed. Such methods are not our primary focus.\n\n(ii) Sampling methods, such as leave-one-out cross validation (LOO-CV), try to estimate the out-\nsample error directly. Cross validation is perhaps the most used validation method, dating as far\nback as 1931 (Larson, 1931; Wherry, 1931, 1951; Katzell, 1951; Cureton, 1951; Mosier, 1951;\nStone, 1974). The permutation complexity uses a \u201csampled\u201d data set on which to compute the\ncomplexity; other than this super\ufb01cial similarity, the estimates are inherently different.\n\n(iii) Bounds. The most celebrated uniform bound on generalization error is the distribution inde-\npendent bound of Vapnik-Chervonenkis (VC-bound) (Vapnik and Chervonenkis, 1971). Since the\nVC-dimension may be hard to compute, empirical estimates have been suggested, (Vapnik et al.,\n1994). The VC-bound is optimal among distribution independent bounds; however, for a particular\ndistribution, it could be sub-optimal. Several data dependent bounds have already been mentioned,\nwhich can typically be estimated in-sample via optimization: maximum discrepancy (Bartlett et al.,\n2002); Rademacher-style penalties (Bartlett and Mendelson, 2002; Fromont, 2007; K\u00a8a\u00a8ari\u00a8ainen and\nElomaa, 2003; Koltchinskii, 2001; Koltchinskii and Panchenko, 2000; Lozano, 2000; Lugosi and\nNobel, 1999; Massart, 2000); margin based bounds, for example (Shawe-Taylor et al., 1998). Gen-\neralizations to Gaussian and symmetric, bounded variance r have also been suggested, (Bartlett and\nMendelson, 2002; Fromont, 2007) . One main application of such bounds is that any such approx-\nimate estimate of the out-sample error (which satis\ufb01es some bound of the form of the permutation\ncomplexity bound) can be used for model selection, after adding a (small) penalty for the \u201ccomplex-\n\n3\n\n\fity of model selection\u201d (see Bartlett et al. (2002)). In practice, this penalty for the complexity of\nmodel selection is ignored (as in Bartlett et al. (2002)).\n\n(iv) Permutation Methods are not new to statistics (Good, 2005; Golland et al., 2005; Wiklund et al.,\n2007). Golland et al. (2005) show concentration for a permutation based test of signi\ufb01cance for the\nimproved performance of a more complex model, using the Rademacher complexity. We directly\ngive a uniform bound for the out-sample error in terms of a permutation complexity, answering a\nquestion posed in (Golland et al., 2005) which asks whether there is a direct link between permu-\ntation statistics and generalization errors. Indeed, Magdon-Ismail and Mertsalov (2010) construct a\npermutation estimate for validation which they empirically test in both classi\ufb01cation and regression\nproblems. For classi\ufb01cation, their estimate is related to the permutation complexity.\n\nMost relevant to this work are Rademacher penalties and the corresponding (sampling without re-\nplacement) maximum discrepancy. Bartlett et al. (2002) give a uniform bound using the maximum\ndiscrepancy which is in some sense a uniform bound based on a sampling without replacement\n(dependent sampling); however, the sampling distribution is \ufb01xed, independent of the data. It is\nillustrative to brie\ufb02y sketch the derivation of the maximum discrepancy bound. Adapting the proof\n\nin Bartlett et al. (2002) and ignoring terms which are O(cid:0)( 1\n\n(a)\n\nn ln 1\n\n\u03b4 )1/2(cid:1), with probability at least 1\u2212\u03b4:\n\neout(h) \u2264 ein(h) + sup\n\nn\n\nn\n\n(c)\n\n1\n2n\n\n\u2032\nih(x\n\n(b)\n= ein(h) + ED sup\n\nh\u2208H(ED\u2032\nh\u2208H( 1\n\u2264 ein(h) + ED,D\u2032 max\nh\u2208H\uf8f1\uf8f2\n\uf8f3\n\n\u2264 ein(h) + ED,D\u2032 max\n\n= ein(h) + ED\u2206H(n, D)\n\nyih(xi) \u2212 y\u2032\n\nh\u2208H{eout(h) \u2212 ein(h)}\nXi=1\nXi=1\nyih(xi) \u2212 y\u2032\nXi=1\nyih(xi) \u2212 y\u2032\n\u2264 ein(h) + \u2206H(n, D),\n\n\u2264 ein(h) + ED sup\nh\u2208H{eout(h) \u2212 ein(h)} ,\ni)) ,\ni)) ,\ni)\uf8fc\uf8fd\n\uf8fe\n\n\u2032\nih(x\n\n\u2032\nih(x\n\n1\nn\n\n2n\n\nn/2\n\n(e)\n\n(d)\n\n,\n\n(a) follows from McDiarmid\u2019s inequality because eout(h) \u2212 ein(h) is stable to a single point pertur-\nbation for every h, hence the supremum is also stable; in (b) appears a ghost data set and (c) follows\nby convexity of the supremum; in (d), we break the sum into two equal parts, which adds the factor\nof two; \ufb01nally, (e) follows again by McDiarmid\u2019s inequality because \u2206H is stable to single point\nperturbations. The discrepancy automatically drops out from using the ghost sample; this does not\nhappen with data dependent permutation sampling, which is where the dif\ufb01culty lies.\n\n2 Permutation Complexity Uniform Bound\n\nWe now give the proof of Theorem 1. We will adapt the standard ghost sample approach in VC-type\nproofs and the symmetrization trick in (Gin\u00b4e and Zinn, 1984) which has greatly simpli\ufb01ed VC-style\nproofs. In general, high probability results are with respect to the distribution over data sets. Our\nmain bounding tool will be McDiarmid\u2019s inequality:\n\nLemma 1 (McDiarmid (1989)) Let Xi \u2208 Ai be independent; suppose f :Qi\n|f (x) \u2212 f (x1, . . . , xj\u22121, z, xj+1, . . . , xn)| \u2264 cj,\n\nsup\n\n(x1 ,...,xn )\u2208Qi Ai\n\nAi 7\u2192 R satis\ufb01es\n\nz\u2208Aj\n\nfor j = 1, . . . , n. Then, with probability at least 1 \u2212 \u03b4,\n\nf (X1, . . . , Xn) \u2264 Ef (X1, . . . , Xn) +vuut\n\n1\n2\n\nn\n\nXi=1\n\nc2\ni ln\n\n1\n\u03b4\n\n.\n\ni=1 c2\n\ni ln 1\n\n\u03b4 by using \u2212f in McDiarmid\u2019s inequality.\n\nWe also obtain Ef \u2264 f +q 1\n\n2Pn\n\n4\n\n\f2.1 Permutation Complexity\n\nThe out-sample permutation complexity of a model is:\n\nPH(n) = EDPH(n, D) = ED,\u03c0\"max\n\nh\u2208H\n\n1\nn\n\ny\u03c0ih(xi)# ,\n\nn\n\nXi=1\n\nj).\n\n\u2032\nj, y\u2032\n\nwhere the expectation is over the data D = (x1, y1), . . . , (xn, yn) and a random permutation \u03c0. Let\nD\u2032 differ from D only in one example, (xj , yj) \u2192 (x\nLemma 2 |PH(n, D) \u2212 PH(n, D\u2032)| \u2264 4\nn .\nProof: For any permutation \u03c0 and every h \u2208 H, the sumPn\ngoing from D to D\u2032; thus, the maximum over h \u2208 H changes by at most 4.\nLemma 2 together with McDiarmid\u2019s inequality implies a concentration of PH(n, D) about PH(n),\nwhich means we can work with PH(n, D) instead of the unknown PH(n).\nCorollary 1 With probability at least 1 \u2212 \u03b4, PH(n) \u2264 PH(n, D) + 4r 1\n\ni=1 y\u03c0ih(xi) changes by at most 4 in\n\n1\n\u03b4\n\n2n\n\nln\n\n.\n\n2 (1 \u2212 1\n\nSince ein(h) = 1\n\u03c0 can be used to compute PH(n, D) for a particular permutation \u03c0.\ny\n2.2 Bounding the Out-Sample Error\n\nnPn\n\ni=1 yih(xi)), the empirical risk minimizer g\u03c0 on the permuted targets\n\nTo bound suph\u2208H{eout(h) \u2212 ein(h)}, we \ufb01rst use the standard ghost sample and symmetrization\narguments typical of modern generalization error proofs (see for example Bartlett and Mendelson\n(2002); Shawe-Taylor and Cristianini (2004)). Let r\nLemma 3 With probability at least 1 \u2212 \u03b4:\nh\u2208H{eout(h) \u2212 ein(h)} \u2264 ED,D\u2032\"sup\n\nn] be a \u00b11 sequence.\n\nr\u2032\u2032\ni (yih(xi) \u2212 y\u2032\n\ni)))# +r 1\n\nh\u2208H( 1\n\n1 , . . . , r\u2032\u2032\n\n\u2032\u2032 = [r\u2032\u2032\n\n\u2032\nih(x\n\nsup\n\n1\n\u03b4\n\n2n\n\n2n\n\nln\n\nn\n\n.\n\nXi=1\n\nProof: We proceed as in the proof of the maximum discrepancy bound in Section 1.1:\n\nsup\nh\u2208H{eout(h) \u2212 ein(h)}\n\n(a)\n\n\u2264 ED,D\u2032\"sup\n= ED,D\u2032\"sup\n\nh\u2208H( 1\nh\u2208H( 1\n\n(b)\n\n2n\n\n2n\n\nn\n\nn\n\nXi=1\nXi=1\n\nyih(xi) \u2212 y\u2032\n\n\u2032\nih(x\n\nr\u2032\u2032\ni (yih(xi) \u2212 y\u2032\n\nln\n\n2n\n\ni))# +r 1\ni)))# +r 1\n\n\u2032\nih(x\n\n2n\n\n1\n\u03b4\n\n,\n\nln\n\n1\n\u03b4\n\n.\n\nn ln 1\n\nIn (a), the O(( 1\nat most 1\nbecause r\u2032\u2032\nexpectation (it amounts to relabeling of random variables).\n\n\u03b4 )1/2) term is from applying McDiarmid\u2019s inequality because ein(h) changes by\nn if one data point changes, and so the supremum changes by at most that much; (b) follows\ni = \u22121 corresponds to exchanging xi, x\ni in the expectation which does not change the\n\u2032\n\nLemma 3 holds for an arbitrary sequence r\ntation with respect to r\n\n\u2032\u2032, for arbitrarily distributed r\n\n\u2032\u2032 which is independent of D, D\u2032; we can take the expec-\n\n\u2032\u2032, as long as r\n\n\u2032\u2032 is independent of D, D\u2032.\n\n2.2.1 Generating Permutations with \u00b11 Sequences\ni = y\u03c0iyi; then,\nFix y; for a given permutation \u03c0, de\ufb01ne a corresponding \u00b11 sequence r\ni yi. Thus, given y, for each of the n! permutations \u03c01, . . . , \u03c0n!, we have a correspond-\ny\u03c0i = r\u03c0\ning \u00b11 sequence r\n\u03c0n!} (there may be\nrepetitions as two different permutations may result in the same sequence of \u00b11 values); we thus\nhave a mapping from permutations to the \u00b11 sequences in Sy. If r, a random vector of \u00b11s, is\n\n\u03c0i; we thus obtain a multiset of sequences Sy = {r\n\n\u03c0 by r\u03c0\n\n\u03c01 , . . . , r\n\n5\n\n\funiform on Sy, then r.y (componentwise product) is uniform over the permutations of y. We say\nthat Sy generates the permutations on y. Similarly, we can de\ufb01ne Sy\u2032, the generator of permutations\n\u2032. Unfortunately, Sy, Sy\u2032 depend on D, D\u2032, and so we can\u2019t take the expectation uniformly over\non y\n(for example) r \u2208 Sy. We can overcome this by introducing a second ghost sample D\u2032\u2032 to \u201capprox-\nimately\u201d generate the permutations for y, y\nTheorem 3 With probability at least 1 \u2212 5\u03b4,\n\n\u2032, ultimately allowing us to prove the main result.\n\nsup\n\nh\u2208H{eout(h) \u2212 ein(h)} \u2264 PH(n) + 9r 1\n\n2n\n\nln\n\n1\n\u03b4\n\n,\n\nWe obtain Theorem 1 by combining Theorem 3 with Corollary 1.\n\n2.2.2 Proof of Theorem 3\n\nLet D\u2032\u2032 be a second, independent ghost sample, and Sy\u2032\u2032 the generator of permutations for y\nLemma 3, take the expectation over r\n\n\u2032\u2032 uniform on Sy\u2032\u2032. The \ufb01rst term on the RHS becomes\n\n\u2032\u2032. In\n\nED,D\u2032,D\u2032\u2032\n\n1\n\nn!X\u03c0 \"sup\n\nh\u2208H\n\n1\n2n\n\nn\n\nXi=1\n\nr\u2032\u2032\ni (\u03c0)(yih(xi) \u2212 y\u2032\n\n\u2032\nih(x\n\ni))# ,\n\n(1)\n\nwhere each permutation \u03c0 induces a particular sequence r\n\u2032\u2032(\u03c0) \u2208 Sy\u2032\u2032 (previously we used r\u03c0\ni\nwhich is now ri(\u03c0)). Consider the sequences r, r\n\u2032 corresponding to the permutations on y and y\n\u2032.\nThe next lemma will ultimately relate the expectation over permutations in the second ghost data set\nto the permutations over D, D\u2032.\nLemma 4 With probability at least 1 \u2212 2\u03b4, there is a one-to-one mapping from the sequences in\nSy\u2032\u2032 = {r\n\n\u2032\u2032(\u03c0)}\u03c0 to Sy = {r(\u03c0)}\u03c0 such that\n\n(r\u2032\u2032\ni \u2212 ri(r\n\n1\n2n\n\nn\n\nXi=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2264r 8\n\nn\n\nln\n\n1\n\u03b4\n\n,\n\n\u2032\u2032))yih(xi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nfor every r\nmapped). Similarly, there exists such a mapping from Sy\u2032\u2032 to Sy\u2032.\n\n\u2032\u2032 \u2208 Sy\u2032\u2032 and every h \u2208 H (we write r(r\n\n\u2032\u2032) to denote the sequence r \u2208 Sy to which r\n\n\u2032\u2032 is\n\nThe probability here is with respect to y, y\ning sets Sy\u2032\u2032, Sy\u2032, and Sy are essentially equivalent.\n\n\u2032 and y\n\n\u2032\u2032. This lemma says that the permutation generat-\n\ni = y\u2032\u2032\n\n\u03c0iy\u2032\u2032\n\nProof: We can (without loss of generality) reorder the points in D\u2032\u2032 so that the \ufb01rst k\u2032\u2032 are +1, so\nk\u2032\u2032 = +1, and the remaining are \u22121. Similarily, we can order the points in D so that\ny\u2032\u2032\n1 = \u00b7\u00b7\u00b7 = y\u2032\u2032\nthe \ufb01rst k are +1, so y1 = \u00b7\u00b7\u00b7 = yk = +1. We now construct the mapping from Sy\u2032\u2032 to Sy as\n\u2032\u2032(\u03c0) \u2208 Sy\u2032\u2032 to r(\u03c0) \u2208 Sy. This mapping is clearly\nfollows. For a given permutation \u03c0, we map r\nbijective since every permutation corresponds uniquely to a sequence in Sy (and Sy\u2032\u2032).\n\u03c0i or yi 6= y\u2032\u2032\n\u2032\u2032 disagree\nLet ri = y\u03c0iyi and r\u2032\u2032\non exactly |k \u2212 k\u2032\u2032| locations (and similarly for y\u03c0 and y\n\u03c0), the number of locations where r and r\n\u2032\u2032\n\u2032\u2032\ndisagree is therefore at most 2|k \u2212 k\u2032\u2032|. Thus, for any r\n\u2032\u2032 and any h \u2208 H,\n1\nXi=1\n2n\nXi=1\n\n\u2032\u2032))yih(xi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\ni , either y\u03c0i 6= y\u2032\u2032\n\ni . If ri 6= r\u2032\u2032\n\n\u2032\u2032)||yih(xi)|\n\ni . Since y and y\n\n(r\u2032\u2032\ni \u2212 ri(r\n\n|r\u2032\u2032\ni \u2212 ri(r\n\n|r\u2032\u2032\ni \u2212 ri(r\n\n2|k \u2212 k\u2032\u2032|\n\nXi=1\n\n\u2032\u2032)| \u2264\n\n\u2264\n\nn\n\nn\n\nn\n\n.\n\nn\n\n1\n2n\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nWe observe thatPn\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nn\n\ni=1(yi \u2212 y\u2032\u2032\n1\nXi=1\n2n\nwhere zi = yi \u2212 y\u2032\u2032\nWe consider the function f (z1, . . . , zn) = 1\n\n(r\u2032\u2032\ni \u2212 ri(r\ni . Since y and y\n\n=\n\n1\n2n\ni ) = 2(k \u2212 k\u2032\u2032) and so,\n1\nn\n\n\u2032\u2032))yih(xi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\u2264(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nnPn\n\n\u2032\u2032 are identically distributed, zi are independent and zero mean.\ni=1 zi. Since zi \u2208 {0,\u00b12}, if you change one of the\n\nn\n\nXi=1\n\n(yi \u2212 y\u2032\u2032\n\n1\nn\n\nn\n\nXi=1\n\ni )(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n=(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n,\n\nzi(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n6\n\n\fzi, f changes by at most 4\n\nn , and so the conditions hold to apply McDiarmid\u2019s inequality to f . Thus,\n\nusing the symmetry of zi, with probability at least 1 \u2212 2\u03b4,(cid:12)(cid:12)\n\n\u2032\u2032)\n\u2032\u2032). We can rewrite the internal summand in the expression of Equation (1) using the equality\n\nGiven D, D\u2032, D\u2032\u2032, assume the mappings which are known to exist by the previous lemma are r(r\nand r\n\u2032(r\nr\u2032\u2032\ni (yih(xi) \u2212 y\u2032\n\u2032\ni).\nih(x\nUsing Lemma 4, we can, with probability at least 1 \u2212 2\u03b4, bound the term which involves (r\u2032\u2032\ni \u2212\n\u2032\u2032)) in Equation (1); and, similarly, with probability at least 1\u2212 2\u03b4, we bound the term involving\nri(r\n\u2032\u2032)). Thus, with probability at least 1 \u2212 4\u03b4, the expression in Equation (1) is bounded by:\n(r\u2032\u2032\ni \u2212 r\u2032\n\n\u2032\u2032))yih(xi) \u2212 (r\u2032\u2032\n\n\u2032\ni)) = (r\u2032\u2032\nih(x\n\ni \u2212 ri(r\n\n\u2032\u2032) + ri(r\n\ni \u2212 r\u2032\n\n\u2032\u2032) + r\u2032\n\n\u2032\u2032))y\u2032\n\ni(r\n\ni(r\n\ni(r\n\n8\n\ni=1 zi(cid:12)(cid:12) \u2264q 1\nnPn\n\n2n ln 1\n\u03b4 .\n\nED,D\u2032,D\u2032\u2032\n\n1\n\nn!X\u03c0 \"sup\n\nh\u2208H\n\n1\n2n\n\nn\n\nXi=1\n\n(ri(r\n\n\u2032\u2032)yih(xi) \u2212 r\u2032\ni(r\n\n\u2032\u2032)y\u2032\n\n\u2032\nih(x\n\ni))# + 2r 8\n\nn\n\nln\n\n1\n\u03b4\n\n,\n\n\u2032\u2032(\u03c0) cycles through the sequences in Sy\u2032\u2032. Since the mappings r(r\n\u2032\u2032).y cycles through the permutations of y, and similarly for r\n\u2032(r\n\nwhere r\none, r(r\nunder negation, we \ufb01nally obtain the bound\n\n\u2032\u2032) and r\n\u2032\u2032).y\n\n\u2032(r\n\n\u2032\u2032) are one-to-\n\u2032. Since H is closed\ni)# + 2r 8\n\n1\n\u03b4\n\nln\n\nn\n\n;\n\n\u2032\ny\u2032\n\u03c0ih(x\n\nUsing this in Lemma 3, with probability at least 1 \u2212 5\u03b4,\n\nED\n\n1\n\nn!X\u03c0 \"sup\n\nh\u2208H\n\n1\n2n\n\nn\n\nn\n\n1\n\n1\n2n\n\nn!X\u03c0 \"sup\n\ny\u03c0i h(xi)# + ED\u2032\n\nXi=1\nXi=1\nh\u2208H{eout(h) \u2212 ein(h)} \u2264 PH(n) + 9r 1\n\nsup\n\nh\u2208H\n\n2n\n\nln\n\n1\n\u03b4\n\n.\n\nCommentary.\n(i) The permutation complexity bound needs empirical risk minimization, which is\nnotoriously hard; however, if the same algorithm is used for learning as well as computing P, we can\nview it as optimization over a constrained hypothesis set (this is especially so with regularization);\nthe bounds now hold. (ii) The same proof technique can be used to get a bootstrap complexity\nbound; the result is similar. (iii) One could bound PH for VC function classes, showing that this\ndata dependent bound is asymptotically no worse than a VC-type bound. Bounding permutation\ncomplexity on speci\ufb01c domains could follow the methods in Bartlett and Mendelson (2002).\n\n2.3 Estimating PH(n, D) Using a Single Permutation\nWe now prove Theorem 2, which states that one can essentially estimate PH(n, D) (an average over\nall permutations) by suph\u2208H\ni=1 y\u03c0ih(xi), using just a single randomly selected permutation \u03c0.\nOur proof is indirect: we will link PH to the bootstrap complexity BH. The bootstrap complexity\nis concentrated via an easy application of McDiarmid\u2019s inequality, which will ultimately allow us to\nconclude that the permutation estimate is also concentrated. The bootstrap distribution B constructs\na random sequence y\nB of n independent uniform samples from y1, . . . , yn; the key requirement is\nthat yB\n\ni are independent samples. There are nn (not distinct) possible bootstrap sequences.\n\nnPn\n\n1\n\nLemma 5 |BH(n, D) \u2212 PH(n, D)| \u2264\nProof: Let k be the number of yi which are +1; we condition on \u03ba, the number of +1 in the\nbootstrap sample. Suppose B|\u03ba samples uniformly among all sequences with \u03ba entries being +1.\n\n.\n\n1\n\u221an\n\nThe key observation is that we can generate all samples uniformly according to B|\u03ba by \ufb01rst gener-\nating a random permutation and then selecting randomly |k \u2212 \u03ba| +1\u2019s (or \u22121\u2019s) to \ufb02ip, so:\n\nBH(n, D) = E\u03ba EB|\u03ba\" sup\n\nh\u2208H\n\n1\nn\n\nEB|\u03ba\" sup\n\nh\u2208H\n\n1\nn\n\n\u03ba# = EF|k\u2212\u03ba|\n\nn\n\nXi=1\n\nyB\n\ni h(xi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n7\n\n\u03ba# ,\n\nn\n\nyB\n\nXi=1\n\ni h(xi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nE\u03c0 \"sup\nXi=1\n\n1\nn\n\nh\u2208H\n\nn\n\nyF\n\n\u03c0ih(xi)# .\n\n\f(F denotes the \ufb02ipping random process.) Since yF\n\n1\nn\n\nsup\nh\u2208H\n\nThus,\n\nn\n\nXi=1\n\ny\u03c0i h(xi) \u2212\n\n2|k \u2212 \u03ba|\n\nn\n\n\u2264 sup\n\nh\u2208H\n\n1\nn\n\nn\n\nXi=1\n\n\u03c0i\n\ndiffers from y\u03c0i in exactly |k \u2212 \u03ba| positions,\n2|k \u2212 \u03ba|\nyF\n\u03c0ih(xi) \u2264 sup\n\ny\u03c0i h(xi) +\n\n1\nn\n\nn\n\nn\n\n.\n\nh\u2208H\n\nXi=1\n\n|BH(n, D) \u2212 PH(n, D)| \u2264\n\nE\u03ba [|k \u2212 \u03ba|].\n\n2\u221an (because \u03ba is binomial), the result follows.\n\n2\nn\n\nSince E\u03ba[|k \u2212 \u03ba|] \u2264pVar[k \u2212 \u03ba] \u2264 1\n\nIn addition to furthering our cause toward the proof of Theorem 2, Lemma 5 is interesting in its own\nright, because it says that permutation and bootstrap sampling are asymptotically similar. The nice\nthing about the bootstrap estimate is that the expectation is over independent yB\nn . Since the\nbootstrap complexity changes by at most 2\nn if you change one sample, by McDiarmid\u2019s inequality,\nLemma 6 For a random bootstrap sample B, with probability at least 1 \u2212 \u03b4,\n\n1 , . . . , yB\n\nBH(n, D) \u2264 sup\n\nh\u2208H\n\n1\nn\n\nn\n\nXi=1\n\nyB\n\ni h(xi) + 2r 1\n\n2n\n\nln\n\n1\n\u03b4\n\n.\n\nWe now prove concentration for estimating PH(n, D). As in the proof of Lemma 5, generate y\nB\nB; \u03ba is binomial. Now, generate a random\nin two steps. First generate \u03ba, the number of +1\u2019s in y\n\u03c0, and \ufb02ip (as appropriate) a randomly selected |k \u2212 \u03ba| entries, where k is the number\npermutation y\nof +1\u2019s in y.\nIf we apply McDiarmid\u2019s inequality to the function which equals the number of\n+1\u2019s, we immediately get that with probability at least 1 \u2212 2\u03b4, |\u03ba \u2212 k| \u2264 ( 1\n\u03b4 )1/2. Thus, with\n\u03c0 in at most (2n ln 1\nprobability at least 1 \u2212 2\u03b4, y\n\u03b4 )1/2 positions. Each \ufb02ip changes\nthe complexity by at most 2, hence, with probability at least 1 \u2212 2\u03b4,\n\nB differs from y\n\n2 n ln 1\n\nn\n\n1\n\u03b4\nWe conclude that for a random permutation \u03c0, with probability at least 1 \u2212 3\u03b4,\n\ny\u03c0ih(xi) + 4r 1\n\nyB\ni h(xi) \u2264 sup\n\nXi=1\n\nXi=1\n\nsup\nh\u2208H\n\n1\nn\n\n1\nn\n\nh\u2208H\n\n2n\n\nln\n\n.\n\nn\n\nBH(n, D) \u2264 sup\n\nh\u2208H\n\n1\nn\n\nn\n\nXi=1\n\ny\u03c0i h(xi) + 6r 1\n\n2n\n\nln\n\n1\n\u03b4\n\n.\n\nNow, combining with Lemma 5, we obtain Theorem 2 after a little algebra, because \u03b4 < 1.\nWe have not only established that PH is concentrated, but we have also established a general con-\nnection between the permutation and bootstrap based estimates. In this particular case, we see that\nsampling with and without replacement are very closely related. In practice, sampling without re-\nplacement can be very different, because one is never in the truly asymptotic regime. Along that\nvein, even though we have concentration, it pays to take the average over a few permutations.\n\nReferences\nAkaike, H. (1974). A new look at the statistical model identi\ufb01cation. IEEE Trans. Aut. Cont., 19,\n\n716\u2013723.\n\nAsuncion, A. and Newman, D. (2007). UCI machine learning repository.\nBartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3, 463\u2013482.\n\nBartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation. Machine\n\nLearning, 48, 85\u2013113.\n\nCraven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathe-\n\nmatik, 31, 377\u2013403.\n\nCureton, E. E. (1951). Symposium: The need and means of cross-validation: II approximate linear\n\nrestraints and best predictor weights. Education and Psychology Measurement, 11, 12\u201315.\n\n8\n\n\fEfron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation.\n\nJournal of the American Statistical Association, 99(467), 619\u2013632.\n\nFromont, M. (2007). Model selection by bootstrap penalization for classi\ufb01cation. Machine Learning,\n\n66(2-3), 165\u2013207.\n\nGin\u00b4e, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Annals of Prob., 12,\n\n929\u2013989.\n\nGolland, P., Liang, F., Mukherjee, S., and Panchenko, D. (2005). Permutation tests for classi\ufb01cation.\n\nLearning Theory, pages 501\u2013515.\n\nGood, P. (2005). Permutation, parametric, and bootstrap tests of hypotheses. Springer.\nK\u00a8a\u00a8ari\u00a8ainen, M. and Elomaa, T. (2003). Rademacher penalization over decision tree prunings. In In\n\nProc. 14th European Conference on Machine Learning, pages 193\u2013204.\n\nKatzell, R. A. (1951). Symposium: The need and means of cross-validation: III cross validation of\n\nitem analyses. Education and Psychology Measurement, 11, 16\u201322.\n\nKoltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions\n\non Information Theory, 47(5), 1902\u20131914.\n\nKoltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function\nlearning. In E. Gine, D. Mason, and J. Wellner, editors, High Dimensional Prob. II, volume 47,\npages 443\u2013459.\n\nLarson, S. C. (1931). The shrinkage of the coef\ufb01cient of multiple correlation. Journal of Education\n\nPsychology, 22, 45\u201355.\n\nLozano, F. (2000). Model selection using Rademacher penalization. In Proc. 2nd ICSC Symp. on\n\nNeural Comp.\n\nLugosi, G. and Nobel, A. (1999). Adaptive model selection using empirical complexities. Annals of\n\nStatistics, 27, 1830\u20131864.\n\nMagdon-Ismail, M. and Mertsalov, K. (2010). A permutation approach to validation. In Proc. 10th\n\nSIAM International Conference on Data Mining (SDM).\n\nMassart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la\n\nFacult\u00b4e des Sciencies de Toulouse, X, 245\u2013303.\n\nMcDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics, pages\n\n148\u2013188. Cambridge University Press.\n\nMosier, C. I. (1951). Symposium: The need and means of cross-validation: I problem and designs\n\nof cross validation. Education and Psychology Measurement, 11, 5\u201311.\n\nShawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Camb. Univ.\n\nPress.\n\nShawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998). Structural risk mini-\nmization over data dependent hierarchies. IEEE Transactions on Information Theory, 44, 1926\u2013\n1940.\n\nStone, M. (1974). Cross validatory choice and assessment of statistical predictions. Journal of the\n\nRoyal Statistical Society, 36(2), 111\u2013147.\n\nVapnik, V. N. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of\n\nevents to their pr obabilities. Theory of Probability and its Applications, 16, 264\u2013280.\n\nVapnik, V. N., Levin, E., and Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine.\n\nNeural Computation, 6(5), 851\u2013876.\n\nWang, J. and Shen, X. (2006). Estimation of generalization error: random and \ufb01xed inputs. Statistica\n\nSinica, 16, 569\u2013588.\n\nWherry, R. J. (1931). A new formula for predicting the shrinkage of the multiple correlation coef\ufb01-\n\ncient. Annals of Mathematical Statistics, 2, 440\u2013457.\n\nWherry, R. J. (1951). Symposium: The need and means of cross-validation: III comparison of cross\nvalidation with statistical inference of betas and multiple r from a single sample. Education and\nPsychology Measurement, 11, 23\u201328.\n\nWiklund, S., Nilsson, D., Eriksson, L., Sjostrom, M., Wold, S., and Faber, K. (2007). A randomiza-\n\ntion test for PLS component selection. Journal of Chemometrics, 21(10-11), 427\u2013439.\n\n9\n\n\f", "award": [], "sourceid": 70, "authors": [{"given_name": "Malik", "family_name": "Magdon-Ismail", "institution": null}]}