{"title": "Algorithmic Stability and Uniform Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 27, "abstract": "One of the central questions in statistical learning theory is to determine the conditions under which agents can learn from experience. This includes the necessary and sufficient conditions for generalization from a given finite training set to new observations. In this paper, we prove that algorithmic stability in the inference process is equivalent to uniform generalization across all parametric loss functions. We provide various interpretations of this result. For instance, a relationship is proved between stability and data processing, which reveals that algorithmic stability can be improved by post-processing the inferred hypothesis or by augmenting training examples with artificial noise prior to learning. In addition, we establish a relationship between algorithmic stability and the size of the observation space, which provides a formal justification for dimensionality reduction methods. Finally, we connect algorithmic stability to the size of the hypothesis space, which recovers the classical PAC result that the size (complexity) of the hypothesis space should be controlled in order to improve algorithmic stability and improve generalization.", "full_text": "Algorithmic Stability and Uniform Generalization\n\nIbrahim Alabdulmohsin\n\nKing Abdullah University of Science and Technology\n\nThuwal 23955, Saudi Arabia\n\nibrahim.alabdulmohsin@kaust.edu.sa\n\nAbstract\n\nOne of the central questions in statistical learning theory is to determine the con-\nditions under which agents can learn from experience. This includes the neces-\nsary and suf\ufb01cient conditions for generalization from a given \ufb01nite training set\nto new observations. In this paper, we prove that algorithmic stability in the in-\nference process is equivalent to uniform generalization across all parametric loss\nfunctions. We provide various interpretations of this result. For instance, a rela-\ntionship is proved between stability and data processing, which reveals that algo-\nrithmic stability can be improved by post-processing the inferred hypothesis or by\naugmenting training examples with arti\ufb01cial noise prior to learning. In addition,\nwe establish a relationship between algorithmic stability and the size of the obser-\nvation space, which provides a formal justi\ufb01cation for dimensionality reduction\nmethods. Finally, we connect algorithmic stability to the size of the hypothesis\nspace, which recovers the classical PAC result that the size (complexity) of the\nhypothesis space should be controlled in order to improve algorithmic stability\nand improve generalization.\n\n1\n\nIntroduction\n\n= (cid:12)(cid:12)Remp \u2212 Rtrue\n\nOne fundamental goal of any learning algorithm is to strike a right balance between under\ufb01tting\nand over\ufb01tting. In mathematical terms, this is often translated into two separate objectives. First,\nwe would like the learning algorithm to produce a hypothesis that is reasonably consistent with the\nempirical evidence (i.e. to have a small empirical risk). Second, we would like to guarantee that the\nempirical risk (training error) is a valid estimate of the true unknown risk (test error). The former\ncondition protects against under\ufb01tting while the latter condition protects against over\ufb01tting.\nThe rationale behind these two objectives can be understood if we de\ufb01ne the generalization risk\nRgen by the absolute difference between the empirical and true risks: Rgen\nThen, it is elementary to observe that the true risk Rtrue is bounded from above by the sum\nRemp + Rgen. Hence, by minimizing both the empirical risk (under\ufb01tting) and the generalization\nrisk (over\ufb01tting), one obtains an inference procedure whose true risk is minimal.\nMinimizing the empirical risk alone can be carried out using the empirical risk minimization (ERM)\nprocedure [1] or some approximations to it. However, the generalization risk is often impossible to\ndeal with directly. Instead, it is a common practice to bound it analyticaly so that we can establish\nconditions under which it is guaranteed to be small. By establishing conditions for generalization,\none hopes to design better learning algorithms that both perform well empirically and generalize\nwell to novel observations in the future. A prominent example of such an approach is the Support\nVector Machines (SVM) algorithm for binary classi\ufb01cation [2].\nHowever, bounding the generalization risk is quite intricate because it can be approached from\nvarious angles.\nIn fact, several methods have been proposed in the past to prove generaliza-\ntion bounds including uniform convergence, algorithmic stability, Rademacher and Gaussian com-\nplexities, generic chaining bounds, the PAC-Bayesian framework, and robustness-based analysis\n\n(cid:12)(cid:12).\n\n.\n\n1\n\n\f[1, 3, 4, 5, 6, 7, 8, 9]. Concentration of measure inequalities form the building blocks of these rich\ntheories.\nThe proliferation of generalization bounds can be understood if we look into the general setting of\nlearning introduced by Vapnik [1]. In this setting, we have an observation space Z and a hypothesis\nspace H. A learning algorithm, henceforth denoted L : \u222a\u221e\nm=1 Z m \u2192 H, uses a \ufb01nite set of\nobservations to infer a hypothesis H \u2208 H. In the general setting, the inference process end-to-end\nis in\ufb02uenced by three key factors: (1) the nature of the observation space Z, (2) the nature of the\nhypothesis space H, and (3) the details of the learning algorithm L. By imposing constraints on\nany of these three components, one may be able to derive new generalization bounds. For example,\nthe Vapnik-Chervonenkis (VC) theory derives generalization bounds by assuming constraints on H,\nwhile stability bounds, e.g. [6, 10, 11, 12], are derived by assuming constraints on L.\nGiven that different generalization bounds can be established by imposing constraints on any of\nZ, H, or L, it is intriguing to ask if there exists a single view for generalization that ties all of these\ndifferent components together. In this paper, we answer this question in the af\ufb01rmative by establish-\ning that algorithmic stability alone is equivalent to uniform generalization. Informally speaking, an\ninference process is said to generalize uniformly if the generalization risk vanishes uniformly across\nall bounded parametric loss functions at the limit of large training sets. A more precise de\ufb01nition\nwill be presented in the sequel. We will show why constraints that are imposed on either H, Z, or\nL to improve uniform generalization can be interpreted as methods of improving the stability of the\nlearning algorithm L. This is similar in spirit to a result by Kearns and Ron, who showed that hav-\ning a \ufb01nite VC dimension in the hypothesis space H implies a certain notion of algorithmic stability\nin the inference process [13]. Our statement, however, is more general as it applies to all learning\nalgorithms that fall under Vapnik\u2019s general setting of learning, well beyond uniform convergence.\nThe rest of the paper is as follows. First, we review the current literature on algorithmic stability,\ngeneralization, and learnability. Then, we introduce key de\ufb01nitions that will be repeatedly used\nthroughout the paper. Next, we prove the central theorem, which reveals that algorithmic stability is\nequivalent to uniform generalization, and provide various interpretations of this result afterward.\n\n2 Related Work\n\nPerhaps, the two most fundamental concepts in statistical learning theory are those of learnability\nand generalization [12, 14]. The two concepts are distinct from each other. As will be discussed\nin more details next, whereas learnability is concerned with measuring the excess risk within a\nhypothesis space, generalization is concerned with estimating the true risk.\nIn order to de\ufb01ne learnability and generalization, suppose we have an observation space Z, a prob-\nability distribution of observations P(z), and a bounded stochastic loss function L(\u00b7; H) : Z \u2192\n[0, 1], where H \u2208 H is an inferred hypothesis. Note that L is implicitly a function of (parameter-\nized by) H as well. We de\ufb01ne the true risk of a hypothesis H \u2208 H by the risk functional:\n\nRtrue(H) = EZ\u223cP(z)\n\n(1)\nThen, a learning algorithm is called consistent if the true risk of its inferred hypothesis H converges\nto the optimal true risk within the hypothesis space H at the limit of large training sets m \u2192 \u221e.\nA problem is called learnable if it admits a consistent learning algorithm [14]. It has been known\nthat learnability for supervised classi\ufb01cation and regression problems is equivalent to uniform con-\nvergence [3, 14]. However, Shalev-Shwartz et al. recently showed that uniform convergence is not\nnecessary in Vapnik\u2019s general setting of learning and proposed algorithmic stability as an alternative\nkey condition for learnability [14].\nUnlike learnability, the question of generalization is concerned primarily with how representative\nthe empirical risk Remp is to the true risk Rtrue. To elaborate, suppose we have a \ufb01nite training set\nSm = {Zi}i=1,..,m, which comprises of m i.i.d. observations Zi \u223c P(z). We de\ufb01ne the empirical\nrisk of a hypothesis H with respect to Sm by:\n\n(cid:2)L(Z; H)(cid:3)\n\n(2)\nWe also let Rtrue(H) be the true risk as de\ufb01ned in Eq. (1). Then, a learning algorithm L is said to\ngeneralize if the empirical risk of its inferred hypothesis converges to its true risk as m \u2192 \u221e.\n\nRemp(H; Sm) =\n\nL(Zi; H)\n\nZi\u2208Sm\n\n(cid:88)\n\n1\nm\n\n2\n\n\fSimilar to learnability, uniform convergence is, by de\ufb01nition, suf\ufb01cient for generalization [1], but\nit is not necessary because the learning algorithm can always restrict its search space to a smaller\nsubset of H (arti\ufb01cially so to speak). By contrast, it is not known whether algorithmic stability is\nnecessary for generalization. It has been shown that various notions of algorithmic stability can be\nde\ufb01ned that are suf\ufb01cient for generalization [6, 10, 11, 12, 15, 16]. However, it is not known whether\nan appropriate notion of algorithmic stability can be de\ufb01ned that is both necessary and suf\ufb01cient for\ngeneralization in Vapnik\u2019s general setting of learning. In this paper, we answer this question by\nshowing that stability in the inference process is not only suf\ufb01cient for generalization, but it is, in\nfact, equivalent to uniform generalization, which is a notion of generalization that is stronger than\nthe one traditionally considered in the literature.\n\n3 Preliminaries\n\nX, we write EX\u223cP(x) f (X) to mean(cid:80)\n\nTo simplify the discussion, we will always assume that all sets are countable, including the observa-\ntion space Z and the hypothesis space H. This is similar to the assumptions used in some previous\nworks such as [6]. However, the main results, which are presented in Section 4, can be readily\ngeneralized. In addition, we assume that all learning algorithms are invariant to permutations of the\ntraining set. Hence, the order of training examples is irrelevant.\nMoreover, if X \u223c P(x) is a random variable drawn from the alphabet X and f (X) is a function of\nx\u2208X P(x) f (x). Often, we will simply write EX f (X) to\nmean EX\u223cP(x) f (X) if the distribution of X is clear from the context. If X takes its values from\na \ufb01nite set S uniformly at random, we write X \u223c S to denote this distribution of X. If X is a\nboolean random variable, then I{X} = 1 if and only if X is true, otherwise I{X} = 0. In general,\nrandom variables are denoted with capital letters, instances of random variables are denoted with\nsmall letters, and alphabets are denoted with calligraphic typeface. Also, given two probability mass\nfunctions P and Q de\ufb01ned on the same alphabet A, we will write (cid:104)P, Q(cid:105) to denote the overlapping\ncoef\ufb01cient, i.e. intersection, between P and Q. That is, (cid:104)P, Q(cid:105) .\na\u2208A min{P (a), Q(a)}. Note\nthat (cid:104)P, Q(cid:105) = 1 \u2212 ||P , Q||T , where ||P , Q||T is the total variation distance. Last, we will write\n\n=(cid:80)\n(cid:1)\u03c6k (1 \u2212 \u03c6)n\u2212k to denote the binomial distribution.\n\nB(k; \u03c6, n) =(cid:0)n\n\nk\n\nIn this paper, we consider the general setting of learning introduced by Vapnik [1]. To reiterate, we\nhave an observation space Z and a hypothesis space H. Our learning algorithm L receives a set of\nm observations Sm = {Zi}i=1,..,m \u2208 Z m generated i.i.d. from a \ufb01xed unknown distribution P(z),\nand picks a hypothesis H \u2208 H with probability PL(H = h|Sm). Formally, L : \u222a\u221e\nm=1 Z m \u2192 H is a\nstochastic map. In this paper, we allow the hypothesis H to be any summary statistic of the training\nset. It can be a measure of central tendency, as in unsupervised learning, or it can be a mapping from\nan input space to an output space, as in supervised learning. In fact, we even allow H to be a subset\nof the training set itself. In formal terms, L is a stochastic map between the two random variables\nH \u2208 H and Sm \u2208 Z m, where the exact interpretation of those random variables is irrelevant.\nIn any learning task, we assume a non-negative bounded loss function L(Z; H) : Z \u2192 [0, 1] is\nused to measure the quality of the inferred hypothesis H \u2208 H on the observation Z \u2208 Z. Most\nimportantly, we assume that L(\u00b7; H) : Z \u2192 [0, 1] is parametric:\nDe\ufb01nition 1 (Parametric Loss Functions). A loss function L(\u00b7; H) : Z \u2192 [0, 1] is called paramet-\nric if it is independent of the training set Sm given the inferred hypothesis H. That is, a parametric\nloss function satis\ufb01es the Markov chain: Sm \u2192 H \u2192 L(\u00b7; H).\nFor any \ufb01xed hypothesis H \u2208 H, we de\ufb01ne its true risk Rtrue(H) by Eq.\n(1), and de\ufb01ne its\nempirical risk on a training set Sm, denoted Remp(H; Sm), by Eq. (2). We also de\ufb01ne the true and\nempirical risks of the learning algorithm L by the expected risk of its inferred hypothesis:\n\n\u02c6Rtrue(L) = ESm\n\u02c6Remp(L) = ESm\n\nEH \u223cPL(h|Sm) Rtrue(H)\nEH \u223cPL(h|Sm) Remp(H; Sm)\n\n(3)\n(4)\nTo simplify notation, we will write \u02c6Rtrue and \u02c6Remp instead of \u02c6Rtrue(L) and \u02c6Remp(L). We will\nconsider the following de\ufb01nition of generalization:\nDe\ufb01nition 2 (Generalization). A learning algorithm L : \u222a\u221e\nm=1 Z m \u2192 H with a parametric\nloss function L(\u00b7; H) : Z \u2192 [0, 1] generalizes if for any distribution P(z) on Z, we have\nlimm\u2192\u221e | \u02c6Remp\u2212 \u02c6Rtrue\n\n(cid:12)(cid:12) = 0, where \u02c6Rtrue and \u02c6Remp are given in Eq. (3) and Eq. (4) respectively.\n\nEH|Sm Rtrue(H)\nEH|Sm Remp(H; Sm)\n\n= ESm\n= ESm\n\n3\n\n\fIn other words, a learning algorithm L generalizes according to De\ufb01nition 2 if its empirical perfor-\nmance (training loss) becomes an unbiased estimator to the true risk as m \u2192 \u221e. Next, we de\ufb01ne\nuniform generalization:\nDe\ufb01nition 3 (Uniform Generalization). A learning algorithm L : \u222a\u221e\nm=1 Z m \u2192 H generalizes\nuniformly if for any \u0001 > 0, there exists m0(\u0001) > 0 such that for all distributions P(z) on Z, all\n\nparametric loss functions, and all sample sizes m > m0(\u0001), we have | \u02c6Remp(L) \u2212 \u02c6Rtrue(L)(cid:12)(cid:12) \u2264 \u0001.\n\nUniform generalization is stronger than the original notion of generalization in De\ufb01nition 2.\nIn\nparticular, if a learning algorithm generalizes uniformly, then it generalizes according to De\ufb01nition\n2 as well. The converse, however, is not true. Even though uniform generalization appears to be\nquite a strong condition, at \ufb01rst sight, a key contribution of this paper is to show that it is not a strong\ncondition because it is equivalent to a simple condition, namely algorithmic stability.\n\n4 Main Results\n\nBefore we prove that algorithmic stability is equivalent to uniform generalization, we introduce a\nprobabilistic notion of mutual stability between two random variables. In order to abstract away any\nlabeling information the random variables might possess, e.g. the observation space may or may not\nbe a metric space, we de\ufb01ne stability by the impact of observations on probability distributions:\nDe\ufb01nition 4 (Mutual Stability). Let X \u2208 X and Y \u2208 Y be two random variables. Then, the mutual\nstability between X and Y is de\ufb01ned by:\n\nS(X; Y )\n\n= (cid:104)P(X) P(Y ), P(X, Y )(cid:105) = EX (cid:104)P(Y ), P(Y |X)(cid:105) = EY (cid:104)P(X), P(X|Y )(cid:105)\n.\n\nIf we recall that 0 \u2264 (cid:104)P, Q(cid:105) \u2264 1 is the overlapping coef\ufb01cient between the two probability dis-\ntributions P and Q, we see that S(X; Y ) given by De\ufb01nition 4 is indeed a probabilistic measure\nof mutual stability. It measures how stable the distribution of Y is before and after observing an\ninstance of X, and vice versa. A small value of S(X; Y ) means that the probability distribution of\nX or Y is heavily perturbed by a single observation of the other random variable. Perfect mutual\nstability is achieved when the two random variables are independent of each other.\nWith this probabilistic notion of mutual stability in mind, we de\ufb01ne the stability of a learning algo-\nrithm L by the mutual stability between its inferred hypothesis and a random training example.\nm=1 Z m \u2192 H be a learning algorithm that receives\nDe\ufb01nition 5 (Algorithmic Stability). Let L : \u222a\u221e\na \ufb01nite set of training examples Sm = {Zi}i=1,..,m \u2208 Z m drawn i.i.d. from a \ufb01xed distribution\nP(z). Let H \u223c PL(h|Sm) be the hypothesis inferred by L, and let Ztrn \u223c Sm be a single ran-\ndom training example. We de\ufb01ne the stability of L by: S(L) = infP(z) S(H; Ztrn), where the\nin\ufb01mum is taken over all possible distributions of observations P(z). A learning algorithm is called\nalgorithmically stable if limm\u2192\u221e S(L) = 1.\nNote that the above de\ufb01nition of algorithmic stability is rather weak; it only requires that the contri-\nbution of any single training example on the overall inference process to be more and more negligible\nas the sample size increases. In addition, it is well-de\ufb01ned even if the learning algorithm is deter-\nministic because the hypothesis H, if it is a deterministic function of an entire training set of m\nobservations, remains a stochastic function of any individual observation. We illustrate this concept\nwith the following example:\nExample 1. Suppose that observations Zi \u2208 {0, 1} are i.i.d. Bernoulli trials with P(Zi = 1) = \u03c6,\nand that the hypothesis produced by L is the empirical average H = 1\ni=1 Zi. Because P(H =\n\nk/m(cid:12)(cid:12) Ztrn = 1) = B(k \u2212 1; \u03c6, m \u2212 1) and P(H = k/m(cid:12)(cid:12) Ztrn = 0) = B(k; \u03c6, m \u2212 1), it can be\n\n(cid:80)m\n\nshown using Stirling\u2019s approximation [17] that the algorithmic stability of this learning algorithm\nis asymptotically given by S(L) \u223c 1 \u2212 1\u221a\n, which is achieved when \u03c6 = 1/2. A more general\nstatement will be proved later in Section 5.\n\n2 \u03c0 m\n\nm\n\nNext, we show that the notion of algorithmic stability in De\ufb01nition 5 is equivalent to the notion of\nuniform generalization in De\ufb01nition 3. Before we do that, we \ufb01rst state the following lemma.\nLemma 1 (Data Processing Inequality). Let A, B, and C be three random variables that satisfy the\nMarkov chain A \u2192 B \u2192 C. Then: S(A; B) \u2264 S(A; C).\n\n4\n\n\fProof. The proof consists of two steps 1. First, we note that because the Markov chain implies that\nP(C|B, A) = P(C|B), we have S(A; (B, C)) = S(A; B) by direct substitution into De\ufb01nition\n5. Second, similar to the information-cannot-hurt inequality in information theory [18], it can be\nshown that S(A; (B, C)) \u2264 S(A; C) for any random variables A, B and C. This is proved using\nsome algebraic manipulation and the fact that the minimum of the sums is always larger than the\ni min{\u03b1i, \u03b2i}. Combining both results yields\nS(A; B) = S(A; (B, C)) \u2264 S(A; C), which is the desired result.\n\nsum of minimums, i.e. min(cid:8)(cid:80)\n\ni \u03b1i, (cid:80)\n\n(cid:9) \u2265 (cid:80)\n\ni \u03b2i\n\nNow, we are ready to state the main result of this paper.\nTheorem 1. For any learning algorithm L : \u222a\u221e\nm=1 Z m \u2192 H, algorithmic stability as given in Def-\ninition 5 is both necessary and suf\ufb01cient for uniform generalization (see De\ufb01nition 3). In addition,\n\n(cid:12)(cid:12) \u2264 1 \u2212 S(H; Ztrn) \u2264 1 \u2212 S(L), where Rtrue and Remp are the true and empirical\n\n(cid:12)(cid:12) \u02c6Rtrue \u2212 \u02c6Remp\n\nrisks of the learning algorithm de\ufb01ned in Eq. (3) and (4) respectively.\nProof. Here is an outline of the proof. First, because a parametric loss function L(\u00b7; H) : Z \u2192 [0, 1]\nis itself a random variable that satis\ufb01es the Markov chain Sm \u2192 H \u2192 L(\u00b7; H), it is not independent\nof Ztrn \u223c Sm. Hence, the empirical risk is given by \u02c6Remp = EL(\u00b7;H) EZtrn|L(\u00b7;H) L(Ztrn; H). By\ncontrast, the true risk is given by \u02c6Rtrue = EL(\u00b7;H) EZtrn\u223cP(z) L(Ztrn; H). The difference is:\n\n(cid:2)EZtrn L(Ztrn; H) \u2212 EZtrn|L(\u00b7;H) L(Ztrn; H)(cid:3)\n\n\u02c6Rtrue \u2212 \u02c6Remp = EL(\u00b7;H)\n\nTo sandwich the right-hand side between an upper and a lower bound, we note that if P1(z) and\nP2(z) are two distributions de\ufb01ned on the same alphabet Z and F (\u00b7) : Z \u2192 [0, 1] is a bounded loss\nfunction, then\ntotal variation distance. The proof to this result can be immediately deduced by considering the two\nregions {z \u2208 Z : P1(z) > P2(z)} and {z \u2208 Z : P1(z) < P2(z)} separately. This is, then, used to\ndeduce the inequalities:\n\n(cid:12)(cid:12)(cid:12) \u2264 ||P1(z) , P2(z)||T , where ||P , Q||T is the\n\n(cid:12)(cid:12)(cid:12)EZ\u223cP1(z) F (Z) \u2212 EZ\u223cP2(z) F (Z)\n(cid:12)(cid:12) \u02c6Rtrue \u2212 \u02c6Remp\n\n(cid:12)(cid:12) \u2264 1 \u2212 S(L(\u00b7; H); Ztrn) \u2264 1 \u2212 S(H; Ztrn) \u2264 1 \u2212 S(L),\nif L is algorithmically stable, i.e. S(L) \u2192 1 as m \u2192 \u221e, then (cid:12)(cid:12) \u02c6Rtrue \u2212 \u02c6Remp\nparametric loss and a distribution P\u03b4(z) such that 1 \u2212 S(L) \u2212 \u03b4 \u2264 (cid:12)(cid:12) \u02c6Rtrue \u2212 \u02c6Remp\n\nzero uniformly across all parametric loss functions. Therefore, algorithmic stability is suf\ufb01cient for\nuniform generalization. The converse is proved by showing that for any \u03b4 > 0, there exists a bounded\n\nwhere the second inequality follows by the data processing inequality in Lemma 1, whereas the\nlast inequality follows by de\ufb01nition of algorithmic stability (see De\ufb01nition 5). This proves that\n\n(cid:12)(cid:12) converges to\n(cid:12)(cid:12) \u2264 1 \u2212 S(L).\n\nTherefore, algorithmic stability is also necessary for uniform generalization.\n\n5\n\nInterpreting Algorithmic Stability and Uniform Generalization\n\nIn this section, we provide several interpretations of algorithmic stability and uniform generalization.\nIn addition, we show how Theorem 1 recovers some classical results in learning theory.\n\n5.1 Algorithmic Stability and Data Processing\n\nThe relationship between algorithmic stability and data processing is presented in Lemma 1. Given\nthe random variables A, B, and C and the Markov chain A \u2192 B \u2192 C, we always have S(A; B) \u2264\nS(A; C). This presents us with qualitative insights into the design of machine learning algorithms.\nFirst, suppose we have two different hypotheses H1 and H2. We will say that H2 contains less\ninformative than H1 if the Markov chain Sm \u2192 H1 \u2192 H2 holds. For example, if observations\nZi \u2208 {0, 1} are Bernoulli trials, then H1 \u2208 R can be the empirical average as given in Example 1\nwhile H2 \u2208 {0, 1} can be the label that occurs most often in the training set. Because H2 = I{H1 \u2265\nm/2}, the hypothesis H2 contains strictly less information about the original training set than H1.\nFormally, we have Sm \u2192 H1 \u2192 H2. In this case, H2 enjoys a better uniform generalization bound\nthan H1 because of data-processing. Intuitively, we know that such a result should hold because H2\nis less tied to the original training set than H1. This brings us to the following remark.\n\n1Detailed proofs are available in the supplementary \ufb01le.\n\n5\n\n\fRemark 1. We can improve the uniform generalization bound (or equivalently algorithmic stability)\nof a learning algorithm by post-processing its inferred hypothesis H in a manner that is condition-\nally independent of the original training set given H.\nExample 2. Post-processing hypotheses is a common technique used in machine learning. This\nincludes sparsifying the coef\ufb01cient vector w \u2208 Rd in linear methods, where wj is set to zero if it has\na small absolute magnitude. It also includes methods that have been proposed to reduce the number\nof support vectors in SVM by exploiting linear dependence [19]. By the data processing inequality,\nsuch methods improve algorithmic stability and uniform generalization.\n\nNeedless to mention, better generalization does not immediately translate into a smaller true risk.\nThis is because the empirical risk itself may increase when the inferred hypothesis is post-processed\nindependently of the original training set.\nSecond, if the Markov chain A \u2192 B \u2192 C holds, we also obtain S(A; C) \u2265 S(B; C) by applying\nthe data processing inequality to the reverse Markov chain C \u2192 B \u2192 A. As a result, we can im-\nprove algorithmic stability by contaminating training examples with arti\ufb01cial noise prior to learning.\nThis is because if \u02c6Sm is a perturbed version of a training set Sm, then Sm \u2192 \u02c6Sm \u2192 H implies that\nS(Ztrn; H) \u2265 S( \u02c6Ztrn; H), when Ztrn \u223c Sm and \u02c6Ztrn \u223c \u02c6Sm are random training examples drawn\nuniformly at random from each training set respectively. This brings us to the following remark:\nRemark 2. We can improve the algorithmic stability of a learning algorithm by introducing arti\ufb01cial\nnoise to training examples, and applying the learning algorithm on the perturbed training set.\nExample 3. Corrupting training examples with arti\ufb01cial noise, such as the recent dropout method,\nare popular techniques in neural networks to improve generalization [20]. By the data processing\ninequality, such methods indeed improve algorithmic stability and uniform generalization.\n\n5.2 Algorithmic Stability and the Size of the Observation Space\nNext, we look into how the size of the observation space Z in\ufb02uences algorithmic stability. First,\nwe start with the following de\ufb01nition:\nDe\ufb01nition 6 (Lazy Learning). A learning algorithm L is called lazy if its hypothesis H \u2208 H is\nmapped one-to-one with the training set Sm, i.e. the mapping H \u2192 Sm is injective.\nA lazy learner is called lazy if its hypothesis is equivalent to the original training set in its infor-\nmation content. Hence, no learning actually takes place. One example is instance-based learning\nwhen H = Sm. Despite their simple nature, lazy learners are useful in practice. They are useful\ntheoretical tools as well. In particular, because of the equivalence H \u2261 Sm and the data processing\ninequality, the algorithmic stability of a lazy learner provides a lower bound to the stability of any\npossible learning algorithm. Therefore, we can relate algorithmic stability (uniform generalization)\nto the size of the observation space by quantifying the algorithmic stability of lazy learners. Because\nthe size of Z is usually in\ufb01nite, however, we introduce the following de\ufb01nition of effective set size.\nDe\ufb01nition 7. In a countable space Z endowed with a probability mass function P(z), the effective\nsize of Z w.r.t. P(z) is de\ufb01ned by: Ess [Z; P(z)]\nAt one extreme, if P(z) is uniform over a \ufb01nite alphabet Z, then Ess [Z; P(z)] = |Z|. At the\nother extreme, if P(z) is a Kronecker delta distribution, then Ess [Z; P(z)] = 1. As proved next,\nthis notion of effective set size determines the rate of convergence of an empirical probability mass\nfunction to its true distribution when the distance is measured in the total variation sense. As a result,\nit allows us to relate algorithmic stability to a property of the observation space Z.\nTheorem 2. Let Z be a countable space endowed with a probability mass function P(z). Let Sm\n(cid:113) Ess [Z; P(z)]\u22121\nbe a set of m i.i.d. samples Zi \u223c P(z). De\ufb01ne PSm(z) to be the empirical probability mass func-\ntion induced by drawing samples uniformly at random from Sm. Then: ESm ||P(z), PSm(z)||T =\nm), where 1 \u2264 Ess [Z; P(z)] \u2264 |Z| is the effective size of Z (see Def-\n1 \u2212(cid:113) Ess [Z; P(z)]\u22121\nm=1 Z m \u2192 H, we have S(H; Ztrn) \u2265\n\u221a\n\u2212 o(1/\nm), where the bound is achieved by lazy learners (see De\ufb01nition 6)2.\ni.i.d. Bernoulli trials with a probability of success \u03c6 converges to the true mean at a rate of(cid:112)2\u03c6(1 \u2212 \u03c6)/(\u03c0m)\n\n2A special case of Theorem 2 was proved by de Moivre in the 1730s, who showed that the empirical mean of\n\ninition 7). In addition, for any learning algorithm L : \u222a\u221e\n\n(cid:112)P(z) (1 \u2212 P(z))(cid:1)2.\n\n2 \u03c0 m\n\n+ o(1/\n\n= 1 +(cid:0)(cid:80)\n\n.\n\nz\u2208Z\n\n\u221a\n\n2 \u03c0 m\n\n6\n\n\f(cid:88)\n(cid:88)\n\nk=1,2,...\n\n(cid:114)\n\nProof. Here is an outline of the proof. First, we know that P(Sm) =(cid:0)\n(cid:0)\u00b7\n(cid:1) is the multinomial coef\ufb01cient. Using the relation ||P, Q||T = 1\n\n\u00b7\u00b7\u00b7 , where\n2||P \u2212 Q||1, the multinomial\n\u00b7\nseries, and De Moivre\u2019s formula for the mean deviation of the binomial random variable [22], it can\nbe shown with some algebraic manipulations that:\n\n(cid:1) pm1\n\n1 pm2\n\nm1, m2, ...\n\nm\n\n2\n\nESm ||P(z), PSm(z)||T =\n\n1\nm\n\n(1 \u2212 pk)(1\u2212pk)mp1+mpk\n\nk\n\nm!\n\n(pkm)! ((1 \u2212 pk)m \u2212 1)!\n\n(cid:114)Ess [Z; P(z)] \u2212 1\n\n2\u03c0m\n\n,\n\nUsing Stirling\u2019s approximation to the factorial [17], we obtain the simple asymptotic expression:\n\nESm ||P(z), PSm(z)||T \u223c 1\n2\n\nk=1,2,3,...\n\n2pk(1 \u2212 pk)\n\n\u03c0m\n\n= 1 \u2212\n\nwhich is tight due to the tightness of the Stirling approximation. The rest of the theorem follows\nfrom the Markov chain Sm \u2192 Sm \u2192 H, the data processing inequality, and De\ufb01nition 6.\nCorollary 1. Given the conditions of Theorem 2, if Z is in addition \ufb01nite (i.e. |Z| < \u221e), then for\n\nany learning algorithm L, we have: S(L) \u2265 1 \u2212(cid:113)|Z|\u22121\n\n\u221a\n\nm)\n\n2\u03c0m \u2212 o(1/\n\nProof. Because in a \ufb01nite observation space Z, the maximum effective set size (see De\ufb01nition 7) is\n|Z|, which is attained at the uniform distribution P(z) = 1/|Z|.\n\nIntuitively speaking, Theorem 2 and its corollary state that in order to guarantee good uniform\ngeneralization for all possible learning algorithms, the number of observations must be suf\ufb01ciently\nlarge to cover the entire effective size of the observation space Z. Needless to mention, this is\ndif\ufb01cult to achieve in practice so the algorithmic stability of machine learning algorithms must be\ncontrolled in order to guarantee a good generalization from a few empirical observations. Similarly,\nthe uniform generalization bound can be improved by reducing the effective size of the observation\nspace, such as by using dimensionality reduction methods.\n\n5.3 Algorithmic Stability and the Complexity of the Hypothesis Space\n\nFinally, we look into the hypothesis space and how it in\ufb02uences algorithmic stability. First, we look\ninto the role of the size of the hypothesis space. This is formalized in the following theorem.\nTheorem 3. Denote by H \u2208 H the hypothesis inferred by a learning algorithm L : \u222a\u221e\nm=1 Z m \u2192\nH. Then, the following bound on algorithmic stability always holds:\nlog |H|\n2 m\n\n(cid:114)H(H)\n\nS(L) \u2265 1 \u2212\n\n\u2265 1 \u2212\n\n(cid:114)\n\n2 m\n\n,\n\nwhere H is the Shannon entropy measured in nats (i.e. using natural logarithms).\n\nProof. The proof is information-theoretic. If we let I(X; Y ) be the mutual information between the\nr.v.\u2019s X and Y and let Sm = {Z1, Z2, . . . , Zm} be a random choice of a training set, we have:\n\n(cid:105) \u2212(cid:104)\n\nH(Z1|H) + H(Z2|Z1, H) + \u00b7\u00b7\u00b7(cid:105)\n\nI(Sm; H) = H(Sm) \u2212 H(Sm | H) =\n\nH(Zi)\n\n(cid:104) m(cid:88)\n\ni=1\n\nBecause conditioning reduces entropy, i.e. H(A|B) \u2264 H(A) for any r.v.\u2019s A and B, we have:\n\nI(Sm; H) \u2265 m(cid:88)\n\nTherefore:\n\ni=1\n\n[H(Zi) \u2212 H(Zi | H)] = m [H(Ztrn) \u2212 H(Ztrn | H)]\n\nI(Ztrn; H) \u2264 I(Sm; H)\n\nm\n\n(5)\n\non average. This is believed to be the \ufb01rst appearance of the square-root law in statistical inference in the\nliterature [21]. Because the effective set size of the Bernoulli distribution, according to De\ufb01nition 7, is given\nby 1 + 4\u03c6(1 \u2212 \u03c6), Theorem 2 agrees with, in fact generalizes, de Moivre\u2019s result.\n\n7\n\n\f||P , Q||T \u2264 (cid:113) D(P || Q)\n\nNext, we use Pinsker\u2019s inequality [18], which states that for any probability distributions P and\n, where ||P , Q||T is total variation distance and D(P || Q) is\nQ:\nIf we recall\nthe Kullback-Leibler divergence measured in nats (i.e. using natural logarithms).\nthat S(Sm; H) = 1 \u2212 ||P(Sm) P(H) , P(Sm, H)||T while mutual information is I(Sm; H) =\nD(P(Sm, H)|| P(Sm) P(H)), we deduce from Pinsker\u2019s inequality and Eq. (5):\n\n2\n\nS(Ztrn; H) = 1 \u2212 ||P(Ztrn) P(H) , P(Ztrn, H)||T\n\nlog |H|\n2m\nIn the last line, we used the fact that I(X; Y ) \u2264 H(X) for any random variables X and Y .\n\nI(Ztrn; H)\n\n\u2265 1 \u2212\n\n\u2265 1 \u2212\n\n\u2265 1 \u2212\n\n\u2265 1 \u2212\n\n2\n\nI(Sm; H)\n\n2m\n\n(cid:114)\n\n(cid:114)\n\n(cid:114)H(H)\n\n2m\n\n(cid:114)\n\nTheorem 3 re-establishes the classical PAC result on the \ufb01nite hypothesis space [23]. In terms of\nalgorithmic stability, a learning algorithm will enjoy a high stability if the size of the hypothesis\nspace is small. In terms of uniform generalization, it states that the generalization risk of a learning\n\nalgorithm is bounded from above uniformly across all parametric loss functions by(cid:112)H(H)/(2m) \u2264\n(cid:112)log |H|/(2m), where H(H) is the Shannon entropy of H.\n\nNext, we relate algorithmic stability to the Vapnik-Chervonenkis (VC) dimension. Despite the fact\nthat the VC dimension is de\ufb01ned on binary-valued functions whereas algorithmic stability is a func-\ntional of probability distributions, there exists a connection between the two concepts. To show this,\nwe \ufb01rst introduce a notion of an induced concept class that exists for any learning algorithm L:\nDe\ufb01nition 8. The concept class C induced by a learning algorithm L : \u222a\u221e\nm=1 Z m \u2192 H is de\ufb01ned\nto be the set of total Boolean functions c(z) = I{P(Ztrn = z | H) \u2265 P(Ztrn = z)} for all H \u2208 H.\nIntuitively, every hypothesis H \u2208 H induces a total partition on the observation space Z given by\nthe Boolean function in De\ufb01nition 8. That is, H splits Z into two disjoint sets: the set of values in\nZ that are, a posteriori, less likely to have been present in the training set than before given that the\ninferred hypothesis is H, and the set of all other values. The complexity (richness) of the induced\nconcept class C is related to algorithmic stability via the VC dimension.\nTheorem 4. Let L : \u222a\u221e\nm=1 Z m \u2192 H be a learning algorithm with an induced concept class C. Let\ndV C(C) be the VC dimension of C. Then, the following bound holds if m > dV C(C) + 1:\n\nS(L) \u2265 1 \u2212 4 +(cid:112)dV C(C) (1 + log(2m))\n(cid:12)(cid:12)EZ\u223cP(z) ch(Z) \u2212 EZ\u223cSm ch(Z)(cid:12)(cid:12)(cid:111)\n\n\u221a\n\n2m\n\nIn particular, L is algorithmically stable if its induced concept class C has a \ufb01nite VC dimension.\nProof. The proof relies on the fact that algorithmic stability S(L) is bounded from below by 1 \u2212\n, where cH (z) = I{P(Ztrn =\nsupP(z)\nz|H) \u2265 P(Ztrn) = z}. The \ufb01nal bound follows by applying uniform convergence results [23].\n\n(cid:110)ESm suph\u2208H\n\n6 Conclusions\n\nIn this paper, we showed that a probabilistic notion of algorithmic stability was equivalent to uniform\ngeneralization. In informal terms, a learning algorithm is called algorithmically stable if the impact\nof a single training example on the probability distribution of the \ufb01nal hypothesis always vanishes at\nthe limit of large training sets. In other words, the inference process never depends heavily on any\nsingle training example. If algorithmic stability holds, then the learning algorithm generalizes well\nregardless of the choice of the parametric loss function. We also provided several interpretations of\nthis result. For instance, the relationship between algorithmic stability and data processing reveals\nthat algorithmic stability can be improved by either post-processing the inferred hypothesis or by\naugmenting training examples with arti\ufb01cial noise prior to learning. In addition, we established a\nrelationship between algorithmic stability and the effective size of the observation space, which pro-\nvided a formal justi\ufb01cation for dimensionality reduction methods. Finally, we connected algorithmic\nstability to the complexity (richness) of the hypothesis space, which re-established the classical PAC\nresult that the complexity of the hypothesis space should be controlled in order to improve stability,\nand, hence, improve generalization.\n\n8\n\n\fReferences\n[1] V. N. Vapnik, \u201cAn overview of statistical learning theory,\u201d Neural Networks, IEEE Transactions\n\non, vol. 10, September 1999.\n\n[2] C. Cortes and V. Vapnik, \u201cSupport-vector networks,\u201d Machine learning, vol. 20, pp. 273\u2013297,\n\n1995.\n\n[3] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, \u201cLearnability and the Vapnik-\n\nChervonenkis dimension,\u201d Journal of the ACM (JACM), vol. 36, no. 4, pp. 929\u2013965, 1989.\n\n[4] M. Talagrand, \u201cMajorizing measures: the generic chaining,\u201d The Annals of Probability, vol. 24,\n\nno. 3, pp. 1049\u20131103, 1996.\n\n[5] D. A. McAllester, \u201cPAC-Bayesian stochastic model selection,\u201d Machine Learning, vol. 51,\n\npp. 5\u201321, 2003.\n\n[6] O. Bousquet and A. Elisseeff, \u201cStability and generalization,\u201d The Journal of Machine Learning\n\nResearch (JMLR), vol. 2, pp. 499\u2013526, 2002.\n\n[7] P. L. Bartlett and S. Mendelson, \u201cRademacher and gaussian complexities: Risk bounds and\nstructural results,\u201d The Journal of Machine Learning Research (JMLR), vol. 3, pp. 463\u2013482,\n2002.\n\n[8] J.-Y. Audibert and O. Bousquet, \u201cCombining PAC-Bayesian and generic chaining bounds,\u201d\n\nThe Journal of Machine Learning Research (JMLR), vol. 8, pp. 863\u2013889, 2007.\n\n[9] H. Xu and S. Mannor, \u201cRobustness and generalization,\u201d Machine learning, vol. 86, no. 3,\n\npp. 391\u2013423, 2012.\n\n[10] A. Elisseeff, M. Pontil, et al., \u201cLeave-one-out error and stability of learning algorithms with\napplications,\u201d NATO-ASI series on Learning Theory and Practice Science Series Sub Series\nIII: Computer and Systems Sciences, 2002.\n\n[11] S. Kutin and P. Niyogi, \u201cAlmost-everywhere algorithmic stability and generalization error,\u201d in\nProceedings of the Eighteenth conference on Uncertainty in arti\ufb01cial intelligence (UAI), 2002.\n[12] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, \u201cGeneral conditions for predictivity in\n\nlearning theory,\u201d Nature, vol. 428, pp. 419\u2013422, 2004.\n\n[13] M. Kearns and D. Ron, \u201cAlgorithmic stability and sanity-check bounds for leave-one-out cross-\n\nvalidation,\u201d Neural Computation, vol. 11, no. 6, pp. 1427\u20131453, 1999.\n\n[14] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, \u201cLearnability, stability and uni-\nform convergence,\u201d The Journal of Machine Learning Research (JMLR), vol. 11, pp. 2635\u2013\n2670, 2010.\n\n[15] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi, A probabilistic theory of pattern recognition. Springer,\n\n1996.\n\n[16] V. Vapnik and O. Chapelle, \u201cBounds on error expectation for support vector machines,\u201d Neural\n\nComputation, vol. 12, no. 9, pp. 2013\u20132036, 2000.\n\n[17] H. Robbins, \u201cA remark on stirling\u2019s formula,\u201d American Mathematical Monthly, pp. 26\u201329,\n\n1955.\n\n[18] T. M. Cover and J. A. Thomas, Elements of information theory. Wiley & Sons, 1991.\n[19] T. Downs, K. E. Gates, and A. Masters, \u201cExact simpli\ufb01cation of support vector solutions,\u201d\n\nJMLR, vol. 2, pp. 293\u2013297, 2002.\n\n[20] S. Wager, S. Wang, and P. S. Liang, \u201cDropout training as adaptive regularization,\u201d in NIPS,\n\npp. 351\u2013359, 2013.\n\n[21] S. M. Stigler, The history of statistics: The measurement of uncertainty before 1900. Harvard\n\nUniversity Press, 1986.\n\n[22] P. Diaconis and S. Zabell, \u201cClosed form summation for classical distributions: Variations on a\n\ntheme of de moivre,\u201d Statlstlcal Science, vol. 6, no. 3, pp. 284\u2013302, 1991.\n\n[23] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Al-\n\ngorithms. Cambridge University Press, 2014.\n\n9\n\n\f", "award": [], "sourceid": 11, "authors": [{"given_name": "Ibrahim", "family_name": "Alabdulmohsin", "institution": "King Abdullah University of Science and Technology (KAUST)"}]}