{"title": "McDiarmid-Type Inequalities for Graph-Dependent Variables and Stability Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 10890, "page_last": 10901, "abstract": "A crucial assumption in most statistical learning theory is that samples are independently and identically distributed (i.i.d.). However, for many real applications, the i.i.d. assumption does not hold. We consider learning problems in which examples are dependent and their dependency relation is characterized by a graph. To establish algorithm-dependent generalization theory for learning with non-i.i.d. data, we first prove novel McDiarmid-type concentration inequalities for Lipschitz functions of graph-dependent random variables. We show that concentration relies on the forest complexity of the graph, which characterizes the strength of the dependency. We demonstrate that for many types of dependent data, the forest complexity is small and thus implies good concentration. Based on our new inequalities we are able to build stability bounds for learning from graph-dependent data.", "full_text": "McDiarmid-Type Inequalities for Graph-Dependent\n\nVariables and Stability Bounds\n\nRui (Ray) Zhang (cid:3)\nSchool of Mathematics\n\nMonash University\n\nrui.zhang@monash.edu\n\nYuyi Wang\n\nETH Zurich, Switzerland\n\nX-Order Lab, China\n\nyuyiwang920@gmail.com\n\nXingwu Liu y\n\nInstitute of Computing Technology,\n\nChinese Academy of Sciences.\n\nUniversity of Chinese Academy of Sciences\n\nliuxingwu@ict.ac.cn\n\nLiwei Wang\n\nKey Laboratory of Machine Perception, MOE,\n\nSchool of EECS, Peking University\n\nCenter for Data Science, Peking University\n\nwanglw@cis.pku.edu.cn\n\nAbstract\n\nA crucial assumption in most statistical learning theory is that samples are inde-\npendently and identically distributed (i.i.d.). However, for many real applications,\nthe i.i.d. assumption does not hold. We consider learning problems in which ex-\namples are dependent and their dependency relation is characterized by a graph.\nTo establish algorithm-dependent generalization theory for learning with non-i.i.d.\ndata, we \ufb01rst prove novel McDiarmid-type concentration inequalities for Lipschitz\nfunctions of graph-dependent random variables. We show that concentration re-\nlies on the forest complexity of the graph, which characterizes the strength of the\ndependency. We demonstrate that for many types of dependent data, the forest\ncomplexity is small and thus implies good concentration. Based on our new in-\nequalities, we establish stability bounds for learning graph-dependent data.\n\n1 Introduction\n\nGeneralization theory is at the foundation of machine learning. It quanti\ufb01es how accurate a model\nwould predict on the test data which the learning algorithm is not able to access during training.\nIt usually relies on a crucial assumption: The data are independently and identically distributed\n(i.i.d.). The i.i.d. assumption allows one to use many powerful tools from probability to prove\nstrong generalization error bounds. However, in real applications, the data are often non-i.i.d. i.e.,\nthe data collected can be dependent. There have been extensive discussions on why and how the\ndata are dependent. We refer the readers to [1, 2].\nEstablishing generalization theory for dependent data has received a lot of attention [3, 4, 5, 6, 7].\nA major line of research in this direction models the data dependency by various types of mixing\nsuch as (cid:11)-mixing [8], (cid:12)-mixing [9], \u03d5-mixing [10], (cid:17)-mixing [11], etc. Mixing models have been\nused in statistical learning theory to establish generalization error bounds based on Rademacher\ncomplexity [4, 6, 12] or algorithmic stability [3, 12, 13] via concentration results [14] or independent\n\nThis work was done when this author was a master student at the Institute of Computing Technology,\nChinese Academy of Sciences and University of Chinese Academy of Sciences. This research forms part of\nRui (Ray) Zhang\u2019s master thesis submitted to the University of Chinese Academy of Sciences in May 2019.\n\n(cid:3)\n\ny\n\nCorresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn these models, the mixing coef\ufb01cients measure the extent to which\nblocking technique [15].\nthe data are dependent to each other. Similar to the mixing models, learning under Dobrushin\u2019s\ncondition [16] is also investigated via concentration results [17, 18, 19] using Dobrushin\u2019s interaction\nmatrix [20]. Although the results under the various mixing conditions and Dobrushin\u2019s condition\nare fruitful, they are faced with dif\ufb01culties in application: It is sometimes dif\ufb01cult to determine the\nquantitative dependency among data points. On the other hand, determining whether two data are\ndependent or not is often much easier. In this paper, we focus on such qualitative dependency of\ndata. We use simple graphs as a natural tool to describe the dependency among data, and establish\ngeneralization theory for such graph-dependent data.\nA basic building block of generalization theory is concentration inequality. Different settings and\ndifferent assumptions require different concentration tools. The less we assume, the more powerful\ntools we need. In order to establish generalization theory for dependent data, standard concentration\nfor i.i.d. data no longer applies. One must develop concentration inequalities for dependent data,\nwhich is a very challenging task.\nIn his seminal work [21], Janson proved an elegant concentration inequality for graph-dependent\ndata. The inequality is a beautiful extension of Hoeffding inequality. It bounds the probability that\nthe summation of graph-dependent random variables deviates from its expected value, in terms of\nthe fractional coloring number of the dependency graph. Janson\u2019s inquality has been extended to\nany functions that can be decomposed into the summation of some functions of independent random\nvariables [22]. This extension enables to establish generalization error bounds for graph-dependent\ndata via fractional Rademacher complexity.\nIn [5], PAC-Bayes bounds for classi\ufb01cation with non-i.i.d. data are obtained based on fractional col-\norings of graphs. The results also hold for speci\ufb01c learning settings such as ranking and learning\nfrom stationary (cid:12)-mixing distributions. In [23], Ralaivola and Amini established new concentra-\ntion inequalities for fractionally sub-additive and fractionally self-bounding functions of dependent\nvariables. Their results are based on the fractional chromatic numbes and the entropy method. In\n[24, 25], Wang et al. used hypergraphs to model dependent random variables that are generated by\nindependent ones. Leveraging the notion of fractional matching, they also establish concentration\ninequalities of Hoeffding- or Bernstein-type.\nThough fundamental and elegant, the above generalization bounds are algorithm-independent. They\nconsidered the complexity of the hypothesis space and data distribution, but does not involve the\nlearning algorithm. To derive better generalization bounds, there are growing interests in developing\nalgorithm-dependent generalization theories. This line of research heavily relies on the algorithmic\nstability. A key advantage of stability bounds is that they are tailored to speci\ufb01c learning algorithms,\nexploiting their particular properties.\nHow can we establish algorithmic stability theory for graph-dependent data? Note that under the\nassumption of i.i.d. data, Hoeffding-type concentration inequality, which bounds the deviation of\nsample average from expectation, is not strong enough to prove stability-based generalization. On\nthe contrary, McDiarmids inequality characterizes the concentration of general Lipschitz functions\nof i.i.d. random variables, hence serving as the key tool for proving the stability theory. Therefore,\nto build algorithmic stability theory for non-i.i.d. samples, one has to develop McDiarmid-type\nconcentration for graph-dependent random variables.\nIn this paper, we prove the \ufb01rst McDiarmid-type concentration inequality for graph-dependent ran-\ndom variables in terms of a new notion called forest complexity, which measures the strength of the\ndependency. It turns out that for various dependency graphs, it is easy to estimate the forest com-\nplexity. The proposed concentration inequality enables us to prove stability-based generalization\nbounds for graph-dependent data. Our results provide basic tools for understanding learning with\noverparameterized models.\nThe rest of the paper is organized as follows. In section 2, we brie\ufb02y introduce the notations and\nrelated results.\nIn section 3, we establish McDiarmid-type inequalities for acyclic dependency\ngraphs, and extend the concentration results to the general dependency graphs. In section 4, we\napply our concentration results to the learning theory and establish generalization error bounds for\nlearning graph-dependent data via algorithmic stability, we also provide an application of learning\nm-dependent data. Section 5 concludes the paper and points out the future research directions.\nThe supplementary materials can be found in [26].\n\n2\n\n\f2 Preliminaries\n\n\u220f\n\nIn this section, we present the notations and the basic McDiarmid\u2019s inequality for i.i.d. random\nvariables.\nThroughout this paper, let n be a positive integer with [n] standing for the set f1; 2; : : : ; ng. Let \u2126i\nbe a Polish space for any i 2 [n], \u2126 =\ni2[n] \u2126i be the product space, R be the set of real numbers,\nR+ be the set of non-negative real numbers, N+ be the set of non-negative integers.\nConcentration inequalities are fundamental tools in statistical learning theory. They are essentially\ntail probability bounds indicating how much a function of random variables deviates from some\nvalue that is usually the expectation. Among the most powerful ones is the McDiarmid\u2019s inequality\nwhich establishes a sharp, even tight in some cases, bound on the concentration, when the function\nsatis\ufb01es c-Lipschitz condition (bounded differences condition), namely, does not depend too much\non any individual variable.\n+, a function f : \u2126 ! R is said\nDe\ufb01nition 2.1 (c-Lipschitz). Given a vector c = (c1; : : : ; cn) 2 Rn\nto be c-Lipschitz if for any x = (x1; : : : ; xn); x\njf (x) (cid:0) f (x\n\u2032\n\n)j (cid:20) n\u2211\n\nn) 2 \u2126, it satis\ufb01es\n\u2032\n\n\u2032\n= (x\n1; : : : ; x\n\ng;\n\n\u2032\n\nci1fxi\u0338=x\n\n\u2032\ni\n\ni=1\n\nwhere ci is called the i-th Lipschitz coef\ufb01cient of f.\nTheorem 2.2 (McDiarmid\u2019s inequality [27]). Suppose f : \u2126 ! R is c-Lipschitz, and X =\n(X1; : : : ; Xn) is a vector of independent random variables with each Xi taking values in \u2126i. Then\nfor any t > 0, the tail probability satis\ufb01es\n\n(\n\n)\n\nPr (f (X) (cid:0) E[f (X)] (cid:21) t) (cid:20) exp\n\n(cid:0) 2t2\n\u2225c\u22252\n\n2\n\n:\n\n(1)\n\n\u2211\n\nthe McDiarmid\u2019s inequality works for independent random variables.\n\nNotice that\nJanson\u2019s\nHoeffding-type inequality [21] for graph-dependent random variables is a special case of\nn\nMcDiarmid-type inequality when the function is a summation. Speci\ufb01cally, when f (X) =\ni=1 Xi\nwith each Xi ranging over an interval of length ci,\n(cid:21) t\n\n[\nn\u2211\n\nn\u2211\n\n(cid:20) exp\n\n(\n\n)\n\n(\n\n)\n\n]\n\n2t2\n\n(cid:0)\n\n(2)\n\nXi\n\n(cid:31)(cid:3)(G)\u2225c\u22252\n\n2\n\n;\n\nXi (cid:0) E\n(cid:3)\nwhere c = (c1; : : : ; cn) and (cid:31)\nrandom variables X.\n\nPr\n\ni=1\n\ni=1\n\n(G) is the fractional coloring number of a dependency graph G of\n\n3 McDiarmid Concentration for Graph-dependent Random Variables\n\nIn this section we present our \ufb01rst set of main results, the McDiarmid-type concentration inequalities\n(i.e., concentration of Lipschitz functions) for graph-dependent random variables. The results in this\nsection will serve as the tools for developing learning theory for dependent data.\nWe start from the simplest case that the dependency graph is acyclic, i.e., trees or forests. We\nprove McDiarmid-type concentration bounds for trees and forests with very simple forms. These\ninequalities are then extended to general graphs. To this end, we introduce the notion of forest\ncomplexity, which characterizes to what extent a general graph can be best approximated by a forest.\nWe prove McDiarmid-type concentration inequality for general graph-dependent random variables\nin terms of the forest complexity. Finally we demonstrate that for many important classes of graphs,\nforest complexity is easy to estimate.\nBelow we \ufb01rst de\ufb01ne the notion of dependency graphs, which is a widely used model in probability,\nstatistics, and combinatorics, see [28, 29, 30, 31, 32] for examples.\nDe\ufb01nition 3.1 (Dependency Graphs). An undirected graph G is called a dependency graph of a\nrandom vector X = (X1; : : : ; Xn) if\n\n1. V (G) = [n]\n2. if I; J (cid:26) [n] are non-adjacent in G, then fXigi2I and fXjgj2J are independent.\n\n3\n\n\f3.1 McDiarmid Concentration for Acyclic Graph-dependent Variables\n\nOur \ufb01rst result is for the case that the dependency graph is a tree.\nTheorem 3.2. Suppose that f : \u2126 ! R is a c-Lipschitz function and G is a dependency graph of a\nrandom vector X that takes values in \u2126. If G is a tree, then for any t > 0, the following inequality\nholds:\n\n(\n\n)\n\nPr(f (X) (cid:0) E[f (X)] (cid:21) t) (cid:20) exp\n\n(cid:0)\n\n2t2\n\n\u27e8i;j\u27e92E(G)(ci + cj)2 + c2\n\nmin\n\n(3)\n\n\u2211\n\nn\n\n\u2211\n\nwhere cmin is the minimum entry in c.\nThe proof of this theorem relies on decomposing f (X)(cid:0)E[f (X)] into the summation\ni=1 Vi with\nVi := E[f (X)jX1; : : : Xi] (cid:0) E[f (X)jX1; : : : Xi(cid:0)1]. We show that each Vi ranges in an interval of\nlength at most ci + cj, where j is the parent of i in the tree (in the proof, we make the tree rooted by\nchoosing the vertex with the minimum Lipschitz coef\ufb01cient as the root). The theorem is then proved\nby applying the Chernoff-Cram\u00e9r technique to\ni=1 Vi. For details, please refer to Subsection A.1\nin the supplementary materials.\nLike McDiarmid\u2019s inequality, Theorem 3.2 also claims a deviation probability bound that decays\nexponentially. The decay rate is determined by two interplaying factors. One is the Lipschitz co-\nef\ufb01cient that is inherent to the function. The other is the pattern of the dependency, namely, which\nrandom variables are dependent and connected by an edge.\nWe then generalize the above result to the case where dependency graph G is a forest.\nTheorem 3.3. Suppose that f : \u2126 ! R is a c-Lipschitz function and G is a dependency graph of a\nrandom vector X that takes values in \u2126. If G is a forest consisting of trees fTigi2[k], then for any\nt > 0, the following inequality holds:\n\nn\n\n;\n\n\u2211\n\n)\n\nPr(f (X) (cid:0) E[f (X)] (cid:21) t) (cid:20) exp\n\n(\n\n\u2211\n\n(cid:0)\n\n\u2211\n\n2t2\n\n\u27e8i;j\u27e92E(G)(ci + cj)2 +\n\nk\ni=1 c2\n\nmin;i\n\n;\n\n(4)\n\nwhere cmin;i = minfcj : j 2 V (Ti)g.\nTheorem 3.3 can be proved in a similar way as Theorem 3.2. The detailed proof is presented in\nSubsection A.2 of the supplementary materials.\nWe point out that Theorem 3.3 is a strict generalization of the McDiarmid\u2019s inequality for i.i.d.\nrandom variables. If all the random variables are independent, i.e., there is no edge in the dependency\ngraph, then it is clear that Eq. (4) degenerates exactly to Eq. (1).\nTheorem 3.3 also clearly demonstrates how dependency between random variables affects concen-\ntration. The decay rate of the probability that f (X) deviates from its expectation is approximately\nreversely proportional to the number of edges in the dependency graph.\n\n3.2 McDiarmid Concentration for General Graphs\n\nIn this subsection, we consider general graphs. Our basic idea for handling general graphs is to use\na forest to approximate the graph. Speci\ufb01cally, we partition the variables into groups so that the\ndependency graph of these groups is a forest. We try to \ufb01nd the optimal forest approximation, which\nleads to the notion of forest complexity. We then prove McDiarmid-type concentration inequality\nfor general graph-dependent random variables in terms of its forest complexity, which yields a very\nsimple form.\nWe \ufb01rst de\ufb01ne the concept of forest approximation.\nDe\ufb01nition 3.4 (Forest Approximation). Given a graph G, a forest F , and a mapping \u03d5 : V (G) !\nV (F ), if \u03d5(u) = \u03d5(v) or \u27e8\u03d5(u); \u03d5(v)\u27e9 2 E(F ) for any \u27e8u; v\u27e9 2 E(G), we say that (\u03d5; F ) is a forest\napproximation of G. Let (cid:8)(G) denote the set of forest approximations of G.\n\nIntuitively, a forest approximation is transforming a graph into a forest by merging vertices and\nremoving the incurred self-loops and multi-edges. In this way, we rule out the redundant variables\nthat heavily depend on others and thus contribute little to concentration.\n\n4\n\n\fBased on forest approximation, we de\ufb01ne the notion of forest complexity of a graph, which intu-\nitively measures how much the graph looks like a forest.\nDe\ufb01nition 3.5 (Forest Complexity). Given a graph G and any forest approximation (\u03d5; F ) 2 (cid:8)(G)\nwith F consisting of trees fTigi2[k], let\n\n\u2211\n\n(j\u03d5\n\n(cid:0)1(u)j + j\u03d5\n\n(cid:0)1(v)j)2\n\nk\u2211\n\nj\u03d5\n\n(cid:0)1(u)j2:\n\n+\n\ni=1\n\nmin\nu2V (Ti)\n\n(cid:21)(\u03d5;F ) =\n\n\u27e8u;v\u27e92E(F )\n\nWe call\n\nthe forest complexity of the graph G.\n\n(cid:3)(G) = min\n\n(\u03d5;F )2(cid:8)(G)\n\n(cid:21)(\u03d5;F )\n\nNow we are ready to state our McDiarmid-type concentration inequality for general graph-dependent\nrandom variables.\nTheorem 3.6. Suppose that f : \u2126 ! R is a c-Lipschitz function and G is a dependency graph of a\nrandom vector X that takes values in \u2126. For any t > 0, the following inequality holds:\n\nPr(f (X) (cid:0) E[f (X)] (cid:21) t) (cid:20) exp\n\n(cid:0)\n\n2t2\n\n(cid:3)(G)\u2225c\u222521\n\n(\n\n)\n\n:\n\nWith the tool of forest approximation, we reduce the concentration problem de\ufb01ned on graphs to that\nde\ufb01ned on forests. Basically, we use a new variable to represent each set of the original variables that\nare merged together by the forest approximation. The function can be equivalently transformed into\na function of the new variables whose dependency graph is the forest. The proof is done by applying\nTheorem 3.3 to the new function. For details, please refer to Subsection A.3 in the supplementary\nmaterials.\nLike the above theorems, Theorem 3.6 also establishes an exponentially decaying probability of\ndeviation. The decay rate is totally determined by the Lipschitz coef\ufb01cient of the function and the\nforest complexity of the variables\u2019 dependency graph. Intuitively, the more the dependency graph\nlooks like a forest, the faster the deviation probability decays. This uncovers how the dependencies\namong random variables in\ufb02uence concentration.\n\n3.3\n\nIllustrations and Examples\n\nThis subsection consists of two parts. In the \ufb01rst part we review a widely-studied random process\nthat generates dependent data whose dependency graph can be naturally constructed. In the second\npart, we deal with some dependency graphs to show that in many cases, the forest complexity is\nsmall and easy to estimate.\nConsider a data generating procedure modeled by the spatial Poisson point process, which is a Pois-\nson point process on R2 (See [33, 34] for discussions of using this process to model data collection\nin various machine learning applications.) The number of points in each \ufb01nite region follows a\nPoisson distribution, and the number of points in disjoint regions are independent. Given a \ufb01nite set\nI = fIign\ni=1 of regions in R2, let Xi be the number of points in region Ii, 1 (cid:20) i (cid:20) n. Then the\ngraph G ([n];f\u27e8i; j\u27e9 : Ii \\ Ij \u0338= \u2205g) is a dependency graph of the random variables fXign\nWe present three examples to demonstrate that estimating the forest complexity (cid:3)(G) is usually\neasy. All the examples can naturally appear in the above process.\nExample 3.7 (G is a tree). In this case, the identity map between G and itself is a forest approxima-\ntion of G. Then (cid:3)(G) (cid:20) jE(G)j(1 + 1)2 + 1 = 4n (cid:0) 3 = O(n). We get an upper bound of (cid:3)(G)\nthat is linear in the number of variables, which is almost tight compared with Hoeffding\u2019s inequality\nor Janson\u2019s result (see (2) with (cid:31)\nExample 3.8 (G is a cycle Cn). If n is even, a forest approximation is illustrated in Figure 1,\nwhere the cycle is approximated by a path F of length n\n2 . The approximation \u03d5 maps any vertex\nof G to the vertex of F having the same shape, so each gray belt stands for a preimage set of \u03d5.\nWe will keep this convention in the rest of this section. By the illustrated forest approximation,\n\n(G) = 2).\n\ni=1.\n\n(cid:3)\n\n5\n\n\f(cid:3)(G) (cid:20) 2 (cid:2) (1 + 2)2 + ( n\nforest approximation shown in Figure 2, (cid:3)(G) (cid:20) (1+2)2+( n(cid:0)1\n(cid:3)\nSince (cid:31)\n\n(cid:0) 2)(2 + 2)2 + 1 = 8n (cid:0) 13 = O(n). When n is odd, according to the\n(cid:0)1)(2+2)2+1 = 8n(cid:0)14 = O(n).\n\n(G) is 2 or 3, our bound is again very tight compared with Jansons result.\n\n2\n\n2\n\n.......\n\nG\n\n\u03d5(cid:0)(cid:0)!\n\n.....\nF\n\n\u03d5(cid:0)(cid:0)!\n\n....\nF\n\n......\n\nG\n\nFigure 1: A forest approximation of C6\n\nFigure 2: A forest approximation of C5\n\nExample 3.9 (G is a grid). Suppose G is a two-dimensional (m (cid:2) m)-grid. Then n = m2. Consid-\nering the forest approximation illustrated in Figure 3, (cid:3)(G) (cid:20) 2[32 + 52 + : : : + (2m (cid:0) 1)2] + 1 =\n2m(2m+1)(2m(cid:0)1)(cid:0)3\n\n3\n\n= O(m3) = O(n 3\n2 )\n\nG\n\nF\n\n..................\n\n....\n.\n\n\u03d5\n\n..\n.......\n\nFigure 3: A forest approximation of the (4 (cid:2) 4)-gird\n\n4 Generalization Theory for Learning from Graph-dependent Data\n\nThis section establishes stability generalization error bounds for learning from graph-dependent\ndata, using the concentration inequalities derived in the last section.\nConsider the supervised learning setting: Let S = ((x1; y1); : : : ; (xn; yn)) 2 (X (cid:2)Y)n be a training\nsample of size n, where X is the input space and Y is the output space. Let D be the underlying\ndistribution of data on X (cid:2) Y. Assume that all the training data points (xi; yi)\u2019s have the same\nmarginal distribution D and that G is a dependency graph of S.\nThroughout this section, \ufb01x a non-negative loss function \u2113 : Y (cid:2) Y ! R. For any hypothesis\nf : X ! Y, the empirical error on sample S is\n1\nn\n\nbR(f ) =\n\n\u2113(yi; f (xi)):\n\nn\u2211\n\ni=1\n\nFor learning from dependent data, the generalization error can be de\ufb01ned in various ways. We adopt\nthe following widely-used one [35, 36, 37, 38]\n\nR(f ) = E(x;y)(cid:24)D[\u2113(y; f (x))];\nwhich assumes that the test set is independent of the training set.\n\n4.1 Bounding Generalization Error via Algorithmic Stability\n\n(5)\n\nAlgorithmic stability has been used in the study of classi\ufb01cation and regression to derive generaliza-\ntion bounds [39, 40, 41, 42, 43, 44]. A key advantage of stability bounds is that they are designed for\n\n6\n\n\fspeci\ufb01c learning algorithms, exploiting particular properties of the algorithms. Introduced 17 years\nago, uniform stability [45] is now among the most widely used notions of algorithmic stability.\nGiven a training sample S of size n and i 2 [n], remove the i-th element from S, resulting in a sample\nof size n (cid:0) 1, which is denoted by S\nni = ((x1; y1); : : : ; (xi(cid:0)1; yi(cid:0)1); (xi+1; yi+1) : : : ; (xn; yn)).\nFor a learning algorithm A, de\ufb01ne f\nS : X ! Y to be the the hypothesis that A has learned from\nA\nthe sample S.\nDe\ufb01nition 4.1 (Uniform Stability [45]). Given integer n > 0, the learning algorithm A is called\n(cid:12)n-uniformly stable with respect to the loss function \u2113, if for any i 2 [n], S 2 (X (cid:2) Y)n, and\n(x; y) 2 X (cid:2) Y, it holds that\n\nj\u2113(y; f\n\nS (x)) (cid:0) \u2113(y; f\nA\n\nSni (x))j (cid:20) (cid:12)n:\nA\n\nIntuitively, the stability of a leaning algorithm means that any small perturbation of training samples\nhas little effect on the result of learning.\nA\nNow, we begin our analysis with studying the distribution of (cid:8)A(S) = R(f\nS ), namely, the\ndifference between the empirical and the generalization errors. The mapping (cid:8)A : (X (cid:2) Y)n ! R\nA\nS ) via stability. We \ufb01rst show that the deviation of (cid:8)A(S)\nwill play a critical role in estimating R(f\nfrom its expectation can be bounded with high probability (Lemma 4.2), and then upper bound the\nexpected value of (cid:8)A(S) in Lemma 4.3.\nLemma 4.2. Given a sample S of size n with dependency graph G, assume that the learning algo-\nrithm A is (cid:12)n-uniformly stable. Suppose the loss function \u2113 is bounded by M. Then for any t > 0,\nit holds that\n\nS )(cid:0)bR(f\n\n(\n\n)\n\nA\n\nPr((cid:8)A(S) (cid:0) E[(cid:8)A(S)] (cid:21) t) (cid:20) exp\n\n(cid:0)\n\n2n2t2\n\n(cid:3)(G)(4n(cid:12)n + M )2\n\n:\n\nLemma 4.2 is proved in two steps. First, we treat (cid:8)A((cid:1)) as an n-ary function and show that its\nLipschitz coef\ufb01cients are all bounded by 4(cid:12)n + M=n. Second, regarding S as a random vector, we\napply Theorem 3.6 to (cid:8)A(S). For detail, see Subsection B.1 of the supplementary materials.\nLemma 4.3. Given a sample S of size n with dependency graph G, assume that the learning al-\ngorithm A is (cid:12)i-uniformly stable for any i (cid:20) n. Suppose the maximum degree of G is \u2206. Let\n(cid:12)n;\u2206 = maxi2[0;\u2206] (cid:12)n(cid:0)i. It holds that\n\nE[(cid:8)A(S)] (cid:20) 2(cid:12)n;\u2206(\u2206 + 1):\n\nThe proof of the lemma is based on iterative perturbations on the training sample S. A perturbation\nis essentially removing a data point from or adding a data point to S. The property of uniform\nstability of the algorithm guarantees that each perturbation causes a discrepancy up to (cid:12)n;\u2206, and in\ntotal 2(\u2206 + 1) perturbations have to be made in order to eliminate the dependency between a data\npoint and the others. For detail, please refer to Subsection B.2 of the supplementary materials.\nCombining Lemma 4.2 and Lemma 4.3, we immediately have\nTheorem 4.4. Given a sample S of size n with dependency graph G, assume that the learning\nalgorithm A is (cid:12)i-uniformly stable for any i (cid:20) n. Suppose the maximum degree G is \u2206, and the\nloss function \u2113 is bounded by M. Let (cid:12)n;\u2206 = maxi2[0;\u2206] (cid:12)n(cid:0)i. For any (cid:14) 2 (0; 1), with probability\nat least 1 (cid:0) (cid:14), it holds that\n\n\u221a\n\nA\n\nS ) (cid:20) bR(f\n)\n\nR(f\n\n(\u221a\n\nA\nS ) + 2(cid:12)n;\u2206(\u2206 + 1) +\n\n4n(cid:12)n + M\n\n(cid:3)(G) ln(1=(cid:14))\n\nn\n\n2\n\n:\n\nRemark 4.5. It is well known that for many learning algorithms (cid:12)n = O(1=n) [45]. Thus, we\noften have (cid:12)n;\u2206(\u2206 + 1) (cid:20) (cid:12)n(cid:0)\u2206(\u2206 + 1) = O( \u2206\nn(cid:0)\u2206 ), which vanishes asymptotically if \u2206 = o(n).\nalso vanishes asymptotically if (cid:3)(G) = o(n2). As a result, in case of\nThe term O\nweak dependence such as the examples in Subsection 3.3, the generalization error is almost upper-\nbounded by the empirical error. We also observe that if the training data are i.i.d., Theorem 4.4\ndegenerates to the standard stability bound in [45], by applying \u2206 = 0, (cid:12)n;\u2206 = (cid:12)n, (cid:3)(G) = n.\n\n(cid:3)(G)=n\n\n7\n\n\f4.2 Application: Learning from m-dependent Data\n\nWe present a practical application in machine learning. Suppose there are linearly aligned locations,\nfor example, real estates along a street. Let yi be the observation at location i, e.g., the house price,\nand xi stand for the random variable modeling geographical effect at location i. Suppose that x\u2019s are\nmutually independent and each yi is geographically in\ufb02uenced by a neighborhood of size at most\n2q +1. One hope to learn the model of y from a sample f((xi(cid:0)q; : : : ; xi; : : : ; xi+q); yi)gi2[n], where\nn is the size of the sample. This model accounts for the impact of local locations on house prices.\nSimilar scenarios are frequently considered in spatial econometrics, see [46] for more examples.\nThis application is a special case of m-dependence, which is an important statistical model intro-\nduced by Hoeffding in [47]. m-dependence has been studied extensively in probability, statistics,\nand combinatorics [48, 49, 50].\nDe\ufb01nition 4.6 (m-dependence [47]). For some m; n 2 N+, a sequence of random variables\nfXign\nj=i+m+1.\n\ni=1 is called m-dependent if for any i 2 [n(cid:0)m(cid:0)1], fXjgi\n\nj=1 is independent of fXjgn\n\nThe upper part of Figure 4 illustrates a dependency graph of 2-dependent sequence fXign\nAs illustrated in Figure 4, we divide an m-dependent sequence into blocks of size m, and sequen-\ntially map the blocks to vertices of a path of length\n\n. This forest approximation leads to\n\ni=1.\n\n\u2308\n\n\u2309\n\nn\nm\n\n(cid:3)(G) (cid:20)\n\n(m + m)2 + m2 (cid:20) 4mn = O(mn)\n\n(\u2308\n\n\u2309\n\n)\n(cid:0) 1\n\nn\nm\n\nG\n\n..\n......\n\n....\n.\n\n\u03d5\n\nF\n\n..\n...\n\nFigure 4: A forest approximation of a 2-dependent sequence. The approximation \u03d5 maps any vertex\nof G to the vertex of F having the same shape, so each gray belt stands for a pre-image set of \u03d5.\n\nCombining Theorem 4.4 and the estimated forest complexity, we have\nCorollary 4.7. Given an m-dependent sequence S of length n as training sample, assume that the\nlearning algorithm A is (cid:12)i-uniformly stable for any i (cid:20) n. Suppose the loss function \u2113 is bounded\n\u221a\nby M. For any (cid:14) 2 (0; 1), with probability at least 1 (cid:0) (cid:14), it holds that\n\nA\nS ) + 2(cid:12)n;2m(2m + 1) + (4n(cid:12)n + M )\n\n2m ln(1=(cid:14))\n\nn\n\n:\n\nS ) (cid:20) bR(f\n\nA\n\nR(f\n\n)\nChoose any uniformly stable learning algorithm A in [45] with (cid:12)n = O(1=n), such as regularization\nalgorithms in RKHS. Apply it to the above mentioned house price prediction problem. Then for\nany \ufb01xed q, with high probability, Corollary 4.7 leads to R(f\nfor\nsuf\ufb01ciently large n, matching the stability bound of the i.i.d. case in [45].\n\nS ) (cid:20) bR(f\n\n(\u221a\n\nA\nS ) + O\n\nln(1=(cid:14))\n\nA\n\nn\n\n5 Conclusion and Future Work\n\nIn this paper, we establish McDiarmid-type concentration inequalities for general functions of graph-\ndependent random variables. We apply our concentration results to obtain a stability-based gener-\nalization error bound for learning from graph-dependent samples. There are several possible exten-\nsions of this work.\n\n8\n\n\f(cid:15) We provide upper bounds of the forest complexity for several classes of graphs. It is an\ninteresting algorithmic problem to ef\ufb01ciently estimate the forest complexity. One heuristic\nmethod to do this on a connected graph is via graph diameter, by merging vertices of the\nsame distances to a peripheral vertex, resulting in a path as long as the diameter.\n(cid:15) If more information of the dependency structure is known, e.g., dependency hyper-\ngraphs [24, 25], can we obtain better concentration inequalities and generalization bounds?\n(cid:15) In [3, 12, 6], generalization error is de\ufb01ned different from that in this paper. The differences\nbetween these two de\ufb01nitions has been discussed in [3, 12]. It is a natural question whether\nour results can be adapted to that de\ufb01nition.\n(cid:15) There are some newly introduced dependency graph models such as thresholded depen-\ndency graphs [51] and weighted dependency graphs [52, 53]. Can the problem in this\npaper be solved under these new models?\n\nAcknowledgments\n\nRui (Ray) Zhang would like to thank Nick Wormald for valuable comments on an early version\nof this paper. Yuyi Wang would like to thank Ond\u02c7rej Ku\u017eelka for very helpful discussions. Liwei\nWang would like to thank Yunchang Yang for very helpful discussions. Xingwu Liu\u2019s work is\npartially supported by the National Key Research and Development Program of China (Grant No.\n2016YFB1000201), the National Natural Science Foundation of China (61420106013), State Key\nLaboratory of Computer Architecture Open Fund (CARCH3410), and Youth Innovation Promotion\nAssociation of Chinese Academy of Sciences.\n\nReferences\n[1] Herold Dehling and Walter Philipp. Empirical process techniques for dependent data.\n\nEmpirical process techniques for dependent data, pages 3\u2013113. Springer, 2002.\n\nIn\n\n[2] Massih-Reza Amini and Nicolas Usunier. Learning with Partially Labeled and Interdependent\n\nData. Springer, 2015.\n\n[3] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for non-iid processes. In Advances\n\nin Neural Information Processing Systems, pages 1025\u20131032, 2008.\n\n[4] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid pro-\n\ncesses. In Advances in Neural Information Processing Systems, pages 1097\u20131104, 2009.\n\n[5] Liva Ralaivola, Marie Szafranski, and Guillaume Stempfel. Chromatic PAC-Bayes bounds for\nnon-iid data: Applications to ranking and stationary (cid:12)-mixing processes. Journal of Machine\nLearning Research, 11(Jul):1927\u20131956, 2010.\n\n[6] Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing pro-\n\ncesses. Machine Learning, 106(1):93\u2013117, 2017.\n\n[7] Hao Yi, Alon Orlitsky, and Venkatadheeraj Pichapati. On learning markov chains. In Advances\n\nin Neural Information Processing Systems, pages 646\u2013655, 2018.\n\n[8] Murray Rosenblatt. A central limit theorem and a strong mixing condition. Proceedings of the\n\nNational Academy of Sciences of the United States of America, 42(1):43, 1956.\n\n[9] VA Volkonskii and Yu A Rozanov. Some limit theorems for random functions. i. Theory of\n\nProbability & Its Applications, 4(2):178\u2013197, 1959.\n\n[10] Ildar A Ibragimov. Some limit theorems for stationary processes. Theory of Probability & Its\n\nApplications, 7(4):349\u2013382, 1962.\n\n[11] Leonid Kontorovich. Measure concentration of strongly mixing processes with applications.\n\nCarnegie Mellon University, 2007.\n\n[12] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary \u03c6-mixing and (cid:12)-\n\nmixing processes. Journal of Machine Learning Research, 11(Feb):789\u2013814, 2010.\n\n9\n\n\f[13] Fangchao He, Ling Zuo, and Hong Chen. Stability analysis for ranking with stationary \u03c6-\n\nmixing samples. Neurocomputing, 171:1556\u20131562, 2016.\n\n[14] Leonid Aryeh Kontorovich, Kavita Ramanan, et al. Concentration inequalities for dependent\nrandom variables via the martingale method. The Annals of Probability, 36(6):2126\u20132158,\n2008.\n\n[15] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The\n\nAnnals of Probability, pages 94\u2013116, 1994.\n\n[16] Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti. Learning\nfrom weakly dependent data under dobrushins condition. In Proceedings of the Thirty-Second\nConference on Learning Theory, volume 99 of Proceedings of Machine Learning Research,\npages 914\u2013928, Phoenix, USA, 25\u201328 Jun 2019. PMLR.\n\n[17] Christof K\u00fclske. Concentration inequalities for functions of gibbs \ufb01elds with application to\ndiffraction and random gibbs measures. Communications in mathematical physics, 239(1-\n2):29\u201351, 2003.\n\n[18] Sourav Chatterjee. Concentration inequalities with exchangeable pairs (ph. d. thesis). arXiv\n\npreprint math/0507526, 2005.\n\n[19] Aryeh Kontorovich and Maxim Raginsky. Concentration of measure without independence: a\nuni\ufb01ed approach via the martingale method. In Convexity and Concentration, pages 183\u2013210.\nSpringer, 2017.\n\n[20] PL Dobruschin. The description of a random \ufb01eld by means of conditional probabilities and\n\nconditions of its regularity. Theory of Probability & Its Applications, 13(2):197\u2013224, 1968.\n\n[21] Svante Janson. Large deviations for sums of partly dependent random variables. Random\n\nStructures & Algorithms, 24(3):234\u2013248, 2004.\n\n[22] Nicolas Usunier, Massih-Reza Amini, and Patrick Gallinari. Generalization error bounds for\nIn Advances in neural information processing\n\nclassi\ufb01ers trained with interdependent data.\nsystems, pages 1369\u20131376, 2006.\n\n[23] Liva Ralaivola and Massih-Reza Amini. Entropy-based concentration inequalities for depen-\n\ndent variables. In International Conference on Machine Learning, pages 2436\u20132444, 2015.\n\n[24] Yuyi Wang, Zheng-Chu Guo, and Jan Ramon. Learning from networked examples. In Inter-\nnational Conference on Algorithmic Learning Theory, ALT 2017, 15-17 October 2017, Kyoto\nUniversity, Kyoto, Japan, pages 641\u2013666, 2017.\n\n[25] Yuanhong Wang, Yuyi Wang, Xingwu Liu, and Juhua Pu. On the ERM principle with net-\n\nworked data. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[26] Rui (Ray) Zhang, Xingwu Liu, Yuyi Wang, and Liwei Wang. McDiarmid-type inequalities for\n\ngraph-dependent variables and stability bounds. arXiv preprint arXiv:1909.02330, 2019.\n\n[27] Colin McDiarmid. On the method of bounded differences.\n\n141(1):148\u2013188, 1989.\n\nSurveys in combinatorics,\n\n[28] Paul Erdos and L\u00e1szl\u00f3 Lov\u00e1sz. Problems and results on 3-chromatic hypergraphs and some\n\nrelated questions. In\ufb01nite and \ufb01nite sets, 10(2):609\u2013627, 1975.\n\n[29] Svante Janson, Tomasz Luczak, and Andrzej Rucinski. An exponential bound for the probabil-\nity of nonexistence of a speci\ufb01ed subgraph in a random graph. Institute for Mathematics and\nits Applications (USA), 1988.\n\n[30] Louis HY Chen. Two central limit problems for dependent random variables. Probability\n\nTheory and Related Fields, 43(3):223\u2013243, 1978.\n\n[31] Pierre Baldi, Yosef Rinott, et al. On normal approximations of distributions in terms of depen-\n\ndency graphs. The Annals of Probability, 17(4):1646\u20131650, 1989.\n\n10\n\n\f[32] Svante Janson, Tomasz Luczak, and Andrzej Rucinski. Random graphs, volume 45. John\n\nWiley & Sons, 2011.\n\n[33] Scott Linderman and Ryan Adams. Discovering latent network structure in point process data.\n\nIn International Conference on Machine Learning, pages 1413\u20131421, 2014.\n\n[34] Alisa Kirichenko and Harry Van Zanten. Optimality of poisson processes intensity learning\nwith gaussian processes. The Journal of Machine Learning Research, 16(1):2909\u20132919, 2015.\n\n[35] Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine\n\nlearning, 39(1):5\u201334, 2000.\n\n[36] Aur\u00e9lie C Lozano, Sanjeev R Kulkarni, and Robert E Schapire. Convergence and consistency\nIn Advances in\n\nof regularized boosting algorithms with stationary b-mixing observations.\nneural information processing systems, pages 819\u2013826, 2006.\n\n[37] Ingo Steinwart and Andreas Christmann. Fast learning from non-iid observations. In Advances\n\nin neural information processing systems, pages 1768\u20131776, 2009.\n\n[38] Hanyuan Hang and Ingo Steinwart. Fast learning from (cid:11)-mixing observations. Journal of\n\nMultivariate Analysis, 127:184\u2013199, 2014.\n\n[39] William H Rogers and Terry J Wagner. A \ufb01nite sample distribution-free performance bound\n\nfor local discrimination rules. The Annals of Statistics, pages 506\u2013514, 1978.\n\n[40] Luc Devroye and Terry Wagner. Distribution-free performance bounds for potential function\n\nrules. IEEE Transactions on Information Theory, 25(5):601\u2013604, 1979.\n\n[41] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-\n\nout cross-validation. Neural computation, 11(6):1427\u20131453, 1999.\n\n[42] Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization\nerror. In Proceedings of the Eighteenth conference on Uncertainty in arti\ufb01cial intelligence,\npages 275\u2013282. Morgan Kaufmann Publishers Inc., 2002.\n\n[43] Wenlong Mou, Yuchen Zhou, Jun Gao, and Liwei Wang. Dropout training, data-dependent\nregularization, and generalization bounds. In International Conference on Machine Learning,\npages 3642\u20133650, 2018.\n\n[44] Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of SGLD for\n\nnon-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.\n\n[45] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of machine learn-\n\ning research, 2(Mar):499\u2013526, 2002.\n\n[46] Luc Anselin. Spatial econometrics: methods and models, volume 4. Springer Science &\n\nBusiness Media, 2013.\n\n[47] Wassily Hoeffding, Herbert Robbins, et al. The central limit theorem for dependent random\n\nvariables. Duke Mathematical Journal, 15(3):773\u2013780, 1948.\n\n[48] PH Diananda and MS Bartlett. Some probability limit theorems with statistical applications.\nIn Mathematical Proceedings of the Cambridge Philosophical Society, volume 49, pages 239\u2013\n246. Cambridge University Press, 1953.\n\n[49] Pranab Kumar Sen. Asymptotic normality of sample quantiles for m-dependent processes. The\n\nannals of mathematical statistics, pages 1724\u20131730, 1968.\n\n[50] Louis HY Chen and Qi-Man Shao. Stein\u2019s method for normal approximation. An introduction\n\nto Steins method, 4:1\u201359, 2005.\n\n[51] Christoph H Lampert, Liva Ralaivola, and Alexander Zimin. Dependency-dependent bounds\n\nfor sums of dependent random variables. arXiv preprint arXiv:1811.01404, 2018.\n\n11\n\n\f[52] Jehanne Dousse and Valentin F\u00e9ray. Weighted dependency graphs and the ising model. arXiv\n\npreprint arXiv:1610.05082, 2016.\n\n[53] Valentin F\u00e9ray et al. Weighted dependency graphs. Electronic Journal of Probability, 23,\n\n2018.\n\n12\n\n\f", "award": [], "sourceid": 5823, "authors": [{"given_name": "Rui (Ray)", "family_name": "Zhang", "institution": "School of Mathematics, Monash University"}, {"given_name": "Xingwu", "family_name": "Liu", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Yuyi", "family_name": "Wang", "institution": "ETH Zurich"}, {"given_name": "Liwei", "family_name": "Wang", "institution": "Peking University"}]}