{"title": "Extending Gossip Algorithms to Distributed Estimation of U-statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 271, "page_last": 279, "abstract": "Efficient and robust algorithms for decentralized estimation in networks are essential to many distributed systems. Whereas distributed estimation of sample mean statistics has been the subject of a good deal of attention, computation of U-statistics, relying on more expensive averaging over pairs of observations, is a less investigated area. Yet, such data functionals are essential to describe global properties of a statistical population, with important examples including Area Under the Curve, empirical variance, Gini mean difference and within-cluster point scatter. This paper proposes new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the U-statistic of interest. We establish convergence rate bounds of O(1 / t) and O(log t / t) for the synchronous and asynchronous cases respectively, where t is the number of iterations, with explicit data and network dependent terms. Beyond favorable comparisons in terms of rate analysis, numerical experiments provide empirical evidence the proposed algorithms surpasses the previously introduced approach.", "full_text": "Extending Gossip Algorithms to\n\nDistributed Estimation of U-Statistics\n\nIgor Colin, Joseph Salmon, St\u00b4ephan Cl\u00b4emenc\u00b8on\n\nLTCI, CNRS, T\u00b4el\u00b4ecom ParisTech\n\nUniversit\u00b4e Paris-Saclay\n\n75013 Paris, France\n\nfirst.last@telecom-paristech.fr\n\nAur\u00b4elien Bellet\nMagnet Team\n\nINRIA Lille - Nord Europe\n\n59650 Villeneuve d\u2019Ascq, France\naurelien.bellet@inria.fr\n\nAbstract\n\nEf\ufb01cient and robust algorithms for decentralized estimation in networks are es-\nsential to many distributed systems. Whereas distributed estimation of sample\nmean statistics has been the subject of a good deal of attention, computation of U-\nstatistics, relying on more expensive averaging over pairs of observations, is a less\ninvestigated area. Yet, such data functionals are essential to describe global prop-\nerties of a statistical population, with important examples including Area Under\nthe Curve, empirical variance, Gini mean difference and within-cluster point scat-\nter. This paper proposes new synchronous and asynchronous randomized gossip\nalgorithms which simultaneously propagate data across the network and main-\ntain local estimates of the U-statistic of interest. We establish convergence rate\nbounds of O(1/t) and O(log t/t) for the synchronous and asynchronous cases\nrespectively, where t is the number of iterations, with explicit data and network\ndependent terms. Beyond favorable comparisons in terms of rate analysis, numer-\nical experiments provide empirical evidence the proposed algorithms surpasses\nthe previously introduced approach.\n\n1\n\nIntroduction\n\nDecentralized computation and estimation have many applications in sensor and peer-to-peer net-\nworks as well as for extracting knowledge from massive information graphs such as interlinked Web\ndocuments and on-line social media. Algorithms running on such networks must often operate under\ntight constraints: the nodes forming the network cannot rely on a centralized entity for communica-\ntion and synchronization, without being aware of the global network topology and/or have limited\nresources (computational power, memory, energy). Gossip algorithms [19, 18, 5], where each node\nexchanges information with at most one of its neighbors at a time, have emerged as a simple yet pow-\nerful technique for distributed computation in such settings. Given a data observation on each node,\ngossip algorithms can be used to compute averages or sums of functions of the data that are separa-\nble across observations (see for example [10, 2, 15, 11, 9] and references therein). Unfortunately,\nthese algorithms cannot be used to ef\ufb01ciently compute quantities that take the form of an average\nover pairs of observations, also known as U-statistics [12]. Among classical U-statistics used in\nmachine learning and data mining, one can mention, among others: the sample variance, the Area\nUnder the Curve (AUC) of a classi\ufb01er on distributed data, the Gini mean difference, the Kendall\ntau rank correlation coef\ufb01cient, the within-cluster point scatter and several statistical hypothesis test\nstatistics such as Wilcoxon Mann-Whitney [14].\nIn this paper, we propose randomized synchronous and asynchronous gossip algorithms to ef\ufb01ciently\ncompute a U-statistic, in which each node maintains a local estimate of the quantity of interest\nthroughout the execution of the algorithm. Our methods rely on two types of iterative information\nexchange in the network: propagation of local observations across the network, and averaging of lo-\n\n1\n\n\fcal estimates. We show that the local estimates generated by our approach converge in expectation to\nthe value of the U-statistic at rates of O(1/t) and O(log t/t) for the synchronous and asynchronous\nversions respectively, where t is the number of iterations. These convergence bounds feature data-\ndependent terms that re\ufb02ect the hardness of the estimation problem, and network-dependent terms\nrelated to the spectral gap of the network graph [3], showing that our algorithms are faster on well-\nconnected networks. The proofs rely on an original reformulation of the problem using \u201cphantom\nnodes\u201d, i.e., on additional nodes that account for data propagation in the network. Our results largely\nimprove upon those presented in [17]: in particular, we achieve faster convergence together with\nlower memory and communication costs. Experiments conducted on AUC and within-cluster point\nscatter estimation using real data con\ufb01rm the superiority of our approach.\nThe rest of this paper is organized as follows. Section 2 introduces the problem of interest as well as\nrelevant notation. Section 3 provides a brief review of the related work in gossip algorithms. We then\ndescribe our approach along with the convergence analysis in Section 4, both in the synchronous and\nasynchronous settings. Section 5 presents our numerical results.\n\n2 Background\n\n2.1 De\ufb01nitions and Notations\nFor any integer p > 0, we denote by [p] the set {1, . . . , p} and by |F| the cardinality of any \ufb01nite set\nF . We represent a network of size n > 0 as an undirected graph G = (V, E), where V = [n] is the\nset of vertices and E \u2286 V \u00d7 V the set of edges. We denote by A(G) the adjacency matrix related\nto the graph G, that is for all (i, j) \u2208 V 2, [A(G)]ij = 1 if and only if (i, j) \u2208 E. For any node\ni \u2208 V , we denote its degree by di = |{j : (i, j) \u2208 E}|. We denote by L(G) the graph Laplacian of\nG, de\ufb01ned by L(G) = D(G) \u2212 A(G) where D(G) = diag(d1, . . . , dn) is the matrix of degrees. A\ngraph G = (V, E) is said to be connected if for all (i, j) \u2208 V 2 there exists a path connecting i and j;\nit is bipartite if there exist S, T \u2282 V such that S \u222a T = V , S \u2229 T = \u2205 and E \u2286 (S \u00d7 T ) \u222a (T \u00d7 S).\nA matrix M \u2208 Rn\u00d7n is nonnegative (resp. positive) if and only if for all (i, j) \u2208 [n]2, [M ]ij \u2265 0,\n[M ]ij > 0). We write M \u2265 0 (resp. M > 0) when this holds. The transpose of M is\n(resp.\ndenoted by M(cid:62). A matrix P \u2208 Rn\u00d7n is stochastic if and only if P \u2265 0 and P 1n = 1n, where\n1n = (1, . . . , 1)(cid:62) \u2208 Rn. The matrix P \u2208 Rn\u00d7n is bi-stochastic if and only if P and P (cid:62) are\nstochastic. We denote by In the identity matrix in Rn\u00d7n, (e1, . . . , en) the standard basis in Rn, I{E}\nthe indicator function of an event E and (cid:107) \u00b7 (cid:107) the usual (cid:96)2 norm.\n\n2.2 Problem Statement\nLet X be an input space and (X1, . . . , Xn) \u2208 X n a sample of n \u2265 2 points in that space. We assume\nX \u2286 Rd for some d > 0 throughout the paper, but our results straightforwardly extend to the more\ngeneral setting. We denote as X = (X1, . . . , Xn)(cid:62) the design matrix. Let H : X \u00d7 X \u2192 R be\na measurable function, symmetric in its two arguments and with H(X, X) = 0, \u2200X \u2208 X . We\nconsider the problem of estimating the following quantity, known as a degree two U-statistic [12]:1\n\nH(Xi, Xj).\n\n(1)\n\nn(cid:88)\n\ni,j=1\n\n\u02c6Un(H) =\n\n1\nn2\n\nIn this paper, we illustrate the interest of U-statistics on two applications, among many others. The\n\ufb01rst one is the within-cluster point scatter [4], which measures the clustering quality of a partition\nP of X as the average distance between points in each cell C \u2208 P. It is of the form (1) with\n\nHP (X, X(cid:48)) = (cid:107)X \u2212 X(cid:48)(cid:107) \u00b7(cid:88)\n\n(2)\nWe also study the AUC measure [8]. For a given sample (X1, (cid:96)1), . . . , (Xn, (cid:96)n) on X \u00d7 {\u22121, +1},\nthe AUC measure of a linear classi\ufb01er \u03b8 \u2208 Rd\u22121 is given by:\n\nC\u2208P\n\nI{(X,X(cid:48))\u2208C2}.\n\n(cid:80)\n(cid:16)(cid:80)\n(cid:17) .\n1\u2264i,j\u2264n(1 \u2212 (cid:96)i(cid:96)j)I{(cid:96)i(\u03b8(cid:62)Xi)>\u2212(cid:96)j (\u03b8(cid:62)Xj )}\nI{(cid:96)i=\u22121}\n\n(cid:17)(cid:16)(cid:80)\n\nI{(cid:96)i=1}\n\n1\u2264i\u2264n\n\n4\n\n1\u2264i\u2264n\n\nAUC(\u03b8) =\n\n(3)\n\n1We point out that the usual de\ufb01nition of U-statistic differs slightly from (1) by a factor of n/(n \u2212 1).\n\n2\n\n\ffor p = 1, . . . , n do\n\nAlgorithm 1 GoSta-sync: a synchronous gossip algorithm for computing a U-statistic\nRequire: Each node k holds observation Xk\n1: Each node k initializes its auxiliary observation Yk = Xk and its estimate Zk = 0\n2: for t = 1, 2, . . . do\n3:\nSet Zp \u2190 t\u22121\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nend for\nDraw (i, j) uniformly at random from E\nSet Zi, Zj \u2190 1\nSwap auxiliary observations of nodes i and j: Yi \u2194 Yj\n\n2 (Zi + Zj)\n\nt Zp + 1\n\nt H(Xp, Yp)\n\nThis score is the probability for a classi\ufb01er to rank a positive observation higher than a negative one.\nWe focus here on the decentralized setting, where the data sample is partitioned across a set of nodes\nin a network. For simplicity, we assume V = [n] and each node i \u2208 V only has access to a single\ndata observation Xi.2 We are interested in estimating (1) ef\ufb01ciently using a gossip algorithm.\n\n3 Related Work\n\nGossip algorithms have been extensively studied in the context of decentralized averaging in net-\nworks, where the goal is to compute the average of n real numbers (X = R):\n\nn(cid:88)\n\ni=1\n\n\u00afXn =\n\n1\nn\n\nXi =\n\nX(cid:62)1n.\n\n1\nn\n\n(4)\n\nas maxima and minima, or sums of the form(cid:80)n\n\nOne of the earliest work on this canonical problem is due to [19], but more ef\ufb01cient algorithms have\nrecently been proposed, see for instance [10, 2]. Of particular interest to us is the work of [2], which\nintroduces a randomized gossip algorithm for computing the empirical mean (4) in a context where\nnodes wake up asynchronously and simply average their local estimate with that of a randomly\nchosen neighbor. The communication probabilities are given by a stochastic matrix P , where pij\nis the probability that a node i selects neighbor j at a given iteration. As long as the network\ngraph is connected and non-bipartite, the local estimates converge to (4) at a rate O(e\u2212ct) where\nthe constant c can be tied to the spectral gap of the network graph [3], showing faster convergence\nfor well-connected networks.3 Such algorithms can be extended to compute other functions such\ni=1 f (Xi) for some function f : X \u2192 R (as done\nfor instance in [15]). Some work has also gone into developing faster gossip algorithms for poorly\nconnected networks, assuming that nodes know their (partial) geographic location [6, 13]. For a\ndetailed account of the literature on gossip algorithms, we refer the reader to [18, 5].\nHowever, existing gossip algorithms cannot be used to ef\ufb01ciently compute (1) as it depends on pairs\nof observations. To the best of our knowledge, this problem has only been investigated in [17].\nTheir algorithm, coined U2-gossip, achieves O(1/t) convergence rate but has several drawbacks.\nFirst, each node must store two auxiliary observations, and two pairs of nodes must exchange an\nobservation at each iteration. For high-dimensional problems (large d), this leads to a signi\ufb01cant\nmemory and communication load. Second, the algorithm is not asynchronous as every node must\nupdate its estimate at each iteration. Consequently, nodes must have access to a global clock, which\nis often unrealistic in practice. In the next section, we introduce new synchronous and asynchronous\nalgorithms with faster convergence as well as smaller memory and communication cost per iteration.\n\n4 GoSta Algorithms\n\nobservation that \u02c6Un(H) = 1/n(cid:80)n\n\ni=1 hi, with hi = 1/n(cid:80)n\n\nIn this section, we introduce gossip algorithms for computing (1). Our approach is based on the\nj=1 H(Xi, Xj), and we write h =\n(h1, . . . , hn)(cid:62). The goal is thus similar to the usual distributed averaging problem (4), with the\n\n2Our results generalize to the case where each node holds a subset of the observations (see Section 4).\n3For the sake of completeness, we provide an analysis of this algorithm in the supplementary material.\n\n3\n\n\f(a) Original graph G.\n\n(b) New graph \u02dcG.\n\nFigure 1: Comparison of original network and \u201cphantom network\u201d.\n\nkey difference that each local value hi is itself an average depending on the entire data sample.\nConsequently, our algorithms will combine two steps at each iteration: a data propagation step to\nallow each node i to estimate hi, and an averaging step to ensure convergence to the desired value\n\u02c6Un(H). We \ufb01rst present the algorithm and its analysis for the (simpler) synchronous setting in\nSection 4.1, before introducing an asynchronous version (Section 4.2).\n\n4.1 Synchronous Setting\n\nIn the synchronous setting, we assume that the nodes have access to a global clock so that they can\nall update their estimate at each time instance. We stress that the nodes need not to be aware of the\nglobal network topology as they will only interact with their direct neighbors in the graph.\nLet us denote by Zk(t) the (local) estimate of \u02c6Un(H) by node k at iteration t. In order to propagate\ndata across the network, each node k maintains an auxiliary observation Yk, initialized to Xk. Our\nalgorithm, coined GoSta, goes as follows. At each iteration, each node k updates its local estimate\nby taking the running average of Zk(t) and H(Xk, Yk). Then, an edge of the network is drawn uni-\nformly at random, and the corresponding pair of nodes average their local estimates and swap their\nauxiliary observations. The observations are thus each performing a random walk (albeit coupled)\non the network graph. The full procedure is described in Algorithm 1.\nIn order to prove the convergence of Algorithm 1, we consider an equivalent reformulation of the\nproblem which allows us to model the data propagation and the averaging steps separately. Specif-\nically, for each k \u2208 V , we de\ufb01ne a phantom Gk = (Vk, Ek) of the original network G, with\nVk = {vk\nj ); (i, j) \u2208 E}. We then create a new graph \u02dcG = ( \u02dcV , \u02dcE)\nwhere each node k \u2208 V is connected to its counterpart vk\n\ni ; 1 \u2264 i \u2264 n} and Ek = {(vk\n\ni , vk\n\nk \u2208 Vk:\n\n(cid:26) \u02dcV = V \u222a (\u222an\n\n\u02dcE = E \u222a (\u222an\n\nk=1Vk)\nk=1Ek) \u222a {(k, vk\n\nk ); k \u2208 V }\n\nThe construction of \u02dcG is illustrated in Figure 1. In this new graph, the nodes V from the original\nnetwork will hold the estimates Z1(t), . . . , Zn(t) as described above. The role of each Gk is to\nsimulate the data propagation in the original graph G. For i \u2208 [n], vk\ni \u2208 V k initially holds the value\ni and vk\nH(Xk, Xi). At each iteration, we draw a random edge (i, j) of G and nodes vk\nj swap their\nvalue for all k \u2208 [n]. To update its estimate, each node k will use the current value at vk\nk.\nWe can now represent the system state at iteration t by a vector S(t) = (S1(t)(cid:62), S2(t)(cid:62))(cid:62) \u2208\nRn+n2. The \ufb01rst n coef\ufb01cients, S1(t), are associated with nodes in V and correspond to the estimate\nvector Z(t) = [Z1(t), . . . , Zn(t)](cid:62). The last n2 coef\ufb01cients, S2(t), are associated with nodes in\n(Vk)1\u2264k\u2264n and represent the data propagation in the network. Their initial value is set to S2(0) =\n(e(cid:62)\n1 H, . . . , e(cid:62)\nRemark 1. The \u201cphantom network\u201d \u02dcG is of size O(n2), but we stress the fact that it is used solely\nas a tool for the convergence analysis: Algorithm 1 operates on the original graph G.\n\nn H) so that for any (k, l) \u2208 [n]2, node vk\n\nl initially stores the value H(Xk, Xl).\n\nThe transition matrix of this system accounts for three events: the averaging step (the action of G\non itself), the data propagation (the action of Gk on itself for all k \u2208 V ) and the estimate update\n\n4\n\n\f(the action of Gk on node k for all k \u2208 V ). At a given step t > 0, we are interested in characterizing\nthe transition matrix M (t) such that E[S(t + 1)] = M (t)E[S(t)]. For the sake of clarity, we write\nM (t) as an upper block-triangular (n + n2) \u00d7 (n + n2) matrix:\n\n(5)\nwith M1(t) \u2208 Rn\u00d7n, M2(t) \u2208 Rn\u00d7n2 and M3(t) \u2208 Rn2\u00d7n2. The bottom left part is necessarily\n0, because G does not in\ufb02uence any Gk. The upper left M1(t) block corresponds to the averaging\nstep; therefore, for any t > 0, we have:\n\nM (t) =\n\n0\n\n,\n\nM3(t)\n\n(cid:18)M1(t) M2(t)\n(cid:19)\n(ei \u2212 ej)(ei \u2212 ej)(cid:62)(cid:19)\n(ei \u2212 ej)(ei \u2212 ej)(cid:62)(cid:19)\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n(cid:124)\n\nand M3(t) =\n\nW1 (G)\n\n...\n\n0\n\n0\n\n(cid:18)\n\n(cid:88)\n(cid:18)\n\n(i,j)\u2208E\n\n(cid:88)\n\nM1(t) =\n\nt \u2212 1\nt\n\n\u00b7 1\n|E|\n\nIn \u2212 1\n2\n\nwhere for any \u03b1 > 1, W\u03b1(G) is de\ufb01ned by:\nIn \u2212 1\n\u03b1\n\nW\u03b1(G) =\n\n1\n|E|\n\n(i,j)\u2208E\n\nFurthermore, M2(t) and M3(t) are de\ufb01ned as follows:\n\nM2(t) =\n\n1\nt\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n(cid:124)\n\ne(cid:62)\n\n1\n\n0\n\n...\n\n0\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n(cid:125)\n\n\u00b7\u00b7\u00b7\n\n...\n0\n\n0\n\n...\n\n0\ne(cid:62)\n\nn\n\n0\n...\n\n\u00b7\u00b7\u00b7\n\n(cid:123)(cid:122)\n\nB\n\nt \u2212 1\nt\n\n=\n\nW2 (G) ,\n\n= In \u2212 2\n\n\u03b1|E| L(G).\n\n(6)\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8\n(cid:125)\n\n,\n\n\u00b7\u00b7\u00b7\n\n0\n...\n\n\u00b7\u00b7\u00b7\n\n(cid:123)(cid:122)\n\nC\n\n0\n\n...\n...\n\n...\n0 W1 (G)\n\nwhere M2(t) is a block diagonal matrix corresponding to the observations being propagated, and\nM3(t) represents the estimate update for each node k. Note that M3(t) = W1 (G) \u2297 In where \u2297 is\nthe Kronecker product.\nWe can now describe the expected state evolution. At iteration t = 0, one has:\n\nE[S(1)] = M (1)E[S(0)] = M (1)S(0) =\n\n(cid:19)\n\n(cid:19)(cid:18) 0\n\n(cid:18)BS2(0)\n(cid:19)\n\n(cid:18)0 B\n(cid:18) 1\n(cid:19)\n(cid:80)t\nCS2(0)\ns=1 W2 (G)t\u2212s BC s\u22121S2(0)\n\nS2(0)\n\n0 C\n\n=\n\n.\n\nt\n\n.\n\n(7)\n\n(8)\n\nUsing recursion, we can write:\n\nE[S(t)] = M (t)M (t \u2212 1) . . . M (1)S(0) =\n\nin order\n\n(cid:80)t\nto prove the convergence of Algorithm 1, one needs to show that\nTherefore,\ns=1 W2 (G)t\u2212s BC s\u22121S2(0) = \u02c6Un(H)1n. We state this precisely in the next theo-\nlimt\u2192+\u221e 1\nt\nrem.\nTheorem 1. Let G be a connected and non-bipartite graph with n nodes, X \u2208 Rn\u00d7d a design\nmatrix and (Z(t)) the sequence of estimates generated by Algorithm 1. For all k \u2208 [n], we have:\n\nC tS2(0)\n\nMoreover, for any t > 0,\n\nE[Zk(t)] =\n\nlim\nt\u2192+\u221e\n\n(cid:13)(cid:13)(cid:13)E[Z(t)] \u2212 \u02c6Un(H)1n\n\n(cid:13)(cid:13)(cid:13) \u2264 1\n\nct\n\nH(Xi, Xj) = \u02c6Un(H).\n\n(cid:13)(cid:13)(cid:13) +\n\n(cid:18) 2\n\nct\n\n(cid:19)(cid:13)(cid:13)H \u2212 h1(cid:62)\n\nn\n\n(9)\n\n(cid:13)(cid:13) ,\n\n+ e\u2212ct\n\nwhere c = c(G) := 1 \u2212 \u03bb2(2) and \u03bb2(2) is the second largest eigenvalue of W2 (G).\n\n1\nn2\n\n(cid:88)\n(cid:13)(cid:13)(cid:13)h \u2212 \u02c6Un(H)1n\n\n1\u2264i,j\u2264n\n\nProof. See supplementary material.\n\nTheorem 1 shows that the local estimates generated by Algorithm 1 converge to \u02c6Un(H) at a rate\nO(1/t). Furthermore, the constants reveal the rate dependency on the particular problem instance.\nIndeed, the two norm terms are data-dependent and quantify the dif\ufb01culty of the estimation problem\nitself through a dispersion measure. In contrast, c(G) is a network-dependent term since 1\u2212\u03bb2(2) =\n\u03b2n\u22121/|E|, where \u03b2n\u22121 is the second smallest eigenvalue of the graph Laplacian L(G) (see Lemma 1\nin the supplementary material). The value \u03b2n\u22121 is also known as the spectral gap of G and graphs\nwith a larger spectral gap typically have better connectivity [3]. This will be illustrated in Section 5.\n\n5\n\n\fAlgorithm 2 GoSta-async: an asynchronous gossip algorithm for computing a U-statistic\nRequire: Each node k holds observation Xk and pk = 2dk/|E|\n1: Each node k initializes Yk = Xk, Zk = 0 and mk = 0\n2: for t = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nDraw (i, j) uniformly at random from E\nSet mi \u2190 mi + 1/pi and mj \u2190 mj + 1/pj\nSet Zi, Zj \u2190 1\nSet Zi \u2190 (1 \u2212 1\nSet Zj \u2190 (1 \u2212 1\nSwap auxiliary observations of nodes i and j: Yi \u2194 Yj\n\n2 (Zi + Zj)\npimi\n\nH(Xi, Yi)\nH(Xj, Yj)\n\n)Zi + 1\npimi\n)Zj + 1\n\npj mj\n\npj mj\n\n, Y (2)\n\nComparison to U2-gossip. To estimate \u02c6Un(H), U2-gossip [17] does not use averaging. Instead,\neach node k requires two auxiliary observations Y (1)\nk which are both initialized to Xk.\nAt each iteration, each node k updates its local estimate by taking the running average of Zk and\nH(Y (1)\n). Then, two random edges are selected: the nodes connected by the \ufb01rst (resp. sec-\nond) edge swap their \ufb01rst (resp. second) auxiliary observations. A precise statement of the algorithm\nis provided in the supplementary material. U2-gossip has several drawbacks compared to GoSta: it\nrequires initiating communication between two pairs of nodes at each iteration, and the amount of\ncommunication and memory required is higher (especially when data is high-dimensional). Further-\nmore, applying our convergence analysis to U2-gossip, we obtain the following re\ufb01ned rate:4\n\nand Y (2)\n\nk\n\nk\n\nk\n\n(cid:13)(cid:13)(cid:13)E[Z(t)] \u2212 \u02c6Un(H)1n\n\n(cid:13)(cid:13)(cid:13) \u2264\n\n(cid:18)\n\n\u221a\n\nn\nt\n\n(cid:13)(cid:13)(cid:13)h \u2212 \u02c6Un(H)1n\n\n(cid:13)(cid:13)(cid:13) +\n\n2\n\n1 \u2212 \u03bb2(1)\n\n(cid:13)(cid:13)H \u2212 h1(cid:62)\n\nn\n\n(cid:13)(cid:13)(cid:19)\n\n,\n\n1\n\n1 \u2212 \u03bb2(1)2\n\n(10)\nwhere 1 \u2212 \u03bb2(1) = 2(1 \u2212 \u03bb2(2)) = 2c(G) and \u03bb2(1) is the second largest eigenvalue of W1(G).\nThe advantage of propagating two observations in U2-gossip is seen in the 1/(1 \u2212 \u03bb2(1)2) term,\nn factor. Intuitively, this is because nodes do\nhowever the absence of averaging leads to an overall\nnot bene\ufb01t from each other\u2019s estimates. In practice, \u03bb2(2) and \u03bb2(1) are close to 1 for reasonably-\nsized networks (for instance, \u03bb2(2) = 1 \u2212 1/n for the complete graph), so the square term does\nn factor dominates in (10). We thus expect U2-gossip to converge\nnot provide much gain and the\nslower than GoSta, which is con\ufb01rmed by the numerical results presented in Section 5.\n\n\u221a\n\n\u221a\n\n4.2 Asynchronous Setting\n\nIn practical settings, nodes may not have access to a global clock to synchronize the updates. In this\nsection, we remove the global clock assumption and propose a fully asynchronous algorithm where\neach node has a local clock, ticking at a rate 1 Poisson process. Yet, local clocks are i.i.d. so one\ncan use an equivalent model with a global clock ticking at a rate n Poisson process and a random\nedge draw at each iteration, as in synchronous setting (one may refer to [2] for more details on clock\nmodeling). However, at a given iteration, the estimate update step now only involves the selected\npair of nodes. Therefore, the nodes need to maintain an estimate of the current iteration number to\nensure convergence to an unbiased estimate of \u02c6Un(H). Hence for all k \u2208 [n], let pk \u2208 [0, 1] denote\nthe probability of node k being picked at any iteration. With our assumption that nodes activate with\na uniform distribution over E, pk = 2dk/|E|. Moreover, the number of times a node k has been\nselected at a given iteration t > 0 follows a binomial distribution with parameters t and pk. Let us\nde\ufb01ne mk(t) such that mk(0) = 0 and for t > 0:\n\npk\n\nif k is picked at iteration t,\notherwise.\n\nmk(t) =\n\nmk(t \u2212 1)\n\n(11)\nFor any k \u2208 [n] and any t > 0, one has E[mk(t)] = t \u00d7 pk \u00d7 1/pk = t. Therefore, given that\nevery node knows its degree and the total number of edges in the network, the iteration estimates are\nunbiased. We can now give an asynchronous version of GoSta, as stated in Algorithm 2.\nTo show that local estimates converge to \u02c6Un(H), we use a similar model as in the synchronous\nsetting. The time dependency of the transition matrix is more complex ; so is the upper bound.\n\n(cid:26) mk(t \u2212 1) + 1\n\n4The proof can be found in the supplementary material.\n\n6\n\n\fDataset\n\nWine Quality (n = 1599)\nSVMguide3 (n = 1260)\n\nComplete graph Watts-Strogatz\n2.72 \u00b7 10\u22125\n6.26 \u00b7 10\u22124\n7.94 \u00b7 10\u22124\n5.49 \u00b7 10\u22125\n\n2d-grid graph\n3.66 \u00b7 10\u22126\n6.03 \u00b7 10\u22126\n\nTable 1: Value of 1 \u2212 \u03bb2(2) for each network.\n\nTheorem 2. Let G be a connected and non bipartite graph with n nodes, X \u2208 Rn\u00d7d a design\nmatrix and (Z(t)) the sequence of estimates generated by Algorithm 2. For all k \u2208 [n], we have:\n\nlim\nt\u2192+\u221e\n\n1\nn2\n\nH(Xi, Xj) = \u02c6Un(H).\n\nMoreover, there exists a constant c(cid:48)(G) > 0 such that, for any t > 1,\n\nE[Zk(t)] =\n\n(cid:88)\n(cid:13)(cid:13)(cid:13) \u2264 c(cid:48)(G) \u00b7 log t\n(cid:13)(cid:13)(cid:13)E[Z(t)] \u2212 \u02c6Un(H)1n\n\n1\u2264i,j\u2264n\n\n(cid:107)H(cid:107).\n\nt\n\n(12)\n\n(13)\n\nProof. See supplementary material.\nRemark 2. Our methods can be extended to the situation where nodes contain multiple observa-\ntions: when drawn, a node will pick a random auxiliary observation to swap. Similar convergence\nresults are achieved by splitting each node into a set of nodes, each containing only one observation\nand new edges weighted judiciously.\n\n5 Experiments\n\nn) is quite high, especially in comparison to usual scale-free networks.\n\nIn this section, we present two applications on real datasets: the decentralized estimation of the Area\nUnder the ROC Curve (AUC) and of the within-cluster point scatter. We compare the performance of\nour algorithms to that of U2-gossip [17] \u2014 see supplementary material for additional comparisons\nto some baseline methods. We perform our simulations on the three types of network described\nbelow (corresponding values of 1 \u2212 \u03bb2(2) are shown in Table 1).\n\u2022 Complete graph: This is the case where all nodes are connected to each other. It is the ideal\nsituation in our framework, since any pair of nodes can communicate directly. For a complete graph\nG of size n > 0, 1 \u2212 \u03bb2(2) = 1/n, see [1, Ch.9] or [3, Ch.1] for details.\n\u2022 Two-dimensional grid: Here, nodes are located on a 2D grid, and each node is connected to its\n\u221a\nfour neighbors on the grid. This network offers a regular graph with isotropic communication, but\nits diameter (\n\u2022 Watts-Strogatz: This random network generation technique is introduced in [20] and allows us to\ncreate networks with various communication properties. It relies on two parameters: the average\ndegree of the network k and a rewiring probability p. In expectation, the higher the rewiring prob-\nability, the better the connectivity of the network. Here, we use k = 5 and p = 0.3 to achieve a\nconnectivity compromise between the complete graph and the two-dimensional grid.\nAUC measure. We \ufb01rst focus on the AUC measure of a linear classi\ufb01er \u03b8 as de\ufb01ned in (3). We use\nthe SMVguide3 binary classi\ufb01cation dataset which contains n = 1260 points in d = 23 dimensions.5\nWe set \u03b8 to the difference between the class means. For each generated network, we perform 50 runs\nof GoSta-sync (Algorithm 1) and U2-gossip. The top row of Figure 2 shows the evolution over time\nof the average relative error and the associated standard deviation across nodes for both algorithms\non each type of network. On average, GoSta-sync outperforms U2-gossip on every network. The\nvariance of the estimates across nodes is also lower due to the averaging step. Interestingly, the\nperformance gap between the two algorithms is greatly increasing early on, presumably because the\nexponential term in the convergence bound of GoSta-sync is signi\ufb01cant in the \ufb01rst steps.\nWithin-cluster point scatter. We then turn to the within-cluster point scatter de\ufb01ned in (2). We use\nthe Wine Quality dataset which contains n = 1599 points in d = 12 dimensions, with a total of K =\n11 classes.6 We focus on the partition P associated to class centroids and run the aforementioned\n5This dataset is available at http://mldata.org/repository/data/viewslug/svmguide3/\n6This dataset is available at https://archive.ics.uci.edu/ml/datasets/Wine\n\n7\n\n\fFigure 2: Evolution of the average relative error (solid line) and its standard deviation (\ufb01lled area)\nwith the number of iterations for U2-gossip (red) and Algorithm 1 (blue) on the SVMguide3 dataset\n(top row) and the Wine Quality dataset (bottom row).\n\n(a) 20% error reaching time.\n\n(b) Average relative error.\n\nFigure 3: Panel (a) shows the average number of iterations needed to reach an relative error below\n0.2, for several network sizes n \u2208 [50, 1599]. Panel (b) compares the relative error (solid line) and\nits standard deviation (\ufb01lled area) of synchronous (blue) and asynchronous (red) versions of GoSta.\n\nmethods 50 times. The results are shown in the bottom row of Figure 2. As in the case of AUC,\nGoSta-sync achieves better perfomance on all types of networks, both in terms of average error and\nvariance. In Figure 3a, we show the average time needed to reach a 0.2 relative error on a complete\ngraph ranging from n = 50 to n = 1599. As predicted by our analysis, the performance gap\nwidens in favor of GoSta as the size of the graph increases. Finally, we compare the performance\nof GoSta-sync and GoSta-async (Algorithm 2) in Figure 3b. Despite the slightly worse theoretical\nconvergence rate for GoSta-async, both algorithms have comparable performance in practice.\n\n6 Conclusion\n\nWe have introduced new synchronous and asynchronous randomized gossip algorithms to compute\nstatistics that depend on pairs of observations (U-statistics). We have proved the convergence rate in\nboth settings, and numerical experiments con\ufb01rm the practical interest of the proposed algorithms.\nIn future work, we plan to investigate whether adaptive communication schemes (such as those of\n[6, 13]) can be used to speed-up our algorithms. Our contribution could also be used as a building\nblock for decentralized optimization of U-statistics, extending for instance the approaches of [7, 16].\n\nAcknowledgements This work was supported by the chair Machine Learning for Big Data of\nT\u00b4el\u00b4ecom ParisTech, and was conducted when A. Bellet was af\ufb01liated with T\u00b4el\u00b4ecom ParisTech.\n\n8\n\n\fReferences\n[1] B\u00b4ela Bollob\u00b4as. Modern Graph Theory, volume 184. Springer, 1998.\n[2] Stephen P. Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip\n\nalgorithms. IEEE Transactions on Information Theory, 52(6):2508\u20132530, 2006.\n\n[3] Fan R. K. Chung. Spectral Graph Theory, volume 92. American Mathematical Society, 1997.\nIn Advances in Neural\n[4] St\u00b4ephan Cl\u00b4emenc\u00b8on. On U-processes and clustering performance.\n\nInformation Processing Systems 24, pages 37\u201345, 2011.\n\n[5] Alexandros G. Dimakis, Soummya Kar, Jos\u00b4e M. F. Moura, Michael G. Rabbat, and Anna\nScaglione. Gossip Algorithms for Distributed Signal Processing. Proceedings of the IEEE,\n98(11):1847\u20131864, 2010.\n\n[6] Alexandros G. Dimakis, Anand D. Sarwate, and Martin J. Wainwright. Geographic Gossip: Ef-\n\ufb01cient Averaging for Sensor Networks. IEEE Transactions on Signal Processing, 56(3):1205\u2013\n1216, 2008.\n\n[7] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual Averaging for Distributed\nOptimization: Convergence Analysis and Network Scaling. IEEE Transactions on Automatic\nControl, 57(3):592\u2013606, 2012.\n\n[8] James A. Hanley and Barbara J. McNeil. The meaning and use of the area under a receiver\n\noperating characteristic (ROC) curve. Radiology, 143(1):29\u201336, 1982.\n\n[9] Richard Karp, Christian Schindelhauer, Scott Shenker, and Berthold Vocking. Randomized\nrumor spreading. In Symposium on Foundations of Computer Science, pages 565\u2013574. IEEE,\n2000.\n\n[10] David Kempe, Alin Dobra, and Johannes Gehrke. Gossip-Based Computation of Aggregate\nInformation. In Symposium on Foundations of Computer Science, pages 482\u2013491. IEEE, 2003.\n[11] Wojtek Kowalczyk and Nikos A. Vlassis. Newscast EM. In Advances in Neural Information\n\nProcessing Systems, pages 713\u2013720, 2004.\n\n[12] Alan J. Lee. U-Statistics: Theory and Practice. Marcel Dekker, New York, 1990.\n[13] Wenjun Li, Huaiyu Dai, and Yanbing Zhang. Location-Aided Fast Distributed Consensus in\n\nWireless Networks. IEEE Transactions on Information Theory, 56(12):6208\u20136227, 2010.\n\n[14] Henry B. Mann and Donald R. Whitney. On a Test of Whether one of Two Random Variables\nis Stochastically Larger than the Other. Annals of Mathematical Statistics, 18(1):50\u201360, 1947.\n[15] Damon Mosk-Aoyama and Devavrat Shah. Fast distributed algorithms for computing separable\n\nfunctions. IEEE Transactions on Information Theory, 54(7):2997\u20133007, 2008.\n\n[16] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti-\n\nmization. IEEE Transactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[17] Kristiaan Pelckmans and Johan Suykens. Gossip Algorithms for Computing U-Statistics. In\n\nIFAC Workshop on Estimation and Control of Networked Systems, pages 48\u201353, 2009.\n\n[18] Devavrat Shah. Gossip Algorithms. Foundations and Trends in Networking, 3(1):1\u2013125, 2009.\n[19] John N. Tsitsiklis. Problems in decentralized decision making and computation. PhD thesis,\n\nMassachusetts Institute of Technology, 1984.\n\n[20] Duncan J Watts and Steven H Strogatz. Collective dynamics of \u2018small-world\u2019networks. Nature,\n\n393(6684):440\u2013442, 1998.\n\n9\n\n\f", "award": [], "sourceid": 169, "authors": [{"given_name": "Igor", "family_name": "Colin", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Aur\u00e9lien", "family_name": "Bellet", "institution": "Telecom ParisTech"}, {"given_name": "Joseph", "family_name": "Salmon", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "St\u00e9phan", "family_name": "Cl\u00e9men\u00e7on", "institution": "Telecom ParisTech"}]}