{"title": "Order Optimal One-Shot Distributed Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2168, "page_last": 2177, "abstract": "We consider distributed statistical optimization in one-shot setting, where there are $m$ machines each observing $n$ i.i.d samples. Based on its observed samples, each machine then sends an $O(\\log(mn))$-length message to a server, at which a parameter minimizing an expected loss is to be estimated. We propose an algorithm called Multi-Resolution Estimator (MRE) whose expected error is no larger than $\\tilde{O}( m^{-1/\\max(d,2)} n^{-1/2})$, where $d$ is the dimension of the parameter space. This error bound meets existing lower bounds up to poly-logarithmic factors, and is thereby order optimal. The expected error of MRE, unlike existing algorithms, tends to zero as the number of machines ($m$) goes to infinity, even when the number of samples per machine ($n$) remains upper bounded by a constant. This property of the MRE algorithm makes it applicable in new machine learning paradigms where $m$ is much larger than $n$.", "full_text": "Order Optimal One-Shot Distributed Learning\n\nArsalan Sharifnassab, Saber Salehkaleybar, S. Jamaloddin Golestani\n\nDepartment of Electrical Engineering, Sharif University of Technology, Tehran, Iran\n\na.sharifnassab@gmail.com, saleh@sharif.edu, golestani@sharif.edu\n\nAbstract\n\nWe consider distributed statistical optimization in one-shot setting, where there\nare m machines each observing n i.i.d. samples. Based on its observed samples,\neach machine then sends an O(log(mn))-length message to a server, at which a\nparameter minimizing an expected loss is to be estimated. We propose an algorithm\ncalled Multi-Resolution Estimator (MRE) whose expected error is no larger than\n\n\u02dcO(cid:0)m\u22121/max(d,2)n\u22121/2(cid:1), where d is the dimension of the parameter space. This\n\nerror bound meets existing lower bounds up to poly-logarithmic factors, and is\nthereby order optimal. The expected error of MRE, unlike existing algorithms,\ntends to zero as the number of machines (m) goes to in\ufb01nity, even when the number\nof samples per machine (n) remains upper bounded by a constant. This property of\nthe MRE algorithm makes it applicable in new machine learning paradigms where\nm is much larger than n.\n\n1\n\nIntroduction\n\nThe rapid growth in the size of datasets has given rise to distributed models for statistical learning, in\nwhich data is not stored on a single machine. In several recent learning applications, it is commonplace\nto distribute data across multiple machines, each of which processes its own data and communicates\nwith other machines to carry out a learning task. The main bottleneck in such distributed settings is\noften the communication between machines, and several recent works have focused on designing\ncommunication-ef\ufb01cient algorithms for different machine learning applications [Duchi et al., 2012,\nBraverman et al., 2016, Chang et al., 2017, Diakonikolas et al., 2017, Lee et al., 2017].\nIn this paper, we consider the problem of statistical optimization in a distributed setting as follows.\nConsider an unknown distribution P over a collection, F, of differentiable convex functions with\nLipschitz \ufb01rst order derivatives, de\ufb01ned on a convex region in Rd. There are m machines, each\nobserving n i.i.d sample functions from P . Each machine processes its observed data, and transmits\na signal of certain length to a server. The server then collects all the signals and outputs an estimate\nof the parameter \u03b8\u2217 that minimizes the expected loss, i.e., min\u03b8 Ef\u223cP\nillustration of the system model.\nWe focus on the distributed aspect of the problem considering arbitrarily large number of machines\n(m) and\n\n(cid:2)f (\u03b8)(cid:3). See Fig. 1 for an\n\na) present an order optimal algorithm with b = O(log mn) bits per transmission, whose\n\nestimation error is no larger than \u02dcO(cid:0)m\u22121/max(d,2)n\u22121/2(cid:1), meeting the lower bound in\n\u02dcO(cid:0)m\u22121/2 + n\u22121/2(cid:1) (cf. Proposition 1).\n\n[Salehkaleybar et al., 2019] up to a poly-logarithmic factor (cf. Theorem 1);\n\nb) we present an algorithm with a single bit per message with expected error no larger than\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Background\n\nThe distributed setting considered here has recently employed in a new machine learning paradigm\ncalled Federated Learning [Kone\u02c7cn`y et al., 2015]. In this framework, training data is kept in users\u2019\ncomputing devices due to privacy concerns, and the users participate in the training process without\nrevealing their data. As an example, Google has been working on this paradigm in their recent project,\nGboard [McMahan and Ramage, 2017], the Google keyboard. Besides communication constraints,\none of the main challenges in this paradigm is that each machine has a small amount of data. In other\nwords, the system operates in a regime that m is much larger than n [Chen et al., 2017].\nA large body of distributed statistical optimization/estimation literature considers \u201cone-shot\" setting,\nin which each machine communicates with the server merely once [Zhang et al., 2013]. In these\nworks, the main objective is to minimize the number of transmitted bits, while keeping the estimation\nerror as low as the error of a centralized estimator, in which the entire data is co-located in the server.\nIf we impose no limit on the communication budget, then each machine can encode its entire data\ninto a single message and sent it to the server. In this case, the sever acquires the entire data from all\nmachines, and the distributed problem reduces to a centralized problem. We call the sum of observed\nfunctions at all machines as the centralized empirical loss, and refer to its minimizer as the centralized\nsolution. It is part of the folklore that the centralized solution is order optimal and its expected error\n\nperformance of the best centralized estimator.\nZhang et al. [2012] studied a simple averaging method where each machine obtains the empirical\nminimizer of its observed functions and sends this minimizer to the server through an O(log mn)\nbit message. Output of the server is then the average of all received empirical minimizers. Zhang\n\nis \u0398(cid:0)1/\net al. [2012] showed that the expected error of this algorithm is no larger than O(cid:0)1/\nZhang et al. [2012] also present a bootstrap method whose expected error is O(cid:0)1/\n\nmn(cid:1) [Lehmann and Casella, 2006, Zhang et al., 2013]. Clearly, no algorithm can beat the\nmn + 1/n(cid:1),\n(cid:2)f (\u03b8)(cid:3) is strongly convex at \u03b8\u2217. Under the extra\nmn + 1/n1.5(cid:1).\n\nprovided that: 1- all functions are convex and twice differentiable with Lipschitz continuous second\nderivatives, and 2- the objective function Ef\u223cP\nassumption that the functions are three times differentiable with Lipschitz continuous third derivatives,\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nmn + 1/n9/4(cid:1). It, therefore, achieves\n\nthe expected error of their method is no larger than O(cid:0)1/\n\nIt is easy to see that, under the above assumptions, the averaging method and the bootstrap method\nachieve the performance of the centralized solution if m \u2264 n and m \u2264 n2, respectively. Recently,\nJordan et al. [2018] proposed to optimize a surrogate loss function using Taylor series expansion.\nThis expansion can be constructed at the server by communicating O(m) number of d-dimensional\nvectors. Under similar assumption on the loss function as in [Zhang et al., 2012], they showed that\nthe performance of the centralized solution for m \u2264 n3.5. However, note that when n is \ufb01xed, all\naforementioned bounds remain lower bounded by a positive constant, even when m goes to in\ufb01nity.\nFor the problem of sparse linear regression, Braverman et al. [2016] proved that any algorithm that\nachieves optimal minimax squared error, requires to communicate \u2126(m \u00d7 min(n, d)) bits in total\nfrom machines to the server. Later, Lee et al. [2017] proposed an algorithm that achieves optimal\nmean squared error for the problem of sparse linear regression when d < n.\nRecently, Salehkaleybar et al. [2019] studied the impact of communication constraints on the expected\nerror, over a class of \ufb01rst order differentiable functions with Lipschitz continuous derivatives. In\nparts of their results, they showed that under the assumptions of Section 2 of this paper in the case\nof log mn bits communication budget, the expected error of any estimator is lower bounded by\n\n\u02dc\u2126(cid:0)m\u22121/max(d,2)n\u22121/2(cid:1). They also showed that if the number of bits per message is bounded by a\n\nconstant and n is \ufb01xed, then the expected error remains lower bounded by a constant, even when the\nnumber of machines goes to in\ufb01nity.\nOther than one-shot communication, there is another major communication model that allows for\nseveral transmissions back and forth between the machines and the server. Most existing works of\nthis type [Bottou, 2010, Lian et al., 2015, Zhang et al., 2015, McMahan et al., 2017] involve variants\nof stochastic gradient descent, in which the server queries at each iteration the gradient of empirical\nloss at certain points from the machines. The gradient vectors are then aggregated in the server to\n\nupdate the model\u2019s parameters. The expected error of such algorithms typically scales as O(cid:0)1/k(cid:1),\n\nwhere k is the number of iterations.\n\n2\n\n\f1.2 Our contributions\n\nWe study the problem of one-shot distributed learning under milder assumptions than previously\navailable in the literature. We assume that loss functions, f \u2208 F, are convex and differentiable with\nLipschitz continuous \ufb01rst order derivatives. This is in contrast to the works of [Zhang et al., 2012]\nand [Jordan et al., 2018] that assume Lipschitz continuity of second or third derivatives. The reader\nshould have in mind this model differences, when comparing our bounds with the existing results.\nUnlike existing works, our results concern the regime where the number of machines m is large, and\nour bounds tend to zero as m goes to in\ufb01nity, even if the number of per-machine observations n\nis bounded by a constant. This is contrary to the algorithms in [Zhang et al., 2012], whose errors\ntend to zero only when n goes to in\ufb01nity. In fact, when n = 1, a simple example1 shows that the\nexpected errors of the simple averaging and bootstrap algorithms in [Zhang et al., 2012] remain lower\nbounded by a constant, for all values of m. The algorithm in [Jordan et al., 2018] suffers from the\nsame problem and its expected error may not go to zero when n = 1.\n\nIn this work, we present an algorithm with O(cid:0) log(mn)(cid:1) bits per message, which we call Multi-\nMRE-C-log algorithm is no larger than O(cid:0)m\u22121/max(d,2)n\u22121/2(cid:1). In this algorithm, each machines\n\nResolution Estimator for Convex landscapes and log mn bits communication budget (MRE-C-log)\nalgorithm. We show that the estimation error of MRE-C-log algorithm meets the aforementioned\nlower bound up to a poly-logarithmic factor. More speci\ufb01cally, we prove that the expected error of\n\nreports not only its empirical minimizer, but also some information about the derivative of its empirical\nloss at some randomly chosen point in a neighborhood of this minimizer. To provide insight into\nthe underlying idea behind MRE-C-log algorithm, we also present a simple naive approach whose\nerror tends to zero as the number of machines goes to in\ufb01nity. Comparing with the lower bound\nin [Salehkaleybar et al., 2019], the expected error of MRE-C-log algorithm meets the lower bound\nup to a poly-logarithmic factor. Moreover, for the case of having constant bits per message, we\n\npresent a simple algorithm whose error goes to zero with rate \u02dcO(cid:0)m\u22121/2 + n\u22121/2(cid:1), when m and n go\n\nto in\ufb01nity simultaneously. We evaluate performance of the MRE-C-log algorithm in two different\nmachine learning tasks and compare with the existing methods in [Zhang et al., 2012]. We show via\nexperiments, for the n = 1 regime, that MRE-C-log algorithm outperforms these algorithms. The\nobservations are also in line with the expected error bounds we give in this paper and those previously\navailable. In particular, in the n = 1 regime, the expected error of MRE-C-log algorithm goes to zero\nas the number of machines increases, while the expected errors of the previously available estimators\nremain lower bounded by a constant.\n\n1.3 Outline\n\nThe paper is organized as follows. We begin with a detailed model and problem de\ufb01nition in\nSection 2. In Section 3, we present our algorithms and main upper bounds. We then report our\nnumerical experiments in Section 4. Finally, in Section 5 we discuss our results and present open\nproblems and directions for future research. The proofs of the main results and optimality of the\nMRE-C-log algorithm are given in the appendix.\n\n2 Problem De\ufb01nition\nConsider a positive integer d and a collection F of real-valued convex functions over [\u22121, 1]d. Let P\nbe an unknown probability distribution over the functions in F. Consider the expected loss function\n(1)\n\n(cid:2)f (\u03b8)(cid:3),\n\n\u03b8 \u2208 [\u22121, 1]d.\n\nF (\u03b8) = Ef\u223cP\n\nOur goal is to learn a parameter \u03b8\u2217 that minimizes F :\n\u03b8\u2217 = argmin\n\u03b8\u2208[\u22121,1]d\n\nF (\u03b8).\n\n(2)\n\n1Consider two convex functions f0(\u03b8) = \u03b82 + \u03b83/6 and f1(\u03b8) = (\u03b8\u2212 1)2 + (\u03b8\u2212 1)3/6 over [0, 1]. Consider\na distribution P that associates probability 1/2 to each function. Then, EP [f (\u03b8)] = f0(\u03b8)/2 + f1(\u03b8)/2, and\n\u221a\n15 \u2212 3)/2 \u2248 0.436. On the other hand, in the averaging method proposed in\nthe optimal solution is \u03b8\u2217 = (\n[Zhang et al., 2012], assuming n = 1, the empirical minimizer of each machine is either 0 if it observes f0, or\n\n1 if it observes f1. Therefore, the server receives messages 0 and 1 with equal probability , and E(cid:2)\u02c6\u03b8(cid:3) = 1/2.\nHence, E(cid:2)|\u02c6\u03b8 \u2212 \u03b8\u2217|(cid:3) > 0.06, for all values of m.\n\n3\n\n\fFigure 1: A distributed system of m machines, each having access to n independent sample functions\nfrom an unknown distribution P . Each machine sends a signal to a server based on its observations.\nThe server receives all signals and output an estimate \u02c6\u03b8 for the optimization problem in (2).\n\nThe expected loss is to be minimized in a distributed fashion, as follows. We consider a distributed\nsystem comprising m identical machines and a server. Each machine i has access to a set of n inde-\npendently and identically distributed samples {f i\nn} drawn from the probability distribution\nP . Based on these observed functions, machine i then sends a signal Y i to the server. We assume that\nthe length of each signal is limited to b bits. The server then collects signals Y 1, . . . , Y m and outputs\nan estimation of \u03b8\u2217, which we denote by \u02c6\u03b8. See Fig. 1 for an illustration of the system model.2\nAssumption 1 We let the following assumptions on F and P be in effect throughout the paper.\n\n1,\u00b7\u00b7\u00b7 , f i\n\n\u2022 Every f \u2208 F is once differentiable and convex.\n\u2022 Each f \u2208 F has bounded and Lipschitz continuous derivatives. More concretely, for any\nf \u2208 F and any \u03b8, \u03b8(cid:48) \u2208 [\u22121, 1]d, we have |f (\u03b8)| \u2264 \u221a\nd, (cid:107)\u2207f (\u03b8)(cid:107) \u2264 1, and (cid:107)\u2207f (\u03b8) \u2212\n\u2207f (\u03b8(cid:48))(cid:107) \u2264 (cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107).\n\u2022 Distribution P is such that F (de\ufb01ned in (1)) is strongly convex. More speci\ufb01cally, there\nis a constant \u03bb > 0 such that for any \u03b81, \u03b82 \u2208 [\u22121, 1]d, we have F (\u03b82) \u2265 F (\u03b81) +\n\u2207F (\u03b81)T (\u03b82 \u2212 \u03b81) + \u03bb(cid:107)\u03b82 \u2212 \u03b81(cid:107)2.\n\n\u2022 The minimizer of F lies in the interior of the cube [\u22121, 1]d. Equivalently, there exists\n\n\u03b8\u2217 \u2208 (\u22121, 1)d such that \u2207F (\u03b8\u2217) = 0.\n\n3 Algorithms and Main Results\n\nIn this section, we propose estimators to minimize the expected loss, organized in a sequence of three\nsubsections. In the \ufb01rst subsection, we consider the case of constant bits per signal transmission,\nwhereas in the last two subsections we allow for log mn bits per signal transmission. For the latter\nregime, we \ufb01rst present in Subsection 3.2, a simple naive approach whose estimation error goes to\nzero for large values of m, even when n = 1. Afterwards, in Subsection 3.3, we describe our main\nestimator, establish an upper bound on its estimation error, and show that it is order optimal.\n\n3.1 Constant number of bits per transmission\n\nHere, we consider a simple case with a one-dimensional domain (d = 1) and one-bit signal per\ntransmission (b = 1). We show that the expected error can be made arbitrarily small as m and n go\nto in\ufb01nity simultaneously.\n\nProposition 1 Suppose that d = 1 and b = 1. There exists a randomized estimator \u02c6\u03b8 such that\n\nE(cid:2)(\u02c6\u03b8 \u2212 \u03b8\u2217)2(cid:3)1/2\n\n= O\n\n(cid:19)\n\n.\n\n(cid:18) 1\u221a\n\n+\n\n1\u221a\nm\n\nn\n\n2The considered model here is similar to the one in [Salehkaleybar et al., 2019].\n\n4\n\n(cid:1858)(cid:2869)(cid:2869)(cid:1858)(cid:3041)(cid:2869)(cid:1858)(cid:2869)(cid:2870)(cid:1858)(cid:3041)(cid:2870)(cid:1858)(cid:2869)(cid:3040)(cid:1858)(cid:3041)(cid:3040)SServer(cid:1851)(cid:2869)(cid:1851)(cid:2870)(cid:1851)(cid:3040)(cid:2016)(cid:4632)12m\fThe proof is given in Appendix A. There, we assume for simplicity that the domain is the [0, 1]\n\u221a\ninterval and propose a simple randomized algorithm in which each machine i \ufb01rst computes an\nn)-accurate estimation \u03b8i based on its observed functions. It then sends a Y i = 1 signal with\nO(1/\nprobability \u03b8i. The server then outputs the average of the received signals as the \ufb01nial estimate.\nBased on Proposition 1, there is an algorithm that achieves any desired accuracy even with budget of\none bit, provided that m and n go to in\ufb01nity simultaneously. In contrary, it was shown in Proposition\n1 of [Salehkaleybar et al., 2019] that no estimator yields error better than a constant if n = 1 and\nthe number of bits per transmission is a constant independent of m. We conjecture that the bound\nin Proposition 1 is tight. More concretely, for constant number of bits per transmission and any\n\nrandomized estimator \u02c6\u03b8, we have E[(\u02c6\u03b8 \u2212 \u03b8\u2217)2]1/2 = \u02dc\u2126(cid:0)1/\n\n\u221a\nn + 1/\n\nm(cid:1).\n\n\u221a\n\n3.2 A simple naive approach with log mn bits per transmission\n\nWe now consider the case where the number of bits per transmission is O(log m). In order to set\nthe stage for our main algorithm given in the next subsection, here we present a simple algorithm\nand show that its estimation error decays as O(m\u22121/3). The underlying idea is that unlike existing\nestimators, in this algorithm each machine encodes in its signal some information about the shape of\nits observed functions at a point that is not necessarily close to its own private optimum. To simplify\nthe presentation, here we con\ufb01ne our setting to one dimensional domain (d = 1) with each machine\nobserving a single sample function (n = 1). The algorithm is as follows:\n\n\u221a\nm/ log(m) over the [\u22121, 1] interval. Each machine i selects\nConsider a regular grid of size 3\na grid point \u03b8i uniformly at random. The machine then forms a signal comprising two parts:\n1- The location of \u03b8i, and 2- The derivative of its observed function f i at \u03b8i. In other words,\n(cid:48)i(\u03b8i)\nis the derivative of f i at \u03b8i. In this encoding, we use O(log m) bits to represent both \u03b8i and\n(cid:48)i(\u03b8i). In the server, for each grid point \u03b8, the average of f\n(cid:48)i is computed over all machines\nf\ni with \u03b8i = \u03b8. We denote this average by \u02c6F (cid:48)(\u03b8). The server then outputs a point \u03b8 that\n\nthe signal Y i of the i-th machine is an ordered pair of the form(cid:0)\u03b8i, f\nminimizes(cid:12)(cid:12) \u02c6F (cid:48)(\u03b8)(cid:12)(cid:12).\n\n(cid:48)i(\u03b8i)(cid:1), where f\n\n\u221a\nThis algorithm learns an estimation of derivatives of F , and \ufb01nds a point that minimizes the size of\nthis derivative. The following lemma shows that the estimation error of this algorithm is \u02dcO(1/ 3\nm).\nThe proof is given in Appendix B.\n\nProposition 2 Let \u02c6\u03b8 be the output of the above estimator. For any \u03b1 > 1,\n\n(cid:18)(cid:12)(cid:12)\u02c6\u03b8 \u2212 \u03b8\u2217(cid:12)(cid:12) >\n\n(cid:19)\n\nPr\n\nConsequently, for any k \u2265 1, we have E(cid:2)|\u02c6\u03b8 \u2212 \u03b8\u2217|k(cid:3) = O(cid:0)(log(m)/ 3\n\n= O\n\n3\u03b1 log(m)\n\n\u221a\n\u03bb 3\n\nm\n\n(cid:16)\n\nexp(cid:0) \u2212 \u03b12 log3 m(cid:1)(cid:17)\nm)k(cid:1).\n\n\u221a\n\n.\n\nWe now turn to the general case with arbitrary values for d and n, and present our main estimator.\n\n3.3 The Main Algorithm\n\nIn this part, we propose our main algorithm and an upper bound on its estimation error. In the proposed\nalgorithm, transmitted signals are designed such that the server can construct a multi-resolution view\nof gradient of function F (\u03b8) around a promising grid point. Then, we call the proposed algorithm\n\u201cMulti-Resolution Estimator for Convex landscapes with log mn bits communication budget (MRE-\nC-log)\". The description of MRE-C-log is as follows:\nEach machine i observes n functions and sends a signal Y i comprising three parts of the form\n(s, p, \u2206). The signals are of length O(log(mn)) bits and the three parts s, p, and \u2206 are as follows.\n\n5\n\n\fFigure 2: An illustration of grid G and cube Cs centered at point s for d = 2. The point p belongs to\ns and p(cid:48) is the parent of p.\n\u02dcG2\n\n\u221a\n\u2022 Part s: Consider a grid G with resolution log(mn)/\n\nn over the d-dimensional cube. Each\n\nmachine i computes the minimizer of the average of its \ufb01rst n/2 observed functions,\n\nn/2(cid:88)\n\nj=1\n\n\u03b8i = argmin\n\u03b8\u2208[\u22121,1]d\n\nf i\nj (\u03b8).\n\n(3)\n\nIt then lets s be the closest grid point to \u03b8i.\n\n\u2022 Part p: Let\n\n(cid:18) log5(mn)\n\n(cid:19) 1\n\n\u221a\n\n.\n\nd\n\nm\n\nmax(d,2)\n\n(4)\n\n\u03b4 (cid:44) 4\n\nNote that \u03b4 = \u02dcO(cid:0)m\u22121/ max(d,2)(cid:1). Let t = log(1/\u03b4). Without loss of generality we assume\n\n\u221a\nn centered\nthat t is an integer. Let Cs be a d-dimensional cube with edge size 2 log(mn)/\n\u221a\nat s. Consider a sequence of t + 1 grids on Cs as follows. For each l = 0, . . . , t, we partition\nthe cube Cs into 2ld smaller equal sub-cubes with edge size 2\u2212l+1 log(mn)/\nn. The lth\ngrid \u02dcGl\ns has 2ld grid points.\nFor any point p(cid:48) in \u02dcGl\ns, we say that p(cid:48) is the parent of all 2d points in \u02dcGl+1\n\u221a\nthat are in the\ns (l < t) has\nj=0 2(d\u22122)j).\ns. Note that O(d log(1/\u03b4)) =\n\n(cid:0)2\u2212l \u00d7 (2 log mn)/\nn(cid:1)-cube centered at p(cid:48) (see Fig. 2). Thus, each point \u02dcGl\nTo select p, we randomly choose an l from 0, . . . , t with probability 2(d\u22122)l/((cid:80)t\n\ns comprises the centers of these smaller cubes. Then, each \u02dcGl\n\nWe then let p be a uniformly chosen random grid point in \u02dcGl\nO(d log(mn)) bits suf\ufb01ce to identify p uniquely.\n\n2d children.\n\ns\n\n\u2022 Part \u2206: We let\n\nn(cid:88)\n\n\u02c6F i(\u03b8) (cid:44) 2\nn\n\nj=n/2+1\n\nf i\nj (\u03b8),\n\n(5)\n\nand refer to it as the empirical function of the ith machine. If the selected p in the previous\npart is in \u02dcG0\ns, i.e., p = s, then we set \u2206 to the gradient of \u02c6F i at \u03b8 = s. Otherwise, if p is in\ns for l \u2265 1, we let\n\u02dcGl\nwhere p(cid:48) \u2208 \u02dcGl\u22121\n\u221a\nis the parent of p. Note that \u2206 is a d-dimensional vector whose entries are\nof the derivative of the functions in F (cf. Assumption 1) and the fact that (cid:107)p \u2212 p(cid:48)(cid:107) =\n2\u2212l\nn. Hence, we can use O(d log(mn)) bits to represent \u2206 within accuracy\n2\u03b4 log(mn)/\n\nn(cid:1) \u00d7(cid:2) \u2212 1, +1(cid:3). This is due to the Lipschitz continuity\n\n\u2206 (cid:44) \u2207 \u02c6F i(p) \u2212 \u2207 \u02c6F i(p(cid:48)),\n\u221a\nd log(mn)/\n\nin the range(cid:0)2\u2212l\n\n\u221a\nd log(mn)/\nn.\n\n\u221a\n\n\u221a\n\ns\n\nAt the server, we choose an s\u2217 \u2208 G that has the largest number of occurrences in the received signals.\ns\u2217, we approximate the gradient of F at s\u2217 as\nThen, base on the signals corresponding to \u02dcG0\n1\nNs\u2217\n\n\u02c6\u2207F (s\u2217) =\n\n(cid:88)\n\n\u2206,\n\nSignals of the form\nY i=(s\u2217,s\u2217,\u2206)\n\n6\n\n\ud835\udc60\ud835\udc60\ud835\udc5d\u2032\ud835\udc5d2log\ud835\udc5a\ud835\udc5b/\ud835\udc5bGrid \ud835\udc3aCube \ud835\udc36\ud835\udc60\fwhere Ns\u2217 is the number of signals containing s\u2217 in the part p. Then, for any point p \u2208 \u02dcGl\nl \u2265 1, we compute\n\ns\u2217 with\n\n\u02c6\u2207F (p) = \u02c6\u2207F (p(cid:48)) +\n\n1\nNp\n\n\u2206,\n\n(6)\n\n(cid:88)\n\nSignals of the form\nY i=(s\u2217,p,\u2206)\n\nwhere Np is the number of signals having point p in their second argument. Finally, the sever lets \u02c6\u03b8\nbe a grid point p in \u02dcGt\nIn the MRE-C-log algorithm the signals are of length d/(d + 1) log m + d log n bits, which is no\nlarger than d log mn. Please refer to Section 5 for discussions on how the MRE-C-log algorithm can\nbe extended to work under more general communication constraints.\n\ns\u2217 with the smallest (cid:107) \u02c6\u2207F (p)(cid:107).\n\n(cid:32)\n\nTheorem 1 Let \u02c6\u03b8 be the output of the above algorithm. Then,\n\n(cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217(cid:107) >\n\nPr\n\n8d log\n\n5\n\nmax(d,2) +1(mn)\nmax(d,2) n 1\n\n2\n\n1\n\n\u03bb m\n\n= exp\n\n(cid:33)\n\n(cid:16) \u2212 \u2126(cid:0) log2(mn)(cid:1)(cid:17)\n\n.\n\nThe proof is given in Appendix C. The proof goes by \ufb01rst showing that s\u2217 is a closest grid point of G\nto \u03b8\u2217 with high probability. We then show that for any l \u2264 t and any p \u2208 \u02dcGl\ns\u2217, the number of received\nsignals corresponding to p is large enough so that the server obtains a good approximation of \u2207F at\ns\u2217, a point at which \u02c6\u2207F has the\np. Once we have a good approximation \u02c6\u2207F of \u2207F at all points of \u02dcGt\nminimum norm lies close to the minimizer of F .\n(cid:33)k\n\nCorollary 1 Let \u02c6\u03b8 be the output of the above algorithm. There is a constant \u03b7 > 0 such that for any\nk \u2208 N,\n\n(cid:32)\n\n5\n\nE(cid:2)(cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)k(cid:3) < \u03b7\n\n8d log\n\nmax(d,2) +1(mn)\nmax(d,2) n 1\n\n2\n\n1\n\n\u03bb m\n\n.\n\nMoreover, \u03b7 can be chosen arbitrarily close to 1, for large enough values of mn.\n\nThe upper bound in Theorem 1 matches the lower bound in Theorem 2 of [Salehkaleybar et al.,\n2019] up to a polylogarithmic factor. In this view, the MRE-C-log algorithm has order optimal\nerror. Moreover, as we show in Appendix C, in the course of computations, the server obtains\nan approximation \u02c6F of F such that for any \u03b8 in the cube Cs\u2217, we have (cid:107)\u2207 \u02c6F (\u03b8) \u2212 \u2207F (\u03b8)(cid:107) =\napproximation of F at all points inside Cs\u2217. In the special case that n = 1, we have Cs\u2217 = [\u22121, 1]d,\nand as a result, the server would acquire an approximation of F over the entire domain. This\nobservation suggests the following insight: In the extreme distributed case (n = 1), \ufb01nding an\n\n\u02dcO(cid:0)m\u22121/dn\u22121/2). Therefore, the server not only \ufb01nds the minimizer of F , but also obtains an\nO(cid:0)m\u22121/d)-accurate minimizer of \u2207F is as hard as \ufb01nding an O(cid:0)m\u22121/d)-accurate approximation of\n\nF for all points in the domain.\n\n4 Experiments\n\nWe evaluated the performance of MRE-C-log on two learning tasks and compared with the averaging\nmethod (AVGM) in [Zhang et al., 2012]. Recall that in AVGM, each machine sends the empirical\nrisk minimizer of its own data to the server and the average of received parameters at the server is\nreturned in the output.\nThe \ufb01rst experiment concerns the problem of ridge regression. Here, each sample (X, Y ) is generated\nbased on a linear model Y = X T \u03b8\u2217 + E, where X, E, and \u03b8\u2217 are sampled from N (0, Id\u00d7d),\nN (0, 0.01), and uniform distribution over [0, 1]d, respectively. We consider square loss function with\nl2 norm regularization: f (\u03b8) = (\u03b8T X \u2212 Y )2 + 0.1(cid:107)\u03b8(cid:107)2\n2. In the second experiment, we perform a\nlogistic regression task, considering sample vector X generated according to N (0, Id\u00d7d) and labels\nY randomly drawn from {\u22121, 1} with probability Pr(Y = 1|X, \u03b8\u2217) = 1/(1 + exp(\u2212X T \u03b8\u2217)). In\nboth experiments, we consider a two dimensional domain (d = 2) and assumed that each machine\nhas access to one sample (n = 1).\n\n7\n\n\f(a) Ridge regression\n\n(b) Logistic regression\n\nFigure 3: The average of MRE-C-log and AVGM algorithms versus the number of machines in two\ndifferent learning tasks.\n\nIn Fig. 3, the average of (cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 is computed over 100 instances for the different number of\nmachines in the range [104, 106]. Both experiments suggest that the average error of MRE-C-log\nkeep decreasing as the number of machines increases. This is consistent with the result in Theorem 1,\naccording to which the expected error of MRE-C-log is upper bounded by \u02dcO(1/\nmn). It is evident\nfrom the error curves that MRE-C-log outperforms the AVGM algorithm in both tasks. This is\nbecause where m is much larger than n, the expected error of the AVGM algorithm typically scales\nas O(1/n), independent of m.\n\n\u221a\n\n5 Discussion\n\ncations. We proposed an algorithm, called MRE-C-log , with O(cid:0) log(mn)(cid:1)-bits per message, and\n\nWe studied the problem of statistical optimization in a distributed system with one-shot communi-\n\nshowed that its expected error is optimal up to a poly-logarithmic factor. Aside from being order\noptimal, the MRE-C-log algorithm has the advantage over the existing estimators that its error tends\nto zero as the number of machines goes to in\ufb01nity, even when the number of samples per machine is\nupper bounded by a constant. This property is in line with the out-performance of the MRE-C-log\nalgorithm in the m (cid:29) n regime, as discussed in our experimental results.\nThe main idea behind the MRE-C-log algorithm is that it essentially computes, in an ef\ufb01cient way,\nan approximation of the gradient of the expected loss over the entire domain. It then outputs a\nnorm-minimizer of this approximate gradients, as an estimate of the minimizer of the expected loss.\nTherefore, MRE-C-log carries out the intricate and seemingly redundant task of approximating the\nloss function for all points in the domain, in order to resolve the apparently much easier problem\nof \ufb01nding a single approximate minimizer for the loss function. In this view, it is quite counter-\nintuitive that such algorithm is order optimal in terms of expected error and sample complexity. This\nobservation provides the interesting insight that in a distributed system with one shot communication,\n\ufb01nding an approximate minimizer is as hard as \ufb01nding an approximation of the function derivatives\nfor all points in the domain.\nOur algorithms and bounds are designed and derived for a broader class of functions with Lipschitz\ncontinuous \ufb01rst order derivatives, compared to the previous works that consider function classes with\nLipschitz continuous second or third order derivatives. The assumption is indeed both practically\nimportant and technically challenging. For example, it is well-known that the loss landscapes involved\nin learning applications and neural networks are highly non-smooth. Therefore, relaxing assumptions\non higher order derivatives is actually a practically important improvement over the previous works.\n\u221a\nOn the other hand, considering Lipschitzness only for the \ufb01rst order derivative renders the problem\nway more dif\ufb01cult. To see this, note that when n > m, the existing upper bound O(1/\nmn + 1/n)\nfor the case of Lipschitz second derivatives goes below the O(m1/dn1/2) lower bound in the case of\nLipschitz \ufb01rst derivatives.\n\n8\n\n00.511.52Number of machines (m)10610-310-210-1100Average errorMREAVGM00.511.52Number of machines (m)10610-310-210-1100Average errorMREAVGM\fA drawback of the MRE-C-log algorithm is that each machine requires to know m in order to set the\nnumber of levels for the grids. This however can be resolved by considering in\ufb01nite number of levels,\nand letting the probability that p is chosen from level l decrease exponentially with l. Moreover,\nalthough communication budget of the MRE-C-log algorithm is O(d log mn) bits per signal, the\nalgorithm can be extended to work under more general communication constraints, via dividing each\nsignal to subsignals of length O(d log mn) each containing an independent independent signal of the\nMRE-C-log algorithm. The expected loss of this modi\ufb01ed algorithm can be shown to still matches\nthe existing lower bounds up to logarithmic factors. Please refer to Salehkaleybar et al. [2019] for a\nthorough treatment.\nWe also proposed, for d = 1, an algorithm with communication budget of one bit per transmission,\n\nwhose error tends to zero in a rate of O(cid:0)1/\nalgorithm has expected error smaller than O(cid:0)1/\n\n\u221a\n\n\u221a\nm + 1/\n\nn(cid:1) as m and n go to in\ufb01nity simultaneously.\nn(cid:1).\n\n\u221a\n\n\u221a\nm + 1/\n\nWe conjecture that this algorithms is order-optimal, in the sense that no randomized constant-bit\n\nThere are several open problems and directions for future research. The \ufb01rst group of problems\ninvolve the constant bit regime. It would be interesting if one could verify whether or not the bound in\nProposition 1 is order optimal. Moreover, the constant bit algorithm in Subsection 3.1 is designed for\none-dimensional domains and one-bit per transmission. Decent extensions of this algorithm to higher\ndimensions with vanishing errors under one bit per transmission constraint seem to be non-trivial.\nInvestigating the power of more bits per transmission (constants larger than one bit) in reducing the\nexpected error is another interesting direction.\nAnother important group of problems concerns the more restricted class of functions with Lipschitz\ncontinuous second order derivatives. Despite several attempts in the literature, the optimal scaling of\nexpected error for this class of functions in the m (cid:29) n regime is still an open problem.\n\nAcknowledgments\n\nThis research was supported by Iran National Science Foundation (INSF) under contract No.\n97012846.\n\nReferences\nL\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of\n\nCOMPSTAT, pages 177\u2013186. Springer, 2010.\n\nMark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication\nlower bounds for statistical estimation problems via a distributed data processing inequality. In\nProceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011\u20131020.\nACM, 2016.\n\nXiangyu Chang, Shao-Bo Lin, and Ding-Xuan Zhou. Distributed semi-supervised learning with\n\nkernel ridge regression. The Journal of Machine Learning Research, 18(1):1493\u20131514, 2017.\n\nYudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings:\nByzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing\nSystems, 1(2):44, 2017.\n\nIlias Diakonikolas, Elena Grigorescu, Jerry Li, Abhiram Natarajan, Krzysztof Onak, and Ludwig\nSchmidt. Communication-ef\ufb01cient distributed learning of discrete distributions. In Advances in\nNeural Information Processing Systems, pages 6391\u20136401, 2017.\n\nJohn C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization:\nIEEE Transactions on Automatic control, 57(3):\n\nconvergence analysis and network scaling.\n592\u2013606, 2012.\n\nMichael I Jordan, Jason D Lee, and Yun Yang. Communication-ef\ufb01cient distributed statistical\n\ninference. Journal of the American Statistical Association, pages 1\u201314, 2018.\n\nJakub Kone\u02c7cn`y, Brendan McMahan, and Daniel Ramage. Federated optimization: distributed\n\noptimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.\n\n9\n\n\fJason D Lee, Qiang Liu, Yuekai Sun, and Jonathan E Taylor. Communication-ef\ufb01cient sparse\n\nregression. The Journal of Machine Learning Research, 18(1):115\u2013144, 2017.\n\nErich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business\n\nMedia, 2006.\n\nXiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for\nnonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737\u20132745,\n2015.\n\nBrendan McMahan and Daniel Ramage. Federated learning: Collaborative machine learning without\n\ncentralized training data. Google Research Blog, 3, 2017.\n\nBrendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.\nCommunication-ef\ufb01cient learning of deep networks from decentralized data. In Arti\ufb01cial Intelli-\ngence and Statistics, pages 1273\u20131282, 2017.\n\nRajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press,\n\n1995.\n\nSaber Salehkaleybar, Arsalan Sharifnassab, and S. Jamaloddin Golestani. One-shot federated learning:\n\ntheoretical limits and algorithms to achieve them. arXiv preprint arXiv:1905.04634v1, 2019.\n\nSixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging SGD. In\n\nAdvances in Neural Information Processing Systems, pages 685\u2013693, 2015.\n\nYuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-ef\ufb01cient algorithms for\nstatistical optimization. In Advances in Neural Information Processing Systems, pages 1502\u20131510,\n2012.\n\nYuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic\nlower bounds for distributed statistical estimation with communication constraints. In Advances in\nNeural Information Processing Systems, pages 2328\u20132336, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1283, "authors": [{"given_name": "Arsalan", "family_name": "Sharifnassab", "institution": "Sharif University of Technology"}, {"given_name": "Saber", "family_name": "Salehkaleybar", "institution": "Sharif University of Technology"}, {"given_name": "S. Jamaloddin", "family_name": "Golestani", "institution": "Sharif University of Technology"}]}