{"title": "Federated Multi-Task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4424, "page_last": 4434, "abstract": "Federated learning poses new statistical and systems challenges in training machine learning models over distributed networks of devices. In this work, we show that multi-task learning is naturally suited to handle the statistical challenges of this setting, and propose a novel systems-aware optimization method, MOCHA, that is robust to practical systems issues. Our method and theory for the first time consider issues of high communication cost, stragglers, and fault tolerance for distributed multi-task learning. The resulting method achieves significant speedups compared to alternatives in the federated setting, as we demonstrate through simulations on real-world federated datasets.", "full_text": "Federated Multi-Task Learning\n\nVirginia Smith\n\nStanford\n\nChao-Kai Chiang\u2217\n\nMaziar Sanjabi\u2217\n\nsmithv@stanford.edu\n\nchaokaic@usc.edu\n\nmaziarsanjabi@gmail.com\n\ntalwalkar@cmu.edu\n\nUSC\n\nUSC\n\nAmeet Talwalkar\n\nCMU\n\nAbstract\n\nFederated learning poses new statistical and systems challenges in training machine\nlearning models over distributed networks of devices. In this work, we show that\nmulti-task learning is naturally suited to handle the statistical challenges of this\nsetting, and propose a novel systems-aware optimization method, MOCHA, that is\nrobust to practical systems issues. Our method and theory for the \ufb01rst time consider\nissues of high communication cost, stragglers, and fault tolerance for distributed\nmulti-task learning. The resulting method achieves signi\ufb01cant speedups compared\nto alternatives in the federated setting, as we demonstrate through simulations on\nreal-world federated datasets.\n\n1\n\nIntroduction\n\nMobile phones, wearable devices, and smart homes are just a few of the modern distributed networks\ngenerating massive amounts of data each day. Due to the growing storage and computational\npower of devices in these networks, it is increasingly attractive to store data locally and push more\nnetwork computation to the edge. The nascent \ufb01eld of federated learning explores training statistical\nmodels directly on devices [37]. Examples of potential applications include: learning sentiment,\nsemantic location, or activities of mobile phone users; predicting health events like low blood sugar\nor heart attack risk from wearable devices; or detecting burglaries within smart homes [3, 39, 42].\nFollowing [25, 36, 26], we summarize the unique challenges of federated learning below.\n1. Statistical Challenges: The aim in federated learning is to \ufb01t a model to data, {X1, . . . , Xm},\ngenerated by m distributed nodes. Each node, t \u2208 [m], collects data in a non-IID manner across the\nnetwork, with data on each node being generated by a distinct distribution Xt \u223c Pt. The number\nof data points on each node, nt, may also vary signi\ufb01cantly, and there may be an underlying\nstructure present that captures the relationship amongst nodes and their associated distributions.\n2. Systems Challenges: There are typically a large number of nodes, m, in the network, and\ncommunication is often a signi\ufb01cant bottleneck. Additionally, the storage, computational, and\ncommunication capacities of each node may differ due to variability in hardware (CPU, memory),\nnetwork connection (3G, 4G, WiFi), and power (battery level). These systems challenges, com-\npounded with unbalanced data and statistical heterogeneity, make issues such as stragglers and\nfault tolerance signi\ufb01cantly more prevalent than in typical data center environments.\n\nIn this work, we propose a modeling approach that differs signi\ufb01cantly from prior work on federated\nlearning, where the aim thus far has been to train a single global model across the network [25, 36, 26].\nInstead, we address statistical challenges in the federated setting by learning separate models for each\nnode, {w1, . . . , wm}. This can be naturally captured through a multi-task learning (MTL) framework,\nwhere the goal is to consider \ufb01tting separate but related models simultaneously [14, 2, 57, 28].\nUnfortunately, current multi-task learning methods are not suited to handle the systems challenges\nthat arise in federated learning, including high communication cost, stragglers, and fault tolerance.\nAddressing these challenges is therefore a key component of our work.\n\n\u2217Authors contributed equally.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Contributions\n\nWe make the following contributions. First, we show that MTL is a natural choice to handle statistical\nchallenges in the federated setting. Second, we develop a novel method, MOCHA, to solve a general\nMTL problem. Our method generalizes the distributed optimization method COCOA [22, 31] in\norder to address systems challenges associated with network size and node heterogeneity. Third, we\nprovide convergence guarantees for MOCHA that carefully consider these unique systems challenges\nand provide insight into practical performance. Finally, we demonstrate the superior empirical\nperformance of MOCHA with a new benchmarking suite of federated datasets.\n\n2 Related Work\n\nLearning Beyond the Data Center. Computing SQL-like queries across distributed, low-powered\nnodes is a decades-long area of research that has been explored under the purview of query processing\nin sensor networks, computing at the edge, and fog computing [32, 12, 33, 8, 18, 15]. Recent works\nhave also considered training machine learning models centrally but serving and storing them locally,\ne.g., this is a common approach in mobile user modeling and personalization [27, 43, 44]. However,\nas the computational power of the nodes within distributed networks grows, it is possible to do even\nmore work locally, which has led to recent interest in federated learning.2 In contrast to our proposed\napproach, existing federated learning approaches [25, 36, 26, 37] aim to learn a single global model\nacross the data.3 This limits their ability to deal with non-IID data and structure amongst the nodes.\nThese works also come without convergence guarantees, and have not addressed practical issues of\nstragglers or fault tolerance, which are important characteristics of the federated setting. The work\nproposed here is, to the best of our knowledge, the \ufb01rst federated learning framework to consider\nthese challenges, theoretically and in practice.\n\nMulti-Task Learning.\nIn multi-task learning, the goal is to learn models for multiple related tasks\nsimultaneously. While the MTL literature is extensive, most MTL modeling approaches can be\nbroadly categorized into two groups based on how they capture relationships amongst tasks. The \ufb01rst\n(e.g., [14, 4, 11, 24]) assumes that a clustered, sparse, or low-rank structure between the tasks is known\na priori. A second group instead assumes that the task relationships are not known beforehand and\ncan be learned directly from the data (e.g., [21, 57, 16]). In this work, we focus our attention on this\nlatter group, as task relationships may not be known beforehand in real-world settings. In comparison\nto learning a single global model, these MTL approaches can directly capture relationships amongst\nnon-IID and unbalanced data, which makes them particularly well-suited for the statistical challenges\nof federated learning. We demonstrate this empirically on real-world federated datasets in Section 5.\nHowever, although MTL is a natural modeling choice to address the statistical challenges of federated\nlearning, currently proposed methods for distributed MTL (discussed below) do not adequately\naddress the systems challenges associated with federated learning.\n\nDistributed Multi-Task Learning. Distributed multi-task learning is a relatively new area of\nresearch, in which the aim is to solve an MTL problem when data for each task is distributed over a\nnetwork. While several recent works [1, 35, 54, 55] have considered the issue of distributed MTL\ntraining, the proposed methods do not allow for \ufb02exibility of communication versus computation.\nAs a result, they are unable to ef\ufb01ciently handle concerns of fault tolerance and stragglers, the latter\nof which stems from both data and system heterogeneity. The works of [23] and [7] allow for\nasynchronous updates to help mitigate stragglers, but do not address fault tolerance. Moreover, [23]\nprovides no convergence guarantees, and the convergence of [7] relies on a bounded delay assumption\nthat is impractical for the federated setting, where delays may be signi\ufb01cant and devices may drop\nout completely. Finally, [30] proposes a method and setup leveraging the distributed framework\nCOCOA [22, 31], which we show in Section 4 to be a special case of the more general approach in\nthis work. However, the authors in [30] do not explore the federated setting, and their assumption\nthat the same amount of work is done locally on each node is prohibitive in federated settings, where\nunbalance is common due to data and system variability.\n\n2The term on-device learning has been used to describe both the task of model training and of model serving.\n\nDue to the ambiguity of this phrase, we exclusively use the term federated learning.\n\n3While not the focus of our work, we note privacy is an important concern in the federated setting, and that\nthe privacy bene\ufb01ts associated with global federated learning (as discussed in [36]) also apply to our approach.\n\n2\n\n\f3 Federated Multi-Task Learning\n\nIn federated learning, the aim is to learn a model over data that resides on, and has been generated by,\nm distributed nodes. As a running example, consider learning the activities of mobile phone users in\na cell network based on their individual sensor, text, or image data. Each node (phone), t \u2208 [m], may\ngenerate data via a distinct distribution, and so it is natural to \ufb01t separate models, {w1, . . . , wm},\nto the distributed data\u2014one for each local dataset. However, structure between models frequently\nexists (e.g., people may behave similarly when using their phones), and modeling these relationships\nvia multi-task learning is a natural strategy to improve performance and boost the effective sample\nsize for each node [10, 2, 5]. In this section, we suggest a general MTL framework for the federated\nsetting, and propose a novel method, MOCHA, to handle the systems challenges of federated MTL.\n\n3.1 General Multi-Task Learning Setup\nGiven data Xt \u2208 Rd\u00d7nt from m nodes, multi-task learning \ufb01ts separate weight vectors wt \u2208 Rd to\nthe data for each task (node) through arbitrary convex loss functions (cid:96)t (e.g., the hinge loss for SVM\nmodels). Many MTL problems can be captured via the following general formulation:\n\n(cid:40) m(cid:88)\n\nnt(cid:88)\n\nt=1\n\ni=1\n\n(cid:41)\nt) + R(W, \u2126)\n\nmin\nW,\u2126\n\n(cid:96)t(wT\n\nt xi\n\nt, yi\n\n,\n\n(1)\n\nwhere W := [w1, . . . , wm] \u2208 Rd\u00d7m is a matrix whose t-th column is the weight vector for the\nt-th task. The matrix \u2126 \u2208 Rm\u00d7m models relationships amongst tasks, and is either known a\npriori or estimated while simultaneously learning task models. MTL problems differ based on their\nassumptions on R, which takes \u2126 as input and promotes some suitable structure amongst the tasks.\nAs an example, several popular MTL approaches assume that tasks form clusters based on whether or\nnot they are related [14, 21, 57, 58]. This can be expressed via the following bi-convex formulation:\n(2)\nwith constants \u03bb1, \u03bb2 > 0, and where the second term performs L2 regularization on each local\nmodel. We use a similar formulation (14) in our experiments in Section 5, and provide details on\nother common classes of MTL models that can be formulated via (1) in Appendix B.\n\nR(W, \u2126) = \u03bb1 tr(cid:0)W\u2126WT(cid:1) + \u03bb2(cid:107)W(cid:107)2\n\nF ,\n\n3.2 MOCHA: A Framework for Federated Multi-Task Learning\n\nacross the nodes, and the structure \u2126, which is known centrally.\n\nis convex, solving for W and \u2126 simultaneously can be dif\ufb01cult [5].\n\nIn the federated setting, the aim is to train statistical models directly on the edge, and thus we\nsolve (1) while assuming that the data {X1, . . . , Xm} is distributed across m nodes or devices.\nBefore proposing our federated method for solving (1), we make the following observations:\n\u2022 Observation 1: In general, (1) is not jointly convex in W and \u2126, and even in the cases where (1)\n\u2022 Observation 2: When \ufb01xing \u2126, updating W depends on both the data X, which is distributed\n\u2022 Observation 3: When \ufb01xing W, optimizing for \u2126 only depends on W and not on the data X.\nBased on these observations, it is natural to propose an alternating optimization approach to solve\nproblem (1), in which at each iteration we \ufb01x either W or \u2126 and optimize over the other, alternating\nuntil convergence is reached. Note that solving for \u2126 is not dependent on the data and therefore can\nbe computed centrally; as such, we defer to prior work for this step [58, 21, 57, 16]. In Appendix B,\nwe discuss updates to \u2126 for several common MTL models.\nIn this work, we focus on developing an ef\ufb01cient distributed optimization method for the W step. In\ntraditional data center environments, the task of distributed training is a well-studied problem, and\nvarious communication-ef\ufb01cient frameworks have been recently proposed, including the state-of-the-\nart primal-dual COCOA framework [22, 31]. Although COCOA can be extended directly to update\nW in a distributed fashion across the nodes, it cannot handle the unique systems challenges of the\nfederated environment, such as stragglers and fault tolerance, as discussed in Section 3.4. To this\nend, we extend COCOA and propose a new method, MOCHA, for federated multi-task learning. Our\nmethod is given in Algorithm 1 and described in detail in Sections 3.3 and 3.4.\n\n3\n\n\ffor tasks t \u2208 {1, 2, . . . , m} in parallel over m nodes do\n\nSet subproblem parameter \u03c3(cid:48) and number of federated iterations, Hi\nfor iterations h = 0, 1,\u00b7\u00b7\u00b7 , Hi do\n\nAlgorithm 1 MOCHA: Federated Multi-Task Learning Framework\n1: Input: Data Xt from t = 1, . . . , m tasks, stored on one of m nodes, and initial matrix \u21260\n2: Starting point \u03b1(0) := 0 \u2208 Rn, v(0) := 0 \u2208 Rb\n3: for iterations i = 0, 1, . . . do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: Central node computes w = w(\u03b1) based on the lastest \u03b1\n13: return: W := [w1, . . . , wm]\n\ncall local solver, returning \u03b8h\nupdate local variables \u03b1t \u2190 \u03b1t + \u2206\u03b1t\nreturn updates \u2206vt := Xt\u2206\u03b1t\n\nUpdate \u2126 centrally based on w(\u03b1) for latest \u03b1\n\nreduce: vt \u2190 vt + \u2206vt\n\nt -approximate solution \u2206\u03b1t of the local subproblem (4)\n\n3.3 Federated Update of W\n\nproblem into distributed subproblems for federated computation across the nodes. Let n :=(cid:80)m\n\nTo update W in the federated setting, we begin by extending works on distributed primal-dual\noptimization [22, 31, 30] to apply to the generalized multi-task framework (1). This involves deriving\nthe appropriate dual formulation, subproblems, and problem parameters, as we detail below.\nDual problem. Considering the dual formulation of (1) will allow us to better separate the global\nt=1 nt\nand X := Diag(X1,\u00b7\u00b7\u00b7 , Xm) \u2208 Rmd\u00d7n. With \u2126 \ufb01xed, the dual of problem (1), de\ufb01ned with\nrespect to dual variables \u03b1 \u2208 Rn, is given by:\n\n(cid:41)\n\n(cid:40)\n\nm(cid:88)\n\nnt(cid:88)\n\nt=1\n\ni=1\n\nD(\u03b1) :=\n\nmin\n\n\u03b1\n\nt (\u2212\u03b1i\n(cid:96)\u2217\n\nt) + R\u2217(X\u03b1)\n\n,\n\n(3)\n\nt, yi\n\nt and R\u2217 are the conjugate dual functions of (cid:96)t and R, respectively, and \u03b1i\n\nwhere (cid:96)\u2217\nt is the dual\nt). Note that R\u2217 depends on \u2126, but for the sake of simplicity, we\nvariable for the data point (xi\nhave removed this in our notation. To derive distributed subproblems from this global dual, we make\nan assumption described below on the regularizer R.\nAssumption 1. Given \u2126, we assume that there exists a symmetric positive de\ufb01nite matrix M \u2208\nRmd\u00d7md, depending on \u2126, for which the function R is strongly convex with respect to M\u22121. Note\nthat this corresponds to assuming that R\u2217 will be smooth with respect to matrix M.\nRemark 1. We can reformulate the MTL regularizer in the form of \u00afR(w, \u00af\u2126) = R(W, \u2126), where\nw \u2208 Rmd is a vector containing the columns of W and \u00af\u2126 := \u2126 \u2297 Id\u00d7d \u2208 Rmd\u00d7md. For example,\nin this form, it is clear that it is strongly convex with respect to matrix M\u22121 = \u03bb1 \u00af\u2126 + \u03bb2I.\nData-local quadratic subproblems. To solve (1) across distributed nodes, we de\ufb01ne the following\ndata-local subproblems, which are formed via a careful quadratic approximation of the dual problem\n(3) to separate computation across the nodes. These subproblems \ufb01nd updates \u2206\u03b1t \u2208 Rnt to the dual\nvariables in \u03b1 corresponding to a single node t, and only require accessing data which is available\nlocally, i.e., Xt for node t. The t-th subproblem is given by:\n\nwe can rewrite the regularizer in (2) as \u00afR(w, \u00af\u2126) = tr(cid:0)wT (\u03bb1 \u00af\u2126 + \u03bb2I)w(cid:1). Writing the regularizer\n\nG\u03c3(cid:48)\nt (\u2206\u03b1t; vt, \u03b1t) :=\n\n(cid:96)\u2217\nt (\u2212\u03b1i\n\nt\u2212\u2206\u03b1i\n\nt)+(cid:104)wt(\u03b1), Xt\u2206\u03b1t(cid:105)+\n\ni=1\n\nmin\n\u2206\u03b1t\n\n(4)\nmR\u2217(X\u03b1), and Mt \u2208 Rd\u00d7d is the t-th diagonal block of the symmetric positive\nwhere c(\u03b1) := 1\nde\ufb01nite matrix M. Given dual variables \u03b1, corresponding primal variables can be found via w(\u03b1) =\n\u2207R\u2217(X\u03b1), where wt(\u03b1) is the t-th block in the vector w(\u03b1). Note that computing w(\u03b1) requires\nthe vector v = X\u03b1. The t-th block of v, vt \u2208 Rd, is the only information that must be communicated\nbetween nodes at each iteration. Finally, \u03c3(cid:48) > 0 measures the dif\ufb01culty of the data partitioning, and\nhelps to relate progress made to the subproblems to the global dual problem. It can be easily selected\nbased on M for many applications of interest; we provide details in Lemma 9 of the Appendix.\n\n+c(\u03b1) ,\n\nMt\n\n(cid:107)Xt\u2206\u03b1t(cid:107)2\n\nnt(cid:88)\n\n\u03c3(cid:48)\n2\n\n4\n\n\f3.4 Practical Considerations\n\nDuring MOCHA\u2019s federated update of W, the central node requires a response from all workers before\nperforming a synchronous update. In the federated setting, a naive execution of this communication\nprotocol could introduce dramatic straggler effects due to node heterogeneity. To avoid stragglers,\nMOCHA provides the t-th node with the \ufb02exibility to approximately solve its subproblem G\u03c3(cid:48)\nt (\u00b7),\nwhere the quality of the approximation is controled by a per-node parameter \u03b8h\nt . The following\nfactors determine the quality of the t-th node\u2019s solution to its subproblem:\n1. Statistical challenges, such as the size of Xt and the intrinsic dif\ufb01culty of subproblem G\u03c3(cid:48)\nt (\u00b7).\n2. Systems challenges, such as the node\u2019s storage, computational, and communication capacities\ndue to hardware (CPU, memory), network connection (3G, 4G, WiFi), and power (battery level).\n\n3. A global clock cycle imposed by the central node specifying a deadline for receiving updates.\n\nt (\u00b7) and \u03b8h\n\nt from the current clock cycle and statistical/systems setting. \u03b8h\nt = 0 indicates an exact solution to G\u03c3(cid:48)\n\nt as a function of these factors, and assume that each node has a controller that may\nWe de\ufb01ne \u03b8h\nt ranges from zero to one,\nderive \u03b8h\nwhere \u03b8h\nt = 1 indicates that node t made no progress\nduring iteration h (which we refer to as a dropped node). For instance, a node may \u2018drop\u2019 if it runs\nout of battery, or if its network bandwidth deteriorates during iteration h and it is thus unable to return\nt is provided in (5) of Section 4.\nits update within the current clock cycle. A formal de\ufb01nition of \u03b8h\nMOCHA mitigates stragglers by enabling the t-th node to de\ufb01ne its own \u03b8h\nt . On every iteration\nh, the local updates that a node performs and sends in a clock cycle will yield a speci\ufb01c value\nfor \u03b8h\nt . As discussed in Section 4, MOCHA is additionally robust to a small fraction of nodes\nperiodically dropping and performing no local updates (i.e., \u03b8h\n:= 1) under suitable conditions,\nt\nas de\ufb01ned in Assumption 2. In contrast, prior work of COCOA may suffer from severe straggler\neffects in federated settings, as it requires a \ufb01xed \u03b8h\nt = \u03b8 across all nodes and all iterations while still\nmaintaining synchronous updates, and it does not allow for the case of dropped nodes (\u03b8 := 1).\nFinally, we note that asynchronous updating schemes are an alternative approach to mitigate stragglers.\nWe do not consider these approaches in this work, in part due to the fact that the bounded-delay\nassumptions associated with most asynchronous schemes limit fault tolerance. However, it would be\ninteresting to further explore the differences and connections between asynchronous methods and\napproximation-based, synchronous methods like MOCHA in future work.\n\n4 Convergence Analysis\n\nMOCHA is based on a bi-convex alternating approach, which is guaranteed to converge [17, 45] to\na stationary solution of problem (1). In the case where this problem is jointly convex with respect\nto W and \u2126, such a solution is also optimal. In the rest of this section, we therefore focus on the\nconvergence of solving the W update of MOCHA in the federated setting. Following the discussion\nin Section 3.4, we \ufb01rst introduce the following per-node, per-round approximation parameter.\nDe\ufb01nition 1 (Per-Node-Per-Iteration-Approximation Parameter). At each iteration h, we de\ufb01ne the\naccuracy level of the solution calculated by node t to its subproblem (4) as:\nt ; v(h), \u03b1(h)\nt\n)\n\nG\u03c3(cid:48)\nt (\u2206\u03b1(h)\nG\u03c3(cid:48)\nt (0; v(h), \u03b1(h)\nt (\u2206\u03b1(cid:63)\nt is the minimizer of subproblem G\u03c3(cid:48)\nt (\u00b7 ; v(h), \u03b1(h)\nt := 1 meaning that no updates to subproblem G\u03c3(cid:48)\n\nwhere \u2206\u03b1(cid:63)\n[0, 1], with \u03b8h\n\nt\n\nt ; v(h), \u03b1(h)\n). We allow this value to vary between\nt are made by node t at iteration h.\n\n; v(h), \u03b1(h)\n\nt\n\n) \u2212 G\u03c3(cid:48)\n\n) \u2212 G\u03c3(cid:48)\n\nt (\u2206\u03b1(cid:63)\n\n\u03b8h\nt :=\n\nt\n\n)\n\n,\n\n(5)\n\nt\n\nt\n\nWhile the \ufb02exible per-node, per-iteration approximation parameter \u03b8h\nt in (5) allows the consideration\nof stragglers and fault tolerance, these additional degrees of freedom also pose new challenges in\nproviding convergence guarantees for MOCHA. We introduce the following assumption on \u03b8h\nt to\nprovide our convergence guarantees.\nAssumption 2. Let Hh := (\u03b1(h), \u03b1(h\u22121),\u00b7\u00b7\u00b7 , \u03b1(1)) be the dual vector history until the beginning\nof iteration h, and de\ufb01ne \u0398h\nt :=\nt = 1] \u2264 pmax < 1 and \u02c6\u0398h\nP[\u03b8h\n\nt |Hh]. For all tasks t and all iterations h, we assume ph\nt |Hh, \u03b8h\n\nt < 1] \u2264 \u0398max < 1.\n\nt := E[\u03b8h\nt := E[\u03b8h\n\n5\n\n\fThis assumption states that at each iteration, the probability of a node sending a result is non-zero,\nand that the quality of the returned result is, on average, better than the previous iterate. Compared\nto [49, 30] which assumes \u03b8h\nt = \u03b8 < 1, our assumption is signi\ufb01cantly less restrictive and better\nmodels the federated setting, where nodes are unreliable and may periodically drop out.\nUsing Assumption 2, we derive the following theorem, which characterizes the convergence of the\nfederated update of MOCHA in \ufb01nite horizon when the losses (cid:96)t in (1) are smooth.\nTheorem 1. Assume that the losses (cid:96)t are (1/\u00b5)-smooth. Then, under Assumptions 1 and 2, there\nexists a constant s \u2208 (0, 1] such that for any given convergence target \u0001D, choosing H such that\n\n,\n\n1\n\nlog\n\nn\n\u0001D\n\n(1 \u2212 \u00af\u0398)s\n\nH \u2265\nwill satisfy E[D(\u03b1(H)) \u2212 D(\u03b1(cid:63))] \u2264 \u0001D .\nHere, \u00af\u0398 := pmax + (1 \u2212 pmax)\u0398max < 1. While Theorem 1 is concerned with \ufb01nite horizon conver-\ngence, it is possible to get asymptotic convergence results, i.e., H \u2192 \u221e, with milder assumptions on\nthe stragglers; see Corollary 8 in the Appendix for details.\nWhen the loss functions are non-smooth, e.g., the hinge loss for SVM models, we provide the\nfollowing sub-linear convergence for L-Lipschitz losses.\nTheorem 2. If the loss functions (cid:96)t are L-Lipschitz, then there exists a constant \u03c3, de\ufb01ned in (24),\nsuch that for any given \u0001D > 0, if we choose\n\n(6)\n\n(cid:24)\n\n(cid:18)\n\n(cid:24)\n\nwith H0 \u2265\n\nthen \u00af\u03b1 := 1\n\nH\u2212H0\n\n(cid:25)\n\n2\n\nmax\n\n(1 \u2212 \u00af\u0398)\n\nH \u2265 H0 +\n16L2\u03c3\u03c3(cid:48)\n(cid:80)H\n(1 \u2212 \u00af\u0398)n2\u0001D\nh=H0+1 \u03b1(h) will satisfy E[D( \u00af\u03b1) \u2212 D(\u03b1(cid:63))] \u2264 \u0001D .\n\n(1 \u2212 \u00af\u0398)\n\n, h0 =\n\n1 +\n\nlog\n\nh0+\n\n2L2\u03c3\u03c3(cid:48)\nn2\u0001D\n\n1,\n\n1\n\n(cid:20)\n\n(cid:19)(cid:25)\n(cid:18) 2n2(D(\u03b1(cid:63)) \u2212 D(\u03b10))\n\n,\n\n4L2\u03c3\u03c3(cid:48)\n\n(7)\n\n(cid:19)(cid:21)\n\n,\n\n+\n\nThese theorems guarantee that MOCHA will converge in the federated setting, under mild assumptions\non stragglers and capabilities of the nodes. While these results consider convergence in terms of the\ndual, we show that they hold analogously for the duality gap. We provide all proofs in Appendix C.\nRemark 2. Following from the discussion in Section 3.4, our method and theory generalize the\nresults in [22, 31]. In the limiting case that all \u03b8h\nt are identical, our results extend the results of\nCOCOA to the multi-task framework described in (1).\nRemark 3. Note that the methods in [22, 31] have an aggregation parameter \u03b3 \u2208 (0, 1]. Though we\nprove our results for a general \u03b3, we simplify the method and results here by setting \u03b3 := 1, which\nhas been shown to have the best performance, both theoretically and empirically [31].\n\n5 Simulations\n\nIn this section we validate the empirical performance of MOCHA. First, we introduce a benchmarking\nsuite of real-world federated datasets and show that multi-task learning is well-suited to handle the\nstatistical challenges of the federated setting. Next, we demonstrate MOCHA\u2019s ability to handle strag-\nglers, both from statistical and systems heterogeneity. Finally, we explore the performance of MOCHA\nwhen devices periodically drop out. Our code is available at: github.com/gingsmith/fmtl.\n\n5.1 Federated Datasets\n\nIn our simulations, we use several real-world datasets that have been generated in federated settings.\nWe provide additional details in the Appendix, including information about data sizes, nt.\n\u2022 Google Glass (GLEAM)4: This dataset consists of two hours of high resolution sensor data\ncollected from 38 participants wearing Google Glass for the purpose of activity recognition.\nFollowing [41], we featurize the raw accelerometer, gyroscope, and magnetometer data into 180\nstatistical, spectral, and temporal features. We model each participant as a separate task, and\npredict between eating and other activities (e.g., walking, talking, drinking).\n4http://www.skleinberg.org/data/GLEAM.tar.gz\n\n6\n\n\f\u2022 Human Activity Recognition5: Mobile phone accelerometer and gyroscope data collected from\n30 individuals, performing one of six activities: {walking, walking-upstairs, walking-downstairs,\nsitting, standing, lying-down}. We use the provided 561-length feature vectors of time and\nfrequency domain variables generated for each instance [3]. We model each individual as a\nseparate task and predict between sitting and the other activities.\n\n\u2022 Vehicle Sensor6: Acoustic, seismic, and infrared sensor data collected from a distributed network\nof 23 sensors, deployed with the aim of classifying vehicles driving by a segment of road [13].\nEach instance is described by 50 acoustic and 50 seismic features. We model each sensor as a\nseparate task and predict between AAV-type and DW-type vehicles.\n\n5.2 Multi-Task Learning for the Federated Setting\n\nWe demonstrate the bene\ufb01ts of multi-task learning for the federated setting by comparing the error\nrates of a multi-task model to that of a fully local model (i.e., learning a model for each task separately)\nand a fully global model (i.e., combining the data from all tasks and learning one single model). Work\non federated learning thus far has been limited to the study of fully global models [25, 36, 26].\nWe use a cluster-regularized multi-task model [57], as described in Section 3.1. For each dataset from\nSection 5.1, we randomly split the data into 75% training and 25% testing, and learn multi-task, local,\nand global support vector machine models, selecting the best regularization parameter, \u03bb \u2208{1e-5,\n1e-4, 1e-3, 1e-2, 0.1, 1, 10}, for each model using 5-fold cross-validation. We repeat this process 10\ntimes and report the average prediction error across tasks, averaged across these 10 trials.\n\nTable 1: Average prediction error: Means and standard errors over 10 random shuf\ufb02es.\n\nModel Human Activity Google Glass Vehicle Sensor\nGlobal\nLocal\nMTL\n\n13.4 (0.26)\n7.81 (0.13)\n6.59 (0.21)\n\n2.23 (0.30)\n1.34 (0.21)\n0.46 (0.11)\n\n5.34 (0.26)\n4.92 (0.26)\n2.02 (0.15)\n\nIn Table 1, we see that for each dataset, multi-task learning signi\ufb01cantly outperforms the other\nmodels in terms of achieving the lowest average error across tasks. The global model, as proposed\nin [25, 36, 26] performs the worst, particularly for the Human Activity and Vehicle Sensor datasets.\nAlthough the datasets are already somewhat unbalanced, we note that a global modeling approach may\nbene\ufb01t tasks with a very small number of instances, as information can be shared across tasks. For this\nreason, we additionally explore the performance of global, local, and multi-task modeling for highly\nskewed data in Table 4 of the Appendix. Although the performance of the global model improves\nslightly relative to local modeling in this setting, the global model still performs the worst for the\nmajority of the datasets, and MTL still signi\ufb01cantly outperforms both global and local approaches.\n\n5.3 Straggler Avoidance\n\nTwo challenges that are prevalent in federated learning are stragglers and high communication.\nStragglers can occur when a subset of the devices take much longer than others to perform local\nupdates, which can be caused either by statistical or systems heterogeneity. Communication can also\nexacerbate poor performance, as it can be slower than computation by many orders of magnitude in\ntypical cellular or wireless networks [52, 20, 48, 9, 38]. In our experiments below, we simulate the\ntime needed to run each method by tracking the operations and communication complexities, and\nscaling the communication cost relative to computation by one, two, or three orders of magnitude,\nrespectively. These numbers correspond roughly to the clock rate vs. network bandwidth/latency\n(see, e.g., [52]) for modern cellular and wireless networks. Details are provided in Appendix E.\n\n5https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones\n6http://www.ecs.umass.edu/~mduarte/Software.html\n\n7\n\n\fFigure 1: The performance of MOCHA compared to other distributed methods for the W update of (1). While\nincreasing communication tends to decrease the performance of the mini-batch methods, MOCHA performs well\nin high communication settings. In all settings, MOCHA with varied approximation values, \u0398h\nt , performs better\nthan without (i.e., naively generalizing COCOA), as it avoids stragglers from statistical heterogeneity.\n\nStatistical Heterogeneity. We explore the effect of statistical heterogeneity on stragglers for various\nmethods and communication regimes (3G, LTE, WiFi). For a \ufb01xed communication network, we\ncompare MOCHA to COCOA, which has a single \u03b8 parameter, and to mini-batch stochastic gradient\ndescent (Mb-SGD) and mini-batch stochastic dual coordinate ascent (Mb-SDCA), which have limited\ncommunication \ufb02exibility depending on the batch size. We tune all compared methods for best\nperformance, as we detail in Appendix E. In Figure 1, we see that while the performance degrades\nfor mini-batch methods in high communication regimes, MOCHA and COCOA are robust to high\ncommunication. However, COCOA is signi\ufb01cantly affected by stragglers\u2014because \u03b8 is \ufb01xed across\nnodes and rounds, dif\ufb01cult subproblems adversely impact convergence. In contrast, MOCHA performs\nwell regardless of communication cost and is robust to statistical heterogeneity.\n\nSystems Heterogeneity. MOCHA is also equipped to handle heterogeneity from changing systems\nenvironments, such as battery power, memory, or network connection, as we show in Figure 2. In\nparticular, we simulate systems heterogeneity by randomly choosing the number of local iterations\nfor MOCHA or the mini-batch size for mini-batch methods, between 10% and 100% of the minimum\nnumber of local data points for high variability environments, to between 90% and 100% for low\nvariability (see Appendix E for full details). We do not vary the performance of COCOA, as the\nimpact from statistical heterogeneity alone signi\ufb01cantly reduces performance. However, adding\nsystems heterogeneity would reduce performance even further, as the maximum \u03b8 value across all\nnodes would only increase if additional systems challenges were introduced.\n\n5.4 Tolerance to Dropped Nodes\n\nFinally, we explore the effect of nodes dropping\non the performance of MOCHA. We do not draw\ncomparisons to other methods, as to the best of\nour knowledge, no other methods for distributed\nmulti-task learning directly address fault toler-\nance. In MOCHA, we incorporate this setting\nby allowing \u03b8h\nt := 1, as explored theoretically\nin Section 4. In Figure 3, we look at the per-\nformance of MOCHA, either for one \ufb01xed W\nupdate, or running the entire MOCHA method, as\nthe probability that nodes drop at each iteration\n(ph\nt in Assumption 2) increases. We see that the\nperformance of MOCHA is robust to relatively\nhigh values of ph\nt , both during a single update of\nW and in how this affects the performance of\nthe overall method. However, as intuition would\nsuggest, if one of the nodes never sends updates\n(i.e., ph\n1 := 1 for all h, green dotted line), the\nmethod does not converge to the correct solution.\nThis provides validation for our Assumption 2.\n\nFigure 2: MOCHA can handle variability from systems\nheterogeneity.\n\nFigure 3: The performance of MOCHA is robust to\nnodes periodically dropping out (fault tolerance).\n\n8\n\n01234567Estimated Time10610-310-210-1100101102Primal Sub-OptimalityHuman Activity: Statistical Heterogeneity (WiFi)MOCHACoCoAMb-SDCAMb-SGD012345678Estimated Time10610-310-210-1100101102Primal Sub-OptimalityHuman Activity: Statistical Heterogeneity (LTE)MOCHACoCoAMb-SDCAMb-SGD00.511.52Estimated Time10710-310-210-1100101102Primal Sub-OptimalityHuman Activity: Statistical Heterogeneity (3G)MOCHACoCoAMb-SDCAMb-SGD0123456789Estimated Time10610-310-210-1100101102Primal Sub-OptimalityVehicle Sensor: Systems Heterogeneity (Low)MOCHACoCoAMb-SDCAMb-SGD0123456789Estimated Time10610-310-210-1100101102Primal Sub-OptimalityVehicle Sensor: Systems Heterogeneity (High)MOCHACoCoAMb-CDMb-SGD0246810Estimated Time10610-310-210-1100101102Primal Sub-OptimalityGoogle Glass: Fault Tolerance, W Step012345678Estimated Time10710-310-210-1100101102Primal Sub-OptimalityGoogle Glass: Fault Tolerance, Full Method\f6 Discussion\n\nTo address the statistical and systems challenges of the burgeoning federated learning setting, we have\npresented MOCHA, a novel systems-aware optimization framework for federated multi-task learning.\nOur method and theory for the \ufb01rst time consider issues of high communication cost, stragglers,\nand fault tolerance for multi-task learning in the federated environment. While MOCHA does not\napply to non-convex deep learning models in its current form, we note that there may be natural\nconnections between this approach and \u201cconvexi\ufb01ed\u201d deep learning models [6, 34, 51, 56] in the\ncontext of kernelized federated multi-task learning.\n\nAcknowledgements\n\nWe thank Brendan McMahan, Chlo\u00e9 Kiddon, Jakub Kone\u02c7cn\u00fd, Evan Sparks, Xinghao Pan, Lisha Li,\nand Hang Qi for valuable discussions and feedback.\n\nReferences\n[1] A. Ahmed, A. Das, and A. J. Smola. Scalable hierarchical multitask learning algorithms for conversion\n\noptimization in display advertising. In Conference on Web Search and Data Mining, 2014.\n\n[2] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled\n\ndata. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[3] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. A public domain dataset for human activity\nrecognition using smartphones. In European Symposium on Arti\ufb01cial Neural Networks, Computational\nIntelligence and Machine Learning, 2013.\n\n[4] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Neural Information Processing\n\nSystems, 2007.\n\n[5] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[6] \u00d6. Aslan, X. Zhang, and D. Schuurmans. Convex deep learning via normalized kernels. In Advances in\n\nNeural Information Processing Systems, 2014.\n\n[7] I. M. Baytas, M. Yan, A. K. Jain, and J. Zhou. Asynchronous multi-task learning. In International\n\nConference on Data Mining, 2016.\n\n[8] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli. Fog computing and its role in the internet of things. In\n\nSIGCOMM Workshop on Mobile Cloud Computing, 2012.\n\n[9] A. Carroll and G. Heiser. An analysis of power consumption in a smartphone. In USENIX Annual Technical\n\nConference, 2010.\n\n[10] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[11] J. Chen, J. Zhou, and J. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning.\n\nIn Conference on Knowledge Discovery and Data Mining, 2011.\n\n[12] A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. Model-based approximate\n\nquerying in sensor networks. VLDB Journal, 14(4):417\u2013443, 2005.\n\n[13] M. F. Duarte and Y. H. Hu. Vehicle classi\ufb01cation in distributed sensor networks. Journal of Parallel and\n\nDistributed Computing, 64(7):826\u2013838, 2004.\n\n[14] T. Evgeniou and M. Pontil. Regularized multi-task learning. In Conference on Knowledge Discovery and\n\nData Mining, 2004.\n\n[15] P. Garcia Lopez, A. Montresor, D. Epema, A. Datta, T. Higashino, A. Iamnitchi, M. Barcellos, P. Felber,\nand E. Riviere. Edge-centric computing: Vision and challenges. SIGCOMM Computer Communication\nReview, 45(5):37\u201342, 2015.\n\n[16] A. R. Gon\u00e7alves, F. J. Von Zuben, and A. Banerjee. Multi-task sparse structure learning with gaussian\n\ncopula models. Journal of Machine Learning Research, 17(33):1\u201330, 2016.\n\n[17] J. Gorski, F. Pfeuffer, and K. Klamroth. Biconvex sets and optimization with biconvex functions: a survey\n\nand extensions. Mathematical Methods of Operations Research, 66(3):373\u2013407, 2007.\n\n[18] K. Hong, D. Lillethun, U. Ramachandran, B. Ottenw\u00e4lder, and B. Koldehofe. Mobile fog: A programming\nmodel for large-scale applications on the internet of things. In SIGCOMM Workshop on Mobile Cloud\nComputing, 2013.\n\n[19] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance Matrix Estimation\n\nUsing Quadratic Approximation. In Neural Information Processing Systems 27, 2014.\n\n9\n\n\f[20] J. Huang, F. Qian, Y. Guo, Y. Zhou, Q. Xu, Z. M. Mao, S. Sen, and O. Spatscheck. An in-depth study of\nlte: Effect of network protocol and application behavior on performance. In ACM SIGCOMM Conference,\n2013.\n\n[21] L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Neural\n\nInformation Processing Systems, 2009.\n\n[22] M. Jaggi, V. Smith, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communication-Ef\ufb01cient\n\nDistributed Dual Coordinate Ascent. In Neural Information Processing Systems, 2014.\n\n[23] X. Jin, P. Luo, F. Zhuang, J. He, and Q. He. Collaborating between local and global learning for distributed\n\nonline multiple tasks. In Conference on Information and Knowledge Management, 2015.\n\n[24] S. Kim and E. P. Xing. Statistical estimation of correlated genome associations to a quantitative trait\n\nnetwork. PLoS Genet, 5(8):e1000587, 2009.\n\n[25] J. Kone\u02c7cn`y, H. B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyond\n\nthe datacenter. arXiv:1511.03575, 2015.\n\n[26] J. Kone\u02c7cn`y, H. B. McMahan, F. X. Yu, P. Richt\u00e1rik, A. T. Suresh, and D. Bacon. Federated learning:\n\nStrategies for improving communication ef\ufb01ciency. arXiv:1610.05492, 2016.\n\n[27] T. Ku\ufb02ik, J. Kay, and B. Kummerfeld. Challenges and solutions of ubiquitous user modeling. In Ubiquitous\n\ndisplay environments, pages 7\u201330. Springer, 2012.\n\n[28] A. Kumar and H. Daum\u00e9. Learning task grouping and overlap in multi-task learning. In International\n\nConference on Machine Learning, 2012.\n\n[29] S. L. Lauritzen. Graphical Models, volume 17. Clarendon Press, 1996.\n[30] S. Liu, S. J. Pan, and Q. Ho. Distributed multi-task relationship learning. Conference on Knowledge\n\nDiscovery and Data Mining, 2017.\n\n[31] C. Ma, V. Smith, M. Jaggi, M. I. Jordan, P. Richt\u00e1rik, and M. Tak\u00e1\u02c7c. Adding vs. averaging in distributed\n\nprimal-dual optimization. In International Conference on Machine Learning, 2015.\n\n[32] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. TAG: A tiny aggregation service for ad-hoc\n\nsensor networks. In Symposium on Operating Systems Design and Implementation, 2002.\n\n[33] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. TinyDB: An acquisitional query processing\n\nsystem for sensor networks. ACM Transactions on Database Systems, 30(1):122\u2013173, 2005.\n\n[34] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Neural Information\n\nProcessing Systems, 2014.\n\n[35] D. Mateos-N\u00fa\u00f1ez and J. Cort\u00e9s. Distributed optimization for multi-task learning via nuclear-norm\n\napproximation. In IFAC Workshop on Distributed Estimation and Control in Networked Systems, 2015.\n\n[36] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-ef\ufb01cient learning\n\nof deep networks from decentralized data. In Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[37] H. B. McMahan and D. Ramage.\n\nhttp://www.googblogs.com/federated-learning-\n\ncollaborative-machine-learning-without-centralized-training-data/. Google, 2017.\n\n[38] A. P. Miettinen and J. K. Nurminen. Energy ef\ufb01ciency of mobile clients in cloud computing. In USENIX\n\nConference on Hot Topics in Cloud Computing, 2010.\n\n[39] A. Pantelopoulos and N. G. Bourbakis. A survey on wearable sensor-based systems for health monitoring\n\nand prognosis. IEEE Transactions on Systems, Man, and Cybernetics, 40(1):1\u201312, 2010.\n\n[40] H. Qi, E. R. Sparks, and A. Talwalkar. Paleo: A performance model for deep neural networks.\n\nInternational Conference on Learning Representations, 2017.\n\nIn\n\n[41] S. A. Rahman, C. Merck, Y. Huang, and S. Kleinberg. Unintrusive eating recognition using google glass.\n\nIn Conference on Pervasive Computing Technologies for Healthcare, 2015.\n\n[42] P. Rashidi and D. J. Cook. Keeping the resident in the loop: Adapting the smart home to the user. IEEE\n\nTransactions on systems, man, and cybernetics, 39(5):949\u2013959, 2009.\n\n[43] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet classi\ufb01cation using binary\n\nconvolutional neural networks. In European Conference on Computer Vision, 2016.\n\n[44] S. Ravi.\n\nhttps://research.googleblog.com/2017/02/on-device-machine-intelligence.\n\nhtml. Google, 2017.\n\n[45] M. Razaviyayn, M. Hong, and Z.-Q. Luo. A uni\ufb01ed convergence analysis of block successive minimization\n\nmethods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126\u20131153, 2013.\n\n[46] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.\n\nInternational Conference on Machine Learning, June 2007.\n\n[47] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimiza-\n\ntion. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[48] D. Singel\u00e9e, S. Seys, L. Batina, and I. Verbauwhede. The communication and computation cost of wireless\n\nsecurity. In ACM Conference on Wireless Network Security, 2011.\n\n[49] V. Smith, S. Forte, C. Ma, M. Tak\u00e1c, M. I. Jordan, and M. Jaggi. CoCoA: A general framework for\n\ncommunication-ef\ufb01cient distributed optimization. arXiv:1611.02189, 2016.\n\n10\n\n\f[50] M. Tak\u00e1\u02c7c, A. Bijral, P. Richt\u00e1rik, and N. Srebro. Mini-Batch Primal and Dual Methods for SVMs. In\n\nInternational Conference on Machine Learning, 2013.\n\n[51] C.-Y. Tsai, A. M. Saxe, and D. Cox. Tensor switching networks. In Neural Information Processing Systems,\n\n2016.\n\n[52] C. Van Berkel. Multi-core for mobile phones. In Proceedings of the Conference on Design, Automation\n\nand Test in Europe, pages 1260\u20131265. European Design and Automation Association, 2009.\n\n[53] H. Wang, A. Banerjee, C.-J. Hsieh, P. K. Ravikumar, and I. S. Dhillon. Large scale distributed sparse\n\nprecision estimation. In Neural Information Processing Systems, 2013.\n\n[54] J. Wang, M. Kolar, and N. Srebro. Distributed multi-task learning. In Conference on Arti\ufb01cial Intelligence\n\nand Statistics, 2016.\n\n[55] J. Wang, M. Kolar, and N. Srebro. Distributed multi-task learning with shared representation.\n\narXiv:1603.02185, 2016.\n\n[56] Y. Zhang, P. Liang, and M. J. Wainwright. Convexi\ufb01ed convolutional neural networks. International\n\nConference on Machine Learning, 2017.\n\n[57] Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning. In\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, 2010.\n\n[58] J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In Neural\n\nInformation Processing Systems, 2011.\n\n11\n\n\f", "award": [], "sourceid": 2307, "authors": [{"given_name": "Virginia", "family_name": "Smith", "institution": "Stanford University"}, {"given_name": "Chao-Kai", "family_name": "Chiang", "institution": "University of Southern California"}, {"given_name": "Maziar", "family_name": "Sanjabi", "institution": "University of California, Los Angeles"}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": "CMU"}]}