{"title": "GIANT: Globally Improved Approximate Newton Method for Distributed Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2332, "page_last": 2342, "abstract": "For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, which is sent to the main driver. The main driver, then, averages all the ANT directions received from workers to form a Globally Improved ANT (GIANT) direction. GIANT is highly communication efficient and naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. Theoretically, we show that GIANT enjoys an improved convergence rate as compared with first-order methods and existing distributed Newton-type methods. Further, and in sharp contrast with many existing distributed Newton-type methods, as well as popular first-order methods, a highly advantageous practical feature of GIANT is that it only involves one tuning parameter. We conduct large-scale experiments on a computer cluster and, empirically, demonstrate the superior performance of GIANT.", "full_text": "GIANT: Globally Improved Approximate Newton\n\nMethod for Distributed Optimization\n\nPeng Xu\n\nStanford University\n\npengxu@stanford.edu\n\nShusen Wang\n\nStevens Institute of Technology\nshusen.wang@stevens.edu\n\nFarbod Roosta-Khorasani\nUniversity of Queensland\nfred.roosta@uq.edu.au\n\nMichael W. Mahoney\n\nUniversity of California at Berkeley\nmmahoney@stat.berkeley.edu\n\nAbstract\n\nFor distributed computing environment, we consider the empirical risk minimiza-\ntion problem and propose a distributed and communication-ef\ufb01cient Newton-type\noptimization method. At every iteration, each worker locally \ufb01nds an Approximate\nNewTon (ANT) direction, which is sent to the main driver. The main driver, then,\naverages all the ANT directions received from workers to form a Globally Improved\nANT (GIANT) direction. GIANT is highly communication ef\ufb01cient and naturally\nexploits the trade-offs between local computations and global communications\nin that more local computations result in fewer overall rounds of communica-\ntions. Theoretically, we show that GIANT enjoys an improved convergence rate as\ncompared with \ufb01rst-order methods and existing distributed Newton-type methods.\nFurther, and in sharp contrast with many existing distributed Newton-type methods,\nas well as popular \ufb01rst-order methods, a highly advantageous practical feature\nof GIANT is that it only involves one tuning parameter. We conduct large-scale\nexperiments on a computer cluster and, empirically, demonstrate the superior\nperformance of GIANT.\n\n1\n\nIntroduction\n\nThe large-scale nature of many modern \u201cbig-data\u201d problems, arising routinely in science, engineering,\n\ufb01nancial markets, Internet and social media, etc., poses signi\ufb01cant computational as well as storage\nchallenges for machine learning procedures. For example, the scale of data gathered in many\napplications nowadays typically exceeds the memory capacity of a single machine, which, in turn,\nmakes learning from data ever more challenging. In this light, several modern parallel (or distributed)\ncomputing architectures, e.g., MapReduce [4], Apache Spark [44, 19], GraphLab [14], and Parameter\nServer [11], have been designed to operate on and learn from data at massive scales. Despite the\nfact that, when compared to a single machine, distributed systems tremendously reduce the storage\nand (local) computational costs, the inevitable cost of communications across the network can often\nbe the bottleneck of distributed computations. As a result, designing methods which can strike an\nappropriate balance between the cost of computations and that of communications are increasingly\ndesired.\nThe desire to reduce communication costs is even more pronounced in the federated learning\nframework [8, 9, 1, 18, 37]. Similarly to typical settings of distributed computing, federated learning\nassumes data are distributed over a network across nodes that enjoy reasonable computational\nresources, e.g., mobile phones, wearable devices, and smart homes. However, the network has\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fseverely limited bandwidth and high latency. As a result, it is imperative to reduce the communications\nbetween the center and a node or between two nodes. In such settings, the preferred methods are those\nwhich can perform expensive local computations with the aim of reducing the overall communications\nacross the network.\nOptimization algorithms designed for distributed setting are abundantly found in the literature. First-\norder methods, i.e, those that rely solely on gradient information, are often embarrassingly parallel\nand easy to implement. Examples of such methods include distributed variants of stochastic gradient\ndescent (SGD) [17, 27, 47], accelerated SGD [35], variance reduction SGD [10, 28], stochastic\ncoordinate descent methods [5, 13, 20, 31] and dual coordinate ascent algorithms [30, 43, 46].\nThe common denominator in all of these methods is that they signi\ufb01cantly reduce the amount of\nlocal computation. But this blessing comes with an inevitable curse that they, in turn, may require\na far greater number of iterations and hence, incur more communications overall. Indeed, as a\nresult of their highly iterative nature, many of these \ufb01rst-order methods require several rounds of\ncommunications and, potentially, synchronizations in every iteration, and they must do so for many\niterations. In a computer cluster, due to limitations on the network\u2019s bandwidth and latency and\nsoftware system overhead, communications across the nodes can oftentimes be the critical bottleneck\nfor the distributed optimization. Such overheads are increasingly exacerbated by the growing number\nof compute nodes in the network, limiting the scalability of any distributed optimization method that\nrequires many communication-intensive iterations.\nTo remedy such drawbacks of high number of iter-\nations for distributed optimization, communication-\nef\ufb01cient second-order methods, i.e., those that,\nin addition to the gradient,\nincorporate curva-\nture information, have also been recently consid-\nered [16, 36, 29, 45, 7, 15, 38]; see also Section 1.1.\nThe common feature in all of these methods is that\nthey intend to increase the local computations with\nthe aim of reducing the overall iterations, and hence,\nlowering the communications.\nIn other words,\nthese methods are designed to perform as much\nlocal computation as possible before making any\ncommunications across the network. Pursuing similar objectives, in this paper, we propose a Globally\nImproved Approximate NewTon (GIANT) method and establish its improved theoretical convergence\nproperties as compared with other similar second-order methods. We also showcase the superior\nempirical performance of GIANT through several numerical experiments.\nThe rest of this paper is organized as follows. Section 1.1 brie\ufb02y reviews prior works most closely\nrelated to this paper. Section 1.2 gives a summary of our main contributions. The formal description\nof the distributed empirical risk minimization problem is given in Section 2, followed by the derivation\nof various steps of GIANT in Section 3. Section 4 presents the theoretical guarantees. The most\ncommonly used notation is listed in Table 1. Due to the page limit, the readers can refer to the long\nversion [41]; Section 5 provides a summary of our experiments. The proofs can be found in the long\nversion [41].\n\nDe\ufb01nition\ntotal number of samples\nnumber of features (attributes)\nnumber of partitions\nobjective function\nregularization parameter\nthe variable at iteration t\nthe variable that minimizes f\nsome condition number\n\nTable 1: Commonly used notation.\n\nn\nd\nm\nf\n\u03b3\nwt\nw(cid:63)\n\u03ba\n\nNotation\n\nTable 2: The number of communications (proportional to the number of iterations) required for the\nridge regression problem. Here \u03ba is the condition number of the Hessian matrix, \u00b5 is the matrix\ncoherence, and \u02dcO conceals constants (analogous to \u00b5) and logarithmic factors.\n\nMethod\n\nGIANT [this work]\n\nDiSCO [45]\nDANE [36]\nAIDE [29]\nCoCoA [38]\n\nAGD\n\nlog(n/\u00b5dm)\n\n#Iterations\n\nn3/4 + \u03ba1/2m1/4\n\nt = O(cid:16) log(d\u03ba/E)\nt = \u02dcO(cid:16) d\u03ba1/2m3/4\nt = \u02dcO(cid:16) \u03ba2m\nt = \u02dcO(cid:16) \u03ba1/2m1/4\nt = O(cid:16)(cid:0)n + 1\nt = O(cid:16)\n\n(cid:17)\n(cid:17)\n(cid:1) log nE\n(cid:17)\n\nn1/4\nn log 1E\n\nn1/4\n\n\u03b3\n\nlog 1E\n\n(cid:17)\n(cid:17)\n\n\u03ba1/2 log dE\n\n2\n\n(cid:17)\n\nlog 1E\n\nMetric\n\n(cid:107)wt \u2212 w(cid:63)(cid:107)2 \u2264 E\nf (wt) \u2212 f (w(cid:63)) \u2264 E\nf (wt) \u2212 f (w(cid:63)) \u2264 E\nf (wt) \u2212 f (w(cid:63)) \u2264 E\nf (wt) \u2212 f (w(cid:63)) \u2264 E\n(cid:107)wt \u2212 w(cid:63)(cid:107)2 \u2264 E\n\n\f1.1 Related Work\n\n1\n\n2 + \u03b3(cid:107)w(cid:107)2\n\nAmong the existing distributed second-order optimization methods, the most notably are DANE [36],\nAIDE [29], and DiSCO [45]. Another similar method is CoCoA [7, 15, 38], which is analogous to\nsecond-order methods in that it involves sub-problems which are local quadratic approximations\nto the dual objective function. However, despite the fact that CoCoA makes use of the smoothness\ncondition, it does not exploit any explicit second-order information.\nWe can evaluate the theoretical properties the above-mentioned methods in light of comparison\nwith optimal \ufb01rst-order methods, i.e., accelerated gradient descent (AGD) methods [22, 23]. It is\nbecause AGD methods are mostly embarrassingly parallel and can be regarded as the baseline for\ndistributed optimization. Recall that AGD methods, being optimal in worst-case analysis sense [21],\n\nare guaranteed to convergence to E-precision in O(\u221a\u03ba log 1E ) iterations [23], where \u03ba can be thought\n\nn(cid:107)Xw\u2212 y(cid:107)2\n\nof as the condition number of the problem. Each iteration of AGD has two rounds of communications\u2014\nbroadcast or aggregation of a vector.\nIn Table 2, we compare the communication costs with other methods for the ridge regression\n2.1 The communication cost of GIANT has a mere logarithmic\nproblem: minw\ndependence on the condition number \u03ba; in contrast, the other methods have at least a square root\ndependence on \u03ba. Even if \u03ba is assumed to be small, say \u03ba = O(\u221an), which was made by [45],\nGIANT\u2019s bound is better than the compared methods regarding the dependence on the number of\npartitions, m.\nOur GIANT method is motivated\nby the subsampled Newton method\n[33, 42, 25]. Later on, we realized\nthat a similar idea has been pro-\nposed by DANE [36]; GIANT and\nDANE are identical for quadratic\nprogramming; they are different for\nthe general convex problems. Nev-\nertheless, we show better conver-\ngence bounds than DANE, even for\nquadratic programming. Our im-\nprovement over DANE is obtained\nby better bounds the Hessian matrix\napproximation and better analysis\nof convex optimization.\nGIANT also bears a resemblance\nto FADL [16], but we show better\nconvergence bounds. Mahajan et\nal. [16] has conducted comprehen-\nsive empirical comparisons among\nmany distributed computing meth-\nods and concluded that the local\nquadratic approximation, which is\nvery similar to GIANT, is the \ufb01nal\nmethod which they recommended.\n\nFigure 1: One iteration of GIANT. Here X and y are respec-\ntively the features and lables; Xi and yi denotes the blocks of\nX and y, respectively. Each one-to-all operation is a Broad-\ncast and each all-to-one operation is a Reduce.\n\n1.2 Contributions\n\nIn this paper, we consider the problem of empirical risk minimization involving smooth and strongly\nconvex objective function (which is the same setting considered in prior works of DANE, AIDE, and\nDiSCO). In this context, we propose a Globally Improved Approximate NewTon (GIANT) method\nand establish its theoretical and empirical properties as follows.\n\u2022 For quadratic objectives, we establish global convergence of GIANT. To attain a \ufb01xed precision,\nthe number of iterations of GIANT (which is proportional to the communication complexity) has\n\n1As for general convex problems, it is very hard to present the comparison in an easily understanding way.\n\nThis is why we do not compare the convergence for the general convex optimization.\n\n3\n\nApproximate\tNewton\t(ANT)\tDirectionGlobally\tImproved\tApproximate\tNewton\t(GIANT)\tDirection!\"#\"$%,\"\u22ef!(#($%,(\u22ef!)#)$%,)!\"#\"*+%,\"\u22ef!(#(*+%,(\u22ef!)#)*+%,)gt=Mwt+mXi=1gt,iwt+1=wt1mmXi=1\u02dcpt,i7\u02dcpt=1mmXi=1\u02dcpt,iwt+1=wt\u02dcpt24\u02dcpt=1mmXi=1\u02dcpt,iwt+1=wt\u02dcpt24gt=wt+mXi=1gt,igt=mXi=1gt,i2\fa mere logarithmic dependence on the condition number. In contrast, the prior works have at least\nsquare root dependence. In fact, for quadratic problems, GIANT and DANE [36] can be shown to be\nidentical. In this light, for such problems, our work improves upon the convergence of DANE.\n\u2022 For more general problems, GIANT has linear-quadratic convergence in the vicinity of the optimal\nsolution, which we refer to as \u201clocal convergence\u201d.2 The advantage of GIANT mainly manifests in\nbig-data regimes where there are many data points available. In other words, when the number of\ndata points is much larger than the number of features, the theoretical convergence of GIANT enjoys\nsigni\ufb01cant improvement over other similar methods.\n\u2022 In addition to theoretical features, GIANT also exhibits desirable practical advantages. For example,\nin sharp contrast with many existing distributed Newton-type methods, as well as popular \ufb01rst-order\nmethods, GIANT only involves one tuning parameter, i.e., the maximal iterations of its sub-problem\nsolvers, which makes GIANT easy to implement in practice. Furthermore, our experiments on a\ncomputer cluster show that GIANT consistently outperforms AGD, L-BFGS, and DANE.\n\n2 Problem Formulation\n\nIn this paper, we consider the distributed variant of empirical risk minimization, a supervised-\nlearning problem arising very often in machine learning and data analysis [34]. More speci\ufb01cally,\nlet x1,\u00b7\u00b7\u00b7 , xn \u2208 Rd be the input feature vectors and y1,\u00b7\u00b7\u00b7 , yn \u2208 R be the corresponding response.\nThe goal of supervised learning is to compute a model from the training data, which can be achieved\nby minimizing an empirical risk function, i.e.,\n\nw\u2208Rd (cid:26)f (w) (cid:44) 1\n\nmin\n\nn\n\n(cid:96)j(wT xj) +\n\nn(cid:88)j=1\n\n2(cid:27),\n\u03b3\n2(cid:107)w(cid:107)2\n\n(1)\n\nwhere (cid:96)j : R (cid:55)\u2192 R is convex, twice differentiable, and smooth. We further assume that f is strongly\nconvex, which in turn, implies the uniqueness of the minimizer of (1), denoted throughout the text\nby w(cid:63). Note that yj is implicitly captured by (cid:96)j. Examples of the loss function, (cid:96)j, appearing in (1)\ninclude\n\nlinear regression:\nlogistic regression:\n\n(cid:96)j(zj) = 1\n(cid:96)j(zj) = log(1 + e\n\n2 (zj \u2212 yj)2,\n\n\u2212zj yj ).\n\nSuppose the n feature vectors and loss functions (x1, (cid:96)1),\u00b7\u00b7\u00b7 , (xn, (cid:96)n) are partitioned among m\nworker machines. Let s (cid:44) n/m be the local sample size. Our theories require s > d; nevertheless,\nGIANT empirically works well for s < d.\nWe consider solving (1) in the regimes where n (cid:29) d. We assume that the data points, {xi}n\ni=1 are\npartitioned among m machines, with possible overlaps, such that the number of local data is larger\nthan d. Otherwise, if n (cid:28) d, we can consider the dual problem and partition features. If the dual\nproblem is also decomposable, smooth, strongly convex, and unconstrained, e.g., ridge regression,\nthen our approach directly applies.\n\n3 Algorithm Description\n\nIn this section, we present the algorithm derivation and complexity analysis. GIANT is a central-\nized and synchronous method; one iteration of GIANT is depicted in Figure 1. The key idea of\nGIANT is avoiding forming of the exact Hessian matrices Ht \u2208 Rd\u00d7d in order to avoid expensive\ncommunications.\n\n2The second-order methods typically have the local convergence issue. Global convergence of GIANT can\nbe trivially established by following [32], however, the convergence rate is not very interesting, as it is worse\nthan the \ufb01rst-order methods.\n\n4\n\n\f3.1 Gradient and Hessian\n\nGIANT iterations require the exact gradient, which in the t-th iteration, can be written as\n\ngt = \u2207f (wt) =\n\n1\nn\n\nn(cid:88)j=1\n\n(cid:48)\nj(wT\n(cid:96)\n\nt xj) xj + \u03b3wt \u2208 Rd.\n\n(2)\n\nThe gradient, gt can be computed, embarrassingly, in parallel. The driver Broadcasts wt to all\nthe worker machines. Each machine then uses its own {(xj, (cid:96)j)} to compute its local gradient.\nSubsequently, the driver performs a Reduce operation to sum up the local gradients and get gt. The\nper-iteration communication complexity is \u02dcO(d) words, where \u02dcO hides the dependence on m (which\ncan be m or log m, depending on the network structure).\nMore speci\ufb01cally, in the t-th iteration, the Hessian matrix at wt \u2208 Rd can be written as\n\nHt = \u22072f (wt) =\n\n1\nn\n\nn(cid:88)j=1\n\n(cid:48)(cid:48)\nj (wT\n(cid:96)\n\nt xj) \u00b7 xjxT\n\nj + \u03b3Id.\n\n(3)\n\nTo compute the exact Hessian, the driver must aggregate the m local Hessian matrices (each of size\nd \u00d7 d) by one Reduce operation, which has \u02dcO(d2) communication complexity and is obviously\nimpractical when d is thousands. The Hessian approximation developed in this paper has a mere\n\u02dcO(d) communication complexity which is the same to the \ufb01rst-order methods.\n3.2 Approximate NewTon (ANT) Directions\nj=1.3 Let Ji be\nAssume each worker machine locally holds s random samples drawn from {(xj, (cid:96)j)}n\nthe set containing the indices of the samples held by the i-th machine, and s = |Ji| denote its size.\nEach worker machine can use its local samples to form a local Hessian matrix\n\n1\n\ns (cid:88)j\u2208Ji\n\n(cid:101)Ht,i =\n\n(cid:48)(cid:48)\nj (wT\n(cid:96)\n\nt xj) \u00b7 xjxT\n\nj + \u03b3Id.\n\nTo reduce the computational cost, we opt to compute the ANT direction by the conjugate gradient\n\n\u22121\nClearly, E[(cid:101)Ht,i] = Ht. We de\ufb01ne the Approximate NewTon (ANT) direction by \u02dcpt,i = (cid:101)H\nt,i gt.\nThe cost of computing the ANT direction \u02dcpt,i in this way, involves O(sd2) time to form the d \u00d7 d\ndense matrix (cid:101)Ht,i and O(d3) to invert it.\n(CG) method [24]. Let aj =(cid:113)(cid:96)(cid:48)(cid:48)\n(4)\nand At,i \u2208 Rs\u00d7d contain the rows of At indexed by the set Ji. Using the matrix notation, we can\nwrite the local Hessian matrix as\n\nt xj) \u00b7 xj \u2208 Rd,\n1 ;\u00b7\u00b7\u00b7 ; aT\n\nn ] \u2208 Rn\u00d7d,\n\nAt = [aT\n\nj (wT\n\ns AT\n\nt,iAt,i + \u03b3Id.\n\n(cid:101)Ht,i = 1\nt,iAt,i + \u03b3Id(cid:1) p = gt\n(cid:0) 1\n\nEmploying CG, it is unnecessary to explicitly form (cid:101)Ht,i. Indeed, one can simply approximately solve\nround of GIANT, the local computational cost of a worker machine is O(cid:0)q \u00b7 nnz(At,i)(cid:1), where q is\n\nin a \u201cHessian-free\u201d manner, i.e., by employing only Hessian-vector products in CG iterations. In each\n\nthe number of CG iterations speci\ufb01ed by the users and typically set to tens.\n\ns AT\n\n(6)\n\n(5)\n\n3If the samples themselves are i.i.d. drawn from some distribution, then a data-independent partition is\n\nequivalent to uniform sampling. Otherwise, the system can Shuf\ufb02e the data.\n\n5\n\n\f1\nm\n\nm(cid:88)i=1\n\n\u22121\n\n1\nm\n\nm(cid:88)i=1 (cid:101)H\n\n3.3 Globally Improved ANT (GIANT) Direction\n\nUsing random matrix concentration, we can show that for suf\ufb01ciently large s, the local Hessian matrix\n\nANT (GIANT) direction is de\ufb01ned as\n\n(cid:101)Ht,i is a spectral approximation to Ht. Now let \u02dcpt,i be an ANT direction. The Globally Improved\n\n\u22121\nt gt.\n\n(7)\n\n\u02dcpt =\n\n\u02dcpt,i =\n\nHessian Ht is the arithmetic mean de\ufb01ned as Ht (cid:44) 1\nis, the \u201cinformation\u201d is spread-out rather than concentrated to a small fraction of samples, then\nthe harmonic mean and the arithmetic mean are very close to each other, and thereby the GIANT\n\nt,i gt = (cid:101)H\nm(cid:80)m\nInterestingly, here (cid:101)Ht is the harmonic mean de\ufb01ned as (cid:101)Ht (cid:44) ( 1\ni=1 (cid:101)H\nm(cid:80)m\ni=1 (cid:101)Ht,i. If the data is incoherent, that\ndirection \u02dcpt = (cid:101)H\u22121gt very well approximates the true Newton direction H\u22121gt. This is the intuition\nThe motivation of using the harmonic mean, (cid:101)Ht, to approximate the arithmetic mean (the true Hessian\nmatrix), Ht, is the communication cost. Computing the arithmetic mean Ht (cid:44) 1\ni=1 (cid:101)Ht,i would\nrequire the communication of d \u00d7 d matrices which is very expensive. In contrast, computing \u02dcpt\nmerely requires the communication of d-dimensional vectors.\n\n\u22121\nt,i )\u22121, whereas the true\n\nof our global improvement.\n\nm(cid:80)m\n\n3.4 Time and Communication Complexities\n\nFor each worker machine, the per-iteration time complexity is O(sdq), where s is the local sample\nsize and q is the number of CG iterations for (approximately) solving (6). (See Proposition 5 for the\nsetting of q.) If the feature matrix X \u2208 Rn\u00d7d has a sparsity of \u0001 = nnz(X)/(nd) < 1, the expected\nper-iteration time complexity is then O(\u0001sdq).\nEach iteration of GIANT has four rounds of communications: two Broadcast for sending and two\nReduce for aggregating some d-dimensional vector. If the communication is in a tree fashion, the\nper-iteration communication complexity is then \u02dcO(d) words, where \u02dcO hides the factor involving\nm which can be m or log m. In contrast, the naive Newton\u2019s method has \u02dcO(d2) communication\ncomplexity, because the system sends and receives d \u00d7 d Hessian matrices.\n\n4 Theoretical Analysis\n\nIn this section, we formally present the convergence guarantees of GIANT. Section 4.1 focuses on\nquadratic loss and treats the global convergence of GIANT. This is then followed by local convergence\nproperties of GIANT for more general non-quadratic loss in Section 4.2. For the results of Sections 4.1\nand 4.2, we require that the local linear system to obtain the local Newton direction is solved exactly.\nSection 4.3 then relaxes this requirement to allow for inexactness in the solution, and establishes\nsimilar convergence rates as those of exact variants.\nFor our analysis here, we frequently make use of the notion of matrix row coherence, de\ufb01ned as\nfollows. Such a notation has been used in compressed sensing [3], matrix completion [2], and\nrandomized linear algebra [6, 40, 39].\nDe\ufb01nition 1 (Coherence). Let A \u2208 Rn\u00d7d be any matrix and U \u2208 Rn\u00d7d be its column orthonormal\nbases. The row coherence of A is \u00b5(A) = n\nRemark 1. Our work assumes At \u2208 Rn\u00d7d, which is de\ufb01ned in (4), is incoherent, namely \u00b5(At) is\nsmall. The prior works, DANE, AIDE, and DiSCO, did not use the notation of incoherence; instead,\nj is upper bounded for all j \u2208 [n] and wt \u2208 Rd, where\nthey assume \u22072\naj \u2208 Rd is the j-th row of At. Such an assumption is different from but has similar implication as\nour incoherence assumption; under either of the two assumptions, it can be shown that the Hessian\nmatrix can be approximated using a subset of samples selected uniformly at random.\n\nwlj(wT xj)|w=wt = ajaT\n\nd maxj (cid:107)uj(cid:107)2\n\nd ].\n2 \u2208 [1, n\n\n6\n\n\f4.1 Quadratic Loss\nIn this section, we consider a special case of (1) with (cid:96)i(z) = (z \u2212 yi)2/2, i.e., the quadratic\noptimization problems:\n(8)\nf (w) = 1\nThe Hessian matrix is given as \u22072f (w) = 1\nn XT X + \u03b3Id, which does not depend on w. Theorem 1\ndescribes the convergence of the error in the iterates, i.e., \u2206t (cid:44) wt \u2212 w(cid:63).\nTheorem 1. Let \u00b5 be the row coherence of X \u2208 Rn\u00d7d and m be the number of partitions. Assume\nthe local sample size satis\ufb01es s \u2265 3\u00b5d\nfor some \u03b7, \u03b4 \u2208 (0, 1). It holds with probability 1 \u2212 \u03b4\nthat\n\n2n(cid:13)(cid:13)Xw \u2212 y(cid:13)(cid:13)2\n\n2(cid:107)w(cid:107)2\n2.\n\n2 + \u03b3\n\n\u03b4\n\n\u03b72 log md\n(cid:107)\u2206t(cid:107)2 \u2264 \u03b1t \u221a\u03ba(cid:107)\u22060(cid:107)2,\n\nn XT X + \u03b3Id.\n\nwhere \u03b1 = \u03b7\u221a\nRemark 2. The theorem can be interpreted in the this way. Assume the total number of samples, n,\nis at least 3\u00b5dm log(md). Then\n\nm + \u03b72 and \u03ba is the condition number of \u22072f (w) = 1\n+(cid:113) 3\u00b5d log(md/\u03b4)\n\n(cid:107)\u2206t(cid:107)2 \u2264 (cid:16) 3\u00b5dm log(md/\u03b4)\n\nholds with probability at least 1 \u2212 \u03b4.\nIf the total number of samples, n, is substantially bigger than \u00b5dm, then GIANT converges in a very\nsmall number of iterations. Furthermore, to reach a \ufb01xed precision, say (cid:107)\u2206t(cid:107)2 \u2264 E, the number of\niterations, t, has a mere logarithmic dependence on the condition number, \u03ba.\n\n(cid:17)t \u221a\u03ba(cid:107)\u22060(cid:107)2\n\nn\n\nn\n\n4.2 General Smooth Loss\n\nFor more general (not necessarily quadratic) but smooth loss, GIANT has linear-quadratic local\nconvergence, which is formally stated in Theorem 2 and Corollary 3. Let H(cid:63) = \u22072f (w(cid:63)) and\nHt = \u22072f (wt). For this general case, we assume the Hessian is L-Lipschitz, which is a standard\nassumption in analyzing second-order methods.\nAssumption 1. The Hessian matrix is L-Lipschitz continuous, i.e.,(cid:13)(cid:13)\u22072f (w) \u2212 \u22072f (w(cid:48))(cid:13)(cid:13)2 \u2264\nL(cid:107)w \u2212 w(cid:48)\n(cid:107)2, for all w and w(cid:48).\nTheorem 2 establishes the linear-quadratic convergence of \u2206t (cid:44) wt \u2212 w(cid:63). We remind that At \u2208\nRn\u00d7d is de\ufb01ned in (4) (thus AT\nt At + \u03b3Id = Ht). Note that, unlike Section 4.1, the coherence of At,\ndenote \u00b5t, changes with iterations.\nTheorem 2. Let \u00b5t \u2208 [1, n/d] be the coherence of At and m be the number of partitions. Assume\nthe local sample size satis\ufb01es st \u2265 3\u00b5td\nfor some \u03b7, \u03b4 \u2208 (0, 1). Under Assumption 1, it holds\nwith probability 1 \u2212 \u03b4 that\n\nlog md\n\u03b4\n\n\u03b72\n\n(cid:13)(cid:13)\u2206t+1(cid:13)(cid:13)2 \u2264 max(cid:110)\u03b1(cid:113) \u03c3max(Ht)\n\n\u03c3min(Ht)(cid:13)(cid:13)\u2206t(cid:13)(cid:13)2,\n\n2L\n\n2(cid:111),\n\u03c3min(Ht)(cid:13)(cid:13)\u2206t(cid:13)(cid:13)2\n\nm + \u03b72.\n\nwhere \u03b1 = \u03b7\u221a\nRemark 3. The standard Newton\u2019s method is well known to have local quadratic convergence; the\nquadratic term in Theorem 2 is the same as Newton\u2019s method. The quadratic term is caused by the\nnon-quadritic objective function. The linear term arises from the Hessian approximation. For large\nsample size, s, equivalently, small \u03b7, the linear term is small.\nNote that in Theorem 2 the convergence depends on the condition numbers of the Hessian at every\npoint. Due to the Lipschitz assumption on the Hessian, it is easy to see that the condition number of\nthe Hessian in a neighborhood of w(cid:63) is close to \u03ba(H(cid:63)). This simple observation implies Corollary 3,\nin which the dependence of the local convergence of GIANT on iterations via Ht is removed.\nAssumption 2. Assume wt is close to w(cid:63) in that (cid:107)\u2206t(cid:107)2 \u2264 3\nL \u00b7 \u03c3min(H(cid:63)), where L is de\ufb01ned in\nAssumption 1.\nCorollary 3. Under the same setting as Theorem 2 and Assumption 2, it holds with probability 1 \u2212 \u03b4\nthat\n\nwhere \u03ba is the condition number of the Hessian matrix at w(cid:63).\n\n(cid:13)(cid:13)\u2206t+1(cid:13)(cid:13)2 \u2264 max(cid:110)2\u03b1\u221a\u03ba(cid:13)(cid:13)\u2206t(cid:13)(cid:13)2,\n\n3L\n\n2(cid:111),\n\u03c3min(H(cid:63))(cid:13)(cid:13)\u2206t(cid:13)(cid:13)2\n\n7\n\n\f4.3\n\nInexact Solutions to Local Sub-Problems\n\nIn the t-th iteration, the i-th worker locally computes \u02dcpt,i by solving (cid:101)Ht,ip = gt, where (cid:101)Ht,i is\nthe i-th local Hessian matrix de\ufb01ned in (5). In high-dimensional problems, say d \u2265 104, the exact\nformation of (cid:101)Ht,i \u2208 Rd\u00d7d and its inversion are impractical. Instead, we could employ iterative linear\nsystem solvers, such as CG, to inexactly solve the arising linear system in (6). Let \u02dcp(cid:48)\nt,i be an inexact\nsolution which is close to \u02dcpt,i (cid:44) (cid:101)H\n(cid:13)(cid:13)(cid:13)(cid:101)H1/2\nt,i \u2212 \u02dcpt,i(cid:1)(cid:13)(cid:13)(cid:13)2 \u2264\nt,i (cid:0)\u02dcp\nm(cid:80)m\ni=1 \u02dcp(cid:48)\nfor some \u00010 \u2208 (0, 1). GIANT then takes \u02dcp(cid:48)\nt,i as the approximate Newton direction in\n\u03b7\u221a\nlieu of \u02dcpt. In this case, as long as \u00010 is of the same order as\nm + \u03b72, the convergence rate of such\ninexact variant of GIANT remains similar to the exact algorithm in which the local linear system is\nsolved exactly. Theorem 4 makes convergence properties of inexact GIANT more explicit.\nTheorem 4. Suppose inexact local solution to (6), denote \u02dcp(cid:48)\n\n\u22121\nt,i gt, in the sense that\n\u00010\n\n2(cid:13)(cid:13)(cid:13)(cid:101)H1/2\nt,i \u02dcpt,i(cid:13)(cid:13)(cid:13)2\n\nt,i, satis\ufb01es (9). Then Theorems 1 and 2\n\nt = 1\n\n(9)\n\n(cid:48)\n\n,\n\nand Corollary 3 all continue to hold with \u03b1 =(cid:0) \u03b7\u221a\n\nm + \u03b72(cid:1) + \u00010.\n\nProposition 5 gives conditions to guarantee (9), which is, in turn, required for Theorem 4.\nProposition 5. To compute an inexact local Newton direction from the sub-problem (6), suppose\neach worker performs\n\nq = log 8\n\u00012\n\n0 (cid:14) log\n\n\u221a\n\u221a\n\u02dc\u03bat\u22121 \u2248\n\n\u02dc\u03bat+1\n\n\u221a\n\n\u03bat\u22121\n2\n\nlog 8\n\u00012\n0\n\nand Ht. Then requirement (9) is satis\ufb01ed.\n\niterations of CG, initialized at zero, where \u02dc\u03bat and \u03bat are, respectiely, the condition number of (cid:101)Ht,i\n\n5 A Summary of the Empirical Study\n\nDue to the page limit, the experiments are not included in this paper; please refer to the long\nversion [41] for the experiments. The Apache Spark code is available at https://github.com/\nwangshusen/SparkGiant.git. Here we brie\ufb02y describe our results.\nWe implement GIANT, Accelerated Gradient Descent (AGD) [23], Limited memory BFGS (L-\nBFGS) [12], and Distributed Approximate NEwton (DANE) [36] in Scala and Apache Spark [44].\nWe empirically study the (cid:96)2-regularized logistic regression problem (which satis\ufb01es our assumptions):\n\nmin\n\nw\n\n1\nn\n\nn(cid:88)j=1\n\nlog(cid:0)1 + exp(\u2212yjxT\n\nj w)(cid:1) +\n\n\u03b3\n2(cid:107)w(cid:107)2\n2,\n\n(10)\n\nWe conduct large-scale experiments on the Cori Supercomputer maintained by NERSC, a Cray XC40\nsystem with 1632 compute nodes, each of which has two 2.3GHz 16-core Haswell processors and\n128GB of DRAM. We use up to 375 nodes (12,000 CPU cores).\nTo apply logistic regression, we use three binary classi\ufb01cation datasets: MNIST8M (digit \u201c4\u201d versus\n\u201c9\u201d, thus n = 2M and d = 784), Covtype (n = 581K and d = 54), and Epsilon (n = 500K and\nd = 2K), which are available at the LIBSVM website. We randomly hold 80% for training and the\nrest for test. To increase the size of the data, we generate 104 random Fourier features [26] and use\nthem in lieu of the original features in the logistic regression problem.\nFor the four methods, we use different settings of the parameters and report the best convergence\ncurve; we do not count the cost of parameter tuning. (This actually favors AGD and DANE because\nthey have more tuning parameters than GIANT and L-BFGS.) Using the same amount of wall-clock\ntime, GIANT consistently converges faster than AGD, DANE, and L-BFGS in terms of both\ntraining objective value and test classi\ufb01cation error (see the \ufb01gures in [41]).\nm to be larger than d. But in practice, GIANT\nOur theory requires the local sample size s = n\nconverges even if s is smaller than d. In this set of experiments, we set m = 89, and thus s is about\nhalf of d. Nevetheless, GIANT converges in all of our experiments. Our empirical may imply that the\ntheoretical sample complexity can be potentially improved.\n\n8\n\n\fWe further use data augmentation (i.e., adding random noise to the feature vectors) to increase n to\n5 and 25 times larger. In this way, the feature matrices are all dense, and the largest feature matrix\nwe use is about 1TB. As we increase both n and the number of compute nodes, the advantage of\nGIANT further increases, which means GIANT is more scalable than the compared methods. It is\nbecause as we increase the number of samples and the number of nodes by the same factor, the local\ncomputation remains the same, but the communication and synchronization costs increase, which\nfavors the communication-ef\ufb01cient methods; see the \ufb01gures and explanations in [41].\n\n6 Conclusions and Future Work\n\nWe have proposed GIANT, a practical Newton-type method, for empirical risk minimization in\ndistributed computing environments. In comparison to similar methods, GIANT has three desirable\nadvantages. First, GIANT is guaranteed to converge to high precision in a small number of iterations,\nprovided that the number of training samples, n, is suf\ufb01ciently large, relative to dm, where d is\nthe number of features and m is the number of partitions. Second, GIANT is very communication\nef\ufb01cient in that each iteration requires four or six rounds of communications, each with a complexity\nof merely \u02dcO(d). Third, in contrast to all other alternates, GIANT is easy to use, as it involves tuning\none parameter. Empirical studies also showed the superior performance of GIANT as compared\nseveral other methods.\nGIANT has been developed only for unconstrained problems with smooth and strongly convex\nobjective function. However, we believe that similar ideas can be naturally extended to projected\nNewton for constrained problems, proximal Newton for non-smooth regularization, and trust-region\nmethod for nonconvex problems. However, strong convergence bounds of the extensions appear\nnontrivial and will be left for future work.\n\nAcknowledgement\n\nWe thank Kimon Fountoulakis, Alex Gittens, Jey Kottalam, Zirui Liu, Hao Ren, Sathiya Selvaraj, Ze-\nbang Shen, and Haishan Ye for their helpful suggestions. The four authors would like to acknowledge\nARO, DARPA, Cray, and NSF for providing partial support of this work. Farbod Roosta-Khorasani\nwas partially supported by the Australian Research Council through a Discovery Early Career\nResearcher Award (DE180100923).\n\nReferences\n[1] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel,\nDaniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy preserving machine\nlearning. IACR Cryptology ePrint Archive, 2017:281, 2017.\n\n[2] Emmanuel J Candes and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics, 9(6):717, 2009.\n\n[3] Emmanuel J Candes and Terence Tao. Near-optimal signal recovery from random projections: Universal\n\nencoding strategies? IEEE Transactions on Information Theory, 52(12):5406\u20135425, 2006.\n\n[4] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simpli\ufb01ed data processing on large clusters. Communi-\n\ncations of the ACM, 51(1):107\u2013113, 2008.\n\n[5] Olivier Fercoq and Peter Richt\u00e1rik. Optimization in high dimensions via accelerated, parallel, and proximal\n\ncoordinate descent. SIAM Review, 58(4):739\u2013771, 2016.\n\n[6] Alex Gittens and Michael W Mahoney. Revisiting the Nystr\u00f6m method for improved large-scale machine\n\nlearning. Journal of Machine Learning Research, 17(1):3977\u20134041, 2016.\n\n[7] Martin Jaggi, Virginia Smith, Martin Tak\u00e1c, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and\nMichael I Jordan. Communication-ef\ufb01cient distributed dual coordinate ascent. In Advances in Neural\nInformation Processing Systems (NIPS), 2014.\n\n[8] Jakub Konecn`y, H Brendan McMahan, Daniel Ramage, and Peter Richt\u00e1rik. Federated optimization:\n\ndistributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.\n\n[9] Jakub Konecn`y, H Brendan McMahan, Felix X Yu, Peter Richt\u00e1rik, Ananda Theertha Suresh, and\nDave Bacon. Federated learning: strategies for improving communication ef\ufb01ciency. arXiv preprint\narXiv:1610.05492, 2016.\n\n[10] Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient\n\nmethods and a lower bound for communication complexity. arXiv preprint arXiv:1507.07595, 2015.\n\n9\n\n\f[11] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long,\nEugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In\nUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.\n\n[12] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.\n\nMathematical programming, 45(1-3):503\u2013528, 1989.\n\n[13] Ji Liu, Stephen J Wright, Christopher R\u00e9, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel\nstochastic coordinate descent algorithm. Journal of Machine Learning Research, 16(285-322):1\u20135, 2015.\n[14] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein.\nDistributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of\nthe VLDB Endowment, 2012.\n\n[15] Chenxin Ma, Virginia Smith, Martin Jaggi, Michael Jordan, Peter Richtarik, and Martin Takac. Adding\nvs. averaging in distributed primal-dual optimization. In International Conference on Machine Learning\n(ICML), 2015.\n\n[16] Dhruv Mahajan, Nikunj Agrawal, S Sathiya Keerthi, S Sundararajan, and L\u00e9on Bottou. An ef\ufb01-\ncient distributed learning algorithm based on effective local functional approximations. arXiv preprint\narXiv:1310.8418, 2013.\n\n[17] Dhruv Mahajan, S Sathiya Keerthi, S Sundararajan, and L\u00e9on Bottou. A parallel SGD method with strong\n\nconvergence. arXiv preprint arXiv:1311.0636, 2013.\n\n[18] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.\nCommunication-ef\ufb01cient learning of deep networks from decentralized data. In International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS), 2017.\n\n[19] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy\nFreeman, DB Tsai, Manish Amde, Sean Owen, et al. MLlib: machine learning in Apache Spark. Journal\nof Machine Learning Research, 17(34):1\u20137, 2016.\n\n[20] Ion Necoara and Dragos Clipici. Parallel random coordinate descent method for composite minimization:\n\nConvergence analysis and error bounds. SIAM Journal on Optimization, 26(1):197\u2013226, 2016.\n\n[21] A.S. Nemirovskii and D.B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. A\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). In\n\nWiley-Interscience publication. Wiley, 1983.\n\nSoviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[23] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science\n\n& Business Media, 2013.\n\n[24] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.\n[25] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with\n\nlinear-quadratic convergence. SIAM Journal on Optimization, 27(1):205\u2013245, 2017.\n\n[26] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2007.\n\n[27] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: a lock-free approach to\nparallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems (NIPS),\n2011.\n\n[28] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. On variance\nreduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information\nProcessing Systems (NIPS). 2015.\n\n[29] Sashank J Reddi, Jakub Konecn`y, Peter Richt\u00e1rik, Barnab\u00e1s P\u00f3cz\u00f3s, and Alex Smola. AIDE: fast and\n\ncommunication ef\ufb01cient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.\n\n[30] Peter Richt\u00e1rik and Martin Tak\u00e1c. Distributed coordinate descent method for learning with big data.\n\nJournal of Machine Learning Research, 17(1):2657\u20132681, 2016.\n\n[31] Peter Richt\u00e1rik and Martin Tak\u00e1vc. Parallel coordinate descent methods for big data optimization. Mathe-\n\nmatical Programming, 156(1-2):433\u2013484, 2016.\n\n[32] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods I: globally convergent\n\nalgorithms. arXiv preprint arXiv:1601.04737, 2016.\n\n[33] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods II: Local convergence\n\nrates. arXiv preprint arXiv:1601.04738, 2016.\n\n[34] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: from theory to algorithms.\n\nCambridge University Press, 2014.\n\n[35] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Annual Allerton\n\nConference on Communication, Control, and Computing, 2014.\n\n[36] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-ef\ufb01cient distributed optimization using an\n\napproximate Newton-type method. In International conference on machine learning (ICML), 2014.\n\n10\n\n\f[37] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task learning.\n\narXiv preprint arXiv:1705.10467, 2017.\n\n[38] Virginia Smith, Simone Forte, Chenxin Ma, Martin Takac, Michael I Jordan, and Martin Jaggi. CoCoA: A\ngeneral framework for communication-ef\ufb01cient distributed optimization. arXiv preprint arXiv:1611.02189,\n2016.\n\n[39] Shusen Wang, Alex Gittens, and Michael W. Mahoney. Sketched ridge regression: Optimization perspective,\nstatistical perspective, and model averaging. In International Conference on Machine Learning (ICML),\n2017.\n\n[40] Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis column selection: Theories,\n\nalgorithms, and extensions. Journal of Machine Learning Research, 17(49):1\u201349, 2016.\n\n[41] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney. GIANT: Globally improved\n\napproximate Newton method for distributed optimization. arXiv:1709.03528, 2018.\n\n[42] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher R\u00e9, and Michael W Mahoney. Sub-sampled\nNewton methods with non-uniform sampling. In Advances in Neural Information Processing Systems\n(NIPS), 2016.\n\n[43] Tianbao Yang. Trading computation for communication: distributed stochastic dual coordinate ascent. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2013.\n\n[44] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster\n\ncomputing with working sets. HotCloud, 10(10-10):95, 2010.\n\n[45] Yuchen Zhang and Xiao Lin. DiSCO: distributed optimization for self-concordant empirical loss. In\n\nInternational Conference on Machine Learning (ICML), 2015.\n\n[46] Shun Zheng, Fen Xia, Wei Xu, and Tong Zhang. A general distributed dual coordinate optimization\n\nframework for regularized loss minimization. arXiv preprint arXiv:1604.03763, 2016.\n\n[47] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2010.\n\n11\n\n\f", "award": [], "sourceid": 1179, "authors": [{"given_name": "Shusen", "family_name": "Wang", "institution": "UC Berkeley"}, {"given_name": "Fred", "family_name": "Roosta", "institution": "University of Queensland"}, {"given_name": "Peng", "family_name": "Xu", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}]}