{"title": "DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9498, "page_last": 9508, "abstract": "For optimization of a large sum of functions in a distributed computing environment, we present a novel communication efficient Newton-type algorithm that enjoys a variety of advantages over similar existing methods. Our algorithm, DINGO, is derived by optimization of the gradient's norm as a surrogate function. DINGO does not impose any specific form on the underlying functions and its application range extends far beyond convexity and smoothness. The underlying sub-problems of DINGO are simple linear least-squares, for which a plethora of efficient algorithms exist. DINGO involves a few hyper-parameters that are easy to tune and we theoretically show that a strict reduction in the surrogate objective is guaranteed, regardless of the selected hyper-parameters.", "full_text": "DINGO: Distributed Newton-Type Method for\n\nGradient-Norm Optimization\n\nRixon Crane\n\nUniversity of Queensland\n\nr.crane@uq.edu.au\n\nFred Roosta\n\nUniversity of Queensland\nfred.roosta@uq.edu.au\n\nAbstract\n\nFor optimization of a large sum of functions in a distributed computing environment,\nwe present a novel communication ef\ufb01cient Newton-type algorithm that enjoys a\nvariety of advantages over similar existing methods. Our algorithm, DINGO, is\nderived by optimization of the gradient\u2019s norm as a surrogate function. DINGO does\nnot impose any speci\ufb01c form on the underlying functions and its application range\nextends far beyond convexity and smoothness. The underlying sub-problems of\nDINGO are simple linear least-squares, for which a plethora of ef\ufb01cient algorithms\nexist. DINGO involves a few hyper-parameters that are easy to tune and we\ntheoretically show that a strict reduction in the surrogate objective is guaranteed,\nregardless of the selected hyper-parameters.\n\n1\n\nIntroduction\n\nConsider the optimization problem\n\nmin\nw\u2208Rd\n\n(cid:26)\n\nf (w) (cid:44) 1\nm\n\n(cid:27)\n\nfi(w)\n\n,\n\nm(cid:88)\n\ni=1\n\n(1)\n\nminw\u2208Rd (cid:80)n\n\nin a centralized distributed computing environment involving one driver machine and m worker\nmachines, in which the ith worker can only locally access the ith component function, fi. Such\ndistributed computing settings arise increasingly more frequently as a result of technological and\ncommunication advancements that have enabled the collection of and access to large scale datasets.\nAs a concrete example, take a data \ufb01tting application, in which given n data points, {xi}n\ni=1, and\ntheir corresponding loss, (cid:96)i(w; xi), parameterized by w, the goal is to minimize the overall loss as\ni=1 (cid:96)i(w; xi)/n. Such problems appear frequently in machine learning, e.g., [1, 2, 3]\nand scienti\ufb01c computing, e.g., [4, 5, 6]. However, in \u201cbig data\u201d regimes where n (cid:29) 1, lack of\nadequate computational resources, in particular storage, can severely limit, or even prevent, any\nattempts at solving such optimization problems in a traditional stand-alone way, e.g., using a single\nmachine. This can be remedied through distributed computing, in which resources across a network\nof stand-alone computational nodes are \u201cpooled\u201d together so as to scale to the problem at hand [7].\nIn such a setting, where n data points are distributed across m workers, one can instead consider (1)\nwith\n\n(cid:96)j(w; xj),\n\ni = 1, 2, . . . , m,\n\n(2)\n\n(cid:88)\n\nj\u2208Si\n\nfi(w) (cid:44) 1\n|Si|\n\nwhere Si \u2286 {1, 2, . . . , n}, with cardinality denoted by |Si|, correspond to the distribution of data\nacross the nodes, i.e., the ith node has access to a portion of the data indexed by the set Si.\nIn distributed settings, the amount of communications, i.e., messages exchanged across the network,\nare often considered a major bottleneck of computations (often more so than local computation\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftimes), as they can be expensive in terms of both physical resources and time through latency\n[8, 9]. First-order methods [10], e.g., stochastic gradient descent (SGD) [11], solely rely on gradient\ninformation and as a result are rather easy to implement in distributed settings. They often require the\nperformance of many computationally inexpensive iterations, which can be suitable for execution on\na single machine. However, as a direct consequence, they can incur excessive communication costs\nin distributed environments and, hence, they might not be able to take full advantage of the available\ndistributed computational resources.\nBy employing curvature information in the form of the Hessian matrix, second-order methods aim at\ntransforming the gradient such that it is a more suitable direction to follow. Compared with \ufb01rst-order\nalternatives, although second-order methods perform more computations per iteration, they often\nrequire far fewer iterations to achieve similar results. In distributed settings, this feature can directly\ntranslate to signi\ufb01cantly less communication costs. As a result, distributed second-order methods\nhave the potential to become the method of choice for distributed optimization tasks.\n\n(cid:107)A(cid:107), respectively. For x, z \u2208 Rd we let [x, z] (cid:44)(cid:8)x + \u03c4 (z \u2212 x) | 0 \u2264 \u03c4 \u2264 1(cid:9). The range and null\n\nNotation\nWe let (cid:104)\u00b7,\u00b7(cid:105) denote the common Euclidean inner product de\ufb01ned by (cid:104)x, y(cid:105) = xT y for x, y \u2208 Rd.\nGiven a vector v and matrix A, we denote their vector (cid:96)2 norm and matrix spectral norm as (cid:107)v(cid:107) and\nspace of a matrix A is denoted by R(A) and N (A), respectively. The Moore\u2013Penrose inverse [12]\nof A is denoted by A\u2020. We let wt \u2208 Rd denote the point at iteration t. For notational convenience,\nwe denote gt,i (cid:44) \u2207fi(wt), Ht,i (cid:44) \u22072fi(wt), gt (cid:44) \u2207f (wt) and Ht (cid:44) \u22072f (wt). We also let\n\n\u02dcHt,i (cid:44)\n\n\u2208 R2d\u00d7d\n\nand\n\n\u02dcgt (cid:44)\n\n\u2208 R2d,\n\n(3)\n\n(cid:20)Ht,i\n\n(cid:21)\n\n\u03c6I\n\n(cid:18)gt\n\n(cid:19)\n\n0\n\nwhere \u03c6 > 0, I is the identity matrix, and 0 is the zero vector.\n\nRelated Work and Contributions\n\nOwing to the above-mentioned potential, many distributed second-order optimization algorithms have\nrecently emerged to solve (1). Among them, most notably are GIANT [13], DiSCO [9], DANE [14],\nInexactDANE and AIDE [15]. While having many advantages, each of these methods respectively\ncome with several disadvantages that can limit their applicability in certain regimes. Namely,\nsome rely on, rather stringent, (strong) convexity assumptions, while for others the underlying sub-\nproblems involve non-linear optimization problems that are themselves non-trivial to solve. A subtle,\nyet potentially severe, draw-back for many of the above-mentioned methods is that their performance\ncan be sensitive to, and severely affected by, the choice of their corresponding hyper-parameters.\nHere, we present a novel communication ef\ufb01cient distributed second-order optimization method\nthat aims to alleviate many of the aforementioned disadvantages. Our approach is inspired by and\nfollows many ideas of recent results on Newton-MR [16], which extends the application range of the\nclassical Newton-CG beyond (strong) convexity and smoothness. More speci\ufb01cally, our algorithm,\nnamed DINGO for \u201cDIstributed Newton-type method for Gradient-norm Optimization\u201d, is derived\nby optimization of the gradient\u2019s norm as a surrogate function for (1), i.e.,\n\n(cid:40)\n\n1\n2\n\n(cid:13)(cid:13)\u2207f (w)(cid:13)(cid:13)2\n\nmin\nw\u2208Rd\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m(cid:88)\n\ni=1\n\n=\n\n1\n\n2m2\n\n\u2207fi(w)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2(cid:41)\n\n.\n\n(4)\n\nWhen f is invex, [17, 18], the problems (1) and (4) have the same solutions. Recall that invexity is the\ngeneralization of convexity, which extends the suf\ufb01ciency of the \ufb01rst order optimality condition, e.g.,\nKarush-Kuhn-Tucker conditions, to a broader class of problems than simple convex programming. In\nother words, invexity is a special case of non-convexity, which subsumes convexity as a sub-class. In\nthis light, unlike DiSCO and GIANT, by considering the surrogate function (4), DINGO\u2019s application\nrange and theoretical guarantees extend far beyond convex settings to invex problems. Naturally, by\nconsidering (4), DINGO may converge to a local maximum or saddle point in non-invex problems.\nSimilar to GIANT and DiSCO, and in contrast to DANE, InexactDANE and AIDE, our algorithm\ninvolves a few hyper-parameters that are easy to tune and the underlying sub-problems are simple\nlinear least-squares, for which a plethora of ef\ufb01cient algorithms exist. However, the theoretical\n\n2\n\n\fTable 1: Comparison of problem class, function form and data distribution. Note that DINGO doesn\u2019t\nassume invexity in analysis, rather it is suited to invex problems in practice.\n\nDINGO\nGIANT\nDiSCO\n\nInexactDANE\n\nAIDE\n\nProblem Class\n\nInvex\n\nStrongly Convex\nStrongly Convex\n\nNon-Convex\nNon-Convex\n\nFunction Form\n\nData Distribution\n\nAny\n\n(cid:96)j(w; xj) = \u03c8j((cid:104)w, xj(cid:105)) + \u03b3(cid:107)w(cid:107)2\n\nAny\nAny\nAny\n\nAny\n|Si| > d\nAny\nAny\nAny\n\nTable 2: Comparison of number of sub-problem hyper-parameters and communication rounds per\niteration. Under inexact update, the choice of sub-problem solver will determine additional hyper-\nparameters. Most communication rounds of DiSCO arise when iteratively solving its sub-problem.\nWe assume DINGO and GIANT use two communication rounds for line-search per iteration.\n\nNumber of Sub-Problem Hyper-\nParameters (Under Exact Update)\n\nCommunication Rounds Per\n\nIteration (Under Inexact Update)\n\nDINGO\nGIANT\nDiSCO\n\nInexactDANE\n\nAIDE\n\n2\n0\n0\n2\n3\n\n\u2264 8\n6\n\n2 + 2 \u00b7 (sub-problem iterations)\n4 \u00b7 (inner InexactDANE iterations)\n\n4\n\nanalysis of both GIANT and DiSCO is limited to the case where each fi is strongly convex, and for\nGIANT they are also of the special form where in (2) we have (cid:96)j(w; xj) = \u03c8j((cid:104)w, xj(cid:105)) + \u03b3(cid:107)w(cid:107)2,\n\u03b3 > 0 is a regularization parameter and \u03c8j is convex, e.g., linear predictor models. In contrast,\nDINGO does not impose any speci\ufb01c form on the underlying functions. Also, unlike GIANT, we\nallow for |Si| < d in (2). Moreover, we theoretically show that DINGO is not too sensitive to the\nchoice of its hyper-parameters in that a strict reduction in the gradient norm is guaranteed, regardless\nof the selected hyper-parameters. See Tables 1 and 2 for a summary of high-level algorithm properties.\nFinally, we note that, unlike GIANT, DiSCO, InexactDANE and AIDE, our theoretical analysis\nrequires exact solutions to the sub-problems. Despite the fact that the sub-problems of DINGO are\nsimple ordinary least-squares, and that DINGO performs well in practice with very crude solutions,\nthis is admittedly a theoretical restriction, which we aim to address in future.\nThe distributed computing environment that we consider is also assumed by GIANT, DiSCO, DANE,\nInexactDANE and AIDE. Moreover, as with these methods, we restrict communication to vectors of\nsize linear in d, i.e., O(d). A communication round is performed when the driver uses a broadcast\noperation to send information to one or more workers in parallel, or uses a reduce operation to receive\ninformation from one or more workers in parallel. For example, computing the gradient at iteration t,\ni=1 gt,i/m, requires two communication rounds, i.e., the driver broadcasts wt to all\nworkers and then, by a reduce operation, receives gt,i for all i. We further remind that the distributed\ncomputational model considered here is such that the main bottleneck involves the communications\nacross the network.\n\nnamely gt =(cid:80)m\n\n2 DINGO\n\nIn this section, we describe the derivation of DINGO, as depicted in Algorithm 1. Each iteration\nt involves the computation of two main ingredients: an update direction pt, and an appropriate\nstep-size \u03b1t. As usual, our next iterate is then set as wt+1 = wt + \u03b1tpt.\n\n3\n\n\fWe begin iteration t by distributively computing the gradient gt. Thereafter, we distributively compute\n\u2020\nt,igt/m and\ni=1 H\n\u2020\n\u02dcH\nt,i\u02dcgt/m. Computing the update direction pt involves three cases, all of which involve simple\n\ni=1 Ht,igt/m as well as the vectors(cid:80)m\n\ni=1\n\nUpdate Direction\n\nlinear least-squares sub-problems:\n\nthe Hessian-gradient product Htgt =(cid:80)m\n(cid:80)m\nCase 1 If (cid:104)(cid:80)m\npt = (cid:80)m\n\u201c\u2212(cid:80)m\ndirection yields suitable descent. Namely, if (cid:104)(cid:80)m\n(cid:80)m\n\u2020\ni=1 pt,i/m, with pt,i = \u2212 \u02dcH\nt,i\u02dcgt.\nIt (cid:44)(cid:8)i = 1, 2, . . . , m | (cid:104) \u02dcH\n\ni=1 H\n\n\u2020\nt,igt/m, Htgt(cid:105) \u2265 \u03b8(cid:107)gt(cid:107)2, where \u03b8 is as in Algorithm 1,\n\nthen we let\n\u2020\ni=1 pt,i/m, with pt,i = \u2212H\nt,igt. Here, we check that the potential update direction\n\u2020\nt,igt/m\u201d is a suitable descent direction for our surrogate objective (4). We do this since\ni=1 H\nwe have not imposed any restrictive assumptions on (1), e.g., strong convexity of each fi, that would\nautomatically guarantee descent; see Lemma 1 for an example of such restrictive assumptions.\nCase 2 If Case 1 fails, we include regularization and check again that the new potential update\n\u2020\nt,i\u02dcgt/m, Htgt(cid:105) \u2265 \u03b8(cid:107)gt(cid:107)2, then we let pt =\n\u02dcH\n\ni=1\n\nCase 3 If all else fails, we enforce descent in the norm of the gradient. More speci\ufb01cally, as Case 2\ndoes not hold, the set\n\nt,i\u02dcgt, Htgt(cid:105) < \u03b8(cid:107)gt(cid:107)2(cid:9),\n\n\u2020\n\n(5)\nis non-empty. In parallel, the driver broadcasts Htgt to each worker i \u2208 It and has it locally compute\nthe solution to\n\narg min\n\npt,i\n\n1\n2\n\n(cid:107)Ht,ipt,i + gt(cid:107)2 +\n\n(cid:107)pt,i(cid:107)2,\n\n\u03c62\n2\n\nsuch that\n\n(cid:104)pt,i, Htgt(cid:105) \u2264 \u2212\u03b8(cid:107)gt(cid:107)2,\n\n.\n\nt,i\n\nt,i\n\n\u03bbt,i =\n\npt,i = \u2212 \u02dcH\n\n\u02dcHt,i)\u22121Htgt,\n\n\u2020\nt,i\u02dcgt \u2212 \u03bbt,i( \u02dcHT\n\nwhere \u03c6 is as in (3). It is easy to show that the solution to this problem is\n\u2020\n\u2212gT\nt,i\u02dcgt + \u03b8(cid:107)gt(cid:107)2\nt Ht \u02dcH\n\u02dcHt,i)\u22121Htgt\nt Ht( \u02dcHT\ngT\n\n(6) and, using a reduce operation, the driver then computes the update direction pt =(cid:80)m\n\n(6)\nThe term \u03bbt,i in (6) is positive by the de\ufb01nition of It and well-de\ufb01ned by Assumption 5, which\nimplies that for gt (cid:54)= 0 we have Htgt (cid:54)= 0. In conclusion, for Case 3, each worker i \u2208 It computes\ni=1 pt,i/m,\n\u2020\nwhich by construction yields descent in the surrogate objective (4). Note that pt,i = \u2212 \u02dcH\nt,i\u02dcgt for all\ni /\u2208 It have already been obtained as part of Case 2.\nRemark 1. The three cases help avoid the need for any unnecessary assumptions on data distribution\nor the knowledge of any practically unknowable constants. In fact, given Lemma 1, which imposes a\ncertain assumption on the data distribution, we could have stated our algorithm in its simplest form,\ni.e., with only Case 1. This would be more in line with some prior works, e.g., GIANT, but it would\nhave naturally restricted the applicability of our method in terms of data distributions.\nRemark 2. In practice, like GIANT and DiSCO, our method DINGO never requires the computation\nor storage of an explicitly formed Hessian. Instead, it only requires Hessian-vector products, which\ncan be computed at a similar cost to computing the gradient itself. Computing matrix pseudo-inverse\n\u2020\nand vector products, e.g., H\nt,igt, constitute the sub-problems of our algorithm. This, in turn, is\ndone through solving least-squares problems using iterative methods that only require matrix-vector\nproducts (see Section 4 for some such methods). Thus DINGO is suitable for large dimension d in (1).\n\nLine Search\n\nAfter computing the update direction pt, DINGO computes the next iterate wt+1 by moving along pt\nby an appropriate step-size \u03b1t and forming wt+1 = wt + \u03b1tpt. We use an Armijo-type line search\nto choose this step-size. Speci\ufb01cally, as we are minimizing the norm of the gradient as a surrogate\nfunction, we choose the largest \u03b1t \u2208 (0, 1] such that\n\n(cid:107)gt+1(cid:107)2 \u2264 (cid:107)gt(cid:107)2 + 2\u03b1t\u03c1(cid:104)pt, Htgt(cid:105),\n\n(7)\nfor some constant \u03c1 \u2208 (0, 1). By construction of pt we always have (cid:104)pt, Htgt(cid:105) \u2264 \u2212\u03b8(cid:107)gt(cid:107)2.\nTherefore, after each iteration we are strictly decreasing the norm of the gradient, and line-search\nguarantees that this occurs irrespective of all hyper-parameters of DINGO, i.e., \u03b8, \u03c6 and \u03c1.\n\n4\n\n\fparameter \u03c1 \u2208 (0, 1), parameter \u03b8 > 0, and regularization parameter \u03c6 > 0 as in (3).\n\nAlgorithm 1 DINGO\n1: input initial point w0 \u2208 Rd, gradient tolerance \u03b4 \u2265 0, maximum iterations T , line search\n2: for t = 0, 1, 2, . . . , T \u2212 1 do\n3:\n4:\n5:\n6:\n7:\n8:\n\n\u2020\n\u2020\nt,igt and \u02dcH\nt,i\u02dcgt.\nThe driver broadcasts gt and, in parallel, each worker i computes Ht,igt, H\n\u2020\nBy a reduce operation, the driver computes Htgt = 1\nt,igt and\ni=1 H\nm\n1\nm\n\nDistributively compute the full gradient gt.\nif (cid:107)gt(cid:107) \u2264 \u03b4 then\nelse\n\ni=1 Ht,igt, 1\n\n(cid:80)m\n\n(cid:80)m\n\nreturn wt\n\nm\n\n(cid:80)m\nif(cid:10) 1\n(cid:80)m\nelse if(cid:10) 1\n\ni=1\n\nm\n\n\u2020\n\u02dcH\nt,i\u02dcgt.\n\u2020\ni=1 H\nt,igt, Htgt\nm\nLet pt = 1\nm\n\n(cid:11) \u2265 \u03b8(cid:107)gt(cid:107)2 then\n(cid:11) \u2265 \u03b8(cid:107)gt(cid:107)2 then\n\n(cid:80)m\ni=1 pt,i, with pt,i = \u2212H\n(cid:80)m\n\u2020\n\u02dcH\nt,i\u02dcgt, Htgt\ni=1 pt,i, with pt,i = \u2212 \u02dcH\n\n(cid:80)m\n\n\u2020\nt,igt.\n\n\u2020\nt,i\u02dcgt.\n\ni=1\n\nelse\n\nLet pt = 1\nm\n\u2020\n\u2020\nt,i\u02dcgt for all i such that (cid:104) \u02dcH\nt,i\u02dcgt, Htgt(cid:105) \u2265 \u03b8(cid:107)gt(cid:107)2.\nThe driver computes pt,i = \u2212 \u02dcH\n\u2020\nThe driver broadcasts Htgt to each worker i such that (cid:104) \u02dcH\nt,i\u02dcgt, Htgt(cid:105) < \u03b8(cid:107)gt(cid:107)2 and, in\nparallel, they compute\n\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n\npt,i = \u2212 \u02dcH\n\n\u2020\nt,i\u02dcgt \u2212 \u03bbt,i( \u02dcHT\n\nt,i\n\n\u02dcHt,i)\u22121Htgt,\n\n\u03bbt,i =\n\nUsing a reduce operation, the driver computes pt = 1\nm\n\nChoose the largest \u03b1t \u2208 (0, 1] such that(cid:13)(cid:13)\u2207f (wt + \u03b1tpt)(cid:13)(cid:13)2 \u2264 (cid:107)gt(cid:107)2 + 2\u03b1t\u03c1(cid:104)pt, Htgt(cid:105).\n\nThe driver computes wt+1 = wt + \u03b1tpt.\n\n\u2020\nt,i\u02dcgt + \u03b8(cid:107)gt(cid:107)2\n\u2212gT\nt Ht \u02dcH\n\u02dcHt,i)\u22121Htgt\nt Ht( \u02dcHT\ngT\ni=1 pt,i.\n\n(cid:80)m\n\nt,i\n\n.\n\nend if\n\n16:\n17:\n18:\n19:\nend if\n20:\n21: end for\n22: return wT .\n\n3 Theoretical Analysis\n\n\u2020\n\ni=1 H\n\n{t | (cid:104)(cid:80)m\n\nt,igt/m, Htgt(cid:105) \u2265 \u03b8(cid:107)gt(cid:107)2}, C2 (cid:44) {t | (cid:104)(cid:80)m\n\nIn this section, we present convergence results for DINGO. The reader can \ufb01nd proofs of lemmas and\ntheorems in the supplementary material. For notational convenience, in our analysis we have C1 (cid:44)\n\u2020\nt,i\u02dcgt/m, Htgt(cid:105) \u2265 \u03b8(cid:107)gt(cid:107)2, t /\u2208 C1},\n\u02dcH\nand C3 (cid:44) {t| t /\u2208 (C1 \u222aC2)}, which are sets indexing iterations t that are in Case 1, Case 2 and Case\n3, respectively. The convergence analysis under these cases are treated separately in Sections 3.2,\n3.3 and 3.4. The unifying result is then simply given in Corollary 1. We begin, in Section 3.1, by\nestablishing general underlying assumptions for our analysis. The analysis of Case 1 and Case 3\nrequire their own speci\ufb01c assumptions, which are discussed in Sections 3.2 and 3.4, respectively.\nRemark 3. As long as the presented assumptions are satis\ufb01ed, our algorithm converges for any choice\nof \u03b8 and \u03c6, i.e., these hyper-parameters do not require the knowledge of the, practically unknowable,\nparameters from these assumptions. However, in Lemma 3 we give qualitative guidelines for a better\nchoice of \u03b8 and \u03c6 to avoid Case 2 and Case 3, which are shown to be less desirable than Case 1.\n\ni=1\n\n3.1 General Assumptions\n\nAs DINGO makes use of Hessian-vector products, we make the following straightforward assumption.\nAssumption 1 (Twice Differentiability). The functions fi in (1) are twice differentiable.\n\nNotice that we do not require each fi to be twice continuously differentiable. In particular, our\nanalysis carries through even if the Hessian is discontinuous. This is in sharp contrast to popular\nbelief that the application of non-smooth Hessian can hurt more so than it helps, e.g., [19]. Note that\n\n5\n\n\feven if the Hessian is discontinuous, Assumption 1 is suf\ufb01cient in ensuring that Ht,i is symmetric,\nfor all t and i, [20]. Following [16], we also make the following general assumption on f.\nAssumption 2 (Moral-Smoothness [16]). For all iterations t, there exists a constant L \u2208 (0,\u221e)\n\nsuch that(cid:13)(cid:13)\u22072f (w)\u2207f (w) \u2212 \u22072f (wt)\u2207f (wt)(cid:13)(cid:13) \u2264 L(cid:107)w \u2212 wt(cid:107), for all w \u2208 [wt, wt + pt], where\n\npt is the update direction of DINGO at iteration t.\n\nAs discussed in [16] with explicit examples, Assumption 2 is strictly weaker than the common\nassumptions of the gradient and Hessian being both Lipschitz continuous. Using [16, Lemma 10], it\nfollows from Assumptions 1 and 2 that\n\n(cid:13)(cid:13)\u2207f (wt + \u03b1pt)(cid:13)(cid:13)2 \u2264(cid:13)(cid:13)gt\n\n(cid:13)(cid:13)2\n\n+ 2\u03b1(cid:10)pt, Htgt\n\n(cid:11) + \u03b12L(cid:107)pt(cid:107)2,\n\n(8)\n\nfor all \u03b1 \u2208 [0, 1] and all iterations t.\n\n3.2 Analysis of Case 1\n\n\u2020\nt UtUT\n\nt + U\u22a5\n\nt (U\u22a5\n\n\u2020\n\u2020\nt gt(cid:107) = (cid:107)H\nt\n\n(cid:0)UtUT\n\nt )T(cid:1)gt(cid:107) = (cid:107)H\n\nIn this section, we analyze the convergence of iterations of DINGO that fall under Case 1. For such\niterations, we make the following assumption about the action of the pseudo-inverse of Ht,i on gt.\nAssumption 3 (Pseudo-Inverse Regularity of Ht,i on gt). For all t \u2208 C1 and all i = 1, 2, . . . , m,\n\u2020\nt,igt(cid:107) \u2264 \u03b3i(cid:107)gt(cid:107).\nthere exists constants \u03b3i \u2208 (0,\u221e) such that (cid:107)H\nAssumption 3 may appear unconventional. However, it may be seen as more general than the\nfollowing assumption.\nAssumption 4 (Pseudo-Inverse Regularity of Ht on its Range Space [16]). There exists a constant\n\u03b3 \u2208 (0,\u221e) such that for all iterates wt we have (cid:107)Htp(cid:107) \u2265 \u03b3(cid:107)p(cid:107) for all p \u2208 R(Ht).\nAssumption 4 implies (cid:107)H\nt gt(cid:107) \u2264 \u03b3\u22121(cid:107)gt(cid:107),\nt denote arbitrary orthonormal bases for R(Ht) and R(Ht)\u22a5, respectively, and\nwhere Ut and U\u22a5\n\u2020\nR(Ht)\u22a5 = N (HT\nt ) = N (H\nt ). Recall that Assumption 4 is a signi\ufb01cant relaxation of strong\nconvexity. As an example, an under-determined least-squares problem f (w) = (cid:107)Aw \u2212 b(cid:107)2/2,\nwhich is clearly not strongly convex, satis\ufb01es Assumption 4 with \u03b3 = \u03c32\nmin(A), where \u03c3min(A) is the\nsmallest non-zero singular value of A.\nTheorem 1 (Convergence Under Case 1). Suppose we run DINGO. Then under Assumptions 1, 2\n\nand 3, for all t \u2208 C1 we have (cid:107)gt+1(cid:107)2 \u2264 (1 \u2212 2\u03c41\u03c1\u03b8)(cid:107)gt(cid:107)2, where \u03c41 = min(cid:8)1, 2(1 \u2212 \u03c1)\u03b8/(L\u03b32)(cid:9),\n\u03b3 =(cid:80)m\n\ni=1 \u03b3i/m, L is as in Assumption 2, \u03b3i are as in Assumption 3, \u03c1 and \u03b8 are as in Algorithm 1.\nFrom the proof of Theorem 1, it is easy to see that \u2200t \u2208 C1 we are guaranteed that 0 < 1\u2212 2\u03c41\u03c1\u03b8 < 1.\nIn Theorem 1, the term \u03b3 is the average of the \u03b3i\u2019s. This is bene\ufb01cial as it \u201csmooths out\u201d non-\nuniformity in \u03b3i\u2019s; for example, \u03b3 \u2265 mini \u03b3i. Under speci\ufb01c assumptions on (1), we can theoretically\nguarantee that t \u2208 C1 for all iterations t. The following lemma provides one such example.\nLemma 1. Suppose Assumption 1 holds and that we run DINGO. Furthermore, suppose that for\nall iterations t and all i = 1, 2, . . . , m, the Hessian matrix Ht,i is invertible and there exists\nconstants \u03b5i \u2208 [0,\u221e) and \u03bdi \u2208 (0,\u221e) such that (cid:107)Ht,i \u2212 Ht(cid:107) \u2264 \u03b5i and \u03bdi(cid:107)gt(cid:107) \u2264 (cid:107)Ht,igt(cid:107). If\n\n(cid:80)m\ni=1(1 \u2212 \u03b5i/\u03bdi)/m \u2265 \u03b8 then t \u2208 C1 for all t, where \u03b8 is as in Algorithm 1.\nsmall so that(cid:80)m\n\nAs an example, the Assumptions of Lemma 1 trivially hold if each fi is strongly convex and we\nassume certain data distribution. Under the assumptions of Lemma 1, if the Hessian matrix for each\nworker is on average a reasonable approximation to the full Hessian, i.e., \u03b5i is on average suf\ufb01ciently\ni=1 \u03b5i/\u03bdi < m, then we can choose \u03b8 small enough to ensure that t \u2208 C1 for all t. In\nother words, for the iterates to stay in C1, we do not require the Hessian matrix of each individual\nworker to be a high-quality approximation to full Hessian (which could indeed be hard to enforce in\nmany practical applications). As long as the data is distributed in such a way that Hessian matrices\nare on average reasonable approximations, we can guarantee to have t \u2208 C1 for all t.\n\n3.3 Analysis of Case 2\n\nWe now analyze the convergence of DINGO for iterations that fall under Case 2. For this case, we do\nnot require any additional assumptions to that of Assumptions 1 and 2. Instead, we use the upper\n\n6\n\n\f\u2020\nbound: (cid:107) \u02dcH\nt,i(cid:107) \u2264 1/\u03c6 for all iterations t and all i = 1, 2, . . . , m, where \u03c6 is as in Algorithm 1; see\nLemma 4 in the supplementary material for a proof of this upper bound.\nTheorem 2 (Convergence Under Case 2). Suppose we run DINGO. Then under Assumptions 1 and\n\n2, for all t \u2208 C2 we have (cid:107)gt+1(cid:107)2 \u2264 (1 \u2212 2\u03c42\u03c1\u03b8)(cid:107)gt(cid:107)2, where \u03c42 = min(cid:8)1, 2(1 \u2212 \u03c1)\u03c62\u03b8/L(cid:9), L is\n\nas in Assumption 2, and \u03c1, \u03b8 and \u03c6 are as in Algorithm 1.\n\nIn our experience, we have found that Case 2 does not occur frequently in practice. It serves more of\na theoretical purpose and is used to identify when Case 3 is required. Case 2 may be thought of as\na speci\ufb01c instance of Case 3, in which It is empty. However, it merits its own case, as in analysis\nit does not require additional assumptions to Assumptions 1 and 2, and in practice it may avoid an\nadditional two communication rounds. If we were to bypass Case 2 to Case 3 and allow It to be\nempty, then Theorem 3 of Section 3.4 with |It| = 0, which states the convergence for Case 3, indeed\ncoincides with Theorem 2.\n\n3.4 Analysis of Case 3\n\nt,i)\u2020Htgt\n\nNow we turn to the \ufb01nal case, and analyze the convergence of iterations of DINGO that fall under\nCase 3. For such iterations, we make the following assumption.\nAssumption 5. For all t \u2208 C3 and all i = 1, 2, . . . , m there exists constants \u03b4i \u2208 (0,\u221e) such that\n\n(cid:13)(cid:13)( \u02dcHT\nthat(cid:13)(cid:13)(U\u22a5\nany orthonormal bases for R(cid:0)\u22072f (w)(cid:1) and its orthogonal complement, respectively.\n\nAssumption 5, like Assumption 3, may appear unconventional. In Lemma 2 we show how Assump-\ntion 5 is implied by three other reasonable assumptions, one of which is as follows.\nAssumption 6 (Gradient-Hessian Null-Space Property [16]). There exists a constant \u03bd \u2208 (0, 1] such\nw denote\n\n(cid:13)(cid:13) \u2265 \u03b4i(cid:107)gt(cid:107).\nw)T\u2207f (w)(cid:13)(cid:13)2 \u2264 (1 \u2212 \u03bd)\u03bd\u22121(cid:13)(cid:13)UT\n\nw\u2207f (w)(cid:13)(cid:13)2, for all w \u2208 Rd, where Uw and U\u22a5\n\nAssumption 6 implies that, as the iterations progress, the gradient will not become arbitrarily\northogonal to the range space of the Hessian matrix. As an example, any least-squares problem\nf (w) = (cid:107)Aw \u2212 b(cid:107)2/2 satis\ufb01es Assumption 6 with \u03bd = 1 ; see [16] for detailed discussion and\nmany more examples of Assumption 6.\nLemma 2. Suppose Assumptions 4 and 6 hold and (cid:107)Ht,i(cid:107)2 \u2264 \u03c4i, \u2200t \u2208 C3, i = 1, 2, . . . , m, \u03c4i \u2208\n\n(0,\u221e), i.e., local Hessians are bounded. Then, Assumption 5 holds with \u03b4i = \u03b3(cid:112)\u03bd/(\u03c4i + \u03c62), where\n\n\u03c6 is as in Algorithm 1, and \u03b3 and \u03bd are as in Assumptions 4 and 6, respectively.\n\nThe following theorem provides convergence properties for iterations of DINGO that are in Case 3.\nTheorem 3 (Convergence Under Case 3). Suppose we run DINGO. Then under Assumptions 1,\n2 and 5, for all t \u2208 C3 we have (cid:107)gt+1(cid:107)2 \u2264 (1 \u2212 2\u03c9t\u03c1\u03b8)(cid:107)gt(cid:107)2 \u2264 (1 \u2212 2\u03c43\u03c1\u03b8)(cid:107)gt(cid:107)2, where\n\u03c9t = min{1, 2(1 \u2212 \u03c1)\u03b8/Lc2\n\n(cid:18)\nt}, \u03c43 = min{1, 2(1 \u2212 \u03c1)\u03b8/Lc2},\n1\nm\u03c6\n\nm + |It| + \u03b8\n\n(cid:88)\n\n(cid:19)\n\n,\n\nc =\n\n1\n\u03b4i\n\ni\u2208It\n\nct =\n\nm(cid:88)\n\ni=1\n\n1\n\u03b4i\n\n,\n\n2\n\u03c6\n\n+\n\n\u03b8\nm\u03c6\n\nL is as in Assumption 2, \u03b4i are as in Assumption 5, It is as in (5), and \u03c1, \u03b8 and \u03c6 are as in Algorithm 1.\nNote that the convergence in Theorem 3 is given in both iteration dependent and independent format,\nsince the former explicitly relates the convergence rate to the size of It, while the latter simply\nupper-bounds this, and hence is qualitatively less informative.\nComparing Theorems 2 and 3, iterations of DINGO should have slower convergence if they are in\nCase 3 rather than Case 2. By Theorem 3, if an iteration t resorts to Case 3 then we may have slower\nconvergence for larger |It|. Moreover, this iteration would require two more communication rounds\nthan if it were to stop in Case 1 or Case 2. Therefore, one may wish to choose \u03b8 and \u03c6 appropriately\nto reduce the chances that iteration t falls in Case 3 or that |It| is large. Under this consideration,\nLemma 3 presents a necessary condition on a relationship between \u03b8 and \u03c6.\nLemma 3. Suppose we run DINGO. Under Assumption 1, if |It| < m for some iteration t, then\n\u03b8\u03c6 \u2264 (cid:107)Htgt(cid:107)/(cid:107)gt(cid:107).\n\n7\n\n\f(a) 10 Workers\n\n(b) 100 Workers\n\n(c) 1000 Workers\n\n(d) 10000 Workers\n\nFigure 1: Softmax regression problem on the CIFAR10 dataset. All algorithms are initialized at\nw0 = 0. In all plots, Sync-SGD has a learning rate of 10\u22122. Async-SGD has a learning rate of: 10\u22123\nin 1(a), 10\u22124 in 1(b) and 1(c), and 10\u22125 in 1(d). SVRG has a learning rate of: 10\u22123 in 1(a) and 1(d),\nand 10\u22122 in 1(b) and 1(c). AIDE has \u03c4 = 100 in 1(a) and 1(d), \u03c4 = 1 in 1(b), and \u03c4 = 10 in 1(c).\nThe number of workers is the value of m in (1) and (2).\n\nLemma 3 suggests that we should pick \u03b8 and \u03c6 so that their product, \u03b8\u03c6, is small. Clearly, choosing\nsmaller \u03b8 will increase the chance of an iteration of DINGO being in Case 1 or Case 2. However,\nthis also gives a lower rate of convergence in Theorems 1 and 2. Choosing smaller \u03c6 will preserve\n\u2020\nmore curvature information of the Hessian Ht,i in \u02dcH\nt,i. However, \u03c6 should still be reasonably large,\nas making \u03c6 smaller also makes some of the sub-problems of DINGO more ill-conditioned. There is\na non-trivial trade-off between \u03c6 and \u03b8, and Lemma 3 gives an appropriate way to set them.\nWe can \ufb01nally present a unifying result on the overall worst-case linear convergence rate of DINGO.\nCorollary 1 (Overall Linear Convergence of DINGO). Suppose we run DINGO. Then under Assump-\ntions 1, 2, 3 and 5, for all iterations t we have (cid:107)gt+1(cid:107)2 \u2264 (1\u2212 2\u03c4 \u03c1\u03b8)(cid:107)gt(cid:107)2 with \u03c4 = min{\u03c41, \u03c42, \u03c43},\nwhere \u03c41, \u03c42 and \u03c43 are as in Theorems 1, 2, and 3, respectively, and \u03c1 and \u03b8 are as in Algorithm 1.\nFrom Corollary 1, DINGO can achieve (cid:107)gt(cid:107) \u2264 \u03b5 with O(log(\u03b5)/(\u03c4 \u03c1\u03b8)) communication rounds.\nMoreover, the term \u03c4 is a lower bound on the step-size under all cases, which can determine the\nmaximum communication cost needed during line-search. For example, knowing \u03c4 could determine\nthe number of step-sizes used in backtracking line-search for DINGO in Section 4.\n\n4 Experiments\n\nIn this section, we evaluate the empirical performance of DINGO, GIANT, DiSCO, InexactDANE,\nAIDE, Asynchronous SGD (Async-SGD) and Synchronous SGD (Sync-SGD) [11] on the strongly\nconvex problem of softmax cross-entropy minimization with regularization on the CIFAR10 dataset\n[21], see Figure 1. This dataset has 50000 training samples, 10000 test samples and each datapoint\nxi \u2208 R3072 has a label yi \u2208 {1, 2, . . . , 10}. This problem has dimension d = 27648. In the\nsupplementary material, the reader can \ufb01nd additional experiments on another softmax regression\n\n8\n\nAIDEAsynchronous SGDDINGODiSCOGIANTInexactDANESynchronous SGD0100200300400500Communication Rounds1.61.82.02.2Objective Function: f(w)0100200300400500Communication Rounds103102101100Gradient Norm: ||f(w)||020406080Iteration100Line Search: 0100200300400500Communication Rounds10152025303540Test Classification Accuracy (%)0100200300400500Communication Rounds1.61.82.02.2Objective Function: f(w)0100200300400500Communication Rounds103102101100Gradient Norm: ||f(w)||020406080Iteration101100Line Search: 0100200300400500Communication Rounds10152025303540Test Classification Accuracy (%)0100200300400500Communication Rounds1.61.71.81.92.02.12.22.3Objective Function: f(w)0100200300400500Communication Rounds102101100Gradient Norm: ||f(w)||020406080Iteration102101100Line Search: 0100200300400500Communication Rounds10152025303540Test Classification Accuracy (%)0100200300400500Communication Rounds1.61.71.81.92.02.12.22.3Objective Function: f(w)0100200300400500Communication Rounds101100Gradient Norm: ||f(w)||020406080Iteration105104103102101100Line Search: 0100200300400500Communication Rounds10152025303540Test Classification Accuracy (%)\fFigure 2: Softmax regression problem on the CIFAR10 dataset. We compare DINGO with \u03b8 =\n10\u22124, 10\u22121, 1, 10, 100. All iterations are in Case 1 with \u03b8 = 10\u22124, which implies the same plot\nwould occur for all \u03b8 \u2264 10\u22124. Case 1 and Case 3 iterations occur when \u03b8 = 10\u22121, 1. All iterations\nunder \u03b8 = 10, 100 are in Case 3.\n\nt,i\n\nas well as on a Gaussian mixture model and autoencoder problem. In all experiments we consider\n(1) with (2), where the sets S1, S2, . . . , Sm randomly partition the index set {1, 2, . . . , n}, with each\nhaving equal size s = n/m. Code is available at https://github.com/RixonC/DINGO.\nWe describe some implementation details. All sub-problem solvers are limited to 50 iterations and\ndo not employ preconditioning. For DINGO, we use the sub-problem solvers MINRES-QLP [22],\n\u2020\n\u2020\n\u02dcHt,i)\u22121(Htgt), respectively. We\nt,igt, \u02dcH\nt,i\u02dcgt and ( \u02dcHT\nLSMR [23] and CG [24] when computing H\n\u02dcHt,i)\u22121Htgt is guaranteed to satisfy\nchoose CG for the latter problem as the approximation x of ( \u02dcHT\nt,i\n(cid:104)Htgt, x(cid:105) > 0 regardless of the number of CG iterations performed. For DINGO, unless otherwise\nstated, we set \u03b8 = 10\u22124 and \u03c6 = 10\u22126. We use backtracking line search for DINGO and GIANT\nto select the largest step-size in {1, 2\u22121, 2\u22122, . . . , 2\u221250} which passes, with an Armijo line-search\nparameter of 10\u22124. For InexactDANE, we set \u03b7 = 1 and \u00b5 = 0, as in [15], and use SVRG [25] as a\nlocal solver with the best learning rate from {10\u22126, 10\u22125, . . . , 106}. We have each iteration of AIDE\ninvoke one iteration of InexactDANE, with the same parameters as in the stand-alone InexactDANE\nmethod, and use the best catalyst acceleration parameter \u03c4 \u2208 {10\u22126, 10\u22125, . . . , 106}, as in [15]. For\nAsync-SGD and Sync-SGD we report the best learning rate from {10\u22126, 10\u22125, . . . , 106} and each\nworker uses a mini-batch of size n/(5m).\nDiSCO has consistent performance, regardless of the number of workers, due to the distributed PCG\nalgorithm. This essentially allows DiSCO to perform Newton\u2019s method over the full dataset. This\nis unnecessarily costly, in terms of communication rounds, when s is reasonably large. Thus we\nsee it perform comparatively poorly in Plots 1(a), 1(b), and 1(c). DiSCO outperforms GIANT and\nDINGO in Plot 1(d). This is likely because the local directions (\u2212H\u22121\nt,i gt and pt,i for GIANT and\nDINGO, respectively) give poor updates as they are calculated using very small subsets of data, i.e.,\nin Plot 1(d) each worker has access to only 5 data points, while d = 27648.\nA signi\ufb01cant advantage of DINGO to InexactDANE, AIDE, Async-SGD and Sync-SGD is that it is\nrelatively easy to tune hyper-parameters. Namely, making bad choices for \u03c1, \u03b8 and \u03c6 in DINGO will\ngive sub-optimal performance; however, it is still theoretically guaranteed to strictly decrease the\nnorm of the gradient. In contrast, some choices of hyper-parameters in InexactDANE, AIDE, Async-\nSGD and Sync-SGD will cause divergence and these choices can be problem speci\ufb01c. Moreover,\nthese methods can be very sensitive to the chosen hyper-parameters with some being very dif\ufb01cult\nto select. For example, the acceleration parameter \u03c4 in AIDE was found to be dif\ufb01cult and time\nconsuming to tune and the performance of AIDE was sensitive to it; notice the variation in selected \u03c4\nin Figure 1. This dif\ufb01culty was also observed in [13, 15]. We found that simply choosing \u03c1, \u03b8 and \u03c6\nto be small, in DINGO, gave high performance. Figure 2 compares different values of \u03b8.\n5 Future Work\nThe following is left for future work. First, extending the analysis of DINGO to include convergence\nresults under inexact update. Second, \ufb01nding more ef\ufb01cient methods of line search, for practical\nimplementations of DINGO, than backtracking line search. Using backtracking line search for\nGIANT and DINGO requires the communication of some constant number of scalars and vectors,\nrespectively. Hence, for DINGO, it may transmit a large amount of data over the network, while still\nonly requiring two communication rounds per iteration of DINGO. Lastly, considering modi\ufb01cations\nto DINGO that prevent convergence to a local maximum/saddle point in non-invex problems.\n\n9\n\nDINGO, with =0.0001DINGO, with =0.1DINGO, with =1DINGO, with =10DINGO, with =1000100200300400500Communication Rounds1.61.82.02.2Objective Function: f(w)0100200300400500Communication Rounds103102101100Gradient Norm: ||f(w)||0100200300400500Communication Rounds10152025303540Test Classification Accuracy (%)020406080100120Iteration10131010107104101Line Search: \fAcknowledgments\n\nBoth authors gratefully acknowledge the generous support by the Australian Research Council\n(ARC) Centre of Excellence for Mathematical & Statistical Frontiers (ACEMS). Fred Roosta was\npartially supported by DARPA as well as ARC through a Discovery Early Career Researcher Award\n(DE180100923). Part of this work was done while Fred Roosta was visiting the Simons Institute for\nthe Theory of Computing.\n\nReferences\n[1] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, Cambridge, 2014.\n\n[2] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning,\n\nvolume 1. Springer Series in Statistics New York, NY, USA, 2001.\n\n[3] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar, and Francis Bach. Foundations of\n\nMachine Learning. MIT Press, Cambridge, 2012.\n\n[4] Nan Ye, Farbod Roosta-Khorasani, and Tiangang Cui. Optimization Methods for Inverse\n\nProblems, volume 2 of MATRIX Book Series. Springer, 2017. arXiv:1712.00154.\n\n[5] Farbod Roosta-Khorasani, Kees van den Doel, and Uri Ascher. Stochastic algorithms for\ninverse problems involving PDEs and many measurements. SIAM J. Scienti\ufb01c Computing,\n36(5):S3\u2013S22, 2014.\n\n[6] Farbod Roosta-Khorasani, Kees van den Doel, and Uri Ascher. Data completion and stochastic\nalgorithms for PDE inversion problems with many measurements. Electronic Transactions on\nNumerical Analysis, 42:177\u2013196, 2014.\n\n[7] Rasul Tutunov, Haitham Bou-Ammar, and Ali Jadbabaie. Distributed newton method for large-\nscale consensus optimization. IEEE Transactions on Automatic Control, 64(10):3983\u20133994,\n2019.\n\n[8] Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling up Machine Learning: Parallel\n\nand Distributed Approaches. Cambridge University Press, Cambridge, 2012.\n\n[9] Yuchen Zhang and Xiao Lin. DiSCO: distributed optimization for self-concordant empirical\n\nloss. In International Conference on Machine Learning, pages 362\u2013370, 2015.\n\n[10] Amir Beck. First-Order Methods in Optimization. SIAM, 2017.\n\n[11] Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed\nsynchronous SGD. In International Conference on Learning Representations Workshop Track,\n2016.\n\n[12] Roger Penrose. A generalized inverse for matrices. Mathematical Proceedings of the Cambridge\n\nPhilosophical Society, 51(3):406\u2013413, 1955.\n\n[13] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney. GIANT: globally\nimproved approximate Newton method for distributed optimization. In Advances in Neural\nInformation Processing Systems, pages 2338\u20132348, 2018.\n\n[14] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-ef\ufb01cient distributed optimization\nusing an approximate Newton-type method. In International Conference on Machine Learning,\npages 1000\u20131008, 2014.\n\n[15] Sashank J. Reddi, Jakub Kone\u02c7cn`y, Peter Richt\u00e1rik, Barnab\u00e1s P\u00f3cz\u00f3s, and Alex Smola. AIDE:\nfast and communication ef\ufb01cient distributed optimization. arXiv preprint arXiv:1608.06879,\n2016.\n\n[16] Fred Roosta, Yang Liu, Peng Xu, and Michael W. Mahoney. Newton-MR: Newton\u2019s method\n\nwithout smoothness or convexity. arXiv preprint arXiv:1810.00303, 2018.\n\n10\n\n\f[17] A. Ben-Israel and B. Mond. What is invexity? The ANZIAM Journal, 28(1):1\u20139, 1986.\n\n[18] Shashi K. Mishra and Giorgio Giorgi. Invexity and Optimization. Springer Berlin Heidelberg,\n\nBerlin, Heidelberg, 2008.\n\n[19] Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of second-order methods for\n\nsmooth convex optimization. Mathematical Programming, pages 1\u201334, 2017.\n\n[20] John H. Hubbard and Barbara Burke Hubbard. Vector Calculus, Linear Algebra, and Differential\n\nForms: a Uni\ufb01ed Approach. Matrix Editions, 5 edition, 2015.\n\n[21] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[22] Sou-Cheng T. Choi, Christopher C. Paige, and Michael A. Saunders. MINRES-QLP: a Krylov\nsubspace method for inde\ufb01nite or singular symmetric systems. SIAM Journal on Scienti\ufb01c\nComputing, 33(4):1810\u20131836, 2011.\n\n[23] David Chin-Lung Fong and Michael Saunders. LSMR: an iterative algorithm for sparse least-\n\nsquares problems. SIAM Journal on Scienti\ufb01c Computing, 33(5):2950\u20132971, 2011.\n\n[24] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer Science & Business\n\nMedia, 2006.\n\n[25] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n11\n\n\f", "award": [], "sourceid": 5049, "authors": [{"given_name": "Rixon", "family_name": "Crane", "institution": "The University of Queensland"}, {"given_name": "Fred", "family_name": "Roosta", "institution": "University of Queensland"}]}