{"title": "Distributed estimation of the inverse Hessian by determinantal averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 11405, "page_last": 11415, "abstract": "In distributed optimization and distributed numerical linear algebra,\nwe often encounter an inversion bias: if we want to compute a\nquantity that depends on the inverse of a sum of distributed matrices,\nthen the sum of the inverses does not equal the inverse of the sum. \nAn example of this occurs in distributed Newton's method, where we\nwish to compute (or implicitly work with) the inverse Hessian\nmultiplied by the gradient. \nIn this case, locally computed estimates are biased, and so taking a\nuniform average will not recover the correct solution. \nTo address this, we propose determinantal averaging, a new\napproach for correcting the inversion bias. \nThis approach involves reweighting the local estimates of the Newton's\nstep proportionally to the determinant of the local Hessian estimate,\nand then averaging them together to obtain an improved global\nestimate. This method provides the first known distributed Newton step that is\nasymptotically consistent, i.e., it recovers the exact step in\nthe limit as the number of distributed partitions grows to infinity. \nTo show this, we develop new expectation identities and moment bounds\nfor the determinant and adjugate of a random matrix. \nDeterminantal averaging can be applied not only to Newton's method,\nbut to computing any quantity that is a linear tranformation of a\nmatrix inverse, e.g., taking a trace of the inverse covariance matrix,\nwhich is used in data uncertainty quantification.", "full_text": "Distributed estimation of the inverse Hessian by\n\ndeterminantal averaging\n\nMicha\u0142 Derezi\u00b4nski\n\nDepartment of Statistics\n\nUniversity of California, Berkeley\n\nmderezin@berkeley.edu\n\nMichael W. Mahoney\n\nICSI and Department of Statistics\nUniversity of California, Berkeley\nmmahoney@stat.berkeley.edu\n\nAbstract\n\nIn distributed optimization and distributed numerical linear algebra, we often\nencounter an inversion bias: if we want to compute a quantity that depends on\nthe inverse of a sum of distributed matrices, then the sum of the inverses does not\nequal the inverse of the sum. An example of this occurs in distributed Newton\u2019s\nmethod, where we wish to compute (or implicitly work with) the inverse Hessian\nmultiplied by the gradient. In this case, locally computed estimates are biased, and\nso taking a uniform average will not recover the correct solution. To address this,\nwe propose determinantal averaging, a new approach for correcting the inversion\nbias. This approach involves reweighting the local estimates of the Newton\u2019s step\nproportionally to the determinant of the local Hessian estimate, and then averaging\nthem together to obtain an improved global estimate. This method provides the \ufb01rst\nknown distributed Newton step that is asymptotically consistent, i.e., it recovers\nthe exact step in the limit as the number of distributed partitions grows to in\ufb01nity.\nTo show this, we develop new expectation identities and moment bounds for the\ndeterminant and adjugate of a random matrix. Determinantal averaging can be\napplied not only to Newton\u2019s method, but to computing any quantity that is a linear\ntransformation of a matrix inverse, e.g., taking a trace of the inverse covariance\nmatrix, which is used in data uncertainty quanti\ufb01cation.\n\n1\n\nIntroduction\n\nMany problems in machine learning and optimization require that we produce an accurate estimate of a\nsquare matrix H (such as the Hessian of a loss function or a sample covariance), while having access to\n\napproach has certain fundamental limitations (described in more detail below). The basic reason for\n\ncases, taking a uniform average of those independent copies provides a natural strategy for boosting\nthe estimation accuracy, essentially by making use of the law of large numbers: 1\nFor many other problems, however, we are more interested in the inverse (Hessian/covariance) matrix\n\nmany copies of some unbiased estimator of H, i.e., a random matrix bH such that E[bH] = H. In these\nt=1 bHt ! H.\nH1, and it is necessary or desirable to work with bH1 as the estimator. Here, a na\u00efve averaging\nthis is that E[bH1] 6= H1, i.e., that there is what may be called an inversion bias.\n\nIn this paper, we propose a method to address this inversion bias challenge. The method uses a\nweighted average, where the weights are carefully chosen to compensate for and correct the bias.\nOur motivation comes from distributed Newton\u2019s method (explained shortly), where combining\nindependent estimates of the inverse Hessian is desired, but our method is more generally applicable,\nand so we \ufb01rst state our key ideas in a more general context.\n\nmPm\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTheorem 1 Let si be independent random variables and Zi be \ufb01xed square rank-1 matrices. If bH =\nPi siZi is invertible almost surely, then the inverse of the matrix H = E[bH] can be expressed as:\n\nH1 =\n\n.\n\nE\u21e5det(bH)bH1\u21e4\nE\u21e5det(bH)\u21e4\n\n,\n\nt )\n\nt=1 at\n\nTo demonstrate the implications of Theorem 1, suppose that our goal is to estimate F (H1) for\nsome linear function F . For example, in the case of Newton\u2019s method F (H1) = H1g, where\ng is the gradient and H is the Hessian. Another example would be F (H1) = tr(H1), where H\nis the covariance matrix of a dataset and tr(\u00b7) is the matrix trace, which is useful for uncertainty\nquanti\ufb01cation. For these and other cases, consider the following estimation of F (H1), which takes\nan average of the individual estimates F (bH1\n\nt ), each weighted by the determinant of bHt, i.e.,\n\u02c6Fm = Pm\n\nDeterminantal Averaging:\n\nBy applying the law of large numbers (separately to the numerator and the denominator), Theorem 1\n\nat = det(bHt).\n\nt=1 atF (bH1\nPm\n\neasily implies that if bH1, . . . ,bHm are i.i.d. copies of bH then this determinantal averaging estimator is\nasymptotically consistent, i.e., \u02c6Fm ! F (H1), almost surely. This determinantal averaging estimator\nis particularly useful when problem constraints do not allow us to compute F( 1\nmPt bHt)1, e.g.,\nwhen the matrices are distributed and not easily combined.\nTo establish \ufb01nite sample convergence guarantees for estimators obtained via determinantal averaging,\nwe establish the following matrix concentration result. We state it separately since it is technically\ninteresting and since its proof requires novel bounds for the higher moments of the determinant of\na random matrix, which is likely to be of independent interest. Below and throughout the paper, C\ndenotes an absolute constant and \u201c\u201d is the L\u00f6wner order on positive semi-de\ufb01nite (psd) matrices.\nkPn\nTheorem 2 Let bH = 1\ni=1 biZi + B and H = E[bH], where B is a positive de\ufb01nite d \u21e5 d\nmatrix and bi are i.i.d. Bernoulli( k\nn ). Moreover, assume that all Zi are psd, d \u21e5 d and rank-1. If\nk C \u00b5d2\npm\u2318 \u00b7 H1 Pm\n\u21e31 \nt=1 at \u21e31 +\nt=1 atbH1\nPm\ni.i.d.\u21e0 bH and at = det(bHt).\n\nwhere bH1, . . . ,bHm\n1.1 Distributed Newton\u2019s method\nTo illustrate how determinantal averaging can be useful in the context of distributed optimization,\nconsider the task of batch minimization of a convex loss over vectors w 2 Rd, de\ufb01ned as follows:\n(1)\n\npm\u2318 \u00b7 H1 with probability 1 ,\n\n for \u2318 2 (0, 1) and \u00b5 = maxi kZiH1k/d, then\n\n\u23182 log3 d\n\u2318\n\n`i(w>xi) +\n\ndef\n=\n\n\u2318\n\nt\n\n\n2kwk2,\n\nL(w)\n\n1\nn\n\nnXi=1\n\nwhere > 0, and `i are convex, twice differentiable and smooth. Given a vector w, Newton\u2019s method\ndictates that the correct way to move towards the optimum is to perform an update ew = w p, with\np = r2L(w)rL(w), where r2L(w) = (r2L(w))1 denotes the inverse Hessian of L at w.1\nHere, the Hessian and gradient are:\n\nr2L(w) =\n\n`00i (w>xi) xix>i + I,\n\nand rL(w) =\n\n`0i(w>xi) xi + w.\n\nFor our distributed Newton application, we study a distributed computational model, where a single\nmachine has access to a subsampled version of L with sample size parameter k \u2327 n:\nbi \u21e0 Bernoullik/n.\n\n\n2kwk2, where\n\n1Clearly, one would not actually compute the inverse of the Hessian explicitly [XRKM17, YXRKM18]. We\ndescribe it this way for simplicity. Our results hold whether or not the inverse operator is computed explicitly.\n\nbL(w)\n\nbi`i(w>xi) +\n\nnXi=1\n\n(2)\n\n1\nk\n\ndef\n=\n\n1\n\nnXi\n\n1\n\nnXi\n\n2\n\n\fNote that bL accesses on average k loss components `i (k is the expected local sample size), and\nmoreover, E\u21e5bL(w)\u21e4 = L(w) for any w. The goal is to compute local estimates of the Newton\u2019s\n\nstep p in a communication-ef\ufb01cient manner (i.e., by only sending O(d) parameters from/to a single\nmachine), then combine them into a better global estimate. The gradient has size O(d) so it can be\ncomputed exactly within this communication budget (e.g., via map-reduce), however the Hessian has\nto be approximated locally by each machine. Note that other computational models can be considered,\nsuch as those where the global gradient is not computed (and local gradients are used instead).\n\nUnder the constraints described above, the most natu-\nral strategy is to use directly the Hessian of the locally\n\nsubsampled loss bL (see, e.g., GIANT [WRKXM18]),\nresulting in the approximate Newton step bp =\nr2bL(w)rL(w). Suppose that we independently\nconstruct m i.i.d. copies of this estimate:bp1, . . . ,bpm\nt=1bpt ! E\u21e5bp\u21e4 6= p. Figure 1 shows\nmPm\n\n(here, m is the number of machines). Then, for suf-\n\ufb01ciently large m, taking a simple average of the esti-\nmates will stop converging to p because of the inver-\nsion bias: 1\nthis by plotting the estimation error (in Euclidean dis-\ntance) of the averaged Newton step estimators, when\nthe weights are uniform and determinantal (for more\ndetails and plots, see Appendix C).\nThe only way to reduce the estimation error beyond a certain point is to increase the local sample size k\n(thereby reducing the inversion bias), which raises the computational cost per machine. Determinantal\naveraging corrects the inversion bias so that estimation error can always be decreased by adding more\nmachines without increasing the local sample size. From the preceding discussion we can easily\nshow that determinantal averaging leads to an asymptotically consistent estimator. This is a corollary\nof Theorem 1, as proven in Section 2.\n\nFigure 1: Newton step estimation error versus\nnumber of machines, averaged over 100 runs\n(shading is standard error) for a libsvm dataset\n[CL11]. More plots in Appendix C.\n\nt=1 at\n\na.s.!m!1\n\nPm\nt=1 atbpt\nPm\n\nCorollary 3 Let {bLt}1t=1 be i.i.d. samples of (2) and de\ufb01ne at = detr2bLt(w). Then:\np, where bpt = r2bLt(w)rL(w) and p = r2L(w)rL(w).\n\nThe (unnormalized) determinantal weights can be computed locally in the same time as it takes to\ncompute the Newton estimates so they do not add to the overall cost of the procedure. While this\nresult is only an asymptotic statement, it holds with virtually no assumptions on the loss function\n(other than twice-differentiability) or the expected local sample size k. However, with some additional\nassumptions we will now establish a convergence guarantee with a \ufb01nite number of machines m by\nbounding the estimation error for the determinantal averaging estimator of the Newton step.\n\nIn the next result, we use Mahalanobis distance, denoted kvkM = pv>Mv, to measure the error of\nthe Newton step estimate (i.e., the deviation from optimum p), with M chosen as the Hessian of L.\nThis choice is motivated by standard convergence analysis of Newton\u2019s method, discussed next. This\nis a corollary of Theorem 2, as explained in Section 3.\nCorollary 4 For any , \u23182 (0, 1) if expected local sample size satis\ufb01es k C\u23182\u00b5d2 log3 d\n\n then\n\nwhere \u00b5 = 1\n\nPm\nt=1 at pr2L(w)\uf8ff\nt=1 atbpt\nPm\nd maxi `00i (w>xi)kxik2\n\n\u2318\n\npm \u00b7 pr2L(w) with probability 1 ,\nr2L(w), and at,bpt and p are de\ufb01ned as in Corollary 3.\n\nWe next establish how this error bound impacts the convergence guarantees offered by Newton\u2019s\nmethod. Note that under our assumptions L is strongly convex so there is a unique minimizer\nw\u21e4 = argminw L(w). We ask how the distance from optimum, kw w\u21e4k, changes after we make\nfunction of w. After this standard assumption, a classical analysis of the Newton\u2019s method reveals\nthat Corollary 4 implies the following Corollary 6 (proof in Appendix B).\n\nan update ew = w bp. For this, we have to assume that the Hessian matrix is L-Lipschitz as a\n\n3\n\n\f2L\n\npm\n\n then\n\np\uf8ffw w\u21e4,\n\nAssumption 5 The Hessian is L-Lipschitz: kr2L(w)r2L(ew)k \uf8ff Lkwewk for any w,ew 2 Rd.\nCorollary 6 For any , \u23182 (0, 1) if expected local sample size satis\ufb01es k C\u23182\u00b5d2 log3 d\nunder Assumption 5 it holds with probability at least 1 that\nminw w\u21e42o for ew = w Pm\new w\u21e4 \uf8ff maxn \u2318\nt=1 atbpt\nPm\nwhere C, \u00b5, at andbpt are de\ufb01ned as in Corollaries 3 and 4, while \uf8ff and min are the condition\n\nnumber and smallest eigenvalue of r2L(w), respectively.\nThe bound is a maximum of a linear and a quadratic convergence term. As m goes to in\ufb01nity and/or\n\u2318 goes to 0 the approximation coef\ufb01cient \u21b5 = \u2318pm in the linear term disappears and we obtain exact\nNewton\u2019s method, which exhibits quadratic convergence (at least locally around w\u21e4). However,\ndecreasing \u2318 means increasing k and with it the average computational cost per machine. Thus, to\npreserve the quadratic convergence while maintaining a computational budget per machine, as the\noptimization progresses we have to increase the number of machines m while keeping k \ufb01xed. This\nis only possible when we correct for the inversion bias, which is done by determinantal averaging.\n\nt=1 at\n\n,\n\n1.2 Distributed data uncertainty quanti\ufb01cation\nHere, we consider another example of when computing a compressed linear representation of the\ninverse matrix is important. Let X be an n \u21e5 d matrix where the rows x>i represent samples drawn\nfrom a population for statistical analysis. The sample covariance matrix \u2303 = 1\nn X>X holds the\ninformation about the relations between the features. Assuming that \u2303 is invertible, the matrix \u23031,\nalso known as the precision matrix, is often used to establish a degree of con\ufb01dence we have in the\ndata collection [KBCG13]. The diagonal elements of \u23031 are particularly useful since they hold\nthe variance information of each individual feature. Thus, ef\ufb01ciently estimating either the entire\ndiagonal, its trace, or some subset of its entries, is of practical interest [Ste97, WLK+16, BCF09]. We\nconsider the distributed setting where data is separately stored in batches and each local covariance is\nmodeled as:\n\n1\nk\n\nnXi=1\n\nb\u2303 =\n\nbixix>i , where\n\nbi \u21e0 Bernoulli(k/n).\n\nFor each of the local covariances b\u23031, . . . ,b\u2303m, we compute its compressed uncertainty information:\nF(b\u2303t + \u2318pm I)1, where we added a small amount of ridge to ensure invertibility2. Here, F (\u00b7)\n\nmay for example denote the trace or the vector of diagonal entries. We arrive at the following\nasymptotically consistent estimator for F (\u23031):\n\n\u02c6Fm = Pm\n\nt=1 at,mF(b\u2303t + \u2318pm I)1\n\nt=1 at,m\n\nPm\n\n, where at,m = detb\u2303t + \u2318pm I.\n\nd maxi kxik2\n\nNote that the ridge term \u2318pm I decays to zero as m goes to in\ufb01nity, which is why \u02c6Fm ! F (\u23031).\nEven though this limit holds for any local sample size k, in practice we should choose k suf\ufb01ciently\nlarge so that b\u2303 is well-conditioned. In particular, Theorem 2 implies that if k 2C\u23182\u00b5d2 log3 d\n ,\n\u23031, then for F (\u00b7) = tr(\u00b7) we have | \u02c6Fm tr(\u23031)|\uf8ff \u2318pm \u00b7 tr(\u23031)\nwhere \u00b5 = 1\nw.p. 1 .\n1.3 Related work\nMany works have considered averaging strategies for combining distributed estimates, particularly\nin the context of statistical learning and optimization. This research is particularly important in\nfederated learning [KBRR16, KBY+16], where data are spread out across a large network of devices\nwith small local storage and severely constrained communication bandwidth. Using averaging to\ncombine local estimates has been studied in a number of learning settings [MMS+09, MHM10] as\n\n2Since the ridge term vanishes as m goes to in\ufb01nity, we are still estimating the ridge-free quantity F (\u23031).\n\n4\n\n\fwell as for \ufb01rst-order stochastic optimization [ZWLS10, AD11]. For example, [ZDW13] examine\nthe effectiveness of simple uniform averaging of empirical risk minimizers and also propose a\nbootstrapping technique to reduce the bias.\nMore recently, distributed averaging methods gained considerable attention in the context of second-\norder optimization, where the Hessian inversion bias is of direct concern.\n[SSZ14] propose a\ndistributed approximate Newton-type method (DANE) which under certain assumptions exhibits\nlow bias. This was later extended and improved upon by [ZL15, RKR+16]. The GIANT method\nof [WRKXM18] most closely follows our setup from Section 1.1, providing non-trivial guarantees\n\nfor uniform averaging of the Newton step estimatesbpt (except they use with-replacement uniform\n\nsampling, whereas we use without-replacement, but that is typically a negligible difference). A\nrelated analysis of this approach is provided in the context of ridge regression by [WGM17]. Finally,\n[ABH17, MLR17, BJKJ17] propose different estimates of the Newton step which exhibit low bias\nunder certain additional assumptions.\nOur approach is related to recent developments in determinantal subsampling techniques (e.g.,\nvolume sampling), which have been shown to correct the inversion bias in the context of least squares\nregression [DW17, DWH19]. However, despite recent progress [DW18, DWH18], volume sampling\nis still far too computationally expensive to be feasible for distributed optimization. Indeed, often\nuniform sampling is the only practical choice in this context.\nWith the exception of the expensive volume sampling-based methods, all of the approaches discussed\nabove, even under favorable conditions, use biased estimates of the desired solution (e.g., the exact\nNewton step). Thus, when the number of machines grows suf\ufb01ciently large, with \ufb01xed local sample\nsize, the averaging no longer provides any improvement. This is in contrast to our determinantal\naveraging, which converges exactly to the desired solution and requires no expensive subsampling.\nTherefore, it can scale with an arbitrarily large number of machines.\n\n2 Expectation identities for determinants and adjugates\n\nIn this section, we prove Theorem 1 and Corollary 3, establishing that determinantal averaging is\nasymptotically consistent. To achieve this, we establish a lemma involving two expectation identities.\nFor a square n \u21e5 n matrix A, we use adj(A) to denote its adjugate, de\ufb01ned as an n \u21e5 n matrix\nwhose (i, j)th entry is (1)i+j det(Aj,i), where Aj,i denotes A without jth row and ith\ncolumn. The adjugate matrix provides a key connection between the inverse and the determinant\nbecause for any invertible matrix A, we have adj(A) = det(A)A1. In the following lemma,\nwe will also use a formula called Sylvester\u2019s theorem, relating the adjugate and the determinant:\ndet(A + uv>) = det(A) + v>adj(A)u.\n\n(a) E\u21e5 det(A)\u21e4 = detE[A]\n\nLemma 7 For A =Pi siZi, where si are independently random and Zi are square and rank-1,\nProof We use induction over the number of components in the sum. If there is only one component,\ni.e., A = sZ, then det(A) = 0 a.s. unless Z is 1\u21e51, in which case (a) is trivial, and (b) follows\nsimilarly. Now, suppose we showed the hypothesis when the number of components is n and let\nA =Pn+1\n\n(b) E\u21e5 adj(A)\u21e4 = adjE[A].\n\nand\n\ni=1 siZi. Setting Zn+1 = uv>, we have:\n\nE\u21e5 det(A)\u21e4 = E\uf8ff det\u21e3 nXi=1\n(Sylvester\u2019s Theorem) = E\uf8ff det\u21e3 nXi=1\n(inductive hypothesis) = det\u2713Eh nXi=1\n(Sylvester\u2019s Theorem) = det\u2713Eh nXi=1\n\nsiZi + sn+1uv>\u2318\nsiZi\u2318u\nsiZi\u2318 + sn+1v>adj\u21e3 nXi=1\nsiZii\u25c6 + E[sn+1] v>adj\u2713Eh nXi=1\nsiZii\u25c6u\nsiZii + E[sn+1] uv>\u25c6 = detE[A],\n\nshowing (a). Finally, (b) follows by applying (a) to each entry adj(A)ij = (1)i+j det(Aj,i).\n\n5\n\n\fSimilar expectation identities for the determinant have been given before [vdV65, DWH19, Der19].\nNone of them, however, apply to the random matrix A as de\ufb01ned in Lemma 7, or even to the special\ncase we use for analyzing distributed Newton\u2019s method. Also, our proof method is quite different, and\nsomewhat simpler, than those used in prior work. To our knowledge, the extension of determinantal\nexpectation to the adjugate matrix has not previously been pointed out.\nWe next prove Theorem 1 and Corollary 3 as consequences of Lemma 7.\n\nProof of Theorem 1 When A is invertible, its adjugate is given by adj(A) = det(A)A1, so the\nlemma implies that\n\nE\u21e5det(A)\u21e4E[A]1 = detE[A]E[A]1 = adj(E[A]) = E\u21e5adj(A)\u21e4 = E\u21e5det(A)A1\u21e4,\n\nfrom which Theorem 1 follows immediately.\n\nProof of Corollary 3 The subsampled Hessian matrix used in Corollary 3 can be written as:\n\n1\n\nbi`00i (w>xi) xix>i + \n\ndXi=1\nso, letting bHt = r2bLt(w), Corollary 3 follows from Theorem 1 and the law of large numbers:\nE\u21e5det(bH)bH1\u21e4\nPm\nt=1 atbpt\nE\u21e5det(bH)\u21e4 rL(w) = r2L(w)rL(w),\nPm\n\nkXi\nr2bL(w) =\nt=1 detbHtbH1\nmPm\nt=1 detbHt\nmPm\n\nwhich concludes the proof.\n\n= bH,\n\nt rL(w)\n\n!m!1\n\nt=1 at\n\neie>i\n\n=\n\ndef\n\n1\n\n1\n\n3 Finite-sample convergence analysis\n\nIn this section, we prove Theorem 2 and Corollary 4, establishing that determinantal averaging\nexhibits a 1/pm convergence rate, where m is the number of sampled matrices (or the number of\nmachines in distributed Newton\u2019s method). For this, we need a tool from random matrix theory.\nLemma 8 (Matrix Bernstein [Tro12]) Consider a \ufb01nite sequence {Xi} of independent, random,\nself-adjoint matrices with dimension d such that E[Xi] = 0 and max(Xi) \uf8ff R almost surely. If the\n\nsequence satis\ufb01esPi E[X2\n\ni ] \uf8ff 2, then the following inequality holds for all x 0:\n\nPr\u2713max\u21e3Xi\n\nXi\u2318 x\u25c6 \uf8ff(d e x2\n\n42\nd e x\n\n4R\n\nfor x \uf8ff 2\nR ;\nfor x 2\nR .\n\nThe key component of our analysis is bounding the moments of the determinant and adjugate of\na certain class of random matrices. This has to be done carefully, because higher moments of the\ndeterminant grow more rapidly than, e.g., for a sub-gaussian random variable. For this result, we\ndo not require that the individual components Zi of matrix A be rank-1, but we impose several\nadditional boundedness assumptions. In the proof below we apply the concentration inequality of\nLemma 8 twice: \ufb01rst to the random matrix A itself, and then also to its trace, which allows \ufb01ner\ncontrol over the determinant.\nPi biZi + B, where bi \u21e0 Bernoulli() are independent, whereas Zi and\nLemma 9 Let A = 1\nB are d \u21e5 d psd matrices such that kZik \uf8ff \u270f for all i and E[A] = I. If 8\u270fd\u23182(p + ln d) for\n0 <\u2318 \uf8ff 0.25 and p 2, then\n\nProof We start by proving (a). Let X = det(A) 1 and denote 1[a,b] as the indicator variable of\nthe event that X 2 [a, b]. Since det(A) 0, we have:\n\n(b) Eh adj(A) Ipi 1\n\np \uf8ff 9\u2318.\n\nand\n\np \uf8ff 5\u2318\n\n(a) Eh det(A) 1pi 1\nE\u21e5|X|p\u21e4 = E\u21e5(X)p \u00b7 1[1,0]\u21e4 + E\u21e5X p \u00b7 1[0,1]\u21e4\n\n\uf8ff \u2318p + Z 1\n\n\u2318\n\npxp1 Pr(X x)dx + Z 1\n\n0\n\npxp1 Pr(X x)dx.\n\n(3)\n\n6\n\n\fE[X2\n\n , so:\n\n )Zi. We\n\nXi\n\n 1 \uf8ff 1\n1\n\nThus it suf\ufb01ces to bound the two integrals. We will start with the \ufb01rst one. Let Xi = (1 bi\nuse the matrix Bernstein inequality to control the extreme eigenvalues of the matrix I A =Pi Xi\n(note that matrix B cancels out because I = E[A] = Pi Zi + B). To do this, observe that\n )2\u21e4 = 1\nkXik \uf8ff \u270f/ and, moreover, E\u21e5(1 bi\ni \uf8ff\n \u00b7Xi\ni \uf8ff\ni ] =Xi\n )2\u21e4Z2\nE\u21e5(1 bi\n, 1\u21e4:\nThus, applying Lemma 8 we conclude that for any z 2\u21e5 \u2318p2d\nPr\u21e3kI Ak z\u2318 \uf8ff 2d e z2\n\u23182 (p+ln d) \uf8ff 2ez2 2dp\n4\u270f \uf8ff 2eln(d)z2 2d\n(4)\nConditioning on the high-probability event given by (4) leads to the lower bound det(A) (1 z)d\nwhich is very loose. To improve on it, we use the following inequality, where 1, . . . , d denote the\neigenvalues of I A:\n\n \u00b7Xi\n\nZi \uf8ff\n\n\u23182 .\n\nZ2\n\n\u270f\n\n\n\u270f\n\n.\n\ndet(A)etr(IA) =Yi\n\n(1 i)ei Yi\n\n(1 i)(1 + i) =Yi\n\n(1 2\ni ).\n\n\u270f\n\n\u23182\n\n\u23182\n\n(5)\n\ney 2dp\n\n \uf8ff \u23182\n\nfor y \uf8ff d;\nfor y d.\n\nThus we obtain a tighter bound when det(A) is multiplied by etr(IA), and now it suf\ufb01ces to\nupper bound the latter. This is a simple application of the scalar Bernstein\u2019s inequality (Lemma\n8 with d = 1) for the random variables Xi = tr(Xi) \uf8ff \u270f/ \uf8ff \u23182\ni ] \uf8ff\n trPi Zi \uf8ff \u270fd\n\n8dp, which satisfyPi E[X 2\n\n8p . Thus the scalar Bernstein\u2019s inequality states that\n\nmaxn Prtr(A I) y, Prtr(A I) \uf8ff yo \uf8ff (ey2 2p\n2 and z =p x\n\n2d and taking a union bound over the appropriate high-probability events\n\nSetting y = x\ngiven by (4) and (5), we conclude that for any x 2 [\u2318, 1]:\ndet(A) (1 z2)d exptr(A I) 1 x\n2e x\nThus, for X = det(A)1 and x 2 [\u2318, 1] we obtain that Pr(X x) \uf8ff 3ex2 p\np \u00b7Z 1\nZ 1\npxp1 Pr X xdx \uf8ff 3pZ 1\n2\u23182 dx \uf8ff 3pq\u21e1 2\u23182\n2 = 3p2\u21e1p \u00b7 \u2318p.\np p p1\n\n2 1 x, with prob. 1 3ex2 p\n1|x|p1 ex2 p\np2\u21e1\u23182/p\nWe now move on to bounding the remaining integral from (3). Since determinant is the product of\neigenvalues, we have det(A) = det(I + A I) \uf8ff etr(AI), so we can use the Bernstein bound of\n(5) w.r.t. A I. It follows that:\npxp1 Pr(X x)dx \uf8ffZ 1\nZ 1\n\uf8ffZ ed1\n\uf8ffZ e1\n\npxp1 Pretr(AI) 1 + xdx\n\u23182 dx + Z 1\n\u23182 dx + Z 1\n\n\uf8ff 3p2\u21e1\u23182p \u00b7 \u23182\n\npxp1e ln2(1+x) 2p\n\npxp1e ln2(1+x) 2p\n\npxp1e ln(1+x) 2dp\n\npxp1e ln(1+x) 2p\n\nbecause ln2(1 + x) ln(1 + x) for x e 1. Note that ln2(1 + x) x2/4 for x 2 [0, e 1], so\n\nxp1 ex2 p\n\n2\u23182 , and consequently,\n\n\u23182 dx,\n\n\u23182 dx\n\ned1\n\ne1\n\n2\u23182 .\n\ndx\n\n2\u23182\n\n\u2318\n\n\u2318\n\n0\n\n0\n\n0\n\n0\n\n\u23182 dx \uf8ffZ e1\n\n0\n\nZ e1\n\n0\n\npxp1e ln2(1+x) 2p\nIn the interval x 2 [e 1,1], we have:\n\u23182 dx = pZ 1\nZ 1\n\uf8ff pZ 1\n1 1\n1+x p\n\npxp1e ln(1+x) 2p\n\ne1\n\ne1\n\ne(p1) ln(x)ln(1+x) 2p\n\npxp1ex2 p\n\n2\u23182 dx \uf8ffp2\u21e1p \u00b7 \u2318p.\n\ne ln(1+x) p\n\n\u23182 dx \uf8ff pZ 1\n\u23182 1 \uf8ff p \u00b7\u21e3 1\n\u23182\u2318p\n2 1\n2 p\n\u23182 1 1\n\ne1\n\np\n\np\n\n\u23182 dx\n\n\uf8ff p \u00b7\u23182p,\n\n\u23182 dx =\n\n7\n\n\f1\n\n1\n\n\u23182 \uf8ff \u23182. Noting that (1 + 4p2\u21e1p + p)\np \uf8ff 5 for any\nwhere the last inequality follows because ( 1\n2 )\np 2 concludes the proof of (a). The proof of (b), given in Appendix A, follows similarly as above\nbecause for any positive de\ufb01nite A we have det(A)\nmax(A) \u00b7 I adj(A) det(A)\nHaving obtained bounds on the higher moments, we can now convert them to convergence with high\nprobability for the average of determinants and the adjugates. Since determinant is a scalar variable,\nthis follows by using standard arguments. On the other hand, for the adjugate matrix we require a\nsomewhat less standard matrix extension of the Khintchine/Rosenthal inequalities (see Appendix A).\nCorollary 10 There is C > 0 s.t. for A as in Lemma 9 with all Zi rank-1 and C\u270fd\u2318 2 log3 d\n ,\npm\u25c6 \uf8ff ,\n\npm\u25c6 \uf8ff \n\n\u00b7 I.\n\n1\nm\n\n1\nm\n\nand\n\nmin\n\n\u2318\n\n\u2318\n\n(a) Pr\u2713\n\nmXt=1\n\ndet(At) 1 \n\nwhere A1, . . . , Am are independent copies of A.\n\n(b) Pr\u2713\n\nmXt=1\n\nadj(At) I \n\nWe are ready to show the convergence rate of determinantal averaging, which follows essentially by\nupper/lower bounding the enumerator and denominator separately, using Corollary 10.\n\n1\n\n1\n\nZ\n\n=\n\ndef\n=\n\ndet(H)\n\n2 . Note that\n\n2bHtH 1\n\nn H 1\nn d\u23182 log3 d\n\nProof of Theorem 2 We will apply Corollary 10 to the matrices At = H 1\nkPi bieZi + H1, where eacheZi = 1\n2 satis\ufb01es keZik \uf8ff \u00b5 \u00b7 d/n. Therefore,\n2 ZiH 1\nAt = n\nn C \u00b5d\n , with probability 1 the following average\nCorollary 10 guarantees that for k\nof determinants is concentrated around 1:\nmXt\nmXt\ndet(bHt)\ndetH 1\n2 2 [1 \u21b5, 1 + \u21b5]\n2bHtH 1\nPm\nt=1 det(At) I \uf8ff\nadjAt Z I /Z\nmXt\nPm\nadjAt I +\n1 \u21b5\nmXt\n(Corollary 10a) \uf8ff\n(Corollary 10b) \uf8ff\n\nalong with a corresponding bound for the adjugate matrices. We obtain that with probability 1 2,\n\n\u21b5\n1 \u21b5\nIt remains to multiply the above expressions by H 1\n\n2 from both sides to recover the desired estimator:\n\n\u21b5\n1 \u21b5\n\n\u21b5\n1 \u21b5\n\nt=1 adj(At)\n\n\u2318\npm\n\nfor \u21b5 =\n\n+\n\n1\n\n1\n\n1\n\n.\n\n,\n\nt\n\nPm\nt=1 det(bHt)bH1\nPm\nt=1 det(bHt)\n\n2 Pm\nPm\n\n= H 1\n\nt=1 adj(At)\nt=1 det(At)\n\nH 1\n\n2 H 1\n\n21 + 2\u21b5\n\n1\u21b5 I H 1\n\n2 =1 + 2\u21b5\n\n1\u21b5H1,\n\nand the lower bound follows identically. Appropriately adjusting the constants concludes the proof.\nAs an application of the above result, we show how this allows us to bound the estimation error in\ndistributed Newton\u2019s method, when using determinantal averaging.\n\nProof of Corollary 4 Follows from Theorem 2 by setting Zi = `00i (w>xi)xix>i and B = I. Note\nthat the assumptions imply that kZik \uf8ff \u00b5, so invoking the theorem and denoting g as rL(w), with\nprobability 1 we have\nPm\nt=1 at pH\n=H\nt=1 atbpt\nPm\n\uf8ffH\n(Theorem 2) \uf8ffH\n\n2 g\n2\u2713Pm\nt=1 det(bHt) H1\u25c6H\nt=1 det(bHt)bH1\nPm\n2 \u00b7H 1\n2\u2713Pm\nt=1 det(bHt) H1\u25c6H\nt=1 det(bHt)bH1\n2 g\nPm\n2 \u00b7k pkH = \u2318pm \u00b7k pkH,\n\nwhich completes the proof of the corollary.\n\n2 \u2318pm H1H\n\n2 H 1\n\nt\n\nt\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n8\n\n\f4 Conclusions and future directions\n\nWe proposed a novel method for correcting the inversion bias in distributed Newton\u2019s method.\nOur approach, called determinantal averaging, can also be applied more broadly to distributed\nestimation of other linear functions of the inverse Hessian or an inverse covariance matrix. We\nshow that estimators produced by determinantal averaging are asymptotically consistent, and we\nprovide bounds on the estimation error by developing new moment bounds on the determinant of a\nrandom matrix.\nFurther empirical evaluation of determinantal averaging, both in the context of distributed optimization\nand other tasks involving inverse estimation, is an important direction for future work. Our preliminary\nexperiments suggest that the bias-correction of determinantal averaging comes at a price of additional\nvariance in the estimators. This leads to a natural open problem: \ufb01nd the optimal balance between bias\nand variance in weighted averaging for distributed inverse estimation. Finally, note that we construct\nour Newton estimates using local Hessian and global gradient. In some settings it is more practical\nto use local approximations for both the Hessian and the gradient. Whether or not determinantal\naveraging corrects the bias in this case remains open.\n\nAcknowledgements\nMWM would like to acknowledge ARO, DARPA, NSF and ONR for providing partial support of this\nwork. Also, MWM and MD thank the NSF for funding via the NSF TRIPODS program. Part of this\nwork was done while MD and MWM were visiting the Simons Institute for the Theory of Computing.\n\nReferences\n\n[ABH17] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization\nfor machine learning in linear time. Journal of Machine Learning Research, 18(116):1\u2013\n40, 2017.\n\n[AD11] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In\nJ. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 873\u2013881. Curran\nAssociates, Inc., 2011.\n\n[BCF09] C. Bekas, A. Curioni, and I. Fedulova. Low cost high performance uncertainty quan-\nti\ufb01cation. In Proceedings of the 2Nd Workshop on High Performance Computational\nFinance, WHPCF \u201909, pages 8:1\u20138:8, New York, NY, USA, 2009. ACM.\n\n[BJKJ17] D. Bajovi\u00b4c, D. Jakoveti\u00b4c, N. Kreji\u00b4c, and N.K. Jerinki\u00b4c. Newton-like method with\ndiagonal correction for distributed optimization. SIAM Journal on Optimization,\n27(2):1171\u20131203, 2017.\n\n[CL11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector ma-\nchines. ACM Transactions on Intelligent Systems and Technology, 2:27:1\u201327:27,\n2011.\n\n[Der19] Micha\u0142 Derezi\u00b4nski. Fast determinantal point processes via distortion-free intermediate\n\nsampling. In Proceedings of the 32nd Conference on Learning Theory, 2019.\n\n[DW17] Micha\u0142 Derezi\u00b4nski and Manfred K. Warmuth. Unbiased estimates for linear regression\nvia volume sampling. In Advances in Neural Information Processing Systems 30,\npages 3087\u20133096, Long Beach, CA, USA, December 2017.\n\n[DW18] Micha\u0142 Derezi\u00b4nski and Manfred K. Warmuth. Reverse iterative volume sampling for\n\nlinear regression. Journal of Machine Learning Research, 19(23):1\u201339, 2018.\n\n[DWH18] Micha\u0142 Derezi\u00b4nski, Manfred K. Warmuth, and Daniel Hsu. Leveraged volume sam-\npling for linear regression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 31, pages 2510\u20132519. Curran Associates, Inc., 2018.\n\n9\n\n\f[DWH19] Micha\u0142 Derezi\u00b4nski, Manfred K. Warmuth, and Daniel Hsu. Correcting the bias in\nleast squares regression with volume-rescaled sampling. In Proceedings of the 22nd\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2019.\n\n[GCT12] Alex Gittens, Richard Y. Chen, and Joel A. Tropp. The masked sample covariance\nInformation and\n\nestimator: an analysis using matrix concentration inequalities.\nInference: A Journal of the IMA, 1(1):2\u201320, 05 2012.\n\n[KBCG13] V. Kalantzis, C. Bekas, A. Curioni, and E. Gallopoulos. Accelerating data uncertainty\nquanti\ufb01cation by solving linear systems with multiple right-hand sides. Numer.\nAlgorithms, 62(4):637\u2013653, April 2013.\n\n[KBRR16] Jakub Konecn\u00fd, H. Brendan McMahan, Daniel Ramage, and Peter Richt\u00e1rik. Feder-\nated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv\ne-prints, page arXiv:1610.02527, Oct 2016.\n\n[KBY+16] Jakub Konecn\u00fd, H. Brendan McMahan, Felix X. Yu, Peter Richt\u00e1rik, Ananda Theertha\nSuresh, and Dave Bacon. Federated Learning: Strategies for Improving Communica-\ntion Ef\ufb01ciency. arXiv e-prints, page arXiv:1610.05492, Oct 2016.\n\n[MHM10] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for\nthe structured perceptron. In Human Language Technologies: The 2010 Annual\nConference of the North American Chapter of the Association for Computational\nLinguistics, HLT \u201910, pages 456\u2013464, Stroudsburg, PA, USA, 2010. Association for\nComputational Linguistics.\n\n[MLR17] Aryan Mokhtari, Qing Ling, and Alejandro Ribeiro. Network newton distributed\n\noptimization methods. Trans. Sig. Proc., 65(1):146\u2013161, January 2017.\n\n[MMS+09] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S.\nMann. Ef\ufb01cient large-scale distributed training of conditional maximum entropy\nmodels. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta,\neditors, Advances in Neural Information Processing Systems 22, pages 1231\u20131239.\nCurran Associates, Inc., 2009.\n\n[NW06] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York,\n\nNY, USA, second edition, 2006.\n\n[RKR+16] Sashank J. Reddi, Jakub Konecn\u00fd, Peter Richt\u00e1rik, Barnab\u00e1s P\u00f3cz\u00f3s, and Alex Smola.\nAIDE: Fast and Communication Ef\ufb01cient Distributed Optimization. arXiv e-prints,\npage arXiv:1608.06879, Aug 2016.\n\n[SSZ14] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-ef\ufb01cient distributed\noptimization using an approximate Newton-type method. In Eric P. Xing and Tony Je-\nbara, editors, Proceedings of the 31st International Conference on Machine Learning,\nvolume 32 of Proceedings of Machine Learning Research, pages 1000\u20131008, Bejing,\nChina, 22\u201324 Jun 2014. PMLR.\n\n[Ste97] Guy V. G. Stevens. On the inverse of the covariance matrix in portfolio analysis. The\n\nJournal of Finance, 53:1821\u20131827, 1997.\n\n[Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations\n\nof Computational Mathematics, 12(4):389\u2013434, August 2012.\n\n[vdV65] H. Robert van der Vaart. A note on Wilks\u2019 internal scatter. Ann. Math. Statist.,\n\n36(4):1308\u20131312, 08 1965.\n\n[WGM17] Shusen Wang, Alex Gittens, and Michael W. Mahoney. Sketched ridge regression:\nOptimization perspective, statistical perspective, and model averaging. In Doina\nPrecup and Yee Whye Teh, editors, Proceedings of the 34th International Conference\non Machine Learning, volume 70 of Proceedings of Machine Learning Research,\npages 3608\u20133616, International Convention Centre, Sydney, Australia, 06\u201311 Aug\n2017. PMLR.\n\n10\n\n\f[WLK+16] Lingfei Wu, Jesse Laeuchli, Vassilis Kalantzis, Andreas Stathopoulos, and Efstratios\nGallopoulos. Estimating the trace of the matrix inverse by interpolating from the\ndiagonal of an approximate inverse. Journal of Computational Physics, 326:828 \u2013\n844, 2016.\n\n[WRKXM18] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W Mahoney. GIANT:\nGlobally improved approximate newton method for distributed optimization.\nIn\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 31, pages 2332\u20132342.\nCurran Associates, Inc., 2018.\n\n[XRKM17] Peng Xu, Farbod Roosta-Khorasani, and Michael Mahoney. Newton-type meth-\nods for non-convex optimization under inexact hessian information. Mathematical\nProgramming, August 2017.\n\n[YXRKM18] Z. Yao, P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Inexact non-convex Newton-\n\ntype methods. Technical report, 2018. Preprint: arXiv:1802.06925.\n\n[ZDW13] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-ef\ufb01cient al-\ngorithms for statistical optimization. J. Mach. Learn. Res., 14(1):3321\u20133363, January\n2013.\n\n[ZL15] Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant\nempirical loss. In Francis Bach and David Blei, editors, Proceedings of the 32nd\nInternational Conference on Machine Learning, volume 37 of Proceedings of Machine\nLearning Research, pages 362\u2013370, Lille, France, 07\u201309 Jul 2015. PMLR.\n\n[ZWLS10] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. Parallelized\nstochastic gradient descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S.\nZemel, and A. Culotta, editors, Advances in Neural Information Processing Systems\n23, pages 2595\u20132603. Curran Associates, Inc., 2010.\n\n11\n\n\f", "award": [], "sourceid": 6079, "authors": [{"given_name": "Michal", "family_name": "Derezinski", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}]}