{"title": "Solving Non-smooth Constrained Programs with Lower Complexity than $\\mathcal{O}(1/\\varepsilon)$: A Primal-Dual Homotopy Smoothing Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 3995, "page_last": 4005, "abstract": "We propose a new primal-dual homotopy smoothing algorithm for a linearly constrained convex program, where neither the primal nor the dual function has to be smooth or strongly convex. The best known iteration complexity solving such a non-smooth problem is $\\mathcal{O}(\\varepsilon^{-1})$. In this paper, \nwe show that by leveraging a local error bound condition on the dual function, the proposed algorithm can achieve a better primal convergence time of $\\mathcal{O}\\l(\\varepsilon^{-2/(2+\\beta)}\\log_2(\\varepsilon^{-1})\\r)$, where $\\beta\\in(0,1]$ is a local error bound parameter. \nAs an example application, we show that the distributed geometric median problem, which can be formulated as a constrained convex program, has its dual function non-smooth but satisfying the aforementioned local error bound condition with $\\beta=1/2$, therefore enjoying a convergence time of $\\mathcal{O}\\l(\\varepsilon^{-4/5}\\log_2(\\varepsilon^{-1})\\r)$. This result improves upon the $\\mathcal{O}(\\varepsilon^{-1})$ convergence time bound achieved by existing distributed optimization algorithms. Simulation experiments also demonstrate the performance of our proposed algorithm.", "full_text": "Solving Non-smooth Constrained Programs with\nLower Complexity than O(1/\"): A Primal-Dual\n\nHomotopy Smoothing Approach\n\nXiaohan Wei\n\nDepartment of Electrical Engineering\n\nUniversity of Southern California\nLos Angeles, CA, USA, 90089\n\nxiaohanw@usc.edu\n\nHao Yu\n\nAlibaba Group (U.S.) Inc.\nBellevue, WA, USA, 98004\nhao.yu@alibaba-inc.com\n\nQing Ling\n\nSchool of Data and Computer Science\n\nSun Yat-Sen University\n\nGuangzhou, China, 510006\n\nlingqing556@mail.sysu.edu.cn\n\nMichael J. Neely\n\nDepartment of Electrical Engineering\n\nUniversity of Southern California\nLos Angeles, CA, USA, 90089\n\nmikejneely@gmail.com\n\nAbstract\n\nWe propose a new primal-dual homotopy smoothing algorithm for a linearly\nconstrained convex program, where neither the primal nor the dual function has\nto be smooth or strongly convex. The best known iteration complexity solving\nsuch a non-smooth problem is O(\"1). In this paper, we show that by leveraging\na local error bound condition on the dual function, the proposed algorithm can\nachieve a better primal convergence time of O\"2/(2+) log2(\"1), where 2\n\n(0, 1] is a local error bound parameter. As an example application of the general\nalgorithm, we show that the distributed geometric median problem, which can be\nformulated as a constrained convex program, has its dual function non-smooth but\nsatisfying the aforementioned local error bound condition with = 1/2, therefore\n\nenjoying a convergence time of O\"4/5 log2(\"1). This result improves upon\nthe O(\"1) convergence time bound achieved by existing distributed optimization\nalgorithms. Simulation experiments also demonstrate the performance of our\nproposed algorithm.\n\n1\n\nIntroduction\n\nWe consider the following linearly constrained convex optimization problem:\n\nmin f (x)\ns.t. Ax b = 0, x 2X ,\n\n(1)\n(2)\n\nwhere X\u2713 Rd is a compact convex set, f : Rd ! R is a convex function, A 2 RN\u21e5d, b 2 RN.\nSuch an optimization problem has been studied in numerous works under various application scenarios\nsuch as machine learning (Yurtsever et al. (2015)), signal processing (Ling and Tian (2010)) and\ncommunication networks (Yu and Neely (2017a)). The goal of this work is to design new algorithms\nfor (1-2) achieving an \" approximation with better convergence time than O(1/\").\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1.1 Optimization algorithms related to constrained convex program\nSince enforcing the constraint Ax b = 0 generally requires a signi\ufb01cant amount of computation in\nlarge scale systems, the majority of the scalable algorithms solving problem (1-2) are of primal-dual\ntype. Generally, the ef\ufb01ciency of these algorithms depends on two key properties of the dual function\nof (1-2), namely, the Lipschitz gradient and strong convexity. When the dual function of (1-2) is\nsmooth, primal-dual type algorithms with Nesterov\u2019s acceleration on the dual of (1)-(2) can achieve a\nconvergence time of O(1/p\") (e.g. Yurtsever et al. (2015); Tran-Dinh et al. (2018))1. When the dual\nfunction has both the Lipschitz continuous gradient and the strongly convex property, algorithms\nsuch as dual subgradient and ADMM enjoy a linear convergence O(log(1/\")) (e.g. Yu and Neely\n(2018); Deng and Yin (2016)). However, when neither of the properties is assumed, the basic dual-\nsubgradient type algorithm gives a relatively worse O(1/\"2) convergence time (e.g. Wei et al. (2015);\nWei and Neely (2018)), while its improved variants yield a convergence time of O(1/\") (e.g. Lan\nand Monteiro (2013); Deng et al. (2017); Yu and Neely (2017b); Yurtsever et al. (2018); Gidel et al.\n(2018)).\nMore recently, several works seek to achieve a better convergence time than O(1/\") under weaker\nassumptions than Lipschitz gradient and strong convexity of the dual function. Speci\ufb01cally, building\nupon the recent progress on the gradient type methods for optimization with H\u00a8older continuous\ngradient (e.g. Nesterov (2015a,b)), the work Yurtsever et al. (2015) develops a primal-dual gradient\nmethod solving (1-2), which achieves a convergence time of O(1/\"\n1+3\u232b ), where \u232b is the modulus of\nH\u00a8older continuity on the gradient of the dual function of the formulation (1-2).2 On the other hand,\nthe work Yu and Neely (2018) shows that when the dual function has Lipschitz continuous gradient\nand satis\ufb01es a locally quadratic property (i.e. a local error bound with = 1/2, see De\ufb01nition 2.1 for\ndetails), which is weaker than strong convexity, one can still obtain a linear convergence with a dual\nsubgradient algorithm. A similar result has also been proved for ADMM in Han et al. (2015).\nIn the current work, we aim to address the following question: Can one design a scalable algorithm\nwith lower complexity than O(1/\") solving (1-2), when both the primal and the dual functions are\npossibly non-smooth? More speci\ufb01cally, we look at a class of problems with dual functions satisfying\nonly a local error bound, and show that indeed one is able to obtain a faster primal convergence via a\nprimal-dual homotopy smoothing method under a local error bound condition on the dual function.\nHomotopy methods were \ufb01rst developed in the statistics literature in relation to the model selection\nproblem for LASSO, where, instead of computing a single solution for LASSO, one computes a\ncomplete solution path by varying the regularization parameter from large to small (e.g. Osborne\net al. (2000); Xiao and Zhang (2013)).3 On the other hand, the smoothing technique for minimizing a\nnon-smooth convex function of the following form was \ufb01rst considered in Nesterov (2005):\n\n1+\u232b\n\n(3)\nwhere \u23261 \u2713 Rd is a closed convex set, h(x) is a convex smooth function, and g(x) can be explicitly\nwritten as\n(4)\n\n (x) = g(x) + h(x), x 2 \u23261\n\ng(x) = max\n\nu2\u23262 hAx, ui (u),\n\nwhere for any two vectors a, b 2 Rd, ha, bi = aT b, \u23261 \u2713 Rd is a closed convex set, and (u) is a\nconvex function. By adding a strongly concave proximal function of u with a smoothing parameter\n\u00b5 > 0 into the de\ufb01nition of g(x), one can obtain a smoothed approximation of (x) with smooth\nmodulus \u00b5. Then, Nesterov (2005) employs the accelerated gradient method on the smoothed\napproximation (which delivers a O(1/p\") convergence time for the approximation), and sets the\nparameter to be \u00b5 = O(\"), which gives an overall convergence time of O(1/\"). An important\nfollow-up question is that whether or not such a smoothing technique can also be applied to solve\n\n1Our convergence time to achieve within \" of optimality is in terms of number of (unconstrained) maximiza-\n2kx \u02dcxk2] where constants , A, \u02dcx, \u00b5 are known. This is a\ntion steps arg maxx2X [T (Ax b) f (x) \u00b5\nstandard measure of convergence time for Lagrangian-type algorithms that turn a constrained problem into a\nsequence of unconstrained problems.\n2The gradient of function g(\u00b7) is H\u00a8older continuous with modulus \u232b 2 (0, 1] on a set X if krg(x) \nrg(y)k \uf8ff L\u232bkx yk\u232b, 8x, y 2X , where k\u00b7k is the vector 2-norm and L\u232b is a constant depending on \u232b.\n3 The word \u201chomotopy\u201d, which was adopted in Osborne et al. (2000), refers to the fact that the mapping\nfrom regularization parameters to the set of solutions of the LASSO problem is a continuous piece-wise linear\nfunction.\n\n2\n\n\f(1-2) with the same primal convergence time. This question is answered in subsequent works Necoara\nand Suykens (2008); Li et al. (2016); Tran-Dinh et al. (2018), where they show that indeed one can\nalso obtain an O(1/\") primal convergence time for the problem (1-2) via smoothing.\nCombining the homotopy method with a smoothing technique to solve problems of the form (3) has\nbeen considered by a series of works including Yang and Lin (2015), Xu et al. (2016) and Xu et al.\n(2017). Speci\ufb01cally, the works Yang and Lin (2015) and Xu et al. (2016) consider a multi-stage\nalgorithm which starts from a large smoothing parameter \u00b5 and then decreases this parameter over\ntime. They show that when the function (x) satis\ufb01es a local error bound with parameter 2 (0, 1],\nsuch a combination gives an improved convergence time of O(log(1/\")/\"1) minimizing the\nunconstrained problem (3). The work Xu et al. (2017) shows that the homotopy method can also be\ncombined with ADMM to achieve a faster convergence solving problems of the form\n\nmin\nx2\u23261\n\nf (x) + (Ax b),\n\nwhere \u23261 is a closed convex set, f, are both convex functions with f (x) + (Ax b) satisfying\nthe local error bound, and the proximal operator of (\u00b7) can be easily computed. However, due to\nthe restrictions on the function in the paper, it cannot be extended to handle problems of the form\n(1-2).4\nContributions: In the current work, we show a multi-stage homotopy smoothing method enjoys a\n\nprimal convergence time O\"2/(2+) log2(\"1) solving (1-2) when the dual function satis\ufb01es a\nlocal error bound condition with 2 (0, 1]. Our convergence time to achieve within \" of optimality\nis in terms of number of (unconstrained) maximization steps arg maxx2X [T (Ax b) f (x) \n\u00b5\n2||x ex||2], where constants , A,ex, \u00b5 are known, which is a standard measure of convergence\ntime for Lagrangian-type algorithms that turn a constrained problem into a sequence of unconstrained\nproblems. The algorithm essentially restarts a weighted primal averaging process at each stage\nusing the last Lagrange multiplier computed. This result improves upon the earlier O(1/\") result by\n(Necoara and Suykens (2008); Li et al. (2016)) and at the same time extends the scope of homotopy\nsmoothing method to solve a new class of problems involving constraints (1-2). It is worth mentioning\nthat a similar restarted smoothing strategy is proposed in a recent work Tran-Dinh et al. (2018) to\nsolve problems including (1-2), where they show that, empirically, restarting the algorithm from the\nLagrange multiplier computed from the last stage improves the convergence time. Here, we give one\ntheoretical justi\ufb01cation of such an improvement.\n\n1.2 The distributed geometric median problem\n\nThe geometric median problem, also known as the Fermat-Weber problem, has a long history (e.g.\nsee Weiszfeld and Plastria (2009) for more details). Given a set of n points b1, b2,\n\u00b7\u00b7\u00b7 , bn 2 Rd,\nwe aim to \ufb01nd one point x\u21e4 2 Rd so as to minimize the sum of the Euclidean distance, i.e.\n\nx\u21e4 2 argmin\nx2Rd\n\nnXi=1\n\nkx bik,\n\n(5)\n\nwhich is a non-smooth convex optimization problem. It can be shown that the solution to this\nproblem is unique as long as b1, b2,\n\u00b7\u00b7\u00b7 , bn 2 Rd are not co-linear. Linear convergence time\nalgorithms solving (5) have also been developed in several works (e.g. Xue and Ye (1997), Parrilo\nand Sturmfels (2003), Cohen et al. (2016)). Our motivation of studying this problem is driven by\nits recent application in distributed statistical estimation, in which data are assumed to be randomly\nspreaded to multiple connected computational agents that produce intermediate estimators, and then,\nthese intermediate estimators are aggregated in order to compute some statistics of the whole data set.\nArguably one of the most widely used aggregation procedures is computing the geometric median of\nthe local estimators (see, for example, Duchi et al. (2014), Minsker et al. (2014), Minsker and Strawn\n(2017), Yin et al. (2018)). It can be shown that the geometric median is robust against arbitrary\ncorruptions of local estimators in the sense that the \ufb01nal estimator is stable as long as at least half of\nthe nodes in the system perform as expected.\n\n4The result in Xu et al. (2017) heavily depends on the assumption that the subgradient of (\u00b7) is de\ufb01ned\neverywhere over the set \u23261 and uniformly bound by some constant \u21e2, which excludes the choice of indicator\nfunctions necessary to deal with constraints in the ADMM framework.\n\n3\n\n\fContributions: As an example application of our general algorithm, we look at the problem of\ncomputing the solution to (5) in a distributed scenario over a network of n agents without any\ncentral controller, where each agent holds a local vector bi. Remarkably, we show theoretically that\nsuch a problem, when formulated as (1-2), has its dual function non-smooth but locally quadratic.\nTherefore, applying our proposed primal-dual homotopy smoothing method gives a convergence\n\ntime of O\"4/5 log2(\"1). This result improves upon the performance bounds of the previously\nknown decentralized optimization algorithms (e.g. PG-EXTRA Shi et al. (2015) and decentralized\nADMM Shi et al. (2014)), which do not take into account the special structure of the problem and\nonly obtain a convergence time of O (1/\"). Simulation experiments also demonstrate the superior\nergodic convergence time of our algorithm compared to other algorithms.\n\n2 Primal-dual Homotopy Smoothing\n\n2.1 Preliminaries\nThe Lagrange dual function of (1-2) is de\ufb01ned as follows:5\n\nF () := max\n\nx2X {h, Ax bi f (x)} ,\n\n(6)\n\nwhere 2 RN is the dual variable, X is a compact convex set and the minimum of the dual function\nis F \u21e4 := min2RN F (). For any closed set K\u2713 Rd and x 2 Rd, de\ufb01ne the distance function of x\nto the set K as\n\ndist(x,K) := min\n\ny2K kx yk,\n\nwhere kxk :=qPd\n\ni=1 x2\n\ni . For a convex function F (), the -sublevel set S is de\ufb01ned as\n\nS := { 2 RN : F () F \u21e4 \uf8ff }.\n\n\u21e4\u21e4 :=\u21e4 2 RN : F (\u21e4) \uf8ff F (), 8 2 RN \n\n(7)\nFurthermore, for any matrix A 2 RN\u21e5d, we use max(AT A) to denote the largest eigenvalue of\nAT A. Let\n(8)\nbe the set of optimal Lagrange multipliers. Note that if the constraint Ax = b is feasible, then\n\u21e4 2 \u21e4\u21e4 implies \u21e4 + v 2 \u21e4\u21e4 for any v that satis\ufb01es AT v = 0. The following de\ufb01nition introduces\nthe notion of local error bound.\nDe\ufb01nition 2.1. Let F () be a convex function over 2 RN . Suppose \u21e4\u21e4 is non-empty. The function\nF () is said to satisfy the local error bound with parameter 2 (0, 1] if 9> 0 such that for any\n 2S ,\n(9)\nwhere C is a positive constant possibly depending on . In particular, when = 1/2, F () is said\nto be locally quadratic and when = 1, it is said to be locally linear.\nRemark 2.1. Indeed, a wide range of popular optimization problems satisfy the local error bound\ncondition. The work Tseng (2010) shows that if X is a polyhedron, f (\u00b7) has Lipschitz continuous\ngradient and is strongly convex, then the dual function of (1-2) is locally linear. The work Burke and\nTseng (1996) shows that when the objective is linear and X is a convex cone, the dual function is\nalso locally linear. The values of have also been computed for several other problems (e.g. Pang\n(1997); Yang and Lin (2015)).\nDe\ufb01nition 2.2. Given an accuracy level \"> 0, a vector x0 2X is said to achieve an \" approximate\nsolution regarding problem (1-2) if\n\ndist(, \u21e4\u21e4) \uf8ff C(F () F \u21e4),\n\nwhere f\u21e4 is the optimal primal objective of (1-2).\n\nf (x0) f\u21e4 \uf8ffO (\"), kAx0 bk \uf8ffO (\"),\n\nThroughout the paper, we adopt the following assumptions:\n\n5Usually, the Lagrange dual is de\ufb01ned as minx2X h, Ax bi + f (x). Here, we \ufb02ip the sign and take the\n\nmaximum for no reason other than being consistent with the form (4).\n\n4\n\n\f(a) The feasible set {x 2X : Ax b = 0} is nonempty and non-singleton.\n\nAssumption 2.1.\n(b) The set X is bounded, i.e. supx,y2X kx yk \uf8ff D, for some positive constant D. Furthermore,\nthe function f (x) is also bounded, i.e. maxx2X |f (x)|\uf8ff M, for some positive constant M.\n(c) The dual function de\ufb01ned in (6) satis\ufb01es the local error bound for some parameter 2 (0, 1] and\nsome level > 0.\n(d) Let PA be the projection operator onto the column space of A. There exists a unique vector\n\u232b\u21e4 2 RN such that for any \u21e4 2 \u21e4\u21e4, PA\u21e4 = \u232b\u21e4, i.e. \u21e4\u21e4 =\u21e4 2 RN : PA\u21e4 = \u232b\u21e4 .\nNote that assumption (a) and (b) are very mild and quite standard. For most applications, it is enough\nto check (c) and (d). We will show, for example, in Section 4 that the distributed geometric median\nproblem satis\ufb01es all the assumptions. Finally, we say a function g : X! R is smooth with modulus\nL > 0 if\n\nkrg(x) rg(y)k \uf8ff Lkx yk, 8x, y 2X .\n\n2.2 Primal-dual homotopy smoothing algorithm\nThis section introduces our proposed algorithm for optimization problem (1-2) satisfying Assump-\ntion 2.1. The idea of smoothing is to introduce a smoothed Lagrange dual function F\u00b5() that\napproximates the original possibly non-smooth dual function F () de\ufb01ned in (6).\nFor any constant \u00b5 > 0, de\ufb01ne\n\nf\u00b5(x) = f (x) +\n\n\u00b5\n\n2kx exk2,\n\nwhereex is an arbitrary \ufb01xed point in X . For simplicity of notation, we drop the dependency onex\n\nin the de\ufb01nition of f\u00b5(x). Then, by the boundedness assumption of X , we have f (x) \uf8ff f\u00b5(x) \uf8ff\nf (x) + \u00b5\n\n2 D2, 8x 2X . For any 2 RN, de\ufb01ne\n\nF\u00b5() = max\n\nx2X h, Ax bi f\u00b5(x)\n\nas the smoothed dual function. The fact that F\u00b5() is indeed smooth with modulus \u00b5 follows from\nLemma 6.1 in the Supplement. Thus, one is able to apply an accelerated gradient descent algorithm\non this modi\ufb01ed Lagrange dual function, which is detailed in Algorithm 1 below, starting from an\n\n(10)\n\n(11)\n\ninitial primal-dual pair (ex,e) 2 Rd \u21e5 RN.\nAlgorithm 1 Primal-Dual Smoothing: PDS\u21e3e,ex, \u00b5, T\u2318\nLet 0 = 1 =e and \u27130 = \u27131 = 1.\nFor t = 0 to T 1 do\n\nt1 1)(t t1),\n\n\u2022 Compute a tentative dual multiplier:bt = t + \u2713t(\u27131\n\u2022 Compute the primal update: x(bt) = argmaxx2X Dbt, Ax bE f (x) \u00b5\n\u2022 Compute the dual update: t+1 =bt + \u00b5(Ax(bt) b).\n\u2022 Update the stepsize: \u2713t+1 =\nx(bt) and T , where ST =PT1\n\np\u27134\n\nST PT1\n\nt=0\n\nt \u27132\n\nt\n\n.\n\n1\n\u2713t\n\nt +4\u27132\n\n2\n\n1\n\u2713t\n\nt=0\n\n.\n\nend for\nOutput: xT = 1\n\nOur proposed algorithm runs Algorithm 1 in multiple stages, which is detailed in Algorithm 2 below.\n\n2kx exk2.\n\n3 Convergence Time Results\n\nWe start by de\ufb01ning the set of optimal Lagrange multipliers for the smoothed problem:6\n\n\u21e4\u21e4\u00b5 :=\u21e4\u00b5 2 RN : F\u00b5(\u21e4\u00b5) \uf8ff F\u00b5(), 8 2 RN \n\n6By Assumption 2.1(a) and Farkas\u2019 Lemma, this is non-empty.\n\n(12)\n\n5\n\n\fAlgorithm 2 Homotopy Method:\nLet \"0 be a \ufb01xed constant and \"<\" 0 be the desired accuracy. Set \u00b50 = \"0\nnumber of stages K dlog2(\"0/\")e + 1, and the time horizon during each stage T 1.\nFor k = 1 to K do\n\nD2 , (0) = 0, x(0) 2X , the\n\n\u2022 Let \u00b5k = \u00b5k1/2.\n\u2022 Run the primal-dual smoothing algorithm ((k), x(k)) = PDS\u21e3(k1), x(k1), \u00b5k, T\u2318.\n\nend for\nOutput: x(K).\n\nOur convergence time analysis involves two steps. The \ufb01rst step is to derive a primal convergence\ntime bound for Algorithm 1, which involves the location information of the initial Lagrange multiplier\nat the beginning of this stage. The details are given in Supplement 6.2.\n\n+\n\n,\n\n2\n\n(13)\n\n(14)\n\nTheorem 3.1. Suppose Assumption 2.1(a)(b) holds. For any T 1 and any initial vector (ex,e) 2\n\nRd \u21e5 RN , we have the following performance bound regarding Algorithm 1,\n\n2\n\n\u00b5D2\n\nmax(AT A)\n\ne\u21e4 e\nf (xT ) f\u21e4 \uf8ff kPAe\u21e4k\u00b7k AxT bk +\n\u21e3e\u21e4 e + dist(\u21e4\u00b5, \u21e4\u21e4)\u2318 ,\n, ST =PT1\nST PT1\n\nkAxT bk \uf8ff\n\n2max(AT A)\n\nx(bt)\n\n2\u00b5ST\n\n\u00b5ST\n\nt=0\n\nt=0\n\n1\n\u2713t\n\nand \u21e4\u00b5 is any point in\n\nwheree\u21e4 2 argmin\u21e42\u21e4\u21e4k\u21e4 ek, xT := 1\n\u21e4\u21e4\u00b5 de\ufb01ned in (12).\nAn inductive argument shows that \u2713t \uf8ff 2/(t + 2) 8t 0. Thus, Theorem 3.1 already gives an\nO(1/\") convergence time by setting \u00b5 = \" and T = 1/\". Note that this is the best trade-off we\ncan get from Theorem 3.1 when simply bounding the terms ke\u21e4 ek and dist(\u21e4\u00b5, \u21e4\u21e4) by constants.\nTo see how this bound leads to an improved convergence time when running in multiple rounds,\nsuppose the computation from the last round gives ae that is close enough to the optimal set \u21e4\u21e4,\nthen, ke\u21e4 ek would be small. When the local error bound condition holds, one can show that\ndist(\u21e4\u00b5, \u21e4\u21e4) \uf8ffO (\u00b5). As a consequence, one is able to choose \u00b5 smaller than \" and get a better\ntrade-off. Formally, we have the following overall performance bound. The proof is given in\nSupplement 6.3.\nTheorem 3.2. Suppose Assumption 2.1 holds, \"0 max{2M, 1}, 0 <\" \uf8ff min{/2, 2M, 1},\nT 2DCpmax(AT A)(2M )/2\n. The proposed homotopy method achieves the following objective\nand constraint violation bound:\n\n\"2/(2+)\n\n\u2713t\n\nf (x(K)) f\u21e4 \uf8ff\u2713 24kPA\u21e4k(1 + C)\nkAx(K) bk \uf8ff\n\n24(1 + C)\n (2M ) \",\nC2\n\n (2M )2\n\nC2\n\n+\n\n6\n\n (2M )2 +\nC2\n\n1\n\n4\u25c6 \",\n\nwith running time 2DCpmax(AT A)(2M )/2\napproximation with convergence time O\"2/(2+) log2(\"1).\n\n\"2/(2+)\n\n(dlog2(\"0/\")e + 1), i.e. the algorithm achieves an \"\n\n4 Distributed Geometric Median\nConsider the problem of computing the geometric median over a connected network (V,E), where\nV = {1, 2,\u00b7\u00b7\u00b7 , n} is a set of n nodes, E = {eij}i,j2V is a collection of undirected edges, eij = 1\nif there exists an undirected edge between node i and node j, and eij = 0 otherwise. Furthermore,\neii = 1, 8i 2{ 1, 2,\u00b7\u00b7\u00b7 , n}.Furthermore, since the graph is undirected, we always have eij =\neji, 8i, j 2{ 1, 2,\u00b7\u00b7\u00b7 , n}. Two nodes i and j are said to be neighbors of each other if eij = 1. Each\nnode i holds a local vector bi 2 Rd, and the goal is to compute the solution to (5) without having a\ncentral controller, i.e. each node can only communicate with its neighbors.\n\n6\n\n\fvector in Rn.\n\nComputing geometric median over a network has been considered in several works previously and\nvarious distributed algorithms have been developed such as decentralized subgradient methd (DSM,\nNedic and Ozdaglar (2009); Yuan et al. (2016)), PG-EXTRA (Shi et al. (2015)) and ADMM (Shi\net al. (2014); Deng et al. (2017)). The best known convergence time for this problem is O(1/\"). In\nthis section, we will show that it can be written in the form of problem (1-2), has its Lagrange dual\nfunction locally quadratic and optimal Lagrange multiplier unique up to the null space of A, thereby\nsatisfying Assumption 2.1.\nThroughout this section, we assume that n 3, b1, b2,\n\u00b7\u00b7\u00b7 , bn 2 Rd are not co-linear and they\nare distinct (i.e. bi 6= bj if i 6= j). We start by de\ufb01ning a mixing matrixfW 2 Rn\u21e5n with respect to\nthis network. The mixing matrix will have the following properties:\n\n1. Decentralization: The (i, j)-th entry ewij = 0 if eij = 0.\n2. Symmetry: fW = fWT .\n3. The null space of In\u21e5n fW satis\ufb01es N (In\u21e5n fW) = {c1, c 2 R}, where 1 is an all 1\nThese conditions are rather mild and satis\ufb01ed by most doubly stochastic mixing matrices used in\npractice. Some speci\ufb01c examples are Markov transition matrices of max-degree chain and Metropolis-\nHastings chain (see Boyd et al. (2004) for detailed discussions). Let xi 2 Rd be the local variable on\nthe node i. De\ufb01ne\nx :=2664\nif i = j\nif i 6= j\nand ewij is ij-th entry of the mixing matrixfW. By the aforementioned null space property of the\nmixing matrixfW, it is easy to see that the null space of the matrix A is\n\n3775 2 Rnd, A =264\nWij =\u21e2(1 ewij)Id\u21e5d,\newijId\u21e5d,\n1 ,\u00b7\u00b7\u00b7 , uT\n\n(15)\nThen, because of the null space property (15), one can equivalently write problem (5) in a \u201cdistributed\nfashion\u201d as follows:\n\n3775 2 Rnd, b :=2664\n\nN (A) =u 2 Rnd : u = [uT\n\nn ]T , u1 = u2 = \u00b7\u00b7\u00b7 = un ,\n\n375 2 R(nd)\u21e5(nd),\n\n\u00b7\u00b7\u00b7 W1n\n...\n\u00b7\u00b7\u00b7 Wnn\n\nW11\n...\nWn1\n\nb1\nb2\n...\nbn\n\nx1\nx2\n...\nxn\n\n...\n\n,\n\nwhere\n\nnXi=1\n\nmin\n\nkxi bik\n\n(16)\n\nt,1, T\n\nt,2,\u00b7\u00b7\u00b7 , T\n\ns.t. Ax = 0,kxi bik \uf8ff D, i = 1, 2,\u00b7\u00b7\u00b7 , n,\n\n(17)\nwhere we set the constant D to be large enough so that the solution belongs to the set X :=\nx 2 Rnd : kxi bik \uf8ff D, i = 1, 2,\u00b7\u00b7\u00b7 , n . This is in the same form as (1-2) with X := {x 2\nRnd : kxi bik \uf8ff D, i = 1, 2,\u00b7\u00b7\u00b7 , n}.\n4.1 Distributed implementation\nIn this section, we show how to implement the proposed algorithm to solve (16-17) in a distributed\nt,n] 2 Rnd be the vectors\nway. Let t = [T\nof Lagrange multipliers de\ufb01ned in Algorithm 1, where each t,i, bt,i 2 Rd. Then, each agent\ni 2{ 1, 2,\u00b7\u00b7\u00b7 , n} in the network is responsible for updating the corresponding Lagrange multipliers\nt,i andbt,i according to Algorithm 1, which has the initial values 0,i = 1,i =ei. Note that the\n\ufb01rst, third and fourth steps in Algorithm 1 are naturally separable regarding each agent. It remains to\ncheck if the second step can be implemented in a distributed way.\nNote that in the second step, we obtain the primal update x(bt) = [x1(bt)T ,\u00b7\u00b7\u00b7 , xn(bt)T ] 2 Rnd\nx(bt) = argmaxx:kxibik\uf8ffD, i=1,2,\u00b7\u00b7\u00b7 ,n Dbt, AxE \n2kxi exik2\u2318 ,\n\nt,n] 2 Rnd,bt = [bT\n\nnXi=1\u21e3kxi bik +\n\nt,2,\u00b7\u00b7\u00b7 , bT\n\nby solving the following problem:\n\nt,1, bT\n\n\u00b5\n\n7\n\n\fwhereexi 2 Rd is a \ufb01xed point in the feasible set. We separate the maximization according to different\n\nagent i 2{ 1, 2,\u00b7\u00b7\u00b7 , n}:\n\nxi(bt) =argmaxxi:kxibik\uf8ffD \n\nnXj=1Dbt,j, WjixiE kxi bik \n\n\u00b5\n\n2kxi exik2.\n\nNote that according to the de\ufb01nition of Wji, it is equal to 0 if agent j is not the neighbor of agent i.\nMore speci\ufb01cally, Let Ni be the set of neighbors of agent i (including the agent i itself), then, the\nabove maximization problem can be equivalently written as\n\n\u00b5\n\n\u00b5\n\n\u00b5\n\nkxibik\uf8ffD \n\nxi(bt) = argmax\n\nji = Wji. Solving this problem only requires the local information\n\nwhere we used the fact that WT\nfrom each agent. Completing the squares gives\n\n2kxi exik2\n2kxi exik2 i 2{ 1, 2,\u00b7\u00b7\u00b7 , n},\nWjibt,j1A\n\nargmaxxi:kxibik\uf8ffD Xj2NiDbt,j, WjixiE kxi bik \nWjibt,j, xi+ kxi bik \n=argmaxxi:kxibik\uf8ffD *Xj2Ni\n2\nxi 0@exi \n\u00b5Pj2Ni\nWjibt,j, then, the solution to (18) has the following closed\nbi,\nkbiaik\u21e3kbi aik 1\n\u00b5\u2318 ,\nbi biai\nbi biai\nkbiaik\n4.2 Local error bound condition\nThe proof of the this theorem is given in Supplement 6.5.\nTheorem 4.1. The Lagrange dual function of (16-17) is non-smooth and given by the following\n\nLemma 4.1. Let ai =exi 1\nxi(bt) =8><>:\n\nThe solution to such a subproblem has a closed form, as is shown in the following lemma (the proof\nis given in Supplement 6.4):\n\nif kbi aik \uf8ff 1/\u00b5,\nif 1\n\u00b5 < kbi aik \uf8ff 1\notherwise.\n\n kxi bik.\n\n(18)\n\n1\n\n\u00b5 Xj2Ni\n\n\u00b5 + D,\n\nform:\n\n2\n\nD,\n\nnXi=1\nF () = \u2326AT , b\u21b5 + D\n[i]k > 1\u2318 is\nwhere A[i] = [W1i W2i \u00b7\u00b7\u00b7 Wni]T is the i-th column block of the matrix A, I\u21e3kAT\nthe indicator function which takes 1 if kAT\n[i]k > 1 and 0 otherwise. Let \u21e4\u21e4 be the set of optimal\nLagrange multipliers de\ufb01ned according to (8). Suppose D 2n \u00b7 maxi,j2V kbi bjk, then, for any\n> 0, there exists a C > 0 such that\n\n[i]k 1) \u00b7 I\u21e3kAT\n(kAT\n\n[i]k > 1\u2318 ,\n\ndist(, \u21e4\u21e4) \uf8ff C(F () F \u21e4)1/2, 8 2S .\n\nFurthermore, there exists a unique vector \u232b\u21e4 2 Rnd s.t. PA\u21e4 = \u232b\u21e4, 8\u21e4 2 \u21e4\u21e4, i.e. Assumption\n2.1(d) holds. Thus, applying the proposed method gives the convergence time O\"4/5 log2(\"1).\n\n5 Simulation Experiments\n\nIn this section, we conduct simulation experiments on the distributed geometric median problem.\nEach vector bi 2 R100, i 2{ 1, 2,\u00b7\u00b7\u00b7 , n} is sampled from the uniform distribution in [0, 10]100, i.e.\neach entry of bi is independently sampled from uniform distribution on [0, 10]. We compare our\nalgorithm with DSM (Nedic and Ozdaglar (2009)), P-EXTRA (Shi et al. (2015)), Jacobian parallel\nADMM (Deng et al. (2017)) and Smoothing (Necoara and Suykens (2008)) under different network\n\n8\n\n\fsizes (n = 20, 50, 100). Each network is randomly generated with a particular connectivity ratio7,\nand the mixing matrix is chosen to be the Metropolis-Hastings Chain (Boyd et al. (2004)), which\ncan be computed in a distributed manner. We use the relative error as the performance metric, which\nis de\ufb01ned as kxt x\u21e4k/kx0 x\u21e4k for each iteration t. The vector x0 2 Rnd is the initial primal\nvariable. The vector x\u21e4 2 Rnd is the optimal solution computed by CVX Grant et al. (2008). For\nour proposed algorithm, xt is the restarted primal average up to the current iteration. For all other\nalgorithms, xt is the primal average up to the current iteration. The results are shown below. We see\nin all cases, our proposed algorithm is much better than, if not comparable to, other algorithms. For\ndetailed simulation setups and additional simulation results, see Supplement 6.6.\n\nr\no\nr\nr\ne\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nR\n\n0\n\n-0.5\n\n-1\n\n-1.5\n\n-2\n\n-2.5\n\n-3\n\n-3.5\n\n-4\n\n-4.5\n\n-5\n\n0\n\nDSM\nEXTRA\nJacobian-ADMM\nSmoothing\nProposed algorithm\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\nNumber of iterations\n\n10\n\u00d710 4\n\n(a)\n\nr\no\nr\nr\ne\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nR\n\n0\n\n-0.5\n\n-1\n\n-1.5\n\n-2\n\n-2.5\n\n-3\n\n-3.5\n\n-4\n\n-4.5\n\n-5\n\n0\n\nDSM\nEXTRA\nJacobian-ADMM\nSmoothing\nProposed algorithm\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\nNumber of iterations\n\n10\n\u00d710 4\n\n(b)\n\nr\no\nr\nr\ne\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nR\n\n0\n\n-1\n\n-2\n\n-3\n\n-4\n\n-5\n\n-6\n\n0\n\nDSM\nEXTRA\nJacobian-ADMM\nSmoothing\nProposed algorithm\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\nNumber of iterations\n\n10\n\u00d710 4\n\n(c)\n\nFigure 1: Comparison of different algorithms on networks of different sizes. (a) n = 20, connectivity\nratio=0.15. (b) n = 50, connectivity ratio=0.13. (c) n = 100, connectivity ratio=0.1.\n\nAcknowledgments\nThe authors thank Stanislav Minsker and Jason D. Lee for helpful discussions related to the geometric\nmedian problem. Qing Ling\u2019s research is supported in part by the National Science Foundation China\nunder Grant 61573331 and Guangdong IIET Grant 2017ZT07X355. Qing Ling is also af\ufb01liated\nwith Guangdong Province Key Laboratory of Computational Science. Michael J. Neely\u2019s research is\nsupported in part by the National Science Foundation under Grant CCF-1718477.\n\nReferences\nBeck, A., A. Nedic, A. Ozdaglar, and M. Teboulle (2014). An o(1/k) gradient method for network\n\nresource allocation problems. IEEE Transactions on Control of Network Systems 1(1), 64\u201373.\n\nBertsekas, D. P. (1999). Nonlinear programming. Athena Scienti\ufb01c Belmont.\nBertsekas, D. P. (2009). Convex optimization theory. Athena Scienti\ufb01c Belmont.\nBoyd, S., P. Diaconis, and L. Xiao (2004). Fastest mixing markov chain on a graph. SIAM\n\nReview 46(4), 667\u2013689.\n\nBurke, J. V. and P. Tseng (1996). A uni\ufb01ed analysis of Hoffman\u2019s bound via Fenchel duality. SIAM\n\nJournal on Optimization 6(2), 265\u2013282.\n\nCohen, M. B., Y. T. Lee, G. Miller, J. Pachocki, and A. Sidford (2016). Geometric median in nearly\nlinear time. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing,\npp. 9\u201321.\n\nDeng, W., M.-J. Lai, Z. Peng, and W. Yin (2017). Parallel multi-block ADMM with o(1/k) conver-\n\ngence. Journal of Scienti\ufb01c Computing 71(2), 712\u2013736.\n\nDeng, W. and W. Yin (2016). On the global and linear convergence of the generalized alternating\n\ndirection method of multipliers. Journal of Scienti\ufb01c Computing 66(3), 889\u2013916.\n\nDuchi, J. C., M. I. Jordan, M. J. Wainwright, and Y. Zhang (2014). Optimality guarantees for\n\ndistributed statistical estimation. arXiv preprint arXiv:1405.0782.\n\nGidel, G., F. Pedregosa, and S. Lacoste-Julien (2018). Frank-Wolfe splitting via augmented La-\n\ngrangian method. arXiv preprint arXiv:1804.03176.\n\nGrant, M., S. Boyd, and Y. Ye (2008). CVX: Matlab software for disciplined convex programming.\n\n7The connectivity ratio is de\ufb01ned as the number of edges divided by the total number of possible edges\n\nn(n + 1)/2.\n\n9\n\n\fHan, D., D. Sun, and L. Zhang (2015). Linear rate convergence of the alternating direction method\nof multipliers for convex composite quadratic and semi-de\ufb01nite programming. arXiv preprint\narXiv:1508.02134.\n\nLan, G. and R. D. Monteiro (2013). Iteration-complexity of \ufb01rst-order penalty methods for convex\n\nprogramming. Mathematical Programming 138(1-2), 115\u2013139.\n\nLi, J., G. Chen, Z. Dong, and Z. Wu (2016). A fast dual proximal-gradient method for separable convex\noptimization with linear coupled constraints. Computational Optimization and Applications 64(3),\n671\u2013697.\n\nLing, Q. and Z. Tian (2010). Decentralized sparse signal recovery for compressive sleeping wireless\n\nsensor networks. IEEE Transactions on Signal Processing 58(7), 3816\u20133827.\n\nLuo, X.-D. and Z.-Q. Luo (1994). Extension of hoffman\u2019s error bound to polynomial systems. SIAM\n\nJournal on Optimization 4(2), 383\u2013392.\n\nMinsker, S., S. Srivastava, L. Lin, and D. B. Dunson (2014). Robust and scalable bayes via a median\n\nof subset posterior measures. arXiv preprint arXiv:1403.2660.\n\nMinsker, S. and N. Strawn (2017). Distributed statistical estimation and rates of convergence in\n\nnormal approximation. arXiv preprint arXiv:1704.02658.\n\nMotzkin, T. (1952). Contributions to the theory of linear inequalities. D.R. Fulkerson (Transl.) (Santa\n\nMonica: RAND Corporation). RAND Corporation Translation 22.\n\nNecoara, I. and J. A. Suykens (2008). Application of a smoothing technique to decomposition in\n\nconvex optimization. IEEE Transactions on Automatic control 53(11), 2674\u20132679.\n\nNedic, A. and A. Ozdaglar (2009). Distributed subgradient methods for multi-agent optimization.\n\nIEEE Transactions on Automatic Control 54(1), 48\u201361.\n\nNesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Program-\n\nming 103(1), 127\u2013152.\n\nNesterov, Y. (2015a). Complexity bounds for primal-dual methods minimizing the model of objective\n\nfunction. Mathematical Programming, 1\u201320.\n\nNesterov, Y. (2015b). Universal gradient methods for convex optimization problems. Mathematical\n\nProgramming 152(1-2), 381\u2013404.\n\nOsborne, M. R., B. Presnell, and B. A. Turlach (2000). A new approach to variable selection in least\n\nsquares problems. IMA Journal of Numerical Analysis 20(3), 389\u2013403.\n\nPang, J.-S. (1997, Oct). Error bounds in mathematical programming. Mathematical Program-\n\nming 79(1), 299\u2013332.\n\nParrilo, P. A. and B. Sturmfels (2003). Minimizing polynomial functions. Algorithmic and quantitative\nreal algebraic geometry, DIMACS Series in Discrete Mathematics and Theoretical Computer\nScience 60, 83\u201399.\n\nShi, W., Q. Ling, G. Wu, and W. Yin (2015). A proximal gradient algorithm for decentralized\n\ncomposite optimization. IEEE Transactions on Signal Processing 63(22), 6013\u20136023.\n\nShi, W., Q. Ling, K. Yuan, G. Wu, and W. Yin (2014). On the linear convergence of the admm in\n\ndecentralized consensus optimization. IEEE Trans. Signal Processing 62(7), 1750\u20131761.\n\nTran-Dinh, Q., O. Fercoq, and V. Cevher (2018). A smooth primal-dual optimization framework for\n\nnonsmooth composite convex minimization. SIAM Journal on Optimization 28(1), 96\u2013134.\n\nTseng, P. (2010). Approximation accuracy, gradient methods, and error bound for structured convex\n\noptimization. Mathematical Programming 125(2), 263\u2013295.\n\nWang, T. and J.-S. Pang (1994). Global error bounds for convex quadratic inequality systems.\n\nOptimization 31(1), 1\u201312.\n\nWei, X. and M. J. Neely (2018). Primal-dual Frank-Wolfe for constrained stochastic programs with\n\nconvex and non-convex objectives. arXiv preprint arXiv:1806.00709.\n\nWei, X., H. Yu, and M. J. Neely (2015). A probabilistic sample path convergence time analysis of\n\ndrift-plus-penalty algorithm for stochastic optimization. arXiv preprint arXiv:1510.02973.\n\nWeiszfeld, E. and F. Plastria (2009). On the point for which the sum of the distances to n given points\n\nis minimum. Annals of Operations Research 167(1), 7\u201341.\n\nXiao, L. and T. Zhang (2013). A proximal-gradient homotopy method for the sparse least-squares\n\nproblem. SIAM Journal on Optimization 23(2), 1062\u20131091.\n\n10\n\n\fXu, Y., M. Liu, Q. Lin, and T. Yang (2017). ADMM without a \ufb01xed penalty parameter: Faster\nconvergence with new adaptive penalization. In Advances in Neural Information Processing\nSystems, pp. 1267\u20131277.\n\nXu, Y., Y. Yan, Q. Lin, and T. Yang (2016). Homotopy smoothing for non-smooth problems with lower\ncomplexity than o(1/\u270f). In Advances In Neural Information Processing Systems, pp. 1208\u20131216.\nXue, G. and Y. Ye (1997). An ef\ufb01cient algorithm for minimizing a sum of euclidean norms with\n\napplications. SIAM Journal on Optimization 7(4), 1017\u20131036.\n\nYang, T. and Q. Lin (2015). Rsg: Beating subgradient method without smoothness and strong\n\nconvexity. arXiv preprint arXiv:1512.03107.\n\nYin, D., Y. Chen, K. Ramchandran, and P. Bartlett (2018). Byzantine-robust distributed learning:\n\nTowards optimal statistical rates. arXiv preprint arXiv:1803.01498.\n\nYu, H. and M. J. Neely (2017a). A new backpressure algorithm for joint rate control and routing with\nvanishing utility optimality gaps and \ufb01nite queue lengths. In INFOCOM 2017-IEEE Conference\non Computer Communications, IEEE, pp. 1\u20139. IEEE.\n\nYu, H. and M. J. Neely (2017b). A simple parallel algorithm with an o(1/t) convergence rate for\n\ngeneral convex programs. SIAM Journal on Optimization 27(2), 759\u2013783.\n\nYu, H. and M. J. Neely (2018). On the convergence time of dual subgradient methods for strongly\n\nconvex programs. IEEE Transactions on Automatic Control.\n\nYuan, K., Q. Ling, and W. Yin (2016). On the convergence of decentralized gradient descent. SIAM\n\nJournal on Optimization 26(3), 1835\u20131854.\n\nYurtsever, A., Q. T. Dinh, and V. Cevher (2015). A universal primal-dual convex optimization\n\nframework. In Advances in Neural Information Processing Systems, pp. 3150\u20133158.\n\nYurtsever, A., O. Fercoq, F. Locatello, and V. Cevher (2018). A conditional gradient framework for\ncomposite convex minimization with applications to semide\ufb01nite programming. arXiv preprint\narXiv:1804.08544.\n\n11\n\n\f", "award": [], "sourceid": 1983, "authors": [{"given_name": "Xiaohan", "family_name": "Wei", "institution": "USC"}, {"given_name": "Hao", "family_name": "Yu", "institution": "Alibaba Group (US) Inc"}, {"given_name": "Qing", "family_name": "Ling", "institution": "Sun Yat-Sen University"}, {"given_name": "Michael", "family_name": "Neely", "institution": "USC"}]}