{"title": "Asynchronous Coordinate Descent under More Realistic Assumptions", "book": "Advances in Neural Information Processing Systems", "page_first": 6182, "page_last": 6190, "abstract": "Asynchronous-parallel algorithms have the potential to vastly speed up algorithms by eliminating costly synchronization. However, our understanding of these algorithms is limited because the current convergence theory of asynchronous block coordinate descent algorithms is based on somewhat unrealistic assumptions. In particular, the age of the shared optimization variables being used to update blocks is assumed to be independent of the block being updated. Additionally, it is assumed that the updates are applied to randomly chosen blocks. In this paper, we argue that these assumptions either fail to hold or will imply less efficient implementations. We then prove the convergence of asynchronous-parallel block coordinate descent under more realistic assumptions, in particular, always without the independence assumption. The analysis permits both the deterministic (essentially) cyclic and random rules for block choices. Because a bound on the asynchronous delays may or may not be available, we establish convergence for both bounded delays and unbounded delays. The analysis also covers nonconvex, weakly convex, and strongly convex functions. The convergence theory involves a Lyapunov function that directly incorporates both objective progress and delays. A continuous-time ODE is provided to motivate the construction at a high level.", "full_text": "Asynchronous Coordinate Descent under More\n\nRealistic Assumption\u2217\n\nTao Sun\n\nNational University of Defense Technology\n\nChangsha, Hunan 410073, China\n\nnudtsuntao@163.com\n\nRobert Hannah\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095, USA\n\nRobertHannah89@math.ucla.edu\n\nWotao Yin\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095, USA\nwotaoyin@math.ucla.edu\n\nAbstract\n\nAsynchronous-parallel algorithms have the potential to vastly speed up algorithms\nby eliminating costly synchronization. However, our understanding of these algo-\nrithms is limited because the current convergence theory of asynchronous block\ncoordinate descent algorithms is based on somewhat unrealistic assumptions. In\nparticular, the age of the shared optimization variables being used to update blocks\nis assumed to be independent of the block being updated. Additionally, it is\nassumed that the updates are applied to randomly chosen blocks.\nIn this paper, we argue that these assumptions either fail to hold or will imply less\nef\ufb01cient implementations. We then prove the convergence of asynchronous-parallel\nblock coordinate descent under more realistic assumptions, in particular, always\nwithout the independence assumption. The analysis permits both the deterministic\n(essentially) cyclic and random rules for block choices. Because a bound on the\nasynchronous delays may or may not be available, we establish convergence for\nboth bounded delays and unbounded delays. The analysis also covers nonconvex,\nweakly convex, and strongly convex functions. The convergence theory involves a\nLyapunov function that directly incorporates both objective progress and delays. A\ncontinuous-time ODE is provided to motivate the construction at a high level.\n\n1\n\nIntroduction\n\nIn this paper, we consider the asynchronous-parallel block coordinate descent (async-BCD) algorithm\nfor solving the unconstrained minimization problem\n\nmin\nx\u2208RN\n\nf (x) = f (x1, . . . , xN ),\n\n(1)\nwhere f is a differentiable function and \u2207f is L-Lipschitz continuous. Async-BCD [14, 13, 16]\nhas virtually the same implementation as regular BCD. The difference is that the threads doing the\nparallel computation do not wait for all others to \ufb01nish and share their updates before starting the next\niteration, but merely continue to update with the most recent solution-vector information available2.\n\u2217The work is supported in part by the National Key R&D Program of China 2017YFB0202902, China\n\nScholarship Council, NSF DMS-1720237, and ONR N000141712162\n\n2Additionally, the step size needs to be modi\ufb01ed to ensure convergence results hold. However in practice\n\ntraditional step sizes appear to still allow convergence, barring extreme circumstances.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn traditional algorithms, latency, bandwidth limits, and unexpected drains on resources, that delay\nthe update of even a single thread will cause the entire system to wait. By eliminating this costly idle\ntime, asynchronous algorithms can be much faster than traditional ones.\nIn async-BCD, each agent continually updates the solution vector, one block at a time, leaving\nall other blocks unchanged. Each block update is a read-compute-update cycle. It begins with an\nagent reading x from shared memory or a parameter server, and saving it in a local cache as \u02c6x. The\nagent then computes \u2212 1\nL\u2207if (\u02c6x), a block partial gradient3. The \ufb01nal step of the cycle depends on\nthe parallel system setup. In a shared memory setup, the agent reads block xi again and writes\nxi \u2212 \u03b3k\nL \u2207if (\u02c6x) to xi (where \u03b3k is the step size). In parameter-server setup, the agent can send\nL\u2207if (\u02c6x) and let the server update xi. Other setups are possible, too. The iteration counter k\n\u2212 1\nincrements upon the completion of any block update, and the updating block is denoted as ik.\nMany iterations may occur between the time a computing node reads the solution vector \u02c6x into\nmemory, and the time that the node\u2019s corresponding update is applied to the shared solution vector.\nBecause of this, the iteration of asyn-BCD is, therefore, modeled [14] as\n\n= xk\n\nxk+1\nik\n\n(2)\nj for all non-updating blocks j (cid:54)= ik.\nwhere \u02c6xk is a potentially outdated version of xk, and xk+1\nThe convergence behavior of this algorithm depends on the sequence of updated blocks ik, the step\nsize sequence \u03b3k, as well as the ages of the blocks of \u02c6xk relative to xk. We de\ufb01ne the delay vector\n(cid:126)j(k) = (j(k, 1), j(k, 2), . . . , j(k, N )) \u2208 ZN , which represents the how outdated each of the blocks\nare. Speci\ufb01cally, we have de\ufb01ne:\n\n\u2207ik f (\u02c6xk),\nj = xk\n\nik \u2212 \u03b3k\nL\n\n\u02c6xk = (xk\u2212j(k,1)\n\n, xk\u2212j(k,2)\n\n, . . . , xk\u2212j(k,N )\n\n).\n\n(3)\n\nThe k\u2019th delay (or current delay) is j(k) = max1\u2264i\u2264N{j(k, i)}.\n\nN\n\n1\n\n2\n\n1.1 Dependence between delays and blocks\n\nIn previous analyses [13, 14, 16, 9], it is assumed that the block index ik and the delay (cid:126)j(k) were\nN \u2207f (\u02c6xk)\nindependent sequences. This simpli\ufb01es proofs, for example, giving Eik (Pik\u2207f (\u02c6xk)) = 1\nwhen ik is chosen at random, where Pi denotes the projection to the ith block. Without independence,\n(cid:126)j(k) will depend on ik, causing the distribution of \u02c6xk to be different for each possible ik, thus\nbreaking the previous equality. However, the independence assumption is unrealistic in practice.\nConsider a problem where some blocks are more expensive to update than others4. Blocks that take\nlonger to update should have greater delays when they are updated because more other updates will\nhave occurred between the time that \u02c6x is read and when the update is applied. For the same reason,\nupdates on blocks assigned to slower or busier agents will generally have greater delays. Indeed\nthis turns out to be the case in practice. Experiments were performed on a cluster with 2 nodes,\neach with 16 threads running on an Intel Xeon CPU E5-2690 v2. The algorithm was applied to the\nlogistic regression problem on the \u201cnews20\u201d data set from LIBSVM, with 64 contiguous coordinate\nblocks of equal size. Over 2000 epochs, blocks 0, 1, and 15 had average delays of 351, 115, and\n28, respectively. ASync-BCD completed this over 7x faster than the corresponding synchronous\nalgorithm using the same computing resources, with a nearly equal decrease in objective function.\nEven when blocks have balanced dif\ufb01culty, and the computing nodes have equal computing power,\nthis dependence persists. We assigned 20 threads to each core, with each thread assigned to a block\nof 40 coordinates with an equal numbers of nonzeros. The mean delay varied from 29 to 50 over\nthe threads. This may be due to the cluster scheduler or issues of data locality, which were hard to\nexamine. Clearly, there is strong dependence of the delays (cid:126)j(k) on the updated block ik.\n\n1.2 Stochastic and deterministic block rules\n\nThis paper considers two different block rules: deterministic and stochastic. For the stochastic block\nrule, at each update, a block is chosen from {1, 2, . . . , N} uniformly at random5, for instance in\n3The computing can start before the reading is completed. If \u2207if (\u02c6x) does not require all components of \u02c6x,\n\nonly the required ones are read.\n\n4say, because they are larger, bear more nonzero entries in the training set, or suffer poorer data locality.\n5The distribution doesn\u2019t have to be uniform. We need only assume that every block has a nonzero probability\n\nof being updated. It is easy to adjust our analysis to this case.\n\n2\n\n\f[14, 13, 16]. For the deterministic rule, ik is an arbitrary sequence that is assumed to be essentially\ncyclic. That is, there is an N(cid:48) \u2208 N, N(cid:48) \u2265 N, such that each block i \u2208 {1, 2, . . . , N} is updated at\nleast once in a window of N(cid:48), that is,\n\nFor each t \u2208 Z+, \u2203 integer K(i, t) \u2208 {tN(cid:48), tN(cid:48) + 1, . . . , (1 + t)N(cid:48) \u2212 1} such that iK(i,t) = i.\nThis encompasses different kinds of cyclic rules such as \ufb01xed ordering, random permutation, and\ngreedy selection. The stochastic block rule is easier to analyze because taking expectation will yield\na good approximation to the full gradient. It ensures the every block is updated at the speci\ufb01ed\nfrequency. However, it can be expensive or even infeasible to implement for the following reasons.\nIn the shared memory setup, stochastic block rules require random data access, which is not only\nsigni\ufb01cantly slower than sequential data access but also cause frequent cache misses (waiting for data\nbeing fetched from slower cache or the main memory). The cyclic rules clearly avoid these issues\nsince data requirements are predictable. In the parameter-server setup where workers update randomly\nassigned blocks at each step, each worker must either store all the problem data necessary to update\nany block (which may mean massive storage requirements) or read the required data from the server\nat every step (which may mean massive bandwidth requirements). Clearly this permanently assigning\nblocks to agents avoids these issues. On the other hand, the analysis of cyclic rules generally has to\nconsider the worst-case ordering and necessarily gives worse performance in the worst case[19]. In\npractice, worst-case behavior is rare, and cyclic rules often lead to good performance [7, 8, 3].\n\n1.3 Bounded and unbounded delays\nWe consider different delay assumptions as well. Bounded delay is when j(k) \u2264 \u03c4 for some \ufb01xed\n\u03c4 \u2208 Z+ and all iterations k; while the unbounded delay allows supk{j(k)} = +\u221e. Bounded and\nunbounded delays can be further divided into deterministic and stochastic. Deterministic delays refer\nto a sequence of delay vectors (cid:126)j(0),(cid:126)j(1),(cid:126)j(2), . . . that is arbitrary or follows an unknown distribution\nso is treated as arbitrary. Our stochastic delay results apply to distributions that decay faster than\nO(k\u22123). Deterministic unbounded delays apply to the case when async-BCD runs on unfamiliar\nhardware platforms. For convergence, we require a \ufb01nite lim inf k{j(k)} and the current step size\n\u03b7k to be adaptively chosen to the current delay j(k), which must be measured or overestimated.\nBounded delays and stochastic unbounded delays apply when the user can provide a bound or delay\ndistribution, respectively. The user can obtain these from previous experience or by running a pilot\ntest. In return, a \ufb01xed step size allows convergence, and measuring the current delay is not needed.\n\n1.4 Contributions\n\nOur contributions are mainly convergence results for three kinds of delays: bounded, stochastic\nunbounded, deterministic unbounded, that are obtained without the arti\ufb01cial independence between\nthe block index and the delay. The results are provided for nonconvex, convex, and strongly convex\nfunctions with Lipschitz gradients. Sublinear rates and linear rates are provided, which match\nthe rates for the corresponding synchronous algorithms in terms of order of magnitude. Due to\nspace limitation, we restrict ourselves to Lipschitz differentiable functions and leave out nonsmooth\nproximable functions. Like many analyses of asynchronous algorithms, our proofs are built on\nthe construction of Lyapunov functions. We provide a simple ODE-based (i.e., continuous time)\nconstruction for bounded delays to motivate the construction of the Lyapunov function in the standard\ndiscrete setting. Our analysis brings great news to the practitioner. Roughly speaking, in a variety of\nsetting, even when there is no load balancing (thus the delays may depend on the block index) or\nbound on the delays, convergence of async-BCD can be assured by using our provided step sizes.\nOur proofs do not treat asynchronicity as noise, as many papers do6, because modelling delays in this\nway appears to destroy valuable information, and leads to inequalities that are too blunt to obtain\nstronger results. This is why sublinear and linear rates can be established for weak and strong convex\nproblems respectively, even when delays depend on the blocks and are potentially unbounded. Our\nmain focus was to prove new convergence results in a new setting, not to obtain the best possible\nrates. Space limitations make this dif\ufb01cult, and we leave it for future work. The main message is that\neven without the independence assumption, convergence of the same order as for the corresponding\nsynchronous algorithm occurs. The step sizes and rates obtained may be overly pessimistic for the\n\n6See, for example, (5.1) and (A.10) in [18], and (14) and Lemma 4 in [6].\n\n3\n\n\fpractitioner to use. In practice, we \ufb01nd that using the standard synchronous step size results in\nconvergence, and the observed rate of convergence is extremely close to that of the synchronous\ncounterpart. With the independence assumption, convergence rates for asynchronous algorithms have\nrecently been proven to be asymptotically the same as their synchronous counterparts[10].\n\n1.5 Related work\n\nOur work extends the theory on asynchronous BCD algorithms such as [18, 14, 13]. However, their\nanalyses rely on the independence assumption and assume bounded delays. The bounded delay\nassumption was weakened by recent papers [9, 17], but independence and random blocks were still\nneeded. Recently [12] proposes (in the SGD setting) an innovative \u201cread after\u201d sequence relabeling\ntechnique to create the independence. However, enforcing independence in this way creates other\narti\ufb01cial implementation requirements that may be problematic: For instances, agents must read \u201call\nshared data parameters and historical gradients before starting iterations\u201d, even if not all of this is\nrequired to compute updates. Our analysis does not require these kinds of implementation \ufb01xes. Also,\nour analysis also works for unbounded delays and deterministic block choices.\nRelated recent works also include [1, 2], which solve our problem with additional convex block-\nseparable terms in the objective. In the \ufb01rst paper [1], independence between blocks and delays is\navoided. However, they require a step size that diminishes at 1/k and that the sequence of iterate is\nbounded (which in general may not be true). The second paper [2] relaxes independence by using\na different set of assumptions. In particular, their assumption D3 assumes that, regardless of the\nprevious updates, there is a universally positive chance for every block to be updated in the next step.\nThis Markov-type assumption relaxes the independence assumption but does not avoid it. Paper [15]\naddressed this issue by decoupling the parameters read by each core from the virtual parameters on\nwhich progress is actually de\ufb01ned. Based on the idea of [16], [12] addressed the dependence problem\nin related work. In the convex case with a bounded delay \u03c4, the step size in paper [14] is O(\n\u03c4 2/N ). In\ntheir proofs, the Lyapunov function is based on (cid:107)xk \u2212 x\u2217(cid:107)2\n2. Our analysis uses a Lyapunov function\nconsisting of both the function value and the sequence history, where the latter vanishes when delays\n\u03c4 ) is better even\nvanish. If the \u03c4 is much larger than the blocks of the problem, our step size O( 1\nunder our much weaker conditions. The step size bound in [16, 9, 4] is O(\n), which is better\nthan ours, but they need the independence assumption and the stochastic block rule. Recently, [20]\nintroduces an asynchronous primal-dual method for a problem similar to ours but having additional\naf\ufb01ne linear constraints. The analysis assumes bounded delays, random blocks, and independence.\n\n1\n1+2\u03c4 /\n\n1\n\n\u221a\n\nN\n\n1.6 Notation\nWe let x\u2217 denote any minimizer of f. For the update in (2), we use the following notation:\n\n\u2206k := xk+1 \u2212 xk (2)\n\n= \u2212 \u03b3k\nL\n\n\u2207ik f (\u02c6xk),\n\ndk := xk \u2212 \u02c6xk.\n\n(4)\n\nWe also use the convention \u2206k := 0 if k < 0. Let \u03c7k be the sigma algebra generated by\n{x0, x1, . . . , xk}. Let E(cid:126)j(k) denote the expectation over the value of (cid:126)j(k) (when it is a random\nvariable). E denotes the expectation over all random variables.\n\n2 Bounded delays\n\nIn this part, we present convergence results for the bounded delays. If the gradient of the function is\nL-Lipschitz (even if the function is nonconvex), we present the convergence for both the deterministic\nand stochastic block rule. If the function is convex, we can obtain a sublinear convergence rate.\nFurther, if the function is restricted strongly convex, a linear convergence rate is obtained.\n\n2.1 Continuous-time analysis\n\nLet t be time in this subsection. Consider the ODE\n\n(5)\nIf we set \u02c6x(t) \u2261 x(t), this system describes a gradient \ufb02ow, which monotoni-\nwhere \u03b7 > 0.\ncally decreases f (x(t)), and its discretization is the gradient descent iteration. Indeed, we have\n\n\u02d9x(t) = \u2212\u03b7\u2207f (\u02c6x(t)),\n\n4\n\n\fdt f (x(t)) = (cid:104)\u2207f (x(t)), \u02d9x(t)(cid:105) (5)\n= \u2212 1\nd\nimpose the bound c > 0 on the delays:\n\n\u03b7(cid:107) \u02d9x(t)(cid:107)2\n\n2. Instead, we allow delays (i.e., \u02c6x(t) (cid:54)= x(t)) and\n\n(cid:107)\u02c6x(t) \u2212 x(t)(cid:107)2 \u2264\n\n(cid:107) \u02d9x(s)(cid:107)2ds.\n\n(cid:90) t\n\nt\u2212c\n\nThe delays introduce inexactness to the gradient \ufb02ow f (x(t)). We lose monotonicity. Indeed,\n\nf (x(t)) = (cid:104)\u2207f (x(t)), \u02d9x(t)(cid:105) = (cid:104)\u2207f (\u02c6x(t)), \u02d9x(t)(cid:105) + (cid:104)\u2207f (x(t)) \u2212 \u2207f (\u02c6x(t)), \u02d9x(t)(cid:105)\n\nd\ndt\n\na)\u2264 \u2212 1\n\u03b7\n\n(cid:107) \u02d9x(t)(cid:107)2\n\n2 + L(cid:107)x(t) \u2212 \u02c6x(t)(cid:107)2 \u00b7 (cid:107) \u02d9x(t)(cid:107)2\n\nb)\u2264 \u2212 1\n2\u03b7\n\n(cid:107) \u02d9x(t)(cid:107)2\n\n2 +\n\n\u03b7cL2\n\n2\n\n(cid:107) \u02d9x(s)(cid:107)2\n\n2ds,\n\nHere a) is from (5) and Lipschitzness of \u2207f and b) is from the Cauchy-Schwarz inequality L(cid:107)x(t) \u2212\n\u02c6x(t)(cid:107)2\u00b7(cid:107) \u02d9x(t)(cid:107)2 \u2264 (cid:107) \u02d9x(t)(cid:107)2\n2ds. The inequalities\nare generally unavoidable. Therefore, we design an energy function with both f and a weighted total\nkinetic term, where \u03b3 > 0 will be decided below:\n\n2\u03b7 + \u03b7L2(cid:107)x(t)\u2212\u02c6x(t)(cid:107)2\n\nt\u2212c (cid:107) \u02d9x(s)(cid:107)2\n\n2\n\n2\n\n2\n\n(6)\u2264 c(cid:82) t\n(cid:0)s \u2212 (t \u2212 c)(cid:1)(cid:107) \u02d9x(s)(cid:107)2\n\nand (cid:107)x(t)\u2212\u02c6x(t)(cid:107)2\n(cid:90) t\n\n2\n\nt\u2212c\n\n2ds.\n\n(cid:90) t\n\nt\u2212c\n\nBy substituting the bound on d\n\n\u03be(t) = f (x(t)) + \u03b3\n\n\u02d9\u03be(t) =\n\n(cid:90) t\n\ndt f (x(t)) in (7), we get the time derivative:\nf (x(t)) + \u03b3c(cid:107) \u02d9x(t)(cid:107)2\nd\ndt\n\u2264 \u2212(\n\nt\u2212c\n2 \u2212 (\u03b3 \u2212 \u03b7cL2\n2\n\n\u2212 \u03b3c)(cid:107) \u02d9x(t)(cid:107)2\n\n(cid:107) \u02d9x(s)(cid:107)2\n\n(cid:90) t\n\n2 \u2212 \u03b3\n\n1\n2\u03b7\n\n2ds\n\n)\n\nt\u2212c\n\n(cid:107) \u02d9x(s)(cid:107)2\n\n2ds\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nk\u22121(cid:88)\n\ni=k\u2212\u03c4\n\nLc, there exists \u03b3 > 0 such that ( 1\n\n2 ) > 0, so \u03be(t) is\nAs long as \u03b7 < 1\nmonotonically nonincreasing. Assume min f is \ufb01nite. Since \u03be(t) is lower bounded by min f, \u03be(t)\nmust converge, hence \u02d9\u03be \u2192 0, \u02d9x(t) \u2192 0 by (8). \u2207f (\u02c6x(t)) \u2192 0 by (5), and \u02c6x(t) \u2212 x(t) \u2192 0 by (6).\nThe last two results further yield \u2207f (x(t)) \u2192 0.\n\n2\u03b7 \u2212 \u03b3c) > 0 and (\u03b3 \u2212 \u03b7cL2\n\n2.2 Discrete analysis\n\nThe analysis for our discrete iteration (2) is based on the following Lyapunov function:\n\n\u03bek := f (xk) +\n\nL\n2\u03b5\n\n(i \u2212 (k \u2212 \u03c4 ) + 1)(cid:107)\u2206i(cid:107)2\n2.\n\n(10)\n\nfor some \u03b5 > 0 determined later based on the step size and \u03c4, the bound on the delays. The constant\n\u03b5 is not an algorithm parameter. In the lemma below, we present a fundamental inequality, which\nstates, regardless of which block ik is updated and which \u02c6xk is used to compute the update in (2),\nthere is a suf\ufb01cient descent in our Lyapunov function.\n\nLemma 1 (suf\ufb01cient descent for bounded delays) Conditions: Let f be a function (possibly non-\nconvex) with L-Lipschitz gradient and \ufb01nite min f. Let (xk)k\u22650 be generated by the async-BCD\nalgorithm (2), and the delays be bounded by \u03c4. Choose the step size \u03b3k \u2261 \u03b3 = 2c\n2\u03c4 +1 for arbitrary\n\ufb01xed 0 < c < 1. Result: we can choose \u03b5 > 0 to obtain\n\n\u03bek \u2212 \u03bek+1 \u2265 1\n2\n\n(\n\n1\n\u03b3\n\n\u2212 1\n2\n\n\u2212 \u03c4 )L \u00b7 (cid:107)\u2206k(cid:107)2\n2,\n\nConsequently,\n\n(cid:107)\u2206k(cid:107)2 = 0\n\nlim\nk\n\n(12)\n\nand\n\n\u221a\n(cid:107)\u2206i(cid:107)2 = o(1/\n\nk).\n\nmin\n1\u2264i\u2264k\n\n(11)\n\n(13)\n\n\u221a\nSo we have that the smallest gradient obtained by step k decays faster than O(1/\nlemma, we obtain a very general result for nonconvex problems.\n\nk). Based on the\n\n5\n\n\fTheorem 1 Assume the conditions of Lemma 1, for f that may be nonconvex. Under the deterministic\nblock rule, we have\n\n(cid:107)\u2207f (xk)(cid:107)2 = 0, min\n1\u2264i\u2264k\n\nlim\nk\n\n(cid:107)\u2207f (xk)(cid:107)2 = o(1/\n\n\u221a\nk).\n\n(14)\n\nThis rate has the same order of magnitude as standard gradient descent for a nonconvex function.\n\n2.3 Stochastic block rule\nUnder the stochastic block rule, an agent picks a block from {1, 2, ..., N} uniformly randomly at the\nbeginning of each update. For the kth completed update, the index of the chosen block is ik. Our\nresult in this subsection relies on the following assumption on the random variable ik:\n\nEik ((cid:107)\u2207ik f (xk\u2212\u03c4 )(cid:107)2 | \u03c7k\u2212\u03c4 ) =\n\n1\nN\n\n(cid:107)\u2207if (xk\u2212\u03c4 )(cid:107)2,\n\n(15)\n\nwhere \u03c7k = \u03c3(x0, x1, . . . , xk,(cid:126)j(0),(cid:126)j(1), . . . ,(cid:126)j(k)), k = 0, 1, . . ., is the \ufb01ltration that represents\nthe information that is accumulated as our algorithm runs. It is important to note that (15) uses\nxk\u2212\u03c4 instead of \u02c6xk because \u02c6xk may depend on ik. This condition essentially states that, given the\ninformation at iteration k \u2212 \u03c4 and earlier, ik is uniform at step k. We can relax (15) to nearly-uniform\ndistributions. Indeed, Theorem 2 below only needs that every block has a nonzero probability of\nbeing updated given \u03c7k\u2212\u03c4 , that is,\n\nE((cid:107)\u2207ik f (xk\u2212\u03c4 )(cid:107)2 | \u03c7k\u2212\u03c4 ) \u2265 \u00af\u03b5\nN\n\n(16)\nfor some universal \u00af\u03b5 > 0. The interpretation is that though ik and \u2207f (xk\u2212\u03c4 ) are dependent, since \u03c4\niterations have passed, \u2207f (xk\u2212\u03c4 ) has a limited in\ufb02uence on the distribution ik: There is a minimum\nprobability that each index is chosen given suf\ufb01cient time. For convenience and simplicity, we assume\n(15) instead of (16) .\nNext, we present a general result for a possibly nonconvex objective f.\n\n(cid:107)\u2207if (xk\u2212\u03c4 )(cid:107)2,\n\ni=1\n\nTheorem 2 Assume the conditions of Lemma 1.Under the stochastic block rule and assumption (15),\nwe have:\n\nE(cid:107)\u2207f (xk)(cid:107)2 = 0, min\n1\u2264i\u2264k\n\nlim\nk\n\nE(cid:107)\u2207f (xk)(cid:107)2\n\n2 = o(1/k).\n\n(17)\n\nFk := f (xk) + \u03b4 \u00b7 k\u22121(cid:88)\n2 \u2212 \u03c4 )] L\n\n2.3.1 Sublinear rate under convexity\nWhen the function f is convex, we can obtain convergence rates, for which we need a slightly\nmodi\ufb01ed Lyapunov function\n\ndelays, the delays can be 0. We also de\ufb01ne \u03c0k := E(Fk \u2212 min f ), S(k, \u03c4 ) :=(cid:80)k\u22121\n\n(18)\n2\u03b5. Here, we assume \u03c4 \u2265 1. Since \u03c4 is just an upper bound of the\ni=k\u2212\u03c4 (cid:107)\u2206i(cid:107)2\n2.\n\nwhere \u03b4 := [1 + \u03b5\n\n\u03b3 \u2212 1\n\n2\u03c4 ( 1\n\ni=k\u2212\u03c4\n\n(i \u2212 (k \u2212 \u03c4 ) + 1)(cid:107)\u2206i(cid:107)2\n2,\n\nLemma 2 Assume the conditions of Lemma 1. Furthermore, let f be convex and use the stochas-\ntic block rule. Let xk denote the projection of xk to argmin f, assumed to exist, and let\n\u03b2 := max{ 8N L2\n\n, (12N + 2)L2\u03c4 + \u03b4\u03c4}, \u03b1 := \u03b2/[ L\n\n2 \u2212 \u03c4 )]. Then we have:\n\n\u03b3 \u2212 1\n\n4\u03c4 ( 1\n\n\u03b32\n\n(\u03c0k)2 \u2264 \u03b1(\u03c0k \u2212 \u03c0k+1) \u00b7 (\u03b4\u03c4ES(k, \u03c4 ) + E(cid:107)xk \u2212 xk(cid:107)2\n2).\n\nWhen \u03c4 = 1 (nearly no delay), we can obtain \u03b2 = O(N L2/\u03b32) and \u03b1 = O(\u03b2\u03b3/L) = O(N L/\u03b3),\nwhich matches the result of standard BCD. This is used to prove sublinear convergence.\n\nTheorem 3 Assume the conditions of Lemma 1. Furthermore, let f be convex and coercive7, and\nuse the stochastic block rule. Then we have:\n\n7A function f is coercive if (cid:107)x(cid:107) \u2192 \u221e means f (x) \u2192 \u221e.\n\nE(f (xk) \u2212 min f ) = O(1/k).\n\n(19)\n\n(20)\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\n6\n\n\f2.3.2 Linear rate under convexity\nWe next consider when f is \u03bd-restricted strongly convex8 in addition to having L-Lipschitz gradient.\nThat is, for x \u2208 dom(f ), (cid:104)\u2207f (x), x \u2212 Projargmin f (x)(cid:105) \u2265 \u03bd \u00b7 dist2(x, argmin f ).\nTheorem 4 Assume the conditions of Lemma 1. Furthermore, let f be \u03bd-strongly convex, and use\nthe stochastic block rule. Then we have:\n\n(cid:14)(1 +\n\nE(f (xk) \u2212 min f ) = O(ck),\n\n\u03b1\n\nmin{\u03bd,1} ) < 1 for \u03b1 given in Lemma 2.\n\n(21)\n\nwhere c :=\n\n\u03b1\n\nmin{\u03bd,1}\n\n3 Stochastic unbounded delay\n\nIn this part, the delay vector (cid:126)j(k) is allowed to be an unbounded random variable. Under some\nmild restrictions on the distribution of (cid:126)j(k), we can still establish convergence. In light of our\ncontinuous-time analysis, we must develop a new bound for the last inner product in (7), which\nrequires the tail distribution of j(k) to decay suf\ufb01ciently fast. Speci\ufb01cally, we de\ufb01ne a sequence of\nl=i sl. Clearly,\nc0 is larger than c1, c2, . . ., and we need c0 to be \ufb01nite. Distributions with pj = O(j\u2212t), for t > 3,\nand exponential-decay distributions satisfy this requirement. De\ufb01ne the Lyapunov function Gk\n. To simplify the\n\n\ufb01xed parameters pj such that pj \u2265 P(j(k) = j),\u2200k, sl =(cid:80)+\u221e\nas Gk := f (xk) + \u00af\u03b4 \u00b7(cid:80)k\u22121\npresentation, we de\ufb01ne R(k) :=(cid:80)k\n\nj=l jpj, and ci :=(cid:80)+\u221e\n\n2, where \u00af\u03b4 := L\n\n2\u03b5 + ( 1\n\ni=0 ck\u22121\u2212i(cid:107)\u2206i(cid:107)2\n\n\u03b3 \u2212 1\n\n2 ) L\nc0\n\n\u2212 L\u221a\n\nc0\n\ni=0 ck\u2212iE(cid:107)\u2206i(cid:107)2\n2.\n\nLemma 3 (Suf\ufb01cient descent for stochastic unbounded delays) Conditions: Let f be a function\n(which may be nonconvex) with L-Lipschitz gradient and \ufb01nite min f. Let delays be stochastic\nunbounded. Use step size \u03b3k \u2261 \u03b3 = 2c\n\u221a\nc0+1 for arbitrary \ufb01xed 0 < c < 1. Results: we can set\n2\n\u03b5 > 0 to ensures suf\ufb01cient descent:\n\nAnd we have\n\nE[Gk \u2212 Gk+1] \u2265 L\n\nc0\n\n\u03b3 \u2212 1\n( 1\n\n2 \u2212 \u221a\n\nc0)R(k).\n\nE(cid:107)\u2206k(cid:107)2 = 0 and lim\n\nk\n\nlim\nk\n\nE(cid:107)dk(cid:107)2 = 0.\n\n(22)\n\n(23)\n\n3.1 Deterministic block rule\nTheorem 5 Let the conditions of Lemma 3 hold for f. Under the deterministic block rule (\u00a71.2), we\nhave:\n\nE(cid:107)\u2207f (xk)(cid:107)2 = 0.\n\nlim\nk\n\n(24)\n\n3.2 Stochastic block rule\n\nRecall that under the stochastic block rule, the block to update is selected uniformly at random from\n{1, 2, . . . , N}. The previous assumption (15), which is made for bounded delays, need to be updated\ninto the following assumption for unbounded delays:\n\nEik ((cid:107)\u2207ik f (xk\u2212j(k))(cid:107)2\n\n2) =\n\n1\nN\n\n(cid:107)\u2207if (xk\u2212j(k))(cid:107)2\n2,\n\n(25)\n\nwhere j(k) is still a variable on both sides. As argued below (15), the uniform distribution can easily\nbe relaxed to a nearly-uniform distribution, but we use the former for simplicity.\n\nTheorem 6 Let the conditions of Lemma 3 hold. Under the stochastic block rule and assumption\n(25), we have\n\nE(cid:107)\u2207f (xk)(cid:107)2 = 0.\n\nlim\nk\n\n(26)\n\n8A condition weaker than \u03bd-strong convexity and useful for problems involving an underdetermined linear\n\nmapping Ax; see [11, 13].\n\n7\n\nN(cid:88)\n\ni=1\n\n\f3.2.1 Convergence rate\nWhen f is convex, we can derive convergence rates for \u03c6k := E(Gk \u2212 min f ).\n\n\u03b32c0\n\nLemma 4 Let the conditions of Lemma 3 hold, and let f be convex. Let xk denote the projection of\nxk to argmin f. Let \u03b2 = max{ 8N L2\nc0)]. Then we\nhave\n\n, (12N + 2)L2 + \u00af\u03b4} and \u03b1 = \u03b2/[ L\n\n2 \u2212 \u221a\n\n(\u03c6k)2 \u2264 \u00af\u03b1(\u03c6k \u2212 \u03c6k+1) \u00b7 (\u00af\u03b4R(k) + E(cid:107)xk \u2212 xk(cid:107)2\n2),\n\n(27)\n2} < +\u221e, which can be ensured\nA sublinear convergence rate can be obtained if supk{E(cid:107)xk \u2212 xk(cid:107)2\nby adding a projection to a large arti\ufb01cial box set that surely contains the solution. Here we only\npresent a linear convergence result.\n\n\u03b3 \u2212 1\n\n2 ( 1\n\nTheorem 7 Let the conditions of Lemma 3 hold. In addition, let f be \u03bd-restricted strongly convex\nand set step size \u03b3k \u2261 \u03b3 <\n\nc0+1 , with c = \u00af\u03b1 max{1, 1\n\u03bd }\n\u03bd } < 1. Then,\n1+ \u00af\u03b1 max{1, 1\n\n\u221a\n\n2\n\n2\n\nE(f (xk) \u2212 min f ) = O(ck).\n\n(28)\n\n4 Deterministic unbounded delays\n\n(cid:80)+\u221e\ni=1 \u03bai(cid:107)\u2206k\u2212i(cid:107)2\n\n(cid:80)+\u221e\nj=i \u0001j obeys \u03ba1 < +\u221e. Set Dj := 1\n\nIn this part, we consider deterministic unbounded delays, which require delay-adaptive step sizes.\nSet positive sequence (\u0001i)i\u22650 (which can be optimized later given actual delays) such that \u03bai :=\n. We use a new Lyapunov function\n2. Let T \u2265 lim inf j(k), and let QT be the subsequence of\nHk := f (xk) + L\nN where the current delay is less than T . We prove convergence on the family of subsequences\n2\nxk, k \u2208 QT . The algorithm is independent of the choice of T . The algorithm is run as before, and\nafter completion, an arbitrarily large T \u2265 lim inf j(k) can be chosen. Extending the result to standard\nsequence convergence has proven intractable.\n\n2 +(cid:80)j\n\n2 + \u03ba1\n\n1\n2\u0001i\n\ni=1\n\nLemma 5 (suf\ufb01cient descent for unbounded deterministic delays) Conditions: Let f be a func-\ntion (which may be nonconvex) with L-Lipschitz gradient and \ufb01nite min f. The delays j(k) are\ndeterministic and obey lim inf j(k) < \u221e. Use step size \u03b3k = c/Dj(k) for arbitrary \ufb01xed 0 < c < 1.\nResults: We have\n\nHk \u2212 Hk+1 \u2265 L( 1\n\n\u2212 Dj(k))(cid:107)\u2206k(cid:107)2\n2,\n\n\u03b3k\n\n(cid:107)\u2206k(cid:107)2 = 0.\n\nlim\nk\n\n(29)\n\nOn any subsequence QT (for arbitrarily large T \u2265 lim inf j(k)), we have:\n(cid:107)\u2207ik f (\u02c6xk)(cid:107)2 = 0,\n\n(cid:107)dk(cid:107)2 = 0,\n\nlim\n\nlim\n\n(k\u2208QT )\u2192\u221e\n\n(k\u2208QT )\u2192\u221e\n\nTo prove our next result, we need a new assumption: essentially cyclically semi-unbounded delay\n(ECSD), which is slightly stronger than the essentially cyclic assumption. In every window of N(cid:48)\nsteps, every index i is updated at least once with a delay less than B (at iteration K(i, t)). The number\nB just needs to exist and can be arbitrarily large. It does not affect the step size.\n\nTheorem 8 Let the conditions of Lemma 5 hold. For the deterministic index rule under the ECSD\nassumption, for T \u2265 B, we have:\n\nlim\n\n(k\u2208QT )\u2192\u221e\n\n5 Conclusion\n\n(cid:107)\u2207f (xk)(cid:107)2 = 0.\n\n(30)\n\nIn summary, we have proven a selection of convergence results for async-BCD under bounded and\nunbounded delays, and stochastic and deterministic block choices. These results do not require\nthe independence assumption that occurs in the vast majority of other work so far. Therefore they\nbetter model the behavior of real asynchronous solvers. These results were obtained with the use\nof Lyapunov function techniques, and treating delays directly, rather than modelling them as noise.\nFuture work may involve obtaining a more exhaustive list of convergence results, sharper convergence\nrates, and an extension to asynchronous stochastic gradient descent-like algorithms, such as SDCA.\n\n8\n\n\fReferences\n[1] Loris Cannelli, Francisco Facchinei, Vyacheslav Kungurtsev, and Gesualdo Scutari. Asynchronous parallel\nalgorithms for nonconvex big-data optimization: Model and convergence. arXiv preprint arXiv:1607.04818,\n2016.\n\n[2] Loris Cannelli, Francisco Facchinei, Vyacheslav Kungurtsev, and Gesualdo Scutari. Asynchronous parallel\nalgorithms for nonconvex big-data optimization. Part II: Complexity and numerical results. arXiv preprint\narXiv:1701.04900, 2017.\n\n[3] Yat Tin Chow, Tianyu Wu, and Wotao Yin. Cyclic coordinate update algorithms for \ufb01xed-point problems:\n\nAnalysis and applications. SIAM Journal on Scienti\ufb01c Computing, accepted, 2017.\n\n[4] Damek Davis. The asynchronous palm algorithm for nonsmooth nonconvex problems. arXiv preprint\n\narXiv:1604.00526, 2016.\n\n[5] Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. In Splitting Methods\n\nin Communication, Imaging, Science, and Engineering, pages 115\u2013163. Springer, 2016.\n\n[6] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A uni\ufb01ed analysis\nof hogwild-style algorithms. In Advances in neural information processing systems, pages 2674\u20132682,\n2015.\n\n[7] Jerome Friedman, Trevor Hastie, Holger H\u00f6\ufb02ing, Robert Tibshirani, et al. Pathwise coordinate optimization.\n\nThe Annals of Applied Statistics, 1(2):302\u2013332, 2007.\n\n[8] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of statistical software, 33(1):1, 2010.\n\n[9] Robert Hannah and Wotao Yin. On unbounded delays in asynchronous parallel \ufb01xed-point algorithms.\n\narXiv preprint arXiv:1609.04746, 2016.\n\n[10] Robert Hannah and Wotao Yin. More Iterations per Second, Same Quality \u2013 Why Asynchronous Algorithms\n\nmay Drastically Outperform Traditional Ones. arXiv preprint arXiv:1708.05136, 2017.\n\n[11] Ming-Jun Lai and Wotao Yin. Augmented (cid:96)1 and nuclear-norm models with a globally linearly convergent\n\nalgorithm. SIAM Journal on Imaging Sciences, 6(2):1059\u20131091, 2013.\n\n[12] R\u00e9mi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous Parallel SAGA. In\nProceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 46\u201354,\n2017.\n\n[13] J. Liu and S. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties.\n\nSIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\n[14] Ji Liu, Stephen J. Wright, Christopher R\u00e9, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel\n\nstochastic coordinate descent algorithm. J. Mach. Learn. Res., 16(1):285\u2013322, 2015.\n\n[15] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and\nMichael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint\narXiv:1507.06970, 2015.\n\n[16] Zhimin Peng, Yangyang Xu, Ming Yan, and Wotao Yin. Arock: an algorithmic framework for asynchronous\n\nparallel coordinate updates. SIAM Journal on Scienti\ufb01c Computing, 38(5):A2851\u2013A2879, 2016.\n\n[17] Zhimin Peng, Yangyang Xu, Ming Yan, and Wotao Yin. On the convergence of asynchronous parallel\n\niteration with arbitrary delays. arXiv preprint arXiv:1612:04425, 2016.\n\n[18] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to\nparallelizing stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and\nK. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 693\u2013701. Curran\nAssociates, Inc., 2011.\n\n[19] Ruoyu Sun and Yinyu Ye. Worst-case Complexity of Cyclic Coordinate Descent: O(n2) Gap with\n\nRandomized Version. arXiv preprint arXiv:1604.07130, 2017.\n\n[20] Yangyang Xu. Asynchronous parallel primal-dual block update methods. arXiv preprint arXiv:1705.06391,\n\n2017.\n\n9\n\n\f", "award": [], "sourceid": 3132, "authors": [{"given_name": "Tao", "family_name": "Sun", "institution": "National university of defense technology"}, {"given_name": "Robert", "family_name": "Hannah", "institution": "UCLA"}, {"given_name": "Wotao", "family_name": "Yin", "institution": "University of California, Los Angeles"}]}