{"title": "The Sound of APALM Clapping: Faster Nonsmooth Nonconvex Optimization with Stochastic Asynchronous PALM", "book": "Advances in Neural Information Processing Systems", "page_first": 226, "page_last": 234, "abstract": "We introduce the Stochastic Asynchronous Proximal Alternating Linearized Minimization (SAPALM) method, a block coordinate stochastic proximal-gradient method for solving nonconvex, nonsmooth optimization problems. SAPALM is the first asynchronous parallel optimization method that provably converges on a large class of nonconvex, nonsmooth problems. We prove that SAPALM matches the best known rates of convergence --- among synchronous or asynchronous methods --- on this problem class. We provide upper bounds on the number of workers for which we can expect to see a linear speedup, which match the best bounds known for less complex problems, and show that in practice SAPALM achieves this linear speedup. We demonstrate state-of-the-art performance on several matrix factorization problems.", "full_text": "The Sound of APALM Clapping: Faster Nonsmooth\n\nNonconvex Optimization with Stochastic\n\nAsynchronous PALM\n\nDamek Davis and Madeleine Udell\n\nCornell University\n\n{dsd95,mru8}@cornell.edu\n\nBrent Edmunds\n\nUniversity of California, Los Angeles\nbrent.edmunds@math.ucla.edu\n\nAbstract\n\nWe introduce the Stochastic Asynchronous Proximal Alternating Linearized Min-\nimization (SAPALM) method, a block coordinate stochastic proximal-gradient\nmethod for solving nonconvex, nonsmooth optimization problems. SAPALM is the\n\ufb01rst asynchronous parallel optimization method that provably converges on a large\nclass of nonconvex, nonsmooth problems. We prove that SAPALM matches the\nbest known rates of convergence \u2014 among synchronous or asynchronous methods\n\u2014 on this problem class. We provide upper bounds on the number of workers\nfor which we can expect to see a linear speedup, which match the best bounds\nknown for less complex problems, and show that in practice SAPALM achieves\nthis linear speedup. We demonstrate state-of-the-art performance on several matrix\nfactorization problems.\n\n1\n\nIntroduction\n\nParallel optimization algorithms often feature synchronization steps: all processors wait for the last to\n\ufb01nish before moving on to the next major iteration. Unfortunately, the distribution of \ufb01nish times is\nheavy tailed. Hence as the number of processors increases, most processors waste most of their time\nwaiting. A natural solution is to remove any synchronization steps: instead, allow each idle processor\nto update the global state of the algorithm and continue, ignoring read and write con\ufb02icts whenever\nthey occur. Occasionally one processor will erase the work of another; the hope is that the gain from\nallowing processors to work at their own paces offsets the loss from a sloppy division of labor.\nThese asynchronous parallel optimization methods can work quite well in practice, but it is dif\ufb01cult\nto tune their parameters: lock-free code is notoriously hard to debug. For these problems, there\nis nothing as practical as a good theory, which might explain how to set these parameters so as to\nguarantee convergence.\nIn this paper, we propose a theoretical framework guaranteeing convergence of a class of asynchronous\nalgorithms for problems of the form\n\nm(cid:88)\n\nj=1\n\nminimize\n\n(x1,...,xm)\u2208H1\u00d7...\u00d7Hm\n\nf (x1, . . . , xm) +\n\nrj(xj),\n\n(1)\n\nwhere f is a continuously differentiable (C 1) function with an L-Lipschitz gradient, each rj is a lower\nsemicontinuous (not necessarily convex or differentiable) function, and the sets Hj are Euclidean\nspaces (i.e., Hj = Rnj for some nj \u2208 N). This problem class includes many (convex and nonconvex)\nsignal recovery problems, matrix factorization problems, and, more generally, any generalized low\nrank model [20]. Following terminology from these domains, we view f as a loss function and each\nrj as a regularizer. For example, f might encode the mis\ufb01t between the observations and the model,\nwhile the regularizers rj encode structural constraints on the model such as sparsity or nonnegativity.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fMany synchronous parallel algorithms have been proposed to solve (1), including stochastic proximal-\ngradient and block coordinate descent methods [22, 3]. Our asynchronous variants build on these\nsynchronous methods, and in particular on proximal alternating linearized minimization (PALM) [3].\nThese asynchronous variants depend on the same parameters as the synchronous methods, such as\na step size parameter, but also new ones, such as the maximum allowable delay. Our contribution\nhere is to provide a convergence theory to guide the choice of those parameters within our control\n(such as the stepsize) in light of those out of our control (such as the maximum delay) to ensure\nconvergence at the rate guaranteed by theory. We call this algorithm the Stochastic Asynchronous\nProximal Alternating Linearized Minimization method, or SAPALM for short.\nLock-free optimization is not a new idea. Many of the \ufb01rst theoretical results for such algorithms\nappear in the textbook [2], written over a generation ago. But within the last few years, asynchronous\nstochastic gradient and block coordinate methods have become newly popular, and enthusiasm in\npractice has been matched by progress in theory. Guaranteed convergence for these algorithms has\nbeen established for convex problems; see, for example, [13, 15, 16, 12, 11, 4, 1].\nAsynchrony has also been used to speed up algorithms for nonconvex optimization, in particular,\nfor learning deep neural networks [6] and completing low-rank matrices [23]. In contrast to the\nconvex case, the existing asynchronous convergence theory for nonconvex problems is limited to the\nfollowing four scenarios: stochastic gradient methods for smooth unconstrained problems [19, 10];\nblock coordinate methods for smooth problems with separable, convex constraints [18]; block\ncoordinate methods for the general problem (1) [5]; and deterministic distributed proximal-gradient\nmethods for smooth nonconvex loss functions with a single nonsmooth, convex regularizer [9]. A\ngeneral block-coordinate stochastic gradient method with nonsmooth, nonconvex regularizers is still\nmissing from the theory. We aim to \ufb01ll this gap.\n\nContributions. We introduce SAPALM, the \ufb01rst asynchronous parallel optimization method that\nprovably converges for all nonconvex, nonsmooth problems of the form (1). SAPALM is a a block\ncoordinate stochastic proximal-gradient method that generalizes the deterministic PALM method\nof [5, 3]. When applied to problem (1), we prove that SAPALM matches the best, known rates of\nconvergence, due to [8] in the case where each rj is convex and m = 1: that is, asynchrony carries\nno theoretical penalty for convergence speed. We test SAPALM on a few example problems and\ncompare to a synchronous implementation, showing a linear speedup.\nNotation. Let m \u2208 N denote the number of coordinate blocks. We let H = H1 \u00d7 . . . \u00d7 Hm. For\nevery x \u2208 H, each partial gradient \u2207jf (x1, . . . , xj\u22121,\u00b7, xj+1, . . . , xm) : Hj \u2192 Hj is Lj-Lipschitz\ncontinuous; we let L = minj{Lj} \u2264 maxj{Lj} = L. The number \u03c4 \u2208 N is the maximum\nj=1 rj(xj). For\neach j \u2208 {1, . . . , m}, y \u2208 Hj, and \u03b3 > 0, de\ufb01ne the proximal operator\n\nallowable delay. De\ufb01ne the aggregate regularizer r : H \u2192 (\u2212\u221e,\u221e] as r(x) =(cid:80)m\n\n(cid:26)\n\n(cid:27)\n\nprox\u03b3rj (y) := argmin\nxj\u2208Hj\n\nrj(xj) +\n\n(cid:107)xj \u2212 y(cid:107)2\n\n1\n2\u03b3\n\nFor convex rj, prox\u03b3rj (y) is uniquely de\ufb01ned, but for nonconvex problems, it is, in general, a set.\nWe make the mild assumption that for all y \u2208 Hj, we have prox\u03b3rj (y) (cid:54)= \u2205. A slight technicality\narises from our ability to choose among multiple elements of prox\u03b3rj (y), especially in light of the\nstochastic nature of SAPALM. Thus, for all y, j and \u03b3 > 0, we \ufb01x an element\n\n\u03b6j(y, \u03b3) \u2208 prox\u03b3rj (y).\n\n(2)\nBy [17, Exercise 14.38], we can assume that \u03b6j is measurable, which enables us to reason with expec-\ntations wherever they involve \u03b6j. As shorthand, we use prox\u03b3rj (y) to denote the (unique) choice\n\n\u03b6j(y, \u03b3). For any random variable or vector X, we let Ek [X] = E(cid:2)X | xk, . . . , x0, \u03bdk, . . . , \u03bd0(cid:3)\n\ndenote the conditional expectation of X with respect to the sigma algebra generated by the history of\nSAPALM.\n\n2 Algorithm Description\n\nAlgorithm 1 displays the SAPALM method.\nWe highlight a few features of the algorithm which we discuss in more detail below.\n\n2\n\n\fAlgorithm 1 SAPALM [Local view]\nInput: x \u2208 H\n1: All processors in parallel do\n2: loop\n3:\n4:\n5:\n6:\n7:\n\nRandomly select a coordinate block j \u2208 {1, . . . , m}\nRead x from shared memory\nCompute g = \u2207jf (x) + \u03bdj\nChoose stepsize \u03b3j \u2208 R++\nxj \u2190 prox\u03b3j rj (xj \u2212 \u03b3jg)\n\n(cid:46) According to Assumption 3\n(cid:46) According to (2)\n\nfrom memory.\n\n\u2022 Inconsistent iterates. Other processors may write updates to x in the time required to read x\n\u2022 Coordinate blocks. When the coordinate blocks xj are low dimensional, it reduces the\n\u2022 Noise. The noise \u03bd \u2208 H is a random variable that we use to model injected noise. It can be\n\nlikelihood that one update will be immediately erased by another, simultaneous update.\n\nset to 0, or chosen to accelerate each iteration, or to avoid saddle points.\n\nAlgorithm 1 has an equivalent (mathematical) description which we present in Algorithm 2, using an\niteration counter k which is incremented each time a processor completes an update. This iteration\ncounter is not required by the processors themselves to compute the updates.\nIn Algorithm 1, a processor might not have access to the shared-memory\u2019s global state, xk, at\niteration k. Rather, because all processors can continuously update the global state while other\nprocessors are reading, local processors might only read the inconsistently delayed iterate xk\u2212dk =\n(xk\u2212dk,1\n\n), where the delays dk are integers less than \u03c4, and xl = x0 when l < 0.\n\n, . . . , xk\u2212dk,m\n\nm\n\n1\n\nAlgorithm 2 SAPALM [Global view]\nInput: x0 \u2208 H\n1: for k \u2208 N do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nRandomly select a coordinate block jk \u2208 {1, . . . , m}\nRead xk\u2212dk = (xk\u2212dk,1\nCompute gk = \u2207jk f (xk\u2212dk ) + \u03bdk\nChoose stepsize \u03b3k\njk\nfor j = 1, . . . , m do\nif j = jk then\n\nxk+1\njk\nj \u2190 xk\nxk+1\n\n, . . . , xk\u2212dk,m\n\n\u2190 prox\u03b3k\n\n\u2208 R++\n\n\u2212 \u03b3k\n\njk\n\nelse\n\n(xk\njk\n\nrjk\n\njk\n\nm\n\njk\n\n1\n\nj\n\n) from shared memory\n\n(cid:46) According to Assumption 3\n\n(cid:46) According to (2)\n\ngk)\n\n2.1 Assumptions on the Delay, Independence, Variance, and Stepsizes\nAssumption 1 (Bounded Delay). There exists some \u03c4 \u2208 N such that, for all k \u2208 N, the sequence of\ncoordinate delays lie within dk \u2208 {0, . . . , \u03c4}m.\nAssumption 2 (Independence). The indices {jk}k\u2208N are uniformly distributed and collectively IID.\nThey are independent from the history of the algorithm xk, . . . , x0, \u03bdk, . . . , \u03bd0 for all k \u2208 N.\nWe employ two possible restrictions on the noise sequence \u03bdk and the sequence of allowable stepsizes\nj , all of which lead to different convergence rates:\n\u03b3k\nAssumption 3 (Noise Regimes and Stepsizes). Let \u03c32\nnorm of the noise, and let a \u2208 (1,\u221e). Assume that Ek\nweights {ck}k\u2208N \u2286 [1,\u221e) such that\n\n(cid:2)(cid:107)\u03bdk(cid:107)2(cid:3) denote the expected squared\n(cid:2)\u03bdk(cid:3) = 0 and that there is a sequence of\n\nk := Ek\n\n(\u2200k \u2208 N) , (\u2200j \u2208 {1, . . . , m})\n\n\u03b3k\nj :=\n\n1\n\nack(Lj + 2L\u03c4 m\u22121/2)\n\n.\n\n3\n\n\fwhich we choose using the following two rules, both of which depend on the growth of \u03c3k:\n\n(cid:80)\u221e\nk = O((k + 1)\u2212\u03b1) =\u21d2 ck = \u0398((k + 1)(1\u2212\u03b1)).\n\u03c32\n\nSummable.\n\u03b1-Diminishing. (\u03b1 \u2208 (0, 1))\nMore noise, measured by \u03c3k, results in worse convergence rates and stricter requirements regarding\nwhich stepsizes can be chosen. We provide two stepsize choices which, depending on the noise regime,\ninterpolate between \u0398(1) and \u0398(k1\u2212\u03b1) for any \u03b1 \u2208 (0, 1). Larger stepsizes lead to convergence\nrates of order O(k\u22121), while smaller ones lead to order O(k\u2212\u03b1).\n\n=\u21d2 ck \u2261 1;\n\nk < \u221e\n\nk=0 \u03c32\n\n2.2 Algorithm Features\n\nInconsistent Asynchronous Reading. SAPALM allows asynchronous access patterns. A proces-\nsor may, at any time, and without notifying other processors:\n\n1. Read. While other processors are writing to shared-memory, read the possibly out-of-sync,\n\ndelayed coordinates xk\u2212dk,1\n\n, . . . , xk\u2212dk,m\n\nm\n\n.\n\n1\n\n2. Compute. Locally, compute the partial gradient \u2207jk f (xk\u2212dk,1\n3. Write. After computing the gradient, replace the jkth coordinate with\n1\n2\u03b3k\njk\n\nrjk (y) + (cid:104)\u2207jk f (xk\u2212dk ) + \u03bdk\n\n\u2208 argmin\n\n, y \u2212 xk\n\nxk+1\njk\n\n(cid:105) +\n\njk\n\njk\n\ny\n\n1\n\n, . . . , xk\u2212dk,m\n\nm\n\n).\n\n(cid:107)y \u2212 xk\n\njk\n\n(cid:107)2.\n\nUncoordinated access eliminates waiting time for processors, which speeds up computation. The\nprocessors are blissfully ignorant of any con\ufb02ict between their actions, and the paradoxes these\ncon\ufb02icts entail: for example, the states xk\u2212dk,1\nneed never have simultaneously existed\nin memory. Although we write the method with a global counter k, the asynchronous processors need\nnot be aware of it; and the requirement that the delays dk remain bounded by \u03c4 does not demand\ncoordination, but rather serves only to de\ufb01ne \u03c4.\n\n, . . . , xk\u2212dk,m\n\nm\n\n1\n\nWhat Does the Noise Model Capture? SAPALM is the \ufb01rst asynchronous PALM algorithm to\nallow and analyze noisy updates. The stochastic noise, \u03bdk, captures three phenomena:\n\n1. Computational Error. Noise due to random computational error.\n2. Avoiding Saddles. Noise deliberately injected for the purpose of avoiding saddles, as in [7].\n3. Stochastic Gradients. Noise due to stochastic approximations of delayed gradients.\n\nOf course, the noise model also captures any combination of the above phenomena. The last one is,\nperhaps, the most interesting: it allows us to prove convergence for a stochastic- or minibatch-gradient\nversion of APALM, rather than requiring processors to compute a full (delayed) gradient. Stochastic\ngradients can be computed faster than their batch counterparts, allowing more frequent updates.\n\n(cid:2)\u2207f (xk\u2212dk ; \u03be)(cid:3) = \u2207f (xk\u2212dk ), and Ek\n\n2.3 SAPALM as an Asynchronous Block Mini-Batch Stochastic Proximal-Gradient Method\nIn Algorithm 1, any stochastic estimator \u2207f (xk\u2212dk ; \u03be) of the gradient may be used, as long as\nEk\nif Problem 1 takes the form\n\n(cid:2)(cid:107)\u2207f (xk\u2212dk ; \u03be) \u2212 \u2207f (xk\u2212dk )(cid:107)2(cid:3) \u2264 \u03c32. In particular,\n\n1\nm\n\nx\u2208H\n\nrj(xj),\n\nminimize\n\nin Algorithm 2,\n\nE\u03be [f (x1, . . . , xm; \u03be)] +\n\nthe stochastic mini-batch estimator gk = m\u22121\n\n(cid:2)(cid:107)gk \u2212 \u2207f (xk\u2212dk )(cid:107)2(cid:3) = O(m\u22121\n\n(cid:80)mk\ni=1 \u2207f (xk\u2212dk ; \u03bei),\nthen,\nwhere \u03bei are IID, may be used in place of \u2207f (xk\u2212dk ) + \u03bdk. A quick calculation shows that\nk ). Thus, any increasing batch size mk = \u2126((k + 1)\u2212\u03b1), with\nEk\n\u03b1 \u2208 (0, 1), conforms to Assumption 3.\nWhen nonsmooth regularizers are present, all known convergence rate results for nonconvex stochastic\ngradient algorithms require the use of increasing, rather than \ufb01xed, minibatch sizes; see [8, 22] for\nanalogous, synchronous algorithms.\n\nj=1\n\nk\n\nm(cid:88)\n\n4\n\n\f3 Convergence Theorem\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\uf8f9\uf8fb ;\n\n\uf8ee\uf8f0 m(cid:88)\n\nj=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nSk := E\n\nMeasuring Convergence for Nonconvex Problems. For nonconvex problems, it is standard to\nmeasure convergence (to a stationary point) by the expected violation of stationarity, which for us is\nthe (deterministic) quantity:\n\n(wk\n\nj ) + \u03bdk\n\nj \u2212 xk\n\u03b3k\nj\nj \u2212 \u03b3k\nj (\u2207jf (xk\u2212dk ) + \u03bdk\nj \u2212 xk\nj + \u03b3k\nj \u2208 \u2212\u03b3k\nj \u03bdk\n\nj = \u2212\u03b3k\n\nj (\u2202Lrj(wk\n\nj \u03bdk\n\n(\u2200j \u2208 {1, . . . , m})\n\nwk\n\nj )).\n\nwhere\n\nj = prox\u03b3k\n\nhence, Sk = E(cid:2)(cid:107)\u2207f (xk)(cid:107)2(cid:3). More generally, wk\n\nj rj (xk\nA reduction to the case r \u2261 0 and dk = 0 reveals that wk\nj + \u03b3k\n\n(3)\nj \u2207jf (xk) and,\nj ) + \u2207jf (xk\u2212dk ))\nwhere \u2202Lrj is the limiting subdifferential of rj [17] which, if rj is convex, reduces to the standard\nconvex subdifferential familiar from [14]. A messy but straightforward calculation shows that our\nconvergence rates for Sk can be converted to convergence rates for elements of \u2202Lr(wk) + \u2207f (wk).\nWe present our main convergence theorem now and defer the proof to Section 4.\nTheorem 1 (SAPALM Convergence Rates). Let {xk}k\u2208N \u2286 H be the SAPALM sequence created\nby Algorithm 2. Then, under Assumption 3 the following convergence rates hold: for all T \u2208 N, if\n{\u03bdk}k\u2208N is\n\nj \u2212 rk\n\n1. Summable, then\n\nmin\n\nk=0,...,T\n\n2. \u03b1-Diminishing, then\n\nSk \u2264 Ek\u223cPT [Sk] = O\n\nSk \u2264 Ek\u223cPT [Sk] = O\n\nmin\n\nk=0,...,T\n\n(cid:18) m(L + 2L\u03c4 m\u22121/2)\n\n(cid:19)\n\nT + 1\n\n;\n\n(cid:18) m(L + 2L\u03c4 m\u22121/2) + m log(T + 1)\n\n(cid:19)\n\n;\n\n(T + 1)\u2212\u03b1\n\nwhere, for all T \u2208 N, PT is the distribution {0, . . . , T} such that PT (X = k) \u221d c\u22121\nk .\nEffects of Delay and Linear Speedups. The m\u22121/2 term in the convergence rates presented in\n\u221a\nTheorem 1 prevents the delay \u03c4 from dominating our rates of convergence. In particular, as long as\nm), the convergence rate in the synchronous (\u03c4 = 0) and asynchronous cases are within a\n\u03c4 = O(\nsmall constant factor of each other. In that case, because the work per iteration in the synchronous\nand asynchronous versions of SAPALM is the same, we expect a linear speedup: SAPALM with p\nprocessors will converge nearly p times faster than PALM, since the iteration counter will be updated\np times as often. As a rule of thumb, \u03c4 is roughly proportional to the number of processors. Hence\nwe can achieve a linear speedup on as many as O(\n\nm) processors.\n\n\u221a\n\n3.1 The Asynchronous Stochastic Block Gradient Method\n\nIf the regularizer r is identically zero, then the noise \u03bdk need not vanish in the limit. The following\ntheorem guarantees convergence of asynchronous stochastic block gradient descent with a constant\nminibatch size. See the supplemental material for a proof.\nTheorem 2 (SAPALM Convergence Rates (r \u2261 0)). Let {xk}k\u2208N \u2286 H be the SAPALM sequence\ncreated by Algorithm 2 in the case that r \u2261 0. If, for all k \u2208 N, {Ek\nnecessarily diminishing) and\n\n(cid:2)(cid:107)\u03bdk(cid:107)2(cid:3)}k\u2208N is bounded (not\n\n(\u2203a \u2208 (1,\u221e)) , (\u2200k \u2208 N) , (\u2200j \u2208 {1, . . . , m})\n\nthen for all T \u2208 N, we have\n\nSk \u2264 Ek\u223cPT [Sk] = O\n\nmin\n\nk=0,...,T\n\n\u03b3k\nj :=\n\n\u221a\na\n\n1\n\nk(Lj + 2M \u03c4 m\u22121/2)\n\n,\n\n(cid:18) m(L + 2L\u03c4 m\u22121/2) + m log(T + 1)\n\n(cid:19)\n\n,\n\n\u221a\n\nT + 1\n\nwhere PT is the distribution {0, . . . , T} such that PT (X = k) \u221d k\u22121/2.\n\n5\n\n\f4 Convergence Analysis\n\n4.1 The Asynchronous Lyapunov Function\nKey to the convergence of SAPALM is the following Lyapunov function, de\ufb01ned on H1+\u03c4 , which\naggregates not only the current state of the algorithm, as is common in synchronous algorithms, but\nalso the history of the algorithm over the delayed time steps: (\u2200x(0), x(1), . . . , x(\u03c4 ) \u2208 H)\n\n\u03a6(x(0), x(1), . . . , x(\u03c4 )) = f (x(0)) + r(x(0)) +\n\n\u221a\nL\n2\n\nm\n\n(\u03c4 \u2212 h + 1)(cid:107)x(h) \u2212 x(h \u2212 1)(cid:107)2.\n\n\u03c4(cid:88)\n\nh=1\n\nThis Lyapunov function appears in our convergence analysis through the following inequality, which\nis proved in the supplemental material.\nLemma 1 (Lyapunov Function Supermartingale Inequality). For all k \u2208 N,\n(xk, . . . , xk\u2212\u03c4 ) \u2208 H1+\u03c4 . Then for all \u0001 > 0, we have\n\nlet zk =\n\n(cid:18)\n\n(cid:32)\n(cid:19)(cid:33)\nm(cid:88)\nj (1 + \u0001\u22121)(cid:0)Lj + 2L\u03c4 m\u22121/2(cid:1)(cid:1) Ek\n(cid:0)1 + \u03b3k\n\n\u2212 (1 + \u0001)\n\n2L\u03c4\nm1/2\n\n1\n\u03b3k\nj\n\nLj +\n\nEk\n\nj=1\n\n(cid:2)(cid:107)wk\n(cid:2)(cid:107)\u03bdk\nj (cid:107)2(cid:3)\n\nj (cid:107)2(cid:3)\n\nj \u2212 xk\n\nj + \u03b3k\n\nj \u03bdk\n\nEk\n\n(cid:2)\u03a6(zk+1)(cid:3) \u2264 \u03a6(zk) \u2212 1\nm(cid:88)\n\n\u03b3k\nj\n\n2m\n\n+\n\nj=1\n\n2m\n\nj \u2212 \u03b3k\nwhere for all j \u2208 {1, . . . , m}, we have wk\nfor \u03c3k = 0, we can take \u0001 = 0 and assume the last line is zero.\n\nj = prox\u03b3k\n\nj rj (xk\n\nj (\u2207jf (xk\u2212dk ) + \u03bdk\n\nj )). In particular,\n\nNotice that if \u03c3k = \u0001 = 0 and \u03b3k\nj is chosen as suggested in Algorithm 2, the (conditional) expected\nvalue of the Lyapunov function is strictly decreasing. If \u03c3k is nonzero, the factor \u0001 will be used in\nconcert with the stepsize \u03b3k\n\nj to ensure that noise does not cause the algorithm to diverge.\n\n\u03b3k\nj\n\n1\n\u03b3k\nj\n\n+\n\nj=1\n\n2m\n\nj=1\n\n(cid:19)\n\n(cid:18)\n\n(cid:107)Ak\n\nj(cid:107)2\n\nj + \u03b3k\n\nj \u03bdk\n\nj := wk\n\nm(cid:88)\n\nj , we have\n\n1 \u2212 1 + \u0001\nack\n\n(cid:2)(cid:107)\u03bdk\nj (cid:107)2(cid:3)(cid:3)\n\n.\n\n(4)\n\nTwo upper bounds follow from the the de\ufb01nition of \u03b3k\nward inequalities (ack)\u22121(L + 2M \u03c4 m\u22121/2)\u22121 \u2265 \u03b3k\n\n4.2 Proof of Theorem 1\nFor either noise regime, we de\ufb01ne, for all k \u2208 N and j \u2208 {1, . . . , m}, the factor \u0001 := 2\u22121(a \u2212 1).\nWith the assumed choice of \u03b3k\nj and \u0001, Lemma 1 implies that the expected Lyapunov function decreases,\nup to a summable residual: with Ak\n\n\uf8ee\uf8f0 1\n\uf8f9\uf8fb\nj \u2212 xk\nm(cid:88)\nE(cid:2)\u03a6(zk+1)(cid:3) \u2264 E(cid:2)\u03a6(zk)(cid:3) \u2212 E\nj (1 + \u0001\u22121)(cid:0)Lj + 2L\u03c4 m\u22121/2(cid:1)(cid:1) E(cid:2)Ek\n(cid:0)1 + \u03b3k\n\uf8ee\uf8f0 1\nj (1 + \u0001\u22121)(cid:0)Lj + 2L\u03c4 m\u22121/2(cid:1)(cid:1) Ek\nSk \u2264 f (x0) + r(x0) \u2212 inf x\u2208H{f (x) + r(x)} +(cid:80)T\n(cid:80)T\nk=0 c\u22121\n\nm(cid:88)\nNow rearrange (4), use E(cid:2)\u03a6(zk+1)(cid:3) \u2265 inf x\u2208H{f (x) + r(x)} and E(cid:2)\u03a6(z0)(cid:3) = f (x0) + r(x0), and\n1(cid:80)T\nk=0 c\u22121\n\nj , the lower bound ck \u2265 1, and the straightfor-\nj \u2265 (ack)\u22121(L + 2M \u03c4 m\u22121/2)\u22121:\nm(cid:88)\nj (cid:107)2(cid:3)\n(cid:2)(cid:107)\u03bdk\n\n\u2264 (1 + (ack)\u22121(1 + \u0001\u22121))(\u03c32\n2a(L + 2L\u03c4 m\u22121/2)\n\n1 \u2212 1 + \u0001\nack\n\n(1+(ack)\u22121(1+\u0001\u22121))(\u03c32\n2a(L+2L\u03c4 m\u22121/2)\n\n(cid:0)1 + \u03b3k\n\nsum (4) over k to get\n\n2ma(L+2L\u03c4 m\u22121/2)\n\n(1\u2212(1+\u0001)a\u22121)\n\n(1\u2212(1+\u0001)a\u22121)\n\nT(cid:88)\n\n(cid:107)Ak\n\nj(cid:107)2\n\nSk \u2264\n\n\uf8f9\uf8fb\n\nk/ck)\n\n.\n\n2m\n\nj=1\n\n(cid:18)\n\n(cid:19)\n\nk=0\n\nk\n\n1\n\u03b3k\nj\n\n1\n\nE\n\n1\nck\n\nand\n\n\u03b3k\nj\n\nj=1\n\n1\nck\n\nk\n\nk=0\n\n2ma(L+2L\u03c4 m\u22121/2)\n\n2m\n\n2m\n\nk/ck)\n\n.\n\n6\n\n\fThe left hand side of this inequality is bounded from below by mink=0,...,T Sk and is precisely the\nterm Ek\u223cPT [Sk]. What remains to be shown is an upper bound on the right hand side, which we will\nnow call RT .\n\nIf the noise is summable, then ck \u2261 1, so(cid:80)T\nthat RT = O(m(L + 2L\u03c4 m\u22121/2)(T + 1)\u22121). If the noise is \u03b1-diminishing, then ck = \u0398(cid:0)k(1\u2212\u03b1)(cid:1),\nso(cid:80)T\n(cid:80)T\n\nk/ck < \u221e, which implies\nk/ck = O(k\u22121), there exists a B > 0 such that\nk=0 Bk\u22121 = O(log(T + 1)), which implies that RT = O((m(L + 2L\u03c4 m\u22121/2) +\n\nk = (T +1) and(cid:80)T\n\nk = \u0398((T + 1)\u03b1) and, because \u03c32\n\nk=0 c\u22121\n\nk/ck \u2264(cid:80)T\n\nk=0 \u03c32\n\nm log(T + 1))(T + 1)\u2212\u03b1).\n\nk=0 c\u22121\n\nk=0 \u03c32\n\n5 Numerical Experiments\n\nIn this section, we present numerical results to con\ufb01rm that SAPALM delivers the expected perfor-\nmance gains over PALM. We con\ufb01rm two properties: 1) SAPALM converges to values nearly as\nlow as PALM given the same number of iterations, 2) SAPALM exhibits a near-linear speedup as\nthe number of workers increases. All experiments use an Intel Xeon machine with 2 sockets and 10\ncores per socket.\nWe use two different nonconvex matrix factorization problems to exhibit these properties, to which\nwe apply two different SAPALM variants: one without noise, and one with stochastic gradient noise.\nFor each of our examples, we generate a matrix A \u2208 Rn\u00d7n with iid standard normal entries, where\nn = 2000. Although SAPALM is intended for use on much larger problems, using a small problem\nsize makes write con\ufb02icts more likely, and so serves as an ideal setting to understand how asynchrony\naffects convergence.\n\n1. Sparse PCA with Asynchronous Block Coordinate Updates. We minimize\n\nargmin\n\nX,Y\n\n1\n2\n\n||A \u2212 X T Y ||2\n\nF + \u03bb(cid:107)X(cid:107)1 + \u03bb(cid:107)Y (cid:107)1,\n\n(5)\n\nwhere X \u2208 Rd\u00d7n and Y \u2208 Rd\u00d7n for some d \u2208 N. We solve this problem using SAPALM\nwith no noise \u03bdk = 0.\n\n2. Quadratically Regularized Firm Thresholding PCA with Asynchronous Stochastic\n\nGradients. We minimize\n\nargmin\n\nX,Y\n\n1\n2\n\n||A \u2212 X T Y ||2\n\nF + \u03bb((cid:107)X(cid:107)Firm + (cid:107)Y (cid:107)Firm) +\n\n((cid:107)X(cid:107)2\n\nF + (cid:107)Y (cid:107)2\nF ),\n\n\u00b5\n2\n\n(6)\n\nwhere X \u2208 Rd\u00d7n, Y \u2208 Rd\u00d7n, and (cid:107)\u00b7(cid:107)Firm is the \ufb01rm thresholding penalty proposed in [21]:\na nonconvex, nonsmooth function whose proximal operator truncates small values to zero\nand preserves large values. We solve this problem using the stochastic gradient SAPALM\nvariant from Section 2.3.\n\nIn both experiments X and Y are treated as coordinate blocks. Notice that for this problem, the\nSAPALM update decouples over the entries of each coordinate block. Each worker updates its\ncoordinate block (say, X) by cycling through the coordinates of X and updating each in turn,\nrestarting at a random coordinate after each cycle.\nIn Figures (1a) and (1c), we see objective function values plotted by iteration. By this metric,\nSAPALM performs as well as PALM, its single threaded variant; for the second problem, the curves\nfor different thread counts all overlap. Note, in particular, that SAPALM does not diverge. But\nSAPALM can add additional workers to increment the iteration counter more quickly, as seen in\nFigure 1b, allowing SAPALM to outperform its single threaded variant.\nWe measure the speedup Sk(p) of SAPALM by the (relative) time for p workers to produce k iterates\n\nSk(p) =\n\nTk(p)\nTk(1)\n\n,\n\n(7)\n\nwhere Tk(p) is the time to produce k iterates using p workers. Table 2 shows that SAPALM achieves\nnear linear speedup for a range of variable sizes d. (Dashes \u2014 denote experiments not run.)\n\n7\n\n\f(a) Iterates vs objective\n(d) Time (s) vs. objective\nFigure 1: Sparse PCA ((1a) and (1b)) and Firm Thresholding PCA ((1c) and (1d)) tests for d = 10.\n\n(b) Time (s) vs. objective\n\n(c) Iterates vs. objective\n\nthreads\n1\n2\n4\n8\n16\n\nd=10\n65.9972\n33.464\n17.5415\n9.2376\n4.934\n\nd=100\n6144.9427\n\u2013\n\u2013\n833.5635\n416.8038\nTable 1: Sparse PCA timing for 16 iterations\nby problem size and thread count.\n\nd=20\n253.387\n127.8973\n67.3267\n34.5614\n17.4362\n\nthreads\n1\n2\n4\n8\n16\n\nd=10\n1\n1.9722\n3.7623\n7.1444\n13.376\n\nd=20\n1\n1.9812\n3.7635\n7.3315\n14.5322\n\nd=100\n1\n\u2013\n\u2013\n7.3719\n14.743\n\nTable 2: Sparse PCA speedup for 16 iterations\nby problem size and thread count.\n\nDeviations from linearity can be attributed to a breakdown in the abstraction of a \u201cshared memory\u201d\ncomputer: as each worker modi\ufb01es the \u201cshared\u201d variables X and Y , some communication is required\nto maintain cache coherency across all cores and processors. In addition, Intel Xeon processors\nshare L3 cache between all cores on the processor. All threads compete for the same L3 cache space,\nslowing down each iteration. For small d, write con\ufb02icts are more likely; for large d, communication\nto maintain cache coherency dominates.\n\n6 Discussion\n\nA few straightforward generalizations of our work are possible; we omit them to simplify notation.\n\nRemoving the log factors. The log factors in Theorem 1 can easily be removed by \ufb01xing a\nmaximum number of iterations for which we plan to run SAPALM and adjusting the ck factors\naccordingly, as in [14, Equation (3.2.10)].\nCluster points of {xk}k\u2208N. Using the strategy employed in [5], it\u2019s possible to show that all cluster\npoints of {xk}k\u2208N are (almost surely) stationary points of f + r.\n\nWeakened Assumptions on Lipschitz Constants. We can weaken our assumptions to allow Lj\nto vary: we can assume Lj(x1, . . . , xj\u22121,\u00b7, xj+1, . . . , xm)-Lipschitz continuity each partial gradient\n\u2207jf (x1, . . . , xj\u22121,\u00b7, xj+1, . . . , xm) : Hj \u2192 Hj, for every x \u2208 H.\n\n7 Conclusion\n\nThis paper presented SAPALM, the \ufb01rst asynchronous parallel optimization method that provably\nconverges on a large class of nonconvex, nonsmooth problems. We provide a convergence theory for\nSAPALM, and show that with the parameters suggested by this theory, SAPALM achieves a near linear\nspeedup over serial PALM. As a special case, we provide the \ufb01rst convergence rate for (synchronous\nor asynchronous) stochastic block proximal gradient methods for nonconvex regularizers. These\nresults give speci\ufb01c guidance to ensure fast convergence of practical asynchronous methods on a\nlarge class of important, nonconvex optimization problems, and pave the way towards a deeper\nunderstanding of stability of these methods in the presence of noise.\n\n8\n\n0100200300400105106107108 124816050100150200105106107108 12481601234x 1050.511.522.533.5x 107 12481605101520250.511.522.533.5x 107 124816\fReferences\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In 2012 IEEE 51st IEEE\n\nConference on Decision and Control (CDC), pages 5451\u20135452, Dec 2012.\n\n[2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods, volume 23.\n[3] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and\n\nnonsmooth problems. Mathematical Programming, 146(1-2):459\u2013494, 2014.\n\n[4] D. Davis. SMART: The Stochastic Monotone Aggregated Root-Finding Algorithm. arXiv preprint\n\narXiv:1601.00698, 2016.\n\n[5] D. Davis. The Asynchronous PALM Algorithm for Nonsmooth Nonconvex Problems. arXiv preprint\n\narXiv:1604.00526, 2016.\n\n[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang,\nQ. V. Le, and A. Y. Ng. Large Scale Distributed Deep Networks. In F. Pereira, C. J. C. Burges, L. Bottou,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223\u20131231.\nCurran Associates, Inc., 2012.\n\n[7] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points\u2014online stochastic gradient for tensor\n\ndecomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797\u2013842, 2015.\n\n[8] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic\n\ncomposite optimization. Mathematical Programming, 155(1):267\u2013305, 2016.\n\n[9] M. Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm\n\nbased approach. arXiv preprint arXiv:1412.6058, 2014.\n\n[10] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 2719\u20132727, 2015.\n\n[11] J. Liu, S. J. Wright, C. R\u00e9, V. Bittorf, and S. Sridhar. An Asynchronous Parallel Stochastic Coordinate\n\nDescent Algorithm. Journal of Machine Learning Research, 16:285\u2013322, 2015.\n\n[12] J. Liu, S. J. Wright, and S. Sridhar. An Asynchronous Parallel Randomized Kaczmarz Algorithm. arXiv\n\npreprint arXiv:1401.4780, 2014.\n\n[13] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed Iterate\n\nAnalysis for Asynchronous Stochastic Optimization. arXiv preprint arXiv:1507.06970, 2015.\n\n[14] Y. Nesterov. Introductory Lectures on Convex Optimization : A Basic Course. Applied optimization.\n\nKluwer Academic Publ., Boston, Dordrecht, London, 2004.\n\n[15] Z. Peng, Y. Xu, M. Yan, and W. Yin. ARock: an Algorithmic Framework for Asynchronous Parallel\n\nCoordinate Updates. arXiv preprint arXiv:1506.02396, 2015.\n\n[16] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A Lock-Free Approach to Parallelizing Stochastic\n\nGradient Descent. In Advances in Neural Information Processing Systems, pages 693\u2013701, 2011.\n\n[17] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317. Springer Science & Business\n\nMedia, 2009.\n\n[18] P. Tseng. On the Rate of Convergence of a Partially Asynchronous Gradient Projection Algorithm. SIAM\n\nJournal on Optimization, 1(4):603\u2013619, 1991.\n\n[19] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gradient\n\noptimization algorithms. IEEE Transactions on Automatic Control, 31(9):803\u2013812, Sep 1986.\n\n[20] M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized Low Rank Models. arXiv preprint arXiv:1410.0342,\n\n2014.\n\n[21] J. Woodworth and R. Chartrand. Compressed sensing recovery via nonconvex shrinkage penalties. arXiv\n\npreprint arXiv:1504.02923, 2015.\n\n[22] Y. Xu and W. Yin. Block Stochastic Gradient Iteration for Convex and Nonconvex Optimization. SIAM\n\nJournal on Optimization, 25(3):1686\u20131716, 2015.\n\n[23] H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. Dhillon. NOMAD: Non-locking, Stochastic\nMulti-machine Algorithm for Asynchronous and Decentralized Matrix Completion. Proc. VLDB Endow.,\n7(11):975\u2013986, July 2014.\n\n9\n\n\f", "award": [], "sourceid": 155, "authors": [{"given_name": "Damek", "family_name": "Davis", "institution": "Cornell University"}, {"given_name": "Brent", "family_name": "Edmunds", "institution": "University of California"}, {"given_name": "Madeleine", "family_name": "Udell", "institution": "Cornell University"}]}