{"title": "Multi-Step Stochastic ADMM in High Dimensions: Applications to Sparse Optimization and Matrix Decomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 2771, "page_last": 2779, "abstract": "In this paper, we consider a multi-step version of the stochastic ADMM method with efficient guarantees for high-dimensional problems. We first analyze the simple setting, where the optimization problem consists of a loss function and a single regularizer (e.g. sparse optimization), and then extend to the multi-block setting with multiple regularizers and multiple variables (e.g. matrix decomposition into sparse and low rank components). For the sparse optimization problem, our method achieves the minimax rate of $O(s\\log d/T)$ for $s$-sparse problems in $d$ dimensions in $T$ steps, and is thus, unimprovable by any method up to constant factors. For the matrix decomposition problem with a general loss function, we analyze the multi-step ADMM with multiple blocks. We establish $O(1/T)$ rate and efficient scaling as the size of matrix grows. For natural noise models (e.g. independent noise), our convergence rate is minimax-optimal. Thus, we establish tight convergence guarantees for multi-block ADMM in high dimensions. Experiments show that for both sparse optimization and matrix decomposition problems, our algorithm outperforms the state-of-the-art methods.", "full_text": "Multi-Step Stochastic ADMM in High Dimensions:\n\nApplications to Sparse Optimization\n\nand Matrix Decomposition\n\nHanie Sedghi\n\nUniv. of Southern California\n\nLos Angeles, CA 90089\nhsedghi@usc.edu\n\nAnima Anandkumar\nUniversity of California\n\nIrvine, CA 92697\n\na.anandkumar@uci.edu\n\nEdmond Jonckheere\n\nUniv. of Southern California\n\nLos Angeles, CA 90089\njonckhee@usc.edu\n\nAbstract\n\nIn this paper, we consider a multi-step version of the stochastic ADMM method\nwith ef\ufb01cient guarantees for high-dimensional problems. We \ufb01rst analyze the\nsimple setting, where the optimization problem consists of a loss function and\na single regularizer (e.g. sparse optimization), and then extend to the multi-block\nsetting with multiple regularizers and multiple variables (e.g. matrix decomposi-\ntion into sparse and low rank components). For the sparse optimization problem,\nour method achieves the minimax rate of O(s log d/T ) for s-sparse problems in\nd dimensions in T steps, and is thus, unimprovable by any method up to constant\nfactors. For the matrix decomposition problem with a general loss function, we\nanalyze the multi-step ADMM with multiple blocks. We establish O(1/T ) rate\nand ef\ufb01cient scaling as the size of matrix grows. For natural noise models (e.g.\nindependent noise), our convergence rate is minimax-optimal. Thus, we establish\ntight convergence guarantees for multi-block ADMM in high dimensions. Experi-\nments show that for both sparse optimization and matrix decomposition problems,\nour algorithm outperforms the state-of-the-art methods.\n\n1\n\nIntroduction\n\nStochastic optimization techniques have been extensively employed for online machine learning\non data which is uncertain, noisy or missing. Typically it involves performing a large number of\ninexpensive iterative updates, making it scalable for large-scale learning. In contrast, traditional\nbatch-based techniques involve far more expensive operations for each update step. Stochastic opti-\nmization has been analyzed in a number of recent works.\nThe alternating direction method of multipliers (ADMM) is a popular method for online and dis-\ntributed optimization on a large scale [1], and is employed in many applications. It can be viewed as\na decomposition procedure where solutions to sub-problems are found locally, and coordinated via\nconstraints to \ufb01nd the global solution. Speci\ufb01cally, it is a form of augmented Lagrangian method\nwhich applies partial updates to the dual variables. ADMM is often applied to solve regularized\nproblems, where the function optimization and regularization can be carried out locally, and then\ncoordinated globally via constraints. Regularized optimization problems are especially relevant in\nthe high dimensional regime since regularization is a natural mechanism to overcome ill-posedness\nand to encourage parsimony in the optimal solution, e.g., sparsity and low rank. Due to the ef\ufb01ciency\nof ADMM in solving regularized problems, we employ it in this paper.\nWe consider a simple modi\ufb01cation to the (inexact) stochastic ADMM method [2] by incorporating\nmultiple steps or epochs, which can be viewed as a form of annealing. We establish that this simple\nmodi\ufb01cation has huge implications in achieving tight bounds on convergence rate as the dimensions\n\n1\n\n\fT\n\nof the problem instances scale. In each iteration, we employ projections on to certain norm balls\nof appropriate radii, and we decrease the radii in epochs over time. For instance, for the sparse\noptimization problem, we constrain the optimal solution at each step to be within an (cid:96)1-norm ball of\nthe initial estimate, obtained at the beginning of each epoch. At the end of the epoch, an average is\ncomputed and passed on to the next epoch as its initial estimate. Note that the (cid:96)1 projection can be\nsolved ef\ufb01ciently in linear time, and can also be parallelized easily [3]. For matrix decomposition\nwith a general loss function, the ADMM method requires multiple blocks for updating the low rank\nand sparse components. We apply the same principle and project the sparse and low rank estimates\non to (cid:96)1 and nuclear norm balls, and these projections can be computed ef\ufb01ciently.\nTheoretical implications: The above simple modi\ufb01cations to ADMM have huge implications\nfor high-dimensional problems. For sparse optimization, our convergence rate is O( s log d\n), for\ns-sparse problems in d dimensions in T steps. Our bound has the best of both worlds: ef\ufb01cient\nhigh-dimensional scaling (as log d) and ef\ufb01cient convergence rate (as 1\nT ). This also matches the\nminimax rate for the linear model and square loss function [4], which implies that our guarantee is\nunimprovable by any (batch or online) algorithm (up to constant factors). For matrix decomposition,\nour convergence rate is O((s + r)\u03b22(p) log p/T )) + O(max{s + r, p}/p2) for a p \u00d7 p input matrix\nin T steps, where the sparse part has s non-zero entries and low rank part has rank r. For many nat-\nural noise models (e.g. independent noise, linear Bayesian networks), \u03b22(p) = p, and the resulting\nconvergence rate is minimax-optimal. Note that our bound is not only on the reconstruction error,\nbut also on the error in recovering the sparse and low rank components. These are the \ufb01rst conver-\ngence guarantees for online matrix decomposition in high dimensions. Moreover, our convergence\nrate holds with high probability when noisy samples are input, in contrast to expected convergence\nrate, typically analyzed in the literature. See Table 1, 2 for comparison of this work with related\nframeworks. Proof of all results and implementation details can be found in the longer version [5].\nPractical implications: The proposed algorithms provide signi\ufb01cantly faster convergence in high\ndimension and better robustness to noise. For sparse optimization, our method has signi\ufb01cantly\nbetter accuracy compared to the stochastic ADMM method and better performance than RADAR,\nbased on multi-step dual averaging [6]. For matrix decomposition, we compare our method with the\nstate-of-art inexact ALM [7] method. While both methods have similar reconstruction performance,\nour method has signi\ufb01cantly better accuracy in recovering the sparse and low rank components.\nRelated Work: ADMM: Existing online ADMM-based methods lack high-dimensional guaran-\ntees. They scale poorly with the data dimension (as O(d2)), and also have slow convergence for\ngeneral problems (as O( 1\u221a\n)). Under strong convexity, the convergence rate can be improved to\nO( 1\nT ) but only in expectation: such analyses ignore the per sample error and consider only the\nexpected convergence rate(see Table 1). In contrast, our bounds hold with high probability. Some\nstochastic ADMM methods, Goldstein et al. [8], Deng [9] and Luo [10] provide faster rates for\nstochastic ADMM, than the rate noted in Table 1. However, they require strong conditions which\nare not satis\ufb01ed for the optimization problems considered here, e.g., Goldstein et al. [8] require both\nthe loss function and the regularizer to be strongly convex.\nRelated Work: Sparse Optimization: For the sparse optimization problem, (cid:96)1 regularization is\nemployed and the underlying true parameter is assumed to be sparse. This is a well-studied problem\nin a number of works (for details, refer to [6]). Agarwal et al. [6] propose an ef\ufb01cient online method\nbased on dual averaging, which achieves the same optimal rates as the ones derived in this paper. The\nmain difference is that our ADMM method is capable of solving the problem for multiple random\nvariables and multiple conditions while their method cannot incorporate these extensions.\nRelated Work: Matrix Decomposition: To the best of our knowledge, online guarantees for high-\ndimensional matrix decomposition have not been provided before. Wang et al. [12] propose a multi-\nblock ADMM method for the matrix decomposition problem but only provide convergence rate\nanalysis in expectation and it has poor high dimensional scaling (as O(p4) for a p \u00d7 p matrix)\nwithout further modi\ufb01cations. Note that they only provide convergence rate on difference between\nloss function and optimal loss, whereas we provide the convergence rate on individual errors of the\nsparse and low rank components (cid:107) \u00afS(T ) \u2212 S\u2217(cid:107)2F,(cid:107) \u00afL(T ) \u2212 L\u2217(cid:107)2F. See Table 2 for comparison of\nguarantees for matrix decomposition problem.\nNotation In the sequel, we use lower case letter for vectors and upper case letter for matrices.\nMoreover, X \u2208 Rp\u00d7p. (cid:107)x(cid:107)1, (cid:107)x(cid:107)2 refer to (cid:96)1, (cid:96)2 vector norms respectively. The term (cid:107)X(cid:107)\u2217 stands\n\nT\n\n2\n\n\fMethod\n\nST-ADMM [2]\nST-ADMM [2]\nBADMM [11]\nRADAR [6]\n\nREASON 1 (this paper)\n\nMinimax bound [4]\n\nAssumptions\nL, convexity\n\nSC, E\n\nconvexity, E\n\nLSC, LL\nLSC, LL\n\nEigenvalue conditions\n\n\u221a\n\u221a\n\nConvergence rate\nO(d2/\nT )\nO(d2 log T /T )\nO(d2/\nT )\nO(s log d/T )\nO(s log d/T )\nO(s log d/T )\n\nTable 1: Comparison of online sparse optimization methods under s sparsity level for the optimal\nparamter, d dimensional space, and T number of iterations. SC = Strong Convexity, LSC = Local\nStrong Convexity, LL = Local Lipschitz, L = Lipschitz property, E=in Expectation. The last row\nprovides the minimax-optimal rate for any method. The results hold with high probability.\n\nMethod\n\nMulti-block-ADMM[12]\n\nBatch method[13]\n\nREASON 2 (this paper)\n\nMinimax bound[13]\n\nAssumptions\n\nL, SC, E\n\nConvergence rate\n\nO(p4/T )\n\nLL, LSC, DF\nLSC, LL, DF O((s + r)\u03b22(p) log p/T ))+O(max{s + r, p}/p2)\n(cid:96)2, IN, DF\n\nO((s log p + rp)/T )+O(s/p2)\nO((s log p + rp)/T )+O(s/p2)\n\n\u221a\n\nTable 2: Comparison of optimization methods for sparse+low rank matrix decomposition for a p\u00d7 p\nmatrix under s sparsity level and r rank matrices and T is the number of samples. Abbreviations\nare as in Table 1, IN = Independent noise model, DF = diffuse low rank matrix under the optimal\np),O(p) and its value depends the model. The last row provides the\nparameter. \u03b2(p) = \u2126(\nminimax-optimal rate for any method under the independent noise model. The results hold with high\nprobability unless otherwise mentioned. For Multi-block-ADMM [12] the convergence rate is on the\ndifference of loss function from optimal loss, for the rest of works in the table, the convergence rate is\non the individual estimates of the sparse and low rank components: (cid:107) \u00afS(T )\u2212 S\u2217(cid:107)2F +(cid:107) \u00afL(T )\u2212 L\u2217(cid:107)2F.\nfor nuclear norm of X. In addition, (cid:107)X(cid:107)2, (cid:107)X(cid:107)F denote spectral and Frobenius norms respectively.\n\nWe use vectorized (cid:96)1, (cid:96)\u221e norm for matrices, i.e., (cid:107)X(cid:107)1 =(cid:80)\n\n|Xij|, (cid:107)X(cid:107)\u221e = max\n\n|Xij|.\n\ni,j\n\ni,j\n\n(cid:96)1 Regularized Stochastic Optimization\n\n2\nWe consider the optimization problem \u03b8\u2217 \u2208 arg min E[f (\u03b8, x)], \u03b8 \u2208 \u2126 where \u03b8\u2217 is a sparse vector.\nThe loss function f (\u03b8, xk) is a function of a parameter \u03b8 \u2208 Rd and samples xi. In stochastic setting,\nwe do not have access to E[f (\u03b8, x)] nor to its subgradients. In each iteration we have access to one\nnoisy sample. In order to impose sparsity we use regularization. Thus, we solve a sequence\n\n\u03b8k \u2208 arg min\n\u03b8\u2208\u2126(cid:48)\n\nf (\u03b8, xk) + \u03bb(cid:107)\u03b8(cid:107)1, \u2126(cid:48) \u2282 \u2126,\n\n(1)\n\nwhere the regularization parameter \u03bb > 0 and the constraint sets \u2126(cid:48) change from epoch to epoch.\n\n2.1 Epoch-based Stochastic ADMM Algorithm\n\nWe now describe the modi\ufb01ed inexact ADMM algorithm for the sparse optimization problem in (1),\nand refer to it as REASON 1, see Algorithm 1. We consider an epoch length T0, and in each epoch\ni, we project the optimal solution on to an (cid:96)1 ball with radius Ri centered around \u02dc\u03b8i, which is the\ninitial estimate of \u03b8\u2217 at the start of the epoch. The \u03b8-update is given by\n\n\u03b8k+1 = arg min\n1\u2264R2\n\n(cid:107)\u03b8\u2212\u02dc\u03b8i(cid:107)2\n\ni\n\n{(cid:104)\u2207f (\u03b8k), \u03b8 \u2212 \u03b8k(cid:105) \u2212 (cid:104)zk, \u03b8 \u2212 yk(cid:105) +\n\n(cid:107)\u03b8 \u2212 yk(cid:107)2\n\n2 +\n\n\u03c1\n2\n\n\u03c1x\n2\n\n(cid:107)\u03b8 \u2212 \u03b8k(cid:107)2\n2}.\n\n(2)\n\nNote that this is an inexact update since we employ the gradient \u2207f (\u00b7) rather than optimize directly\non the loss function f (\u00b7) which is expensive. The above program can be solved ef\ufb01ciently since\nit is a projection on to the (cid:96)1 ball, whose complexity is linear in the sparsity level of the gradient,\nwhen performed serially, and O(log d) when performed in parallel using d processors [3]. For the\nregularizer, we introduce the variable y, and the y-update is yk+1 = arg min{\u03bbi(cid:107)yk(cid:107)1\u2212(cid:104)zk, \u03b8k+1\u2212\n\n3\n\n\fAlgorithm 1: Regularized Epoch-based Admm for Stochastic Opt. in high-dimensioN 1 (REASON 1)\n\nInput \u03c1, \u03c1x, epoch length T0 , initial prox center \u02dc\u03b81, initial radius R1, regularization parameter\n{\u03bbi}kT\ni=1.\nDe\ufb01ne Shrink\u03ba(a) = (a \u2212 \u03ba)+ \u2212 (\u2212a \u2212 \u03ba)+.\nfor Each epoch i = 1, 2, ..., kT do\nInitialize \u03b80 = y0 = \u02dc\u03b8i\nfor Each iteration k = 0, 1, ..., T0 \u2212 1 do\n\n{(cid:104)\u2207f (\u03b8k), \u03b8 \u2212 \u03b8k(cid:105) \u2212 (cid:104)zk, \u03b8 \u2212 yk(cid:105) +\n\n\u03c1\n2\n\n(cid:107)\u03b8 \u2212 yk(cid:107)2\n\n\u03c1x\n2\nzk+1 = zk \u2212 \u03c4 (\u03b8k+1 \u2212 yk+1)\n\n2 +\n\n(cid:107)\u03b8 \u2212 \u03b8k(cid:107)2\n2}\n\n\u03b8k+1 = arg min\n(cid:107)\u03b8\u2212\u02dc\u03b8i(cid:107)1\u2264Ri\n\nyk+1 = Shrink\u03bbi/\u03c1(\u03b8k+1 \u2212 zk\n\u03c1\n\n),\n\n(cid:80)T0\u22121\n\nk=0 \u03b8k for epoch i and \u02dc\u03b8i+1 = \u03b8(Ti).\n\nReturn : \u03b8(Ti) := 1\nT\nUpdate : R2\ni /2.\ni+1 = R2\ny(cid:105)+ \u03c1\n2}. This update can be simpli\ufb01ed to the form given in REASON 1, where Shrink\u03ba(\u00b7)\n2(cid:107)\u03b8k+1\u2212y(cid:107)2\nis the soft-thresholding or shrinkage function [1]. Thus, each step in the update is extremely simple\nto implement. When an epoch is complete, we carry over the average \u03b8(Ti) as the next epoch center\nand reset the other variables.\n\n2.2 High-dimensional Guarantees\n\n2(cid:107)\u03b82 \u2212 \u03b81(cid:107)2\n2.\n\nWe now provide convergence guarantees for the proposed method under the following assumptions.\nAssumption A1: Local strong convexity (LSC): The function f : S \u2192 R satis\ufb01es an R-local form\nof strong convexity (LSC) if there is a non-negative constant \u03b3 = \u03b3(R) such that for any \u03b81, \u03b82 \u2208 S\nwith (cid:107)\u03b81(cid:107)1 \u2264 R and (cid:107)\u03b82(cid:107)1 \u2264 R, f (\u03b81) \u2265 f (\u03b82) + (cid:104)\u2207f (\u03b82), \u03b81 \u2212 \u03b82(cid:105) + \u03b3\nNote that the notion of strong convexity leads to faster convergence rates in general. Intuitively,\nstrong convexity is a measure of curvature of the loss function, which relates the reduction in the\nloss function to closeness in the variable domain. Assuming that the function f is twice continuously\ndifferentiable, it is strongly convex, if and only if its Hessian is positive semi-de\ufb01nite, for all feasible\n\u03b8. However, in the high-dimensional regime, where there are fewer samples than data dimension, the\nHessian matrix is often singular and we do not have global strong convexity. A solution is to impose\nlocal strong convexity which allows us to provide guarantees for high dimensional problems. This\nnotion has been exploited before in a number of works on high dimensional analysis, e.g., [14, 13, 6].\nIt holds for various loss functions such as square loss.\nAssumption A2: Sub-Gaussian stochastic gradients: Let ek(\u03b8) := \u2207f (\u03b8, xk) \u2212 E[\u2207f (\u03b8, xk)].\nThere is a constant \u03c3 = \u03c3(R) such that for all k > 0, E[exp((cid:107)ek(\u03b8)(cid:107)2\u221e)/\u03c32] \u2264 exp(1), for all \u03b8\nsuch that (cid:107)\u03b8 \u2212 \u03b8\u2217(cid:107)1 \u2264 R.\nRemark: The bound holds with \u03c3 = O(\nsub-Gaussian tails [6].\nAssumption A3: Local Lipschitz condition: For each R > 0, there is a constant G = G(R) such\nthat, |f (\u03b81)\u2212f (\u03b82)| \u2264 G(cid:107)\u03b81\u2212\u03b82(cid:107)1, for all \u03b81, \u03b82 \u2208 S such that (cid:107)\u03b8\u2212\u03b8\u2217(cid:107)1 \u2264 R and (cid:107)\u03b81\u2212\u03b8\u2217(cid:107)1 \u2264 R.\nThe design parameters are as below where \u03bbi is the regularization for (cid:96)1 term in epoch i, \u03c1 and \u03c1x\nare penalties in \u03b8-update as in (2) and \u03c4 is the step size for the dual update.\n\nlog d) whenever each component of the error vector has\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u03bb2\ni =\n\n\u221a\n\u03b3Ri\ns\nT0\n\nG2(\u03c1 + \u03c1x)2\n\nlog d +\n\nT 2\n0\n\n+ \u03c32\n\ni log(\n\n3\n\u03b4i\n\n),\n\n\u03c1 \u221d\n\nT0 log d\n\nRi\n\n,\n\n\u03c1x > 0,\n\n\u03c4 \u221d\n\nT0\nRi\n\n.\n\n(3)\nTheorem 1. Under Assumptions A1 \u2212 A3, \u03bbi as in (3) , with \ufb01xed epoch lengths T0 = T log d/kT ,\nwhere T is the total number of iterations and\n\n(cid:115)\n\nkT = log2\n\ns2(log d + \u03b3\n\n1T\n\n\u03b32R2\ns G + 12\u03c32 log( 6\n\n\u03b4 ))\n\n,\n\n4\n\n\fand T0 satis\ufb01es T0 = O(log d), for any \u03b8\u2217 with sparsity s, with probability at least 1 \u2212 \u03b4 we have\n\n(cid:107)\u00af\u03b8T \u2212 \u03b8\u2217(cid:107)2\n\n2 = O\n\ns\n\nlog d + \u03b3\n\ns G + (log(1/\u03b4) + log(kT /log d))\u03c32\n\nT\n\nlog d\nkT\n\n(cid:18)\n\n(cid:19)\n\n,\n\nwhere \u00af\u03b8T is the average for the last epoch for a total of T iterations.\n\nImprovement of log d factor : The above theorem covers the practical case where the epoch length\nT0 is \ufb01xed. We can improve the above results using varying epoch length (which depend on the\nproblem parameters) such that (cid:107)\u00af\u03b8T \u2212 \u03b8\u2217(cid:107)2\n2 = O(s log d/T ). The details can be found in the longer\nversion [5].This convergence rate of O(s log d/T ) matches the minimax lower bounds for sparse\nestimation [4]. This implies that our guarantees are unimprovable up to constant factors.\n\n3 Extension to Doubly Regularized Stochastic Optimization\nWe consider the optimization problem M\u2217 \u2208 arg min E[f (M, X)], where we want to decompose\nM into a sparse matrix S \u2208 Rp\u00d7p and a low rank matrix L \u2208 Rp\u00d7p. f (M, Xk) is a function of a\nparameter M and samples Xk. Xk can be a matrix (e.g. independent noise model) or a vector (e.g.\nGaussian graphical model). In stochastic setting, we do not have access to E[f (M, X)] nor to its\nsubgradients. In each iteration, we have access to one noisy sample and update our estimate based\non that. We impose the desired properties with regularization. Thus, we solve a sequence\n\n(cid:99)Mk := arg min{(cid:98)f (M, Xk) + \u03bbn(cid:107)S(cid:107)1 + \u00b5n(cid:107)L(cid:107)\u2217}\n\ns.t. M = S + L,\n\n(cid:107)L(cid:107)\u221e \u2264 \u03b1\np\n\n.\n\n(4)\n\nWe propose an online program based on multi-block ADMM algorithm. In addition to tailoring\nprojection ideas employed for sparse case, we impose an (cid:96)\u221e constraint of \u03b1/p on each entry of L.\nThis constraint is also imposed for the batch version of the problem (4) in [13], and we assume that\nthe true matrix L\u2217 satis\ufb01es this constraint. Intuitively, the (cid:96)\u221e constraint controls the \u201cspikiness\u201d\nof L\u2217. If \u03b1 \u2248 1, then the entries of L are O(1/p), i.e. they are \u201cdiffuse\u201d or \u201cnon-spiky\u201d, and no\nentry is too large. When the low rank matrix L\u2217 has diffuse entries, it cannot be a sparse matrix,\nand thus, can be separated from the sparse S\u2217 ef\ufb01ciently. In fact, the (cid:96)\u221e constraint is a weaker form\nof the incoherence-type assumptions needed to guarantee identi\ufb01ability [15] for sparse+low rank\ndecomposition. For more discussions, see Section 3.2.\n\n3.1 Epoch-based Multi-Block ADMM Algorithm\n\nWe now extend the ADMM method proposed in REASON 1 to multi-block ADMM. The details\nare in Algorithm 2, and we refer to it as REASON 2. Recall that the matrix decomposition setting\nassumes that the true matrix M\u2217 = S\u2217 + L\u2217 is a combination of a sparse matrix S\u2217 and a low rank\nmatrix L\u2217. In REASON 2, the updates for matrices M, S, L are done independently at each step. The\nupdates follow de\ufb01nition of ADMM and ideas presented in Section 2. We consider epochs of lengths\nT0. We do not need to project the update of matrix M. The update rules for S, L are result of doing\nan inexact proximal update by considering them as a single block, which can then be decoupled.\nWe impose an (cid:96)1-norm projection for the sparse estimate S around the epoch initialization \u02dcSi. For\nthe low rank estimate L, we impose a nuclear norm projection around the epoch initialization \u02dcLi.\nIntuitively, the nuclear norm projection, which is an (cid:96)1 projection on the singular values, encourages\nsparsity in the spectral domain leading to low rank estimates. We also require an (cid:96)\u221e constraint on\nL. Thus, the update rule for L has two projections, i.e. in\ufb01nity and nuclear norm projections. We\ndecouple it into ADMM updates L, Y with dual variable U corresponding to this decomposition.\n\n3.2 High-dimensional Guarantees\n\nWe now prove that REASON 2 recovers both the sparse and low rank estimates in high dimensions\nef\ufb01ciently. We need the following assumptions, in addition to Assumptions A2, A3.\nAssumption A4: Spectral Bound on the Gradient Error Let Ek(M, Xk) := \u2207f (M, Xk) \u2212\nE[\u2207f (M, Xk)], (cid:107)Ek(cid:107)2 \u2264 \u03b2(p)\u03c3, where \u03c3 := (cid:107)Ek(cid:107)\u221e.\n\n5\n\n\fRecall from Assumption A2 that \u03c3 = O(log p), under sub-Gaussianity. Here, we require spectral\nbounds in addition to (cid:107) \u00b7 (cid:107)\u221e bound in A2.\nAssumption A5: Bound on spikiness of low-rank matrix (cid:107)L\u2217(cid:107)\u221e \u2264 \u03b1\nAssumption A6: Local strong convexity (LSC) The function f : Rd1\u00d7d2 \u2192 Rn1\u00d7n2 satis\ufb01es\nan R-local form of strong convexity (LSC) if there is a non-negative constant \u03b3 = \u03b3(R) such that\nf (B1) \u2265 f (B2) + Tr(\u2207f (B2)(B1 \u2212 B2)) + \u03b3\n2(cid:107)B2 \u2212 B1(cid:107)F, for any (cid:107)B1(cid:107) \u2264 R and (cid:107)B2(cid:107) \u2264 R,\nwhich is essentially the matrix version of Assumption A1.\nWe choose algorithm parameters as below where \u03bbi, \u00b5i are the regularization for (cid:96)1 and nuclear\nnorm respectively, \u03c1, \u03c1x correspond to penalty terms in M-update and \u03c4 is dual update step size.\n\np , as discussed before.\n\n\u03b3\n\n\u03bb2\ni =\n\n(R2\n\ni + \u02dcR2\ni )\n\u221a\nT0\n(s + r)\n\nlog p+\n\nG2(\u03c1 + \u03c1x)2\n\nT 2\n0\n\n+\u03b22(p)\u03c32\n\ni log(\n\n3\n\u03b4i\n\n)+\n\n\u03b12\np2 +\n\n\u03b22(p)\u03c32\n\nT0\n\nlog p+log\n\n(cid:18)\n\n(cid:19)\n\n1\n\u03b4\n\n(cid:115)\n\n(5)\n\nT0\ni + \u02dcR2\n\ni\n\nR2\n\n(cid:113)\n\n(cid:115)\n(cid:115)\n\ni = c\u00b5\u03bb2\n\u00b52\ni ,\n\n\u03c1 \u221d\n\nT0 log p\ni + \u02dcR2\nR2\n\ni\n\n,\n\n\u03c1x > 0,\n\n\u03c4 \u221d\n\n(cid:18) (s + r)2\n\n(cid:20)\n\n\u03b32R2\n\n1T\n\nkT (cid:39) \u2212 log\n\nTheorem 2. Under Assumptions A2 \u2212 A6, parameter settings (5), let T denote total number of\niterations and T0 = T log p/kT , where\n\nlog p +\n\n+ \u03b22(p)\u03c32 [(1 + G)(log(6/\u03b4) + log kT ) + log p]\n\nG\n\ns + r\n\nand T0 satis\ufb01es T0 = O(log p), with probability at least 1 \u2212 \u03b4 we have\n(cid:107) \u00afS(T ) \u2212 S\u2217(cid:107)2F + (cid:107) \u00afL(T ) \u2212 L\u2217(cid:107)2F =\n\nlog p + G + \u03b22(p)\u03c32(cid:104)\n\n\uf8eb\uf8ed(s + r)\n\nO\n\n(1 + G)(log 6\n\n\u03b4 + log kT\n\nlog p ) + log p\n\nT\n\n(cid:105)\n\n(cid:18)\n\n\uf8f6\uf8f8+\n\nlog p\nkT\n\n1 +\n\ns + r\n\u03b32p\n\n(cid:21)(cid:19)\n\n,\n\n(cid:19) \u03b12\n\np\n\n.\n\nImprovement of log p factor : The above result can be improved by a log p factor by considering\nvarying epoch lengths (which depend on problem parameters). The resulting convergence rate is\nO((s + r)p log p/T + \u03b12/p). The details can be found in the longer version [5].\n\u221a\np) \u2264 \u03b2(p)\u0398(p). This implies that the conver-\nScaling of \u03b2(p): We have the following bounds \u0398(\n\u221a\ngence rate (with varying epoch lengths) is O((s + r)p log p/T + \u03b12/p), when \u03b2(p) = \u0398(\np) and\nwhen \u03b2(p) = \u0398(p), it is O((s + r)p2 log p/T + \u03b12/p). The upper bound on \u03b2(p) arises trivially by\nconverting the max-norm (cid:107)Ek(cid:107)\u221e \u2264 \u03c3 to the bound on the spectral norm (cid:107)Ek(cid:107)2. In many interesting\nscenarios, the lower bound on \u03b2(p) is achieved, as outlined below in Section 3.2.1.\nComparison with the batch result: Agarwal et al. [13] consider the batch version of the same\nproblem (4), and provide a convergence rate of O((s log p + rp)/T + s\u03b12/p2). This is also the\nminimax lower bound under the independent noise model. With respect to the convergence rate, we\n\u221a\nmatch their results with respect to the scaling of s and r, and also obtain a 1/T rate. We match\np) attains the lower bound,\nthe scaling with respect to p (up to a log factor), when \u03b2(p) = \u0398(\nand we discuss a few such instances below. Otherwise, we are worse by a factor of p compared\nto the batch version. Intuitively, this is because we require different bounds on error terms Ek in\nthe online and the batch settings. The batch setting considers an empirical estimate, hence operates\non the averaged error. Whereas in the online setting we suffer from the per sample error. Ef\ufb01cient\nconcentration bounds exist for the batch case [16], while for the online case, no such bounds exist in\ngeneral. Hence, we conjecture that our bounds in Theorem 2 are unimprovable in the online setting.\nApproximation Error: Note that the optimal decomposition M\u2217 = S\u2217 + L\u2217 is not identi\ufb01able\nin general without the incoherence-style conditions [15, 17].\nIn this paper, we provide ef\ufb01cient\nguarantees without assuming such strong incoherence constraints. This implies that there is an\napproximation error which is incurred even in the noiseless setting due to model non-identi\ufb01ability.\n\n6\n\n\fAlgorithm 2: Regularized Epoch-based Admm for Stochastic Opt. in high-dimensioN 2 (REASON 2)\n\nInput \u03c1, \u03c1x, epoch length T0 , regularization parameters {\u03bbi, \u00b5i}kT\ninitial radii R1, \u02dcR1.\nDe\ufb01ne Shrink\u03ba(a) shrinkage operator as in REASON 1, GMk = Mk+1 \u2212 Sk \u2212 Lk \u2212 1\nfor each epoch i = 1, 2, ..., kT do\nInitialize S0 = \u02dcSi, L0 = \u02dcLi, M0 = S0 + L0.\nfor each iteration k = 0, 1, ..., T0 \u2212 1 do\n\ni=1, initial prox centers \u02dcS1, \u02dcL1,\n\n\u03c1 Zk.\n\nMk+1 =\n\nSk+1 =\n\nLk+1 =\n\n\u2212\u2207f (Mk) + Zk + \u03c1(Sk + Lk) + \u03c1xMk\n\nmin\n\n(cid:107)S\u2212 \u02dcSi(cid:107)1\u2264Ri\n\n\u03bbi(cid:107)S(cid:107)1 +\n\u00b5i(cid:107)L(cid:107)\u2217 +\n(cid:107)L\u2212 \u02dcLi(cid:107)\u2217\u2264 \u02dcRi\n(cid:107)Y \u2212 (Lk + \u03c4kGMk )(cid:107)2F +\n\u03c1\n2\u03c4k\n\n\u03c1 + \u03c1x\n(cid:107)S \u2212 (Sk + \u03c4kGMk )(cid:107)2F\n\u03c1\n2\u03c4k\n(cid:107)L \u2212 Yk \u2212 Uk/\u03c1(cid:107)2F\n\u03c1\n2\n\u03c1\n2\n\nmin\n\n(cid:107)Lk+1 \u2212 Y \u2212 Uk/\u03c1(cid:107)2F\n\n(cid:107)Y (cid:107)\u221e\u2264\u03b1/p\n\nYk+1 = min\nZk+1 = Zk \u2212 \u03c4 (Mk+1 \u2212 (Sk+1 + Lk+1))\nUk+1 = Uk \u2212 \u03c4 (Lk+1 \u2212 Yk+1).\n\n(cid:80)T0\u22121\n\nk=0 Sk and \u02dcLi+1 := 1\nT0\n\n(cid:80)T0\u22121\n\nk=0 Lk\ni+1 = R2\n\nSet: \u02dcSi+1 = 1\nT0\nif R2\nelse STOP;\n\ni > 2(s + r + (s+r)2\n\np\u03b32 ) \u03b12\n\np then Update R2\n\ni /2, \u02dcR2\n\ni+1 = \u02dcRi\n\n2\n\n/2;\n\nDimension Run Time (s)\n\nMethod\n\nerror at 0.02T error at 0.2T error at T\n\nd=20000\n\nT=50\n\nd=2000\n\nT=5\n\nd=20\n\nT=0.2\n\nST-ADMM\nRADAR\nREASON\nST-ADMM\nRADAR\nREASON\nST-ADMM\nRADAR\nREASON\n\n1.022\n0.116\n1.5e-03\n0.794\n0.103\n0.001\n0.212\n0.531\n0.100\n\n1.002\n\n2.10e-03\n2.20e-04\n\n0.380\n\n4.80e-03\n2.26e-04\n\n0.092\n\n4.70e-03\n2.02e-04\n\n0.996\n\n6.26e-05\n1.07e-08\n\n0.348\n\n1.53e-04\n1.58e-08\n\n0.033\n\n4.91e-04\n1.09e-08\n\nTable 3: Least square regression problem, epoch size Ti = 2000, Error=\n\n(cid:107)\u03b8\u2212\u03b8\u2217(cid:107)2\n(cid:107)\u03b8\u2217(cid:107)2\n\n.\n\nAgarwal et al. [13] achieve an approximation error of s\u03b12/p2 for their batch algorithm. Our online\nalgorithm has an approximation error of max{s + r, p}\u03b12/p2, which is decaying with p. It is not\nclear if this bound can be improved by any other online algorithm.\n\n3.2.1 Optimal Guarantees for Various Statistical Models\n\nWe now list some statistical models under which we achieve the batch-optimal rate for sparse+low\nrank decomposition.\n1) Independent Noise Model: Assume we sample i.i.d. matrices Xk = S\u2217 + L\u2217 + Nk, where\nthe noise Nk has independent bounded sub-Gaussian entries with maxi,j Var(Nk(i, j)) = \u03c32. We\nconsider the square loss function, (cid:107)Xk \u2212 S \u2212 L(cid:107)2F. Hence Ek = Xk \u2212 S\u2217 \u2212 L\u2217 = Nk. From [Thm.\n1.1][18], we have w.h.p. (cid:107)Nk(cid:107) = O(\u03c3\np). We match the batch bound in [13] in this setting.\nMoreover, Agarwal et al. [13] provide a minimax lower bound for this model, and we match it as\nwell. Thus, we achieve the optimal convergence rate for online matrix decomposition for this model.\n2) Linear Bayesian Network: Consider a p-dimensional vector y = Ah + n, where h \u2208 Rr with\nr \u2264 p, and n \u2208 Rp. The variable h is hidden, and y is the observed variable. We assume that the\nvectors h and n are each zero-mean sub-Gaussian vectors with i.i.d entries, and are independent of\n\n\u221a\n\n7\n\n\fRun Time\n\nError\n\nREASON 2\n\nIALM\n\n(cid:107)M\u2217\u2212S\u2212L(cid:107)F\n\n(cid:107)M\u2217(cid:107)F\n2.20e-03\n5.11e-05\n\nT = 50 sec\n(cid:107)S\u2212S\u2217(cid:107)F\n(cid:107)S\u2217(cid:107)F\n0.004\n0.12\n\n(cid:107)L\u2217\u2212L(cid:107)F\n(cid:107)L\u2217(cid:107)F\n0.01\n0.27\n\n(cid:107)M\u2217\u2212S\u2212L(cid:107)F\n\n(cid:107)M\u2217(cid:107)F\n5.55e-05\n8.76e-09\n\nT = 150 sec\n(cid:107)S\u2212S\u2217(cid:107)F\n(cid:107)S\u2217(cid:107)F\n\n1.50e-04\n\n0.12\n\n(cid:107)L\u2217\u2212L(cid:107)F\n(cid:107)L\u2217(cid:107)F\n\n3.25e-04\n\n0.27\n\nTable 4: REASON 2 and inexact ALM, matrix decomposition problem. p = 2000, \u03b72 = 0.01\n\nh and \u03c32\n\nhAA(cid:62) has rank at most r.\n\ny,y = S\u2217 + L\u2217, where S\u2217 = \u03c32\n\none another. Let \u03c32\nn be the variances for the entries of h and n respectively. Without loss\nof generality, we assume that the columns of A are normalized, as we can always rescale A and\n\u03c3h appropriately to obtain the same model. Let \u03a3\u2217\ny,y be the true covariance matrix of y. From the\nindependence assumptions, we have \u03a3\u2217\nnI is a diagonal matrix and\nL\u2217 = \u03c32\nIn each step k, we obtain a sample yk from the Bayesian network. For the square loss function f,\nk \u2212\nwe have the error Ek = yky(cid:62)\n\u221a\nnI(cid:107)2 = O(\nh). We thus have with probability 1 \u2212 T e\u2212cp,\n(cid:107)hkh(cid:62)\n\u03c32\n\u2200 k \u2264 T. When (cid:107)A(cid:107)2 is bounded, we obtain the optimal\nh + \u03c32\nbound in Theorem 2, which matches the batch bound. If the entries of A are generically drawn (e.g.,\nare also \u201cdiffuse\u201d, and thus, the low rank matrix L\u2217 satis\ufb01es Assumption A5, with \u03b1 \u223c polylog(p).\nIntuitively, when A is generically drawn, there are diffuse connections from hidden to observed\nvariables, and we have ef\ufb01cient guarantees under this setting.\n\n(cid:107)Ek(cid:107)2 \u2264 O(cid:0)\u221a\nfrom a Gaussian distribution), we have (cid:107)A(cid:107)2 = O(1 +(cid:112)r/p). Moreover, such generic matrices A\n\ny,y. Applying [Cor. 5.50][19], we have, with w.h.p. (cid:107)nkn(cid:62)\nhI(cid:107)2 = O(\n\nk \u2212 \u03a3\u2217\nk \u2212 \u03c32\n\nn)(cid:1) ,\n\np\u03c32\nn),\np((cid:107)A(cid:107)2\u03c32\n\n\u221a\n\np\u03c32\n\n4 Experiments\n\nREASON 1:\nFor sparse optimization problem, we compare REASON 1 with RADAR and\nST-ADMM under the least-squares regression setting. Samples (xt, yt) are generated such that\nxt \u2208 Unif[\u2212B, B] and yt = (cid:104)\u03b8\u2217, x(cid:105) + nt. \u03b8\u2217 is s-sparse with s = (cid:100)log d(cid:101). nt \u223c N (0, \u03b72).\nWith \u03b72 = 0.5 in all cases. We consider d = 20, 2000, 20000 and s = 1, 3, 5 respectively.\nThe experiments are performed on a 2.5 GHz In-\ntel Core i5 laptop with 8 GB RAM. See Table 3\nfor experiment results. It should be noted that\nRADAR is provided with information of \u03b8\u2217 for\nepoch design and recentering. In addition, both\nRADAR and REASON 1 have the same initial\nradius. Nevertheless, REASON 1 reaches bet-\nter accuracy within the same run time even for\nsmall time frames. In addition, we compare rel-\native error (cid:107)\u03b8 \u2212 \u03b8\u2217(cid:107)2/(cid:107)\u03b8\u2217(cid:107)2 in REASON 1 and\nST-ADMM in the \ufb01rst epoch. We observe that in\nhigher dimension error \ufb02uctuations for ADMM\nincreases noticeably (see Figure 1). Therefore,\nprojections of REASON 1 play an important role\nin denoising and obtaining good accuracy.\nREASON 2: We compare REASON 2 with state-of-the-art inexact ALM method for matrix de-\ncomposition problem (ALM codes are downloaded from [20]). Table 4 shows that with equal time,\ninexact ALM reaches smaller (cid:107)M\u2217\u2212S\u2212L(cid:107)F\nerror while in fact this does not provide a good decompo-\nsition. Further, REASON 2 reaches useful individual errors. Experiments with \u03b72 \u2208 [0.01, 1] show\nsimilar results. Similar experiments on exact ALM shows worse performance than inexact ALM.\n\nFigure 1: Least square regression, Error=\nvs. iteration number, d1 = 20 and d2 = 20000.\n\n(cid:107)\u03b8\u2212\u03b8\u2217(cid:107)2\n(cid:107)\u03b8\u2217(cid:107)2\n\n(cid:107)M\u2217(cid:107)F\n\nAcknowledgment\n\nWe acknowledge detailed discussions with Majid Janzamin and thank him for valuable comments on\nsparse and low rank recovery. The authors thank Alekh Agarwal for detailed discussions of his work\nand the minimax bounds. A. Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF\nCareer award CCF-1254106, NSF Award CCF-1219234, and ARO YIP Award W911NF-13-1-0084.\n\n8\n\n50010001500200000.20.40.60.81x 10\u22124ttttt2500100015002000051015202530rrttttt350010001500200001234rrttttt45001000150020000246810ttttt1\fReferences\n[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[2] H. Ouyang, N. He, L. Tran, and A. G Gray. Stochastic alternating direction method of multi-\npliers. In Proceedings of the 30th International Conference on Machine Learning (ICML-13),\npages 80\u201388, 2013.\n\n[3] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the (cid:96)1-\nball for learning in high dimensions. In Proceedings of the 25th international conference on\nMachine learning, pages 272\u2013279. ACM, 2008.\n\n[4] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional\nlinear regression over (cid:96)q-balls. IEEE Trans. Information Theory, 57(10):6976\u20146994, October\n2011.\n\n[5] Hanie Sedghi, Anima Anandkumar, and Edmond Jonckheere. Guarantees for multi-step\n\nstochastic ADMM in high dimensions. arXiv preprint arXiv:1402.5131, 2014.\n\n[6] A. Agarwal, S. Negahban, and M. J. Wainwright. Stochastic optimization and sparse statistical\n\nrecovery: Optimal algorithms for high dimensions. In NIPS, pages 1547\u20131555, 2012.\n\n[7] Z. Lin, M. Chen, and Y. Ma. The augmented lagrange multiplier method for exact recovery of\n\ncorrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010.\n\n[8] T Goldstein, B. ODonoghue, and S. Setzer. Fast alternating direction optimization methods.\n\nCAM report, pages 12\u201335, 2012.\n\n[9] W. Deng, W.and Yin. On the global and linear convergence of the generalized alternating\n\ndirection method of multipliers. Technical report, DTIC Document, 2012.\n\n[10] Zhi-Quan Luo. On the linear convergence of the alternating direction method of multipliers.\n\narXiv preprint arXiv:1208.3922, 2012.\n\n[11] H. Wang and A. Banerjee. Bregman alternating direction method of multipliers. arXiv preprint\n\narXiv:1306.3203, 2013.\n\n[12] X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimiza-\ntion problems using two-block alternating direction method of multipliers. arXiv preprint\narXiv:1308.5294, 2013.\n\n[13] A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex relax-\n\nation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171\u20131197, 2012.\n\n[14] S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\ndimensional analysis of M-estimators with decomposable regularizers. Statistical Science,\n27(4):538\u2013557, 2012.\n\n[15] V. Chandrasekaran, S. Sanghavi, Pablo A Parrilo, and A. S Willsky. Rank-sparsity incoherence\n\nfor matrix decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[16] J. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computa-\n\ntional Mathematics, 12(4):389\u2013434, 2012.\n\n[17] Daniel Hsu, Sham M Kakade, and Tong Zhang. Robust matrix decomposition with sparse\n\ncorruptions. Information Theory, IEEE Transactions on, 57(11):7221\u20137234, 2011.\n\n[18] Van H Vu. Spectral norm of random matrices. In Proceedings of the thirty-seventh annual\n\nACM symposium on Theory of computing, pages 423\u2013430. ACM, 2005.\n\n[19] Roman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[20] Low-rank matrix recovery and completion via convex optimization.\n\nhttp://\nperception.csl.illinois.edu/matrix-rank/home.html. Accessed: 2014-\n05-02.\n\n9\n\n\f", "award": [], "sourceid": 1443, "authors": [{"given_name": "Hanie", "family_name": "Sedghi", "institution": "University of Southern California"}, {"given_name": "Anima", "family_name": "Anandkumar", "institution": "UC Irvine"}, {"given_name": "Edmond", "family_name": "Jonckheere", "institution": null}]}