{"title": "Efficient Stochastic Gradient Hard Thresholding", "book": "Advances in Neural Information Processing Systems", "page_first": 1984, "page_last": 1993, "abstract": "Stochastic gradient hard thresholding methods have recently been shown to work favorably in solving large-scale empirical risk minimization problems under sparsity or rank constraint. Despite the improved iteration complexity over full gradient methods, the gradient evaluation and hard thresholding complexity of the existing stochastic algorithms usually scales linearly with data size, which could still be expensive when data is huge and the hard thresholding step could be as expensive as singular value decomposition in rank-constrained problems. To address these deficiencies, we propose an efficient hybrid stochastic gradient hard thresholding (HSG-HT) method that can be provably shown to have sample-size-independent gradient evaluation and hard thresholding complexity bounds. Specifically, we prove that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically. By applying the heavy ball acceleration technique, we further propose an accelerated variant of HSG-HT which can be shown to have improved factor dependence on restricted condition number. Numerical results confirm our theoretical affirmation and demonstrate the computational efficiency of the proposed methods.", "full_text": "Ef\ufb01cient Stochastic Gradient Hard Thresholding\n\nPan Zhou\u2217\n\nXiao-Tong Yuan\u2020\n\nJiashi Feng\u2217\n\n\u2217 Learning & Vision Lab, National University of Singapore, Singapore\n\n\u2020 B-DAT Lab, Nanjing University of Information Science & Technology, Nanjing, China\n\npzhou@u.nus.edu\n\nxtyuan@nuist.edu.cn\n\nelefjia@nus.edu.sg\n\nAbstract\n\nStochastic gradient hard thresholding methods have recently been shown to work\nfavorably in solving large-scale empirical risk minimization problems under sparsi-\nty or rank constraint. Despite the improved iteration complexity over full gradient\nmethods, the gradient evaluation and hard thresholding complexity of the existing\nstochastic algorithms usually scales linearly with data size, which could still be\nexpensive when data is huge and the hard thresholding step could be as expensive\nas singular value decomposition in rank-constrained problems. To address these\nde\ufb01ciencies, we propose an ef\ufb01cient hybrid stochastic gradient hard thresholding\n(HSG-HT) method that can be provably shown to have sample-size-independent\ngradient evaluation and hard thresholding complexity bounds. Speci\ufb01cally, we\nprove that the stochastic gradient evaluation complexity of HSG-HT scales linearly\nwith inverse of sub-optimality and its hard thresholding complexity scales logarith-\nmically. By applying the heavy ball acceleration technique, we further propose\nan accelerated variant of HSG-HT which can be shown to have improved factor\ndependence on restricted condition number in the quadratic case. Numerical results\ncon\ufb01rm our theoretical af\ufb01rmation and demonstrate the computational ef\ufb01ciency\nof the proposed methods.\n\n1\n\nIntroduction\n\n(cid:88)n\n\ni=1\n\nWe consider the following sparsity- or rank-constrained \ufb01nite-sum minimization problems which are\nwidely applied in high-dimensional statistical estimation:\n\n1\nn\n\ns.t. (cid:107)x(cid:107)0 \u2264 k or\n\nrank (x) \u2264 k,\n\nmin\n\nx\n\nfi(x),\n\nf (x) :=\n\n(1)\nwhere each individual loss fi(x) is associated with the i-th sample, (cid:107)x(cid:107)0 denotes the number of\nnonzero entries in x as a vector variable, rank (x) denotes the rank of x as a matrix variable, and k\nrepresents the sparsity/low-rankness level. Such a formulation encapsulates several important prob-\nlems, including (cid:96)0-constrained linear/logistic regression [1, 2, 3], sparse graphical model learning [4],\nand low-rank multivariate and multi-task regression [5, 6, 7, 8, 9], to name a few.\nWe are particularly interested in gradient hard thresholding methods [10, 11, 12, 13] which are\npopular and effective for solving problem (1). The common theme of this class of methods is to\niterate between gradient descent and hard thresholding to maintain sparsity/low-rankness of solution\nwhile minimizing the loss function. In our problem setting, a plain gradient hard thresholding iteration\nis given by xt+1 = \u03a6k(xt \u2212 \u03b7\u2207f (xt)), where \u03a6k(\u00b7), as de\ufb01ned in Section 2, denotes the hard\nthresholding operation that preserves the top k entries (in magnitude) of a vector or produces an\noptimal rank-k approximation to a matrix via singular value decomposition (SVD). When considering\ngradient hard thresholding methods, two main sources of computational complexity are at play:\nthe gradient evaluation complexity and the hard thresholding complexity. As the per-iteration hard\nthresholding can be as expensive as SVD in rank-constrained problems, our goal is to develop methods\nthat iterate and converge quickly while using a minimal number of hard thresholding operations.\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Comparison of different hard thresholding algorithms for sparsity- and rank-constrained\nproblem (1). Both computational complexity and statistical error are evaluated w.r.t. the estimation\nerror (cid:107) \u02dcx \u2212 x\u2217(cid:107) between the k-sparse/rank estimator \u02dcx and the k\u2217-sparse/rank optimum x\u2217. Both\n\n\u03bas and \u03ba(cid:98)s denote the restricted condition numbers with s = 2k + k\u2217 and(cid:98)s = 3k + k\u2217. (cid:101)I =\nsupp(\u03a62k(\u2207f (x\u2217)))\u222asupp(x\u2217) and(cid:98)I = supp(\u03a63k(\u2207f (x\u2217)))\u222asupp(x\u2217) are two support sets.\n\nThe results of AHSG-HT are established for quadratic loss functions.\n\n(cid:1)\n\non \u03bas\n\nSG-HT [20]\n\nRestriction Required\nvalue of k\n\nComputational Complexity\n#IFO\n\n#Hard Thresholding Sparsity-constrained Problem1\n\nk + k\u2217(cid:107)\u2207f (x\u2217)(cid:107)\u221e(cid:1)\nsk\u2217) O(cid:0)n\u03bas log(cid:0) 1\nO(cid:0)\u221a\nO(cid:0)\u03bas log(cid:0) 1\n(cid:1)(cid:1)\n(cid:1)(cid:1)\n(cid:1)\n(cid:1)(cid:1)\nO(cid:0) 1\nO(cid:0)\u03bas log(cid:0) 1\n(cid:1)(cid:1)\nsk\u2217) O(cid:0)\u03bas log(cid:0) 1\n(cid:80)n\n(cid:1)(cid:1) O(cid:0)\u221a\nsk\u2217) O(cid:0)(n+\u03bas)log(cid:0) 1\n(cid:1)(cid:1) O(cid:0)(n+\u03bas)log(cid:0) 1\ni=1 (cid:107)\u2207fi(x\u2217)(cid:107)2\nO(cid:0) \u03bas\nO(cid:0)(cid:107)\u2207(cid:101)If (x\u2217)(cid:107)2\n(cid:1)\n(cid:1)\nO(cid:0)\u03bas log(cid:0) 1\n(cid:1)(cid:1)\ns(cid:107)\u2207f (x\u2217)(cid:107)\u221e+(cid:107)\u2207(cid:101)If (x\u2217)(cid:107)2\n(cid:1)\nO(cid:0)(cid:107)\u2207(cid:98)If (x\u2217)(cid:107)2\n\u03ba(cid:98)s log(cid:0) 1\nO(cid:0) \u221a\n(cid:1)\nO(cid:0)\u221a\n(cid:1)(cid:1)\nsk\u2217)\n\u03ba(cid:98)s\n\u2126(\u03ba(cid:98)sk\u2217)\n\nHSG-HT\nAHSG-HT\n1 For general rank-constrained problem, the statistic error is not explicitly provided in FG-HT, SG-HT and\n\nFG-HT [12, 13] \u2014\n\u2264 4\n3\nSVRG-HT [21] \u2014\n\u2014\n\u2014\n\n\u2126(\u03ba2\n\u2126(\u03ba2\n\u2126(\u03ba2\n\u2126(\u03ba2\n\nStatistical Error on\n\nn\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\nSVRG-HT while is given in our Theorem 1 for HSG-HT and Theorem 3 for AHSG-HT.\n\nFull gradient hard thresholding. The plain form of full gradient hard thresholding (FG-HT)\nalgorithm has been extensively studied in compressed sensing and sparse learning [10, 12, 13, 14].\nAt each iteration, FG-HT \ufb01rst updates the variable x by using full gradient descent and then performs\nhard thresholding on the updated variable. Theoretical results show that FG-HT converges linearly\ntowards a proper nominal solution with high estimation accuracy [12, 13, 15]. Besides, compared\nwith the algorithms adopting (cid:96)1- or nuclear-norm convex relaxation (e.g., [16, 17, 18, 19]), directly\nsolving problem (1) via FG-HT often exhibits similar accuracy guarantee but is more computationally\nef\ufb01cient. However, despite these desirable properties, FG-HT needs to compute the full gradient at\neach iteration which can be expensive in large-scale problems. If the restricted condition number is\n\n(cid:1)(cid:1) iterations are needed to attain an \u0001-suboptimal solution (up to a statistical\n\n\u03bas, then O(cid:0)\u03bas log(cid:0) 1\n(IFO, see De\ufb01nition 1), is O(cid:0)n\u03bas log(cid:0) 1\n\n(cid:1)(cid:1) which scales linearly with n\u03bas.\n\nerror), and thus the sample-wise gradient evaluation complexity, or incremental \ufb01rst order oracle\n\n\u0001\n\n\u0001\n\n\u0001\n\nStochastic gradient hard thresholding. To improve computational ef\ufb01ciency, stochastic hard thresh-\nolding algorithms [20, 21, 22] have recently been developed via leveraging the \ufb01nite-sum structure\nof problem (1). For instance, Nguyen et al. [20] proposed a stochastic gradient hard thresholding\n(SG-HT) algorithm for solving problem (1). At each iteration, SG-HT only evaluates gradient of one\n(or a mini-batch) randomly selected sample for variable update and hard thresholding. It was shown\n\nthat the IFO complexity and hard thresholding complexity of SG-HT are both O(cid:0)\u03bas log(cid:0) 1\n\n(cid:1)(cid:1) which\n\nis independent on n. However, SG-HT can only be shown to converge to a sub-optimal statistical esti-\nmation accuracy (see Table 1) which is inferior to that of the full-gradient methods. Another limitation\nof SG-HT is that it requires the restricted condition number \u03bas to be not larger than 4/3 which is hard\nto meet in realistic high-dimensional sparse estimation problems such as sparse linear regression [13].\nTo overcome these issues, the stochastic variance reduced gradient hard thresholding (SVRG-HT)\nalgorithm [21, 22] was developed as an adaptation of SVRG [23] to problem (1). Bene\ufb01ting from\nthe variance reduced technique, SVRG-HT can converge more stably and ef\ufb01ciently while having\nbetter estimation accuracy than SG-HT. Also different from SG-HT, the convergence analysis for\nSVRG-HT allows arbitrary bounded restricted condition number. As shown in Table 1, both the IFO\n\ncomplexity and hard thresholding complexity of SVRG-HT are O(cid:0)(n + \u03bas) log(cid:0) 1\n\n(cid:1)(cid:1). Although the\n\nIFO complexity of SVRG-HT substantially improves over FG-HT, the overall complexity still scale\nlinearly with respect to the sample size n. Therefore, when the data-scale is huge (e.g., n (cid:29) \u03bas) and\nthe per-iteration hard thresholding operation is expensive, SVRG-HT could still be computationally\ninef\ufb01cient in practice. Later, Chen et al. [24] proposed a stochastic variance-reduced block coordinate\ndescent algorithm. But its overall complexity still scale linearly with respect to the sample size and\nthus it faces the same challenge as SVRG-HT in computation.\nOverview of our approach. The method we propose can be viewed as a simple yet ef\ufb01cient extension\nof the hybrid stochastic gradient descent (HSGD) method [25, 26, 27] from unconstrained \ufb01nite-sum\nminimization to the cardinality-constrained \ufb01nite-sum problem (1). The core idea of HSGD is to\niteratively sample an evolving mini-batch of terms in the \ufb01nite-sum for gradient estimation. This\nstyle of incremental gradient method has been shown, both in theory and practice, to bridge smoothly\nthe gap between deterministic and stochastic gradient methods [26]. Inspired by the success of\n\n\u0001\n\n2\n\n\fHSGD, we propose the hybrid stochastic gradient hard thresholding (HSG-HT) method which has\nthe following variable update form:\n\n(cid:0)xt \u2212 \u03b7gt(cid:1) , with gt =\n\n(cid:88)\n\nxt+1 = \u03a6k\n\n1\nst\n\n\u2207fit(xt),\n\nit\u2208St\n\n(cid:0)xt \u2212 \u03b7gt + \u03bd(xt \u2212 xt\u22121)(cid:1) .\n\nwhere \u03b7 is the learning rate and St is the set of st selected samples. In early stage of iterations,\nHSG-HT selects a few samples to estimate the full gradient; and along with more iterations, st\nincreases, giving more accurate full gradient estimation. Such a mechanism allows it to enjoy\nthe merits of both SG-HT and FG-HT, i.e. the low iteration complexity of SG-HT and the steady\nconvergence rate of FG-HT with constant learning rate \u03b7. Given a k\u2217-sparse/low-rank target solution\nx\u2217, for objective function with restricted condition number \u03bas and s = 2k + k\u2217, we show that\n\nO(cid:0) \u03bas\n(cid:1)(cid:1) steps of hard thresholding operation are suf\ufb01cient\nfor HSG-HT to \ufb01nd \u02dcx such that (cid:107) \u02dcx \u2212 x\u2217(cid:107)2 \u2264 \u0001 + O(cid:0)s(cid:107)\u2207f (x\u2217)(cid:107)2\u221e(cid:1). In this way, HSG-HT exhibits\n\n(cid:1) rounds of IFO update and O(cid:0)\u03bas log(cid:0) 1\n\nsample-size-independent IFO and hard thresholding complexity. Another attractiveness of HSG-HT\nis that it can be readily accelerated via applying the heavy ball acceleration technique [28, 29, 30]. To\nthis end, we modify the iteration of HSG-HT by adding a small momentum \u03bd(xt \u2212 xt\u22121) for some\n\u03bd > 0 to the gradient descent step:\n\n\u0001\n\n\u0001\n\nxt+1 = \u03a6k\n\nWe call the above modi\ufb01ed version as accelerated HSG-HT (AHSG-HT). For quadratic problems,\n\u03ba(cid:98)s\n\nwe prove that such a simple momentum strategy boosts the IFO complexity of HSG-HT to O(cid:0)\u221a\n(cid:1),\n(cid:1)(cid:1), where(cid:98)s = 3k + k\u2217. To the best of our\nand the hard thresholding complexity to O(cid:0)\u221a\n\n\u03ba(cid:98)s log(cid:0) 1\n\nknowledge, AHSG-HT is the \ufb01rst momentum based algorithm that can be provably shown to have\nsuch an improved complexity for stochastic gradient hard thresholding.\nHighlight of results and contribution. Table 1 summarizes our main results on computational\ncomplexity and statistical estimation accuracy of HSG-HT and AHSG-HT, along with the results\nfor the above mentioned state-of-the-art gradient hard thresholding algorithms. From this table we\ncan observe that our methods have several theoretical advantages over the considered prior methods,\nwhich are highlighted in the next few paragraphs.\nOn sparsity/low-rankness level constraint condition. AHSG-HT in the quadratic case substantially\n\nimproves the bounding condition on the sparsity/low-rankness level k: it only requires k = \u2126(\u03ba(cid:98)sk\u2217),\nwhile the other considered algorithms with optimal statistical estimation accuracy all require k =\nsk\u2217). Moreover, both HSG-HT and AHSG-HT get rid of the restrictive condition \u03bas \u2264 4/3\n\u2126(\u03ba2\nrequired in SG-HT.\nOn statistical estimation accuracy. For sparsity-constrained problem, the statistical estimation\n(cid:80)n\naccuracy of HSG-HT is comparable to that in FG-HT and is better than those in SVRG and SG-HT, as\n(cid:107)\u2207(cid:101)If (x\u2217)(cid:107)2 in HSG-HT is usually superior to the error\ns(cid:107)\u2207f (x\u2217)(cid:107)\u221e in SVRG-HT and is much\ni=1 (cid:107)\u2207fi(x\u2217)(cid:107)2 in SG-HT. AHSG-HT has even smaller estimation error\ncardinality of the support set(cid:98)I, especially when the restrictive condition number is sensitive to k.\nthan HSG-HT since it allows smaller sparsity/low-rankness level k = \u2126(\u03ba(cid:98)sk\u2217) and thus a smaller\n\nsmaller than the one 1\nn\n\n\u221a\n\n\u0001\n\n\u0001\n\nOn computational complexity. Both HSG-HT and AHSG-HT enjoy sample-size-independent IFO\nand hard thresholding complexity. To compare the IFO complexity, our methods will be cheaper\nthan FG-HT and SVRG-HT when n dominates 1\n\u0001 . This suggests that HSG-HT and AHSG-HT are\nmore suitable for handling large-scale data. SG-HT has the lowest IFO complexity, which however\nis obtained at the price of severely deteriorated statistical estimation accuracy. In terms of hard\nthresholding complexity, AHSG-HT is the best one and HSG-HT matches FG-HT and SG-HT.\nLast but not least, we highlight that AHSG-HT, to our best knowledge, for the \ufb01rst time provides\nimproved convergence guarantees for momentum based stochastic gradient hard thresholding methods.\nWhile in convex problems the momentum based methods such as heavy ball and Nesterov\u2019s methods\nhave long been known to work favorably for accelerating full/stochastic gradient methods [28, 31, 32,\n33], it still remains largely unknown if it is possible to accelerate gradient hard thresholding methods\nfor solving the non-convex \ufb01nite-sum problem (1). There is a recent attempt at understanding a\nNesterov\u2019s momentum full gradient hard thresholding method [34]. Although showing to attain\nimproved rate of convergence under certain conditions, the iteration complexity bound established\nin [34] still does not exhibit better dependence on restricted condition number than the plain FG-HT.\nIn contrast, at least in the quadratic case, AHSG-HT can be shown to have improved dependence on\ncondition number than the existing gradient hard thresholding methods.\n\n3\n\n\f2 Preliminaries\nThroughout this paper, we use (cid:107)x(cid:107) to denote the Euclidean norm for vector x \u2208 Rd and the Frobenius\nnorm for matrix x \u2208 Rd1\u00d7d2. (cid:107)x(cid:107)\u221e denotes the largest absolute entry in x. The hard thresholding\noperation \u03a6k(x) preserves the k largest entries of x in magnitude for vector x, and for matrix x it\nk , where Hk and Vk are\nonly preserves the top k-top singular values. Namely, \u03a6k(x) = Hk\u03a3kV T\nrespectively the top-k left and right singular vectors of x, \u03a3k is the diagonal matrix of the top-k\nsingular values of x. We use supp(x) to denote the support set of x. Speci\ufb01cally, for vector x,\nsupp(x) indexes its nonzero entries; and for matrix x \u2208 Rd1\u00d7d2, it indexes the subspace U that is a\nset of singular vectors spanning the column space of x. For vector variable x, \u2207If (x) preserves the\nentries in \u2207f (x) indexed by the support set I and sets the remaining entries to be zero; while for\nmatrix variable x, \u2207If (x) with I = I1 \u222a I2 projects \u2207f (x) into the subspace indexed by I1 \u222a I2,\nnamely \u2207If (x) = (U1U T\n2 )\u2207f (x), where U1 and U2 respectively span\n1 + U2U T\nthe subspaces indexed by I1 and I2.\nWe assume the objective function in (1) to have restricted strong convexity (RSC) and restricted strong\nsmoothness (RSS). For both sparsity- and rank-constrained problems, the RSC and RSS conditions\nare commonly used in analyzing hard thresholding algorithms [12, 13, 21, 22, 20].\nAssumption 1 (Restricted strong convexity condition, RSC). A differentiable function f (x) is\nrestricted \u03c1s-strongly convex with parameter s if there exists a generic constant \u03c1s > 0 such that for\nany x, x(cid:48) with (cid:107)x \u2212 x(cid:48)(cid:107)0 \u2264 s or rank (x \u2212 x(cid:48)) \u2264 s,\n\n2 \u2212 U1U T\n\n1 U2U T\n\nf (x) \u2212 f (x(cid:48)) \u2212 (cid:104)\u2207f (x(cid:48)), x \u2212 x(cid:48)(cid:105) \u2265 \u03c1s\n2\n\n(cid:107)x \u2212 x(cid:48)(cid:107)2.\n\nAssumption 2 (Restricted strong smoothness condition, RSS). For each fi(x), it is said to be\nrestricted (cid:96)s-strongly smooth with parameter s if there exists a generic constant (cid:96)s > 0 such that for\nany x, x(cid:48) with (cid:107)x \u2212 x(cid:48)(cid:107)0 \u2264 s or rank (x \u2212 x(cid:48)) \u2264 s,\n\nfi(x) \u2212 fi(x(cid:48)) \u2212 (cid:104)\u2207fi(x(cid:48)), x \u2212 x(cid:48)(cid:105) \u2264 (cid:96)s\n2\n\n(cid:107)x \u2212 x(cid:48)(cid:107)2.\n\nWe also need to impose the following boundness assumption on the variance of stochastic gradient.\nAssumption 3 (Bounded stochastic gradient variance). For any x and each loss fi(x), the distance\nbetween \u2207fi(x) and the full gradient \u2207f (x) is upper bounded as maxi (cid:107)\u2207fi(x)\u2212\u2207f (x)(cid:107)\u2264 B.\nSimilar to [21, 23, 35, 36], the incremental \ufb01rst order oracle (IFO) complexity is adopted as the\ncomputational complexity metric for solving \ufb01nite-sum problem (1). In high-dimensional sparse\nlearning and low-rank matrix recovery problems, the per-iteration hard thresholding operation can\nbe equally time-consuming or even more expensive than gradient evaluation. For instance, in rank-\nconstrained problems, hard thresholding operation can be as expensive as top-k SVD for a matrix.\nTherefore we also need to take the computational complexity of hard thresholding into our account.\nDe\ufb01nition 1 (IFO and Hard Thresholding Complexity). For f (x) in problem (1), an IFO takes an\nindex i \u2208 [n] and a point x, and returns the pair (fi(x),\u2207fi(x)). In a hard thresholding operation,\nwe feed x into \u03a6k(\u00b7) and obtain the output \u03a6k(x).\nThe IFO and hard thresholding complexity as a whole can more comprehensively re\ufb02ect the overall\ncomputational performance of a \ufb01rst-order hard thresholding algorithm, as objective value, gradient\nevaluation and hard thresholding operation usually dominate the per-iteration computation.\n\n3 Hybrid Stochastic Gradient Hard Thresholding\n\nIn this section, we \ufb01rst introduce the Hybrid Stochastic Gradient Hard Thresholding (HSG-HT)\nalgorithm and then analyze its convergence performance for sparsity- and rank-constrained problems.\n\n3.1 The HSG-HT Algorithm\n\nThe HSG-HT algorithm is outlined in Algorithm 1. At the t-th iteration, it \ufb01rst uniformly randomly\nselects st samples St from all data and evaluates the approximated gradient gt = 1\n\u2207fit (xt).\n\n(cid:80)\n\nst\n\nit\u2208St\n\n4\n\n\fAlgorithm 1: (Accelerated) Hybrid Stochastic Gradient Hard Thresholding\nInput\nfor t = 1, 2, ..., T \u2212 1 do\n\n:Initial point x0, sample index set S = {1,\u00b7\u00b7\u00b7, n}, learning rate \u03b7, momentum strength \u03bd,\nmini-batch sizes {st}.\n\nUniformly randomly select st samples St from S\nCompute the approximate gradient gt = 1\nit\u2208St\nst\nUpdate xt+1 using either of the following two options:\n\n\u2207fit(xt)\n(O1) xt+1 = \u03a6k (xt \u2212 \u03b7gt); /* for plain HSG-HT */\n(O2) xt+1 = \u03a6k\n\n(cid:0)xt \u2212 \u03b7gt + \u03bd(xt \u2212 xt\u22121)(cid:1); /* for accelerated HSG-HT */\n\n(cid:80)\n\nend\nOutput : xT .\n\n(cid:0)xt \u2212 \u03b7gt + \u03bd(xt \u2212 xt\u22121)(cid:1), leading to an accelerated variant of HSG-HT.\n\nThen, there are two options for variable update. The \ufb01rst option is to update xt+1 with a standard\nlocal descent step along gt followed by a hard threholding step, giving the plain update procedure\nxt+1 = \u03a6k (xt \u2212 \u03b7gt) in option O1. The other option O2 is to update xt+1 based on a momentum\nformulation xt+1 = \u03a6k\nThe plain update O1 is actually a special case of the momentum based update in O2 with strength\n\u03bd = 0. In early stage of iteration when the mini-batch size st is relatively small, HSG-HT performs\nmore like SG-HT with low per-iteration gradient evaluation cost. Along with more iterations, st\nincreases and HSG-HT performs like full gradient hard thresholding methods. Next, we analyze the\nparameter estimation accuracy and the objective value convergence of HSG-HT. The analysis of the\naccelerated version will be presented in Section 4.\n\n3.2 Statistical Estimation Analysis\n\nWe \ufb01rst analyze the parameter estimation performance of HSG-HT by characterizing the distance\nbetween the output of Algorithm 1 and the optimum x\u2217. Such an analysis is helpful in understanding\nthe convergence behavior and statistical estimation accuracy of the computed solution. We summarize\nthe main result for both sparsity- and rank-constrained problems in Theorem 1.\nTheorem 1. Suppose the objective function f (x) is \u03c1s-strongly convex and each individual fi(x)\n\u221a\nis (cid:96)s-strongly smooth with parameter s = 2k + k\u2217. Let \u03bas = (cid:96)s\nk\u2217\u221a\nk\u2212k\u2217 . Assume the\nand the mini-batch\n3\u03c1s(cid:96)s(cid:107)x0\u2212x\u2217(cid:107)2 . Then the output xT of HSG-HT satis\ufb01es\n\nsparsity/low-rankness level k \u2265(cid:0)1 + 712\u03ba2\n\n\u03c9t with \u03c9 = 1 \u2212 1\n\n(cid:1) k\u2217. Set the learning rate \u03b7 = 1\n(cid:1)T /2(cid:107)x0 \u2212 x\u2217(cid:107) +\n(cid:112)12(1 \u2212 \u03b2)\n\nE(cid:107)xT \u2212 x\u2217(cid:107) \u2264(cid:0)1 \u2212 1\n\n(cid:1),(cid:101)I = supp(\u03a62k(\u2207f (x\u2217))) \u222a supp(x\u2217), and T is the number of iterations.\n\n480\u03bas\n\nwhere \u03b2 = \u03b1(cid:0)1 \u2212 1\n\n(cid:107)\u2207(cid:101)If (x\u2217)(cid:107),\n\nand \u03b1 = 1 + 2\n\nsize st = \u03c4\n\n\u221a\n\n\u03b1\n\n480\u03bas\n\ns\n\nand \u03c4 \u2265\n\n40\u03b1B\n\n12\u03bas\n\n6(cid:96)s\n\n\u03c1s\n\n(cid:96)s\n\n480\u03bas\n\n\u03c9 with \u03c9 = 1 \u2212 1\n\nA proof of Theorem 1 is given in Appendix B.1. Theorem 1 shows that for both sparsity- and rank-\nsk\u2217) and gradually expanding the\nconstrained problem, if using sparsity/low-rankness level k = \u2126(\u03ba2\n, then in expectation the sequence\nmini-batch size at an exponential rate of 1\n{xt} generated by HSG-HT converges linearly towards x\u2217 at the rate of (1\u2212 1\n) 1\n2 . This indicates\nthat HSG-HT enjoys a similar fast and steady convergence rate just like the deterministic FG-HT [13].\nAs the condition number \u03bas = (cid:96)s/\u03c1s is usually large in realistic problems, the exponential rate 1\n\u03c9\nis actually only a slightly larger than one. This means even a moderate-scale dataset will allow\nHSGD-HT to iterate suf\ufb01ciently for decreasing the loss, as illustrated in Figure 1 and 2 in Section 5.\nOne can also observe that the estimation error of E(cid:107)xt \u2212 x\u2217(cid:107) is controlled by the multiplier of\n(cid:107)\u2207(cid:101)If (x\u2217)(cid:107) which usually represents the statistical error of model. For sparsity-constrained prob-\nthan the error bound O(cid:0)\u221a\n(cid:1) with s = 2k + k\u2217 in SVRG-HT [21]\n(cid:1) for\nsince (cid:107)\u2207(cid:101)If (x\u2217)(cid:107)2 \u2264 \u221a\nSG-HT [20], the error term of HSG-HT is signi\ufb01cantly smaller. This is because the magnitude\n(cid:107)\u2207f (x\u2217)(cid:107)2 of the full gradient is usually small when sample size is large, while the individual\n(or small mini-batch) gradient norm (cid:107)\u2207fi(x\u2217)(cid:107)2 could still have relatively large magnitude. For\n\ns(cid:107)\u2207f (x\u2217)(cid:107)\u221e. Compared with the error O(cid:0) 1\ns(cid:107)\u2207f (x\u2217)(cid:107)\u221e + (cid:107)\u2207(cid:101)If (x\u2217)(cid:107)2\n\nlem, such a statistical error bound matches that established in FG-HT [13], and is usually better\n\n(cid:80)n\ni=1 (cid:107)\u2207fi(x\u2217)(cid:107)2\n\n480\u03bas\n\nn\n\n5\n\n\fexample, in sparse linear regression problems the difference could be as signi\ufb01cant as O((cid:112)log(d)/n)\n(in HSG-HT) versus O((cid:112)log(d)) (in SG-HT). Notice, for the general rank-constrained problem,\n\nFG-HT, SG-HT and SVRG-HT do not explicitly provide the statistical error as given by HSG-HT.\nIndeed, SG-HT and SVRG-HT only considered a low-rank matrix linear model which is a special\ncase of the general rank-constrained problem (1). Moreover, to guarantee convergence, SG-HT\nrequires the restrictive condition \u03bas \u2264 4/3, while our analysis removes such a condition and allows\nfor an arbitrarily large \u03bas as long as it is bounded.\nBased on Theorem 1, we can derive the IFO and hard thresholding complexity of HSG-HT for\nproblem (1) in Corollary 1 with proof in Appendix B.2. For fairness, here we follow the convention\n\n, the IFO complexity of HSG-HT in Algorithm 1 is O(cid:0) \u03bas\n\nin [13, 20, 21, 22] to use E(cid:107)x \u2212 x\u2217(cid:107) \u2264 \u221a\n\u0001 + statistical error as the measure of \u0001-suboptimality.\n(cid:1) and the hard threshold-\nCorollary 1. Suppose the conditions in Theorem 1 hold. To achieve E(cid:107)xT \u2212 x\u2217(cid:107) \u2264 \u221a\ning complexity is O(cid:0)\u03bas log(cid:0) 1\n\u221a\n\u03b1(cid:107)\u2207(cid:101)I f (x\u2217)(cid:107)\n\u221a\n12(1\u2212\u03b2)\nCompared with FG-HT [13] and SVRG-HT [21, 22] whose IFO complexity are O(cid:0)n\u03bas log(cid:0) 1\n(cid:1)(cid:1) and\nO(cid:0)(n + \u03bas) log(cid:0) 1\n(cid:1)(cid:1) respectively, HSG-HT is more computationally ef\ufb01cient in IFO than FG-HT\nHSG-HT shares the same complexity O(cid:0)\u03bas log(cid:0) 1\n(cid:1)(cid:1) with FG-HT, which is considerably cheaper\n(cid:1)(cid:1) hard thresholding complexity of SVRG-HT when data scale is large.\nthan the O(cid:0)(n + \u03bas) log(cid:0) 1\n\nand SVRG-HT when sample size n dominates 1\n\u0001 . This is usually the case when the data scale is\nhuge while the desired accuracy \u0001 is moderately small. Concerning the hard thresholding complexity,\n\nOverall, HSG-HT is able to achieve better trade-off between IFO and hard thresholding complexity\nthan FG-HT and SVRG-HT when n is much larger than 1\n\n\u0001 in large-scale learning problems.\n\n(cid:1)(cid:1).\n\n\u0001 +\n\n(cid:96)s\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n3.3 Convergence Analysis\n\nFor sparsity-constrained problem, we further investigate the convergence behavior of HSG-HT in\nterms of the objective value f (x) towards the optimal loss f (x\u2217). The main result is summarized in\nthe following theorem, whose proof is deferred to Appendix B.3.\nTheorem 2. Suppose f (x) is \u03c1s-strongly convex and each individual component fi(x) is (cid:96)s-strongly\nsmooth with parameter s = 2k + k\u2217. Let \u03bas = (cid:96)s\nBy setting the learning rate \u03b7 = 1\n2(cid:96)s\n\u03c4 \u2265\n\nand\n\u03c1s[f (x0)\u2212f (x\u2217)] , then for sparsity-constrained problem, the output xT of Algorithm 1 satis\ufb01es\n\nand the sparsity level k \u2265 (cid:0)1 + 64\u03ba2\n(cid:1)T\n\nE[f (xT ) \u2212 f (x\u2217)] \u2264(cid:0)1 \u2212 1\n\n\u03c9t with \u03c9 = 1 \u2212 1\n\nand the mini-batch size st = \u03c4\n\n[f (x0) \u2212 f (x\u2217)].\n\n(cid:1) k\u2217.\n\n148B\u03ba2\ns\n\n16\u03bas\n\n\u03c1s\n\ns\n\n16\u03bas\n\nTheorem 2 shows that for sparsity-constrained problem, HSG-HT in expectation also enjoys linear\nconvergence in terms of the objective value by gradually exponentially expanding the mini-batch\nsize. The result in Theorem 2 also implies that the expected value of f (xt) can be arbitrarily close to\nthe k\u2217-sparse target value f (x\u2217) as long as the iteration number is suf\ufb01ciently large. This property\nis important, since in realistic problems, such as classi\ufb01cation or regression problems, if f (x) is\nmore close to the optimum f (x\u2217), then the prediction result can be better. FG-HT [13] also enjoys\nsuch a good property. In contrast, for SVRG-HT [21], the convergence bound is known to be\nE[f (xt) \u2212 f (x\u2217)] \u2264 O (\u03b6 t +\ns(cid:107)\u2207f (x\u2217)(cid:107)\u221e) for some shrinkage rate \u03b6 \u2208 (0, 1). That result is\ninferior to ours due to the presence of a non-vanishing statistical barrier term\n\ns(cid:107)\u2207f (x\u2217)(cid:107)\u221e.\n\n\u221a\n\n\u221a\n\n4 Acceleration via Heavy-Ball Method\n\n(cid:0)xt \u2212 \u03b7gt + \u03bd(xt \u2212 xt\u22121)(cid:1). The following result con\ufb01rms that such an accelerated variant,\n\nIn this section, we show that HSG-HT can be effectively accelerated by applying the heavy ball\ntechnique [28, 29]. As proposed in the option O2 in Algorithm 1, the idea is to use the integration\nof the estimated gradient gt and a small momentum \u03bd(xt \u2212 xt\u22121) to modify the update as xt+1 =\n\u03a6k\ni.e. AHSG-HT, can signi\ufb01cantly improve the ef\ufb01ciency of HSG-HT for quadratic loss functions. A\nproof of this result can be found in Appendix C.1.\n\n6\n\n\fconditions with parameter(cid:98)s = 3k + k\u2217. Let \u03ba(cid:98)s = (cid:96)(cid:98)s\nk \u2265 (1 + 16\u03ba(cid:98)s) k\u2217. Set the learning rate \u03b7 =\n\u03c9 = (1 \u2212 1\n\u221a\nthe output xT of AHSG-HT in Algorithm 1 satis\ufb01es\n\nTheorem 3. Suppose the objective function f (x) is quadratic and it satis\ufb01es the RSC and RSS\n\u03c1(cid:98)s\n. Assume the sparsity/low-rankness level\n(cid:96)(cid:98)s)4(cid:107)x0\u2212x\u2217(cid:107)2 , the momentum parameter \u03bd =(cid:0)\u221a\n\u221a\n\u03c1(cid:98)s)2 , the mini-batch size st = \u03c4\n(cid:96)(cid:98)s+\n\u03c9t where\n4\n\u03ba(cid:98)s\u22121\n\u221a\n\u03ba(cid:98)s+1\n(cid:1)T(cid:107)x0 \u2212 x\u2217(cid:107) +\n(cid:107)\u2207(cid:98)If (x\u2217)(cid:107),\n\nE(cid:107)xT \u2212 x\u2217(cid:107) \u2264 2(cid:0)1 \u2212 1\n\nwhere(cid:98)I = supp(\u03a63k(\u2207f (x\u2217))) \u222a supp(x\u2217) and T is the number of iterations.\n\n(cid:1)2. Then\n\n\u221a\n\u03ba(cid:98)s\n\u221a\n8\n\u03c1(cid:98)s +\n\n81B\u03ba(cid:98)s\n\u221a\n\u03c1(cid:98)s+\n\n)2 and \u03c4 \u2265\n\n(cid:96)(cid:98)s)2\n\n\u221a\n(\n\n\u221a\n(\n\n\u03ba(cid:98)s\n\n\u221a\n\n\u03bas\n\n18\n\n\u221a\n\n18\n\n4(\n\n2\n\n18\n\n480\u03bas\n\n\u03ba(cid:98)s\n\nFrom this result, we can observe that for both sparsity- and rank-constrained quadratic loss mini-\nmization problems, AHSG-HT has a faster convergence rate (1\u2212 1\n\u221a\n) 1\nsmaller than \u03bas since the factor k in(cid:98)s = 3k + k\u2217 is allowed to be smaller than that in s = 2k + k\u2217\nof HSG-HT. This is because the restricted condition number \u03ba(cid:98)s is usually comparable to or even\n(explained below). Also, such an acceleration relaxes the restriction on the sparsity/low-rankness level\nk: AHSG-HT allows k = \u2126(\u03ba(cid:98)sk\u2217) which is considerably superior to the condition of k = \u2126(\u03ba2\nsk\u2217)\nas required in the analysis of other hard thresholding algorithms including HSG-HT, FG-HT and\nSVRG-HT. As a direct consequence, the statistical error bound O((cid:107)\u2207(cid:98)If (x\u2217)(cid:107)) of AHSG-HT can\nbe improved in the sense that the cardinality |(cid:98)I| = 3k + k\u2217 has better dependency on the restricted\n\n) than the rate (1\u2212 1\n\n, the IFO complexity of AHSG-HT in Algorithm 1 is O(cid:0)\u221a\n\ncondition number \u03bas.\nTo better illustrate the boosted ef\ufb01ciency, we establish the computational complexity of AHSG-HT in\nIFO and hard thresholding in Corollary 2, whose proof is given in Appendix C.2.\n\u221a\n8\n\n(cid:1) and the hard threshold-\nCorollary 2. Suppose the conditions in Theorem 3 hold. To achieve E(cid:107)xT \u2212 x\u2217(cid:107) \u2264 \u221a\n(cid:1)(cid:1).\ning complexity is O(cid:0)\u221a\n\u03ba(cid:98)s log(cid:0) 1\n\u03ba(cid:98)s(cid:107)\u2207(cid:98)I f (x\u2217)(cid:107)\n\u221a\n\u221a\n\u03c1(cid:98)s+\n(cid:96)(cid:98)s)2\n(cid:1), and its hard thresholding complexity can\nquadratic case can be reduced from O(cid:0) \u03bas\nbe reduced from O(cid:0)\u03bas log(cid:0) 1\n(cid:1)(cid:1) to O(cid:0)\u221a\n(cid:1)(cid:1). Such an improvement in the dependency on\n\nCorollary 2 shows that equipped with heavy ball acceleration, the IFO complexity of HSG-HT in the\n\n(cid:1) to O(cid:0)\u221a\n\u03ba(cid:98)s log(cid:0) 1\n\nrestricted condition number is noteworthy in large-scale ill-conditioned learning problems.\n\n\u03ba(cid:98)s\n\n\u03ba(cid:98)s\n\n\u0001 +\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n(\n\n\u0001\n\n\u0001\n\nFigure 1: Single-epoch processing: comparison among hard thresholding algorithms for a single pass\nover data on sparse logistic regression with regularization parameter \u03bb = 10\u22125.\n\n5 Experiments\n\nWe now compare the numerical performance of HSG-HT and AHSG-HT to several state-of-the-\nart algorithms, including FG-HT [13], SG-HT [20] and SVRG-HT [21, 22]. We evaluate all the\nconsidered algorithms on two sets of learning tasks. The \ufb01rst set contains two sparsity-constrained\nproblems: logistic regression with fi(x) = log(1+exp(\u2212bia(cid:62)\n2 and multi-class softmax\n(cid:80)c\nexp(a(cid:62)\ni xj )\nl=1exp(a(cid:62)\n(cid:3),\n\nregression with fi(x) =(cid:80)c\nn(cid:88)\n\n(cid:2) \u03bb\n2c(cid:107)xj(cid:107)2\n(cid:2)(cid:107)bi \u2212 (cid:104)x, ai(cid:105)(cid:107)2\n\n(cid:3), where bi is the target output\n\nof ai and c is the class number. The second one is a rank-constrained linear regression problem:\n\ns.t. rank (x) \u2264 k,\n\n2\u22121{bi = j} log\n\n2(cid:107)x(cid:107)2\n\ni x))+ \u03bb\n\n(cid:107)x(cid:107)2\n\nmin\n\ni xl)\n\nj=1\n\n2 +\n\nF\n\n1\nn\n\nx\n\n\u03bb\n2\n\ni=1\n\n7\n\n01000020000\u221210\u22128\u22126\u22124\u2212202468#IFOObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HT01000020000\u221210\u22128\u22126\u22124\u2212202468#Hard ThresholdingObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HTrcv1, k=2000200004000060000\u221210\u22128\u22126\u22124\u221220246#IFOObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HT0200004000060000\u221210\u22128\u22126\u22124\u221220246#Hard ThresholdingObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HTreal\u2212sim, k=500\fwhich has several important applications including multi-class classi\ufb01cation and multi-task regression\nfor simultaneously learning shared characteristics of all classes/tasks [37], as well as high dimensional\nimage and \ufb01nancial data modeling [5, 8]. We run simulations on six datasets, including rcv1, real-sim,\nmnist, news20, coil100 and caltech256. The details of these data sets are described in Appendix E.\nFor HSG-HT and AHSG-HT, we follow our theory to exponentially expand the mini-batch size st but\nwith small exponential rate, with \u03c4 = 1. Since there is no ground truth on real data, we run FG-HT\nsuf\ufb01ciently long until (cid:107)xt \u2212 xt+1(cid:107)/(cid:107)xt(cid:107) \u2264 10\u22126, and then use the output f (xt) as the approximate\noptimal value f\u2217 for sub-optimality estimation in Figure 1 and Figure 2.\nSingle-epoch evaluation results. We \ufb01rst consider the sparse logistic regression problem with single-\nepoch processing. As demonstrated in Figure 1 (more experiments in Appendix E) that HSG-HT\nand AHSG-HT converge signi\ufb01cantly faster than the other considered algorithms in one pass over\ndata. This con\ufb01rms our theoretical predictions in Corollary 1 and 2 that HSG-HT and AHSG-HT are\ncheaper in IFO complexity than the sample-size-dependent algorithms when the desired accuracy is\nmoderately small and data scale is large. In view of the hard thresholding complexity, AHSG-HT and\nHSG-HT are comparable to FG-HT and they all require much fewer hard thresholding operations\nthan SG-HT and SVRG-HT to reach the same accuracy. This also well aligns with our theory: in one\n\npass setting, roughly speaking, AHSG-HT and HSG-HT respectively need O(cid:0)\u221a\nO(cid:0)\u03bas log(cid:0) n\n\n(cid:1)(cid:1) and\n(cid:1)(cid:1) steps of hard thresholding which are both much less than the O(n) complexity of\n\n\u03ba(cid:98)s log(cid:0) n\n\n\u03ba(cid:98)s\n\n\u03bas\n\nSG-HT and SVRG-HT. From Figure 1 and the magnifying \ufb01gures in Appendix E for better displaying\nobjective loss decrease along with hard thresholding iteration, one can observe that AHSG-HT has\nshaper convergence behavior than HSG-HT, which demonstrates the acceleration power of AHSG-HT.\nMulti-epoch evaluation results. We further evaluate the considered algorithms on sparsity-\nconstrained softmax regression and rank-constrained linear regression problems, for which an\napproach usually needs multiple cycles of data processing to reach high accuracy solution. In\nour implementation, HSG-HT (and AHSG-HT) degenerates to plain (and accelerated) FG-HT when\nthe mini-batch size exceeds data size. The degeneration case, however, does not happen in our\nexperiment with the speci\ufb01ed small expanding rate. The corresponding results are illustrated in\nFigure 2, from which we can observe that HSG-HT and AHSG-HT exhibit much shaper convergence\ncurves and lower hard thresholding complexity than other considered hard thresholding algorithms.\n\n(a) Sparsity-constrained softmax problem\n\n(b) Rank-constrained linear regression problem\n\nFigure 2: Multi-epochs processing: comparison among hard thresholding algorithms with multiple\npasses over data for sparse softmax regression and rank-constrained linear regression problems, both\nwith regularization parameters \u03bb = 10\u22125.\n\n6 Conclusions\n\nIn this paper, we proposed HSG-HT as a hybrid stochastic gradient hard thresholding method for\nsparsity/rank-constrained empirical risk minimization problems. We proved that HSG-HT enjoys the\n\n8\n\n01020304050\u22127\u22126\u22125\u22124\u22123\u22122\u22121012#IFO/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HT01020304050\u22127\u22126\u22125\u22124\u22123\u22122\u22121012#Hard Thresholding/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HTmnist, k=20001020304050\u22124\u22123\u22122\u221210123#IFO/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HT01020304050\u22124\u22123\u22122\u221210123#Hard Thresholding/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HTnew20, k=200001020304050\u22124\u22123\u22122\u2212101#IFO/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HT01020304050\u22124\u22123\u22122\u2212101#Hard Thresholding/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HTcoil100, k=10001020304050\u22124\u22123\u22122\u22121012#IFO/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HT01020304050\u22124\u22123\u22122\u22121012#Hard Thresholding/nObjective Distance log(f \u2212 f*) FG\u2212HTSG\u2212HTSVRG\u2212HTHSG\u2212HTAHSG\u2212HTcaltech256, k=100\fO(cid:0)\u03bas log(cid:0) 1\nindependent IFO complexity of O(cid:0) \u03bas\n\n(cid:1)(cid:1) hard thresholding complexity like full gradient methods, while possessing sample-size-\n(cid:1). Compared to the existing variance-reduced hard thresholding\n\n\u0001\n\n\u0001\n\nalgorithms, HSG-HT enjoys lower overall computational cost when sample size is large and the\naccuracy is moderately small. Furthermore, we showed that HSG-HT can be effectively accelerated\nvia applying the heavy ball acceleration technique to attain improved dependency on restricted\ncondition number. The provable ef\ufb01ciency of HSGT-HT and its accelerated variant has been con\ufb01rmed\nby extensive numerical evaluation with comparison against the prior state-of-the-art algorithms.\n\nAcknowledgements\n\nJiashi Feng was partially supported by NUS startup R-263-000-C08-133, MOE Tier-I R-263-000-\nC21-112, NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-\nD17-112. Xiao-Tong Yuan was supported in part by Natural Science Foundation of China (NSFC)\nunder Grant 61522308 and Grant 61876090, and in part by Tencent AI Lab Rhino-Bird Joint Research\nProgram No.JR201801.\n\nReferences\n[1] D. Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289\u20131306, 2006. 1\n\n[2] J. Tropp and A. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit.\n\nIEEE Trans. on Information Theory, 53(12):4655\u20134666, 2007. 1\n\n[3] S. Bahmani, B. Raj, and P. Boufounos. Greedy sparsity-constrained optimization. Journal of Machine\n\nLearning Research, 14:807\u2013841, 2013. 1\n\n[4] A. Jalali, C. Johnson, and P. Ravikumar. On learning discrete graphical models using greedy methods. In\n\nProc. Conf. Neutral Information Processing Systems, pages 1935\u20131943, 2011. 1\n\n[5] S. Negahban and M. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional\n\nscaling. The Annals of Statistics, pages 1069\u20131097, 2011. 1, 8\n\n[6] P. Zhou and J. Feng. Outlier-robust tensor PCA. In Proc. IEEE Conf. Computer Vision and Pattern\n\nRecognition, pages 1\u20139, 2017. 1\n\n[7] Y. Wang, C. Xu, C. Xu, and D. Tao. Beyond RPCA: Flattening complex noise in the frequency domain. In\n\nAAAI Conf. Arti\ufb01cial Intelligence, 2017. 1\n\n[8] A. Rohde and A. Tsybakov. Estimation of high-dimensional low-rank matrices. The Annals of Statistics,\n\n39(2):887\u2013930, 2011. 1, 8\n\n[9] P. Zhou, C. Lu, Z. Lin, and C. Zhang. Tensor factorization for low-rank tensor completion.\n\nTransactions on Image Processing, 27(3):1152 \u2013 1163, 2017. 1\n\nIEEE\n\n[10] T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. Applied and computa-\n\ntional harmonic analysis, 27(3):265\u2013274, 2009. 1, 2\n\n[11] S. Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical\n\nAnalysis, 49(6):2543\u20132563, 2011. 1\n\n[12] X. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research,\n\n18(166):1\u201343, 2018. 1, 2, 4\n\n[13] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional M-estimation.\n\nIn Proc. Conf. Neutral Information Processing Systems, pages 685\u2013693, 2014. 1, 2, 4, 5, 6, 7\n\n[14] R. Garg and Rohit R. Khandekar. Gradient descent with sparsi\ufb01cation: an iterative algorithm for sparse\nrecovery with restricted isometry property. In Proc. Int\u2019l Conf. Machine Learning, pages 337\u2013344. ACM,\n2009. 2\n\n[15] X. Yuan, P. Li, and T. Zhang. Exact recovery of hard thresholding pursuit.\n\nInformation Processing Systems, pages 3558\u20133566, 2016. 2\n\nIn Proc. Conf. Neutral\n\n[16] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996. 2\n\n9\n\n\f[17] S. Van de Geer. High-dimensional generalized linear models and the LASSO. The Annals of Statistics,\n\n36(2):614\u2013645, 2008. 2\n\n[18] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010. 2\n\n[19] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Proc. Conf. Neutral\n\nInformation Processing Systems, pages 1329\u20131336, 2005. 2\n\n[20] N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy algorithms with\n\nsparse constraints. IEEE Trans. on Information Theory, 63(11):6869\u20136895, 2017. 2, 4, 5, 6, 7\n\n[21] X. Li, R. Arora, H. Liu, J. Haupt, and T. Zhao. Nonconvex sparse learning via stochastic optimization with\n\nprogressive variance reduction. Proc. Int\u2019l Conf. Machine Learning, 2016. 2, 4, 5, 6, 7\n\n[22] J. Shen and P. Li. A tight bound of hard thresholding. Journal of Machine Learning Research, 18(208):1\u201342,\n\n2018. 2, 4, 6, 7\n\n[23] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nProc. Conf. Neutral Information Processing Systems, pages 315\u2013323, 2013. 2, 4\n\n[24] J. Chen and Q. Gu. Accelerated stochastic block coordinate gradient descent for sparsity constrained\n\nnonconvex optimization. In Proc. Conf. Uncertainty in Arti\ufb01cial Intelligence, 2016. 2\n\n[25] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on\n\nOptimization, 7(4):913\u2013926, 1997. 2\n\n[26] M. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data \ufb01tting. SIAM Journal\n\non Scienti\ufb01c Computing, 34(3):A1380\u2013A1405, 2012. 2\n\n[27] P. Zhou, X. Yuan, and J. Feng. New insight into hybrid stochastic gradient descent: Beyond with-\nIn Proc. Conf. Neutral Information Processing Systems, 2018.\n\nreplacement sampling and convexity.\n2\n\n[28] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational\n\nMathematics and Mathematical Physics, 4(5):1\u201317, 1964. 3, 6\n\n[29] A. Govan. Introduction to optimization. In North Carolina State University, SAMSI NDHS, Undergraduate\n\nworkshop, 2006. 3, 6\n\n[30] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145\u2013151,\n\n1999. 3\n\n[31] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science &\n\nBusiness Media, 2013. 3\n\n[32] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating stochastic gradient descent.\n\narXiv preprint arXiv:1704.08227, 2017. 3\n\n[33] S. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In Proc. Int\u2019l Conf. Learning\n\nRepresentations, 2018. 3\n\n[34] K. Rajiv and K. Anastasios. IHT dies hard: Provable accelerated iterative hard thresholding. In Proc. Int\u2019l\n\nConf. Arti\ufb01cial Intelligence and Statistics, volume 84, pages 188\u2013198, 2018. 3\n\n[35] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization.\n\nIn Proc. Int\u2019l Conf. Machine Learning, pages 353\u2013361, 2015. 4\n\n[36] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In Proc. Conf. Neutral\n\nInformation Processing Systems, pages 3059\u20133067, 2014. 4\n\n[37] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classi\ufb01cation. In\n\nProc. Int\u2019l Conf. Machine Learning, pages 17\u201324. ACM, 2007. 8\n\n10\n\n\f", "award": [], "sourceid": 997, "authors": [{"given_name": "Pan", "family_name": "Zhou", "institution": "National University of Singapore"}, {"given_name": "Xiaotong", "family_name": "Yuan", "institution": "Nanjing University of Information Science and Technology"}, {"given_name": "Jiashi", "family_name": "Feng", "institution": "National University of Singapore"}]}