{"title": "Theoretical Linear Convergence of Unfolded ISTA and Its Practical Weights and Thresholds", "book": "Advances in Neural Information Processing Systems", "page_first": 9061, "page_last": 9071, "abstract": "In recent years, unfolding iterative algorithms as neural networks has become an empirical success in solving sparse recovery problems. However, its theoretical understanding is still immature, which prevents us from fully utilizing the power of neural networks. In this work, we study unfolded ISTA (Iterative Shrinkage Thresholding Algorithm) for sparse signal recovery. We introduce a weight structure that is necessary for asymptotic convergence to the true sparse signal. With this structure, unfolded ISTA can attain a linear convergence, which is better than the sublinear convergence of ISTA/FISTA in general cases. Furthermore, we propose to incorporate thresholding in the network to perform support selection, which is easy to implement and able to boost the convergence rate both theoretically and empirically. Extensive simulations, including sparse vector recovery and a compressive sensing experiment on real image data, corroborate our theoretical results and demonstrate their practical usefulness. We have made our codes publicly available: https://github.com/xchen-tamu/linear-lista-cpss.", "full_text": "Theoretical Linear Convergence of Unfolded ISTA\n\nand its Practical Weights and Thresholds\n\nXiaohan Chen\u2217\n\nDepartment of Computer Science and Engineering\n\nTexas A&M University\n\nCollege Station, TX 77843, USA\n\nchernxh@tamu.edu\n\nZhangyang Wang\n\nDepartment of Computer Science and Engineering\n\nTexas A&M University\n\nCollege Station, TX 77843, USA\n\natlaswang@tamu.edu\n\nJialin Liu\u2217\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095, USA\nliujl11@math.ucla.edu\n\nWotao Yin\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095, USA\nwotaoyin@math.ucla.edu\n\nAbstract\n\nIn recent years, unfolding iterative algorithms as neural networks has become an\nempirical success in solving sparse recovery problems. However, its theoretical\nunderstanding is still immature, which prevents us from fully utilizing the power\nof neural networks. In this work, we study unfolded ISTA (Iterative Shrinkage\nThresholding Algorithm) for sparse signal recovery. We introduce a weight struc-\nture that is necessary for asymptotic convergence to the true sparse signal. With this\nstructure, unfolded ISTA can attain a linear convergence, which is better than the\nsublinear convergence of ISTA/FISTA in general cases. Furthermore, we propose\nto incorporate thresholding in the network to perform support selection, which\nis easy to implement and able to boost the convergence rate both theoretically\nand empirically. Extensive simulations, including sparse vector recovery and a\ncompressive sensing experiment on real image data, corroborate our theoretical\nresults and demonstrate their practical usefulness. We have made our codes publicly\navailable.2.\n\nIntroduction\n\n1\nThis paper aims to recover a sparse vector x\u2217 from its noisy linear measurements:\n\nb = Ax\u2217 + \u03b5,\n\n(1)\nwhere b \u2208 Rm, x \u2208 Rn, A \u2208 Rm\u00d7n, \u03b5 \u2208 Rm is additive Gaussian white noise, and we have m (cid:28) n.\n(1) is an ill-posed, highly under-determined system. However, it becomes easier to solve if x\u2217 is\nassumed to be sparse, i.e. the cardinality of support of x\u2217, S = {i|x\u2217\ni (cid:54)= 0}, is small compared to n.\nA popular approach is to model the problem as the LASSO formulation (\u03bb is a scalar):\n\n(2)\nand solve it using iterative algorithms such as the iterative shrinkage thresholding algorithm (ISTA)\n[1]:\n\nminimize\n\nx\n\n(cid:107)b \u2212 Ax(cid:107)2\n\n1\n2\n\n2 + \u03bb(cid:107)x(cid:107)1\n(cid:17)\n\n(cid:16)\n\nxk+1 = \u03b7\u03bb/L\n\n,\n\u2217These authors contributed equally and are listed alphabetically.\n2https://github.com/xchen-tamu/linear-lista-cpss\n\nxk +\n\nAT (b \u2212 Axk)\n\n1\nL\n\nk = 0, 1, 2, . . .\n\n(3)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhere \u03b7\u03b8 is the soft-thresholding function3 and L is usually taken as the largest eigenvalue of AT A.\nIn general, ISTA converges sublinearly for any given and \ufb01xed dictionary A and sparse code x\u2217 [2]\nIn [3], inspired by ISTA, the authors proposed a learning-based model named Learned ISTA (LISTA).\nThey view ISTA as a recurrent neural network (RNN) that is illustrated in Figure 1(a), where\nL \u03bb. LISTA, illustrated in Figure 1(b), unrolls the RNN and\nW1 = 1\ntruncates it into K iterations:\n\nL AT , W2 = I \u2212 1\n\nL AT A, \u03b8 = 1\n\nxk+1 = \u03b7\u03b8k (W k\n\n1 b + W k\n\n2 xk),\n\nk = 0, 1,\u00b7\u00b7\u00b7 , K \u2212 1,\n\n(4)\n\nleading to a K-layer feed-forward neural network with side connections.\nDifferent from ISTA where no parameter is learnable (except the hyper parameter \u03bb to be tuned),\nLISTA is treated as a specially structured neural network and trained using stochastic gradient descent\n(SGD), over a given training dataset {(x\u2217\ni=1 sampled from some distribution P(x, b). All the\nparameters \u0398 = {(W k\n\nk=0 are subject to learning. The training is modeled as:\n\n2 , \u03b8k)}K\u22121\n\ni , bi)}N\n\n1 , W k\n\n\u0398\n\nminimize\n\n(5)\nMany empirical results, e.g., [3\u20137], show that a trained K-layer LISTA (with K usually set to 10 \u223c 20)\nor its variants can generalize more than well to unseen samples (x(cid:48), b(cid:48)) from the same P(x, b) and\nrecover x(cid:48) from b(cid:48) to the same accuracy within one or two order-of-magnitude fewer iterations than\nthe original ISTA. Moreover, the accuracies of the outputs {xk} of the layers k = 1, .., K gradually\nimprove.\n\n2\n\n(cid:13)(cid:13)(cid:13)xK(cid:16)\n\n\u0398, b, x0(cid:17) \u2212 x\u2217(cid:13)(cid:13)(cid:13)2\n\n.\n\nEx\u2217,b\n\n(a) RNN structure of ISTA.\n\n(b) Unfolded learned ISTA Network.\n\nFigure 1: Diagrams of ISTA and LISTA.\n\n1.1 Related Works\nMany recent works [8, 9, 4, 10, 11] followed the idea of [3] to construct feed-forward networks by\nunfolding and truncating iterative algorithms, as fast trainable regressors to approximate the solutions\nof sparse coding models. On the other hand, progress has been slow towards understanding the\nef\ufb01cient approximation from a theoretical perspective. The most relevant works are discussed below.\n[12] attempted to explain the mechanism of LISTA by re-factorizing the Gram matrix of dictionary,\nwhich tries to nearly diagonalize the Gram matrix with a basis that produces a small perturbation of\nthe (cid:96)1 ball. They re-parameterized LISTA into a new factorized architecture that achieved similar\nacceleration gain to LISTA. Using an \u201cindirect\u201d proof, [12] was able to show that LISTA can converge\nfaster than ISTA, but still sublinearly. Lately, [13] tried to relate LISTA to a projected gradient descent\ndescent (PGD) relying on inaccurate projections, where a trade-off between approximation error and\nconvergence speed was made possible.\n[14] investigated the convergence property of a sibling architecture to LISTA, proposed in [4], which\nwas obtained by instead unfolding/truncating the iterative hard thresholding (IHT) algorithm rather\nthan ISTA. The authors argued that they can use data to train a transformation of dictionary that\ncan improve its restricted isometry property (RIP) constant, when the original dictionary is highly\ncorrelated, causing IHT to fail easily. They moreover showed it bene\ufb01cial to allow the weights to\ndecouple across layers. However, the analysis in [14] cannot be straightforwardly extended to ISTA\nalthough IHT is linearly convergent [15] under rather strong assumptions.\nIn [16], a similar learning-based model inspired by another iterative algorithm solve LASSO, approx-\nimated message passing (AMP), was studied. The idea was advanced in [17] to substituting the AMP\n\n3Soft-thresholding function is de\ufb01ned in a component-wise way: \u03b7\u03b8(x) = sign(x) max(0,|x| \u2212 \u03b8)\n\n2\n\n\fproximal operator (soft-thresholding) with a learnable Gaussian denoiser. The resulting model, called\nLearned Denoising AMP (L-DAMP), has theoretical guarantees under the asymptotic assumption\nnamed \u201cstate evolution.\u201d While the assumption is common in analyzing AMP algorithms, the tool is\nnot directly applicable to ISTA. Moreover, [16] shows L-DAMP is MMSE optimal, but there is no\nresult on its convergence rate. Besides, we also note the empirical effort in [18] that introduces an\nOnsager correction to LISTA to make it resemble AMP.\n\n1.2 Motivations and Contributions\nWe attempt to answer the following questions, which are not fully addressed in the literature yet:\n\nexploiting certain dependencies among its parameters {(W k\nnetwork and improve the recovery results?\n\n\u2022 Rather than training LISTA as a conventional \u201cblack-box\u201d network, can we bene\ufb01t from\nk=0 to simplify the\n\u2022 Obtained with suf\ufb01ciently many training samples from the target distribution P(x, b), LISTA\nworks very well. So, we wonder if there is a theoretical guarantee to ensure that LISTA (4)\nconverges 4 faster and/or produces a better solution than ISTA (3) when its parameters are\nideal? If the answer is af\ufb01rmative, can we quantize the amount of acceleration?\n\u2022 Can some of the acceleration techniques such as support detection that were developed for\n\n2 , \u03b8k)}K\u22121\n\n1 , W k\n\nLASSO also be used to improve LISTA?\n\nOur Contributions: this paper aims to introduce more theoretical insights for LISTA and to further\nunleash its power. To our best knowledge, this is the \ufb01rst attempt to establish a theoretical convergence\nrate (upper bound) of LISTA directly. We also observe that the weight structure and the thresholds\ncan speedup the convergence of LISTA:\n\n\u2022 We give a result on asymptotic coupling between the weight matrices W k\n\n2 . This\nresult leads us to eliminating one of them, thus reducing the number of trainable parameters.\nThis elimination still retains the theoretical and experimental performance of LISTA.\n\u2022 ISTA is generally sublinearly convergent before its iterates settle on a support. We prove\nthat, however, there exists a sequence of parameters that makes LISTA converge linearly\nsince its \ufb01rst iteration. Our numerical experiments support this theoretical result.\n\u2022 Furthermore, we introduce a thresholding scheme for support selection, which is extremely\nsimple to implement and signi\ufb01cantly boosts the practical convergence. The linear conver-\ngence results are extended to support detection with an improved rate.\n\n1 and W k\n\nDetailed discussions of the above three points will follow after Theorems 1, 2 and 3, respectively.\nOur proofs do not rely on any indirect resemblance, e.g., to AMP [18] or PGD [13]. The theories\nare supported by extensive simulation experiments, and substantial performance improvements are\nobserved when applying the weight coupling and support selection schemes. We also evaluated\nLISTA equipped with those proposed techniques in an image compressive sensing task, obtaining\nsuperior performance over several of the state-of-the-arts.\n\n2 Algorithm Description\nWe \ufb01rst establish the necessary condition for LISTA convergence, which implies a partial weight\ncoupling structure for training LISTA. We then describe the support-selection technique.\n\n2.1 Necessary Condition for LISTA Convergence and Partial Weight Coupling\nAssumption 1 (Basic assumptions). The signal x\u2217 and the observation noise \u03b5 are sampled from the\nfollowing set:\n\n(x\u2217, \u03b5) \u2208 X (B, s, \u03c3) (cid:44)(cid:110)\n\n(cid:12)(cid:12)(cid:12)|x\u2217\n\n(cid:111)\n\n.\n\n(x\u2217, \u03b5)\n\ni | \u2264 B,\u2200i, (cid:107)x\u2217(cid:107)0 \u2264 s,(cid:107)\u03b5(cid:107)1 \u2264 \u03c3\n\n(6)\n\nIn other words, x\u2217 is bounded and s-sparse5 (s \u2265 2), and \u03b5 is bounded.\nTheorem 1 (Necessary Condition). Given {W k\n(1) and {xk}\u221e\n\nk=0 and x0 = 0, let b be observed by\nk=1 be generated layer-wise by LISTA (4). If the following holds uniformly for any\n4 The convergence of ISTA/FISTA measures how fast the k-th iterate proceeds; the convergence of LISTA\n\n2 , \u03b8k}\u221e\n\n1 , W k\n\nmeasures how fast the output of the k-th layer proceeds as k increases.\n\n5A signal is s-sparse if it has no more than s non-zero entries.\n\n3\n\n\f(x\u2217, \u03b5) \u2208 X (B, s, 0) (no observation noise):\n\nxk(cid:16){W \u03c4\n\n\u03c4 =0, b, x0(cid:17) \u2192 x\u2217,\n\n1 , W \u03c4\n\n2 , \u03b8\u03c4}k\u22121\n\nas k \u2192 \u221e\n\nand {W k\n\n2 }\u221e\n\nk=1 are bounded\n\nthen {W k\n\n1 , W k\n\n2 , \u03b8k}\u221e\n\n(cid:107)W k\nk=0 must satisfy\n\n2 (cid:107)2 \u2264 BW ,\n\n\u2200k = 0, 1, 2,\u00b7\u00b7\u00b7 ,\n\n2 \u2212 (I \u2212 W k\nW k\n\u03b8k \u2192 0,\n\n1 A) \u2192 0,\nas k \u2192 \u221e.\n\nas k \u2192 \u221e\n\n(7)\n(8)\n\n1 , W k\n\nProofs of the results throughout this paper can be found in the supplementary. The conclusion (7)\n2 }\u221e\ndemonstrates that the weights {W k\nk=0 in LISTA asymptotically satis\ufb01es the following partial\nweight coupling structure:\n2 = I \u2212 W k\nW k\n(cid:17)\n\nWe adopt the above partial weight coupling for all layers, letting W k = (W k\nsimplifying LISTA (4) to:\n\n(9)\n1 )T \u2208 (cid:60)m\u00d7n, thus\n\n(cid:16)\n\n1 A.\n\nxk + (W k)(cid:62)(b \u2212 Axk)\n\n,\n\nk = 0, 1,\u00b7\u00b7\u00b7 , K \u2212 1,\n\n(10)\n\nxk+1 = \u03b7\u03b8k\n\nk=0 remain as free parameters to train. Empirical results in Fig. 3 illustrate that the\n\nwhere {W k, \u03b8k}K\u22121\nstructure (9), though having fewer parameters, improves the performance of LISTA.\nThe coupled structure (9) for soft-thresholding based algorithms was empirically studied in [16]. The\nsimilar structure was also theoretically studied in Proposition 1 of [14] for IHT algorithms using the\n\ufb01xed-point theory, but they let all layers share the same weights, i.e. W k\n2.2 LISTA with Support Selection\nWe introduce a special thresholding scheme to LISTA, called support selection, which is inspired by\n\u201ckicking\u201d [19] in linearized Bregman iteration. This technique shows advantages on recoverability\nand convergence. Its impact on improving LISTA convergence rate and reducing recovery errors\nwill be analyzed in Section 3. With support selection, at each LISTA layer before applying soft\nthresholding, we will select a certain percentage of entries with largest magnitudes, and trust them\nas \u201ctrue support\u201d and won\u2019t pass them through thresholding. Those entries that do not go through\nthresholding will be directly fed into next layer, together with other thresholded entires.\nAssume we select pk% of entries as the trusted support at layer k. LISTA with support selection can\nbe generally formulated as\n\n1 = W1,\u2200k.\n\n2 = W2, W k\n\nxk+1 = \u03b7ss\n\npk\n\u03b8k\n\nW k\n\n1 b + W k\n\nk = 0, 1,\u00b7\u00b7\u00b7 , K \u2212 1,\n\n2 xk(cid:17)\n\n,\n\nwhere \u03b7ss is the thresholding operator with support selection, formally de\ufb01ned as:\n\n(\u03b7ss\n\npk\n\u03b8k (v))i =\n\nvi\nvi \u2212 \u03b8k\n0\nvi + \u03b8k\nvi\n\n: vi > \u03b8k,\n: vi > \u03b8k,\n: \u2212\u03b8k \u2264 vi \u2264 \u03b8k\n: vi < \u2212\u03b8k,\n: vi < \u2212\u03b8k,\n\ni \u2208 Spk\ni /\u2208 Spk\ni /\u2208 Spk\ni \u2208 Spk\n\n(v),\n(v),\n\n(v),\n(v),\n\nwhere Spk\n\n(v) includes the elements with the largest pk% magnitudes in vector v:\n\n(cid:110)\ni1, i2,\u00b7\u00b7\u00b7 , ipk\n\n(cid:12)(cid:12)(cid:12)|vi1| \u2265 |vi2| \u2265 \u00b7\u00b7\u00b7|vipk|\u00b7\u00b7\u00b7 \u2265 |vin|(cid:111)\n\n.\n\nSpk\n\n(v) =\n\n(cid:16)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(11)\n\n(12)\n\nTo clarify, in (11), pk is a hyperparameter to be manually tuned, and \u03b8k is a parameter to train. We use\nan empirical formula to select pk for layer k: pk = min(p \u00b7 k, pmax), where p is a positive constant\nand pmax is an upper bound of the percentage of the support cardinality. Here p and pmax are both\nhyperparameters to be manually tuned.\nIf we adopt the partial weight coupling in (9), then (11) is modi\ufb01ed as\n\nk = 0, 1,\u00b7\u00b7\u00b7 , K \u2212 1.\n\n(13)\n\n(cid:16)\n\nxk+1 = \u03b7ss\n\npk\n\u03b8k\n\n(cid:17)\n\n,\n\nxk + (W k)T (b \u2212 Axk)\n\n4\n\n\fand the initial point x0. Strictly speaking, xk should be written as xk(cid:16){W \u03c4 , \u03b8\u03c4}k\u22121\n\nAlgorithm abbreviations For simplicity, hereinafter we will use the abbreviation \u201cCP\u201d for the\npartial weight coupling in (9), and \u201cSS\u201d for the support selection technique. LISTA-CP denotes\nthe LISTA model with weights coupling (10). LISTA-SS denotes the LISTA model with support\nselection (11). Similarly, LISTA-CPSS stands for a model using both techniques (13), which has the\nbest performance. Unless otherwise speci\ufb01ed, LISTA refers to the baseline LISTA (4).\n3 Convergence Analysis\nIn this section, we formally establish the impacts of (10) and (13) on LISTA\u2019s convergence. The\noutput of the kth layer xk depends on the parameters {W \u03c4 , \u03b8\u03c4}k\u22121\n\u03c4 =0, the observed measurement b\n. By the\nobservation model b = Ax\u2217 + \u03b5, since A is given and x0 can be taken as 0, xk therefore depends\non {(W \u03c4 , \u03b8\u03c4 )}k\n. For simplicity, we instead\njust write xk(x\u2217, \u03b5).\nTheorem 2 (Convergence of LISTA-CP). Given {W k, \u03b8k}\u221e\nk=1 be generated\nby (10). If Assumption 1 holds and s is suf\ufb01ciently small, then there exists a sequence of parameters\n{W k, \u03b8k} such that, for all (x\u2217, \u03b5) \u2208 X (B, s, \u03c3), we have the error bound:\n\n\u03c4 =0, x\u2217 and \u03b5. So, we can write xk(cid:16){W \u03c4 , \u03b8\u03c4}k\u22121\n\nk=0 and x0 = 0, let {xk}\u221e\n\n\u03c4 =0, b, x0(cid:17)\n\n\u03c4 =0, x\u2217, \u03b5\n\n(cid:17)\n\n(cid:107)xk(x\u2217, \u03b5) \u2212 x\u2217(cid:107)2 \u2264 sB exp(\u2212ck) + C\u03c3,\n\n\u2200k = 1, 2,\u00b7\u00b7\u00b7 ,\n\n(14)\n\nwhere c > 0, C > 0 are constants that depend only on A and s. Recall s (sparsity of the signals) and\n\u03c3 (noise-level) are de\ufb01ned in (6).\n\nIf \u03c3 = 0 (noiseless case), (14) reduces to\n\n(cid:107)xk(x\u2217, 0) \u2212 x\u2217(cid:107)2 \u2264 sB exp(\u2212ck).\n\n(15)\n\nThe recovery error converges to 0 at a linear rate as the number of layers goes to in\ufb01nity. Combined\nwith Theorem 1, we see that the partial weight coupling structure (10) is both necessary and suf\ufb01cient\nto guarantee convergence in the noiseless case. Fig. 3 validates (14) and (15) directly.\nDiscussion: The bound (15) also explains why LISTA (or its variants) can converge faster than ISTA\nand fast ISTA (FISTA) [2]. With a proper \u03bb (see (2)), ISTA converges at an O(1/k) rate and FISTA\nconverges at an O(1/k2) rate [2]. With a large enough \u03bb, ISTA achieves a linear rate [20, 21]. With\n\u00afx(\u03bb) being the solution of LASSO (noiseless case), these results can be summarized as: before the\niterates xk settle on a support6,\n\nxk \u2192 \u00afx(\u03bb) sublinearly, (cid:107)\u00afx(\u03bb) \u2212 x\u2217(cid:107) = O(\u03bb),\nxk \u2192 \u00afx(\u03bb) linearly, (cid:107)\u00afx(\u03bb) \u2212 x\u2217(cid:107) = O(\u03bb),\n\n\u03bb > 0\n\u03bb large enough.\n\nBased on the choice of \u03bb in LASSO, the above observation re\ufb02ects an inherent trade-off between\nconvergence rate and approximation accuracy in solving the problem (1), see a similar conclusion in\n[13]: a larger \u03bb leads to faster convergence but a less accurate solution, and vice versa.\nHowever, if \u03bb is not constant throughout all iterations/layers, but instead chosen adaptively for each\nstep, more promising trade-off can arise7. LISTA and LISTA-CP, with the thresholds {\u03b8k}K\u22121\nk=0 free to\ntrain, actually adopt this idea because {\u03b8k}K\u22121\nk=0 corresponds to a path of LASSO parameters {\u03bbk}K\u22121\nk=0 .\nWith extra free trainable parameters, {W k}K\u22121\nk=0 (LISTA-CP) or {W k\nk=0 (LISTA), learning\nbased algorithms are able to converge to an accurate solution at a fast convergence rate. Theorem 2\ndemonstrates the existence of such sequence {W k, \u03b8k}k in LISTA-CP (10). The experiment results\nin Fig. 4 show that such {W k, \u03b8k}k can be obtained by training.\nAssumption 2. Signal x\u2217 and observation noise \u03b5 are sampled from the following set:\ni | \u2264 B,\u2200i, (cid:107)x\u2217(cid:107)1 \u2265 B,(cid:107)x\u2217(cid:107)0 \u2264 s,(cid:107)\u03b5(cid:107)1 \u2264 \u03c3\n\n(x\u2217, \u03b5) \u2208 \u00afX (B, B, s, \u03c3) (cid:44)(cid:110)\n\n2 }K\u22121\n\n(x\u2217, \u03b5)\n\n1 , W k\n\n(cid:111)\n\n.\n\n(16)\n\n(cid:12)(cid:12)(cid:12)|x\u2217\n\n6After xk settles on a support, i.e. as k large enough such that support(xk) is \ufb01xed, even with small \u03bb,\n\nISTA reduces to a linear iteration, which has a linear convergence rate [22].\n\n7This point was studied in [23, 24] with classical compressive sensing settings, while our learning settings\n\ncan learn a good path of parameters without a complicated thresholding rule or any manual tuning.\n\n5\n\n\fTheorem 3 (Convergence of LISTA-CPSS). Given {W k, \u03b8k}\u221e\nk=1 be\ngenerated by (13). With the same assumption and parameters as in Theorem 2, the approximation\nerror can be bounded for all (x\u2217, \u03b5) \u2208 X (B, s, \u03c3):\n\nk=0 and x0 = 0, let {xk}\u221e\n\n(cid:107)xk(x\u2217, \u03b5) \u2212 x\u2217(cid:107)2 \u2264 sB exp\n\n+ Css\u03c3,\n\n\u2200k = 1, 2,\u00b7\u00b7\u00b7 ,\n\n(17)\n\nct\nss\n\n(cid:16) \u2212 k\u22121(cid:88)\n\nt=0\n\n(cid:16) \u2212 k\u22121(cid:88)\n\n(cid:17)\n\n(cid:17)\n\nss \u2265 c for all k and Css \u2264 C.\n\nwhere ck\nIf Assumption 2 holds, s is small enough, and B \u2265 2C\u03c3 (SNR is not too small), then there exists\nanother sequence of parameters { \u02dcW k, \u02dc\u03b8k} that yields the following improved error bound: for all\n(x\u2217, \u03b5) \u2208 \u00afX (B, B, s, \u03c3),\n\n(cid:107)xk(x\u2217, \u03b5) \u2212 x\u2217(cid:107)2 \u2264 sB exp\n\n+ \u02dcCss\u03c3,\n\n\u2200k = 1, 2,\u00b7\u00b7\u00b7 ,\n\n(18)\n\n\u02dcct\nss\n\nwhere \u02dcck\n\nss \u2265 c for all k, \u02dcck\n\nss > c for large enough k, and \u02dcCss < C.\n\nt=0\n\nThe bound in (17) ensures that, with the same assumptions and parameters, LISTA-CPSS is at least no\nworse than LISTA-CP. The bound in (18) shows that, under stronger assumptions, LISTA-CPSS can\nbe strictly better than LISTA-CP in both folds: \u02dcck\nss > c is the better convergence rate of LISTA-CPSS;\n\u02dcCss < C means that the LISTA-CPSS can achieve smaller approximation error than the minimum\nerror that LISTA can achieve.\n4 Numerical Results\nFor all the models reported in this section, including the baseline LISTA and LAMP models , we\nadopt a stage-wise training strategy with learning rate decaying to stabilize the training and to get\nbetter performance, which is discussed in the supplementary.\n\n4.1 Simulation Experiments\nExperiments Setting. We choose m = 250, n = 500. We sample the entries of A i.i.d. from the\nstandard Gaussian distribution, Aij \u223c N (0, 1/m) and then normalize its columns to have the unit\n(cid:96)2 norm. We \ufb01x a matrix A in each setting where different networks are compared. To generate\nsparse vectors x\u2217, we decide each of its entry to be non-zero following the Bernoulli distribution with\npb = 0.1. The values of the non-zero entries are sampled from the standard Gaussian distribution. A\ntest set of 1000 samples generated in the above manner is \ufb01xed for all tests in our simulations.\nAll the networks have K = 16 layers. In LISTA models with support selection, we add p% of entries\ninto support and maximally select pmax% in each layer. We manually tune the value of p and pmax\nfor the best \ufb01nal performance. With pb = 0.1 and K = 16, we choose p = 1.2 for all models in\nsimulation experiments and pmax = 12 for LISTA-SS but pmax = 13 for LISTA-CPSS. The recovery\nperformance is evaluated by NMSE (in dB):\n\nNMSE(\u02c6x, x\u2217) = 10 log10\n\n(cid:18)E(cid:107)\u02c6x \u2212 x\u2217(cid:107)2\n\n(cid:19)\n\nE(cid:107)x\u2217(cid:107)2\n\n,\n\n2 \u2192 I \u2212 W k\n\n1 A, and \u03b8k \u2192 0, as k \u2192 \u221e. Theorem 1 is directly validated.\n\nwhere x\u2217 is the ground truth and \u02c6x is the estimate obtained by the recovery algorithms (ISTA, FISTA,\nLISTA, etc.).\n1 A)(cid:107)2 and \u03b8k, obtained\nValidation of Theorem 1. In Fig 2, we report two values, (cid:107)W k\nby the baseline LISTA model (4) trained under the noiseless setting. The plot clearly demonstrates\nthat W k\nValidation of Theorem 2. We report the test-set NMSE of LISTA-CP (10) in Fig. 3. Although (10)\n\ufb01xes the structure between W k\n2 , the \ufb01nal performance remains the same with the baseline\nLISTA (4), and outperforms AMP, in both noiseless and noisy cases. Moreover, the output of interior\nlayers in LISTA-CP are even better than the baseline LISTA. In the noiseless case, NMSE converges\nexponentially to 0; in the noisy case, NMSE converges to a stationary level related with the noise-level.\nThis supports Theorem 2: there indeed exist a sequence of parameters {(W k, \u03b8k)}K\u22121\nk=0 leading to\nlinear convergence for LISTA-CP, and they can be obtained by data-driven learning.\n\n2 \u2212 (I \u2212 W k\n\n1 and W k\n\n6\n\n\f(a) Weight W k\n\n2 \u2192 I \u2212 W k\n\n1 A as k \u2192 \u221e.\n\n(b) The threshold \u03b8k \u2192 0.\n\nFigure 2: Validation of Theorem 1.\n\n(a) SNR = \u221e\n\n(b) SNR = 30\n\nFigure 3: Validation of Theorem 2.\n\nValidation of Discussion after\nTheorem 2. In Fig 4, We com-\npare LISTA-CP and ISTA with\ndifferent \u03bbs (see the LASSO\nproblem (2)) as well as an adap-\ntive threshold rule similar to one\nin [23], which is described in the\nsupplementary. As we have dis-\ncussed after Theorem 2, LASSO\nhas an inherent tradeoff based\non the choice of \u03bb. A smaller\n\u03bb leads to a more accurate solu-\ntion but slower convergence. The\nFigure 4: Validating Discussion after Theorem 2 (SNR = \u221e).\nadaptive thresholding rule \ufb01xes\nthis issue:\nit uses large \u03bbk for\nsmall k, and gradually reduces it as k increases to improve the accuracy [23]. Except for adaptive\nthresholds {\u03b8k}k (\u03b8k corresponds to \u03bbk in LASSO), LISTA-CP has adaptive weights {W k}k, which\nfurther greatly accelerate the convergence. Note that we only ran ISTA and FISTA for 16 iterations,\njust enough and fair to compare them with the learned models. The number of iterations is so small\nthat the difference between ISTA and FISTA is not quite observable.\nValidation of Theorem 3. We compare the recovery NMSEs of LISTA-CP (10) and LISTA-CPSS\n(13) in Fig. 5. The result of the noiseless case (Fig. 5(a)) shows that the recovery error of LISTA-SS\nconverges to 0 at a faster rate than that of LISTA-CP. The difference is signi\ufb01cant with the number of\nlayers k \u2265 10, which supports our theoretical result: \u201c\u02dcck\nss > c as k large enough\u201d in Theorem 3. The\nresult of the noisy case (Fig. 5(b)) shows that LISTA-CPSS has better recovery error than LISTA-CP.\nThis point supports \u02dcCss < C in Theorem 3. Notably, LISTA-CPSS also outperforms LAMP [16],\nwhen k > 10 in the noiseless case, and even earlier as SNR becomes lower.\nPerformance with Ill-Conditioned Matrix. We train LISTA, LAMP, LISTA-CPSS with ill-\nconditioned matrices A of condition numbers \u03ba = 5, 30, 50. As is shown in Fig. 6, as \u03ba increases,\nthe performance of LISTA remains stable while LAMP becomes worse, and eventually inferior to\nLISTA when \u03ba = 50. Although our LISTA-CPSS also suffers from ill-conditioning, its performance\nalways stays much better than LISTA and LAMP.\n\n7\n\n1234567891011121314151600.511.522.531234567891011121314151600.20.40.60.81012345678910111213141516-50-40-30-20-100ISTAFISTAAMPLISTALISTA-CP012345678910111213141516-50-40-30-20-100ISTAFISTAAMPLISTALISTA-CP0100200300400500600700800-40-30-20-100NMSE (dB)ISTA ( = 0.1)ISTA ( = 0.05)ISTA ( = 0.025)LISTA-CPISTA (adaptivek)\f(a) Noiseless Case\n\n(b) Noisy Case: SNR=40dB\n\n(c) Noisy Case: SNR=30dB\n\n(d) Noisy Case: SNR=20dB\n\nFigure 5: Validation of Theorem 3.\n\n(a) \u03ba = 5\n\n(b) \u03ba = 30\n\n(c) \u03ba = 50\n\nFigure 6: Performance in ill-conditioned situations (SNR = \u221e).\n\n4.2 Natural Image Compressive Sensing\nExperiments Setting. We perform a compressive sensing (CS) experiment on natural images\n(patches). We divide the BSD500 [25] set into a training set of 400 images, a validation set of\n50 images, and a test set of 50 images. For training, we extract 10,000 patches f \u2208 R16\u00d716 at\nrandom positions of each image, with all means removed. We then learn a dictionary D \u2208 R256\u00d7512\nfrom them, using a block proximal gradient method [26]. For each testing image, we divide it into\nnon-overlapping 16 \u00d7 16 patches. A Gaussian sensing matrices \u03a6 \u2208 Rm\u00d7256 is created in the same\nmanner as in Sec. 4.1, where m\nSince f is typically not exactly sparse under the dictionary D, Assumptions 1 and 2 no longer strictly\nhold. The primary goal of this experiment is thus to show that our proposed techniques remain robust\nand practically useful in non-ideal conditions, rather than beating all CS state-of-the-arts.\nNetwork Extension. In the real data case, we have no ground-truth sparse code available as the\nregression target for the loss function (5). In order to bypass pre-computing sparse codes f over D\non the training set, we are inspired by [11]: \ufb01rst using layer-wise pre-training with a reconstruction\nloss w.r.t. dictionary D plus an l1 loss, shown in (19), where k is the layer index and \u0398k denotes all\nparameters in the k-th and previous layers; then appending another learnable fully-connected layer\n(initialized by D) to LISTA-CPSS and perform an end-to-end training with the cost function (20).\n\n256 is the CS ratio.\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\nLk(\u0398k) =\n\nL(\u0398, WD) =\n\n(cid:107)fi \u2212 D \u00b7 xk\n\ni (\u0398k)(cid:107)2\n\n2 + \u03bb(cid:107)xk\n\ni (\u0398k)(cid:107)1\n\n(cid:107)fi \u2212 WD \u00b7 xK\n\ni (\u0398)(cid:107)2\n\n2 + \u03bb(cid:107)xK\n\ni (\u0398)(cid:107)1\n\n(19)\n\n(20)\n\ni=1\n\n8\n\n012345678910111213141516-70-60-50-40-30-20-100ISTAFISTAAMPLISTALAMPLISTA-CPLISTA-SSLISTA-CPSS012345678910111213141516-50-40-30-20-100ISTAFISTAAMPLISTALAMPLISTA-CPLISTA-SSLISTA-CPSS012345678910111213141516-40-30-20-100ISTAFISTAAMPLISTALAMPLISTA-CPLISTA-SSLISTA-CPSS012345678910111213141516-30-20-100ISTAFISTAAMPLISTALAMPLISTA-CPLISTA-SSLISTA-CPSS012345678910111213141516-70-60-50-40-30-20-100LISTALAMPLISTA-CPSS012345678910111213141516-70-60-50-40-30-20-100LISTALAMPLISTA-CPSS012345678910111213141516-70-60-50-40-30-20-100LISTALAMPLISTA-CPSS\fTable 1: The Average PSRN (dB) for Set 11 test images with CS ratio ranging from 0.2 to 0.6\n\nAlgorithm\nTVAL3\n\nRecon-Net\n\nLIHT\nLISTA\n\nLISTA-CPSS\n\n20% 30% 40% 50% 60%\n33.16\n25.37\n32.44\n27.18\n25.83\n34.00\n35.99\n28.17\n28.25\n36.39\n\n31.51\n31.39\n31.73\n34.26\n34.60\n\n29.76\n30.49\n29.93\n32.75\n32.87\n\n28.39\n29.11\n27.83\n30.43\n30.54\n\nResults. The results are reported in Table 1. We build CS models at the sample rates of\n20%, 30%, 40%, 50%, 60% and test on the standard Set 11 images as in [27]. We compare our\nresults with three baselines: the classical iterative CS solver, TVAL3 [28]; the \u201cblack-box\u201d deep\nlearning CS solver, Recon-Net [27];a l0-based network unfolded from IHT algorithm [15], noted as\nLIHT; and the baseline LISTA network, in terms of PSNR (dB)8. We build 16-layer LIHT, LISTA\nand LISTA-CPSS networks and set \u03bb = 0.2. For LISTA-CPSS, we set p% = 0.4% more entries\ninto the support in each layer for support selection. We also select support w.r.t. a percentage of the\nlargest magnitudes within the whole batch rather than within a single sample as we do in theorems\nand simulated experiments, which we emprically \ufb01nd is bene\ufb01cial to the recovery performance. Table\n1 con\ufb01rms LISTA-CPSS as the best performer among all. The advantage of LISTA-CPSS and LISTA\nover Recon-Net also endorses the incorporation of the unrolled sparse solver structure into deep\nnetworks.\n5 Conclusions\nIn this paper, we have introduced a partial weight coupling structure to LISTA, which reduces the\nnumber of trainable parameters but does not hurt the performance. With this structure, unfolded ISTA\ncan attain a linear convergence rate. We have further proposed support selection, which improves\nthe convergence rate both theoretically and empirically. Our theories are endorsed by extensive\nsimulations and a real-data experiment. We believe that the methodology in this paper can be\nextended to analyzing and enhancing other unfolded iterative algorithms.\n\nAcknowledgments\n\nThe work by X. Chen and Z. Wang is supported in part by NSF RI-1755701. The work by J. Liu and\nW. Yin is supported in part by NSF DMS-1720237 and ONR N0001417121. We would also like to\nthank all anonymous reviewers for their tremendously useful comments to help improve our work.\n\nReferences\n[1] Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations.\n\nJournal of Fourier analysis and Applications, 14(5-6):629\u2013654, 2008.\n\n[2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[3] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings\nof the 27th International Conference on International Conference on Machine Learning, pages\n399\u2013406. Omnipress, 2010.\n\n[4] Zhangyang Wang, Qing Ling, and Thomas Huang. Learning deep l0 encoders.\n\nConference on Arti\ufb01cial Intelligence, pages 2194\u20132200, 2016.\n\nIn AAAI\n\n[5] Zhangyang Wang, Ding Liu, Shiyu Chang, Qing Ling, Yingzhen Yang, and Thomas S Huang.\nD3: Deep dual-domain based fast restoration of jpeg-compressed images. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 2764\u20132772, 2016.\n\n[6] Zhangyang Wang, Shiyu Chang, Jiayu Zhou, Meng Wang, and Thomas S Huang. Learning a\ntask-speci\ufb01c deep architecture for clustering. In Proceedings of the 2016 SIAM International\nConference on Data Mining, pages 369\u2013377. SIAM, 2016.\n\n8We applied TVAL3, LISTA and LISTA-CPSS on 16 \u00d7 16 patches to be fair. For Recon-Net, we used their\ndefault setting working on 33 \u00d7 33 patches, which was veri\ufb01ed to perform better than using smaller patches.\n\n9\n\n\f[7] Zhangyang Wang, Yingzhen Yang, Shiyu Chang, Qing Ling, and Thomas S Huang. Learning a\n\ndeep (cid:96)\u221e encoder for hashing. pages 2174\u20132180, 2016.\n\n[8] Pablo Sprechmann, Alexander M Bronstein, and Guillermo Sapiro. Learning ef\ufb01cient sparse\nand low rank models. IEEE transactions on pattern analysis and machine intelligence, 2015.\n\n[9] Zhaowen Wang, Jianchao Yang, Haichao Zhang, Zhangyang Wang, Yingzhen Yang, Ding Liu,\nand Thomas S Huang. Sparse Coding and its Applications in Computer Vision. World Scienti\ufb01c.\n\n[10] Jian Zhang and Bernard Ghanem. ISTA-Net: Interpretable optimization-inspired deep network\n\nfor image compressive sensing. In IEEE CVPR, 2018.\n\n[11] Joey Tianyi Zhou, Kai Di, Jiawei Du, Xi Peng, Hao Yang, Sinno Jialin Pan, Ivor W Tsang,\nYong Liu, Zheng Qin, and Rick Siow Mong Goh. SC2Net: Sparse LSTMs for sparse coding. In\nAAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[12] Thomas Moreau and Joan Bruna. Understanding trainable sparse coding with matrix factoriza-\n\ntion. In ICLR, 2017.\n\n[13] Raja Giryes, Yonina C Eldar, Alex Bronstein, and Guillermo Sapiro. Tradeoffs between\nconvergence speed and reconstruction accuracy in inverse problems. IEEE Transactions on\nSignal Processing, 2018.\n\n[14] Bo Xin, Yizhou Wang, Wen Gao, David Wipf, and Baoyuan Wang. Maximal sparsity with deep\n\nnetworks? In Advances in Neural Information Processing Systems, pages 4340\u20134348, 2016.\n\n[15] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing.\n\nApplied and computational harmonic analysis, 27(3):265\u2013274, 2009.\n\n[16] Mark Borgerding, Philip Schniter, and Sundeep Rangan. AMP-inspired deep networks for\n\nsparse linear inverse problems. IEEE Transactions on Signal Processing, 2017.\n\n[17] Christopher A Metzler, Ali Mousavi, and Richard G Baraniuk. Learned D-AMP: Principled neu-\nral network based compressive image recovery. In Advances in Neural Information Processing\nSystems, pages 1770\u20131781, 2017.\n\n[18] Mark Borgerding and Philip Schniter. Onsager-corrected deep learning for sparse linear inverse\nproblems. In 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP).\n\n[19] Stanley Osher, Yu Mao, Bin Dong, and Wotao Yin. Fast linearized bregman iteration for\ncompressive sensing and sparse denoising. Communications in Mathematical Sciences, 2010.\n\n[20] Kristian Bredies and Dirk A Lorenz. Linear convergence of iterative soft-thresholding. Journal\n\nof Fourier Analysis and Applications, 14(5-6):813\u2013837, 2008.\n\n[21] Lufang Zhang, Yaohua Hu, Chong Li, and Jen-Chih Yao. A new linear convergence result for\n\nthe iterative soft thresholding algorithm. Optimization, 66(7):1177\u20131189, 2017.\n\n[22] Shaozhe Tao, Daniel Boley, and Shuzhong Zhang. Local linear convergence of ista and \ufb01sta on\n\nthe lasso problem. SIAM Journal on Optimization, 26(1):313\u2013336, 2016.\n\n[23] Elaine T. Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for (cid:96)1-minimization:\n\nmethodology and convergence. SIAM Journal on Optimization, 19(3):1107\u20131130, 2008.\n\n[24] Lin Xiao and Tong Zhang. A proximal-gradient homotopy method for the sparse least-squares\n\nproblem. SIAM Journal on Optimization, 23(2):1062\u20131091, 2013.\n\n[25] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human seg-\nmented natural images and its application to evaluating segmentation algorithms and measuring\nIn Proceedings of the International Conference on Computer Vision,\necological statistics.\nvolume 2, pages 416\u2013423, 2001.\n\n[26] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex\noptimization with applications to nonnegative tensor factorization and completion. SIAM\nJournal on imaging sciences, 6(3):1758\u20131789, 2013.\n\n10\n\n\f[27] Kuldeep Kulkarni, Suhas Lohit, Pavan Turaga, Ronan Kerviche, and Amit Ashok. Recon-\nNet: Non-iterative reconstruction of images from compressively sensed measurements. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[28] Chengbo Li, Wotao Yin, Hong Jiang, and Yin Zhang. An ef\ufb01cient augmented lagrangian method\nwith applications to total variation minimization. Computational Optimization and Applications,\n56(3):507\u2013530, 2013.\n\n[29] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization. Athena Scienti\ufb01c\n\nBelmont, MA, 1997.\n\n11\n\n\f", "award": [], "sourceid": 5436, "authors": [{"given_name": "Xiaohan", "family_name": "Chen", "institution": "Texas A&M University"}, {"given_name": "Jialin", "family_name": "Liu", "institution": "University of California, Los Angeles (UCLA)"}, {"given_name": "Zhangyang", "family_name": "Wang", "institution": "TAMU"}, {"given_name": "Wotao", "family_name": "Yin", "institution": "University of California, Los Angeles"}]}