{"title": "Learned D-AMP: Principled Neural Network based Compressive Image Recovery", "book": "Advances in Neural Information Processing Systems", "page_first": 1772, "page_last": 1783, "abstract": "Compressive image recovery is a challenging problem that requires fast and accurate algorithms. Recently, neural networks have been applied to this problem with promising results. By exploiting massively parallel GPU processing architectures and oodles of training data, they can run orders of magnitude faster than existing techniques. However, these methods are largely unprincipled black boxes that are difficult to train and often-times specific to a single measurement matrix. It was recently demonstrated that iterative sparse-signal-recovery algorithms can be ``unrolled\u2019' to form interpretable deep networks. Taking inspiration from this work, we develop a novel neural network architecture that mimics the behavior of the denoising-based approximate message passing (D-AMP) algorithm. We call this new network {\\em Learned} D-AMP (LDAMP). The LDAMP network is easy to train, can be applied to a variety of different measurement matrices, and comes with a state-evolution heuristic that accurately predicts its performance. Most importantly, it outperforms the state-of-the-art BM3D-AMP and NLR-CS algorithms in terms of both accuracy and run time. At high resolutions, and when used with sensing matrices that have fast implementations, LDAMP runs over $50\\times$ faster than BM3D-AMP and hundreds of times faster than NLR-CS.", "full_text": "Learned D-AMP: Principled Neural Network Based\n\nCompressive Image Recovery\n\nChristopher A. Metzler\n\nRice University\n\nchris.metzler@rice.edu\n\nAli Mousavi\nRice University\n\nali.mousavi@rice.edu\n\nRichard G. Baraniuk\n\nRice University\nrichb@rice.edu\n\nAbstract\n\nCompressive image recovery is a challenging problem that requires fast and accu-\nrate algorithms. Recently, neural networks have been applied to this problem with\npromising results. By exploiting massively parallel GPU processing architectures\nand oodles of training data, they can run orders of magnitude faster than existing\ntechniques. However, these methods are largely unprincipled black boxes that are\ndif\ufb01cult to train and often-times speci\ufb01c to a single measurement matrix.\n\nIt was recently demonstrated that iterative sparse-signal-recovery algorithms can\nbe \u201cunrolled\u201d to form interpretable deep networks. Taking inspiration from this\nwork, we develop a novel neural network architecture that mimics the behavior of\nthe denoising-based approximate message passing (D-AMP) algorithm. We call\nthis new network Learned D-AMP (LDAMP).\n\nThe LDAMP network is easy to train, can be applied to a variety of different\nmeasurement matrices, and comes with a state-evolution heuristic that accurately\npredicts its performance. Most importantly, it outperforms the state-of-the-art\nBM3D-AMP and NLR-CS algorithms in terms of both accuracy and run time. At\nhigh resolutions, and when used with sensing matrices that have fast implemen-\ntations, LDAMP runs over 50\u00d7 faster than BM3D-AMP and hundreds of times\nfaster than NLR-CS.\n\n1\n\nIntroduction\n\nOver the last few decades computational imaging systems have proliferated in a host of different\nimaging domains, from synthetic aperture radar to functional MRI and CT scanners. The majority of\nthese systems capture linear measurements y \u2208 Rm of the signal of interest x \u2208 Rn via y = Ax + \u0001,\nwhere A \u2208 Rm\u00d7n is a measurement matrix and \u0001 \u2208 Rm is noise.\nGiven the measurements y and the measurement matrix A, a computational imaging system seeks to\nrecover x. When m < n this problem is underdetermined, and prior knowledge about x must be used\nto recovery the signal. This problem is broadly referred to as compressive sampling (CS) [1; 2].\nThere are myriad ways to use priors to recover an image x from compressive measurements. In the\nfollowing, we brie\ufb02y describe some of these methods. Note that the ways in which these algorithms\nuse priors span a spectrum; from simple hand-designed models to completely data-driven methods\n(see Figure 1).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The spectrum of compressive signal recovery algorithms.\n\n1.1 Hand-designed recovery methods\n\nThe vast majority of CS recovery algorithms can be considered \u201chand-designed\u201d in the sense that they\nuse some sort of expert knowledge, i.e., prior, about the structure of x. The most common signal prior\nis that x is sparse in some basis. Algorithms using sparsity priors include CoSaMP [3], ISTA [4],\napproximate message passing (AMP) [5], and VAMP [6], among many others. Researchers have also\ndeveloped priors and algorithms that more accurately describe the structure of natural images, such as\nminimal total variation, e.g., TVAL3 [7], markov-tree models on the wavelet coef\ufb01cients, e.g., Model-\nCoSaMP [8], and nonlocal self-similarity, e.g., NLR-CS [9]. Off-the-shelf denoising and compression\nalgorithms have also been used to impose priors on the reconstruction, e.g., Denoising-based AMP\n(D-AMP) [10], D-VAMP [11], and C-GD [12]. When applied to natural images, algorithms using\nadvanced priors outperform simple priors, like wavelet sparsity, by a large margin [10].\nThe appeal of hand-designed methods is that they are based on interpretable priors and often have\nwell understood behavior. Moreover, when they are set up as convex optimization problems they often\nhave theoretical convergence guarantees. Unfortunately, among the algorithms that use accurate priors\non the signal, even the fastest is too slow for many real-time applications [10]. More importantly,\nthese algorithms do not take advantage of potentially available training data. As we will see, this\nleaves much room for improvement.\n\n1.2 Data-driven recovery methods\n\nAt the other end of the spectrum are data-driven (often deep learning-based) methods that use no\nhand-designed models whatsoever. Instead, researchers provide neural networks (NNs) vast amounts\nof training data, and the networks learn how to best use the structure within the data [13\u201316].\nThe \ufb01rst paper to apply this approach was [13], where the authors used stacked denoising autoen-\ncoders (SDA) [17] to recover signals from their undersampled measurements. Other papers in this\nline of work have used either pure convolutional layers (DeepInverse [15]) or a combination of\nconvolutional and fully connected layers (DR2-Net [16] and ReconNet [14]) to build deep learning\nframeworks capable of solving the CS recovery problem. As demonstrated in [13], these methods can\ncompete with state-of-the-art methods in terms of accuracy while running thousands of times faster.\nUnfortunately, these methods are held back by the fact that there exists almost no theory governing\ntheir performance and that, so far, they must be trained for speci\ufb01c measurement matrices and noise\nlevels.\n\n1.3 Mixing hand-designed and data-driven methods for recovery\n\nThe third class of recovery algorithms blends data-driven models with hand-designed algorithms.\nThese methods \ufb01rst use expert knowledge to set up a recovery algorithm and then use training\ndata to learn priors within this algorithm. Such methods bene\ufb01t from the ability to learn more\nrealistic signal priors from the training data, while still maintaining the interpretability and guarantees\nthat made hand-designed methods so appealing. Algorithms of this class can be divided into two\nsubcategories. The \ufb01rst subcategory uses a black box neural network that performs some function\nwithin the algorithm, such as the proximal mapping. The second subcategory explicitly unrolls and\niterative algorithm and turns it into a deep NN. Following this unrolling, the network can be tuned\nwith training data. Our LDAMP algorithm uses ideas from both these camps.\n\nBlack box neural nets. The simplest way to use a NN in a principled way to solve the CS problem\nis to treat it as a black box that performs some function; such as computing a posterior probability.\n\n2\n\n\f(a) D-IT Iterations\n\n(b) D-AMP Iterations\n\nFigure 2: Reconstruction behavior of D-IT (left) and D-AMP (right) with an idealized denoiser.\nBecause D-IT allows bias to build up over iterations of the algorithm, its denoiser becomes ineffective\nat projecting onto the set C of all natural images. The Onsager correction term enables D-AMP to\navoid this issue. Figure adapted from [10].\n\nExamples of this approach include RBM-AMP and its generalizations [18\u201320], which use Restricted\nBoltzmann Machines to learn non-i.i.d. priors; RIDE-CS [21], which uses the RIDE [22] generative\nmodel to compute the probability of a given estimate of the image; and OneNet [23], which uses a\nNN as a proximal mapping/denoiser.\n\nUnrolled algorithms. The second way to use a NN in a principled way to solve the CS problem is\nto simply take a well-understood iterative recovery algorithm and unroll/unfold it. This method is\nbest illustrated by the the LISTA [24; 25] and LAMP [26] NNs. In these works, the authors simply\nunroll the iterative ISTA [4] and AMP [5] algorithms, respectively, and then treat parameters of\nthe algorithm as weights to be learned. Following the unrolling, training data can be fed through\nthe network, and stochastic gradient descent can be used to update and optimize its parameters.\nUnrolling was recently applied to the ADMM algorithm to solve the CS-MRI problem [27]. The\nresulting network, ADMM-Net, uses training data to learn \ufb01lters, penalties, simple nonlinearities,\nand multipliers. Moving beyond CS, the unrolling principle has been applied successfully in speech\nenhancement [28], non-negative matrix factorization applied to music transcription [29], and beyond.\nIn these applications, unrolling and training signi\ufb01cantly improve both the quality and speed of signal\nreconstruction.\n\n2 Learned D-AMP\n\n2.1 D-IT and D-AMP\n\nLearned D-AMP (LDAMP), is a mixed hand-designed/data-driven compressive signal recovery\nframework that is builds on the D-AMP algorithm [10]. We describe D-AMP now, as well as the\nsimpler denoising-based iterative thresholding (D-IT) algorithm. For concreteness, but without loss\nof generality, we focus on image recovery.\nA compressive image recovery algorithm solves the ill-posed inverse problem of \ufb01nding the image x\ngiven the low-dimensional measurements y = Ax by exploiting prior information on x, such as fact\nthat x \u2208 C, where C is the set of all natural images. A natural optimization formulation reads\n\n(1)\nWhen no measurement noise \u0001 is present, a compressive image recovery algorithm should return the\n(hopefully unique) image xo at the intersection of the set C and the af\ufb01ne subspace {x|y = Ax} (see\nFigure 2).\nThe premise of D-IT and D-AMP is that high-performance image denoisers D\u03c3, such as BM3D\n[30], are high-quality approximate projections onto the set C of natural images.1,2 That is, suppose\n1The notation D\u03c3 indicates that the denoiser can be parameterized by the standard deviation of the noise \u03c3.\n2Denoisers can also be thought of as a proximal mapping with respect to the negative log likelihood of natural\nimages [31] or as taking a gradient step with respect to the data generating function of natural images [32; 33].\n\nargminx(cid:107)y \u2212 Ax(cid:107)2\n\n2 subject to x \u2208 C.\n\n3\n\n\fxo + \u03c3z is a noisy observation of a natural image, with xo \u2208 C and z \u223c N (0, I). An ideal denoiser\nD\u03c3 would simply \ufb01nd the point in the set C that is closest to the observation xo + \u03c3z\n\nD\u03c3(x) = argminx(cid:107)xo + \u03c3z \u2212 x(cid:107)2\n\n2 subject to x \u2208 C.\n\n(2)\n\nCombining (1) and (2) leads naturally to the D-IT algorithm, presented in (3) and illustrated in Figure\n2(a). Starting from x0 = 0, D-IT takes a gradient step towards the {x|y = Ax} af\ufb01ne subspace and\nthen applies the denoiser D\u03c3 to move to x1 in the set C of natural images . Gradient stepping and\ndenoising is repeated for t = 1, 2, . . . until convergence.\n\nD-IT Algorithm\n\nzt = y \u2212 Axt,\n\nxt+1 = D\u02c6\u03c3t(xt + AH zt).\n\n(3)\n\nLet \u03bdt = xt + AH zt \u2212 xo denote the difference between xt + AH zt and the true signal xo at each\niteration. \u03bdt is known as the effective noise. At each iteration, D-IT denoises xt + AH zt = xo + \u03bdt,\ni.e., the true signal plus the effective noise. Most denoisers are designed to work with \u03bdt as additive\nwhite Gaussian noise (AWGN). Unfortunately, as D-IT iterates, the denoiser biases the intermediate\nsolutions, and \u03bdt soon deviates from AWGN. Consequently, the denoising iterations become less\neffective [5; 10; 26], and convergence slows.\nD-AMP differs from D-IT in that it corrects for the bias in the effective noise at each iteration\nt = 0, 1, . . . using an Onsager correction term bt.\n\nD-AMP Algorithm\n\nzt\u22121divD\u02c6\u03c3t\u22121(xt\u22121 + AH zt\u22121)\n\n,\n\nbt =\nzt = y \u2212 Axt + bt,\n\nm\n\n\u02c6\u03c3t =\n\n(cid:107)zt(cid:107)2\u221a\n\nm\n\n,\n\nxt+1 = D\u02c6\u03c3t(xt + AH zt).\n\n(4)\n\nThe Onsager correction term removes the bias from the intermediate solutions so that the effective\nnoise \u03bdt follows the AWGN model expected by typical image denoisers. For more information on the\nOnsager correction, its origins, and its connection to the Thouless-Anderson-Palmer equations [34],\nsee [5] and [35]. Note that (cid:107)zt(cid:107)2\u221a\nm serves as a useful and accurate estimate of the standard deviation of\n\u03bdt [36]. Typically, D-AMP algorithms use a Monte-Carlo approximation for the divergence divD(\u00b7),\nwhich was \ufb01rst introduced in [37; 10].\n\n2.2 Denoising convolutional neural network\n\nNNs have a long history in signal denoising; see, for instance [38]. However, only recently have they\nbegun to signi\ufb01cantly outperform established methods like BM3D [30]. In this section we review the\nrecently developed Denoising Convolutional Neural Network (DnCNN) image denoiser [39], which\nis both more accurate and far faster than competing techniques.\nThe DnCNN neural network consists of 16 to 20 convolutional layers, organized as follows. The \ufb01rst\nconvolutional layer uses 64 different 3 \u00d7 3 \u00d7 c \ufb01lters (where c denotes the number of color channels)\nand is followed by a recti\ufb01ed linear unit (ReLU) [40]. The next 14 to 18 convolutional layers each use\n64 different 3 \u00d7 3 \u00d7 64 \ufb01lters which are each followed by batch-normalization [41] and a ReLU. The\n\ufb01nal convolutional layer uses c separate 3 \u00d7 3 \u00d7 64 \ufb01lters to reconstruct the signal. The parameters\nare learned via residual learning [42].\n\n2.3 Unrolling D-IT and D-AMP into networks\n\nThe central contribution of this work is to apply the unrolling ideas described in Section 1.3 to D-IT\nand D-AMP to form the LDIT and LDAMP neural networks. The LDAMP network, presented in (5)\nand illustrated in Figure 3, consists of 10 AMP layers where each AMP layer contains two denoisers\n\n4\n\n\fFigure 3: Two layers of the LDAMP neural network. When used with the DnCNN denoiser, each\ndenoiser block is a 16 to 20 convolutional-layer neural network. The forward and backward operators\nare represented as the matrices A and AH; however function handles work as well.\n\nwith tied weights. One denoiser is used to update xl, and the other is used to estimate the divergence\nusing the Monte-Carlo approximation from [37; 10]. The LDIT network is nearly identical but does\nnot compute an Onsager correction term and hence, only applies one denoiser per layer. One of the\nfew challenges to unrolling D-IT and D-AMP is that, to enable training, we must use a denoiser\nthat easily propagates gradients; a black box denoiser like BM3D will not work. This restricts us to\ndenoisers such as DnCNN, which, fortunately, offers improved performance.\n\nLDAMP Neural Network\n\nzl\u22121divDl\n\nwl\u22121(\u02c6\u03c3l\u22121)(xl\u22121 + AH zl\u22121)\n\nbl =\nzl = y \u2212 Axl + bl,\n\nm\n\n,\n\n(cid:107)zl(cid:107)2\u221a\nm\nwl(\u02c6\u03c3l)(xl + AH zl).\n\n\u02c6\u03c3l =\n\nxl+1 = Dl\n\n,\n\n(5)\n\nWithin (5), we use the slightly cumbersome notation Dl\nwl(\u02c6\u03c3l) to indicate that layer l of the network\nuses denoiser Dl, that this denoiser depends on its weights/biases wl, and that these weights may be a\nfunction of the estimated standard deviation of the noise \u02c6\u03c3l. During training, the only free parameters\nwe learn are the denoiser weights w1, ...wL. This is distinct from the LISTA and LAMP networks,\nwhere the authors decouple and learn the A and AH matrices used in the network [24; 26].\n\n3 Training the LDIT and LDAMP networks\n\nWe experimented with three different methods to train the LDIT and LDAMP networks. Here we\ndescribe and compare these training methods at a high level; the details are described in Section 5.\n\n\u2022 End-to-end training: We train all the weights of the network simultaneously. This is the\n\nstandard method of training a neural network.\n\n\u2022 Layer-by-layer training: We train a 1 AMP layer network (which itself contains a 16-20\nlayer denoiser) to recover the signal, \ufb01x these weights, add an AMP layer, train the second\nlayer of the resulting 2 layer network to recover the signal, \ufb01x these weights, and repeat\nuntil we have trained a 10 layer network.\n\n\u2022 Denoiser-by-denoiser training: We decouple the denoisers from the rest of the network\nand train each on AWGN denoising problems at different noise levels. During inference,\nthe network uses its estimate of the standard deviation of the noise to select which set of\ndenoiser weights to use. Note that, in selecting which denoiser weights to use, we must\ndiscretize the expected range of noise levels; e.g., if \u02c6\u03c3 = 35, then we use the denoiser for\nnoise standard deviations between 20 and 40.\n\n5\n\n\fEnd-to-end\nLayer-by-layer\nDenoiser-by-denoiser\n(a)\n\nLDIT LDAMP\n32.1\n26.1\n28.0\n\n33.1\n33.1\n31.6\n\nEnd-to-end\nLayer-by-layer\nDenoiser-by-denoiser\n(b)\n\nLDIT LDAMP\n8.0\n-2.6\n22.1\n\n18.7\n18.7\n25.9\n\nFigure 4: Average PSNRs4 of 100 40 \u00d7 40 image reconstructions with i.i.d. Gaussian measurements\ntrained at a sampling rate of m\nn = 0.05\n(b).\n\nn = 0.20 and tested at sampling rates of m\n\nn = 0.20 (a) and m\n\nComparing Training Methods. Stochastic gradient descent theory suggests that layer-by-layer\nand denoiser-by-denoiser training should sacri\ufb01ce performance as compared to end-to-end training\n[43]. In Section 4.2 we will prove that this is not the case for LDAMP. For LDAMP, layer-by-layer and\ndenoiser-by-denoiser training are minimum-mean-squared-error (MMSE) optimal. These theoretical\nresults are born out experimentally in Tables 4(a) and 4(b). Each of the networks tested in this section\nconsists of 10 unrolled DAMP/DIT layers that each contain a 16 layer DnCNN denoiser.\nTable 4(a) demonstrates that, as suggested by theory, layer-by-layer training of LDAMP is optimal;\nadditional end-to-end training does not improve the performance of the network. In contrast, the table\ndemonstrates that layer-by-layer training of LDIT, which represents the behavior of a typical neural\nnetwork, is suboptimal; additional end-to-end training dramatically improves its performance.\nDespite the theoretical result the denoiser-by-denoiser training is optimal, Table 4(a) shows that\nLDAMP trained denoiser-by-denoiser performs slightly worse than the end-to-end and layer-by-layer\ntrained networks. This gap in performance is likely due to the discretization of the noise levels, which\nis not modeled in our theory. This gap can be reduced by using a \ufb01ner discretization of the noise\nlevels or by using deeper denoiser networks that can better handle a range of noise levels [39].\nIn Table 4(b) we report on the performance of the two networks when trained at a one sampling\nrate and tested at another. LDIT and LDAMP networks trained end-to-end and layer-by-layer at a\nsampling rate of m\nn = 0.05. In contrast, the\ndenoiser-by-denoiser trained networks, which were not trained at a speci\ufb01c sampling rate, generalize\nwell to different sampling rates.\n\nn = 0.2 perform poorly when tested at a sampling rate of m\n\n4 Theoretical analysis of LDAMP\n\nThis section makes two theoretical contributions. First, we show that the state-evolution (S.E.), a\nframework that predicts the performance of AMP/D-AMP, holds for LDAMP as well.5 Second, we\nuse the S.E. to prove that layer-by-layer and denoiser-by-denoiser training of LDAMP are MMSE\noptimal.\n\n4.1 State-evolution\n\nIn the context of LAMP and LDAMP, the S.E. equations predict the intermediate mean squared error\n(MSE) of the network over each of its layers [26]. Starting from \u03b80 =\nthe S.E. generates a\nsequence of numbers through the following iterations:\n\n(cid:107)xo(cid:107)2\n\nn\n\n2\n\n\u03b8l+1(xo, \u03b4, \u03c32\n\n\u0001 ) =\n\nE\u0001(cid:107)Dl\n\nwl(\u03c3)(xo + \u03c3l\u0001) \u2212 xo(cid:107)2\n2,\n\n1\nn\n\n(6)\n\n\u0001 ) + \u03c32\n\n\u03b4 \u03b8l(xo, \u03b4, \u03c32\n\n\u0001 , the scalar \u03c3\u0001 is the standard deviation of the measurement noise\nwhere (\u03c3l)2 = 1\n\u0001, and the expectation is with respect to \u0001 \u223c N (0, I). Note that the notation \u03b8l+1(xo, \u03b4, \u03c32\n\u0001 ) is used to\nemphasize that \u03b8l may depend on the signal xo, the under-determinacy \u03b4, and the measurement noise.\nLet xl denote the estimate at layer l of LDAMP. Our empirical \ufb01ndings, illustrated in Figure 5, show\nthat the MSE of LDAMP is predicted accurately by the S.E. We formally state our \ufb01nding.\n\n4PSNR = 10 log10(\n5For D-AMP and LDAMP, the S.E. is entirely observational; no rigorous theory exists. For AMP, the S.E. has\n\nmean((\u02c6x\u2212xo)2) ) when the pixel range is 0 to 255.\n\n2552\n\nbeen proven asymptotically accurate for i.i.d. Gaussian measurements [44].\n\n6\n\n\fFigure 5: The MSE of intermediate reconstructions of the Boat test image across different layers for\nthe DnCNN variants of LDAMP and LDIT alongside their predicted S.E. The image was sampled\nwith Gaussian measurements at a rate of m\nn = 0.1. Note that LDAMP is well predicted by the S.E.,\nwhereas LDIT is not.\n\nFinding 1. If the LDAMP network starts from x0 = 0, then for large values of m and n, the\nS.E. predicts the mean square error of LDAMP at each layer, i.e., \u03b8l(xo, \u03b4, \u03c32\n2 , if\nthe following conditions hold: (i) The elements of the matrix A are i.i.d. Gaussian (or subgaussian)\nwith mean zero and standard deviation 1/m. (ii) The noise w is also i.i.d. Gaussian. (iii) The\ndenoisers Dl at each layer are Lipschitz continuous.6\n\n\u0001 ) \u2248 1\n\nn\n\n(cid:13)(cid:13)xl \u2212 xo\n\n(cid:13)(cid:13)2\n\n4.2 Layer-by-layer and denoiser-by-denoiser training is optimal\n\nwl(\u03c3)(xo + \u03c3\u0001) \u2212 xo(cid:107)2\n\nThe S.E. framework enables us to prove the following results: Layer-by-layer and denoiser-by-\ndenoiser training of LDAMP are MMSE optimal. Both these results rely upon the following lemma.\nLemma 1. Suppose that D1, D2, ...DL are monotone denoisers in the sense that for l = 1, 2, ...L\ninf wl E(cid:107)Dl\n2 is a non-decreasing function of \u03c3. If the weights w1 of D1 are set\nto minimize Ex0[\u03b81] and \ufb01xed; and then the weights w2 of D2 are set to minimize Ex0 [\u03b82] and \ufb01xed,\n. . . and then the weights wL of DL are set to minimize Ex0[\u03b8L], then together they minimize Ex0[\u03b8L].\nLemma 1 can be derived using the proof technique for Lemma 3 of [10], but with \u03b8l replaced by\nEx0[\u03b8l] throughout. It leads to the following two results.\nCorollary 1. Under the conditions in Lemma 1, layer-by-layer training of LDAMP is MMSE optimal.\nThis result follows from Lemma 1 and the equivalence between Ex0[\u03b8l] and Ex0[ 1\nCorollary 2. Under the conditions in Lemma 1, denoiser-by-denoiser training of LDAMP is MMSE\noptimal.\nThis result follows from Lemma 1 and the equivalence between Ex0 [\u03b8l] and Ex0[ 1\n\u03c3l\u0001) \u2212 xo(cid:107)2\n2].\n\nn(cid:107)xl \u2212 xo(cid:107)2\n2].\n\nwl(\u03c3)(xo +\n\nE\u0001(cid:107)Dl\n\nn\n\n5 Experiments\n\nDatasets Training images were pulled from Berkeley\u2019s BSD-500 dataset [46]. From this dataset,\nwe used 400 images for training, 50 for validation, and 50 for testing. For the results presented in\nSection 3, the training images were cropped, rescaled, \ufb02ipped, and rotated to form a set of 204,800\noverlapping 40 \u00d7 40 patches. The validation images were cropped to form 1,000 non-overlapping\n40 \u00d7 40 patches. We used 256 non-overlapping 40 \u00d7 40 patches for test. For the results presented in\nthis section, we used 382,464 50 \u00d7 50 patches for training, 6,528 50 \u00d7 50 patches for validation, and\nseven standard test images, illustrated in Figure 6 and rescaled to various resolutions, for test.\n\nImplementation. We implemented LDAMP and LDIT, using the DnCNN denoiser [39], in both\nTensorFlow and MatConvnet [47], which is a toolbox for Matlab. Public implementations of both\nversions of the algorithm are available at https://github.com/ricedsp/D-AMP_Toolbox.\n\n6A denoiser is said to be L-Lipschitz continuous if for every x1, x2 \u2208 C we have (cid:107)D(x1) \u2212 D(x2)(cid:107)2\n\n2 \u2264\n2. While we did not \ufb01nd it necessary in practice, weight clipping and gradient norm penalization\n\nL(cid:107)x1 \u2212 x2(cid:107)2\ncan be used to ensure Lipschitz continuity of the convolutional denoiser [45].\n\n7\n\n\f(a) Barbara\n\n(b) Boat\n\n(c) Couple\n\n(d) Peppers\n\n(e) Fingerprint\n\n(f) Mandrill\n\n(g) Bridge\n\nFigure 6: The seven test images.\n\nTraining parameters. We trained all the networks using the Adam optimizer [48] with a training\nrate of 0.001, which we dropped to 0.0001 and then 0.00001 when the validation error stopped\nimproving. We used mini-batches of 32 to 256 patches, depending on network size and memory\nusage. For layer-by-layer and denoiser-by-denoiser training, we used a different randomly generated\nmeasurement matrix for each mini-batch. Training generally took between 3 and 5 hours per denoiser\non an Nvidia Pascal Titan X. Results in this section are for denoiser-by-denoiser trained networks\nwhich consists of 10 unrolled DAMP/DIT layers that each contain a 20 layer DnCNN denoiser.\n\nCompetition. We compared the performance of LDAMP to three state-of-the-art image recovery\nalgorithms; TVAL3 [7], NLR-CS [9], and BM3D-AMP [10]. We also include a comparison with LDIT\nto demonstrate the bene\ufb01ts of the Onsager correction term. Our results do not include comparisons\nwith any other NN-based techniques. While many NN-based methods are very specialized and only\nwork for \ufb01xed matrices [13\u201316; 27], the recently proposed OneNet [23] and RIDE-CS [21] methods\ncan be applied more generally. Unfortunately, we were unable to train and test the OneNet code\nin time for this submission. While RIDE-CS code was available, the implementation requires the\nmeasurement matrices to have orthonormalized rows. When tested on matrices without orthonormal\nrows, RIDE-CS performed signi\ufb01cantly worse than the other methods.\n\nAlgorithm parameters. All algorithms used their default parameters. However, NLR-CS was\ninitialized using 8 iterations of BM3D-AMP, as described in [10]. BM3D-AMP was run for 10\n\u221a\niterations. LDIT and LDAMP used 10 layers. LDIT had its per layer noise standard deviation estimate\n\u02c6\u03c3 parameter set to 2(cid:107)zl(cid:107)2/\n\nm, as was done with D-IT in [10].\n\nTesting setup. We tested the algorithms with i.i.d. Gaussian measurements and with measurements\nfrom a randomly sampled coded diffraction pattern [49]. The coded diffraction pattern forward\noperator was formed as a composition of three steps; randomly (uniformly) change the phase, take a\n2D FFT, and then randomly (uniformly) subsample. Except for the results in Figure 7, we tested the\nalgorithms with 128 \u00d7 128 images (n = 1282). We report recovery accuracy in terms of PSNR. We\nreport run times in seconds. Results broken down by image are provided in the supplement.\n\nGaussian measurements. With noise-free Gaussian measurements, the LDAMP network produces\nthe best reconstructions at every sampling rate on every image except Fingerprints, which looks very\nunlike the natural images the network was trained on. With noise-free Gaussian measurements, LDIT\nand LDAMP produce reconstructions signi\ufb01cantly faster than the competing methods. Note that,\ndespite having to perform twice as many denoising operations, at a sampling rate of m\nn = 0.25 the\nLDAMP network is only about 25% slower than LDIT. This indicates that matrix multiplies, not\ndenoising operations, are the dominant source of computation. Average recovery PSNRs and run\ntimes are reported in Table 1. With noisy Gaussian measurements, LDAMP uniformly outperformed\nthe other methods; these results can be found in the supplement.\n\nCoded diffraction measurements. With noise-free coded diffraction measurements, the LDAMP\nnetwork again produces the best reconstructions on every image except Fingerprints. With coded\ndiffraction measurements, LDIT and LDAMP produce reconstructions signi\ufb01cantly faster than com-\npeting methods. Note that because the coded diffraction measurement forward and backward operator\ncan be applied in O(n log n) operations, denoising becomes the dominant source of computations:\nLDAMP, which has twice as many denoising operations as LDIT, takes roughly 2\u00d7 longer to com-\nplete. Average recovery PSNRs and run times are reported in Table 2. We end this section with a\nvisual comparison of 512 \u00d7 512 reconstructions from TVAL3, BM3D-AMP, and LDAMP, presented\n\n8\n\n\fTable 1: PSNRs and run times (sec) of 128 \u00d7 128 reconstructions with i.i.d. Gaussian measurements\nand no measurement noise at various sampling rates.\n\nMethod\n\nTVAL3\nBM3D-AMP\nLDIT\nLDAMP\nNLR-CS\n\nm\nn = 0.10\n\nPSNR Time\n2.2\n4.8\n0.3\n0.4\n85.9\n\n21.5\n23.1\n20.1\n23.7\n23.2\n\nm\nn = 0.15\n\nm\nn = 0.20\n\nm\nn = 0.25\n\nPSNR\n22.8\n25.1\n20.7\n25.7\n25.2\n\nTime\n2.9\n4.4\n0.4\n0.5\n104.0\n\nPSNR\n24.0\n26.6\n21.1\n27.2\n26.8\n\nTime\n3.6\n4.2\n0.4\n0.5\n124.4\n\nPSNR\n25.0\n27.9\n21.7\n28.5\n28.2\n\nTime\n4.3\n4.1\n0.5\n0.6\n146.3\n\nTable 2: PSNRs and run times (sec) of 128\u00d7128 reconstructions with coded diffraction measurements\nand no measurement noise at various sampling rates.\n\nMethod\n\nTVAL3\nBM3D-AMP\nLDIT\nLDAMP\nNLR-CS\n\nm\nn = 0.10\n\nm\nn = 0.15\n\nm\nn = 0.20\n\nm\nn = 0.25\n\nPSNR\n24.0\n23.8\n22.9\n25.3\n21.6\n\nTime\n0.52\n4.55\n0.14\n0.26\n87.82\n\nPSNR\n26.0\n25.7\n25.6\n27.4\n22.8\n\nTime\n0.46\n4.29\n0.14\n0.26\n87.43\n\nPSNR\n27.9\n27.5\n27.4\n28.9\n25.1\n\nTime\n0.43\n3.67\n0.14\n0.27\n87.18\n\nPSNR\n29.7\n29.1\n28.9\n30.5\n26.4\n\nTime\n0.41\n3.40\n0.14\n0.26\n86.87\n\nin Figure 7. At high resolutions, the LDAMP reconstructions are incrementally better than those of\nBM3D-AMP yet computed over 60\u00d7 faster.\n\n(a) Original Image\n\ndB,\n\n(b) TVAL3 (26.4 dB, 6.85\n\n(c) BM3D-AMP (27.2 dB,\n\n(d) LDAMP (28.1\n\nsec)\n\n1.22 sec)\nFigure 7: Reconstructions of 512 \u00d7 512 Boat test image sampled at a rate of m\nn = 0.05 using\ncoded diffraction pattern measurements and no measurement noise. LDAMP\u2019s reconstructions are\nnoticeably cleaner and far faster than the competing methods.\n\n75.04 sec)\n\n6 Conclusions\n\nIn this paper, we have developed, analyzed, and validated a novel neural network architecture that\nmimics the behavior of the powerful D-AMP signal recovery algorithm. The LDAMP network is\neasy to train, can be applied to a variety of different measurement matrices, and comes with a state-\nevolution heuristic that accurately predicts its performance. Most importantly, LDAMP outperforms\nthe state-of-the-art BM3D-AMP and NLR-CS algorithms in terms of both accuracy and run time.\nLDAMP represents the latest example in a trend towards using training data (and lots of of\ufb02ine\ncomputations) to improve the performance of iterative algorithms. The key idea behind this paper\nis that, rather than training a fairly arbitrary black box to learn to recover signals, we can unroll\na conventional iterative algorithm and treat the result as a NN, which produces a network with\nwell-understood behavior, performance guarantees, and predictable shortcomings. It is our hope this\npaper highlights the bene\ufb01ts of this approach and encourages future research in this direction.\n\n9\n\n\fAcknowledgements\n\nThis work was supported in part by DARPA REVEAL grant HR0011-16-C-0028, DARPA OMNI-\nSCIENT grant G001534-7500, ONR grant N00014-15-1-2735, ARO grant W911NF-15-1-0316,\nONR grant N00014-17-1-2551, and NSF grant CCF-1527501. In addition, C. Metzler was supported\nin part by the NSF GRFP.\n\nReferences\n[1] E. J. Candes, J. Romberg, and T. Tao, \u201cRobust uncertainty principles: Exact signal reconstruction from\nhighly incomplete frequency information,\u201d IEEE Trans. Inform. Theory, vol. 52, no. 2, pp. 489\u2013509, Feb.\n2006.\n\n[2] R. G. Baraniuk, \u201cCompressive sensing [lecture notes],\u201d IEEE Signal Processing Mag., vol. 24, no. 4, pp.\n\n118\u2013121, 2007.\n\n[3] D. Needell and J. A. Tropp, \u201cCoSaMP: Iterative signal recovery from incomplete and inaccurate samples,\u201d\n\nAppl. Comput. Harmon. Anal., vol. 26, no. 3, pp. 301\u2013321, 2009.\n\n[4] I. Daubechies, M. Defrise, and C. D. Mol, \u201cAn iterative thresholding algorithm for linear inverse problems\n\nwith a sparsity constraint,\u201d Comm. on Pure and Applied Math., vol. 75, pp. 1412\u20131457, 2004.\n\n[5] D. L. Donoho, A. Maleki, and A. Montanari, \u201cMessage passing algorithms for compressed sensing,\u201d Proc.\n\nNatl. Acad. Sci., vol. 106, no. 45, pp. 18 914\u201318 919, 2009.\n\n[6] S. Rangan, P. Schniter, and A. Fletcher, \u201cVector approximate message passing,\u201d arXiv preprint\n\narXiv:1610.03082, 2016.\n\n[7] C. Li, W. Yin, and Y. Zhang, \u201cUser\u2019s guide for TVAL3: TV minimization by augmented Lagrangian and\n\nalternating direction algorithms,\u201d Rice CAAM Department report, vol. 20, pp. 46\u201347, 2009.\n\n[8] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, \u201cModel-based compressive sensing,\u201d IEEE Trans.\n\nInform. Theory, vol. 56, no. 4, pp. 1982 \u20132001, Apr. 2010.\n\n[9] W. Dong, G. Shi, X. Li, Y. Ma, and F. Huang, \u201cCompressive sensing via nonlocal low-rank regularization,\u201d\n\nIEEE Trans. Image Processing, vol. 23, no. 8, pp. 3618\u20133632, 2014.\n\n[10] C. A. Metzler, A. Maleki, and R. G. Baraniuk, \u201cFrom denoising to compressed sensing,\u201d IEEE Trans.\n\nInform. Theory, vol. 62, no. 9, pp. 5117\u20135144, 2016.\n\n[11] P. Schniter, S. Rangan, and A. Fletcher, \u201cDenoising based vector approximate message passing,\u201d arXiv\n\npreprint arXiv:1611.01376, 2016.\n\n[12] S. Beygi, S. Jalali, A. Maleki, and U. Mitra, \u201cAn ef\ufb01cient algorithm for compression-based compressed\n\nsensing,\u201d arXiv preprint arXiv:1704.01992, 2017.\n\n[13] A. Mousavi, A. B. Patel, and R. G. Baraniuk, \u201cA deep learning approach to structured signal recovery,\u201d\n\nProc. Allerton Conf. Communication, Control, and Computing, pp. 1336\u20131343, 2015.\n\n[14] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, \u201cReconnet: Non-iterative reconstruction of\nimages from compressively sensed measurements,\u201d Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 449\u2013458, 2016.\n\n[15] A. Mousavi and R. G. Baraniuk, \u201cLearning to invert: Signal recovery via deep convolutional networks,\u201d\n\nProc. IEEE Int. Conf. Acoust., Speech, and Signal Processing (ICASSP), pp. 2272\u20132276, 2017.\n\n[16] H. Yao, F. Dai, D. Zhang, Y. Ma, S. Zhang, and Y. Zhang, \u201cDR2-net: Deep residual reconstruction network\n\nfor image compressive sensing,\u201d arXiv preprint arXiv:1702.05743, 2017.\n\n[17] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, \u201cStacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion,\u201d J. Machine Learning\nResearch, vol. 11, pp. 3371\u20133408, 2010.\n\n[18] E. W. Tramel, A. Dr\u00e9meau, and F. Krzakala, \u201cApproximate message passing with restricted Boltzmann\nmachine priors,\u201d Journal of Statistical Mechanics: Theory and Experiment, vol. 2016, no. 7, p. 073401,\n2016.\n\n10\n\n\f[19] E. W. Tramel, A. Manoel, F. Caltagirone, M. Gabri\u00e9, and F. Krzakala, \u201cInferring sparsity: Compressed\nsensing using generalized restricted Boltzmann machines,\u201d Proc. IEEE Information Theory Workshop\n(ITW), pp. 265\u2013269, 2016.\n\n[20] E. W. Tramel, M. Gabri\u00e9, A. Manoel, F. Caltagirone, and F. Krzakala, \u201cA deterministic and gen-\neralized framework for unsupervised learning with restricted Boltzmann machines,\u201d arXiv preprint\narXiv:1702.03260, 2017.\n\n[21] A. Dave, A. K. Vadathya, and K. Mitra, \u201cCompressive image recovery using recurrent generative model,\u201d\n\narXiv preprint arXiv:1612.04229, 2016.\n\n[22] L. Theis and M. Bethge, \u201cGenerative image modeling using spatial LSTMs,\u201d Proc. Adv. in Neural\n\nProcessing Systems (NIPS), pp. 1927\u20131935, 2015.\n\n[23] J. Rick Chang, C.-L. Li, B. Poczos, B. Vijaya Kumar, and A. C. Sankaranarayanan, \u201cOne network to solve\nthem all\u2013Solving linear inverse problems using deep projection models,\u201d Proc. IEEE Int. Conf. Comp.\nVision, and Pattern Recognition, pp. 5888\u20135897, 2017.\n\n[24] K. Gregor and Y. LeCun, \u201cLearning fast approximations of sparse coding,\u201d Proc. Int. Conf. Machine\n\nLearning, pp. 399\u2013406, 2010.\n\n[25] U. S. Kamilov and H. Mansour, \u201cLearning optimal nonlinearities for iterative thresholding algorithms,\u201d\n\nIEEE Signal Process. Lett., vol. 23, no. 5, pp. 747\u2013751, 2016.\n\n[26] M. Borgerding and P. Schniter, \u201cOnsager-corrected deep networks for sparse linear inverse problems,\u201d\n\narXiv preprint arXiv:1612.01183, 2016.\n\n[27] Y. Yang, J. Sun, H. Li, and Z. Xu, \u201cDeep ADMM-net for compressive sensing MRI,\u201d Proc. Adv. in Neural\n\nProcessing Systems (NIPS), vol. 29, pp. 10\u201318, 2016.\n\n[28] J. R. Hershey, J. L. Roux, and F. Weninger, \u201cDeep unfolding: Model-based inspiration of novel deep\n\narchitectures,\u201d arXiv preprint arXiv:1409.2574, 2014.\n\n[29] T. B. Yakar, P. Sprechmann, R. Litman, A. M. Bronstein, and G. Sapiro, \u201cBilevel sparse models for\n\npolyphonic music transcription.\u201d ISMIR, pp. 65\u201370, 2013.\n\n[30] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, \u201cImage denoising by sparse 3-d transform-domain\n\ncollaborative \ufb01ltering,\u201d IEEE Trans. Image Processing, vol. 16, no. 8, pp. 2080\u20132095, Aug. 2007.\n\n[31] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, \u201cPlug-and-play priors for model based reconstruc-\n\ntion,\u201d Proc. Global Conf. on Signal and Inform. Processing (GlobalSIP), pp. 945\u2013948, 2013.\n\n[32] G. Alain and Y. Bengio, \u201cWhat regularized auto-encoders learn from the data-generating distribution,\u201d J.\n\nMachine Learning Research, vol. 15, no. 1, pp. 3563\u20133593, 2014.\n\n[33] C. K. S\u00f8nderby, J. Caballero, L. Theis, W. Shi, and F. Husz\u00e1r, \u201cAmortised map inference for image\n\nsuper-resolution,\u201d Proc. Int. Conf. on Learning Representations (ICLR), 2017.\n\n[34] D. J. Thouless, P. W. Anderson, and R. G. Palmer, \u201cSolution of \u2018Solvable model of a spin glass\u2019,\u201d Philos.\n\nMag., vol. 35, no. 3, pp. 593\u2013601, 1977.\n\n[35] M. M\u00e9zard and A. Montanari, Information, Physics, Computation: Probabilistic Approaches. Cambridge\n\nUniversity Press, 2008.\n\n[36] A. Maleki, \u201cApproximate message passing algorithm for compressed sensing,\u201d Stanford University PhD\n\nThesis, Nov. 2010.\n\n[37] S. Ramani, T. Blu, and M. Unser, \u201cMonte-Carlo sure: A black-box optimization of regularization parameters\n\nfor general denoising algorithms,\u201d IEEE Trans. Image Processing, pp. 1540\u20131554, 2008.\n\n[38] H. C. Burger, C. J. Schuler, and S. Harmeling, \u201cImage denoising: Can plain neural networks compete with\n\nBM3D?\u201d Proc. IEEE Int. Conf. Comp. Vision, and Pattern Recognition, pp. 2392\u20132399, 2012.\n\n[39] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, \u201cBeyond a Gaussian denoiser: Residual learning of\n\ndeep CNN for image denoising,\u201d IEEE Trans. Image Processing, 2017.\n\n[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d Proc. Adv. in Neural Processing Systems (NIPS), pp. 1097\u20131105, 2012.\n\n11\n\n\f[41] S. Ioffe and C. Szegedy, \u201cBatch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift,\u201d arXiv preprint arXiv:1502.03167, 2015.\n\n[42] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d Proc. IEEE Int. Conf.\n\nComp. Vision, and Pattern Recognition, pp. 770\u2013778, 2016.\n\n[43] F. J. \u00b4Smieja, \u201cNeural network constructive algorithms: Trading generalization for learning ef\ufb01ciency?\u201d\n\nCircuits, Systems, and Signal Processing, vol. 12, no. 2, pp. 331\u2013374, 1993.\n\n[44] M. Bayati and A. Montanari, \u201cThe dynamics of message passing on dense graphs, with applications to\n\ncompressed sensing,\u201d IEEE Trans. Inform. Theory, vol. 57, no. 2, pp. 764\u2013785, 2011.\n\n[45] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, \u201cImproved training of Wasserstein\n\nGANs,\u201d arXiv preprint arXiv:1704.00028, 2017.\n\n[46] D. Martin, C. Fowlkes, D. Tal, and J. Malik, \u201cA database of human segmented natural images and its\napplication to evaluating segmentation algorithms and measuring ecological statistics,\u201d Proc. Int. Conf.\nComputer Vision, vol. 2, pp. 416\u2013423, July 2001.\n\n[47] A. Vedaldi and K. Lenc, \u201cMatconvnet \u2013 Convolutional neural networks for MATLAB,\u201d Proc. ACM Int.\n\nConf. on Multimedia, 2015.\n\n[48] D. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[49] E. J. Candes, X. Li, and M. Soltanolkotabi, \u201cPhase retrieval from coded diffraction patterns,\u201d Appl. Comput.\n\nHarmon. Anal., vol. 39, no. 2, pp. 277\u2013299, 2015.\n\n12\n\n\f", "award": [], "sourceid": 1119, "authors": [{"given_name": "Chris", "family_name": "Metzler", "institution": "Rice University"}, {"given_name": "Ali", "family_name": "Mousavi", "institution": "Rice University"}, {"given_name": "Richard", "family_name": "Baraniuk", "institution": "Rice University"}]}