{"title": "Algorithmic Guarantees for Inverse Imaging with Untrained Network Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 14832, "page_last": 14842, "abstract": "Deep neural networks as image priors have been recently introduced for problems\nsuch as denoising, super-resolution and inpainting with promising performance\ngains over hand-crafted image priors such as sparsity. Unlike learned generative\npriors they do not require any training over large datasets. However, few theoretical guarantees exist in the scope of using untrained network priors for inverse imaging problems. We explore new applications and theory for untrained neural network priors. Specifically, we consider the problem of solving linear inverse problems, such as compressive sensing, as well as non-linear problems, such as compressive phase retrieval. We model images to lie in the range of an untrained deep generative network with a fixed seed. We further present a projected gradient descent scheme that can be used for both compressive sensing and phase retrieval and provide rigorous theoretical guarantees for its convergence. We also show both theoretically as well as empirically that with deep neural network priors, one can achieve better compression rates for the same image quality as compared to when hand crafted priors are used.", "full_text": "Algorithmic Guarantees for Inverse Imaging\n\nwith Untrained Network Priors\n\nGauri Jagatap\n\nNew York University\n\ngauri.jagatap@nyu.edu\n\nChinmay Hegde\n\nNew York University\nchinmay.h@nyu.edu\n\nAbstract\n\nDeep neural networks as image priors have been recently introduced for problems\nsuch as denoising, super-resolution and inpainting with promising performance\ngains over hand-crafted image priors such as sparsity. Unlike learned generative\npriors they do not require any training over large datasets. However, few theoretical\nguarantees exist in the scope of using untrained network priors for inverse imaging\nproblems. We explore new applications and theory for untrained neural network\npriors. Speci\ufb01cally, we consider the problem of solving linear inverse problems,\nsuch as compressive sensing, as well as non-linear problems, such as compressive\nphase retrieval. We model images to lie in the range of an untrained deep generative\nnetwork with a \ufb01xed seed. We further present a projected gradient descent scheme\nthat can be used for both compressive sensing and phase retrieval and provide\nrigorous theoretical guarantees for its convergence. We also show both theoretically\nas well as empirically that with deep neural network priors, one can achieve better\ncompression rates for the same image quality as compared to when hand crafted\npriors are used.\n\n1\n\nIntroduction\n\n1.1 Motivation\n\nDeep neural networks have led to unprecedented success in solving several problems, speci\ufb01cally in\nthe domain of inverse imaging. Image denoising [1], super-resolution [2], inpainting and compressed\nsensing [3], and phase retrieval [4] are among the many imaging applications that have bene\ufb01ted\nfrom the usage of deep convolutional networks (CNNs) trained with thousands of images.\nApart from supervised learning, deep CNN models have also been used in unsupervised setups, such\nas Generative Adversarial Networks (GANs). Here, image priors based on a generative model [5] are\nlearned from training data. In this context, neural networks emulate the probability distribution of the\ndata inputs. GANs have been used to model signal prior by learning the distribution of training data.\nSuch learned priors have replaced hand-crafted priors with high success rates [3, 6, 7, 8]. However,\nthe main challenge with these approaches is the requirement of massive amounts of training data. For\ninstance, super-resolution CNN [2] uses ImageNet which contains millions of images. Moreover,\nconvergence guarantees for training such networks are limited [7].\nIn contrast, there has been recent interest in using untrained neural networks as an image prior. Deep\nImage Prior [9] and variants such as Deep Decoder [10] are capable of solving linear inverse imaging\nproblems with no training data whatsover, while merely imposing an auto-encoder [9] and decoder\n[10] architecture as a structural prior. For denoising, inpainting and super-resolution, deep image\npriors have shown superior reconstruction performance as compared to conventional methodologies\nsuch as basis pursuit denoising (BPDN) [11], BM3D [12] as well as convolutional sparse coding\n[13]. Similar emperical results have been claimed very recently in the context of time-series data\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffor audio applications [14, 15]. The theme in all of these approaches is the same: to design a prior\nthat exploits local image correlation, instead of global statistics, and \ufb01nd a good low-dimensional\nneural representation of natural images. However, most of these works have very limited [16, 10] or\nno theoretical guarantees.\nNeural networks priors for compressive imaging has only recently been explored. In the context\nof compressive sensing (CS), [17] uses Deep Image Prior along with learned regularization for\nreconstructing images from compressive measurements [18]. However, the model described still relies\non training data for learning appropriate regularization parameters. For the problem of compressive\nsensing, priors such as sparsity [19] and structured sparsity [20] have been traditionally used.\nPhase retrieval is another inverse imaging problem in several Fourier imaging applications, which\ninvolves reconstructing images from magnitude-only measurements. Compressive phase retrieval\n(CPR) models use sparse priors for reducing sample requirements; however, standard techniques from\nrecent literature [21] suggest a quadratic dependence of number of measurements on the sparsity level\nfor recovering sparse images from magnitude-only Gaussian measurements and the design of a smart\ninitialization scheme [22, 21]. If a prior is learned via a GAN [7], [23], then this requirement can be\nbrought down; however one requires suf\ufb01cient training data, which can be prohibitively expensive to\nobtain in domains such as medical or astronomical imaging.\n\n1.2 Our contributions\n\nIn this paper, we explore, in depth, the use of untrained deep neural networks as an image prior for\ninverting images from under-sampled linear and non-linear measurements. Speci\ufb01cally, we assume\nthat the image, x\u2217d\u00d71 has d pixels. We further assume that the image x\u2217 belongs to the range spanned\nby the weights of a deep under-parameterized untrained neural network G(w; z), which we denote\nby S, where w is a set of the weights of the deep network and z is the latent code. The compressive\nmeasurements are stored in vector y = f (x\u2217), where f embeds either compressive linear (de\ufb01ned\nby operator A(\u00b7)) or compressive magnitude-only (de\ufb01ned by operator |A(\u00b7)|) measurements. The\ntask is to reconstruct image \u02c6x which corresponds to small measurement error minx\u2208S (cid:107)f (x) \u2212 y(cid:107)2\n2.\nWith this setup, we establish theoretical guarantees for successful image reconstruction from both\nmeasurement schemes under untrained network priors.\nOur speci\ufb01c contributions are as follows:\n\u2022 We \ufb01rst present a new variant of the Restricted Isometry Property (RIP) [18] via a covering number\nargument for the range of images S spanned by a deep untrained neural network. We use this result\nto guarantee unique image reconstruction for two different compressive imaging schemes.\n\n\u2022 We propose a projected gradient descent (PGD) algorithm for solving the problem of compressive\nsensing with a deep untrained network prior. To our knowledge this is the \ufb01rst paper to use deep\nneural network priors for compressive sensing 1, which relies on no training data2. We analyze the\nconditions under which PGD provably converges and report the sample complexity requirements\ncorresponding to it. We also show superior performance of this framework via empirical results.\n\u2022 We are the \ufb01rst to use deep network priors in the context of phase retrieval. We introduce a\nnovel formulation, to solve compressive phase retrieval with fewer measurements as compared\nto state-of-art. We further provide preliminary guarantees for the convergence of a projected\ngradient descent scheme to solve the problem of compressive phase retrieval. We empirically show\nsigni\ufb01cant improvements in image reconstruction quality as compared to prior works.\n\nWe note that our sample complexity results rely on the number of parameters of the assumed deep\nnetwork prior. Therefore, to get meaningful bounds, our network priors are under-parameterized,\nin that the total number of unknown parameters of the deep network is smaller than the dimension\nof the image. To ensure this, we build upon the formulation of the deep decoder [10], which is a\nspecial network architecture resembling the decoder of an autoencoder (or generator of a GAN). The\nrequirement of under-parameterization of deep network priors is natural; the goal is to design priors\nthat concisely represent natural images. Moreover, this also ensures that the network does not \ufb01t noise\n[10]. Due to these merits, we use select the deep decoder architecture for all analyses in this paper.\n\n1We note recent concurrent work in [24] which explores a similar approach for compressive sensing; however\n\nour paper focuses theoretical guarantees rooted in an algorithmic procedure.\n\n2[17] requires training data for learning a regularization function.\n\n2\n\n\f1.3 Prior work\n\nSparsifying transforms have long been used to constrain the solutions of inverse imaging problems\nin the context of denoising or inpainting. Conventional approaches to solve these problems include\nBasis Pursuit Denoising (BPDN) or Lasso [11], TVAL3 [25], which rely on using (cid:96)0, (cid:96)1 and total\nvariation (TV) regularizations on the image to be recovered. Sparsity based priors are highly effective\nand dataset independent, however it heavily relies on choosing a good sparsifying basis [26].\nInstead of hand-picking the sparsifying transform, in dictionary learning one learns both the sparsify-\ning transform and the sparse code [27]. The dictionary captures global statistics of a given dataset 3.\nMulti-layer convolutional sparse coding [16] is an extension of sparse coding which models a given\ndataset in the form of a product of several linear dictionaries, all of which are convolutional in nature\nand this problem is challenging.\nGenerative adversarial networks (GAN) [5] have been used to generate photo-realistic images in\nan unsupervised fashion. The generator consists of stacked convolutions and maps random low-\ndimensional noise vectors to full sized images. GAN priors have been successfully used for inverse\nimaging problems [6, 7, 28, 29, 8]. The shortcomings of this approach are two-fold: test images are\nstrictly restricted to the range of a trained generator, and the requirement of suf\ufb01cient training data.\nSparse signal recovery from linear compressive measurements [18] as well as magnitude-only\ncompressive measurements [21] has been extensively studied, with several algorithmic approaches\n[19, 21]. In all of these approaches, modeling the low-dimensional embedding is challenging and\nmay not be captured correctly using simple hand-crafted priors such as structured sparsity [20]. Since\nit is hard to estimate these hyper-parameters accurately, the number of samples required to reconstruct\nthe image is often much higher than information theoretic limits [30, 6].\nThe problem of compressive phase retrieval speci\ufb01cally, is even more challenging because it is non-\nconvex. Several papers in recent literature [31, 32, 21] rely on the design of a spectral initialization\nscheme which ensures that one can subsequently optimize over a convex ball of the problem. However\nthis initialization requirement results in high sample requirements and is a bottleneck in achieving\ninformation theoretically optimal sample complexity.\nDeep image prior [9] (DIP) uses primarily an encoder-decoder as a prior on the image, alongside\nan early stopping condition, for inverse imaging problems such as denoising, super-resolution and\ninpainting. Deep decoder [10] (DD) improves upon DIP, providing a much simpler, underparameter-\nized architecture, to learn a low-dimensional manifold (latent code) and a decoding operation from\nthis latent code to the full image. Because it is under parameterized, deep decoder does not \ufb01t noise,\nand therefore does not require early stopping.\nDeep network priors in the context of compressive imaging have only recently been explored [17],\nand only in the context of compressive sensing. In contrast with [17] which extends the idea of a Deep\nImage Prior to incorporate learned regularizations, in this paper we focus more on theoretical aspects\nof the problem and also explore applications in compressive phase retrieval. To our knowledge the\napplication of deep network priors to compressive phase retrieval is novel.\n\n2 Notation\n\nThroughout the paper, lower case letters denote vectors, such as v and upper case letters for matrices,\nsuch as M. A set of variables subscripted with different indices is represented with bold-faced\nshorthand of the following form: w := {W1, W2, . . . WL}. The neural network consists of L layers,\neach layer denoted as Wl, with l \u2208 {1, . . . L} and are 1 \u00d7 1 convolutional. Up-sampling operators\nare denoted by Ul. Vectorization of a matrix is written as vec(\u00b7). The activation function considered\nis Recti\ufb01ed Linear Unit (ReLU), denoted as \u03c3(\u00b7). Hadamard or element-wise product is denoted by \u25e6.\nElement-wise absolute valued vector is denoted by |v|. Unless mentioned otherwise, (cid:107)v(cid:107) denotes\nvector (cid:96)2-norm and (cid:107)M(cid:107) denotes spectral norm (cid:107)M(cid:107)2.\n\n3Local structural information from a single image can also be used to learn dictionaries, by constructing\n\nseveral overlapping crops or patches of a single image.\n\n3\n\n\f3 Problem setup\n\n3.1 Deep neural network priors\nIn this paper we discuss the problem of inverting a mapping x \u2192 y of the form:\n\ny = f (x)\n\nwhere x = vec(X)dk is a d-dimensional signal X d\u00d7k (vectorized image), with k channels and\nf : x \u2192 y \u2208 Rn captures a compressive measurement procedure, such as a linear operator A(\u00b7) or\nmagnitude only measurements |A(\u00b7)| and n < dk. We elaborate further on the exact structure of f in\nthe next subsection (Section 3.2). The task of reconstructing image x from measurements y can be\nformulated as an optimization problem of the form:\n\n(cid:107)y \u2212 f (x)(cid:107)2\n\n(1)\n\nmin\nx\u2208S\n\nwhere we have chosen the (cid:96)2-squared loss function and where S captures the prior on the image.\nIf the image x can be represented as the action of a deep generative network G(w; z) with weights w\non some latent code z, such that x = G(w; z), then the set S captures the characteristics of G(w; z).\nThe latent code z := vec(Z1) with Z1 \u2208 Rd1\u00d7k1 is a low-dimensional embedding with dimension\nd1k1 (cid:28) dk and its elements are generated from uniform random distribution.\nWhen the network G(\u00b7) and its weights w := {W1, . . . WL} are known (from pre-training a generative\nnetwork over large datasets) and \ufb01xed, the task is to obtain an estimate \u02c6x = G(w; \u02c6z), which indirectly\ntranslates to \ufb01nding the optimal latent space encoding \u02c6z . This problem has been studied in [6, 7] in\nthe form of using learned GAN priors for inverse imaging.\nIn this paper however, the weights of the generator w are not pre-trained; rather, the task is to\nestimate image \u02c6x = G( \u02c6w; z) \u2248 G(w\u2217; z) = x\u2217 and corresponding weights \u02c6w, for a \ufb01xed seed z,\nwhere x\u2217 is assumed to be the true image and the true weights w\u2217 (possibly non-unique) satisfy\nw\u2217 = minw (cid:107)x\u2217 \u2212 G(w; z)(cid:107)2\n2. Note that the optimization in Eq. 1 is equivalent to substituting the\nsurjective mapping G : w \u2192 x, and optimizing over w,\n\n(cid:107)y \u2212 f (G(w; z))(cid:107)2,\n\nmin\n\nw\n\n(2)\n\nand estimate weights \u02c6w and corresponding image \u02c6x.\nSpeci\ufb01cally, the untrained network G(w; z) takes the form of an expansive neural network; a decoder\narchitecture similar to the one in [10] 4. The neural network is composed of L weight layers Wl,\nindexed by l \u2208 {1, . . . , L} and are 1\u00d7 1 convolutions, upsampling operators Ul for l \u2208 {1, . . . L\u2212 1}\nand ReLU activation \u03c3(\u00b7) and is expressed as follows\n\nx = G(w; z) = UL\u22121\u03c3(ZL\u22121WL\u22121)WL = ZLWL,\n\n(3)\n\nwhere \u03c3(\u00b7) represents the action of ReLU operation, Z di\u00d7ki\nz = vec(Z1), dL = d and WL \u2208 RkL\u00d7k.\nTo capture the range of images spanned by the deep neural network architecture described above, we\nformally introduce the main assumption in our paper through De\ufb01nition 1. Without loss in generality,\nwe set k = 1 for the rest of this paper, while noting that the techniques carry over to general k.\nDe\ufb01nition 1. A given image x \u2208 Rd is said to obey an untrained neural network prior if it belongs\nto a set S de\ufb01ned as:\n\n= Ui\u22121\u03c3(Zi\u22121Wi\u22121), for i = 2, . . . L,\n\ni\n\nS := {x|x = G(w; z)}\n\nwhere z is a (randomly chosen, \ufb01xed) latent code vector and G(w; z) has the form in Eq. 3.\n\n3.2 Observation models and assumptions\n\nWe now discuss the compressive measurement setup in more detail. Compressive measurement\nschemes were developed in [18] for ef\ufb01cient imaging and storage of images and work only as long as\ncertain structural assumptions on the signal (or image) are met. The optimization problem in Eq.1 is\n\n4Alternatively, one may assume the architecture of the generator of a DCGAN [33, 17].\n\n4\n\n\fnon-convex in general, partly dictated by the non-convexity of set S. Moreover, in the case of phase\nretrieval, the loss function is itself non-convex. Therefore unique signal recovery for either problems\nis not guaranteed without making speci\ufb01c assumptions on the measurement setup.\nIn this paper, we assume that the measurement operation can be represented by the action of a Gaussian\nmatrix A which is rank-de\ufb01cient (n < d). The entries of this matrix are such that Aij \u223c N (0, 1/n).\nLinear compressive measurements take the form y = Ax and magnitude-only measurements take the\nform y = |Ax|. We formally discuss the two different imaging schemes in the next two sections. We\nalso present algorithms and theoretical guarantees for their convergence. For both algorithms, we\nrequire that a special (S, \u03b3, \u03b2)-RIP holds for measurement matrix A, which is de\ufb01ned below.\nDe\ufb01nition 2. (S, \u03b3, \u03b2)-RIP: Set-Restricted Isometry Property with parameters \u03b3, \u03b2:\nFor parameters \u03b3, \u03b2 > 0, a matrix A \u2208 Rn\u00d7d satis\ufb01es (S, \u03b3, \u03b2)-RIP, if for all x \u2208 S,\n\n\u03b3(cid:107)x(cid:107)2 \u2264 (cid:107)Ax(cid:107)2 \u2264 \u03b2(cid:107)x(cid:107)2.\n\nWe refer to the left (lower) inequality as (S, \u03b3)-RIP and right (upper) inequality as (S, \u03b2)-RIP.\nThe (S, 1 \u2212 \u03b1, 1 + \u03b1) RIP is achieved by Gaussian matrix A under certain assumptions, which we\nstate and prove via Lemma 1 as follows.\nLemma 1. If an image x \u2208 Rd has a decoder prior (captured in set S), where the decoder consists\nof weights w and piece-wise linear activation (ReLU), a random Gaussian matrix A \u2208 Rn\u00d7d with\nelements from N (0, 1/n), satis\ufb01es (S, 1 \u2212 \u03b1, 1 + \u03b1)-RIP, with probability 1 \u2212 e\u2212c\u03b12n, as long as\n\n(cid:19)\n\nn = O\n\nk1\n\u03b12\n\nkl log d\n\n, for small constant c and 0 < \u03b1 < 1.\n\n(cid:18)\n\nL(cid:80)\n\nl=2\n\nProof sketch: We use a union of sub-spaces model, similar to that developed in [6] which was\ndeveloped for GAN priors, to capture the range of a deep untrained network.\nOur method uses a linearization principle. If the output sign of any ReLU activation \u03c3(\u00b7) on its\ninputs were known a priori, then the mapping x = G(w; z) becomes a product of linear weight\nmatrices and linear upsampling operators acting on the latent code z. The bulk of the proof relies on\nconstructing a counting argument for the number of such linearized networks; call that number N.\nFor a \ufb01xed linear subspace, the image x has a representation of the form x = U Zw, where U absorbs\nall upsampling operations, Z is latent code which is \ufb01xed and known and w is the direct product of\nall weight matrices with w \u2208 Rk1. An oblivious subspace embedding (OSE) of x takes the form\n\n(1 \u2212 \u03b1)(cid:107)x(cid:107)2 \u2264 (cid:107)Ax(cid:107)2 \u2264 (1 + \u03b1)(cid:107)x(cid:107)2,\n\nwhere A is a Gaussian matrix, and holds for all k1-dimensional vectors w, with high probability as\nlong as n = O(k1/\u03b12). We further require to take a union bound over all possible such linearized\nnetworks, which is given by N. The sample complexity corresponding to this bound is then computed\nto complete the set-restricted RIP result. The complete proof can be found in Appendix D and a\ndiscussion on the sample complexity is presented in Appendix B.\n\n4 Linear compressive sensing with deep network prior\n\n(cid:107)y \u2212 Ax(cid:107)2\n\nWe now analyze linear compressed Gaussian measurements of a vectorized image x, with a deep\nnetwork prior. The reconstruction problem assumes the following form:\n\nmin\n\nx\n\ns.t. x = G(w; z),\n\n(4)\nwhere A \u2208 Rn\u00d7d is Gaussian matrix with n < d, unknown weight matrices w and latent code z which\nis \ufb01xed. We solve this problem via Algorithm 1, Network Projected Gradient Descent (Net-PGD) for\ncompressed sensing recovery.\nSpeci\ufb01cally, we break down the minimization into two parts; we \ufb01rst solve an unconstrained loss\nminimization of the objective function in Eq. 4 by implementing one step of gradient descent in Step\n3 of Algorithm 1. The update vt typically does not adhere to the deep network prior constraint vt (cid:54)\u2208 S.\nTo ensure that this happens, we solve a projection step in Line 4 of Algorithm 1, which happens to be\nthe same as \ufb01tting a deep network prior to a noisy image. We iterate through this procedure in an\nalternating fashion until the estimates xt converge to x\u2217 within error factor \u0001.\nWe further establish convergence guarantees for Algorithm 1 in Theorem 1.\n\n5\n\n\fAlgorithm 1 Net-PGD for compressed sensing recovery.\n1: Input: y, A, z = vec(Z1), \u03b7, T = log 1\n2: for t = 1,\u00b7\u00b7\u00b7 , T do\n\u0001\n3:\n4:\n5:\n6: end for\n7: Output \u02c6x \u2190 xT .\n\nvt \u2190 xt \u2212 \u03b7A(cid:62)(Axt \u2212 y)\nwt \u2190 arg min\nw\nxt+1 \u2190 G(wt; z)\n\n(cid:107)vt \u2212 G(w; z)(cid:107)\n\n{gradient step for least squares}\n{projection to range of deep network}\n\nTheorem 1. Suppose the sampling matrix An\u00d7d satis\ufb01es (S, 1\u2212 \u03b1, 1 + \u03b1)-RIP with high probability\nthen, Algorithm 1, with \u03b7 small enough, produces \u02c6x such that (cid:107)\u02c6x \u2212 x\u2217(cid:107) \u2264 \u0001 and requires T \u221d log 1\niterations.\n\n\u0001\n\nProof sketch: The proof of this theorem predominantly relies on our new set-restricted RIP result\nand uses standard techniques from compressed sensing theory. Indicating the loss function in Eq.\n4 as L(xt) = (cid:107)y \u2212 Axt(cid:107)2, we aim to establish a contraction of the form L(xt+1) < \u03bdL(xt), with\n\u03bd < 1. To achieve this, we combine the projection criterion in Step 4 of Algorithm 1, which strictly\nimplies that\n\n(cid:107)xt+1 \u2212 vt(cid:107) \u2264 (cid:107)x\u2217 \u2212 vt(cid:107)\n\nand vt = xt\u2212\u03b7A(cid:62)(Axt\u2212y) from Step 3 of Algorithm 1, where \u03b7 is chosen appropriately. Therefore,\n\n(cid:107)xt+1 \u2212 xt + \u03b7A(cid:62)A(xt \u2212 x\u2217)(cid:107)2 \u2264 (cid:107)x\u2217 \u2212 xt + \u03b7A(cid:62)A(xt \u2212 x\u2217)(cid:107)2.\n\nFurthermore, we utilize (S, 1 \u2212 \u03b1, 1 + \u03b1)-RIP and its Corollary 1 (refer Appendix D) which apply to\nx\u2217, xt, xt+1 \u2208 S, to show that\n\nL(xt+1) \u2264 \u03bdL(xt)\n\nand subsequently the error contraction (cid:107)xt+1 \u2212 x\u2217(cid:107) \u2264 \u03bdo(cid:107)xt \u2212 x\u2217(cid:107), with \u03bd, \u03bdo < 1 to guarantee\nlinear convergence of Net-PGD for compressed sensing recovery. This convergence result implies\nthat Net-PGD requires T \u221d log 1/\u0001 iterations to produce \u02c6x within \u0001-accuracy of x\u2217. The complete\nproof of Theorem 1 can be found in Appendix D. In Appendix A we provide some exposition on the\nprojection step (line 4 of Algorithm 1).\n\n5 Compressive phase retrieval under deep image prior\nIn compressive phase retrieval, one wants to reconstruct a signal x \u2248 x\u2217 \u2208 S from measurements of\nthe form y = |Ax\u2217| and therefore the objective is to minimize the following\n\n(cid:107)y \u2212 |Ax|(cid:107)2\n\ns.t. x = G(w; z),\n\nmin\n\nx\n\n(5)\n\nwhere n < d and A is Gaussian, z is a \ufb01xed seed and weights w need to be estimated. We propose a\nNetwork Projected Gradient Descent (Net-PGD) for compressive phase retrieval to solve this problem,\nwhich is presented in Algorithm 2.\nAlgorithm 2 broadly consists of two parts. For the \ufb01rst part, in Line 3 we estimate the phase of the\ncurrent estimate and in Line 4 we use this to compute the Wirtinger gradient [31] and execute one\nstep for solving an unconstrained phase retrieval problem with gradient descent. The second part of\nthe algorithm is (Line 5), estimating the weights of the deep network prior with noisy input vt. This is\nthe projection step and ensures that the output wt and subsequently the image estimate xt = G(wt; z)\nlies in the range of the decoder G(\u00b7) outlined by set S.\nWe highlight that the problem in Eq. 5 is signi\ufb01cantly more challenging than the one in Eq. 4.\nThe dif\ufb01culty hinges on estimating the missing phase information accurately. For a real-valued\nvectors, there are 2n different phase vectors p = sign(Ax) for a \ufb01xed choice of x, which satisfy\ny = |Ax|, moreover the entries of p are restricted to {1,\u22121}. Hence, phase estimation is a non-\nconvex problem. Therefore, with Algorithm 2 the problem in Eq.5 can only be solved to convergence\nlocally; an initialization scheme is required to establish global convergence guarantees. We highlight\nthe guarantees of Algorithm 2 in Theorem 2.\n\n6\n\n\fAlgorithm 2 Net-PGD for compressive phase retrieval.\n1: Input: A, z = vec(Z1), \u03b7, T = log 1\n2: for t = 1,\u00b7\u00b7\u00b7 , T do\npt \u2190 sign(Axt)\n3:\nvt \u2190 xt \u2212 \u03b7A(cid:62)(Axt \u2212 y \u25e6 pt)\n4:\nwt \u2190 arg min\n(cid:107)vt \u2212 G(w; z)(cid:107)\n5:\nw\nxt+1 \u2190 G(wt; z)\n6:\n7: end for\n8: Output \u02c6x \u2190 xT .\n\n\u0001 , x0 s.t. (cid:107)x0 \u2212 x\u2217(cid:107) \u2264 \u03b4i(cid:107)x\u2217(cid:107).\n\n{phase estimation}\n{gradient step for phase retrieval}\n{projection to range of deep network}\n\nTheorem 2. Suppose the sampling matrix An\u00d7d with Gaussian entries satis\ufb01es (S, 1\u2212\u03b1, 1+\u03b1)-RIP\nwith high probability, Algorithm 2 solves Eq. 5 with \u03b7 small enough, such that (cid:107)\u02c6x\u2212x\u2217(cid:107) \u2264 \u0001, as long as\n.\nthe weights are initialized appropriately and the number of measurements is n = O\n\nL(cid:80)\n\n(cid:19)\n\n(cid:18)\n\nk1\n\nkl log d\n\nl=2\n\nProof sketch: The proof for Theorem 2 relies on two important results; (S, 1 \u2212 \u03b1, 1 + \u03b1)-RIP and\nLemma 2 which establishes a bound on the phase estimation error. Formally, the update in Step 4 of\nAlgorithm 2 can be re-written as\n\nvt+1 = xt \u2212 \u03b7A(cid:62)(cid:0)Axt \u2212 Ax\u2217 \u25e6 sign(Ax\u2217) \u25e6 sign(Axt)(cid:1) = xt \u2212 \u03b7A(cid:62)(cid:0)Axt \u2212 Ax\u2217(cid:1) \u2212 \u03b7\u03b5t\n\np\n\np := A(cid:62)Ax\u2217 \u25e6 (1 \u2212 sign(Ax\u2217) \u25e6 sign(Axt)) is phase estimation error.\n\nwhere \u03b5t\nIf sign(Ax\u2217) \u2248 sign(Axt), then the above resembles the gradient step from the linear compressive\nsensing formulation. Thus, if x0 is initialized well, the error due to phase mis-match \u03b5t\np can be\nbounded, and subsequently, a convergence result can be formulated.\nNext, Step 5 of Algorithm 2 learns weights wt that produce xt = G(wt; z), such that\n\nfor t = {1, 2, . . . T}. Then, the above projection rule yields:\n\n(cid:107)xt+1 \u2212 vt(cid:107) \u2264 (cid:107)xt \u2212 vt(cid:107)\n\n(cid:107)xt+1 \u2212 vt+1 + vt+1 \u2212 x\u2217(cid:107) \u2264 (cid:107)xt+1 \u2212 vt+1(cid:107) + (cid:107)x\u2217 \u2212 vt+1(cid:107) \u2264 2(cid:107)x\u2217 \u2212 vt+1(cid:107),\n\nUsing the update rule from Eq. 12 and plugging in for vt+1:\n\n(cid:107)xt+1 \u2212 x\u2217(cid:107) \u2264 (cid:107)(1 \u2212 \u03b7A(cid:62)A)ht(cid:107) + (cid:107)\u03b5t\np(cid:107)\n\n1\n2\n\nwhere \u03b7 is chosen appropriately. The rest of the proof relies on bounding the \ufb01rst term via matrix norm\ninequalities using Corollary 2 (in Appendix D) of (S, 1\u2212\u03b1, 1+\u03b1)-RIP as (cid:107)(1\u2212\u03b7A(cid:62)A)ht(cid:107) \u2264 \u03c1o(cid:107)ht(cid:107)\nand the second term is bounded via Lemma 2 as (cid:107)\u03b5t\np(cid:107) \u2264 \u03b4o(cid:107)xt \u2212 x\u2217(cid:107) as long as (cid:107)x0 \u2212 x\u2217(cid:107) \u2264 \u03b4i(cid:107)x\u2217(cid:107).\nHence we obtain a convergence criterion of the form\n\n(cid:107)xt+1 \u2212 x\u2217(cid:107) \u2264 2(\u03c1o + \u03b7\u03b4o)(cid:107)xt \u2212 x\u2217(cid:107) := \u03c1(cid:107)xt \u2212 x\u2217(cid:107).\nwhere \u03c1 < 1. Note that this proof relies on a bound on the phase error (cid:107)\u03b5t\np(cid:107) which is established\nvia Lemma 2. The complete proof for Theorem 2 can be found in Appendix D. In Appendix A we\nprovide some exposition on the projection step (line 5 of Algorithm 2). In our experiments (Section\n6) we note that a uniform random initialization of the weights w0 (which is common in training\nneural networks), to yield x0 = G(w0; z) is suf\ufb01cient for Net-PGD to succeed for compressive phase\nretrieval. In Appendix C we show experimental evidence to support this claim.\n\n6 Experimental results\n\nDataset: We use images from the MNIST database and CelebA database to test our algorithms and\nreconstruct 6 grayscale (MNIST, 28 \u00d7 28 pixels (d = 784)) and 5 RGB (CelebA) images. The\nCelebA dataset images are center cropped to size 64 \u00d7 64 \u00d7 3 (d = 12288). The pixel values of all\nimages are scaled to lie between 0 and 1.\n\n7\n\n\fOriginal Compressed Net-GD Net-PGD Lasso\n\nTVAL3\n\nOriginal Compressed Net-GD Net-PGD Lasso\n\nNet-GD\nLasso\n\nNet-PGD\nTVAL3\n\n1.5\n\n1\n\nE\nS\nM\nn\n\n0.5\n\n0\n\n0.1\n\n0.2\n\n0.15\n\n0.25\ncompression ratio f\n\n0.3\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (CS) Reconstructed images from linear measurements (at compression rate n/d = 0.1)\nwith (a) n = 78 measurements for examples from MNIST, (b) n = 1228 measurements for examples\nfrom CelebA, and (c) nMSE at different compression rates f = n/d for MNIST.\n\nOriginal Compressed Net-GD Net-PGD Sparta\n\nn/d = 0.5\n\nNet-GD\nNet-PGD\n\nSparta\n\nn/d = 0.1\n\n0.5\n\nE\nS\nM\nn\n\n0\n\n0\n\n1\n\n2\n\ncompression ratio f\n\n3\n\n(a)\n\n(c)\n\n(c)\n\nFigure 2: (CPR) Reconstructed images from magnitude-only measurements (a) at compression rate\nof n/d = 0.3 for MNIST, (b) at compression rates of n/d = 0.1, 0.5 for CelebA with (row 1,3)\nNet-GD and (row 2,4) Net-PGD, (c) nMSE at different compression rates f = n/d for MNIST.\n\nDeep network architecture: We \ufb01rst optimize the deep network architecture which \ufb01t our example\nimages such that x\u2217 \u2248 G(w\u2217; z) (referred as \u201ccompressed\u201d image). For MNIST images, the\narchitecture was \ufb01xed to a 2 layer con\ufb01guration k1 = 15, k2 = 15, k3 = 10, and for CelebA images,\na 3 layer con\ufb01guration with k1 = 120, k2 = 15, k3 = 15, k4 = 10. Both architectures use bilinear\nupsampling operations. Further details on this setup can be found in Appendix C.\nMeasurement setup: We use a Gaussian measurement matrix of size n \u00d7 d with n varied such that\n(i) n/d = 0.08, 0.1, 0.15, 0.2, 0.25, 0.3 for compressive sensing and (ii) n/d = 0.1, 0.2, 0.3, 0.5, 1, 3\nfor compressive phase retrieval. The elements of A are picked such that Ai,j \u223c N (0, 1/n) and we\nreport averaged reconstruction error values over 10 different instantiations of A for a \ufb01xed image\n(image of digit \u20180\u2019 from MNIST), network con\ufb01guration and compression ratio n/d .\n\n6.1 Compressive sensing\n\nAlgorithms and baselines: We implement 4 schemes based on untrained priors for solving CS, (i)\ngradient descent with deep network prior which solves Eq.2 (we call this Net-GD), similar to [17] but\nwithout learned regularization (ii) Net-PGD, (iii) Lasso ((cid:96)1 regularization) with sparse prior in DCT\nbasis and \ufb01nally (iv) TVAL3 [25] (Total Variation regularization). The TVAL3 code only works for\ngrayscale images, therefore we do not use it for CelebA examples. The reconstructions are shown in\nFigure 1 for images from (a) MNIST and (b) CelebA datasets. The implementation details can be\nfound in Appendix C.\n\n8\n\n\fPerformance metrics: We compare reconstruction quality using normalized Mean-Squared Error\n(nMSE), which is calculated as (cid:107)\u02c6x \u2212 x\u2217(cid:107)2/(cid:107)x\u2217(cid:107)2. We plot the variation of the nMSE with different\ncompression rates f = n/d for all the algorithms tested averaged over all trials for MNIST in Figure\n1 (c). We note that both Net-GD and Net-PGD produce superior reconstructions as compared to state\nof art. Running time performance is reported in Appendix C.\n\n6.2 Compressive phase retrieval\n\nAlgorithms and baselines: We implement 3 schemes based on untrained priors for solving CPR , (i)\nNet-GD (ii) Net-PGD and \ufb01nally (iii) Sparse Truncated Amplitude Flow (Sparta) [22], with sparse\nprior in DCT basis for both datasets. The reconstructions are shown in Figure 2 for (a) MNIST and\n(b) CelebA datasets. We plot nMSE at varying compression rates for all algorithms averaged over all\ntrials for MNIST in Figure 2(c) and note that both Net-GD and Net-PGD outperform Sparta. Running\nterm performance as well as goodness of random initialization scheme are discussed in Appendix C.\n\n7 Acknowledgments\n\nThis work was supported in part by NSF grants CAREER CCF-2005804, CCF-1815101, and a faculty\nfellowship from the Black and Veatch Foundation.\n\n9\n\n\fReferences\n[1] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol. Stacked denoising autoen-\ncoders: Learning useful representations in a deep network with a local denoising criterion.\nJournal of machine learning research, 11(Dec):3371\u20133408, 2010.\n\n[2] C. Dong, C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks.\n\nIEEE transactions on pattern analysis and machine intelligence, 38(2):295\u2013307, 2016.\n\n[3] J. Chang, C. Li, B. P\u00f3czos, and B. Kumar. One network to solve them all\u2014solving linear\ninverse problems using deep projection models. In 2017 IEEE International Conference on\nComputer Vision (ICCV), pages 5889\u20135898. IEEE, 2017.\n\n[4] C. Metzler, P. Schniter, A. Veeraraghavan, and R. Baraniuk. prdeep: Robust phase retrieval with\na \ufb02exible deep network. In International Conference on Machine Learning, pages 3498\u20133507,\n2018.\n\n[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[6] A. Bora, A. Jalal, E. Price, and A. Dimakis. Compressed sensing using generative models.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n537\u2013546. JMLR. org, 2017.\n\n[7] P. Hand, O. Leong, and V. Voroninski. Phase retrieval under a generative prior. In Advances in\n\nNeural Information Processing Systems, pages 9136\u20139146, 2018.\n\n[8] T. Lillicrap Y. Wu, M. Rosca. Deep compressed sensing. arXiv preprint arXiv:1905.06723,\n\n2019.\n\n[9] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 9446\u20139454, 2018.\n\n[10] R. Heckel and P. Hand. Deep decoder: Concise image representations from untrained non-\n\nconvolutional networks. In International Conference on Learning Representations, 2018.\n\n[11] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM review,\n\n43(1):129\u2013159, 2001.\n\n[12] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising with block-matching and\n3d \ufb01ltering. In Image Processing: Algorithms and Systems, Neural Networks, and Machine\nLearning, volume 6064, page 606414. International Society for Optics and Photonics, 2006.\n\n[13] V. Papyan, Y. Romano, J. Sulam, and M. Elad. Convolutional dictionary learning via local\nprocessing. In Proceedings of the IEEE International Conference on Computer Vision, pages\n5296\u20135304, 2017.\n\n[14] A. Dimakis S. Ravula. One-dimensional deep image prior for time series inverse problems.\n\narXiv preprint arXiv:1904.08594, 2019.\n\n[15] L. Wolf M. Michelashvili. Audio denoising with deep network priors. arXiv preprint arXiv:\n\narXiv:1904.07612, 2019.\n\n[16] J. Sulam, V. Papyan, Y. Romano, and M. Elad. Multilayer convolutional sparse modeling:\nPursuit and dictionary learning. IEEE Transactions on Signal Processing, 66(15):4090\u20134104,\n2018.\n\n[17] D. Van Veen, A. Jalal, E. Price, S. Vishwanath, and A. Dimakis. Compressed sensing with deep\n\nimage prior and learned regularization. arXiv preprint arXiv:1806.06438, 2018.\n\n[18] D. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289\u20131306,\n\n2006.\n\n10\n\n\f[19] D. Needell and J. Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate\n\nsamples. Applied and computational harmonic analysis, 26(3):301\u2013321, 2009.\n\n[20] R. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. IEEE\n\nTransactions on Information Theory, 56:1982\u20132001, 2010.\n\n[21] G. Jagatap and C. Hegde. Fast, sample-ef\ufb01cient algorithms for structured phase retrieval. In\n\nAdvances in Neural Information Processing Systems, pages 4917\u20134927, 2017.\n\n[22] G. Wang, L. Zhang, G. Giannakis, M. Ak\u00e7akaya, and J. Chen. Sparse phase retrieval via\n\ntruncated amplitude \ufb02ow. IEEE Transactions on Signal Processing, 66(2):479\u2013491, 2017.\n\n[23] F. Shamshad, F. Abbas, and A. Ahmed. Deep ptych: Subsampled fourier ptychography using\ngenerative priors. In IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pages 7720\u20137724. IEEE, 2019.\n\n[24] R. Heckel. Regularizing linear inverse problems with convolutional neural networks. arXiv\n\npreprint arXiv:1907.03100, 2019.\n\n[25] C. Li, W. Yin, and Y. Zhang. User\u2019s guide for tval3: Tv minimization by augmented lagrangian\n\nand alternating direction algorithms.\n\n[26] S. Mallat. A wavelet tour of signal processing. Elsevier, 1999.\n\n[27] M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete\ndictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311,\n2006.\n\n[28] R. Hyder, V. Shah, C. Hegde, and S. Asif. Alternating phase projected gradient descent with\ngenerative priors for solving compressive phase retrieval. arXiv preprint arXiv:1903.02707,\n2019.\n\n[29] V. Shah and C. Hegde. Solving linear inverse problems using gan priors: An algorithm with\nprovable guarantees. In 2018 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 4609\u20134613. IEEE, 2018.\n\n[30] G. Jagatap and C. Hegde. Sample-ef\ufb01cient algorithms for recovering structured signals from\n\nmagnitude-only measurements. IEEE Transactions on Information Theory, 2019.\n\n[31] Y. Chen and E. Candes. Solving random quadratic systems of equations is nearly as easy as\nsolving linear systems. In Advances in Neural Information Processing Systems, pages 739\u2013747,\n2015.\n\n[32] T. Cai, X. Li, and Z. Ma. Optimal rates of convergence for noisy sparse phase retrieval via\n\nthresholded wirtinger \ufb02ow. The Annals of Statistics, 44(5):2221\u20132251, 2016.\n\n[33] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[34] S. Oymak and M. Soltanolkotabi. Towards moderate overparameterization: global convergence\n\nguarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674, 2019.\n\n[35] S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-\nparameterized neural networks. In International Conference on Learning Representations,\n2018.\n\n[36] H. Zhang and Y. Liang. Reshaped wirtinger \ufb02ow for solving quadratic system of equations. In\n\nAdvances in Neural Information Processing Systems, pages 2622\u20132630, 2016.\n\n[37] Huishuai Zhang and Yingbin Liang. Reshaped wirtinger \ufb02ow for solving quadratic system of\n\nequations. In Advances in Neural Information Processing Systems, pages 2622\u20132630, 2016.\n\n[38] Tam\u00e1s S. Improved approximation algorithms for large matrices via random projections. 2006\n47th Annual IEEE Symposium on Foundations of Computer Science (FOCS\u201906), pages 143\u2013152,\n2006.\n\n11\n\n\f", "award": [], "sourceid": 8412, "authors": [{"given_name": "Gauri", "family_name": "Jagatap", "institution": "Iowa State University"}, {"given_name": "Chinmay", "family_name": "Hegde", "institution": "New York University"}]}