{"title": "Training Image Estimators without Image Ground Truth", "book": "Advances in Neural Information Processing Systems", "page_first": 2439, "page_last": 2449, "abstract": "Deep neural networks have been very successful in compressive-sensing and image restoration applications, as a means to estimate images from partial, blurry, or otherwise degraded measurements. These networks are trained on a large number of corresponding pairs of measurements and ground-truth images, and thus implicitly learn to exploit domain-specific image statistics. But unlike measurement data, it is often expensive or impractical to collect a large training set of ground-truth images in many application settings. In this paper, we introduce an unsupervised framework for training image estimation networks, from a training set that contains only measurements---with two varied measurements per image---but no ground-truth for the full images desired as output. We demonstrate that our framework can be applied for both regular and blind image estimation tasks, where in the latter case parameters of the measurement model (e.g., the blur kernel) are unknown: during inference, and potentially, also during training. We evaluate our framework for training networks for compressive-sensing and blind deconvolution, considering both non-blind and blind training for the latter. Our framework yields models that are nearly as accurate as those from fully supervised training, despite not having access to any ground-truth images.", "full_text": "Training Image Estimators\nwithout Image Ground-Truth\n\nZhihao Xia\n\nWashington University in St. Louis\n\n1 Brookings Dr., St. Louis, MO 63130\n\nzhihao.xia@wustl.edu\n\nAyan Chakrabarti\n\nWashington University in St. Louis\n\n1 Brookings Dr., St. Louis, MO 63130\n\nayan@wustl.edu\n\nAbstract\n\nDeep neural networks have been very successful in image estimation applications\nsuch as compressive-sensing and image restoration, as a means to estimate images\nfrom partial, blurry, or otherwise degraded measurements. These networks are\ntrained on a large number of corresponding pairs of measurements and ground-truth\nimages, and thus implicitly learn to exploit domain-speci\ufb01c image statistics. But\nunlike measurement data, it is often expensive or impractical to collect a large\ntraining set of ground-truth images in many application settings. In this paper, we\nintroduce an unsupervised framework for training image estimation networks, from\na training set that contains only measurements\u2014with two varied measurements per\nimage\u2014but no ground-truth for the full images desired as output. We demonstrate\nthat our framework can be applied for both regular and blind image estimation\ntasks, where in the latter case parameters of the measurement model (e.g., the\nblur kernel) are unknown: during inference, and potentially, also during training.\nWe evaluate our method for training networks for compressive-sensing and blind\ndeconvolution, considering both non-blind and blind training for the latter. Our\nunsupervised framework yields models that are nearly as accurate as those from\nfully supervised training, despite not having access to any ground-truth images.\n\n1\n\nIntroduction\n\nReconstructing images from imperfect observations is a classic inference task in many imaging\napplications. In compressive sensing [8], a sensor makes partial measurements for ef\ufb01cient acquisition.\nThese measurements correspond to a low-dimensional projection of the higher-dimensional image\nsignal, and the system relies on computational inference for recovering the full-dimensional image.\nIn other cases, cameras capture degraded images that are low-resolution, blurry, etc., and require a\nrestoration algorithm [10, 29, 34] to recover a corresponding un-corrupted image. Deep convolutional\nneural networks (CNNs) have recently emerged as an effective tool for such image estimation\ntasks [4, 6, 7, 12, 27, 30, 31]. Speci\ufb01cally, a CNN for a given application is trained on a large dataset\nthat consists of pairs of ground-truth images and observed measurements (in many cases where\nthe measurement or degradation process is well characterized, having a set of ground-truth images\nis suf\ufb01cient to generate corresponding measurements). This training set allows the CNN to learn\nto exploit the expected statistical properties of images in that application domain, to solve what is\nessentially an ill-posed inverse problem.\nBut for many domains, it is impractical or prohibitively expensive to capture full-dimensional or\nun-corrupted images, and construct such a large representative training set. Unfortunately, it is often\nin such domains that a computational imaging solution is most useful. Recently, Lehtinen et al. [14]\nproposed a solution to this issue for denoising, with a method that trains with only pairs of noisy\nobservations. While their method yields remarkably accurate network models without needing any\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Unsupervised Training from Measurements. Our method allows training image estima-\ntion networks f (\u00b7) from sets of pairs of varied measurements, but without the underlying ground-truth\nimages. (Top Right) We supervise training by requiring that network predictions from one measure-\nment be consistent with the other, when measured with the corresponding parameter. (Bottom) In the\nblind training setting, when both the image and measurement parameters are unavailable, we also\ntrain a parameter estimator g(\u00b7). Here, we generate a proxy training set from the predictions of the\nmodel (as it is training), and use synthetic measurements from these proxies to supervise training of\nthe parameter estimator g(\u00b7), and augment training of the image estimator f (\u00b7).\n\nground-truth images for training, it is applicable only to the speci\ufb01c case of estimation from noisy\nmeasurements\u2014when each image intensity is observed as a sample from a (potentially unknown)\ndistribution with mean or mode equal to its corresponding true value.\nIn this work, we introduce an unsupervised method for training image estimation networks that can\nbe applied to a general class of observation models\u2014where measurements are a linear function of the\ntrue image, potentially with additive noise. As training data, it only requires two observations for the\nsame image but not the underlying image itself1. The two measurements in each pair are made with\ndifferent parameters (such as different compressive measurement matrices or different blur kernels),\nand these parameters vary across different pairs. Collecting such a training set provides a practical\nalternative to the more laborious one of collecting full image ground-truth. Given these measurements,\nour method trains an image estimation network by requiring that its prediction from one measurement\nof a pair be consistent with the other measurement, when observed with the corresponding parameter.\nWith suf\ufb01cient diversity in measurement parameters for different training pairs, we show this is\nsuf\ufb01cient to train an accurate network model despite lacking direct ground-truth supervision.\nWhile our method requires knowledge of the measurement model (e.g., blur by convolution), it also\nincorporates a novel mechanism to handle the blind setting during training\u2014when the measurement\nparameters (e.g., the blur kernels) for training observations are unknown. To be able to enforce\nconsistency as above, we use an estimator for measurement parameters that is trained simultaneously\nusing a \u201cproxy\u201d training set. This set is created on-the-\ufb02y by taking predictions from the image\nnetwork even as it trains, and pairing them with observations synthetically created using randomly\nsampled, and thus known, parameters. The proxy set provides supervision for training the parameter\nestimator, and to augment training of the image estimator as well. This mechanism allows our method\nto nearly match the accuracy of fully supervised training on image and parameter ground-truth.\nWe validate our method with experiments on image reconstruction from compressive measurements\nand on blind deblurring of face images, with blind and non-blind training for the latter, and compare to\nfully-supervised baselines with state-of-the-art performance. The supervised baselines use a training\nset of ground-truth images and generate observations with random parameters on the \ufb02y in each\nepoch, to create a much larger number of effective image-measurement pairs. In contrast, our method\nis trained with only two measurements per image from the same training set (but not the image\n\n1Note that at test time, the trained network only requires one observation as input as usual.\n\n2\n\n\fitself), with the pairs kept \ufb01xed through all epochs of training. Despite this, our unsupervised training\nmethod yields models with test accuracy close to that of the supervised baselines, and thus presents a\npractical way to train CNNs for image estimation when lacking access to image ground truth.\n\n2 Related Work\n\nCNN-based Image Estimation. Many imaging tasks require inverting the measurement process\nto obtain a clean image from the partial or degraded observations\u2014denoising [3], deblurring [29],\nsuper-resolution [10], compressive sensing [8], etc. While traditionally solved using statistical image\npriors [9, 25, 34], CNN-based estimators have been successfully employed for many of these tasks.\nMost methods [4, 6, 7, 12, 22, 27, 30, 31] learn a network to map measurements to corresponding\nimages from a large training set of pairs of measurements and ideal ground-truth images. Some learn\nCNN-based image priors, as denoisers [5, 23, 31] or GANs [1], that are agnostic to the inference task\n(denoising, deblurring, etc.), but still tailored to a chosen class of images. All these methods require\naccess to a large domain-speci\ufb01c dataset of ground-truth images for training. However, capturing\nimage ground-truth is burdensome or simply infeasible in many settings (e.g., for MRI scans [18] and\nother biomedical imaging applications). In such settings, our method provides a practical alternative\nby allowing estimation networks to be trained from measurement data alone.\nUnsupervised Learning. Unsupervised learning for CNNs is broadly useful in many applications\nwhere large-scale training data is hard to collect. Accordingly, researchers have proposed unsu-\npervised and weakly-supervised methods for such applications, such as depth estimation [11, 32],\nintrinsic image decomposition [16, 19], etc. However, these methods are closely tied to their speci\ufb01c\napplications. In this work, we seek to enable unsupervised learning for image estimation networks.\nIn the context of image modeling, Bora et al. [2] propose a method to learn a GAN model from only\ndegraded observations. Their method, like ours, includes a measurement model with its discriminator\nfor training (but requires knowledge of measurement parameters, while we are able to handle the\nblind setting). Their method proves successful in training a generator for ideal images. We seek a\nsimilar unsupervised means for training image reconstruction and restoration networks.\nThe closest work to ours is the recent Noise2Noise method of Lehtinen et al. [14], who propose an\nunsupervised framework for training denoising networks by training on pairs of noisy observations\nof the same image. In their case, supervision comes from requiring the denoised output from one\nobservation be close to the other. This works surprisingly well, but is based on the assumption that the\nexpected or median value of the noisy observations is the image itself. We focus on a more general\nclass of observation models, which requires injecting the measurement process in loss computation.\nWe also introduce a proxy training approach to handle blind image estimation applications.\nAlso related are the works of Metzler et al. [21] and Zhussip et al. [33], that use Stein\u2019s unbiased\nrisk estimator for unsupervised training from only measurement data, for applications in compressive\nsensing. However, these methods are speci\ufb01c to estimators based on D-AMP estimation [20],\nsince they essentially train denoiser networks for use in unrolled AMP iterations for recovery from\ncompressive measurements. In contrast, ours is a more general framework that can be used to train\ngeneric neural network estimators.\n\n3 Proposed Approach\nGiven a measurement y \u2208 RM of an ideal image x \u2208 RN that are related as\n\ny = \u03b8 x + \u0001,\n\n(1)\nour goal is to train a CNN to produce an estimate \u02c6x of the image from y. Here, \u0001 \u223c p\u0001 is random\nnoise with distribution p\u0001(\u00b7) that is assumed to be zero-mean and independent of the image x, and\nthe parameter \u03b8 is an M \u00d7 N matrix that models the linear measurement operation. Often, the mea-\nsurement matrix \u03b8 is structured with fewer than M N degrees of freedom based on the measurement\nmodel\u2014e.g., it is block-Toeplitz for deblurring with entries de\ufb01ned by the blur kernel. We consider\nboth non-blind estimation when the measurement parameter \u03b8 is known for a given measurement\nduring inference, and the blind setting where \u03b8 is unavailable but we know the distribution p\u03b8(\u00b7).\nFor blind estimators, we address both non-blind and blind training\u2014when \u03b8 is known for each\nmeasurement in the training set but not at test time, and when it is unknown during training as well.\n\n3\n\n\fSince (1) is typically non-invertible, image estimation requires reasoning with the statistical distri-\nbution px(\u00b7) of images for the application domain, and conventionally, this is provided by a large\ntraining set of typical ground-truth images x. In particular, CNN-based image estimation methods\ntrain a network f : y \u2192 \u02c6x on a large training set {(xt, yt)}T\nt=1 of pairs of corresponding images and\nmeasurements, based on a loss that measures error \u03c1( \u02c6xt \u2212 xt) between predicted and true images\nacross the training set. In the non-blind setting, the measurement parameter \u03b8 is known and provided\nas input to the network f (we omit this in the notation for convenience), while in the blind setting,\nthe network must also reason about the unknown measurement parameter \u03b8.\nTo avoid the need for a large number of ground-truth training images, we propose an unsupervised\nlearning method that is able to train an image estimation network using measurements alone. Specif-\nically, we assume we are given a training set of two measurements (yt:1, yt:2) for each image xt:\n\n(2)\nbut not the images {xt} themselves. We require the corresponding measurement parameters \u03b8t:1\nand \u03b8t:2 to be different for each pair, and further, to also vary across different training pairs. These\nparameters are assumed to be known for the non-blind training setting, but not for blind training.\n\nyt:1 = \u03b8t:1 xt + \u0001t:1, yt:2 = \u03b8t:2 xt + \u0001t:2,\n\n3.1 Unsupervised Training for Non-Blind Image Estimation\n\nWe begin with the simpler case of non-blind estimation, when the parameter \u03b8 for a given measurement\ny is known, both during inference and training. Given pairs of measurements with known parameters,\nour method trains the network f (\u00b7) using a \u201cswap-measurement\u201d loss on each pair, de\ufb01ned as:\n\n\u03b8t:2 f (yt:1) \u2212 yt:2\n\n\u03c1\n\n+ \u03c1\n\n\u03b8t:1 f (yt:2) \u2212 yt:1\n\n.\n\n(3)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n(cid:88)\n\nt\n\nLswap =\n\n1\nT\n\n(cid:17)\n\n(cid:17)\n\nE\n\nE\n\n(cid:16)\n\n= 2\u03c32\n\nE\n\n\u00011\u223cp\u0001\nE\n\u0001\u223cp\u0001\n\nE\n\n\u00012\u223cp\u0001\n\n(cid:17)T\n\n(cid:107)\u03b82f (\u03b81x + \u00011) \u2212 (\u03b82x + \u00012)(cid:107)2\nf (\u03b8x + \u0001) \u2212 x\n\n(cid:16)\n\n(cid:17)\n\nQ\n\n\u03b81\u223cp\u03b8\nE\n\u03b8\u223cp\u03b8\n\n\u03b82\u223cp\u03b8\nf (\u03b8x + \u0001) \u2212 x\n\nThis loss evaluates the accuracy of the full images predicted by the network from each measurement\nin a pair, by comparing it to the other measurement\u2014using an error function \u03c1(\u00b7)\u2014after simulating\nobservation with the corresponding measurement parameter. Note Noise2Noise [14] can be seen as a\nspecial case of (3) for measurements are degraded only by noise, with \u03b8t:1 = \u03b8t:2 = I.\nWhen the parameters \u03b8t:1, \u03b8t:2 used to acquire the training set are suf\ufb01ciently diverse and statistically\nindependent for each underlying xt, this loss provides suf\ufb01cient supervision to train the network\nf (\u00b7). To see this, we consider using the L2 distance for the error function \u03c1(z) = (cid:107)z(cid:107)2, and note\nthat (3) represents an empirical approximation of the expected loss over image, parameter, and\nnoise distributions. Assuming the training measurement pairs are obtained using (2) with xt \u223c px,\n\u03b8t:1, \u03b8t:2 \u223c p\u03b8, and \u0001t:1, \u0001t:2 \u223c p\u0001 drawn i.i.d. from their respective distributions, we have\nLswap \u2248 2 E\nx\u223cpx\n\u0001 + 2 E\nx\u223cpx\n\n(\u03b8(cid:48)T \u03b8(cid:48)). (4)\nTherefore, because the measurement matrices are independent, we \ufb01nd that in expectation the swap-\nmeasurement loss is equivalent to supervised training against the true image x, with an L2 loss that is\nweighted by the N \u00d7N matrix Q (upto an additive constant given by noise variance). When the matrix\nQ is full-rank, the swap-measurement loss will provide supervision along all image dimensions, and\nwill reach its theoretical minimum (2\u03c32\nThe requirement that Q be full-rank implies that the distribution p\u03b8 of measurement parameters must\nbe suf\ufb01ciently diverse, such that the full set of parameters {\u03b8}, used for training measurements,\ntogether span the entire domain RN of full images. Therefore, even though measurements made by\nindividual \u03b8\u2014and even pairs of (\u03b8t:1, \u03b8t:2)\u2014are incomplete, our method relies on the fact that the full\nset of measurement parameters used during training is complete. Indeed, for Q to be full-rank, it is\nimportant that there be no systematic de\ufb01ciency in p\u03b8 (e.g., no vector direction in RN left unobserved\nby all measurement parameters used in training). Also note that while we derived (4) for the L2 loss,\nthe argument applies to any error function \u03c1(\u00b7) that is minimized only when its input is 0.\nIn addition to the swap loss, we also \ufb01nd it useful to train with an additional \u201cself-measurement\u201d loss\nthat measures consistency between an image prediction and its own corresponding input measurement:\n\n\u0001 ) iff the network makes exact predictions.\n\n, Q = E\n\u03b8(cid:48)\u223cp\u03b8\n\n+ \u03c1\n\n\u03b8t:2 f (yt:2) \u2212 yt:2\n\n.\n\n(5)\n\n(cid:16)\n\n\u03c1\n\n(cid:88)\n\nt\n\nLself =\n\n1\nT\n\n(cid:17)\n\n(cid:16)\n\n\u03b8t:1 f (yt:1) \u2212 yt:1\n\n4\n\n\fWhile not suf\ufb01cient by itself, we \ufb01nd the additional supervision it provides to be practically useful in\nyielding more accurate network models since it provides more direct supervision for each training\nsample. Therefore, our overall unsupervised training objective is a weighted version of the two losses\nLswap + \u03b3Lself, with weight \u03b3 chosen on a validation set.\n\n3.2 Unsupervised Training for Blind Image Estimation\n\nWe next consider the more challenging case of blind estimation, when the measurement parameter \u03b8\nfor an observation y is unknown\u2014and speci\ufb01cally, the blind training setting, when it is unknown\neven during training. The blind training setting complicates the use of our unsupervised losses in\n(3) and (5), since the values of \u03b8t:1 and \u03b8t:2 used there are unknown. Also, blind estimation tasks\noften have a more diverse set of possible parameters \u03b8. While supervised training methods with\naccess to ground-truth images can generate a very large database of synthetic image-measurement\npairs by pairing the same image with many different \u03b8 (assuming p\u03b8(\u00b7) is known), our unsupervised\nframework has access only to two measurements per image.\nHowever, in many blind estimation applications (such as deblurring), the parameter \u03b8 has compara-\ntively limited degrees of freedom and the distribution p\u03b8(\u00b7) is known. Consequently, it is feasible\nto train estimators for \u03b8 from an observation y with suf\ufb01cient supervision. With these assumptions,\nwe propose a \u201cproxy training\u201d approach for unsupervised training of blind image estimators. This\napproach treats estimates from our network during training as a source of image ground-truth to train\nan estimator g : y \u2192 \u02c6\u03b8 for measurement parameters. We use the image network\u2019s predictions to\nconstruct synthetic observations as:\n\nt:i \u2190 f (yt:i), \u03b8+\nx+\n\nt:i \u223c p\u03b8, \u0001+\n\nt:i \u223c p\u0001,\n\nt:i and \u0001+\n\n(6)\nt:i are sampled on the \ufb02y from the parameter and noise distributions, and \u2190 indicates\nwhere \u03b8+\nan assignment with a \u201cstop-gradient\u201d operation (to prevent loss gradients on the proxy images from\naffecting the image estimator f (\u00b7)). We use these synthetic observations y+\nt:i, with known sampled\nparameters \u03b8+\n\nt:i, to train the parameter estimation network g(\u00b7) based on the loss:\n\nt:i = \u03b8+\ny+\n\nfor i \u2208 {1, 2},\n\nt:i x+\n\nt:i + \u0001+\nt:i,\n\n(cid:88)\n\n2(cid:88)\n\n\u03c1(cid:0)g(y+\n\nt\n\ni=1\n\n(cid:1) .\n\nt:i) \u2212 \u03b8+\n\nt:i\n\nLprox:\u03b8 =\n\n1\nT\n\n(7)\n\n(8)\n\nAs the parameter network g(\u00b7) trains with augmented data, we simultaneously use it to compute\nestimates of parameters for the original observations: \u02c6\u03b8t:i \u2190 g(yt:i),\nfor i \u2208 {1, 2}, and compute\nthe swap- and self-measurement losses in (3) and (5) on the original observations using these\nestimated, instead of true, parameters. Notice that we use a stop-gradient here as well, since we do\nnot wish to train the parameter estimator g(\u00b7) based on the swap- or self-measurement losses\u2014the\nbehavior observed in (4) no longer holds in this case, and we empirically observe that removing the\nstop-gradient leads to instability and often causes training to fail.\nIn addition to training the parameter estimator g(\u00b7), the proxy training data in (6) can be used to\naugment training for the image estimator f (\u00b7), now with full supervision from the proxy images as:\n\n(cid:88)\n\n2(cid:88)\n\n\u03c1(cid:0)f (y+\n\nt\n\ni=1\n\n(cid:1) .\n\nt:i) \u2212 x+\n\nt:i\n\nLprox:x =\n\n1\nT\n\nThis loss can be used even in the non-blind training setting, and provides a means of generating\nadditional training data with more pairings of image and measurement parameters. Also note that\nalthough our proxy images x+\nt:i are approximate estimates of the true images, they represent the\nt:i. Hence, the losses Lprox:\u03b8 and Lprox:x are\nground-truth for the synthetically generated observations y+\napproximate only in the sense that they are based on images that are not sampled from the true image\ndistribution px(\u00b7). And the effect of this approximation diminishes as training progresses, and the\nimage estimation network produces better image predictions (especially on the training set).\nOur overall method randomly initializes the weights of the image and parameter networks f (\u00b7) and\ng(\u00b7), and then trains them with a weighted combination of all losses: Lswap+\u03b3Lself+\u03b1Lprox:\u03b8 +\u03b2Lprox:x,\nwhere the scalar weights \u03b1, \u03b2, \u03b3 are hyper-parameters determined on a validation set. For non-blind\ntraining (of blind estimators), only the image estimator f (\u00b7) needs to be trained, and \u03b1 can be set to 0.\n\n5\n\n\fTable 1: Performance (in PSNR dB) of various methods for compressive measurement reconstruction,\non BSD68 and Set11 images for different compression ratios.\n\nMethod\n\nSupervised\n\nTVAL3 [15]\n\nBM3D-AMP [20] (patch-wise)\nBM3D-AMP [20] (full-image)\n\nReconNet [12]\nISTA-Net+ [30]\n\nSupervised Baseline (Ours)\nUnsupervised Training (Ours)\nUnsupervised Training (Ours)\n\nablation without self-loss\n\n\u0017\n\u0017\n\u0017\n\u0013\n\u0013\n\u0013\n\u0017\n\n\u0017\n\nBSD68\n\n4%\n-\n-\n-\n\n21.66\n22.17\n22.94\n22.78\n22.73\n\n10%\n\n-\n-\n-\n\n24.15\n25.33\n25.57\n25.40\n25.32\n\n1%\n16.43\n5.21\n5.59\n17.27\n17.34\n17.88\n17.84\n17.80\n\nSet11\n4%\n18.75\n18.40\n17.18\n20.63\n21.31\n22.61\n22.20\n22.10\n\n10%\n22.99\n22.64\n23.07\n24.28\n26.64\n26.74\n26.33\n26.16\n\n1%\n-\n-\n-\n-\n\n19.14\n19.74\n19.67\n19.59\n\nGround truth ReconNet [12]\n\nISTA-Net+ [30]\n\nSupervised\n\nBaseline (Ours)\n\nUnsupervised\nTraining (Ours)\n\nPSNR:\n\n21.89 dB\n\n23.61 dB\n\n24.34 dB\n\n24.03 dB\n\nPSNR:\n\n21.29 dB\n\n23.66 dB\n\n24.37 dB\n\n24.17 dB\n\nFigure 2: Images reconstructed by various methods from compressive measurements (at 10% ratio).\n\n4 Experiments\n\nWe evaluate our framework on two well-established tasks: non-blind image reconstruction from\ncompressive measurements, and blind deblurring of face images. These tasks were chosen since large\ntraining sets of ground-truth images is available in both cases, which allows us to demonstrate the\neffectiveness of our approach through comparisons to fully supervised baselines. The source code of\nour implementation is available at https://projects.ayanc.org/unsupimg/.\n\n4.1 Reconstruction from Compressive Measurements\n\nWe consider the task of training a CNN to reconstruct images from compressive measurements. We\nfollow the measurement model of [12, 30], where all non-overlapping 33 \u00d7 33 patches in an image\nare measured individually by the same low-dimensional orthonormal matrix. Like [12, 30], we train\nCNN models that operate on individual patches at a time, and assume ideal observations without\nnoise (the supplementary includes additional results for noisy measurements). We train models for\ncompression ratios of 1%, 4%, and 10% (using corresponding matrices provided by [12]).\nWe generate a training and validation set, of 100k and 256 images respectively, by taking 363 \u00d7 363\ncrops from images in the ImageNet database [26]. We use a CNN architecture that stacks two U-\nNets [24], with a residual connection between the two (see supplementary). We begin by training our\narchitecture with full supervision, using all overlapping patches from the training images, and an L2\nloss between the network\u2019s predictions and the ground-truth image patches. For unsupervised training\n\n6\n\n\fwith our approach, we create two partitions of the original image, each containing non-overlapping\npatches. The partitions themselves overlap, with patches in one partition being shifted from those in\nthe other (see supplementary). We measure patches in both partitions with the same measurement\nmatrix, to yield two sets of measurements. These provide the diversity required by our method as\neach pixel is measured with a different patch in the two partitions. Moreover, this measurement\nscheme can be simply implemented in practice by camera translation. The shifts for each image are\nrandomly selected, but kept \ufb01xed throughout training. Since the network operates independently on\npatches, it can be used on measurements from both partitions. To compute the swap-measurement\nloss, we take the network\u2019s individual patch predictions from one partition, arrange them to form the\nimage, and extract and then apply the measurement matrix to shifted patches corresponding to the\nother partition. The weight \u03b3 for the self-measurement loss is set to 0.05 based on the validation set.\nIn Table 1, we report results for existing compressive sensing methods that use supervised training [12,\n30], as well as two methods that do not require any training [15, 20]. We report numbers for these\nmethods from the evaluation in [30] that, like us, reconstruct each patch in an image individually. We\nalso report results for the algorithm in [20] by running it on entire images (i.e., using the entire image\nfor regularization while still using the per-patch measurement measurement model). Note that [20] is\na D-AMP-based estimator (and while slower, performs similarly to the learned D-AMP estimators\nproposed in [21, 33] as per their own evaluation).\nEvaluating our fully supervised baseline against these methods, we \ufb01nd that it achieves state-of-the-art\nperformance. We then report results for training with our unsupervised framework, and \ufb01nd that this\nleads to accurate models that only lag our supervised baseline by 0.4 db or less in terms of average\nPSNR on both test sets\u2014and in most cases, actually outperforms previous methods. This is despite\nthe fact that these models have been trained without any access to ground-truth images. In addition to\nour full unsupervised method with both the self- and swap- losses, Table 1 also contains an ablation\nwithout using the self-loss, which is found to lead to a slight drop in performance. Figure 2 provides\nexample reconstructions for some images, and we \ufb01nd that results from our unsupervised method are\nextremely close in visual quality to those of the baseline model trained with full supervision.\n\n4.2 Blind Face Image Deblurring\n\nWe next consider the problem of blind motion deblurring of face images. Like [27], we consider the\nproblem of restoring 128 \u00d7 128 aligned and cropped face images that have been affected by motion\nblur, through convolution with motion blur kernels of size upto 27 \u00d7 27, and Gaussian noise with\nstandard deviation of two gray levels. We use all 160k images in the CelebA training set [17] and\n1.8k images from Helen training set [13] to construct our training set, and 2k images from CelebA\nval and 200 from the Helen training set for our validation set. We use a set of 18k and 2k random\nmotion kernels for training and validation respectively, generated using the method described in [4].\nWe evaluate our method on the of\ufb01cial blurred test images provided by [27] (derived from the CelebA\nand Helen test sets). Note that unlike [27], we do not use any semantic labels for training.\nIn this case, we use a single U-Net architecture to map blurry observations to sharp images. We again\ntrain a model for this architecture with full supervision, generating blurry-sharp training pairs on the\n\ufb02y by pairing random of blur kernels from training set with the sharp images. Then, for unsupervised\ntraining with our approach, we choose two kernels for each training image to form a training set\nof measurement pairs, that are kept \ufb01xed (including the added Gaussian noise) across all epochs of\ntraining. We \ufb01rst consider non-blind training, using the true blur kernels to compute the swap- and\nself-measurement losses. Here, we consider training with and without the proxy loss Lprox:x for the\nnetwork. Then, we consider the blind training case where we also learn an estimator for blur kernels,\nand use its predictions to compute the measurement losses. Instead of training a entirely separate\nnetwork, we share the initial layers with the image UNet, and form a separate decoder path going\nfrom the bottleneck to the blur kernel. The weights \u03b1, \u03b2, \u03b3 are all set to one in this case.\nWe report results for all versions of our method in Table 2, and compare it to [27], as well as a\ntraditional deblurring method that is not trained on face images [28]. We \ufb01nd that with full supervision,\nour architecture achieves state-of-the-art performance. Then with non-blind training, we \ufb01nd that our\nmethod is able to come close to supervised performance when using the proxy loss, but does worse\nwithout\u2014highlighting its utility even in the non-blind setting. Finally, we note that models derived\nusing blind-training with our approach are also able to produce results nearly as accurate as those\ntrained with full supervision\u2014despite lacking access both to ground truth image data, and knowledge\n\n7\n\n\fTable 2: Performance of various methods on blind face deblurring on test images from [27].\n\nMethod\n\nSupervised\n\nXu et al. [28]\nShen et al. [27]\n\nSupervised Baseline (Ours)\n\nUnsupervised Non-blind (Ours)\nUnsupervised Non-blind (Ours)\n\nwithout proxy loss\n\nUnsupervised Blind (Ours)\n\n\u0017\n\u0013\n\u0013\n\u0017\n\n\u0017\n\n\u0017\n\nPSNR\n20.11\n25.99\n26.13\n25.95\n25.47\n25.93\n\nHelen\n\nCelebA\n\nSSIM PSNR\n18.93\n0.711\n0.871\n25.05\n25.20\n0.886\n25.09\n0.878\n24.64\n0.867\n0.876\n25.06\n\nSSIM\n0.685\n0.879\n0.892\n0.885\n0.873\n0.883\n\nGround truth Blurred input\n\nShen et al. [27] Supervised (Ours) Non-blind (Ours) Blind (Ours)\n\nPSNR:\n\n22.69 dB\n\n24.61 dB\n\n25.16 dB\n\n25.19 dB\n\nPSNR:\n\n26.83 dB\n\n28.18 dB\n\n28.27 dB\n\n28.16 dB\n\nPSNR:\n\n26.59 dB\n\n28.29 dB\n\n27.42 dB\n\n26.77 dB\n\nPSNR:\n\n22.36 dB\n\n23.50 dB\n\n22.84 dB\n\n22.94 dB\n\nFigure 3: Blind face deblurring results using various methods. Results from our unsupervised\napproach, with both non-blind and blind training, nearly match the quality of the supervised baseline.\n\nof the blur kernels in their training measurements. Figure 3 illustrates this performance qualitatively,\nwith example deblurred results from various models on the of\ufb01cial test images. We also visualize the\nblur kernel estimator learned during blind training with our approach in Fig. 4 on images from our\nvalidation set. Additional results, including those on real images, are included in the supplementary.\n\n8\n\n\fGround Truth\n\nBlurred\n\nPredictions\n\nGround Truth\n\nBlurred\n\nPredictions\n\nFigure 4: Image and kernel predictions on validation images. We show outputs of our model\u2019s kernel\nestimator, that is learned as part of blind training to compute swap- and self-measurement losses.\n\n5 Conclusion\n\nWe presented an unsupervised method to train image estimation networks from only measurements\npairs, without access to ground-truth images, and in blind settings, without knowledge of measurement\nparameters. In this paper, we validated this approach on well-established tasks where suf\ufb01cient ground-\ntruth data (for natural and face images) was available, since it allowed us to compare to training with\nfull-supervision and study the performance gap between the supervised and unsupervised settings.\nBut we believe that our method\u2019s real utility will be in opening up the use of CNNs for image\nestimation to new domains\u2014such as medical imaging, applications in astronomy, etc.\u2014where such\nuse has been so far infeasible due to the dif\ufb01culty of collecting large ground-truth datasets.\nAcknowledgments. This work was supported by the NSF under award no. IIS-1820693.\n\nReferences\n[1] Rushil Anirudh, Jayaraman J Thiagarajan, Bhavya Kailkhura, and Timo Bremer. An unsu-\npervised approach to solving inverse problems using generative adversarial networks. arXiv\npreprint arXiv:1805.07281, 2018.\n\n[2] Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambientgan: Generative models from\nlossy measurements. In International Conference on Learning Representations (ICLR), 2018.\n[3] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising.\nIn 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition\n(CVPR\u201905), volume 2, pages 60\u201365. IEEE, 2005.\n\n[4] Ayan Chakrabarti. A neural approach to blind motion deblurring. In European conference on\n\ncomputer vision, pages 221\u2013235. Springer, 2016.\n\n[5] Jen-Hao Rick Chang, Chun-Liang Li, Barnabas Poczos, BVK Vijaya Kumar, and Aswin C\nSankaranarayanan. One network to solve them all-solving linear inverse problems using deep\nprojection models. In Proc. ICCV, 2017.\n\n[6] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A \ufb02exible framework\nfor fast and effective image restoration. IEEE transactions on pattern analysis and machine\nintelligence, 39(6):1256\u20131272, 2017.\n\n[7] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using\ndeep convolutional networks. IEEE transactions on pattern analysis and machine intelligence,\n38(2):295\u2013307, 2015.\n\n[8] David L Donoho et al. Compressed sensing.\n\n52(4):1289\u20131306, 2006.\n\nIEEE Transactions on information theory,\n\n9\n\n\f[9] M\u00e1rio AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse\nreconstruction: Application to compressed sensing and other inverse problems. IEEE Journal\nof selected topics in signal processing, 1(4):586\u2013597, 2007.\n\n[10] William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-resolution.\n\nIEEE Computer graphics and Applications, (2):56\u201365, 2002.\n\n[11] Cl\u00e9ment Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth\nestimation with left-right consistency. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 270\u2013279, 2017.\n\n[12] Kuldeep Kulkarni, Suhas Lohit, Pavan Turaga, Ronan Kerviche, and Amit Ashok. Reconnet:\nNon-iterative reconstruction of images from compressively sensed measurements. In Proceed-\nings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 449\u2013458,\n2016.\n\n[13] Vuong Ba L\u00ea, Jonathan Brandt, Zhe L. Lin, Lubomir D. Bourdev, and Thomas S. Huang.\n\nInteractive facial feature localization. In ECCV, 2012.\n\n[14] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala,\nand Timo Aila. Noise2noise: Learning image restoration without clean data. arXiv preprint\narXiv:1803.04189, 2018.\n\n[15] Chengbo Li, Wotao Yin, Hong Jiang, and Yin Zhang. An ef\ufb01cient augmented lagrangian method\nwith applications to total variation minimization. Computational Optimization and Applications,\n56(3):507\u2013530, 2013.\n\n[16] Zhengqi Li and Noah Snavely. Learning intrinsic image decomposition from watching the\nworld. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 9039\u20139048, 2018.\n\n[17] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[18] Michael Lustig, David L Donoho, Juan M Santos, and John M Pauly. Compressed sensing mri.\n\nIEEE signal processing magazine, 25(2):72, 2008.\n\n[19] Wei-Chiu Ma, Hang Chu, Bolei Zhou, Raquel Urtasun, and Antonio Torralba. Single image\nIn Proceedings of the European\n\nintrinsic decomposition without a single intrinsic image.\nConference on Computer Vision (ECCV), pages 201\u2013217, 2018.\n\n[20] Christopher A Metzler, Arian Maleki, and Richard G Baraniuk. From denoising to compressed\n\nsensing. IEEE Transactions on Information Theory, 62(9):5117\u20135144, 2016.\n\n[21] Christopher A Metzler, Ali Mousavi, Reinhard Heckel, and Richard G Baraniuk. Unsupervised\n\nlearning with stein\u2019s unbiased risk estimator. arXiv preprint arXiv:1805.10531, 2018.\n\n[22] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural\nnetwork for dynamic scene deblurring. 2017 IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 257\u2013265, 2017.\n\n[23] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization\n\nby denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804\u20131844, 2017.\n\n[24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for\nbiomedical image segmentation. In International Conference on Medical Image Computing\nand Computer-assisted Intervention, 2015.\n\n[25] Stefan Roth and Michael J Black. Fields of experts. IJCV, 2009.\n[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 2015.\n\n[27] Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, and Ming-Hsuan Yang. Deep semantic\nface deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 8260\u20138269, 2018.\n\n[28] Li Xu, Shicheng Zheng, and Jiaya Jia. Unnatural l0 sparse representation for natural image\ndeblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 1107\u20131114, 2013.\n\n10\n\n\f[29] Lu Yuan, Jian Sun, Long Quan, and Heung-Yeung Shum. Image deblurring with blurred/noisy\n\nimage pairs. ACM Transactions on Graphics (TOG), 26(3):1, 2007.\n\n[30] Jian Zhang and Bernard Ghanem. Ista-net: Interpretable optimization-inspired deep network\nfor image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 1828\u20131837, 2018.\n\n[31] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for\n\nimage restoration. In Proc. CVPR, 2017.\n\n[32] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of\ndepth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 1851\u20131858, 2017.\n\n[33] Magauiya Zhussip, Shakarim Soltanayev, and Se Young Chun. Training deep learning based\nimage denoisers from undersampled measurements without ground truth and without image\nprior. In Proc. CVPR, 2019.\n\n[34] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image\nrestoration. In 2011 International Conference on Computer Vision, pages 479\u2013486. IEEE, 2011.\n\n11\n\n\f", "award": [], "sourceid": 1422, "authors": [{"given_name": "Zhihao", "family_name": "Xia", "institution": "Washington University in St. Louis"}, {"given_name": "Ayan", "family_name": "Chakrabarti", "institution": "Washington University in St. Louis"}]}