{"title": "Training deep learning based denoisers without ground truth data", "book": "Advances in Neural Information Processing Systems", "page_first": 3257, "page_last": 3267, "abstract": "Recently developed deep-learning-based denoisers often outperform state-of-the-art conventional denoisers, such as the BM3D. They are typically trained to minimizethe mean squared error (MSE) between the output image of a deep neural networkand a ground truth image. In deep learning based denoisers, it is important to use high quality noiseless ground truth data for high performance, but it is often challenging or even infeasible to obtain noiseless images in application areas such as hyperspectral remote sensing and medical imaging. In this article, we propose a method based on Stein\u2019s unbiased risk estimator (SURE) for training deep neural network denoisers only based on the use of noisy images. We demonstrate that our SURE-based method, without the use of ground truth data, is able to train deep neural network denoisers to yield performances close to those networks trained with ground truth, and to outperform the state-of-the-art denoiser BM3D. Further improvements were achieved when noisy test images were used for training of denoiser networks using our proposed SURE-based method.", "full_text": "Training deep learning based denoisers\n\nwithout ground truth data\n\nShakarim Soltanayev\n\nSe Young Chun\n\nUlsan National Institute of Science and Technology (UNIST), Republic of Korea\n\nDepartment of Electrical Engineering\n\n{shakarim,sychun}@unist.ac.kr\n\nAbstract\n\nRecently developed deep-learning-based denoisers often outperform state-of-the-art\nconventional denoisers, such as the BM3D. They are typically trained to minimize\nthe mean squared error (MSE) between the output image of a deep neural network\nand a ground truth image. In deep learning based denoisers, it is important to\nuse high quality noiseless ground truth data for high performance, but it is often\nchallenging or even infeasible to obtain noiseless images in application areas such\nas hyperspectral remote sensing and medical imaging. In this article, we propose a\nmethod based on Stein\u2019s unbiased risk estimator (SURE) for training deep neural\nnetwork denoisers only based on the use of noisy images. We demonstrate that our\nSURE-based method, without the use of ground truth data, is able to train deep\nneural network denoisers to yield performances close to those networks trained\nwith ground truth, and to outperform the state-of-the-art denoiser BM3D. Further\nimprovements were achieved when noisy test images were used for training of\ndenoiser networks using our proposed SURE-based method. Code is available at\nhttps://github.com/Shakarim94/Net-SURE.\n\n1\n\nIntroduction\n\nDeep learning has been successful in various high-level computer vision tasks [1], such as image\nclassi\ufb01cation [2, 3], object detection [4, 5], and semantic segmentation [6, 7]. Deep learning has also\nbeen investigated for low-level computer vision tasks, such as image denoising [8, 9, 10, 11, 12],\nimage inpainting [13], and image restoration [14, 15, 16].\nIn particular, image denoising is a\nfundamental computer vision task that yields images with reduced noise, and improves the execution\nof other tasks, such as image classi\ufb01cation [8] and image restoration [16].\nDeep learning based image denoisers [9, 11, 12] have yielded performances that are equivalent\nto or better than those of conventional state-of-the-art denoising techniques such as BM3D [17].\nThese deep denoisers typically train their networks by minimizing the mean-squared error (MSE)\nbetween the output of a network and the ground truth (noiseless) image. Thus, it is crucial to have\nhigh quality noiseless images for high performance deep learning denoisers. Thus far, deep neural\nnetwork denoisers have been successful since high quality camera sensors and abundant light allow\nthe acquisition of high quality, almost noiseless 2D images in daily environment tasks. Acquiring\nsuch high quality photographs is quite cheap with the use of smart phones and digital cameras.\nHowever, it is challenging to apply currently developed deep learning based image denoisers with\nminimum MSE to some application areas, such as hyperspectral remote sensing and medical imaging,\nwhere the acquisition of noiseless ground truth data is expensive, or sometimes even infeasible. For\nexample, hyperspectral imaging contains hundreds of spectral information per pixel, often leading to\nincreased noise in hyperspectral imaging sensors [18]. Long acquisitions may improve image quality,\nbut it is challenging to perform them with spaceborne or airborne hyperspectral imaging. Similarly,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fin medical imaging, ultra high resolution 3D MRI (sub-millimeter resolution) often requires several\nhours of acquisition time for a single, high quality volume, but reducing acquisition time leads to\nincreased noise. In X-ray CT, image noise can be substantially reduced by increasing the radiation\ndose. Recent studies on deep learning based image denoisers [19, 20] used CT images generated\nwith normal doses as the ground truth so that denoising networks would be able to be trained to\nyield excellent performance. However, increased radiation dose leads to harmful effects in scanned\nsubjects, while excessively high doses may saturate the CT detectors (e.g., in a similar manner to the\nacquisition of a photo of the sun without any \ufb01lter). Thus, acquiring ground truth data with newly\ndeveloped CT scanners seems challenging without compromising the subjects\u2019 safety.\nConventional denoising methods do not usually require noiseless ground truth images to perform\ndenoising, but often require them for tuning parameters of image \ufb01lters to elicit the best possible\nresults (minimum MSE). In order to identify the optimal parameters of conventional denoisers\nwithout ground truth data, several works have been conducted [21, 22] using Stein\u2019s unbiased risk\nestimator (SURE) [23], which is an unbiased MSE estimator. For the popular non-local means (NLM)\n\ufb01lter [24], the analytical form of SURE was used to optimize the denoiser performance [25, 26, 27].\nFor denoisers whose analytical forms of SURE are not available, Ramani et al. [28] proposed a Monte-\nCarlo-based SURE (MC-SURE) method to determine near-optimal denoising parameters based on\nthe brute-force search of the parameter space. Deledalle et al. [29] investigated the approximation of\na weak gradient of SURE to optimize parameters using the quasi-Newton algorithm. However, since\nthis method requires the computation of the weak Jacobian, it is not applicable to high dimensional\nparameter spaces, such as deep neural networks.\nWe propose a SURE-based training method for deep neural network denoisers without ground\ntruth data. In Section 2, we review key results elicited from SURE and MC-SURE. Subsequently,\nin Section 3, we describe our proposed method using MC-SURE and a stochastic gradient for\ntraining deep learning based image denoisers. In Section 4, simulation results are presented for (a)\nconventional state-of-the-art denoiser (BM3D), (b) deep learning based denoiser trained with BM3D\nas the ground truth, (c) the same deep neural network denoiser with the proposed SURE training\nwithout the ground truth, and (d) the same denoiser network with ground truth data as a reference.\nSection 5 concludes this article by discussing several potential issues for further studies.\n\n2 Background\n\n2.1 Stein\u2019s unbiased risk estimator\n\nA signal (or image) with Gaussian noise can be modeled as,\n\ny = x + n\n\n(1)\nwhere x \u2208 RK is an unknown signal in accordance with x \u223c p(x), y \u2208 RK is a known measurement,\nn \u2208 RK is an i.i.d. Gaussian noise such that n \u223c N (0, \u03c32I), and where I is an identity matrix. We\ndenote n \u223c N (0, \u03c32I) as n \u223c N0,\u03c32. An estimator of x from y (or denoiser) can be de\ufb01ned as a\nfunction of y such that\n(2)\nwhere h, g are functions from RK to RK. Accordingly, the SURE for h(y) can be derived as follows,\n\nh(y) = y + g(y)\n\nK(cid:88)\n\ni=1\n\n\u2202hi(y)\n\n\u2202yi\n\n(3)\n\n\u03b7(h(y)) = \u03c32 +\n\n(cid:107)g(y)(cid:107)2\n\nK\n\n+\n\n2\u03c32\nK\n\n\u2202gi(y)\n\n\u2202yi\n\n=\n\n(cid:107)y \u2212 h(y)(cid:107)2\n\nK\n\n\u2212 \u03c32 +\n\n2\u03c32\nK\n\nK(cid:88)\n\ni=1\n\nwhere \u03b7 : RK \u2192 R and yi is the ith element of y. For a \ufb01xed x, the following theorem holds:\nTheorem 1 ([23, 30]). The random variable \u03b7(h(y)) is an unbiased estimator of\n\nor\n\n(cid:107)x \u2212 h(y)(cid:107)2\n\n= En\u223cN0,\u03c32 {\u03b7(h(y))}\n\n(4)\n\nMSE(h(y)) =\n\n(cid:26)(cid:107)x \u2212 h(y)(cid:107)2\n\n1\nK\n\n(cid:27)\n\nK\n\nEn\u223cN0,\u03c32\n\n2\n\n\fwhere En\u223cN0,\u03c32{\u00b7} is the expectation operator in terms of the random vector n. Note that in\nTheorem 1, x is treated as a \ufb01xed, deterministic vector.\nIn practice, \u03c32 can be estimated [28] and (cid:107)y \u2212 h(y)(cid:107)2 only requires the output of the estimator (or\ndenoiser). The last divergence term of (3) can be obtained analytically in some special cases, such as\nin linear or NLM \ufb01lters [26]. However, it is challenging to calculate this term analytically for more\ngeneral denoising methods.\n\n2.2 Monte-Carlo Stein\u2019s unbiased risk estimator\n\nRamani et al. [28] introduced a fast Monte-Carlo approximation of the divergence term in (3) for\ngeneral denoisers. For a \ufb01xed unknown true image x, the following theorem is valid:\nTheorem 2 ([28]). Let \u02dcn \u223c N0,1 \u2208 RK be independent of n, y. Then,\n\n(cid:26)\n\n(cid:18) h(y + \u0001\u02dcn) \u2212 h(y)\n\n(cid:19)(cid:27)\n\n\u2202hi(y)\n\n\u2202yi\n\nE\u02dcn\n\n= lim\n\u0001\u21920\n\n\u02dcnt\n\n\u0001\n\nK(cid:88)\n\ni=1\n\n(5)\n\nprovided that h(y) admits a well-de\ufb01ned second-order Taylor expansion. If not, this is still valid in\nthe weak sense provided that h(y) is tempered.\nBased on Theorem 2, the divergence term in (3) can be approximated by one realization of \u02dcn \u223c N0,1\nand a \ufb01xed small positive value \u0001:\n\n\u2202hi(y)\n\n\u2202yi\n\n\u2248 1\n\u0001K\n\n\u02dcnt (h(y + \u0001\u02dcn) \u2212 h(y))\n\n(6)\n\nK(cid:88)\n\ni=1\n\n1\nK\n\nwhere t is the transpose operator. This expression has been shown to yield accurate unbiased estimates\nof MSE for many conventional denoising methods h(y) [28].\n\n3 Method\n\nIn this section, we will develop our proposed MC-SURE-based method for training deep learning\nbased denoisers without noiseless ground truth images by assuming a Gaussian noise model in (1).\n\n3.1 Training denoisers using the stochastic gradient method\n\nA typical risk for image denoisers with the signal generation model (1) is\n\nEx\u223cp(x),n\u223cN0,\u03c32(cid:107)x \u2212 h(y; \u03b8)(cid:107)2\n\n(7)\n\nwhere h(y; \u03b8) is a deep learning based denoiser parametrized with a large-scale vector \u03b8. It is usually\ninfeasible to calculate (7) exactly due to expectation operator. Thus, the empirical risk for (7) is used\nas a cost function as follows:\n\n(cid:107)h(y(j); \u03b8) \u2212 x(j)(cid:107)2\n\n(8)\n\nN(cid:88)\n\nj=1\n\n1\nN\n\nwhere {(x(1), y(1)),\u00b7\u00b7\u00b7 , (x(N ), y(N ))} are the N pairs of a training dataset, sampled from the joint\ndistribution of x(j) \u223c p(x) and n(j) \u223c N0,\u03c32. Note that (8) is an unbiased estimator of (7).\nTo train the deep learning network h(y; \u03b8) with respect to \u03b8, a gradient-based optimization algorithm\nis used such as the stochastic gradient descent (SGD) [31], momentum, Nesterov momentum [32], or\nthe Adam optimization algorithm [33]. For any gradient-based optimization method, it is essential to\ncalculate the gradient of (7) with respect to \u03b8 as follows,\n\nEx\u223cp(x),n\u223cN0,\u03c32 2\u2207\u03b8h(y; \u03b8)t (h(y; \u03b8) \u2212 x) .\n\n(9)\n\nTherefore, it is suf\ufb01cient to calculate the gradient of the empirical risk (8) to approximate (9) for any\ngradient-based optimization.\nIn practice, calculating the gradient of (8) for large N is inef\ufb01cient since a small amount of well-\nshuf\ufb02ed training data can often approximate the gradient of (8) accurately. Thus, a mini-batch is\n\n3\n\n\ftypically used for ef\ufb01cient deep neural network training by calculating the mini-batch empirical risk\nas follows:\n\n(cid:107)h(y(j); \u03b8) \u2212 x(j)(cid:107)2\n\n(10)\n\nM(cid:88)\n\nj=1\n\n1\nM\n\nwhere M is the number of one mini-match. Equation (10) is still an unbiased estimator of (7) provided\nthat the training data is randomly permuted every epoch.\n\n3.2 Proposed training method for deep learning based denoisers\n\nTo incorporate MC-SURE into a stochastic gradient-based optimization algorithm for training, such\nas the SGD or the Adam optimization algorithms, we modify the risk (7) in accordance with\n\n(cid:104)En\u223cN0,\u03c32\n\n(cid:0)(cid:107)x \u2212 h(y; \u03b8)(cid:107)2|x(cid:1)(cid:105)\n\n.\n\nEx\u223cp(x)\n\nwhere (11) is equivalent to (7) owing to conditioning.\nFrom Theorem 1, an unbiased estimator for En\u223cN0,\u03c32\n\n(cid:0)(cid:107)x \u2212 h(y; \u03b8)(cid:107)2|x(cid:1) can be derived as\n(cid:0)(cid:107)x \u2212 h(y; \u03b8)(cid:107)2|x(cid:1) = En\u223cN0,\u03c32(cid:107)x \u2212 h(y; \u03b8)(cid:107)2 = KEn\u223cN0,\u03c32 \u03b7(h(y; \u03b8)).\nM(cid:88)\n\n(cid:107)y(j) \u2212 h(y(j); \u03b8)(cid:107)2 \u2212 K\u03c32 + 2\u03c32\n\n\u2202hi(y(j); \u03b8)\n\nK(cid:88)\n\nK\u03b7(h(y; \u03b8))\n\n(cid:26)\n\n(cid:27)\n\n1\nM\n\nj=1\n\ni=1\n\n\u2202yi\n\nThus, using the empirical risk expression in (10), an unbiased estimator for (7) is\n\nsuch that for a \ufb01xed x,\n\nEn\u223cN0,\u03c32\n\n(11)\n\n(12)\n\n(13)\n\nnoting that no noiseless ground truth data x(j) were used in (13).\nFinally, the last divergence term in (13) can be approximated using MC-SURE so that the \ufb01nal\nunbiased risk estimator for (7) will be\n\n(cid:107)y(j) \u2212 h(y(j); \u03b8)(cid:107)2 \u2212 K\u03c32 +\n\n2\u03c32\n\u0001\n\nh(y(j) + \u0001\u02dcn(j); \u03b8) \u2212 h(y(j); \u03b8)\n\n(14)\n\n(\u02dcn(j))t(cid:16)\n\n(cid:17)(cid:27)\n\n(cid:26)\n\nM(cid:88)\n\nj=1\n\n1\nM\n\nwhere \u0001 is a small \ufb01xed positive number and \u02dcn(j) is a single realization from the standard normal\ndistribution for each training data j. In order to make sure that the estimator (14) is unbiased, the\norder of y(j) should be randomly permuted and the new set of \u02dcn(j) should be generated at every\nepoch.\nThe deep learning based image denoiser with the cost function of (14) can be implemented using\na deep learning development framework, such as TensorFlow [34], by properly de\ufb01ning the cost\nfunction. Thus, the gradient of (14) can be automatically calculated when the training is performed.\nOne of the potential advantages of our SURE-based training method is that we can use all the\navailable data without noiseless ground truth images. In other words, we can train denoising, deep\nneural networks with the use of training and testing data. This advantage may further improve the\nperformance of deep learning based denoisers.\nLastly, almost any deep neural network denoiser can utilize our MC-SURE-based training by modify-\ning the cost function from (10) to (14) as far as it satis\ufb01es the condition in Theorem 2. Many deep\nlearning based denoisers with differentiable activation functions (e.g., sigmoid) can comply with this\ncondition. Some denoisers with piecewise differentiable activation functions (e.g., ReLU) can still\nutilize Theorem 2 in the weak sense since\n\n(cid:107)h(y; \u03b8)(cid:107) \u2264 C0(1 + (cid:107)y(cid:107)n0)\n\nfor some n0 > 1 and C0 > 0. Therefore, we expect that our proposed method should work for most\ndeep learning image denoisers [8, 9, 10, 11, 12].\n\n4\n\n\f4 Simulation results\n\nIn this section, denoising simulation results are presented with the MNIST dataset using a simple\nstacked denoising autoencoder (SDA) [8], and a large-scale natural image dataset using a deep\nconvolutional neural network (CNN) image denoiser (DnCNN) [11].\nAll of the networks presented in this section (denoted by NET, which can be either SDA or DnCNN)\nwere trained using one of the following two optimization objectives: (MSE) the minimum MSE\nbetween a denoised image and its ground truth image in (10) and (SURE) our proposed minimum\nMC-SURE without ground truth in (14). NET-MSE methods generated noisy training images at every\nepochs in accordance with [11], while our proposed NET-SURE methods used only noisy images\nobtained before training. We also propose the SURE-T method which utilized noisy test images with\nnoisy training images and without ground truth data. Table 1 summarizes all simulated con\ufb01gurations\nincluding conventional state-of-the-art image denoiser, BM3D [17], that did not require any training,\nor the use of any ground truth data. Code is available at https://github.com/Shakarim94/Net-SURE.\n\nTable 1: Summary of simulated denoising methods. NET can be either SDA or DnCNN.\n\nMethod\nBM3D\nNET-BM3D\nNET-SURE\nNET-SURE-T\nNET-MSE-GT Optimizing MSE with ground truth\n\nDescription\nConventional state-of-the-art method\nOptimizing MSE with BM3D output as ground truth\nOptimizing SURE without ground truth\nOptimizing SURE without ground truth, but with noisy test data\n\n4.1 Results: MNIST dataset\n\nWe performed denoising simulations with the MNIST dataset. Noisy images were generated based on\nmodel (1) with two noise levels (one with \u03c3 = 25 and the other with \u03c3 = 50). For the experiments on\nthe MNIST dataset which comprised 28 \u00d7 28 pixels, a simple SDA network was chosen [8]. Decoder\nand encoder networks each consisted of two convolutional layers (kernel size 3 \u00d7 3) with sigmoid\nactivation functions, each of which had a stride of two (both conv and conv transposed). Thus, a\ntraining sample with a size of 28 \u00d7 28 is downsampled to 7 \u00d7 7, and then upsampled to 28 \u00d7 28.\nSDA was trained to output a denoised image using a set of 55,000 training and 5,000 validation\nimages. The performance of the model was tested with 100 images chosen randomly from the default\ntest set of 10,000 images. For all cases, SDA was trained with the Adam optimization algorithm [33]\nwith the learning rate of 0.001 for 100 epochs. The batch size was set to 200 (bigger batch sizes did\nnot improve the performance). The \u0001 value in (6) was set to 0.0001.\nOur proposed methods SDA-SURE, SDA-SURE-T yielded a comparable performance to SDA-MSE-\nGT (only 0.01-0.04 dB difference) and outperformed the conventional BM3D for all simulated noise\nlevels, \u03c3 = 25, 50, as shown in Table 2.\n\nTable 2: Results of denoisers for MNIST (performance in dB). Means of 10 experiments are reported.\n\nMethods BM3D SDA-REG SDA-SURE\n\nSDA-SURE-T\n\nSDA-MSE-GT\n\n\u03c3 = 25\n\u03c3 = 50\n\n27.53\n21.82\n\n25.07\n19.85\n\n28.35\n25.23\n\n28.39\n25.24\n\n28.35\n25.24\n\nFigure 1 illustrates the visual quality of the outputs of the simulated denoising methods at high noise\nlevels (\u03c3 = 50). All SDA-based methods clearly outperform the conventional BM3D method based\non visual inspection (BM3D image looks blurry compared to other SDA-based results), while it is\nindistinguishable for the simulation results among all SDA methods with different cost functions and\ntraining sets. These observations were con\ufb01rmed by the quantitative results shown in Table 2. All\nSDA-based methods outperformed BM3D signi\ufb01cantly, but there were very small differences among\nall the SDA methods, even when noisy test data were used.\n\n5\n\n\f(a) Noisy image\n\n(b) BM3D\n\n(c) SDA-MSE-GT\n\n(d) SDA-SURE\n\n(e) SDA-SURE-T\n\n(f) SDA-REG\n\nFigure 1: Denoising results of SDA with various methods for MNIST dataset at a noise level of \u03c3=50.\n\n4.2 Regularization effect of deep neural network denoisers\n\nParametrization of deep neural networks with different number of parameters and structures may\nintroduce a regularization effect in training denoisers. We further investigated this regularization\neffect by training the SDA to minimize the MSE between the output of SDA and the input noisy\nimage (SDA-REG). In the case of a noise level of \u03c3 = 50, early stopping rule was applied when the\nnetwork started to over\ufb01t the noisy dataset after the \ufb01rst few epochs. The performance of this method\nwas signi\ufb01cantly worse than those of all other methods with PSNR values of 25.07 dB (\u03c3 = 25) and\n19.85 dB (\u03c3 = 50), as shown in Table 2. These values are approximately 2 dB lower than the PSNRs\nof BM3D. Noise patterns are visible, as shown in Figure 1. This shows that the good performance of\nSDA is not attributed to its structure only, but also depends on the optimization of MSE or SURE.\n\n4.3 Accuracy of MC-SURE approximation\n\nA small value must be assigned to \u0001 in (6) for accurate estimation of SURE. Ramani et al. [28]\nhave observed that \u0001 can take a wide range of values and its choice is not critical. According to\nour preliminary experiments for the SDA with an MNIST dataset, any choice for \u0001 in the range\nof [10\u22122, 10\u22127] worked well so that the SURE approximation matches close to the MSE during\ntraining, as illustrated in Figure 2 (middle). Extremely small values \u0001 < 10\u22128 resulted in numerical\ninstabilities, as shown in Figure 2 (right). On the contrary, when \u0001 > 10\u22121, the approximation in\n(6) becomes substantially inaccurate. Figure 3 illustrates how the performance of SDA-SURE is\naffected by the \u0001 value. However, note that these values are only for SDA trained with the MNIST\ndataset. The admissible range of \u0001 depends on hi(y; \u03b8). For example, we observed that a suitable\n\u0001 value must be carefully selected in other cases, such as DnCNN with, large-scale parameters and\nhigh resolution images for improved performance.\n\nFigure 2: Loss curves for the training of SDA with MSE (blue) and its corresponding MC-SURE\n(red) using different \u0001 values, \u0001 = 1 (left), \u0001 = 10\u22125 (middle), and \u0001 = 10\u22129 (right). MC-SURE\naccurately approximates the true MSE for a wide range of \u0001.\n\nThe accuracy of MC-SURE also depends on the noise level \u03c3. It was observed that the SURE loss\ncurves become noisier compared to MSE loss curves as \u03c3 increases. However, they still followed\nsimilar trends and yielded similar average PSNRs on MNIST dataset as shown in Figure 4. We\nobserved that after \u03c3 = 350, SURE loss curves started to become too noisy and thus deviated from\nthe trends of their corresponding MSE loss curves. Conversely, noise levels \u03c3 > 300 were too high\nfor both SDA-based denoisers and BM3D, so that they were not able to output recognizable digits.\nTherefore, SDA-SURE can be trained effectively on adequately high noise levels so that it can yield a\nperformance that is comparable to SDA-MSE-GT and can consistently outperform BM3D.\n\n6\n\n\fFigure 3: Performance of SDA-SURE for dif-\nferent \u0001 values at \u03c3 = 25.\n\nFigure 4: Performance of denoising methods\nat different \u03c3 values.\n\n4.4 Results: high resolution natural images dataset\n\nTo demonstrate the capabilities of our SURE-based deep learning denoisers, we investigated a deeper\nand more powerful denoising network called DnCNN [11] using high resolution images. DnCNN\nconsisted of 17 layers of CNN with batch normalization and ReLU activation functions. Each\nconvolutional layer had 64 \ufb01lters with sizes of 3 \u00d7 3. Similar to [11], the network was trained with\n400 images with matrix sizes of 180 \u00d7 180 pixels. In total, 1772 \u00d7 128 image patches with sizes\nof 40 \u00d7 40 pixels were extracted randomly from these images. Two test sets were used to evaluate\nperformance: one set consisted of 12 widely used images (Set12) [17], and the other was a BSD68\ndataset. For DnCNN-SURE-T, additional 808 \u00d7 128 image patches were extracted from these noisy\ntest images, and were then added to the training dataset. For all cases, the network was trained\nwith 50 epochs using the Adam optimization algorithm with an initial learning rate of 0.001, which\neventually decayed to 0.0001 after 40 epochs. The batch size was set to 128 (note that bigger batch\nsizes did not improve performance). Images were corrupted at three noise levels (\u03c3 = 25, 50, 75).\nDnCNN used residual learning [11] whereby the network was forced to learn the difference between\nnoisy and ground truth images. The output residual image was then subtracted from the input noisy\nimage to yield the estimated image. In other words, our network was trained with SURE as\n\nh(y; \u03b8) = y \u2212 CNN\u03b8(y)\n\n(15)\nwhere CNN\u03b8(.) is the DnCNN that is being trained using residual learning. For DnCNN, selecting\nan appropriate \u0001 value in (6) turned out to be important for a good denoising performance. To achieve\nstable training with good performance, \u0001 had to be tuned for each of the chosen noise levels of \u03c3 =\n25, 50, 75. We observed that the optimal value for \u0001 was proportional to \u03c3 as shown in [29]. All the\nexperiments were performed with the setting of \u0001 = \u03c3 \u00d7 1.4 \u00d7 10\u22124.\nWith the use of an NVidia Titan X GPU, the training process took approximately 7 hours for DnCNN-\nMSE-GT and approximately 11 hours for DnCNN-SURE. SURE based methods took more training\n\nTable 3: Results of denoising methods on 12 widely used images (Set12) (performance in dB).\nIMAGE\n\nSTARFISH MONARCH AIRPLANE\n\nLENA BARBARA BOAT MAN\n\nC. MAN HOUSE\n\nPEPPERS\n\nPARROT\n\nCOUPLE Average\n\nBM3D\nDNCNN-BM3D\nDNCNN-SURE\nDNCNN-SURE-T\nDNCNN-MSE-GT\n\nBM3D\nDNCNN-BM3D\nDNCNN-SURE\nDNCNN-SURE-T\nDNCNN-MSE-GT\n\nBM3D\nDNCNN-BM3D\nDNCNN-SURE\nDNCNN-SURE-T\nDNCNN-MSE-GT\n\n29.47\n29.34\n29.80\n29.86\n30.14\n\n26.00\n25.76\n26.48\n26.47\n27.03\n\n24.58\n24.11\n24.65\n24.82\n25.46\n\n33.00\n31.99\n32.70\n32.73\n33.16\n\n29.51\n28.43\n29.14\n29.20\n29.92\n\n27.45\n27.02\n27.16\n27.34\n28.04\n\n30.23\n30.13\n30.58\n30.57\n30.84\n\n26.58\n26.5\n26.77\n26.78\n27.27\n\n24.69\n24.48\n24.49\n24.58\n25.22\n\n28.58\n28.38\n29.08\n29.11\n29.4\n\n25.01\n24.9\n25.38\n25.39\n25.65\n\n23.19\n23.09\n23.25\n23.34\n23.62\n\n29.35\n29.21\n30.11\n30.13\n30.45\n\n25.78\n25.66\n26.50\n26.53\n26.95\n\n23.81\n23.73\n24.10\n24.25\n24.81\n\n\u03c3 = 25\n\n28.37\n28.46\n28.94\n28.93\n29.11\n\n\u03c3 = 50\n\n25.15\n25.15\n25.66\n25.65\n25.93\n\n\u03c3 = 75\n\n23.38\n23.40\n23.52\n23.56\n23.97\n\n7\n\n28.89\n28.91\n29.17\n29.26\n29.36\n\n25.98\n25.82\n26.21\n26.21\n26.43\n\n24.22\n24.06\n24.13\n24.44\n24.71\n\n32.06\n31.53\n32.06\n32.08\n32.44\n\n28.93\n28.36\n28.79\n28.81\n29.31\n\n27.14\n27.11\n26.92\n27.03\n27.60\n\n30.64\n28.89\n29.16\n29.44\n29.91\n\n27.19\n25.3\n24.86\n25.23\n26.17\n\n25.08\n23.80\n23.02\n23.07\n23.88\n\n29.78\n29.6\n29.84\n29.86\n30.11\n\n26.62\n26.5\n26.78\n26.79\n27.12\n\n25.05\n24.84\n25.09\n25.17\n25.53\n\n29.60\n29.52\n29.89\n29.91\n30.08\n\n26.79\n26.6\n26.97\n26.97\n27.22\n\n25.30\n25.19\n25.37\n25.45\n25.68\n\n29.70\n29.54\n29.76\n29.78\n30.06\n\n26.46\n26.17\n26.51\n26.48\n26.94\n\n24.73\n24.59\n24.70\n24.78\n25.13\n\n29.97\n29.63\n30.09\n30.14\n30.42\n\n26.67\n26.26\n26.67\n26.71\n27.16\n\n24.89\n24.62\n24.70\n24.82\n25.30\n\n\fTable 4: Results of denoising methods on BSD68 dataset (performance in dB).\n\nMethods\n\nBM3D\n\nDnCNN-BM3D\n\nDnCNN-SURE\n\nDnCNN-SURE-T\n\nDnCNN-MSE-GT\n\n\u03c3 = 25\n\u03c3 = 50\n\u03c3 = 75\n\n28.56\n25.62\n24.20\n\n28.54\n25.44\n24.09\n\n28.97\n25.93\n24.31\n\n29.00\n25.95\n24.37\n\n29.20\n26.22\n24.66\n\ntime than MSE based methods because of the additional divergence calculations executed to optimize\nthe MC-SURE cost function. For the DnCNN-SURE-T method, it took approximately 15 hours to\ncomplete the training owing to the larger dataset.\nTables 3 and 4 present denoising performance data using (a) the BM3D denoiser [17], (b) a state-of-\nthe-art deep CNN (DnCNN) image denoiser trained with MSE [11], and (c) the same DnCNN image\ndenoiser trained with SURE without the use of noiseless ground truth images, for different dataset\nvariations (as shown in Table 1). The MSE-based DnCNN image denoiser with ground truth data,\nDnCNN-MSE-GT, yielded the best denoising performance compared to other methods, such as the\nBM3D, which is consistent with the results in [11].\nAs seen in Table 3, for the Set12 dataset, SURE-based denoisers achieved performances comparable\nto or better than that for BM3D for noise levels \u03c3 = 25 and 50. In contrast, for higher noise levels\n(\u03c3 = 75), DnCNN-SURE and DnCNN-SURE-T yielded lower average PSNR values by 0.19 dB\nand 0.07 dB than BM3D. DnCNN-SURE-T outperformed DnCNN-SURE in all cases, and had\nconsiderably better performance on some images, such as \u201cBarbara.\u201d BM3D had exceptionally good\ndenoising performance on the \u201cBarbara\u201d image (up to 2.33 dB better PSNR), and even outperformed\nthe DnCNN-MSE-GT method.\nIn the case of the BSD68 dataset in Table 4, SURE-based methods outperformed BM3D for all the\nnoise levels. Unlike the case of the Set12 images, we observed that DnCNN-SURE had a signi\ufb01cantly\nbetter performance than BM3D, and yielded increased average PSNR values by 0.11 - 0.41 dB. It was\nalso observed that DnCNN-SURE-T bene\ufb01ted from the utilization of noisy test images and improved\nthe average PSNR of DnCNN-SURE.\nDifferences among the performances of denoisers in Tables 3 and 4 can be explained by the working\nprinciple of BM3D. Since BM3D looks for similar image patches for denoising, repeated patterns\n(as in the \u201cBarbara\u201d image) and \ufb02at areas (as in \u201cHouse\u201d image) can be key factors to generating\nimproved denoising results. One of the advantages of DnCNN-SURE over BM3D is that it does\nnot suffer from rare patch effects. If the test image is relatively detailed and does not contain many\nsimilar patterns, BM3D will have poorer performance than the proposed DnCNN-SURE method.\nNote that the DnCNN-BM3D method that trains networks by optimizing MSE with BM3D denoised\nimages as the ground truth yielded slightly worse performance than the BM3D itself (Tables 3, 4).\nFigure 5 illustrates the denoised results for an image from the BSD68 dataset. Visual quality assess-\nment indicated that BM3D yielded blurrier images and thus yielded worse PSNR compared to the\nresults generated by deep neural network denoisers. DnCNN-MSE-GT had the best denoised image\nwith the highest PSNR of 26.85 dB, while both SURE methods yielded very similar performances in\naccordance with PSNR and visual quality assessment.\n\n(a) Noisy image / 14.76dB\n\n(b) BM3D / 26.14dB\n\n(c) SURE / 26.46dB\n\n(d) SURE-T / 26.46dB\n\n(e) MSE / 26.85dB\n\nFigure 5: Denoising results of an image from the BSD68 dataset for \u03c3=50\n\n8\n\n\f5 Discussion\n\nIt has been shown that a single deep denoiser network, such as DnCNN, can be trained to deal\nwith multiple noise levels (e.g. \u03c3 = [0, 55]) [11]. Thus, our work can be easily generalized to\ntrain denoisers for multiple noise levels by modifying (14) to have \u03c3(j) and \u0001(j) for the SURE risk\nof the jth image patch. For example, \u0001 values for image patches can be speci\ufb01ed by our formula\n\u0001(j) = \u03c3(j) \u00d7 1.4 \u00d7 10\u22124 for DnCNN. In fact, SURE based DnCNN networks were trained to handle\na wide range of noise levels (blind denoising) in [35].\nSURE-based methods used noisy training images only, but SURE-T methods used both noisy training\nand test images. SURE-T methods yielded a slightly better denoising performance than SURE-based\nmethods (approximately 0.02 - 0.06 dB and 0.04 - 0.12 dB for BSD68 and Set12 datasets, respectively)\nwith considerably increased the overall inference time. Thus, at this moment, SURE-T methods do\nnot seem to have considerable bene\ufb01ts compared to SURE-based methods. However, developing\na hybrid method that \ufb01rst trains networks using SURE with training data, and then \ufb01ne-tunes the\nnetwork using SURE with testing data, could be potentially useful to reduce the overall inference\ntime. Finding a connection between SURE-T and \u201cdeep image prior\u201d that used noisy test images for\ndenoising [36] can constitute interesting future work.\nOur proposed SURE-based deep learning denoiser can be useful for applications with considerably\nlarge amounts of noisy images, but with few noiseless images, or with expensive noiseless images.\nDeep learning based denoising research is still evolving, and it may be even possible for our SURE-\nbased training method to achieve signi\ufb01cantly better performances than BM3D, or other conventional\nstate-of-the-art denoisers, when it is applied to novel deep neural network denoisers. Further\ninvestigation will be needed for high performance denoising networks for synthetic and real noise.\nIn this work, Gaussian noise with known variance was assumed in all simulations. However, there\nare several noise estimation methods that can be used with SURE (see [28] for details). SURE\ncan incorporate a variety of noise distributions other than Gaussian noise. For example, SURE has\nbeen used for parameter selection of conventional \ufb01lters for a Poisson distribution [29]. Generalized\nSURE for exponential families has been proposed [37] so that other common noise types in imaging\nsystems can be potentially considered for SURE-based methods. It should be noted that SURE does\nnot require any prior knowledge on images. Thus, potentially it can be applied to the measurement\ndomain for different applications, such as medical imaging. Owing to noise correlation (colored\nnoise) in the image domain (e.g., based on the Radon transform in the case of CT or PET), further\ninvestigations will be necessary to apply our proposed method directly to the image domain.\nNote that unlike (7), the existence of the minimizer for (14) should be considered with care since\nit is theoretically possible that (14) becomes negative in\ufb01nity due to the divergence term in (14).\nHowever, in practice, this issue can be easily addressed by introducing a regularizer (weight decay),\nwith a deep neural network structure so that denoisers can impose regularity conditions on function h\n(e.g., bounded norm of \u2207h), either by choosing an adequate \u0001 value, or by using proper training data.\nLastly, note that we derived (14), an unbiased estimator for MSE, assuming a \ufb01xed \u03b8. Thus, there is\nno guarantee that the resulting estimator (denoiser) that is tuned by SURE will be unbiased [38].\n\n6 Conclusion\n\nWe proposed a MC-SURE based training method for general deep learning denoisers. Our proposed\nmethod trained deep neural network denoisers without noiseless ground truth data so that they\ncould yield comparable denoising performances to those elicited by the same denoisers that were\ntrained with noiseless ground truth data, and outperform the conventional state-of-the-art BM3D.\nOur SURE-based training method worked successfully in the simple SDA [8], and in the case of the\nstate-of-the-art DnCNN [11] without the use of ground truth images.\n\nAcknowledgments\n\nThis work was supported partly by Basic Science Research Program through the National Research\nFoundation of Korea(NRF) funded by the Ministry of Education(NRF-2017R1D1A1B05035810)\nand partly by the Technology Innovation Program or Industrial Strategic Technology Development\nProgram (10077533, Development of robotic manipulation algorithm for grasping/assembling with\n\n9\n\n\fthe machine learning using visual and tactile sensing information) funded by the Ministry of Trade,\nIndustry & Energy (MOTIE, Korea).\n\nReferences\n[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\nMay 2015.\n\n[2] A Krizhevsky, I Sutskever, and G E Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems (NIPS) 25, pages\n1097\u20131105, 2012.\n\n[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image\nRecognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n770\u2013778, 2016.\n\n[4] R Girshick, J Donahue, and T Darrell. Rich feature hierarchies for accurate object detection\nand semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 580\u2013587, 2014.\n\n[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time\nobject detection with region proposal networks. In Advances in Neural Information Processing\nSystems (NIPS) 28, pages 91\u201399, 2015.\n\n[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.\nIn\n\nSemantic image segmentation with deep convolutional nets and fully connected crfs.\nInternational Conference on Learning Representation (ICLR), 2015.\n\n[7] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic\nsegmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n3431\u20133440, 2015.\n\n[8] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre Antoine Manzagol.\nStacked denoising autoencoders: Learning Useful Representations in a Deep Network with a\nLocal Denoising Criterion. Journal of Machine Learning Research, 11:3371\u20133408, December\n2010.\n\n[9] Harold C Burger, Christian J Schuler, and Stefan Harmeling. Image denoising: Can plain\nneural networks compete with BM3D? In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 2392\u20132399, 2012.\n\n[10] Yi-Qing Wang and Jean-Michel Morel. Can a Single Image Denoising Neural Network Handle\nAll Levels of Gaussian Noise? IEEE Signal Processing Letters, 21(9):1150\u20131153, May 2014.\n[11] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a Gaussian\nDenoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Transactions on Image\nProcessing, 26(7):3142\u20133155, May 2017.\n\n[12] Stamatios Lefkimmiatis. Non-local Color Image Denoising with Convolutional Neural Net-\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n\nworks.\n5882\u20135891, 2017.\n\n[13] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural\nnetworks. In Advances in Neural Information Processing Systems (NIPS) 25, pages 341\u2013349,\n2012.\n\n[14] Xiao Jiao Mao, Chunhua Shen, and Yu Bin Yang. Image restoration using very deep convo-\nlutional encoder-decoder networks with symmetric skip connections. In Advances in Neural\nInformation Processing Systems (NIPS) 29, pages 2810\u20132818, 2016.\n\n[15] Ruohan Gao and Kristen Grauman. On-demand learning for deep image restoration. In IEEE\n\nInternational Conference on Computer Vision (ICCV), 2017.\n\n[16] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning Deep CNN Denoiser\nPrior for Image Restoration. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 3929\u20133938, 2017.\n\n[17] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising\nby sparse 3-D transform-domain collaborative \ufb01ltering. IEEE Transactions on Image Processing,\n16(8):2080\u20132095, August 2007.\n\n10\n\n\f[18] Minchao Ye, Yuntao Qian, and Jun Zhou. Multitask Sparse Nonnegative Matrix Factorization\nfor Joint Spectral\u2013Spatial Hyperspectral Imagery Denoising. IEEE Transactions on Geoscience\nand Remote Sensing, 53(5):2621\u20132639, December 2014.\n\n[19] Hu Chen, Yi Zhang, Mannudeep K Kalra, Feng Lin, Yang Chen, Peixi Liao, Jiliu Zhou, and\nGe Wang. Low-Dose CT With a Residual Encoder-Decoder Convolutional Neural Network.\nIEEE Transactions on Medical Imaging, 36(12):2524\u20132535, November 2017.\n\n[20] Eunhee Kang, Junhong Min, and Jong Chul Ye. A deep convolutional neural network using\ndirectional wavelets for low-dose X-ray CT reconstruction. Medical Physics, 44(10):e360\u2013e375,\nOctober 2017.\n\n[21] David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage.\n\nJournal of the american statistical association, 90(432):1200\u20131224, 1995.\n\n[22] Xiao-Ping Zhang and Mita D Desai. Adaptive denoising based on sure risk. IEEE signal\n\nprocessing letters, 5(10):265\u2013267, 1998.\n\n[23] C M Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics,\n\n9(6):1135\u20131151, November 1981.\n\n[24] A Buades, B Coll, and J M Morel. A review of image denoising algorithms, with a new one.\n\nMultiscale Modeling & Simulation, 4(2):490\u2013530, January 2005.\n\n[25] J Salmon. On Two Parameters for Denoising With Non-Local Means. IEEE Signal Processing\n\nLetters, 17(3):269\u2013272, March 2010.\n\n[26] D Van De Ville and M Kocher. SURE-Based Non-Local Means. IEEE Signal Processing\n\nLetters, 16(11):973\u2013976, November 2009.\n\n[27] Minh Phuong Nguyen and Se Young Chun. Bounded Self-Weights Estimation Method for\nNon-Local Means Image Denoising Using Minimax Estimators. IEEE Transactions on Image\nProcessing, 26(4):1637\u20131649, February 2017.\n\n[28] S Ramani, T Blu, and M Unser. Monte-Carlo Sure: A Black-Box Optimization of Regularization\nIEEE Transactions on Image Processing,\n\nParameters for General Denoising Algorithms.\n17(9):1540\u20131554, August 2008.\n\n[29] Charles-Alban Deledalle, Samuel Vaiter, Jalal Fadili, and Gabriel Peyr\u00e9. Stein unbiased gradient\nestimator of the risk (sugar) for multiple parameter selection. SIAM Journal on Imaging\nSciences, 7(4):2448\u20132487, 2014.\n\n[30] T Blu and F Luisier. The SURE-LET Approach to Image Denoising. IEEE Transactions on\n\nImage Processing, 16(11):2778\u20132786, October 2007.\n\n[31] L\u00e9on Bottou. Online Learning and Stochastic Approximations. In On-line learning in neural\n\nnetworks, pages 9\u201342. Cambridge University Press New York, NY, USA, 1998.\n\n[32] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o\n\n(1/k2). In Soviet Mathematics Doklady, 1983.\n\n[33] Diederik P Kingma and Jimmy Ba. Adam - A Method for Stochastic Optimization.\n\nInternational Conference on Learning Representation (ICLR), 2015.\n\nIn\n\n[34] Mart\u00edn Abadi et al. Tensor\ufb02ow: A system for large-scale machine learning. In Proceedings\nof the 12th USENIX Conference on Operating Systems Design and Implementation, pages\n265\u2013283, 2016.\n\n[35] Magauiya Zhussip and Se Young Chun. Simultaneous compressive image recovery and deep\ndenoiser learning from undersampled measurements. arXiv preprint arXiv:1806.00961, 2018.\n[36] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, volume 17, page 18, 2018.\n[37] Y C Eldar. Generalized SURE for Exponential Families: Applications to Regularization. IEEE\n\nTransactions on Signal Processing, 57(2):471\u2013481, January 2009.\n\n[38] Ryan J Tibshirani and Saharon Rosset. Excess optimism: How biased is the apparent error of\n\nan estimator tuned by sure? arXiv preprint arXiv:1612.09415, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1658, "authors": [{"given_name": "Shakarim", "family_name": "Soltanayev", "institution": "Ulsan National Institute of Science and Technology"}, {"given_name": "Se Young", "family_name": "Chun", "institution": "UNIST"}]}