{"title": "Natural Image Denoising with Convolutional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 769, "page_last": 776, "abstract": "We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters.", "full_text": "Natural Image Denoising with\n\nConvolutional Networks\n\nViren Jain1\n\n1Brain & Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nH. Sebastian Seung1,2\n\n2Howard Hughes Medical Institute\n\nMassachusetts Institute of Technology\n\nAbstract\n\nWe present an approach to low-level vision that combines two main ideas: the\nuse of convolutional networks as an image processing architecture and an unsu-\npervised learning procedure that synthesizes training samples from speci\ufb01c noise\nmodels. We demonstrate this approach on the challenging problem of natural\nimage denoising. Using a test set with a hundred natural images, we \ufb01nd that con-\nvolutional networks provide comparable and in some cases superior performance\nto state of the art wavelet and Markov random \ufb01eld (MRF) methods. Moreover,\nwe \ufb01nd that a convolutional network offers similar performance in the blind de-\nnoising setting as compared to other techniques in the non-blind setting. We also\nshow how convolutional networks are mathematically related to MRF approaches\nby presenting a mean \ufb01eld theory for an MRF specially designed for image denois-\ning. Although these approaches are related, convolutional networks avoid compu-\ntational dif\ufb01culties in MRF approaches that arise from probabilistic learning and\ninference. This makes it possible to learn image processing architectures that have\na high degree of representational power (we train models with over 15,000 param-\neters), but whose computational expense is signi\ufb01cantly less than that associated\nwith inference in MRF approaches with even hundreds of parameters.\n\n1 Background\n\nLow-level image processing tasks include edge detection, interpolation, and deconvolution. These\ntasks are useful both in themselves, and as a front-end for high-level visual tasks like object recog-\nnition. This paper focuses on the task of denoising, de\ufb01ned as the recovery of an underlying image\nfrom an observation that has been subjected to Gaussian noise.\nOne approach to image denoising is to transform an image from pixel intensities into another rep-\nresentation where statistical regularities are more easily captured. For example, the Gaussian scale\nmixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet de-\ncomposition that provides an effective description of local image statistics [1, 2].\nAnother approach is to try and capture statistical regularities of pixel intensities directly using\nMarkov random \ufb01elds (MRFs) to de\ufb01ne a prior over the image space.\nInitial work used hand-\ndesigned settings of the parameters, but recently there has been increasing success in learning the\nparameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be\nused for tasks such as image denoising by augmenting the prior with a noise model.\nAlternatively, an MRF can be used to model the probability distribution of the clean image condi-\ntioned on the noisy image. This conditional random \ufb01eld (CRF) approach is said to be discrimi-\nnative, in contrast to the generative MRF approach. Several researchers have shown that the CRF\napproach can outperform generative learning on various image restoration and labeling tasks [9, 10].\nCRFs have recently been applied to the problem of image denoising as well [5].\n\n1\n\n\fThe present work is most closely related to the CRF approach. Indeed, certain special cases of con-\nvolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The\nadvantage of the convolutional network approach is that it avoids a general dif\ufb01culty with applying\nMRF-based methods to image analysis: the computational expense associated with both parameter\nestimation and inference in probabilistic models. For example, naive methods of learning MRF-\nbased models involve calculation of the partition function, a normalization factor that is generally\nintractable for realistic models and image dimensions. As a result, a great deal of research has\nbeen devoted to approximate MRF learning and inference techniques that meliorate computational\ndif\ufb01culties, generally at the cost of either representational power or theoretical guarantees [12, 13].\nConvolutional networks largely avoid these dif\ufb01culties by posing the computational task within the\nstatistical framework of regression rather than density estimation. Regression is a more tractable\ncomputation and therefore permits models with greater representational power than methods based\non density estimation. This claim will be argued for with empirical results on the denoising problem,\nas well as mathematical connections between MRF and convolutional network approaches.\n\n2 Convolutional Networks\n\nConvolutional networks have been extensively applied to visual object recognition using architec-\ntures that accept an image as input and, through alternating layers of convolution and subsampling,\nproduce one or more output values that are thresholded to yield binary predictions regarding object\nidentity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire\nimage as output. Previous work has used such architectures to produce images with binary targets\nin image restoration problems for specialized microscopy data [11, 16]. Here we show that similar\narchitectures can also be used to produce images with the analog \ufb02uctuations found in the intensity\ndistributions of natural images.\n\nNetwork Dynamics and Architecture\n\nA convolutional network is an alternating sequence of linear \ufb01ltering and nonlinear transformation\noperations. The input and output layers include one or more images, while intermediate layers\ncontain \u201chidden\" units with images called feature maps that are the internal computations of the\nalgorithm. The activity of feature map a in layer k is given by\n\nwk,ab \u2297 Ik\u22121,b \u2212 \u03b8k,a\n\nIk,a = f\n\n(1)\nwhere Ik\u22121,b are feature maps that provide input to Ik,a, and \u2297 denotes the convolution operation.\nThe function f is the sigmoid f(x) = 1/ (1 + e\u2212x) and \u03b8k,a is a bias parameter.\nWe restrict our experiments to monochrome images and hence the networks contain a single image\nin the input layer. It is straightforward to extend this approach to color images by assuming an input\nlayer with multiple images (e.g., RGB color channels). For numerical reasons, it is preferable to\nuse input and target values in the range of 0 to 1, and hence the 8-bit integer intensity values of the\ndataset (values from 0 to 255) were normalized to lie between 0 and 1. We also explicitly encode\nthe border of the image by padding an area surrounding the image with values of \u22121.\n\nLearning to Denoise\n\nParameter learning can be performed with a modi\ufb01cation of the backpropagation algorithm for feed-\nfoward neural networks that takes into account the weight-sharing structure of convolutional net-\nworks [14]. However, several issues have to be addressed in order to learn the architecture in Figure\n1 for the task of natural image denoising.\nFirstly, the image denoising task must be formulated as a learning problem in order to train the\nconvolutional network. Since we assume access to a database of only clean, noiseless images, we\nimplicitly specify the desired image processing task by integrating a noise process into the training\nprocedure. In particular, we assume a noise process n(x) that operates on an image xi drawn from a\ndistribution of natural images X. If we consider the entire convolutional network to be some function\n\n2\n\n(cid:32)(cid:88)\n\nb\n\n(cid:33)\n\n\fFigure 1: Architecture of convolutional network used for denoising. The network has 4 hidden layers and 24\nfeature maps in each hidden layer. In layers 2, 3, and 4, each feature map is connected to 8 randomly chosen\nfeature maps in the previous layer. Each arrow represents a single convolution associated with a 5 \u00d7 5 \ufb01lter,\nand hence this network has 15,697 free parameters and requires 624 convolutions to process its forward pass.\n\n(cid:80)\ni(xi \u2212 F\u03c6(n(xi)))2).\n\nF\u03c6 with free parameters \u03c6, then the parameter estimation problem is to minimize the reconstruction\nerror of the images subject to the noise process: min\u03c6\nSecondly, it is inef\ufb01cient to use batch learning in this context. The training sets used in the ex-\nperiments have millions of pixels, and it is not practical to perform both a forward and backward\npass on the entire training set when gradient learning requires many tens of thousands of updates to\nconverge to a reasonable solution. Stochastic online gradient learning is a more ef\ufb01cient learning\nprocedure that can be adapted to this problem. Typically, this procedure selects a small number of\nindependent examples from the training set and averages together their gradients to perform a single\nupdate. We compute a gradient update from 6 \u00d7 6 patches randomly sampled from six different\nimages in the training set. Using a localized image patch violates the independence assumption in\nstochastic online learning, but combining the gradient from six separate images yields a 6 \u00d7 6 \u00d7 6\ncube that in practice is a suf\ufb01cient approximation of the gradient to be effective. Larger patches (we\ntried 8\u00d7 8 and 10\u00d7 10) reduce correlations in the training sample but do not improve accuracy. This\nscheme is especially ef\ufb01cient because most of the computation for a local patch is shared.\nWe found that training time is minimized and generalization accuracy is maximized by incrementally\nlearning each layer of weights. Greedy, layer-wise training strategies have recently been explored\nin the context of unsupervised initialization of multi-layer networks, which are usually \ufb01ne tuned\nfor some discriminative task with a different cost function [17, 18, 19]. We maintain the same cost\nfunction throughout. This procedure starts by training a network with a single hidden layer. After\nthirty epochs, the weights from the \ufb01rst hidden layer are copied to a new network with two hidden\nlayers; the weights connecting the hidden layer to the output layer are discarded. The two hidden\nlayer network is optimized for another thirty epochs, and the procedure is repeated for N layers.\nFinally, when learning networks with two or more hidden layers it was important to use a very small\nlearning rate for the \ufb01nal layer (0.001) and a larger learning rate (0.1) in all other layers.\n\nImplementation\n\nConvolutional network inference and learning can be implemented in just a few lines of MATLAB\ncode using multi-dimensional convolution and cross-correlation routines. This also makes the ap-\nproach especially easy to optimize using parallel computing or GPU computing strategies.\n\n3 Experiments\n\nWe derive training and test sets for our experiments from natural images in the Berkeley segmenta-\ntion database, which has been previously used to study denoising [20, 4]. We restrict our experiments\nto the case of monochrome images; color images in the Berkeley dataset are converted to grayscale\nby averaging the color channels. The test set consists of 100 images, 77 with dimensions 321 \u00d7 481\nand 23 with dimensions 481 \u00d7 321. Quantitative comparisons are performed using the Peak Signal\n\n3\n\ninputimageI1,24I1,2I1,1...I2,24I2,2I2,1...I3,24I3,2I3,1...I4,24I4,2I4,1...outputimageArchitecture of CN1 and CN2\fFigure 2: Denoising results as measured by peak signal to noise ratio (PSNR) for 3 different noise levels. In\neach case, results are the average denoised PSNR of the hundred images in the test set. CN1 and CNBlind are\nlearned using the same forty image training set as the Field of Experts model (FoE). CN2 is learned using a\ntraining set with an additional sixty images. BLS-GSM1 and BLS-GSM2 are two different parameter settings\nof the algorithm in [1]. All methods except CNBlind assume a known noise distribution.\n\nto Noise Ratio (PSNR): 20 log10(255/\u03c3e), where \u03c3e is the standard deviation of the error. PSNR\nhas been widely used to evaluate denoising performance [1, 4, 2, 5, 6, 7].\n\nDenoising with known noise conditions\n\nIn this task it is assumed that images have been subjected to Gaussian noise of known variance.\nWe use this noise model during the training process and learn a \ufb01ve-layer network for each noise\nlevel. Both the Bayes Least Squares-Gaussian Scale Mixture (BLS-GSM) and Field of Experts\n(FoE) method also optimize the denoising process based on a speci\ufb01ed noise level.\nWe learn two sets of networks for this task that differ in their training set. In one set of networks,\nwhich we refer to as CN1, the training set is the same subset of the Berkeley database used to learn\nthe FoE model [4]. In another set of networks, called CN2, this training set is augmented by an\nadditional sixty images from the Berkeley database. The architecture of these networks is shown in\nFig. 1. Quantitative results from both networks under three different noise levels are shown in Fig.\n2, along with results from the FoE and BLS-GSM method (BLS-GSM 1 is the same settings used\nin [1] while BLS-GSM 2 is the default settings in the code provided by the authors). For the FoE\nresults, the number of iterations and magnitude of the step size are optimized for each noise level\nusing a grid search on the training set. A visual comparison of these results is shown in Fig. 3.\nWe \ufb01nd that the convolutional network has the highest average PSNR using either training set,\nalthough by a margin that is within statistical insigni\ufb01cance when standard error is computed from\nthe distribution of PSNR values of the entire image. However, we believe this is a conservative\nestimate of the standard error, which is much smaller when measured on a pixel or patch-wise basis.\n\nBlind denoising\n\nIn this task it is assumed that images have been subjected to Gaussian noise of unknown variance.\nDenoising in this context is a more dif\ufb01cult problem than in the non-blind situation. We train a single\nsix-layer network network we refer to as CNBlind by randomly varying the amount of noise added to\neach example in the training process, in the range of \u03c3 = [0, 100] . During inference, the noise level\nis unknown and only the image is provided as input. We use the same training set as the FoE model\nand CN1. The architecture is the same as that shown in Fig. 1 except with 5 hidden layers instead\nof 4. Results for 3 noise levels are shown in Fig. 2. We \ufb01nd that a convolutional network trained for\nblind denoising performs well even compared to the other methods under non-blind conditions. In\nFig. 4, we show \ufb01lters that were learned for this network.\n\n4\n\n255010019202122232425262728293031Denoising Performance ComparisonNoise \u03c3Average PSNR of Denoised Images  FoEBLS\u2212GSM 1BLS\u2212GSM 2CN1CN2CNBlind\fFigure 3: Denoising results on an image from the test set. The noisy image was generated by adding Gaussian\nnoise with \u03c3 = 50 to the clean image. Non-blind denoising results for the BLS-GSM, FoE, and convolutional\nnetwork methods are shown. The lower left panel shows results for the outlined region in the upper left panel.\nThe zoomed in region shows that in some areas CN2 output has less severe artifacts than the wavelet-based\nresults and is sharper than the FoE results. CN1 results (PSNR=24.12) are visually similar to those of CN2.\n\n4 Relationship between MRF and Convolutional Network Approaches\n\nIn the introduction, we claim that convolutional networks have similar or even greater representa-\ntional power compared to MRFs. To support this claim, we will show that special cases of convolu-\ntional networks correspond to mean \ufb01eld inference for an MRF. This does not rigorously prove that\nconvolutional networks have representational power greater than or equal to MRFs, since mean \ufb01eld\ninference is an approximation. However, it is plausible that this is the case.\nPrevious work has pointed out that the Field of Experts MRF can be interpreted as a convolutional\nnetwork (see [21]) and that MRFs with an Ising-like prior can be related to convolutional networks\n(see [11]). Here, we analyze a different MRF that is specially designed for image denoising and\nshow that it is closely related to the convolutional network in Figure 1. In particular, we consider an\nMRF that de\ufb01nes a distribution over analog \u201cvisible\u201d variables v and binary \u201chidden\u201d variables h:\n\nP (v, h) =\n\n1\nZ\n\nexp\n\ni +\nv2\n\n1\n\u03c32\n\ni (wa \u2297 v)i +\nha\n\n1\n2\n\ni (wab \u2297 hb)i\nha\n\n(2)\n\nwhere vi and hi correspond to the ith pixel location in the image, Z is the partition function, and \u03c3 is\nthe known standard deviation of the Gaussian noise. Note that by symmetry we have wab\nj\u2212i,\n\ni\u2212j = wba\n\n5\n\n(cid:32)\n\u2212 1\n2\u03c32\n\n(cid:88)\n\ni\n\n(cid:88)\n\nia\n\n(cid:33)\n\n(cid:88)\n\niab\n\nCLEANNOISY PSNR=14.96CN2 PSNR=24.25BLS-GSM PSNR=23.78 FoE PSNR=23.02CLEANCN2FoEBLS-GSM\fLayer 1\n\nLayer 2\n\nFigure 4: Filters learned for the \ufb01rst 2 hidden layers of network CNBlind. The second hidden layer has 192\n\ufb01lters (24 feature maps (cid:215)\n8 \ufb01lters per map). The \ufb01rst layer has recognizable structure in the \ufb01lters, including\nboth derivative \ufb01lters as well as high frequency \ufb01lters similar to those learned by the FoE model [4, 6].\n\n0 = 0 so there is no self interaction in the model (if this were not the case, one\nand we assume waa\ncould always transfer this to a term that is linear in ha\ni , which would lead to an additional bias term\nin the mean \ufb01eld approximation). Hence, P (v(cid:44) h) constitutes an undirected graphical model which\ncan be conceptualized as having separate layers for the visible and hidden variables. There are no\nintralayer interactions in the visible layer and convolutional structure (instead of full connectivity) in\nthe intralayer interactions between hidden variables and interlayer interactions between the visible\nand hidden layer.\nFrom the de\ufb01nition of P (v(cid:44) h) it follows that the conditional distribution,\n\n(cid:30) (cid:30) 1\n\nvi (cid:30)(cid:29)\n\ni\n\na\n\n(cid:27) 2(cid:26)\n\nP (v(cid:124) h) (cid:31) exp\n\nis Gaussian with mean vi =(cid:24)\n\n(3)\na(wa(cid:29)ha)i. This is also equal to the conditional expectation E [v(cid:124) h].\nWe can use this model for denoising by \ufb01xing the visible variables to the noisy image, computing\nthe most likely hidden variables h(cid:31) by MAP inference, and regarding the conditional expectation of\nP (v(cid:124) h(cid:31) ) as the denoised image. To do inference we would like to calculate maxh P (h(cid:124) v), but this is\ndif\ufb01cult because of the partition function. However, we can consider the mean \ufb01eld approximation,\n\n2(cid:31)2\n\n(wa (cid:29)ha)i\n\n(cid:31)2 (wa (cid:29)v)i +(cid:29)\n\n1\n\ni = f\nha\n\n(wab (cid:29)hb)i\n\nb\n\n(4)\n\nwhich can be solved by regarding the equation as a dynamics and iterating it. If we compare this to\nEq. 1, we \ufb01nd that this is equivalent to a convolutional network in which each hidden layer has the\nsame weights and each feature map directly receives input from the image.\nThese results suggest that certain convolutional networks can be interpreted as performing approx-\nimate inference on MRF models designed for denoising. In practice, the convolutional network ar-\nchitectures we train are not exactly related to such MRF models because the weights of each hidden\nlayer are not constrained to be the same, nor is the image an input to any feature map except those\nin the \ufb01rst layer. An interesting question for future research is how these additional architectural\nconstraints would affect performance of the convolutional network approach.\nFinally, although the special case of non-blind Gaussian denoising allows for direct integration of the\nnoise model into the MRF equations, our empirical results on blind denoising suggest that the con-\nvolutional network approach is adaptable to more general and complex noise models when speci\ufb01ed\nimplicitly through the learning cost function.\n\n5 Discussion\n\nPrior versus learned structure\n\nBefore learning, the convolutional network has little structure specialized to natural images.\nIn\ncontrast, the GSM model uses a multi-scale wavelet representation that is known for its suitability in\n\n6\n\n(cid:31)\n(cid:29)\n(cid:28)\n(cid:25)\n(cid:28)\n(cid:27)\n\frepresenting natural image statistics. Moreover, inference in the FoE model uses a procedure similar\nto non-linear diffusion methods, which have been previously used for natural image processing\nwithout learning. The architecture of the FoE MRF is so well chosen that even random settings of\nthe free parameters can provide impressive performance [21].\nRandom parameter settings of the convolutional networks do not produce any clearly useful com-\nputation. If the parameters of CN2 are randomized in just the last layer, denoising performance for\nthe image in Fig. 3 drops from PSNR=24.25 to 14.87. Random parameters in all layers yields even\nworse results. This is consistent with the idea that nothing in CN2\u2019s representation is specialized to\nnatural images before training, other than the localized receptive \ufb01eld structure of convolutions. Our\napproach instead relies on a gradient learning algorithm to tune thousands of parameters using ex-\namples of natural images. One might assume this approach would require vastly more training data\nthan other methods with more prior structure. However, we obtain good generalization performance\nusing the same training set as that used to learn the Field of Experts model, which has many fewer\ndegrees of freedom. The disadvantage of this approach is that it produces an architecture whose\nperformance is more dif\ufb01cult to understand due to its numerous free parameters. The advantage of\nthis approach is that it may lead to more accurate performance, and can be applied to novel forms of\nimagery that have very different statistics than natural images or any previously studied dataset (an\nexample of this is the specialized image restoration problem studied in [11]).\n\nNetwork architecture and using more image context\n\nThe amount of image context the convolutional network uses to produce an output value for a spe-\nci\ufb01c image location is determined by the number of layers in the network and size of \ufb01lter in each\nlayer. For example, the 5 and 6-layer networks explored here respectively use a 20\u00d7 20 and 24\u00d7 24\nimage patch. This is a relatively small amount of context compared to that used by the FoE and BLS-\nGSM models, both of which permit correlations to extend over the entire image. It is surprising that\ndespite this major difference, the convolutional network approach still provides good performance.\nOne explanation could be that the scale of objects in the chosen image dataset may allow for most\nrelevant information to be captured in a relatively small \ufb01eld of view.\nNonetheless, it is of interest for denoising as well as other applications to increase the amount of\ncontext used by the network. A simple strategy is to further increase the number of layers; however,\nthis becomes computationally intensive and may be an inef\ufb01cient way to exploit the multi-scale\nproperties of natural images. Adding additional machinery in the network architecture may work\nbetter. Integrating the operations of sub-sampling and super-sampling would allow a network to\nprocess the image at multiple scales, while still being entirely amenable to gradient learning.\n\nComputational ef\ufb01ciency\n\nWith many free parameters, convolutional networks may seem like a computationally expensive\nimage processing architecture. On the contrary, the 5-layer CN1 and CN2 architecture (Fig. 1)\nrequires only 624 image convolutions to process an image. In comparison, the FoE model performs\ninference by means of a dynamic process that can require several thousand iterations. One-thousand\niterations of these dynamics requires 48,000 convolutions (for an FoE model with 24 \ufb01lters).\nWe also report wall-clock speed by denoising a 512 \u00d7 512 pixel image on a 2.16Ghz Intel Core\n2 processor. Averaged over 10 trials, CN1/CN2 requires 38.86 \u00b1 0.1 sec., 1,000 iterations of the\nFoE requires 1664.35 \u00b1 30.23 sec. (using code from the authors of [4]), the BLS-GSM model with\nparameter settings \u201c1\u201d requires 51.86 \u00b1 0.12 sec., and parameter setting \u201c2\u201d requires 26.51 \u00b1 0.15\nsec. (using code from the authors of [1]). All implementations are in MATLAB.\nIt is true, however, that training the convolutional network architecture requires substantial computa-\ntion. As gradient learning can require many thousands of updates to converge, training the denoising\nnetworks required a parallel implementation that utilized a dozen processors for a week. While this\nis a signi\ufb01cant amount of computation, it can be performed off-line.\n\nLearning more complex image transformations and generalized image attractors models\n\nIn this work we have explored an image processing task which can be easily formulated as a learning\nproblem by synthesizing training examples from abundantly available noiseless natural images. Can\n\n7\n\n\fthis approach be extended to tasks in which the noise model has a more variable or complex form?\nOur results on blind denoising, in which the amount of noise may vary from little to severe, provides\nsome evidence that it can. Preliminary experiments on image inpainting are also encouraging.\nThat said, a major virtue of the image prior approach is the ability to easily reuse a single image\nmodel in novel situations by simply augmenting the prior with the appropriate observation model.\nThis is possible because the image prior and the observation model are decoupled. Yet explicit prob-\nabilistic modeling is computationally dif\ufb01cult and makes learning even simple models challenging.\nConvolutional networks forgo probabilistic modeling and, as developed here, focus on speci\ufb01c im-\nage to image transformations as a regression problem. It will be interesting to combine the two\napproaches to learn models that are \u201cunnormalized priors\u201d in the sense of energy-based image at-\ntractors; regression can then be used as a tool for unsupervised learning by capturing dependencies\nbetween variables within the same distribution [22].\nAcknowledgements: we are grateful to Ted Adelson, Ce Liu, Srinivas Turaga, and Yair Weiss for\nhelpful discussions. We also thank the authors of [1] and [4] for making code available.\n\nReferences\n[1] J. Portilla, V. Strela, M.J. Wainwright, E.P. Simoncelli. Image denoising using scale mixtures of Gaussians\n\nin the wavelet domain. IEEE Trans. Image Proc., 2003.\n\n[2] S. Lyu, E.P. Simoncelli. Statistical modeling of images with \ufb01elds of Gaussian scale mixtures. NIPS*\n\n2006.\n\n[3] S. Geman, D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images.\n\nPattern Analysis and Machine Intelligence, 1984.\n\n[4] S. Roth, M.J. Black. Fields of Experts: a framework for learning image priors. CVPR 2005.\n[5] M.F. Tappen, C. Liu, E.H. Adelson, W.T. Freeman. Learning Gaussian Conditional Random Fields for\n\nLow-Level Vision. CVPR 2007.\n\n[6] Y. Weiss, W.T. Freeman. What makes a good model of natural images? CVPR 2007.\n[7] P. Gehler, M. Welling. Product of \"edge-perts\". NIPS* 2005.\n[8] S.C. Zhu, Y. Wu, D. Mumford. Filters, Random Fields and Maximum Entropy (FRAME): Towards a\n\nUni\ufb01ed Theory for Texture Modeling. International Journal of Computer Vision, 1998.\n\n[9] S. Kumar, M. Hebert. Discriminative \ufb01elds for modeling spatial dependencies in natural images. NIPS*\n\n2004.\n\n[10] X. He, R Zemel, M.C. Perpinan. Multiscale conditional random \ufb01elds for image labeling. CVPR 2004.\n[11] V. Jain, J.F. Murray, F. Roth, S. Turaga, V. Zhigulin, K.L. Briggman, M.N. Helmstaedter, W. Denk, H.S.\n\nSeung. Supervised Learning of Image Restoration with Convolutional Networks. ICCV 2007.\n\n[12] S. Parise, M. Welling. Learning in markov random \ufb01elds: An empirical study. Joint Stat. Meeting, 2005.\n[13] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, C. Rother. A\n\ncomparative study of energy minimization methods for markov random \ufb01elds. ECCV 2006.\n\n[14] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel. Backpropagation\n\nApplied to Handwritten Zip Code Recognition. Neural Computation, 1989.\n\n[15] Y. LeCun, F.J. Huang, L. Bottou. Learning methods for generic object recognition with invariance to pose\n\nand lighting. CVPR 2004.\n\n[16] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, P.E. Barbano. Toward Automatic Phenotyping of\n\nDeveloping Embryos From Videos. IEEE Trans. Image Proc., 2005.\n\n[17] G. Hinton, R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006.\n[18] M. Ranzato, YL Boureau, Y. LeCun. Sparse feature learning for deep belief networks. NIPS* 2007.\n[19] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle. Greedy Layer-Wise Training of Deep Networks.\n\nNIPS* 2006.\n\n[20] D. Martin, C. Fowlkes, D. Tal, J. Malik. A database of human segmented natural images and its application\n\nto evaluating segmentation algorithms and measuring ecological statistics. ICCV 2001.\n\n[21] S. Roth. High-order markov random \ufb01elds for low-level vision. PhD Thesis, Brown Univ., 2007.\n[22] H.S. Seung. Learning continuous attractors in recurrent networks. NIPS* 1997.\n\n8\n\n\f", "award": [], "sourceid": 31, "authors": [{"given_name": "Viren", "family_name": "Jain", "institution": null}, {"given_name": "Sebastian", "family_name": "Seung", "institution": null}]}