{"title": "Regularized estimation of image statistics by Score Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 1126, "page_last": 1134, "abstract": "Score Matching is a recently-proposed criterion for training high-dimensional density models for which maximum likelihood training is intractable. It has been applied to learning natural image statistics but has so-far been limited to simple models due to the difficulty of differentiating the loss with respect to the model parameters. We show how this differentiation can be automated with an extended version of the double-backpropagation algorithm. In addition, we introduce a regularization term for the Score Matching loss that enables its use for a broader range of problem by suppressing instabilities that occur with finite training sample sizes and quantized input values. Results are reported for image denoising and super-resolution.", "full_text": "Regularized estimation of image statistics\n\nby Score Matching\n\nDiederik P. Kingma\n\nDepartment of Information and Computing Sciences\n\nUniversiteit Utrecht\n\nd.p.kingma@students.uu.nl\n\nYann LeCun\n\nCourant Institute of Mathematical Sciences\n\nNew York University\nyann@cs.nyu.edu\n\nAbstract\n\nScore Matching is a recently-proposed criterion for training high-dimensional\ndensity models for which maximum likelihood training is intractable. It has been\napplied to learning natural image statistics but has so-far been limited to simple\nmodels due to the dif\ufb01culty of differentiating the loss with respect to the model\nparameters. We show how this differentiation can be automated with an extended\nversion of the double-backpropagation algorithm. In addition, we introduce a reg-\nularization term for the Score Matching loss that enables its use for a broader\nrange of problem by suppressing instabilities that occur with \ufb01nite training sam-\nple sizes and quantized input values. Results are reported for image denoising and\nsuper-resolution.\n\n1 Introduction\n\nConsider the subject of density estimation for high-dimensional continuous random variables, like\nimages. Approaches for normalized density estimation, like mixture models, often suffer from the\ncurse of dimensionality. An alternative approach is Product-of-Experts (PoE) [7], where we model\nthe density as a product, rather than a sum, of component (expert) densities. The multiplicative\nnature of PoE models make them able to form complex densities: in contrast to mixture models, each\nexpert has the ability to have a strongly negative in\ufb02uence on the density at any point by assigning\nit a very low component density. However, Maximum Likelihood Estimation (MLE) of the model\nrequires differentiation of a normalizing term, which is infeasible even for low data dimensionality.\n\nA recently introduced estimation method is Score Matching [10], which involves minimizing the\nsquare distance between the model log-density slope (score) and data log-density slope, which is\nindependent of the normalizing term. Unfortunately, applications of SM estimation have thus far\nbeen limited. Besides ICA models, SM has been applied to Markov Random Fields\n[14] and\na multi-layer model [13], but reported results on real-world data have been of qualitative, rather\nthan quantitative nature. Differentiating the SM loss with respect to the parameters can be very\nchallenging, which somewhat complicates the use of SM in many situations. Furthermore, the proof\nof the SM estimator [10] requires certain conditions that are often violated, like a smooth underlying\ndensity or an in\ufb01nite number of samples.\n\nOther estimation methods are Constrastive Divergence [8] (CD), Basis Rotation [23] and Noise-\nContrastive Estimation [6] (NCE). CD is an MCMC method that has been succesfully applied\nto Restricted Boltzmann Machines (RBM\u2019s) [8], overcomplete Independent Component Analysis\n\n1\n\n\f(ICA) [9], and convolution variants of ICA and RBM\u2019s [21, 19]. Basis Rotation [23] works by re-\nstricting weight updates such that they are probability mass-neutral. SM and NCE are consistent\nestimators [10, 6], while CD estimation has been shown to be generally asymptotically biased [4].\nNo consistency results are known for Basis Rotation, to our knowledge. NCE is a promising method,\nbut unfortunately too new to be included in experiments. CD and Basis Rotation estimation will be\nused as a basis for comparison.\n\nIn section 2 a regularizer is proposed that makes Score Matching applicable to a much broader\nclass of problems. In section 3 we show how computation and differentiation of the SM loss can\nbe performed in automated fashion. In section 4 we report encouraging quantitative experimental\nresults.\n\n2 Regularized Score Matching\n\nConsider an energy-based [17] model E(x; w), where \u201cenergy\u201d is the unnormalized negative log-\ndensity such that the pdf is: p(x; w) = e\u2212E(x;w)/Z(w), where Z(w) is the normalizing constant.\nIn other words, low energies correspond to high probability density, and high energies correspond\nto low probability density.\n\nScore Matching works by \ufb01tting the slope (score) of the model density to the slope of the true,\nunderlying density at the data points, which is obviously independent of the vertical offset of the log-\ndensity (the normalizing constant). Hyv\u00a8arinen [10] shows that under some conditions, this objective\nis equivalent to minimizing the following expression, which involves only \ufb01rst and second partial\nderivatives of the model density:\n\nJ(w) =Zx\u2208RN\n\npx(x)\n\nN\n\nXi=1 1\n\n\u2202xi (cid:19)2\n2(cid:18) \u2202E(x; w)\n\n\u2212\n\n\u22022E(x; w)\n\n(\u2202xi)2 ! dx + const\n\n(1)\n\nwith N -dimensional data vector x, weight vector w and true, underlying pdf px(x). Among the\nconditions 1 is (1) that px(x) is differentiable, and (2) that the log-density is \ufb01nite everywhere. In\npractice, the true pdf is unknown, and we have a \ufb01nite sample of T discrete data points. The sample\nversion of the SM loss function is:\n\nJ S(w) =\n\n1\nT\n\nT\n\nXt=1\n\nN\n\nXi=1 1\n\n2(cid:18) \u2202E(x(t); w)\n\n\u2202xi\n\n(cid:19)2\n\n\u2212\n\n\u22022E(x(t); w)\n\n(\u2202xi)2 !\n\n(2)\n\nwhich is asymptotically equivalent to the equation (1) as T approaches in\ufb01nity, due to the law of\nlarge numbers. This loss function was used in previous publications on SM [10, 12, 13, 15].\n\n2.1 Issues\n\nShould these conditions be violated, then (theoretically) the pdf cannot be estimated using equation\n(1). Only some speci\ufb01c special-case solutions exist, e.g. for non-negative data [11]. Unfortunately,\nsituations where the mentioned conditions are violated are not rare. The distribution for quantized\ndata (like images) is discontinuous, hence not differentiable, since the data points are concentrated\nat a \ufb01nite number of discrete positions. Moreover, the fact that equation (2) is only equivalent to\nequation (1) as T approaches in\ufb01nity may cause problems: the distribution of any \ufb01nite training\nset of discrete data points is discrete, hence not differentiable. For proper estimation with SM, data\ncan be smoothened by whitening; however, common whitening methods (such as PCA or SVD) are\ncomputational infeasible for large data dimensionality, and generally destroy the local structure of\nspatial and temporal data such as image and audio. Some previous publications on Score Matching\napply zero-phase whitening (ZCA) [13] which computes a weighed sum over an input patch which\nremoves some of the original quantization, and can potentially be applied convolutionally. However,\n\n1 The conditions are:\n\nE(cid:2)k\u2202 log px(x)/\u2202xk2(cid:3) and E(cid:2)k\u2202E(x; w)/\u2202xk2(cid:3) w.r.t. x are \ufb01nite for any w, and px(x)\u2202E(x; w)/\u2202x\n\ngoes to zero for any w when kxk \u2192 \u221e.\n\nthe true (underlying) pdf px(x)\n\nis differentiable,\n\nthe expectations\n\n2\n\n\fthe amount of information removed from the input by such whitening is not parameterized and\npotentially large.\n\n2.2 Proposed solution\n\nOur proposed solution is the addition of a regularization term to the loss, approximately equivalent\nto replacing each data point x with a Gaussian cloud of virtual datapoints (x+\u01eb) with i.i.d. Gaussian\nnoise \u01eb \u223c N (0, \u03c32I). By this replacement, the sample pdf becomes smooth and the conditions for\nproper SM estimation become satis\ufb01ed. The expected value of the sample loss is:\n\nE(cid:2)J S(x + \u01eb; w)(cid:3) =\n\n1\n2\n\nN\n\nXi=1 E\"(cid:18) \u2202E(x + \u01eb; w)\n\n\u2202(xi + \u01ebi) (cid:19)2#! \u2212\n\nN\n\nXi=1(cid:18)E(cid:20) \u22022E(x + \u01eb; w)\n\n(\u2202(xi + \u01ebi))2 (cid:21)(cid:19) (3)\n\nWe approximate the \ufb01rst and second term with a simple \ufb01rst-order Taylor expansion. Recall that\n\nsince the noise is i.i.d. Gaussian, E [\u01ebi] = 0, E [\u01ebi\u01ebj] = E [\u01ebi] E [\u01ebj] = 0 if i 6= j, and E(cid:2)\u01eb2\n\nThe expected value of the \ufb01rst term is:\n\ni(cid:3) = \u03c32.\n\n1\n2\n\nN\n\nXi=1\n\nE\"(cid:18) \u2202E(x + \u01eb; w)\n\n\u2202(xi + \u01ebi) (cid:19)2# =\n\n=\n\n1\n2\n\n1\n2\n\nN\n\n\u2202xi\n\nE\uf8ee\n\uf8f0 \u2202E(x; w)\nXi=1\nXi=1 (cid:18) \u2202E(x; w)\n\u2202xi (cid:19)2\n\nN\n\nN\n\n+\n\n\u01ebj(cid:19) + O(\u01eb2\n\n\u2202xi\u2202xj\n\nXj=1(cid:18) \u2202 2E(x; w)\n\u2202xi\u2202xj (cid:19)2\nXj=1(cid:18) \u2202 2E(x; w)\n\nN\n\n+ \u03c32\n\ni )!2\uf8f9\n\uf8fb\ni )!\n\n+ \u02c6O(\u01eb2\n\nThe expected value of the second term is:\n\nN\n\nXi=1(cid:18)E(cid:20) \u2202 2E(x + \u01eb; w)\n\n(\u2202(xi + \u01ebi))2 (cid:21)(cid:19) =\n\n=\n\nN\n\nN\n\nXi=1 E\" \u2202 2E(x; w)\nXi=1(cid:18) \u2202 2E(x; w)\n\n(\u2202xi)2 +\n(\u2202xi)2 (cid:19) + O(\u01eb2\n\nN\n\ni )\n\nXi=1(cid:18) \u2202 3E(x; w)\n\n\u2202xi\u2202xi\u2202xj\n\n\u01ebj(cid:19) + O(\u01eb2\n\ni )#!\n\n(4)\n\n(5)\n\nPutting the terms back together, we have:\n\nE(cid:2)J S(x + \u01eb; w)(cid:3) =\n\n1\n2\n\nN\n\n\u2202xi(cid:19)2\nXi=1(cid:18) \u2202E\n\n\u2212\n\nN\n\nXi=1(cid:18) \u22022E\n\n(\u2202xi)2(cid:19) +\n\n1\n2\n\n\u03c32\n\nN\n\nXi=1\n\nN\n\nXj=1(cid:18) \u22022E\n\n\u2202xi\u2202xj(cid:19)2\n\n+ \u02c6O(\u01eb2) (6)\n\nwhere E = E(x; w). This is the full regularized Score Matching loss. While minimization of above\nloss may be feasible in some situations, in general it requires differentiation of the full Hessian w.r.t.\nx which scales like O(W 2). However, the off-diagonal elements of the Hessian are often dominated\nby the diagonal. Therefore, we will use the diagonal approximation:\n\nJreg(x; w; \u03bb) = J S(x; w) + \u03bb\n\nN\n\n(\u2202xi)2(cid:19)2\nXi=1(cid:18) \u22022E\n\n(7)\n\nwhere \u03bb sets regularization strength and is related to (but not exactly equal to) 1\n2 \u03c32 in equation (6).\nThis regularized loss is computationally convenient: the added complexity is almost negligible since\ndifferentiation of the second derivative terms (\u22022E/(\u2202xi)2) w.r.t. the weights is already required\nfor unregularized Score Matching. The regularizer is related to Tikhonov regularization [22] and\ncurvature-driven smoothing [2] where the square of the curvature of the energy surface at the data\npoints are also penalized. However, its application has been limited since (contrary to our case) in\nthe general case it adds considerable computational cost.\n\n3\n\n\fFigure 1: Illustration of local computational \ufb02ow around some node j. Black lines: computation of\nj = \u22022E/(\u2202gi)2 and the SM loss J(x; w). Red lines indicate computa-\nquantities \u03b4j = \u2202E/\u2202gj, \u03b4\u2032\ntional \ufb02ow for differentiation of the Score Matching loss: computation of e.g. \u2202J/\u2202\u03b4j and \u2202J/\u2202gj.\nThe in\ufb02uence of weights are not shown, for which the derivatives are computed in the last step.\n\n3 Automatic differentiation of J(x; w)\n\nIn most optimization methods for energy-based models [17], the sample loss is de\ufb01ned in readily\nobtainable quantities obtained by forward inference in the model. In such situations, the required\nderivatives w.r.t. the weights can be obtained in a straightforward and ef\ufb01cient fashion by standard\napplication of the backpropagation algorithm.\n\nFor Score Matching, the situation is more complex since the (regularized) loss (equations 2,7) is\nde\ufb01ned in terms of {\u2202E/\u2202xi} and {\u22022E/(\u2202xi)2}, each term being some function of x and w.\nIn earlier publications on Score Matching for continuous variables [10, 12, 13, 15], the authors\nrewrote {\u2202E/\u2202xi} and {\u22022E/(\u2202xi)2} to their explicit forms in terms of x and w by manually\ndifferentiating the energy2. Subsequently, derivatives of the loss w.r.t. the weights can be found.\nThis manual differentiation was repeated for different models, and is arguably a rather in\ufb02exible\napproach. A procedure that could automatically (1) compute and (2) differentiate the loss would\nmake SM estimation more accessible and \ufb02exible in practice.\n\nA large class of models (e.g. ICA, Product-of-Experts and Fields-of-Experts), can be interpreted as\na form of feed-forward neural network. Consequently, the terms {\u2202E/\u2202xi} and {\u22022E/(\u2202xi)2} can\nbe ef\ufb01ciently computed using a forward and backward pass: the \ufb01rst pass performs forward inference\n(computation of E(x; w)) and the second pass applies the backpropagation algorithm [3] to obtain\nthe derivatives of the energy w.r.t. the data point ({\u2202E/\u2202xi} and {\u22022E/(\u2202xi)2}). However, only\nthe loss J(x; w) is obtained by these two steps. For differentiation of this loss, one must perform an\nadditional forward and backward pass.\n\n3.1 Obtaining the loss\n\nConsider a feed-forward neural network with input vector x and weights w and an ordered set of\nnodes indexed 1 . . . N , each node j with child nodes i \u2208 children(j) with j < i and parent nodes\nk \u2208 parents(j) with k < j. The \ufb01rst D < N nodes are input nodes, for which the activation value\nis gj = xj . For the other nodes (hidden units and output unit), the activation value is determined by\na differentiable scalar function gj({gi}i\u2208parents(j), w). The network\u2019s \u201coutput\u201d (energy) is deter-\nmined as the activation of the last node: E(x; w) = gN (.). The values \u03b4j = \u2202E/\u2202gj are ef\ufb01ciently\ncomputed by backpropagation. However, backpropagation of the full Hessian scales like O(W 2),\nwhere W is the number of model weights. Here, we limit backpropagation to the diagonal approx-\nimation which scales like O(W ) [1]. This will still result in the correct gradients \u22022E/(\u2202xj)2 for\none-layer models and the models considered in this paper. Rewriting the equations for the full Hes-\nj = \u22022E/(\u2202gj)2. The SM loss is split in\nsian is a straightforward exercise. For brevity, we write \u03b4\u2032\ntwo terms: J(x; w) = K + L with K = 1\nj)2. The equations\nfor inference and backpropagation are given as the \ufb01rst two f or-loops in Algorithm 1.\n\nj + \u03bb(\u03b4\u2032\n\nj=1 \u2212\u03b4\u2032\n\nj=1 \u03b42\n\n2PD\n\nj and L = PD\n\n2Most previous publications do not express unnormalized neg. log-density as \u201cenergy\u201d\n\n4\n\n\fInput: x, w (data and weight vectors)\n\nfor j \u2190 D + 1 to N do\n\n// Forward propagation\n\ncompute gj(.)\nfor i \u2208 parents(j) do\n, \u2202 2gj\n(\u2202gi)2 , \u2202 3gj\n\ncompute \u2202gj\n\u2202gi\n\n(\u2202gi)3\n\n\u03b4N \u2190 1, \u03b4\u2032\nfor j \u2190 N \u2212 1 to 1 do\n\nN \u2190 0\n\n// Backpropagation\n\n\u03b4j \u2190Pk\u2208children(j) \u03b4k\nj \u2190Pk\u2208children(j) \u03b4k\n\n\u03b4\u2032\n\nfor j \u2190 1 to D do\n\u2190 \u03b4j; \u2202L\n\u2202\u03b4j\n\n\u2202K\n\u2202\u03b4j\n\n\u2190 0; \u2202L\n\u2202\u03b4\u2032\nj\n\n\u2202gk\n\u2202gj\n\u2202 2gk\n(\u2202gj )2 + \u03b4\u2032\n\n\u2202gj(cid:17)2\nk(cid:16) \u2202gk\n\n\u2190 \u22121 + 2\u03bb\u03b4\u2032\nj\n\nfor j \u2190 D + 1 to N do\n\n// SM Forward propagation\n\n\u2202gj\n\u2202gi\n\u2202 2gj\n(\u2202gi)2 + \u2202L\n\n\u2202\u03b4i\n\n\u2202gj\n\u2202gi\n\n\u2202gi(cid:17)2\ni (cid:16) \u2202gj\n\n\u2202K\n\u2202\u03b4j\n\u2202L\n\u2202\u03b4j\n\n\u2202L\n\u2202\u03b4\u2032\nj\n\n\u2202K\n\u2202gj\n\u2202L\n\u2202gj\n\n\u2202K\n\u2202\u03b4i\n\u2202L\n\u2202\u03b4\u2032\ni\n\n\u2202L\n\u2202\u03b4\u2032\n\n\u2190Pi\u2208parents(j)\n\u2190Pi\u2208parents(j)\n\u2190Pi\u2208parents(j)\n\u2190Pk\u2208children(j)\n\u2190Pk\u2208children(j)\n\nfor j \u2190 N to D + 1 do\n\nfor w \u2208 w do\n\n\u2202J\n\n\u2202w \u2190PN\n\n// SM Backward propagation\n\n\u2202K\n\u2202gk\n\u2202L\n\u2202gk\n\n\u2202gk\n\u2202gj\n\u2202gk\n\u2202gj\n\n+ \u2202K\n\u2202\u03b4j\n+ \u2202L\n\u2202\u03b4j\n\n\u03b4k\n\n\u03b4k\n\n\u2202 2gk\n(\u2202gj )2\n\u2202 2gk\n(\u2202gj )2 + 2 \u2202L\n\n\u2202\u03b4\u2032\nj\n\n\u03b4\u2032\nk\n\n\u2202gk\n\u2202gj\n\n\u2202 2gk\n(\u2202gj )2 + \u2202L\n\n\u2202\u03b4\u2032\nj\n\n\u03b4k\n\n\u2202 3gk\n(\u2202gj )3\n\nj=D+1\n\n\u2202K\n\u2202gj\n\n\u2202gj\n\u2202w + \u2202L\n\u2202gj\n\n\u2202gj\n\u2202w + \u2202K\n\u2202\u03b4j\n\n\u2202\u03b4j\n\u2202w + \u2202L\n\u2202\u03b4j\n\n// Derivatives wrt weights\n\u2202\u03b4j\n\u2202w + \u2202L\n\n\u2202\u03b4\u2032\nj\n\u2202w\n\n\u2202\u03b4\u2032\nj\n\nAlgorithm 1: Compute \u2207wJ . See sections 3.1 and 3.2 for context.\n\n3.2 Differentiating the loss\n\nSince the computation of the loss J(x; w) is performed by a deterministic forward-backward mech-\nanism, this two-step computation can be interpreted as a combination of two networks: the original\nnetwork for computing {gj} and E(x; w), and an appended network for computing {\u03b4j}, {\u03b4\u2032\nj} and\neventually J(x; w). See \ufb01gure 1. The combined network can be differentiated by an extended\nversion of the double-backpropagation procedure [5], with the main difference that the appended\nnetwork not only computes {\u03b4j}, but also {\u03b4\u2032\nj}. Automatic differentiation of the combined network\nconsists of two phases, corresponding to reverse traversal of the appended and original network\nrespectively: (1) obtaining \u2202K/\u2202\u03b4j, \u2202L/\u2202\u03b4j and \u2202L/\u2202\u03b4\u2032\nj for each node j in order 1 to N ; (2) ob-\ntaining \u2202J/\u2202gj for each node j in order N to D + 1. These procedures are given as the last two\nf or-loops in Algorithm 1. The complete algorithm scales like O(W ).\n\n4 Experiments\n\nConsider the following Product-of-Experts (PoE) model:\n\nM\n\nE(x; W, \u03b1) =\n\n\u03b1ig(wT\n\ni x)\n\n(8)\n\nwhere M is the number of experts, wi is an image \ufb01lter and the i-th row of W and \u03b1i are scaling\nparameters. Like in [10], the \ufb01lters are L2 normalized to prevent a large portion from vanishing.We\nuse a slightly modi\ufb01ed Student\u2019s t-distribution (g(z) = log((cz)2/2 + 1)) for latent space, so this is\nalso a Product of Student\u2019s t-distribution model [24]. The parameter c is a non-learnable horizontal\nscaling parameter, set to e1.5. The vertical scaling parameters \u03b1i are restricted to positive, by setting\n\u03b1i = exp \u03b2i where \u03b2i is the actual weight.\n\nXi=1\n\n5\n\n\f4.1 MNIST\n\nwith square weight matrix W , normalized g(.) (meaningR \u221e\n\nThe \ufb01rst task is to estimate a density model of the MNIST handwritten digits [16]. Since a large\nnumber of models need to be learned, a 2\u00d7 downsampled version of MNIST was used. The MNIST\ndataset is highly non-smooth: for each pixel, the extreme values (0 and 1) are highly frequent lead-\ning to sharp discontinuities in the data density at these points. It is well known that for models\n\u2212\u221e exp(\u2212g(x))dx = 1) and \u03b1i = 1, the\nnormalizing constant can be computed [10]: Z(w) = | det W |. For this special case, models can be\ncompared by computing the log-likelihood for the training- and test set. Unregularized, and regu-\nlarized models for different choices of \u03bb were estimated and log-likelihood values were computed.\nSubsequently, these models were compared on a classi\ufb01cation task. For each MNIST digit class, a\nsmall sample of 100 data points was converted to internal features by different models. These fea-\ntures, combined with the original class label, were subsequently used to train a logistic regression\nclassi\ufb01er for each model. For the PoE model, the \u201cactivations\u201d g(wT\ni x) were used as features. Clas-\nsi\ufb01cation error on the test set was compared against reported results for optimal RBM and SESM\nmodels [20].\n\nResults. As expected, unregularized estimation did not result in an accurate model. Figure 2 shows\nhow the log-likelihood of the train- and test set is optimal at \u03bb\u2217 \u2248 0.01, and decreases for smaller\n\u03bb. Coincidentally, the classi\ufb01cation performance is optimal for the same choice of \u03bb.\n\n4.2 Denoising\n\nConsider grayscale natural image data from the Berkeley dataset [18]. The data quantized and\ntherefore non-smooth, so regularization is potentially bene\ufb01cial. In order to estimate the correct\nregularization magnitude, we again esimated a PoE model as in equation (8) with square W , such\nthat Z(w) = | det W | and computed the log-likelihood of 10.000 random patches under different\nregularization levels. We found that \u03bb\u2217 \u2248 10\u22125 for maximum likelihood (see \ufb01gure 2d). This\nvalue is lower than for MNIST data since natural image data is \u201cless unsmooth\u201d. Subsequently, a\nconvolutional PoE model known as Fields-of-Experts [21] (FoE) was estimated using regularized\nSM:\n\nM\n\nE(x; W, \u03b1) =Xp\n\nXi=1\n\n\u03b1ig(wT\n\ni x(p))\n\n(9)\n\nwhere p runs over image positions, and x(p) is a square image patch at p. The \ufb01rst model has the\nsame architecture as the CD-1 trained model in [21]: 5 \u00d7 5 receptive \ufb01elds, 24 experts (M = 24),\nand \u03b1i and g(.) as in our PoE model. Note that qualitative results of a similar model estimated\nwith SM have been reported earlier [15]. We found that for best performance, the model is learned\non images \u201cwhitened\u201d with a 5 \u00d7 5 Laplacian kernel. This is approximately equivalent to ZCA\nwhitening used in [15].\n\nModels are evaluated by means of Bayesian denoising using maximum a posteriori (MAP) estima-\ntion. As in a general Bayesian image restoration framework, the goal is to estimate the original\ninput x given a noisy image y using the Bayesian proportionality p(x|y) \u221d p(y|x)p(x). The as-\nsumption is white Gaussian noise such that the likelihood is p(y|x) \u223c N (0, \u03c32I). The model\nE(x; w) = \u2212 log p(x; w) \u2212 Z(w) is our prior. The gradient of the log-posterior is:\n\n\u2207x log p(x|y) = \u2212\u2207xE(x; w) +\n\n1\n2\u03c32 \u2207x\n\nN\n\nXi=1\n\n(yi \u2212 xi)2\n\n(10)\n\nDenoising is performed by initializing x to a noise image, and 300 subsequent steps of steepest\ndescent according to x\u2032 \u2190 x + \u03b1\u2207x log p(x|y), with \u03b1 annealed from 2 \u00b7 10\u22122 to 5 \u00b7 10\u22124. For com-\nparison, we ran the same denoising procedure with models estimated by CD-1 and Basis Rotation,\nfrom [21] and [23] respectively. Note that the CD-1 model is trained using PCA whitening. The\nCD-1 model has been extensively applied to denoising before [21] and shown to compare favourably\nto specialized denoising methods.\n\nResults. Training of the convolutional model took about 1 hour on a 2Ghz machine. Regularization\nturns out to be important for optimal denoising (see \ufb01gure 2[e-g]). See table 1 for denoising perfor-\nmance of the optimal model for speci\ufb01c standard images. Our model performed signi\ufb01cantly better\n\n6\n\n\ftest\ntraining\n\n)\n.\ng\nv\na\n(\n \nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\nl\n\n 250\n\n 200\n\n 150\n\n 100\n\n 50\n\n 0\n\nReg. SM\nRBM\nSESM\n\n)\n\n%\n\n(\n \nr\no\nr\nr\ne\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n 12.5\n 12\n 11.5\n 11\n 10.5\n 10\n 9.5\n 9\n 8.5\n\n(a)\n\n 0.01\n\n 0.02\n\n 0.03\n\n 0.01\n\n 0.02\n\n 0.03\n\n\u03bb\n\n(b)\n\n\u03bb\n\n(c)\n\nLog-l. of image patches\n\n\u03c3\n\nnoise=1/256\n\n\u03c3\n\nnoise=5/256\n\n\u03c3\n\nnoise=15/256\n\n-2800\n\n-2900\n\n-3000\n\n-3100\n\n-3200\n\n \n\nR\nN\nS\nP\nd\ne\ns\no\nn\ne\nd\n\ni\n\n 48.8\n\n 48.6\n\n 48.4\n\n 48.2\n\n 48\n\n 36.5\n\n 36\n 35.5\n\n 35\n 34.5\n\n 34\n\n 30\n\n 29\n\n 28\n\n 27\n\n 26\n\n-6.5 -6 -5.5 -5 -4.5\n\n-7 -6 -5 -4 -3 -2\n\n-7 -6 -5 -4 -3 -2\n\n-7 -6 -5 -4 -3 -2\n\nlog10 \u03bb\n\n(d)\n\nlog10 \u03bb\n\n(e)\n\nlog10 \u03bb\n(f)\n\nlog10 \u03bb\n(g)\n\n(h)\n\n(i)\n\n(j)\n\n(k)\n\n(l)\n\n)\n.\n\ng\nv\na\n(\n \n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\nl\n\n(m)\n\n(n)\n\n(o)\n\n(p)\n\n(q)\n\nFigure 2: (a) Top: selection of downsampled MNIST datapoints. Middle and bottom: random\nsample of \ufb01lters from unregularized and regularized (\u03bb = 0.01) models, respectively. (b) Average\nlog-likelihood of MNIST digits in training- and test sets for choices of \u03bb. Note that \u03bb\u2217 \u2248 0.01, both\nfor maximum likelihood and optimal classi\ufb01cation. (c) Test set error of a logistic regression classi\ufb01er\nlearned on top of features, with only 100 samples per class, for different choices of \u03bb. Optimal error\nrates of SESM and RBM (\ufb01gure 1a in [20]) are shown for comparison. (d) Log-likelihood of 10.000\nrandom natural image patches for complete model, for different choices of \u03bb. (e-g) PSNR of 500\ndenoised images, for different levels of noise and choices of \u03bb. Note that \u03bb\u2217 \u2248 10\u22125, both for\nmaximum likelihood and best denoising performance. (h) Some natural images from the Berkeley\ndataset. (i) Filters of model with 5 \u00d7 5 \u00d7 24 weights learned with CD-1 [21], (j) \ufb01lters of our model\nwith 5 \u00d7 5 \u00d7 24 weights, (k) random selection of \ufb01lters from the Basis Rotation [23] model with\n15 \u00d7 15 \u00d7 25 weights, (l) random selection of \ufb01lters from our model with 8 \u00d7 8 \u00d7 64 weights. (m)\nDetail of original Lena image. (n) Detail with noise added (\u03c3noise = 5/256). (o) Denoised with\nmodel learned with CD-1 [21], (p) Basis Rotation [23], (q) and Score Matching with (near) optimal\nregularization.\n\n7\n\n\fthan the Basis Rotation model and slightly better than the CD-1 model. As reported earlier in [15],\nwe can verify that the \ufb01lters are completely intuitive (Gabor \ufb01lters with different phase, orientation\nand scale) unlike the \ufb01lters of CD-1 and Basis Rotation models (see \ufb01gure 2[i-l]).\n\nTable 1: Peak signal-to-noise ratio (PSNR) of denoised images with \u03c3noise = 5/256. Shown errors\nare aggregated over different noisy images.\n\nImage\nWeights\nBarbara\nPeppers\nHouse\nLena\nBoat\n\nCD-1\n(5 \u00d7 5) \u00d7 24\n37.30\u00b10.01\n37.63\u00b10.01\n37.85\u00b10.02\n38.16\u00b10.02\n36.33\u00b10.01\n\nBasis Rotation\n(15 \u00d7 15) \u00d7 25\n37.08\u00b10.02\n37.09\u00b10.02\n37.73\u00b10.03\n37.97\u00b10.01\n36.21\u00b10.01\n\nOur model\n(5 \u00d7 5) \u00d7 24\n37.31\u00b10.01\n37.41\u00b10.03\n38.03\u00b10.04\n38.19\u00b10.01\n36.53\u00b10.01\n\n4.3 Super-resolution\n\nIn addition, models are compared with respect to their performance on a simple version of super-\nresolution as follows. An original image xorig is sampled down to image xsmall by averaging blocks\nof 2 \u00d7 2 pixels into a single pixel. A \ufb01rst approximation x is computed by linearly scaling up xsmall\nand subsequent application of a low-pass \ufb01lter to remove false high frequency information. The\nimage is than \ufb01ne-tuned by 200 repetitions of two subsequent steps: (1) re\ufb01ning the image slightly\nusing x\u2032 \u2190 x + \u03b1\u2207xE(x; w) with \u03b1 annealed from 2 \u00b7 10\u22122 to 5 \u00b7 10\u22124 ; (2) updating each k \u00d7 k\nblock of pixels such that their average corresponds to the down-sampled value. Note: the simple\nblock-downsampling results in serious aliasing artifacts in the Barbara image, so the Castle image\nis used instead.\n\nResults. PSNR values for standard images are shown in table 2. The considered models made give\nslight improvements in terms of PSNR over the initial solution with low pass \ufb01lter. Still, our model\ndid slightly better than the CD-1 and Basis Rotation models.\n\nTable 2: Peak signal-to-noise ratio (PSNR) of super-resolved images for different models.\n\nImage\nWeights\nPeppers\nHouse\nLena\nBoat\nCastle\n\nLow pass \ufb01lter\n-\n27.54\n33.15\n32.39\n29.20\n24.19\n\nCD-1\n(5 \u00d7 5) \u00d7 24\n29.11\n33.53\n33.31\n30.81\n24.15\n\nBasis Rotation\n(15 \u00d7 15) \u00d7 25\n27.69\n33.41\n33.07\n30.77\n24.26\n\nOur model\n(5 \u00d7 5) \u00d7 24\n29.76\n33.48\n33.46\n30.82\n24.31\n\n5 Conclusion\n\nWe have shown how the addition of a principled regularization term to the expression of the Score\nMatching loss lifts continuity assumptions on the data density, such that the estimation method\nbecomes more generally applicable. The effectiveness of the regularizer was veri\ufb01ed with the dis-\ncontinuous MNIST and Berkeley datasets, with respect to likelihood of test data in the model. For\nboth datasets, the optimal regularization parameter is approximately equal for both likelihood and\nsubsequent classi\ufb01cation and denoising tasks. In addition, we showed how computation and differ-\nentiation of the Score Matching loss can be automated using an ef\ufb01cient algorithm.\n\n8\n\n\fReferences\n\n[1] S. Becker and Y. LeCun.\n\nImproving the convergence of back-propagation learning with second-order\nmethods. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proc. of the 1988 Connectionist Models\nSummer School, pages 29\u201337, San Mateo, 1989. Morgan Kaufman.\n\n[2] C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, Oxford, UK, 1996.\n\n[3] A. E. Bryson and Y. C. Ho. Applied optimal control; optimization, estimation, and control. Blaisdell Pub.\n\nCo. Waltham, Massachusetts, 1969.\n\n[4] M. A. Carreira-Perpinan and G. E. Hinton. On contrastive divergence learning. In Arti\ufb01cial Intelligence\n\nand Statistics, 2005.\n\n[5] H. Drucker and Y. LeCun. Improving generalization performance using double backpropagation. IEEE\n\nTransactions on Neural Networks, 3(6):991\u2013997, 1992.\n\n[6] M. Gutmann and A. Hyv\u00a8arinen. Noise-contrastive estimation: A new estimation principle for unnor-\nIn Proc. Int. Conf. on Arti\ufb01cial Intelligence and Statistics (AISTATS2010),\n\nmalized statistical models.\n2010.\n\n[7] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14:2002, 2000.\n\n[8] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18(7):1527\u20131554, 2006.\n\n[9] G. E. Hinton, S. Osindero, M. Welling, and Y. W. Teh. Unsupervised discovery of non-linear structure\n\nusing contrastive backpropagation. Cognitive Science, 30(4):725\u2013731, 2006.\n\n[10] A. Hyv\u00a8arinen. Estimation of non-normalized statistical models by score matching. Journal of Machine\n\nLearning Research, 6:695\u2013709, 2005.\n\n[11] A. Hyv\u00a8arinen.\n\nSome extensions of score matching. Computational Statistics & Data Analysis,\n\n51(5):2499\u20132512, 2007.\n\n[12] A. Hyv\u00a8arinen. Optimal approximation of signal priors. Neural Computation, 20:3087\u20133110, 2008.\n\n[13] U. K\u00a8oster and A. Hyv\u00a8arinen. A two-layer ica-like model estimated by score matching. In J. M. de S\u00b4a,\nL. A. Alexandre, W. Duch, and D. P. Mandic, editors, ICANN (2), volume 4669 of Lecture Notes in\nComputer Science, pages 798\u2013807. Springer, 2007.\n\n[14] U. Koster, J. T. Lindgren, and A. Hyv\u00a8arinen. Estimating markov random \ufb01eld potentials for natural\nimages. Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation (ICA2009),\n2009.\n\n[15] U. K\u00a8oster, J. T. Lindgren, and A. Hyv\u00a8arinen. Estimating markov random \ufb01eld potentials for natural\nimages. In T. Adali, C. Jutten, J. M. T. Romano, and A. K. Barros, editors, ICA, volume 5441 of Lecture\nNotes in Computer Science, pages 515\u2013522. Springer, 2009.\n\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nIn Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[17] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. In\nG. Bakir, T. Hofman, B. Sch\u00a8olkopf, A. Smola, and B. Taskar, editors, Predicting Structured Data. MIT\nPress, 2006.\n\n[18] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its\napplication to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int\u2019l\nConf. Computer Vision, volume 2, pages 416\u2013423, July 2001.\n\n[19] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random\n\ufb01elds. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing\nSystems 20, pages 1121\u20131128. MIT Press, Cambridge, MA, 2008.\n\n[20] M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In Advances in\n\nNeural Information Processing Systems (NIPS 2007), 2007.\n\n[21] S. Roth and M. J. Black. Fields of experts. International Journal of Computer Vision, 82(2):205\u2013229,\n\n2009.\n\n[22] A. N. Tikhonov. On the stability of inverse problems. Dokl. Akad. Nauk SSSR, (39):176\u2013179, 1943.\n\n[23] Y. Weiss and W. T. Freeman. What makes a good model of natural images. In CVPR 2007: Proceedings\nof the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE\nComputer Society, pages 1\u20138, 2007.\n\n[24] M. Welling, G. E. Hinton, and S. Osindero. Learning sparse topographic representations with products\nof student-t distributions. In S. T. S. Becker and K. Obermayer, editors, Advances in Neural Information\nProcessing Systems 15, pages 1359\u20131366. MIT Press, Cambridge, MA, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1300, "authors": [{"given_name": "Durk", "family_name": "Kingma", "institution": null}, {"given_name": "Yann", "family_name": "Cun", "institution": null}]}