{"title": "Wasserstein Training of Restricted Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 3718, "page_last": 3726, "abstract": "Boltzmann machines are able to learn highly complex, multimodal, structured and multiscale real-world data distributions. Parameters of the model are usually learned by minimizing the Kullback-Leibler (KL) divergence from training samples to the learned model. We propose in this work a novel approach for Boltzmann machine training which assumes that a meaningful metric between observations is known. This metric between observations can then be used to define the Wasserstein distance between the distribution induced by the Boltzmann machine on the one hand, and that given by the training sample on the other hand. We derive a gradient of that distance with respect to the model parameters. Minimization of this new objective leads to generative models with different statistical properties. We demonstrate their practical potential on data completion and denoising, for which the metric between observations plays a crucial role.", "full_text": "Wasserstein Training of\n\nRestricted Boltzmann Machines\n\nGr\u00e9goire Montavon\n\nTechnische Universit\u00e4t Berlin\n\nKlaus-Robert M\u00fcller\u2217\n\nTechnische Universit\u00e4t Berlin\n\ngregoire.montavon@tu-berlin.de\n\nklaus-robert.mueller@tu-berlin.de\n\nMarco Cuturi\n\nCREST, ENSAE, Universit\u00e9 Paris-Saclay\n\nmarco.cuturi@ensae.fr\n\nAbstract\n\nBoltzmann machines are able to learn highly complex, multimodal, structured\nand multiscale real-world data distributions. Parameters of the model are usually\nlearned by minimizing the Kullback-Leibler (KL) divergence from training samples\nto the learned model. We propose in this work a novel approach for Boltzmann\nmachine training which assumes that a meaningful metric between observations is\nknown. This metric between observations can then be used to de\ufb01ne the Wasserstein\ndistance between the distribution induced by the Boltzmann machine on the one\nhand, and that given by the training sample on the other hand. We derive a\ngradient of that distance with respect to the model parameters. Minimization of this\nnew objective leads to generative models with different statistical properties. We\ndemonstrate their practical potential on data completion and denoising, for which\nthe metric between observations plays a crucial role.\n\n1\n\nIntroduction\n\nBoltzmann machines [1] are powerful generative models that can be used to approximate a large\nclass of real-world data distributions, such as handwritten characters [9], speech segments [7], or\nmultimodal data [16]. Boltzmann machines share similarities with neural networks in their capability\nto extract features at multiple scales, and to build well-generalizing hierarchical data representations\n[15, 13]. The restricted Boltzmann machine (RBM) is a special type of Boltzmann machine composed\nof one layer of latent variables, and de\ufb01ning a probability distribution p\u03b8(x) over a set of d binary\nobserved variables whose state is represented by the binary vector x \u2208 {0, 1}d, and with a parameter\nvector \u03b8 to be learned.\nn=1 \u03b4xn where (xn)n is a list of N observa-\nGiven an empirical probability distribution \u02c6p(x) = 1\nN\ntions in {0, 1}d, an RBM can be trained using information-theoretic divergences (see for example\n[12]) by minimizing with respect to \u03b8 a divergence \u2206(\u02c6p, p\u03b8) between the sample empirical measure \u02c6p\nand the modeled distribution p\u03b8. When \u2206 is for instance the KL divergence, this approach results in\nthe well-known Maximum Likelihood Estimator (MLE), which yields gradients for the \u03b8 of the form\n\n(cid:80)N\n\n\u2207\u03b8KL(\u02c6p(cid:107) p\u03b8) = \u2212\n\n1\nN\n\nN(cid:88)\n(cid:10)\nn=1\u2207\u03b8 log p\u03b8(xn) = \u2212\n\n\u2207\u03b8 log p\u03b8(x)(cid:11)\n\n\u02c6p,\n\nwhere the bracket notation (cid:104)\u00b7(cid:105)p indicates an expectation with respect to p. Alternative choices for \u2206\nare the Bhattacharrya/Hellinger and Euclidean distances between distributions, or more generally\n\n\u2217Also with the Department of Brain and Cognitive Engineering, Korea University.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n(1)\n\n\fF -divergences or M-estimators [10]. They all result in comparable gradient terms, that try to adjust\n\u03b8 so that the \ufb01tting terms p\u03b8(xn) grow as large as possible.\nWe explore in this work a different scenario: what if \u03b8 is chosen so that p\u03b8(x) is large, on average,\nwhen x is close to a data point xn in some sense, but not necessarily when x coincides exactly with\nxn? To adopt such a geometric criterion, we must \ufb01rst de\ufb01ne what closeness between observations\nmeans. In almost all applications of Boltzmann machines, such a metric between observations is\nreadily available: One can for example consider the Hamming distance between binary vectors, or\nany other metric motivated by practical considerations2. This being done, the geometric criterion\nwe have drawn can be materialized by considering for \u2206 the Wasserstein distance [20] (a.k.a. the\nKantorovich or the earth mover\u2019s distance [14]) between measures. This choice was considered in\ntheory by [2], who proved its statistical consistency, but was never considered practically to the best\nof our knowledge. This paper describes a practical derivation for a minimum Kantorovich distance\nestimator [2] for Boltzmann machines, which can scale up to tens of thousands of observations. As\nwill be described in this paper, recent advances in the fast approximation of Wasserstein distances [5]\nand their derivatives [6] play an important role in the practical implementation of these computations.\nBefore describing this approach in detail, we would like to insist that measuring goodness-of-\ufb01t\nwith the Wasserstein distance results in a considerably different perspective than that provided by a\nKullback-Leibler/MLE approach. This difference is illustrated in Figure 1, where a probability p\u03b8 can\nbe close from a KL perspective to a given empirical measure \u02c6p, but far from the same measure p in\nthe Wasserstein sense. Conversely, a different probability p\u03b8(cid:48) can miss the mark from a KL viewpoint\nbut achieve a low Wasserstein distance to \u02c6p. Before proceeding to the rest of this paper, let us mention\nthat Wasserstein distances have a broad appeal for machine learning. That distance was for instance\nintroduced in the context of supervised inference by [8], who used it to compute a predictive loss\nbetween the output of a multilabel classi\ufb01er against its ground truth, or for domain adaptation, by [4].\n\nFigure 1: Empirical distribution \u02c6p(x) (gray) de\ufb01ned on the set of states {0, 1}d with d = 3 superposed\nto two possible models of it de\ufb01ned on the same set of states. The size of the circles indicates the\nprobability mass for each state. For each model, we show its KL and Wasserstein divergences from\n\u02c6p(x), and an explanation of why the divergences are low or high: a large/small overlap with \u02c6p(x), or\na large/small distance from \u02c6p(x).\n\n2 Minimum Wasserstein Distance Estimation\n\n+ such that(cid:80)\n\nx p(x) =(cid:80)\n\nConsider two probabilities p, q in P(X ), the set of probabilities on X = {0, 1}d. Namely, two maps\np, q in RX\nx q(x) = 1, where we omit x \u2208 X under the summation sign.\nConsider a cost function de\ufb01ned on X \u00d7X , typically a distance D : X \u00d7X \u2192 R+. Given a constant\n\u03b3 \u2265 0, the \u03b3-smoothed Wasserstein distance [5] is equal to\n\u03c0\u2208\u03a0(p,q)(cid:104)D(x, x(cid:48))(cid:105)\u03c0 \u2212 \u03b3H(\u03c0),\nwhere \u03a0(p, q) is the set of joint probabilities \u03c0 on X \u00d7 X such that (cid:80)\n(cid:80)\n(cid:80)\nx(cid:48) \u03c0(x, x(cid:48)) = p(x),\nxx(cid:48) \u03c0(x, x(cid:48)) log \u03c0(x, x(cid:48)) is the Shannon entropy of \u03c0.\nx \u03c0(x, x(cid:48)) = q(x(cid:48)) and H(\u03c0) = \u2212\nThis optimization problem, a strictly convex program, has an equivalent dual formulation [6] which\ninvolves instead two real-valued functions \u03b1, \u03b2 in RX and which plays an important role in this paper:\n\nW\u03b3(p, q) = min\n\n(2)\n\n(cid:88)\n\nxx(cid:48)\n\nW\u03b3(p, q) = max\n\n\u03b1,\u03b2\u2208RX(cid:104)\u03b1(x)(cid:105)p + (cid:104)\u03b2(x(cid:48))(cid:105)q \u2212 \u03b3\n\n1\n\n\u03b3 (\u03b1(x)+\u03b2(x(cid:48))\u2212D(x,x(cid:48)))\u22121.\n\ne\n\n(3)\n\n2When using the MLE principle, metric considerations play a key role to de\ufb01ne densities p\u03b8, e.g. the reliance\n\nof Gaussian densities on Euclidean distances. This is the kind of metric we take for granted in this work.\n\n2\n\nhighlowhighlowsmall overlapsmall distancelarge distancelarge overlapData distributionModel 1Model 2\fSmooth Wasserstein Distances The \u201ctrue\u201d Wasserstein distance corresponds to the case where\n\u03b3 = 0, that is when Equation (2) is stripped of the entropic term. One can easily verify that that\nde\ufb01nition matches the usual linear program used to describe Wasserstein/EMD distances [14]. When\n\u03b3 \u2192 0 in Equation (3), one also recovers the Kantorovich dual formulation, because the rightmost\nregularizer converges to the indicator function of the feasible set of the dual optimal transport problem,\n\u03b1(x) + \u03b2(x(cid:48)) \u2264 D(x, x(cid:48)). We consider in this paper the case \u03b3 > 0 because it was shown in [5] to\nconsiderably facilitate computations, and in [6] to result in a divergence W\u03b3(p, q) which, unlike the\ncase \u03b3 = 0, is differentiable w.r.t to the \ufb01rst variable. Looking at the dual formulation in Equation (3),\none can see that this gradient is equal to \u03b1(cid:63), the centered optimal dual variable (the centering step for\n\u03b1(cid:63) ensures the orthogonality with respect to the simplex constraint).\nSensitivity analysis gives a clear interpretation to the quantity \u03b1(cid:63)(x): It measures the cost for each\nunit of mass placed by p at x when computing the Wasserstein distance W\u03b3(p, q). To decrease\nW\u03b3(p, q), it might thus be favorable to transfer mass in p from points where \u03b1(x) is high to place\nit on points where \u03b1(x) is low. This idea can be used, by a simple application of the chain rule, to\nminimize, given a \ufb01xed target probability p, the quantity W\u03b3(p\u03b8, p) with respect to \u03b8.\nZ e\u2212F\u03b8(x) be a parameterized family of probability distributions where\nProposition 1. Let p\u03b8(x) = 1\nF\u03b8(x) is a differentiable function of \u03b8 \u2208 \u0398 and we write G\u03b8 = (cid:104)\u2207\u03b8F\u03b8(x)(cid:105)p\u03b8. Let \u03b1(cid:63) be the centered\noptimal dual solution of W\u03b3(p\u03b8, p) as described in Equation (3). The gradient of the smoothed\nWasserstein distance with respect to \u03b8 is given by\n(4)\n\n.\n\n\u2207\u03b8W\u03b3(p\u03b8, p) =(cid:10)\u03b1(cid:63)(x)(cid:11)\n(cid:16) \u2202p\u03b8\n\np\u03b8\n\n\u2207\u03b8W\u03b3(p\u03b8, p) =\n\n\u2202\u03b8\n\nG\u03b8 \u2212\n\n(cid:10)\u03b1(cid:63)(x)\u2207\u03b8F\u03b8(x))(cid:11)\n(cid:17)T \u2202W\u03b3(p\u03b8, q)\n\n.\n\n\u2202p\u03b8\n\np\u03b8\n\nProof. This result is a direct application of the chain rule: We have\n\nAs mentioned in [6], the rightmost term is the optimal dual variable (the Kantorovich potential)\n\u2202W\u03b3(p\u03b8, q)/\u2202p\u03b8 = \u03b1(cid:63). The Jacobian (\u2202p\u03b8/\u2202\u03b8) is a linear map \u0398 \u2192 X . For a given x(cid:48),\nAs a consequence,(cid:0) \u2202p\u03b8\n\n\u2202p\u03b8(x(cid:48))/\u2202\u03b8 = p\u03b8(x(cid:48))G\u03b8 \u2212 \u2207F\u03b8(x(cid:48))p\u03b8(x(cid:48)).\n\u03b1(cid:63) is the integral w.r.t. x(cid:48) of the term above multiplied by \u03b1(cid:63)(x(cid:48)), which\n\n(cid:1)T\n\n\u2202\u03b8\nresults in Equation (4).\n\nComparison with the KL Fitting Error The target distribution p plays a direct role in the for-\nmation of the gradient of KL(\u02c6p(cid:107) p\u03b8) w.r.t. \u03b8 through the term (cid:104)\u2207\u03b8F\u03b8(x)(cid:105)p in Equation (1). The\nWasserstein gradient incorporates the knowledge of p in a different way, by considering, on the\nsupport of p\u03b8 only, points x that correspond to high potentials (costs) \u03b1(x) when computing the\ndistance of p\u03b8 to p. A high potential at x means that the probability p\u03b8(x) should be lowered if one\nwere to decrease W\u03b3(p\u03b8, p), by varying \u03b8 accordingly.\nSampling Approximation The gradient in Equation (4) is intractable, since it involves solving\nan optimal (smoothed) transport problem over probabilities de\ufb01ned on 2d states. In practice, we\nreplace expectations w.r.t p\u03b8 by an empirical distribution formed by sampling from the model p\u03b8\n\n(e.g. the PCD sample [18]). Given a sample ((cid:101)xn)n of size (cid:101)N generated by the model, we de\ufb01ne\n\u02c6p\u03b8 =(cid:80)(cid:101)N\nn=1 \u03b4(cid:101)xn /(cid:101)N. The tilde is used to differentiate the sample generated by the model from the\n\nempirical observations. Because the dual potential \u03b1(cid:63) is centered and \u02c6p\u03b8 is a measure with uniform\nweights, (cid:104)\u03b1(cid:63)(x)(cid:105) \u02c6p\u03b8 = 0 which simpli\ufb01es the approximation of the gradient to\n\n(cid:98)\u2207\u03b8W\u03b3(p\u03b8, \u02c6p) = \u2212\n\n1(cid:101)N\n\n(cid:101)N(cid:88)\n\n\u02c6\u03b1(cid:63)((cid:101)xn)\u2207\u03b8F\u03b8((cid:101)xn)\n\nwhere \u02c6\u03b1(cid:63) is the solution of the discrete smooth Wasserstein dual between the two empirical distribu-\n\ntions \u02c6p and \u02c6p\u03b8, which have respectively supports of size N and (cid:101)N. In practical terms, \u02c6\u03b1(cid:63) is a vector\nof size (cid:101)N, one coef\ufb01cient for each PCD sample, which can be computed by following the algorithm\n\nbelow [6]. To keep notations simple, we describe it in terms of generic probabilities p and q, having\nin mind these are in practice the training and simulated empirical measures \u02c6p and \u02c6p\u03b8.\n\nn=1\n\n(5)\n\n3\n\n\fsupport of p and q, where |p| = (cid:80)\n\nComputing \u03b1(cid:63) When \u03b3 > 0, the optimal variable \u03b1(cid:63) corresponding to W\u03b3(p, q) can be recovered\nthrough the Sinkhorn algorithm with a cost which grows as the product |p||q| of the sizes of the\nx 1p(x)>0. The algorithm is well known but we adapt it here\nto our setting, see [6, Alg.3] for a more precise description. To ease notations, we consider an\narbitrary ordering of X , a set of cardinal 2d, and identify its elements with indices 1 \u2264 i \u2264 2d. Let\nI = (i1,\u00b7\u00b7\u00b7 , i|p|) be the ordered family of indices in the set {i| p(i) > 0} and de\ufb01ne J accordingly\nfor q. I and J have respective lengths |p| and |q|. Form the matrix K = [e\u2212D(i,j)/\u03b3]i\u2208I,j\u2208J of size\n|p| and |q|. Choose now two positive vectors u \u2208 R|p|\n++ at random, and repeat until\nu, v converge in some metric the operations u \u2190 p/(Kv), v \u2190 q/(K T u). Upon convergence, the\noptimal variable \u03b1(cid:63) is zero everywhere except for \u03b1(cid:63)(ia) = log(ua/\u02dcu)/\u03b3 where 1 \u2264 a \u2264 |p| and \u02dcu\nis the geometric mean of vector u (which ensures that \u03b1(cid:63) is centered).\n\n++ and v \u2208 R|q|\n\n3 Application to Restricted Boltzmann Machines\n\n(cid:80)h\n\np\u03b8(x) =(cid:80)\nand Z\u03b8 =(cid:80)\n\nThe restricted Boltzmann machine (RBM) is a generative model of binary data that is composed of d\nbinary observed variables and h binary explanatory variables. The vector x \u2208 {0, 1}d represents the\nstate of observed variables, and the vector y \u2208 {0, 1}h represents the state of explanatory variables.\nThe RBM associates to each con\ufb01guration x of observed variables a probability p\u03b8(x) de\ufb01ned as\nye\u2212E\u03b8(x,y)/Z\u03b8, where E\u03b8(x, y) = \u2212aT x \u2212\nj x + bj) is called the energy\nx,y e\u2212E\u03b8(x,y) is the partition function that normalizes the probability distribution to\nj=1) of the RBM are learned from the data. Knowing the state\none. The parameters \u03b8 = (a,{wj, bj}h\nx of the observed variables, the explanatory variables are independent Bernoulli-distributed with\nj x + bj), where \u03c3 is the logistic map z (cid:55)\u2192 (1 + e\u2212z)\u22121. Conversely, knowing\nPr(yj = 1|x) = \u03c3(wT\nthe state y of the explanatory variables, the observed variables on which the probability distribution is\nde\ufb01ned can also be sampled independently, leading to an ef\ufb01cient alternate Gibbs sampling procedure\nfor p\u03b8. In this RBM model, explanatory variables can be analytically marginalized, which yields the\nfollowing probability model:\n\nj=1 yj(wT\n\nj x + bj)) is the free energy associated to this model\n\n\u03b8 =(cid:80)\n\nwhere F\u03b8(x) = \u2212aT x \u2212\nand Z(cid:48)\n\nj=1 log(1 + exp(wT\nx e\u2212F\u03b8(x) is the partition function.\n\np\u03b8(x) = e\u2212F\u03b8(x)/Z(cid:48)\n\u03b8,\n\n(cid:80)h\n\nWasserstein Gradient of the RBM Having written the RBM in its free energy form, the Wasser-\nstein gradient can be obtained by computing the gradient of F\u03b8(x) and injecting it in Equation (5):\n\n(cid:98)\u2207wjW\u03b3(\u02c6p, p\u03b8) =(cid:10)\u03b1(cid:63)(x) \u03c3(zj) x(cid:11)\n(cid:98)\u2207wj KL(\u02c6p(cid:107) p\u03b8) =(cid:10)\u03c3(zj) x(cid:11)\n(cid:10)\u03c3(zj) x(cid:11)\n\n\u02c6p\u03b8 \u2212\n\n\u02c6p\u03b8\n\n,\n\n\u02c6p.\n\nwhere zj = wT\nthe same means. In comparison, the gradient of the KL divergence is given by\n\nj x + bj. Gradients with respect to parameters a and {bj}j can also be obtained by\n\nconstraint (see Section 2), thus making the gradient zero.\n\nWhile the Wasserstein gradient can in the same way as the KL gradient be expressed in a very simple\nform, the \ufb01rst one is not sum-decomposable. A simple manifestation of the non-decomposability\n\noccurs for (cid:101)N = 1 (smallest possible sample size): In that case, \u03b1((cid:101)xn) = 0 due to the centering\nstrongly differs from the examples coming from \u02c6p, because there is no weighting (\u03b1((cid:101)xn))n of the\n\nStability and KL Regularization Unlike the KL gradient, the Wasserstein gradient depends only\nindirectly on the data distribution \u02c6p. This is a problem when the sample \u02c6p\u03b8 generated by the model\n\ngenerated sample that can represent the desired direction in the parameter space \u0398. In that case,\nthe Wasserstein gradient will point to a bad local minimum. Closeness between the two empirical\nsamples from this optimization perspective can be ensured by adding a regularization term to the\nobjective that incorporates both the usual quadratic containment term, but also the KL term, that\nforces proximity to \u02c6p due to the direct dependence of its gradient on it. The optimization problem\nbecomes:\nmin\n\u03b8\u2208\u0398 W\u03b3(\u02c6p, p\u03b8) + \u03bb \u00b7 \u2126(\u03b8)\n\n\u2126(\u03b8) = KL(\u02c6p(cid:107) p\u03b8) + \u03b7 \u00b7 ((cid:107)a(cid:107)2 +(cid:80)\n\nj(cid:107)wj(cid:107)2)\n\nwith\n\n4\n\n\fstarting at point \u03b80 = arg min\u03b8\u2208\u0398 \u2126(\u03b8), and where \u03bb, \u03b7 are two regularization hyperparameters that\nmust be selected. Determining the starting point \u03b80 is analogous to having an initial pretraining phase.\nThus, the proposed Wasserstein procedure can also be seen as \ufb01netuning a standard RBM, and forcing\nthe \ufb01netuning not to deviate too much from the pretrained solution.\n\n4 Experiments\n\nWe perform several experiments that demonstrate that Wasserstein-trained RBMs learn distributions\nthat are better from a metric perspective. First, we explore what are the main characteristics of a\nlearned distribution that optimizes the Wasserstein objective. Then, we investigate the usefulness of\nthese learned models on practical problems, such as data completion and denoising, where the metric\nbetween observations occurs in the performance evaluation. We consider three datasets: MNIST-small,\na subsampled version of the original MNIST dataset [11] with only the digits \u201c0\u201d retained, a subset of\nthe UCI PLANTS dataset [19] containing the geographical spread of plants species, and MNIST-code,\n128-dimensional code vectors associated to each MNIST digit (additional details in the supplement).\n\n4.1 Training, Validation and Evaluation\n\nAll RBM models that we investigate are trained in full batch mode, using for \u02c6p\u03b8 the PCD approx-\nimation [18] of p\u03b8, where the sample is refreshed at each gradient update by one step of alternate\nGibbs sampling, starting from the sample at the previous time step. We choose a PCD sample of\n\nsame size as the training set (N = (cid:101)N). The coef\ufb01cients \u03b11, . . . , \u03b1(cid:101)N occurring in the Wasserstein\ngradient are obtained by solving the smoothed Wasserstein dual between \u02c6p and \u02c6p\u03b8, with smoothing\nparameter \u03b3 = 0.1 and distance D(x, x(cid:48)) = H(x, x(cid:48))/(cid:104)H(x, x(cid:48))(cid:105) \u02c6p, where H denotes the Ham-\nming distance between two binary vectors. We use the centered parameterization of the RBM for\ngradient descent [13, 3]. We perform holdout validation on the quadratic containment coef\ufb01cient\n\u03b7 \u2208 {10\u22124, 10\u22123, 10\u22122}, and on the KL weighting coef\ufb01cient \u03bb \u2208 {0, 10\u22121, 100, 101,\u221e}. The\nnumber of hidden units of the RBM is set heuristically to 400 for all datasets. The learning rate is\nset heuristically to 0.01(\u03bb\u22121) during the pretraining phase and modi\ufb01ed to 0.01 min(1, \u03bb\u22121) when\ntraining on the \ufb01nal objective. The Wasserstein distance W\u03b3(\u02c6p\u03b8, \u02c6p) is computed between the whole\ntest distribution and the PCD sample at the end of the training procedure. This sample is a fast\napproximation of the true unbiased sample, that would otherwise have to be generated by annealing\nor enumeration of the states (see the supplement for a comparison of PCD and AIS samples).\n\n4.2 Results and Analysis\n\nThe contour plots of Figure 2 show the effect of hyperparameters \u03bb and \u03b7 on the Wasserstein distance.\nFor \u03bb = \u221e, only the KL regularizer is active, which is equivalent to minimizing a standard RBM. As\nwe reduce the amount of regularization, the Wasserstein distance becomes effectively minimized and\nthus smaller. If \u03bb is chosen too small, the Wasserstein distance increases again, for the stability reasons\nmentioned in Section 3. In all our experiments, we observed that KL pretraining was necessary in\norder to reach low Wasserstein distance. Not doing so leads to degenerate solutions. The relation\nbetween hyperparameters and minimization criteria is consistent across datasets: In all cases, the\nWasserstein RBM produces lower Wasserstein distance than a standard RBM.\n\nMNIST-small\n\nPLANTS\n\nMNIST-code\n\nFigure 2: Wasserstein distance as a function of hyperparameters \u03bb and \u03b7. The best RBMs in the\nWasserstein sense (RBM-W) are shown in red. The best RBMs in the standard sense (i.e. with \u03bb\nforced to +inf, and minimum KL) are shown in blue.\n\nSamples generated by the standard RBM and the Wasserstein RBM (more precisely their PCD\napproximation) are shown in Figure 3. The RBM-W produces a reduced set of clean prototypical\nexamples, with less noise than those produced by a regular RBM. All zeros generated by RBM-W\n\n5\n\n00.11.010.0infParameter\u03bb1e-41e-31e-2Parameter\u03b7RBMRBM-W00.11.010.0infParameter\u03bb1e-41e-31e-2Parameter\u03b7RBMRBM-W00.11.010.0infParameter\u03bb1e-41e-31e-2Parameter\u03b7RBMRBM-Wlowhigh\fhave well-de\ufb01ned contours and a round shape but do not reproduce the variety of shapes present in the\ndata. Similarly, the plants territorial spreads generated by the RBM-W form compact and contiguous\nregions that are prototypical of real spreads, but are less diverse than the data or the sample generated\nby the standard RBM. Finally, the RBM-W generates codes that, when decoded, are closer to actual\nMNIST digits.\n\nMNIST-small\n\nPLANTS\n\nMNIST-code\n\nM\nB\nR\n\n-\n\nW\nM\nB\nR\n\nFigure 3: Examples generated by the standard and the Wasserstein RBMs. (Images for PLANTS\ndataset are automatically generated from the Wikimedia Commons template https://commons.\nwikimedia.org/wiki/File:BlankMap-USA-states-Canada-provinces.svg created by\nuser Lokal_Pro\ufb01l.) Images for MNIST-code are produced by the decoders shown on the right.\n\nThe PCA plots of Figure 4 superimpose to the true data distribution (in gray) the distributions\ngenerated by the standard RBM (in blue) and the Wasserstein RBM (in red). In particular, the plots\nshow the projected distributions on the \ufb01rst two PCA components of the true distribution. While the\nstandard RBM distribution uniformly covers the data, the one generated by the RBM-W consists of a\n\ufb01nite set of small dense clusters that are scattered across the input distribution. In other words, the\nWasserstein model is biased towards these clusters, and systematically ignores other regions.\n\nMNIST-small\n\nPLANTS\n\nMNIST-code\n\n\u03b3 small \u2190\n\n\u2192 \u03b3 large \u03b3 small \u2190\n\n\u2192 \u03b3 large \u03b3 small \u2190\n\n\u2192 \u03b3 large\n\nFigure 4: Top: Two-dimensional PCA comparison of distributions learned by the RBM and the RBM-\nW with smoothing parameter \u03b3 = 0.1. Plots are obtained by projecting the learned distributions on\nthe \ufb01rst two components of the true distribution. Bottom: RBM-W distributions obtained by varying\nthe parameter \u03b3.\n\nAt the bottom of Figure 4, we analyze the effect of the Wasserstein smoothing parameter \u03b3 on the\nlearned distribution, with \u03b3 = 0.025, 0.05, 0.1, 0.2, 0.4. We observe on all datasets that the stronger\nthe smoothing, the stronger the shrinkage effect. Although the KL-generated distributions shown in\nblue may look better (the red distribution strongly departs visually from the data distribution), the\nred distribution is actually superior if considering the smooth Wasserstein distance as a performance\nmetric, as shown in Figure 2.\n\n4.3 Validating the Shrinkage Effect\n\nTo verify that the shrinkage effect observed in Figure 4 is not a training artefact, but a truly expected\nproperty of the modeled distribution, we analyze this effect for a simple distribution for which the\nparameter space can be enumerated. Figure 5 plots the Wasserstein distance between samples of\nsize 100 of a 10-dimensional Gaussian distribution p \u223c N (0, I), and a parameterized model of that\ndistribution p\u03b8 \u223c N (0, \u03b82I), where \u03b8 \u2208 [0, 1]. The parameter \u03b8 can be interpreted as a shrinkage\n\n6\n\n100128 binary units28x28 pixels128 binary units28x28 pixels400 binary units400 binary unitsRBMMNIST-code digits generation200100200RBM-dataRBMdataRBM-WdataRBMdataRBM-WdataRBMdataRBM-W\fparameter. The Wasserstein distance is computed using the cityblock or euclidean metric, both\nrescaled such that the expected distance between pairs of points is 1.\n\nFigure 5: Wasserstein distance between a sample \u02c6p \u223c N (0, I), and a sample \u02c6p\u03b8 \u223c N (0, \u03b82I) for\nvarious model parameters \u03b8 \u2208 [0, 1] and smoothing \u03b3, using the cityblock or the euclidean metric.\nInterestingly, for all choices of Wasserstein smoothing parameters \u03b3, and even for the true Wasserstein\ndistance (\u03b3 = 0, computed here using the OpenCV library), the best model p\u03b8 in the empirical\nWasserstein sense is a shrinked version of p (i.e. with \u03b8 < 1). When the smoothing is strong\nenough, the best parameter becomes \u03b8 = 0 (i.e. Dirac distribution located at the origin). Overall, this\nexperiment gives a training-independent validation for our observation that Wasserstein RBMs learn\nshrinked cluster-like distributions. Note that the \ufb01nite sample size prevents the Wasserstein distance\nto reach zero, and always favors shrinked models.\n\n4.4 Data Completion and Denoising\n\nIn order to demonstrate the practical relevance of Wasserstein distance minimization, we apply the\nlearned models to the task of data completion and data denoising, for which the use of a metric is\ncrucial: Data completion and data denoising performance is generally measured in terms of distance\nbetween the true data and the completed or denoised data (e.g. Euclidean distance for real-valued data,\nor Hamming distance H for binary data). Remotely located probability mass that may result from\nsimple KL minimization would incur a severe penalty on the completion and denoising performance\nmetric. Both tasks have useful practical applications: Data completion can be used as a \ufb01rst step\nwhen applying discriminative learning (e.g. neural networks or SVM) to data with missing features.\nData denoising can be used as a dimensionality reduction step before training a supervised model.\nLet the input x = [v, h] be composed of d \u2212 k visible variables v and k hidden variables h.\nData Completion The setting of the data completion experiment is illustrated in Figure 6 (top).\nThe distribution p\u03b8(x|v) over possible reconstructions can be sampled from using an alternate Gibbs\nsampler, or by enumeration. The expected Hamming distance between the true state x(cid:63) and the\nreconstructed state modeled by the distribution p\u03b8(x|v) is given by iterating on the 2k possible\nreconstructions:\n\nE =(cid:80)\n\nh p\u03b8(x| v) \u00b7 H(x, x(cid:63))\n\nwhere h \u2208 {0, 1}k. Since the reconstruction is a probability distribution, we can compute the\nexpected Hamming error, but also its bias-variance decomposition. On MNIST-small, we hide\nrandomly located image patches of size 3 \u00d7 3 (i.e. k = 9). On PLANTS and MNIST-code, we hide\nrandom subsets of k = 9 variables. Results are shown in Figure 7 (left), where we compare three\ntypes of models: Kernel density estimation (KDE), standard RBM (RBM) and Wasserstein RBM\n(RBM-W). The KDE estimation model uses a Gaussian kernel, with the Gaussian scale parameter\nchosen such that the KL divergence of the model from the validation data is minimized. The RBM-W\nis better or comparable the other models. Of particular interest is the structure of the expected\nHamming error: For the standard RBM, a large part of the error comes from the variance (or entropy),\nwhile for the Wasserstein RBM, the bias term is the most contributing. This can be related to what is\nobserved in Figure 4: For a data point outside the area covered by the red points, the reconstruction is\nsystematically redirected towards the nearest red cluster, thus, incurring a systematic bias.\n\nData Denoising Here, we consider a simple noise process where for a prede\ufb01ned subset of k\nvariables, denoted by h a known number l of bits \ufb02ips occur randomly. Remaining d \u2212 k variables\noriginal and(cid:101)x its noisy version resulting from \ufb02ipping l variables of h, the expected Hamming error\nare denoted by v. The setting of the experiment is illustrated in Figure 6 (bottom). Calling x(cid:63) the\n\n7\n\n0.00.20.40.60.81.0modelparameter\u03b80.40.50.60.70.80.91.0W\u03b3(\u02c6p\u03b8,\u02c6p)(metric=cityblock)\u03b3=1.00\u03b3=0.32\u03b3=0.10\u03b3=0.03\u03b3=0.01\u03b3=0.000.00.20.40.60.81.0modelparameter\u03b80.50.60.70.80.91.0W\u03b3(\u02c6p\u03b8,\u02c6p)(metric=euclidean)\u03b3=1.00\u03b3=0.32\u03b3=0.10\u03b3=0.03\u03b3=0.01\u03b3=0.00\fFigure 6: Illustration of the completion and denoising setup. For each image, we select a known\nsubset of pixels, that we hide (or corrupt with noise). Each possible reconstruction has a particular\nHamming distance to the original example. The expected Hamming error is computed by weighting\nthe Hamming distances by the probability that the model assigns to the reconstructions.\n\n\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014 Completion \u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\nMNIST-code\nMNIST-small\n\nPLANTS\n\n\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014 Denoising \u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\nMNIST-code\nMNIST-small\n\nPLANTS\n\ne\nc\nn\na\n\nt\ns\nd\n\ni\n\ni\n\ng\nn\nm\nm\na\nH\n\nFigure 7: Performance on the completion and denoising tasks of the kernel density estimation, the\nstandard RBM and the Wasserstein RBM. The total length of the bars is the expected Hamming error.\nDark gray and light gray sections of the bars give the bias-variance decomposition.\n\nis given by iterating over the(cid:0)k\n(cid:1) states x with same visible variables v and that are at distance l of(cid:101)x:\nE =(cid:80)\nh p\u03b8(x| v,H(x,(cid:101)x) = l) \u00b7 H(x, x(cid:63))\n\nl\n\nwhere h \u2208 {0, 1}k. Note that the original example x(cid:63) is necessarily part of this set of states under\nthe noise model assumption. For the MNIST-small data, we choose randomly located images patches\nof size 4 \u00d7 3 or 3 \u00d7 4 (i.e. k = 12), and generate l = 4 random bit \ufb02ips within the selected patch.\nFor PLANTS and MNIST-code, we generate l = 4 bit \ufb02ips in k = 12 randomly preselected input\nvariables. Figure 7 (right) shows the denoising error in terms of expected Hamming distance on the\nsame datasets. The RBM-W is better or comparable to other models. Like for the completion task,\nthe main difference between the two RBMs is the bias/variance ratio, where again the Wasserstein\nRBM tends to have larger bias. This experiment has considered a very simple noise model consisting\nof a \ufb01xed number of l random bit \ufb02ips over a small prede\ufb01ned subset of variables. Denoising highly\ncorrupted complex data will however require to combine Wasserstein models with more \ufb02exible noise\nmodels such as the ones proposed by [17].\n\n5 Conclusion\n\nWe have introduced a new objective for restricted Boltzmann machines (RBM) based on the smooth\nWasserstein distance. We derived the gradient of the Wasserstein distance from its dual formulation,\nand used it to effectively train an RBM. Unlike the usual Kullback-Leibler (KL) divergence, our\nWasserstein objective takes into account the metric of the data. In all considered scenarios, the Wasser-\nstein RBM produced distributions that strongly departed from standard RBMs, and outperformed\nthem on practical tasks such as completion or denoising.\nMore generally, we demonstrated empirically, that when learning probability densities, the reliance on\ndistributions that incorporate indirectly the desired metric can be substituted for training procedures\nthat make the desired metric directly part of the learning objective. Thus, Wasserstein training can be\nseen as a more direct approach to density estimation than regularized KL training. Future work will\naim to further explore the interface between Boltzmann learning and Wasserstein minimization, with\nthe aim to scale the newly proposed learning technique to larger and more complex data distributions.\n\n8\n\n\ufb02ip twopixelsoriginalimagenoisyimage\ufb02ip twopixels again6 possible image reconstructions224202hide threepixelsoriginalimageincompleteimageassignpixels again8 possible image reconstructions12202CompletionDenoising0.40.100.500.40000.60KDERBMRBM-W0.00.51.01.5variancebiasKDERBMRBM-W0.00.51.01.5variancebiasKDERBMRBM-W0.00.51.01.5variancebiasKDERBMRBM-W0.00.51.01.5variancebiasKDERBMRBM-W0.00.51.01.5variancebiasKDERBMRBM-W0.00.51.01.5variancebias\fAcknowledgements\n\nThis work was supported by the Brain Korea 21 Plus Program through the National Research Foundation of\nKorea funded by the Ministry of Education. This work was also supported by the grant DFG (MU 987/17-1). M.\nCuturi gratefully acknowledges the support of JSPS young researcher A grant 26700002. Correspondence to\nGM, KRM and MC.\n\nReferences\n[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive\n\nScience, 9(1):147\u2013169, 1985.\n\n[2] F. Bassetti, A. Bodini, and E. Regazzini. On minimum Kantorovich distance estimators. Statistics &\n\nProbability Letters, 76(12):1298 \u2013 1302, 2006.\n\n[3] K. Cho, T. Raiko, and A. Ilin. Enhanced gradient for training restricted Boltzmann machines. Neural\n\nComputation, 25(3):805\u2013831, 2013.\n\n[4] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 2016.\n\n[5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural\n\nInformation Processing Systems 26, pages 2292\u20132300, 2013.\n\n[6] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31th\n\nInternational Conference on Machine Learning, ICML, pages 685\u2013693, 2014.\n\n[7] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton. Phone recognition with the mean-covariance\nrestricted Boltzmann machine. In Advances in Neural Information Processing Systems 23., pages 469\u2013477,\n2010.\n\n[8] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. Poggio. Learning with a wasserstein loss. In NIPS,\n\npages 2044\u20132052. 2015.\n\n[9] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[10] P. J. Huber. Robust statistics. Springer, 2011.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, Nov 1998.\n\n[12] B. M. Marlin, K. Swersky, B. Chen, and N. de Freitas. Inductive principles for restricted Boltzmann\nmachine learning. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics, AISTATS, pages 509\u2013516, 2010.\n\n[13] G. Montavon and K.-R. M\u00fcller. Deep Boltzmann machines and the centering trick. In Neural Networks:\n\nTricks of the Trade - Second Edition, LNCS, pages 621\u2013637. Springer, 2012.\n\n[14] Y. Rubner, L. Guibas, and C. Tomasi. The earth mover\u2019s distance, multi-dimensional scaling, and color-\nbased image retrieval. In Proceedings of the ARPA Image Understanding Workshop, pages 661\u2013668,\n1997.\n\n[15] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In Proceedings of the Twelfth International\n\nConference on Arti\ufb01cial Intelligence and Statistics, AISTATS, pages 448\u2013455, 2009.\n\n[16] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. Journal of\n\nMachine Learning Research, 15(1):2949\u20132980, 2014.\n\n[17] Y. Tang, R. Salakhutdinov, and G. E. Hinton. Robust Boltzmann machines for recognition and denoising.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, pages 2264\u20132271, 2012.\n\n[18] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In\nMachine Learning, Proceedings of the Twenty-Fifth International Conference (ICML), pages 1064\u20131071,\n2008.\n\n[19] United States Department of Agriculture. The PLANTS Database, 2012.\n\n[20] C. Villani. Optimal transport: old and new, volume 338. Springer Verlag, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1839, "authors": [{"given_name": "Gr\u00e9goire", "family_name": "Montavon", "institution": "TU Berlin"}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": "TU Berlin"}, {"given_name": "Marco", "family_name": "Cuturi", "institution": "ENSAE - CREST"}]}