{"title": "The Convergence of Contrastive Divergences", "book": "Advances in Neural Information Processing Systems", "page_first": 1593, "page_last": 1600, "abstract": null, "full_text": " The Convergence of Contrastive Divergences\n\n\n\n Alan Yuille\n Department of Statistics\n University of California at Los Angeles\n Los Angeles, CA 90095\n yuille@stat.ucla.edu\n\n\n\n\n Abstract\n\n\n This paper analyses the Contrastive Divergence algorithm for learning\n statistical parameters. We relate the algorithm to the stochastic approxi-\n mation literature. This enables us to specify conditions under which the\n algorithm is guaranteed to converge to the optimal solution (with proba-\n bility 1). This includes necessary and sufficient conditions for the solu-\n tion to be unbiased.\n\n\n\n\n1 Introduction\n\n\nMany learning problems can be reduced to statistical inference of parameters. But inference\nalgorithms for this task tend to be very slow. Recently Hinton proposed a new algorithm\ncalled contrastive divergences (CD) [1]. Computer simulations show that this algorithm\ntends to converge, and to converge rapidly, although not always to the correct solution [2].\nTheoretical analysis shows that CD can fail but does not give conditions which guarantee\nconvergence [3,4].\n\nThis paper relates CD to the stochastic approximation literature [5,6] and hence derives\nelementary conditions which ensure convergence (with probability 1). We conjecture that\nfar stronger results can be obtained by applying more advanced techniques such as those\ndescribed by Younes [7]. We also give necessary and sufficient conditions for the solution\nof CD to be unbiased.\n\nSection (2) describes CD and shows that it is closely related to a class of stochastic ap-\nproximation algorithms for which convergence results exist. In section (3) we state and\ngive a proof of a simple convergence theorem for stochastic approximation algorithms.\nSection (4) applies the theorem to give sufficient conditions for convergence of CD.\n\n\n2 Contrastive Divergence and its Relations\n\n\nThe task of statistical inference is to estimate the model parameters which minimize the\nKullback-Leibler divergence D(P0(x)||P (x|)) between the empirical distribution func-\n\n\f\ntion of the observed data P0(x) and the model P (x|). It is assumed that the model distri-\nbution is of the form P (x|) = e-E(x;)/Z().\n\nEstimating the model parameters is difficult. For example, it is natural to try performing\nsteepest descent on D(P0(x)||P (x|)). The steepest descent algorithm can be expressed\nas:\n E(x; ) E(x; )\n t+1 - t = t{- P0(x) + P (x|) }, (1)\n \n x x\n\nwhere the {t} are constants.\n\nUnfortunately steepest descent is usually computationally intractable because of the need to\ncompute the second term on the right hand side of equation (1). This is extremely difficult\nbecause of the need to evaluate the normalization term Z() of P (x|).\n\nMoreover, steepest descent also risks getting stuck in a local minimum. There is, however,\nan important exception if we can express E(x; ) in the special form E(x; ) = (x),\nfor some function (x). In this case D(P0(x)||P (x|)) is convex and so steepest descent\nis guaranteed to converge to the global minimum. But the difficulty of evaluating Z()\nremains.\n\nThe CD algorithm is formally similar to steepest descent. But it avoids the need to eval-\nuate Z(). Instead it approximates the second term on the right hand side of the steepest\ndescent equation (1) by a stochastic term. This approximation is done by defining, for\neach , a Markov Chain Monte Carlo (MCMC) transition kernel K(x, y) whose invariant\ndistribution is P (x|) (i.e. P (x|)K\n x (x, y) = P (y|)).\n\nThen the CD algorithm can be expressed as:\n\n E(x; ) E(x; )\n t+1 - t = t{- P0(x) + Q(x) }, (2)\n \n x x\n\nwhere Q(x) is the empirical distribution function on the samples obtained by initializing\nthe chain at the data samples P0(x) and running the Markov chain forward for m steps (the\nvalue of m is a design choice).\n\nWe now observe that CD is similar to a class of stochastic approximation algorithms which\nalso use MCMC methods to stochastically approximate the second term on the right hand\nside of the steepest descent equation (1). These algorithms are reviewed in [7] and have\nbeen used, for example, to learn probability distributions for modelling image texture [8].\n\nA typical algorithm of this type introduces a state vector St(x) which is initialized by\nsetting St=0(x) = P0(x). Then St(x) and t are updated sequentially as follows. St(x)\nis obtained by sampling with the transition kernel K (x, y) using St-1(x) as the initial\n t\nstate for the chain. Then t+1 is computed by replacing the second term in equation (1) by\nthe expectation with respect to St(x). From this perspective, we can obtain CD by having\na state vector St(x) (= Q(x)) which gets re-initialized to P0(x) at each time step.\n\nThis stochastic approximation algorithm, and its many variants, have been extensively s-\ntudied and convergence results have been obtained (see [7]). The convergence results are\nbased on stochastic approximation theorems [6] whose history starts with the analysis of\nthe Robbins-Monro algorithm [5]. Precise conditions can be specified which guarantee\nconvergence in probability. In particular, Kushner [9] has proven convergence to global\noptima. Within the NIPS community, Orr and Leen [10] have studied the ability of these\nalgorithms to escape from local minima by basin hopping.\n\n\f\n3 Stochastic Approximation Algorithms and Convergence\n\nThe general stochastic approximation algorithm is of the form:\n\n t+1 = t - tS(t, Nt), (3)\n\n\nwhere Nt is a random variable sampled from a distribution Pn(N ), t is the damping\ncoefficient, and S(., .) is an arbitrary function.\n\nWe now state a theorem which gives sufficient conditions to ensure that the stochastic\napproximation algorithm (3) converges to a (solution) state . The theorem is chosen\nbecause of the simplicity of its proof and we point out that a large variety of alternative\nresults are available, see [6,7,9] and the references they cite.\n\nThe theorem involves three basic concepts. The first is a function L() = (1/2)| - |2\nwhich is a measure of the distance of the current state from the solution state (in\nthe next section we will require = arg min D(P0(x)||P (x|))). The second is the\nexpected value P\n N n(N )S(, N ) of the update term in the stochastic approximation\nalgorithm (3). The third is the expected squared magnitude |S(, N )|2 of the update\nterm.\n\nThe theorem states that the algorithm will converge provided three conditions are satisfied.\nThese conditions are fairly intuitive. The first condition requires that the expected update\n P\n N n(N )S(, N ) has a large component towards the solution (i.e. in the direction\nof the negative gradient of L()). The second condition requires that the expected squared\nmagnitude |S(, N )|2 is bounded, so that the \"noise\" in the update is not too large. The\nthird condition requires that the damping coefficients t decrease with time t, so that the\nalgorithm eventually settles down into a fixed state. This condition is satisfied by setting\nt = 1/t, t (which is the fastest fall off rate consistent with the SAC theorem).\n\nWe now state the theorem and briefly sketch the proof which is based on martingale theory\n(for an introduction, see [11]).\n\nStochastic Approximation Convergence (SAC) Theorem. Consider the stochastic ap-\nproximation algorithm, equation (3), and let L() = (1/2)| - |2. Then the algorithm\nwill converge to with probability 1 provided: (1) - L() P\n N n(N )S(, N ) \nK1L() for some constant K1, (2) |S(, N )|2 t K2(1 + L()), where K2 is some\nconstant and the expectation . t is taken with respect to all the data prior to time t, and\n(3) 2 < .\n t=1 t = and \n t=1 t\n\nProof. The proof [12] is a consequence of the supermartingale convergence theorem [11].\nThis theorem states that if Xt, Yt, Zt are positive random variables obeying Y\n t=0 t \nwith probability one and Xt+1 Xt + Yt - Zt, t, then Xt converges with probability 1\nand Z\n t=0 t < . To apply the theorem, set Xt = (1/2)|t - |2, set Yt = (1/2)K22\n t\nand Zt = -Xt(K22t - K1t) (Zt is positive for sufficiently large t). Conditions 1 and 2\nimply that Xt can only converge to 0. The result follows after some algebra.\n\n\n4 CD and SAC\n\nThe CD algorithm can be expressed as a stochastic approximation algorithm by setting:\n\n E(x; ) E(x; )\n S(t, Nt) = - P0(x) + Q(x) , (4)\n \n x x\n\n\f\nwhere the random variable Nt corresponds to the MCMC sampling used to obtain Q(x).\n\nWe can now apply the SAC to give three conditions which guarantee convergence of the CD\nalgorithm. The third condition can be satisfied by setting t = 1/t, t. We can satisfy the\nsecond condition by requiring that the gradient of E(x; ) with respect to is bounded,\nsee equation (4). We conjecture that weaker conditions, such as requiring only that the\ngradient of E(x; ) be bounded by a function linear in , can be obtained using the more\nsophisticated martingale analysis described in [7].\n\nIt remains to understand the first condition and to determine whether the solution is unbi-\nased. These require studying the expected CD update:\n\n E(x; ) E(x; )\n Pn(Nt)S(t, Nt) = - P0(x) + P0(y)Km(y, x) , (5)\n \n Nt x y,x\n\n\n\nwhich is derived using the fact that the expected value of Q(x) is P (y, x)\n y 0(y)K m\n \n(where the superscript m indicates running the transition kernel m times).\n\nWe now re-express this expected CD update in two different ways, Results 1 and 2, which\ngive alternative ways of understanding it. We then proceed to Results 3 and 4 which give\nconditions for convergence and unbiasedness of CD.\n\nBut we must first introduce some background material from Markov Chain theory [13].\n\nWe choose the transition kernel K(x, y) to satisfy detailed balance so that\nP (x|)K(x, y) = P (y|)K(y, x). Detailed balance is obeyed by many MCMC algo-\nrithms and, in particular, is always satisfied by Metropolis-Hasting algorithms. It implies\nthat P (x|) is the invariant kernel of K(x, y) so that P (x|)K\n x (x, y) = P (y|) (all\ntransition kernels satisfy K\n y (x, y) = 1, x).\n\nDetailed balance implies that the matrix Q(x, y) = P (x|)1/2K(x, y)P (y|)-1/2 is\nsymmetric and hence has orthogonal eigenvectors and eigenvalues {e(x), \n }. The eigen-\nvalues are ordered by magnitude (largest to smallest). The first eigenvalue is 1 = 1 (so\n|| < 1, 2). By standard linear algebra, we can write Q(x, y) in terms of its\neigenvectors and eigenvalues Q(x, y) = e (x)e (y), which implies that we can\n \nexpress the transition kernel applied m times by:\n\n\nKm(x, y) = (x) (y) = (x)v(y),\n {}m{P (x|)}-1/2e {P (y|)}1/2e {}mu \n \n (6)\n\nwhere the {v(x) (x)\n } and {u } are the left and right eigenvectors of the transition kernel\nK(x, y). They are defined by:\n\n v(x) = e (x) (x) = e (x)\n {P (x|)}1/2, u {P (x|)}-1/2, , (7)\n\n\nand it can be verified that v(x)K v(y), and K (y) =\n x (x, y) = \n y (x, y)u\n \n u (x),\n . In addition, the left and right eigenvectors are mutually orthonormal so that\n v(x)u (x) = \n x , where is the Kronecker delta function. This implies that we\ncan express any function f (x) in equivalent expansions,\n\n\n f (x) = { f (y)u (y) (x), f (x) = f (y)v(y) (x).\n }v { }u (8)\n y y\n\n\f\nMoreover, the first left and right eigenvectors can be calculated explicitly to give:\n\n v1 (x) = P (x (x) = 1,\n |), u1 1, 1 (9)\n\n\nwhich follows because P (x|) is the (unique) invariant distribution of the transition kernel\nK(x, y) and hence is the first left eigenvector.\n\nWe now have sufficient background to state and prove our first result.\n\nResult 1. The expected CD update corresponds to replacing the update term\n P (x|) E(x;) in the steepest descent equation (1) by:\n x \n\n\n E(x; ) E(x; )\n P (x|) + { P0(y)u(y)}{ v(x) }, (10)\n }m{ \n x =2 y x\n\n\n\nwhere {v(x), u(x)\n } are the left and right eigenvectors of K(x, y) with eigenvalues\n{}.\n\nProof.\nThe expected CD update replaces P (x|) E(x;) by P (y, x) E(x;) ,\n x y,x 0(y)K m\n \nsee equation (5). We use the eigenvector expansion of the transition kernel, equation (6),\nto express this as P (y)v(x) E(x;) . The result follows using the\n y,x, 0(y){\n }mu\n \nspecific forms of the first eigenvectors, see equation (9).\n\nResult 1 demonstrates that the expected update of CD is similar to the steepest descent\nrule, see equations (1,10), but with an additional term { P (y)}\n =2 }m{ y 0(y)u\n \n\n{ v(x) E(x;) } which will be small provided the magnitudes of the eigenvalues {\n x }\nare small for 2 (or if the transition kernel can be chosen so that P\n y 0(y)u\n is small\nfor 2).\n\nWe now give a second form for the expected update rule. To do this, we define a new\nvariable g(x; ). This is chosen so that P (x|)g(x; ) = 0, and the extrema of\n x\nthe Kullback-Leibler divergence occurs when P\n x 0(x)g(x; ) = 0.\n\nResult 2. Let g(x; ) = E(x;) - P (x|) E(x;) , then P (x|)g(x; ) = 0, the\n x x\nextrema of the Kullback-Leibler divergence occur when P\n x 0(x)g(x; ) = 0, and the\nexpected update rule can be written as:\n\n\n t+1 = t - t{ P0(x)g(x; ) - P0(y)Km(y, x)g(x; )\n }. (11)\n x y,x\n\n\n\n\nProof. The first result follows directly. The second follows because P\n x 0(x)g(x; ) =\n P - P (x|) E(x;) . To get the third we substitute the definition of\n x 0(x) E(x;)\n x \ng(x; ) into the expected update equation (5). The result follows using the standard prop-\nerty of transition kernels that Km(x, y) = 1, x.\n y \n\nWe now use Results 1 and 2 to understand the fixed points of the CD algorithm and deter-\nmine whether it is biased.\n\nResult 3. The fixed points of the CD algorithm are true (unbiased) extrema\nof the KL divergence (i.e. P\n x 0(x)g(x; ) = 0) if, and only if, we also have\n P\n y,x 0(y)K m\n (y, x)g(x; ) = 0. A sufficient condition is that P0(y) and g(x; ) lie\n\n\f\nin orthogonal eigenspaces of K (y, x). This includes the (known) special case when\nthere exists such that P (x|) = P0(x) (see [2]).\n\nProof. The first part follows directly from equation (11) in Result 2. The second part can\nbe obtained by the eigenspace analysis in Result 1. Suppose P0(x) = P (x|). Recall that\nv1 P (y) = 0, = 1. Moreover, v1\n (x) = P (x|), and so y 0(y)u\n ast x g(x; ) =\n0. Hence P0(x) and g(x; ) lie in orthogonal eigenspaces of K (y, x).\n\nResult 3 shows that whether CD converges to an unbiased estimate usually depends on the\nspecific form of the MCMC transition matrix K(y, x). But there is an intuitive argument\nwhy the bias term P\n y,x 0(y)K m\n (y, x)g(x; ) may tend to be small at places where\n P P\n x 0(x)g(x; ) = 0. This is because for small m, y 0(y)K m\n (y, x) P0(x) which\nsatisfies P P\n x 0(x)g(x; ) = 0. Moreover, for large m, y 0(y)K m\n (y, x) P (x|)\nand we also have P (x|)g(x; ) = 0.\n x\n\nAlternatively, using Result 1, the bias term P\n y,x 0(y)K m\n (y, x)g(x; ) can be expressed\n\nas { P v }. This will tend to be small\n =2 }m{ y 0(y)u\n (y)}{ x (x) E(x;)\n \nprovided the eigenvalue moduli || are small for 2 (i.e. the standard conditions for\na well defined Markov Chain). In general the bias term should decrease exponentially as\n|2|m. Clearly it is also desirable to define the transition kernels K(x, y) so that the right\neigenvectors {u(y) : \n 2} are as orthogonal as possible to the observed data P0(y).\n\nThe practicality of CD depends on whether we can find an MCMC sampler such that the\nbias term P\n y,x 0(y)K m\n (y, x)g(x; ) = 0 is small for most . If not, then the alterna-\ntive stochastic algorithms may be preferable.\n\nFinally we give convergence conditions for the CD algorithm.\n\nResult 4 CD will converge with probability 1 to state provided t = 1/t, E is bounded,\n \nand\n\n ( - ) { P0(x)g(x; ) - P0(y)Km(y, x)g(x; )\n } K1| - |2, (12)\n x y,x\n\n\nfor some K1.\n\nProof. This follows from the SAC theorem and Result 2. The boundedness of E is required\n \nto ensure that the \"update noise\" is bounded in order to satisfy the second condition of the\nSAC theorem.\n\nResults 3 and 4 can be combined to ensure that CD converges (with probability 1) to the\ncorrect (unbiased) solution. This requires specifying that in Result 4 also satisfies the\nconditions P P\n x 0(x)g(x; ) = 0 and y,x 0(y)K m\n (y, x)g(x; ) = 0.\n\n\n\n5 Conclusion\n\nThe goal of this paper was to relate the Contrastive Divergence (CD) algorithm to the s-\ntochastic approximation literature. This enables us to give convergence conditions which\nensure that CD will converge to the parameters that minimize the Kullback-Leibler di-\nvergence D(P0(x)||P (x|)). The analysis also gives necessary and sufficient conditions to\ndetermine whether the solution is unbiased. For more recent results, see Carreira-Perpignan\nand Hinton (in preparation).\n\nThe results in this paper are elementary and preliminary. We conjecture that far more\n\n\f\npowerful results can be obtained by adapting the convergence theorems in the literature\n[6,7,9]. In particular, Younes [7] gives convergence results when the gradient of the energy\nE(x; )/ is bounded by a term that is linear in (and hence unbounded). He is also\nable to analyze the asymptotic behaviour of these algorithms. But adapting his mathemati-\ncal techniques to Contrastive Divergence is beyond the scope of this paper.\n\nFinally, the analysis in this paper does not seem to capture many of the intuitions behind\nContrastive Divergence [1]. But we hope that the techniques described in this paper may\nalso stimulate research in this direction.\n\n\nAcknowledgements\n\nI thank Geoff Hinton, Max Welling and Yingnian Wu for stimulating conversations and\nfeedback. Yingnian provided guidance to the stochastic approximation literature and Max\ngave useful comments on an early draft. This work was partially supported by an NSF SLC\ncatalyst grant \"Perceptual Learning and Brain Plasticity\" NSF SBE-0350356.\n\n\nReferences\n\n[1]. G. Hinton. \"Training Products of Experts by Minimizing Contrastive Divergence\"\".\nNeural Computation. 14, pp 1771-1800. 2002.\n\n[2]. Y.W. Teh, M. Welling, S. Osindero and G.E. Hinton. \"Energy-Based Models for Sparse\nOvercomplete Representations\". Journal of Machine Learning Research. To appear. 2003.\n\n[3]. D. MacKay. \"Failures of the one-step learning algorithm\". Available electronically at\nhttp://www.inference.phy.cam.ac.uk/mackay/abstracts/gbm.html. 2001.\n\n[4]. C.K.I. Williams and F.V. Agakov. \"An Analysis of Contrastive Divergence Learning\nin Gaussian Boltzmann Machines\". Technical Report EDI-INF-RR-0120. Institute for\nAdaptive and Neural Computation. University of Edinburgh. 2002.\n\n[5]. H. Robbins and S. Monro. \"A Stochastic Approximation Method\". Annals of Mathe-\nmatical Sciences. Vol. 22, pp 400-407. 1951.\n\n[6]. H.J. Kushner and D.S. Clark. Stochastic Approximation for Constrained and Un-\nconstrained Systems. New York. Springer-Verlag. 1978.\n\n[7]. L. Younes. \"On the Convergence of Markovian Stochastic Algorithms with Rapidly\nDecreasing Ergodicity rates.\" Stochastics and Stochastic Reports, 65, 177-228. 1999.\n\n[8]. S.C. Zhu and X. Liu. \"Learning in Gibbsian Fields: How Accurate and How Fast Can\nIt Be?\". IEEE Trans. Pattern Analysis and Machine Intelligence. Vol. 24, No. 7, July\n2002.\n\n[9]. H.J. Kushner. \"Asymptotic Global Behaviour for Stochastic Approximation and D-\niffusions with Slowly Decreasing Noise Effects: Global Minimization via Monte Carlo\".\nSIAM J. Appl. Math. 47:169-185. 1987.\n\n[10]. G.B. Orr and T.K. Leen. \"Weight Space Probability Densities on Stochastic Learn-\ning: II Transients and Basin Hopping Times\". Advances in Neural Information Processing\nSystems, 5. Eds. Giles, Hanson, and Cowan. Morgan Kaufmann, San Mateo, CA. 1993.\n\n[11]. G.R. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford\nUniversity Press. 2001.\n\n\f\n[12]. B. Van Roy. Course notes. Prof. B. Van Roy. Stanford.\n(www.stanford.edu/class/msande339/notes/lecture6.ps).\n\n[13]. P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and\nQueues. Springer. New York. 1999.\n\n\f\n", "award": [], "sourceid": 2617, "authors": [{"given_name": "Alan", "family_name": "Yuille", "institution": null}]}