{"title": "The Rectified Gaussian Distribution", "book": "Advances in Neural Information Processing Systems", "page_first": 350, "page_last": 356, "abstract": null, "full_text": "The Rectified Gaussian Distribution \n\nN. D. Socci, D. D. Lee and H. S. Seung \n\nBell Laboratories, Lucent Technologies \n\nMurray Hill, NJ 07974 \n\n{ndslddleelseung}~bell-labs.com \n\nAbstract \n\nA simple but powerful modification of the standard Gaussian dis(cid:173)\ntribution is studied. The variables of the rectified Gaussian are \nconstrained to be nonnegative, enabling the use of nonconvex en(cid:173)\nergy functions. Two multimodal examples, the competitive and \ncooperative distributions, illustrate the representational power of \nthe rectified Gaussian. Since the cooperative distribution can rep(cid:173)\nresent the translations of a pattern, it demonstrates the potential \nof the rectified Gaussian for modeling pattern manifolds. \n\n1 \n\nINTRODUCTION \n\nThe rectified Gaussian distribution is a modification of the standard Gaussian in \nwhich the variables are constrained to be nonnegative. This simple modification \nbrings increased representational power, as illustrated by two multimodal examples \nof the rectified Gaussian, the competitive and the cooperative distributions. The \nmodes of the competitive distribution are well-separated by regions of low probabil(cid:173)\nity. The modes of the cooperative distribution are closely spaced along a nonlinear \ncontinuous manifold. Neither distribution can be accurately approximated by a \nsingle standard Gaussian. In short, the rectified Gaussian is able to represent both \ndiscrete and continuous variability in a way that a standard Gaussian cannot. \nThis increased representational power comes at the price of increased complexity. \nWhile finding the mode of a standard Gaussian involves solution of linear equations, \nfinding the modes of a rectified Gaussian is a quadratic programming problem. \nSampling from a standard Gaussian can be done by generating one dimensional \nnormal deviates, followed by a linear transformation. Sampling from a rectified \nGaussian requires Monte Carlo methods. Mode-finding and sampling algorithms \nare basic tools that are important in probabilistic modeling. \nLike the Boltzmann machine[l], the rectified Gaussian is an undirected graphical \nmodel. The rectified Gaussian is a better representation for probabilistic modeling \n\n\fThe Rectified Gaussian Distribution \n\n351 \n\n(a) \n\n(c) \n\nFigure 1: Three types of quadratic energy functions. (a) Bowl (b) Trough (c) Saddle \n\nof continuous-valued data. It is unclear whether learning will be more tractable for \nthe rectified Gaussian than it is for the Boltzmann machine. \nA different version of the rectified Gaussian was recently introduced by Hinton and \nGhahramani[2, 3]. Their version is for a single variable, and has a singularity at \nthe origin designed to produce sparse activity in directed graphical models. Our \nversion lacks this singularity, and is only interesting in the case of more than one \nvariable, for it relies on undirected interactions between variables to produce the \nmultimodal behavior that is of interest here. \nThe present work is inspired by biological neural network models that use contin(cid:173)\nuous dynamical attractors[4]. In particular, the energy function of the cooperative \ndistribution was previously studied in models of the visual cortex[5], motor cortex[6], \nand head direction system[7]. \n\n2 ENERGY FUNCTIONS: BOWL, TROUGH, AND \n\nSADDLE \n\nThe standard Gaussian distribution P(x) is defined as \n\nP(x) \nE(x) = \n\ne \n\nZ -l -{3E(;r:) \n, \n1 _xT Ax - bTx \n2 \n\n. \n\n(1) \n\n(2) \n\nThe symmetric matrix A and vector b define the quadratic energy function E(x). \nThe parameter (3 = lIT is an inverse temperature. Lowering the temperature \nconcentrates the distribution at the minimum of the energy function. The prefactor \nZ normalizes the integral of P(x) to unity. \nDepending on the matrix A, the quadratic energy function E(x) can have different \ntypes of curvature. The energy function shown in Figure l(a) is convex. The mini(cid:173)\nmum of the energy corresponds to the peak of the distribution. Such a distribution \nis often used in pattern recognition applications, when patterns are well-modeled \nas a single prototype corrupted by random noise. \nThe energy function shown in Figure 1 (b) is flattened in one direction. Patterns \ngenerated by such a distribution come with roughly equal1ikelihood from anywhere \nalong the trough. So the direction of the trough corresponds to the invariances of \nthe pattern. Principal component analysis can be thought of as a procedure for \nlearning distributions of this form. \nThe energy function shown in Figure 1 (c) is saddle-shaped. It cannot be used \nin a Gaussian distribution, because the energy decreases without limit down the \n\n\f352 \n\nN. D. Socci, D. D. Lee and H. S. Seung \n\nsides of the saddle, leading to a non-normalizable distribution. However, certain \nsaddle-shaped energy functions can be used in the rectified Gaussian distribution, \nwhich is defined over vectors x whose components are all nonnegative. The class of \nenergy functions that can be used are those where the matrix A has the property \nxT Ax > 0 for all x > 0, a condition known as copositivity. Note that this set of \nmatrices is larger than the set of positive definite matrices that can be used with \na standard Gaussian. The nonnegativity constraints block the directions in which \nthe energy diverges to negative infinity. Some concrete examples will be discussed \nshortly. The energy functions for these examples will have multiple minima, and \nthe corresponding distribution will be multimodal, which is not possible with a \nstandard Gaussian. \n\n3 MODE-FINDING \n\nBefore defining some example distributions, we must introduce some tools for an(cid:173)\nalyzing them. The modes of a rectified Gaussian are the minima of the energy \nfunction (2), subject to nonnegativity constraints. At low temperatures, the modes \nof the distribution characterize much of its behavior. \nFinding the modes of a rectified Gaussian is a problem in quadratic programming. \nAlgorithms for quadratic programming are particularly simple for the case of non(cid:173)\nnegativity constraints. Perhaps the simplest algorithm is the projected gradient \nmethod, a discrete time dynamics consisting of a gradient step followed by a recti(cid:173)\nfication \n\n(3) \nThe rectification [x]+ = max(x, 0) keeps x within the nonnegative orthant (x ~ 0). \nIf the step size 7J is chosen correctly, this algorithm can provably be shown to \nconverge to a stationary point of the energy function[8]. In practice, this stationary \npoint is generally a local minimum. \n\nNeural networks can also solve quadratic programming problems. We define the \nsynaptic weight matrix W = I - A, and a continuous time dynamics \n\nx+x = [b+ Wx]+ \n\n(4) \n\nFor any initial condition in the nonnegative orthant, the dynamics remains in the \nnonnegative orthant, and the quadratic function (2) is a Lyapunov function of the \ndynamics. \nBoth of these methods converge to a stationary point of the energy. The gradient \nof the energy is given by 9 = Ax - b. According to the Kiihn-Tucker conditions, a \nstationary point must satisfy the conditions that for all i, either gi = 0 and Xi > 0, \nor gi > 0 and Xi = O. The intuitive explanation is that in the interior of the \nconstraint region, the gradient must vanish, while at the boundary, the gradient \nmust point toward the interior. For a stationary point to be a local minimum, the \nKiihn-Tucker conditions must be augmented by the condition that the Hessian of \nthe nonzero variables be positive definite. \nBoth methods are guaranteed to find a global minimum only in the case where A is \npositive definite, so that the energy function (2) is convex. This is because a convex \nenergy function has a unique minimum. Convex quadratic programming is solvable \nin polynomial time. In contrast, for a nonconvex energy function (indefinite A), it \nis not generally possible to find the global minimum in polynomial time, because of \nthe possible presence of local minima. In many practical situations, however, it is \nnot too difficult to find a reasonable solution. \n\n\fThe Rectified Gaussian Distribution \n\n353 \n\n(a) \n\n(b) \n\nFigure 2: The competitive distribution for two variables. (a) A non-convex energy \nfunction with two constrained minima on the x and y axes. Shown are contours of \nconstant energy, and arrows that represent the negative gradient of the energy. (b) \nThe rectified Gaussian distribution has two peaks. \n\nThe rectified Gaussian happens to be most interesting in the nonconvex case, pre(cid:173)\ncisely because of the possibility of multiple minima. The consequence of multiple \nminima is a multimodal distribution, which cannot be well-approximated by a stan(cid:173)\ndard Gaussian. We now consider two examples of a multimodal rectified Gaussian. \n\n4 COMPETITIVE DISTRIBUTION \n\nThe competitive distribution is defined by \n\nAij \n\n-dij + 2 \n\nbi = 1; \n\nWe first consider the simple case N = 2. Then the energy function given by \n\nE(x,y)=-\n\nX2 +y2 \n\n2 +(x+y)2_(x+y) \n\n(5) \n(6) \n\n(7) \n\nhas two constrained minima at (1,0) and (0,1) and is shown in figure 2(a). It \ndoes not lead to a normalizable distribution unless the nonnegativity constraints are \nimposed. The two constrained minima of this nonconvex energy function correspond \nto two peaks in the distribution (fig 2(b)). While such a bimodal distribution \ncould be approximated by a mixture of two standard Gaussians, a single Gaussian \ndistribution cannot approximate such a distribution. In particular, the reduced \nprobability density between the two peaks would not be representable at all with a \nsingle Gaussian. \nThe competitive distribution gets its name because its energy function is similar \nto the ones that govern winner-take-all networks[9]. When N becomes large, the \nN global minima of the energy function are singleton vectors (fig 3), with one \ncomponent equal to unity, and the rest zero. This is due to a competitive interaction \nbetween the components. The mean of the zero temperature distribution is given \nby \n\nThe eigenvalues of the covariance \n\n(XiXj) -\n\n1 \n(Xi)(Xj) = N dij - N2 \n\n1 \n\n(8) \n\n(9) \n\n\f354 \n\n-\n\n.:(a) \n., \n\nN. D. Socci, D. D. Lee and H. S. Seung \n\n.: (b) \n\n: (c) r-\n\n-\n\n0 \n\nr-\n\n\u00b7 \u00b7 n \nn \nI \nu \n\u00b7 \n. . , . . . , . . .. \n\nn \n\nIII \n\n, \n\n2 \n\nJ \n\na \n\nI \n\n\u2022 \n\n., \n\n\u2022\n\n\u2022 \n\nto \n\n'2 S \u2022 \u2022 \u2022 ., \u2022 \u2022 'I \n\nFigure 3: The competitive distribution for N = 10 variables. (a) One mode (zero \ntemperature state) of the distribution. The strong competition between the vari(cid:173)\nables results in only one variable on. There are N modes of this form, each with a \ndifferent winner variable. (b) A sample at finite temperature (13 ~ 110) using Monte \nCarlo sampling. There is still a clear winner variable. (c) Sample from a standard \nGaussian with matched mean and covariance. Even if we cut off the negative values \nthis sample still bears little resemblance to the states shown in (a) and (b), since \nthere is no clear winner variable. \n\nall equal to 1/ N, except for a single zero mode. The zero mode is 1, the vector of \nall ones, and the other eigenvectors span the N - 1 dimensional space perpendicular \nto 1. Figure 3 shows two samples: one (b) drawn at finite temperature from the \ncompetitive distribution, and the other (c) drawn from a standard Gaussian distri(cid:173)\nbution with the same mean and covariance. Even if the sample from the standard \nGaussian is cut so negative values are set to zero the sample does not look at all \nlike the original distribution. Most importantly a standard Gaussian will never be \nable to capture the strongly competitive character of this distribution. \n\n5 COOPERATIVE DISTRIBUTION \n\nTo define the cooperative distribution on N variables, an angle fh = 27ri/N is \nassociated with each variable Xi, so that the variables can be regarded as sitting on \na ring. The energy function is defined by \n1 \n\n4 \n\n(10) \n\nAij \n\n6ij + N - N COS(Oi - OJ) \n1; \n\nbi = \n\n(11) \nThe coupling Aij between Xi and X j depends only on the separation Oi - 03. between \nthem on the ring. \nThe minima, or ground states, of the energy function can be found numerically by \nthe methods described earlier. An analytic calculation of the ground states in the \nlarge N limit is also possible[5]. As shown in Figure 4(a), each ground state is a \nlump of activity centered at some angle on the ring. This delocalized pattern of \nactivity is different from the singleton modes of the competitive distribution, and \narises from the cooperative interactions between neurons on the ring. Because the \ndistribution is invariant to rotations of the ring (cyclic permutations of the variables \nxd, there are N ground states, each with the lump at a different angle. \nThe mean and the covariance of the cooperative distribution are given by \n\n(12) \n(13) \nA given sample of x, shown in Figure 4(a), does not look anything like the mean, \nwhich is completely uniform. Samples generated from a Gaussian distribution with \n\n(XiXj) - (Xi}(Xj) = C(Oi - OJ) \n\n(Xi) = const \n\n\fThe Rectified Gaussian Distribution \n\n355 \n\n'(a) \n\n, (b) \n\n(c) \n\nr \n\nFigure 4: The cooperative distribution for N = 25 variables. (a) Zero temperature \nstate. A cooperative interaction between the variables leads to a delocalized pattern \nof activity that can sit at different locations on the ring. (b) A finite temperature \n(/3 = 50) sample. (c) A sample from a standard Gaussian with matched mean and \ncovariance. \n\nthe same mean and covariance look completely different from the ground states of \nthe cooperative distribution (fig 4(c)). \n\nThese deviations from standard Gaussian behavior reflect fundamental differences \nin the underlying energy function. Here the energy function has N discrete minima \narranged along a ring. In the limit of large N the barriers between these minima \nbecome quite small. A reasonable approximation is to regard the energy function \nas having a continuous line of minima with a ring geometry[5] . In other words, the \nenergy surface looks like a curved trough, similar to the bottom of a wine bottle. \nThe mean is the centroid of the ring and is not close to any minimum. \n\nThe cooperative distribution is able to model the set of all translations of the lump \npattern of activity. This suggests that the rectified Gaussian may be useful in \ninvariant object recognition, in cases where a continuous manifold of instantiations \nof an object must be modeled. One such case is visual object recognition, where \nthe images of an object from different viewpoints form a continuous manifold. \n\n6 SAMPLING \n\nFigures 3 and 4 depict samples drawn from the competitive and cooperative distri(cid:173)\nbution. These samples were generated using the Metropolis Monte Carlo algorithm. \nSince full descriptions of this algorithm can be found elsewhere, we give only a brief \ndescription of the particular features used here. The basic procedure is to generate \na new configuration of the system and calculate the change in energy (given by \neq. 2). If the energy decreases, one accepts the new configuration unconditionally. \nIf it increases then the new configuration is accepted with probability e-{3AE. \nIn our sampling algorithm one variable is updated at a time (analogous to single \nspin flips). The acceptance ratio is much higher this way than if we update all the \nspins simultaneously. However, for some distributions the energy function may have \napproximately marginal directions; directions in which there is little or no barrier. \nThe cooperative distribution has this property. We can expect critical slowing down \ndue to this and consequently some sort of collective update (analogous to multi-spin \nupdates or cluster updates) might make sampling more efficient. However, the type \nof update will depend on the specifics of the energy function and is not easy to \ndetermine. \n\n\f356 \n\nN D. Socci, D. D. Lee and H. S. Seung \n\n7 DISCUSSION \n\nThe competitive and cooperative distributions are examples of rectified Gaussians \nfor which no good approximation by a standard Gaussian is possible. However, \nboth distributions can be approximated by mixtures of standard Gaussians. The \ncompetitive distribution can be approximated by a mixture of N Gaussians, one \nfor each singleton state. The cooperative distribution can also be approximated by \na mixture of N Gaussians, one for each location of the lump on the ring. A more \neconomical approximation would reduce the number of Gaussians in the mixture, \nbut .make each one anisotropic[IO]. \nWhether the rectified Gaussian is superior to these mixture models is an empirical \nquestion that should be investigated empirically with specific real-world probabilis(cid:173)\ntic modeling tasks. Our intuition is that the rectified Gaussian will turn out to be \na good representation for nonlinear pattern manifolds, and the aim of this paper \nhas been to make this intuition concrete. \n\nTo make the rectified Gaussian useful in practical applications, it is critical to \nfind tractable learning algorithms. It is not yet clear whether learning will be more \ntractable for the rectified Gaussian than it was for the Boltzmann machine. Perhaps \nthe continuous variables of the rectified Gaussian may be easier to work with than \nthe binary variables of the Boltzmann machine. \n\nAcknowledgments We would like to thank P. Mitra, L. Saul, B. Shraiman and \nH. Sompolinsky for helpful discussions. Work on this project was supported by Bell \nLaboratories, Lucent Technologies. \n\nReferences \n\n[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for \n\nBoltzmann machines. Cognitive Science, 9:147-169, 1985. \n\n[2] G. E. Hinton and Z. Ghahramani. Generative models for discovering sparse \n\ndistributed representations. Phil. Trans. Roy. Soc., B352:1177-90, 1997. \n\n[3] Z. Ghahramani and G. E. Hinton. Hierarchical non-linear factor analysis and \n\ntopographic maps. Adv. Neural Info. Proc. Syst., 11, 1998. \n\n[4] H. S. Seung. How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, \n\n93:13339-13344, 1996. \n\n[5] R. Ben-Yishai, R. L. Bar-Or, and H. Sompolinsky. Theory of orientation tuning \n\nin visual cortex. Proc. Nat. Acad. Sci. USA, 92:3844-3848, 1995. \n\n[6] A. P. Georgopoulos, M. Taira, and A. Lukashin. Cognitive neurophysiology of \n\nthe motor cortex. Science, 260:47-52, 1993. \n\n[7] K. Zhang. Representation of spatial orientation by the intrinsic dynamics of \n\nthe head-direction cell ensemble: a theory. J. Neurosci., 16:2112-2126, 1996. \n[8] D. P. Bertsekas. Nonlinear programming. Athena Scientific, Belmont, MA, \n\n1995. \n\n[9] S. Amari and M. A. Arbib. Competition and cooperation in neural nets. In \nJ. Metzler, editor, Systems Neuroscience, pages 119-165. Academic Press, New \nYork, 1977. \n\n[10] G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of \n\nhandwritten digits. IEEE Trans. Neural Networks, 8:65-74, 1997. \n\n\f", "award": [], "sourceid": 1402, "authors": [{"given_name": "Nicholas", "family_name": "Socci", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "H. Sebastian", "family_name": "Seung", "institution": null}]}