{"title": "A Locally Adaptive Normal Distribution", "book": "Advances in Neural Information Processing Systems", "page_first": 4251, "page_last": 4259, "abstract": "The multivariate normal density is a monotonic function of the distance to the mean, and its ellipsoidal shape is due to the underlying Euclidean metric. We suggest to replace this metric with a locally adaptive, smoothly changing (Riemannian) metric that favors regions of high local density. The resulting locally adaptive normal distribution (LAND) is a generalization of the normal distribution to the \"manifold\" setting, where data is assumed to lie near a potentially low-dimensional manifold embedded in R^D. The LAND is parametric, depending only on a mean and a covariance, and is the maximum entropy distribution under the given metric. The underlying metric is, however, non-parametric. We develop a maximum likelihood algorithm to infer the distribution parameters that relies on a combination of gradient descent and Monte Carlo integration. We further extend the LAND to mixture models, and provide the corresponding EM algorithm. We demonstrate the efficiency of the LAND to fit non-trivial probability distributions over both synthetic data, and EEG measurements of human sleep.", "full_text": "A Locally Adaptive Normal Distribution\n\nGeorgios Arvanitidis, Lars Kai Hansen and S\u00f8ren Hauberg\n\nTechnical University of Denmark, Lyngby, Denmark\n\nDTU Compute, Section for Cognitive Systems\n\n{gear,lkai,sohau}@dtu.dk\n\nAbstract\n\nThe multivariate normal density is a monotonic function of the distance to the mean,\nand its ellipsoidal shape is due to the underlying Euclidean metric. We suggest to\nreplace this metric with a locally adaptive, smoothly changing (Riemannian) metric\nthat favors regions of high local density. The resulting locally adaptive normal\ndistribution (LAND) is a generalization of the normal distribution to the \u201cmanifold\u201d\nsetting, where data is assumed to lie near a potentially low-dimensional manifold\nembedded in RD. The LAND is parametric, depending only on a mean and a\ncovariance, and is the maximum entropy distribution under the given metric. The\nunderlying metric is, however, non-parametric. We develop a maximum likelihood\nalgorithm to infer the distribution parameters that relies on a combination of\ngradient descent and Monte Carlo integration. We further extend the LAND to\nmixture models, and provide the corresponding EM algorithm. We demonstrate\nthe ef\ufb01ciency of the LAND to \ufb01t non-trivial probability distributions over both\nsynthetic data, and EEG measurements of human sleep.\n\n1\n\nIntroduction\n\nThe multivariate normal distribution is a fundamental building block in many machine learning\nalgorithms, and its well-known density can compactly be written as\n\np(x | \u00b5, \u03a3) \u221d exp\n\ndist2\n\n\u03a3(\u00b5, x)\n\n,\n\n(1)\n\n(cid:18)\n\n\u2212 1\n2\n\n(cid:19)\n\nwhere dist2\n\u03a3(\u00b5, x) denotes the Mahalanobis distance for covariance matrix \u03a3. This distance measure\ncorresponds to the length of the straight line connecting \u00b5 and x, and consequently the normal\ndistribution is often used to model linear phenomena. When data lies near a nonlinear manifold\nembedded in RD the normal distribution becomes inadequate due to its linear metric. We investigate\nif a useful distribution can be constructed by replacing the linear distance function with a nonlinear\ncounterpart. This is similar in spirit to Isomap [21] that famously replace the linear distance with a\ngeodesic distance measured over a neighborhood graph spanned by the data, thereby allowing for\na nonlinear model. This is, however, a discrete distance measure that is only well-de\ufb01ned over the\ntraining data. For a generative model, we need a continuously de\ufb01ned metric over the entire RD.\nFollowing Hauberg et al. [9] we learn a smoothly changing metric that favors regions of high density\ni.e., geodesics tend to move near the data. Under this metric, the data space is interpreted as a\nD-dimensional Riemannian manifold. This \u201cmanifold learning\u201d does not change dimensionality, but\nmerely provides a local description of the data. The Riemannian view-point, however, gives a strong\nmathematical foundation upon which the proposed distribution can be developed. Our work, thus,\nbridges work on statistics on Riemannian manifolds [15, 23] with manifold learning [21].\nWe develop a locally adaptive normal distribution (LAND) as follows: First, we construct a metric\nthat captures the nonlinear structure of the data and enables us to compute geodesics; from this, an\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Illustration of the LAND using MNIST images of the digit 1 projected onto the \ufb01rst 2\nprincipal components. Left: comparison of the geodesic and the linear distance. Center: the proposed\nlocally adaptive normal distribution. Right: the Euclidean normal distribution.\n\nunnormalized density is trivially de\ufb01ned. Second, we propose a scalable Monte Carlo integration\nscheme for normalizing the density with respect to the measure induced by the metric. Third, we\ndevelop a gradient-based algorithm for maximum likelihood estimation on the learned manifold. We\nfurther consider a mixture of LANDs and provide the corresponding EM algorithm. The usefulness\nof the model is veri\ufb01ed on both synthetic data and EEG measurements of human sleep stages.\nNotation: all points x \u2208 RD are considered as column vectors, and they are denoted with bold\nlowercase characters. S D\n++ represents the set of symmetric D \u00d7 D positive de\ufb01nite matrices. The\nlearned Riemannian manifold is denoted M, and its tangent space at x \u2208 M is denoted TxM.\n\n2 A Brief Summary of Riemannian Geometry\n\nWe start our exposition with a brief review of Riemannian manifolds [6]. These smooth manifolds are\nnaturally equipped with a distance measure, and are commonly used to model physical phenomena\nsuch as dynamical or periodic systems, and many problems that have a smooth behavior.\nDe\ufb01nition 1. A smooth manifold M together with a Riemannian metric M : M \u2192 S D\n++ is called\na Riemannian manifold. The Riemannian metric M encodes a smoothly changing inner product\n(cid:104)u, M(x)v(cid:105) on the tangent space u, v \u2208 TxM of each point x \u2208 M.\nRemark 1. The Riemannian metric M(x) acts on tangent vectors, and may, thus, be interpreted as\na standard Mahalanobis metric restricted to an in\ufb01nitesimal region around x.\n\nThe local inner product based on M is a suitable model for capturing local behavior of data, i.e.\nmanifold learning. From the inner product, we can de\ufb01ne geodesics as length-minimizing curves\nconnecting two points x, y \u2208 M, i.e.\n\n(cid:90) 1\n\n(cid:112)(cid:104)\u03b3(cid:48)(t), M(\u03b3(t))\u03b3(cid:48)(t)(cid:105)dt,\n\n\u02c6\u03b3 = argmin\n\n\u03b3\n\n0\n\ns.t. \u03b3(0) = x, \u03b3(1) = y.\n\n(2)\n\nHere M(\u03b3(t)) is the metric tensor at \u03b3(t), and the tangent vector \u03b3(cid:48) denotes the deriva-\ntive (velocity) of \u03b3.\nThe distance between x and y is de\ufb01ned as the length of the\ngeodesic. A standard result from differential geometry is that the geodesic can be found\nas the solution to a system of 2nd order ordinary differential equations (ODEs) [6, 9]:\n\n(cid:20) \u2202vec[M(\u03b3(t))]\n\n(cid:21)(cid:124)\n\n\u2202\u03b3(t)\n\nM\u22121(\u03b3(t))\n\n\u03b3(cid:48)(cid:48)(t) = \u2212 1\n2\n\n(\u03b3(cid:48)(t) \u2297 \u03b3(cid:48)(t))\n(3)\nsubject to \u03b3(0) = x, \u03b3(1) = y. Here vec[\u00b7] stacks the columns\nof a matrix into a vector and \u2297 is the Kronecker product.\nThis differential equation allows us to de\ufb01ne basic operations on\nthe manifold. The exponential map at a point x takes a tangent\nvector v \u2208 TxM to y = Expx(v) \u2208 M such that the curve\n\u03b3(t) = Expx(t \u00b7 v) is a geodesic originating at x with initial\n\n2\n\nFigure 2: An illustration of the ex-\nponential and logarithmic maps.\n\nGeodesicsDataLAND meanLinearGeodesicLAND modelLAND meanLinear modelLinear meanxy=Expx(v)v=Logx(y)\u03b3(t)\fvelocity v and length (cid:107)v(cid:107). The inverse mapping, which takes y to TxM is known as the logarithm\nmap and is denoted Logx(y). By de\ufb01nition (cid:107)Logx(y)(cid:107) corresponds to the geodesic distance from\nx to y. These operations are illustrated in Fig. 2. The exponential and the logarithmic map can\nbe computed by solving Eq. 3 numerically, as an initial value problem (IVP) or a boundary value\nproblem (BVP) respectively. In practice the IVPs are substantially faster to compute than the BVPs.\nThe Mahalanobis distance is naturally extended to Riemannian manifolds as dist2\n\u03a3(x, y) =\n(cid:104)Logx(y), \u03a3\u22121Logx(y)(cid:105). From this, Pennec [15] considered the Riemannian normal distribution\n(4)\n\n(cid:104)Log\u00b5(x), \u03a3\u22121Log\u00b5(x)(cid:105)\n\npM(x | \u00b5, \u03a3) =\n\n, x \u2208 M\n\n(cid:18)\n\n(cid:19)\n\n1\nC exp\n\n\u2212 1\n2\n\nand showed that it is the manifold-valued distribution with maximum entropy subject to a known\nmean and covariance. This distribution is an instance of Eq. 1 and is the distribution we consider in\nthis paper. Next, we consider standard \u201cintrinsic least squares\u201d estimates of \u00b5 and \u03a3.\n\n2.1\n\nIntrinsic Least Squares Estimators\n\n(cid:90)\n\nLet the data be generated from an unknown probability distribution qM(x) on a manifold. Then it is\ncommon [15] to de\ufb01ne the intrinsic mean of the distribution as the point that minimize the variance\n\n(cid:90)\n\n(5)\nwhere dM(x) is the measure (or in\ufb01nitesimal volume element) induced by the metric. Based on the\nmean, a covariance matrix can be de\ufb01ned\n\n\u02c6\u00b5 = argmin\n\u00b5\u2208M\n\nM\n\ndist2(\u00b5, x)qM(x)dM(x),\n\n(cid:124)\nLog \u02c6\u00b5(x)Log \u02c6\u00b5(x)\n\nqM(x)dM(x),\n\n\u02c6\u03a3 =\n\nD( \u02c6\u00b5)\n\nproportional to \u02c6v =(cid:80)N\n\n(6)\nwhere D( \u02c6\u00b5) is the domain over which T\u02c6\u00b5M is well-de\ufb01ned. For the manifolds we consider, the\ndomain D( \u02c6\u00b5) is RD. Practical estimators of \u02c6\u00b5 rely on gradient-based optimization to \ufb01nd a local\nminimizer of Eq. 5, which is well-de\ufb01ned [12]. For \ufb01nite data {xn}N\nn=1, the descent direction is\nn=1 Log\u00b5(xn) \u2208 T\u00b5M, and the updated mean is a point on the geodesic\ncurve \u03b3(t) = Exp\u00b5(t \u00b7 \u02c6v). After estimating the mean, the empirical covariance matrix is estimated\n(cid:124). It is worth noting that even though these estimators are\nas \u02c6\u03a3 = 1\nnatural, they are not maximum likelihood estimates for the Riemannian normal distribution (4).\nIn practice, the intrinsic mean often falls in regions of low data density [8]. For instance, consider\ndata distributed uniformly on the equator of a sphere, then the optima of Eq. 5 is either of the poles.\nConsequently, the empirical covariance is often overestimated.\n\nn=1 Log \u02c6\u00b5(xn)Log \u02c6\u00b5(xn)\n\n(cid:80)N\n\nN\u22121\n\n3 A Locally Adaptive Normal Distribution\n\nWe now have the tools to de\ufb01ne a locally adaptive normal distribution (LAND): we replace the\nlinear Euclidean distance with a locally adaptive Riemannian distance and study the corresponding\nRiemannian normal distribution (4). By learning a Riemannian manifold and using its structure to\nestimate distributions of the data, we provide a new and useful link between Riemannian statistics\nand manifold learning.\n\n3.1 Constructing a Metric\n\nIn the context of manifold learning, Hauberg et al. [9] suggest to model the local behavior of the data\nmanifold via a locally-de\ufb01ned Riemannian metric. Here we propose to use a local covariance matrix\nto represent the local structure of the data. We only consider diagonal covariances for computational\nef\ufb01ciency and to prevent the over\ufb01tting. The locality of the covariance is de\ufb01ned via an isotropic\nGaussian kernel of size \u03c3. Thus, the metric tensor at x \u2208 M is de\ufb01ned as the inverse of a local\ndiagonal covariance matrix with entries\n\n(cid:33)\u22121\n\n, with wn(x) = exp\n\n3\n\n(cid:32)\n\n(cid:33)\n\n\u2212(cid:107)xn \u2212 x(cid:107)2\n\n2\n\n2\u03c32\n\n.\n\n(7)\n\n(cid:32) N(cid:88)\n\nn=1\n\nMdd(x) =\n\nwn(x)(xnd \u2212 xd)2 + \u03c1\n\n\fHere xnd is the dth dimension of the nth observation, and \u03c1 a regularization parameter to avoid\nsingular covariances. This de\ufb01nes a smoothly changing (hence Riemannian) metric that captures the\nlocal structure of the data. It is easy to see that if x is outside of the support of the data, then the\nmetric tensor is large. Thus, geodesics are \u201cpulled\u201d towards the data where the metric is small. Note\nthat the proposed metric is not invariant to linear transformations.While we restrict our attention to\nthis particular choice, other learned metrics are equally applicable, c.f. [22, 9].\n\n(cid:90)\n\n(cid:90)\n\n3.2 Estimating the Normalization Constant\n\nThe normalization constant of Eq. 4 is by de\ufb01nition\n\n(cid:18)\n\n\u2212 1\n2\n\n(cid:19)\n(cid:104)Log\u00b5(x), \u03a3\u22121Log\u00b5(x)(cid:105)\n\ndM(x),\n\n(cid:113)(cid:12)(cid:12)M(Exp\u00b5(v))(cid:12)(cid:12) exp\n\n(cid:18)\n\nC(\u00b5, \u03a3) =\n\nexp\n\n(8)\nwhere dM(x) denotes the measure induced by the Riemannian metric. The constant C(\u00b5, \u03a3) depends\nnot only on the covariance matrix, but also on the mean of the distribution, and the curvature of the\nmanifold (captured by the logarithm map). For a general learned manifold, C(\u00b5, \u03a3) is inaccessible in\nclosed-form and we resort to numerical techniques. We start by rewriting Eq. 8 as\n\nM\n\nC(\u00b5, \u03a3) =\n\n(cid:104)v, \u03a3\u22121v(cid:105)\n\n\u2212 1\n2\n\ndv.\n\nT\u00b5M\n\n(9)\nIn effect, we integrate the distribution over the tangent space T\u00b5M instead of directly over the\nmanifold. This transformation relies on the fact that the volume of an in\ufb01nitely small area on\nthe manifold can be computed in the tangent space if we take the deformation of the metric into\naccount [15]. This deformation is captured by the measure which, in the tangent space, is dM(x) =\n\n(cid:113)(cid:12)(cid:12)M(Exp\u00b5(v))(cid:12)(cid:12)dv. For notational simplicity we de\ufb01ne the function m(\u00b5, v) =\ntion constant of the Euclidean normal distribution Z =(cid:112)(2\u03c0)D |\u03a3|.\n\nwhich intuitively captures the cost for a point to be outside the data support (m is large in low density\nareas and small where the density is high).\nWe estimate the normalization constant (9) using Monte Carlo inte-\ngration. We \ufb01rst multiply and divide the integral with the normaliza-\n\n(cid:113)(cid:12)(cid:12)M(Exp\u00b5(v))(cid:12)(cid:12),\n\n(cid:19)\n\nThen, the integral becomes an expectation estimation problem\nC(\u00b5, \u03a3) = Z \u00b7 EN (0,\u03a3)[m(\u00b5, v)], which can be estimated numer-\nically as\n\nS(cid:88)\n\nC(\u00b5, \u03a3) (cid:39) Z\n\nS\n\ns=1\n\nm(\u00b5, vs), where vs \u223c N (0, \u03a3)\n\n(10)\nand S is the number of samples on T\u00b5M. The computationally\nexpensive element is to evaluate m, which in turn requires evaluating\nExp\u00b5(v). This amounts to solving an IVP numerically, which is\nfairly fast. Had we performed the integration directly on the manifold (8) we would have had to\nevaluate the logarithm map, which is a much more expensive BVP. The tangent space integration,\nthus, scales better.\n\nFigure 3: Comparison of\nLAND and intrinsic least\nsquares means.\n\nInferring Parameters\n\n3.3\nAssuming an independent and identically distributed dataset {xn}N\n\ndistribution as pM(x1, . . . , xN ) =(cid:81)N\n\nn=1, we can write their joint\nn=1 pM(xn | \u00b5, \u03a3). We \ufb01nd parameters \u00b5 and \u03a3 by maximum\n\nlikelihood, which we implement by minimizing the mean negative log-likelihood\n\n{ \u02c6\u00b5, \u02c6\u03a3} = argmin\n\u00b5\u2208M\n\u03a3\u2208SD\n\n++\n\n\u03c6 (\u00b5, \u03a3) = argmin\n\u00b5\u2208M\n\u03a3\u2208SD\n\n++\n\n(cid:104)Log\u00b5(xn), \u03a3\u22121Log\u00b5(xn)(cid:105) + log (C(\u00b5, \u03a3)) .\n\n(11)\nThe \ufb01rst term of the objective function \u03c6 : M \u00d7 S D\n++ is a data-\ufb01tting term, while the second can be\nseen as a force that both pulls the mean closer to the high density areas and shrinks the covariance.\nSpeci\ufb01cally, when the mean is in low density areas, as well as when the covariance gives signi\ufb01cant\n\nN(cid:88)\n\nn=1\n\n1\n2N\n\n4\n\nIntrinsicLeastSquaresLAND\fn=1, stepsizes \u03b1\u00b5, \u03b1A\n\nprobability to those areas, the value of m(\u00b5, v)\nwill by construction be large. Consequently,\nC(\u00b5, \u03a3) will increase and these solutions will be\npenalized. In practice, we \ufb01nd that the maximum\nlikelihood LAND mean generally avoids low den-\nsity regions, which is in contrast to the standard\nintrinsic least squares mean (5), see Fig. 3.\nIn practice we optimize \u03c6 using block coordinate\ndescent: we optimize the mean keeping the co-\nvariance \ufb01xed and vice versa. Unfortunately, both\nof the sub-problems are non-convex, and unlike\nthe linear normal distribution, they lack a closed-\nform solution. Since the logarithm map is a dif-\nferentiable function, we can use gradient-based\ntechniques to infer \u00b5 and \u03a3. Below we give the\ndescent direction for \u00b5 and \u03a3 and the correspond-\ning optimization scheme is given in Algorithm 1.\nInitialization is discussed in the supplements.\nOptimizing \u00b5:\n\u2202\u00b5(cid:104)Log\u00b5(x), \u03a3\u22121Log\u00b5(x)(cid:105) = \u22122\u03a3\u22121Log\u00b5(x), we get the gradient\nS(cid:88)\n\u2202\n\nAlgorithm 1 LAND maximum likelihood\nInput: the data {xn}N\nOutput: the estimated \u02c6\u00b5, \u02c6\u03a3, \u02c6C( \u02c6\u00b5, \u02c6\u03a3)\n1: initialize \u00b50, \u03a30 and t \u2190 0\n2: repeat\nestimate C(\u00b5t, \u03a3t) using Eq. 10\n3:\ncompute d\u00b5\u03c6(\u00b5t, \u03a3t) using Eq. 12\n4:\n\u00b5t+1 \u2190 Exp\u00b5t(\u03b1\u00b5d\u00b5\u03c6(\u00b5t, \u03a3t))\n5:\nestimate C(\u00b5t+1, \u03a3t) using Eq. 10\n6:\ncompute \u2207A\u03c6(\u00b5t+1, \u03a3t) using Eq. 13\n7:\n8: At+1 \u2190 At \u2212 \u03b1A\u2207A\u03c6(\u00b5t+1, \u03a3t)\n9: \u03a3t+1 \u2190 [(At+1)\n(cid:124)\n10:\n\n11: until(cid:13)(cid:13)\u03c6(\u00b5t+1, \u03a3t+1) \u2212 \u03c6(\u00b5t, \u03a3t)(cid:13)(cid:13)2\n\nthe objective function is differentiable with respect to \u00b5 [6], and using that\n\nt \u2190 t + 1\n\nAt+1]\u22121\n\nN(cid:88)\n\n2 \u2264 \u0001\n\n(cid:34)\n\n(cid:35)\n\nZ\n\n\u2207\u00b5\u03c6(\u00b5, \u03a3) = \u2212\u03a3\u22121\n\nm(\u00b5, vs)vs\n\n.\n\n(12)\n\nLog\u00b5(xn) \u2212\n\n1\nN\n\nn=1\n\nC(\u00b5, \u03a3) \u00b7 S\n\ns=1\n\nIt is easy to see that this gradient is highly dependent on the condition number of \u03a3. We \ufb01nd that this,\nat times, makes the gradient unstable, and choose to use the steepest descent direction instead of the\ngradient direction. This is equal to d\u00b5\u03c6(\u00b5, \u03a3) = \u2212\u03a3\u2207\u00b5\u03c6(\u00b5, \u03a3) (see supplements).\nOptimizing \u03a3: since the covariance matrix by de\ufb01nition is constrained to be in the space S D\ncommon trick is to decompose the matrix as \u03a3\u22121 = A\nto A. The gradient of this factor is (see supplements for derivation)\n\n++, a\nA, and optimize the objective with respect\n\n(cid:124)\n\n(cid:35)\n\nZ\n\nC(\u00b5, \u03a3) \u00b7 S\n\nS(cid:88)\n\ns=1\n\n(cid:34)\n\n1\nN\n\nN(cid:88)\n\nn=1\n\n\u2207A\u03c6(\u00b5, \u03a3) = A\n\nLog\u00b5(xn)Log\u00b5(xn)\n\n(cid:124) \u2212\n\n(cid:124)\nm(\u00b5, vs)vsv\ns\n\n.\n\n(13)\n\nHere the \ufb01rst term \ufb01ts the given data by increasing the size of the covariance matrix, while the second\nterm regularizes the covariance towards a small matrix.\n\n3.4 Mixture of LANDs\n\n(cid:20) 1\n\nK(cid:88)\n\nAt this point we can \ufb01nd maximum likelihood estimates of the LAND model. We can easily extend\nthis to mixtures of LANDs: Following the derivation of the standard Gaussian mixture model [3], our\nobjective function for inferring the parameters of the LAND mixture model is formulated as follows\n\nN(cid:88)\n(cid:104)Log\u00b5k\n(xn), \u03a3\u22121\nk Log\u00b5k\nrnk\n(cid:80)K\nkth component, and(cid:80)K\nk=1 , rnk = \u03c0kpM(xn | \u00b5k,\u03a3k)\nwhere \u0398 = {\u00b5k, \u03a3k}K\nl=1 \u03c0lpM(xn | \u00b5l,\u03a3l) is the probability that xn is generated by the\nk=1 \u03c0k = 1, \u03c0k \u2265 0. The corresponding EM algorithm is in the supplements.\n\n(cid:21)\n(xn)(cid:105) + log(C(\u00b5k, \u03a3k)) \u2212 log(\u03c0k)\n\n\u03c8(\u0398) =\n\n(14)\n\nn=1\n\nk=1\n\n2\n\n,\n\n4 Experiments\n\nIn this section we present both synthetic and real experiments to demonstrate the advantages of the\nLAND. We compare our model with both the Gaussian mixture model (GMM), and a mixture of\nLANDs using least squares (LS) estimators (5, 6). Since the latter are not maximum likelihood\nestimates we use a Riemannian K-means algorithm to \ufb01nd cluster centers. In all experiments we\nuse S = 3000 samples in the Monte Carlo integration. This choice is investigated empirically in the\nsupplements. Furthermore, we choose \u03c3 as small as possible, while ensuring that the manifold is\nsmooth enough that geodesics can be computed numerically.\n\n5\n\n\f4.1 Synthetic Data Experiments\n\nAs a \ufb01rst experiment, we generate a nonlinear data-manifold\nby sampling from a mixture of 20 Gaussians positioned along a\nhalf-ellipsoidal curve (see left panel of Fig. 5). We generate 10\ndatasets with 300 points each, and \ufb01t for each dataset the three\nmodels with K = 1, . . . , 4 number of components. Then, we\ngenerate 10000 samples from each \ufb01tted model, and we com-\npute the mean negative log-likelihood of the true generative\ndistribution using these samples. Fig. 4 shows that the LAND\nlearns faster the underlying true distribution, than the GMM.\nMoreover, the LAND perform better than the least squares esti-\nmators, which overestimates the covariance. In the supplements\nwe show, using the standard AIC and BIC criteria, that the op-\ntimal LAND is achieved for K = 1, while for the least squares\nestimators and the GMM, the optimal is achieved for K = 3\nand K = 4 respectively.\nIn addition, in Fig. 5 we show the contours for the LAND and the GMM for K = 2. There, we\ncan observe that indeed, the LAND adapts locally to the data and reveals their underlying nonlinear\nstructure. This is particularly evident near the \u201cboundaries\u201d of the data-manifold.\n\nFigure 4: The mean negative log-\nlikelihood experiment.\n\nFigure 5: Synthetic data and the \ufb01tted models. Left: the given data, the intensity of the geodesics\nrepresent the responsibility of the point to the corresponding cluster. Center: the contours of the\nLAND mixture model. Right: the contours of the Gaussian mixture model.\n\nWe extend this experiment to a clustering task (see left panel of Fig. 6 for data). The center and right\npanels of Fig. 6 show the contours of the LAND and Gaussian mixtures, and it is evident that the\nLAND is substantially better at capturing non-ellipsoidal clusters. Due to space limitations, we move\nfurther illustrative experiments to the supplementary material and continue with real data.\n\n4.2 Modeling Sleep Stages\n\nWe consider electro-encephalography (EEG) measurements of human sleep from 10 subjects, part of\nthe PhysioNet database [11, 7, 5]. For each subject we get EEG measurements during sleep from\ntwo electrodes on the front and the back of the head, respectively. Measurements are sampled at\nfs = 100Hz, and for each 30 second window a so-called sleep stage label is assigned from the set\n{1, 2, 3, 4, REM, awake}. Rapid eye movement (REM) sleep is particularly interesting, characterized\nby having EEG patterns similar to the awake state but with a complex physiological pattern, involving\ne.g., reduced muscle tone, rolling eye movements and erection [16]. Recent evidence points to the\nimportance of REM sleep for memory consolidation [4]. Periods in which the sleeper is awake are\ntypically happening in or near REM intervals. Thus we here consider the characterization of sleep in\nterms of three categories REM, awake, and non-REM, the latter a merger of sleep stages 1 \u2212 4.\nWe extract features from EEG measurements as follows: for each subject we subdivide the 30 second\nwindows to 10 seconds, and apply a short-time-Fourier-transform to the EEG signal of the frontal\nelectrode with 50% overlapping windows. From this we compute the log magnitude of the spectrum\nlog(1 + |f|) of each window. The resulting data matrix is decomposed using Non-Negative Matrix\nFactorization (10 random starts) into \ufb01ve factors, and we use the coef\ufb01cients as 5D features. In Fig. 7\nwe illustrate the nonlinear manifold structure based on a three factor analysis.\n\n6\n\n1234Number of mixture components01234567Mean negative log-likelihoodGMMLSLANDTrueGeodesicsDataLAND meansGeodesics, cluster 1Geodesics, cluster 2LAND mixture modelLAND meanGaussian mixture modelGMM mean\fFigure 6: The clustering problem for two synthetic datasets. Left: the given data, the intensity of the\ngeodesics represent the responsibility of the point to the corresponding cluster. Center: the LAND\nmixture model. Right: the Gaussian mixture model.\n\nWe perform clustering on the data and evaluate the alignment\nbetween cluster labels and sleep stages using the F-measure\n[14]. The LAND depends on the parameter \u03c3 to construct the\nmetric tensor, and in this experiment it is less straightforward to\nselect \u03c3 because of signi\ufb01cant intersubject variability. First, we\n\ufb01xed \u03c3 = 1 for all the subjects. From the results in Table 1 we\nobserve that for \u03c3 = 1 the LAND(1) generally outperforms the\nGMM and achieves much better alignment. To further illustrate\nthe effect of \u03c3 we \ufb01tted a LAND for \u03c3 = [0.5, 0.6, . . . , 1.5]\nand present the best result achieved by the LAND. Selecting \u03c3\nthis way leads indeed to higher degrees of alignment further un-\nderlining that the conspicuous manifold structure and the rather\ncompact sleep stage distributions in Fig. 7 are both captured\nbetter with the LAND representation than with a linear GMM.\n\nFigure 7: The 3 leading factors for\nsubject \u201cs151\u201d.\n\nTable 1: The F-measure result for 10 subjects (the closer to 1 the better).\ns001\n0.831\n0.812\n\ns081\n0.804\n0.798\n\ns151\n0.820\n0.794\n\ns161\n0.780\n0.775\n\ns011\n0.701\n0.690\n\ns042\n0.670\n0.675\n\ns062\n0.740\n0.651\n\ns141\n0.870\n0.870\n\ns162\n0.747\n0.747\n\ns191\n0.786\n0.776\n\n0.831\n\n0.716\n\n0.695\n\n0.740\n\n0.818\n\n0.874\n\n0.830\n\n0.783\n\n0.750\n\n0.787\n\nLAND(1)\n\nGMM\nLAND\n\n5 Related Work\n\nWe are not the \ufb01rst to consider Riemannian normal distributions, e.g. Pennec [15] gives a theoretical\nanalysis of the distribution, and Zhang and Fletcher [23] consider the Riemannian counterpart of\nprobabilistic PCA. Both consider the scenario where the manifold is known a priori. We adapt the\ndistribution to the \u201cmanifold learning\u201d setting by constructing a Riemannian metric that adapts to the\ndata. This is our overarching contribution.\nTraditionally, manifold learning is seen as an embedding problem where a low-dimensional rep-\nresentation of the data is sought. This is useful for visualization [21, 17, 18, 1], clustering [13],\nsemi-supervised learning [2] and more. However, in embedding approaches, the relation between a\n\n7\n\nGeodesicsDataLAND meansGeodesics, cluster 1Geodesics, cluster 2LAND mixture modelLAND meanGaussian mixture modelGMM meanGeodesicsDataLAND meansGeodesics, cluster 1Geodesics, cluster 2LAND mixture modelLAND meanGaussian mixture modelGMM mean1-4R.E.M.awake\fnew point and the embedded points are less well-de\ufb01ned, and consequently these approaches are less\nsuited for building generative models. In contrast, the Riemannian approach gives the ability to mea-\nsure continuous geodesics that follow the structure of the data. This makes the learned Riemannian\nmanifold a suitable space for a generative model.\nSimo-Serra et al. [19] consider mixtures of Riemannian normal distributions on manifolds that\nare known a priori. Structurally, their EM algorithm is similar to ours, but they do not account\nfor the normalization constants for different mixture components. Consequently, their approach is\ninconsistent with the probabilistic formulation. Straub et al. [20] consider data on spherical manifolds,\nand further consider a Dirichlet process prior for determining the number of components. Such a\nprior could also be incorporated in our model. The key difference to our work is that we consider\nlearned manifolds as well as the following complications.\n\n6 Discussion\n\nIn this paper we have introduced a parametric locally adaptive normal distribution. The idea is to\nreplace the Euclidean distance in the ordinary normal distribution with a locally adaptive nonlinear\ndistance measure. In principle, we learn a non-parametric metric space, by constructing a smoothly\nchanging metric that induces a Riemannian manifold, where we build our model. As such, we propose\na parametric model over a non-parametric space.\nThe non-parametric space is constructed using a local metric that is the inverse of a local covariance\nmatrix. Here locality is de\ufb01ned via a Gaussian kernel, such that the manifold learning can be seen\nas a form of kernel smoothing. This indicates that our scheme for learning a manifold might not\nscale to high-dimensional input spaces. In these cases it may be more practical to learn the manifold\nprobabilistically [22] or as a mixture of metrics [9]. This is feasible as the LAND estimation procedure\nis agnostic to the details of the learned manifold as long as exponential and logarithm maps can be\nevaluated.\nOnce a manifold is learned, the LAND is simply a Riemannian normal distribution. This is a natural\nmodel, but more intriguing, it is a theoretical interesting model since it is the maximum entropy\ndistribution for a \ufb01xed mean and covariance [15]. It is generally dif\ufb01cult to build locally adaptive\ndistributions with maximum entropy properties, yet the LAND does this in a fairly straight-forward\nmanner. This is, however, only a partial truth as the distribution depends on the non-parametric space.\nThe natural question, to which we currently do not have an answer, is whether a suitable maximum\nentropy manifold exist?\nAlgorithmically, we have proposed a maximum likelihood estimation scheme for the LAND. This\ncombines a gradient-based optimization with a scalable Monte Carlo integration method. Once\nexponential and logarithm maps are available, this procedure is surprisingly simple to implement. We\nhave demonstrated the algorithm on both real and synthetic data and results are encouraging. We\nalmost always improve upon a standard Gaussian mixture model as the LAND is better at capturing\nthe local properties of the data.\nWe note that both the manifold learning aspect and the algorithmic aspect of our work can be improved.\nIt would be of great value to learn the parameter \u03c3 used for smoothing the Riemannian metric, and in\ngeneral, more adaptive learning schemes are of interest. Computationally, the bottleneck of our work\nis evaluating the logarithm maps. This may be improved by specialized solvers, e.g. probabilistic\nsolvers [10], or manifold-speci\ufb01c heuristics.\nThe ordinary normal distribution is a key element in many machine learning algorithms. We expect\nthat many fundamental generative models can be extended to the \u201cmanifold\u201d setting simply by\nreplacing the normal distribution with a LAND. Examples of this idea include Na\u00efve Bayes, Linear\nDiscriminant Analysis, Principal Component Analysis and more. Finally we note that standard\nhypothesis tests also extend to Riemannian normal distributions [15] and hence also to the LAND.\nAcknowledgements. LKH was funded in part by the Novo Nordisk Foundation Interdisciplinary\nSynergy Program 2014, \u2019Biophysically adjusted state-informed cortex stimulation (BASICS)\u2019. SH\nwas funded in part by the Danish Council for Independent Research, Natural Sciences.\n\n8\n\n\fReferences\n[1] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.\n\nNeural Computation, 15(6):1373\u20131396, June 2003.\n\n[2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning\n\nfrom Labeled and Unlabeled Examples. JMLR, 7:2399\u20132434, Dec. 2006.\n\n[3] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-\n\nVerlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[4] R. Boyce, S. D. Glasgow, S. Williams, and A. Adamantidis. Causal evidence for the role of REM sleep\n\ntheta rhythm in contextual memory consolidation. Science, 352(6287):812\u2013816, 2016.\n\n[5] A. Delorme and S. Makeig. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics\n\nincluding independent component analysis. J. Neurosci. Methods, page 21, 2004.\n\n[6] M. do Carmo. Riemannian Geometry. Mathematics (Boston, Mass.). Birkh\u00e4user, 1992.\n\n[7] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B.\nMoody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New\nResearch Resource for Complex Physiologic Signals. Circulation, 101(23):e215\u2013e220, 2000 (June 13).\n\n[8] S. Hauberg. Principal Curves on Riemannian Manifolds. IEEE Transactions on Pattern Analysis and\n\nMachine Intelligence (TPAMI), 2016.\n\n[9] S. Hauberg, O. Freifeld, and M. J. Black. A Geometric Take on Metric Learning. In Advances in Neural\n\nInformation Processing Systems (NIPS) 25, pages 2033\u20132041, 2012.\n\n[10] P. Hennig and S. Hauberg. Probabilistic Solutions to Differential Equations and their Application to\nRiemannian Statistics. In Proceedings of the 17th international Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), volume 33, 2014.\n\n[11] S. A. Imtiaz and E. Rodriguez-Villegas. An open-source toolbox for standardized use of PhysioNet Sleep\nEDF Expanded Database. In 2015 37th Annual International Conference of the IEEE Engineering in\nMedicine and Biology Society (EMBC), pages 6014\u20136017, Aug 2015.\n\n[12] H. Karcher. Riemannian center of mass and molli\ufb01er smoothing. Communications on Pure and Applied\n\nMathematics, 30(5):509\u2013541, 1977.\n\n[13] U. Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing, 17(4):395\u2013416, Dec. 2007.\n\n[14] R. Marxer, H. Purwins, and A. Hazan. An f-measure for evaluation of unsupervised clustering with\nnon-determined number of clusters. Report of the EmCAP project (European Commission FP6-IST), pages\n1\u20133, 2008.\n\n[15] X. Pennec. Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements.\n\nJournal of Mathematical Imaging and Vision, 25(1):127\u2013154, July 2006.\n\n[16] D. Purves, G. Augustine, D. Fitzpatrick, W. Hall, A. LaMantia, J. McNamara, and L. White. Neuroscience,\n\n2008. De Boeck, Sinauer, Sunderland, Mass.\n\n[17] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290:2323\u20132326, 2000.\n\n[18] B. Sch\u00f6lkopf, A. Smola, and K.-R. M\u00fcller. Kernel principal component analysis. In Advances in Kernel\n\nMethods - Support Vector Learning, pages 327\u2013352, 1999.\n\n[19] E. Simo-Serra, C. Torras, and F. Moreno-Noguer. Geodesic Finite Mixture Models. In Proceedings of the\n\nBritish Machine Vision Conference. BMVA Press, 2014.\n\n[20] J. Straub, J. Chang, O. Freifeld, and J. W. Fisher III. A Dirichlet Process Mixture Model for Spherical\n\nData. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[21] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear\n\nDimensionality Reduction. Science, 290(5500):2319, 2000.\n\n[22] A. Tosi, S. Hauberg, A. Vellido, and N. D. Lawrence. Metrics for Probabilistic Geometries. In The\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), July 2014.\n\n[23] M. Zhang and P. Fletcher. Probabilistic Principal Geodesic Analysis. In Advances in Neural Information\n\nProcessing Systems (NIPS) 26, pages 1178\u20131186, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2113, "authors": [{"given_name": "Georgios", "family_name": "Arvanitidis", "institution": "DTU"}, {"given_name": "Lars", "family_name": "Hansen", "institution": "Technical University of Denmark"}, {"given_name": "S\u00f8ren", "family_name": "Hauberg", "institution": "Technical University of Denmark"}]}