{"title": "Robust Low Rank Kernel Embeddings of Multivariate Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 3228, "page_last": 3236, "abstract": "Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspecified. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspecification. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation.", "full_text": "Robust Low Rank Kernel Embeddings of\n\nMultivariate Distributions\n\nLe Song, Bo Dai\n\nCollege of Computing, Georgia Institute of Technology\nlsong@cc.gatech.edu, bodai@gatech.edu\n\nAbstract\n\nKernel embedding of distributions has led to many recent advances in machine\nlearning. However, latent and low rank structures prevalent in real world distri-\nbutions have rarely been taken into account in this setting. Furthermore, no prior\nwork in kernel embedding literature has addressed the issue of robust embedding\nwhen the latent and low rank information are misspeci\ufb01ed.\nIn this paper, we\npropose a hierarchical low rank decomposition of kernels embeddings which can\nexploit such low rank structures in data while being robust to model misspeci-\n\ufb01cation. We also illustrate with empirical evidence that the estimated low rank\nembeddings lead to improved performance in density estimation.\n\nIntroduction\n\n1\nMany applications of machine learning, ranging from computer vision to computational biology,\nrequire the analysis of large volumes of high-dimensional continuous-valued measurements. Com-\nplex statistical features are commonplace, including multi-modality, skewness, and rich dependency\nstructures. Kernel embedding of distributions is an effective framework to address challenging prob-\nlems in this regime [1, 2]. Its key idea is to implicitly map distributions into potentially in\ufb01nite di-\nmensional feature spaces using kernels, such that subsequent comparison and manipulation of these\ndistributions can be achieved via feature space operations (e.g., inner product, distance, projection\nand spectral analysis). This new framework has led to many recent advances in machine learning\nsuch as kernel independence test [3] and kernel belief propagation [4].\nHowever, algorithms designed with kernel embeddings have rarely taken into account latent and\nlow rank structures prevalent in high dimensional data arising from various applications such as\ngene expression analysis. While these information have been extensively exploited in other learning\ncontexts such as graphical models and collaborative \ufb01ltering, their use in kernel embeddings re-\nmains scarce and challenging. Intuitively, these intrinsically low dimensional structures of the data\nshould reduce the effect number of parameters in kernel embeddings, and allow us to obtain a better\nestimator when facing with high dimensional problems.\nAs a demonstration of the above intuition, we illustrate the behavior of low rank kernel embeddings\n(which we will explain later in more details) when applied to density estimation (Figure 1). 100 data\npoints are sampled i.i.d. from a mixture of 2 spherical Gaussians, where the latent variable is the\ncluster indicator. The \ufb01tted density based on an ordinary kernel density estimator has quite different\ncontours from the ground truth (Figure 1(b)), while those provided by low rank embeddings appear\nto be much closer to the ground truth ((Figure 1(c)). Essentially, the low rank approximation step\nendows kernel embeddings with an additional mechanism to smooth the estimator which can be\nbene\ufb01cial when the number of data points is small and there are clusters in the data. In our later\nmore systematic experiments, we show that low rank embeddings can lead to density estimators\nwhich can signi\ufb01cantly improve over alternative approaches in terms of held-out likelihood.\nWhile there are a handful of exceptions [5, 6] in the kernel embedding literature which have ex-\nploited latent and low rank information, these algorithms are not robust in the sense that, when such\ninformation are misspeci\ufb01cation, no performance guarantee can be provided and these algorithms\ncan fail drastically. The hierarchical low rank kernel embeddings we proposed in this paper can be\n\n1\n\n\f\u2212\n3\n\n\u2212\n3\n\n\u2212\n2\n\n\u2212\n1\n\n0\n\n1\n\n2\n\n3\n\n\u2212\n3\n\n\u2212\n3\n\n\u2212\n2\n\n\u2212\n1\n\n0\n\n1\n\n2\n\n3\n\n\u2212\n3\n\n\u2212\n3\n\n\u2212\n2\n\n\u2212\n1\n\n0\n\n1\n\n2\n\n3\n\n\u2212\n2\n\n\u2212\n1\n\n0\n\n1\n\n2\n\n3\n\n\u2212\n2\n\n\u2212\n1\n\n0\n\n1\n\n2\n\n3\n\n\u2212\n2\n\n\u2212\n1\n\n0\n\n1\n\n2\n\n3\n\n(a) Ground truth\n\n(b) Ordinary KDE\n\n(c) Low Rank KDE\n\nFigure 1: We draw 100 samples from a mixture of 2 spherical Gaussians with equal mixing weights.\n(a) the contour plot for the ground truth density, (b) for ordinary kernel density estimator (KDE), (c)\nfor low rank KDE. We used cross-validation to \ufb01nd the best kernel bandwidth for both the KDE and\nlow rank KDE. The latter produces a density which is visibly closer to the ground truth, and in term\nof the integrated square error, it is smaller than the KDE (0.0092 vs. 0.012).\nconsidered as a kernel generalization of the discrete valued tree-structured latent variable models\nstudied in [7]. The objective of the current paper is to address previous limitations of kernel embed-\ndings as applied to graphical models and make them more practically useful. Furthermore, we will\nprovide both theoretical and empirical support to the new approach.\nAnother key contribution of the paper is a novel view of kernel embedding of multivariate distri-\nbutions as in\ufb01nite dimensional higher order tensors, and the low rank structure of these tensors in\nthe presence of latent variables. This novel view allows us to introduce modern multi-linear alge-\nbra and tensor decomposition tools to address challenging problems in the interface between kernel\nmethods and latent variable models. We believe our work will play a synergistic role in bridging to-\ngether largely separate areas in machine learning research, including kernel methods, latent variable\nmodels, and tensor data analysis.\nIn the remainder of the paper, we will \ufb01rst present the tensor view of kernel embeddings of mul-\ntivariate distributions and its low rank structure in the presence of latent variables. Then we will\npresent our algorithm for hierarchical low rank decomposition of kernel embeddings by making use\nof a sequence of nested kernel singular value decompositions. Last, we will provide both theoretical\nand empirical support to our proposed approach.\n2 Kernel Embeddings of Distributions\nWe will focus on continuous domains, and denote X a random variable with domain \u2326 and density\np(X). The instantiations of X are denoted by lower case character, x. A reproducing kernel Hilbert\nspace (RKHS) F on \u2326 with a kernel k(x, x0) is a Hilbert space of functions f :\u2326 7! R with inner\nproduct h\u00b7,\u00b7iF\n. Its element k(x,\u00b7) satis\ufb01es the reproducing property: hf (\u00b7), k(x,\u00b7)iF = f (x), and\nconsequently, hk(x,\u00b7), k(x0,\u00b7)iF = k(x, x0), meaning that we can view the evaluation of a function\nf at any point x 2 \u2326 as an inner product. Alternatively, k(x,\u00b7) can be viewed as an implicit feature\nmap (x) where k(x, x0) = h(x), (x0)iF\n. For simplicity of notation, we assumes that the domain\nof all variables are the same and the same kernel function is applied to all variables.\nA kernel embedding represents a density by its expected features, i.e., \u00b5X := EX [(X)] =\nR\u2326 (x)p(x)dx, or a point in a potentially in\ufb01nite-dimensional and implicit feature space of a k-\nernel [8, 1, 2]. The embedding \u00b5X has the property that the expectation of any RKHS func-\ntion f 2F can be evaluated as an inner product in F, h\u00b5X, fiF := EX[f (X)]. Kernel\nembeddings can be readily generalized to joint density of d variables, X1, . . . , Xd, using d-\nth order tensor product feature space F d, In this feature space, the feature map is de\ufb01ned as\ni=1(xi) := (x1) \u2326 (x2) \u2326 . . . \u2326 (xd), and the inner product in this space satis\ufb01es\n\u2326d\n\u2326\u2326d\ni=1 k(xi, x0i). Then we can embed a\ni=1(xi),\u2326d\njoint density p(X1, . . . , Xd) into a tensor product feature space F d by\n\ni=1 h(xi), (x0i)iF\n\ni=1(x0i)\u21b5F d = Qd\nCX1:d := EX1:d\u21e5\u2326d\n\ni=1(Xi)\u21e4 =Z\u2326d\u2326d\n\n= Qd\ni=1(xi) p(x1, . . . , xd)\n\ndYi=1\n\ndxi,\n\n(1)\n\nwhere we used X1:d to denote the set of variables {X1, . . . , Xd}.\n\n2\n\n\fThe kernel embeddings can also be generalized to conditional densities p(X|z) [9]\n\n\u00b5X|z := EX|z[(X)] =Z\u2326\n\n(x) p(x|z) dx\n\n(2)\nGiven this embedding, the conditional expectation of a function f 2F can be computed as\nEX|z[f (X)] = \u2326f, \u00b5X|z\u21b5F\n. Unlike the ordinary embeddings, an embedding of conditional dis-\ntribution is not a single element in the RKHS, but will instead sweep out a family of points in the\nRKHS, each indexed by a \ufb01xed value z of the conditioning variable Z. It is only by \ufb01xing Z to a\nparticular value z, that we will be able to obtain a single RKHS element, \u00b5X|z 2F . In other words,\nconditional embedding is an operator, denoted as CX|Z, which can take as input an z and output an\nembedding, i.e., \u00b5X|z = CX|Z(z). Likewise, kernel embedding of conditional distributions can\nalso be generalized to joint distribution of d variables.\nWe will represent an observation from a discrete variable Z taking r possible value using the stan-\ndard basis in Rr (or one-of-r representation). That is when z takes the i-th value, the i-th dimension\nof vector z is 1 and other dimensions 0. For instance, when r = 3, Z can take three possible val-\nue (1, 0, 0)>, (0, 1, 0)> and (0, 0, 1)>. In this case, we let (Z) = Z and use the linear kernel\nk(Z, Z0) = Z>Z. Then, the conditional embedding operator reduces to a separate embedding \u00b5X|z\nfor each conditional density p(X|z). Conceptually, we can concatenate these \u00b5X|z for different val-\nue of z in columns CX|Z := (\u00b5X|z=(1,0,0)>, \u00b5X|z=(0,1,0)>, \u00b5X|z=(0,0,1)>). The operation CX|Z(z)\nessentially picks up the corresponding embedding (or column).\n3 Kernel Embeddings as In\ufb01nite Dimensional Higher Order Tensors\nThe above kernel embedding CX1:d can also be viewed as a multi-linear operator (tensor) of order\nd mapping from F\u21e5 . . . \u21e5F to R. (For generic introduction to tensor and tensor notation, please\nsee [10]). The operator is linear in each argument (mode) when \ufb01xing other arguments. Furthermore,\nthe application of the operator to a set of elements {fi 2F} d\ni=1 can be de\ufb01ned using the inner\nproduct from the tensor product feature space, i.e.,\n\nh(Xi), fiiF# ,\n\nDCX1:d,eCX1:dE\u2022\n\n=\n\nCX1:d \u20221 f1 \u20222 . . . \u2022d fd :=\u2326CX1:d, \u2326d\nof CX1:d as kCX1:dk2\n\ni=1fd\u21b5F d = EX1:d\" dYi=1\nwhere \u2022i means applying fi to the i-th argument of CX1:d. Furthermore, we can de\ufb01ne the gener-\n\u2022 =P1i1=1 \u00b7\u00b7\u00b7P1id=1(CX1:d \u20221 ei1 \u20222 . . . \u2022d eid)2\nalized Frobenius norm k\u00b7k\u2022\nusing an orthonormal basis {ei}1i=1 \u21e2F . We can also de\ufb01ne the inner product for the space of such\noperator that kCX1:dk\u2022 < 1 as\n1Xi1=1\n\n1Xid=1\n(CX1:d \u20221 ei1 \u20222 . . . \u2022d eid)(eCX1:d \u20221 ei1 \u20222 . . . \u2022d eid).\ni=1(Xi)\u21e4, the above inner product reduces to EX1:d[eCX1:d \u20221\nWhen CX1:d has the form of EX1:d\u21e5\u2326d\n(X1) \u20222 . . . \u2022d (Xd)].\nIn this paper, the ordering of the tensor modes is not essential so we simply label them using the\ncorresponding random variables. We can reshape a higher order tensor into a lower order tensor by\npartitioning its modes into several disjoint groups. For instance, let I1 = {X1, . . . , Xs} be the set\nof modes corresponding to the \ufb01rst s variables and I2 = {Xs+1, . . . , Xd}. Similarly to the Matlab\nfunction, we can obtain a 2nd order tensor by\n(5)\nIn the reverse direction, we can also reshape a lower order tensor into a higher order one by further\npartitioning certain mode of the tensor. For instance, we can partition I1 into I 01 = {X1, . . . , Xt}\nand I 001 = {Xt+1, . . . , Xs}, and turn CI1;I2 into a 3rd order tensor by\n(6)\nNote that given a orthonormal basis {ei}1i=1 2F , we can readily obtain an orthonormal basis\nfor, e.g., F t, as {ei1 \u2326 . . . \u2326 eit}1i1,...,it=1, and hence de\ufb01ne the generalized Frobenius norm for\nCI1;I2 and CI 01;I 001 ;I2. This also implies that the generalized Frobenius norms are the same for all\nthese reshaped tensors, i.e., kCX1:dk\u2022 = kCI1;I2k\u2022\n\nCI 01;I 001 ;I2 = reshape (CI1;I2, I 01, I 001 , I2) : F t \u21e5F st \u21e5F ds 7! R.\n\nCI1;I2 = reshape (CX1:d, I1, I2) : F s \u21e5F ds 7! R.\n\n(3)\n\n(4)\n\n\u00b7\u00b7\u00b7\n\n=CI 01;I 001 ;I2\u2022\n\n.\n\n3\n\n\fZ\n\nX1\n\nX3\n\nZ1\n\nZ2\n\nX1\nX2\n(a) X1 ? X2|Z\n\nX2\nX4\n(b) X1:2 ? X3:4|Z1:2\n\nZ1\n\nX1\n\nZ2\n\nX2\n\nZd\n\nXd\n\n(c) Caterpillar tree (hidden Markov model)\n\nFigure 2: Three latent variable model with different tree topologies\n\nThe 2nd order tensor CI1;I2 can also be viewed as the cross-covariance operator between two sets of\nvariables in I1 and I2. In this case, we can essentially use notation and operations for matrices. For\ninstance, we can perform singular value decomposition of CI1;I2 =P1i=1 si(ui\u2326vi) where si 2 R\nare ordered in nonincreasing manner, {ui}1i=1 \u21e2F s and {vi}1i=1 \u21e2F ds are singular vectors. The\nIn this case, we will also de\ufb01ne\nrank of CI1;I2 is the smallest r such that si = 0 for i r.\nUr = (u1, u2, . . . , ur), Vr = (v1, v2, . . . , vr) and Sr = diag (s1, s2, . . . , sr), and denote the low\nrank approximation as CI1;I2 = UrSrV>r . Finally, a 1st order tensor reshape (CX1:d,{X1:d} ,;),\nis simply a vector where we we will use vector notation.\n\n4 Low Rank Kernel Embeddings Induced by Latent Variables\nIn the presence of latent variables, the kernel embedding CX1:d will be low rank. For example,\nthe two observed variables X1 and X2 in the example in Figure 1 is conditional independent giv-\nen the latent cluster indicator variable Z. That is the joint density factorizes as p(X1, X2) =\nPz p(z)p(X1|z)p(X2|z) (see Figure 2(a) for the graphical model). Throughout the paper, we as-\nsume that z is discrete and takes r possible values. Then the embedding CX1X2 of p(X1, X2) has a\nrank at most r. Let z be represented as the standard basis in Rr. Then\nCX1X2 = EZ\u21e5EX1|Z[(X1)]Z \u2326EX2|Z[(X2)]Z\u21e4 = CX1|Z EZ [Z \u2326 Z]CX2|Z>\n(7)\nwhere EZ [Z \u2326 Z] is an r \u21e5 r matrix, and hence restricting the rank of CX1X2 to be at most r.\nIn our second example, four observed variables are connected via two latent variables Z1 and Z2\neach taking r possible values. The conditional independence structure implies that the density of\np(X1, X2, X3, X4) factorizes as Pz1,z2\np(X1|z1)p(X2|z1)p(z1, z2)p(X3|z2)p(X4|z2) (see Fig-\nure 2(b) for the graphical model). Reshaping its kernel embedding CX1:4, we obtain CX1:2;X3:4 =\nreshape (CX1:4,{X1:2} ,{X3:4}) which factorizes as\n\nEX1:2|Z1[(X1) \u2326 (X2)] EZ1Z2[Z1 \u2326 Z2] EX3:4|Z2[(X3) \u2326 (X4)]>\n\n(8)\nwhere EZ1Z2[Z1 \u2326 Z2] is an r \u21e5 r matrix. Hence the intrinsic \u201crank\u201d of the reshaped embedding is\nonly r, although the original kernel embedding CX1:4 is a 4th order tensor with in\ufb01nite dimensions.\nIn general, for a latent variable model p(X1, . . . , Xd) where the conditional independence structure\nis a tree T , various reshapings of its kernel embedding CX1:d according to edges in the tree will be\nlow rank. More speci\ufb01cally, each edge in the latent tree corresponds to a pair of latent variables\n(Zs, Zt) (or an observed and a hidden variable (Xs, Zt)) which induces a partition of the observed\nvariables into two groups, I1 and I2. One can imagine splitting the latent tree into two subtrees by\ncutting the edge. One group of variables reside in the \ufb01rst subtree, and the other group in the second\nsubtree. If we reshape the tensor according to this partitioning, then\n\nTheorem 1 Assume that all observed variables are leaves in the latent tree structure, and all latent\nvariables take r possible values, then rank(CI1;I2) \uf8ff r.\nProof Due to the conditional independence structure induced by the latent tree, p(X1, . . . , Xd) =\nPzsPzt\np(I1|zs)p(zs, zt)p(I2|zt). Then its embedding can be written as\nCI1;I2 = CI1|Zs EZsZt[Zs \u2326 Zt]CI2|Zt> ,\n(9)\nwhere CI1|Zs and CI2|Zt are the conditional embedding operators for p(I1|zs) and p(I2|zt) re-\nspectively. Since EZsZt[Zs \u2326 Zt] is a r \u21e5 r matrix, rank(CI1;I2) \uf8ff r.\nTheorem 1 implies that, given a latent tree model, we obtain a collection of low rank reshapings\n{CI1;I2} of the kernel embedding CX1:d, each corresponding to an edge (Zs, Zt) of the tree. We\n\n4\n\n\fwill denote by H(T , r) the class of kernel embeddings CX1:d whose various reshapings according to\nthe latent tree T have rank at most r.1 We will also use CX1:d 2H (T , r) to indicator such a relation.\nIn practice, the latent tree model assumption may be misspeci\ufb01ed for a joint density p(X1, . . . , Xd),\nand consequently the various reshapings of its kernel embedding CX1:d are only approximately low\nrank. In this case, we will instead impose a (potentially misspeci\ufb01ed) latent structure T and a \ufb01xed\nrank r on the data and obtain an approximate low rank decomposition of the kernel embedding. The\ngoal is to obtain a low rank embedding eCX1:d 2H (T , r), while at the same time insure keCX1:d \nCX1:dk\u2022 is small. In the following, we will present such a decomposition algorithm.\n5 Low Rank Decomposition of Kernel Embeddings\nFor simplicity of exposition, we will focus on the case where the latent tree structure T has a cater-\npillar shape (Figure 2(c)). This decomposition can be viewed as a kernel generalization of the hier-\narchical tensor decomposition in [11, 12, 7]. The decomposition proceeds by reshaping the kernel\nembedding CX1:d according to the \ufb01rst edge (Z1, Z2), resulting in A1 := CX1;X2:d. Then we perform\na rank r approximation for it, resulting in A1 \u21e1U rSrV>r . This leads to the \ufb01rst intermediate tensor\nG1 = Ur, and we reshape SrV>r and recursively decompose it. We note that Algorithm 1 contains\nonly pseudo codes, and not implementable in practice since the kernel embedding to decompose can\nhave in\ufb01nite dimensions. We will design a practical kernel algorithm in the next section.\n\nAlgorithm 1 Low Rank Decomposition of Kernel Embeddings\nIn: A kernel embedding CX1:d, the caterpillar tree T and desired rank r\nOut: A low rank embedding eCX1:d 2H (T , r) as intermediate tensors {G1, . . . ,Gd}\n1: A1 = reshape(CX1:d, {X1} , {X2:d}) according to tree T .\n2: A1 \u21e1U rSrV>r , approximate A1 using its r leading singular vectors.\n3: G1 = Ur, and B1 = SrV>r . G1 can be viewed as a model with two variables, X1 and Z1; and\nB1 as a new caterpillar tree model T1 with variable X1 removed from T .\n4: for j = 2, . . . , d 1 do\n5: Aj = reshape(Bj1, {Zj1, Xj} , {Xj+1:d}) according to tree Tj1.\n6: Aj \u21e1U rSrV>r , approximate Aj using its r leading singular vectors.\n7:\n\nGj = reshape(Ur, {Zj1} , {Xj} , {Zj}), and Bj = SrV>r . Gj can be viewed as a model\nwith three variables, Xj, Zj and Zj1; and Bj as a new caterpillar tree model Tj with variable\nZj1 and Xj removed from Tj1.\n\n8: end for\n9: Gd = Bd1\nOnce we \ufb01nish the decomposition, we obtain the low rank representation of the kernel embedding\nas a set of intermediate tensors {G1, . . . ,Gd}. In particular, we can think of G1 as a second order\ntensor with dimension 1 \u21e5 r, Gd as a second order tensor with dimension r \u21e5 1, and Gj for\n2 6 j 6 d 1 as a third order tensor with dimension r \u21e5 1 \u21e5 r. Then we can apply the low\nrank kernel embedding eCX1:d to a set of elements {fi 2F} d\ni=1 as follows eCX1:d \u20221 f1 \u20222 . . . \u2022d fd =\n(G1 \u20221 f1)> (G2 \u20222 f2) . . . (Gd1 \u20222 fd1)(Gd \u20222 fd). Based on the above decomposition, one can\nobtain a low rank density estimate byep(X1, . . . , Xd) = eCX1:d \u20221 (X1) \u20222 . . . \u2022d (Xd). We can\nalso compute the difference betweeneCX1:d and the operator CX1:d by using the generalized Frobenius\nnorm keCX1:d C X1:dk\u2022.\nIn practice, we are only provided with a \ufb01nite number of samples(xi\ni=1 draw i.i.d. from\np(X1, . . . , Xd), and we want to obtain an empirical low rank decomposition of the kernel embed-\nding. In this case, we will perform a low rank decomposition of the empirical kernel embedding\nj). Although the empirical kernel embedding still has in\ufb01nite dimen-\ni=1\u2326d\nnPn\n\u00afCX1:d = 1\nsions, we will show that we can carry out the decomposition using just the kernel matrices. Let us\ndenote the kernel matrix for each dimension of the data by Kj where j 2{ 1, . . . , d}. The (i, i0)-th\nentry in Kj can be computed as Kii0\nj ). Alternatively, one can think of implicitly forming\n1One can readily generalize this notation to decompositions where different reshapings have different ranks.\n\n6 Kernel Algorithm\n\nj = k(xi\n\nj, xi0\n\nj=1(xi\n\n1, . . . , xi\n\nd) n\n\n5\n\n\fj0).\n\nj ), . . . , (xn\n\nj0=j+1(xn\n\nFurthermore, we denote the tensor feature matrix formed from dimension j + 1 to d of the data as\n\nthe feature matrix j =(x1\n j =\u2326d\n\nj0=j+1(x1\n\nj0), . . . ,\u2326d\n\nj ), and the corresponding kernel matrix is Kj = >j j.\nj0). The corresponding kernel matrix Lj = >j j with\nj =Qd\n\nj0=j+1 k(xi\n\nj0, xi0\n\nn2 1 >1 1>1 1 = 1 , 1\n\nthe (i, i0)-th entry in Lj de\ufb01ned as Lii0\nStep 1-3 in Algorithm 1. The key building block of the algorithm is a kernel singular value de-\ncomposition (Algorithm 2), which we will explain in more details using the example in step 2 of\nn 1 >1 .\nAlgorithm 1. Using the implicitly de\ufb01ned feature matrix, A1 can be expressed as A1 = 1\nFor the low rank approximation, A1 \u21e1U rSrV>r , using singular value decomposition, the lead-\ning r singular vector Ur = (u1, . . . , ur) will lie in the span of 1, i.e., Ur = 1(1, . . . , r)\nwhere 2 Rn. Then we can transform the singular value decomposition problem for an in\ufb01nite\ndimensional matrix to a generalized eigenvalue problem involving kernel matrices, A1A1>u =\nn2 K1L1K1 = K 1. Let the Cholesky decom-\nu , 1\nposition of K1 be R>R, then the generalized eigenvalue decomposition problem can be solved by\nrede\ufb01ning e = R, and solving an ordinary eigenvalue problem\nThe resulting singular vectors satisfy u>l ul0 = >l >1 1l0 = >l Kl0 = e>l el0 = ll0. Then we\ncan obtain B1 := SrV>r = U>r A1 by projecting the column of A1 using the singular vectors Ur,\n(11)\nwhere 2 Rr can be treated as the reduced r-dimensional feature representation for each feature\nmapped data point (xi\n1). Then we have the \ufb01rst intermediate tensor G1 = Ur = 1(1, . . . , r) =:\n1(\u27131, . . . , \u2713n)>, where \u2713 2 Rr. Then the kernel singular value decomposition can be carried out\nrecursively on the reshaped tensor B1.\nne2 >2 , where\nStep 5-7 in Algorithm 1. When j = 2, we \ufb01rst reshape B1 = SrV>r to obtain A2 = 1\ne2 = (1\u2326(x1\nbefore, and obtain Ur = e2(1, . . . , r) =: e2(\u27131, . . . , \u2713n)>. Then we have the second operator\nG2 =Pn\n\nn2 RL1R>e = e, and obtain = R\u2020e.\n\n2) \u2326 \u2713i. Last, we de\ufb01ne B2 := SrV>r = U>r A2 as\n(1, . . . , r)>( K2) >2 =:\n\n2 )). Then we can carry out similar singular value decomposition as\n\n(1, . . . , r)>K1 >1 =: (1, . . . , n) >1\n\ni=1 i \u2326 (xi\n1\nn\n\n(1, . . . , r)>>1 1 >1 =\n\n2), . . . , n\u2326(xn\n\n(1, . . . , n) >2 ,\n\nB1 =\n\n1\nn\n\n(10)\n\nB2 =\n\n(1, . . . , r)>e>2e2 >2 =\n\n1\nn\n\nand carry out the recursive decomposition further.\n\n(12)\n\n1\nn\n\n1\nn\n\n1\n\nThe result of the algorithm is an empirical low rank kernel embedding, bCX1:d, represented as a col-\nlection of intermediate tensors {G1, . . . ,Gd}. The overall algorithm is summarized in Algorithm 3.\nMore details about the derivation can be found in Appendix A.\nThe application of the set of intermediate tensor {G1, . . . ,Gd} to a set of elements {fi 2F} can be\nexpressed as kernel operations. For instance, we can obtain a density estimate bybp(x1, . . . , xd) =\nbCX1:d \u20221 (x1) \u20222 . . . \u2022d (xd) = Pz1,...,zd\ng1(x1, z1)g2(z1, x2, z2) . . . gd(zd1, xd) where (see\n\nAppendix A for more details)\n\n(z>1 \u2713i)k(xi\n\n1, x1)\n\n(13)\n\ng1(x1, z1) = G1 \u20221 (x1) \u20222 z1 =Xn\ngj(zj1, xj, zj) = Gj \u20221 zj1 \u20222 (xj) \u20223 zj =Xn\ngd(zd1, xd) = Gd \u20221 zd1 \u2022 xd =Xn\n\ni=1\n\ni=1\n\nd, xd)\n\n(z>d1i)k(xi\n\n(15)\nIn the above formulas, each term is a weighted combination of kernel functions, and the weighting\nis determined by the kernel singular value decomposition and the values of the latent variable {zj}.\n7 Performance Guarantees\nAs we mentioned in the introduction, the imposed latent structure used in the low rank decompo-\nsition of kernel embeddings may be misspeci\ufb01ed, and the decomposition of empirical embeddings\nmay suffer from sampling error. In this section, we provide \ufb01nite guarantee for Algorithm 3 even\nwhen the latent structures are misspeci\ufb01ed. More speci\ufb01cally, we will bound, in terms of the gen-\n\n(z>j1i)k(xi\n\nj, xj)(z>j \u2713i)\n\n(14)\n\ni=1\n\n6\n\n\fAlgorithm 2 KernelSVD(K, L, r)\nOut: A collection of vectors (\u27131, . . . , \u2713n)\n1: Perform Cholesky decomposition K = R>R\n2: Solve eigen decomposition 1\n\nCompute matrix Kj with Kii0\n\nj = k(xi\n\nj, xi0\n\n1, . . . , xi\n\ni=1, desired rank r, a query point (x1, . . . , xd)\n\nwe observed that the difference can be decomposed into two terms\n\nj ); furthermore, if j < d, then Lj = Lj+1 Kj+1\n\nn2 RLR>e = e, and keep the leading r eigen vectors\n\n(e1, . . . ,er)\n3: Compute 1 = R\u2020e1, . . . , r = R\u2020er, and reorgnaize (\u27131, . . . , \u2713n)> = (1, . . . , r)\nAlgorithm 3 Kernel Low Rank Decomposition of Empirical Embedding \u00afCX1:d\nIn: A sample(xi\nd) n\nOut: A low rank embedding bCX1:d 2H (T , r) as intermediate operators {G1, . . . ,Gd}\n1: Ld = 11>\n2: for j = d, d 1, . . . , 1 do\n3:\n4: end for\n5: (\u27131, . . . , \u2713n) = KernelSVD(K1, L1, r)\n6: G1 = 1(\u27131, . . . , \u2713n)>, and compute (1, . . . , n) = (\u27131, . . . , \u2713n)K1\n7: for j = 2, . . . , d 1 do\n8:\ni=1 i \u2326 (xi\n9:\n10: end for\n11: Gd = (1, . . . , n)>d\neralized Frobenius norm kCX1:d bCX1:dk\u2022, the difference between the true kernel embeddings and\nthe low rank kernel embeddings estimated from a set of n i.i.d. samples(xi\n+keCX1:d bCX1:dk\u2022\n}\n|\n\n= ( 1, . . . , n)>(1, . . . , n), and compute (\u27131, . . . , \u2713n) = KernelSVD(Ki , Li, r)\nGj =Pn\n\nkCX1:d bCX1:dk\u2022 6 kCX1:d eCX1:dk\u2022\n}\n\nwhere the \ufb01rst term is due to the fact that the latent structures may be misspeci\ufb01ed, while the second\nterm is due to estimation from \ufb01nite number of data points. We will bound these two sources of\nerror separately (the proof is deferred to Appendix B)\nTheorem 2 Suppose each reshaping CI1;I2 of CX1:d according to an edge in the latent tree struc-\n\nj) \u2326 \u2713i, and compute (1, . . . , n) = (\u27131, . . . , \u2713n)Ki\n\nture has a rank r approximation UrSrV>r with errorCI1;I2 U rSrV>r \u2022 6 \u270f. Then the low rank\ndecomposition eCX1:d from Algorithm 1 satis\ufb01es kCX1:d eCX1:dk\u2022 6 pd 1 \u270f.\n\nAlthough previous work [5, 6] have also used hierarchical decomposition for kernel embeddings,\ntheir decompositions make the strong assumption that the latent tree models are correctly speci\ufb01ed.\nWhen the models are misspeci\ufb01ed, these algorithms have no guarantees whatsoever, and may fail\ndrastically as we show in later experiments. In contrast, the decomposition we proposed here are\nrobust in the sense that even when the latent tree structure is misspeci\ufb01ed, we can still provide\nthe approximation guarantee for the algorithm. Furthermore, when the latent tree structures are\ncorrectly speci\ufb01ed and the rank r is also correct, then CI1;I2 has rank r and hence \u270f = 0 and our\ndecomposition algorithm does not incur any modeling error.\nNext, we provide bound for the the estimation error. The estimation error arises from decomposing\nthe empirical estimate \u00afCX1:d of the kernel embedding, and the error can accumulate as we combine\nintermediate tensors {G1, . . . ,Gd} to form the \ufb01nal low rank kernel embedding. More speci\ufb01cally,\nwe have the following bound (the proof is deferred to Appendix C)\nTheorem 3 Suppose the r-th singular value of each reshaping CI1;I2 of CX1:d according to an\nedge in the latent tree structure is lower bounded by , then with probability at least 1 , keCX1:d \nbCX1:dk\u2022 \uf8ff (1+)d2\nd2pn , with some constant c associated with the\n\nCX1:d \u00afCX1:d\u2022 6 (1+)d2c\n\nkernel and the probability .\n\nd2\n\n1, . . . , xi\n\ni=1. First\n\n|\n\n{z\n\nd) n\n\n(16)\n\nE1: model error\n\nE2: estimation error\n\n{z\n\n7\n\n\fFrom the above theorem, we can see that the smaller the r-th singular value, the more dif\ufb01cult it is\nto estimate the low rank kernel embedding. Although in the bound the error grows exponential in\n1/d2, in our experiments, we did not observe such exponential degradation of performance even\nin relatively high dimensional datasets.\n8 Experiments\nBesides the synthetic dataset we showed in Figure 1 where low rank kernel embedding can lead to\nsigni\ufb01cant improvement in term of estimating the density, we also experimented with real world\ndatasets from UCI data repository. We take 11 datasets with varying dimensions and number of data\npoints, and the attributes of the datasets are continuous-valued. We whiten the data and compare\nlow rank kernel embeddings (Low Rank) obtained from Algorithm 3 to 3 other alternatives for\ncontinuous density estimation, namely, mixture of Gaussian with full covariance matrix, ordinary\nkernel density estimator (KDE) and the kernel spectral algorithm for latent trees (Spectral) [6]. We\nuse Gaussian kernel k(x, x0) = 1p2\u21e1s\nexp(kx x0k2/(2s2)) for KDE, Spectral and our method\n(Low rank). We split each dataset into 10 subsets, and use nested cross-validation based on held-\nout likelihood to choose hyperparameters: the kernel parameter s for KDE, Spectral and Low rank\n({23, 22, 21, 1, 2, 4, 8} times the median pairwise distance), the rank parameter r for Spectral\nand Low rank (range from 2 to 30), and the number of components in the Gaussian mixture (range\nfrom 2 to # Sample\n). For both Spectral and Low rank, we use a caterpillar tree in Figure 2(c) as the\nstructure for the latent variable model.\nFrom Table 1, we can see that low rank kernel embeddings provide the best or comparable held-out\nnegative log-likelihood across the datasets we experimented with. In some datasets, low rank kernel\nembeddings can lead to drastic improvement over the alternatives. For instance, in dataset \u201csonar\u201d\nand \u201cyeast\u201d, the improvement is dramatic. The Spectral approach performs even worse sometimes.\nThis makes sense, since the caterpillar tree supplied to the algorithm may be far away from the\nreality and Spectral is not robust to model misspeci\ufb01cation. Meanwhile, the Spectral algorithm also\ncaused numerical problem in practical.\nIn contrast, our method Low Rank uses the same latent\nstructure, but achieved much more robust results.\n\n30\n\nTable 1: Negative log-likelihood on held-out data (the lower the better).\n\nMethod\n\nKDE\n\nLow rank\nSpectral\n15.88\u00b10.11\n18.32\u00b10.64\n33.50 \u00b12.17\n7.57\u00b10.14\n8.36\u00b10.17\n25.01\u00b10.66\n30.57 \u00b1 0.15 28.40 \u00b1 11.64 22.89 \u00b1 0.26\n21.50 \u00b1 2.39 16.95 \u00b1 0.13\n18.23 \u00b10.18\n35.84 \u00b1 1.00\n54.91\u00b11.35\n43.53 \u00b1 1.25\n10.38 \u00b1 0.19 31.42 \u00b1 2.40 10.07 \u00b1 0.11\n33.20 \u00b1 0.70 28.19 \u00b1 0.37\n30.65 \u00b1 0.66\n89.26 \u00b1 2.75 57.96 \u00b1 2.67\n96.17 \u00b1 0.27\n48.66 \u00b1 2.56 40.78 \u00b1 0.86\n49.48 \u00b1 0.64\n19.25 \u00b1 0.58 18.67 \u00b1 0.17\n19.56 \u00b1 0.56\n137.15 \u00b1 1.80 76.58 \u00b1 2.24 72.67 \u00b14.05\n\nData Set\n# Sample Dim. Gaussian mixture\naustralian\n690\nbupa\n345\ngerman\n1000\nheart\n270\nionosphere 351\npima\n768\nparkinsons 195\nsonar\n208\n198\nwpbc\n178\nwine\nyeast\n208\n\n17.97\u00b10.26\n8.17\u00b10.30\n31.14 \u00b1 0.41\n17.72 \u00b10.23\n47.60 \u00b11.77\n11.78 \u00b1 0.04\n30.13\u00b1 0.24\n107.06 \u00b1 1.36\n50.75 \u00b1 1.11\n19.59 \u00b1 0.14\n146.11 \u00b1 5.36\n\n14\n6\n24\n13\n34\n8\n22\n60\n33\n13\n79\n\n9 Discussion and Conclusion\nIn this paper, we presented a robust kernel embedding algorithm which can make use of the low\nrank structure of the data, and provided both theoretical and empirical support for it. However, there\nare still a number of issues which deserve further research. First, the algorithm requires a sequence\nof kernel singular decompositions which can be computationally intensive for high dimensional and\nlarge datasets. Developing ef\ufb01cient algorithms yet with theoretical guarantees will be interesting fu-\nture research. Second, the statistical analysis could be sharpened. For the moment, the analysis does\nnot seem to suggest that the obtained estimator by our algorithm is better than ordinary KDE. Third,\nit will be interesting empirical work to explore other applications for low rank kernel embeddings,\nsuch as kernel two-sample tests, kernel independence tests and kernel belief propagation.\n\n8\n\n\fReferences\n[1] A. J. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A Hilbert space embedding for dis-\ntributions. In Proceedings of the International Conference on Algorithmic Learning Theory,\nvolume 4754, pages 13\u201331. Springer, 2007.\n\n[2] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Injective Hilbert\nspace embeddings of probability measures. In Proc. Annual Conf. Computational Learning\nTheory, pages 111\u2013122, 2008.\n\n[3] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch\u00a8olkopf, and A. J. Smola. A kernel\nstatistical test of independence. In Advances in Neural Information Processing Systems 20,\npages 585\u2013592, Cambridge, MA, 2008. MIT Press.\n\n[4] L. Song, A. Gretton, D. Bickson, Y. Low, and C. Guestrin. Kernel belief propagation. In Proc.\nIntl. Conference on Arti\ufb01cial Intelligence and Statistics, volume 10 of JMLR workshop and\nconference proceedings, 2011.\n\n[5] L. Song, B. Boots, S. Siddiqi, G. Gordon, and A. J. Smola. Hilbert space embeddings of hidden\n\nmarkov models. In International Conference on Machine Learning, 2010.\n\n[6] L. Song, A. Parikh, and E.P. Xing. Kernel embeddings of latent tree graphical models.\n\nAdvances in Neural Information Processing Systems, volume 25, 2011.\n\nIn\n\n[7] L. Song, M. Ishteva, H. Park, A. Parikh, and E. Xing. Hierarchical tensor decomposition of\nlatent tree graphical models. In International Conference on Machine Learning (ICML), 2013.\n[8] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Kluwer, 2004.\n\n[9] L. Song, J. Huang, A. J. Smola, and K. Fukumizu. Hilbert space embeddings of conditional\n\ndistributions. In Proceedings of the International Conference on Machine Learning, 2009.\n\n[10] Tamara. G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review,\n\n51(3):455\u2013500, 2009.\n\n[11] L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM Journal on Matrix\n\nAnalysis and Applications, 31(4):2029\u20132054, 2010.\n\n[12] I Oseledets. Tensor-train decomposition. SIAM Journal on Scienti\ufb01c Computing, 33(5):2295\u2013\n\n2317, 2011.\n\n[13] L. Rosasco, M. Belkin, and E.D. Vito. On learning with integral operators. Journal of Machine\n\nLearning Research, 11:905\u2013934, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1504, "authors": [{"given_name": "Le", "family_name": "Song", "institution": "Georgia Tech"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Georgia Tech"}]}