{"title": "Spectral Methods for Indian Buffet Process Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1484, "page_last": 1492, "abstract": "The Indian Buffet Process is a versatile statistical tool for modeling distributions over binary matrices. We provide an efficient spectral algorithm as an alternative to costly Variational Bayes and sampling-based algorithms. We derive a novel tensorial characterization of the moments of the Indian Buffet Process proper and for two of its applications. We give a computationally efficient iterative inference algorithm, concentration of measure bounds, and reconstruction guarantees. Our algorithm provides superior accuracy and cheaper computation than comparable Variational Bayesian approach on a number of reference problems.", "full_text": "Spectral Methods for Indian Buffet Process Inference\n\nHsiao-Yu Fish Tung\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAlexander J. Smola\n\nMachine Learning Department\n\nCarnegie Mellon University and Google\n\nPittsburgh, PA 15213\n\nAbstract\n\nThe Indian Buffet Process is a versatile statistical tool for modeling distributions\nover binary matrices. We provide an ef\ufb01cient spectral algorithm as an alternative\nto costly Variational Bayes and sampling-based algorithms. We derive a novel\ntensorial characterization of the moments of the Indian Buffet Process proper and\nfor two of its applications. We give a computationally ef\ufb01cient iterative inference\nalgorithm, concentration of measure bounds, and reconstruction guarantees. Our\nalgorithm provides superior accuracy and cheaper computation than comparable\nVariational Bayesian approach on a number of reference problems.\n\n1\n\nIntroduction\n\nInferring the distributions of latent variables is a key tool in statistical modeling. It has a rich history\ndating back over a century to mixture models for identifying crabs [27] and has served as a key tool\nfor describing diverse sets of distributions ranging from text [10] to images [1] and user behavior [4].\nIn recent years spectral methods have become a credible alternative to sampling [19] and variational\nmethods [9, 13] for the inference of such structures. In particular, the work of [6, 5, 11, 21, 29]\ndemonstrates that it is possible to infer latent variable structure accurately, despite the problem\nbeing nonconvex, thus exhibiting many local minima. A particularly attractive aspect of spectral\nmethods is that they allow for ef\ufb01cient means of inferring the model complexity in the same way\nas the remaining parameters, simply by thresholding eigenvalue decomposition appropriately. This\nmakes them suitable for nonparametric Bayesian approaches.\nWhile the issue of spectral inference in Dirichlet Distribution is largely settled [6, 7], the domain\nof nonparametric tools is much richer and it is therefore desirable to see whether the methods can\nbe extended to other models such as the Indian Buffet Process (IBP). This is the main topic of our\npaper. We provide a full analysis of the tensors arising from the IBP and how spectral algorithms\nneed to be modi\ufb01ed, since a degeneracy in the third order tensor requires fourth order terms. To\nrecover the parameters and latent factors, we use Excess Correlation Analysis (ECA) [8] to whiten\nthe higher order tensors and to reduce their dimensionality. Subsequently we employ the power\nmethod to obtain symmetric factorization of the higher-order terms. The method provided in this\nwork is simple to implement and has high ef\ufb01ciency in recovering the latent factors and related\nparameters. We demonstrate how this approach can be used in inferring an IBP structure in the\nmodels discussed in [18] and [24]. Moreover, we show that empirically the spectral algorithm\nprovides higher accuracy and lower runtime than variational methods [14]. Statistical guarantees for\nrecovery and stability of the estimates conclude the paper.\nOutline: Section 2 gives a brief primer on the IBP. Section 3 contains the lower-order moments\nof IBP and its application on different model. Section 5 discusses concentration of measure of\nmoments. Section 4 applies Excess Correlation Analysis to the moments and it provides the basic\nstructure of this Algorithm. Section 6 shows the empirical performance of our algorithm. Due to\nspace constraints we relegate most derivations and proofs to the appendix.\n\n1\n\n\f2 The Indian Buffet Process\n\nThe Indian Buffet Process de\ufb01nes a distribution over equivalence classes of binary matrices Z with\na \ufb01nite number of rows and a (potentially) in\ufb01nite number of columns [17, 18]. The idea is that\nthis allows for automatic adjustment of the number of binary entries, corresponding to the number\nof independent sources, underlying causes, etc. This is a very useful strategy and it has led to many\napplications including structuring Markov transition matrices [15], learning hidden causes with a\nbipartite graph [30] and \ufb01nding latent features in link prediction [26]. n \u2208 N the number of rows of\nZ, i.e. the number of customers sampling dishes from the \u201c Indian Buffet\u201d, let mk be the number of\ncustomers who have sampled dish k, let K+ be the total number of dishes sampled, and denote by\nKh the number of dishes with a particular selection history h \u2208 {0; 1}n. That is, Kh > 1 only if\nthere are two or more dishes that have been selected by exactly the same set of customers. Then the\nprobability of generating a particular matrix Z is given by [18]\n\n(cid:34)\n\np(Z) =\n\n\u2212\u03b1\n\nexp\n\n\u03b1K+(cid:81)\n\nh Kh!\n\n(cid:35) K+(cid:89)\n\nn(cid:88)\n\n1\nj\n\nj=1\n\nk=1\n\n(n \u2212 mk)!(mk \u2212 1)!\n\nn!\n\n(1)\n\nHere \u03b1 is a parameter determining the expected number of nonzero columns in Z. Due to the\nconjugacy of the prior an alternative way of viewing p(Z) is that each column (aka dish) contains\nnonzero entries Zij that are drawn from the binomial distribution Zij \u223c Bin(\u03c0i). That is, if we\nknew K+, i.e. if we knew how many nonzero features Z contains, and if we knew the probabilities\n\u03c0i, we could draw Z ef\ufb01ciently from it. We take this approach in our analysis: determine K+ and\ninfer the probabilities \u03c0i directly from the data. This is more reminiscent of the model used to derive\nthe IBP \u2014 a hierarchical Beta-Binomial model, albeit with a variable number of entries:\n\n\u03b1\n\n\u03c0i\n\nZij\n\ni \u2208 (cid:8)K+\n\n(cid:9)\n\nj \u2208 {n}\n\nIn general, the binary attributes Zij are not observed. Instead, they capture auxiliary structure per-\ntinent to a statistical model of interest. To make matters more concrete, consider the following two\nmodels proposed by [18] and [24]. They also serve to showcase the algorithm design in our paper.\n\nLinear Gaussian Latent Feature Model [18]. The assumption is that we observe vectorial data\nx. It is generated by linear combination of dictionary atoms A and an associated unknown number\nof binary causes z, all corrupted by some additive noise \u0001. That is, we assume that\n\nx = Az + \u0001 where \u0001 \u223c N (0, \u03c321) and z \u223c IBP(\u03b1).\n\n(2)\nThe dictionary matrix A is considered to be \ufb01xed but unknown. In this model our goal is to infer both\nA, \u03c32 and the probabilities \u03c0i associated with the IBP model. Given that, a maximum-likelihood\nestimate of Z can be obtained ef\ufb01ciently.\n\nIn\ufb01nite Sparse Factor Analysis [24]. A second model is that of sparse independent component\nanalysis. In a way, it extends (2) by replacing binary attributes with sparse attributes. That is, instead\nof z we use the entry-wise product z.\u2217y. This leads to the model\n\nx = A(z.\u2217y) + \u0001 where \u0001 \u223c N (0, \u03c321) , z \u223c IBP(\u03b1) and yi \u223c p(y)\n\n(3)\nAgain, the goal is to infer A, the probabilities \u03c0i and then to associate likely values of Zij and Yij\nwith the data. In particular, [24] make a number of alternative assumptions on p(y), namely either\nthat it is iid Gaussian or that it is iid Laplacian. Note that the scale of y itself is not so important\nsince an equivalent model can always be found by rescaling A suitably.\nNote that in (3) we used the shorthand .\u2217 to denote point-wise multiplication of two vectors in\n\u2019Matlab\u2019 notation. While (2) and (3) appear rather similar, the latter model is considerably more\ncomplex since it not only amounts to a sparse signal but also to an additional multiplicative scale.\n[24] refer to the model as In\ufb01nite Sparse Factor Analysis (isFA) or In\ufb01nite Independent Component\nAnalysis (iICA) depending on the choice of p(y) respectively.\n\n2\n\n\f3 Spectral Characterization\n\nWe are now in a position to de\ufb01ne the moments of the associated binary matrix. In our approach\nwe assume that Z \u223c IBP(\u03b1). We assume that the number of nonzero attributes k is unknown\n(but \ufb01xed). Our analysis begins by deriving moments for the IBP proper. Subsequently we apply\nthis to the two models described above. All proofs are deferred to the Appendix. For notational\nconvenience we denote by S the symmetrized version of a tensor where care is taken to ensure\nthat existing multiplicities are satis\ufb01ed. That is, for a generic third order tensor we set S6[A]ijk =\nAijk + Akij + Ajki + Ajik + Akji + Aikj. However, if e.g. A = B \u2297 c with Bij = Bji, we only\nneed S3[A]ijk = Aijk + Akij + Ajki to obtain a symmetric tensor.\n\n3.1 Tensorial Moments for the IBP\n\nA degeneracy in the third order tensor requires that we compute a fourth order moment. We can\nexclude the cases of \u03c0i = 0 and \u03c0i = 1 since the former amounts to a nonexistent feature and the\nlatter to a constant offset. We use Mi to denote moments of order i and Si to denote diagonal(izable)\ntensors of order i. Finally, we use \u03c0 \u2208 RK+ to denote the vector of probabilities \u03c0i.\nOrder 1 This is straightforward, since we have\n\nOrder 2 The second order tensor is given by\n\nM1 := Ez [z] = \u03c0 =: S1.\n\nM2 := Ez [z \u2297 z] = \u03c0 \u2297 \u03c0 + diag(cid:0)\u03c0 \u2212 \u03c02(cid:1) = S1 \u2297 S1 + diag(cid:0)\u03c0 \u2212 \u03c02(cid:1) .\nS2 := M2 \u2212 S1 \u2297 S1 = diag(cid:0)\u03c0 \u2212 \u03c02(cid:1) .\n\nSolving for the diagonal tensor we have\n\n(6)\nThe degeneracies {0, 1} of \u03c0 \u2212 \u03c02 = (1 \u2212 \u03c0)\u03c0 can be ignored since they amount to non-existent\nand degenerate probability distributions.\n\nOrder 3 The third order moments yield\n\nM3 :=Ez [z \u2297 z \u2297 z] = \u03c0 \u2297 \u03c0 \u2297 \u03c0 + S3\n\n=S1 \u2297 S1 \u2297 S1 + S3 [S1 \u2297 S2] + diag(cid:0)\u03c0 \u2212 3\u03c02 + 2\u03c03(cid:1) .\nS3 :=M3 \u2212 S3 [S1 \u2297 S2] \u2212 S1 \u2297 S1 \u2297 S1 = diag(cid:0)\u03c0 \u2212 3\u03c02 + 2\u03c03(cid:1) .\n\n(cid:2)\u03c0 \u2297 diag(cid:0)\u03c0 \u2212 \u03c02(cid:1)(cid:3) + diag(cid:0)\u03c0 \u2212 3\u03c02 + 2\u03c03(cid:1)\n\nNote that the polynomial \u03c0 \u2212 3\u03c02 + 2\u03c03 = \u03c0(2\u03c0 \u2212 1)(\u03c0 \u2212 1) vanishes for \u03c0 = 1\nundesirable for the power method \u2014 we need to compute a fourth order tensor to exclude this.\n\n2. This is\n\n(4)\n\n(5)\n\n(7)\n(8)\n(9)\n\nM4 :=Ez [z \u2297 z \u2297 z \u2297 z] = S1 \u2297 S1 \u2297 S1 \u2297 S1 + S6 [S2 \u2297 S1 \u2297 S1] + S3 [S2 \u00d7 S2]\n\nS4 :=M4 \u2212 S1 \u2297 S1 \u2297 S1 \u2297 S1 \u2212 S6 [S2 \u2297 S1 \u2297 S1] \u2212 S3 [S2 \u00d7 S2] \u2212 S4 [S3 \u2297 S1]\n\nOrder 4 The fourth order moments are\n\n+ S4 [S3 \u2297 S1] + diag(cid:0)\u03c0 \u2212 7\u03c02 + 12\u03c03 \u2212 6\u03c04(cid:1)\n=diag(cid:0)\u03c0 \u2212 7\u03c02 + 12\u03c03 \u2212 6\u03c04(cid:1) .\nThe roots of the polynomial are(cid:8)0, 1\n\n\u221a\n2 + 1/\ntheir corresponding \u03c0k can be inferred either by S3 or S4.\n\n2 \u2212 1/\n\n12, 1\n\n\u221a\n\n12, 1(cid:9). Hence the latent factors and\n\n(10)\n\n3.2 Application of the IBP\n\nThe above derivation showed that if we were able to access z directly, we could infer \u03c0 from it\nby reading off terms from a diagonal tensor. Unfortunately, this is not quite so easy in practice\nsince z generally acts as a latent attribute in a more complex model. In the following we show how\nthe models of (2) and (3) can be converted into spectral form. We need some notation to indicate\nmultiplications of a tensor M of order k by a set of matrices Ai.\n\n[T (M, A1, . . . , Ak)]i1,...ik\n\n:=\n\nMj1,...jk [A1]i1j1\n\n\u00b7 . . . \u00b7 [Ak]ikjk\n\n.\n\n(11)\n\n(cid:88)\n\nj1,...jk\n\n3\n\n\fNote that this includes matrix multiplication. For instance, A(cid:62)\n1 M A2 = T (M, A1, A2). Also note\nthat in the special case where the matrices Ai are vectors, this amounts to a reduction to a scalar.\nAny such reduced dimensions are assumed to be dropped implicitly. The latter will become useful\nin the context of the tensor power method in [6].\nLinear Gaussian Latent Factor Model. When dealing with (2) our goal is to infer both A and\n\u03c0. The main difference is that rather than observing z we have Az, hence all tensors are colored.\nMoreover, we also need to deal with the terms arising from the additive noise \u0001. This yields\n\nS1 :=M1 = T (\u03c0, A)\nS2 :=M2 \u2212 S1 \u2297 S1 \u2212 \u03c321 = T (diag(\u03c0 \u2212 \u03c02), A, A)\nS3 :=M3 \u2212 S1 \u2297 S1 \u2297 S1 \u2212 S3 [S1 \u2297 S2] \u2212 S3 [m1 \u2297 1]\n\n=T(cid:0)diag(cid:0)\u03c0 \u2212 3\u03c02 + 2\u03c03(cid:1) , A, A, A(cid:1)\n=T(cid:0)diag(cid:0)\u22126\u03c04 + 12\u03c03 \u2212 7\u03c02 + \u03c0(cid:1) , A, A, A, A(cid:1)\n\n\u2212 \u03c32S6 [S2 \u2297 1] \u2212 m4S3 [1 \u2297 1]\n\nS4 :=M4 \u2212 S1 \u2297 S1 \u2297 S1 \u2297 S1 \u2212 S6 [S2 \u2297 S1 \u2297 S1] \u2212 S3 [S2 \u2297 S2] \u2212 S4 [S3 \u2297 S1]\n\n(12)\n(13)\n(14)\n\n(15)\n\nHere we used the auxiliary statistics m1 and m4. Denote by v the eigenvector with the smallest\neigenvalue of the covariance matrix of x. Then the auxiliary variables are de\ufb01ned as\n\n(cid:104)\nx(cid:104)v, (x \u2212 E [x])(cid:105)2(cid:105)\n(cid:104)(cid:104)v, (x \u2212 Ex [x])(cid:105)4(cid:105)\n\n/3\n\nm1 :=Ex\n\nm4 :=Ex\n\n= \u03c32T (\u03c0, A)\n\n= \u03c34.\n\n(16)\n\n(17)\n\nThese terms are used in a tensor power method to infer both A and \u03c0 (Appendix A has a derivation).\nIn\ufb01nite Sparse Factor Analysis. Using the model of (3) it follows that z is a symmetric distribution\nwith mean 0 provided that p(y) has this property. From that it follows that the \ufb01rst and third order\nmoments and tensors vanish, i.e. S1 = 0 and S3 = 0. We have the following statistics:\n\nS2 :=M2 \u2212 \u03c321 = T (c \u00b7 diag(\u03c0), A, A)\nS4 :=M4 \u2212 S3 [S2 \u2297 S2] \u2212 \u03c32S6 [S2 \u2297 1] \u2212 m4S3 [1 \u2297 1] = T (diag(f (\u03c0)), A, A, A, A) .\n\n(19)\nHere m4 is de\ufb01ned as in (17). Whenever p(y) in (3) is Gaussian, we have c = 1 and f (\u03c0) = \u03c0\u2212 \u03c02.\nMoreover, whenever p(y) follows the Laplace distribution, we have c = 2 and f (\u03c0) = 24\u03c0 \u2212 12\u03c02.\nLemma 1 Any linear model of the form (2) or (3) with the property that \u0001 is symmetric and satis\ufb01es\n\n(cid:3) the same properties for y, will yield the same moments.\n\n(cid:3) and E[\u00014] = E(cid:2)\u00014\n\nE[\u00012] = E(cid:2)\u00012\n\n(18)\n\nGauss\n\nGauss\n\nProof This follows directly from the fact that z, \u0001 and y are independent and that the latter two\nhave zero mean and are symmetric. Hence the expectations carry through regardless of the actual\nunderlying distribution.\n\n4 Parameter Inference\nHaving derived symmetric tensors that contain both A and polynomials of \u03c0, we need to separate\nthose two factors and the additive noise, as appropriate. In a nutshell the approach is as follows: we\n\ufb01rst identify the noise \ufb02oor using the assumption that the number of nonzero probabilities in \u03c0 is\nlower than the dimensionality of the data. Secondly, we use the noise-corrected second order tensor\nto whiten the data. This is akin to methods used in ICA [12]. Finally, we perform power iterations\non the data to obtain S3 and S4, or rather, their applications to data. Note that the eigenvalues in the\nre-scaled tensors differ slightly since we use S\n\n\u2020 1\n2 x directly rather than x.\n\n2\n\nRobust Tensor Power Method Our reasoning follows that of [6].\nIt is our goal to obtain an\northogonal decomposition of the tensors Si into an orthogonal matrix V together with a set of\ncorresponding eigenvalues \u03bb such that Si = T [diag(\u03bb), V (cid:62), . . . , V (cid:62)]. This is accomplished by\ngeneralizing the Rayleigh quotient and power iterations as described in [6, Algorithm 1]:\n\n\u03b8 \u2190 T [S, 1, \u03b8, . . . , \u03b8] and \u03b8 \u2190 (cid:107)\u03b8(cid:107)\u22121 \u03b8.\n\n(20)\n\n4\n\n\fAlgorithm 1 Excess Correlation Analysis for Linear-Gaussian model with IBP prior\nInputs: the moments M1, M2, M3, M4.\n1: Infer K and \u03c32:\n2: Optionally \ufb01nd a subspace R \u2208 Rd\u00d7K(cid:48)\n\nwith K < K(cid:48) by random projection.\n\nRange (R) = Range (M2 \u2212 M1 \u2297 M1) and project down to R\n\n4: Set S2 =(cid:0)M2 \u2212 M1 \u2297 M1 \u2212 \u03c321(cid:1)\n\n3: Set \u03c32 := \u03bbmin (M2 \u2212 M1 \u2297 M1)\n\n\u0001 by truncating to eigenvalues larger than \u0001\n\n2 , where [U, \u03a3] = svd(S2)\n\n5: Set K = rank S2\n6: Set W = U \u03a3\u2212 1\n7: Whitening: (best carried out by preprocessing x)\n8: Set W3 := T (S3, W, W, W )\n9: Set W4 := T (S4, W, W, W, W )\n10: Tensor Power Method:\n11: Compute generalized eigenvalues and vectors of W3.\n12: Keep all K1 \u2264 K (eigenvalue, eigenvector) pairs (\u03bbi, vi) of W3\n13: De\ufb02ate W4 with (\u03bbi, vi) for all i \u2264 K1\n14: Keep all K \u2212 K1 (eigenvalue, eigenvector) pairs (\u03bbi, vi) of de\ufb02ated W4\n15: Reconstruction: With corresponding eigenvalues {\u03bb1,\u00b7\u00b7\u00b7 , \u03bbK}, return the set A:\n\n(cid:26) 1\n\nZi\n\n(cid:0)W \u2020(cid:1)(cid:62)\n\nA =\n\n(cid:27)\n\nwhere Zi =(cid:112)\u03c0i \u2212 \u03c02\n\ni with \u03c0i = f\u22121(\u03bbi). f (\u03c0) = \u22122\u03c0+1\n\n\u221a\n\notherwise. (The proof of Equation (21) is provided in the Appendix.)\n\nvi : vi \u2208 \u039b\n\n(21)\n\u03c0\u2212\u03c02 if i \u2208 [K1] and f (\u03c0) = 6\u03c02\u22126\u03c0+1\n\u03c0\u2212\u03c02\n\nIn a nutshell, we use a suitable number of random initialization l, perform a few iterations (v)\nand then proceed with the most promising candidate for another d iterations. The rationale for\npicking the best among l candidates is that we need a high probability guarantee that the selected\ninitialization is non-degenerate. After \ufb01nding a good candidate and normalizing its length we de\ufb02ate\n(i.e. subtract) the term from the tensor S.\n\nExcess Correlation Analysis (ECA) The algorithm for recovering A is shown in Algorithm 1.\nWe \ufb01rst present the method of inferring the number of latent features, K, which can be viewed as\nthe rank of the covariance matrix. An ef\ufb01cient way of avoiding eigendecomposition on a d \u00d7 d\nmatrix is to \ufb01nd a low-rank approximation R \u2208 Rd\u00d7K(cid:48)\nsuch that K < K(cid:48) (cid:28) d and R spans the\nsame space as the covariance matrix. One ef\ufb01cient way to \ufb01nd such matrix is to set R to be\n\nR = (M2 \u2212 M1 \u00d7 M1) \u0398,\n\n(22)\n\nwhere \u0398 \u2208 Rd\u00d7K(cid:48)\nis a random matrix with entries sampled independently from a standard normal.\nThis is described, e.g. by [20]. Since there is noise in the data, it is not possible that we get exactly K\nnon-zero eigenvalues with the remainder being constant at noise \ufb02oor \u03c32. An alternative strategy to\nthresholding by \u03c32 is to determine K by seeking the largest slope on the curve of sorted eigenvalues.\nNext, we whiten the observations by multiplying data with W \u2208 Rd\u00d7K. This is computationally\nef\ufb01cient, since we can apply this directly to x, thus yielding third and fourth order tensors W3 and\nW4 of size k. Moreover, approximately factorizing S2 is a consequence of the decomposition and\nrandom projection techniques arising from [20].\nTo \ufb01nd the singular vectors of W3 and W4 we use the robust tensor power method, as described\nabove. From the eigenvectors we found in the last step, A could be recovered with Equation 21. The\nfact that this algorithm only needs projected tensors makes it very ef\ufb01cient. Streaming variants of\nthe robust tensor power method are subject of future research.\nFurther Details on the projected tensor power method.\nExplicitly calculating tensors\nM2, M3, M4 is not practical in high dimensional data. It may not even be desirable to compute\nthe projected variants of M3 and M4, that is, W3 and W4 (after suitable shifts). Instead, we can use\n\n5\n\n\fthe analog of a kernel trick to simplify the tensor power iterations to\nW (cid:62)xi (cid:104)xi, W u(cid:105)l\u22121 =\n\nW (cid:62)T (Ml, 1, W u, . . . , W u) =\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n(cid:10)W (cid:62)xi, u(cid:11)l\u22121\n\nxi\n\nm(cid:88)\n\ni=1\n\nW (cid:62)\nm\n\nBy using incomplete expansions memory complexity and storage are reduced to O(d) per term.\nMoreover, precomputation is O(d2) and it can be accomplished in the \ufb01rst pass through the data.\n\n5 Concentration of Measure Bounds\n\nThere exist a number of concentration of measure inequalities for speci\ufb01c statistical models using\nrather speci\ufb01c moments [8]. In the following we derive a general tool for bounding such quantities,\nboth for the case where the statistics are bounded and for unbounded quantities alike. Our analysis\nborrows from [3] for the bounded case, and from the average-median theorem, see e.g. [2].\n\n5.1 Bounded Moments\nWe begin with the analysis for bounded moments. Denote by \u03c6 : X \u2192 F a set of statistics on X\nand let \u03c6l be the l-times tensorial moments obtained from l.\n\n\u03c61(x) := \u03c6(x); \u03c62(x) := \u03c6(x) \u2297 \u03c6(x); \u03c6l(x) := \u03c6(x) \u2297 . . . \u2297 \u03c6(x)\n\n(23)\n\nIn this case we can de\ufb01ne inner products via\n\nkl(x, x(cid:48)) := (cid:104)\u03c6l(x), \u03c6l(x(cid:48))(cid:105) = T [\u03c6l(x), \u03c6(x(cid:48)), . . . , \u03c6(x(cid:48))] = (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105)l = kl(x, x(cid:48))\n\nas reductions of the statistics of order l for a kernel k(x, x(cid:48)) := (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105). Finally, denote by\n\nm(cid:88)\n\nj=1\n\nMl := Ex\u223cp(x)[\u03c6l(x)] and \u02c6Ml :=\n\n1\nm\n\n\u03c6l(xj)\n\n(24)\n\nthe expectation and empirical averages of \u03c6l. Note that these terms are identical to the statistics\nused in [16] whenever a polynomial kernel is used. It is therefore not surprising that an analogous\nconcentration of measure inequality to the one proven by [3] holds:\nTheorem 2 Assume that the suf\ufb01cient statistics are bounded via (cid:107)\u03c6(x)(cid:107) \u2264 R for all x \u2208 X . With\nprobability at most 1 \u2212 \u03b4 the following guarantee holds:\n\n(cid:40)\n\nPr\n\nsup\n\nu:(cid:107)u(cid:107)\u22641\n\n(cid:12)(cid:12)(cid:12)T (Ml, u,\u00b7\u00b7\u00b7 , u) \u2212 T ( \u02c6Ml, u,\u00b7\u00b7\u00b7 , u)\n\n(cid:12)(cid:12)(cid:12) > \u0001l\n\n(cid:41)\n\n(cid:2)2 +\n\n\u221a\u22122 log \u03b4(cid:3) Rl\n\n\u221a\n\nm\n\n.\n\n\u2264 \u03b4 where \u0001l \u2264\n\nUsing Lemma 1 this means that we have concentration of measure immediately for the moments\nS1, . . . S4.Details are provided in the appendix. In particular, we need a chaining result (Lemma 4)\nthat allows us to compute bounds for products of terms ef\ufb01ciently. By utilizing an approach similar\nto [8], overall guarantees for reconstruction accuracy can be derived.\n\n5.2 Unbounded Moments\n\nWe are interested in proving concentration of the following four tensors in (13), (14), (15) and one\nscalar in (27). Whenever the statistics are unbounded, concentration of moment bounds are less\ntrivial and require the use of subgaussian and gaussian inequalities [22]. We derive a bound for\nfourth-order subgaussian random variables (previous work only derived up to third order bounds).\nLemma 5 and 6 has details on how to obtain such guarantees. We further get the bounds for the ten-\nsors based on the concentration of moment in Lemma 7 and 8. Bounds for reconstruction accuracy\nof our algorithm are provided. The full proof is in the Appendix.\nTheorem 3 (Reconstruction Accuracy) Let \u03c2k [S2] be the k\u2212th largest singular value of S2. De\ufb01ne\n\n\u03c0min = argmaxi\u2208[K] |\u03c0i \u2212 0.5|, \u03c0max = argmaxi\u2208[K] \u03c0i and \u02dc\u03c0 =(cid:81){i:\u03c0i\u22640.5} \u03c0i\n\n(cid:81){i:\u03c0i>0.5}(1\u2212\n\n6\n\n\f(cid:32)\n\nm \u2265 poly\n\nd, K,\n\n, log(1/\u03b4),\n\n1\n\u0001\n\nK(cid:80)\n\n(cid:107)Ai(cid:107)2\n\n2\n\n1\n\u02dc\u03c0\n\n,\n\n\u03c21 [S2]\n\u03c2K [S2]\n\n,\n\ni=1\n\u03c2K [S2]\n\n,\n\n\u221a\n\n,\n\n1\n\n\u03c0min \u2212 \u03c0min2\n\n,\n\n(cid:112)\u03c0max \u2212 \u03c02\n\n\u03c0max\n\nmax\n\n\u03c32\n\n\u03c2K [S2]\n\n(cid:33)\n\n\u03c0i). Pick any \u03b4, \u0001 \u2208 (0, 1). There exists a polynomial poly(\u00b7) such that if sample size m statis\ufb01es\n\nwith probability greater than 1 \u2212 \u03b4, there is a permutation \u03c4 on [K] such that the \u02c6A returns by\nAlgorithm 1 sati\ufb01es\n\n\u0001 for all i \u2208 [K].\n\n(cid:13)(cid:13)(cid:13) \u02c6A\u03c4 (i) \u2212 Ai\n\n(cid:13)(cid:13)(cid:13) \u2264(cid:16)(cid:107)Ai(cid:107)2 +(cid:112)\u03c21 [S2]\n\n(cid:17)\n\n6 Experiments\nWe evaluate the algorithm on a number of problems suitable for the two models of (2) and (3). The\nproblems are largely identical to those put forward in [18] in order to keep our results comparable\nwith a more traditional inference approach. We demonstrate that our algorithm is faster, simpler,\nand achieves comparable or superior accuracy.\n\nSynthetic data Our goal is to demonstrate the ability to recover latent structure of generated data.\nFollowing [18] we generate images via linear noisy combinations of 6 \u00d7 6 templates. That is, we\nuse the binary additive model of (2). The goal is to recover both the above images and to assess their\nrespective presence in observed data. Using an additive noise variance of \u03c32 = 0.5 we are able to\nrecover the original signal quite accurately (from left to right: true signal, signal inferred from 100\nsamples, signal inferred from 500 samples). Furthermore, as the second row indicates, our algorithm\nalso correctly infers the attributes present in the images.\n\nFor a more quantitative evaluation we compared our results to the in\ufb01nite variational algorithm\nof [14]. The data is generated using \u03c3 \u2208 {0.1, 0.2, 0.3, 0.4, 0.5} and with sample size n \u2208\n{100, 200, 300, 400, 500}. Figure 1 shows that our algorithm is faster and comparatively accurate.\n\nFigure 1: Comparison to in\ufb01nite variational approach. The \ufb01rst plot compares the test negative log\nlikelihood training on N = 500 samples with different \u03c3. The second plot shows the CPU time to\ndata size, N, between the two methods.\n\nImage Source Recovery We repeated the same test using 100 photos from [18]. We \ufb01rst reduce\ndimensionality on the data set by representing the images with 100 principal components and apply\nour algorithm on the 100-dimensional dataset (see Algorithm 1 for details). Figure 2 shows the\nresult. We used 10 initial iterations 50 random seeds and 30 \ufb01nal iterations 50 in the Robust Power\nTensor Method. The total runtime was 0.2788s.\n\n7\n\n0 1 0 01 1 0 00 1 0 11 0 0 10 1 0 01 1 0 00 1 0 11 0 0 1Text1 0 1 00 1 1 00.10.20.30.40.50.6010002000300040005000600070008000negative log likelihood to \u03c3\u03c3negative loglikelihood Infinite Variational ApproachSpectral Method on IBP2004006008001000050100150200250300CPU time to NNCPU time(sec) Infinite Variational ApproachSpectral Method on IBP\fFigure 2: Results of modeling 100 images from\n[18] of size 240 \u00d7 320 by model (2). Row\n1:\nfour sample images containing up to four\nobjects ($20 bill, Klein bottle, prehistoric han-\ndaxe, cellular phone). An object basically ap-\npears in the same location, but some small vari-\nation noise is generated because the items are\nput into scene by hand; Row 2: Independent\nattributes, as determined by in\ufb01nite variational\ninference of [14] (note, the results in [18] are\nblack and white only); Row 3: Independent at-\ntributes, as determined by spectral IBP; Row 4:\nReconstruction of the images via spectral IBP.\nThe binary superscripts indicate the items iden-\nti\ufb01ed in the image.\n\nFigure 3: Recovery of the source matrix A in\nmodel (3) when comparing MCMC sampling\nand spectral methods. MCMC sampling re-\nquired 1.72 seconds and yielded a Frobenius\ndistance (cid:107)A \u2212 AMCM(cid:107)F = 0.77. Our spec-\ntral algorithm required 0.77 seconds to achieve\na distance (cid:107)A \u2212 ASpectral(cid:107)F = 0.31.\n\nderived\n\nspectral\n\nFigure 4: Gene sig-\nby\nnatures\nthe\nIBP.\nThey show that there\nare common hidden\ncauses in the observed\nexpression\nlevels,\nthus offering a con-\nsiderably\nsimpli\ufb01ed\nrepresentation.\n\nGene Expression Data As a \ufb01rst sanity check of the feasibility of our model for (3), we generated\nsynthetic data using x \u2208 R7 with k = 4 sources and n = 500 samples, as shown in Figure 3.\nFor a more realistic analysis we used a microarray dataset. The data consisted of 587 mouse liver\nsamples detecting 8565 gene probes, available as dataset GSE2187 as part of NCBI\u2019s Gene Ex-\npression Omnibus www.ncbi.nlm.nih.gov/geo. There are four main types of treatments,\nincluding Toxicant, Statin, Fibrate and Azole. Figure 4 shows the inferred latent factors arising from\nexpression levels of samples on 10 derived gene signatures. According to the result, the group of\n\ufb01brate-induced samples and a small group of toxicant-induced samples can be classi\ufb01ed accurately\nby the special patterns. Azole-induced samples have strong positive signals on gene signatures 4\nand 8, while statin-induced samples have strong positive signals only on the 9 gene signatures.\nSummary\nIn this paper we introduced a spectral approach to inferring latent parameters in the\nIndian Buffet Process. We derived tensorial moments for a number of models, provided an ef\ufb01cient\ninference algorithm, concentration of measure theorems and reconstruction guarantees. All this is\nbacked up by experiments comparing spectral and MCMC methods.\nWe believe that this is a \ufb01rst step towards expanding spectral nonparametric tools beyond the\nmore common Dirichlet Process representations. Applications to more sophisticated models, larger\ndatasets and ef\ufb01cient implementations are subject for future work.\n\n8\n\nOriginal GSpectral isFAMCMC\fReferences\n[1] R. Adams, Z. Ghahramani, and M. Jordan. Tree-structured stick breaking for hierarchical data. In Neural\n\nInformation Processing Systems, pages 19\u201327, 2010.\n\n[2] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments.\n\nJournal of Computers and System Sciences, 58(1):137\u2013147, 1999.\n\n[3] Y. Altun and A. J. Smola. Unifying divergence minimization and statistical inference via convex duality.\nIn H.U. Simon and G. Lugosi, editors, Proc. Annual Conf. Computational Learning Theory, LNCS, pages\n139\u2013153. Springer, 2006.\n\n[4] M. Aly, A. Hatch, V. Josifovski, and V.K. Narayanan. Web-scale user modeling for targeting. In Confer-\n\nence on World Wide Web, pages 3\u201312. ACM, 2012.\n\n[5] A. Anandkumar, K. Chaudhuri, D. Hsu, S. Kakade, L. Song, and T. Zhang. Spectral methods for learning\n\nmultivariate latent tree structure. In Neural Information Processing Systems, 2011.\n\n[6] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning\n\nlatent variable models. arXiv preprint arXiv:1210.7559, 2012.\n\n[7] Anima Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor spectral approach to learning\n\nmixed membership community models. In Proc. Annual Conf. Computational Learning Theory, 2013.\n\n[8] Animashree Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and Yi-Kai Liu. Two svds\nsuf\ufb01ce: Spectral decompositions for probabilistic topic modeling and latent dirichlet allocation. CoRR,\nabs/1204.6703, 2012.\n\n[9] D. Blei and M. Jordan. Variational inference for dirichlet process mixtures. In Bayesian Analysis, vol-\n\n[10] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,\n\nume 1, pages 121\u2013144, 2005.\n\n3:993\u20131022, January 2003.\n\n[11] Byron Boots, Arthur Gretton, and Geoffrey J Gordon. Hilbert space embeddings of predictive state\n\nrepresentations. In Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n\n[12] J.-F. Cardoso. Blind signal separation: statistical principles. Proceedings of the IEEE, 90(8):2009\u20132026,\n\n[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society B, 39(1):1\u201322, 1977.\n\n[14] F. Doshi, K. Miller, J. Van Gael, and Y. W. Teh. Variational inference for the indian buffet process. Journal\n\nof Machine Learning Research - Proceedings Track, 5:137\u2013144, 2009.\n\n[15] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. Sharing features among dynamical systems\n\n[16] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel two-sample test. JMLR,\n\nwith beta processes. nips, 22, 2010.\n\n13:723\u2013773, 2012.\n\n[17] T. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the indian buffet process. Advances in\n\nNeural Information Processing Systems 18, pages 475\u2013482, 2006.\n\n[18] T. Grif\ufb01ths and Z. Ghahramani. The indian buffet process: An introduction and review. Journal of\n\nMachine Learning Research, 12:11851224, 2011.\n\n[19] T.L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of Sci-\n\nences, 101:5228\u20135235, 2004.\n\n[20] N. Halko, P.G. Martinsson, and J. A. Tropp. Finding structure with randomness: Stochastic algorithms\n\nfor constructing approximate matrix decompositions, 2009. oai:arXiv.org:0909.4061.\n\n[21] D. Hsu, S. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. In Proc.\n\nAnnual Conf. Computational Learning Theory, 2009.\n\n[22] D. Hsu, S. Kakade, and T. Zhang. Tail inequalities for sums of random matrices that depend on the\n\nintrinsic dimension. Electron. Commun. Probab., 17:13, 2012.\n\n[23] D. Hsu and S.M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral\n\ndecompositions. 2012.\n\n[24] D. Knowles and Z. Ghahramani.\n\nIn\ufb01nite sparse factor analysis and in\ufb01nite independent components\nanalysis. In International Conference on Independent Component Analysis and Signal Separation, 2007.\nIn Survey in Combinatorics, pages 148\u2013188.\n\n[25] C. McDiarmid. On the method of bounded differences.\n\n[26] K.T. Miller, T.L. Grif\ufb01ths, and M.I. Jordan. Latent feature models for link prediction. In Snowbird, page\n\n1998.\n\nCambridge University Press, 1989.\n\n2 pages, 2009.\n\nRoyal Society, pages 71\u201371, 1894.\n\nCambridge, 1989.\n\n[27] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the\n\n[28] G. Pisier. The Volume of Convex Bodies and Banach Space Geometry. Cambridge University Press,\n\n[29] L. Song, B. Boots, S. Siddiqi, G. Gordon, and A. J. Smola. Hilbert space embeddings of hidden markov\n\nmodels. In International Conference on Machine Learning, 2010.\n\n[30] F. Wood, T. L. Grifths, and Z. Ghahramani. A non-parametric bayesian method for inferring hidden\n\ncauses. uai, 2006.\n\n9\n\n\f", "award": [], "sourceid": 805, "authors": [{"given_name": "Hsiao-Yu", "family_name": "Tung", "institution": "Carnegie Mellon University"}, {"given_name": "Alexander", "family_name": "Smola", "institution": "Carnegie Mellon"}]}