{"title": "Bayesian Probabilistic Co-Subspace Addition", "book": "Advances in Neural Information Processing Systems", "page_first": 1403, "page_last": 1411, "abstract": "For modeling data matrices, this paper introduces Probabilistic Co-Subspace Addition (PCSA) model by simultaneously capturing the dependent structures among both rows and columns. Briefly, PCSA assumes that each entry of a matrix is generated by the additive combination of the linear mappings of two features, which distribute in the row-wise and column-wise latent subspaces. Consequently, it captures the dependencies among entries intricately, and is able to model the non-Gaussian and heteroscedastic density. Variational inference is proposed on PCSA for approximate Bayesian learning, where the updating for posteriors is formulated into the problem of solving Sylvester equations. Furthermore, PCSA is extended to tackling and filling missing values, to adapting its sparseness, and to modelling tensor data. In comparison with several state-of-art approaches, experiments demonstrate the effectiveness and efficiency of Bayesian (sparse) PCSA on modeling matrix (tensor) data and filling missing values.", "full_text": "Bayesian Probabilistic Co-Subspace Addition\n\nLei Shi\n\nBaidu.com, Inc\n\nshilei06@baidu.com\n\nAbstract\n\nFor modeling data matrices, this paper introduces Probabilistic Co-Subspace Ad-\ndition (PCSA) model by simultaneously capturing the dependent structures among\nboth rows and columns. Brie\ufb02y, PCSA assumes that each entry of a matrix is gen-\nerated by the additive combination of the linear mappings of two low-dimensional\nfeatures, which distribute in the row-wise and column-wise latent subspaces re-\nspectively. In consequence, PCSA captures the dependencies among entries in-\ntricately, and is able to handle non-Gaussian and heteroscedastic densities. By\nformulating the posterior updating into the task of solving Sylvester equations,\nwe propose an ef\ufb01cient variational inference algorithm. Furthermore, PCSA is\nextended to tackling and \ufb01lling missing values, to adapting model sparseness, and\nto modelling tensor data. In comparison with several state-of-art methods, experi-\nments demonstrate the effectiveness and ef\ufb01ciency of Bayesian (sparse) PCSA on\nmodeling matrix (tensor) data and \ufb01lling missing values.\n\n1 Introduction\n\nThis paper focuses on modeling data matrices by simultaneously capturing the dependent structures\namong both rows and columns, which is especially useful for \ufb01lling missing values. Using Gaussian\nProcess (GP), Xu et al [25] modi\ufb01ed the kernel to incorporate relational information and drew\noutputs from GPs. Widely used in geostatistics, Linear Models of Corregionalization (LMC) [5]\nlearns the covariance structures over the vectorized data matrix. In [12, 16], Bayesian probabilistic\nmatrix factorization (PMF) is investigated via modeling the row-wise and column-wise speci\ufb01c\nvariances and inferred based on suitable priors. Probabilistic Matrix Addition (PMA) [1] describes\nthe covariance structures among rows and columns, showing promising results compared with GP\nregression, PMF and LMC. However, both LMC and PMA are inef\ufb01cient on large scale matrices.\nOn high dimensional data, subspace structures are usually designed in statistical models with re-\nduced numbers of free parameters, leading to improvement on both learning ef\ufb01ciency and accuracy\n[3, 11, 24]. Equipping PMA with the subspace structures, this paper proposes a simple yet novel\ngenerative Probabilistic Co-Subspace Addition (PCSA) model, which, as its name, assumes that all\nentries in a matrix come from the sums of linear mappings of latent features in row-wise and column-\nwise hidden subspaces. Including many existing models as its special cases (see Section 2.1), PCSA\nis able to capture the dependencies among entries intricately, \ufb01t the non-Gaussian and heteroscedas-\ntic density, and extract the hidden features in the co-subspaces.\nWe propose a variational Bayesian algorithm for inferring both the parameters and the latent di-\nmensionalities of PCSA. For quick and stable convergence, we formulate the posterior updating\nprocedure into solving Sylvester equations [10]. Furthermore, Bayesian PCSA is implemented in\nthree extensions. First, missing values in data matrices are easily tackled and \ufb01lled by iterating\nwith the variational inference. Second, with a Jeffreys prior, Bayesian sparse PCSA is implemented\nwith an adaptive model sparseness [4]. Finally, we extend the PCSA on matrix data (i.e., 2nd-order\ntensor) to PCSA-k for modelling tensor data with an arbitrary order k.\n\n1\n\n\fOn the task of \ufb01lling missing values in matrix data, we compare (sparse) PCSA with several state-\nof-art models/approaches, including PMA, Robust Bayesian PMF and Bayesian GPLVM [21]. The\ndatasets under consideration range from multi-label classi\ufb01cation data, user-item rating data for\ncollaborative \ufb01ltering, and face images. Further on tensor structured face image data, PCSA is com-\npared with the M2SA method [6] that uses consecutive SVDs on all modes of the tensor. Although\nsimple and not designed for any particular application, through experiments PCSA shows results\npromisingly comparable to or better than the competing approaches.\n\n2 PCSA Model and Variational Bayesian Inference\n\n2.1 Probabilistic Co-Subspace Addition (PCSA)\nThe PCSA model de\ufb01nes distributions over real valued matrices. Letting X \u2208 RD1\u00d7D2 be an\nobserved matrix with D1 \u2264 D2 without loss of generality1, we start by outlining a generative model\nfor X. Consider two hidden variables y \u223c N (y|0d1, Id1 ) and z \u223c N (z|0d2 , Id2) with d1 < D1\nand d2 < D2, where 0d denotes a d-dim vector with all entries being zeros and Id denotes a\nd \u00d7 d identity matrix. Using the concatenation nomenclature of Matlab, two matrices of hidden\nfactors Y = [y\u22171, . . . , y\u2217D2] \u2208 Rd1\u00d7D2 and Z = [z\u22171, . . . , z\u2217D1] \u2208 Rd2\u00d7D1 are column-wise\nindependently generated, respectively. Through two linear mapping matrices A \u2208 RD1\u00d7d1 and\nB \u2208 RD2\u00d7d2, each entry xij \u2208 X is independent given Y and Z by xij = ai\u2217y\u2217j + bj\u2217z\u2217i + eij,\nwhere ai\u2217 is the i-th row of A. Each eij \u223c N (eij|0, 1/\u03c4 ) is independently Gaussian distributed and\nindependent from Y, Z. The generative process of X thus is:\n\n\u2022 Get Y by independently drawing each vector y\u2217j \u223c N (y\u2217j|0d1, Id1 ) for j = 1, . . . , D2;\n\u2022 Get Z by independently drawing each vector z\u2217i \u223c N (z\u2217i|0d2 , Id2) for i = 1, . . . , D1;\n\u2022 Get E \u2208 RD1\u00d7D2 by independently drawing each element eij \u223c N (eij|0, 1/\u03c4 ) for \u2200i, j;\n\u2022 Get X = AY + (BZ)\n\u22a4\n+ E given Y and Z, i.e., additively combines the co-subspaces.\nD2\u220f\nD1\u220f\n\nGiven parameters \u03b8 = {A, B, \u03c4}, the joint distribution of X, Y and Z is p(X, Y, Z|\u03b8) =\nN (xij|ai\u2217y\u2217j + bj\u2217z\u2217i, 1/\u03c4 )].\n\nN (y\u2217j|0d1 , Id1 )] \u00b7 [\n\nD1\u220f\n\nD2\u220f\n\n[\n\nN (z\u2217i|0d2, Id2 )] \u00b7 [\n\n(1)\n\nj=1\n\ni=1\n\ni=1\n\nj=1\n\nProperties and relations to existing work. Albeit its simple generative process, PCSA owns\nmeaningful properties and can be viewed as an extension of several existing models.\n\n\u2022 Intricate dependencies between entries in X. Although each entry xij \u2208 X is indepen-\ndent given Y and Z, the PCSA model captures the dependencies along rows as well as\ncolumns in the joint X. Particularly, assuming D1 is the data dimensionality while D2 is\nthe sample size, the samples (column vectors) in X is dependent from each other by PC-\nSA. When B is constrained as 0, PCSA will degenerate to Probabilistic PCA (PPCA) [3],\nwhich insists the sample i.i.d. assumption.\n\u2022 Non-Gaussianity and heteroscedasticity. If we still consider D1 as the data dimensional-\nity and D2 as the sample size, the PCSA model handles the non-Gaussianity in samples of\n\u22a4 are discretized to take values from a\nX. As an extreme example, if all columns of Z\nB\nset of n vectors, PCSA degenerates to Mixture of PPCA [7, 20] with n components, whose\nsubspace loadings are the same. That is, learning such a PCSA model actually implements\nthe group PPCA [24] throughout different components. Also, if marginalizing Z, we are\ndescribing the column samples of X with a dependent heteroscedasticity.\n\u2022 Co-subspace feature extraction. Although able to describe the row-wise and column-\nwise covariances, PMA [1] requires estimating and inverting two (large) kernel matrices\nwith sizes D1 \u00d7 D1 and D2 \u00d7 D2 respectively, which is intractable for many real world\napplications. In contrast, PCSA has (D1d1 + D2d2 + 1) free parameters and inverts smaller\nmatrices, and recovers PMA when d1 = D1 and d2 = D2. Moreover, PCSA is able to\nextract the hidden features Y and Z simultaneously.\n\n\u22a4\n\n1Otherwise, we can transpose X. This assumption is for ef\ufb01cient Sylvester equation solving in the sequel.\n\n2\n\n\f2.2 Variational Bayesian Inference\n\nGiven X and the hidden dimensionalities (d1, d2), we can estimate PCSA\u2019s parameters \u03b8 =\n{A, B, \u03c4} by maximizing the likelihood p(X|\u03b8). However, the capacity control is essential to gen-\neralization ability, for which we proceed to deliver a variational Bayesian inference on PCSA.\nBy introducing hyper-parameters \u03c2 = [\u03c21, . . . , \u03c2d1 ]\nNormal-Gamma prior on (A, \u03c2) and (B, \u03c6) respectively [3, 7, 20], we have the prior p(\u03b8) as\n\n\u22a4 for a hierarchical\n\n\u22a4 and \u03c6 = [\u03c61, . . . , \u03c6d2]\nd1\u220f\n(cid:0)(\u03c2i|u\u03c2\nd2\u220f\n\nd1\u220f\np(\u03c4 ) = (cid:0)(\u03c4|u\u03c4 , v\u03c4 ),\nN (a\u2217i|0D1 , ID1/\u03c2i), p(\u03c2) =\nd2\u220f\n\nN (b\u2217i|0D2 , ID2 /\u03c6i), p(\u03c6) =\n\ni=1\n\ni=1\n\ni , v\u03c2\n\ni ),\n\np(\u03b8, \u03c2, \u03c6) = p(\u03c4 )p(A, \u03c2)p(B, \u03c6),\np(A|\u03c2) =\np(A, \u03c2) = p(A|\u03c2)p(\u03c2),\n\nD2\u220f\n\ni=1\n\nD1\u220f\n\ni=1\n\n(cid:0)(\u03c6i|u\u03c6\n\np(B, \u03c6) = p(B|\u03c6)p(\u03c6), p(B|\u03c6) =\n\ni , v\u03c6\n\ni=1\n\ni=1\n\ni ), (2)\nwhere (cid:0)(\u00b7|u, v) denotes a Gamma distribution with a shape parameter u and an inverse scale param-\neter v. Each column a\u2217i of the mapping matrix A priori independently follows a spherical Gaussian\nwith a precision scalar \u03c2i, i.e., an automatic relevance determination (ARD) type prior [14]. Each\nprecision \u03c2i further follows a Gamma prior for completing the speci\ufb01cation of the Bayesian model.\np(X|(cid:2))p((cid:2))d(cid:2),\nIt is computationally intractable to evaluate the marginal likelihood p(X) =\nwhere (cid:2) = {Z, Y, \u03b8, \u03c2, \u03c6} represents the set of all parameters and latent variables. Since MCMC\nsamplers are inef\ufb01cient for high dimensional data, this paper chooses variational inference instead\n[11], which introduces a distribution Q((cid:2)) and approximates maximizing the log marginal like-\nlihood log p(X) by maximizing a lower bound L(Q) =\nQ((cid:2)) d(cid:2). For tractability,\nQ((cid:2)) is factorized into the following mean-\ufb01eld form:\n\nQ((cid:2)) log p(X,(cid:2))\n\n\u222b\n\n\u222b\n\nQ((cid:2)) = Q(Y)Q(Z)Q(A)Q(B)Q(\u03c4 )Q(\u03c2)Q(\u03c6), Q(Y) =\n\nd2\u220f\n\nd1\u220f\n\nQ(y\u2217i), Q(Z) =\n\nd2\u220f\n\nQ(z\u2217i),\n\nd1\u220f\n\nQ(A) =\n\nQ(a\u2217i), Q(B) =\n\nQ(b\u2217i), Q(\u03c2) =\n\nQ(\u03c2i), Q(\u03c6) =\n\nQ(\u03c6i).\n\n(3)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\ni , (cid:22)v\u03c2\n\ni , (cid:22)v\u03c6\ni ),\n\nMaximizing L(Q) w.r.t. the above Q(\u03d1) for \u2200\u03d1 \u2208 (cid:2) leads to the following explicit conjugate forms\n\ni ), Q(\u03c6i) = (cid:0)(\u03c6i|(cid:22)u\u03c6\n\nQ(y\u2217t) = N (y\u2217t|(cid:22)y\u2217t, (cid:22)(cid:6)Y ), Q(z\u2217i) = N (z\u2217i|(cid:22)z\u2217i, (cid:22)(cid:6)Z),\nQ(\u03c2i) = (cid:0)(\u03c2i|(cid:22)u\u03c2\nQ(a\u2217i) = N (a\u2217i|(cid:22)a\u2217i, \u03c8AID1), Q(b\u2217i) = N (b\u2217i|(cid:22)b\u2217i, \u03c8BID2 ), Q(\u03c4 ) = (cid:0)(\u03c4|(cid:22)u\u03c4 , (cid:22)v\u03c4 ). (4)\n]\u22121\n[\nFor expression simplicity, we denote (cid:22)A = [(cid:22)a\u22171, . . . , (cid:22)a\u2217d1 ] and similarly for (cid:22)B, (cid:22)Y and (cid:22)Z. During\nmaximizing L(Q), the solutions of (cid:22)Y and (cid:22)Z are bundled and conditional on each other:\n]\u22121\n[\n(1 + \u27e8\u03c4\u27e9D1\u03c8A)Id1 + \u27e8\u03c4\u27e9 (cid:22)A\n(1 + \u27e8\u03c4\u27e9D2\u03c8B)Id2 + \u27e8\u03c4\u27e9 (cid:22)B\n\u22a4)\n\n,\n, (5)\nwhere \u27e8\u00b7\u27e9 is expectation and \u27e8\u03c4\u27e9 = (cid:22)u\u03c4 /(cid:22)v\u03c4 . Directly updating by the above converges neither quickly\nnor stably. Instead after putting one equation into the other, we attain a Sylvester equation [10] and\ncan ef\ufb01ciently solve it by many tools. Then (cid:22)Z is obtained by solving LZ\n1\n\n(cid:22)Y = \u27e8\u03c4\u27e9SA (cid:22)A\n(X \u2212 (cid:22)Z\n\u22a4 (cid:22)B\n\u22a4\n\u22a4\n(cid:22)Z = \u27e8\u03c4\u27e9SB (cid:22)B\n(X \u2212 (cid:22)A (cid:22)Y)\n\u22a4\n\u22a4\n\n(6)\nwhose solution is further put into Eq.(5) to update (cid:22)Y2. Given (cid:22)Y and (cid:22)Z, updating ( (cid:22)A, (cid:22)B) is similar.\nThe remainders of Q((cid:2)) in Eq.(4) are updated as\n\n(cid:22)(cid:6)Y = SA, SA =\n(cid:22)(cid:6)Z = SB, SB =\n\n\u2212 (cid:22)Z + LZ\n\u2212 \u27e8\u03c4\u27e9 (cid:22)ASA (cid:22)A\n\n3 = 0,\n,\n\n\u22a4 (cid:22)B, LZ\n\n\u22a4 (cid:22)A\n\u22a4 (cid:22)B\n\n\u22a4(\n\nwith LZ\n\n(cid:22)ZLZ\n2\n\nID1\n\nX\n\n),\n\n,\n\n\u03c8A = d1/tr\n\n\u03c8B = d2/tr\n\nSY =\n\n+ D2 (cid:22)(cid:6)Y ,\n+ D1 (cid:22)(cid:6)Z,\n2The choice of computing (cid:22)Z \ufb01rst is based on the assumption D1 (cid:20) D2 for learning ef\ufb01ciency.\n\nKY = (cid:22)Y (cid:22)Y\n\u22a4\nKZ = (cid:22)Z (cid:22)Z\n\nSZ =\n\nSZ\n\n,\n\n,\n\n\u22a4\n\n, LZ\n\n\u22a4\n2 = (cid:22)ASA (cid:22)A\n\n3 = \u27e8\u03c4\u27e9SB (cid:22)B\n\u22a4\n[\u27e8\u03c4\u27e9KY + diag(\u27e8\u03c2\u27e9)\n]\u22121\n]\u22121\n[\u27e8\u03c4\u27e9KZ + diag(\u27e8\u03c6\u27e9)\n\n,\n\n1 = \u27e8\u03c4\u27e92SB (cid:22)B\n)\n(\n)\n(\n\nSY\n\n,\n\n\u22121\n\u22121\n\n3\n\n\f(cid:22)u\u03c4 = u\u03c4 +\n\n(cid:22)v\u03c4 = v\u03c4 +\n\nD1D2\n\n,\n\n2\nD1\ntr( (cid:22)(cid:6)Y (cid:22)A\n2\n1\ndiag( (cid:22)A\n2\n\nD1\n2\n\n(cid:22)u\u03c2 = u\u03c2 +\n\u22a4 (cid:22)A + \u03c8AKY ) +\n\n1d1 ,\nD2\n2\n\n(cid:22)u\u03c6 = u\u03c6 +\n\nD2\n2\n\n1d2 ,\n\u22a4 (cid:22)B + \u03c8BKZ) +\n\ntr( (cid:22)(cid:6)Z (cid:22)B\n\nD1\u03c8A\n\n\u22a4 (cid:22)A) +\n\n1/(cid:22)v\u03c2\n\n2\n/(cid:22)v\u03c2\nd1\n\n(cid:22)v\u03c6 = v\u03c6 +\n\n1, . . . , (cid:22)u\u03c2\nd1\n\n1d1 ,\n\u22a4, \u27e8\u03c6\u27e9 = [(cid:22)u\u03c6\n]\n\n(cid:22)v\u03c2 = v\u03c2 +\nwhere \u27e8\u03c2\u27e9 = [(cid:22)u\u03c2\ninter-converts between a vector and a diagonal matrix, and || \u00b7 ||F is the Frobenius norm.\nIn implementation, all Gamma priors in Eq.(2) are set to be vague as (cid:0)(\u00b7|10\n\u22123). During\n\u22123, 10\nlearning, redundant columns of (cid:22)A and (cid:22)B will be pushed to approach zeros, which actually makes\nBayesian model selection on hidden dimensionalities d1 and d2.\n\n/(cid:22)v\u03c6\nd2\n\n1 /(cid:22)v\u03c2\n\ndiag( (cid:22)B\n\n]\n\n1\n2\n1, . . . , (cid:22)u\u03c6\nd2\n\n)||2\nF ,\n\n\u22a4\n\u22a4 (cid:22)B\n\n||(X \u2212 (cid:22)A (cid:22)Y \u2212 (cid:22)Z\nD2\u03c8B\n\n1\n2\n\u22a4 (cid:22)B) +\n\u22a4, tr(\u00b7) stands for trace, diag(\u00b7)\n\n1d2 ,\n\n2\n\n(7)\n\n3 Extensions\n\n3.1 Filling Missing Values\n\nIn many real applications, X is usually partially observed with some missing entries. The goal here\nis to infer not only the PCSA model but also the missing values in X based on the model structure.\n\nSimilar to the settings of PMA in [1], let us begin with a full matrix eX, where the missing values\nvalues therein. In each iteration, we \u201cpretend\u201d that eX is the observed matrix and update Q((cid:2)) by\n\nare randomly \ufb01lled. We denote M = {(i, j) : ~xij is missing} as the index set of the missing\nEqs.(6\u223c7). Then given Q((cid:2)), the missing entries {~xij : (i, j) \u2208 M} are updated by maximizing\nL(Q), i.e., ~xij = (cid:22)xij with (cid:22)X = arg maxX L(Q) = (cid:22)A (cid:22)Y + (cid:22)Z\n\u22a4. This updating manner plays a\nrole of adaptive regularization [2], and performs well in experiments as to be shown in Section 4.\nMoreover, \ufb01lling missing values in PMA [1] needs to infer the column and row factors by either\nGibbs sampling or MAP. In contrast, PCSA directly employs (cid:22)Y \u222a (cid:22)Z that were estimated already in\nthe variational inference, and thus saves the computing cost.\n\n\u22a4 (cid:22)B\n\n3.2 Bayesian Sparse PCSA\n\nAs discussed above, PCSA describes observations by mapping hidden features (Y and Z) in the\nco-subspaces through A and B respectively, i.e., A and B serve similarly to the transformation\nmatrix in Factor Analysis and PPCA. For high dimensional data, the parameters A and B probably\nsuffer from inaccurate estimations and are dif\ufb01cult to interpret. Sparsi\ufb01cation is one popularly-\nused method to improve model interpretability in the literature. In this part, we extend to provide a\nBayesian treatment on the sparse PCSA model.\nLASSO [19] encourages model sparseness by adding an \u21131 regularizer, which is equivalent to a\nLaplacian prior. In [4], the sparseness is adaptively controlled by assigning a hierarchical Normal-\nJeffreys (NJ) prior. Paper [9] showed that the NJ prior performs better than the Laplacian on sparse\nPPCA. In this paper, we choose to adopt the NJ prior for learning a sparse PCSA model.\nDifferent from Eq.(2), each column of A and B follows a hierarchical Normal-Jeffreys prior:\n\np(A|\u03b1A) =\n\np(B|\u03b1B) =\n\nN (a\u2217i|0, \u03b1A\n\ni ID1 ),\n\np(\u03b1A) =\n\nN (b\u2217i|0, \u03b1B\n\ni ID2 ),\n\np(\u03b1B) =\n\ni=1\n\ni=1\n\n, with \u03b1A = [\u03b1A\n\n1 , . . . , \u03b1A\nd1\n\n\u22a4\n\n]\n\n,\n\n, with \u03b1B = [\u03b1B\n\n1 , . . . , \u03b1B\nd2\n\n\u22a4\n\n]\n\n,\n\n(8)\n\n1\n\u03b1A\ni\n\n1\n\u03b1B\ni\n\nwhich also encourages the variances in \u03b1A and \u03b1B of redundant dimensions to approach zeros.\nThe prior on \u03c4 remains the same as in Eq.(2). Still under the variational inference framework,\nwe now let (cid:2) = {Z, Y, \u03b8} and Q((cid:2)) = Q(Y)Q(Z)Q(\u03b8) takes the conjugate form same as in\nEq.(4). In consequence, we optimize L(Q; \u03b1A, \u03b1B) =\nQ((cid:2)) log p(X,(cid:2)|(cid:11)A,(cid:11)B )\nd(cid:2) w.r.t. Q((cid:2)),\n\u03b1A and \u03b1B, where L(Q; \u03b1A, \u03b1B) \u2264 log p(X|\u03b1A, \u03b1B). Posterior inference remains the same\nas above, except that all appearances of \u27e8\u03c2\u27e9 and \u27e8\u03c6\u27e9 are replaced with [1/\u03b1A\n]T and\n\n\u222b\n\nQ((cid:2))\n\n1 , . . . , 1/\u03b1A\nd1\n\n4\n\nd1\u220f\nd2\u220f\n\ni=1\n\nd1\u220f\nd2\u220f\n\ni=1\n\n\f1 , . . . , 1/\u03b1B\nd2\n\n[1/\u03b1B\n\u03b1A = 1\n\nD1+2 [diag( (cid:22)A\n\n]T respectively. Then given Q((cid:2)), the variances \u03b1A and \u03b1B are updated via\n\n\u22a4 (cid:22)A) + \u03c8A] and \u03b1B = 1\n\nD2+2 [diag( (cid:22)B\n\n\u22a4 (cid:22)B) + \u03c8B].\n\n3.3 Modeling High-Order Tensor Data\n\nUp till now, we have been talking about modeling X when it is a matrix, and this part extends the\nPCSA model and its Bayesian inference to cover the cases when X is structured as a tensor. Tensors\nare higher-order generalizations of vectors (1st-order tensors) and matrices (2nd-order tensors) [6].\nEach dimension of a tensor is called as a mode, and the order of a tensor is determined as the number\nof its modes. Let us denote tensors with open-face uppercase letters (e.g., X, Y, Z), in comparison\nwith the bold uppercase letters (e.g., X, Y, Z) for matrices. A kth-order tensor X can be denoted\nby X \u2208 RD1\u00d7D2\u00d7...\u00d7Dk, where its dimensionalities in each mode are D1, D2, . . . , Dk respectively.\nAn element and a (1st-mode) vector of X are denoted by xj1j2...jk and x\u2217j2...jk respectively, where\n1 \u2264 ji \u2264 Di for each i = 1, . . . , k. Moreover, the 1st-mode \ufb02attening transform of X, denoted by\nF(X) \u2208 RD1\u00d7(D2D3...Dk), is obtained by concatenating all the (1st-mode) vectors of X. Vice versa,\na ([D1, . . . , Dk])-tensorization of a matrix X \u2208 RD1\u00d7(D2...Dk) is de\ufb01ned as T(X, [D1, . . . , Dk]) \u2208\nRD1\u00d7D2\u00d7...\u00d7Dk, so that T(F(X), [D1, . . . , Dk]) = X. An ith mode-shift transform is de\ufb01ned as\nM(X, i) \u2208 RDi\u00d7Di+1\u00d7...\u00d7Dk\u00d7D1\u00d7...\u00d7Di(cid:0)1, which shifts the modes sequentially in a cycle and until\nthe ith-mode in X becomes the 1st-mode in M(X, i).\nBased on the above de\ufb01nitions, the PCSA model describes a kth-order tensor data XD1\u00d7...\u00d7Dk\nthrough the following generative process: (i) for each mode i, all elements of the hidden tensor\nY(i) \u2208 Rdi\u00d7Di+1\u00d7...\u00d7Dk\u00d7D1\u00d7...\u00d7Di(cid:0)1 are assumed i.i.d. drawn from N (y(i)\n|0, 1);\n)\n(ii) draw each element xj1...jk\n, 1/\u03c4 ), i.e., X is actually\ngenerated by a mode-shift co-subspace addition:\n\n\u223c N (xj1...jk\n(\n\nji\u2217y(i)\u2217ji+1...jkj1...ji(cid:0)1\n\n|\u2211\n\njiji+1...jkj1ji(cid:0)1\n\ni=1 a(i)\n\n)\n\n(\n\nk\n\nk\u2211\n\n,\n\nT\n\ni=1\n\nM\n\nX = E +\n\n(cid:22)X(i), [Di, . . . , Dk, D1, . . . , Di\u22121]\n\ni=1 and parameters \u03b8 = {\u03c4} \u222a {A(i)}k\n\n(9)\nwhere each (cid:22)X(i) = A(i)F(Y(i)) and the matrix A(i) \u2208 RDi\u00d7di maps Y(i) to X. Shortly named as\nPCSA-k, this model has latent tensors {Y(i)}k\ni=1 with latent\nscales {di}k\ni=1. When k = 2, PCSA-2 is exactly the PCSA in Section 2.1 on matrix data. Also, it\ncan be imagined as a kind of group Factor Analysis [24].\nSame as Eq.(2), each column of A(i) takes a hierarchical Normal-Gamma prior, and the Bayesian\ninference in Section 2.2 can be trivially extended for covering PCSA-k model. Please see the details\nin the supplementary materials. Except the involvement of the tensor structure and its operators,\nthere is another difference compared with the variational posterior updating based on a matrix X.\nRemembering (Q(Y), Q(Z)) and (Q(A), Q(B)) pairwise were decoupled and updated by solving\nSylvester equations, we can decouple neither {Q(Y(i))}i nor {Q(A(i))}i into Sylvester equations\nfor the general k > 2. Instead, sequentially for each i = 1, . . . , k, we update only Q(Y(i)) (or\nQ(A(i))) and keep the remaining {Q(Y(u))}u\u0338=i (or {Q(A(u))}u\u0338=i) \ufb01xed.\n\n, k + 2 \u2212 i\n\n4 Experimental Results\n\n4.1 Predicting Missing Entries in Weight Matrices\n\nOn Emotions and CAL500 Data. The proposed PCSA model can be viewed as a rather direct\nextension of the PMA model, which showed advantages over GPR, LMC and PMF in [1]. Following\n[1], the \ufb01rst experiment compares PCSA with PMA in \ufb01lling the missing entries of a truncated log-\nodds matrix in multi-label classi\ufb01cation. For n samples and m classes, the class memberships can\nbe represented as an n \u00d7 m binary matrix G. A truncated log-odds matrix X is constructed with\nxij = c if gij = 1 and xij = \u2212c if gij = 0, where c is nonzero constant. In experiments, certain\nentries xij are assumed to be missing and \ufb01lled as ~xij by an algorithm, and the performance is\nevaluated by the class membership prediction accuracy based on sign(~xij).\nTwo multi-label classi\ufb01cation datasets are under consideration, namely Emotions [22] and\nCAL500 [23]. Already used in [1], the Emotions contains 593 samples with 72 numeric at-\n\n5\n\n\ftributes in 6 classes, and the number of classes that each sample belongs to ranges from 1 to 3. The\nconstructed X for Emotions is thus 593\u00d7 6. The CAL500 contains 502 samples with 68 numeric\nattributes in 174 classes, and the min and max numbers of classes that each sample belongs to are\n13 and 48 respectively. The constructed X for CAL500 is thus 502 \u00d7 174, i.e., its size is larger and\nmore balanced than the one for Emotions.\nTo test the capability in dealing with missing values, the proportion of the missing labels is increased\nfrom 10% to 50%, with 5% as a step size. Instead of Gibbs sampling, the MAP inference is used in\nPMA implementation for a fair comparison. After 10 independent runs on each dataset, Fig.1 report-\ns the error rates for recovering the missing labels in the truncated log-odds matrices, by Bayesian\nPCSA, Bayesian sparse PCSA and PMA. On the relatively unbalanced Emotions data, PCSA out-\nperforms sparse PCSA when the missing proportion is no larger than 40%, while sparse PCSA takes\nover the advantage when too many entries are missing due to the increasing importance of model\nsparsity. On the more balanced CAL500 data, the sparse PCSA keeps a slight outperformance over\nPCSA, again due to the sparsity. Moreover, PCSA and sparse PCSA always perform considerably\nbetter than PMA on both datasets. Table 1 reports the average time cost, where sparse PCSA shows\na little quicker convergence than PCSA. Both are much quicker than PMA, since they do not need\nto either invert large covariances or infer the factor during \ufb01lling missing values (see Section 3.1).\n\ndataset:\nPCSA\n\nsparse PCSA\n\nPMA\n\nEmotions CAL500\n\n4.0\n3.5\n22.9\n\n17.3\n11.6\n198.3\n\nFigure 1: Error rates of 10 independent runs for\nrecovering the missing labels in Emotions (left)\nand CAL500 (right) data.\n\nTable 1: Average time cost (in seconds)\non each dataset throughout 10 indepen-\ndent runs and all missing proportions.\n\nOn MovieLens and JesterJoke Data.\nIn many real applications, e.g. collaborative \ufb01ltering,\nthe size of the matrix X is much larger than the above. We proceed to consider on two larger\nweight datasets: the MovieLens100K data3 and the JesterJoke3 data [8]. Particularly, the\nMovieLens100K dataset contains 100K ratings of 943 users on 1682 movies, which are ordinal\nvalues on the scale [1, 5]. The JesterJoke3 data contains ratings of 24983 users who have rated\nbetween 15 and 35 pieces of the total 100 jokes, where the ratings are continuous in [\u221210.0, 10.0].\nRecently in [12], Robust Bayesian Matrix Factorization (RBMF) was proposed by adopting a\nStudent-t prior in probabilistic matrix factorization, and showed promising results on predicting\nentries on both MovieLens100K and JesterJoke3 data. Following [12], in each run we ran-\n\u2211\ndomly choose 70% of the ratings for training, and use the remaining ratings as the missing values\nt=1 and the predictions {~rt}T\nfor testing. Given the true test ratings {rt}T\nt=1, the performance is eval-\nt=1(rt \u2212 ~rt)2, and\nuated based on the rooted mean squared error (RMSE), i.e., RMSE =\nthe mean absolute error (MAE), i.e., MAE = 1\nT\nAfter 10 independent runs, the average RMSE and MAE values obtained by (sparse) PCSA are\nreported in Table 2, in comparison with the best results by RBMF (i.e., RBMF-RR) collected from\n[12]. Since PMA runs inef\ufb01ciently on high dimensional data as in Table 1, it is not considered to \ufb01ll\nthe ratings in this experiment. It is observed that the performance by PCSA on predicting ratings is\ncomparable with RBMF. On both RMSE and MAE scores, the sparse PCSA further improves the\ncorrectness and performs similarly to or better than RBMF.\n\n|rt \u2212 ~rt|.\n\n\u2211\n\n\u221a\n\nT\n\n1\nT\n\nT\nt=1\n\nTable 2: Average RMSE and MAE on MovieLens100K (left) and JesterJoke3 (right).\n\nmodel\nPCSA\n\nsparse PCSA\nRBMF-RR [12]\n\nRMSE MAE\n0.903\n0.708\n0.898\n0.706\n0.705\n0.900\n\nmodel\nPCSA\n\nsparse PCSA\nRBMF-RR [12]\n\nRMSE MAE\n4.446\n3.447\n3.434\n4.413\n4.454\n3.439\n\n3Downloaded from www.grouplens.org/node/73.\n\n6\n\n\f4.2 Completing Partially Observed Images\n\nWe consider two greyscale face image datasets, namely Frey [15] and ORL [17]. Speci\ufb01cally, Frey\nhas 1965 images of size 28\u00d7 20 taken from one person, and the data X is thus a 560\u00d7 1965 matrix;\nORL has 400 images of size 64 \u00d7 64 taken from 40 persons (10 images per person), and the data X\nis thus a 4096 \u00d7 400 matrix. Applied on these matrices, the PCSA model is expected to extract the\nlatent correlations among pixels and images. In [13], Neil Lawrence proposed a Gaussian process\nlatent variable model (GPLVM) for modeling and visualizing high dimensional data. Recently a\nBayesian GPLVM [21] was developed and showed much improved performance on \ufb01lling pixels in\npartially observed Frey faces. This experiment compares PCSA with Bayesian GPLVM4.\nWhile PCSA can utilize the partial observed samples, the Bayesian GPLVM cannot. Thus in each\nrun, we randomly pick nf images as fully observed, and a half pixels of the remaining images are\nfurther randomly chosen as missing values. Same as [21], Bayesian GPLVM uses the nf images for\ntraining and then infers the missing pixels. In contrast, (sparse) PCSA uses all images as a whole\nmatrix. In order to test the robustness, the nf for Frey is decreased gradually from 1000 to 200,\nand for ORL is decreased gradually from 300 to 50. Performance is evaluated by the correlation\ncoef\ufb01cient (CORR) and the MAE between the \ufb01lled pixels and the ground truth.\nOn Frey and ORL data respectively, Fig.2 and Fig.3 report the CORR and MAE values of 10\nindependent runs by PCSA, sparse PCSA and Bayesian GPLVM. Both PCSA and sparse PCSA\nperform more accurately than Bayesian GPLVM in completing the missing pixels, and PCSA gives\nthe best matching. Also, (sparse) PCSA shows promising stability against the decreasing fully\nobserved sample size nf , and this tendency is kept even when we assign all images are partially\nobserved (i.e., nf = 0), as exempli\ufb01ed by Fig.4. The results by Bayesian GPLVM deteriorate more\nobviously, because the partially observed images have no contribution during learning. Furthermore,\nthe advantage of PCSA becomes more signi\ufb01cant, as we shift from the Frey data for a single\nperson, to the ORL data for multiple persons. It indirectly re\ufb02ects the importance of extracting the\ncorrelations among different images, rather than keeping them independent. Sparse PCSA performs\nworse than PCSA in this task, mainly because it leads to a little too many sparse dimensions.\n\nFigure 2: Results of 10 runs on Frey faces.\n\nFigure 3: Results of 10 runs on ORL faces.\n\nFigure 4: Reconstruction examples by PCSA when all images are partially observed: Frey (left)\nand ORL (right). Three rows from top are true, observed, and reconstructed images, respectively.\n\n4.3 Completing Partially Observed Image Tensor\n\nWe proceed to consider modeling the face image data arranged in a tensor. The dataset under consid-\neration is a subset of the CMU PIE database [18], and totally has 5100 face images from 30 individ-\nuals. Each person\u2019s face exhibits 170 images corresponding to 170 different pose-and-illumination\ncombinations. Each normalized image has 32 \u00d7 32 greyscale pixels, and the dataset is thus a tensor\nX \u2208 R1024\u00d730\u00d7170, whose three modes correspond to pixel, identity, and pose/illumination, respec-\ntively. Figure 5 shows some image examples of two persons. The PCSA-k model (with k = 3 on the\n\n4We use the code in http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/vargplvm/.\n\n7\n\n\f3rd-order tensor X) in Section 3.3 is expected to extract the co-subspace structures (i.e., correlations\namong pixels, identities, and poses/illuminations respectively) and \ufb01ll the missing values in X. In\n[6], an M2SA method was proposed to conduct multilinear subspace analysis with missing values\non the tensor data, via consecutive SVD dimension reductions on each mode.\n\nFigure 5: Typical normalized face images from the CMU PIE database.\n\ntrue:\n\n\ufb01lled:\n\ntrue:\n\n\ufb01lled:\n\nFigure 6: Typical missing images \ufb01lled by PCSA-3. Original images (in the odd rows) are\nrandomly picked and removed, and PCSA-3 \ufb01lls the images in the even rows.\n\nTable 3: Average CORR (left) and MAE (right) of 10 runs\nby PCSA-3 and M2SA on the CMU PIE data.\n\nmissing proportion:\n\nPCSA-3\nM2SA\n\nmissing proportion:\n\nPCSA-3\nM2SA\n\n10% 20% 30%\n0.937\n0.908\n0.893\n0.928\n\n0.926\n0.914\n\n10% 20% 30%\n21.5\n14.6\n17.8\n24.8\n\n18.3\n21.9\n\nHere, the randomly drawn missing values are not pixels as in Section 4.2 but images. Compared with\nthe true missing images, the goodness of the \ufb01lled missing images is evaluated again by CORR and\nMAE. Still to test the capability in dealing with missing values, the proportion of the missing images\nis considered as 10%, 20% and 30%, respectively. After 10 independent runs for each proportion, the\naverages CORR and average MAE of \ufb01ling the missing images by PCSA-3 and M2SA are compared\nin Table 3. During implementing M2SA, the ratio of the subspace rank over the original rank is set\nas 0.6 according to Fig.9 in [6]. As shown in Table 3, PCSA-3 achieves the better performance in\nall cases. For demonstration, Fig.6 shows some \ufb01lled missing images when the missing proportion\nis 20%, which match the original images steadily well.\n\n5 Concluding Remarks\n\nWe have introduced the Probabilistic Co-Subspace Addition (PCSA) model, which simultaneously\ncaptures the dependent structures among both rows and columns in data matrices (tensors). Vari-\national inference is proposed on PCSA for an approximate Bayesian learning, and the posteriors\ncan be ef\ufb01ciently and stably updated by solving Sylvester equations. Capable to \ufb01ll missing values,\nPCSA is extended to not only sparse PCSA with the help of a Jeffreys prior, but also PCSA-k that\nmodels arbitrary kth-order tensor data. Although somewhat simple and not designed for any partic-\nular application, the experiments demonstrate the effectiveness and ef\ufb01ciency of PCSA on modeling\nmatrix (tensor) data and \ufb01lling missing values. The performance by PCSA may be further improved\nby considering nonlinear mappings with the kernel trick, which however is not that direct due to the\ncoupling inner products between the co-subspaces.\n\nAcknowledgments\n\nThe author would like to thank the anonymous reviewers for their useful comments on this paper.\n\n8\n\n\fReferences\n[1] A. Agovic, A. Banerjee, and S. Chatterjee. Probabilistic matrix addition. In Proc. ICML, pages 1025\u2013\n\n1032, 2011.\n\n[2] C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation,\n\n7(1):108\u2013116, 1995.\n\n[3] C. M. Bishop. Variational principal components. In Proc. ICANN\u20191999, volume 1, pages 509\u2013514, 1999.\n[4] M. A. T. Figueiredo. Adaptive sparseness using Jeffreys prior. In Advances in NIPS, volume 14, pages\n\n679\u2013704. MIT Press, Cambridge, MA, 2002.\n\n[5] A. E. Gelfand and S. Banerjee. Multivariate spatial process models. In A. E. Gelfand, P. Diggle, P.Guttorp,\n\nand M. Fuentes, editors, Handbook of Spatial Statistics. CRC Press, 2010.\n\n[6] X. Geng, K. Smith-Miles, Z.-H. Zhou, and L. Wang. Face image modeling by multilinear subspace\n\nanalysis with missing values. IEEE Trans. Syst., Man, Cybern. B, Cybern., 41(3):881\u2013892, 2011.\n\n[7] Z. Ghahramani and G. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report\n\nCRG-TR-96-1, Department of Computer Science, University of Toronto, Toronto, Canada, 1997.\n\n[8] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaset: A constant time collaborative \ufb01ltering\n\nalgorithm. Information Retrieval, 4(2):133\u2013151, 2001.\n\n[9] Y. Guan and J. Dy. Sparse probabilistic principal component analysis. In Proc. AISTATS\u20192009, JMLR\n\nW&CP, volume 5, pages 185\u2013192. 2009.\n\n[10] D. Y. Hu and L. Reichel. Krylov-subspace methods for the Sylvester equation. Linear Algebra and Its\n\nApplications, 172:283\u2013313, 1992.\n\n[11] M. I. Jordan, editor. Learning in graphical models. MIT Press, Cambridge MA, 1999.\n[12] B. Lakshimanarayan, G. Bouchard, and C. Archambeau. Robust Bayesian matrix factorisation. In Proc.\n\nAISTATS\u20192011, JMLR W&CP, volume 15, pages 425\u2013433. 2011.\n\n[13] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In\n\nAdvances in NIPS, volume 16, pages 329\u2013336. MIT Press, Cambridge, MA, 2003.\n\n[14] R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, New York, 1996.\n[15] S. Roweis, L. K. Saul, and G. Hinton. Global coordination of local linear models. In Advances in NIPS,\n\nvolume 14, pages 889\u2013896. MIT Press, Cambridge, MA, 2002.\n\n[16] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in NIPS, volume 20, pages\n\n1257\u20131264. MIT Press, Cambridge, MA, 2008.\n\n[17] F. Samaria and A. Harter. Parameterisation of a stochastic model for human face identi\ufb01cation. In Proc.\n\n2nd IEEE Workshop on Applications of Computer Vision, pages 138\u2013142, 1994.\n\n[18] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression database. IEEE Trans. Patten\n\nAnal. Mach. Intell., 25(12):1615\u20131618, 2003.\n\n[19] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58(1):267\u2013288, 1996.\n[20] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural\n\nComputation, 11(2):443\u2013482, 1999.\n\n[21] M. Titsias and N. Lawrence. Bayesian Gaussian process latent variable model. In Proc. AISTATS\u20192009,\n\nJMLR W&CP, volume 9, pages 844\u2013851. 2010.\n\n[22] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. Vlahavas. Multilabel classi\ufb01cation of music into emotions.\n\nIn Proc. Intl. Conf. on Music Information Retrieval (ISMIR), pages 325\u2013330, 2008.\n\n[23] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Semantic annotation and retrieval of music and\n\nsound effects. IEEE Trans. Audio, Speech and Lang. Process., 16(2):467\u2013476, 2008.\n\n[24] S. Virtanen, A. Klami, S. A. Khan, and S. Kaski. Bayesian group factor analysis. In Proc. AISTATS\u20192012,\n\nJMLR W&CP, volume 22, pages 1269\u20131277. 2012.\n\n[25] Z. Xu, K. Kersting, and V. Tresp. Multi-relational learning with Gaussian processes. In Proc. IJCAI\u20192009,\n\npages 1309\u20131314, 2009.\n\n9\n\n\f", "award": [], "sourceid": 673, "authors": [{"given_name": "Lei", "family_name": "Shi", "institution": null}]}