{"title": "Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent", "book": "Advances in Neural Information Processing Systems", "page_first": 208, "page_last": 216, "abstract": "Variational Bayesian matrix factorization (VBMF) efficiently   approximates the posterior distribution of factorized matrices   by assuming matrix-wise independence of the two factors.  A recent study on fully-observed VBMF  showed that, under a stronger assumption that the two factorized matrices are  column-wise independent,  the global optimal solution can be analytically computed.  However, it was not clear how restrictive  the column-wise independence assumption is.  In this paper, we prove that the global solution under matrix-wise  independence is actually column-wise independent,  implying that the column-wise independence assumption is harmless.  A practical consequence of our theoretical finding is that  the global solution under matrix-wise independence (which is a standard setup)  can be obtained analytically in a computationally very efficient way  without any iterative algorithms.  We experimentally illustrate advantages of using our analytic solution  in probabilistic principal component analysis.", "full_text": "Global Solution of Fully-Observed\n\nVariational Bayesian Matrix Factorization\n\nis Column-Wise Independent\n\nShinichi Nakajima\nNikon Corporation\n\nTokyo, 140-8601, Japan\n\nMasashi Sugiyama\n\nTokyo Institute of Technology\n\nTokyo 152-8552, Japan\n\nnakajima.s@nikon.co.jp\n\nsugi@cs.titech.ac.jp\n\nDerin Babacan\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801, USA\n\ndbabacan@illinois.edu\n\nAbstract\n\nVariational Bayesian matrix factorization (VBMF) ef\ufb01ciently approximates the\nposterior distribution of factorized matrices by assuming matrix-wise indepen-\ndence of the two factors. A recent study on fully-observed VBMF showed that,\nunder a stronger assumption that the two factorized matrices are column-wise in-\ndependent, the global optimal solution can be analytically computed. However,\nit was not clear how restrictive the column-wise independence assumption is. In\nthis paper, we prove that the global solution under matrix-wise independence is\nactually column-wise independent, implying that the column-wise independence\nassumption is harmless. A practical consequence of our theoretical \ufb01nding is that\nthe global solution under matrix-wise independence (which is a standard setup)\ncan be obtained analytically in a computationally very ef\ufb01cient way without any\niterative algorithms. We experimentally illustrate advantages of using our analytic\nsolution in probabilistic principal component analysis.\n\n1 Introduction\n\nThe goal of matrix factorization (MF) is to approximate an observed matrix by a low-rank one. In\nthis paper, we consider fully-observed MF where the observed matrix has no missing entry1. This\nformulation includes classical multivariate analysis techniques based on singular-value decomposi-\ntion such as principal component analysis (PCA) [9] and canonical correlation analysis [10].\nIn the framework of probabilistic MF [20, 17, 19], posterior distributions of factorized matrices are\nconsidered. Since exact inference is computationally intractable, the Laplace approximation [3],\nthe Markov chain Monte Carlo sampling [3, 18], and the variational Bayesian (VB) approximation\n[4, 13, 16, 15] were used for approximate inference in practice. Among them, the VB approximation\nseems to be a popular choice due to its high accuracy and computational ef\ufb01ciency.\nIn the original VBMF [4, 13], factored matrices are assumed to be matrix-wise independent, and a\nlocal optimal solution is computed by an iterative algorithm. A simpli\ufb01ed variant of VBMF (sim-\npleVBMF) was also proposed [16], which assumes a stronger constraint that the factored matrices\n\n1This excludes the collaborative \ufb01ltering setup, which is aimed at imputing missing entries of an observed\n\nmatrix [12, 7].\n\n1\n\n\fare column-wise independent. A notable advantage of simpleVBMF is that the global optimal solu-\ntion can be computed analytically in a computationally very ef\ufb01cient way [15].\nIntuitively, it is suspected that simpleVBMF only possesses weaker approximation ability due to its\nstronger column-wise independence assumption. However, it was reported that no clear performance\ndegradation was observed in experiments [14]. Thus, simpleVBMF would be a practically useful\napproach. Nevertheless, the in\ufb02uence of the stronger column-wise independence assumption was\nnot elucidated beyond this empirical evaluation.\nThe main contribution of this paper is to theoretically show that the column-wise independence\nassumption does not degrade the performance. More speci\ufb01cally, we prove that a global optimal\nsolution of the original VBMF is actually column-wise independent. Thus, a global optimal so-\nlution of the original VBMF can be obtained by the analytic-form solution of simpleVBMF\u2014no\ncomputationally-expensive iterative algorithm is necessary. We show the usefulness of the analytic-\nform solution through experiments on probabilistic PCA.\n\n2 Formulation\n\nIn this section, we \ufb01rst formulate the problem of probabilistic MF, and then introduce the VB ap-\nproximation and its simpli\ufb01ed variant.\n\n2.1 Probabilistic Matrix Factorization\n\nThe probabilistic MF model is given as follows [19]:\n\u2212 1\n2\u03c32\n\n(cid:181)\n\np(Y |A, B) \u221d exp\n\u00a1\n\n\u22a4\u00a2\u00b6\n\n\u22121\nA A\n\n(cid:181)\n\n\u00b6\n\n,\n\n\u22121\n2\n\nFro\n\n(cid:181)\n\n\u2225Y \u2212 BA\n\n\u22a4\u22252\n\np(B) \u221d exp\n\n\u00a1\n\n\u22a4\u00a2\u00b6\n\n\u22121\nB B\n\n(1)\n\np(A) \u221d exp\n\n\u22121\n2\n\n,\n\ntr\n\nAC\n\n(2)\nwhere Y \u2208 RL\u00d7M is an observed matrix, A \u2208 RM\u00d7H and B \u2208 RL\u00d7H are parameter matrices to be\nestimated, and \u03c32 is the noise variance. Here, we denote by \u22a4 the transpose of a matrix or vector, by\n\u2225 \u00b7 \u2225Fro the Frobenius norm, and by tr(\u00b7) the trace of a matrix. We assume that the prior covariance\nmatrices CA and CB are diagonal and positive de\ufb01nite, i.e.,\n)\n\nfor cah, cbh > 0, h = 1, . . . , H.\n\nCB = diag(c2\n\nCA = diag(c2\n\na1, . . . , c2\naH\n\nb1, . . . , c2\nbH\n\nBC\n\n),\n\ntr\n\n,\n\nWithout loss of generality, we assume that the diagonal entries of the product CACB are arranged\nin the non-increasing order, i.e., cahcbh\nThroughout the paper, we denote a column vector of a matrix by a bold smaller letter, and a row\nvector by a bold smaller letter with a tilde, namely,\n\n\u2265 cah\u2032 cbh\u2032 for any pair h < h\n\n\u2032.\n\nA = (a1, . . . , aH) = (ea1, . . . ,eaM )\n\n\u22a4 \u2208 RM\u00d7H , B = (b1, . . . , bH) =\n\n\u2021eb1, . . . ,ebL\n\n\u00b7\u22a4 \u2208 RL\u00d7H .\n\n2.2 Variational Bayesian Approximation\n\nThe Bayes posterior is written as\n\np(A, B|Y ) = p(Y |A, B)p(A)p(B)\n\n,\n\nZ(Y )\n\n(3)\nwhere Z(Y ) = \u2329p(Y |A, B)\u232ap(A)p(B) is the marginal likelihood. Here, \u2329\u00b7\u232ap denotes the expecta-\ntion over the distribution p. Since the Bayes posterior (3) is computationally intractable, the VB\napproximation was proposed [4, 13, 16, 15].\nLet r(A, B), or r for short, be a trial distribution. The following functional with respect to r is called\nthe free energy:\nF (r|Y ) =\n\n\u00bf\nlog r(A, B)\np(A, B|Y )\n\np(Y |A, B)p(A)p(B)\n\n\u2212 log Z(Y ).\n\nr(A, B)\n\n\u00bf\n\n(cid:192)\n\n(cid:192)\n\nlog\n\n(4)\n\n=\n\nr(A,B)\n\nr(A,B)\n\n2\n\n\fIn the last equation, the \ufb01rst term is the Kullback-Leibler (KL) distance from the trial distribution\nto the Bayes posterior, and the second term is a constant. Therefore, minimizing the free energy (4)\namounts to \ufb01nding the distribution closest to the Bayes posterior in the sense of the KL distance. In\nthe VB approximation, the free energy (4) is minimized over some restricted function space.\nA standard constraint for the MF model is matrix-wise independence [4, 13], i.e.,\n\n(5)\nThis constraint breaks off the entanglement between the parameter matrices A and B, and leads to\na computationally-tractable iterative algorithm. Using the variational method, we can show that,\nunder the constraint (5), the VB posterior minimizing the free energy (4) is written as\n\nrVB(A, B) = rVB\n\nA (A)rVB\n\nB (B).\n\nNH(eam;ebam, \u03a3A)\n\nrVB(A, B) =\n\nwhere the parameters satisfy\n\n\u2021eba1, . . . ,ebaM\n\u00b7\u22a4\n(cid:181)ebb1, . . . ,\n\u00b6\u22a4\nebbL\n\nbA =\nbB =\n\nm=1\n\nMY\n\u22a4bB\n= Y bA\n\n= Y\n\n\u03a3A\n\u03c32 ,\n\n\u03a3B\n\u03c32 ,\n\nl=1\n\nLY\nebbl, \u03a3B),\nNH(ebl;\n\u2021bB\n\u22a4bB + L\u03a3B + \u03c32C\n\u2021bA\n\u22a4bA + M \u03a3A + \u03c32C\n\n\u00b7\u22121\n\u00b7\u22121\n\n,\n\n\u22121\nA\n\n\u22121\nB\n\n\u03a3A = \u03c32\n\n(6)\n\n\u03a3B = \u03c32\n\n(7)\nHere, Nd(\u00b7; \u00b5, \u03a3) denotes the d-dimensional Gaussian distribution with mean \u00b5 and covariance ma-\n\ntrix \u03a3. Iteratively updating the parameters bA, \u03a3A, bB, and \u03a3B by Eqs.(6) and (7) until convergence\n\ngives a local minimum of the free energy (4).\nWhen the noise variance \u03c32 is unknown, it can also be estimated based on the free energy minimiza-\ntion. The update rule for \u03c32 is given by\n\n.\n\n\u2021\n\n\u22a4bBbA\n\n\u22a4) + tr\n\n\u22a4bA + M \u03a3A)(bB\n(bA\n\n\u00b7\n\u22a4bB + L\u03a3B)\n\n\u2225Y \u22252\n\nFro \u2212 tr(2Y\n\n\u03c32 =\n\n.\n\n(8)\n\nLM\n\nFurthermore, in the empirical Bayesian scenario, the hyperparameters CA and CB are also estimated\nfrom data. In this scenario, CA and CB are updated in each iteration by the following formulas:\n\n= \u2225bbh\u22252/L + (\u03a3B)hh .\n\nc2\nbh\n\n(9)\n\n= \u2225bah\u22252/M + (\u03a3A)hh ,\n\nc2\nah\n\n2.3 SimpleVB Approximation\n\nA simpli\ufb01ed variant, called the simpleVB approximation, assumes column-wise independence of\neach matrix [16, 15], i.e.,\n\nHY\n\nHY\n\nrsimpleVB(A, B) =\n\nrsimpleVB\nah\n\n(ah)\n\nrsimpleVB\nbh\n\n(bh).\n\n(10)\n\nh=1\n\nh=1\n\nThis constraint restricts the covariances \u03a3A and \u03a3B to be diagonal, and thus necessary memory stor-\nage and computational cost are substantially reduced [16]. The simpleVB posterior can be written\nas\n\nwhere the parameters satisfy\n\nrsimpleVB(A, B) =\n\n0@Y \u2212\n0@Y \u2212\n\nX\nX\n\nh\u2032\u0338=h\n\nh\u2032\u0338=h\n\nbbh\u2032ba\nbbh\u2032ba\n\n\u22a4\nh\u2032\n\n\u22a4\nh\u2032\n\nbah =\nbbh =\n\n\u03c32\nah\n\u03c32\n\n\u03c32\nbh\n\u03c32\n\nah\n\nh=1\n\nHY\nNM (ah;bah, \u03c32\n1A\u22a4bbh,\n1Abah,\n\n\u03c32\nah\n\n\u03c32\nbh\n\n= \u03c32\n\n= \u03c32\n\nIM )NL(bh;bbh, \u03c32\n\u2021\n\u2225bbh\u22252 + L\u03c32\n\u00a1\u2225bah\u22252 + M \u03c32\n\nbh\n\nah\n\nIL),\n\nbh\n\n\u00b7\u22121\n\u00a2\u22121\n\n,\n\n+ \u03c32c\n\n\u22122\nah\n\n+ \u03c32c\n\n\u22122\nbh\n\n(11)\n\n.\n\n(12)\n\n3\n\n\fHere, Id denotes the d-dimensional identity matrix. Iterating Eqs.(11) and (12) until convergence,\nwe can obtain a local minimum of the free energy. Eqs.(8) and (9) are similarly applied if the noise\nvariance \u03c32 is unknown and in the empirical Bayesian scenario, respectively.\nA recent study has derived the analytic solution for simpleVB when the observed matrix has no\nmissing entry [15]. This work made simpleVB more attractive, because it did not only provide sub-\nstantial reduction of computation costs, but also guaranteed the global optimality of the solution.\nHowever, it was not clear how restrictive the column-wise independence assumption is, beyond its\nexperimental success [14]. In the next section, we theoretically show that the column-wise indepen-\ndence assumption is actually harmless.\n\n3 Analytic Solution of VBMF under Matrix-wise Independence\n\nUnder the matrix-wise independence constraint (5), the free energy (4) can be written as\n\nF = \u2329log r(A) + log r(B) \u2212 log p(Y |A, B)p(A)p(B)\u232a\n\u2225Y \u22252\n2\u03c32 + const.\n\nlog \u03c32 + M\n2\n\n= LM\n2\n\n|CA|\n|CB|\n\u00b7\n\u2021bA\n|\u03a3A| + L\n\u22a4bA + M \u03a3A\n|\u03a3B| +\n\u2021\n\u22a4bB +\n\u22122bA\n\u22121\n+ C\nB\n\n\u2021bB\n\u00b7\n\u22a4bB + L\u03a3B\n\u2021bA\n\u00b7\u2021bB\n\u22a4bA + M \u03a3A\n\nn\n\nr(A)r(B)\n\n\u22121\nA\n\n+\u03c3\n\nlog\n\nlog\n\n1\n2\n\n+\n\ntr\n\nC\n\n2\n\nY\n\n\u22a4\n\n\u22122\n\nNote that Eqs.(6) and (7) together form the stationarity condition of Eq.(13) with respect to bA, bB,\n\n\u03a3A, and \u03a3B.\nBelow, we show that a global solution of \u03a3A and \u03a3B is diagonal. When the product CACB is non-\n\u2032), the global solution is unique and diagonal.\ndegenerate (i.e., cahcbh > cah\u2032 cbh\u2032 for any pair h < h\nOn the other hand, when CACB is degenerate, the global solutions are not unique because arbitrary\nrotation in the degenerate subspace is possible without changing the free energy. However, still one\nof the equivalent solutions is always diagonal.\n\n(13)\n\n.\n\n\u00b7\u00b7o\n\n\u22a4bB + L\u03a3B\n\nTheorem 1 Diagonal \u03a3A and \u03a3B minimize the free energy (13).\n\nThe basic idea of our proof is that, since minimizing the free energy (13) with respect to A, B, \u03a3A,\nand \u03a3B is too complicated, we focus on a restricted space written in a particular form that includes\nthe optimal solution. From necessary conditions for optimality, we can deduce that the solutions \u03a3A\nand \u03a3B are diagonal.\nBelow, we describe the outline of the proof for non-degenerate CACB. The complete proof for\ngeneral cases is omitted because of the page limit.\n\u2217\n(Sketch of proof of Theorem 1) Assume that (A\n(13), and consider the following set of parameters speci\ufb01ed by an H \u00d7 H orthogonal matrix \u2126:\n\nbA = A\nbB = B\n\u22a4 is invariant with respect to \u2126, and (bA, bB, \u03a3A, \u03a3B) = (A\nNote that bBbA\no\n\n\u22121/2\n\u2217\n\u22a4\nA \u2126\nAC\n\u22a4\n\u2217\nBC 1/2\nA \u2126\n\u2217\n, B\n\u2126 = IH. Then, as a function of \u2126, the free energy (13) can be simpli\ufb01ed as\n\n\u2217\nB) is a minimizer of the free energy\n\u22121/2\nA \u03a3\nA \u03a3\n\n\u03a3A = C 1/2\nA \u2126C\n\u22121/2\nA \u2126C 1/2\n\u03a3B = C\n\n\u22121/2\n\u22a4\nA \u2126\n\u22a4\nA \u2126\n\nC 1/2\nA ,\n\u22121/2\n.\nA\n\u2217\nA, \u03a3\n\n\u2217\nC\n\u2217\nC 1/2\n\n\u2217\nB) holds if\n\nC 1/2\nA ,\n\u22121/2\n,\nA\n\n\u2217\nA, \u03a3\n\nn\n\n, \u03a3\n\n, \u03a3\n\n, B\n\nC\n\nC\n\n\u2217\n\n\u2217\n\nC\n\ntr\n\nF (\u2126) =\n\n+ const.\nA \u2126\n\u2217\n\u2217\n\u2217\n\u2217\nThis is necessarily minimized at \u2126 = IH, because we assumed that (A\nB) is a mini-\n, \u03a3\n, B\nA, \u03a3\n\u2217 + L\u03a3\n\u2217\nmizer. We can show that F (\u2126) is minimized at \u2126 = IH only if B\nB is diagonal. This\nB\nimplies that \u03a3\nSimilarly, we consider another set of parameters speci\ufb01ed by an H \u00d7 H orthogonal matrix \u2126\n\n\u2217\nA (see Eq.(6)) should be diagonal.\n\n\u2217\u22a4\n\n\u2032:\n\nB\n\nB\n\nA\n\nC 1/2\n\n\u22121\nA C\n\n\u22121\nB \u2126C 1/2\n\n\u2217 + L\u03a3\n\n\u2217\nB\n\n\u00a2\n\n\u00a1\n\n\u2217\u22a4\n\n1\n2\n\n\u22a4\n\n\u2032\u22a4\n\n\u2217\nC 1/2\n\u2217\n\nB \u2126\n\u22121/2\nB \u2126\n\nC\n\nC\n\u2032\u22a4\n\n\u22121/2\n,\nB\nC 1/2\nB ,\n\n\u2032\n\n\u03a3A = C\n\u03a3B = C 1/2\n\n\u22121/2\nB \u2126\n\u2032\nC\nB \u2126\n\nC 1/2\nB \u03a3\n\u22121/2\nB \u03a3\n\n\u2032\u22a4\n\n\u2217\nAC 1/2\nB \u2126\n\u22121/2\n\u2217\nBC\nB \u2126\n\nC\n\u2032\u22a4\n\n,\n\n\u22121/2\nB\nC 1/2\nB .\n\nbA = A\nbB = B\n\n4\n\n\fn\n\n\u00a1\n\n\u00a2\n\no\n\n\u2032\n\nB\n\nC\n\n\u2032\u22a4\n\n\u2032, the free energy (13) can be expressed as\nThen, as a function of \u2126\n1\n\u2032) =\ntr\n2\n\n\u22121\nA C\n\nF (\u2126\n\nC 1/2\n\n\u22121\n\u2217\u22a4\nB \u2126\nA\n\u2032 = IH only if A\n\n\u2217 + M \u03a3\n\u2217\n+ const.\nA\nA\n\u2217\u22a4\n\u2217 + M \u03a3\n\u2217\nA is diagonal. Thus, \u03a3\nA\n\nB \u2126\n\nC 1/2\n\n\u2217\nSimilarly, this is minimized at \u2126\nB should be\n\u2044\ndiagonal (see Eq.(7)).\nThe result that \u03a3A and \u03a3B become diagonal would be natural because we assumed the independent\nGaussian prior on A and B: the fact that any Y can be decomposed into orthogonal components may\nimply that the observation Y cannot convey any preference for singular-component-wise correlation.\nNote, however, that Theorem 1 does not necessarily hold when the observed matrix has missing\nentries.\nTheorem 1 implies that the stronger column-wise independence constraint (10) does not degrade\napproximation accuracy, and the VB solution under matrix-wise independence (5) essentially agrees\nwith the simpleVB solution. Consequently, we can obtain a global analytic solution for VB, by\ncombining Theorem 1 above with Theorem 1 in [15]:\nCorollary 1 Let \u03b3h (\u2265 0) be the h-th largest singular value of Y , and let \u03c9ah and \u03c9bh be the\nassociated right and left singular vectors:\nLetb\u03b3h be the second largest real solution of the following quartic equation with respect to t:\np\n\nfh(t) := t4 + \u03be3t3 + \u03be2t2 + \u03be1t + \u03be0 = 0,\n\nwhere the coef\ufb01cients are de\ufb01ned by\n\u03be2 = \u2212\n\n(L2 + M 2)\u03b72\nh\n\nLX\n\n\u03be1 = \u03be3\n\n\u03b3h\u03c9bh\u03c9\n\n\u03be3\u03b3h +\n\n\u02c6\n\n!\n\n\u03be3 =\n\nY =\n\n(14)\n\n\u22a4\nah\n\n\u03be0,\n\nh=1\n\n+\n\n.\n\n,\n\n(cid:181)\n\n\u00b6(cid:181)\n\nLM\n\nh =\n\u03b72\n\n1 \u2212 \u03c32L\n\u03b32\nh\n\n1 \u2212 \u03c32M\n\u03b32\nh\n\n2\u03c34\nc2\nc2\nah\nbh\n\n\u00b6\n\n\u03b32\nh.\n\nLet\n\n(L \u2212 M)2\u03b3h\n\u02c6\n\nLM\n\n,\n\n,\n\nc2\nah\n\n\u2212 \u03c34\nc2\nbh\n\n!2\nvuuut(L + M)\u03c32\n\u22a4\u232arVB(A,B) = bBbA\n\n2\n\n\u03be0 =\n\n\u03b72\nh\n\ne\u03b3h =\nbU VB \u2261 \u2329BA\n\nvuut\u02c6\nb\u03b3VB\n\nHX\n\nh=1\n\n+ \u03c34\n2c2\nc2\nah\nbh\n\n+\n\n(L + M)\u03c32\n\n2\n\n+ \u03c34\n2c2\nc2\nah\nbh\n\n\u22a4 =\n\nh \u03c9bh\u03c9\n\n, where b\u03b3VB\n\nh =\n\n\u22a4\nah\n\n!2 \u2212 LM \u03c34.\n\u2030b\u03b3h\nif \u03b3h >e\u03b3h,\n\n0\n\notherwise.\n\n(15)\n\nThen, the global VB solution under matrix-wise independence (5) can be expressed as\n\nTheorem 1 holds also in the empirical Bayesian scenario, where the hyperparameters (CA, CB)\nare also estimated from observation. Accordingly, the empirical VB solution also agrees with the\nempirical simpleVB solution, whose analytic-form is given in Corollary 5 in [15]. Thus, we obtain\nthe global analytic solution for empirical VB:\nCorollary 2 The global empirical VB solution under matrix-wise independence (5) is given by\n\n(\n\nHX\nbU EVB =\n(cid:181)\n\u2021\n\nL +\n1\n\n2LM\n\n\u221a\n\n\u221a\n\n\u03b32\nh\n\nh=1\n\n= (\n\nHere,\n\u03b3\n\nh\n\n\u02d8c2\nh =\n\nb\u03b3EVB\n\nh \u03c9bh\u03c9\n\n\u22a4\nah\n\n, where b\u03b3EVB\nq\n\nh\n\nM)\u03c3,\n\u00b7\n\u2212 (L + M)\u03c32 +\n\n(\u03b32\nh\n\n\u2021\n\n\u2206h = M log\n\n\u03b3h\nM \u03c32 \u02d8\u03b3VB\n\n+ L log\nis the VB solution for cahcbh = \u02d8ch.\n\nh + 1\n\nand \u02d8\u03b3VB\n\nh\n\n=\n\n\u02d8\u03b3VB\nh\n0\n\nif \u03b3h > \u03b3\nh\notherwise.\n\nand \u2206h \u2264 0,\n\n\u00b6\n\u00a1\u22122\u03b3h\u02d8\u03b3VB\n\n,\n\n\u2212 (L + M)\u03c32)2 \u2212 4LM \u03c34\n\n\u00b7\n\n\u03b3h\nL\u03c32 \u02d8\u03b3VB\n\nh + 1\n\n+\n\n1\n\u03c32\n\nh + LM \u02d8c2\n\nh\n\n(16)\n\n(17)\n\n,\n\n(18)\n\n\u00a2\n\n5\n\n\fh\n\nby using Eq.(17) and Corollary 1. Otherwise, b\u03b3EVB\n\nWhen we calculate the empirical VB solution, we \ufb01rst check if \u03b3h > \u03b3\ncompute \u02d8\u03b3VB\n\u2206h \u2264 0 holds by using Eq.(18).\nWhen the noise variance \u03c32 is unknown, it is optimized by a naive 1-dimensional search to minimize\nthe free energy [15]. To evaluate the free energy (13), we need the covariances \u03a3A and \u03a3B, which\nneither Corollary 1 nor Corollary 2 provides. The following corollary, which gives the complete\ninformation on the VB posterior, is obtained by combining Theorem 1 above with Corollary 2 in\n[15]:\n\nholds. If it holds, we\n= 0. Finally, we check if\n\nh\n\nh\n\nCorollary 3 The VB posteriors under matrix-wise independence (5) are given by\n\nHY\n\nNL(bh;bbh, \u03c32\n\nbh\n\nA (A) =\nrVB\n\nwhere, forb\u03b3VB\n\nh\n\nbeing the solution given by Corollary 1,\n\nh\n\nah\n\nh=1\n\nHY\nNM (ah;bah, \u03c32\nqb\u03b3VB\nb\u03b4h \u00b7 \u03c9ah,\nbah = \u00b1\n\u2212\u00a1b\u03b72\n\u2212\u00a1b\u03b72\nh + \u03c32(M \u2212 L)\n(M \u2212 L)(\u03b3h \u2212b\u03b3VB\n(\nif \u03b3h >e\u03b3h,\n\n\u2212 \u03c32(M \u2212 L)\n\nb\u03b4h =\nb\u03b72\n\n=\n\n=\n\nh\n\nh =\n\notherwise.\n\n\u03b72\nh\n\u03c34\nc2\nc2\nah\nbh\n\n\u03c32\nah\n\n\u03c32\nbh\n\nIM ),\n\nB (B) =\nrVB\n\nh\n\n+\n\nh=1\n\n\u22121\nh\n\n\u00b7 \u03c9bh,\n\nqb\u03b3VB\nbbh = \u00b1\nb\u03b4\np\n\u00a2\n(b\u03b72\n\u2212 \u03c32(M \u2212 L))2 + 4M \u03c32b\u03b72\nb\u03b4\n2M(b\u03b3VB\np\n\u00a2\n(b\u03b72\nh + \u03c32(M \u2212 L))2 + 4L\u03c32b\u03b72\nb\u03b4h + \u03c32c\n2L(b\u03b3VB\nq\n(M \u2212 L)2(\u03b3h \u2212b\u03b3VB\n\nh\n\u22121\nh + \u03c32c\n\n\u22122\nah )\n\nh\n+\n\n\u22122\nbh\n\nh )2 + 4\u03c34LM\nc2\nbh\n\nc2\nah\n\n)\n\nh\n\nh\n\nh\n\n,\n\nh ) +\n\nIL),\n\n,\n\n,\n\n2\u03c32M c\n\n\u22122\nah\n\nNote that the ratio cah/cbh is arbitrary in empirical VB, so we can \ufb01x it to, e.g., cah/cbh = 1 without\nloss of generality [15].\n\n4 Experimental Results\n\nIn this section, we \ufb01rst introduce probabilistic PCA as a probabilistic MF model. Then, we show\nexperimental results on arti\ufb01cial and benchmark datasets, which illustrate practical advantages of\nusing our analytic solution.\n\nin the following form:\n\n4.1 Probabilistic PCA\n\nIn probabilistic PCA [20], the observation y \u2208 RL is assumed to be driven by a latent vectorea \u2208 RH\nHere, B \u2208 RL\u00d7H speci\ufb01es the linear relationship betweenea and y, and \u03b5 \u2208 RL is a Gaussian noise\nfrom the latent vectors {ea1, . . . ,eaM}, and each latent vector is subject toea \u223c NH(0, IH). Then,\nsubject to NL(0, \u03c32IL). Suppose that we are given M observed samples {y1, . . . , yM\nthe probabilistic PCA model is written as Eqs.(1) and (2) with CA = IH.\nIf we apply Bayesian inference, the intrinsic dimension H is automatically selected without prede-\ntermination [4, 14]. This useful property is called automatic dimensionality selection (ADS).\n\ny = Bea + \u03b5.\n\n} generated\n\n4.2 Experiment on Arti\ufb01cial Data\n\nWe compare the iterative algorithm and the analytic solution in the empirical VB scenario with\nunknown noise variance, i.e., the hyperparameters (CA, CB) and the noise variance \u03c32 are also\n\n6\n\n\f(a) Free energy\n\n(b) Computation time\n\n(c) Estimated rank\n\nFigure 1: Experimental results for Arti\ufb01cial1 dataset, where the data dimension is L = 100, the\nnumber of samples is M = 300, and the true rank is H\n\n\u2217 = 20.\n\n(a) Free energy\n\n(b) Computation time\n\n(c) Estimated rank\n\nFigure 2: Experimental results for Arti\ufb01cial2 dataset (L = 70, M = 300, and H\n\n\u2217 = 40).\n\n\u2217.\n\nand B\n\n\u2217\n\n\u2217\u22a4.\n\nA\n\n\u2217 \u2208 RM\u00d7H\n\n\u2217\n\n\u2217 \u2208 RL\u00d7H\n\n\u2217\n\nway: bA and bB are randomly created so that each entry follows N1(0, 1). Other variables are set to\n\nestimated from observation. We use the full-rank model (i.e., H = min(L, M)), and expect the\nADS effect to automatically \ufb01nd the true rank H\nFigure 1 shows the free energy, the computation time, and the estimated rank over iterations for an\n\u2217 = 20. We randomly created true\narti\ufb01cial (Arti\ufb01cial1) dataset with L = 100, M = 300, and H\n\u2217 follows N1(0, 1). An\nmatrices A\nobserved matrix Y was created by adding a noise subject to N1(0, 1) to each entry of B\nThe iterative algorithm consists of the update rules (6)\u2013(9). Initial values were set in the following\n\u03a3A = \u03a3B = CA = CB = IH and \u03c32 = 1. Note that we rescale Y so that \u2225Y \u22252\nFro/(LM) = 1,\nbefore starting iteration. We ran the iterative algorithm 10 times, starting from different initial\npoints, and each trial is plotted by a solid line in Figure 1. The analytic solution consists of applying\nCorollary 2 combined with a naive 1-dimensional search for noise variance \u03c32 estimation [15]. The\nanalytic solution is plotted by the dashed line. We see that the analytic solution estimates the true\n\u2217 = 20 immediately (\u223c 0.1 sec on average over 10 trials), while the iterative algorithm\n\nrank bH = H\n\nso that each entry of A\n\n\u2217 and B\n\ndoes not converge in 60 sec.\nFigure 2 shows experimental results on another arti\ufb01cial dataset (Arti\ufb01cial2) where L = 70, M =\n\u2217 = 40. In this case, all the 10 trials of the iterative algorithm are trapped at local\n300, and H\nminima. We empirically observed a tendency that the iterative algorithm suffers from the local\nminima problem when H\n\n\u2217 is large (close to H).\n\n4.3 Experiment on Benchmark Data\n\nFigures 3 and 4 show experimental results on the Satellite and the Spectf datasets available from the\nUCI repository [1], showing similar tendencies to Figures 1 and 2. We also conducted experiments\non various benchmark datasets, and found that the iterative algorithm typically converges slowly,\nand sometimes suffers from the local minima problem, while our analytic-form gives the global\nsolution immediately.\n\n7\n\n0501001502002501.822.22.42.62.8IterationF/(LM) AnalyticIterative050100150200250020406080IterationTime(sec) AnalyticIterative050100150200250020406080100IterationbH AnalyticIterative0501001502002502.62.833.2IterationF/(LM) AnalyticIterative050100150200250051015202530IterationTime(sec) AnalyticIterative050100150200250020406080IterationbH AnalyticIterative\f(a) Free energy\n\n(b) Computation time\n\n(c) Estimated rank\n\nFigure 3: Experimental results for the Sat dataset (L = 36, M = 6435).\n\n(a) Free energy\n\n(b) Computation time\n\n(c) Estimated rank\n\nFigure 4: Experimental results for the Spectf dataset (L = 44, M = 267).\n\n5 Conclusion and Discussion\n\nIn this paper, we have analyzed the fully-observed variational Bayesian matrix factorization (VBMF)\nunder matrix-wise independence. We have shown that the VB solution under matrix-wise indepen-\ndence essentially agrees with the simpli\ufb01ed VB (simpleVB) solution under column-wise indepen-\ndence. As a consequence, we can obtain the global VB solution under matrix-wise independence\nanalytically in a computationally very ef\ufb01cient way.\nOur analysis assumed uncorrelated priors. With correlated priors, the posterior is no longer uncor-\nrelated and thus it is not straightforward to obtain a global solution analytically. Nevertheless, there\nexists a situation where an analytic solution can be easily obtained: Suppose there exists an H \u00d7 H\n\u22121 are diagonal.\nnon-singular matrix T such that both of C\nWe can show that the free energy (13) is invariant under the following transformation for any T :\n\n\u2032\nA = T CAT\n\n\u2032\nB = (T\n\n\u22a4 and C\n\n\u22121)\u22a4\n\nCBT\n\nA \u2192 AT\nB \u2192 BT\n\n\u22a4\n,\n\u22121,\n\n\u03a3A \u2192 T \u03a3AT\n\u03a3B \u2192 (T\n\n\u22a4\n\n,\n\n\u22121)T \u03a3BT\n\n\u22121,\n\nCA \u2192 T CAT\n\u22a4\nCB \u2192 (T\n\u22121)\u22a4\n\n,\n\nCBT\n\n\u22121.\n\n\u2032\nA, C\n\n\u2032\nB) is \ufb01rst computed, and the above transformation is then applied.\n\nAccordingly, the following procedure gives the global solution analytically: the analytic solution\ngiven the diagonal (C\nWe have demonstrated the usefulness of our analytic solution in probabilistic PCA. On the other\nhand, robust PCA has gathered a great deal of attention recently [5], and its Bayesian variant has\nbeen proposed [2]. We expect that our analysis can handle more structured sparsity, in addition to\nthe current low-rank inducing sparsity. Extension of the current work along this line will allow us to\ngive more theoretical insights into robust PCA and provide computationally ef\ufb01cient algorithms.\nFinally, a more challenging direction is to handle priors correlated over rows of A and B. This\nallows us to model correlations in the observation space, and capture, e.g., short-term correlation\nin time-series data and neighboring pixels correlation in image data. Analyzing such a situation, as\nwell as missing value imputation and tensor factorization [11, 6, 8, 21] is our important future work.\n\nAcknowledgments\n\nThe authors thank anonymous reviewers for helpful comments. Masashi Sugiyama was supported\nby the FIRST program. Derin Babacan was supported by a Beckman Postdoctoral Fellowship.\n\n8\n\n0501001502002502.533.544.55IterationF/(LM) AnalyticIterative0501001502002500100200300400500IterationTime(sec) AnalyticIterative05010015020025001020304050IterationbH AnalyticIterative05010015020025033.544.55IterationF/(LM) AnalyticIterative0501001502002500510152025IterationTime(sec) AnalyticIterative050100150200250010203040IterationbH AnalyticIterative\fReferences\n[1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n[2] D. Babacan, M. Luessi, R. Molina, and A. Katsaggelos. Sparse Bayesian methods for low-rank matrix\n\nestimation. arXiv:1102.5288v1 [stat.ML], 2011.\n\n[3] C. M. Bishop. Bayesian principal components. In Advances in NIPS, volume 11, pages 382\u2013388, 1999.\n[4] C. M. Bishop. Variational principal components. In Proc. of ICANN, volume 1, pages 514\u2013509, 1999.\n[5] E.-J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, abs/0912.3599,\n\n2009.\n\n[6] J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an n-way\n\ngeneralization of \u2019eckart-young\u2019 decomposition. Psychometrika, 35:283\u2013319, 1970.\n\n[7] S. Funk. Try this at home. http://sifter.org/\u02dcsimon/journal/20061211.html, 2006.\n[8] R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an \u201dexplanatory\u201d\n\nmultimodal factor analysis. UCLA Working Papers in Phonetics, 16:1\u201384, 1970.\n\n[9] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educa-\n\ntional Psychology, 24:417\u2013441, 1933.\n\n[10] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3\u20134):321\u2013377, 1936.\n[11] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\n[12] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. Grouplens: Applying\n\ncollaborative \ufb01ltering to Usenet news. Communications of the ACM, 40(3):77\u201387, 1997.\n\n[13] Y. J. Lim and T. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of\n\nKDD Cup and Workshop, 2007.\n\n[14] S. Nakajima, M. Sugiyama, and D. Babacan. On Bayesian PCA: Automatic dimensionality selection and\nanalytic solution. In Proceedings of 28th International Conference on Machine Learning (ICML2011),\nBellevue, WA, USA, Jun. 28\u2013Jul.2 2011.\n\n[15] S. Nakajima, M. Sugiyama, and R. Tomioka. Global analytic solution for variational Bayesian matrix fac-\ntorization. In J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, editors, Advances\nin Neural Information Processing Systems 23, pages 1759\u20131767, 2010.\n\n[16] T. Raiko, A. Ilin, and J. Karhunen. Principal component analysis for large scale problems with lots of\nmissing values. In J. Kok, J. Koronacki, R. Lopez de Mantras, S. Matwin, D. Mladenic, and A. Skowron,\neditors, Proceedings of the 18th European Conference on Machine Learning, volume 4701 of Lecture\nNotes in Computer Science, pages 691\u2013698, Berlin, 2007. Springer-Verlag.\n\n[17] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation,\n\n11:305\u2013345, 1999.\n\n[18] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte\n\nCarlo. In International Conference on Machine Learning, 2008.\n\n[19] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In J. C. Platt, D. Koller, Y. Singer,\nand S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1257\u20131264, Cam-\nbridge, MA, 2008. MIT Press.\n\n[20] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatistical Society, 61:611\u2013622, 1999.\n\n[21] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279\u2013311, 1996.\n\n9\n\n\f", "award": [], "sourceid": 160, "authors": [{"given_name": "Shinichi", "family_name": "Nakajima", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}, {"given_name": "S.", "family_name": "Babacan", "institution": null}]}