{"title": "Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1439, "page_last": 1447, "abstract": "When a probabilistic model and its  prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difficulty: the rigorous Bayesian learning is  intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difficulty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efficient approximation. LRSC extracts a low-dimensional structure of data by embedding  samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We first prove a key property that the VB-LRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem is written as a set of polynomial equations, which is solvable with the homotopy method. For further computational efficiency,  we also propose an efficient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach.", "full_text": "Global Solver and Its Ef\ufb01cient Approximation for\nVariational Bayesian Low-rank Subspace Clustering\n\nShinichi Nakajima\nNikon Corporation\n\nTokyo, 140-8601 Japan\n\nAkiko Takeda\n\nThe University of Tokyo\nTokyo, 113-8685 Japan\n\nnakajima.s@nikon.co.jp\n\ntakeda@mist.i.u-tokyo.ac.jp\n\nS. Derin Babacan\n\nGoogle Inc.\n\nMountain View, CA 94043 USA\n\nMasashi Sugiyama\n\nTokyo Institute of Technology\n\nTokyo 152-8552, Japan\n\ndbabacan@gmail.com\n\nsugi@cs.titech.ac.jp\n\nIchiro Takeuchi\n\nNagoya Institute of Technology\n\nAichi, 466-8555, Japan\n\ntakeuchi.ichiro@nitech.ac.jp\n\nAbstract\n\nWhen a probabilistic model and its prior are given, Bayesian learning offers infer-\nence with automatic parameter tuning. However, Bayesian learning is often ob-\nstructed by computational dif\ufb01culty: the rigorous Bayesian learning is intractable\nin many models, and its variational Bayesian (VB) approximation is prone to suf-\nfer from local minima. In this paper, we overcome this dif\ufb01culty for low-rank\nsubspace clustering (LRSC) by providing an exact global solver and its ef\ufb01cient\napproximation. LRSC extracts a low-dimensional structure of data by embedding\nsamples into the union of low-dimensional subspaces, and its variational Bayesian\nvariant has shown good performance. We \ufb01rst prove a key property that the VB-\nLRSC model is highly redundant. Thanks to this property, the optimization prob-\nlem of VB-LRSC can be separated into small subproblems, each of which has\nonly a small number of unknown variables. Our exact global solver relies on an-\nother key property that the stationary condition of each subproblem consists of\na set of polynomial equations, which is solvable with the homotopy method. For\nfurther computational ef\ufb01ciency, we also propose an ef\ufb01cient approximate variant,\nof which the stationary condition can be written as a polynomial equation with a\nsingle variable. Experimental results show the usefulness of our approach.\n\n1 Introduction\n\nPrincipal component analysis (PCA) is a widely-used classical technique for dimensionality reduc-\ntion. This amounts to globally embedding the data points into a low-dimensional subspace. As more\n\ufb02exible models, the sparse subspace clustering (SSC) [7, 20] and the low-rank subspace clustering\n(LRSC) [8, 13, 15, 14] were proposed. By inducing sparsity and low-rankness, respectively, SSC\nand LRSC locally embed the data into the union of subspaces. This paper discusses a probabilistic\nmodel for LRSC.\nAs the classical PCA requires users to pre-determine the dimensionality of the subspace, LRSC re-\nquires manual parameter tuning for adjusting the low-rankness of the solution. On the other hand,\n\n1\n\n\fBayesian formulations enable us to estimate all unknown parameters without manual parameter\ntuning [5, 4, 17]. However, in many problems, the rigorous application of Bayesian inference is\ncomputationally intractable. To overcome this dif\ufb01culty, the variational Bayesian (VB) approxima-\ntion was proposed [1]. A Bayesian formulation and its variational inference have been proposed for\nLRSC [2]. There, to avoid computing the inverse of a prohibitively large matrix, the posterior is\napproximated with the matrix-variate Gaussian (MVG) [11].\nTypically, the VB solution is computed by the iterated conditional modes (ICM) algorithm [3, 5],\nwhich is derived through the standard procedure for the VB approximation. Since the objective\nfunction for the VB approximation is generally non-convex, the ICM algorithm is prone to suffer\nfrom local minima. So far, the global solution for the VB approximation is not attainable except PCA\n(or the fully-observed matrix factorization), for which the global VB solution has been analytically\nobtained [17]. This paper makes LRSC another exception with proposed global VB solvers.\nTwo common factors make the global VB solution attainable in PCA and LRSC: \ufb01rst, a large portion\nof the degrees of freedom that the VB approximation learns are irrelevant, and the optimization\nproblem can be decomposed into subproblems, each of which has only a small number of unknown\nvariables; second, the stationary condition of each subproblem is written as a polynomial system (a\nset of polynomial equations).\nBased on these facts, we propose an exact global VB solver (EGVBS) and an approximate global\nVB solver (AGVBS). EGVBS \ufb01nds all stationary points by solving the polynomial system with the\nhomotopy method [12, 10], and outputs the one giving the lowest free energy. Although EGVBS\nsolves subproblems with much less complexity than the original VB problem, it is still not ef\ufb01cient\nenough for handling large-scale data. For further computational ef\ufb01ciency, we propose AGVBS, of\nwhich the stationary condition is written as a polynomial equation with a single variable. Our ex-\nperiments on arti\ufb01cial and benchmark datasets show that AGVBS provides a more accurate solution\nthan the MVG approximation [2] with much less computation time.\n\n2 Background\n\nIn this section, we introduce the low-rank subspace clustering and its variational Bayesian formula-\ntion.\n\n2.1 Subspace Clustering Methods\nLet Y 2 RL(cid:2)M = (y1; : : : ; yM ) be L-dimensional observed samples of size M. We generally\ndenote a column vector of a matrix by a bold-faced small letter. We assume that each ym is ap-\n\u2032 words in a dictionary, D = (d1; : : : ; dM\u2032 ),\nproximately expressed as a linear combination of M\ni.e.,\n\nY = DX + E;\n\n\u2032(cid:2)M is unknown coef\ufb01cients, and E 2 RL(cid:2)M is noise. In subspace clustering,\nwhere X 2 RM\nthe observed matrix Y itself is often used as a dictionary D. The convex formulation of the sparse\nsubspace clustering (SSC) [7, 20] is given by\n\n\u2225Y (cid:0) Y X\u22252\n\nFro + (cid:21)\u2225X\u22251; s.t. diag(X) = 0;\n\nmin\nX\n\nobtained subspace. After the minimizer bX is obtained, abs(bX) + abs(bX\n\n(1)\nwhere X 2 RM(cid:2)M is a parameter to be estimated, (cid:21) > 0 is a regularization coef\ufb01cient to be\nmanually tuned. \u2225 (cid:1) \u2225Fro and \u2225 (cid:1) \u22251 are the Frobenius norm and the (element-wise) \u21131-norm of a\nmatrix, respectively. The \ufb01rst term in Eq.(1) requires that each data point ym can be expressed as\na linear combination of a small set of other data points fdm\u2032g for m\n\u2032 \u0338= m. This smallness of the\nset is enforced by the second (\u21131-regularization) term, and leads to the low-dimensionality of each\n), where abs((cid:1)) takes the\nabsolute value element-wise, is regarded as an af\ufb01nity matrix, and a spectral clustering algorithm,\nsuch as normalized cuts [19], is applied to obtain clusters.\nIn the low-rank subspace clustering (LRSC) or low-rank representation [8, 13, 15, 14], low-\ndimensional subspaces are sought by enforcing the low-rankness of X:\n\n\u22a4\n\n(2)\n\n\u2225Y (cid:0) Y X\u22252\n\nFro + (cid:21)\u2225X\u2225tr:\n\nmin\nX\n\n2\n\n\fThanks to the simplicity, the global solution of Eq.(2) has been analytically obtained [8].\n\n2.2 Variational Bayesian Low-rank Subspace Clustering\n\n((cid:0) 1\n\n((cid:0) 1\n\nWe formulate the probabilistic model of LRSC, so that the maximum a posteriori (MAP) estimator\ncoincides with the solution of the problem (2) under a certain hyperparameter setting:\n\np(Y jA\n\u2032\n) / exp\n\n) / exp\n(3)\n(cid:0)1\n(cid:0)1\n\u2032\n(4)\n2tr(A\nA A\nC\nB B\n\u2032\u22a4, as in [2], to induce low-rankness through the model-induced\n\u2032\nHere, we factorized X as X = B\nA\n\u2032 2 RM(cid:2)H for H (cid:20)\nregularization mechanism [17].\nIn this formulation, A\nmin(L; M ) are the parameters to be estimated. We assume that hyperparameters\n\n2tr(B\n\u2032 2 RM(cid:2)H and B\n\n)\n2(cid:27)2\u2225Y (cid:0) DB\n\u2032\u22a4\n\n((cid:0) 1\n\n\u2032\np(A\n\n)\n\u2032\u22a4\u22252\n;\nA\n) / exp\n\u2032\n\n)\n\np(B\n\n; B\n\n\u2032\u22a4\n\nFro\n\nC\n\n)\n\n)\n\n\u2032\n\n\u2032\n\n\u2032\n\n:\n\n;\n\nCA = diag(c2\na1\n\n; : : : ; c2\n\naH\n\n);\n\nCB = diag(c2\nb1\n\n; : : : ; c2\nbH\n\n):\n\nare diagonal and positive de\ufb01nite. The dictionary D is treated as a constant, and set to D = Y , once\nY is observed.1\nThe Bayes posterior is written as\n\n\u2032\n\n\u2032jY ) = p(Y jA\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\n;B\n\n)p(A\n\n)p(B\n\n\u2032\n\n\u2032\n\n; B\n\np(A\n\n(5)\n)\u27e9p(A\u2032)p(B\u2032) is the marginal likelihood. Here, \u27e8(cid:1)\u27e9p denotes the expecta-\nwhere p(Y ) = \u27e8p(Y jA\ntion over the distribution p. Since the Bayes posterior (5) is computationally intractable, we adopt\nthe variational Bayesian (VB) approximation [1, 5].\n\u2032\nLet r(A\ncalled the free energy:\n\n), or r for short, be a trial distribution. The following functional with respect to r is\n\n; B\n\n; B\n\np(Y )\n\n\u2032\n\n;\n\n)\n\nF (r) =\n\nlog\n\n\u2032\nr(A\n\n\u2032\n\np(Y jA\u2032;B\u2032);p(A\u2032)p(B\u2032)\n\n;B\n\n)\n\nr(A\u2032;B\u2032)\n\n=\n\n\u2032\nlog r(A\n\n)\np(A\u2032;B\u2032jY )\n\n;B\n\n\u2032\n\n(cid:0) log p(Y ):\n\n(6)\n\nr(A\u2032;B\u2032)\n\n\u27e9\n\n\u27e9\n\n\u27e8\n\n\u27e8\n\nIn the last equation of Eq.(6), the \ufb01rst term is the Kullback-Leibler (KL) distance from the trial\ndistribution to the Bayes posterior, and the second term is a constant. Therefore, minimizing the free\nenergy (6) amounts to \ufb01nding a distribution closest to the Bayes posterior in the sense of the KL\ndistance. In the VB approximation, the free energy (6) is minimized over some restricted function\nspace.\n\n2.2.1 Standard VB (SVB) Iteration\n\nThe standard procedure for the VB approximation imposes the following constraint on the posterior:\n\n\u2032\nr(A\n\n; B\n\n\u2032\n\n\u2032\n) = r(A\n\n\u2032\n\n)r(B\n\n):\n\nBy using the variational method, we can show that the VB posterior is Gaussian, and has the follow-\ning form:\n\n(\n\n\u2032(cid:0)bA\n\n\u2032\n\n)\n\n\u2032(cid:0)bA\n\n\u2032\n\n\u22a4\n)\n\n)\n\n\u2032\nr(A\n\n) / exp\n\n(cid:0) tr((A\n\n)(cid:6)\n\n(cid:0)1\nA\u2032 (A\n2\n\n;\n) 2 RM H. The means and the covariances satisfy the following equations:\n\nr(B\n\n;\n\n\u2032\n\n\u2032\n\n) / exp\n\n(cid:0) ((cid:21)b\n\n(cid:0)1\n\u22a4 (cid:21)(cid:6)\nB\u2032 ((cid:21)b\n)\n2\n\n(\n\n\u2032\n\n\u2032(cid:0)b(cid:21)b\n)(cid:0)1\n\n)\n\n\u2032(cid:0)b(cid:21)b\n\n\u2032\n\n)\n\n(7)\n\n(8)\n\n)(cid:0)1\n\n\u2032\u27e9\n\n(\u27e8\n(\n(bA\n\u2032\u22a4bA\n\n\u2032\u22a4\n\nB\n\nY\n\nwhere (cid:21)b\n\nbA\nb(cid:21)b\n\n\u2032\n\n\u2032\n\n\u2032\n\n= vec(B\n\u2032\n\n\u22a4\n\nY bB\n(\n\n(cid:27)2 Y\n(cid:21)(cid:6)B\u2032\n(cid:27)2 vec\n\n= 1\n\n)\n\n(cid:6)A\u2032;\n\u22a4\n\nY bA\n\n\u2032\n\n(cid:6)A\u2032 = (cid:27)2\n\n\u22a4\n\nY B\n\nr(B\u2032) + (cid:27)2C\n\n(cid:0)1\nA\n\n;\n\n\u2032\n\n+ M (cid:6)A\u2032) (cid:10) Y\n\n\u22a4\n\n(cid:0)1\nB\n\n(cid:10) IM )\n\n;\n\nY\n\n=\n\n(cid:21)(cid:6)B\u2032 = (cid:27)2\n\n(9)\nwhere (cid:10) denotes the Kronecker product of matrices, and IM is the M-dimensional identity matrix.\n1 Our formulation is slightly different from the one proposed in [2], where a clean version of Y is introduced\nas an additional parameter to cope with outliers. Since we focus on the LRSC model without outliers in this\npaper, we simpli\ufb01ed the model. In our formulation, the clean dictionary corresponds to Y BA\n, where\ny denotes the pseudo-inverse of a matrix.\n\nY + (cid:27)2(C\n\n(BA\n\ny\n)\n\n\u22a4\n\n\u22a4\n\n;\n\n3\n\n\f(\n\n\u2211\n\n\u2032\nh\n\nc2\nah\n\n= \u2225ba\n(\n\nFor empirical VB learning, where the hyperparameters are also estimated from observation, the\nfollowing are obtained from the derivatives of the free energy (6):\n\u22252 +\n\u2032\u22a4\u27e9\n+M (cid:6)A\u2032 )B\n\n\u22252=M + (cid:27)2\n(\n\u2032bA\nIM(cid:0)2bB\n\n(11)\n;\n)) are the diagonal entries\nwhere ((cid:27)2\n\u2032\na\n1\nof (cid:6)A\u2032 and (cid:21)(cid:6)B\u2032, respectively. Eqs.(8)\u2013(11) form an ICM algorithm, which we call the standard VB\n(SVB) iteration.\n\n))\nM\nm=1 (cid:27)2\nB\n\n(bA\n\u2032\u22a4bA\n\n) and (((cid:27)2\nB\n\n; : : : ; (cid:27)2\n\u2032\na\nH\n\n); : : : ; ((cid:27)2\nB\n\n; : : : ; (cid:27)2\nB\n\n; : : : ; (cid:27)2\nB\n\n\u2225bb\n\n+\u27e8B\n\n(cid:27)2 =\n\n;\n\u2032\u22a4\n\n\u2032\nM;H\n\n(10)\n\nr(B\u2032 )\n\n=\n\u2032\n\n\u2032\nm;h\n\n\u2032\nM;1\n\n=M;\n\nc2\nbh\n\n\u2032\n1;H\n\n\u2032\n1;1\n\n\u2032\na\nh\n\n\u2032\nh\n\nLM\n\n\u22a4\n\ntr\n\n\u2032\n\nY\n\nY\n\n)\n\n2.2.2 Matrix-Variate Gaussian Approximate (MVGA) Iteration\n\nActually, the SVB iteration cannot be applied to a large-scale problem, because Eq.(9) requires the\ninversion of a huge M H (cid:2) M H matrix. This dif\ufb01culty can be avoided by restricting r(B\n) to be\nthe matrix-variate Gaussian (MVG) [11], i.e.,\n\n\u2032\n\n(\n\n(\n\n\u2032\n\n) / exp\n\nr(B\n\n(cid:0) 1\n2tr\n\n(cid:0)1\nB\u2032 (B\n\n(cid:2)\n\n\u2032\n\n(cid:0)1\nB\u2032 (B\n\n)(cid:6)\n\n:\n\n(12)\n\n\u2032 (cid:0) bB\n\n))\n\n\u2032 (cid:0) bB\n\n\u22a4\n\n\u2032\n\n)\n\nUnder this additional constraint, a gradient-based computationally tractable algorithm can be derived\n[2], which we call the MVG approximate (MVGA) iteration.\n\n3 Global Variational Bayesian Solvers\n\nIn this section, we \ufb01rst show that a large portion of the degrees of freedom in the expression (7)\nare irrelevant, which signi\ufb01cantly reduces the complexity of the optimization problem without the\nMVG approximation. Then, we propose an exact global VB solver and its approximation.\n\n3.1\n\nIrrelevant Degrees of Freedom of VB-LRSC\n\nConsider the following transforms:\n\n;\n\n;\n\n\u2032\n\nY\n\nY\n\nY\n\nB\n\n\u2032\nA\n\n(13)\n\nwhere\n\nY = \u2126left\n\nA = \u2126right\u22a4\n\nB = \u2126right\u22a4\n\nY (cid:0)Y \u2126right\u22a4\nis the singular value decomposition (SVD) of Y . Then, we obtain the following theorem:\n\nTheorem 1 The global minimum of the VB free energy (6) is achieved with a solution such that\n\nbA; bB; (cid:6)A; (cid:21)(cid:6)B are diagonal.\ni.e., Y ! (cid:0)Y . Since we assume Gaussian priors with no correlation, the solution bBbA\n\u2032\u22a4bA\ninvestigating perturbations around any solution. We \ufb01rst show that bA\n\n(Sketch of proof) After the transform (13), we can regard the observed matrix as a diagonal matrix,\n\u22a4 is naturally\nexpected to be diagonal. To prove this intuition, we apply a similar approach to [17], where the\ndiagonalities of the VB posterior covariances were proved in fully-observed matrix factorization by\n+M (cid:6)A\u2032 is diagonal, with\n2\n\nwhich Eq.(9) implies the diagonality of (cid:21)(cid:6)B. Other diagonalities can be shown similarly.\nTheorem 1 does not only reduce the complexity of the optimization problem greatly, but also makes\nthe problem separable, as shown in the following.\n\n\u2032\n\n3.2 Exact Global VB Solver (EGVBS)\n\nThanks to Theorem 1, the free energy minimization problem can be decomposed as follows:\n(ba1; : : : ;baH ); ((cid:27)2\nLemma 1 Let J((cid:20) min(L; M )) be the rank of Y , (cid:13)m be the m-th largest singular value of Y , and\nbe the diagonal entries of bA; (cid:6)A; bB; (cid:21)(cid:6)B, respectively. Then, the free energy (6) is written as\n(\n\n); (bb1; : : : ;bbH ); (((cid:27)2\n\u2211\n\u2211\n\n; : : : ; (cid:27)2\naH\n\n); : : : ; ((cid:27)2\n\n; : : : ; (cid:27)2\n\n; : : : ; (cid:27)2\n\n)\n\nBM;H\n\nBM;1\n\nB1;H\n\nB1;1\n\n))\n\na1\n\nF = 1\n2\n\nLM log(2(cid:25)(cid:27)2) +\n\nJ\n\nh=1 (cid:13)2\n(cid:27)2 +\n\nh\n\nH\nh=1 2Fh\n\n;\n\nwhere\n\n(14)\n\n4\n\n\fh\n(cid:27)2\n\nbah = (cid:13)2\nbbh =\n=ba2\n\n(cid:13)2\nh\n(cid:27)2\n\nc2\nah\n\n(\n8<:(cid:27)2\n(bb2\n\nc2\nbh\n\nbb2\n(\n\u2211\n\n(cid:13)2\nm\n\n;\n(m (cid:20) J);\n(m > J);\n\n)\n\nY (cid:0)Y \u2126right\u22a4\n\nAlgorithm 1 Exact Global VB Solver (EGVBS) for LRSC.\n1: Calculate the SVD of Y = \u2126left\n2: for h = 1 to H do\n3:\n4:\n5:\n6:\n\nFind all the solutions of the polynomial system (16)\u2013(18) by the homotopy method.\nDiscard prohibitive solutions with complex numbers or with negative variances.\nSelect the stationary point giving the smallest Fh (de\ufb01ned by Eq.(15)).\nThe global solution for h is the selected stationary point if it satis\ufb01es Fh < 0, otherwise the\nnull solution (19).\n\nY\n\n.\n\nY\n\n\u22a4\n\nY\n\n7: end for\n\n2Fh = M log\n\nbBbA\n\u2126right\u22a4\n\u2211\nbbh +bb2\n(cid:0)2bah\nh(ba2\n\n8: Calculate bX = \u2126right\n9: Apply spectral clustering with the af\ufb01nity matrix equal to abs(bX) + abs(bX\n\u2211\nbb2\n(ba2\n)(cid:0)1\n)(cid:0)1\n\nm(cid:27)2\nand its stationary condition is given as follows: for each h = 1; : : : ; H,\n\n\u2211\n(cid:0) (M + J) +\n\u2211\n(ba2\n\n{\nbbh(cid:27)2\nbah(cid:27)2\n\nBm;h\nh + M (cid:27)2\nah\n\nh + M (cid:27)2\nah\n\n;\n\n(cid:27)2\nBm;h\n\n=\n\nJ\nm=1 log\n\n+ (cid:27)2\nc2\nbh\n\nJ\nm=1 (cid:13)2\n\nJ\nm=1 (cid:13)2\n\n+ (cid:27)2\nc2\nah\n\nh+M (cid:27)2\nah\n\n+ 1\n(cid:27)2\n\nm(cid:27)2\n\n= (cid:27)2\n\nc2\nah\n(cid:27)2\n\n)\n\n(\n\nba2\n\n(cid:27)2\nah\n\nh +\n\n)\n\n+\n\n)\n\n;\n\nah\n\nc2\nbh\n\n(cid:27)2\n\n+\n\nah\n\nBm;h\n\nBm;h\n\n(cid:13)2\nh\n\n(cid:13)2\nh\n\nBh;h\n\nc2\nah\n\n+\n\n\u22a4\n\n).\n\nBm;h\n\n}\n\nJ\n\nh+\n\nm=1 (cid:27)2\nc2\nbh\nh + M (cid:27)2\nah\n\n)\n\n;\n\n(15)\n\nIf no stationary point gives Fh < 0, the global solution is given by\n\nc2\nbh\n\n=\n\nh +\n\nJ\nm=1 (cid:27)2\n\nBm;h\n\n=J:\n\nh=M + (cid:27)2\nah\n\n;\n\nbah =bbh = 0;\n\n(cid:27)2\nah\n\n; (cid:27)2\n\nBm;h\n\n; c2\nah\n\n; c2\nbh\n\nfor m = 1; : : : ; M:\n\n! 0\n\n(bah; (cid:27)2\n\n;bbh;f(cid:27)2\n\nBm;h\n\n= (cid:27)2\n\nfor m > J, Eqs.(16)\u2013(18) for each h can be seen\nTaking account of the trivial relations c2\nbh\n.\nas a polynomial system with 5 + J unknown variables, i.e.,\nThus, Lemma 1 has decomposed the original problem (8)\u2013(10) with O(M 2H 2) unknown variables\ninto H subproblems with O(J) variables each.\nGiven the noise variance (cid:27)2, our exact global VB solver (EGVBS) \ufb01nds all stationary points that sat-\nisfy the polynomial system (16)\u2013(18) by using the homotopy method [12, 10],2 After that, it discards\nthe prohibitive solutions with complex numbers or with negative variances, and then selects the one\ngiving the smallest Fh, de\ufb01ned by Eq.(15). The global solution is the selected stationary point if\nit satis\ufb01es Fh < 0, or the null solution (19) otherwise. Algorithm 1 summarizes the procedure of\nEGVBS. If (cid:27)2 is unknown, we conduct a naive 1-D search by iteratively applying EGVBS, as for\nVB matrix factorization [17].\n\ngJ\nm=1; c2\nbh\n\n; c2\nah\n\nBm;h\n\nah\n\n3.3 Approximate Global VB Solver (AGVBS)\n\nAlthough Lemma 1 signi\ufb01cantly reduced the complexity of the optimization problem, EGVBS is\nnot applicable to large-scale data, since the homotopy method is not guaranteed to \ufb01nd all the solu-\ntions in polynomial time in J, when the polynomial system involves O(J) unknown variables. For\nlarge-scale data, we propose a scalable approximation by introducing an additional constraint that\n(cid:13)2\nm(cid:27)2\n\nare constant over m = 1; : : : ; J, i.e.,\n\nBm;h\n\n(cid:13)2\nm(cid:27)2\n\n(20)\n2 The homotopy method is a reliable and ef\ufb01cient numerical method to solve a polynomial system [6, 9]. It\nprovides all the isolated solutions to a system of n polynomials f (x) (cid:17) (f1(x); : : : ; fn(x)) = 0 by de\ufb01ning a\nsmooth set of homotopy systems with a parameter t 2 [0; 1]: g(x; t) (cid:17) (g1(x; t); g2(x; t); : : : ; gn(x; t)) = 0,\nsuch that one can continuously trace the solution path from the easiest (t = 0) to the target (t = 1). We use\nHOM4PS-2.0 [12], which is one of the most successful polynomial system solvers.\n\n= (cid:27)2\nbh\n\nBm;h\n\nfor all m (cid:20) J:\n\n5\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\n)\n\n\fUnder this constraint, we obtain the following theorem (the proof is omitted):\n\n\ufb01es the following polynomial equation with a single variablebb(cid:13)h:\n\nTheorem 2 Under the constraint (20), any stationary point of the free energy (15) for each h satis-\n\nbb(cid:13)\n\nbb(cid:13)\n\nbb(cid:13)\n\nbb(cid:13)h + (cid:24)0 = 0;\n\nbb(cid:13)\n\nbb(cid:13)\n\n(cid:24)6\n\n6\nh + (cid:24)5\n\n5\nh + (cid:24)4\n\n4\nh + (cid:24)3\n\n3\nh + (cid:24)2\n\n2\nh + (cid:24)1\n\nwhere\n\n;\n\n(cid:13)3\nh\n\n(cid:13)h\n\n(cid:13)4\nh\n\n(cid:13)h\n\nh)\n\n(cid:13)3\nh\n\n(cid:13)2\nh\n\nh)\n\n(cid:13)h\n\n(cid:13)2\nh\n\n(cid:13)h\n\n(cid:13)2\nh\nh)\n\n(cid:24)4 = \u03d52\n\n+ 1 + \u03d52\n\nHere, (cid:13) = (\n\n+ \u03d5hM J(cid:27)4\n\n(cid:24)5 = (cid:0)2 \u03d52\n\n(cid:0) \u03d5hM (cid:27)2((M +J)(cid:27)2(cid:0)(cid:13)2\n\nhM (cid:27)2\n+ 2\u03d5h\n(cid:13)3\n(cid:13)h\nh\n(cid:0) 2(M(cid:0)J)(cid:27)2\n\n(cid:24)6 = \u03d52\nh\n(cid:13)2\nh\n(cid:24)3 = 2\u03d5hM (M(cid:0)J)(cid:27)4\n(cid:24)2 = (M(cid:0)J)2(cid:27)4\n\u2211\n(cid:24)1 = (cid:0) (M(cid:0)J)(cid:27)2((M +J)(cid:27)2(cid:0)(cid:13)2\nb(cid:13)h =bb(cid:13) + (cid:13)h (cid:0) M (cid:27)2\n; b(cid:20) = (cid:13)2\n\u221ab(cid:20)2 (cid:0) 4M J(cid:27)4\n(b(cid:20) +\nb(cid:28) = 1\n)\n(bah; (cid:27)2\n;bbh; (cid:27)2\n\n(cid:13)h\n(cid:0)2\nm =J)\n\n;\n1 (cid:0) (cid:13)2\n\n+ ((M + J)(cid:27)2 (cid:0) (cid:13)2\n\n(cid:0) 2\u03d5h(2M(cid:0)J)(cid:27)2\n(cid:0) \u03d52\n\nh(M (cid:27)2(cid:0)(cid:13)2\nh)\nhM (cid:27)2(M (cid:27)2(cid:0)(cid:13)2\n+ \u03d5h(M (cid:27)2(cid:0)(cid:13)2\nh)\nh) (cid:0) \u03d5h(M(cid:0)J)(cid:27)2(M (cid:27)2(cid:0)(cid:13)2\n\nhM 2(cid:27)4\n;\n+ \u03d5h((M +J)(cid:27)2(cid:0)(cid:13)2\n)\n(\n. For each real solutionbb(cid:13)h such that\n(cid:0) (M + J)(cid:27)2 (cid:0)(\n))\n)(cid:0)1\n(\n; b(cid:14)h = (cid:27)2pb(cid:28)\n(cid:0)b(cid:13)h\nbb(cid:13)\n)\n(\u221ab(cid:13)b(cid:14); (cid:27)2b(cid:14)h\n\u221ab(cid:13)=b(cid:14)=(cid:13)h;\npb(cid:28) =(cid:13)2\npb(cid:28) ;\n\nbb(cid:13)\n\u03d5h\n(cid:13)h (cid:0) M (cid:27)2\n\nM (cid:27)2 (cid:0) (cid:13)2\n\nare real and positive, the corresponding stationary point candidate is given by\n\n(cid:0)1 and \u03d5h =\n\n(cid:24)0 = M J(cid:27)4:\n\n)\n(\n\nJ\nm=1 (cid:13)\n\n1 + \u03d5h\n\n;\n\n(cid:13)h\n\nh)\n\n;\n\n2M J\n\nh\n(cid:13)2\n\n(cid:13)2\nh\n\n; c2\nah\n\nah\n\n; c2\nbh\n\nbh\n\n=\n\n(cid:13)h\n\nh\n\n(cid:13)h\n\n;\n\n(cid:13)h\n\n(cid:13)h\n\n;\n\n(cid:13)2\nh\n\nh\n\nb(cid:14)h(cid:0)\u03d5h\n\n(cid:27)2\n\n(cid:13)h\n\n;\n\n(cid:27)2pb(cid:28)\n\nh)\n\n(21)\n\n;\n\n(22)\n\n;\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\n(27)\n\n:\n\n(28)\n\nh\n\nGiven the noise variance (cid:27)2, obtaining the coef\ufb01cients (22)\u2013(25) is straightforward. Our approx-\nimate global VB solver (AGVBS) solves the sixth-order polynomial equation (21), e.g., by the\n\u2018roots\u2019 function in MATLAB R\u20dd, and obtain all candidate stationary points by using Eqs.(26)\u2013(28).\nThen, it selects the one giving the smallest Fh, and the global solution is the selected stationary point\nif it satis\ufb01es Fh < 0, otherwise the null solution (19). Note that, although a solution of Eq.(21) is not\nnecessarily a stationary point, selection based on the free energy discards all non-stationary points\nand local maxima. As in EGVBS, a naive 1-D search is conducted for estimating (cid:27)2.\nIn Section 4, we show that AGVBS is practically a good alternative to the MVGA iteration in terms\nof accuracy and computation time.\n\n4 Experiments\n\nIn this section, we experimentally evaluate the proposed EGVBS and AGVBS. We assume that the\nhyperparameters (CA; CB) and the noise variance (cid:27)2 are unknown and estimated from observations.\nWe use the full-rank model (i.e., H = min(L; M )), and expect VB-LRSC to automatically \ufb01nd the\ntrue rank without any parameter tuning.\nWe \ufb01rst conducted an experiment with a small arti\ufb01cial dataset (\u2018arti\ufb01cial small\u2019), on which the\nexact algorithms, i.e., the SVB iteration (Section 2.2.1) and EGVBS (Section 3.2), are computation-\nally tractable. Through this experiment, we can measure the accuracy of the ef\ufb01cient approximate\nvariants, i.e., the MVGA iteration (Section 2.2.2) and AGVBS (Section 3.3). We randomly cre-\nated M = 75 samples in L = 10 dimensional space. We assumed K = 2 clusters: M (1)(cid:3)\n= 50\nsamples lie in a H (1)(cid:3)\n= 25 samples lie in a\n= 1 dimensional subspace. For each cluster k, we independently drew M (k)(cid:3) samples from\nH (2)(cid:3)\nN\nH (k)(cid:3)(0; 10IH (k)(cid:3) ), where Nd((cid:22); (cid:6)) denotes the d-dimensional Gaussian, and projected them\ninto the observed L-dimensional space by R(k) 2 RL(cid:2)H (k)(cid:3)\n, each entry of which follows N1(0; 1).\n\u2211\nThus, we obtained a noiseless matrix Y (k)(cid:3) 2 RL(cid:2)M (k)(cid:3)\nfor the k-th cluster. Concatenating all\n), and adding random noise subject to N1(0; 1) to each entry gave\nclusters, Y\nan arti\ufb01cial observed matrix Y 2 RL(cid:2)M , where M =\n(cid:3)\nk=1 M (k)(cid:3)\n= 75. The true rank of Y\n\n= 3 dimensional subspace, and the other M (2)(cid:3)\n\n; : : : ; Y (K)(cid:3)\n\n= (Y (1)(cid:3)\n\n(cid:3)\n\nK\n\n6\n\n\f(a) Free energy\n\n(b) Computation time\n\n(c) Estimated rank\n(cid:3)\n\nFigure 1: Results on the \u2018arti\ufb01cial small\u2019 dataset (L = 10; M = 75; H\nwere 1:3% for EGVBS, AGVBS, and the SVB iteration, and 2:4% for the MVGA iteration.\n\n= 4). The clustering errors\n\n(a) Free energy\n\n(b) Computation time\n\nFigure 2: Results on the \u2018arti\ufb01cial large\u2019 dataset (L = 50; M = 225; H\nwere 4:0% for AGVBS and 11:2% for the MVGA iteration.\n\n(c) Estimated rank\n(cid:3)\n\n= 5). The clustering errors\n\n\u2211\nk=1 H (k)(cid:3)\n\nK\n\n(a) Free energy\n\n(b) Computation time\n\n(c) Estimated rank\n\nFigure 3: Results on the \u20181R2RC\u2019 sequence (L = 59; M = 459) of the Hopkins 155 motion\ndatabase. The clustering errors are shown in Figure 4.\n\n(cid:3)\n\n= min(\n\n(cid:3) is different from the rank J of the\n\n\u2032bA\n\nis given by H\nobserved matrix Y , which is almost surely equal to min(L; M ) = 10 under the Gaussian noise.\n\nFigure 1 shows the free energy, the computation time, and the estimated rank of bX = bB\n\n; L; M ) = 4. Note that H\n\n\u2032\u22a4\nover iterations. For the iterative methods, we show the results of 10 trials starting from different\nrandom initializations. We can see that AGVBS gives almost the same free energy as the exact\nmethods (EGVBS and the SVB iteration). The exact method requires a large computation cost:\nEGVBS took 621 sec to obtain the global solution, and the SVB iteration took (cid:24) 100 sec to achieve\nalmost the same free energy. The approximate methods are much faster: AGVBS took less than 1\nsec, and the MVGA iteration took (cid:24) 10 sec. Since the MVGA iteration had not converged after\n250 iterations, we continued the MVGA iteration until 2500 iterations, and found that the MVGA\niteration sometimes converges to a local solution with signi\ufb01cantly higher free energy than the other\nmethods. EGVBS, AGVBS, and the SVB iteration successfully found the true rank H\n= 4, while\nthe MVGA iteration sometimes failed. This difference is actually re\ufb02ected to the clustering error,\ni.e., the misclassi\ufb01cation rate with all possible cluster correspondences taken into account, after\nspectral clustering [19] is performed: 1:3% for EGVBS, AGVBS, and the SVB iteration, and 2:4%\nfor the MVGA iteration.\nNext we conducted the same experiment with a larger arti\ufb01cial dataset (\u2018arti\ufb01cial large\u2019) (L =\n50; K = 4; (M (1)(cid:3)\n) = (2; 1; 1; 1)), on which\nEGVBS and the SVB iteration are computationally intractable. Figure 2 shows results with AGVBS\nand the MVGA iteration. An advantage in computation time is clear: AGVBS took (cid:24) 0:1 sec, while\nthe MVGA iteration took more than 100 sec. The clustering errors were 4:0% for AGVBS and\n11:2% for the MVGA iteration.\nFinally, we applied AGVBS and the MVGA iteration to the Hopkins 155 motion database [21].\nIn this dataset, each sample corresponds to a trajectory of a point in a video, and clusteirng the\ntrajectories amounts to \ufb01nding a set of rigid bodies. Figure 3 shows the results on the \u20181R2RC\u2019\n\n) = (100; 50; 50; 25); (H (1)(cid:3)\n\n; : : : ; M (K)(cid:3)\n\n; : : : ; H (K)(cid:3)\n\n(cid:3)\n\n7\n\n0501001502002501.81.922.12.22.3IterationF/LM EGVBSAGVBSSVBIterationMVGAIteration050100150200250100102104IterationTime(sec) EGVBSAGVBSSVBIterationMVGAIteration0501001502002500246810IterationbH EGVBSAGVBSSVBIterationMVGAIteration050010001500200025001.611.6151.621.6251.631.635IterationF/LM AGVBSMVGAIteration05001000150020002500100102104IterationTime(sec) AGVBSMVGAIteration05001000150020002500051015IterationbH AGVBSMVGAIteration05001000150020002500234567IterationF/LM AGVBSMVGAIteration05001000150020002500100102104IterationTime(sec) AGVBSMVGAIteration0500100015002000250001020304050IterationbH AGVBSMVGAIteration\fFigure 4: Clustering errors on the \ufb01rst 20 sequences of Hopkins 155 dataset.\n\n(L = 59; M = 459) sequence.3 We see that AGVBS gave a lower free energy with much less\ncomputation time than the MVGA iteration. Figure 4 shows the clustering errors over the \ufb01rst 20\nsequences. We \ufb01nd that AGVBS generally outperforms the MVGA iteration. Figure 4 also shows\nthe results with MAP estimation (2) with the tuning parameter (cid:21) optimized over the 20 sequences\n(we performed MAP with different values for (cid:21), and selected the one giving the lowest average\nclustering error). We see that AGVBS performs comparably to MAP with optimized (cid:21), which\nimplies that VB estimates the hyperparameters and the noise variance reasonably well.\n\n5 Conclusion\n\nIn this paper, we proposed a global variational Bayesian (VB) solver for low-rank subspace clus-\ntering (LRSC), and its approximate variant. The key property that enabled us to obtain a global\nsolver is that we can signi\ufb01cantly reduce the degrees of freedom of the VB-LRSC model, and the\noptimization problem is separable.\nOur exact global VB solver (EGVBS) provides the global solution of a non-convex minimization\nproblem by using the homotopy method, which solves the stationary condition written as a poly-\nnomial system. On the other hand, our approximate global VB solver (AGVBS) \ufb01nds the roots of\na polynomial equation with a single unknown variable, and provides the global solution of an ap-\nproximate problem. We experimentally showed advantages of AGVBS over the previous scalable\nmethod, called the matrix-variate Gaussian approximate (MVGA) iteration, in terms of accuracy and\ncomputational ef\ufb01ciency. In AGVBS, SVD dominates the computation time. Accordingly, applying\nadditional tricks, e.g., parallel computation and approximation based on random projection, to the\nSVD calculation would be a vital option for further computational ef\ufb01ciency.\nLRSC can be equipped with an outlier term, which enhances robustness [7, 8, 2]. With the outlier\nterm, much better clustering error on Hopkins 155 dataset was reported [2]. Our future work is to\nextend our approach to such robust variants. Theorem 2 enables us to construct the mean update\n(MU) algorithm [16], which \ufb01nds the global solution with respect to a large number of unknown\nvariables in each step. We expect that the MU algorithm tends to converge to a better solution than\nthe standard VB iteration, as in robust PCA and its extensions. EGVBS and AGVBS cannot be\napplied to the applications where Y has missing entries. Also in such cases, Theorem 2 might be\nused to derive a better algorithm, as the VB global solution of fully-observed matrix factorization\n(MF) was used as a subroutine for partially-observed MF [18].\nIn many probabilistic models, the Bayesian learning is often intractable, and its VB approximation\nhas to rely on a local search algorithm. Exceptions are the fully-observed MF, for which an analytic-\nform of the global solution has been derived [17], and LRSC, for which this paper provided global\nVB solvers. As in EGVBS, the homotopy method can solve a stationary condition if it can be written\nas a polynomial system. We expect that such a tool would extend the attainability of global solutions\nof non-convex problems, with which machine learners often face.\n\nAcknowledgments\n\nThe authors thank the reviewers for helpful comments. SN, MS, and IT thank the support from\nMEXT Kakenhi 23120004, the FIRST program, and MEXT KAKENHI 23700165, respectively.\n\n3Peaks in free energy curves are due to pruning, which is necessary for the gradient-based MVGA iteration.\n\nThe free energy can jump just after pruning, but immediately get lower than the value before pruning.\n\n8\n\n00.20.40.6ClusteringError 1R2RC1R2RCR1R2RCR_g121R2RCR_g131R2RCR_g231R2RCT_A1R2RCT_A_g121R2RCT_A_g131R2RCT_A_g231R2RCT_B1R2RCT_B_g121R2RCT_B_g131R2RCT_B_g231R2RC_g121R2RC_g131R2RC_g231R2TCR1R2TCRT1R2TCRT_g121R2TCRT_g13MAP (with optimized lambda)AGVBSMVGAIteration\fReferences\n[1] H. Attias. Inferring parameters and structure of latent variable models by variational Bayes. In Proc. of\n\nUAI, pages 21\u201330, 1999.\n\n[2] S. D. Babacan, S. Nakajima, and M. N. Do. Probabilistic low-rank subspace clustering. In Advances in\n\nNeural Information Processing Systems 25, pages 2753\u20132761, 2012.\n\n[3] J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society B, 48:259\u2013\n\n302, 1986.\n\n[4] C. M. Bishop. Variational principal components. In Proc. of International Conference on Arti\ufb01cial Neural\n\nNetworks, volume 1, pages 514\u2013509, 1999.\n\n[5] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006.\n[6] F. J. Drexler. A homotopy method for the calculation of all zeros of zero-dimensional polynomial ideals.\n\nIn H. J. Wacker, editor, Continuation methods, pages 69\u201393, New York, 1978. Academic Press.\n\n[7] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. of CVPR, pages 2790\u20132797, 2009.\n[8] P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace estimation and\n\nclustering. In Proceedings of CVPR, pages 1801\u20131807, 2011.\n\n[9] C. B. Garcia and W. I. Zangwill. Determining all solutions to certain systems of nonlinear equations.\n\nMathematics of Operations Research, 4:1\u201314, 1979.\n\n[10] T. Gunji, S. Kim, M. Kojima, A. Takeda, K. Fujisawa, and T. Mizutani. Phom\u2014a polyhedral homotopy\n\ncontinuation method. Computing, 73:57\u201377, 2004.\n\n[11] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman and Hall/CRC, 1999.\n[12] T. L. Lee, T. Y. Li, and C. H. Tsai. Hom4ps-2.0: a software package for solving polynomial systems by\n\nthe polyhedral homotopy continuation method. Computing, 83:109\u2013133, 2008.\n\n[13] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In Proc. of ICML,\n\npages 663\u2013670, 2010.\n\n[14] G. Liu, H. Xu, and S. Yan. Exact subspace segmentation and outlier detection by low-rank representation.\n\nIn Proc. of AISTATS, 2012.\n\n[15] G. Liu and S. Yan. Latent low-rank representation for subspace segmentation and feature extraction. In\n\nProc. of ICCV, 2011.\n\n[16] S. Nakajima, M. Sugiyama, and S. D. Babacan. Variational Bayesian sparse additive matrix factorization.\n\nMachine Learning, 92:319\u20131347, 2013.\n\n[17] S. Nakajima, M. Sugiyama, S. D. Babacan, and R. Tomioka. Global analytic solution of fully-observed\n\nvariational Bayesian matrix factorization. Journal of Machine Learning Research, 14:1\u201337, 2013.\n\n[18] M. Seeger and G. Bouchard. Fast variational Bayesian inference for non-conjugate matrix factorization\nmodels. In Proceedings of International Conference on Arti\ufb01cial Intelligence and Statistics, La Palma,\nSpain, 2012.\n\n[19] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell.,\n\n22(8):888\u2013905, 2000.\n\n[20] M. Soltanolkotabi and E. J. Cand`es. A geometric analysis of subspace clustering with outliers. CoRR,\n\n2011.\n\n[21] R. Tron and R. Vidal. A benchmark for the comparison of 3-D motion segmentation algorithms. In Proc.\n\nof CVPR, 2007.\n\n9\n\n\f", "award": [], "sourceid": 720, "authors": [{"given_name": "Shinichi", "family_name": "Nakajima", "institution": "Nikon"}, {"given_name": "Akiko", "family_name": "Takeda", "institution": "University of Tokyo"}, {"given_name": "S. Derin", "family_name": "Babacan", "institution": "Google Research"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "Tokyo Institute of Technology"}, {"given_name": "Ichiro", "family_name": "Takeuchi", "institution": "Nagoya Institute of Technology"}]}