{"title": "Divide-and-Conquer Learning by Anchoring a Conical Hull", "book": "Advances in Neural Information Processing Systems", "page_first": 1242, "page_last": 1250, "abstract": "We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ ``anchors'' lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme ``DCA'' that distributes the problem to $\\mathcal O(k\\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.", "full_text": "Divide-and-Conquer Learning by Anchoring a\n\nConical Hull\n\nTianyi Zhou\u2020, Jeff Bilmes\u2021, Carlos Guestrin\u2020\n\n\u2020Computer Science & Engineering, \u2021Electrical Engineering, University of Washington, Seattle\n\n{tianyizh, bilmes, guestrin}@u.washington.edu\n\nAbstract\n\nWe reduce a broad class of fundamental machine learning problems, usually\naddressed by EM or sampling, to the problem of \ufb01nding the k extreme rays\nspanning the conical hull of a1 data point set. These k \u201canchors\u201d lead to a global\nsolution and a more interpretable model that can even outperform EM and sampling\non generalization error. To \ufb01nd the k anchors, we propose a novel divide-and-\nconquer learning scheme \u201cDCA\u201d that distributes the problem to O(k log k) same-\ntype sub-problems on different low-D random hyperplanes, each can be solved\nindependently by any existing solver. For the 2D sub-problem, we instead present\na non-iterative solver that only needs to compute an array of cosine values and its\nmax/min entries. DCA also provides a faster subroutine inside other algorithms\nto check whether a point is covered in a conical hull, and thus improves these\nalgorithms by providing signi\ufb01cant speedups. We apply our method to GMM,\nHMM, LDA, NMF and subspace clustering, then show its competitive performance\nand scalability over other methods on large datasets.\n\n1\n\nIntroduction\n\nExpectation-maximization (EM) [10], sampling methods [13], and matrix factorization [20, 25] are\nthree algorithms commonly used to produce maximum likelihood (or maximum a posteriori (MAP))\nestimates of models with latent variables/factors, and thus are used in a wide range of applications\nsuch as clustering, topic modeling, collaborative \ufb01ltering, structured prediction, feature engineering,\nand time series analysis. However, their learning procedures rely on alternating optimization/updates\nbetween parameters and latent variables, a process that suffers from local optima. Hence, their quality\ngreatly depends on initialization and on using a large number of iterations for proper convergence [24].\nThe method of moments [22, 6, 17], by contrast, solves m equations by relating the \ufb01rst m moments\nof observation x \u2208 Rp to the m model parameters, and thus yields a consistent estimator with a\nglobal solution. In practice, however, sample moments usually suffer from unbearably large variance,\nwhich easily leads to the failure of \ufb01nal estimation, especially when m or p is large. Although recent\nspectral methods [8, 18, 15, 1] reduces m to 2 or 3 when estimating O(p) (cid:29) m parameters [2] by\nrelating the eigenspace of lower-order moments to parameters in a matrix form up to column scale,\nthe variance of sample moments is still sensitive to large p or data noise, which may result in poor\nestimation. Moreover, although spectral methods using SVDs or tensor decomposition evidently\nsimpli\ufb01es learning, the computation can still be expensive for big data. In addition, recovering a\nparameter matrix with uncertain column scale might not be feasible for some applications.\nIn this paper, we reduce the learning in a rich class of models (e.g., matrix factorization and latent\nvariable model) to \ufb01nding the extreme rays of a conical hull from a \ufb01nite set of real data points.\nThis is obtained by applying a general separability assumption to either the data matrix in matrix\nfactorization or the 2nd/3rd order moments in latent variable models. Separability posits that a ground\nset of n points, as rows of matrix X, can be represented by X = F XA, where the rows (bases) in\nXA are a subset A \u2282 V = [n] of rows in X, which are called \u201canchors\u201d and are interesting to various\n\n1\n\n\fFigure 1: Geometry of general minimum conical hull prob-\nlem and basic idea of divide-and-conquer anchoring (DCA).\n\nmodels when |A| = k (cid:28) n. This property was introduced in [11] to establish the uniqueness of\nnon-negative matrix factorization (NMF) under simplex constraints, and was later [19, 14] extended to\nnon-negative constraints. We generalize it further to the model X = F YA for two (possibly distinct)\n\ufb01nite sets of points X and Y , and build a new theory for the identi\ufb01ability of A. This generalization\nenables us to apply it to more general models (ref. Table 1) besides NMF. More interestingly, it leads\nto a learning method with much higher tolerance to the variance of sample moments or data noise, a\nunique global solution, and a more interpretable model.\nAnother primary contribution of this paper is a dis-\ntributed learning scheme \u201cdivide-and-conquer anchor-\ning\u201d (DCA), for \ufb01nding an anchor set A such that\nX = F YA by solving same-type sub-problems on only\nO(k log k) randomly drawn low-dimensional (low-D)\nhyperplanes. Each sub-problem is of the form of\n(X\u03a6) = F \u00b7 (Y \u03a6)A with random projection matrix\n\u03a6, and can easily be handled by most solvers due to\nthe low dimension. This is based on the observation\nthat the geometry of the original conical hull is partially\npreserved after a random projection. We analyze the\nprobability of success for each sub-problem to recover\npart of A, and then study the number of sub-problems\nfor recovering the whole A with high probability (w.h.p.). In particular, we propose an very fast\nnon-iterative solver for sub-problems on the 2D plane, which requires computing an array of cosines\nand its max/min values, and thus results in learning algorithms with speedups of tens to hundreds of\ntimes. DCA improves multiple aspects of algorithm design since: 1) its idea of divide-and-conquer\nrandomization gives rise to distributed learning that can reduce the original problem to multiple\nextremely low-D sub-problems that are much easier and faster to solve, and 2) it provides a fast\nsubroutine, checking if a point is covered by a conical hull, which can be embedded into other solvers.\nWe apply both the conical hull anchoring model and DCA to \ufb01ve learning models: Gaussian mixture\nmodels (GMM) [27], hidden Markov models (HMM) [5], latent Dirichlet allocation (LDA) [7],\nNMF [20], and subspace clustering (SC) [12]. The resulting models and algorithms show signi\ufb01cant\nimprovement in ef\ufb01ciency. On generalization performance, they consistently outperform spectral\nmethods and matrix factorization, and are comparable to or even better than EM and sampling.\nIn the following, we will \ufb01rst generalize the separability assumption and minimum conical hull\nproblem risen from NMF in \u00a72, and then show how to reduce more general learning models to a\n(general) minimum conical hull problem in \u00a73. \u00a74 presents a divide-and-conquer learning scheme that\ncan quickly locate the anchors of the conical hull by solving the same problem in multiple extremely\nlow-D spaces. Comprehensive experiments and comparison can be found in \u00a75.\n2 General Separability Assumption and Minimum Conical Hull Problem\nThe original separability property [11] is de\ufb01ned on the convex hull of a set of data points, namely\nthat each point can be represented as a convex combination of certain subsets of vertices that de\ufb01ne\nthe convex hull. Later works on separable NMF [19, 14] extend it to the conical hull case, which\nreplaced convex with conical combinations. Given the de\ufb01nition of (convex) cone and conical hull,\nthe separability assumption can be de\ufb01ned both geometrically and algebraically.\nDe\ufb01nition 1 (Cone & conical hull). A (convex) cone is a non-empty convex set that is closed with\nrespect to conical combinations of its elements. In particular, cone(R) can be de\ufb01ned by its k\ngenerators (or rays) R = {ri}k\n\n(cid:27)\n\ni=1 such that\n\n(cid:26)(cid:88)k\n\ncone(R) =\n\n\u03b1iri | ri \u2208 R, \u03b1i \u2208 R+ \u2200i\n\ni=1\n\n.\n\n(1)\n\nSee [29] for the original separability assumption, the equivalence between separable NMF and the\nminimum conical hull problem, which is de\ufb01ned as a submodular set cover problem.\n2.1 General Separability Assumption and General Minimum Conical Hull Problem\n\nBy generalizing the separability assumption, we obtain a general minimum conical hull problem\nthat can reduce more general learning models besides NMF, e.g., latent variable models and matrix\nfactorization, to \ufb01nding a set of \u201canchors\u201d on the extreme rays of a conical hull.\n\n2\n\nOX = { } Y = { } YA = { } Cone(YA) cone(Y\u00c3 \u03a6) HX\u03a6 = { } Y\u03a6 = { } Y\u00c3 \u03a6 = { } \fDe\ufb01nition 2 (General separability assumption). All the n data points(rows) in X are covered in a\n\ufb01nitely generated and pointed cone (i.e., if x \u2208 cone(YA) then \u2212x (cid:54)\u2208 cone(YA)) whose generators\nform a subset A \u2286 [m] of data points in Y such that (cid:64)i (cid:54)= j, YAi = a \u00b7 YAj . Geometrically, it says\n(2)\n\n\u2200i \u2208 [n], Xi \u2208 cone (YA) , YA = {yi}i\u2208A.\n\n+\n\nAn equivalent algebraic form is X = F YA, where |A| = k, F (cid:48) \u2208 S \u2286 R(n\u2212k)\u00d7k\nWhen X = Y and S = R(n\u2212k)\u00d7k\n, it degenerates to the original separability assumption given\nin [29]. We generalize the minimum conical hull problem from [29]. Under the general separability\nassumption, it aims to \ufb01nd the anchor set A from the points in Y rather than X.\nDe\ufb01nition 3 (General Minimum Conical Hull Problem). Given a \ufb01nite set of points X and a set\nY having an index set V = [m] of its rows, the general minimum conical hull problem \ufb01nds the\nsubset of rows in Y that de\ufb01ne a super-cone for all the rows in X. That is, \ufb01nd A \u2208 2V that solves:\n(3)\n\n|A|, s.t., cone(YA) \u2287 cone(X).\n\n+\n\n.\n\nmin\nA\u2282V\n\nwhere cone(YA) is the cone induced by the rows A of Y .\n\nWhen X = Y , this also degenerates to the original minimum conical hull problem de\ufb01ned in [29].\nA critical question is whether/when the solution A is unique. When X = Y and X = F XA, by\nfollowing the analysis of the separability assumption in [29],we can prove that A is unique and\nidenti\ufb01able given X. However, when X (cid:54)= Y and X = F YA, it is clear that there could be multiple\nlegal choices of A (e.g., there could be multiple layers of conical hulls containing a conical hull\ncovering all points in X). Fortunately, when the rows of Y are rank-one matrices after vectorization\n(concatenating all columns to a long vector), which is the common case in most latent variable models\nin \u00a73.2, A can be uniquely determined if the number of rows in X exceeds 2.\nLemma 1 (Identi\ufb01ability). If X = F YA with the additional structure Ys = vec(Os\nis a pi \u00d7 k matrix and Os\n2, two (non-identical) rows in X are suf\ufb01cient to exactly recover the unique A, Oi and Oj.\n\nj ) where Oi\ni is its sth column, under the general separability assumption in De\ufb01nition\n\ni \u2297Os\n\nSee [29] for proof and additional uniqueness conditions when applied to latent variable models.\n\n3 Minimum Conical Hull Problem for General Learning Models\nTable 1: Summary of reducing NMF, SC, GMM, HMM and LDA to a conical hull anchoring model X = F YA in \u00a73, and their learning\nalgorithms achieved by A = DCA(X, Y, k, M) in Algorithm 1 . Minimal conical hull A = MCH(X, Y ) is de\ufb01ned in De\ufb01nition 4.\nvec(\u00b7) denotes the vectorization of a matrix. For GMM and HMM, Xi \u2208 Rn\u00d7pi is the data matrix for view i (i.e., a subset of features) and\nthe ith observation of all triples of sequential observations, respectively. Xt,i is the tth row of Xi and associates with point/triple t. \u03b7t is a\nvector uniformly drawn from the unit sphere. More details are given in [29].\n\nModel\nNMF\nSC\nGMM\nHMM\nLDA\n\nAlgo\nNMF\nSC\nGMM\nHMM\nLDA\n\n+\n\n1 Diag(X3\u03b7t)X2]t\u2208[q]]/n\n2 Diag(X1\u03b7t)X3]t\u2208[q]]/n\n\nX in conical hull problem\ndata matrix X \u2208 Rn\u00d7p\ndata matrix X \u2208 Rn\u00d7p\n1 X2]; vec[XT\n[vec[XT\n[vec[XT\n2 X3]; vec[XT\nword-word co-occurrence matrix X \u2208 Rp\u00d7p\nEach sub-problem in DCA\n\u02dcA = MCH(X\u03a6, X\u03a6), can be solved by (10)\n\u02dcA =anchors of clusters achieved by meanshift( (cid:92)(X\u03a6)\u03d5)\n\u02dcA = MCH(X\u03a6, Y \u03a6), can be solved by (10)\n\u02dcA = MCH(X\u03a6, Y \u03a6), can be solved by (10)\n\u02dcA = MCH(X\u03a6, X\u03a6), can be solved by (10)\n\n+\n\nY in conical hull problem\nY := X\n\nY := X\n[vec(Xt,1 \u2297 Xt,2)]t\u2208[n]\n[vec(Xt,2 \u2297 Xt,3)]t\u2208[n]\nY := X\n\nPost-processing after A :=(cid:83)\n\n\u02dcAi\n\ni\n\nsolving F in X = F XA\nclustering anchors XA\nN/A\nsolving T in OT = XA,3\ncol-normalize {F : X = F XA}\n\nk in conical hull problem\n# of factors\n# of basis from all clusters\n# of components/clusters\n# of hidden states\n# of topics\n\nInterpretation of anchors indexed by A\nbasis XA are real data points\ncluster i is a cone cone(XAi\ncenters [XA,i]i\u2208[3] from real data\nemission matrix O = XA,2\nanchor word for topic i (topic prob. Fi)\n\n)\n\nIn this section, we discuss how to reduce the learning of general models such as matrix factorization\nand latent variable models to the (general) minimum conical hull problem. Five examples are given\nin Table 1 to show how this general technique can be applied to speci\ufb01c models.\n\n3.1 Matrix Factorization\nBesides NMF, we consider more general matrix factorization (MF) models that can operate on\nnegative features and specify a complicated structure of F . The MF X = F W is a deterministic\nlatent variable model where F and W are deterministic latent factors. By assigning a likelihood\np(Xi,j|Fi, (W T )j) and priors p(F ) and p(W ), its optimization model can be derived from maximum\n\n3\n\n\fical hull problem that selects the subset A with X = F XA. In this setting, RW (W ) =(cid:80)k\n\nlikelihood or MAP estimate. The resulting objective is usually a loss function (cid:96)(\u00b7) of X \u2212 F W plus\nregularization terms for F and W , i.e., min (cid:96)(X, F W ) + RF (F ) + RW (W ).\nSimilar to separable NMF, minimizing the objective of general MF can be reduced to a minimum con-\ni=1 g(Wi)\nwhere g(w) = 0 if w = Xi for some i and g(w) = \u221e otherwise. This is equivalent to applying a\nprior p(Wi) with \ufb01nite support set on the rows of X to each row of W . In addition, the regularization\nof F can be transformed to geometric constraints between points in X and in XA. Since Fi,j is the\nconical combination weight of XAj in recovering Xi, a large Fi,j intuitively indicates a small angle\nbetween XAj and Xi, and vice verse. For example, the sparse and graph Laplacian prior for rows of\nF in subspace clustering can be reduced to \u201ccone clustering\u201d for \ufb01nding A. See [29] for an example\nof reducing the subspace clustering to general minimum conical hull problem.\n\n3.2 Latent Variable Model\n\nDifferent from deterministic MF, we build a system of equations from the moments of probabilistic\nlatent variable models, and then formulate it as a general minimum conical hull problem, rather\nthan directly solve it. Let the generalization model be h \u223c p(h; \u03b1) and x \u223c p(x|h; \u03b8), where h is a\nlatent variable, x stands for observation, and {\u03b1, \u03b8} are parameters. In a variety of graphical models\nsuch as GMMs and HMMs, we need to model conditional independence between groups of features.\nThis is also known as the multi-view assumption. W.l.o.g., we assume that x is composed of three\ngroups(views) of features {xi}i\u2208[3] such that \u2200i (cid:54)= j, xi \u22a5\u22a5 xj|h. We further assume the dimension\nk of h is smaller than pi, the dimension of xi. Since the goal is learning {\u03b1, \u03b8}, decomposing the\nmoments of x rather than the data matrix X can help us get rid of the latent variable h and thus avoid\nalternating minimization between {\u03b1, \u03b8} and h. When E(xi|h) = hT OT\ni (linearity assumption),\nthe second and third order moments can be written in the form of matrix operator.\n\n(cid:26) E (xi \u2297 xj) = E[E(xi|h) \u2297 E(xj|h)] = OiE(h \u2297 h)OT\n\nj ,\nE (xi \u2297 xj \u00b7 (cid:104)\u03b7, xl(cid:105)) = Oi [E(h \u2297 h \u2297 h) \u00d73 (Ol\u03b7)] OT\nj ,\n\n(4)\nwhere A \u00d7n U denotes the n-mode product of a tensor A by a matrix U, \u2297 is the outer product, and\nthe operator parameter \u03b7 can be any vector. We will mainly focus on the models in which {\u03b1, \u03b8} can\nbe exactly recovered from conditional mean vectors {Oi}i\u2208[3] and E(h \u2297 h)1, because they cover\nmost popular models such as GMMs and HMMs in real applications.\nThe left hand sides (LHS) of both equations in (4) can be directly estimated from training data, while\nj with Oi \u2208 Rpi\u00d7k and\ntheir right hand sides (RHS) can be written in a uni\ufb01ed matrix form OiDOT\nD \u2208 Rk\u00d7k. By using different \u03b7, we can obtain 2 \u2264 q \u2264 pl + 1 independent equations, which\ncompose a system of equations for Oi and Oj. Given the LHS, we can obtain the column spaces of\nOi and Oj, which respectively equal to the column and row space of OiDOT\nj , a low-rank matrix\nwhen pi > k. In order to further determine Oi and Oj, our discussion falls into two types of D.\nWhen D is a diagonal matrix. This happens when \u2200i (cid:54)= j, E(hihj) = 0. A common example is\nthat h is a label/state indicator such that h = ei for class/state i, e.g., h in GMM and HMM. In this\ncase, the two D matrices in the RHS of (4) are\n\n(cid:40) E(h \u2297 h) = Diag(\n\n\u2212\u2212\u2212\u2192E(h2\n\ni )),\n\nE(h \u2297 h \u2297 h) \u00d73 (Ol\u03b7) = Diag(\n\n\u2212\u2212\u2212\u2192E(h3\n\ni ) \u00b7 Ol\u03b7),\n\n(5)\n\n\u2212\u2212\u2212\u2192E(ht\n\nrank-one matrices, i.e.,(cid:80)k\n\ni) = [E(ht\n\nwhere\n\n1), . . . , E(ht\n\ns=1 \u03c3(s)Os\n\ni \u2297 Os\n\nk)]. So either matrix in the LHS of (4) can be written as a sum of k\n\nj , where Os\n\ni is the sth column of Oi.\n\ni \u2297 Os\n\nThe general separability assumption posits that the set of k rank-one basis matrices constructing the\nRHS of (4) is a unique subset A \u2286 [n] of the n samples of xi \u2297 xj constructing the left hand sides,\nj = [xi \u2297 xj]As = XAs,i \u2297 XAs,j, the outer product of xi and xj in (As)th data point.\ni.e., Os\n1Note our method can also handle more complex models that violate the linearity assumption and need higher\norder moments for parameter estimation. By replacing xi in (4) with vec(xi\u2297n), the vectorization of the nth\ntensor power of xi, Oi can contain nth order moments for p(xi|h; \u03b8). However, since higher order moments are\neither not necessary or dif\ufb01cult to estimate due to high sample complexity, we will not study them in this paper.\n\n4\n\n\fTherefore, by applying q \u2212 1 different \u03b7 to (4), we obtain the system of q equations in the following\nform, where Y t is the estimate of the LHS of tth equation from training data.\n\nk(cid:88)\n\ns=1\n\n\u2200t \u2208 [q], Y (t) =\n\n\u03c3t,s[xi \u2297 xj]As \u21d4 [vec(Y (t))]t\u2208[q] = \u03c3[vec(Xt,i \u2297 Xt,j)]t\u2208A.\n\n(6)\nThe right equation in (6) is an equivalent matrix representation of the left one. Its LHS is a q \u00d7 pipj\nmatrix, and its RHS is the product of a q \u00d7 k matrix \u03c3 and a k \u00d7 pipj matrix. By letting X \u2190\n[vec(Y (t))]t\u2208[q], F \u2190 \u03c3 and Y \u2190 [vec(Xt,i \u2297 Xt,j)]t\u2208[n], we can \ufb01t (6) to X = F YA in De\ufb01nition\n2. Therefore, learning {Oi}i\u2208[3] is reduced to selecting k rank-one matrices from {Xt,i \u2297 Xt,j}t\u2208[n]\nindexed by A whose conical hull covers the q matrices {Y (t)}t\u2208[q]. Given the anchor set A, we have\n\u02c6Oi = XA,i and \u02c6Oj = XA,j by assigning real data points indexed by A to the columns of Oi and Oj.\nGiven Oi and Oj, \u03c3 can be estimated by solving (6). In many models, a few rows of \u03c3 are suf\ufb01cient\nto recover \u03b1. See [29] for a practical acceleration trick based on matrix completion.\nWhen D is a symmetric matrix with nonzero off-diagonal entries. This happens in \u201cadmixture\u201d\nmodels, e.g., h can be a general binary vector h \u2208 {0, 1}k or a vector on the probability simplex, and\nthe conditional mean E(xi|h) is a mixture of columns in Oi. The most well known example is LDA,\nin which each document is generated by multiple topics.\nWe apply the general separability assumption by only using the \ufb01rst equation in (4), and treating the\nmatrix in its LHS as X in X = F XA. When the data are extremely sparse, which is common in\ntext data, selecting the rows of the denser second order moment as bases is a more reasonable and\neffective assumption compared to sparse data points. In this case, the p rows of F contain k unit\nvectors {ei}i\u2208[k]. This leads to a natural assumption of \u201canchor word\u201d for LDA [3].\nSee [29] for the example of reducing multi-view mixture model, HMM, and LDA to general minimum\nconical hull problem. It is also worth noting that we can show our method, when applied to LDA,\nyields equal results but is faster than a Bayesian inference method [3], see Theorem 4 in [29].\n4 Algorithms for Minimum Conical Hull Problem\n4.1 Divide-and-Conquer Anchoring (DCA) for General Minimum Conical Hull Problems\n\nThe key insights of DCA come from two observations on the geometry of the convex cone. First,\nprojecting a conical hull to a lower-D hyperplane partially preserves its geometry. This enables us\nto distribute the original problem to a few much smaller sub-problems, each handled by a solver\nto the minimum conical hull problem. Secondly, there exists a very fast anchoring algorithm for a\nsub-problem on 2D plane, which only picks two anchor points based on their angles to an axis without\niterative optimization or greedy pursuit. This results in a signi\ufb01cantly ef\ufb01cient DCA algorithm that\ncan be solely used, or embedded as a subroutine, checking if a point is covered in a conical hull.\n4.2 Distributing Conical Hull Problem to Sub-problems in Low Dimensions\nDue to the convexity of cones, a low-D projection of a conical hull is still a conical hull that covers the\nprojections of the same points covered in the original conical hull, and generated by the projections\nof a subset of anchors on the extreme rays of the original conical hull.\nLemma 2. For an arbitrary point x \u2208 cone(YA) \u2282 Rp, where A is the index set of the k anchors\n(generators) selected from Y , for any \u03a6 \u2208 Rp\u00d7d with d \u2264 p, we have\n\n\u2203 \u02dcA \u2286 A : x\u03a6 \u2208 cone(Y \u02dcA\u03a6),\n\n(7)\nSince only a subset of A remains as anchors after projection, solving a minimum conical hull problem\non a single low-D hyperplane rarely returns all the anchors in A. However, the whole set A can be\nrecovered from the anchors detected on multiple low-D hyperplanes. By sampling the projection\nmatrix \u03a6 from a random ensemble M, it can be proved that w.h.p. solving only s = O(ck log k)\nsub-problems are suf\ufb01cient to \ufb01nd all anchors in A. Note c/k is the lower bound of angle \u03b1 \u2212 2\u03b2 in\nTheorem 1, so large c indicates a less \ufb02at conical hull. See [29] for our method\u2019s robustness to the\nfailure in identifying \u201c\ufb02at\u201d anchors.\nFor the special case of NMF when X = F XA, the above result is proven in [28]. However, the\nanalysis cannot be trivially extended to the general conical hull problem when X = F YA (see Figure\n1). A critical reason is that the converse of Lemma 2 does not hold: the uniqueness of the anchor set \u02dcA\n\n5\n\n\fAlgorithm 1 DCA(X, Y, k, M)\n\nInput: Two sets of points (rows) X \u2208 Rn\u00d7p and Y \u2208 Rm\u00d7p in matrix forms (ref. Table 1 to see X and Y\nfor different models), number of latent factors/variables k, random matrix ensemble M;\nOutput: Anchor set A \u2286 [m] such that \u2200i \u2208 [n], Xi \u2208 cone(YA);\nDivide Step (in parallel):\nfor i = 1 \u2192 s := O(k log k) do\n\nRandomly draw a matrix \u03a6 \u2208 Rp\u00d7d from M;\nSolve sub-problem such as \u02dcAt = MCH(X\u03a6, Y \u03a6) by any solver, e.g., (10);\n\nend for\nConquer Step:\n\n\u2200i \u2208 [m], compute \u02c6g(Yi) = (1/s)(cid:80)s\n\n1 \u02dcAt (Yi);\n\nt=1\n\nReturn A as index set of the k points with the largest \u02c6g(Yi).\n\non low-D hyperplane could be violated, because non-anchors in Y may have non-zero probability to\nbe projected as low-D anchors. Fortunately, we can achieve a unique \u02dcA by de\ufb01ning a \u201cminimal conical\nhull\u201d on a low-D hyperplane. Then Proposition 1 reveals when w.h.p. such an \u02dcA is a subset of A.\nDe\ufb01nition 4 (Minimal conical hull). Given two sets of points (rows) X and Y , the conical hull\nspanned by anchors (generators) YA is the minimal conical hull covering all points in X iff\n\n\u2200{i, j, s} \u2208(cid:8)i, j, s | i \u2208 AC = [m] \\ A, j \u2208 A, s \u2208 [n], Xs \u2208 cone(YA) \u2229 cone(Yi\u222a(A\\j))(cid:9) (8)\nwe have (cid:91)XsYi > (cid:91)XsYj, where(cid:99)xy denotes the angle between two vectors x and y. The solution of\n\nminimal conical hull is denoted by A = MCH(X, Y ).\n\nIt is easy to verify that the minimal conical hull is unique, and\nthe general minimum conical hull problem X = F YA under the\ngeneral separability assumption (which leads to the identi\ufb01ability of\nA) is a special case of A = MCH(X, Y ). In DCA, on each low-D\nhyperplane Hi, the associated sub-problem aims to \ufb01nd the anchor\nset \u02dcAi = MCH(X\u03a6i, Y \u03a6i). The following proposition gives the\nprobability of \u02dcAi \u2286 A in a sub-problem solution.\nProposition 1 (Probability of success in sub-problem). As de-\n\ufb01ned in Figure 2, Ai \u2208 A signi\ufb01es an anchor point in YA, Ci \u2208 X\nsigni\ufb01es a point in X \u2208 Rn\u00d7p, Bi \u2208 AC signi\ufb01es a non-anchor\npoint in Y \u2208 Rm\u00d7p, the green ellipse marks the intersection hy-\nperplane between cone(YA) and the unit sphere Sp\u22121, the super-\nscript \u00b7(cid:48) denotes the projection of a point on the intersection hy-\nperplane. De\ufb01ne d-dim (d \u2264 p) hyperplanes {Hi}i\u2208[4] such that\n1 \u22a5 H4, let \u03b1 = (cid:92)H1H2 be the angle between hyper-\nA(cid:48)\n3A(cid:48)\nplanes H1 and H2, \u03b2 = (cid:92)H3H4 be the angle between H3 and H4. If H with associated projection\nmatrix \u03a6 \u2208 Rp\u00d7d is a d-dim hyperplane uniformly drawn from the Grassmannian manifold Gr(d, p),\nand \u02dcA = M CH(X\u03a6, Y \u03a6) is the solution to the minimal conical hull problem, we have\n\n2 \u22a5 H3, B(cid:48)\n\n2 \u22a5 H2, B(cid:48)\n\n2 \u22a5 H1, A(cid:48)\n\nFigure 2: Proposition 1.\n\n1C(cid:48)\n\n1A(cid:48)\n\n1A(cid:48)\n\nPr(B1 \u2208 \u02dcA) =\n\n, Pr(A2 \u2208 \u02dcA) =\n\n\u03b2\n2\u03c0\n\n\u03b1 \u2212 \u03b2\n2\u03c0\n\n.\n\n(9)\n\nSee [29] for proof, discussion and analysis of robustness to unimportant \u201c\ufb02at\u201d anchors and data noise.\nTheorem 1 (Probability bound). Following the same notations in Proposition 1, suppose p\u2217\u2217 =\n\nmin{A1,A2,A3,B1,C1}(\u03b1 \u2212 2\u03b2) \u2265 c/k > 0. It holds with probability at least 1 \u2212 k exp(cid:0)\u2212 cs\n\nDCA successfully identi\ufb01es all the k anchors in A, where s is the number of sub-problems solved.\n\n(cid:1) that\n\n3k\n\nSee [29] for proof. Given Theorem 1, we can immediately achieve the following corollary about the\nnumber of sub-problems that guarantee success of DCA in \ufb01nding A.\nCorollary 1 (Number of sub-problems). With probability 1 \u2212 \u03b4, DCA can correctly recover the\nanchor set A by solving \u2126( 3k\n\nc log k\n\n\u03b4 ) sub-problems.\n\nSee [29] for the idea of divide-and-conquer randomization in DCA, and its advantage over Johnson-\nLindenstrauss (JL) Lemma based methods.\n\n6\n\nH1 A\u20191 A\u20192 A\u20193$C\u20192 C1 B\u20191 H2 H3 H4 \u03b1\"\u03b2\"O$A1 A3$A2 C\u20191 C2 \u2026 \u2026 \f4.3 Anchoring on 2D Plane\n\nAlthough DCA can invoke any solver for the sub-problem on any low-D hyperplane, a very fast\nsolver for the 2D sub-problem always shows high accuracy in locating anchors when embedded into\nDCA. Its motivation comes from the geometry of conical hull on a 2D plane, which is a special\ncase of a d-dim hyperplane H in the sub-problem of DCA. It leads to a non-iterative algorithm for\nA = MCH(X, Y ) on the 2D plane. It only requires computing n + m cosine values, \ufb01nding the\nmin/max of the n values, and comparing the remaining m ones with the min/max value.\nAccording to Figure 1, the two anchors Y \u02dcA\u03a6 on a 2D plane have the min/max (among points in Y \u03a6 )\nangle (to either axis) that is larger/smaller than all angles of points in X\u03a6, respectively. This leads to\nthe following closed form of \u02dcA.\n\n\u02dcA = {arg min\ni\u2208[m]\n\n( (cid:92)(Yi\u03a6)\u03d5 \u2212 max\nj\u2208[n]\n\n(cid:92)(Xj\u03a6)\u03d5)+, arg min\ni\u2208[m]\n\n(cid:92)(Xj\u03a6)\u03d5 \u2212 (cid:92)(Yi\u03a6)\u03d5)+},\n\n(min\nj\u2208[n]\n\n(10)\nwhere (x)+ = x if x \u2265 0 and \u221e otherwise, and \u03d5 can be either the vertical or horizontal axis on a\n2D plane. By plugging (10) in DCA as the solver for s sub-problems on random 2D planes, we can\nobtain an extremely fast learning algorithm.\nNote for the special case when X = Y , (10) degenerates to \ufb01nding the two points in X\u03a6 with the\n(cid:92)(Xi\u03a6)\u03d5}.\nsmallest and largest angles to an axis \u03d5, i.e., \u02dcA = {arg mini\u2208[n]\nThis is used in matrix factorization and the latent variable model with nonzero off-diagonal D.\nSee [29] for embedding DCA as a fast subroutine into other methods, and detailed off-the-shelf DCA\nalgorithms of NMF, SC, GMM, HMM and LDA. A brief summary is in Table 1.\n5 Experiments\n\n(cid:92)(Xi\u03a6)\u03d5, arg maxi\u2208[n]\n\nSee [29] for a complete experimental section with results of DCA for NMF, SC, GMM, HMM, and\nLDA, and comparison to other methods on more synthetic and real datasets.\n\nFigure 3: Separable NMF on a randomly generated 300 \u00d7 500 matrix, each point on each curve is the result by averaging 10 independent\nrandom trials. SFO-greedy algorithm for submodular set cover problem. LP-test is the backward removal algorithm from [4]. LEFT: Accuracy\nof anchor detection (higher is better). Middle: Negative relative (cid:96)2 recovery error of anchors (higher is better). Right: CPU seconds.\n\nFigure 4: Clustering accuracy (higher is better) and CPU seconds vs. Number of clusters for Gaussian mixture model on CMU-PIE (left) and\nYALE (right) human face dataset. We randomly split the raw pixel features into 3 groups, each associates to a view in our multi-view model.\n\nFigure 5: Likelihood (higher is better) and CPU seconds vs. Number of states for using an HMM to model the stock price of 2 companies from\n01/01/1995-05/18/2014 collected by Yahoo Finance. Since no ground truth label is given, we measure likelihood on training data.\nDCA for Non-negative Matrix Factorization on Synthetic Data. The experimental comparison\nresults are shown in Figure 3. Greedy algorithms SPA [14], XRAY [19] and SFO achieves the best\n\n7\n\n10\u2212210\u2212110010100.10.20.30.40.50.60.70.80.91noise levelanchor index recovery rate SPAXRAYDCA(s=50)DCA(s=92)DCA(s=133)DCA(s=175)SFOLP\u2212test10\u2212210\u22121100101\u22121\u22120.9\u22120.8\u22120.7\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2\u22120.10noise level\u2212anchor recovery error SPAXRAYDCA(s=50)DCA(s=92)DCA(s=133)DCA(s=175)SFOLP\u2212test10\u2212210\u2212110010110\u2212510\u2212410\u2212310\u2212210\u22121100101noise levelCPU seconds SPAXRAYDCA(s=50)DCA(s=92)DCA(s=133)DCA(s=175)SFOLP\u2212test30609012015018021024027030000.050.10.150.20.250.30.35Number of Clusters/Mixture ComponentsClustering Accuracycmu\u2212pie DCA GMM(s=171)DCA GMM(s=341)DCA GMM(s=682)DCA GMM(s=1023)k\u2212meansSpectral GMMEM for GMM30609012015018021024027030010\u22121100101102103Number of Clusters/Mixture ComponentsCPU secondscmu\u2212pie DCA GMM(s=171)DCA GMM(s=341)DCA GMM(s=682)DCA GMM(s=1023)k\u2212meansSpectral GMMEM for GMM193857769511413315217119000.050.10.150.20.250.30.350.40.450.5Number of Clusters/Mixture ComponentsClustering Accuracyyale DCA GMM(s=171)DCA GMM(s=341)DCA GMM(s=682)DCA GMM(s=853)k\u2212meansSpectral GMMEM for GMM193857769511413315217119010\u2212210\u22121100101102Number of Clusters/Mixture ComponentsCPU secondsyale DCA GMM(s=171)DCA GMM(s=341)DCA GMM(s=682)DCA GMM(s=853)k\u2212meansSpectral GMMEM for GMM34567891028.52929.53030.53131.53232.53333.5Number of StatesloglikelihoodBarclays DCA HMM(s=32)DCA HMM(s=64)DCA HMM(s=96)DCA HMM(s=160)Baum\u2212Welch(EM)Spectral method34567891010\u2212310\u2212210\u22121100101Number of StatesCPU secondsBarclays DCA HMM(s=32)DCA HMM(s=64)DCA HMM(s=96)DCA HMM(s=160)Baum\u2212Welch(EM)Spectral method3456789102345678Number of StatesloglikelihoodJP\u2212Morgan DCA HMM(s=32)DCA HMM(s=96)DCA HMM(s=160)DCA HMM(s=256)Baum\u2212Welch(EM)Spectral method34567891010\u2212310\u2212210\u22121100101Number of StatesCPU secondsJP\u2212Morgan DCA HMM(s=32)DCA HMM(s=96)DCA HMM(s=160)DCA HMM(s=256)Baum\u2212Welch(EM)Spectral method\faccuracy and smallest recovery error when noise level is above 0.2, but XRAY and SFO are the\nslowest two. SPA is slightly faster but still much slower than DCA. DCA with different number of\nsub-problems shows slightly less accuracy than greedy algorithms, but the difference is acceptable.\nConsidering its signi\ufb01cant acceleration, DCA offers an advantageous trade-off. LP-test [4] has the\nexact solution guarantee, but it is not robust to noise, and too slow. Therefore, DCA provides a much\nfaster and more practical NMF algorithm with comparable performance to the best ones.\nDCA for Gaussian Mixture Model on CMU-PIE and YALE Face Dataset. The experimental\ncomparison results are shown in Figure 4. DCA consistently outperforms other methods (k-means,\nEM, spectral method [1]) on accuracy, and shows speedups in the range 20-2000. By increasing the\nnumber of sub-problems, the accuracy of DCA improves. Note the pixels of face images always\nexceed 1000, and thus results in slow computation of pairwise distances required by other clustering\nmethods. DCA exhibits the fastest speed because the number of sub-problems s = O(k log k) does\nnot depend on the feature dimension, and thus merely 171 2D random projections are suf\ufb01cient\nfor obtaining a promising clustering result. The spectral method performs poorer than DCA due\nto the large variance of sample moments. DCA uses the separability assumption in estimating the\neigenspace of the moment, and thus effectively reduces the variance.\nTable 2: Motion prediction accuracy (higher is better) of the test set for 6 motion capture sequences from CMU-mocap dataset. The motion\nfor each frame is manually labeled by the authors of [16]. In the table, s13s29(39/63) means that we split sequence 29 of subject 13 into\nsub-sequences, each has 63 frames, in which the \ufb01rst 39 ones are for training and the rest are for test. Time is measured in ms.\n\nSequence\nMeasure\nBaum-Welch (EM)\nSpectral Method\nDCA-HMM (s=9)\nDCA-HMM (s=26)\nDCA-HMM (s=52)\nDCA-HMM (s=78)\n\ns13s29(39/63)\nTime\nAcc\n383\n0.50\n80\n0.20\n0.33\n3.3\n3.3\n0.50\n3.4\n0.50\n0.66\n3.4\n\ns13s30(25/51)\nTime\nAcc\n140\n0.50\n43\n0.25\n0.92\n1\n1.00\n1\n1.1\n0.50\n0.93\n1.1\n\ns13s31(25/50)\nTime\nAcc\n148\n0.46\n58\n0.13\n0.19\n1.5\n0.65\n1.6\n1.6\n0.43\n0.41\n1.6\n\ns14s06(24/40)\nTime\nAcc\n368\n0.34\n66\n0.29\n0.29\n4.8\n0.60\n4.8\n4.9\n0.48\n0.51\n4.9\n\ns14s14(29/43)\nTime\nAcc\n529\n0.62\n134\n0.63\n0.79\n3\n3\n0.45\n0.80\n3.2\n0.80\n6.7\n\ns14s20(29/43)\nTime\nAcc\n345\n0.77\n70\n0.59\n0.28\n3\n0.89\n3\n3.1\n0.78\n0.83\n3.2\n\nFigure 6: LEFT: Perplexity (smaller is better) on test set and CPU seconds vs. Number of topics for LDA on NIPS1-17 Dataset, we randomly\nselected 70% documents for training and the rest 30% is used for test. RIGHT: Mutual Information (higher is better) and CPU seconds vs.\nNumber of clusters for subspace clustering on COIL-100 Dataset.\nDCA for Hidden Markov Model on Stock Price and Motion Capture Data. The experimental\ncomparison results for stock price modeling and motion segmentation are shown in Figure 5 and Table\n2, respectively. In the former one, DCA always achieves slightly lower but comparable likelihood\ncompared to Baum-Welch (EM) method [5], while the spectral method [2] performs worse and\nunstably. DCA shows a signi\ufb01cant speed advantage compared to others, and thus is more preferable\nin practice. In the latter one, we evaluate the prediction accuracy on the test set, so the regularization\ncaused by separability assumption leads to the highest accuracy and fastest speed of DCA.\nDCA for Latent Dirichlet Allocation on NIPS1-17 Dataset. The experimental comparison results\nfor topic modeling are shown in Figure 6. Compared to both traditional EM and the Gibbs sam-\npling [23], DCA not only achieves both the smallest perplexity (highest likelihood) on the test set\nand the highest speed, but also the most stable performance when increasing the number of topics. In\naddition, the \u201canchor word\u201d achieved by DCA provides more interpretable topics than other methods.\nDCA for Subspace Clustering on COIL-100 Dataset. The experimental comparison results for\nsubspace clustering are shown in Figure 6. DCA provides a much more practical algorithm that can\nachieve comparable mutual information but at a more than 1000 times speedup over the state-of-the-art\nSC algorithms such as SCC [9], SSC [12], LRR [21], and RSC [26].\nAcknowledgments: We would like to thank MELODI lab members for proof-reading and the\nanonymous reviewers for their helpful comments. This work is supported by TerraSwarm research\ncenter administered by the STARnet phase of the Focus Center Research Program (FCRP) sponsored\nby MARCO and DARPA, by the National Science Foundation under Grant No. (IIS-1162606), and\nby Google, Microsoft, and Intel research awards, and by the Intel Science and Technology Center for\nPervasive Computing.\n\n8\n\n51322303847556372802000220024002600280030003200340036003800Number of TopicsPerplexity DCA LDA(s=801)DCA LDA(s=2001)DCA LDA(s=3336)DCA LDA(s=5070)EM variationalGibbs samplingSpectral method513223038475563728010\u22121100101102103104Number of TopicsCPU seconds DCA LDA(s=801)DCA LDA(s=2001)DCA LDA(s=3336)DCA LDA(s=5070)EM variationalGibbs samplingSpectral method204060801001201401601802000.10.20.30.40.50.60.70.8Number of Clusters/Mixture ComponentsMutual Information DCA SC(s=307)DCA SC(s=819)DCA SC(s=1229)DCA SC(s=1843)SSCSCCLRRRSC2040608010012014016018020010\u22121100101102103104105Number of Clusters/Mixture ComponentsCPU seconds DCA SC(s=307)DCA SC(s=819)DCA SC(s=1229)DCA SC(s=1843)SSCSCCLRRRSC\fReferences\n[1] A. Anandkumar, D. P. Foster, D. Hsu, S. Kakade, and Y. Liu. A spectral algorithm for latent dirichlet\n\nallocation. In NIPS, 2012.\n\n[2] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov\n\nmodels. In COLT, 2012.\n\n[3] S. Arora, R. Ge, Y. Halpern, D. M. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm\n\nfor topic modeling with provable guarantees. In ICML, 2013.\n\n[4] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization - provably. In\n\nSTOC, 2012.\n\n[5] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of \ufb01nite state Markov chains.\n\nAnnals of Mathematical Statistics, 37:1554\u20131563, 1966.\n\n[6] M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, 2010.\n[7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Maching Learning Research\n\n(JMLR), 3:993\u20131022, 2003.\n\n[8] J. T. Chang. Full reconstruction of markov models on evolutionary trees: Identi\ufb01ability and consistency.\n\nMathematical Biosciences, 137(1):51\u201373, 1996.\n\n[9] G. Chen and G. Lerman. Spectral curvature clustering (scc). International Journal of Computer Vision\n\n(IJCV), 81(3):317\u2013330, 2009.\n\n[10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39(1):1\u201338, 1977.\n\n[11] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition\n\ninto parts? In NIPS, 2003.\n\n[12] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, 2009.\n[13] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 6(6):721\u2013741, 1984.\n\n[14] N. Gillis and S. A. Vavasis. Fast and robust recursive algorithmsfor separable nonnegative matrix fac-\ntorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(4):698\u2013714,\n2014.\n\n[15] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. In COLT,\n\n2009.\n\n[16] M. C. Hughes, E. B. Fox, and E. B. Sudderth. Effective split-merge monte carlo methods for nonparametric\n\nmodels of sequential data. In NIPS, 2012.\n\n[17] A. T. Kalai, A. Moitra, and G. Valiant. Ef\ufb01ciently learning mixtures of two gaussians. In STOC, 2010.\n[18] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In COLT,\n\n2005.\n\n[19] A. Kumar, V. Sindhwani, and P. Kambadur. Fast conical hull algorithms for near-separable nonnegative\n\nmatrix factorization. In ICML, 2013.\n\n[20] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature,\n\n401:788\u2013791, 1999.\n\n[21] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, 2010.\n[22] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the\n\nRoyal Society of London. A, 185:71\u2013110, 1894.\n\n[23] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs sampling\n\nfor latent dirichlet allocation. In SIGKDD, pages 569\u2013577, 2008.\n\n[24] R. A. Redner and H. F. Walker. Mixture Densities, Maximum Likelihood and the Em Algorithm. SIAM\n\nReview, 26(2):195\u2013239, 1984.\n\n[25] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2008.\n[26] M. Soltanolkotabi, E. Elhamifar, and E. J. Cand`es. Robust subspace clustering. arXiv:1301.2603, 2013.\n[27] D. Titterington, A. Smith, and U. Makov. Statistical Analysis of Finite Mixture Distributions. Wiley, New\n\nYork, 1985.\n\n[28] T. Zhou, W. Bian, and D. Tao. Divide-and-conquer anchoring for near-separable nonnegative matrix\n\nfactorization and completion in high dimensions. In ICDM, 2013.\n\n[29] T. Zhou, J. Bilmes, and C. Guestrin. Extended version of \u201cdivide-and-conquer learning by anchoring a\n\nconical hull\u201d. In Extended version of a accepted NIPS-2014 paper, 2014.\n\n9\n\n\f", "award": [], "sourceid": 702, "authors": [{"given_name": "Tianyi", "family_name": "Zhou", "institution": "University of Washington, Seattle"}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": "University of Washington, Seattle"}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": "University of Washington"}]}