{"title": "Scalable Methods for Nonnegative Matrix Factorizations of Near-separable Tall-and-skinny Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 945, "page_last": 953, "abstract": "Numerous algorithms are used for nonnegative matrix factorization under the assumption that the matrix is nearly separable. In this paper, we show how to make these algorithms scalable for data matrices that have many more rows than columns, so-called tall-and-skinny matrices.\" One key component to these improved methods is an orthogonal matrix transformation that preserves the separability of the NMF problem. Our final methods need to read the data matrix only once and are suitable for streaming, multi-core, and MapReduce architectures. We demonstrate the efficacy of these algorithms on terabyte-sized matrices from scientific computing and bioinformatics.\"", "full_text": "Scalable Methods for Nonnegative Matrix\n\nFactorizations of Near-separable Tall-and-skinny\n\nMatrices\n\nAustin R. Benson\n\nICME\n\nStanford University\n\nStanford, CA\n\narbenson@stanford.edu\n\nJason D. Lee\n\nICME\n\nStanford University\n\nStanford, CA\n\njdl17@stanford.edu\n\nBartek Rajwa\n\nBindley Biosciences Center\n\nPurdue University\nWest Lafeyette, IN\n\nbrajwa@purdue.edu\n\nDavid F. Gleich\n\nComputer Science Department\n\nPurdue University\nWest Lafeyette, IN\n\ndgleich@purdue.edu\n\nAbstract\n\nNumerous algorithms are used for nonnegative matrix factorization under the as-\nsumption that the matrix is nearly separable.\nIn this paper, we show how to\nmake these algorithms scalable for data matrices that have many more rows than\ncolumns, so-called \u201ctall-and-skinny matrices.\u201d One key component to these im-\nproved methods is an orthogonal matrix transformation that preserves the separa-\nbility of the NMF problem. Our \ufb01nal methods need to read the data matrix only\nonce and are suitable for streaming, multi-core, and MapReduce architectures.\nWe demonstrate the ef\ufb01cacy of these algorithms on terabyte-sized matrices from\nscienti\ufb01c computing and bioinformatics.\n\nX = WH\n\n1 Nonnegative matrix factorizations at scale\nA nonnegative matrix factorization (NMF) for an m \u00d7 n matrix X with real-valued, nonnegative\nentries is\n(1)\nwhere W is m \u00d7 r, H is r \u00d7 n, r < min(m, n), and both factors have nonnegative entries. While\nthere are already standard dimension reduction techniques for general matrices such as the singular\nvalue decomposition, the advantage of NMF is in interpretability of the data. A common example is\nfacial image decomposition [17]. If the columns of X are pixels of a facial image, the columns of W\nmay be facial features such as eyes or ears, and the coef\ufb01cients in H represent the intensity of these\nfeatures. For this reason, among a host of other reasons, NMF is used in a broad range of applications\nincluding graph clustering [21], protein sequence motif discovery [20], and hyperspectral unmixing\n[18].\nAn important property of matrices in these applications and other massive scienti\ufb01c data sets is that\nthey have many more rows than columns (m (cid:29) n). For example, this matrix structure is common\nin big data applications with hundreds of millions of samples and a small set of features\u2014see,\ne.g., Section 4.2 for a bioinformatics application where the data matrix has 1.6 billion rows and 25\ncolumns. We call matrices with many more rows than columns tall-and-skinny. The number of\ncolumns of these matrices is small, so there is no problem storing or manipulating them. Our use\n\n1\n\n\fof NMF is then to uncover the hidden structure in the data rather than for dimension reduction or\ncompression.\nIn this paper, we present scalable and computationally ef\ufb01cient NMF algorithms for tall-and-skinny\nmatrices as prior work has not taken advantage of this structure for large-scale factorizations. The\nadvantages of our method are: we preserve the geometry of the problem, we only read the data\nmatrix once, and we can test several different nonnegative ranks (r) with negligible cost. Further-\nmore, we show that these methods can be implemented in parallel (Section 3) to handle large data\nsets. In Section 2.3, we present a new dimension reduction technique using orthogonal transfor-\nmations. These transformations are particularly effective for tall-and-skinny matrices and lead to\nalgorithms that only need to read the data matrix once. We compare this method with a Gaussian\nprojection technique from the hyperspectral unmixing community [5, 7]. We test our algorithms on\ndata sets from two scienti\ufb01c applications, heat transfer simulations and \ufb02ow cytometry, in Section 4.\nOur new dimension reduction technique outperforms Gaussian projections on these data sets. In\nthe remainder of the introduction, we review the state of the art for computing non-negative matrix\nfactorizations.\n\n1.1 Separable NMF\n\nX = X(:,K)H,\n\nWe \ufb01rst turn to the issue of how to practically compute the factorization in Equation (1). Unfortu-\nnately, for a \ufb01xed non-negative rank r, \ufb01nding the factors W and H for which the residual (cid:107)X \u2212 WH(cid:107)\nis minimized is NP-complete [26]. To make the problem tractable, we make assumptions about the\ndata. In particular, we require a separability condition on the matrix. A nonnegative matrix X is\nseparable if\nwhere K is an index set with |K| = r and X(:,K) is Matlab notation for the matrix X restricted to\nthe columns indexed by K. Since the coef\ufb01cients of H are nonnegative, all columns of X live in\nthe conical hull of the \u201cextreme\u201d columns indexed by K. The idea of separability was developed\nby Donoho and Stodden [15], and recent work has produced tractable NMF algorithms by assuming\nthat X almost satis\ufb01es a separability condition [3, 6].\nA matrix X is noisy r-separable or near-separable if X = X(:,K)H + N, where N is a noise matrix\nwhose entries are small. Near-separability means that all data points approximately live in the\nconical hull of the extreme columns. The algorithms for near-separable NMF are typically based on\nconvex geometry (see Section 2.1) and can be described by the same two-step approach:\n\n1. Determine the extreme columns, indexed by K, and let W = X(:,K).\n2. Solve H = arg minY\u2208Rr\u00d7n\n\n+ (cid:107)X \u2212 WY(cid:107).\n\nThe bulk of the literature is focused on the \ufb01rst step. In Section 3, we show how to implement both\nsteps in a single pass over the data and provide the details of a MapReduce implementation. We\nnote that separability (or near-separability) is a severe and restrictive assumption. The tradeoff is\nthat our algorithms are extremely scalable and provably correct under this assumption. In big data\napplications, scalability is at a premium, and this provides some justi\ufb01cation for using separability\nas a tool for exploratory data analysis. Furthermore, our experiments on real scienti\ufb01c data sets in\nSection 4 under the separability assumption lead to new insights.\n\n1.2 Alternative NMF algorithms and related work\n\nThere are several approaches to solving Equation (1) that do not assume the separability condi-\ntion. These algorithms typically employ block coordinate descent, optimizing over W and H while\nkeeping one factor \ufb01xed. Examples include the seminal work by Lee and Seung [23], alternating\nleast squares [10], and fast projection-based least squares [19]. Some of these methods are used in\nMapReduce architectures at scale [24].\nAlternating methods require updating the entire factor W or H after each optimization step. When\none of the factors is large, repeated updates can be prohibitively expensive. The problem is exacer-\nbated in Hadoop MapReduce, where intermediate results are written to disk. In addition, alternating\nmethods can take an intolerable number of iterations to converge. Regardless of the approach or\ncomputing platform, the algorithms are too slow when the matrices cannot \ufb01t in main memory In\n\n2\n\n\fcontrast, we show in Sections 2 and 3 that the separability assumption leads to algorithms that do\nnot require updates to large matrices. This approach is scalable for large tall-and-skinny matrices in\nbig data problems.\n\n2 Algorithms and dimension reduction for near-separable NMF\n\nThere are several popular algorithms for near-separable NMF, and they are motivated by convex\ngeometry. The goal of this section is to show that when X is tall-and-skinny, we can apply dimension\nreduction techniques so that established algorithms can execute on n \u00d7 n matrices, rather than the\noriginal m \u00d7 n. Our new dimension reduction technique in Section 2.3 is also motivated by convex\ngeometry. In Section 3, we leverage the dimension reduction into scalable algorithms.\n\n2.1 Geometric algorithms\n\nconical hulls. A cone C \u2282 Rm is a non-empty convex set with C = {(cid:80)\n\nThere are two geometric strategies typically employed for near-separable NMF. The \ufb01rst deals with\n| \u03b1i \u2208 R+, xi \u2208 Rm}.\n\ni \u03b1ixi\n\nThe xi are generating vectors. In separable NMF,\n\nX = X(:,K)H\n\nimplies that all columns of X lie in the cone generated by the columns indexed by K. For any k \u2208 K,\n{\u03b1X(:, k)\n| \u03b1 \u2208 R+} is an extreme ray of this cone, In other words, the set of columns indexed by\nK are the set of extreme rays of the cone. The goal of the XRAY algorithm [22] is to \ufb01nd these\nextreme rays (i.e., to \ufb01nd K).\nIn particular, the greedy variant of XRAY selects the maximum\ncolumn norm arg max j (cid:107)RT X(:, j)(cid:107)2/(cid:107)X(:, j)(cid:107)2, where R is a residual matrix that gets updated with\neach new extreme column.\nThe second approach deals with convex hulls, where the columns of X are (cid:96)1-normalized. If D is a\ndiagonal matrix with Dii = (cid:107)X(:, i)(cid:107)1 and X is separable, then\n\nXD\u22121 = X(:,K)D(K,K)\u22121D(K,K)HD\u22121 = (XD\u22121)(:,K) \u02dcH.\n\nThus, XD\u22121 is also separable (in fact, this holds for any nonsingular diagonal matrix D). Since the\ncolumns are (cid:96)1-normalized, the columns of \u02dcH have non-negative entries and sum to one. In other\nwords, all columns of XD\u22121 are in the convex hull of the columns indexed by K. The problem of\ndetermining K is reduced to \ufb01nding the extreme points of a convex hull. Popular approaches in the\ncontext of NMF include the Successive Projection Algorithm (SPA, [2]) and its generalization [16].\nAnother alternative, based on linear programming, is Hott Topixx [6]. Other geometric approaches\nhad good heuristic performance [9, 25] before the more recent theoretical work. As an example of\nthe particulars of one such method, SPA, which we will use in Section 4, \ufb01nds extreme points by\ncomputing arg max j (cid:107)R(:, j)(cid:107)2\nIn any algorithm, we call the columns indexed by K extreme columns. The next two subsections are\ndevoted to dimension reduction techniques for \ufb01nding the extreme columns in the case when X is\ntall-and-skinny.\n\n2, where R is a residual matrix related to the data matrix X.\n\n2.2 Gaussian projection\n\nA common dimension reduction technique is random Gaussian projections, and the idea has been\nused in hyperspectral unmixing problems [5]. In the hyperspectral unmixing literature, the sepa-\nrability is referred to as the pure-pixel assumption, and the random projections are motivated by\nconvex geometry [7]. In particular, given a matrix G \u2208 Rm\u00d7k with Gaussian i.i.d. entries, the extreme\ncolumns of X are taken as the extreme columns of GT X, which is of dimension k \u00d7 n. Recent work\nshows that when X is nearly r-separable and k = O(r log r), then all of the extreme columns are\nfound with high probability [13].\n\n2.3 Orthogonal transformations\n\nOur new alternative dimension reduction technique is also motivated by convex geometry. Consider\na cone C \u2282 Rm and a nonsingular m \u00d7 m matrix M. It is easily shown that x is an extreme ray of C\n\n3\n\n\fif and only if Mx is an extreme ray of MC = {Mz | z \u2208 C}. Similarly, for any convex set, invertible\ntransformations preserve extreme points.\nWe take advantage of these facts by applying speci\ufb01c orthogonal transformations as the nonsingular\nmatrix M. Let X = Q \u02dcR and X = U \u02dc\u03a3VT be the full QR factorization and singular value decomposition\n(SVD) of X, so that Q and U are m \u00d7 m orthogonal (and hence nonsingular) matrices. Then\n\n(cid:33)\n\n(cid:32)R\n\n0\n\n(cid:32)\u03a3VT\n\n(cid:33)\n\n,\n\n0\n\nQT X =\n\n, UT X =\n\nwhere R and \u03a3 are the top n \u00d7 n blocks of \u02dcR and \u02dc\u03a3 and 0 is an (m \u2212 n) \u00d7 n matrix of zeroes. The zero\nrows provide no information on which columns of QT X or UT X are extreme rays or extreme points.\nThus, we can restrict ourselves to \ufb01nding the extreme columns of R and \u03a3VT . These matrices are\nn \u00d7 n, and we have signi\ufb01cantly reduced the dimension of the problem. In fact, if X = X(:,K)H is a\nseparable representation, we immediately have separated representations for R and \u03a3VT :\n\nR = R(:,K)H,\n\n\u03a3VT = \u03a3VT (:,K)H.\n\nWe note that, although any invertible transformation preserves extreme columns, many transfor-\nmations will destroy the geometric structure of the data. However, orthogonal transformations are\neither rotations or re\ufb02ections, and they preserve the data\u2019s geometry. Also, although QT and UT are\nm \u00d7 m, we will only apply them implicitly (see Section 3.1), i.e., these matrices are never formed or\ncomputed.\nThis dimension reduction technique is exact when X is r-separable, and the results will be the same\nfor orthogonal transformations QT and UT . This is a consequence of the transformed data having\nthe same separability as the original data. The SPA and XRAY algorithms brie\ufb02y described in\nSection 2.1 only depend on computing column 2-norms, which are preserved under orthogonal\ntransformations. For these algorithms, applying QT or UT preserves the column 2-norms of the data,\nand the selected extreme columns are the same. However, other NMF algorithms do not possess this\ninvariance. For this reason, we present both of the orthogonal transformations.\nFinally, we highlight an important bene\ufb01t of this dimension reduction technique. In many appli-\ncations, the data is noisy and the separation rank (r in Equation (1)) is not known a priori.\nIn\nSection 2.4, we show that the H factor can be computed in the small dimension. Thus, it is viable to\ntry several different values of the separation rank and pick the best one. This idea is extremely useful\nfor the applications presented in Section 4, where we do not have a good estimate of the separability\nof the data.\n\n2.4 Computing H\nSelecting the extreme columns indexed by K completes one half of the NMF factorization in Equa-\n+ (cid:107)X \u2212 X(:,K)Y(cid:107)2 for some norm.\ntion (1). How do we compute H? We want H = arg minY\u2208Rr\u00d7n\nChoosing the Frobenius norm results in a set of n nonnegative least squares (NNLS) problems:\n\nH(:, i) = arg min\ny\u2208Rr\n\n+\n\n(cid:107)X(:,K)y \u2212 X(:, i)(cid:107)2\n2,\n\ni = 1, . . . , n.\n\nLet X = Q \u02dcR with R the upper n \u00d7 n block of \u02dcR. Then H(:, i) is computed by \ufb01nding y \u2208 Rr\nminimizes\n\n+ that\n\n(cid:107)X(:,K)y \u2212 X(:, i)(cid:107)2\n\n= (cid:107)QT (X(:,K)y \u2212 X(:, i))(cid:107)2\n\n= (cid:107)R(:,K)y \u2212 R(:, i)(cid:107)2\n\n2\n\n2\n\n2\n\nThus, we can solve the NNLS problem with matrices of size n \u00d7 n. After computing just the small\nR factor from the QR factorization, we can compute the entire nonnegative matrix factorization by\nworking with matrices of size n \u00d7 n. Analogous results hold for the SVD, where we replace Q by\nU, the left singular vectors. In Section 3, we show that these computations are simple and scalable.\nSince m (cid:29) n, computations on O(n2) data are fast, even in serial. Finally, note that we can also\ncompute the residual in this reduced space, i.e.:\n(cid:107)X(:,K)y \u2212 X(:, i)(cid:107)2\n\n(cid:107)R(:,K)y \u2212 R(:, i)(cid:107)2\n2.\n\n2\n\n= min\ny\u2208Rn\n\n+\n\nmin\ny\u2208Rn\n\n+\n\nThis simple fact is signi\ufb01cant in practice. When there are several candidate sets of extreme columns\nK, the residual error for each set can be computed quickly. In Section 4, we compute many residual\nerrors for different sets K in order to choose an optimal separation rank.\n\n4\n\n\fWe have now shown how to use dimension reduction techniques for tall-and-skinny matrix data in\nnear-separable NMF algorithms. Following the same strategy as many NMF algorithms, we \ufb01rst\ncompute extreme columns and then solve for the coef\ufb01cient matrix H. Fortunately, once the upfront\ncost of the orthogonal transformation is complete, both steps can be computed using O(n2) data.\n\n3\n\nImplementation\n\nRemarkably, when the matrix is tall-and-skinny, we only need to read the data matrix once. The\nreads can be performed in parallel, and computing platforms such as MapReduce, Spark, distributed\nmemory MPI, and GPUs can all achieve optimal parallel communication. For our implementation,\nwe use Hadoop MapReduce for convenience.1 While all of the algorithms use sophisticated com-\nputation, these routines are only ever invoked with matrices of size n \u00d7 n. Furthermore, the local\nmemory requirements of these algorithms are only O(n2). Thus, we get extremely scalable imple-\nmentations. We note that, using MapReduce, computing GT X for the Gaussian projection technique\nis a simple variation of standard methods to compute XT X [4].\n\n3.1 TSQR and R-SVD\nThe thin QR factorization of an m \u00d7 n real-valued matrix X with m > n is X = QR where Q is an\nm\u00d7 n orthogonal matrix and R is an n\u00d7 n upper triangular matrix. This is precisely the factorization\nwe need in Section 2. For our purposes, QT is applied implicitly, and we only need to compute\nR. When m (cid:29) n, communication-optimal algorithms for computing the factorization are referred\nto as TSQR [14]. Implementations and specializations of the TSQR ideas are available in several\nenvironments, including MapReduce [4, 11], distributed memory MPI [14], and GPUs [1]. All of\nthese methods avoid computing XT X and hence are numerically stable.\nThe thin SVD used in Section 2.3 is a small extension of the thin QR factorization. The thin SVD is\nX = U\u03a3VT , where U is m \u00d7 n and orthogonal, \u03a3 is diagonal with decreasing, nonnegative diagonal\nentries, and V is n\u00d7n and orthogonal. Let X = QR be the thin QR factorization of X and R = UR\u03a3VT\nbe the SVD of R. Then X = (QUR)\u03a3VT = U\u03a3VT . The matrix U = QUR is m \u00d7 n and orthogonal, so\nthis is the thin SVD of X. The dimension of R is n \u00d7 n, so computing its SVD takes O(n3) \ufb02oating\npoint operations (\ufb02ops), a trivial cost when n is small. When m (cid:29) n, this method for computing the\nSVD is called the R-SVD [8]. Both TSQR and R-SVD require O(mn2) \ufb02ops. However, the dominant\ncost is data I/O, and TSQR only reads the data matrix once.\n\n3.2 Column normalization\n\nThe convex hull algorithms from Section 2.1 and the Gaussian projection algorithm from Section 2.2\nrequire the columns of the data matrix X to be normalized. A naive implementation of the column\nnormalization in a MapReduce environment is: (1) read X and compute the column norms; (2) read\nX, normalize the columns, and write the normalized data to disk; (3) use TSQR on the normalized\nmatrix. This requires reading the data matrix twice and writing O(mn) data to disk once just to\nnormalize the columns. The better approach is a single step: use TSQR on the unnormalized data X\nand simultaneously compute the column norms. If D is the diagonal matrix of column norms, then\n\nX = QR \u2192 XD\u22121 = Q(RD\u22121).\n\nThe matrix \u02c6R = RD\u22121 is upper triangular, so Q \u02c6R is the thin QR factorization of the column-\nnormalized data. This approach reads the data once and only writes O(n2) data. The same idea\napplies to Gaussian projection since GT (XD\u22121) = (GT X)D\u22121. Thus, our algorithms only need to\nread the data matrix once in all cases. (We refer to the algorithm output as selecting the columns\nand computing the matrix H, which is typically what is used in practice. Retrieving the entries from\nthe columns of A from K does require a subsequent pass.)\n\n4 Applications\n\nIn this section, we test our dimension reduction technique on massive scienti\ufb01c data sets. The data\nare nonnegative, but we do not know a priori that the data is separable. Experiments on synthetic\n\n1The code is available at https://github.com/arbenson/mrnmf.\n\n5\n\n\fdata sets are provided in an online version of this paper and show that our algorithms are effective\nand correct on near-separable data sets.2\nAll experiments were conducted on a 10-node, 40-core MapReduce cluster. Each node has 6 2-TB\ndisks, 24 GB of RAM, and a single Intel Core i7-960 3.2 GHz processor. They are connected via\nGigabit ethernet. We test the following three algorithms: (1) dimension reduction with the SVD\nfollowed by SPA; (2) Dimension reduction with the SVD followed by the greedy variant of the\nXRAY algorithm; (3) Gaussian projection (GP) as described in Section 2.2. We note that the greedy\nvariant of XRAY is not exact in the separable case but works well in practice [22].\nUsing our dimension reduction technique, all three algorithms require reading the data only once.\nThe algorithms were selected to be a representative set of the approaches in the literature, and we\nwill refer to the three algorithms as SPA, XRAY, and GP. As discussed in Section 2.3, the choice of\nQR or SVD does not matter for these algorithms (although it may matter for other NMF algorithms).\nThus, we only consider the SVD transformation in the subsequent numerical experiments.\n\n4.1 Heat transfer simulation\n\nF/(cid:107)X(cid:107)2\n\nThe heat transfer simulation data contains the simulated heat in a high-conductivity stainless steel\nblock with a low-conductivity foam bubble inserted in the block [12].3 Each column of the matrix\ncorresponds to simulation results for a foam bubble of a different radius. Several simulations for ran-\ndom foam bubble locations are included in a column. Each row corresponds to a three-dimensional\nspatial coordinate, a time step, and a bubble location. An entry of the matrix is the temperature of\nthe block at a single spatial location, time step, bubble location, and bubble radius. The matrix is\nconstructed such that columns near 64 have far more variability in the data \u2013 this is then responsible\nfor additional \u201crank-like\u201d structure. Thus, we would intuitively expect the NMF algorithms to select\nadditional columns closer to the end of the matrix. (And indeed, this is what we will see shortly.) In\ntotal, the matrix has approximately 4.9 billion rows and 64 columns and occupies a little more than\n2 TB on the Hadoop Distributed File System (HDFS).\nThe left plot of Figure 1 shows the relative error for varying separation ranks. The relative error is\nde\ufb01ned as (cid:107)X \u2212 X(:,K)H(cid:107)2\nF. Even a small separation rank (r = 4) results in a small residual.\nSPA has the smallest residuals, and XRAY and GP are comparable. An advantage of our projection\nmethod is that we can quickly test many values of r. For the heat transfer simulation data, we choose\nr = 10 for further experiments. This value is near an \u201celbow\u201d in the residual plot for the GP curve.\nWe note that the original SPA and XRAY algorithms would achieve the same reconstruction error\nif applied to the entire data set. Our dimension reduction technique allows us to accelerate these\nestablished methods for this large problem.\nThe middle plot of Figure 1 shows the columns selected by each algorithm. Columns 5 through\n30 are not extreme in any algorithm. Both SPA and GP select at least one column in indices one\nthrough four. Columns 41 through 64 have the highest density of extreme columns for all algorithms.\nAlthough the extreme columns are different for the algorithms, the coef\ufb01cient matrix H exhibits\nremarkably similar characteristics in all cases. Figure 2 visualizes the matrix H for each algorithm.\nEach non-extreme column is expressed as a conic combination of only two extreme columns. In\ngeneral, the two extreme columns corresponding to column i are j1 = arg max{ j \u2208 K |\nj < i} and\narg min{ j \u2208 K |\nj > i}. In other words, a non-extreme column is a conic combination of the two\nextreme columns that \u201csandwich\u201d it in the data matrix. Furthermore, when the index i is closer to\nj1, the coef\ufb01cient for j1 is larger and the coef\ufb01cient for j2 is smaller. This phenomenon is illustrated\nin the right plot of Figure 1.\n\n4.2 Flow cytometry\n\nThe \ufb02ow cytometry (FC) data represent abundances of \ufb02uorescent molecules labeling antibodies\nthat bind to speci\ufb01c targets on the surface of blood cells.4 The phenotype and function of individual\ncells can be identi\ufb01ed by decoding these label combinations. The analyzed data set contains mea-\nsurements of 40,000 single cells. The measurement \ufb02uorescence intensity conveying the abundance\n\n2http://arxiv.org/abs/1402.6964.\n3The heat transfer simulation data is available at https://www.opensciencedatacloud.org.\n4The FC data is available at https://github.com/arbenson/mrnmf/tree/master/data.\n\n6\n\n\f(Left) Relative error in the separable factorization as a function of separation rank (r)\nFigure 1:\nfor the heat transfer simulation data. Our dimension reduction technique lets us test all values of\nr quickly. (Middle) The \ufb01rst 10 extreme columns selected by SPA, XRAY, and GP. We choose 10\ncolumns as there is an \u201celbow\u201d in the GP curve there (left plot). The columns with larger indices\nare more extreme, but the algorithms still select different columns. (Right) Values of H(K\u22121(1), j)\nand H(K\u22121(34), j) computed by SPA for j = 2, . . . , 33, where K\u22121(1) and K\u22121(34) are the indices\nof the extreme columns 1 and 34 in W (X = WH). Columns 2 through 33 of X are roughly convex\ncombinations of columns 1 and 34, and are not selected as extreme columns by SPA. As j increases,\nH(K\u22121(1), j) decreases and H(K\u22121(34), j) increases.\n\nFigure 2: Coef\ufb01cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when\nr = 10. In all cases, the non-extreme columns are conic combinations of two of the selected columns,\ni.e., each column in H has at most two non-zero values. Speci\ufb01cally, the non-extreme columns are\nconic combinations of the two extreme columns that \u201csandwich\u201d them in the matrix. See the right\nplot of Figure 1 for a closer look at the coef\ufb01cients.\n\ninformation were collected at \ufb01ve different bands corresponding to the FITC, PE, ECD, PC5, and\nPC7 \ufb02uorescent labels tagging antibodies against CD4, CD8, CD19, CD45, and CD3 epitopes.\nThe measurements are represented as the data matrix A of size 40, 000 \u00d7 5. Our interest in the pre-\nsented analysis was to study pairwise interactions in the data (cell vs. cell, and marker vs. marker).\nThus, we are interested in the matrix X = A \u2297 A, the Kronecker product of A with itself. Each row\nof X corresponds to a pair of cells and each column to a pair of marker abundance values. X has\ndimension 40, 0002 \u00d7 52 and occupies 345 GB on HDFS.\nThe left plot of Figure 3 shows the residuals for the three algorithms applied to the FC data for\nvarying values of the separation rank. In contrast to the heat transfer simulation data, the relative\nerrors are quite large for small r. In fact, SPA has large relative error until nearly all columns are\nselected (r = 22). XRAY has the smallest residual for any value of r. The right plot of Figure 3\nshows the columns selected when r = 16. XRAY and GP only disagree on one column. SPA\nchooses different columns, which is not surprising given the relative residual error. Interestingly, the\ncolumns involving the second marker de\ufb01ning the phenotype (columns 2, 6, 7, 8, 9, 10, 12, 17, 22)\nare underrepresented in all the choices. This suggests that the information provided by the second\nmarker may be redundant. In biological terms, it may indicate that the phenotypes of the individual\ncells can be inferred from a smaller number of markers. Consequently, this opens a possibility that in\nmodi\ufb01ed experimental conditions, the FC researchers may omit this particular label, and still be able\nto recover the complete phenotypic information. Owing to the preliminary nature of these studies,\na more in-depth analysis involving multiple similar blood samples would be desirable in order to\ncon\ufb01rm this hypothesis.\n\n7\n\n\f(Left) Relative error in the separable factorization as a function of nonnegative rank (r)\nFigure 3:\nfor the \ufb02ow cytometry data. (Right) The \ufb01rst 16 extreme columns selected by SPA, XRAY, and GP.\nWe choose 16 columns since the XRAY and GP curve levels for larger r (left plot).\n\nFigure 4: Coef\ufb01cient matrix H for SPA, XRAY, and GP for the \ufb02ow cytometry data when r =\n16. The coef\ufb01cients tend to be clustered near the diagonal. This is remarkably different to the\ncoef\ufb01cients for the heat transfer simulation data in Figure 2.\n\nFinally, Figure 4 shows the coef\ufb01cient matrix H. The coef\ufb01cients are larger on the diagonal, which\nmeans that the non-extreme columns are composed of nearby extreme columns in the matrix.\n\n5 Discussion\n\nWe have shown how to compute nonnegative matrix factorizations at scale for near-separable tall-\nand-skinny matrices. Our main tool was TSQR, and our algorithms only needed to read the data\nmatrix once. By reducing the dimension of the problem, we can easily compute the ef\ufb01cacy of\nfactorizations for several values of the separation rank r. With these tools, we have computed the\nlargest separable nonnegative matrix factorizations to date. Furthermore, our algorithms provide\nnew insights into massive scienti\ufb01c data sets. The coef\ufb01cient matrix H exposed structure in the\nresults of heat transfer simulations. Extreme column selection in \ufb02ow cytometry showed that one\nof the labels used in measurements may be redundant. In future work, we would like to analyze\nadditional large-scale scienti\ufb01c data sets. We also plan to test additional NMF algorithms.\nThe practical limits of our algorithm are imposed by the tall-and-skinny requirement where we\nassume that it is easy to manipulate n \u00d7 n matrices. The synthetic examples we explored used up\nto 200 columns, and regimes up to 5000 columns have been explored in prior work [11]. A rough\nrule of thumb is that our implementations should be possible as long as an n \u00d7 n matrix \ufb01ts into\nmain memory. This means that implementations based on our work will scale up to 30, 000 columns\non machines with more than 8 GB of memory; although at this point communication begins to\ndominate. Solving these problems with more columns is a challenging opportunity for the future.\n\nAcknowledgments\n\nARB and JDL are supported by an Of\ufb01ce of Technology Licensing Stanford Graduate Fellowship.\nJDL is also supported by a NSF Graduate Research Fellowship. DFG is supported by NSF CAREER\naward CCF-1149756. BR is supported by NIH grant 1R21EB015707-01.\n\n8\n\n\fReferences\n[1] M. Anderson, G. Ballard, J. Demmel, and K. Keutzer. Communication-avoiding QR decomposition for\n\nGPUs. In IPDPS, pages 48\u201358, 2011.\n\n[2] M. Ara\u00b4ujo et al. The successive projections algorithm for variable selection in spectroscopic multicom-\n\nponent analysis. Chemometrics and Intelligent Laboratory Systems, 57(2):65\u201373, 2001.\n\n[3] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization\u2013provably. In\n\nProceedings of the 44th symposium on Theory of Computing, pages 145\u2013162. ACM, 2012.\n\n[4] A. R. Benson, D. F. Gleich, and J. Demmel. Direct QR factorizations for tall-and-skinny matrices in\n\nMapReduce architectures. In 2013 IEEE International Conference on Big Data, pages 264\u2013272, 2013.\n\n[5] J. M. Bioucas-Dias and A. Plaza. An overview on hyperspectral unmixing: geometrical, statistical, and\n\nsparse regression based approaches. In IGARSS, pages 1135\u20131138, 2011.\n\n[6] V. Bittorf, B. Recht, C. Re, and J. A. Tropp. Factoring nonnegative matrices with linear programs. In\n\nNIPS, pages 1223\u20131231, 2012.\n\n[7] J. W. Boardman et al. Automating spectral unmixing of aviris data using convex geometry concepts. In\n\n4th Annu. JPL Airborne Geoscience Workshop, volume 1, pages 11\u201314. JPL Publication 93\u201326, 1993.\n\n[8] T. F. Chan. An improved algorithm for computing the singular value decomposition. ACM Trans. Math.\n\nSoftw., 8(1):72\u201383, Mar. 1982.\n\n[9] M. T. Chu and M. M. Lin. Low-dimensional polytope approximation and its applications to nonnegative\n\nmatrix factorization. SIAM Journal on Scienti\ufb01c Computing, 30(3):1131\u20131155, 2008.\n\n[10] A. Cichocki and R. Zdunek. Regularized alternating least squares algorithms for non-negative matrix/ten-\n\nsor factorization. In Advances in Neural Networks\u2013ISNN 2007, pages 793\u2013802. Springer, 2007.\n\n[11] P. G. Constantine and D. F. Gleich. Tall and skinny QR factorizations in MapReduce architectures. In\n\nSecond international workshop on MapReduce and its applications, pages 43\u201350. ACM, 2011.\n\n[12] P. G. Constantine, D. F. Gleich, Y. Hou, and J. Templeton. Model reduction with MapReduce-enabled tall\n\nand skinny singular value decomposition. SIAM J. Sci. Comput., Accepted:To appear, 2014.\n\n[13] A. Damle and Y. Sun. Random projections for non-negative matrix factorization. arXiv:1405.4275, 2014.\n[14] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential\n\nQR and LU factorizations. SIAM J. Sci. Comp., 34, Feb. 2012.\n\n[15] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition\n\ninto parts? In NIPS, 2003.\n\n[16] N. Gillis and S. Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factoriza-\n\ntion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PP(99):1\u20131, 2013.\n\n[17] D. Guillamet and J. Vitri`a. Non-negative matrix factorization for face recognition. In Topics in Arti\ufb01cial\n\nIntelligence, pages 336\u2013344. Springer, 2002.\n\n[18] S. Jia and Y. Qian. Constrained nonnegative matrix factorization for hyperspectral unmixing. Geoscience\n\nand Remote Sensing, IEEE Transactions on, 47(1):161\u2013173, 2009.\n\n[19] D. Kim, S. Sra, and I. S. Dhillon. Fast projection-based methods for the least squares nonnegative matrix\n\napproximation problem. Statistical Analysis and Data Mining, 1(1):38\u201351, 2008.\n\n[20] W. Kim, B. Chen, J. Kim, Y. Pan, and H. Park. Sparse nonnegative matrix factorization for protein\n\nsequence motif discovery. Expert Systems with Applications, 38(10):13198\u201313207, 2011.\n\n[21] D. Kuang, H. Park, and C. H. Ding. Symmetric nonnegative matrix factorization for graph clustering. In\n\nSDM, volume 12, pages 106\u2013117, 2012.\n\n[22] A. Kumar, V. Sindhwani, and P. Kambadur. Fast conical hull algorithms for near-separable non-negative\n\nmatrix factorization. In ICML, 2013.\n\n[23] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.\n[24] C. Liu, H.-C. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for\n\nweb-scale dyadic data analysis on mapreduce. In WWW, pages 681\u2013690. ACM, 2010.\n\n[25] C. Thurau, K. Kersting, and C. Bauckhage. Yes we can: simplex volume maximization for descriptive\n\nweb-scale matrix factorization. In CIKM, pages 1785\u20131788. ACM, 2010.\n\n[26] S. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization,\n\n20(3):1364\u20131377, 2010.\n\n9\n\n\f", "award": [], "sourceid": 588, "authors": [{"given_name": "Austin", "family_name": "Benson", "institution": "Stanford University"}, {"given_name": "Jason", "family_name": "Lee", "institution": "Stanford University"}, {"given_name": "Bartek", "family_name": "Rajwa", "institution": "Purdue University"}, {"given_name": "David", "family_name": "Gleich", "institution": "Purdue University"}]}