{"title": "DFacTo: Distributed Factorization of Tensors", "book": "Advances in Neural Information Processing Systems", "page_first": 1296, "page_last": 1304, "abstract": "We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. By exploiting properties of the Khatri-Rao product, we show how to efficiently address a computationally challenging sub-step of both algorithms. Our algorithm, DFacTo, only requires two matrix-vector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million x 2.5 million x 1.5 million dimensional tensor with 1.2 billion non-zero entries.", "full_text": "DFacTo: Distributed Factorization of Tensors\n\nJoon Hee Choi\n\nElectrical and Computer Engineering\n\nPurdue University\n\nWest Lafayette, IN 47907\nchoi240@purdue.edu\n\nS. V. N. Vishwanathan\n\nComputer Science\n\nUniversity of California Santa Cruz\n\nSanta Cruz, CA 95064\nvishy@ucsc.edu\n\nAbstract\n\nWe present a technique for signi\ufb01cantly speeding up Alternating Least Squares\n(ALS) and Gradient Descent (GD), two widely used algorithms for tensor fac-\ntorization. By exploiting properties of the Khatri-Rao product, we show how to\nef\ufb01ciently address a computationally challenging sub-step of both algorithms. Our\nalgorithm, DFacTo, only requires two sparse matrix-vector products and is easy\nto parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster\nthan competing algorithms on a variety of datasets. For instance, DFacTo only\ntakes 480 seconds on 4 machines to perform one iteration of the ALS algorithm\nand 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million\n\u00d7 2.5 million \u00d7 1.5 million dimensional tensor with 1.2 billion non-zero entries.\n\n1\n\nIntroduction\n\nTensor data appears naturally in a number of applications [1, 2]. For instance, consider a social\nnetwork evolving over time. One can form a users \u00d7 users \u00d7 time tensor which contains snapshots\nof interactions between members of the social network [3]. As another example consider an online\nstore such as Amazon.com where users routinely review various products. One can form a users \u00d7\nitems \u00d7 words tensor from the review text [4]. Similarly a tensor can be formed by considering the\nvarious contexts in which a user has interacted with an item [5]. Finally, consider data collected by\nthe Never Ending Language Learner from the Read the Web project which contains triples of noun\nphrases and the context in which they occur, such as, (\u201cGeorge Harrison\u201d, \u201cplays\u201d, \u201cguitars\u201d) [6].\nWhile matrix factorization and matrix completion have become standard tools that are routinely used\nby practitioners, unfortunately, the same cannot be said about tensor factorization. The reasons are\nnot very hard to see: There are two popular algorithms for tensor factorization namely Alternating\nLeast Squares (ALS) (Appendix B), and Gradient Descent (GD) (Appendix C). The key step in\nboth algorithms is to multiply a matricized tensor and a Khatri-Rao product of two matrices (line 4\nof Algorithm 2 and line 4 of Algorithm 3). However, this process leads to a computationally-\nchallenging, intermediate data explosion problem. This problem is exacerbated when the dimensions\nof tensor we need to factorize are very large (of the order of hundreds of thousands or millions), or\nwhen sparse tensors contain millions to billions of non-zero entries. For instance, a tensor we formed\nusing review text from Amazon.com has dimensions of 6.5 million \u00d7 2.5 million \u00d7 1.5 million and\ncontains approximately 1.2 billion non-zero entries.\nSome studies have identi\ufb01ed this intermediate data explosion problem and have suggested ways of\naddressing it. First, the Tensor Toolbox [7] uses the method of reducing indices of the tensor for\nsparse datasets and entrywise multiplication of vectors and matrices for dense datasets. However, it\nis not clear how to store data or how to distribute the tensor factorization computation to multiple\nmachines (see Appendix D). That is, there is a lack of distributable algorithms in existing studies.\nAnother possible strategy to solve the data explosion problem is to use GigaTensor [8]. Unfortu-\nnately, while GigaTensor does address the problem of parallel computation, it is relatively slow. To\n\n1\n\n\fsummarize, existing algorithms for tensor factorization such as the excellent Tensor Toolbox of [7],\nor the Map-Reduce based GigaTensor algorithm of [8] often do not scale to large problems.\nIn this paper, we introduce an ef\ufb01cient, scalable and distributed algorithm, DFacTo, that addresses\nthe data explosion problem. Since most large-scale real datasets are sparse, we will focus exclu-\nsively on sparse tensors. This is well justi\ufb01ed because previous studies have shown that designing\nspecialized algorithms for sparse tensors can yield signi\ufb01cant speedups [7]. We show that DFacTo\ncan be applied to both ALS and GD, and naturally lends itself to a distributed implementation.\nTherefore, it can be applied to massive real datasets which cannot be stored and manipulated on a\nsingle machine. For ALS, DFacTo is on average around 5 times faster than GigaTensor and around\n10 times faster than the Tensor Toolbox on a variety of datasets. In the case of GD, DFacTo is\non average around 4 times faster than CP-OPT [9] from the Tensor Toolbox. On the Amazon.com\nreview dataset, DFacTo only takes 480 seconds on 4 machines to perform one iteration of ALS and\n1,143 seconds to perform one iteration of GD.\nAs with any algorithm, there is a trade-off: DFacTo uses 3 times more memory than the Tensor\nToolbox, since it needs to store 3 \ufb02attened matrices as opposed to a single tensor. However, in\nreturn, our algorithm only requires two sparse matrix-vector multiplications, making DFacTo easy to\nimplement using any standard sparse linear algebra library. Therefore, there are two merits of using\nour algorithm: 1) computations are distributed in a natural way; and 2) only standard operations are\nrequired.\n\n2 Notation and Preliminaries\n\nOur notation is standard, and closely follows [2]. Also see [1]. Lower case letters such as x denote\nscalars, bold lower case letters such as x denote vectors, bold upper case letters such as X represent\nmatrices, and calligraphic letters such as X denote three-dimensional tensors.\nThe i-th element of a vector x is written as xi. In a similar vein, the (i, j)-th entry of a matrix\nX is denoted as xi,j and the (i, j, k)-th entry of a tensor X is written as xi,j,k. Furthermore, xi,:\n(resp. x:,i) denotes the i-th row (resp. column) of X. We will use X\u2126,: (resp. X:,\u2126) to denote the\nsub-matrix of X which contains the rows (resp. columns) indexed by the set \u2126. For instance, if\n\u2126 = {2, 4}, then X\u2126,: is a matrix which contains the second and fourth rows of X. Extending the\nabove notation to tensors, we will write Xi,:,:, X:,j,: and X:,:,k to respectively denote the horizontal,\nlateral and frontal slices of a third-order tensor X. The column, row, and tube \ufb01bers of X are given\nby x:,j,k, xi,:,k, and xi,j,: respectively.\nSometimes a matrix or tensor may not be fully observed. We will use \u2126X or \u2126X respectively to\ndenote the set of indices corresponding to the observed (or equivalently non-zero) entries in a matrix\nX or a tensor X. Extending this notation, \u2126X\n:,j) denotes the set of column (resp. row)\nindices corresponding to the observed entries in the i-th row (resp. j-th column) of X. We de\ufb01ne\n\u2126X\n:,:,k analogously as the set of indices corresponding to the observed entries of the\ni-th horizontal, j-th lateral, or k-th frontal slices of X. Also, nnzr(X) (resp. nnzc(X)) denotes the\nnumber of rows (resp. columns) of X which contain at least one non-zero element.\nX(cid:62) denotes the transpose, X\u2020 denotes the Moore-Penrose pseudo-inverse, and (cid:107)X(cid:107) (resp. (cid:107)X(cid:107))\ndenotes the Frobenius norm of a matrix X (resp. tensor X) [10]. Given a matrix A \u2208 Rn\u00d7m, the\nlinear operator vec(A) yields a vector x \u2208 Rnm, which is obtained by stacking the columns of A.\nOn the other hand, given a vector x \u2208 Rnm, the operator unvec(n,m)(x) yields a matrix A \u2208 Rn\u00d7m.\nA \u2297 B denotes the Kronecker product, A (cid:12) B the Khatri-Rao product, and A \u2217 B the Hadamard\nproduct of matrices A and B. The outer product of vectors a and b is written as a \u25e6 b (see e.g.,\n[11]). De\ufb01nitions of these standard matrix products can be found in Appendix A.\n\ni,: (resp. \u2126X\n\ni,:,:, \u2126X\n\n:,j,:, and \u2126X\n\n2.1 Flattening Tensors\nJust like the vec(\u00b7) operator \ufb02attens a matrix, a tensor X may also be unfolded or \ufb02attened into a\nmatrix in three ways namely by stacking the horizontal, lateral, and frontal slices. We use Xn to\ndenote the n-mode \ufb02attening of a third-order tensor X \u2208 RI\u00d7J\u00d7K; X1 is of size I \u00d7 JK, X2 is of\nsize J \u00d7 KI, and X3 is of size K \u00d7 IJ. The following relationships hold between the entries of X\n\n2\n\n\fand its unfolded versions (see Appendix A.1 for an illustrative example):\n\nxi,j,k = x1\n\n(1)\nWe can view X1 as consisting of K stacked frontal slices of X, each of size I \u00d7 J. Similarly, X2\nconsists of I slices of size J \u00d7 K and X3 is made up of J slices of size K \u00d7 I. If we use Xn,m to\ndenote the m-th slice in the n-mode \ufb02attening of X, then observe that the following holds:\n\nj,k+(i\u22121)K = x3\n\ni,j+(k\u22121)J = x2\n\nk,i+(j\u22121)I .\n\nk,i+(j\u22121)I = x3,j\nx3\nk,i.\n\ni,j+(k\u22121)J = x1,k\nx1\ni,j ,\n\nj,k+(i\u22121)K = x2,i\nx2\nj,k,\n\n(2)\nOne can state a relationship between the rows and columns of various \ufb02attenings of a tensor, which\nwill be used to derive our distributed tensor factorization algorithm in Section 3. The proof of the\nbelow lemma is in Appendix A.2.\nLemma 1 Let (n, n(cid:48)) \u2208 {(2, 1), (3, 2), (1, 3)}, and let Xn and Xn(cid:48)\nrespectively of a tensor X. Moreover, let Xn,m be the m-th slice in Xn, and xn(cid:48)\nof Xn(cid:48)\n\nbe the n and n(cid:48)-mode \ufb02attening\nm,: be the m-th row\n\n. Then, vec(Xn,m) = xn(cid:48)\nm,:.\n\n3 DFacTo\n\nRecall that the main challenge of implementing ALS or GD for solving tensor factorization lies in\nmultiplying a matricized tensor and a Khatri-Rao product of two matrices: X1 (C (cid:12) B)1 . If B is\nof size J \u00d7 R and C is of size K \u00d7 R, explicitly forming (C (cid:12) B) requires O(JKR) memory and\nis infeasible when J and K are large. This is called the intermediate data explosion problem in the\nliterature [8]. The lemma below will be used to derive our ef\ufb01cient algorithm, which avoids this\nproblem. Although the proof can be inferred using results in [2], we give an elementary proof for\ncompleteness.\nLemma 2 The r-th column of X1 (C (cid:12) B) can be computed as\n\nb:,r\n\nc:,r\n\n(3)\n\nb:,r\n\nc:,r\n\n=\n\n...\n\n:,r =\n\nunvec(K,I)\n\n:,r =\n\nunvec(K,I)\n\n:,r X2,1 c:,r\n\nb(cid:62)\n:,r X2,I c:,r\n\n(cid:17)(cid:105)(cid:62)\n(cid:17)(cid:105)(cid:62)\n\nProof We need to show that\n\n(cid:2)X1 (C (cid:12) B)(cid:3)\n(cid:2)X1 (C (cid:12) B)(cid:3)\n\n(cid:16)(cid:0)X2(cid:1)(cid:62)\n(cid:104)\n(cid:16)(cid:0)X2(cid:1)(cid:62)\n(cid:104)\n\uf8f9\uf8fa\uf8fb .\n\uf8ee\uf8ef\uf8f0 b(cid:62)\nOr equivalently it suf\ufb01ces to show that(cid:2)X1 (C (cid:12) B)(cid:3)\n(cid:1) vec(cid:0)X2,i(cid:1) .\n(cid:1) =(cid:0)c(cid:62)\n(cid:1)(cid:62)\n(c:,r \u2297 b:,r) =(cid:2)X1 (C (cid:12) B)(cid:3)\nUnfortunately, a naive computation of(cid:2)X1 (C (cid:12) B)(cid:3)\ndata explosion problem. This is because(cid:0)X2(cid:1)(cid:62)\nasserts, only a small number of entries of(cid:0)X2(cid:1)(cid:62)\n(cid:2)unvec(K,I)(v:,r)(cid:3)(cid:62)\n\nvec(cid:0)b(cid:62)\n:,r X2,i c:,r =(cid:0)x1\n\nObserve that b(cid:62)\nThis allows us to rewrite the above equation as\n\nwhich completes the proof.\n\nb:,r are non-zero.\n\nFor convenience,\n\ni,r = b(cid:62)\n\nb(cid:62)\n\ni,:\n\ni,r ,\n\n:,r X2,i c:,r is a scalar. Moreover, using Lemma 1 we can write vec(cid:0)X2,i(cid:1) = x1\n\n:,r \u2297 b(cid:62)\n\n:,r X2,i c:,r\n\n:,r\n\n(4)\ni,:.\n\n:,r X2,i c:,r. Using (13)\n\n:,r by using (3) does not solve the intermediate\nb:,r produces a KI dimensional vector, which is\nthen reshaped by the unvec(K,I)(\u00b7) operator into a K \u00d7 I matrix. However, as the next lemma\n\nlet a vector produced by (X2)(cid:62)b:,r be v:,r and a matrix produced by\nbe Mr.\n\n1We mainly concentrate on the update to A since the updates to B and C are analogous.\n\n3\n\n\fLemma 3 The number of non-zeros in v:,r is at most nnzr((X2)(cid:62)) and nnzc(X2).\nProof Multiplying an all-zero row in (X2)(cid:62) and b:,r produces zero. Therefore, the number of non-\nzeros in v:,r is equal to the number of rows in (X2)(cid:62) that contain at least one non-zero element.\nAlso, by de\ufb01nition, nnzr((X2)(cid:62)) is equal to nnzc(X2).\nAs a consequence of the above lemma, we only need to explicitly compute the non-zero entries of\noperator still remains. The\n\nv:,r. However, the problem of reshaping v:,r via the(cid:2)unvec(K,I)(\u00b7)(cid:3)(cid:62)\n\nof v:,r. And, this element is the (i, k)-th entry of Mr by de\ufb01nition of(cid:2)unvec(K,I)(\u00b7)(cid:3)(cid:62)\n\nnext lemma shows how to overcome this dif\ufb01culty.\nLemma 4 The location of the non-zero entries of Mr depends on (X2)(cid:62) and is independent of b:,r.\nProof The product of the (k + (i\u2212 1)K)-th row of (X2)(cid:62) and b:,r is the (k + (i\u2212 1)K)-th element\n. Therefore,\nif all the entries in the (k + (i \u2212 1)K)-th row of (X2)(cid:62) are zero, then the (i, k)-th entry of Mr is\nzero regardless of b:,r. Consequently, the location of the non-zero entries of Mr is independent of\nb:,r, and is only determined by (X2)(cid:62).\nGiven X one can compute (X2)(cid:62) to know the locations of the non-zero entries of Mr. In other\nwords, we can infer the non-zero pattern and therefore preallocate memory for Mr. We will show\n\nbelow how this allows us to perform the(cid:2)unvec(K,I)(\u00b7)(cid:3)(cid:62)\n\noperation for free.\n\nRecall the Compressed Sparse Row (CSR) Format, which stores a sparse matrix as three arrays\nnamely values, columns, and rows. Here, values represents the non-zero values of the matrix; while\ncolumns stores the column indices of the non-zero values. Also, rows stores the indices of the\ncolumns array where each row starts. For example, if a sparse matrix Mr is\n\nthen the CSR of Mr is\n\nMr =\n\n(cid:20) 1\n\n0\n\n0\n3\n\n2\n4\n\n(cid:21)\n\n,\n\nvalue(Mr) = [ 1\ncol(Mr) = [ 0\nrow(Mr) = [ 0\n\n2\n\n2\n\n2\n\n3\n\n1\n\n4 ]\n2 ]\n\n4 ] .\n\nDifferent matrices with the same sparsity pattern can be represented by simply changing the entries\nof the value array. For our particular case, what this means is that we can pre-compute col(Mr) and\nrow(Mr) and pre-allocate value(Mr). By writing the non-zero entries of v:,r into value(Mr) we\ncan \u201creshape\u201d v:,r into Mr.\nLet the matrix with all-zero rows in (X2)(cid:62) removed be ( \u02c6X2)(cid:62). Then, Algorithm 1 shows the\nDFacTo algorithm for computing N := X1 (C (cid:12) B). Here, the input values are ( \u02c6X2)(cid:62), B, C, and\nMr preallocated in CSR format. By storing the results of the product of ( \u02c6X2)(cid:62) and b:,r directly\ninto value(Mr), we can obtain Mr because Mr was preallocated in the CSR format. Then, the\nproduct of Mr and c:,r yields the r-th column of N. We obtain the output N by repeating these two\nsparse matrix-vector products R times.\n\nAlgorithm 1: DFacTo algorithm for Tensor Factorization\nInput: ( \u02c6X2)(cid:62), B, C, value(Mr) col(Mr), row(Mr)\nOutput: N\nwhile r=1, 2,. . . , R do\n\nvalue(Mr) \u2190 ( \u02c6X2)(cid:62) b:,r\nn:,r \u2190 Mr c:,r\n\n1\n2\n3\n4\n5\n6\n\nend\n\nother than storing Mr, which contains at most nnzc(X2) \u2264(cid:12)(cid:12)\u2126X(cid:12)(cid:12) non-zero entries. Therefore, we\n\nIt is immediately obvious that using the above lemmas to compute N requires no extra memory\n\n4\n\n\fcompletely avoid the intermediate data explosion problem. Moreover, the same subroutine can be\nused for both ALS and GD (see Appendix E for detailed pseudo-code).\n\n3.1 Distributed Memory Implementation\n\nOur algorithm is easy to parallelize using a master-slave architecture of MPI(Message Passing Inter-\nface). At every iteration, the master transmits A, B, and C to the slaves. The slaves hold a fraction\nof the rows of X2 using which a fraction of the rows of N is computed. By performing a synchro-\nnization step, the slaves can exchange rows of N. In ALS, this N is used to compute A which is\ntransmitted back to the master. Then, the master updates A, and the iteration proceeds. In GD, the\nslaves transmit N back to the master, which computes \u2207A. Then, the master computes the step size\nby a line search algorithm, updates A, and the iteration proceeds.\n\n3.2 Complexity Analysis\n\nA naive computation of N requires(cid:0)JK +(cid:12)(cid:12)\u2126X(cid:12)(cid:12)(cid:1) R \ufb02ops; forming C (cid:12) B requires JKR \ufb02ops\nand performing the matrix-matrix multiplication X1 (C (cid:12) B) requires(cid:12)(cid:12)\u2126X(cid:12)(cid:12) R \ufb02ops. Our algorithm\nrequires only(cid:0)nnzc(X2) +(cid:12)(cid:12)\u2126X(cid:12)(cid:12)(cid:1) R \ufb02ops;(cid:12)(cid:12)\u2126X(cid:12)(cid:12) R \ufb02ops for computing v:,r and nnzc(X2)R \ufb02ops\nfor computing Mrc:,r. Note that, typically, nnzc(X2) (cid:28) both JK and(cid:12)(cid:12)\u2126X(cid:12)(cid:12) (see Table 1). In terms\n\nof memory, the naive algorithm requires O(JKR) extra memory, while our algorithm only requires\nnnzc(X2) extra space to store Mr.\n\n4 Related Work\n\n(cid:62)(cid:17)\n\nand N2 := bin(cid:0)X1(cid:1) \u2217(cid:16)\n\n1I (cid:12) (c:,r \u2297 1J )\n\n(cid:62)(cid:17)\n\nTwo papers that are most closely related to our work are the GigaTensor algorithm proposed by [8]\nand the Sparse Tensor Toolbox of [7]. As discussed above, both algorithms attack the problem of\ncomputing N ef\ufb01ciently. In order to compute n:,r, GigaTensor computes two intermediate matrices\n. Next, N3 :=\nN1 \u2217 N2 is computed, and n:,r is obtained by computing N3 1JK. As reported in [8], GigaTensor\n\nN1 := X1 \u2217(cid:16)\nuses 2(cid:12)(cid:12)\u2126X(cid:12)(cid:12) extra storage and 5(cid:12)(cid:12)\u2126X(cid:12)(cid:12) \ufb02ops to compute one column of N. The Sparse Tensor Toolbox\nare summed to form columns of N. The algorithm uses 2(cid:12)(cid:12)\u2126X(cid:12)(cid:12) extra storage and 5(cid:12)(cid:12)\u2126X(cid:12)(cid:12) \ufb02ops to\n\nstores a tensor as a vector of non-zero values and a matrix of corresponding indices. Entries of B\nand C are replicated appropriately to create intermediate vectors. A Hadamard product is computed\nbetween the non-zero entries of the matrix and intermediate vectors, and a selected set of entries\n\n1I (cid:12) (1K \u2297 b:,r)\n\ncompute one column of N. See Appendix D for a detailed illustrative example which shows all the\nintermediate calculations performed by our algorithm as well as the algorithm of [8] and [7].\nAlso, [9] suggests the gradient-based optimization of CANDECOMP/PARAFAC (CP) using the\nsame method as [7] to compute X1 (C (cid:12) B). [9] refers to this gradient-based optimization algorithm\nas CPOPT and the ALS algorithm of CP using the method of [7] as CPALS. Following [9], we use\nthese names, CPALS and CPOPT.\n\n5 Experimental Evaluation\n\nOur experiments are designed to study the scaling behavior of DFacTo on both publicly available\nreal-world datasets as well as synthetically generated data. We contrast the performance of DFacTo\n(ALS) with GigaTensor [8] as well as with CPALS [7], while the performance of DFacTo (GD) is\ncompared with CPOPT [9]. We also present results to show the scaling behavior of DFacTo when\ndata is distributed across multiple machines.\n\nDatasets See Table 1 for a summary of the real-world datasets we used in our experiments. The\nNELL-1 and NELL-2 datasets are from [8] and consists of (noun phrase 1, context, noun phrase 2)\ntriples from the \u201cRead the Web\u201d project [6]. NELL-2 is a version of NELL-1, which is obtained by\nremoving entries whose values are below a threshold.\n\n5\n\n\fThe Yelp Phoenix dataset is from the Yelp Data Challenge 2, while Cellartracker, Ratebeer, Beerad-\nvocate and Amazon.com are from the Stanford Network Analysis Project (SNAP) home page. All\nthese datasets consist of product or business reviews. We converted them into a users \u00d7 items \u00d7\nwords tensor by \ufb01rst splitting the text into words, removing stop words, using Porter stemming [12],\nand then removing user-item pairs which did not have any words associated with them. In addition,\nfor the Amazon.com dataset we \ufb01ltered words that appeard less than 5 times or in fewer than 5\ndocuments. Note that the number of dimensions as well as the number of non-zero entries reported\nin Table 1 differ from those reported in [4] because of our pre-processing.\n\n(cid:12)(cid:12)(cid:12)\u2126 \u02c6X(cid:12)(cid:12)(cid:12) nnzc(X1) nnzc(X2) nnzc(X3)\n\n229.83K\n1.32M\n337.37K\n1.57M\n2.85M\n17.37M\n29.91M\n\nDataset\nYelp\n\nCellartracker\n\nNELL-2\n\nBeeradvocate\n\nRatebeer\nNELL-1\nAmazon\n\nJ\n11.54K\n\nK\nI\n45.97K\n84.52K\n36.54K 412.36K 163.46K\n12.09K\n28.82K\n9.18K\n33.37K\n66.06K 204.08K\n29.07K 110.30K 294.04K\n2.90M\n6.64M\n\n6.11M\n5.88M\n21.48M\n12.05M\n7.84M\n2.14M 25.50M 143.68M 113.30M 119.13M\n2.44M\n525.25M 389.64M\n\n9.85M\n25.02M\n76.88M\n78.77M\n77.13M\n\n4.32M\n19.23M\n16.56M\n18.98M\n22.40M\n\n1.68M\n\n1.22B\n\nTable 1: Summary statistics of the datasets used in our experiments.\n\nWe also generated the following two kinds of synthetic data for our experiments:\n\n\u2022 the number of non-zero entries in the tensor is held \ufb01xed but we vary I, J, and K.\n\u2022 the dimensions I, J, and K are held \ufb01xed but the number of non-zeros entries varies.\n\nTo simulate power law behavior, both the above datasets were generated using the following prefer-\nential attachment model [13]: the probability that a non-zero entry is added at index (i, j, k) is given\nby pi \u00d7 pj \u00d7 pk, where pi (resp. pj and pk) is proportional to the number of non-zero entries at index\ni (resp. j and k).\n\nImplementation and Hardware All experiments were conducted on a computing cluster where\neach node has two 2.1 GHz 12-core AMD 6172 processors with 48 GB physical memory per node.\nOur algorithms are implemented in C++ using the Eigen library3 and compiled with the Intel Com-\npiler. We downloaded Version 2.5 of the Tensor Toolbox, which is implemented in MATLAB4.\nSince open source code for GigaTensor is not freely available, we developed our own version in\nC++ following the description in [8]. Also, we used MPICH25 in order to distribute the tensor fac-\ntorization computation to multiple machines. All our codes are available for download under an\nopen source license from http://www.joonheechoi.com/research.\n\nScaling on Real-World Datasets Both CPALS and our implementation of GigaTensor are uni-\nprocessor codes. Therefore, for this experiment we restricted ourselves to datasets which can \ufb01t on a\nsingle machine. When initialized with the same starting point, DFacTo and its competing algorithms\nwill converge to the same solution. Therefore, we only compare the CPU time per iteration of the\ndifferent algorithms. The results are summarized in Table 2. On many datasets DFacTo (ALS) is\naround 5 times faster than GigaTensor and 10 times faster than CPALS; the differences are more\npronounced on the larger datasets. Also, DFacTo (GD) is around 4 times faster than CPOPT.\nThe differences in performance between DFacTo (ALS) and CPALS and between DFacTo (GD)\nand CPOPT can partially be explained by the fact that DFacTo (ALS, GD) is implemented in C++\nwhile CPALS and CPOPT use MATLAB. However, it must be borne in mind that both MATLAB\nand our implementation use an optimized BLAS library to perform their computationally intensive\nnumerical linear algebra operations.\nCompared to the Map-Reduce version implemented in Java and used for the experiments reported\nin [8], our C++ implementation of GigaTensor is signi\ufb01cantly faster and more optimized. As per [8],\n\n2https://www.yelp.com/dataset challenge/dataset\n3http://eigen.tuxfamily.org\n4http://www.sandia.gov/\u02dctgkolda/TensorToolbox/\n5http://www.mpich.org/static/downloads/\n\n6\n\n\fDataset\n\nYelp Phoenix\nCellartracker\n\nNELL-2\n\nBeeradvocate\n\nRatebeer\nNELL-1\n\nDFacTo (ALS) GigaTensor CPALS DFacTo (GD) CPOPT\n45.9\n130.32\n386.25\n481.06\n349.18\n-\n\n26.82\n80.65\n186.30\n224.29\n240.80\n772.24\n\n46.52\n118.25\n376.10\n364.98\n396.63\n-\n\n13.57\n35.82\n80.79\n94.85\n87.36\n742.67\n\n9.52\n23.89\n32.59\n43.84\n44.20\n322.45\n\nTable 2: Times per iteration (in seconds) of DFacTo (ALS), GigaTensor, CPALS, DFacTo (GD), and\nCPOPT on datasets which can \ufb01t in a single machine (R=10).\n\nDFacTo (ALS)\n\nNELL-1\nIter.\n322.45\n205.07\n141.02\n86.09\n81.24\n90.31\n\nCPU\n322.45\n167.29\n101.58\n62.19\n46.25\n34.54\n\nAmazon\nIter.\n-\n-\n480.21\n292.34\n179.23\n142.69\n\nCPU\n-\n-\n376.71\n204.41\n98.07\n54.60\n\nMachines\n\n1\n2\n4\n8\n16\n32\n\nDFacTo (GD)\n\nNELL-1\nIter.\n742.67\n492.38\n322.65\n232.41\n178.92\n209.39\n\nCPU\n104.23\n55.11\n28.55\n16.24\n9.70\n7.45\n\nAmazon\nIter.\n-\n-\n1143.7\n727.79\n560.47\n471.91\n\nCPU\n-\n-\n127.57\n62.61\n28.61\n15.78\n\nTable 3: Total Time and CPU time per iteration (in seconds) as a function of number of machines\nfor the NELL-1 and Amazon datasets (R=10).\n\nthe Java implementation took approximately 10,000 seconds per iteration to handle a tensor with\naround 109 non-zero entries, when using 35 machines. In contrast, the C++ version was able to\nhandle one iteration of the ALS algorithm on the NELL-1 dataset on a single machine in 772 sec-\nonds. However, because DFacto (ALS) uses a better algorithm, it is able to handsomely outperform\nGigaTensor and only takes 322 seconds per iteration.\nAlso, the execution time of DFacTo (GD) is longer than that of DFacTo (ALS) because DFacTo\n(GD) spends more time on the line search algorithm to obtain an appropriate step size.\n\nScaling across Machines Our goal is to study scaling behavior of the time per iteration as datasets\nare distributed across different machines. Towards this end we worked with two datasets. NELL-1\nis a moderate-size dataset which our algorithm can handle on a single machine, while Amazon is a\nlarge dataset which does not \ufb01t on a single machine. Table 3 shows that the iteration time decreases\nas the number of machines increases on the NELL-1 and Amazon datasets. While the decrease in\niteration time is not completely linear, the computation time excluding both synchronization and\nline search time decreases linearly. The Y-axis in Figure 1 indicates T4/Tn where Tn is the single\niteration time with n machines on the Amazon dataset.\n\n(a) DFacTo(ALS)\n\n(b) DFacTo(GD)\n\nFigure 1: The scalability of DFacTo with respect to the number of machines on the Amazon dataset\n\n7\n\n\fSynthetic Data Experiments We perform two experiments with synthetically generated tensor\ndata. In the \ufb01rst experiment we \ufb01x the number of non-zero entries to be 106 and let I = J = K\nand vary the dimensions of the tensor. For the second experiment we \ufb01x the dimensions and let\nI = J = K and the number of non-zero entries is set to be 2I. The scaling behavior of the three\nalgorithms on these two datasets is summarized in Table 4. Since we used a preferential attachment\nmodel to generate the datasets, the non-zero indices exhibit a power law behavior. Consequently,\nthe number of columns with non-zero elements (nnzc(\u00b7)) for X1, X2 and X3 is very close to the\ntotal number of non-zero entries in the tensor. Therefore, as predicted by theory, DFacTo (ALS, GD)\ndoes not enjoy signi\ufb01cant speedups when compared to GigaTensor, CPALS and CPOPT. However,\nit must be noted that DFacto (ALS) is faster than either GigaTensor or CPALS in all but one case and\nDFacTo (GD) is faster than CPOPT in all cases. We attribute this to better memory locality which\narises as a consequence of reusing the memory for N as discussed in Section 3.\n\nI = J = K Non-zeros DFacTo (ALS) GigaTensor CPALS DFacTo (GD) CPOPT\n5.21\n11.70\n29.13\n202.71\n0.57\n2.98\n26.04\n324.2\n\n106\n106\n106\n106\n2 \u00d7 104\n2 \u00d7 105\n2 \u00d7 106\n2 \u00d7 107\n\n1.14\n2.72\n7.26\n41.64\n0.05\n0.92\n12.06\n144.48\n\n104\n105\n106\n107\n104\n105\n106\n107\n\n2.80\n6.71\n11.86\n38.19\n0.09\n1.61\n22.08\n251.89\n\n5.10\n6.11\n16.54\n175.57\n0.52\n1.50\n15.84\n214.37\n\n2.32\n5.87\n16.51\n121.30\n0.09\n1.81\n21.74\n275.19\n\nTable 4: Time per iteration (in seconds) on synthetic datasets (non-zeros = 106 or 2I, R=10)\n\nRank Variation Experiments Table 5 shows the time per iteration on various ranks (R) with the\nNELL-2 dataset. We see that the computation time of our algorithm increases lineraly in R like the\ntime complexity analyzed in Section 3.2.\n\nR\n\nNELL-2\n\n5\n\n15.84\n\n10\n\n31.92\n\n20\n\n58.71\n\n50\n\n141.43\n\n100\n\n298.89\n\n200\n\n574.63\n\n500\n\n1498.68\n\nTable 5: Time per iteration (in seconds) on various R\n\n6 Discussion and Conclusion\n\nWe presented a technique for signi\ufb01cantly speeding up the Alternating Least Squares (ALS) and the\nGradient Descent (GD) algorithm for tensor factorization by exploiting properties of the Khatri-Rao\nproduct. Not only is our algorithm, DFacto, computationally attractive, but it is also more memory\nef\ufb01cient compared to existing algorithms. Furthermore, we presented a strategy for distributing the\ncomputations across multiple machines.\nWe hope that the availability of a scalable tensor factorization algorithm will enable practitioners\nto work on more challenging tensor datasets, and therefore lead to advances in the analysis and\nunderstanding of tensor data. Towards this end we intend to make our code freely available for\ndownload under a permissive open source license.\nAlthough we mainly focused on tensor factorization using ALS and GD, it is worth noting that one\ncan extend the basic ideas behind DFacTo to other related problems such as joint matrix completion\nand tensor factorization. We present such a model in Appendix F. In fact, we believe that this joint\nmatrix completion and tensor factorization model by itself is somewhat new and interesting in its\nown right, despite its resemblance to other joint models including tensor factorization such as [14].\nIn our joint model, we are given a user \u00d7 item ratings matrix Y, and some side information such as\na user \u00d7 item \u00d7 words tensor X. Preliminary experimental results suggest that jointly factorizing\nY and X outperforms vanilla matrix completion. Please see Appendix F for details of the algorithm\nand some experimental results.\n\n8\n\n\fReferences\n[1] Age Smilde, Rasmus Bro, and Paul Geladi. Multi-way Analysis with Applications in the Chem-\n\nical Sciences. John Wiley and Sons, Ltd, 2004.\n\n[2] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review,\n\n51(3):455\u2013500, 2009.\n\n[3] Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs over time: densi\ufb01cation\n\nlaws, shrinking diameters and possible explanations. In KDD, pages 177\u2013187, 2005.\n\n[4] J. McAuley and J. Leskovec. Hidden Factors and Hidden Topics: Understanding Rating Di-\nIn Proceedings of the 7th ACM Conference on Recommender\n\nmensions with Review Text.\nSystems, pages 165\u2013172, 2013.\n\n[5] Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. Multiverse\nrecommendation: N-dimensional tensor factorization for context-aware collaborative \ufb01ltering.\nIn Proceeedings of the 4th ACM Conference on Recommender Systems (RecSys), 2010.\n\n[6] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr., and T.M. Mitchell. Toward\nan architecture for never-ending language learning. In In Proceedings of the Conference on\nArti\ufb01cial Intelligence (AAAI), 2010.\n\n[7] Brett W. Bader and Tamara G. Kolda. Ef\ufb01cient matlab computations with sparse and factored\n\ntensors. SIAM Journal on Scienti\ufb01c Computing, 30(1):205\u2013231, 2007.\n\n[8] U. Kang, Evangelos E. Papalexakis, Abhay Harpale, and Christos Faloutsos. Gigatensor: scal-\ning tensor analysis up by 100 times - algorithms and discoveries. In Conference on Knowledge\nDiscovery and Data Mining, pages 316\u2013324, 2012.\n\n[9] Evrim Acar, Daniel M. Dunlavy, and Tamara G. Kolda. A scalable optimization approach\nfor \ufb01tting canonical tensor decompositions. Journal of Chemometrics, 25(2):67\u201386, February\n2011.\n\n[10] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge Univ Press, 1990.\n\n[11] Dennis S. Bernstein. Matrix Mathematics. Princeton University Press, 2005.\n\n[12] M. Porter. An algorithm for suf\ufb01x stripping. Program, 14(3):130\u2013137, 1980.\n\n[13] A. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509\u2013512,\n\n1999.\n\n[14] Evrim Acar, Tamara G. Kolda, and Daniel M. Dunlavy. All-at-once optimization for cou-\npled matrix and tensor factorizations. In MLG\u201911: Proceedings of Mining and Learning with\nGraphs, August 2011.\n\n9\n\n\f", "award": [], "sourceid": 730, "authors": [{"given_name": "Joon Hee", "family_name": "Choi", "institution": "Purdue University"}, {"given_name": "S.", "family_name": "Vishwanathan", "institution": "Purdue University"}]}