{"title": "Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch", "book": "Advances in Neural Information Processing Systems", "page_first": 10096, "page_last": 10106, "abstract": "We propose two randomized algorithms for low-rank Tucker decomposition of tensors. The algorithms, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order. To the best of our knowledge, ours are the only algorithms which can do this. We test our algorithms on sparse synthetic data and compare them to multiple other methods. We also apply one of our algorithms to a real dense 38 GB tensor representing a video and use the resulting decomposition to correctly classify frames containing disturbances.", "full_text": "Low-Rank Tucker Decomposition of Large Tensors\n\nUsing TensorSketch\n\nOsman Asif Malik\n\nDepartment of Applied Mathematics\n\nUniversity of Colorado Boulder\nosman.malik@colorado.edu\n\nStephen Becker\n\nDepartment of Applied Mathematics\n\nUniversity of Colorado Boulder\n\nstephen.becker@colorado.edu\n\nAbstract\n\nWe propose two randomized algorithms for low-rank Tucker decomposition of\ntensors. The algorithms, which incorporate sketching, only require a single pass\nof the input tensor and can handle tensors whose elements are streamed in any\norder. To the best of our knowledge, ours are the only algorithms which can do\nthis. We test our algorithms on sparse synthetic data and compare them to multiple\nother methods. We also apply one of our algorithms to a real dense 38 GB tensor\nrepresenting a video and use the resulting decomposition to correctly classify\nframes containing disturbances.\n\n1\n\nIntroduction\n\nMany real datasets have more than two dimensions and are therefore better represented using tensors,\nor multi-way arrays, rather than matrices. In the same way that methods such as the singular value\ndecomposition (SVD) can help in the analysis of data in matrix form, tensor decompositions are\nimportant tools when working with tensor data. As multidimensional datasets grow larger and larger,\nthere is an increasing need for methods that can handle them, even on modest hardware. One approach\nto the challenge of handling big data, which has proven to be very fruitful in the past, is the use of\nrandomization. In this paper, we present two algorithms for computing the Tucker decomposition of a\ntensor which incorporate random sketching. A key challenge to incorporating sketching in the Tucker\ndecomposition is that the relevant design matrices are Kronecker products of the factor matrices. This\nmakes them too large to form and store in RAM, which prohibits the application of standard sketching\ntechniques. Recent work [26, 27, 2, 10] has led to a new technique called TENSORSKETCH which is\nideally suited for sketching Kronecker products. It is based on this technique that we develop our\nalgorithms. Our algorithms, which are single pass and can handle streamed data, are suitable when\nthe decomposition we seek is of low-rank. When we say that our algorithms can handle streamed\ndata, we mean that they can decompose a tensor whose elements are revealed one at a time and then\ndiscarded, no matter which order this is done in. These streaming properties of our methods follow\ndirectly from the streaming properties of TENSORSKETCH.\nIn some applications, such as the compression of scienti\ufb01c data produced by high-\ufb01delity simulations,\nthe data tensors can be very large (see e.g. the recent work [1]). Since such data frequently is produced\nincrementally, e.g. by stepping forward in time, a compression algorithm which is one-pass and can\nhandle the tensor elements being streamed would make it possible to compress the data without ever\nhaving to store it in full. Our algorithms have these properties.\nIn summary, our paper makes the following algorithmic contributions:\n\n\u2022 We propose two algorithms for Tucker decomposition which incorporate TENSORSKETCH.\n\nThey are intended to be used for low-rank decompositions.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u2022 We propose an idea for de\ufb01ning the sketch operators upfront. In addition to increasing\naccuracy and reducing run time, it allows us to make several other improvements. These\ninclude only requiring a single pass of the data, and being able to handle tensors whose\nelements are streamed. To the best of our knowledge, ours are the only algorithms which\ncan do this.\n\n1.1 A brief introduction to tensors and the Tucker decomposition\n\nWe use the same notations and de\ufb01nitions as in the review paper by Kolda and Bader [18]. Due\nto limited space, we only explain our notation here, with de\ufb01nitions given in Section S1 of the\nsupplementary material. A tensor X \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7IN is an array of dimension N, also called an N-\nway tensor. Boldface Euler script letters, e.g. X, denote tensors of dimension 3 or greater; bold capital\nletters, e.g. X, denote matrices; bold lowercase letters, e.g. x, denote vectors; and lowercase letters,\ne.g. x, denote scalars. For scalars indicating dimension size, uppercase letters, e.g. I, will be used. \u201c\u2297\u201d\na tensor X \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7IN is denoted by X(n) \u2208 RIn\u00d7(cid:81)\nand \u201c(cid:12)\u201d denote the Kronecker and Khatri-Rao products, respectively. The mode-n matricization of\nn In denotes\nthe vectorization of X into a column vector. The n-mode tensor-times-matrix (TTM) product of X\nand a matrix A \u2208 RJ\u00d7In is denoted by X \u00d7n A \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7In\u22121\u00d7J\u00d7In+1\u00d7\u00b7\u00b7\u00b7\u00d7IN . The norm of X is\nde\ufb01ned as (cid:107)X(cid:107) = (cid:107)x(:)(cid:107)2. For a positive integer n, we use the notation [n] := {1, 2, . . . , n}.\nThere are multiple tensor decompositions. In this paper, we consider the Tucker decomposition. A\nTucker decomposition of a tensor X \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7IN is\n\ni(cid:54)=n In. Similarly, x(:) \u2208 R(cid:81)\n\nX = G \u00d71 A(1) \u00d72 A(2) \u00b7\u00b7\u00b7 \u00d7N A(N ) =:(cid:74)G; A(1), A(2), . . . , A(N )(cid:75),\n\n(1)\nwhere G \u2208 RR1\u00d7R2\u00d7\u00b7\u00b7\u00b7\u00d7RN is called the core tensor and each A(n) \u2208 RIn\u00d7Rn is called a factor\nmatrix. Without loss of generality, the factor matrices can be assumed to have orthonormal columns,\nwhich we will assume as well. We say that X in (1) is a rank-(R1, R2, . . . , RN ) tensor.\nThe Tucker decomposition problem of decomposing a data tensor Y \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7IN can be\nformulated as\n\n(2)\nThe standard approach to this problem is to use alternating least-squares (ALS). By rewriting the\nobjective function appropriately (use e.g. Proposition 3.7 in [17]), we get the following steps, which\nare repeated until convergence:\n\n(cid:110)(cid:13)(cid:13)(cid:13)Y \u2212(cid:74)G; A(1), . . . , A(N )(cid:75)(cid:13)(cid:13)(cid:13) : G \u2208 RR1\u00d7\u00b7\u00b7\u00b7\u00d7RN , A(n) \u2208 RIn\u00d7Rn for n \u2208 [N ]\n(cid:111)\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n1. For n = 1, . . . , N, update A(n) = arg min\nA\u2208RIn\u00d7Rn\n\n(n)A(cid:62) \u2212 Y(cid:62)\nG(cid:62)\n\n(n)\n\nG,A(1),...,A(N )\n\narg min\n\n(cid:19)\n\nA(i)\n\n(3)\n\n.\n\nF\n\n.\n\n2. Update G = arg min\n\nZ\u2208RR1\u00d7\u00b7\u00b7\u00b7\u00d7RN\n\nA(i)\n\ni=N\n\n.\n\n(4)\n\nOne can show that the solution for the nth factor matrix A(n) in (3) is given by the Rn leading left\nsingular vectors of the mode-n matricization of Y\u00d71 A(1)(cid:62) \u00b7\u00b7\u00b7\u00d7n\u22121 A(n\u22121)(cid:62)\u00d7n+1 A(n+1)(cid:62) \u00b7\u00b7\u00b7\u00d7N\nA(N )(cid:62). Since each A(i) has orthogonal columns, it turns out that the solution to (4) is given by\nG = Y \u00d71 A(1)(cid:62) \u00d72 A(2)(cid:62) \u00b7\u00b7\u00b7 \u00d7N A(N )(cid:62). These insights lead to Algorithm 1, which we will refer\nto as TUCKER-ALS. It is also frequently called higher-order orthogonal iteration (HOOI), and is\nmore accurate than higher-order SVD (HOSVD) which is another standard algorithm for Tucker\ndecomposition. More details can be found in [18].\n\n1.2 A brief introduction to TensorSketch\n\nIn this paper, we apply TENSORSKETCH to approximate the solution to large overdetermined\nleast-squares problems, and to approximate chains of TTM products similar to those in (1). TEN-\nSORSKETCH is a randomized method which allows us to reduce the cost and memory usage of these\ncomputations in exchange for somewhat reduced accuracy. It can be seen as a specialized version of\nanother sketching method called COUNTSKETCH, which was introduced in [7] and further analyzed\nin [8]. One way to de\ufb01ne a COUNTSKETCH operator S : RI \u2192 RJ is as S = PD, where\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:18)(cid:79)1\n\ni=N\ni(cid:54)=n\nz(:) \u2212 y(:)\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:18)(cid:79)1\n\n\fAlgorithm 1: TUCKER-ALS (aka HOOI)\n:Y, target rank (R1, R2, . . . , RN )\ninput\n\noutput :Rank-(R1, R2, . . . , RN ) Tucker decomposition(cid:74)G; A(1), . . . , A(N )(cid:75) of Y\n\nfor n = 1, . . . , N do\n\nZ = Y \u00d71 A(1)(cid:62) \u00b7\u00b7\u00b7 \u00d7n\u22121 A(n\u22121)(cid:62) \u00d7n+1 A(n+1)(cid:62) \u00b7\u00b7\u00b7 \u00d7N A(N )(cid:62)\nA(n) = Rn leading left singular vectors of Z(n) /* Solves Eq. (3) */\n\n1 Initialize A(2), A(3), . . . , A(N )\n2 repeat\n3\n4\n5\n6\n7 until termination criteria met\n8 G = Y \u00d71 A(1)(cid:62) \u00d72 A(2)(cid:62) \u00b7\u00b7\u00b7 \u00d7N A(N )(cid:62) /* Solves Eq. (4) */\n9 return G, A(1), . . . , A(N )\n\nend\n\n\u2022 P \u2208 RJ\u00d7I is a matrix with ph(i),i = 1, and all other entries equal to 0;\n\u2022 h : [I] \u2192 [J] is a random map such that (\u2200i \u2208 [I])(\u2200j \u2208 [J]) P(h(i) = j) = 1/J; and\n\u2022 D \u2208 RI\u00d7I is a diagonal matrix, with each diagonal entry equal to +1 or \u22121 with equal\n\nprobability.\n\nDue to the special structure of S, it is inef\ufb01cient to store it as a full matrix. When applying S to a\nmatrix A, it is better to do this implicitly, which costs only O(nnz(A)) and avoids storing S as a full\nmatrix. Here, nnz(A) denotes the number of nonzero elements of A.\nTENSORSKETCH was \ufb01rst introduced in 2013 in [26] where it is applied to compressed matrix\nmultiplication. In [27], it is used for approximating support vector machine polynomial kernels\nef\ufb01ciently. Avron et al. [2] show that TENSORSKETCH provides an oblivious subspace embedding.\nDiao et al. [10] provide theoretical guarantees which we will rely on in this paper. Below is an\ninformal summary of those results we will use; for further details, see the paper by Diao et al.,\nespecially Theorem 3.1 and Lemma B.1. Let A \u2208 RL\u00d7M be a matrix, where L (cid:29) M. Like other\nclasses of sketches, an instantiation of TENSORSKETCH is a linear map T : RL \u2192 RJ, where J (cid:28) L,\n\nsuch that, if y \u2208 RL and(cid:101)x def= arg minx (cid:107)TAx \u2212 Ty(cid:107)2, then for J suf\ufb01ciently large (depending on\n\u03b5 > 0), with high probability (cid:107)A(cid:101)x \u2212 y(cid:107)2 \u2264 (1 + \u03b5) minx (cid:107)Ax \u2212 y(cid:107)2.\n\nThe distinguishing feature of TENSORSKETCH is that if the matrix A is of the form A = A(N ) \u2297\nA(N\u22121) \u2297 \u00b7\u00b7\u00b7 \u2297 A(1), where each A(n) \u2208 RI\u00d7R, I (cid:29) R, then the cost of computing TA can be\nshown to be O(N IR + JRN ), excluding log factors, whereas na\u00efve matrix multiplication would\ncost O(JI N RN ). Moreover, TA can be computed without ever forming the full matrix A. One can\nshow that this is achievable by \ufb01rst applying an independent COUNTSKETCH operator S(n) \u2208 RJ\u00d7I\nto each factor matrix A(n) and then computing the full TENSORSKETCH using the fast Fourier\ntransform (FFT). The formula for this is\n\n1(cid:79)\n\nTA = T\n\nA(i) = FFT\u22121\n\n\uf8eb\uf8ed(cid:32) 1(cid:75)\n\n(cid:16)\n\nFFT\n\n(cid:16)\n\nS(i)A(i)(cid:17)(cid:17)(cid:62)(cid:33)(cid:62)\uf8f6\uf8f8 .\n\n(5)\n\ni=N\n\ni=N\n\nThese results generalize to the case when the factor matrices are of different sizes. In Section S2\nof the supplementary material, we provide a more thorough introduction to COUNTSKETCH and\nTENSORSKETCH, including how to arrive at the formula (5).\n\n2 Related work\n\nRandomized algorithms have been applied to tensor decompositions before. Wang et al. [31] and\nBattaglino et al. [5] apply sketching techniques to the CANDECOMP/PARAFAC (CP) decomposition.\nDrineas and Mahoney [11], Zhou and Cichocki [32], Da Costa et al. [9] and Tsourakakis [30] propose\ndifferent randomized methods for computing HOSVD. The method in [30], which is called MACH,\nis also extended to computing HOOI. Mahoney et al. [23] and Caiafa and Cichocki [6] present results\n\n3\n\n\fthat extend the CUR factorization for matrices to tensors. Other decomposition methods that only\nconsider a small number of the tensor entries include those by Oseledets et al. [25] and Friedland et\nal. [13].\nAnother approach to decomposing large tensors is to use memory ef\ufb01cient and distributed methods.\nKolda and Sun [19] introduce the Memory Ef\ufb01cient Tucker (MET) decomposition for sparse tensors\nas a solution to the so called intermediate blow-up problem which occurs when computing the chain\nof TTM products in HOOI. Other papers that use memory ef\ufb01cient and distributed methods include\n[4, 20, 21, 22, 28, 15, 1, 16, 24].\nOther research focuses on handling streamed tensor data. Sun et al. [29] introduce a framework for\nincremental tensor analysis. The basic idea of their method is to \ufb01nd one set of factor matrices which\nworks well for decomposing a sequence of tensors that arrive over time. Fanaee-T and Gama [12]\nintroduce multi-aspect-streaming tensor analysis which is based on the histogram approximation\nconcept rather than linear algebra techniques. Neither of these methods correspond to Tucker\ndecomposition of a tensor whose elements are streamed. Gujral et al. [14] present a method for\nincremental CP decomposition.\nWe compare our algorithms to TUCKER-ALS and MET in Tensor Toolbox version 2.6 [3, 19],\nFSTD1 with adaptive index selection from [6], as well as the HOOI version of the MACH algorithm\nin [30].1 TUCKER-ALS and MET, which are mathematically equivalent, provide good accuracy, but\nrun out of memory as the tensor size increases. MACH scales somewhat better, but also runs out of\nmemory for larger tensors. Its accuracy is also lower than that of TUCKER-ALS/MET. None of these\nalgorithms are one-pass. FSTD1 scales well, but has accuracy issues on very sparse tensors. FSTD1\ndoes not need to access all elements of the tensor and is one-pass, but since the entire tensor needs to\nbe accessible the method cannot handle streamed data.\n\n3 Tucker decomposition using TensorSketch\n\nWe now present our proposed algorithms. More detailed versions of them can be found in Section S3\nof the supplement. A Matlab implementation of our algorithms can be found at https://github.\ncom/OsmanMalik/tucker-tensorsketch.\n\n3.1 First proposed algorithm: TUCKER-TS\n\nFor our \ufb01rst algorithm, we TENSORSKETCH both the least-squares problems in (3) and (4), and then\nsolve the smaller resulting problems. We give an algorithm for this approach in Algorithm 2. We\ncall it TUCKER-TS, where \u201cTS\u201d stands for TENSORSKETCH. The core tensor and factor matrices\nin line 1 are initialized randomly with each element i.i.d. Uniform(\u22121, 1). The factor matrices\nare subsequently orthogonalized. On line 2 we de\ufb01ne TENSORSKETCH operators of appropriate\n2 \u2208 RJ2\u00d7In\nsize. This is done by \ufb01rst de\ufb01ning COUNTSKETCH operators S(n)\nfor n \u2208 [N ], as explained in Section 1.2. Then each operator T(n), for n \u2208 [N ], is de\ufb01ned as\n1 }n\u2208[N ] and with the nth term excluded in the Kronecker and Khatri-Rao\nin (5) but based on {S(n)\n2 }n\u2208[N ] and without excluding any terms\nproducts. T(N +1) is de\ufb01ned similarly, but based on {S(n)\n1 }n\u2208[N ]\nin the Kronecker and Khatri-Rao products. The reason we use two different sets {S(n)\n2 }n\u2208[N ] of COUNTSKETCH operators with different target sketch dimensions J1 and J2,\nand {S(n)\nrespectively, is that the design matrix in (4) has more rows than that in (3). In practice, this means\nthat we choose J2 > J1. In Section 4 we provide some guidance on how to choose J1 and J2.\nWe also want to point out that none of the sketch operators are stored explicitly as matrices in our\nimplementation. Instead, we only generate and store the function h and the diagonal of D, which\nwere de\ufb01ned in Section 1.2, for each COUNTSKETCH operator. We then use the formula in (5) when\napplying one of the TENSORSKETCH operators to a Kronecker product matrix. The computations\n\n1 \u2208 RJ1\u00d7In and S(n)\n\n1For FSTD1, we use the Matlab code from the website of one of the authors (http://ccaiafa.\nwixsite.com/cesar). For MACH, we adapted the Python code provided on the author\u2019s website (https:\n//tsourakakis.com/mining-tensors/) to Matlab. MACH requires an algorithm for computing the HOOI\ndecomposition of the sparsi\ufb01ed tensor. For this, we use TUCKER-ALS and then switch to higher orders of MET\nas necessary when we run out of memory. As recommended in [30], we keep each nonzero entry in the original\ntensor with probability 0.1 when using MACH.\n\n4\n\n\f(n) and T(N +1)y(:) on lines 5 and 7 cannot be done using the formula in (5), but are still\n\nT(n)Y(cid:62)\ncomputed implicitly without forming any full sketching matrices.\nSince all sketch operators used on line 5 are de\ufb01ned in terms of the same set {S(n)\n1 }n\u2208[N ], the\nleast-squares problem on all iterations of that line except the \ufb01rst will depend in some way on the\nsketch operator T(n) being applied. A similar dependence will exist between the least-squares\nproblem on line 7 and T(N +1) beyond the \ufb01rst iteration. It is important to note that the guarantees\nfor TENSORSKETCHED least-squares in [10] hold when the random sketch is independent of the\nleast-squares problem it is applied to. For these guarantees to hold, we would need to de\ufb01ne a new\nTENSORSKETCH operator each time a least-squares problem is solved in Algorithm 2. In all of our\nexperiments, we observe that our approach of instead de\ufb01ning the sketch operators upfront leads to a\nsubstantial reduction in the error for the algorithm as a whole (see Figure 1). We have not yet been\nable to provide theoretical justi\ufb01cation for why this is.\nThe following proposition shows that the normal equations formulation of the least-squares problem\non line 7 in Algorithm 2 is well-conditioned with high probability if J2 is suf\ufb01ciently large, and\ntherefore can be ef\ufb01ciently solved using conjugate gradient (CG). This is true because the factor\nmatrices are orthogonal, and does not hold for the smaller system on line 5, so this system we\nsolve via direct methods. In our experiments, for an accuracy of 1e-6, CG takes about 15 iterations\nregardless of I. A proof of Proposition 3.1 is provided in Section S4 of the supplementary material.\ndef\nProposition 3.1. Assume T(N +1)\n=\ni=N A(i)), where all A(n) have orthonormal columns, and sup-\nn Rn)2(2 + 3N )/(\u03b52\u03b4), then the 2-norm condition number of M\n\n(T(N +1)(cid:78)1\ni=N A(i))(cid:62)(T(N +1)(cid:78)1\npose \u03b5, \u03b4 \u2208 (0, 1). If J2 \u2265 ((cid:81)\n\nsatis\ufb01es \u03ba(M) \u2264 (1 + \u03b5)2/(1 \u2212 \u03b5)2 with at least probability 1 \u2212 \u03b4.\nRemark 3.2. De\ufb01ning the sketching operators upfront allows us to make the following improvements:\n\nis de\ufb01ned as in line 2 in Algorithm 2.\n\nLet M\n\n(a) Since Y remains unchanged throughout the algorithm, the N + 1 sketches of Y only need\nto be computed once, which we do upfront in a single pass over the data (using a careful\nimplementation). This can also be done if elements of Y are streamed.\n\ndef= (FFT(S(n)\n\nin the inner loop, we can compute the quantity \u02c6A(n)\ns1\nand reuse it when computing other factor matrices until A(n) is updated again.\n\n(b) Since the same COUNTSKETCH is applied to each A(n) when sketching the Kronecker product\n1 A(n)))(cid:62) after updating A(n)\n(c) When In \u2265 J1 + J2 for some n \u2208 [N ], we can reduce the size of the least-squares problem on\nline 5. Note that the full matrix A(n) is not needed until the return statement\u2014only the sketches\nS(n)\n1 A(n) and S(n)\n2 A(n) are necessary to compute the different TENSORSKETCHES. Replacing\nT(n)Y(cid:62)\n], which also can be computed upfront,\nwe get a smaller least-squares problem which has the solution [S(n)\n2 A(n)]. Before the\nreturn statement, we then compute the full factor matrix A(n). With this adjustment, we cannot\northogonalize the factor matrices on each iteration, and therefore Proposition 3.1 does not apply.\nIn this case, we therefore use a dense method instead of CG when computing G in Algorithm 2.\n\n(n) on line 5 with T(n)[Y(cid:62)\n\n1 A(n), S(n)\n\n(n)S(n)(cid:62)\n\n(n)S(n)(cid:62)\n\n1\n\n, Y(cid:62)\n\n2\n\n3.2 Second proposed algorithm: TUCKER-TTMTS\n\n(cid:78)1\n\n(n))(cid:62)T(n)(cid:78)1\n\nWe can rewrite the TTM product on line 4 of Algorithm 1 to Z(n) = Y(n)\nTENSORSKETCH this formulation as follows: \u02dcZ(n) = (T(n)Y(cid:62)\nwhere each T(n) \u2208 RJ1\u00d7(cid:81)\nT(N +1) \u2208 RJ2\u00d7(cid:81)\n\nA(i). We\ni=N\ni(cid:54)=n\nA(i), n \u2208 [N ],\ni(cid:54)=n Ii is a TENSORSKETCH operator with target dimension J1. We\ncan similarly sketch the computation on line 8 in Algorithm 1 using a TENSORSKETCH operator\ni Ii with target dimension J2. Replacing the computations on lines 4 and 8 in\nAlgorithm 1 with these sketched computations, we get our second algorithm which we call TUCKER-\nTTMTS, where \u201cTTMTS\u201d stands for \u201cTTM TENSORSKETCH.\u201d The algorithm is given in Algorithm 3.\nThe initialization of the factor matrices on line 1a and the de\ufb01nition of the sketching operators on\nline 1b are done in the same way as in Algorithm 2. Since the sketch operators are de\ufb01ned upfront\nhere as well, the same caveat applies here as for Algorithm 2. The main bene\ufb01t of TUCKER-TTMTS\nover TUCKER-TS is that it scales better with the target rank (see Section 3.4).\n\ni=N\ni(cid:54)=n\n\n5\n\n\fAlgorithm 2: TUCKER-TS (proposal)\ninput\n\n3 repeat\n4\n\n1 Initialize G, A(2), A(3), . . . , A(N )\n\n:Y, target rank (R1, R2, . . . , RN ), sketch dimensions (J1, J2)\n\noutput :Rank-(R1, R2, . . . , RN ) Tucker decomposition(cid:74)G; A(1), . . . , A(N )(cid:75) of Y\n2 De\ufb01ne TENSORSKETCH operators T(n) \u2208 RJ1\u00d7(cid:81)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:16)\ni=N,i(cid:54)=n A(i)(cid:17)\nT(n)(cid:78)1\n(cid:13)(cid:13)(cid:13)(cid:16)\ni=N A(i)(cid:17)\nT(N +1)(cid:78)1\n\nend\nG = arg minZ\nOrthogonalize each A(i) and update G\n\nfor n = 1, . . . , N do\nA(n) = arg minA\n\n(n)A(cid:62) \u2212 T(n)Y(cid:62)\nG(cid:62)\n\nz(:) \u2212 T(N +1)y(:)\n\n(cid:13)(cid:13)(cid:13)2\n\n(n)\n\nF\n\n5\n\n6\n\n7\n\n2\n\ni(cid:54)=n Ii, for n \u2208 [N ], and T(N +1) \u2208 RJ2\u00d7(cid:81)\n\ni Ii\n\n8\n9 until termination criteria met\n10 return G, A(1), . . . , A(N )\n\nThe following informal proposition shows that the error for each sketched computation in TUCKER-\nTTMTS is additive rather than multiplicative as for TUCKER-TS. A formal statement and proof are\ngiven in Section S5 of the supplementary material.\nProposition 3.3 (TUCKER-TTMTS (informal)). Assume each TENSORSKETCH operator is rede-\n\ufb01ned prior to being used. Let OBJ denote the objective function in (3), and let \u02dcA(n) be the Rn\nleading left singular vectors of Z(n) de\ufb01ned on line 4 in Algorithm 3. Under certain conditions, \u02dcA(n)\nsatis\ufb01es OBJ( \u02dcA(n)) \u2264 minA OBJ(A) + \u03b5C with high probability if J1 is suf\ufb01ciently large, where\nC depends on Y, the target rank, and the other factor matrices. A similar result holds for the update\non line 8 and the objective function in (4).\n\nAlgorithm 3: TUCKER-TTMTS (proposal)\n\n/* Identical to Algorithm 1, except for the lines below */\n\ni(cid:54)=n Ii, for n \u2208 [N ], and T(N +1) \u2208 RJ2\u00d7(cid:81)\n\ni Ii\n\n1a Initialize A(2), A(3), . . . , A(N )\n\n1b De\ufb01ne TENSORSKETCH operators T(n) \u2208 RJ1\u00d7(cid:81)\ni=N,i(cid:54)=n A(i)(cid:17)\n\n(cid:17)(cid:62)(cid:16)\n(cid:16)\nT(n)(cid:78)1\n(cid:16)\ni=N A(i)(cid:17)(cid:62)\nT(N +1)(cid:78)1\n\nT(N +1)y(:)\n\nT(n)Y(cid:62)\n\n4 Z(n) =\n\n8 g(:) =\n\n(n)\n\n3.3 Stopping conditions and orthogonalization\nUnless stated otherwise, we stop after 50 iterations or when the change in (cid:107)G(cid:107) is less than 1e-3.\nThe same type of convergence criteria are used in [19]. In Algorithm 2, we orthogonalize the\nfactor matrices and update G using the reduced QR factorization. If we use the improvement in\nRemark 3.2 (c), we need to approximate G. This is discussed in Section S3.1 of the supplementary\nmaterial. In Algorithm 3, we compute an estimate of G using the same formula as in line 8, but using\nthe smaller sketch dimension J1 instead. Unlike in TUCKER-ALS, the objective is not guaranteed to\ndecrease on each iteration of our algorithms. Despite this, the only practical different between our\nalgorithms and TUCKER-ALS is that the tolerance may need to be set differently.\nWe would like to point out that we cannot provide global convergence guarantees for our algorithms.\nAlthough a global analysis would be desirable, it is important to note that such an analysis is dif\ufb01cult\neven for TUCKER-ALS. Indeed, TUCKER-ALS is not guaranteed to converge to the global optimum\nor even a stationary point (see Section 4.2 in [18]).\n\n6\n\n\f3.4 Complexity analysis\n\nWe compare the complexity of Algorithms 1\u20133, FSTD1 and MACH for the case N = 3. We assume\nthat In = I and Rn = R < I for all n \u2208 [N ]. Furthermore, we assume that J1 = KRN\u22121 and\nJ2 = KRN for some constant K > 1, which is a choice that works well in practice. Table 1\nshows the complexity when each of the variables I and R are assumed to be large. A more detailed\ncomplexity analysis of the proposed algorithms is given in Section S3.2 of the supplementary material.\n\nVariable assumed to be large\n\nAlgorithm\n\nT.-ALS (Alg. 1)\nFSTD1 [6]\nMACH [30]\nT.-TS (proposal, Alg. 2)\nT.-TTMTS (proposal, Alg. 3)\n\nR = rank\nI = size of \ufb01ber\n(#iter + 1) \u00b7 RI 3\n(#iter + 1) \u00b7 RI 3\nR5\nIR4\n(#iter + 1) \u00b7 RI 3\n(#iter + 1) \u00b7 RI 3\nR3 + #iter \u00b7 R6\nnnz(Y) + IR4\nnnz(Y) + IR4 + #iter \u00b7 IR4 R6 + #iter \u00b7 R4\n\nTable 1: Leading order computational complexity, ignoring log factors and assuming K = O(1),\nwhere #iter is the number of main loop iterations. Y is the 3-way data tensor we decompose. The\nmain bene\ufb01ts of our proposed algorithms is reducing the O(I N ) complexity of Algorithm 1 to RO(N )\ncomplexity due to the sketching, since typically R (cid:28) I. The complexity of MACH is the same as\nthat of TUCKER-ALS, but with a smaller constant factor.\n\n4 Experiments\n\nIn this section we present results from experiments. Our Matlab implementation that we provided a\nlink to at the beginning of Section 3 comes with demo script \ufb01les for running experiments similar to\nthose presented here. All synthetic results are averages over ten runs in an environment using four\ncores of an Intel Xeon E5-2680 v3 @2.50GHz CPU and 21 GB of RAM. For Algorithms 2 and 3,\nthe sketch dimensions J1 and J2 must be chosen. We have found that the choice J1 = KRN\u22121 and\nJ2 = KRN , for a constant K > 4, works well in practice. Figure 1 shows examples of how the error\nof TUCKER-TS and TUCKER-TTMTS, in relation to that of TUCKER-ALS, changes with K. It\nalso shows results for variants of each algorithm for which the TENSORSKETCH operator is rede\ufb01ned\neach time it is used (called \u201cmulti-pass\u201d in the \ufb01gure). For both algorithms, de\ufb01ning TENSORSKETCH\noperators upfront leads to higher accuracy than rede\ufb01ning them before each application. In subsequent\nexperiments, we always de\ufb01ne the sketch operators upfront (i.e., as written in Algorithms 2 and 3)\nand, unless stated otherwise, always use K = 10.\n\nFigure 1: Errors of TUCKER-TS and TUCKER-TTMTS, relative to that of TUCKER-ALS, for\ndifferent values of the sketch dimension parameter K. For both plots, the tensor size is 500\u00d7500\u00d7500\nwith nnz(Y) \u2248 1e+6 and true rank (15, 15, 15). The algorithms use a target rank of (10, 10, 10).\n\n4.1 Sparse synthetic data\n\nIn this subsection, we apply our algorithms to synthetic sparse tensors. For all synthetic data we\nuse In = I and Rn = R for all n \u2208 [N ]. The sparse tensors are each created from a random dense\ncore tensor and random sparse factor matrices, where the sparsity of the factor matrices is chosen\n\n7\n\n468101214161820K1234Error relativeto TUCKER-ALS(a) TUCKER-TS: multi-pass vs. one-passTUCKER-TS multi-passTUCKER-TSTUCKER-ALS468101214161820K11.52Error relativeto TUCKER-ALS(b) TUCKER-TTMTS: multi-pass vs. one-passTUCKER-TTMTS multi-passTUCKER-TTMTSTUCKER-ALS\fto achieve the desired sparsity of the tensor. We add i.i.d. normally distributed noise with standard\ndeviation 1e-3 to all nonzero tensor elements.\nFigures 2 and 3 show how the algorithms scale with increased dimension size I. Figure 4 and\n5 show how the algorithms scale with tensor density and algorithm target rank R, respectively.\nTUCKER-ALS/MET and MACH run out of memory when I = 1e+5. FSTD1 is fast and scalable\nbut inaccurate for very sparse tensors. The algorithm repeatedly \ufb01nds indices of Y by identifying\nthe element of maximum magnitude in \ufb01bers of the residual tensor. However, when Y is very\nsparse, it frequently happens that whole \ufb01bers in the residual tensor are zero. In those cases, the\nalgorithm fails to \ufb01nd a good set of indices. This explains its poor accuracy in our experiments. We\nsee that TUCKER-TS performs very well when Y truly is low-rank and we use that same rank for\nreconstruction. TUCKER-TTMTS in general has a larger error than TUCKER-TS, but scales better\nwith higher target rank. Moreover, when the true rank of the input tensor is greater than the target\nrank (Figure 3), which is closer to what real data might look like, the error of TUCKER-TTMTS is\nmuch closer to that of TUCKER-TS.\n\nFigure 2: Relative error and run time for random sparse 3-way tensors with varying dimension size I\nand nnz(Y) \u2248 1e+6. Both the true and target ranks are (10, 10, 10).\n\nFigure 3: Relative error and run time for random sparse 3-way tensors with varying dimension size\nI and nnz(Y) \u2248 1e+6. The true rank is (15, 15, 15) and target rank is (10, 10, 10). A convergence\ntolerance of 1e-1 is used for these experiments.\n\nFigure 4: Relative error and run time for random sparse 3-way tensors with dimension size I = 1e+4\nand varying number of nonzeros. Both the true and target ranks are (10, 10, 10).\n\n4.2 Dense real-world data\n\nIn this section we apply TUCKER-TTMTS to a real dense tensor representing a grayscale video.\nThe video consists of 2,200 frames, each of size 1,080 by 1,980 pixels. The whole tensor, which\n\n8\n\n1e+31e+41e+51e+6Dimension size (I)10-210-1100Relative error(a) Varying dimension size (I), Rtrue = R 1e-2 1e-1 1e-0TUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALS/METFSTD1MACH1e+31e+41e+51e+6Dimension size (I)100102Run time (s)(b) Varying dimension size (I), Rtrue = R 10 s 100 sTUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALSMET(1)MET(2)FSTD1MACHOut of memoryOut of memoryOut ofmemory1e+31e+41e+51e+6Dimension size (I)00.51Relative error(a) Varying dimension size (I), Rtrue > RTUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALS/METFSTD1MACH1e+31e+41e+51e+6Dimension size (I)100102Run time (s)(b) Varying dimension size (I), Rtrue > R 10 s 100 s 1000 sTUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALSMET(1)MET(2)FSTD1MACHOut of memoryOut of memoryOut ofmemory1e+51e+61e+71e+8Number of nonzeros10-2100Relative error(a) Varying density 1e-2 1e-1 1e-0TUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALS/METFSTD1MACH1e+51e+61e+71e+8Number of nonzeros100102104Run time (s)(b) Varying density 10 s 100 s 1000 s 10000 sTUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALSMET(1)MET(2)FSTD1MACH\fFigure 5: Relative error and run time for random sparse 3-way tensors with dimension size I = 1e+4\nand nnz(Y) \u2248 1e+7. The true and target ranks are (R, R, R), with R varying.\n\nrequires 38 GB of RAM, is too large to load at the same time. Instead, it is loaded in pieces which\nare sketched and then added together. The video shows a natural scene, which is disturbed by a\nperson passing by the camera twice. Since the camera is in a \ufb01xed position, we can expect this\ntensor to be compressible. We compute a rank (10, 10, 10) Tucker decomposition of the tensor using\nTUCKER-TTMTS with the sketch dimension parameter set to K = 100 and a maximum of 30\niterations. We then apply k-means clustering to the factor matrix A(3) \u2208 R2200\u00d710 corresponding to\nthe time dimension, classifying each frame using the corresponding row in A(3) as a feature vector.\nWe \ufb01nd that using three clusters works better than using two. We believe this is due to the fact that the\nlight intensity changes through the video due to clouds, which introduces a third time varying factor.\nFigure 6 shows \ufb01ve sample frames with the corresponding assigned clusters. With few exceptions,\nthe frames which contain a disturbance are correctly grouped together into class 3 with the remaining\nframes grouped into classes 1 and 2. The video experiment is online and a link to it is provided at\nhttps://github.com/OsmanMalik/tucker-tensorsketch.\n\nFigure 6: Five sample frames with their assigned classes. The frames (b) and (d) contain a disturbance.\n\n5 Conclusion\n\nWe have proposed two algorithms for low-rank Tucker decomposition which incorporate TENSORS-\nKETCH and can handle streamed data. Experiments corroborate our complexity analysis which shows\nthat the algorithms scale well both with dimension size and density. TUCKER-TS, and to a lesser\nextent TUCKER-TTMTS, scale poorly with target rank, so they are most useful when R (cid:28) I.\n\nAcknowledgments\n\nWe would like to thank the reviewers for their many helpful comments and suggestions which helped\nimprove this paper.\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1810314.\n\n9\n\n24681012141618Target rank (R)10-1100Relative error(a) Varying target rank (R) 1e-0 1e-1TUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALS/METFSTD1MACH24681012141618Target rank (R)100102Run time (s)(b) Varying target rank (R) 1 s 10 s 100 s 1000 sTUCKER-TS (proposal)TUCKER-TTMTS (proposal)TUCKER-ALSMET(1)MET(2)FSTD1MACH(a) Frame 500(b) Frame 1450(c) Frame 1650(d) Frame 1850(e) Frame 2000500100015002000Frame500100015002000Frame500100015002000Frame500100015002000Frame500100015002000Frame123ClassCurrent frameIncorrectclassif-icationCorrect classification\fThis work utilized the RMACC Summit supercomputer, which is supported by the National Science\nFoundation (awards ACI-1532235 and ACI-1532236), the University of Colorado Boulder, and\nColorado State University. The Summit supercomputer is a joint effort of the University of Colorado\nBoulder and Colorado State University.\n\nReferences\n[1] Woody Austin, Grey Ballard, and Tamara G. Kolda. Parallel Tensor Compression for Large-\nScale Scienti\ufb01c Data. Proceedings - 2016 IEEE 30th International Parallel and Distributed\nProcessing Symposium, IPDPS 2016, pages 912\u2013922, 2016.\n\n[2] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial\nkernel. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 27, pages 2258\u20132266. Curran\nAssociates, Inc., 2014.\n\n[3] Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 2.6. Available online,\n\nFebruary 2015.\n\n[4] Muthu Baskaran, Beno\u00eet Meister, Nicolas Vasilache, and Richard Lethin. Ef\ufb01cient and scalable\ncomputations with sparse tensors. 2012 IEEE Conference on High Performance Extreme\nComputing, HPEC 2012, 2012.\n\n[5] C. Battaglino, G. Ballard, and T. Kolda. A practical randomized CP tensor decomposition.\n\nSIAM Journal on Matrix Analysis and Applications, 39(2):876\u2013901, 2018.\n\n[6] Cesar F. Caiafa and Andrzej Cichocki. Generalizing the column-row matrix decomposition to\n\nmulti-way arrays. Linear Algebra and Its Applications, 433(3):557\u2013573, 2010.\n\n[7] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams.\n\nTheoretical Computer Science, 312(1):3\u201315, 2004.\n\n[8] Kenneth L. Clarkson and David P. Woodruff. Low-Rank Approximation and Regression in\n\nInput Sparsity Time. Journal of the ACM, 63(6):1\u201345, 2017.\n\n[9] M. N. da Costa, R. R. Lopes, and J. M. T. Romano. Randomized methods for higher-order\nsubspace separation. In 2016 24th European Signal Processing Conference (EUSIPCO), pages\n215\u2013219, Aug 2016.\n\n[10] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product\nregression and p-splines. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of\nthe Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics, volume 84\nof Proceedings of Machine Learning Research, pages 1299\u20131308, Playa Blanca, Lanzarote,\nCanary Islands, 09\u201311 Apr 2018. PMLR.\n\n[11] Petros Drineas and Michael W. Mahoney. A randomized algorithm for a tensor-based general-\nization of the singular value decomposition. Linear Algebra and Its Applications, 420(2-3):553\u2013\n571, 2007.\n\n[12] Hadi Fanaee-T and Jo\u00e3o Gama. Multi-aspect-streaming tensor analysis. Knowledge-Based\n\nSystems, 89(C):332\u2013345, November 2015.\n\n[13] S. Friedland, V. Mehrmann, A. Miedlar, and M. Nkengla. Fast low rank approximations of\n\nmatrices and tensors. Electronic Journal of Linear Algebra, 22, 2011.\n\n[14] Ekta Gujral, Ravdeep Pasricha, and Evangelos E. Papalexakis. Sambaten: Sampling-based batch\nincremental tensor decomposition. In Proceedings of the 2018 SIAM International Conference\non Data Mining, pages 387\u2013395.\n\n[15] Inah Jeon, Evangelos E. Papalexakis, U Kang, and Christos Faloutsos. Haten2: Billion-scale\ntensor decompositions. In IEEE International Conference on Data Engineering (ICDE), 2015.\n\n10\n\n\f[16] O. Kaya and B. U\u00e7ar. High performance parallel algorithms for the tucker decomposition of\nsparse tensors. In 2016 45th International Conference on Parallel Processing (ICPP), pages\n103\u2013112, Aug 2016.\n\n[17] Tamara G. Kolda. Multilinear operators for higher-order decompositions. Technical Report\n\nSAND2006-2081, Sandia National Laboratories, April 2006.\n\n[18] Tamara G. Kolda and Brett W. Bader. Tensor Decompositions and Applications. SIAM Review,\n\n51(3):455\u2013500, 2009.\n\n[19] Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining.\nProceedings - IEEE International Conference on Data Mining, ICDM, pages 363\u2013372, 2008.\n\n[20] Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. An input-adaptive\nand in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International\nConference for High Performance Computing, Networking, Storage and Analysis, SC \u201915, pages\n76:1\u201376:12, New York, NY, USA, 2015. ACM.\n\n[21] Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. Optimizing sparse tensor times\nmatrix on multi-core and many-core architectures. In Proceedings of the Sixth Workshop on\nIrregular Applications: Architectures and Algorithms, IA3 \u201916, pages 26\u201333, Piscataway, NJ,\nUSA, 2016. IEEE Press.\n\n[22] Bangtian Liu, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri Dehnavi. A Uni\ufb01ed Opti-\nmization Approach for Sparse Tensor Operations on GPUs. Proceedings - IEEE International\nConference on Cluster Computing, ICCC, pages 47\u201357, Sept 2017.\n\n[23] Michael W. Mahoney, Mauro Maggioni, and Petros Drineas. Tensor-CUR decompositions for\ntensor-based data. SIAM Journal on Matrix Analysis and Applications, 30(3):957\u2013987, 2008.\n\n[24] Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu. S-HOT:\nScalable High-Order Tucker Decomposition. Proceedings of the Tenth ACM International\nConference on Web Search and Data Mining - WSDM \u201917, pages 761\u2013770, 2017.\n\n[25] I. V. Oseledets, D. V. Savostianov, and E. E. Tyrtyshnikov. Tucker dimensionality reduction of\nthree-dimensional arrays in linear time. SIAM Journal on Matrix Analysis and Applications,\n30(3):939\u2013956, 2008.\n\n[26] Rasmus Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory,\n\n5(3):1\u201317, 2013.\n\n[27] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps.\nProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and\ndata mining - KDD \u201913, pages 239\u2013247, 2013.\n\n[28] Yang Shi, U. N. Niranjan, Animashree Anandkumar, and Cris Cecka. Tensor Contractions with\nExtended BLAS Kernels on CPU and GPU. Proceedings - 23rd IEEE International Conference\non High Performance Computing, HiPC 2016, pages 193\u2013202, 2017.\n\n[29] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos. Incre-\nmental tensor analysis: Theory and applications. ACM Transactions on Knowledge Discovery\nfrom Data, 2(3):11:1\u201311:37, October 2008.\n\n[30] Charalampos E. Tsourakakis. MACH: fast randomized tensor decompositions. In Proceedings\nof the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010,\nColumbus, Ohio, USA, pages 689\u2013700, 2010.\n\n[31] Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar. Fast and guaranteed\ntensor decomposition via sketching. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 991\u2013999.\nCurran Associates, Inc., 2015.\n\n[32] Guoxu Zhou, Andrzej Cichocki, and Shengli Xie. Decomposition of big tensors with low\n\nmultilinear rank. CoRR, abs/1412.1885, 2014.\n\n11\n\n\f", "award": [], "sourceid": 6498, "authors": [{"given_name": "Osman Asif", "family_name": "Malik", "institution": "University of Colorado Boulder"}, {"given_name": "Stephen", "family_name": "Becker", "institution": "University of Colorado"}]}