{"title": "Legendre Decomposition for Tensors", "book": "Advances in Neural Information Processing Systems", "page_first": 8811, "page_last": 8821, "abstract": "We present a novel nonnegative tensor decomposition method, called Legendre decomposition, which factorizes an input tensor into a multiplicative combination of parameters. Thanks to the well-developed theory of information geometry, the reconstructed tensor is unique and always minimizes the KL divergence from an input tensor. We empirically show that Legendre decomposition can more accurately reconstruct tensors than other nonnegative tensor decomposition methods.", "full_text": "Legendre Decomposition for Tensors\n\nMahito Sugiyama\n\nNational Institute of Informatics\n\nJST, PRESTO\n\nHiroyuki Nakahara\n\nRIKEN Center for Brain Science\n\nhiro@brain.riken.jp\n\nKoji Tsuda\n\nThe University of Tokyo\n\nNIMS; RIKEN AIP\n\nmahito@nii.ac.jp\n\ntsuda@k.u-tokyo.ac.jp\n\nAbstract\n\nWe present a novel nonnegative tensor decomposition method, called Legendre\ndecomposition, which factorizes an input tensor into a multiplicative combination\nof parameters. Thanks to the well-developed theory of information geometry, the\nreconstructed tensor is unique and always minimizes the KL divergence from an in-\nput tensor. We empirically show that Legendre decomposition can more accurately\nreconstruct tensors than other nonnegative tensor decomposition methods.\n\n1\n\nIntroduction\n\nMatrix and tensor decomposition is a fundamental technique in machine learning; it is used to analyze\ndata represented in the form of multi-dimensional arrays, and is used in a wide range of applications\nsuch as computer vision (Vasilescu and Terzopoulos, 2002, 2007), recommender systems (Symeoni-\ndis, 2016), signal processing (Cichocki et al., 2015), and neuroscience (Beckmann and Smith, 2005).\nThe current standard approaches include nonnegative matrix factorization (NMF; Lee and Seung,\n1999, 2001) for matrices and CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970) or\nTucker decomposition (Tucker, 1966) for tensors. CP decomposition compresses an input tensor\ninto a sum of rank-one components, and Tucker decomposition approximates an input tensor by a\ncore tensor multiplied by matrices. To date, matrix and tensor decomposition has been extensively\nanalyzed, and there are a number of variations of such decomposition (Kolda and Bader, 2009),\nwhere the common goal is to approximate a given tensor by a smaller number of components, or\nparameters, in an e\ufb03cient manner.\nHowever, despite the recent advances of decomposition techniques, a learning theory that can\nsystematically de\ufb01ne decomposition for any order tensors including vectors and matrices is still\nunder development. Moreover, it is well known that CP and Tucker tensor decomposition include\nnon-convex optimization and that the global convergence is not guaranteed. Although there are a\nnumber of extensions to transform the problem into a convex problem (Liu et al., 2013; Tomioka and\nSuzuki, 2013), one needs additional assumptions on data, such as a bounded variance.\nHere we present a new paradigm of matrix and tensor decomposition, called Legendre decomposition,\nbased on information geometry (Amari, 2016) to solve the above open problems of matrix and tensor\ndecomposition.\nIn our formulation, an arbitrary order tensor is treated as a discrete probability\ndistribution in a statistical manifold as long as it is nonnegative, and Legendre decomposition is\nrealized as a projection of the input tensor onto a submanifold composed of reconstructable tensors.\nThe key to introducing the formulation is to use the partial order (Davey and Priestley, 2002; Gierz\net al., 2003) of indices, which allows us to treat any order tensors as a probability distribution in the\ninformation geometric framework.\nLegendre decomposition has the following remarkable properties: It always \ufb01nds the unique tensor\nthat minimizes the Kullback\u2013Leibler (KL) divergence from an input tensor. This is because Legendre\ndecomposition is formulated as convex optimization, and hence we can directly use gradient descent,\nwhich always guarantees the global convergence, and the optimization can be further speeded up by\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: (a) Overview of Legendre decomposition. (b) Illustration of (cid:18) and (cid:17) for second-order\ntensor (matrix) when B = [I1] (cid:2) [I2].\n\nusing a natural gradient (Amari, 1998) as demonstrated in our experiments. Moreover, Legendre\ndecomposition is \ufb02exible as it can decompose sparse tensors by removing arbitrary entries beforehand,\nfor examples zeros or missing entries.\nFurthermore, our formulation has a close relationship with statistical models, and can be interpreted\nas an extension of the learning of Boltzmann machines (Ackley et al., 1985). This interpretation\ngives new insight into the relationship between tensor decomposition and graphical models (Chen\net al., 2018; Y\u0131lmaz et al., 2011; Y\u0131lmaz and Cemgil, 2012) as well as the relationship between tensor\ndecomposition and energy-based models (LeCun et al., 2007). In addition, we show that the proposed\nformulation belongs to the exponential family, where the parameter (cid:18) used in our decomposition is\nthe natural parameter, and (cid:17), used to obtain the gradient of (cid:18), is the expectation of the exponential\nfamily.\nThe remainder of this paper is organized as follows. We introduce Legendre decomposition in\nSection 2. We de\ufb01ne the decomposition in Section 2.1, formulate it as convex optimization in\nSection 2.2, describe algorithms in Section 2.3, and discuss the relationship with other statistical\nmodels in Section 2.4. We empirically examine performance of our method in Section 3, and\nsummarize our contribution in Section 4.\n\n2 The Legendre Decomposition\n\n(cid:21)0\n\nWe introduce Legendre decomposition for tensors. We begin with a nonnegative Nth-order tensor\nX 2 RI1(cid:2)I2(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)IN\n. To simplify the notation, we write the entry xi1i2:::iN by xv with the index\nvector v = (i1; i2; : : : ; iN ) 2 [I1] (cid:2) [I2] (cid:2) (cid:1)(cid:1)(cid:1) (cid:2) [IN ], where each [Ik] = f1; 2; : : : ; Ikg. To treat X\nas a discrete probability mass function in our formulation, we normalize X by dividing each entry by\nthe sum of all entries, yielding P = X =\nv xv. In the following, we always work with a normalized\ntensor P.\n\n\u2211\n\n2.1 De\ufb01nition\nLegendre decomposition always \ufb01nds the best approximation of a given tensor P. Our strategy is\nto choose an index set B (cid:18) [I1] (cid:2) [I2] (cid:2) (cid:1)(cid:1)(cid:1) (cid:2) [IN ] as a decomposition basis, where we assume\n(1; 1; : : : ; 1) \u03382 B for a technical reason, and approximate the normalized tensor P by a multiplicative\ncombination of parameters associated with B.\nFirst we introduce the relation \u201c(cid:20)\u201d between index vectors u = (j1; : : : ; jN ) and v = (i1; : : : ; iN ) as\nu (cid:20) v if j1 (cid:20) i1, j2 (cid:20) i2, : : : , jN (cid:20) iN. It is clear that this relation gives a partial order (Gierz et al.,\n2003); that is, (cid:20) satis\ufb01es the following three properties for all u; v; w: (1) v (cid:20) v (re\ufb02exivity), (2)\nu (cid:20) v, v (cid:20) u ) u = v (antisymmetry), and (3) u (cid:20) v, v (cid:20) w ) u (cid:20) w (transitivity). Each tensor\nis treated as a discrete probability mass function with the sample space \u2126 (cid:18) [I1](cid:2)(cid:1)(cid:1)(cid:1)(cid:2)[IN ]. While it\nis natural to set \u2126 = [I1](cid:2)(cid:1)(cid:1)(cid:1)(cid:2)[IN ], our formulation allows us to use any subset \u2126 (cid:18) [I1](cid:2)(cid:1)(cid:1)(cid:1)(cid:2)[IN ].\nHence, for example, we can remove unnecessary indices such as missing or zero entries of an input\ntensor P from \u2126. We de\ufb01ne \u2126+ = \u2126 n f(1; 1; : : : ; 1)g.\n\n2\n\n\u03b7v = \u08a3 qu89\u0891v\u088fvu\u08a0\u088fvInput tensor \u0001Decomposable tensor \u0001Reconstruction(one-to-one)KL divergenceis minimizedParameters (\u03b8v)v\u08a0BLegendredecompositionabqv = \u00dc exp(\u03b8u)u\u08a0\u0891vexp(\u03c8(\u03b8))1E = (8, 9)\f(cid:21)0\n\nWe de\ufb01ne a tensor Q 2 RI1(cid:2)I2(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)IN\nas fully decomposable with B (cid:18) \u2126+ if each entry qv 2 \u2126 is\n\u220f\nrepresented in the form of\n\n(1)\nusing jBj parameters ((cid:18)v)v2B with (cid:18)v 2 R and the normalizer ((cid:18)) 2 R, which is always uniquely\ndetermined from the parameters ((cid:18)v)v2B as\n\n#v = f u 2 B j u (cid:20) v g ;\n\u220f\n\nexp( ((cid:18)))\n\n\u2211\n\nexp((cid:18)u);\n\nu2#v\n\nqv =\n\n1\n\n ((cid:18)) = log\n\nexp((cid:18)u):\n\nv2\u2126\n\nu2#v\n\n(cid:21)0\n\n(cid:21)0\n\n\u2211\n\nThis normalization does not have any e\ufb00ect on the decomposition performance, but rather it is\nneeded to formulate our decomposition as an information geometric projection, as shown in the next\nsubsection. There are two extreme cases for a choice of a basis B: If B = \u2205, a fully decomposable\nQ is always uniform; that is, qv = 1=j\u2126j for all v 2 \u2126. In contrast, if B = \u2126+, any input P itself\nbecomes decomposable.\nWe now de\ufb01ne Legendre decomposition as follows: Given an input tensor P 2 RI1(cid:2)I2(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)IN\n, the\nsample space \u2126 (cid:18) [I1](cid:2) [I2](cid:2)(cid:1)(cid:1)(cid:1)(cid:2) [IN ], and a parameter basis B (cid:18) \u2126+, Legendre decomposition\n\ufb01nds the fully decomposable tensor Q 2 RI1(cid:2)I2(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)IN\nwith a B that minimizes the Kullback\u2013\nLeibler (KL) divergence DKL(P;Q) =\nv2\u2126 pv log(pv=qv) (Figure 1[a]). In the next subsection,\nwe introduce an additional parameter ((cid:17)v)v2B and show that this decomposition is always possible\nvia the dual parameters ( ((cid:18)v)v2B; ((cid:17)v)v2B ) with information geometric analysis. Since (cid:18) and (cid:17) are\nconnected via Legendre transformation, we call our method Legendre decomposition.\nLegendre decomposition for second-order tensors (that is, matrices) can be viewed as a low rank\napproximation not of an input matrix P but of its entry-wise logarithm logP. This is why the rank\n\u2211\nof logQ with the fully decomposable matrix Q coincides with the parameter matrix T 2 RI1(cid:2)I2\nsuch that tv = (cid:18)v if v 2 B, t(1;1) = 1= exp( ((cid:18))), and tv = 0 otherwise. Thus we \ufb01ll zeros into\nentries in \u2126+ n B. Then we have log qv =\nu2#v tu, meaning that the rank of logQ coincides with\nthe rank of T . Therefore if we use a decomposition basis B that includes only l rows (or columns),\nrank(logQ) (cid:20) l always holds.\n2.2 Optimization\nWe solve the Legendre decomposition by formulating it as a convex optimization problem. Let us\nassume that B = \u2126+ = \u2126 n f(1; 1; : : : ; 1)g, which means that any tensor is fully decomposable.\nOur de\ufb01nition in Equation (1) can be re-written as\n\n{\n\n\u2211\n\n\u2211\n\nif u (cid:20) v;\notherwise\n\n1\n0\n\n(cid:16)(u; v)(cid:18)u (cid:0) ((cid:18)) =\n\n\u2211\n\nu2\u2126\n\nu2\u2126+\n\n(cid:16)(u; v) =\n\n(cid:16)(u; v)(cid:18)u;\n\nlog qv =\n\n(2)\nwith (cid:0) ((cid:18)) = (cid:18)(1;1;:::;1), and the sample space \u2126 is a poset (partially ordered set) with respect to\nthe partial order \u201c(cid:20)\u201d with the least element ? = (1; 1; : : : ; 1). Therefore our model belongs to\nthe log-linear model on posets introduced by Sugiyama et al. (2016, 2017), which is an extension\nof the information geometric hierarchical log-linear model (Amari, 2001; Nakahara and Amari,\n2002). Each entry qv and the parameters ((cid:18)v)v2\u2126+ in Equation (2) directly correspond to those in\nEquation (8) in Sugiyama et al. (2017). According to Theorem 2 in Sugiyama et al. (2017), if we\nintroduce ((cid:17)v)v2\u2126+ such that\n\n\"v = f u 2 \u2126 j u (cid:21) v g ;\n\nu2\u2126\n\nqu =\n\n(cid:17)v =\n\nu2\"v\n\n(cid:16)(v; u)qu;\n\n(3)\n\u2211\nfor each v 2 \u2126+ (see Figure 1[b]), the pair ( ((cid:18)v)v2\u2126+; ((cid:17)v)v2\u2126+ ) is always a dual coordinate\nsystem of the set of normalized tensors S = fP j 0 < pv < 1 and\nv2\u2126 pv = 1g with respect to\nthe sample space \u2126, as they are connected via Legendre transformation. Hence S becomes a dually\n\ufb02at manifold (Amari, 2009).\n}\nHere we formulate Legendre decomposition as a projection of a tensor onto a submanifold. Suppose\nthat B (cid:18) \u2126+ and let S B be the submanifold such that\n\n{ Q 2 S(cid:12)(cid:12) (cid:18)v = 0 for all v 2 \u2126+ n B\n\nS B =\n\n;\n\n\u2211\n\n3\n\n\fFigure 2: Projection in statistical manifold.\n\nwhich is the set of fully decomposable tensors with B and is an e-\ufb02at submanifold as it has constraints\non the (cid:18) coordinate (Amari, 2016, Chapter 2.4). Furthermore, we introduce another submanifold\nSP for a tensor P 2 S and A (cid:18) \u2126+ such that\n\nSP = f Q 2 S j (cid:17)v = ^(cid:17)v for all v 2 A g ;\n\nwhere ^(cid:17)v is given by Equation (3) by replacing qu with pu, which is an m-\ufb02at submanifold as it has\nconstraints on the (cid:17) coordinate.\nThe dually \ufb02at structure of S with the dual coordinate systems ( ((cid:18)v)v2\u2126+; ((cid:17)v)v2\u2126+ ) leads to the\nfollowing strong property: If A = B, that is, (\u2126+ n B) [ A = \u2126+ and (\u2126+ n B) \\ A = \u2205, the\nintersection S B \\ SP is always a singleton; that is, the tensor Q such that Q 2 S B \\ SP always\nuniquely exists, and Q is the minimizer of the KL divergence from P (Amari, 2009, Theorem 3):\n\nQ = argmin\nR2SB\n\nDKL(P;R):\n\n\u2211\n\n\u2211\n\n(4)\nThe transition from P to Q is called the m-projection of P onto S B, and Legendre decomposition\ncoincides with m-projection (Figure 2). In contrast, if some fully decomposable tensor R 2 S B is\ngiven, \ufb01nding the intersection Q 2 S B \\ SP is called the e-projection of R onto SP. In practice,\nwe use e-projection because the number of parameters to be optimized is jBj in e-projection while\nit is j\u2126 n Bj in m-projection, and jBj (cid:20) j\u2126 n Bj usually holds.\nThe e-projection is always convex optimization as the e-\ufb02at submanifold S B is convex with respect\nto ((cid:18)v)v2\u2126+. More precisely,\n\npv log pv\nqv\n\nDKL(P;Q) =\npv log qv = (cid:0)\nv2\u2126\n\u2211\n\u2211\nwhere H(P) is the entropy of P and independent of ((cid:18)v)v2\u2126+. Since we have\n(cid:0)\n\n(\u2211\n\n\u2211\n\nv2\u2126\n\nv2\u2126\n\nv2\u2126\n\n=\n\n\u2211\n\npv log pv (cid:0)\n\n\u2211\n)\n(cid:16)(u; v)((cid:0)(cid:18)u) + ((cid:18))\n\npv\n\nexp\n\nv2\u2126\n\nv2\u2126\n\nu2B\n\n(cid:16)(u; v)(cid:18)(u)\n\npv log qv =\n\n; ((cid:18)) = log\nv2\u2126\n ((cid:18)) is convex and DKL(P;Q) is also convex with respect to ((cid:18)v)v2\u2126+.\n2.3 Algorithm\nHere we present our two gradient-based optimization algorithms to solve the KL divergence mini-\nmization problem in Equation (4). Since the KL divergence DKL(P;Q) is convex with respect to\neach (cid:18)v, the standard gradient descent shown in Algorithm 1 can always \ufb01nd the global optimum,\nwhere \" > 0 is a learning rate. Starting with some initial parameter set ((cid:18)v)v2B, the algorithm\niteratively updates the set until convergence. The gradient of (cid:18)w for each w 2 B is obtained as\n\nu2B\n\n;\n\npv log qv (cid:0) H(P);\n\n(\u2211\n\n)\n\n\u2211\n\n(cid:16)(u; v)(cid:18)u +\n\n@\n@(cid:18)w\n\npv ((cid:18))\n\nv2\u2126\n\n\u2211\n\nv2\u2126\n\n\u2211\n\n\u2211\n\nu2B\n\n@\n@(cid:18)w\n\nDKL(P;Q) = (cid:0) @\n\u2211\n@(cid:18)w\n\n= (cid:0)\n\nv2\u2126\n\npv log qv = (cid:0) @\n@(cid:18)w\n= (cid:17)w (cid:0) ^(cid:17)w;\n\nv2\u2126\n\npv\n\n@ ((cid:18))\n@(cid:18)w\n\npv(cid:16)(w; v) +\n\n4\n\n\u0001\u0001\u0001\u0001B\u0001(fully decomposable tensors)(tensors with the same \u03b7for basis B)(input tensor)(solution)m-projectione-projection\u02c6\u00d3\fInitialize ((cid:18)v)v2B;\nrepeat\n\nAlgorithm 1: Legendre decomposition by gradient descent\n1 GradientDescent(P, B)\n2\n3\n4\n5\n6\n7\n\nCompute Q using the current parameter ((cid:18)v)v2B;\nCompute ((cid:17)v)v2B from Q;\n(cid:18)v (cid:18)v (cid:0) \"((cid:17)v (cid:0) ^(cid:17)v);\nuntil convergence of ((cid:18)v)v2B;\n\nforeach v 2 B do\n\n8\n\nInitialize ((cid:18)v)v2B;\nrepeat\n\nAlgorithm 2: Legendre decomposition by natural gradient\n1 NaturalGradient(P, B)\n2\n3\n4\n5\n6\n7\n8\n\nCompute Q using the current parameter ((cid:18)v)v2B;\nCompute ((cid:17)v)v2B from Q and \u2206(cid:17) (cid:17) (cid:0) ^(cid:17);\nCompute the inverse G\n(cid:18) (cid:18) (cid:0) G\n\nuntil convergence of ((cid:18)v)v2B;\n\n(cid:0)1\u2206(cid:17)\n\n// e.g. (cid:18)v = 0 for all v 2 B\n\n// e.g. (cid:18)v = 0 for all v 2 B\n\n(cid:0)1 of the Fisher information matrix G using Equation (5);\n\nwhere the last equation uses the fact that @ ((cid:18))=@(cid:18)w = (cid:17)w in Theorem 2 in Sugiyama et al. (2017).\nThis equation also shows that the KL divergence DKL(P;Q) is minimized if and only if (cid:17)v = ^(cid:17)v for\nall v 2 B. The time complexity of each iteration is O(j\u2126jjBj), as that of computing Q from ((cid:18)v)v2B\n(line 5 in Algorithm 1) is O(j\u2126jjBj) and computing ((cid:17)v)v2B from Q (line 6 in Algorithm 1) is\nO(j\u2126j). Thus the total complexity is O(hj\u2126jjBj2) with the number of iterations h until convergence.\nAlthough gradient descent is an e\ufb03cient approach, in Legendre decomposition, we need to repeat\n\u201cdecoding\u201d from ((cid:18)v)v2B and \u201cencoding\u201d to ((cid:17)v)v2B in each iteration, which may lead to a loss of\ne\ufb03ciency if the number of iterations is large. To reduce the number of iterations to gain e\ufb03ciency,\nwe propose to use a natural gradient (Amari, 1998), which is the second-order optimization method,\nshown in Algorithm 2. Again, since the KL divergence DKL(P;Q) is convex with respect to\n((cid:18)v)v2B, a natural gradient can always \ufb01nd the global optimum. More precisely, our natural gradient\nalgorithm is an instance of the Bregman algorithm applied to a convex region, which is well known\nto always converge to the global solution (Censor and Lent, 1981). Let B = fv1; v2; : : : ; vjBjg,\n(cid:18) = ((cid:18)v1; : : : ; (cid:18)vjBj )T , and (cid:17) = ((cid:17)v1; : : : ; (cid:17)vjBj)T . In each update of the current (cid:18) to (cid:18)next, the natural\ngradient method uses the relationship,\n\n\u2206(cid:17) = (cid:0)G\u2206(cid:18); \u2206(cid:17) = (cid:17) (cid:0) ^(cid:17) and \u2206(cid:18) = (cid:18)next (cid:0) (cid:18);\n\nwhich leads to the update formula\n\n(cid:18)next = (cid:18) (cid:0) G\n]\n\n[\n\n(cid:0)1\u2206(cid:17);\n\n\u2211\n\n=\n\nw2\u2126\n\nwhere G = (guv) 2 RjBj(cid:2)jBj is the Fisher information matrix such that\n\nguv((cid:18)) =\n\n@(cid:17)u\n@(cid:18)v\n\n= E\n\n@ log pw\n\n@ log pw\n\n@(cid:18)u\n\n@(cid:18)v\n\n(cid:16)(u; w)(cid:16)(v; w)pw (cid:0) (cid:17)u(cid:17)v\n\n(5)\n\nas given in Theorem 3 in Sugiyama et al. (2017). Note that natural gradient coincides with Newton\u2019s\nmethod in our case as the Fisher information matrix G corresponds to the (negative) Hessian matrix:\n\n@2\n\n@(cid:18)u@(cid:18)v\n\nDKL(P;Q) = (cid:0) @(cid:17)u\n@(cid:18)v\n\n= (cid:0)guv:\n\nThe time complexity of each iteration is O(j\u2126jjBj + jBj3), where O(j\u2126jjBj) is needed to compute\nQ from (cid:18) and O(jBj3) to compute the inverse of G, resulting in the total complexity O(h\n\u2032j\u2126jjBj +\n\u2032jBj3) with the number of iterations h\nh\n\n\u2032 until convergence.\n\n5\n\n\f2.4 Relationship to Statistical Models\nWe demonstrate interesting relationships between Legendre decomposition and statistical models,\nincluding the exponential family and the Boltzmann (Gibbs) distributions, and show that our decom-\nposition method can be viewed as a generalization of Boltzmann machine learning (Ackley et al.,\n1985). Although the connection between tensor decomposition and graphical models has been ana-\nlyzed by Chen et al. (2018); Y\u0131lmaz et al. (2011); Y\u0131lmaz and Cemgil (2012), our analysis adds a new\ninsight as we focus on not the graphical model itself but the sample space of distributions generated\nby the model.\n\n2.4.1 Exponential family\nWe show that the set of normalized tensors S = fP 2 RI1(cid:2)I2(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)IN\nin the exponential family. The exponential family is de\ufb01ned as\n\n>0\n\nj\u2211\n)\n(cid:18)iki(x) + r(x) (cid:0) C((cid:18))\n\n(\u2211\n\np(x; (cid:18)) = exp\n\nfor natural parameters (cid:18). Since our model in Equation (1) can be written as\n\npv =\n\n1\n\nexp( ((cid:18)))\n\nexp((cid:18)u) = exp\n\n(cid:18)u(cid:16)(u; v) (cid:0) ((cid:18))\n\n(\u2211\n\nu2\u2126+\n\n)\n\nv2\u2126 pv = 1g is included\n\n\u220f\n\nu2#v\n\n[\n\nwith (cid:18)u = 0 for u 2 \u2126+ n B, it is clearly in the exponential family, where (cid:16) and ((cid:18)) correspond\nto k and C((cid:18)), respectively, and r(x) = 0. Thus, the ((cid:18)v)v2B used in Legendre decomposition are\ninterpreted as natural parameters of the exponential family. Moreover, we can obtain ((cid:17)v)v2B by\ntaking the expectation of (cid:16)(u; v):\n\n]\n\n\u2211\n\nv2\u2126\n\nE\n\n(cid:16)(u; v)\n\n=\n\n(cid:16)(u; v)pv = (cid:17)u:\n\nThus Legendre decomposition of P is understood to \ufb01nd a fully decomposable Q that has the same\nexpectation with P with respect to a basis B.\n2.4.2 Boltzmann Machines\nA Boltzmann machine is represented as an undirected graph G = (V; E) with a vertex set V and an\nedge set E (cid:18) V (cid:2) V , where we assume that V = [N ] = f1; 2; : : : ; Ng without loss of generality.\nThis V is the set of indices of N binary variables. A Boltzmann machine G de\ufb01nes a probability\ndistribution P , where each probability of an N-dimensional binary vector x 2 f0; 1gN is given as\n\n\u220f\n\n(\n\n(\n\n)\u220f\n\nfi;jg2E\n\np(x) =\n\n1\n\nZ((cid:18))\n\ni2V\n\nexp\n\n(cid:18)ixi\n\nexp\n\n(cid:18)ijxixj\n\n;\n\nwhere (cid:18)i is a bias, (cid:18)ij is a weight, and Z((cid:18)) is a partition function.\nTo translate a Boltzmann machine into our formulation, let \u2126 = f1; 2gN and suppose that\n\n{\nB(V ) = f (ia\n\nB(E) =\n\n1; : : : ; ia\n\n(iab\n\n1 ; : : : ; iab\n\nN ) 2 \u2126 j a 2 V g ;\nN ) 2 \u2126\n\n(cid:12)(cid:12) fa; bg 2 E\n\n}\n\n;\n\nia\nl =\n\niab\nl =\n\n2\n1\n\n2\n1\n\nif l = a;\notherwise;\nif l 2 fa; bg;\notherwise:\n\nThen it is clear that the set of probability distributions, or Gibbs distributions, that can be represented\nby the Boltzmann machine G is exactly the same as S B with B = B(V ) [ B(E) and exp( ((cid:18))) =\nZ((cid:18)); that is, the set of fully decomposable Nth-order tensors de\ufb01ned by Equation (1) with the\nbasis B(V ) [ B(E) (Figure 3). Moreover, let a given Nth-order tensor P 2 R2(cid:2)2(cid:2)(cid:1)(cid:1)(cid:1)(cid:2)2\nbe\nan empirical distribution estimated from data, where each pv is the probability for a binary vector\nv(cid:0)(1; : : : ; 1) 2 f0; 1gN. The tensorQ obtained by Legendre decomposition with B = B(V )[B(E)\ncoincides with the distribution learned by the Boltzmann machine G = (V; E). The condition\n(cid:17)v = ^(cid:17)v in the optimization of the Legendre decomposition corresponds to the well-known learning\n\n(cid:21)0\n\n6\n\n)\n\n{\n{\n\n\fFigure 3: Boltzmann machine with V = f1; 2; 3g and E = ff1; 2g;f2; 3gg (left) and its sample\nspace (center), which corresponds to a tensor (right). Grayed circles are the domain of parameters (cid:18).\n\nequation of Boltzmann machines, where ^(cid:17) and (cid:17) correspond to the expectation of the data distribution\nand that of the model distribution, respectively.\nTherefore our Legendre decomposition is a generalization of Boltzmann machine learning in the\nfollowing three aspects:\n1. The domain is not limited to binary but can be ordinal; that is, f0; 1gN is extended to [I1](cid:2) [I2](cid:2)\n(cid:1)(cid:1)(cid:1) (cid:2) [IN ] for any I1; I2; : : : ; IN 2 N.\n2. The basis B with which parameters (cid:18) are associated is not limited to B(V ) [ B(E) but can be\nany subset of [I1] (cid:2) (cid:1)(cid:1)(cid:1) (cid:2) [IN ], meaning that higher-order interactions (Sejnowski, 1986) can be\nincluded.\n3. The sample space of probability distributions is not limited to f0; 1gN but can be any subset of\n[I1](cid:2)[I2](cid:2)(cid:1)(cid:1)(cid:1)(cid:2)[IN ], which enables us to perform e\ufb03cient computation by removing unnecessary\nentries such as missing values.\n\nHidden variables are often used in Boltzmann machines to increase the representation power, such\nas in restricted Boltzmann machines (RBMs; Smolensky, 1986; Hinton, 2002) and deep Boltzmann\nmachines (DBMs; Salakhutdinov and Hinton, 2009, 2012). In Legendre decomposition, including\na hidden variable corresponds to including an additional dimension. Hence if we include H hidden\nvariables, the fully decomposable tensor Q has the order of N + H. This is an interesting extension\nto our method and an ongoing research topic, but it is not a focus of this paper since our main aim is\nto \ufb01nd a lower dimensional representation of a given tensor P.\nIn the learning process of Boltzmann machines, approximation techniques of the partition function\nZ((cid:18)) are usually required, such as annealed importance sampling (AIS; Salakhutdinov and Murray,\n2008) or variational techniques (Salakhutdinov, 2008). This requirement is because the exact\ncomputation of the partition function requires the summation over all probabilities of the sample\nspace \u2126, which is always \ufb01xed to 2V with the set V of variables in the learning process of Boltzmann\nmachines, and which is not tractable. Our method does not require such techniques as \u2126 is a subset\nof indices of an input tensor and the partition function can always be directly computed.\n\n3 Experiments\n\nWe empirically examine the e\ufb03ciency and the e\ufb00ectiveness of Legendre decomposition using syn-\nthetic and real-world datasets. We used Amazon Linux AMI release 2018.03 and ran all experiments\non 2.3 GHz Intel Xeon CPU E5-2686 v4 with 256 GB of memory. The Legendre decomposition\nwas implemented in C++ and compiled with icpc 18.0.0 1.\nThroughout the experiments, we focused on the decomposition of third-order tensors and used\nthree types of decomposition bases in the form of B1 = fv j i1 = i2 = 1g [ fv j i2 = i3 =\n1g [ fv j i1 = i3 = 1g, B2(l) = fv j i1 = 1; i2 2 C2(l)g [ fv j i1 2 C1(l); i2 = 1g with\nCk(l) = fc\u230aIk=l\u230b j c 2 [l]g, and B3(l) = fv j (i1; i2) 2 Hi3(l)g with Hi3 (l) being the set indices\nfor the top l elements of the i3th frontal slice in terms of probability.\nIn these bases, B1 works\nas a normalizer for each mode, B2 works as a normalizer for rows and columns of each slice, and\nB3 highlights entries with high probabilities. We always assume that (1; : : : ; 1) is not included in\n\n1Implementation is available at: https://github.com/mahito-sugiyama/Legendre-decomposition\n\n7\n\n(0, 0, 0)(1, 0, 0)(0, 0, 1)(1, 1, 0)(0, 1, 0)(1, 0, 1)(0, 1, 1)(1, 1, 1)(1, 1, 1)(2, 1, 1)(1, 2, 1)(1, 2, 2)(1, 1, 2)(2, 2, 1)(2, 2, 2)(2, 1, 2)Boltzmann machineHasse diagram of sample spaceTensor(2, 1, 2)(0, 1, 0)(1, 0, 1)(1, 1, 1)123\fFigure 4: Experimental results on synthetic data. (a, b) Comparison of natural gradient (Algorithm 2)\nand gradient descent (Algorithm 1), where both algorithms produce exactly the same results. (c)\nComparison of Legendre decomposition (natural gradient) and other tensor decomposition methods.\n\nthe above bases. The cardinality of a basis corresponds to the number of parameters used in the\ndecomposition. We used l to vary the number of parameters for decomposition in our experiments.\nTo examine the e\ufb03ciency and the e\ufb00ectiveness of tensor decomposition, we compared Legendre\ndecomposition with two standard nonnegative tensor decomposition techniques, nonnegative Tucker\ndecomposition (Kim and Choi, 2007) and nonnegative CANDECOMP/PARAFAC (CP) decomposi-\ntion (Shashua and Hazan, 2005). Since both of these methods are based on least square objective\nfunctions (Lee and Seung, 1999), we also included a variant of CP decomposition, CP-Alternating\nPoisson Regression (CP-APR; Chi and Kolda, 2012), which uses the KL-divergence for its objective\nfunction. We used the TensorLy implementation (Kossai\ufb01 et al., 2016) for the nonnegative Tucker\nand CP decompositions and the tensor toolbox (Bader et al., 2017; Bader and Kolda, 2007) for\nCP-APR. In the nonnegative Tucker decomposition, we always employed rank-(m; m; m) Tucker\ndecomposition with the single number m, and we use rank-n decomposition in the nonnegative CP\ndecomposition and CP-APR. Thus rank-(m; m; m) Tucker decomposition has (I1 + I2 + I3)m + m3\nparameters, and rank-n CP decomposition and CP-APR have (I1 + I2 + I3)n parameters.\n\nResults on Synthetic Data First we compared our two algorithms, the gradient descent in Algo-\nrithm 1 and the natural gradient in Algorithm 2, to evaluate the e\ufb03ciency of these optimization\nalgorithms. We randomly generated a third-order tensor with the size 20(cid:2) 20(cid:2) 20 from the uniform\ndistribution and obtained the running time and the number of iterations. We set B = B3(l) and\nvaried the number of parameters jBj with increasing l. In Algorithm 2, we used the outer loop (from\nline 3 to 8) as one iteration for fair comparison and \ufb01xed the learning rate \" = 0:1.\nResults are plotted in Figure 4(a, b) that clearly show that the natural gradient is dramatically faster\nthan gradient descent. When the number of parameters is around 400, the natural gradient is more\nthan six orders of magnitude faster than gradient descent. The increased speed comes from the\nreduction of iterations. The natural gradient requires only two or three iterations until convergence\nin all cases, while gradient descent requires more than 105 iterations to get the same result. In the\nfollowing, we consistently use the natural gradient for Legendre decomposition.\nNext we examined the scalability compared to other tensor decomposition methods. We used the\nsame synthetic datasets and increased the tensor size from 20(cid:2) 20(cid:2) 20 to 500(cid:2) 500(cid:2) 500. Results\nare plotted in Figure 4(c). Legendre decomposition is slower than Tucker and CP decompositions\nif the tensors get larger, while the plots show that the running time of Legendre decomposition is\nlinear with the tensor size (Figure 5). Moreover, Legendre decomposition is faster than CP-APR if\nthe tensor size is not large.\n\nResults on Real Data Next we demonstrate the e\ufb00ectiveness of Legendre decomposition on real-\nworld datasets of third-order tensors. We evaluated the quality of decomposition by the root mean\nsquared error (RMSE) between the input and the reconstructed tensors. We also examined the\nscalability of our method in terms of the number of parameters.\nFirst we examine Legendre decomposition and three competing methods on the face image dataset2.\nWe picked up the \ufb01rst entry from the fourth mode (corresponds to lighting) from the dataset and the\n2This dataset is originally distributed at http://www.cl.cam.ac.uk/research/dtg/attarchive/\nfacedatabase.html and also available from the R rTensor package (https://CRAN.R-project.org/\npackage=rTensor).\n\n8\n\n#D<14A\u0001>5\u0001?0A0<4C4AB#D<14A\u0001>5\u00018C4A0C8>=B\u000b\t\t\r\t\t\u000f\t\t\u0011\t\t\n\tx\n\tz\n\t|)4=B>A\u0001B8I4'D==8=6\u0001C8<4\u0001(B42\b)\u000b\t{\u000e\t{\n\t\t{\u000b\t\t{\u000e\t\t{\n\t\u2013y\n\tx\n\ty\n\tz\n\t{#D<14A\u0001>5\u0001?0A0<4C4AB'D==8=6\u0001C8<4\u0001(B42\b)\u000b\t\t\r\t\t\u000f\t\t\u0011\t\t\n\tz\n\t|\n\tx\n\t\u2013z#0CDA0;\u00016A03\b\u001cA03\b\u000134B24=C!464=3A4#>==460C8E4\u0001)D2:4A#>==460C8E4\u0001\u0018%\u0018%\u0007\u0016%'012\fFigure 5: Experimental results on the face image dataset (a, b) and MNIST (c, d).\n\n\ufb01rst 20 faces from the third mode, resulting in a single third-order tensor with a size of 92(cid:2) 112(cid:2) 20,\nwhere the \ufb01rst two modes correspond to image pixels and the third mode to individuals. We set\ndecomposition bases B as B = B1 [ B2(l)[ B3(l). For every decomposition method, we gradually\nincreased l, m, and n and checked the performance in terms of RMSE and running time.\nResults are plotted in Figure 5(a, b). In terms of RMSE, Legendre decomposition is superior to\nthe other methods if the number of parameters is small (up to 2,000), and it is competitive with\nnonnegative CP decomposition and inferior to CP-APR for a larger number of parameters. The\nreason is that Legendre decomposition uses the information of the index order that is based on the\nstructure of the face images (pixels); that is, rows or columns cannot be replaced with each other in\nthe data. In terms of running time, it is slower than Tucker and CP decompositions as the number of\nparameters increases, while it is still faster than CP-APR.\nNext we used the MNIST dataset (LeCun et al., 1998), which consists of a collection of images\nof handwritten digits and has been used as the standard benchmark datasets in a number of recent\nstudies such as deep learning. We picked up the \ufb01rst 500 images for each digit, resulting in 10\nthird-order tensors with the size of 28 (cid:2) 28 (cid:2) 500, where the \ufb01rst two modes correspond to image\npixels. In Legendre decomposition, the decomposition basis B was simply set as B = B3(l) and\nremoved zero entries from \u2126. Again, for every decomposition method, we varied the number of\nparameters by increasing l, m, and n and evaluated the performance in terms of RMSE.\nMeans (cid:6) standard error of the mean (SEM) across all digits from \u201c0\u201d to \u201c9\u201d are plotted in Figure 5(c,\nd). Results for all digits are presented in the supplementary material. Legendre decomposition\nclearly shows the smallest RMSE, and the di\ufb00erence is larger if the number of parameters is smaller.\nThe reason is that Legendre decomposition ignores zero entries and decomposes only nonzero entries,\nwhile such decomposition is not possible for other methods. Running time shows the same trend\nas that of the face dataset; that is, Legendre decomposition is slower than other methods when the\nnumber of parameters increases.\n\n4 Conclusion\n\nIn this paper, we have proposed Legendre decomposition, which incorporates tensor structure into\ninformation geometry. A given tensor is converted into the dual parameters ((cid:18); (cid:17)) connected via\nthe Legendre transformation, and the optimization is performed in the parameter space instead of\ndirectly treating the tensors. We have theoretically shown the desired properties of the Legendre\ndecomposition, namely, that its results are well-de\ufb01ned, unique, and globally optimized, in that it\nalways \ufb01nds the decomposable tensor that minimizes the KL divergence from the input tensor. We\nhave also shown the connection between Legendre decomposition and Boltzmann machine learning.\nWe have experimentally shown that Legendre decomposition can more accurately reconstruct input\ntensors than three standard tensor decomposition methods (nonnegative Tucker decomposition, non-\nnegative CP decomposition, and CP-APR) using the same number of parameters. Since the shape\nof the decomposition basis B is arbitrary, Legendre decomposition has the potential to achieve even\nmore-accurate decomposition. For example, one can incorporate the domain knowledge into the set\nB such that speci\ufb01c entries of the input tensor are known to dominate the other entries.\nOur work opens the door to both further theoretical investigation of information geometric algorithms\nfor tensor analysis and a number of practical applications such as missing value imputation.\n\n9\n\n\u001b024\u00018<064\u001b024\u00018<064\"#\u001e()\"#\u001e()#D<14A\u0001>5\u0001?0A0<4C4AB'\"(\u001a\t\n\t\t\t\u000b\t\t\t\f\t\t\t\t\b\t\u0011\t\b\n\t\t\b\n\u000b\t\b\n\r\t\b\n\u000f#D<14A\u0001>5\u0001?0A0<4C4AB'D==8=6\u0001C8<4\u0001(sec.)'D==8=6\u0001C8<4\u0001(sec.)\t\n\t\t\t\u000b\t\t\t\f\t\t\t\n\tx\n\ty\n\tz#D<14A\u0001>5\u0001?0A0<4C4AB'\"(\u001a\t\u000b\t\t\t\u000f\t\t\t\n\t\t\t\t\r\t\u000e\t\u000f\t#D<14A\u0001>5\u0001?0A0<4C4AB\t\u000b\t\t\t\u000f\t\t\t\n\t\t\t\t\n\tx\n\ty\n\tz\n\t{0123!464=3A4#>==460C8E4\u0001)D2:4A#>==460C8E4\u0001\u0018%\u0018%\u0007\u0016%'\fAcknowledgments\nThis work was supported by JSPS KAKENHI Grant Numbers JP16K16115, JP16H02870, and JST,\nPRESTO Grant Number JPMJPR1855, Japan (M.S.); JSPS KAKENHI Grant Numbers 26120732\nand 16H06570 (H.N.); and JST CREST JPMJCR1502 (K.T.).\n\nReferences\nD. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines.\n\nCognitive Science, 9(1):147\u2013169, 1985.\n\nS. Amari. Natural gradient works e\ufb03ciently in learning. Neural Computation, 10(2):251\u2013276, 1998.\nS. Amari. Information geometry on hierarchy of probability distributions. IEEE Transactions on\n\nInformation Theory, 47(5):1701\u20131711, 2001.\n\nS. Amari. Information geometry and its applications: Convex function and dually \ufb02at manifold.\nIn F. Nielsen, editor, Emerging Trends in Visual Computing: LIX Fall Colloquium, ETVC 2008,\nRevised Invited Papers, pages 75\u2013102. Springer, 2009.\n\nS. Amari. Information Geometry and Its Applications. Springer, 2016.\nB. W. Bader and T. G. Kolda. E\ufb03cient MATLAB computations with sparse and factored tensors.\n\nSIAM Journal on Scienti\ufb01c Computing, 30(1):205\u2013231, 2007.\n\nB. W. Bader, T. G. Kolda, et al. MATLAB tensor toolbox version 3.0-dev, 2017.\nC. F. Beckmann and S. M. Smith. Tensorial extensions of independent component analysis for\n\nmultisubject FMRI analysis. NeuroImage, 25(1):294\u2013311, 2005.\n\nY. Censor and A. Lent. An iterative row-action method for interval convex programming. Journal\n\nof Optimization Theory and Applications, 34(3):321\u2013353, 1981.\n\nJ. Chen, S. Cheng, H. Xie, L. Wang, and T. Xiang. Equivalence of restricted Boltzmann machines\n\nand tensor network states. Physical Review B, 97:085104, 2018.\n\nE. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on\n\nMatrix Analysis and Applications, 33(4):1272\u20131299, 2012.\n\nA. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan. Tensor\ndecompositions for signal processing applications: From two-way to multiway component analysis.\nIEEE Signal Processing Magazine, 32(2):145\u2013163, 2015.\n\nB. A. Davey and H. A. Priestley. Introduction to Lattices and Order. Cambridge University Press, 2\n\nedition, 2002.\n\nG. Gierz, K. H. Hofmann, K. Keimel, J. D. Lawson, M. Mislove, and D. S. Scott. Continuous Lattices\n\nand Domains. Cambridge University Press, 2003.\n\nR. A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an \u201cex-\nplanatory\u201d multi-modal factor analysis. Technical report, UCLA Working Papers in Phonetics,\n1970.\n\nG. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computa-\n\ntion, 14(8):1771\u20131800, 2002.\n\nY. D. Kim and S. Choi. Nonnegative Tucker decomposition. In 2007 IEEE Conference on Computer\n\nVision and Pattern Recognition, pages 1\u20138, 2007.\n\nT. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\nJ. Kossai\ufb01, Y. Panagakis, and M. Pantic. TensorLy: Tensor learning in Python. arXiv:1610.09555,\n\n2016.\n\n10\n\n\fY. LeCun, C. Cortes, and C. J. C. Burges. The MNIST database of handwritten digits, 1998. URL\n\nhttp://yann.lecun.com/exdb/mnist/.\n\nY. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. J. Huang. A tutorial on energy-based learning.\nIn G. Bakir, T. Hofmann, B. Sch\u00f6lkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan, editors,\nPredicting Structured Data. The MIT Press, 2007.\n\nD. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization.\n\nNature, 401(6755):788\u2013791, 1999.\n\nD. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural\n\nInformation processing Systems 13, pages 556\u2013562, 2001.\n\nJ. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual\n\ndata. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208\u2013220, 2013.\n\nH. Nakahara and S. Amari. Information-geometric measure for neural spikes. Neural Computation,\n\n14(10):2269\u20132316, 2002.\n\nR. Salakhutdinov. Learning and evaluating Boltzmann machines. UTML TR 2008-002, 2008.\nR. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines.\n\nIn Proceedings of the 12th\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 448\u2013455, 2009.\n\nR. Salakhutdinov and G. E. Hinton. An e\ufb03cient learning procedure for deep Boltzmann machines.\n\nNeural Computation, 24(8):1967\u20132006, 2012.\n\nR. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings\n\nof the 25th International Conference on Machine learning, pages 872\u2013879, 2008.\n\nT. J. Sejnowski. Higher-order Boltzmann machines. In AIP Conference Proceedings, volume 151,\n\npages 398\u2013403, 1986.\n\nA. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and\nIn Proceedings of the 22nd International Conference on Machine Learning,\n\ncomputer vision.\npages 792\u2013799, 2005.\n\nP. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In D. E.\nRumelhart, J. L. McClelland, and PDP Research Group, editors, Parallel Distributed Processing:\nExplorations in the Microstructure of Cognition, Vol. 1, pages 194\u2013281. MIT Press, 1986.\n\nM. Sugiyama, H. Nakahara, and K. Tsuda. Information decomposition on structured space. In 2016\n\nIEEE International Symposium on Information Theory, pages 575\u2013579, 2016.\n\nM. Sugiyama, H. Nakahara, and K. Tsuda. Tensor balancing on statistical manifold. In Proceedings\n\nof the 34th International Conference on Machine Learning, pages 3270\u20133279, 2017.\n\nP. Symeonidis. Matrix and tensor decomposition in recommender systems. In Proceedings of the\n\n10th ACM Conference on Recommender Systems, pages 429\u2013430, 2016.\n\nR. Tomioka and T. Suzuki. Convex tensor decomposition via structured schatten norm regularization.\n\nIn Advances in Neural Information Processing Systems 26, pages 1331\u20131339, 2013.\n\nL. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):\n\n279\u2013311, 1966.\n\nM. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: TensorFaces. In\nProceedings of The 7th European Conference on Computer Vision (ECCV), volume 2350 of LNCS,\npages 447\u2013460, 2002.\n\nM. A. O. Vasilescu and D. Terzopoulos. Multilinear (tensor) image synthesis, analysis, and recogni-\n\ntion [exploratory dsp]. IEEE Signal Processing Magazine, 24(6):118\u2013123, 2007.\n\nK. Y. Y\u0131lmaz, A. T. Cemgil, and U. Simsekli. Generalised coupled tensor factorisation. In Advances\n\nin Neural Information Processing Systems 24, pages 2151\u20132159, 2011.\n\nY. K. Y\u0131lmaz and A. T. Cemgil. Algorithms for probabilistic latent tensor factorization. Signal\n\nProcessing, 92(8):1853\u20131863, 2012.\n\n11\n\n\f", "award": [], "sourceid": 5299, "authors": [{"given_name": "Mahito", "family_name": "Sugiyama", "institution": "National Institute of Informatics"}, {"given_name": "Hiroyuki", "family_name": "Nakahara", "institution": "RIKEN Brain Science Institute"}, {"given_name": "Koji", "family_name": "Tsuda", "institution": "The University of Tokyo / RIKEN"}]}