{"title": "Supervised Learning with Tensor Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4799, "page_last": 4807, "abstract": "Tensor networks are approximations of high-order tensors which are efficient to work with and have been very successful for physics and mathematics applications. We demonstrate how algorithms for optimizing tensor networks can be adapted to supervised learning tasks by using matrix product states (tensor trains) to parameterize non-linear kernel learning models. For the MNIST data set we obtain less than 1% test set classification error. We discuss an interpretation of the additional structure imparted by the tensor network to the learned model.", "full_text": "Supervised Learning with Tensor Networks\n\nE. M. Stoudenmire\n\nPerimeter Institute for Theoretical Physics\n\nWaterloo, Ontario, N2L 2Y5, Canada\n\nDavid J. Schwab\n\nDepartment of Physics\n\nNorthwestern University, Evanston, IL\n\nAbstract\n\nTensor networks are approximations of high-order tensors which are ef\ufb01cient to\nwork with and have been very successful for physics and mathematics applications.\nWe demonstrate how algorithms for optimizing tensor networks can be adapted to\nsupervised learning tasks by using matrix product states (tensor trains) to parame-\nterize non-linear kernel learning models. For the MNIST data set we obtain less\nthan 1% test set classi\ufb01cation error. We discuss an interpretation of the additional\nstructure imparted by the tensor network to the learned model.\n\n1\n\nIntroduction\n\nRecently there has been growing appreciation for tensor methods in machine learning. Tensor\ndecompositions can solve non-convex optimization problems [1, 2] and be used for other important\ntasks such as extracting features from input data and parameterizing neural nets [3, 4, 5]. Tensor\nmethods have also become prominent in the \ufb01eld of physics, especially the use of tensor networks\nwhich accurately capture very high-order tensors while avoiding the the curse of dimensionality\nthrough a particular geometry of low-order contracted tensors [6]. The most successful use of\ntensor networks in physics has been to approximate exponentially large vectors arising in quantum\nmechanics [7, 8].\nAnother context where very large vectors arise is non-linear kernel learning, where input vectors x\nare mapped into a higher dimensional space via a feature map \u03a6(x) before being classi\ufb01ed by a\ndecision function\n\nf (x) = W \u00b7 \u03a6(x) .\n\n(1)\nThe feature vector \u03a6(x) and weight vector W can be exponentially large or even in\ufb01nite. One\napproach to deal with such large vectors is the well-known kernel trick, which only requires working\nwith scalar products of feature vectors [9].\nIn what follows we propose a rather different approach. For certain learning tasks and a speci\ufb01c class\nof feature map \u03a6, we \ufb01nd the optimal weight vector W can be approximated as a tensor network\u2014a\ncontracted sequence of low-order tensors. Representing W as a tensor network and optimizing it\ndirectly (without passing to the dual representation) has many interesting consequences. Training the\nmodel scales only linearly in the training set size; the evaluation cost for a test input is independent\nof training set size. Tensor networks are also adaptive: dimensions of tensor indices internal to the\nnetwork grow and shrink during training to concentrate resources on the particular correlations within\nthe data most useful for learning. The tensor network form of W presents opportunities to extract\ninformation hidden within the trained model and to accelerate training by optimizing different internal\ntensors in parallel [10]. Finally, the tensor network form is an additional type of regularization beyond\nthe choice of feature map, and could have interesting consequences for generalization.\nOne of the best understood types of tensor networks is the matrix product state (MPS) [11, 8], also\nknown as the tensor train decomposition [12]. Though MPS are best at capturing one-dimensional\ncorrelations, they are powerful enough to be applied to distributions with higher-dimensional cor-\nrelations as well. MPS have been very useful for studying quantum systems, and have recently\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: The matrix product state (MPS) decomposition, also known as a tensor train. (Lines\nrepresent tensor indices and connecting two lines implies summation.)\n\nbeen investigated for machine learning applications such as learning features by decomposing tensor\nrepresentations of data [4] and compressing the weight layers of neural networks [5].\nWhile applications of MPS to machine learning have been a success, one aim of the present work is\nto have tensor networks play a more central role in developing learning models; another is to more\neasily incorporate powerful algorithms and tensor networks which generalize MPS developed by the\nphysics community for studying higher dimensional and critical systems [13, 14, 15]. But in what\nfollows, we only consider the case of MPS tensor networks as a proof of principle.\nThe MPS decomposition is an approximation of an order-N tensor by a contracted chain of N lower-\norder tensors shown in Fig. 1. (Throughout we will use tensor diagram notation: shapes represent\ntensors and lines emanating from them are tensor indices; connecting two lines implies contraction of\na pair of indices. We emphasize that tensor diagrams are not merely schematic, but have a rigorous\nalgorithmic interpretation. For a helpful review of this notation, see Cichocki [16].)\nRepresenting the weights W of Eq. (1) as an MPS allows one to ef\ufb01ciently optimize these weights and\nadaptively change their number by varying W locally a few tensors at a time, in close analogy to the\ndensity matrix renormalization group (DMRG) algorithm used in physics [17, 8]. Similar alternating\nleast squares methods for tensor trains have been explored more recently in applied mathematics [18].\nThis paper is organized as follows: \ufb01rst we propose our general approach and describe an algorithm\nfor optimizing the weight vector W in MPS form. Then we test our approach on the MNIST\nhandwritten digit set and \ufb01nd very good performance for remarkably small MPS bond dimensions.\nFinally, we discuss the structure of the functions realized by our proposed models.\nFor researchers interested in reproducing our results, we have made our codes publicly available at:\nhttps://github.com/emstoudenmire/TNML. The codes are based on the ITensor library [19].\n\n2 Encoding Input Data\n\nTensor networks in physics are typically used in a context where combining N independent systems\ncorresponds to taking a tensor product of a vector describing each system. With the goal of applying\nsimilar tensor networks to machine learning, we choose a feature map of the form\n\n\u03a6s1s2\u00b7\u00b7\u00b7sN (x) = \u03c6s1(x1) \u2297 \u03c6s2(x2) \u2297 \u00b7\u00b7\u00b7 \u03c6sN (xN ) .\n\n(2)\nThe tensor \u03a6s1s2\u00b7\u00b7\u00b7sN is the tensor product of a local feature map \u03c6sj (xj) applied to each input\ncomponent xj of the N-dimensional vector x (where j = 1, 2, . . . , N). The indices sj run from 1\nto d, where d is known as the local dimension and is a hyper-parameter de\ufb01ning the classi\ufb01cation\nmodel. Though one could use a different local feature map for each input component xj, we will\nonly consider the case of homogeneous inputs with the same local map applied to each xj. Thus each\nxj is mapped to a d-dimensional vector, and the full feature map \u03a6(x) can be viewed as a vector in a\ndN -dimensional space or as an order-N tensor. The tensor diagram for \u03a6(x) is shown in Fig. 2. This\ntype of tensor is said be rank-1 since it is manifestly the product of N order-1 tensors.\nFor a concrete example of this type of feature map, which we will use later, consider inputs which are\ngrayscale images with N pixels, where each pixel value ranges from 0.0 for white to 1.0 for black. If\nthe grayscale value of pixel number j is xj \u2208 [0, 1], a simple choice for the local map \u03c6sj (xj) is\n\n(3)\n\nand is illustrated in Fig. 3. The full image is represented as a tensor product of these local vectors. The\nabove feature map is somewhat ad-hoc, and is motivated by \u201cspin\u201d vectors encountered in quantum\nsystems. More research is needed to understand the best choices for \u03c6s(x), but the most crucial\nproperty seems to be that (cid:126)\u03c6(x) \u00b7 (cid:126)\u03c6(x(cid:48)) is a smooth and slowly varying function of x and x(cid:48), and\ninduces a distance metric in feature space that tends to cluster similar images together.\n\n\u03c6sj (xj) =(cid:104)cos(cid:16) \u03c0\n\n2\n\nxj(cid:17), sin(cid:16) \u03c0\n\n2\n\nxj(cid:17)(cid:105)\n\n2\n\n\u21e1\fFigure 2: Input data is mapped to a normalized order N tensor with a rank-1 product structure.\n\nFigure 3: For the case of a grayscale image and d = 2, each pixel value is mapped to a normalized\ntwo-component vector. The full image is mapped to the tensor product of all the local pixel vectors\nas shown in Fig. 2.\n\nThe feature map Eq. (2) de\ufb01nes a kernel which is the product of N local kernels, one for each\ncomponent xj of the input data. Kernels of this type have been discussed previously in Vapnik [20, p.\n193] and have been argued by Waegeman et al. [21] to be useful for data where no relationship is\nassumed between different components of the input vector prior to learning.\n\n3 Classi\ufb01cation Model\n\nIn what follows we are interested in classifying data with pre-assigned hidden labels, for which we\nchoose a \u201cone-versus-all\u201d strategy, which we take to mean optimizing a set of functions indexed by a\nlabel (cid:96)\n\nf (cid:96)(x) = W (cid:96) \u00b7 \u03a6(x)\n\n(4)\n\nand classifying an input x by choosing the label (cid:96) for which |f (cid:96)(x)| is largest.\nSince we apply the same feature map \u03a6 to all input data, the only quantity that depends on the label\n(cid:96) is the weight vector W (cid:96). Though one can view W (cid:96) as a collection of vectors labeled by (cid:96), we\nwill prefer to view W (cid:96) as an order N + 1 tensor where (cid:96) is a tensor index and f (cid:96)(x) is a function\nmapping inputs to the space of labels. The tensor diagram for evaluating f (cid:96)(x) for a particular input\nis depicted in Fig. 4.\nBecause the weight tensor W (cid:96)\nhas NL \u00b7 dN components, where NL is the number of labels,\nwe need a way to regularize and optimize this tensor ef\ufb01ciently. The strategy we will use is to\nrepresent W (cid:96) as a tensor network, namely as an MPS which have the key advantage that methods for\nmanipulating and optimizing them are well understood and highly ef\ufb01cient. An MPS decomposition\nof the weight tensor W (cid:96) has the form\n\ns1s2\u00b7\u00b7\u00b7sN\n\nW (cid:96)\n\ns1s2\u00b7\u00b7\u00b7sN =(cid:88){\u03b1}\n\nA\u03b11\n\ns1 A\u03b11\u03b12\n\ns2\n\n\u00b7\u00b7\u00b7 A(cid:96);\u03b1j \u03b1j+1\n\nsj\n\n\u00b7\u00b7\u00b7 A\u03b1N\u22121\n\nsN\n\n(5)\n\nFigure 4: The overlap of the weight tensor W (cid:96) with a speci\ufb01c input vector \u03a6(x) de\ufb01nes the decision\nfunction f (cid:96)(x). The label (cid:96) for which f (cid:96)(x) has maximum magnitude is the predicted label for x.\n\n3\n\ns1s2s3s4s5s6=s1s2s3s4s5s6`=`W`(x)f`(x)\fFigure 5: Approximation of the weight tensor W (cid:96) by a matrix product state. The label index (cid:96) is\nplaced arbitrarily on one of the N tensors but can be moved to other locations.\n\nand is illustrated in Fig. 5. Each A tensor has d m2 elements which are the latent variables parame-\nterizing the approximation of W ; the A tensors are in general not unique and can be constrained to\nbestow nice properties on the MPS, like making the A tensors partial isometries.\nThe dimensions of each internal index \u03b1j of an MPS are known as the bond dimensions and are the\n(hyper) parameters controlling complexity of the MPS approximation. For suf\ufb01ciently large bond\ndimensions an MPS can represent any tensor [22]. The name matrix product state refers to the fact\nthat any speci\ufb01c component of the full tensor W (cid:96)\ncan be recovered ef\ufb01ciently by summing\nover the {\u03b1j} indices from left to right via a sequence of matrix products (the term \u201cstate\u201d refers to\nthe original use of MPS to describe quantum states of matter).\nIn the above decomposition Eq. (5), the label index (cid:96) was arbitrarily placed on the tensor at some\nposition j, but this index can be moved to any other tensor of the MPS without changing the overall\nW (cid:96) tensor it represents. To do so, one contracts the tensor at position j with one of its neighbors,\nthen decomposes this larger tensor using a singular value decomposition such that (cid:96) now belongs to\nthe neighboring tensor\u2014see Fig. 7(a).\n\ns1s2\u00b7\u00b7\u00b7sN\n\n4 \u201cSweeping\u201d Optimization Algorithm\n\nn = 1 and y(cid:96)\n\nn)2 where n runs over the NT training inputs and y(cid:96)\n\n2(cid:80)NT\nn=1(cid:80)(cid:96)(f (cid:96)(xn) \u2212 y(cid:96)\n\nInspired by the very successful DMRG algorithm developed for physics applications [17, 8], here we\npropose a similar algorithm which \u201csweeps\u201d back and forth along an MPS, iteratively minimizing the\ncost function de\ufb01ning the classi\ufb01cation task.\nTo describe the algorithm in concrete terms, we wish to optimize the quadratic cost\nn is the vector\nC = 1\nof desired outputs for input n. If the correct label of xn is Ln, then yLn\nn = 0 for all other\nlabels (cid:96) (i.e. a one-hot encoding).\nOur strategy for minimizing this cost function will be to vary only two neighboring MPS tensors at a\ntime within the approximation Eq. (5). We could conceivably just vary one at a time, but varying two\ntensors makes it simple to adaptively change the MPS bond dimension.\nSay we want to improve the tensors at sites j and j + 1. Assume we have moved the label index (cid:96)\nto the MPS tensor at site j. First we combine the MPS tensors A(cid:96)\nand Asj+1 into a single \u201cbond\nsj\ntensor\u201d B\u03b1j\u22121(cid:96)\u03b1j+1\nNext we compute the derivative of the cost function C with respect to the bond tensor B(cid:96) in order to\nupdate it using a gradient descent step. Because the rest of the MPS tensors are kept \ufb01xed, let us show\nthat to compute the gradient it suf\ufb01ces to feed, or project, each input xn through the \ufb01xed \u201cwings\u201d of\nthe MPS as shown on the left-hand side of Fig. 6(b) (connected lines in the diagram indicate sums\nover pairs of indices). The result is a projected, four-index version of the input \u02dc\u03a6n shown on the\nright-hand of Fig. 6(b). The current decision function can be ef\ufb01ciently computed from this projected\ninput \u02dc\u03a6n and the current bond tensor B(cid:96) as\n\nsj sj+1\n\nby contracting over the index \u03b1j as shown in Fig. 6(a).\n\nf (cid:96)(xn) = (cid:88)\u03b1j\u22121\u03b1j+1 (cid:88)sj sj+1\n\nB\u03b1j\u22121(cid:96)\u03b1j+1\n\nsj sj+1\n\n( \u02dc\u03a6n)sj sj+1\n\n\u03b1j\u22121(cid:96)\u03b1j+1\n\nor as illustrated in Fig. 6(c). The gradient update to the tensor B(cid:96) can be computed as\n\n\u2206B(cid:96) = \u2212\n\n\u2202C\n\u2202B(cid:96) =\n\nNT(cid:88)n=1\n\nn \u2212 f (cid:96)(xn)) \u02dc\u03a6n .\n(y(cid:96)\n\n4\n\n(6)\n\n(7)\n\n``\u21e1\fFigure 6: Steps leading to computing the gradient of the bond tensor B(cid:96) at bond j: (a) forming\nthe bond tensor; (b) projecting a training input into the \u201cMPS basis\u201d at bond j; (c) computing the\ndecision function in terms of a projected input; (d) the gradient correction to B(cid:96). The dark shaded\ncircular tensors in step (b) are \u201ceffective features\u201d formed from m different linear combinations of\nmany original features.\n\n= (cid:88)\u03b1(cid:48)\n\nj \u03b1j\n\nU \u03b1j\u22121\nsj \u03b1(cid:48)\n\nj\n\nsj+1\n\nThe tensor diagram for \u2206B(cid:96) is shown in Fig. 6(d).\nHaving computed the gradient, we use it to make a small update to B(cid:96), replacing it with B(cid:96) + \u03b7\u2206B(cid:96)\nfor some small \u03b7. Having obtained our improved B(cid:96), we must decompose it back into separate\nMPS tensors to maintain ef\ufb01ciency and apply our algorithm to the next bond. Assume the next\nbond we want to optimize is the one to the right (bond j + 1). Then we can compute a singular\nvalue decomposition (SVD) of B(cid:96), treating it as a matrix with a collective row index (\u03b1j\u22121, sj) and\ncollective column index ((cid:96), \u03b1j+1, sj+1) as shown in Fig. 7(a). Computing the SVD this way restores\nthe MPS form, but with the (cid:96) index moved to the tensor on site j + 1. If the SVD of B(cid:96) is given by\n(8)\n\nB\u03b1j\u22121(cid:96)\u03b1j+1\n\n(cid:48)\nj \u03b1j V \u03b1j (cid:96)\u03b1j+1\n\nS\u03b1\n\n,\n\nsj sj+1\n\nsj+1\n\nsj+1 = SV (cid:96)\n\nthen to proceed to the next step we de\ufb01ne the new MPS tensor at site j to be A(cid:48)sj = Usj and the new\ntensor at site j + 1 to be A(cid:48)(cid:96)\nwhere a matrix multiplication over the suppressed \u03b1 indices\nis implied. Crucially at this point, only the m largest singular values in S are kept and the rest are\ntruncated (along with the corresponding columns of U and V \u2020) in order to control the computational\ncost of the algorithm. Such a truncation is guaranteed to produce an optimal approximation of the\ntensor B(cid:96) (minimizes the norm of the difference before and after truncation); furthermore if all of\nthe MPS tensors to the left and right of B(cid:96) are formed from (possibly truncated) unitary matrices\nsimilar to the de\ufb01nition of A(cid:48)sj\nabove, then the optimality of the truncation of B(cid:96) applies globally\nto the entire MPS as well. For further background reading on these technical aspects of MPS, see\nRefs. [8] and [16].\nFinally, when proceeding to the next bond, it would be inef\ufb01cient to fully project each training input\nover again into the con\ufb01guration in Fig. 6(b). Instead it is only necessary to advance the projection\nby one site using the MPS tensor set from a unitary matrix after the SVD as shown in Fig. 7(b). This\nallows the cost of each local step of the algorithm to remain independent of the size of the input space,\nmaking the total algorithm scale only linearly with input space size (i.e. the number of components\nof an input vector x).\nThe above algorithm highlights a key advantage of MPS and tensor networks relevant to machine\nlearning applications. Following the SVD of the improved bond tensor B(cid:48)(cid:96), the dimension of the new\nMPS bond can be chosen adaptively based on the number of large singular values encountered in\nthe SVD (de\ufb01ned by a threshold chosen in advance). Thus the MPS form of W (cid:96) can be compressed\nas much as possible, and by different amounts on each bond, while still ensuring an accurate\napproximation of the optimal decision function.\n\n5\n\n(b)=j+1j(c)=`(a)`=j+1j``\u02dcnnB`\u02dcnf`(xn)(d)=`\u02dcnB`(y`nf`(xn))Xn\fFigure 7: Restoration (a) of MPS form, and (b) advancing a projected training input before optimizing\nthe tensors at the next bond. In diagram (a), if the label index (cid:96) was on the site j tensor before forming\nB(cid:96), then the operation shown moves the label to site j + 1.\n\nThe scaling of the above algorithm is d3 m3 N NL NT , where recall m is the typical MPS bond\ndimension; N the number of components of input vectors x; NL the number of labels; and NT\nthe size of the training data set. Thus the algorithm scales linearly in the training set size: a\nmajor improvement over typical kernel-trick methods which typically scale at least as N 2\nT without\nspecialized techniques [23]. This scaling assumes that the MPS bond dimension m needed is\nindependent of NT , which should be satis\ufb01ed once NT is a large, representative sample.\nIn practice, the training cost is dominated by the large size of the training set NT , so it would be\nvery desirable to reduce this cost. One solution could be to use stochastic gradient descent, but our\nexperiments at blending this approach with the MPS sweeping algorithm did not match the accuracy\nof using the full, or batch gradient. Mixing stochastic gradient with MPS sweeping thus appears to be\nnon-trivial but is a promising direction for further research.\n\n5 MNIST Handwritten Digit Test\n\nTo test the tensor network approach on a realistic task, we used the MNIST data set [24]. Each\nimage was scaled down from 28 \u00d7 28 to 14 \u00d7 14 by averaging clusters of four pixels; otherwise we\nperformed no further modi\ufb01cations to the training or test sets. Working with smaller images reduced\nthe time needed for training, with the tradeoff of having less information available for learning.\nWhen approximating the weight tensor as an MPS, one must choose a one-dimensional ordering of\nthe local indices s1, s2, . . . , sN . We chose a \u201czig-zag\u201d ordering meaning the \ufb01rst row of pixels are\nmapped to the \ufb01rst 14 external MPS indices; the second row to the next 14 MPS indices; etc. We then\nmapped each grayscale image x to a tensor \u03a6(x) using the local map Eq. (3).\nUsing the sweeping algorithm in Section 4 to optimize the weights, we found the algorithm quickly\nconverged after a few passes, or sweeps, over the MPS. Typically \ufb01ve or less sweeps were needed to\nsee good convergence, with test error rates changing only hundreths of a percent thereafter.\nTest error rates also decreased rapidly with the maximum MPS bond dimension m. For m = 10 we\nfound both a training and test error of about 5%; for m = 20 the error dropped to only 2%. The\nlargest bond dimension we tried was m = 120, where after three sweeps we obtained a test error of\n0.97%; the corresponding training set error was 0.05%. MPS bond dimensions in physics applications\ncan reach many hundreds or even thousands, so it is remarkable to see such small classi\ufb01cation errors\nfor only m = 120.\n\n6\n\nInterpreting Tensor Network Models\n\nA natural question is which set of functions of the form f (cid:96)(x) = W (cid:96) \u00b7 \u03a6(x) can be realized when\nusing a tensor-product feature map \u03a6(x) of the form Eq. (2) and a tensor-network decomposition of\nW (cid:96). As we will argue, the possible set of functions is quite general, but taking the tensor network\nstructure into account provides additional insights, such as determining which features the model\nactually uses to perform classi\ufb01cation.\n\n6\n\nFibZ2???t1,t2,t3>0Z2??t1,t2,t3<0t3t1t2?Z2Z2~t=(0,1,0)~t=(1,0,0)~t=(1,0,0)~t=(0,0,1)~t=(0,1,0)~t=(0,0,1)~t=(0,0,1)~t=(0,1,0)Z2Z2Z2(a)`\u21e1`SVD=(b)A0sjA0`sj+1A0sj`UsjSV`sj+1=B0`\fFigure 8: (a) Decomposition of W (cid:96) as an MPS with a central tensor and orthogonal site tensors. (b)\nOrthogonality conditions for U and V type site tensors. (c) Transformation de\ufb01ning a reduced feature\nmap \u02dc\u03a6(x).\n\n6.1 Representational Power\n\nTo simplify the question of which decision functions can be realized for a tensor-product feature map\nof the form Eq. (2), let us \ufb01x (cid:96) to a single label and omit it from the notation. We will also temporarily\nconsider W to be a completely general order-N tensor with no tensor network constraint. Then f (x)\nis a function of the form\n\nf (x) =(cid:88){s}\n\nWs1s2\u00b7\u00b7\u00b7sN \u03c6s1 (x1) \u2297 \u03c6s2 (x2) \u2297 \u00b7\u00b7\u00b7 \u03c6sN (xN ) .\n\n(9)\n\nIf the functions {\u03c6s(x)}, s = 1, 2, . . . , d form a basis for a Hilbert space of functions over x \u2208 [0, 1],\nthen the tensor product basis \u03c6s1(x1) \u2297 \u03c6s2(x2) \u2297 \u00b7\u00b7\u00b7 \u03c6sN (xN ) forms a basis for a Hilbert space\nof functions over x \u2208 [0, 1]\u00d7N . Moreover, in the limit that the basis {\u03c6s(x)} becomes complete,\nthen the tensor product basis would also be complete and f (x) could be any square integrable\nfunction; however, practically reaching this limit would eventually require prohibitively large tensor\ndimensions.\n\n6.2\n\nImplicit Feature Selection\n\nOf course we have not been considering an arbitrary weight tensor W (cid:96) but instead approximating the\nweight tensor as an MPS tensor network. The MPS form implies that the decision function f (cid:96)(x)\nhas interesting additional structure. One way to analyze this structure is to separate the MPS into a\ncentral tensor, or core tensor C \u03b1i(cid:96)\u03b1i+1 on some bond i and constrain all MPS site tensors to be left\northogonal for sites j \u2264 i or right orthogonal for sites j \u2265 i. This means W (cid:96) has the decomposition\n\nW (cid:96)\n\ns1s2\u00b7\u00b7\u00b7sN =\nU \u03b11\ns1 \u00b7\u00b7\u00b7 U \u03b1i\n\n(cid:88){\u03b1}\n\n\u03b1i\u22121siC (cid:96)\n\n\u03b1i\u03b1i+1V \u03b1i+1\n\nsi+1\u03b1i+2 \u00b7\u00b7\u00b7 V \u03b1N\u22121\n\nsN\n\n(10)\n\n\u03b1j and V \u03b1j\u22121\n\nas illustrated in Fig. 8(a). To say the U and V tensors are left or right orthogonal means when viewed\nas matrices U\u03b1j\u22121sj\nsj \u03b1j these tensors have the property U\u2020U = I and V V \u2020 = I\nwhere I is the identity; these orthogonality conditions can be understood more clearly in terms of the\ndiagrams in Fig. 8(b). Any MPS can be brought into the form Eq. (10) through an ef\ufb01cient sequence\nof tensor contractions and SVD operations similar to the steps in Fig. 7(a).\nThe form in Eq. (10) suggests an interpretation where the decision function f (cid:96)(x) acts in three\nstages. First, an input x is mapped into the dN dimensional feature space de\ufb01ned by \u03a6(x), which is\nexponentially larger than the dimension N of the input space. Next, the feature vector \u03a6 is mapped\ninto a much smaller m2 dimensional space by contraction with all the U and V site tensors of the\nMPS. This second step de\ufb01nes a new feature map \u02dc\u03a6(x) with m2 components as illustrated in Fig. 8(c).\nFinally, f (cid:96)(x) is computed by contracting \u02dc\u03a6(x) with C (cid:96).\n\n7\n\n`Us1Us2Us3Vs4Vs5Vs6CW`s1\u00b7\u00b7\u00b7s6=U\u2020sjUsj==VsjV\u2020sj(a)(b)(c)(x)=\u02dc(x)\fTo justify calling \u02dc\u03a6(x) a feature map, it follows from the left- and right-orthogonality conditions of\nthe U and V tensors of the MPS Eq. (10) that the indices \u03b1i and \u03b1i+1 of the core tensor C label an\northonormal basis for a subspace of the original feature space. The vector \u02dc\u03a6(x) is the projection of\n\u03a6(x) into this subspace.\nThe above interpretation implies that training an MPS model uncovers a relatively small set of\nimportant features and simultaneously trains a decision function using only these reduced features.\nThe feature selection step occurs when computing the SVD in Eq. (8), where any basis elements\n\u03b1j which do not contribute meaningfully to the optimal bond tensor are discarded. (In our MNIST\nexperiment the \ufb01rst and last tensors of the MPS completely factorized during training, implying they\nwere not useful for classi\ufb01cation as the pixels at the corners of each image were always white.) Such\na picture is roughly similar to popular interpretations of simultaneously training the hidden and output\nlayers of shallow neural network models [25]. (MPS were \ufb01rst proposed for learning features in\nBengua et al. [4], but with a different, lower-dimensional data representation than what is used here.)\n\n7 Discussion\n\nWe have introduced a framework for applying quantum-inspired tensor networks to supervised\nlearning tasks. While using an MPS ansatz for the model parameters worked well even for the\ntwo-dimensional data in our MNIST experiment, other tensor networks such as PEPS [6], which\nare explicitly designed for two-dimensional systems, or MERA tensor networks [15], which have a\nmulti-scale structure and can capture power-law correlations, may be more suitable and offer superior\nperformance. Much work remains to determine the best tensor network for a given domain.\nThere is also much room to improve the optimization algorithm by incorporating standard techniques\nsuch as mini-batches, momentum, or adaptive learning rates. It would be especially interesting to\ninvestigate unsupervised techniques for initializing the tensor network. Additionally, while the tensor\nnetwork parameterization of a model clearly regularizes it in the sense of reducing the number of\nparameters, it would be helpful to understand the consquences of this regularization for speci\ufb01c\nlearning tasks. It could also be fruitful to include standard regularizations of the parameters of the\ntensor network, such as weight decay or L1 penalties. We were surprised to \ufb01nd good generalization\nwithout using explicit parameter regularization.\nWe anticipate models incorporating tensor networks will continue be successful for quite a large\nvariety of learning tasks because of their treatment of high-order correlations between features\nand their ability to be adaptively optimized. With the additional opportunities they present for\ninterpretation of trained models due to the internal, linear tensor network structure, we believe there\nare many promising research directions for tensor network models.\nNote: while we were preparing our \ufb01nal manuscript, Novikov et al. [26] published a related framework\nfor using MPS (tensor trains) to parameterize supervised learning models.\n\nReferences\n[1] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky.\nTensor decompositions for learning latent variable models. Journal of Machine Learning\nResearch, 15:2773\u20132832, 2014.\n\n[2] Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M. Kakade. A tensor approach\nto learning mixed membership community models. J. Mach. Learn. Res., 15(1):2239\u20132312,\nJanuary 2014. ISSN 1532-4435.\n\n[3] Anh Huy Phan and Andrzej Cichocki. Tensor decompositions for feature extraction and\nclassi\ufb01cation of high dimensional datasets. Nonlinear theory and its applications, IEICE, 1(1):\n37\u201368, 2010.\n\n[4] J.A. Bengua, H.N. Phien, and H.D. Tuan. Optimal feature extraction and classi\ufb01cation of tensors\nvia matrix product state decomposition. In 2015 IEEE Intl. Congress on Big Data (BigData\nCongress), pages 669\u2013672, June 2015.\n\n[5] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. Tensorizing neural\n\nnetworks. arxiv:1509.06569, 2015.\n\n8\n\n\f[6] Glen Evenbly and Guifr\u00e9 Vidal. Tensor network states and geometry. Journal of Statistical\n\nPhysics, 145:891\u2013918, 2011.\n\n[7] Jacob C. Bridgeman and Christopher T. Chubb. Hand-waving and interpretive dance: An\n\nintroductory course on tensor networks. arxiv:1603.03039, 2016.\n\n[8] U. Schollw\u00f6ck. The density-matrix renormalization group in the age of matrix product states.\n\nAnnals of Physics, 326(1):96\u2013192, 2011.\n\n[9] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based\n\nlearning algorithms. IEEE Transactions on Neural Networks, 12(2):181\u2013201, Mar 2001.\n\n[10] E. M. Stoudenmire and Steven R. White. Real-space parallel density matrix renormalization\n\ngroup. Phys. Rev. B, 87:155137, Apr 2013.\n\n[11] Stellan \u00d6stlund and Stefan Rommer. Thermodynamic limit of density matrix renormalization.\n\nPhys. Rev. Lett., 75(19):3537\u20133540, Nov 1995.\n\n[12] I. Oseledets. Tensor-train decomposition. SIAM Journal on Scienti\ufb01c Computing, 33(5):\n\n2295\u20132317, 2011.\n\n[13] F. Verstraete and J. I. Cirac. Renormalization algorithms for quantum-many body systems in\n\ntwo and higher dimensions. cond-mat/0407066, 2004.\n\n[14] Guifr\u00e9 Vidal. Entanglement renormalization. Phys. Rev. Lett., 99(22):220405, Nov 2007.\n\n[15] Glen Evenbly and Guifr\u00e9 Vidal. Algorithms for entanglement renormalization. Phys. Rev. B,\n\n79:144108, Apr 2009.\n\n[16] Andrzej Cichocki. Tensor networks for big data analytics and large-scale optimization problems.\n\narxiv:1407.3124, 2014.\n\n[17] Steven R. White. Density matrix formulation for quantum renormalization groups. Phys. Rev.\n\nLett., 69(19):2863\u20132866, 1992.\n\n[18] Sebastian Holtz, Thorsten Rohwedder, and Reinhold Schneider. The alternating linear scheme\nfor tensor optimization in the tensor train format. SIAM Journal on Scienti\ufb01c Computing, 34(2):\nA683\u2013A713, 2012.\n\n[19] ITensor Library (version 2.0.11). http://itensor.org/.\n\n[20] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, 2000.\n\n[21] W. Waegeman, T. Pahikkala, A. Airola, T. Salakoski, M. Stock, and B. De Baets. A kernel-based\nframework for learning graded relations from data. Fuzzy Systems, IEEE Transactions on, 20\n(6):1090\u20131101, Dec 2012.\n\n[22] F. Verstraete, D. Porras, and J. I. Cirac. Density matrix renormalization group and periodic\nboundary conditions: A quantum information perspective. Phys. Rev. Lett., 93(22):227205, Nov\n2004.\n\n[23] N. Cesa-Bianchi, Y. Mansour, and O. Shamir. On the complexity of learning with kernels.\n\nProceedings of The 28th Conference on Learning Theory, pages 297\u2013325, 2015.\n\n[24] Christopher J.C. Burges Yann LeCun, Corinna Cortes. MNIST handwritten digit database.\n\nhttp://yann.lecun.com/exdb/mnist/.\n\n[25] Michael Nielsen. Neural Networks and Deep Learning. Determination Press, 2015.\n\n[26] Alexander Novikov, Mikhail Tro\ufb01mov, and Ivan Oseledets.\n\narxiv:1605.03795, 2016.\n\nExponential machines.\n\n9\n\n\f", "award": [], "sourceid": 2436, "authors": [{"given_name": "Edwin", "family_name": "Stoudenmire", "institution": "Univ of California Irvine"}, {"given_name": "David", "family_name": "Schwab", "institution": "Northwestern University"}]}