{"title": "Pre-training of Recurrent Neural Networks via Linear Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 3572, "page_last": 3580, "abstract": "We propose a pre-training technique for recurrent neural networks based on linear autoencoder networks for sequences, i.e. linear dynamical systems modelling the target sequences. We start by giving a closed form solution for the definition of the optimal weights of a linear autoencoder given a training set of sequences. This solution, however, is computationally very demanding, so we suggest a procedure to get an approximate solution for a given number of hidden units. The weights obtained for the linear autoencoder are then used as initial weights for the input-to-hidden connections of a recurrent neural network, which is then trained on the desired task. Using four well known datasets of sequences of polyphonic music, we show that the proposed pre-training approach is highly effective, since it allows to largely improve the state of the art results on all the considered datasets.", "full_text": "Pre-training of Recurrent Neural Networks via\n\nLinear Autoencoders\n\nLuca Pasa, Alessandro Sperduti\n\nDepartment of Mathematics\nUniversity of Padova, Italy\n\n{pasa,sperduti}@math.unipd.it\n\nAbstract\n\nWe propose a pre-training technique for recurrent neural networks based on linear\nautoencoder networks for sequences, i.e. linear dynamical systems modelling the\ntarget sequences. We start by giving a closed form solution for the de\ufb01nition of\nthe optimal weights of a linear autoencoder given a training set of sequences. This\nsolution, however, is computationally very demanding, so we suggest a procedure\nto get an approximate solution for a given number of hidden units. The weights\nobtained for the linear autoencoder are then used as initial weights for the input-\nto-hidden connections of a recurrent neural network, which is then trained on the\ndesired task. Using four well known datasets of sequences of polyphonic music,\nwe show that the proposed pre-training approach is highly effective, since it allows\nto largely improve the state of the art results on all the considered datasets.\n\n1\n\nIntroduction\n\nRecurrent Neural Networks (RNN) constitute a powerful computational tool for sequences mod-\nelling and prediction [1]. However, training a RNN is not an easy task, mainly because of the well\nknown vanishing gradient problem which makes dif\ufb01cult to learn long-term dependencies [2]. Al-\nthough alternative architectures, e.g. LSTM networks [3], and more ef\ufb01cient training procedures,\nsuch as Hessian Free Optimization [4], have been proposed to circumvent this problem, reliable and\neffective training of RNNs is still an open problem.\nThe vanishing gradient problem is also an obstacle to Deep Learning, e.g., [5, 6, 7]. In that context,\nthere is a growing evidence that effective learning should be based on relevant and robust internal\nrepresentations developed in autonomy by the learning system. This is usually achieved in vectorial\nspaces by exploiting nonlinear autoencoder networks to learn rich internal representations of input\ndata which are then used as input to shallow neural classi\ufb01ers or predictors (see, for example, [8]).\nThe importance to start gradient-based learning from a good initial point in the parameter space has\nalso been pointed out in [9]. Relationship between autoencoder networks and Principal Component\nAnalysis (PCA) [10] is well known since late \u201880s, especially in the case of linear hidden units [11,\n12]. More recently, linear autoencoder networks for structured data have been studied in [13, 14, 15],\nwhere an exact closed-form solution for the weights is given in the case of a number of hidden units\nequal to the rank of the full data matrix.\nIn this paper, we borrow the conceptual framework presented in [13, 16] to devise an effective pre-\ntraining approach, based on linear autoencoder networks for sequences, to get a good starting point\ninto the weight space of a RNN, which can then be successfully trained even in presence of long-\nterm dependencies. Speci\ufb01cally, we revise the theoretical approach presented in [13] by: i) giving\na simpler and direct solution to the problem of devising an exact closed-form solution (full rank\ncase) for the weights of a linear autoencoder network for sequences, highlighting the relationship\nbetween the proposed solution and PCA of the input data; ii) introducing a new formulation of\n\n1\n\n\fthe autoencoder learning problem able to return an optimal solution also in the case of a number\nof hidden units which is less than the rank of the full data matrix; iii) proposing a procedure for\napproximate learning of the autoencoder network weights under the scenario of very large sequence\ndatasets. More importantly, we show how to use the linear autoencoder network solution to derive a\ngood initial point into a RNN weight space, and how the proposed approach is able to return quite\nimpressive results when applied to prediction tasks involving long sequences of polyphonic music.\n\n2 Linear Autoencoder Networks for Sequences\nIn [11, 12] it is shown that principal directions of a set of vectors xi \u2208 Rk are related to solutions\nobtained by training linear autoencoder networks\n\noi = WoutputWhiddenxi, i = 1, . . . , n,\n\n(1)\nwhere Whidden \u2208 Rp\u00d7k, Woutput \u2208 Rk\u00d7p, p (cid:28) k, and the network is trained so to get oi = xi, \u2200i.\nWhen considering a temporal sequence x1, x2, . . . , xt, . . . of input vectors, where t is a discrete time\nindex, a linear autoencoder can be de\ufb01ned by considering the coupled linear dynamical systems\n\nyt = Axt + Byt\u22121\n\n(2)\n\n= Cyt\n\n(3)\n\n(cid:21)\n\n(cid:20) xt\n\nyt\u22121\n\nIt should be noticed that eqs. (2) and (3) extend the linear transformation de\ufb01ned in eq. (1) by\nintroducing a memory term involving matrix B \u2208 Rp\u00d7p. In fact, yt\u22121 is inserted in the right part\nof equation (2) to keep track of the input history through time: this is done exploiting a state space\nrepresentation. Eq. (3) represents the decoding part of the autoencoder: when a state yt is multiplied\nby C, the observed input xt at time t and state at time t \u2212 1, i.e. yt\u22121, are generated. Decoding\ncan then continue from yt\u22121. This formulation has been proposed, for example, in [17] where an\niterative procedure to learn weight matrices A and B, based on Oja\u2019s rule, is presented. No proof\nof convergence for the proposed procedure is however given. More recently, an exact closed-form\nsolution for the weights has been given in the case of a number of hidden units equal to the rank of\nthe full data matrix (full rank case) [13, 16]. In this section, we revise this result. In addition, we\ngive an exact solution also for the case in which the number of hidden units is strictly less than the\nrank of the full data matrix.\nThe basic idea of [13, 16] is to look for directions of high variance into the state space of the\ndynamical linear system (2). Let start by considering a single sequence x1, x2, . . . , xt, . . . , xn and\nthe state vectors of the corresponding induced state sequence collected as rows of a matrix Y =\n[y1, y2, y3,\u00b7\u00b7\u00b7 , yn]T. By using the initial condition y0 = 0 (the null vector), and the dynamical\nlinear system (2), we can rewrite the Y matrix as\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(cid:124)\n\nY =\n\nxT\n0\n1\n2 xT\nxT\n1\nxT\n3 xT\n2\n...\n...\nxT\nn xT\n\n0\n0\nxT\n1\n...\nn\u22121 xT\n\nn\u22122\n\n(cid:123)(cid:122)\n\n\u039e\n\n0\n0\n0\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n...\n\u00b7\u00b7\u00b7 xT\n\n2\n\n0\n0\n0\n\n...\nxT\n1\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n(cid:125)\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(cid:124)\n\nAT\nATBT\nATB2T\n\nATBn\u22121T\n\n...\n(cid:123)(cid:122)\n\n\u2126\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n(cid:125)\n\nwhere, given s = kn, \u039e \u2208 Rn\u00d7s is a data matrix collecting all the (inverted) input subsequences\n(including the whole sequence) as rows, and \u2126 is the parameter matrix of the dynamical system.\nNow, we are interested in using a state space of dimension p (cid:28) n, i.e. yt \u2208 Rp, such that as\nmuch information as contained in \u2126 is preserved. We start by factorizing \u039e using SVD, obtaining\n\u039e = V\u039bUT where V \u2208 Rn\u00d7n is an unitary matrix, \u039b \u2208 Rn\u00d7s is a rectangular diagonal matrix\nwith nonnegative real numbers on the diagonal with \u03bb1,1 \u2265 \u03bb2,2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbn,n (the singular values),\nand UT \u2208 Rs\u00d7n is a unitary matrix.\nIt is important to notice that columns of UT which correspond to nonzero singular values, apart\nsome mathematical technicalities, basically correspond to the principal directions of data, i.e. PCA.\nIf the rank of \u039e is p, then only the \ufb01rst p elements of the diagonal of \u039b are not null, and the\nabove decomposition can be reduced to \u039e = V(p)\u039b(p)U(p)T where V(p) \u2208 Rn\u00d7p, \u039b(p) \u2208 Rp\u00d7p,\n\n2\n\n\fand U(p)T \u2208 Rp\u00d7n. Now we can observe that U(p)T\nU(p) = I (where I is the identity matrix of\ndimension p), since by de\ufb01nition the columns of U(p) are orthogonal, and by imposing \u2126 = U(p),\nwe can derive \u201coptimal\u201d matrices A \u2208 Rp\u00d7k and B \u2208 Rp\u00d7p for our dynamical system, which will\nhave corresponding state space matrix Y(p) = \u039e\u2126 = \u039eU(p) = V(p)\u039b(p)U(p)T\nU(p) = V(p)\u039b(p).\n, each of size k \u00d7 p, the problem\nThus, if we represent U(p) as composed of n submatrices U(p)\nreduces to \ufb01nd matrices A and B such that\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u2126 =\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nU(p)\n1\nU(p)\n2\nU(p)\n3 ...\nU(p)\nn\n\ni\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb = U(p).\n\nAT\nATBT\nATB2T\n\n...\n\nATBn\u22121T\n\n(4)\n\nThe reason to impose \u2126 = U(p) is to get a state space where the coordinates are uncorrelated so\nto diagonalise the empirical sample covariance matrix of the states. Please, note that in this way\neach state (i.e., row of the Y matrix) corresponds to a row of the data matrix \u039e, i.e. the unrolled\n(sub)sequence read up to a given time t. If the rows of \u039e were vectors, this would correspond to\ncompute PCA, keeping only the \ufb01st p principal directions.\nIn the following, we demonstrate that there exists a solution to the above equation. We start\nby observing that \u039e owns a special structure, i.e. given \u039e = [\u039e1 \u039e2 \u00b7\u00b7\u00b7 \u039en], where \u039ei \u2208\nRn\u00d7k, then for i = 1, . . . , n \u2212 1, \u039ei+1 = Rn\u039ei =\n\u039ei , and\nRn\u039en = 0, i.e. the null matrix of size n \u00d7 k. Moreover, by singular value decomposition, we\nhave \u039ei = V(p)\u039b(p)U(p)\nV(p) = I, and\nfor i = 1, . . . , n \u2212 1, and t =\ncombining the above equations, we get U(p)\ni Qt,\n1, . . . , n \u2212 i, where Q = \u039b(p)V(p)T\nn Q = 0 since\n)TV(p)\u039b(p)\u22121. Thus, eq. (4) is satis\ufb01ed by\nU(p)\n\nnV(p)\u039b(p)\u22121. Moreover, we have that U(p)\nRT\nnV(p)\u039b(p)\u22121\nRT\n= (Rn\u039en\n\ni = 1, . . . , n. Using the fact that V(p)T\n\n(cid:20) 01\u00d7(n\u22121)\n\nI(n\u22121)\u00d7(n\u22121) 0(n\u22121)\u00d71\n\nn \u039b(p)V(p)T\n\ni+t = U(p)\n\nn Q = U(p)\n\n01\u00d71\n\n(cid:21)\n\nfor\n\ni\n\nT\n\n,\n\nT\n\nA = U(p)\n1\ncomputing Y(p)U(p)T\n\nand B = QT. It is interesting to note that the original data \u039e can be recovered by\n\n= V(p)\u039b(p)U(p)T\n\n= \u039e, which can be achieved by running the system\n\n=0\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n(cid:20) AT\n\n=\n\nBT\n\n(cid:21)\n\nyt\n\n(cid:21)\n\n(cid:20) xt\n\nyt\u22121\n\nstarting from yn, i.e.\n\nis the matrix C de\ufb01ned in eq. (3).\n\nFinally, it is important to remark that the above construction works not only for a single sequence,\nbut also for a set of sequences of different length. For example, let consider the two sequences\n(xa\n\n2). Then, we have\n\n1, xb\n\n3) and (xb\n\n2, xa\n\n1, xa\n\nT 0\nT xa\n1\nT xa\n2\n\n0\nT 0\nT xa\n1\n\nT\n\n1\nxa\n2\nxa\n3\n\n(cid:34)\n\n\uf8f9\uf8fb and \u039eb =\n(cid:20)\n(cid:21)\n\n\u039ea\n\n\u039eb 02\u00d71\n\nT\n\nT\n\nxb\n1\nxb\n2\n\n0\nxb\n1\n\nT\n\n(cid:35)\n(cid:20)\n\n(cid:21)\n\n.\n\nR4\n\nR2 02\u00d71\n\nwhich can be collected together to obtain \u039e =\n\n, and R =\n\nAs a \ufb01nal remark, it should be stressed that the above construction only works if p is equal to the\nrank of \u039e. In the next section, we treat the case in which p < rank(\u039e).\n\n2.1 Optimal solution for low dimensional autoencoders\n\nT(cid:54)= \u039ei, and\nWhen p < rank(\u039e) the solution given above breaks down because \u02dc\u039ei = V(p)L(p)U(p)\nconsequently \u02dc\u039ei+1 (cid:54)= Rn \u02dc\u039ei. So the question is whether the proposed solutions for A and B still\nhold the best reconstruction error when p < rank(\u039e).\n\ni\n\n3\n\nBT\n\n(cid:21)\n\n(cid:20) AT\n\uf8ee\uf8f0 xa\n\n\u039ea =\n\n\fIn this paper, we answer in negative terms to this question by resorting to a new formulation of our\ni \u2208 Rk\u00d7p, i = 1, . . . , n + 1 collecting the\nproblem where we introduce slack-like matrices E(p)\nn+1(cid:88)\nreconstruction errors, which need to be minimised:\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nmin\n\nQ\u2208Rp\u00d7p,E(p)\n\ni\n\n1\n\nU(p)\nU(p)\nU(p)\n\n1 + E(p)\n2 + E(p)\n3 + E(p)\n\n2\n\n3\n\n...\n\nU(p)\n\nn + E(p)\nn\n\ni=1\n\n(cid:107)E(p)\ni (cid:107)2\n\nF\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb Q =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nU(p)\nU(p)\n\n2 + E(p)\n3 + E(p)\n\n2\n\n3\n\n...\n\nU(p)\n\nn + E(p)\nn\nE(p)\nn+1\n\nsubject to :\n\n(5)\n\ni and Q\u2217, from which we can derive AT = U(p)\n\nNotice that the problem above is convex both in the objective function and in the constraints; thus\nit only has global optimal solutions E\u2217\n1 and\nBT = Q\u2217. Speci\ufb01cally, when p = rank(\u039e), RT\ns,kU(p) is in the span of U(p) and the optimal\ni = 0k\u00d7p \u2200i, and Q\u2217 = U(p)T\nsolution is given by E\u2217\ns,kU(p), i.e. the solution we have already\ndescribed. If p < rank(\u039e), the optimal solution cannot have \u2200i, E\u2217\ni = 0k\u00d7p. However, it is not\ndif\ufb01cult to devise an iterative procedure to reach the minimum. Since in the experimental section we\ndo not exploit the solution to this problem for reasons that we will explain later, here we just sketch\nsuch procedure. It helps to observe that, given a \ufb01xed Q, the optimal solution for E(p)\n\n1 + E\u2217\n\nis given by\n\nRT\n\n[ \u02dcE(p)\n\n1 , \u02dcE(p)\n\n2 , . . . , \u02dcE(p)\n\nn+1] = [U(p)\n\n1 Q \u2212 U(p)\n\ni\n\n4 , . . .] M+\nQ\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n1 Q2 \u2212 U(p)\n2 , U(p)\n\u2212Q \u2212Q2 \u2212Q3\n0\nI\n0\n0\n0\nI\n\n0\nI\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\n1 Q3 \u2212 U(p)\n3 , U(p)\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\n...\n\nwhere M+\n\nQ is the pseudo inverse of MQ =\n\n...\n\n, \u02dcE(p)T\nIn general, \u02dcE(p) =\nspan of U(p) and a component E(p)\u22a5\n(part of) the other component can be absorbed into Q by de\ufb01ning \u02dcU(p) = U(p) + E(p)\u22a5\n\n(cid:104) \u02dcE(p)T\n\u02dcQ = ( \u02dcU(p))+(cid:104) \u02dcU(p)T\n\n...\n,\u00b7\u00b7\u00b7 , \u02dcE(p)T\northogonal to it. Notice that E(p)\u22a5\n(cid:105)T\n\n,\u00b7\u00b7\u00b7 , \u02dcU(p)T\n\n, \u02dcU(p)T\n\n, E(p)T\n\n, \u02dcE(p)T\n\n(cid:105)T\n\nn\n\n2\n\n1\n\n3\n\n.\n\n2\n\n3\n\nn\n\nn+1\n\ncannot be reduced, while\nand taking\n\ncan be decomposed into a component in the\n\nGiven \u02dcQ, the new optimal values for E(p)\n\ni\n\nare obtained and the process iterated till convergence.\n\n3 Pre-training of Recurrent Neural Networks\n\nHere we de\ufb01ne our pre-training procedure for recurrent neural networks with one hidden layer of p\nunits, and O output units:\n\not = \u03c3(Woutputh(xt)) \u2208 RO, h(xt) = \u03c3(Winputxt + Whiddenh(xt\u22121)) \u2208 Rp\n\n(6)\nwhere Woutput \u2208 RO\u00d7p, Whidden \u2208 Rp\u00d7k, for a vector z \u2208 Rm, \u03c3(z) = [\u03c3(z1), . . . , \u03c3(zm)]T,\nand here we consider the symmetric sigmoid function \u03c3(zi) = 1\u2212e\u2212zi\n1+e\u2212zi .\nThe idea is to exploit the hidden state representation obtained by eqs. (2) as initial hidden state repre-\nsentation for the RNN described by eqs. (6). This is implemented by initialising the weight matrices\nWinput and Whidden of (6) by using the matrices that jointly solve eqs. (2) and eqs. (3), i.e. A and\nB (since C is function of A and B). Speci\ufb01cally, we initialize Winput with A, and Whidden with\nB. Moreover, the use of symmetrical sigmoidal functions, which do give a very good approximation\nof the identity function around the origin, allows a good transferring of the linear dynamics inside\n\n4\n\n\fRNN. For what concerns Woutput, we initialise it by using the best possible solution, i.e. the pseu-\ndoinverse of H times the target matrix T, which does minimise the output squared error. Learning\nis then used to introduce nonlinear components that allow to improve the performance of the model.\nMore formally, let consider a prediction task where for each sequence sq \u2261 (xq\n2, . . . , xq\n)\nlq\nof length lq in the training set, a sequence tq of target vectors is de\ufb01ned, i.e. a training se-\ni \u2208 RO. Given a train-\nquence is given by (cid:104)sq, tq(cid:105) \u2261 (cid:104)(xq\nq=1 lq, as\n\ning set with N sequences, let de\ufb01ne the target matrix T \u2208 RL\u00d7O, where L = (cid:80)N\nT = (cid:2)t1\n\n(cid:3)T. The input matrix \u039e will have size L \u00d7 k. Let p\u2217 be the de-\n\n)(cid:105), where tq\n\n1, tq\n\n1), (xq\n\n2, tq\n\n2), . . . , (xq\nlq\n\nsired number of hidden units for the recurrent neural network (RNN). Then the pre-training proce-\ndure can be de\ufb01ned as follows: i) compute the linear autoencoder for \u039e using p\u2217 principal direc-\ntions, obtaining the optimal matrices A\u2217 \u2208 Rp\u2217\u00d7k and B\u2217 \u2208 Rp\u2217\u00d7p\u2217\n; i) set Winput = A\u2217 and\nWhidden = B\u2217; iii) run the RNN over the training sequences, collecting the hidden activities vec-\ntors (computed using symmetrical sigmoidal functions) over time as rows of matrix H \u2208 RL\u00d7p\u2217\n;\niv) set Woutput = H+T, where H+ is the (left) pseudoinverse of H.\n\n1, t1\n\n2, . . . , t1\nl1\n\n, t2\n\n1, . . . , tN\nlN\n\n1, xq\n\n, tq\nlq\n\n3.1 Computing an approximate solution for large datasets\n\nIn real world scenarios the application of our approach may turn dif\ufb01cult because of the size of\nthe data matrix. In fact, stable computation of principal directions is usually obtained by SVD de-\ncomposition of the data matrix \u039e, that in typical application domains involves a number of rows\nand columns which is easily of the order of hundreds of thousands. Unfortunately, the computa-\ntional complexity of SVD decomposition is basically cubic in the smallest of the matrix dimensions.\nMemory consumption is also an important issue. Algorithms for approximate computation of SVD\nhave been suggested (e.g., [18]), however, since for our purposes we just need matrices V and \u039b\nwith a prede\ufb01ned number of columns (i.e. p), here we present an ad-hoc algorithm for approximate\ncomputation of these matrices. Our solution is based on the following four main ideas: i) divide \u039e\nin slices of k (i.e., size of input at time t) columns, so to exploit SVD decomposition at each slice\nseparately; ii) compute approximate V and \u039b matrices, with p columns, incrementally via truncated\nSVD of temporary matrices obtained by concatenating the current approximation of V\u039b with a new\nslice; iii) compute the SVD decomposition of a temporary matrix via either its kernel or covariance\nmatrix, depending on the smallest between the number of rows and the number of columns of the\ntemporary matrix; iv) exploit QR decomposition to compute SVD decomposition.\nAlgorithm 1 shows in pseudo-code the main steps of our procedure. It maintains a temporary matrix\nT which is used to collect incrementally an approximation of the principal subspace of dimension p\nof \u039e. Initially (line 4) T is set equal to the last slices of \u039e, in a number suf\ufb01cient to get a number\nof columns larger than p (line 2). Matrices V and \u039b from the p-truncated SVD decomposition of\nT are computed (line 5) via the KECO procedure, described in Algorithm 2, and used to de\ufb01ne a\nnew T matrix by concatenation with the last unused slice of \u039e. When all slices are processed, the\ncurrent V and \u039b matrices are returned. The KECO procedure, described in Algorithm 2 , reduces\nthe computational burden by computing the p-truncated SVD decomposition of the input matrix\nM via its kernel matrix (lines 3-4) if the number of rows of M is no larger than the number of\ncolumns, otherwise the covariance matrix is used (lines 6-8). In both cases, the p-truncated SVD\ndecomposition is implemented via QR decomposition by the INDIRECTSVD procedure described in\nAlgorithm 3. This allows to reduce computation time when large matrices must be processed [19].\nFinally, matrices V and S 1\n2 (both kernel and covariance matrices have squared singular values of\nM) are returned.\nWe use the strategy to process slices of \u039e in reverse order since, moving versus columns with larger\nindices, the rank as well as the norm of slices become smaller and smaller, thus giving less and less\ncontribution to the principal subspace of dimension p. This should reduce the approximation error\ncumulated by dropping the components from p + 1 to p + k during computation [20]. As a \ufb01nal\nremark, we stress that since we compute an approximate solution for the principal directions of \u039e,\nit makes no much sense to solve the problem given in eq. (5): learning will quickly compensate\nfor the approximations and/or sub-optimality of A and B obtained by matrices V and \u039b returned\nby Algorithm 1. Thus, these are the matrices we have used for the experiments described in next\nsection.\n\n5\n\n\fend for\nreturn V, \u039b\n\nnStart = (cid:100)p/k(cid:101)\nnSlice = (\u039e.columns/k) \u2212 nStart\nT = \u039e[:, k \u2217 nSlice : \u039e.columns]\nV, \u039b =KECO(T, p)\nfor i in REVERSED(range(nSlice)) do\nT = [\u039e[:, i \u2217 k:(i + 1) \u2217 k], V\u039b]\nV, \u039b =KECO(T, p)\n\nAlgorithm 1 Approximated V and \u039b with p components\n1: function SVFORBIGDATA(\u039e, k, p)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end function\nAlgorithm 2 Kernel vs covariance computation\n1: function KECO(M, p)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end function\n\nC = MTM\nV, Ssqr, UT =INDIRECTSVD(C, p)\nV = MUTS\n\nK = MMT\nV, Ssqr, UT =INDIRECTSVD(K, p)\n\nif M.rows <= \u039e.columns then\n\nelse\n\n\u2212 1\n2\nsqr\n\nend if\nreturn V, S\n\n1\n2\nsqr\n\n(cid:46) Number of starting slices\n(cid:46) Number of remaining slices\n\n(cid:46) Computation of V and \u039b for starting slices\n(cid:46) Computation of V and \u039b for remaining slices\n\nAlgorithm 3 Truncated SVD by QR\n1: function INDIRECTSVD(M, p)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: end function\n\nQ, R =QR(M)\nVr, S, UT =SVD(R)\nV = QVr\nS = S[1 : p, 1 : p]\nV = V[1 : p, :]\nUT = UT[:, 1 : p]\nreturn V, S, UT\n\n4 Experiments\n\nIn order to evaluate our pre-training approach, we decided to use the four polyphonic music se-\nquences datasets used in [21] for assessing the prediction abilities of the RNN-RBM model. The\nprediction task consists in predicting the notes played at time t given the sequence of notes played\ntill time t \u2212 1. The RNN-RBM model achieves state-of-the-art in such demanding prediction task.\nAs performance measure we adopted the accuracy measure used in [21] and described in [22]. Each\ndataset is split in training set, validation set, and test set. Statistics on the datasets, including largest\nsequence length, are given in columns 2-4 of Table 1. Each sequence in the dataset represents a song\nhaving a maximum polyphony of 15 notes (average 3.9); each time step input spans the whole range\nof piano from A0 to C8 and it is represented by using 88 binary values (i.e. k = 88).\nOur pre-training approach (PreT-RNN) has been assessed by using a different number of hidden\nunits (i.e., p is set in turn to 50, 100, 150, 200, 250) and 5000 epochs of RNN training1 using the\nTheano-based stochastic gradient descent software available at [23].\nRandom initialisation (Rnd) has also been used for networks with the same number of hidden units.\nSpeci\ufb01cally, for networks with 50 hidden units, we have evaluated the performance of 6 different\nrandom initialisations. Finally, in order to verify that the nonlinearity introduced by the RNN is\nactually useful to solve the prediction task, we have also evaluated the performance of a network\nwith linear units (250 hidden units) initialised with our pre-training procedure (PreT-Lin250).\nTo give an idea of the time performance of pre-training with respect to the training of a RNN, in\ncolumn 5 of Table 1 we have reported the time in seconds needed to compute pre-training matrices\n(Pre-) (on Intel c(cid:13) Xeon c(cid:13) CPU E5-2670 @2.60GHz with 128 GB) and to perform training of a\nRNN with p = 50 for 5000 epochs (on GPU NVidia K20). Please, note that for larger values of p,\nthe increase in computation time of pre-training is smaller than the increment in computation time\nneeded for training a RNN.\n\n1Due to early over\ufb01tting, for the Muse dataset we used 1000 epochs.\n\n6\n\n\fDataset\n\nNottingham\n\nPiano-midi.de\n\nMuseData\n\nJSB Chorales\n\nSet\n\n# Samples Max length\n\n(Pre-)Training Time\n\nTraining\n\n(39165 \u00d7 56408)\n\n(70672 \u00d7 387640)\n\nTest\n\nValidation\nTraining\n\nTest\n\nValidation\nTraining\n\nTest\n\nValidation\nTraining\n\nTest\n\nValidation\n\n(248479 \u00d7 214192)\n\n(27674 \u00d7 22792)\n\n195\n\n170\n173\n87\n\n25\n12\n524\n\n25\n135\n229\n\n77\n76\n\n641\n\n1495\n1229\n4405\n\n2305\n1740\n2434\n\n2305\n2523\n259\n\n320\n289\n\nseconds\n\n(226) 5837\np = 50\n\n5000 epochs\n\nseconds\n\n(2971) 4147\n\np = 50\n\n5000 epochs\n\nseconds\n\n(7338) 4190\n\np = 50\n\n5000 epochs\n\nseconds\n(79) 6411\np = 50\n\n5000 epochs\n\nModel\nRNN (w. HF)\nRNN-RBM\nPreT-RNN\nPreT-Lin250\nRNN (w. HF)\nRNN-RBM\nPreT-RNN\nPreT-Lin250\nRNN (w. HF)\nRNN-RBM\nPreT-RNN\nPreT-Lin250\nRNN (w. HF)\nRNN-RBM\nPreT-RNN\nPreT-Lin250\n\nACC% [21]\n62.93 (66.64)\n\n75.40\n\n75.23 (p = 250)\n\n73.19\n\n19.33 (23.34)\n\n37.74 (p = 250)\n\n23.25 (30.49)\n\n57.57 (p = 200)\n\n28.46 (29.41)\n\n65.67 (p = 250)\n\n28.92\n\n16.87\n\n34.02\n\n3.56\n\n33.12\n\n38.32\n\nTable 1: Datasets statistics including data matrix size for the training set (columns 2-4), computa-\ntional times in seconds to perform pre-training and training for 5000 epochs with p = 50 (column\n5), and accuracy results for state-of-the-art models [21] vs our pre-training approach (columns 6-7).\nThe acronym (w. HF) is used to identify an RNN trained by Hessian Free Optimization [4].\n\nTraining and test curves for all the models described above are reported in Figure 1. It is evident that\nrandom initialisation does not allow the RNN to improve its performance in a reasonable amount of\nepochs. Speci\ufb01cally, for random initialisation with p = 50 (Rnd 50), we have reported the average\nand range of variation over the 6 different trails: different initial points do not change substantially\nthe performance of RNN. Increasing the number of hidden units allows the RNN to slightly increase\nits performance. Using pre-training, on the other hand, allows the RNN to start training from a quite\nfavourable point, as demonstrated by an early sharp improvement of performances. Moreover, the\nmore hidden units are used, the more the improvement in performance is obtained, till over\ufb01tting is\nobserved. In particular, early over\ufb01tting occurs for the Muse dataset. It can be noticed that the linear\nmodel (Linear) reaches performances which are in some cases better than RNN without pre-training.\nHowever, it is important to notice that while it achieves good results on the training set (e.g. JSB and\nPiano-midi), the corresponding performance on the test set is poor, showing a clear evidence of over-\n\ufb01tting. Finally, in column 7 of Table 1, we have reported the accuracy obtained after validation on\nthe number of hidden units and number of epochs for our approaches (PreT-RNN and PreT-Lin250)\nversus the results reported in [21] for RNN (also using Hessian Free Optimization) and RNN-RBM.\nIn any case, the use of pre-training largely improves the performances over standard RNN (with\nor without Hessian Free Optimization). Moreover, with the exception of the Nottingham dataset,\nthe proposed approach outperforms the state-of-the-art results achieved by RNN-RBM. Large im-\nprovements are observed for the Muse and JSB datasets. Performance for the Nottingham dataset\nis basically equivalent to the one obtained by RNN-RBM. For this dataset, also the linear model\nwith pre-training achieves quite good results, which seems to suggest that the prediction task for\nthis dataset is much easier than for the other datasets. The linear model outperforms RNN without\npre-training on Nottingham and JSB datasets, but shows problems with the Muse dataset.\n\n5 Conclusions\n\nWe have proposed a pre-training technique for RNN based on linear autoencoders for sequences.\nFor this kind of autoencoders it is possible to give a closed form solution for the de\ufb01nition of the\n\u201coptimal\u201d weights, which however, entails the computation of the SVD decomposition of the full\ndata matrix. For large data matrices exact SVD decomposition cannot be achieved, so we proposed\na computationally ef\ufb01cient procedure to get an approximation that turned to be effective for our\ngoals. Experimental results for a prediction task on datasets of sequences of polyphonic music\nshow the usefulness of the proposed pre-training approach, since it allows to largely improve the\nstate of the art results on all the considered datasets by using simple stochastic gradient descend for\nlearning. Even if the results are very encouraging the method needs to be assessed on data from\nother application domains. Moreover, it is interesting to understand whether the analysis performed\nin [24] on linear deep networks for vectors can be extended to recurrent architectures for sequences\nand, in particular, to our method.\n\n7\n\n\fFigure 1: Training (left column) and test (right column) curves for the assessed approaches on the\nfour datasets. Curves are sampled at each epoch till epoch 100, and at steps of 100 epochs afterwards.\n\n8\n\n-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0 200 400 600 800 1000AccuracyEpochMuse Dataset Test SetRnd 50 (6 trials)Linear 250Rnd 100Rnd 150Rnd 200Rnd 250PreT 50PreT 150PreT 100PreT 200PreT 250 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000AccuracyEpochNottingham Training Set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000AccuracyEpochNottingham Test Set 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000AccuracyEpochPiano-Midi.de Training Set 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000AccuracyEpochPiano-Midi.de Test Set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 200 400 600 800 1000AccuracyEpochMuse Dataset Training Set 0 0.1 0.2 0.3 0.4 0.5 0.6 0 200 400 600 800 1000AccuracyEpochMuse Dataset Test Set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000AccuracyEpochJSB Chorales Training Set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000AccuracyEpochJSB Chorales Test Set\fReferences\n[1] S. C. Kremer. Field Guide to Dynamical Recurrent Networks. Wiley-IEEE Press, 2001.\n[2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent\n\nis dif\ufb01cult. IEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[3] S. Hochreiter and J. Schmidhuber. Lstm can solve hard long time lag problems. In NIPS, pages\n\n473\u2013479, 1996.\n\n[4] J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization.\n\nIn ICML, pages 1033\u20131040, 2011.\n\n[5] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504\u2013507, July 2006.\n\n[6] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18(7):1527\u20131554, 2006.\n\n[7] P. di Lena, K. Nagata, and P. Baldi. Deep architectures for protein contact map prediction.\n\nBioinformatics, 28(19):2449\u20132457, 2012.\n\n[8] Y. Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[9] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and\n\nmomentum in deep learning. In ICML (3), pages 1139\u20131147, 2013.\n\n[10] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag New York, Inc., 2002.\n[11] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value\n\ndecomposition. Biological Cybernetics, 59(4-5):291\u2013294, 1988.\n\n[12] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from\n\nexamples without local minima. Neural Networks, 2(1):53\u201358, 1989.\n\n[13] A. Sperduti. Exact solutions for recursive principal components analysis of sequences and\n\ntrees. In ICANN (1), pages 349\u2013356, 2006.\n\n[14] A. Micheli and A. Sperduti. Recursive principal component analysis of graphs. In ICANN (2),\n\npages 826\u2013835, 2007.\n\n[15] A. Sperduti. Ef\ufb01cient computation of recursive principal component analysis for structured\n\ninput. In ECML, pages 335\u2013346, 2007.\n\n[16] A. Sperduti. Linear autoencoder networks for structured data. In NeSy\u201913:Ninth International\n\nWorkshop onNeural-Symbolic Learning and Reasoning, 2013.\n\n[17] T. Voegtlin. Recursive principal components analysis. Neural Netw., 18(8):1051\u20131063, 2005.\n[18] G. Martinsson et al. Randomized methods for computing the singular value decomposition\n(svd) of very large matrices. In Works. on Alg. for Modern Mass. Data Sets, Palo Alto, 2010.\n\n[19] E. Rabani and S. Toledo. Out-of-core svd and qr decompositions. In PPSC, 2001.\n[20] Z. Zhang and H. Zha. Structure and perturbation analysis of truncated svds for column-\n\npartitioned matrices. SIAM J. on Mat. Anal. and Appl., 22(4):1245\u20131262, 2001.\n\n[21] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in\nhigh-dimensional sequences: Application to polyphonic music generation and transcription.\nIn ICML, 2012.\n\n[22] M. Bay, A. F. Ehmann, and J. S. Downie. Evaluation of multiple-f0 estimation and tracking\n\nsystems. ISMIR, pages 315\u2013320, 2009.\n[23] https://github.com/gwtaylor/theano-rnn.\n[24] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1884, "authors": [{"given_name": "Luca", "family_name": "Pasa", "institution": "Universit\u00e0 degli Studi di Padova"}, {"given_name": "Alessandro", "family_name": "Sperduti", "institution": "Universit\u00e0 degli Studi di Padova"}]}