{"title": "Preventing Gradient Explosions in Gated Recurrent Units", "book": "Advances in Neural Information Processing Systems", "page_first": 435, "page_last": 444, "abstract": "A gated recurrent unit (GRU) is a successful recurrent neural network architecture for time-series data. The GRU is typically trained using a gradient-based method, which is subject to the exploding gradient problem in which the gradient increases significantly. This problem is caused by an abrupt change in the dynamics of the GRU due to a small variation in the parameters. In this paper, we find a condition under which the dynamics of the GRU changes drastically and propose a learning method to address the exploding gradient problem. Our method constrains the dynamics of the GRU so that it does not drastically change. We evaluated our method in experiments on language modeling and polyphonic music modeling. Our experiments showed that our method can prevent the exploding gradient problem and improve modeling accuracy.", "full_text": "Preventing Gradient Explosions\n\nin Gated Recurrent Units\n\nSekitoshi Kanai, Yasuhiro Fujiwara, Sotetsu Iwamura\n\n{kanai.sekitoshi, fujiwara.yasuhiro, iwamura.sotetsu}@lab.ntt.co.jp\n\nNTT Software Innovation Center\n\n3-9-11, Midori-cho, Musashino-shi, Tokyo\n\nAbstract\n\nA gated recurrent unit (GRU) is a successful recurrent neural network architecture\nfor time-series data. The GRU is typically trained using a gradient-based method,\nwhich is subject to the exploding gradient problem in which the gradient increases\nsigni\ufb01cantly. This problem is caused by an abrupt change in the dynamics of the\nGRU due to a small variation in the parameters. In this paper, we \ufb01nd a condition\nunder which the dynamics of the GRU changes drastically and propose a learning\nmethod to address the exploding gradient problem. Our method constrains the\ndynamics of the GRU so that it does not drastically change. We evaluated our\nmethod in experiments on language modeling and polyphonic music modeling. Our\nexperiments showed that our method can prevent the exploding gradient problem\nand improve modeling accuracy.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) can handle time-series data in many applications such as speech\nrecognition [14, 1], natural language processing [26, 30], and hand writing recognition [13]. Unlike\nfeed-forward neural networks, RNNs have recurrent connections and states to represent the data.\nBack propagation through time (BPTT) is a standard approach to train RNNs. BPTT propagates the\ngradient of the cost function with respect to the parameters, such as weight matrices, at each layer and\nat each time step by unfolding the recurrent connections through time. The parameters are updated\nusing the gradient in a way that minimizes the cost function. The cost function is selected according\nto the task, such as classi\ufb01cation or regression.\nAlthough RNNs are used in many applications, they have problems in that the gradient can be\nextremely small or large; these problems are called the vanishing gradient and exploding gradient\nproblems [5, 28].\nIf the gradient is extremely small, RNNs can not learn data with long-term\ndependencies [5]. On the other hand, if the gradient is extremely large, the gradient moves the RNNs\nparameters far away and disrupts the learning process. To handle the vanishing gradient problem,\nprevious studies [18, 8] proposed sophisticated models of RNN architectures. One successful model\nis a long short-term memory (LSTM). However, the LSTM has the complex structures and numerous\nparameters with which to learn the long-term dependencies. As a way of reducing the number of\nparameters while avoiding the vanishing gradient problem, a gated recurrent unit (GRU) was proposed\nin [8]; the GRU has only two gate functions that hold or update the state which summarizes the past\ninformation. In addition, Tang et al. [33] show that the GRU is more robust to noise than the LSTM\nis, and it outperforms the LSTM in several tasks [9, 20, 33, 10].\nGradient clipping is a popular approach to address the exploding gradient problem [26, 28]. This\nmethod rescales the gradient so that the norm of the gradient is always less than a threshold. Although\ngradient clipping is a very simple method and can be used with GRUs, it is heuristic and does not\nanalytically derive the appropriate threshold. The threshold has to be manually tuned to the data\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fand tasks by trial and error. Therefore, a learning method is required to more effectively address the\nexploding gradient problem in training of GRUs.\nIn this paper, we propose a learning method for GRUs that addresses the exploding gradient problem.\nThe method is based on an analysis of the dynamics of GRUs. GRUs suffer from gradient explosions\ndue to their nonlinear dynamics [11, 28, 17, 3] that enable GRUs to represent time-series data. The\ndynamics can drastically change when the parameters cross certain values, called bifurcation points\n[36], in the learning process. Therefore, the gradient of the state with respect to the parameters can\ndrastically increase at a bifurcation point. This paper presents an analysis of the dynamics of GRUs\nand proposes a learning method to prevent the parameters from crossing the bifurcation point. It\ndescribes evaluations of this method through language modeling and polyphonic music modeling\nexperiments. The experiments demonstrate that our method can train GRUs without gradient clipping\nand that it can improve the accuracy of GRUs.\nThe rest of this paper is organized as follows: Section 2 brie\ufb02y explains the GRU, dynamical systems\nand the exploding gradient problem. It also outlines related work. The dynamics of GRUs is analyzed\nand our training approach is presented in Section 3. The experiments that veri\ufb01ed our method\nare discussed in Section 4. The paper concludes in Section 5. Proofs of lemmas are given in the\nsupplementary material.\n\n2 Preliminaries\n\n2.1 Gated Recurrent Unit\n\nzt = sigm(Wxzxt + Whzht\u22121),\n\nTime-series data often have long and short-term dependencies. In order to model long and short-term\nbehavior, a GRU is designed to properly keep and forget past information. The GRU controls the past\ninformation by having two gates: an update gate and reset gate. The update gate zt \u2208 Rn\u00d71 at a time\nstep t is expressed as\n(1)\nwhere xt \u2208 Rm\u00d71 is the input vector, and ht \u2208 Rn\u00d71 is the state vector. Wxz \u2208 Rn\u00d7m and\nWhz \u2208 Rn\u00d7n are weight matrices. sigm(\u00b7) represents the element-wise logistic sigmoid function.\nThe reset gate rt \u2208 Rn\u00d71 is expressed as\n(2)\nwhere Wxr \u2208 Rn\u00d7m and Whr \u2208 Rn\u00d7n are weight matrices. The activation of the state ht is\nexpressed as\n(3)\nwhere 1 is the vector of all ones, and (cid:12) means the element-wise product. \u02dcht is a candidate for a new\nstate, expressed as\n(4)\nwhere tanh(\u00b7) is the element-wise hyperbolic tangent, and Wxh \u2208 Rn\u00d7m and Whh \u2208 Rn\u00d7n are\nweight matrices. The initial value of ht is h0 = 0 where 0 represents the vector of all zeros; the\nGRU completely forgets the past information when ht becomes 0.\nThe training of a GRU can be formulated as an optimization problem as follows:\n\n\u02dcht = tanh(Wxhxt + Whh(rt (cid:12) ht\u22121)),\n\nht = zt (cid:12) ht\u22121 + (1 \u2212 zt) (cid:12) \u02dcht,\n\nrt = sigm(Wxrxt + Whrht\u22121),\n\n1\nN\n\nmin\u03b8\n\nj=1 C(x(j), y(j); \u03b8),\n\n(5)\nwhere \u03b8, x(j), y(j), C(x(j), y(j); \u03b8), and N are all parameters of the model (e.g., elements of\nWhh), the j-th training input data, the j-th training output data, the loss function for the j-th data\n(e.g., mean squared error or cross entropy), and the number of training data, respectively. This\noptimization problem is usually solved through stochastic gradient descent (SGD). SGD iteratively\nupdates parameters according to the gradient of a mini-batch, which is randomly sampled data from\nthe training data. The parameter update at step \u03c4 is\n\n(6)\nwhere D\u03c4 , |D\u03c4|, and \u03b7 represent the \u03c4-th mini-batch, the size of the mini-batch, and the learning rate\nof SGD, respectively. In gradient clipping, the norm of \u2207\u03b8\nC(x(j), y(j); \u03b8) is\nclipped by the speci\ufb01ed threshold. The size of the parameters \u03b8 is 3(n2 + mn) + \u03b1, where \u03b1 is the\nnumber of parameters except for the GRU, because the sizes of the six weight matrices of W\u2217 in eqs.\n(1)-(4) are n\u00d7n or n\u00d7m. Therefore, the computational cost of gradient clipping is O(n2 + mn + \u03b1).\n\nC(x(j), y(j); \u03b8),\n\n(x(j),y(j))\u2208D\u03c4\n\n(x(j),y(j))\u2208D\u03c4\n\n(cid:80)\n\n1|D\u03c4|\n\n\u03b8(\u03c4 ) = \u03b8(\u03c4\u22121) \u2212 \u03b7\u2207\u03b8\n\n1|D\u03c4|\n\n(cid:80)N\n\n(cid:80)\n\n2\n\n\f2.2 Dynamical System and Gradient Explosion\n\nAn RNN is a nonlinear dynamical system that can be represented as follows:\n\nht = f (ht\u22121, \u03b8),\n\n(7)\nwhere ht is a state vector at time step t, \u03b8 is a parameter vector, and f is a nonlinear vector function.\nThe state evolves over time according to eq. (7). If the state ht\u2217 at some time step t\u2217 satis\ufb01es\nht\u2217 = f (ht\u2217 , \u03b8), i.e., the new state equals the previous state, the state never changes until an external\ninput is applied to the system. Such a state point is called a \ufb01xed point h\u2217. The state converges to or\ngoes away from the \ufb01xed point h\u2217 depending on f and \u03b8. This property is important and is called\nstability [36]. The \ufb01xed point h\u2217 is said to be locally stable if there exists a constant \u03b5 such that, for\nht whose initial value h0 satis\ufb01es |h0 \u2212 h\u2217| < \u03b5, limt\u2192\u221e |ht \u2212 h\u2217| = 0 holds. In this case, a set of\npoints h0 such that |h0 \u2212 h\u2217| < \u03b5 is called a basin of attraction of the \ufb01xed point. Conversely, if\nh\u2217 is not stable, the \ufb01xed point is said to be unstable. Stability and the behavior of ht near a \ufb01xed\npoint, e.g., converging or diverging, can be qualitatively changed by a smooth variation in \u03b8. This\nphenomenon is called a local bifurcation, and the value of the parameter of a bifurcation is called a\nbifurcation point [36].\nDoya [11], Pascanu et al. [28] and Baldi and Hornik [3] pointed out that gradient explosions are due\nto bifurcations. The training of an RNN involves iteratively updating its parameters. This process\ncauses a bifurcation: a small change in parameters can result in a drastic change in the behavior of\nthe state. As a result, the gradient increases at a bifurcation point.\n\n2.3 Related Work\n\nKuan et al. [23] established a learning method to avoid the exploding gradient problem. This method\nrestricts the dynamics of an RNN so that the state remains stable. Yu [37] proposed a learning rate for\nstable training through Lyapunov functions. However, these methods mainly target Jordan and Elman\nnetworks called simple RNNs which, unlike GRUs, are dif\ufb01cult to train long-term dependencies.\nIn addition, they suppose that the mean squared error is used as the loss function. By contrast, our\nmethod targets the GRU, a more sophisticated model, and can be used regardless of the loss function.\nDoya [11] showed that bifurcations cause gradient explosions and that real-time recurrent learning\n(RTRL) can train an RNN without the gradient explosion. However, RTRL has a high computational\ncost: O((n + u)4) for each update step where u is the number of output units [19]. More recently,\nArjovsky et al. [2] proposed unitary matrix constraints in order to prevent the gradient vanishing\nand exploding. Vorontsov et al. [35], however, showed that it can be detrimental to maintain hard\nconstraints on matrix orthogonality.\nPrevious studies analyzed the dynamics of simple RNNs [12, 4, 31, 16, 27]. Barabanov and Prokhorov\n[4] analyzed the absolute stability of multi-layer simple RNNs. Haschke and Steil [16] presented a\nbifurcation analysis of a simple RNN in which inputs are regarded as the bifurcation parameter. Few\nstudies have analyzed the dynamics of the modern RNN models. Talathi and Vartak [32] analyzed\nthe nonlinear dynamics of an RNN with a Relu nonlinearity. Laurent and von Brecht [24] empirically\nrevealed that LSTMs and GRUs can exhibit chaotic behavior and proposed a novel model that has\nstable dynamics. To the best of our knowledge, our study is the \ufb01rst to analyze the stability of GRUs.\n\n3 Proposed Method\n\nAs mentioned in Section 2, a bifurcation makes the gradient explode. In this section, through an\nanalysis of the dynamics of GRUs, we devise a training method that avoids a bifurcation and prevents\nthe gradient from exploding.\n\n3.1 Formulation of Proposed Training\n\nIn Section 3.1, we formulate our training approach. For the sake of clarity, we \ufb01rst explain the\nformulation for a one-layer GRU; then, we apply the method to a multi-layer GRU.\n\n3.1.1 One-Layer GRU\n\nThe training of a GRU is formulated as eq. (5). This training with SGD can be disrupted by a gradient\nexplosion. To prevent the gradient from exploding, we formulate the training of a one-layer GRU as\n\n3\n\n\fthe following constrained optimization:\n\n(cid:80)N\n\n1\nN\n\nmin\u03b8\n\nj=1 C(x(j), y(j); \u03b8), s.t. \u03c31(Whh) < 2,\n\n(8)\nwhere \u03c3i(\u00b7) is the i-th largest singular value of a matrix, and \u03c31(\u00b7) is called the spectral norm. This\nconstrained optimization problem keeps the one-layer GRU locally stable and prevents the gradient\nfrom exploding due to a bifurcation of the \ufb01xed point on the basis of the following theorem:\nTheorem 1. When \u03c31(Whh) < 2, a one-layer GRU is locally stable at a \ufb01xed point h\u2217 = 0.\nWe show the proof of this theorem later. This theorem indicates that our training approach of eq. (8)\nmaintains the stability of the \ufb01xed point h\u2217 = 0. Therefore, our approach prevents the gradient\nexplosion caused by the bifurcation of the \ufb01xed point h\u2217. In order to prove this theorem, we need to\nuse the following three lemmas:\nLemma 1. A one-layer GRU has a \ufb01xed point at h\u2217 = 0.\nLemma 2. Let I be an n\u00d7 n identity matrix, \u03bbi(\u00b7) be the eigenvalue that has the i-th largest absolute\n2 I. When the spectral radius 1 |\u03bb1(J )| < 1, a one-layer GRU without\nvalue, and J = 1\ninput can be approximated by the following linearized GRU near ht = 0:\n\n4 Whh + 1\n\nht = J ht\u22121,\n\n(9)\n\n4 Whh + 1\n\nand the \ufb01xed point h\u2217 = 0 of a one-layer GRU is locally stable.\nLemma 2 indicates that we can prevent a change in local stability by exploiting the constraint of\n|\u03bb1( 1\n2 I)| < 1. This constraint can be represented as a bilinear matrix inequality (BMI)\nconstraint [7]. However, an optimization problem with a BMI constraint is NP-hard [34]. Therefore,\nwe relax the optimization problem to that of a singular value constraint as in eq. (8) by using the\nfollowing lemma:\nLemma 3. When \u03c31(Whh) < 2, we have |\u03bb1( 1\nBy exploiting Lemmas 1, 2, and 3, we can prove Theorem 1 as follows:\n\n2 I)| < 1.\n\n4 Whh + 1\n\nProof. From Lemma 1, there exists a \ufb01xed point h\u2217 = 0 in a one-layer GRU. This \ufb01xed point is\nlocally stable when |\u03bb1( 1\n2 I)| < 1\nholds when \u03c31(Whh) < 2. Therefore, when \u03c31(Whh) < 2, the one-layer GRU is locally stable at\nthe \ufb01xed point h\u2217 = 0\n\n2 I)| < 1 from Lemma 2. From Lemma 3, |\u03bb1( 1\n\n4 Whh + 1\n\n4 Whh + 1\n\nLemma 1 indicates that a one-layer GRU has a \ufb01xed point. Lemma 2 shows the condition under\nwhich this \ufb01xed point is kept stable. Lemma 3 shows that we can use a singular value constraint\ninstead of an eigenvalue constraint. These lemmas prove Theorem 1, and this theorem ensures that\nour method prevents the gradient from exploding because of a local bifurcation.\nIn our method of eq. (8), h\u2217 = 0 is a \ufb01xed point. This \ufb01xed point is important since the initial\nvalue of the state h0 is 0, and the GRU forgets all the past information when the state is reset to 0\nas described in Section 2. If h\u2217 = 0 is stable, the state vector near 0 asymptotically converges to 0.\nThis means that the state vector ht can be reset to 0 after a suf\ufb01cient time in the absence of an input;\ni.e., the GRU can forget the past information entirely. On the other hand, when |\u03bb1(J )| becomes\ngreater than one, the \ufb01xed point at 0 becomes unstable. This means that the state vector ht never\nresets to 0; i.e., the GRU can not forget all the past information until we manually reset the state. In\nthis case, the forget gate and reset gate may not work effectively. In addition, Laurent and von Brecht\n[24] show that an RNN model with state that asymptotically converges to zero achieves a level of\nperformance comparable to that of LSTMs and GRUs. Therefore, our constraint that the GRU is\nlocally stable at h\u2217 = 0 is effective for learning.\n\n3.1.2 Multi-Layer GRU\n\nHere, we extend our method in the multi-layer GRU. An L-layer GRU is represented as follows:\n\nh1,t = f1(h1,t\u22121, xt), h2,t = f2(h2,t\u22121, h1,t), . . . , hL,t = fL(hL,t\u22121, hL\u22121,t),\n\nwhere hl,t \u2208 Rnl\u00d71 is a state vector with the length of nl at the l-th layer, and fl represents a\nGRU that corresponds to eqs. (1)-(4) at the l-th layer. In the same way as the one-layer GRU,\nht = [hT\n\nL,t]T = 0 is a \ufb01xed point, and we have the following lemma:\n1The spectral radius is the maximum value of the absolute value of the eigenvalues.\n\n1,t, . . . , hT\n\n4\n\n\f4 Wl,hh + 1\n\nLemma 4. When |\u03bb1( 1\nGRU is locally stable.\nFrom Lemma 3, we have |\u03bb1(Wl,hh + 1\ntraining of a multi-layer GRU to prevent gradient explosions as\n\n2 I)| < 1 for l = 1, . . . , L, the \ufb01xed point h\u2217 = 0 of a multi-layer\n2 I)| < 1 when \u03c31(Wl,hh) < 2. Thus, we formulated our\n\n(cid:80)N\nj=1 C(x(j), y(j); \u03b8), s.t. \u03c31(Wl,hh) < 2, \u03c31(Wl,xh)\u2264 2 for l = 1, . . . , L.\n\n(10)\nWe added the constraint \u03c31(Wl,xh) \u2264 2 in order to prevent the input from pushing the state out of\nthe basin of attraction of the \ufb01xed point h\u2217 = 0. This constrained optimization problem keeps a\nmulti-layer GRU locally stable.\n\nmin\u03b8\n\n1\nN\n\n3.2 Algorithm\n\nThe optimization method for eq. (8) needs to \ufb01nd the optimal parameters in the feasible set, in which\nthe parameters satisfy the constraint: {Whh|Whh \u2208 Rn\u00d7n, \u03c31(Whh) < 2}. Here, we modify SGD\nin order to solve eq. (8). Our method updates the parameters as follows:\n\n\u03b8(\u03c4 )\u2212Whh\n\n= \u03b8(\u03c4\u22121)\u2212Whh\n1|D\u03c4|\n\n(cid:80)\n\u2212 \u03b7\u2207\u03b8CD\u03c4 (\u03b8), W (\u03c4 )\n\n(x(j),y(j))\u2208D\u03c4\n\nhh = P\u03b4(W (\u03c4\u22121)\nC(x(j), y(j); \u03b8), and \u03b8(\u03c4 )\u2212Whh\n\nhh\n\n\u2212 \u03b7\u2207WhhCD\u03c4 (\u03b8)),\n\n(11)\n\nrepresents the parame-\n\nhh . In eq. (11), We compute P\u03b4(\u00b7) by using the following procedure:\n\nwhere CD\u03c4 (\u03b8) represents\nters except for W (\u03c4 )\n\nStep 1. Decompose \u02c6W (\u03c4 )\n\nhh := W (\u03c4\u22121)\n\nhh \u2212\u03b7\u2207WhhCD\u03c4(\u03b8) by using singular value decomposition (SVD):\n(12)\n\n\u02c6W (\u03c4 )\n\nhh = U \u03a3V.\n\nStep 2. Replace the singular values that are greater than the threshold 2 \u2212 \u03b4:\n\u00af\u03a3 = diag(min(\u03c31, 2 \u2212 \u03b4), . . . min(\u03c3n, 2 \u2212 \u03b4)).\n\n(13)\n\nStep 3. Reconstruct W (\u03c4 )\n\nhh by using U, V and \u00af\u03a3 in Steps 1 and 2:\n\nW (\u03c4 )\n\nhh \u2190 U \u00af\u03a3V.\n\n(14)\nBy using this procedure, Whh is guaranteed to have a spectral norm of less than or equal to 2 \u2212 \u03b4.\nWhen \u03b4 is 0 < \u03b4 < 2, our method constrains \u03c31(Whh) to be less than 2. P\u03b4(\u00b7) in our method brings\nback the parameters into the feasible set when the parameters go out the feasible set after SGD. Our\nprocedure P\u03b4(\u00b7) is an optimal projection into the feasible set as shown by the following lemma:\nhh obtained by P\u03b4(\u00b7) is a solution of the following optimization:\nLemma 5. The weight matrix W (\u03c4 )\nF represents the Frobenius norm.\nminW (\u03c4 )\n\nhh )\u2264 2\u2212\u03b4, where || \u00b7 ||2\n\nF , s.t. \u03c31(W (\u03c4 )\n\nhh \u2212 W (\u03c4 )\n\n|| \u02c6W (\u03c4 )\n\nhh ||2\n\nhh\n\nLemma 5 indicates that our method can bring back the weight matrix into the feasible set with\nminimal variations in the parameters. Therefore, our procedure P\u03b4(\u00b7) has minimal impact on the\nminimization of the loss function. Note that our method does not depend on the learning rate schedule,\nand an adaptive learning rate method (such as Adam [21]) can be used with it.\n\n3.3 Computational Cost\n\nLet n be the length of a state vector ht; a naive implementation of SVD needs O(n3) time. Here, we\npropose an ef\ufb01cient method to reduce the computational cost. First, let us reconsider the computation\nof P\u03b4(\u00b7). Equations (12)-(14) can be represented as follows:\n\nhh \u2212(cid:80)s\n\n(cid:104)\n\n(cid:105)\nhh ) \u2212 (2 \u2212 \u03b4)\n\ni=1\n\nW (\u03c4 )\n\nuivT\ni ,\n\n\u03c3i( \u02c6W (\u03c4 )\n\nhh = \u02c6W (\u03c4 )\n\n(15)\nwhere s is the number of the singular values greater than 2 \u2212 \u03b4, and ui and vi are the i-th left and\nright singular vectors, respectively. Eq. (15) shows that our method only needs the singular values\nhh ) > 2 \u2212 \u03b4. In order to reduce the computational cost of our method,\nand vectors such that \u03c3i( \u02c6W (\u03c4 )\nwe use the truncated SVD [15] to ef\ufb01ciently compute the top s singular values in O(n2 log(s)) time,\nwhere s is the speci\ufb01ed number of singular values. Since the truncated SVD requires s to be set\nbeforehand, we need to ef\ufb01ciently estimate the number of singular values such that must meet the\nhh ) > 2 \u2212 \u03b4. Therefore, we compute upper bounds of the singular values that\ncondition of \u03c3i( \u02c6W (\u03c4 )\nmeet the condition on the basis of the following lemma:\n\n5\n\n\f) + |\u03b7|||\u2207WhhCD\u03c4 (\u03b8)||F .\n\nhh\n\nhh are bounded with the following inequality: \u03c3i( \u02c6W (\u03c4 )\n\nLemma 6. The singular values of \u02c6W (\u03c4 )\n\u03c3i(W (\u03c4\u22121)\nUsing this upper bound, we can estimate s as the number of the singular values with upper bounds of\ngreater than 2\u2212\u03b4. This upper bound can be computed in O(n2) time since the size of \u2207WhhCD\u03c4 (\u03b8) is\nn\u00d7n and \u03c3i(W (\u03c4\u22121)\n(cid:80)K\n) has already been obtained at step \u03c4. If we did not compute the previous singular\nhh ) as \u03c3i(W (\u03c4\u2212K\u22121)\nvalues from \u03c4 \u2212 K step to \u03c4 \u2212 1 step, we compute the upper bound of \u03c3i( \u02c6W (\u03c4 )\n)+\nk=0 |\u03b7|||\u2207WhhCD\u03c4\u2212k (\u03b8)||F from Lemma 6. Since our training originally constrains \u03c31(W (\u03c4 )\nhh ) <\n2 as described in eq. (8), we can rede\ufb01ne s as the number of singular values such that \u03c3i( \u02c6W (\u03c4 )\nhh ) > 2,\nhh ) > 2 \u2212 \u03b4. This modi\ufb01cation can further reduce the computational cost without\ninstead of \u03c3i( \u02c6W (\u03c4 )\ndisrupting the training. In summary, our method can ef\ufb01ciently estimate the number of singular\nvalues needed in O(n2) time, and we compute the truncated SVD in O(n2log(s)) time only if we\nneed to compute singular values by using Lemma 6.\n\nhh\n\nhh ) \u2264\n\nhh\n\n4 Experiments\n\n4.1 Experimental Conditions\n\nTo evaluate the effectiveness of our method, we conducted experiments on language modeling and\npolyphonic music modeling. We trained the GRU and examined the successful training rate, as well\nas the average and standard deviation of the loss. We de\ufb01ned successful training as training in which\nthe validation loss at each epoch is never greater than the initial value. The experimental conditions\nof each modeling are explained below.\n\n4.1.1 Language Modeling\n\nPenn Treebank (PTB) [25] is a widely used dataset to evaluate the performance of RNNs. PTB is\nsplit into training, validation, and test sets, and the sets are composed of 930 k, 74 k, 80 k tokens.\nThis experiment used a 10 k word vocabulary, and all words outside the vocabulary were mapped\nto a special token. The experimental conditions were based on the previous paper [38]. Our model\narchitecture was as follows: The \ufb01rst layer was a 650\u00d7 10, 000 linear layer without bias to convert the\none-hot vector input into a dense vector, and we multiplied the output of the \ufb01rst layer by 0.01 because\nour method assumes small inputs. The second layer was a GRU layer with 650 units, and we used the\nsoftmax function as the output layer. We applied 50 % dropout to the output of each layer except for\nthe recurrent connection [38]. We unfolded the GRU for 35 time steps in BPTT and set the mini-batch\nsize to 20. We trained the GRU with SGD for 75 epochs since the performance of the models trained\nby Adam and RMSprop were worse than that trained by SGD in the preliminary experiments, and\nZaremba et al. [38] used SGD. The results and conditions of preliminary experiments are in the\nsupplementary material. We set the learning rate to one in the \ufb01rst 10 epochs, and then, divided\nthe learning rate by 1.1 after each epoch. In our method, \u03b4 was set to [0.2, 0.5, 0.8, 1.1, 1.4]. In\ngradient clipping, a heuristic for setting the threshold is to look at the average norm of the gradient\n[28]. We evaluated gradient clipping based on the gradient norm by following the study [28]. In\nthe supplementary material, we evaluated gradient elementwise clipping which is used practically.\nSince the average norm of the gradient was about 10, we set the threshold to [5, 10, 15, 20]. We\ninitialized the weight matrices except for Whh with a normal distribution N (0, 1/650) , and Whh\nas an orthogonal matrix composed of the left singular vectors of a random matrix [29, 8]. After\neach epoch, we evaluated the validation loss. The model that achieved the least validation loss was\nevaluated using the test set.\n\n4.1.2 Polyphonic Music Modeling\n\nIn this modeling, we predicted MIDI note numbers at the next time step given the observed notes of\nthe previous time steps. We used the Nottingham dataset: a MIDI \ufb01le containing 1200 folk tunes [6].\nWe represented the notes at each time step as a 93-dimensional binary vector. This dataset is split\ninto training, validation and test sets [6]. The experimental conditions were based on the previous\nstudy [20]. Our model architecture was as follows: The \ufb01rst layer was a 200\u00d7 93 linear layer without\nbias, and the output of the \ufb01rst layer was multiplied by 0.01. The second and third layers were GRU\n\n6\n\n\fTable 1: Language modeling results: success rate and perplexity.\n\nOur method\n\nDelta\n\n0.2\n\n0.5\n\n0.8\n\n1.1\n\n1.4\n\nSuccess Rate\nValidation Loss\n\nTest Loss\n\n100 %\n102.0\u00b10.3\n97.6\u00b10.4\n\n100 %\n102.8\u00b10.3\n98.4\u00b10.3\n\n100 %\n103.7\u00b10.2\n99.0\u00b10.4\n\n100 %\n105.2\u00b10.2\n100.3\u00b10.2\n\n100 %\n107.0\u00b10.4\n102.1\u00b10.2\n\nGradient clipping\n\nThreshold\nSuccess Rate\nValidation Loss\n\nTest Loss\n\n5\n\n100 %\n109.3\u00b10.4\n106.9\u00b10.4\n\n10\n40 %\n\n15\n20\n0 % 0 %\n103.1\u00b10.4 N/A N/A\n100.4\u00b10.5 N/A N/A\n\nTable 2: Music modeling results: success rate and negative log-likelihood.\n\nOur method\n\nDelta\n\n0.2\n\n0.5\n\n0.8\n\n1.1\n\n1.4\n\nSuccess Rate\nValidation Loss\n\nTest Loss\n\n100 %\n3.46\u00b10.05\n3.53\u00b10.04\n\n100 %\n3.47\u00b10.07\n3.53\u00b10.04\n\n100 %\n3.59\u00b10.1\n3.64\u00b10.2\n\n100 %\n4.58\u00b10.2\n4.56\u00b10.2\n\n100 %\n4.64\u00b10.2\n4.62\u00b10.2\n\nGradient clipping\n\nThreshold\nSuccess Rate\nValidation Loss\n\nTest Loss\n\n15\n\n30\n\n45\n\n100 %\n3.57\u00b10.01\n3.64\u00b10.04\n\n100 %\n3.61\u00b10.2\n3.64\u00b10.2\n\n100 %\n3.88\u00b10.2\n3.89\u00b10.2\n\n60\n\n100 %\n5.26\u00b13\n5.36\u00b13\n\nlayers with 200 units per layer, and we used the logistic function as the output layer. 50 % dropout\nwas applied to non-recurrent connections. We unfolded GRUs for 35 time steps and set the size\nof the mini-batch to 20. We used SGD with a learning rate of 0.1 and divided the learning rate by\n1.25 if we observed no improvement over 10 consecutive epochs. We repeated the same procedure\nuntil the learning rate became smaller than 10\u22124. In our method, \u03b4 was set to [0.2, 0.5, 0.8, 1.1,\n1.4]. In gradient clipping, the threshold was set to [15, 30, 45, 60], since the average norm of the\ngradient was about 30. We initialized the weight matrices except for Whh with a normal distribution\nN (0, 10\u22124/200) , and Whh as an orthogonal matrix. After each epoch, we evaluated the validation\nloss, and the model that achieved the least validation loss was evaluated using the test set.\n\n4.2 Success Rate and Accuracy\n\nTables 1 and 2 list the success rates of language modeling and music modeling, respectively. These\ntables also list the averages and standard deviations of the loss in each modeling to show that our\nmethod outperforms gradient clipping. In these tables, \u201cThreshold\u201d means the threshold of gradient\nclipping, and \u201cDelta\u201d means \u03b4 in our method.\nAs shown in Table 1, in language modeling, gradient clipping failed to train even though its parameter\nwas set to 10, which is the average norm of the gradient as recommended by Pascanu et al. [28].\nAlthough gradient clipping successfully trained the GRU when its threshold was \ufb01ve, it failed to\neffectively learn the model with this setting; a threshold of 10 achieved lower perplexity than a\nthreshold of \ufb01ve. As shown in Table 2, in music modeling, gradient clipping successfully trained\nthe GRU. However, the standard deviation of the loss was high when the threshold was set to 60\n(double the average norm). On the other hand, our method successfully trained the GRU in both\nmodelings. Tables 1 and 2 show that our approach achieved lower perplexity and negative log-\nlikelihood compared with gradient clipping, while it constrained the GRU to be locally stable. This is\nbecause our approach of constraining stability improves the performance of the GRU. The previous\nstudy [22] showed that stabilizing the activation of the RNN can improve performance on several\ntasks. In addition, Bengio et al. [5] showed that an RNN is robust to noise when the state remains\nin the basin of attraction. Using our method, the state of the GRU tends to remain in the basin of\nthe attraction of h\u2217 = 0. Therefore, our method can improve robustness against noise, which is an\nadvantage of the GRU [33].\nAs shown in Table 2, when \u03b4 was set to 1.1 or 1.4, the performance of the GRU deteriorated. This is\nbecause the convergence speed of the state depends on \u03b4. As mentioned in Section 3.2, the spectral\nnorm of Whh is less than or equal to 2 \u2212 \u03b4. This spectral norm gives the upper bound of |\u03bb1(J )|.\n|\u03bb1(J )| gives the rate of convergence of a linearized GRU (eq. (9)), which approximates GRU near\nht = 0 when |\u03bb1(J )| < 1. Therefore, the state of the GRU near ht = 0 tends to converge quickly if\n\u03b4 is set to close to two. In this case, the GRU becomes robust to noise since the state affected by the\npast noise converges to zero quickly, while the GRU loses effectiveness for long-term dependencies.\nWe can tune \u03b4 from the characteristics of the data: if data have the long-term dependencies, we should\nset \u03b4 small, whereas we should set \u03b4 large for noisy data.\nThe threshold in gradient clipping is unbounded, and hence, it is dif\ufb01cult to tune. Although the\nthreshold can be heuristically set on the basis of the average norm, this may not be effective in\nlanguage modeling using the GRU, as shown in Table 1. In contrast, the hyper-parameter is bounded\nin our method, i.e., 0 < \u03b4 < 2, and it is easy to understand its effect as mentioned above.\n\n7\n\n\f(a) Gradient clipping (threshold of 5).\n\n(b) Our method (delta of 0.2).\n\nFigure 1: Gradient explosion in language modeling.\n\nTable 3: Computation time in the language modeling (delta is 0.2, threshold is 5).\n\nNaive SVD\n5.02 \u00d7 104\n\nComputation time (s)\n\nTruncated SVD\n\n4.55 \u00d7 104\n\nGradient clipping\n\n4.96 \u00d7 104\n\n4.3 Relation between Gradient and Spectral Radius\n\nOur method of constraining the GRU to be locally stable is based on the hypothesis that a change in\nstability causes an exploding gradient problem. To con\ufb01rm this hypothesis, we examined (i) the norm\nof the gradient before clipping and (ii) the spectral radius of J (in Lemma 2), which determines local\nstability, versus the number of iterations until the 500th iteration in Fig. 1. Fig. 1(a) and 1(b) show\nthe results of gradient clipping with a threshold of 5 and our method with \u03b4 of 0.2. Each norm of the\ngradient was normalized so that its maximum value was one. The norm of the gradient signi\ufb01cantly\nincreased when the spectral radius crossed one, such as at the 63rd, 79th, and 141st iteration (Fig. 1(a)).\nIn addition, the spectral radius decreased to less than one after the gradient explosion; i.e., when the\ngradient explosion occurred, the gradient became in the direction of decreasing spectral radius. In\ncontrast, our method kept the spectral radius less than one by constraining the spectral norm of Whh\n(Fig. 1(b)). Therefore, our method can prevent the gradient from exploding and effectively train the\nGRU.\n\n4.4 Computation Time\n\nWe evaluated computation time of the language modeling experiment. The detailed experimental\nsetup is described in the supplementary material. Table 3 lists the computation time of the whole\nlearning process using gradient clipping and our method with the naive SVD and with truncated\nSVD. This table shows the computation time of our method is comparable to gradient clipping. As\nmentioned in Section 2.1, the computational cost of gradient clipping is proportional to the number\nof parameters including weight matrices of input and output layers. In language modeling, the sizes\nof input and output layers tend to be large due to the large vocabulary size. On the other hand, the\ncomputational cost of our method only depends on the length of the state vector, and our method\ncan be ef\ufb01ciently computed if the number of singular values greater than 2 is small as described\nin Section 3.3. As a result, our method could reduce the computation time comparing to gradient\nclipping.\n\n5 Conclusion\n\nWe analyzed the dynamics of GRUs and devised a learning method that prevents the exploding\ngradient problem. Our analysis of stability provides new insight into the behavior of GRUs. Our\nmethod constrains GRUs so that the states near 0 asymptotically converge to 0. Through language\nand music modeling experiments, we con\ufb01rmed that our method can successfully train GRUs and\nfound that our method can improve their performance.\n\n8\n\n0100200300400500No.ofIteration0.00.20.40.60.81.01.21.4NormofgradientSpectralradius0100200300400500No.ofIteration0.00.20.40.60.81.01.2NormofgradientSpectralradius\fReferences\n[1] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro,\nJingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi\nFan, Christopher Fougner, Awni Hannun, Billy Jun, Tony Han, Patrick LeGresley, Xiangang\nLi, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan\nRaiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian\nWang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2:\nEnd-to-end speech recognition in english and mandarin. In Proc. ICML, pages 173\u2013182, 2016.\n[2] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nIn Proc. ICML, pages 1120\u20131128, 2016.\n\n[3] Pierre Baldi and Kurt Hornik. Universal approximation and learning of trajectories using\n\noscillators. In Proc. NIPS, pages 451\u2013457. 1996.\n\n[4] Nikita E Barabanov and Danil V Prokhorov. Stability analysis of discrete-time recurrent neural\n\nnetworks. IEEE Transactions on Neural Networks, 13(2):292\u2013303, 2002.\n\n[5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with\n\ngradient descent is dif\ufb01cult. IEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[6] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal\ndependencies in high-dimensional sequences: Application to polyphonic music generation and\ntranscription. In Proc. ICML, pages 1159\u20131166, 2012.\n\n[7] Mahmoud Chilali and Pascal Gahinet. H \u221e design with pole placement constraints: an lmi\n\napproach. IEEE Transactions on automatic control, 41(3):358\u2013367, 1996.\n\n[8] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder\u2013\ndecoder for statistical machine translation. In Proc. EMNLP, pages 1724\u20131734. ACL, 2014.\n\n[9] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[10] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capacity and trainability in\n\nrecurrent neural networks. In Proc. ICLR, 2017.\n\n[11] Kenji Doya. Bifurcations in the learning of recurrent neural networks. In Proc. ISCAS, volume 6,\n\npages 2777\u20132780. IEEE, 1992.\n\n[12] Bernard Doyon, Bruno Cessac, Mathias Quoy, and Manuel Samuelides. Destabilization and\nroute to chaos in neural networks with random connectivity. In Proc. NIPS, pages 549\u2013555.\n1993.\n\n[13] Alex Graves and J\u00fcrgen Schmidhuber. Of\ufb02ine handwriting recognition with multidimensional\n\nrecurrent neural networks. In Proc. NIPS, pages 545\u2013552, 2009.\n\n[14] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\n\nrecurrent neural networks. In Proc. ICASSP, pages 6645\u20136649. IEEE, 2013.\n\n[15] N Halko, PG Martinsson, and JA Tropp. Finding structure with randomness: Stochastic algo-\nrithms for constructing approximate matrix decompositions. arXiv preprint arXiv:0909.4061,\n2009.\n\n[16] Robert Haschke and Jochen J Steil. Input space bifurcation manifolds of recurrent neural\n\nnetworks. Neurocomputing, 64:25\u201338, 2005.\n\n[17] Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural\n\nnetworks. In Proc. NIPS, pages 190\u2013198. 2013.\n\n[18] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[19] Herbert Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF\nand the\" echo state network\" approach. GMD-Forschungszentrum Informationstechnik, 2002.\n[20] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent\n\nnetwork architectures. In Proc. ICML, pages 2342\u20132350, 2015.\n\n9\n\n\f[21] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. ICLR,\n\n2015.\n\n[22] David Krueger and Roland Memisevic. Regularizing rnns by stabilizing activations. In Proc.\n\nICLR, 2016.\n\n[23] Chung-Ming Kuan, Kurt Hornik, and Halbert White. A convergence result for learning in\n\nrecurrent neural networks. Neural Computation, 6(3):420\u2013440, 1994.\n\n[24] Thomas Laurent and James von Brecht. A recurrent neural network without chaos. In Proc.\n\nICLR, 2017.\n\n[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated\n\ncorpus of english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[26] Tomas Mikolov. Statistical language models based on neural networks. PhD thesis, Brno\n\nUniversity of Technology, 2012.\n\n[27] Hiroyuki Nakahara and Kenji Doya. Dynamics of attention as near saddle-node bifurcation\n\nbehavior. In Proc. NIPS, pages 38\u201344. 1996.\n\n[28] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\n\nneural networks. In Proc. ICML, pages 1310\u20131318, 2013.\n\n[29] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. In Proc. ICLR, 2014.\n\n[30] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Proc. NIPS, pages 3104\u20133112. 2014.\n\n[31] Johan AK Suykens, Bart De Moor, and Joos Vandewalle. Robust local stability of multilayer\n\nrecurrent neural networks. IEEE Transactions on Neural Networks, 11(1):222\u2013229, 2000.\n\n[32] Sachin S Talathi and Aniket Vartak. Improving performance of recurrent neural network with\n\nrelu nonlinearity. arXiv preprint arXiv:1511.03771, 2015.\n\n[33] Zhiyuan Tang, Ying Shi, Dong Wang, Yang Feng, and Shiyue Zhang. Memory visualization\nfor gated recurrent neural networks in speech recognition. In Proc. ICASSP, pages 2736\u20132740.\nIEEE, 2017.\n\n[34] Onur Toker and Hitay Ozbay. On the np-hardness of solving bilinear matrix inequalities and\nsimultaneous stabilization with static output feedback. In Proc. of American Control Conference,\nvolume 4, pages 2525\u20132526. IEEE, 1995.\n\n[35] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and\n\nlearning recurrent networks with long term dependencies. In Proc. ICML, 2017.\n\n[36] Stephen Wiggins. Introduction to applied nonlinear dynamical systems and chaos, volume 2.\n\nSpringer Science & Business Media, 2003.\n\n[37] Wen Yu. Nonlinear system identi\ufb01cation using discrete-time recurrent neural networks with\n\nstable learning algorithms. Information sciences, 158:131\u2013147, 2004.\n\n[38] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization.\n\narXiv preprint arXiv:1409.2329, 2014.\n\n10\n\n\f", "award": [], "sourceid": 324, "authors": [{"given_name": "Sekitoshi", "family_name": "Kanai", "institution": "NTT"}, {"given_name": "Yasuhiro", "family_name": "Fujiwara", "institution": "NTT Software Innovation Center"}, {"given_name": "Sotetsu", "family_name": "Iwamura", "institution": "NTT Software Innovation center"}]}