{"title": "Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations", "book": "Advances in Neural Information Processing Systems", "page_first": 3477, "page_last": 3485, "abstract": "We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.", "full_text": "Path-Normalized Optimization of Recurrent Neural\n\nNetworks with ReLU Activations\n\nBehnam Neyshabur\u2217\n\nToyota Technological Institute at Chicago\n\nbneyshabur@ttic.edu\n\nYuhuai Wu\u2217\n\nUniversity of Toronto\nywu@cs.toronto.edu\n\nRuslan Salakhutdinov\n\nCarnegie Mellon University\nrsalakhu@cs.cmu.edu\n\nNathan Srebro\n\nToyota Technological Institute at Chicago\n\nnati@ttic.edu\n\nAbstract\n\nWe investigate the parameter-space geometry of recurrent neural networks (RNNs),\nand develop an adaptation of path-SGD optimization method, attuned to this\ngeometry, that can learn plain RNNs with ReLU activations. On several datasets\nthat require capturing long-term dependency structure, we show that path-SGD can\nsigni\ufb01cantly improve trainability of ReLU RNNs compared to RNNs trained with\nSGD, even with various recently suggested initialization schemes.\n\nIntroduction\n\n1\nRecurrent Neural Networks (RNNs) have been found to be successful in a variety of sequence learning\nproblems [4, 3, 9], including those involving long term dependencies (e.g., [1, 23]). However, most\nof the empirical success has not been with \u201cplain\u201d RNNs but rather with alternate, more complex\nstructures, such as Long Short-Term Memory (LSTM) networks [7] or Gated Recurrent Units (GRUs)\n[3]. Much of the motivation for these more complex models is not so much because of their modeling\nrichness, but perhaps more because they seem to be easier to optimize. As we discuss in Section\n3, training plain RNNs using gradient-descent variants seems problematic, and the choice of the\nactivation function could cause a problem of vanishing gradients or of exploding gradients.\nIn this paper our goal is to better understand the geometry of plain RNNs, and develop better\noptimization methods, adapted to this geometry, that directly learn plain RNNs with ReLU activations.\nOne motivation for insisting on plain RNNs, as opposed to LSTMs or GRUs, is because they\nare simpler and might be more appropriate for applications that require low-complexity design\nsuch as in mobile computing platforms [22, 5]. In other applications, it might be better to solve\noptimization issues by better optimization methods rather than reverting to more complex models.\nBetter understanding optimization of plain RNNs can also assist us in designing, optimizing and\nintelligently using more complex RNN extensions.\nImproving training RNNs with ReLU activations has been the subject of some recent attention,\nwith most research focusing on different initialization strategies [12, 22]. While initialization can\ncertainly have a strong effect on the success of the method, it generally can at most delay the problem\nof gradient explosion during optimization. In this paper we take a different approach that can be\ncombined with any initialization choice, and focus on the dynamics of the optimization itself.\nAny local search method is inherently tied to some notion of geometry over the search space (e.g.\nthe space of RNNs). For example, gradient descent (including stochastic gradient descent) is tied to\nthe Euclidean geometry and can be viewed as steepest descent with respect to the Euclidean norm.\nChanging the norm (even to a different quadratic norm, e.g. by representing the weights with respect\nto a different basis in parameter space) results in different optimization dynamics. We build on prior\nwork on the geometry and optimization in feed-forward networks, which uses the path-norm [16]\n\n\u2217Contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFF (shared weights)\n\nRNN notation\n\nInput nodes\nhv = x[v]\n\nh0\n\nt = xt, hi\n\n0 = 0\n\n(cid:104)(cid:80)\nt =(cid:2)Wi\n\nhv =\n\nhi\n\nInternal nodes\n(u\u2192v)\u2208E wu\u2192vhu\ninhi\u22121\n\nt + Wi\n\nrechi\n\nt\u22121\n\n(cid:105)\n(cid:3)\n\n+\n\n+\n\nhv =(cid:80)\n\nOutput nodes\n\n(u\u2192v)\u2208E wu\u2192vhu\nt = Wouthd\u22121\nhd\n\nt\n\nTable 1: Forward computations for feedforward nets with shared weights.\n\n(de\ufb01ned in Section 4) to determine a geometry leading to the path-SGD optimization method. To\ndo so, we investigate the geometry of RNNs as feedforward networks with shared weights (Section\n2) and extend a line of work on Path-Normalized optimization to include networks with shared\nweights. We show that the resulting algorithm (Section 4) has similar invariance properties on RNNs\nas those of standard path-SGD on feedforward networks, and can result in better optimization with\nless sensitivity to the scale of the weights.\n\n+\n\n(cid:105)\n\n2 Recurrent Neural Nets as Feedforward Nets with Shared Weights\nWe view Recurrent Neural Networks (RNNs) as feedforward networks with shared weights.\nWe denote a general feedforward network with ReLU activations and shared weights is indicated\nby N (G, \u03c0, p) where G(V, E) is a directed acyclic graph over the set of nodes V that corresponds\nto units v \u2208 V in the network, including special subsets of input and output nodes Vin, Vout \u2282 V ,\np \u2208 Rm is a parameter vector and \u03c0 : E \u2192 {1, . . . , m} is a mapping from edges in G to parameters\nindices. For any edge e \u2208 E, the weight of the edge e is indicated by we = p\u03c0(e). We refer to the\nset of edges that share the ith parameter pi by Ei = {e \u2208 E|\u03c0(e) = i}. That is, for any e1, e2 \u2208 Ei,\n\u03c0(e1) = \u03c0(e2) and hence we1 = we2 = p\u03c0(e1).\n\u2192 R|Vout| as follows: For any\nSuch a feedforward network represents a function fN (G,\u03c0,p) : R|Vin|\n(cid:104)(cid:80)\ninput node v \u2208 Vin, its output hv is the corresponding coordinate of the input vector x \u2208 R|Vin|.\napplied and their output hv =(cid:80)\nwhere\nFor each internal node v, the output is de\ufb01ned recursively as hv =\n(u\u2192v)\u2208E wu\u2192v \u00b7 hu\n[z]+ = max(z, 0) is the ReLU activation function2. For output nodes v \u2208 Vout, no non-linearity is\n(u\u2192v)\u2208E wu\u2192v \u00b7 hu determines the corresponding coordinate of\nthe computed function fN (G,\u03c0,p)(x). Since we will \ufb01x the graph G and the mapping \u03c0 and learn\nthe parameters p, we use the shorthand fp = fN (G,\u03c0,p) to refer to the function implemented by\nparameters p. The goal of training is to \ufb01nd parameters p that minimize some error functional L(fp)\nthat depends on p only through the function fp. E.g. in supervised learning L(f ) = E [loss(f (x), y)]\nand this is typically done by minimizing an empirical estimate of this expectation.\nIf the mapping \u03c0 is a one-to-one mapping, then there is no weight sharing and it corresponds to\nstandard feedforward networks. On the other hand, weight sharing exists if \u03c0 is a many-to-one\nmapping. Two well-known examples of feedforward networks with shared weights are convolutional\nand recurrent networks. We mostly use the general notation of feedforward networks with shared\nweights throughout the paper as this will be more general and simpli\ufb01es the development and notation.\nHowever, when focusing on RNNs, it is helpful to discuss them using a more familiar notation which\nwe brie\ufb02y introduce next.\nRecurrent Neural Networks Time-unfolded RNNs are feedforward networks with shared weights\nthat map an input sequence to an output sequence. Each input node corresponds to either a coordinate\nof the input vector at a particular time step or a hidden unit at time 0. Each output node also\ncorresponds to a coordinate of the output at a speci\ufb01c time step. Finally, each internal node refers\nto some hidden unit at time t \u2265 1. When discussing RNNs, it is useful to refer to different layers\nand the values calculated at different time-steps. We use a notation for RNN structures in which\nthe nodes are partitioned into layers and hi\nt denotes the output of nodes in layer i at time step t.\nLet x = (x1, . . . , xT ) be the input at different time steps where T is the maximum number of\nin and\npropagations through time and we refer to it as the length of the RNN. For 0 \u2264 i < d, let Wi\nrec be the input and recurrent parameter matrices of layer i and Wout be the output parameter\nWi\nmatrix. Table 1 shows forward computations for RNNs.The output of the function implemented by\nRNN can then be calculated as fW,t(x) = hd\nt . Note that in this notations, weight matrices Win,\nWrec and Wout correspond to \u201cfree\u201d parameters of the model that are shared in different time steps.\n2The bias terms can be modeled by having an additional special node vbias that is connected to all internal\n\nand output nodes, where hvbias = 1.\n\n2\n\n\f3 Non-Saturating Activation Functions\nThe choice of activation function for neural networks can have a large impact on optimization. We\nare particularly concerned with the distinction between \u201csaturating\u201d and \u201cnon-starting\u201d activation\nfunctions. We consider only monotone activation functions and say that a function is \u201csaturating\u201d if it\nis bounded\u2014this includes, e.g. sigmoid, hyperbolic tangent and the piecewise-linear ramp activation\nfunctions. Boundedness necessarily implies that the function values converge to \ufb01nite values at\nnegative and positive in\ufb01nity, and hence asymptote to horizontal lines on both sides. That is, the\nderivative of the activation converges to zero as the input goes to both \u2212\u221e and +\u221e. Networks with\nsaturating activations therefore have a major shortcoming: the vanishing gradient problem [6]. The\nproblem here is that the gradient disappears when the magnitude of the input to an activation is large\n(whether the unit is very \u201cactive\u201d or very \u201cinactive\u201d) which makes the optimization very challenging.\nWhile sigmoid and hyperbolic tangent have historically been popular choices for fully connected\nfeedforward and convolutional neural networks, more recent works have shown undeniable advantages\nof non-saturating activations such as ReLU, which is now the standard choice for fully connected and\nConvolutional networks [15, 10]. Non-saturating activations, including the ReLU, are typically still\nbounded from below and asymptote to a horizontal line, with a vanishing derivative, at \u2212\u221e. But\nthey are unbounded from above, enabling their derivative to remain bounded away from zero as the\ninput goes to +\u221e. Using ReLUs enables gradients to not vanish along activated paths and thus can\nprovide a stronger signal for training.\nHowever, for recurrent neural networks, using ReLU activations is challenging in a different way, as\neven a small change in the direction of the leading eigenvector of the recurrent weights could get\nampli\ufb01ed and potentially lead to the explosion in forward or backward propagation [1].\nTo understand this, consider a long path from an input in the \ufb01rst element of the sequence to an output\nof the last element, which passes through the same RNN edge at each step (i.e. through many edges\nin some Ei in the shared-parameter representation). The length of this path, and the number of times\nit passes through edges associated with a single parameter, is proportional to the sequence length,\nwhich could easily be a few hundred or more. The effect of this parameter on the path is therefore\nexponentiated by the sequence length, as are gradient updates for this parameter, which could lead to\nparameter explosion unless an extremely small step size is used.\nUnderstanding the geometry of RNNs with ReLUs could helps us deal with the above issues more\neffectively. We next investigate some properties of geometry of RNNs with ReLU activations.\n\nInvariances in Feedforward Nets with Shared Weights\n\nFeedforward networks (with or without shared weights) are highly over-parameterized, i.e. there\nare many parameter settings p that represent the same function fp. Since our true object of interest\nis the function f, and not the identity p of the parameters, it would be bene\ufb01cial if optimization\nwould depend only on fp and not get \u201cdistracted\u201d by difference in p that does not affect fp. It is\ntherefore helpful to study the transformations on the parameters that will not change the function\npresented by the network and come up with methods that their performance is not affected by such\ntransformations.\nDe\ufb01nition 1. We say a network N is invariant to a transformation T if for any parameter setting p,\nfp = fT (p). Similarly, we say an update rule A is invariant to T if for any p, fA(p) = fA(T (p)).\nInvariances have also been studied as different mappings from the parameter space to the same\nfunction space [19] while we de\ufb01ne the transformation as a mapping inside a \ufb01xed parameter space.\nA very important invariance in feedforward networks is node-wise rescaling [17]. For any internal\nnode v and any scalar \u03b1 > 0, we can multiply all incoming weights into v (i.e. wu\u2192v for any\n(u \u2192 v) \u2208 E) by \u03b1 and all the outgoing weights (i.e. wv\u2192u for any (v \u2192 u) \u2208 E) by 1/\u03b1 without\nchanging the function computed by the network. Not all node-wise rescaling transformations can be\napplied in feedforward nets with shared weights. This is due to the fact that some weights are forced\nto be equal and therefore, we are only allowed to change them by the same scaling factor.\n\nDe\ufb01nition 2. Given a network N , we say an invariant transformation (cid:101)T that is de\ufb01ned over edge\n\nweights (rather than parameters) is feasible for parameter mapping \u03c0 if the shared weights remain\nequal after the transformation, i.e. for any i and for any e, e(cid:48)\n\n\u2208 Ei, (cid:101)T (w)e = (cid:101)T (w)e(cid:48).\n\n3\n\n\fFigure 1: An example of invari-\nances in an RNN with two hidden\nlayers each of which has 2 hidden\nunits. The dashed lines correspond\nto recurrent weights. The network\non the left hand side is equivalent\n(i.e. represents the same function)\nto the network on the right for any\nnonzero \u03b11\n1 = c,\n\u03b12\n\n1 = a, \u03b11\n\n2 = b, \u03b12\n\n2 = d.\n\nTherefore, it is helpful to understand what are the feasible node-wise rescalings for RNNs. In the\nfollowing theorem, we characterize all feasible node-wise invariances in RNNs.\nTheorem 1. For any \u03b1 such that \u03b1i\nj > 0, any Recurrent Neural Network with ReLU activation is in-\nvariant to the transformation T\u03b1 ([Win, Wrec, Wout]) = [Tin,\u03b1 (Win) ,Trec,\u03b1 (Wrec) ,Tout,\u03b1 (Wout)]\nwhere for any i, j, k:\n\n(cid:26)\u03b1i\n(cid:0)\u03b1i\n\njWi\nin[j, k]\nj/\u03b1i\u22121\n\nk\n\n(cid:1) Wi\nTout,\u03b1(Wout)[j, k] =(cid:0)1/\u03b1d\u22121\n\ni = 1,\n1 < i < d,\n\nin[j, k]\n\nk\n\n(cid:1) Wout[j, k].\n\n(1)\n\nTrec,\u03b1(Wrec)i[j, k] =(cid:0)\u03b1i\n\nTin,\u03b1(Win)i[j, k] =\n\n(cid:1) Wi\n\nj/\u03b1i\nk\n\nrec[j, k],\n\nFurthermore, any feasible node-wise rescaling transformation can be presented in the above form.\n\nThe proofs of all theorems and lemmas are given in the supplementary material. The above theorem\nshows that there are many transformations under which RNNs represent the same function. An\nexample of such invariances is shown in Fig. 1. Therefore, we would like to have optimization\nalgorithms that are invariant to these transformations and in order to do so, we need to look at\nmeasures that are invariant to such mappings.\n\n4 Path-SGD for Networks with Shared Weights\n\nAs we discussed, optimization is inherently tied to a choice of geometry, here represented by a choice\nof complexity measure or \u201cnorm\u201d3. Furthermore, we prefer using an invariant measure which could\nthen lead to an invariant optimization method. In Section 4.1 we introduce the path-regularizer and in\nSection 4.2, the derived Path-SGD optimization algorithm for standard feed-forward networks. Then\nin Section 4.3 we extend these notions also to networks with shared weights, including RNNs, and\npresent two invariant optimization algorithms based on it. In Section 4.4 we show how these can be\nimplemented ef\ufb01ciently using forward and backward propagations.\n\n4.1 Path-regularizer\n\nThe path-regularizer is the sum over all paths from input nodes to output nodes of the product of\nsquared weights along the path. To de\ufb01ne it formally, let P be the set of directed paths from input to\n\u2208 P of length len(\u03b6), we have that \u03b60 \u2208 Vin,\n\u03b6len(\u03b6) \u2208 Vout and for any 0 \u2264 i \u2264 len(\u03b6) \u2212 1, (\u03b6i \u2192 \u03b6i+1) \u2208 E. We also abuse the notation and\ndenote e \u2208 \u03b6 if for some i, e = (\u03b6i, \u03b6i+1). Then the path regularizer can be written as:\n\n(cid:1)\noutput units so that for any path \u03b6 =(cid:0)\u03b60, . . . , \u03b6len(\u03b6)\nlen(\u03b6)\u22121(cid:89)\n(cid:88)\n\n\u03b32\nnet(w) =\n\nw2\n\n\u03b6i\u2192\u03b6i+1\n\n\u03b6\u2208P\n\ni=0\n\nEquivalently, the path-regularizer can be de\ufb01ned recursively on the nodes of the network as:\n\n\u03b32\nv (w) =\n\n\u03b32\nu(w)w2\n\nu\u2192v ,\n\n\u03b32\nnet(w) =\n\n\u03b32\nu(w)\n\n(2)\n\n(3)\n\n(cid:88)\n\n(u\u2192v)\u2208E\n\n(cid:88)\n\nu\u2208Vout\n\n3The path-norm which we de\ufb01ne is a norm on functions, not on weights, but as we prefer not getting into this\n\ntechnical discussion here, we use the term \u201cnorm\u201d very loosely to indicate some measure of magnitude [18].\n\n4\n\n\ud835\udc4e\ud835\udc4e\ud835\udc4f\ud835\udc4f\ud835\udc4f\ud835\udc4e#\ud835\udc4e\ud835\udc4f#11\ud835\udc50\ud835\udc4e\u2044\ud835\udc50\ud835\udc4f#\ud835\udc51\ud835\udc4e#\ud835\udc51\ud835\udc4f#1\ud835\udc50#1\ud835\udc51#1\ud835\udc50#1\ud835\udc51#11\ud835\udc51\ud835\udc50#\ud835\udc50\ud835\udc51#11111111111111111111T\f4.2 Path-SGD for Feedforward Networks\n\nPath-SGD is an approximate steepest descent step with respect to the path-norm. More formally, for\na network without shared weights, where the parameters are the weights themselves, consider the\ndiagonal quadratic approximation of the path-regularizer about the current iterate w(t):\n\nnet(w(t)) +\n\nnet(w(t) + \u2206w) = \u03b32\n\u02c6\u03b32\nUsing the corresponding quadratic norm (cid:107)w \u2212 w(cid:48)\n(cid:107)2\nnet(w(t)+\u2206w) = 1\n\u02c6\u03b32\ncan de\ufb01ne an approximate steepest descent step as:\n\nnet(w(t)), \u2206w\n\n+\n\n2\n\n(cid:68)\n\u2207\u03b32\n\u2207L(w), w \u2212 w(t)(cid:69)\n(cid:68)\n\n(cid:16)\n\n1\n2\n\n\u2206w(cid:62) diag\n\n(cid:69)\n(cid:80)\n(cid:13)(cid:13)(cid:13)w \u2212 w(t)(cid:13)(cid:13)(cid:13)2\n\n+\n\nw(t+1) = min\nw\nSolving (5) yields the update:\n\n\u03b7\n\n(cid:17)\n(we \u2212 w(cid:48)\n\n\u22072\u03b32\n\nnet(w(t))\n\ne\u2208E\n\n\u22022\u03b32\n\u2202w2\ne\n\nnet\n\n\u2206w (4)\n\ne)2, we\n\n\u02c6\u03b32\nnet(w(t)+\u2206w)\n\n.\n\ne\n\nw(t+1)\n\n= w(t)\n\n1\n2\nThe stochastic version that uses a subset of training examples to estimate\nPath-SGD [16]. We now show how Path-SGD can be extended to networks with shared weights.\n\nnet(w)\n\u2202w2\ne\n\u2202L\n\nwhere: \u03bae(w) =\n\n\u2202L\n\u2202we\n\n\u03bae(w(t))\n\ne \u2212\n\n(w(t))\n\n\u2202wu\u2192v\n\n(w(t)) is called\n\n\u03b7\n\n.\n\n\u22022\u03b32\n\n(5)\n\n(6)\n\n4.3 Extending to Networks with Shared Weights\n\nWhen the networks has shared weights, the path-regularizer is a function of parameters p and\ntherefore the quadratic approximation should also be with respect to the iterate p(t) instead of w(t)\nwhich results in the following update rule:\n\n(cid:68)\n\u2207L(p), p \u2212 p(t)(cid:69)\n(cid:80)m\n\nnet\n\n\u22022\u03b32\n\u2202p2\ni\n\ni=1\n\n(pi \u2212 p(cid:48)\n\n+\n\n(cid:13)(cid:13)(cid:13)p \u2212 p(t)(cid:13)(cid:13)(cid:13)\u02c6\u03b32\n\n.\n\n(7)\n\nnet(p(t)+\u2206p)\n\ni)2. Solving (7) gives the following update:\n\nwhere (cid:107)p \u2212 p(cid:48)\n\np(t+1) = min\np\n\n\u03b7\n\nnet(p(t)+\u2206p) = 1\n\u02c6\u03b32\n\n2\n\n(cid:107)2\np(t+1)\ni\n\n= p(t)\n\ni \u2212\n\n\u03b7\n\n\u03bai(p(t))\n\n\u2202L\n\u2202pi\n\n(p(t))\n\nwhere: \u03bai(p) =\n\n1\n2\n\n\u22022\u03b32\n\nnet(p)\n\u2202p2\ni\n\n.\n\n(8)\n\nThe second derivative terms \u03bai are speci\ufb01ed in terms of their path structure as follows:\nLemma 1. \u03bai(p) = \u03ba(1)\n\n(p) + \u03ba(2)\n\n(p) where\n\ni\n\n(cid:88)\n\ne\u2208Ei\n\ni\n\n(cid:88)\n(cid:88)\n\n\u03b6\u2208P\n\n1e\u2208\u03b6\n\n(cid:88)\n\ne1,e2\u2208Ei\ne1(cid:54)=e2\n\n\u03b6\u2208P\n\n\u03ba(1)\ni\n\n(p) =\n\n\u03ba(2)\ni\n\n(p) = p2\ni\n\nlen(\u03b6)\u22121(cid:89)\n\nj=0\n\ne(cid:54)=(\u03b6j\u2192\u03b6j+1)\n\n1e1,e2\u2208\u03b6\n\nlen(\u03b6)\u22121(cid:89)\n\nj=0\n\ne1(cid:54)=(\u03b6j\u2192\u03b6j+1)\ne2(cid:54)=(\u03b6j\u2192\u03b6j+1)\n\n(cid:88)\n\ne\u2208Ei\n\np2\n\u03c0(\u03b6j\u2192\u03b6j+1),\n\np2\n\u03c0(\u03b6j\u2192\u03b6j+1) =\n\n\u03bae(w),\n\n(9)\n\n(10)\n\ni\n\nand \u03bae(w) is de\ufb01ned in (6).\nThe second term \u03ba(2)\n(p) measures the effect of interactions between edges corresponding to the\nsame parameter (edges from the same Ei) on the same path from input to output. In particular, if for\nany path from an input unit to an output unit, no two edges along the path share the same parameter,\nthen \u03ba(2)(p) = 0. For example, for any feedforward or Convolutional neural network, \u03ba(2)(p) = 0.\nBut for RNNs, there certainly are multiple edges sharing a single parameter on the same path, and so\nwe could have \u03ba(2)(p) (cid:54)= 0.\nThe above lemma gives us a precise update rule for the approximate steepest descent with respect to\nthe path-regularizer. The following theorem con\ufb01rms that the steepest descent with respect to this\nregularizer is also invariant to all feasible node-wise rescaling for networks with shared weights.\nTheorem 2. For any feedforward networks with shared weights, the update (8) is invariant to all\nfeasible node-wise rescalings. Moreover, a simpler update rule that only uses \u03ba(1)\n(p) in place of\n\u03bai(p) is also invariant to all feasible node-wise rescalings.\nEquations (9) and (10) involve a sum over all paths in the network which is exponential in depth of\nthe network. However, we next show that both of these equations can be calculated ef\ufb01ciently.\n\ni\n\n5\n\n\fTraining Error\n\nTest Error\n\nFigure 2: Path-SGD with/without the second term in word-level language modeling on PTB. We use the\nstandard split (929k training, 73k validation and 82k test) and the vocabulary size of 10k words. We initialize\nthe weights by sampling from the uniform distribution with range [\u22120.1, 0.1]. The table on the left shows the\nratio of magnitude of \ufb01rst and second term for different lengths T and number of hidden units H. The plots\ncompare the training and test errors using a mini-batch of size 32 and backpropagating through T = 20 time\nsteps and using a mini-batch of size 32 where the step-size is chosen by a grid search.\n\n4.4 Simple and Ef\ufb01cient Computations for RNNs\nWe show how to calculate \u03ba(1)\nbut with squared weights:\nTheorem 3. For any network N (G, \u03c0, p), consider N (G, \u03c0, \u02dcp) where for any i, \u02dcpi = p2\nfunction g : R|Vin|\nand \u03ba(2) can be calculated as follows where 1 is the all-ones input vector:\n\n\u2192 R to be the sum of outputs of this network: g(x) =(cid:80)|Vout|\n\n(p) and \u03ba(2)\n\ni\n\ni\n\n(p) by considering a network with the same architecture\n\ni . De\ufb01ne the\ni=1 f\u02dcp(x)[i]. Then \u03ba(1)\n\n(cid:88)\n\n(u\u2192v),(u(cid:48)\u2192v(cid:48))\u2208Ei\n(u\u2192v)(cid:54)=(u(cid:48)\u2192v(cid:48) )\n\n\u03ba(1)(p) = \u2207\u02dcpg(1),\n\n\u03ba(2)\ni\n\n(p) =\n\n\u02dcpi\n\n\u2202g(1)\n\u2202hv(cid:48)(\u02dcp)\n\n\u2202hu(cid:48)(\u02dcp)\n\u2202hv(\u02dcp)\n\nhu(\u02dcp).\n\n(11)\n\n\u03ba(2)(Wi\n\nIn the process of calculating the gradient \u2207\u02dcpg(1), we need to calculate hu(\u02dcp) and \u2202g(1)/\u2202hv(\u02dcp)\nfor any u, v. Therefore, the only remaining term to calculate (besides \u2207 \u02dcpg(1)) is \u2202hu(cid:48)(\u02dcp)/\u2202hv(\u02dcp).\nRecall that T is the length (maximum number of propagations through time) and d is the number\nof layers in an RNN. Let H be the number of hidden units in each layer and B be the size of the\nmini-batch. Then calculating the gradient of the loss at all points in the minibatch (the standard\nwork required for any mini-batch gradient approach) requires time O(BdT H 2). In order to calculate\n\u03ba(1)\n(p), we need to calculate the gradient \u2207\u02dcpg(1) of a similar network at a single input\u2014so the\ni\ntime complexity is just an additional O(dT H 2). The second term \u03ba(2)(p) can also be calculated\nfor RNNs in O(dT H 2(T + H)). For an RNN, \u03ba(2)(Win) = 0 and \u03ba(2)(Wout) = 0 because only\nrecurrent weights are shared multiple times along an input-output path. \u03ba(2)(Wrec) can be written\nand calculated in the matrix form:\n\n(cid:34)(cid:16)(cid:0)W(cid:48)i\n(cid:1)t1(cid:17)(cid:62)\nT\u22123(cid:88)\nrec[j, k](cid:1)2. The only terms that require extra computa-\nrec[j, k] =(cid:0)Wi\n\nrec (cid:12)\nwhere for any i, j, k we have W(cid:48)i\ntion are powers of Wrec which can be done in O(dT H 3) and the rest of the matrix computations need\nO(dT 2H 2). Therefore, the ratio of time complexity of calculating the \ufb01rst term and second term with\nrespect to the gradient over mini-batch is O(1/B) and O((T + H)/B) respectively. Calculating only\n\u03ba(1)\n(p) might be\ni\ni\nexpensive for large networks. Beyond the low computational cost, calculating \u03ba(1)\n(p) is also very\neasy to implement as it requires only taking the gradient with respect to a standard feed-forward\ncalculation in a network with slightly modi\ufb01ed weights\u2014with most deep learning libraries it can be\nimplemented very easily with only a few lines of code.\n5 Experiments\n5.1 The Contribution of the Second Term\n\n(p) is therefore very cheap with minimal per-minibatch cost, while calculating \u03ba(2)\n\n\u2202g(1)\nt1+t2+1(\u02dcp)\n\nT\u2212t1\u22121(cid:88)\n\nrec) = W(cid:48)i\n\n(cid:0)hi\n\nt2\n\n(\u02dcp)(cid:1)(cid:62)\n\n(cid:35)\n\ni\n\nt1=0\n\nrec\n\n(cid:12)\n\n\u2202hi\n\nt2=2\n\nAs we discussed in section 4.4, the second term \u03ba(2) in the update rule can be computationally\nexpensive for large networks. In this section we investigate the signi\ufb01cance of the second term\n\n6\n\nEpoch050100150200Perplexity0100200300400500SGDPath-SGD:5(1)Path-SGD:5(1)+5(2)Epoch050100150200Perplexity0100200300400500SGDPath-SGD:5(1)Path-SGD:5(1)+5(2)!(#)#!(%)#&'=400,,=100.00014'=400,,=400.00022'=100,,=100.00037'=100,,=100.00048\fFigure 3: Test errors for the addition problem of different lengths.\n\nand show that at least in our experiments, the contribution of the second term is negligible. To\ncompare the two terms \u03ba(1) and \u03ba(2), we train a single layer RNN with H = 200 hidden units for the\ntask of word-level language modeling on Penn Treebank (PTB) Corpus [13]. Fig. 2 compares the\nperformance of SGD vs. Path-SGD with/without \u03ba(2). We clearly see that both versions of Path-SGD\nare performing very similarly and both of them outperform SGD signi\ufb01cantly. This results in Fig. 2\nsuggest that the \ufb01rst term is more signi\ufb01cant and therefore we can ignore the second term.\nTo better understand the importance of the two terms, we compared the ratio of the norms\n\n(cid:13)(cid:13)\u03ba(2)(cid:13)(cid:13)2 /(cid:13)(cid:13)\u03ba(1)(cid:13)(cid:13)2 for different RNN lengths T and number of hidden units H. The table in Fig. 2\n\nshows that the contribution of the second term is bigger when the network has fewer number of\nhidden units and the length of the RNN is larger (H is small and T is large). However, in many cases,\nit appears that the \ufb01rst term has a much bigger contribution in the update step and hence the second\nterm can be safely ignored. Therefore, in the rest of our experiments, we calculate the Path-SGD\nupdates only using the \ufb01rst term \u03ba(1).\n5.2 Synthetic Problems with Long-term Dependencies\nTraining Recurrent Neural Networks is known to be hard for modeling long-term dependencies due to\nthe gradient vanishing/exploding problem [6, 2]. In this section, we consider synthetic problems that\nare speci\ufb01cally designed to test the ability of a model to capture the long-term dependency structure.\nSpeci\ufb01cally, we consider the addition problem and the sequential MNIST problem.\nAddition problem: The addition problem was introduced in [7]. Here, each input consists of two\nsequences of length T , one of which includes numbers sampled from the uniform distribution with\nrange [0, 1] and the other sequence serves as a mask which is \ufb01lled with zeros except for two entries.\nThese two entries indicate which of the two numbers in the \ufb01rst sequence we need to add and the task\nis to output the result of this addition.\nSequential MNIST: In sequential MNIST, each digit image is reshaped into a sequence of length 784,\nturning the digit classi\ufb01cation task into sequence classi\ufb01cation with long-term dependencies [12, 1].\nFor both tasks, we closely follow the experimental protocol in [12]. We train a single-layer RNN\nconsisting of 100 hidden units with path-SGD, referred to as RNN-Path. We also train an RNN of\nthe same size with identity initialization, as was proposed in [12], using SGD as our baseline model,\nreferred to as IRNN. We performed grid search for the learning rates over {10\u22122, 10\u22123, 10\u22124}\nfor both our model and the baseline. Non-recurrent weights were initialized from the uniform\ndistribution with range [\u22120.01, 0.01]. Similar to [1], we found the IRNN to be fairly unstable (with\nSGD optimization typically diverging). Therefore for IRNN, we ran 10 different initializations and\npicked the one that did not explode to show its performance.\nIn our \ufb01rst experiment, we evaluate Path-SGD on the addition problem. The results are shown in\nFig. 3 with increasing the length T of the sequence: {100, 400, 750}. We note that this problem\nbecomes much harder as T increases because the dependency between the output (the sum of two\nnumbers) and the corresponding inputs becomes more distant. We also compare RNN-Path with\nthe previously published results, including identity initialized RNN [12] (IRNN), unitary RNN [1]\n(uRNN), and np-RNN4 introduced by [22]. Table 2 shows the effectiveness of using Path-SGD.\nPerhaps more surprisingly, with the help of path-normalization, a simple RNN with the identity\ninitialization is able to achieve a 0% error on the sequences of length 750, whereas all the other\nmethods, including LSTMs, fail. This shows that Path-SGD may help stabilize the training and\nalleviate the gradient problem, so as to perform well on longer sequence. We next tried to model\n\n4The original paper does not include any result for 750, so we implemented np-RNN for comparison.\nHowever, in our implementation the np-RNN is not able to even learn sequences of length of 200. Thus we put\n\u201c>2\u201d for length of 750.\n\n7\n\n0153045number of Epochs0.000.050.100.150.20MSEAdding 100IRNNRNN Path080160240number of Epochs0.000.050.100.150.200.25Adding 400IRNNRNN Path0100200300400number of Epochs0.000.050.100.150.20Adding 750IRNNRNN Path\fAdding Adding Adding\n\nsMNIST\n\n400\n16.7\n\nPTB text8\n1.55\n1.54\n\n-\n\n-\n-\n-\n\n100\n0\n0\n0\n0\n0\n0\n\n750\n16.7\n16.7\n16.7\n>2\n16.7\n\n0\n\nIRNN\n\nRNN-Path\n\n3\n2\n2\n0\n0\n\n1.42\n1.65\n1.55\n1.48\n1.55\n1.58\n1.47\n1.41\n\nIRNN [12]\nuRNN [1]\nLSTM [1]\nnp-RNN[22]\n\nRNN+smoothReLU [20]\nHF-MRNN [14]\nRNN-ReLU[11]\nRNN-tanh[11]\nTRec,\u03b2 = 500[11]\n1.65\nRNN-ReLU\n1.70\nRNN-tanh\n1.58\nRNN-Path\nLSTM\n1.52\nTable 3: Test BPC for PTB and text8.\n\n5.0\n4.9\n1.8\n3.1\n7.1\n3.1\nTable 2: Test error (MSE) for the adding problem with\ndifferent input sequence lengths and test classi\ufb01cation\nerror for the sequential MNIST.\nthe sequences length of 1000, but we found that for such very long sequences RNNs, even with\nPath-SGD, fail to learn.\nNext, we evaluate Path-SGD on the Sequential MNIST problem. Table 2, right column, reports\ntest error rates achieved by RNN-Path compared to the previously published results. Clearly, using\nPath-SGD helps RNNs achieve better generalization. In many cases, RNN-Path outperforms other\nRNN methods (except for LSTMs), even for such a long-term dependency problem.\n5.3 Language Modeling Tasks\nIn this section we evaluate Path-SGD on a language modeling task. We consider two datasets, Penn\nTreebank (PTB-c) and text8 5. PTB-c: We performed experiments on a tokenized Penn Treebank\nCorpus, following the experimental protocol of [11]. The training, validations and test data contain\n5017k, 393k and 442k characters respectively. The alphabet size is 50, and each training sequence is\nof length 50. text8: The text8 dataset contains 100M characters from Wikipedia with an alphabet\nsize of 27. We follow the data partition of [14], where each training sequence has a length of 180.\nPerformance is evaluated using bits-per-character (BPC) metric, which is log2 of perplexity.\nSimilar to the experiments on the synthetic datasets, for both tasks, we train a single-layer RNN\nconsisting of 2048 hidden units with path-SGD (RNN-Path). Due to the large dimension of hidden\nspace, SGD can take a fairly long time to converge. Instead, we use Adam optimizer [8] to help speed\nup the training, where we simply use the path-SGD gradient as input to the Adam optimizer.\nWe also train three additional baseline models: a ReLU RNN with 2048 hidden units, a tanh RNN\nwith 2048 hidden units, and an LSTM with 1024 hidden units, all trained using Adam. We performed\ngrid search for learning rate over {10\u22123, 5 \u00b7 10\u22124, 10\u22124} for all of our models. For ReLU RNNs,\nwe initialize the recurrent matrices from uniform[\u22120.01, 0.01], and uniform[\u22120.2, 0.2] for non-\nrecurrent weights. For LSTMs, we use orthogonal initialization [21] for the recurrent matrices and\nuniform[\u22120.01, 0.01] for non-recurrent weights. The results are summarized in Table 3.\nWe also compare our results to an RNN that uses hidden activation regularizer [11] (TRec,\u03b2 = 500),\nMultiplicative RNNs trained by Hessian Free methods [14] (HF-MRNN), and an RNN with smooth\nversion of ReLU [20]. Table 3 shows that path-normalization is able to outperform RNN-ReLU and\nRNN-tanh, while at the same time shortening the performance gap between plain RNN and other\nmore complicated models (e.g. LSTM by 57% on PTB and 54% on text8 datasets). This demonstrates\nthe ef\ufb01cacy of path-normalized optimization for training RNNs with ReLU activation.\n6 Conclusion\nWe investigated the geometry of RNNs in a broader class of feedforward networks with shared\nweights and showed how understanding the geometry can lead to signi\ufb01cant improvements on\ndifferent learning tasks. Designing an optimization algorithm with a geometry that is well-suited\nfor RNNs, we closed over half of the performance gap between vanilla RNNs and LSTMs. This is\nparticularly useful for applications in which we seek compressed models with fast prediction time\nthat requires minimum storage; and also a step toward bridging the gap between LSTMs and RNNs.\nAcknowledgments\nThis research was supported in part by NSF RI/AF grant 1302662, an Intel ICRI-CI award, ONR\nGrant N000141310721, and ADeLAIDE grant FA8750-16C-0130-001. We thank Saizheng Zhang\nfor sharing a base code for RNNs.\n\n5http://mattmahoney.net/dc/textdata\n\n8\n\n\fReferences\n[1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. arXiv\n\npreprint arXiv:1511.06464, 2015.\n\n[2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient\n\ndescent is dif\ufb01cult. Neural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[3] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder\u2013decoder for statistical\nmachine translation. In Proceeding of the 2015 Conference on Empirical Methods in Natural Language\nProcessing (EMNLP), pages 1724\u20131734, 2014.\n\n[4] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks.\n\nIn Proceeding of the International Conference on Machine Learning (ICML), pages 1764\u20131772, 2014.\n\n[5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\npruning, trained quantization and huffman coding. In Proceeding of the International Conference on\nLearning Representations, 2016.\n\n[6] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem\n\nsolutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02), 1998.\n\n[7] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997.\n\n[8] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceeding of the\n\nInternational Conference on Learning Representations, 2015.\n\n[9] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with\nmultimodal neural language models. Transactions of the Association for Computational Linguistics, 2015.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep convolutional\nneural networks. In Advances in neural information processing systems (NIPS), pages 1097\u20131105, 2012.\n\n[11] David Krueger and Roland Memisevic. Regularizing RNNs by stabilizing activations. In Proceeding of\n\nthe International Conference on Learning Representations, 2016.\n\n[12] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of\n\nrecti\ufb01ed linear units. arXiv preprint arXiv:1504.00941, 2015.\n\n[13] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus\n\nof english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[14] Tom\u00e1\u0161 Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, and J Cernocky. Subword\n\nlanguage modeling with neural networks. (http://www.\ufb01t.vutbr.cz/ imikolov/rnnlm/char.pdf), 2012.\n\n[15] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nProceedings of the International Conference on Machine Learning (ICML), pages 807\u2013814, 2010.\n\n[16] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-SGD: Path-normalized optimization\n\nin deep neural networks. In Advanced in Neural Information Processsing Systems (NIPS), 2015.\n\n[17] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Data-dependent path\n\nnormalization in neural networks. In the International Conference on Learning Representations, 2016.\n\n[18] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks.\n\nIn Proceeding of the 28th Conference on Learning Theory (COLT), 2015.\n\n[19] Yann Ollivier. Riemannian metrics for neural networks ii: recurrent networks and learning symbolic data\n\nsequences. Information and Inference, page iav007, 2015.\n\n[20] Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models:\n\nwhen are they needed? arXiv preprint arXiv:1301.5650, 2013.\n\n[21] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\nlearning in deep linear neural networks. In International Conference on Learning Representations, 2014.\n\n[22] Sachin S. Talathi and Aniket Vartak.\n\nImproving performance of recurrent neural network with relu\n\nnonlinearity. In the International Conference on Learning Representations workshop track, 2014.\n\n[23] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan Salakhutdinov,\nand Yoshua Bengio. Architectural complexity measures of recurrent neural networks. arXiv preprint\narXiv:1602.08210, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1726, "authors": [{"given_name": "Behnam", "family_name": "Neyshabur", "institution": "TTI-Chicago"}, {"given_name": "Yuhuai", "family_name": "Wu", "institution": "University of Toronto"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}