{"title": "A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off", "book": "Advances in Neural Information Processing Systems", "page_first": 7038, "page_last": 7048, "abstract": "Reducing the precision of weights and activation functions in neural network training, with minimal impact on performance, is essential for the deployment of these models in resource-constrained environments. We apply mean field techniques to networks with quantized activations in order to evaluate the degree to which quantization degrades signal propagation at initialization. We derive initialization schemes which maximize signal propagation in such networks, and suggest why this is helpful for generalization. Building on these results, we obtain a closed form implicit equation for $L_{\\max}$, the maximal trainable depth (and hence model capacity), given $N$, the number of quantization levels in the activation function. Solving this equation numerically, we obtain asymptotically: $L_{\\max}\\propto N^{1.82}$.", "full_text": "A Mean Field Theory of Quantized Deep Networks:\n\nThe Quantization-Depth Trade-Off\n\nYaniv Blumenfeld\nTechnion, Israel\n\nyanivblm6@gmail.com\n\nDar Gilboa\n\nColumbia University\n\ndargilboa@gmail.com\n\nDaniel Soudry\nTechnion, Israel\n\ndaniel.soudry@gmail.com\n\nAbstract\n\nReducing the precision of weights and activation functions in neural network train-\ning, with minimal impact on performance, is essential for the deployment of these\nmodels in resource-constrained environments. We apply mean \ufb01eld techniques\nto networks with quantized activations in order to evaluate the degree to which\nquantization degrades signal propagation at initialization. We derive initialization\nschemes which maximize signal propagation in such networks, and suggest why\nthis is helpful for generalization. Building on these results, we obtain a closed\nform implicit equation for Lmax, the maximal trainable depth (and hence model\ncapacity), given N, the number of quantization levels in the activation function.\nSolving this equation numerically, we obtain asymptotically: Lmax \u221d N 1.82.\n\n1\n\nIntroduction\n\nAs neural networks are increasingly trained and deployed on-device in settings with memory and space\nconstraints [12, 5], a better understanding of the trade-offs involved in the choice of architecture and\ntraining procedure are gaining in importance. One widely used method to conserve resources is the\nquantization (discretization) of the weights and/or activation functions during training [6, 25, 13, 3].\nWhen choosing a quantized architecture, it is natural to expect depth to increase the \ufb02exibility of\nthe model class, yet choosing a deeper architecture can make the training process more dif\ufb01cult.\nAdditionally, due to resource constraints, when using a quantized activation function whose image\nis a \ufb01nite set of size N, one would like to choose the smallest possible N such that the model is\ntrainable and performance is minimally affected. There is a trade-off here between the capacity of the\nnetwork which depends on its depth and the ability to train it ef\ufb01ciently on the one hand \u2014 and the\nparsimony of the activation function used on the other.\nWe quantify this trade-off between capacity/trainability and the degree of quantization by an analysis\nof wide neural networks at initialization. This is achieved by studying signal propagation in deep\nquantized networks, using techniques introduced in [24, 26] that have been applied to numerous\narchitectures. Signal propagation will refer to the propagation of correlations between inputs into the\nhidden states of a deep network. Additionally, we consider the dynamics of training in this regime\nand the effect of signal propagation on the change in generalization error during training.\nIn this paper,\n\n\u2022 We suggest (section 3.2) that if the signal propagation conditions do not hold, generalization\nerror in early stages of training should not decrease at a typical test point, potentially\nexplaining the empirically observed bene\ufb01t of signal propagation to generalization. This\nis done using an analysis of learning dynamics in wide neural networks, and corroborated\nnumerically.\n\u2022 We obtain (section 4.2) initialization schemes that maximize signal propagation in certain\n\nclasses of feed-forward networks with quantized activations.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2022 Combining these results, we obtain an expression for the trade-off between the quantization\nlevel and the maximal trainable depth of the network (eq. 18), in terms of the depth scale of\nsignal propagation. We experimentally corroborate these predictions (Figure 3).\n\n2 Related work\n\nSeveral works have shown that training a 16 bit numerical precision is suf\ufb01cient for most machine\nlearning applications [10, 7], with little to no cost to model accuracy. Since, many more aggressive\nquantization schemes were suggested [13, 18, 22, 21], ranging from the extreme usage of 1-bit at\nrepresentations and math operations [25, 6], to a more conservative usage of 8-bits [3, 28], all in\neffort to minimize the computational cost with minimal loss to model accuracy. Theoretically, it is\nwell known that a small amount of imprecision can signi\ufb01cantly degrade the representational capacity\nof a model. For example, an in\ufb01nite precision recurrent neural network can simulate a universal\nTuring machine [27]. However, any numerical imprecision reduces the representational power of\nthese models to that of \ufb01nite automata [19]. In this paper, we focus on the effects of quantization on\ntraining. So far, these effects are typically quanti\ufb01ed empirically, though some theoretical work has\nbeen done in this direction (e.g. [17, 1, 36, 33]).\nSignal propagation in wide neural networks has been the subject of recent work for fully-connected\n[24, 26, 23, 31], convolutional [30] and recurrent architectures [4, 8]. These works study the\nevolution of covariances between the hidden states of the network and the stability of the gradients.\nThese depend only on the leading moments of the weight distributions and the nonlinearities at the\nin\ufb01nite width limit, greatly simplifying analysis. They identify critical initialization schemes that\nallow training of very deep networks (or recurrent networks on long time sequence tasks) without\nperforming costly hyperparameter searches. While the analytical results in these works assume that\nthe layer widths are taken to in\ufb01nity sequentially (which we will refer to this as the sequential limit),\nthe predictions prove predictive when applied to networks with layers of equal width once the width\nis typically of the order of hundreds of neurons. For fully connected networks it was also shown using\nan application of the Central Limit Theorem for exchangeable random variables that the asymptotic\nbehavior at in\ufb01nite width is independent of the order of limits [20].\n\n3 Preliminaries: the mean \ufb01eld approach\n\n3.1 Signal propagation in feed-forward networks\n\nWe now review the analysis of signal propagation in feed-forward networks performed in [24, 26].\nThe network function f : Rn0 \u2192 RnL+1 is given by\n\n\u03c6(\u03b1(0)(x)) = x\n\n\u03b1(l)(x) = W (l)\u03c6(\u03b1(l\u22121)(x)) + b(l)\nf (x) = \u03b1(L+1)(x)\n\nl = 1, ..., L\n\n(1)\n\nw\n\ni \u223c N (0, \u03c32\n\nij \u223c N (0, \u03c32\n\nfor input x \u2208 Rn0, weight matrices W (l) \u2208 Rnl\u00d7nl\u22121 and nonlinearity \u03c6 : R \u2192 R. The weights are\nn(l\u22121) ), b(l)\ninitialized using W (l)\nb ) so that the variance of the neurons at every\nlayer is independent of the layer widths 1.\nAccording to Theorem 4 in [20], under a mild condition on the activation function that is satis\ufb01ed\nby any saturating nonlinearity, the pre-activations \u03b1(l)(x) converge in distribution to a multivariate\nGaussian as the layer widths n1, ..., nL are taken to in\ufb01nity in any order (with n0, nL+1 \ufb01nite) 2.\nIn the physics literature the approximation obtained by taking this limit is known as the mean \ufb01eld\napproximation.\n\n1In principle the following results should hold under more generally mild moment conditions alone.\n2When taking the sequential limit, asymptotic normality is a consequence of repeated application of the\n\nCentral Limit Theorem [24]\n\n2\n\n\fThe covariance of this Gaussian at a given layer is then obtained by the recursive formula\n\nE\u03b1(l)\n\ni (x)\u03b1(l)\n\nj (x(cid:48)) = E nl\u22121(cid:80)\n\n(cid:104)\n\nW (l)\nk,k(cid:48)=1\nE\u03c6(\u03b1(l\u22121)\n\n\u03c32\nw\n\n=\n\n\u03b4ij.\nOmitting the dependence on the inputs x, x(cid:48) in the RHS below, we de\ufb01ne\nC (l)\n1\n\n(cid:18) 1\n\ni (x) E\u03b1(l)\ni (x(cid:48)) E\u03b1(l)\n\ni (x)\u03b1(l)\ni (x(cid:48))\u03b1(l)\n\ni (x)\u03b1(l)\ni (x)\u03b1(l)\n\ni (x(cid:48))\ni (x(cid:48))\n\n= Q(l)\n\nC (l)\n\n1\n\nb\n\n(x))\u03c6(\u03b1(l\u22121)\nk(cid:48)\n(x(cid:48))) + \u03c32\n\n(cid:105)\n\nk\n\njk(cid:48)\u03c6(\u03b1(l\u22121)\nik W (l)\n(x))\u03c6(\u03b1(l\u22121)\n(cid:33)\n\n1\n\nCombining eqs. 2 and 3 we obtain the following two-dimensional dynamical system:\n\n(cid:20)\n\n\u03c32\nw\n\nu\u223cN (0,Q(l\u22121))\n\n\u03c62(u) + \u03c32\nb\n\nE\n\nE\n\n1\n\nQ(l\u22121)\n\n\u03c32\nw\n\n(u1,u2)\u223cN (0,\u03a3(Q(l\u22121),C(l\u22121)))\n\n\u03c6(u1)\u03c6(u2) + \u03c32\nb\n\n(2)\n\n(3)\n\n(x(cid:48))) + b(l)\n\ni b(l)\n\nj\n\n(cid:19)\n(cid:21) \uf8f6\uf8f7\uf8f8 \u2261 M\n\n= \u03a3(Q(l), C (l)).\n\n(cid:20)(cid:18) Q(l\u22121)\n\nC (l\u22121)\n\n(cid:19)(cid:21)\n\n,\n\nE\u03b1(l)\n\n(cid:32) E\u03b1(l)\n\uf8eb\uf8ec\uf8ed\n(cid:19)\n\n=\n\n(cid:18) Q(l)\n\nC (l)\n\nw, \u03c32\n\n(4)\nwhere M depends on the nonlinearity and the initialization hyperparameters \u03c32\nb and the initial\nconditions (Q(0), C (0))T depend also on x, x(cid:48). See Figure 1 for a visualization of the covariance\npropagation.\nOnce the above dynamical system converges to a \ufb01xed point (Q\u2217, C\u2217) or at least approaches it to\nwithin numerical precision, information about the initial conditions is lost. As argued in [26], this is\ndetrimental to learning as inputs in different classes can no longer be distinguished in terms of the\nnetwork output (assuming the \ufb01xed point C\u2217 is independent of C (0), see Lemma 1). The convergence\nrate to the \ufb01xed point can be obtained by linearizing the dynamics around it. This can be done for\nthe two dimensional system as a whole, yet in [26] it was also shown that, for any monotonically\nincreasing nonlinearity, convergence of this linearized dynamical system in the direction C (l) cannot\nbe faster than convergence in the Q(l) direction, and thus studying convergence can be reduced to the\nsimpler one dimensional system C (l) = MQ\u2217 (C (l\u22121)) that is obtained by assuming Q(l) has already\nconverged, as assumption we review in appendix K. The convergence rate is given by the following\nknown results of [26, 8] which we recapitulate for completeness:\n\n(cid:19)\n\n(cid:18) 1 C\n\nC 1\n\nLemma 1. [26, 8] De\ufb01ning \u03a3(Q, C) = Q\nsystem\n\nMQ\u2217 (C) =\n\n1\nQ\u2217\n\n\u03c32\nw\n\n(u1,u2)\u223cN (0,\u03a3(Q\u2217,C))\nwhen linearized around a \ufb01xed point C\u2217, converges at a rate\n\nE\n\nfor Q \u2265 0, C \u2208 [\u22121, 1] the dynamical\n\n(cid:21)\n\n\u03c6(u1)\u03c6(u2) + \u03c32\nb\n\n(5)\n\n\u2202MQ\u2217 (C)\n\n\u03c7 =\n\n(6)\nAdditionally, MQ\u2217 (C) has at most one stable \ufb01xed point in the range [0, 1] for any choice of \u03c6 such\nthat \u03c6 is odd or \u03c6(cid:48)(cid:48) is non-negative.\n\n(ua,ub)\u223cN (0,\u03a3(Q\u2217,C\u2217))\n\n= \u03c32\nw\n\n\u2202C\n\nE\n\n\u03c6(cid:48)(u1)\u03c6(cid:48)(u2).\n\n(cid:20)\n(cid:12)(cid:12)(cid:12)(cid:12)C\u2217\n\nProof: See Appendix A.\nWe subsequently drop the subscript in MQ\u2217 (C) to lighten notation. The corresponding time scale of\nconvergence in the linearized regime is\n\n\u03be = \u2212 1\nlog \u03c7\n\n.\n\n(7)\n\n\u03c7 depends on the initialization hyperparameters and choice of nonlinearity, and it follows from the\nconsiderations above that signal propagation from the inputs to the outputs of a deep network would\nbe facilitated by a choice of \u03c7 such that \u03be diverges, which occurs as \u03c7 approaches 1 from below.\nIndeed, as observed empirically across multiple architectures and tasks [30, 4, 8, 31], up to a constant\nfactor \u03be typically gives the maximal depth up to which a network is trainable. These calculations\nmotivate initialization schemes that satisfy:\n\n\u03c7 = 1\n\n3\n\n\fin order to train very deep networks. We will show shortly that this condition is unattainable for a\nlarge class of quantized activation functions. 3\nThe analysis of forward signal propagation in the sense described above in networks with continuous\nactivations is related to the stability of the gradients as well [26]. The connection is obtained by\nrelating the rate of convergence \u03c7 to the \ufb01rst moment of the state-to-state Jacobian\n\n.\n\n\u2202 \u02c6\u03b1(l)\n\u2202 \u02c6\u03b1(l\u22121)\n\nEtr(cid:0)JJ T(cid:1) .\n\n(8)\n\n(9)\n\nTaking all the layer widths to be equal to n, the \ufb01rst moment is given by\n\nJ = lim\nl\u2192\u221e\n\nmJJ T =\n\n1\nn\n\nSince high powers of this matrix will appear in the gradient, controlling its spectrum can prevent the\ngradient from exploding or vanishing. In the case of quantized activations, however, the relationship\nbetween the Jacobian and the convergence rate \u03c7 no longer holds since the gradients vanish almost\nsurely and modi\ufb01ed weight update schemes such as the Straight-Through Estimator (STE) [11, 13]\nare used instead. However, one can de\ufb01ne a modi\ufb01ed Jacobian JSTE that takes the modi\ufb01ed update\nscheme into account and control its moments instead.\n\nFigure 1: Propagation of empirical covariance between hidden states at different layers, in quantized\nfeed-forward networks with N = 16, varying the standard deviation of the weights \u03c3w. \u2206\u03b8 is\nthe angle between two normalized inputs. Signal propagation is maximized when \u03c3w = \u03c3opt\nw , and\ndegrades as \u03c3w deviates from it.\n3.2 Signal propagation may improve generalization\n\nThe argument that a network will be untrainable if signals cannot propagate from the inputs to the loss,\ncorresponding to the rapid convergence of the dynamical system eq. 4, has empirical support across\nnumerous architectures. A choice of initialization hyperparameters that facilitates signal propagation\nhas also been shown to lead to slight improvements in generalization error, yet understanding of\nthis was beyond the scope of the existing analysis. Indeed, there is also empirical evidence that\nwhen training very deep networks it is only the generalization error that is impacted adversely but\nthe training error is not [30]. Additionally, one may wonder whether a deep network could still be\ntrainable despite a lack of signal propagation. On the one hand, rapid convergence of the correlation\nmap between the pre-activations is equivalent to the distance between f (x), f (x(cid:48)) converging to a\nvalue that is independent of the distance between x, x(cid:48). On the other, since deep networks can \ufb01t\nrandom inputs and labels [34] this convergence may not impede training. .\nTo understand the effect of signal propagation on generalization, we consider the dynamics of learning\nfor wide, deep neural networks in the setting studied in [14, 16]. We note that this setting introduces\nan unconventional scaling of the weights. Despite this, it should be a good approximation for the\nearly stages of learning in networks with standard initialization, as long as the weights do not change\ntoo much from their initial values. In this regime, the function implemented by the network evolves\nlinearly in time, with the dynamics determined by the Neural Tangent Kernel (NTK). We argue that\n\n3It will at times be convenient to consider the dynamics of the correlations of the post-activations(cid:98)\u03b1(l) =\n\u03c6(\u03b1(l)) which we denote by (cid:99)M((cid:98)C). The rates of convergence are identical in both cases, as shown in Appendix\n\nB.\n\n4\n\n202020406080Layersoptw-0.2202020406080optw-0.1202020406080optw202020406080optw+0.1202020406080optw+0.21.000.750.500.250.000.250.500.751.00\frapid convergence of eq. 4 in deep networks implies that the error at a typical test point should not\ndecrease during training since the resulting form of the NTK will be independent of the label of the\ntest point. Conversely, this effect will be mitigated with a choice of hyperparameters that maximizes\nsignal propagation, which could explain the bene\ufb01cial effect on generalization error that is observed\nempirically. We provide details and empirical evidence in support of this claim for networks with\nboth quantized and continuous activation functions in Appendix M.\n\n4 Mean \ufb01eld theory of signal propagation with quantized activations\n\nIn this section, we will explore the effects of using a quantized activation function on signal prop-\nagation in feed-forward networks. We will start by developing the mean \ufb01eld equations for a sign\nactivations and then consider more general activation function, and establish a theory that predicts\nthe relationship between the number of quantization states, the initialization parameters, and the\nfeed-forward network depth.\n\n4.1 Warm-up: sign activations\n\nWe begin by considering signal propagation in the network in eq. 1 with \u03c6(x) = sign(x). Substituting\n\u03c6(x) = sign(x), \u03c6(cid:48)(u) = 2\u03b4(u) in eqs. 4 and 6 gives\n\nQ\u2217 = \u03c32\n\nw + \u03c32\n\nb , \u03c7 = 4\u03c32\nw\n\nE\n\n(u1,u2)\u223cN (0,\u03a3(Q\u2217,C\u2217))\n\n\u03b4(u1)\u03b4(u2).\n\n(10)\n\n(11)\n\n(12)\n\nAs shown in Appendix C, we obtain\n\n\u03c7 =\n\nb )(cid:112)1 \u2212 (C\u2217)2\n\n2\u03c32\nw\n\n\u03c0 (\u03c32\n\nw + \u03c32\n\nM(C) =\n\n2\u03c32\n\u03c0 arcsin (C) + \u03c32\nw\n\nb\n\nw + \u03c32\n\u03c32\nb\n\n.\n\nThe closed form expressions 11 and 12, which are not available for more complex architectures,\nexpose the main challenge to signal propagation. It is clear from these expressions that the derivative\nof M(C) diverges at 1, and since M(C) is differentiable and convex, it can have no stable \ufb01xed\npoint in [0, 1] that satis\ufb01es the signal propagation condition \u03c7 = 1. In fact, as we show in Appendix\nL.1 that the maximal value of \u03c7 for this architecture is achievable when \u03c3b = 0, and is bounded from\nabove by \u03c7max = 2\n\u03c0 for all choices of the initialization hyperparameters. The corresponding depth\nscale is bounded by \u03bemax < 3.\nAdditionally, one may wonder if using stochastic binary activations [13] might improve signal propa-\ngation. In Appendix D we show this is not the case: we consider a stochastic rounding quantization\nscheme and show that stochastic rounding can only further decrease the signal propagation depth\nscale.\n\n4.2 General quantized activations\nWe consider a general activation function \u03c6N : R \u2192 S, where S is a \ufb01nite set of real numbers of size\n|S| = N. To obtain a \ufb02exible class of non-decreasing functions of this form, we de\ufb01ne\n\nH (x \u2212 gi) hi ,\n\n\u03c6N (x) = A +\n\n(13)\nwhere A \u2208 R,\u2200i \u2208 {1, 2, ..., N \u2212 1}, gi \u2208 R, hi \u2208 R>0, and H : R \u2192 R is the Heaviside function.\nA to the maximum state A +(cid:80)N\u22121\nThis activation function can be thought of as a \"stairs\" function, going from the minimum state of\ni=1 hi, over N \u2212 1 stairs, with stair i located at an offset gi with\na height hi. We will assume that the offsets gi are ordered, for simplicity. The development of the\nmean \ufb01eld equations for this activation function is located in appendix E, where we \ufb01nd that:\n(cid:98)Q(l) =\n(cid:112)\n\nw(cid:98)Q(l) + \u03c32\n\n\u2212 max(gi, gj)\n\n, Q(l+1) = \u03c32\n\nN\u22121(cid:88)\n\nN\u22121(cid:88)\n\n(14)\n\nb\n\nmin(gi, gj)\n\n(cid:112)\n\n(cid:33)\n\n(cid:32)\n\n\u03a6\n\n(cid:32)\n\n(cid:33)\n\nhihj\u03a6\n\ni=1\n\nj=1\n\nQ(l)\n\nQ(l)\n\nN\u22121(cid:88)\n\ni=1\n\n5\n\n\fN\u22121(cid:88)\n\nN\u22121(cid:88)\n\nhihj exp\n\n\uf8ee\uf8f0\u2212 g2\n2Q\u2217(cid:16)\n\ni \u2212 2C\u2217gigj + g2\n\n1 \u2212 (C\u2217)2(cid:17)\n\nj\n\n\uf8f9\uf8fb ,\n\n2\u03c0Q\u2217(cid:113)\n\n\u03c32\nw\n1 \u2212 (C\u2217)2\n\n\u03c7 =\n\n(15)\n\ni\n\ng2\ni\n\ni=1\n\nj=1\n\n(cid:105)\n\n2Q\u2217(1+C\u2217)\n\n(cid:104)\u2212\n\n2\u03c0Q\u2217\u221a\n\nwh2\n\u03c32\n1\u2212(C\u2217)2 exp\n\nwhere \u03a6 is the gaussian CDF and (cid:98)Q(l) is the hidden state covariance, as explained in appendix B.\n\nThis expression diverges as C\u2217 \u2192 1 since all the summands are non-negative and the diagonal ones\n. Since M(C) is convex (see Lemma 1), we \ufb01nd that\nsimplify to\nas in the case of sign activation, \u03c7 = 1 is not achievable for any choice of a quantized activation\nfunction.\nTo optimize the signal propagation for any given number of states, we would like to \ufb01nd the parameters\nthat will bring the \ufb01xed point slope \u03c7 as close as possible to 1. For simplicity, we will henceforth use\nthe initialization \u03c3b = 0, which is quite common [9]. Empirical evidence in appendix F suggest that\nusing \u03c3b > 0 is sub-optimal, which is not very surprising, given our similar (exact) results for sign\nactivation. For \u03c3b = 0, C = 0 becomes a \ufb01xed point. We eliminate eq. 15 direct dependency on Q\u2217,\nby de\ufb01ning normalized offsets \u02dcg \u2261 g\u221a\nQ\u2217 . By moving to normalized offsets, substituting C\u2217 = 0 and\n(cid:80)N\u22121\n(cid:80)N\u22121\nthe remaining Q\u2217 by eq. 14, our expression for the \ufb01xed point slope becomes:\n(cid:80)N\u22121\nj=1 \u03a6 (\u2212 max(\u02dcgi, \u02dcgj)) \u03a6 (min(\u02dcgi, \u02dcgj)) hihj\n\n2\u03c0 exp(cid:2)\u2212 1\n\n(cid:1)(cid:3) hihj\n\n(cid:80)N\u22121\n\n(cid:0)\u02dcg2\n\ni + \u02dcg2\nj\n\n(16)\n\n\u03c7 =\n\nj=1\n\ni=1\n\ni=1\n\n1\n\n2\n\n2\n\nEq. 16 provides us with way to determine the quality of any quantized activation function in regard\nto signal propagation, without concerning ourselves with the initialization parameters, that will only\n\neq. 15, moving from normalized offsets to actual offsets becomes trivial.\nTo measure the relation between the number of states and depth scale, we will use eq. 16 over a limited\nset of constant-spaced activations, where we choose A < 0,\u2200i \u2208 {1, .., N \u2212 1}, hi = const. and\nthe offsets are evenly spaced and centered around zero, with D de\ufb01ned as the distance between two\nQ\u2217 . We view this con\ufb01guration\nas the most obvious selection of activation function, where the \u2019stairs\u2019 are evenly spaced between the\nminimal and the maximal state. Using eq. 16 on an activation in this set, we get:\n\nhave a linear effect on the offsets. Since the normalized offsets are suf\ufb01cient to determine (cid:98)Q, Q, using\n(cid:1), and \u02dcD de\ufb01ned as \u02dcD = D\u221a\nsequential offsets so that gi = D(cid:0)i \u2212 N\n(cid:104)\u2212 1\n(cid:0)i2 + j2(cid:1) \u02dcD2(cid:105)\n(cid:80)\n(cid:16)\n(cid:16)\u2212 max (i, j) \u02dcD\n(cid:17)\n\n2 |\u2200k \u2208 N, k < N(cid:9). A numeric analysis using of eq. 17 is presented in \ufb01gure 2,\n\nwhen K =(cid:8)k \u2212 N\n\nand reveals a clear logarithmic relation between the level of quantization to the optimal \ufb01xed point\nslope, and the normalized spacing required to reach this optimal con\ufb01guration. By extrapolating the\nnumerical results, as seen in the right panels of Fig. 2, we \ufb01nd a good approximations for the the\nmaximal achievable slope for any quantization level \u03c7max(N ) and the corresponding normalized\nspacing Dopt(N ). Using those extrapolated values, we predict the depth-scale of a quantized, feed-\nforward network to be:\n\n(cid:80)\n(cid:80)\n\nmin (i, j) \u02dcD\n\n(cid:80)\n\n1\n2\u03c0 exp\n\nj\u2208K \u03a6\n\n(cid:17)\n\n(17)\n\nj\u2208K\n\ni\u2208K\n\ni\u2208K\n\n\u03c7 =\n\n\u03a6\n\n2\n\n\u03beN = \u2212\n\n1\n\nlog(\u03c7max(N ))\n\n(cid:39) \u2212\n\n1\n\nlog(1 \u2212 e0.71 (N + 1)\n\n\u22121.82)\n\n(cid:39) 1\n2\n\n(N + 1)1.82 .\n\n(18)\n\nwhere the latter approximation is valid for large N. While the depth scale in eq. 18 is applicable\nto uniformly spaced quantized activations, numerical results presented in Appendix G suggest that\nusing more complex activations with the same quantization level will not produce better results.\nIn their work regarding mean \ufb01eld theory of convolutional neural networks, [30] shows that the\ndynamics of hidden-layer\u2019s correlations in CNNs decouple into independently evolving Fourier\nmodes that evolves near the \ufb01xed point, each with a corresponding \ufb01xed-point-slope of \u03c7c\u03bbi, with \u03c7c\ndepending the initialization hyperparameters and equivalent to the \ufb01xed point slope as calculated for\nfully connected networks, and \u03bbi \u2264 1 being a frequency dependant modi\ufb01er corresponding to mod i.\nWhile the exact dynamics in this case may depend on the decomposition of the input to Fourier mods,\nit is apparent that the maximal depth-scale of each mod can not exceed the depth-scale calculated for\nthe fully-connected case, and thus our upper limit on the number of layers holds for the case of CNNs.\n\n6\n\n\fSimilarly, following [4] and [8], our results can be easily extended to single layer RNNs, LSTMs and\nGRUS, in which case the limitation applies to the timescale of the network memory.\n\nFigure 2: Numerical analysis of the covariance propagation \ufb01xed point slope for quantized activation\nfunctions. Left: The convergence rate in eq. 17 of the covariances of the hidden states as a function\nof the normalized spacing between offsets \u02dcD for activations with different levels of quantization N.\nTop Right: The difference between 1 and maximal achievable convergence rate \u03c7max as a function of\nN. Bottom Right: The normalized spacing between states \u02dcD corresponding to \u03c7max as a function of\nN. We \ufb01nd that the dependence of 1 \u2212 \u03c7max on N is approximated well by a power law.\n\n5 Experimental results\n\ns\n\nr , 2\n\nr , .., r\u22121\n\n(cid:8)ui =(cid:112)Q\u2217\n(cid:112)Q\u2217\n\n(cid:0)u0 cos(\u03b8) + u1 sin(\u03b8)(cid:1)|i \u2208 {0, 1\n\nr }, \u03b8 = 2\u03c0i(cid:9), where r = 500 is the num-\n\nTo visualize the covariance propagation in eq. 2 we reconstruct an experiment presented\nin [24], and apply it to untrained quantized neural networks. We consider a neural net-\nwork with L = 100 fully-connected layers, all of width n = 1000. We draw two\northonormal vectors u0, u1 \u2208 R1000 and generate the 1 dimensional manifold U =\nber of samples, and Q\u2217\ns is the \ufb01xed point, calculated numerically. After initializing the neural network,\nwe use the manifold values as inputs to the neural network and measure the covariance in all hidden\nlayers. We then plot in Figure 1 the empirical covariance of the hidden states as a function of the\ndifference in the angle \u03b8 of their corresponding inputs. The reason for multiplying the initial values by\ns is so we can isolate the convergence of the off-diagonal correlations from that of the diagonal.\nTo test the predictions of the theory, we have constructed a similar experiment to the one described in\n[26], training neural networks of varying depths over the MNIST dataset. We study how the maximal\ntrainable depth of a quantized activation fully-connected network depends on the weight variance\nw and the number of states in the activation function N. For our quantized activations, we used the\n\u03c32\nconstant-spaced activations we have analyzed in section 4.2:\nx \u2212 2\n\n\u03c6N (x) = \u22121 +\n\nN\u22121(cid:88)\n\n(cid:19)(cid:19)\n\n(cid:18)\n\nH\n\ni=1\n\nN \u2212 1\n\nN \u2212 1\nwhich describes an activation function with a distance of D = 2\nranging between -1 and 1.\nTo \ufb01nd the best initialization parameters for each activation function, we \ufb01rst used eq. 14 to compute\nQ\u2217 is optimized ( \u02dcDopt, computed using the linear regression\n, and\nthus ensured that the normalized offsets are indeed optimal. Gradients are computed using the\n\n(cid:98)Q\u2217 assuming our normalized spacing D\u221a\nparameters of Figure 2 bottom right panel). Then, we picked \u03c3b = 0, \u03c3w = 1\u221a(cid:98)Q\u2217\n\nN\u22121 between offsets, and with states\n\n\u02dcDopt\n\nD\n\n,\n\n(cid:18)\n\n2\n\ni \u2212 N\n2\n\n7\n\n012\u0303D0.650.700.750.800.850.900.951.00\u03c7#\u0303States:234581632102410\u2212510\u2212410\u2212310\u2212210\u221211001\u2212\u03c7maxlog(1\u2212\u03c7max)=\u22121.82log(N+1)+0.71R2=0.9998ExtrapolationNumerical Re ult 101102103# State 10\u2212210\u22121100\u0303Doptlog\u0303\u0303Dopt)=\u22120.88log\u0303N+1)+1.40R2=0.99996E#trapolationNumerical Re ult \fFigure 3: Test accuracy of feed-forward networks of different depth with quantized activation\nfunctions trained on MNIST classi\ufb01cation after 1600 training steps, compared with the theoretical\ndepth scale predictions (eq. 7). Up to a constant factor, the theoretical depth scale predicts the phase\ntransition between regimes where a network is trainable and one where training fails. Left: Networks\nwith a 10 states activation function and different values of the weight variance. Right: Networks\nwith different quantization levels (number of states), with variances adjusted to allow optimal signal\npropagation.\n\n(cid:26)\u2206output\n\n0\n\nStraight-Through Estimator (STE) [13]:\n\n\u2206input =\n\n|input| < 1\nelse\n\n,\n\n(19)\n\nwhere \u2206output is gradient we get from the next layer and \u2206input is the gradient we pass to the preceding\nlayer. The conditions required for allowing the gradients information to propagate backward are\ndiscussed in appendix J. Those conditions are not enforced in this experiment, as they have no\nsigni\ufb01cant effect on the results, as shown in appendix H, where we add more results that isolate\nthe forward-pass from the backward pass. Also included in appendix H are results that show the\nevolution of the training and test accuracy in training time. A simpli\ufb01ed initialization scheme for the\nuse of practitioners is included in appendix I.\nWe set the hidden layer width to 2048. We use SGD for training, a learning rate of 10\u22123 for networks\nwith 10-90 layers, and a learning rate of 5 \u00d7 10\u22124 when training 100-220 layers. Those parameters\nwere selected to match those reported in [26], with the second learning rate adjusted to \ufb01t our\narea-of-search. We also use a batch size of 32, and use a standard preprocessing of the MNIST input4.\nFigure 3 shows that the initialization of the network using the parameters suggested by our theory\nachieves the optimal trainability when the number of layers is high. When measuring test accuracy\nat the early stage of the network, we can see that the accuracy is high when the network has \u223c 4\u03be\nlayers or less. As demonstrated by the advanced training stage results shown in appendix H, and by\nthe results of [26], networks of depth exceeding \u223c 6\u03be appear to be untrainable.\n\n6 Discussion\nIn this paper, we study the effect of using quantized activations on the propagation of signals in\ndeep neural networks, from the inputs to the outputs. We focus on quantized activations, which\nmaps its input to a \ufb01nite set of N possible outputs. Our analysis suggests an initialization scheme\nthat improves network trainability, and that fully-connected and convolutional networks to become\nuntrainable when the number of layers exceeds Lmax \u223c 3 (N + 1)1.82.\nAdditionally, we propose a possible explanation for the improved generalization observed when\ntraining networks that are initialized to enable stable signal propagation. While the motivation for\nthe critical initialization has been improved trainability [26], empirically these initialization schemes\nwere shown to improve generalization as well, an observation that was beyond the scope of the\nanalysis which motivated them. By considering the dynamics of learning in wide networks that\n\n4The code for running this experiment and more is provided in https://github.com/yanivbl6/\n\nquantized_meanfield.\n\n8\n\n0.60.81.01.21.4\u03c3w101102Layers4\u03be6\u03be9\u03be46810121416# States101102Layers4\u03be6\u03be9\u03be020406080100\fexhibit poor signal propagation, we \ufb01nd that generalization error in the early stages of training will\ntypically not improve. This effect will be minimized when using a critical initialization.\nThe limitations of poor signal propagation can perhaps be overcome with certain modi\ufb01cations to\nthe architecture or training procedure. Residual connections, for example, can be initialized [35] to\nmaintain the signal propagation conditions even when the full-network depth exceeds our theoretical\nlimit [31]. Another possible modi\ufb01cation is batch normalization, which we did not consider in\nthe analysis. While batch normalization by itself was shown to have negative side effects on the\nsignal propagation [32], other studies [3, 6, 13] have already suggested that applying proper batch\nnormalization is key when training quantized feed-forward networks. There are, however, cases\nwhere batch normalization does not work well, like in the case of recurrent neural networks. We\nexpect our \ufb01ndings to have as increased signi\ufb01cance if generalized to such architectures, as was done\npreviously for continuous activations [4, 8].\n\nAcknowledgements\n\nThe work of DS was supported by the Israel Science foundation (grant No. 31/1031), the Taub\nFoundation and used a Titan Xp donated by the NVIDIA Corporation. The work of DG was supported\nby the NSF NeuroNex Award DBI-1707398 and the Gatsby Charitable Foundation. The work of\nDG and DS was done in part while the authors were visiting the Simons Institute for the Theory of\nComputing.\n\nReferences\n[1] Alexander G. Anderson and Cory P. Berg. The High-Dimensional Geometry of Binary Neural\n\nNetworks. ICLR, (2014):1\u201313, 2018.\n\n[2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\nOn exact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955,\n2019.\n\n[3] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training\nof neural networks. In Advances in Neural Information Processing Systems, pages 5145\u20135153,\n2018.\n\n[4] Minmin Chen, Jeffrey Pennington, and Samuel S Schoenholz. Dynamical isometry and a mean\n\ufb01eld theory of rnns: Gating enables signal propagation in recurrent neural networks. arXiv\npreprint arXiv:1806.05394, 2018.\n\n[5] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing\nneural networks with the hashing trick. In International Conference on Machine Learning,\npages 2285\u20132294, 2015.\n\n[6] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\n\nneural networks. Advances in Neural Information Processing Systems, 2016.\n\n[7] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha,\nKunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas,\net al. Mixed precision training of convolutional neural networks using integer operations. arXiv\npreprint arXiv:1802.00930, 2018.\n\n[8] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S Schoenholz, Ed H Chi, and Jeffrey\nPennington. Dynamical isometry and a mean \ufb01eld theory of lstms and grus. arXiv preprint\narXiv:1901.08987, 2019.\n\n[9] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedfor-\nward neural networks. In Proceedings of the thirteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 249\u2013256, 2010.\n\n[10] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In International Conference on Machine Learning, pages\n1737\u20131746, 2015.\n\n[11] G Hinton. Neural networks for machine learning. coursera,[video lectures], 2012.\n\n9\n\n\f[12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[13] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan-\ntized neural networks: Training neural networks with low precision weights and activations.\nThe Journal of Machine Learning Research, 18(1):6869\u20136898, 2017.\n\n[14] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\n\ngeneralization in neural networks. June 2018.\n\n[15] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington,\nand Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint\narXiv:1711.00165, 2017.\n\n[16] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and\nJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient\ndescent. arXiv preprint arXiv:1902.06720, 2019.\n\n[17] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training\n\nQuantized Nets: A Deeper Understanding. NIPS, jun 2017.\n\n[18] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network.\n\nIn Advances in Neural Information Processing Systems, pages 345\u2013353, 2017.\n\n[19] Wolfgang Maass and Pekka Orponen. On the Effect of Analog Noise in Discrete-Time Analog\n\nComputations. Neural Computation, 10(5):1071\u20131095, jul 1998.\n\n[20] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin\nGhahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint\narXiv:1804.11271, 2018.\n\n[21] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. Wrpn: wide reduced-precision\n\nnetworks. arXiv preprint arXiv:1709.01134, 2017.\n\n[22] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using\n\nlogarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.\n\n[23] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep\nlearning through dynamical isometry: theory and practice. In Advances in neural information\nprocessing systems, pages 4785\u20134795, 2017.\n\n[24] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos. In Advances in neural\ninformation processing systems, pages 3360\u20133368, 2016.\n\n[25] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[26] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa-\n\ntion propagation. arXiv preprint arXiv:1611.01232, 2016.\n\n[27] Hava T. Siegelmann and Eduardo D. Sontag. Turing computability with neural nets. Applied\n\nMathematics Letters, 4(6):77\u201380, jan 1991.\n\n[28] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan.\nIn Advances in neural\n\nTraining deep neural networks with 8-bit \ufb02oating point numbers.\ninformation processing systems, pages 7675\u20137684, 2018.\n\n[29] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E. Turner, Jose Miguel Hernandez-\nLobato, and Alexander L. Gaunt. Deterministic variational inference for robust bayesian neural\nnetworks. In International Conference on Learning Representations, 2019.\n\n[30] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pen-\nnington. Dynamical isometry and a mean \ufb01eld theory of cnns: How to train 10,000-layer vanilla\nconvolutional neural networks. arXiv preprint arXiv:1806.05393, 2018.\n\n[31] Ge Yang and Samuel Schoenholz. Mean \ufb01eld residual networks: On the edge of chaos. In\n\nAdvances in neural information processing systems, pages 7103\u20137114, 2017.\n\n[32] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S Schoenholz.\n\nA mean \ufb01eld theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019.\n\n10\n\n\f[33] Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin.\nUnderstanding straight-through estimator in training activation quantized neural nets. ICLR,\npages 1\u201330, 2019.\n\n[34] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n[35] Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning\n\nwithout normalization. arXiv preprint arXiv:1901.09321, 2019.\n\n[36] Yiren Zhou, Seyed-Mohsen Moosavi-Dezfooli, Ngai-Man Cheung, and Pascal Frossard. Adap-\ntive quantization for deep neural network. In Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3809, "authors": [{"given_name": "Yaniv", "family_name": "Blumenfeld", "institution": "Technion"}, {"given_name": "Dar", "family_name": "Gilboa", "institution": "Columbia University"}, {"given_name": "Daniel", "family_name": "Soudry", "institution": "Technion"}]}