{"title": "Training Quantized Nets: A Deeper Understanding", "book": "Advances in Neural Information Processing Systems", "page_first": 5811, "page_last": 5821, "abstract": "Currently, deep neural networks are deployed on low-power portable devices by first training a full-precision model using powerful hardware, and then deriving a corresponding low-precision model for efficient inference on such systems. However, training models directly with coarsely quantized weights is a key step towards learning on embedded platforms that have limited computing resources, memory capacity, and power consumption.  Numerous recent publications have studied methods for training quantized networks, but these studies have mostly been empirical. In this work, we investigate training methods for quantized neural networks from a theoretical viewpoint.  We first explore accuracy guarantees for training methods under convexity assumptions.  We then look at the behavior of these algorithms for non-convex problems, and show that training algorithms that exploit high-precision representations have an important greedy search phase that purely quantized training methods lack, which explains the difficulty of training using low-precision arithmetic.", "full_text": "Training Quantized Nets: A Deeper Understanding\n\nHao Li1\u2217, Soham De1\u2217, Zheng Xu1, Christoph Studer2, Hanan Samet1, Tom Goldstein1\n\n{haoli,sohamde,xuzh,hjs,tomg}@cs.umd.edu, studer@cornell.edu\n\n1Department of Computer Science, University of Maryland, College Park\n\n2School of Electrical and Computer Engineering, Cornell University\n\nAbstract\n\nCurrently, deep neural networks are deployed on low-power portable devices by \ufb01rst training\na full-precision model using powerful hardware, and then deriving a corresponding low-\nprecision model for ef\ufb01cient inference on such systems. However, training models directly\nwith coarsely quantized weights is a key step towards learning on embedded platforms that\nhave limited computing resources, memory capacity, and power consumption. Numerous\nrecent publications have studied methods for training quantized networks, but these studies\nhave mostly been empirical. In this work, we investigate training methods for quantized neu-\nral networks from a theoretical viewpoint. We \ufb01rst explore accuracy guarantees for training\nmethods under convexity assumptions. We then look at the behavior of these algorithms for\nnon-convex problems, and show that training algorithms that exploit high-precision repre-\nsentations have an important greedy search phase that purely quantized training methods\nlack, which explains the dif\ufb01culty of training using low-precision arithmetic.\n\n1\n\nIntroduction\n\nDeep neural networks are an integral part of state-of-the-art computer vision and natural language\nprocessing systems. Because of their high memory requirements and computational complexity,\nnetworks are usually trained using powerful hardware. There is an increasing interest in training\nand deploying neural networks directly on battery-powered devices, such as cell phones or other\nplatforms. Such low-power embedded systems are memory and power limited, and in some cases\nlack basic support for \ufb02oating-point arithmetic.\nTo make neural nets practical on embedded systems, many researchers have focused on training nets\nwith coarsely quantized weights. For example, weights may be constrained to take on integer/binary\nvalues, or may be represented using low-precision (8 bits or less) \ufb01xed-point numbers. Quantized nets\noffer the potential of superior memory and computation ef\ufb01ciency, while achieving performance that\nis competitive with state-of-the-art high-precision nets. Quantized weights can dramatically reduce\nmemory size and access bandwidth, increase power ef\ufb01ciency, exploit hardware-friendly bitwise\noperations, and accelerate inference throughput [1\u20133].\nHandling low-precision weights is dif\ufb01cult and motivates interest in new training methods. When\nlearning rates are small, stochastic gradient methods make small updates to weight parameters.\nBinarization/discretization of weights after each training iteration \u201crounds off\u201d these small updates\nand causes training to stagnate [1]. Thus, the na\u00efve approach of quantizing weights using a rounding\nprocedure yields poor results when weights are represented using a small number of bits. Other\napproaches include classical stochastic rounding methods [4], as well as schemes that combine\nfull-precision \ufb02oating-point weights with discrete rounding procedures [5]. While some of these\nschemes seem to work in practice, results in this area are largely experimental, and little work has\nbeen devoted to explaining the excellent performance of some methods, the poor performance of\nothers, and the important differences in behavior between these methods.\n\n\u2217Equal contribution. Author ordering determined by a cryptographically secure random number generator.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fContributions This paper studies quantized training methods from a theoretical perspective, with\nthe goal of understanding the differences in behavior, and reasons for success or failure, of various\nmethods. In particular, we present a convergence analysis showing that classical stochastic rounding\n(SR) methods [4] as well as newer and more powerful methods like BinaryConnect (BC) [5] are\ncapable of solving convex discrete problems up to a level of accuracy that depends on the quantization\nlevel. We then address the issue of why algorithms that maintain \ufb02oating-point representations, like\nBC, work so well, while fully quantized training methods like SR stall before training is complete.\nWe show that the long-term behavior of BC has an important annealing property that is needed for\nnon-convex optimization, while classical rounding methods lack this property.\n\n2 Background and Related Work\n\nThe arithmetic operations of deep networks can be truncated down to 8-bit \ufb01xed-point without\nsigni\ufb01cant deterioration in inference performance [4, 6\u20139]. The most extreme scenario of quantization\nis binarization, in which only 1-bit (two states) is used for weight representation [10, 5, 1, 3, 11, 12].\nPrevious work on obtaining a quantized neural network can be divided into two categories: quantizing\npre-trained models with or without retraining [7, 13, 6, 14, 15], and training a quantized model from\nscratch [4, 5, 3, 1, 16]. We focus on approaches that belong to the second category, as they can be\nused for both training and inference under constrained resources.\nFor training quantized NNs from scratch, many authors suggest maintaining a high-precision \ufb02oating\npoint copy of the weights while feeding quantized weights into backprop [5, 11, 3, 16], which results\nin good empirical performance. There are limitations in using such methods on low-power devices,\nhowever, where \ufb02oating-point arithmetic is not always available or not desirable. Another widely\nused solution using only low-precision weights is stochastic rounding [17, 4]. Experiments show\nthat networks using 16-bit \ufb01xed-point representations with stochastic rounding can deliver results\nnearly identical to 32-bit \ufb02oating-point computations [4], while lowering the precision down to 3-bit\n\ufb01xed-point often results in a signi\ufb01cant performance degradation [18]. Bayesian learning has also\nbeen applied to train binary networks [19, 20]. A more comprehensive review can be found in [3].\n\n3 Training Quantized Neural Nets\n\nWe consider empirical risk minimization problems of the form:\n\nm(cid:88)\n\n1\nm\n\ni=1\n\nmin\nw\u2208W F (w) :=\n\n(1)\nwhere the objective function decomposes into a sum over many functions fi : Rd \u2192 R. Neural\nnetworks have objective functions of this form where each fi is a non-convex loss function. When\n\ufb02oating-point representations are available, the standard method for training neural networks is\nstochastic gradient descent (SGD), which on each iteration selects a function \u02dcf randomly from\n{f1, f2, . . . , fm}, and then computes\n\nfi(w),\n\nSGD: wt+1 = wt \u2212 \u03b1t\u2207 \u02dcf (wt),\n\n(2)\nfor some learning rate \u03b1t. In this paper, we consider the problem of training convolutional neural\nnetworks (CNNs). Convolutions are computationally expensive; low precision weights can be used\nto accelerate them by replacing expensive multiplications with ef\ufb01cient addition and subtraction\noperations [3, 9] or bitwise operations [11, 16].\nTo train networks using a low-precision representation of the weights, a quantization function Q(\u00b7)\nis needed to convert a real-valued number w into a quantized/rounded version \u02c6w = Q(w). We use\nthe same notation for quantizing vectors, where we assume Q acts on each dimension of the vector.\nDifferent quantized optimization routines can be de\ufb01ned by selecting different quantizers, and also\nby selecting when quantization happens during optimization. The common options are:\n\nDeterministic Rounding (R) A basic uniform or deterministic quantization function snaps a\n\ufb02oating point value to the closest quantized value as:\nQd(w) = sign(w) \u00b7 \u2206 \u00b7\n\n(3)\n\n+\n\n,\n\n(cid:22)|w|\n\n\u2206\n\n(cid:23)\n\n1\n2\n\n2\n\n\fwhere \u2206 denotes the quantization step or resolution, i.e., the smallest positive number that is\nrepresentable. One exception to this de\ufb01nition is when we consider binary weights, where all weights\nare constrained to have two values w \u2208 {\u22121, 1} and uniform rounding becomes Qd(w) = sign(w).\nThe deterministic rounding SGD maintains quantized weights with updates of the form:\n\nDeterministic Rounding: wt+1\n\n(4)\nwhere wb denotes the low-precision weights, which are quantized using Qd immediately after applying\nthe gradient descent update. If gradient updates are signi\ufb01cantly smaller than the quantization step,\nthis method loses gradient information and weights may never be modi\ufb01ed from their starting values.\n\nb = Qd\n\nb \u2212 \u03b1t\u2207 \u02dcf (wt\n\nb)(cid:1),\n\n(cid:0)wt\n\n(cid:26)(cid:98) w\n\n\u2206(cid:99) + 1\n(cid:98) w\n\u2206(cid:99)\n\nStochastic Rounding (SR) The quantization function for stochastic rounding is de\ufb01ned as:\n\nQs(w) = \u2206 \u00b7\n\n(5)\nwhere p \u2208 [0, 1] is produced by a uniform random number generator. This operator is non-\ndeterministic, and rounds its argument up with probability w/\u2206 \u2212 (cid:98)w/\u2206(cid:99), and down otherwise.\nThis quantizer satis\ufb01es the important property E[Qs(w)] = w. Similar to the deterministic rounding\nmethod, the SR optimization method also maintains quantized weights with updates of the form:\n\nfor p \u2264 w\notherwise,\n\n\u2206 \u2212 (cid:98) w\n\u2206(cid:99),\n\nStochastic Rounding: wt+1\n\nb = Qs\n\n(6)\n\nb \u2212 \u03b1t\u2207 \u02dcf (wt\n\n(cid:0)wt\nb)(cid:1).\nr \u2212 \u03b1t\u2207 \u02dcf(cid:0)Q(wt\nr)(cid:1).\n\nBinaryConnect (BC) The BinaryConnect algorithm [5] accumulates gradient updates using a\nfull-precision buffer wr, and quantizes weights just before gradient computations as follows.\n\nBinaryConnect: wt+1\n\n(7)\nEither stochastic rounding Qs or deterministic rounding Qd can be used for quantizing the weights\nwr, but in practice, Qd is the common choice. The original BinaryConnect paper constrains the\nlow-precision weights to be {\u22121, 1}, which can be generalized to {\u2212\u2206, \u2206}. A more recent method,\nBinary-Weights-Net (BWN) [3], allows different \ufb01lters to have different scales for quantization,\nwhich often results in better performance on large datasets.\n\nr = wt\n\nNotation For the rest of the paper, we use Q to denote both Qs and Qd unless the situation requires\nthis to be distinguished. We also drop the subscripts on wr and wb, and simply write w.\n\n4 Convergence Analysis\n\nWe now present convergence guarantees for the Stochastic Rounding (SR) and BinaryConnect\n(BC) algorithms, with updates of the form (6) and (7), respectively. For the purposes of deriving\ntheoretical guarantees, we assume each fi in (1) is differentiable and convex, and the domain\nW is convex and has dimension d. We consider both the case where F is \u00b5-strongly convex:\n(cid:104)\u2207F (w(cid:48)), w \u2212 w(cid:48)(cid:105) \u2264 F (w) \u2212 F (w(cid:48)) \u2212 \u00b5\n2(cid:107)w \u2212 w(cid:48)(cid:107)2, as well as where F is weakly convex. We also\nassume the (stochastic) gradients are bounded: E(cid:107)\u2207 \u02dcf (wt)(cid:107)2 \u2264 G2. Some results below also assume\nthe domain of the problem is \ufb01nite. In this case, the rounding algorithm clips values that leave the\ndomain. For example, in the binary case, rounding returns bounded values in {\u22121, 1}.\n\n4.1 Convergence of Stochastic Rounding (SR)\n\nWe can rewrite the update rule (6) as:\n\nwt+1 = wt \u2212 \u03b1t\u2207 \u02dcf (wt) + rt,\n\nwhere rt = Qs(wt \u2212 \u03b1t\u2207 \u02dcf (wt)) \u2212 wt + \u03b1t\u2207 \u02dcf (wt) denotes the quantization error on the t-th\niteration. We want to bound this error in expectation. To this end, we present the following lemma.\nLemma 1. The stochastic rounding error rt on each iteration can be bounded, in expectation, as:\n\nE(cid:13)(cid:13)rt(cid:13)(cid:13)2 \u2264\n\n\u221a\n\nd\u2206\u03b1tG,\n\nwhere d denotes the dimension of w.\n\n3\n\n\fProofs for all theoretical results are presented in the Appendices. From Lemma 1, we see that\nthe rounding error per step decreases as the learning rate \u03b1t decreases. This is intuitive since the\nprobability of an entry in wt+1 differing from wt is small when the gradient update is small relative\nto \u2206. Using the above lemma, we now present convergence rate results for Stochastic Rounding (SR)\nin both the strongly-convex case and the non-strongly convex case. Our error estimates are ergodic,\ni.e., they are in terms of \u00afwT = 1\nT\nTheorem 1. Assume that F is \u00b5-strongly convex and the learning rates are given by \u03b1t = 1\nConsider the SR algorithm with updates of the form (6). Then, we have:\n\u221a\n\nt=1 wt, the average of the iterates.\n\n(cid:80)T\n\n\u00b5(t+1) .\n\nE[F ( \u00afwT ) \u2212 F (w(cid:63))] \u2264 (1 + log(T + 1))G2\n\n+\n\nwhere w(cid:63) = arg minw F (w).\nTheorem 2. Assume the domain has \ufb01nite diameter D, and learning rates are given by \u03b1t = c\u221a\na constant c. Consider the SR algorithm with updates of the form (6). Then, we have:\n\nt\n\n, for\n\n2\u00b5T\n\n\u221a\n\nd\u2206G\n2\n\n,\n\n\u221a\n\n\u221a\nE[F ( \u00afwT ) \u2212 F (w(cid:63))] \u2264 1\nc\n\nT\n\nD2 +\n\nT + 1\n2T\n\ncG2 +\n\nd\u2206G\n2\n\n.\n\nWe see that in both cases, SR converges until it reaches an \u201caccuracy \ufb02oor.\u201d As the quantization\nbecomes more \ufb01ne grained, our theory predicts that the accuracy of SR approaches that of high-\nprecision \ufb02oating point at a rate linear in \u2206. This extra term caused by the discretization is unavoidable\nsince this method maintains quantized weights.\n\n4.2 Convergence of Binary Connect (BC)\n\nWhen analyzing the BC algorithm, we assume that the Hessian satis\ufb01es the Lipschitz bound:\n(cid:107)\u22072fi(x) \u2212 \u22072fi(y)(cid:107) \u2264 L2(cid:107)x \u2212 y(cid:107) for some L2 \u2265 0. While this is a slightly non-standard\nassumption, we will see that it enables us to gain better insights into the behavior of the algorithm.\nThe results here hold for both stochastic and uniform rounding. In this case, the quantization error r\ndoes not approach 0 as in SR-SGD. Nonetheless, the effect of this rounding error diminishes with\nshrinking \u03b1t because \u03b1t multiplies the gradient update, and thus implicitly the rounding error as well.\nTheorem 3. Assume F is L-Lipschitz smooth, the domain has \ufb01nite diameter D, and learning rates\nare given by \u03b1t = c\u221a\n. Consider the BC-SGD algorithm with updates of the form (7). Then, we have:\n\nt\n\n\u221a\nE[F ( \u00afwT ) \u2212 F (w(cid:63))] \u2264 1\n2c\n\nT\n\nD2 +\n\n\u221a\n\nT + 1\n2T\n\ncG2 +\n\n\u221a\n\nd\u2206LD.\n\nAs with SR, BC can only converge up to an error \ufb02oor. So far this looks a lot like the convergence\nguarantees for SR. However, things change when we assume strong convexity and bounded Hessian.\nTheorem 4. Assume that F is \u00b5-strongly convex and the learning rates are given by \u03b1t = 1\n\u00b5(t+1) .\nConsider the BC algorithm with updates of the form (7). Then we have:\n\nE[F ( \u00afwT ) \u2212 F (w(cid:63))] \u2264 (1 + log(T + 1))G2\n\n2\u00b5T\n\n+\n\nDL2\n2\n\nd\u2206\n\n.\n\n\u221a\n\nNow, the error \ufb02oor is determined by both \u2206 and L2. For a quadratic least-squares problem, the\ngradient of F is linear and the Hessian is constant. Thus, L2 = 0 and we get the following corollary.\nCorollary 1. Assume that F is quadratic and the learning rates are given by \u03b1t = 1\n\u00b5(t+1) . The BC\nalgorithm with updates of the form (7) yields\n\nE[F ( \u00afwT ) \u2212 F (w(cid:63))] \u2264 (1 + log(T + 1))G2\n\n2\u00b5T\n\n.\n\nWe see that the real-valued weights accumulated in BC can converge to the true minimizer of quadratic\nlosses. Furthermore, this suggests that, when the function behaves like a quadratic on the distance\n\n4\n\n\fFigure 1: The SR method starts at some location x (in this case 0), adds a perturbation to x, and then rounds.\nAs the learning rate \u03b1 gets smaller, the distribution of the perturbation gets \u201csquished\u201d near the origin, making\nthe algorithm less likely to move. The \u201csquishing\u201d effect is the same for the part of the distribution lying to the\nleft and to the right of x, and so it does not effect the relative probability of moving left or right.\n\nscale \u2206, one would expect BC to perform fundamentally better than SR. While this may seem\nlike a restrictive condition, there is evidence that even non-convex neural networks become well\napproximated as a quadratic in the later stages of optimization within a neighborhood of a local\nminimum [21].\nNote, our convergence results on BC are for wr instead of wb, and these measures of convergence are\nnot directly comparable. It is not possible to bound wb when BC is used, as the values of wb may\nnot converge in the usual sense (e.g., in the +/-1 binary case wr might converge to 0, in which case\narbitrarily small perturbations to wr might send wb to +1 or -1).\n\n5 What About Non-Convex Problems?\n\nThe global convergence results presented above for convex problems show that, in general, both\nthe SR and BC algorithms converge to within O(\u2206) accuracy of the minimizer (in expected value).\nHowever, these results do not explain the large differences between these methods when applied to\nnon-convex neural nets. We now study how the long-term behavior of SR differs from BC. Note\nthat this section makes no convexity assumptions, and the proposed theoretical results are directly\napplicable to neural networks.\nTypical (continuous-valued) SGD methods have an important exploration-exploitation tradeoff. When\nthe learning rate is large, the algorithm explores by moving quickly between states. Exploitation\nhappens when the learning rate is small. In this case, noise averaging causes the algorithm more\ngreedily pursues local minimizers with lower loss values. Thus, the distribution of iterates produced\nby the algorithm becomes increasingly concentrated near minimizers as the learning rate vanishes\n(see, e.g., the large-deviation estimates in [22]). BC maintains this property as well\u2014indeed, we saw\nin Corollary 1 a class of problems for which the iterates concentrate on the minimizer for small \u03b1t.\nIn this section, we show that the SR method lacks this important tradeoff: as the stepsize gets small\nand the algorithm slows down, the quality of the iterates produced by the algorithm does not improve,\nand the algorithm does not become progressively more likely to produce low-loss iterates. This\nbehavior is illustrated in Figures 1 and 2.\nTo understand this problem conceptually, consider the simple case of a one-variable optimization\nproblem starting at x0 = 0 with \u2206 = 1 (Figure 1). On each iteration, the algorithm computes a\nstochastic approximation \u2207 \u02dcf of the gradient by sampling from a distribution, which we call p. This\ngradient is then multiplied by the stepsize to get \u03b1\u2207 \u02dcf . The probability of moving to the right (or\nleft) is then roughly proportional to the magnitude of \u03b1\u2207 \u02dcf . Note the random variable \u03b1\u2207 \u02dcf has\ndistribution p\u03b1(z) = \u03b1\u22121p(z/\u03b1).\nNow, suppose that \u03b1 is small enough that we can neglect the tails of p\u03b1(z) that lie outside the interval\n[\u22121, 1]. The probability of transitioning from x0 = 0 to x1 = 1 using stochastic rounding, denoted\nby T\u03b1(0, 1), is then\n\n(cid:90) 1\n\n0\n\nT\u03b1(0, 1) \u2248\n\nzp\u03b1(z)dz =\n\n1\n\u03b1\n\n(cid:90) 1\n\n0\n\nzp(z/\u03b1) dz = \u03b1\n\np(x)x dx \u2248 \u03b1\n\np(x)x dx,\n\n(cid:90) 1/\u03b1\n\n0\n\n(cid:90) \u221e\n\n0\n\nwhere the \ufb01rst approximation is because we neglected the unlikely case that \u03b1\u2207 \u02dcf > 1, and the\nsecond approximation appears because we added a small tail probability to the estimate. These\n\n5\n\n\f(a) \u03b1 = 1.0\n\n(d) \u03b1 = 0.001\nFigure 2: Effect of shrinking the learning rate in SR vs BC on a toy problem. The left \ufb01gure plots the objective\nfunction (8). Histograms plot the distribution of the quantized weights over 106 iterations. The top row of plots\ncorrespond to BC, while the bottom row is SR, for different learning rates \u03b1. As the learning rate \u03b1 shrinks, the\nBC distribution concentrates on a minimizer, while the SR distribution stagnates.\n\n(c) \u03b1 = 0.01\n\n(b) \u03b1 = 0.1\n\nwe have T\u03b1(0, 1) \u223c \u03b1(cid:82) \u221e\n\n0 p(x)x dx as \u03b1 \u2192 0. Similarly, T\u03b1(0,\u22121) \u223c \u03b1(cid:82) 0\n\napproximations get more accurate for small \u03b1. We see that, assuming the tails of p are \u201clight\u201d enough,\n\u2212\u221e p(x)x dx as \u03b1 \u2192 0.\n(cid:21)\nWhat does this observation mean for the behavior of SR? First of all, the probability of leaving x0 on\nan iteration is\n\n(cid:20)(cid:90) \u221e\n\n(cid:90) 0\n\np(x)x dx +\n\np(x)x dx\n\n,\n\nT\u03b1(0,\u22121) + T\u03b1(0, 1) \u2248 \u03b1\n\n0\n\nwhich vanishes for small \u03b1. This means the algorithm slows down as the learning rate drops off,\nwhich is not surprising. However, the conditional probability of ending up at x1 = 1 given that the\nalgorithm did leave x0 is\n\n\u2212\u221e\n\n(cid:82) \u221e\n\nT\u03b1(0, 1|x1 (cid:54)= x0) \u2248\n\nT\u03b1(0, 1)\n\nT\u03b1(0,\u22121) + T\u03b1(0, 1)\n\n=\n\n(cid:82) 0\n\u2212\u221e p(x)x dx +(cid:82) \u221e\n\n0 p(x)x dx\n\n,\n\n0 p(x)x dx\n\nwhich does not depend on \u03b1. In other words, provided \u03b1 is small, SR, on average, makes the same\ndecisions/transitions with learning rate \u03b1 as it does with learning rate \u03b1/10; it just takes 10 times\nlonger to make those decisions when \u03b1/10 is used. In this situation, there is no exploitation bene\ufb01t in\ndecreasing \u03b1.\n\n5.1 Toy Problem\n\nTo gain more intuition about the effect of shrinking the learning rate in SR vs BC, consider the\nfollowing simple 1-dimensional non-convex problem:\n\n\uf8f1\uf8f2\uf8f3w2 + 2,\n\n(w \u2212 2.5)2 + 0.75,\n(w \u2212 4.75)2 + 0.19,\n\nmin\n\nw\n\nf (w) :=\n\nif w < 1,\nif 1 \u2264 w < 3.5,\nif w \u2265 3.5.\n\n(8)\n\nFigure 2 shows a plot of this loss function. To visualize the distribution of iterates, we initialize at\nw = 4.0, and run SR and BC for 106 iterations using a quantization resolution of 0.5.\nFigure 2 shows the distribution of the quantized weight parameters w over the iterations when\noptimized with SR and BC for different learning rates \u03b1. As we shift from \u03b1 = 1 to \u03b1 = 0.001, the\ndistribution of BC iterates transitions from a wide/explorative distribution to a narrow distribution\nin which iterates aggressively concentrate on the minimizer. In contrast, the distribution produced\nby SR concentrates only slightly and then stagnates; the iterates are spread widely even when the\nlearning rate is small.\n\n5.2 Asymptotic Analysis of Stochastic Rounding\n\nThe above argument is intuitive, but also informal. To make these statements rigorous, we interpret\nthe SR method as a Markov chain. On each iteration, SR starts at some state (iterate) x, and moves to\n\n6\n\n-202468Weight w024681012Loss Value\f0.6\n\nA\n\n0.6\n\n0.2\n\n0.2\n\nB\n\n0.4\n\n0.2\n\n0.4\n\n0.2\n\nC\n\n0.2\n\n0.6\n\nB\n\n0.2\n\n0.1\n\n0.8\n\nA\n\n0.3\n\n0.1\n\n0.2\n\n0.1\n\nC\n\n0.6\n\nFigure 3: Markov chain example with 3 states. In the right \ufb01gure, we halved each transition probability for\nmoving between states, with the remaining probability put on the self-loop. Notice that halving all the transition\nprobabilities would not change the equilibrium distribution, and instead would only increase the mixing time of\nthe Markov chain.\n\na new state y with some transition probability T\u03b1(x, y) that depends only on x and the learning rate\n\u03b1. For \ufb01xed \u03b1, this is clearly a Markov process with transition matrix2 T\u03b1(x, y).\nThe long-term behavior of this Markov process is determined by the stationary distribution of\nT\u03b1(x, y). We show below that for small \u03b1, the stationary distribution of T\u03b1(x, y) is nearly invariant\nto \u03b1, and thus decreasing \u03b1 below some threshold has virtually no effect on the long term behavior of\nthe method. This happens because, as \u03b1 shrinks, the relative transition probabilities remain the same\n(conditioned on the fact that the parameters change), even though the absolute probabilities decrease\n(see Figure 3). In this case, there is no exploitation bene\ufb01t to decreasing \u03b1.\nTheorem 5. Let px,k denote the probability distribution of the kth entry in \u2207 \u02dcf (x), the stochas-\n\u03bd2 , and some C2 such that both(cid:82) C2\n(cid:82) \u221e\ntic gradient estimate at x. Assume there is a constant C1 such that for all x, k, and \u03bd we have\n\u03bd px,k(z) dz \u2264 C1\npx,k(z) dz > 0.\n(cid:82) \u221e\nDe\ufb01ne the matrix\n(cid:82) 0\n\n0 px,k(z) dz > 0 and(cid:82) 0\n\n\u2206 dz, if x and y differ only at coordinate k, and yk = xk + \u2206\n\u2206 dz, if x and y differ only at coordinate k, and yk = xk \u2212 \u2206\n\n\u02dcU (x, y) =\n\n\u2212C2\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n0 px,k(z) z\n\u2212\u221e px,k(z) z\n0, otherwise,\n\nand the associated markov chain transition matrix\n\n\u02dcT\u03b10 = I \u2212 \u03b10 \u00b7 diag(1T \u02dcU ) + \u03b10 \u02dcU ,\n\n(9)\n\nwhere \u03b10 is the largest constant that makes \u02dcT\u03b10 non-negative. Suppose \u02dcT\u03b1 has a stationary distribu-\ntion, denoted \u02dc\u03c0. Then, for suf\ufb01ciently small \u03b1, T\u03b1 has a stationary distribution \u03c0\u03b1, and\n\nlim\n\u03b1\u21920\n\n\u03c0\u03b1 = \u02dc\u03c0.\n\nFurthermore, this limiting distribution satis\ufb01es \u02dc\u03c0(x) > 0 for any state x, and is thus not concentrated\non local minimizers of f.\n\nWhile the long term stationary behavior of SR is relatively insensitive to \u03b1, the convergence speed\nof the algorithm is not. To measure this, we consider the mixing time of the Markov chain. Let \u03c0\u03b1\ndenote the stationary distribution of a Markov chain. We say that the \u0001-mixing time of the chain is\nM\u0001 if M\u0001 is the smallest integer such that [23]\n\n|P(xM\u0001 \u2208 A|x0) \u2212 \u03c0(A)| \u2264 \u0001,\n\nfor all x0 and all subsets of states A \u2286 X.\n\n(10)\nWe show below that the mixing time of the Markov chain gets large for small \u03b1, which means\nexploration slows down, even though no exploitation gain is being realized.\nTheorem 6. Let px,k satisfy the assumptions of Theorem 5. Choose some \u0001 suf\ufb01ciently small that\nthere exists a proper subset of states A \u2282 X with stationary probability \u03c0\u03b1(A) greater than \u0001. Let\nM\u0001(\u03b1) denote the \u0001-mixing time of the chain with learning rate \u03b1. Then,\n\nM\u0001(\u03b1) = \u221e.\n\nlim\n\u03b1\u21920\n\n2Our analysis below does not require the state space to be \ufb01nite, so T\u03b1(x, y) may be a linear operator rather\n\nthan a matrix. Nonetheless, we use the term \u201cmatrix\u201d as it is standard.\n\n7\n\n\fTable 1: Top-1 test error after training with full-precision (ADAM), binarized weights (R-ADAM, SR-ADAM,\nBC-ADAM), and binarized weights with big batch size (Big SR-ADAM).\n\nCIFAR-100 ImageNet\nVGG-9 VGG-BC ResNet-56 WRN-56-2 ResNet-56 ResNet-18\n\nCIFAR-10\n\nADAM 7.97\nBC-ADAM 10.36\nBig SR-ADAM 16.95\nSR-ADAM 23.33\nR-ADAM 23.99\n\n7.12\n8.21\n16.77\n20.56\n21.88\n\n8.10\n8.83\n19.84\n26.49\n33.56\n\n6.62\n7.17\n16.04\n21.58\n27.90\n\n33.98\n35.34\n50.79\n58.06\n68.39\n\n36.04\n52.11\n77.68\n88.86\n91.07\n\n6 Experiments\n\nTo explore the implications of the theory above, we train both VGG-like networks [24] and Residual\nnetworks [25] with binarized weights on image classi\ufb01cation problems. On CIFAR-10, we train\nResNet-56, wide ResNet-56 (WRN-56-2, with 2X more \ufb01lters than ResNet-56), VGG-9, and the\nhigh capacity VGG-BC network used for the original BC model [5]. We also train ResNet-56 on\nCIFAR-100, and ResNet-18 on ImageNet [26].\nWe use Adam [27] as our baseline optimizer as we found it to frequently give better results than\nwell-tuned SGD (an observation that is consistent with previous papers on quantized models [1\u20135]),\nand we train with the three quantized algorithms mentioned in Section 3, i.e., R-ADAM, SR-ADAM\nand BC-ADAM. The image pre-processing and data augmentation procedures are the same as [25].\nFollowing [3], we only quantize the weights in the convolutional layers, but not linear layers, during\ntraining (See Appendix H.1 for a discussion of this issue, and a detailed description of experiments).\nWe set the initial learning rate to 0.01 and decrease the learning rate by a factor of 10 at epochs 82 and\n122 for CIFAR-10 and CIFAR-100 [25]. For ImageNet experiments, we train the model for 90 epochs\nand decrease the learning rate at epochs 30 and 60. See Appendix H for additional experiments.\nResults The overall results are summarized in Table 1. The binary model trained by BC-ADAM\nhas comparable performance to the full-precision model trained by ADAM. SR-ADAM outperforms\nR-ADAM, which veri\ufb01es the effectiveness of Stochastic Rounding. There is a performance gap\nbetween SR-ADAM and BC-ADAM across all models and datasets. This is consistent with our\ntheoretical results in Sections 4 and 5, which predict that keeping track of the real-valued weights as\nin BC-ADAM should produce better minimizers.\nExploration vs exploitation tradeoffs Section 5 discusses the exploration/exploitation tradeoff\nof continuous-valued SGD methods and predicts that fully discrete methods like SR are unable to\nenter a greedy phase. To test this effect, we plot the percentage of changed weights (signs different\nfrom the initialization) as a function of the training epochs (Figures 4 and 5). SR-ADAM explores\naggressively; it changes more weights in the conv layers than both R-ADAM and BC-ADAM, and\nkeeps changing weights until nearly 40% of the weights differ from their starting values (in a binary\nmodel, randomly re-assigning weights would result in 50% change). The BC method never changes\nmore than 20% of the weights (Fig 4(b)), indicating that it stays near a local minimizer and explores\nless. Interestingly, we see that the weights of the conv layers were not changed at all by R-ADAM;\nwhen the tails of the stochastic gradient distribution are light, this method is ineffective.\n\n6.1 A Way Forward: Big Batch Training\n\nWe saw in Section 5 that SR is unable to exploit local minima because, for small learning rates,\nshrinking the learning rate does not produce additional bias towards moving downhill. This was\nillustrated in Figure 1. If this is truly the cause of the problem, then our theory predicts that we can\nimprove the performance of SR for low-precision training by increasing the batch size. This shrinks\nthe variance of the gradient distribution in Figure 1 without changing the mean and concentrates\nmore of the gradient distribution towards downhill directions, making the algorithm more greedy.\nTo verify this, we tried different batch sizes for SR including 128, 256, 512 and 1024, and found that\nthe larger the batch size, the better the performance of SR. Figure 5(a) illustrates the effect of a batch\nsize of 1024 for BC and SR methods. We \ufb01nd that the BC method, like classical SGD, performs best\n\n8\n\n\f(a) R-ADAM\n\n(b) BC-ADAM\n\n(c) SR-ADAM\n\nFigure 4: Percentage of weight changes during training of VGG-BC on CIFAR-10.\n\n(a) BC-ADAM vs SR-ADAM (b) Weight changes since beginning (c) Weight changes every 5 epochs\n\nFigure 5: Effect of batch size on SR-ADAM when tested with ResNet-56 on CIFAR-10. (a) Test error vs epoch.\nTest error is reported with dashed lines, train error with solid lines. (b) Percentage of weight changes since\ninitialization. (c) Percentage of weight changes per every 5 epochs.\n\nwith a small batch size. However, a large batch size is essential for the SR method to perform well.\nFigure 5(b) shows the percentage of weights changed by SR and BC during training. We see that the\nlarge batch methods change the weights less aggressively than the small batch methods, indicating\nless exploration. Figure 5(c) shows the percentage of weights changed during each 5 epochs of\ntraining. It is clear that small-batch SR changes weights much more frequently than using a big batch.\nThis property of big batch training clearly bene\ufb01ts SR; we see in Figure 5(a) and Table 1 that big\nbatch training improved performance over SR-ADAM consistently.\nIn addition to providing a means of improving \ufb01xed-point training, this suggests that recently\nproposed methods using big batches [28, 29] may be able to exploit lower levels of precision to\nfurther accelerate training.\n7 Conclusion\n\nThe training of quantized neural networks is essential for deploying machine learning models\non portable and ubiquitous devices. We provide a theoretical analysis to better understand the\nBinaryConnect (BC) and Stochastic Rounding (SR) methods for training quantized networks. We\nproved convergence results for BC and SR methods that predict an accuracy bound that depends\non the coarseness of discretization. For general non-convex problems, we proved that SR differs\nfrom conventional stochastic methods in that it is unable to exploit greedy local search. Experiments\ncon\ufb01rm these \ufb01ndings, and show that the mathematical properties of SR are indeed observable (and\nvery important) in practice.\n\nAcknowledgments\n\nT. Goldstein was supported in part by the US National Science Foundation (NSF) under grant CCF-\n1535902, by the US Of\ufb01ce of Naval Research under grant N00014-17-1-2078, and by the Sloan\nFoundation. C. Studer was supported in part by Xilinx, Inc. and by the US NSF under grants\nECCS-1408006, CCF-1535897, and CAREER CCF-1652065. H. Samet was supported in part by the\nUS NSF under grant IIS-13-20791.\n\n9\n\n020406080100120140160180Epochs01020304050Percentage of changed weights (%)conv_1conv_2conv_3conv_4conv_5conv_6linear_1linear_2linear_3020406080100120140160180Epochs01020304050Percentage of changed weights (%)conv_1conv_2conv_3conv_4conv_5conv_6linear_1linear_2linear_3020406080100120140160180Epochs01020304050Percentage of changed weights (%)conv_1conv_2conv_3conv_4conv_5conv_6linear_1linear_2linear_3020406080100120140160Epochs0102030405060Error (%)BC-ADAM 128BC-ADAM 1024SR-ADAM 128SR-ADAM 1024020406080100120140160Epochs0102030405060Percentage of changed weights (%)BC-ADAM 128BC-ADAM 1024SR-ADAM 128SR-ADAM 1024020406080100120140160Epochs01020304050Percentage of changed weights (%)BC-ADAM 128BC-ADAM 1024SR-ADAM 128SR-ADAM 1024\fReferences\n[1] Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training\ndeep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830\n(2016)\n\n[2] Marchesi, M., Orlandi, G., Piazza, F., Uncini, A.: Fast neural networks without multipliers.\n\nTransactions on Neural Networks 4(1) (1993) 53\u201362\n\nIEEE\n\n[3] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet Classi\ufb01cation Using Binary\n\nConvolutional Neural Networks. ECCV (2016)\n\n[4] Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision.\n\nIn: ICML. (2015)\n\n[5] Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neural networks with binary\n\nweights during propagations. In: NIPS. (2015)\n\n[6] Lin, D., Talathi, S., Annapureddy, S.: Fixed point quantization of deep convolutional networks. In: ICML.\n\n(2016)\n\n[7] Hwang, K., Sung, W.: Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1.\n\nIn: IEEE Workshop on Signal Processing Systems (SiPS). (2014)\n\n[8] Lin, Z., Courbariaux, M., Memisevic, R., Bengio, Y.: Neural networks with few multiplications. ICLR\n\n(2016)\n\n[9] Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)\n\n[10] Kim, M., Smaragdis, P.: Bitwise neural networks. In: ICML Workshop on Resource-Ef\ufb01cient Machine\n\nLearning. (2015)\n\n[11] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: Training\n\nneural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016)\n\n[12] Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Subdominant dense clusters allow for\nsimple learning and high computational performance in neural networks with discrete synapses. Physical\nreview letters 115(12) (2015) 128101\n\n[13] Anwar, S., Hwang, K., Sung, W.: Fixed point optimization of deep convolutional neural networks for\n\nobject recognition. In: ICASSP, IEEE (2015)\n\n[14] Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. ICLR (2017)\n\n[15] Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: Towards lossless CNNs\n\nwith low-precision weights. ICLR (2017)\n\n[16] Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low bitwidth convolutional\n\nneural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)\n\n[17] H\u00f6hfeld, M., Fahlman, S.E.: Probabilistic rounding in neural network learning with limited precision.\n\nNeurocomputing 4(6) (1992) 291\u2013299\n\n[18] Miyashita, D., Lee, E.H., Murmann, B.: Convolutional neural networks using logarithmic data representa-\n\ntion. arXiv preprint arXiv:1603.01025 (2016)\n\n[19] Soudry, D., Hubara, I., Meir, R.: Expectation backpropagation: Parameter-free training of multilayer\n\nneural networks with continuous or discrete weights. In: NIPS. (2014)\n\n[20] Cheng, Z., Soudry, D., Mao, Z., Lan, Z.: Training binary multilayer neural networks for image classi\ufb01cation\n\nusing expectation backpropagation. arXiv preprint arXiv:1503.03562 (2015)\n\n[21] Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In:\n\nInternational Conference on Machine Learning. (2015) 2408\u20132417\n\n[22] Lan, G., Nemirovski, A., Shapiro, A.: Validation analysis of mirror descent stochastic approximation\n\nmethod. Mathematical programming 134(2) (2012) 425\u2013458\n\n[23] Levin, D.A., Peres, Y., Wilmer, E.L.: Markov chains and mixing times. American Mathematical Soc.\n\n(2009)\n\n10\n\n\f[24] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In:\n\nICLR. (2015)\n\n[25] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR. (2016)\n\n[26] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,\n\nBernstein, M., et al.: Imagenet Large Scale Visual Recognition Challenge. IJCV (2015)\n\n[27] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ICLR (2015)\n\n[28] De, S., Yadav, A., Jacobs, D., Goldstein, T.: Big batch SGD: Automated inference using adaptive batch\n\nsizes. arXiv preprint arXiv:1610.05792 (2016)\n\n[29] Goyal, P., Doll\u00e1r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.:\n\nAccurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)\n\n[30] Lax, P.: Linear Algebra and Its Applications. Number v. 10 in Linear algebra and its applications. Wiley\n\n(2007)\n\n[31] Krizhevsky, A.: Learning multiple layers of features from tiny images. (2009)\n\n[32] Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)\n\n[33] Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In:\n\nBigLearn, NIPS Workshop. (2011)\n\n[34] Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal\n\nCovariate Shift. (2015)\n\n11\n\n\f", "award": [], "sourceid": 2965, "authors": [{"given_name": "Hao", "family_name": "Li", "institution": "University of Maryland, College Park"}, {"given_name": "Soham", "family_name": "De", "institution": "University of Maryland, College Park"}, {"given_name": "Zheng", "family_name": "Xu", "institution": "University of Maryland, College Park"}, {"given_name": "Christoph", "family_name": "Studer", "institution": "Cornell University"}, {"given_name": "Hanan", "family_name": "Samet", "institution": "University of Maryland at College Park"}, {"given_name": "Tom", "family_name": "Goldstein", "institution": "University of Maryland"}]}