{"title": "Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1742, "page_last": 1752, "abstract": "Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications. Flexpoint tensors have a shared exponent that is dynamically adjusted to minimize overflows and maximize available dynamic range. We validate Flexpoint by training AlexNet, a deep residual network and a generative adversarial network, using a simulator implemented with the \\emph{neon} deep learning framework. We demonstrate that 16-bit Flexpoint closely matches 32-bit floating point in training all three models, without any need for tuning of model hyperparameters. Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.", "full_text": "Flexpoint: An Adaptive Numerical Format for\nEf\ufb01cient Training of Deep Neural Networks\n\nUrs K\u00f6ster\u2217\u2020, Tristan J. Webb\u2217, Xin Wang\u2217, Marcel Nassar\u2217, Arjun K. Bansal, William H.\nConstable, O\u02d8guz H. Elibol, Scott Gray\u2021, Stewart Hall\u2020, Luke Hornof, Amir Khosrowshahi,\n\nCarey Kloss, Ruby J. Pai, Naveen Rao\n\nArti\ufb01cial Intelligence Products Group, Intel Corporation\n\nAbstract\n\nDeep neural networks are commonly developed and trained in 32-bit \ufb02oating point\nformat. Signi\ufb01cant gains in performance and energy ef\ufb01ciency could be realized\nby training and inference in numerical formats optimized for deep learning. De-\nspite advances in limited precision inference in recent years, training of neural\nnetworks in low bit-width remains a challenging problem. Here we present the\nFlexpoint data format, aiming at a complete replacement of 32-bit \ufb02oating point\nformat training and inference, designed to support modern deep network topologies\nwithout modi\ufb01cations. Flexpoint tensors have a shared exponent that is dynami-\ncally adjusted to minimize over\ufb02ows and maximize available dynamic range. We\nvalidate Flexpoint by training AlexNet [1], a deep residual network [2, 3] and a\ngenerative adversarial network [4], using a simulator implemented with the neon\ndeep learning framework. We demonstrate that 16-bit Flexpoint closely matches\n32-bit \ufb02oating point in training all three models, without any need for tuning of\nmodel hyperparameters. Our results suggest Flexpoint as a promising numerical\nformat for future hardware for training and inference.\n\n1\n\nIntroduction\n\nDeep learning is a rapidly growing \ufb01eld that achieves state-of-the-art performance in solving many\nkey data-driven problems in a wide range of industries. With major chip makers\u2019 quest for novel\nhardware architectures for deep learning, the next few years will see the advent of new computing\ndevices optimized for training and inference of deep neural networks with increasing performance at\ndecreasing cost.\nTypically deep learning research is done on CPU and/or GPU architectures that offer native 64-bit,\n32-bit or 16-bit \ufb02oating point data format and operations. Substantial improvements in hardware\nfootprint, power consumption, speed, and memory requirements could be obtained with more ef\ufb01cient\ndata formats. This calls for innovations in numerical representations and operations speci\ufb01cally\ntailored for deep learning needs.\nRecently, inference with low bit-width \ufb01xed point data formats has made signi\ufb01cant advancement,\nwhereas low bit-width training remains an open challenge [5, 6, 7]. Because training in low preci-\nsion reduces memory footprint and increases the computational density of the deployed hardware\ninfrastructure, it is crucial to ef\ufb01cient and scalable deep learning applications.\n\n\u2217Equal contribution.\n\u2020Currently with Cerebras Systems, work done while at Nervana Systems and Intel Corporation.\n\u2021Currently with OpenAI, work done while at Nervana Systems.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we present Flexpoint, a \ufb02exible low bit-width numerical format, which faithfully\nmaintains algorithmic parity with full-precision \ufb02oating point training and supports a wide range of\ndeep network topologies, while at the same time substantially reduces consumption of computational\nresources, making it amenable for specialized training hardware optimized for \ufb01eld deployment of\nalready existing deep learning models.\nThe remainder of this paper is structured as follows. In Section 2, we review relevant work in literature.\nIn Section 3, we present the Flexpoint numerical format along with an exponent management\nalgorithm that tracks the statistics of tensor extrema and adjusts tensor scales on a per-minibatch\nbasis. In Section 4, we show results from training several deep neural networks in Flexpoint, showing\nclose parity to \ufb02oating point performance: AlexNet and a deep residual network (ResNet) for\nimage classi\ufb01cation, and the recently published Wasserstein GAN. In Section 5, we discuss speci\ufb01c\nadvantages and limitations of Flexpoint, and compare its merits to those of competing low-precision\ntraining schemes.\n\n2 Related Work\n\nIn 2011, Vanhoucke et al. \ufb01rst showed that inference and training of deep neural networks is\nfeasible with values of certain tensors quantized to a low-precision \ufb01xed point format [8]. More\nrecently, an increasing number of studies demonstrated low-precision inference with substantially\nreduced computation. These studies involve, usually in a model-dependent manner, quantization of\nspeci\ufb01c tensors into low-precision \ufb01xed point formats. These include quantization of weights and/or\nactivations to 8-bit [8, 9, 10, 11], down to 4-bit, 2-bit [12, 13] or ternary [10], and ultimately all\nbinary [7, 14, 5, 6]. Weights trained at full precision are commonly converted from \ufb02oating point\nvalues, and bit-widths of component tensors are either pre-determined based on the characteristics\nof the model, or optimized per layer [11]. Low-precision inference has already made its way into\nproduction hardware such as Google\u2019s tensor processing unit (TPU) [15].\nOn the other hand, reasonable successes in low-precision training have been obtained with binarized\n[13, 16, 17, 5] or ternarized weights [18], or binarized gradients in the case of stochastic gradient\ndescent [19], while accumulation of activations and gradients is usually at higher precision. Motivated\nby the non-uniform distribution of weights and activations, Miyashita et al. [20] used a logarithmic\nquantizer to quantize the parameters and gradients to 6 bits without signi\ufb01cant loss in performance.\nXNOR-nets focused on speeding up neural network computations by parametrizing the activations and\nweights as rank-1 products of binary tensors and higher precision scalar values [7]. This enables the\nuse of kernels composed of XNOR and bit-count operations to perform highly ef\ufb01cient convolutions.\nHowever, additional high-precision multipliers are still needed to perform the scaling after each\nconvolution which limits its performance. Quantized Neural Networks (QNNs), and their binary\nversion (Binarized Nets), successfully perform low-precision inference (down to 1-bit) by keeping\nreal-valued weights and quantizing them only to compute the gradients and performing forward\ninference [17, 5]. Hubara et al. found that low precision networks coupled with ef\ufb01cient bit shift-\nbased operations resulted in computational speed-up, from experiments performed using specialized\nGPU kernels. DoReFa-Nets utilize similar ideas as QNNs and quantize the gradients to 6-bits to\nachieve similar performance [6]. The authors also trained in limited precision the deepest ResNet (18\nlayers) so far.\nThe closest work related to this manuscript is by Courbariaux et al. [21], who used a dynamical\n\ufb01xed point (DFXP) format in training a number of benchmark models. In their study, tensors are\npolled periodically for the fraction of over\ufb02owed entries in a given tensor: if that number exceeds a\ncertain threshold the exponent is incremented to extend the dynamic range, and vice versa. The main\ndrawback is that this update mechanism only passively reacts to over\ufb02ows rather than anticipating\nand preemptively avoiding over\ufb02ows; this turns out to be catastrophic for maintaining convergence of\nthe training.\n\n2\n\n\fFigure 1: Diagrams of bit representations of different tensorial numerical formats. Red, green and\nblue shading each signify mantissa, exponent, and sign bits respectively. In both (a) IEEE 754 32-bit\n\ufb02oating point and (b) IEEE 754 16-bit \ufb02oating point a portion of the bit string are allocated to specify\nexponents. (c) illustrates a Flexpoint tensor with 16-bit mantissa and 5-bit shared exponent.\n\n3 Flexpoint\n\n3.1 The Flexpoint Data Format\n\nFlexpoint is a data format that combines the advantages of \ufb01xed point and \ufb02oating point arithmetic.\nBy using a common exponent for integer values in a tensor, Flexpoint reduces computational and\nmemory requirements while automatically managing the exponent of each tensor in a user transparent\nmanner.\nFlexpoint is based on tensors with an N-bit mantissa storing an integer value in two\u2019s complement\nform, and an M-bit exponent e, shared across all elements of a tensor. This format is denoted as\nflexN+M. Fig. 1 shows an illustration of a Flexpoint tensor with a 16-bit mantissa and 5-bit exponent,\ni.e. flex16+5 compared to 32-bit and 16-bit \ufb02oating point tensors. In contrast to \ufb02oating point, the\nexponent is shared across tensor elements, and different from \ufb01xed point, the exponent is updated\nautomatically every time a tensor is written.\nCompared to 32-bit \ufb02oating point, Flexpoint reduces both memory and bandwidth requirements in\nhardware, as storage and communication of the exponent can be amortized over the entire tensor.\nPower and area requirements are also reduced due to simpler multipliers compared to \ufb02oating point.\nSpeci\ufb01cally, multiplication of entries of two separate tensors can be computed as a \ufb01xed point\noperation since the common exponent is identical across all the output elements. For the same reason,\naddition across elements of the same tensor can also be implemented as \ufb01xed point operations. This\nessentially turns the majority of computations of deep neural networks into \ufb01xed point operations.\n\n3.2 Exponent Management\n\nThese remarkable advantages come at the cost of added complexity of exponent management and\ndynamic range limitations imposed by sharing a single exponent. Other authors have reported\non the range of values contained within tensors during neural network training: \u201cthe activations,\ngradients and parameters have very different ranges\u201d and \u201cgradients ranges slowly diminish during\nthe training\u201d [21]. These observations are promising indicators on the viability of numerical formats\nbased around tensor shared exponents. Fig. 2 shows histograms of values from different types of\ntensors taken from a 110-layer ResNet trained on CIFAR-10 using 32-bit \ufb02oating point.\n\n3\n\n\fIn order to preserve a faithful representation of \ufb02oating point, tensors with a shared exponent must\nhave a suf\ufb01ciently narrow dynamic range such that mantissa bits alone can encode variability. As\nsuggested by Fig. 2, 16-bits of mantissa is suf\ufb01cient to cover the majority of values of a single\ntensor. For performing operations such as adding gradient updates to weights, there must be suf\ufb01cient\nmantissa overlap between tensors, putting additional requirements on number of bits needed to\nrepresent values in training, as compared to inference. Establishing that deep learning tensors\nconform to these requirements during training is a key \ufb01nding in our present results. An alternative\nsolution to addressing this problem is stochastic rounding [22].\nFinally, to implement Flexpoint ef\ufb01ciently in hardware, the output exponent has to be determined\nbefore the operation is actually performed. Otherwise the intermediate result needs to be stored in\nhigh precision, before reading the new exponent and quantizing the result, which would negate much\nof the potential savings in hardware. Therefore, intelligent management of the exponents is required.\n\n3.3 Exponent Management Algorithm\n\nWe propose an exponent management algorithm called Auto\ufb02ex, designed for iterative optimizations,\nsuch as stochastic gradient descent, where tensor operations, e.g. matrix multiplication, are performed\nrepeatedly and outputs are stored in hardware buffers. Auto\ufb02ex predicts an optimal exponent for the\noutput of each tensor operation based on tensor-wide statistics gathered from values computed in\nprevious iterations.\nThe success of training in deep neural networks in Flexpoint hinges on the assumption that ranges\nof values in the network change suf\ufb01ciently slowly, such that exponents can be predicted with high\naccuracy based on historical trends. If the input data is independently and identically distributed,\ntensors in the network, such as weights, activations and deltas, will have slowly changing exponents.\nFig. 3 shows an example of training a deep neural network model.\nThe Auto\ufb02ex algorithm tracks the maximum absolute value \u0393, of the mantissa of every tensor, by\nusing a dequeue to store a bounded history of these values. Intuitively, it is then possible to estimate\na trend in the stored values based on a statistical model, use it to anticipate an over\ufb02ow, and increase\nthe exponent preemptively to prevent over\ufb02ow. Similarly, if the trend of \u0393 values decreases, the\nexponent can be decreased to better utilize the available range.\nWe formalize our terminology as follows. After each kernel call, statistics are stored in the \ufb02oating\npoint representation \u03c6 of the maximum absolute values of a tensor, obtained as \u03c6 = \u0393\u03ba, by\nmultiplying the maximum absolute mantissa value \u0393 with scale factor \u03ba. This scale factor is related\nto the exponent e by the relation \u03ba = 2\u2212e.\n\nFigure 2: Distributions of values for (a) weights, (b) activations and (c) weight updates, all during the\n\ufb01rst epoch (blue) and last epoch (purple) of training a ResNet trained on CIFAR-10 for 165 epochs.\nThe horizontal axis covers the entire range of values that can be represented in 16-bit Flexpoint, with\nthe horizontal bars indicating the dynamic range covered by the 16-bit mantissa. All tensors have\na narrow peak close to the right edge of the horizontal bar, where values have close to the same\nprecision as if the elements had individual exponents.\n\n4\n\n\fIf the same tensor is reused for different computations in the network, we track the exponent e and\nthe statistics of \u03c6 separately for each use. This allows the underlying memory for the mantissa to be\nshared across different uses, without disrupting the exponent management.\n\n3.4 Auto\ufb02ex Initialization\n\nAt the beginning of training, the statistics queue is empty, so we use a simple trial-and-error scheme\ndescribed in Algorithm 1 to initialize the exponents. We perform each operation in a loop, inspecting\nthe output value of \u0393 for over\ufb02ows or underutilization, and repeat until the target exponent is found.\n\nAlgorithm 1 Auto\ufb02ex initialization algorithm. Scales are initialized by repeatedly performing the\noperation and adjusting the exponent up in case of over\ufb02ows or down if not all bits are utilized.\n1: initialized \u2190 False\n2: \u03ba = 1\n3: procedure INITIALIZE SCALE\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n\n\u0393 \u2190 returned by kernel call\nif \u0393 \u2265 2N\u22121 \u2212 1 then\n\u03ba \u2190 \u03ba \u00d7 2(cid:98) N\u22121\n2 (cid:99)\nelse if \u0393 < 2N\u22122 then\n\u03ba \u2190 \u03ba \u00d7 2(cid:100)log2 max (\u0393,1)(cid:101)\u2212(N\u22122)\nif \u0393 > 2(cid:98) N\u22121\n\n(cid:46) under\ufb02ow: decrease scale \u03ba\n(cid:46) Jump directly to target exponent\n(cid:46) Ensure enough bits for reliable jump\n(cid:46) scale \u03ba is correct\n\n2 (cid:99)\u22122 then\ninitialized \u2190 True\n\n(cid:46) over\ufb02ow: increase scale \u03ba\n\nwhile not initialized do\n\nelse\n\ninitialized \u2190 True\n\n3.5 Auto\ufb02ex Exponent Prediction\n\nAfter the network has been initialized by running the initialization procedure for each computation\nin the network, we train the network in conjunction with a scale update Algorithm 2 executed\ntwice per minibatch, once after forward activation and once after backpropagation, for each tensor /\ncomputation in the network. We maintain a \ufb01xed length dequeue f of the maximum \ufb02oating point\nvalues encountered in the previous l iterations, and predict the expected maximum value for the next\niteration based on the maximum and standard deviation of values stored in the dequeue. If an over\ufb02ow\nis encountered, the history of statistics is reset and the exponent is increased by one additional bit.\n\nAlgorithm 2 Auto\ufb02ex scaling algorithm. Hyperparameters are multiplicative headroom factor \u03b1 = 2,\nnumber of standard deviations \u03b2 = 3, and additive constant \u03b3 = 100. Statistics are computed over a\nmoving window of length l = 16. Returns expected maximum \u03ba for the next iteration.\n1: f \u2190 stats dequeue of length l\n2: \u0393 \u2190 Maximum absolute value of mantissa, returned by kernel call\n3: \u03ba \u2190 previous scale value \u03ba\n4: procedure ADJUST SCALE\nif \u0393 \u2265 2N\u22121 \u2212 1 then\n5:\n6:\n7:\n8:\n9:\n10:\n\nclear f\n\u0393 \u2190 2\u0393\nf \u2190 [f , \u0393\u03ba]\n\u03c7 \u2190 \u03b1 [max(f ) + \u03b2std(f ) + \u03b3\u03ba]\n\u03ba \u2190 2(cid:100)log2 \u03c7(cid:101)\u2212N +1\n\n(cid:46) Extend dequeue\n(cid:46) Predicted maximum value for next iteration\n(cid:46) Nearest power of two\n\n(cid:46) over\ufb02ow: add one bit and clear stats\n\n3.6 Auto\ufb02ex Example\n\nWe illustrate the algorithm by training a small 2-layer perceptron for 400 iterations on the CIFAR-10\ndataset. During training, \u03ba and \u0393 values are stored at each iteration, as shown in Fig. 3, for instance,\na linear layer\u2019s weight, activation, and update tensors. Fig. 3(a) shows the weight tensor, which\nis highly stable as it is only updated with small gradient steps. \u0393 slowly approaches its maximum\n\n5\n\n\fWeights\n\nActivations\n\nUpdates\n\n\u0393\n\n\u03ba\n\n\u03a6\n\n214\n\n213\n2\u221216\n\n2\u221217\n\n2\u22123\n\n2\u22123.2\n\n214\n\n213\n\n2\u22129\n2\u221210\n2\u221211\n\n25\n\n24\n\n23\n\n213\n211\n29\n2\u22128\n2\u221210\n2\u221212\n\n24\n\n22\n\n20\n\nRaw value\nMax\nEstimate\n\n0\n\n100\n\n200\n(a)\n\n300\n\n0\n\n100\n\n200\n(b)\n\n300\n\n0\n\n100\n\n300\n\n200\n(c)\n\nFigure 3: Evolution of different tensors during training with corresponding mantissa and exponent\nvalues. The second row shows the scale \u03ba, adjusted to keep the maximum absolute mantissa values\n(\u0393, \ufb01rst row) at the top of the dynamic range without over\ufb02owing. As the product of the two (\u03a6,\nthird row) is anticipated to cross a power of two boundary, the scale is changed so as to keep the\nmantissa in the correct range. (a) Shows this process for a weight tensor, which is very stable and\nslowly changing. The black arrow indicates how scale changes are synchronized with crossings of\nthe exponent boundary. (b) shows an activation tensor with a noisier sequence of values. (c) shows\na tensor of updates, which typically displays the most frequent exponent changes. In each case the\nAuto\ufb02ex estimate (green line) crosses the exponent boundary (gray horizontal line) before the actual\ndata (red) does, which means that exponent changes are predicted before an over\ufb02ow occurs.\n\nvalue of 214, at which point the \u03ba value is updated, and \u0393 drops by one bit. Shown below is the\ncorresponding \ufb02oating point representation of the statistics computed from \u03a6, which is used to\nperform the exponent prediction. Using a sliding window of 16 values, the predicted maximum is\ncomputed, and used to set the exponent for the next iteration. In Fig. 3(a), the prediction crosses\nthe exponent boundary of 23 about 20 iterations before the value itself does, safely preventing an\nover\ufb02ow. Tensors with more variation across epochs are shown in Fig. 3(b) (activations) and Fig. 3(c)\n(updates). The standard deviation across iterations is higher, therefore the algorithm leaves about half\na bit and one bit respectively of headroom. Even as the tensor \ufb02uctuates in magnitude by more than a\nfactor of two, the maximum absolute value of the mantissa \u0393 is safely prevented from over\ufb02owing.\nThe cost of this approach is that in the last example \u0393 reaches 3 bits below the cutoff, leaving the top\nbits zero and using only 13 of the 16 bits for representing data.\n\n3.7 Simulation on GPU\n\nThe experiments described below were performed on Nvidia GPUs using the neon deep learning\nframework4. In order to simulate the flex16+5 data format we stored tensors using an int16 type.\nComputations such as convolution and matrix multiplication were performed with a set of GPU\nkernels which convert the underlying int16 data format to float32 by multiplying with \u03ba, perform\noperations in \ufb02oating point, and convert back to int16 before returning the result as well as \u0393. The\nkernels also have the ability to compute only \u0393 without writing any outputs, to prevent writing invalid\ndata during exponent initialization. The computational performance of the GPU kernels is comparable\nto pure \ufb02oating point kernels, so training models in this Flexpoint simulator adds little overhead.\n\n4 Experimental Results\n\n4.1 Convolutional Networks\n\nWe trained two convolutional networks in flex16+5, using float32 as a benchmark: AlexNet [1],\nand a ResNet [2, 3]. The ResNet architecture is composed of modules with shortcuts in the data\ufb02ow\n\n4Available at https://github.com/NervanaSystems/neon.\n\n6\n\n\fgraph, a key feature that makes effective end-to-end training of extremely deep networks possible.\nThese multiple divergent and convergent \ufb02ows of tensor values at potentially disparate scales might\npose unique challenges for training in \ufb01xed point numerical format.\nWe built a ResNet following the design as described in [3]. The network has 12 blocks of residual\nmodules consisting of convolutional stacks, making a deep network of 110 layers in total. We trained\nthis model on the CIFAR-10 dataset [1] with float32 and flex16+5 data formats for 165 epochs.\nFig. 4 shows misclassi\ufb01cation error on the validation set plotted over the course of training. Learning\ncurves match closely between float32 and flex16+5 for both networks. In contrast, models trained\nin float16 without any changes in hyperparameter values substantially underperformed those trained\nin float32 and flex16+5.\n\n4.2 Generative Adversarial Networks\n\nNext, we validate training a generative adversarial network (GAN) in flex16+5. By virtue of an ad-\nversarial (two-player game) training process, GAN models provide a principled way of unsupervised\nlearning using deep neural networks. The unique characteristics of GAN training, namely separate\ndata \ufb02ows through two components (generator and discriminator) of the network, in addition to feeds\nof alternating batches of real and generated data of drastically different statistics to the discriminator\nat early stages of the training, pose signi\ufb01cant challenges to \ufb01xed point numerical representations.\nWe built a Wasserstein GAN (WGAN) model [4], which has the advantage of a metric, namely\nthe Wasserstein-1 distance, that is indicative of generator performance and can be estimated from\ndiscriminator output during training. We trained a WGAN model with the LSUN [23] bedroom\ndataset in float32, flex16+5 and float16 formats with exactly the same hyperparameter settings.\nAs shown in Fig. 5(a), estimates of the Wasserstein distance in flex16+5 training and in float32\ntraining closely tracked each other. In float16 training the distance deviated signi\ufb01cantly from\nbaseline float32, starting with an initially undertrained discriminator. Further, we found no dif-\nferences in the quality of generated images between float32 and flex16+5 at speci\ufb01c stages of\nthe training 5(b), as quanti\ufb01ed by the Fr\u00e9chet Inception Distance (FID) [24]. Generated images\nfrom float16 training had lower quality (signi\ufb01cantly higher FIDs, Fig. 5(b)) with noticeably more\nsaturated patches, examples illustrated in Fig. 5(c), 5(d) and 5(e).\n\nr\no\nr\nr\ne\n\nn\no\ni\nt\na\nc\n\ufb01\n\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n5\np\no\nT\n\n0.6\n\n0.4\n\n0.2\n\nflex16+5\nfloat32\nfloat16\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\ne\n\nn\no\ni\nt\na\nc\n\ufb01\n\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n1\np\no\nT\n\nflex16+5\nfloat32\nfloat16\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n0\n\n50\n\n100\n\n150\n\nEpoch\n\n(a) ImageNet1k AlexNet\n\nEpoch\n\n(b) CIFAR-10 ResNet\n\nFigure 4: Convolutional networks trained in flex16+5 and float32 numerical formats. (a) AlexNet\ntrained on ImageNet1k, graph showing top-5 misclassi\ufb01cation on the validation set. (b) ResNet of\n110 layers trained on CIFAR-10, graph showing top-1 misclassi\ufb01cation on the validation set.\n\n7\n\n\f5 Discussion\n\nIn the present work, we show that a Flexpoint data format, flex16+5, can adequately support training\nof modern deep learning models without any modi\ufb01cations of model topology or hyperparameters,\nachieving a numerical performance on par with float32, the conventional data format widely used\nin deep learning research and development. Our discovery suggests a potential gain in ef\ufb01ciency and\nperformance of future hardware architectures specialized in deep neural network training.\nAlternatives, i.e. schemes that more aggressively quantize tensor values to lower bit precisions, also\nmade signi\ufb01cant progress recently. Here we list major advantages and limitations of Flexpoint, and\nmake a detailed comparison with competing methods in the following sections.\nDistinct from very low precision (below 8-bit) \ufb01xed point quantization schemes which signi\ufb01cantly\nalter the quantitative behavior of the original model and thus requires completely different training\nalgorithms, Flexpoint\u2019s philosophy is to maintain numerical parity with the original network training\nbehavior in high-precision \ufb02oating point. This brings about a number of advantages. First, all prior\nknowledge of network design and hyperparameter tuning for ef\ufb01cient training can still be fully\nleveraged. Second, networks trained in high-precision \ufb02oating point formats can be readily deployed\nin Flexpoint hardware for inference, or as component of a bigger network for training. Third, no\nre-tuning of hyperparameters is necessary for training in Flexpoint\u2013what works with \ufb02oating point\nsimply works in Flexpoint. Fourth, the training procedure remains exactly the same, eliminating the\nneed of intermediate high-precision representations, with the only exception of intermediate higher\n\n1.5\n\n1\n\n0.5\n\ne\nt\na\nm\n\ni\nt\ns\ne\n\nn\ni\ne\nt\ns\nr\ne\ns\ns\na\n\nW\n\n0\n\n0\n\nflex16+5\nfloat32\nfloat16\n\n300\n\n200\n\n100\n\n)\n\nD\nI\nF\n(\n\ne\nc\nn\na\nt\ns\ni\nD\nn\no\ni\nt\np\ne\nc\nn\nI\n\nt\ne\nh\nc\n\u00e9\nr\nF\n\nflex16+5\nfloat32\nfloat16\n\n200,000\n\n0\n\n0\n\n50,000\n\n100,000\n\n150,000\nGenerator iteration\n(a) Training performance\n\n15\n\n25\n\n5\n\n10\n\n20\nNumber of trained epochs\n(b) Quality of generated images\n\n(c) float32\n\n(d) flex16+5\n\n(e) float16\n\nFigure 5: Training performance of WGAN in flex16+5, float32 and float16 data formats.\n(a) Learning curves, i.e. estimated Wasserstein distance by median \ufb01ltered and down-sampled values\nof the negative discriminator cost function, median \ufb01lter kernel length 100 [4], and down-sampling by\nplotting every 100th value. Examples of generated images by the WGAN trained with in (c) float32,\n(d) flex16+5 and (e) float16 for 16 epochs. Fr\u00e9chet Inception Distance (FID) estimated from\n5000 samples of the generator, as in [24].\n\n8\n\n\fprecision accumulation commonly needed for multipliers and adders. Fifth, all Flexpoint tensors are\nmanaged in exactly the same way by the Auto\ufb02ex algorithm, which is designed to be hidden from\nthe user, eliminating the need to remain cognizant of different type of tensors being quantized into\ndifferent bit-widths. And \ufb01nally, the AutoFlex algorithm is robust enough to accommodate diverse\ndeep network topologies, without the need of model-speci\ufb01c tuning of its hyperparameters.\nDespite these advantages, the same design philosophy of Flexpoint likely prescribes a potential\nlimitation in performance and ef\ufb01ciency, especially when compared to more aggressive quantization\nschemes, e.g. Binarized Networks, Quantized Networks and the DoReFa-Net. However, we believe\nFlexpoint strikes a desirable balance between aggressive extraction of performance and support\nfor a wide collection of existing models. Furthermore, potentials and implications for hardware\narchitecture of other data formats in the Flexpoint family, namely flexN+M for certain (N , M ), are\nyet to be explored in future investigations.\nLow-precision data formats: TensorFlow provides tools to quantize networks into 8-bit for infer-\nence [9]. TensorFlow\u2019s numerical format shares some common features with Flexpoint: each tensor\nhas two variables that encode the range of the tensor\u2019s values; this is similar to Auto\ufb02ex \u03ba (although\nit uses fewer bits to encode the exponent). Then an integer value is used to represent the dynamic\nrange with a dynamic precision.\nThe dynamic \ufb01xed point (DFXP) numerical format, proposed in [25], has a similar representation\nas Flexpoint: a tensor consists of mantissa bits and values share a common exponent. This format\nwas used by [21] to train various neural nets in low-precision with limited success (with dif\ufb01culty to\nmatch CIFAR-10 maxout nets in float32). DFXP diverges signi\ufb01cantly from Flexpoint in automatic\nexponent management: DFXP only updates the shared exponent at intervals speci\ufb01ed by the user\n(e.g. per 100 minibatches) and solely based on the number of over\ufb02ows occurring. Flexpoint is more\nsuitable for training modern networks where the dynamics of the tensors might change rapidly.\nLow-precision networks: While allowing for very ef\ufb01cient forward inference, the low-precision\nnetworks discussed in Section 2 share the following shortcomings when it comes to neural network\ntraining. These methods utilize an intermediate \ufb02oating point weight representation that is also\nupdated in \ufb02oating point. This requires special hardware to perform these operations in addition to\nincreasing the memory footprint of the models. In addition, these low-precision quantizations render\nthe models so different from the exact same networks trained in high-precision \ufb02oating point formats\nthat there is often no parity at the algorithmic level, which requires completely distinct training\nalgorithms to be developed and optimized for these low-precision training schemes.\n\n6 Conclusion\n\nTo further scale up deep learning the future will require custom hardware that offers greater com-\npute capability, supports ever-growing workloads, and minimizes memory and power consumption.\nFlexpoint is a numerical format designed to complement such specialized hardware.\nWe have demonstrated that Flexpoint with a 16-bit mantissa and a 5-bit shared exponent achieved\nnumerical parity with 32-bit \ufb02oating point in training several deep learning models without modifying\nthe models or their hyperparameters, outperforming 16-bit \ufb02oating point under the same conditions.\nThus, speci\ufb01cally designed formats, like Flexpoint, along with supporting algorithms, such as\nAuto\ufb02ex, go beyond current standards and present a promising ground for future research.\n\n7 Acknowledgements\n\nWe thank Dr. Evren Tumer for his insightful comments and feedback.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, pages\n1097\u20131105, 2012.\n\n9\n\n\f[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016. URL http://arxiv.org/abs/1512.03385.\n\n[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\nnetworks. In European Conference on Computer Vision, pages 630\u2013645. Springer, 2016. URL\nhttp://arxiv.org/abs/1603.05027.\n\n[4] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv preprint\n\narXiv:1701.07875, 2017. URL http://arxiv.org/abs/1701.07875.\n\n[5] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan-\ntized neural networks: Training neural networks with low precision weights and activations.\narXiv preprint arXiv:1609.07061, 2016. URL http://arxiv.org/abs/1609.07061.\n\n[6] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. DoReFa-Net:\nTraining low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint\narXiv:1606.06160, 2016. URL http://arxiv.org/abs/1606.06160.\n\n[7] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: Ima-\ngeNet classi\ufb01cation using binary convolutional neural networks. In European Conference on\nComputer Vision, pages 525\u2013542. Springer, 2016.\n\n[8] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks\non CPUs. In Advances in Neural Information Processing Systems (workshop on deep learning),\n2011. URL http://research.google.com/pubs/archive/37631.pdf.\n\n[9] Mart\u00edn Abadi and et al. TensorFlow: Large-scale machine learning on heterogeneous distributed\n\nsystems, 2016. URL http://arxiv.org/abs/1603.04467.\n\n[10] Naveen Mellempudi, Abhisek Kundu, Dipankar Das, Dheevatsa Mudigere, and Bharat Kaul.\nMixed low-precision deep learning inference using dynamic \ufb01xed point. arXiv preprint\narXiv:1701.08978, 2017. URL http://arxiv.org/abs/1701.08978.\n\n[11] Darryl D. Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. Fixed point quantization\nof deep convolutional networks. In International Conference on Machine Learning, pages\n2849\u20132858, 2016.\n\n[12] Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. Accelerating deep convolutional\nnetworks using low-precision and sparsity. arXiv preprint arXiv:1610.00324, 2016. URL\nhttp://arxiv.org/abs/1610.00324.\n\n[13] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks\nwith few multiplications. arXiv preprint arXiv:1510.03009, 2015. URL http://arxiv.org/\nabs/1510.03009.\n\n[14] Minje Kim and Paris Smaragdis. Bitwise neural networks. arXiv preprint arXiv:1601.06071,\n\n2016. URL http://arxiv.org/abs/1601.06071.\n\n[15] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder\nBajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance\nanalysis of a tensor processing unit. arXiv preprint arXiv:1704.04760, 2017. URL http:\n//arxiv.org/abs/1704.04760.\n\n[16] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep\nneural networks with binary weights during propagations. In Advances in Neural Information\nProcessing Systems, pages 3123\u20133131, 2015. URL http://arxiv.org/abs/1511.00363.\n[17] Matthieu Courbariaux and Yoshua Bengio. BinaryNet: Training deep neural networks with\nweights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016. URL\nhttp://arxiv.org/abs/1602.02830.\n\n[18] Kyuyeon Hwang and Wonyong Sung. Fixed-point feedforward deep neural network design\nusing weights +1, 0, and -1. In Signal Processing Systems (SiPS), 2014 IEEE Workshop on,\npages 1\u20136. IEEE, 2014.\n\n[19] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent\nand its application to data-parallel distributed training of speech DNNs. In Interspeech, pages\n1058\u20131062, 2014.\n\n10\n\n\f[20] Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional neural networks\nusing logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016. URL http:\n//arxiv.org/abs/1603.01025.\n\n[21] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks\nwith low precision multiplications. In International Conference on Learning Representations\n(workshop contribution), 2014. URL http://arxiv.org/abs/1412.7024.\n\n[22] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. arXiv preprint arXiv:1502.02551, 2015. URL http://\narxiv.org/abs/1502.02551.\n\n[23] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction\nof a large-scale image dataset using deep learning with humans in the loop. arXiv preprint\narXiv:1506.03365, 2015.\n\n[24] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G\u00fcnter Klambauer,\nand Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilib-\nrium. CoRR, abs/1706.08500, 2017. URL http://arxiv.org/abs/1706.08500.\n\n[25] Darrell Williamson. Dynamically scaled \ufb01xed point arithmetic. In Communications, Computers\nand Signal Processing, 1991., IEEE Paci\ufb01c Rim Conference on, pages 315\u2013318. IEEE, 1991.\n\n11\n\n\f", "award": [], "sourceid": 1099, "authors": [{"given_name": "Urs", "family_name": "K\u00f6ster", "institution": "Intel Corporation"}, {"given_name": "Tristan", "family_name": "Webb", "institution": "Intel / Nervana"}, {"given_name": "Xin", "family_name": "Wang", "institution": "Intel Corporation"}, {"given_name": "Marcel", "family_name": "Nassar", "institution": "Intel Corporation"}, {"given_name": "Arjun", "family_name": "Bansal", "institution": "Intel Nervana"}, {"given_name": "William", "family_name": "Constable", "institution": "Intel"}, {"given_name": "Oguz", "family_name": "Elibol", "institution": "Intel Nervana"}, {"given_name": "Scott", "family_name": "Gray", "institution": "OpenAI"}, {"given_name": "Stewart", "family_name": "Hall", "institution": "Intel"}, {"given_name": "Luke", "family_name": "Hornof", "institution": "Intel Nervana"}, {"given_name": "Amir", "family_name": "Khosrowshahi", "institution": "Intel"}, {"given_name": "Carey", "family_name": "Kloss", "institution": "Intel"}, {"given_name": "Ruby", "family_name": "Pai", "institution": "Intel Corporation"}, {"given_name": "Naveen", "family_name": "Rao", "institution": "Intel"}]}