{"title": "Training DNNs with Hybrid Block Floating Point", "book": "Advances in Neural Information Processing Systems", "page_first": 453, "page_last": 463, "abstract": "The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them. These accelerators typically employ densely packed full-precision floating-point arithmetic to maximize performance per area. Ongoing research efforts seek to further increase that performance density by replacing floating-point with fixed-point arithmetic. However, a significant roadblock for these attempts has been fixed point's narrow dynamic range, which is insufficient for DNN training convergence. We identify block floating point (BFP) as a promising alternative representation since it exhibits wide dynamic range and enables the majority of DNN operations to be performed with fixed-point logic. Unfortunately, BFP alone introduces several limitations that preclude its direct applicability. In this work, we introduce HBFP, a hybrid BFP-FP approach, which performs all dot products in BFP and other operations in floating point. HBFP delivers the best of both worlds: the high accuracy of floating point at the superior hardware density of fixed point. For a wide variety of models, we show that HBFP matches floating point's accuracy while enabling hardware implementations that deliver up to 8.5x higher throughput.", "full_text": "Training DNNs with Hybrid Block Floating Point\n\nMario Drumond\n\nEcocloud\n\nEPFL\n\nmario.drumond@epfl.ch\n\nTao Lin\nEcocloud\n\nEPFL\n\ntao.lin@epfl.ch\n\nMartin Jaggi\n\nEcocloud\n\nEPFL\n\nmartin.jaggi@epfl.ch\n\nBabak Falsa\ufb01\n\nEcocloud\n\nEPFL\n\nbabak.falsafi@epfl.ch\n\nAbstract\n\nThe wide adoption of DNNs has given birth to unrelenting computing requirements,\nforcing datacenter operators to adopt domain-speci\ufb01c accelerators to train them.\nThese accelerators typically employ densely packed full-precision \ufb02oating-point\narithmetic to maximize performance per area. Ongoing research efforts seek to\nfurther increase that performance density by replacing \ufb02oating-point with \ufb01xed-\npoint arithmetic. However, a signi\ufb01cant roadblock for these attempts has been \ufb01xed\npoint\u2019s narrow dynamic range, which is insuf\ufb01cient for DNN training convergence.\nWe identify block \ufb02oating point (BFP) as a promising alternative representation\nsince it exhibits wide dynamic range and enables the majority of DNN operations to\nbe performed with \ufb01xed-point logic. Unfortunately, BFP alone introduces several\nlimitations that preclude its direct applicability. In this work, we introduce HBFP,\na hybrid BFP-FP approach, which performs all dot products in BFP and other\noperations in \ufb02oating point. HBFP delivers the best of both worlds: the high\naccuracy of \ufb02oating point at the superior hardware density of \ufb01xed point. For a\nwide variety of models, we show that HBFP matches \ufb02oating point\u2019s accuracy while\nenabling hardware implementations that deliver up to 8.5\u00d7 higher throughput.\n\n1\n\nIntroduction\n\nToday\u2019s online services are ubiquitous, offering custom-tailored content to billions of daily users.\nService customization is often provided using deep neural networks (DNNs) deployed at a massive\nscale in datacenters. Delivering faster DNN inference and more accurate training is often limited\nby the arithmetic density of the underlying hardware platform. Most service providers resort to\nGPUs as the platform of choice for training neural networks because they offer higher arithmetic\ndensity per silicon area than CPUs, through full precision \ufb02oating-point (FP32) units. However, the\ncomputational power required by DNNs has been increasing so quickly that even traditional GPUs\ncannot satisfy the demand, pushing both accelerators and high-end GPUs towards narrow arithmetic\nto improve logic density. For instance, NVIDIA\u2019s Volta [1] GPU employs half-precision \ufb02oating\npoint (FP16) arithmetic while Google employs a custom 16-bit \ufb02oating point representation in the\nsecond and third versions of the TPU [2] architecture.\nFollowing the same approach, there have been research efforts to replace narrow \ufb02oating point with\neven denser \ufb01xed-point representations. Fixed-point arithmetic promises excellent gains in both\nspeed and density. For instance, 8-bit \ufb01xed-point multipliers occupy 5.8\u00d7 less area and consume\n5.5\u00d7 less energy than their FP16 counterpart [3]. Unfortunately, training with \ufb01xed-point, or even\nwith FP16, has yielded mixed results due to the limited range inherent in these representations [4].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) FP repr. with an exponent per tensor element.\nFigure 1: A n-element tensor in BFP and FP representations. BFP tensors save space and simplify\ncomputations by sharing exponents across tensors.\n\n(b) BFP repr. with an exponent per tensor.\n\nBlock \ufb02oating point (BFP) is an alternative representation that strikes a balance between logic density\nand representable range. Signal processing platforms have historically resorted to BFP to optimize for\nboth performance and density, but BFP has not been thoroughly investigated in the context of DNN\ntraining. Figure 1 highlights the difference between the BFP and FP representations. Floating point\nencodes numbers with one exponent per value (Figure 1a), requiring complex hardware structures to\nmanage mantissa alignment and exponent values. In contrast, BFP (Figure 1b) shares exponents across\nblocks of numbers (or tensors), which enables dense \ufb01xed point logic for multiply-and-accumulate\noperations. In the past, signal processors leveraged BFP to convert common algorithms (e.g., FFT)\nto \ufb01xed point arithmetic hardware. DNN computations, like signal processing, consist mostly of\nMAC-based operations (i.e., dot products) and therefore can bene\ufb01t from BFP\u2019s arithmetic density.\nWhile promising, replacing \ufb02oating point with BFP for DNN training faces three signi\ufb01cant chal-\nlenges. First, although BFP dot products are very area-ef\ufb01cient, other BFP operations may not be as\nef\ufb01cient, leading to hardware with \ufb02oating-point-like arithmetic density. Second, exponent sharing\nmay lead to data loss if exponent values are too large or too small, making the exponent selection\npolicy a crucial design choice in BFP-based systems. Finally, BFP may incur data loss if the tensors\u2019\nvalue distributions are too wide to be captured by its mantissa bits.\nIn this paper, we target the three aforementioned challenges with a hybrid approach, performing all\ndot products in BFP and other operations in \ufb02oating point. First, we observe that, since dot products\nare prevalent in DNNs, performing other operations with \ufb02oating point incurs low area overhead,\nsimilar to the overhead incurred by arbitrary BFP operations.\nSolving the second problem of selecting the right exponents is equivalent to selecting good scales for\nquantization points. Prior work [5, 6] introduced coarse-grained exponent selection algorithms. These\nalgorithms train DNNs with \ufb01xed point and adjust the exponents a few times per training batch or\nepoch. Unfortunately, the coarse-grained approaches fail to accommodate drastic exponent changes,\nresulting in data loss and hurting convergence. One solution to that problem is to use wider mantissas\npaired with conservatively large exponents so that there is some headroom in case tensor values are\nunexpectedly large. These strategies result in less dense hardware due to the wide arithmetic used\nand also introduce more hyperparameters, further complicating the training process. We argue that\na more aggressive approach to exponent selection, with exponents chosen before each dot product\ntakes place, leads to convergence on narrower mantissas.\nFinally, we target the third challenge by only converting values to BFP right before dot products,\nleveraging the fact that dot products are resilient to the input data loss incurred by BFP. Other\noperations take \ufb02oating points as inputs, enabling the accurate representation of arbitrary value\ndistributions.\n\nThis paper\u2019s contributions are:\n(1) a hybrid BFP-FP (HBFP) DNN training framework that\nmaximizes \ufb01xed-point arithmetic and minimizes the mantissa width requirements while preserving\nconvergence, (2) two optimizations to BFP, namely tiling and wide weight storage, to improve BFP\u2019s\nprecision with modest area and memory bandwidth overhead, (3) an exploration of the HBFP design\nspace showing that DNNs trained on BFP with 12- and 8-bit mantissas match FP32 accuracy, serving\nas a drop-in replacement for this representation and (4) we show, with an FPGA prototype, that\nHBFP exhibits arithmetic density similar to that of \ufb01xed-point hardware with the accuracy of FP32\nhardware.\n\n2\n\n\u20268-bit mantissan-element tensora0a1an-2an-110-bit exponent\u2026\u20268-bit mantissan-element tensora0a1an-2an-110-bit exponent\f2 Related Work\n\nTraining and inference in DNNs with narrow representations are well studied subjects. In this section,\nwe review prior work.\n\nHybrid accelerators. The separation between dot products and other operations already exists in\ncommodity hardware in NVIDIA Volta\u2019s FP16 Tensor Cores [1] and in Google\u2019s Tensor Processing\nUnit [7] architecture. We take one step further and use different numeric representations for these\ndifferent operations, enabling training with dense \ufb01xed-point arithmetic.\n\nInference with reduced precision. Quantization [8] is a widely used technique for DNN inference.\nBFP [9] has also been proposed for inference. These techniques quantize the weights of DNNs\ntrained with full precision \ufb02oating point to use \ufb01xed-point logic during inference. We consider the\nmore challenging task of training DNNs with arithmetic density that matches quantized inference.\n\nBinarized and ternary neural networks. Binarized [10] and Ternary [11, 12] neural networks\nare another way to compress models. Although these networks enable inference with hardware that\nis orders of magnitude more ef\ufb01cient than \ufb02oating-point hardware, they are trained like traditional\nneural networks, with both activations and parameters represented with \ufb02oating point. Therefore,\nthese approaches are orthogonal to BFP-based training. Other work [13, 14] uses binary operations\nfor forward and backward passes but not for weight gradient calculation and accumulation. The\nnew training algorithm is not transparent to users, requiring redesign of networks with numeric\nrepresentation in mind. In contrast, our approach is backwards compatible with FP32 models.\n\nTraining with end-to-end low precision. ZipML [15], DoReFa [6], and Flexpoint [5] train DNNs\nwith end-to-end low precision. They use \ufb01xed-point arithmetic to represent weights and activations\nduring forward and backward passes, and introduce various algorithms and restrictions to control the\nnumeric range of activations or select quantization points for the \ufb01xed-point representations.\nDoReFa [6] requires techniques to control the activations\u2019 magnitudes, and is unable to quantize\nthe \ufb01rst and last layers of networks. Others [15, 16] take a more theoretical approach to \ufb01nd the\noptimal quantization points for each dataset, performing both computations and communication using\n\ufb01xed-point arithmetic. We use BFP instead, effectively computing quantization points by choosing\nexponents at a \ufb01ner granularity, before every dot product.\nFlexpoint [5] performs all computations in \ufb01xed point. It uses the Auto\ufb02ex algorithm twice per\nminibatch to predict the occurrence of over\ufb02ows and adjust the tensor exponents accordingly. They\nleverage the slowly changing aspect of gradients exponents to minimize the number of exponent\nupdates. However, to minimize over\ufb02ows, they end up requiring conservatively large exponents,\nleaving the higher bits of mantissas unused and increasing mantissa width. Furthermore, Auto\ufb02ex adds\nan arti\ufb01cial dependency between computations when it collects tensor value stats, making it unsuitable\nfor DNNs that employ dynamic data\ufb02ow and limiting training scalability since it restricts the way\nDNNs can be sliced for distributed training. Our approach computes exponents more frequently and\nit does so in-device, without requiring any additional stat collection, and accommodating dynamic\ndata\ufb02ows naturally. We observe that, as long as dot product calculation\u2019s intermediate values remain\nin \ufb01xed-point-like representations, conversions are infrequent enough that the hardware area dedicated\nto them accounts for an insigni\ufb01cant fraction of the total accelerator area.\n\n3 Specialized Arithmetic for DNNs\n\nDue to the massive computational requirements for DNNs employed in datacenter-scale online\nservices, operators such as Google have started adopting specialized numeric representations for\nDNNs. So far, accelerators have employed \ufb01xed-point representations for inference [7], and narrow\n\ufb02oating-point representations [1, 17] for training. From a hardware design perspective, the use of\nreduced-precision arithmetic allows silicon designers to improve logic density and energy ef\ufb01ciency,\nwhile minimizing the number of bits used to represent models relaxes demands on both memory\ncapacity and bandwidth. From the user\u2019s perspective, arithmetic representations must be easy to use,\nwithout sacri\ufb01cing accuracy or requiring any algorithmic techniques to recover performance.\n\n3\n\n\fTable 1: Validation test error of ResNet-20 on CIFAR-10 with narrow FP representations.\n\nMantissa bit-widths\n\n2\n\n4\n\n8\n\n24\n\nN/A 9.77% 8.05% 8.42%\n\nExponent bit-widths\n8\n\n6\n\n2\n\nN/A 14.67% 8.42%\n\nFP32 representations are easy to use but inef\ufb01cient. They represent numbers with a 24-bit mantissa\nand a 8-bit exponent. In terms of precision, the 24-bit mantissa is an overkill for DNNs. Table 1\nshows the validation error obtained when training ResNet-20 models on CIFAR10 using \ufb02oating-point\nrepresentations with various mantissas and exponent widths. We observed convergence without loss\nof precision with 8-bit mantissas, convergence with a small loss of precision with 4-bit mantissas, and\ndivergence only when using 2-bit mantissas. Exponent width, however, cannot be reduced because\nof its impact over numeric range. We observed diminished validation precision when reducing the\nexponent width from 8 to 6 bits, and divergence when using 2-bit exponents.\nHardware developers have made the same observation, leading the state of the art to quickly drift\ntowards narrow \ufb02oating-point representations. One prominent example is FP16. FP16 is denser\nthan FP32, employing 11-bit mantissas and 5-bit exponents. However, FP16\u2019s logic overhead is still\nhigh compared to that of \ufb01xed point. For instance, although the area of an FP16 multiplier is 4.7\u00d7\nsmaller than that of a FP32 multiplier, it is still 5.8\u00d7 larger than its 8-bit \ufb01xed-point counterpart [3].\nFP16 also introduces complexity for users, as the 5-bit exponent results in a narrow range that is\nnot suf\ufb01cient to represent gradients throughout the training process [4]. DNN training requires\nnumeric representations with wide range because, as the loss value and the learning rates decrease,\nthe gradient values also decrease, often by several orders of magnitude. To mitigate this issue, Google\nhas moved to a 16-bit \ufb02oating point [2] representation that employs 8 bits for both the mantissas and\nthe exponents, to improve the dynamic range.\nGiven these requirements, we identify block \ufb02oating point (BFP) as the ideal numeric representation\nfor DNNs. Like \ufb02oating point, BFP represents numbers with mantissas and exponent and therefore\nexhibits a wide dynamic range. However, BFP logic is denser because exponents are shared across\nentire tensors, resulting in dot products that can be computed entirely in \ufb01xed-point logic. Because the\nvast majority of the arithmetic operations executed by DNN training and inference are dot products,\nwe are able to fold almost all the training computation into \ufb01xed-point logic.\n\n4 DNN Training With BFP Arithmetic\n\nEquation (1) computes the real value ai of an element i of a BFP tensor a with mantissa ma\nexponent ea.\n\ni and\n\nai = ma\n\ni \u00d7 2ea\n\n(1)\n\n(cid:18)\n\nN(cid:88)\n\nIn this example, BFP can only represent a accurately if the value distribution of a is not too wide to\nbe captured by ma and the exponent ea is representative of said value distribution. If ea is too large\nthen small values are lost and the most signi\ufb01cant bits of the mantissas are wasted. If ea is too small,\nthen the larger values in a will be saturated, leading to data loss.\nEquation (2) calculates the dot product between BFP tensors a and b, each with N elements.\n\na \u00b7 b =\n\ni \u00d7 2ea ) \u00d7 (mb\n\ni \u00d7 2eb )\n\ni=1\n\n(ma\n\n(2)\nThe dot product ma \u00b7 mb is computed entirely in \ufb01xed-point arithmetic, without the alignment of\ni are \ufb01xed point. In a matrix multiplication A \u00d7 B,\nintermediate values, since all elements ma\nit is enough for A to have one exponent per row, and B to have one exponent per column. BFP matrix\nmultiplications can also be tiled. With tiled matrices, tile multiplications are performed in \ufb01xed point,\nand their results are accumulated in \ufb02oating point arithmetic, requiring mantissa realignment.\n\ni and mb\n\n= 2ea+eb \u00d7 (ma \u00b7 mb)\n\n(cid:19)\n\n4.1 Hybrid Block Floating Point (HBFP) DNN Training\n\nWe propose the use of BFP for all dot product computations, with other operations performed in\n\ufb02oating-point representations. This con\ufb01guration enables the bulk of the DNN operations to be\n\n4\n\n\fperformed in \ufb01xed-point logic and facilitates the use of various activation functions or techniques\nlike batch normalization without the restrictions imposed by BFP.\nHBFP is superior to pure BFP for two reasons. First, using BFP for all operations may lead to\ndivergence unless wide mantissas are employed. DNN operations often result in tensors with\nwide value distributions, that can be too wide for BFP, leading to loss of values at the edge of the\ndistributions (i.e., values that are too small or too large). Thus, most operations cannot tolerate taking\nBFP values as inputs, as they may change the value distributions in non-trivial ways, with both small\nand large values having an impact on the results. Dot products, however, do not face this problem.\nDot products are reductions, and thus, the input tensors\u2019 largest values dominate the sum, with small\nvalues having little impact on the \ufb01nal result. Consequently, BFP dot products tolerate data loss as\nlong as tensor exponents are large enough to avoid saturation.\nThe second reason is the area overhead of general purpose BFP operations. BFP can lead to costly\n\ufb02oating-point-like hardware in the general case, since it may lead to numerous mantissa realignments\nand expensive exponent computations to ensure that tensors\u2019 exponents match value distributions.\nBFP dot products are denser because the overhead of exponent calculations and mantissa realignments\nis amortized over the reduction. For instance, in a dot product between two tensors with N elements,\nBFP leads to one mantissa realignment for every 2 \u00d7 N operations, while a BFP ReLU operation, for\ninstance, requires one mantissa realignment per operation.\nWe propose to use BFP in all dot-product-based operations present in DNNs (i.e., convolutions,\nmatrix multiplications, and outer products), and \ufb02oating-point representations for all other operations\n(i.e., activations, regularizations, etc). We store long-lasting model state (i.e., weights) in BFP and\ntransient activation values in \ufb02oating point. We convert tensors to BFP before every dot product,\nusing the exponent of the largest tensor value, and convert the result back to \ufb02oating point afterwards.\n\n4.2 Minimizing BFP Data Loss\n\nThe amount of data loss incurred by BFP is determined by two factors: the size of tensors that share\nexponents and the width of the mantissas. We devise two optimizations to mitigate each of these\nissues with modest silicon area and memory bandwidth overhead: tiling and wide weight storage.\n\nTiling: Matrix multiplications are often tiled to improve locality in intermediate caching storage.\nWe observe that BFP can also bene\ufb01t from tiling. More speci\ufb01cally, we divide the weight matrices in\ntiles of a prede\ufb01ned size and share exponents within tiles. Tiling bounds the number of values that\nshare exponents, reducing data loss. This optimization incurs some silicon density penalty because\nthe resulting tiles need to be accumulated using \ufb02oating-point arithmetic. Nevertheless, the overhead\nis small: for a tile size of N \u00d7 N, tiling incurs one extra \ufb02oating-point operation every 2 \u00d7 N\noperations. For large tiles, the area of the extra \ufb02oating-point adders is negligible compared to the\narea of the N multiply and accumulate units.\n\nWide weight storage: To minimize data loss in long-lasting training state, we store weights with\nwider mantissas. All operations are still executed using the original mantissa, and only weight updates\nuse the wider mantissa. Therefore, we still reduce the memory bandwidth requirements for forward\nand backward passes, during which only the most signi\ufb01cant bits of the weights are accessed. The\nleast signi\ufb01cant bits of the weight matrices are only accessed by weight updates.\n\n5 Methodology\n\n5.1 HBFP Simulation on GPU\n\nWe train DNNs with the proposed HBFP approach, using BFP in the compute-intensive operations\n(matrix multiplications, convolutions, and their backward passes) and FP32 in the other operations.\nWe simulate BFP dot products in GPUs by modifying PyTorch\u2019s [18] linear and convolution layers\nto reproduce the behaviour of BFP matrix multipliers. We rede\ufb01ned PyTorch\u2019s convolution and\nlinear modules using its autograd.function feature to create new modules that process the inputs and\noutputs of both the forward and backward passes to simulate BFP. In the forward pass, we convert\nthe activations to BFP, giving the x tensor one exponent per training input. Then we execute the\n\n5\n\n\ftarget operation in native \ufb02oating-point arithmetic. In the backward pass, we perform the same\npre-/post-processing of the inputs/outputs of the x derivative.\nWe handle the weights in the optimizer. We created a shell optimizer that takes the original optimizer,\nperforms its update function in FP32 and converts the weights to two BFP formats: one with wide and\nanother with narrow mantissas. The former is used in future weight updates while the latter is used in\nforward and backward passes. We also use this same mechanism to simulate different tile sizes for\nweight matrices. Finally, for convolutional layers, we tile the two outer feature map dimensions of\nthe weight matrices.\n\n5.2 Evaluation Setup\n\nDatasets. We experiment with a set of popular image classi\ufb01cation tasks with the CIFAR-100 [19],\nSVHN [20], and ImageNet [21] datasets. We used standard data augmentation [22, 23] for CIFAR-\n100 and no augmentation for SVHN. We also evaluate language modeling tasks with Penn Tree\nBank(PTB) dataset [24].\n\nEvaluation metrics. To evaluate the impact of HBFP and explore the design space of different BFP\nimplementations, we tune the models using FP32, and then train the same models from scratch with\nthe same hyperparameters in HBFP. For the image classi\ufb01cation experiments, we report training loss\nand validation top-1 error. For the language modeling models, we report training loss and validation\nperplexity.\n\nTraining. We use a WideResNet [25] trained on CIFAR-100 to explore the BFP design space,\nevaluating models trained with various mantissa widths and various tile sizes. To show that HBFP\nis a viable alternative to FP32, we train a wide range of models using various datasets. We train\nResNet [22], WideResNet [25], and DenseNet [26] models on the CIFAR-100 and SVHN datasets; a\nResNet model on ImageNet and the LSTM from [27] on PTB. We trained all models using the same\nhyperparameters reported in their respective original papers.\n\n5.3 Hardware Prototype Implementation\n\nHBFP accelerators exhibit arithmetic density that is sim-\nilar to their \ufb01xed-point counterparts. To further illustrate\nthis point, we synthesized a proof-of-concept FPGA-based\naccelerator. Figure 2 shows the block diagram of the ac-\ncelerator. Grey boxes and arrows indicate buffers, units,\nand data\ufb02ow in BFP format while other colors correspond\nto FP. We implemented the basic operations needed for\nneural network training (i.e., matrix multiplication, trans-\npose, convolutions, outer product, weight update, and data\nmovement operations) using a data\ufb02ow similar to [28].\nWe employ a matrix multiplication (MatMul) unit followed\nby an activation/loss unit, sized to maximize resource uti-\nlization in the FPGA. The MatMul output width matches\nthe activation/loss units\u2019 input width to avoid backpres-\nsure. The FP-to-BFP units detects the maximum exponent\nof incoming FP tensors and normalizes their mantissas\naccordingly, while the BFP-to-FP unit takes the results computed in the wide accumulators present in\nthe MatMul unit, normalizes and truncates their mantissas, and computes their exponents. Hence,\nthe MatMul unit never causes over\ufb02ows or saturation. We employ stochastic rounding [29] during\nmantissa truncation, using a Xorshift random number generator [30]. Xorshift is a very small ran-\ndom number generator, employing three constant shifts and three xor operations, and it has been\nshown [31] to work well for stochastic rounding. Finally, weight updates are done entirely in the\nactivation unit, in \ufb02oating point. The proof-of-concept accelerator operates with both weights and\nactivations stored on-chip.\n\nFigure 2: HBFP accelerator with BFP.\n\n6\n\nWeight Bu\ufb00erFP to BFPBFP MatMul/ConvolutionBFP to FPActivation/ Loss ActivationBu\ufb00er External I/O Interface\fTable 2: Test error of image classi\ufb01cation models. RN, WRN and DN indicate ResNet, WideResNet\nand DenseNet, respectively. hbfpX_Y indicates an experiment with X-bit mantissas and Y-bit weight\nstorage and a tile size of 24. All dot product operations are performed in X-bit arithmetic.\n\nCIFAR 100\nRN-50 WRN-28-10\n26.07%\n25.12%\n25.10%\n\n20.35%\n20.78%\n20.78%\n\nfp32\nhbfp8_16\nhbfp12_16\n\n6 Evaluation\n\nSVHN\n\nDN-40\n26.03% 1.89%\n26.27% 1.98%\n25.82% 1.96%\n\nRN-50 WRN-16-8 DN-40\n1.80%\n1.79%\n1.85%\n\n2.00%\n1.98%\n1.94%\n\nImageNet\n\nRN-50\n23.64%\n23.88%\n23.58%\n\nWe now evaluate DNN training with HBFP. We explore the design space of BFP, \ufb01nding the best-\nperforming con\ufb01gurations of BFP. We vary both the mantissa width and the tile sizes. Then, we\nmove on to evaluate HBFP on various datasets and tasks, to show that HBFP is indeed a drop-in\nreplacement for FP32. Finally, we evaluate the throughput gains obtained with HBFP using our\nhardware prototype.\n\nBFP design space: We train WideResNet-28-10 models on CIFAR-100 using various HBFP\ncon\ufb01gurations. To experiment with the mantissa width, we train models with 4-, 8-, 12- and 16-bit\nwide mantissas. All models with mantissas wider than 8 bits result in \ufb01nal validation error within\n1% of the FP32 baseline, with only 4-bit mantissas showing a large accuracy gap, with 4.1% larger\nerror. We also evaluate models with 8- and 12-bit mantissas paired with 16-bit weight storage. We\nobserve small accuracy improvements of 0.21% and 0.43% over their counterparts with narrow\nweight storage. We observe similar trends on other models.\nWe also train HBFP with various tile sizes. Tile sizes of 24 \u00d7 24 and 64 \u00d7 64 yield similar accuracy\nto FP32, with errors within 0.5% of the baseline. HBFP without tiles results in a larger error increase,\nof 0.8% over FP32, because it often forces large weight matrices to share exponents. Again, we\nobserve similar trends on other models.\nThe sweet spot in the design space is HBFP with 8- to 12-bit mantissa, 16-bit weight storage and a tile\nsize of 24. This con\ufb01guration matches FP32 quality while improving arithmetic density and reducing\nmemory bandwidth requirements. Using 8-bit mantissas reduces the memory bandwidth requirements\nof the forward and backward passes by up to 4\u00d7 compared to FP32. HBFP stores activations in\n\ufb02oating-point format. While doing so may increase bandwidth requirements, we observe that these\nactivations can be stored in narrow \ufb02oating-point representations or even in summarized formats\n(e.g., for ReLU, only a single bit per value needs to be saved for the backward pass). Furthermore,\nactivations account for a small fraction of the memory traf\ufb01c when training DNNs. While activation\ntraf\ufb01c is dwarfed by weight traf\ufb01c in fully connected layers, in convolutional layers the computation-\nto-communication ratio is so high that the memory traf\ufb01c incurred by activations is not a signi\ufb01cant\nthroughput factor.\n\nTable 3: Perplexity of language modeling models.\nhbfpX_Y indicates an experiment with X-bit mantissas\nand Y-bit weight storage and a tile size of 24. All dot\nproduct operations are performed in X-bit arithmetic.\n\nHBFP vs. FP32: Table 2 reports the\nvalidation error for all the image classi\ufb01-\ncation models and Table 3 reports the val-\nidation perplexity of the language mod-\neling model. In addition, \ufb01gure 3 illus-\ntrates the training process for three of the\nevaluated models: a WideResNet28-10\ntrained with CIFAR-100, a ResNet-50\ntrained with ImageNet, and an LSTM\ntrained with PTB. HBFP matches the per-\nformance of FP32 in all the models and\ndatasets tested. We conclude that HBFP is indeed a drop-in replacement for FP32 for a wide set of\ntasks, leading to models that are more compact and enabling HW accelerators that use \ufb01xed point\narithmetic for most of the DNNs computations.\n\nfp32\nhbfp8_16\nhbfp12_16\n\n61.31\n61.86\n61.35\n\nLSTM-PTB\n\n7\n\n\f(a) WideResNet-28-10 trained on CIFAR-100 for 250 epochs.\n\n(b) ResNet-50 trained on ImageNet for 90 epochs.\n\n(c) LSTM trained on PTB for 500 epochs.\n\nFigure 3: Comparison between HBFP and FP32. hbfpX_Y indicates an experiment with with X-bit\nmantissas and Y-bit weight storage and a tile size 24. All dot product operations are performed with\nX-bit arithmetic.\n\nHBFP silicon density and performance estimation: We synthesize the accelerator on a Stratix V\n5SGSD5 FPGA at a clock rate of 200MHz. We achieve a maximum throughput of 1 TOp/s using\n8-bit wide multiply-and-add units in the matrix-multiplier and \ufb02oating-point activations (with 8-bit\nmantissas plus 8-bit exponents). The activation units occupy less than 10% of the FPGA resources,\nresulting in an 8.5\u00d7 throughput improvement over a variant of the accelerator that employs FP16\nmultiply-and-add units on the same FPGA. Finally, the conversion units occupy less than 1% of the\nFPGA resources, and incur no performance overhead.\n\n7 Conclusion\n\nDNNs have become ubiquitous in datacenter settings, forcing operators to adopt specialized hardware\nto execute and train them. However, DNN training still depends on \ufb02oating-point representations\nfor convergence, severely limiting the ef\ufb01ciency of accelerators. In this paper, we propose HBFP, a\nhybrid BFP-FP number representation for DNN training. We show that the HBFP leads to ef\ufb01cient\nhardware, with the bulk of the silicon real-estate spent on ef\ufb01cient \ufb01xed-point logic. Finally, we\nevaluate HBFP, and show that, for all models evaluated, BFP-FP training matches FP32 counterparts\n\n8\n\n083167250Epoch0.00.51.01.52.0Training losshbfp8_16hbfp12_16fp32083167250Epoch0.020.040.060.0Validation errorhbfp8_16hbfp12_16fp320306090Epoch0.01.02.03.04.05.0Training losshbfp8_16hbfp12_16fp320306090Epoch0.020.040.060.080.0Validation errorhbfp8_16hbfp12_16fp320167333500Epoch0.02.04.06.0Training losshbfp8_16hbfp12_16fp320167333500Epoch0.050.0100.0150.0Validation perplexityhbfp8_16hbfp12_16fp32\fwhile resulting in 2\u00d7 more compact models. BFP-FP32 also leads to faster accelerators, with 8-\nbit BFP achieving 8.5\u00d7 higher throughput when compared to FP16. Higher throughput leads to\nfaster and more energy-ef\ufb01cient DNN training/inference, while model compression leads to lower\nbandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and\nlower communication bandwidth requirements for distributed training.\n\nAcknowledgements\n\nThe authors thank the anonymous reviewers, Mark Sutherland, Siddharth Gupta, and Alexandros\nDaglis for their precious comments and feedback. We also thank Ryota Tomioka and Eric Chung\nfor many inspiring conversations on low-precision DNN processing. This work has been partially\nfunded by the ColTraIn project of the Microsoft-EPFL Joint Research Center and by SNSF grant\n200021_175796.\n\nReferences\n[1] \u201cArti\ufb01cial\n\nintelligence architecture.\u201d https://www.nvidia.com/en-us/data-center/\n\nvolta-gpu-architecture, 2018. Accessed: 2018-01-31.\n\n[2] \u201cTearing apart google\u2019s tpu 3.0 ai coprocessor.\u201d https://www.nextplatform.com/2018/\n05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor, 2018. Accessed: 2018-05-\n15.\n\n[3] W. Dally, \u201cHigh performance hardware for machine learning.\u201d https://media.nips.\ncc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf, 2015. Ac-\ncessed: 2018-01-31.\n\n[4] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Hous-\nton, O. Kuchaiev, G. Venkatesh, and H. Wu, \u201cMixed precision training,\u201d in Proceedings of Sixth\nInternational Conference on Learning Representations (ICLR 18), 2018.\n\n[5] U. K\u00f6ster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable, O. Elibol, S. Hall,\nL. Hornof, A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao, \u201cFlexpoint: An Adaptive\nNumerical Format for Ef\ufb01cient Training of Deep Neural Networks.,\u201d in Proceedings of Thirty-\n\ufb01rst Conference on Neural Information Processing Systems (NIPS 17), 2017.\n\n[6] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, \u201cDoReFa-Net: Training Low Bitwidth\nConvolutional Neural Networks with Low Bitwidth Gradients.,\u201d CoRR, vol. abs/1606.06160,\n2016.\n\n[7] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,\nN. Boden, A. Borchers, R. Boyle, P. luc Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau,\nJ. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,\nD. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan,\nD. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,\nA. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,\nR. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek,\nE. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan,\nG. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and\nD. H. Yoon, \u201cIn-Datacenter Performance Analysis of a Tensor Processing Unit.,\u201d in Proceedings\nof The 44th International Symposium on Computer Architecture (ISCA 17), 2017.\n\n[8] \u201cHow to quantize neural networks with tensor\ufb02ow.\u201d https://www.tensorflow.org/\n\nperformance/quantization, 2017. Accessed: 2018-01-31.\n\n[9] Z. Song, Z. Liu, and D. Wang, \u201cComputation error analysis of block \ufb02oating point arithmetic\noriented convolution neural network accelerator design,\u201d in Proceedings of the Thirty-Second\nAAAI Conference on Arti\ufb01cial Intelligence, (AAAI-18), 2018.\n\n[10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, \u201cBinarized Neural Networks.,\u201d\nin Proceedings of Thirtieth Conference on Neural Information Processing Systems (NIPS 16),\n2016.\n\n9\n\n\f[11] F. Li and B. Liu, \u201cTernary Weight Networks.,\u201d CoRR, vol. abs/1605.04711, 2016.\n\n[12] C. Zhu, S. Han, H. Mao, and W. J. Dally, \u201cTrained Ternary Quantization.,\u201d in Proceedings of\n\nFifth International Conference on Learning Representations (ICLR 17), 2017.\n\n[13] M. Courbariaux, Y. Bengio, and J.-P. David, \u201cBinaryConnect: Training Deep Neural Networks\nwith binary weights during propagations.,\u201d in Proceedings of Twenty-ninth Conference on\nNeural Information Processing Systems (NIPS 15), 2015.\n\n[14] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, \u201cXNOR-Net: ImageNet Classi\ufb01cation\nUsing Binary Convolutional Neural Networks.,\u201d in Proceedings of 13rd European Conference\non Computer Vision (ECCV 16), 2016.\n\n[15] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, \u201cZipML: Training Linear Models\nwith End-to-End Low Precision, and a Little Bit of Deep Learning.,\u201d in Proceedings of Thirty-\nfourth International Conference on Machine Learning (ICML 17), 2017.\n\n[16] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, \u201cQSGD: Communication-Ef\ufb01cient\nSGD via Gradient Quantization and Encoding.,\u201d in Proceedings of Thirty-\ufb01rst Conference on\nNeural Information Processing Systems (NIPS 17), 2017.\n\n[17] \u201cCloud tpu.\u201d https://cloud.google.com/tpu, 2017. Accessed: 2018-01-31.\n\n[18] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\nL. Antiga, and A. Lerer, \u201cAutomatic differentiation in pytorch,\u201d NIPS 2017 Autodiff Workshop:\nThe Future of Gradient-based Machine Learning Software and Techniques, 2017.\n\n[19] A. Krizhevsky, \u201cLearning multiple layers of features from tiny images,\u201d Technical report,\n\nUniversity of Toronto, 2009.\n\n[20] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, \u201cReading digits in natural\nimages with unsupervised feature learning,\u201d Deep Learning and Unsupervised Feature Learning\nWorkshop, 2011.\n\n[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, \u201cImagenet large scale visual recognition\nchallenge,\u201d CoRR, vol. abs/1409.0575, 2014.\n\n[22] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep Residual Learning for Image Recognition.,\u201d in\n\nProceedings of Conference on Computer Vision and Pattern Recognition (CVPR 16), 2016.\n\n[23] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, \u201cDeep networks with stochastic\n\ndepth,\u201d in Proceedings of 13rd European Conference on Computer Vision (ECCV 16), 2016.\n\n[24] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. Cernock\u00fd, and S. Khudanpur, \u201cRecurrent neural network\nbased language model,\u201d in Proceedings of 11th Annual Conference of the International Speech\nCommunication Association (INTERSPEECH 2010), 2010.\n\n[25] S. Zagoruyko and N. Komodakis, \u201cWide Residual Networks.,\u201d in Proceedings of British Machine\n\nVision Conference (BMVC 16), 2016.\n\n[26] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, \u201cDensely Connected Convolutional\nNetworks.,\u201d in Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR\n17), 2017.\n\n[27] S. Merity, N. S. Keskar, and R. Socher, \u201cRegularizing and optimizing LSTM language models,\u201d\nin Proceedings of Sixth International Conference on Learning Representations (ICLR 18), 2018.\n\n[28] Y.-H. Chen, J. S. Emer, and V. Sze, \u201cEyeriss: A Spatial Architecture for Energy-Ef\ufb01cient\nData\ufb02ow for Convolutional Neural Networks.,\u201d in Proceedings of The 43rd International\nSymposium on Computer Architecture (ISCA 16), 2016.\n\n[29] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, \u201cDeep Learning with Limited\nNumerical Precision.,\u201d in Proceedings of Thirty-second International Conference on Machine\nLearning (ICML 15), 2015.\n\n10\n\n\f[30] G. Marsaglia, \u201cXorshift RNGs,\u201d Journal of Statistical Software, vol. 8, no. 14, 2003.\n\n[31] C. D. Sa, M. Feldman, C. R\u00e9, and K. Olukotun, \u201cUnderstanding and Optimizing Asynchronous\nLow-Precision Stochastic Gradient Descent.,\u201d in Proceedings of The 44th International Sympo-\nsium on Computer Architecture (ISCA 17), 2017.\n\n11\n\n\f", "award": [], "sourceid": 282, "authors": [{"given_name": "Mario", "family_name": "Drumond", "institution": "EPFL"}, {"given_name": "Tao", "family_name": "LIN", "institution": "EPFL"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}, {"given_name": "Babak", "family_name": "Falsafi", "institution": "EcoCloud, EPFL"}]}