{"title": "Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 7533, "page_last": 7544, "abstract": "Optimization of Binarized Neural Networks (BNNs) currently relies on real-valued latent weights to accumulate small update steps. In this paper, we argue that these latent weights cannot be treated analogously to weights in real-valued networks. Instead their main role is to provide inertia during training. We interpret current methods in terms of inertia and provide novel insights into the optimization of BNNs. We subsequently introduce the first optimizer specifically designed for BNNs, Binary Optimizer (Bop), and demonstrate its performance on CIFAR-10 and ImageNet. Together, the redefinition of latent weights as inertia and the introduction of Bop enable a better understanding of BNN optimization and open up the way for further improvements in training methodologies for BNNs.", "full_text": "Latent Weights Do Not Exist: Rethinking Binarized\n\nNeural Network Optimization\n\nKoen Helwegen1, James Widdicombe1, Lukas Geiger1, Zechun Liu2, Kwang-Ting Cheng2,\n\nand Roeland Nusselder1\n\n1Plumerai Research\n\n{koen, james, lukas, roeland}@plumerai.com\n2Hong Kong University of Science and Technology\nzliubq@connect.ust.hk, timcheng@ust.hk\n\nAbstract\n\nOptimization of Binarized Neural Networks (BNNs) currently relies on real-valued\nlatent weights to accumulate small update steps. In this paper, we argue that these\nlatent weights cannot be treated analogously to weights in real-valued networks.\nInstead their main role is to provide inertia during training. We interpret current\nmethods in terms of inertia and provide novel insights into the optimization of\nBNNs. We subsequently introduce the \ufb01rst optimizer speci\ufb01cally designed for\nBNNs, Binary Optimizer (Bop), and demonstrate its performance on CIFAR-10 and\nImageNet. Together, the rede\ufb01nition of latent weights as inertia and the introduction\nof Bop enable a better understanding of BNN optimization and open up the way\nfor further improvements in training methodologies for BNNs. Code is available\nat: https://github.com/plumerai/rethinking-bnn-optimization.\n\n1\n\nIntroduction\n\nSociety can be transformed by utilizing the power of deep learning outside of data centers: self-driving\ncars, mobile-based neural networks, smart edge devices, and autonomous drones all have the potential\nto revolutionize everyday lives. However, existing neural networks have an energy budget which\nis far beyond the scope for many of these applications. Binarized Neural Networks (BNNs) have\nemerged as a promising solution to this problem. In these networks both weights and activations are\nrestricted to {\u22121, +1}, resulting in models which are dramatically less computationally expensive,\nhave a far lower memory footprint, and when executed on specialized hardware yield in a stunning\nreduction in energy consumption. After the pioneering work on BinaryNet [1] demonstrated such\nnetworks could be trained on a large task like ImageNet [2], numerous papers have explored new\narchitectures [3\u20137], improved training methods [8] and sought to develop a better understanding of\ntheir properties [9].\nThe understanding of BNNs, in particular their training algorithms, have been strongly in\ufb02uenced\nby knowledge of real-valued networks. Critically, all existing methods use \u201clatent\u201d real-valued\nweights during training in order to apply traditional optimization techniques. However, many insights\nand intuitions inspired by real-valued networks do not directly translate to BNNs. Overemphasiz-\ning the connection between BNNs and their real-valued counterparts may result in cumbersome\nmethodologies that obscure the training process and hinder a deeper understanding.\nIn this paper we develop an alternative interpretation of existing training algorithms for BNNs, and\nsubsequently, argue that latent weights are not necessary for gradient-based optimization of BNNs.\nWe introduce a new optimizer based on these insights, which, to the best of our knowledge is the \ufb01rst\noptimizer designed speci\ufb01cally for BNNs, and empirically demonstrate its performance on CIFAR-10\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[10] and ImageNet. Although we study the case where both activations and weights are binarized,\nthe ideas and techniques developed here concern only the binary weights and make no assumptions\nabout the activations, and hence can be applied to networks with activations of arbitrary precision.\nThe paper is organized as follows. In Section 2 we review existing training methods for BNNs. In\nSection 3 we give a novel explanation of why these techniques work as well as they do and suggest\nan alternative approach in Section 4. In Section 5 we give empirical results of our new optimizer on\nCIFAR-10 and ImageNet. We end by discussing promising directions in which BNN optimization\nmay be further improved in Section 6.\n\n2 Background: Training BNNs with Latent Weights\nConsider a neural network, y = f (x, w), with weights, w \u2208 Rn, and a loss function, L(y, ylabel),\nwhere ylabel is the correct prediction corresponding to sample x. We are interested in \ufb01nding a binary\nweight vector, w(cid:63)\n\nbin, that minimizes the expected loss:\n\nbin = argminwbin\u2208{\u22121,+1}nEx,y [L (f (x, wbin), ylabel)] .\nw(cid:63)\n\n(1)\n\nIn contrast to traditional, real-valued supervised learning, Equation 1 adds the additional constraint\nfor the solution to be a binary vector. Usually, a global optimum cannot be found. In real-valued\nnetworks an approximate solution via Stochastic Gradient-Descent (SGD) based methods are used\ninstead.\nThis is where training BNNs becomes challenging. Suppose that we can evaluate the gradient \u2202L\n\u2202w for\na given tuple (x, w, y). The question then is how can we use this gradient signal to update w, if w is\nrestricted to binary values?\nCurrently, this problem is resolved by introducing an additional real-valued vector \u02dcw during train-\ning. We call these latent weights. During the forward pass we binarize the latent weights, \u02dcw,\ndeterministically such that\n\n(2)\nThe gradient of the sign operation vanishes almost everywhere, so we rely on a \u201cpseudo-gradient\u201d to\nget a gradient signal on the latent weights, \u02dcw [1, 11]. In the simplest case this pseudo-gradient, \u03a6, is\nobtained by replacing the binarization during the backward pass with the identity:\n\nwbin = sign( \u02dcw)\n\n(forward pass).\n\n\u03a6(L, \u02dcw) :=\n\n\u2202L\n\u2202wbin\n\n\u2248 \u2202L\n\u2202 \u02dcw\n\n(backward pass).\n\n(3)\n\nThis simple case is known as the \u201cStraight-Through Estimator\u201d (STE) [12, 11]. The full optimization\nprocedure is outlined in Algorithm 1. The combination of pseudo-gradient and latent weights makes it\npossible to apply a wide range of known methods to BNNs, including various optimizers (Momentum,\nAdam, etc.) and regularizers (L2-regularization, weight decay) [13\u201315].\nLatent weights introduce an additional layer to the problem and make it harder to reason about the\neffects of different optimization techniques in the context of BNNs. A better understanding of latent\nweights will aid the deployment of existing optimization techniques and can guide the development\nof novel methods.\nFor the sake of completeness we should mention there exists a closely related line of research which\nconsiders stochastic BNNs [16, 17]. These networks fall outside the scope of the current work and in\nthe remainder of this paper we focus exclusively on fully deterministic BNNs.\n\n3 The Role of Latent Weights\n\nLatent weights absorb network updates. However, due to the binarization function modi\ufb01cations do\nnot alter the behavior of the network unless a sign change occurs. Due to this, we suggest that the\nlatent weight can be better understood when thinking of its sign and magnitude separately:\n\n\u02dcw = sign( \u02dcw) \u00b7 | \u02dcw| =: wbin \u00b7 m, wbin \u2208 {\u22121, +1}, m \u2208 [0,\u221e).\n\n(4)\nThe role of the magnitude of the latent weights, m, is to provide inertia to the network. As the inertia\ngrows, a stronger gradient-signal is required to make the corresponding binary weight \ufb02ip. Each\n\n2\n\n\fAlgorithm 1: Training procedure for BNNs using latent weights. Note that the optimizer A may\nbe stateful, although we have suppressed the state in our notation for simplicity.\ninput :Loss function L(f (x; w), y), Batch size K\ninput :Optimizer A : g (cid:55)\u2192 \u03b4w, learning rate \u03b1\ninput :Pseudo-Gradient \u03a6 : L(f (x; w), y) (cid:55)\u2192 g \u2208 Rn\ninitialize \u02dcw \u2190 \u02dcw0 \u2208 Rn;\nwhile stopping criterion not met do\n\nSample minibatch {x(1), ..., x(K)} with labels y(k);\nPerform forward pass using wbin = sign( \u02dcw);\nCompute gradient: g \u2190 1\nUpdate latent weights \u02dcw \u2190 \u02dcw + \u03b1 \u00b7 A(g);\n\nK \u03a6(cid:80)\n\nend\n\nk L(f (x(k); wbin), y(k));\n\nbinary weight, wbin, can build up inertia, m, over time as the magnitude of the corresponding latent\nweight increases. Therefore, latent weights are not weights at all: they encode both the binary weight,\nwbin, and a corresponding inertia, m, which is really an optimizer variable much like momentum.\nWe contrast this inertia-based view with the common perception in the literature, which is to see the\nbinary weights as an approximation to the real-valued weight vector. In the original BinaryConnect\npaper, the authors describe the binary weight vector as a discretized version of the latent weight, and\ndraw an analogy to Dropout in order to explain why this may work [18, 19]. Anderson and Berg\nargue that binarization works because the angle between the binarized vector and the weight vector is\nsmall [9]. Li et al. prove that, for a quadratic loss function, in BinaryConnect the real-valued weights\nconverge to the global minimum, and argue this explains why the method outperforms Stochastic\nRounding [20]. Merolla et al. challenge the view of approximation by demonstrating that many\nprojections, onto the binary space and other spaces, achieve good results [21].\nA simple experiment suggests the approximation viewpoint is problematic. After training the BNN,\nwe can evaluate the network using the real-valued latent weights instead of the binarized weights,\nwhile keeping the binarization of the activations. If the approximation view is correct, using the\nreal-valued weights should result in a higher accuracy than using the binary weights. We \ufb01nd this is\nnot the case. Instead, we consistently see a comparable or lower train and validation accuracy when\nusing the latent weights, even after retraining the batch statistics.\nThe concept of inertia enables us to better understand what happens during the optimization of BNNs.\nBelow we review some key aspects of the optimization procedure from the perspective of inertia.\nFirst and foremost, we see that in the context of BNNs, the optimizer is mostly changing the inertia\nof the network rather than the binary weights themselves. The inertia variables have a stabilizing\neffect: after being pushed in one direction for some time, a stronger signal in the reverse direction is\nrequired to make the weight \ufb02ip. Meanwhile, clipping of latent weights, as is common practice in the\nliterature, in\ufb02uences training by ceiling the inertia that can be accumulated.\nIn the optimization procedure de\ufb01ned by Algorithm 1, scaling of the learning rate does not have the\nrole one may expect, as is made clear by the following theorem:\nTheorem 1. The binary weight vector generated by Algorithm 1 is invariant under scaling of the\nlearning rate, \u03b1, provided the initial conditions are scaled accordingly and the pseudo-gradient, \u03a6,\ndoes not depend on | \u02dcw|.\nThe proof for Theorem 1 is presented in the appendix. An immediate corollary is that in this setting\nwe can set an arbitrary learning rate for every individual weight as long as we scale the initialization\naccordingly.\nWe should emphasize the conditions to Theorem 1 are rarely met: usually latent weights are clipped,\nand many pseudo-gradients depend on the magnitude of the latent weight. Nevertheless, in experi-\nments we have observed that the advantages of various learning rates can also be achieved by scaling\nthe initialization. For example, when using SGD and Glorot initialization [11] a learning rate of 1\nperforms much better than 0.01; but when we multiply the initialized weights by 0.01 before starting\ntraining, we obtain the same improvement in performance.\n\n3\n\n\fTheorem 1 also helps to understand why reducing the learning rate after training for some time helps:\nit effectively increases the already accumulated inertia, thus reducing noise during training. Other\ntechniques that modify the magnitude of update-steps, such as the normalizing aspect of Adam and\nthe layerwise scaling of learning rates introduced in [1], should be understood in similar terms. Note\nthat the ceiling on inertia introduced by weight clipping may also play a role, and a full explanation\nrequires further analysis.\nClearly, the bene\ufb01ts of using Momentum and Adam over vanilla-SGD that have been observed for\nBNNs [8] cannot be explained in terms of characteristics of the loss landscape (curvature, critical\npoints, etc.) as is common in the real-valued context [22, 14, 23, 24]. We hypothesize that the main\neffect of using Momentum is to reduce noisy behavior when the latent weight is close to zero. As\nthe latent weight changes signs, the direction of the gradient may reverse. In such a situation, the\npresence of momentum may avoid a rapid sign change of the binary weight.\n\n4 Bop: a Latent-Free Optimizer for BNNs\n\nIn this section we introduce the Binary Optimizer, referred to as Bop, which is to the best of our\nknowledge, the \ufb01rst optimizer designed speci\ufb01cally for BNNs. It is based on three key ideas.\nFirst, the optimizer has only a single action available: \ufb02ipping weights. Any concept used in the\nalgorithm (latent weights, learning rates, update steps, momentum, etc) only matters in so far as it\naffects weight \ufb02ips. In the end, any gradient-based optimization procedure boils down to a single\nquestion: how do we decide whether to \ufb02ip a weight or not, based on a sequence of gradients? A\ngood BNN optimizer provides a concise answer to this question and all concepts it introduces should\nhave a clear relation to weight \ufb02ips.\nSecond, it is necessary to take into account past gradient information when determining weight \ufb02ips:\nit matters that a signal is consistent. We de\ufb01ne a gradient signal as the average gradient over a number\nof training steps. We say a signal is more consistent if it is present in longer time windows. The\noptimizer must pay attention to consistency explicitly because the weights are binary. There is no\naccumulation of update steps.\nThird, in addition to consistency, there is meaningful information in the strength of the gradient signal.\nHere we de\ufb01ne strength as the absolute value of the gradient signal. As compared to real-valued\nnetworks, in BNNs there is only a weak relation between the gradient signal and the change in loss\nthat results from a \ufb02ip, which makes the optimization process more noisy. By \ufb01ltering out weak\nsignals, especially during the \ufb01rst phases of training, we can reduce this noisiness.\nIn Bop, which is described in full in Algorithm 2, we implement these ideas as follows. We select\nconsistent signals by looking at an exponential moving average of gradients:\n(1 \u2212 \u03b3)t\u2212rgr,\n\nmt = (1 \u2212 \u03b3)mt\u22121 + \u03b3gt = \u03b3\n\nt(cid:88)\n\n(5)\n\nr=0\n\nwhere gt is the gradient at time t, mt is the exponential moving average and \u03b3 is the adaptivity rate.\nA high \u03b3 leads to quick adaptation of the exponential moving average to changes in the distribution\nof the gradient.\nIt is easy to see that if the gradient gi\nt for some weight i is sampled from a stable distribution, mi\nt\nconverges to the expectation of that distribution. By using this parametrization, \u03b3 becomes to an\nextend analogous to the learning rate: reducing \u03b3 increases the consistency that is required for a\nsignal to lead to a weight \ufb02ip.\nWe compare the exponential moving average with a threshold \u03c4 to determine whether to \ufb02ip each\nweight:\n\n(6)\n\n(cid:26)\u2212wi\n\nwi\n\nt\u22121\n\nwi\n\nt =\n\nt\u22121\n\nt| \u2265 \u03c4 and sign(mi\n\nif |mi\notherwise.\n\nt) = sign(wi\n\nt\u22121),\n\nThis allows us to control the strength of selected signals in an effective manner. The use of a threshold\nhas no analogue in existing methods. However, similar to using Momentum or Adam to update latent\nweights, a non-zero threshold avoids rapid back-and-forth of weights when the gradient reverses on a\nweight \ufb02ip. Observe that a high \u03c4 can result in weights never \ufb02ipping despite a consistent gradient\npressure to do so, if that signal is too weak.\n\n4\n\n\fAlgorithm 2: Bop, an optimizer for BNNs.\ninput :Loss function L(f (x; w), y), Batch size K\ninput :Threshold \u03c4, adaptivity rate \u03b3\ninitialize w \u2190 w0 \u2208 {\u22121, 1}n, m \u2190 m0 \u2208 Rn ;\nwhile stopping criterion not met do\n\nend\n\nend\n\n(cid:80)\n\n\u2202L\n\u2202w\n\nK\n\nSample minibatch {x(1), ..., x(K)} with labels y(k);\nCompute gradient: g \u2190 1\nUpdate momentum: m \u2190 (1 \u2212 \u03b3)m + \u03b3g;\nfor i \u2190 1 to n do\nwi \u2190 \u2212wi;\n\nif |mi| > \u03c4 and sign(mi) = sign(wi) then\nend\n\nk L(f (x(k); w), y(k));\n\nBoth hyperparameters, the adaptivity rate \u03b3 and threshold \u03c4, can be understood directly in terms\nof the consistency and strength of gradient signals that lead to a \ufb02ip. A higher \u03b3 results in a more\nadaptive moving average: if a new gradient signal pressures a weight to \ufb02ip, it will require less time\nsteps to do so, leading to faster but more noisy learning. A higher \u03c4 on the other hand makes the\noptimizer less sensitive: a stronger gradient signal is required to \ufb02ip a weight, reducing noise at the\nrisk of \ufb01ltering out valuable smaller signals.\nAs compared to existing methods, Bop drastically reduces the number of hyperparameters and the\ntwo hyperparameters left have a clear relation to weight \ufb02ips. Currently, one has to decide on an\ninitialization scheme for the latent weights, an optimizer and its hyperparameters, and optionally\nconstraints or regularizations on the latent weights. The relation between many of these choices and\nweight \ufb02ipping - the only thing that matters - is not at all obvious. Furthermore, Bop reduces the\nmemory requirements during training: it requires only one real-valued variable per weight, while the\nlatent-variable approach with Momentum and Adam require two and three respectively.\nNote that the concept of consistency here is closely related to the concept of inertia introduced\nin the previous section. If we initialize the latent weights in Algorithm 1 at zero, they contain a\nsum, weighted by the learning rate, over all gradients. Therefore, its sign is equal to the sign over\nthe weighted average of past gradients. This introduces an undue dependency on old information.\nClipping of the latent weights can be seen as an ad-hoc solution to this problem. By using an\nexponential moving average, we eliminate the need for latent weights, a learning rate and arbitrary\nclipping; at the same time we gain \ufb01ne-grained control over the importance assigned to past gradients\nthrough \u03b3.\nWe believe Bop should be viewed as a basic binary optimizer, similar to SGD in real-valued training.\nWe see many research opportunities both in the direction of hyperparameter schedules and in adaptive\nvariants of Bop. In the next section, we explore some basic properties of the optimizer.\n\n5 Empirical Analysis\n\n5.1 Hyperparameters\n\nWe start by investigating the effect of different choices for \u03b3 and \u03c4. To better understand the behavior\nof the optimizer, we monitor the accuracy of the network and the ratio of weights \ufb02ipped at each step\nusing the following metric:\n\n(cid:18)Number of \ufb02ipped weights at time t\n\n(cid:19)\n\n+ e\u22129\n\n.\n\n\u03c0t = log\n\nTotal number of weights\nHere e\u22129 is added to avoid log(0) in the case of no weight \ufb02ips.\nThe results are shown in Figure 1. We see the expected patterns in noisiness: both a higher \u03b3 and a\nlower \u03c4 increase the number of weight \ufb02ips per time step.\n\n(7)\n\n5\n\n\fFigure 1: Comparison of training for different values of \u03b3 and \u03c4 in Bop for BinaryNet on CIFAR-10.\nThe upper panels show accuracy (train: \u2014, validation: \u2013 \u2013). The lower panels show \u03c0t, as de\ufb01ned in\nEquation (7), for the last layer of the network. On the left side we compare three values for \u03b3, while\nkeeping \u03c4 \ufb01xed at 10\u22126. On the right we compare three values for \u03c4, while keeping \u03b3 \ufb01xed at 10\u22123.\nWe see that both high \u03b3 and low \u03c4 lead to rapid initial learning but result in high \ufb02ip rates, while low\n\u03b3 and high \u03c4 result in slow learning and near-zero \ufb02ip rates.\n\nMore interesting is the corresponding pattern in accuracy. For both hyperparameters, we \ufb01nd there is\na \u201csweet spot\u201d. Choosing a very low \u03b3 and high \u03c4 leads to extremely slow learning. On the other\nhand, overly aggressive hyperparameter settings (high \u03b3 and low \u03c4) result in rapid initial learning that\nquickly levels off at a suboptimal training accuracy: it appears the noisiness prevents further learning.\nIf we look at the validation accuracy for the two aggressive settings ((\u03b3, \u03c4 ) = (10\u22122, 10\u22126) and\n(\u03b3, \u03c4 ) = (10\u22123, 0)), we see the validation accuracy becomes highly volatile in both cases, and\ndeteriorates substantially over time in the case of \u03c4 = 0. This suggest that by learning from weak\ngradient-signals the model becomes more prone to over\ufb01t. The observed over\ufb01tting cannot simply be\nexplained by a higher sensitivity to gradients from a single example or batch, because then we would\nexpect to observe a similarly poor generalization for high \u03b3.\nThese empirical results validate the theoretical considerations that informed the design of the optimizer\nin the previous section. The behavior of Bop can be easily understood in terms of weight \ufb02ips. The\npoor results for high \u03b3 con\ufb01rm the need to favor consistent signals, while our results for \u03c4 = 0\ndemonstrate that \ufb01ltering out weak signals can greatly improve optimization.\n\n5.2 CIFAR-10\n\nWe use a VGG [25] inspired network architecture, equal to the implementation used by Courbariaux\net al. [1]. We scale the RGB images to the interval [\u22121, +1], and use the following data augmentation\nduring training to improve generalization (as \ufb01rst observed in [26] for CIFAR datasets): 4 pixels are\npadded on each side, a random 32 \u00d7 32 crop is applied, followed by a random horizontal \ufb02ip. During\ntest time the scaled images are used without any augmentation. The experiments were conducted\nusing TensorFlow [27] and NVIDIA Tesla V100 GPUs.\nIn assessing the new optimizer, we are interested in both the \ufb01nal test accuracy and the number\nof epochs it requires to achieve this. As discussed in [8], the training time for BNNs is currently\nfar longer than what one would expect for the real-valued case, and is in the order of 500 epochs,\ndepending on the optimizer. To benchmark Bop we train for 500 epochs with threshold \u03c4 = 10\u22128,\nadaptivity rate \u03b3 = 10\u22124 decayed by 0.1 every 100 epochs, batch size 50, and use Adam with the\nrecommended defaults for \u03b21, \u03b22, \u0001 [14] and an initial learning rate of \u03b1 = 10\u22122 to update the\nreal-valued variables in the Batch Normalization layers [28]. We use Adam with latent real-valued\nweights as a baseline, training for 500 epochs with Xavier learning rate scaling [18] (as recommended\nin [8]) using the recommended defaults for \u03b21, \u03b22 and \u0001, learning rate 10\u22123, decayed by 0.1 every\n100 epochs, and batch size 50.\nThe results for the top-1 training and test accuracy are summarized in Figure 2. Compared to the\nbase test accuracy of 90.9%, Bop reaches 91.3%. The baseline accuracy was highly tuned using a\n\n6\n\nEpoch0.40.60.8Accuracy=102=103=104020406080100Epoch7.55.02.5tEpoch0.40.60.8Accuracy=0=106=105020406080100Epoch7.55.02.5t\fFigure 2: Training and test accuracy history for the models trained. One can see that the results for\nBop are competitive with the baseline.\n\nTable 1: Accuracies for Bop on ImageNet for three common BNNs. Results for latent weights are\ncited from the relevant literature [29, 6, 4].\n\nModel\n\nBinaryNet\nXNOR-Net\nBiReal-Net\n\ntop-5\n\nBop (ours)\n\nLatent weights\ntop-1\ntop-1\ntop-5\n41.1% 65.4% 40.1% 66.3%\n45.9% 70.0% 44.2% 69.2%\n56.6% 79.4% 56.4% 79.5%\n\nextensive random search for the initial learning rate, and the learning rate schedule, and improves the\nresult found in [1] by 1.0%.\n\n5.3\n\nImageNet\n\nWe test Bop on ImageNet by training three well-known binarized networks from scratch: BinaryNet, a\nbinarized version of Alexnet [29]; XNOR-Net, a improved version of BinaryNet that uses real-valued\nscaling factors and real-valued \ufb01rst and last layers [6]; and BiReal-Net, which introduced real-valued\nshortcuts to binarized networks and achieves drastically better accuracy [4].\nWe train BinaryNet and BiReal-Net for 150 epochs and XNOR-Net for 100 epochs. We use a batch\nsize of 1024 and standard preprocessing with random \ufb02ip and resize but no further augmentation. For\nall three networks we use the same optimizer hyperparameters. We set the threshold to 1 \u00b7 10\u22128 and\ndecay the adaptivity rate linearly from 1 \u00b7 10\u22124 to 1 \u00b7 10\u22126. For the real-valued variables, we use\nAdam with a linearly decaying learning rate from 2.5\u00b7 10\u22123 to 5\u00b7 10\u22126 and otherwise default settings\n(\u03b21 = 0.9, \u03b22 = 0.999 and \u0001 = 1 \u00b7 10\u22127). After observing over\ufb01tting for XNOR-Net we introduce a\nsmall l2-regularization of 5 \u00b7 10\u22127 on the (real-valued) \ufb01rst and last layer for this network only. For\nbinarization of the activation we use the STE in BinaryNet and XNOR-Net and the ApproxSign for\nBiReal-Net, following Liu et al. [4]. Note that as the weights are not binarized in the forward pass,\nno pseudo-gradient for the backward pass needs to be de\ufb01ned. Moreover, whereas XNOR-net and\nBiReal-Net effectively binarize to {\u2212\u03b1, \u03b1} by introducing scaling factors, we learn strictly binary\nweight kernels.\nThe results are shown in Table 1. We obtain competitive results for all three networks. We emphasize\nthat while each of these papers introduce a variety of tricks, such as layer-wise scaling of learning\nrates in [1], scaled binarization in [6] and a multi-stage training protocol in [4], we use almost identical\noptimizer settings for all three networks. Moreover, our improvement on XNOR-Net demonstrates\nscaling factors are not necessary to train BNNs to high accuracies, which is in line with earlier\nobservations [30].\n\n7\n\n0100200300400500Epoch0.700.750.800.850.900.951.00Train AccuracyBaselineBop0100200300400500Epoch0.650.700.750.800.850.900.95Test AccuracyBaselineBop\f6 Discussion\n\nIn this paper we offer a new interpretation of existing deterministic BNN training methods which\nexplains latent real-valued weights as encoding inertia for the binary weights. Using the concept of\ninertia, we gain a better understanding of the role of the optimizer, various hyperparameters, and\nregularization. Furthermore, we formulate the key requirements for a gradient-based optimization\nprocedure for BNNs and guided by these requirements we introduce Bop, the \ufb01rst optimizer designed\nfor BNNs. With this new optimizer, we have exceeded the state-of-the-art result for BinaryNet on\nCIFAR-10 and achieved a competitive result on ImageNet for three well-known binarized networks.\nOur interpretation of latent weights as inertia differs from the common view of BNNs, which treats\nbinary weights as an approximation to latent weights. We argue that the real-valued magnitudes of\nlatent weights should not be viewed as weights at all: changing the magnitudes does not alter the\nbehavior of the network in the forward pass. Instead, the optimization procedure has to be understood\nby considering under what circumstances it \ufb02ips the binary weights.\nThe approximation viewpoint has not only shaped understanding of BNNs but has also guided efforts\nto improve them. Numerous papers aim at reducing the difference between the binarized network and\nits real-valued counterpart. For example, both the scaling introduced by XNOR-Net (see eq. (2) in\n[6]) and DoReFa (eq. (7) in [31]), as well as the magnitude-aware binarization introduced in Bi-Real\nNet (eq. (6) in [4]) aim at bringing the binary vector closer to the latent weight. ABC-Net maintains\na single real-valued weight vector that is projected onto multiple binary vectors (eq. (4) in [5]) in\norder to get a more accurate approximation (eq. (1) in [5]). Although many of these papers achieve\nimpressive results, our work shows that improving the approximation is not the only option. Instead\nof improving BNNs by reducing the difference with real-valued networks during training, it may be\nmore fruitful to modify the optimization method in order to better suit the BNN.\nBop is the \ufb01rst step in this direction. As we have demonstrated, it is conceptually simpler than current\nmethods and requires less memory during training. Apart from this conceptual simpli\ufb01cation, the\nmost novel aspect of Bop is the introduction of a threshold \u03c4. We note that when setting \u03c4 = 0, Bop\nis mathematically similar to the latent weight approach with SGD, where the moving averages m\nnow play the role of latent variables.\nThe threshold that is used introduces a dependency on the absolute magnitude of the gradients. We\nhypothesize the threshold helps training by selecting the most important signals and avoiding rapid\nchanges of a single weight. However, a \ufb01xed threshold for all layers and weights may not be the\noptimal choice. The success of Adam for real-valued methods and the invariance of latent-variable\nmethods to the scale of the update step (see Theorem 1) suggest some form of normalization may be\nuseful.\nWe see at least two possible ways to modify thresholding in Bop. First, one could consider layer-wise\nnormalization of the exponential moving averages. This would allow selection of important signals\nwithin each layer, thus avoiding situations in which some layers are noisy and other layers barely\ntrain at all. A second possibility is to introduce a second moving average that tracks the magnitude of\nthe gradients, similar to Adam.\nAnother direction in which Bop may be improved is the exploration of hyperparameter schedules.\nThe adaptivity rate, \u03b3, may be viewed as analogous to the learning rate in real-valued optimization.\nIndeed, if we view the moving averages, m, as analogous to latent weights, lowering \u03b3 is analogous\nto decreasing the learning rate, which by Theorem 1 increases inertia. Reducing \u03b3 over time therefore\nseems like a sensible approach. However, any analogy to the real-valued setting is imperfect, and it\nwould be interesting to explore different schedules.\nHyperparameter schedules could also target the threshold, \u03c4, (or an adaptive variation of \u03c4). We\nhypothesize one should select for strong signals (i.e. high \u03c4) in the \ufb01rst stages of training, and make\ntraining more sensitive by lowering \u03c4 over time, perhaps while simultaneously lowering \u03b3. However,\nwe stress once again that such intuitions may prove unreliable in this unexplored context.\nMore broadly, the shift in perspective presented here opens up many opportunities to further improve\noptimization methods for BNNs. We see two areas that are especially promising. The \ufb01rst is\nregularization. As we have argued, it is not clear that applying L2-regularization or weight decay\nto the latent weights should lead to any regularization at all. Applying Dropout to BNNs is also\nproblematic. Either the zeros introduced by dropout are projected onto {\u22121, +1}, which is likely to\n\n8\n\n\fresult in a bias, or zeros appear in the convolution, which would violate the basic principle of BNNs.\nIt would be interesting to see custom regularization techniques for BNNs. One very interesting work\nin this direction is [32].\nThe second area where we anticipate further improvements in BNN optimization is knowledge\ndistillation [33]. One way people currently translate knowledge from the real-valued network to the\nBNN is through initialization of the latent weights, which is becoming increasingly sophisticated [4,\n8, 34]. Several works have started to apply other techniques of knowledge distillation to low-precision\nnetworks [34\u201336].\nThe authors are excited to see how to concept of inertia introduced within this paper in\ufb02uence the\nfuture development of the \ufb01eld.\n\n9\n\n\fReferences\n[1] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. \u201cBi-\nnarized Neural Networks: Training Deep Neural Networks with Weights and Activations\nConstrained to +1 or -1\u201d. In: arXiv preprint arXiv:1602.02830 (2016).\n\n[2] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\n\u201cImageNet Large Scale Visual Recognition Challenge\u201d. In: International Journal of Computer\nVision (IJCV) 115.3 (2015), pp. 211\u2013252.\n\n[3] Shilin Zhu, Xin Dong, and Hao Su. \u201cBinary Ensemble Neural Network: More Bits per Network\n\nor More Networks per Bit?\u201d In: arXiv preprint arXiv:1806.07550 (2018).\n\n[4] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. \u201cBi-Real\nNet : Enhancing the Performance of 1-bit CNNs with Improved Representational Capability\nand Advanced Training Algorithm\u201d. In: ECCV (2018).\n\n[5] Xiaofan Lin, Cong Zhao, and Wei Pan. \u201cTowards Accurate Binary Convolutional Neural\n\nNetwork\u201d. In: NIPS (2017).\n\n[6] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. \u201cXNOR-Net:\n\nImageNet Classi\ufb01cation Using Binary Convolutional Neural Networks\u201d. In: ECCV (2016).\n\n[7] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. \u201cStructured Binary\nNeural Networks for Accurate Image Classi\ufb01cation and Semantic Segmentation\u201d. In: CVPR\n(2019).\n\n[8] Milad Alizadeh, Javier Fernandez-Marques, Nicholas D. Lane, and Yarin Gal. \u201cAn Emperical\n\nStudy of Binary Neural Networks\u2019 Optimisation\u201d. In: ICLR (2019).\n\n[9] Alexander G. Anderson and Cory P. Berg. \u201cThe High-Dimensional Geometry of Binary Neural\n\nNetworks\u201d. In: arXiv preprint arXiv:1705.07199 (2017).\n\n[10] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech. rep. 2009.\n[11] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. \u201cEstimating or Propagating Gra-\ndients Through Stochastic Neurons for Conditional Computation\u201d. In: arXiv preprint\narXiv:1308.3432 (2013).\n\n[12] Geoffrey Hinton. \u201cNeural networks for machine learning\u201d. In: Coursera Video Lectures (2012).\n[13] Ning Qian. \u201cOn the momentum term in gradient descent learning algorithms\u201d. In: Neural\n\nNetworks 12.1 (1999), pp. 145 \u2013151.\n\n[14] Diederik P. Kingma and Jimmy Ba. \u201cAdam: A Method for Stochastic Optimization\u201d. In: arXiv\n\n[15]\n\n[16]\n\npreprint arXiv:1412.6980 (2014).\nIlya Loshchilov and Frank Hutter. \u201cFixing Weight Decay Regularization in Adam\u201d. In: arXiv\npreprint arXiv:1711.05101 (2017).\nJorn W. T. Peters and Max Welling. \u201cProbabilistic Binary Neural Networks\u201d. In: arXiv preprint\narXiv:1809.03368 (2018).\n\n[17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. \u201cDeep Learn-\n\ning with Limited Numerical Precision\u201d. In: ICML (2015).\n\n[18] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. \u201cBinaryConnect: Training\n\nDeep Neural Networks with binary weights during propagations\u201d. In: NIPS (2015).\n\n[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\n\u201cDropout: A Simple Way to Prevent Neural Networks from Over\ufb01tting\u201d. In: Journal of Machine\nLearning Research 15 (2014), pp. 1929\u20131958.\n\n[20] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. \u201cTraining\n\nQuantized Nets: A Deeper Understanding\u201d. In: NIPS (2017).\n\n[21] Paul Merolla, Rathinakumar Appuswamy, John Arthur, Steve K. Esser, and Dharmendra\nModha. \u201cDeep neural networks are robust to weight binarization and other non-linear distor-\ntions\u201d. In: arXiv preprint arXiv:1606.01981 (2016).\n\n[22] Gabriel Goh. \u201cWhy Momentum Really Works\u201d. In: Distill (2017).\n[23]\n\nIlya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. \u201cOn the importance of\ninitialization and momentum in deep learning\u201d. In: ICML (2013).\nIan Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.\n\n[24]\n[25] K. Simonyan and A. Zisserman. \u201cVery Deep Convolutional Networks for Large-Scale Image\n\nRecognition\u201d. In: International Conference on Learning Representations. 2015.\n\n10\n\n\f[26] Benjamin Graham. \u201cSpatially-sparse convolutional neural networks\u201d. In: arXiv preprint\n\narXiv:1409.6070 (2014).\n\n[27] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfel-\nlow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz\nKaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore,\nDerek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,\nKunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol\nVinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available\nfrom tensor\ufb02ow.org. 2015.\n\n[29]\n\n[28] Sergey Ioffe and Christian Szegedy. \u201cBatch Normalization: Accelerating Deep Network\nTraining by Reducing Internal Covariate Shift\u201d. In: arXiv preprint arXiv:1502.03167 (2015).\nItay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. \u201cBina-\nrized Neural Networks\u201d. In: NIPS (2016).\nJoseph Bethge, Haojin Yang, Marvin Bornstein, and Christoph Meinel. \u201cBack to Simplicity:\nHow to Train Accurate BNNs from Scratch?\u201d In: arXiv preprint arXiv:1906.08637 (2019).\n\n[30]\n\n[31] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. \u201cDoReFa-Net:\nTraining Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients\u201d. In:\narXiv preprint arXiv:1606.06160 (2016).\n\n[32] Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu. \u201cRegularizing Activation\nDistribution for Training Binarized Deep Networks\u201d. In: arXiv preprint arXiv:1904.02823\n(2019).\n\n[33] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. \u201cDistilling the Knowledge in a Neural Network\u201d.\n\nIn: arXiv preprint arXiv:1503.02531 (2015).\n\n[34] Adrian Bulat, Georgios Tzimiropoulos, Jean Kossai\ufb01, and Maja Pantic. \u201cImproved training\nof binary networks for human pose estimation and image recognition\u201d. In: arXiv preprint\narXiv:1904.05868 (2019).\n\n[35] Antonio Polino, Razvan Pascanu, and Dan Alistarh. \u201cModel Compression via Distillation and\n\nQuantization\u201d. In: arXiv preprint arXiv:1802.05668 (2018).\n\n[36] Asit Mishra and Debbie Marr. \u201cApprentice: Using Knowledge Distillation Techniques to\n\nImprove Low-Precision Network Accuracy\u201d. In: arXiv preprint arXiv:1711.05852 (2017).\n\n11\n\n\fA Proof for Theorem 1\n\nProof. Consider a single weight. Let \u02dcwt be the latent weight at time t, gt the pseudo-gradient and \u03b4t\nthe update step generated by the optimizer A. Then:\n\n\u02dcwt+1 = \u02dcwt + \u03b1\u03b4t.\n\nNow take some positive scalar C by which we scale the learning rate. Replace the weight by\n\u02dcvt = C \u02dcwt. Since sign(\u02dcvt) = sign( \u02dcwt), the binary weight is unaffected. Therefore the forward pass\nat time t is unchanged and we obtain an identical pseudo-gradient gt and update step \u03b4t. We see:\n\n\u02dcvt+1 = \u02dcvt + C\u03b1\u03b4t = C \u00b7 ( \u02dcwt + \u03b1\u03b4t) = C \u02dcwt+1.\n\nThus sign(\u02dcvt+1) = sign( \u02dcwt+1). By induction, this holds for \u2200t(cid:48) > t and we conclude the BNN is\nunaffected by the change in learning rate.\n\n12\n\n\f", "award": [], "sourceid": 4102, "authors": [{"given_name": "Koen", "family_name": "Helwegen", "institution": "Plumerai"}, {"given_name": "James", "family_name": "Widdicombe", "institution": "Plumerai"}, {"given_name": "Lukas", "family_name": "Geiger", "institution": "Plumerai"}, {"given_name": "Zechun", "family_name": "Liu", "institution": "HKUST"}, {"given_name": "Kwang-Ting", "family_name": "Cheng", "institution": "Hong Kong University of Science and Technology"}, {"given_name": "Roeland", "family_name": "Nusselder", "institution": "Plumerai"}]}