{"title": "The Synthesis of XNOR Recurrent Neural Networks with Stochastic Logic", "book": "Advances in Neural Information Processing Systems", "page_first": 8444, "page_last": 8454, "abstract": "The emergence of XNOR networks seek to reduce the model size and computational cost of neural networks for their deployment on specialized hardware requiring real-time processes with limited hardware resources. In XNOR networks, both weights and activations are binary, bringing great benefits to specialized hardware by replacing expensive multiplications with simple XNOR operations. Although XNOR convolutional and fully-connected neural networks have been successfully developed during the past few years, there is no XNOR network implementing commonly-used variants of recurrent neural networks such as long short-term memories (LSTMs). The main computational core of LSTMs involves vector-matrix multiplications followed by a set of non-linear functions and element-wise multiplications to obtain the gate activations and state vectors, respectively. Several previous attempts on quantization of LSTMs only focused on quantization of the vector-matrix multiplications in LSTMs while retaining the element-wise multiplications in full precision. In this paper, we propose a method that converts all the multiplications in LSTMs to XNOR operations using stochastic computing. To this end, we introduce a weighted finite-state machine and its synthesis method to approximate the non-linear functions used in LSTMs on stochastic bit streams. Experimental results show that the proposed XNOR LSTMs reduce the computational complexity of their quantized counterparts by a factor of 86x without any sacrifice on latency while achieving a better accuracy across various temporal tasks.", "full_text": "The Synthesis of XNOR Recurrent Neural Networks\n\nwith Stochastic Logic\n\nDepartment of Electrical and Computer Engineering, McGill University, Montreal, Canada\n\n{arash.ardakani, zhengyun.ji, amir.ardakani}@mail.mcgill.ca\n\nArash Ardakani, Zhengyun Ji, Amir Ardakani, Warren J. Gross\n\nwarren.gross@mcgill.ca\n\nAbstract\n\nThe emergence of XNOR networks seek to reduce the model size and compu-\ntational cost of neural networks for their deployment on specialized hardware\nrequiring real-time processes with limited hardware resources. In XNOR networks,\nboth weights and activations are binary, bringing great bene\ufb01ts to specialized\nhardware by replacing expensive multiplications with simple XNOR operations.\nAlthough XNOR convolutional and fully-connected neural networks have been\nsuccessfully developed during the past few years, there is no XNOR network\nimplementing commonly-used variants of recurrent neural networks such as long\nshort-term memories (LSTMs). The main computational core of LSTMs involves\nvector-matrix multiplications followed by a set of non-linear functions and element-\nwise multiplications to obtain the gate activations and state vectors, respectively.\nSeveral previous attempts on quantization of LSTMs only focused on quantization\nof the vector-matrix multiplications in LSTMs while retaining the element-wise\nmultiplications in full precision. In this paper, we propose a method that converts\nall the multiplications in LSTMs to XNOR operations using stochastic computing.\nTo this end, we introduce a weighted \ufb01nite-state machine and its synthesis method\nto approximate the non-linear functions used in LSTMs on stochastic bit streams.\nExperimental results show that the proposed XNOR LSTMs reduce the compu-\ntational complexity of their quantized counterparts by a factor of 86\u00d7 without\nany sacri\ufb01ce on latency while achieving a better accuracy across various temporal\ntasks.\n\n1 Introduction\n\nRecurrent neural networks (RNNs) have exhibited state-of-the-art performance across different\ntemporal tasks that require processing variable-length sequences such as image captioning [1], speech\nrecognition [2] and natural language processing [3]. Despite the remarkable success of RNNs on a\nwide range of complex sequential problems, they suffer from the exploding gradient problem that\noccurs when learning long-term dependencies [4, 5]. Therefore, various RNN architectures such\nas long short-term memories (LSTMs) [6] and gated recurrent units (GRUs) [7] have emerged to\nmitigate the exploding gradient problem. Due to the prevalent use of LSTMs in both academia and\nindustry, we mainly focus on the LSTM architecture in this work. The recurrent transition in LSTM is\nperformed in two stages: the \ufb01rst stage performing gate computations and the second one performing\nstate computations. The gate computations are described as\n\nft = \u03c3 (Wf hht\u22121 + Wf xxt + bf ) , it = \u03c3(Wihht\u22121 + Wixxt + bi),\not = \u03c3(Wohht\u22121 + Woxxt + bo), gt = tanh(Wghht\u22121 + Wgxxt + bg),\n\n(1)\nwhere {Wf h, Wih, Woh, Wgh} \u2208 Rdh\u00d7dh, {Wf x, Wix, Wox, Wgx} \u2208 Rdx\u00d7dh and {bf , bi, bo,\nbg} \u2208 Rdh denote the recurrent weights and bias. The input vector x \u2208 Rdx denotes input temporal\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffeatures whereas the hidden state h \u2208 Rdh retains the temporal state of the network. The logistic\nsigmoid and hyperbolic tangent functions are denoted as \u03c3 and tanh, respectively. The updates of\nthe LSTM parameters are regulated through a set of gates: ft, it, ot and gt. The state computations\nare then performed as\n\nct = ft \u2297 ct\u22121 + it \u2297 gt, ht = ot \u2297 tanh(ct),\n\n(2)\n\nwhere the parameter c \u2208 Rdh is the cell state. The operator \u2297 denotes the Hadamard product.\nThe \ufb01rst computational stage of LSTM is structurally similar to a fully-connected layer as it only\ninvolves several vector-matrix multiplications. Therefore, LSTMs are memory intensive similar to\nfully-connected layers [8]. LSTMs are also computationally intensive due to their recursive nature\n[9]. These limitations make LSTM models dif\ufb01cult to deploy on specialized hardware requiring\nreal-time processes with inferior hardware resources and power budget. Several techniques have\nbeen introduced in literature to alleviate the computational complexity and memory footprint of\nneural networks such as low-rank approximation [10], weight/activation pruning [11, 12, 13, 14] and\nquantization [15, 16, 17]. Among these solutions, quantization methods speci\ufb01cally binarization\nmethods bring signi\ufb01cant bene\ufb01ts to dedicated hardware since they reduce the required memory\nfootprint and implementation cost by constrainting both weights and activations to only two values\n(i.e., -1 or 1) and replacing multiplications with simple XNOR operations, respectively [17]. As a\nresult, several attempts were reported in literature to binarize LSTM models during the past few years\n[18, 19, 20]. However, all the existing methods only focused on the gate computations of LSTMs\nby binarizing either the weights or both the weights and the hidden vector h while retaining the\nstate computations in full precision (FP). Although the recurrent computations of LSTM models\nare dominated by the gate computations, using full-precision multipliers are inevitable for the state\ncomputations when designing dedicated hardware for LSTM models, making the existing binarized\nLSTM models unsuitable for embedded systems with limited hardware resources and tight power\nbudget. It is worth mentioning that a full-precision multiplier requires 200\u00d7 more Xilinx FPGA slices\nthan an XNOR gate [17]. Therefore, an XNOR-LSTM model that can perform the multiplications of\nboth the gate and the state computations using XNOR operations is missing in literature.\nIn this paper, we \ufb01rst extend an existing LSTM model with binary weights to binarize the hidden\nstate vector h. In this way, the multiplications of the gate computations can be performed using\nXNOR operations. We then propose a method to binarize the state computations using stochastic\ncomputing (SC) [21]. More precisely, we show that the binarized weights and the hidden state vector\nh can be represented as stochastic bit streams, allowing us to perform the gate computations using\nstochastic logic and to implement the non-linear activation functions (i.e., the sigmoid and hyperbolic\ntangent functions) using \ufb01nite-state machines (FSMs). We also introduce a new FSM topology and\nits synthesis method to accurately approximate the nonlinear functions of LSTM. We show that the\nproposed FSM outputs a binary stream that its expected value is an approximation to the nonlinear\nactivation functions of LSTMs. Ultimately, we use the binary streams generated by the FSMs to\nreplace the full-precision multipliers required for the state computations with XNOR gates, forming\nan XNOR-LSTM model.\n\n2 Related Work\nIn the binarization process, the full-precision weight matrix W \u2208 RdI\u00d7dJ is estimated using a binary\nweight matrix Wb \u2208 {\u22121, 1}dI\u00d7dJ and a scaling factor \u03b1 \u2208 R+ such that W \u2248 \u03b1Wb. In [15],\nthe sign function was used as the transformation function to obtain the binary weight matrix (i.e.,\nWb = sign(W)) while using a \ufb01xed scaling factor for all the weights. Lin et al. [22] introduced\na ternarization method to reduce the accuracy loss of the binarization process by clamping values\nhesitating to be either 1 or -1 to zero. Some methods [23, 24] were then proposed to improve upon\nthe ternarization method by learning the scaling factor \u03b1. Zhou et al. [25] proposed a method that\nquantizes weights, activations and gradients of neural networks using different bitwidths. Rastegari\net al. [16] and Lin et al. [17] proposed binary neural networks (BNNs) in which both weights and\nactivations of convolutional neural networks (CNNs) are represented in binary. Despite the great\nperformance of the aforementioned works in quantization of CNNs, they fail to work well on RNNs\n[26]. As a result, recent studies mainly attempted to quantize RNNs in particular LSTMs.\nHou et al. [19] introduced the loss-aware binarization method (LAB) that uses the proximal Newton\nalgorithm to minimize the loss w.r.t the binarizied weights. The LAB method was further extended\n\n2\n\n\fa : 1 0 0 0 1 0 0 0 (2/8)\n\nb : 0 0 1 0 1 1 0 1 (4/8)\n\ny : 0 0 0 0 1 0 0 0 (1/8)\n\na : 0 0 0 0 1 0 1 0 (\u22124/8)\n\nb : 1 0 1 0 1 0 0 1 (0/8)\n\n(a)\n\n(b)\n\ny : 0 1 0 1 1 1 0 0 (0/8)\n\nFigure 1: Stochastic multiplications using bit-wise operations in (a) unipolar and (b) bipolar formats.\n\nin [27] to support different bitwidths for the weights. Both of these methods were tested on LSTM\nmodels performing character-level language modeling experiments. Xu et al. [18] presented the\nalternating multi-bit quantization (AMQ) method that uses a binary search tree to obtain optimal\nquantization coef\ufb01cients for LSTM models. Wang et al. [26] proposed a ternary RNN, called\nHitNet, which exploits a hybrid of different quantization methods to quantize the weights and the\nhidden state vector based on their statistical characteristics. Both HitNet and alternating multi-\nbit quantization method were tested on RNNs performing word-language modeling experiments.\nRecently, Ardakani et al. [20] leveraged batch normalization in both the input-to-hidden and the\nhidden-to-hidden transformations of LSTMs to binarize/ternarize the recurrent weights. This method\nwas tested on various sequential tasks, such as sequence classi\ufb01cation, language modeling, and\nreading comprehension. While all the aforementioned approaches successfully managed to quantize\nthe weights and the hidden state vector (i.e., the gate computations) of LSTM models, the state\ncomputations were retained in full precision. More precisely, no attempt was reported to binarize\nboth the gate and state computations of LSTMs. Motivated by this observation, we propose the\n\ufb01rst XNOR-LSTM model in literature, performing all the recurrent multiplications with XNOR\noperations.\n\n3 Preliminaries\n\n3.1 Stochastic Computing\n\nStochastic computing is a well-known technique to obtain ultra low-cost hardware implementations\nfor various applications [28]. In SC, continuous values are represented as sequences of random bits,\nallowing complex computations to be computed by simple bit-wise operations on the bit streams.\nMore precisely, the statistics of the bits determine the information content of the stream. For example,\na real number a \u2208 [0, 1] is represented as the sequence a \u2208 {0, 1}l in SC\u2019s unipolar format such that\n(3)\nwhere E[a] and l denote the expected value of the Bernoulli random vector a and the length of the\nsequence, respectively. Another well-known SC\u2019s representation format is the bipolar format where\na \u2208 [\u22121, 1] is represented as\n\nE[a] = a,\n\n(4)\nTo represent any real number using these two formats, we need to scale it down to \ufb01t within the\nappropriate interval (i.e., either [0, 1] or [\u22121, 1]). It is worth mentioning that the stochastic stream a\nis generated using a linear feedback shift register (LFSR) and a comparator in custom hardware [28],\nreferred to as stochastic number generator (SNG).\n\nE[a] = (a + 1)/2.\n\n3.1.1 Multiplication and Addition in SC\nMultiplication of two stochastic streams of a and b in the unipolar format is performed as\n\ny = a \u00b7 b,\n\n(5)\nwhere \u201c\u00b7\u201d denotes the bit-wise AND operation, E[y] = E[a] \u00d7 E[b] if and only if the input stochastic\nstreams (i.e, a and b) are independent. However, this multiplication in the bipolar format is computed\nusing an XNOR gate as\n\n(6)\nwhere \u201c(cid:12)\u201d denotes the bit-wise XNOR operation. Similarly, if the input sequences are independent,\nwe have\n\ny = a (cid:12) b,\n\n2 \u00d7 E[y] \u2212 1 = (2 \u00d7 E[a] \u2212 1) \u00d7 (2 \u00d7 E[b] \u2212 1).\n\n(7)\n\n3\n\n\fyj = 0\n\nyj = 1\n\naj = 1\n\naj = 1\n\naj = 1\n\naj = 1\n\naj = 1\n\naj = 0\n\nC0\n\nC1\n\nCn/2\u22121\n\nCn/2\n\nCn\u22122\n\nCn\u22121\n\naj = 1\n\naj = 0\n\naj = 0\n\naj = 0\n\naj = 0\n\naj = 0\n\nFigure 2: State transition diagram of the FSM implementing tanh where aj and yj denote the jth entry\nof the input stream a \u2208 {0, 1}l and the output stream y \u2208 0, 1l for j \u2208 {1, 2, . . . , l}, respectively.\n\nFigure 1 shows an example of a multiplication in stochastic domain using both the unipolar and\nbipolar formats. Since stochastic numbers are represented as probabilities falling into the interval\nof [0, 1] in the unipolar format, additions in SC are performed using the scaled adder that \ufb01ts the\nresult of the addition into the [0, 1] interval [28]. Additions of two stochastic streams of a and b is\ncomputed by\n\n(8)\nwhere the signal c is a stochastic stream with a probability of 0.5 (i.e., E[c] = 0.5). The scaled\nadder is implemented using a multiplexer in which the stream c is used as its selector signal. The\naforementioned discussion on the stochastic addition also holds true for the bipolar format.\n\ny = a \u00b7 c + b \u00b7 (1 \u2212 c),\n\n3.1.2 FSM-Based Functions in SC\n\nIn SC, non-linear functions such as the hyperbolic tangent, sigmoid and exponentiation functions can\nbe performed on stochastic bit streams using FSMs [29]. An FSM in SC can be viewed as a saturating\ncounter that does not increment beyond its maximum value or decrement below its minimum value.\nFor example, the FSM-based transfer function \u201cStanh\u201d that approximates the hyperbolic tangent\nfunction is constructed such that\n\n(cid:16) na\n\n(cid:17) \u2248 2 \u00d7 E[Stanh(n, a)] \u2212 1,\n\ntanh\n\n2\n\n(9)\n\nwhere n denotes the number of states in the FSM. Figure 2 illustrates the state transition of the\nFSM-based transfer function approximating the hyperbolic tangent function when using a set of\nstates C0 \u2192 Cn\u22121. Since the sigmoid function is obtained from the hyperbolic tangent functions, the\ntransfer function Stanh is also used to approximate the sigmoid function, that is,\n\n\u03c3(na) =\n\n1 + tanh\n\n2\n\n2\n\n\u2248 E[Stanh(n, a)].\n\n(10)\n\n(cid:16) na\n\n(cid:17)\n\nm(cid:88)\n\nj=1\n\n4\n\nIntegral Stochastic Computing\n\n3.2\nIn integral stochastic computing (ISC), a real value s \u2208 [0, m] in the unipolar format (or s \u2208 [\u2212m, m]\nin the bipolar format) is represented as a sequence of integer numbers [30]. In this way, each element\nof the sequence s \u2208 {0, 1, . . . , m}l in the unipolar format (or s \u2208 {\u2212m,\u2212m + 1, . . . , m}l in the\nbipolar format) is represented using the two\u2019s complement format, where l denotes the length of the\nstochastic stream. The integral stochastic stream s is obtained by the element-wise additions of m\nbinary stochastic streams as follows\n\ns =\n\naj,\n\n(11)\n\nwhere the expected value of each binary stochastic stream, denoted as aj, is equal to s/m. With this\nde\ufb01nition, we have\n\nE[y] =\n\nE[aj] =\n\ns\nm\n\n= s.\n\n(12)\n\nm(cid:88)\n\nj=1\n\nm(cid:88)\n\nj=1\n\nFor instance, the element-wise addition of two binary stochastic streams, {0, 1, 1, 1, 1, 0, 1, 1} and\n{0, 1, 1, 1, 0, 1, 1, 1}, each representing the real value of 0.75, results in the integral stochastic stream\n\n\fof {0, 2, 2, 2, 1, 1, 2, 2} representing the real value of 1.5 for m = 2 and l = 8. We hereafter refer to\nthe integral stochastic number generator function as ISNG.\nAdditions in ISC are performed using the conventional binary-radix adders, retaining all the input\ninformation as opposed to the scaled adders that decrease the precision of the output streams [30].\nMultiplications are also implemented using the binary-radix multiplier in ISC. The main advantage of\nISC lies in its FSM-based functions that take integral stochastic streams and output binary stochastic\nstreams, allowing the rest of computations to be performed with simple bit-wise operations in binary\nSC. The approximate transfer function of hyperbolic tangent and sigmoid, which is referred to as\nIStanh, is de\ufb01ned as\n\n(cid:16) ns\n\n(cid:17) \u2248 2 \u00d7 E[IStanh(n \u00d7 m, s)] \u2212 1,\n\n(13)\n\n\u03c3(ns) \u2248 E[IStanh(n \u00d7 m, s)].\n\n(14)\nThe IStanh outputs zero when the state counter is less than n \u00d7 m/2, otherwise it outputs one.\nConsidering kj as an entry of the state counter vector k \u2208 {0, 1, . . . , m}l and yj as an entry of the\nIStanh\u2019s output vector y \u2208 {0, 1}l , we have\n\ntanh\n\n2\n\n(cid:26)0, kj < (n \u00d7 m/2)\n\nyj =\n\n1, otherwise\n\n(15)\nwhere j \u2208 {1, . . . , l}. As opposed to the FSM-based functions in binary SC in which the state counter\nis incremented or decremented only by 1, the state counter of the FSM-based functions in ISC is\nincreased or decreased according to the integer input value. In fact, the maximum possible transition\nat each time slot is equal to m in ISC. Moreover, the FSM-based functions in ISC require m times\nmore states than the ones in SC. Despite the complexity of the FSM-based functions in ISC, they are\nmore accurate than their counterparts in SC [30].\n\n,\n\n4 Synthesis of XNOR RNNs\n\n4.1 Binarization of the Hidden State\n\nIn [20], the recurrent weights of LSTMs and GRUs were binarized using batch normalization in\nboth the input-to-hidden and hidden-to-hidden transformations. More speci\ufb01cally, the recurrent\ncomputations of gate ft is performed as\n\nft = \u03c3\n\nBN(Wb\n\nf hht\u22121; \u03c6f h, 0) + BN(Wb\n\nf xxt; \u03c6f x, 0) + bf\n\n,\n\n(16)\n\n(cid:16)\n\n(cid:17)\n\nwhere Wb\nas follows\n\nf h and Wb\n\nf x are the binarized weights obtained by sampling from the Bernoulli distribution\n\nWb = 2 \u00d7 Bernoulli(P (W = 1)) \u2212 1.\n(cid:112)V(u) + \u0001\nBN also denotes the batch normalization transfer function such that\nBN(u; \u03c6, \u03b3) = \u03b3 + \u03c6 \u2297 u \u2212 E(u)\n\n,\n\n(17)\n\n(18)\n\n(19)\n\nwhere u is the unnormalized vector and V(u) denotes its variance. The model parameters \u03b3 and \u03c6\ndetermine the mean and variance of the normalized vector. The rest of the gate computations (i.e.,\nit, ot and gt) are binarized in a similar fashion. So far, we have reviewed the method introduced in\n[20] to binarize the recurrent weights. We now extend this method to also binarize the hidden state\nvector h. To this end, we use the sign function. However, the derivative of the sign function is zero\nduring backpropagation, making the gradients of the loss w.r.t the parameters before the quantization\nfunction to be zero [17]. To address this issue, we estimate the derivative of the sign function as\n\n(cid:26)1,\n\n\u2202 sign(h)\n\n\u2202h\n\n\u2248\n\n|h| < 1\n\n0, otherwise ,\n\nsimilar to [17]. In this way, the gradient\u2019s information are preserved. Training LSTMs with this\nmethod allows us to perform the matrix-vector multiplications of the gate computations using XNOR\noperations. We use the extended LSTM (ELSTM) with binary weights and the state hidden vector as\nour baseline for the rest of this paper.\n\n5\n\n\f4.2 Stochastic Representation of Gate Computations\n\nLet us only consider the recurrent computations for a single neuron of a baseline\u2019s gate as\n\ndh(cid:88)\n\ndx(cid:88)\n\ny = \u03b1h\n\nh (cid:12) hj + \u03b1x\nwj\n\nx \u00d7 xj + b,\nwi\n\n(20)\n\nj=1\n\nj=1\n\nwhere wh, wx, h and x are the element entries of the hidden-to-hidden weight vector wh \u2208 {\u22121, 1}dh,\nthe input-to-hidden weight vector wx \u2208 {\u22121, 1}dx, the hidden vector h \u2208 {\u22121, 1}dh and the input\nvector x \u2208 Rdx, respectively. The bias is denoted as b \u2208 R. The parameters \u03b1h \u2208 R and \u03b1x \u2208 R\ndenote the scaling factors dictated by the binarization process. Note that the batch normalization\nprocesses are considered in the parameters \u03b1h, \u03b1x and b in Eq. (20). In most of the temporal\ntasks, the input vector x is one-hot encoded, replacing the vector-vector multiplication of wxx with a\nsimple indexing operation implemented by a lookup table. As such, let us merge this vector-vector\nmultiplication into the bias as follows\n\ndh(cid:88)\n\nj=1\n\ny = \u03b1h\n\nh (cid:12) hj + b.\nwj\n\n(21)\n\nConsidering the linear property of the expected value operator, we can rewrite Eq. (21) as follows\n\n\u03b1hdh\n\nh (cid:12) hj\nwj\ndh\n\n+ b = \u03b1hdhE[wh (cid:12) h] + b = E[\u03b1hdh(wh (cid:12) h) + b] = E[y].\n\n(22)\n\ndh(cid:88)\n\nj=1\n\ny =\n\nSo far, we have represented the output y \u2208 R as a sequence of real numbers (i.e., y \u2208 Rdh) where\neach entry of the vector y is either \u03b1hdh + b or \u2212\u03b1hdh + b. Passing the vector y into the ISNG\nfunction generates the integral stochastic stream yISC such that\n\ny = E[y] = E[ISNG(y)] = E[yISC].\n\n(23)\nNote that the integer range of the integral stream is equal to (cid:100)|\u03b1hdh| + |b|(cid:101). For instance,\nconsidering \u03b1h = 0.2, dh = 10, b = 0.5 and wh (cid:12) h = {1,\u22121, 1, 1, 1,\u22121, 1,\u22121,\u22121, 1},\nyISC = {3,\u22122, 2, 3, 2,\u22121, 3,\u22121,\u22122, 2} is an integral stochastic representation of y =\n{2.5,\u22121.5, 2.5, 2.5, 2.5,\u22121.5, 2.5,\u22121.5,\u22121.5, 2.5}, resulting in y = 0.9. To guarantee the stochas-\nticity of the sequence yISC, we can permute the reading addresses of the memories storing the\nweights and the hidden state vector h. Note that Eq. (20) with the input vector x that is not one-hot\nencoded can also be represented as a stochastic stream by equalizing vector lengths of dx and dh.\nAssuming that dh > dx and dx is a multiple of dh, this can simply obtained by repeating the input\nvector (i.e., x) dh/dx times as the mean of the repeated vector remains unchanged. Of course dh is a\ndesign parameter and can take any arbitrary value.\n\n4.3 Weighted FSM-Based Function\n\nSo far, we have shown that the output of each neuron can be represented as an integral stochastic\nstream, allowing us to perform the nonlinear functions using the FSM-based IStanh function. How-\never, our experiments show that the IStanh function fails to resemble the hyperbolic tangent and\nsigmoid functions (see Figures 3(a) and 3(b)). We attribute this problem to the even distribution of\npositive and negative integer elements in the vector yISC for both positive and negative values of y.\nMore precisely, the vector yISC contains almost the same number of positive and negative integer\nentries since the expected value (i.e, the mean) of the vector wh (cid:12) h is a small number. However,\nintegral stochastic streams representing positive and negative real values are more likely to have more\npositive and negative entries, respectively. To address this issue, we propose a weighted FSM-based\nfunction, referred to as WIStanh, in which each state is associated with a weight. In the weighted\nFSM-based function, we use the same FSM that is used in the IStanh function. However, the output\nis determined by sampling from the weights associated to the states as follows\n\nyFSM\nj = Bernoulli(\n\nwkj + 1\n\n2\n\n),\n\n6\n\n(24)\n\n\fM\nS\nF\ny\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\nM\nS\nF\ny\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n2 \u00d7 IStanh(128,y) \u2212 1\n\ntanh(y)\n\nM\nS\nF\ny\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\nIStanh(128,y)\n\n\u03c3(y)\n\nM\nS\nF\ny\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n2 \u00d7 WIStanh(128,y) \u2212 1\n\ntanh(y)\n\nWIStanh(128,y)\n\n\u03c3(y)\n\n5\n\n0\n\n\u22125\nInput stream y\n(a)\n\n\u221210 \u22125\n\n0\n\n5\n\n10\n\n15\n\nInput stream y\n(b)\n\n\u22126 \u22124 \u22122\n\n0\n\n2\n\n4\n\n6\n\n8\n\nInput stream y\n(c)\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\nInput stream y\n\n(d)\n\nFigure 3: The IStanh function approximating (a) tanh and (b) sigmoid functions. The WIStanh\nfunction approximating (c) tanh and (d) sigmoid functions. The results were obtained by measuring\nthe output of a single neuron for 12K input samples taken from the test set of the Penn Treebank\ndataset when performing character-level language modeling.\n\nare entries of the weight vector w \u2208 Rnm, the state counter vector k \u2208\nwhere wkj , kj and yFSM\n{0, 1, . . . , m \u00d7 n \u2212 1}dh and the WIStanh\u2019s output vector yFSM \u2208 {0, 1}dh for j \u2208 {1, . . . , dh}. To\nobtain the weights approximating the FSM as the tanh function, we use linear regression such that\n\nj\n\nm\u00d7n\u22121(cid:88)\n\ntanh(y) =\n\npCq \u00d7 wq,\n\n(25)\n\nq=0\n\nwhere pCq denotes the probability of the occurrence of the state Cq (i.e., the qth state in the state\nset of C0 \u2192 Cn\u00d7m\u22121). The sigmoid function can also be obtained in a similar fashion. Note that\nwe constraint the weight values to lie into the interval of [\u22121, 1]. Figures 3(c) and 3(d) show the\ntanh and sigmoid functions implemented using the proposed WIStanh function where the FSM was\ntrained on the Penn Treebank dataset [31] when performing the character-language modeling task.\nThe early states of the trained FSM mainly contains values near to \u22121 in the bipolar format (or\nzero in the unipolar format) whereas the weight values of the latter states are close 1 (see Figure 4),\ncomplying with the state values of the conventional integral stochastic FSMs. Note that we \ufb01ne tune\nthe weights of our baseline model (i.e., ELSTM) with the proposed stochastic functions to comply\nwith the approximation error.\n\n4.4 XNOR LSTM\n\nLet us rewrite the gate computations of LSTMs using the proposed stochastic representation as\n\nFs\nt = WIStanh(ISNG(Wb\nIs\nt = WIStanh(ISNG(Wb\nOs\nt = WIStanh(ISNG(Wb\nt = WIStanh(ISNG(Wb\nGs\nt \u2208 {0, 1}dh\u00d7dh, Is\nt \u2208 {0, 1}dh\u00d7dh, Os\n\nf xxt + bf )),\nixxt + bi)),\noxxt + bo)),\ngxxt + bg)),\n\nt\u22121 + Wb\nt\u22121 + Wb\nt\u22121 + Wb\nt\u22121 + Wb\n\nf hhb\nihhb\nohhb\nghhb\nt \u2208 {0, 1}dh\u00d7dh and Gs\n\n(26)\nt \u2208 {0, 1}dh\u00d7dh denote the\nwhere Fs\nstochastic representation of the gate vectors (i.e., ft, it, ot and gt) in which each entry of the vectors\nis represented as a binary stochastic stream generated by the WIStanh function. More precisely, the\nexpected value of the gate matrices Fs\nt over their second dimension is equal to the\ngate vectors ft, it, ot and gt, respectively. This stochastic representation of the gates allows us to\nperform the Hadamard products of the state computations using XNOR operations. More precisely,\nwe can formulate the state computations as\n\nt and Gs\n\nt , Os\n\nt , Is\n\nt (cid:12) WIStanh(Cs\n\nt (cid:12) SNG(ct\u22121) + Is\n\nt (cid:12) Gs\n\nt , ht = S2B(Os\n\nt = Fs\nCs\n(27)\nt \u2208 {0, 1, . . . , m(cid:48)}dh\u00d7dh is an integral stochastic representation of the cell state vector\nwhere Cs\nct. The SNG function generates a binary stochastic stream in the bipolar format (see Section 3.1),\nallowing us to replace the Hadamard products with XNOR operations. The S2B and IS2B functions\nconvert binary and integral stochastic streams into a real number. In other words, these two functions\n\ufb01nd the expected value (i.e., the mean) of the stochastic streams by accumulating the stream entries\nand dividing the accumulated value by the stream length l = dh. Of course setting the stream length\n\nt )), ct = IS2B(Cs\nt ),\n\n7\n\n\fs\nt\nn\ne\ni\nc\n\ufb01\nf\ne\no\nC\ne\nt\na\nt\n\nS\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\nHyperbolic tangent\n\nSigmoid\n\n0\n\n20\n\n60\n\n40\n80\nNo. States\n\n100\n\n120\n\nFigure 4: The weight values assigned for each state of the WIStanh to implement tanh and \u03c3\nfunctions.\n\nw\nht\u22121\n\nb\n\n+\n\nr\ne\nt\ns\ni\ng\ne\nr\n\n\u03c3\n/\nh\nn\na\nt\n\ng\n\ni\nc\nf\n\n\u00d7\n\u00d7\n\n\u03b1\n\u00d7\n\n(a)\n\n+\n\nh \u00d7\n\nn\na\nt\n\nht\n\no\n\n(b)\n\nw\nht\u22121\n\n\u03b1\n\nb\n\nG\nN\nS\nI\n\nh\nn\na\nt\n\nS\nI\nW\n\n(c)\n\ngs\nis\nf s\ncs\n\n+\n\nh\nn\na\nt\n\nS\nI\nW\n\nos\n\n(d)\n\n+\n\nr\ne\nt\ns\ni\ng\ne\nr\n\nht\n\nFigure 5: The main computational core of (a) the gate computations and (b) the state computations of\nthe conventional binarized LSTM. The main computational core of (c) the gate computations and (d)\nthe state computations in the proposed XNOR LSTM.\n\ndh to a number of power of two replaces the division with a simple shift operation. With our stochastic\nrepresentation, the accumulations of the vector-matrix multiplication at the gate computations are\nnow shifted to the end of the state computations (see Figure 5). Moreover, both the gate and state\ncomputations now involve several vector-vector products, performed using XNOR operators. As\nopposed to stochastic computing systems requiring long latency to generate the stochastic streams,\nthe stochastic bit streams of the proposed XNOR LSTM are already generated by the binarization of\nthe gate computations. Therefore, the computational latency of the proposed XNOR LSTM is either\nthe same or even less than of the conventional quantized LSTMs since using simple operators allows\nus to run the XNOR LSTM at higher frequencies.\n5 Experimental Results\nIn this section, we evaluate the performance of the proposed XNOR LSTM across different temporal\ntasks including character-level/word-level language modeling and quation answering (QA). Note\nthat the length of all stochastic streams (i.e., the parameter l) in our proposed method is equal to the\nsize of LSTMs (i.e., the parameter dh). For the character-level and word-level language modeling,\nwe conduct our experiments on Penn Treebank (PTB) [31] corpus. For the character-level language\nmodeling (CLLM) experiment, we use an LSTM layer of size 1,000 on a sequence length of 100 when\nperforming PTB. We set the training parameters similar to [31]. The performance of CLLM models\nare evaluated as bits per character (BPC). For the word-level language modeling (WLLM) task, we\ntrain one layer of LSTM with 300 units on a sequence length of 35 while applying the dropout rate of\n0.5. The performance of WLLM models are measured in terms of perplexity per word (PPW). For\nthe QA task, we perform our experiment on the CNN corpus [32]. We also adopt the LSTM-based\nAttentive Reader architecture and its the training parameters, introduced in [32]. We measure the\nperformance of the QA task as a error rate (ER). Note that lower BPC, PPW and ER values show\na better performance. For a fair comparison with prior works, our XNOR-LSTM model for each\ntask contains the same number of parameters as of their previous counterparts. Table 1 summarizes\nthe performance of our XNOR-LSTM models. We consider a typical semi-parallel architecture of\nLSTMs, in which each neuron is implemented using a multiply-and-accumulate (MAC) unit, to\nobtain the implementation cost reported in Table 1. Depending on the precision used for the gate and\nstate computations, we replace the multiplier inside the MAC unit with a simpler logic and report the\nimplementation cost in terms of XNOR counts. In fact, we approximate the cost of a ternary/2-bit\n\n8\n\n\fTable 1: Performance of the proposed XNOR-LSTM models vs their quantized counterparts.\n\nBaseline\n\nLAB\n\nAMQ\n\nHitNet\n\n(ICLR\u201917 [19])\n\n(ICLR\u201918 [18])\n\n(NeurIPS\u201918 [26])\n\nPrecision of Gate Computations\nPrecision of State Computations\n\nCLLM\n\nWLLM\n\nQA\n\nAccuracy (BPC)\nSize (MByte)\nCost (XNOR count)\nCost (No. clock cycles)\nAccuracy (PPW)\nSize (KByte)\nCost (XNOR count)\nCost (No. clock cycles)\nAccuracy (ER)\nSize (MByte)\nCost (XNOR count)\nCost (No. clock cycles)\n\nFP\nFP\n1.39\n16.8\n\n1,400,000\n\n1,000\n91.5\n2,880\n420,000\n\n300\n40.19\n7,471\n\n1,433,600\n\n256\n\nBinary\n\nFP\n1.56\n0.525\n604,000\n1,000\nNA\nNA\nNA\nNA\nNA\nNA\nNA\nNA\n\n2 bits\nFP\nNA\nNA\nNA\nNA\n95.8\n180\n\nTernary\n\nFP\nNA\nNA\nNA\nNA\n110.3\n180\n\nELSTM XNOR\n(ours)\n(ours)\nBinary\nBinary\nBinary\n1.52\n0.525\n7,000\n1,000\n95.5\n90\n\nFP\n1.47\n0.525\n604,000\n1,000\n93.5\n90\n\n182,400\n\n182,400\n\n181,200\n\n300\nNA\nNA\nNA\nNA\n\n300\nNA\nNA\nNA\nNA\n\n300\n40.4\n233\n\n618,496\n\n256\n\n2,100\n300\n43.8\n233\n7,168\n256\n\nmultiplication as two XNOR gates and the cost of a full-precision multiplication as 200 XNOR\ngates [17]. The experimental results show that our XNOR-LSTM models outperform the previous\nquantized LSTMs in terms of accuracy performance while requiring 86\u00d7 fewer XNOR gates to\nperform the recurrent computations. While all the LSTM models in Table 1 require the same number\nof clock cycles to perform the recurrent computations, the inference time of our XNOR LSTMs is\nless than of other quantized works when running at higher frequencies due to their simpler operators.\nAs a \ufb01nal note, the small gap between the XNOR and ELSTM models shows the approximation error\ncaused by the use of stochastic computing.\n6 Discussion\nIn Section 5, we only considered the implementation cost of our method in terms of XNOR operations\nsince our main focus was to replace the costly multipliers with simple XNOR gates while the rest of\nthe computing elements (i.e., the adders and look-up tables) almost remains the same (see Figure\n5). Note that since SNG and ISNG can be easily implemented with magnetic tunnel junction (MTJ)\ndevices which come almost at no cost compared to CMOS technologies [33], we excluded them from\nthe implementation cost in Table 1. However, even if we include these units in our cost model, our\nstochastic-based implementation is still superior to its conventional binary-radix counterpart. To this\nend, we have implemented both the non-stochastic binarized method (e.g., [26]) and our proposed\nmethod on a Xilinx Virtex-7 FPGA device where each architecture contains 300 neurons. The\nimplementation of our proposed method requires 66K FPGA slices while yielding the throughput of\n3.2 TOPS @ 934 MHz whereas the implementation of the non-stochastic binarized method requires\n1.1M FPGA slices while yielding the throughput of 1.8 TOPS @ 515 MHz. Therefore, our proposed\nmethod outperforms its binarized counterpart by factors of 16.7\u00d7 and 1.8\u00d7 in terms of area and\nthroughput, respectively, while considering all the required logic such as SNG, ISNG and look-up\ntables. Note that the number of occupied slices denotes the area size of the implemented design.\nAlso, the implementation of our proposed method runs at a higher frequency since its critical path is\nshorter than the conventional method due to the simpler hardware of XNOR gates versus multipliers.\nTherefore, this work is the \ufb01rst successful application of SC to the best of our knowledge where the\nSC-based implementation outperforms its conventional binary-radix counterpart in terms of both the\ncomputational latency and the area.\n7 Conclusion\nIn this paper, we presented a method to synthesize XNOR LSTMs. To this end, we \ufb01rst represented\nthe gate computations of LSTMs with binary weights and binary hidden state vector h in stochastic\ncomputing domain, allowing to replace the non-linear activation functions with stochastic FSM-based\nfunctions. We then proposed a new FSM-based function and its synthesis method to approximate the\nhyperbolic tangent and sigmoid functions. In this way, the gate activation values are represented as\nstochastic binary streams, allowing to perform the multiplications of the state computations using\nsimple XNOR gates. To the best of our knowledge, this paper is the \ufb01rst to perform the multiplications\nof both the gate and state computations with XNOR operations.\n\n9\n\n\fReferences\n[1] A. Karpathy and L. Fei-Fei, \u201cDeep Visual-Semantic Alignments for Generating Image Descriptions,\u201d\nIEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 664\u2013676, Apr. 2017. [Online]. Available:\nhttps://doi.org/10.1109/TPAMI.2016.2598339\n\n[2] A. Graves and J. Schmidhuber, \u201cFramewise Phoneme Classi\ufb01cation with Bidirectional LSTM Networks,\u201d\nin Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 4, July 2005,\npp. 2047\u20132052 vol. 4.\n\n[3] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget,\n\nNetwork Based Language Model.\u201d\n//dblp.uni-trier.de/db/conf/interspeech/interspeech2010.html#MikolovKBCK10\n\nJ. Cernock\u00fd, and S. Khudanpur, \u201cRecurrent Neural\nhttp:\n\nISCA, 2010, pp. 1045\u20131048. [Online]. Available:\n\n[4] Y. Bengio, P. Simard, and P. Frasconi, \u201cLearning Long-term Dependencies with Gradient Descent\nis Dif\ufb01cult,\u201d Trans. Neur. Netw., vol. 5, no. 2, pp. 157\u2013166, Mar. 1994. [Online]. Available:\nhttp://dx.doi.org/10.1109/72.279181\n\n[5] R. Pascanu, T. Mikolov, and Y. Bengio, \u201cOn the Dif\ufb01culty of Training Recurrent Neural Networks,\u201d\nthe 30th International Conference on International Conference on Machine\nJMLR.org, 2013, pp. III\u20131310\u2013III\u20131318. [Online]. Available:\n\nin Proceedings of\nLearning - Volume 28, ser. ICML\u201913.\nhttp://dl.acm.org/citation.cfm?id=3042817.3043083\n\n[6] \u201cLong Short-Term Memory,\u201d Neural computation, vol. 9, no. 8, pp. 1735\u20131780, 1997.\n[7] K. Cho, B. van Merri\u00ebnboer, \u00c7. G\u00fcl\u00e7ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio,\n\u201cLearning Phrase Representations using RNN Encoder\u2013Decoder for Statistical Machine Translation,\u201d in\nProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).\nDoha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724\u20131734. [Online]. Available:\nhttp://www.aclweb.org/anthology/D14-1179\n\n[8] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, \u201cEIE: Ef\ufb01cient Inference\nEngine on Compressed Deep Neural Network,\u201d in 2016 ACM/IEEE 43rd Annual International Symposium\non Computer Architecture (ISCA), June 2016, pp. 243\u2013254.\n\n[9] A. Ardakani, Z. Ji, and W. J. Gross, \u201cLearning to Skip Ineffectual Recurrent Computations in LSTMs,\u201d in\n\n2019 Design, Automation Test in Europe Conference Exhibition (DATE), March 2019, pp. 1427\u20131432.\n\n[10] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, \u201cLow-Rank Matrix\nFactorization for Deep Neural Network Training with High-Dimensional Output Targets.\u201d IEEE, 2013, pp.\n6655\u20136659. [Online]. Available: http://dblp.uni-trier.de/db/conf/icassp/icassp2013.html#SainathKSAR13\n[11] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy, \u201cSparse Convolutional Neural Networks,\u201d in\n\n2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 806\u2013814.\n\n[12] S. Han, J. Pool, J. Tran, and W. J. Dally, \u201cLearning both Weights and Connections for Ef\ufb01cient Neural\n\nNetworks,\u201d CoRR, vol. abs/1506.02626, 2015. [Online]. Available: http://arxiv.org/abs/1506.02626\n\n[13] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, \u201cLearning Structured Sparsity in Deep Neural\nNetworks,\u201d in Proceedings of the 30th International Conference on Neural Information Processing\nSystems, ser. NIPS\u201916. USA: Curran Associates Inc., 2016, pp. 2082\u20132090. [Online]. Available:\nhttp://dl.acm.org/citation.cfm?id=3157096.3157329\n\n[14] A. Ardakani, C. Condo, and W. J. Gross, \u201cSparsely-Connected Neural Networks: Towards Ef\ufb01cient VLSI\nImplementation of Deep Neural Networks,\u201d 5th International Conference on Learning Representations\n(ICLR), 2016. [Online]. Available: https://arxiv.org/abs/1611.01427\n\n[15] M. Courbariaux, Y. Bengio, and J. David, \u201cBinaryConnect: Training Deep Neural Networks with binary\n\nweights during propagations,\u201d CoRR, vol. abs/1511.00363, 2015.\n\n[16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, \u201cXNOR-Net: ImageNet Classi\ufb01cation Using\nBinary Convolutional Neural Networks,\u201d CoRR, vol. abs/1603.05279, 2016. [Online]. Available:\nhttp://dblp.uni-trier.de/db/journals/corr/corr1603.html#RastegariORF16\n\n[17] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, \u201cBinarized Neural Networks,\u201d in\nAdvances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 4107\u20134115. [Online]. Available:\nhttp://papers.nips.cc/paper/6573-binarized-neural-networks.pdf\n\n[18] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, \u201cAlternating Multi-bit Quantization for\nRecurrent Neural Networks,\u201d in International Conference on Learning Representations, 2018. [Online].\nAvailable: https://openreview.net/forum?id=S19dR9x0b\n\n[19] L. Hou, Q. Yao, and J. T. Kwok, \u201cLoss-aware Binarization of Deep Networks,\u201d CoRR, vol. abs/1611.01600,\n\n2016. [Online]. Available: http://arxiv.org/abs/1611.01600\n\n10\n\n\f[20] A. Ardakani, Z. Ji, S. C. Smithson, B. H. Meyer, and W. J. Gross, \u201cLearning Recurrent Binary/Ternary\nWeights,\u201d in International Conference on Learning Representations, 2019. [Online]. Available:\nhttps://openreview.net/forum?id=HkNGYjR9FX\n\n[21] B. R. Gaines, Stochastic Computing Systems. Boston, MA: Springer US, 1969, pp. 37\u2013172. [Online].\n\nAvailable: https://doi.org/10.1007/978-1-4899-5841-9_2\n\n[22] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, \u201cNeural Networks with Few Multiplications,\u201d\n\nCoRR, vol. abs/1510.03009, 2015.\n\n[23] C. Zhu, S. Han, H. Mao, and W. J. Dally, \u201cTrained Ternary Quantization,\u201d CoRR, vol. abs/1612.01064,\n\n2016. [Online]. Available: http://arxiv.org/abs/1612.01064\n\n[24] F. Li and B. Liu, \u201cTernary Weight Networks,\u201d CoRR, vol. abs/1605.04711, 2016. [Online]. Available:\n\nhttp://arxiv.org/abs/1605.04711\n\n[25] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, \u201cDoReFa-Net: Training Low Bitwidth Convolutional\nNeural Networks with Low Bitwidth Gradients,\u201d CoRR, vol. abs/1606.06160, 2016. [Online]. Available:\nhttp://arxiv.org/abs/1606.06160\n\n[26] P. Wang, X. Xie, L. Deng, G. Li, D. Wang, and Y. Xie, \u201cHitNet: Hybrid Ternary Recurrent Neural Network,\u201d\nin Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 604\u2013614.\n[Online]. Available: http://papers.nips.cc/paper/7341-hitnet-hybrid-ternary-recurrent-neural-network.pdf\n[27] L. Hou and J. T. Kwok, \u201cLoss-aware Weight Quantization of Deep Networks,\u201d in International Conference\n\non Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=BkrSv0lA-\n\n[28] A. Alaghi and J. P. Hayes, \u201cSurvey of Stochastic Computing,\u201d ACM Trans. Embed. Comput. Syst., vol. 12,\n\nno. 2s, pp. 92:1\u201392:19, May 2013. [Online]. Available: http://doi.acm.org/10.1145/2465787.2465794\n\n[29] B. D. Brown and H. C. Card, \u201cStochastic Neural Computation. I. Computational Elements,\u201d IEEE Transac-\n\ntions on Computers, vol. 50, no. 9, pp. 891\u2013905, Sep. 2001.\n\n[30] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross, \u201cVLSI Implementation of Deep\nNeural Network Using Integral Stochastic Computing,\u201d IEEE Transactions on Very Large Scale Integration\n(VLSI) Systems, vol. 25, no. 10, pp. 2688\u20132699, Oct 2017.\n\n[31] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, \u201cBuilding a Large Annotated Corpus of English:\nThe Penn Treebank,\u201d Comput. Linguist., vol. 19, no. 2, pp. 313\u2013330, June 1993. [Online]. Available:\nhttp://dl.acm.org/citation.cfm?id=972470.972475\n\n[32] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, \u201cTeaching\nMachines to Read and Comprehend,\u201d in Advances in Neural Information Processing Systems 28. Curran\nAssociates, Inc., 2015, pp. 1693\u20131701.\n\n[33] N. Onizawa, D. Katagiri, W. J. Gross, and T. Hanyu, \u201cAnalog-to-stochastic converter using magnetic tunnel\njunction devices for vision chips,\u201d IEEE Transactions on Nanotechnology, vol. 15, no. 5, pp. 705\u2013714, Sep.\n2016.\n\n11\n\n\f", "award": [], "sourceid": 4567, "authors": [{"given_name": "Arash", "family_name": "Ardakani", "institution": "McGill University"}, {"given_name": "Zhengyun", "family_name": "Ji", "institution": "McGill University"}, {"given_name": "Amir", "family_name": "Ardakani", "institution": "McGill University"}, {"given_name": "Warren", "family_name": "Gross", "institution": "McGill University"}]}