{"title": "On-Chip Compensation of Device-Mismatch Effects in Analog VLSI Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 441, "page_last": 448, "abstract": null, "full_text": "     On-Chip Compensation of Device-Mismatch\n        Effects in Analog VLSI Neural Networks\n\n\n\n                                        Miguel Figueroa\n             Department of Electrical Engineering, Universidad de Concepcion\n                           Casilla 160-C, Correo 3, Concepcion, Chile\n                                     mfigueroa@die.udec.cl\n\n                                 Seth Bridges and Chris Diorio\n                    Computer Science & Engineering, University of Washington\n                           Box 352350, Seattle, WA 98195-2350, USA\n                                {seth, diorio}@cs.washington.edu\n\n\n                                           Abstract\n          Device mismatch in VLSI degrades the accuracy of analog arithmetic\n          circuits and lowers the learning performance of large-scale neural net-\n          works implemented in this technology. We show compact, low-power\n          on-chip calibration techniques that compensate for device mismatch. Our\n          techniques enable large-scale analog VLSI neural networks with learn-\n          ing performance on the order of 10 bits. We demonstrate our techniques\n          on a 64-synapse linear perceptron learning with the Least-Mean-Squares\n          (LMS) algorithm, and fabricated in a 0.35m CMOS process.\n\n\n1    Introduction\n\nModern embedded and portable electronic systems operate in unknown and mutating envi-\nronments, and use adaptive filtering and machine learning techniques to discover the statis-\ntics of the data and continuously optimize their performance. Artificial neural networks\nare an attractive substrate for implementing these techniques, because their regular com-\nputation and communication structures makes them a good match for custom VLSI imple-\nmentations. Portable systems operate under severe power dissipation and space constraints,\nand VLSI implementations provide a good tradeoff between computational throughput and\npower/area cost. More specifically, analog VLSI neural networks perform their computa-\ntion using the physical properties of transistors with orders of magnitude less power and\ndie area than their digital counterparts. Therefore, they could enable large-scale real-time\nadaptive signal processing systems on a single die with minimal power dissipation.\n\nDespite the promises delivered by analog VLSI, an important factor has prevented the suc-\ncess of large-scale neural networks using this technology: device mismatch. Gradients in\nthe parameters of the fabrication process create variations in the physical properties of sili-\ncon devices across a single chip. These variations translate into gain and offset mismatches\nin the arithmetic blocks, which severely limit the overall performance of the system. As\na result, the accuracy of analog implementations rarely exceeds 5-6 bits, even for small-\nscale networks. This limitation renders these implementations useless for many important\napplications. Although it is possible to combat some of these effects using careful design\ntechniques, they come at the cost of increased power and area, making an analog solution\nless attractive.\n\n\f\n          (a) Single-layer LMS perceptron.                (b) Block diagram for the synapse.\n\nFigure 1: A single-layer perceptron and synapse. (a) The output z of the perceptron is the inner\nproduct between the input and weight vectors. The LMS algorithm updates the weights based on the\ninputs and an error signal e. (b) The synapse stores the weight in an analog memory cell. A Gilbert\nmultiplier computes the product between the input and the weight and outputs a differential current.\nThe LMS block updates the weight.\n\n\nWe have built a 64-synapse analog neural network with an learning performance of 10 bits,\nrepresenting an improvement of more than one order of magnitude over that of traditional\nanalog designs, with a modest increase in power and die area. We fabricated our network\nusing a double-poly, 4-metal 0.35m CMOS process available from MOSIS. We achieve\nthis performance by locally calibrating the critical analog blocks after circuit fabrication\nusing a combination of one-time (or periodic) and continuous calibration using the same\nfeedback as the network's learning algorithm. We chose the Least Mean Squares (LMS) al-\ngorithm because of its simplicity and wide applicability in supervised learning techniques\nsuch as adaptive filtering, adaptive inverse control, and noise canceling. Moreover, sev-\neral useful unsupervised-learning techniques, such as adaptive orthogonalization, princi-\npal components analysis (PCA), independent components analysis (ICA) and decision-\nfeedback learning, use simple generalizations of LMS.\n\n\n2         A linear LMS perceptron\n\nFig. 1(a) shows our system architecture, a linear perceptron with scalar output that performs\nthe function:                                        N\n\n                                  z(i) = bw0(i) +           xj(i) wj(i)                         (1)\n                                                     j=1\n\nwhere i represents time, z(i) is the output, xj(i) are the inputs, wj(i) are the synaptic\nweights, and b is a constant bias input. We clarify the role of b in Section 3.1. After each\npresentation of the input, the LMS algorithm updates the weights using the learning rule:\n\n                   wj(i + 1) = wj(i) +  xj(i) e(i)            i = 0 . . . N, x0(i) = b         (2)\n\nwhere  is a constant learning rate, and e(i) is the error between the output and a reference\nsignal r(i) such that e(i) = r(i) - z(i).\n\n\n3         The synapse\n\nFig. 1(b) shows a block diagram of our synapse. We store the synaptic weights in a memory\ncell that implements nonvolatile analog storage with linear updates. A circuit transforms\nthe single-ended voltage output of the memory cell (Vw) into a differential voltage signal\n(V +, V -\n     w      w ), with a constant common mode. A Gilbert multiplier computes the 4-quadrant\nproduct between this signal and the input (also represented as a differential voltage V +\n                                                                                                x ,\nV -\n x ). The output is a differential analog current pair (I +\n                                                                  o , I -\n                                                                       o ), which we sum across all\nsynapses by connecting them to common wires.\n\n\f\n       (a) Measured output vs. input value.                    (b) Measured output vs. weight value.\n\nFigure 2: Gilbert multiplier response for 8 synapses.(a) Our multiplier maximizes the linearity of\nxi, achieving a linear range of 600mV differential. Gain mismatch is 2:1 and offset mismatch is up to\n200mV. (b) Our multiplier maximizes weight range at the cost of weight linearity (1V single-ended,\n2V differential). The gain variation is lower, but the offset mismatch exceeds 60% of the range.\n\n\nBecause we represent the perceptron's output and the reference with differential currents,\nwe can easily compute the error using simple current addition. We then transform (off-chip\nin our current implementation) the resulting analog error signal using a pulse-density mod-\nulation (PDM) representation [1]. In this scheme, the value of the error is represented as\nthe difference between the density (frequency) of two fixed-width, fixed-amplitude digital\npulse trains (P +\n                e    and P -\n                                  e    in Fig. 1(b)). These properties make the PDM representation\nlargely immune to amplitude and jitter noise. The performance of the perceptron is highly\nsensitive to the resolution of the error signal; therefore the PDM representation is a good\nmatch for it. The LMS block in the synapse takes the error and input values and computes\nupdate pulses (also using PDM) according to Eqn. 2.\n\nIn the rest of this section, we analyze the effects of device mismatch in the performance\nof the major blocks, discuss their impact in overall system performance, and present the\ntechniques that we developed to deal with them. We illustrate with experimental results\ntaken from silicon implementation of the perceptron in a 0.35m CMOS process. All data\npresented in this paper, unless otherwise stated, comes from this silicon implementation.\n\n3.1    Multiplier\n\nA Gilbert multiplier implements a nonlinear function of the product between two differ-\nential voltages. Device mismatch in the multiplier has two main effects: First, it creates\noffsets in the inputs. Second, mismatch across the entire perceptron creates variations in\nthe offsets, gain, and linearity of the product. Thus, Eqn. 1 becomes:\n                          N\n\n            z(i) =               aj f x x                 f w w                      x\n                                       j    j (i) - dx\n                                                    j     j    j (i) - dw\n                                                                           j              0(i) = b              (3)\n                          j=0\n\nwhere aj represents the gain mismatch between multipliers, f x and f w are the nonlineari-\n                                                                                j    j\nties applied to the inputs and weights (also mismatched across the perceptron), and dx and\n                                                                                                           j\ndw are the mismatched offsets of the inputs and weights.\n j\n\nOur analysis and simulations of the LMS algorithm [2] determine that the performance of\nthe algorithm is much more sensitive to the linearity of f x than to the linearity of f w, be-\n                                                                      j                               j\ncause the inputs vary over their dynamic range with a large bandwidth, while the bandwidth\nof the weights is much lower than the adaptation time-constant. Therefore, the adaptation\ncompensates for mild nonlinearities in the weights as long as f w remains a monotonic odd\n                                                                                j\nfunction [2]. Consequently, we sized the transistors in the Gilbert multiplier to maximize\nthe linearity of f x, but paid less attention (in order to minimize size and power) to f w.\n                     j                                                                                          j\nFig. 2(a) shows the output of 8 synapses in the system as a function of the input value. The\n\n\f\n                (a) Memory cell circuit.                       (b) Measured weight updates\n\nFigure 3: A simple PDM analog memory cell. (a) We store each weight as nonvolatile analog charge\non the floating gate FG. The weight increments and decrements are proportional to the density of the\npulses on Pinc and Pdec. (b) Memory updates as a function of the increment and decrement pulse\ndensities for 8 synapses. The updates show excellent linearity (10 bits), but also poor matching both\nwithin a synapse and between synapses.\n\n\nresponse is highly linear. The gain mismatch is about 2:1, but the LMS algorithm naturally\nabsorbs it into the learned weight value. Fig. 2(b) shows the multiplier output as a function\nof the single-ended weight value Vw. The linearity is visibly worse in this case, but the\nLMS algorithm compensates for it.\n\nThe graphs in the Fig. 2 also show the input and weight offsets. Because of the added\nmismatch in the single-ended to differential converter, the weights present an offset of up\nto 300mV, or 30% of the weight range. The LMS algorithm will also compensate for this\noffset by absorbing it into the weight, as shown in the analysis of [3] for backprogagation\nneural networks. However, this will only occur if the weight range is large enough acco-\nmodate for the offset mismatch. Consequently, we sacrifice weight linearity to increase\nthe weight range. Input offsets pose a harder problem, though. The offsets are small (up\n100mV), but because of the restricted input range (to maximize linearity), they are large\nenough to dramatically affect the learning performance of the perceptron. Our solution was\nto use the bias synapse w0 to compensate for the accumulated input offset. Assuming that\nthe multiplier is linear, offsets translate into nonzero-mean inputs, which a bias synapse\ntrained with LMS can remove as demonstrated in [4]. To guarantee sufficient gain, we\nprovide a stronger bias current to the multiplier in the bias synapse.\n\n3.2    Memory cell\n\nA synapse transistor [5] is a silicon device that provides compact, accurate, nonvolatile\nanalog storage as charge on its floating gate. Fowler-Nordheim tunneling adds charge to\nthe floating gate and hot-electron injection removes charge. Both mechanisms can be used\nto accurately update the stored value during normal device operation. Because of these\nproperties, synapse transistors have been a popular choice for weight storage in recent\nsilicon learning systems [6, 7].\n\nDespite the advantages listed above, it is hard to implement linear learning rules such as\nLMS using tunneling and injection. This is because their dynamics are exponential with re-\nspect to their control variables (floating-gate voltage, tunneling voltage and injection drain\ncurrent), which naturally lead to weight-dependent nonlinear update rules. This is an im-\nportant problem because the learning performance of the perceptron is strongly dependent\non the accuracy of the weight updates; therefore distortions in the learning rule will degrade\nperformance. The initial design of our memory cell, shown in Fig. 3(a) and based on the\nwork presented in [8], solves this problem: We store the analog weight as charge on the\nfloating gate FG of synapse transistor M1. Pulses on Pdec and Pinc activate tunneling and\ninjection and add or remove charge from the floating gate, respectively. The operational\n\n\f\n        (a) Calibrated memory cell circuit.           (b) Measured calibrated weight updates.\n\nFigure 4: PDM memory cell with local calibration. (a) We first match the tunneling rate across all\nsynapses by locally changing the voltage at the floating gate FGdec. Then, we modify the injection\nrate to match the local tunneling rate using the floating gate FGinc. (b) The calibrated updates are\nsymmetric and uniform within 9-10 bits.\n\n\namplifier sets the floating-gate voltage at the global voltage Vbias. Capacitor Cw integrates\nthe charge updates, changing the output Vout by Vout = Q/C. Because the floating-\ngate voltage is constant and so are the pulse widths and amplitudes, the magnitude of the\nupdates depends on the density of the pulses Pinc and Pdec. Fig. 3(b) shows the magnitude\nof the weight updates as a function of the density of pulses in Pinc (positive slopes) and\nPdec (negative slopes) for 8 synapses. The linearity of the updates, measured as the integral\nnonlinearity (INL) of the transfer functions depicted in Fig. 3(b), exceeds 10 bits.\n\nFig. 3(b) highlights an important problem caused by device mismatch: the strengths of\ntunneling and injection are poorly balanced within a synapse (the slopes show up to a 4:1\nmismatch). Moreover, they show a variation of more than 3:1 across different synapses\nin the perceptron. This translates into asymmetric update rules that are also nonuniform\nacross synapses. The local asymmetry of the learning rate translates into offsets between\nthe learned and target weights, degrading the learning performance of the perceptron. The\nnonuniformity between learning rates across the perceptron changes Eqn. 2 into:\n\n              wj(i + 1) = wj(i) + j xj(i) e(i)            i = 0 . . . N, x0(i) = b              (4)\n\nwhere j are the different learning rates for each synapse. Generalizing the conventional\nstability analysis of LMS [9], we can show that the condition for the stability of the weight\nvector is: 0 < max < 1/max, where max is the maximal eigenvalue of the input's\ncorrelation matrix and max = maxj(j). Therefore, learning rate mismatch does not\naffect the accuracy of the learned weights, but it does slow down convergence because we\nneed to scale all learning rates globally to limit the value of the maximal rate.\n\nTo maintain good learning performance and convergence speed, we need to make learning\nrates symmetric and uniform across the perceptron. We modified the design of the memory\ncell to incorporate local calibration mechanisms that achieve this goal. Fig. 4(a) shows our\nnew design. The first step is to equalize tunneling rates: The voltage at the new floating gate\nFGdec sets the voltage at the floating-gate FG and controls the ratio between the strength\nof tunneling and injection onto FG: Raising the voltage at FGdec increases the drain-to-\nchannel voltage and reduces the gate-to-tunneling-junction voltage at M1, thus increasing\ninjection efficiency and reducing tunneling strength [5]. We set the voltage at FGdec by first\ntunneling using the global line erase dec, and then injecting on transistor M3 by lowering\nthe local line set dec to equalize the tunneling rates across all synapses. To compare the\ntunneling rates, we issue a fixed number of pulses at Pdec and compare the memory cell\noutputs using a double-sampling comparator (off-chip in the current implementation). To\ncontrol the injection rate, we add transistor M2, which limits the current through M1 and\n\n\f\n                      (a) LMS block.                             (b) Measured RMS error.\n\nFigure 5: LMS block at each synapse. (a) The difference between the densities of Pinc and Pdec\nis proportional to the product between the input and the error, and thus constitutes an LMS update\nrule. (b) RMS error for a single-synapse with a constant input and reference, including a calibrated\nmemory cell with symmetric updates, a simple synapse with asymmetric updates, and a simulated\nideal synapse.\n\n\nthus the injection strength of the pulse at Pinc. We control the current limit with the voltage\nat the new floating gate FGinc: we first remove electrons from the floating gate using the\nglobal line erase inc. Then we inject on transistor M4 by lowering the local line set inc to\nmatch the injection rates across all synapses. The entire process is controlled by a simple\nstate machine (also currently off-chip). Fig. 4(b) shows the tunneling and injection rates\nafter calibration as a function of the density of pulses Pinc and Pdec. Comparing the graph\nto Fig. 4(b), it is clear that the update rates are now symmetric and uniform across all\nsynapses (they match within 9-10 bits). Note that we could also choose to calibrate just for\nlearning rate symmetry and not uniformity across synapses, thus eliminating the floating\ngate FGinc and its associated circuitry. This optimization would result in approximately a\n25% reduction in memory cell area (6% reduction in total synapse area), but would also\ncause an increase of more than 200% in convergence time, as illustrated in Section 4.\n\n3.3    The LMS block\n\nFig. 5(a) shows a block diagram of the LMS-update circuit at each synapse. A pulse-\ndensity modulator [10] transforms the synaptic input into a pair of digital pulse-trains of\nfixed width (P+\n                  x , P-\n                      x ). The value of the input is represented as the difference between the\ndensity (frequency) of the pulse trains. We implement the memory updates of Eqn. 2 by\ndigitally combining the input and error pulses (P+\n                                                       e , P-\n                                                           e ) such that:\n\n                       Pinc    =    (P + AN D P +) OR (P - AN D P -)\n                                        x         e              x           e                  (5)\n                       Pdec    =    (P + AN D P -) OR (P - AN D P +)\n                                        x         e              x           e                  (6)\n\nThis technique was used previously in a synapse-transistor based circuit that learns corre-\nlations between signals [11], and to multiply and add signals [1]. If the pulse trains are\nasynchronous and sparse, then using Eqn. 5 and Eqn. 6 to increment and decrement the\nsynaptic weight implements the LMS learning rule of Eqn. 2.\n\nTo validate our design, we first trained a single synapse with a DC input to learn a constant\nreference. Because the input is constant, the linearity and offsets in the input signal do not\naffect the learning performance; therefore this experiment tests the resolution of the feed-\nback path (LMS circuit and memory cell) isolated from the analog multipliers. Fig. 5(b)\nshows the evolution of the RMS value of the error for a synapse using the original and\ncalibrated memory cells. The resolution of the pulse-density modulators is about 8 bits,\nwhich limits the resolution of the error signal. We also show the RMS error for a sim-\nulated (ideal) synapse learning from the same error. We plot the results in a logarithmic\nscale to highlight the differences between the three curves. The RMS error of the cali-\nbrated synapse converges to about 0.1nA. Computing the equivalent resolution in bits as\n\n\f\n             (a) Measured RMS error.                       (b) Measured weight evolution.\n\nFigure 6: Results for 64-synapse experiment. (a) Asymmetric learning rates and multiplier offsets\nlimit the output resolution to around 3 bits. Symmetric learning rates and a bias synapse brings\nthe resolution up to more 10 bits, and uniform updates reduce convergence time. (b) Synapse 4\nshows a larger mismatch than synapse 1 and therefore it deviates from its theoretical target value\nto compensate. The bias synapse in the VLSI perceptron converges to a value that compensates for\noffsets in the inputs xi to the multipliers.\n\n\nrb = -log2 0.5 RMS error           , we find that for a 2A output range, this error represents an\n                   output range\noutput resolution of about 13 bits. The difference with the simulated synapse is due to the\ndiscrete weight updates in the PDM memory cell. Without calibration, the RMS error con-\nverges to 0.4nA (or about 11 bits), due to the offset in the learned weights introduced by the\nasymmetry in the learning rate. As discussed in Section 4, the degradation of the learning\nperformance in a larger-scale system due to asymmetric learning rates is drastically larger.\n\n\n4    A 64-synapse perceptron\n\nTo test our techniques in a larger-scale system, we fabricated a 64-synapse linear percep-\ntron in a 0.35m CMOS process. The circuit uses 0.25mm2 of die area and dissipates\n200W. Fig. 6(a) shows the RMS error of the output in a logarithmic scale as we introduce\ndifferent compensation techniques. We used random zero-mean inputs selected from a uni-\nform distribution over the entire input range, and trained the network using the response\nfrom a simulated perceptron with ideal multipliers and fixed weights as a reference. In\nour first experiments, we trained the network without using any compensation. The error\nsettles to 10A RMS, which corresponds to an output resolution of about 3 bits for a full\nrange of 128A differential. Calibrating the synapses for symmetric learning rates only\nimproves the RMS error to 5A (4 bits), but the error introduced by the multiplier offsets\nstill dominates the residual error. Introducing the bias synapse and keeping the learning\nrates symmetric (but nonuniform across the perceptron) compensates for the offsets and\nbrings the error down to 60nA RMS, corresponding to an output resolution better than 10\nbits. Further calibrating the synapses to achieve uniform, symmetric learning rates main-\ntains the same learning performance, but reduces convergence time to less than one half, as\npredicted by the analysis in Section 3.2. A simulated software perceptron with ideal multi-\npliers and LMS updates that uses an error signal of the same resolution as our experiments\ngives an upper bound of just under 12 bits for the learning performance.\n\nFig. 6(b) depicts the evolution of selected weights in the silicon perceptron with on-chip\ncompensation and the software version. The graph shows that synapse 1 in our VLSI im-\nplementation suffers from little mismatch, and therefore its weight virtually converges to\nthe theoretical value given by the software implementation. Because the PDM updates are\ndiscrete, the weight shows a larger oscillation around its target value than the software ver-\nsion. Synapse 4 shows a larger mismatch; therefore it converges to a visibly different value\nfrom the theoretical in order to compensate for it. The bias weight in the software percep-\n\n\f\ntron converges to zero because the inputs have zero mean. In the VLSI perceptron, input\noffsets in the multipliers create nonzero-mean inputs; therefore the bias synapse converges\nto a value that compensates for the aggregated effect of the offsets. The normalized value\nof -1.2 reflects the gain boost given to this multiplier to increase its dynamic range.\n\n\n5    Conclusions\n\nDevice mismatch prevents analog VLSI neural networks from delivering good learning\nperformance for large-scale applications. We identified the key effects of mismatch and\npresented on-chip compensation techniques. Our techniques rely both on one-time (or\nperiodic) calibration, and on the adaptive operation of the system to achieve continuous\ncalibration. Combining these techniques with careful circuit design enables an improve-\nment of more than one order of magnitude in accuracy compared to traditional analog\ndesigns, at the cost of an off-line calibration phase and a modest increase in die area and\npower. We illustrated our techniques with a 64-synapse analog-VLSI linear perceptron that\nadapts using the LMS algorithm. Future work includes extending these techniques to un-\nsupervised learning algorithms such as adaptive orthogonalization, principal components\nanalysis (PCA) and independent components analysis (ICA).\n\nAcknowledgements\n\nThis work was financed in part by the Chilean government through FONDECYT grant\n#1040617. We fabricated our chips through MOSIS.\n\n\nReferences\n\n [1] Y. Hirai and K. Nishizawa, \"Hardware implementation of a PCA learning network by an asyn-\n     chronous PDM digital circuit,\" in IEEE-INNS-ENNS International Joint Conference on Neural\n     Networks (IJCNN), vol. 2, pp. 6570, 2000.\n\n [2] M. Figueroa, Adaptive Signal Processing and Correlational Learning in Mixed-Signal VLSI.\n     Ph.D. Thesis, University of Washington, 2005.\n\n [3] B. K. Dolenko and H. C. Card, \"Tolerance to analog hardware of on-chip learning in back-\n     propagation networks,\" IEEE Transactions on Neural Networks, vol. 6, no. 5, pp. 10451052,\n     1995.\n\n [4] F. Palmieri, J. Zhu, and C. Chang, \"Anti-Hebbian learning in topologically constrained linear\n     networks: A tutorial,\" IEEE Transactions on Neural Networks, vol. 4, no. 5, pp. 748761, 1993.\n\n [5] C. Diorio, P. Hasler, B. Minch, and C. Mead, \"A complementary pair of four-terminal silicon\n     synapses,\" Analog Integrated Circuits and Signal Processing, vol. 13, no. 1/2, pp. 153166,\n     1997.\n\n [6] C. Diorio, D. Hsu, and M. Figueroa, \"Adaptive CMOS: from biological inspiration to systems-\n     on-a-chip,\" Proceedings of the IEEE, vol. 90, no. 3, pp. 345357, 2002.\n\n [7] J. Dugger and P. Hasler, \"Improved correlation learning rule in continuously adapting floating-\n     gate arrays using logarithmic pre-distortion of input and learning signals,\" in IEEE Intl. Sympo-\n     sium on Circuits and Systems (ISCAS), vol. 2, pp. 536539, 2002.\n\n [8] C. Diorio, S. Mahajan, P. Hasler, B. A. Minch, and C. Mead, \"A high-resolution nonvolatile\n     analog memory cell,\" in IEEE Intl. Symp. on Circuits and Systems, vol. 3, pp. 22332236,\n     1995.\n\n [9] B. Widrow and E. Walach, Adaptive Inverse Control. Upper Saddle River, NJ: Prentice-Hall,\n     1996.\n\n[10] C. Mead, Analog VLSI and Neural Systems. Reading, MA: Addison-Wesley, 1989.\n\n[11] A. Shon, D. Hsu, and C. Diorio, \"Learning spike-based correlations and conditional probabili-\n     ties in silicon,\" in Neural Information Processing Systems (NIPS), (Vancouver, BC), 2001.\n\n\f\n", "award": [], "sourceid": 2690, "authors": [{"given_name": "Miguel", "family_name": "Figueroa", "institution": null}, {"given_name": "Seth", "family_name": "Bridges", "institution": null}, {"given_name": "Chris", "family_name": "Diorio", "institution": null}]}