{"title": "Adaptive Quantization and Density Estimation in Silicon", "book": "Advances in Neural Information Processing Systems", "page_first": 1107, "page_last": 1114, "abstract": "", "full_text": " \n\nAdaptive Quantization and Density \n\nEstimation in Silicon \n\n \n\nDavid Hsu Seth Bridges Miguel Figueroa Chris Diorio \n\n University of Washington \n 114 Sieg Hall, Box 352350 \n\n Department of Computer Science and Engineering \n \n \n \n {hsud, seth, miguel, diorio}@cs.washington.edu \n\n Seattle, WA 98195-2350 USA \n\n \n\n \n\nAbstract \n\nWe present the bump mixture model, a statistical model for analog \ndata where the probabilistic semantics, inference, and learning \nrules derive from low-level transistor behavior. The bump mixture \nmodel relies on translinear circuits to perform probabilistic infer-\nence, and floating-gate devices to perform adaptation. This system \nis low power, asynchronous, and fully parallel, and supports vari-\nous on-chip learning algorithms. In addition, the mixture model can \nperform several tasks such as probability estimation, vector quanti-\nzation, classification, and clustering. We tested a fabricated system \non clustering, quantization, and classification of handwritten digits \nand show performance comparable to the E-M algorithm on mix-\ntures of Gaussians. \n\n1 Introduction \n\nMany system-on-a-chip applications, such as data compression and signal process-\ning, use online adaptation to improve or tune performance. These applications can \nbenefit from the low-power compact design that analog VLSI learning systems can \noffer. Analog VLSI learning systems can benefit immensely from flexible learning \nalgorithms that take advantage of silicon device physics for compact layout, and that \nare capable of a variety of learning tasks. One learning paradigm that encompasses a \nwide variety of learning tasks is density estimation, learning the probability \ndistribution over the input data. A silicon density estimator can provide a basic \ntemplate for VLSI systems for feature extraction, classification, adaptive vector \nquantization, and more. \n\nIn this paper, we describe the bump mixture model, a statistical model that describes \nthe probability distribution function of analog variables using low-level transistor \nequations. We intend the bump mixture model to be the silicon version of mixture of \nGaussians [1], one of the most widely used statistical methods for modeling the \nprobability distribution of a collection of data. Mixtures of Gaussians appear in \nmany contexts from radial basis functions [1] to hidden Markov models [2]. In the \nbump mixture model, probability computations derive from translinear circuits [3] \nand learning derives from floating-gate device equations [4]. The bump mixture \n\n\f \n\nmodel can perform different functions such as quantization, probability estimation, \nand classification. In addition this VLSI mixture model can implement multiple \nlearning algorithms using different peripheral circuitry. Because the equations for \nsystem operation and learning derive from natural transistor behavior, we can build \nlarge bump mixture model with millions of parameters on a single chip. We have \nfabricated a bump mixture model, and tested it on clustering, classification, and vec-\ntor quantization of handwritten digits. The results show that the fabricated system \nperforms comparably to mixtures of Gaussians trained with the E-M algorithm [1]. \n\nOur work builds upon several trends of research in the VLSI community. The results \nin this paper are complement recent work on probability propagation in analog VLSI \n[5-7]. These previous systems, intended for decoding applications in communication \nsystems, model special forms of probability distributions over discrete variables, \nand do not incorporate learning. In contrast, the bump mixture model performs in-\nference and learning on probability distributions over continuous variables. The \nbump mixture model significantly extends previous results on floating-gate circuits \n[4]. Our system is a fully realized floating-gate learning algorithm that can be used \nfor vector quantization, probability estimation, clustering, and classification. Fi-\nnally, the mixture model\u2019s architecture is similar to many previous VLSI vector \nquantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer \nwith well-defined probabilistic semantics. Computations such as probability estima-\ntion and maximum-likelihood classification have a natural statistical interpretation \nunder the mixture model. In addition, because we rely on floating-gate devices, the \nmixture model does not require a refresh mechanism unlike previous learning VLSI \nquantizers. \n\n2 The adaptive bump circuit \n\nThe adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump \nmixture model. This circuit is slightly different from previous versions reported in \nthe literature. Nevertheless, the high level functionality remains the same; the adap-\ntive bump circuit computes the similarity between a stored variable and an input, \nand adapts to increase the similarity between the stored variable and input. \n\nFig.1(a) shows the computation portion of the circuit. The bump circuit takes as \ninput, a differential voltage signal (+Vin, - Vin) around a DC bias, and computes the \nsimilarity between Vin and a stored value, m . We represent the stored memory m as a \nvoltage: \n\n \n\nm\n\n=\n\nV\nw\n\n-\n\nV\nw\n\n+\n\n \n\n2\n\n(1) \n\nwhere Vw+ and Vw- are the gate-offset voltages stored on capacitors C1 and C2. Be-\ncause C1 and C2 isolate the gates of transistors M1 and M2 respectively, these tran-\nsistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw- are \nnonvolatile. \nthe floating-gate voltages Vfg1 and Vfg2 as \nVfg1=Vin+Vw+ and Vfg2=Vw-\n\n We can express \n\n- Vin, and the output of the bump circuit as [10]: \nI\n\n \n\n=\n\nI\n\nout\n\n(\n\n(\n\nk\n4 /\n\ncosh\n\n2\n\nb\n\n)(\nSU V\n\nt\n\nV\n\nfg\n\n2\n\nfg\n\n1\n\n=\n\n)\n\n)\n\ncosh\n\n2\n\n(\n\n(\n\nk\n8 /\n\nI\nb\n)(\nSU V\nin\n\nt\n\n \n\n)\n\n)\n\nm\n\n(2) \n\nwhere Ib is the bias current, k is the gate-coupling coefficient, Ut is the thermal volt-\nage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different \nstored values of m . As the data show, different m \u2019s shift the location of the peak re-\nsponse of the circuit. \n\n-\n-\n-\n\fm\n\n1\n\nm 2\n\nm 3\n\n \n\n)\n\nA\nn\n(\n \n\nt\n\nu\no\n\nI\n\n8\n\n6\n\n4\n\n2\n\n0\n\nFigure 1. (a-b) The adaptive bump \ncircuit. (a) The original bump cir-\ncuit augmented by capacitors C1 \nand C2, and cascode transistors \n(driven by Vcasc). (b) The adapta-\ntion subcircuit. M3 and M4 control \ninjection on the floating-gates and \nM5 and M6 control tunneling. (b) \nMeasured output current of a bump \ncircuit \nprogrammed \nmemories. \n\nthree \n\nfor \n\n \n\nV w+\n\nV fg1\n\nV b\n\nV fg2\n\nV in\n\nC 1\n\nV casc\n\nV 1\n\nM 1\n\nIout\n\nM 2\n\nC 2\n\nV 2\n\nV inj\n(a)\n\nV w-\n- V in\n\nV tun\n\nM 5\nV fg1\n\nV b\n\nV tun\n\nM 6\n\nV fg2\n\nM 3 M 4\n\nV 2\n\nV 1\n\nV inj\n(b)\n\nbump circuit's transfer function for three m\n10\n\n's\n\n-0.4\n\n-0.2\n\n0.2\n\n0.4\n\n0\nVin\n(c)\n\n \n\nFig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We \nimplement learning through Fowler-Nordheim tunneling [11] on tunneling junctions \nM5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Tran-\nsistor M3 and M5 control injection and tunneling on M1\u2019s floating-gate. Transistors \nM4 and M6 control injection and tunneling on M2\u2019s floating-gate. We activate tun-\nneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump \ncircuit, both processes increase the similarity between Vin and m . In addition, the \nmagnitude of the update does not depend on the sign of (Vin - m ) because the differ-\nential input provides common-mode rejection to the input differential pair. \nThe similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Conse-\nquently, we can equate the output current of the bump circuit with the probability of \nthe input under a distribution parameterized by mean m : \n\n(\nP V\nin\n\n)\nm =\n\n|\n\n(3) \n \nIn addition, increasing the similarity between Vin and m is equivalent to increasing \nP(Vin |m ). Consequently, the adaptive bump circuit adapts to maximize the likelihood \nof the present input under the circuit\u2019s probability distribution. \n\nout\n\nI\n\n \n\n3 The bump mixture model \n\nWe now describe the computations and learning rule implemented by the bump mix-\nture model. A mixture model is a general class of statistical models that approxi-\nmates the probability of an analog input as the weighted sum of probability of the \ninput under several simple distributions. The bump mixture model comprises a set \nof Gaussian-like probability density functions, each parameterized by a mean vec-\ntor, mmmm i. Denoting the jth dimension of the mean of the ith density as m ij, we express \nthe probability of an input vector x as: \n\n\f \n\n(\n\n)\n\nx\n\nP\n\n=\n\n(\n1/\n\n)\n\nN\n\n(\n\n)\n\n=\n\n(\n1/\n\nN\n\nx\n\n|\n\ni\n\n(cid:1)\n\ni\n\nP\n\n(\n(cid:1) (cid:213)\n\ni\n\n)\n\n(\nP x\n\nj\n\nm\n\nij\n\n|j\n\n)\n\n)\n\n \n\n \n\n(4) \n\nwhere N is the number of densities in the model and i denotes the ith density. P(x|i) \nis the product of one-dimensional densities P(xj|m ij) that depend on the jth dimension \nof the ith mean, m ij. We derive each one-dimensional probability distribution from \nthe output current of a single bump circuit. The bump mixture model makes two \nassumptions: (1) the component densities are equally likely, and (2) within each \ncomponent density, the input dimensions are independent and have equal variance. \nDespite these restrictions, this mixture model can, in principle, approximate any \nprobability density function [1]. \nThe bump mixture model adapts all mmmm i to maximize the likelihood of the training \ndata. Learning in the bump mixture model is based on the E-M algorithm, the stan-\ndard algorithm for training Gaussian mixture models. The E-M algorithm comprises \ntwo steps. The E-step computes the conditional probability of each density given the \ninput, P(i|x). The M-step updates the parameters of each distribution to increase the \nlikelihood of the data, using P(i|x) to scale the magnitude of each parameter update. \nIn the online setting, the learning rule is: \n\n \n\nm\n\n=\n\nij\n\nh\n\n( |\nP i\n\nx\n\n)\n\nlog\n\n)\n\nm\n\n|\n\nij\n\nj\n\n(\nP x\nm\n\nij\n\n=\n\nh\n\n(\nx\n(\nP\n\n)\n|\n\n|\ni\nx\n\n)\n\nk\n\nP\n(cid:1)\n\nk\n\nlog\n\n)\n\nm\n\n|\n\nij\n\nj\n\n(\nP x\nm\n\nij\n\n \n\n(5) \n\nwhere h is a learning rate and k denotes component densities. Because the adaptive \nbump circuit already adapts to increase the likelihood of the present input, we ap-\nproximate E-M by modulating injection and tunneling in the adaptive bump circuit \nby the conditional probability: \n=\n\nh\n\nm\n\nm\n\n(\nP i\n\n)\n\n|\n\nx\n\n(\nf x\n\nj\n\n)\n\n \n\nij\n\n(6) \n\n \n\nij\n\nwhere f() is the parameter update implemented by the bump circuit. We can modu-\nlate the learning update in (6) with other competitive factors instead of the condi-\ntional probability to implement a variety of learning rules such as online K-means. \n\n4 Silicon implementation \n\nWe now describe a VLSI system that implements the silicon mixture model. The \nhigh level organization of the system detailed in Fig.2, is similar to VLSI vector \nquantization systems. The heart of the mixture model is a matrix of adaptive bump \ncircuits where the ith row of bump circuits corresponds to the ith component density. \nIn addition, the periphery of the matrix comprises a set of inhibitory circuits for per-\nforming probability estimation, inference, quantization, and generating feedback for \nlearning. \nWe send each dimension of an input x down a single column. Unity-gain inverting \namplifiers (not pictured) at the boundary of the matrix convert each single ended \nvoltage input into a differential signal. Each bump circuit computes a current that \nrepresents (P(xj|m ij))s, where s is the common variance of the one-dimensional den-\nsities. The mixture model computes P(x|i) along the ith row and inhibitory circuits \nperform inference, estimation, or quantization. We utilize translinear devices [3] to \nperform all of these computations. Translinear devices, such as the subthreshold \nMOSFET and bipolar transistor, exhibit an exponential relationship between the \ngate-voltage and source current. This property allows us to establish a power-law \nrelationship between currents and probabilities (i.e. a linear relationship between \ngate voltages and log-probabilities). \n\n\u00b6\n\u00b6\nD\n\u00b6\n\u00b6\nD\n-\n\fx1\n\nx2\n\nxn\n\nP(x|m\n\n11)\n\nP(x|m\n\n12)\n\nP(x|m\n\n1n)\n\nP(x|m\n\n1)\n\nVtun,Vinj\n\nInh()\n\nP(x|m\n\n21)\n\nP(x|m\n\n22)\n\nP(x|m\n\n2n)\n\nInh()\n\nP(x|m\n\n2)\n\n \n\nO\nu\n\nt\n\np\nu\n\nt\n\nFigure 2. Bump mixture \nmodel architecture. The \nsystem comprises a ma-\ntrix of adaptive bump \ncircuits where each row\ncomputes the probability \nP(x|mmmm i). Inhibitory cir-\ncuits transform the out-\nput of each row into \nsystem outputs. Spike \ngenerators also \ntrans-\nform \ninhibitory circuit \noutputs into rate-coded \nfeedback for learning. \n\n \n\nWe compute the multiplication of the probabilities in each row of Fig.2 as addition \nin the log domain using the circuit in Fig.3(a). This circuit first converts each bump \ncircuit\u2019s current into a voltage using a diode (e.g. M1). M2\u2019s capacitive divider com-\nputes Vavg as the average of the scalar log probabilities, logP(xj|m ij): \n\n=\n\n(\ns\n\n)\n\n(\nP x\n\nm\n\n|\n\n)\n\nV\n\navg\n\n/\n\nN\n\n(7) \n \nwhere s is the variance, N is the number of input dimensions, and voltages are in \nunits of k/Ut (Ut is the thermal voltage and k is the transistor-gate coupling coeffi-\ncient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage \nof M5 as log P(x|i) (up to an additive constant) and compute: \n\nlog\n\n(cid:1)\n\n \n\nij\n\nj\n\nj\n\n \n\n(\n\n(\n\nP\n\nlog\n\n)\n\n)\n\n=\n\nx\n\n|\n\ni\n\n)\n\n(\n\nC C\n2\n\n+\n1\nC\n1\n\n=\n\nV\n\navg\n\n)\ns\n\n(\n\n+\nC C\n1\n2\nC N\n\n1\n\n(\n\n(\nP x\n\nm\n\n|\n\nij\n\nj\n\n)\n\n)\n\n+\n\nk\n\n \n\n(cid:1)\n\nj\n\nlog\n\n(8) \n\nwhere k is a constant dependent on Vg (the control gate voltage on M5), and C1 and \nC2 are capacitances. From eq.8 we can derive the variance as: \n\n \n\ns =\n\n/NC C C\n\n1\n\n1\n\n2\n\n+\n\n(\n\n)\n\n \n\n(9) \n\nThe system computes different output functions and feedback signals for learning by \noperating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that com-\nputes P(i|x) for each distribution. The circuit is a k-input differential pair where the \nbias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith \nleg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponenti-\nates logP(x|i), and a single wire sums the currents. We can also apply other inhibi-\ntory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and \nresistive networks [14]. In our fabricated chip, we implemented probability estima-\ntion,conditional probability computation, and WTA. The WTA outputs the index of \nthe most likely component distribution for the present input, and can be used to im-\nplement vector quantization and to produce feedback for an online K-means learn-\ning rule. \n\nAt each synapse, the system combines a feedback signal, such as the conditional \nprobability P(i|x), computed at the matrix periphery, with the adaptive bump circuit \nto implement learning. We trigger adaptation at each bump circuit by a rate-coded \nspike signal generated from the inhibitory circuit\u2019s current outputs. We generate this \nspike train with a current-to-spike converter based on Lazzaro\u2019s low-powered spik-\ning neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. \nConsequently, adaptation is proportional to the frequency of the spike train, which \nis in turn a linear function of the inhibitory feedback signal. The alternative to the \nrate code would be to transform the inhibitory circuit\u2019s output directly into analog \n\nm\nm\nm\nm\nm\nm\n\fM1\n\nP(xn|m\n\nin)s\n\nP(x1|m\n\ni1)s\n\nVavg\n\n.\n.\n.\n\nM2\n\nM3\n\nVs\n\nM5\n\nVg\n\nC2\n\nVavg\n\nC1\n\nM4\n\nlogP(x|i)\n\nP(i|x)\n\n(a)\n\n(b)\n\n \n\nVs\n\nVb\n\n...\n\nM0\n\n...\n\nlogP(x|i)\n\n...\n\n...\n\nP(x)\n\n(c)\n\nFigure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The \ncurrent through the ith leg represents P(i|x). (c) Circuit for computing P(x). \n\nVtun and Vinj signals. Because injection and tunneling are highly nonlinear functions \nof Vinj and Vtun respectively, implementing updates that are linear in the inhibitory \nfeedback signal is quite difficult using this approach. \n\n5 Experimental Results and Conclusions \n\nWe fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 \ndimensions each) in a TSMC 0.35m m CMOS process available through MOSIS, and \ntested the chip on synthetic data and a handwritten digits dataset. In our tests, we \nfound that due to a design error, one of the input dimensions coupled to the other \ninputs. Consequently, we held that input fixed throughout the tests, effectively re-\nducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 \nproduced poor performance because the variance of the bump distributions was too \nlarge. Consequently, in our learning experiments, we used the hard winner-take-all \ncircuit to control adaptation, resulting in a K-means learning rule. We trained the \nchip to perform different tasks on handwritten digits from the MNIST dataset [16]. \nTo prepare the data, we first perform PCA to reduce the 784-pixel images to seven-\ndimensional vectors, and then sent the data on-chip. \n\nWe first tested the circuit on clustering handwritten digits. We trained the chip on \n1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight \nmeans before and after training. We compute each reconstruction by multiplying the \nmeans by the seven principal eigenvectors of the dataset. The data shows that the \nmeans diverge to associate with different digits. The chip learns to associate most \ndigits with a single probability distribution. The lone exception is digit 5 which \ndoesn\u2019t clearly associate with one distribution. We speculate that the reason is that \n3\u2019s, 5\u2019s, and 8\u2019s are very similar in our training data\u2019s seven-dimensional represen-\ntation. Gaussian mixture models trained with the E-M algorithm also demonstrate \nsimilar results, recovering only seven out of the eight digits. \n\nWe next evaluated the same learned means on vector quantization of a set of test \ndigits (4400 examples of each digit). We compare the chip\u2019s learned means with \nmeans learned by the batch E-M algorithm on mixtures of Gaussians (with s =0.01), \na mismatch E-M algorithm that models chip nonidealities, and a non-adaptive base-\nline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect \nof nonuniform injection and tunneling strengths in floating-gate transistors. Because \ntunneling and injection magnitudes can vary by a large amount on different floating-\ngate transistors, the adaptive bump circuits can learn a mean that is somewhat off-\ncenter. We measured the offset of each bump circuit when adapting to a constant \ninput and constructed the mismatch E-M algorithm by altering the learned means \nduring the M-step by the measured offset. We constructed the baseline quantizer by \nselecting, at random, an example of each digit for the quantizer codebook. For each \nquantizer, we computed the reconstruction error on the digit\u2019s seven-dimensional \n\n\f \n\n20\n\n10\n\nd\ne\nr\na\nu\nq\ns\n \n\ne\ng\na\nr\ne\nv\na\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nz\ni\nt\n\nn\na\nu\nq\n\n0\nbaseline\nchip\n\nE-M/mismatch\n\n5\n\n4\n\n8\n\n7\n\n6\n\ndigit\n\nE-M\n\n1\n\n3\n\n2\n\n(b)\n\nFigure 4. (a) Reconstruction of chip \nmeans before and after training with \nhandwritten digits. (b) Comparison of \naverage quantization error on unseen \nhandwritten digits, \nchip\u2019s \nlearned means and mixture models \ntrained by standard algorithms. (c) Plot \nof probability of unseen examples of 7\u2019s \nand 9\u2019s under two bump mixture models \ntrained solely on each digit. \n\nthe \n\nfor \n\n2.5\n\nbefore\n\nafter\n\n(a)\n\n7 +\n9 o\n\n1\n\n1.5\n\n2\n\nProbability under 9's model (m A)\n\n(c)\n\n \n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0.5\n\n)\n\nA\n\n(\n \nl\n\ne\nd\no\nm\n \ns\n7\n\n'\n\n \nr\ne\nd\nn\nu\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\n\nP\n\nrepresentation when we represent each test digit by the closest mean. The results in \nFig.4(b) show that for most of the digits the chip\u2019s learned means perform as well as \nthe E-M algorithm, and better than the baseline quantizer in all cases. The one digit \nwhere the chip\u2019s performance is far from the E-M algorithm is the digit \u201c1\u201d. Upon \nexamination of the E-M algorithm\u2019s results, we found that it associated two means \nwith the digit \u201c1\u201d, where the chip allocated two means for the digit \u201c3\u201d. Over all the \ndigits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M \ngives a quantization error of 10.9, the chip\u2019s error was 11.6, and the baseline quan-\ntizer\u2019s error was 15.97. The data show that mismatch is a significant factor in the \ndifference between the bump mixture model\u2019s performance and the E-M algorithm\u2019s \nperformance in quantization tasks. \n\nFinally, we use the mixture model to classify handwritten digits. If we train a sepa-\nrate mixture model for each class of data, we can classify an input by comparing the \nprobabilities of the input under each model. In our experiment, we train two sepa-\nrate mixture models: one on examples of the digit 7, and the other on examples of \nthe digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and \n9, and record the probability score of each unseen example under each mixture \nmodel. We plot the resulting data in Fig.4(c). Each axis represents the probability \nunder a different class. The data show that the model probabilities provide a good \nmetric for classification. Assigning each test example to the class model that outputs \nthe highest probability results in an accuracy of 87% on 2000 unseen digits. Addi-\ntional software experiments show that mixtures of Gaussians (s =0.01) trained by \nthe batch E-M algorithm provide an accuracy of 92.39% on this task. \n\nOur test results show that the bump mixture model\u2019s performance on several learn-\ning tasks is comparable to standard mixtures of Gaussians trained by E-M. These \nexperiments give further evidence that floating-gate circuits can be used to build \neffective learning systems even though their learning rules derive from silicon phys-\nics instead of statistical methods. The bump mixture model also represents a basic \nbuilding block that we can use to build more complex silicon probability models \n\nm\n\f \n\nover analog variables. This work can be extended in several ways. We can build \ndistributions that have parameterized covariances in addition to means. In addition, \nwe can build more complex, adaptive probability distributions in silicon by combin-\ning the bump mixture model with silicon probability models over discrete variables \n[5-7] and spike-based floating-gate learning circuits [4]. \n\nA c kn o w l e d g me n t s \n\nThis work was supported by NSF under grants BES 9720353 and ECS 9733425, and \nPackard Foundation and Sloan Fellowships. \nReferences \n[1] \n\nC. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon \nPress, 1995. \n\n[2] \n\n[3] \n\n[4] \n\n[5] \n\n[6] \n\nL. R. Rabiner, \"A tutorial on hidden Markov models and selected applications in \nspeech recognition,\" Proceedings of the IEEE, vol. 77, pp. 257-286, 1989. \n\nB. A. Minch, \"Analysis, Synthesis, and Implementation of Networks of Multiple-\nInput Translinear Elements,\" California Institute of Technology, 1997. \n\nC.Diorio, D.Hsu, and M.Figueroa, \"Adaptive CMOS: from biological inspiration to \nsystems-on-a-chip,\" Proceedings of the IEEE, vol. 90, pp. 345-357, 2002. \nT. Gabara, J. Hagenauer, M. Moerz, and R. Yan, \"An analog 0.25 m m BiCMOS tail-\nbiting MAP decoder,\" IEEE International Solid State Circuits Conference (ISSCC), \n2000. \n\nJ. Dai, S. Little, C. Winstead, and J. K. Woo, \"Analog MAP decoder for (8,4) Ham-\nming code in subthreshold CMOS,\" Advanced Research in VLSI (ARVLSI), 2001. \n\n[7] M. Helfenstein, H.-A. Loeliger, F. Lustenberger, and F. Tarkoy, \"Probability propaga-\ntion and decoding in analog VLSI,\" IEEE Transactions on Information Theory, vol. \n47, pp. 837-843, 2001. \n\n[8] W. C. Fang, B. J. Sheu, O. Chen, and J. Choi, \"A VLSI neural processor for image \ndata compression using self-organization neural networks,\" IEEE Transactions on \nNeural Networks, vol. 3, pp. 506-518, 1992. \n\n[9] \n\nJ. Lubkin and G. Cauwenberghs, \"A learning parallel analog-to-digital vector quan-\ntizer,\" Journal of Circuits, Systems, and Computers, vol. 8, pp. 604-614, 1998. \n\n[10] T. Delbruck, \"Bump circuits for computing similarity and dissimilarity of analog volt-\n\nages,\" California Institute of Technology, CNS Memo 26, 1993. \n\n[11] M. Lenzlinger, and E. H. Snow, \"Fowler-Nordheim tunneling into thermally grown \n\nSiO2,\" Journal of Applied Physics, vol. 40, pp. 278-283, 1969. \n\n[12] E. Takeda, C. Yang, and A. Miura-Hamada, Hot Carrier Effects in MOS Devices. San \n\nDiego, CA: Academic Press, 1995. \n\n[13] \n\nJ. Lazzaro, S. Ryckebusch, M. Mahowald, and C. A. Mead, \"Winner-take-all networks \nof O(n) complexity,\" in Advances in Neural Information Processing, vol. 1, D. Tour-\nestzky, Ed.: MIT Press, 1989, pp. 703-711. \n\n[14] K. Boahen and A. Andreou, \"A contrast sensitive silicon retina with reciprocal syn-\napses,\" in Advances in Neural Information Processing Systems 4, S. H. J. Moody, and \nR. Lippmann, Ed.: MIT Press, 1992, pp. 764-772. \n\n[15] \n\nJ. Lazzaro, \"Low-power silicon spiking neurons and axons,\" IEEE International Sym-\nposium on Circuits and Systems, 1992. \n\n[16] Y. Lecun, \"The MNIST database of handwritten digits, \n\nhttp://yann_lecun.com/exdb/mnist.\" \n\n\f", "award": [], "sourceid": 2344, "authors": [{"given_name": "David", "family_name": "Hsu", "institution": ""}, {"given_name": "Seth", "family_name": "Bridges", "institution": ""}, {"given_name": "Miguel", "family_name": "Figueroa", "institution": ""}, {"given_name": "Chris", "family_name": "Diorio", "institution": ""}]}