{"title": "An Analog VLSI Splining Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1008, "page_last": 1014, "abstract": null, "full_text": "An Analog VLSI Splining Network \n\nDaniel B. Schwartz and Vijay K. Samalam \n\nGTE Laboratories, Inc. \n\n40 Sylvan Rd. \n\nWaltham, MA 02254 \n\nAbstract \n\nWe have produced a VLSI circuit capable of learning to approximate ar(cid:173)\nbitrary smooth of a single variable using a technique closely related to \nsplines. The circuit effectively has 512 knots space on a uniform grid and \nhas full support for learning. The circuit also can be used to approximate \nmulti-variable functions as sum of splines. \n\nAn interesting, and as of yet, nearly untapped set of applications for VLSI imple(cid:173)\nmentation of neural network learning systems can be found in adaptive control and \nnon-linear signal processing. In most such applications, the learning task consists \nof approximating a real function of a small number of continuous variables from \ndiscrete data points. Special purpose hardware is especially interesting for applica(cid:173)\ntions of this type since they generally require real time on-line learning and there \ncan be stiff constraints on the power budget and size of the hardware. Frequently, \nthe already difficult learning problem is made more complex by the non-stationary \nnature of the underlying process. \nConventional feed-forward networks with sigmoidal units are clearly inappropriate \nfor applications of this type. Although they have exhibited remarkable performance \nin some types of time series prediction problems (for example, Wiegend, 1990 and \nAtlas, 1990), their learning rates in general are too slow for on-line learning. On-line \nperformance can be improved most easily by using networks with more constrained \narchitecture, effectively making the learning problem easier by giving the network a \nhint about the learning task. Networks that build local representations of the data, \nsuch as radial basis functions, are excellent candidates for these type of problems. \nOne great advantage of such networks is that they require only a single layer of \nunits. If the position and width of the units are fixed, the learning problem is linear \n\n1008 \n\n\fAn Analog VLSI Splining Network \n\n1009 \n\nin the coefficients and local. By local we mean the computation of a weight change \nrequires only information that is locally available to each weight, a highly desirable \nproperty for VLSI implementation. If the learning algorithm is allowed to adjust \nboth the position and width of the units then many of the advantages of locally \ntuned units are lost. \nA number of techniques have been proposed for the determination of the width \nand placement of the units. One of the most direct is to center a unit at every \ndata point and to adjust the widths of the units so the receptive fields overlap \nwith those of neighboring data points ( B room head , 1989). The proliferation of \nunits can be limited by using unsupervised clustering techniques to clump the data \nfollowed by the allocation of units to fit the clumps (Moody, 1989). Others have \nadvocated assigning new units only when the error on a new data point is larger than \na threshold and otherwise making small adjustments in the weights and parameters \nof the existing units (Platt, 1990). All of these methods suffer from the common \nproblem of requiring an indeterminate quantity of resources in contrast with the \nfixed resources available from most VLSI circuits. Even worse, when used with \nnon-stationary processes a mechanism is needed to deallocate units as well as to \nallocate them. The resource allocation/deallocation problem is a serious barrier to \nimplementing these algorithms as autonomous VLSI microsystems. \n\nA Splining Network \n\nTo avoid the resource allocation problem we propose a network that uses all of \nits weights and units regardless of the problem. We avoid over parameterization \nof the training data by building constraints on smoothness into the network, thus \nreducing the number of degrees of freedom available to the training process. In \nits simplest guise, the network approximates arbitrary I-d smooth functions with a \nlinear superposition of locally tuned units spaced on a uniform grid, \n\ng(z) = LWifC7(z - i~z) \n\ni \n\n(1) \n\nwhere u is the radius of the unit's receptive field and the Wi are the weights. fC7 is a \nbump of width u such as a gaussian or a cubic spline basis function. Mathematically \nthe network is closely related to function approximation using B-splines (Lancaster, \n1986) with uniformly spaced knots. However, in B-spline interpolation the overlap \nof the basis functions is normally determined by the degree of the spline whereas \nwe use the degree of overlap as a free parameter to constrain the smoothness of \nthe network's output. As mentioned earlier, the network is linear in its weights \nso gradient descent with a quadratic cost function (LMS) is an effective training \nprocedure. \nThe weights needed for this network can easily be implemented in CMOS with an \narray of transconductance amplifiers. The amplifiers are wired as voltage followers \nwith their outputs tied together and the weights are represented by voltages lti \nat the non-inverting inputs of the amplifiers. If the outputs of the locally tuned \nunits are represented by unipolar currents Ii these currents can be used to bias the \n\n\f1010 \n\nSchwartz and Samalam \n\ntransconductance amplifiers and the result is (Mead,1989) \n\n_ Ei IiYi \nt7 \nYou' - ~ \nL\"i Ii \n\nprovided that care is taken to control the non-linearities of the amplifiers. However, \nwhile the weights have a simple implementation in analog VLSI circuitry, the input \nunits du not. A number of circuits exist whose transfer characteristics can be shaped \nto be a suitable bump but none of those known to the authors allow the width of \nthe bump to be adjusted over a wide range without the use of resistors. \n\nGenerating the Receptive Fields \n\nInput units with tunable receptive fields can be generated quite efficiently by break(cid:173)\ning them up into two layers of circuitry as shown in figure 1. The input layer place \nencodes the input signal - i.e. only one or perhaps a small cluster of units is active \nat a time. The output of the place encoding units either injects or controls the \n\noutput \n\nInput \n\nweight \n\nspreading \n\nlayer \n\nplace \n\nencoding \n\nFigure 1: An architecture that allows the width and shape of the receptive fields to \nbe varied over a wide range. The elements of the 'spreading layer' are passive and \ncan sink current to ground. \n\ninjection of current into the laterally connected spreading layer. The elements in \nthe spreading layer all contain ground terminals and the current sunk by each one \ndetermines the bias current applied to the associated weight. Clearly, the distribu(cid:173)\ntion of currents flowing to ground through the spreading layer form a smooth bump \nsuch that when excitation is applied to tap j of the spreading layer, \n\nIi = 10 1(1(; - j} \n\nwhere I(I(;} is the bump called for by equation 1. In our earliest realizations of \nthis network the input layer was a crude flash A-to-D converter and the input \nto the circuit was analog. In the current generation the input is digital with the \nplace encoding performed by a conventional address decoder. If desired, input \nquantization can be avoided by using a layer of amplifiers that generate smooth \nbumps of fixed width to generate the input place encoding. \n\n\fAn Analog VLSI Splining Network \n\n1011 \n\nThe simplest candidate to implement the spreading layer in conventional CMOS \nis a set of diode connected n-channel transistors laterally connected by n-channel \npass transistors. The gate voltages of the diode connected transistors determine \nthe bias currents Ii of the weights. Ignoring the body effect and assuming weak \ninversion in the current sink, this type of networks tends to gives bumps with rather \nsharp peaks, Ii ~ E j Ioe-aul , where Iii is the distance from the point where the \nexcitation is applied. Figure 2 shows a more sophisticated version of this circuit \nin which the output of the place encoding units applies excitation to the spreading \nnetwork through a p-channel transistor. The shape of the bumps can be softened by \n\nto weights \n\nbias \n\nvoltages \n\nfrom place encoder \n\nFigure 2: A schematic of a section of the spreading layer. Roughly speaking, the \nn-channel pass transistor controls the extent of the tails of the bumps and the \np-channel pass transistor and the cascode transistor control its width. \n\nlimiting the amount of current drawn by the current sinks with an n-channel cascode \ntransistor in series with the current sink. Some experimental results for this type of \ncircuit are shown in figure 3a. More control can be obtained by using complementary \npass transistors. The use of p-channel pass transistors alone unexpectedly results \nin bumps that are nearly square (figure 3b). These can be smoothed by using a \nusing both flavors of pass transistor simultaneously (figure 3c). \n\nThe Weights \n\nAs described earlier, the implementation of the output weights is based on the \ncomputation of means by the well known follower-aggregation circuit. With typical \ntransconductance amplifiers, this averaging is linear only when the voltages being \naveraged are distributed over a voltage range of no more than a few time UQ = kT/e \nin weak inversion. In the circuits described here the linear range has been widened \nto nearly a volt by reducing the transconductance of the readout amplifiers through \nthe combination of low width to length ratio input transistors and relatively large \ntail currents. \nThe weights Vi are stored on MOS capacitors and are programmed by the gated \ntransconductance amplifier shown in figure 4. Since this amplifier computes the \n\n\fc CD \nt: :s o \n\nI \n\nI , , , , , , ., \n\n,: I \n,: :, \nd.1 \n.i 1, \n,! : I \n,: :., \nI! \\. \n,: \n\\ \\ \n,: : \\ \n,,\" ...... ... \n' ........ ::-.::..--\n\n\" \u2022... l \n\n.. -.~ ... ~ \no 50 100 150 200 250 \n\n, II \nII \nI: \nI, \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \n'-\n\n! \n\n1012 \n\nSchwartz and Samalam \n\nb \n\no 10 20 30 40 50 \n\no 10 20 30 40 50 \n\nTap Number \n\nFigure 3: Experimental measurements of the receptive field shapes obtained from \ndifferent types of networks. (a) n-channel transistors for several gate voltages. (b) \np-channel transistors for several gate voltages. ( c) Both n-channel and p-channel \npass transistors. \n\nexdmkm >-~+----------+--------~ \n\nFigure 4: Schematic of an output weight including the circuitry to generate weight \nupdates. To minimize leakage and charge injection simultaneously, the pass tran(cid:173)\nsistors used to gate the weight change amplifier are of minimum size and a separate \ntransistor turns off the output transistors of the amplifier. \n\ndifference between the target voltage and the actual output of the network, the \nlearning rule is just LMS, \n\nwhere C is the capacitance of the storage capacitor and T is the duration of weight \nchanges. The transconductance gi of the weight change amplifier is determined by \nthe strength of excitation current from the spreading layer, gi oc Ii in weak inversion. \nSince the weight changes are governed by strengths of the excitation currents from \nthe spreading layer, clusters of weights are changed at a time. This enhances the \nfault tolerance of the circuit since the group of weights surrounding a bad one can \ncompensate for it. \n\n\fAn Analog VLSI Splining Network \n\n1013 \n\nExperimental Evaluation \n\nSeveral different chips have been fabricated in 21' p-well CMOS and tested to evalu(cid:173)\nate the principles described here. The most recent of these has 512 weights arranged \nin a 64 x 8 matrix connected to form a one dimensional array. The active area of \nthis chip is 4.1mm x 3.7mm. The input signal is digital with the place encoding \nperformed by a conventional address decoder. To maximize the flexibility of the \nchip, the excitation is applied to the spreading layer by a register located in each \ncell. By writing to multiple registers between resets, the spreading layer can be ex(cid:173)\ncited at multiple points simultaneously. This feature allows the chip to be treated \nas a single I-dimensional spline with 512 weights or, for example, as the sum of \nfour distinct I-dimensional splines each made up of 128 weights. One of the most \nnoticeable virtues of this design is the simplicity of the layout due to the absence \nof any dear distinction between 'weights' and 'units'. The primitive cell consists of \na register, a piece of the spreading network, a weight change amplifier, a storage \ncapacitor and output amplifier. All but a tiny fraction of the chip is a tiling of \nthis primitive cell. The excess circuitry consists of the address decoders, a timing \ncircuit to control the duration of weight changes and some biasing circuitry for the \nspreading layer. \nTo execute LMS learning, the user need only provide a sequence of target voltages \nand a current proportional to the duration of weight changes. Under reasonable \noperating conditions a weight updates cycle takes less than 11'8 implying a weight \nchange rate of 5 x 108 connections/second. The response of the chip to a single \nweight change after initialization is shown in in figure 5a. One feature of this plot \nis striking - even though the distribution of offsets in the individual amplifiers has \na variance of 13mV, the ripple in the output of the chip is about a ImV. For some \ncomputations, it appears the limiting factor on the accuracy of the chip is the rate \nof weight decay, about IOmV/s. \nAs a more strenuous test of the functionality of the chip we trained it to predict \nchaotic time series generated by the well know logistic equation, \n\nXt+l = 4axt{1 - x,), a < 1. \n\nSome experimental results for the mean prediction error are shown in figure 5b. \nIn these experiments, a mean prediction error of 3% is achieved, which is well \nabove the intrinsic accuracy of the circuit. A detailed examination of the error \nrate as a function of the size and shape of the bumps indicates that the problem \nlies in the long tails exhibited by the spreading layer when the n-channel pass \ntransistors are turned on. This tail falls off very slowly due to the body effect. \nOne remedy to this problem is to actively bias the gates of the n-channel pass \ntransistors to be a programmed offset above their source voltages (Mead, 1989). A \nsimpler solution is to subtract a fixed current from each of the bias current defined \nby the spreading layer. This solution costs a mere 4 transistors and has the added \nbenefit of guaranteeing that the bumps will always have a finite support. \n\nConclusion \n\nWe have demonstrated that neural network learning can be efficiently mapped onto \nanalog VLSI provided that the network architecture and training procedure are \n\n\f1014 \n\nSchwartz and Samalam \n\nco \nci \n\nb \n\nII) \nN \n\n(') \nN \n\n..... \nN \n\n~ ..... \n\n... \n\nII) \nci \n\n0 \nt:: \nQ) ~ \nci \nc:: \n~ (') \n'2 ci \na. \"! \nc:: as \n0 \nQ) \n..... \nE \nci \n\n0 \n\n1 0 20 \n\n30 40 \n\n50 \n\n60 \n\n70 \n\n0 \n\n200 \n\n400 \n\n600 \n\n800 \n\n1000 \n\ninput value \n\ntime \n\n0 \nci \n\nFigure 5: Some experimental results from a splining circuit. (a) The response of \nthe circuit to learning one data point after initialization of the weights to a constant \nvalue. (b) Experimental mean prediction while learning a chaotic time series. \n\ntailored to match the constraints imposed by VLSI. Besides the computational \nspeed and low power consumption ( 300pA ) that follow directly from this mapping \nonto VLSI, the circuit also demonstrates intrinsic fault tolerance to defects in the \nweights. \n\nAcknowledgements \n\nThis work was initially inspired by a discussion with A. G. Barto and R. S. Sutton. \nA discussion with J. Moody was also helpful. \n\nReferences \n\n[1] L. Atlas, R. Cole, Y. Muthusamy, A. Lippman, J. Connor, D. Park, M. EI(cid:173)\n\nSharkawi, and R. J. Marks II. A performance comparison of trained multi-layer \nperceptrons and trained classification trees. IEEE Proceedings, 1990. \n\n[2] D. S. Broomhead and D. Lowe. Multivariable function interpolation and adap(cid:173)\n\ntive networks. Complex Systems, 2:321-355, 1988. \n\n[3] P. Lancaster and K. Salkauskas. Curve and Surface Fitting. Academic Press, \n\n1986. \n\n[4] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989. \n[5] J. Moody and C.J. Darken. Fast learning in networks oflocally-tuned processing \n\nunits. Neural Computation, 1(2), 1989. \n\n[6] J. Platt. A resource-allocating neural network for function interpolation. In \nRichard P. Lippman, John Moody, and David S. Touretzky, editors, Advances \nin Neural Information Processing Systems 9, 1991. \n\n[7] A. S. Weigend, , B. A. Huberman, and D. E. Rummlehart. Predicting the future \n: A connectionist approach. International Journal of Neural Systems, 3, 1990. \n\n\f", "award": [], "sourceid": 342, "authors": [{"given_name": "Daniel", "family_name": "Schwartz", "institution": null}, {"given_name": "Vijay", "family_name": "Samalam", "institution": null}]}