{"title": "An Analog Visual Pre-Processing Processor Employing Cyclic Line Access in Only-Nearest-Neighbor-Interconnects Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 971, "page_last": 978, "abstract": null, "full_text": "An Analog Visual Pre-Processing Processor\n\nEmploying Cyclic Line Access in\n\nOnly-Nearest-Neighbor-Interconnects\n\nArchitecture\n\nYusuke Nakashita\n\nYoshio Mita\n\nDepartment of Frontier Informatics\n\nDepartment of Electrical Engineering\n\nSchool of Frontier Sciences\n\nThe University of Tokyo\n\nSchool of Engineering\nThe University of Tokyo\n\n5-1-5 Kashiwanoha, Kashiwa-shi, Chiba\n\n7-3-1 Hongo, Bunkyo-ku,Tokyo\n\n277-8561, Japan\n\n113-8656, Japan.\n\nyusuke@else.k.u-tokyo.ac.jp\n\nmita@ee.t.u-tokyo.ac.jp\n\nTadashi Shibata\n\nDepartment of Frontier Informatics\n\nSchool of Frontier Sciences\n\nThe University of Tokyo\n\n5-1-5 Kashiwanoha, Kashiwa-shi, Chiba\n\n277-8561, Japan\n\nshibata@ee.t.u-tokyo.ac.jp\n\nAbstract\n\nAn analog focal-plane processor having a 128\u0002128 photodiode array has\nbeen developed for directional edge \ufb01ltering. It can perform 4\u00024-pixel\nkernel convolution for entire pixels only with 256 steps of simple ana-\nlog processing. Newly developed cyclic line access and row-parallel\nprocessing scheme in conjunction with the \u201conly-nearest-neighbor in-\nterconnects\u201d architecture has enabled a very simple implementation. A\nproof-of-concept chip was fabricated in a 0.35-(cid:0)m 2-poly 3-metal CMOS\ntechnology and the edge \ufb01ltering at a rate of 200 frames/sec. has been\nexperimentally demonstrated.\n\n1 Introduction\n\nDirectional edge detection in an input image is the most essential operation in early visual\nprocessing [1, 2]. Such spatial \ufb01ltering operations are carried out by taking the convolu-\ntion between a block of pixels and a weight matrix, requiring a number of multiply-and-\naccumulate operations. Since the convolution operation must be repeated pixel-by-pixel\nto scan the entire image, the computation is very expensive and software solutions are not\ncompatible to real-time applications. Therefore, the hardware implementation of focal-\nplane parallel processing is highly demanded. However, there exists a hard problem which\nwe call the interconnects explosion as illustrated in Fig. 1.\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Interconnects from nearest neighbor (N.N.) and second N.N. pixels to a single\npixel at the center. (b) N.N. and second N.N. interconnects for pixels in the two rows, an\nillustrative example of interconnecs explosion.\n\nIn carrying out a \ufb01ltering operation for one pixel, the luminance data must be gathered from\nthe nearest-neighbor and second nearest-neighbor pixels. The interconnects necessary for\nthis is illustrated in Fig. 1(a). If such wiring is formed for two rows of pixels, excessively\nhigh density overlapping interconnects are required. If we extend this to an entire chip,\nit is impossible to form the wiring even with the most advanced VLSI interconnects tech-\nnology. Biology has solved the problem by real 3D-interconnects structures. Since only\ntwo dimensional layouts are allowed with a limited number of stacks in VLSI technology,\nthe missing one dimension is crucial. We must overcome the dif\ufb01culty by introducing new\narchitectures.\n\nIn order to achieve real-time performance in image \ufb01ltering, a number of VLSI chips have\nbeen developed in both digital [3, 4] and analog [5, 6, 7] technologies. A \ufb02ash-convolution\nprocessor [4] allows a single 5\u00025-pixel convolution operation in a single clock cycle by\nintroducing a subtle memory access scheme. However, for an N\u0002M-pixel image, it takes\nN\u0002M clock cycles to complete the processing. In the line-parallel processing scheme em-\nployed in [7], both row-parallel and column-parallel processing scan the target image sev-\neral times and the entire \ufb01ltering \ufb01nishes in O (N+M) steps. (A single step includes several\nclock cycles to control the analog processing.)\n\nThe purpose of this work is to present an analog focal-plane CMOS image sensor chip\nwhich carries out the directional edge \ufb01ltering convolution for an N\u0002M-pixel image only in\nM (or N) steps. In order to achieve an ef\ufb01cient processing, two key technologies have been\nintroduced: \u201conly-nearest-neighbor interconnects\u201d architecture and \u201ccyclic line access and\nrow-parallel processing\u201d. The former was \ufb01rst developed in [8], and has enabled the con-\nvolution including second-nearest-neighbor luminance data only using nearest neighbor\ninterconnects, thus greatly reducing the interconnect complexity. However, the \ufb01ll factor\nwas sacri\ufb01ced due to the pixel parallel organization. The problem has been resolved in\nthe present work by \u201ccyclic line access and row-parallel processing.\u201d Namely, the process-\ning elements are separated from the array of photo diodes and the \u201conly-nearest-neighbor\ninterconnects\u201d architecture was realized as a separate module of row-parallel processing\nelements. The cyclic line access scheme \ufb01rst introduced in the present work has eliminated\nthe redundant data readout operations from the photodiode array and has established a very\nef\ufb01cient processing. As a result, it has become possible to complete the edge \ufb01ltering\nfor a 128\u0002128 pixel image only in 128\u00022 steps. A proof-of-concept chip was fabricated\nin a 0.35-(cid:0)m 2-poly 3-metal CMOS technology, and the edge detection at a rate of 200\nframes/sec. has been experimentally demonstrated.\n\n\fphotodiode\n\nprocessing\n\nelement\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Edge \ufb01ltering in the \u201conly-nearest-neighbor interconnects\u201d architecture: (a) \ufb01rst\nstep; (b) second step; (c) all interconnects necessary for pixel parallel processing; (d) PD\u2019s\ninvolved in the convolution.\n\n0 +1+1 0\n-1 0 +2+1\n-1 -2 0 +1\n0 -1 -1 0\n\n0 +1+1 0\n+1+2 0 -1\n+1 0 -2 -1\n0 -1 -1 0\n\n0 +1+1 0\n+1+2+2+1\n-1 -2 -2 -1\n0 -1 -1 0\n\n0 +1 -1 0\n+1+2 -2 -1\n+1+2 -2 -1\n0 +1 -1 0\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Edge \ufb01ltering kernels realized in \u201conly-nearest-neighbor interconnects\u201d architec-\nture: (a) \u0007(cid:1)(cid:2) degree; (b) (cid:1)(cid:2) degree; (c) horizontal; (d) vertical.\n\n2 System Organization\n\nThe two key technologies employed in the present work are explained in the following.\n\n2.1 \u201cOnly-Nearest-Neighbor Interconnects\u201d Architecture\n\nThis architecture was \ufb01rst proposed in [8], and experimentally veri\ufb01ed with small-scale test\ncircuits (7\u00027 processing elements without photodiodes). The key feature of the architecture\nis that photodiodes (PD\u2019s) are placed at four corners of each processing element (PE), and\nthat the luminance data of each PD are shared by four PE\u2019s as shown in Fig. 2.\n\nThe edge \ufb01ltering is carried out as explained below. First, as shown in Fig. 2 (a), pre-\nprocessing is carried out in each PE using the luminance data taken from four PD\u2019s located\nat its corners. Then, the result is transferred to the center PE as shown in Fig. 2 (b) and nec-\nessary computation is carried out. This accomplishes the \ufb01ltering processing for one half\nof the entire pixels. Then the roles of pre-processing PE\u2019s and center PE\u2019s are interchanged\nand the same procedure follows to complete the processing for the rest of the pixels. The\ninterconnects necessary for the entire parallel processing is shown in Fig. 2(c). In this man-\nner, every PE can gather all data necessary for the processing from its nearest-neighbor and\nsecond nearest-neighbor pixels without complicated crossover interconnects. The kernels\nillustrated in Fig. 3 have been all realized in this architecture. The luminance data from 12\nPD\u2019s enclosed in Fig. 2 (d) are utilized to detect the edge information at the center location.\n\n\f131 pixels\n\nphotodiode array\n\nr\ne\nd\no\nc\ne\nd\n \ns\ns\ne\nr\nd\nd\na\n\nl\n\ni\n\ns\ne\nx\np\n1\n3\n1\n\n \n\n130 (128 + 2) processing elements\n130 (128 + 2) processing elements\n130 (128 + 2) processing elements\n130 (128 + 2) processing elements\n\ns\nw\no\nr\n \n\n4\n\n128 parallel output\n\n(a)\n\nRST\n\nM1\n\nSH\n\nM2\n\nSELECT\n\nVout\n\nM3\n\nBIAS\n\nPD\n\nC\n\n(c)\n\ncyclic connection\n\nanalog memory\n\nprocessing element (PE)\n\nx mod 4 = 1\n\nx mod 4 = 2\n\nx mod 4 = 3\n\nx mod 4 = 0\n\nx mod 4 = 1\n\ns\nE\nP\n\n \nf\n\no\n\n \ns\nw\no\nr\n \n\n4\n\n1 PE\n\n128 PEs for output\n\n1 PE\n\n(b)\n\nFigure 4: Block diagram of the chip (a), and organization of row-parallel processing mod-\nule (b). (cid:1) in (b) represents the row number 1(cid:2)131. (c) shows read out circuit of photodiode.\n\n2.2 Cyclic Line Access and Row-Parallel Processing\n\nA block diagram of the analog edge-\ufb01ltering processor is given in Fig. 4 (a). It consists of\nan array of 131\u0002131 photodiodes (PD\u2019s) and a module for row-parallel processing placed\nat the bottom of the PD array. Figure 4(b) illustrates the organization of the row processing\nmodule, which is composed of four rows of 130 PE\u2019s and \ufb01ve rows of 131 analog memory\ncells that temporarily store the luminance data read out from the PD array. It should be\nnoted that only three rows of PE\u2019s and four rows of PD\u2019s are suf\ufb01cient to carry out a single-\nrow processing as explained in reference to Fig. 2(d). However, one extra row of PE\u2019s and\none extra row of analog memories for PD data storage were included in the row-parallel\nprocessing module. This is essential to carry out a seamless data read out from the PD array\nand computation without analog data shift within the processing module. The chip yields\n\n\fanalog PD memory\n\nPE\n\n1\n\n2\n\n3\n\n4\n\n1\n\n5\n\n2\n\n3\n\n4\n\n5\n\n(a)\n\n(b)\n\n5\n\n6\n\n3\n\n4\n\n5\n\n5\n\n6\n\n7\n\n4\n\n5\n\n(c)\n\n(d)\n\nFigure 5: \u201cCyclic-line access and row-parallel processing\u201d scheme.\n\nthe kernel convolution results for one of the rows in the PD array as 128 parallel outputs.\n\nNow, the operation of the row-parallel processing module is explained with reference to\nFig. 4 (b) and Fig. 5. In order to carry out the convolution for the data in Row 1(cid:2)4, the PD\ndata are temporarily stored in the analog memory array as shown in Fig. 5 (a). Imporatant\nto note is that the data from Row 1 are duplicated at the bottom. The convolution operation\nproceeds using the upper four rows of data as explained in Fig. 5 (a). In the next step,\nthe data from Row 5 are overwritten to the sites of Row 1 data as shown in Fig. 5 (b).\nThe operation proceeds using the lower four rows of data and the second set of outputs\nis produced. In the third step, the data from Row 6 is overwritten to the sites of Row 2\ndata (Fig. 5 (c)), and the convolution is taken using the data in the enclosure. Although a\npart of the data (top two rows) are separated from the rest, the topology of the hardware\ncomputation is identical to that explained in Fig. 5 (a). This is because the same set of data\nis stored in both top and bottom PD memories and the top and bottom PE\u2019s are connected\nby \u201ccyclic connection\u201d as illustrated in Fig. 4 (b). By introducing such one extra row of PD\nmemories and one extra row of PE\u2019s with cyclic interconnections, row-parallel processing\ncan be seamlessly performed with only a single-row PD data set download at each step.\n\n3 Circuit Con\ufb01gurations\n\nIn this architecture, we need only two arithmetic operations, i.e., the sum of four inputs and\nthe subtraction.\n\nFigure 6(a) shows the adder circuit using the multiple-input \ufb02oating-gate source fol-\nlower [9]. The substrate of \u0005 (cid:3) is connected to the source to avoid the body effect. The\ntransistor \u0005 (cid:4) operates as a current source for fast output voltage stabilization as well as to\nachieve good linearity. Due to the charge redistribution in the \ufb02oating gate, the average of\nthe four input voltages appears at the output as\n\n(cid:3)\u0003\t\b (cid:5)\n\n(cid:3)(cid:3) \u0007 (cid:3)(cid:4) \u0007 (cid:3)(cid:5) \u0007 (cid:3)(cid:6)\n\n(cid:1)\n\n\u0007 (cid:3)(cid:3)\b(cid:7)(cid:3) (cid:4)\n\nwhere (cid:3)\b(cid:7) represents the threshold voltage of \u0005 (cid:3). Here, the four coupling capacitors\nconnected to the \ufb02oating gate of \u0005 (cid:3) are identical and the capacitance coupling between\nthe \ufb02oating gate and the ground was assumed to be 0 for simplicity. The electrical charge\nin the \ufb02oating gate is initialized periodically using the reset switch (\u0005 (cid:6)). The coupling\ncapacitors themselves are also utilized as temporary memories for the PD data read out\nfrom the PD array.\n\nFigure 6(b) shows the subtraction circuit, where the same source follower was used. When\nSW1 and SW2 are turned on, and SW3 is turned off, the following voltage difference is\n\n\fVin_p1 Vin_p2\n\nVin_m1 Vin_m2\n\nC1\n\nC2\n\nC3\n\nC4\n\nBIAS\n\nM2\n\nVin1\n\nVin2\n\nVin3\n\nVin4\n\nC1\n\nC2\n\nC3\n\nC4\n\nFloating gate\n\nRST\n\nM1\n\nM3\n\nVout\n\nSW2\n\nSW3\n\nVref\n\n(a)\n\nSW1\n\nC5\n\n(b)\n\nM2\n\nBIAS\nVout\n\nM1\n\nFigure 6: Adder circuit (a) and subtraction circuit (b) using \ufb02oating-gate MOS technology.\n\ndeveloped across the capacitor (cid:5)(cid:8):\n\n\u0004(cid:3)(cid:9)\u0002 \u0004(cid:3) \u0007 (cid:3)(cid:9)\u0002 \u0004(cid:4)\u0005  \u0004(cid:3)(cid:9)\u0002 \u0001(cid:3) \u0007 (cid:3)(cid:9)\u0002 \u0001(cid:4)\u0005\n\n(cid:1)\n\n(cid:6)\n\nThen, SW1 and SW2 are turned off, and SW3 is turned on. As a result, the output voltage\n(cid:3)\u0003\t\b becomes\n\n(cid:3)\u0003\t\b (cid:5)\n\n\u0004(cid:3)(cid:9)\u0002 \u0004(cid:3) \u0007 (cid:3)(cid:9)\u0002 \u0004(cid:4)\u0005  \u0004(cid:3)(cid:9)\u0002 \u0001(cid:3) \u0007 (cid:3)(cid:9)\u0002 \u0001(cid:4)\u0005\n\n(cid:1)\n\n\u0007 (cid:3)\u0006(cid:14)(cid:15) \u0007 (cid:3)(cid:3)\b(cid:7)(cid:3) (cid:4)\n\nwhere (cid:3)\b(cid:7) represents the threshold voltage of \u0005 (cid:3).\n\n4 Experimental Results\n\nA proof-of-concept chip was designed and fabricated in a 0.35-(cid:0)m 2-poly 3-metal CMOS\ntechnology. Figure 7 shows the photomicrograph of the chip, and the chip speci\ufb01cations are\ngiven in Table 1. Since the pitch of a single PE unit is larger than the pitch of the PD array,\n130 PE units are laid out as two separate L-shaped blocks at the periphery of the PD array\nas seen in the chip photomicrograph. Successful operation of the chip was experimently\nveri\ufb01ed.\n\nAn example is shown in Fig. 8, where the experimental results for (cid:1)(cid:2)-degree edge \ufb01ltering\nare demonstrated. Since the thresholding circuitry was not implemented in the present\nchip, only the convolution results are shown. 128 parallel outputs from the test chip were\nmultiplexed for observation using the external multiplexers mounted on a milled printed\ncircuit board. The vertical stripes observed in the result are due to the resistance variation\nin the external interconnects poorly produced on the milled printed circuit board.\n\nIt was experimentally con\ufb01rmed the chip operates at 1000 frames/sec. However, the oper-\nation is limited by the integration time of PD\u2019s and typical motion images are processed at\nabout 200 frames/sec. The power dissipation in the PE\u2019s was 25 mW and that in the PD\narray was 40mW.\n\n5 Conclusions\n\nAn analog edge-\ufb01ltering processor has been developed based on the two key technologies:\n\u201conly-nearest-neighbor interconnects\u201d architecture and \u201ccyclic line access and row-parallel\n\n\fPEs\n\n131 x 131\nPD Array\n\nPEs\n\nFigure 7: Chip photomicrograph.\n\nTable 1: Chip Speci\ufb01cations.\n\nProcess Technology\n\nDie Size\nVoltage Supply\nOperating Frequency\nPower Dissipation\nPE Operation\nTypical Frame Ratel\n\n0.35 (cid:0)m CMOS,\n2-Poly, 3-Metal\n9.8 mm x 9.8 mm\n3.3 V\n50M Hz\n25 mW (PE Array)\n1000 Frames/secl\n200 Frames / sec\n(limited by\nPD integration time)\n\n(a)\n\n(b)\n\nFigure 8: Experimental set up (a), and measurement results of (cid:1)(cid:2) degree edge \ufb01ltering\nconvolution (b).\n\nprocessing\u201d. As a result, the convolution operation involving second nearest-neighbor pixel\ndata for an \u0006 \u0002 \u0006-pixel image can be performed only in (cid:4)\u0006 steps. The edge \ufb01ltering oper-\nation for 128\u0002128-pixel images at 200 frames/sec. has been experimentally demonstrated.\nThe chip meets the requirement of low-power and real-time-response applications.\n\n6 Acknowledgments\n\nThe VLSI chip in this study was fabricated in the chip fabrication program of VLSI Design\nand Education Center (VDEC), the University of Tokyo in collaboration with Rohm Cor-\nporation and Toppan Printing Corporation. The work is partially supported by the Ministry\nof Education, Science, Sports, and Culture under Grant-in-Aid for Scienti\ufb01c Research (No.\n14205043).\n\nReferences\n\n[1] D. H. Hubel and T. N. Wiesel, \u201cReceptive \ufb01elds of single neurons in the cat\u2019s striate\n\ncortex,\u201d Journal of Physiology, vol. 148, pp. 574-591, 1959.\n\n[2] M. Yagi and T. Shibata, \u201cAn image representation algorithm compatible with neural-\nassociative-processor-based hardware recognition systems,\u201d IEEE Trans. Neural Net-\nworks, vol. 14(5), pp. 1144-1161, 2003.\n\n[3] J. C. Gealow and C. G. Sodini, \u201cA pixel parallel-processor using logic pitch-matched\n\nto dynamic memory,\u201d IEEE J. Solid-State Circuits, vol. 34, pp. 831-839, 1999.\n\n\f[4] K. Ito, M. Ogawa and T. Shibata, \u201cA variable-kernel \ufb02ash-convolution image \ufb01ltering\n\nprocessor,\u201d Dig. Tech. Papers of Int. Solid-State Circuits Conf., pp. 470-471, 2003.\n\n[5] L. D. McIlrath, \u201cA CCD/CMOS focal plane array edge detection processor imple-\nmenting the multiscale veto algorithm,\u201d IEEE J. Solid-State Circuits, vol. 31(9), pp.\n1239-1247, 1996.\n\n[6] R. Etiene-Cummings, Z. K. Kalayjian and D. Cai, \u201cA programmable focal plane MIMD\n\nimage processor chip,\u201d IEEE J. Solid-State Circuits, vol. 36(1), pp. 64-73, 2001.\n\n[7] T. Taguchi, M. Ogawa and T. Shibata, \u201cAn Analog Image Processing LSI Employing\nScanning Line Parallel Processing,\u201d Proc. 29th European Solid-Sate Circuits Confer-\nence (ESSCIRC 2003), pp. 65-68, 2003.\n\n[8] Y. Nakashita, Y. Mita and T. Shibata, \u201cAn Analog Edge-Filtering Processor Employing\nOnly-Nearest-Neighbor Interconnects,\u201d Ext. Abstracts of the International Conference\non Solid State Devices and Materials (SSDM \u201904), pp. 356-357, 2004.\n\n[9] T. Shibata and T. Ohmi, \u201cA Functional MOS Transistor Featuring Gate-Level Weighted\nSum and Threshold Operations,\u201d IEEE Trans. Electron Devices, vol. 39(6), pp. 1444-\n1455, 1992.\n\n\f", "award": [], "sourceid": 2930, "authors": [{"given_name": "Yusuke", "family_name": "Nakashita", "institution": null}, {"given_name": "Yoshio", "family_name": "Mita", "institution": null}, {"given_name": "Tadashi", "family_name": "Shibata", "institution": null}]}