{"title": "A Simple and Fast Neural Network Approach to Stereovision", "book": "Advances in Neural Information Processing Systems", "page_first": 808, "page_last": 814, "abstract": "", "full_text": "A Simple and Fast Neural Network \n\nApproach to Stereovision \n\nRolf D. Henkel \n\nInstitute of Theoretical Physics \n\nUniversity of Bremen \n\nP.O. Box 330 440, D-28334 Bremen \n\nhttp://axon.physik.uni-bremen.de/-rdh \n\nAbstract \n\nA neural network approach to stereovision is presented based on \naliasing effects of simple disparity estimators and a fast coherence(cid:173)\ndetection scheme. Within a single network structure, a dense dis(cid:173)\nparity map with an associated validation map and, additionally, \nthe fused cyclopean view of the scene are available. The network \noperations are based on simple, biological plausible circuitry; the \nalgorithm is fully parallel and non-iterative. \n\n1 \n\nIntroduction \n\nHumans experience the three-dimensional world not as it is seen by either their left \nor right eye, but from a position of a virtual cyclopean eye, located in the middle \nbetween the two real eye positions. The different perspectives between the left and \nright eyes cause slight relative displacements of objects in the two retinal images \n(disparities), which make a simple superposition of both images without diplopia \nimpossible. Proper fusion of the retinal images into the cyclopean view requires the \nregistration of both images to a common coordinate system, which in turn requires \ncalculation of disparities for all image areas which are to be fused. \n\n1.1 The Problems with Classical Approaches \n\nThe estimation of disparities turns out to be a difficult task, since various random \nand systematic image variations complicate this task. Several different techniques \nhave been proposed over time, which can be loosely grouped into feature-, area-\n\n\fA Simple and Fast Neural Network Approach to Stereovision \n\n809 \n\nand phase-based approaches. All these algorithms have a number of computational \nproblems directly linked to the very assumptions inherent in these approaches. \n\nIn feature-based stereo, intensity data is first converted to a set of features assumed \nto be a more stable image property than the raw image intensities. Matching \nprimitives used include zerocrossings, edges and corner points (Frisby, 1991), or \nhigher order primitives like topological fingerprints (see for example: Fleck, 1991). \nGenerally, the set of feature-classes is discrete, causing the two primary problems \nof feature-based stereo algorithms: the famous \"false-matches\"-problem and the \nproblem of missing disparity estimates. \n\nFalse matches are caused by the fact that a single feature in the left image can \npotentially be matched with every feature of the same class in the right image. \nThis problem is basic to all feature-based stereo algorithms and can only be solved \nby the introduction of additional constraints to the solution. In conjunction with \nthe extracted features these constraints define a complicated error measure which \ncan be minimized by cooperative processes (Marr, 1979) or by direct (Ohta, 1985) \nor stochastic search techniques (Yuille, 1991). While cooperative processes and \nstochastic search techniques can be realized easily on a neural basis, it is not im(cid:173)\nmediately clear how to implement the more complicated algorithmic structures of \ndirect search techniques neuronally. Cooperative processes and stochastic search \ntechniques turn out to be slow, needing many iterations to converge to a local \nminimum of the error measure. \n\nThe requirement of features to be a stable image property causes the second problem \nof feature-based stereo: stable features can only be detected in a fraction of the \nwhole image area, leading to missing disparity estimates for most of the image area. \nFor those image parts, disparity estimates can only be guessed. \n\nDense disparity maps can be obtained with area-based approaches, where a suitable \nchosen correlation measure is maximized between small image patches of the left and \nright view. However, a neuronally plausible implementation of this seems to be not \nreadily available. Furthermore, the maximization turns out to be a computationally \nexpensive process, since extensive search is required in configuration space. \n\nHierarchical processing schemes can be utilized for speed-up, by using information \nobtained at coarse spatial scales to restrict searching at finer scales. But, for general \nimage data, it is not guaranteed that the disparity information obtained at some \ncoarse scale is valid. The disparity data might be wrong, might have a different value \nthan at finer scales , or might not be present at all. Furthermore, by processing data \nfrom coarse to fine spatial scales, hierarchical processing schemes are intrinsically \nsequential. This creates additional algorithmic overhead which is again difficult to \nrealize with neuronal structures. \n\nThe same comments apply to phase-based approaches, where a locally extracted \nFourier-phase value is used for matching. Phase values are only defined modulo \n211\", and this wrap-around makes the use of hierarchical processing essential for \nthese types of algorithms. Moreover, since data is analyzed in different spatial \nfrequency channels, it is nearly certain that some phase values will be undefined \nat intermediate scales, due to missing signal energy in this frequency band (Fleet, \n1993) . Thus, in addition to hierarchical processing, some kind of exception handling \nis needed with these approaches. \n\n\f810 \n\nR. D. Henkel \n\n2 Stereovision by Coherence Detection \n\nIn summary, classical approaches to stereovision seem to have difficulties with the \nfast calculation of dense disparity-maps, at least with plausible neural circuitry. \nIn the following, a neural network implementation will be described which solves \nthis task by using simple disparity estimators based on motion-energy mechanisms \n(Adelson, 1985; Qian, 1997), closely resembling responses of complex cells in visual \ncortex (DeAngelis, 1991). Disparity units of these type belong to a class of disparity \nestimators which can be derived from optical flow methods (Barron, 1994). Clearly, \ndisparity calculations and optical flow estimation share many similarities. The two \nstereo views of a (static) scene can be considered as two time-slices cut out of \nthe space-time intensity pattern which would be recorded by an imaginary camera \nmoving from the position of the left to the position of the right eye. However, \ncompared to optical flow, disparity estimation is complicated by the fact that only \ntwo discrete \"time\"-samples are available, namely the images of the left and right \nview positions. \n\nto \n\ndisparity calculations \n\n
'~in) present in the data and the largest disparity \nwhich can be estimated reliably (Henkel, 1997): \n\nII \nd < k~ax -\n\n7r _1I{J \n\n'2>'min . \n\n(1) \n\n\fA Simple and Fast Neural Network Approach to Stereovision \n\n811 \n\nA well-known example of the size-disparity scaling expressed in equation (1) is \nfound in the context of the spatial frequency channels assumed to exist in the \nvisual cortex. Cortical cells respond to spatial wavelengths down to about half \ntheir peak wavelength Aopt; therefore, they can estimate reliable only disparities \nless than 1/4 Aopt. This is known as Marr's quarter-cycle limit (Blake, 1991). \n\nEquation (1) immediately suggests a way to extend the limited working range of \ndisparity estimators: a spatial smoothing of the image data before or during dispar(cid:173)\nity calculation reduces k'f:tax, and in turn increases the disparity range. However, \nspatial smoothing reduces also the spatial resolution of the resulting disparity map. \nAnother way of modifying the usable range of disparity estimators is the applica(cid:173)\ntion of a fixed preshift to the input data before disparity calculation. This would \nrequire prior knowledge of the correct preshift to be applied, which is a nontrivial \nproblem. One could resort to hierarchical coarse-to-fine schemes, but the difficulties \nwith hierarchical schemes have already been elal ')rated. \n\nThe aliasing effects discussed are a general feature of sampling visual space with \nonly two eyes; instead of counteracting, one can exploit them in a simple coherence(cid:173)\ndetection scheme, where the multi-unit activity in stacks of disparity detectors tuned \nto a common view direction is analyzed. \n\nAssuming that all disparity units i in a stack have random preshifts or presmoothing \napplied to their input data, these units will have different, but slightly overlapping \nworking ranges Di = [diin , diax] for valid disparity estimates. An object with true \ndisparity d, seen in the common view direction of such a stack, will therefore split \nthe stack into two disjunct classes: the class C of estimators with dEDi for all \ni E C, and the rest of the stack, C, with d \u00a2 D i . All disparity estimators E C will \ncode more or less the true disparity di ~ d, but the estimates of units belonging to C \nwill be subject to the random aliasing effects discussed, depending in a complicated \nway on image content and disparity range Di of the unit. \n\nWe will thus have di ~ d ~ dj whenever units i and j belong to C, and random rela(cid:173)\ntionships otherwise. A simple coherence detection within each stack, i.e. searching \nfor all units with di ~ dj and extracting the largest cluster found, will be sufficient \nto single out C. The true disparity d in the view direction of the stack can be simply \nestimated as an average over all coherently coding units: \n\n3 Neural Network Implementation \n\nRepeating this coherence detection scheme in every view direction results in a fully \nparallel network structure for disparity calculation. Neighboring disparity stacks \nresponding to different view directions estimate disparity values independently from \neach other, and within each stack, disparity units operate independently from each \nother. Since coherence detection is an opportunistic scheme, extensions of the basic \nalgorithm to mUltiple spatial scales and combinations of different types of disparity \nestimators are trivial. Additional units are simply included in the appropriate \ncoherence stacks. The coherence scheme will combine only the information from \nthe coherently coding units and ignore the rest of the data. For this reason, the \nscheme also turns out to be extremely robust against single-unit failures. \n\n\f812 \n\nR. D. Henkel \n\ndisparity data \n\n\"h'7\" \n\n-----------r\u00b7----------\n\nLeft eye\u00b7\" .. , \n\n: \n\n, .............. , .. \n\n.' Right eye \n\nFigure 2: The network structure for a single horizontal scan-line (left). The view \ndirections of the disparity stacks split the angle between the left and right lines \nof sight in the network and 3D-space in half, therefore analyzing space along the \ncyclopean view directions (right). \n\nCyclopean eye \n\nIn the current implementation (Fig. 2), disparity units at a single spatial scale \nare arranged into horizontal disparity layers. Left and right image data is fed \ninto this network along diagonally running data lines. This causes every disparity \nlayer to receive the stereo data with a certain fixed preshift applied, leading to the \nrequired, slightly different working-ranges of neighboring layers. Disparity units \nstacked vertically above each other are collected into a single disparity stack which \nis then analyzed for coherent activity. \n\n4 Results \n\nThe new stereo network performs comparable on several standard test image sets \n(Fig. 3). The calculated disparity maps are similar to maps obtained by classical \narea-based approaches, but they display subpixel-precision. Since no smoothing or \nregularization is performed by the coherence-based stereo algorithm, sharp disparity \nedges can be observed at object borders. \n\nWithin the network, a simple validation map is available locally. A measure of local \n\nFigure 3: Disparity maps for some standard test images (small insets), calculated \nby the coherence-based stereo algorithm. \n\n\fA Simple and Fast Neural Network Approach to Stereovision \n\n813 \n\nFigure 4: The performance of coherence-based stereo on a difficult scene with spec(cid:173)\nular highlights, transparency and repetitive structures (left). The disparity map \n(middle) is dense and correct, except for a few structure-less image regions. These \nregions, as well as most object borders, are indicated in the validation map (right) \nwith a low [dark] validation count. \n\ncoherence can be obtained by calculating the relative number of coherently acting \ndisparity units in each stack, i.e. by calculating the ratio N(C)/ N(CUC), where N(C) \nis the number of units in class C. In most cases, this validation map clearly marks \nimage areas where the disparity calculations failed (for various reasons, notably at \nocclusions caused by object borders, or in large structure-less image regions, where \nno reliable matching can be obtained -\n\ncompare Fig 4). \n\nClose inspection of disparity and validation maps reveals that these image maps \nare not aligned with the left or the right view of the scene. Instead, both maps are \nregistered with the cyclopean view. This is caused by the structural arrangement of \ndata lines and disparity stacks in the network. Reprojecting data lines and stacks \nback into 3D-space shows that the stacks analyze three-dimensional space along \nlines splitting the angle between the left and right view directions in half. This is \nthe cyclopean view direction as defined by (Hering, 1879). \nIt is easy to obtain the cyclopean view of the scene itself. With If and If denoting \nthe left and right input data at the position of disparity-unit i, a summation over \nall coherently coding disparity units in a stack, i.e., \n\nFigure 5: A simple superposition of the left and right stereo images results in \ndiplopia (left). By using a vergence system, the two stereo images can be aligned \nbetter (middle), but diplopia is still prominent in most areas of the visual field. \nThe fused cyclopean view of the scene (left) was calculated by the coherence-based \nstereo network. \n\n\f814 \n\nR. D. Henkel \n\ngives the image intensity I C in the cyclopean view-direction of this stack. Collecting \nIC from all disparity stacks gives the complete cyclopean view as the third co(cid:173)\nregistered map of the network (Fig 5). \n\nAcknowledgements \n\nThanks to Helmut Schwegler and Robert P. O'Shea for interesting discus(cid:173)\nImage data courtesy of G. Medoni, UCS Institute for Robotics & In(cid:173)\nsions. \ntelligent Systems, B. Bolles, AIC, SRI International, and G. Sommer, Kiel \nCognitive Systems Group, Christian-Albrechts-Universitat Kiel. An internet(cid:173)\nbased implementation of the algorithm presented in this paper is available at \nhttp://axon.physik.uni-bremen.de/-rdh/online~alc/stereo/. \n\nReferences \n\nAdelson, E.H. & Bergen, J.R. (1985): Spatiotemporal Energy Models for the Per(cid:173)\nception of Motion. J. Opt. Soc. Am. A2: 284-299. \n\nBarron, J.L., Fleet, D.J. & Beauchemin, S.S. (1994): Performance of Optical Flow \nTechniques. Int. J. Camp. Vis. 12: 43-77. \n\nBlake, R. & Wilson, H.R. (1991): Neural Models of Stereoscopic Vision. TINS 14: \n445-452. \n\nDeAngelis, G.C., Ohzawa, I. & Freeman, R.D. (1991): Depth is Encoded in the \nVisual Cortex by a Specialized Field Structure. Nature 11: 156-159. \n\nFleck, M.M. (1991): A Topological Stereo Matcher. Int. J. of Camp. Vis. 6: \n197-226. \n\nFleet, D.J. & Jepson, A.D. (1993): Stability of Phase Information. IEEE PAMI 2: \n333-340. \nFrisby, J.P. & and S. B. Pollard, S.B. (1991): Computational Issues in Solving the \nStereo Correspondence Problem. eds. M.S. Landy and J. A. Movshon, Computa(cid:173)\ntional Models of Visual Processing, pp. 331, MIT Press, Cambridge 1991. \n\nHenkel, R.D. (1997): Fast Stereovision by Coherence Detection, in Proc. of \nCAIP'97, Kiel, LCNS 1296, eds. G. Sommer, K. Daniilidis and J. Pauli, pp. 297, \nLCNS 1296, Springer, Heidelberg 1997. \n\nE. Hering (1879): Der Raumsinn und die Bewegung des Auges, in Handbuch der \nPsychologie, ed. 1. Hermann, Band 3, Teil 1, Vogel, Leipzig 1879. \nMarr, D. & Poggio, T. (1979): A Computational Theory of Human Stereo Vision. \nProc. R. Soc. Land. B 204: 301-328. \nOhta, Y, & Kanade, T. (1985): Stereo by Intra- and Inter-scanline Search using \ndynamic programming. IEEE PAMI 7: 139-154. \nQian, N. & Zhu, Y. (1997): Physiological Computation of Binocular Disparity, to \nappear in Vision Research. \nYuille, A.L., Geiger, D. & Biilthoff, H.H. (1991): Stereo Integration, Mean Field \nTheory and Psychophysics. Network 2: 423-442. \n\n\f", "award": [], "sourceid": 1352, "authors": [{"given_name": "Rolf", "family_name": "Henkel", "institution": null}]}