{"title": "A Hierarchical Model of Complex Cells in Visual Cortex for the Binocular Perception of Motion-in-Depth", "book": "Advances in Neural Information Processing Systems", "page_first": 1271, "page_last": 1278, "abstract": "", "full_text": "A hierarchical model of complex cells in \nvisual cortex for the binocular perception \n\nof motion-in-depth \n\nSilvio P. Sabatini, Fabio Solari, Giulia Andreani, \n\nChiara Bartolozzi, and Giacomo M. Bisio \n\nDepartment of Biophysical and Electronic Engineering \n\nUniversity of Genoa, 1-16145 Genova, ITALY \n\nsilvio@dibe.unige.it \n\nAbstract \n\nA cortical model for motion-in-depth selectivity of complex cells in \nthe visual cortex is proposed. The model is based on a time ex(cid:173)\ntension of the phase-based techniques for disparity estimation. We \nconsider the computation of the total temporal derivative of the \ntime-varying disparity through the combination of the responses of \ndisparity energy units. To take into account the physiological plau(cid:173)\nsibility, the model is based on the combinations of binocular cells \ncharacterized by different ocular dominance indices. The resulting \ncortical units of the model show a sharp selectivity for motion-in(cid:173)\ndepth that has been compared with that reported in the literature \nfor real cortical cells. \n\n1 \n\nIntroduction \n\nThe analysis of a dynamic scene implies estimates of motion parameters to infer \nspatio-temporal information about the visual world. In particular, the perception \nof motion-in-depth (MID), i.e. \nthe capability of discriminating between forward \nand backward movements of objects from an observer, has important implications \nfor navigation in dynamic environments. In general, a reliable estimate of motion(cid:173)\nin-depth can be gained by considering the dynamic stereo correspondence problem \nin the stereo image signals acquired by a binocular vision system. Fig. 1 shows \nthe relationships between an object moving in the 3-D space and its geometrical \nprojections in the right and left retinas. In a first approximation, the positions of \ncorresponding points are related by a 1-D horizontal shift, the disparity, along the \ndirection of the epipolar lines. Formally, the left and right observed intensities from \nthe two eyes, respectively JL(X) and JR(x), result related as JL(X) = JR[x + 8(x)], \nwhere 8(x) is the horizontal binocular disparity. If an object moves from P to \nQ its disparity changes and projects different velocities (VL' VR) on the retinas. \n\n\f.............. .9J ............ t+~t \n\n8(t+lit) = (XQL -XQR) \"\" a(D-ZQ)/D2 \n\nV \"\" li8 D2/a \n\nz M \n\nli8 = 8(t+lit)-&(t) = \nlit \n\nlit \n\n_ \n\n(XQL -XPL )-(XQR -XPR) \n\n\"\" \n\nlit \n\nVZ \"\" (VL-vR)D2/a \n\n( \n\n) \n\na \n\nFigure 1: The dynamic stereo correspondence problem. A moving object in the 3-D \nspace projects different trajectories onto the left and right retinas. The differences \nbetween the two trajectories carry information about motion-in-depth. \n\nThus, the Z component of the object's motion (i.e., its motion-in-depth) Vz can \nbe approximated in two ways [1]: (1) by the rate of change of disparity, and (2) \nby the difference between retinal velocities, as it is evidenced in the box in Fig. l. \nThe predominance of one measure on the other one corresponds to different hy(cid:173)\npotheses on the architectural solutions adopted by visual cortical cells to encode \ndynamic 3-D visual information. Recently, numerous experimental and computa(cid:173)\ntional studies (see e.g., [2] [3] [4] [5]) addressed this issue, by analyzing the binocular \nspatio-temporal properties of simple and complex cells. The fact that the resulting \ndisparity tuning does not vary with time, and that most of the cells in the pri(cid:173)\nmary visual cortex have the same motion preference for the two eyes, led to the \nconclusion that these cells are not tuned to motion-in-depth. In this paper, we \ndemonstrate that, within a phase-based disparity encoding scheme, such cells relay \nphase temporal derivative components that can be combined, at a higher level, to \nyield a specific motion-in-depth selectivity. The rationale of this statement relies \nupon analytical considerations on phase-based dynamic stereopsis, as a time ex(cid:173)\ntension of the well-known phase-based techniques for disparity estimation [6] [7]. \nThe resulting model is based on the computation of the total temporal derivative of \nthe disparity through the combination of the outputs of binocular disparity energy \nunits [4] [5] characterized by different ocular dominance indices. Since each energy \nunit is just a binocular Adelson and Bergen's motion detector, this establishes a \nlink between the information contained in the total rate of change of the binocular \n\n\fdisparity and that held by the interocular velocity differences. \n\n2 Phase-based dynamic stereopsis \n\n'k \n\n2 ; 2 \n\nIn the last decades, a computational approach for stereopsis, that rely on the phase \ninformation contained in the spectral components of the stereo image pair, has been \nproposed [6] [7]. Spatially-localized phase measures on the left and right images can \nbe obtained by filtering operations with a complex-valued quadrature pair of Gabor \nfilters h(x, ko) = e- X \"et ox, where ko is the peak frequency of the filter and a \nrelates to its spatial extension. The resulting convolutions with the left and right \nbinocular signals can be expressed as Q(x) = p(x)ei\u00a2(x) = C(x) + is(x) where \np(x) = ylC2(X) + S2(X) and \u00a2(x) = arctan (S(x)/C(x)) denote their amplitude \nand phase components, respectively, and C(x) and S(x) are the responses of the \nquadrature pair of filters. Hence, binocular disparity can be predicted by 8(x) = \n[\u00a2L(X) - \u00a2R(x)]/k(x) where k(x) = [\u00a2~(x) + \u00a2;Z(x)]/2 , with \u00a2x spatial derivative of \nphase \u00a2, is the average instantaneous frequency of the bandpass signal, that, under \na linear phase model, can be approximated by the peak frequency of the Gabor filter \nko. Extending to time domain, the disparity of a point moving with the motion \nfield can be estimated by: \n\n5:[ () ] _ \u00a2L[X(t), t] - \u00a2R[x(t), t] \nuxt ,t -\n\nko \n\n(1) \n\nwhere phase components are computed from the spatiotemporal convolutions of the \nstereo image pair Q(x, t) = C(x, t) + is(x, t) with directionally tuned Gabor filters \nwith a central frequency p = (ko, wo). For spatiotemporal locations where linear \nphase approximation still holds (\u00a2 ~ kox + wot), the phase differences in Eq. (1) \nprovide only spatial information, useful for reliable disparity estimates. \n\n2.1 Motion-in-depth \n\nIf disparity is defined with respect to the spatial coordinate XL, by differentiating \nwith respect to time, its total rate of variation can be written as \n\nd8 = 88 \ndt \n\n8t + ko \n\nVL (A.L _ A.R) \n'l'x \n\n'l'x \n\n(2) \n\nwhere VL is the horizontal component of the velocity signal on the left retina. Con(cid:173)\nsidering the conservation property of local phase measurements [8], image velocities \ncan be computed from the temporal evolution of constant phase contours, and thus: \n\nand \n\n(3) \n\nwith \u00a2t = ~. Combining Eq. (3) with Eq. (2) we obtain d8/dt = (VR - VL)\u00a2;Z /ko, \nwhere (v R - V L) is the phase-based interocular velocity difference along the epipolar \nlines. When the spatial tuning frequency of the Gabor filter ko approaches the \ninstantaneous spatial frequency of the left and right convolution signals one can \nderive the following approximated expressions: \n88 \u00a2t - \u00a2f \n8t \n\nd8 \n- ~ -\ndt \n\n~VR-VL \n\n(4) \n\nko \n\n= \n\n\fThe partial derivative of the disparity can be directly computed by convolutions \n(S, C) of stereo image pairs and by their temporal derivatives (St, Ct): \n\na8 \nat \n\n[StCL - SLCt \n(SL)2 + (CL)2 \n\ns[lcR - SRC[l] 1 \n(SR)2 + (CR)2 \nko \n\n(5) \n\nthus avoiding explicit calculation and differentiation of phase, and the attendant \nproblem of phase unwrapping. Considering that, at first approximation (SL)2 + \n(C L)2 ::: (SR)2 + (CR)2 and that these terms are scantly discriminant for motion(cid:173)\nin-depth, we can formulate the cortical model taking into account the numerator \nterms only. \n\n2.2 The cortical model \n\nIf one prefilters the image signal to extract some temporal frequency sub-band, \nS(x, t) ::: 9 * S(x , t) and C(x , t) ::: 9 * C(x , t) , and evaluates the temporal changes \nin that sub-band, differentiation can be attained by convolutions on the data with \nappropriate bandpass temporal filters: \nS'(x, t) ::: g' * S(x, t) \n\n; C'(x, t) ::: g' * C(x, t) . \n\nS' and C' approximate St and Ct, respectively, if 9 and g' are a quadrature pair of \ntemporal filters, e.g.: g(t) = e- t / T sinwot and g'(t) = e- t / T coswot. From a mod(cid:173)\neling perspective, that approximation allows us to express derivative operations in \nterms of convolutions with a set of spatio-temporal filters, whose shapes resemble \nthose of simple cell receptive fields (RFs) of the primary visual cortex. Though, it \nis worthy to note that a direct interpretation of the computational model is not bio(cid:173)\nlogically plausible. Indeed, in the computational scheme (see Eq. (5)), the temporal \nvariations of phases are obtained by processing monocular images separately and \nthen the resulting signals are binocularly combined to give at an estimate of motion(cid:173)\nin-depth in each spatial location. To employ binocular RFs from the beginning, as \nthey exist for most of the cells in the visual cortex, we manipulated the numerator \nby rewriting it as the combination of terms characterized by a dominant contribu(cid:173)\ntion for the ipsilateral eye and a non-dominant contribution for the controlateral \neye. These contributions are referable to binocular disparity energy units [5] built \nfrom two pairs of binocular direction selective simple cells with left and right RFs \nweighted by an ocular dominance index a E [0,1]. The \"tilted\" spatio-temporal RFs \nof simple cells of the model are obtained by combining separable RFs according to \nan Adelson and Bergen's scheme [9]. It can be demonstrated that the information \nabout motion-in-depth can be obtained with a minimum number of eight binocular \nsimple cells, four with a left and four with a right ocular dominance, respectively \n(see Fig. 2): \n\nSl = (1 - a)(Cf + SL) - a(CR - sf\") \n\nS2 = (1 - a)(CL + Sf) + a(Cf\" + SR) \n\nS3 = (1 - a)(Cf - SL) - a(CR + sf\") \nS5 = a(Cf + SL) - (1 - a)(CR - sf\") \nS7 = a(Cf - SL) - (1 - a)(CR + sf\") \nC11 = si + S~ ; C12 = S5 + S~ \n\nS4 = (1 - a)(CL + Sf) + a(Cf\" - SR) \nS6 = a(CL - Sf) + (1 - a)(Cf\" + SR) \nS8 = a(CL + Sf) + (1 - a)(Cf\" - SR) \nC13 = S~ + S~ ; C14 = S\u00a5 + S~ \n\n\fC21 = C12 - C11 \n\n; C22 = C13 - C14 \n\nC3 = (1 - 20:) (stcL - sLCt - s[lcR + sRc[l) . \n\nThe output of the higher complex cell in the hierarchy (C3 ) truly encodes motion(cid:173)\nin-depth information. It is worthy to note that for a balanced ocular dominance \n(0: = 0.5) the cell looses its selectivity. \n\n3 Results \n\nTo assess model performances we derived cells' responses to drifting sinusoidal grat(cid:173)\nings with different speeds in the left and right eye. The spatial frequency of the \ngratings has been chosen as central to the RF's bandwidth. For each layer, the \ntuning characteristics of the cells are analyzed as sensitivity maps in the (XL - XR) \nand (VL - VR) domains for the static and dynamic properties, respectively. The \n(XL - XR) represents the binocular RF [5] of a cell, evidencing its disparity tuning. \nThe (v L - v R) response represents the binocular tuning curve of the velocities along \nthe epipolar lines. To better evidence motion-in-depth sensitivity, we represent as \npolar plots, the responses of the model cells with respect to the interocular veloc(cid:173)\nities ratio for 12 different motion trajectories in depth (labeled 1 to 12) [10]. The \ncells of the cortical model exhibit properties and typical profiles similar to those \nobserved in the visual cortex [5] [10]. The middle two layers (see insets A and B \nin Fig. 2) exhibit a strong selectivity to static disparity, but no specific tuning to \nmotion-in-depth. On the contrary, the output cell C3 shows a narrow tuning to the \nZ direction of the object's motion, while lacking disparity tuning (see inset C in \nFig. 2). \n\nTo consider more biologically plausible RFs for the simple cells, we included a \ncoefficient f3 in the scheme used to obtain tilted RFs in the space-time domain (e.g. \nC + f3St). This coefficient takes into account the simple cell response to the non(cid:173)\npreferred direction. We analytically demonstrated (results not shown here) that \nthe resulting effect is a constant term that multiplies the cortical model output. \nIn this way, the model is based on more realistic simple cells without lacking its \nfunctionality, provided that the basic direction selective units maintain a significant \ndirection selective index. To analyze the effect of the architectural parameters on \nthe model performance, we systematically varied the ocular dominance index 0: and \nintroduced a weight I representing the inhibition strength of the afferent signals \nto the complex cells in layer 2. The resulting direction-in-depth polar plots are \nshown in Fig. 3. The 0: parameter yields a strong effect on the response profile: \nif 0: = 0.5 there is no direction-in-depth selectivity; according that 0: > 0.5 or \n0: < 0.5 cells exhibit a tuning to opposite directions in depth. As 0: approaches the \nboundary values 0 or 1 the binocular model turns to a monocular one. A decrease \nof the inhibition strength I yields cells characterized by a less selective response to \ndirection-in-depth, whereas an increase of I diminishes their response amplitude. \n\n4 Discussion and conclusions \n\nThere are at least two binocular cues that can be used to determine the MID \n[1] : binocular combination of monocular velocity signals or the rate of change of \nretinal disparity. Assuming a phase-based disparity encoding scheme [6], we demon(cid:173)\nstrated that information held in the interocular velocity difference is the same of \n\n\f\"\" S, EB- ( \n\n,,-.,. \n...... \n...c \n01) \n\u00b7c \n'--' \n\u00a7 \n;:::l \n\"0 \nu \n