{"title": "Managing Uncertainty in Cue Combination", "book": "Advances in Neural Information Processing Systems", "page_first": 869, "page_last": 878, "abstract": null, "full_text": "Managing Uncertainty in Cue Combination \n\nZhiyong Yang \n\nRichard S. Zemel \n\nDeparbnent of Neurobiology, Box 3209 \n\nDuke University Medical Center \n\nDurham, NC 27710 \nzhyyang@duke.edu \n\nDeparbnent of Psychology \n\nUniversity of Arizona \n\nTucson, AZ 85721 \n\nzemel@u.arizona.edu \n\nAbstract \n\nWe develop a hierarchical generative model to study cue combi(cid:173)\nnation. The model maps a global shape parameter to local cue(cid:173)\nspecific parameters, which in tum generate an intensity image. \nInferring shape from images is achieved by inverting this model. \nInference produces a probability distribution at each level; using \ndistributions rather than a single value of underlying variables at \neach stage preserves information about the validity of each local \ncue for the given image. This allows the model, unlike standard \ncombination models, to adaptively weight each cue based on gen(cid:173)\neral cue reliability and specific image context. We describe the \nresults of a cue combination psychophysics experiment we con(cid:173)\nducted that allows a direct comparison with the model. The model \nprovides a good fit to our data and a natural account for some in(cid:173)\nteresting aspects of cue combination. \n\nUnderstanding cue combination is a fundamental step in developing computa(cid:173)\ntional models of visual perception, because many aspects of perception naturally \ninvolve multiple cues, such as binocular stereo, motion, texture, and shading. It is \noften formulated as a problem of inferring or estimating some relevant parameter, \ne.g., depth, shape, position, by combining estimates from individual cues. \nAn important finding of psychophysical studies of cue combination is that cues \nvary in the degree to which they are used in different visual environments. Weights \nassigned to estimates derived from a particular cue seem to reflect its estimated \nreliability in the current scene and viewing conditions. For example, motion \nand stereo are weighted approximately equally at near distances, but motion is \nweighted more at far distances, presumably due to distance limits on binocular \ndisparity.3 Experiments have also found these weightings sensitive to image ma(cid:173)\nnipulations; if a cue is weakened, such as by adding noise, then the uncontami(cid:173)\nnated cue is utilized more in making depth judgments.9 A recent study2 has shown \nthat observers can adjust the weighting they assign to a cue based on its relative \nutility for a particular task. From these and other experiments, we can identify two \ntypes of information that determine relative cue weightings: (1) cue reliability: its \nrelative utility in the context of the task and general viewing conditions; and (2) \nregion informativeness: cue information available locally in a given image. \nA central question in computational models of cue combination then concerns how \nthese forms of uncertainty can be combined. We propose a hierarchical generative \n\n\f870 \n\nZ. Yang and R. S. Zemel \n\nmodel. Generative models have a rich history in cue combination, as thel underlie \nmodels of Bayesian perception that have been developed in this area. lO , The nov(cid:173)\nelty in the generative model proposed here lies in its hierarchical nature and use \nof distributions throughout, which allows for both context-dependent and image(cid:173)\nspecific uncertainty to be combined in a principled manner. \nOur aims in this paper are dual: to develop a combination model that incorporates \ncue reliability and region informativeness (estimated across and within images), \nand to use this model to account for data and provide predictions for psychophys(cid:173)\nical experiments. Another motivation for the approach here stems from our recent \nprobabilistic framework,11 which posits that every step of processing entails the \nrepresentation of an entire probability distribution, rather than just a single value \nof the relevant underlying variable(s). Here we use separate local probability dis(cid:173)\ntributions for each cue estimated directly from an image. Combination then entails \ntransforming representations and integrating distributions across both space and \ncues, taking across- and within-image uncertainty into account. \n\n1 IMAGE GENERATION \n\nIn this paper we study the case of combining shading and texture. Standard shape(cid:173)\nfrom-shading models exclude texture, l, 8 while standard shape-from-texture mod(cid:173)\nels exclude shading.7 Experimental results and computational arguments have \nsupported a strong interaction between these cues}O but no model accounting for \nthis interaction has yet been worked out. \nThe shape used in our experiments is a simple surface: \n\nZ = B(l - x2 ), Ixl <= 1, Iyl <= 1 \n\n(1) \n\nwhere Z is the height from the xy plane. B is the only shape parameter. \nOur image formation model is a hierarchical generative model (see Figure 1). The \ntop layer contains the global parameter B. The second layer contains local shad(cid:173)\ning and texture parameters S, T = {Sj, 11}, where i indexes image regions. The \ngeneration of local cues from a global parameter is intended to allow local uncer(cid:173)\ntainties to be introduced separately into the cues. This models specific conditions \nin realistic images, such as shading uncertainty due to shadows or specularities, \nand texture uncertainty when prior assumptions such as isotropicity are violated.4 \nHere we introduce uncertainty by adding independent local noise to the underly(cid:173)\ning shape parameter; this manipulation is less realistic but easier to control. \n\nGlobal Shape (B) \n\nLocal Shading ({S}) \n\n/~ \n~~ Image (I) \n\nLocal Texture ({T}) \n\nFigure 1: Left: The generative model of image formation. Right: Two sample \nimages generated by the image formation procedure. B = 1.4 in both. Left: 0', = \n0.05,O't = O. Right: 0', = O,O't = 0.05. \n\nThe local cues are sampled from Gaussian distributions: p(SdB) = N(f(B); 0',); \np(7iIB) = N(g(B); O't}. f(B),g(B) describe how the local cue parameters depend \n\n\fManaging Uncertainty in Cue Combination \n\n871 \n\non the shape parameter B, while 0\"8 and O\"t represent the degree of noise in each \ncue. In this paper, to simplify the generation process we set f(B) = g(B) = B. \nFrom {Si} and {Ti}, two surfaces are generated; these are essentially two separate \nnoisy local versions of B. The intensity image combines these surfaces. A set \nof same-intensity texsels sampled from a uniform distribution are mapped onto \nthe texture surface, and then projected onto the image plane under orthogonal \nprojection. The intensity of surface pixels not contained within these texsels are \ndetermined generated from the shading surface using Lambertian shading. Each \nimage is composed of 10 x 10 non-overlapping regions, and contains 400 x 400 \npixels. Figure 1 shows two images generated by this procedure. \n\n2 COMBINATION MODEL \n\nWe create a combination, or recognition model by inverting the generative model \nof Figure 1 to infer the shape parameter B from the image. An important aspect of \nthe combination model is the use of distributions to represent parameter estimates \nat each stage. This preserves uncertainty information at each level, and allows it to \nplaya role in subsequent inference. \nThe overall goal of combination is to infer an estimate of B given some image I. We \nderive our main inference equation using a Bayesian integration over distributions: \n\nP(BIl) = J P(BIS, T)P(S, TIl)dSdT \n'\" IT P(Sdl)P(TiII) \n;(B)P(S, TIB)/ J P(B)P(S, TIB)db '\" n P(SdB)P(TiIB) (4) \n\nP(BIS, T) \n\nP(S, TIl) \n\n(2) \n\n(3) \n\n\u2022 \n\nTo simplify the two components we have assumed that the prior over B is uniform, \nand that the S, T are conditionally independent given B, and given the image. This \nthird assumption is dubious but is not essential in the model, as discussed below. \nWe now consider these two components in tum. \n\n2.1 Obtaining local cue-specific representations from an image \n\nOne component in the inference equation, P(S, TIl), describes local cue(cid:173)\ndependent information in the particular image I. We first define intermediate \nrepresentations S, T that are dependent on shading and texture cues, respectively. \nThe shading representation is the curvature of a horizontal section: S = f(B) = \n2B(1 + 4x2 B2)-3/2. The texture representation is the cosine of the surface slant: \nT = g(B) = (1 + 4x2 B2)-1/2. Note that these S, T variables do not match those \nused in the generative model; ideally we could have used these cue-dependent \nvariables, but generating images from them proved difficult. \nSome image pre-processing must take place in order to estimate values and un(cid:173)\ncertainties for these particular local variables. The approach we adopt involves a \nsimple statistical matching procedure, similar to k-nearest neighbors, applied to \nlocal image patches. After applying Gaussian smoothing and band-pass filtering \nto the image, two representations of each patch are obtained using separate shad(cid:173)\ning and texture filters. For shading, image patches are represented by forming a \nhistogram of ~1; for texture, the patch is represented by the mean and standard \ndeviation of the amplitude of Gabor filter responses at 4 scales and orientations. \nThis representation of a shading patch is then compared to a database of similar \n\n\f872 \n\nZ. Yang and R. S. Zemel \n\npatch representations. Entries in the shading database are formed by first select(cid:173)\ning a particular value of B and (j3' generating an image patch, and applying the \nappropriate filters. Thus S = f (B) and the noise level (j 3 are known for each entry, \nallowing an estimate of these variables for the new patch to be formed as a linear \ncombination of the entries with similar representations. An analogous procedure, \nutilizing a separate database, allows T and an uncertainty estimate to be derived \nfor texture. Both databases have 60 different h, (j pairs, and 10 samples of each pair. \nBased on this procedure we obtain for each image patch mean values Mt, Ml and \nuncertainty values Vi3 , Vit for Si, Tt. These determine P(IIS), P(IIT), which are \napproximated as Gaussians. Taking into account the Gaussian priors for Si, Tt, \n\nP(Sil!) \n\nP(TtI!) \n\nP(IISi)P(Si) '\" exp(-t(S - Mt)2)exp(-t(S - M~)2) (5) \n\nP(IITt)P(Tt) '\" exp( -f(T - Ml)2) exp( -f(T - M~)2) (6) \n\nV,3 \n\nW \n\nV,3 \n\nv,t \n\nNote that the independence assumption of Equation 3 is not necessary, as the \nmatching procedure could use a single database indexed by both the shading and \ntexture representations of a patch. \n\n2.2 Transforming and combining cue-specific local representations \n\nThe other component of the inference equation describes the relationship between \nthe intermediate, cue-specific representations S, T and the shape parameter B: \n\nP(SIB) '\" exp(--t(S - f(B))2) ; P(TIB) '\" exp(-t(T - g(B))2) \n\n(7) \n\nV;3 \n\nv;t \n\nThe two parameters Vb3, V: in this equation describe the uncertainty in the relation(cid:173)\nship between the intermediate parameters S, T and B; they are invariant across \nspace. These two, along with the parameters of the priors-M~, M~, V~, Vt-are \nthe free parameters of this model. Note that this combination model neatly ac(cid:173)\ncounts for both types of cue validity we identified: the variance in P(SIB) de(cid:173)\nscribes the general uncertainty of a given cue, while the local variance in P(SI!) \ndescribes the image-specific uncertainty of the cue. \nCombining Equations 3-7, and completing the integral in Equation 2, we have: \n\nP( BII) - exp [ -~ ~ .tI( B)' + .,g(B)' - 2.,f(B) - 2 \u2022\u2022 g( B) 1 (8) \n\nThus our model infers from any image a mean U and variance E2 for B as non(cid:173)\nlinear combinations of the cue estimates, taking into account the various forms of \nuncertainty. \n\n3 A CUE COMBINATION PSYCHOPHYSICS EXPERIMENT \n\nWe have conducted psychophysical experiments using stimuli generated by the \nprocedure described above. In each experimental trial, a stimulus image and four \n\n\fManaging Uncertainty in Cue Combination \n\n873 \n\nviews of a mesh surface are displayed side-by-side on a computer screen. The \nsubject's task is to manipulate the curvature of the mesh to match the stimulus. \nThe final shape of the mesh surface describes the subject's estimate of the shape \nparameter B on that trial. The subject's variance is computed across repeated trials \nwith an identical stimulus. In a given block of trials, the stimulus may contain \nonly shading information (no texture elements), only texture information uniform \nshading), or both. The local cue noise \u00ab(i$' (it) is zero in some blocks, non-zero in \nothers. The primary experimental findings (see Figure 2) are: \n\n\u2022 Shape from shading alone produces underestimates of B. Shape from tex(cid:173)\n\nture alone also leads to underestimation, but to a lesser degree. \n\n\u2022 Shape from both cues leads to almost perfect estimation, with smaller vari(cid:173)\nance than shape from either cue alone. Thus cue enhancement-more accu(cid:173)\nrate and robust judgements for stimuli containing multiple cues than just \nindividual cues-applies to this paradigm. \n\n\u2022 The variance of a subject's estimation increases with B. \n\u2022 Noise in either shading or texture systematically biases the estimation \n\nfrom the true values: the greater the noise level, the greater the bias. \n\n\u2022 Shape from both cues is more robust against noise than shape from either \n\ncue alone, providing evidence of another form of cue enhancement. \n\n2 \n\n.. \ng 1.5 \n\"\" \nE \n.~ \nW \n\n1 \n\n-/.-\n\n~.-r \n\n--l.-I-\n\n.-t \n\n2 \n\n.. \nc: 1.5 \n0 \n'\" \n,.. \nE \n