{"title": "Probabilistic Image Sensor Fusion", "book": "Advances in Neural Information Processing Systems", "page_first": 824, "page_last": 830, "abstract": null, "full_text": "Probabilistic Image Sensor Fusion \n\nRavi K. Sharma1 , Todd K. Leen 2 and Misha Pavel1 \n\n1 Department of Electrical and Computer Engineering \n2Department of Computer Science and Engineering \nOregon Graduate Institute of Science and Technology \n\nP.O. Box 91000 , Portland , OR 97291-1000 \n\nEmail: {ravi,pavel} @ece.ogi.edu, tleen@cse.ogi.edu \n\nAbstract \n\nWe present a probabilistic method for fusion of images produced \nby multiple sensors. The approach is based on an image formation \nmodel in which the sensor images are noisy, locally linear functions \nof an underlying, true scene. A Bayesian framework then provides \nfor maximum likelihood or maximum a posteriori estimates of the \ntrue scene from the sensor images. Maximum likelihood estimates \nof the parameters of the image formation model involve (local) \nsecond order image statistics, and thus are related to local principal \ncomponent analysis. We demonstrate the efficacy of the method \non images from visible-band and infrared sensors. \n\n1 \n\nIntroduction \n\nAdvances in sensing devices have fueled the deployment of multiple sensors in several \ncomputational vision systems [1, for example]. Using multiple sensors can increase \nreliability with respect to single sensor systems. This work was motivated by a \nneed for an aircraft autonomous landing guidance (ALG) system [2, 3] that uses \nvisible-band, infrared (IR) and radar-based imaging sensors to provide guidance \nto pilots for landing aircraft in low visibility. IR is suitable for night operation, \nwhereas radar can penetrate fog. The application requires fusion algorithms [4] to \ncombine the different sensor images . \n\nImages from different sensors have different characteristics arising from the varied \nphysical imaging processes. Local contrast may be polarity reversed between visible(cid:173)\nband and IR images [5 , 6] . A particular sensor image may contain local features \nnot found in another sensor image , i.e., sensors may report complementary features . \nFinally, individual sensors are subject to noise. Fig . l(a) and l(b) are visible-band \nand IR images respectively, of a runway scene showing polarity reversed (rectangle) \n\n\fProbabilistic Image Sensor Fusion \n\n825 \n\nand complementary (circle) features. These effects pose difficulties for fusion. \n\nAn obvious approach to fusion is to average the pixel intensities from different \nsensors. Averaging, Fig. l(c), increases the signal to noise ratio, but reduces the \ncontrast where there are polarity reversed or complementary features [7]. \n\nTransform-based fusion methods [8, 5, 9] selectfrom one sensor or another for fusion. \nThey consist of three steps: (i) decompose the sensor images using a specified \ntransform e.g. a multiresolution Laplacian pyramid, (ii) fuse at each level of the \npyramid by selecting the highest energy transform coefficient, and (iii) invert the \ntransform to synthesize the fused image. Since features are selected rather than \naveraged, they are rendered at full contrast, but the methods are sensitive to sensor \nnoise, see Fig. l(d). \n\nTo overcome the limitations of averaging or selection methods, and put sensor fusion \non firm theoretical grounds, we explicitly model the production of sensor images \nfrom the true scene, including the effects of sensor noise. From the model, and \nsensor images, one can ask What is the most probable true scene? This forms \nthe basis for fusing the sensor images. Our technique uses the Laplacian pyramid \nrepresentation [5], with the step (ii) above replaced by our probabilistic fusion. A \nsimilar probabilistic framework for sensor fusion is discussed in ([10]). \n\n2 The lInage Forlnation Model \nThe true scene, denoted s, gives rise to a sensor image through a noisy, non-linear \ntransformation. For ALG, s would be an image of the landing scene under conditions \nof uniform lighting, unlimited visibility, and perfect sensors. We model the map from \nthe true scene to a sensor image by a noisy, locally affine transformation whose \nparameters are allowed to vary across the image (actually across the Laplacian \npyramid) \n\nai(~ t) = (3i(~ t) s(~ t) + O'i(~ t) + Ei(~ t) \n\n(1) \nwhere, s is the true scene, ai is ith sensor image, r == (x, y, k) is the hyperpixel \nlocation, with x, y the pixel coordinates and k the level of the pyramid, t is the \ntime, 0' is the sensor offset, {3 is the sensor gain (which includes the effects of local \npolarity reversals and complementarity), and E is the (zero-mean) sensor noise. To \nsimplify notation, we adopt the matrix form \n\na = (3s + 0' + l \n\n(2) \nwhere a = [al,a2, . . . ,aqr, f3 = [(31,(32, ... , (3qr, Q' = [0'1,0'2, ... ,O'qr, s is a \nscalar and l = [El,E2, ... ,Eqr, and we have dropped the reference to location and \ntime. \nSince the image formation parameters f3, Q', and the sensor noise covariance E~ can \nvary from hyperpixel to hyperpixel, the model can express local polarity reversals, \ncomplementary features, spatial variation of sensor gain, and noise. \n\nWe do assume, however, that the image formation parameters and sensor noise \ndistribu tion vary slowly with location 1 . Hence, a particular set of parameters is \nconsidered to hold true over a spatial region of several square hyperpixels. We will \nuse this assumption implicitly when we estimate these parameters from data. \n\nThe model (2) fits the framework of the factor analysis model in statistics [11, \n12] . Here the hyperpixel values of the true scene s are the latent variables or \n\n1 Specifically the parameters vary slowly on the spatia-temporal scales over which the \n\ntrue scene s may exhibit large variations. \n\n\f826 \n\nR. K. Sharma, T. K. Leen and M. Pavel \n\ncommon factors, f3 contains the factor loadings, and the sensor noise \u00a3 values are \nthe independent factors. Estimation of the true scene is equivalent to estimating \nthe common factors from the observations a. \n\n3 Bayesian Fusion \nGiven the sensor intensities a, we will estimate the true scene s by appeal to a \nBayesian framework. We assume that the probability density function of the latent \nvariables s is a Gaussian with local mean so(~ t) and local variance u;(~ t). An \nattractive benefit of this setup is that the prior mean So might be obtained from \nknowledge in the form of maps, or clear-weather images of the scene. Thus, such \ndatabase information can be folded into the sensor fusion in a natural way. \n\nThe density on the sensor images conditioned on the true scene, P(als), is normal \nwith mean f3 s+a and covariance E\u00a3 :::: diag[u;l' U;2\" .. ,u;J The marginal density \nP(a) is normal with mean I'm :::: f3 So + a and covariance \n\nC :::: E\u00a3 + u;f3f3 T \n\n(3) \nFinally, the posterior density on s, given the sensor data a, P(sla) is also normal \n:::: (f/ E;l f3+ l/u;fl. \nwith mean M- 1 (f3T E;l (a -a)+ so/u;), and covariance M- 1 \nGiven these densities, there are two obvious candidates for probabilistic fusion : \nmaximum likelihood (ML) 5 :::: max. P(als), and maximum a posteriori (MAP) \n5:::: max. P(sla) . \n\nThe MAP fusion estimate is simply the posterior mean \n\n5:::: [f3TE;If3+1/u;r1 (f3TE;l(a_a) + so/un \n\n(4) \n\n(5) \nTo obtain the ML fusion estimate we take the limit u; -+ 00 in either (4) or (5). \nFor both ML and MAP, the fused image 5 is a locally linear combination of the sensor \nimages that can, through the spatio-temporal variations in f3, a, and E\u00a3, properly \nrespond to changes in the sensor characteristics that tax averaging or selection \nschemes. For example, if the second sensor has a polarity reversal relative to the \nfirst, then f32 is negative and the two sensor contributions are properly subtracted. \nIf the first sensor has high noise (large u;J, its contribution to the fused image is \nattenuated. Finally, a feature missing from sensor 1 corresponds to f31 \n:::: O. The \nmodel compensates by accentuating the contribution from sensor 2. \n\n4 Model Parameter Estimates \nWe need to estimate the local image formation model parameters a(~ t), f3(~ t) and \nthe local sensor noise covariance\u00b7E\u00a3(~ t). We estimate the latter from successive, \nmotion compensated video frames from each sensor. First we estimate the average \nvalue at each hyperpixel (ai(t)), and the average square (a;(t)) by exponential \nmoving averages. We next estimate the noise variance by the difference U;i (t) :::: \na; (t) - ai 2 (t). \nTo estimate f3 and a, we assume that f3, a, E\u00a3, So and u; are nearly constant \nover small spatial regions (5 x 5 blocks) surrounding the hyperpixel for which the \n\n\fProbabilistic Image Sensor Fusion \n\n827 \n\nparameters are desired. Essentially we are invoking a spatial analog of ergodicity, \nwhere ensemble averages are replaced by spatial averages, carried out locally over \nregions in which the statistics are approximately constant. \n\nTo form a maximum likelihood (ML) estimate of a, we extremize the data log(cid:173)\nlikelihood C = Z=;;=llog[P(an)] with respect to a to obtain \n\na ML = I'a - f3 so , \n\n(6) \nwhere I'a is the data mean, computed over a 5 x 5 hyperpixellocal region (N = 25 \npoints). \nTo obtain a ML estimate of f3, we set the derivatives of C with respect to f3 equal \nto zero and recover \n\n(C - Ea)C \n\n-1 \n\nf3 = 0 \n\n(7) \n\nwhere Ea is the data covariance matrix, also computed over a 5 x 5 hyperpixel local \nregion . The only non-trivial solution to (7) is \n\nf3ML = E, U \n\n!-(X-l)t \nr \n\nu~ \n\n(8) \n\n_ \n\nwhere U , A are the principal eigenvector and eigenvalue of the weighted data co-\nvariance matrix, Ea == E, 2 EaE \u20ac 2, and r = \u00b1l. \nAn alternative to maximum likelihood estimation is the least squares (LS) ap(cid:173)\nproach [11] . We obtain the LS estimate aLS by minimizing \n\n_1. \n\n_1. \n\nwith respect to a . This gives \n\naLS = I'a - f3 so . \n\nThe least squares estimate f3 LS is obtained by minimizing \n\nwith respect to f3 . The solution to this minimization is \n\nE{3 = II Ea - C W \n\nf3LS = -Ur \n\nAt \n\nu~ \n\n(9) \n\n(10) \n\n(11) \n\n(12) \n\nwhere U, A are the principal eigenvector and eigenvalue of the noise-corrected co(cid:173)\nvariance matrix (Ea - E f ), and r = \u00b1 l. 2 \nThe estimation procedures cannot provide values of the priors u~ and So. Were we \ndealing with a single global model, this would pose no problem. But we must impose \na constraint in order to smoothly piece together our local models. We impose that \n11.811 = 1 everywhere, or by (12) u; = A. Recall that A is the leading eigenvalue of \n~a - ~, and thus captures the scale of variations in a that arise from variations in \ns . Thus we would expect A ex u~. Our constraint insures that the proportionality \nconstant be the same for each local model. Next, note that changing So causes a shift \n\n2The least squares and maximum likelihood solutions are identical when the model is \nexact Ea == C, i.e. \nthe observed data covariance is exactly of the form dictated by the \nmodel. Under this condition, U = (UTE;lU)-1/2Ee -1/2U and (~- 1) = ~(UTE;lU). \nThe LS and ML solutions are also identical when the noise covariance is homoscedastic \nEe = (1; I, even if the model is not exact. \n\n\fR. K. Sharma, T. K. Leen and M. Pavel \n828 \nin s. To maintain consistency between local regions, we take So = 0 everywhere. \nThese choices for 11'; and So constrain the parameter estimates to \n\nf3 LS \n\nr V and \nPa \n\n. \n\naLS \n\n(13) \nIn (5) 11'; and So are defined at each hyperpixel. However, to estimate f3 and a, \nwe used spatial averages to compute the sample mean and covariance. This is \nsomewhat inconsistent, since the spatial variation of So (e.g. when there are edges \nin the scene) is not explicitly captured in the model mean and covariance. These \nvariations are, instead, attributed to 11';, resulting in overestimation of the latter. \nA more complete model would explicitly model the spatial variations of So, though \nwe expect this will produce only minor changes in the results . \n\nFinally, the sign parameter r is not specified. In order to properly piece together \nour local models , we must choose r at each hyperpixel in such a way that f3 changes \ndirection slowly as we move from hyperpixel to hyperpixel and encounter changes \nin the local image statistics. That is, large direction changes due to arbitrary sign \nreversals are not allowed. We use a simple heuristic to accomplish this. \n\n5 Relation to peA \nThe MAP and ML fusion rules are closely related to PCA. To see this, assume that \nthe noise is homoscedastic EE = 11';1 and use the parameter estimates (13) in the \nMAP fusion rule (5), reducing the latter to \n\ns= 1+I1'UI1'; Va(a-Pa) + 1+11';;11'~ So \n\nT \n\n1 \n\n1 \n\n(14) \n\nwhere Va is the principal eigenvector of the data covariance matrix Ea. The MAP \nestimate s is simply a scaled and shifted local PCA projection of the sensor data. \nBoth the scaling and shift arise because the prior distribution on s tends to bias s \ntowards So. When the prior is flat 11'; -+ 00, (or equivalently when using the ML \nfusion estimate), or when the noise variance vanishes, the fused image is given by a \nsimple local PCA projection \n\n(15 ) \n\n6 Experilllents and Results \nWe applied our fusion method to visible-band and IR runway images, Fig. 1, con(cid:173)\ntaining additive Gaussian noise. Fig. l(e) shows the result of ML fusion with f3 \nand a estimated using (13) . ML fusion performs better than either averaging or \nselection in regions that contain local polarity reversals or complementary features. \nML fusion gives higher weight to IR in regions where the features in the two im(cid:173)\nages are common , thus reducing the effects of noise in the visible-band image. ML \nfusion gives higher weight to the appropriate sensor in regions with complementary \nfeatures. \nFig. l(f) shows the result of MAP fusion (5) with the priors 11'; and So those dictated \nby the consistency requirements discussed in section 4. Clearly, the MAP image is \nless noisy than the ML image. In regions of low sensor image contrast, 11'; is low \n(since>. is low), thus the contribution from the sensor images is attenuated compared \nto the ML fusion rule. Hence the noise is attenuated. In regions containing features \nsuch as edges, 11'; is high (since>. is high); hence the contribution from the sensor \nimages is similar to that in ML fusion. \n\n\fProbabilistic Image Sensor Fusion \n\n829 \n\n(a) Visible-band image \n\n(b) IR image \n\n(c) Averaging \n\n(d) Selection \n\n(e) ML \n\n(f) MAP \n\nFigure 1: Fusion of visible-band and IR images containing additive Gaussian noise \n\nIn Fig. 2 we demonstrate the use of a database image for fusion. Fig. 2(a) and 2(b) \nare simulated noisy sensor images from visible-band and JR, that depict a runway \nwith an aircraft on it. Fig. 2(c) is an image of the same scene as might be obtained \nfrom a terrain database. Although this image is clean, it does not show the actual \nsituation on the runway. One can use the database image pixel intensities as the \nprior mean So in the MAP fusion rule (5). The prior variance u; in (5) can be \nregarded as a m-easure of confidence in the database image - it's value controls the \nrelative contribution of the sensors vs. the database image in the fused image. (The \nparameters f3 and a, and the sensor noise covariance EIE were estimated exactly \nas before.) Fig. 2(d), 2(e) and 2(f) show the MAP-fused image as a function of \nincreasing 0\";. Higher values of 0\"; accentuate the contribution of the sensor images, \nwhereas lower values of 0\"; accentuate the contribution of the database. \n\n7 Discussion \n\nWe presented a model-based probabilistic framework for fusion of images from multi(cid:173)\nple sensors and exercised the approach on visible-band and IR images. The approach \nprovides both a rigorous framework for PCA-like fusion rules, and a principled way \nto combine information from a terrain database with sensor images. \n\nWe envision several refinements of the approach given here. Writing new image \nformation models at each hyperpixel produces an overabundance of models. Early \nexperiments show that this can be relaxed by using the same model parameters over \nregions of several square hyperpixels, rather than recalculating for each hyperpixel. \nA further refinement could be provided by adopting a mixture of linear models to \nbuild up the non-linear image formation model. Finally, we have used multiple \nframes from a video sequence to obtain ML and MAP fused sequences, and one \nshould be able to produce superior parameter estimates by suitable use of the video \nsequence. \n\n\f830 \n\nR. K. Sharma, T. K. Leen and M Pavel \n\n(a) Visible-band image \n\n(b) IR image \n\n(c) Database image \n\nFigure 2: Fusion of simulated visible-band and IR images using database image \nAcknowledgments - This work was supported by NASA Ames Research Center \ngrant NCC2-S11. TKL was partially supported by NSF grant ECS-9704094. \n\nReferences \n[1] L. A. Klein. Sensor and Data Fusion Concepts and Applications. SPIE, 1993. \n[2] J. R. Kerr, D. P. Pond, and S. Inman. Infrared-optical muItisensor for autonomous \n\nlanding guidance. Proceedings of SPIE, 2463:38-45, 1995. \n\n[3] B. Roberts and P. Symosek. Image processing for flight crew situation awareness. \n\nProceedings of SPIE, 2220:246-255, 1994. \n\n[4] M. Pavel and R. K. Sharma. Model-based sensor fusion for aviation. In J. G. Verly, \neditor, Enhanced and Synthetic Vision 1997, volume 3088, pages 169-176. SPIE, 1997. \n[5] P. J. Burt and R. J. Kolczynski. Enhanced image capture through fusion. In Fourth \n\nInt. Conf. on Computer Vision, pages 173-182. IEEE Compo Soc., 1993. \n\n[6] H. Li and Y. Zhou. Automatic visual/IR image registration. Optical Engineering, \n\n35(2):391-400, 1996. \n\n' \n\n[7] M . Pavel, J. Larimer, and A. Ahumada. Sensor fusion for synthetic vision. In Pro(cid:173)\n\nceedings of the Society for Information Display, pages 475-478. SPIE, 1992. \n\n[8] P. Burt. A gradient pyramid basis for pattern-selective image fusion. In Proceedings \n\nof the Society for Information Display, pages 467-470. SPIE, 1992. \n\n[9] A. Toet. Hierarchical image fusion. Machine Vision and Applications, 3:1-11, 1990. \n[10] J. J. Clark and A. L. Yuille. Data Fusion for Sensory Information Processing Systems. \n\nKluwer, Boston, 1990. \n\n[11] A. Basilevsky. Statistical Factor Analysis and Related Methods. Wiley, 1994. \n[12] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Tech(cid:173)\nnical report, NCRG/97/01O, Neural Computing Research Group, Aston University, \nUK,1997. \n\n\f", "award": [], "sourceid": 1516, "authors": [{"given_name": "Ravi", "family_name": "Sharma", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "Misha", "family_name": "Pavel", "institution": null}]}