{"title": "Learning to Estimate Scenes from Images", "book": "Advances in Neural Information Processing Systems", "page_first": 775, "page_last": 781, "abstract": null, "full_text": "Learning to estimate scenes from images \n\nWilliam T. Freeman and Egon C. Pasztor \nMERL, Mitsubishi Electric Research Laboratory \n\n201 Broadway; Cambridge, MA 02139 \n\nfreeman@merl.com, pasztor@merl.com \n\nAbstract \n\nWe seek the scene interpretation that best explains image data. \nFor example, we may want to infer the projected velocities (scene) \nwhich best explain two consecutive image frames (image). From \nsynthetic data , we model the relationship between image and scene \npatches, and between a scene patch and neighboring scene patches. \nGiven' a new image, we propagate likelihoods in a Markov network \n(ignoring the effect of loops) to infer the underlying scene. This \nyields an efficient method to form low-level scene interpretations. \nWe demonstrate the technique for motion analysis and estimating \nhigh resolution images from low-resolution ones. \n\n1 \n\nIntroduction \n\nThere has been recent interest in studying the statistical properties of the visual \nworld. Olshausen and Field [23J and Bell and Sejnowski [2J have derived VI-like \nreceptive fields from ensembles of images; Simon celli and Schwartz [30J account for \ncontrast normalization effects by redundancy reduction. Li and Atick [1 J explain \nretinal color coding by information processing arguments. Various research groups \nhave developed realistic texture synthesis methods by studying the response statis(cid:173)\ntics of VI-like multi-scale, oriented receptive fields [12, 7, 33, 29J. These methods \nhelp us understand the early stages of image representation and processing in the \nbrain. \n\nUnfortunately, they don't address how a visual system might interpret images, i.e., \nestimate the underlying scene. In this work, we study the statistical properties of \na labelled visual world , images together with scenes, in order to infer scenes from \nimages. The image data might be single or multiple frames; the scene quantities \n\n\f776 \n\nW T. Freeman and E. C. Pasztor \n\nto be estimated could be projected object velocities, surface shapes, reflectance \npatterns, or colors. \n\nWe ask: can a visual system correctly interpret a visual scene if it models (1) \nthe probability that any local scene patch generated the local image, and (2) the \nprobability that any local scene is the neighbor to any other? The first probabilities \nallow making scene estimates from local image data, and the second allow these \nlocal estimates to propagate. This leads to a Bayesian method for low level vision \nproblems, constrained by Markov assumptions. We describe this method, and show \nit working for two low-level vision problems. \n\n2 Markov networks for scene estimation \n\nFirst, we synthetically generate images and their underlying scene representations, \nusing computer graphics. The synthetic world should typify the visual world in \nwhich the algorithm will operate. \n\nFor example, for the motion estimation problem of Sect. 3, our training images were \nirregularly shaped blobs, which could occlude each other, moving in randomized \ndirections at speeds up to 2 pixels per frame . The contrast values of the blobs and \nthe background were randomized. The image data were the concatenated image \nintensities from two successive frames of an image sequence. The scene data were \nthe velocities of the visible objects at each pixel in the two frames. \n\nSecond, we place the image and scene data in a Markov network [24]. We break \nthe images and scenes into localized patches where image patches connect with un(cid:173)\nderlying scene patches; scene patches also connect with neighboring scene patches. \nThe neighbor relationship can be with regard to position, scale, orientation, etc. \n\nFor the motion problem, we represented both the images and the velocities in 4-\nlevel Gaussian pyramids [6], to efficiently communicate across space. Each scene \npatch then additionally connects with the patches at neighboring resolution levels. \nFigure 2 shows the multiresolution representation (at one time frame) for images \nand scenes. 1 \n\nThird, we propagate probabilities. Weiss showed the advantage of belief propagation \nover regularization methods for severall-d problems [31]; we apply related methods \nto our 2-d problems. Let the ith and jth image and scene patches be Yi and \nXj, respectively. For the MAP estimate [3] of the scene data,2 we want to find \nargmaxxl ,X2 , ... ,XNP(Xl,X2,'\" ,xNIYl,Y2, .. . ,YM), where Nand M are the number \nof scene and image patches. Because the joint probability is simpler to compute, \nwe find, equivalently, argmaxx1,X2, ... ,XNP(Xl , X2,\u00b7 . . , XN, Yl , Y2, \u00b7 .. , YM) . \n\nThe conditional independence assumptions of the Markov network let us factorize \nthe desired joint probability into quantities involving only local measurements and \ncalculations [24, 32]. Consider the two-patch system of Fig. 1. We can factorize \nP(Xl , X2,Yl,Y2) in three steps: (1) P(XI,X2 ,Yl,Y2) = P(X2 ,Yl,Y2Ixt}P(Xl) (by el(cid:173)\nementary probability); (2) P(X2,Yl,Y2Ixl) = P(ydXl)P(X2 ,Y2Ixl) (by conditional \n\nITo maintain the desired conditional independence relationships, we appended the im(cid:173)\n\nage data to the scenes. This provided the scene elements with image contrast information , \nwhich they would otherwise lack. \n\n2Related arguments follow for the MMSE or other estimators. \n\n\fLearning to Estimate Scenes from Images \n\n777 \n\nindependence); (3) P(X2,Y2IxI) = P(x2Ixt}P(Y2Ix2) (by elementary probability \nand the Markov assumption). To estimate just Xl at node 1, the argmaxx2 becomes \nmax X 2 , and then slides over constants, giving terms involving only local computa(cid:173)\ntions at each node: \n\nargmaxX1 max x2 P(xI, X2, YI, Y2) = argmaxx1 [P(XI )P(Yllxl)maxX2 [P(x2Ixt}P(Y2I x 2)]]. \n\n(1) \n\nThis factorization generalizes to any network structure without loops. We use a \ndifferent factorization at each scene node: we turn the initial joint probability into \na conditional by factoring out that node's prior, P(Xj) , then proceeding analogously \nto the example above. The resulting factorized computations give local propagation \nrules, similar to those of [24, 32] : Each node, j, receives a message from each \nneighbor, k , which is an accumulated likelihood function, Lkj = P(Yk ... Yzlxj), \nwhere Yk . .. Yz are all image nodes that lie at or beyond scene node k, relative to \nscene node j. At each iteration, more image nodes Y enter that likelihood function. \nAfter each iteration, the MAP estimate at node j is argmaxXj P(x j )P(Yj Ix j) Ilk L kj , \nwhere k runs over all scene node neighbors of node j . We calculate Lkj from: \n\nL kj = maxxkP(xklxj)P(Yklxk) II \u00a3lk, \n\nl#j \n\n(2) \n\nwhere Llk is Llk from the previous iteration. The initial \u00a3lk'S are 1. Using the \n\nFigure 1: Markov network nodes used in example. \n\nfactorization rules described above, one can verify that the local computations will \ncompute argmaxx1 ,X2 , . .. , XN P(XI' X2, ... ,xNIYI, Y2, ... ,YM), as desired. To learn the \nnetwork parameters, we measure P(Xj), P(Yjlxj), and P(xklxj) , directly from the \nsynthetic training data. \n\nIf the network contains loops, the above factorization does not hold . Both learning \nand inference then require more computationally intensive methods [15]. Alterna(cid:173)\ntively, one can use multi-resolution quad-tree networks [20], for which the factor(cid:173)\nization rules apply, to propagate information spatially. However, this gives results \nwith artifacts along quad-tree boundaries, statistical boundaries in the model not \npresent in the real problem. We found good results by including the loop-causing \nconnections between adjacent nodes at the same tree level but applying the factor(cid:173)\nized propagation rules, anyway. Others have obtained good results using the same \napproach for inference [8, 21, 32]; Weiss provides theoretical arguments why this \nworks for certain cases [32]. \n\n3 Discrete Probability Representation (motion example) \n\nWe applied the training method and propagation rules to motion estimation, using \na vector code representation [11] for both images and scenes. We wrote a tree(cid:173)\nstructured vector quantizer, to code 4 by 4 pixel by 2 frame blocks of image data \n\n\f778 \n\nW. T. Freeman and E. C. Pasztor \n\nfor each pyramid level into one of 300 codes for each level. We also coded scene \npatches into one of 300 codes. \n\nDuring training, we presented approximately 200,000 examples of irregularly shaped \nmoving blobs, some overlapping, of a contrast with the background randomized \nto one of 4 values. Using co-occurance histograms, we measured the statistical \nrelationships that embody our algorithm: P(x) , P(ylx), and P(xnlx), for scene Xn \nneighboring scene x. \n\nFigure 2 shows an input test image, (a) before and (b) after vector quantization. The \ntrue underlying scene, the desired output, is shown (c) before and (d) after vector \nquantization. Figure 3 shows six iterations of the algorithm (Eq. 2) as it converges \nto a good estimate for the underlying scene velocities. The local probabilities we \nlearned (P(x), P(ylx), and P(xnlx)) lead to figure/ground segmentation, aperture \nproblem constraint propagation, and filling-in (see caption). \n\nFigure 2: (a) First of two frames of image data (in gaussian pyramid), and (b) \nvector quantized. (c) The optical flow scene information , and (d) vector quantized. \nLarge arrow added to show small vectors' orientation. \n\n4 Density Representation (super-resolution example) \n\nFor super-resolution, the input \"image\" is the high-frequency components (sharpest \ndetails) of a sub-sampled image. The \"scene\" to be estimated is the high-frequency \ncomponents of the full-resolution image, Fig. 4. \n\nWe improved our method for this second problem. A faithful image representation \nrequires so many vector codes that it becomes infeasible to measure the prior and \nco-occurance statistics (note unfaithful fit of Fig. 2) . On the other hand, a discrete \nrepresentation allows fast propagation. We developed a hybrid method that allows \nboth good fitting and fast propagation. \n\nWe describe the image and scene patches as vectors in a continuous space, and \nfirst modelled the probability densities, P(x) , P(y, x), and P(xn, x), as gaussian \nmixtures [4] . (We reduced the dimensionality some by principal components analysis \n[4]). We then evaluated the prior and conditional distributions of Eq. 2 only at a \ndiscrete set of scene values, different for each node. (This sample-based approach \nrelates to [14, 7]). The scenes were a sampling of those scenes which render to the \nimage at that node. This focusses the computation to the locally feasible scene \ninterpretations. P(xkIXj) in Eq. 2 becomes the ratios of the gaussian mixtures \nP(Xk , Xj) and P(Xj), evaluated at the scene samples at nodes k and j, respectively. \nP(Yklxk) is P(Yk ,Xk)/P(Xk) evaluated at the scene samples of node k. \n\nTo select the scene samples, we could condition the mixture P(y , x) on the Y ob(cid:173)\nserved at each node, and sample x's from the resulting mixture of gaussians. We \nobtained somewhat better results by using the scenes from the training set whose \n\n\fLearning to Estimate Scenes from Images \n\n779 \n\nT -\n\n1 \nI \nk.... \n,~\"'.... . \n'<6. \n... .J 4 \n.. ..,. \n~ ...... ~ \nit::;) \n\n1 \nI \n(a) \n1 \nl. ............. :ol. \n~ ....................... .. \n! ...... \n.... .... \"1\u00b7 \nr-:;; \nt'~ \nI \nI \nI\n\u00b7 ... \n1 \n\n.. ... \n... .. \n\nI \nI \n\n2 \n\nI \n: \nt..;i' ....... ~ .... ,\"', \n\nI; \u2022 \n\n~ .. \n\n3 \n\n1 \n\n.. I;. \n\nI ,. ..... ;i),. ;1, , ~ ,; ~. \n\nt \nI \n\n~A~ \nA~ \n,. ,,;~ : \n\n............. ;:::: \n:!::::::~:::: \n~;;;;~ ... ~\",,~~~~~~ \n\n! \nI \n\n#::~;:.. \n\n\" \n~~ \n\n... \n\ni \"~~A'''1~''!\"'''-!:-:'' I \nt \nI \nI \nI \n\nI \n\nI \nI \nI \n\nI \nI \nI \n\nI \n\nFigure 3: The most probable scene code for Fig. 2b at first 6 iterations of Bayesian \nbelief propagation. (a) Note initial motion estimates occur only at edges. Due to \nthe \"aperture problem\", initial estimates do not agree. (b) Filling-in of motion \nestimate occurs. Cues for figure/ground determination may include edge curvature, \nand information from lower resolution levels. Both are included implicitly in the \nlearned probabilities. (c) Figure/ground still undetermined in this region of low \nedge curvature. (d) Velocities have filled-in, but do not yet all agree. (e) Velocities \nhave filled-in , and agree with each other and with the correct velocity direction, \nshown in Fig. 2. \n\nimages most closely matched the image observed at that node (thus avoiding one \ngaussian mixture modeling step). \n\nUsing 40 scene samples per node, setting up the P(xklxj) matrix for each link took \nseveral minutes for 96x96 pixel images. The scene (high resolution) patch size was \n3x3; the image (low resolution) patch size was 7x7. We didn't feel long-range scene \npropagation was critical here, so we used a flat, not a pyramid, node structure. \nOnce the matrices were computed, the iterations of Eq. 2 were completed within \nseconds. \n\nFigure 4 shows the results. The training images were random' shaded and painted \nblobs such as the test image shown. After 5 iterations, the synthesized maximum \nlikelihood estimate of the high resolution image is visually close to the actual high \nfrequency image (top row). (Including P(x) gave too flat results, we suspect due \nto errors modeling that highly peaked distribution). The dominant structures are \nall in approximately the correct position. This may enable high quality zooming of \nlow-resolution images, attempted with limited success by others [28, 25]. \n\n5 Discussion \n\nIn related applications of Markov random fields to vision, researchers typically use \nrelatively simple, heuristically derived expressions (rather than learned) for the like(cid:173)\nlihood function P(ylx) or for the spatial relationships in the prior term on scenes \n\n\f780 \n\nW. T. Freeman and E. C. Pasztor \n\nsub-sampled \n\nimage \n\nzoomed high freqs. \n\nof sub-sampled image \n\n(algorithm input) \n\nfull-detail \n\nimage \n\nhigh freqs. of \n\nfull-detail image \n(desired output) \n\n. ,-\n/ , \n\nf \n\nI \n\n. ' r \n/ \n\n/ \n\n'f' , \n\u2022 J \nI \"\" .. \n/ \nf . \n} \nI \n, \n1'( \n\u2022 \nI \n\n, \nt \n, \n\\: \n'. \niteration 5 \n(output) \n\nw/o \n\nwith \n\ncomputed output \n\niteration 0 \n\niteration 1 \n\nFigure 4: Superresolution example. Top row: Input and desired output (contrast \nnormalized, only those orientations around vertical). Bottom row: algorithm out(cid:173)\nput and comparison of image with and without estimated high vertical frequencies. \n\n[10, 26, 9, 17, 5, 20, 19, 27]. Some researchers have applied related learning ap(cid:173)\nproaches to low-level vision problems, but restricted themselves to linear models \n[18, 13]. For other learning or constraint propagation approaches in motion analy(cid:173)\nsis, see [20, 22, 16]. \n\nIn summary, we have developed a principled and practical learning based method \nfor low-level vision problems. Markov assumptions lead to factorizing the posterior \nprobability. The parameters of our Markov random field are probabilities specified \nby the training data. For our two examples (programmed in C and Matlab, respec(cid:173)\ntively), the training can take several hours but the running takes only several min(cid:173)\nutes. Scene estimation by Markov networks may be useful for other low-level vision \nproblems, such as extracting intrinsic images from line drawings or photographs. \nAcknowledgements We thank E. Adelson, J. Tenenbaum, P. Viola, and Y. \\Veiss for \nhelpful discussions. \n\nReferences \n[1] J. J. Atick, Z. Li, and A. N. Redlich. Understanding retinal color coding from first \n\nprinciples. Neural Computation, 4:559- 572, 1992. \n\n[2] A. J. Bell and T . J. Senjowski. The independent components of natural scenes are \n\nedge filters. Vision R esearch, 37(23):3327- 3338, 1997. \n\n[3] J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985. \n[4] C. l\\l. Bishop. Neural networks for pattern recognition. Oxford, 1995. \n[5] M. J. Black and P. Anandan. A framework for the robust estimation of optical flow. \n\nIn Fmc. 4th Inti. Conf. Computer Vision, pages 231- 236. IEEE, 1993. \n\n[6] P. J. Burt and E . H. Adelson. The Laplacian pyramid as a compact image code. \n\nIEEE Trans. Comm., 31(4):532- 540, 1983. \n\n[7] J. S. DeBonet and P. Viola. Texture recognition using a non-parametric multi-scale \n\n\fLearning to Estimate Scenes from Images \n\n781 \n\nstatistical model. In Proc. IEEE Computer Vision and Patt ern R ecognition, 1998. \n\n[8] B. J . Frey. Bayesian n etworks for pattern classification. MIT Press, 1997. \n[9] D. Geiger and F. Girosi. Parallel and deterministic algorithms from MRF 's: surface \nreconstruction. IEEE Pattern Analysis and Machine Intelligence, 13(5):401- 412 , May \n1991 . \n\n[11] \n\n[12] \n\n[18] \n\n[19] \n\n[13] \n\n[14] \n\n[15] \n[16] \n\n[10] S. Geman and D . Geman. Stochastic relaxation , Gibbs distribution, and the Bayesian \nrestoration of images. IEEE Pattern Analysis and Machin e Intelligence, 6:721- 741 , \n1984. \nR. M. Gray, P. C. Cosman, and K. L. Oehler. Incorporating visual factors into vector \nquantizers for image compression. In A. B. Watson, editor , Digital images and human \nvision. MIT Press, 1993. \nD. J. Heeger and J . R. Bergen. Pyramid-based texture analysis/synthesis. In ACM \nSIGGRAPH, pages 229- 236, 1995. In Computer Graphics Proceedings, Annual Con(cid:173)\nference Series. \nA . C. Hurlbert and T . A. Poggio. Synthesizing a color algorithm from examples. \nScien ce, 239:482- 485, 1988. \nM. Isard and A. Blake. Contour tracking by stochastic propagation of conditional \ndensity. In Proc. European Conf. on Computer Vision , pages 343- 356 , 1996. \nM. 1. Jordan , editor. Learning in graphical models. MIT Press, 1998. \nS. Ju, M. J. Black, and A. D. Jepson . Skin and bones: Multi-layer, locally affine, \noptical flow and regularization with transparency. In Proc. IEEE Computer Vision \nand Pattern Recognition, pages 307- 314, 1996. \nD. Kersten. Transparancy and the cooperative computation of scene attributes. In \nM. S. Landy and J. A. Movshon , editors, Computational Models of Visual Processing, \nchapter 15. MIT Press, Cambridge, MA , 1991. \nD. Kersten, A. J. O 'Toole, M. E. Sereno, D . C. Knill, and J . A. Anderson. Associative \nlearning of scene parameters from images. Applied Optics, 26(23):4999- 5006, 1987. \nD. Knill and W . Richards, editors. P erception as Bayesian inference. Cambridge \nUniv. Press, 1996. \nM. R. Luettgen, W. C. Karl , and A. S. Will sky. Efficient multiscale regularization \nwith applications to the computation of optical flow. IEEE Trans. Imag e Processing, \n3(1):41- 64, 1994. \nD. J . C. Mackay and R. M. Neal. Good error- correcting codes based on very sparse \nmatrices. In Cryptography and coding - LNCS 1025, 1995. \nS. Nowlan and T . J. Senjowski. A selection model for motion processing in area l'vIT \nof primates. J . Neurosci ence, 15:1195- 1214, 1995. \nB. A. Olshausen and D . J. Field. Emergence of simple-cell receptive field properties \nby learning a sparse code for natural images. Nature, 381:607- 609, 1996. \nJ. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. \nMorgan Kaufmann, 1988. \nA. Pentland and B. Horowitz. A practical approach to fractal-based image compres(cid:173)\nsion. In A. B. Watson, editor, Digital images and human vision. MIT Press, 1993. \nT . Poggio, V . Torre, and C. Koch. Computational vision and regularization theory. \nNature, 317(26) :314- 139, 1985. \nE. Saund. Perceptual organization of occluding contours of opaque surfaces. In CVPR \n'98 Workshop on Perceptual Organization, Santa Barbara, CA, 1998. \nR. R. Schultz and R . L. Stevenson . A Bayesian approach to image expansion for \nimproved definition. IEEE Trans. Image Processing, 3(3):233- 242, 1994. \nE. P. Simoncelli. Statistical models for images: Compression, restoration and synthe(cid:173)\nsis. In 31st Asilomar Conf. on Sig. , Sys . and Computers, Pacific Grove, CA , 1997. \nE. P . Simoncelli and O. Schwartz. Modeling surround suppression in vI neurons with \na statistically-derived normalization model. In Adv. in N eural Information Processing \nSystems , volume 11 , 1999. \nY . Weiss. Interpreting images by propagating Bayesian beliefs. In Adv. in Neural \nInformation Processing Systems , volume 9, pages 908- 915 , 1997. \nY. Weiss. Belief propagation and revision in networks with loops. Technical Report \n1616, AI Lab Memo, MIT , Cambridge, MA 02139, 1998. \nS. C. Zhu and D. Mumford. Prior learning and Gibbs reaction-diffusion. \nPattern Analysis and Ma chine Int elligence, 19(11), 1997. \n\nIEEE \n\n[27] \n\n[28] \n\n[25] \n\n[26] \n\n[21] \n\n[22] \n\n[23] \n\n[24] \n\n[32] \n\n[33] \n\n[29] \n\n[30] \n\n[17] \n\n[20] \n\n[31] \n\n\f", "award": [], "sourceid": 1629, "authors": [{"given_name": "William", "family_name": "Freeman", "institution": null}, {"given_name": "Egon", "family_name": "Pasztor", "institution": null}]}