{"title": "The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 845, "page_last": 851, "abstract": null, "full_text": "The Manhattan World Assumption: \nRegularities in scene statistics which \n\nenable Bayesian inference \n\nJames M. Coughlan \n\nA.L. Yuille \n\nSmith-Kettlewell Eye Research Inst. \n\nSmith-Kettlewell Eye Research Inst. \n\n2318 Fillmore St. \n\nSan Francisco, CA 94115 \n\n2318 Fillmore St. \n\nSan Francisco, CA 94115 \n\ncoughlan@ski.org \n\nyuille@ski.org \n\nAbstract \n\nPreliminary work by the authors made use of the so-called \"Man(cid:173)\nhattan world\" assumption about the scene statistics of city and \nindoor scenes. This assumption stated that such scenes were built \non a cartesian grid which led to regularities in the image edge gra(cid:173)\ndient statistics. In this paper we explore the general applicability \nof this assumption and show that, surprisingly, it holds in a large \nvariety of less structured environments including rural scenes. This \nenables us, from a single image, to determine the orientation of the \nviewer relative to the scene structure and also to detect target ob(cid:173)\njects which are not aligned with the grid. These inferences are \nperformed using a Bayesian model with probability distributions \n(e.g. on the image gradient statistics) learnt from real data. \n\n1 \n\nIntroduction \n\nIn recent years, there has been growing interest in the statistics of natural images \n(see Huang and Mumford [4] for a recent review). Our focus, however, is on the \ndiscovery of scene statistics which are useful for solving visual inference problems. \nFor example, in related work [5] we have analyzed the statistics of filter responses \non and off edges and hence derived effective edge detectors. \n\nIn this paper we present results on statistical regularities of the image gradient \nresponses as a function of the global scene structure. This builds on preliminary \nwork [2] on city and indoor scenes. This work observed that such scenes are based \non a cartesian coordinate system which puts (probabilistic) constraints on the image \ngradient statistics. \nOur current work shows that this so-called \"Manhattan world\" assumption about \nthe scene statistics applies far more generally than urban scenes. Many rural scenes \ncontain sufficient structure on the distribution of edges to provide a natural cartesian \nreference frame for the viewer. The viewers' orientation relative to this frame can \nbe determined by Bayesian inference. In addition, certain structures in the scene \nstand out by being unaligned to this natural reference frame. In our theory such \n\n\fstructures appear as \"outlier\" edges which makes it easier to detect them. Informal \nevidence that human observers use a form of the Manhattan world assumption is \nprovided by the Ames room illusion, see figure (6), where the observers appear \nto erroneously make this assumption, thereby grotesquely distorting the sizes of \nobjects in the room. \n\n2 Previous Work and Three- Dimensional Geometry \n\nOur preliminary work on city scenes was presented in [2]. There is related work in \ncomputer vision for the detection of vanishing points in 3-d scenes [1], [6] (which \nproceeds through the stages of edge detection, grouping by Hough transforms, and \nfinally the estimation of the geometry). \n\nWe refer the reader to [3] for details on the geometry of the Manhattan world \nand report only the main results here. Briefly, we calculate expressions for the \norientations of x, y, z lines imaged under perspective projection in terms of the \norientation of the camera relative to the x, y, z axes. The camera orientation relative \nto the xyz axis system may be specified by three Euler angles: the azimuth (or \ncompass angle) a, corresponding to rotation about the z axis, the elevation (3 above \nthe xy plane, and the twist'Y about the camera's line of sight. We use ~ = (a, (3, 'Y) \nto denote all three Euler angles of the camera orientation. Our previous work [2] \nassumed that the elevation and twist were both zero which turned out to be invalid \nfor many of the images presented in this paper. \n\nWe can then compute the normal orientation of lines parallel to the x, y, z axes, \nmeasured in the image plane, as a function of film coordinates (u, v) and the camera \norientation ~. We express the results in terms of orthogonal unit camera axes ii, b \nand c, which are aligned to the body of the camera and are determined by ~. For \nx lines (see Figure 1, left panel) we have tan Ox = -(ucx + fax)/(vcx + fb x), where \nOx is the normal orientation of the x line at film coordinates (u, v) and f is the focal \nlength of the camera. Similarly, tanOy = -(ucy + fay)/(vcy + fb y) for y lines and \ntanOz = -(ucz + faz)/(vc z + fb z) for z lines. In the next section will see how to \nrelate the normal orientation of an object boundary (such as x,y,z lines) at a point \n(u, v) to the magnitude and direction of the image gradient at that location. \n\nI ~ e \n~u \n\nvanishing \npoint \n\n~I I~ \n\nFigure 1: (Left) Geometry of an x line projected onto (u,v) image plane. 0 is the \nnormal orientation of the line in the image. (Right) Histogram of edge orientation \nerror (displayed modulo 180\u00b0). Observe the strong peak at 0\u00b0, indicating that \nthe image gradient direction at an edge is usually very close to the true normal \norientation of the edge. \n\n3 Pon and Poff : Characterizing Edges Statistically \n\nSince we do not know where the x, y, z lines are in the image, we have to infer their \nlocations and orientations from image gradient information. This inference is done \n\n\fusing a purely local statistical model of edges. A key element of our approach is that \nit allows the model to infer camera orientation without having to group pixels into \nx, y, z lines. Most grouping procedures rely on the use of binary edge maps which \noften make premature decisions based on too little information. The poor quality \nof some of the images - underexposed and overexposed - makes edge detection \nparticularly difficult, as well as the fact that some of the images lack x, y, z lines \nthat are long enough to group reliably. \nFollowing work by Konishi et al [5], we determine probabilities Pon(Ea) and \nPOf!(Ea) for the probabilities of the image gradient magnitude Ea at position it \nin the image conditioned on whether we are on or off an edge. These distributions \nquantify the tendency for the image gradient to be high on object boundaries and \nlow off them, see Figure 2. They were learned by Konishi et al for the Sowerby \nimage database which contains one hundred presegmented images. \n\nFigure 2: POf!(Y) (left) and Pon(y)(right), the empirical histograms of edge re(cid:173)\n\nsponses off and on edges, respectively. Here the response y = IV II is quantized to \n\ntake 20 values and is shown on the horizontal axis. Note that the peak of POf!(Y) \noccurs at a lower edge response than the peak of Pon (y). \n\nWe extend the work of Konishi et al by putting probability distributions on how \naccurately the image gradient direction estimates the true normal direction of the \nedge. These were learned for this dataset by measuring the true orientations of the \nedges and comparing them to those estimated from the image gradients. \n\nThis gives us distributions on the magnitude and direction of the intensity gradient \nPon CEaIB), Pof! CEa), where Ea = (Ea, CPa), B is the true normal orientation of the \nedge, and CPa is the gradient direction measured at point it = (u, v). We make a \nfactorization assumption that Pon(EaIB) = Pon(Ea)Pang(CPa - B) and POf!(Ea) = \nPof!(Ea)U(cpa). Pang(.) (with argument evaluated modulo 271\" and normalized to \nlover the range 0 to 271\") is based on experimental data, see Figure 1 (right), and \nis peaked about 0 and 71\". \nIn practice, we use a simple box-shaped function to \nmodel the distribution: Pang (r5B) = (1 - f)/47 if r5B is within angle 7 of 0 or 71\", and \nf/(271\" - 47) otherwise (i.e. the chance of an angular error greater than \u00b17 is f ). \nIn our experiments f = 0.1 and 7 = 4\u00b0 for indoors and 6\u00b0 outdoors. By contrast, \nU(.) = 1/271\" is the uniform distribution. \n\n4 Bayesian Model \n\nWe devised a Bayesian model which combines knowledge of the three-dimensional \ngeometry of the Manhattan world with statistical knowledge of edges in images. The \nmodel assumes that, while the majority of pixels in the image convey no information \nabout camera orientation, most of the pixels with high edge responses arise from \nthe presence of x, y, z lines in the three-dimensional scene. An important feature of \nthe Bayesian model is that it does not force us to decide prematurely which pixels \n\n\fare on and off an object boundary (or whether an on pixel is due to x,y, or z), but \nallows us to sum over all possible interpretations of each pixel. \n\nThe image data Eil at a single pixel u is explained by one of five models mil: \nmil = 1,2,3 mean the data is generated by an edge due to an x, y, z line, respectively, \nin the scene; mil = 4 means the data is generated by an outlier edge (not due to an \nx, y, z line); and mil = 5 means the pixel is off-edge. The prior probability P(mil) \nof each of the edge models was estimated empirically to be 0.02,0.02,0.02,0.04,0.9 \nfor mil = 1,2, ... , 5. \nUsing the factorization assumption mentioned before, we assume the probability of \nthe image data Eil has two factors, one for the magnitude of the edge strength and \nanother for the edge direction: \n\nP(Eillmil, ~,u) = P(Eillmil)P(\u00a2illmil, ~,u) \n\n(1) \n\nwhere P(Eillmil) equals Po/!(Eil ) if mil = 5 or Pon(Eil ) if mil # 5. Also, \nP(\u00a2illmil, ~,u) equals Pang(\u00a2il-O(~,mil'U)) if mil = 1,2,3 or U(\u00a2il) if mil = 4,5. \nHere O(~, mil, u)) is the predicted normal orientation of lines determined by the \nequation tan Ox = -(ucx+ fax)/(vcx+ fb x) for x lines, tanOy = -(ucy+ fay)/(vcy+ \nfb y) for y lines, and tanOz = -(ucz + faz)/(vcz + fb z) for z lines. \nIn summary, the edge strength probability is modeled by Pon for models 1 through \n4 and by po/! for model 5. For models 1,2 and 3 the edge orientation is modeled by \na distribution which is peaked about the appropriate orientation of an x, y, z line \npredicted by the camera orientation at pixel location u; for models 4 and 5 the edge \norientation is assumed to be uniformly distributed from 0 through 27f. \nRather than decide on a particular model at each pixel, we marginalize over all five \npossible models (i.e. creating a mixture model): \n\nP(Eill~,u) = 2: P(Eillmil, ~,u)P(mil) \n\n5 \n\nmit=l \n\n(2) \n\nNow, to combine evidence over all pixels in the image, denoted by {Ea}, we assume \nthat the image data is conditionally independent across all pixels, given the camera \norientation ~: \n\nP({Ea}I~) = II P(Eill~,u) \n\nil \n\n(3) \n\n(Although the conditional independence assumption neglects the coupling of gra(cid:173)\ndients at neighboring pixels, it is a useful approximation that makes the model \ncomputationally tractable.) Thus the posterior distribution on the camera orienta-\ntion is given by nil P(Eill~, U)P(~)/Z where Z is a normalization factor and P(~) \nis a uniform prior on the camera orientation. \n\nTo find the MAP (maximum a posterior) estimate, our algorithm maximizes the \nlog \nposte-\nrior term log[P({Eil}I~)P(~)] = logP(~) + L:illog[L:muP(Eillmil,~,u)P(mil)] \nnumerically by searching over a quantized set of compass directions ~ in a certain \nrange. For details on this procedure, as well as coarse-to-fine techniques for speeding \nup the search, see [3]. \n\n\f5 Experimental Results \n\nThis section presents results on the domains for which the viewer orientation relative \nto the scene can be detected using the Manhattan world assumption. In particular, \nwe demonstrate results for: (I) indoor and outdoor scenes (as reported in [2]), (II) \nrural English road scenes, (III) rural English fields, (IV) a painting of the French \ncountryside, (V) a field of broccoli in the American mid-west, (VI) the Ames room, \nand (VII) ruins of the Parthenon (in Athens). The results show strong success \nfor inference using the Manhattan world assumption even for domains in which \nit might seem unlikely to apply. (Some examples of failure are given in [3]. For \nexample, a helicopter in a hilly scene where the algorithm mistakenly interprets the \nhill silhouettes as horizontal lines ). \n\nThe first set of images were of city and indoor scenes in San Francisco with images \ntaken by the second author [2]. We include four typical results, see figure 3, for \ncomparison with the results on other domains. \n\nFigure 3: Estimates of the camera orientation obtained by our algorithm for two \nindoor scenes (left) and two outdoor scenes (right). The estimated orientations of \nthe x, y lines, derived for the estimated camera orientation q!, are indicated by the \nblack line segments drawn on the input image. (The z line orientations have been \nomitted for clarity.) At each point on a sub grid two such segments are drawn - one \nfor x and one for y. In the image on the far left, observe how the x directions align \nwith the wall on the right hand side and with features parallel to this wall. The y \nlines align with the wall on the left (and objects parallel to it). \n\nWe now extend this work to less structured scenes in the English countryside. Fig(cid:173)\nure (4) shows two images of roads in rural scenes and two fields. These images \ncome from the Sowerby database. The next three images were either downloaded \nfrom the web or digitized (the painting). These are the mid-west broccoli field, the \nParthenon ruins, and the painting of the French countryside. \n\n6 Detecting Objects in Manhattan world \n\nWe now consider applying the Manhattan assumption to the alternative problem of \ndetecting target objects in background clutter. To perform such a task effectively \nrequires modelling the properties of the background clutter in addition to those of \nthe target object. It has recently been appreciated that good statistical modelling \nof the image background can improve the performance of target recognition [7]. \n\nThe Manhattan world assumption gives an alternative way of probabilistically mod(cid:173)\nelling background clutter. The background clutter will correspond to the regular \nstructure of buildings and roads and its edges will be aligned to the Manhattan \ngrid. The target object, however, is assumed to be unaligned (at least, in part) to \nthis grid. Therefore many of the edges of the target object will be assigned to model \n4 by the algorithm. (Note the algorithm first finds the MAP estimate q!* of the \n\n\fFigure 4: Results on rural images in England without strong Manhattan structure. \nSame conventions as before. Two images of roads in the countryside (left panels) \nand two images of fields (right panel). \n\nFigure 5: Results on an American mid-west broccoli field, the ruins of the \nParthenon, and a digitized painting of the French countryside. \n\ncompass orientation, see section (4), and then estimates the model by doing MAP \nof P(ma!Ea, ~*,'it) to estimate ma for each pixel 'it.) This enables us to signifi(cid:173)\ncantly simplify the detection task by removing all edges in the images except those \nassigned to model 4. \n\nThe Ames room, see figure (6), is a geometrically distorted room which is con(cid:173)\nstructed so as to give the false impression that it is built on a cartesian coordinate \nframe when viewed from a special viewpoint. Human observers assume that the \nroom is indeed cartesian despite all other visual cues to the contrary. This distorts \nthe apparent size of objects so that, for example, humans in different parts of the \nroom appear to have very different sizes. In fact, a human walking across the room \nwill appear to change size dramatically. Our algorithm, like human observers, in(cid:173)\nterprets the room as being cartesian and helps identify the humans in the room as \noutlier edges which are unaligned to the cartesian reference system. \n\n7 Summary and Conclusions \n\nWe have demonstrated that the Manhattan world assumption applies to a range \nof images, rural and otherwise, in addition to urban scenes. We demonstrated a \nBayesian model which used this assumption to infer the orientation of the viewer \nrelative to this reference frame and which could also detect outlier edges which are \nunaligned to the reference frame. A key element of this approach is the use of image \ngradient statistics, learned from image datasets, which quantify the distribution of \nthe image gradient magnitude and direction on and off object boundaries. We \nexpect that there are many further image regularities of this type which can be \nused for building effective artificial vision systems and which are possibly made use \nof by biological vision systems. \n\n\ff'-~, \n..-\\:,'. . \n\nl - . \n\ni .1 \n\u00b7;;t.J-' \n\n\\. \n\nFigure 6: Detecting people in Manhattan world. The left images (top and bottom) \nshow the estimated scene structure. The right images show that people stand out \nas residual edges which are unaligned to the Manhattan grid. The Ames room (top \npanel) violates the Manhattan assumption but human observers, and our algorithm, \ninterpret it as if it satisfied the assumptions. In fact, despite appearances, the two \npeople in the Ames room are really the same size. \n\nAcknowledgments \n\nWe want to acknowledge funding from NSF with award number IRI-9700446, sup(cid:173)\nport from the Smith-Kettlewell core grant, and from the Center for Imaging Sci(cid:173)\nences with Army grant ARO DAAH049510494. This work was also supported by \nthe National Institute of Health (NEI) with grant number R01-EY 12691-01. It is a \npleasure to acknowledge email conversations with Song Chun Zhu about scene clut(cid:173)\nter. We gratefully acknowledge the use ofthe Sowerby image dataset from Sowerby \nResearch Centre, British Aerospace. \n\nReferences \n\n[1] B. Briliault-O'Mahony. \"New Method for Vanishing Point Detection\". Computer Vi(cid:173)\n\nsion, Graphics, and Image Processing. 54(2). pp 289-300. 1991. \n\n[2] J. Coughlan and A.L. Yuille. \"Manhattan World: Compass Direction from a Single Im(cid:173)\n\nage by Bayesian Inference\" . Proceedings International Conference on Computer Vision \nICCV'99. Corfu, Greece. 1999. \n\n[3] J. Coughlan and A.L. Yuille. \"Manhattan World: Orientation and Outlier Detection \nby Bayesian Inference.\" Submitted to International Journal of Computer Vision. 2000. \n[4] J. Huang and D. Mumford. \"Statistics of Natural Images and Models\". In Proceedings \n\nComputer Vision and Pattern Recognition CVPR'99. Fort Collins, Colorado. 1999. \n\n[5] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu. \"Fundamental Bounds on \nEdge Detection: An Information Theoretic Evaluation of Different Edge Cues.\" Proc. \nInt 'l con/. on Computer Vision and Pattern Recognition, 1999. \n\n[6] E. Lutton, H. Maitre, and J. Lopez-Krahe. \"Contribution to the determination of van(cid:173)\n\nishing points using Hough transform\". IEEE Trans. on Pattern Analysis and Machine \nIntelligence. 16(4). pp 430-438. 1994. \n\n[7] S. C. Zhu, A. Lanterman, and M. I. Miller. \"Clutter Modeling and Performance Anal(cid:173)\n\nysis in Automatic Target Recognition\". In Proceedings Workshop on Detection and \nClassification of Difficult Targets. Redstone Arsenal, Alabama. 1998. \n\n\f", "award": [], "sourceid": 1804, "authors": [{"given_name": "James", "family_name": "Coughlan", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}