{"title": "Hierarchical Image Probability (H1P) Models", "book": "Advances in Neural Information Processing Systems", "page_first": 848, "page_last": 854, "abstract": null, "full_text": "Hierarchical Image Probability (HIP) Models \n\nClay D. Spence and Lucas Parra \n\nSarnoff Corporation \n\nCN5300 \n\nPrinceton, NJ 08543-5300 \n\n{ cspence, lparra} @samoff.com \n\nAbstract \n\nWe formulate a model for probability distributions on image spaces. We \nshow that any distribution of images can be factored exactly into condi(cid:173)\ntional distributions of feature vectors at one resolution (pyramid level) \nconditioned on the image information at lower resolutions. We would \nlike to factor this over positions in the pyramid levels to make it tractable, \nbut such factoring may miss long-range dependencies. To fix this, we in(cid:173)\ntroduce hidden class labels at each pixel in the pyramid. The result is \na hierarchical mixture of conditional probabilities, similar to a hidden \nMarkov model on a tree. The model parameters can be found with max(cid:173)\nimum likelihood estimation using the EM algorithm. We have obtained \nencouraging preliminary results on the problems of detecting various ob(cid:173)\njects in SAR images and target recognition in optical aerial images. \n\n1 Introduction \nMany approaches to object recognition in images estimate Pr(class I image). By con(cid:173)\ntrast, a model of the probability distribution of images, Pr(image), has many attrac(cid:173)\ntive features. We could use this for object recognition in the usual way by training \na distribution for each object class and using Bayes' rule to get Pr(class I image) = \nPr(image I class) Pr(class)J Pr{image). Clearly there are many other benefits of having a \nmodel of the distribution of images, since any kind of data analysis task can be approached \nusing knowledge of the distribution of the data. For classification we could attempt 'to de(cid:173)\ntect unusual examples and reject them, rather than trusting the classifier's output. We could \nalso compress, interpolate, suppress noise, extend resolution, fuse multiple images, etc. \n\nMany image analysis algorithms use probability concepts, but few treat the distribution of \nimages. Zhu, Wu and Mumford [9] do this by computing the maximum entropy distribution \ngiven a set of statistics for some features, This seems to work well for textures but it is not \nclear how well it will model the appearance of more structured objects. \n\nThere are several algorithms for modeling the distributions of features extracted from the \nimage, instead of the image itself. The Markov Random Field (MRF) models are an ex(cid:173)\nample of this line of development; see, e.g., [5, 4]. Unfortunately they tend to be very \nexpensive computationally. \n\nIn De Bonet and Viola's flexible histogram approach [2, 1], features are extracted at mul(cid:173)\ntiple image scales, and the resulting feature vectors are treated as a set of independent \n\n\fHierarchical Image Probability (HIP) Models \n\n849 \n\nFeature \nPyramid \n\nFigure 1: Pyramids and feature notation. \n\nsamples drawn from a distribution. They then model this distribution of feature vectors \nwith Parzen windows. This has given good results, but the feature vectors from neighbor(cid:173)\ning pixels are treated as independent when in fact they share exactly the same components \nfrom lower-resolutions. To fix this we might want to build a model in which the features at \none pixel of one pyramid level condition the features at each of several child pixels at the \nnext higher-resolution pyramid level. The multi scale stochastic process (MSP) methods do \nexactly that. Luettgen and Willsky [7], for example, applied a scale-space auto-regression \n(AR) model to texture discrimination. They use a quadtree or quadtree-like organization \nof the pixels in an image pyramid, and model the features in the pyramid as a stochastic \nprocess from coarse-to-fine levels along the tree. The variables in the process are hidden, \nand the observations are sums of these hidden variables plus noise. The Gaussian distribu(cid:173)\ntions are a limitation of MSP models. The result is also a model of the probability of the \nobservations on the tree, not of the image. \n\nAll of these methods seem well-suited for modeling texture, but it is unclear how we might \nbuild the models to capture the appearance of more structured objects. We will argue below \nthat the presence of objects in images can make local conditioning like that of the flexible \nhistogram and MSP approaches inappropriate. In the following we present a model for \nprobability distributions of images, in which we try to move beyond texture modeling. \nThis hierarchical image probability (HIP) model is similar to a hidden Markov model on \na tree, and can be learned with the EM algorithm. In preliminary tests of the model on \nclassification tasks the performance was comparable to that of other algorithms. \n\n2 Coarse-to-fine factoring of image distributions \n\nOur goal will be to write the image distribution in a form similar to Pr(I) \n\"-' \nPr(Fo I Fd Pr(F l I F 2 ) ... , where FI is the set of feature images at pyramid levell. We \nexpect that the short-range dependencies can be captured by the model's distribution of \nindividual feature vectors, while the long-range dependencies can be captured somehow at \nlow resolution. The large-scale structures affect finer scales by the conditioning. \nIn fact we can prove that a coarse-to-fine factoring like this is correct. From an image I \nwe build a Gaussian pyramid (repeatedly blur-and-subsample, with a Gaussian filter). Call \nthe l-th level II, e.g., the original image is 10 (Figure 1). From each Gaussian level II we \nextract some set of feature images Fl. Sub-sample these to get feature images GI. Note \nthat the images in GI have the same dimensions as 11+1 . We denote by G I the set of images \ncontaining 11+1 and the images in G I. We further denote the mapping from II to GI by gl. \n\nSuppose now that go : 10 t-+ Go is invertible. Then we can think of go as a change of vari-\n\n\f850 \n\nC. D. Spence and L. Parra \n\nabIes. If we have a distribution on a space, its expressions in two different coordinate sys(cid:173)\ntems are related by multiplying by the Jacobian. In this case we get Pr( 10 ) = Igo I Pr( Go). \nSince Go = (Go , II), we can factor Pr(Go) to get Pr(Io) = Igol Pr(Go I h) Pr(h). If \ngl is invertible for alll E {O, . .. , L - 1} then we can simply repeat this change of variable \nand factoring procedure to get \n\nPr(I) = [rflgdPr(GIIII+d] Pr(h) \n\n(1) \n\n1=0 \n\nThis is a very general result, valid for all Pr(I), no doubt with some rather mild restrictions \nto make the change of variables valid. The restriction that gl be invertible is strong, but \nmany such feature sets are known to exist, e.g., most wavelet transforms on images. We \nknow of a few ways that this condition can be relaxed, but further work is needed here. \n\n3 The need for hidden variables \n\nFor the sake of tractability we want to factor Pr(GI 111+1 ) over positions, something like \nPr(I) ,.... ITI ITxEI,+l Pr(gl(x) I fl+! (x)) where gl(x) and fl+! (x) are the feature vectors \nat position x. The dependence of gl on fi+l expresses the persistence of image structures \nacross scale, e.g., an edge is usually detectable as such in several neighboring pyramid \nlevels. The flexible histogram and MSP methods share this structure. While it may be \nplausible that fl+ 1 (x) has a strong influence on gl (x), we argue now that this factorization \nand conditioning is not enough to capture some properties of real images. \n\nObjects in the world cause correlations and non-local dependencies in images. For exam(cid:173)\nple, the presence of a particular object might cause a certain kind of texture to be visible \nat levell. Usually local features fi+l by themselves will not contain enough information \nto infer the object's presence, but the entire image II+! at that layer might. Thus gl(x) is \ninfluenced by more of 11+1 than the local feature vector. \n\nSimilarly, objects create long-range dependencies. For example, an object class might \nresult in a kind of texture across a large area of the image. If an object of this class is always \npresent, the distribution may factor, but if such objects aren't always present and can't \nbe inferred from lower-resolution information, the presence of the texture at one location \naffects the probability of its presence elsewhere. \n\nWe introduce hidden variables to represent the non-local information that is not captured by \nlocal features. They should also constrain the variability of features at the next finer scale. \nDenoting them collectively by A, we assume that conditioning on A allows the distributions \nover feature vectors to factor. In general, the distribution over images becomes \n\nPr(I) ()( L {IT II Pr(gl(x) I fi+l (x) , A) Pr(A I h)} Pr(h) . \n\nA \n\n1=0 xEI,+l \n\n(2) \n\nAs written this is absolutely general, so we need to be more specific. In particular we would \nlike to preserve the conditioning of higher-resolution information on coarser-resolution \ninformation, and the ability to factor over positions. \n\n\fHierarchical Image Probability (HIP) Models \n\n851 \n\nFigure 2: Tree structure of the conditional dependency between hidden variables in the HIP \nmodel. With subsampling by two, this is sometimes called a quadtree structure. \n\nAs a first model we have chosen the following structure for our HIP model: 1 \n\nPr(I)-\n..!!! \n'\" \n!! \n~ \n\nc;, \n\n0 \n\n~ \n..!!! \n'\" \n~ \nj \nc;, \n\nf. fealure 9 layer 1 \n\nI; fealure 9 layer 1 \n\nFigure 6: Empirical and HIP estimates of the distribution of a feature 91 (X) conditioned on \nits parent feature 11+1 (x). \n\ncomponents. The resulting model is somewhat like a hidden Markov model on a tree. Our \nearly results on classification problems showed good performance. \n\nAcknowledgements \n\nWe thank Jeremy De Bonet and John Fisher for kindly answering questions about their \nwork and experiments. Supported by the United States Government. \n\nReferences \n\n[1] J. S. De Bonet, P. Viola, and 1. W. Fisher III. Flexible histograms: A multiresolution \ntarget discrimination model. In E. G. Zelnio, editor, Proceedings of SPIE, volume \n3370,1998. \n\n[2] Jeremy S. De Bonet and Paul Viola. Texture recognition using a non-parametric multi(cid:173)\n\nscale statistical model. In Conference on Computer Vision and Pattern Recognition. \nIEEE,1998. \n\n[3] Robert W. Buccigrossi and Eero P. Simoncelli. Image compression via joint statisti(cid:173)\n\ncal characterization in the wavelet domain. Technical Report 414, U. Penn. GRASP \nLaboratory, 1998. Available at ftp :l/ftp.cis.upenn.eduJpub/eero!buccigrossi97.ps.gz. \n\n[4] Rama Chellappa and S. Chatterjee. Classification of textures using Gaussian Markov \n\nrandom fields. IEEE Trans. ASSP, 33:959-963, 1985. \n\n[5] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the \nBayesian restoration of images. IEEE Trans. PAMI, PAMI-6(6):194-207 , November \n1984. \n\n[6] Michael 1. Jordan, editor. Learning in Graphical Models, volume 89 of NATO Science \n\nSeries D: Behavioral and Brain Sciences. Kluwer Academic, 1998. \n\n[7] Mark R. Luettgen and Alan S. Will sky. Likelihood calculation for a class of multiscale \nstochastic models, with application to texture discrimination. IEEE Trans. Image Proc., \n4(2):194-207, 1995. \n\n[8] Clay D. Spence and Paul Sajda. Applications of multi-resolution neural networks to \nmammography. In Michael S. Kearns, Sara A. SolI a, and David A. Cohn, editors, NIPS \n11, pages 981-988, Cambridge, MA, 1998. MIT Press. \n\n[9] Song Chun Zhu, Ying Nian Wu, and David Mumford. Minimax entropy principle and \n\nits application to texture modeling. Neural Computation, 9(8): 1627-1660, 1997. \n\n\f", "award": [], "sourceid": 1784, "authors": [{"given_name": "Clay", "family_name": "Spence", "institution": null}, {"given_name": "Lucas", "family_name": "Parra", "institution": null}]}