{"title": "A Non-Parametric Multi-Scale Statistical Model for Natural Images", "book": "Advances in Neural Information Processing Systems", "page_first": 773, "page_last": 779, "abstract": "", "full_text": "A Non-parametric Multi-Scale Statistical \n\nModel for Natural Images \n\nJeremy S. De Bonet & Paul Viola \n\nArtificial Intelligence Laboratory \n\n545 Technology Square Massachusetts Institute of Technology \n\nLearning & Vision Group \n\nCambridge, MA 02139 \n\nEMAIL: jsd@ai.mit.edu & viola@ai.mit.edu \n\nHOMEPAGE: http://www.ai .mit. edu/projects/lv \n\nAbstract \n\nThe observed distribution of natural images is far from uniform. \nOn the contrary, real images have complex and important struc(cid:173)\nture that can be exploited for image processing, recognition and \nanalysis. There have been many proposed approaches to the prin(cid:173)\ncipled statistical modeling of images, but each has been limited in \neither the complexity of the models or the complexity of the im(cid:173)\nages. We present a non-parametric multi-scale statistical model for \nimages that can be used for recognition, image de-noising, and in \na \"generative mode\" to synthesize high quality textures. \n\n1 \n\nIntroduction \n\nIn this paper we describe a multi-scale statistical model which can capture the \nstructure of natural images across many scales. Once trained on example images, \nit can be used to recognize novel images, or to generate new images. Each of these \ntasks is reasonably efficient, requiring no more than a few seconds or minutes on a \nworkstation. \nThe statistical modeling of images is an endeavor which reaches back to the 60's \nand 70's (Duda and Hart, 1973). Statistical approaches are alluring because they \nprovide a unified view of learning, classification and generation. To date however, a \ngeneric, efficient and unified statistical model for natural images has yet to appear. \nNevertheless, many approaches have shown significant competence in specific areas. \n\nPerhaps the most influential statistical model for generic images is the Markov \nrandom field (MRF) (Geman and Geman, 1984). MRF's define a distribution over \n\n\f774 \n\nJ. S. D. Bonet and P. A. Viola \n\nimages that is based on simple and local interactions between pixels. Though MRF's \nhave been very successfully used for restoration of images, their generative prop(cid:173)\nerties are weak. This is due to the inability of the MRF 's to capture long range \n(low frequency) interactions between pixels. Recently there has been a great deal of \ninterest in hierarchical models such as the Helmholtz machine (Hinton et al., 1995; \nDayan et al. , 1995). Though the Helmholtz machine can be trained to discover long \nrange structure, it is not easily applied to natural images. \n\nMulti-scale wavelet models have emerged as an effective technique for modeling \nrealistic natural images. These techniques hypothesize that the wavelet transform \nmeasures the underlying causes of natural images which are assumed to be statisti(cid:173)\ncally independent. The primary evidence for this conjecture is that the coefficients of \nwavelet transformed images are uncorrelated and low in entropy (hence the success \nof wavelet compression) . These insights have been used for noise reduction (Donoho \nand Johnstone, 1993; Simoncelli and Adelson, 1996), and example driven texture \nsynthesis (Heeger and Bergen, 1995). The main drawback of wavelet algorithms is \nthe assumption of complete independence between coefficients. We conjecture that \nin fact there is strong cross-scale dependence between the wavelet coefficients of an \nimage, which is consistent with observations in (De Bonet, 1997) and (Buccigrossi \nand Simoncelli, 1997). \n\n2 Multi-scale Statistical Models \n\nMulti-scale wavelet techniques assume that images are a linear transform of a collec(cid:173)\ntion of statistically independent random variables: 1= W-1C, where I is an image, \nW- 1 is the inverse wavelet transform, and C = {Ck} is a vector of random variable \n\"causes\" which are assumed to be independent. The distribution of each cause Ck is \nPk ( . ), and since the Ck'S are independent it follows that: p( C) = nk Pk (Ck). Various \n\nwavelet transforms have been developed, but all share the same type of multi-scale \neach row of the wavelet matrix W is a spatially localized filter that is \nstructure -\na shifted and scaled version of a single basis function. \n\nThe wavelet transform is most efficiently computed as an iterative convolution using \na bank of filters. First a \"pyramid\" of low frequency downsampled images is created: \nGo = I , G 1 = 2 ..!-(9 \u00ae Go), and Gi+l = 2 ..!-(9 \u00ae Gi ), where 2..!- downsamples an \nimage by a factor of 2 in each dimension, \u00ae is the convolution operation, and 9 is \na low pass filter. At each level a series offilter functions are applied: Fj = h (j!) Gj, \nwhere the Ii 's are various types of filters. Computation of the Fj's is a linear \ntransformation that can thought of as a single matrix W. With careful selection \nof 9 and h this matrix can be constructed so that W- 1 = W T (Simoncelli et al., \n1992)1. Where convenient we will combine the pixels of the feature images Fj(x, y) \ninto a single cause vector C. \nThe expected distribution of causes, Ck, is a function of the image classes that are \nbeing modeled. For example it is possible to attempt to model the space of all \nnatural images. In that case it appears as though the most accurate Pk (.) 's are \nhighly kurtotic which indicates that the Ck ' S are most often zero but in rare cases \ntake on very large values (Donoho and Johnstone, 1993; Simoncelli and Adelson, \n1996). This is in direct contrast to the distribution of Ck'S for white-noise images -\nwhich is gaussian. The difference in these distributions can be used as the basis \nof noise reduction algorithms, by reducing the wavelet coefficients which are more \n\nlComputation of the inverse wavelet transform is algorithmically similar to the com(cid:173)\n\nputation of the forward wavelet transform. \n\n\fNon-Parametric Multi-Scale Statistical Image Model \n\n775 \n\nlikely to be noise than signal. \n\nSpecific image classes can be modeled using similar methods (Heeger and Bergen, \n1995)2. For a given set of input images the empirical distribution of the Ck'S is \nobserved. To generate a novel example of a texture a new set of causes, (}, is sampled \nfrom the assumed independent empirical distributions Pk (.). The generated images \nare computed using the inverse wavelet transform: l' = W- 1(},. Bergen and Heeger \nhave used this approach to build a probabilistic model of a texture from a ::;ingle \nexample image. To do this they assume that textures are spatially ergodic - that \nthe expected distribution is not a function of position in the image. As a result the \npixels in anyone feature image, Fj(x, y), are samples from the same distribution \nand can be combined3 . \n\nHeeger and Bergen's work is at or near the current state of the art in texture \ngeneration. Figure 1 contains some example textures. Notice however, that this \ntechnique is much better at generating smooth or noise-like textures than those \nwith well defined structure. Image structures, such as the sharp edges at the border \nof the tiles in the rightmost texture can not be modeled with their approach. These \nimage features directly contradict the assumption that the wavelet coefficients, or \ncauses, of the image are independent. \n\nFor many types of natural images the coefficients of the wavelet transform are not \nindependent, for example images which contain long edges. While wavelets are local \nboth in frequency and space, a long edge is not local in frequency nor in space. As \na result the wavelet representation of such a feature requires many coefficients. The \nhigh frequencies of the edge are captured by many small high frequency wavelets. \nThe long scale is captured by a number of larger low frequency wavelets. A model \nwhich assumes these coefficients are independent can never accurately model images \nwhich contain these non-local features. Conversely a model which captures the \nconditional dependencies between coefficients will be much more effective. We chose \nto approximate the joint distribution of coefficients as a chain, in which coefficients \nthat occur higher in the wavelet pyramid condition the distribution of coefficients \nat lower levels (Le. low frequencies condition the generation of higher frequencies). \n\nFor every pixel in an image define the parent vector of that pixel: \n\nV (x, y) = Fo (x, y), Fo (x, y), . .. ,Fo (x, y), \n.... \n\n1 \n\n0 \n\n[ \n\nN \n\n-nO \n\nl'{(l2\"J, l2\"J),F1 (L2\"J, L2\"J , ... ,Fl (l2\"J, L2\" \n\nN \n\nX \n\nX \n\nX \n\n1 \n\n, ... \n\nYJ) \n\ny \n\ny) \n\no \nFM(L2MJ,l2MJ),FM(l2M ,L2M J , ... ,FM ( 2M ' 2M \n\nY \n\nx \n\n1 \n\nN L x J L Y J)] \n\nX J Y) \n\n(1) \n\nwhere M is the top level of the pyramid and N is the number of features. Rather \nthan generating each of these coefficients independently, we define a chain across \nscale. In this chain the generation of the lower levels depend on the higher levels: \n\np(V(x, y)) = p(VM(x, y)) x p(VM- 1 (x, y)IVM(x, y)) \n\nx p(VM-2(X, y)!VM-l (x, y), VM(x, y)) x ... \n\nx p(Vo(x, y)IVl (x, y), ... , VM-l (x, y), VM(x, y)) \n\n(2) \n\n2See (Zhu, Wu and Mumford, 1996) for a related but more formal model. \n3Their generation process is slightly more complex than this, involving a iteration \ndesigned to match the pixel histogram. The implementation used for generating the images \nin Figure 1 incorporates this, but we do not discuss it here. \n\n\f776 \n\n1. S. D. Bonet and P. A. Viola \n\nFigure 1: Synthesis results for the Heeger and Bergen (1995) model. Top: Input \ntextures. BOTTOM: Synthesis results. This technique is much better at generating \nfine or noisy textures then it is at generating textures which require co-occurrence \nof wavelets at multiple scales. \n\nFigure 2: Synthesis results using our technique for the input textures shown in \nFigure 1 (Top). \n\nwhere Yt(x , y) is the a subset of the elements of Vex , y) computed from C/. Usually \nwe will assume ergodicity, i.e. that p(V(x, y)) is independent of x and y. The gen(cid:173)\nerative process starts from the top of the pyramid, choosing values for the V M (x, y) \nat all points. Once these are generated the values at the next level, V M -1 (x , y) , are \ngenerated. The process continues until all of the wavelet coefficients are generated. \nFinally the image is computed using the inverse wavelet transform. \n\nIt is important to note that this probabilistic model is not made up of a collection \nof independent chains, one for each Vex , y). Parent vectors for neighboring pixels \nhave substantial overlap as coefficients in the higher pyramid levels (which are \n\n\fNon-Parametric Multi-Scale Statistical Image Model \n\n777 \n\nlower resolution) are shared by neighboring pixels at lower pyramid levels. Thus, \nthe generation of nearby pixels will be strongly dependent. In a related approach a \nsimilar arrangement of generative chains has been termed a Markov tree (Basseville \net al., 1992). \n\n2.1 Estimating the Conditional Distributions \n\nThe additional descriptive power of our generative model does not come without \ncost. The conditional distributions that appear in (2) must be estimated from \nobservations. We choose to do this directly from the data in a non-parametric \nfashion. Given a sample of parent vectors {8(x, y)} from an example image we \nestimate the conditional distribution as a ratio of Parzen window density estimators: \n\n(~( )1v,M ( \nI x,y \np \n\n/+1 x,y \n\n-\n\n.... M \n\n(3) \n\n)) _ p(ViM(x,y)) '\" Lx',y' R(V;M(X,y), 8r(x', y')) \nLx',y' R(V/+l (x, y), S/+1 (x',y')) \n\np(V/+1(x,y)) \n\n.... M \n\n\"'M \n\n'\" \n\nwhere Vik(x,y) is a subset of the parent vector V(x,y) that contains information \nfrom level I to level k, and R(\u00b7) is a function of two vectors that returns maximal \nvalues when the vectors are similar and smaller values when the vectors are dis(cid:173)\nsimilar. We have explored various R(\u00b7) functions. \nR( .) function returns a fixed constant 1/ z if all of the coefficients of the vec(cid:173)\ntors are within some threshold () and zero otherwise. Given this simple defini-\ntion for R(\u00b7) sampling from p(Vz(x,Y)IV;~1(X,y)) is very straightforward: find all \nx', y' such that R(8#1 (x', y'), 8#1 (x, y)) = 1/ z and pick from among them to set \nVz(x,y) = SI(X',y'). \n\nIn the results presented the \n\n3 Experiments \n\nWe have applied this approach to the problems of texture generation, texture recog(cid:173)\nnition, target recognition, and signal de-noising. In each case our results are com(cid:173)\npetitive with the best published approaches. \nIn Figure 2 we show the results of our technique on the textures from Figure 1. \nFor these textures we are better able to model features which are caused by a \nconjunction of wavelets. This is especially striking in the rightmost texture where \nthe geometrical tiling is almost, but not quite, preserved. In our model, knowledge of \nthe joint distribution provides constraints which are critical in the overall perceived \nappearance of the synthesized texture. \nUsing this same model, we can measure the textural similarity between a known and \nnovel image. We do this by measuring the likelihood of generating the parent vectors \nin the novel image under the chain model of the known image. On \"easy\" data \nsets, such as the the MeasTex Brodatz texture test suite, performance is slightly \nhigher than other techniques, our approach achieved 100% correct classification \ncompared to 97% achieved by a gaussian MRF approach (Chellappa and Chatterjee, \n1985). The MeasTex lattice test suite is slightly more difficult because each \ntexture is actually a composition of textures containing different spatial frequencies. \nOur approach achieved 97% while the best alternate method, in this case Gabor \nConvolution Energy method (Fogel and Sagi, 1989) achieved 89%. Gaussian MRF's \nexplicitly assume that the texture is a unimodal distribution and as a result achieve \nonly 79% correct recognition. We also measured performance on a set of 20 types \nof natural texture and compared the classification power of this model to that of \nhuman observers (humans discriminate textures extremely accurately.) On this \n\n\f778 \n\n1. S. D. Bonet and P. A. Viola \n\nOriginal \n\nDenoise Shrinkage \n\nShrinkage Residual \n\nNoised \n\nDenoise Ours \n\nOur Residual \n\nFigure 3: (Original) the original image; (Noised) the image corrupted with white \ngaussian noise (SNR 8.9 dB); (Denoise Shrinkage) the results of de-noising using \nwavelet shrinkage or coring (Donoho and Johnstone, 1993; Simoncelli and Adelson, \n1996) (SNR 9.8 dB); (Shrinkage Residual) the residual error between the shrinkage \nde-noised result and the original -\nnotice that the error contains a great deal of \ninterpretable structure; (Denoise Ours) our de-noising approach (SNR 13.2 dB); \nand (Our Residual) the residual error -\n\nthese errors are much less structured. \n\ntest, humans achieved 86% accuracy, our approach achieved an accuracy of 81%, \nand GMRF's achieved 68%. \n\nA strong probabilistic model for images can be used to perform a variety of image \nprocessing tasks including de-noising and sharpening. De-noising of an observed \nimage i can be performed by Monte Carlo averaging: draw a number of sample \nimages according to the prior density P(I), compute the likelihood of the noise for \neach image P(v = (1) - 1), and then find the weighted average over these images. \nThe weighted average is the estimated mean over all possible ways that the image \nmight have been generated given the observation. \n\nImage de-noising frequently relies on generic image models which simply enforce \nimage smoothness. These priors either leave a lot of residual noise or remove much \nof the original image. In contrast, we construct a probability density model from \nthe noisy image itself. In effect we assume that the image is redundant, containing \nmany examples of the same visual structures, as if it were a texture. The value of \nthis approach is directly related to the redundancy in the image. If the redundancy \nin the image is very low, then the parent structures will be everywhere different, and \nthe only resampled images with significant likelihood will be the original image. But \nif there is some redundancy in the image -\nthat might arise from a regular texture \nor smoothly varying patch -\nthe resampling will freely average across these similar \nregions. This will have the effect of reducing noise in these images. In Figure 3 we \nshow results of this de-noising approach. \n\n\fNon-Parametric Multi-Scale Statistical Image Model \n\n779 \n\n4 Conclusions \n\nWe have presented a statistical model of texture which can be trained using exam(cid:173)\nple images. The form of the model is a conditional chain across scale on a pyra(cid:173)\nmid of wavelet coefficients. The cross scale condtional distributions are estimated \nnon-parametrically. This is important because many of the observed conditional \ndistributions are complex and contain multiple modes. We believe that there are \ntwo main weaknesses of the current approach: i) the tree on which the distributions \nare defined are fixed and non-overlapping; and ii) the conditional distributions are \nestimated from a small number of samples. We hope to address these limitations \nin future work. \n\nAcknowledgments \n\nIn this research, Jeremy De Bonet is supported by the DOD Multidisciplinary Re(cid:173)\nsearch Program of the University Research Initiative, and Paul Viola by Office of \nNaval Research Grant No. N00014-96-1-0311. \n\nReferences \nBasseville, M., Benveniste, A., Chou, K. C., Golden, S. A., Nikoukhah, R., and Will sky, \nA. S. (1992). Modeling and estimation of multiresolution stochastic processes. IEEE \nTransactions on Information Theory, 38(2) :766-784. \n\nBuccigrossi, R. W. and Simoncelli, E. P. (1997). Progressive wavelet image coding based \n\non a conditional probability model. In Proceedings ICASSP-97, Munich, Germany. \n\nChellappa, R. and Chatterjee, S. (1985). Classification of textures using gaussian markov \nrandom fields. In Proceedings of the International Joint Conference on Acoustics, \nSpeech and Signal Processing, volume 33, pages 959-963. \n\nDayan, P., Hinton, G., Neal, R., and Zemel, R. (1995). The helmholtz machine. Neural \n\nComputation, 7:1022-1037. \n\nDe Bonet, J. S. (1997). Multiresolution sampling procedure for analysis and synthesis of \n\ntexture images. In Computer Graphics. ACM SIGGRAPH. \n\nDonoho, D. L. and Johnstone, 1. M. (1993). Adaptation to unknown smoothness via \nwavelet shrinkage. Technical report, Stanford University, Department of Statistics. \nAlso Tech. Report 425. \n\nDuda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis. John Wiley and \n\nSons. \n\nFogel, I. and Sagi, D. (1989). Gabor filters as texture discriminator. Biological Cybernetics, \n\n61: 103- 113. \n\nGeman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the \nbayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine \nIntelligence, 6:721-741. \n\nHeeger, D. J. and Bergen, J. R. (1995). Pyramid-based texture analysis/synthesis. In \n\nComputer Graphics Proceedings, pages 229-238. \n\nHinton, G., Dayan, P., Frey, B., and Neal, R. (1995). The \"wake-sleep\" algorithm for \n\nunsupervised neural networks. Science, 268:1158-116l. \n\nSimoncelli, E. P. and Adelson, E. H. (1996). Noise removal via bayesian wavelet coring. \n\nIn IEEE Third Int'l Conf on Image Processing, Laussanne Switzerland. IEEE. \n\nSimoncelli, E. P., Freeman, W. T., Adelson, E. H., and Heeger, D. J. (1992). Shiftable \n\nmultiscale transforms. IEEE Transactions on Information Theory, 38(2):587-607. \nZhu, S. C., Wu, Y., and Mumford, D. (1996). Filters random fields and maximum en(cid:173)\n\ntropy(frame): To a unified theory for texture modeling. To appear in Int'l Journal \nof Computer Vision. \n\n\f", "award": [], "sourceid": 1450, "authors": [{"given_name": "Jeremy", "family_name": "De Bonet", "institution": null}, {"given_name": "Paul", "family_name": "Viola", "institution": null}]}