{"title": "Scaling Laws in Natural Scenes and the Inference of 3D Shape", "book": "Advances in Neural Information Processing Systems", "page_first": 1089, "page_last": 1096, "abstract": null, "full_text": "Scaling Laws in Natural Scenes and the Inference of 3D Shape\n\nBrian Potetz Department of Computer Science Center for the Neural Basis of Cognition Carnegie Mellon University Pittsburgh, PA 15213 bpotetz@cs.cmu.edu\n\nTai Sing Lee Department of Computer Science Center for the Neural Basis of Cognition Carnegie Mellon University Pittsburgh, PA 15213 tai@cnbc.cmu.edu\n\nAbstract\nThis paper explores the statistical relationship between natural images and their underlying range (depth) images. We look at how this relationship changes over scale, and how this information can be used to enhance low resolution range data using a full resolution intensity image. Based on our findings, we propose an extension to an existing technique known as shape recipes [3], and the success of the two methods are compared using images and laser scans of real scenes. Our extension is shown to provide a two-fold improvement over the current method. Furthermore, we demonstrate that ideal linear shape-from-shading filters, when learned from natural scenes, may derive even more strength from shadow cues than from the traditional linear-Lambertian shading cues.\n\n1\n\nIntroduction\n\nThe inference of depth information from single images is typically performed by devising models of image formation based on the physics of light interaction and then inverting these models to solve for depth. Once inverted, these models are highly underconstrained, requiring many assumptions such as Lambertian surface reflectance, smoothness of surfaces, uniform albedo, or lack of cast shadows. Little is known about the relative merits of these assumptions in real scenes. A statistical understanding of the joint distribution of real images and their underlying 3D structure would allow us to replace these assumptions and simplifications with probabilistic priors based on real scenes. Furthermore, statistical studies may uncover entirely new sources of information that are not obvious from physical models. Real scenes are affected by many regularities in the environment, such as the natural geometry of objects, the arrangements of objects in space, natural distributions of light, and regularities in the position of the observer. Few current shape inference algorithms make use of these trends. Despite the potential usefulness of statistical models and the growing success of statistical methods in vision, few studies have been made into the statistical relationship between images and range (depth) images. Those studies that have examined this relationship in nature have uncovered meaningful and exploitable statistical trends in real scenes which may be useful for designing new algorithms in surface inference, and also for understanding how humans perceive depth in real scenes [6, 4, 8].\n\n\f\nIn this paper, we explore some of the properties of the statistical relationship between images and their underlying range (depth) images in real scenes, using images acquired by laser scanner in natural environments. Specifically, we will examine the cross-covariance between images and range images, and how this structure changes over scale. We then illustrate how our statistical findings can be applied to inference problems by analyzing and extending the shape recipe depth inference algorithm.\n\n2\n\nShape recipes\n\nWe will motivate our statistical study with an application. Often, we may have a highresolution color image of a scene, but only a low spatial resolution range image (range images record the 3D distance between the scene and the camera for each pixel). This often happens if our range image was acquired by applying a stereo depth inference algorithm. Stereo algorithms rely on smoothness constraints, either explicitly or implicitly, and so the high-frequency components of the resulting range image are not reliable [1, 7]. Lowresolution range data may also be the output of a laser range scanner, if the range scanner is inexpensive, or if the scan must be acquired quickly (range scanners typically acquire each pixel sequentially, taking up to several minutes for a high-resolution scan). It should be possible to improve our estimate of the high spatial frequencies of the range image by using monocular cues from the high-resolution intensity (or color) image. Shape recipes [3, 9] provide one way of doing this. The basic principle of shape recipes is that a relationship between shape and light intensity could be learned from the low resolution image pair, and then extrapolated and applied to the high resolution intensity image to infer the high spatial frequencies of the range image. One advantage of this approach is that hidden variables important to inference from monocular cues, such as illumination direction and material reflectance properties, might be implicitly learned from the lowresolution range and intensity images. However, for this approach to work, we require some model of how the relationship between shape and intensity changes over scale, which we discuss below. For shape recipes, both the high resolution intensity image and the low resolution range image are decomposed into steerable wavelet filter pyramids, linearly breaking the image down according to scale and orientation [2]. Linear regression is then used between the highest frequency band of the available low-resolution range image and the corresponding band of the intensity image, to learn a linear filter that best predicts the range band from the image band. The hypothesis of the model is that this filter can then be used to predict high frequency range bands from the high frequency image bands. We describe the implementation in more detail below. Let im, and zm, be steerable filter pyramid subbands of the intensity and range image respectively, at spatial resolution m and orientation (both are integers). Number the band levels so that m = 0 is the highest frequency subband of the intensity image, and m = n is the highest available frequency subband of the low-resolution range image. Thus, higher level numbers correspond to lower spatial frequencies. Shape (recipes work by learning a linear filter kn, at level n by minimizing sum-squared error zn, - kn, in, )2 , where denotes convolution. Higher resolution subbands of the range image are inferred by: z m , = ^ 1 (kn, im, ) cn-m (1)\n\nwhere c = 2. The choice of c = 2 in the shape recipe model is motivated by the linear Lambertian shading model [9]. We will discuss this choice of constant in section 3. The underlying assumption of shape recipes is that the convolution kernel km, should be roughly constant over the four highest resolution bands of the steerable filter pyramid. This\n\n\f\nis based on the idea that shape recipe kernels should vary slowly over scale. In this section, we show mathematically that this model is internally inconsistent. To do this, we first reexpress the shape recipe process in the Fourier domain. The operations of shape recipes (pyramid decomposition, convolution, and image reconstruction) are all linear operations, and so they can be combined into a single linear convolution. In other words, we can think of shape recipes as inferring the high resolution range data zhigh via a single convolution where I is the Fourier transform of the intensity image i. (In general, we will use capital letters to denote functions in the Fourier domain). Krecipe is a filter in the Fourier domain, of the same size as the image, whose construction is discussed below. Note that Krecipe is zero in the low frequency bands where Zlow is available. Once zhigh (the inverse Fourier transform of Zhigh ) is estimated, it can be combined with the known low-resolution range data simply by adding them together: zrecipe (x, y ) = zlow (x, y ) + zhigh (x, y ). For shorthand, we will write I (u, v )I (u, v ) as I I (u, v ) and Z (u, v )I (u, v ) as Z I (u, v ). I I is also known as the power spectrum, and it is the Fourier transform of the autocorrelation of the intensity image. Z I is the Fourier transform of the cross-correlation between the intensity and range images, and it has both real and imaginary parts. Let K = Z I /I I . Observe that I K is a perfect reconstruction of the original high resolution range image (as long as I I (u, v ) = 0). Because we do not have the full-resolution range image, we can only compute the low spatial frequencies of Z I (u, v ). Let Klow = Z Ilow /I I , where Z Ilow is the Fourier transform of the cross-correlation between the low-resolution range image, and a low-resolution version of the intensity image. Klow is zero in the high frequency bands. We can then think of Krecipe as an approximation of K = Z I /I I formed by extrapolating Klow into the higher spatial frequencies. In the appendix, we show that shape recipes implicitly perform this extrapolation by learning the highest available frequency octave of Klow , and duplicating this octave into all successive octaves of Krecipe , multiplied by a scale factor. However, there is a problem with this approach. First, there is no reason to expect that features in the range/intensity relationship should repeat once every octave. Figure 1a shows a plot of Z I from a scene in our database of ground-truth range data (to be described in section 3). The fine structures in real[K ] do not duplicate themselves every octave. Second and more importantly, octave duplication violates Freeman and Torralba's assumption that shape recipe kernels should change slowly over scale, which we take to mean over all scales, not just over successive octaves. Even if octave 2 of K is made identical to octave 1, it is mathematically impossible for fractional octaves of K like 1.5 to also be identical unless Z I /I I is completely smooth and devoid of fine structure. The fine structures in K therefore cannot possibly generalize over all scales. In the next section, we use laser scans of real scenes to study the joint statistics of range and intensity images in greater detail, and use our results to form a statistically-motivated model of Z I . We believe that a greater understanding of the joint distribution of natural images and their underlying 3D structure will have a broad impact on the development of robust depth inference algorithms, and also on understanding human depth perception. More immediately, our statistical observations lead to a more accurate way to extrapolate Klow , which in turn results in a more accurate shape recipe method. Zhigh (u, v ) = I (u, v ) Krecipe (u, v ) (2)\n\n3\n\nScaling laws in natural scene statistics\n\nTo study the correlational structures between depth and intensity in natural scenes, we have collected a database of coregistered intensity and high-resolution range images (corresponding pixels of the two images correspond to the same point in space). Scans were collected using the Riegl LMS-Z360 laser range scanner with integrated color photosensor.\n\n\f\na) |real[Z I ]|\n\nb) Example BK ( ) vs degrees counter-clockwise from horizontal\n\n2\n\nx 10\n\n-3\n\nB()\n\n0\n\n-2 0\n\n90\n\n180\n\n270\n\n360\n\nFigure 1: a) A log-log polar plot of |real[Z I ]| from a scene in our database. Z I contains extensive fine structures that do not repeat at each octave. However, along all orientations, the general form of |real[Z I ]| is a power-law. |imag[Z I ]| similarly obeys a power-law. b) A plot of BK () for the scene in figure 2. real[BK ()] is drawn in black and imag [BK ()] in grey. This plot is typical of most scenes in our database. As predicted by equation 4, imag[BK ()] reaches its minima at the illumination direction (in this case, to the extreme left, almost 180 ). Also typical is that real[BK ()] is uniformly negative, most likely caused by cast shadows in object concavities [6]. Scans were taken of a variety of rural and urban scenes. All images were taken outdoors, under sunny conditions, while the scanner was level with ground. The shape recipe model was intended for scenes with homogenous albedo and surface material. To test this algorithm in real scenes of this type, we selected 28 single-texture image sections from our database. These textures include statue surfaces and faceted building exteriors, such as archways and church facades (12 scenes), rocky terrain and rock piles (8), and leafy foliage (8). No logarithm or other transformation was applied to the intensity or range data (measured in meters), as this would interfere with the Lambertian model that motivates the shape recipe technique. Average size of these textures was 172,669 pixels per image. We show a log-log polar plot of |real[Z I (r, )]| from one image in our database in figure 1a. As can be seen in the figure, this structure appears to closely follow a power law. We claim that Z I can be reasonably modeled by B ()/r , where r is spatial frequency in polar coordinates, and B () is a parameter of the model (with both real and imaginary parts) that depends only on polar angle . We test this claim by dividing the Fourier plane into four 45 octants (vertical, forward diagonal, horizontal, and backward diagonal), and measuring the drop-off rate in each octant separately. For each octant, we average over the octant's included orientations and fit the result to a power-law. The resulting values of (averaged over all 28 images) are listed in the table below: orientation horizontal forward diagonal vertical backward diagonal mean II 2.47 0.10 2.61 0.11 2.76 0.11 2.56 0.09 2.60 0.10 real[Z I ] 3.61 0.18 3.67 0.17 3.62 0.15 3.69 0.17 3.65 0.14 imag[Z I ] 3.84 0.19 3.95 0.17 3.61 0.24 3.84 0.23 3.87 0.16 ZZ 2.84 0.11 2.92 0.11 2.89 0.11 2.86 0.10 2.88 0.10\n\nFor each octant, the correlation coefficient between the power-law fit and the actual spectrum ranged from 0.91 to 0.99, demonstrating that each octant is well-fit by a power-law (Note that averaging over orientation smooths out some fine structures in each spectrum). Furthermore, varies little across orientations, showing that our model fits Z I closely. The above findings predict that K = Z I /I I also obeys a power-law. Subtracting I I from real[Z I ] and imag[Z I ] , we find that real[K ] drops off at 1/r1.1 and imag [K ] drops off at 1/r1.2 . Thus, we have that K (r, ) BK ()/r.\n\n\f\nNow that we know that K can be fit (roughly) by a 1/r power-law, we can offer some insight into why K tends to approximate this general form. The 1/r drop-off in the imaginary part of K can be explained by the linear Lambertian model of shading, with oblique lighting conditions. This argument was used by Freeman and Torralba [9] in their theoretical motivation for choosing c = 2. The linear Lambertian model is obtained by taking only the linear terms of the Taylor series of the Lambertian equation. Under this model, if constant albedo is assumed, and no occlusion is present, then with lighting from above, i(x, y ) = a z / y , where a is some constant. In the Fourier domain, I (u, v ) = a2 j v Z (u, v ), where j = -1. Thus, we have that j Z I (r, ) = - I I (r, ) (3) a2 r sin() 1 1 (4) K (r, ) = -j r a2 sin() In other words, under this model, K obeys a 1/r power-law. This means that each octave of K is half of the octave before it. Our empirical finding that the imaginary part of K obeys a 1/r power-law confirms Freeman and Torralba's reasoning behind choosing c = 2 for shape recipes. However, the linear Lambertian shading model predicts that only the imaginary part of Z I should obey a power-law. In fact, according to equation 3, this model predicts that the real part of Z I should be zero. Yet, in our database, the real part of Z I was typically stronger than the imaginary part. The real part of Z I is the Fourier transform of the even-symmetric part of the cross-correlation function, and it includes the direct correlation cov[i, z ]. In a previous study of the statistics of natural range images [6], we have found that darker pixels in the image tend to be farther away, resulting in significantly negative cov[i, z ]. We attributed this phenomenon to cast shadows in complex scenes: object interiors and concavities are farther away than object exteriors, and these regions are the most likely to be in shadow. This effect can be observed wherever shadows are found, such as the crevices of figure 2a. However, the effect appears strongest in complex objects with many shadows and concavities, like folds of cloth, or foliage. We found that the real part of Z I is especially likely to be strongly negative in images of foliage. Such correlation between depth and darkness has been predicted theoretically for diffuse lighting conditions, such as cloudy days, when viewed from directly above [5]. The fact that all of our images were taken under cloudless, sunny conditions and with oblique lighting from above suggests that this cue may be more important than at first realized. Psychophysical experiments have demonstrated that in the absence of all other cues, darker image regions appear farther, suggesting that the human visual system makes use of this cue for depth inference (see [6] for a review, also [10]). We believe that the 1/r drop-off rate observed in real[K ] is due to the fact that concavities with smaller apertures but equal depths tend to be darker. In other words, for a given level of darkness, a smaller aperture corresponds to a more shallow hole.\n\n4\n\nInference using power-law models\n\nArmed with a better understanding of the statistics of real scenes, we are better prepared to develop successful depth inference algorithms. We now know that fine details in Z I /I I do not generalize across scales, but that its coarse structure roughly follows a 1/r power-law. We can exploit this statistical trend directly. We can simply fit our BK ()/r power law to Z Ilow /I I , and then use this estimate of K to reconstruct the high frequency range data. Specifically, from the low-resolution range and intensity image, we compute low resolution spectra of Z I and I I . From the highest frequency octave of the low-resolution images, we estimate BI I () and BZ I (). Any standard interpolation method will work to estimate these functions. We chose a cos3 ( + /4) basis function based on steerable filters [2].\n\n\f\na) Original Intensity Image\n\nb) Low-Resolution Range Data c) Power-law Shape Recipe\n\nd) Krecipe\n\ne) Kpowerlaw\n\nFigure 2: a) An example intensity image from our database. b) A Lambertian rendering of the corresponding low resolution range image. c) Power-law method output. Shape recipe reconstructions show a similar amount of texture, but tests show that texture generated by the power-law method is more highly correlated with the true texture. d) The imaginary parts of Krecipe and e) Kpowerlaw for the same scene. Dark regions are negative, light regions are positive. The grey center region in each estimate of K corresponds to the low spatial frequencies, where range data is not inferred because it is already known. Notice that Krecipe oscillates over scale. We now can estimate the high spatial frequencies of the range image, z . Define Kpowerlaw (r, ) = Fhigh (r) (BZ I ()/BI I ())/r Zpowerlaw = Zlow + I Kpowerlaw (5) (6)\n\nwhere Fhigh is the high-pass filter associated with the two highest resolution bands of the steerable filter pyramid of the full-resolution image.\n\n5\n\nEmpirical evaluation\n\nIn this section, we compare the performance of shape recipes with our new approach, using our ground-truth database of high-resolution range and intensity image pairs described in section 3. For each range image in our database, a low-resolution (but still full-sized) range image, zlow , was generated by setting to zero the top two steerable filter pyramid layers. Both algorithms accepted as input the low-resolution range image and high-resolution intensity image, and the output was compared with the original high-resolution range image. The high resolution output corresponds to a 4-fold increase in spatial resolution (or a 16-fold increase in total size). Although encouraging enhancements of stereo output were given by the authors, shape recipes has not been evaluated with real, ground-truth high resolution range data. To maximize its performance, we implemented shape recipes using ridge regression, with the ridge coefficient obtained using cross-validation. Linear kernels were learned (and the output evaluated) over a region of the image at least 21 pixels from the image border. For each high-resolution output, we measured the sum squared error between the reconstruction (zrecipe or zpowerlaw ) and the original range image (z ). We compared this with the sum-squared error of the low-resolution range image zlow to get the percent reduction er rlow -er rrecipe in sum-squared error: error reductionrecipe = . This measure of error er rlow reflects the performance of the method independently of the variance or absolute depth of\n\n\f\nthe range image. On average, shape recipe reconstructions had 1.3% less mean-squared error than zlow . Shape recipes improved 21 of the 28 images. Our new approach had 2.2% less mean-squared error than zlow , and improved 26 of the 28 images. We cannot expect the error reduction values to be very high, partly because our images are highly complex natural scenes, and also because some noise was present in both the range and intensity images. Therefore, it is difficult to assess how much of the remaining error could be recovered by a superior algorithm, and how much is simply due to sensor noise. As a comparison, we generated an optimal linear reconstruction, zoptlin , by learning 11 11 shape recipe kernels for the two high resolution pyramid bands directly from the ground-truth high resolution range image. This reconstruction provides a loose upper bound on the degree of improvement possible by linear shape methods. We then measured the percentage of linearly achievable improvement for each image: improv ementrecipe = er rlow -er rrecipe er rlow -er roptlin Shape recipes yielded an average improvement of 23%. Our approach achieved an improvement of 44%, nearly a two-fold enhancement over shape recipes.\n\n6\n\nThe relative strengths of shading and shadow cues\n\nEarlier we showed that Lambertian shading alone predicts that the real part of Z I in natural scenes is empty of useful correlations between images and range images. Yet in our database, the real part of Z I , which we believe is related to shadow cues, was often stronger than the imaginary component. Our depth-inference algorithm offers an opportunity to compare the performance of shading cues versus shadow cues. We ran our algorithm again, except that we set the real part of Kpowerlaw to zero. This yielded only a 12% improvement. However, when we ran the algorithm after setting imag[K ] to zero, 32% improvement was achieved. Thus, 72% of the algorithm's total improvement was due to shadow cues. When the database is broken down into categories, the real part of Z I is responsible for 96% of total improvement in foliage scenes, 76% in rocky terrain scenes, and 35% in urban scenes (statue surfaces and building facades). As expected, the algorithm relies more heavily on the real part of Z I in environments rich in cast shadows. These results show that shadow cues are far more useful than was previously expected, and also that they can be exploited more easily than was previously thought possible, using only simple linear relationships that might easily be incorporated into linear shape-from-shading techniques. We feel that these insights into natural scene statistics are the most important contributions of this paper.\n\n7\n\nDiscussion\n\nThe power-law extension to shape recipes not only offers a substantial improvement in performance, but it also greatly reduces the number of parameters that must be learned. The original shape recipes required one 1111 kernel, or 121 parameters, for each orientation of the steerable filters. The new algorithm requires only two parameters for each orientation (the real and the imaginary parts of BK ()). This suggests that the new approach has captured only those components of K that generalize across scales, disregarding all others. While it is encouraging that the power-law algorithm is highly parsimonious, it also means that fewer scene properties are encoded in the shape recipe kernels than was previously hoped [3]. For example, complex properties of the material and surface reflectance cannot be encoded. We believe that the B () parameter of the power-law model can be determined almost entirely by the direction of illumination and the prominence of cast shadows (see figure 1b). This suggests that the power-law algorithm of this paper would work equally well for scenes with multiple materials. To capture more complex material properties, nonlinear methods and probabilistic methods may achieve greater success. However, when designing these more sophisticated methods, care must be taken to avoid the same pitfall encountered by shape recipes: not all properties of a scene can be scale-invariant simultaneously.\n\n\f\n8\n\nAppendix\n\nShape recipes infer each high resolution band of the range using equation 1. Let = 2n-m . If we take the Fourier transform of equation 1, we get uv 1 , (I Fm, ) (7) Zhigh Fm, = n-m Kn, c where Fm, is the Fourier transform of the steerable filter at level m and orientation , and Zhigh is the inferred high spatial frequency components of the range image. If we take the steerable pyramid decomposition of Zhigh and then transform it back, we get Zhigh again, and so: m