Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Matthias Franz, Bernhard Schölkopf*

The computation of classical higher-order statistics such as higher-order moments or spectra is difficult for images due to the huge number of terms to be estimated and interpreted. We propose an alternative ap- proach in which multiplicative pixel interactions are described by a se- ries of Wiener functionals. Since the functionals are estimated implicitly via polynomial kernels, the combinatorial explosion associated with the classical higher-order statistics is avoided. First results show that image structures such as lines or corners can be predicted correctly, and that pixel interactions up to the order of five play an important role in natural images.

Most of the interesting structure in a natural image is characterized by its higher-order statistics. Arbitrarily oriented lines and edges, for instance, cannot be described by the usual pairwise statistics such as the power spectrum or the autocorrelation function: From knowing the intensity of one point on a line alone, we cannot predict its neighbouring intensities. This would require knowledge of a second point on the line, i.e., we have to consider some third-order statistics which describe the interactions between triplets of points. Analogously, the prediction of a corner neighbourhood needs at least fourth-order statistics, and so on.

In terms of Fourier analysis, higher-order image structures such as edges or corners are described by phase alignments, i.e. phase correlations between several Fourier components of the image. Classically, harmonic phase interactions are measured by higher-order spectra [4]. Unfortunately, the estimation of these spectra for high-dimensional signals such as images involves the estimation and interpretation of a huge number of terms. For instance, a sixth-order spectrum of a 1616 sized image contains roughly 1012 coefficients, about 1010 of which would have to be estimated independently if all symmetries in the spectrum are considered. First attempts at estimating the higher-order structure of natural images were therefore restricted to global measures such as skewness or kurtosis [8], or to submanifolds of fourth-order spectra [9].

Here, we propose an alternative approach that models the interactions of image points in a series of Wiener functionals. A Wiener functional of order n captures those image components that can be predicted from the multiplicative interaction of n image points. In contrast to higher-order spectra or moments, the estimation of a Wiener model does not require the estimation of an excessive number of terms since it can be computed implicitly

via polynomial kernels. This allows us to decompose an image into components that are characterized by interactions of a given order.

In the next section, we introduce the Wiener expansion and discuss its capability of model- ing higher-order pixel interactions. The implicit estimation method is described in Sect. 2, followed by some examples of use in Sect. 3. We conclude in Sect. 4 by briefly discussing the results and possible improvements.

1 Modeling pixel interactions with Wiener functionals

For our analysis, we adopt a prediction framework: Given a d d neighbourhood of an image pixel, we want to predict its gray value from the gray values of the neighbours. We are particularly interested to which extent interactions of different orders contribute to the overall prediction. Our basic assumption is that the dependency of the central pixel value y on its neighbours xi, i = 1, . . . , m = d2 - 1 can be modeled as a series

```
y = H0[x] + H1[x] + H2[x] + + Hn[x] + (1)
```

of discrete Volterra functionals

```
m m H0[x] = h0 = const. and Hn[x] = h(n) x . . . x . (2) i i i i 1 ...i 1 n n 1 =1 i =1 n
```

Here, we have stacked the grayvalues of the neighbourhood into the vector x = (x1, . . . , xm) Rm. The discrete nth-order Volterra functional is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x with mn coefficients h(n) . Volterra functionals provide a controlled way of introducing multiplicative inter- i1...in actions of image points since a functional of order n contains all products of the input of order n. In terms of higher-order statistics, this means that we can control the order of the statistics used since an nth-order Volterra series leads to dependencies between maximally n + 1 pixels.

Unfortunately, Volterra functionals are not orthogonal to each other, i.e., depending on the input distribution, a functional of order n generally leads to additional lower-order interac- tions. As a result, the output of the functional will contain components that are proportional to that of some lower-order monomials. For instance, the output of a second-order Volterra functional for Gaussian input generally has a mean different from zero [5]. If one wants to estimate the zeroeth-order component of an image (i.e., the constant component created without pixel interactions) the constant component created by the second-order interactions needs to be subtracted. For general Volterra series, this correction can be achieved by de- composing it into a new series y = G0[x] + G1[x] + + Gn[x] + of functionals Gn[x] that are uncorrelated, i.e., orthogonal with respect to the input. The resulting Wiener functionals1 Gn[x] are linear combinations of Volterra functionals up to order n. They are computed from the original Volterra series by a procedure akin to Gram-Schmidt or- thogonalization [5]. It can be shown that any Wiener expansion of finite degree minimizes the mean squared error between the true system output and its Volterra series model [5]. The orthogonality condition ensures that a Wiener functional of order n captures only the component of the image created by the multiplicative interaction of n pixels. In contrast to general Volterra functionals, a Wiener functional is orthogonal to all monomials of lower order [5].

So far, we have not gained anything compared to classical estimation of higher-order mo- ments or spectra: an nth-order Volterra functional contains the same number of terms as

```
1Strictly speaking, the term Wiener functional is reserved for orthogonal Volterra functionals with respect to Gaussian input. Here, the term will be used for orthogonalized Volterra functionals with arbitrary input distributions.
```

the corresponding n + 1-order spectrum, and a Wiener functional of the same order has an even higher number of coefficients as it consists also of lower-order Volterra functionals. In the next section, we will introduce an implicit representation of the Wiener series using polynomial kernels which allows for an efficient computation of the Wiener functionals.

2 Estimating Wiener series by regression in RKHS

Volterra series as linear functionals in RKHS. The nth-order Volterra functional is a weighted sum of all nth-order monomials of the input vector x. We can interpret the evaluation of this functional for a given input x as a map n defined for n = 0, 1, 2, . . . as

```
0(x) = 1 and n(x) = (xn1, xn-1 1 x2, . . . , x1xn-1 2 , xn 2 , . . . , xn ) m (3)
```

such that n maps the input x Rm into a vector n(x) Fn = Rmn containing all mn ordered monomials of degree n. Using n, we can write the nth-order Volterra functional in Eq. (2) as a scalar product in Fn,

```
Hn[x] = n n(x), (4)
```

with the coefficients stacked into the vector n = (h(n) 1,1,..1, h(n) 1,2,..1, h(n) 1,3,..1, . . . ) Fn. The same idea can be applied to the entire pth-order Volterra series. By stacking the maps n into a single map (p)(x) = (0(x), 1(x), . . . , p(x)) , one obtains a mapping from Rm into F(p) = R Rm Rm2 . . . Rmp = RM with dimensionality M = 1-mp+1 . The 1-m entire pth-order Volterra series can be written as a scalar product in F(p)

```
p Hn[x] = ((p)) (p)(x) (5) n=0
```

with (p) F(p). Below, we will show how we can express (p) as an expansion in terms of the training points. This will dramatically reduce the number of parameters we have to estimate.

This procedure works because the space Fn of nth-order monomials has a very special property: it has the structure of a reproducing kernel Hilbert space (RKHS). As a conse- quence, the dot product in Fn can be computed by evaluating a positive definite kernel function kn(x1, x2). For monomials, one can easily show that (e.g., [6])

```
n(x1) n(x2) = (x1 x2)n =: kn(x1, x2). (6)
```

Since F(p) is generated as a direct sum of the single spaces Fn, the associated scalar product is simply the sum of the scalar products in the Fn:

```
p (p)(x1) (p)(x2) = (x1 x2)n = k(p)(x1, x2). (7) n=0
```

Thus, we have shown that the discretized Volterra series can be expressed as a linear func- tional in a RKHS2.

Linear regression in RKHS. For our prediction problem (1), the RKHS property of the Volterra series leads to an efficient solution which is in part due to the so called repre- senter theorem (e.g., [6]). It states the following: suppose we are given N observations

```
2A similar approach has been taken by [1] using the inhomogeneous polynomial kernel k(p) ( inh x1, x2) = (1 + x1 x2)p. This kernel implies a map inh into the same space of monomi- als, but it weights the degrees of the monomials differently as can be seen by applying the binomial theorem.
```

(x1, y1), . . . , (xN , yN ) of the function (1) and an arbitrary cost function c, is a nonde- creasing function on R>0 and . F is the norm of the RKHS associated with the kernel k. If we minimize an objective function

```
c((x1, y1, f (x1)), . . . , (xN , yN , f (xN ))) + ( f F), (8)
```

over all functions in the RKHS, then an optimal solution3 can be expressed as

```
N f (x) = ajk(x, xj), aj R. (9) j=1
```

In other words, although we optimized over the entire RKHS including functions which are defined for arbitrary input points, it turns out that we can always express the solution in terms of the observations xj only. Hence the optimization problem over the extremely large number of coefficients (p) in Eq. (5) is transformed into one over N variables aj.

Let us consider the special case where the cost function is the mean squared error, c(( N x1, y1, f (x1)), . . . , (xN , yN , f (xN ))) = 1 (f (x N j=1 j ) - yj )2, and the regularizer is zero4. The solution for a = (a1, . . . , aN ) is readily computed by setting the derivative of (8) with respect to the vector a equal to zero; it takes the form a = K -1y with the Gram matrix defined as Kij = k(xi, xj), hence5

```
y = f (x) = a z(x) = y K-1z(x), (10)
```

where z(x) = (k(x, x1), k(x, x2), . . . k(x, xN )) RN .

Implicit Wiener series estimation. As we stated above, the pth-degree Wiener expan- sion is the pth-order Volterra series that minimizes the squared error. This can be put into the regression framework: since any finite Volterra series can be represented as a linear functional in the corresponding RKHS, we can find the pth-order Volterra series that min- imizes the squared error by linear regression. This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property6. From Eqn. (10), we obtain the following expressions for the implicit Wiener series 1 p p G0[x] = y 1, G H N n[x] = n[x] = y K-1 p z(p)(x) (11) n=0 n=0

where the Gram matrix Kp and the coefficient vector z(p)(x) are computed using the kernel from Eq. (7) and 1 = (1, 1, . . . ) RN . Note that the Wiener series is represented only implicitly since we are using the RKHS representation as a sum of scalar products with the training points. Thus, we can avoid the "curse of dimensionality", i.e., there is no need to compute the possibly large number of coefficients explicitly.

The explicit Volterra and Wiener expansions can be recovered from Eq. (11) by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra functionals in a Wiener series of degree p > 0 are given implicitly by

```
Hn[x] = y K-1 p zn(x) (12)
```

with zn(x) = ((x1 x)n, (x2 x)n, . . . , (x x)n) . For p = 0 the only term is the N constant zero-order Volterra functional H0[x] = G0[x]. The coefficient vector n = (h(n) 1,1,...1, h(n) 1,2,...1, h(n) 1,3,...1, . . . ) of the explicit Volterra functional is obtained as

```
n = K-1 n p y (13)
3for conditions on uniqueness of the solution, see [6] 4Note that this is different from the regularized approach used by [1]. If is not zero, the resulting Volterra series are different from the Wiener series since they are not orthogonal with respect to the input. 5If K is not invertible, K-1 denotes the pseudo-inverse of K. 6assuming symmetrized Volterra kernels which can be obtained from any Volterra expanson.
```

using the design matrix n = (n(x1) , n(x1) , . . . , n(x1) ) . The individual Wiener functionals can only be recovered by applying the regression procedure twice. If we are interested in the nth-degree Wiener functional, we have to compute the solution for the kernels k(n)(x1, x2) and k(n-1)(x1, x2). The Wiener functional for n > 0 is then obtained from the difference of the two results as

```
n n-1 Gn[x] = Gi[x] - Gi[x] = y K-1 n z(n)(x) - K-1 n-1 z(n-1)(x) . (14) i=0 i=0
```

The corresponding ith-order Volterra functionals of the nth-degree Wiener functional are computed analogously to Eqns. (12) and (13) [3].

Orthogonality. The resulting Wiener functionals must fulfill the orthogonality condition which in its strictest form states that a pth-degree Wiener functional must be orthogonal to all monomials in the input of lower order. Formally, we will prove the following

Theorem 1 The functionals obtained from Eq. (14) fulfill the orthogonality condition

```
E [m(x)Gp[x]] = 0 (15)
```

where E denotes the expectation over the input distribution and m(x) an arbitrary ith- order monomial with i < p.

We will show that this a consequence of the least squares fit of any linear expansion in a set of basis functions of the form y = M j=1 j j (x). In the case of the Wiener and Volterra expansions, the basis functions j(x) are monomials of the components of x.

We denote the error of the expansion as e(x) = y - M j=1 j j (xi). The minimum of the expected quadratic loss L with respect to the expansion coefficient k is given by

```
L = E e(x) 2 = -2E [ k (x)e(x)] = 0. (16) k k
```

This means that, for an expansion in a set of basis functions minimizing the squared error, the error is orthogonal to all basis functions used in the expansion.

Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p - 1. The approximation error is given by the sum of the higher-order Wiener functionals e(x) = G n=p n[x], so Gp[x] is part of the error. As a consequence of the linearity of the expectation, Eq. (16) implies

```
E [k(x)Gn[x]] = 0 and E [k(x)Gn[x]] = 0 (17) n=p n=p+1
```

for any k of order less than p. The difference of both equations yields E [k(x)Gp[x]] = 0, so that Gp[x] must be orthogonal to any of the lower order basis functions, namely to all monomials with order smaller than p.

3 Experiments

Toy examples. In our first experiment, we check whether our intuitions about higher-order statistics described in the introduction are captured by the proposed method. In particular, we expect that arbitrarily oriented lines can only be predicted using third-order statistics. As a consequence, we should need at least a second-order Wiener functional to predict lines correctly.

Our first test image (size 80 110, upper row in Fig. 1) contains only lines of varying orientations. Choosing a 5 5 neighbourhood, we predicted the central pixel using (11).

original image 0th-order 1st-order 1st-order 2nd-order 2nd-order 3rd-order 3rd-order component/ reconstruction component reconstruction component reconstruction component reconstruction

```
mse = 583.7 mse = 0.006 mse = 0
mse = 1317 mse = 37.4 mse = 0.001
mse = 1845 mse = 334.9 mse = 19.0
```

Figure 1: Higher-order components of toy images. The image components of different orders are created by the corresponding Wiener functionals. They are added up to obtain the different orders of reconstruction. Note that the constant 0-order component and reconstruction are identical. The reconstruction error (mse) is given as the mean squared error between the true grey values of the image and the reconstruction. Although the linear first-order model seems to reconstruct the lines, this is actually not true since the linear model just smoothes over the image (note its large reconstruction error). A correct prediction is only obtained by adding a second-order component to the model. The third-order component is only significant at crossings, corners and line endings.

Models of orders 0 . . . 3 were learned from the image by extracting the maximal training set of 76 106 patches of size 5 57. The corresponding image components of order 0 to 3 were computed according to (14). Note the different components generated by the Wiener functionals can also be negative. In Fig. 1, they are scaled to the gray values [0..255]. The behaviour of the models conforms to our intuition: the linear model cannot capture the line structure of the image thus leading to a large reconstruction error which drops to nearly zero when a second-order model is used. The additional small correction achieved by the third-order model is mainly due to discretization effects.

Similar to lines, we expect that we need at least a third-order model to predict crossings or corners correctly. This is confirmed by the second and third test image shown in the corresponding row in Fig. 1. Note that the third-order component is only significant at crossings, corners and line endings. The fourth- and fifth-order terms (not shown) have only negligible contributions. The fact that the reconstruction error does not drop to zero for the third image is caused by the line endings which cannot be predicted to a higher accuracy than one pixel.

Application to natural images. Are there further predictable structures in natural images that are not due to lines, crossings or corners? This can be investigated by applying our method to a set of natural images (an example of size 80 110 is depicted in Fig. 2). Our

7In contrast to the usual setting in machine learning, training and test set are identical in our case since we are not interested in generalization to other images, but in analyzing the higher-order components of the image at hand.

original image 0th-order 1st-order 1st-order 2nd-order 2nd-order component/ reconstruction component reconstruction component reconstruction mse = 1070 mse = 957.4

```
3rd-order 3rd-order 4th-order 4th-order 5th-order 5th-order reconstruction component reconstruction component reconstruction component mse = 414.6 mse = 98.5 mse = 18.5
6th-order 6th-order 7th-order 7th-order 8th-order 8th-order reconstruction component reconstruction component reconstruction component mse = 4.98 mse = 1.32 mse = 0.41
```

Figure 2: Higher-order components and reconstructions of a photograph. Interactions up to the fifth order play an important role. Note that significant components become sparser with increasing model order.

results on a set of 10 natural images of size 50 70 show an an approximately exponential decay of the reconstruction error when more and more higher-order terms are added to the reconstruction (Fig. 3). Interestingly, terms up to order 5 still play a significant role, although the image regions with a significant component become sparser with increasing model order (see Fig. 2). Note that the nonlinear terms reduce the reconstruction error to almost 0. This suggests a high degree of higher-order redundancy in natural images that cannot be exploited by the usual linear prediction models.

4 Conclusion

The implicit estimation of Wiener functionals via polynomial kernels opens up new pos- sibilities for the estimation of higher-order image statistics. Compared to the classical methods such as higher-order spectra, moments or cumulants, our approach avoids the combinatorial explosion caused by the exponential increase of the number of terms to be estimated and interpreted. When put into a predictive framework, multiplicative pixel inter- actions of different orders are easily visualized and conform to the intuitive notions about image structures such as edges, lines, crossings or corners.

There is no one-to-one mapping between the classical higher-order statistics and multi- plicative pixel interactions. Any nonlinear Wiener functional, for instance, creates infinitely many correlations or cumulants of higher order, and often also of lower order. On the other

700 Figure 3: Mean square reconstruction error of 600 models of different order for a set of 10 natural images. 500

400

mse 300

200

100

```
00 1 2 3 4 5 6 7 model order
```

hand, a Wiener functional of order n produces only harmonic phase interactions up to order n + 1, but sometimes also of lower orders. Thus, when one analyzes a classical statistic of a given order, one often cannot determine by which order of pixel interaction it was created. In contrast, our method is able to isolate image components that are created by a single order of interaction.

Although of preliminary nature, our results on natural images suggest an important role of statistics up to the fifth order. Most of the currently used low-level feature detectors such as edge or corner detectors maximally use third-order interactions. The investigation of fourth- or higher-order features is a field that might lead to new insights into the nature and role of higher-order image structures.

As often observed in the literature (e.g. [2][7]), our results seem to confirm that a large proportion of the redundancy in natural images is contained in the higher-order pixel in- teractions. Before any further conclusions can be drawn, however, our study needs to be extended in several directions: 1. A representative image database has to be analyzed. The images must be carefully calibrated since nonlinear statistics can be highly calibration- sensitive. In addition, the contribution of image noise has to be investigated. 2. Currently, only images up to 9000 pixels can be analyzed due to the matrix inversion required by Eq. 11. To accomodate for larger images, our method has to be reformulated in an iterative algorithm. 3. So far, we only considered 5 5-patches. To systematically investigate patch size effects, the analysis has to be conducted in a multi-scale framework.

Do not remove: This comment is monitored to verify that the site is working properly