{"title": "Iterative Construction of Sparse Polynomial Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1071, "abstract": null, "full_text": "Iterative Construction of \n\nSparse Polynomial Approximations \n\nTerence D. Sanger \nMassachusetts Institute \n\nof Technology \nRoom E25-534 \n\nCambridge, MA 02139 \n\ntds@ai.mit.edu \n\nRichard S. Sutton \nGTE Laboratories \n\nIncorporated \n\n40 Sylvan Road \n\nWaltham, MA 02254 \n\nsutton@gte.com \n\nAbstract \n\nChristopher J. Matheus \n\nGTE Laboratories \n\nIncorporated \n\n40 Sylvan Road \n\nWaltham, MA 02254 \n\nmatheus@gte.com \n\nWe present an iterative algorithm for nonlinear regression based on con(cid:173)\nstruction of sparse polynomials. Polynomials are built sequentially from \nlower to higher order. Selection of new terms is accomplished using a novel \nlook-ahead approach that predicts whether a variable contributes to the \nremaining error. The algorithm is based on the tree-growing heuristic in \nLMS Trees which we have extended to approximation of arbitrary poly(cid:173)\nnomials of the input features. In addition, we provide a new theoretical \njustification for this heuristic approach. The algorithm is shown to dis(cid:173)\ncover a known polynomial from samples, and to make accurate estimates \nof pixel values in an image-processing task. \n\n1 \n\nINTRODUCTION \n\nLinear regression attempts to approximate a target function by a model that is a \nlinear combination of the input features. Its approximation ability is thus limited \nby the available features. We describe a method for adding new features that are \nproducts or powers of existing features. Repeated addition of new features leads \nto the construction of a polynomial in the original inputs, as in (Gabor 1961). \nBecause there is an infinite number of possible product terms, we have developed \na new method for predicting the usefulness of entire classes of features before they \nare included. The resulting nonlinear regression will be useful for approximating \nfunctions that can be described by sparse polynomials. \n\n1064 \n\n\fIterative Construction of Sparse Polynomial Approximations \n\n1065 \n\nf \n\nXn \n\nFigure 1: Network depiction of linear regression on a set of features Xi. \n\n2 THEORY \n\nLet {xdi=l be the set of features already included in a model that attempts to \npredict the function f . The output of the model is a linear combination \n\nn \n\ni = LCiXi \n\ni=l \n\nwhere the Ci'S are coefficients determined using linear regression. The model can \nalso be depicted as a single-layer network as in figure 1. The approximation error \nis e = f - j, and we will attempt to minimize E[e2 ] where E is the expectation \noperator. \n\nThe algorithm incrementally creates new features that are products of existing \nfeatures. At each step, the goal is to select two features xp and Xq already in the \nmodel and create a new feature XpXq (see figure 2). Even if XpXq does not decrease \nthe approximation error, it is still possible that XpXqXr will decrease it for some \nX r . So in order to decide whether to create a new feature that is a product with \nx p , the algorithm must \"look-ahead\" to determine if there exists any polynomial a \nin the xi's such that inclusion ofaxp would significantly decrease the error. If no \nsuch polynomial exists, then we do not need to consider adding any features that \nare products with xp. \nDefine the inner product between two polynomials a and b as (alb) = E[ab] where \nthe expected value is taken with respect to a probability measure I-\" over the (zero(cid:173)\nmean) input values. The induced norm is IIal12 = E[a 2 ], and let P be the set of \npolynomials with finite norm. {P, (\u00b7I\u00b7)} is then an infinite-dimensional linear vector \nspace. The Weierstrass approximation theorem proves that P is dense in the set of \nall square-integrable functions over 1-\", and thus justifies the assumption that any \nfunction of interest can be approximated by a member of P. \nAssume that the error e is a polynomial in P. In order to test whether axp partic(cid:173)\nipates in e for any polynomial a E P, we write \ne = apxp + bp \n\n\f1066 \n\nSanger, Sutton, and Matheus \n\nf \n\nFigure 2: Incorporation of a new product term into the model. \n\nwhere ap and bp are polynomials, and ap is chosen to minimize lIapxp - ell 2 \nE[( apxp - e )2]. The orthogonality principle then shows that apxp is the projection \nof the polynomial e onto the linear subspace of polynomials xpP. Therefore, bp is \northogonal to xpP, so that E[bpg] = 0 for all g in xpP. \nWe now write \n\nE[e2] = E[a;x;] + 2E[apxpbp] + E[b;] = E[a;x;] + E[b;] \n\nsince E[apxpbp] = 0 by orthogonality. If apxp were included in the model, it would \nthus reduce E[e2] by E[a;x;], so we wish to choose xp to maximize E[a;x;]. Un(cid:173)\nfortunately, we have no dIrect measurement of ap \u2022 \n\n3 METHODS \n\nAlthough E[a;x;] cannot be measured directly, Sanger (1991) suggests choosing xp \nto maximize E[e2x~] instead, which is directly measurable. Moreover, note that \n\nE[e2x;] = E[a;x;] + 2E[apx;bp] + E[x;b;] \n\n= E[a;x;] \n\nand thus E[e2x;] is related to the desired but unknown value E[a;x;]. Perhaps \nbetter would be to use \n\nE[e 2x2] E[a2x4 ] \n~=-::-:p- -\np p \nE[x~] -\nE[x~] \n\nwhich can be thought of as the regression of (a;x~)xp against xp' \nof e2 against xr for all i as the basis for comparison. The regression coefficients Wi \n\nMore recently, (Sutton and Matheus 1991) suggest using the regression coefficients \n\nare called \"potentials\", and lead to a linear approximation of the squared error: \n\n(1) \n\n\fIterative Construction of Sparse Polynomial Approximations \n\n1067 \n\nIf a new term apxp were included in the model of f, then the squared error would \nbe b; which is orthogonal to any polynomial in xpP and in particular to x;. Thus \nthe coefficient of x; in (1) would be zero after inclusion of apxp, and wpE[x;] is an \napproximation to the decrease in mean-squared error E[e 2 ] - E[b;] which we can \nexpect from inclusion of apxp. We thus choose xp by maximizing wpE[x;]. \n\nThis procedure is a form of look-ahead which allows us to predict the utility of a \nhigh-order term apxp without actually including it in the regression. This is perhaps \nmost useful when the term is predicted to make only a small contribution for the \noptimal a p , because in this case we can drop from consideration any new features \nthat include xp. \nWe can choose a different variable Xq similarly, and test the usefulness of incorporat(cid:173)\ning the product XpXq by computing a \"joint potential\" Wpq which is the regression of \nthe squared error against the model including a new term x~x~. The joint potential \nattempts to predict the magnitude of the term E[a~qx;xi]. \nWe now use this method to choose a single new feature XpXq to include in the model. \nFor all pairs XiXj such that Xi and Xj individually have high potentials, we perform \na third regression to determine the joint potentials of the product terms XiXj. Any \nterm with a high joint potential is likely to participate in f. We choose to include the \nnew term XpXq with the largest joint potential. In the network model, this results in \nthe construction of a new unit that computes the product of xp and x q, as in figure \n2. The new unit is incorporated into the regression, and the resulting error e will \nbe orthogonal to this unit and all previous units. Iteration of this technique leads \nto the successive addition of new regression terms and the successive decrease in \nmean-squared error E[e 2 ]. The process stops when the residual mean-squared error \ndrops below a chosen threshold, and the final model consists of a sparse polynomial \nin the original inputs. \nWe have implemented this algorithm both in a non-iterative version that computes \ncoefficients and potentials based on a fixed data set, and in an iterative version that \nuses the LMS algorithm (Widrow and Hoff 1960) to compute both coefficients and \npotentials incrementally in response to continually arriving data. In the iterative \nversion, new terms are added at fixed intervals and are chosen by maximizing over \nthe potentials approximated by the LMS algorithm. The growing polynomial is \nefficiently represented as a tree-structure, as in (Sanger 1991a). \nAlthough the algorithm involves three separate regressions, each is over only O( n) \nterms, and thus the iterative version of the algorithm is only of O(n) complexity \nper input pattern processed. \n\n4 RELATION TO OTHER ALGORITHMS \n\nApproximation of functions over a fixed monomial basis is not a new technique \n(Gabor 1961, for example). However, it performs very poorly for high-dimensional \ninput spaces, since the set of all monomials (even of very low order) can be pro(cid:173)\nhibitively large. This has led to a search for methods which allow the generation of \nsparse polynomials. A recent example and bibliography are provided in (Grigoriev \net al. 1990), which describes an algorithm applicable to finite fields (but not to \n\n\f1068 \n\nSanger, Sutton, and Matheus \n\nj \n\nFigure 3: Products of hidden units in a sigmoidal feedforward network lead to a \npolynomial in the hidden units themselves. \n\nreal-valued random variables). \n\nThe GMDH algorithm (Ivakhnenko 1971, Ikeda et al. 1976, Barron et al. 1984) \nincrementally adds new terms to a polynomial by forming a second (or higher) \norder polynomial in 2 (or more) of the current terms, and including this polynomial \nas a new term if it correlates with the error. Since GMDH does not use look-ahead, \nit risks avoiding terms which would be useful at future steps. For example, if the \npolynomial to be approximated is xyz where all three variables are independent, \nthen no polynomial in x and y alone will correlate with the error, and thus the \nterm xy may never be included. However, x 2y2 does correlate with x 2y2 Z2, so \nthe look-ahead algorithm presented here would include this term, even though the \nerror did not decrease until a later step. Although GMDH can be extended to \ntest polynomials of more than 2 variables, it will always be testing a finite-order \npolynomial in a finite number of variables, so there will always exist target functions \nwhich it will not be able to approximate. \nAlthough look-ahead avoids this problem, it is not always useful. For practical \npurposes, we may be interested in the best Nth-order approximation to a function, \nso it may not be helpful to include terms which participate in monomials of order \ngreater than N, even if these monomials would cause a large decrease in error. \nFor example, the best 2nd-order approximation to x 2 + ylOOO + zlOOO may be x 2 , \neven though the other two terms contribute more to the error. In practice, some \ncombination of both infinite look-ahead and GMDH-type heuristics may be useful. \n\n5 APPLICATION TO OTHER STRUCTURES \n\nThese methods have a natural application to other network structures. The inputs \nto the polynomial network can be sinusoids (leading to high-dimensional Fourier \nrepresentations), Gaussians (leading to high-dimensional Radial Basis Functions) \nor other appropriate functions (Sanger 1991a, Sanger 1991b). Polynomials can \n\n\fI terative Construction of Sparse Polynomial Approximations \n\n1069 \n\neven be applied with sigmoidal networks as input, so that \n\nXi = (T (I: SijZj ) \n\nwhere the z;'s are the original inputs, and the Si;'S are the weights to a sigmoidal \nhidden unit whose value is the polynomial term Xi. The last layer of hidden units \nin a multilayer network is considered to be the set of input features Xi to a linear \noutput unit, and we can compute the potentials of these features to determine the \nhidden unit xp that would most decrease the error if apxp were included in the \nmodel (for the optimal polynomial ap ). But ap can now be approximated using a \nsubnetwork of any desired type. This subnetwork is used to add a new hidden unit \nC&pxp that is the product of xp with the subnetwork output C&p, as in figure 3. \nIn order to train the C&p subnetwork iteratively using gradient descent, we need to \ncompute the effect of changes in C&p on the network error \u00a3 = E[(f - j)2]. We have \n\nwhere S 4pXp is the weight from the new hidden unit to the outpu t. Without loss of \ngenerality we can set S4pXp = 1 by including this factor within C&p. Thus the error \nterm for iteratively training the subnetwork ap is \n\nwhich can be used to drive a standard backpropagation-type gradient descent al(cid:173)\ngorithm. This gives a method for constructing new hidden nodes and a learning \nalgorithm for training these nodes. The same technique can be applied to deeper \nlayers in a multilayer network. \n\n6 EXAMPLES \n\nWe have applied the algorithm to approximation of known polynomials in the pres(cid:173)\nence of irrelevant noise variables, and to a simple image-processing task. \nFigure 4 shows the results of applying the algorithm to 200 samples of the polyno(cid:173)\nmial 2 + 3XIX2 + 4X3X4X5 with 4 irrelevant noise variables. The algorithm correctly \nfinds the true polynomial in 4 steps, requiring about 5 minutes on a Symbolics Lisp \nMachine. Note that although the error did not decrease after cycle 1, the term X4X5 \nwas incorporated since it would be useful in a later step to reduce the error as part \nof X3X4X5 in cycle 2. \nThe image processing task is to predict a pixel value on the succeeding scan line \nfrom a 2x5 block of pixels on the preceding 2 scan lines. If successful, the resulting \npolynomial can be used as part of a DPCM image coding strategy. The network \nwas trained on random blocks from a single face image, and tested on a different \nimage. Figure 5 shows the original training and test images, the pixel predictions, \nand remaining error . Figure 6 shows the resulting 55-term polynomial. Learning \nthis polynomial required less than 10 minutes on a Sun Sparcstation 1. \n\n\f1070 \n\nSanger, Sutton, and Matheus \n\n200 sa.mples of IJ = 2 + 3z1 z2 + 4x3 z4 Zs \nwith 4 additional irrelevant inputs, z6-z9 \n\nOriginal MSE: 1.0 \n\nCycle 1 : \n\nMSE: \nTerms: \nCoeffs: \nPo ten tials: \nTop Pairs: \nNew Term: \n\nCycle 2: \n\nMSE: \nTerms: \nCoeffs: \nPotentials: \nTop Pairs: \nNew Term: \n\nCycle 3: \n\nMSE: \nTerms: \nCoeffs: \nPotentials: \nTop Pairs: \nNew Term: \n\nCycle 4: \n\nMSE: \nTerms: \nCoeffs: \n\nSolution: \n\n0.967 \nXl \n-0.19 \n0.22 \n(S 4) (5 3) (43) (4 4) \nXIO =X4 X S \n\nX 2 \n0.14 \n0.24 \n\nX3 \n0.24 \n0.2S \n\nX 4 \n0.31 \n0 .32 \n\nXs \n0. 17 \n0 .33 \n\nX6 \n0.48 \n0.01 \n\nX7 \n0.03 \n0.08 \n\nX8 \nO.OS \n0.01 \n\nX9 \n0.S8 \n0.05 \n\n0.966 \nX4 \nXl \n0.30 \n-0.19 \n0.25 \nO.OS \n(103) (101) (102) (10 10) \nXu =X10 X 3 =X3 X 4 X S \n\nX2 \n0.14 \n0.22 \n\nX3 \n0.24 \n0.2S \n\nXs \n0.18 \n0 .02 \n\nX6 \n0.48 \n0.03 \n\nX7 \n0.03 \n0.08 \n\nX8 \nO.OS \n0.02 \n\nX9 \n0.S7 \n0.03 \n\nXlO \nO.OS \n0 .47 \n\nX2 \n-0.26 \n0.S9 \n\n0.349 \nXl \n0.04 \n0.S2 \n(2 1) (2 9) (22) (1 9) \nXu =X1 X 2 \n\nX3 \n0.09 \n0.03 \n\nX4 \n0.37 \n0.02 \n\nXs \n-0.04 \n-0.08 \n\nX6 \n0.27 \n0.03 \n\nX7 \n0.10 \n-O.OS \n\nX8 \n0 .22 \n-0.06 \n\nX9 \n0.42 \n0.05 \n\nX10 Xll \n4.07 \nO. OS \n\n-0.26 \n-O.OS \n\n0.000 \nXl \n-0.00 \n2 + 3X1 X2 + 4X3X4X5 \n\nX2 \n-0.00 \n\nX3 \n-0.00 \n\nX4 \n0.00 \n\nXs \n-0.00 \n\nX6 \n0 .00 \n\nX7 \n0.00 \n\nX8 \n0.00 \n\nX9 \n0.00 \n\nX10 Xu \n4.00 \n\n-0.00 \n\nX l2 \n3.00 \n\nFigure 4: A simple example of polynomial learning. \n\nFigure 5: Original, predicted, and error images. The top row is the training image \n(RMS error 8.4), and the bottom row is the test image (RMS error 9.4). \n\n\fIterative Construction of Sparse Polynomial Approximations \n\n1071 \n\n-40\u00b71z0 + -23.9z1 + -5.4z2 + -17\u00b71z3+ \n(1.1z5 + 2.4z8 + -1.1z2 + -1.5z0 + -2.0Z1 + 1.3z4 + 2.3z6 + 3\u00b71z7 + -25 .6)z4 + \n( \n\n(-2.9z9 + 3.0z8 + -2.9z4 + -2.8z3 + -2 .9z2 + -1.9z5 + -6.3%0 + -5.2%1 + 2.5z6 + 6.7z7 + 1.1)z9+ \n(3 .9z8 + Z5 + 3.3z4 + 1.6z3 + 1.1z2 + 2 .9z6 + 5.0Z7 + 16 .1)z8+ \n-2.3%3 + -2 .1%2 + -1.6.%1 + 1.1z4 + 2\u00b71z6 + 3.5%7 + 28 .6)z5+ \n\n87 \u00b71z6 + 128.1%7 + 80 .5%8+ \n( \n\n(-2\u00b76.%9 + -2.4%5 + -4.5%0 + -3 .9%1 + 3.4%6 + 7 .3%7 + -2.5)%9+ \n21.7%8 + -16 .0%4 + -12\u00b71z3 + -8.8%2 + 31.4)%9+ \n\n2 . 6 \n\nFigure 6: 55-term polynomial used to generate figure 5. \n\nAcknowledgments \n\nWe would like to thank Richard Brandau for his helpful comments and suggestions \non an earlier draft of this paper. This report describes research done both at \nGTE Laboratories Incorporated, in Waltham MA, and at the laboratory of Dr. \nEmilio Bizzi in the department of Brain and Cognitive Sciences at MIT. T. Sanger \nwas supported during this work by a National Defense Science and Engineering \nGraduate Fellowship, and by NIH grants 5R37 AR26710 and 5R01NS09343 to Dr. \nBizzi. \n\nReferences \n\nBarronR. L., Mucciardi A. N., CookF. J., CraigJ. N., Barron A. R., 1984, Adaptive \nlearning networks: Development and application in the United States of algorithms \nrelated to GMDH, In Farlow S. J., ed., Self-Organizing Methods in Modeling, pages \n25-65, Marcel Dekker, New York. \nGabor D., 1961, A universal nonlinear filter, predictor, and simulator which opti(cid:173)\nmizes itself by a learning process, Proc. lEE, 108B:422-438. \nGrigoriev D. Y., Karpinski M., Singer M. F., 1990, Fast parallel algorithms for \nsparse polynomial interpolation over finite fields, SIAM J. Computing, 19(6):1059-\n1063. \nIkeda S., Ochiai M., Sawaragi Y., 1976, Sequential GMDH algorithm and its ap(cid:173)\nplication to river flow prediction, IEEE Trans. Systems, Man, and Cybernetics, \nSMC-6(7):473-479. \nIvakhnenko A. G., 1971, Polynomial theory of complex systems, \nSystems, Man, and Cybernetics, SMC-1( 4):364-378. \nSanger T. D., 1991a, Basis-function trees as a generalization of local variable se(cid:173)\nlection methods for function approximation, In Lippmann R. P., Moody J. E., \nTouretzky D. S., ed.s, Advances in Neural Information Processing Systems 3, pages \n700-706, Morgan Kaufmann, Proc. NIPS'90, Denver CO. \nSanger T. D., 1991b, A tree-structured adaptive network for function approximation \nin high dimensional spaces, IEEE Trans. Neural Networks, 2(2):285-293. \nSutton R. S., Matheus C. J., 1991, Learning polynomial functions by feature con(cid:173)\nstruction, In Proc. Eighth Inti. Workshop on Machine Learning, Chicago. \nWidrow B., Hoff M. E., 1960, Adaptive switching circuits, In IRE WESCON Conv. \nRecord, Part 4, pages 96-104. \n\nIEEE Trans. \n\n\f", "award": [], "sourceid": 538, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "Christopher", "family_name": "Matheus", "institution": null}]}