{"title": "Encoding Geometric Invariances in Higher-Order Neural Networks", "book": "Neural Information Processing Systems", "page_first": 301, "page_last": 309, "abstract": null, "full_text": "301 \n\nENCODING GEOMETRIC INVARIANCES IN \n\nHIGHER-ORDER NEURAL NETWORKS \n\nAir Force Office of Scientific Research, Bolling AFB, DC 20332 \n\nC.L. Giles \n\nNaval Research Laboratory, Washington, DC \n\nR.D. Griffin \n\n20375-5000 \n\nSachs-Freeman Associates, Landover, MD 20785 \n\nT. Maxwell \n\nABSTRACT \n\nWe describe a method of constructing higher-order neural \n\nnetworks that respond invariantly under geometric transformations on \nthe input space. By requiring each unit to satisfy a set of \nconstraints on the interconnection weights, a particular structure is \nimposed on the network. A network built using such an architecture \nmaintains its invariant performance independent of the values the \nweights assume, of the learning rules used, and of the form of the \nnonlinearities in the network. The invariance exhibited by a first(cid:173)\norder network is usually of a trivial sort, e.g., responding only to \nthe average input in the case of translation invariance, whereas \nhigher-order networks can perform useful functions and still exhibit \nthe invariance. We derive the weight constraints for translation, \nrotation, scale, and several combinations of these transformations, \nand report results of simulation studies. \n\nINTRODUCTION \n\nA persistent difficulty for pattern recognition systems is the \n\nrequirement that patterns or objects be recognized independent of \nirrelevant parameters or distortions such as orientation (position, \nrotation, aspect), scale or size, background or context, doppler \nshift, time of occurrence, or signal duration. The remarkable \nperformance of humans and other animals on this problem in the visual \nand auditory realms is often taken for granted, until one tries to \nbuild a machine with similar performance. Thoufh many methods have \nbeen developed for dealing with these problems, we have classified \nthem into two categories: 1) preprocessing or transformation \n(inherent) approaches, and 2) case-specific or \"brute force\" \n(learned) approaches. Common transformation techniques include: \nFourier, Hough, and related transforms; moments; and Fourier \ndescriptors of the input signal. \nusually transformed so that the subsequent processing ignores \narbitrary parameters such as scale, translation, etc. \nthese techniques are usually computationally expensive and are \nsensitive to noise in the input signal. The \"brute force\" approach \nis exemplified by training a device, such as a perceptron, to \nclassify a pattern independent of it's position by presenting the \n\nIn these approaches the signal is \n\nIn addition, \n\n@ American Institute of Physics 1988 \n\n\f302 \n\ntraining pattern at all possible positions. MADALINE machines 2 have \nbeen shown to perform well using such techniques. Often, this type \nof invariance is pattern specific, does not easily generalize to \nother patterns, and depends on the type of learning algorithm \nemployed. Furthermore, a great deal of time and energy is spent on \nlearning the invariance, rather than on learning the signal. We \ndescribe a method that has the advantage of inherent invariance but \nuses a higher-order neural network approach that must learn only the \ndesired signal. Higher-order units have been shown to have unique \ncomputational strengths and are quite amenable to the encoding of a \npriori know1edge. 3 - 7 \n\nMATHEMATICAL DEVELOPMENT \n\nOur approach is similar to the group invariance approach,8,10 \n\nalthough we make no appeal to group theory to obtain our results. We \nbegin by selecting a transformation on the input space, then require \nthe output of the unit to be invariant to the transformation. The \nresulting equations yield constraints on the interconnection weights, \nand thus imply a particular form or structure for the network \narchitecture. \n\nFor the i-th unit Yi of order M defined on a discrete input \n\nspace, let the output be given by \n\nYi[YiM(X),P(x)] -\n\nf( WiO + ~ Wi1 (X1) P(x1) \n\n+ ~~ Wi 2 (X1,X2) P(x1) P(x2) + ... \n\n+~ ... ~ WiM(X1,\u00b7 \u00b7XM) P(x1)\u00b7 \u00b7P(XM) ), \n\n(1) \n\nwhere p(x) is the input pattern or signal function (sometimes called \na pixel) evaluated at position vector x, wim(xl, ... Xm) is the weight \nof order m connecting the outputs of units at Xl, x2, .. Xm to the i(cid:173)\nth unit, i.e., it correlates m values, f(u) is some threshold or \nsigmoid output function, and the summations extend over the input \nspace. YiM(X) represents the entire set of weights associated with \nthe i-th unit. These units are equivalent to the sigma-pi units a \ndefined by Rumelhart, Hinton, and Williams. 7 Systems built from \nthese units suffer from a combinatorial explosion of terms, hence are \nmore complicated to build and train. To reduce the severity of this \nproblem, one can limit the range of the interconnection weights or \nthe number of orders, or impose various other constraints. We find \nthat, in addition to the advantages of inherent invariance, imposing \nan invariance constraint on Eq. (1) reduces the number of allowed \n\naThe sigma-pi neural networks are multi-layer networks with \n\nhigher-order terms in any layer. As such, most of the neural \nnetworks described here can be considered as a special case of the \nsigma-pi units. However, the sigma-pi units as originally formulated \ndid not have invariant weight terms, though it is quite simple to \nincorporate such invariances in these units. \n\n\fweights, thus simplifying the architecture and \ntraining time. \n\nWe now define what we mean by invariance. \n\nis invariant with respect to the transformation \npattern if9 \n\n303 \n\nshortening the \n\nThe output of a unit \nT on the input \n\n(2) \n\nAn example of the class of invariant response defined by Eq. (2) \nwould be invariant detection of an object in the receptive field of a \npanning or zooming camera. An example of a different class would be \ninvariant detection of an object that is moving within the field of a \nfixed camera. One can think of this latter case as consisting of a \nfixed field of \"noise\" plus a moving field that contains only the \nobject of interest. If the detection system does not respond to the \nfixed field, then this latter case is included in Eq. (2). \n\nTo illustrate our method we derive the weight constraints for \none-dimensional translation invariance. We will first switch to a \ncontinuous formulation, however, for reasons of simplicity and \ngenerality, and because it is easier to grasp the physical \nsignificance of the results, although any numerical simulation \nrequires a discrete formulation and has significant implications for \nthe implementation of our results. \nkeep track of our units with the continuous variable u. With these \nchanges Eq. (2) now becomes \n\nInstead of an index i, we now \n\ny[u;wM(x),p(X)] = f( wO + JrdXl Wl(U;Xl) P(xl) + ... \n\n+ f\u00b7\u00b7 Jr dXl\u00b7 .dXM wM(U;Xl,\u00b7 \u00b7XM) P(Xl)\u00b7 .P(XM) ), \nThe limits on the integrals are defined by the problem and are \ncrucial in what follows. Let T be a translation of the input pattern \nby -xO, so that \n\n(3) \n\nT[p(x)] - p(x+XO) \n\n(4) \n\nwhere xo is the translation of the input pattern. Then, from eq (2), \n\nTy[u;wM(x) ,p(x)] - y[u;YM(x),p(x+XO\u00bb) = y[u;wM(x),p(x)] \n\n(5) \n\nSince p(x) is arbitrary we must impose term-by-term equality in the \nargument of the threshold function; i.e., \n\nf dXl Wl(U;Xl) P(xl) = f dxl Wl(U;Xl) P(xl+XO), \n\nJr fdxl dX2 W2 (U;Xl,X2) P(xl) P(x2) = \n\nJr f dXl dX2 W2 (U;Xl,X2) P(xl+XO) P(x2+XO), \n\netc. \n\n(Sa) \n\n(Sb) \n\n\f304 \n\nMaking the substitutions xl. xl-XO, x2 \",x2-XO, etc, we find that \n\nf dXl Wl(U;Xl) P(xl) - f dxl WI(U;Xl-XO) P(XI) , \n\nf f dxl dX2 W2(U;XI,X2) P(xI) P(x2) -\n\n(6a) \n\n(6b) \n\nf f dXI dX2 W2(U;XI-XO,X2-XO) P(xI) P(x2), \n\netc. \n\nIf the limits on the \n\nNote that the limits of the integrals on the right hand side must be \nadjusted to satisfy the change-of-variables. \nintegrals are infinite or if one imposes some sort of periodic \nboundary condition, the limits of the integrals on both sides of the \nequation can be set equal. We will assume in the remainder of this \npaper that these conditions can be met; normally this means the \nlimits of the integrals extend to infinity. \nit is usually impractical or even impossible to satisfy these \nrequirements, but our simulation results indicate that these networks \nperform satisfactorily even though the regions of integration are not \nidentical. This question must be addressed for each class of \ntransformation; it is an integral part of the implementation design.) \nSince the functions p(x) are arbitrary and the regions of integration \nare the same, the weight functions must be equal. This imposes a \nconstraint on the functional form of the weight functions or, in the \ndiscrete implementation, limits the allowed connections and thus the \nnumber of weights. \nconstraint on the functional form of the weight functions requires \nthat \n\nIn the case of translation invariance, the \n\n(In an implementation, \n\nw1(U;XI) - wl(u;X].-XO), \n\nw2(U;XI,X2) - w2(U;XI-XO,X2-XO), \n\netc. \n\n(7a) \n\n(7b) \n\nThese equations imply that the first order weight is independent of \ninput position, and depends only on the output position u. The \nsecond order weight is a function only of vector differences,IO i.e., \n\nw1(u;Xj) - J..(u), \n\nw2(U;X].,X2) - w2(u:X]. -Xl)\u00b7 \n\n(8a) \n\n(8b) \n\nFor a discrete implementation with N input units (pixels) fully \nconnected to an output unit, this requirement reduces the number of \nsecond-order weights from order N2 to order N, i.e., only weights for \ndifferences of indexes are needed rather than all unique pair \ncombinations. Of course, this advantage is multiplied as the number \nof fully-connected output units increases. \n\nFURTHER EXAMPLES \n\nWe have applied these techniques to several other \n\ntransformations of interest. For the case of transformation of scale \n\n\fdefine the scale operator S such that \n\nSp(x) - aIlp(ax) \n\n305 \n\n(9) \n\nwhere a is the scale factor, and x is a vector of dimension n. The \nfactor an is used for normalization purposes, so that a given figure \nalways contains the same \"energy\" regardless of its scale. \nApplication of the same procedure to this transformation leads to the \nfollowing constraints on the weights: \n\nwl(u;Xjfa) -= wl(u;~, \n\nw2(u;X1Ia,xv'a) .. w2(u;'X.l.'~)' \n\nw3(u;xlla,x2/a ,x3/a) ... w3(U;X].,X2,X3), etc. \n\n(lOa) \n(lOb) \n(lOc) \n\nConsider a two-dimensional problem viewed in polar coordinates (r,t). \nA set of solutions to these constraints is \n\nJ.(u;q,tI) - w1(u;Q), \n\nw2(u;rl,r2;tl,t2) - w2(u;rllr2;tl,t2). \n\nw3(u;rl,r2,r3;tl,t2,t3) - w3(u;(rl-r2)/r3;tl,t2,t3). \n\n(lla) \n(llb) \n(llc) \n\nNote that with increasing order comes increasing freedom in the \nselection of the functional form of the weights. Any solution that \nsatisfies the constraint may be used. This gives the designer \nadditional freedom to limit the connection complexity, or to encode \nspecial behavior into the net architecture. An example of this is \ngiven later when we discuss combining translation and scale \ninvariance in the same network. \n\nNow consider a change of scale for a two-dimensional system in \nrectangular coordinates, and consider only the second-order weights. \nA set of solutions to the weight constraint is: \n\nW2 (U;Xl,Yl;X2,Y2) - W2 (U;Xl/Yl;X2/Y2), \nW2 (U;Xl,Yl;X2,Y2) - W2 (U;Xl/X2;Yl/Y2), \n\nW2 (U;Xl,Yl;X2,Y2) - w2 (U;(Xl-X2)/(Yl-Y2)), etc. \n\n(12a) \n(l2b) \n(12c) \n\nWe have done a simulation using the form of Eq. (12b). The \nsimulation was done using a small input space (8x8) and one output \nunit. A simple least-mean-square (back-propagation) algorithm was \nused for training the network. When taught to distinguish the \nletters T and C at one scale, it distinguished them at changes of \nscale of up to 4X with about 15 percent maximum degradation in the \noutput strength. These results are quite encouraging because no \nspecial effort was required to make the system work, and no \ncorrections or modifications were made to account for the boundary \ncondition requirements as discussed near Eq. (6). This and other \nsimulations are discussed further later. \n\nAs a third example of a geometric transformation, consider the \n\ncase of rotation about the origin for a two-dimensional space in \npolar coordinates. One can readily show that the weight constraints \n\n\f306 \n\nare satisfied if \n\nwl(u;rl,tl) ~ wl(u;rl), \n\nw2(u;rl,r2;tl,t2) - w2(u;rl,r2;tl-t2), etc. \n\n(13a) \n(l3b) \n\nThese results are reminiscent of the results for translation \ninvariance. This is not uncommon: \noften have similar constraint requirements if the proper change of \nvariable is made. This can be used to advantage when implementing \nsuch networks but we will not discuss it further here. \n\nseemingly different problems \n\nAn interesting case arises when one considers combinations of \ninvariances, e.g., scale and translation. This raises the question \nof the effect of the order of the transformations, i.e., is scale \nfollowed by translation equivalent to translation followed by scale? \nThe obvious answer is no, yet for certain cases the order is \nunimportant. Consider first the case of change-of-scale by a, \nfollowed by a translation XC; the constraints on the weights up to \nsecond order are: \n\nWl(U;Xl) - wl(u; (xl-xo)/a), \n\nw2 (u; Xl ,x2) 0= w2(u; (xl-xo)/a, (x2-xo)/a) , \n\nand for translation followed by scale the constraints are: \n\nwl(u;Xl) - wl(u; (xl/a)-xo). and \n\nw2(U;Xl,X2) = w2(u;(xl/a)-xo,(x2Ia)-XO) . \n\n(14a) \n(l4b) \n\n(lSa) \n(lSb) \n\nConsider only the second-order weights for the two-dimensional case. \nChoose rectangular coordinate variables (x,y) so that the translation \nis given by (xO,YO). Then \n\nW2 (U;Xl,Yl;X2,Y2) = \n\nw2 (u;(xl/a)-xO,(Yl/a)-YO;(x2/a)-xO'(Y2/a)-yO)' \n\n(l6a) \n\nor \n\nW2 (U;Xl,Yl;X2,Y2) -\n\nw2(U;(Xl-xo)/a, (Yl-yo)/a; (x2- xo)/a, (Y2-Yo)/a). \n\n(16b) \n\nIf we take as our solution \n\nw2(U;Xl,Yl;X2,Y2) = w2(U;(X1-X2)/(Yl-Y2\u00bb, \n\n(17) \n\nthen w2 is invariant to scale and translation, and \nunimportant. With higher-order weights one can be \nadventurous. \n\nthe order is \neven more \n\nAs a final example consider the case \nfactor a and rotation about the origin by \ndimensional system in polar coordinates. \ntransformation makes no difference.) The \nsecond order are: \n\nof a change of scale by a \nan amount to for a two(cid:173)\n(Note that the order of \nweight constraints up to \n\n(18a) \n\n\f307 \n\n(18b) \n\nThe first-order constraint requires that wI be independent of the \ninput variables, but for the second-order term one can obtain a more \nuseful solution: \n\n(19) \n\nThis implies that with second-order weights, one can construct a unit \nthat is insensitive to changes in scale and rotation of the input \nspace. How useful it is depends upon the application. \n\nSIMULATION RESULTS \n\nWe have constructed several higher-order neural networks that \ndemonstrated invariant response to transformations of scale and of \ntranslation of the input patterns. The systems were small, \nconsisting of less than 100 input units, were constructed from \nsecond-and first-order units, and contained only one, two, or three \nlayers. We used a back-propagation algorithm modified for the \nhigher-order (sigma-pi) units. The simulation studies are still in \nthe early stages, so the performance of the networks has not been \nthoroughly investigated. It seems safe to say, however, that there \nis much to be gained by a thorough study of these systems. For \nexample, we have demonstrated that a small system of second-order \nunits trained to distinguish the letters T and C at one scale can \ncontinue to distinguish them over changes in scale of factors of at \nleast four without retraining and with satisfactory performance. \nSimilar performance has been obtained for the case of translation \ninvariance. \n\nEven at this stage, some interesting facets of this approach are \n\nbecoming clear: 1) Even with the constraints imposed by the \ninvariance, it is usually necessary to limit the range of connections \nin order to restrict the complexity of the network. This is often \ncited as a problem with higher-order networks, but we take the view \nthat one can learn a great deal more about the nature of a problem by \nexamining it at this level rather than by simply training a network \nthat has a general-purpose architecture. 2) The higher-order \nnetworks seem to solve problems in an elegant and simple manner. \nHowever, unless one is careful in the design of the network, it \nperforms worse than a simpler conventional network when there is \nnoise in the input field. \nconventional approach, although this is highly dependent on the \nspecific problem and implementation design. \ncan be made: either faster learning but less noise robustness, or \nslower learning with more robust performance. \n\n3) Learning is often \"quicker\" than in a \n\nIt seems that a tradeoff \n\nDISCUSSION \n\nWe have shown a simple way to encode geometric invariances into \nneural networks (instead of training them), though to be useful the \nnetworks must be constructed of higher-order units. The invariant \nencoding is achieved by restricting the allowable network \n\n\f308 \n\narchitectures and is independent of learning rules and the form of \nthe sigmoid or threshold functions. The invariance encoding is \nnormally for an entire layer, although it can be on an individual \nunit basis. It is easy to build one or more invariant layers into a \nmulti-layer net, and different layers can satisfy different \ninvariance requirements. This is useful for operating on internal \nfeatures or representations in an invariant manner. For learning in \nsuch a net, a multi-layered learning rule such as generalized back(cid:173)\npropagation7 must be used. \ngeneralized back-propagation learning rule to train a two-layer \nsystem consisting of a second-order, translation-invariant input \nlayer and a first-order output layer. Note that we have not shown \nthat one can not encode invariances into layered first-order \nnetworks, but the analysis in this paper implies that such invariance \nwould be dependent on the form of the sigmoid function. \n\nIn our simulations we have used a \n\nWhen invariances are encoded into higher-order neural networks, \nthe number of interconnections required is usually reduced by orders \nof powers of N where N is the size of the input. For example, a \nfully connected, first-order, single-layer net with a single output \nunit would have order N interconnections; a similar second-order net, \norder N2 . \ninvariant, the order is reduced to N. The number of multiplies and \nadds is still of order N2 . \n\nIf this second-order net (or layer) is made shift \n\nWe have limited our discussion in this paper to geometric \n\ninvariances, but there seems to be no reason why temporal or other \ninvariances could not be encoded in a similar manner. \n\nREFERENCES \n\n1. \n\n2. \n\n3. \n\n4. \n\n5. \n\n6. \n\n7. \n\nD.H. Ballard and C.M. Brown, Computer Vision (Prentice-Hall, \nEnglewood Cliffs, NJ, 1982). \n\nB. Widrow, IEEE First Int1. Conf. on Neural Networks, 87TH019l-\n7, Vol. 1, p. 143, San Diego, CA, June 1987. \n\nJ.A. Feldman, Biological Cybernetics 46, 27 (1982). \n\nC.L. Giles and T. Maxwell, App1. Optics 26, 4972 (1987). \n\nG.E. Hinton, Proc. 7th IntI. Joint Conf. on Artificial \nIntelligence, ed. A. Drina, 683 (1981). \n\nY.C. Lee, G. Doolen, H.H. Chen, G.Z. Sun, T. Maxwell, H.Y. Lee, \nC.L. Giles, Physica 22D, 276 (1986). \n\nD.E. Rume1hart, G.E. Hinton, and R.J. Williams, Parallel \nDistributed Processing, Vol. 1, Ch. 8, D.E. Rume1hart and J.L. \nMcClelland, eds., (MIT Press, Cambridge, 1986). \n\n\f309 \n\n8. \n\n9. \n\nT . Maxwell, C.L. Giles, Y.C. Lee, and H.H. Chen, Proc. IEEE \nIntI. Conf. on Systems, Man, and Cybernetics, 86CH2364-8, p. \n627, Atlanta, GA, October 1986. \n\nW. Pitts and W.S. McCulloch, Bull. Math. Biophys. 9, 127 \n(1947). \n\n10. M. Minsky and S, Papert, Perceptrons (MIT Press, Cambridge, \n\nMass., 1969). \n\n\f", "award": [], "sourceid": 14, "authors": [{"given_name": "C.", "family_name": "Giles", "institution": null}, {"given_name": "R.", "family_name": "Griffin", "institution": null}, {"given_name": "T.", "family_name": "Maxwell", "institution": null}]}