{"title": "On the Use of Projection Pursuit Constraints for Training Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 10, "abstract": null, "full_text": "On the Use of Projection Pursuit Constraints for \n\nTraining Neural Networks \n\nNathan Illtl'ator'\" \n\nComput.er Science Department \n\nTel-Aviv Universit.y \n\nRamat.-A viv, 69978 ISRAEL \n\nand \n\nInst.itute for Brain and Neural Systems, \n\nBrown University \n\nnin~math,tau.ac.il \n\nAbstract \n\n\\Ve present a novel classifica t.ioll and regression met.hod that com(cid:173)\nbines exploratory projection pursuit. (unsupervised traiuing) with pro(cid:173)\njection pursuit. regression (supervised t.raining), t.o yield a. nev,,' family of \ncost./complexity penalLy terms . Some improved generalization properties \nare demonstrat.ed on real \\vorld problems. \n\n1 \n\nIntroduction \n\nParameter estimat.ion becomes difficult. in high-dimensional spaces due t.o the in(cid:173)\ncreasing sparseness of t.he dat.a. Therefore. when a low dimensional representation \nis embedded in t.he da.t.a. dimensionality l'eJuction methods become useful. One \nsuch met.hod - projection pursuit. regression (Friedman and St.uet.zle, 1981) (PPR) \nis capable of performing dimensionality reduct.ion by composit.ion, namely, it con(cid:173)\nstructs an approximat.ion to the desired response function using a composition of \nlower dimensional smooth functions, These functions depend on low dimensional \nprojections t.hrough t.he data . \n\n\u2022 Research was support.ed by the N at.ional Science Foundat.ion. the Army Research Of(cid:173)\n\nfice, and the Office of Naval Researclr . \n\n3 \n\n\f4 \n\nIntrator \n\nWhen the dimensionality of the problem is in the thousands, even projection pur(cid:173)\nsuit methods are almost alwa.ys over-parametrized, t.herefore, additional smoothing \nis needed for low variance estimation. Explol'atory Projection Pursuit (Friedman \nand Thkey, 1974; Friedman, 1987) (EPP) may be useful for t.hat. It searches in a \nhigh dimensional space for structure in the form of (semi) linear projections with \nconstraints characterized by a projection index. The projection index may be con(cid:173)\nsidered as a universal prior for a large class of problems, or may be tailored t.o a \nspecific problem based on prior knowledge. \n\nIn this paper, the general for111 of exploratory projection pursuit is formulated to be \nan additional constraint for projection pUl'suit regression. In particular, a hybrid \ncombination of supervised and unsupervised artificial neural network (ANN) is de(cid:173)\nscribed as a special case. In addition, a specific project.ion index that is particularly \nuseful for classification (Int.rator, 1990; Intrator and Cooper, 1992) is introduced in \nthis context. A more detailed discussion appears in Intrator (1993). \n\n2 Brief Description of Projection Pursuit Regression \n\nLet (X, Y) be a pair of random variables, X E R d , and Y E R. The problem is to \napproximate the d dimensiona.l surfa('e \n\nI(x) = E[Y'IX = x} \n\nfrom n observations (Xl, YI), ... , (Xu, Yn). \nPPR tries t.o approximate a funct.ion 1 by a sum of ridge functions (functions that \nare constant. along lines) \n\n1(:1') ~ L gj(af x). \n\nThe fit.t.ing procedure alt.ernat.es between a.n estimation of a direction a and an \nestimat.ioll of a smoot.h funct.ion g. such that at. iterat.ion j, t.he square average of \nt.he resid uals \n\nj=l \n\nl'ij(xd = 1'ij-l - 9j((IJ xd \n\nis minimized. This process is init.ialized by setting 1'jO = !Ii. Usually, the initial \nvalues of aj a.re t.aken to be the first few principal component.s of the data. \n\nEstimation of the ridge functions call be achieved by various nonparamet.ric smooth(cid:173)\ning techniques such as locally linear functions (Friedman and Stuetzle, 1981), \nk-nearest neighbors (Hall. 1989b), splines or variable degree polynomials. The \nsmoot.hness const.raint. imposed on !1, implies t.hat. t.he actual projection pursuit \nis achieved by minimizing at. it.erat.ioJl j. t.lte sum \n\nII \n\ni= 1 \n\nfor some smoothness measure C. \n\nAlthough PPR cOllverg('s t.o the desired response function (Jones, 1987), the use \nof non-paramet.ric function estimat.ion is likely to lead to ovel'fitt.ing. Recent re(cid:173)\nsults (Hornik, 1991) suggest. that a feed forward net.work archit.ecture with a single \n\n\fOn the Use of Projection Pursuit Constraints for Training Neural Networks \n\n5 \n\nhidden layer and a rat.her general fixed activat.ion function is a universal approxi(cid:173)\nmator. Therefore, the use of a non-parametric single ridge function estimation can \nbe avoided. It is thus appropriate to concentrate on the est.imation of good pro(cid:173)\njections. In the next section we present a general framework of PPR architecture, \nand in sect.ion 4 we restrict it. t.o a feed-forward architecture with sigmoidal hidden \nunits. \n\n3 EstiInating The Projections Using Exploratory \n\nProjection Pursuit \n\nExplorat.ory projection pursuit \u00b7is based on seeking interesting projections of high \ndimensional data points (Krllskal, 1969; Switzer, 1970; Kruskal, 1972; Friedman \nand Tukey, 1974; Friedman, 1987; Jones and Sibson, 1987; Hall, 1988; Huber, 1985, \nfor review). The notion of interesting projections is motivated by an observation \nt.hat for most. high-dimensional data clouds, most low-dimensional projections are \napproximat.ely normal (Oiaconis alld F!'('edlllan, 1984). This finding suggests that \nthe important information in the data is conveyed in t.hose direct.ions whose single \ndimensional project.ed dist.ribution is far from Gaussian. Variolls projection indices \n(measures for t.he goodrwss of a. projl-'ction) differ on the assumptions about the \nnature of deviation from norl1lality, (Iud ill their comput.ational efficiency. They can \nbe considered as different priOl's mot.ivat.ed by specific assumptions on t.he underlying \nmodel. \n\nTo partially decouple the search for a projection vectol' from the search for a non(cid:173)\nparametric ridge function, we propose to add a penalty term, which is based on \na pl'Oject.ion index, t.o t.he energy minimizat.ion associated wit.h the estimation of \nthe ridge functions and t.he projections. Specifically, let p( a) be a projection index \nwhich is minimized for project.ions wit.h a certain deviation fl'0111 normality; At the \nj'th iterat.ion, we minimize the sum \n\nL 1}( .r;) + (,'(gj) + p(aj). \n\nWhen a concurrent minimizat.ion ovet' several project.ions/functions is practical, we \nget a penalty t.erm of t.he form \n\ni \n\nB(j) = L[C(gj) + p(aj )]. \n\nj \n\nSince C and p may not be linear, t.he more general measure t.hat does not assume a \nstep\",Tise approach, but. instead seeks I projections and ridge functions concurrently, \nis given by \n\nB(f) = C(9J,\" \u00b7,gd + p(a.J, .. . ,ad, \n\nIn practice, p depends implicit.ly 011 t.he t.raining dat.a, (t.he empirical density) and \nis therefore replaced by its empirical measure ii. \n\n3.1 Some Possible Measures \n\nSome applicable projection indices are disc.ussed in (Huber, 1985; Jones and Sib(cid:173)\nson, 1987; Friedman, 1987; Hall, 1989a; Intrator, 1990). Probably, a.ll the possible \n\n\f6 \n\nIntrator \n\nmeasures should emphasize some form of deviation from normality but the spe(cid:173)\ncific type may depend on the problem at hand. For example, a measure based \non the Karhunen Loeve expansion O\"Iougeot et al., 1991) may be useful for image \ncompression with autoassociative net.works, since in this case one is int.erested in \nminimizing the L2 norm of tlH' dist.ance between t.he reconst.ructed image and the \noriginal one, and under mild condit.ions, t.he Karhunen Loeve expansion gives the \noptimal solution. \n\nA different type of prior knowledge is required for classificat.ion problems. The \nunderlying a'5sumption then is that the data is clustered (when projecting in the \nright direct.ions) and that t.he classification may be achieved by some (nonlinear) \nma.pping of these clustel\u00b7s. In such a case, the projection index should emphasize \nmulti-modality as a specific deviation from normality. A projection index that em(cid:173)\nphasizes multimodalities in the projected distribution (without relying on the class \nla.bels) has recently been int.roduced (Intrator, 1990) and implemented efficiently us(cid:173)\ning a variant of a biologically motivated unsupervised network (Intrat.or and Cooper, \n1992) . Its int.egration into a back-propagat.ion classifier will be discussed below . \n\n3.2 Adding EPP constraints to baek-propagatioll network \n\nOne way of adding SOllie prior knowledge int 0 the archi t.ecLme is by 111lllll1llZmg \nthe effective number of parameters llsing weight. sharing, ill which a single weight \nis shared among many connections in the network (\\\\'aibel et. al., 1989; Le Cun \net aI., }989). An ext.f'nsion of t.his idea is the \"soft. \\',\u00b7eight. sharing\" which favors \nirregularities in the weight distribution in the form of mult.imodality (Nowlan and \nHinton, 1992). This penalty improved generalization results obtained by weight \nelimination penalt.y. Bot.h t.hese wet.hods make an explicit. assumption about the \nstructure of t.he weight. space, but. wit.h 110 regarJ to the structure of the illput space. \n\nAs described in the context of project.ion pursuit. regression. a penalt.y term may \nbe added t.o the energy funct.ional minimized by error back propagation, for the \npurpose of mea<;uring direct.ly t.he goodness of t.he projections sOllght by the network. \nSince our main int.erest. is in reducing ovedHt.ing fOI' high dimensional pl'Oblems, our \nunderlying assumpt.ion is t.hat. t.ile slll-faCf.' fUllct.ion to be estirnat.ed can be faithfully \nrepresented using a low dimensiollal composition of sigmoidal functions, namely, \nusing a back-propagation net.work in which t.he number of hidden units is much \nsmaller t.han the number of input unit.s. Therefore, t.he penalty term may be added \nonly to the hidden layer. The synapt.ic modification equat.ions of the hidden units' \nweights become \n\nOWij \nfJt \n\n-c [ ot(w, .1') \n\naWij \n\n0P(Wl, .... wn) \n+------\n\nOU'ij \n\n+(Contrihul,ion of cost/complexity t.erms)]. \n\nAn appl'Oach of t.his type has lWl'1I used in ima.ge compl'cssion, wit.h a penalty \naimed at minimizing tIl<' ent.ropy of the projected distribution (Bichsel and Seitz, \n1989). This penalt.y eel'tainly measures deviat.ion from normality, since entropy is \nmaximized for a Gaussian distribution. \n\n\fOn the Use of Projection Pursuit Constraints for Training Neural Networks \n\n7 \n\n4 Projection Index for Classification: The Unsupervised \n\nBCM Neuron \n\nIntrator (1990) has recently shown that a variant of the Bienenstock, Cooper and \nMunro neuron (Bienenstock et al., 1982) performs exploratory projection pursuit \nusing a projection index that measures multi-modality. This neuron version allows \ntheoretical analysis of some visual deprivation experiments (lntrat.or and Cooper, \n1992), and is in agreement. with the vast experimental result.s on visual cortical \nplasticity (Clothiaux et al., 1991). A network implementation which can find several \nprojections in parallel while ret.aining its computational efficiency, was found to be \napplicable for extracting features from very high dimensional vector spaces (Intrator \nand Gold, 1993; Int.rator et al., 1991; Intrator, 1992) \nThe activity of neuron k in the network is Ck = Li XiWik + WOk. The inhibited \nactivity and threshold of the k'th neuron is given by \n\nC/.: = (1(Ck -II LCj), \n\nj'f;/.: \n\n- m = . cj,: . \nE[''l] \n8~ k \n\nThe threshold e~~l is the point. at. which the modificat.ion function