{"title": "Function Approximation with the Sweeping Hinge Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 535, "page_last": 541, "abstract": "", "full_text": "Function Approximat.ion with the \n\nSweeping Hinge Algorithm \n\nDon R. Hush, Fernando Lozano \n\nDept. of Elec. and Compo Engg. \n\nUniversity of New Mexico \nAlbuquerque, NM 87131 \n\nBill Horne \n\nMakeWaves, Inc. \n832 Valley Road \n\nWatchung, NJ 07060 \n\nAbstract \n\nWe present a computationally efficient algorithm for function ap(cid:173)\nproximation with piecewise linear sigmoidal nodes. A one hidden \nlayer network is constructed one node at a time using the method of \nfitting the residual. The task of fitting individual nodes is accom(cid:173)\nplished using a new algorithm that searchs for the best fit by solving \na sequence of Quadratic Programming problems. This approach of(cid:173)\nfers significant advantages over derivative-based search algorithms \n(e.g. backpropagation and its extensions). Unique characteristics \nof this algorithm include: finite step convergence, a simple stop(cid:173)\nping criterion, a deterministic methodology for seeking \"good\" local \nminima, good scaling properties and a robust numerical implemen(cid:173)\ntation. \n\n1 \n\nIntroduction \n\nThe learning algorithm developed in this paper is quite different from the tradi(cid:173)\ntional family of derivative-based descent methods used to train Multilayer Percep(cid:173)\ntrons (MLPs) for function approximation. First, a constructive approach is used, \nwhich builds the network one node at a time. Second, and more importantly, we \nuse piecewise linear sigmoidal nodes instead of the more popular (continuously dif(cid:173)\nferentiable) logistic nodes. These two differences change the nature of the learning \nproblem entirely. It becomes a combinatorial problem in the sense that the number \nof feasible solutions that must be considered in the search is finite. We show that \nthis number is exponential in the input dimension, and that the problem of find(cid:173)\ning the global optimum admits no polynomial-time solution. We then proceed to \ndevelop a heuristic algorithm that produces good approximations with reasonable \nefficiency. This algorithm has a simple stopping criterion, and very few user spec(cid:173)\nified parameters. In addition, it produces solutions that are comparable to (and \nsometimes better than) those produced by local descent methods, and it does so \n\n\f536 \n\nD. R. Hush, R Lozano and B. Horne \n\nusing a deterministic methodology, so that the results are independent of initial \nconditions. \n\n2 Background and Motivation \n\nWe wish to approximate an unknown continuous function f(x) over a compact set \nwith a one-hidden layer network described by \n\nf~(x) = ao + L aiU(x, Wi) \n\nn \n\ni=l \n\n(1) \n\nwhere n is the number of hidden layer nodes (basis functions), x E ~d is the input \nvector, and {u(x, w)} are sigmoidal functions parameterized by a weight vector w. \nA set of example data, S = {Xi, Yi}, with a total of N samples is available for \ntraining and test. \n\nThe models in (1) have been shown to be universal approximators. More impor(cid:173)\ntantly, (Barron, 1993) has shown that for a special class of continuous functions, \nr c, the generalization error satisfies \n\nE[lIf - fn,NII2] ~ IIf - fnll 2 + E[lIfn - fn,NII2] \n\n= 0 (*) + 0 ( nd ~g N ) \n\nwhere 11\u00b711 is the appropriate two-norm, f n is the the best n-node approximation to \nf, and fn,N is the approximation that best fits the samples in S. In this equation \nIIf - fnll 2 and E[lIfn - fn,NII2] correspond to the approximation and estimation error \nrespectively. Of particular interest is the O(l/n) bound on approximation error, \nwhich for fixed basis functions is of the form O(1/n 2 / d ) (Barron, 1993). Barron's \nresult tells us that the (tunable) sigmoidal bases are able to avoid the curse of \ndimensionality (for functions in rc). Further, it has been shown that the O(l/n) \nbound can be achieved constructively (Jones, 1992), that is by designing the basis \nfunctions (nodes) one at a time. The proof of this result is itself constructive, \nand thus provides a framework for the development of an algorithm which can (in \nprinciple) achieve this bound. One manifestation of this algorithm is shown in \nFigure 1. We call this the iterative approximation algorithm (I1A) because it builds \nthe approximation by iterating on the residual (Le. the unexplained portion of the \nfunction) at each step. This is the same algorithmic strategy used to form bases in \nnumerous other settings, e.g. Grahm-Schmidt, Conjugate Gradient, and Projection \nPursuit. The difficult part of the I1A algorithm is in the determination of the best \nfitting basis function Un in step 2. This is the focus of the remainder of this paper. \n\n3 Algorithmic Development \n\nWe begin by defining the hinging sigmoid (HS) node on which our algorithms are \nbased. An HS node performs the function \n\nUh(X, w) = \n\n{ \n\n-T- > \n\nw+, w, X _ w+ \n- T-\n- T-\nw, X, w_ ~ w, X ~ w+ \nw_, w, x _ w_ \n\n- T- < \n\n(2) \n\nwhere w T = [WI w+ w_] and x is an augmented input vector with a 1 in the first \ncomponent. An example of the surface formed by an HS node on a two-dimensional \ninput is shown in Figure 2. It is comprised of three hyperplanes joined pairwise \n\n\fFunction Approximation with the Sweeping Hinge Algorithm \n\n537 \n\nInitialization: fo(x) = 0 \nfor n = 1 to nma:c do \n1. Compute Residual: \n2. Fit Residual: \n3. Update Estimate: \n\nen(x) = f(x) -\nun(x) = argminO\"EE lIen(x) - u(x)11 \nfn(x) = o:fn-l (x) + f3un(x) \n\nfn-l (x) \n\nwhere 0: and f3 are chosen to minimize IIf(x) -\n\nfn(x)1I \n\nendloop \n\nFigure 1: Iterative Approximation Algorithm (rIA). \n\nFigure 2: A Sigmoid Hinge function in two dimensions . \n\ncontinuously at two hinge locations. The upper and middle hyperplanes are joined \nat \"Hinge I\" and the lower and middle hyperplanes are joined at \"Hinge 2\". These \nhinges induce linear partitions on the input space that divide the space into three \nregions, and the samples in 5 into three subsets, \n\n5+ = {(Xi,Yi): Wr-Xi ~ w+} \n5, = {(Xi,Yi): w_ ~ WTXi ~ w+} \n5_ = {(Xi,Yi): WTXi ~ w_} \n\n(3) \n\nThese subsets, and the corresponding regions of the input space, are referred to as \nthe PLUS, LINEAR and MINUS subsets/regions respectively. We refer to this type \nof partition as a sigmoidal partition. A sigmoidal partition of 5 will be denoted \nP = {5+, 5\" 5_}, and the set of all such partitions will be denoted II = {Pd. \nInput samples that fall on the boundary between two regions can be assigned to the \nset on either side. These points are referred to as hinge samples and playa crucial \nrole in subsequent development. Note that once a weight vector w is specified, the \npartition P is completely determined, but the reverse is not necessarily true. That \nis, there are generally an infinite number of weight vectors that induce the same \npartition. \n\nWe begin our quest for a learning algorithm with the development of an expression \nfor the empirical risk. The empirical risk (squared error over the sample set) is \ndefined \n\n(4) \n\n\f538 \n\nD. R Hush, F. Lozano and B. Horne \n\nThis expression can be expanded into three terms, one for each set in the partition, \n\nEp(w) = ~ :E(Yi - W_)2 + ~ :E(Yi - W+)2 + ~ 2)Yi - WTXi)2 \n\n~ \n\n~ \n\n~ \n\nAfter further expansion and rearrangement of terms we obtain \n\nEp(w) = 2wTRw - w T r + s; \n\n1 \n\nR, = \"L:s, XiX; \n\nr, = \"L:s, XiYi \n\ns; = ! \"L:s Y; st = \"L:s+ Yi Sy = \"L:s_ Yi \nR, \n~ \n\nr=un \n\n(5) \n\n(6) \n\n(7) \n\n(8) \n\nwhere \n\nR= \n\n( \n\nand N+ , N, and N_ are the number of samples in S+ , S, and S_ respectively. The \nsubscript P is used to emphasize that this criterion is dependent on the partition (i.e. \nP is required to form Rand r). In fact, the nature of the partition plays a critical \nrole in determining the properties of the solution. When R is positive definite (i.e. \nfull rank), P is referred to as a stable partition, and when R has reduced rank P is \nreferred to as an unstable partition. A stable partition requires that R, > O. For \npurposes of algorithm development we will assume that R, > 0 when ISti > Nmin, \nwhere Nmin is a suitably chosen value greater than or equal to d + 1. With this, a \nnecessary condition for a stable partition is that there be at least one sample in S+ \nand S_ and N, ~ Nmin. When seeking a minimizing solution for Ep(w) we restrict \nourselves to stable partitions because of the potential nonuniqueness associated with \nsolutions to unstable partitions. \nDetermining a weight vector that simultaneously minimizes E p (w) and preserves \nthe current partition can be posed as a constrained optimization problem. This \nproblem takes on the form \n\nmin !wTRw - w T r \nsubject to Aw ~ 0 \n\n2 \n\n(9) \n\nwhere the inequality constraints are designed to maintain the current partition de(cid:173)\nfined by (3). This is a Quadratic Programming problem with inequality constraints, \nand because R > 0 it has a unique global minimum. The general Quadratic Pro(cid:173)\ngramming problem is N P-hard and also hard to approximate (Bellare and Rogaway, \n1993). However, the convex case which we restrict ourselves to here (i.e. R > 0) \nadmits a polynomial time solution. In this paper we use the active set algorithm \n(Luenberger, 1984) to solve (9). With the proper implementation, this algorithm \nruns in O(k(~ + Nd)) time, where k is typically on the order of d or less. \nThe solution to the quadratic programming problem in (9) is only as good as the \ncurrent partition allows. The more challenging aspect of minimizing Ep(w) is in \nthe search for a good partition. Unfortunately there is no ordering or arrangement \nof partitions that is convex in Ep(w), so the search for the optimal partition will \nbe a computationally challenging problem. An exhaustive search is usually out of \nthe question because of the prohibitively large number of partitions, as given by the \nfollowing lemma. \n\nLemma 1: Let S contain a total of N samples in Rd that lie in general position. \nThen the number of sigmoidal partitions defined in (3) is 8(Nd+l). \n\n\fFunction Approximation with the Sweeping Hinge Algorithm \n\n539 \n\nProof: A detailed proof is beyond the scope of this paper, but an intuitive proof \nfollows. It is well-known that the number of linear dichotomies of N points in d \ndimensions is 8(Nd ) (Edelsbrunner, 1987). Each sigmoidal partition is comprised \nof two linear dichotomies, one formed by Hinge 1 and the other by Hinge 2, and \nthese dichotomies are constrained to be simple translations of one another. Thus, \nto enumerate all sigmoidal partitions we allow one of the hinges, say Hinge 1, can \ntake on 8(Nd) different positions. For each of these the other hinge can occupy \nonly'\" N unique positions. The total is therefore 8 (Nd+l ). \nThe search algorithm developed here employs a Quadratic Programming (QP) al(cid:173)\ngorithm at each new partition to determine the optimal weight vector for that \npartition (Le. the optimal orientation for the separating hyperplanes). Transitions \nare made from one partition to the next by allowing hinge samples to flip from one \nside of the hinge boundary to the next. The search is terminated when a minimum \nvalue of Ep(w) is found (Le. it can no longer be reduced by flipping hinge samples). \nSuch an algorithm is shown in Figure 3. We call this the HingeDescent algorithm \nbecause it allows the hinges to \"walk across\" the data in a manner that descends \nthe Ep(w) criterion. Note that provisions are made within the algorithm to avoid \nunstable partitions. Note also that it is easy to modify this algorithm to descend \nonly one hinge at a time, simply by omitting one of the blocks of code that flips \nsamples across the corresponding hinge boundary. \n\n{This routine is invoked with a stable feasible solution W = {w, R, r, A, S+, SI, S_ }.} \nprocedure HingeDescent (W) \n\n{ Allow hinges to walk across the data until a minimizing partition is found. } \nE_1wTRw-wTr \n-\ndo \n\n2 \n\nEmin = E \n{Flip Hinge 1 Samples.} \nfor each \u00abXi, Yi) on Hinge 1) do \n\nif \u00abXi, Yi) E S+ and N+ > 1) then \nelseif \u00abXi, Yi) E S, and N, > N min ) then \n\nMove (Xi,Yi) from S+ to S\" and update R, r, and A \n\nMove (Xi, Yi) from S, to S+, and update R, r, and A \n\nendif \n\nendloop \n{Flip Hinge 2 Samples.} \nfor each \u00abXi, Yi) on Hinge 2) do \n\nif \u00abXi,Yi) E S- and N_ > 1) then \nelseif \u00abXi, Yi) E S, and N, > Nmin) then \n\nMove (Xi,Yi) from S- to S\" and update R, r, and A \n\nMove (Xi,Yi) from S, to S-, and update R, r, and A \n\nendif \n\nendloop \n{Compute optimal solution for new partition.} \nW = QPSolve(W}; \nE= ~wTRw-wTr \n\nwhile (E < Emin) j \nreturn(W)j \n\nend; \n\n{HingeDescent} \n\nFigure 3: The HingeDescent Algorithm. \n\nLemma 2: When started at a stable partition, the HingeDescent algorithm will \n\n\f540 \n\nD. R Hush, R Lozano and B. Horne \n\nconverge to a stable partition of Ep(w) in a finite number of steps. \nProof: First note that when R> 0, a QP solution can always be found in a finite \nnumber of steps. The proof of this result is beyond the scope of this paper, but can \neasily be found in the literature (Luenberger, 1984). Now, by design, HingeDescent \nalways moves from one stable partition to the next, maintaining the R > 0 property \nat each step so that all QP solutions can be produced in a finite number of steps. \nIn addition, Ep(w) is reduced at each step (except the last one) so no partitions \nare revisited, and since there are a finite number of partitions (see Lemma 1) this \nalgorithm must terminate in a finite number of steps. QED. \nAssume that QPSol ve runs in O(k( cP + N d)) time as previously stated. Then the run \ntime of HingeDescent is given by O(Np((k+Nh)cP+kNd)), where Nh is the number \nof samples flipped at each step and Np is the total number of partitions explored. \nTypical values for k and Nh are on the order of d, simplifying this expression to \nO(Np(d3 + NcP)). Np can vary widely, but is often substantially less than N. \nHingeDescent seeks a local minimum over II, and may produce a poor solution, \ndepending on the starting partition. One way to remedy this is to start from \nseveral different initial partitions, and then retain the best solution overall. We \ntake a different approach here, that always starts with the same initial condition, \nvisits several local minima along the way, and always ends up with the same final \nsolution each time. \n\nThe SweepingHinge algorithm works as follows. It starts by placing one of the \nhinges, say Hinge 1, at the outer boundary of the data. It then sweeps this hinge \nacross the data, M samples at a time (e.g. M = 1), allowing the other hinge (Hinge \n2) to descend to an optimal position at each step. The initial hinge locations are \ndetermined as follows. A linear fit is formed to the entire data set and the hinges are \npositioned at opposite ends of the data so that the PLUS and MINUS regions meet \nthe LINEAR region at the two data samples on either end. After the initial linear \nfit, the hinges are allowed to descend to a local minimum using HingeDescent. Then \nHinge 1 is swept across the data M samples at a time. Mechanically this is achieved \nby moving M additional samples from S, to S+ at each step. Hinge 2 is allowed \nto descend to an optimal position at each of these steps using the Hinge2Descent \nalgorithm. This algorithm is identical to HingeDescent except that the code that \nflips samples across Hinge 1 is omitted. The best overall solution from the sweep is \nretained and \"fine-tuned\" with one final pass through the HingeDescent algorithm \nto produce the final solution. \nThe run time of SweepingHinge is no worse than N j M times that of HingeDescent. \nGiven this, an upper bound on the (typical) run time for this algorithm (with \nM = 1) is O(NNp(d3 + NcP)). Consequently, SweepingHinge scales reasonably \nwell in both Nand d, considering the nature of the problem it is designed to solve. \n\n4 Empirical Results \n\nThe following experiment was adapted from (Breiman, 1993). The function lex) = \ne- lIxll 2 is sampled at 100d points {xd such that IIxll ~ 3 and IIxll is uniform on [0,3]. \nThe dimension d is varied from 4 to 10 (in steps of 2) and models of size 1 to 20 \nnodes are trained using the I1AjSweepingHinge algorithm. The number of samples \ntraversed at each step of the sweep in SweepingHinge was set to M = 10. Nmin \nwas set equal to 3d throughout. A refitting pass was employed after each new node \nwas added in the I1A. The refitting algorithm used HingeDescent to \"fine-tune\" \neach node each node before adding the next node. The average sum of squared \n\n\fFunction Approximation with the Sweeping Hinge Algorithm \n\n541 \n\nd=4 \nd=6 \nd=8 \nd=10 \nd=4 \nd=6 \nd=8 \nd=10 \n\n3000 \n\n2500 \n\n2000 \n\n1500 \n\n1000 \n\n500 \n\n6 \n\n8 \n\n16 \nNumber or Nodes \n\n14 \n\n10 \n\n12 \n\n18 \n\n20 \n\nFigure 4: Upper (lower) curves are for training (test) data. \n\nerror, e-2 , was computed for both the training data and an independent set of test \ndata of size 200d. Plots of 1/e-2 versus the number of nodes are shown in Figure \n4. The curves for the training data are clearly bounded below by a linear function \nof n (as suggested by inverting the O(l/n) result of Barron's). More importantly \nhowever, they show no significant dependence on the dimension d. The curves for \nthe test data show the effect of the estimation error as they start to \"bend over\" \naround n = 10 nodes. Again however, they show no dependence on dimension. \n\nAcknowledgements \n\nThis work was inspired by the theoretical results of (Barron, 1993) for Sigmoidal \nnetworks as well as the \"Hinging Hyperplanes\" work of (Breiman, 1993) , and the \n\"Ramps\" work of (Friedman and Breiman, 1994). This work was supported in part \nby ONR grant number N00014-95-1-1315. \n\nReferences \n\nBarron, A.R. (1993) Universal approximation bounds for superpositions of a sig(cid:173)\nmoidal function. IEEE Transactions on Information Theory 39(3):930-945. \nBellare, M. & Rogaway, P. (1993) The complexity of approximating a nonlinear \nprogram. In P.M. Pardalos (ed.), Complexity in numerical optimization, pp. 16-32, \nWorld Scientific Pub. Co. \nBreiman, L. (1993) Hinging hyperplanes for regression, classification and function \napproximation. IEEE Transactions on Information Theory 39(3):999-1013. \nBreiman, L. & Friedman, J.H. (1994) Function approximation using RAMPS. Snow(cid:173)\nbird Workshop on Machines that Learn. \nEdelsbrunner, H. (1987) In EATCS Monographs on Theoretical Computer Science \nV. 10, Algorithms in Combinatorial Geometry. Springer-Verlag. \n\nJones, L.K. (1992) A simple lemma on greedy approximation in Hilbert space and \nconvergence rates for projection pursuit regression and neural network training. \nThe Annals of Statistics, 20:608-613. \n\nLuenberger, D.G. (1984) Introduction to Linear and Nonlinear Programming. \nAddison-Wesley. \n\n\f", "award": [], "sourceid": 1397, "authors": [{"given_name": "Don", "family_name": "Hush", "institution": null}, {"given_name": "Fernando", "family_name": "Lozano", "institution": null}, {"given_name": "Bill", "family_name": "Horne", "institution": null}]}