{"title": "488 Solutions to the XOR Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 410, "page_last": 416, "abstract": null, "full_text": "488 Solutions to the XOR Problem \n\nFrans M. Coetzee * \neoetzee@eee.emu.edu \n\nVirginia L. Stonick \n\nginny@eee.emu.edu \n\nDepartment of Electrical Engineering \n\nDepartment of Electrical Engineering \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nA globally convergent homotopy method is defined that is capable \nof sequentially producing large numbers of stationary points of the \nmulti-layer perceptron mean-squared error surface. Using this al(cid:173)\ngorithm large subsets of the stationary points of two test problems \nare found. It is shown empirically that the MLP neural network \nappears to have an extreme ratio of saddle points compared to \nlocal minima, and that even small neural network problems have \nextremely large numbers of solutions. \n\n1 \n\nIntroduction \n\nThe number and type of stationary points of the error surface provide insight into \nthe difficulties of finding the optimal parameters ofthe network, since the stationary \npoints determine the degree of the system[l]. Unfortunately, even for the small \ncanonical test problems commonly used in neural network studies, it is still unknown \nhow many stationary points there are, where they are, and how these are divided \ninto minima, maxima and saddle points. \n\nSince solving the neural equations explicitly is currently intractable, it is of interest \nto be able to numerically characterize the error surfaces of standard test problems. \nTo perform such a characterization is non-trivial, requiring methods that reliably \nconverge and are capable of finding large subsets of distinct solutions. It can be \nshown[2] that methods which produce only one solution set on a given trial become \ninefficient (at a factorial rate) at finding large sets of multiple distinct solutions, \nsince the same solutions are found repeatedly. This paper presents the first prov(cid:173)\nably globally convergent homotopy methods capable of finding large subsets of the \n\nCurrently with Siemens Corporate Research, Princeton NJ 08540 \n\n\f488 Solutions to the XOR Problem \n\n411 \n\nstationary points of the neural network error surface. These methods are used to \nempirically quantify not only the number but also the type of solutions for some \nsimple neural networks. \n\n1.1 Sequential Neural Homotopy Approach Summary \n\nWe briefly acquaint the reader with the principles of homotopy methods, since these \napproaches differ significantly from standard descent procedures. \n\nHomotopy methods solve systems of nonlinear equations by mapping the known \nsolutions from an initial system to the desired solution of the unsolved system of \nequations. The basic method is as follows: Given a final set of equations /(z) = \n0, xED ~ ?Rn whose solution is sought, a homotopy function h : D x T -+ ?Rn is \ndefined in terms of a parameter T ETC ?R, such that \n\nh(z, T) = \n\n{ g(z) \n/(z) \n\nwhen T = 0 \nwhen T = 1 \n\nwhere the initial system of equations g(z) = 0 has a known solution. For opti(cid:173)\nmization problems f(x) = \\7 x\u20ac2(x) where \u20ac2(x) \nis the error meaSUre. Conceptually, \nh(z, T) = 0 is solved numerically for z for increasing values of T, starting at T = 0 \nat the known solution, and incrementally varying T and correcting the solution x \nuntil T = 1, thereby tracing a path from the initial to the final solutions. \n\nThe power and the problems of homotopy methods lie in constructing a suitable \nfunction h. Unfortunately, for a given f most choices of h will fail, and, with the ex(cid:173)\nception of polynomial systems, no guaranteed procedures for selecting h exist. Paths \ngenerally do not connect the initial and final solutions, either due to non-existence \nof solutions, or due to paths diverging to infinity. However, if a theoretical proof \nof existence of a suitable trajectory can be constructed, well-established numerical \nprocedures exist that reliably track the trajectory. \n\nThe following theorem, proved in [2], establishes that a suitable homotopy exists \nfor the standard feed-forward backpropagation neural networks: \n\nTheorem 1.1 Let \u20ac2 be the unregularized mean square error (MSE) problem for \nthe multi-layer perceptron network, with weights f3 E ?Rn . Let f3 0 E U C ?Rn and \na EVe ?Rn , where U and V are open bounded sets. Then except for a set of \nmeasure zero (f3, a) E U x V, the solutions ({3, T) of the set of equations \nf3o) + TD(J (\u20ac2 + J.t'\u00a2'(IIf3 - aWn = 0 \n\n(1) \nwhere J.t > 0 and'\u00a2' : ?R -+ ?R satisfies 2'\u00a2'\"(a 2 )a 2 + '\u00a2\"(a 2 ) > 0 as a -+ 00, form non(cid:173)\ncrossing one dimensional trajectories for all T E ?R, which are bounded V T E [0, 1]. \nFurthermore, the path through (f3 o, 0) connects to at least one solution ({3* ,1) of the \nregularized MSE error problem \n\nh(f3, T) = (1- T)(f3 -\n\nOn T E [0,1] the approach corresponds to a pseudo-quadratic error surface being \ndeformed continuously into the final neural network error surface l . Multiple solu-\n\n1 The common engineering heuristic whereby some arbitrary error surface is relaxed into \n\nanother error surface generally does not yield well defined trajectories. \n\n(2) \n\n\f412 \n\nF M. Coetzee and V. L. Stonick \n\ntions can be obtained by choosing different initial values f30. Every desired solution \n(3\" is accessible via an appropriate choice of a, since f30 = (3\" suffices. \nFigure 1 qualitatively illustrates typical paths obtained for this homotopy2 . The \npaths typically contain only a few solutions, are disconnected and diverge to infinity. \nA novel two-stage homotopy[2, 3] is used to overcome these problems by constructing \nand solving two homotopy equations. The first homotopy system is as described \nabove. A synthetic second homotopy solves an auxiliary set of equations on a non(cid:173)\nEuclidean compact manifold (sn (0; R) x A, where A is a compact subset of R) and \nis used to move between the disconnected trajectories of the first homotopy. The \nmethod makes use of the topological properties of the compact manifold to ensure \nthat the secondary homotopy paths do not diverge. \n\n+infty \n\n(3 \n\n0 \n\n-infty \n\n-infty \n\n+infty \n\n0 \n\nT \n\n4 \n\n3 \n\n2 \n\n0 \n\n-1 \n\n-2 \n\n-3 \n\n-4 \n-4 \n\n-2 \n\n0 \n\n2 \n\n4 \n\nFigure 1: \n(a) Typical homotopy trajectories, illustrating divergence of paths and \nmultiple solutions occurring on one path. (b) Plot of two-dimensional vectors used \nas training data for the second test problem (Yin-Yang problem). \n\n2 Test Problems \n\nThe test problems described in this paper are small to allow for (i) a large number \nof repeated runs, and (ii) to make it possible to numerically distinguish between \nsolutions. Classification problems were used since these present the only interesting \nsmall problems, even though the MSE criterion is not necessarily best for classifi(cid:173)\ncation. Unlike most classification tasks, all algorithms were forced to approximate \nthe stationary point accurately by requiring the 11 norm of the gradient to be less \nthan 10- 10 , and ensuring that solutions differed in h by more than 0.01. \nThe historical XOR problem is considered first. The data points (-1, -1), (1,1)' \n'(-1,1) and (1,-1) were trained to the target values -0.8,-0.8, 0.8 and 0.8. A \nnetwork with three inputs (one constant), two hidden layer nodes and one output \nnode were used , with hyperbolic tangent transfer functions on the hidden and final \n\n2Note that the homotopy equation and its trajectories exist outside the interval T = \n\n[0,1]. \n\n\f488 Solutions to the XOR Problem \n\n413 \n\nnodes. The regularization used fL = 0.05 , tjJ(x) = x and a = 0 (no bifurcations were \nfound for this value during simulations) . This problem was chosen since it is small \nenough to serve as a benchmark for comparing the convergence and performance of \nthe different algorithms. The second problem, referred to as the Yin-Yang problem , \nis shown in Figure 1. The problem has 23 and 22 data points in classes one and two \nrespectively, and target values \u00b10.7. Empirical evidence indicates that the smallest \nsingle hidden layer network capable of solving the problem has five hidden nodes. \nWe used a net with three inputs, five hidden nodes and one output. This problem \nis interesting since relatively high classification accuracy is obtained using only a \nsingle neuron, but a 100% classification performance requires at least five hidden \nnodes and one of only a few global weight solutions. \n\nThe stationary points form equivalence classes under renumbering of the weights \nor appropriate interchange of weight signs. For the XOR problem each solution \nclass contains up to 22 2! = 8 distinct solutions; for the Yin-Yang network, there \nare 25 5! = 3840 symmetries. The equivalence classes are reported in the following \nsections. \n\n3 Test Results \n\nA Ribak-Poliere conjugate gradient (CG) method was used as a control since this \nmethod can find only minima, as contrasted to the other algorithms, all of which \nare attracted by all stationary points. \nIn the second algorithm, the homotopy \nequation (1) was solved by following the main path until divergence. A damped \nNewton (DN) method and the twa-stage homotopy method completed the set of \nfour algorithms considered. The different algorithms were initialized with the same \nn random weights f30 E sn-l(o; v'2n). \n\n3.1 Control- The XOR problem \n\nThe total number and classification of the solutions obtained for 250 iterations on \neach algorithm are shown in Table 1. \n\nTable 1: Number of equivalence class solutions obtained . XOR Problem \n\nAlgorithm \nCG \nDN \nOne Stage \nTwo Stage \nI Total DIstmct \n\n# Solutions #Minima # Maxima #Saddle Points \n\n17 \n44 \n28 \n61 \n61 \n\n17 \n6 \n16 \n17 \n17 \n\n0 \n0 \n0 \n0 \n0 \n\n0 \n38 \n12 \n44 \n44 \n\nThe probability of finding a given solution on a trial is shown in Figure 2. The \ntwa-stage homotopy method finds almost every solution from every initial point. \nIn contrast to the homotopy approaches, the Newton method exhibits poor con(cid:173)\nvergence, even when heavily damped. The sets of saddle points found by the DN \nalgorithm and the homotopy algorithms are to a large extent disjoint, even though \nthe same initial weights were used. For the Newton method solutions close to the \ninitial point are typically obtained, while the initial point for the homotopy alga-\n\n\f414 \n\n50%-t \n\n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \n0 \n0 \n\nF. M. Coetzee and V. L Stonick \n\n75%-t \n\nlOO%-t \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\nConjugate \nGradient \n\nPi \n\nPi \n\nPi \n\n50%-t \n\n75%-t \n\nlOO%-t \n\n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \no \no \n\nr1r\">~n. \n\n~. \n\n10 \n\n20 \n\n30 \n\n-\n\n.Jl \u2022 ..nnnn.. \n\nNewton \n\n40 \n\n50 \n\n60 \n\n50%-t \n\n75%-t \n\nlOO%-t \n\n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \n0 \n0 \n\n50%-t \n\n0.8 \n0.6 \n0.4 \n0.2 \n0 \n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n75%-t \n\n50 \n\n60 \n\nlOO%-t \n\n10 \n\n20 \n\n30 \nEquivalence Class \n\n40 \n\n50 \n\n60 \n\nSingle \nStage \nHomotopy \n\nTwo \nStage \nHomotopy \n\nFigure 2: Probability of finding equivalence class i on a trial. Solutions have been \nsorted based on percentage ofthe training set correctly classified. Dark bars indicate \nlocal minima, light bars saddle points. XOR problem \n\n\f488 Solutions to the XOR Problem \n\n415 \n\nTable 2: Number of solutions correctly classifying x% of target data. \n\nClassification \nMinimum \nSaddle \nTotal Distinct \n\n25 % 50 % 75 % 100 % \n17 \n44 \n61 \n\n17 \n44 \n61 \n\n4 \n20 \n24 \n\n4 \n0 \n4 \n\nrithm might differ significantly from the final solution. This difference illustrates \nthat homotopy arrives at solutions in a fundamentally different way than descent \napproaches. \n\nBased on these results we conclude that the two-stage homotopy meets its objective \nof significantly increasing the number of solutions produced on a single trial. The \nhomotopy algorithms converge more reliably than Newton methods, in theory and \nin practise. These properties make homotopy attractive for characterizing error \nsurfaces. Finally, due to the large number of trials and significant overlap between \nthe solution sets for very different algorithms, we believe that Tables 1-2 represent \naccurate estimates for the number and types of solutions to the regularized XOR \nproblem. \n\n3.2 Results on the Yin-Yang problem \n\nThe first three algorithms for the Yin-Yang problem were evaluated for 100 trials. \nThe conjugate gradient method showed excellent stability, while the Newton method \nexhibited serious convergence problems, even with heavy damping. The two-stage \nalgorithm was still producing solutions when the runs were terminated after multiple \nweeks of computer time, allowing evaluation of only ten different initial points. \n\nTable 3: Number of equivalence class solutions obtained. Yin-Yang Problem \n\nAlgorithm \nConjugate Gradient \nDamped Newton \nOne Stage Homotopy \nTwo Stage Homotopy \nTotal Dlstmct \n\n# Solutions #Minima # Maxima #Saddle Points \n\n14 \n10 \n78 \n1633 \n1722 \n\n14 \n0 \n15 \n12 \n28 \n\n0 \n0 \n0 \n0 \n0 \n\n0 \n10 \n63 \n1621 \n1694 \n\nTable 4: Number of solutions correctly classifying x% of target data. \n\nClassification \n75 \n28 \nMinimum \nSaddle \n1694 \nTotal Distinct 1722 \n\n80 \n28 \n1694 \n1722 \n\n95 \n26 \n\n96 \n90 \n26 \n28 \n1682 400 400 \n1710 426 \n426 \n\n97 \n5 \n13 \n18 \n\n98 \n5 \n13 \n18 \n\n99 \n2 \n3 \n5 \n\n100% \n\n2 \n3 \n5 \n\nThe results in Tables 3-4 for the number of minima are believed to be accurate, due \nto verification provided by the conjugate gradient method. The number of saddle \n\n\f416 \n\nF. M. Coetzee and V. L Stonick \n\npoints should be seen as a lower bound. The regularization ensured that the saddle \npoints were well conditioned, i.e. \nthe Hessian was not rank deficient, and these \nsolutions are therefore distinct point solutions. \n\n4 Concl usions \n\nThe homotopy methods introduced in this paper overcome the difficulties of poor \nconvergence and the problem of repeatedly finding the same solutions. The use \nof these methods therefore produces significant new empirical insight into some \nextraordinary unsuspected properties of the neural network error surface. \n\nThe error surface appears to consist of relatively few minima, separated by an \nextraordinarily large 'number of saddle points. While one recent paper by Goffe et \nal [4] had given some numerical estimates based on which it was concluded that a \nlarge number of minima in neural nets exist (they did not find a significant number \nof these), this extreme ratio of saddle points to minima appears to be unexpected. \nNo maxima were discovered in the above runs; in fact none appear to exist within \nthe sphere where solutions were sought (this seems likely given the regularization). \nThe numerical results reveal astounding complexity in the neural network problem. \nIf the equivalence classes are complete, then 488 solutions for the XOR problem are \nimplied, of which 136 are minima. For the Yin-Yang problem, 6,600,000+ solutions \nand 107,250+ minima were characterized. For the simple architectures considered, \nthese numbers appear extremely high. We are unaware of any other system of \nequations having these remarkable properties. \n\nFinally, it should be noted that the large number of saddle points and the small \nratio of minima to saddle points in neural problems can create tremendous com(cid:173)\nputational difficulties for approaches which produce stationary points, rather than \nsimple minima. The efficiency of any such algorithm at producing solutions will be \nnegated by the fact that, from an optimization perspective, most of these solutions \nwill be useless. \n\nAcknowledgements. The partial support of the National Science Foundation by \ngrant MIP-9157221 is gratefully acknowledged. \n\nReferences \n\n[1] E . H. Rothe, Introduction to Various Aspects of Degree Theory in Banach Spaces. \nMathematical Surveys and Monographs (23), Providence, Rhode Island: American \nMathematical Society, 1986. ISBN 0-82218-1522-9. \n\n[2] F. M. Coetzee, Homotopy Approaches for the Analysis and Solution of Neural \n\nNetwork and Other Nonlinear Systems of Equations. PhD thesis, Carnegie Mellon \nUniversity, Pittsburgh, PA, May 1995. \n\n[3] F . M. Coetzee and V. L. Stonick, \"Sequential homotopy-based computation of \n\nmultiple solutions to nonlinear equations,\" in Proc. IEEE ICASSP, (Detroit), IEEE, \nMay 1995. \n\n[4] W. L. Goffe, G. D. Ferrier, and J. Rogers, \"Global optimization of statistical \n\nfunctions with simulated annealing,\" Jour. Econometrics, vol. 60, no. 1-2, pp. 65-99, \n1994. \n\n\f", "award": [], "sourceid": 1298, "authors": [{"given_name": "Frans", "family_name": "Coetzee", "institution": null}, {"given_name": "Virginia", "family_name": "Stonick", "institution": null}]}