{"title": "Back Propagation is Sensitive to Initial Conditions", "book": "Advances in Neural Information Processing Systems", "page_first": 860, "page_last": 867, "abstract": null, "full_text": "Back Propagation is Sensitive to Initial Conditions \n\nJohn F. Kolen \n\nJordan B. Pollack \n\nLaboratory for Artificial Intelligence Research \n\nThe Ohio State University \nColumbus. OH 43210. USA \nkolen-j@cis.ohio-state.edu \npollack@cis.ohio-state.edu \n\nAbstract \n\nfunctions  with \n\nlearning  simple \n\nThis paper explores the effect of initial weight selection on feed-forward \nthe  back-propagation \nnetworks \ntechnique.  We  first  demonstrate.  through  the  use  of  Monte  Carlo \ntechniques. that the magnitude of the initial condition vector (in weight \nspace) is a very significant parameter in convergence time variability. In \norder  to  further  understand \nthis  result.  additional  deterministic \nexperiments  were  performed.  The  results  of  these  experiments \ndemon~trate the extreme sensitivity of back propagation to initial weight \nconfiguration. \n\nINTRODUCTION \n\n1 \nBack Propagation (Rwnelhart et al .\u2022  1986) is  the network training method  of choice for \nmany neural network projects. and for good reason. Like other weak methods, it is simple \nto implement, faster than  many  other \"general\" approaches. well-tested by  the field.  and \neasy  to  mold  (with domain knowledge  encoded in the  learning  environment) into  very \nspecific and efficient algorithms. \nRumelhart  et al.  made  a  confident statement:  for  many  tasks.  \"the  network:  rarely  gets \nstuck in poor local mininla that are significantly worse than the global minima. \"(p.  536) \nAccording to them. initial weights  of exactly 0 cannot be  used. since symmetries in the \nenvironment  are  not sufficient  to break  symmetries  in initial weights.  Since  their paper \nwas  published.  the  convention  in  the  field  has  been  to  choose  initial  weights  with  a \nuniform distribution between plus and minus  P. usually set to 0.5 or less. \nThe convergence  claim  was  based solely  upon their empirical experience  with  the  back \npropagation technique.  Since then. Minsky &  Papert (1988) have argued that there exists \nno proof of convergence for the  technique. and several researchers (e.g. Judd  1988) have \nfound that the convergence time must be related to the difficulty of the problem. otherwise \nan unsolved computer science question (P  J NP  ) would finally  be answered. We do not \nwish to make claims about convergence of the technique in the limit (with vanishing step-\n\n860 \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n861 \n\nsize), or the relationship between task and perfonnance, but wish to talk about a pervasive \nbehavior of the  teclmique which has gone unnoticed  for  several years:  the sensitivity  of \nback propagation to initial conditions. \n\n2  THE MONTE-CARLO EXPERIMENT \nInitially,  we  perfonned  empirical  studies  to  determine  the  effect  of  learning  rate, \nmomentum  rate,  and  the  range  of initial  weights  on t-convergence (Kolen  and  Goel,  to \nappear). We use the tenn t-convergence to  refer to whether or not a network, starting at a \nprecise  initial  configuration,  could  learn  to  separate  the  input  patterns  according  to  a \nboolean function  (correct outputs above  or below  0.5) within  t  epochs. The experiment \nconsisted  of training  a  2-2-1  network  on exclusive-or while  varying  three  independent \nvariables in  114 combinations:  learning rate,  11,  equal to 1.0 or 2.0;  momentum rate,  a., \nequal to 0.0,0.5, or 0.9;  and initial weight range,  p, equal to 0.1  to 0.9 in 0.1  increments, \nand  1.0 to  10.0 in 1.0 increments. Each combination of parameters was used to initialize \nand  train  a  number of networks.'  Figure  1 plots  the  percentage  of t-convergent (where \nt ==  50, 000  epochs of 4 presentations) initial conditions for the 2-2-1  network trained on \nthe exclusive-or problem. From the figure we thus conclude the choice of p S; 0.5  is more \nthan a  convenient symmetry-breaking default,  but is quite necessary to obtain low levels \nof nonconvergent behavior. \n\n90.00 \n\n1 \n\n80.00 -\n\n70.00 -\n\n% \nNon \nConvergence  60.00-\nAfter \n5000-\n50,000 \nTrials \n\n. \n\n40.00 --\n\n30.00 \n\n20.00 -\n\n10.00 \n\n0.00- 1 \n\n0.00 \n\nI \n\n2.00 \n\nI \n\n4.00 \n\nP \n\nL=l.OM=o.O \nL;;UHit-;;;o.5 \n[;;[O~~.9 \n\nL=2.0M=O.9 \n\nI \n\n6.00 \n\nI \n\n8.00 \n\nI \n\n10.00 \n\nFigure 1: Percentage T-Convergence vs. Initial Weight Range \n\n3  SCENES FROM EXCLUSIVE-OR \nWhy do networks  exhibit the  behavior illustrated in Figure  I? While some might  argue \nthat very high initial weights (i.e.  p > 10.0) lead to very long convergence times since the \nderivative  of the  semi-linear sigmoid  function  is  effectively  zero  for large  weights,  this \n\n1.  Numbers ranged from 8 to 8355, depending on availability of computationaJ resources. \nThose data points calculated with small samples were usually surrounded by data points \nwith larger samples. \n\n\f862 \n\nKolen and Pollack \n\nFigure 2: \n\n(Schematic Network) \n\nFigure 3: \n\n(-5-3+3+6Y -1-6+7X) \n\n11=3.25 a=O.40 \n\nFigure 5: \n\n(-5+5+1-6+3XY+8+3) \n\n11=2.75 a--o.80 \n\nFigure 6: \n\n(YX-3+6+8+ 3+ 1 +7-3) \n\n11=3.25 a=O.OO \ni \n\n~ \n\nFigure 4: \n\n(+4-7+6+0-3Y + 1X+ 1) \n\n11=2.75 a=O.OO \n\nFigure 7: \n\n(Y+3-9-2+6+7-3X+7) \n\n11=3.25 a=O.60 \n\nFigure 8: \n\n(-6-4XY -6-6+9-4-9) \n\n11=3.00 a--o.50 \n\nFigure 9: \n\n(-2+1+9-1X-3+8Y -4) \n\n11=2.75 a=O.20 \n\nFigure 10: \n\n(+ 1 +8-3-6X-1 + 1 +8Y) \n\n11=3.50 a=O.90 \n\nFigure 11: \n\n(+7+4-9-9-5Y-3+9X) \n\n11=3.00 a=O.70 \n\nFigure 12: \n(-9.0,-1.8) \nstep 0.018 \n\nFigure 13: \n\n( -6.966.-0.500) \n\nstep 0.004 \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n863 \n\ndoes  not explain tbe fact  that when  p  is  between 2.0  and 4.0, the non-t-convergence rate \nvaries from  5 to 50 percent. \nThus,  we  decided  to  utilize  a more  deterministic  approach  for eliciting  the  structure  of \ninitial conditions  giving  rise  to t-convergence.  Unfortunately,  most networks have  many \nweights, and thus many dimensions in initial-condition space. We can, however, examine \n2-dimensional slices through the space in great detail. A slice is specified by an origin and \ntwo  orthogonal  directions  (the  X  and  Y  axes).  In the  figures  below,  we  vary  the  initial \nweigbts  regularly  throughout the  plane formed  by  the  axes  (with the  origin in the lower \nleft-hand comer) and collect the results of rumrlng back-propagation to  a particular time \nlimit  for  each  initial condition.  The map is  displayed  with grey-level linearly  related to \ntime  of convergence:  black:  meaning  not  t-convergent  and  white  representing  the  fastest \nconvergence  time  in the  picture.  Figure  2  is  a schematic representation of the  networks \nused in this and the following experiment. The numbers on the links and in the nodes will \nbe used for identification purposes. Figures 3 through 11 show several interesting \"slices\" \nof the the initial condition space for 2-2-1  networks trained on exclusive-or. Each slice is \ncompactly  identified  by  its  9-dimensional  weight  vector  and  associated  learning! \nmomentum rates.  For instance, the vector (-3+2+7-4X+5-2-6Y) describes a network with \nan  initial  weight  of -0.3  between  the  left  hidden  unit  and the  left  input  unit.  Likewise, \n\"+5\" in the sixth position represents  an initial bias  of 0.5  to the right  hidden  unit.  The \nletters \"X\" and \"Y\" indicate that the corresponding weight is varied along the X- or Y(cid:173)\naxis from  -10.0 to +10.0 in steps of 0.1. All the figures in this paper contain the results of \n40,000  runs  of back-propagation  (i .e.200  pixels  by  200  pixels)  for  up  to  200  epochs \n(where an epoch consists of 4 training examples). \nFigures  12  and  13  present  a  closer look  at  the  sensitivity  of back-propagation  to  initial \ncondition~. These figures  zoom into a complex region of Figure  11;  the  captions list the \nlocation of the origin and step size used to generate each picture. \nSensitivity behavior can also be demonstrated with even simpler functions. Take the case \nof a 2-2-1 network learning the or function. Figure  14 shows the effect of learning \"or\" on \nnetworks (+5+5-1X+5-1 Y+3-l) and varying weights 4 (X-axis) and 7 (Y-axis) from -20.0 \nto  20.0  in  steps  of 0.2.  Figure  15  shows  the  same  region,  except  that  it  partitions  tbe \ndisplay according to equivalent solution networks after t-convergence (200 epoch limit), \nrather  than  the  time  to  convergence.  Two  networks  are  considered  equivalent2if their \nweights have the same sign. Since there are 9 weights, there are 512 ($2 sup 9$) possible \nnetwork  equivaJence  classes.  Figures  16  through  25  show  successive  zooms  into  the \ncentral swirl identified by the XY  coordinate of the lower-left comer and pixel step size. \nAfter 200 iterations, the resulting networks could be partitioned into 37 (both convergent \nand  nonconvergent)  classes. Obviously,  the smooth behavior of the  t-convergence  plots \ncan be deceiving, since two initial conditions,  arbitrarily  alike, can obtain quite different \nfinal network configuration. \nNote  the  triangles  appearing  in  Figures  19,  21,  23  and  the  mosaic  in  Figure  25 \ncorresponding  to  tbe  area  which  did  not  converge  in  200  iteration~ in  Figure  24.  The \ntriangular  boundaries  are  similar  to  fractal  structures  generated  under iterated  function \nsystems  (Bamsley  1988):  in  this  case,  the  iterated  function  is  the  back  propagation \n\n2.  For rendering purposes only. It is extremely difficult to know precisely the equivalence \nclasses of solutions, so we approximated. \n\n\f864 \n\nKolen and Pollack \n\n;.:.:-.. :.:.: ..... ;.; ..... .; ....... : ..... ; ... : ... ; .\u2022. : ... : .\u2022. ; ... ;.: .\u2022. ;.;.;.; ... ;.; ...\u2022. ; .. \n\nFigure  14: \n\n(-20.00000, -20.00000) \n\nStep 0.200000 \n\nFigure  15: \n\nSolution Networks \n\nFigure  16: \n\n(-4.500000, -4.500000) \n\nStep 0.030000 \n\nFigure  17: \n\nSolution Networks \n\nFigure  18: \n\n(-1.680000, -1.3500(0) \n\nStep 0.002400 \n\nFigure  19; \n\nSolution Networks \n\nFigure  20: \n\n(-1.536000, -1.197000) \n\nStep 0.000780 \n\nFigure  21  ; \n\nSolution Networks \n\nFigure  22: \n\n(-1.472820, -1.145520) \n\nStep 0.000070 \n\nFigure  23  : \n\nSolution Networks \n\nFigure  24: \n\n(-1.467150, -1.140760) \n\nStep 0.000016 \n\nFigure  25  : \n\nSolution Networks \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n865 \n\nWeight  1 \nWeight 2 \nWeight 3 \nWeight 4 \nWeight 5 \nWeight 6 \nWeight 7 \nWeight 8 \nWeight 9 \nWeight 10 \nWeight  11 \nWeight 12 \nWeight 13 \nWeight 14 \nWeight 15 \nWeight 16 \n\nFigure 26 \n-0.34959000 \n0.00560000 \n-0.26338813 \n0.75501968 \n0.47040862 \n-0.18438011 \n0.46700363 \n-0.48619500 \n0.62821201 \n-0.90039973 \n0.48940201 \n-0.70239312 \n-0.95838741 \n0.46940394 \n-0.73719884 \n0.96140103 \n\nFigure 28 \n-0.34959000 \n0.00560000 \n0.39881098 \n-0.16718577 \n-0.28598450 \n-0.18438011 \n-0.06778983 \n0.66061292 \n-0.39539510 \n0.55021922 \n0.35141364 \n-0.17438740 \n-0.07619988 \n0.88460041 \n0.67141031 \n-0.10578894 \n\nFigure 27 \nFigure 29 \nFigure 30 \n-0.34959000 \n0.00560000 \n0.65060705 \n0.75501968 \n0.91281711 \n-0.19279729 \n0.56181073 \n0.20220653 \n0.11201949 \n0.67401200 \n-0.54978875 \n-0.69839197 \n-0.19659844 \n0.89221204 \n-0.56879740 \n0.20201484 \n\nTable 1: Network Weights for Figures 26 through 30 \n\nlearning method.  We  propose that these fractal-like  boundaries arise in back-propagation \ndue to the  existence of multiple solutions  (attractors),  the  non-zero learning parameters, \nand the non-linear deterministic nature of the gradient descent approach. When more than \none  hidden  unit  is  utilized,  or  when  an  envirorunent  has  internal  symmetry  or is  very \nundercollstrained. then there will be multiple attmctors corresponding to the large num ber \nof hidden-unit  permutations  which  form  equivalence  classes  of  functionality.  As  the \nnumber  of  solutions  available  to  the  gradient  descent  method  increases,  the  more \ncomplicated  the  non-local  interactions  between  them.  This  explains  the  puzzling  result \nthat  several  researchers  have  noted,  that  as  more  hidden  units  are  added,  instead  of \nspeeding up, back-propagation slows down (e.g.  Uppman and Gold, 1987). Rather than a \nhill-climbing  metaphor  with  local  peaks  to  get  stuck on,  we  should instead  think  of a \nmany-body  metaphor:  The  existence  of many  bodies does  not imply  that  a particle  will \ntake a simple path to land on one. From this view, we see that Rumelhart et al. 's claim of \nback-propagation usually  converging is due  to a  very  tight  focus  inside  the  \"eye of the \nstonn\" . \nCould  learning  and  momentum  rates  also  be  involved  in  the  stonn?  Such  a  question \nprompted another study, this  time focused  on the interaction of learning and momentum \nrates. Rather than alter the initial weights of a set of networks, we varied the learning rate \nalong  the  X  axis  and  momentum  rate  along  the  Y  axis.  Figures  26,  27,  and  28  were \nproduced  by  training  a  3-3-1  network  on  3-bit  parity  until  t-convergence  (250  epoch \nlimit). Table  1 lists the  initial weights of the  networks  trained in Figures  26 through 31. \nExamination of the  fuzzy  area in Figure 26 shows how small changes in learning and/or \nmomentwn rate can drasticly affect t-convergellce (Figures 30 and 31). \n\n\f866 \n\nKolen and Pollack \n\nFigure 26: \n\n11=(0.0,4.0) a=(0.0,1.25) \n\nFigure 27: \n\n11=(0.0,4.0) a=(0.0,1.25) \n\nFigure 28: \n\n11=(0.0,4.0) a=(0.0,1.25) \n\nFigure 29: \n\n11=(3.456,3.504 ) \na=(0.835,0.840) \n\nFigure 30: \n\n11=(3.84,3.936) \na=(0.59,0.62) \n\n4  DISCUSSION \nChaotic  behavior has  been  carefully  circumvented  by  many  neural  network  researchers \n(through the choice of symmetric weights by Hopfield (1982), for example), but has been \nreported in increasing frequency  over the  past few  years  (e.g.  Kurten and Clark,  1986). \nConnectionists, who use neural models for cognitive modeling, disregard these reports of \nextreme  non-linear  behavior in spite  of common knowledge  that  non-linearity  is  what \nenables network models to perform non-trivial computations in the flI'St  place, All work to \ndate  has  noticed  various  forms  of  chaos  in  network  dynamics,  but  not  in  learning \ndynamics. Even if back-propagation is shown to be non-chaotic in the limit, this still does \nnot preclude the existance of fractal boundaries between attract or basins since other non(cid:173)\nchaotic  non-linear  systems  produce  such  boundaries  (i.e.  forced  pendulums  with  two \nattractors (D'Humieres et al., 1982\u00bb \nWhat  does  this  mean  to  the  back-propagation  community?  From  an  engineering \napplications standpoint, where only the solution matters, nothing at  all.  When an optimal \nset of weights for a particular problem is discovered, it can be reproduced through digital \nmeans. From a scientific standpoint, however, this sensitivity to initial conditions demands \nthat  neural  network learning  results  must  be specially treated  to  guarantee  replicability. \nWhen theoretical claims are  made (from experience) regarding the power of an adaptive \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n867 \n\nnetwork  to  model  some phenomena,  or  when  claims  are  made  regarding  the  similarity \nbetween  psychological  data  and  network  performance,  the  initial  conditions  for  the \nnetwork need to be precisely specified or filed in a public scientific database. \nWhat about the future of back-propagation? We remain neutral on the issue of its ultimate \nconvergence,  but  our result  points  to  a  few  directions  for improved methods.  Since  the \nslowdown occurs as  a result of global influences of multiple solutions,  an  algorithm  for \nfirst  factoring  the  symmetry  out  of both network  and  training environment (e.g.  domain \nknowledge) may be helpful.  Furthermore, it may also tum out that search methods which \nharness \"strange attractors\" ergodicaUy guaranteed to come arbitrarily close to somesubset \nof solutions might work better than methods based on strict gradient descent. Finally, we \nview  this  result  as  strong  impetus  to  discover  how  to  exploit  the  information-creative \naspects of non-linear dynamical systems for future models of cognition (Pollack 1989). \n\nAcknowledgements \nThis  work  was  supported by  Office  of Naval Research grant number NOOOI4-89-Jl200. \nSubstantial  free  use  of  over  200  Sun  workstations  was  generously  provided  by  our \ndepartment. \n\nReferences \nM. BamsleY,Fractals Everywhere, Academic Press, San Diego, CA, (1988). \nJ.  J.  Hopfield,  \"Neural  Networks  and  Physical  Systems  with  Emergent  Collective \nComputational Abilities\", Proceedings  US  National Academy 0/ Science, 79:2554-2558, \n(1982). \nD.  D'Humieres, M.  R.  Beasley, B. A.  Huberman, and A.  Libchaber, \"Chaotic States and \nRoutes to Chaos in the Forced Pendulum\", Physical Review A, 26:3483-96, (1982). \nS. Judd, \"Learning in Networks is Hard\", Journal o/Complexity, 4:177-192, (1988). \nJ.  Kolen  and  A.  Goel,  \"Learning  in  Parallel  Distributed  Processing  Networks: \nComputational  Complexity  and  Information  Content\",  IEEE  Transactions  on  Systems, \nMan, and Cybernetics, in press. \nK.  E.  KUrten  and J.  W.  Clark, \"Chaos in Neural Networks\", Physics Letters,  114A,413-\n418, (1986). \nR.  P.  Lippman and  B.  Gold,  \"Neural  Oassifiers  Useful  for  Speech Recognition\",  In 1 st \nInternational Conference on Neural Networks ,IEEE, IV:417-426, (1987). \nM. L. Minsky and S. A. Papert, Perceptrons. MIT Press, (1988). \nJ.  B.  Pollack,  \"Implications  of Recursive  Auto Associative  Memories\",  In Advances  ;12 \nNeural  Information  Processing  Systems.  (ed.  D.  Touretzky)  pp  527-536,  Morgan \nKaufman, San Mateo, (1989) . \nD.  E.  Rumelhart,  G.  E.  Hinton,  and R.  J.  Williams,  \"Learning Representation  by  Back(cid:173)\nPropagating Errors\", Nature, 323:533-536, (1986). \n\n\f", "award": [], "sourceid": 395, "authors": [{"given_name": "John", "family_name": "Kolen", "institution": null}, {"given_name": "Jordan", "family_name": "Pollack", "institution": null}]}