{"title": "Interposing an ontogenetic model between Genetic Algorithms and Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 99, "page_last": 106, "abstract": "", "full_text": "Interposing an ontogenic model between \nGenetic Algorithms and Neural Networks \n\nRichard K. Belew \nrikGcs.ucsd.edu \n\nCognitive Computer Science Research Group \n\nComputer Science & Engr. Dept. (0014) \n\nUniversity of California - San Diego \n\nLa Jolla, CA 92093 \n\nAbstract \n\nThe relationships between learning, development and evolution in \nNature is taken seriously, to suggest a model of the developmental \nprocess whereby the genotypes manipulated by the Genetic Algo(cid:173)\nrithm (GA) might be expressed to form phenotypic neural networks \n(NNet) that then go on to learn. ONTOL is a grammar for gener(cid:173)\nating polynomial NN ets for time-series prediction. Genomes corre(cid:173)\nspond to an ordered sequence of ONTOL productions and define a \ngrammar that is expressed to generate a NNet. The NNet's weights \nare then modified by learning, and the individual's prediction error \nis used to determine GA fitness. A new gene doubling operator \nappears critical to the formation of new genetic alternatives in the \npreliminary but encouraging results presented. \n\n1 \n\nIntroduction \n\nTwo natural phenomena, the learning done by individuals' nervous systems and the \nevolution done by populations of individuals, have served as the basis of distinct \nclasses of adaptive algorithms, neural networks (NNets) and Genetic Algorithms \n(GAs), resp. Interactions between learning and evolution in Nature suggests that \ncombining NNet and GA algorithmic techniques might also yield interesting hybrid \nalgorithms. \n\n99 \n\n\f100 \n\nBelew \n\nDimension \n\nX \n\nt-H \n\nx \u2022 \n\nt-2 \n\nX\u00b7 \n\nt-I \n\nX \nt \n\nD \ne \n9 \nr \ne \ne \n\n2 \n\n= ' W4 Xt - 2 + W3 Xt-f,..-X t- I + \n\n><t \n\nFigure 1: Polynomial networks \n\nTaking the analogy to learning and evolution seriously, we propose that the missing \nfeature is a the developmental process whereby the genotypes manipulated by the \nGA are expressed to form phenotypic NNets that then go on to learn. Previous \nattempts to use the GA to search for good NN et topologies have foundered exactly \nbecause they have assumed an overly direct genotype-to-phenotype correspondence. \nThis research is therefore consistent with other NN et research the physiology of \nneural development [3] as well as those into \"constructive\" methods for changing \nnetwork topologies adaptively during the training process [4]. Additional motivation \nderives from the growing body of neuroscience demonstrating the importance of \ndevelopmental processes as the shapers of effective learning networks. Cognitively, \nthe resolution of false dicotomies like \"nature/nurture\" and \"nativist/empiricist\" \nalso depends on a richer language for describing the way genetically determined \ncharacteristics and within-lifetime changes by individuals can interact. \nBecause GAs and NNets are each complicated technologies in their own right, and \nbecause the focus of the current research is a model of development that can span \nbetween them, three major simplifications have been imposed for the preliminary \nresearch reported here. First, in order to stay close to the mathematical theory of \nfunctional approximation, we restrict the form of our NNets to what can be called \n\"polynomials networks\" (cf. [7]). That is, we will consider networks with a first \nlayer of linear units (Le., terms in the polynomial that are simply weighted input \nXi), a second layer with units that form products of first-layer units, a third layer \nwith units that form products of second-layer units, etc.; see Figure 1, and below \nfor an example. As depicted in Figure 1 the space of polynomial networks can be \nviewed as two-dimensional, parameterized by dimension (Le., how much history of \nthe time series is used) and degree. \n\nThere remains the problem of finding the best parameter values for this particular \npolynomial form. Much of classicial optimization theory and more recent NNet \nresearch is concerned with various methods for performing this task. Previous re-\n\n\fInterposing an ontogenic model between Genetic Algorithms and Neural Networks \n\n101 \n\nsearch has demonstrated that the global sampling behavior of the GA works very \neffectively with any gradient, local search technique [2]. The second major sim(cid:173)\nplification, then, is that for the time being we use only the most simple-minded \ngradient method: first-order, fixed-step gradient descent. Analytically, this is the \nmost tractible, and the general algorithm design can readily replace this with any \nother local search technique. \nThe final simplification is that we focus on one of the most parsimonious of prob(cid:173)\nlems, time series prediction: The GA is used to evolve NNets that are good at \npredicting X1+1 given access to an unbounded history X 1 , X1-1, X 1-2, .... Polyno(cid:173)\nmial ai)proximations of an arbitrary time series can vary in two dimensions: the \nextent to which they rely on this history, and (e.g., how far back in time), and \nin their degree. The Stone-Weierstrauss Aproximation Theorem guarantees that, \nwithin this two-dimensional space, there exists some polynomial that will match \nthe desired temporal sequence to arbitrarily precision. The problem, of course, is \nthat over a history H and allowing m degree terms there exists O(Hm) terms, far \ntoo many to search effectively. From the perspective of function approximation, \nthen, this work corresponds to a particular heuristic for searching for the correct \npolynomial form, the parameters of which will be tuned with a gradient technique. \n\n2 Expression of the ONTOL grammar \n\nEvery multi-cellular organism has the problem of using a single genetic description \ncontained in the first germ cell as specification for all of its various cell types. The \ngenome therefore appears to contain a set of developmental instructions, subsets of \nwhich become \"relevant\" to the particular context in which each developing cell finds \nitself. If we imagine that each cell type is a unique symbol in some alphabet, and \nthat the mature organism is a string of symbols, it becomes very natural to model \nthe developmental process as a (context-sensitive) grammar generating this string \n[6, 5]. The initial germ cell becomes the start symbol. A series of production rules \nspecify the expansion (mitosis) of this non-terminal (cell) into two other symbols \nthat then develop according to the same set of genetically-determined rules, until \nall cells are in a mature, terminal state. \nONTOL is a grammar for generating cells in the two-dimensional space of poly(cid:173)\nnomial networks. The left hand side (LHS) of productions in this grammar define \nconditions on the cells' internal Clock state and on the state of its eight Moore \nneighbors. The RHS of the production defines one of five cell-state update actions \nthat are performed ifthe LHS condition is satisfied: A cell can mitosize either left or \ndown (M Left, M Down), meaning that this adjacent cell now becomes filled with \nan identical copy; Die (Le., disappear entirely); Tick (simply decrement its inter(cid:173)\nnal Clock state); or Terminate (cease development). Only terminating cells form \nsynaptic connections, and only to adjacent neighbors. \n\nThe developmental process is begun by placing a single \"gamete\" cell at the origin of \nthe 2d polyspace, with its Clock state initialized to a maximal value M azClock = 4; \nthis state is decremented every time a gene is fired. If and when a gene causes this \ncell to undergo mitosis, a new cell, either to the left or below the originial cell, is \ncreated. Critically, the same set of genetic instructions contained in the original \ngametic cell are used to control transitions of all its progeny cells (much like a \n\n\f102 \n\nBelew \n\nB B B B B B \n:2 B \nB \nB \nB \nB \n\nB B B B B B \n1 B \n1 B \nB \nB \nB \n\nB B B B B B \n1 B \n1 B \nB \nB \nB \n\nU \n0 \n\nDeg=l DiII:l #- 0 \nDeg:l Di.-l *-~. \nDeg=:2 DiII=1 #-~ \n\nFigure 2: Logistic genome, engineered \n\ncellular automaton's transition table), even though the differing contexts of each \ncell are likely to cause different genes to be applied in different cells. Figure 2 shows \na trace of this developmental process: each snap-shot shows the Clock states of all \nactive (non-terminated) cells, the coordinates of the cell being expressed, and the \ngene used to control its expression. \n\n3 Experimental design \n\nEach generation begins by developing and evaluating each genotype in the popu(cid:173)\nlation. First, each genome in the population is expressed to form an executable \nLisp lambda expression computing a polynomial and a corresponding set of initial \nweights for each of its terms. If this expression can be performed successfully and \nthe individual is viable (Le., their genomes can be interpretted to build well-formed \nnetworks), the individual is exposed to NTrain sequential instances of the time \nseries. Fitness is then defined to be its cumulative error on the next NTest time \nsteps. \nAfter the entire population has been evaluated, the next generation is formed ac(cid:173)\ncording to a relatively conventional genetic algorithm: more successful individuals \nare differentially reproduced and genetic operators are applied to these to experi(cid:173)\nment with novel, but similar, alternatives. Each genome is cloned zero, one or more \ntimes using a proportional selection algorithm that guarantees the expected number \nof offspring is proportional to an individual's relative fitness. \nVariation is introduced into the population by mutation and recombination genetic \noperators that explore new genes and genomic combinations. Four types of mutation \nwere applied, with the probability of a mutation proportional to genome length. \nFirst, some random portion of an extant gene might be randomly altered, e.g., \nchanging an initial weight, adding or deleting a constraint on a condition, changing \nthe gene's action. Because a gene's order in the genome can affect its probability \nof being expressed, a second form of mutation permutes the order of the genes on \n\n\fInterposing an ontogenic model between Genetic Algorithms and Neural Networks \n\n103 \n\n3.S \n\n3 \n\n2.S ~~~(!~~~~~Ifl \n\n2 \n\nloS \n\n1 \n\nO.S \n\nIII \nIII \nQI \nC \n~ \n..-4 \n\"-\n\nMin -\nAvg ---\n\n.~~~~ \n\no ~----~----~------~----~------~----~------~----~ \n800 \n\n700 \n\n600 \n\n400 \n\n100 \n\n200 \n\n300 \n\no \n\nGenerations \n\nSOO \n\nFigure 3: Poulation Minimum and Average Fitness \n\nthe genome. A third class of mutation removes genes from the genome, always \n\"trimming\" them from the end. Combined with the expression mechanism's bias \ntowards the head of the genomic list, this trimming operation creates a pressure \ntowards putting genes critical to early ontogeny near the head. The final and \ncritical form of mutation randomly selects a gene to be doubled: a duplicate copy of \nthe gene is constructed and inserted at a randomly selected position in the genome. \nAfter all mutations have been performed, cross-over is performed between pairs of \nindividuals. \n\n4 Experiments \n\nTo demonstrate, consider the problem of predicting a particularly difficult time \nseries, the chaotic logistic map: X t = 4.0Xt _ 1 - 4.oxl_ 1. The example of Figure 2 \nshowed an ONTOL genome engineered to produce the desired logistic polynomial. \nThis \"genetically engineered\" solution is merely evidence that a genetic solution \nexists that can be interpretted to form the desired phenotypic form; the real test is \nof course whether the GA can find it or something similar. \nEarly generations are not encouraging. Figure 3 shows the minimum (i.e., best) \nprediction error and popUlation average error for the first 800 generations of a typical \nsimulation. Initial progress is rapid because in the initial, randomly constructed \npopulation, fully half of the individuals are not even viable. These are strongly \n\n\f104 \n\nBelew \n\n100 ~----~----~------~----~------~-----r----~~----~ \n\nNonlinear polys -\n\n90 \n\n80 \n\n70 \n\n~o \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\n100 \n\n200 \n\n300 \n\n400 \n\nGenerations \n\n500 \n\n600 \n\n700 \n\n800 \n\nFigure 4: Complex polynomials \n\nselected against, of course, and within the first two or three generations at least \n95% of all generations remain viable. \nFor the next several hundred generations, however, all of aNTOL's developmental \nmachinery appears for naught as the dominant phenotypic individuals are the most \n\"simplistic\" linear, first-degree approximators of the form W1Xl + woo Even here, \nhowever, the GA is able to work in conjunction with the gradient learning process \nis able to achieve Baldwin-like effects optimizing Wo and WI [1]. The simulation \nreaches a \"simplistic plateau,\" then, as it converges on a population composed of \nthe best predictors the simplistic linear, first-degree network topology permits for \nthis time series. \n\nIn the background, however, genetic operators are continuing to explore a wide \nvariety of genotypic forms that all have the property of generating roughly the \nsame simplistic phenotypes. Figure 4 shows that there are significant numbers of \n\"complex\" polynomialsl in early generations, and some of these have much higher \nthan average fitness 2 On average, however, genes leading to complex phenotypes \nprovide lead to poorer approximations than the simplistic ones, and are quickly \nculled. \n\nlI.e., either nonlinear terms or higher dimensional dependence on the past \n2Note the good solutions in the first 50 generations, as well as subsequent dips during \n\nthe simplistic plateau. \n\n\fInterposing an ontogenic model between Genetic Algorithms and Neural Networks \n\n105 \n\n12000~-----r----~~----'------'------~----~------~----~ \n\nSelected -\n\nNeutral --_. \n\n10000 \n\n8000 \n\n6000 \n\n4000 \n\n2000 \n\nUl \nQ) \n\ni \nt!) ... III \n\n~ \n\n~ \n\no~----~----~~----~----~------~----~------~----~ \n800 \n\n700 \n\n300 \n\n500 \n\n600 \n\n100 \n\n200 \n\no \n\n400 \n\nGenerations \n\nFigure 5: Genome length \n\nA critical aspect of the redundancy introduced by gene doubling is that old genetic \nmaterial is freed to mutate into new forms without threatening the phenotype's \nviability. When compared to a population of mediocre, simplistic networks any \ncomplex networks able to provide more accurate predictions have much higher fit(cid:173)\nness, and eventually are able to take over the population. Around generation 400, \nthen, Figure 3 shows the fitness dropping from the simplistic plateau, and Figure \n4 shows the number of complex polynomials increasing. Many of these individuals' \ngenomes indeed encode grammars that form polynomials of the desired functional \nform. \nA surprising feature of these simulations is that while the genes leading to complex \nphenotypes are present from the beginning and continue to be explored during \nthe simplistic plateau, it takes many generations before these genes are successfully \ncomposed into robust, consistently viable genotypes. How do the complex genotypes \ndiscovered in later generations differ from those in the initial population? \nOne piece of the answer is revealed in Figure 5: later genomes are much longer. \nAll 100 individuals in the initial population have exactly five genes, and so the ini(cid:173)\ntial \"gene pool\" size is 500. In the experiments just described, this number grows \nasymptotically to approximately 6000 total genes (i.e., 60 per individual, on aver(cid:173)\nage) during the simplistic plateau, and then explodes a second time to more than \n10,000 as the population converts to complex polynomials. It appears that gene \nduplication creates a very constructive form of redundancy: mulitple copies of crit-\n\n\f106 \n\nBelew \n\nical genes help the genotype maintain the more elaborate development programs \nrequired to form complex phenotypes. Micro-analysis of the most successful indi(cid:173)\nviduals in later generations supports this view. While many parts of their genomes \nappear inconsequential (for example, relative to the engineered genome of Figure 2), \nboth the M Down gene and the two-element Terminate genes, critical to forming \npolynomials that are \"morphologically isomorphic\" with the correct solution, are \nconsistently present. \n\nThis hypothesis is also supported by results from a second experiment, also plotted \non Figure 5. Recall that the increase in genome size caused by gene doubling is offset \nby a trimming mutation that periodically shortens a genome. The curve labelled \n\"Neutral\" shows the results of these opposing operations when the next generation \nis formed randomly, rather than being selected for better prediction. Under neutral \nselection, genome size grows slightly from initial size, but gene doubling and genome \ntrimming then quickly reach equilibrium. When we select for better predictors, \nhowever, longer genomes are clearly preferred, at least up to a point. The apparent \nasymptote accompanying the simplistic plateau suggests that if these simulations \nwere extended, the length of complex genotypes would also stabalize.a \n\nAcknowledgements \n\nI gratefully acknowledge the warm and stimulating research environments provide \nby Domenico Parisi and colleagues at the Psychological Institute, CNR, Rome, Italy, \nand Jean-Arcady Meyer and colleagues in the Groupe de BioInformatique, Ecole \nNormale Superieure in Paris, France. \n\nReferences \n\n[1] R. K. Belew. Evolution, learning and culture: computational metaphors for \n\nadaptive search. Complex Systems, 4(1):11-49, 1990. \n\n[2] R. K. Belew, J. McInerney, and N. N. Schraudolph. Evolving networks: Using \nthe Genetic Algorithm with connectionist learning. In Proc. Second Artificial \nLife Conference, pages 511-547, New York, 1991. Addison-Wesley. \n\n[3] J. D. Cowan and A. E. Friedman. Development and regeneration of eye-brain \nmaps: A computational model. In Advances in Neural Info. Proc. Systems 2, \npages 92-99. Morgan Kaufman, 1990. \n\n[4] S. E. Fahlman and C. Lebiere. The Cascade-Correlation learning architecture. \nIn D. S. Touretzky, editor, Advances in Neural Info. Proc. Systems ..I, pages \n524-532. Morgan Kaufmann, 1990. \n\n[5] H. Kitano. Designing neural networks using genetic algorithms with graph gen(cid:173)\n\neration system. Complex Systems, 4(4), 1990. \n\n[6] A. Lindenmayer and G. Rozenberg. Automata, languages, development. North(cid:173)\n\nHolland, Amsterdam, 1976. \n\n[7] T. D. Sanger, R. S. Sutton, and C. J. Matheus. Iterative construction of sparse \npolynomials. In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances \nin Neural Info. Proc. Systems ..I, pages 1064-1071. Morgan Kaufmann, 1992. \n\n\f", "award": [], "sourceid": 618, "authors": [{"given_name": "Richard", "family_name": "Belew", "institution": null}]}