{"title": "Meiosis Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 533, "page_last": 541, "abstract": null, "full_text": "Meiosis Networks \n\n533 \n\nMeiosis Networks \n\nStephen Jose Hanson \n\n1 \n\nLearning and Knowledge Acquisition Group \n\nSiemens Research Center \n\nPrinceton, NJ 08540 \n\nABSTRACT \n\nA central problem in connectionist modelling is the control of \nnetwork and architectural resources during learning. In the present \napproach, weights reflect a coarse prediction history as coded by a \ndistribution of values and parameterized in the mean and standard \ndeviation of these weight distributions. Weight updates are a \nfunction of both \nthe mean and standard deviation of each \nconnection in the network and vary as a function of the error signal \n(\"stochastic delta rule\"; Hanson, 1990). Consequently, the weights \ntheir \nmaintain \nin \n\"uncertainty\" \nestablishing a policy concerning the size of the nodal complexity of \nthe network and growth of new nodes. For example, during \nproblem solving \nthe present network can undergo \"meiosis\", \nproducing two nodes where there was one \"overtaxed\" node as \nmeasured by its coefficient of variation. It is shown in a number of \nbenchmark problems that meiosis networks can find minimal \narchitectures, reduce computational complexity, and overall increase \nthe efficiency of the representation learning interaction. \n\ninformation on \n\nin prediction. \n\ntheir central \n\ntendency and \n\nSuch \n\ninformation \n\nis useful \n\n1 Also a member or the Cognitive Science Laboratory, Princeton University, Princeton, NJ 08542 \n\n\f534 \n\nHanson \n\n1 INTRODUCTION \nSearch problems which \ninvolve high dimensionality, a-priori constraints and \nnonlinearities are hard. Unfortunately, learning problems in biological systems \ninvolve just these sorts of properties. Worse, one can characterize the sort of \nproblem that organisms probably encounter in the real world as those that do not \neasily \nlinear \napproximation or complete knowledge of data or nature of the problem being solved. \nWe would contend there are three basic properties of real learning result in an ill(cid:173)\ndefined set problems and heterogeneous set of solutions: \n\nsimple averaging, optimality, \n\nsolutions \n\ninvolve, \n\nadmit \n\nthat \n\n\u2022 Data are continuously available but incomplete; the learner must constantly \nupdate parameter estimates with stingy bits of data which may represent a very \nsmall sample from the possible population \n\n\u2022 Conditional distributions of response categories with respect to given features are \n\nunknown and must be estimated from possibly unrepresentative samples. \n\n\u2022 Local (in \n\ntime) \n\ninformation may be misleading, wrong, or non stationary, \nconsequently there is a poor tradeoff between the present use of data and waiting \nfor more and possibly flawed data- consequently updates must be small and \nrevocable. \n\nThese sorts of properties represent only one aspect of the learning problem faced by \nreal organisms in real environments. Nonetheless, they underscore why \"weak\" \nmethods-methods that assume little about the environment in which they are \noperating -are so critical. \n\n1.1 LEARNING AND SEARCH \nIt is possible to precisely characterize the search problem in terms of the resources or \ndegress of freedom in the learning model. If the task the learning system is to \nperform is classification then the system can be analyzed in terms of its ability to \ndichotomize stimulus points in feature space. \n\nDichotomization Capability: Network Capacity Using a linear fan-in or hyperplane \ntype neuron we can characterize the degrees of freedom inherent in a network of \nunits with thresholded output. For example, with linear boundaries, consider 4 \npoints, well distributed in a 2-dimensional feature space. There are exactly 14 \nlinearly separable dichotomies that can be formed with the 4 target points. However, \nthere are actually 16 \n(24) possible dichotomies of 4 points in 2 dimensions \nconsequently, the number of possible dichotomies or arbitrary categories that are \nlinearly implementable can be thought of as a capacity of the linear network in k \ndimensions with n examples. The general category capacity measure (Cover, 1965) \ncan be written as: \n\nC(n,k)=2 E \n\nIe \n\n(n-I)! \n\nj~ (n-l- j)!j! \n\n,n> k+l \n\n(I) \n\n\fMeiosis Networks \n\n535 \n\nNote the dramatic growth in C as a function of k, the number of feature dimensions, \nfor example, for 25 stimuli in a 5 dimensional feature space there are 100,670 linear \ndichotomies. U ndertermination in these sorts of linear networks is the rule not the \nexception. This makes the search process and the nature of constraints on the search \nprocess critical in finding solutions that may be useful in the given problem domain. \n\nin nature. The \n\ntypical activation (unction used \n\n1.2 THE STOCHASTIC DELTA RULE \nActual mammalian neural systems involve noise. Responses from the same individual \nunit in isolated cortex due to cyclically repeated identical stimuli will never result in \nidentical bursts Transmission of excitation through neural networks in living systems \nis essentially stochastic \nin \nconnectionist models must be assumed to be an average over many intervals, since \nany particular neuronal pulse train appears quite random [in fact, Poisson; (or \nexample see Burns,1968; Tomko & Crapper, 1974]. \nThis suggests that a particular neural signal in time may be modeled by a \ndistribution of synaptic values rather then a single value. Further this sort of \nrepresentation provides a natural way to affect the synaptic efficacy in time. In order \nto introduce noise adaptively, we require that the synaptic modification be a function \nof a random increment or decrement proportional in size to the present error signal. \nConsequently, the weight delta or gradient itself becomes a random variable based on \nprediction performance. Thus, the noise that seems ubiquitous and apparently \nuseless throughout the nervous system can be turned to at least three advantages in \nthat it provides the system with mechanisms for (1) entertaining multiple response \nhypotheses given a single input (2) maintaining a coarse prediction history that is \nlocal, recent, and cheap, thus providing punctate credit assignment opportunities and \nfinally, (3) revoking parameterizations that are easy to reach, locally stable, but \ndistant from a solution. \nAlthough it is possible to implement the present principle a number of different ways \nwe chose to consider a connection strength to be represented as a distribution of \nweights with a finite mean and variance (see Figure 1). \n\nFigure 1: Weights as Sampling Distributions \n\nA forward activation or recognition pass consists o( randomly sampling a weight \nfrom the existing distribution calculating the dot product and producing an output \n\n\f536 \n\nHanson \n\nfor that pass. \n\nXi = EWi:Yj \n\nj \n\nwhere the sample is found from, \n\nS(Wij=Wi:) = J.l 1II + (jill ** 1.0 and ---> 1.0 \nEll-ii \n\nEll- ile \nIe \n\nMeiosis then proceeds as follows (see Figure 2) \n\n\u2022 A forward stochastic pass is made producing an output \n\u2022 Output is compared to target producing errors which are then used to update \n\nthe mean and variance of weight. \n\n\u2022 The composite input and output variance and means are computed for each \n\nhidden units \n\n\u2022 For those hidden units whose composite C.V.s are > 1.0 node splitting occurs; \nhalf the variance is assigned to each new node with a jittered mean centered at \nthe old mean \n\nMEIOSIS \n\nFigure 2: Meiosis \n\n\f538 \n\nHanson \n\nThere is no stopping criteria. The network stops creating nodes based on the \nprediction error and noise level ( P,~) . \n\n1.4 EXAMPLES \n\nParity Benchmark: Finding the Right number of units \n\n1.4.1 \nSmall parity problems (Exclusive-or and 3BIT parity) were used to explore sensitivity \nof the noise parameters on node splitting and to benchmark the method. All runs \nwere with ftxed learning rate ( 1] = .5 ) and momentum ( a = .75). Low values of \nzeta ( < .7) produce minimal or no node splitting, while higher values (> .99) seem to \nproduce continuous node spliting without regard to the problem type. Zeta was rlXed \n(.98) and beta, the noise per step parameter was varied between values .1 and .5. \nThe following runs were unaffected by varying beta between these two values. \n\nmean=20 \n\nmean=4.1 \n\n'\" .. \n\n-\n\n.. \n\no \n\n\u2022 \n\n\u2022 \n\n10 \n\no \n\n\u2022 \n\n\u2022 \n\n10 \n\nFigure 3: Number of Hidden Units at Convergence \n\nShown in Figure 3 are 50 runs of Exclusive-or and 50 runs of 3 BIT PARITY. \nHistograms show for exclusive-or that almost all runs (>95%) ended up with 2 \nhidden units while for the 3BIT PARITY case most runs produce 3 hidden units, \nhowever with considerably more variance, some ending with 2 while a few runs \nended with as many 9 hidden units. The next figure (Figure 4) shows histograms for \n\n\fmean. I 18 \nlSI') \n\niii \n\nMeiosis Networks \n\n539 \n\no \n\n50 \n\nIDO \n\n150 \n\n_ \n\naD : . \n\n\u2022 \n\n\u2022 \n\n101 I . _ \n\nao _ \n\no \n\n:lIOII \n\nr \n... . . . . 101D ,a . .. ...... ,.,. \n\n. \n\n. \n\nFigure 4: Convergence Times \n\nthe convergence time showing a slight advantage in terms of convergence for the \nmeiosis networks for both exclusive-or and 3 BIT PARITY. \n\n1.4.2 \n\nBlood NMR Data: Nonlinear Separability \n\nIn the Figure 5 data were taken from 10 different continuous kinds of blood \nmeasurements, including, total lipid content, cholesterol (mg/dl), High density lipids, \nlow-density lipids, triglyceride, etc as well as some NMR measures. Subjects were \npreviously diagnosed for presence (C) or absence (N) of a blood disease. \n\n~ \n\n., \n\nco \n\n~ .. \n~ c \n\nU rN \n\n0 \n\n~ \n\n~ \n\n4 \n\n- --\n\n04'_,,, ...... ~L \n\n~.,_1 ' 1_ \n\n. . \n.. . . \n\n-2 \n\n0 \n\n2 \n\n4 \n\n8 \n\n!irs! cIiImrIw>anI .anaIlIe \n\nFigure 5: Blood NMR Separability \n\nThe data consisted of 238 samples, 146 Ns and 92 es. Shown in the adjoining figure \nis a Perceptron (linear discriminant analysis) response to the data. Each original \ndata point is projected into the first two discriminant variables showing about 75% \nof the data to be linearly separable (k-k/ 2 jackknife tests indicate about 52% transfer \nrates). However, also shown is a rough non-linear envelope around one class of \n\n\f540 \n\nHanson \n\nsubjects(N) showing the potentially complex decision region for this data. \n\n1.4.3 Meiosis Learning curves \n\nData was split into two groups (118,120) for learning and transfer tests. Learning \ncurves for both the meiosis network and standard back-propagation are shown in the \nFigure 6. Also shown in this display is the splitting rate for the meiosis network \nshowing it grow to 7 hidden units and freezIng during the first 20 sweeps . \n\no \n\n___ ____ ~ _ __ _ _ ~ __ . __ ____ L -______ .~ \n\ns_ \n\n1IX1 \n\n50 \n\n150 \n\n200 \n\n.. -\n\nFigure 6: Learning Curves and Splitting Rate \n\n1.4.4 \n\nTransfer Rate \n\nBackpropagation was run on the blood data with 0 (perceptron), 2, 3, 4, 5, 6, 7, and \n20 hidden units. Shown is the median transfer rate of 3 runs for each hidden unit \nnetwork size. Transfer rate seemed to hover near 65% as the number of hidden \nunits approached 20. A meiosis network was also run 3 times on the data (using f3 \n.40 and ~ .98). Transfer Rate shown in Figure 7 was always above 70% at the 7 \nhidden unit number. \n\n\fMeiosis Networks \n\n541 \n\n_.---l_ \n\n\u2022 \n\nFigure 7: Transfer Rate as a Function of Hidden Unit Number \n\n10 \n\n.....-01-\n\n15 \n\nto network prediction and at the same time control \n\n1.5 Conclusions \nThe key property of the present scheme is the integration of representational aspects \nthat are sensitive \nthe \narchitectural resources of the network. Consequently, with Meiosis networks it is \nto dynamically and opportunistically control network complexity and \npossible \ntherefore \nindirectly \nits learning efficiency and generalization capacity. Meiosis \nNetworks were defined upon earlier work using local noise injections and noise \nrelated learning rules. As learning proceeds the meiosis network can measure the \nprediction history of particular nodes and if found to be poor, can split the node and \nopportunistically to increase the resources of the network. Further experiments are \nrequired in order to understand different advantages of splitting policies and their \naffects on generalization and speed of learning. \n\nReferences \n\nBurns, B. D The uncertain nervous system, London Edward Arnold Ltd, 1968. \n\nCover, T. M. Geometrical and statistical properties of systems of linear inequalities with \napplications to pattern recognition. IEEE Trans Elec Computers, Vol EC-14,3, pp \n236-334, 1965 \n\nHanson, S. 1. A stochastiC versIOn of the delta rule Physica D, 1990. \n\nHanson, S J & Burr D J Minkowskl Back-propagation. learning In connectionist models \nWith non-euclIdean error signals, Neural Information Processing Systems, AmerIcan \nInstitute of PhYSICS 1988 \n\nHanson, S J & Pratt, L. A comparIson of different biases for minimal network construction \nWith back-propagation, Advances in Neural Information Processing, D. Touretzsky, \nMorgan-Kaufmann, 1989 \n\nKirkpatrick, S, Gelatt, C D. & Veechl, M. Optimization by Simulated annealing, Science, \n\n220, 671-680. 1983. \n\nTomko, G. 1. & Crapper, D. R Neural varIability Non-stationary response to Identical visual \n\nstimUli, Brain Research, 79, p. 405-418, 1974 \n\n\f", "award": [], "sourceid": 227, "authors": [{"given_name": "Stephen", "family_name": "Hanson", "institution": null}]}**