{"title": "Learning in Higher-Order \"Artificial Dendritic Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 490, "page_last": 497, "abstract": null, "full_text": "490 \n\nBell \n\nLearning in higher-order' artificial dendritic trees' \n\nTony Bell \n\nArtificial Intelligence Laboratory \n\nVrije Universiteit Brussel \n\nPleinlaan 2, B-1050 Brussels, BELGIUM \n\n(tony@arti.vub.ac.be) \n\nABSTRACT \n\nIf neurons sum up their inputs in a non-linear way, as some simula(cid:173)\ntions suggest, how is this distributed fine-grained non-linearity ex(cid:173)\nploited during learning? How are all the small sigmoids in synapse, \nspine and dendritic tree lined up in the right areas of their respective \ninput spaces? In this report, I show how an abstract atemporal highly \nnested tree structure with a quadratic transfer function associated \nwith each branchpoint, can self organise using only a single global \nreinforcement scalar, to perform binary classification tasks. The pro(cid:173)\ncedure works well, solving the 6-multiplexer and a difficult phoneme \nclassification task as well as back-propagation does, and faster. \nFurthermore, it does not calculate an error gradient, but uses a statist(cid:173)\nical scheme to build moving models of the reinforcement signal. \n\n1. INTRODUCTION \nThe computational territory between the linearly summing McCulloch-Pitts neuron and \nthe non-linear differential equations of Hodgkin & Huxley is relatively sparsely popu(cid:173)\nlated. Connectionists use variants of the former and computational neuroscientists \nstruggle with the exploding parameter spaces provided by the latter. However, evi(cid:173)\ndence from biophysical simulations suggests that the voltage transfer properties of \nsynapses, spines and dendritic membranes involve many detailed non-linear interac(cid:173)\ntions, not just a squashing function at the cell body. Real neurons may indeed be \nhigher-order nets. \nFor the computationally-minded, higher order interactions means, first of all, quadratic \nterms. This contribution presents a simple learning principle for a binary tree with a \nlogistic/quadratic transfer function at each node. These functions, though highly \nnested, are shown to be capable of changing their shape in concert. The resulting tree \nstructure receives inputs at its leaves, and outputs an estimate of the probability that \nthe input pattern is a member of one of two classes at the top. \n\n\fLearning in Higher-Order' Artificial Dendritic Trees' \n\n491 \n\nA number of other schemes exist for learning in higher-order neural nets. Sigma-Pi \nunits, higher-order threshold logic units (Giles & Maxwell, 87) and product units (Dur(cid:173)\nbin & Rumelhart, 89) are all examples of units which learn coefficients of non-linear \nfunctions. Product unit networks, like Radial Basis Function nets, consist of a layer of \nnon-linear transformations, followed by a normal Perceptron-style layer. The scheme \npresented here has more in common with the work reviewed in Barron (88) (see also \nTenorio 90) on polynomial networks in that it uses low order polynomials in a tree of \nlow degree. The differences lie in a global rather than layer-by-Iayer learning scheme, \nand a transfer function derived from a gaussian discriminant function. \n\n2. THE ARTIFICIAL DENDRITIC TREE (ADT) \nThe network architecture in Figure I(a) is that of a binary tree which propagates real \nnumber values from its leaf nodes (or inputs) to its root node which is the output. In \nthis simple formulation, the tree is construed as a binary classifier. The output node \nsignals a number between 1 and 0 which represents the probability that the pattern \npresented to the tree was a member of the positive class of patterns or the negative \nclass. Because the input patterns may have extremely high dimension and the tree is, \nat least initially, constrained to be binary, the depth of the tree may be significant, at \nleast more than one might like to back-propagate through. A transfer function is asso(cid:173)\nciated with each 'hidden' node of the tree and the output node. This will hereafter be \nreferred to as a Z{unction, for the simple reason that it takes in two variables X and \nY, and outputs Z. A cascade of Z-functions performs the computation of the tree and \nthe learning procedure consists of changing these functions. The tree is referred to as \nan Artificial Dendritic Tree or ADT with the same degree of licence that one may talk \nof Artificial Neural Networks, or ANNs. \n\n(a) \n\nz (x) \n\nz (x ,y) 1.0 \n\n(b) I (c) A \n\n(d) \n\nlnput nodes \n\nx \n\nX \n\nY \n\nFigure 1: (a) an Artificial Dendritic Tree, (b) a ID Z-node (c) a 2D Z-node (d) \nA ID Z-function constructed from2 gaussians (e) approximating a step function \n\n2.1. THE TRANSFER FUNCTION \nThe idea behind the Z-function is to allow the two variables arriving at a node to \ninteract locally in a non-linear way which contributes to the global computation of the \ntree. The transfer function is derived from statistical considerations. To simplify, con(cid:173)\nsider the one-dimensional case of a variable X travelling on a wire as in Figure 1 (b). \nA statistical estimation procedure could observe the distribution of values of X when \nthe global pattern was positive or negative and derive a decision rule from these. In \nFigure I(d), the two density functions f+(x) and f-(x) are plotted. Where they meet, \nthe local computation must answer that, based on its information, the global pattern is \npositively classified with probability 0.5. Assuming that there are equal numbers of \npositive and negative patterns (ie: that the a priori probability of positive is 0.5), it is \neasy to see that the conditional probability of being in the positive class given our \nvalue for X, is given by equation (1). \n\n\f492 \n\nBell \n\nz (x) = P [class=+ve Ix] = \n\n[+ex) \n\n[+(x)+[-(x) \n\n(1) \n\nThis can be also derived from Bayesian reasoning (Therrien, 89). The fonn of z (x) is \nshown with the thick line in Figure l(d) for the given [+(x) and [-(x). If [+(x) and \n[-ex) can be usefully approximated by normal (gaussian) curves as plotted above, \nthen (1) translates into (2): \n\nz ex) = \n\n1. \n\nt \n1 +e -mp\" \n\n,input = ~-(x) - ~+(x) + In[ a:] \n\na \n\n(2) \n\nThis can be obtained by substituting equation (4) overleaf into (1) using the definitions \nof a and ~ given. The exact form a and ~ take depends on the number of variables \ninput. The first striking thing is that the form of (2) is exactly that of the back(cid:173)\npropagation logistic function. The second is that input is a polynomial quadratic \nexpression. For Z-functions with 2 inputs (x ,y) using formulas (4.2) it takes the fonn: \n\nw lX2+W2Y2+w~+w 4X+wsY+w6 \n\n(3) \nThe w' s can be thought of as weights just as in backprop, defining a 6D space of \ntransfer functions. However optimising the w's directly through gradient descent may \nnot be the best idea (though this is what Tenorio does), since for any error function E, \naE law 4 = x aE law 1 = Y aE law 3. That is, the axes of the optimisation are not indepen(cid:173)\ndent of each other. There are, however, two sets of 5 independent parameters which \nthe w's in (3) are actually composed from if we calculate input from (4.2). These are \nJl:, cr;, 11;, cr; and r+, denoting the means, standard deviations and correlation \ncoefficient defining the two-dimensional distribution of (x ,y) values which should be \npositively classified. The other 5 variables define the negative distribution. \nThus 2 Gaussians (hereafter referred to as the positive and negative models) define a \nquadratic transfer function (called the Z{unction) which can be interpreted as express(cid:173)\ning conditional probability of positive class membership. The shape of these functions \ncan be altered by changing the statistical parameters defining the distributions which \nundedy them. In Figure l(d), a 1-dimensional Z-function is seen to be sigmoidal \nthough it need not be monotonic at all. Figure 2(b)-(h) shows a selection of 2D Z(cid:173)\nfunctions. In general the Z-function divides its N-dimensional input space with a N-1 \ndimensional hypersurface. In 2D, this will be an ellipse, a parabola, a hyperbola or \nsome combination of the three. Although the dividing surface is quadratic, the Z(cid:173)\nfunction is still a logistic or squashing function. The exponent input is actually \nequivalent to the log likelihood ratio or In(j+(x)/j-(x\u00bb. commonly used in statistics. \nIn this work, 2-dimensional gaussians are used to generate Z-functions. There are \ncompelling reasons for this. One dimensional Z-functions are of little use since they \ndo not reduce information. Z-functions of dimension higher than 1 perform optimal \nclass-based information reduction by propagating conditional probabilities of class \nmembership. But 2D Z-functions using 2D gaussians are of particular interest because \nthey include in their function space all boolean functions of two variables (or at least \nanalogue versions of these functions). For example the gaussians which would come to \nrepresent the positive and negative exemplar patterns for XOR are drawn as ellipses in \nFigure 2(a). They have equal means and variances but the negative exemplar patterns \nare correlated while the positive ones are anti-correlated. These models automatically \ngive rise to the XOR surface in Figure 2(b) if put through equation (2). An interesting \n\n\fLearning in Higher-Order' Artificial Dendritic Trees' \n\n493 \n\nobservation is that a problem of Nth order (XOR is 2nd order, 3-parity is 3rd order \netc) can be solved by a polynomial of degree N (Figure 2d). Since 2nd degree polyno(cid:173)\nmials like (3) are used in our system, there is one step up in power from 1st degree \nsystems like the Perceptron. Thus 3-parity is to the Z-function unit what XOR is to the \nPerceptron (in this case not quadratically separable). \n\nA GAUSSIAN IS: \n\nf (x)=.le-IJ(%) \n\na \n\nin one dimension: a=(21t) 1120'% \n\n~(x ) (x -Jl% )2 \n20'x 2 \n\nin two dimensions: a=21tO'x O'y(l-r 2)112 \n\nin n dimensions: \n\nJl%=E [x] \nO';:E [x 2]-Jl% 2 \n\nE[xy]~%Jly \n\nr \n\n(4) \n\n(4.1.1) \n\n(4.1.2) \n\n(4.2.1) \n\n(4.2.2) \n\n2r (x -J,1x )(y -~ ) 1 \n\nO'x O'y \n\n(4.n.l) \n\n(4.n.2) \n\n1 \n\n[ (X-J,1x)2 \n\n(y -~ )2 \n0'% 2 + 0'/ \n\n~(x ,y)= 2(l-r2) \na=(21t)\"/2 IK 11/2 \n~