{"title": "Strategies for Teaching Layered Networks Classification Tasks", "book": "Neural Information Processing Systems", "page_first": 850, "page_last": 859, "abstract": null, "full_text": "850 \n\nStrategies for Teaching Layered Networks \n\nClassification Tasks \n\nBen S. Wittner 1 and John S. Denker \n\nAT&T Bell Laboratories \n\nHolmdel, New Jersey 07733 \n\nAbstract \n\nThere is a widespread misconception that the delta-rule is in some sense guaranteed to \nwork on networks without hidden units. As previous authors have mentioned, there is \nno such guarantee for classification tasks. We will begin by presenting explicit counter(cid:173)\nexamples illustrating two different interesting ways in which the delta rule can fail. We \ngo on to provide conditions which do guarantee that gradient descent will successfully \ntrain networks without hidden units to perform two-category classification tasks. We \ndiscuss the generalization of our ideas to networks with hidden units and to multi(cid:173)\ncategory classification tasks. \n\nThe Classification Task \n\nConsider networks of the form indicated in figure 1. We discuss various methods for \ntraining such a network, that is for adjusting its weight vector, w. If we call the input \nv, the output is g(w\u00b7 v), where 9 is some function. \n\nThe classification task we wish to train the network to perform is the following. Given \ntwo finite sets of vectors, Fl and F2, output a number greater than zero when a vector in \nFl is input, and output a number less than zero when a vector in F2 is input. Without \nsignificant loss of generality, we assume that 9 is odd (Le. g( -s) == -g( s\u00bb. In that case, \nthe task can be reformulated as follows. Define 2 \n\nF :== Fl U {-v such that v E F2} \n\n(1) \n\nand output a number greater than zero when a vector in F is input. The former \nformulation is more natural in some sense, but the later formulation is somewhat more \nconvenient for analysis and is the one we use. We call vectors in F, training vectors. \n\nA Class of Gradient Descent Algorithms \n\nWe denote the solution set by \n\nW :== {w such that g(w\u00b7 v) > 0 for all v E F}, \n\n(2) \n\nlCurrently at NYNEX Science and Technology, 500 Westchester Ave., White Plains, NY 10604 \n2 We use both A := Band B =: A to denote \"A is by definition B\". \n\n@ American Institute of Physics 1988 \n\n\f851 \n\noutput \n\ninputs \n\nFigure 1: a simple network \n\nand we are interested in rules for finding some weight vector in W. We restrict our \nattention to rules based upon gradient'descent down error functions E(w) of the form \n\nE(w) = L h(w . v). \n\nVEF \n\nThe delta-rule is of this form with \n\nh(w . v) = h6(W . v) := -(b - g(w . v))2 \n\n1 \n2 \n\n(3) \n\n(4) \n\nfor some positive number b called the target (Rumelhart, McClelland, et al.). We call \nthe delta rule error function E6 . \n\nFailure of Delta-rule Using Obtainable Targets \n\nLet 9 be any function that is odd and differentiable with g'(s) > 0 for all s. In this \nsection we assume that the target b is in the range of g. We construct a set F of \ntraining vectors such that even though M' is not empty, there is a local minimum of E6 \nnot located in W. In order to facilitate visualization, we begin by assuming that 9 is \nlinear. We will then indicate why the construction works for the nonlinear case as well. \nWe guess that this is the type of counter-example alluded to by Duda and Hart (p. 151) \nand by Minsky and Papert (p. 15). \n\nThe input vectors are two dimensional. The arrows in figure 2 represent the training \nvectors in F and the shaded region is W. There is one training vector, vI, in the second \nquadrant, and all the rest are in the first quadrant. The training vectors in the first \nquadrant are arranged in pairs symmetric about the ray R and ending on the line L. \nThe line L is perpendicular to R, and intersects R at unit distance from the origin. \nFigure 2 only shows three of those symmetric pairs, but to make this construction work \nwe might need many. The point p lies on R at a distance of g-l(b) from the origin. \n\nWe first consider the contribution to E6 due to any single training vector, v. The \ncontribution is \n\n(1/2)(b - g(w\u00b7 v))2, \n\n(5) \n\nand is represented in figure 3 in the z-direction. Since 9 is linear and since b is in the \n\n\f, R \n\n\\ \n,. \n, \n\\ \n\\ \n, , \n,'c\" \n, \n\\ \n\np \n..... \n\n, \n\n\" \n\n, L \n\\ \n\nX-axis \n\n\f853 \n\nx-axis \n\nFigure 3: Error surface \n\nWe now remove the assumption that 9 is linear. The key observation is that \n\ndh6/ds == h/(s) = (b - g(s\u00bb( -g'(s\u00bb \n\n(6) \n\nstill only has a single zero at g-l(b) and so h(s) still has a single minimum at g-l(b). \nThe contribution to E6 due to the training vectors in the first quadrant therefore still \nhas a global minimum on the xy-plane at the point p. So, as in the linear case, if there \nare enough symmetric pairs of training vectors in the first quadrant, the value of Eo \nat p can be made arbitrarily lower than the value along some circle in the xy-plane \ncentered around p, and E5 = Eo + El will have a local minimum arbitrarily near p. \nQ.E.D. \n\nFailure of Delta-rule Using Unobtainable Targets \n\nWe now consider the case where the target b is greater than any number in the range \nof g. The kind of counter-example presented in the previous section no longer exists, \nbut we will show that for some choices of g, including the traditional choices, the delta \nrule can still fail. Specifically, we construct a set F of training vectors such that even \nthough W is not empty, for some choices of initial weights, the path traced out by going \ndown the gradient of E5 never enters W. \n\n\f854 \n\ny-axis \n\n\",.,-P \n\n, \n\n, \nq , __ .... \nJ'=----~----~---\nL \n~ \n\n,:~ 4 \n\n, , \n, , \n:, \n\nx-axis \n\nFigure 4: Counter-example for unobtainable targets \n\nWe suppose that 9 has the following property. There exists a number r > 0 such that \n\nAn example of such a 9 is \n\n. hs'( -rs) \n_-00 \nhm h '() = o. \n\n5 S \n\n9(S) = tanh(s) = 1 + e-2., - 1, \n\n2 \n\nfor which any r greater than 1 will do. \n\n(7) \n\n(8) \n\nThe solid arrows in figure 4 represent the training vectors in F and the more darkly \nshaded region is W. The set F has two elements, \n\nand \n\nv 2=mm[n \n\nThe dotted ray, R lies on the diagonal {y = x}. \n\nSince \n\n(9) \n\n(10) \n\n\fthe gradient descent algorithm follows the vector field \n\n-v E(w) = -h/(w\u00b7 V1)V1 - h/(w. V 2 )V2 . \n\nThe reader can easily verify that for all won R, \n\nSo by equation (7), if we constrain w to move along R, \n\n. \n-h/(w. vI) \nhm \nw ...... oo -ho (w . v ) \n\n, \n\n2 = O. \n\n855 \n\n(11) \n\n(12) \n\n(13) \n\nCombining equations (11) and (13) we see that there is a point q somewhere on R such \nthat beyond q, - V E( w) points into the region to the right of R, as indicated by the \ndotted arrows in figure 4. \n\nLet L be the horizontal ray extending to the right from q. Since for all s, \n\ng'(s) > 0 \n\nand \n\nb> g(s), \n\n(14) \n\nwe get that \n\n- h/(s) = (b - g(s\u00bbg'(s) > o. \n\n(15) \nSo since both vI and v 2 have a positive y-component, -V E(w) also has a positive \ny-component for all w. So once the algorithm following -V E enters the region above \nL and to the right of R (indicated by light shading in figure 4), it never leaves. Q.E.D. \n\nProperties to Guarantee Gradient Descent Learning \n\nIn this section we present three properties of an error function which guarantee that \ngradient descent will not fail to enter a non-empty W. \n\nWe call an error function of the form presented in equation (3) well formed if h is \ndifferentiable and has the following three properties. \n\n1. For all s, -h'( s) ~ 0 (i.e. h does not push in the wrong direction). \n\n2. There exists some f > 0 such that -h'(s) ~ f for all s ~ 0 (i.e. h keeps pushing \n\nif there is a misclassification). \n\n3. h is bounded below. \n\nProposition 1 If the error junction is well formed, then gradient descent is guaranteed \nto enter W, provided W is not empty. \n\n\f856 \n\nThe proof proceeds by contradiction. Suppose for some starting weight vector the path \ntraced out by gradient descent never enters W. Since W is not empty, there is some \nnon-zero w* in W. Since F is finite, \n\nA := min{w*. v such that v E F} -:> O. \n\n(16) \n\nLet wet) be the path traced out by the gradient descent algorithm. So \n\nw'(t) = -VE(w(t\u00bb = I:: -h'(w(t) \u00b7v)v \n\nfor all t. \n\n(17) \n\nvEF \n\nSince we are assuming that at least one training vector is misclassified at all times, by \nproperties 1 and 2 and equation (17), \n\nSo \n\nw* . w'(t) 2: fA \n\nfor all t. \n\nIw'(t)1 2: fA/lw*1 =: e > 0 \n\nfor all t. \n\n(18) \n\n(19) \n\nBy equations (17) and (19), \n\ndE(w(t\u00bb/dt = V E\u00b7 w'(t) = -w'(t) . w'(t) ~ -e < 0 \n\nfor all t. \n\n(20) \n\nThis means that \n\nE(w(t\u00bb --+ -00 as \n\nt --+ 00. \n\n(21) \n\nBut property 3 and the fact that F is finite guarantee that E is bounded below. This \ncontradicts equation (21) and finishes the proof. \n\nConsensus and Compromise \n\nSo far we have been concerned with the case in which F is separable (i.e. W is not \nempty). What kind of behavior do we desire in the non-separable case? One might \nhope that the algorithm will choose weights which produce correct results for as many \nof the training vectors as possible. We suggest that this is what gradient descent using \na well formed error function does. \n\nFrom investigations of many well formed error functions, we suspect the following well \nformed error function is representative. Let g( s) = s, and for some b > 0, let \n\nh(S)={ (b-s)2 ifs~~; \notherwIse. \n\no \n\n(22) \n\nIn all four frames of figure 5 there are three training vectors. Training vectors 1 and 2 \nare held fixed while 3 is rotated to become increasingly inconsistent with the others. In \nframes (i) and (ii) F is separable. The training set in frame (iii) lies just on the border \nbetween separability and non-separability, and the one in frame (iv) is in the interior of \n\n\fi) \n\n3 \n\niii) \n\n2 \n\n3 L.1 \n\n... \n\n857 \n\nii ) \n\n3 \n\niv) \n\n3 \n\n2 \n\n2 \n\n1 \n\n1 \n\nFigure 5: The transition between seperability and non-seperability \n\nthe non-separable regime. Regardless of the position of vector 3, the global minimum \nof the error function is the only minimum. \n\nIn frames (i) and (ii), the error function is zero on the shaded region and the shaded \nregion is contained in W. As we move training vector number 3 towards its position in \nframe (iii), the situation remains the same except the shaded region moves arbitrarily \nfar from the origin. At frame (iii) there is a discontinuity; the region on which the \nerror function is at its global minimum is now the one-dimensional ray indicated by \nthe shading. Once training vector 3 has moved into the interior of the non-separable \nregime, the region on which the error function has its global minimum is a point closer \nto training vectors 1 and 2 than to 3 (as indicated by the \"x\" in frame (iv\u00bb. \n\nIf all the training vectors can be satisfied, the algorithm does so; otherwise, it tries to \nsatisfy as many as possible, and there is a discontinuity between the two regimes. We \nsummarize this by saying that it finds a consensus if possible, otherwise it devises a \ncompromise. \n\nHidden Layers \n\nFor networks with hidden units, it is probably impossible to prove anything like propo(cid:173)\nsition 1. The reason is that even though property 2 assures that the top layer of weights \n\n\f858 \n\ngets a non-vanishing error signal for misclassified inputs, the lower layers might still get \na vanishingly weak signal if the units above them are operating in the saturated regime. \n\nWe believe it is nevertheless a good idea to use a well formed error function when \ntraining such networks. Based upon a probabilistic interpretation of the output of the \nnetwork, Baum and Wilczek have suggested using an entropy error function (we thank \nJ.J. Hopfield and D.W. Tank for bringing this to our attention). Their error function \nis well formed. Levin, Solla, and Fleisher report simulations in which switching to the \nentropy error function from the delta-rule introduced an order of magnitude speed-up \nof learning for a network with hidden units. \n\nMultiple Categories \n\nOften one wants to classify a given input vector into one of many categories. One popular \nway of implementing multiple categories in a feed-forward network is the following. Let \nthe network have one output unit for each category. Denote by oj(w) the output of \nthe j-th output unit when input v is presented to the network having weights w. The \nnetwork is considered to have classified v as being in the k-th category if \n\nor(w) > oj(w) \n\nfor all j ~ k. \n\n(23) \n\nThe way such a network is usually trained is the generalized delta-rule (Rumelhart, \nMcClelland, et al.). Specifically, denote by c(v) the desired classification of v and let \n\nb\"! .= {b \n\n1 \n\n\u2022 \n\nif j = c(v); \n\n-b otherwise, \n\nfor some target b > O. One then uses the error function \n\nE(w):= EE (bj - oj (w\u00bb) 2 \n\nv \n\n. \n3 \n\n(24) \n\n(25) \n\n\u2022 \n\nThis formulation has several bothersome aspects. For one, the error function is not will \nformed. Secondly, the error function is trying to adjust the outputs, but what we really \ncare about is the differences between the outputs. A symptom of this is the fact that \nthe change made to the weights of the connections to any output unit does not depend \non any of the weights of the connections to any of the other output units. \n\nTo remedy this and also the other defects of the delta rule we have been discussing, we \nsuggest the following. For each v and j, define the relative coordinate \n\n(26) \n\n\fWhat we really want is all the 13 to be positive, so use the error function \n\nE(w):= E E h (f3j(w)) \n\nv #c(v) \n\n859 \n\n(27) \n\nfor some well formed h. In the simulations we have run, this does not always help, but \nsometimes it helps quite a bit. \n\nWe have one further suggestion. Property 2 of a well formed error function (and the \nfact that derivatives are continuous) means that the algorithm will not be completely \nsatisfied with positive 13; it will try to make them greater than zero by some non-zero \nmargin. That is a good thing, because the training vectors are only representatives of \nthe vectors one wants the network to correctly classify. Margins are critically important \nfor obtaining robust performance on input vectors not in the training set. The problem \nis that the margin is expressed in meaningless units; it makes no sense to use the same \nnumerical margin for an output unit which varies a lot as is used for an output unit \nwhich varies only a little. We suggest, therefore, that for each j and v, keep a running \nestimate of uj(w), the variance of f3J(w), and replace f3J(w) in equation (27) by \n\nf3J (w)/uj (w). \n\n(28) \n\nOf course, when beginning the gradient descent, it is difficult to have a meaningful \nestimate of uj(w) because w is changing so much, but as the algorithm begins to \nconverge, your estimate can become increasingly meaningful. \n\nReferences \n\n1. David Rumelhart, James McClelland, and the PDP Research Group, Parallel Dis(cid:173)\n\ntributed Processing, MIT Press, 1986 \n\n2. Richard Duda and Peter Hart, Pattern Classification and Scene Analysis, John \n\nWiley & Sons, 1973. \n\n3. Marvin Minsky and Seymour Papert, \"On Perceptrons\", Draft, 1987. \n\n4. Eric Baum and Frank Wilczek, these proceedings. \n\n5. Esther Levin, Sara A. Solla, and Michael Fleisher, private communications. \n\n\f", "award": [], "sourceid": 85, "authors": [{"given_name": "Ben", "family_name": "Wittner", "institution": null}, {"given_name": "John", "family_name": "Denker", "institution": null}]}