{"title": "Scaling and Generalization in Neural Networks: A Case Study", "book": "Advances in Neural Information Processing Systems", "page_first": 160, "page_last": 168, "abstract": null, "full_text": "160 \n\nSCALING AND GENERALIZATION IN \nNEURAL NETWORKS: A CASE STUDY \n\nSubutai Ahmad \n\nCenter for Complex Systems Research \n\nUniversity of Illinois at Urbana-Champaign \n\n508 S. 6th St., Champaign, IL 61820 \n\nGerald Tesauro \nIBM Watson Research Center \nPO Box 704 \nYorktown Heights, NY 10598 \n\nABSTRACT \n\nThe issues of scaling and generalization have emerged as key issues in \ncurrent studies of supervised learning from examples in neural networks. \nQuestions such as how many training patterns and training cycles are \nneeded for a problem of a given size and difficulty, how to represent the \ninllUh and how to choose useful training exemplars, are of considerable \ntheoretical and practical importance. Several intuitive rules of thumb \nhave been obtained from empirical studies, but as yet there are few rig(cid:173)\norous results. In this paper we summarize a study Qf generalization in \nthe simplest possible case-perceptron networks learning linearly separa(cid:173)\nble functions. The task chosen was the majority function (i.e. return \na 1 if a majority of the input units are on), a predicate with a num(cid:173)\nber of useful properties. We find that many aspects of.generalization \nin multilayer networks learning large, difficult tasks are reproduced in \nthis simple domain, in which concrete numerical results and even some \nanalytic understanding can be achieved. \n\n1 \n\nINTRODUCTION \n\nIn recent years there has been a tremendous growth in the study of machines which \nlearn. One class of learning systems which has been fairly popular is neural net(cid:173)\nworks. Originally motivated by the study of the nervous system in biological organ(cid:173)\nisms and as an abstract model of computation, they have since been applied to a \nwide variety of real-world problems (for examples see [Sejnowski and Rosenberg, 87] \nand [Tesauro and Sejnowski, 88]). Although the results have been encouraging, \nthere is actually little understanding of the extensibility of the formalism. In par(cid:173)\nticular, little is known of the resources required when dealing with large problems \n(i.e. scaling), and the abilities of networks to respond to novel situations (i.e. gen(cid:173)\neraliz ation). \n\nThe objective of this paper is to gain some insight into the relationships between \nthree fundament~l quantities under a variety of situations. In particular we are in(cid:173)\nterested in the relationships between the size of the network, the number of training \n\n\fScaling and Generalization in Neural Networks \n\n161 \n\ninstances, and the generalization that the network performs, with an emphasis on \nthe effects of the input representation and the particular patterns present in the \ntraining set. \n\nAs a first step to a detailed understanding, we summarize a study of scaling and \ngeneralization in the simplest possible case. Using feed forward networks, the type of \nnetworks most common in the literature, we examine the majority function (return \na 1 if a majority of the inputs are on), a boolean predicate with a number of useful \nfeatures. By using a combination of computer simulations and analysis in the limited \ndomain of the majority function, we obtain some concrete numerical results which \nprovide insight into the process of generalization and which will hopefully lead to a \nbetter understanding of learning in neural networks in general.\u00b7 \n\n2 THE MAJORITY FUNCTION \n\nThe function we have chosen to study is the majority function, a simple predicate \nwhose output is a 1 if and only if more than half of the input units are on. This \nfunction has a number of useful properties which facilitate a study of this type. \nThe function has a natural appeal and can occur in several different contexts in the \nreal-world. The problem is linearly separable (i.e. of predicate order 1 [Minsky and \nPapert, 69]). A version of the perceptron convergence theorem applies, so we are \nguaranteed that a network with one layer of weights can learn the function. Finally, \nwhen there are an odd number of input units, exactly half of the possible inputs \nresults in an output of 1. This property tends to minimize any negative effects that \nmay result from having too many positive or negative training examples. \n\n3 METHODOLOGY \n\nThe class of networks used are feed forward networks [Rumelhart and McClelland, 86], \na general category of networks that include perceptrons and the multi-layered net(cid:173)\nworks most often used in current research. Since majority is a boolean function \nof predicate order 1, we use a network with no hidden units. The output function \nused was a sigmoid with a bias. The basic procedure consisted of three steps. First \nthe network was initialized to some random starting weights. Next it was trained \nusing back propagation on a set of training patterns. Finally, the performance of \nthe network was tested on a set of random test patterns. This performance figure \nwas used as the estimate of the network's generalization. Since there is a large \namount of randomness in the procedure, most of our data are averages over several \nsimulations. \n\nO. The material contained in this paper is a condensation of portions of the first author's \n\nM.S. thesis [Ahmad, 88]. \n\n\f162 \n\nAhmad and Tesauro \n\nf \n\n0.50 \n\n0.42 \n\n0.33 \n\n0.25 \n\n0.17 \n\n0.08 \n\n0.00 \n\n0 \n\n70 \n\n140 \n\n210 \n\n280 \n\n350 5 \n\n420 \n\nFigure 1: The average failure rate as a function of S. d = 25 \n\nNotation. In the following discussion, we denote 5 to be the number of training \npatterns, d the number of input units, and c the number of cycles through the train(cid:173)\ning set. Let f be the failure rate (the fraction of misclassified training instances), \nand rr be the set of training patterns. \n\n4 RANDOM TRAINING PATTERNS \n\nWe first examine the failure rate as a function of 5 and d. Figure 1 shows the \ngraph of the average failure rate as a function of S, for a fixed input size d = 25. \nNot surprisingly we find that the failure rate decreases fairly monotonically with 5. \nOur simulations show that in fact, for majority there is a well defined relationship \nbetween the failure rate and 5. Figure 2 shows this for a network with 25 input \nunits. The figure indicates that In f is proportional to 5 implying that the failure \nrate decreases exponentially with 5, i.e., , = ae-fJs . 1/ {3 can be thought of as a \ncharacteristic training set size, corresponding to a failure rate of a/e. \n\nObtaining the exact scaling relationship of l/P was somewhat tricky. Plotting {3 on \na log-log plot against d showed it to be close to a straight line, indicating that 1/ {3 \nincreases'\" d(J for some constant a. Extracting the exponent by measuring the slope \nof the log-log graph turned out to be very error prone, since the data only ranged \nover one order of magnitude. An alternate method for obtaining the exponent is \nto look for a particular exponent a by setting 5 = ad(J. Since a linear scaling \nrelationship is theoretically plausible, we measured the failure rate of the network \n\n\fScaling and Generalization in Neural Networks \n\n163 \n\nIn! \n\nG.\"'\" \n\n-1.000 \n\n-Z.OOO \n\n-3.000 \n\n-4.000 \n\n-5.000 \n\n-6.000 \n\n0.0 \n\n70.0 \n\n140.0 \n\nZ10.0 \n\nZ80.0 \n\n350.0 \n\n4Z0.0 \n\ns \n\nFigure 2: In f as a function of S. d = 25. The slope was == -0.01 \n\nat S = ad for various values of a. As Figure 3 shows, the failure rate remains more \nor less constant for fixed values of a, indicating a linear scaling relationship with d. \nThus O( d) training patterns should be required to learn majority to a fixed level of \nperformance. Note that if we require perfect learning, then the failure rate has to \nbe < 1/(2d - S) ,..,. 1/2d \u2022 By substituting this for f in the above formula and solving \nfor S, we get that (1 )(dln 2 + In a) patterns are required. The extra factor of d \nsuggests that O( d2) would be required to learn majority perfectly. We will show in \nSection 6.1 that this is actually an overestimate. \n\n5 THE INPUT REPRESENTATION \n\nSo far in our simulations we have used the representation commonly used for boolean \npredicates. Whenever an input feature has been true, we clamped the corresponding \ninput unit to a 1, and when it has been off we have clamped it to a O. There is no \nreason, however, why some other representation couldn't have been used. Notice \nthat in back propagation the weight change is proportional to the incoming input \nsignal, hence the weight from a particular input unit to the output unit is changed \nonly when the pattern is misclassified and the input unit is non-zero. The weight \nremains unchanged when the input unit is O. If the 0,1 representation were changed \nto a-l,+1 representation each weight will be changed more often, hence the network \nshould learn the training set quicker (simulations in [Stornetta and Huberman, 81] \nreported such a decrease in training time using a -i, +i representation.) \n\n\f164 \n\nAhmad and Tesauro \n\n0.50 \n\n0.42 \n\nf \n\n0.33 \n\n0.25 \n\n0.17 \n\n0.08 \n\n~~--------------------\n\nS=3d \n\n-\n\nS=5d \n\nS=7d \n\n0.00 +----+----+-----+---+----+---.... 60 \n\n20 \n\n27 \n\n33 \n\n40 \n\n47 \n\n53 \n\nd \n\nFigure 3: Failure ra.te VB d with S = 3d, 5d, 7 d. \n\nWe found that not only did the training time decrease with the new representation, \nthe generalization of the network improved significantly. The scaling of the failure \nrate with respect to S is unchanged, but for any fixed value of S, the generalization \nis about 5 - 10% better. Also, the scaling with respect to dis still linear, but the \nconstant for a fixed performance level is smaller. Although the exact reason for \nthe improved generalization is unclear, the following might be a plausible reason. \nA weight is changed only if the corresponding input is non-zero. By the definition \nof the majority function, the average number of units that are on for the positive \ninstances is higher than for the negative instances. Hence, using the 0,1 represen(cid:173)\ntation, the weight changes are more pronounced for the positive instances than for \nthe negative instances. Since the weights are changed whenever a pattern is mis(cid:173)\nclassified, the net result is that the weight change is greater when a positive event \nis misclassified than when a negative event is misclassified. Thus, there seems to be \na bias in the 0,1 representation for correcting the hyperplane more when a positive \nevent is misclassified. In the new representation, both positive and negative events \nare treated equally hence it is unbiased. \n\nThe basic lesson here seems to be that one should carefully examine every choice \nthat has been made during the design process. The representation of the input, \neven down to such low level details as deciding whether \"off\" should be represented \nas 0 or -1, could make a significant difference in the generalization. \n\n\fScaling and Generalization in Neural Networks \n\n165 \n\n6 BORDER PATTERNS \n\nWe now consider a method for improving the generalization by intelligently selecting \nthe patterns in the training set. Normally, for a given training set, when the inputs \nare spread evenly around the input space, there can be several generalizations which \nare consistent with the patterns. The performance of the network on the test \nset becomes a random event, depending on the initial state of the network. If \npractical, it makes sense to choose training patterns whic~ can limit the possible \ngeneralizations. In particular, if we can find those examples which are closest to \nthe separating surface, we can maximally constrain the number of generalizations. \nThe solution that the network converges to using these \"border\" patterns should \nhave a higher probability of being a good separator. In general finding a perfect \nset of border patterns can be computationally expensive, however there might exist \nsimple heuristics which can help select good training examples. \n\nWe explored one heuristic for choosing such points: selecting only those patterns \nin which the number of 1 's is either one less or one more than half the number \nof input units. Intuitively, these inputs should be close to the desired separating \nsurface, thereby constraining' the network more than random patterns would. Our \nresults show that using only border patterns in the training set, there is a large \nincrease in the expected performance of the network for a given S. In addition, the \nscaling behavior as a function of S seems to be very different and is faster than an \nexponential decrease. (Figure 4 shows typical failure rate vs S curves comparing \nborder patterns, the -1,+1 representation, and the 0,1 representation.) \n\n6.1 BORDER PATTERNS AND PERFECT LEARNING \nWe say the network has perfectly learned a function when the test patterns are never \nmisclassified. For the majority function, one can argue that at least some border \npatterns must be present in order to guarantee perfect performance. If no border \npatterns were in the training set, then the network could have learned the f - 1 \nof d or the f + 1 of d function . Furthermon~, if we know that a certain number \nof border patterns is guaranteed to give perfect performance, say bed), then given \nthe probability that a random pattern is a border pattern, we can calculate the \nexpected number of random patterns sufficient to learn majority. \n\nFor odd d, there are 2 * ( ; ) border patterns, so the probability of choosing a \nborder pattern randomly is: \n\n( ; ) \n\n2d- 1 \n\nAs d gets larger this probability decreases as 1/.fd.* The expected number of ran(cid:173)\ndomly chosen patterns required before b( d) border patterns are chosen is therefore: \n\n0* This can be shown using Stirling's approximation to d!. \n\n\f166 \n\nAhmad and Tesauro \n\nf \n\n0.50 \n\n0.42 \n\n0.33 \n\n0.25 \n\n0.17 \n\n0.08 \n\n0.001 \n\n0 \n\n58 \n\n117 \n\n175 \n\n233 \n\n292 \n\n350 \n\nS \n\nFigure 4: Graph showing the average failure rate vs. S using the 0,1 representation \n(right), the -1,+1 representation (middle), and using border patterns (left). The network \nhad 23 inputs units and was tested on a test set consisting of 1024 patterns. \n\nb( cl)Vd. From our data we find that 3d border patterns are always sufficient to learn \nthe test set perfectly. From this, and from the theoretical results in [Cover, 65], we \ncan be confident that b( cI) is linear in d. Thus, O( fi3/2) random patterns should be \nsufficient to learn majority perfectly. \n\nIt should be mentioned that border patterns are not the only patterns which con(cid:173)\ntribute to the generalization of the network. Figure 5 shows that the failure rate of \nthe network when trained with random training patterns which happen to contain \nb border patterns is substantially better than a training set consisting of only b \nborder patterns. Note that perfect performance is achieved at the same point in \nboth cases. \n\n7 CONCLUSION \n\nIn this paper we have described a systematic study of some of the various factors \naffecting scaling and generalization in neural networks. Using empirical studies in \na simple test domain, we were able to obtain precise scaling relationships between \nthe performance of the network, the number of training patterns, and the size of \nthe network. It was shown that for a fixed network size, the failure rate decreases \nexponentially with the size of the training set. The number of patterns required to \n\n/ \n\n--......, '\\ \n\n\fScaling and Generalization in Neural Networks \n\n167 \n\nf \u2022. u \n\n\u2022. u \n\n\u2022\u2022 U \n\n\u2022\u2022 17 \n\n.... \n.... \n\n\u2022 \n\nII \n\n.. \n\nII \n\nII \nN wnber of border patterna. \n\n,. \n\nFigure 5: This figure compares the failure rate on a random training set which happens \nto contain b border patterns (bottom plot) with a training set composed of only b border \npatterns (top plot). \n\nachieve a fixed performance level was shown to increase linearly with the network \nSIZe. \n\nA general finding was that the performance of the network was very sensitive to a \nnumber of factors. A slight change in the input representation caused a jump in the \nperformance of the network. The specific patterns in the training set had a large \ninfluence on the final weights and on the generalization. By selecting the training \npatterns intelligently, the performance of the network was increased significantly. \n\nThe notion of border patterns were introduced as the most interesting patterns in \nthe training set. As far as the number of patterns required to teach a function \nto the network, these patterns are near optimal. It was shown that a network \ntrained only on border patterns generalizes substantially better than one trained \non the same number of random patterns. Border patterns were also used to derive \nan expected bound on the number of random patterns sufficient to learn majority \nperfectly. It was shown tha,t on average, O(d3 / 2 ) random patterns are sufficient to \nlearn majority perfectly. \n\nIn conclusion, this paper advocates a careful study of the process of generalization \nin neural networks. There are a large number of different factors which can affect \nthe performance. Any assumptions made when applying neural networks to a real(cid:173)\nworld problem should be made with care. Although much more work needs to be \n\n\f168 \n\nAhmad and Tesauro \n\ndone, it was shown that many of the issues can be effectively studied in a simple \ntest domain. \n\nAcknowledgements \n\nWe thank T. Sejnowski, R. Rivest and A. Barron for helpful discussions. We also \nthank T. Sejnowski and B. Bogstad for assistance in development of the simulator \ncode. This work was partially supported by the National Center for Supercomputing \nApplications and by National Science Foundation grant Phy 86-58062. \nReferences \n\n[Ahmad,88] S. Ahmad. A Study of Scaling and Generalization in Neural Networks. \nTechnical Report UIUCDCS-R-88-1454, Department of Computer Science, Uni(cid:173)\nversity of Illinois, Urbana-Champaign, IL, 1988. \n\n[Cover, 65] T. Cover. Geometric and satistical properties of systems oflinear equa(cid:173)\n\ntions. IEEE Trans. Elect. Comp., 14:326-334, 1965. \n\n[Minsky and Papert, 69] Marvin Minsky and Seymour Papert. Perceptrons. MIT \n\nPress, Cambridge, Mass., 1969. \n\n[Muroga, 71] S Muroga. Threshold Logic and its Applications. Wiley, New York, \n\n1971. \n\n[Rumelhart and McClelland, 86] D. E. Rumelhart and J. L. McClelland, editors. \nParallel Distributed Processing: Explorations in the Microstructure of Cognition: \nFoundations. Volume I, MIT Press, Cambridge, Mass., 1986. \n\n[Stornetta and Huberman, 87] W.S. Stornetta and B.A. Huberman. An improved \n\nthree-layer, back propagation algorithm. In Proceedings of the IEEE First In(cid:173)\nternational Conference on Neural Networks, San Diego, CA, 1987. \n\n[Sejnowski and Rosenberg, 87] T.J. Sejnowski and C.R. Rosenberg. Parallel net(cid:173)\nworks that learn to pronounce English text. Complex Systems, 1:145-168, 1987. \n\n[Tesauro and Sejnowski, 88] G. Tesauro and T.J. Sejnowski. A Parallel Network \nthat Learns to Play Backgammon. Technical Report CCSR-88-2, Center for \nComplex Systems Research, University of Illinois, Urbana-Champaign, IL, 1988. \n\n\f", "award": [], "sourceid": 129, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}, {"given_name": "Gerald", "family_name": "Tesauro", "institution": null}]}