{"title": "What Size Net Gives Valid Generalization?", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 90, "abstract": null, "full_text": "81 \n\nWHAT SIZE NET GIVES VALID \n\nGENERALIZATION?* \n\nEric B. Baum \nDepartment of Physics \nPrinceton University \nPrinceton NJ 08540 \n\nDavid Haussler \nComputer and Information Science \nUniversity of California \nSanta Cruz, CA 95064 \n\nABSTRACT \n\nWe address the question of when a network can be expected to \ngeneralize from m random training examples chosen from some ar(cid:173)\nbitrary probability distribution, assuming that future test examples \nare drawn from the same distribution. Among our results are the \nfollowing bounds on appropriate sample vs. network size. Assume \no < \u00a3 $ 1/8. We show that if m > O( ~log~) random exam(cid:173)\nples can be loaded on a feedforward network of linear threshold \nfunctions with N nodes and W weights, so that at least a fraction \n1 - t of the examples are correctly classified, then one has confi(cid:173)\ndence approaching certainty that the network will correctly classify \na fraction 1 - \u00a3 of future test examples drawn from the same dis(cid:173)\ntribution. Conversely, for fully-connected feedforward nets with \none hidden layer, any learning algorithm using fewer than O( '!') \nrandom training examples will, for some distributions of examples \nconsistent with an appropriate weight choice, fail at least some \nfixed fraction of the time to find a weight choice that will correctly \nclassify more than a 1 - \u00a3 fraction of the future test examples. \n\nINTRODUCTION \nIn the last few years, many diverse real-world problems have been attacked by back \npropagation. For example \"expert systems\" have been produced for mapping text \nto phonemes [sr87], for determining the secondary structure of proteins [qs88], and \nfor playing backgammon [ts88]. \n\nIn such problems, one starts with a training database, chooses (by making an ed(cid:173)\nucated guess) a network, and then uses back propagation to load as many of the \ntraining examples as possible onto the network. The hope is that the network so de(cid:173)\nsigned will generalize to predict correctly on future examples of the same problem. \nThis hope is not always realized. \n\n* This paper will appear in the January 1989 issue of Neural Computation. For \ncompleteness, we reprint this full version here, with the kind permission of MIT \nPress. \u00a9 1989, MIT Press \n\n\f82 \n\nBaum and Haussler \n\nWe address the question of when valid generalization can be expected. Given a \ntraining database of m examples, what size net should we attempt to load these \non? We will assume that the examples are drawn from some fixed but arbitrary \nprobability distribution, that the learner is given some accuracy parameter E, and \nthat his goal is to produce with high probability a feedforward neural network that \npredicts correctly at least a fraction 1 - E of future examples drawn from the same \ndistribution. These reasonable assumptions are suggested by the protocol proposed \nby Valiant for learning from examples [val84]. However, here we do not assume the \nexistence of any \"target function\"; indeed the underlying process generating the \nexamples may classify them in a stochastic manner, as in e.g. [dh73]. \n\nOur treatment of the problem of valid generalization will be quite general in that \nthe results we give will hold for arbitrary learning algorithms and not just for \nback propagation. The results are based on the notion of capacity introduced by \nCover [cov65] and developed by Vapnik and Chervonenkis [vc7l], [vap82]. Recent \noverviews of this theory are given in [dev88], [behw87b] and [poI84], from the various \nperspectives of pattern recognition, Valiant's computational learning theory, and \npure probability theory, respectively. This theory generalizes the simpler counting \narguments based on cardinality and entropy used in [behw87a] and [dswshhj87], in \nthe latter case specifically to study the question of generalization in feedforward \nnets (see [vap82] or [behw87b]). \n\nThe particular measures of capacity we use here are the maximum number of di(cid:173)\nchotomies that can be induced on m inputs, and the Vapnik-CheMlonenki. (Ve) \nDimen.ion, defined below. We give upper and lower bounds on these measures for \nclasses of networks obtained by varying the weights in a fixed feedforward architec(cid:173)\nture. These results show that the VC dimension is closely related to the number of \nweights in the architecture, in analogy with the number of coefficients or \"degrees \nof freedom\" in regression models. One particular result, of some interest indepen(cid:173)\ndent of its implications for learning, is a construction of a near minimal size net \narchitecture capable of implementing all dichotomies on a randomly chosen set of \npoints on the n-hypercube with high probability. \n\nApplying these results, we address the question of when a network can be expected \nto generalize from m random training examples chosen from some arbitrary prob(cid:173)\nability distribution, assuming that future test examples are drawn from the same \ndistribution. Assume 0 < E < 1/8. We show that ifm ~ O(~log.r:) random ex(cid:173)\namples can be loaded on a feedforward network of linear threshold functions with \nj of the examples are cor(cid:173)\nN nodes and W weights, so that at least a fraction 1 -\nrectly classified, then one has confidence approaching certainty that the network \nwill correctly classify a fraction 1 - E of future test examples drawn from the same \ndistribution. Conversely, for fully-connected feedforward nets with one hidden layer, \nany learning algorithm using fewer than O( ~) random training examples will, for \nsome distributions of examples consistent with an appropriate weight choice, fail \nat least some fixed fraction of the time to find a weight choice that will correctly \nclassify more than a 1 - E fraction of the future test examples. \n\n\fWhat Size Net Gives Valid Generalization? \n\n83 \n\nIgnoring the constant and logarithmic factors, these results suggest that the appro(cid:173)\npriate number of training examples is approximately the number of weights times \nthe inversel of the accuracy parameter E. Thus, for example, if we desire an accu(cid:173)\nracy level of 90%, corresponding to E = 0.1, we might guess that we would need \nabout 10 times as many training examples as we have weights in the network. This \nis in fact the rule of thumb suggested by Widrow [wid87], and appears to work fairly \nwell in practice. At the end of Section 3, we briefly discuss why learning algorithms \nthat try to minimize the number of non-zero weights in the network [rum87] [hin87] \nmay need fewer training examples. \n\nDEFINITIONS \nWe use In to denote the natural logarithm and log to denote the logarithm base \n2. We define an ezample as a pair (i, a), i E ~n, a E {-I, +1}. We define a \nrandom sample as a sequence of examples drawn independently at random from \nsome distribution D on ~n X {-1, +1}. Let I be a function from ~n into {-1, +1}. \nWe define the error of I, with respect to D, as the probability a;/; I(i) for (i,a) \na random example. \n\nLet F be a class of {-1, +l}-valued functions on ~n and let S be a set of m points \nin ~n . A dichotomy of S induced by I E F is a partition of S into two disjoint \nsubsets S+ and S- such that I(i) = +1 for i E S+ and I(i) = -1 for i E S-. \nBy .6.F (S) we denote the number of distinct dichotomies of S induced by functions \nI E F, and by .6.F(m) we denote the maximum of .6.F(S) over all S C ~n of \ncardinality m. We say S is shattered by F if .6.F(S) = 2151 , i.e. all dichotomies of \nS can be induced by functions in F. The Vapnik-CheMlonenkis (VC) dimension of \nF, denoted VCdim(F), is the cardinality of the largest S C ~n that is shattered \nby F, i.e. the largest m such that .6.F ( m) = 2m \u2022 \nA feedforward net with input from ~n is a directed acyclic graph G with an ordered \nsequence ofn source nodes (called inputs) and one sink (called the output). Nodes \nof G that are not source nodes are called computation nodes, nodes that are neither \nsource nor sink nodes are called hidden nodes. With each computation node n. \nthere is associated a function\" : ~inde't'ee(n,) ~ {-I, +1}, where indeg7'ee(n.) is \nthe number of incoming edges for node n,. With the net itself there is associated \na function I : ~n ~ {-I, +1} defined by composing the I,'s in the obvious way, \nassuming that component i of the input i is placed at the it\" input node. \n\nA Jeedlorward architecture is a class of feedforward nets all of which share the \nsame underlying graph. Given a graph G we define a feedforward architecture by \n\nassociating to each computation node n, a class of functions F, from ~'nde't'ee(n,) \n\n1 It should be noted that our bounds differ significantly from those given in [dev88] in \nthat the latter exhibit a dependence on the inverse of e2\u2022 This is because we derive \nour results from Vapnik's theorem on the uniform relative deviation of frequencies \nfrom their probabilities ([vap82], see Appendix A3 of [behw87b]), giving sharper \nbounds as E approaches o. \n\n\f84 \n\nBaum and Haussler \n\nto {-I, +1}. The resulting architecture consists of all feedforward nets obtained by \nchoosing a particular function\" from F, for each computation node ft,. We will \nidentify an architecture with the class offunctions computed by the individual nets \nwithin the architecture when no confusion will arise. \n\nCONDITIONS SUFFICIENT FOR VALID \nGENERALIZATION \nTheorem 1: Let F be a feedforward architecture generated by an underlying \ngraph G with N > 2 computation nodes and F, be the class of functions associated \nwith computation node ft, of G, 1 < i < N. Let d = E~l VCdim(Fl). Then \nAF(m) < n~lAF,(m)::; (Nem/d)d for m > d, where e is the base of the natural \nlogarithm. \n\nProof: Assume G has n input nodes and that the computation nodes of G are \n\nordered so that node n, receives inputs only from input nodes and from computation \nnodes nj, 1 < j ::; i-I. Let S be a set of m points in ~n. The dichotomy \ninduced on S by the function in node nl can be chosen in at most AFI (m) ways. \nThis choice determines the input to node nz for each of the m points in S. The \ndichotomy induced on these m inputs by the function in node nz can be chosen \nin at most AF:a(m) ways, etc. Any dichotomy of S induced by the whole network \ncan be obtained by choosing dichotomies for each of the ni's in this manner, hence \nAF(m) < nf:l AF,(m). \nBy a theorem of Sauer [sau72], whenever VCdim(F) = Ie < 00, AF(m) < (em/Ie)l \nfor all m > Ie (see also [behw87b]). Let ~ = VCdim(Fi), 1 < i < N. Thus \nd = Ef:l~. Then n~l AF,(m) < n~l(em/~)'\" for m > d. Using the fact that \nE~l -ailogai < logN whenever a. > 0, 1 < i < N, and E~l ai = I, and setting \nai = ~/d, it is easily verified that n~l ~d. > (d/N)d. Hence n~l(em/di)d. < \n(Nem/d)d. \n\nCorollary 2: Let F be the class of all functions computed by feedforward nets \ndefined on a fixed underlying graph G with E edges and N > 2 computation \nnodes, each of which computes a linear threshold function. Let W = E + N (the \ntotal number of weights in the network, including one weight per edge and one \nthreshold per computation node). Then AF(m) < (Nem/W)W for all m > Wand \nVCdim(F) < 2Wlog(eN). \nProof: The first inequality follows from directly from Theorem 1 using the fact that \nVCdim(F) = Ie + 1 when F is the class of all linear threshold functions on ~l (see \ne.g. [wd81]). For the second inequality, it is easily verified that for N > 2 and \nm = 2Wlog(eN), (N em/W)W < 2m. Hence this is an upper bound on VCdim(F). \nUsing VC dimension bounds given in [wd81], related corollaries can be obtained for \nnets that use spherical and other types of polynomial threshold functions. These \nbounds can be used in the following. \n\n\fWhat Size Net Gives Valid Generalization? \n\n85 \n\nTheorem 3 [vapS2} (see [behw87b), Theorem A3.3): Let F be a class offunctions2 \non ~n, 0 < l' < 1,0 < \u00a3,6 < 1. Let S be a random sequence of m examples drawn \nindependently according to the distribution D. The probability that there exists a \nfunction in F that disagrees with at most a fraction (1 - 1')\u00a3 of the examples in S \nand yet has error greater than \u00a3 (w.r.t. D) is less than \n\nFrom Corollary 2 and Theorem 3, we get: \n\nCorollary 4: Given a fixed graph G with E edges and N linear threshold units \n(i.e. W = E + N weights), fixed 0 < \u00a3 < 1/2, and m random training examples, \nwhere \n\n32W 1 32N \n\nm>-n-, \n\n-\n\n\u00a3 \n\n\u20ac \n\nif one can find a choice of weights so that at least a fraction 1-\u00a3/2 of the m training \nexamples are correctly loaded, then one has confidence at least 1 - Se- 1\u20225W that \nthe net will correctly classify all but a fraction \u20ac of future examples drawn from the \nsame distribution. For \n\n64W I 64N \nm > --;- n--;-, \n\nthe confidence is at least 1 - Se-em/S2. \n\nProof: Let l' = 1/2 and apply Theorem 3, using the bound on aF(m) given in \nCorollary 2. This shows that the probability that there exists a choice of the weights \nthat defines a function with error greater than \u00a3 that is consistent with at least a \nfraction 1 - \u00a3/2 of the training examples is at most \n\nWhen m = !ll!.ln!!K this is S(2e E In!!K)W which is less than Se- 1. 5W for N > \n2 and \u00a3 < 1/2. When m > 84EW In 8~N, (2N em/W) W < eEm/S2 , so S(2N em/W) W \ne- Em/ 16 < Se-em/S2. \n\n3fN' \n\ne ' \n\nE \n\n' \n\ne \n\n-\n\nThe constant 32 is undoubtably an overestimate. No serious attempt has been made \nto minimize it. Further, we do not know if the log term is unavoidable. Nevertheless, \neven without these terms, for nets with many weights this may represent a consid(cid:173)\nerable number of examples. Such nets are common in cases where the complexity \nof the rule being learned is not known in advance, so a large architecture is chosen \n\n2 We assume some measurability conditions on the class F. See [poI84], [behwS7b1 for \n\ndetails. \n\n\f86 \n\nBaum and Haussler \n\nin order to increase the chances that the rule can be represented. To counteract the \nconcomitant increase in the size of the training sample needed, one method that \nhas been explored is the use of learning algorithms that try to use as little of the \narchitecture as possible to load the examples, e.g. by setting as many weights to \nzero as possible, and by removing as many nodes as possible (a node can be removed \nif all its incoming weights are zero.) [rumS7] [hin87]. The following shows that the \nVC dimension of such a \"reduced\" architecture is not much larger than what one \nwould get if one knew a priori what nodes and edges could be deleted. \n\nCorollary 5: Let F be the class of all functions computed by linear threshold \nfeedforward nets defined on a fixed underlying graph G with N' > 2 computation \nnodes and E' ~ N' edges, such that at most E > 2 edges have non-zero weights \nand at most N ~ 2 nodes have at least one incoming edge with a non-zero weight. \nLet W = E + N. Then the conclusion of Corollary 4 holds for sample size \n\n32W l 32NE' \nm>-n---\n\n-\n\nf \n\nf \n\nProol sketch: We can bound dF( m) by considering the number of ways the N nodes \nand E edges that remain can be chosen from among those in the initial network. A \ncrude upper bound is (N')N (E')E. Applying Corollary 2 to the remaining network \ngives dF(m) ~ (N')N(E')E(Nem/W)w. This is at most (N E'em/W)w. The rest \nof the analysis is similar to that in Corollary 4. \n\nThis iridicates that minimizing non-zero weights may be a fruitful approach. Similar \napproaches in other learning contexts are discussed in [hauSS] and [litSS]. \n\nCONDITIONS NECESSARY FOR \nVALID GENERALIZATION \nThe following general theorem gives a lower bound on the number of examples \nneeded for distribution-free learning, regardless of the algorithm used. \n\nTheorem 6 [ehkvS7] (see also [behw87b]): Let F be a class of {-I, +1}-valued \nfunctions on ~n. with VCdim(F) > 2. Let A be any learning algorithm that takes \nas input a sequence of {-I, +1}-labeled examples over ~n. and produces as output \na function from ~n. into {-I, +1}. Then for any 0 < f ~ l/S, 0 < 0 ~ l~ and \n\nm < maz -\n\n[1- fl 1 VCdim(F) -1] \n' \n\nn7' \n\ne \n\nv \n\n3 \n2e \n\nthere exists (1) a function I E F and (2) a distribution D on ~n X {-I, +1} for \nwhich Prob((E, a) : a f. I(E)) = 0, such that given a random sample of size m \nchosen according to D, with probability at least 0, A produces a function with error \ngreater than e. \n\n\fWhat Size Net Gives Valid Generalization? \n\n87 \n\nThis theorem can be used to obtain a lower bound on the number of examples \nneeded to train a net, assuming that the examples are drawn from the worst-case \ndistribution that is consistent with some function realizable on that net. We need \nonly obtain lower bounds on the VC dimension of the associated architecture. In \nthis section we will specialize by considering only fully-connected networks of linear \nthreshold units that have only one hidden layer. Thus each hidden node will have an \nincoming edge from each input node and an outgoing edge to the output node, and \nno other edges will be present. In [b88] a slicing construction is given that shows \nthat a one hidden layer net of threshold units with n inputs and 2j hidden units \ncan shatter an arbitrary set of 2jn vectors in general position in ~\". A corollary of \nthis result is: \n\nTheorem 7: The class of one hidden layer linear threshold nets taking input from \n~\" with k hidden units has VC dimension at least 2L~Jn. \nNote that for large k and n, 2 L ~ J n is approximately equal to the total number W \nof weights in the network. \n\nA special case of considerable interest occurs when the domain is restricted to \nthe hypercube: {+1,-1}\". Lemma 6 of [lit88] shows that the class of Boolean\u00b7 \nfunctions on {+1, _I}\" represented by disjunctive normal form expressions with k \nterms, k < 0(2,,/2/Vn) , where each term is the conjunction of n/2 literals, has \nVC dimension at least kn/4. Since these functions can be represented on a linear \nthreshold net with one hidden layer of k units, this provides a lower bound on the \nVC dimension of this architecture. We also can use the slicing construction of [b88] \nto give a lower bound approaching kn/2. The actual result is somewhat stronger in \nthat it shows that for large n a randomly chosen set of approximately kn/2 vectors \nis shattered with high probability. \n\nTheorem 8: With probability approaching 1 exponentially in n, a set S of m < 2,,/3 \nvectors chosen randomly and uniformly from {+1, _I}\" can be shattered by the \none hidden layer architecture with 2rm/l(n(1 - 1~0,,))J1linear threshold units in \nits hidden layer. \n\nProol,ketch: With probability approaching 1 exponentially in n no pair of vectors \nin S are negations of each other. Assume n > eto. Let l' = In(l- I~O,,)J. Divide \nS at random into r m/1' 1 disjoint subsets S1I ... , Srm/t'l each containing no more \nthan l' vectors. We will describe a set T of \u00b11 vectors as Iliceable if the vectors \nin T are linearly independent and the subspace they span over the reals does not \ncontain any \u00b1l vector other than the vectors in T and their negations. In [od188] \nit is shown, for large n, that any random set of l' vectors has probability P = \n4(;)(~)\" +0(( 110)\") of not being sliceable. Thus the probability that some S. is not \nsliceable is 0(mn2(~)\"), which is exponentially small for m < 2,,/3. Hence with \nprobability approaching 1 exponentially in n, each S, is sliceable, 1 ~ i $ r m/ 1'1. \nConsider any Boolean function I on S and let S: = {i E S, : f(i) = +1}, \n\n\f88 \n\nBaum and Haussler \n\n1 < i < r m/7' 1. If Si is sliceable and no pair of vectors in S are negations of each \nother then we may pass a plane through the points in st that doesn't contain any \nother points in S. Shifting this plane parallel to itself slightly we can construct two \nhalf spaces whose intersection forms a slice of~\" containing st and no other points \nin S. Using threshold units at the hidden layer recognizing these two half spaces, \nwith weights to the output unit +1 and -1 appropriately, the output unit receives \ninput +2 for any point in the slice and 0 for any point not in the slice. Doing this \n\nfor each S: and thresholding at 1 implements the function f. \n\nWe can now apply Theorem 6 to show that any neural net learning algorithm using \ntoo few examples will be fooled by some reasonable distributions. \n\nCorollary 9: For any learning algorithm training Ii net with k linear threshold \nfunctions in its hidden layer, and 0 < l ~ 1/8, if the algorithm uses (a) fewer \nthan 2llc/;'f,,-1 examples to learn a function from ~\" to {-I, +1}, or (b) fewer \nthan l\"lll/2J(mQ,:I:(1/!~~-10/(ln n\u00bb)J-1 examples to learn a function from {-I, +1}\" \nto {-I, +1}, for k ~ O(2n / 3 ), then there exist distributions D for which (i) there \nexists a choice of weights such that the network exactly classifies its inputs according \nto D, but (ii) the learning algorithm will have probability at least .01 of finding a \nchoice of weights which in fact has error greater than E. \n\nCONCLUSION \nWe have given theoretical lower and upper bounds on the sample size vs. net \nsize needed such that valid generalization can be expected. The exact constants we \nhave given in these formulae are still quite crudej it may be expected that the actual \nvalues are closer to 1. The logarithmic factor in Corollary 4 may also not be needed, \nat least for the types of distributions and architectures seen in practice. Widrow's \nexperience supports this conjecture [wid87]. However, closing the theoretical gap \nbetween the O( ': log ~) upper bound and the (2 ( 1f) lower bound on the worst case \nsample size for architectures with one hidden layer of threshold units remains an \ninteresting open problem. Also, apart from our upper bound, the case of multiple \nhidden layers is largely open. Finally, our bounds are obtained under the assumption \nthat the node functions are linear threshold functions (or at least Boolean valued). \nWe conjecture that similar bounds also hold for classes of real valued functions such \nas sigmoid functions, and hope shortly to establish this. \n\nAcknowledgements \n\nWe would like to thank Ron Rivest for suggestions on improving the bounds given in \nCorollaries 4 and 5 in an earlier draft of this paper, and Nick Littlestone for many \nhelpful comments. The research of E. Baum was performed by the Jet Propul(cid:173)\nsion Laboratory, California Institute of Technology, as part of its Innovative Space \n\n\fWhat Size Net Gives Valid Generalization? \n\n89 \n\nTechnology Center, which is sponsored by the Strategic Defense Initiative Organi(cid:173)\nzation/Innovative Science and Technology through an agreement with the National \nAeronautics and Space Administration (NASA). D. Haussler gratefully acknowl(cid:173)\nedges the support of ONR grant NOOOI4-86-K-0454. Part of this work was done \nwhile E. Baum was visiting UC Santa Cruz. \n\nReferences \n\n[b88]BAUM, E. B., (1988) On the capabilities of multilayer perceptrons, J. of Com(cid:173)\nplexity, 4, 1988, ppI93-215. \n\n[behw87a]BLUMER, A., EHRENFEUCHT, A. HAUSSLER, D., WARMUTH, M., \n(1987), Occam's Razor, Int: Proc. Let., 24, 1987, pp377-380. \n\n[behw87b]BLUMER, A., EHRENFEUCHT, A. HAUSSLER, D., WARMUTH, M., \n(1987), Learnability and the Vapnik-Chervonenkis dimension, UC Santa Cruz Tech. \nRep. UCSC-CRL-87-20 (revised Oct., 1988) and J. ACM, to appear. \n\n[cov65]COVER, T., (1965), Geometrical and statistical properties of systems of \nlinear inequalities with applications to pattern recognition, IEEE Trans. Elect. \nComp., V14, pp326-334. \n\n[dev88]DEVROYE, L., (1988), Automatic pattern recognition, a study of the prob(cid:173)\nability of error, IEEE Trans. P AMI, V10, N 4, pp530-543. \n\n[dswshhj87]DENKER J., SCHWARTZ D., WITTNER B., SOLLA S., HOP FIELD \nJ., HOWARD R., JACKEL L., (1987), Automatic learning, rule extraction, and \ngeneralization, Complex Systems 1 pp877-922. \n\n[dh73]DUDA, R., HART, P., (1973), Pattern clallification and scene analysis, Wi(cid:173)\nley, New York. \n\n[ehkv87]EHRENFEUCHT, A., llAUSSLER, D., KEARNS, M., VALIANT, L., (1987), \nA general lower bound on the number of examples needed for learning, UC Santa \nCruz Tech. Rep. UCSC-CRL-87-26 and Information and Computation, to appear. \n\n[hau88]HAUSSLER, D., (1988), Quantifying inductive bias: AI learning algorithms \nand Valiant's learning framework, Artificial Intelligence, 36, 1988, pp177-221. \n\n[hin87]HINTON, G., (1987), Connectionist learning procedures, Artificial Intelli(cid:173)\ngence, to appear. \n\n[lit88]LITTLESTONE, N., (1988) Learning quickly when irrelevant attributes abound: \na new linear threshold algorithm, Machine Learning, V2, pp285-318. \n\n[odI88]ODLYZKO, A., (1988), On subspaces spanned by random selections of \u00b11 \nvectors, J. Comb. Th. A, V47, Nt, pp124-133. \n[poI84]POLLARD, D., (1984), Convergence 0/ stochastic procelles, Springer-Verlag, \nNew York. ' \n\n\f90 \n\nBaum and Haussler \n\n[qs88]QUIAN, N., SEJNOWSKI, T. J., (1988), Predicting the secondary structure \nof globular protein using neural nets, Bull. Math. Biophys. 5, 115-137. \n\n[rum87]RUMELHART, D., (1987), personal communication. \n\n[sau72]SAUER, N., (1972), On the density of families of sets, J. Comb. Th. A, \nV13, 145-147. \n\n[sr87]SEJNOWSKI, T.J., ROSENBERG, C. R., (1987), NET Talk: a parallel net(cid:173)\nwork that learns to read aloud, Complex Systems, vi pp145-168. \n\n[ts88]TESAURO G., SEJNOWSKI, T. J.,(1988), A 'neural' network that learns to \nplay backgammon, in Neural Information Procelling Sy,tem\" ed. D.Z. Anderson, \nAlP, NY, pp794-803. \n\n[val84]VALIANT, L. G., (1984), A theory of the learnable, Comm. ACM V27, Nil \npp1l34-1142. \n\n[vc71]VAPNIK, V.N., Chervonenkis, A. Ya., (1971), On the uniform convergence of \nrelative frequencies of events to their probabilities, Th. Probe and its Applications, \nV17, N2, pp264-280. \n\n[vap82]VAPNIK, V.N., (1982), E,timation of Dependence, Ba,ed on Empirical Data, \nSpringer Verlag, NY. \n\n[wd81]WENOCUR, R. S., DUDLEY, R. M., (1981) Some special Vapnik-Chervonenkis \nclasses, Discrete Math., V33, pp313-318. \n\n[wid87]WIDROW, B, (1987) ADALINE and MADALINE - 1963, Plenary Speech, \nVol I, Proc. IEEE 1st Int. Conf. on Neural Networks, San Diego, CA, pp143-158. \n\n\f", "award": [], "sourceid": 154, "authors": [{"given_name": "Eric", "family_name": "Baum", "institution": null}, {"given_name": "David", "family_name": "Haussler", "institution": null}]}