{"title": "Threshold Network Learning in the Presence of Equivalences", "book": "Advances in Neural Information Processing Systems", "page_first": 879, "page_last": 886, "abstract": null, "full_text": "Threshold Network Learning in the Presence of \n\nEquivalences \n\nJohn Shawe-Taylor \n\nDepartment of Computer Science \n\nRoyal Holloway and Bedford New College \n\nUniversity of London \n\nEgham, Surrey TW20 OEX, UK \n\nAbstract \n\nThis paper applies the theory of Probably Approximately Correct (PAC) \nlearning to multiple output feedforward threshold networks in which the \nweights conform to certain equivalences. It is shown that the sample size \nfor reliable learning can be bounded above by a formula similar to that \nrequired for single output networks with no equivalences. The best previ(cid:173)\nously obtained bounds are improved for all cases. \n\n1 \n\nINTRODUCTION \n\nThis paper develops the results of Baum and Haussler [3] bounding the sample sizes \nrequired for reliable generalisation of a single output feedforward threshold network. \nThey prove their result using the theory of Probably Approximately Correct (PAC) \nlearning introduced by Valiant [11]. They show that for 0 < \u00ab: :S 1/2, if a sample of \nsIze \n\n64N \nm 2:: rna = - - log - -\n\n64W \n\n\u00ab: \n\n\u00ab: \n\nis loaded into a feedforward network of linear threshold units with N nodes and W \nweights, so that a fraction 1- \u00ab:/2 of the examples are correctly classified, then with \nconfidence approaching certainty the network will correctly classify a fraction 1 - \u00ab: \nof future examples drawn according to the same distribution. A similar bound was \nobtained for the case when the network correctly classified the whole sample. The \nresults below will imply a significant improvement to both of these bounds. \n\n879 \n\n\f880 \n\nShawe-Taylor \n\nIn many cases training can be simplified if known properties of a problem can \nbe incorporated into the structure of a network before training begins. One such \ntechnique is described by Shawe-Taylor [9], though many similar techniques have \nbeen applied as for example in TDNN's [6]. The effect of these restrictions is to \nconstrain groups of weights to take the same value and learning algorithms are \nadapted to respect this constraint. \n\nIn this paper we consider the effect of this restriction on the generalisation per(cid:173)\nformance of the networks and in particular the sample sizes required to obtain a \ngiven level of generalisation. This extends the work described above by Baum and \nHaussler [3] by improving their bounds and also improving the results of Shawe(cid:173)\nTaylor and Anthony [10], who consider generalisation of multiple-output threshold \nnetworks. The remarkable fact is that in all cases the formula obtained is the same, \nwhere we now understand the number of weights W to be the number of weight \nclasses, but N is still the number of computational nodes. \n\n2 DEFINITIONS AND MAIN RESULTS \n\n2.1 SYMMETRY AND EQUIVALENCE NETWORKS \n\nWe begin with a definition of threshold networks. To simplify the exposition it is \nconvenient to incorporate the threshold value into the set of weights. This is done \nby creating a distinguished input that always has value 1 and is called the threshold \ninput. The following is a formal notation for these systems. \nA network N = (C, I, 0, no, E) is specified by a set C of computational nodes, a \nset I of input nodes, a subset 0 ~ C of output nodes and a node no E I, called the \nthreshold node. The connectivity is given by a set E ~ (C u 1) x C of connections, \nwith {no} x C ~ E. \nWith network N we associate a weight function W from the set of connections to \nthe real numbers. We say that the network N is in state w. For input vector i \nwith values in some subset of the set 'R of real numbers, the network computes a \nfunction F./If(w, i). \nAn automorphism')' of a network N = (C, I, 0, no, E) is a bijection of the nodes of \nN which fixes I setwise and {no} U 0 pointwise, such that the induced action fixes \nE setwise. We say that an automorphism')' preserves the weight assignment W if \nWji = w(-yj)(\"Yi ) for all i E I u C, j E C. Let')' be an automorphism of a network \nN = (C, 1,0, no, E) and let i be an input to N. We denote by i\"Y the input whose \nvalue on input k is that of i on input ,),-lk. \n\nThe following theorem is a natural generalisation of part of the Group Invariance \nTheorem of Minsky and Pappert [8] to multi-layer perceptrons. \n\nTheorem 2.1 [9J Let')' be a weight preserving automorphism of the network N = \n( C, I, 0, no, E) in state w. Then for every input vector i \n\nF./If(w, i) = F./If(w, P). \n\nFollowing this theorem it is natural to consider the concept of a symmetry net(cid:173)\nwork [9]. This is a pair (N, r), where N is a network and r a group of weight \n\n\fThreshold Network Learning in the Presence of Equivalences \n\n881 \n\npreserving automorphims of N. We will also refer to the automorphisms as sym(cid:173)\nmetries. For a symmetry network (N, r), we term the orbits of the connections E \nunder the action of r the weight classes. \nFinally we introduce the concept of an equivalence network. This definition ab(cid:173)\nstracts from the symmetry networks precisely those properties we require to obtain \nour results. The class of equivalence networks is, however, far larger than that \nof symmetry networks and includes many classes of networks studied by other re(cid:173)\nsearchers [6, 7]. \n\nDefinition 2.2 An equivalence network is a threshold network in which an equiva(cid:173)\nlence relation is dejined on both weights and nodes. The two relations are required \nto be compatible in that weights in the same class are connected to nodes in the same \nclass, while nodes in the same class have the same set of input weight connection \ntypes. The weights in an equivalence class are at all times required to remain equal. \n\nNote that every threshold network can be viewed as an equivalence network by \ntaking the trivial equivalence relations. We now show that symmetry networks \nare indeed equivalence networks with the same weight classes and give a further \ntechnical lemma. For both lemmas proofs are omitted. \n\nLemma 2.3 A symmetry network (N, r) is an equivalence network, where the \nequivalence classes are the orbits of connections and nodes respectively. \n\nLemma 2.4 Let N be an equivalence network and C be the set of classes of nodes. \nThen there is an indezing of the classes, Gi, i = 1, . . . , n, such that nodes in Gi do \nnot have connections from nodes in Gj for j 2 i. \n\n2.2 MAIN RESULTS \n\nWe are now in a position to state our main results. Note that throughout this paper \nlog means natural logarithm, while an explicit subscript is used for other bases. \n\nTheorem 2.5 Let N be an equivalence network with W weight classes and N com(cid:173)\nputational nodes. If the network correctly computes a function on a set of m inputs \ndrawn independently according to a jized probability distribution, where \n\nm 2 mo(\u20ac,I5) = \u20ac(1- J\u20ac) \n\n1 [ (1.3) \n\nlog 6 + 2Wlog -\u20ac-\n\n(6VN) 1 \n\n.. \n\nthen with probability at least 1 - 15 the error rate of the network will be less than \u20ac \non inputs drawn according to the same distribution. \n\nTheorem 2.6 Let N be an equivalence network with W weight classes and N \ncomputational nodes. If the network correctly computes a function on a fraction \n1 - (1 -1')\u20ac of m inputs drawn independently according to a jized probability distri(cid:173)\nbution, where \n\nm 2 mo(\u20ac,c5,1') = \n\n1'2\u20ac(1 -\n\n1.jfJN [4 log (-154) + 6Wlog ( ;~ )] \n15 the error rate of the network will be less than \u20ac \n\n\u20ac/ N) \n\nl' \n\n\u20ac \n\nthen with probability at least 1 -\non inputs drawn according to the same distribution. \n\n\f882 \n\nShawe-Taylor \n\n3 THEORETICAL BACKGROUND \n\n3.1 DEFINITIONS AND PREVIOUS RESULTS \n\nIn order to present results for binary outputs ({O, I} functions) and larger ranges \nin a unified way we will consider throughout the task of learning the graph of a \nfunction. All the definitions reduce to the standard ones when the outputs are \nbinary. \n\nWe consider learning from examples as selecting a suitable function from a set H of \nhypotheses, being functions from a space X to set Y, which has at most countable \nSlze. At all times we consider an (unknown) target function \n\nc:X---+Y \n\nwhich we are attempting to learn. To this end the space X is required to be a \nprobability space (X, lJ, p.), with appropriate regularity conditions so that the sets \nconsidered are measurable [4]. In particular the hypotheses should be measurable \nwhen Y is given the discrete topology as should the error sets defined below. The \nspace S = X x Y is equipped with au-algebra E x 2Y and measure v = v(p., e), \ndefined by its value on sets of the form U x {y}: \n\nv(U x {y}) = p. (U n e- 1(y)) . \n\nU sing this measure the error of a hypothesis is defined to be \n\nerv (h) = v{(:z:, y) E Slh(:z:) =1= y}. \n\nThe introduction of v allows us to consider samples being drawn from S, as they \nwill automatically reflect the output value of the target. This approach freely \ngeneralises to stochastic concepts though we will restrict ourselves to target func(cid:173)\ntions for the purposes of this paper. The error of a hypothesis h on a sample \nx = ((:Z:1' yd, ... , (:Z:m, Ym)) E sm is defined to be \n\nerx(h) = ~ l{ilh(:Z:i) =1= ydl\u00b7 \n\nm \n\nWe also define the VC dimension of a set of hypotheses by reference to the product \nspace S. Consider a sample x = ((:Z:1I yI), ... , (:Z:m' Ym)) E sm and the function \n\ngiven by X*(h)i = 1 if and only if h(:z:,) = Yi, for i = 1, ... ,m. We can now define \nthe growth function BH(m) as \n\nx* : H ---+ {O, l}m, \n\nBH(m) = max l{x*(h)lh E H}I ~ 2m. \n\nXES\"' \n\nThe Vapnik-Chervonenkis dimension of a hypothesis space H is defined as \nif BH(m) = 2m , for all m; \notherwise. \n\nIn the case of a threshold network oN, the set of functions obtainable using all \npossible weight assignments is termed the hypothesis space of oN and we will refer \n\n\fThreshold Network Learning in the Presence of Equivalences \n\n883 \n\nto it as N. For a threshold network N, we also introduce the state growth function \nSJV(m). This is defined by first considering all computational nodes to be output \nnodes, and then counting different output sequences. \n\n. m'!-x \n\n'R}I \nSJV(m) = \nwhere X = [0,1]111 and N' is obtained from N by setting 0 = C. We clearly have \nthat for all Nand m, BJV(m) ::; SJV(m). \n\nI{(FJV'(w, il), FJV'(w, i2)'\"'' FJV'(w, im))lw : E -\n\nX=(ll, ... ,lm.)EX= \n\nTheorem 3.1 \nany \u20ac > 0 and k > m and \n\n[2J If a hypothesis space H has growth function BH(m) then for \n\n1 \nO I, then \nTheorem 3.2 \nthere is mo = mo( \u20ac, 6) such that if m > mo then the probability that a hypothesis \nconsistent with a randomly chosen sample of size m has error greater than \u20ac \nis less \nthan 6. A suitable value of rna is \n\nrna = \u20ac (1 ~ 0) [log ( d / (d6 - 1)) + 2d log (~) ]. \n\no \n\nFor the case when we allow our hypothesis to incorrectly compute the function on \na small fraction of the training sample, we have the following result. Note that we \nare still considering the discrete metric and so in the case where we are considering \nmultiple output feedforward networks a single output in error would count as an \noverall error. \nTheorem 3.3 [10J Let 0 < \u20ac < 1 and 0 < \"( ::; 1. Suppose H is a hypothesis space \nof functions from an input space X to a possibly countable set Y, and let v be any \nprobability measure on S = X x Y. Then the probability (with respect to v m ) that, \nfor x E sm, there is some h E H such that \n\nerll(h) > \u20ac \n\nand erx(h)::; (1 - ,,()erll(h) \n\nis at most \n\n4BH(2m)exp --4- . \n\n( \n\n\"(2\u20acm) \n\nFurthermore, if H has finite VC dimension d, this quantity is less than 6 for \n\nm> mo(\u20ac,6,,,() = \"(2\u20ac(11_ 0) [410g (~) + 6dlog ('Y2~3\u20ac)]' \n\no \n\n\f884 \n\nShawe-Taylor \n\n4 THE GROWTH FUNCTION FOR EQUIVALENCE \n\nNETWORKS \n\nWe will bound the number of output sequences B,,(m) for a number m of inputs \nby the number of distinct state sequences S,,(m) that can be generated from the \nm inputs by different weight assignments. This follows the approach taken in [10]. \n\nTheorem 4.1 Let.N be an equivalence network with W weight equivalence classes \nand a total of N computational nodes. Then we can bound S,,(m) by \n\nIdea of Proof: Let Gi, i = 1, ... , n, be the equivalence classes of nodes indexed \nas guaranteed by Lemma 2.4 with IGil = Ci and the number of inputs for nodes in \nGi being ni (including the threshold input). Denote by .AIj the network obtained \nby taking only the first j node equivalence classes. We omit a proof by induction \nthat \n\nj \n\nS\"j (m) :S II Bi(mci), \n\nwhere Bi is the growth function for nodes in the class Gi. \n\ni=1 \n\nUsing the well known bound on the growth function of a threshold node with ni \ninputs we obtain \n\nSN( m) ~ ll. ( e:;, ) n, \n\nConsider the function !( ~) = ~ log~. This is a convex function and so for a set of \nvalues ~1, ..\u2022 , ~M, we have that the average of f(~i) is greater than or equal to f \napplied to the average of ~i. Consider taking the ~'s to be Ci copies of ni/ci for \neach i = 1, ... n. We obtain \n\n12:n \n-\nN _ \n,=1 \n\nn-Iog - > -\n' \n\nni W \nCi - N \n\nlog -\n\nW \nN \n\nor \n\nand so \n\nas required. _ \n\nS,,(m) :S ( emwN)W, \n\nThe bounds we have obtained make it possible to bound the Vapnik-Chervonenkis \ndimension of equivalence networks. Though we we will not need these results, we \ngive them here for completeness. \n\nProposition 4.2 The Vapnik-Chervonenkis dimension of an equivalence network \nwith W weight classes and N computational nodes is bounded by \n\n2Wlog 2 eN. \n\n\fThreshold Network Learning in the Presence of Equivalences \n\n885 \n\n5 PROOF OF MAIN RESULTS \n\nU sing the results of the last section we are now in a position to prove Theorems 2.5 \nand 2.6. \n\nProof of Theorem 2.5: (Outline) We use Theorem 3.1 which bounds the proba(cid:173)\nbility that a hypothesis with error greater than E can match an m-sample. Substi(cid:173)\ntuting our bound on the growth function of an equivalence network and choosing \n\nand r as in [1], we obtain the following bound on the probability \n\n( d) (e 4Em2)W \n\nd _ 1 \n\nW2 \n\nN W exp( -Em). \n\nBy choosing m> me where me is given by \n\nme = me(E, 6) = E(1 _ JE) \n\n1 [ (1.3) \n\nlog 6\" + 2W log -E-\n\n(6..fN)] \n\nwe guarantee that the above probability is less than 6 as required. _ \n\nOur second main result can be obtained more directly. \n\nProof of Theorem 2.6: (Outline) We use Theorem 3.3 which bounds the prob(cid:173)\nability that a hypothesis with error greater than E can match all but a fraction \n(1 -1') of an m-sample. The bound on the sample size is obtained from the proba(cid:173)\nbility bound by using the inequality for BH(2m). By adjusting the parameters we \nwill convert the probability expression to that obtained by substituting our growth \nfunction. We can then read off a sample size by the corresponding substitution in \nthe sample size formula. Consider setting d = W, E = E' IN and m = N m'. With \nthese substitutions the sample size formula is \n\nm = \n\n, \n\n1 \n\n1'2 e'(1 - Je'IN) \n\n[ ( 4 ) \n410 \ng 6 \n\n- + 6 W 10 \n\n( 4N ) 1 \n\ng 1'2/3e' \n\nas required. _ \n\n6 CONCLUSION \n\nThe problem of training feedforward neural networks remains a major hurdle to the \napplication of this approach to large scale systems. A very promising technique for \nsimplifying the training problem is to include equivalences in the network structure \nwhich can be justified by a priori knowledge of the application domain. This paper \nhas extended previous results concerning sample sizes for feedforward networks to \ncover so called equivalence networks in which weights are constrained in this way. \nAt the same time we have improved the sample size bounds previously obtained for \nstandard threshold networks [3] and multiple output networks [10]. \n\n\f886 \n\nShawe-Taylor \n\nThe results are of the same order as previous results and imply similar bounds on \nthe Vapnik-Chervonenkis namely 2W log2 eN. They perhaps give circumstancial \nevidence for the conjecture that the loga eN factor in this expression is real, in that \nthe same expression obtains even if the number of computational nodes is increased \nby expanding the equivalence classes of weights. Equivalence networks may be a \nuseful area to search for high growth functions and perhaps show that for certain \nclasses the VC dimension is O(Wlog N). \n\nReferences \n\n[1] Martin Anthony, Norman Biggs and John Shawe-Taylor, Learnability and For(cid:173)\n\nmal Concept Analysis, RHBNC Department of Computer Science, Technical \nReport, CSD-TR-624, 1990. \n\n[2] Martin Anthony, Norman Biggs and John Shawe-Taylor, The learnability of \nformal concepts, Proc. COLT '90, Rochester, NY. (eds Mark Fulk and John \nCase) (1990) 246-257. \n\n[3] Eric Baum and David Haussler, What size net gives valid generalization, Neural \n\nComputation, 1 (1) (1989) 151-160. \n\n[4] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler and Manfred K. War(cid:173)\nmuth, Learnability and the Vapnik-Chervonenkis dimension, JACM, 36 (4) \n(1989) 929-965. \n\n[5] David Haussler, preliminary extended abstract, COLT '89. \n[6] K. Lang and G.E. Hinton, The development of TDNN architecture for speech \nrecognition, Technical Report CMU-CS-88-152, Carnegie-Mellon University, \n1988. \n\n[7] Y. Ie Cun, A theoretical framework for back propagation, in D. Touretzsky, \n\neditor, Connectionist Models: A Summer School, Morgan-Kaufmann, 1988. \n\n[8] M. Minsky and S. Papert, Perceptrons, expanded edition, MIT Press, Cam(cid:173)\n\nbridge, USA, 1988. \n\n[9] John Shawe-Taylor, Building Symmetries into Feedforward Network Architec(cid:173)\ntures, Proceedings of First lEE Conference on Artificial Neural Networks, Lon(cid:173)\ndon, 1989, 158-162. \n\n[10] John Shawe-Taylor and Martin Anthony, Sample Sizes for Multiple Output \n\nFeedforward Networks, Network, 2 (1991) 107-117. \n\n[11] Leslie G. Valiant, A theory of the learnable, Communications of the ACM, 27 \n\n(1984) 1134-1142. \n\n\f", "award": [], "sourceid": 510, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}