{"title": "On-Line Learning with Restricted Training Sets: Exact Solution as Benchmark for General Theories", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 322, "abstract": null, "full_text": "Tight Bounds for the VC-Dimension of \n\nPiecewise Polynomial Networks \n\nAkito Sakurai \n\nSchool of Knowledge Science \n\nJapan Advanced Institute of Science and Technology \n\nNomi-gun, Ishikawa 923-1211, Japan. \n\nCREST, Japan Science and Technology Corporation. \n\nASakurai@jaist.ac.jp \n\nAbstract \n\nO(ws(s log d+log(dqh/ s))) and O(ws((h/ s) log q) +log(dqh/ s)) are \nupper bounds for the VC-dimension of a set of neural networks of \nunits with piecewise polynomial activation functions, where s is \nthe depth of the network, h is the number of hidden units, w is \nthe number of adjustable parameters, q is the maximum of the \nnumber of polynomial segments of the activation function, and d is \nthe maximum degree of the polynomials; also n(wslog(dqh/s)) is \na lower bound for the VC-dimension of such a network set, which \nare tight for the cases s = 8(h) and s is constant. For the special \ncase q = 1, the VC-dimension is 8(ws log d). \n\n1 \n\nIntroduction \n\nIn spite of its importance, we had been unable to obtain VC-dimension values for \npractical types of networks, until fairly tight upper and lower bounds were obtained \n([6], [8], [9], and [10]) for linear threshold element networks in which all elements \nperform a threshold function on weighted sum of inputs. Roughly, the lower bound \nfor the networks is (1/2)w log h and the upper bound is w log h where h is the number \nof hidden elements and w is the number of connecting weights (for one-hidden-Iayer \ncase w ~ nh where n is the input dimension of the network). \n\nIn many applications, though, sigmoidal functions, specifically a typical sigmoid \nfunction 1/ (1 + exp( -x)), or piecewise linear functions for economy of calculation, \nare used instead of the threshold function. This is mainly because the differen(cid:173)\ntiability of the functions is needed to perform backpropagation or other learning \nalgorithms. Unfortunately explicit bounds obtained so far for the VC-dimension of \nsigmoidal networks exhibit large gaps (O(w2h2) ([3]), n(w log h) for bounded depth \n\n\f324 \n\nA. Sakurai \n\nand f!(wh) for unbounded depth) and are hard to improve. For the piecewise linear \ncase, Maass obtained a result that the VO-dimension is O(w210g q), where q is the \nnumber of linear pieces of the function ([5]). \nRecently Koiran and Sontag ([4]) proved a lower bound f!(w 2 ) for the piecewise \npolynomial case and they claimed that an open problem that Maass posed if there \nis a matching w 2 lower bound for the type of networks is solved. But we still have \nsomething to do, since they showed it only for the case w = 8(h) and the number \nof hidden layers being unboundedj also O(w 2 ) bound has room to improve. \n\nWe in this paper improve the bounds obtained by Maass, Koiran and Sontag and \nconsequently show the role of polynomials, which can not be played by linear func(cid:173)\ntions, and the role of the constant functions that could appear for piecewise poly(cid:173)\nnomial case, which cannot be played by polynomial functions. \n\nAfter submission of the draft, we found that Bartlett, Maiorov, and Meir had ob(cid:173)\ntained similar results prior to ours (also in this proceedings). Our advantage is that \nwe clarified the role played by the degree and number of segments concerning the \nboth bounds. \n\n2 Terminology and Notation \n\nlog stands for the logarithm base 2 throughout the paper. \n\nThe depth of a network is the length of the longest path from its external inputs to \nits external output, where the length is the number of units on the path. Likewise \nwe can assign a depth to each unit in a network as the length of the longest path \nfrom the external input to the output of the unit. A hidden layer is a set of units at \nthe same depth other than the depth of the network. Therefore a depth L network \nhas L - 1 hidden layers. \nIn many cases W will stand for a vector composed of all the connection weights in \nthe network (including threshold values for the threshold units) and w is the length \nof w. The number of units in the network, excluding \"input units,\" will be denoted \nby hj in other words, the number of hidden units plus one, or sometimes just the \nnumber of hidden units. A function whose range is {O, 1} \n(a set of 0 and 1) is \ncalled a Boolean-valued function. \n\n3 Upper Bounds \n\nTo obtain upper bounds for the VO-dimension we use a region counting argu.ment, \ndeveloped by Goldberg and Jerrum [2]. The VO-dimension of the network, that is, \nthe VO-dimension of the function set {fG(wj . ) I W E'RW} is upper bounded by \n\nmax {N 12N ~ Xl~.~N Nee ('Rw - UJ:1.N'(fG(:Wj x\u00a3))) } \n\n(3.1) \n\nwhere NeeO is the number of connected components and .N'(f) IS the set \n{w I f(w) = O}. \nThe following two theorems are convenient. Refer [11] and [7] for the first theorem. \nThe lemma followed is easily proven. \nTheorem 3.1. Let fG(wj Xi) (1 ~ i ~ N) be real polynomials in w, each of degree \nd or less. The number of connected components of the set n~l {w I fG(wj xd = O} \nis bounded from above by 2(2d)W where w is the length of w. \n\n\fTight Bounds for the VC-Dimension of Piecewise Polynomial Networks \n\n325 \n\nLemma 3.2. Ifm ~ w(1ogC + loglogC + 1), then 2m > (mC/w)W for C ~ 4. \nFirst let us consider the polynomial activation function case. \n\nTheorem 3.3. Suppose that the activation function are polynomials of degree at \nmost d. O( ws log d) is an upper bound of the VC-dimension for the networks with \ndepth s. When s = 8(h) the bound is O(whlogd). More precisely ws(1ogd + \nlog log d + 2) is an upper bound. Note that if we allow a polynomial as the input \nfunction, d1d2 will replace d above where d1 is the maximum degree of the input \nfunctions and d2 is that of the activation functions. \n\nThe theorem is clear from the facts that the network function (fa in (3.1)) is a \npolynomial of degree at most d S + ds- 1 + ... + d, Theorem 3.1 and Lemma 3.2. \nFor the piecewise linear case, we have two types of bounds. The first one is suitable \nthe depth s = o( h)) and the second one for the \nfor bounded depth cases (i. e. \nunbounded depth case (i.e . s = 8(h)). \n\nTheorem 3.4. Suppose that the activation functions are piecewise polynomials with \nat most q segments of polynomials degree at most d. O(ws(slogd + log(dqh/s))) \nand O(ws((h/s)logq) +log(dqh/s)) are upper bounds for the VC-dimension, where \ns is the depth of the network. More precisely, ws((s/2)logd + log(qh)) and \nws( (h/ s) log q + log d) are asymptotic upper bounds. Note that if we allow a polyno(cid:173)\nmial as the input function then d1 d2 will replace d above where d1 is the maximum \ndegree of the input functions and d2 is that of the activation functions. \n\nProof. We have two different ways to calculate the bounds. First \n\nS \n\ni=1 \n< s \n-p \n\nJ=1 \n\n(8eNQhs(di-1 + .. . + d + l)d) 'l\u00bbl+'''+W; \n\nWl+\"'+W' \nJ \n\n::; (8eNqd(s:)/2(h/S)) ws \n\nwhere hi is the number of hidden units in the i-th layer and 0 is an operator to \nform a new vector by concatenating the two. From this we get an asymptotic upper \nbound ws((s/2) log d + log(qh)) for the VC-dimension. \nSecondly \n\nFrom this we get an asymptotic upper bound ws((h/s)logq + log d) for the VC(cid:173)\ndimension. Combining these two bounds we get the result. Note that sin log( dqh/ s) \nin it is introduced to eliminate unduly large term emerging when s = 8(h) . \n0 \n\n4 Lower Bounds for Polynomial Networks \nTheorem 4.1 Let us consider the case that the activation function are polynomials \nof degree at most d . n( ws log d) is a lower bound of the VC-dimension for the \nnetworks with depth s. When s = 8(h) the bound is n(whlogd), More precisely, \n\n\f326 \n\nA. Sakurai \n\n(1/16)w( 5 - 6) log d is an asymptotic lower bound where d is the degree of activation \nfunctions and is a power of two and h is restricted to O(n2) for input dimension n. \n\nThe proof consists of several lemmas. The network we are constructing will have \ntwo parts: an encoder and a decoder. We deliberately fix the N input points. The \ndecoder part has fixed underlying architecture but also fixed connecting weights \nwhereas the encoder part has variable weights so that for any given binary outputs \nfor the input points the decoder could output the specified value from the codes in \nwhich the output value is encoded by the encoder. \n\nFirst we consider the decoder, which has two real inputs and one real output. One \nof the two inputs y holds a code of a binary sequence bl , b2, ... ,bm and the other x \nholds a code of a binary sequence Cl, C2, ... ,Cm . The elements of the latter sequence \nare all O's except for Cj = 1, where Cj = 1 orders the decoder to output bj from it \nand consequently from the network. \n\nWe show two types of networks; one of which has activation functions of degree at \nmost two and has the VC-dimension w(s-l) and the other has activation functions \nof degree d a power of two and has the VC-dimension w( s - 5) log d. \n\nWe use for convenience two functions 'H9(X) = 1 if x 2:: 0 and \u00b0 otherwise and \n'H9,t/J (x) = 1 if x 2:: cp, \u00b0 if x ::; 0, and undefined otherwise. Throughout this section \n\nwe will use a simple logistic function p(x) = (16/3)x(1- x) which has the following \nproperty. \nLemma 4.2. For any binary sequence bl , b2, . .. , bm , there exists an interval [Xl, X2] \n\nsuch that bi = 'Hl /4,3/4(pi(x)) and \u00b0 :S /(x) ::; 1 for any x E [Xl, X2]' \n\nThe next lemmas are easily proven. \n\nLemma 4.3. For any binary sequence Cl, C2,\"\" Cm which are all O's except for \nCj = 1, there exists Xo such that Ci = 'Hl /4,3/4(pi(xo)). Specifically we will take Xo = \np~(j-l)(1/4), where PLl(x) is the inverse of p(x) on [0,1/2]. Then pi-l(xo) = 1/4, \n\npi(xo) = 1, pi(xo) = \u00b0 for all i > j, and pj-i(xo) ::; (1/4)i for all positive i ::; j. \n\nProof. Clear from the fact that p(x) 2:: 4x on [0,1/4]. \nLemma 4.4. For any binary sequence bl , b2, ... , bm , \n\ntake y such that bi \n\no \n\n'H 1 / 4,3/4(pi(y)) and \u00b0 ::; pi(y) \n'H7/ 12 ,3/4 (l::l pi(xo)pi(y)} = bi' i.e. 'Ho (l::l pi(xo)pi(y) - 2/3} = bi' \nProof. If bj = 0, l::l pi(xo)pi(y) = l:1=1 pi(xo)pi(y) :S pi(y) + l:1:::(1/4)i < \npi(y) + (1/3)::; 7/12. If bj = 1, l::l pi(xo)pi(y) > pi(xo)pi(y) 2:: 3/4. \n0 \n\n::; 1 for all i and Xo = p~(j-l)(1/4), then \n\nBy the above lemmas, the network in Figure 1 (left) has the following function: \n\nSuppose that a binary sequence bl , ... ,bm and an integer j is given. Then we \ncan present y that depends only on bl , \u2022\u2022 \u2022 ,bm and Xo that depends only on j \nsuch that bi is output from the decoder. \n\nNote that we use (x + y)2 - (x - y)2 = 4xy to realize a multiplication unit. \nFor the case of degree of higher than two we have to construct a bit more complicated \none by using another simple logistic function fL(X) = (36/5)x(1- x). We need the \nnext lemma. \nLemma 4.5. Take Xo = fL~(j-l)(1/6), where fLLl(X) is the inverse of fL(X) on \n\n[0,1/2]. Then fLi-1(xo) = 1/6, fLj(XO) = 1, fLi(xo) = \u00b0 for all i > j, and fLi-i(xo) = \n\n\fTight Bounds for the VC-Dimension of Piecewise Polynomial Networks \n\n327 \n\nL--_L...-_---L_...L..-_ X. \n\n~A\u00b7l \n\n'........ \n\ni~-!~ \n\nf\u00b7i~~] \n,----_ .. __ ...... __ .. \n\nx, \n\ny \n\nFigure 1: Network architecture consisting of polynomials of order two (left) and \nthose of order of power of two (right). \n\n(1/6)i for all i > 0 and $ j. \n\nProof. Clear from the fact that J-L(x) ~ 6x on [0,1/6]. \n0 \nLemma 4.6. For any binary sequence bl. b2, . .. , bk, bk+b bk+2, . .. ,b2k , \n... , b(m-1)H1,'''' bmk take y such that bi = 1-l1/4,3/4(pi(y)) and 0 $ pi(y) $ 1 \nj $ m and any 1 $ 1 $ k take Xl = \nfor all i. Moreover for any 1 $ \nJ-LL(j-1)(1/6), and Xo = J-LL(I-1)(1/6k). \nThen for Z = E:1 pik(Y)J-Lik(xt), \n1-lo (E~==-Ol pi(z)J-Li(xo) - (1/2)) = bki+l holds. \nLemma 4.7. If 0 < pi(x) < 1 for any 0 < i $1, take an \u00a3 such that (16/3)1\u00a3 < 1/4. \nThen pl(x) - (16/3)1\u00a3 < pl(x + \u00a3) < pl(x) + (16/3)1\u00a3. \nProof.. There are four cases ~epending on ~hether pl- ~ (x + \u00a3) is on the uphill or \ndownhIll of p and whether x IS on the uphlll or downhIll of p -1 . The proofs are \ndone by induction. \nFirst suppose that the two are on the uphill. Then pl(x + \u00a3) = p(pl-1\\X + f)) < \np(pl-1(X) + (16/3)1-1\u00a3)) < pl(x) + (16/3)1\u00a3. Secondly suppose that p -l(x + \u00a3) \nis on the uphill but x is on the downhill. Then pl(x + \u00a3) = p(pl-1(x + f)) > \np(pl-1(x) - (16/3)1-1\u00a3)) > pl(x) - (16/3)1\u00a3. The other two cases are similar. \n0 \nProof of Lemma 4.6. We will show that the difference between piHl(y) \nand E~==-ol p'(z)J-Li(xo) is sufficiently small. Clearly Z = E:1 J-Lik(X1)pik(y) = \nE{=l J-Lik(X1)pik(y) $ pik(y)+ E{~i(1/6k)i < pik(y)+1/(6k -1) and pik(y) < z. If \nZ is on the uphill of pI then by using the above lemma, we get E~==-Ol pi(z)J-Li(xO) = \nE~=o p'(z)J-Li(xo) < pl(z) + 1/(6k - 1) < piHl(y) + (1 + (16/3)1)(1/(6k - 1)) < \npik+1(y) + 1/4 (note that 1 $ k - 1 and k ~ 2). If z is on the downhill of pI then \nby using the above lemma, we get E~==-Ol pi(Z)J-Li(xo) = E~=o pi(z)J-Li(xo) > pl(z) > \npl(pik(y)) _ (16/3)1(1/(6k - 1)) > pik+l(y) - 1/4. \n0 \nNext we show the encoding scheme we adopted. We show only the case w = 8(h2 ) \nsince the case w = 8(h) or more generally w = O(h2) is easily obtained from this. \nTheorem 4.8 There is a network of2n inputs, 2h hidden units with h2 weights w, \n\n\f328 \n\nA. Sakurai \n\nand h 2 sets of input values Xl, ... ,Xh2 such that for any set of values Y1, ... , Yh2 \nwe can chose W to satisfy Yi = fG(w; Xi). \n\nProof. We extensively utilize the fact that monomials obtained by choosing at most \nk variables from n variables with repetition allowed (say X~X2X6) are all linearly \nindependent ([1]). Note that the number of monomials thus formed is (n~m). \n\nSuppose for simplicity that we have 2n inputs and 2h main hidden units (we have \nother hidden units too), and h = (n~m). By using multiplication units (in fact each \nis a composite of two squaring units and the outputs are supposed to be summed up \nas in Figure 1), we can form h = (n~m) linearly independent monomials composed \nof variables Xl, . \u2022\u2022 ,Xn by using at most (m -l)h multiplication units (or h nominal \nunits when m = 1). In the same way, we can form h linearly independent monomials \ncomposed of variables Xn+ll . .\u2022 , X2n. Let us denote the monomials by U1, \u2022.\u2022 , Uh \nand V1, . .. , Vh. \nWe form a subnetwork to calculate 2:7=1 (2:7=1 Wi,jUi)Vj by using h multiplication \nunits. Clearly the calculated result Y is the weighted sum of monomials described \nabove where the weights are Wi,j for 1 $ i, j $ h. \n\nSince y = fG(w; x) is a linear combination of linearly independent terms, if we \nchoose appropriately h 2 sets of values Xll . . . , Xh2 for X = (Xl, .. \u2022 , X2n) , then for \nany assignment of h 2 values Y1, ... ,Yh2 to Y we have a set of weights W such that \nYi = f(xi, w). \n0 \n\nProof of Theorem -4.1. The whole network consists of the decoder and the encoder. \nThe input points are the Cartesian product of the above Xl, ... ,Xh2 and {xo defined \nin Lemma 4.4 for bj = 111 $ j :$ 8'} for some h where 8' is the number of bits to \nbe encoded. This means that we have h 2 s points that can be shattered. \n\nLet the number of hidden layers of the decoder be 8. The number of units used \nfor the decoder is 4(8 - 1) + 1 (for the degree 2 case which can decode at most 8 \nbits) or 4(8 - 3) + 4(k - 1) + 1 (for the degree 2k case which can decode at most \n(8 - 2)k bits). The number of units used for the encoder is less than 4h; we though \nhave constraints on 8 (which dominates the depth of the network) and h (which \ndominates the number of units in the network) that h :$ (n~m) and m = O(s) or \nroughly log h = 0(8) be satisfied. \nLet us chose m = 2 (m = log 8 is a better choise). As a result, by using 4h + 4(s -\nI} + 1 (or 4h + 4(8 - 3) + 4(k -1) + 1) units in s + 2 layers, we can shatter h 2 8 (or \nh 2 (8 - 2) log d) points; or asymptotically by using h units 8 layers we can shatter \n(1/16)w( 8 - 3) (or (1/16)w( 8 - 5) log d) points. \n0 \n\n5 Piecewise Polynomial Case \nTheorem 5.1. Let us consider a set of networks of units with linear input func(cid:173)\ntions and piecewise polynomial (with q polynomial segments) activation functions . \nQ( W8 log( dqh/ 8)) is a lower bound of the VC-dimension, where 8 is the depth of the \nnetwork and d is the maximum degree of the activation functions. More precisely, \n(1/16)w(s - 6)(10gd+ log(h/s) + logq) is an asymptotic lower bound. \nFor the scarcity of space, we give just an outline of the proof. Our proof is based \non that of the polynomial networks. We will use h units with activation function \nof q ~ 2 polynomial segments of degree at most d in place of each of pk unit in the \ndecoder, which give the ability of decoding log dqh bits in one layer and slog dqh \nbits in total by 8( 8h) units in total. If h designates the total number of units, the \n\n\fTight Bounds for the VC-Dimension of Piecewise Polynomial Networks \n\n329 \n\nnumber of the decodable bits is represented as log(dqh/s). \n\nIn the following for simplicity we suppose that dqh is a power of 2. Let pk(x) be \nthe k composition of p(x) as usual i.e. pk(x) = p(pk-l(x)) and pl(X) = p(x). Let \nplogd,/(x) = /ogd(,X/(x)), where 'x(x) = 4x if x $ 1/2 and 4 - 4x otherwise, which \nby the way has 21 polynomial segments. \nNow the pk unit in the polynomial case is replaced by the array /ogd,logq,logh(x) of \nh units that is defined as follows: \n(i) plogd,logq,l(X) is an array of two units; one is plogd,logq(,X+(x)) where ,X+(x) = \n4x if x $ 1/2 and 0 otherwise and the other is plog d,log q ('x - (x)) where ,X - (x) = 0 \nif x $ 1/2 and 4 - 4x otherwise. \n\n(ii) plog d,log q,m~x) is the array of 2m units, each with one of the functions \n( . .\u2022 ('x\u00b1(x)) . . . )) where ,X\u00b1( ... ('x\u00b1(x)) .. \u00b7) is the m composition \n\nplogd,logq(,X \nof 'x+(x) or 'x - (x). Note that ,X\u00b1( ... ('x\u00b1(x)) ... ) has at most three linear seg(cid:173)\nments (one is linear and the others are constant 0) and the sum of 2m possible \ncombinations t(,X\u00b1( . . . ('x\u00b1(x)) \u00b7 . . )) is equal to t(,Xm(x)) for any function f \nsuch that f(O) = O. \n\nThen lemmas similar to the ones in the polynomial case follow. \n\nReferences \n\n[1] Anthony, M: Classification by polynomial surfaces, NeuroCOLT Technical Re(cid:173)\n\nport Series, NC-TR-95-011 (1995). \n\n[2] Goldberg, P. and M. Jerrum: Bounding the Vapnik-Chervonenkis dimension \nof concept classes parameterized by real numbers, Proc. Sixth Annual ACM \nConference on Computational Learning Theory, 361-369 (1993). \n\n[3] Karpinski, M. and A. Macintyre, Polynomial bounds for VC dimension of sig(cid:173)\n\nmoidal neural networks, Proc. 27th ACM Symposium on Theory of Computing, \n200-208 (1995) . \n\n[4] Koiran, P. and E. D. Sontag: Neural networks with quadratic VC dimension, \n\nJourn. Compo Syst. Sci., 54, 190-198(1997). \n\n[5] Maass, W . G.: Bounds for the computational power and learning complexity of \nanalog neural nets, Proc. 25th Annual Symposium of the Theory of Computing, \n335-344 (1993). \n\n[6] Maass, W. G.: Neural nets with superlinear VC-dimension, Neural Computa(cid:173)\n\ntion, 6, 877-884 (1994) \n\n[7] Milnor, J.: On the Betti numbers of real varieties, Proc. of the AMS, 15, \n\n275-280 (1964). \n\n[8] Sakurai, A.: Tighter Bounds of the VC-Dimension of Three-layer Networks, \n\nProc. WCNN'93, III, 540-543 (1993). \n\n[9] Sakurai, A.: On the VC-dimension of depth four threshold circuits and the \ncomplexity of Boolean-valued functions, Proc. ALT93 (LNAI 744), 251-264 \n(1993) ; refined version is in Theoretical Computer Science, 137, 109-127 (1995). \n\n[10] Sakurai, A. : On the VC-dimension of neural networks with a large number of \n\nhidden layers, Proc. NOLTA'93, IEICE, 239-242 (1993). \n\n[11] Warren, H. E.: Lower bounds for approximation by nonlinear manifolds, Trans . \n\nAMS, 133, 167-178, (1968). \n\n\fOn-Line Learning with Restricted \n\nTraining Sets: \n\nExact Solution as Benchmark \n\nfor General Theories \n\nH.C. Rae \n\nhamish.rae@kcl.ac.uk \n\nP. Sollich \n\npsollich@mth.kcl.ac.uk \n\nA.C.C. Coolen \n\ntcoolen@mth.kcl.ac.uk \n\nDepartment of Mathematics \n\nKing's College London \n\nThe Strand \n\nLondon WC2R 2LS, UK \n\nAbstract \n\nWe solve the dynamics of on-line Hebbian learning in perceptrons \nexactly, for the regime where the size of the training set scales \nlinearly with the number of inputs. We consider both noiseless \nand noisy teachers. Our calculation cannot be extended to non(cid:173)\nHebbian rules, but the solution provides a nice benchmark to test \nmore general and advanced theories for solving the dynamics of \nlearning with restricted training sets. \n\n1 \n\nIntroduction \n\nConsiderable progress has been made in understanding the dynamics of supervised \nlearning in layered neural networks through the application of the methods of sta(cid:173)\ntistical mechanics. A recent review of work in this field is contained in [1 J. For \nthe most part, such theories have concentrated on systems where the training set is \nmuch larger than the number of updates. In such circumstances the probability that \na question will be repeated during the training process is negligible and it is possible \nto assume for large networks, via the central limit theorem, that the local field dis(cid:173)\ntribution is Gaussian. In this paper we consider restricted training sets; we suppose \nthat the size of the training set scales linearly with N, the number of inputs. The \nprobability that a question will reappear during the training process is no longer \nnegligible, the assumption that the local fields have Gaussian distributions is not \ntenable, and it is clear that correlations will develop between the weights and the \n\n\fLearning with Restricted Training Sets: Exact Solution \n\n317 \n\nquestions in the training set as training progresses. In fact, the non-Gaussian char(cid:173)\nacter of the local fields should be a prediction of any satisfactory theory of learning \nwith restricted training sets, as this is clearly demanded by numerical simulations. \nSeveral authors [2, 3, 4, 5, 6, 7] have discussed learning with restricted training sets \nbut a general theory is difficult. A simple model of learning with restricted training \nsets which can be solved exactly is therefore particularly attractive and provides \na yardstick against which more difficult and sophisticated general theories can, in \ndue course, be tested and compared. We show how this can be accomplished for \non-line Hebbian learning in perceptrons with restricted training sets and we ob(cid:173)\ntain exact solutions for the generalisation error and the training error for a class of \nnoisy teachers and students with arbitrary weight decay. Our theory is in excellent \nagreement with numerical simulations and our prediction of the probability density \nof the student field is a striking confirmation of them, making it clear that we are \nindeed dealing with local fields which are non-Gaussian. \n\n2 Definitions \n\nWe study on-line learning in a student percept ron S, which tries to perform a task \ndefined by a teacher percept ron characterised by a fixed weight vector B* E ~N. \nWe assume, however, that the teacher is noisy and that the actual teacher output \nT and the corresponding student response S are given by \n\nT: {-I, I}N ~ {-I, I} \nS: {-I, I}N ~ {-I, I} \n\nT(e) = sgn[B\u00b7 eL \nS(e) = sgn[J\u00b7 e]' \n\nwhere the vector B is drawn independently of e with probability p(B} which may \ndepend explicitly on the correct teacher vector B*. Of particular interest are the \nfollowing two choices, described in literature as output noise and Gaussian input \nnoise, respectively: \n\np(B} = >. 6(B+B*} + (1->.) 6(B-B*} \n\n(1) \n\nwhere >. ~ 0 represents the probability that the teacher output is incorrect, and \n\n(B) = [~] T \nP \n\n211'~2 \n\nN \n\n-I:f(B-Bo)2/'E2 \ne \n. \n\n(2) \n\nThe variance ~2 / N has been chosen so as to achieve appropriate scaling for N ~ CXl. \n\nOur learning rule will be the on-line Hebbian rule, i.e. \n\nJ(f+l) = (1- ~)J(f) + ~ e(f) sgn[B(f)\u00b7 e(f)] \n\n(3) \n\nwhere the non-negative parameters, and fJ are the decay rate and the learning rate , \nrespectively. At each iteration step f an input vector e(f) is picked at random from \na training set consisting of p = aN randomly drawn vectors e\u00b7 E {-I, I} N, f..L = \n1, . . . p. This set remains unchanged during the learning dynamics. At the same \ntime the teacher selects at random, and independently of e(f}, the vector B(\u00a3), \naccording to the probability distribution p(B} . Iterating equation (3) gives \n\nJ(m) = (1 - ~) m J o + ~ ~ (1 _ ~) m-l-Ie(e) sgn[B(f) . e(f)] \n\n(4) \n\n(=0 \n\nWe assume that the (noisy) teacher output is consistent in the sense that if a \nquestion e reappears at some stage during the training process the teacher makes \nthe same choice of B in both cases, i.e. if e(e) = e(f') then also B(f) = B(e') . This \nconsistency allows us to define a generalised training set iJ by including with the p \n\n\f318 \n\nH. C. Rae, P. Sollich and A. C. C. Coo/en \n\nquestions the corresponding teacher vectors: \n\nD = {(e,B 1), ... ,(e,BP)} \n\nThere are two sources of randomness in this problem. First of all there is the random \nrealisation of the 'path' n = ((e(O), B(O)), (e(l), B(l)), ... , (e(f), B (f)), ... }. This \nis simply the randomness of the stochastic process that gives the evolution of the \nvector J. Averages over this process will be denoted as ( ... ). Secondly there is the \nrandomness in the composition of the training set. We will write averages over all \ntraining sets as ( ... )sets. We note that \n\n(J[e(f), B(e))) = ~ L f(e, Btl) \n\np \n\np tL=1 \n\n(for all e) \n\nand that averages over all possible realisations of the training set are given by \n(J[(e, B1), (e, B2), ... , (e, BP)])sets \n\n= L L ... L 2~P J [ IT p(BIl) dBIl] f[(e, B1), (e, B2), ... ,(e, BP)] \ne1 e \nwhere e E {-I, l}N. We normalise B* so that [B*]2 = 1 and choose the time unit \nt = miN. We finally assume that J o and B* are statistically independent of the \ntraining vectors ell, and that they obey Ji(O), B; = O(N-~) for all i. \n\ne \n\ntL=l \n\n3 Explicit Microscopic Expressions \nAt the m-th stage of the learning process the two simple scalar observables Q[J] = \nJ2 and R[J] = B* . J, and the joint distribution of fields x = J . e, y = B* . e, z = \nB . e (calculated over the questions in the training set D), are given by \n\nQ[J(m)] = J2(m) \n\nR[J(m)] = B* . J(m) \n\nPix, y, z; J(m)] = - L o[x - J(m) . e] o[y - B* . ell] o[z - Bil . ell] \n\n1 P \n\np 11=1 \n\n(5) \n\n(6) \n\nFor infinitely large systems one can prove that the fluctuations in mean-field ob(cid:173)\nservables such as {Q, R, P}, due to the randomness in the dynamics, will vanish [6]. \nFurthermore one assumes, with convincing support from numerical simulations, that \nfor N -r (Xl the evolution of such observables, observed for different random realisa(cid:173)\ntions of the training set, will be reproducible (i.e. the sample-to-sample fluctuations \nwill also vanish, which is called 'self-averaging'). Both properties are central ingre(cid:173)\ndients of all current theories. We are thus led to the introduction of the averages of \nthe observables in (5,6), with respect to the dynamical randomness and with respect \nto the randomness in the training set (to be carried out in precisely this order): \n\nQ(t) = lim ( (Q[J(tN))) )set.s \n\nN-+oo \n\nR(t) = \n\nlim ( (R[J(tN)]) )sets \nN-+oo \n\nPt(x,y,z) = \n\nlim \u00abP[x,y,z;J(tN)]) )sets \nN-+oo \n\n( 7) \n\n(8) \n\nA fundamental ingredient of our calculations will be the average (~i sgn(B \u00b7e))(e, B), \ncalculated over all realisations of (e, B). We find, for a wide class of p(B), that \n\n(9) \n\nwhere, for example, \n\n\fLearning with Restricted Training Sets: Exact Solution \n\nP = if (1-2>.) \nP_ f!. \n- V -; V1 + 'f,2 \n\n1 \n\n(output noise) \n\n(Gaussian input noise) \n\n3/9 \n\n(10) \n\n(11) \n\n4 Averages of Simple Scalar Observables \n\nCalculation of Q(t) and R(t) using (4, 5, 7, 9) to execute the path average and the \naverage over sets is relatively straightforward, albeit tedious. We find that \n\nQ(t) = e-2\"\"(tQo + 21}PRo e \n\n-\"Yt) \n\n-\"Yt(l \n-e \n\"( \n\n+ ~(1_e-2\"Yt) \n\n2 \n2, \n\n+1}2 \n\n(1_e - \"Yt)2 1 \n(_+p2) \na \n\n\"(2 \n\n(12) \n\nand that \n\n(13) \nwhere p is given by equations (10, 11) in the examples of output noise and Gaussian \ninput noise, respectively. We note that the generalisation error is given by \n\nEg = ~arccos [R(t)/v'Q(t)] \n\n(14) \n\nAll models of the teacher noise which have the same p will thus have the same \ngeneralisation error at any time. This is true, in particular, of output noise and \nGaussian input noise when their respective parameters>. and 'f, are related by \n\n1 - 2>' = \n\n1 \n\nV1 + 'f,2 \n\n(15) \n\nWith each type of teacher noise for which (9) holds, one can thus associate an \neffective output noise parameter >.. Note, however, that this effective teacher error \nprobability>. will in general not be identical to the true teacher error probability \nassociated with a given p(B), as can immediately be seen by calculating the latter \nfor the Gaussian input noise (2). \n\n5 Average of the Joint Field Distribution \n\nThe calculation of the average of the joint field distribution starting from equation \n(8) is more difficult. Writing a = (l-,IN) , and expressing the 6 functions in terms \nof complex exponentials, we find that \nP, (x y z) = jdidydZ ei(xHyy+zi) lim (e-i[xe-\"YtJo \u00b7el+i;B \u00b7 .e+zBl.el] \nt \n\nN-400 \n\n871\"3 \n\n, \n\n, \n\nX fi:[~ te-[i1)XN- 1 /TtN-t(e 1 f') sg~(B\"\"f')l]) \n\u00a3=0 p v=l \n\nsets \n\n(16) \n\nIn this expression we replace e 1 bye and Bl by B, and abbreviate S = I1~~0[' \u00b7l \nUpon writing the latter product in terms of the auxiliary variables Vv = (e 1 \u00b7e V ) I IN \nand Wv == B V \n\n\u2022 C, we find that for large N \n\nlogS\", X(x sgn[B\u00b7 e],t) - t1}XUl (l_e - \"Yt) _ 1} x u2(1_e-2\"Yt ) \n\n(17) \n\n. A \n\n\"( \n\n2 A2 \n\n4\"( \n\nwhere Ul, U2 are the random variables given by \n\n\f320 \n\nand with \n\nH. C. Rae. P. Sollich and A. C. C. Coolen \n\nU2 = - ~ Vv \n\n1 '\"\"' \nP v>l \n\n2 \n. \n\nUl = .jN ~ Vv sgn(wv ), \n\n1 \n\n'\"\"' \na N v>l \n\n1 it \n\na 0 \n\nds [e- 11]We \n\n[-Y(.-t)] \n\nX(w, t) = -\n\n- 1] \n\n(18) \n\nA study of the statistics of Ul and U2 shows that limN --700 U2 = 1, and that \n\n(N ~ 00), \n\nwhere U is a Gaussian random variable with mean equal to zero and variance unity. \nOn the basis of these results and equations (16, 17) we find that \n\nP, (x y z) = jdXdfjdi ei(x:Hyy+==)_~x2[Q - R2- e -2-yt(Qo-R6)]+ ~dx sgn [=] ,t) -ixy(R-Roe-> ' ) \nt \n\n87f3 \n\n, \n\n, \n\n(19) \n\nwhere Q and R are given by the expressions (12,13) (note: Q - R2 is independent \nof p, i.e. of the distribution p(B)). Let Xo = J o .~, y = B* .~, z = B . ~. \nWe assume that, given y, z is independent of Xo. This condition, which reflects in \nsome sense the property that the teacher noise preserves the perceptron structure. \nis certainly satisfied for the models which we are considering and is probably true \nof all reasonable noise models. The joint probability density then has the form \np( Xo, y, z) = p( Xo J Y )p(y , z). Equation (19) then leads to the following expression for \nthe conditional probability of x, given y and z: \n\nP,t(xJy, z) = j ~! eiX[x-Ry]-~x2[Q-R2J+x(x sgn[z),t) \n\n(20) \n\nWe observe that this probability distribution is the same for all models with the \nsame p and that the dependence on z is through r = sgn[ z], a directly observable \nquantity. The training error and the student field probability density are given by \n\nE tr = j dxdy L B( -xr)P,t (xJy , r)P(rJy)P,(y) \n\nT=\u00b1l \n\nP,t(x) = j dy L P,t(xJy, r)P,(rJy)P(y) \n\nT=\u00b1l \n\n(21 ) \n\n(22) \n\nin which P,(y) = (27f)-2e- 2Y . We note that the dependence of Etr and P,t{x) on \nthe specific noise model arises solely through P,( rJy) which we find is given by \n\n1 \n\n1 \n\n2 \n\nP(rJy) = )\"B( -ry) + (1 -\n\n)..)B(ry) \n\nP{rJy) = 2(1 + rerf[y/J2~]) \n\n1 \n\nin the output noise and Gaussian input noise models, respectively. In order to sim(cid:173)\nplify the numerical computation of the remaining integrals one can further reduce \nthe number of integrations analytically. Details will be reported elsewhere. \n\n6 Comparison with Numerical Simulations \n\nIt will be clear that there is a large number of parameters that one could vary in \norder to generate different simulation experiments with which to test our theory. \nHere we have to restrict ourselves to presenting a number of representative results. \nFigure 1 shows, for the output noise model, how the probability density Pdx) of \n\n\fLearning with Restricted Training Sets: Exact Solution \n\n321 \n\n0.2 , - - - -- -----, \n\n0.1 \n\nf \n\n0.0 '-L-_~~. \no \nX \n\n-10 \n\n10 -10 \n\no \nX \n\n10 -10 \n\no \nX \n\n10 -10 \n\no \nX \n\n10 \n\nFigure 1: Student field distribution P(x) for the case of output noise, at different \ntimes (left to right: t= 1,2,3,4), for a=,=~, 10 =1}= 1, A=0.2. Histograms: \ndistributions measured in simulations, (N = 10,000). Lines: theoretical predictions. \n\nthe student field x = J . ~ develops in time, starting as a Gaussian at t = 0 \nand evolving to a highly non-Gaussian distribution with a double peak by time \nt = 4. The theoretical results give an extremely satisfactory account of the numerical \nsimulations . Figure 2 compares our predictions for the generalisation and training \nerrors Eg and Etr with the results of numerical simulations, for different initial \nconditions, Eg(O) = 0 and Eg(O) = 0.5, and for different choices of the two most \nimportant parameters A (which controls the amount of teacher noise) and a (which \nmeasures the relative size of the training set). The theoretical results are again in \nexcellent agreement with the simulations. The system is found to have no memory of \nits past (which will be different for some other learning rules), the asymptotic values \nof Eg and Etr being independent of the initial student vector. In our examples Eg is \nconsistently larger than E tr , the difference becoming less pronounced as a increases. \nNote, however, that in some circumstances E tr can also be larger then E g . Careful \ninspection shows that for Hebbian learning there are no true overfitting effects, not \neven in the case of large A and small, (for large amounts of teacher noise, without \nregularisation via weight decay). Minor finite time minima of the generalisation \nerror are only found for very short times (t < 1), in combination with special \nchoices for parameters and initial conditions. \n\n7 Discussion \n\nStarting from a microscopic description of Hebbian on-line learning in perceptrons \nwith restricted training sets, of size p = aN where N is the number of inputs, \nwe have developed an exact theory in terms of macroscopic observables which has \nenabled us to predict the generalisation error and the training error, as well as the \nprobability density of the student local fields in the limit N ~ 00. Our results are in \nexecellent agreement with numerical simulations (as carried out for systems of size \nN = 5,000) in the case of output noise; our predictions for the Gaussian input noise \nmodel are currently being compared with the results of simulations. Generalisations \nof our calculations to scenarios involving, for instance, time-dependent learning \nrates or time-dependent decay rates are straightforward. Although it will be clear \nthat our present calculations cannot be extended to non-Hebbian rules, since they \n\n\f322 \n\nH. C. Rae, P Sollich and A. C. C. Coo/en \n\nb,. \n\nI~ \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0.0 \n\n0.5 \n\n0.4 ~ \n0.3 ~ \n0.2 ~~ \n~ \n\n0.1 \n\na=O.5 \n\na=4.0 \n\nr \n\nf \n~ \n\n~ \n\n\"\"V' \n\n-~ \n\na=0.5 \n\n~~ \nl \n\n'v-\n\n~ \n\nA=D.25 \n\nA=D.O \n\nA:=0.25 \n\n\"\\.T \n\n)..=0.0 \n\n~ \n\n-~ \n\n0.0 o \n\n10 \n\n20 \nt \n\n30 \n\n40 \n\no \n\n10 \n\n20 \nt \n\na=0.5 \n\na=4.0 1 \nJhj \n\na=O.5 \n\nA:=0.25 \n\n)..=0.0 \n\n)..=0.25 \n\n\"'..t> \n\n~ \n\nA.---0.0 \n\n30 \n\nJ \nj , \n40 \n\nFigure 2: Generalisation errors (diamonds/lines) and training errors (circles/li.nes) \nas observed during on-line Hebbian learning, as functions of time. Upper two graphs: \nA = 0.2 and a E {0.5,4.0} (upper left: Eg(O) = 0.5, upper right: Eg(O) = 0). Lower \ntwo graphs: a = 1 and A E {O.O, 0.25} (lower left: Eg(O) = 0.5. lower right: \nEg(O) = 0.0). Markers: simulation results for an N = 5,000 system. Solid lines: \npredictions of the theory. In all cases Jo = 'f} = 1 and 'Y = 0.5 . \n\nultimately rely on our ability to write down the microscopic weight vector J at \nany time in explicit form (4), they do indeed provide a significant yardstick against \nwhich more sophisticated and more general theories can be tested. In particular. \nthey have already played a valuable role in assessing the conditions under which a \nrecent general theory of learning with restricted training sets, based on a dynamical \nversion of the replica formalism, is exact [6, 7]. \n\nReferences \n\n[1] Mace C.W.H.and Coolen A.C.C. (1998) Statistics and Computing 8 , 55 \n[2] Horner H. (1992a) , Z.Phys . B 86.291; (1992b) , Z.Phys . B 87,371 \n[3] Krogh A. and Hertz J.A. (1992) IPhys . A: Math. Gen. 25, 1135 \n[4] Sollich P. and Barber D. (1997) Europhys. Lett. 38 , 477 \n[5] SoUich P. and Barber D. (1998) Advances in Neural Information Processing \n\nSystems 10, Eds. Jordan M., Kearns M. and Solla S. (Cambridge: MIT) \n\n[6] Cool en A.C.C. and Saad D. , King's College London preprint KCL-MTH-98-08 \n[7] Coolen A .C.C. and Saad D. (1998) (in preparation) \n\n\f", "award": [], "sourceid": 1606, "authors": [{"given_name": "H.", "family_name": "Rae", "institution": null}, {"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "Anthony", "family_name": "Coolen", "institution": null}]}