{"title": "Examples of learning curves from a modified VC-formalism", "book": "Advances in Neural Information Processing Systems", "page_first": 344, "page_last": 350, "abstract": null, "full_text": "Examples of learning curves from a modified \n\nVC-formalism. \n\nA. Kowalczyk & J. Szymanski \nTelstra Research Laboratories \n\n770 Blackbtun Road, \n\nClayton, Vic. 3168, Australia \n\n{akowalczyk,j.szymanski }@trl.oz.au) \n\nP.L. Bartlett & R.C. Williamson \nDepartment of Systems Engineering \n\nAustralian National University \nCanberra, ACT 0200, Australia \n\n{bartlett, williams }@syseng.anu.edu.au \n\nAbstract \n\nWe examine the issue of evaluation of model specific parameters in a \nmodified VC-formalism. Two examples are analyzed: the 2-dimensional \nhomogeneous perceptron and the I-dimensional higher order neuron. \nBoth models are solved theoretically, and their learning curves are com(cid:173)\npared against true learning curves. It is shown that the formalism has \nthe potential to generate a variety of learning curves, including ones \ndisplaying ''phase transitions.\" \n\n1 \n\nIntroduction \n\nOne of the main criticisms of the Vapnik-Chervonenkis theory of learning [15] is that the \nresults of the theory appear very loose when compared with empirical data. In contrast, \ntheory based on statistical physics ideas [1] provides tighter numerical results as well as \nqualitatively distinct predictions (such as \"phase transitions\" to perfect generalization). \n(See [5, 14] for a fuller discussion.) A question arises as to whether the VC-theory can \nbe modified to give these improvements. The general direction of such a modification is \nobvious: one needs to sacrifice the universality of the VC-bounds and introduce model (e.g. \ndistribution) dependent parameters. This obviously can be done in a variety of ways. Some \nspecific examples are VC-entropy [15], empirical VC-dimensions [16], efficient complexity \n[17] or (p., C)-uniformity [8, 9] in a VC-formalism with error shells. An extension of the \nIt is based on a refinement of the \nlast formalism is of central interest to this paper. \n\"fundamental theorem of computational learning\" [2] and its main innovation is to split the \nset of partitions of a training sample into separate \"error shells\", each composed of error \nvectors corresponding to the different error values. \n\nSuch a split introduces a whole range of new parameters (the average number of patterns \nin each of a series of error shells) in addition to the VC dimension. The difficulty of \ndetermining these parameters then arises. There are some crude, \"obvious\" upper bounds \n\n\fExamples of Learning Curves from a Modified VC-fonnalism \n\n345 \n\non them which lead to both the VC-based estimates [2, 3, 15] and the statistical physics \nbased formalism (with phase transitions) [5] as specific cases of this novel theory. Thus \nthere is an obvious potential for improvement of the theory with tighter bounds. In particular \nwe find that the introduction of a single parameter (order of uniformity), which in a sense \ndetermines shifts in relative sizes of error shells, leads to a full family of shapes of learning \ncurves continuously ranging in behavior from decay proportional to the inverse of the \ntraining sample size to \"phase transitions\" (sudden drops) to perfect generalization in small \ntraining sample sizes. We present initial comparison of the learning curves from this new \nformalism with \"true\" learning curves for two simple neral networks. \n\n2 Overview of the formalism \n\nThe presentation is set in the typical PAC-style; the notation follows [2]. We consider \na space X of samples with a probability measure J1., a subspace H of binary functions \nX -+ {O, 1} (dichotomies) (called the hypothesis space) and a target hypothesis t E H. \nForeachh E H andeachm-samplez = (:el, ... , :em) E xm (m E {1, 2, ... }),wedenoteby \n\u20ach,z d;j ~ E::llt-hl(:ei)theempiricalerrorofhonz,andbY\u20ach d;j fx It- h l(:e)J1.(d:e) \nthe expected error of h E H. \n\nFor each m E {1, 2, ... } let us consider the random variable \n\nmaa: (-) de! \n\u20ac H \n\n{ \n:l: = max \u20ach \nhEH \n\nj \u20ach z = \nO} \n\n' \n\n(1) \n\ndefined as the maximal expected error of an hypothesis h E H consistent with t on z. The \nlearning curve of H, defined as the expected value of tJiaa: , \n\n\u20acj{(m) d;j Exm.[\u20acJiaa:] = f \nJx = \n\n\u20acJiaa: (z)Jr (dz) \n\n(z E xm) \n\n(2) \n\nand by \n\nde! \n\nis of central interest to us. Upper bounds on it can be derived from basic PAC-estimates as \nthe subset of \u20ac-bad hypotheses \n\nfollows. For \u20ac ~ \u00b0 we denote by HE = {h E H j \u20ach ~ \u20ac} \nQ;! d;j {z E Xm j 3hE H. \u20ach,ri = O} = {z E Xm j 3hEH \u20ach,ri = \u00b0 & \u20ach ~ \u20ac} \n\n(3) \n\nthe subset of m-samples for which there exists an \u20ac-bad hypothesis consistent with the \ntarget t. \n\nLemmal IfJ1.m(Q;!) ~ 1J!(\u20ac,m), \nin the assumption implies equality in the conclusion. 0 \n\nthen\u20acj{(m) ~ folmin(l,1J!(\u20ac,m))J1.(d\u20ac), \n\nand equality \n\nProof outline. If the assumption holds, then 'lr(\u20ac, m) d~ 1 - min(l, 1J!( \u20ac, m)) is a lower \nbound on the cumulative distribution of the random variable (1). Thus E x= [\u20acJiaa:] ~ \nf01 \u20ac tE 'lr( \u20ac, m)d\u20ac and integration by parts yields the conclusion. \no \nGivenz = (:el, ... ,:em) E Xm,letusintroducethetransformation(projection)1rt,ri: H-+ \n{O, l}m allocating to each h E H the vector \n\n1rt,:i!(h) d;j (Ih(:el) - t(:el)l, ... , Ih(:em) - t(:em)l) \n\ncalled the error pattern of h on z. For a subset G C H, let 1rt,:i!(G) = {1rt,:i!(h) : hE G}. \nThe space {o,l}m is the disjoint union of error shells \u00a3i d~ {(el, ... ,em) E \n{O,l}m j el + ... + em = i} for i = 0,1, ... , m, and l1rt,ri(HE) n \u00a3il is the number \n\n\f346 \n\nA. KOWALCZYK, J. SZYMANSKI, P. L. BARTLETT, R. C. WILLIAMSON \n\nof different error patterns with i errors which can be obtained for h E HE' We shall emplOy \nthe following notation for its average: \n\nIHEli d~ Ex ... [l1I't,z(HE) n t:in = r l'II't,z(HE) n t:ilJ.\u00a3m(dz). \n\nJx ... \n\n(4) \n\nThe central result of this paper, which gives a bOlUld on the probability of the set Qr;' as \nin Lemma 1 in terms of I HE Ii, will be given now. It is obtained by modification of the \nproof of [8, Theorem 1] which is a refinement of the proof of the ''ftmdamental theorem of \ncomputational learning\" in [2]. It is a simplified version (to the consistent learning case) of \nthe basic estimate discussed in [9, 7]. \n\nTheorem 2 For any integer Ie ~ 0 and 0 ::; E, 'Y ::; 1 \n\nI-'m(Q';\")::; A f ,k,7 t (~) (m:- 1e)-lIHElj+A:, \n\nj~7k J \n\nJ \n\n(5) \n\nwhereA E,k,7 d~ (1- E}~~ O)Ej(l-E)k- j ) -l,forle > OandA E,o,7 d~ 1.0 \n\nSince error shells are disjoint we have the following relation: \n\nPH(m) d~ 2- m i_I\".(H)I!r(dZ) = 2- m t.IHli ~ IIH(m)/2m \n\n(6) \n\nwhere 1I'z(h) d~ 1I'0,z(h), IHli d~ IHoli and IIH(m) d~ maxz E x'\" I 'll'z (H) I is the \ngrowth function [2] of H. (Note that assuming that the target t == 0 does not affect the \ncardinality of 1I't,z(H).) If the VC-dimension of H, d = dvc(H), is finite, we have the \nwell-known estimate [2] \n\nIIH(m)::; ~(d,m) d~ t (rr:) ::; (em/d)d. \n\nj=O \n\nJ \n\n(7) \n\nCorollary 3 (i) If the VC-dimension d of H is finite and m > 8/E, then J.\u00a3m(Qr;') ::; \n22 - mE/ 2(2em/ d)d. \n(ii) If H has finite cardinality, then J.\u00a3m (Qr;') ::; EhEH. (1 - Eh)m. \n\nProof. (i) Use the estimate A E,k,E/2 ::; 2 for Ie ~ 8/E resulting from the Chernoff bound \nand set'Y = E /2 and Ie = m in (5). (ii) Substitute the following crude estimate: \n\nm \n\nm \n\nIHEli ::; L IHEli ::; L IHli ::; PH ::; (em/d)d, \ninto the previous estimate. (iii) Set Ie = 0 into (i) and use the estimate \n\ni=O \n\ni=O \n\nIHli::; L Prx ... (Eh,z = i/m) = L (1- Eh)m-iEhi. 0 \n\nThe inequality in Corollary 3.i (ignoring the factor of 2) is the basic estimate of the VC(cid:173)\nformalism (c.f. [2]); the inequality in Corollary 3.ii is the union bound which is the starting \npoint for the statistical physics based formalism developed in [5]. In this sense both of \nthese theories are unified in estimate (5) and all their conclusions (including the prediction \n\n\fExamples of Learning Curves from a Modified VC-formalism \n\n347 \n\n100 \n\n10- 1 \n\n10-2 \n4 \n\n(a) \n\n, \\ \n\\ \n\\ \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \n\\ \n\nI \nI \nI \n\n...... _-' .... \n\n5 \n\n6 \n\nmid \n\n7 \n\n8 \n\n9 \n\n(b) \n\nCJ = 3 : chain line \nCJ = 3 and COl. = 1 : broken line \n\n10-2 '-'-\"'--'-.L...l.....L...l.....L...l....~~.L...l.....L...l.....L...l.....L...l.....L...l.....L...J \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\no \n\nmi d \n\nFigure 1: (a) Examples of upper bounds on the learning curves for the case of finite VC(cid:173)\ndimension d = dvc(H) implied by Corollary 4.ii for Cw,m == const. They split into five \ndistinct \"bands\" of four curves each, according to the values of the order of uniformity w = \n2, 3,4,5, 10 (in the top-down order). Each band contains a solid line (Cw,m == 1, d = 100), \na dotted line (Cw,m == 100, d = 100), a chain line (Cw,m == 1, d = 1000) and a broken line \n(Cw,m == 100, d = 1000). \n(b) Various learning curves for the 2-dimensional homogeneous perceptron. Solid lines \n(top to bottom): (i) - for the VC-theory bound (Corollary 3.ii) with VC-dimension d = 2; \n(ii) - for the bound (for Eqn, 5 and Lemma 1) with'Y = f, k = m and the upper bounds \nIHElr ~ IHlr = 2 for i = 1, \" \" m - 1 and IHElr ~ IHlr = 1 for i = 0, m ; (iii) - as in \n(ii) but with the exact values for IH Elr as in (11); (iv) - true learning curve (Eqn. 13). The \nw-uniformity bound for w = 2 (with the minimal C w,m satisfying (9), which turn out to be \n= const = 1) is shown by dotted line; for w = 3 the chain line gives the result for minimal \nCw m and the broken line for Cw m set to 1. \n\n, \n\n, \n\nof phase transitions to perfect generalization for the Ising perceptron for a = mj d < 1.448 \nin the thermodynamic limit [5]) can be derived from this estimate, and possibly improved \nwith the use of tighter estimates on IH E Ir. \nWe now formally introduce a family of estimates on IHElr in order to discuss a potential \nof our formalism . For any m, f and w ~ 1.0 there exists Cw,m > 0 such that \n(for 0 s: i ~ m), \n\nIH.lr s: IHlr s: Cw,m (7) PH(m)l-ll-2i/ml'\" \n\n(8) \n\nWe shall call such an estimate an w-uniformity bound. \n\nCorollary 4 (i) If an w -lllliformity bolllld (8) holds, then \n\nILm(Qm) < A \nr-\n\nE \n\n_ \n\nC ~ (m)PH (2m)l-ll-j/ml\"', \n\n, \n\nElm \u2022 .., W,m ~ . \nj~\",m J \n\n(ii) if additionallyd = dvc(H) < 00, then \n\nm m \n\n(Q.) s: A.,m,,,,Cw,m L \n\nJ1-\n\n. (T \n\n2m \n\nm \n\nm \n) \nj~\",m J \n\n( \n\nd l-Il-j/ml'\" \n\n(2emjd)) \n\n(9) \n\n. 0 \n\n(10) \n\n3 Examples of learning curves \n\nIn this section we evaluate the above formalism on two examples of simple neural networks. \n\n\f348 \n\nA. KOWALCZYK, J. SZYMANSKI, P. L. BARTLETT, R. C. WILLIAMSON \n\n20 \n\n(b) \n\nI \n\nI \n\nI \n\nI \n\n10- 1 \n\nC'oj = 2 : dolled line \nCol = 3: chain line \n\n0 \n\n10 \n\n20 \n30 \nm/(d+l) \n\n40 \n\n50 \n\n15 f-\n\nu \n2 10 -\n