{"title": "Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 624, "page_last": 630, "abstract": null, "full_text": "Training Data Selection \n\nfor Optimal Generalization \n\nin Trigonometric Polynomial Networks \n\nMasashi Sugiyama*and Hidemitsu Ogawa \n\nDepartment of Computer Science, Tokyo Institute of Technology, \n\n2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan. \n\nsugi@cs. titeck. ac.jp \n\nAbstract \n\nIn this paper, we consider the problem of active learning in trigonomet(cid:173)\nric polynomial networks and give a necessary and sufficient condition of \nsample points to provide the optimal generalization capability. By ana(cid:173)\nlyzing the condition from the functional analytic point of view, we clarify \nthe mechanism of achieving the optimal generalization capability. We \nalso show that a set of training examples satisfying the condition does \nnot only provide the optimal generalization but also reduces the compu(cid:173)\ntational complexity and memory required for the calculation of learning \nresults. Finally, examples of sample points satisfying the condition are \ngiven and computer simulations are performed to demonstrate the effec(cid:173)\ntiveness of the proposed active learning method. \n\n1 \n\nIntroduction \n\nSupervised learning is obtaining an underlying rule from training examples, and can \nbe formulated as a function approximation problem. If sample points are actively \ndesigned, then learning can be performed more efficiently. In this paper, we discuss \nthe problem of designing sample points, referred to as active learning, for optimal \ngeneralization. \n\nActive learning is classified into two categories depending on the optimality. One \nis global optimal, where a set of all training examples is optimal (e.g. Fedorov \n[3]) . The other is greedy optimal, where the next training example to sample is \noptimal in each step (e.g. MacKay [5], Cohn [2], Fukumizu [4], and Sugiyama and \nOgawa [10]). In this paper, we focus on the global optimal case and give a new ac(cid:173)\ntive learning method in trigonometric polynomial networks. The proposed method \ndoes not employ any approximations in its derivation, so that it provides exactly \nthe optimal generalization capability. Moreover, the proposed method reduces the \ncomputational complexity and memory required for the calculation of learning re(cid:173)\nsults. Finally, the effectiveness of the proposed method is demonstrated through \ncomputer simulations. \n\n'' http://ogawa-www.cs.titech.ac\u00b7jprsugi. \n\n\fTraining Data Selection for Optimal Generalization \n\n625 \n\n2 Formulation of supervised learning \n\nIn this section, the supervised learning problem is formulated from the functional \nanalytic point of view (see Ogawa [7]). Then, our learning criterion and model are \ndescribed. \n\n2.1 Supervised learning as an inverse problem \n\nLet us consider the problem of obtaining the optimal approximation to a target \nfunction f(x) of L variables from a set of M training examples. The training \nexamples are made up of sample points Xm E V, where V is a subset of the L(cid:173)\ndimensional Euclidean space R L, and corresponding sample values Ym E C: \n\n{(xm, Ym) 1 Ym = f(xm) + nm}~=l' \n\n(1) \n\nwhere Ym is degraded by zero-mean additive noise nm. Let nand Y be M(cid:173)\ndimensional vectors whose m-th elements are nm and Ym, respectively. Y is called \na sample value vector. In this paper, the target function f(x) is assumed to be(cid:173)\nlong to a reproducing kernel Hilbert space H (Aronszajn [1]). If H is unknown, \nthen it can be estimated by model selection methods (e.g. Sugiyama and Ogawa \n[9]). Let K(\u00b7,\u00b7) be the reproducing kernel of H. If a function 'l/Jm(x) is defined \nas 'l/Jm (x) = K (x, x m), then the value of f at a sample point Xm is expressed as \nf(x m) = (I, 'l/Jm), where (-,.) stands for the inner product. For this reason, 'l/Jm is \ncalled a sampling function. Let A be an operator defined as \n\nM \n\nA= ~ (em0~, \n\nm=l \n\n(2) \n\nwhere em is the m-th vector of the so-called standard basis in C M and (. 0 7) \nstands for the Neumann-Schatten productl. A is called a sampling operator. Then, \nthe relationship between f and Y can be expressed as \n\nY = Af +n. \n\nLet us denote a mapping from Y to a learning result fo by X: \n\nfo = Xy, \n\n(3) \n\n(4) \n\nwhere X is called a learning operator. Then, the supervised learning problem is \nreformulated as an inverse problem of obtaining X providing the best approximation \nfa to f under a certain learning criterion. \n\n2.2 Learning criterion and model \n\nAs mentioned above, function approximation is performed on the basis of a learning \ncriterion. Our purpose of learning is to minimize the generalization error of the \nlearning result fa measured by \n\nJe = Enllfo - f11 2 , \n\n(5) \nwhere En denotes the ensemble average over noise. In this paper, we adopt projec(cid:173)\ntion learning as our learning criterion. Let A*, R(A*), and PR(AO) be the adjoint \noperator of A, the range of A*, and the orthogonal projection operator onto R(A*), \nrespectively. Then, projection learning is defined as follows. \n\nIFor any fixed 9 in a Hilbert space HI and any fixed f in a Hilbert space H2, the \nNeumann-Schatten product (f \u00ae g) is an operator from HI to H2 defined by using any \nhE HI as (f\u00aeg)h = (h,g)f\u00b7 \n\n\f626 \n\nM Sugiyama and H. Ogawa \n\nDefinition 1 (Projection learning) (Ogawa !6j) An operator X is called the \nprojection learning operator if X minimizes the functional J p [X] = En II X nll 2 under \nthe constraint XA = Pn(A*). \n\nIt is well-known that Eq.(5) can be decomposed into the bias and variance: \n\nJG = IIPn(A*)f - fl12 + En11Xn112. \n\n(6) \n\nEq.(6) implies that the projection learning criterion reduces the bias to a certain \nlevel and minimizes the variance. \n\nLet us consider the following function space. \n\nDefinition 2 (Trigonometric polynomial space) Let x = (e(I),e(2), .. \u00b7,e(L))T. \nFor 1 :S l :S L, let Nl be a positive integer and Vl = [-7r,7r]. Then, a function \nspace H is called a trigonometric polynomial space of order (N1 , N 2 , ... , N L) if H \nis spanned by \n\n(7) \n\n(8) \n\nThe dimension J.l of a trigonometric polynomial space of order (N1 , N 2 , ... , N L) is \nJ.l = nf=1 (2Nl + 1), and the reproducing kernel of this space is expressed as \n\nwhere \n\nK, (~{l) , ~('}') = { \n\nK(x, x') = II Kl(e(l), e(l)I), \n\nL \n\nl=1 \n\n(9) \n\n(10) \n\nif e(l) -=1= e(l)' , \nif e(l) = e(l)I. \n\n3 Active learning in trigonometric polynomial space \n\nThe problem of active learning is to find a set {Xm}~=1 of sample points providing \nthe optimal generalization capability. In this section, we give the optimal solution \nto the active learning problem in the trigonometric polynomial space. \nLet At be the Moore-Penrose generalized inverse 2 of A. Then, the following propo(cid:173)\nsition holds. \n\nProposition 1 If the noise covariance matrix Q is given as Q = a 2 I with a 2 > 0, \nthen the projection learning operator X is expressed as X = At. \n\nNote that the sampling operator A is uniquely determined by {Xm}~=1 (see Eq.(2)). \nFrom Eq.(6), the bias of a learning result fo becomes zero for all f in H if and only \nif N(A) = {O}, where NO stands for the null space of an operator. For this reason, \n2 An operator X is called the Moore-Penrose generalized inverse of an operator A if X \n\nsatisfies AXA = A, XAX = X, (AX)'\" = AX, and (XA)\" = XA . \n\n\fTraining Data Selection for Optimal Generalization \n\n627 \n\nH \n\nFigure 1: Mechanism of noise suppression by Theorem 1. If a set {xm}~= l of sample \npoints satisfies A* A = MI, then XAf = f, IIXntll = JMlln111, and Xn2 = o. \n\nwe consider the case where a set {Xm}~=l of sample points satisfies N(A) = {o}. \nIn this case, Eq.(6) is reduced to \n\n(11) \nwhich is equivalent to the noise variance in H. Consequently, the problem of active \nlearning becomes the problem of finding a set {Xm }~= 1 of sample points minimizing \nEq.(l1) under the constraint N(A) = {a} . \nFirst, we derive a condition for optimal generalization in terms of the sampling \noperator A. \n\nTheorem 1 Assume that the noise covariance matrix Q is given as Q = (721 with \n(72 > o. Then, Je in Eq.{11) is minimized under the constraint N(A) = {O} if and \nonly if \n\n(12) \nwhere I denotes the identity operator on H. In this case, the minimum value of Je \nis (72J.L/M, where J.L is the dimension of H . \n\nA*A =MI, \n\nEq.(12) implies that {:;k1jJm}~=l forms a pseudo orthonormal basis (Ogawa [8]) \nin H, which is an extension of orthonormal bases. The following lemma gives \ninterpretation of Theorem 1. \n\nLemma 1 When a set {Xm}~=l of sample points satisfies Eq.(12}, it holds that \n\nXAf \nIIAfl1 \nIIXul1 \n\nfor all f E H, \n\nf \nrullfll for all f E H, \n{ *llull for u E 'R.(A), \nfor u E'R.(A).l. \n\no \n\n(13) \n(14) \n\n(15) \n\nEqs.(14) and (15) imply that kA becomes an isometry and VMX becomes a \npartial isometry with the initial space 'R.(A) , respectively. Let us decompose the \nnoise n as n = nl + n2, where nl E 'R.(A) and n2 E 'R.(A).l. Then, the sample value \nvector y is rewritten as y = Af + nl + n2. It follows from Eq.(13) that the signal \ncomponent Af is transformed into the original function f by X. From Eq.(15) , X \nsuppresses the magnitude of noise nl in 'R.(A) by k and completely removes the \n\n\f628 \n\nM Sugiyama and H. Ogawa \n\n\u2022 \u2022 \n\n-71\" C \n\n.----. . \n\n211\" \nM \n\n71\" \n\n-71\" \n\n\u2022 \n\nC \n\n\u2022 \n\n\u2022 \n\n( a) Theorem 2 \n\n(b) Theorem 3 \n\nFigure 2: Two examples of sample points such that Condition (12) holds (1-\u00a3 = 3 \nand M = 6). \n\nnoise n2 in R(A).l. This analysis is summarized in Fig.1. Note that Theorem 1 and \nits interpretation are valid for all Hilbert spaces such that K(x, x) is a constant for \nany x. \nIn Theorem 1, we have given a necessary and sufficient condition to minimize Ja \nin terms of the sampling operator A. Now we give two examples of sample points \n{X m };;:[=l such that Condition (12) holds. From here on, we focus on the case when \nthe dimension L of the input x is 1 for simplicity. However, the following results \ncan be easily scaled to the case when L > 1. \n\nTheorem 2 Let M 2: 1-\u00a3, where 1-\u00a3 is the dimension of H. Let c be an arbitrary \n-71\" + ~. If a set {Xm};;:[=l of sample points is \nconstant such that -71\" < c :::; \ndetermined as \n\n(16) \n\nthen Eq.(12) holds. \n\nTheorem 3 Let M = kl-\u00a3 where k is a positive integer. Let c be an arbitrary \n:::; c :::; -71\" + ~. If a set {Xm };;:[=l of sample points is \nconstant such that -71\" \ndetermined as \n\nXm =c+-r, \n\nwhere r = m - 1 (mod 1-\u00a3), \n\n(17) \n\n271\" \n1-\u00a3 \n\nthen Eq. (12) holds. \nTheorem 2 means that M sample points are fixed to 271\" 1M intervals in the domain \n[-71\",71\"] and sample values are gathered once at each point (see Fig.2 (a\u00bb . In con(cid:173)\ntrast, Theorem 3 means that 1-\u00a3 sample points are fixed to 271\"11-\u00a3 intervals in the \ndomain and sample values are gathered k times at each point (see Fig.2 (b\u00bb. \nNow, we discuss calculation methods ofthe projection learning result fo(x). Let hm \nbe the m-th column vector of the M-dimensional matrix (AA*)t. Then, for general \nsample points, the projection learning result fo(x) can be calculated as \n\nM \nfo(x) = L \n\n(y, hm )1/Jm(x). \n\nm=l \n\n(18) \n\nWhen we use the optimal sample points satisfying Condition (12), the following \ntheorems hold. \n\nTheorem 4 When Eq.(12) holds, the projection learning result fo(x) can be calcu(cid:173)\nlated as \n\nfo(x) = M LYm1/Jm(X). \n\n(19) \n\n1 M \n\nm=l \n\n\fTraining Data Selection for Optimal Generalization \n\n629 \n\nTheorem 5 When sample points are determined following Theorem 3, the projec(cid:173)\ntion learning result fo (x) can be calculated as \n\n1 \n\nJ.I. \n\nfo(x) = - LYp'I/Jp(x), \n\nI-' p=l \n\n1 k \n\nwhere YP = k LYp+J.I.(q-l). \n\nq=l \n\n(20) \n\nIn Eq.(18), the coefficient of 'l/Jm(x) is obtained by the inner product (y, hm). In \ncontrast, it is replaced with Ym/M in Eq.(19) , which implies that the Moore-Penrose \ngeneralized inverse of AA* is not required for calculating fo(x). This property \nis quite useful when the number M of training examples is very large since the \ncalculation of the Moore-Penrose generalized inverse of high dimensional matrices \nis sometimes unstable. In Eq.(20), the number of basis functions is reduced to I-' \nand the coefficient of 'l/Jp(x) is obtained by Yp/I-', where YP is the mean sample values \nat xp. \n\nFor general sample points, the computational complexity and memory required for \ncalculating fo(x) by Eq.(18) are both O(M2). In contrast, Theorem 4 states that \nif a set of sample points satisfies Eq.(12) , then both the computational complexity \nand memory are reduced to O(M). Hence, Theorem 1 and Theorem 4 do not only \nprovide the optimal generalization but also reduce the computational complexity \nand memory. Moreover, if we determine sample points following Theorem 3 and \ncalculate the learning result fo(x) by Theorem 5, then the computational complexity \nand memory are reduced to 0 (1-'). This is extremely efficient since I-' does not depend \non the number M of training examples. The above results are shown in Tab.1. \n\n4 Simulations \n\nIn this section, the effectiveness of the proposed active learning method is demon(cid:173)\nstrated through computer simulations. \n\nLet H be a trigonometric polynomial space of order 100, and the noise covariance \nmatrix Q be Q = I . Let us consider the following three sampling schemes. \n\n(A) Optimal sampling: Training examples are gathered following Theorem 3. \n(B) Experimental design: Eq.(2) in Cohn [2] is adopted as the active learning \ncriterion. The value of this criterion is evaluated by 30 reference points. The \nsampling location is determined by multi-point-search with 3 candidates. \n\n(C) Passive learning: Training examples are given unilaterally. \n\nFig.3 shows the relation between the number of training examples and the gener(cid:173)\nalization error. The horizontal and vertical axes display the number of training \nexamples and the generalization error Je measured by Eq.(5), respectively. The \nsolid line shows the sampling scheme (A). The dashed and dotted lines denote the \naverages of 10 trials of the sampling schemes (B) and (C), respectively. When the \nnumber of training examples is 201, the generalization error of the sampling scheme \n(A) is 1 while the generalization errors of the sampling schemes (B) and (C) are \n3.18 x 104 and 8.75 x 104 , respectively. This graph illustrates that the proposed sam(cid:173)\npling scheme gives much better generalization capability than the sampling schemes \n(B) and (C) especially when the number of training examples is not so large. \n\n5 Conclusion \n\nWe proposed a new active learning method in the trigonometric polynomial space. \nThe proposed method provides exactly the optimal generalization capability and \n\n\f630 \n\nM Sugiyama and H Ogawa \n\n1 Or----,---:---,.-~---r---____.\" \n\nTable 1: Computational complexity \nand memory required for projection \n\nlearning. \n\nCalculation \n\nmethods \n\nEq.(18) \n\nTheorem 4 \n\nTheorem S\u00a7 \n\nComputational \n\nComplexity \n\nand \n\nMemory \n\nO(M2) \n\nO(M) \n\nO(J.L) \n\n\u00a7 M = kJ.L where J.L is the dimension \nof Hand k is a positive integer. \n\n... 8 \nE \n(jj \nc:: ,g 6 \n\ntil \n.~ \nCii \n(jj \nc:: 4 \nCI) \nOl \nCI) \n.\u00a3: \nI- 2 \n\n-\n\nOptimal sampling \nExperimental design \nPassive learning \n\nI \nI \nI \n\nI , \n\nI \nI \nI \nI \n\nI \" \n\n.' ..... \n\n' . . , \n' . \n..... '.;. .... \n\n........ \n........... - .... -\n\n\"(cid:173)\n\nOL-----~------~-----~-----~ \n600 \n\n400 \n\n300 \n\n500 \n\nThe number of training examples \n\nFigure 3: Relation between the number of \ntraining examples and the generalization er-\nror. \n\nat the same time, it reduces the computational complexity and memory required \nfor the calculation of learning results. The mechanism of achieving the optimal \ngeneralization was clarified from the functional analytic point of view. \n\nReferences \n\n[1] N. Aronszajn. Theory of reproducing kernels. Transactions on American Mathemat(cid:173)\n\nical Society, 68:337-404, 1950. \n\n[2] D. Cohn, Neural network exploration using optimal experiment design. In J. Cowan \net al. (Eds.), Advances in Neural Information Processing Systems 6, pp. 679-686. \nMorgan-Kaufmann Publishers Inc., San Mateo, CA, 1994. \n\n[3J V. V . Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. \n[4] K. Fukumizu. Active learning in multilayer perceptrons. In D. Touretzky et al. (Eds.), \nAdvances in Neural Information Processing Systems 8, pp. 295-301. The MIT Press, \nCambridge, 1996. \n\n[5] D. MacKay. Information-based objective functions for active data selection. Neural \n\nComputation, 4(4):590-604, 1992. \n\n[6] H. Ogawa, Projection filter regularization of ill-conditioned problem. In Proceedings \n\nof SPIE, 808, Inverse Problems in Optics, pp. 189-196, 1987. \n\n[7] H, Ogawa. Neural network learning, generalization and over-learning. In Proceedings \nof the ICIIPS'92, International Conference on Intelligent Information Processing fj \nSystem, vol. 2, pp. 1-6, Beijing, China, 1992. \n\n[8] H. Ogawa. Theory of pseudo biorthogonal bases and its application. In Research \nInstitute for Mathematical Science, RIMS Kokyuroku, 1067, Reproducing Kernels \nand their Applications, pp. 24-38, 1998. \n\n[9] M. Sugiyama and H. Ogawa. Functional analytic approach to model selection(cid:173)\nSubspace information criterion. In Proceedings of 1999 Workshop on Information(cid:173)\nBased Induction Sciences (IBIS'99), pp. 93-98, Syuzenji, Shizuoka, Japan, 1999 \n(Its complete version is available at ftp://ftp.cs.titech.ac.jp/pub/TR/99/TR99-\n0009.ps.gz). \n\n[10] M. Sugiyama and H. Ogawa. Incremental active learning in consideration of bias, \n\nTechnical Report of IEICE, NC99-56, pp. 15-22, 1999 (Its complete version is avail(cid:173)\nable at ftp://ftp.cs.titech.ac.jp/pub/TR/99/TR99-001O.ps.gz). \n\n\f", "award": [], "sourceid": 1671, "authors": [{"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}, {"given_name": "Hidemitsu", "family_name": "Ogawa", "institution": null}]}