{"title": "Active Learning in Multilayer Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 301, "abstract": null, "full_text": "Active Learning in Multilayer \n\nPerceptrons \n\nInformation and Communication R&D Center, Ricoh Co., Ltd. \n\nKenji Fukumizu \n\n3-2-3, Shin-yokohama, Yokohama, 222 Japan \n\nE-mail: fuku@ic.rdc.ricoh.co.jp \n\nAbstract \n\nWe propose an active learning method with hidden-unit reduction. \nwhich is devised specially for multilayer perceptrons (MLP). First, \nwe review our active learning method, and point out that many \nFisher-information-based methods applied to MLP have a critical \nproblem: the information matrix may be singular. To solve this \nproblem, we derive the singularity condition of an information ma(cid:173)\ntrix, and propose an active learning technique that is applicable to \nMLP. Its effectiveness is verified through experiments. \n\n1 \n\nINTRODUCTION \n\nWhen one trains a learning machine using a set of data given by the true system, its \nability can be improved if one selects the training data actively. In this paper, we \nconsider the problem of active learning in multilayer perceptrons (MLP). First, we \nreview our method of active learning (Fukumizu el al., 1994), in which we prepare a \nprobability distribution and obtain training data as samples from the distribution. \nThis methodology leads us to an information-matrix-based criterion similar to other \nexisting ones (Fedorov, 1972; Pukelsheim, 1993). \n\nActive learning techniques have been recently used with neural networks (MacKay, \n1992; Cohn, 1994). Our method, however, as well as many other ones has a crucial \nproblem: the required inverse of an information matrix may not exist (White, 1989). \n\nWe propose an active learning technique which is applicable to three-layer percep(cid:173)\ntrons. Developing a theory on the singularity of a Fisher information matrix, we \npresent an active learning algorithm which keeps the information matrix nonsingu(cid:173)\nlar. We demonstrate the effectiveness of the algorithm through experiments. \n\n\f296 \n\nK. FUKUMIZU \n\n2 STATISTICALLY OPTIMAL TRAINING DATA \n\n2.1 A CRITERION OF OPTIMALITY \n\nWe review the criterion of statistically optimal training data (Fukumizu et al., 1994). \nWe consider the regression problem in which the target system maps a given input \nz to y according to \n\ny = I(z) + Z, \n\nwhere I( z) is a deterministic function from R L to R M , and Z is a random variable \nwhose law is a normal distribution N(O,(12IM ), (IM is the unit M x M matrix). \nOur objective is to estimate the true function 1 as accurately as possible. \nLet {/( z; O)} be a parametric model for estimation. We use the maximum likelihood \nestimator (MLE) 0 for training data ((z(v), y(v\u00bb)}~=l' which minimizes the sum of \nsquared errors in this case. In theoretical derivations, we assume that the target \nfunction 1 is included in the model and equal to 1(,; (0 ). \nWe make a training example by choosing z(v) to try, observing the resulting output \ny(v), and pairing them. The problem of active learning is how to determine input \ndata {z(v)} ~=l to minimize the estimation error after training. Our approach is \na statistical one using a probability for training, r( z), and choosing {z(v) }:Y\"=l as \nindependent samples from r(z) to minimize the expectation of the MSE in the \nactual environment: \n\nIn the above equation, Q is the environmental probability which gives input vectors \nto the true system in the actual environment, and E{(zlv\"yIV')} means the expec(cid:173)\ntation on training data. Eq.(I), therefore, shows the average error of the trained \nmachine that is used as a substitute of the true function in the actual environment. \n\n2.2 REVIEW OF AN ACTIVE LEARNING METHOD \n\nUsing statistical a.~ymptotic theory, Eq. (1) is approximated a.~ follows: \n\nEMSE = (12 + ~ Tr [I(Oo)J-1(Oo)] + O(N- 3j2), \n\n2 \n\n(2) \n\nwhere the matrixes I and J are (Fisher) illformation matrixes defined by \n\n1(0) = J I(z;O)dQ(z). J(O) = J I(z;O)r(z)dz. \n\nThe essential part of Eq.(2) is Tr[I(Oo)J-1(Oo\u00bb), computed by the unavailable pa(cid:173)\nrameter 00 \u2022 We have proposed a practical algorithm in which we replace 00 with O. \nprepare a family of probability {r( z; 'lI) I 'U : paramater} to choose training samples, \nand optimize 'U and {) iteratively (Fllkumizll et al., 1994). \n\nActive Learning Algorithm \n\n1. Select an initial training data set D[o] from r( z; 'lI[O])' and compute 0[0]' \n2. k:= 1. \n3. Compute the optimal v = V[k] to minimize Tr[I(O[k_l])J-1(O[k_l]\u00bb)' \n\n\fActive Learning in Multilayer Perceptrons \n\n297 \n\n4. Choose ~ new training data from r(z;V[k]) and let D[k] be a union of \n\nD[k-l] and the new data. \n\n5. Compute the MLE 9[k] based on the training data set D[k]. \n6. k := k + 1 and go to 3. \n\nIt has the \nThe above method utilizes a probability to generate training data. \nadvantage of making many data in one step compared to existing ones in which \nonly one data is chosen in each step, though their criterions are similar to each \nother. \n\n3 SINGULARITY OF AN INFORMATION MATRIX \n\n3.1 A PROBLEM ON ACTIVE LEARNING IN MLP \n\nHereafter, we focus on active learning in three-layer perceptrons with H hidden \nunits, NH = {!(z, O)}. The map !(z; 0) is defined by \nh(z; 0) = L Wij s(L UjkXk + (j) + 7]i, \n\n(1~i~M), \n\n(3) \n\nH \n\nL \n\nj=1 \n\nk=1 \n\nwhere s(t) is the sigmoidal function: s(t) = 1/(1 + e-t ). \nOur active learning method as well as many other ones requires the inverse of an \ninformation matrix J. The information matrix of MLP, however, is not always \ninvertible (White, 1989). Any statistical algorithms utilizing the inverse, then, \ncannot be applied directly to MLP (Hagiwara et al., 1993). Such problems do not \narise in linear models, which almost always have a nonsingular information matrix. \n\n3.2 SINGULARITY OF AN INFORMATION MATRIX OF MLP \n\nThe following theorem shows that the information matrix of a three-layer perceptron \nis singular if and only if the network has redundant hidden units. We can deduce \ntha.t if the information matrix is singular, we can make it nonsingular by eliminating \nredundant hidden units without changing the input-output map. \n\nTheorem 1 Assume r(z) is continuous and positive at any z. Then. the Fisher \ninformation matrix J is singular if and only if at least one of the follo'wing three \ncon(litions is satisfied: \n(1) u,j := (Ujl, ... , UjL)T = 0, for some j. \n(2) Wj:= (Wlj, ... ,WMj) = OT , for some j. \n(3) For difJerenth andh, (U,h,(jt) = (U,1,(h) or (U,h,(it) = -(U,h,(h)\u00b7 \n\nThe rough sketch of the proof is shown below. The complete proof will appear in a \nforthcoming pa.per ,(Fukumizu, 1996). \nRough sketch of the proof. We know easily that an information matrix is singular if \nand ouly if {()fJ:~(J)}a are linearly dependent. The sufficiency can be proved easily. \nTo show the necessity, we show that the derivatives are linearly independent if none \nof the three conditions is satisfied. Assume a linear relation: \n\n\f298 \n\nK. FUKUMIZU \n\nWe can show there exists a basis of R L , (Z(l), ... , Z(L\u00bb, such that Uj . z(l) i- 0 for \n'Vj, 'VI, and Uj! . z(l) + (h i- \u00b1(u12 . z(l) + (h) for jl i- h,'VI. We replace z in \neq.(4) by z(l)t (t E R). Let my) := Uj\u00b7 z(l), Sjl) := {z E C I z = ((2n+ 1)1T/=1-\n(j)/m~l), n E Z}, and D(l) := C - UjSY). The points in S~l) are the singularities \nof s(m~l) z + (j). We define holomorphic functions on D(l) as \n\nq,~l)(z) \n\n._ \n\n'Ef=l aijs(my> z + (j) + aiO + 'E~l 'E~=l,BjkWijS'(my) z + (j)x~l> z \n+'E~l,BjOWijS'(my)z+(j), \n\n(1 ~ i ~ M). \n\nFrom eq.( 4), we have q,~l) (t) = 0 for all t E R. Using standard arguments on isolated \nsingularities of holomorphic functions, we know SY) are removable singularities of \nq,~l)(z), and finally obtain \n\nWij 'E~=l,BjkX~I) = 0, \n\nWij,BjO = 0, \n\naij = 0, \n\naiO = o. \n\nIt is easy to see ,Bjk = O. This completes the proof. \n\n3.3 REDUCTION PROCEDURE \n\nWe introduce the following reduction procedure based on Theorem 1. Used dur(cid:173)\ning BP training, it eliminates redundant hidden units and keeps the information \nmatrix nonsingular. The criterion of elimination is very important, because exces(cid:173)\nsive elimination of hidden units degrades the approximation capacity. We propose \nan algorithm which does not increase the mean squared error on average. In the \nfollowing, let Sj := s( itj . z + llj) and \u00a3( N) == A/ N for a positive number A. \n\nReduction Procedure \n\n1. If \n\nIIWjll2 J(Sj - s((j))2dQ < \u00a3(N), \n\nand lli -. lli + WijS((j) for all i. \n\nthen eliminate the jth hidden unit, \n\n2. If \n3. If \n\nIIwjll2 J(sj)2dQ < \u20ac(N), \nIIwhll2 J(sh - sjJ 2 dQ < \u20ac(N) \n\nthen eliminate the jth hidden unit. \n\nfor different it and h, \n\nthen eliminate the hth hidden unit and Wij! -. wih + Wijz for all i. \n\n4. If \n\nIIwhll2 J(1 - sh - sjJ 2dQ < \u20ac(N) \n\nfor different jl and h, \n\n~hen eliminate the j2th hidden unit and wih -. Wij! - wih, \nwih \n\nfor all 'i. \n\nili -. ili + \n\nFrom Theorem 1, we know that Wj, itj, (ith' (h) - (it'};, (j!), or (ith, (h )+( it]:, (h) \ncan be reduced to 0 if the information matrix is singular. Let 0 E NK denote \nthe reduced parameter from iJ according to the above procedure. The above four \nconditions are, then, given by calculating J II/(x; 0) -/(x; iJ)WdQ. \nWe briefly explain how the procedure keeps the information matrix nonsingular \nand does not increase EMSE in high probability. First, suppose detJ(Oo) = 0, then \nthere exists Off E NK (K < H) such that f(x;Oo) = f(x;Off) and detJ(Of) i- 0 \nin N K. The elimination of hidden units up to K, of course, does not increase the \nEMSE. Therefore, we have only to consider the case in which detJ(Oo) i- 0 and \nhidden units are eliminated. \nSuppose J II/(z; Off) -/(z; Oo)1I 2dQ > O(N- 1 ) for any reduced parameter Off from \n00 \u2022 The probability of satisfying J II/(z;iJ) -/(z;O)WdQ < A/N is very small for \n\n\fActive Learning in Multilayer Perceptrons \n\n299 \n\na sufficiently small A. Thus, the elimination of hidden units occurs in very tiny \nprobability. Next, suppose J 1I!(x; (Jff) - !(x; (Jo)1I 2dQ = O(N-l). Let 0 E NK be \na reduced parameter made from 9 with the same procedure as we obtain (Jff from \n(Jo. We will show for a sufficiently small A, \n\nwhere OK is MLE computed in NK. We write (J = ((J(l),(J(2\u00bb) in which (J(2) is \nchanged to 0 in reduction, changing the coordinate system if necessary. The Taylor \nexpansion and asymptotic theory give \nE [JII!(x; OK) - !(x; (Jo)1I 2dQ] ~ JII!(x; (Jf)- !(x; (Jo)11 2dQ+ ~ Tr[In((Jf)Jil1((Jf)), \n\nE [JII!(x; 9) - !(x; O)WdQ] ~ JII!(x; (Jf)- !(x; (Jo)1I 2dQ+ ;, Tr[h2 ((Jf)J2;l ((Jo)], \nwhere Iii and Jii denote the local information matrixes w.r.t. (J(i) ('i = 1,2). Thus, \n\n2 \n\nE [JII!(x; 0) - !(x; (Jo)1I 2dQ] - E [JII!(x; OK) - !(x; (Jo)1I 2dQ] \n~ -E [JII!(x;o) - !(X;O)1I 2dQ] +;' Tr[h2((Jf)J;1((Jo)) \n\n2 \n\n2 \n\n- ;, Tr[Ill((Jf)Jil 1((Jf)] + E [JII!(x; 0) - !(x; (Jo)1I 2dQ] . \n\nSince the sum of the last two terms is positive, the 1.h.s is positive if E[f II!( x; OK)_ \n!(x; 0)1I 2dQ) < BIN for a sufficiently small B. Although we cannot know the value \nof this expectation, we can make the probability of holding this enequality very high \nby taking a small A. \n\n4 ACTIVE LEARNING WITH REDUCTION \n\nPROCEDURE \n\nThe reduction procedure keeps the information matrix nonsingular and makes the \nactive learning algorithm applicable to MLP even with surplus hidden units. \n\nActive Learning with Hidden Unit Reduction \n\n1. Select initial training data set Do from r( x; V[O]). and compute 0[0]' \n2. k:= 1, and do REDUCTION PROCEDURE. \n3. Compute the optimal v = 1'[k] to minimize Tr[I(9[k_l])J-l (9[k-l] )). using \n\nthe steepest descent method. \n\n4. Choose Nk new training data from r( x; V[k]) and let D[k] be a union of \n\nD[k-l] and the new data. \n\n5. Compute the MLE 9[kbbased on the training data D[k] using BP with \n\nREDUCTION PROCE URE. \n\n6. k:= k + 1 and go to 3. \n\nThe BP with reduction procedure is applicable not only to active learning, but to \na variety of statistical techniques that require the inverse of an information matrix. \nWe do not discuss it in this paper. however. \n\n\f300 \n\n- - Active Learning \n\n\u2022 Active Learning [Av\u00b7Sd,Av+Sd] \n\n- .. - . Passive Learning \n\n+ Passive Learning [Av\u00b7Sd,Av+Sd] ~ \n\n0.00001 \n\nK. FUKUMIZU \n\n- - Learning Curve \n\n4 \n\n..\u2022.. It of hidden units \n\n\u2022 \n\n\u2022 \n\n\u2022 \u2022 + \n\n+ \n\n+ \n\n+ \n\n+ \n\n\u2022 \u2022 \u2022 \n\n200 \n\n400 \n\n600 \n\n800 \n\n100> \n\nThe Number of Training nata \n\nO.IXXXlOOI \n\n100 200 300 400 \n\n0 \nsoo 600 700 800 900 100> \n\nThe Number of Training nata \n\nFigure 1: Active/Passive Learning: f(x) = s(x) \n\n5 EXPERIMENTS \n\nWe demonstrate the effect of the proposed active learning algorithm through ex(cid:173)\nperiments. First we use a three-layer model with 1 input unit, 3 hidden units, and \n1 output unit. The true function f is a MLP network with 1 hidden unit. The \ninformation matrix is singular at 0o, then. The environmental probability, Q, is \na normal distribution N(O,4). We evaluate the generalization error in the actual \nenvironment using the following mean squared error of the function values: \n\n! 1If(:l!; 0) - f(:l!)11 2dQ. \n\nWe set the deviation in the true system II = 0.01. As a family of distributions \nfor training {r(:l!;v)}, a mixture model of 4 normal distributions is used. In each \nstep of active learning, 100 new samples are added. A network is trained using \nonline BP, presented with all training data 10000 times in each step, and operated \nthe reduction procedure once a 100 cycles between 5000th and 10000th cycle. We \ntry 30 trainings changing the seed of random numbers. In comparison, we train a \nnetwork passively based on training samples given by the probability Q. \nFig.1 shows the averaged learning curves of active/passive learning and the number \nof hidden units in a typical learning curve. The advantage of the proposed active \nlearning algorithm is clear. We can find that the algorithm has expected effects on \na simple, ideal approximation problem. \n\nSecond, we apply the algorithm to a problem in which the true function is not \nincluded in the MLP model. We use MLP with 4 input units, 7 hidden units, and 1 \noutput unit. The true function is given by f(:l!) = erf(xt), where erf(t) is the error \nfunction. The graph of the error function resembles that of the sigmoidal function, \nwhile they never coincide by any affine transforms. We set Q = N(0,25 X 14). We \ntrain a network actively/passively based on 10 data sets, and evaluate MSE's of \nfunction values. Other conditions are the same as those of the first experiment. \n\nFig.2 shows the averaged learning curves and the number of hidden units in a \ntypical learning curve. We find tha.t the active learning algorithm reduces the errors \nthough the theoretical condition is not perfectly satisfied in this case. It suggests \nthe robustness of our active learning algorithm. \n\n\fActive Learning in Multilayer Perceptrons \n\n301 \n\n8 \n\nO.IXXlI \n\n- - Active Learning \n- .. - . Passive Learning \n\n- - Learning Curve \n\n7 It \n\n;; \n:s c :; \n~ \n6 e. \n:r \nis: \n~ :s \n.. 5; \n~ \n\n..\u2022.. # of hidden units \n\nL .... . ... . . . . ......... : \n\n200 \n\n400 \n\n600 \n\n800 \n\nIIXXl \n\nThe Number ofTraining nata \n\nr-~~~r-~-r~--r-~-+4 \n100 200 300 400 500 600 700 800 900 \n\nIIXXl \n\nThe Number of Training nata \n\nFigure 2: Active/Passive Learning: f(z) = erf(xI) \n\n6 CONCLUSION \n\nWe review statistical active learning methods and point out a problem in their ap(cid:173)\nplication to MLP: the required inverse of an information matrix does not exist if the \nnetwork has redundant hidden units. We characterize the singularity condition of \nan information matrix and propose an active learning algorithm which is applicable \nto MLP with any number of hidden units. The effectiveness of the algorithm is \nverified through computer simulations, even when the theoretical assumptions are \nnot perfectly satisfied. \n\nReferences \n\nD. A. Cohn. (1994) Neural network exploration using optimal experiment design. \nIn J. Cowan et al. (ed.), A d'vances in Neural Information Processing SYHtems 6, \n679-686. San Mateo, CA: Morgan Kaufmann. \n\nV. V. Fedorov. (1972) Theory of Optimal Experiments. NY: Academic Press. \nK. Fukumizu. (1996) A Regularity Condition of the Information Matrix of a Mul(cid:173)\ntilayer Percept ron Network. Neural Networks, to appear. \nK. Fukumizu, & S. Watanabe. (1994) Error Estimation and Learning Data Ar(cid:173)\nrangement for Neural Networks. Proc. IEEE Int. Conf. Neural Networks :777-780. \n\nK. Hagiwara, N. Toda, & S. Usui. (1993) On the problem of applying AIC to \ndetermine the structure of a layered feed-forward neural network. Proc. 1993 Int. \nJoint ConI. Neural Networks :2263-2266. \n\nD. MacKay. (1992) Information-based objective functions for active data selection, \nNe'ural Computation 4(4):305-318. \n\nF. Pukelsheim. (1993) Optimal Design of Experiments. NY: John Wiley & Sons. \n\nH. White. (1989) Learning in artificial neural networks: A statistical perspective \nNeural Computation 1 ( 4 ):425-464. \n\n\f", "award": [], "sourceid": 1140, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}]}