{"title": "Algebraic Analysis for Non-regular Learning Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 356, "page_last": 362, "abstract": null, "full_text": "Algebraic Analysis for Non-Regular \n\nLearning Machines \n\nSumio Watanabe \n\nPrecision and Intelligence Laboratory \n\nTokyo Institute of Technology \n\n4259 Nagatsuta, Midori-ku, Yokohama 223 Japan \n\nswatanab@pi. titech. ac.jp \n\nAbstract \n\nHierarchical learning machines are non-regular and non-identifiable \nstatistical models, whose true parameter sets are analytic sets with \nsingularities. Using algebraic analysis, we rigorously prove that \nthe stochastic complexity of a non-identifiable learning machine \n(ml - 1) log log n + const., \nis asymptotically equal to >'1 log n -\nwhere n is the number of training samples. Moreover we show that \nthe rational number >'1 and the integer ml can be algorithmically \ncalculated using resolution of singularities in algebraic geometry. \nAlso we obtain inequalities 0 < >'1 ~ d/2 and 1 ~ ml ~ d, where d \nis the number of parameters. \n\n1 \n\nIntroduction \n\nHierarchical learning machines such as multi-layer perceptrons, radial basis func(cid:173)\ntions, and normal mixtures are non-regular and non-identifiable learning machines. \nIf the true distribution is almost contained in a learning model, then the set of \ntrue parameters is not one point but an analytic variety . This paper \nestablishes the mathematical foundation to analyze such learning machines based \non algebraic analysis and algebraic geometry. \n\nLet us consider a learning machine represented by a conditional probability density \np(xlw) where x is an M dimensional vector and w is a d dimensional parameter. We \nassume that n training samples xn = {Xi; i = 1,2, ... , n} are independently taken \nfrom the true probability distribution q(x), and that the set of true parameters \n\nWo = {w E W ; p(xlw) = q(x) \n\n(a.s. q(x)) } \n\nis not empty. In Bayes statistics, the estimated distribution p(xlxn) is defined by \n\nJ \n\np(xlxn) = \n\np(xlw) Pn(w)dw, Pn(w) = Zn IIp(XiIW) W. \n(A.3) Let {rj(x, w*); j = 1,2, ... , d} be the associated convergence radii of h(x, w) \nat w*, in other words, Taylor expansion of h(x, w) at w* = (wi, ... , wd), \n\nh(x, w) = L ak 1k2 \u2022 .. kd(X)(WI - wi)kl (W2 - W2)k 2 \u2022\u2022\u2022 (Wd - Wd)kd, \n\n00 \n\nk1, .. ,kd=O \n\nabsolutely converges in IWj - wjl < rj(x, w*). Assume inf \nj=I,2, ... ,d. \n\nxEQw'EW \n\ninf rj(x, w*) > \u00b0 for \n\nTheorem 1 Assume (A.l),(A.2), and (A.3). Then, there exist a rational number \nAl > 0, a natural number ml, and a constant C, such that \n\nIF(n) - A1logn + (ml - 1) loglognl < C \n\nholds for an arbitrary natural number n. \n\nRemarks. (1) If q(x) is compact supported, then the assumption (A.3) is automat(cid:173)\nically satisfied. (2) Without assumptions (A.l) and (A.3), we can prove the upper \nbound, F(n) ::; A1logn -\n\n(ml - 1) log log n + const. \n\n\f358 \n\nS. Watanabe \n\nFrom Theorem 1, if the generalization error K (n) has the asymptotic expansion, \nthen it should be \n\nK(n) = Al _ mi - 1 + o( \n\nn \n\nnlogn \n\n1 \n\n). \n\nnlogn \n\nAs is well known, if the model is identifiable and has the positive definite Fisher \ninformation matrix, then Al = d/2 (d is the dimension of the parameter space) and \nmi = 1. However, hierarchical learning models such as multi-layer perceptrons, \nradial basis functions, and normal mixtures have smaller Al and larger ml, in other \nwords, hierarchical models are better learning machines than regular ones if Bayes \nestimation is applied. Constants Al and mi are characterized by the following \ntheorem. \n\nTheorem 2 Assume the same conditions as theorem 1. Let f > 0 be a sufficiently \nsmall constant. The holomorphic function in Re(z) > 0, \n\nJ(z) = \n\n( \n1 H(W) 0, we define F*(n) by \n\nF*(n) = -log ( \n\nexp( -nH(w)) tp(w) dw. \n\nlH(w) 0, \n\nD(w, aw , z)H(wr+1 = b(z)H(wr \n\nJ(z) == ( \n\n1 H(W) -AK+l, and IJK(z)1 ---t 0 (izi ---t 00, Re(z) > -AK+l)' \nLet us define a function \n\nJ(t) = J <5(t - H(w))cp(w)dw \n\nfor 0 < t < \u20ac and J(t) = 0 for \u20ac ~ t ~ 1. Then I(t) connects the function F*(n) \nwith J(z) by the relations, \n\n11 t Z J(t) dt, \n-log 10 exp( -nt) J(t) dt. \n\n1 \n\nJ(z) \n\nF*(n) = \n\nThe inverse Laplace transform gives the asymptotic expansion of J(t) as t ---t 0, \n\nresulting in the asymptotic expansion of F*(n), \n\nF*(n) = \n\n-log \n\ni n \n\no \n\nexp(-t) 1(-) -\n\nt dt \nn n \n\n= Adogn - (ml - 1) loglogn + 0(1), \n\nwhich is the upper bound of F(n). \n\n3.2 Lower Bound \n\nWe define a random variable \n\nA(xn) = sup 1 nl/2(H(w, xn) - H(w)) / H(W)I/2 I. \n\nwEW \n\n(4) \n\nThen, we prove in Appendix that there exists a constant Co which is independent \nof n such that \n\nEx n {A(xn)2} < Co. \n\n(5) \n\nBy using an inequality ab ~ (a 2 + b2 )/2, \n\nnH(w,xn) ~ nH(w) - A(xn)(nH(w))1/2 ~ \"2{nH(w) - A(xn)2}, \n\n1 \n\nwhich derives a lower bound, \n\n(6) \n\n\f360 \n\nS. Watanabe \n\nThe first term in eq.(6) is bounded. Let the second term be F*(n) , then \n\n-IOg(Zl + Z2) \nr \nr \n\nJH(W)<\u20ac \n\nJH(W)?\u20ac exp(-\n\n2 \n\n2 \n\nnH(w) \n\nexp( - nH(w)) 'llogn - (m1 - 1) loglogn + canst. \n\n4 Resolution of Singularities \n\nIn this section, we construct a method to calculate >'1 and mI. First of all, we cover \nthe compact set Wo with a finite union of open sets Wo,. In other words, Wo C \nUo, Wo,. Hironaka's resolution of singularities [5J [2J ensures that, for an arbitrary \nanalytic function H(w), we can algorithmically find an open set Uo, C Rd (Uo, \ncontains the origin) and an analytic function go, : Uo, ~ W o, such that \n\nH(go,(u)) = a(u) U~I U~2 ... U~d (u E Uo, ) \n\n(7) \nwhere a( u) > 0 is a positive function and ki ~ 0 (1 ~ i ~ d) are even integers (a( u) \nand k i depend on Uo,). Note that Jacobian Ig~(u) 1 = 0 if and only if u E g~l(WO). \n\n( ()) I I ( \n

1 a(u) {U~I U~2 .. . U~d V Ufl U~2 .. 'U~d dUl dU2 .. \u00b7 dUd . \n\nU '\" \n\nFor real z , maxo, lo,(z) ~ l(z) ~ Lo, lo,(z), \n\n>'1 = min min \n\nmin \n(PI , ... ,Pd) l::;q::;d \n\n0, \n\nand m1 is equal to the number of q which attains the minimum, min. \nl::;q::;d \n\nRemark. In a neighborhood of Wo E Wo, the analytic function H(w) is equivalent \nto a polynomial H Wo ( w ), in other words, there exists constants C1, C2 > 0 such that \nc1Hwo(w) ~ H(w) ~ C2Hwo(W) . Hironaka's theorem constructs the resolution map \ngo, for any polynomial H Wo (w) algorithmically in the finite procedures ( blowing(cid:173)\nups for nonsingular manifolds in singularities are recursively applied ). From the \nabove discussion, we obtain an inequality, 1 ~ m ~ d. Moreover there exists 'Y > 0 \nsuch that H(w) ~ 'Ylw - wol 2 in the neighborhood of Wo E Wo, we obtain >'1 ~ d/2. \nExample. Let us consider a model (x, y) E R2 and w = (a, b, c, d) E R4 , \n\np(x, ylw) \n\n= Po(x) (271')1/2 exp(-\"2(Y - ?jJ(x, w)) ), \n\n2 \n\n1 \n\n1 \n\n?jJ(x, a, b, c, d) \n\natanh(bx) + ctanh(dx), \n\n\fAlgebraic Analysis for Non-regular Learning Machines \n\n361 \n\nwhere Po(x) is a compact support probability density (not estimated). We also \nassume that the true regression function is y = 'I/J(x, 0, 0, 0,0). The set of true \nparameters is \n\nWo = {Ex'I/J(X, a, b, c, d)2 = O} = {ab + cd = 0 and ab3 + cd3 = O}. \n\nAssumptions (A.1),(A.2), and (A.3) are satisfied. The singularity in Wo which gives \nthe smallest Al is the origin and the average loss function in the neighborhood WO \nof the origin is equivalent to the polynomial Ho(a, b, c, d) = (ab+cd)2 + (ab3 +cd3)2, \n(see). Using blowing-ups, we find a map 9 : (x, y, z, w) t-+ (a, b, c, d), \n\na = x, b = y3w - yzw, C = zwx, d = y, \n\nby which the singularity at the origin is resolved. \n\nJ(z) \n\nr Ho(a, b, c, d)z 0) shows that \n\nH(w) = J q(x) log tX )) dx :2: ~ J q(x)(log tX\n\npx,w \n\n2 \n\npx,w \n\n)) )2dx :2: a2\n\n0 L 9j(w)2 \n\nJ \n\nj=l \n\nwhere ao > 0 is the smallest eigen value of the positive definite symmetric matrix \nEx {!i(X, WO)!k(X, wo)}. Lastly, combining \n\n2 \nnH(w) \nA(X ) = :~K,< H(w) \n\nn \n\nA \n\nao \n\nJ \n\"\" 1 \"\" \n\nn \n\n::; 2 :~K,< ~ {Vn f=t(fj(Xi , w) - Ex !j(X, w))} \n\n2 \n\nwith eq.(lO), we obtain eq.(5). \n\nAcknowledgments \n\nThis research was partially supported by the Ministry of Education, Science, Sports \nand Culture in Japan, Grant-in-Aid for Scientific Research 09680362. \n\nReferences \n\n Amari,S., Murata, N.(1993) Statistical theory of learning curves under entropic loss. \nNeural Computation, 5 (4) pp.140-153. \n\n Atiyah, M.F. (1970) Resolution of singularities and division of distributions. Comm. \nPure and Appl. Math., 13 pp.145-150. \n\n Fukumizu,K. (1999) Generalization error of linear neural networks in unidentifiable \ncases. Lecture Notes in Computer Science, 1720 Springer, pp.51-62. \n\n Hagiwara,K., Toda,N., Usui,S . (1993) On the problem of applying Ale to determine \nthe structure of a layered feed-forward neural network. Proc. of IJCNN, 3 pp.2263-2266. \n\n Hironaka, H. (1964) Resolution of singularities of an algebraic variety over a field of \ncharacteristic zero, I,ll. Annals of Math., 79 pp.109-326. \n\n Kashiwara, M. (1976) B-functions and holonomic systems, Invent. Math ., 38 pp.33-53. \n\n Oaku, T. (1997) An algorithm of computing b-funcitions. Duke Math . J., 87 pp .115-\n132. \n\n Sato, M., Shintani,T. (1974) On zeta functions associated with prehomogeneous vector \nspace.Annals of Math., 100, pp.131-170. \n\n Watanabe, S.(1998) On the generalization error by a layered statistical model with \nBayesian estimation. IEICE Trans., J81-A pp.1442-1452. English version: Elect. Comm. \nin Japan., to appear. \n\n Watanabe, S. (1999) Algebraic analysis for singular statistical estimation. Lecture \nNotes in Computer Science, 1720 Springer, pp.39-50. \n\n\f", "award": [], "sourceid": 1739, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}