{"title": "Algebraic Information Geometry for Learning Machines with Singularities", "book": "Advances in Neural Information Processing Systems", "page_first": 329, "page_last": 335, "abstract": null, "full_text": "Algebraic Information Geometry for \nLearning Machines with Singularities \n\nSumio Watanabe \n\nPrecision and Intelligence Laboratory \n\nTokyo Institute of Technology \n\n4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 J apan \n\nswatanab@pi.titech.ac.jp \n\nAbstract \n\nAlgebraic geometry is essential to learning theory. In hierarchical \nlearning machines such as layered neural networks and gaussian \nmixtures, the asymptotic normality does not hold, since Fisher in(cid:173)\nformation matrices are singular. In this paper , the rigorous asymp(cid:173)\ntotic form of the stochastic complexity is clarified based on resolu(cid:173)\ntion of singularities and two different problems are studied. (1) If \nthe prior is positive, then the stochastic complexity is far smaller \nthan BIO, resulting in the smaller generalization error than regular \nstatistical models, even when the true distribution is not contained \nin the parametric model. \nnate free and equal to zero at singularities, is employed then the \nstochastic complexity has the same form as BIO. It is useful for \nmodel selection, but not for generalization. \n\n(2) If Jeffreys' prior, which is coordi(cid:173)\n\n1 \n\nIntroduction \n\nThe Fisher information matrix determines a metric of the set of all parameters of \na learning machine [2]. If it is positive definite, then a learning machine can be un(cid:173)\nderstood as a Riemannian manifold. However, almost all learning machines such as \nlayered neural networks, gaussian mixtures, and Boltzmann machines have singular \nFisher metrics. For example, in a three-layer perceptron, the Fisher information \nmatrix J( w) for a parameter w is singular (det J( w) = 0) if and only if w represents \na small model which can be realized with the fewer hidden units than the learning \nmodel. Therefore , when the learning machine is in an almost redundant state, any \nmethod in statistics and physics that uses a quadratic approximation of the loss \nfunction can not be applied. In fact , the maximum likelihood estimator is not sub(cid:173)\nject to the asymptotic normal distribution [4]. The Bayesian posterior probability \nconverges to a distribution which is quite different from the normal one [8]. To \nconstruct a mathematical foundation for such learning machines, we clarified the \nessential relation between algebraic geometry and Bayesian statistics [9,10]. In this \n\n\fpaper, we show that the asymptotic form of the Bayesian stochastic complexity is \nrigorously obtained by resolution of singularities. The Bayesian method gives pow(cid:173)\nerful tools for both generalization and model selection, however, the appropriate \nprior for each purpose is quite different. \n\n2 Stochastic Complexity \n\nLet p(xlw) be a learning machine, where x is a pair of an input and an output, \nand w E Rd is a parameter. We prepare a prior distribution 'fJ( w) on Rd. Training \nsamples xn = (Xl, x 2 , ... , Xn) are independently taken from the true distribution \nq(x), which is not contained in p(x lw) in general. The stochastic complexity F(xn) \nand its average F (n) are defined by \n\nF(xn) = - log J ITp(Xi lw) 'fJ(w)dw \n\ni=l \n\nand F(n) = Exn{F(xn)}, respectively, where Exn{ .} denotes the expectation \nvalue overall training sets. The stochastic complexity plays a central role in Bayesian \nstatistics. Firstly, F(n+1)-F(n)-S, where S = - J q(x) logq(x)dx, is equal to the \naverage Kullback distance from q(x) to the Bayes predictive distribution p(xlxn), \nwhich is called the generalization error denoted by G(n). Secondly, exp(-F(Xn)) \nis in proportion to the posterior probability of the model, hence, the best model is \nselected by minimization of F (xn) [7]. And lastly, if the prior distribution has a hy(cid:173)\nperparameter (), that is to say, 'fJ(w) = 'fJ(wl()), then it is optimized by minimization \nof F(xn) [1]. \n\nWe define a function Fo(n) using the Kullback distance H(w), \n\nFo(n) = - log J exp( -nH(w))'fJ(w)dw, H(w) = J q(x) log pf~~2) dx. \n\nThen by Jensen's inequality, F(n) - Sn ::; Fo(n). Moreover, we assume that \nL(x,w) == logq(x) - logp(xlw) is an analytic function from w to the Hilbert space \nof all square integrable functions with the measure q(x)dx , and that the support of \nthe prior W = supp 'fJ is compact . Then H(w) is an analytic function on W, and \nthere exists a constant CI > 0 such that, for an arbitrary n, \n\nn \n\nFO(\"2) - CI ::; F(n) - Sn ::; Fo(n). \n\n(1) \n\n3 General Learning Machines \n\nIn this section, we study a case when the true distribution is contained in the \nparametric model, that is to say, there exists a parameter Wo E W such that q(x) = \np(x lwo). Let us introduce a zeta function J(z) (z E C) of H(w) and a state density \nfunction v(t) by \n\nJ(z) = J H(wY'fJ(w)dw, \n\nv(t) = J J(t - H(w))'fJ(w)dw. \n\nThen, J( z ) and Fo(n) are represented by the Mellin and the Laplace transform of \nv(t), respectively. \n\nJ(z) = lh tZv(t)dt, \n\nFo(n) = - log lh exp(-nt)v(t)dt, \n\n\fwhere h = maxWEW H(w). Therefore Fo(n), v(t), and J(z) are mathematically \nconnected. It is obvious that J(z) is a holomorphic function in Re( z) > O. Moreover, \nby using the existence of Sato-Bernstein's b-function [6], it can be analytically \ncontinued to a meromorphic function on the entire complex plane, whose poles are \nreal, negative, and rational numbers. Let -AI > -A2 > -A3 > ... be the poles of \nJ (z) and mk be the order of - Ak. Then, by using the inverse Mellin tansform, it \nfollows that v(t) has an asymptotic expansion with coefficients {Ckm}, \n\n00 mk \n\nv(t) ~ L L CkmtAk - 1(- logt)m-l \n\nk=lm=1 \n\n(t ---> +0). \n\nTherefore, also Fo (n) has an asymptotic expansion, by putting A = Al and m = ml, \n\nFo (n) = A log n -\n\n(m - 1) log log n + 0 (1) , \n\nwhich ensures the asymptotic expansion of F(n) by eq.(l), \n\nF(n) = Sn + Alogn -\n\n(m - 1) log log n + 0(1). \n\nThe Kullback distance H(w) depends on the analytic set Wo = {w E W; H(w) = O} , \nresulting that both A and m depend on Woo Note that, if the Bayes generalization \nerror G(n) = F(n + 1) - F(n) - S has an asymptotic expansion, it should be \nAI n -\n(m - 1) I (n log n). The following lemma is proven using the definition of \nFo(n) and its asymptotic expansion. \n\nLemma 1 (1) Let (Ai, mi) (i = 1,2) be constants corresponding to (Hi(W), rpi(W)) \n(i = 1, 2). If H1(w) :::::: H2(w) and rpl(W) 2': rp2(W), then 'AI < A2' or 'AI = A2 and \nml 2': m2 '. \n(2) Let (Ai , mi) (i = 1, 2) be constants corresponding to (Hi(Wi), rpi(Wi)) (i = 1, 2). \nLet W = (WI, W2), H(w) = HI (wI) + H2(W2), and rp(w) = rpI(Wl)rp2(W2). Then the \nconstants of (H(w) , rp(w)) are A = Al + A2 and m = ml + m2 - 1. \n\nThe concrete values of A and m can be algorithmically obtained by the following \ntheorem. Let Wi be the open kernel of W (the maximal open set contained in W). \n\nTheorem 1 (Resolution of Singularities, Hironaka [5}) Let H(w) 2': 0 be a real \nanalytic function on Wi. Th en there exist both a real d-dimensional manifold U and \na real analytic function g : U ---> Wi such that, in a neighborhood of an arbitrary \nU E U, \n\n(2) \nwhere a( u) > 0 is an analytic function and {sd are non-negative integers. M ore(cid:173)\nover, for arbitrary compact set K c W, g-1 (K) c U is a compact set. Such a \nfunction g( u) can be found by finite blowing-ups. \n\nRemark. By applying eq.(2) to the definition of J( z), one can see the integral \nin J( z) is decomposed into a direct product of the integral of each variable [3]. \nApplications to learning theory are shown in [9,10]. In general it is not so easy to \nfind g(u) that gives the complete resolution of singularities, however, in this paper, \nwe show that even a partial resolution ma pping gives an upper bound of A. \n\nDefinition. We introduce two different priors. \n(1) The prior distribution rp(w) is called positive if rp(w) > 0 for an arbitrary \n\n\fwE Wi, (W = supp
d/2' or '). = d/2 and m = 1 '. \n\n(Outline of the Proof) (1) In order to examine the poles of J(z), we can divide the \nparameter space into the sum of neighborhoods. Since H( w) is an analytic function, \nin arbitrary neighborhood of Wo that satisfies H(wo) = 0, we can find a positive \ndefinite quadratic form which is smaller than H(w). The positive definite quadratic \nform satisfies). = d/2 and m = 1. By using Lemma 1 (1), we obtain the first half. \n(2) Because Jeffreys' prior is coordinate free, we can study the problem on the \nparameter space U instead of Wi in eq. (2). Hence, there exists an analytic function \nt(x, u) such that, in each local coordinate, \n\nL(x, u) = L(x, g( u)) = t(x, U)U~l ... U~d. \n\nFor simplicity, we assume that Si > 0 (i = 1,2, ... , d). Then \n\n8L \n~ = ~Wi + Sit u1 .. ,ui' \nUWi \n\n(8t \nUWi \n\n) 8 1 \n\n8\u00b7 - 1 \n\n8d \n\n.. 'Ud . \n\nBy using blowing-ups Ui = V1V2'\" Vi (i = 1,2, ... , d) and a notation rYp = sp+sp+l + \n... + Sd, it is easy to show \n\ndetI(v) ::; II v;dap+p-d-2, du = (II Ivpld-p)dv . \n\nd \n\n(3) \n\nd \n\np=1 \n\np=1 \n\nBy using H(g(u)y = 11 v;apz and Lemma.1 (1), in order to prove the latter half \nof the theorem, it is suthcient to prove that \n\nhas a pole z = -d/2 with the order m = 1. Direct calculation of integrals in J(z) \ncompletes the theorem. (Q.E.D.) \n\n4 Three-Layer Percept ron \n\nIn this section, we study some cases when the learner is a three-layer percept ron \nand the true distribution is contained and not contained. We define the three layer \n\n\fpercept ron p(x, vlw) with JII! input units, K hidden units, and N output units, \nwhere x is an input, V is an output, and w is a parameter. \n\np(x, vlw) \n\nr(x) \n\n2 \n(27ru2)N/2 exp(- 2u211v - fK(x ,w)11 ) \n\n1 \n\nK \n\nfK(x,w) = Laku(bk\u00b7x+Ck) \n\nk=l \n\nwhere w = {(ak' bk, Ck); ak E R N, bk E R M, Ck E Rl}, r(x) is the probability density \non the input, and u 2 is the variance of the output (either r(x) or u is not estimated). \n\nTheorem 3 If the true distribution is represented by the three-layer perceptron with \nKo ::; K hidden units, and if positive prior is employed, then \n\n1 \n\nA ::; \"2 {Ko(M + N + 1) + (K - Ko) mm(M + 1, N)}. \n\n. \n\n(Outline of Proof) Firstly, we consider a case when g(x) = O. Then, \n\n(4) \n\n(5) \n\nLet ak = (akl' ... , akN) and bk = (b k1 , ... , bkM ). Let us consider a blowing-up, \n\nau = 0:, akj = o:a~j (k -=/:-l,j -=/:-1), bk1 = b~l' Ck = c~. \n\nThen da db dc = o:KN-1do: da' db' dc' and there exists an analytic function \nH1(a' , b' , c') such that H(a , b,c) = 0:2H1(a',b',c'). Therefore J(z) has a pole at \nz = - K N /2 . Also by using another blowing-up, \n\nthen, da db dc = 0:(M+1)K- 1do: da\" db\" dc\" and there exists an analytic \nfunction H2(a l ,bl ,c\") such that H(a , b,c) = 0:2H2(al ,bl ,c\"), which shows that \nJ( z) has a pole at z = -K(M + 1)/2. By combining both results, we obtain \nA ::; (K/2) min(M + 1, N) . Secondly, we prove the general case, 0 < Ko ::; K. \nThen, \n\nBy combining Lemma. 1 (2) and the above result, we obtain the Theorem. (Q.E.D. ). \n\nIf the true regression function g(x) is not contained in the learning model, we assume \nthat, for each 0 ::; k ::; K, there exists a parameter w~k) E W that minimizes the \nsquare error \n\n(6) \n\n\fWe use notations E(k) \nk) min(M + 1, N). \n\n(1/2){k(M + N + 1) + (K -\n\nTheorem 4 If the true regression function is not contained in the learning model \nand positive prior is applied, then \n\nF(n):,,::: min [n2E(k) +'\\(k)lognJ +0(1). \n\nO~k~K a \n\n(Outline of Proof) This theorem can be shown by the same procedure as eq.(6) in \nthe preceding theorem. (Q.E.D.) \n\nIf G(n) has an asymptotic expansion G(n) = 2::~1 aqfq(n), where fq(n) is a de(cid:173)\ncreasing function of n that satisfies fq+1(n) = o(fq(n)) and fQ(n) = l/n, then \n\nG ( n):,,::: min [ E ( k) + ,\\ ( k ) ] \n\nO~k~K a2 n ' \n\nwhich shows that the generalization error of the layered network is smaller than the \nregular statistical models even when the true distribution is not contained in the \nlearning model. It should be emphasized that the optimal k that minimizes G(n) \nis smaller than the learning model when n is not so large, and it becomes larger as \nn increases. This fact shows that the positive prior is useful for generalization but \nnot appropriate for model selection. Under the condition that the true distribution \nis contained in the parametric model, Jeffreys' prior may enable us to find the true \nmodel with higher probability. \n\nTheorem 5 If the true regression function is contained in the three-layer perceptron \nand Jeffrey's prior is applied, then ,\\ = d/2 and m = 1, even if the Fisher metric is \ndegenerate at the true parameter. \n\n(Outline of Proof) For simplicity, we prove the theorem for the case g(x) = O. The \ngeneral cases can be proven by the same method. By direct calculation of the Fisher \ninformation matrix, there exists an analytic function D(b, e) ~ 0 such that \n\nK \n\nN \n\ndetI(w) = II(Lakp)2(M+1)D(b,e) \n\nBy using a blowing-up \n\nk=1 p=1 \n\nwe obtain H(w) = a 2H1(a',b',e') same as eq.(5), detI(w) ex a 2(M+1)K, and \nda db de = aN K -1 da da' db de. The integral \n\n}(z) = 1 a 2za(M+1)K+NK- 1da \n\n1\"'1\u00ab' \n\nhas a pole at z = -(M + N + 1)K/2. By combining this result with Theorem 3, \nwe obtain Theorem.5. (Q.E.D.). \n\n\f5 Discussion \n\nIn many applications of neural networks, rather complex machines are employed \ncompared with the number of training samples. In such cases, the set of optimal \nparameters is not one point but an analytic set with singularities, and the set \nof almost optimal parameters {Wi H(w ) < E} is not an 'ellipsoid'. Hence neither \nthe Kullback distance can be approximat ed by any quadratic form nor the saddle \npoint approximation can be used in integration on the parameter space. The zeta \nfunction of the Kullback distance clarifies the behavior of the stochastic complexity \nand resolution of singularities enables us to calculate the learning efficiency. \n\n6 Conclusion \n\nThe relation between algebraic geometry and learning theory is clarified, and two \ndifferent facts are proven. \n(1) If the true distribution is not contained in a hierarchical learning model, then \nby using a positive prior, the generalization error is made smaller than the regular \nstatistical models. \n(2) If the true distribution is contained in the learning model and if Jeffreys' prior \nis used , then the average Bayesian factor has the same form as BIC. \n\nAcknowledgments \n\nThis research was partially supported by the Ministry of Education, Science, Sports \nand Culture in Japan, Grant-in-Aid for Scientific Research 12680370. \n\nReferences \n\n[1] Akaike, H. (1980) Likelihood and Bayes procedure. Bayesian Statistics, (Bernald J.M. \neds.) University Press, Valencia, Spain, 143-166. \n\n[2] Amari, S. (1985) Differential-geometrical m ethods in Statistics. Lecture Notes in Statis(cid:173)\ntics, Springer . \n\n[3] Atiyah, M . F. (1970) Resolution of singularities and division of distributions. Comm. \nPure and Appl. Math. , 13, pp.145-150. \n\n[4] Dacunha-Castelle, D. , & Gassiat, E. (1997). Testing in locally conic models, and \napplication to mixture models. Probability and Statistics, 1, 285-317. \n\n[5] Hironaka, H. (1964) Resolution of Singularities of an algebraic variety over a field of \ncharacteristic zero. Annals of Math., 79 ,109-326. \n\n[6] Kashiwara, M . (1976) B-functions and holonomic systems. Inventions Math., 38,33-53. \n\n[7] Schwarz, G . (1978) Estimating the dimension of a model. Ann. of Stat., 6 (2), 461-464. \n\n[8] Watanabe, S. (1998) On the generalization error by a layered statistical model with \nIEICE Transactions, J81-A (10), 1442-1452 . English version: \nBayesian estimation. \n(2000)Electronics and Communications in Japan, Part 3, 83(6) ,95-104. \n\n[9] Watanabe, S. (2000) Algebraic analysis for non-regular learning machines. Advances \nin Neural Information Processing Systems, 12, 356-362. \n\n[10] Watanabe, S. (2001) Algebraic analysis for non-identifiable learning machines. Neural \nComputation, to appear. \n\n\f", "award": [], "sourceid": 1826, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}