{"title": "Algebraic Analysis for Non-regular Learning Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 356, "page_last": 362, "abstract": null, "full_text": "Algebraic  Analysis for  Non-Regular \n\nLearning Machines \n\nSumio  Watanabe \n\nPrecision and Intelligence Laboratory \n\nTokyo Institute of Technology \n\n4259  Nagatsuta,  Midori-ku,  Yokohama 223  Japan \n\nswatanab@pi. titech. ac.jp \n\nAbstract \n\nHierarchical learning machines are non-regular and non-identifiable \nstatistical models, whose true parameter sets are analytic sets with \nsingularities.  Using  algebraic  analysis,  we  rigorously  prove  that \nthe  stochastic  complexity  of  a  non-identifiable  learning  machine \n(ml  - 1) log log n  + const., \nis  asymptotically  equal  to  >'1  log n  -\nwhere n is the number of training samples.  Moreover we show that \nthe rational  number >'1  and the integer ml  can be algorithmically \ncalculated  using  resolution  of singularities  in  algebraic  geometry. \nAlso we obtain inequalities 0 < >'1  ~ d/2 and 1  ~ ml  ~ d,  where d \nis  the number of parameters. \n\n1 \n\nIntroduction \n\nHierarchical  learning  machines  such  as  multi-layer  perceptrons,  radial  basis  func(cid:173)\ntions, and normal mixtures are non-regular and non-identifiable learning machines. \nIf the  true  distribution  is  almost  contained  in  a  learning  model,  then  the  set  of \ntrue  parameters  is  not  one  point  but  an  analytic  variety  [4][9][3][10].  This  paper \nestablishes  the  mathematical  foundation  to analyze such  learning  machines  based \non algebraic analysis and algebraic geometry. \n\nLet us consider a  learning machine represented by a  conditional probability density \np(xlw) where x is an M  dimensional vector and w is a d dimensional parameter.  We \nassume that n  training samples xn =  {Xi; i  =  1,2, ... , n} are independently taken \nfrom the true probability distribution q(x),  and that the set of true parameters \n\nWo  =  {w  E  W ; p(xlw) =  q(x) \n\n(a.s.  q(x))  } \n\nis not empty.  In Bayes statistics, the estimated distribution p(xlxn) is  defined  by \n\nJ \n\np(xlxn) = \n\np(xlw) Pn(w)dw,  Pn(w)  =  Zn  IIp(XiIW) <p(w), \n\n1 \n\nn \n\nwhere <p( w)  is  an  a  priori probability density on Rd,  and Zn  is  a  normalizing con(cid:173)\n\nstant.  The generalization  error is  defined  by J \n\nK(n) = E x n {  q(x)  log p(xlxn)  dx} \n\nq(x) \n\n\fAlgebraic Analysis for Non-regular Learning Machines \n\n357 \n\nwhere  Ex\" {-}  shows  the  expectation  value  over  all  training  samples  xn.  One  of \nthe main  purposes in learning theory is  to clarify  how  fast  K(n)  converges to zero \nas  n  tends to  infinity.  Using the  log-loss  function  h(x, w)  = logq(x)  -logp(x, w), \nwe  define the K ullback distance and the empirical one, \n\nJ \nh(x, w)q(x)dx,  H(w, xn) = ;; L h(Xi' w). \n\n1  n \n\nH(w) = \n\nt=l \n\nNote  that the  set  of true  parameters is  equal  to  the  set of zeros  of  H(w),  Wo  = \n{w  E  W  ;  H ( w)  = O}.  If the true parameter set Wo  consists  of only  one point,  the \nlearning machine p(xlw) is called identifiable, if otherwise non-identifiable.  It should \nbe  emphasized  that,  in  non-identifiable  learning  machines,  Wo  is  not  a  manifold \nbut  an  analytic  set  with  singular  points,  in  general.  Let  us  define  the  stochastic \ncomplexity by \n\nF(n} = -Exn {log J exp( -nH(w, xn))<p(w)dw}. \n\n(1) \n\nThen  we  have  an  important  relation  between  the stochastic  complexity  F(n)  and \nthe generalization error  K ( n ) \n\nK(n) = F(n + 1) - F(n), \n\nwhich  represents  that  K(n)  is  equal to the  increase of  F(n)  [1].  In  this paper,  we \nshow  the  rigorous  asymptotic  form  of the  stochastic  complexity  F( n)  for  general \nnon-identifiable  learning machines. \n\n2  Main Results \n\nWe need  three assumptions upon  which  the  main results are proven. \n\n(A.I)  The probability density <p(w}  is  infinite times continuously differentiable and \nits support, W  ==  supp <p,  is  compact.  In other words,  <p  E  Cff. \n(A.2)  The log  loss  function,  h(x, w)  =  log q(x)  - logp(x, w),  is  continuous  for  x  in \nthe support Q ==  suppq,  and is  analytic for  w  in an open set W' :>  W. \n(A.3)  Let  {rj(x, w*); j  =  1,2, ... , d}  be the  associated  convergence  radii  of h(x, w) \nat w*,  in other words,  Taylor expansion of h(x, w)  at w*  =  (wi, ... , wd), \n\nh(x, w)  =  L  ak 1k2 \u2022 .. kd(X)(WI  - wi)kl (W2  - W2)k 2  \u2022\u2022\u2022 (Wd  - Wd)kd, \n\n00 \n\nk1, .. ,kd=O \n\nabsolutely converges in  IWj  - wjl  < rj(x, w*).  Assume  inf \nj=I,2, ... ,d. \n\nxEQw'EW \n\ninf  rj(x, w*)  > \u00b0 for \n\nTheorem 1  Assume  (A.l),(A.2),  and  (A.3).  Then,  there  exist  a rational  number \nAl  > 0,  a natural number ml,  and  a constant C,  such  that \n\nIF(n)  - A1logn + (ml  - 1) loglognl  < C \n\nholds  for  an  arbitrary  natural number n. \n\nRemarks.  (1)  If q(x) is compact supported, then the assumption (A.3)  is automat(cid:173)\nically satisfied.  (2)  Without  assumptions  (A.l)  and  (A.3),  we  can prove the upper \nbound,  F(n)  ::;  A1logn -\n\n(ml  - 1) log log n + const. \n\n\f358 \n\nS.  Watanabe \n\nFrom  Theorem  1,  if the  generalization  error  K (n)  has  the  asymptotic  expansion, \nthen it should  be \n\nK(n)  =  Al  _  mi - 1 + o( \n\nn \n\nnlogn \n\n1 \n\n). \n\nnlogn \n\nAs  is  well  known,  if the  model  is  identifiable  and  has  the  positive  definite  Fisher \ninformation matrix, then Al  = d/2 (d is the dimension of the parameter space)  and \nmi  =  1.  However,  hierarchical  learning  models  such  as  multi-layer  perceptrons, \nradial basis functions,  and normal mixtures have smaller Al  and larger ml, in other \nwords,  hierarchical  models  are better learning  machines than regular ones if Bayes \nestimation  is  applied.  Constants  Al  and  mi  are  characterized  by  the  following \ntheorem. \n\nTheorem  2  Assume  the  same conditions  as  theorem  1.  Let f  > 0  be  a  sufficiently \nsmall constant.  The  holomorphic function  in  Re(z) > 0, \n\nJ(z)  = \n\n( \n1 H(W)<l \n\nH(wrtp(w)dw, \n\ncan be  analytically continued to the  entire complex plane as  a meromorphic function \nwhose  poles  are  on the  negative part of the  real  axis,  and the  constants -AI  and mi \nin theorem  1  are  equal  to  the  largest  pole  of J(z)  and its multiplicity,  respectively. \n\nThe proofs of above theorems are explained in the following  section.  Let w  =  g(u) \nis  an arbitrary analytic function from  a  set  U  C  Rd  to W.  Then  J(z)  is  invariant \nunder the mapping, \n\n{H(w), tp(w)}  -+  {H(g(u)), tp(g(u))Jg'(u) I}, \n\nwhere  Jg'(u)1  =  J det(awi/aUj)J  is  Jacobian.  This fact shows that Al  and mi  are in(cid:173)\nvariant under a bi-rational mapping.  In section 4,  we show an algorithm to calculate \nAl  and mi  by using this invariance and resolution of singularities. \n\n3  Mathematical Structure \n\nIn this section,  we  present an outline of the proof and its mathematical structure. \n\n3.1  Upper bound and  b-function \n\nFor a  sufficiently small constant  f  > 0,  we  define  F*(n)  by \n\nF*(n)  =  -log  ( \n\nexp( -nH(w)) tp(w)  dw. \n\nlH(w)<l \n\nThen by using the Jensen's inequality, we obtain F(n)  ~ F*(n).  To evaluate F*(n), \nwe  need  the  b-function  in  algebraic  analysis  [6][7].  Sato,  Bernstein,  Bjork,  and \nKashiwara proved that,  for  an arbitrary analytic  function  H(w), there exist a  dif(cid:173)\nferential  operator  D(w, aw , z)  which  is  a  polynomial  for  z,  and  a  polynomial  b(z) \nwhose zeros are rational numbers on the negative part of the real axis,  such that \n\n(2) \nfor  any z  E  C  and any w  E  Wl  = {w E  W; H(w) < f}.  By using the relation eq.(2), \nthe holomorphic  function  J(z)  in  Re(z)  > 0, \n\nD(w, aw , z)H(wr+1 = b(z)H(wr \n\nJ(z) ==  ( \n\n1 H(W)<E \n\nH(w)ztp(w)dw  =  b(l)  { \n\nz  1 H(W)<l \n\nH(W)Z+I D~tp(w)dw, \n\n\fAlgebraic Analysis for Non-regular Learning Machines \n\n359 \n\ncan  be analytically continued to the entire complex plane as a  meromorphic  func(cid:173)\ntion  whose  poles  are  on  the  negative  part of the  real  axis.  The poles,  which  are \nrational numbers and ordered from  the origin to the minus  infinity, are referred to \nas  -AI,  -A2'  -A3, ... , and their multiplicities are also referred to as ml, m2, m3, ... \nLet Ckm  be the coefficient  of the m-th order of Laurent expansion of J(z)  at  -Ak. \nThen, \n\nJK(Z)  ==  J(z) - L L  (z :~:)-m \n\nK  m\" \n\nk=1 m=1 \n\n(3) \n\nis  holomorphic  in  Re(z)  >  -AK+l,  and  IJK(z)1  ---t  0  (izi  ---t  00,  Re(z)  >  -AK+l)' \nLet us  define  a function \n\nJ(t)  = J <5(t  - H(w))cp(w)dw \n\nfor  0  <  t  <  \u20ac  and  J(t)  = 0  for  \u20ac  ~ t  ~ 1.  Then  I(t)  connects  the function  F*(n) \nwith J(z)  by the relations, \n\n11 t Z  J(t)  dt, \n-log 10  exp( -nt) J(t)  dt. \n\n1 \n\nJ(z) \n\nF*(n)  = \n\nThe inverse Laplace transform gives  the asymptotic expansion of J(t)  as t  ---t  0, \n\nresulting in the asymptotic expansion of F*(n), \n\nF*(n)  = \n\n-log \n\ni n \n\no \n\nexp(-t) 1(-) -\n\nt  dt \nn  n \n\n=  Adogn - (ml  - 1) loglogn + 0(1), \n\nwhich is  the upper  bound of F(n). \n\n3.2  Lower Bound \n\nWe define  a  random variable \n\nA(xn) =  sup  1 nl/2(H(w, xn) - H(w))  /  H(W)I/2  I. \n\nwEW \n\n(4) \n\nThen,  we  prove  in  Appendix that there exists a  constant Co  which  is  independent \nof n  such that \n\nEx n  {A(xn)2}  < Co. \n\n(5) \n\nBy using an inequality ab  ~ (a 2 + b2 )/2, \n\nnH(w,xn) ~ nH(w) - A(xn)(nH(w))1/2 ~ \"2{nH(w)  - A(xn)2}, \n\n1 \n\nwhich derives a  lower  bound, \n\n(6) \n\n\f360 \n\nS.  Watanabe \n\nThe first  term in eq.(6)  is  bounded.  Let the second term be F*(n) , then \n\n-IOg(Zl + Z2) \nr \nr \n\nJH(W)<\u20ac \n\nJH(W)?\u20ac  exp(-\n\n2 \n\n2 \n\nnH(w) \n\nexp( - nH(w)) <p(w)dw  ~ const. n-)'I  (log n)m 1 -1 \n\n) <p(w)dw  ~ exp(-2) ' \n\nnE \n\nwhich proves the lower  bound of F( n), \n\nF(n)  ~ >'llogn - (m1  - 1) loglogn + canst. \n\n4  Resolution of Singularities \n\nIn this section, we construct a method to calculate >'1  and mI.  First of all,  we cover \nthe compact  set  Wo  with  a  finite  union  of open  sets  Wo,.  In  other words,  Wo  C \nUo, Wo,.  Hironaka's  resolution  of singularities  [5J [2J  ensures  that,  for  an  arbitrary \nanalytic  function  H(w),  we  can  algorithmically  find  an  open  set  Uo,  C  Rd  (Uo, \ncontains the origin)  and an analytic function go,  : Uo,  ~ W o,  such that \n\nH(go,(u))  = a(u)  U~I U~2  ...  U~d  (u  E  Uo, ) \n\n(7) \nwhere a( u)  > 0 is a positive function  and ki  ~ 0  (1  ~ i  ~ d)  are even integers (a( u) \nand  k i  depend on Uo,).  Note that Jacobian  Ig~(u) 1  = 0 if and only if u  E  g~l(WO). \n\n(  ()) I I  ( \n<p  go,  u \n\n) I \n\ngo,  U  =  ~  CPI ,P2''' ',Pd u 1 u2  .. , ud \n\nPI  P2 \n\nP,l  + R(  ) \nu  , \n\nfinite \n\"\"\"' \n\n(8) \n\nBy combining  eq.(7)  with eq.(8), we  obtain \n\nlo,(z) \n\nr  H(w)z<p(w) \nJwc> 1 a(u)  {U~I U~2  .. . U~d V Ufl  U~2 .. 'U~d  dUl  dU2  .. \u00b7 dUd . \n\nU '\" \n\nFor real  z , maxo,  lo,(z)  ~ l(z) ~ Lo, lo,(z), \n\n>'1  = min  min \n\nmin \n(PI , ... ,Pd)  l::;q::;d \n\n0, \n\nand m1  is equal to the number of q which attains the minimum,  min. \nl::;q::;d \n\nRemark.  In  a  neighborhood of Wo  E  Wo, the analytic function  H(w)  is equivalent \nto a  polynomial H Wo ( w ),  in other words,  there exists constants C1,  C2  > 0 such that \nc1Hwo(w)  ~ H(w)  ~ C2Hwo(W) .  Hironaka's theorem constructs the resolution map \ngo,  for  any polynomial  H Wo (w)  algorithmically  in the finite  procedures  (  blowing(cid:173)\nups for  nonsingular manifolds in singularities are recursively applied  [5]).  From the \nabove discussion,  we  obtain an inequality,  1 ~ m  ~ d.  Moreover  there exists 'Y  > 0 \nsuch that H(w)  ~ 'Ylw - wol 2 in the neighborhood of Wo  E Wo, we  obtain >'1  ~ d/2. \nExample.  Let us  consider a  model  (x, y)  E  R2  and w  = (a, b, c, d)  E  R4 , \n\np(x, ylw) \n\n=  Po(x)  (271')1/2  exp(-\"2(Y - ?jJ(x, w))  ), \n\n2 \n\n1 \n\n1 \n\n?jJ(x, a, b, c, d) \n\natanh(bx) + ctanh(dx), \n\n\fAlgebraic Analysis for Non-regular Learning Machines \n\n361 \n\nwhere  Po(x)  is  a  compact  support  probability  density  (not  estimated).  We  also \nassume  that  the  true  regression  function  is  y  =  'I/J(x, 0, 0, 0,0).  The  set  of  true \nparameters is \n\nWo  =  {Ex'I/J(X, a, b, c, d)2  =  O}  =  {ab + cd = 0  and ab3 + cd3 = O}. \n\nAssumptions (A.1),(A.2), and (A.3)  are satisfied.  The singularity in Wo  which gives \nthe smallest Al  is the origin and the average loss  function  in the neighborhood  WO \nof the origin is equivalent to the polynomial Ho(a, b, c, d)  =  (ab+cd)2 + (ab3 +cd3)2, \n(see[9]).  Using  blowing-ups,  we  find  a  map 9 : (x, y, z, w)  t-+  (a, b, c, d), \n\na = x,  b =  y3w - yzw,  C  = zwx,  d = y, \n\nby which the singularity at the origin is  resolved. \n\nJ(z) \n\nr  Ho(a, b, c, d)z<p(a,  b, c, d)da db  de  dd \niwo J { x 2y6 w2[1  + (z + w2(y2  - z)3)2JYlxy3w l<p(g(x, y, z, w)) dxdydzdw, \n\nwhich shows that Al  = 2/3 and ml  = 1,  resulting that F(n) =  (2/3) logn + Const. \nIf the generalization error can be  asymptotically expanded, then K(n)  ~ (2/3n). \n\n5  Conclusion \n\nMathematical foundation for non-identifiable learning machines is constructed based \non algebraic analysis and algebraic geometry.  We obtained both the rigorous asymp(cid:173)\ntotic form  of the stochastic complexity and an algorithm to calculate it. \n\nAppendix \n\nIn the appendix,  we  show the inequality eq.(5). \n\nLemma 1  Assume conditions  (A.1),  (A.2)  and  (A.3) .  Then \n\nExn {sup I  r.;; L [ h(Xi, w) - Ex h(X, w)  J  12}  < 00. \n\nn \n\n1 \n\nwEW  yn i=1 \n\nThis  lemma is  proven  by  using  just  the  same  method  as  [10].  In  order  to  prove \n(5),  we  divide  'SUPwEW'  in eq.(4)  into  'SUPH(w)2':('  and'suPH(w)\u00ab'.  Finiteness  of \nthe  first  half is  directly  proven  by  Lemma 1.  Let  us  prove the second  half is  also \nfinite.  We  can  assume  without  loss  of  generality  that  w  is  in  the  neighborhood \nof Wo  E  Wo,  because  W  can  be  covered  by  a  finite  union  of  neighborhoods.  In \neach neighborhood,  by using Taylor expansion of an analytic function,  we  can find \nfunctions  {fj(x,w)} and {gj(w)  =  TIi(Wi  -WOi)a;} such that \n\nJ \n\nh(x, w) = L gj(w)fj(x, w), \n\nj=1 \n\n(9) \n\nwhere  {fj(x, wo)}  are  linearly  independent  functions  of  x  and  gj(wo)  =  O.  Since \ngj(w)fj(x, w)  is  a  part of Taylor expansion among Wo,  fJ(x, w)  satisfies \n\n1 \n\nn \n\nEx n {:~X-< I Vn ~(fj(Xi' w) - ExfJ(X, W))12}  < 00. \n\n(10) \n\n\f362 \n\nS.  Watanabe \n\nBy using a  definition  H(w) ==  IH(w, xn) - H(w)l, \n\n1  n \nI;:;: L  {L 9j (w)(!i (Xi, w) - Ex !j(X, w))}1 \n\nJ \n\n2 \n\ni=1 \n\nj=1 \n\n< \n\n1  n \n\nJ \n\nJ \nL9j(w)2 L{;:;: L(fj(Xi , w) - Ex !j(X, w))}2 \nj=1 \n\nj=1 \n\ni=1 \n\nwhere  we  used  Cauchy-Schwarz's  inequality.  On  the  other  hand,  the  inequality \nlog x :2:  (1/2)(logx)2 - X  + 1 (x  > 0)  shows that \n\nH(w) = J q(x) log  tX )) dx  :2:  ~ J q(x)(log  tX\n\npx,w \n\n2 \n\npx,w \n\n)) )2dx :2:  a2\n\n0  L  9j(w)2 \n\nJ \n\nj=l \n\nwhere ao  >  0  is  the smallest eigen  value of the positive definite symmetric  matrix \nEx {!i(X, WO)!k(X, wo)}.  Lastly,  combining \n\n2 \nnH(w) \nA(X  ) =  :~K,<  H(w) \n\nn \n\nA \n\nao \n\nJ \n\"\"  1  \"\" \n\nn \n\n::;  2  :~K,< ~ {Vn f=t(fj(Xi , w) - Ex !j(X, w))} \n\n2 \n\nwith eq.(lO),  we  obtain eq.(5). \n\nAcknowledgments \n\nThis research was partially supported by the Ministry of Education, Science, Sports \nand Culture in Japan,  Grant-in-Aid for  Scientific  Research 09680362. \n\nReferences \n\n[1]  Amari,S.,  Murata,  N.(1993)  Statistical  theory  of learning  curves  under  entropic  loss. \nNeural  Computation,  5  (4)  pp.140-153. \n\n[2]  Atiyah,  M.F.  (1970)  Resolution  of singularities and division  of distributions.  Comm. \nPure  and  Appl.  Math.,  13 pp.145-150. \n\n[3]  Fukumizu,K.  (1999)  Generalization  error  of  linear  neural  networks  in  unidentifiable \ncases. Lecture  Notes  in  Computer Science,  1720 Springer, pp.51-62. \n\n[4]  Hagiwara,K.,  Toda,N.,  Usui,S .  (1993)  On  the  problem  of applying  Ale to determine \nthe structure of a  layered feed-forward  neural network.  Proc.  of IJCNN, 3  pp.2263-2266. \n\n[5]  Hironaka,  H.  (1964)  Resolution  of singularities  of an  algebraic  variety  over  a  field  of \ncharacteristic zero,  I,ll.  Annals of Math.,  79 pp.109-326. \n\n[6]  Kashiwara, M.  (1976)  B-functions and holonomic systems,  Invent.  Math .,  38 pp.33-53. \n\n[7]  Oaku,  T.  (1997)  An  algorithm  of computing b-funcitions.  Duke  Math .  J., 87 pp .115-\n132. \n\n[8]  Sato, M.,  Shintani,T.  (1974)  On zeta functions associated with prehomogeneous vector \nspace.Annals  of Math.,  100, pp.131-170. \n\n[9]  Watanabe,  S.(1998)  On  the  generalization  error  by  a  layered  statistical  model  with \nBayesian estimation.  IEICE Trans.,  J81-A pp.1442-1452.  English version:  Elect.  Comm. \nin Japan.,  to appear. \n\n[10]  Watanabe,  S.  (1999)  Algebraic  analysis  for  singular  statistical  estimation.  Lecture \nNotes  in Computer Science,  1720 Springer, pp.39-50. \n\n\f", "award": [], "sourceid": 1739, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}