{"title": "Splines, Rational Functions and Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1040, "page_last": 1047, "abstract": null, "full_text": "Splines,  Rational Functions  and Neural Networks \n\nRobert  C.  Willialnson \n\nDepartment of Systems Engineering \n\nAustralian  National University \n\nCanberra,  2601 \n\nAustralia \n\nPeter L.  Bartlett \n\nDepartment of Electrical  Engineering \n\nUniversity of Queensland \n\nQueensland,  4072 \n\nAustralia \n\nAbstract \n\nConnections  between  spline  approximation,  approximation with  rational \nfunctions,  and  feedforward  neural  networks  are  studied.  The  potential \nimprovement in  the  degree  of approximation in  going  from  single  to two \nhidden layer networks is examined.  Some results of Birman and Solomjak \nregarding the degree  of approximation achievable  when  knot positions are \nchosen  on the basis of the probability distribution of examples rather than \nthe function  values  are extended. \n\n1 \n\nINTRODUCTION \n\nFeedforward  neural  networks  have  been  proposed  as  parametrized  representations \nsuitable for  nonlinear regression.  Their approximation theoretic  properties are still \nnot  well  understood.  This  paper  shows  some  connections  with  the  more  widely \nknown methods of spline and rational approximation.  A result  due to Vitushkin is \napplied to determine the relative improvement in degree of approximation possible \nby  having  more than one  hidden  layer.  Furthermore,  an  approximation result  rel(cid:173)\nevant  to  statistical  regression  originally  due  to  Birman  and  Solomjak for  Sobolev \nspace  approximation  is  extended  to  more  general  Besov  spaces.  The  two  main \nresults  are  theorems 3.1  and  4.2. \n\n1040 \n\n\fSplines,  Rational Functions and Neural  Networks \n\n1041 \n\n2  SPLINES  AND  RATIONAL  FUNCTIONS \n\nThe two most widely studied nonlinear approximation methods are splines with free \nknots and rational functions.  It is  natural to ask what connection, if any, these have \nwith neural networks.  It is  already known that splines with free  knots and rational \nfunctions  are closely related,  as  Petrushev  and Popov's remarkable result  shows: \n\nTheorem 2.1  ([10,  chapter 8])  Let \n\nRn(J)p  := inf{llf -\nS!(f)p  := inf{lIf - slip:  s  a spline  of degree  k - 1  with  n  - 1 free  knots}. \n\nrllp:  r  a rational function  of degree  n} \n\nIf f  E  Lp [a,  b],oo < a  < b < 00,  1 < p < 00,  k  ~ 1,  0 < a  < k,  then \nif and  only  if  S~(f)p = D(n- a). \n\nRn(f)p = D(n- a) \n\nIn  both  cases  the  efficacy  of  the  methods  can  be  understood  in  terms  of  their \nflexibility  in  partitioning  the  domain of definition:  the  partitioning amounts  to  a \n\"balancing\"  of the error of local  linear approximation [4]. \nThere  is  an  obvious  connection  between  single  hidden  layer  neural  networks  and \nsplines.  For  example,  replacing  the  sigmoid  (1  + e-x)-l  by  the  piecewise  linear \nfunction (Ix + II-Ix - 11)/2 results  in networks  that are in one  dimension splines, \nand in d  dimensions can  be  written in  \"Canonical Piecewise  Linear\"  form  [3]: \n\nf( x)  := a + bT x + L Ci I aT x - fJi I \n\ni=l \n\ndefines  f:  IR d  --+  IR,  where  a, Ci, fJi  E  IR  and  b,  O'i  E IR d.  Note  that  canonical  piece(cid:173)\nwise  linear  representations  are  unique  on  a  compact  domain  if we  use  the  form \nf(x)  := L;~ll Ci laT x -11.  Multilayer piecewise linear nets are not generally canon(cid:173)\nical piecewise linear:  Let g(x)  := Ix+y-ll-lx+y+ll-lx-y+ll-lx-y-ll+x+y. \nThen  gC)  is  canonical  piecewise  linear,  but  Ig(x)1  (a simple two-hidden  layer  net(cid:173)\nwork)  is  not. \n\nThe connection between certain single hidden layer networks and rational functions \nhas  been  exploited in [13]. \n\n3  COMPOSITIONS  OF RATIONAL  FUNCTIONS \n\nThere has been little effort in the nonlinear approximation literature in understand(cid:173)\ning nonlinearly parametrized approximation classes  \"more complex\" than splines or \nrational functions.  Multiple hidden layer neural networks are in this more complex \nclass.  As  a  first  step  to understanding  the  utility of these  representations  we  now \nconsider  the  degree  of approximation of certain  smooth function  classes  via ratio(cid:173)\nnal  functions  or  compositions  of rational functions  in  the  sup-metric.  A  function \n\u00a2>  : IR  --+  IR  is rational of degree  7r  if \u00a2>  can be expressed  as  a  ratio of polynomials in \nx  E IR  of degree  at most  7r.  Thus \n\n(3.1) \n\n\u00a2>(J  := \u00a2>(J(x)  := \n\nL:'\"  O'i xi \n. \nLi=l fJixt \n\n~=l \n\nx E IR,  (}  := [0',,8] \n\n\f1042 \n\nWilliamson  and Bartlett \n\nLet  (]\" 1f (1, \u00a2)  := inf {Ilf - \u00a2911:  deg \u00a2  S  7T}  denote  the  degree  of approximation of \nf  by  a  rational  function  of degree  7T  or  less.  Let  'Ij;  :=  \u00a2 0 p,  where  \u00a2  and  pare \nrational functions:  p:  lR  x  e p  ---+  lR,  \u00a2: lR  x e \u00a2  ---+  lR,  both  of degree  7T.  Let  IF  be \nsome function  space  (metrized  by  11\u00b71100)  and  let  (]\"1f(lF,.)  := sup{(]\"1f(1,\u00b7):  f  ElF} \ndenote  the degree  of approximation of the function  class IF. \n\nTheorem 3.1  Let  lFa  :=  W~(n) denote  the  Sobolev  space  of functions  from  a \ncompact  subset  n  c  lR  to  lR  with  s :=  La J  continuous  derivatives  and  the  sth \nderivative  satisfying a Lipschitz condition  with order a - s.  Then  there  exist positive \nconstants  C1  and  C2  not  depending  on  7T  such  that  for  sufficiently  large  7T \n\n(]\"1f(lFa ,p) 2:  C1  (217T) a \n\n(3.2) \n\nand \n\n(3.3) \n\nNote  that  (3.2)  is  tight:  it  is  achievable.  Whether  (3.3)  is  achievable  is  unknown. \nThe  proof is  a  consequence  of theorem  3.4.  The  above  result,  although  only  for \nrational  functions  of a  single  variable,  suggests  that  no  great  benefit  in  terms  of \ndegree of approximation is  to be obtained by  using multiple hidden layer networks. \n\n3.1  PROOF OF THEOREM \nDefinition 3.2  Let  rd  C  lR d.  A  map  r:  r d ---+  lR  is  called  a  piecewise  rational \nfunction  of degree  k  with  barrier  b~ of order  q  if there  is  a  polynomial b~  of degree \nq  in  x  E r d  such  that  on  any  connected  component  of Ii  C r d  \\  {x:  b~( x) = O},  r  is \na  rational function  on  Ii  of degree  k: \n\nr:=r(x):= \n\nP;i(X) \nQdi(x) \n\nk' \n, \n\nPdi,QdiElR[x]. \n\nd \n\nk \n, \n\nk \n, \n\nNote  that  at  any  point  x  E  Ii n Ii,  (i f. j j,  r  is  not  necessarily single  valued. \n\nDefinition 3.3  Let IF  be  some function  class defined  on  a set G  metrized with 11\u00b71100 \nand  let e = lR II.  Then  F:,'J:  G  x e ---+  lR,  F:'.,}:  (x, 0)  f-t  F( x, 0)  where \n\n1.  F( x, 0)  is  a  piecewise  rational function  of 0  of degree  k  or less  with  barrier b~'x \n\n(possibly  depending  on  x)  of order q; \n\n2.  For  all  f  E IF  there  is  a 0  E e such  that  IIf - Fe, 0) liSE; \nis  called  an  E-representation  of IF  of degree  k  and order  q. \n\nTheorem 3.4  ([12,  page 191,  theorem 1])  If Ff:k~q  is  an E-representation  oflFa \nof degree  k  and  order q  with  barrier b  not  depending  ~n x,  then for sufficiently small \n\n(3.4) \n\nv log[(q + l)(k + 1)]  2:  C \n\n( 1)1/a \n\n\"[ \n\nwhere  C  is  a  constant  not  dependent  on  E, v, k  or q. \n\n\fSplines,  Rational Functions and Neural  Networks \n\n1043 \n\nTheorem  3.4  holds for  any  [-representation  F  and  therefore  (by  rearrangement  of \n(3.4)  and setting v  = 271\") \n\n(3.5) \n\n0\"11\" (IF, F) ~ c -( 2-7I\"-lo--:g [:--( q-+-1-) (-k-+-1--=-)] -) Q \n\n1 \n\nNow \u00a2(J  given by (3.1) is, for any given and fixed  x  E IR, a piecewise rational function \nof ()  of degree  1 with barrier of degree 0 (no barrier is actually required).  Thus (3.5) \nimmediately gives  (3.2). \nNow  consider  'IjJ(J  = <p  0  p,  where \n\n'I\\\" 11\" 4> \n\n\u2022 \n\n\u00a2 = L...-~~1 ai Y~  (y  E IR)  and  p  =  L...-~=1 IJ x. \nLj~l 8j  x J \n\nLi=l j3iY' \n\n'I\\\" 11\" P \n\n\u2022 \n\nj \n\n(x  E IR) . \n\nDirect substitution and rearrangement gives \n\nV'9  = \n\n\u2022  ['I\\\"1I\"p \n\n'1\\\"11\"4> \nL.....i=l  a z  L...-j=l IJ x \n'I\\\" 11\" 4> \nL...-i=l  j3i  L...-j~lljxJ \n\n[ 'I\\\" 11\" \n\n. \n\n. \n\n.  j]i ['I\\\"1I\"p \n\n'.  j]1I\"4>-i \n\nL...-j=l OJ  x \n\n. ]  t  ['I\\\" 11\" \n\nL...-j~l 8j xJ \n\n. ]  11\" 4> - t \n\nwhere  we  write  ()  = [a, j3, 1,8]  and  for  simplicity set  71\" \u00a2  = 71\" P  = 71\".  Thus  dim ()  = \n471\"  =:  v.  For  arbitrary  but  fixed  x,  V'  is  a  rational  function  of degree  k =  71\".  No \nbarrier is  needed  so  q = 0 and hence  by  (3.4), \n\nu.(IF.,.;,) > c, (4\"10g~,,+ IJ\u00b7 \n\n3.2  OPEN  PROBLEMS \n\nAn  obvious further  question  is  whether  results  as  in  the  previous  section  hold  for \nmultivariable approximation, perhaps for  multivariable rational approximation. \n\nA  popular  method  of  d-dimensional  nonlinear  spline  approximation  uses  dyadic \nsplines  [2,5,8].  They are  piecewise  polynomial representations  where  the partition \nused  is  a  dyadic  decomposition .  Given  that  such  a  partition  3  is  a  subset  of a \npartition generated by the zero level set of a barrier polynomial of degree  ~ 131,  can \nVitushkin's  results  be  applied  to  this  situation?  Note  that  in  Vitushkin's  theory \nit  is  the  parametrization  that  is  piecewise  rational  (PR),  not  the  representation. \nWhat connections  are there in general  (if any) between  PR representations  and PR \nparametrizations? \n\n4  DEGREE OF APPROXIMATION  AND  LEARNING \n\nDetermining the  degree  of approximation for  given parametrized function  classes  is \nnot  only  of curiosity  value.  It is  now  well  understood  that  the  statistical  sample \ncomplexity  of  learning  depends  on  the  size  of  the  approximating  class. \nIdeally \nthe  approximating  class  is  small  whilst  well  approximating  as  large  as  possible \nan  approximat ed  class.  Furthermore,  in  order  to  make  statements  such  as  in  [1] \nregarding  the  overall  degree  of approximation achieved  by  statistical learning,  the \nclassical  degree  of approximation is  required. \n\n\f1044 \n\nWilliamson  and Bartlett \n\nFor regression  purposes the metric used  is  L p ,/-, ,  where \n\nwhere  J-t  is  a  probability  measure.  Ideally  one  would  like  to  avoid  calculating  the \ndegree  of approximation for  an  endless  series  of different  function  spaces.  Fortu(cid:173)\nnately,  for  the  case  of spline  approximation  (with  free  knots)  this  not  necessary \nbecause  (thanks to Petrushev  and others)  there now exist  both direct  and converse \ntheorems  characterizing  such  approximation classes.  Let  Sn (f)p  denote  the  error \nof n  knot  spline  approximation in  Lp[O, 1].  Let  I  denote  the  identity operator and \nT(h)  the translation operator (T(h)(f, x)  := f(x + h))  and let  ~~ := (T(h) - I)k, \nk = 1,2, ... , be the difference  operators.  The modulus of smoothness of order k for \nf  E  LpUJ)  is \n\nWk(f,t)p  := L  11~~f(\u00b7)IILp(n). \n\nPetrushev  [9]  has obtained \n\nIhl:$;t \n\nLet  T  = (aid + IIp)-l.  Then \nf)n a Sn(f)p]k ~ \n\nn=l \n\n<  00 \n\nn \n\nTheorelll 4.1 \n\n(4.1) \n\nif and  only  if \n\n(4.2) \n\nThe somewhat strange  quantity  in  (4.2)  is  the  norm of f  in a  Besov  space  B<; Tok' \nNote that for  a  large enough,  T  <  1.  That is,  the smoothness is  measured in a~ Lp \n(p  < 1)  space.  More  generally  [11],  we  have  (on  domain [0,1]) \n:=  (  t (t-awk(f, t)p)q dt) l/q \n\nIlflIBC>  0 \n\np,q,k \n\nJo \n\nt \n\nBesov  spaces  are  generalizations  of  classical  smoothness  spaces  such  as  Sobolev \nspaces  (see  [11]). \n\nWe  are interested in approximation in L p ,/-,  and following Birman and Solomjak [2] \nask  what degree  of approximation in Lp ,/-,  can be obtained when  the knot positions \nare  chosen  according  to  J-t  rather  than f.  This is  of interest  because  it  makes the \nproblem of determining the parameter values on the basis of observations linear. \n\nTheorelll 4.2  Let f  E Lp ,/-,  where J-t  E LA  for some>. > 1  and is  absolutely  contin(cid:173)\nuous.  Choose  the n  knot  positions  of a  spline  approximant v  to  f  on  the  basis  of J-t \nonly.  Then  for  all such f  there  is  a  constant  c  not  dependent  on  n  such  that \n(4.3) \nwhere  u = (a + (1- >.-l)p-l)-l  and p < u.  The  constant c  depends  on  J-t  and  >.. \n\n\fSplines,  Rational Functions and Neural  Networks \n\n1045 \n\nII p  ~ 1  and (1  S p,  for  any  el'  < (1-1  for  all I  under the  conditions  above,  there  is \na v  such  that \n\n( 4.4) \n\nand  again  c  depends  on  J-l  and  A but  does  not  depend  on  n. \n\nFor any  A 2:  1, \n\nProof  First  we  prove  (4.3).  Let  [0,1]  be  partitioned  by  3.  Thus  if  v  is  the \napproximant to Ion [0,1]  we  have \n\nIlf - vIIi \u2022. , = ~ Ilf - vlli \u2022. ,(,,) = ~ L If(x) - v(x)IPdl'(x). \n~ [L If - vIP(1-'-')-' dXr'-' [L (ix)' dx r' \n\ni I/(x) - v(x)IPdJ-l(x) = i l l  - viP  (~~) dx \n\n= III - vll~~(A)  IldJ-l/dxIIL>.(A) \n\nwhere  'IjJ  =  p(l- A-I)-I.  Now  Petrushev  and  Popov  [10,  p.216]  have  shown  that \nthere exists a  polynomial of degree  k  on  ~ = [r, s]  such  that \n\nIII - vll~~(A) s cll/ll~(A) \n\nwhere \n\nIlfIIB(A):=  Jo \n\n( \n\nf(s-r)/k \n\n(t-QII~: 1(\u00b7)IIL.,.(r,s-kt)t T \n\ndt) l/u \n\nand (1:=  (el' + 'IjJ-l)-I,  0< 'IjJ  <  00  and  k > 1.  Let  131  =: n  and choose:=: = Ui~i \n(~i = [ri, SiD  such  that \n\n1. (~~)' dx = ~lIdl'/dxIlL(o.1)' \n\nThus  IldJ-l/dxIIL>.(A)  =  n-l/AlldJ-l/dxIIL>.(O,l)'  Hence \n\n(4.5) \n\nIII - vll~p,,,  s ClldJ-l/dxIIL>.  L n-I/Allfll~(A)' \n\nAe2 \n\nSince  (by  hypothesis)  p <  (1,  Holder's inequality gives \n\nII! - vilL., ~ clldJlldxllL, [~ G) t.-:,. ] 1-~ [~llfIlR(\")] ~ \n\nNow  for  arbitrary  partitions  3  of [0,1]  Petrushev  and  Popov  [10,  page  216]  have \nshown \n\nL II/IIB(A) S 1I/IIB~.k \n\nAe3 \n\n' \n\n\f1046 \n\nWilliamson  and Bartlett \n\nwhere  Be;.k =  Be; l7\u00b7k  =  B([O, 1]).  Hence \n\n,  , \n\n, \n\nIII - vll~ \n\np,p. \n\n:S  clldJ.tjdxIIL>.  n~+I-t \"I\"~cw \n\ntT j k \n\nand so \n\n(4 .6) \nwith u = (a + '1/'-1 )-1, 'I/'  = p(l- A-I )-1 .  Hence  u = (a + 1_;-1 )-1.  Thus given  a \nand p,  choosing  different  A adjusts the  u  used  to measure I  on  the right-hand side \nof (4 .6).  This proves  (4.3). \nNote  that because  of the  restriction  that p  < u , a  > 1 is only  achievable for  p  < 1 \n(which  is  rarely  used  in  statistical regression  [6]).  Note  also  the effect  of the  term \nIIdJ.tjdxll.i!:'  When  A =  1  this  is  identically  1  (since  J.t  is  a  probability measure). \nWhen  A >  1 it  measures  the  departure  from  uniform  distribution,  suggesting  the \ndegree  of approximation achievable  under  non-uniform distributions is  worse  than \nunder  uniform distributions. \nEquation  (4.4)  is  proved  similarly.  When  u  :S  p  with  p  2:  1,  for  any  a  :S  1 j u,  we \ncan set  A := (1  - ~ + pa )-1  2:  1.  From (4.5)  we  have \n\nIII - vll~ \u2022.\u2022 :0;  clld,,/dxllL, ~ (D 1/' 11/11':.(\"1 \n\n:0;  clld,,/dxIIL, G) 1/' [~II/IIB(\"f\" \n\n:S  clldJ.tjdxIlL>.n-l+~-pa ll/ll~<> \n\n0' ;k \n\nand therefore \n\n\u2022 \n\n5  CONCLUSIONS  AND  FURTHER WORK \n\nIn  this  paper  a  result  of Vitushkin  has  been  applied  to  \"multi-layer\"  rational  ap(cid:173)\nproximation .  Furthermore,  the  degree  of approximation achievable  by  spline  ap(cid:173)\nproximation with free  knots  when  the  knots  are  chosen  according  to a  probability \ndistribution has  been  examined. \n\nThe  degree  of approximation of neural  networks,  particularly  multiple layer  net(cid:173)\nworks,  is  an  interesting open  problem.  Ideally one  would like both  direct  and con(cid:173)\nverse  theorems,  completely characterizing the  degree  of approximation.  If it turns \nout that from  an  approximation point  of view  neural  networks  are  no  better  than \ndyadic splines  (say),  then there is a  strong incentive to study the PAC-like learning \ntheory  (of the style of [7]) for such spline representations.  We  are currently working \non  this topic . \n\n\fSplines,  Rational Functions and Neural  Networks \n\n1047 \n\nAcknowledgements \n\nThis work  was  supported  in  part  by  the  Australian Telecommunications and  Elec(cid:173)\ntronics  Research  Board and OTC. The first  author thanks  Federico  Girosi for  pro(cid:173)\nviding him with  a  copy  of [4].  The second  author was  supported  by  an  Australian \nPostgraduate Research  Award. \n\nReferences \n\n[1]  A.  R.  Barron, Approxima.tion  and  Estimation  Bounds for Artificial  Neural Networks, \n\nTo appear in  Machine  Learning,  1992. \n\n[2]  M.  S.  Birman  and  M.  Z.  Solomjak,  Piecewise-Polynomial  Approximations  of  Func(cid:173)\ntions of the Classes W;, Mathematics o{the USSR - Sbornik,  2 (1967),  pp. 295-\n317. \n\n[3]  L.  Chua  and  A.  -C'.  Deng,  Canonical  Piecewise-Linear  Representation,  IEEE Trans(cid:173)\n\nactions  on  C'iIcuits  and  Systems,  35 (1988),  pp.  101-111. \n\n[4]  R.  A.  DeVore.  Degree  of  Nonlinear  Approximat.ion,  in  Approximation  Theory  VI, \nVolump  1.  C.  K.  Chui.  L.  1. Schumaker  and  J.  D.  Ward,  eds.,  Academic  Press, \nBost.on,  1991,  pp.  17.5-201. \n\n[5]  R.  A.  DeVore,  B.  Jawert.h  and  V.  Popov,  Compression  of Wavelet  Decompositions, \n\nTo appear  in  American  Journal of Mathematics,  1992. \n\n[6]  H.  Ekblom ,  Lp-met.hods  for  Robust  Regression,  BIT,  14 (1974),  pp.  22-32. \n\n[7]  D.  Haussler,  Decision  Theoretic  Generalizations  of the  PAC  Model  for  Neural  Net \nand  Ot.her  Learning  Applicat.ions,  Report  UCSC-CRL-90-52,  Baskin  Center  for \nComputer  Engineering  and  Informat.ion  Sciences,  University  of California,  Santa \nCruz,  H)90. \n\n[8]  P.  Oswald.  On  t.he  Degree  of  Nonlinear  Spline  Approximat.ion  in  Besov-Sobolev \n\nSpace~.  Journal  of A.pproximatioll  Theory,  61 (1990),  pp.  131-157. \n\n[9]  P.  P.  Pet.ru~hev.  Direct  and  Converse  Theorems  for  Spline  and  Rational  Approxi(cid:173)\n\nmation  and  Be~oy Spaces,  in  Function  Spaces  and  Applications  (Lecture  Notes \nill  Ma tllem a. tics  1.'3(2),  M.  Cwikel,  J.  Peetre,  Y.  Sagher  and  H.  Wallin,  eds., \nSprin~er-Verlag.  Berlin,  1988,  pp.  363-377. \n\n[10]  P.  P.  Petrushev  and  V.  A.  Popov,  Rational Approximation  of Real  Functions,  Cam(cid:173)\n\nbridge  Univer~it.y Press,  Cambridge,  1987. \n\n[11]  H.  Triebel.  TlleorJ'  of Function  Spaces .  Birkhauser  Verlag,  Basel,  1983. \n\n[12]  A.  G.  Vitllshkin.  Tlleocy  of the  Transmission  and  Processing  of Information,  Perg(cid:173)\n\namon  Press,  Oxford,  1961,  Originally  published  as  Otsenka  slozhnosti  zadachi \ntaIJu1icO\\'ani,va  (Est.ima.tion  of  the  Complexit.y  of the  Tabulation  Problem),  Fiz(cid:173)\nmatgiz.  Mo~cow. 19.59. \n\n[13]  R.  C.  Williamson  and  U.  Helmke,  Existence  and  Uniqueness  Results  for  Neural \n\nNetwork  Approximations.  Submitted,  1992. \n\n\f", "award": [], "sourceid": 442, "authors": [{"given_name": "Robert", "family_name": "Williamson", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}