{"title": "Neural Networks with Quadratic VC Dimension", "book": "Advances in Neural Information Processing Systems", "page_first": 197, "page_last": 203, "abstract": null, "full_text": "Neural Networks with  Quadratic VC \n\nDimension \n\nPascal Koiran* \n\nLab.  de  l'Informatique du  Paraltelisme \n\nEcole  Normale Superieure  de  Lyon - CNRS \n\n69364  Lyon  Cedex  07,  France \n\nEduardo D.  Sontagt \n\nDepartment of Mathematics \n\nRutgers University \n\nNew  Brunswick,  NJ  08903,  USA \n\nAbstract \n\nThis paper shows that neural  networks  which  use  continuous acti(cid:173)\nvation functions have VC  dimension at least as large as the square \nof  the  number  of weights  w.  This  result  settles  a  long-standing \nopen  question,  namely whether  the well-known  O( w log w)  bound, \nknown for hard-threshold nets, also held for  more general sigmoidal \nnets.  Implications for  the number of samples needed for  valid gen(cid:173)\neralization are  discussed. \n\n1 \n\nIntroduction \n\nOne of the main applications of artificial neural networks is  to pattern classification \ntasks.  A set of labeled training samples is provided, and a network must be obtained \nwhich is then expected to correctly classify previously unseen inputs.  In this context, \na  central  problem is  to estimate the  amount of training  data needed  to guarantee \nsatisfactory  learning  performance.  To  study  this  question,  it  is  necessary  to  first \nformalize the notion of learning from examples. \nOne such formalization is  based on  the paradigm of probably  approximately  correct \n(PAC) learning, due to Valiant (1984).  In this framework, one starts by fitting some \nfunction  /, chosen  from  a  predetermined  class  F,  to the  given  training data.  The \nclass F  is often called the  \"hypothesis class\" , and for  purposes of this discussion  it \nwill be assumed that the functions in F  take binary values {O,  I} and are defined on a \ncommon domain X.  (In neural networks applications, typically F  corresponds to the \nset of all neural networks with a given architecture and choice of activation functions. \nThe elements of X  are  the inputs,  possibly  multidimensional.)  The  training  data \nconsists  of labeled  samples  (Xi,ci),  with  each  Xi  E  X  and  each  Ci  E  {O,  I},  and \n\n*koiranGlip. ens-lyon. fr. \ntsontagGhilbert.rutgers.edu. \n\n\f198 \n\nP. KOIRAN, E.  D.  SONTAG \n\n\"fitting\"  by  an  f  means  that  f(xj)  =  Cj  for  each  i.  Given  a  new  example  x,  one \nuses  f( x) as a guess of the  \"correct\"  classification of x.  Assuming that both training \ninputs  and future  inputs  are  picked according  to the same  probability distribution \non  X, one  needs  that the space  of possible  inputs be  well-sampled by  the training \ndata,  so  that  f  is  an  accurate  fit.  We  omit  the  details  of the  formalization  of \nPAC  learning, since  there  are excellent  references  available,  both in  textbook  (e.g. \nAnthony and Biggs (1992), Natarajan (1991)) and survey paper (e.g.  Maass (1994)) \nform,  and the  concept  is  by  now  very  well-known. \nAfter  the  work  of Vapnik  (1982)  in  statistics  and of Blumer et.  al.  (1989)  in  com(cid:173)\nputationallearning theory,  one knows that a  certain  combinatorial quantity, called \nthe  Vapnik-Chervonenkis  (VC)  dimension  VC(F)  of the  class  F  of interest  com(cid:173)\npletely characterizes the sample sizes needed for learnability in the PAC sense.  (The \nappropriate  definitions are  reviewed  below.  In  Valiant's formulation one is  also in(cid:173)\nterested  in  quantifying the  computational effort  required  to actually fit  a function \nto  the  given  training  data,  but  we  are  ignoring  that  aspect  in  the  current  paper.) \nVery  roughly  speaking,  the  number of samples  needed  in  order  to learn  reliably  is \nproportional to VC(F).  Estimating VC(F)  then  becomes  a  central  concern.  Thus \nfrom  now  on,  we  speak  exclusively  of VC  dimension,  instead  of the  original  PAC \nlearning problem. \nThe work  of Cover  (1988)  and Baum and Haussler  (1989)  dealt  with the computa(cid:173)\ntion of VC(F)  when  the class  F  consists of networks  built up from hard-threshold \nactivations  and  having  w  weights;  they  showed  that  VC(F)=  O(wlogw).  (Con(cid:173)\nversely,  Maass  (1993)  showed  that  there  is  also  a  lower  bound  of this  form.)  It \nwould  appear  that  this  definitely  settled  the  VC  dimension  (and  hence  also  the \nsample size)  question. \nHowever,  the  above  estimate  assumes  an  architecture  based  on  hard-threshold \n(\"Heaviside\")  neurons.  In contrast, the usually employed gradient descent  learning \nalgorithms (\"backpropagation\"  method)  rely  upon  continuous  activations,  that is, \nneurons  with  graded responses.  As  pointed  out in  Sontag  (1989),  the  use  of ana(cid:173)\nlog activations, which allow the passing of rich (not just binary) information among \nlevels,  may result in higher memory capacity as compared with threshold nets.  This \nhas serious potential implications in learning, essentially because more memory ca(cid:173)\npacity means that a given function f  may be able to \"memorize\" in a  \"rote\"  fashion \ntoo much data,  and less  generalization is  therefore  possible.  Indeed,  Sontag (1992) \nshowed  that there  are  conceivable  (though not very  practical)  neural  architectures \nwith extremely high VC  dimensions.  Thus the problem of studying VC(F) for  ana(cid:173)\nlog  networks  is  an  interesting  and relevant  issue.  Two  important contributions in \nthis direction were the papers by Maass (1993)  and by Goldberg and Jerrum (1995), \nwhich  showed  upper  bounds  on  the  VC  dimension  of networks  that  use  piecewise \npolynomial activations.  The last  reference,  in  particular,  established for  that  case \nan  upper  bound of O(w2),  where,  as  before,  w is  the number of weights.  However \nit was an open problem (specifically,  \"open problem number 7\"  in the recent survey \nby  Maass (1993)  if there is a matching w 2  lower bound for such networks,  and more \ngenerally for  arbitrary continuous-activation nets.  It could have been  the case  that \nthe  upper  bound  O( w 2 )  is  merely  an  artifact  of the  method  of proof in  Goldberg \nand Jerrum (1995),  and that reliable learning with  continuous-activation networks \nis still possible with far smaller sample sizes,  proportional to O( w log w).  But this is \nnot the case,  and in  this paper we  answer  Maass'  open  question  in  the affirmative. \n\nAssume  given  an  activation  (T  which  has  different  limits at  \u00b1oo,  and  is  such  that \nthere  is  at  least  one  point  where  it  has  a  derivative  and the  derivative  is  nonzero \n(this  last  condition  rules  out  the  Heaviside  activation).  Then  there  are  architec(cid:173)\ntures  with  arbitrary  large  numbers  of weights  wand  VC  dimension  proportional \n\n\fNeural Networks with Quadratic VC Dimension \n\n199 \n\nto  w 2 \u2022  The  proof relies  on  first  showing  that networks  consisting  of two  types  of \nactivations,  Heavisides  and  linear,  already  have  this  power.  This  is  a  somewhat \nsurprising result,  since  purely  linear networks  result  in  VC  dimension  proportional \nto w, and purely threshold nets have, as per the results quoted above, VC  dimension \nbounded  by  w log w.  Our  construction  was  originally  motivated by  a  related  one, \ngiven in Goldberg and Jerrum (1995), which showed that real-number programs (in \nthe Blum-Shub-Smale (1989) model of computation) with running time T  have  VC \ndimension  O(T2).  The  desired  result  on  continuous  activations  is  then  obtained, \napproximating Heaviside gates by IT-nets  with large weights and approximating lin(cid:173)\near  gates  by  IT-nets  with  small  weights.  This  result  applies  in  particular  to  the \nstandard sigmoid 1/(1 + e- X ).  (However,  in contrast with the piecewise-polynomial \ncase,  there  is  still  in  that  case  a  large  gap  between  our  O( w 2 )  lower  bound  and \nthe  O( w 4 )  upper  bound  which  was  recently  established  in  Karpinski  and  Macin(cid:173)\ntyre  (1995).)  A  number  of variations,  dealing  with  Boolean  inputs,  or  weakening \nthe  assumptions on  IT,  are  discussed.  The  full  version  of this  paper  also  includes \nsome remarks on  thresholds  networks  with  a  constant  number of linear gates,  and \nthreshold-only  nets with  \"shared\"  weights. \n\nBasic Terminology and  Definitions \n\nFormally,  a  (first-order,  feedforward)  architecture  or  network  A  is  a  connected  di(cid:173)\nrected  acyclic  graph  together  with  an  assignment  of a  function  to  a  subset  of its \nnodes.  The nodes  are of two types:  those of fan-in  zero  are  called  input  nodes  and \nthe remaining ones are called  computation  nodes or gates.  An  output  node  is  a node \nof fan-out  zero.  To each gate g  there is associated a function  IT g  : IR.  -!- IR.,  called the \nactivation  or  gate  function  associated  to g. \n\nThe  number of weights or  parameters associated to a gate 9  is  the integer  ng  equal \nto the fan-in of 9  plus one.  (This definition is motivated by the fact  that each input \nto the gate will  be multiplied by  a  weight,  and the results  are  added together  with \na  \"bias\"  constant  term , seen  as  one  more weight;  see  below.)  The  (total)  number \nof weights  (or parameters) of A  is by  definition the sum of the numbers n g ,  over all \nthe gates  9 of A.  The  number of inputs  m of A  is the total number of input nodes \n(one also says  that  \"A has  inputs in  IR.m,,);  it is  assumed that m > O.  The  number \nof outputs  p  of A  is  the number of output  nodes  (unless  otherwise  mentioned,  we \nassume  by  default  that  all  nets  considered  have  one-dimensional outputs,  that  is, \np =  1). \nTwo  examples of gate  functions  that  are  of particular  interest  are  the  identity or \nlinear gate:  Id( x) = x for  all  x,  and the  threshold or  H eaviside  function:  H (x) = 1 \nif x  ~ 0,  H(x)  =  0 if x < O. \nLet  A  be  an  architecture.  Assume  that  nodes  of A  have  been  linearly  ordered  as \n11\"1,  ... , 11\" m, gl, ... , gl, where the 1I\"j 's are the input nodes and the gj 's the gates.  For \nsimplicity,  write  nj  :=  n g.,  for  each  i  = 1, ... , I.  Note  that  the  total  number  of \nparameters is  n  = L:~=1 nj  and the fan-in of each gj  is  nj - 1.  To each  architecture \nA  (strictly  speaking,  an  architecture  together  with  such  an  ordering  of nodes)  we \nassociate  a function \n\nF  : ]Rm  x ]Rn  -!-]RP , \n\nwhere  p  is  the  number of outputs  of A,  defined  by  first  assigning  an  \"output\"  to \neach  node,  recursively  on  the  distance  from  the  the  input  nodes.  Assume  given \nan  input  x  E  ]Rm  and  a  vector  of weights  w  E  ]Rn.  We  partition  w  into  blocks \n(WI , ... , WI)  of sizes  nl, ... , nl  respectively.  First  the coordinates of x  are  assigned \nas  the  outputs  of the  input  nodes  11\"1,  ... , 1I\"m  respectively.  For  each  of the  other \ngates  gj,  we  proceed  as  follows.  Assume  that  outputs  Yl, ... , Yn. -1  have  already \n\n\f200 \n\nP. KOIRAN, E. D.  SONTAG \n\nbeen  assigned  to the  predecessor  nodes  of gi  (these  are  input  and/or computation \nnodes,  listed  consistently  with  the order fixed  in  advance).  Then  the output  of gi \nis by  definition \n\n(1'g.  (Wi,O  + Wi , lYI + Wi ,2Y2 + ... + wi,n.-lYn.-d  , \n\nwhere  we  are  writing  Wi  =  (Wi,O, Wi,l, Wi ,2,  ... , wi,n.-d.  The  value  of F(x, w)  is \nthen  by definition  the vector  (scalar if p =  1)  obtained by  listing the outputs of the \noutput nodes  (in the  agreed-upon fixed  ordering of nodes).  We  call  F  the function \ncomputed  by  the  architecture  A.  For  each  choice  of weights  W  E  IRn,  there  is  a \nfunction  Fw  : IRm _  IRP  defined  by  Fw(x)  := F(x, w) ; by  abuse  of terminology we \nsometimes call  this also the function  computed by  A (if the weight  vector has been \nfixed). \nAssume  that A  is  an  architecture  with inputs in IRm  and scalar outputs,  and  that \nthe (unique) output gate has range  {O, 1}.  A subset  A  ~ IR m  is said to be  shattered \nby  A  if for  each  Boolean function  13  : A  -\n{O, 1}  there  is  some weight  W  E IRn  so \nthat  Fw(x)  = f3(x)  for  all  x  EA .  The  Vapnik-Chervonenkis  (VC)  dimension  of A \nis  the  maximal size  of a  subset  A  ~ IRm  that is  shattered  by  A.  If the output  gate \ncan take non-binary values, we  implicitly assume that the result of the computation \nis  the sign of the output.  That is,  when  we  say  that a  subset  A  ~ IRm  is  shattered \nby  A , we  really  mean  that  A  is  shattered  by  the  architecture  H(A) in  which  the \noutput of A  is  fed  to a sign  gate . \n\n2  Networks  Made  up of Linear and  Threshold  Gates \n\nProposition 1  For  every  n  ;:::  1,  there  is  a  network  architecture  A  with  inputs  in \nIR 2  and O( VN)  weights  that  can  shatter a set  of size  N  = n 2.  This  architecture  is \nmade  only  of linear  and  threshold  gates. \n\nProof.  Our architecture  has  n  parameters WI , ... , Wn; each  of them is  an element \nofT = {O.WI  . .. Wn ;Wi  E {O, 1}}. The shattered set  will be  S =  [n]2  = {1, .. . ,nF. \nFor  a  given  choice  of W  =  (WI' ... ' Wn),  A  will  compute  the  boolean  function \nfw  : S  -\n{O, 1}  defined  as  follows:  fw(x, y)  is equal to the x-th bit of W y .  Clearly, \nfor  any boolean function  f  on S,  there exists a  (unique)  W  such  that  f  =  fw. \nWe first  consider  the obvious architecture  which  computes the function: \n\nflv(Y)  =  WI  + I)Wz - Wz-dH(y - z + 1/2) \n\nn \n\nz=2 \n\n(1) \n\nsending  each  point  Y  E  [n]  to  Wy.  This  architecture  has  n - 1  threshold  gates, \n3(n - 1) + 1 weights, and just one linear gate. \nNext  we  define  a  second  multi-output  net  which  maps  wET to  its  binary  rep(cid:173)\nresentation  j2(w)  = (WI' . .. ' wn ).  Assume  by  induction  that  we  have  a  net N? \nthat  maps W to  (WI,  ... ,Wi,O.Wi+l ... Wn) .  Since  Wi+l  = H(O .Wi+l  . .. Wn  -1/2) \nand o. Wi+2  ... Wn  = 2 x o. Wi+1  . .. Wn  - Wi+!,  .N;;'l  can  be  obtained  by  adding one \nthreshold  gate and one linear gate to .N;2  (as  well  as  4 weights).  It follows  that N~ \nhas n threshold gates,  n linear gates  and 4n  weights. \nFinally,  we  define  a  net  N3  which  takes  as  input  x  E  [n]  and  W =  (WI , ... , wn) E \n{O, l}n,  and outputs W X \u2022  We  would like this network  to  be  as  follows: \n\nf3(X , w) = WI  + L wzH(x - z + 1/2) - L wz_IH(x - z + 1/2). \n\nn \n\nn \n\nz=2 \n\nz=2 \n\n\fNeural Networks with Quadratic VC Dimension \n\n201 \n\nThis is not quite possible,  because  the products between  the  Wi'S  (which are  inputs \nin this context)  and the  Heavisides  are  not  allowed.  However,  since  we  are  dealing \nwith  binary variables one  can  write  uv =  H(u + v  -\nl.5).  Thus N3  has one  linear \ngate,  4(n - 1)  threshold  gates  and  12(n - 1) + n  weights.  Note  that  fw(x, y)  = \np (x, P Ulv (y)).  This can be realized by  means of a  net  that has n + 2 linear gates, \n(n-l)+n+4(n-l) = 6n-5 threshold gates, and (3n-2)+4n+(12n-ll) = 19n-13 \nweights.  0 \n\nThe following is the  main result of this section: \n\nTheorem 1  For  every  n  ;:::  1,  there  is  a  network  architecture  A  with  inputs  in  IR. \nand  O( VN)  weights  that  can  shatter  a  set  of size  N  = n 2.  This  architecture  is \nmade  only  of linear  and threshold  gates. \n\nProof.  The  shattered  set  will  be  S  =  {O, 1, .. . ,n2 -I}.  For  every  xES,  there \nI}  such  that  u  =  nx + y.  The idea of the \nare  unique  integers  x, y  E  {O, 1, ... , n  -\nconstruction  is  to  compute  x  and  y,  and  then  feed  (x + 1, y + 1)  to  the  network \nconstructed  in  Proposition  1.  Note  that  x  is  the unique integer such  that u - nx E \n{O,  1, .. . , n -\n\nI}.  It can  therefore  by  computed  by  brute force  search  as  follows: \n\nn-1 \n\nX  =  L kH[H(u - nk) + H(n - 1 - (u - nk)) -\n\nl.5]. \n\nk=O \n\nThis network has 3n threshold gates, one linear gate and 8n weights.  Then of course \ny =  u  - nx.  0 \n\nA  Boolean  version  is  as  follows. \nTheorem 2  For  every  d  ;:::  1,  there  is  a  network  architecture  A  with  O( VN) \nweights  that  can  shatter the  N  =  22d  points  of {O, 1 Fd .  This  architecture  is  made \nonly  of linear  and  threshold  gates. \nProof.  Given  u  E  {O, IFd, one  can  compute  x  =  1 + 2::=1 2i-1ui  and  y  = 1 + \n2:1=12i-1Ui+d  with  two  linear  gates.  Then  (x, y)  can  be  fed  to  the  network  of \nProposition  1 (with  n = 2d ).  0 \n\nIn  other  words,  there  is  a  network  architecture  with  2d  weights  that  can  compute \nall  boolean functions on  2d variables. \n\n3  Arbitrary Sigmoids \n\nWe  now  extend  the  preceding  VC  dimension  bounds  to  networks  that  use  just \none  activation function  tr  (instead of both linear  and  threshold  gates).  All  that is \nrequired  is  that  the  gate function  have  a  sigmoidal shape  and  satisfy  a  very  weak \nsmoothness property: \n\nl.  tr  is differentiable at some point Xo  (i.e., tr(xo+h) = tr(xo)+tr'(xo)h+o(h)) \n\nwhere  tr'(xo)# 0. \n\n2.  limx __ oo tr(x)  = \u00b0 and  limx _+ oo tr(x)  = 1  (the  limits \u00b0 and  1  can  be \n\nreplaced  by  any  distinct numbers). \n\nA function satisfying these  two conditions will be  called  sigmoidal.  Given  any such \ntr,  we  will show  that networks using only tr  gates provide quadratic VC  dimension. \n\n\f202 \n\nP.  KOIRAN, E.  D.  SONTAG \n\nTheorem 3  Let tT  be  an  arbitrary sigmoidal function.  There  exist  architectures Al \nand A2  with O( VN)  weights  made  only  of tT  gates  such  that: \n\u2022  Al  can  shatter a subset  ofIR  of cardinality  N  =  n 2 ,-\n\u2022  A2  can  shatter the  N  =  22d  points  of {O,  1}2d. \n\nThis follows directly from Theorems 1 and 2,  together with the following simulation \nresult: \nTheorem 4  Let  tT  be  a  an  arbitrary  sigmoidal function.  Let N  be  a  network  of \nT  threshold  and  L  linear  gates,  with  a threshold  gate  at  the  output.  Then  N  can \nbe  simulated  on  any  given  finite  set  of inputs  by  a network N'  of T  + L  gates  that \nall  use  the  activation function  tT  (except  the  output  gate  which  is  still a threshold). \nMoreover,  if N  has  n  weights  then N'  has  O( n)  weights. \n\nProof.  Let  S  be a finite set of inputs.  We can assume, by changing the thresholds of \nthreshold  gates  if necessary,  that the net input  Ig (x)  to any threshold  gate  9  of N \n\nis different  from \u00b0 for  all inputs  xES. \n\nGiven \u20ac  > 0,  let N( be the net obtained by replacing the output functions of all gates \nby the new  output function  x  1--+  tT( X / \u20ac) \nif this output function is the sign function , \nand by  x  1--+  tT(x) =  [tT(xo+\u20acx)-tT(xo))/[\u20actT'(xo)] \nifit is  the identity function.  Note \nthat for  any  a  > 0,  lim(_o+ tT(x/\u20ac)  =  H(x)  uniformly for  x  E)  - 00, -a] U [a, +00] \nand limHo tT(x) =  x  uniformly for  x  E [-l/a, l/a]. \nThis implies by induction on the depth of 9  that for  any gate 9  of N  and any input \nXES,  the net input Ig,(x) to 9 in the transformed net N(  satisfies li~_o IgAx) = \nIg(x)  (here,  we  use  the  fact  that  the  output  function  of every  9  is  continuous  at \nIg(x)).  In  particular,  by  taking  9  to be  the output  gate of N,  we  see  that Nand \nN(  compute  the  same function  on  S  if \u20ac \nis  small  enough.  Such  a  net  N(  can  be \ntransformed into an equivalent net N' that uses only tT  as gate function by a simple \ntransformation of its weights and  thresholds.  The  number of weights  remains  the \nsame, except  at most for  a  constant  term that must be added to each  net input to \na  gate;  thus if N  has  n  weights, N' has at most  2n weights.  0 \n\n4  More General  Gate Functions \n\nThe objective  of this  section  is  to establish  results  similar to  Theorem  3,  but  for \neven  more  arbitrary  gate  functions,  in  particular  weakening  the  assumption  that \nlimits exist  at infinity.  The  main result  is,  roughly,  that  any tT  which  is  piecewise \ntwice  (continuously)  differentiable gives  at least  quadratic VC  dimension,  save  for \ncertain exceptional  cases  involving functions  that are  almost everywhere  linear. \n\nA  function  tT  :  IR  --+  IR  is  said  to  be  piecewise  C 2  if  there  is  a  finite  sequence \nal < a2  < ... < ap  such  that on each  interval I  of the form] - 00, al [,  )ai, ai+1 [ or \n]ap , +00[,  tTll  is  C2. \n\n(Note:  our results hold even if it is only assumed that the second derivative exists in \neach of the above intervals; we do not use the continuity of these second derivatives.) \n\nTheorem 5  Let  tT  be  a  piecewise  C2  function.  For  every  n  ~ 1,  there  exists  an \narchitecture  made  of tT-gates,  and  with  O( n)  weights,  that  can  shatter  a  subset  of \nIR 2  of cardinality n 2 ,  except  perhaps  in  the  following  cases: \n\n1.  tT  is piecewise-constant,  and  in  this  case  the  VC dimension  of any  architec(cid:173)\n\nture  of n  weights  is O( n log n),-\n\n\fNeural Networks with Quadratic VC Dimension \n\n203 \n\n2.  u  is  affine,  and  in  this  case  the  VC  dimension  of any  architecture  of n \n\nweights  is  at  most n. \n\n3.  there  are  constants  af; 0  and b  such  that  u( x)  =  ax + b  except  at  a  finite \nnonempty  set  of points.  In  this  case,  the  VC  dimension  of any  architec(cid:173)\nture  of n  weights  is  O(n 2 ),  and  there  are  architectures  of  VC  dimension \nO(nlogn). \n\nDue  to  the  lack  of space ,  the  proof cannot  be  included  in  this  paper.  Note  that \nthe upper bound of the first  special  case is  tight for  threshold nets, and that of the \nsecond special  case  is  tight for  linear functions  in  ]R n. \n\nAcknowledgements \n\nPascal  Koiran  was  supported  by  an INRIA fellowship , DIMACS,  and the  Interna(cid:173)\ntional Computer  Science  Institute.  Eduardo Sontag was  supported  in  part by  US \nAir  Force  Grant AFOSR-94-0293 . \n\nReferences \n\nM .  ANTHONY  AND  N.L.  BIGGS  (1992)  Computational Learning  Th eory:  An Introduction, \nCambridge  U.  Press. \n\nE .B.  BAUM  AND  D .  HAUSSLER  (1989)  What  size  net gives  valid generalization?,  Neural \nComputation  1,  pp.  151-160. \n\nL.  BLUM,  M.  SHUB  AND  S.  SMALE  (1989)  On  the  theory  of computation  and  complex(cid:173)\nity  over the  real  numbers:  NP-completeness,  recursive  functions  and universal machines, \nBulletin  of the  AMS  21 ,  pp.  1- 46 . \n\nA.  BLUMER,  A .  EHRENFEUCHT,  D .  HAUSSLER,  AND  M.  WARMUTH  (1989)  Learnability \nand the  Vapnik- Chervonenkis dimension , J.  of the ACM  36,  pp.  929-965. \n\nT.M.  COVER  (1988)  Capacity problems  for  linear  machines,  in:  Pattern  Recognition ,  L. \nKanal ed. ,  Thompson  Book Co.,  pp.  283-289. \n\nP.  GOLDBERG  AND  M .  JERRUM  (1995)  Bounding the  Vapnik-Chervonenkis dim ension  of \nconcept classes parametrized by  real  numbers, Machine  Learning  18,  pp.  131-148. \n\nM .  KARPINSKI  AND  A.  MACINTYRE  (1995)  Polynomial bounds for  VC  dimension  of sig(cid:173)\nmoidal neural networks, in  Proc.  27th ACM Symposium on Theory of Computing,  pp.  200-\n208. \n\nW.  MAASS  (1993)  Bounds for  the  computational power  and learning complexity of analog \nneural nets, in  Proc. of the 25th  ACM  Symp.  Theory  of Computing,  pp.  335-344. \n\nW .  MAASS  (1994)  Perspectives of current research  about the  complexity of learning in neu(cid:173)\nral  nets,  in  Theoretical  Advances  in  N eural  Computation  and  Learning ,  V.P.  Roychowd(cid:173)\nhury,  K.Y.  Siu,  and  A.  Orlitsky,  editors,  Kluwer,  Boston , pp.  295-336. \n\nB.K.  NATARAJAN (1991)  Machine Learning :  A  Theoretical Approach, M. Kaufmann Pub(cid:173)\nlishers,  San  Mateo , CA. \n\nE .D.  SONTAG  (1989)  Sigmoids  distinguish better than  Heavisides, Neural  Computation  1, \npp. 470-472. \n\nE.D.  SONTAG  (1992)  Feedforward  nets  for  interpolation  and  classification,  J.  Compo \nSyst.  Sci  45 ,  pp.  20-48. \n\nL.G.  VALIANT  (1984)  A  th eory of the learnable,  Comm. of the ACM  27,  pp.  1134-1142 \n\nV .N.  VAPNIK  (1982)  Estimation  of  Dependencies  Based  on  Empirical  Data,  Springer, \nBerlin. \n\n\f", "award": [], "sourceid": 1051, "authors": [{"given_name": "Pascal", "family_name": "Koiran", "institution": null}, {"given_name": "Eduardo", "family_name": "Sontag", "institution": null}]}