{"title": "The Capacity of a Bump", "book": "Advances in Neural Information Processing Systems", "page_first": 556, "page_last": 562, "abstract": null, "full_text": "The Capacity of a Bump \n\nGary William Flake\u00b7 \n\nInstitute for Advance Computer Studies \n\nUniversity of Maryland \nCollege Park, MD 20742 \n\nAbstract \n\nRecently, several researchers have reported encouraging experimental re(cid:173)\nsults when using Gaussian or bump-like activation functions in multilayer \nperceptrons.  Networks of this type usually require fewer  hidden layers \nand  units and often learn much  faster  than typical  sigmoidal networks. \nTo  explain these  results we consider a hyper-ridge network,  which is a \nsimple perceptron with no hidden units and a rid\u00a5e activation function. If \nwe are interested in partitioningp points in d dimensions into two classes \nthen in the limit as d approaches infinity the capacity of a hyper-ridge and \na perceptron is identical.  However, we show that for p  ~ d, which is the \nusual case in practice, the ratio of hyper-ridge to perceptron dichotomies \napproaches pl2(d + 1). \n\n1  Introduction \n\nA  hyper-ridge network is a simple perceptron with no  hidden units and a ridge activation \nfunction.  With  one  output  this  is  conveniently  described  as  y  = g(h)  = g(w  . x  - b) \nwhere  g(h)  =  sgn(1  - h2).  Instead  of dividing  an  input-space  into  two  classes  with  a \nsingle hyperplane, a hyper-ridge network uses two parallel hyperplanes.  All points in the \ninterior of the hyperplanes form one class, while all exterior points form another.  For more \ninformation on hyper-ridges, learning algorithms, and convergence issues the curious reader \nshould consult [3]. \n\nWe wouldn't go so far as to suggest that anyone actually use a hyper-ridge for a real-world \nproblem,  but  it  is  interesting to  note  that  a  hyper-ridge can  represent  linear inseparable \nmappings  such  as  XOR,  NEGATE,  SYMMETRY,  and  COUNT(m)  [2,  3].  Moreover, \nhyper-ridges are very similar to multilayer perceptrons with bump-like activation functions, \nsuch as a Gaussian, in the way the input space is partitioned. Several researchers [6, 2,3, 5] \nhave independently found that Gaussian units offer many advantages over sigmoidal units. \n\n\u00b7Current address:  Adaptive Information and Signal Processing Department, Siemens Corporate \n\nResearch, 755 College Road East, Princeton, NJ 08540. Email:  ftake@scr.siemens .com \n\n\fThe Capacity of a Bump \n\n557 \n\nIn  this  paper  we  derive  the  capacity  of a  hyper-ridge  network.  Our first  result  is  that \nhyper-ridges  and  simple  perceptrons  are  equivalent  in  the  limit  as  the  input  dimension \nsize  approaches  infinity.  However,  when  the  number  of patterns  is far  greater  than  the \ninput dimension (as  is the usual  case) the ratio of hyper-ridge to perceptron  dichotomies \napproaches p/2(d + 1),  giving some evidence that bump-like activation functions offer an \nadvantage over the more traditional sigmoid. \n\nThe rest of this paper is divided into three more sections. In Section 2 we derive the number \nof dichotomies  for  a  hyper-ridge network.  The  capacities  for  hyper-ridges  and  simple \nperceptrons are compared in Section 3. Finally, in Section 4 we give our conclusions. \n\n2  The Representation Power of a Hyper-Ridge \n\nSuppose we have p patterns in the pattern-space, ~d, where d is the number of inputs of our \nneural  network.  A dichotomy is a classification  of all  of the points into two distinct sets. \nClearly, there are at most 2P dichotomies that exist.  We are concerned with the number of \ndichotomies that a single hyper-ridge node can represent.  Let the number of dichotomies \nof p patterns in d dimensions be denoted as D(p, d). \nFor the case of D(1, d),  when p = 1 there are always two and only two dichotomies since \none can trivially include the single point or no points.  Thus, D(1, d) = 2. \nFor the  case  of D(p, 1),  all  of the points are  constrained to fallon  a line.  From this  set \npick two points, say  Xa  and Xb.  It is always  possible to place a ridge function  such  that \nall points between Xa  and Xb  (inclusive of the end points) are  included in one set,  and all \nother points are excluded. Thus, there are p dichotomies consisting of a single point, p - 1 \ndichotomies consisting of two points, p  - 2 dichotomies consisting of three  points,  and \nso on.  No other dichotomies besides the empty set are possible.  The number of possible \nhyper-ridge dichotomies in one dimension can now be expressed as \n\nD(p, 1)= 2: i + 1 = 2P(P + 1)+ 1, \n\nP \n\n1 \n\n(1) \n\nwith the extra dichotomy coming from the empty set. \n\ni=1 \n\nTo  derive  the  general  form  of the  recurrence  relationship,  we  would  have  to  resort  to \ntechniques similar to those used by Cover [1],  Nilsson [7],  and Gardner [4] .  Because of \nspace considerations, we do not give the full derivation of the general form of the recurrence \nrelationship in this paper,  but instead cite the complete derivation given in  [3] .  The short \nversion of the story is that the general form of the recurrence relationship for hyper-ridge \ndichotomies is identical to the equivalent expression for simple perceptrons: \n\nD(p, d) =  D(P - 1, d) + D(P - 1, d - 1). \n\n(2) \n\nAll differences between the capacity of hyper-ridges and simple perceptrons are, therefore, \na consequence of the different base cases for the recurrence expression. \n\nTo get Equation 2 into closed form, we first expand D(p, d) a total of p times, yielding \n\nD(P,d)=~  i \n\np-l  ( \n\np-l \n) \n\n. \nD(I, d-z). \n\n(3) \n\nFor Equation 3 it is possible for the second term of D( 1, d - 1) to become zero or negative. \nTaking the two identities D(P,O) = p + 1 and D(p, -1) = 1 are  the only choices that are \nconsistent with the recurrent relationship expressed in Equation 2.  With this in mind, there \nare  three  separate  cases  that  we need  to  be concerned  with:  p  <  d + 2,  p  = d + 2,  and \n\n\f558 \n\np>d+2. Whenp<d+2 \n\nD(p, d) = ~ P ~ 1  D(1, d -\n\np-l ( )  \n\np-l  ( \n\ni) = 2 ~ P ~ 1  = 2P, \n\n) \n\nG.  W.FLAKE \n\n(4) \n\ni) are  always  greater or equal  to  zero.  When \nsince  all  of the  second  tenns  in D(I, d  -\np = d + 2, the last tenn in D(I, d - i), in the summation, will be equal to -I. Thus we can \nexpand Equation 3 in this case to \n\nDCp,d)  =  ~ (p ~ 1) D(I,d _  i)= ~ (p ~ 1) D(1, p- 2- i) \n=  I: (p ~ 1) D(1, p - 2 - i) + 1 = 2 I: (p ~ I)  + 1 \n\n~ \n\nl~ \n\n=  2(2P- 1 -l)+1=2P -1. \n\n(5) \nFinally, when p  > d + 2, some ofthe last terms in D(I, d - i) are always negative.  We can \ndisregard all d - i < -1, taking D(1 , d - i) equal to zero in these cases (which is consistent \nwith the recurrence relationship), \n\nDCp,d)  =  ~ (p~ 1) D(I , d _  i)= ~ (p ~ 1) D(I,d _  i) \n.  (p - 1)  ~ (p - 1) \nz) +  d + 1  = 2 ~  i \n\n~ (p - 1) \n\n=  ~  i \n\nD(1, d -\n\nCombining Equations 4, 5, and 6 gives \n\nd \n\n2 L (p ~ 1) + (~~:)  for p > d + 2 \nfor p = d + 2 \n\n2P - 1 \n\nI~ \n\nD~~= \n\n(p - 1) \n\n+  d + 1 \n\n.  (6) \n\nm \n\n2P \n\nforp <d+2 \n\n3  Comparing Representation Power \n\nCover [1], Nilsson [7], and Gardner [4] have all shown that D(p, ~ for simple perceptrons \nobeys the rule \n\nD(P, d) = \n\n2~ (p~ 1) \n\nforp >d+2 \n\n2P - 2 \n\nfor p=d+2 \n\n2P \n\nforp <d+2 \n\n(8) \n\nThe interesting case is  when p  > d + 2,  since that is where Equations 7 and  8 differ the \nmost.  Moreover,  problems are more difficult when the number of training patterns greatly \nexceeds the number of trainable weights in a neural network. \n\nLet Dh(p, d)  and  Dp(p, d) denote the number of dichotomies possible for hyper-ridge net(cid:173)\nworks  and  simple  perceptrons,  respectively.  Additionally,  Let  Ch ,  and  Cp  denote  the \n\n\fThe Capacity of a Bump \n\n559 \n\nrespective capacities.  We should expect both Dh(p, d)/2P and Dp(p, d)/2P to be at or around \n1 for  small  values  of p/(d + 1).  At some point,  for  large  p/(d + 1),  the  2P  term  should \ndominate, making the ratio go to zero.  The capacity of a network can loosely be defined as \nthe value p/(d + 1) such that D(p, d)/2P = ~. This is more rigorously defined as \n\nC= { . l'  D(c(d+ 1).d) =~} \n\nc . d~~  2c(d+1) \n\n2' \n\nwhich is the point in which the transition occurs in the limit as the input dimension goes to \ninfinity. \n\nFigures  1,  2,  and  3  illustrate and  compare  Cp  and  Ch  at  different  stages.  In  Figure  1 \nthe  capacities  are  illustrated for  perceptrons  and  hyper-ridges,  respectively,  by  plotting \nD(p, d)/LP  versus p/(d + 1) for various  values  of d.  On  par with our intuition, the ratio \nD(p, d)/LP equals 1 for small values of p/(d + 1) but decreases to zero as p(d + 1) increases. \nFigure 2 and the left diagram of Figure 3 plot D(p, d)/2P  versus p/(d + 1) for perceptron \nand  hyper-ridges, side by side, with values of d = 5,20, and  100.  As d  increases,  the two \ncurves become more similar.  This fact is further illustrated in the right diagram of Figure 3 \nwhere the plot is of Dh(p, d)/Dp(P, d) versus p  for  various values  of d.  The ratio clearly \napproaches  1 as d increases, but there is significant difference for smaller values of d. \n\nThe differences between Dp  and Dh  can be more explicitly quantified by noting that \n\nDh(p, d) = Dp(p, d) +  d + 1 \n\n( p -1) \n\nfor p > d + 2.  This difference clearly shows up in in the plots comparing the two capacities. \nWe will now show that the capacities are identical in the limit as d approaches infinity.  To \ndo this, we will prove that the capacity curves for both hyper-ridges and perceptrons crosses \n~ at p/(d + 1) = 2.  This fact  is already widely known for perceptrons.  Because of space \nlimitations we  will handwave our way  through lemma and corollary proofs.  The curious \nreader should consult [3) for the complete proofs. \n\nLemma 3.1 \n\nlim \nn-oo  22n \n\n(2nn)  = O. \n\nShort Proof Since n approaches infinity, we can use Stirling's formula as an approximation \nof the factorials. \n\no \n\nCorollary 3.2  For all positive integer constants, a,  b,  and c, \n\nlim  _1_  (2n + b) = O. \nn-oo  22n+a \n\nn + c \n\nShort Proof When adding the constants band c to the combination, the whole combination \ncan always be represented as comb(2n, n)\u00b7 y, where y is some multiplicative constant. Such \na constant can  always  be  factored  out of the limit.  Additionally,  large values of a only \nincrease the growth rate of the denominator. \n\no \n\nLemma 3.3  For p/(d + 1) = 2,  liffid ..... oo Dp(p, d)/2P = ~. \nShort Proof  Consult any of Cover [1], Nilsson [7], or Gardner [4]  for full proof. \n\no \n\n\f560 \n\n\"  \" \n\n. \n'\\ \n\\  : \n\\ ' \n.~ \n\nd =  S (cid:173)\nd =20  ---(cid:173)\nd= 100  .. . \n\n0.' \n\n~  - - ~  --r-- -_ .  -\n\nI \n\n.. \u2022 \n\nos \n\n- -\n\n-\n\nG.  W.FLAKE \n\nd =  5 -\nd= 20 - (cid:173)\nd=l00  . \n\nFigure  1:  On  the left, Dp(P, tf)12P  versus pl(d + 1), and  on the right, Dh(p, d)/2P  versus \npl(d + 1)  for  various  values  of d .  Notice  that  for  perceptrons  the curve  always  passes \nthrough!  at pl(d + 1) = 2.  For hyper-ridges, the point where the curve passes  through! \ndecreases as d increases. \n\nI  perceptmn(cid:173)\nihyper-ridge  ---\n\no.s \n\n--\n\n-\n\n---'r-'--\n\n\\ \n\n-- - \\ . . \n\n2 \n\np/(d+ I) \n\n\u00b00L-----~----~2 --~~------J \n\npl(d+ J) \n\nFigure 2:  On the left, capacity comparison for d = 5.  There is considerable difference for \nsmall values of d,  especially when one considers that the capacities are  normalized by 2P. \nOn the right, comparison for d = 20.  The difference between the two capacities is  much \nmore subtle now that d is fairly large. \n\nos \n\n-\n\n-\n\nt perecpuon  -\n:hyper.ridge  _. \n! \n\nd:  1 -\nd\",  2  ---(cid:173)\nd=  5 \nd\", I O (cid:173)\nd= 100  - _. -\n\n20 \n\n10 \n\no L-----~----~~--~----~ \n\no \n\n10 \n\n20 \n\n30 \n\n40 \n\nso \nP \n\n60 \n\n70 \n\n80 \n\n90 \n\n100 \n\nFigure 3:  On the left, capacity comparison for d = 100. For this value of d, the capacities \nare visibly indistinguishable.  On the right, Dh(P, d)1 Dp(P, tf) versus p for various values of \nd.  For small values of d the capacity of a hyper-ridge is much greater than  a perceptron. \nAs d grows, the ratio asymptotically approaches  1. \n\n\fThe Capacity of a Bump \n\nTheorem 3.4  For pl(d + 1) = 2, \n\nlim  Dh(p, d)  =  !. \n2 \nd-oo \n\n2P \n\n561 \n\nProof  Taking advantage of the relationship between  perceptron dichotomies and hyper(cid:173)\nridge dichotomies allows us to expand Dh(p, d), \n\n1\u00b7  Dh(P, d) \n1m \nd-oo \n\n2P \n\nl'  Dp(P, d) \n=  1m \nd-oo \n\n2P \n\n+  1m \n\nl' \nd-oo  d + 1 \n\n(p - 1) \n. \n\nBy Lemma 3.3, and substituting 2(d + 1) for p, we get: \n\n1 l' \n\n- +  1m \n2  d-oo \n\n(2d + 1) \n\nd + 1 \n\n. \n\nFinally, by Corollary 3.2 the right limit vanishes leaving us with !. \no \n\nSuperficially, Theorem 3.4 would seem to indicate that there is no difference between the \nrepresentation power of a perceptron and a hyper-ridge network.  However, since this result \nis only valid in the limit as the number of inputs goes to infinity, it would be interesting to \nknow the exact relationship between Dp(d, p) and Dh(d, p) for finite values of d. \n\nIn  the  right diagram  of Figure 3  values  of Dp(d,p)IDh(d,p)  are  plotted  against  various \nvalues  of p.  The  figure  is  slightly misleading  since  the ratio  appears  to  be linear in p, \nwhen,  in fact,  the ratio is only approximately linear in p.  If we normalize the ratio by } \nand recompute the ratio in the limit as p approaches infinity the ratio becomes linear in d. \nTheorem 3.5 establishes this rigorously. \n\nTheorem 3.5 \n\nProof  First, note that we can simplify the left hand side of the expression to \n\n1 Dh(d,p) \n\n. \n=  hm  -\nhm  -\np-oopDp(d,p)  p_oop \n\n. \n\n1 Dp(d,p) + (~~ :) \n\n1 (~~:) \n\n. \n\n=  hm  -\n\nDp(d,p) \n\np_oop  Dp(d,p) \n\n(9) \n\nIn the next step, we will invert Equation 9, making it easier to work with.  We need to show \nthat the new expression is equal to 2(d + 1). \n\nlim  p  Dp(d,p)  =  lim  2p \np-oo  (~ ~ :)  p_oo \n\nL~ (p~ 1) \n(~ ~ :) \n\nl \n\n= \n\n.  2:d \n\nhm  2p \np_oo \n\n. \n1=0 \n\n(P  -\n\nI)! \n\n(d + 1)!(P - d  - 2)! \n\ni!(P - i-I)! \n\n(P  -\n\nI)! \n\n2:d  (d + I)! (P  - d  - 2)! \n\ni! \n\n(P  - i-I)! \n\n. \n\n= hm  2p \n\nP_oo. \n1=0 \n\n= \n\nd \n\nlim \np_oo  (P  - 1 - d) \n\np \n\n2(d+l)\"d!(p-d-l)!=  lim2(d+l)\"d!(p-d-l)!  (10) \n\ni!  (P  -\n\ni - 1)! \n\np_oo \n\n~ i!  (P  - i-I)! \ni=O \n\nd \n\n6 \n1=0 \n\nIn Equation 10, the summation can be reduced to 1 since \n\nd!  (P  - d  - 1 )!  _ {O  when 0 :5 i < d \n\n1. \n1m  -\np-oo  i!  (P  - i-I)! \n\n-\n\n1 \n\nh \n\n.  d \n\nw  en l  = \n\n\f562 \n\nG. W.FLAKE \n\nThus, Equation 10 is equal to 2(d + 1), which proves the theorem. \no \n\nTheorem 3.5 is valid only in the case when p  ~ d,  which is typically true in interesting \nclassification problems.  The result of the theorem gives us a good estimate of how many \nmore dichotomies are computable with a hyper-ridge network when compared to a simple \nperceptron.  When p  ~ d the equation \n\nDh(d,p) \nDp(d,p)  - 2(d+ 1) \n\nP \n\n(11) \n\nis an  accurate  estimate  of the difference  between  the capacities of the two architectures. \nFor example,  taking d  =  4  and p  =  60 and  applying the values to Equation  11  yields the \nratio of 6, which should be interpreted as meaning that one could store six times the number \nof mappings in a hyper-ridge network than one could in a  simple perceptron.  Moreover, \nEquation 11  is in agreement with the right diagram of Figure 3 for all values of p  ~ d. \n\n4  Conclusion \n\nAn interesting footnote to this work is that the VC dimension [8]  of a hyper-ridge network \nis  identical  to  a  simple  perceptron,  namely  d.  However,  the  real  difference  between \nperceptrons and hyper-ridges is more noticeable in practice, especially when one considers \nthat linear inseparable problems are representable by hyper-ridges. \n\nWe  also  know that there is  no such  thing as  a  free  lunch and  that  generalization  is  sure \nto  suffer in just the  cases  when  representation power is  increased.  Yet  given  all  of the \ncomparisons between Ml.Ps and radial basis functions (RBFs) we find it encouraging that \nthere may  be a class  of approximators that is  a  compromise between  the local  nature of \nRBFs and the global structure of MLPs. \n\nReferences \n\n[l]  T.M.  Cover.  Geometrical  and  statistical  properties of systems  of linear  inequalities \nwith applications in pattern recognition.  IEEE Transactions on Electronic Computers, \n14:326-334,1965. \n\n[2]  M.R.W.  Dawson and D.P.  Schopflocher.  Modifying the generalized delta rule to train \nnetworks of non-monotonic processors for pattern classification.  Connection Science, \n4(1), 1992. \n\n[3]  G.  W.  Flake.  Nonmonotonic Activation Functions  in  Multilayer Perceptrons.  PhD \n\nthesis, University of Maryland, College Park, MD, December 1993. \n\n[4]  E.  Gardner.  Maximum  storage  capacity  in  neural  networks.  Europhysics  Letters, \n\n4:481-485,1987. \n\n[5]  F.  Girosi,  M.  Jones,  and  T.  Poggio.  Priors,  stabilizers  and  basis  functions:  from \nregularization to radial, tensor and additive splines.  Technical Report A.I.  Memo No. \n1430, C.B.C.L. Paper No. 75, MIT AI Laboratory, 1993. \n\n[6]  E.  Hartman  and  J.  D.  Keeler.  Predicting the future:  Advanages  of semilocal  units. \n\nNeural Computation, 3:566-578,1991. \n\n[7]  N.J.  Nilsson.  Learning Machines:  Foundations of Trainable Pattern Classifying Sys(cid:173)\n\ntems.  McGraw-Hill, New York,  1965. \n\n[8]  Y.N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies \nof events to their probabilities. Theory of Probability and Its Applications, 16:264-280, \n1971. \n\n\f", "award": [], "sourceid": 1136, "authors": [{"given_name": "Gary", "family_name": "Flake", "institution": null}]}