{"title": "Complexity of Finite Precision Neural Network Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 668, "page_last": 675, "abstract": null, "full_text": "668 \n\nDembo, Siu and Kailath \n\nComplexity  of Finite  Precision \n\nNeural Network  Classifier \n\nAmir Dembo1 \nInform.  Systems Lab. \nStanford  University \nStanford,  Calif.  94305 \n\nKai-Yeung Siu \n\nInform.  Systems Lab. \nStanford  University \nStanford,  Calif.  94305 \n\nThomas  Kailath \nInform.  Systems Lab. \nStanford  University \nStanford, Calif.  94305 \n\nABSTRACT \n\nA rigorous analysis on the finite  precision computational <)Spects  of \nneural  network as  a  pattern  classifier  via a  probabilistic  approach \nis  presented.  Even though there exist negative results on  the capa(cid:173)\nbility of perceptron,  we  show  the following  positive results:  Given \nn  pattern vectors each represented by en bits where  e > 1,  that are \nuniformly  distributed,  with  high  probability  the  perceptron  can \nperform  all  possible  binary  classifications  of the  patterns.  More(cid:173)\nover, the resulting neural network requires a  vanishingly small pro(cid:173)\nportion O(log n/n) of the memory that would be required for  com(cid:173)\nplete  storage  of the  patterns.  Further,  the  perceptron  algorithm \ntakes  O(n2)  arithmetic operations  with  high  probability,  whereas \nother  methods  such  as  linear  programming  takes  O(n3 .5 )  in  the \nworst  case.  We also  indicate some mathematical connections  with \nVLSI  circuit  testing and the theory  of random matrices. \n\n1 \n\nIntroduction \n\nIt is  well  known that the  percept ron  algorithm can  be  used  to find  the appropriate \nparameters in a  linear threshold  device  for  pattern  classification, provided the pat(cid:173)\ntern vectors are linearly separable.  Since the number of parameters in a  perceptron \nis significantly fewer  than that needed  to store  the whole  data set,  it is  tempting to \n\n1 The coauthor is now with the Mathematics and Statistics Department of Stanford University. \n\n\fComplexity of Finite Precision Neural Network Classifier \n\n669 \n\nconclude that when  the patterns are linearly separable,  the perceptron  can  achieve \na  reduction  in  storage  complexity.  However,  Minsky  and  Papert  [1]  have  shown \nan example in  which  both the  learning time and the parameters increase  exponen(cid:173)\ntially, when  the perceptron  would need  much more storage than does the whole list \nof patterns. \nWays around  such examples can  be  explored  by noting that analysis that assumes \nreal arithmetic and disregards finite precision aspects might yield misleading results. \nFor example,  we  present  below a  simple network with  one  real  valued  weight that \ncan  simulate  all  possible  classifications  of  n  real  valued  patterns  into  k  classes, \nwhen  unlimited accuracy  and continuous distribution of the  patterns are  assumed. \nFor simplicity,  let  us  assume  the  patterns are  real  numbers in  [0,1].  Consider  the \nfollowing sequence  {xi,i} generated  by each  pattern  Xi  for  i =  1, ... , n: \n\nXi,l  = k\u00b7 Xi  modk \n\nXi,i  = k  . xi,i-l  mod k  lor j  > 1 \n\nU(Xi,j) =  [xi,i) \n\nwhere  []  denotes  the integer part. \nLet I: {Xl, ... , Xn}  --+  {O, ... , k-l} denote the desired classification of the patterns. \nIt is easy  to see  that for  any continuous distribution on  [0,1], there  exists a  j  such \nthat  U(Xi,j)  = I(xi),  with  probability  one.  So,  the  network  y  = u(x,w)  may \nsimulate any classification  with w = j  determined from the  desired  classification as \nshown  above. \n\nSo in this paper, we emphasize the finite  precision computational aspects of pattern \nclassification problems and provide partial answers to the following questions: \n\n\u2022  Can  the  perceptron  be  used  as  an  efficient form  of memory'? \n\n\u2022  Does  the  'learning' time  of perceptron  become too  long  to  be  practical most  of \n\nthe  time  even  when  the  patterns  are  assumed to  be  linearly separable '? \n\n\u2022  How  do  the  convergence  results  compare  to  those  obtained  by  solving  system \n\nof linear inequalities'? \n\nWe  attempt to answer the above questions  by using  a  probabilistic approach.  The \ntheorems  will  be  presented  without  proofs;  details  of the  proof will  appear  in  a \ncomplete paper.  In the following analysis, the phrase 'with high probability' means \nthe probability of the underlying event goes to  1 as the number of patterns goes  to \n\n\f670 \n\nDembo, Siu and Kailath \n\ninfinity.  First, we shall introduce the classical  model of a perceptron in more details \nand give some known results  on its limitation as a  pattern  classifier. \n\n2  The Perceptron \nA  perceptron  is  a  linear threshold  device  which  computes a  linear  combination of \nthe  coordinates  of the  pattern  vector,  compares  the  value  with  a  threshold  and \noutputs  +1  or  -1 if the  value is  larger or smaller than  the  threshold  respectively. \nMore formally,  we  have \nOutput: \n\nsign{ < w, i  > -8} = sign{L Xi  . Wi  - 8} \n\nd \n\ni=l \n\nInput: \n\nParameters: \n\nweights \n\nthreshold  8 E R \n\nsign{y} = { ~~  if y  ~ 0 \notherwise \n\nGiven  m  patterns xi, ... ,x~ in  Rd,  there  are  2m  possible  ways of classifying each \nof the  patterns to \u00b1 1.  When  a  desired  classification of the  patterns  is  achieveable \nby a  perceptron,  the  patterns  are  said  to  be  linearly separable.  Rosenblatt(1962) \n[2]  showed  that  if  the  patterns  are  linearly  separable,  then  there  is  a  'learning' \nalgorithm which  he  called  perceptron learning  algorithm to find  the appropriate pa(cid:173)\nrameters wand 8.  Let  CTi  = \u00b11 be the desired  classification of the pattern xi.  Also, \nlet Yi  = CTi  \u2022 xi.  The perceptron  learning algorithm runs as follows: \n\n1.  Set  k = 1,  choose an  initial value of w( k)  \u00a5 O. \n2.  Select  an i  E {I, ... , n}, set Y(k)  = yi. \n3.  If w( k)  . y( k)  ~ 0,  goto 2.  Else \n4.  Set w(k +  1) = w(k) +  Y(k),  k = k +  1,  go to  2. \n\n\fComplexity of Finite Precision Neural Network Classifier \n\n671 \n\nThe  algorithm  terminates  when  step  3  is  true  for  all  Yi.  If the  patterns  are  lin(cid:173)\nearly separable,  then  the  above  perceptron  algorithm is  guaranteed to converge in \nfinitely  many iterations, i.e.  Step 4 would be  reached  only finitely  often. \n\nThe existence  of such  simple and elegant  'learning' algorithm had  brought a  great \ndeal  of interests during  the  60's.  However,  the capability of the  perceptron  is  very \nlimited  since  only  a  small  portion  of the  2m  possible  binary  classifications  can  be \nachieved.  In fact,  Cover(1965) [3]  has shown that a  perceptron  can at most classify \nthe patterns into \n\n2\n\ndI:-1 \n\ni=O \n\n( \n\nm - 1  = O(md - 1 ) \n\n) \n\nI \n\ndifferent  ways out of the  2m  possibilities. \nThe above upper bound O( m d- 1 )  is achieved when the pattern vectors are in general \nposition i.e.  every subset  of d vectors in {xi, ... , x~} are  linearly independent.  An \nimmediate generalization of this result  is the following: \n\nTheorem 1  For  any  function  f( w, i) which  lies  in  a function  space  of dimension \nr,  i. e.  if we  can  write \n\nf(w,i)  = al (w)!t (i)  + ... + ar(w)fr(i) \n\nthen  the  number of possible  classifications of m  patterns by sign{f(w, in is  bounded \n\nby  O(mr-l) \n\n3  A  New Look at  the Perceptron \nThe  reason  why  perceptron  is  so  limited  in  its capability as  a  pattern  classifier  is \nthat  the  dimension  of the  pattern  vector  space  is  kept  fixed  while  the  number  of \npatterns is increased.  We consider the binary expansion of each coordinate and view \nthe real  pattern vector as a  binary vector,  but in  a much higher  dimensional space. \nThe intuition behind this is that we  are now making use of every bit of information \nin  the  pattern.  Let  us  assume  that each  pattern  vector  has dimension  d and  that \neach  coordinate  is  given  with  m  bits of accuracy,  which  grows with the number of \npatterns n  in such a  way that d\u00b7 m = c\u00b7 n for  some c > 1.  By considering the binary \nexpansion,  we  can  treat the  patterns as  binary  vectors,  i.e.  each  vector  belongs  to \n{+l,-lyn.  If we  want  to  classify  the  patterns  into  k  classes,  we  can  use  logk \nnumber of binary classifiers,  each classifying the patterns into the corresponding bit \nof the  binary  encoding  of the  k  classes.  So  without  loss  of generality,  we  assume \nthat the  number of classes  equals 2.  Now  the  classification  problem  can  be viewed \nas  an implementation of a partial Boolean function  whose  value is only specified  on \n\n\f672 \n\nDem bo, Siu and Kailath \n\nn  inputs  out of the  2cn  possible  ones.  For arbitrary input patterns,  there  does  not \nseem  to  exist  an efficient  way other than complete storage of the  patterns  and the \nuse  of a  look-up table for  classification,  which  will  require  O(n2)  bits.  It is natural \nto  ask  if this  is  the  best  we  can  do.  Surprisingly,  using  probabilistic  method  in \ncombinatorics [4]  (counting arguments), we  can show the following: \n\nTheorem 2  For  n  sufficiently  large,  there  exists  a  system  that  can  simulate  all \npossible  binary  classifications with  parameter storage  of n + 2 log n  bits. \n\nMoreover,  a  recent  result  from  the theory of VLSI  testing  [5],  implies that at least \nn  + log n  bits  are  needed .  As  the  proof of theorem  1  is  non-constructive,  both \nthe  learning of the parameters  and  the  retrieval of the  desired  classification  in  the \n'optimal'  system  may  be  too  complex  for  any  practical  purpose.  Besides,  since \nthere  is  almost  no  redundancy  in  the  storage  of parameters  in  such  an  'optimal' \nsystem,  there  will  be  no  'generalization'  properties. \ni.e.  It  is  difficult  to  predict \nwhat the output of the system would be on patterns that are not trained.  However, \na  perceptron  classifier,  while  sub-optimal  in  terms  of Theorem  3  below,  requires \nonly  O(n log n)  bits  for  parameter storage,  compared  with  O(n2 )  bits  for  a  table \nlook up classifier.  In  addition, it will exhibit 'generalization' properties in the sense \nthat new  patterns that are close in Hamming distance to those trained patterns are \nlikely  to  be  classified  into the  same class.  So,  if we  allow  some  vanishingly small \nprobability of error,  we  can  give  an affirmative  answer  to the first  question  raised \nat the beginning: \n\nTheorem 3  Assume the  n pattern vectors are  uniformly distributed over {+1, _1}cn, \nthen  with high  probability, the  patterns can  be  classified into  a1l2n  possible  ways  us(cid:173)\ning perceptron algorithm.  Further,  the storage  of parameters requires only O( n log n) \nbits. \n\nIn  other  words,  when  the input patterns are  given  with  high  precision,  perceptron \ncan  be  used  as an efficient form of memory. \n\nThe  known  upper  bound  on  the learning time of percept ron  depends  on  the  max(cid:173)\nimum  length  of  the  input  pattern  vectors,  and  the  minimum  distance  fJ  of  the \npattern  vectors  to  a  separating  hyperplane.  In  the  following  analysis,  our  proba(cid:173)\nbilistic assumption guarantees  the  pattern vectors  to be  linearly independent  with \nhigh probability and  thus linearly separable.  In order to give an probabilistic upper \nbound  on  the  learning  time of the  perceptron,  we  first  give  a  lower  bound  on  the \nminimum distance  fJ  with  high  probability: \n\nLemma 1  Let n  be  the  number of pattern  vectors  each  in  Rm,  where  m  = (1 + f)n \nand  f  is  any  constant>  O.  Assume  the  entries  of each  vector v  are  iid  random \nvariables  with  zero  mean  and  bounded  second  moment.  Then  with  probability  --+  1 \n\n\fComplexity or Finite Precision Neural Network Classifier \n\n673 \n\nas  n  --+  00  ,  there  exists  a separating hyperplane  and  a 15*  > 0  such  that  each  vector \nis  at  a  distance  of at  least 15*  from  it. \n\nIn  our  case,  each  coordinate  of the  patterns  is  assumed  to  be  equally  likely  \u00b11 \nand  clearly  the  conditions  in  the  above  lemma are  satisfied.  In  general,  when  the \ndimension of the pattern vectors is larger than and increases linearly with the num(cid:173)\nber  of patterns,  the  above  theorem  applies  provided  the  patterns  are  given  with \nhigh enough precision that a continuous distribution is a sufficiently good model for \nanalysis. \n\nThe  above  lemma makes  use  of a  famous  conjecture  from  the  theory  of random \nmatrices [6]  which gives a lower bound on the minimum singular value of a random \nmatrix.  We actually proved the conjecture  during our course  of study, which states \nwhich  states  that  the  minimum singular  value  of a  en  by  n  random  matrix with \nc> 1,  grows as  Fn almost surely. \n\nTheorem 4  Let An  be  a en  X  n  random  matrix with c > 1,  whose  entries  are  i. i. d. \nentries  with  zero  mean  and  bounded second moment,  0'\"(-)  denote  the  minimum sin(cid:173)\ngular  value  of a matrix.  Then  there  exists f3  > 0  such  that \n\nlim inf u( A~) > f3 \n\nn-oo \n\nyn \n\nwith  probability 1. \n\nNote  that our probabilistic assumption on the patterns includes a  wide  class of dis(cid:173)\ntributions, in particular the zero  mean normal and symmetric uniform distribution \non  a  bounded  interval.  In  addition, they satisfy  the following  condition: \n\n(*)  There exists  a  a> 0 such  that P{[v[  > aFn} --+  0 as  n --+  00. \n\nBefore  we  answer  the  last  two questions  raised  at the  beginning,  we  state the  fol(cid:173)\nlowing known  result  on  the  perceptron  algorithm as  a  second  lemma: \n\nLemma 2  Suppose  there  exists  a  unit  vector  w*  such  that  w*  . v  >  15  for  some \n15  > 0  and  for  all pattern  vectors  v.  Then  the  perceptron  algorithm  will  converge  to \na solution  vector in  ::::;  N2 /152  number of iterations,  where  N  is  the  maximum length \nof the  pattern  vectors. \n\nN ow  we  are ready  to state the following \n\nTheorem 5  Suppose  the  patterns  satisfy  the  probabilistic  assumptions  stated  in \n\n\f674 \n\nDembo, Siu and Kailath \n\nLemma  1  and  the  condition  (*),  then  with  high  probability,  the  perceptron  takes \nO( n 2 )  arithmetic  operations to  terminate. \n\nAs  mentioned  earlier,  another  way  of finding  a  separating  hyperplane  is  to  solve \na  system  of linear  inequalities  using  linear  programming,  which  requires  O( n 3 .S ) \narithmetic  operations  [7] .  Under  our  probabilistic  assumptions,  the  patterns  are \nlinearly independent  with  high  probability,  so  that we  can  actually solve a  system \nof linear  equations.  However,  this  still  requires  O(n3 )  arithmetic operations.  Fur(cid:173)\nther,  these  methods require  batch  processing  in  the sense  that all  patterns have to \nbe  stored  in  advance  in  order  to  find  the  desired  parameters,  in  constrast  to  the \nsequential  'learning' nature of the perceptron  algorithm.  So for  training this neural \nnetwork classifier,  perceptron  algorithm seems more preferable. \n\nWhen the number of patterns is  polynomial in the total number of bits representing \neach  pattern,  we  may first  extend  each  vector  to  a  dimension  at  least  as  large  as \nthe  number of patterns, and then  apply  the  perceptron  to compress  the  storage of \nparameters.  One  way of adding these  extra bits is  to form  products  of the  coordi(cid:173)\nnates  within  each  pattern.  Note  that by doing so,  the  coordinates of each  pattern \nare pairwise independent.  We conjecture that theorem 3 still applies, implying even \nmore  reduction  in  storage  requirements.  Simulation  results  strongly  support  our \nconjecture. \n\n4  Conclusion \nIn  this  paper,  the  finite  precision  computational  aspects  of pattern  classification \nproblems are emphasized.  We show that the perceptron,  in  contrast to common be(cid:173)\nlief, can be quite efficient as a pattern classifier, provided the patterns are given with \nhigh  enough  precision.  Using  a  probabilistic  approach,  we  show  that  the  percep(cid:173)\ntron  algorithm can  even  outperform linear programming under  certain  conditions. \nDuring the  course of this  work,  we  also discovered some mathematical connections \nwith  VLSI  circuit  testing  and  the  theory  of random  matrices.  In  particular,  we \nhave proved an open  conjecture regarding the minimum singular value of a  random \nmatrix. \n\nAcknowledgements \n\nThis  work  was supported  in  part  by the  Joint Services  Program at Stanford  Uni(cid:173)\nversity  (US  Army,  US  Navy,  US  Air  Force)  under  Contract  DAAL03-88-C-OOll, \nand  NASA  Headquarters,  Center for  Aeronautics  and  Space  Information Sciences \n(CASIS)  under  Grant NAGW-419-S5. \n\n\fComplexity or Finite Precision Neural Network Classifier \n\n675 \n\nReferences \n[1]  M.  Minsky and S.  Papert,  Perceptrons,  The MIT Press, expanded edition, 1988. \n\n[2]  F.  Rosenblatt,  Principles  of Neurodynamics,  Spartan Books,  New  York,  1962. \n\n[3]  T.  M.  Cover,  \"Geometrical and Statistical Properties  of Systems of Linear  In-\nequalities with Applications in Pattern Recognition\",  IEEE Trans.  on  Electronic \nComputers,  EC-14:326-34,  1965. \n\n[4]  P.  Erdos  and  J.  Spencer,  Probabilistic  Methods  in  Combinatorics,  Academic \n\nPress/ Akademiai Kiado,  New  York-Budapest,  1974. \n\n[5]  G.  Seroussi  and  N.  Bshouty,  \"Vector Sets for  Exhaustive Testing of Logic  Cir(cid:173)\n\ncuits\",  IEEE  Trans.  Inform.  Theory,  IT-34:513-522,  1988. \n\n[6]  J.  Cohen,  H.  Kesten  and C.  Newman, editor,  Random  Matrices  and  Their Ap(cid:173)\n\nplications,  volume  50  of Contemporary  Mathematics,  American  Mathematical \nSociety,  1986. \n\n[7]  N.  Karmarkar,  \"A New  Polynomial-Time Algorithm for  Linear Programming\", \n\nCombinatorica 1,  pages  373-395,  1984. \n\n\f", "award": [], "sourceid": 277, "authors": [{"given_name": "Amir", "family_name": "Dembo", "institution": null}, {"given_name": "Kai-Yeung", "family_name": "Siu", "institution": null}, {"given_name": "Thomas", "family_name": "Kailath", "institution": null}]}