{"title": "Learning Stochastic Perceptrons Under k-Blocking Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 286, "abstract": null, "full_text": "Learning Stochastic Perceptrons  Under \n\nk-Blocking Distributions \n\nMario Marchand \n\nSaeed Hadjifaradji \n\nOttawa-Carleton Institute for  Physics \n\nOttawa-Carleton Institute for  Physics \n\nUniversity of Ottawa \n\nOttawa, Ont.,  Canada KIN 6N5 \n\nmario@physics.uottawa.ca \n\nUniversity of Ottawa \n\nOttawa, Ont.,  Canada KIN 6N5 \n\nsaeed@physics.uottawa.ca \n\nAbstract \n\nWe  present  a  statistical  method  that  PAC  learns  the  class  of \nstochastic  perceptrons  with  arbitrary  monotonic  activation  func(cid:173)\ntion and weights Wi  E  {-I, 0, + I} when the probability distribution \nthat  generates  the input examples  is  member of a  family  that we \ncall k-blocking distributions.  Such distributions represent an impor(cid:173)\ntant step beyond the case where each input variable is statistically \nindependent  since the 2k-blocking  family  contains  all  the  Markov \ndistributions of order k.  By stochastic  percept ron we  mean  a  per(cid:173)\nceptron which,  upon presentation of input vector x, outputs 1 with \nprobability  fCLJi WiXi  - B).  Because the same algorithm works  for \nany  monotonic  (nondecreasing  or  nonincreasing)  activation  func(cid:173)\ntion  f  on  Boolean  domain,  it  handles  the  well  studied  cases  of \nsigmolds  and the  \"usual\"  radial basis functions. \n\n1 \n\nINTRODUCTION \n\nWithin recent years,  the field  of computational learning theory has emerged to pro(cid:173)\nvide  a  rigorous  framework  for  the  design  and  analysis  of  learning  algorithms.  A \ncentral  notion  in  this  framework,  known  as  the  \"Probably  Approximatively  Cor(cid:173)\nrect\"  (PAC)  learning criterion (Valiant,  1984), has recently been extended (Hassler, \n1992)  to  analyze  the  learn ability  of  probabilistic  concepts  (Kearns  and  Schapire, \n1994;  Schapire,  1992).  Such concepts, which are stochastic rules that give the prob(cid:173)\nability that input example x  is  classified as  being positive,  are natural probabilistic \n\n\f280 \n\nMario Marchand,  Saeed Hadjifaradji \n\nextensions of the deterministic concepts originally studied by Valiant  (1984). \nMotivated by the stochastic nature of many  \"real-world\"  learning problems and by \nthe indisputable fact that biological neurons are probabilistic devices, some prelimi(cid:173)\nnary studies about the PAC learnability of simple probabilistic neural concepts have \nbeen  reported  recently  (Golea  and  Marchand,  1993;  Golea  and  Marchand,  1994). \nHowever,  the  probabilistic  behaviors  considered  in these studies  are  quite  specific \nand  clearly  need  to  be  extended.  Indeed,  only  classification  noise  superimposed \non a  deterministic signum function was  considered in Golea and Marchand  (1993). \nThe  probabilistic  network,  analyzed  in  Golea  and  Marchand  (1994),  consists  of  a \nlinear superposition of signum functions  and is  thus  solvable  as  a  (simple)  case  of \nlinear  regression.  What  is  clearly  needed  is  the extension  to  the  non-linear  cases \nof sigmolds  and  radial  basis  functions.  Another  criticism about  Golea  and  Marc(cid:173)\nhand  (1993,  1994)  is  the  fact  that  their  learn ability  results  was  established  only \nfor  distributions where each input variable is  statistically independent from  all the \nothers  (sometimes called  product distributions).  In fact,  very few  positive learning \nresults for  non-trivial p-concepts classes  are known to hold for  larger classes  of dis(cid:173)\ntributions.  Therefore, in an effort to find  algorithms that will  work in  practice,  we \nintroduce in this paper a  new family of distributions that we call k-blocking.  As  we \nwill  argue, this family  has the dual advantage of avoiding malicious  and unnatural \ndistributions that are prone to render simple concept classes  unlearnable  (Lin  and \nVitter, 1991)  and of being likely to contain several  distributions found  in  practice. \nOur main contribution is to present a simple statistical method that PAC learns (in \npolynomial time) the class of stochastic perceptrons with monotonic (but otherwise \narbitrary)  activation functions  and weights Wi  E { -1,0, + 1}  when the input exam(cid:173)\nples  are  generated  according  to  any  distribution  member of the k-blocking  family. \nDue to space constraints,  only a sketch of the proofs  is  presented here. \n\n2  DEFINITIONS \n\nThe instance  (input)  space,  In,  is  the Boolean  domain  {-1, + 1 } n.  The set  of all \ninput variables  is  denoted  by  X.  Each input example  x  is  generated according to \nsome  unknown distribution  D  on In.  We  will often  use  PD(X),  or simply p(x),  to \ndenote the  probability of observing the vector value x  under  distribution  D.  If U \nand  V  are  two  disjoint  subsets  of  X,  Xu  and  Xv  will  denote  the  restriction  (or \nprojection)  of  x  over  the  variables  of  U  and  V  respectively  and  p D (xu I xv)  will \ndenote the probability,  under distribution D, of observing the vector value Xu  (for \nthe variables in U)  given that the variables in V  are set to the vector value xv. \n\nFollowing Kearns and Schapire  (1994),  a  probabilistic  concept  (p-concept)  is  a  map \nc : In  ~ [0, 1]  for  which c(x) represents the probability that example X  is  classified \nas  positive.  More  precisely,  upon  presentation  of input  x,  an  output  of  a  =  1  is \ngenerated  (by an unknown target  p-concept)  with  probability c(x)  and  an output \nof a  =  0 is  generated with probability 1 - c(x). \nA  stochastic  perceptron  is  a  p-concept  parameterized  by  a  vector  of n  weights  Wi \nand  a  activation  function  fO  such  that,  the  probability that input example  X  is \n\n\fLearning  Stochastic Perceptrons  under k-Blocking  Distributions \n\nclassified  as  positive is  given by \n\n281 \n\n(1) \n\nWe consider the case of a non-linear function fO since the linear case can be solved \nby  a  standard  least  square  approximation  like  the  one  performed  by  Kearns  in \nSchapire (1994)  for  linear sums of basis functions.  We restrict ourselves to the case \nwhere fO  is  monotonic  i.e.  either nondecreasing or  nonincreasing.  But since  any \nnonincreasing f (.) combined with a weight vector w  can always be represented by a \nnondecreasing  f(\u00b7)  combined with a weight vector -w, we  can assume without loss \nof generality that the target stochastic perceptron has a nondecreasing f (. ).  Hence, \nwe  allow  any sigmoid-type of activation function  (with arbitrary threshold).  Also, \nsince  our  instance space zn  is  on  an-sphere,  eq.  1 also  include  any  nonincreasing \nradial  basis  function  of the type \u00a2(z2)  where  z  =  Ix - wi  and w is  interpreted  as \nthe  \"center\"  of \u00a2.  The only significant restriction is  on the weights where we  allow \nonly for  Wi  E  {-I, 0, +1}. \nAs  usual,  the goal  of the learner is  to  return an  hypothesis  h  which  is  a  good  ap(cid:173)\nproximation of the target p-concept c.  But, in contrast with  decision  rule  learning \nwhich attempts to  \"filter out\"  the noisy behavior by  returning a  deterministic  hy(cid:173)\npothesis,  the  learner  will  attempt  the  harder  (and  more  useful)  task  of modeling \nthe  target  p-concept  by  returning  a  p-concept  hypothesis.  As  a  measure of error \nbetween the target and the hypothesis  p-concepts we  adopt the  variation  distance \ndv (\u00b7,\u00b7)  defined  as: \n\nerr(h,c)  =  dv(h,c)  ~f LPD(X) Ih(x)  - c(x)1 \n\nx \n\n(2) \n\nWhere the summation is  over all the 2n  possible values  of x.  Hence,  the same  D  is \nused for  both training and testing.  The following  formulation of the PAC criterion \n(Valiant,  1984;  Hassler,  1992)  will  be sufficient for  our purpose. \n\nDefinition 1  Algorithm A  is said to  PAC learn  the  class  C  of p-concepts  by  using \nthe  hypothesis class H  (of p-concepts) under a family V  of distributions  on instance \nspace In,  iff  for  any c E  C,  any D  E V,  any 0 < t,8 < 1,  algorithm A  returns  in  a \ntime  polynomial in (l/t, 1/8, n),  an  hypothesis  h  E  H  such  that  with  probability  at \nleast 1 - 8,  err(h, c)  < t. \n\n3  K-BLOCKING DISTRIBUTIONS \n\nTo learn the class of stochastic perceptrons, the algorithm will try to discover each \nweight  Wi  that  connects  to  input  variable  Xi  by  estimating  how  the  probability \nof observing  a  positive output  (0\"  =  1)  is  affected  by  \"hard-wiring\"  variable  Xi  to \nsome  fixed  value.  This  should  clearly  give  some  information  about  Wi  when  Xi \nis  statistically independent  from  all the other variables  as  was  the case  for  Golea \nand  Marchand  (1993)  and  Schapire  (1992).  However,  if  the  input  variables  are \ncorrelated, then the process of fixing variable Xi will carry over neighboring variables \nwhich in turn will  affect  other variables until all the variables  are perturbed (even \nin the simplest case  of a  first  order  Markov chain).  The information about Wi  will \n\n\f282 \n\nMario Marchand,  Saeed HadjiJaradji \n\nthen  be smeared  by  all  the other  weights.  Therefore,  to obtain information  only \non Wi,  we  need to break this  \"chain  reaction\"  by fixing  some  other variables.  The \nnotion of blocking  sets serves this purpose. \nLoosely  speaking,  a  set  of variables  is  said  to  be  a  blocking  set1  for  variable  Xi \nif the distribution on all  the remaining variables  is  unaffected  by the setting of Xi \nwhenever all the variables of the blocking set are set to a fixed value.  More precisely, \nwe  have: \n\nDefinition 2  Let B  be  a subset of X  and let U = X  - (B U {Xi}).  Let XB  and Xu \nbe  the  restriction of X  on Band U  respectively  and let b  be  an  assignment for XB. \nThen B  is  said to  be  a blocking  set for  variable  Xi  (with  respect to  D), iff: \n\nPD(xulxB = b,Xi = +1) = PD(xulxB = b,Xi = -1) \n\nfor  all  b  and Xu \n\nIn  addition,  if B  is  not  anymore  a  blocking  set  when  we  remove  anyone  of its \nvariables,  we  then say that B  is  a minimal  blocking  set for variable  Xi. \n\nWe  thus adopt the following  definition for  the k-blocking family. \nDefinition 3  Distribution D  on rn  is said to  be  k-blocking iff  IBil  <  k \n1,2\u00b7\u00b7 . n  when  each  Bi is  a minimal blocking  set for  variable  Xi. \n\nfor  i  = \n\nThe k-blocking  family  is  quite  a  large  class  of  distributions.  In fact  we  have  the \nfollowing  property: \n\nProperty 1  All Markov  distributions  of kth  order  are  members  of the  2k-blocking \nfamily. \n\nProof:  By  kth  order  Markov  distributions,  we  mean  distributions  which  can  be \nexactly  written  as  a  Chow(k)  expansion  (see  Hoeffgen,  1993)  for  some  permuta(cid:173)\ntion  of  the  variables.  We  prove  it  here  (by  using  standard  techniques  such  as \nin  Abend  et.  al,  1965)  for  first  order  Markov  distributions,  the  generalization  for \nk  >  1  is  straightforward.  Recall  that  for  Markov  chain  distributions  we  have: \np(XjIXj-b\u00b7\u00b7\u00b7 xI) = p(XjIXj_l) for  1 < j  ~ n.  Hence: \nP(XI ... Xj-2, Xj+2\u00b7 .. XnlXj-b Xj, Xj+!) \n\n=  p(Xl)p(X2Ixl)\u00b7\u00b7\u00b7 p(Xj IXj-l)p(Xj+llxj)\u00b7\u00b7\u00b7 P(XnIXn-l)!p(Xj-b Xj, Xj+!) \n=  p(xI)p(x2Ixd\u00b7\u00b7\u00b7 p(xj-llxj-2)P(Xj+2Ixj+l)\u00b7 .. P(XnIXn-l)!p(Xj-I) \n=  P(Xl\u00b7\u00b7 \u00b7Xj-2,Xj+2\u00b7 \u00b7\u00b7XnIXj-bXj,Xj+!) \n\nwhere Xj  denotes the negation of Xj.  Thus, we see that Markov chain distributions \nare  a  special  case  of  2-blocking  distributions:  the  blocking  set  of  each  variable \nconsisting only of the two first-neighbor variables.  D. \nThe  proposed  algorithm  for  learning  stochastic  perceptrons  needs  to  be  provided \nwith a blocking set (of at most k variables) for each input variable.  Hoeffgen (1993) \nhas recently proven that Chow(l) and Chow(k > 1) expansions are efficiently learn(cid:173)\nable;  the latter under some restricted conditions.  We can thus use these algorithms \n\nIThe wording  \"blocking set\"  was also used by Hancock & Mansour (Proc.  of COLT'91 , \n179-183, Morgan Kaufmann Publ.)  to denote a property of the target concept.  In contrast, \nour definition of blocking set denotes a  property of the input distribution only. \n\n\fLearning  Stochastic  Perceptrons  under k-Blocking  Distributions \n\n283 \n\nto  discover  the  blocking  sets  for  such  distributions.  However,  the efficient  learn(cid:173)\nability of unrestricted Chow(k > 1)  expansions  and larger  classes  of distributions, \nsuch as the k-blocking family,  is still unknown.  In fact,  from the hardness results of \nHoeffgen  (1993),  we  can see  that it is  definitely very hard  (perhaps  NP-complete) \nto find  the blocking sets  if the learner has  no  information available other than the \nfact  that the distribution is  k-blocking.  On the other hand,  we  can argue that the \n\"natural\"  ordering of the variables  present in many  \"real-world\"  situations is  such \nthat the blocking set of any given  variable is  among the neighboring variables.  In \nvision  for  example,  we  expect  that  the  setting  of a  pixel  will  directly  affect  only \nthose located in it's neighborhood; the other pixels being affected only through this \nneighborhood.  In such  cases,  the neighborhood  of  a  variable  \"naturally\"  provides \nits blocking set. \n\n4  LEARNING STOCHASTIC PERCEPTRONS \n\nWe  first  establish  (the  intuitive  fact)  that,  without  making  much  error,  we  can \nalways  consider that the target  p-concept  is  defined  only over  the variables which \nare not almost  always set to the same value. \n\nLemma 1  Let V  be  a  set  of v  variables  Xi  for  which  Pr(xi  =  ai)  >  1 - a.  Let \nc  be  a  p-concept  and  let  c'  be  the  same  p-concept  as  c  except  that  the  reading  of \neach  variable  Xi  E  V  is  replaced  by  the  reading  of the  constant  value  ai.  Then \nerr( c' , c)  <  v  . a. \n\nProof:  Let  a  be the vector obtained from  the concatenation of all  ais and let  Xv \nbe the vector obtained from  X  by  keeping  only the components  Xi  which are in V. \nThen err(c', c)  ~ Pr(xv =I- a)  ~ L:iEVPr(Xi =I- ai).  D. \nFor  a  given  set  of  blocking  sets  {Bi }f=l '  the  algorithm  will  try  to  discover  each \nweight Wi  by estimating the blocked influence of Xi  defined as: \n\nBinf(xilhi) ~f Pr(O' = 11xBi  = hi , Xi  = +1) - Pr(O'  = 11xBi  = hi, Xi  = -1) \n\nwhere XB i  denotes the restriction of x  on the blocking set Bi for  variable Xi  and hi \nis an assignment for  XB i \u2022  The following  lemma ensures the learner that Binf(xilhi) \ncontains enough information about Wi. \n\nLemma 2  Let the  taryet p-concept be  a  stochastic  perceptron on In  having  a  non(cid:173)\ndecreasing  activation  function  and weights  taken  from  {-1, 0, + 1 }.  Then,  for  any \nassignment hi  for  the  variables  in the  blocking  set Bi  of variable  Xi,  we  have: \n\nBinf(xilhi) \n\n{ \n\n~ 0 \n=  0 \n~ 0 \n\nif Wi  =  +1 \nif Wi  =  0 \nif Wi  =  -1 \n\n(3) \n\n(Bi U  {Xi}),  s  =  L:jEUWjXj  and  (  =  L:kEBi wkbk\u00b7 \nProof sketch:  Let  U  =  X  -\nLet  pes)  denote  the  probability  of  observing  s  (under  D).  Then  Binf(xilhi)  = \nf(s + (- Wi)];  from which we find  the desired result for  a \nL:sp(s) [f(s +  (+ Wi)  -\nnondecreasing  f(\u00b7) .  D. \n\n\f284 \n\nMario Marchand,  Saeed Hadjifaradji \n\nIn  principle,  lemma  2  enables  the  learner  to  discover  Wi  from  Binf(xilbi).  The \nlearner, however,  has only access to its  empirical  estimate  Bfnf(xilbi)  from  a  finite \nsample.  Hence,  we  will  use  Hoeffding's  inequality  (Hoeffding,  1963)  to  find  the \nnumber of examples needed for  a  probability p  to be close to its empirical estimate \nft  with high  probability. \n\nLemma 3  (Hoeffding,  1963)  Let  YI. ... , Ym  be  a  sequence  of m  independent \nBernoulli trials,  each  succeeding with  probability p.  Let ft  =  L:l ~/m. Then: \n\nPr (1ft - pi  > E)  ~  2 exp (-2mE2) \n\nHence,  by writing Binf(xilbi)  in terms of (unconditional)  probabilities that can be \nestimated from  all  the training examples,  we  find  from  lemma 3 that the number \nmO(E,8,n)  of examples  needed to have  IBfnf(xilbi)  - Binf(xilbi)I  <  E with  proba(cid:173)\nbility at least  1 - 8 is  given  by: \n\nmO(E,8,n)  ~ ~ (:E) 2 ln (~) \n\nwhere  /'i,  = a k +1  is the lowest  permissible value for  PD(bi , Xi)  (see lemma 1).  So,  if \nthe minimal nonzero value for  IBinf(xilbi)1 is (3, then the number of examples needed \nto find,  with  confidence at least  1 - 8,  the exact value for  Wi  among { -1,0, + I}  is \nsuch  that we  need  to have:  Pr(IBfnf(xilbi) - Binf(xilbi)1  < (3/2)  >  1 - 8.  Thus, \nwhenever (3 is oH2(e-n ), we will need of O(e2n )  examples to find  (with prob > 1-8) \nthe value  for  Wi.  So,  in  order to be able to PAC  learn from  a  polynomial sample, \nwe  must  arrange ourselves  so  that we  do  not need to worry about such low  values \nfor  IBinf(xilbi)l.  We therefore consider the maximum  blocked influence  defined as: \n\nBinf(xi) ~f Binf(xilbn \n\nwhere  b;  is  the vector  value  for  which  IBinf(xilbi)1  is the largest.  We  now  show \nthat the learner can ignore all variables Xi  for which IBinf(xi)1  is too small (without \nmaking much error). \n\nLemma 4  Let c  be  a  stochastic  perceptron  with  nondecreasing  activation function \nf (.)  and weights taken from { -1, 0, + 1 }.  Let V  c  X  and let cv  be  the same stochas(cid:173)\ntic  perceptron  as  c  except that Wi  =  0  for  all  Xi  E  V  and its  activation function  is \nchanged to  f ( .  + e).  Then,  there  always  exists  a  value for e such  that: \n\nerr(cv, c) \n\n:::;  2: IBinf(xi)I \n\niEV \n\nProof sketch:  By induction on  IVI.  To  first  verify the  lemma for  V  =  {Xl},  let \nb  be  a  vector  of values  for  the setting  of all  Xi  E  Bl  and  let  Xu  be  a  vector  of \nvalues  for  the setting of all  Xj  E  U = X  - (Bl  U {Xl}).  Let  s  = LjEU WjXj  and \n( = LjEB l  WjXj,  then for e = WI,  we  have: \n\nerr(cv, c)  =  2: 2:PD(xul b )PD(blxl = -l)PD(XI = -1) \n\nXu  b \n\nx  If(s + ( + WI)  -\n\nf(s + ( - wl)1 \n\n~  IBinf(xdl \n\n\fLearning  Stochastic Perceptrons  under k-Blocking  Distributions \n\n285 \n\nWe  now  assume  that  the  lemma  holds  for  V  =  {Xl. X2\u00b7\u00b7 . Xk}  and  prove  it  for \nW  = V  U {Xk+1}.  Let  S  = {Xk+1}  and  let  f(\u00b7  +  Ow),  f(\u00b7  +  Ov)  and  f(\u00b7  +  Os) \ndenote  respectively  the  activation  function  for  cw,  Cv  and  cs.  By  inspecting \nthe  expressions  for  err(cv, c)  and  err(cw, cs),  we  can  see  that  there  always  ex(cid:173)\nist  a  value  for  Ow  E  {Ov  + Wk+1,OV  - Wk+l}  and  Os  E  {Wk+l. -wk+d  such \nthat  err( cw, cs)  ::;  err( cv, c).  And  since  dv (-, .)  satisfies  the  triangle  inequality, \nerr(cw,c)::; err(cv,c) + IBinf(xk+1)I.  D. \nAfter  discovering  the weights,  the  hypothesis  p-concept  h  returned  by the learner \nwill simply be the table look-up of the estimated probabilities of observing a positive \nclassification  given  that  ~~= 1 Wi Xi  =  s  for  all  s  values  that  are  observed  with \nsufficient probability (the hypothesis can output any value for the values of s that are \nobserved very rarely).  We thus have the following  learning algorithm for  stochastic \nperceptrons. \n\nAlgorithm LearnSP(n, \u20ac,  6, {Bi}i=l) \n\n1.  Call m  = 128 e: ) 2kHIn e~n) training examples  (where  k = maXi I Bi I). \n\n2.  Compute Pr(xi  =  +1)  for  each  variable  Xi.  Neglect  Xi  whenever  we  have \n\nPr(xi = +1) < \u20ac/(4n)  or  Pr(xi = +1)  > 1 - \u20ac/(4n). \n\n3.  For  each variable  Xi  and for  each of its  blocking vector value  hi,  compute \nBfnf(xilhi).  Let h; be the value of hi for  which  IBfnf(xilhi)1  is the largest. \nLet Bfnf(xi) = Bfnf(xilh;). \n\n4.  For each variable Xi: \n\n(a)  Let Wi  = +1 whenever Bfnf(xi) > \u20ac/(4n). \n(b)  Let Wi  =  -1 whenever Bfnf(xi) < \u20ac/(4n). \n(c)  Otherwise let Wi  = 0 \n\n5.  Compute Pr(~~=l Wi Xi  =  s)  for  s  =  -n, . .. + n. \n6.  Return the hypothesis p-concept  h  formed  by the table look-up: \n\nh(x)  =  h'(s)  =  Pr (0' = 1 t WiXi  =  s) \n\n~=l \n\nfor  all  s  for  which Pr(~~=l WiXi  = s)  > \u20ac/(8n  + 8).  For the other s values, \nlet  h'(s) = 0  (or  any other value). \n\nTheorem 1  Algorithm  LearnSP  PAC  learns  the  class  of stochastic  perceptrons \non  In  with  monotonic  activation  functions  and  weights  Wi  E  {-1, 0, + 1}  under \nany  k-blocking  distribution  (when  a  blocking  set for  each  variable  is  known).  The \nnumber of examples  required  is m  =  128 (2:) 2kHIn (l~n)  (and  the  time  needed  is \nO(n x  m)) for  the  returned hypothesis  to  make  error  at  most  \u20ac  with  confidence  at \nleast  1 - 6. \n\nProof sketch:  From Hoeffding's inequality (lemma 3) we can show that this sample \nsize is  sufficient  to ensure that: \n\n\f286 \n\nMario Marchand,  Saeed Hadjifaradji \n\n\u2022  IPr(Xi  = +1) - Pr(xi = +1)1  < ~/(4n) with confidence at least  1 - 6/(4n) \n\u2022  IBfnf(xi) - Binf(xi)1  < ~/(4n) with confidence at least  1 - 6/(4n) \n\u2022  IPr(I:~=l WiXi  =  s)  - Pr(I:~=l WiXi  =  s)1  < ~2/[64(n + 1)]  with confidence \n\nat least 1- 6/(4n +  4) \n\n\u2022  IPr(O\"  = llI:~=l WiXi  = s) - Pr(O\"  =  llI:~=l WiXi  = s)1  < ~/4 with confi(cid:173)\n\ndence at least 1 - 6/4 \n\nFrom this and from  lemma 1,2 and 4,  it follows that returned hypothesis will  make \nerror at most  ~ with confidence at least 1 - 6.  D. \n\nAcknowledgments \n\nWe thank Mostefa Golea,  Klaus-U.  Hoeffgen and Stefan Poelt for  useful comments \nand discussions about technical points.  M.  Marchand is supported by NSERC grant \nOGP0122405.  Saeed Hadjifaradji is  supported by the MCHE of Iran. \n\nReferences \n\nAbend  K.,  Hartley  T.J.  &  Kanal  L.N.  (1965)  \"Classification  of  Binary  Random \nPatterns\",  IEEE  Trans.  Inform.  Theory  vol.  IT-II, 538-544. \nGolea,  M.  & Marchand M.  (1993)  \"On Learning Perceptrons with Binary Weights\", \nNeural  Computation vol.  5,  765-782. \nGolea,  M.  & Marchand  M.  (1994)  \"On Learning  Simple  Deterministic  and  Prob(cid:173)\nabilistic  Neural  Concepts\",  in  Shawe-Talor  J.  , Anthony  M.  (eds.),  Computational \nLearning  Theory:  EuroCOLT'93, Oxford University Press,  pp.  47-60. \nHaussler D.  (1992)  \"Decision Theoritic Generalizations of the PAC Model for  Neural \nNet and Other Learning Applications\",  Information and Computation vol.  100,78-\n150. \n\nHoeffgen K.U.  (1993)  \"On Learning and Robust Learning of Product Distributions\", \nProceedings  of the  6th  ACM Conference  on  Computational Learning  Theory,  ACM \nPress,  77-83. \nHoeffding  W.  (1963)  \"Probability inequalities  for  sums  of bounded  random  vari(cid:173)\nabIes\",  Journal of the  American Statistical Association, vol.  58(301),  13-30. \nKearns M.J. and Schapire R.E. (1994)  \"Efficient Distribution-free Learning ofProb(cid:173)\nabilistic Concepts\", Journal of Computer and System Sciences, Vol. 48, pp. 464-497. \nLin  J.H.  & Vitter  J.S.  (1991)  \"Complexity  Results  on Learning  by  Neural  Nets\", \nMachine Learning, Vol.  6,  211-230. \nSchapire  R.E.  (1992)  The  Design  and  Analysis  of Efficient  Learning  Algorithms, \nCambridge MA:  MIT Press. \nValiant  L.G.  (1984)  \"A  Theory of the  Learnable\",  Comm.  ACM,  Vol.  27,  1134-\n1142. \n\n\f", "award": [], "sourceid": 878, "authors": [{"given_name": "Mario", "family_name": "Marchand", "institution": null}, {"given_name": "Saeed", "family_name": "Hadjifaradji", "institution": null}]}