{"title": "Learning with Product Units", "book": "Advances in Neural Information Processing Systems", "page_first": 537, "page_last": 544, "abstract": null, "full_text": "Comparing the prediction accuracy of \n\nartificial neural networks and other \nstatistical models for  breast  cancer \n\nsurvival \n\nHarry B.  Burke \n\nDepartment of Medicine \nNew  York  Medical  College \n\nValhalla, NY  10595 \n\nDavid B.  Rosen \n\nDepartment of Medicine \nNew  York  Medical College \n\nValhalla, NY  10595 \n\nPhilip H.  Goodman \nDepartment of Medicine \n\nUniversity of Nevada School of Medicine \n\nReno,  Nevada 89520 \n\nAbstract \n\nThe  TNM  staging  system  has  been  used  since  the  early  1960's \nto  predict  breast  cancer  patient  outcome.  In  an  attempt  to  in(cid:173)\ncrease  prognostic  accuracy,  many putative prognostic factors  have \nbeen  identified.  Because  the  TNM  stage  model  can  not  accom(cid:173)\nmodate  these  new  factors,  the  proliferation  of factors  in  breast \ncancer  has  lead  to  clinical  confusion.  What  is  required  is  a  new \ncomputerized  prognostic system  that  can test  putative prognostic \nfactors  and  integrate  the  predictive  factors  with  the  TNM  vari(cid:173)\nables  in order to increase  prognostic  accuracy.  Using  the area un(cid:173)\nder  the curve of the receiver  operating characteristic,  we  compare \nthe  accuracy  of the  following  predictive  models  in  terms  of five \nyear  breast cancer-specific  survival:  pTNM staging system,  princi(cid:173)\npal component analysis,  classification and regression  trees,  logistic \nregression,  cascade  correlation neural network,  conjugate gradient \ndescent  neural,  probabilistic neural network,  and backpropagation \nneural network.  Several statistical models are significantly more ac-\n\n\f1064 \n\nHarry B.  Burke,  David B. Rosen,  Philip H.  Goodman \n\ncurate than the TNM  staging system.  Logistic regression  and the \nbackpropagation neural  network  are  the  most  accurate  prediction \nmodels for  predicting five  year  breast  cancer-specific  survival \n\n1 \n\nINTRODUCTION \n\nFor over thirty years measuring cancer outcome has been based on the TNM staging \nsystem  (tumor size,  number  of lymph  nodes  with  metastatic  disease,  and  distant \nmetastases)  (Beahr  et.  al.,  1992).  There  are  several  problems  with  this  model \n(Burke  and  Henson,  1993).  First,  it  is  not  very  accurate,  for  breast  cancer  it  is \n44%  accurate.  Second  its  accuracy  can  not  be  improved  because  predictive  vari(cid:173)\nables  can  not  be  added  to  the  model.  Third,  it does  not  apply  to  all cancers.  In \nthis  paper  we  compare  computerized  prediction  models  to  determine  if they  can \nimprove prognostic  accuracy.  Artificial  neural  networks  (ANN)  are  a  class  of non(cid:173)\nlinear  regression  and  discrimination  models.  ANNs  are  being  used  in  many  areas \nof medicine,  with  several  hundred  articles  published  in the last  year.  Representa(cid:173)\ntive  areas  of research  include  anesthesiology  (Westenskow  et.  al.,  1992), radiology \n(Tourassi et.  al. ,  1992) , cardiology  (Leong  and Jabri,  1982), psychiatry  (Palombo, \n1992),  and  neurology  (Gabor  and  Seyal,  1992).  ANNs  are  being  used  in  cancer \nresearch  including  image  processing  (Goldberg  et.  al.,  1992)  ,  analysis  of labora(cid:173)\ntory  data for  breast  cancer  diagnosis  (0 Leary et.  al.,  1992),  and the  discovery  of \nchemotherapeutic  agents  (Weinstein  et .  al.,  1992).  It  should  be  pointed  out  that \nthe analyses in this paper rely  upon previously  collected  prognostic factors.  These \nfactors  were  selected  for  collection  because  they  were  significant  in  a  generalized \nlinear model such as the linear or logistic models.  There is no predictive model that \ncan improve upon linear or logistic prediction  models  when the predictor  variables \nmeet  the  assumptions  of these  models  and  there  are  no  interactions.  Therefore \nhe  objective  of this  paper  is  not  to outperform  linear  or  logistic  models  on  these \ndata.  Rather,  our objective  is  to show  that,  with variables selected  by generalized \nlinear models,  artificial  neural networks  can perform as  well as  the best traditional \nmodels  .  There  is  no  a  priori  reason  to believe  that future  prognostic  factors  will \nbe binary or linear,  and  that there will not be complex interactions  between  prog(cid:173)\nnostic  factors.  A  further  objective  of this  paper  is  to  demonstrate  that  artificial \nneural  networks  are  likely  to  outperform the  conventional  models  when  there  are \nunanticipated nonmonotonic factors  or complex interactions. \n\n2  METHODS \n\n2.1  DATA \n\nThe  Patient  Care  Evaluation  (PCE)  data  set  is  collected  by  the  Commission on \nCancer  of the  American  College  of Surgeons  (ACS).  The  ACS,  in  October  1992, \nrequested  cancer  information from  hospital  tumor  registries  in the  United  States. \nThe ACS asked for the first 25 cases of breast cancer seen at that institution in 1983, \nand it asked for follow up information on each of these 25  patients through the date \nof the  request.  These  are  only  cases  of first  breast  cancer.  Follow-up  information \nincluded  known  deaths.  The PCE  data set  contains,  at best,  eight  year follow-up. \n\n\fPrediction Accuracy of Models for Breast Cancer Survival \n\n1065 \n\nWe  chose  to  use  a  five  year  survival  end-point.  This  analysis  is  for  death  due  to \nbreast  cancer,  not all cause  mortality. \n\nFor this analysis  cases  with  missing data,  and cases  censored  before five  years,  are \nnot  included  so  that the prediction models  can  be  compared  without  putting  any \nprediction model at a disadvantage. We randomly divided the data set into training, \nhold-out, and testing subsets of 3,100, 2,069,  and 3,102 cases, respectively. \n\n2.2  MODELS \n\nThe TMN stage model used  in this analysis is the pathologic model (pTNM) based \non  the  1992  American  Joint  Committee  on  Cancer's  Manual  for  the  Staging  of \nCancer  (Beahr  et.  al.,  1992).  The  pathologic model relies  upon  pathologically de(cid:173)\ntermined  tumor  size  and  lymph  nodes,  this  contrasts  with  clinical  staging  which \nrelies  upon  the  clinical  examination to provide  tumor  size  and lymph  node  infor(cid:173)\nmation.  To  determine the overall  accuracy  of the TNM stage model we  compared \nthe  model's  prediction  for  each  patient,  where  the  individual  patient's  prediction \nis  the fraction  of all  the patients in that stage  who  survive,  to each  patient's true \noutcome. \n\nPrincipal  components  analysis,  is  a  data reduction  technique  based  on  the  linear \ncombinations of predictor variables that minimizes the variance across patients (Jol(cid:173)\nlie,  1982).  The logistic regression analysis is performed in a stepwise manner, with(cid:173)\nout interaction terms,  using the statistical language S-PLUS  (S-PLUS,  1992), with \nthe continuous variable age modeled with a restricted cubic spline to avoid assuming \nlinearity  (Harrell  et.  al.,  1988).  Two  types  of Classification  and  Regression  Tree \n(CART)  (Breiman  et.  al.,  1984)  analyses  are  performed  using  S-PLUS.  The first \nwas  a  9-node  pruned  tree  (with  10-fold cross  validation on the  deviance),  and the \nsecond  was  a  shrunk  tree with  13.7 effective  nodes. \n\nThe  multilayer  perceptron  neural  network  training  in  this  paper  is  based  on  the \nmaximum likelihood function  unless  otherwise  stated,  and  backpropagation refers \nto gradient descent.  Two neural  networks  that are  not multilayer perceptrons  are \ntested.  They are the Fuzzy  ARTMAP neural network  (Carpenter et.  al.,  1991) and \nthe probabilistic neural network  (Specht,  1990). \n\n2.3  ACCURACY \n\nThe measure  of comparative accuracy  is  the area under  the  curve  of the  receiver \noperating  characteristic  (Az) .  Generally,  the  Az  is  a  nonparametric  measure  of \ndiscrimination.  Square error summarizes how  close each patient's predicted value is \nto its true outcome.  The Az  measures the relative goodness of the set of predictions \nas  a  whole  by  comparing the predicted  probability of each  patient  with that of all \nother patients.  The computational approach to the Az  that employs the trapezoidal \napproximation  to  the  area  under  the  receiver  operating  characteristic  curve  for \nbinary  outcomes  was  first  reported  by  Bamber  (Bamber,  1975),  and  later  in  the \nmedical  literature  by  Hanley  (Hanley  and  McNeil,  1982).  This  was  extended  by \nHarrell  (Harrell  et.  al.,  1988) to continuous outcomes. \n\n\f1066 \n\nHarry B.  Burke,  David B.  Rosen,  Philip H.  Goodman \n\nTable 1:  PCE  1983  Breast  Cancer  Data:  5 Year  Survival Prediction,  54 Variables. \n\nPREDICTION  MODEL \n\nACCURACY\u00b7  SPECIFICATIONS \n\npTNM Stages \nPrincipal Components Analysis \nCART, pruned \nCART, shrunk \nStepwise  Logistic regression \nFuzzy  ARTMAP ANN \nCascade correlation ANN \nConjugate gradient descent  ANN \nProbabilistic ANN \nBackpropagation ANN \n* The area under  the curve of the  receiver  operating characteristic. \n\nO,I,I1A,I1B,IIIA,I1IB,IV \none scaling iteration \n9 nodes \n13.7  nodes \nwith cubic  splines \n54-F2a,  128-1 \n54-21-1 \n54-30-1 \nbandwidth = 16s \n54-5-1 \n\n.720 \n.714 \n.753 \n.762 \n.776 \n.738 \n.761 \n.774 \n.777 \n.784 \n\n3  RESULTS \n\nAll results  are based on the independent  variable sample not used for training (i.e., \nthe  testing  data  set),  and  all  analyses  employ  the  same  testing  data set.  Using \nthe  PCE  breast  cancer  data set,  we  can  assess  the  accuracy  of several  prediction \nmodels  using  the most  powerful of the predictor  variables available in the data set \n(See Table 1). \nPrincipal  components  analysis  is  not  expected  to  be  a  very  accurate  model;  with \none  scaling  iteration,  its  accuracy  is  .714.  Two  types  of classification  and  regres(cid:173)\nsion  trees  (CART),  pruned  and shrunk,  demonstrate  accuracies  of .753  and  .762, \nrespectively.  Logistic regression  with cubic  splines for  age  has an accuracy of .776. \nIn addition to the backpropagation neural network and the probabilistic neural net(cid:173)\nwork,  three  types  of neural  networks  are  tested.  Fuzzy  ARTMAP's  accuracy  is \nthe poorest  at  .738.  It was  too computationally intensive  to  be  a  practical  model. \nCascade-correlation and conjugate gradient descent  have the potential to do as  well \nas  backpropagation.  The  PNN  accuracy  is  .777.  The  PNN  has  many  interesting \nfeatures,  but it also has  several  drawbacks  including its storage requirements.  The \nbackpropagation neural  network's  accuracy  is  .784.4. \n\n4  DISCUSSION \n\nFor  predicting  five  year  breast  cancer-specific  survival,  several  computerized  pre(cid:173)\ndiction models  are more accurate than the TNM stage system,  and artificial neural \nnetworks  are  as good as  the best  traditional statistical models. \n\nReferences \n\nBamber D (1975).  The area above the ordinal dominance graph and the area below \nthe receiver  operating characteristic.  J  Math  Psych  12:387-415. \nBeahrs  OH,  Henson  DE,  Hutter  RVP,  Kennedy  BJ  (1992).  Manual  for  staging  of \n\n\fPrediction Accuracy of Models for Breast Cancer Survival \n\n1067 \n\ncancer,  4th ed.  Philadelphia:  JB  Lippincott. \n\nBurke  HB,  Henson  DE (1993) .  Criteria for  prognostic factors  and for  an enhanced \nprognostic system.  Cancer 72:3131-5. \n\nBreiman  L,  Friedman JH,  Olshen  RA  (1984).  Classification  and  Regression  Trees. \nPacific  Grove,  CA:  Wadsworth and  Brooks/Cole. \n\nCarpenter  GA,  Grossberg  S,  Rosen  DB  (1991).  Fuzzy  ART:  Fast  stable  learning \nand  categorization  of analog  patterns  by  an  adaptive  resonance  system.  Neural \nNetworks 4:759-77l. \n\nGabor  AJ,  M.  Seyal  M  (1992)  .  Automated  interictal  EEG  spike  detection  using \nartificial neural networks.  Electroencephalogr  Clin  Neurophysiology 83 :271-80. \n\nGoldberg  V,  Manduca A,  Ewert  DL  (1992).  Improvement  in  specificity  of ultra(cid:173)\nsonography for  diagnosis  of breast  tumors  by  means  of artificial  intelligence.  Med \nPhys 19:1275-8l. \n\nHanley J A, McNeil BJ (1982).  The meaning of the use of the area under the receiver \noperating  characteristic  (ROC)  curve.  Radiology  143:29-36. \n\nHarrell  FE,  Lee  KL,  Pollock  BG  (1988).  Regression  models  in  clinical  studies: \ndetermining  relationships  between  predictors  and  response.  J  Natl  Cancer  Instit \n80:1198-1202. \n\nJollife IT (1986).  Principal Component  Analysis.  New  York:  Springer-Verlag, 1986. \n\nLeong  PH,  J abri  MA  (1982).  MATIC  - an  intracardiac  tachycardia classification \nsystem.  PACE 15:1317-31,1982. \nO'Leary TJ, Mikel UV,  Becker  RL (1992).  Computer-assisted image interpretation: \nuse of a neural network to differentiate tubular carcinoma from sclerosing adenosis. \nModern  Pathol 5:402-5. \n\nPalombo SR (1992).  Connectivity and condensation in dreaming.  JAm Psychoanal \nAssoc 40:1139-59. \n\nS-PLUS (1991),  v 3.0.  Seattle,  WA;  Statistical Sciences,  Inc. \n\nSpecht  DF (1990).  Probabilistic neural  networks.  Neural  Networks  3:109-18. \n\nTourassi  GD,  Floyd  CE,  Sostman  HD,  Coleman  RE  (1993).  Acute  pulmonary \nembolism:  artificial neural network  approach for  diagnosis.  Radiology  189:555-58. \n\nWeinstein  IN,  Kohn  KW,  Grever  MR et.  al.  (1992)  Neural  computing  in  cancer \ndrug  development:  predicting mechanism of action.  Science 258:447-51. \nWestenskow  DR,  Orr JA,  Simon  FH  (1992) .  Intelligent  alarms reduce  anesthesiol(cid:173)\nogist's response  time to critical faults.  Anesthesiology  77:1074-9,  1992. \n\n\f\fLearning with Product  Units \n\nLaurens R.  Leerink \n\nAustralian Gilt Securities  LTD \n\n37-49  Pitt Street \n\nNSW  2000, Australia \nlaurens@sedal.su.oz.au \n\nC.  Lee  Giles \n\nNEC  Research  Institute \n\n4 Independence  Way \n\nPrinceton,  NJ  08540,  USA \ngiles@research.nj.nec.com \n\nBill G.  Horne \n\nNEC  Research  Institute \n\n4 Independence  Way \n\nPrinceton,  NJ  08540,  USA \nhorne@research.nj.nec.com \n\nMarwan A.  Jabri \n\nDepartment of Electrical  Engineering \n\nThe University of Sydney \n\nNSW  2006,  Australia \nmarwan@sedal.su.oz.au \n\nAbstract \n\nProduct  units  provide  a  method  of  automatically  learning  the \nhigher-order  input  combinations  required  for  efficient  learning  in \nneural  networks.  However,  we  show  that  problems  are  encoun(cid:173)\ntered  when  using  backpropagation  to  train  networks  containing \nthese  units.  This  paper  examines  these  problems,  and  proposes \nsome atypical heuristics to improve learning.  Using these heuristics \na  constructive  method  is  introduced  which  solves  well-researched \nproblems with  significantly less  neurons than  previously reported. \nSecondly,  product  units are implemented as  candidate units in the \nCascade  Correlation  (Fahlman &  Lebiere,  1990)  system.  This re(cid:173)\nsulted  in  smaller  networks  which  trained  faster  than  when  using \nsigmoidal or Gaussian units. \n\n1 \n\nIntroduction \n\nIt is well-known that supplementing the inputs to a neural network with higher-order \ncombinations ofthe inputs both increases the capacity of the network  (Cover, 1965) \nand  the  the  ability to learn  geometrically  invariant  properties  (Giles  &  Maxwell, \n\n\f538 \n\nLaurens Leerink,  C.  Lee Giles,  Bill G.  Home,  Marwan A.  Jabri \n\n1987).  However,  there  is  a  combinatorial explosion  of higher  order  terms  as  the \nnumber  of inputs  to  the  network  increases.  Yet  in  order  to  implement  a  certain \nlogical function,  in  most  cases  only  a  few  of these  higher  order  terms are  required \n(Redding et al.,  1993). \nThe  product  units  (PUs)  introduced  by  (Durbin  &  Rumelhart,  1989)  attempt to \nmake use of this fact.  These networks have the advantage that, given an appropriate \ntraining  algorithm,  the  units  can  automatically learn  the  higher  order  terms  that \nare required  to implement a specific  logical function. \n\nIn these networks the hidden layer units compute the weighted product ofthe inputs, \nthat is \n\nN \n\nN \n\nII X~i \n\ni=l \n\ninstead of \n\n2:XiWi \ni=l \n\n(1) \n\nas  in  standard  networks.  An  additional  advantage of PUs  is  the increased  infor(cid:173)\nmation capacity  of these  units  compared  to standard  summation networks.  It is \napproximately  3N  (Durbin  &  Rumelhart,  1989),  compared  to  2N  for  a  single \nthreshold  logic  function  (Cover,  1965),  where  N  is  the  number  of inputs  to  the \nunit. \nThe larger capacity means that the same functions can be implemented by networks \ncontaining  less  units.  This  is  important  for  certain  applications  such  as  speech \nrecognition  where  the  data  bandwidth  is  high  or  if realtime implementations are \ndesired. \nWhen  PUs  are  used  to  process  Boolean  inputs,  best  performance  is  obtained \n(Durbin &  Rumelhart, 1989) by using inputs of {+1, -I}. If the imaginary compo(cid:173)\nnent  is  ignored, with these  inputs,  the activation function  is  equivalent  to a  cosine \nsummation function  with  {-1,+1}  inputs  mapped  {I,D}  (Durbin & Rumelhart, \n1989).  In  the  remainder of this  paper  the  terms  product  unit  (PU)  and  cos{ine) \nunit will be used interchangeably as all the problems examined have Boolean inputs. \n\n2  Learning with  Product Units \n\nAs  the  basic  mechanism of a  PU  is  multiplicative instead  of additive,  one  would \nexpect  that  standard  neural  network  training methods  and  procedures  cannot  be \ndirectly  applied when  training these  networks.  This is indeed  the case.  If a  neural \nnetwork simulation environment is  available the basic functionality of a  PU  can be \nobtained  by  simply  adding  the  cos  function  cos( 1(\"  * input)  to  the  existing  list  of \ntransfer  functions.  This  assumes  that  Boolean  mappings  are  being  implemented \n{I,D}  mapping has  been  performed on  the input \nand  the appropriate {-1,+1} -\nvectors.  However,  if we then attempt to train a network on on the parity-6 problem \nshown  in  (Durbin  &  Rumelhart,  1989),  it  is  found  that  the  standard  backpropa(cid:173)\ngat ion  (BP)  algorithm simply does  not work.  We  have found  two  main reasons  for \nthis. \n\nThe first  is weight initialization.  A typical first  step in the  backpropagation proce(cid:173)\ndure  is  to initialize  all  weights  to small  random  values.  The  main reason  for  this \nis  to use  the  dynamic range of the sigmoid function  and it's derivative.  However, \nthe  dynamic range of a  PU  is  unlimited.  Initializing the weights  to small random \n\n\fLearning with Product Units \n\n539 \n\nvalues  results  in an  input to the unit  where  the  derivative is small.  So  apart from \nchoosing  small  weights  centered  around  ml\"  with  n  =  \u00b11, \u00b12, ... this  is  the  worst \npossible  choice.  In  our simulations weights  were  initialized  randomly in  the  range \n[-2,2].  In fact,  learning seems insensitive to the size of the weights,  as long as they \nare large enough. \n\nThe second  problem  is  local  minima.  Previous  reports  have  mentioned  this  prob(cid:173)\nlem,  (Lapedes  &  Farber,  1987)  commented  that  \"using  sin's  often  leads  to  nu(cid:173)\nmerical  problems,  and  nonglobal  minima,  whereas  sigmoids seemed  to  avoid such \nproblems\".  This  comment summarizes  our  experience  of training with  PUs.  For \nsmall problems (less than 3 inputs) backpropagation provides satisfactory training. \nHowever,  when  the number of inputs are increased  beyond  this number, even  with \nth: .weight  initialization in  the  correct  range,  training  usually  ends  up  in  a  local \nmInIma. \n\n3  Training  Algorithms \n\nWith these aspects in mind, the following training algorithms were evaluated:  online \nand batch versions of Backpropagation (BP), Simulated Annealing (SA), a Random \nSearch  Algorithm (RSA)  and combinations of these algorithms. \n\nBP was used  as a benchmark and for  use in combination with the other algorithms. \nThe  Delta-Bar-Delta learning rate adaptation rule  (Jacobs,  1988)  was  used  along \nwith  the batch  version  of BP  to  accelerate  convergence,  with the  parameters were \nset  to  0  = 0.35, K,  = 0.05  and  \u00a2  = 0.90.  RSA  is  a  global  search  method  (i.e. \nthe whole weight space  is  explored  during training).  Weights are  randomly chosen \nfrom a predefined  distribution, and replaced if this results in an error decrease.  SA \n(Kirkpatrick et aI.,  1983)  is  a standard optimization method.  The operation of SA \nis  similar to RSA,  with  the difference  that with  a  decreasing  probability solutions \nare accepted which increase the training error.  The combination of algorithms were \nchosen  (BP  &  SA,  BP &  RSA)  to combine the benefits of global and local search. \nUsed  in  this  manner,  BP  is  used  to find  the local  minima.  If the training error  at \nthe minima is sufficiently low, training is terminated.  Otherwise,  the global method \ninitializes the weights  to another position in weight space from which  local training \ncan  continue. \n\nThe BP-RSA  combination requires further  explanation.  Several  BP-(R)SA  combi(cid:173)\nnations were evaluated, but best  performance was obtained using a fixed  number of \niterations of BP  (in  this case  120)  along  with  one  initial iteration of RSA.  In  this \nmanner  BP  is  used  to move  to  the  local  minima,  and  if the  training error  is  still \nabove  the desired  level  the  RSA  algorithm generates  a  new  set  of random weights \nfrom  which  BP can start  again. \nThe algorithms were evaluated on two problems, the parity problem and learning all \nlogical functions of 2 and 3 inputs.  The infamous parity problem is (for the product \nunit  at least)  an  appropriate task.  As  illustrated by  (Durbin &  Rumelhart,  1989), \nthis  problem  can  be  solved  by  one  product  unit.  The  question  is  whether  the \ntraining  algorithms can  find  a  solution.  The target  values  are  {-1, + 1},  and  the \noutput is  taken  to be  correct  if it  has the  correct  sign.  The simulation results  are \nshown in Table 1.  It should be noted  that one epoch  of both SA  and RSA  involves \n\n\f540 \n\nLaurens Leerink,  C.  Lee Giles, Bill G.  Home,  Marwan A.  Jahri \n\nrelaxing the network  across  the training set  for  every  weight,  so  in  computational \nterms their nepoeh  values should be multiplied by  a factor of (N + 1). \n\nParity \n\nN \n6 \n8 \n10 \n\nOnline BP \n\nBatch BP \n\nSA \n\nRSA \n\nneonv \n10 \n8 \n6 \n\nnepoeh \n30.4 \n101.3 \n203.3 \n\nn eonv \n\nnepoeh \n\n7 \n2 \n0 \n\n34 \n700 \n-\n\nneonv \n10 \n10 \n10 \n\nnepoeh \n12.6 \n52.8 \n99.9 \n\nneon v \n10 \n10 \n10 \n\nnepoeh \n15.2 \n45.4 \n74.1 \n\nTable 1:  The parity N  problem:  The table shows  neon v  the number of runs  out of \n10  that have converged  and nepoeh'  the average number of training epochs required \nwhen  training converged. \n\nFor  the  parity  problem it  is  clear  that  local  learning alone  does  not  provide good \nconvergence.  For  this  problem, global search  algorithms have  the following advan(cid:173)\ntages:  (1)  The search  space  is  bounded  (all weights  are restricted  to [-2, +2])  (2) \nThe  dimension  of search  space  is  low  (maximum of  11  weights  for  the  problems \nexamined).  (3)  The fraction  of the weight space which  satisfies  the parity problem \nrelative to the total bounded  weight space is  high. \n\nIn  a  second  set  of simulations, one product  unit was  trained to calculate all  2(2N ) \nlogical functions  of the  N  input variables.  Unfortunately,  this  is  only  practical for \nN  E  {2,3} .  For  N  =  2  there  are  only  16  functions,  and  a  product  unit  has  no \nproblem  learning  all  these  functions  rapidly  with  all  four  training  algorithms.  In \ncomparison a single summation unit can learn 14 (not the XOR & XNOR functions). \nFor N =3, a product unit is able to implement 208 of the 256 functions,  while a single \nsummation unit could only implement 104.  The simulation results  are displayed in \nTable 2. \n\nOnline BP \n\nBatch  BP \n\nBP-RSA \n\nTable  2:  Learning  all  logical  functions  of 3  inputs:  The  rows  display  nlogie ,  the \naverage number of logical functions implemented by  a product  unit and nepoeh,  the \nnumber  of epochs  required  for  convergence.  Ten  simulations  were  performed  for \neach of the 256  logical functions , each  for  a  maximum of 1,000 iterations. \n\n4  Constructive Learning with  Product Units \n\nSelecting the optimal network  architecture for  a specific  application is  a  nontrivial \nand  time-consuming task,  and several  algorithms have  been  proposed  to automate \nthis process.  These include pruning methods and growing algorithms.  In this section \na  simple  method  is  proposed  for  adding  PUs  to  the  hidden  layer  of a  three  layer \nnetwork.  The output layer contains a single sigmoidal unit. \n\nSeveral  constructive  algorithms  proceed  by  freezing  a  subset  of the  weights  and \nlimiting training to the newly  added  units.  As  mentioned earlier,  for  PUs  a  global \n\n\fLearning with Product Units \n\n541 \n\nTiling AI  orithm  ~ \n\nUpstart AI  orithm  I-t--< \nUnits  >S-t \n\n81M using Pr \n\n300 \n\n250 \n\n200 \n\n150 \n\n100 \n\n50 \n\ni!: \n0 \n\n~ c \n\n.!; \n<II \n\nC e \n::l \nCI) c \n'15 \n~ \nD \nE \n::l \nZ \n\n200 \n\n400 \n\n600 \n\nNumber of patterns (2\"N) \n\n800 \n\n1000 \n\n1200 \n\nFigure 1:  The number of units required for learning the random mapping problems \nby  the 'Tiling', 'Upstart' and SIM  algorithms. \n\nsearch  is  required  to  solve  the  local-minima  problems.  Freezing  a  subset  of the \nweights  restricts  the new  solution to an  affine  subset  of the existing weight  space, \noften  resulting  in  non-minimal  networks  (Ash,  1989).  For  this  reason  a  simple \nincremental method (SIM)  was implemented which retains the global search  for  all \nweights  during the  whole  training process.  The  method used  in our simulations is \nas follows: \n\n\u2022  Train a network  using the BP-RSA combination on a network with a spec(cid:173)\n\nified  minimum number of hidden  PUs. \n\n\u2022  If there is no convergence within a specified number of epochs,  add a PU  to \nthe  network.  Reinitialize weights  and  continue  training with  the  BP-RSA \ncombination . \n\n\u2022  Repeat  process  until a solution is found  or the network has grown  a  prede(cid:173)\n\ntermined  maximum size. \n\nThe method of  (Ash , 1989)  was  also evaluated, where  neurons with  small weights \nwere  added  to a  network according to certain  criteria.  The SIM  performed  better, \npossibly  because  of the global search  performed by the  RSA  step. \nThe  'Upstart'  (Frean,  1990)  and  'Tiling'  (Mezard  &  Nadal,  1989)  constructive \nalgorithms  were  chosen  as  benchmarks.  A  constructive  PU  network  was  trained \non  two  problems  described  in  these  papers,  namely  the  parity  problem  and  the \nrandom  mapping  problem.  In  (Frean,  1990)  it  was  reported  that  the  Upstart \n\n\f542 \n\nLaurens Leerink,  C.  Lee Giles,  Bill G.  Home,  Marwan A.  Jabri \n\nalgorithm  required  N  units  for  all  parity N  problems,  and  1,000  training  epochs \nwere  sufficient  for  all  values  of N  except  N  =  10,  which  required  10,000.  As  seen \nearlier,  one  PU  is  able  to  perform  any  parity  function,  and  SIM  required  an  an \naverage of 74.1  iterations for  N  = 6,8,10. \nThe random mapping problem is  defined  by  assigning each  of the  2N  patterns  its \ntarget { -1, + I} with 50% probability.  This is a difficult problem, due to the absence \nof correlations  and  structure  in  the input.  As  in  (Frean,  1990;  Mezard  &  Nadal, \n1989)  the  average  of 25  runs were  performed,  each  on  a  different  training set.  The \nnumber of units  required  by  SIM  is  plotted  in  Figure  1.  The values  for  the Tiling \nand Upstart algorithms are approximate and were obtained through inspection from \na similar graph in  (Frean,  1990). \n\n5  U sing  Cosine  Candidate  Units in  Cascade  Correlation \n\nInitially  we  wanted  to  compare  the  performance  of  SIM  with  the  well-known \n'cascade-correlation'  (CC)  algorithm of  (Fahlman &  Lebiere,  1990).  However,  the \nnetwork architectures differ and a direct comparison between the number of units in \nthe respective architectures does not reflect the efficiency of the algorithms.  Instead, \nit was  decided  to integrate PUs into the CC  system as  candidate units. \nFor these simulations a public domain version of CC was  used  (White,  1993) which \nsupports  four  different  candidate  types;  the  asymmetric  sigmoid,  symmetric sig(cid:173)\nmoid,  variable sigmoid  and  gaussian  units.  Facilities  exist  for  either  constructing \nhomogeneous networks  by  selecting one  unit type,  or training with  a  pool of differ(cid:173)\nent units allowing the construction of hybrid networks.  It was thus relatively simple \nto add PU candidate units to the system.  Table 3 displays the results when CC was \ntrained  on  the  random logic  problem using  three  types  of homogeneous  candidate \nunits. \n\nN \n\n7 \n8 \n9 \n10 \n\nnunih \n\nCC Sigmoid \nnepOCh6 \n924.5 \n1630.9 \n2738.3 \n4410.9 \n\n6.6 \n12.1 \n20.5 \n32.9 \n\nCC Gauss \n\nCC  PU \n\nnunih \n\n6.7 \n11.5 \n18.4 \n30.2 \n\nnepoch, \n642.6 \n1128.2 \n1831.1 \n2967.6 \n\nnunih \n\n5.7 \n9.9 \n16.4 \n26.6 \n\nn epoch6 \n493.8 \n833.8 \n1481.8 \n2590.8 \n\nTable  3:  Learning  random  logic  functions  of N  inputs:  The  table  shows  nunih' \nthe  average  number of units  required  and nepoch6'  the  average  number of training \nepochs required for  convergence of CC  using sigmoidal, Gaussian and  PU candidate \nunits.  Figures  are  based on  25  simulations. \n\nIn a separate experiment the performance of hybrid networks  were  re-evaluated on \nthe same  random logic  problem.  To enable  a  fair  competition  between  candidate \nunits of different  types,  the simulations were  run with 40  candidate units,  8 of each \ntype.  The simulations were evaluated on  25  trails for  each  of the random mapping \nproblems (7,8,9  and  10  inputs,  a  total of 1920 input vectors).  In  total 1460 hidden \nunits  were  allocated,  and  in  all  cases  PU  candidate  units were  chosen  above  units \nof  the  4  other  types  during  the  competitive  stage.  During  this  comparison  all \n\n\fLearning with Product  Units \n\n543 \n\nparameters  were  set  to  default  values,  i.e.  the  weights  of the  PU  candidate  units \nwere  random numbers initialized in the range of [-1, +1].  As  discussed earlier,  this \nputs the  PUs at a  slight disadvantage as their optimum range is  [-2, +2]. \n\n6  Discussion \n\nThe  BP-RSA  combination  is  in  effect  equivalent  to  the  'local  optimization  with \nrandom restarts'  process  discussed  by  (Karmarkar &  Karp,  1982), where  the local \noptimization  is  this  case  is  performed  by  the  BP  algorithm.  They  reported  that \nfor  certain  problems where  the  error  surface  was  'exceedingly  mountainous',  mul(cid:173)\ntiple  random-start  local  optimization  outperformed  more  sophisticated  methods. \nWe  hypothesize  that  adding PUs to a  network  makes the error  surface  sufficiently \nmountainous so that  a global search  is  required. \n\nAs  expected,  the higher  separating capacity of the  PU  enables  the  construction  of \nnetworks  with  less  neurons  than  those  produced  by  the  Tiling and  Upstart  algo(cid:173)\nrithms.  The fact that SIM works this well is mainly a result of the error surface;  the \nsurface is so irregular that even training a network of fixed  architecture is best done \nby  reinitializing the  weights  if convergence  does  not occur  within  certain  bounds. \nThis again is in accordance with the results of  (Karmarkar &  Karp,  1982) discussed \nabove. \n\nWhen  used  in  CC  we  hypothesize  that there  are  three  main reasons  for  the  choice \nof PUs above any of the other types during the competitive learning phase.  Firstly, \nthe  higher  capacity  (in  a  information  capacity  sense)  of the  PUs  allows  a  better \ncorrelation  with the error signal.  Secondly,  having N  competing candidate units is \nequivalent  to selecting  the  best  of N  random restarts,  and  performs  the  required \nglobal  search.  Thirdly,  although  the  error  surface  of networks  with  PUs  contains \nmore local minima than when using standard transfer functions, the surface is locally \nsmooth.  This allows effective  use  of higher-order error  derivatives,  resulting in fast \nconvergence  by the  quickprop  algorithm. \nIn  (Dawson & Schopflocher,  1992) it was shown that networks with Gaussian units \ntrain  faster  and  require  less  units  than  networks  with  standard  sigmoidal  units. \nThis  is  supported  by  our  results  shown  in  Table  3.  However,  for  the  problem \nexamined,  PUs  outperform  Gaussian  units  by  approximately the  same  margin  as \nGaussian  units  outperform  sigmoidal  units.  It  should  also  be  noted  that  these \nproblems where not chosen for  their suitability for  PUs.  In fact,  if the problems are \nsymmetric/regular the difference  in  performance is  expected  to increase. \n\n7  Conclusion \n\nOf the learning algorithms examined BP provides the fastest  training, but is  prone \nto  nonglobal  minima.  On  the  other  hand,  global  search  methods  are  impractical \nfor  larger networks.  For  the  problems examined, a  combination of local and  global \nsearch methods were found to perform best.  Given a network containing PUs, there \nare  some atypical heuristics  that  can  be  used:  (a)  correct  weight  initialization (b) \nreinitialization  of the  weights  if convergence  is  not  rapidly  reached.  In  addition, \nthe  representational  power  of  PUs  have  enabled  us  to  solve  standard  problems \n\n\f544 \n\nlAurens Leerink,  C.  Lee Giles, Bill G.  Home,  Marwan A.  labri \n\nusing significantly smaller networks  than  previously  reported,  using  a  very  simple \nconstructive  method.  When implemented in the CC  architecture,  for  the problems \nexamined  PUs  resulted  in  smaller networks  which  trained  faster  than  other  units. \nWhen  included  in  a pool of competing candidate units,  simulations showed  that in \nall  cases  PU  candidate units  were  preferred  over  candidate  units of the other four \ntypes. \n\nReferences \n\nAsh,  T.  (1989) .  Dynamic node  creation  in  backpropagation networks.  Connection \n\nScience,  1 (4),  365-375. \n\nCover, T. (1965) . Geometrical and statistical properties of systems of linear inequal(cid:173)\n\nities with applications in pattern recognition.  IEEE Transactions  on Electronic \nComputers,  14,  326-334. \n\nDawson, M. & Schopflocher, D.  (1992) . Modifying the generalized delta rule to train \nnetworks  of nonmonotonic  processors  for  pattern  classification.  Connection \nScience,  4,  19-31. \n\nDurbin,  R.  &  Rumelhart,  D.  (1989).  Product  units:  A  computationally  power(cid:173)\n\nful  and  biologically  plausible  extension  to backpropagation networks.  Neural \nComputation,  1,  133- 142. \n\nFahlman,  S. &  Lebiere,  C.  (1990) .  The  cascade-correlation  learning  architecture. \nIn  Touretzky,  D.  (Ed.),  Advances  in  Neural  Information  Processing  Systems, \nvolume 2,  (pp.  524-532)., San  Mateo. (Denver  1989), Morgan  Kaufmann. \n\nFrean,  M. (1990).  The upstart  algorithm:  A method for  constructing  and  training \n\nfeedforward  neural  networks.  Neural  Computation,  2,  198-209. \n\nGiles,  C.  &  Maxwell,  T .  (1987).  Learning,  invariance, and generalization  in  high(cid:173)\n\norder  neural networks.  Applied  Optics,  26(23),4972-4978 . \n\nJacobs,  R.  (1988).  Increased rates of convergence through learning rate adaptation. \n\nNeural  Networks,  1,  295-307. \n\nKarmarkar ,  N.  &  Karp,  R .  (1982).  The  differencing  method  of set  partitioning. \nTechnical Report UCB/CSD 82/113, Computer Science  Division,  University of \nCalifornia, Berkeley,  California. \n\nKirkpatrick,  S.,  Jr.,  C.  G .,  ,  &  Vecchi,  M.  (1983) .  Optimization  by  simulated \n\nannealing.  Science,  220.  Reprinted in  (?) . \n\nLapedes,  A.  &  Farber,  R.  (1987) .  Nonlinear  signal  processing  using  neural  net(cid:173)\n\nworks:  Prediction  and system  modelling.  Technical  Report  LA-UR-87-2662, \nLos  Alamos National Laboratory,  Los  Alamos,  NM. \n\nMezard,  M.  &  Nadal, J.-P. (1989).  Learning in  feedforward  layered  networks:  The \n\ntiling algorithm.  Journal  of Physics  A,  22, 2191-2204. \n\nRedding,  N.,  Kowalczyk,  A.,  &  Downs,  T .  (1993).  A  constructive  higher  order \n\nnetwork  algorithm that is  polynomial-time.  Neural  Networks,  6,997. \n\nWhite, M.  (1993).  A  public domain C  implement ion of the Cascade Correlation al(cid:173)\ngorithm. Department of Computer Science, Carnegie  Mellon  University,  Pitts(cid:173)\nburgh , PA. \n\n\f", "award": [], "sourceid": 990, "authors": [{"given_name": "Laurens", "family_name": "Leerink", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}, {"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "Marwan", "family_name": "Jabri", "institution": null}]}