{"title": "ALCOVE: A Connectionist Model of Human Category Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 655, "abstract": null, "full_text": "ALCOVE:  A  Connectionist  Model of \n\nHuman Category Learning \n\nJohn K.  Kruschke \nDepartment of Psychology  and Cognitive Science Program \nIndiana University,  Bloomington IN  47405-4201  USA \ne-mail:  kruschke@ucs.indiana.edu \n\nAbstract \n\nALCOVE  is  a  connectionist  model  of human  category  learning  that  fits  a \nbroad spectrum of human learning data.  Its architecture is  based on well(cid:173)\nestablished  psychological  theory,  and  is  related  to  networks  using  radial \nbasis functions.  From the perspective of cognitive psychology,  ALCOVE can \nbe construed as a combination of exemplar-based representation and error(cid:173)\ndriven  learning.  From the perspective of connectionism,  it can  be seen  as \nincorporating constraints into back-propagation networks  appropriate for \nmodelling human learning. \n\n1 \n\nINTRODUCTION \n\nALCOVE is intended to accurately model human, perhaps non-optimal, performance \nin  category  learning.  While  it  is  a  feed-forward  network  that  learns  by  gradient \ndescent  on  error,  it  is  unlike  standard  back  propagation  (Rumelhart,  Hinton  & \n'''illiams, 1986)  in  its architecture,  its behavior, and its goals.  Unlike  the standard \nback-propagation  network,  which  was  motivated  by  generalizing  neuron-like  per(cid:173)\nceptrons,  the architecture of ALCOVE  was  motivated by  a  molar-level psychological \ntheory,  Nosofsky's  (1986)  generalized  context  model  (GCM).  The  psychologically \nconstrained architecture results in behavior that captures the detailed  course of hu(cid:173)\nman  category  learning  in  many  situations  where  standard  back  propagation fares \nless  well.  And,  unlike  most  applications of standard back  propagation,  the goal  of \nALCOVE is  not to discover new (hidden-layer) representations after lengthy training, \nbut rather to model the course of learning itself (Kruschke,  1990c),  by  determining \nwhich dimensions of the given representation are most relevant to the task, and how \nstrongly to associate exemplars with categories. \n\n649 \n\n\f650 \n\nKruschke \n\nCategory nodes. \n\nLearned association weights. \n\nExemplar nodes. \n\no  0 \n\nLearned attention strengths. \n\nStimulus dimension nodes. \n\nFigure  1:  The architecture of ALCOVE  (Attention  Learning  covEring map).  Exem(cid:173)\nplar nodes show  their  activation  profile when  r = q = 1 in  Eqn.  1. \n\n2  THE  MODEL \n\nLike the GCM,  ALCOVE  assumes that input patterns can be represented as points in a. \nmulti-dimensional psychological space,  as  determined  by  multi-dimensiona.l  scaling \nalgorithms  (e.g.,  Shepard,  1962).  Each  input  node  encodes  a  single  psychological \ndimension,  with  the  activation  of the  node  indicating  the  value  of the  stimulus on \nthat dimension.  Figure 1 shows the architecture of ALCOVE,  illustrating the case  of \njust two input dimensions. \n\nEach  input  node  is  gated  by  a  dimensional  attention  strength  ai.  The  attention \nstrength on  a  dimension  reflects  the relevance of that dimension  for  the  particular \ncategorization  task  at  hand,  and  the  model  learns  to  allocate  more  attention  to \nrelevant  dimensions  and less to irrelevant  dimensions. \n\nEach hidden node corresponds to a position in the multi-dimensional stimulus space, \nwith one hidden node placed at the position of every training exemplar.  Each hidden \nnode  is  activated  according  to  the  psychological  similarity  of the  stimulus  to  the \nexemplar represented  by  the  hidden  node.  The similarity function  comes from  the \nGCM  and  the  work  of Shepard  (1962;  1987):  Let  the  position  of the  ph  hidden \nnode  be  denoted  as (hjl' hj2' ... ),  and let  the activation of the ph  hidden  node  be \ndenoted  as  ajid.  Then \n\nwhere  c  is  a  positive  constant  called  the  specificity of the  node,  where  the sum is \ntaken over  all  input dimensions,  and  where  rand q are  constants  determining  the \nsimilarity metric  and similarity gradient,  respectively.  For  separable psychologica.l \n\n(1) \n\n\fALCOVE:  A Connectionist Model of Human Category Learning \n\n651 \n\n(a)  *: .... t \n\n\u2022.... ~ \n\"';t-bL\u00b7~\u00b7\u00b7 \n\n.....\u2022..... \n\n(b) \n\n\u2022..\u2022 \n\n__ ... J .... .&. \n~ \n\nFigure 2:  (a) Increasing attention on the horizontal axis and decreasing attention on \nthe vertical axis causes exemplars of the two categories (denoted by dots and + 's)  to \nhave  greater between-category dissimilarity  and  greater within-category similarity. \n(b)  ALCOVE cannot differentially attend to diagonal \n(After Nosofsky, 1986, Fig. 2.) \naxes. \n\ndimensions,  the  city-block  metric  (1'  =  1)  is  used,  while  integra.l  dimensions  might \ncall  for  a  Euclidean  metric  (r =  2).  An  exponential  similarity  gradient  (q  =  1)  is \nused  here  (Shepard,  1987;  this  volume),  but  a  Gaussian similarity  gradient  (q  =  2) \ncan sometimes be  appropriate. \n\nThe dimensional  attention strengths adjust themselves so  that exemplars from  dif(cid:173)\nferent  categories become less similar, and exemplars within categories become more \nsimilar.  Consider  a simple case of four  stimuli  that form  the corners  of a  square in \ninput  space,  as  in  Figure  2(a).  The  two  left  stimuli  are  mapped  to one  category \n(indicated by dots) and the two right stimuli are mapped to another category (indi(cid:173)\ncated by  +'s).  ALCOVE  learns  to increase  the attention strength on  the  horizontal \naxis,  and to decrease the attention strength on the vertical axis.  On the other hand, \nALCOVE  cannot stretch or shrink diagonally,  as suggested in  Figure 2(b).  This con(cid:173)\nstraint is  an accurate reflection of human performance, in  that categories separated \nby  a diagonal boundary tend to take longer to learn than categories separa.ted  by  a \nboundary orthogonal to one dimension. \n\nEach  hidden  node is  connected  to output nodes  that  correspond  to response  cate(cid:173)\ngories.  The  connection  from  the  lh  hidden  node  to  the  kth  category  node  hac;  a \nconnection weight  denoted Wkj'  called the association  weight between the exemplar \nand the category.  The output (category)  nodes are activated by the linear rule used \nin the  GCM  and the network  models of Gluck and Bower  (1988a,b): \n\naout  - ' \"  W  ahid \nk \n\n- L..J  kj  j \n\n. \n\nhid \nj \n\n(2) \n\nIn  ALCOVE,  unlike the GCM,  the association weights are learned and can take on  any \nreal  value,  including negative values.  Category activations are mapped  to response \nprobabilities using the same choice rule as was used in the GCM and network models . \nThus, \n\nPr(I<) =  exp( \u00a2 alt) /  L exp( \u00a2 akut ) \n\n(3) \n\nout \nk \n\n\f652 \n\nKruschke \n\nwhere <p  is  a real-valued scaling constant.  In other words, the probability of classify(cid:173)\ning the given stimulus into category K  is  determined  by  the magnitude of category \nK's activation  relative to the sum of all  category  activations. \n\nThe  dimensional  attention  strengths,  Ck'i'  and  the  association  weights,  wkj '  are \nlearned  by  gradient  descent  on sum-squared error,  as  used  in  standard  back  prop(cid:173)\nagation  (Rumelhart  et  al.,  1986)  and  in  the  network  models  of Gluck  and  Bower \n(1988a,b).  Details can be found in Kruschke (1990a,b).  In fitting ALCOVE  to human \nlearning data,  there  are four  free  parameters:  the fixed  specificity  c in  Equation  1; \nthe probability mapping constant  <p  in  Equation  3;  the  association  weight  learning \nrate;  and,  the  attention strength learning  rate. \n\nIn summary,  ALCOVE  extends  Nosofsky's  (1986)  GCM  by  having  a  learning mecha(cid:173)\nnism  and  by  allowing  any  positive  or  negative  values  for  association  weights,  and \nit extends Gluck and Bower's  (1988a,b)  network models by  including explict  atten(cid:173)\ntion  strengths  and  by  using  continuous  input  dimensions.  It is  a  combination  of \nexemplar-based  category  representations  with  error-driven  learning,  as  alluded  to \nby  Estes  et  al.  (1989;  see  also  Hurwitz,  1990).  ALCOVE  can  also  be  construed  as  a \nform of (non-)radial basis function  network, if r  =  q =  2 in  Equation 1.  In the form \ndescribed here,  the  hidden  nodes  are  placed  at  positions  where  training  exemplars \noccur,  but  another  option,  described  by  Kruschke  (1990a,b),  is  to  scatter  hidden \nnodes  over  the  input  space  to  form  a  covering  map.  Both  these  methods  work \nwell  in  fitting  human  data  in  some  situations,  but  the  exemplar-based  a.pproach \nhas advantages  (Kruschke,  1990a,b).  ALCOVE  can  also  be compared  to a  standard \nback-propagation  network  that  has  adaptive  attentional  multipliers  on  its  input \nnodes  (cf.  Mozer  and  Smolensky,  1989),  but  with  fixed  input-to-hidden  weights \n(Kruschke  1990b,  p.33).  Such  a  network  behaves  similarly  to  a  covering-map  ver(cid:173)\nsion of ALCOVE.  Moreover, such  back-prop networks are susceptible to catastrophic \nretroactive interference (Ratcliff,  1990;  McCloskey &  Cohen,  1989),  unlike ALCOVE. \n\n3  APPLICATIONS \n\nSeveral applications of ALCOVE  to modelling human  performance are  detailed else(cid:173)\nwhere  (Kruschke,  1990a,b);  a  few  will  be summarized here. \n\n3.1  RELATIVE  DIFFICULTY  OF  CATEGORY STRUCTURES \n\nThe  classic  work  of Shepard,  Hovland  and  Jenkins  (1961)  explored  the  relative \ndifficulty  of  learning  different  category  structures.  As  a  simplified  example,  the \nlinearly separable categories in  Figure 2( a)  are easier to learn than the exclusive-or \nproblem  (which  would  have  the  top-left  and  bottom-right  exemplars  mapped  to \none category,  and the top-right  and bottom-left mapped  to the other).  Shepard et \nal.  carefully  considered  several  candidate  explanations  for  the  varying  difficulties, \nand concluded  that some form of attentionallearning was  necessary to a.ccount  for \ntheir  results.  That  is,  people  seemed  to  be  able  to  determine  which  dimensions \nwere relevant or irrelevant,  and they allocated attention to dimensions a.ccordingly. \nCategory structures  with fewer  relevant  dimensions  were  easier  to  learn.  ALCOVE \nhas just  the sort  of attentional  learning  mechanism  called  for,  and  can  match  the \nrelative  difficulties  observed by Shepard et al. \n\n\fALCOVE:  A Connectionist Model of Human Category Learning \n\n653 \n\n3.2  BASE-RATE NEGLECT \n\nA  recent  series  of experiments  (Gluck  &  Bower,  1988b;  Estes  et  aI.,  1989;  Shanks, \n1990;  Nosofsky  et  aI.,  1991)  investigated  category learning when the  assignment  of \nexemplars to categories  was  probabilistic  and  the  base  rates of the  categories  were \nunequal.  In these experiments, there were two categories (one  \"rare\"  and  the other \n\"common\") and  four  binary-valued stimulus dimensions.  The stimulus values  were \ndenoted  sl  and  sl *  for  the  first  dimension,  s2  and  s2*  for  the  second  dimension, \nand so on.  The probalities were  arranged such that over  the course of training,  the \nnormative  probability  of each  category,  given  sl  alone,  was  50%.  However,  when \npresented  with  feature  sl  alone,  human  subjects  classified  it  as  the  rare  category \nsignificantly  more  than  50%  of the  time.  It  was  as  if people  were  neglecting  the \nbase  rates of the categories. \nGluck and Bower (1988b) and Estes et aI.  (1989)  compared two candidate models to \naccount for  the apparent base-rate neglect.  One was a simple exemplar-based model \nthat kept  track of each  training exemplar,  and made  predictions of categoriza tions \nby  summing up  frequencies  of occurence  of each  stimulus  value  for  each  category. \nThe  exemplar-based  model  was  unable  to  predict  base-rate  neglect.  The  second \nmodel  they  considered,  the  \"double-node  network,\"  was  a  one-layer  error-driven \nnetwork  that  encoded  each  binary-valued  dimension  with  a  pair  of input  nodes. \nThe double-node  model  was  able show  base-rate neglect. \n\nALCOVE  is  an  exemplar-based  model,  and  so  it  is  challenged  by  those  results.  In \nfact,  Kruschke (1990a,b) and Nosofsky et aI. (1991) show that ALCOVE fits  the trail(cid:173)\nby-trial learning and base-rate neglect data as well  as or better than the double-node \nmodel. \n\n3.3  THREE-STAGE LEARNING  OF  RULES  AND  EXCEPTIONS \n\nOne  of the best-known  connectionist  models  of human  learning is  Rumelhart  and \nMcClelland's (1986) model of verb  past tense acquistion.  One of the main phenom(cid:173)\nena they  wished  to model  was  three-stage  learning  of irregular  verbs:  First  a  few \nhigh-frequency irregulars are  learned; second,  many regular  verbs  are learned  with \nsome interference to the previously learned irregulars; and third, the high-frequency \nirregulars are re-Iearned. l  In order to reproduce three-stage learning in their model, \nRumelhart  and  McClelland  had  to  change  the  training  corpus  during  learning,  so \nthat early on  the network was trained with  ten verbs,  80%  of which  were  irregular, \nand later the network was trained with 420 verbs, only 20% of which  were  irregular. \nIt remains  a challenge to connectionist models to show  three-stage learning of rules \nand exceptions while  keeping  the  training set  constant. \n\nWhile  ALCOVE  has  not  been  applied  to  the  verb-learning  situation  (and  perhaps \nshould  not  be,  as  a  multi-dimensional  similarity-space  might  not  be  a  tractable \nrepresentation  for  verbs),  it  can  show  three-stage  learning  of rules  and  exceptions \nin  simpler  but  analogous  situations.  Figure  3  shows  an  arrangement  of  training \nexemplars,  most  of which  can  be  classified  by  the simple  rule,  \"if it's  to  the  right \n\nlThere  is  evidence  that  three-stage  learning  is  only  very  subtle  in  verb  past  tense \nacquisition (e.g.,  Marcus, 1990), but whether it exists more robustly in the simpler cat.egory \nlearning domains  addressed  by  ALCOVE  is still an open  question. \n\n\f654 \n\nKruschke \n\n0.9 \n\nG  @ \n0  @  ~  8 \n@  @  @: \n0  @ \n\n~  GJ \n[3 \n~  [!J \n8 \n[!) \n\nI \n\nI \n\n0.1 \n\n... \n,..... \nu \n~  0.7 \n\nu \n'-\" \nQ.c \n\n0.6 \n\nO.S \n\nlearning trial \n\nFigure  3:  Left  panel  shows  arrangement  of rule-following  (R)  and  exceptional  (E) \ncases.  Right  panel  shows  the  performance  of ALCOVE.  The  ratio  of E  to  R  cases \nand  all  parameters of the model  were  fixed  throughout training. \n\nof the  dashed  line,  then  it's  in  the  'rectangle'  category,  otherwise  it's  in  the  'oval' \ncategory.\"  The  rule-following  cases  are  marked  with  an  \"R.\"  There  are  two  ex(cid:173)\nceptional  cases  near  the  dashed  line,  marked  with  an  \"E.\"  Exceptional  exemplars \noccurred 4  times  as  often  as  rule-following  exemplars.  The  right  panel of Figure  3 \nshows  that  ALCOVE  initially  learns  the  E  cases  better  than  the  R  cases,  but  that \nlater  in  learning  the  R  cases surpass the  E's.  The reason  is  that early  in  learning, \nALCOVE  is  primarily  building  up  association  weights  and  has  not  yet  shifted  much \nattention  away  from  the irrelevant  dimension.  Associations from  the  E  cases  grow \nmore quickly because they are more frequent.  Once the associations are established, \nthen there is  a  basis for  attention to be shifted away  fWlll  the irrelevant  dimension, \nrapidly  improving  performance  on  the  R  cases.  At  the  time  of  this  writing,  these \nresults have the status of a provocative demonstration, but experiments with human \nsubjects in  similar  learning situations are  presently  being undertaken. \n\nAcknow ledgment \n\nThis  research  was  supported  in  part  by  Biomedical  Research  Support  Grant  RR \n7031-25 from  the National Institutes of Health. \n\nReferences \n\nEstes, Vv.  K.,  Campbell, J. A.,  Hatsopoulos,  N.,  &  Hurwitz, J. B.  (1989).  Base-rate \neffects in category learning:  A comparison of parallel network  and memory storage(cid:173)\nretrieval  models.  J.  Exp.  Psych.  Learning,  Memory and  Cognition,  15, 556-576. \nGluck,  M.  A.  &  Bower,  G.  H.  (1988a).  Evaluating  an  adaptive  network  model  of \nhuman  learning.  J.  of Memory and Language,  27,  166-195. \nGluck,  M.  A.  &  Bower,  G.  H.  (1988b).  From conditioning to category learning:  An \nadaptive network model.  J.  Exp.  Psych.  General,  117, 227-247. \n\nHurwitz,  J.  B.  (1990).  A  hidden-pattern  unit  network  model  of category  learning. \nDoctoral dissertation,  Harvard  University. \n\n\fALCOVE:  A Connectionist Model of Human Category Learning \n\n655 \n\nKruschke, J.  K.  (1990a).  A  connectionist model of category learning.  Doctoral dis(cid:173)\nsertation, University of California at Berkeley.  Available from University Microfilms \nInternational. \n\nKruschke,  J.  K.  (1990b).  ALCOVE:  A  connectionist  model  of category  learning. \nResearch  Report  19,  Cognitive Science Program,  Indiana University. \nKruschke,  J.  K.  (1990c).  How  connectionist  models  learn:  The  course  of learning \nin connectionist networks.  Behavioral and Brain  Sciences,  13, 498-499. \nMarcus, G. F., Ullman, M.,  Pinker, S.,  Hollander, M., Rosen, T. J., &  Xu, F. (1990). \nOverregularization.  Occasional  Paper  #41, MIT Center for  Cognitive Science. \nMcCloskey,  M.  & Cohen,  N.  J.  (1989).  Catastrophic  interference  in  connectionist \nnetworks:  the sequential learning  problem.  In:  G.  Bower  (ed.),  The Psychology of \nLearning and ~Motivation, Vol.  24.  New  York:  Academic  Press. \nMozer,  M.  C.,  &  Smolensky,  P. (1989).  Skeletonization:  A  technique for  trimming \nthe fat from a network via relevance assessment.  In:  D. S. Touretzky (ed.), Advances \nin Neural Information  Processing Systems,  I,  pp.  107-115.  San Mateo, CA:  Morgan \nKaufmann. \n\nNosofsky,  R.  M.  (1986).  Attention,  similarity  and  the  identification-categorization \nrelationship.  J.  Exp.  Psych.  General,  115, 39-57. \n\nNosofsky,  R.  M.,  Kruschke,  J.  K.,  &  McKinley,  S.  (1991).  Comparisons  between \nadaptive network  and exemplar  models  of classification  learning.  Research  Report \n35,  Cognitive  Science  Program, Indiana University. \n\nRatcliff,  R.  (1990).  Connectionist  models  of recognition  memory:  Constraints  im(cid:173)\nposed  by  learning and forgett.ing  functions.  Psychological  Review,  97,  285-308. \nRumelhart,  D.  E.,  Hinton,  G.  E.,  &  ''''illiams,  R.  J.  (1986).  Learning  internal \nrepresentations by back-propagating errors.  In:  D. E. Rumelhart & .J.  L.  McClelland \n(eds.),  Parallel  Distributed  Processing,  Vol.  1,  pp.  318-362.  Cambridge,  ~'lA:  MIT \nPress. \n\nRumelhart,  D.  E.,  &  McClelland,  J.  L.  (1986).  On  learning  the  past  tenses  of \nenglish  verbs.  In:  J.  L.  McClelland &  D.  E.  Rumelhart  (eds.),  Parallel  Distributed \nProcessing,  Vol.  2,  pp.  216-271.  Cambridge,  MA:  MIT  Press. \nShanks,  D.  R.  (1990).  Connectionism  and  the  learning  of probabilistic  concepts. \nQuarterly J.  Exp.  Psych.,  42A,  209-237. \n\nShepard,  R.  N.  (1962).  The analysis of proximities:  Multidimensional scaling with \nan  unknown  distance function,  I  &  II.  Psychometrika,  27,  125-140,  219-246. \n\nShepard,  R.  N.  (1987).  Toward  a  universal  law  of generalization for  psychological \nscience.  Science,  237,  1317-1323. \nShepard,  R.  N.,  Hovland,  C.  1., & Jenkins,  H.  M.  (1961) .  Learning and  memoriza.(cid:173)\ntion of classifications.  Psychological  Monographs,  75(13), Whole  No.  517. \n\n\f", "award": [], "sourceid": 416, "authors": [{"given_name": "John", "family_name": "Kruschke", "institution": null}]}