{"title": "Constraints on Adaptive Networks for Modeling Human Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 2, "page_last": 10, "abstract": null, "full_text": "2 \n\nCONSTRAINTS ON ADAPTIVE NETWORKS \nFOR MODELING HUMAN GENERALIZATION \n\nM. Pavel \n\nMark A. Gluck \n\nVan Henkle \n\nDepartm\u00a31Il of Psychology \n\nStanford University \nStanford. CA 94305 \n\nABSTRACT \n\nThe potential of adaptive  networks  to learn categorization rules and to \nmodel  human  performance  is  studied  by  comparing  how  natural  and \nartificial systems respond to new inputs, i.e., how they generalize.  Like \nhumans,  networks  can  learn  a  detenninistic  categorization  task  by  a \nvariety  of  alternative  individual  solutions.  An  analysis  of  the  con(cid:173)\nstraints imposed by using networks with the minimal number of hidden \nunits  shows  that  this  \"minimal  configuration\"  constraint  is  not \nsufficient to explain and predict human performance;  only a few  solu(cid:173)\ntions  were  found  to be  shared by both  humans and  minimal  adaptive \nnetworks.  A  further  analysis  of human  and  network  generalizations \nindicates  that  initial  conditions  may  provide  important constraints  on \ngeneralization.  A new  technique,  which  we  call  \"reversed learning\", \nis described for finding appropriate initial conditions. \n\nINTRODUCTION \n\nWe are investigating the potential of adaptive networks to learn categorization tasks and \nto  model  human  performance.  In  particular  we  have  studied  how  both  natural  and \nartificial  systems respond to new  inputs, that is, how  they  generalize.  In this paper we \nfirst describe a computational technique to analyze generalizations by adaptive networks. \nFor  a  given  network  structure  and  a  given  classification  problem,  the  technique \nenumerates all possible network solutions to the problem.  We then report the  results of \nan  empirical study of human categorization learning. The generalizations of human sub(cid:173)\njects are compared to those of adaptive networks.  A cluster analysis of both human and \nnetwork  generalizations  indicates, significant  differences  between  human  perfonnance \nand possible network behaviors.  Finally, we examine the role of the initial state of a net(cid:173)\nwork for biasing the solutions found by the network. Using data on the relations between \nhuman  subjects'  initial  and  final  performance  during  training,  we  develop  a  new  tech(cid:173)\nnique,  called  \"reversed  learning\",  which  shows  some  potential  for  modeling  human \nlearning processes using adaptive networks.  The scope of our analyses is limited to gen(cid:173)\neralizations in deterministic pattern classification (categorization) tasks. \n\n\fModeling Human Generalization \n\n3 \n\nThe basic difficulty in generalization is that there  exist many different classification rules \n(\"solutions\")  that  that  correctly  classify  the  training  set  but  which  categorize  novel \nobjects  differently.  The  number  and  diversity  of  possible  solutions  depend  on  the \nlanguage defining the pattern recognizer.  However, additional constraints can be used in \nconjunction  with  many  types  of  pattern  categorizers  to  eliminate  some,  hopefully \nundesirable, solutions. \n\nOne  typical way of introducing additional constraints is to minimize the representation. \nFor  example  minimizing  the  number  of equations  and  parameters  in  a  mathematical \nexpression,  or  the  number  of  rules  in  a  rule-based  system  would  assure  that  some \nidentification maps would not be computable.  In the case of adaptive networks, minimiz(cid:173)\ning the  size of adaptive networks,  which reduces the number of possible encoded func(cid:173)\ntions, may result in improved generalization perfonnance (Rumelhart,  1988). \n\nThe critical theoretical and applied questions in pattern recognition involve characteriza(cid:173)\ntion  and  implementation  of desirable  constraints.  In  the  first  part  of  this  paper  we \ndescribe  an  analysis  of adaptive  networks  that characterizes  the  solution  space  for  any \nparticular problem. \n\nANALYSES OF ADAPTIVE NETWORKS \n\nFeed-forward  adaptive  networks  considered  in  this  paper  will  be  defined  as  directed \ngraphs with linear threshold units (LTV) as nodes and with edges labeled by real-valued \nweights. The output or activations of a unit is detennined by a monotonic nonlinear func(cid:173)\ntion of a  weighted sum  of the activation of all units whose edges  tenninate on that unit \nThere are three  types of units within a feed-forward layered architecture:  (1) Input units \nwhose activity is determined by external input; (2) output units whose activity is taken as \nthe response; and (3)  the remaining units, called hidden units.  For the sake of simplicity \nour discussion will be limited to objects represented by binary valued vectors. \n\nA fully  connected feed-forward  network with an  unlimited number of hidden  units can \ncompute  any  boolean  function.  Such  a  general  network,  therefore,  provides  no  con(cid:173)\nstraints on the  solutions.  Therefore, additional constraints  must be imposed for  the net(cid:173)\nwork  to prefer  one generalization  over another.  One such  constraint is  minimizing  the \nsize of the  network.  In  order to explore the effect of minimizing the number of hidden \nunits we first identify the minimal network architecture and then examine its generaliza(cid:173)\ntions. \n\nMost  of the  results  in  this  area  have  been  limited  to  finding  bounds  on  the  expected \nnumber of possible patterns that could be classified by a given network (e.g. Cover, 1965; \nVolper and Hampson,  1987; Valiant,  1984; Baum & Haussler, 1989).  The bounds found \nby these researchers hold for all possible categorizations and are,  therefore, too broad to \nbe useful for the analysis of particular categorization problems. \n\nTo determine the generalization behavior for a particular network architecture, a specific \n\n\f4 \n\nGluck, Pavel and Henkle \n\ncategorization problem and a training set it is necessary to find find all possible solutions \nand the corresponding generalizations.  To do this we used a computational (not a  simu(cid:173)\nlation)  procedure  developed  by  Pavel and  Moore  (1988)  for  finding  minimal  networks \nsolving specific categorization problems.  Pavel and Moore (1988) defined  two network \nsolutions to be different if at least one hidden unit categorized at least one object in the \ntraining  set differently.  Using  this definition  their algorithm  finds  all  possible different \nsolutions.  Because  finding  network  solutions  is  NP-complete  (Judd,  1987),  for  larger \nproblems Pavel and Moore used a  probabilistic version of the algorithm  to estimate the \ndistribution of generalization responses. \n\nOne  way  to  characterize  the constraints on generalization  is in  terms of the  number of \npossible solutions.  A  larger number of possible solutions indicates that  generalizations \nwill be less predictable.  The critical result of the analysis is that, even for minimal net(cid:173)\nworks.  the  number  of different  network  solutions  is  often  quite  large.  Moreover.  the \nnumber of solutions increases rapidly with increases in the number of hidden units.  The \napparent lack  of constraints can  also be demonstrated by  finding  the  probability  that a \nnetwork with a randomly selected hidden layer can solve a given categorization problem. \nThat is,  suppose  that we se~t n  different  hidden  units, each unit representing  a  linear \ndiscriminant fwction.  The activations of these random hidden  wits can be viewed as a \nttansformation  of the  input patterns.  We can ask what is the probability  that an  output \nunit can be found to perfonn the desired dichotomization.  A typical example of a result \nof this analysis is  shown in Figure  1 for  the three-dimensional  (3~) parity problem.  In \nthe minimal configuration involving three hidden units there were 62 different solutions \nto the 3D parity problem.  The rapid increase in probability (high  slope of the curve in \nFigure 1) indicates that adding a few  more hidden units rapidly  increases the probability \nthat a random hidden layer will solve the 3D parity problem. \n\n100 \n\n10 \n\n80 \n\n40 \n\n20 \n\n\u2022 \nz \ng \n!; \n-' \ni \n~ \n\n0 \n\n0 \n\n...... -. \n\n~. \n\n, , \n\n\" , , \n-- 3D PARITY \n---- EXPERIMENT \n\n, \n, , \nII \n, , \n, , \n, \n, , \n, \n, \n~ , , , \n\n2 \n\n4 \n\n6 \n\nHIOOENUNITS \n\n\u2022 \n\n10 \n\n12 \n\nFigure  1 1be proportion of solutions  to 3D parity problem (solid line) and the \nexperimental task (dashed line) as a function of the number of hidden units. \n\nThe results of a more detailed analysis of the generalization performance of the minimal \nnetworks  will be discussed following  a description of a categorization experiment with \n\n\fModeling Human Generalization \n\n5 \n\nhuman subjects. \n\nHUMAN CA TEGORIZA TION EXPERIMENT \n\nIn this  experiment human  subjects learned  to categorize objects which  were  defined by \nfour dimensional binary vectors.  Of the 24 possible objects, subjects were trained to clas(cid:173)\nsify a subset of 8 objects into two categories of 4 objects each.  The specific assignments \nof objects into categories was patterned after Medin et aI.  (1982) and is shown in Figure \n2.  Eight of the patterns are designated as a training set and the remaining eight comprise \nthe  test seL  The assignment of the patterns  in  the  training  set into two categories  was \nsuch that there were many combinations of rules that could be used to correctly perfonn \nthe categorization.  For example, the first two dimensions could be used with one other \ndimension.  The training patterns could also be categorized on the basis of an exclusive \nor (XOR)  of the last two dimensions.  The type of solution obtained by a  human subject \ncould only be determined by examining responses  to the test set as well as the training \nseL \n\nTRAINING  SET \n\nTEST  SET \n\nX1  1  1  0  1  001  0  000 1 \n1  1  0  1 \nDIMENSIONS  ~  1  1  1  0  000 1  001  0 \n1  1  1 0 \n~  101 0  101 0  o 1  0  1  o 1  0  1 \nX.  101 0  o 1  0  1  o 1 0  1  o 1  0  1 \nAAAA  BBBB  ??? ?  ???? \n\nCATEGORY \n\nFigllTe 2.  PattemI to be clulmed.  (Adapted from Medin et aI .\u2022  1982). \n\nIn the actual experiments,  subjects were  asked  to perform  a  medical diagnosis for each \npattern  of four  symptoms (dimensions).  The experimental  procedure  will  be described \nhere only briefly because the details of this experiment have been described elsewhere in \ndetail (pavel, Gluck, Henkle,  1988).  Each of the patterns was presented serially in a ran(cid:173)\ndomized  order.  Subjects  responded  with  one of the  categories  and  then  received feed(cid:173)\nback.  The training of each individual continued until he reached a criterion (responding \ncorrectly  to 32 consecutive  stimuli)  or  until  each  pattern  had been  presented  32 times. \nThe data reported here is based on 78 subjects, half (39) who learned the task to criterion \nand half who did DOL \n\nFollowing  the  training  phase,  subjects  were  tested  using  all  16  possible patterns.  The \nresults  of the  test  phase enabled  us  to  determine  the  generalizations  perfonned by the \nsubjects. Subjects'  generalizations were used to estimate the  \"functions\"  that they may \nhave been  using.  For example, of the 39 criterion  subjects,  15 used a  solution that was \nconsistent with the exclusive-or (XOR) of the dimensions x 3 and X4. \n\nWe use \"response profiles\" to graph responses for an ensemble of functions, in this case \nfor a  group of subjects.  A  response  profile represents  the probability of assigning each \n\n\f6 \n\nGluck, Pavel and Henkle \n\npattern  to  category  \"A\".  For  example,  the  response  profile  for  the  XOR  solution  is \nshown in Figure 3A.  For convenience we define the responses to the test set as the \"gen(cid:173)\neralization  profile\".  The  response  profile  of all  subjects  who  reached  the  criterion  is \nshown in Figure 3D.  The responses of our criterion subjects to the training set were basi(cid:173)\ncally  identical and correct  The distribution of subjects'  genezalization profiles reflected \nin the overall generalization profile are indicative of considerable individual differences \n\n1001 \n0110 \n1101 \n1110 \n1011 \n0100 \n0011 \n0000 \n0101 \n1010 \n0001 \n0010 \n1000 \n0111 \n1100 \n1111 \n\n/I) z \na:: \nloll \n~ \nC \n~ \n\n00  02  04  06  08  10  12 \n\nPROPORTION  \" .. -\n\n/I) z \na:: \nloll \n~ \nC \n~ \n\n1001 \n0110 \n1101 \n1110 \n1011~ \n\n0100 -=:===::--\n\n0011~ \n0000  r---\n0101 \n1010 \n0001 \n0010 \n1000 \n0111 \n1100 \n\n1111  . \n\n00  02  04  06  01  10  12 \n\nPROPORTION  \" It.-\n\nFigwe 3.  (A) Response  profile of the  XOR solution.  and (B)  a proportion of \nthe response  \"A\"  to all patterns for human  subjects  (dark bars)  and minimal \nnetworks  (light bars).  The lower 8  patterns  are  from  the training  set  and  the \nupper 8 patterns from the test set. \n\nMODEliNG THE RESPONSE PROFILE \n\nOne of our goals is  to  model  subjects'  distribution of categorizations as  represented by \nthe  response  profile  in  Figure  3D.  We  considered  three  natural  approaches  to  such \n(1)  Statistical/proximity  models,  (2)  Minimal  disjunctive  normal  forms \nmodeling: \n(DNF), and (3) Minimal two-layer networks. \n\nThe statistical approach is based on the assumption that the response profile over subjects \nrepresents the probability of categorizations performed by each subject  Our data are not \nconsistent with  that assumption  because  each  subject  appeared  to behave  deterministi(cid:173)\ncally.  The second approach, using the minimal DNF is also not a good candidate because \nthere  are  only  four  such  solutions and  the  response  profile over  those  solutions differs \nconsiderably  from  that of the  SUbjects.  Turning  to  the  adaptive  network  solutions,  we \nfound all the solutions using the linear programming technique described above (pavel & \nMoore,  1988).  The minimal two-layer adaptive network that was capable of solving the \ntraining  set  problem  consisted  of  two  hidden  units.  The  proportion  of  solutions  as  a \n\n\fModeling Human Generalization \n\n7 \n\nfunction of the number of hidden units is shown in Figure 1 by the dashed line. \n\nFor the minimal network there were 18 different solutions.  These 18 solutions had 8 dif(cid:173)\nferent individual generalization profiles.  Assuming that each of the 18 network solution \nis equally  likely.  we computed the generalization profile  for  minimal network shown in \nFigure 3B. The response profile for the minimal network represents the probability that a \nrandomly selected minimal network will assign a  given pattern to category  \"A\".  Even \nwithout statistical testing we can conclude that the generalization profiles for humans and \nnetworks are quite different.  It is possible. however. that humans and minimal networks \nobtain  similar solutions and that the differences  in the  average responses are due to  the \nparticular statistical sampling assumption used  for the minimal networks (i.e. each solu(cid:173)\ntion  is  equally  likely).  In  order to determine the overlap of solutions we examined the \ngeneralization profiles in more detail. \n\nCLUSTERING ANALYSIS OF GENERALIZATION PROFILES \n\nTo analyze the similarity in solutions we defined a  metric on generalization profiles.  The \nHamming  distance  between  two  profiles  is  equal  to  the  number  of patterns  that  are \ncategorized differently.  For example.  the distance between generalization profile \u2022\u2022 A  A \nB A B B B B\" and \"A A B B B B A B\" is equal to two. because the two profiles differ \non only  the fourth  and seventh pattern.  Figure 4  shows the results of a  cluster analysis \nusing  a  hierarchical  clustering procedure  that maximizes  the average  distance between \nclusters. \n\nc \u2022 \u2022 c c \u2022 \u2022 c \n\nc \n\n~ \u2022 \u2022 \u2022 \u2022 \u2022 \n\no \n\n\u2022  \u2022 \n\u2022  \u2022 \n\u2022  \u2022 \n\u2022  ~ \nc \n3 c \n~ \n\n\u2022  \u2022 \n~  c \n= \n\u2022  \u2022 \nc  c \n\u2022  \u2022 \n\u2022  \u2022 \n\n\u2022 \n\u2022 \n\u2022 \nc \n\u2022 \n\u2022 \n\u2022 \nc \n\u2022  3 \nc \u2022 c \n\nc \n~ \n\u2022 \n\u2022 \n\u2022 \n\u2022 \nFiglll'll  4.  Results  of  hierarchical  clustering  for  human  (left)  and  network \n(right) generalization profiles. \n\n! \n\u2022 \nc \n\u2022 \n\u2022 \n\nc \n\u2022 \n~ \n\u2022 \n\u2022 \nI \n\n;  ~ \n~  ~ \n\u2022  \u2022 \n\u2022  \u2022 \n\u2022  \u2022 \n\n\u2022 \n~ \nI \n\u2022 \n\u2022 \n\u2022 \n\nIn this graph the average distance between any two clusters is shown by the value of the \nlowest  common  node  in  the  tree.  The  clustering  analysis  indicates  that  humans  and \n\n\f8 \n\nGluck, Pavel and Henkle \n\nnetworks  obtained  widely  different  generalization  profiles.  Only  three  generalization \nprofiles  were  found  to be  common  to  human  and  networks.  This  number of common \ngeneralizations  is  to  be  expected  by  chance  if the  human  and  network  solutions  are \nindependent  Thus, even if there exists a learning algorithm that approximates the human \nprobability distribution of responses, the minimal network would not be a good model of \nhuman perfonnance in this task. \n\nIt is clear from  the previously described network analysis that somewhat larger networks \nwith different constraints could account for human solutions.  In order to characterize the \nadditional constraints, we examined subjects' individual strategies to find out why indivi(cid:173)\ndual subjects obtained different solutions. \n\nANALYSIS OF HUMAN LEARNING STRATEGIES \n\nHuman learning  strategies  that lead  to preferences  for particular solutions may best be \nmodeled in networks by imposing constraints and providing hints  (Abu-Mostafa  1989). \nThese include choosing  the network architecture  and a  learning rule, constraining con(cid:173)\nnectivity, and specifying initial conditions.  We will  focus on the  specification of initial \nconditions. \n\n30 \n\n20 \n\n10 \n\no \n\nCI  .. CONSISTENT \nCONSISTENT \n\u2022 \n\nlOR \n\nNON  lOR \n\nNO  CRrTERION \n\nSUBJECT TYPES \n\nFiglU'e  5.  The number of consistent  or non-stable responses  (black)  and  the \nnwnber of stable incorrect responses (light) for XOR, Non-XOR criterion su~ \njeers, and for those who never reached criterion. \n\nOur effort to examine initial  conditions was motivated by large differences  in  learning \ncurves  (Pavel et al.,  1988) between  subjects who obtained the  XOR solutions and those \nwho did not  The subjects who did not obtain  the XOR solutions would perfonn  much \nbetter on some patterns  (e.g.  0001)  then  the  XOR  subjects, but worse on other patterns \n(e.g.  10(0).  We concluded that these subjects during the first few  trials discovered rules \n\n\fModeling Human Generalization \n\n9 \n\nthat categorized  most of the training patterns correctly but failed on one or two training \npatterns. \n\nWe  examined  the  sequences  of subjects'  responses  to  see  how  well  they  adhered  to \n\"incorrect\"  rules.  We  designated  a  response  to  a  pattern  as  stable  if the  individual \nresponded  the  same way  to that pattern  at least four  times  in  a  row.  We designated  a \nresponse as consistent if the response was stable and correct  The results of the analysis \nare shown in Figure 5.  These results indicate that the subjects who eventually achieved \nthe XOR solution were less likely to generate stable incorrect solutions.  Another impor(cid:173)\ntant result is that those subjects  who never learned the correct responses  to the training \nset  were  not  responding  randomly.  Rather,  they  were  systematically  using  incorrect \nrules.  On  the basis of these results,  we conclude that subjects'  initial strategies may be \nimportant detenninants of their final  solutions. \n\nREVERSED LEARNING \n\nFor simplicity we identify subjects'  initial conditions by their responses on the first few \ntrials.  An important theoretical question is whether or not it is possible to find a network \nstructure, initial conditions and a  learning rule  such  that the network can represent both \nthe initial and final behavior of the subject  In order to study this problem we developed \na  technique we call  \"\"reversed leaming\".  It is based on a  perturbation analysis of feed(cid:173)\nforward  networks.  We use  the fact  that the error surface  in a  small neighborhood of a \nminimum  is well  approximated by a  quadratic surface.  Hence, a  well behaved gradient \ndescent procedure with a starting point in the neighborhood of the minimum will find that \n'minimum. \n\nThe reversed learning procedure consists of three phases.  (1)  A  network is trained to a \nfinal  desired state of a particular individual, using both the training and the test patterns. \n(2) Using only the training patterns, the network is then trained to achieve the initial state \nof that individual subject closest to the desired final  state (3) The network is trained with \nonly the training patterns and the solution is compared to the subject's response profiles. \nOur preliminary results indicate that this procedure leads in many cases to initial condi(cid:173)\ntions  that  favor  the  desired  solutions.  We  are  currently  investigating  conditions  for \nfinding the optimal initial states. \n\nCONCLUSION \n\nThe main goal of this study was to examine constraints imposed by humans (experimen(cid:173)\ntally)  and  networks  (linear  programming)  on  learning  of simple binary  categorization \ntasks.  We  characterize  the  constraints  by  analyzing  responses  to  novel  stimuli.  We \nshowed  that.  like the humans,  networks  learn  the  detenninistic  categorization  task and \nfind many, very different. individual solutions.  Thus adaptive networks are better models \nthan  statistical  models and DNF rules.  The constraints  imposed by  minimal  networks, \nhowever, appear to differ from  those imposed by human learners in that there are only a \nfew  solutions shared between human and adaptive networks.  After a detailed analysis of \n\n\f10 \n\nGluck, Pavel and Henkle \n\nthe  human  learning  process we  concluded  that initial conditions may  provide imPOl'Wlt \nconstraints.  In  fact  we consider the set of initial conditions as .powerful  \"hints\" (Abu(cid:173)\nMostafa,  1989)  which  reduces  the  number  of potential  solutions. without reducing  the \ncomplexity  of the  problem.  We demonstrated  the  potential effectiveness  of these  con(cid:173)\nstraints  using  a  perturbation  technique.  which  we  call  reversed  learning,  for  finding \nappropriate initial conditions. \n\nAcknowledgements \n\nThis  work  was  supported  by  research  grants  from  the  National  Science  Foundation \n(BNS-86-18049) to Gordon Bower and Mark Gluck. and (IST-8511589) to M. Pavel. and \nby a grant from  NASA Ames (NCC 2-269) to Stanford University. We thank Steve Slo(cid:173)\nman and Bob Rehder for useful discussions and their comments on this draft \n\nReferences \n\nAbu-Mostafa, Y. S. Learning by example with hints. NIPS. 1989. \nBaum, E. B .\u2022 & Haussler. D. What size net gives vaUd generalization?  NIPS, 1989. \nCover. T. (June 1965). Geometrical and statistical properties of systems of linear inequal-\nities  with  applications  in  pattern  recognition.  IEEE  Transactions  on  Electronic \nComputers. EC-14. 3. 326-334. \n\nJudd. J. S. Complexity of connectionist learning with various node functions. Presented at \nthe  First  IEEE  International  Conference  on  Neural  Networks.  San  Diego, June \n1987. \n\nMedin. D.  L .\u2022  Altom. M. W .\u2022 Edelson. S.  M .\u2022 &  Freko. D. (1982).  Correlated symptoms \n\nand simulated medical classification. Journal of Experimental Psychology: Learn(cid:173)\ning. Memory.  & Cognition, 8(1).37-50. \n\nPavel. M .\u2022 Gluck, M.  A .\u2022  &  Henkle. V. Generalization by humans and multi-layer adap(cid:173)\ntive  networks.  Submitted  to  Tenth  Annual  Conference  of the  Cognitive  Science \nSociety. August 17-19, 1988. \n\nPavel.  M .\u2022  &  Moore,  R.  T.  (1988).  Computational  analysis  of solutions  of two-layer \nadaptive networks. APL Technical Repon, Dept. of Psychology. Stanford Univer(cid:173)\nsity. \n\nValiant, L. G. (1984). A theory of the learnable. Comm. ACM. 27.11.1134-1142. \nVolper. D. J \u2022\u2022 & Hampson. S. E. (1987). Learning and using specific instances. Biological \n\nCybernetics, 56 \u2022. \n\n\f", "award": [], "sourceid": 106, "authors": [{"given_name": "Mark", "family_name": "Gluck", "institution": null}, {"given_name": "M.", "family_name": "Pavel", "institution": null}, {"given_name": "Van", "family_name": "Henkle", "institution": null}]}