{"title": "Connectionist Implementation of a Theory of Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 671, "abstract": null, "full_text": "Connectionist Implementation of a Theory of Generalization \n\nRoger N. Shepard \n\nDepartment of Psychology \n\nStanford University \n\nStanford, CA 94305-2130 \n\nSheila Kannappan \nDepartment of Physics \n\nHarvard University \n\nCambridge, MA 02138 \n\nAbstract \n\nEmpirically,  generalization between  a  training  and  a  test  stimulus  falls  off in \nclose approximation to  an  exponential decay  function  of distance between  the \ntwo stimuli in the \"stimulus space\" obtained by multidimensional scaling.  Math(cid:173)\nematically, this result is derivable from  the assumption that an  individual takes \nthe  training  stimulus  to  belong  to  a  \"consequential\"  region  that includes  that \nstimulus but is  otherwise of unknown  location, size, and  shape in the stimulus \nspace (Shepard, 1987).  As the individual gains additional information about the \nconsequential region-by finding other stimuli to be consequential or nOl-the \ntheory  predicts  the shape  of the generalization  function  to  change  toward  the \nfunction relating actual probability of the consequence to location in the stimulus \nspace.  This paper describes a natural connectionist implementation of the theory, \nand illustrates how implications of the theory for generalization, discrimination, \nand classification learning can be explored by connectionist simulation. \n\n1  THE THEORY OF GENERALIZATION \n\nBecause we never confront exactly the same situation twice, anything we have learned in \nany previous situation can guide us in deciding which action to take in the present situation \nonly  to  the  extent  that  the  similarity  between  the  two  situations  is  sufficient  to  justify \ngeneralization of our previous learning to the present situation.  Accordingly, principles of \ngeneralization must be foundational for any theory of behavior. \n\nIn  Shepard  (1987)  nonarbitrary principles  of generalization  were  sought that  would  be \noptimum in any world in which an object, however distinct from other objects, is generally \na  member of some class or natural kind  sharing some dispositional property of potential \nconsequence for the individual.  A newly encountered plant or animal  might be edible or \n\n665 \n\n\f666 \n\nShepard and Kannappan \n\npoisonous. for example. depending on the hidden genetic makeup of its natural kind. \n\nThis simple idea was shown to yield a quantitative explanation of two very general empirical \nregularities that emerge when generalization date are submitted to methods of multidimen(cid:173)\nsional scaling. The first concerns the shape of the generalization gradient. which describes \nhow response probability falls off with distance of a test stimulus from the training stimulus \nin the obtained representational  space.  The second.  which  is  not treated  in the present \n(unidimensional) connectionist implementation. concerns  the metric of multidimensional \nrepresentational spaces.  (See Shepard. 1987.) \n\nThese results were mathematically derived for the simplest case of an  individual who. in \nthe absence of any advance knowledge about particular objects, now encounters one such \nobject and  discovers  it to  have an  important consequence.  From  such  a  learning event. \nthe individual can  conclude that all  objects  are  consequential  that are  of the  same  kind \nas  that object and that therefore fall  in some consequential region that overlaps the point \ncorresponding to that object in representational space.  The individual can only estimate the \nprobability that a given new object is  consequential as  the conditional probability. given \nthat a region of unknown size and shape overlaps that point. that it also overlaps the point \ncorresponding to the new object.  The gradient of generalization then arises because a new \nobject that is  closer to  the old object in the representational  space  is  more likely  to  fall \nwithin a random region that overlaps the old object. \n\nIn order to obtain a quantitative estimate of the probability that the new  stimulus is con(cid:173)\nsequential.  the  individual  must  integrate  over  all  candidate  regions  in  representational \nspace--with. perhaps. different probabilities assigned. a priori. to different sizes and shapes \nof region. The results tum out to depend remarkably little on the prior probabilities assigned \n(Shepard.  1987).  For any  reasonable choice of these probabilities. integration yields  an \napproximately  exponential gradient.  And.  for  the single most reasonable  choice in the \nabsence of any advance information about size or shape. namely. the choice of maximum \nentropy prior probabilities, integration yields exactly the exponential decay function. \nThese results were obtained by separating the psychological problem of the form of gener(cid:173)\nalization in a psychological space from  the psychophysical problem of the mapping from \nany physical parameter space to that psychological space.  The psychophysical mapping, \nhaving been  shaped by natural  selection.  would favor  a  representational  space  in which \nregions  that correspond  to  natural kinds.  though  variously sized and  shaped.  are not on \naverage  systematically elogated or compressed  in  any  particular direction or location of \nthe space.  Such a regularized space would provide the best basis for generalization from \nobjects of newly encountered kinds. \n\nThe psychophysical mapping thus corresponds to an optimum mapping from input to hidden \nunits in a connectionist system.  Indeed. Rumelhart (1990) has recently suggested that the \npower of the connectionist approach  comes  from  the ability  of a  set of hidden  units  to \nrepresent the relations among possible inputs according to their significances for the system \nas a whole rather than according to their superficial relations at the input level.  Although in \nbiologically evolved individuals the psychophysical mapping is likely to have been shaped \nmore  through evolution than  through learning  (Shepard.  1989;  see  also  Miller &  Todd, \n1990) the connectionist implementation to  be described here does  provide for some fine \ntuning of this mapping through learning. \n\nBeyond the  exponential  form  of the  gradient of generalization  following  training  on  a \nsingle stimulus. three basic phenomena of discrimination and classification learning that \n\n\fConnectionist Implementation of a Theory of Generalization \n\n667 \n\nthe theory of generalization should be able to explain are the following:  First, when all \nand only the  stimuli within a compact subset are followed by an  important consequence \n(reinforcement), an individual should eventually learn to respond to all and only the stimuli \nin that subset (Shepard,  1990)-at least to  the degree possible,  given any  noise-induced \nuncertainty about locations in the representational space (Shepard,  1986, 1990).  Second, \nwhen the positive stimuli do not fonn a compact subset but are interspersed among negative \n(nonreinforced)  stimuli, generalization should entail a  slowing of classification  learning \n(Nosofsky, 1986; Shepard & Chang, 1963). Third, repeated discrimination or classification \nlearning, in which a boundary between positive and negative stimuli remains fixed, should \ninduce a \"fine tuning\" stretching of the representational space at that boundary such  that \nany subsequent learning will generalize less fully across that boundary. \n\nOur initial connectionist explorations have been for relatively simple cases  using a un ide(cid:173)\nmensional  stimulus  set and  a  linear learning rule.  These  simulations  serve  to  illustrate \nhow  infonnation about the probable disposition of a  consequential region  accrues,  in  a \nBayesian  manner,  from  successive encounters  with different stimuli, each  of which is or \nis  not followed by the consequence.  In complex cases,  the cumulative effects  on proba(cid:173)\nbility of generalized response, on latency of discriminative response, and on fine tuning of \nthe psychophysical mapping may  sometimes be easier to establish by simulation than by \nmathematical derivation.  Fortunately for this purpose, the theory of generalization has  a \nconnectionist embodiment that is quite direct and neurophysiologically plausible. \n\n2  A CONNECTIONIST IMPLEMENTATION \n\nIn the implementation reponed here, a linear array of 20 input units represents a set of 20 \nstimuli differing along a unidimensional continnuum, such as  the continuum of pitches of \ntones.  The activation level of a given input unit is  1 when its  corresponding stimulus is \npresented and 0 when it is not.  (This localist representation of the \"input\" may be considered \nthe output of a lower-level, massively parallel network for perpetual analysis.) \n\nWhen  such  an  \"input unit\"  is  activated,  its  activation  propagates  upward  and  outward \nthrough successively higher layers of hidden units, giving rise to a cone of activation of that \ninput unit (Figure la).  Higher units are activated by wider ranges of input units (Le., have \nlarger \"receptive fields\").  The hidden units thus represent potential consequential regions, \nwith higher units corresponding to regions of greater sizes in representational space. \n\nThe activation from any input unit is also subject to progressive attenuation as it propagates \nto succesively higher layers of hidden units. In the present fonn of the model, this attenuation \ncomes  about because  the  weights of the connections from  input to  hidden units  falloff \nexponentially with  the heights  of the hidden  units.  (Connection weights are pictorially \nindicated in Figure 1 by the heavinesses of the connecting lines.)  An exponential falloff \nof connection weight with height is natural, in that it corresponds to a decrement of fixed \nproportion as  the activation propagates through each layer to the next.  According to the \ngeneralizaton theory (Shepard,  1987), an exponential falloff is also optimum for the case \nof minimum prior knowledge, because it corresponds to the maximum entropy probability \ndensity distribution of possible sizes of a consequential region. \n\nWhen a response, Rk, is followed by a positive consequence in the presence of a stimulus, \nSI, a  simple linear rule (either a Hebbian or a delta rule)  will increase the weight of the \nconnection from each representational unit, j, (whether inputor hidden unit) to that response \n\n\f668 \n\nShepard and Kannappan \n\na \n\n'5 \nt: \n\n1/1 \na  .. \n; .~  4 \nc: 1II \n\ni .~ a  1/1 A. \n~ ~  3  0 \nI:  ~ a(cid:173)\nU  ... \n1/1  a \n-: ~  2 \n:5Jl \nc:  1/1 \n-IE \n:.5  '  0 \n.... \nIII \n\n'0  ::J \n\nInput \nUnits  0 \n\nt ... \u2022 >-\n\u2022 .... \n\nb \n\nR \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\nInput Units Corresponding  to  Values  _ \n\non  a Unidimensional  Continuum \n\nCone of adivalion  ~d----l 'Test stimulus \n\n51 \n\n57 \n\n(a)  Initial \nFigure  1:  Schematic  portrayal  of the  connectionist embodiment. \nconnections from  an  input unit to hidden units  in its \"cone of activation.\"  (b) \nConnections from  these hidden units to a response unit following reinforcement \nof the response. \n\nunit, Ric, in proportion to the current level of activation, a j , of that representational unit. In \nthe initial implementation considered here, the change in weight from representational unit \nj  to the response unit Ric  is simply \n\nllw;. = {  >.aj (1  - alc)  upon a positive outcome (reinforcement) \n\nupon a negative outcome (nonreinforcement) \n\nwhere>. is a learning rate parameter and  alc  is the current activation level of the response \nunit  Ric  (which,  tending  to  be confined  between  0  and  1,  represents  an  estimate of the \nprobability of the positive consequence).  Following a  positive outcome,  then,  positive \nweights will connect all the units in the cone of activation for SI  to Ric, but with values that \ndecay exponentially with the height of a unit in that cone (Figure Ib). \nIf, now,  a different stimulus, S2,  is encountered,  some but not all of the representational \nunits that are in the cone of activation of SI  and, hence, that are already connected to  Ric \nwill also fall in the cone of activation of S2  (Figure 1 b).  It is these units in the overlap of the \ntwo cones that mediate generalization of the response from SI  to S2.  Not only is this simple \nconnectionist scheme  neurophysiologically plausible,  it is  also  isomorphic to  the theory \nof generalization (Shepard, 1987) based solely on considerations of optimal behavior in a \nworld consisting of natural kinds. \n\n\fConnectionist Implementation of a Theory of Generalization \n\n669 \n\n3  PRELIMINARY CONNECTIONIST EXPLORATIONS \n\nThe simulation results  for generalization and discrimination learning are  summarized  in \nFigure 2.  Panel  a  shows.  for  different stages  of training  on stimulus  SlO.  the level  of \nresponse activation produced by activation of each  of the 20 input units.  In accordance \nwith theory. this activation decayed exponentially with distance from the training stimulus. \nThe obtained functions differ only by a multiplicative scale factor that increased (toward \nasymptote) with the amount of training.  Following this training. the response connection \nweights decreased exponentially with the heights of the hidden units (panel b). Later training \non a second positive stimulus. S12. created a secondary peak in the activation function (panel \nc). and still later nonreinforced presentation of a third stimulus. S9.  produced a sharp drop \nin the activation function at the discrimination boundary (panel d). \n\nResponse  Connection Weights  Following Training on 5'1 \n\nlit -\u00b72 \n:I .. I: \nI: \u2022 lit \u2022 '\" Q. \n\n0 :.:. \n~ \n\n& \n\nI \n0 \n\nI \n5 \n\n15 \n\n20 \n\no \n\n5 \n\n10 \n\nInput  Unit \n<f><f> \n\n, \n\n, \n\n, \n\n, \n\n, \n\n, \n\n, \n\n\u00ab \n\nInput  Uni t \n\nI \n10 \n\n, \n\n, \n\n, \n\n, \n\nI \n15 \n\nI \n20 \n\n\u2022 \n\n~ !I.O  C \n\u2022 1/1 ,.a \n.6 \" 'i .4 \ni :.:;  .2 \u2022 .~  0 ~ . \n\n~ -< \n\no \n\n5 \n\n10 \n\nInput  Unit \n\n15 \n\n5 \n\n10 \n\nInput  Unit \n\n15 \n\n20 \n\nFigure 2:  Connectionist simulations of generalization and discrimination learning. \n\nFigure 3 presents the results for classification learning in which all stimuli were presented \nbut with response reinforcement  for  stimuli in  the positive set only.  When  the positive \nset was compact (panel a) sharp discrimination boundaries formed and response activation \napproached  1 for  all positive stimuli and  0 for all  negative stimuli.  In accordance with \ntheory and empirical data. generalization entailed slower classification learning when the \npositive stimuli were dispersed among negative stimuli (panel b)-as shown by a (mean \nsquare) error measure (panel c). \n\n\f670 \n\nShepard and Kannappan \n\no \n\n5 \n\n10 \n\nInput  Unit \n\nIS \n\n5 \n\nI \n\nInput \n\nI \nI \n\n\\,  C \n\\ \n\n\\ \n, \n\\ \n,~ \n\nIn nch Clse. the positive set \ncontains  5  out of  20  stimuli \n\nd \n\nP\"'Vl0U5~ \ndiscrimination! \ni \nboundlry \n\n3.0 \n\n2.5 \n\n'- 2.0 \no \nt  1.5 \nLU \n\n1.0 \n\n~ \n\n.... \n'c \n~~ \n~ \nc \n&~ \n&.6 \n--\n0. 4 \nC \n0 \n-\n~ .2 \n.~ \n\n:--P,..Y10US \n! discrimlnltlon \nI  boundlry \nI \n! \n\n! \n! I \n! \n\n,~ \n\n'-!-v \n\n\"~-f \n'_ \n'... \n\n' .  \n\n'~\" \n\n'.... \n\n! \n60 \n\n.5 \n\n.0  \u2022 ____________________ = __ = __ = __ ::: __ ::: __ == __ ==_= ____ -_-_--..:-~--~I  ~ .0 I----~::: \n\n-\"----'-\n\n, \n\no \n\nI \n20 \n\nI \n40 \n\nSuccessive Learning Epochs \n\n, \n80 \n\n\u2022 \n\nI 00 \n\n0~\"\"\"\"\"'''''--~5 ........... ...o...-'''''''''.:..J,1:-0 ................... ~.,.L\" 5,............0...-......... ..,,20 \n\nInput  Unit \n\nFigure 3:  Connectionist simulations of classification learning. \n\nFinally. Panel d illustrates fine tuning of the psychophysical mapping when discrimination \nboundaries  have  the  same  locations  for  many  successively  learned  classifications. \nIn \ncontrast to the preceding simulations. in which only the response connection weights were \nallowed to  change.  here  the connection weights from  the input units  to  the hidden units \nwere also allowed to change through \"back propagation\" (Rumelhart. Hinton, & Williams. \n1986).  For 400 learning  epochs  each.  each  of ten  different responses  was  successively \nassociated  with the same five positive stimuli. SlO  through S14.  while reinforcement was \nwithheld for all the remaining stimuli. Then. yet another response was associated with the \nsingle stimulus SlO.  Although the resulting activation curves for this new response (panel d) \nare similar to the original generalization curves (Figure 2a). they drop more sharply where \nclassification boundaries were previously located.  This fine  tuning of the psychophysical \nmapping  proceeded.  however.  much  more  slowly than  the learning  of the classificatory \nresponses  themselves. \n\n4  CONCLUDING REMARKS \n\nThis is just the beginning of the connectionist exploration of the implications of the gen(cid:173)\neralization  theory  in  more complex  cases.  In  addition to  accounting  for  generalization \n\n\fConnectionist Implementation of a Theory of Generalization \n\n671 \n\nand classification along a unidimensional continuum. the approach can account for gener(cid:173)\nalization and classification of stimuli differing with respect to  multidimensional continua \n(Shepard.  1987) and also with respect to discrete features  (Gluck.  1991;  Russell.  1986). \nFinally,  the  connectionist implementation  should  facilitate  a  proposed  extension  to  the \ntreatment of response latencies as well as probabilities (Shepard. 1987). \n\nConnectionists  have  sometimes  assumed  an  exponential  decay  generalization  function. \nand their notion of radial basis functions is not unlike the present concept of consequential \nregions (see Hanson & Gluck. this volume).  What has been advocated here (and in Shepard. \n1987) is the derivation of such functions and concepts from first principles. \n\nAcknowledgements \n\nThis  work  was  supported  by  National  Science  Foundation  grant  BNS85-11685  to  the \nfirst author.  For help and guidance. we thank Jonathan Bachrach. Geoffrey Miller, Mark \nMonheit. David Rumelhart, and Steven Sloman. \n\nReferences \n\nGluck,  M.  A.  (1991).  Stimulus  generalization  and representation in  adaptive  network  models  of \n\ncategory learning. Psychological Science, 2.  (in press). \n\nHanson, S. J. &  Gluck, M. A. (1991). Spherical units as dynamic consequential regions: Implications \n\nfor attention, competition, and categorization. (fhis volume). \n\nMiller, G. F.  &  Todd, P. M. (1990).  Exploring adaptive agency I: Theory and methods for simulating \nthe evolution of learning.  In  D.  S. Touretzky, J. L.  Elman,  T.  J. Sejnowski,  &  G.  E.  Hinton \n(Eds.), Proceedings of the 1990 Connectionist Models Summer School. S an Mateo, CA: Morgan \nKaufmann. \n\nNosofsky,  R.  M.  (1986).  Attention,  similarity,  and  the  identification-categorization relationship. \n\nJournal of Experimental Psychology: General, 114, 39-57. \n\nRumelhart. D. E. (1990). Representation in connectionist models (fhe Association Lecture). Attention \n\n& Performance Meeting.  Ann Arbor, Michigan, July 9. \n\nRumelhart.  D.  E.,  Hinton,  G.  E.,  &  Williams,  R.  J.  (1986).  Learning  representations  by  back(cid:173)\n\npropagating errors.  Nature, 323,533-536. \n\nRussell, S. J. (1986). A quantitative analysis of analogy by similarity.  In Proceedings of the National \nConference on Artificial Intelligence.  Philadelphia,  PA:  American  Association for  Artificial \nIn telligence. \n\nShepard, R. N. (1986).  Discrimination and generalization in identification and classification:  Com(cid:173)\n\nment on Nosofsky. Journal of Experimental Psychology: General, 115, 50-61. \n\nShepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, \n\n237, 1317-1323. \n\nShepard. R. N.  (1989).  Internal representation of universal regularities:  A challenge for connection(cid:173)\nism.  In L.  Nadel,  L.  A. Cooper,  P.  Culicover,  &  R.  M.  Harnish (Eds.), Neural Connections, \nMental Computation (pp.  103-104). Cambridge, MA:  MIT Press. \n\nShepard, R.  N.  (1990).  Neural nets for generalization and classification:  Comment on Staddon and \n\nReid. Psychological Review, 97, 579-580. \n\nShepard,  R.  N.  &  Chang,  J.  J.  (1963).  Stimulus  generalization  in  the  learning  of classifications. \n\nJournal of Experimental Psychology, 65,94-102. \n\n\f\fPart XI \n\nLocal Basis Functions \n\n\f\f", "award": [], "sourceid": 351, "authors": [{"given_name": "Roger", "family_name": "Shepard", "institution": null}, {"given_name": "Sheila", "family_name": "Kannappan", "institution": null}]}