{"title": "Rules and Similarity in Concept Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 65, "abstract": null, "full_text": "Rules and Similarity in Concept Learning \n\nJoshua B. Tenenbaum \nDepartment of Psychology \n\nStanford University, Stanford, CA 94305 \n\njbt@psych.stanford.edu \n\nAbstract \n\nThis paper argues that two apparently distinct modes of generalizing con(cid:173)\ncepts - abstracting rules and computing similarity to exemplars - should \nboth be seen as special cases of a more general Bayesian learning frame(cid:173)\nwork. Bayes explains the specific workings of these two modes - which \nrules are abstracted, how similarity is measured - as well as why gener(cid:173)\nalization should appear rule- or similarity-based in different situations. \nThis analysis also suggests why the rules/similarity distinction, even if \nnot computationally fundamental, may still be useful at the algorithmic \nlevel as part of a principled approximation to fully Bayesian learning. \n\n1 Introduction \n\nIn domains ranging from reasoning to language acquisition, a broad view is emerging of \ncognition as a hybrid of two distinct modes of computation, one based on applying abstract \nrules and the other based on assessing similarity to stored exemplars [7]. Much support for \nthis view comes from the study of concepts and categorization. In generalizing concepts, \npeople's judgments often seem to reflect both rule-based and similarity-based computations \n[9], and different brain systems are thought to be involved in each case [8]. Recent psycho(cid:173)\nlogical models of classification typically incorporate some combination of rule-based and \nsimilarity-based modules [1,4]. In contrast to this currently popular modularity position, I \nwill argue here that rules and similarity are best seen as two ends of a continuum of possible \nconcept representations. In [11,12], I introduced a general theoretical framework to account \nfor how people can learn concepts from just a few positive examples based on the principles \nof Bayesian inference. Here I explore how this framework provides a unifying explanation \nfor these two apparently distinct modes of generalization. The Bayesian framework not only \nincludes both rules and similarity as special cases but also addresses several questions that \nconventional modular accounts do not. People employ particular algorithms for selecting \nrules and measuring similarity. Why these algorithms as opposed to any others? People's \ngeneralizations appear to shift from similarity-like patterns to rule-like patterns in system(cid:173)\natic ways, e.g., as the number of examples observed increases. Why these shifts? \n\nThis short paper focuses on a simple learning game involving number concepts, in which \nboth rule-like and similarity-like generalizations clearly emerge in the judgments of human \nsubjects. Imagine that I have written some short computer programs which take as input a \nnatural number and return as output either \"yes\" or \"no\" according to whether that number \n\n\f60 \n\nJ. B. Tenenbaum \n\nsatisfies some simple concept. Some possible concepts might be \"x is odd\", \"x is between \n30 and 45\", \"x is a power of3\", or\"x is less than 10\". For simplicity, we assume that only \nnumbers under 100 are under consideration. The learner is shown a few randomly chosen \npositive examples - numbers that the program says \"yes\" to - and must then identify the \nother numbers that the program would accept. This task, admittedly artificial, nonetheless \ndraws on people's rich knowledge of number while remaining amenable to theoretical anal(cid:173)\nysis. Its structure is meant to parallel more natural tasks, such as word learning, that often \nrequire meaningful generalizations from only a few positive examples of a concept. \n\nSection 2 presents representative experimental data for this task. Section 3 describes a \nBayesian model and contrasts its predictions with those of models based purely on rules or \nsimilarity. Section 4 summarizes and discusses the model's applicability to other domains. \n\n2 The number concept game \n\nEight subjects participated in an experimental study of number concept learning, under es(cid:173)\nsentially the same instructions as those given above [11]. On each trial, subj ects were shown \none or more random positive examples of a concept and asked to rate the probability that \neach of 30 test numbers would belong to the same concept as the examples observed. X \ndenotes the set of examples observed on a particular trial, and n the number of examples. \n\nTrials were designed to fall into one of three classes. Figure la presents data for two repre(cid:173)\nsentative trials of each class. Bar heights represent the average judged probabilities that par(cid:173)\nticular test numbers fall under the concept given one or more positive examples X, marked \nby \"*\"s. Bars are shown only for those test numbers rated by subjects; missing bars do not \ndenote zero probability of generalization, merely missing data. \nOn class I trials, subjects saw only one example of each concept: e.g., X = {16} and X = \n{60}. To minimize bias, these trials preceded all others on which multiple examples were \ngiven. Given only one example, people gave most test numbers fairly similar probabilities \nof acceptance. Numbers that were intuitively more similar to the example received slightly \nhigher ratings: e.g., for X = {16}, 8 was more acceptable than 9 or 6, and 17 more than \n87; for X = {60}, 50 was more acceptable than 51, and 63 more than 43. \n\nThe remaining trials each presented four examples and occured in pseudorandom order. \nOn class II trials, the examples were consistent with a simple mathematical rule: X = \n{16 , 8, 2, 64} or X = {60, 80, 10, 30}. Note that the obvious rules, \"powers of two\" and \n\"multiples often\", are in no way logically implied by the data. \"Multiples offive\" is a pos(cid:173)\nsibility in the second case, and \"even numbers\" or \"all numbers under 80\" are possibilities \nin both, not to mention other logically possible but psychologically implausible candidates, \nsuch as \"all powers of two, except 32 or4\". Nonetheless, subjects overwhelmingly followed \nan all-or-none pattern of generalization, with all test numbers rated near 0 or 1 according to \nwhether they satisified the single intuitively \"correct\" rule. These preferred rules can be \nloosely characterized as the most specific rules (i.e., with smallest extension) that include \nall the examples and that also meet some criterion of psychological simplicity. \n\nOn class III trials, the examples satisified no simple mathematical rule but did have sim(cid:173)\nilar magnitudes: X = {16, 23, 19, 20} and X = {60, 52, 57, 55} . Generalization now \nfollowed a similarity gradient along the dimension of magnitude. Probability ratings fell \nbelow 0.5 for numbers more than a characteristic distance e beyond the largest or smallest \n\nobserved examples - roughly the typical distance between neighboring examples (\"'\" 2 or \n3). Logically, there is no reason why participants could not have generalized according to \n\n\fRules and Similarity in Concept Learning \n\n61 \n\nvarious complex rules that happened to pick out the given examples, or according to very \ndifferent values of~, yet all subjects displayed more or less the same similarity gradients. \n\nTo summarize these data, generalization from a single example followed a weak similarity \ngradient based on both mathematical and magnitude properties of numbers. When several \nmore examples were observed, generalization evolved into either an all-or-none pattern de(cid:173)\ntermined by the most specific simple rule, or, when no simple rule applied, a more articu(cid:173)\nlated magnitude-based similarity gradient falling off with characteristic distance e roughly \nequal to the typical separation between neighboring examples. Similar patterns were ob(cid:173)\nserved on several trials not shown (including one with a different value of e) and on two \nother experiments in quite different domains (described briefly in Section 4). \n\n3 The Bayesian model \n\nIn [12], I introduced a Bayesian framework for concept learning in the context oflearn(cid:173)\ning axis-parallel rectangles in a multidimensional feature space. Here I show that the same \nframework can be adapted to the more complex situation oflearning number concepts and \ncan explain all of the phenomena of rules and similarity documented above. Formally, we \nobserve n positive examples X = {x(1), ... , x(n)} of concept C and want to compute \np(y E CIX), the probability that some new object y belongs to C given the observations \nX. Inductive leverage is provided by a hypothesis space 11. of possible concepts and a prob(cid:173)\nabilistic model relating hypotheses h to data X. \n\nThe hypothesis space. Elements ofll. correspond to subsets of the universe of objects that \nare psychologically plausible candidates for the extensions of concepts. Here the universe \nconsists of numbers between 1 and 100, and the hypotheses correspond to subsets such as \nthe even numbers, the numbers between 1 and 10, etc. The hypotheses can be thought of \nin terms of either rules or similarity, i.e., as potential rules to be abstracted or as features \nentering into a similarity computation, but Bayes does not distinguish these interpretations. \n\nBecause we can capture only a fraction of the hypotheses people might bring to this task, \nwe would like an objective way to focus on the most relevant parts of people's hypothesis \nspace. One such method is additive clustering (ADCLUS) [6,10], which extracts a setoffea(cid:173)\ntures that best accounts for subjects' similarity judgments on a given set of objects. These \nfeatures simply correspond to subsets of objects and are thus naturally identified with hy(cid:173)\npotheses for concept learning. Applications of ADCLUS to similarity judgments for the \nnumbers 0-9 reveal two kinds of subsets [6,10]: numbers sharing a common mathemati(cid:173)\ncal property, such as {2, 4, 8} and {3, 6, 9}, and consecutive numbers of similar magnitude, \nsuch as {I, 2, 3, 4} and {2, 3, 4, 5, 6}. Applying ADCLUS to the full set of numbers from \n1 to 100 is impractical, but we can construct an analogous hypothesis space for this domain \nbased on the two kinds of hypotheses found in the ADCLUS solution for 0-9. One group \nof hypotheses captures salient mathematical properties: odd, even, square, cube, and prime \nnumbers, multiples and powers of small numbers (~ 12), and sets of numbers ending in the \nsame digit. A second group of hypotheses, representing the dimension of numerical mag(cid:173)\nnitude, includes all intervals of consecutive numbers with endpoints between 1 and 100. \nPriors and likelihoods. The probabilistic model consists of a prior p( h) over 11. and a like(cid:173)\nlihood p( X I h) for each hypothesis h E H. Rather than assigning prior probabilities to each \nofthe 5083 hypotheses individually, I adopted a hierarchical approach based on the intuitive \ndivision of 11. into mathematical properties and magnitude intervals. A fraction A of the to(cid:173)\ntal probability was allocated to the mathematical hypotheses as a group, leaving (1 - A) for \n\n\f62 \n\nJ. B. Tenenbaum \n\nthe magnitude hypotheses. The ,\\ probability was distributed uniformly across the mathe(cid:173)\nmatical hypotheses. The (1 - ,\\) probability was distributed across the magnitude intervals \nas a function of interval size according to an Erlang distribution, p( h) ex (Ihl/ li2 )e- 1hl /0', \nto capture the intuition that intervals of some intermediate size are more likely than those \nof very large or small size. ,\\ and Ii are treated as free parameters of the model. \n\nThe likelihood is determined by the assumption of randomly sampled positive examples. \nIn the simplest case, each example in X is assumed to be independently sampled from a \nuniform density over the concept G. For n examples we then have: \n\np(Xlh) \n\nl/lhl n if Vj, xU) E h \no otherwise, \n\n(1) \n\nwhere I h I denotes the size of the subset h. For example, if h denotes the even numbers, then \nIhl = 50, because there are 50 even numbers between I and 100. Equation I embodies the \nsize principle for scoring hypotheses: smaller hypotheses assign greater likelihood than do \nlarger hypotheses to the same data, and they assign exponentially greater likelihood as the \nnumber of consistent examples increases. The size principle plays a key role in learning \nconcepts from only positive examples [12], and, as we will see below, in determining the \nappearance of rule-like or similarity-like modes of generalization. \n\nGiven these priors and likelihoods, the posterior p( hlX) follows directly from Bayes' rule. \nFinally, we compute the probability of generalization to a new object y by averaging the \npredictions of all hypotheses weighted by their posterior probabilities p( h IX): \n\np(y E GIX) = L p(y E Glh)p(hIX). \n\nhE1i \n\n(2) \n\nEquation 2 follows from the conditional independence of X and the membership of y E G, \ngiven h. To evaluate Equation 2, note that p(y E Glh) is simply 1 ify E h, and 0 otherwise. \nModel results. Figure Ib shows the predictions of this Bayesian model (with'\\ = 1/2, \nIi = 10). The model captures the main features of the data, including convergence to the \nmost specific rule on Class II trials and to appropriately shaped similarity gradients on Class \nIII trials. We can understand the transitions between graded, similarity-like and all-or-none, \nrule-like regimes of generalization as arising from the interaction of the size principle (Equa(cid:173)\ntion 1) with hypothesis averaging (Equation 2). Because each hypothesis h contributes to \nthe average in Equation 2 in proportion to its posterior probability p(hIX), the degree of \nuncertainty in p(hIX) determines whether generalization will be sharp or graded. When \np( h IX) is very spread out, many distinct hypotheses contribute significantly, resulting in a \nbroad gradient of generalization. When p(hIX) is concentrated on a single hypothesis h*, \nonly h* contributes significantly and generalization appears all-or-none. The degree of un(cid:173)\ncertainty in p( h I X) is in tum a consequence of the size principle. Given a few examples con(cid:173)\nsistent with one hypothesis that is significantly smaller than the next-best competitor - such \nas X = {16, 8, 2, 64}, where \"powers of two\" is significantly smaller than \"even numbers\" \n- then the smallest hypothesis becomes exponentially more likely than any other and gener(cid:173)\nalization appears to follow this most specific rule. However, given only one example (such \nas X = {16}), or given several examples consistent with many similarly sized hypotheses(cid:173)\nsuch as X = {16, 23,19, 20}, where the top candidates are all very similar intervals: \"num(cid:173)\nbers between 16 and 23\", \"numbers between 15 and 24\", etc. - the size-based likelihood \nfavors the smaller hypotheses only slightly, p(hIX) is spread out over many overlapping \nhypotheses and generalization appears to follow a gradient of similarity. That the Bayesian \n\n\fRules and Similarity in Concept Learning \n\n63 \n\nmodel predicts the right shape for the magnitude-based similarity gradients on Class III trials \nis no accident. The characteristic distance \u20ac of the Bayesian generalization gradient varies \nwith the uncertainty in p( h I X), which (for interval hypotheses) can be shown to covary with \nthe intuitively relevant factor of average separation between neighboring examples. \n\nBayes vs. rules or similarity alone. It is instructive to consider two special cases of the \nBayesian model that are equivalent to conventional similarity-based and rule-based algo(cid:173)\nrithms from the concept learning literature. What I call the SIM algorithm was pioneered \nby [5] and also described in [2,3] as a Bayesian approach to learning concepts from both \npositive and negative evidence. SIM replaces the size-based likelihood with a binary likeli(cid:173)\nhood that measures only whether a hypothesis is consistent with the examples: p( X I h) :::: 1 \nifVj, xli) E h, and 0 otherwise. Generalization under SIM is just a count of the features \nshared by y and all the examples in X, independent of the frequency of those features or \nthe number of examples seen. As Figure Ic shows, SIM successfully models generaliza(cid:173)\ntion from a single example (Class I) but fails to capture how generalization sharpens up after \nmultiple examples, to either the most specific rule (Class II) or a magnitude-based similarity \ngradient with appropriate characteristic distance \u20ac (Class III). What I call the MIN algorithm \npreserves the size principle but replaces the step of hypothesis averaging with maximization: \np(y E GIX) :::: 1 ify E arg maXh p(Xlh), and 0 otherwise. MIN is perhaps the oldest al(cid:173)\ngorithm for concept learning [3] and, as a maximum likelihood algorithm, is asymptotically \nequivalent to Bayes. Its success for finite amounts of data depends on how peaked p(hIX) \nis (Figure Id). MIN always selects the most specific consistent rule, which is reasonable \nwhen that hypothesis is much more probable than any other (Class II), but too conservative \nin other cases (Classes I and III). In quantitative terms, the predictions of Bayes correlate \nmuch more highly with the observed data (R2 :::: 0.91) than do the predictions of either SIM \n(R 2 :::: 0.74) or MIN (R 2 :::: 0.47). In sum, only the full Bayesian framework can explain \nthe full range of rule-like and similarity-like generalization patterns observed on this task. \n\n4 Discussion \n\nExperiments in two other domains provide further support for Bayes as a unifying frame(cid:173)\nwork for concept learning. In the context of multidimensional continuous feature spaces, \nsimilarity gradients are the default mode of generalization [5]. Bayes successfully mod(cid:173)\nels how the shape of those gradients depends on the distribution and number of examples; \nSIM and MIN do not [12]. Bayes also successfully predicts how fast these similarity gra(cid:173)\ndients converge to the most specific consistent rule. Convergence is quite slow in this do(cid:173)\nmain (n \"\" 50) because the hypothesis space consists of densely overlapping subsets - axis(cid:173)\nparallel rectangles - much like the interval hypotheses in the Class III number tasks. \n\nAnother experiment engaged a word-learning task, using photographs of real objects as \nstimuli and a cover story oflearning a new language [11]. On each trial, subjects saw ei(cid:173)\nther one example of a novel word (e.g., a toy animal labeled with \"Here is a blicket.\"), or \nthree examples at one of three different levels of specificity: subordinate (e.g., 3 dalma(cid:173)\ntians labeled with \"Here are three blickets.\"), basic (e.g., 3 dogs), or superordinate (e.g., 3 \nanimals). They then were asked to pick the other instances of that concept from a set of \n24 test objects, containing matches to the example(s) at all levels (e.g., other dalmatians, \ndogs, animals) as well as many non-matching objects. Figure 2 shows data and predictions \nfor all three models. Similarity-like generalization given one example rapidly converged to \nthe most specific rule after only three examples were observed, just as in the number task \n(Classes I and II) but in contrast to the axis-parallel rectangle task or the Class III num-\n\n\f, \n\n64 \n\nJ. B. Tenenbaum \n\nber tasks, where similarity-like responding was still the norm after three or four examples. \nFor modeling purposes, a hypothesis space was constructed from a hierarchical clustering \nof subjects' similarity judgments (augmented by an a priori preference for basic-level con(cid:173)\ncepts) [11]. The Bayesian model successfully predicts rapid convergence from a similarity \ngradient to the minimal rule, because the smallest hypothesis consistent with each example \nset is significantly smaller than the next-best competitor (e.g., \"dogs\" is significantly smaller \nthan \"dogs and cats\", just as with \"multiples often\" vs. \"multiples of five\"). Bayes fits the \nfull data extremely well (R2 = 0.98); by comparison, SIM (R2 = 0.83) successfully ac(cid:173)\ncounts for only the n = 1 trials and MIN (R2 = 0.76), the n = 3 trials. \nIn conclusion, a Bayesian framework is able to account for both rule- and similarity-like \nmodes of generalization, as well as the dynamics of transitions between these modes, across \nseveral quite different domains of concept learning. The key features of the Bayesian \nmodel are hypothesis averaging and the size principle. The former allows either rule-like \nor similarity-like behavior depending on the uncertainty in the posterior probability. The \nlatter determines this uncertainty as a function of the number and distribution of examples \nand the structure ofthe learner's hypothesis space. With sparsely overlapping hypotheses \n- i.e., the most specific hypothesis consistent with the examples is much smaller than its \nnearest competitors - convergence to a single rule occurs rapidly, after just a few exam(cid:173)\nples. With densely overlapping hypotheses - i.e., many consistent hypotheses of compara(cid:173)\nble size - convergence to a single rule occurs much more slowly, and a gradient of similar(cid:173)\nity is the norm after just a few examples. Importantly, the Bayesian framework does not so \nmuch obviate the distinction between rules and similarity as explain why it might be useful \nin understanding the brain. As Figures 1 and 2 show, special cases of Bayes correspond(cid:173)\ning to the SIM and MIN algorithms consistently account for distinct and complementary \nregimes of generalization. SIM, without the size principle, works best given only one exam(cid:173)\nple or densely overlappipg hypotheses, when Equation I does not generate large differences \nin likelihood. MIN, without hypothesis averaging, works best given many examples or \nsparsely overlapping hypotheses, when the most specific hypothesis dominates the sum over \n1i in Equation 2. In light of recent brain-imaging studies dissociating rule- and exemplar(cid:173)\nbased processing [8], the Bayesian theory may best be thought of as a computational-level \naccount of concept learning, with multiple subprocesses - perhaps subserving SIM and MIN \n- implemented in distinct neural circuits. I hope to explore this possibility in future work. \n\nReferences \n\n[1] M. Erickson & J. Kruschke (1998). Rules and exemplars in category learning. JEP: General 127, \n107-140. \n[2] D. Haussler, M. Kearns, & R. Schapire (1994). Bounds on the sample complexity of Bayesian \nlearning using information theory and the VC-dimension. Machine Learning 14,83-113. \n[3] T. Mitchell (1997). Machine Learning. McGraw-Hill. \n[4] R. Nosofsky & T. Palmeri (1998). A rule-plus-exception model for classifying objects in \ncontinuous-dimension spaces. Psychonomic Bull. & Rev. 5,345-369. \n[5] R. Shepard (1987). Towards a universal law of generalization for psychological science. Science \n237, 1317-1323. \n[6] R. Shepard & P. Arabie (1979). Additive clustering: Representation of similarities as combina(cid:173)\ntions of discrete overlapping properties. Psych. Rev. 86, 87-123. \n[7] S. Sloman & L. Rips (1998). Similarity and Symbols in Human Thinking. MIT Press. \n[8] E. Smith, A. Patalano & 1. Jonides (1998). Alternative strategies of categorization. In [6]. \n[9] E. Smith & S. Sloman (1994). Similarity- vs. rule-based categorization. Mem. & Cog. 22,377. \n[10] J. Tenenbaum (1996). Learning the structure of similarity. NIPS 8. \n[11] J. Tenenbaum (1999). A Bayesian Framework/or Concept Learning. Ph. D. Thesis, MIT. \n[12] J. Tenenbaum (1999). Bayesian modeling of human concept learning. NIPS I I. \n\n\fRules and Similarity in Concept Learning \n\n65 \n\n(a) Average generalization judgments: \n\nX=60 \n\nII I IIlmll,' III I I \n1~ 1 \n0.5 o \u2022\u2022 III I. ! II ... 1._*_111 !. 1 \n1 f o ... \n\nI \u2022 \u2022 11111*11111 \n\n0.5 \n\n\u2022 I \n\n10 20 30 40 50 60 70 80 90100 \n\n**'* \n\nX=60 52 57 55 \n\nX=60 80 1030 \n\nX = 16 8 2 64 \n\nX=6080103O \n\nX= 1623 1920 \n\nX = 60 52 57 55 \n\n1 \n\nX=60 \nI I \n\n1 \n\no.g II L II I I \u2022\u2022 \" \u2022\u2022 1.1111 \no.g .. I. I. I II ... L1J .. I. I \no.g II. .. . I I uljl. ... .. \n\n1 r \n\n... \n\nClass I o.g ~II~\"~IIII \no.g [111~1I1 I. I \n\nClass II \n\nClass III o.~ l Jill \n\no .. I_ I I I \u2022 \n\nI II X = 1: \u2022 \n\nX=168264 \n\n11 \n\nI \n\n\u2022 \n\nX=1623 1920 \n\nLI~~~~' __ ~~~ __ ~~~~ \n\n10 20 30 40 50 60 70 80 90100 \n\n\u2022 II \n\nI . \n\n(b) Bayesian model: \n\no.g tllllil~'1 II \u2022 \n\n1t a \n\n0.5 o IIliI \n\nI _~ \n\nI \u2022 I \n\n\u2022 \n\n.I \n\n\u00b71 \n\nX= 16 \n\n\u00b7 .. \n\u00b7 . \n\u00b7 . \n\n10 20 30 40 50 60 70 80 90 100 \n\n10 20 30 40 50 60 70 80 90 100 \n\n(c) Pure similarity model (SIM): \n\n1 t \n0'8llhk~~lll 1 \no.g f!II~u!wUIII . \no\u00b7~f\"I \u2022\u2022 11 J I \n\nX= 16 \n\n. J \n\nX=168264 \n\n\u2022 J \n\nX=1623 1920 \n\nI II \nI II \n* \n\u2022 I. \n\n1 \n\nX=60 \n\no.g IlL III hlllll \u2022\u2022\u2022 111 II \n1t \no.g [ \"1L I.! I .. 1I1.1~IJIl ! d \no.g II. I I I 11111.1111 I . \n\n1 r \n\nX=60 80 10 30 \n\nX = 60 52 57 55 \n\n10 20 30 40 50 60 70 80 90 100 \n\n10 20 30 40 50 60 70 80 90 100 \n\n(d) Pure rule model (MIN): \n\n1f .L \n0.5 a .... * \u2022.... \n\n1f .I. \n0.5 a.... \n\n. ... \n\nl!Ett* \n\nX= 16 \u00b7 . \n\nX=168264 \n\nX=1623 1920 \n\n.I \n* \n\n1 \n0.5 \no \u2022\u2022\u2022 \n\nX = 60 \n\n........ 1..... .. \n\no.g 1.1. I. III 1I.1..l.1~1~ro 10 30 \no.g .1. \n\n.1 ... Jl. \"' .. \n\nX=60525755 \n\n1 ~ \n\n*** \n\n10 20 30 40 50 60 70 80 90100 \n\n10 20 30 40 50 60 70 80 90 100 \n\nFigure 1: Data and model predictions for the number concept task. \n\n(a) Average generalization judgments: \n\nTraining \nexamples: \n\nI \n\n3 subordmate \n\n1 I. r 1 I \n\n3 basic \n\nr 1 I. r 1 \n\n0.5~.5~.5~.5 \n\n3 superordinate \n\n(b) Bayesian model: \n\n(c) Pure similarity model (SIM): \n\no\u00b7~~\u00b7~~\u00b7~~\u00b7~UL \no\u00b7~IIL\u00b7~IIL\u00b7~IIL\u00b7~UL \no\u00b7~~\u00b7~~\u00b7~~\u00b7~UL \n\n(d) Pure rule model (MIN): \n\nFigure 2: Data and model predictions for the word learning task. \n\n\f", "award": [], "sourceid": 1666, "authors": [{"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}