{"title": "A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1040, "abstract": null, "full_text": "A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments\nDaniel J. Navarro School of Psychology University of Adelaide Adelaide, SA 5005, Australia daniel.navarro@adelaide.edu.au Thomas L. Griffiths Department of Psychology UC Berkeley Berkeley, CA 94720, USA tom griffiths@berkeley.edu\n\nAbstract\nThe additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance.\n\n1\n\nIntroduction\n\nOne of the central problems in cognitive science is determining the mental representations that underlie human inferences. A variety of solutions to this problem are based on the analysis of similarity judgments. By defining a probabilistic model that accounts for the similarity between stimuli based on their representation, statistical methods can be used to infer underlying representations from human similarity judgments. The particular methods used to infer representions from similarity judgments depend on the nature of the underlying representations. For stimuli that are assumed to be represented as points in some psychological space, multidimensional scaling algorithms [1] can be used to translate similarity judgments into stimulus locations. For stimuli that are assumed to be represented in terms of a set of latent features, additive clustering is the method of choice. The original formulation of the additive clustering (ADCLUS) problem [2] is as follows. Assume that we have data in the form of a n n similarity matrix S = [sij ], where sij is the judged similarity between the ith and j th of n objects. Similarities are assumed to be symmetric (with sij = sj i ) and non-negative, often constrained to lie on the interval [0, 1]. These empirical similarities are assumed to be well-approximated by a weighted linear function of common features. Under these assumptions, a representation that uses m features to describe n objects is given by an n m matrix F = [fik ], where fik = 1 if the ith object possesses the kth feature, and fik = 0 if it is not. Each feature has an associated non-negative saliency weight w = (w1, . . . , wm ). When written in matrix form, the ADCLUS model seeks to uncover a feature matrix F and a weight vector w such that S FWF , where W = diag(w) is a diagonal matrix with nonzero elements corresponding to the saliency weights. In most applications it is assumed that there is a fixed \"additive constant\", a required feature possessed by all objects.\n\n2\n\nA Nonparametric Bayesian ADCLUS Model\n\nTo formalize additive clustering as a statistical model, it is standard practice to assume that error terms are i.i.d. Gaussian [3], yielding the model: S = FWF\n+\n\nE,\n\n(1)\n\n\f\ns\n\nThef Indian Buffet\n\n1\n\nf2\n\nf3\n\nf4\n\n...\n\nw\n\nThe Diners\n\nf11 =1\nF\n\nf12 =1 f23 =1 f33 =1 f34 =1\n\nf21 =1\n\nn(n-1)/2 (a) (b)\n\nFigure 1: Graphical model representation of the IBP-ADCLUS model. Panel (a) shows the hierarchical structure of the ADCLUS model, and panel (b) illustrates the method by which a feature matrix is generated using the Indian Buffet Process. where E = [ ij ] is an n n matrix with entries drawn from a Gaussian (0, 2 ) distribution. Equation 1 reveals that the additive clustering model is structurally similar to the better-known factor analysis model [4], although there are several differences: most notably the constraints that F is binak y valued, W is necessarily diagonal and S is non-negative. In any case, if we define r ij = wk fik fj k to be the similarity predicted by a particular choice of F and w, then: sij | F, w, Normal(ij , 2), (2) where 2 is the variance of the Gaussian error distribution. However, self-similarities sii are not modeled in additive clustering, and are generally fixed to (the same) arbitrary values for both the model and data. It is typical to treat 2 as a fixed parameter [5], and while this could perhaps be improved upon, we leave this open for future research. In our approach, additive clustering is framed as a form of nonparametric Bayesian inference, in which Equation 2 provides the likelihood function, and the model is completed by placing priors over the weights w and the feature matrix F. We assume a fixed Gamma prior over feature saliencies though it is straightforward to extend this to other, more flexible, priors. Setting a prior over binary feature matrices F is more difficult, since there is generally no good reason to assume an upper bound on the number of features that might be relevant to a particular similarity matrix. For this reason we use the \"nonparametric\" Indian Buffet Process (IBP) [6], which provides a proper prior distribution over binary matrices with a fixed number of rows and an unbounded number of columns. The IBP can be understood by imagining an Indian buffet containing an infinite number of dishes. Each customer entering the restaurant samples a number of dishes from the buffet, with a preference for those dishes that other diners have tried. For the kth dish sampled by at least one of the first n - 1 customers, the probability that the nth customer will also try that dish is nk p(fnk = 1|Fn-1) = , (3) n where Fn-1 records the choices of the previous customers, and nk denotes the number of previous customers that have sampled that dish. Being adventurous, the new customer may try some hitherto untasted meals from the infinite buffet on offer. The number of new dishes taken by customer n follows a Poisson( /n) distribution. The complete IBP-ADCLUS model becomes, sij | F, w, wk | 1, 2 F| Normal(ij , 2 ) Gamma(1 , 2) IBP(). (4)\n\nThe structure of this model is illustrated graphically in Figure 1(a), and an illustration of the IBP prior is shown in Figure 1(b).\n\n3\n\nA Gibbs-Metropolis Sampling Scheme\n\nAs a Bayesian formulation of additive clustering, statistical inference in Equation 4 is based on the posterior distribution over feature matrices and saliency vectors, p(F, w | S). Naturally, the ideal\n\n\f\napproach is to calculate posterior quantities using exact methods. Unfortunately, this is generally quite difficult, so a natural alternative is to use Markov chain Monte Carlo (MCMC) methods to repeatedly sample from the posterior distribution: estimates of posterior quantities can be made using these samples as proxies for the full distribution. We construct a simple MCMC scheme for the Bayesian ADCLUS model using a combination of Gibbs sampling [7] and more general Metropolis proposals [8]. Saliency Weights . We use a Metropolis scheme to resample the saliency weights. If the current saliency is wk , a candidate wk is first generated from a Gaussian(wk , 0.05) distribution. The value of wk is then reassigned using the Metropolis update rule. If w-k denotes the set of all saliencies except wk , this rule is w p(S | F,w ,wk )p(wk | ) k with probability a (5) wk , where a = p(S | F,w-k ,wk )p(wk | ) . -k wk with probability 1 - a\n With a Gamma prior, the Metropolis sampler automatically rejects all negative valued wk .\n\n\"Pre-Existing\" Features. For features currently possessed by at least one object, assignments are updated using a standard Gibbs sampler: the value of fik is drawn from the conditional posterior distribution over fik | S, F-ik, w. Since feature assignments are discrete, it is easy to find this conditional probability by noting that p(fik |S, F-ik, w) p(S|F, w)p(fik|F-ik ), (6) where F-ik denotes the set of all feature assignments except fik . The first term in this expression is just the likelihood function for the ADCLUS model, and is simple to calculate. Moreover, since feature assignments in the IBP are exchangeable, we can treat the kth assignment as if it were the last. Given this, Equation 3 indicates that p(fik |F-ik) = n-ik /n, where n-ik counts the number of stimuli (besides the ith) that currently possess the kth feature. The Gibbs sampler deletes all single-stimulus features with probability 1, since n-ik will be zero for one of the stimuli. \"New\" Features. Since the IBP describes a prior over infinite feature matrices, the resampling procedure needs to accommodate the remaining (infinite) set of features that are not currently represented among the manifest features F. When resampling feature assignments, some finite number of those currently-latent features will become manifest. When sampling from the conditional prior over feature assignments for the ith stimulus, we hold the feature assignments fixed for all other stimuli, so this is equivalent to sampling some number of \"singleton\" features (i.e., features possessed only by stimulus i) from the conditional prior, which is Poisson( /n) as noted previously. When working with this algorithm, we typically run several chains. For each chain, we initialize the Gibbs-Metropolis sampler more or less arbitrarily. After a \"burn-in\" period is allowed for the sampler to converge to a sensible location (i.e., for the state to represent a sample from the posterior), we make a \"draw\" by recording the state of the sampler, leaving a \"lag\" of several iterations between successive draws to reduce the autocorrelation between samples. When doing so, it is important to ensure that the Markov chains converge on the target distribution p(F, w | S). We did so by inspecting the time series plot formed by graphing the log posterior probability of successive samples. To illustrate this, one of the chains used in our simulations (see Section 5) is displayed in Figure 2, with nine parallel chains used for comparison: the time series plot shows no long-term trends, and that different chains are visually indistinguishable from one another. Although elaborations and refinements are possible for both the sampler [9] and the convergence check [10], we have found this approach to be reasonably effective for the moderate-sized problems considered in our applications.\n\n4\n\nFour Estimators for the ADCLUS Model\n\nSince the introduction of the additive clustering model, a range of algorithms have been used to infer features, including \"subset selection\" [2], expectation maximization [3], continuous approximations [11] and stochastic hillclimbing [5] among others. A review, as well as an effective combinatorial search algorithm, is given in [12]. Curiously, while the plethora of algorithms available for extracting estimates of F and w have been discussed in the literature, the variety in the choice of estimator has been largely overlooked, to our knowledge. One advantage of the IBP-ADCLUS approach is that it allows us to discuss a range of different estimators that within a single framework. We will explore estimators based on computing the posterior distribution over F and w given S. This includes estimators based on maximum a posteriori (MAP) estimation, corresponding to the value of a variable with highest posterior probability, and taking expectations over the posterior distribution.\n\n\f\nSmoothed Log-Posterior Probability\n\n-170 -175 -180 -185 -190 -195\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500 600 Sample Number\n\n700\n\n800\n\n900\n\n1000\n\nFigure 2: Smoothed time series showing log-posterior probabilities for successive draws from the Gibbs-Metropolis sampler, for simulated similarity data with n = 16. The bold line shows a single chain, while the dotted lines show the remaining nine chains. Conditional MAP Estimation. Much of the literature defines an estimator conditional on the assumption that the number of features in the model m is fixed [3][11][12]. These approaches seek to estimate the values of F and w that jointly maximize some utility function conditional on this known m. If we treat the posterior probability to be our measure of utility, the estimators become, ^^ F1, w1 = arg max p(F, w | S, m)\nF,w\n\n(7) (\n\nEstimating the dimension is harder. The natural (MAP) estimate for m is easy to state: F p m1 = arg max p(m | S) = arg max ^ (F, w | S) dw\nm m Fm\n\n8)\n\nwhere Fm denotes the set of feature matrices containing m unique features. In practice, given the difficulty of working with Equation 8, it is typical to fix m on the basis of intuition, or via some heuristic method. MAP Feature Estimation. In the previous approach, m is given primacy, since F and w cannot be estimated until it is known. No distinction is made between F and w. In many practical situations [13], this does not reflect the priorities of the researcher. Often the feature matrix F is the psychologically relevant variable, with w and m being nuisance parameters. In such cases, it is natural to marginalize w when estimating F, and let the estimated feature matrix itself determine m. That is, we first select p . ^ 2 = arg max p(F | S) = arg max F (F, w | S)dw (9)\nF F\n\n^ Notice that F2 provides an implicit estimate of m2 , which may differ from m1 . The saliencies are ^ ^ ^ estimated after F2 is chosen, via conditional MAP estimation: ^ ^ w2 = arg max p(w | F2, S).\nw\n\n(10)\n\nThis approach is typical of existing (parametric) Bayesian approaches to additive clustering [5][14], where analytic approximations to p(F | S) are used for expediency. Joint MAP Estimation. Both approaches discussed so far require some aspects of the model to be estimated before others. While the rationales for this constraint differ, both approaches seem sensible. Another approach, not as common in the literature, is to jointly estimate F and w without conditioning on m, yielding the MAP estimators, ^^ F3 , w3 = arg max p(F, w | S).\nF,w\n\n(11)\n\nEarly papers [2] recognized that this approach can be prone to overfitting, and thus requires that the prior place some emphasis on parsimony. However, many theoretically-motivated priors (including the IBP) allow the researcher to emphasize parsimony, and some frequentist methods used in ADCLUS-like models apply penalty functions for this reason [15].\n\n\f\n(a)\nFeatures Recovered\n\n15\n\n(b)\n\nn 8 So Sn St So Sn St So Sn St\n\n10\n\n16\n5\n\n32\n0\n\n^ S1 79 78 87 89 90 96 91 91 100\n\n^ S2 81 81 88 88 88 95 91 91 100\n\n^ S3 79 78 87 89 90 96 91 91 100\n\n^ S4 84 84 92 90 90 97 91 91 100\n\n6[8] 8[16] 10[32] Number of Latent Features [Number of Objects]\n\nFigure 3: Posterior distributions (a) over the number of features p(m | So ) in simulations containing mt = 6, 8 and 10 features respectively. Variance accounted for (b) by the four similarity estimators ^ S, where the target is either the observed training data So , a new test data set Sn, or the true similarity matrix St. Approximate Expectations. A fourth approach aims to summarize the posterior distribution by looking at the marginal posterior probabilities associated with particular features. The probability that a particular feature fk belongs in the representation is given by: F p(fk | S) = p(F | S). (12)\n:fk F\n\nAlthough this approach has never been applied in the ADCLUS literature, the concept is implicit in more general discussions of mental representation [16] that ask whether or not a specific predicate is likely to be represented. Letting rk = p(fk | S) denote the posterior probability that feature fk ^ is manifest, we can construct a vector ^ = [rk] that contains these probabilities for all 2n possible r ^ features. Although this vector discards the covariation between features across the posterior distribution, it is useful both theoretically (for testing hypotheses about specific features) and pragmatically, since the expected posterior similarities can be written as follows: f E [sij |S] = fik fj k rk wk , ^^ (13)\nk\n\nwhere wk = E [wk |fk, S] denotes the expected saliency for feature fk on those occasions when it ^ is represented (Equation 13 relies on the fact that features combine linearly in the ADCLUS model, and is straightforward to derive). In practice, it is impossible to look at all 2n features, so one would typically report only those features for which rk is large. Since these tend to be the features that ^ make the largest contributions to E [sij |S], there is a sense in which this approach approximates the expected posterior similarities.\n\n5\n\nRecovering Noisy Feature Matrices\n\nBy using the IBP-ADCLUS framework, we can compare the performance of the four estimators in a reasonable fashion. Loosely following [12], we generated noisy similarity matrices with n = 8, 16 and 32 stimuli, based on \"true\" feature matrices Ft in which mt = 2 log2(n), where each object possessed each feature with probability 0.5. Saliency weights wt were generated uniformly from the interval [1, 3], but were subsequently rescaled to ensure that the \"true\" similarities St had variance 1. Two sets of Gaussian noise were injected into the similarities with fixed = 0.3, ensuring that the noise accounted for approximately 10% of the variance in the \"observed\" data matrix So and the \"new\" matrix Sn. We fixed = 2 for all simulations: since the number of manifest features in an IBP model follows a Poisson( Hn) distribution (where Hn is the nth harmonic number) [6], the prior has a strong bias toward parsimony. The prior expected number of features is approximately 5.4, 6.8 and 8.1 (as compared to the true values of 6, 8 and 10). We approximated the posterior distribution p(F, w | S1), by drawing samples in the following manner. For a given similarity matrix, 10 Gibbs-Metropolis chains were run from different start points, and 1000 samples were drawn from each. The chains were burnt in for 1000 iterations, and a lag of 10 iterations was used between successive samples. Visual inspection suggested that five chains in the n = 32 condition did not converge: log-posteriors were low, differed substantially from one\n\n\f\n(a)\nProbability\n\n0.4 0.3\n\n(b)\nProbability\n\n0.4 0.3 0.2 0.1 0\n\n(c)\nProbability 0 5 10 15 Number of Features\n\n0.4 0.3 0.2 0.1 0 10 15 20 25 Number of Features\n\n0.2 0.1 0 5 10 15 20 Number of Features\n\nFigure 4: Posterior distributions over the number of features when the Bayesian ADCLUS model is applied to (a) the numbers data, (b) the countries data and (c) the letters data. Table 1: Two representations of the numbers data. (a) The representation reported in [3], extracted using an EM algorithm with the number of features fixed at eight. (b) The 10 most probable features extracted using the Bayesian ADCLUS model. The first column gives the posterior probability that a particular feature belongs in the representation. The second column displays the average saliency of a feature in the event that it is included.\n(a) 2 012 3\nF E AT U R E WEIGHT\n\n(b)\n\nF E AT U R E\n\nP RO B . W E I G H T\n\n4\n\n8\n\n6 9 6789 23456 13 579 1234 45678 additive constant\n\n0.444 0.345 0.331 0.291 0.255 0.216 0.214 0.172 0.148\n\n3\n\n6\n\n9\n\n24 8 012 23456 6789 01234 2468 13579 45678 789 additive constant\n\n0.79 0.70 0.69 0.59 0.57 0.42 0.41 0.40 0.34 0.26 1.00\n\n0.326 0.385 0.266 0.240 0.262 0.173 0.387 0.223 0.181 0.293 0.075\n\nanother, and had noticable positive slope. In this case, the estimators were constructed from the five remaining chains. Figure 3(a) shows the posterior distributions over the number of features m for each of the three simulation conditions. There is a tendency to underestimate the number of features when provided with small similarity matrices, with the modal number being 3, 7 and 10. However, since the posterior estimate of m is below the prior estimate when n = 8, it seems this effect is data-driven, as 79% of the variance in the data matrix So can be accounted for using only three features. ^ Since each approach allows the construction of an estimated similarity matrix S, a natural comparison is to look at the proportion of variance this estimate accounts for in the observed data So , the novel data set Sn, and the true matrix St. In view of the noise model used to construct these matrices, the \"ideal\" answer for these three should be around 90%, 90% and 100% respectively. When n = 32, this profile is observed for all four estimators, suggesting that in this case all four estimators have converged appropriately. For the smaller matrices, the conditional MAP and joint MAP esti^ ^ ^ mators (S1 and S3) agree closely. The MAP feature approach S3 appears to perform slightly better, ^ though the difference is very small. The expectation method S4 provides the best estimate.\n\n6\n\nModeling Empirical Similarities\n\nWe now turn to the analysis of empirical data. Since space constraints preclude detailed reporting of all four estimators with respect to all data sets, we limit the discussion to the most novel IBPADCLUS estimators, namely the direct estimates of dimensionality provided through Equation 8, and the features extracted via \"approximate expectation\". Featural representations of numbers. A standard data set used in evaluating additive clustering models measures the conceptual similarity of the numbers 0 through 9 [17]. This data set is often used as a benchmark due to the complex interrelationships between the numbers. Table 1(a) shows an eight-feature representation of these data, taken from [3] who applied a maximum likelihood approach. This representation explains 90.9% of the variance, with features corresponding to arith-\n\n\f\nTable 2: Featural representation of the similarity between 16 countries. The table shows the eight highest-probability features extracted by the Bayesian ADCLUS model. Each column corresponds to a single feature, with the associated probabilities and saliencies shown below. The average weight associated with the additive constant is 0.035.\nF E AT U R E\n\nItaly Vietnam Germany Zimbabwe Zimbabwe Germany China Russia Nigeria Nigeria Spain Japan USA Cuba Philippines China Jamaica Indonesia Japan Iraq Libya P RO B . 1.00 1.00 0.99 0.62 0.52 W E I G H T 0.593 0.421 0.267 0.467 0.209\n\nIraq Zimbabwe Philippines Libya Nigeria Indonesia Iraq Libya 0.36 0.373 0.33 0.299 0.25 0.311\n\nTable 3: Featural representation of the perceptual similarity between 26 capital letters. The table shows the ten highest-probability features extracted by the Bayesian ADCLUS model. Each column corresponds to a single feature, with the associated probabilities and saliencies shown below. The average weight associated with the additive constant is 0.003.\nF E AT U R E\n\nM I C D P E E K B C N L G O R F H X G J W T Q R U P R O B . 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.92 W E I G H T 0.686 0.341 0.623 0.321 0.465 0.653 0.322 0.427 0.226 0.225\n\nmetic concepts and to numerical magnitude. Fixing = 0.05, and = 0.5, we drew 10,000 lagged samples to construct estimates. Although the posterior probability is spread over a large number of feature matrices, 92.6% of sampled matrices had between 9 and 13 features. The modal number of represented features was m1 =11, with 27.2% of the posterior mass. The posterior distribution over ^ the number of features is shown in Figure 4(a). Since none of the existing literature has used the \"approximate expectation\" approach to find highly probable features, it is useful to note the strong similarities between Table 1(a) and Table 1(b), which reports the ten highest-probability features across the entire posterior distribution. Applying this approach to obtain an estimate of the posterior ^ predictive similarities S4 revealed that this matrix accounts for 97.4% of the variance in the data. Featural representations of countries. A second application is to human forced-choice judgments of the similarities between 16 countries [18]. In this task, participants were shown lists of four countries and asked to pick out the two countries most similar to each other. Applying the Bayesian model to these data with = 0.1 reveals that only eight features appear in the representation more than 25% of the time. Given this, it is not surprising that the posterior distribution over the number of features, shown in Figure 4 (b), indicates that the modal number of features is eight. The eight most probable features are listed in Table 2. The \"approximate expectation\" method explains 85.4% of the variance, as compared to the 78.1% found by a MAP feature approach [18]. The features are interpretable, corresponding to a range of geographical, historical, and economic regularities. Featural representations of letters. As a third example, we analyzed a somewhat larger data set, consisting of kindergarten children's assessment of the perceptual similarity of the 26 capital letters [19]. In this case, we used = 0.05, and the Bayesian model accounted for 89.2% of the variance in the children's similarity judgments. The posterior distribution over the number of represented features is shown in Figure 4(c). Table 3 shows the ten features that appeared in more than 90% of samples from the posterior. The model recovers an extremely intuitive set of overlapping features. For example, it picks out the long strokes in I, L, and T, and the elliptical forms of D, O, and Q.\n\n7\n\nDiscussion\n\nLearning how similarity relations are represented is a difficult modeling problem. Additive clustering provides a framework for learning featural representations of stimulus similarity, but remains underused due to the difficulties associated with the inference. By adopting a Bayesian approach\n\n\f\nto additive clustering, we are able to obtain a richer characterization of the structure behind human similarity judgments. Moreover, by using nonparametric Bayesian techniques to place a prior distribution over infinite binary feature matrices via the Indian Buffet Process, we can allow the data to determine the number of features that the algorithm recovers. This is theoretically important as well as pragmatically useful. As noted by [16], people are capable of recognizing that individual stimuli possess an arbitrarily large number of characteristics, but in any particular context will make judgments using only a finite, usually small number of properties that form part of our current mental representation. In other words, by moving to a Bayesian nonparametric form, we are able to bring the ADCLUS model closer to the kinds of assumptions that are made by psychological theories.\nAcknowledgements. TLG was supported by NSF grant number 0631518, and DJN by ARC grants DP0451793 and DP-0773794. We thank Nancy Briggs, Simon Dennis and Michael Lee for helpful comments on this work.\n\nReferences\n[1] W. S. Torgerson. Theory and Methods of Scaling . Wiley, New York, 1958. [2] R. N. Shepard and P. Arabie. Additive clustering: Representation of similarities as combinations of discrete overlapping properties. Psyc hological Review, 86:87123, 1979. [3] J. B. Tenenbaum. Learning the structure of similarity. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems , volume 8, pages 39. MIT Press, Cambridge, MA, 1996. [4] L. L. Thurstone. Multiple-Factor Analysis . University of Chicago Press, Chicago, 1947. [5] M. D. Lee. Generating additive clustering models with limited stochastic complexity. Journal of Classification, 19:6985, 2002. [6] T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. Technical Report 2005-001, Gatsby Computational Neuroscience Unit, 2005. [7] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 6:721741, 1984. [8] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics , 21:10871092, 1953. [9] Q.-M. Shao M.-H. Chen and J. G. Ibrahim. Monte Carlo Methods in Bayesian Computation . Springer, New York, 2000. [10] M. K. Cowles and B. P. Carlin. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association , 91:833904, 1996. [11] P. Arabie and J. Douglas Carroll. MAPCLUS: A mathematical programming approach to fitting the ADCLUS model. Psyc hometrika, 45:211235, 1980. [12] W. Ruml. Constructing distributed representations using additive clustering. In Advances in Neural Information Processing Systems 14 , Cambridge, MA, 2001. MIT Press. [13] M. D. Lee and D. J. Navarro. Extending the ALCOVE model of category learning to featural stimulus domains. Psyc honomic Bulletin and Review , 9:4358, 2002. [14] D. J. Navarro. Representing Stimulus Similarity . Ph.D. Thesis, University of Adelaide, 2003. [15] L. E. Frank and W. J. Heiser. Feature selection in Feature Network Models: Finding predictive subsets of features with the Positive Lasso. British Journal of Mathematical and Statistical Psyc hology , in press. [16] D. L. Medin and A. Ortony. Psychological essentialism. In Similarity and Analogical Reasoning . Cambridge University Press, New York, 1989. [17] R. N. Shepard, D. W. Kilpatric, and J. P. Cunningham. The internal representation of numbers. Cognitive Psyc hology, 7:82138, 1975. [18] D. J. Navarro and M. D. Lee. Commonalities and distinctions in featural stimulus representations. In Proceedings of the 24th Annual Conference of the Cognitive Science Society , pages 685690, Mahwah, NJ, 2002. Lawrence Erlbaum. [19] E. Z. Rothkopf. A measure of stimulus similarity and errors in some paired-associate learning tasks. Journal of Experimental Psyc hology , 53:94101, 1957.\n\n\f\n", "award": [], "sourceid": 3136, "authors": [{"given_name": "Daniel", "family_name": "Navarro", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}