{"title": "Combining Dimensions and Features in Similarity-Based Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 67, "page_last": 74, "abstract": "", "full_text": "Combining Dimensions and Features in\n\nSimilarity-Based Representations\n\nDaniel J. Navarro\n\nDepartment of Psychology\n\nOhio State University\nnavarro.20@osu.edu\n\nMichael D. Lee\n\nDepartment of Psychology\n\nUniversity of Adelaide\n\nmichael.lee@psychology.adelaide.edu.au\n\nAbstract\n\nThis paper develops a new representational model of similarity data\nthat combines continuous dimensions with discrete features. An al-\ngorithm capable of learning these representations is described, and\na Bayesian model selection approach for choosing the appropriate\nnumber of dimensions and features is developed. The approach is\ndemonstrated on a classic data set that considers the similarities\nbetween the numbers 0 through 9.\n\n1 Introduction\n\nA central problem for cognitive science is to understand the way people mentally\nrepresent stimuli. One widely used approach for deriving representations from data\nis to base them on measures of stimulus similarity (see Shepard 1974). Similarity\nis naturally understood as a measure of the degree to which the consequences of\none stimulus generalize to another, and may be measured using a number of experi-\nmental methodologies, including ratings scales, confusion probabilities, or grouping\nor sorting tasks. For a domain with n stimuli, similarity data take the form of an\nn \u00a3 n matrix, S = [sij], where sij is the similarity of the ith and jth stimuli. The\ngoal of similarity-based representation is then to (cid:222)nd structured and interpretable\ndescriptions of the stimuli that capture the pattern of similarities.\n\nModeling the similarities between stimuli requires making assumptions about both\nthe representational structures used to describe stimuli, and the processes used to\nassess the similarities across these structures. The two best developed represen-\ntational approaches in cognitive modeling are the (cid:145)dimensional(cid:146) and (cid:145)featural(cid:146) ap-\nproaches (Goldstone, 1999). In the dimensional approach, stimuli are represented by\ncontinuous values along a number of dimensions, so that each stimulus corresponds\nto a point in a multi-dimensional space, and the similarity between two stimuli is\nmeasured according to the distance between their representative points. In the fea-\ntural approach, stimuli are represented in terms of the presence or absence of a set\nof discrete (usually binary) features or properties, and the similarity between two\nstimuli is measured according to their common and distinctive features.\n\nThe dimensional and featural approaches have di!erent strengths and weaknesses.\nDimensional representations are constrained by the metric axioms, such as the tri-\n\n\fangle inequality, that are violated by some empirical data. Featural representations\nare ine\"cient when representing inherently continuous aspects of the variation be-\ntween stimuli. It has been argued that spatial representations are most appropriate\nfor low-level perceptual stimuli, whereas featural representations are better suited to\nhigh-level conceptual domains (e.g., Carroll 1976, Tenenbaum 1996, Tversky 1977).\nIn general, though, stimuli convey both perceptual and conceptual information. As\nCarroll (1976) concludes: (cid:147)Since what is going on inside the head is likely to be\ncomplex, and is equally likely to have both discrete and continuous aspects, I believe\nthe models we pursue must also be complex, and have both discrete and continuous\ncomponents(cid:148) (p. 462).\n\nThis paper develops a new model of similarity that combines dimensions with fea-\ntures in the obvious way, allowing a stimulus to take continuous values on a number\nof dimensions, as well as potentially having a number of discrete features. We de-\nscribe an algorithm capable of learning these representations from similarity data,\nand develop a Bayesian model selection approach for choosing the appropriate num-\nber of dimensions and features. Finally, we demonstrate the approach on a classic\ndata set that considers the similarities between the numbers 0 through 9.\n\n2 Dimensional, Featural and Combined Representations\n\n2.1 Dimensional Representation\n\nIn a dimensional representation, the ith stimulus is represented by a point pi =\n(pi1, . . . , piv) in a v-dimensional coordinate space. The dissimilarity between the\nith and jth stimuli is then usually modeled as the distance between their points\naccording to one of the family of Minkowskian metrics\n\n(cid:136)dij =\u02c6 vXk=1\n\njpik \u00a1 pjkjr! 1\n\nr\n\n+ c,\n\n(1)\n\nwhere c is a non-negative constant. Dimensional representations can be learned us-\ning a variety of multidimensional scaling algorithms (e.g., Cox & Cox, 1994), which\nhave placed particular emphasis on the r = 1 (City-Block) and r = 2 (Euclidean)\ncases because of their relationship, respectively, to so-called (cid:145)separable(cid:146) and (cid:145)inte-\ngral(cid:146) stimulus dimensions (Garner 1974). Pairs of separable dimensions are those,\nlike shape and size, that can be attended to separately.\nIntegral dimensions, in\ncontrast, are those rarer cases like hue and saturation that are not easily separated.\n\n2.2 Featural Representation\n\nIn a featural representation, the ith stimulus is represented by a vector of m bi-\nnary variables fi = (fi1, . . . , fim), where fik = 1 if the ith stimulus possesses the\nkth feature, and fik = 0 if it does not. Each feature is also usually associated\nwith a positive weight, wk, denoting its importance or salience. No constraints are\nplaced on the way features may be assigned to stimuli. Rather than requiring fea-\ntures partition stimuli, as in many clustering methods, or that features nest within\none another, as in many tree-(cid:222)tting methods, the (cid:223)exible nature of human mental\nrepresentation demands that features are allowed to overlap in arbitrary ways.\n\nAlthough a number of models have been proposed for measuring the similarity\nbetween featurally represented stimuli (Navarro & Lee, 2002), the most widely used\nis the Contrast Model (Tversky, 1977). The Contrast Model assumes the similarity\n\n\fbetween two stimuli increases according to the weights of the (common) features\nthey share, decreases according to the weights of the (distinctive) features that one\nhas but the other does not, and these common and distinctive sources of information\nare themselves weighted in arriving at a (cid:222)nal similarity value. Particular emphasis\n(e.g., Shepard & Arabie, 1979; Tenenbaum, 1996) has been given to the special case\nof the Contrast Model where only common features are used, and feature weights\nare additive, so that the similarity of the ith and jth stimuli is given by\n\n(cid:136)sij =\n\nmXk=1\n\nwkfikfjk + c.\n\n(2)\n\nAlthough learning common feature representations is a di\"cult combinatorial op-\ntimization problem, several successful additive clustering algorithms have been de-\nveloped (e.g., Lee, 2002; Ruml, 2001; Tenenbaum, 1996).\n\n2.3 Combined Representation\n\nThe obvious generalization of dimensional and featural approaches is to represent\nstimuli in terms of continuous values along a set of dimensions and the presence or\nabsence of a number of discrete features. If there are v dimensions and m features,\nthe ith stimulus is de(cid:222)ned by a point pi, a feature vector fi, and the feature weights\nw = (w1, . . . , wm).\n\nWith this representational structure in place, we assume the similarity between\nthe ith and jth stimuli is then simply the sum of the similarity arising from their\ncommon features (Eq. 2), minus the dissimilarity arising from their dimensional\ndi!erences (Eq. 1), as follows\n\n(cid:136)sij =\u02c6 mXk=1\n\nwkfikfjk! \u00a1\u02c6 vXk=1\n\njpik \u00a1 pjkjr! 1\n\nr\n\n+ c.\n\n3 Model Fitting and Selection\n\nProposing the combined representational approach immediately presents two chal-\nlenges. The (cid:222)rst model (cid:222)tting problem is to develop a method for learning rep-\nresentations that (cid:222)t the similarity data well using a given number of dimensions\nand features. The second model selection problem is to choose between alternative\ncombined representations of the same data that use di!erent numbers of features\nand dimensions.\n\nFormally, we conceive of the representational model as specifying the number of\ndimensions and features and the nature of the distance metric, and being para-\nmeterized by the feature variables and weights, coordinate locations and the ad-\nditive constant. This means a particular representation is given by R! (!) where\n\" = (v, m, r) and ! = (p1, . . . , pn,f 1, . . . , fn,w , c).\n\nFollowing Tenenbaum (1996), we assume that the observed similarities come from\nindependent Gaussian distributions with means sij and common variance #. The\nvariance corresponds to the precision of the data which, for empirical similarity\ndata averaged across information sources (such as individual participants) is easily\nestimated (Lee 2001), and otherwise must be speci(cid:222)ed by assumption.\n\nUnder these assumptions, the likelihood of a similarity matrix given a particular\n\n\frepresentation is\n\np (S j R!, !) = Yi<j\n\n1\n\n=\n\ngiving the log-likelihood function\n\nln p (S j R!, !) = \u00a1\n\n1\n\n#p2$\n\nexp(cid:181)\u00a1\n\u00a1#p2$\u00a2n(n!1)/2 exp!\n2#2Xi<j\n\n(sij \u00a1 (cid:136)sij)2 \u00a1\n\n1\n\n1\n\n2#2 (sij \u00a1 (cid:136)sij)2\u00b6\n2#2Xi<j\n\"\u00a1\n\n1\n\n(sij \u00a1 (cid:136)sij)2#\n$ ,\n\nn (n \u00a1 1)\n\n2\n\nln\u2021#p2$\u00b7 .\n\nWithin this framework, we solve the model (cid:222)tting problem by (cid:222)nding the maximum\nlikelihood parameter values !\". Measures of data (cid:222)t like maximum likelihood, how-\never, are clearly not appropriate for choosing between representations with di!erent\nnumbers of dimensions and features, because of di!erences in model complexity. For\nthis reason, we tackle the model selection problem using a Bayesian approach.\n\n3.1 Fitting Algorithm\n\nOur learning algorithm for the combined model relies on the observation (Tenen-\nbaum, 1996) that it is relatively easy to (cid:222)nd the maximum likelihood values of\nthe continuous parameters(cid:150)the coordinate locations, feature weights, and additive\nconstant(cid:150)given values for the discrete feature assignments.\n\nIf ! is partitioned into !C = (p1, . . . , pn,w , c) and a (cid:222)xed !D = (f1, . . . , fn), then\nwe solve the optimization problem\n\narg max\n\n\"C\n\nln p (S j R!, !D, !C )\n\nwhere w, c \u201a 0,\n\n(3)\n\nusing the Levenberg-Marquardt approach (More, 1977). Since distances are pre-\nserved under translation for the Minkowskian family of metrics, we assume without\nloss of generality that p1 is the origin.\n\nWith this optimization capability in place, our learning algorithm may be described\nby the following (cid:222)ve stage process:\n\nStep 1: Choose a maximum number of dimensions vmax and features mmax. Start\nwith v = 1 and m = 1, making the lone feature the current feature to be optimized.\n\nStep 2: Find a starting (seed) value for the current feature by considering all possi-\nbilities that have exactly one pair of stimuli with the feature, choosing the possibility\nwith the best data-(cid:222)t using Eq. 3.\n\nStep 3: Consider all possible representations arising from changing the assignment\nof one stimulus in relation to the current feature. If any of these changes improve\nthe (cid:222)t of the representation as a whole, update the representation to be the one\nwith the best (cid:222)t. Repeat this process until no change is found that improves the\nrepresentation. The current representation at this point is recorded as the best-\n(cid:222)tting representation with v dimensions and m features.\n\nStep 4: If there are fewer than mmax features, then add a new feature, make it the\ncurrent feature, and return to Step 2.\n\n\fStep 5: If there are fewer than vmax dimensions, then add a new dimension, reset\nthe number of features to m = 1, and again make the lone feature the current\nfeature to be optimized. Return to Step 2.\nThe output of this algorithm is a grid of vmax \u00a3 mmax representations, one for each\npossible combination of number of dimensions and number of features.\n\n3.2 Model Selection\n\nGiven representational models with di!erent numbers of dimensions and features,\nthe Bayesian approach is to select the one with the maximum posterior probability\n\np (R! j S) =\n\np (R!)\n\np (S) Z p (S j R!, !) p (! j R!) d!.\n\nSince all models relate to the same similarity data, p (S) is a constant. If we assume\nthat all representations are a priori equally likely, the posterior becomes\n\np (R! j S) /X\"D Z p (S j R!, !) p (! j R!) d!C .\n\n(4)\n\nThis Bayesian approach embodies an automatic form of Ockham(cid:146)s Razor, balanc-\ning data-(cid:222)t against model complexity, because it considers the model at all of its\nparameterizations. Complicated models that use many parameters (i.e., have high\nparametric complexity), or parameters that interact in complicated ways (i.e., have\nhigh functional form complexity) to achieve good levels of data-(cid:222)t at their optimal\nvalues will typically (cid:222)t data poorly at other parameter values, and so will have\nsmaller posteriors.\n\nFor the combined model, the posterior in Eq. 4 is not well approximated by simple\nmeasures such as the Bayesian Information (BIC: Schwarz, 1978) that have pre-\nviously been applied to dimensional and featural representations (Lee & Navarro,\n2002). This is because the BIC measures only parametric complexity, and treats\neach additional parameter as having an equal e!ect on model complexity. Binary\nfeature membership parameters and continuous coordinate location parameters,\nhowever, will clearly have di!erent e!ects on model complexity. In addition, be-\ncause the BIC does not measure functional form complexity, it is not sensitive to the\nchange in representational model complexity arising from di!erent distance metrics.\nThere are also di\"culties approximating the posterior by a multivariate Gaussian\nwith !\" as the mode, as in the Laplacian approximation (see Kass & Raftery, 1995,\np. 778), because the featural component of the combined model makes the posterior\nmultimodal.\n\nFor these reasons, we employed Monte Carlo methods with importance sampling\n(e.g., Oh & Berger, 1993), in which the posterior is numerically approximated by\n\np (R! j S) \u2026\n\n1\nN\n\nNXi=1\n\np (S j R!, !i) p(!i j R!)\n\ng(!i j R!)\n\n,\n\nwhere each of the N !i values is independently sampled from g(\u00a2). In the following\nevaluation, we assumed that p(! j R!) is uniform over !, and speci(cid:222)ed an importance\ndistribution g(\u00a2) that was Gaussian over !C and multinomial over !D. As the\nposterior may be multimodal and non-standard, g(\u00a2) was heavy tailed, and we\nsampled extensively (N = 5 \u00a3 106) to ensure convergence.\n\n\f0\n\n2\n\n4\n\n1\n\n3\n\n6\n\n5\n\n8\n\n9\n\n7\n\nFeature\n\nWeight\n\n2\n\n4\n\n8\n\n0 1 2\n\n3\n\n6\n\n9\n\n6 7 8 9\n\n2 3 4 5 6\n\n1\n\n3\n\n5\n\n7\n\n9\n\n1 2 3 4\n\n4 5 6 7 8\n\nadditive constant\n\n0.444\n0.345\n0.331\n0.291\n0.255\n0.216\n0.214\n0.172\n0.148\n\n(a)\n\n(b)\n\nFigure 1: Representations of the numbers similarity data using the (a) dimensional\nand (b) featural approaches.\n\n4 An Illustrative Example\n\nShepard, Kilpatric and Cunningham (1975) collected data measuring the (cid:147)abstract\nconceptual similarity(cid:148) of the numbers 0 through 9. Figure 1(a) displays a two-\ndimensional representation of the numbers, using the City-Block metric. This rep-\nresentation explains only 78.6% of the variance, and fails to capture important\nregularities evident in the raw data, such the fact that the number 7 is more similar\nto 8 than it is to 9, or that 3 is much more similar to 0 than it is to 8, and so on.\nFigure 1(b) shows an eight-feature representation of the numbers using the same\ndata, as reported by Tenenbaum (1996). This representation explains 90.9% of\nthe variance, with features corresponding to arithmetic concepts (e.g., f2, 4, 8g and\nf3, 6, 9g) and to numerical magnitude (e.g., f1, 2, 3, 4g and f6, 7, 8, 9g). We note in\npassing that the representations displayed in Figure 1 are also recovered when our\nalgorithm is restricted to purely dimensional or purely featural representations.\n\nFigure 1 suggests that the numbers data is a candidate for combined representation.\nFeatures are appropriate for representing the arithmetic concepts, but a (cid:145)magnitude(cid:146)\ndimension seems to o!er a more e\"cient and meaningful representation of this\nregularity than the (cid:222)ve features used in Figure 1(b).\n\nWe (cid:222)tted combined models with between one and three dimensions and one and\neight features to the same similarity data, and calculated the log posterior for each.\nBecause the raw data needed to estimate the precision of these averaged data are\nunavailable, we followed the arguments presented in Lee (2002) to make a conserva-\ntive choice of # = 0.15. The results are shown in Figure 2. All of the representations\nusing one dimension are more likely than those using two or three dimensions. Of\nthe one dimensional representations, the four feature version is preferred, although\nthe likelihoods of representations with other numbers of features are close enough\nto warrant consideration in choosing a (cid:145)best(cid:146) representation, particularly given the\nassumptions made about data precision.\n\nFor the sake of concreteness, however, Figure 3 describes the representation with one\ndimension and four features, which explains 90.0% of the variance. The one dimen-\nsion almost orders the numbers according to their magnitude, with the violations\nbeing very small. The four features all capture meaningful arithmetic concepts, cor-\nresponding to (cid:147)powers of two(cid:148), (cid:147)multiples of three(cid:148), (cid:147)multiples of two(cid:148) (or (cid:147)even\n\n\fr\no\ni\nr\ne\n\nt\ns\no\nP\ng\no\nL\n\n \n\n10\n\n0\n\n\u221210\n\n\u221220\n\n1D\n\n2D\n\n3D\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nNumber of Features\n\nFigure 2: Log posteriors for combined representations with between one and three\ndimensions, and one and eight features.\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\nFeature\n\nWeight\n\n2\n\n2\n\n4\n\n4\n\n3\n\n3\n\n1\n\n6\n\n6\n\n8\n\n8\n\n9\n\n9\n\nadditive constant\n\n0.286\n0.282\n0.224\n0.157\n0.568\n\nFigure 3: Representation of the numbers similarity data using one dimension (shown\non the left) and four features (shown on the right).\n\nnumbers(cid:148)) and (cid:147)powers of three(cid:148). Encouragingly, these features are close to those\nin Figure 1(b) that do not deal with numerical magnitude.\n\n5 Conclusion\n\nFuture work will examine the use of other featural similarity models besides the\npurely common features approach, and will also look to develop learning algo-\nrithms that do not rely on maximum likelihood estimation, but instead consider\nthe posterior probability of a representation. Reliable analytic approximations to\nthe posterior will be required for this purpose.\n\nMost importantly, however, the combined representation of a wide range of simi-\nlarity data needs to be examined. Although the numbers data is a promising start,\nit is just a (cid:222)rst test of the combined approach to similarity-based representation.\nDemonstrating the generality and usefulness of the ability to represent stimuli in\nterms of both dimensions and features remains a challenge for future research.\n\n\fAcknowledgments\n\nThis research was supported by Australian Research Council Grant DP0211406.\nWe thank Tom Gri\"ths and two anonymous reviewers for helpful comments and\ndiscussions.\n\nReferences\n\n[1] Carroll, J. D. (1976). Spatial, non-spatial and hybrid models for scaling. Psychome-\ntrika, 41, 439(cid:151)463.\n\n[2] Cox, T. F. & Cox, M. A. A. (1994). Multidimensional Scaling. London: Chapman and\nHall.\n\n[3] Garner, W. R. (1974).The Processing of Information and Structure. Potomac, MD:\nErlbaum.\n\n[4] Goldstone, R. L. (1999). Similarity. In R.A. Wilson and F.C. Keil (eds.), MIT Ency-\nclopedia of the Cognitive Sciences, pp. 763(cid:151)765. Cambridge, MA: MIT Press.\n\n[5] Lee, M. D. (2001). Determining the dimensionality of multidimensional scaling repre-\nsentations for cognitive modeling. Journal of Mathematical Psychology, 45(1), 149(cid:151)166.\n\n[6] Lee, M. D. (2002). Generating additive clustering models with limited stochastic com-\nplexity. Journal of Classi(cid:222)cation, 19(1), 69-85.\n\n[7] Lee, M. D. & Navarro, D. J. (2002). Extending the ALCOVE model of category learning\nto featural stimulus domains. Psychonomic Bulletin & Review, 9(1), 43-58.\n\n[8] Kass, R. E. & Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical\nAssociation, 90(430), 773-795.\n\n[9] More, J. J. (1977). The Levenberg-Marquardt algorithm: Implementation and theory.\nIn G.A. Watson (ed.), Lecture Notes in Mathematics, 630, pp. 105(cid:151)116. New York:\nSpringer-Verlag.\n\n[10] Navarro, D. J. & Lee, M. D. (2002). Commonalities and distinctions in featural\nstimulus representations. In: W. G. Gray, and C. D. Schunn (Eds.) Proceedings of the\n24th Annual Conference of the Cognitive Science Society, pp. 685-690, Mahwah, NJ:\nLawrence Erlbaum.\n\n[11] Oh, M. & Berger J. O. (1993). Integration of multimodal functions by Monte Carlo\nimportance sampling, Journal of the American Statistical Association, 88, 450-456.\n\n[12] Ruml, W. (2001). Constructing distributed representations using additive clustering.\nIn: T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.) Advances in Neural Information\nProcessing 14. Cambridge, MA: MIT Press.\n\n[13] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2),\n461(cid:151)464.\n\n[14] Shepard, R. N. (1974). Representation of structure in similarity data: Problems and\nprospects. Psychometrika, 39(4), 373(cid:151)422.\n\n[15] Shepard, R. N. & Arabie, P. (1979). Additive clustering representations of similarities\nas combinations of discrete overlapping properties. Psychological Review, 86(2), 87(cid:151)123.\n\n[16] Shepard, R. N., Kilpatric, D. W. & Cunningham, J. P. (1975). The internal represen-\ntation of numbers. Cognitive Psychology, 7, 82(cid:151)138.\n\n[17] Tenenbaum, J. B. (1996). Learning the structure of similarity. In D. S. Touretzky, M.\nC. Mozer and M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems,\npp. 3(cid:151)9, Cambridge, MA: MIT Press.\n\n[18] Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327(cid:151)352.\n\n\f", "award": [], "sourceid": 2249, "authors": [{"given_name": "Daniel", "family_name": "Navarro", "institution": null}, {"given_name": "Michael", "family_name": "Lee", "institution": null}]}