{"title": "Constructing Distributed Representations Using Additive Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 107, "page_last": 114, "abstract": null, "full_text": "Constructing Distributed Representations\nUsing Additive Clustering\nWheeler Ruml\n\nDivision of Engineering and Applied Sciences\nHarvard University\n33 Oxford Street, Cambridge, MA 02138\n\nruml@eecs.harvard.edu\nAbstract\n\nIf the promise of computational modeling is to be fully realized in higher-\nlevel cognitive domains such as language processing, principled methods\nmust be developed to construct the semantic representations used in such\nmodels. In this paper, we propose the use of an established formalism\nfrom mathematical psychology, additive clustering, as a means of auto-\nmatically constructing binary representations for objects using only pair-\nwise similarity data. However, existing methods for the unsupervised\nlearning of additive clustering models do not scale well to large prob-\nlems. We present a new algorithm for additive clustering, based on a\nnovel heuristic technique for combinatorial optimization. The algorithm\nis simpler than previous formulations and makes fewer independence as-\nsumptions. Extensive empirical tests on both human and synthetic data\nsuggest that it is more effective than previous methods and that it also\nscales better to larger problems. By making additive clustering practical,\nwe take a significant step toward scaling connectionist models beyond\nhand-coded examples.\n1 Introduction\n\nMany cognitive models posit mental representations based on discrete substructures. Even\nconnectionist models whose processing involves manipulation of real-valued activations\ntypically represent objects as patterns of 0s and 1s across a set of units (Noelle, Cottrell,\nand Wilms, 1997). Often, individual units are taken to represent specific features of the\nobjects and two representations will share features to the degree to which the two objects\nare similar. While this arrangement is intuitively appealing, it can be difficult to construct\nthe features to be used in such a model. Using random feature assignments clouds the\nrelationship between the model and the objects it is intended to represent, diminishing\nthe model's value. As Clouse and Cottrell (1996) point out, hand-crafted representations\nare tedious to construct and it can be difficult to precisely justify (or even articulate) the\nprinciples that guided their design. These difficulties effectively limit the number of objects\nthat can be encoded, constraining modeling efforts to small examples. In this paper, we\ninvestigate methods for automatically synthesizing feature-based representations directly\nfrom the pairwise object similarities that the model is intended to respect. This automatic\n\f\nTable 1: An 8-feature model derived from consonant confusability data. With c = 0.024,\n\nthe model accounts for 91.8% of the variance in the data.\nWt. Objects with feature Interpretation\n.350 f# front unvoiced fricatives\n.243 dg back voiced stops\n.197 p k unvoiced stops (without t)\n.182 b v# front voiced\n.162 ptk unvoiced stops\n.127 mn nasals\n.075 dgv#z z voiced (without b)\n.049 ptkf#s s unvoiced\napproach eliminates the manual burden of selecting and assigning features while providing\nan explicit design criterion that objectively connects the representations to empirical data.\nAfter formalizing the problem, we will review existing algorithms that have been proposed\nfor solving it. We will then investigate a new approach, based on combinatorial optimiza-\ntion. When using a novel heuristic search technique, we find that the new approach, despite\nits simplicity, performs better than previous algorithms and that, perhaps more important,\nit maintains its effectiveness on large problems.\n1.1 Additive Clustering\n\nWe will formalize the problem of constructing discrete features from similarity information\nusing the additive clustering model of Shepard and Arabie (1979). In this framework,\nabbreviated ADCLUS, clusters represent arbitrarily overlapping discrete features. Each of\nthe k features has a non-negative real-valued weight w k , and the similarity between two\nobjects i and j is just the sum of the weights of the features they share. If f ik is 1 if object\n\ni has feature k and 0 otherwise, and c is a real-valued constant, then the similarity of i and\n\nj is modeled as\n\n s ij =\n\n#\n\nk\n\nw k f ik f jk + c .\nThis class of models is very expressive, encompassing non-hierarchical as well as hierar-\nchical arrangements of clusters. An example model, derived using the ewindclus-klb\n\nalgorithm described below, is shown in Table 1. The representation of each object is simply\nthe binary column specifying its membership or absence in each cluster. Additive cluster-\ning is asymmetric in the sense that only the shared features of two objects contribute to\ntheir similarity, not the ones they both lack. (This is the more general formulation, as an\nadditional feature containing the set complement of the original feature could always be\nused to produce such an effect.)\nWith a model formalism in hand, we can then phrase the problem of constructing feature\nassignments as simply finding the ADCLUS model that best matches the given similarity\ndata using the desired number of features. The fit of a model (comprising F , W , and c) to\na matrix, S, can be quantified by the variance accounted for (VAF), which compares the\nmodel's accuracy to merely predicting using the mean similarity:\nVAF = 1\n\n-\n\n#\n\ni,j (s ij\n\n-\n\n s ij )\n\n2\n\n#\n\ni,j (s ij\n\n-\n\n s)\n\n2\nA VAF of 0 can always be achieved by setting all w k to 0 and c to  s.\n\f\n2 Previous Algorithms\n\nAdditive clustering is a difficult 0-1 quadratic programming problem and only heuristic\nmethods, which do not guarantee an optimal model, have been proposed. Many different\napproaches have been taken:\n\nSubsets: Shepard and Arabie (1979) proposed an early algorithm based on subset analy-\nsis that was clearly superseded by Arabie's later work below. Hojo (1983) also\nproposed an algorithm along these lines. We will not consider these algorithms\nfurther.\n\nNon-discrete Approximation: Arabie and Carroll (1980) and Carroll and Arabie (1983)\nproposed the two-stage indclus algorithm. In the first stage, cluster member-\nships are treated as real values and optimized for each cluster in turn by gradient\ndescent. At the same time, a penalty term for non-0-1 values is gradually in-\ncreased. Afterwards, a combinatorial clean-up stage tries all possible changes to\n1 or 2 cluster memberships. Experiments reported below use the original code,\nmodified slightly to handle large instances. Random initial configurations were\nused.\n\nAsymmetric Approximation: In the sindclus algorithm, Chaturvedi and Carroll\n(1994) optimize an asymmetric model with two sets of cluster memberships, hav-\ning the form  s ij =\n\n#\n\nk w k f ik g jk +c. By considering each cluster in turn, this for-\nmulation allows a fast method for determining each of F , G, and w given the other\ntwo. In practice, F and G often become identical, yielding an ADCLUS model.\nExperiments reported below use both a version of the original implementation that\nhas been modified to handle large instances and a reimplemented version (re-\n\nsindclus) that differs in its behavior at boundary cases (handling 0 weights,\nempty clusters, ties). Models from runs in which F and G did not converge were\neach converted into several ADCLUS models by taking only F , only G, their inter-\nsection, or their union. The weights and constants of each model were optimized\nusing constrained least-squares linear regression (Stark and Parker, 1995), ensur-\ning non-negative cluster weights, and the one with the highest VAF was used.\n\nAlternating Clusters: Kiers (1997) proposed an element-wise simplified sindclus al-\ngorithm, which we abbreviate as ewindclus. Like sindclus, it considers\neach cluster in turn, alternating between the weights and the cluster memberships,\nalthough only one set of clusters is maintained. Weights are set by a simple re-\ngression and memberships are determined by a gradient function that assumes\nobject independence and fixed weights. The experiments reported below use a\nnew implementation, similar to the reimplementation of sindclus.\n\nExpectation Maximization: Tenenbaum (1996) reformulated ADCLUS fitting in proba-\nbilistic terms as a problem with multiple hidden factorial causes, and proposed\na combination of the EM algorithm, Gibbs sampling, and simulated annealing to\nsolve it. The experiments below use a modified version of the original implemen-\ntation which we will notate as em-indclus. It terminates early if 10 iterations\nof EM pass without a change in the solution quality. (A comparison with the orig-\ninal code showed this modification to give equivalent results using less running\ntime.)\nUnfortunately, it is not clear which of these approaches is the best. Most published com-\nparisons of additive clustering algorithms use only a small number of test problems (or\nonly artificial data) and report only the best solution found within an unspecified amount\nof time. Because the algorithms use random starting configurations and often return solu-\ntions of widely varying quality even when run repeatedly on the same problem, this leaves\nit unclear which algorithm gives the best results on a typical run. Furthermore, different\n\f\nTable 2: The performance of several previously proposed algorithms on data sets from\npsychological experiments.\n\nindclus sindclus re-sindclus ewindclus\n\nName VAF IQR VAF IQR r VAF IQR r VAF IQR r\n\nanimals-s 77 75--80 66 65--65 8 78 79 --80 12 64 60--69 4\nnumbers 83 81--86 84 82 --86 5 78 75--81 7 82 79--85 5\nworkers 83 82--85 81 79--83 9 84 82--85 7 67 63--72 2\nconsonants 89 89--90 88 87--89 6 81 80--82 5 51 44--57 1\nanimals 71 69--74 66 66--66 9 66 66--66 13 72 71 --73 26\nletters 80 80--80 78 78--79 7 68 65--72 5 74 73--75 17\nTable 3: The performance of indclus and em-indclus on the human data sets.\n\nindclus em-indclus\n\nName n k VAF IQR r VAF IQR\n\nanimals-s 10 3 80 80--80 23 80 80--80\nnumbers 10 8 91 90--91 157 90 89--90\nworkers 14 7 89 88--89 89 87 87--89\nconsonants 16 8 91 91--91 291 91 91--91\nanimals 26 12 71 69--74 1 N/A\n\nletters 30 5 82 82--83 486 82 82--83\nalgorithms require very different running times, and multiple runs of a fast algorithm with\nhigh variance in solution quality may produce a better result in the same time as a single run\nof a more predictable algorithm. The next section reports on a new empirical comparison\nthat addresses these concerns.\n\n2.1 Evaluation of Previous Algorithms\n\nWe compared indclus, both implementations of sindclus, ewindclus, and em-\nindclus on 3 sets of problems. The first set is a collection of 6 typical data sets from\npsychological experiments that have been used in previous additive clustering work (orig-\ninally by Shepard and Arabie (1979), except for animals-s, Mechelen and Storms (1995),\nand animals, Chaturvedi and Carroll (1994)). The number of objects (n) and the number of\nfeatures used (k) are listed for each instance as part of Table 3. The second set of problems\ncontains noiseless synthetic data derived from ADCLUS models with 8, 16, 32, 64, and 128\nobjects. In a rough approximation of the human data, the number of clusters was set to\n\n2 log 2 (n), and as in previous ADCLUS work, each object was inserted in each cluster with\nprobability 0.5. A single similarity matrix was generated from each model using weights\nand constants uniformly distributed between 1 and 6. The third set of problems was de-\nrived from the second by adding gaussian noise with a variance of 10% of the variance of\nthe similarity data and enforcing symmetry. Each algorithm was run at least 50 times on\neach data set. Runs that crashed or resulted in a VAF  0 were ignored. To avoid biasing\nour conclusions in favor of methods requiring more computation time, those results were\nthen used to derive the distribution of results that would be expected if all algorithms were\nrun simultaneously and those that finished early were re-run repeatedly until the slowest\nalgorithm finished its first run, with any re-runs in progress at that point discarded.\n\n1\n1\n\nDepending as it does on running time, this comparison remains imprecise due to variations in\nthe degree of code tuning and the quality of the compilers used, and the need to normalize timings\nbetween the multiple machines used in the tests.\n\f\nSummaries of the time-equated results produced by each algorithm on each of the human\ndata sets are shown in Table 2. (em-indclus took much longer than the other algorithms\nand its performance is shown separately in Table 3.) The mean VAF for each algorithm\nis listed, along with the inter-quartile range (IQR) and the mean number of runs that were\nnecessary to achieve time parity with the slowest algorithm on that data set (r). On most\ninstances, there is remarkable variance in the VAF achieved by each algorithm.\n\n2\n\nOverall,\ndespite the variety of approaches that have been brought to bear over the years, the origi-\nnal indclus algorithm appears to be the best. (Results in which another algorithm was\nsuperior to indclus are marked with a box.) Animals-s is the only data set on which its\nmedian performance was not the best, and its overall distribution of results is consistently\ncompetitive. It is revealing to note the differences in performance between the original and\nreimplemented versions of sindclus. Small changes in the handling of boundary cases\nmake a large difference in the performance of the algorithm.\nSurprisingly, on the synthetic data sets (not shown), the relative performance of the algo-\nrithms was quite different, and almost the same on the noisy data as on the noise-free data.\n(This suggests that the randomly generated data sets that are commonly used to evaluate\nADCLUS algorithms do not accurately reflect the problems of interest to practitioners.)\n\newindclus performed best here, although it was only occasionally able to recover the\noriginal models from the noise-free data.\nOverall, it appears that current methods of additive clustering are quite sensitive to the\ntype of problem they are run on and that there is little assurance that they can recover the\nunderlying structure in the data, even for small problems. To address these problems, we\nturn now to a new approach.\n3 A Purely Combinatorial Approach\n\nOne common theme in indclus, sindclus, and ewindclus is their computation\nof each cluster and its weight in turn, at each step fitting only the residual similarity not\naccounted for by the other clusters. This forces memberships to be considered in a predeter-\nmined order and allows weights to become obsolete. Inspired in part by recent work of Lee\n(in press), we propose an orthogonal decomposition of the problem. Instead of computing\nthe elements and weight of each cluster in succession, we first consider all the member-\nships and then derive all the weights using constrained regression. And where previous\nalgorithms recompute all the memberships of one cluster simultaneously (and therefore in-\ndependently), we will change memberships one by one in a dynamically determined order\nusing simple heuristic search techniques, recomputing the weights after each step. (An\nincremental bounded least squares regression algorithm that took advantage of the previ-\nous solution would be ideal, but the algorithms tested below did not incorporate such an\nimprovement.) From this perspective, one need only focus on changing the binary mem-\nbership variables, and ADCLUS becomes a purely combinatorial optimization problem.\nWe will evaluate three different algorithms based on this approach, all of which attempt to\nimprove a random initial model. The first, indclus-hc, is a simple hill-climbing strategy\nwhich attempts to toggle individual memberships in an arbitrary order and the first change\nresulting in an improved model is accepted. The algorithm terminates when no membership\ncan be changed to give an improvement. This strategy is reminiscent of a proposal by\nClouse and Cottrell (1996), although here we are using the ADCLUS model of similarity.\nIn the second algorithm, indclus-pbil, the PBIL algorithm of Baluja (1997) is used\n\n2\n\nTable 3 shows one anomaly: no em-indclus run on animals resulted in a VAF\n\n#\n\n0. This also\noccurred on all synthetic problems with 32 or more objects (although very good solutions were found\non the smaller problems). Tenenbaum (personal communication) suggests that the default annealing\nschedule in the em-indclus code may need to be modified for these problems.\n\f\nTable 4: The performance of the combinatorial algorithms on human data sets.\n\nindclus-hc ind-pbil ewind-klb indclus\n\nName VAF IQR r VAF IQR VAF IQR r VAF IQR r\n\nanimals-s 80 80--80 44 74 71--74 80 80--80 74 80 80--80 47\nnumbers 90 90--91 24 87 85--88 91 91--91 18 90 89--91 59\nworkers 88 88--89 16 86 84--87 89 89--89 13 88 88--89 53\nconsonants 86 85--87 11 80 76--82 92 92--92 9 91 91--91 61\nanimals 71 70--72 8 66 65--69 74 74--74 6 74 74--74 36\nletters 70 69--71 3 66 64--68 76 74--78 2 82 81--82 57\nto search for appropriate memberships. This is a simplification of the strategy suggested\nby Lee (in press), whose proposal also includes elements concerned with automatically\ncontrolling model complexity. We use the parameter settings he suggests but only allow\nthe algorithm to generate 10,000 solutions.\n3.1 KL Break-Out: A New Optimization Heuristic\n\nWhile the two approaches described above do not use any problem-specific information be-\nyond solution quality, the third algorithm uses the gradient function from the ewindclus\n\nalgorithm to guide the search. The move strategy is a novel combination of gradient ascent\nand the classic method of Kernighan and Lin (1970) which we call `KL break-out'. It pro-\nceeds by gradient ascent, changing the entry in F whose ewindclus gradient points most\nstrongly to the opposite of its current value. When the ascent no longer results in an im-\nprovement, a local maximum has been reached. Motivated by results suggesting that good\nmaxima tend to cluster (Boese, Kahng, and Muddu, 1994; Ruml et al., 1996), the algo-\nrithm tries to break out of the current basin of attraction and find a nearby maximum rather\nthan start from scratch at another random model. It selects the least damaging variable to\nchange, using the gradient as in the ascent, but now it locks each variable after changing\nit. The pool of unlocked variables shrinks, thus forcing the algorithm out of the local max-\nimum and into another part of the space. To determine if it has escaped, a new gradient\nascent is attempted after each locking step. If the ascent surpasses the previous maximum,\nthe current break-out attempt is abandoned and the ascent is pursued. If the break-out pro-\ncedure changes all variables without any ascent finding a better maximum, the algorithm\nterminates. The procedure is guaranteed to return a solution at least as good as that found\nby the original KL method (although it will take longer), and it has more flexibility to\nfollow the gradient function. This algorithm, which we will call ewindclus-klb, sur-\npassed the original KL method in time-equated tests. It is also conceptually simple and has\nno parameters that need to be tuned.\n3.2 Evaluation of the Combinatorial Algorithms\n\nThe time-equated performance of the combinatorial algorithms on the human data sets is\nshown in Table 4, with indclus, the best of the previous algorithms, shown for com-\nparison. As one might expect, adding heuristic guidance to the search helps it enor-\nmously: ewindclus-klb surpasses the other combinatorial algorithms on every prob-\nlem. It performs better than indclus on three of the human data sets (top panel), equals\nits performance on two, and performs worse on one data set, letters. (Results in which\n\newindclus-klb was not the best are marked with a box.) The variance of indclus on\nletters is very small, and the full distributions suggest that ewindclus-klb is the better\nchoice on this data set if one can afford the time to take the best of 20 runs. (Experiments\n\f\nTable 5: ewindclus-klb and indclus on noisy synthetic data sets of increasing size.\n\newindclus-klb indclus\n\nn VAF IQR VAF IQR r\n\n8 97 96--97 95 93--97 1\n16 91 90--92 86 85--87 4\n32 90 88--92 83 82--84 22\n64 91 90--91 84 84--85 100\n128 91 91--91 88 87--90 381\nusing 7 additional human data sets found that letters represented the weakest performance\nof ewindclus-klb.)\n\nPerformance of a plain KL strategy (not shown) surpassed or equaled indclus on all\nbut two problems (consonants and letters), indicating that the combinatorial approach, in\ntandem with heuristic guidance, is powerful even without the new `KL break-out' strategy.\nWhile we have already seen that synthetic data does not predict the relative performance\nof algorithms on human data very well, it does provide a test of how well they scale to\nlarger problems. On noise-free synthetic data, ewindclus-klb reliably recovered the\noriginal model on all data sets. It was also the best performer on the noisy synthetic data (a\ncomparison with indclus is presented in Table 5. These results show that, in addition to\nperforming best on the human data, the combinatorial approach retains its effectiveness on\nlarger problems.\nIn addition to being able to handle larger problems than previous methods, it is important to\nnote that the higher VAF of the models induced by ewindclus-klb often translates into\nincreased interpretability. In the model shown in Table 1, for instance, the best previously\npublished model (Tenenbaum, 1996), whose VAF is only 1.6% worse, does not contain  s\nin the unvoiced cluster.\n4 Conclusions\n\nWe formalized the problem of constructing feature-based representations for cognitive\nmodeling as the unsupervised learning of ADCLUS models from similarity data. In an\nempirical comparison sensitive to variance in solution quality and computation time, we\nfound that several recently proposed methods for recovering such models perform worse\nthan the original indclus algorithm of Arabie and Carroll (1980). We suggested a purely\ncombinatorial approach to this problem that is simpler than previous proposals, yet more\neffective. By changing memberships one at a time, it makes fewer independence assump-\ntions. We also proposed a novel variant of the Kernighan-Lin optimization strategy that\nis able to follow the gradient function more closely, surpassing the performance of the\noriginal.\nWhile this work has extended the reach of the additive clustering paradigm to large prob-\nlems, it is directly applicable to feature construction of only those cognitive models whose\nrepresentations encode similarity as shared features. (The cluster weights can be repre-\nsented by duplicating strong features or by varying connection weights.) However, the\nsimplicity of the combinatorial approach should make it straightforward to extend to mod-\nels in which the absence of features can enhance similarity. Other future directions include\nusing the output of one algorithm as the starting point for another, and incorporating mea-\nsures of model complexity(Lee, in press).\n\f\n5 Acknowledgments\n\nThanks to Josh Tenenbaum, Michael Lee, and the Harvard AI Group for stimulating dis-\ncussions; to Josh, Anil Chaturvedi, Henk Kiers, J. Douglas Carroll, and Phipps Arabie for\nproviding source code for their algorithms; Josh, Michael, and Phipps for providing data\nsets; and Michael for sharing unpublished work. This work was supported in part by the\nNSF under grants CDA-94-01024 and IRI-9618848.\n\nReferences\n\nArabie, Phipps and J. Douglas Carroll. 1980. MAPCLUS: A mathematical programming\napproach to fitting the adclus model. Psychometrika, 45(2):211--235, June.\nBaluja, Shumeet. 1997. Genetic algorithms and explicit search statistics. In Michael C.\nMozer, Michael I. Jordan, and Thomas Petsche, editors, NIPS 9.\n\nBoese, Kenneth D., Andrew B. Kahng, and Sudhakar Muddu. 1994. A new adaptive\nmulti-start technique for combinatorial global optimizations. Operations Research Letters,\n\n16:101--113.\nCarroll, J. Douglas and Phipps Arabie. 1983. INDCLUS: An individual differences\ngeneralization of the ADCLUS model and the MAPCLUS algorithm. Psychometrika,\n\n48(2):157--169, June.\nChaturvedi, Anil and J. Douglas Carroll. 1994. An alternating combinatorial optimization\napproach to fitting the INDCLUS and generalized INDCLUS models. Journal of Classifi-\ncation, 11:155--170.\nClouse, Daniel S. and Garrison W. Cottrell. 1996. Discrete multi-dimensional scaling. In\n\nProceedings of the 18th Annual Conference of the Cognitive Science Society, pp. 290--294.\nHojo, Hiroshi. 1983. A maximum likelihood method for additive clustering and its appli-\ncations. Japanese Psychological Research, 25(4):191--201.\nKernighan, B. and S. Lin. 1970. An efficient heuristic procedure for partitioning graphs.\n\nThe Bell System Technical Journal, 49(2):291--307, February.\nKiers, Henk A. L. 1997. A modification of the SINDCLUS algorithm for fitting the\nADCLUS and INDCLUS models. Journal of Classification, 14(2):297--310.\nLee, Michael D. in press. A simple method for generating additive clustering models with\nlimited complexity. Machine Learning.\n\nMechelen, I. Van and G. Storms. 1995. Analysis of similarity data and Tversky's contrast\nmodel. Psychologica Belgica, 35(2--3):85--102.\nNoelle, David C., Garrison W. Cottrell, and Fred R. Wilms. 1997. Extreme attraction: On\nthe discrete representation preference of attractor networks. In M. G. Shafto and P. Langley,\neds, Proceedings of the 19th Annual Conference of the Cognitive Science Society, p. 1000.\nRuml, Wheeler, J. Thomas Ngo, Joe Marks, and Stuart Shieber. 1996. Easily searched en-\ncodings for number partitioning. Journal of Optimization Theory and Applications, 89(2).\nShepard, Roger N. and Phipps Arabie. 1979. Additive clustering: Representation of\nsimilarities as combinations of discrete overlapping properties. Psychological Review,\n\n86(2):87--123, March.\nStark, Philip B. and Robert L. Parker. 1995. Bounded-variable least-squares: An algorithm\nand applications. Computational Statistics, 10:129--141.\nTenenbaum, Joshua B. 1996. Learning the structure of similarity. In D. S. Touretzky,\nM. C. Mozer, and M. E. Hasselmo, editors, NIPS 8.\n\f\n", "award": [], "sourceid": 2077, "authors": [{"given_name": "Wheeler", "family_name": "Ruml", "institution": null}]}