{"title": "Factorial Learning by Clustering Features", "book": "Advances in Neural Information Processing Systems", "page_first": 561, "page_last": 568, "abstract": null, "full_text": "Factorial Learning by Clustering Features \n\nJoshua B. Tenenbaum and Emanuel V. Todorov \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\n{jbt.emo}~psyche . mit.edu \n\nAbstract \n\nWe introduce a novel algorithm for factorial learning, motivated \nby segmentation problems in computational vision, in which the \nunderlying factors correspond to clusters of highly correlated input \nfeatures. The algorithm derives from a new kind of competitive \nclustering model, in which the cluster generators compete to ex(cid:173)\nplain each feature of the data set and cooperate to explain each \ninput example, rather than competing for examples and cooper(cid:173)\nating on features, as in traditional clustering algorithms. A natu(cid:173)\nral extension of the algorithm recovers hierarchical models of data \ngenerated from multiple unknown categories, each with a differ(cid:173)\nent, multiple causal structure. Several simulations demonstrate \nthe power of this approach. \n\n1 \n\nINTRODUCTION \n\nUnsupervised learning is the search for structure in data. Most unsupervised learn(cid:173)\ning systems can be viewed as trying to invert a particular generative model of the \ndata in order to recover the underlying causal structure of their world. Differ(cid:173)\nent learning algorithms are then primarily distinguished by the different generative \nmodels they embody, that is, the different kinds of structure they look for. \nFactorial learning, the subject of this paper, tries to find a set of independent causes \nthat cooperate to produce the input examples. We focus on strong factorial learning, \nwhere the goal is to recover the actual degrees of freedom responsible for generating \nthe observed data, as opposed to the more general weak approach, where the goal \n\n\f562 \n\nJoshua B. Tenenbaum, Emmanuel V. Todorov \n\nFigure 1: A simple factorial learning problem. The learner observes an articulated \nhand in various configurations, with each example specified by the positions of 16 \ntracked features (shown as black dots). The learner might recover four underlying \nfactors, corresponding to the positions of the fingers, each of which claims respon(cid:173)\nsiblity for four features of the data set. \n\nis merely to recover some factorial model that explains the data efficiently. Strong \nfactorial learning makes a claim about the nature of the world, while weak facto(cid:173)\nrial learning only makes a claim about the nature of the learner's representations \n(although the two are clearly related). Standard subspace algorithms, such as prin(cid:173)\ncipal component analysis, fit a linear, factorial model to the input data, but can \nonly recover the true causal structure in very limited situations, such as when the \ndata are generated by a linear combination of independent factors with significantly \ndifferent variances (as in signal-from-noise separation). \nRecent work in factorial learning suggests that the general problem of recovering the \ntrue, multiple causal structure of an arbitrary, real-world data set is very difficult, \nand that specific approaches must be tailored to specific, but hopefully common, \nclasses of problems (Foldiak, 1990; Saund, 1995; Dayan and Zemel, 1995). Our \nown interest in multiple cause learning was motivated by segmentation problems in \ncomputational vision, in which the underlying factors correspond ideally to disjoint \nclusters of highly correlated input features. Examples include the segmentation \nof articulated objects into functionally independent parts, or the segmentation of \nmultiple-object motion sequences into tracks of individual objects. These problems, \nas well as many other problems of pattern recognition and analysis, share a common \nset of constraints which makes factorial learning both appropriate and tractable. \nSpecifically, while each observed example depends on some combination of several \nfactors, anyone input feature always depends on only one such factor (see Figure \n1). Then the generative model decomposes into independent sets of functionally \ngrouped input features, or functional parts (Tenenbaum, 1994). \nIn this paper, we propose a learning algorithm that extracts these functional parts. \nThe key simplifying assumption, which we call the membership constmint, states \nthat each feature belongs to at most one functional part, and that this membership \nis constant over the set of training examples. The membership constraint allows \nus to treat the factorial learning problem as a novel kind of clustering problem. \nThe cluster generators now compete to explain each feature of the data set and \ncooperate to explain each input example, rather than competing for examples and \ncooperating on features, as in traditional clustering systems such as K-means or \nmixture models. The following sections discuss the details of the feature cluster(cid:173)\ning algorithm for extracting functional parts, a simple but illustrative example, and \nextensions. In particular, we demonstrate a natural way to relax the strict member(cid:173)\nship constraint and thus learn hierarchical models of data generated from multiple \nunknown categories, each with a different multiple causal structure. \n\n\fFactorial Learning by Clustering Features \n\n563 \n\n2 THE FEATURE CLUSTERING ALGORITHM \n\nOur algorithm for extracting functional parts derives from a statistical mechanics \nformulation of the soft clustering problem (inspired by Rose, Gurewitz, and Fox, \n1990; Hinton and Zemel, 1994). We take as input a data set {xt}, with I examples \nof J real-valued features. The best K-cluster representation of these J features is \ngiven by an optimal set of cluster parameters, {Ok}, and an optimal set of assign(cid:173)\nments, {Pi/.J. The assignment Pjk specifies the probability of assigning feature j \nto cluster k, and depends directly on Ejk = 2:i(Xj'l - fj~)2, the total squared dif(cid:173)\nference (over the I training examples) between the observed feature values x}il and \ncluster k's predictions fj~. The parameters Ok define cluster k's generative model, \nand thus determine the predictions fj~(Ok). \n\nIf we limit functional parts to clusters of linearly correlated features, then the \nappropriate generative model has fj~ = WjkYkil + Uj, with cluster parameters \nOk = {Ykil , Wjk, Uj} to be estimated. That is, for each example i, part k predicts the \nvalue of input feature j as a linear function of some part-specific factor Ykil (such as \nfinger position in Figure 1). For the purposes of this paper, we assume zero-mean \nfeatures and ignore the Uj terms. Then Ejk = 2:i(Xj'l - WjkYkil )2. \nThe optimal cluster parameters and assignments can now be found by maximizing \nthe complete log likelihood of the data given the K-cluster representation, or equiv(cid:173)\nalently, in the framework of, statistical mechanics, by minimizing the free energy \n\nF = E - -H = LLPjk(Ejk + -logpjk) \n\nj \n\nk \n\n1 \n(3 \n\n1 \n(3 \n\n(1) \n\n(3) \n\nsubject to the membership constraints, 2:k pjk = 1, (\\lj). Minimizing the energy, \n\nE = L LPjkEjk, \n\nj \n\nk \n\n(2) \n\nreduces the expected reconstruction error, leading to more accurate representations. \nMaximizing the entropy, \n\nH = - LLPjklogPjk, \n\nj \n\nk \n\ndistributes responsibility for each feature across many parts, thus decreasing the \nindependence of the parts and leading to simpler representations (with fewer degrees \nof freedom). In line with Occam's Razor, minimizing the energy-entropy tradeoff \nfinds the representation that, at a particular temperature 1/(3, best satisfies the \nconflicting requirements of low error and low complexity. \nWe minimize the free energy with a generalized EM procedure (Neal and Hinton, \n1994), setting derivatives to zero and iterating the resulting update equations: \n\ne-(3Ejk \n\nPjk = 2:k e-(3Ejk \nYkil \n\nLPjkWjkXj'l \nj \n\nWjk = Lxj'lYtl. \n\n(4) \n\n(5) \n\n(6) \n\n\f564 \n\nJoshua B. Tenenbaum. Emmanuel V. Todorov \n\nThis update procedure assumes a normalization step Ykil = Ykil /(Eil (y(l)2)1/2 in \neach iteration, because without some additional constraint on the magnitudes of \nyt l (or Wjk), inverting the generative model fj~ = WjkYkil is an ill-posed problem. \nThis algorithm maps naturally onto a simple network architecture. The hidden unit \nactivities, representing the part-specific factors Ykil , are computed from the obser-\nvations xyl via bottom-up weights PjkWjk, normalized, and multiplied by top-down \nweights Wjk to generate the network's predictions fj~. The weights adapt accord(cid:173)\ning to a hybrid learning rule, with Wjk determined by a Hebb rule (as in subspace \nlearning algorithms), and pjk determined by a competitive, softmax function of the \nreconstruction error Ejk (as in soft mixture models). \n\n3 LEARNING A HIERARCHY OF PARTS \n\nThe following simulation illustrates the algorithm's behavior on a simple, part seg(cid:173)\nmentation task. The training data consist of 60 examples with 16 features each, \nrepresenting the horizontal positions of 16 points on an articulated hand in various \nconfigurations (as in Figure 1). The data for this example were generated by a \nhierarchical, random process that produced a low correlation between all 16 fea(cid:173)\ntures, a moderate correlation between the four features on each finger, and a high \ncorrelation between the two features on each joint (two joints per finger). To fully \nexplain this data set, the algorithm should be able to find a corresponding hierarchy \nof increasingly complex functional part representations. \nTo evaluate the network's representation of this data set, we inspect the learned \nweights PjkWjk, which give the total contribution of feature j to part k in (5) . \nIn Figure 2, these weights are plotted for several different values of /3, with gray \nboxes indicating zero weights, white indicating strong positive weights, and black \nindicating strong negative weights. The network was configured with K = 16 part \nunits, to ensure that all potential parts could be found. When fewer than K distinct \nparts are found, some of the cluster units have identical parameters (appearing \nas identical columns in Figure 2). These results were generated by deterministic \nannealing, starting with /3 \u00ab 1, and perturbing the weights slightly each time /3 \nwas increased, in order to break symmetries. \nFigure 2 shows that the number of distinct parts found increases with /3, as more \naccurate (and more complex) representations become favored. In (4), we see that \n/3 controls the number of distinct parts via the strength of the competition for fea(cid:173)\ntures. At /3 = 0, every part takes equal responsibility for every feature. Without \ncompetition, there can be no diversity, and thus only one distinct part is discovered \nat low /3, corresponding to the whole hand (Figure 2a). As /3 increases, the compe(cid:173)\ntition for features gets stiffer, and parts split into their component subparts. The \nnetwork finds first four distinct parts (with four features each), corresponding to in(cid:173)\ndividual fingers (Figure 2c), and then eight distinct parts (with two features each), \ncorresponding to individual joints (Figure 2d). Figure 2b shows an intermediate \nrepresentation, with something between one and four parts. Four distinct columns \nare visible, but they do not cleanly segregate the features. \nFigure 3 plots the decrease in mean reconstruction error (expressed by the energy E) \n\n\fFactorial Learning by Clustering Features \n\n565 \n\n(a) \n\n~=1 \n\n(b) \n\n~= 100 \n\n~ ~~+-~~+-~~~~~~ \n~ ~~+-r+~+-~-r~r+-r~ \n~ r4~+-r+~4=P+-r~r+~~ \n\n(e) \n\nPart k \n\n~= 1000 \n\nPart k \n\n(d) \n\n~=2oo00 \n\nPart k \n\nPart k \n\nFigure 2: A hierarchy of functional part representations, parameterized by {3. \n\nF - - - e - - - - - -___ b \n\nd \n\n2 \n\nlog ~ \n\n3 \n\n4 \n\nFigure 3: A phase diagram distinguishes true parts (a, c, d) from spurious ones (b). \n\n\f566 \n\nJoshua B. Tenenbaum. Emmanuel V. Todorov \n\nas (3 increases and more distinct parts emerge. Notice that within the three stable \nphases corresponding to good part decompositions (Figures 2a, 2c, 2d), E remains \npractically constant over wide variations in (3. In contrast, E varies rapidly at \nthe boundaries between phases, where spurious part structure appears (Figure 2b). \nIn general, good representations should lie at stable points of this phase diagram, \nwhere the error-complexity tradeoff is robust. Thus the actual number of parts in \na particular data set, as well as their hierarchical structure, need not be known in \nadvance, but can be inferred from the dynamics of learning. \n\n4 LEARNING MULTIPLE CATEGORIES \n\nUntil this point, we have assumed that each feature belongs to at most one part over \nthe entire set of training examples, and tried to find the single K -part model that \nbest explains the data as a whole. But the notion that a single model must explain \nthe whole data set is quite restrictive. The data may contain several categories of \nexamples, each characterized by a different pattern of feature correlations, and then \nwe would like to learn a set of models, each capturing the distinctive part structure \nof one such category. Again we are motivated by human vision, which easily recog(cid:173)\nnizes many categories of motion defined by high-level patterns of coordinated part \nmovement, such as hand gestures and facial expressions. \nIf we know which examples belong to which categories, learning multiple models is \nno harder than learning one, as in the previous section. A separate model can be fit \nto each category m of training examples, and the weights P'j'k wjk are frozen to pro(cid:173)\nduce a set of category templates. However, if the category identities are unknown, \nwe face a novel kind of hierarchical learning task. We must simultaneously discover \nthe optimal clustering of examples into categories, as well as the optimal cluster(cid:173)\ning of features into parts within each category. We can formalize this hierarchical \nclustering problem as minimizing a familiar free energy, \n\n(7) \n\nin which gim. specifies the probability of assigning example i to category m, and Tim. \nis the associated cost. This cost is itself the free energy of the mth K -part model \non the ith example, \n\nTim = L LP'j'k(Ejk' + _(31 logp'j'k) , \n\n(8) \n\nj \n\nk \n\nin which P'j'k specifies the probability of assigning feature j to part k within category \nm, and Ejk' = (xj - Wjky~m.)2 is the usual reconstruction error from Section 2. \nThis algorithm was tested on a data set of 256 hand configurations with 20 features \neach (similar to those in Figure 1), in which each example expresses one of four \npossible \"gestures\", i.e. patterns of feature correlation. As Table 1 indicates, the \nfive features on each finger are highly correlated across the entire data set, while \nvariable correlations between the four fingers distinguish the gesture categories. \nNote that a single model with four parts explains the full variance of the data just \nas well as the actual four-category generating process. However, most of the data \n\n\fFactorial Learning by Clustering Features \n\n567 \n\nTable 1: The 20 features are grouped into either 2, 3, or 4 functional parts. \n\nExamples No. of parts \n\nPart composition \n\n1- 64 \n65 - 127 \n128 - 192 \n193 - 256 \n\n2 \n3 \n3 \n4 \n\n10 11----20 \n1 \n10 11-15 16--20 \n1 \n1-5 6-10 11 \n20 \n1-5 6--10 11-15 16--20 \n\ncan also be explained by one of several simpler models, making the learner's task a \nchallenging balancing act between accuracy and simplicity. \n\nFigure 4 shows a typical representation learned for this data set. The algorithm \nwas configured with M = 8 category models (each with K = 8 parts), but only four \ndistinct categories of examples are found after annealing on a (holding {3 constant), \nand their weights prk wjk are depicted in Figure 4a. Each category faithfully captures \none of the actual generating categories in Table 1, with the correct number and \ncomposition of functional parts. Figure 4b depicts the responsibility gi'm that each \nlearned category m takes for each feature i. Notice the inevitable effect of a bias \ntowards simpler representations. Many examples are misassigned relative to Table 1, \nwhen categories with fewer degrees of freedom than their true generating categories \ncan explain them almost as accurately. \n\n5 CONCLUSIONS AND FUTURE DIRECTIONS \n\nThe notion that many data sets are best explained in terms of functionally inde(cid:173)\npendent clusters of correlated features resonates with similar proposals of Foldiak \n(1990), Saund (1995), Hinton and Zemel (1994), and Dayan and Zemel (1995). Our \napproach is unique in actually formulating the learning task as a clustering problem \nand explicitly extracting the functional parts of the data. Factorial learning by clus(cid:173)\ntering features has three principal advantages. First, the free energy cost function \nfor clustering yields a natural complexity scale-space of functional part representa(cid:173)\ntions, parameterized by {3. Second, the generalized EM learning algorithm is simple \nand quick, and maps easily onto a network architecture. Third, by nesting free \nenergies, we can seamlessly compose objective functions for quite complex, hierar(cid:173)\nchical unsupservised learning problems, such as the multiple category, multiple part \nmixture problem of Section 4. \n\nThe primary limitation of our approach is that when the generative model we as(cid:173)\nsume does not in fact apply to the data, the algorithm may fail to recover any \nmeaningful structure. In ongoing work, we are pursuing a more flexible generative \nmodel that allows the underlying causes to compete directly for arbitrary feature(cid:173)\nexample pairs ij, rather than limiting competition only to features j, as in Section \n2, or only to examples i, as in conventional mixture models, or segregating compe(cid:173)\ntition for examples and features into hierarchical stages, as in Section 4. Because \nthis introduces many more degrees of freedom, robust learning will require addi(cid:173)\ntional constraints, such as temporal continuity of examples or spatial continuity of \nfeatures. \n\n\f568 \n\nJoshua B. Tenenbaum, Emmanuel V. Todorov \n\n(a) \n\nCategory 1 \n\ncategory 2 \n\ncategory 3 \n\ncategory 4 \n\nPart k \n\nPart k \n\nPart k \n\nPart k \n\n(b) \n\n1 \n\nE \n521 .. --------~ .. ~w.~\u00b7 \n\nCl * () 31------~\u00b7\u00b7\u00b7\u00b7\u00b7 ......... . \n\n64 \n\n128 \n\nExample i \n\n192 \n\n256 \n\nFigure 4: Learning multiple categories, each with a different part structure. \n\nAcknowledgements \n\nBoth authors are Howard Hughes Medical Institute Predoctoral Fellows. We thank \nWhitman Richards, Yair Weiss, and Stephen Gilbert for helpful discussions. \n\nReferences \n\nDayan, P. and Zemel, R. S. (1995). Competition and multiple cause models. Neural \nComputation, in press. \n\nFoldiak, P. (1990). Forming sparse representations by local anti-hebbian learning. Biolog(cid:173)\nical Cybernetics 64, 165-170. \n\nHinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length and \nHelmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (eds.), Advances in \nNeural Information Processing Systems 6. San Mateo, CA: Morgan Kaufmann, 3-10. \n\nNeal, R. M., and Hinton, G. E. (1994). A new view of the EM algorithm that justifies \nincremental and other variants. \n\nRose, K., Gurewitz, F., and Fox, G. (1990). Statistical mechanics and phase transitions \nin clustering. Physical Review Letters 65, 945-948. \n\nSaund, E. (1995). A mUltiple cause mixture model for unsupervised learning. Neural \nComputation 7,51-71. \n\nTenenbaum, J . (1994) . Functional parts. In A. Ram & K. Eiselt (eds.), Proceedings of \nthe Sixteenth Annual Conference of the Cognitive Science Society. Hillsdale, N J: Lawrence \nErlbaum, 864-869. \n\n\f", "award": [], "sourceid": 955, "authors": [{"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}, {"given_name": "Emanuel", "family_name": "Todorov", "institution": null}]}