{"title": "Clustering with a Domain-Specific Distance Measure", "book": "Advances in Neural Information Processing Systems", "page_first": 96, "page_last": 103, "abstract": null, "full_text": "Clustering with a Domain-Specific \n\nDistance Measure \n\nSteven Gold, Eric Mjolsness and Anand Rangarajan \n\nDepartment of Computer Science \n\nYale University \n\nNew Haven, CT 06520-8285 \n\nAbstract \n\nWith a point matching distance measure which is invariant under \ntranslation, rotation and permutation, we learn 2-D point-set ob(cid:173)\njects, by clustering noisy point-set images. Unlike traditional clus(cid:173)\ntering methods which use distance measures that operate on feature \nvectors - a representation common to most problem domains - this \nobject-based clustering technique employs a distance measure spe(cid:173)\ncific to a type of object within a problem domain. Formulating \nthe clustering problem as two nested objective functions, we derive \noptimization dynamics similar to the Expectation-Maximization \nalgorithm used in mixture models. \n\n1 \n\nIntroduction \n\nClustering and related unsupervised learning techniques such as competitive learn(cid:173)\ning and self-organizing maps have traditionally relied on measures of distance, like \nEuclidean or Mahalanobis distance, which are generic across most problem domains. \nConsequently, when working in complex domains like vision, extensive preprocess(cid:173)\ning is required to produce feature sets which reflect properties critical to the domain, \nsuch as invariance to translation and rotation. Not only does such preprocessing \nincrease the architectural complexity of these systems but it may fail to preserve \nsome properties inherent in the domain. For example in vision, while Fourier de(cid:173)\ncomposition may be adequate to handle reconstructions invariant under translation \nand rotation, it is unlikely that distortion invariance will be as amenable to this \ntechnique (von der Malsburg, 1988). \n\n96 \n\n\fClustering with a Domain-Specific Distance Measure \n\n97 \n\nThese problems may be avoided with the help of more powerful, domain-specific \ndistance measures, including some which have been applied successfully to visual \nrecognition tasks (Simard, Le Cun, and Denker, 1993; Huttenlocher et ai., 1993). \nSuch measures can contain domain critical properties; for example, the distance \nmeasure used here to cluster 2-D point images is invariant under translation, rota(cid:173)\ntion and labeling permutation. Moreover, new distance measures may constructed, \nas this was, using Bayesian inference on a model of the visual domain given by a \nprobabilistic grammar (Mjolsness, 1992). Distortion invariant or graph matching \nmeasures, so formulated, can then be applied to other domains which may not be \namenable to description in terms of features. \n\nObjective functions can describe the distance measures constructed from a proba(cid:173)\nbilistic grammar, as well as learning problems that use them. The clustering prob(cid:173)\nlem in the present paper is formulated as two nested objective functions: the inner \nobjective computes the distance measures and the outer objective computes the \ncluster centers and cluster memberships. A clocked objective function is used, with \nseparate optimizations occurring in distinct clock phases (Mjolsness and Miranker, \n1993). The optimization is carried out with coordinate ascent/descent and deter(cid:173)\nministic annealing and the resulting dynamics is a generalization of the Expectation(cid:173)\nMaximization (EM) algorithm commonly used in mixture models. \n\n2 Theory \n\n2.1 The Distance Measure \n\nOur distance measure quantifies the degree of similarity between two unlabeled \n2-D point images, irrespective of their position and orientation. It is calculated \nwith an objective that can be used in an image registration problem. Given two \nsets of points {Xj} and {Yk }, one can minimize the following objective to find the \ntranslation, rotation and permutation which best maps Y onto X : \n\nEreg(m, t, 0) = L mjkllXj - t - R(0) . Yk l1 2 \n'Vj L:k mjk = 1 , 'Vk L:j mjk = l. \n\njk \n\nwith constraints: \n\nSuch a registration permits the matching of two sparse feature images in the presence \nof noise (Lu and Mjolsness, 1994). In the above objective, m is a permutation matrix \nwhich matches one point in one image with a corresponding point in the other image. \nThe constraints on m ensure that each point in each image corresponds to one and \nonly one point in the other image (though note later remarks regarding fuzziness). \nThen given two sets of points {Xj} and {Yk } the distance between them is defined \nas: \n\nD({Xj}, {Yk}) = min(Ereg(m,t,0) I constraints on m) . \n\n(1) \n\nm,t,e \n\nThis measure is an example of a more general image distance measure derived in \n(Mjolsness, 1992): \n\nd(x, y) = mind(x, T(y)) E [0,00) \n\nT \n\nwhere T is a set of transformation parameters introduced by a visual grammar. In \n(1) translation, rotation and permutation are the transformations, however scaling \n\n\f98 \n\nGold, Mjolsness, and Rangarajan \n\nor distortion could also have been included, with consequent changes in the objective \nfunction. \n\nThe constraints are enforced by applying the Potts glass mean field theory ap(cid:173)\nproximations (Peterson and Soderberg,1989) and then using an equivalent form of \nthe resulting objective, which employs Lagrange multipliers and an x log x barrier \nfunction (as in Yuille and Kosowsky, 1991): \n\nEreg(m, t, 8) \n\nL: mjkllXj - t - R(8) \u00b7 YkW + f31 L: mjk(logmjk -1) \njk \n\njk \n\n+L:J.tj(L:mjk-1)+L:vk(L:mjk-1). \n\n(2) \n\nj \n\nk \n\nk \n\nj \n\nIn this objective we are looking for a saddle point. (2) is minimized with respect to \nm, t, and 8, which are the correspondence matrix, translation,and rotation, and is \nmaximized with respect to J.t and v, the Lagrange multipliers that enforce the row \nand column constraints for m. \n\n2.2 The Clustering Objective \n\nThe learning problem is formulated as follows: Given a set of I images, {Xd, with \neach image consisting of J points, find a set of A cluster centers {Ya } and match \nvariables {Mia} defined as \n\nM. - {I if Xi is in Ya's cluster \n\nla -\n\n0 otherwise, \n\nsuch that each image is in only one cluster, and the total distance of all the images \nfrom their respective cluster centers is minimized. To find {Ya} and {Mia} minimize \nthe cost function, \n\nEc/U8ter(Y, M) = L: MiaD(Xi, Ya) \n\n, \n\n'Vi l:a Mia = 1. D(Xi, Ya), the distance function, is \n\nwith the constraint that \ndefined by (1). \nThe constraints on M are enforced in a manner similar to that described for the \ndistance measure, except that now only the rows of the matrix M need to add to \none, instead of both the rows and the columns. The Potts glass mean field theory \nmethod is applied and an equivalent form of the resulting objective is used: \nEc/u8ter(Y, M) = ~ MiaD(Xi, Ya) + f3 ~ Mia (log Mia - 1) + ~ Ai(L: Mia -1) \n\n1 \n\nta \n\nia \n\nza \n\nz \n\na \n\n(3) \n\nReplacing the distance measure by (2), we derive: \n\nEc/u8ter(Y, M, t, 8, m) = L:Mia L: miajkllXij - tia - R(8ia) . Ya k11 2+ \n\nia \n\njk \n\n~[f3~ ~k miajk(logmiajk - 1) + ~ J.tiaj(L:k miajk - 1) + \nza \nL:Viak(L:miajk -1)]+ -;- L:Mia(logMia -1)+ L: Ai(L: Mia -1) \nk \n\nia \n\na \n\nM \n\nJ \n\nJ \n\nj \n\ni \n\n\fClustering with a Domain-Specific Distance Measure \n\n99 \n\nA saddle point is required. The objective is minimized with respect to Y, M, m, \nt, 0, which are respectively the cluster centers, the cluster membership matrix, the \ncorrespondence matrices, the rotations, and the translations. It is maximized with \nrespect to A, which enforces the row constraint for M, and J..l and v which enforce \nthe column and row constraints for m. M is a cluster membership matrix indicating \nfor each image i, which cluster a it falls within, and mia is a permutation matrix \nwhich assigns to each point in cluster center Ya a corresponding point in image Xi. \n0ia gives the rotation between image i and cluster center a. Both M and mare \nfuzzy, so a given image may partially fall within several clusters, with the degree of \nfuzziness depending upon 13m and 13M. \nTherefore, given a set of images, X, we construct Ecltuter and upon finding the \nappropriate saddle point of that objective, we will have Y, their cluster centers, \nand M, their cluster memberships. \n\n3 The Algorithm \n\n3.1 Overview - A Clocked Objective Function \n\nThe algorithm to minimize the above objective consists of two loops - an inner \nloop to minimize the distance measure objective (2) and an outer loop to minimize \nthe clustering objective (3). Using coordinate descent in the outer loop results \nin dynamics similar to the EM algorithm for clustering (Hathaway, 1986). (The \nEM algorithm has been similarly used in supervised learning [Jordan and Jacobs, \n1993].) All variables occurring in the distance measure objective are held fixed \nduring this phase. The inner loop uses coordinate ascent/descent which results in \nrepeated row and column projections for m. The minimization of m, t and 0 occurs \nin an incremental fashion, that is their values are saved after each inner loop call \nfrom within the outer loop and are then used as initial values for the next call to \nthe inner loop. This tracking of the values of m, t, and 0 in the inner loop is \nessential to the efficiency of the algorithm since it greatly speeds up each inner loop \noptimization. Each coordinate ascent/descent phase can be computed analytically, \nfurther speeding up the algorithm. Local minima are avoided, by deterministic \nannealing in both the outer and inner loops. \n\nThe resulting dynamics can be concisely expressed by formulating the objective as \na clocked objective function, which is optimized over distinct sets of variables in \nphases, \n\nEcloc1ced = Ecl'luter( (((J..l, m)A , (v, m)A)$' 0 A, tA)$, (A, M)A, yA)$ \n\nwith this special notation employed recursively: \n\nE{x, Y)$ : coordinate descent on x, then y, iterated (if necessary) \nx A \n\n: use analytic solution for x phase \n\nThe algorithm can be expressed less concisely in English, as follows: \nInitialize t, 0 to zero, Y to random values \nBegin Outer Loop \nBegin Inner Loop \n\nInitialize t, 0 with previous values \n\n\f100 \n\nGold, Mjolsness, and Rangarajan \n\nFind m, t, e for each ia pair: \nFind m by softmax, projecting across j, then k, iteratively \nFind e by coordinate descent \nFind t by coordinate descent \n\nEnd Inner Loop \nIf first time through outer loop i 13m and repeat inner loop \nFind M ,Y using fixed values of m, t, e determined in inner loop: \n\nFind M by soft max, across i \nFind Y by coordinate descent \n\ni 13M, 13m \n\nEnd Outer Loop \n\nWhen the distances are calculated for all the X - Y pairs the first time time through \nthe outer loop, annealing is needed to minimize the objectives accurately. However \non each succeeding iteration, since good initial estimates are available for t and e \n(the values from the previous iteration of the outer loop) annealing is unnecessary \nand the minimization is much faster. \n\nThe speed of the above algorithm is increased by not recalculating the X - Y distance \nfor a given ia pair when its Mia membership variable drops below a threshold. \n\nInner Loop \n\n3.2 \nThe inner loop proceeds in three phases. In phase one, while t and e are held fixed, \nm is initialized with the softmax function and then iteratively projected across its \nrows and columns until the procedure converges. In phases two and three, t and e \nare updated using coordinate descent. Then 13m is increased and the loop repeats. \nIn phase one m is updated with softmax: \nexp( -13m \"Xij -\n\ntia - R(eia ) . Yak 112) \n\nmiajk = Lk' exp( -13m IIXij - tia - R(eia) . Yak/112) \n\nThen m is iteratively normalized across j and k until Ljk t:t.miajk < f \n\n: \n\nmiajk \n\nmiajk = =-~-\u00ad\n'1\\'., m\u00b7 .I k \nL.JJ \n\n,aJ \n\nUsing coordinate descent e is calculated in phase two: \n\nAnd t in phase three: \n\nFinally 13m is increased and the loop repeats. \n\n\fClustering with a Domain-Specific Distance Measure \n\n101 \n\nBy setting the partial derivatives of (2) to zero and initializing I-lJ and v2 to zero, \nthe algorithm for phase one may be derived. Phases two and three may be derived \nby taking the partial derivative of (2) with respect to 0, setting it to zero, solving \nfor 0, and then solving for the fixed point of the vector (tl, t2). \nBeginning with a small 13m allows minimization over a fuzzy correspondence matrix \nm, for which a global minimum is easier to find. Raising 13m drives the m's closer \nto 0 or 1, as the algorithm approaches a saddle point. \n\n3.3 Outer Loop \n\nThe outer loop also proceeds in three phases: (1) distances are calculated by calling \nthe inner loop, (2) M is projected across a using the softmaxfunction, (3) coordinate \ndescent is used to update Y . \n\nTherefore, using softmax M is updated in phase two: \n\nexp( -13M Ljk miajkllXij - tia - R(0ia) . Yak112) \n\nMia = ~----------~------~----------~~~----~7 \nLa' exp( -13M Ljk mia' jk IIXij - tia , - R(0ia ,) . Ya, k 112) \n\nY, in phase three is calculated using coordinate descent: \n\nLi Mia Lj miajk( cos 0 ia (Xij 1 -\n\ntiad + sin 0ia(Xij2 - tia2)) \n\nYak2 \n\nLi Mia Lj miajk( - sin 0ia(Xi jl - tiad + cos 0ia(Xij2 - tia2)) \n\nLi Mia Lj miaj k \n\nLi Mia Ej miajk \n\nThen 13M is increased and the loop repeats. \n\n4 Methods and Experimental Results \n\nIn two experiments (Figures la and Ib) 16 and 100 randomly generated images of \n15 and 20 points each are clustered into 4 and 10 clusters, respectively. \n\nA stochastic model, formulated with essentially the same visual grammar used to \nderive the clustering algorithm (Mjolsness, 1992), generated the experimental data. \nThat model begins with the cluster centers and then applies probabilistic trans(cid:173)\nformations according to the rules laid out in the grammar to produce the images. \nThese transformations are then inverted to recover cluster centers from a starting \nset of images. Therefore, to test the algorithm, the same transformations are ap(cid:173)\nplied to produce a set of images, and then the algorithm is run in order to see if it \ncan recover the set of cluster centers, from which the images were produced. \nFirst, n = 10 points are selected using a uniform distribution across a normalized \nsquare. For each of the n = 10 points a model prototype (cluster center) is created \nby generating a set of k = 20 points uniformly distributed across a normalized \nsquare centered at each orginal point. Then, m = 10 new images consisting of \nk = 20 points each are generated from each model prototype by displacing all k \nmodel points by a random global translation, rotating all k points by a random \nglobal rotation within a 54\u00b0 arc, and then adding independent noise to each of the \ntranslated and rotated points with a Gaussian distribution of variance (1\"2. \n\n\f102 \n\nGold, Mjolsness, and Rangarajan \n\n10 \n\nj t t \n\n,. \n\n10 \n\nt \n\nt j \n\nI \n\n0.2 \n\n0.' \n\n0.6 \n\n0.1 \n\n1 . 2 \n\n1.' \n\nFigure 1: (a): 16 images, 15 points each (b):100 images, 20 points each \n\nThe p = n x m = 100 images so generated is the input to the algorithm. The \nalgorithm, which is initially ignorant of cluster membership information, computes \nn = 10 cluster centers as well as n x p = 1000 match variables determining the \ncluster membership of each point image. u is varied and for each u the average \ndistance of the computed cluster centers to the theoretical cluster centers (i.e. the \noriginal n = 10 model prototypes) is plotted. \nData (Figure 1a) is generated with 20 random seeds with constants of n = 4, k = \n15, m = 4, p = 16, varying u from .02 to .14 by increments of .02 for each seed. \nThis produces 80 model prototype-computed cluster center distances for each value \nof u which are then averaged and plotted, along with an error bar representing \nthe standard deviation of each set. 15 random seeds (Figure 1 b) with constants \nof n = 10, k = 20, m = 10, p = 100, u varied from .02 to .16 by increments of \n.02 for each seed, produce 150 model prototype-computed cluster center distances \nfor each value of u. The straight line plotted on each graph shows the expected \nmodel prototype-cluster center distances, b = ku / vn, which would be obtained if \n\nthere were no translation or rotation for each generated image, and if the cluster \nmemberships were known. It can be considered a lower bound for the reconstruction \nperformance of our algorithm. Figures 1a and 1 b together summarize the results of \n280 separate clustering experiments. \n\nFor each set of images the algorithm was run four times, varying the initial randomly \nselected starting cluster centers each time and then selecting the run with the lowest \nenergy for the results. The annealing rate for 13M and 13m was a constant factor of \n1.031. Each run of the algorithm averaged ten minutes on an Indigo SGI workstation \nfor the 16 image test, and four hours for the 100 image test. The running time of \nthe algorithm is O(pnk2). Parallelization, as well as hierarchical and attentional \nmechanisms, all currently under investigation, can reduce these times. \n\n5 Summary \n\nBy incorporating a domain-specific distance measure instead of the typical generic \ndistance measures, the new method of unsupervised learning substantially reduces \nthe amount of ad-hoc pre-processing required in conventional techniques. Critical \nfeatures of a domain (such as invariance under translation, rotation, and permu-\n\n\fClustering with a Domain-Specific Distance Measure \n\n103 \n\ntation) are captured within the clustering procedure, rather than reflected in the \nproperties of feature sets created prior to clustering. The distance measure and \nlearning problem are formally described as nested objective functions. We derive \nan efficient algorithm by using optimization techniques that allow us to divide up \nthe objective function into parts which may be minimized in distinct phases. The \nalgorithm has accurately recreated 10 prototypes from a randomly generated sample \ndatabase of 100 images consisting of 20 points each in 120 experiments. Finally, by \nincorporating permutation invariance in our distance measure, we have a technique \nthat we may be able to apply to the clustering of graphs. Our goal is to develop \nmeasures which will enable the learning of objects with shape or structure. \n\nAcknowledgements \n\nThis work has been supported by AFOSR grant F49620-92-J-0465 and \nONR/DARPA grant N00014-92-J-4048. \n\nReferences \n\nR. Hathaway. (1986) Another interpretation of the EM algorithm for mixture \ndistributions. Statistics and Probability Letters 4:53:56. \n\nD. Huttenlocher, G. Klanderman and W. Rucklidge. \nages using the Hausdorff Distance. Pattern Analysis and Machine Intelligence \n15(9):850:863. \n\n(1993) Comparing im(cid:173)\n\nA. L. Yuille and J.J. Kosowsky. (1992). Statistical physics algorithms that converge. \nTechnical Report 92-7, Harvard Robotics Laboratory. \n\nM.l. Jordan and R.A. Jacobs. (1993). Hierarchical mixtures of experts and the \nEM algorithm. Technical Report 9301, MIT Computational Cognitive Science. \n\nC. P. Lu and E. Mjolsness. (1994). Two-dimensional object localization by coarse(cid:173)\nto-fine correlation matching. In this volume, NIPS 6 . \n\nC. von der Malsburg. (1988) . Pattern recognition by labeled graph matching. \nNeural Networks,1:141:148 . \n\nE. Mjolsness and W. Miranker. (1993). Greedy Lagrangians for neural networks: \nthree levels of optimization in relaxation dynamics. Technical Report 945, Yale \nUniversity, Department of Computer Science. \n\nE. Mjolsness. Visual grammars and their neural networks . (1992) SPIE Conference \non the Science of Artificial Neural Networks, 1710:63:85. \n\nC. Peterson and B. Soderberg. A new method for mapping optimization problems \nonto neural networks. (1989) International Journal of Neural Systems,I(1):3:22. \nP. Simard, Y. Le Cun, and J. Denker. Efficient pattern recognition using a new \ntransformation distance. (1993). In S. Hanson, J . Cowan, and C. Giles, (eds.), \nNIPS 5 . Morgan Kaufmann, San Mateo CA. \n\n\f", "award": [], "sourceid": 838, "authors": [{"given_name": "Steven", "family_name": "Gold", "institution": null}, {"given_name": "Eric", "family_name": "Mjolsness", "institution": null}, {"given_name": "Anand", "family_name": "Rangarajan", "institution": null}]}