{"title": "The Method of Quantum Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 769, "page_last": 776, "abstract": null, "full_text": "The Method of Quantum Clustering\n\nDavid Horn and Assaf Gottlieb\nSchool of Physics and Astronomy\n\nRaymond and Beverly Sackler Faculty of Exact Sciences\n\nTel Aviv University, Tel Aviv 69978, Israel\n\nAbstract\n\nWe propose a novel clustering method that is an extension of ideas inher-\nent to scale-space clustering and support-vector clustering. Like the lat-\nter, it associates every data point with a vector in Hilbert space, and like\nthe former it puts emphasis on their total sum, that is equal to the scale-\nspace probability function. The novelty of our approach is the study of\nan operator in Hilbert space, represented by the Schr\u00a8odinger equation of\nwhich the probability function is a solution. This Schr\u00a8odinger equation\ncontains a potential function that can be derived analytically from the\nprobability function. We associate minima of the potential with cluster\ncenters. The method has one variable parameter, the scale of its Gaussian\nkernel. We demonstrate its applicability on known data sets. By limiting\nthe evaluation of the Schr\u00a8odinger potential to the locations of data points,\nwe can apply this method to problems in high dimensions.\n\n1 Introduction\n\nMethods of data clustering are usually based on geometric or probabilistic considerations\n[1, 2, 3]. The problem of unsupervised learning of clusters based on locations of points in\ndata-space, is in general ill de\ufb01ned. Hence intuition based on other \ufb01elds of study may be\nuseful in formulating new heuristic procedures. The example of [4] shows how intuition\nderived from statistical mechanics leads to successful results. Here we propose a model\nbased on tools that are borrowed from quantum mechanics.\n\nWe start out with the scale-space algorithm of [5] that uses a Parzen-window estimator of\nthe probability distribution based on the data. Using a Gaussian kernel, one generates from\n\nthe\n\n(1)\n\nup to an overall normalization, the expression\n\ndata points in a Euclidean space of dimension \u0001 a probability distribution given by,\n\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\u0016\u0015\u0018\u0017\u0012\u0015\u001a\u0019\u001c\u001b\u001e\u001d\n\u001d \u001f\n\r are the data points. It seems quite natural [5] to associate maxima of this function\n\r with vectors in an abstract Hilbert space.\n\nwhere\u0005\ntering (SVC) [6], associating the\n\nThe same kind of Gaussian kernel was the basis of another method, Support Vector Clus-\n\ndata-points\u0005\n\nwith cluster centers.\n\n\u001d\n\fHere we will also consider a Hilbert space, but, in contradistinction with kernel methods\nwhere the Hilbert space is implicit, here we work with a Schr\u00a8odinger equation that serves\nas the basic framework of the Hilbert space. Our method was introduced in [7] and is\nfurther expanded in this presentation. Its main emphasis is on the Schr\u00a8odinger potential,\nwhose minima will determine the cluster centers. This potential is part of the Schr\u00a8odinger\n\nequation that\u0002\n\nis a solution of.\n\n2 The Schr\u00a8odinger Potential\n\nWe de\ufb01ne[7] the Schr\u00a8odinger equation\n\n\u0003\u001c\u0005\n\n\u0003\u0006\u0005\n\n(2)\n\n\u0002\u0002\u0001\n\n\u0005\u0017\u0012\u0018\u0007\n\n\u0007\r\f\u000f\u000e\n\n\u0003\u001c\u0005\u0016\u0003\n\n\u0003\u0004\u0003\u0006\u0005\b\u0007\n\t\u000b\n\nis a solution, or eigenstate.1 The simplest case is that of a single Gaussian,\n\n\u0003\u0006\u0005\n\t\u0011\u0010\nfor which\u0002\n\u0003\u001c\u0005\nwhen \u0002\nrepresents a single point at \u0005\u0013\u0012 . Then it turns out that \u000e\nquadratic function, whose center lies at \u0005\u0013\u0012 , is known as the harmonic potential in quantum\n\u0007 . This\n\u0001\u0019\u0018\nmechanics (see, e.g., [8]). Its eigenvalue \u0010\nis the lowest possible eigenvalue of \nhence the Gaussian function is said to describe the ground state of \n\u0007 and one searches for solutions,\n\u0003\u0006\u0005\n\u0007 . Here, we have already\u0002\u0004\u0003\u0006\u0005\n\u0007 , as determined by the data points, we\nor eigenfunctions,\u0002\u0004\u0003\u0006\u0005\nConventionally, in quantum mechanics, one is given \u000e\n\u0007 whose solution is the given \u0002\u0004\u0003\u0006\u0005\n\u0007 . This can be easily obtained\n\u0003\u0006\u0005\nask therefore for the \u000e\n\u0003\u0006\u0005\u0016\u0003\n\t\u001a\u0010\nto be positive de\ufb01nite, i.e. min\u000e =0.\n\nis still left unde\ufb01ned. For this purpose we require \u000e\n\nThis sets the value of\n\n\t\u0011\u0010\u001b\u0003\n\n\u0013\u001e\u0015\n\n\u0017\u0012\u0015\n\nthrough\n\n\u0007\u0015\u0014\n\n.\n\n\u0003\u001c\u0005\n\n,\n\n(3)\n\n(4)\n\n(5)\n\nand determines \u000e\n\n\u0003\u0006\u0005\n\n\u0007 uniquely. Using Eq. 3 it is easy to prove that\n\n\t\u001e\u0003 \u001f\"!$#\n\n%\"&\n\n\u0010('\n\n3 2D Examples\n\n3.1 Crab Data\n\nTo show the power of our new method we discuss the crab data set taken from Ripley\u2019s book\n[9]. This data set is de\ufb01ned over a \ufb01ve-dimensional parameter space. When analyzed in\nterms of the 2nd and 3rd principal components of the correlation matrix one observes a nice\nseparation of the 200 instances into their four classes. We start therefore with this problem\nas our \ufb01rst test case. In Fig. 1 we show the data as well as the Parzen probability distribution\n\n\u0007 using the width parameter \u0005\n\nenough to deduce the correct clustering according to the approach of [5]. Nonetheless, the\npotential displayed in Fig. 2 shows the required four minima for the same width parameter.\nThus we conclude that the necessary information is already available. One needs, however,\nthe quantum clustering approach, to bring it out.\n\n\t . It is quite obvious that this width is not small\n\n\u0018*)\n\n\u0003\u001c\u0005\n\n1+\n\n(the Hamiltonian) and ,\n\nrescaled so that +\n\nmechanics.\n\ndepends on one parameter, -\n\n(potential energy) are conventional quantum mechanical operators,\nis a (rescaled) energy eigenvalue in quantum\n\n.\n\n\n\u0007\n\u0007\n\u0002\n\u0007\n\u0002\n\u0007\n\u0007\n\t\n\u0012\n\u001d\n\t\n\t\n\u000e\n\u0007\n\f\n\u0014\n\u001d\n\u0007\n\n\u0007\n\u0002\n\u0002\n\u0001\n\t\n\f\n\u001c\n\t\n\u0005\n\u0007\n\u0002\n\u000b\n\n\u0005\n\n\u0007\n\u0007\n\u000f\n\u0011\n\u0019\n\u001b\n\u001d\n\u001d\n\u001f\n\u001d\n\u001d\n\u0010\n\u0010\n\u0014\n\u001d\n\u0007\n\n\u0007\n\u0002\n\u0002\n\u0001\n\t\n\u001d\n\u0002\n\t\n\u001c\n.\n\f2\n\n1\n\n0\n\n\u22122\n\n\u22123\n\n\u22122\n\n\u22121\n\n\u22121\n\nPC2\n\n2\n\n1\n\n0\n\nPC3\n\nFigure 1: A plot of Roberts\u2019 probability distribution for Ripley\u2019s crab data [9] as de\ufb01ned\nover the 2nd and 3rd principal components of the correlation matrix. Using a Gaussian\n\n\t we observe only one maximum. Different symbols label the four\n\nwidth of \u0005\n\nclasses of data.\n\n1.2\n\n1\n\n0.8\n\nE\nV\n\n/\n\n0.6\n\n0.4\n\n0.2\n\n0\n2\n\n1\n\n0\n\n\u22121\n\nPC2\n\n\u22122\n\n\u22123\n\n\u22122\n\n\u22121\n\n2\n\n1\n\n0\n\nPC3\n\nFigure 2: A plot of the Schr\u00a8odinger potential for the same problem as Fig. 1. Here we\n\nclearly see the required four minima. The potential is plotted in units of \u0010\n\n.\n\n\t\n\u001c\n\u0018\n)\n\fNote in Fig. 2 that the potential grows quadratically outside the domain over which the\nsets the relevant scale over which\none may look for structure of the potential. If the width is decreased more structure is to\n\ndata are located. This is a general property of Eq. 3. \u0010\nbe expected. Thus, for \u0005\n\nthey lie high and contain only a few data points. The major minima are the same as in Fig.\n2.\n\n\t , two more minima appear, as seen in Fig. 3. Nonetheless,\n\n3.2 Iris Data\n\nOur second example consists of the iris data set [10], which is a standard benchmark ob-\ntainable from the UCI repository [11]. Here we use the \ufb01rst two principal components to\nde\ufb01ne the two dimensions in which we apply our method. Fig. 4, which shows the case for\n\n\t\u0001 , provides an almost perfect separation of the 150 instances into the three classes\n\ninto which they should belong.\n\n4 Application of Quantum Clustering\n\nThe examples displayed in the previous section show that, if the spatial representation of\nthe data allows for meaningful clustering using geometric information, quantum clustering\n(QC) will do the job. There remain, however, several technical questions to be answered:\n\nWhat is the preferred choice of \u0005 ? How can QC be applied in high dimensions? How does\n\none choose the appropriate space, or metric, in which to perform the analysis? We will\nconfront these issues in this section.\n\n\u0003\u0006\u0005\n\nis decreased to \u0012\n\n, the previous minima of \u000e\n\n\u0007 get\n\u0007 , roughly the same\n\ndeeper and two new minima are formed. However the latter are insigni\ufb01cant, in the sense\n), as shown in Fig. 3. Thus, if we classify data-points\n\n4.1 Varying \u0005\nIn the crabs-data we \ufb01nd that as \u0005\nthat they lie at high values (of order \u0010\nto clusters according to their topographic location on the surface of \u000e\nclustering assignment is expected for \u0005\nand more maxima are expected in \u0002\nacquires only one additional maximum at \u0005\n) in \u000e\nThe one parameter of our problem, \u0005\nOne may therefore vary \u0005 continuously and look for stability of cluster solutions, or limit\noneself to relatively high values of \u0005 and decide to stop the search once a few clusters are\n\n, signi\ufb01es the distance that we probe. Accordingly we\nexpect to \ufb01nd clusters relevant to proximity information of the same order of magnitude.\n\n. By the way, the wave function\nis being further decreased, more\nand an ever increasing number of minima (limited by\n\nbeing uncovered.\n\n\u0003\u0006\u0005\n\n.\n\nas for\n\n. As \u0005\n\n4.2 Higher Dimensions\n\nIn the iris problem we obtained excellent clustering results using the \ufb01rst two principal\ncomponents, whereas in the crabs problem, clustering that depicts correctly the classi\ufb01ca-\ntion necessitates components 2 and 3. However, once this is realized, it does not harm to\nadd the 1st component. This requires working in a 3-dimensional space, spanned by the\n\n\u0007 on a \ufb01ne computational grid becomes a heavy task\n\nthree leading PCs. Calculating \u000e\na close estimate of where the minima lie, and it reduces the complexity to \n\nin high dimensions. To cut down complexity, we propose using the analytic expression of\nEq. 3 and evaluating the potential on data points only. This should be good enough to give\nirrespective\nof dimension. In the gradient-descent algorithm described below, we will require further\ncomputations, also restricted to well de\ufb01ned locations in space.\n\n\u0003\u0006\u0005\n\n\t\n\u001c\n\u0018\n\u0005\n\t\n%\n\u001d\n\u0007\n\t\n\u0012\n\u0007\n\u0012\n\u0002\n\u0007\n\t\n\u0012\n\u0007\n\n\u0007\n\f1.2\n\n1\n\n0.8\n\nE\nV\n\n/\n\n0.6\n\n0.4\n\n0.2\n\n0\n2\n\n1\n\n0\n\n\u22121\n\nPC3\n\n\u22122\n\n\u22123\n\n\u22122\n\n2\n\n1\n\n0\n\n\u22121\n\nPC2\n\nFigure 3: The potential for the crab data with \u0005\n\nsigni\ufb01cant, minima. The four deep minima are roughly at the same locations as in Fig.\n2.\n\n\t displays two additional, but in-\n\n1.5\n\n1\n\n0.5\n\n0\n\n2\nC\nP\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\nPC1\n\nin a space spanned by the \ufb01rst\ntwo principal components. Different symbols represent the three classes. Equipotential\n\n\t\u0001\n\nFigure 4: Quantum clustering of the iris data for \u0005\nlines are drawn at \u000e\n\n\t\u0001\n\n\u0003\u0006\u0005\n\n\t\n\u001c\n\u0018\n\t\n%\n\u001d\n\u0007\n\u0018\n\u0010\n\t\n\u001d\n\u001d\n\u0002\n\n\u001d\n\u0003\n\n\u001d\n\u0004\n\n\u001c\n\u001d\n\fon a discrete set of points\n\n\t\u0003\u0002\n\n\u0002 as\n\n(6)\n\nin terms of the distance matrix\n\n . We can then express \u000e\n\n\u0003\u001c\u0005\nWhen restricted to the locations of data points, we evaluate \u000e\n\u0019\b\u0007\n\u0011\u0006\u0005\n\u0019\t\u0007\n\u001d \u001f\n\n\u0010\u001b\u0003\n\n =0.\n\nchosen appropriately so that min\u000e\n\nwith \u0010\n\nAll problems that we have used as examples were such that data were given in some space,\nand we have exercised our freedom to de\ufb01ne a metric, using the PCA approach, as the basis\nfor distance calculations. The previous analysis tells us that QC can also be applied to data\nfor which only the distance information is known.\n\n4.3 Principal Component Metrics\n\nThe QC algorithm starts from distance information. The question how the distances are cal-\nculated is another - very important - piece of the clustering procedure. The PCA approach\nde\ufb01nes a metric that is intrinsic to the data, determined by their second order statistics. But\neven then, several possibilities exist, leading to non-equivalent results.\n\nPrincipal component decomposition can be applied both to the correlation matrix and to the\ncovariance matrix. Moreover, whitening normalization may be applied. The PCA approach\nthat we have used is based on a whitened correlation matrix. This turns out to lead to the\ngood separation of crab-data in PC2-PC3 and of iris-data in PC1-PC2. Since our aim was to\nconvince the reader that once a good metric is found, QC conveys the correct information,\nwe have used the best preprocessing before testing QC.\n\n5 The Gradient Descent Algorithm\n\nAfter discovering the cluster centers we are faced with the problem of allocating the data\npoints to the different clusters. We propose using a gradient descent algorithm for this\npurpose. De\ufb01ning\n\n reach an asymptotic \ufb01xed value coinciding with a cluster center. More\n\nletting the points\nsophisticated minimum search algorithms, as given in chapter 10 of [12], may be used for\nfaster convergence.\n\n(7)\n\n\u0003\f\u000b\n\n\u0007\n\t\n\n we de\ufb01ne the process\n\u0003\u000f\u000b\n\u0003\u0011\u0010\n\f\u000e\n\n\u0003\u000f\u000b\n\n\u0003\f\u000b\n\nTo demonstrate the results of this algorithm, as well as the application of QC to higher\ndimensions, we analyze the iris data in 4 dimensions. We use the original data space with\nonly one modi\ufb01cation: all axes are normalized to lie within a uni\ufb01ed range of variation.\nThe results are displayed in Fig. 5. Shown here are different windows for the four different\naxes, within which we display the values of the points after descending the potential surface\nvalues are shown in the \ufb01fth window. These results are\nvery satisfactory, having only 5 misclassi\ufb01cations. Applying QC to data space without\nnormalization of the different axes, leads to misclassi\ufb01cations of the order of 15 instances,\nsimilar to the clustering quality of [4].\n\nand reaching its minima, whose \u000e\n\n6 Discussion\n\nIn the literature of image analysis one often looks for the curve on which the Laplacian of\nthe Gaussian \ufb01lter of an image vanishes[13]. This is known as zero-crossing and serves as\n\n\u000e\n\n\u0007\n\t\n\u000e\n\n\n\u0001\n\u0005\n\n\u0003\n\u0005\n\u0001\n\u000e\n\n\t\n\u0001\n\t\n\f\n\u001c\n\t\n\u0005\n\u0007\n\u0004\n\u0001\n\n\u0007\n\n\u0001\n\u000f\n\u001d\n\u001d\n\u001f\n\u001d\n\u0004\n\u0001\n\u000f\n\u0011\n\u0005\n\u001d\n\u001d\n\n\u0003\n%\n\u0005\n\n\u000b\n\u0007\n\t\n\n\u0007\n\u0007\n\n\u000e\n\u0003\n\n\u0007\n\u0007\n\n\n\f1.5\n\n1\n\n0.5\n\n0\n\n \n\n1\nm\nd\n\ni\n\n \n\n2\nm\nd\n\ni\n\n1\n\n2\n\n1\n\n0\n\n2\n\n1\n\n0\n\n0\n\n0\n\n0\n\n \n\n3\nm\nd\n\ni\n\n \n\n4\nm\nd\n\ni\n\n50\n\n50\n\n50\n\n50\n\n100\n\n100\n\n100\n\n100\n\n150\n\n150\n\n150\n\n150\n\n0.2\n\nE\nV\n\n/\n\n0.1\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\nserial number\n\n.\n\nFigure 5: The \ufb01xed points of the four-dimensional iris problem following the gradient-\ndescent algorithm. The results show almost perfect clustering into the three families of 50\n\na measure of segmentation of the image. Its analogue in the scale-space approach is where\n\n% . Clearly each such contour can also be viewed as surrounding maxima of the\n\ninstances each for \u0005\n\u0002\u0004\u0003\u0006\u0005\n\u0003\u0006\u0005\n[13] is a non-decreasing function of \u0005\nComparison with Eq. 3 tells us that they are the \u000e\nthe clusters are. Cluster cores are better de\ufb01ned by \u000e\nmay therefore speculate that equipotential levels of \u000e may serve as alternatives to \n\nprobability function, and therefore representing some kind of cluster boundary, although\ndifferent from the conventional one [5]. It is known that the number of such boundaries\n. Note that such contours can be read off Fig. 4.\ncontours on the periphery of\nthis \ufb01gure. Clearly they surround the data but do not give a satisfactory indication of where\ncurves in this \ufb01gure. One\n\ncurves in future applications to image analysis.\n\nImage analysis is a 2-dimensional problem, in which differential operations have to be\nformulated and followed on a \ufb01ne grid. Clustering is a problem that may occur in any\nnumber of dimensions. It is therefore important to develop a tool that can deal with it\naccordingly. Since the Schr\u00a8odinger potential, the function that plays the major role in our\nanalysis, has minima that lie in the neighborhood of data points, we \ufb01nd that it suf\ufb01ces\nto evaluate it at these points. This enables us to deal with clustering in high dimensional\nspaces. The results, such as the iris problem of Fig. 5, are very promising. They show that\nthe basic idea, as well as the gradient-descent algorithm of data allocation to clusters, work\nwell.\n\nQuantum clustering does not presume any particular shape or any speci\ufb01c number of clus-\nters.\nIt can be used in conjunction with other clustering methods. Thus one may start\nwith SVC to de\ufb01ne outliers which will be excluded from the construction of the QC po-\ntential. This would be one example where not all points are given the same weight in the\nconstruction of the Parzen probability distribution.\n\nIt may seem strange to see the Schr\u00a8odinger equation in the context of machine learning. Its\nusefulness here is due to the fact that the two different terms of Eq. 2 have opposite effects\non the wave-function. The potential represents the attractive force that tries to concentrate\n\n\t\n%\n\u001d\n\t\n\u001c\n\n\u0007\n\u0007\n\t\n\u0007\n\t\n\u0010\n\t\n%\n\u001d\n\u0002\n\u0010\n\u0007\n\u0002\n\t\n%\n\fthe distribution around its minima. The Laplacian has the opposite effect of spreading the\nwave-function. In a clustering analysis we implicitly assume that two such effects exist.\nQC models them with the Schr\u00a8odinger equation. Its success proves that this equation can\nserve as the basic tool of a clustering method.\n\nReferences\n\n[1] A.K. Jain and R.C. Dubes. Algorithms for clustering data. Prentice Hall, Englewood\n\nCliffs, NJ, 1988.\n\n[2] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San\n\nDiego, CA, 1990.\n\n[3] R.O. Duda, P.E. Hart and D.G. Stork. Pattern Classi\ufb01cation. Wiley-Interscience, 2nd\n\ned., 2001.\n\n[4] M. Blat, S. Wiseman and E. Domany. Super-paramagnetic clustering of data. Phys.\n\nRev. Letters 76:3251-3255, 1996.\n\n[5] S.J. Roberts. Non-parametric unsupervised cluster analysis. Pattern Recognition,\n\n30(2):261\u2013272, 1997.\n\n[6] A. Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik. A Support Vector Method for\nClustering. in Advances in Neural Information Processing Systems 13: Proceedings\nof the 2000 Conference Todd K. Leen, Thomas G. Dietterich and Volker Tresp eds.,\nMIT Press 2001, pp. 367\u2013373.\n\n[7] David Horn and Assaf Gottlieb. Algorithm for Data Clustering in Pattern Recognition\n\nProblems Based on Quantum Mechanics. Phys. Rev. Lett. 88 (2002) 018702.\n\n[8] S. Gasiorowicz. Quantum Physics. Wiley 1996.\n[9] B. D. Ripley Pattern Recognition and Neural Networks. Cambridge University Press,\n\nCambridge UK, 1996.\n\n[10] R.A. Fisher. The use of multiple measurements in taxonomic problems. Annual\n\nEugenics, 7:179\u2013188, 1936.\n\n[11] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.\n[12] W. H. Press, S. A. Teuklosky, W. T. Vetterling and B. P. Flannery. Numerical Recipes\n\n- The Art of Scienti\ufb01c Computing 2nd ed. Cambridge Univ. Press, 1992.\n\n[13] A. L. Yuille and T. A. Poggio. Scaling theorems for zero crossings. IEEE Trans. Pat-\n\ntern Analysis and Machine Intelligence PAMI-8, 15-25, 1986.\n\n\f", "award": [], "sourceid": 2083, "authors": [{"given_name": "David", "family_name": "Horn", "institution": null}, {"given_name": "Assaf", "family_name": "Gottlieb", "institution": null}]}