{"title": "K-Medoids For K-Means Seeding", "book": "Advances in Neural Information Processing Systems", "page_first": 5195, "page_last": 5203, "abstract": "We show experimentally that the algorithm CLARANS of Ng and Han (1994) finds better K-medoids solutions than the Voronoi iteration algorithm of Hastie et al. (2001). This finding, along with the similarity between the Voronoi iteration algorithm and Lloyd's K-means algorithm, motivates us to use CLARANS as a K-means initializer. We show that CLARANS outperforms other algorithms on 23/23 datasets with a mean decrease over k-means++ of 30% for initialization mean squared error (MSE) and 3% for final MSE. We introduce algorithmic improvements to CLARANS which improve its complexity and runtime, making it a viable initialization scheme for large datasets.", "full_text": "K-Medoids for K-Means Seeding\n\nJames Newling\n\nIdiap Research Institue and\n\nFranc\u00b8ois Fleuret\n\nIdiap Research Institue and\n\n\u00b4Ecole polytechnique f\u00b4ed\u00b4erale de Lausanne\n\n\u00b4Ecole polytechnique f\u00b4ed\u00b4erale de Lausanne\n\njames.newling@idiap.ch\n\nfrancois.fleuret@idiap.ch\n\nAbstract\n\nWe show experimentally that the algorithm clarans of Ng and Han (1994) \ufb01nds\nbetter K-medoids solutions than the Voronoi iteration algorithm of Hastie et al.\n(2001). This \ufb01nding, along with the similarity between the Voronoi iteration algo-\nrithm and Lloyd\u2019s K-means algorithm, motivates us to use clarans as a K-means\ninitializer. We show that clarans outperforms other algorithms on 23/23 datasets\nwith a mean decrease over k-means-++ (Arthur and Vassilvitskii, 2007) of 30%\nfor initialization mean squared error (MSE) and 3% for \ufb01nal MSE. We introduce\nalgorithmic improvements to clarans which improve its complexity and runtime,\nmaking it a viable initialization scheme for large datasets.\n\n1\n\nIntroduction\n\n1.1 K-means and K-medoids\n\nThe K-means problem is to \ufb01nd a partitioning of points, so as to minimize the sum of the squares\nof the distances from points to their assigned partition\u2019s mean. In general this problem is NP-hard,\nand in practice approximation algorithms are used. The most popular of these is Lloyd\u2019s algorithm,\nhenceforth lloyd, which alternates between freezing centers and assignments, while updating the\nother. Speci\ufb01cally, in the assignment step, for each point the nearest (frozen) center is determined.\nThen during the update step, each center is set to the mean of points assigned to it. lloyd has\napplications in data compression, data classi\ufb01cation, density estimation and many other areas, and\nwas recognised in Wu et al. (2008) as one of the top-10 algorithms in data mining.\nThe closely related K-medoids problem differs in that the center of a cluster is its medoid, not its\nmean, where the medoid is the cluster member which minimizes the sum of dissimilarities between\nitself and other cluster members. In this paper, as our application is K-means initialization, we focus\non the case where dissimilarity is squared distance, although K-medoids generalizes to non-metric\nspaces and arbitrary dissimilarity measures, as discussed in \u00a7SM-A.\nBy modifying the update step in lloyd to compute medoids instead of means, a viable K-medoids\nalgorithm is obtained. This algorithm has been proposed at least twice (Hastie et al., 2001; Park and\nJun, 2009) and is often referred to as the Voronoi iteration algorithm. We refer to it as medlloyd.\nAnother K-medoids algorithm is clarans of Ng and Han (1994, 2002), for which there is no direct\nK-means equivalent. It works by randomly proposing swaps between medoids and non-medoids,\naccepting only those which decrease MSE. We will discuss how clarans works, what advantages\nit has over medlloyd, and our motivation for using it for K-means initialization in \u00a72 and \u00a7SM-A.\n\n1.2 K-means initialization\n\nlloyd is a local algorithm, in that far removed centers and points do not directly in\ufb02uence each\nother. This property contributes to lloyd\u2019s tendency to terminate in poor minima if not well initial-\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finitial:\n\ufb01nal:\n\n(cid:13)(cid:63) (cid:13)(cid:63)\n(cid:63)\n(cid:63) (cid:13) (cid:63) (cid:13)(cid:63)\n\ninitial:\n\ufb01nal:\n\n(cid:63)(cid:13)\n(cid:63)(cid:13)\n\n(cid:63) (cid:13)(cid:63)\n(cid:63) (cid:13) (cid:63)\n\nFigure 1: N = 3 points, to be partitioned into K = 2 clusters with lloyd, with two possible\ninitializations (top) and their solutions (bottom). Colors denote clusters, stars denote samples, rings\ndenote means. Initialization with clarans enables jumping between the initializations on the left\nand right, ensuring that when lloyd eventually runs it avoids the local minimum on the left.\n\nized. Good initialization is key to guaranteeing that the re\ufb01nement performed by lloyd is done in\nthe vicinity of a good solution, an example showing this is given in Figure 1.\nIn the comparative study of K-means initialization methods of Celebi et al. (2013), 8 schemes\nare tested across a wide range of datasets. Comparison is done in terms of speed (time to run\ninitialization+lloyd) and energy (\ufb01nal MSE). They \ufb01nd that 3/8 schemes should be avoided, due to\npoor performance. One of these schemes is uniform initialization, henceforth uni, where K samples\nare randomly selected to initialize centers. Of the remaining 5/8 schemes, there is no clear best, with\nresults varying across datasets, but the authors suggest that the algorithm of Bradley and Fayyad\n(1998), henceforth bf, is a good choice.\nThe bf scheme of Bradley and Fayyad (1998) works as follows. Samples are separated into J\n(= 10) partitions. lloyd with uni initialization is performed on each of the partitions, providing J\ncentroid sets of size K. A superset of JK elements is created by concatenating the J center sets.\nlloyd is then run J times on the superset, initialized at each run with a distinct center set. The\ncenter set which obtains the lowest MSE on the superset is taken as the \ufb01nal initializer for the \ufb01nal\nrun of lloyd on all N samples.\nProbably the most widely implemented initialization scheme other than uni is k-means++ (Arthur\nand Vassilvitskii, 2007), henceforth km++. Its popularity stems from its simplicity, low computa-\ntional complexity, theoretical guarantees, and strong experimental support. The algorithm works by\nsequentially selecting K seeding samples. At each iteration, a sample is selected with probability\nproportional to the square of its distance to the nearest previously selected sample.\nThe work of Bachem et al. (2016) focused on developing sampling schemes to accelerate km++,\nwhile maintaining its theoretical guarantees. Their algorithm afk-mc2 results in as good initializa-\ntions as km++, while using only a small fraction of the KN distance calculations required by km++.\nThis reduction is important for massive datasets.\nIn none of the 4 schemes discussed is a center ever replaced once selected. Such re\ufb01nement is only\nperformed during the running of lloyd. In this paper we show that performing re\ufb01nement during\ninitialization with clarans, before the \ufb01nal lloyd re\ufb01nement, signi\ufb01cantly lowers K-means MSEs.\n\n1.3 Our contribution and paper summary\n\nWe compare the K-medoids algorithms clarans and medlloyd, \ufb01nding that clarans \ufb01nds better\nlocal minima, in \u00a73 and \u00a7SM-A. We offer an explanation for this, which motivates the use of\nclarans for initializing lloyd (Figure 2). We discuss the complexity of clarans, and brie\ufb02y\nshow how it can be optimised in \u00a74, with a full presentation of acceleration techniques in \u00a7SM-D.\nMost signi\ufb01cantly, we compare clarans with methods uni, bf, km++ and afk-mc2 for K-means\ninitialization, and show that it provides signi\ufb01cant reductions in initialization and \ufb01nal MSEs in\n\u00a75. We thus provide a conceptually simple initialization scheme which is demonstrably better than\nkm++, which has been the de facto initialization method for one decade now.\nOur source code at https://github.com/idiap/zentas is available under an open source li-\ncense. It consists of a C++ library with Python interface, with several examples for diverse data types\n(sequence data, sparse and dense vectors), metrics (Levenshtein, l1, etc.) and potentials (quadratic\nas in K-means, logarithmic, etc.).\n\n1.4 Other Related Works\n\nAlternatives to lloyd have been considered which resemble the swapping approach of clarans.\nOne is by Hartigan (1975), where points are randomly selected and reassigned. Telgarsky and\n\n2\n\n\fVattani (2010) show how this heuristic can result in better clustering when there are few points per\ncluster.\nThe work most similar to clarans in the K-means setting is that of Kanungo et al. (2002), where\nit is indirectly shown that clarans \ufb01nds a solution within a factor 25 of the optimal K-medoids\nclustering. The local search approximation algorithm they propose is a hybrid of clarans and\nlloyd, alternating between the two, with sampling from a kd-tree during the clarans-like step.\nTheir source code includes an implementation of an algorithm they call \u2018Swap\u2019, which is exactly the\nclarans algorithm of Ng and Han (1994).\n\n2 Two K-medoids algorithms\n\nLike km++ and afk-mc2, K-medoids generalizes beyond the standard K-means setting of Euclidean\nmetric with quadratic potential, but we consider only the standard setting in the main body of this\npaper, referring the reader to SM-A for a more general presentation. In Algorithm 1, medlloyd is\npresented. It is essentially lloyd with the update step modi\ufb01ed for K-medoids.\n\nAlgorithm 1 two-step iterative medlloyd algo-\nrithm (in vector space with quadratic potential).\n1: Initialize center indices c(k), as distinct el-\nements of {1, . . . , N}, where index k \u2208\n{1, . . . , K}.\n\nfor i = 1 : N do\n\na(i) \u2190 arg min\nk\u2208{1,...,K}\n\nend for\nfor k = 1 : K do\n\nc(k) \u2190\n\n(cid:107)x(i)\u2212x(c(k))(cid:107)2\n\n(cid:88)\n\n(cid:107)x(i)\u2212x(i(cid:48))(cid:107)2\n\n2: do\n3:\n4:\n\n5:\n6:\n7:\n8:\n\narg min\ni:a(i)=k\n\ni(cid:48):a(i(cid:48))=k\n\nend for\n\n9:\n10: while c(k) changed for at least one k\n\n3: \u03c8\u2212 \u2190(cid:80)N\n\u03c8+ \u2190(cid:80)N\n\nAlgorithm 2 swap-based clarans algorithm (in\na vector space and with quadratic potential).\n1: nr \u2190 0\n2: Initialize center indices C \u2282 {1, . . . , N}\ni=1 mini(cid:48)\u2208C (cid:107)x(i) \u2212 x(i(cid:48))(cid:107)2\n4: while nr \u2264 Nr do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\nend if\n13:\n14: end while\n\nsample i\u2212 \u2208 C and i+ \u2208 {1, . . . , N} \\ C\nmini(cid:48)\u2208C\\{i\u2212}\u222a{i+} (cid:107)x(i)\u2212 x(i(cid:48))(cid:107)2\nC \u2190 C \\ {i\u2212} \u222a {i+}\nnr \u2190 0, \u03c8\u2212 \u2190 \u03c8+\nnr \u2190 nr + 1\n\nif \u03c8+ < \u03c8\u2212 then\n\nelse\n\ni=1\n\nIn Algorithm 2, clarans is presented. Following a random initialization of the K centers (line\n2), it proceeds by repeatedly proposing a random swap (line 5) between a center (i\u2212) and a non-\ncenter (i+). If a swap results in a reduction in energy (line 8), it is implemented (line 9). clarans\nterminates when Nr consecutive proposals have been rejected. Alternative stopping criteria could\nbe number of accepted swaps, rate of energy decrease or time. We use Nr = K 2 throughout, as this\nmakes proposals between all pairs of clusters probable, assuming balanced cluster sizes.\nclarans was not the \ufb01rst swap-based K-medoids algorithm, being preceded by pam and clara of\nKaufman and Rousseeuw (1990). It can however provide better complexity than other swap-based\nalgorithms if certain optimisations are used, as discussed in \u00a74.\nWhen updating centers in lloyd and medlloyd, assignments are frozen. In contrast, with swap-\nbased algorithms such as clarans, assignments change along with the medoid index being changed\n(i\u2212 to i+). As a consequence, swap-based algorithms look one step further ahead when computing\nMSEs, which helps them escape from the minima of medlloyd. This is described in Figure 2.\n\n3 A Simple Simulation Study for Illustration\n\nWe generate simple 2-D data, and compare medlloyd, clarans, and baseline K-means initializers\nkm++ and uni, in terms of MSEs. The data is described in Figure 3, where sample initializations\nare also presented. Results in Figure 4 show that clarans provides signi\ufb01cantly lower MSEs than\nmedlloyd, an observation which generalizes across data types (sequence, sparse, etc), metrics (Lev-\nenshtein, l\u221e, etc), and potentials (exponential, logarithmic, etc), as shown in Appendix SM-A.\n\n3\n\n\f\u2022\n\nx(1)\n\n\u2022\n\u2022\n\nx(3)\n\nx(2)\n\n\u2022\n\nx(4)\n\n\u2022\nx(5)\n\u2022\n\nx(6)\n\n\u2022\n\nx(7)\n\nFigure 2: Example with N = 7 samples, of which K = 2 are medoids. Current medoid indices\nare 1 and 4. Using medlloyd, this is a local minimum, with \ufb01nal clusters {x(1)}, and the rest.\nclarans may consider swap (i\u2212, i+) = (4, 7) and so escape to a lower MSE. The key to swap-\nbased algorithms is that cluster assignments are never frozen. Speci\ufb01cally, when considering the\nswap of x(4) and x(7), clarans assigns x(2), x(3) and x(4) to the cluster of x(1) before computing\nthe new MSE.\n\n(Column 1) Simu-\nFigure 3:\nlated data in R2.\nFor each\ncluster center g \u2208 {0, . . . , 19}2,\n100 points\nare drawn from\nN (g, \u03c32I),\nillustrated here for\n\u03c3 \u2208 {2\u22126, 2\u22124, 2\u22122}. (Columns\n2,3,4,5) Sample initializations.\nWe observe \u2018holes\u2019\nfor meth-\nods uni, medlloyd and km++.\nclarans successfully \ufb01lls holes\nby\nunder-\nutilised centers.\nThe spatial\ncorrelation of medlloyd\u2019s holes\nare due to its locality of updating.\n\nremoving\n\ndistant,\n\n4 Complexity and Accelerations\n\nlloyd requires KN distance calculations to update K centers, assuming no acceleration technique\nsuch as that of Elkan (2003) is used. The cost of several iterations of lloyd outweighs initialization\nwith any of uni, km++ and afk-mc2. We ask if the same is true with clarans initialization, and\n\ufb01nd that the answer depends on how clarans is implemented. clarans as presented in Ng and\nHan (1994) is O(N 2) in computation and memory, making it unusable for large datasets. To make\nclarans scalable, we have investigated ways of implementing it in O(N ) memory, and devised\noptimisations which make its complexity equivalent to that of lloyd.\nclarans consists of two main steps. The \ufb01rst is swap evaluation (line 6) and the second is swap\nimplementation (scope of if-statement at line 8). Proposing a good swap becomes less probable as\nMSE decreases, thus as the number of swap implementations increases the number of consecutive\nrejected proposals (nr) is likely to grow large, illustrated in Figure 5. This results in a larger fraction\nof time being spent in the evaluation step.\n\nFigure 4: Results on simulated data. For 400 values of \u03c3 \u2208 [2\u221210, 2\u22121], initialization (left) and \ufb01nal\n(right) MSEs relative to true cluster variances. For \u03c3 \u2208 [2\u22125, 2\u22122] km++ never results in minimal\nMSE (M SE/\u03c32 = 1), while clarans does for all \u03c3. Initialization MSE with medlloyd is on\naverage 4 times lower than with uni, but most of this improvement is regained when lloyd is\nsubsequently run (\ufb01nal M SE/\u03c32).\n\n4\n\n\u03c3=2\u22126\u03c3=2\u22124019\u03c3=2\u22122unimedlloyd++clarans2\u2212102\u221292\u221282\u221272\u221262\u221252\u221242\u221232\u221222\u22121\u03c32\u22124202428212216initMSE/\u03c322\u2212102\u221292\u221282\u221272\u221262\u221252\u221242\u221232\u221222\u22121\u03c32\u22124202428212216\ufb01nalMSE/\u03c32medlloyduni++clarans\fFigure 5: The number of consecutive swap proposal rejections (evaluations) before one is accepted\n(implementations), for simulated data (\u00a73) with \u03c3 = 2\u22124.\n\nWe will now discuss optimisations in order of increasing algorithmic complexity, presenting their\ncomputational complexities in terms of evaluation and implementation steps. The explanations here\nare high level, with algorithmic details and pseudocode deferred to \u00a7SM-D.\nLevel -2 To evaluate swaps (line 6), simply compute all KN distances.\nLevel -1 Keep track of nearest centers. Now to evaluate a swap, samples whose nearest center is\nx(i\u2212) need distances to all K samples indexed by C \\ {i\u2212} \u222a {i+} computed in order to determine\nthe new nearest. Samples whose nearest is not x(i\u2212) only need the distance to x(i+) computed to\ndetermine their nearest, as either, (1) their nearest is unchanged, or (2) it is x(i+).\nLevel 0 Also keep track of second nearest centers, as in the implementation of Ng and Han (1994),\nwhich recall is O(N 2) in memory and computes all distances upfront. Doing so, nearest centers\ncan be determined for all samples by computing distances to x(i+). If swap (i\u2212, i+) is accepted,\nsamples whose new nearest is x(i+) require K distance calculations to recompute second nearests.\nThus from level -1 to 0, computation is transferred from evaluation to implementation, which is\ngood, as implementation is less frequently performed, as illustrated in Figure 5.\nLevel 1 Also keep track, for each cluster center, of the distance to the furthest cluster member\nas well as the maximum, over all cluster members, of the minimum distance to another center.\nUsing the triangle inequality, one can then frequently eliminate computation for clusters which are\nunchanged by proposed swaps with just a single center-to-center distance calculation. Note that\nusing the triangle inequality requires that the K-medoids dissimilarity is metric based, as is the case\nin the K-means initialization setting.\nLevel 2 Also keep track of center-to-center distances. This allows whole clusters to be tagged as\nunchanged by a swap, without computing any distances in the evaluation step.\nWe have also considered optimisations which, unlike levels -2 to 2, do not result in the exact same\nclustering as clarans, but provide additional acceleration. One such optimisation uses random sub-\nsampling to evaluate proposals, which helps signi\ufb01cantly when N/K is large. Another optimisation\nwhich is effective during initial rounds is to not implement the \ufb01rst MSE reducing swap found, but\nto rather continue searching for approximately as long as swap implementation takes, thus balancing\ntime between searching (evaluation) and implementing swaps. Details can be found in \u00a7SM-D.3.\nThe computational complexities of these optimisations are in Table 1. Proofs of these complexities\nrely on there being O(N/K) samples changing their nearest or second nearest center during a swap.\nIn other words, for any two clusters of sizes n1 and n2, we assume n1 = \u2126(n2). Using level 2\ncomplexities, we see that if a fraction p(C) of proposals reduce MSE, then the expected complexity\nis O(N (1 + 1/(p(C)K))). One cannot marginalise C out of the expectation, as C may have no MSE\nreducing swaps, that is p(C) = 0. If p(C) is O(K), we obtain complexity O(N ) per swap, which\nis equivalent to the O(KN ) for K center updates of lloyd. In Table 2, we consider run times and\ndistance calculation counts on simulated data at the various levels of optimisation.\n\n5 Results\n\nWe \ufb01rst compare clarans with uni, km++, afk-mc2 and bf on the \ufb01rst 23 publicly available\ndatasets in Table 3 (datasets 1-23). As noted in Celebi et al. (2013), it is common practice to\nrun initialization+lloyd several time and retain the solution with the lowest MSE. In Bachem et al.\n(2016) methods are run a \ufb01xed number of times, and mean MSEs are compared. However, when\ncomparing minimum MSEs over several runs, one should take into account that methods vary in\ntheir time requirements.\n\n5\n\n0500100015002000acceptedswaps(implementations)20210Nrevaluations\fK 2 evaluations, K implementations K 3N\n\n-2\nN K\n\n1\n\nN\n\n-1\nN\n1\n\n0\nN\nN\n\nK 2N\n\nN\n\nK 2N\n\nN\n\n1\n\nN\nK + K\n\nN\n\nN K + K 3\n\n2\nN\nK\nN\nKN\n\nN\n\nN + K 2\n\n1 evaluation\n\n1 implementation\n\nmemory\n\nTable 1: The complexities at different levels of optimisation of evaluation and implementation, in\nterms of required distance calculations, and overall memory. We see at level 2 that to perform K 2\nevaluations and K implementations is O(KN ), equivalent to lloyd.\n\n2\nlog2(# dcs ) 44.1 36.5 35.5 29.4 26.7\n407 19.2 15.6\n\ntime [s]\n\n-1\n\n0\n\n-2\n\n-\n\n1\n\n-\n\nTable 2: Total number of distance calculations\n(# dcs ) and time required by clarans on sim-\nulation data of \u00a73 with \u03c3 = 2\u22124 at different opti-\nmisation levels.\n\ndataset\n\na1\na2\na3\n\nbirch1\nbirch2\nbirch3\n\nConfLong\ndim032\ndim064\ndim1024\neurope\n\nN\n\n#\n3000\n1\n5250\n2\n7500\n3\n100000\n4\n100000\n5\n100000\n6\n164860\n7\n1024\n8\n1024\n9\n1024\n10\n11 169308\n\ndim\n2\n2\n2\n2\n2\n2\n3\n32\n64\n1024\n\n2\n\nTL [s]\nK\n1.94\n40\n1.37\n70\n1.69\n100\n21.13\n200\n15.29\n200\n16.38\n200\n30.74\n22\n1.13\n32\n1.19\n32\n7.68\n32\n1000 166.08\n\nN\n\ndataset #\nhousec8 12\nKDD\u2217\nmnist\nMopsi\nrna\u2217\ns1\ns2\ns3\ns4\nsong\u2217\nsusy\u2217\nyeast\n\n34112\n13 145751\n10000\n14\n13467\n15\n20000\n16\n17\n5000\n5000\n18\n5000\n19\n5000\n20\n20000\n21\n22\n20000\n1484\n23\n\nTL [s]\ndim K\n3\n18.71\n400\n200 998.83\n74\n784 300 233.48\n2.14\n2\n6.84\n8\n2\n1.20\n1.50\n2\n1.39\n2\n1.44\n2\n71.10\n90\n18\n24.50\n1.23\n8\n\n100\n200\n30\n30\n30\n30\n200\n200\n40\n\nTable 3: The 23 datasets. Column \u2018TL\u2019 is time allocated to run with each initialization scheme, so\nthat no new runs start after TL elapsed seconds. The starred datasets are those used in Bachem et al.\n(2016), the remainder are available at https://cs.joensuu.fi/sipu/datasets.\n\nRather than run each method a \ufb01xed number of times, we therefore run each method as many times\nas possible in a given time limit, \u2018TL\u2019. This dataset dependent time limit, given by columns TL in\nTable 3, is taken as 80\u00d7 the time of a single run of km+++lloyd. The numbers of runs completed\nin time TL by each method are in columns 1-5 of Table 4. Recall that our stopping criterion for\nclarans is K 2 consecutively rejected swap proposals. We have also experimented with stopping\ncriterion based on run time and number of swaps implemented, but \ufb01nd that stopping based on num-\nber of rejected swaps best guarantees convergence. We use K 2 rejections for simplicity, although\nhave found that fewer than K 2 are in general needed to obtain minimal MSEs.\nWe use the fast lloyd implementation accompanying Newling and Fleuret (2016) with the \u2018auto\u2019\n\ufb02ag set to select the best exact accelerated algorithm, and run until complete convergence. For\ninitializations, we use our own C++/Cython implementation of level 2 optimised clarans, the im-\nplementation of afk-mc2 of Bachem et al. (2016), and km++ and bf of Newling and Fleuret (2016).\nThe objective of Bachem et al. (2016) was to prove and experimentally validate that afk-mc2 pro-\nduces initialization MSEs equivalent to those of km++, and as such lloyd was not run during ex-\nperiments. We consider both initialization MSE, as in Bachem et al. (2016), and \ufb01nal MSE after\nlloyd has run. The latter is particularly important, as it is the objective we wish to minimize in the\nK-means problem.\nIn addition to considering initialization and \ufb01nal MSEs, we also distinguish between mean and\nminimum MSEs. We believe the latter is important as it captures the varying time requirements,\nand as mentioned it is common to run lloyd several times and retain the lowest MSE clustering. In\nTable 4 we consider two MSEs, namely mean initialization MSE and minimum \ufb01nal MSE.\n\n6\n\n\fruns completed\n\nmean initial mse\n\nminimum \ufb01nal mse\n\nm\nk\n\n+\n+\n\nk\nf\na\n\n2\nc\nm\n\ni\nn\nu\n\nf\nb\n\ns\nn\na\nr\n\na\nl\nc\n\nm\nk\n\n+\n+\n\nk\nf\na\n\n2\nc\nm\n\ni\nn\nu\n\ns\nn\na\nr\n\na\nl\nc\n\nm\nk\n\n+\n+\n\nk\nf\na\n\n2\nc\nm\n\ni\nn\nu\n\nf\nb\n\ns\nn\na\nr\n\na\nl\nc\n\n135\n1\n81\n2\n82\n3\n79\n4\n85\n5\n68\n6\n84\n7\n84\n8\n81\n9\n144\n10\n70\n11\n80\n12\n13\n102\n14\n88\n91\n15\n107\n16\n84\n17\n100\n18\n88\n19\n20\n88\n96\n21\n116\n22\n23\n82\ngm 90\n\n65\n24\n21\n27\n22\n22\n66\n29\n29\n52\n25\n27\n74\n43\n23\n28\n31\n39\n36\n36\n52\n48\n31\n34\n\n8\n138\n5\n85\n6\n87\n95\n28\n137 27\n23\n77\n75\n38\n5\n88\n90\n5\n311 24\n28\n15\n21\n81\n65\n56\n276 83\n7\n52\n86\n28\n5\n85\n7\n100\n6\n83\n87\n6\n98\n67\n134 67\n5\n81\n93\n14\n\n29\n7\n4\n5\n6\n4\n46\n19\n16\n18\n4\n4\n5\n4\n4\n4\n25\n30\n24\n24\n4\n4\n6\n8\n\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n\n2\n\n1\n\n0.98\n\n0.97\n0.99\n0.99\n0.99\n\n1.96\n2.07\n1.54\n3.8\n2.35\n1.17\n0.98\n43.1\n1.01 >102\n0.99 >102\n20.2\n2.09\n\n0.99\n\n1\n\n1\n\n1\n1\n1\n\n1\n1\n1\n1\n\n0.99\n1.01\n0.99\n1.05\n1.01\n\n4\n1\n25\n24.5\n2.79\n2.24\n1.55\n1.65\n1.14\n1.04\n1.18\n4.71\n\n0.6\n0.6\n\n0.63 0.59 0.58 0.59 0.61 0.57\n0.62\n0.59 0.61 0.63 0.58\n0.63\n0.61 0.62 0.63 0.59\n0.69 0.66 0.66 0.66 0.66 0.66\n0.62 0.62 0.62 0.64 0.63 0.59\n0.67 0.64 0.64 0.68 0.68 0.63\n0.73 0.64 0.64 0.64 0.64 0.64\n0.65 0.65 0.65 0.66 0.66 0.63\n0.66 0.66 0.66 0.66 0.69 0.63\n0.72 0.62 0.61 0.62 0.62 0.59\n0.72 0.67 0.67 2.25\n0.64\n0.77\n0.73 0.74 0.69\n0.77 0.69 0.69 0.75 0.75 0.69\n0.87\n0.6\n0.6\n0.6\n0.57 0.57 3.71 3.62 0.51\n0.62 0.62 0.61 2.18 2.42 0.56\n0.7\n0.66 0.65 0.67 0.69 0.65\n0.69 0.65 0.65 0.66 0.66 0.64\n0.71 0.65 0.65 0.66 0.67 0.65\n0.71 0.65 0.64 0.64 0.65 0.64\n0.8\n0.65\n0.81 0.69 0.69 0.69 0.69 0.69\n0.74 0.65 0.65 0.65 0.67 0.64\n0.7\n0.62\n\n0.67 0.66 0.71\n\n0.64 0.64 0.79\n\n0.61\n\n2.4\n\n0.7\n\n0.7\n\n0.6\n\n0.6\n\n0.7\n\n0.8\n\nTable 4: Summary of results on the 23 datasets (rows). Columns 1 to 5 contain the number of initial-\nization+lloyd runs completed in time limit TL. Columns 6 to 14 contain MSEs relative to the mean\ninitialization MSE of km++. Columns 6 to 9 are mean MSEs after initialization but before lloyd,\nand columns 10 to 14 are minimum MSEs after lloyd. The \ufb01nal row (gm) contains geometric\nmeans of all columns. clarans consistently obtains the lowest across all MSE measurements, and\nhas a 30% lower initialization MSE than km++ and afk-mc2, and a 3% lower \ufb01nal minimum MSE.\n\nFigure 6: Initialization (above) and \ufb01nal (below) MSEs for km++ (left bars) and clarans (right\nbars), with minumum (1), mean (2) and mean + standard deviation (3) of MSE across all runs. For\nall initialization MSEs and most \ufb01nal MSEs, the lowest km++ MSE is several standard deviations\nhigher than the mean clarans MSE.\n\n7\n\n0.50.60.70.80.91.01.1initialisationMSE(2)(1)(3)1234567891011121314151617181920212223dataset0.50.60.70.80.91.01.1\ufb01nalMSEkm++clarans\f5.1 Baseline performance\n\nWe brie\ufb02y discuss \ufb01ndings related to algorithms uni, bf, afk-mc2 and km++. Results in Table 4\ncorroborate the previously established \ufb01nding that uni is vastly outperformed by km++, both in\ninitialization and \ufb01nal MSEs. Table 4 results also agree with the \ufb01nding of Bachem et al. (2016)\nthat initialization MSEs with afk-mc2 are indistinguishable from those of km++, and moreover that\n\ufb01nal MSEs are indistinguishable. We observe in our experiments that runs with km++ are faster than\nthose with afk-mc2 (columns 1 and 2 of Table 4). We attribute this to the fast blas-based km++\nimplementation of Newling and Fleuret (2016).\nOur \ufb01nal baseline \ufb01nding is that MSEs obtained with bf are in general no better than those with uni.\nThis is not in strict agreement with the \ufb01ndings of Celebi et al. (2013). We attribute this discrepancy\nto the fact that experiments in Celebi et al. (2013) are in the low K regime (K < 50, N/K > 100).\nNote that Table 4 does not contain initialization MSEs for bf, as bf does not initialize with data\npoints but with means of sub-samples, and it would thus not make sense to compare bf initialization\nwith the 4 seeding methods.\n\n5.2 clarans performance\n\nHaving established that the best baselines are km++ and afk-mc2, and that they provide clusterings\nof indistinguishable quality, we now focus on the central comparison of this paper, that between\nkm++ with clarans. In Figure 6 we present bar plots summarising all runs on all 23 datasets. We\nobserve a very low variance in the initialization MSEs of clarans. We speculatively hypothesize\nthat clarans often \ufb01nds a globally minimal initialization. Figure 6 shows that clarans provides\nsigni\ufb01cantly lower initialization MSEs than km++.\nThe \ufb01nal MSEs are also signi\ufb01cantly better when initialization is done with clarans, although the\ngap in MSE between clarans and km++ is reduced when lloyd has run. Note, as seen in Table 4,\nthat all 5 initializations for dataset 7 result in equally good clusterings.\nAs a supplementary experiment, we considered initialising with km++ and clarans in series, thus\nusing the three stage clustering km+++clarans+lloyd. We \ufb01nd that this can be slightly faster than\njust clarans+lloyd with identical MSEs. Results of this experiment are presented in \u00a7SM-I. We\nperform a \ufb01nal experiment measure the dependence of improvement on K in \u00a7SM-I, where we see\nthe improvement is most signi\ufb01cant for large K.\n\n6 Conclusion and Future Works\n\nIn this paper, we have demonstrated the effectiveness of the algorithm clarans at solving the k-\nmedoids problem. We have described techniques for accelerating clarans, and most importantly\nshown that clarans works very effectively as an initializer for lloyd, outperforming other initial-\nization schemes, such as km++, on 23 datasets.\nAn interesting direction for future work might be to develop further optimisations for clarans. One\nidea could be to use importance sampling to rapidly obtain good estimates of post-swap energies.\nAnother might be to propose two swaps simultaneously, as considered in Kanungo et al. (2002),\nwhich could potentially lead to even better solutions, although we have hypothesized that clarans\nis already \ufb01nding globally optimal initializations.\nAll source code is made available under a public license. It consists of generic C++ code which\ncan be extended to various data types and metrics, compiling to a shared library with extensions in\nCython for a Python interface. It can currently be found in the git repository https://github.\ncom/idiap/zentas.\n\nAcknowledgments\n\nJames Newling was funded by the Hasler Foundation under the grant 13018 MASH2.\n\n8\n\n\fReferences\nArthur, D. and Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceed-\nings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA \u201907, pages\n1027\u20131035, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics.\n\nBachem, O., Lucic, M., Hassani, S. H., and Krause, A. (2016). Fast and provably good seedings for\n\nk-means. In Neural Information Processing Systems (NIPS).\n\nBradley, P. S. and Fayyad, U. M. (1998). Re\ufb01ning initial points for k-means clustering. In Proceed-\nings of the Fifteenth International Conference on Machine Learning, ICML \u201998, pages 91\u201399,\nSan Francisco, CA, USA. Morgan Kaufmann Publishers Inc.\n\nCelebi, M. E., Kingravi, H. A., and Vela, P. A. (2013). A comparative study of ef\ufb01cient initialization\n\nmethods for the k-means clustering algorithm. Expert Syst. Appl., 40(1):200\u2013210.\n\nElkan, C. (2003). Using the triangle inequality to accelerate k-means. In Machine Learning, Pro-\nceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washing-\nton, DC, USA, pages 147\u2013153.\n\nHartigan, J. A. (1975). Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th\n\nedition.\n\nHastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2001). The elements of statistical learning : data\n\nmining, inference, and prediction. Springer series in statistics. Springer, New York.\n\nKanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. (2002).\nA local search approximation algorithm for k-means clustering. In Proceedings of the Eighteenth\nAnnual Symposium on Computational Geometry, SCG \u201902, pages 10\u201318, New York, NY, USA.\nACM.\n\nKaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data : an introduction to cluster\nanalysis. Wiley series in probability and mathematical statistics. Wiley, New York. A Wiley-\nInterscience publication.\n\nLewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). Rcv1: A new benchmark collection for text\n\ncategorization research. Journal of Machine Learning Research, 5:361\u2013397.\n\nNewling, J. and Fleuret, F. (2016). Fast k-means with accurate bounds.\nInternational Conference on Machine Learning (ICML), pages 936\u2013944.\n\nIn Proceedings of the\n\nNg, R. T. and Han, J. (1994). Ef\ufb01cient and effective clustering methods for spatial data mining. In\nProceedings of the 20th International Conference on Very Large Data Bases, VLDB \u201994, pages\n144\u2013155, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.\n\nNg, R. T. and Han, J. (2002). Clarans: A method for clustering objects for spatial data mining. IEEE\n\nTransactions on Knowledge and Data Engineering, pages 1003\u20131017.\n\nPark, H.-S. and Jun, C.-H. (2009). A simple and fast algorithm for k-medoids clustering. Expert\n\nSyst. Appl., 36(2):3336\u20133341.\n\nTelgarsky, M. and Vattani, A. (2010). Hartigan\u2019s method: k-means clustering without voronoi. In\n\nAISTATS, volume 9 of JMLR Proceedings, pages 820\u2013827. JMLR.org.\n\nWu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu,\nB., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., and Steinberg, D. (2008). Top 10 algorithms in\ndata mining. Knowledge and Information Systems, 14(1):1\u201337.\n\nYujian, L. and Bo, L. (2007). A normalized levenshtein distance metric. IEEE Trans. Pattern Anal.\n\nMach. Intell., 29(6):1091\u20131095.\n\n9\n\n\f", "award": [], "sourceid": 2674, "authors": [{"given_name": "James", "family_name": "Newling", "institution": "Idiap Research Institute & EPFL"}, {"given_name": "Fran\u00e7ois", "family_name": "Fleuret", "institution": "Idiap Research Institute"}]}