{"title": "Clustering with Same-Cluster Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 3216, "page_last": 3224, "abstract": "We propose a framework for Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not. We study the query and computational complexity of clustering in this framework. We consider a setting where the expert conforms to a center-based clustering with a notion of margin. We show that there is a trade off between computational complexity and query complexity; We prove that for the case of $k$-means clustering (i.e., when the expert conforms to a solution of $k$-means), having access to relatively few such queries allows efficient solutions to otherwise NP hard problems. In particular, we provide a probabilistic polynomial-time (BPP) algorithm for clustering in this setting that asks $O\\big(k^2\\log k + k\\log n)$ same-cluster queries and runs with time complexity $O\\big(kn\\log n)$ (where $k$ is the number of clusters and $n$ is the number of instances). The success of the algorithm is guaranteed for data satisfying the margin condition under which, without queries, we show that the problem is NP hard. We also prove a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting.", "full_text": "Clustering with Same-Cluster Queries\n\nHassan Ashtiani , Shrinu Kushagra and Shai Ben-David\n\nDavid R. Cheriton School of Computer Science\n\nUniversity of Waterloo,\n\nWaterloo, Ontario, Canada\n\n{mhzokaei,skushagr,shai}@uwaterloo.ca\n\nAbstract\n\nWe propose a framework for Semi-Supervised Active Clustering framework\n(SSAC), where the learner is allowed to interact with a domain expert, asking\nwhether two given instances belong to the same cluster or not. We study the query\nand computational complexity of clustering in this framework. We consider a\nsetting where the expert conforms to a center-based clustering with a notion of\nmargin. We show that there is a trade off between computational complexity and\nquery complexity; We prove that for the case of k-means clustering (i.e., when the\nexpert conforms to a solution of k-means), having access to relatively few such\nqueries allows ef\ufb01cient solutions to otherwise NP hard problems.\nIn particular, we provide a probabilistic polynomial-time (BPP) algorithm for\n\nclustering in this setting that asks O(cid:0)k2 log k + k log n) same-cluster queries and\nruns with time complexity O(cid:0)kn log n) (where k is the number of clusters and\n\nn is the number of instances). The algorithm succeeds with high probability for\ndata satisfying margin conditions under which, without queries, we show that the\nproblem is NP hard. We also prove a lower bound on the number of queries needed\nto have a computationally ef\ufb01cient clustering algorithm in this setting.\n\n1\n\nIntroduction\n\nClustering is a challenging task particularly due to two impediments. The \ufb01rst problem is that\nclustering, in the absence of domain knowledge, is usually an under-speci\ufb01ed task; the solution\nof choice may vary signi\ufb01cantly between different intended applications. The second one is that\nperforming clustering under many natural models is computationally hard.\nConsider the task of dividing the users of an online shopping service into different groups. The result\nof this clustering can then be used for example in suggesting similar products to the users in the same\ngroup, or for organizing data so that it would be easier to read/analyze the monthly purchase reports.\nThose different applications may result in con\ufb02icting solution requirements. In such cases, one needs\nto exploit domain knowledge to better de\ufb01ne the clustering problem.\nAside from trial and error, a principled way of extracting domain knowledge is to perform clustering\nusing a form of \u2018weak\u2019 supervision. For example, Balcan and Blum [BB08] propose to use an\ninteractive framework with \u2019split/merge\u2019 queries for clustering.\nIn another work, Ashtiani and\nBen-David [ABD15] require the domain expert to provide the clustering of a \u2019small\u2019 subset of data.\nAt the same time, mitigating the computational problem of clustering is critical. Solving most of\nthe common optimization formulations of clustering is NP-hard (in particular, solving the popular\nk-means and k-median clustering problems). One approach to address this issues is to exploit the\nfact that natural data sets usually exhibit some nice properties and likely to avoid the worst-case\nscenarios. In such cases, optimal solution to clustering may be found ef\ufb01ciently. The quest for notions\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fof niceness that are likely to occur in real data and allow clustering ef\ufb01ciency is still ongoing (see\n[Ben15] for a critical survey of work in that direction).\nIn this work, we take a new approach to alleviate the computational problem of clustering. In\nparticular, we ask the following question: can weak supervision (in the form of answers to natural\nqueries) help relaxing the computational burden of clustering? This will add up to the other bene\ufb01t\nof supervision: making the clustering problem better de\ufb01ned by enabling the accession of domain\nknowledge through the supervised feedback.\nThe general setting considered in this work is the following. Let X be a set of elements that should\nbe clustered and d a dissimilarity function over it. The oracle (e.g., a domain expert) has some\ninformation about a target clustering C\u2217\nX in mind. The clustering algorithm has access to X, d, and\ncan also make queries about C\u2217\nX. The queries are in the form of same-cluster queries. Namely, the\nalgorithm can ask whether two elements belong to the same cluster or not. The goal of the algorithm\nis to \ufb01nd a clustering that meets some prede\ufb01ned clusterability conditions and is consistent with the\nanswers given to its queries.\nWe will also consider the case that the oracle conforms with some optimal k-means solution. We\nthen show that access to a \u2019reasonable\u2019 number of same-cluster queries can enable us to provide an\nef\ufb01cient algorithm for otherwise NP-hard problems.\n\n1.1 Contributions\n\nThe two main contributions of this paper are the introduction of the semi-supervised active clustering\n(SSAC) framework and, the rather unusual demonstration that access to simple query answers can\nturn an otherwise NP hard clustering problem into a feasible one.\nBefore we explain those results, let us also mention a notion of clusterability (or \u2018input niceness\u2019)\nthat we introduce. We de\ufb01ne a novel notion of niceness of data, called \u03b3-margin property that is\nrelated to the previously introduced notion of center proximity [ABS12]. The larger the value of\n\u03b3, the stronger the assumption becomes, which means that clustering becomes easier. With respect\nto that \u03b3 parameter, we get a sharp \u2018phase transition\u2019 between k-means being NP hard and being\noptimally solvable in polynomial time1.\nWe focus on the effect of using queries on the computational complexity of clustering. We provide\na probabilistic polynomial time (BPP) algorithm for clustering with queries, that succeeds under\nthe assumption that the input satis\ufb01es the \u03b3-margin condition for \u03b3 > 1. This algorithm makes\n\nO(cid:0)k2 log k + k log n) same-cluster queries to the oracle and runs in O(cid:0)kn log n) time, where k is\n\n\u221a\n\nthe number of clusters and n is the size of the instance set.\nOn the other hand, we show that without access to query answers, k-means clustering is NP-hard\n3.4 \u2248 1.84 and k = \u0398(n\u0001) (for any\neven when the solution satis\ufb01es \u03b3-margin property for \u03b3 =\n\u0001 \u2208 (0, 1)). We further show that access to \u2126(log k + log n) queries is needed to overcome the NP\nhardness in that case. These results, put together, show an interesting phenomenon. Assume that\nproperty for some 1 < \u03b3 \u2264 \u221a\nthe oracle conforms to an optimal solution of k-means clustering and that it satis\ufb01es the \u03b3-margin\n3.4. In this case, our lower bound means that without making queries\nk-means clustering is NP-hard, while the positive result shows that with a reasonable number of\nqueries the problem becomes ef\ufb01ciently solvable.\nThis indicates an interesting (and as far as we are aware, novel) trade-off between query complexity\nand computational complexity in the clustering domain.\n\n1.2 Related Work\n\nThis work combines two themes in clustering research; clustering with partial supervision (in\nparticular, supervision in the form of answers to queries) and the computational complexity of\nclustering tasks.\nSupervision in clustering (sometimes also referred to as \u2018semi-supervised clustering\u2019) has been\naddressed before, mostly in application-oriented works [BBM02, BBM04, KBDM09]. The most\n\n1The exact value of such a threshold \u03b3 depends on some \ufb01ner details of the clustering task; whether d is\n\nrequired to be Euclidean and whether the cluster centers must be members of X.\n\n2\n\n\fcommon method to convey such supervision is through a set of pairwise link/do-not-link constraints\non the instances. Note that in contrast to the supervision we address here, in the setting of the papers\ncited above, the supervision is non-interactive. On the theory side, Balcan et. al [BB08] propose a\nframework for interactive clustering with the help of a user (i.e., an oracle). The queries considered in\nthat framework are different from ours. In particular, the oracle is provided with the current clustering,\nand tells the algorithm to either split a cluster or merge two clusters. Note that in that setting, the\noracle should be able to evaluate the whole given clustering for each query.\nAnother example of the use of supervision in clustering was provided by Ashtiani and Ben-David\n[ABD15]. They assumed that the target clustering can be approximated by \ufb01rst mapping the data\npoints into a new space and then performing k-means clustering. The supervision is in the form of a\nclustering of a small subset of data (the subset provided by the learning algorithm) and is used to\nsearch for such a mapping.\nOur proposed setup combines the user-friendliness of link/don\u2019t-link queries (as opposed to asking\nthe domain expert to answer queries about whole data set clustering, or to cluster sets of data) with\nthe advantages of interactiveness.\nThe computational complexity of clustering has been extensively studied. Many of these results\nare negative, showing that clustering is computationally hard. For example, k-means clustering is\nNP-hard even for k = 2 [Das08], or in a 2-dimensional plane [Vat09, MNV09]. In order to tackle the\nproblem of computational complexity, some notions of niceness of data under which the clustering\nbecomes easy have been considered (see [Ben15] for a survey).\nThe closest proposal to this work is the notion of \u03b1-center proximity introduced by Awasthi et. al\n[ABS12]. We discuss the relationship of that notion to our notion of margin in Appendix B. In the\nrestricted scenario (i.e., when the centers of clusters are selected from the data set), their algorithm\nef\ufb01ciently recovers the target clustering (outputs a tree such that the target is a pruning of the tree) for\n\u03b1 > 3. Balcan and Liang [BL12] improve the assumption to \u03b1 >\n2 + 1. Ben-David and Reyzin\n[BDR14] show that this problem is NP-Hard for \u03b1 < 2.\nVariants of these proofs for our \u03b3-margin condition yield the feasibility of k-means clustering when\nthe input satis\ufb01es the condition with \u03b3 > 2 and NP hardness when \u03b3 < 2, both in the case of arbitrary\n(not necessarily Euclidean) metrics2 .\n\n\u221a\n\n2 Problem Formulation\n\n2.1 Center-based clustering\n\nThe framework of clustering with queries can be applied to any type of clustering. However, in this\nwork, we focus on a certain family of common clusterings \u2013 center-based clustering in Euclidean\nspaces3.\nLet X be a subset of some Euclidean space, Rd. Let CX = {C1, . . . , Ck} be a clustering (i.e., a\npartitioning) of X . We say x1\nCX\u223c x2 if x1 and x2 belong to the same cluster according to CX . We\nfurther denote by n the number of instances (|X|) and by k the number of clusters.\nWe say that a clustering CX is center-based if there exists a set of centers \u00b5 = {\u00b51, . . . , \u00b5k} \u2282 Rn\nsuch that the clustering corresponds to the Voroni diagram over those center points. Namely, for\nevery x in X and i \u2264 k, x \u2208 Ci \u21d4 i = arg minj d(x, \u00b5j).\nFinally, we assume that the centers \u00b5\u2217 corresponding to C\u2217 are the centers of mass of the correspond-\ning clusters. In other words, \u00b5\u2217\nx. Note that this is the case for example when the\noracle\u2019s clustering is the optimal solution to the Euclidean k-means clustering problem.\n\ni = 1|Ci|\n\n(cid:80)\n\nx\u2208C\u2217\n\ni\n\n2.2 The \u03b3-margin property\n\nNext, we introduce a notion of clusterability of a data set, also referred to as \u2018data niceness property\u2019.\nfunctions. Later in this paper, we prove hardness for \u03b3 \u2264 \u221a\n\n2In particular, the hardness result of [BDR14] relies on the ability to construct non-Euclidean distance\n\n3.4 for Euclidean instances.\n\n3In fact, our results are all independent of the Euclidean dimension and apply to any Hilbert space.\n\n3\n\n\fDe\ufb01nition 1 (\u03b3-margin). Let X be set of points in metric space M. Let CX = {C1, . . . , Ck} be\na center-based clustering of X induced by centers \u00b51, . . . , \u00b5k \u2208 M. We say that CX satis\ufb01es the\n\u03b3-margin property if the following holds. For all i \u2208 [k] and every x \u2208 Ci and y \u2208 X \\ Ci,\n\n\u03b3d(x, \u00b5i) < d(y, \u00b5i)\n\nSimilar notions have been considered before in the clustering literature. The closest one to our\n\u03b3-margin is the notion of \u03b1-center proximity [BL12, ABS12]. We discuss the relationship between\nthese two notions in appendix B.\n\n(cid:26)\n\nOC\u2217 (x1, x2) =\n\nC\u2217\u223c x2\n\n1 , . . . C\u2217\n\n2.3 The algorithmic setup\nFor a clustering C\u2217 = {C\u2217\nk}, a C\u2217-oracle is a function OC\u2217 that answers queries according\nto that clustering. One can think of such an oracle as a user that has some idea about its desired\nclustering, enough to answer the algorithm\u2019s queries. The clustering algorithm then tries to recover\nC\u2217 by querying a C\u2217-oracle. The following notion of query is arguably most intuitive.\nDe\ufb01nition 2 (Same-cluster Query). A same-cluster query asks whether two instances x1 and x2\nbelong to the same cluster, i.e.,\n\ntrue\nfalse\n\nif x1\no.w.\n(we omit the subscript C\u2217 when it is clear from the context).\nDe\ufb01nition 3 (Query Complexity). An SSAC instance is determined by the tuple (X , d, C\u2217). We will\nconsider families of such instances determined by niceness conditions on their oracle clusterings C\u2217.\n1. A SSAC algorithm A is called a q-solver for a family G of such instances, if for every\ninstance (X , d, C\u2217) \u2208 G, it can recover C\u2217 by having access to (X , d) and making at most\nq queries to a C\u2217-oracle.\n\n2. Such an algorithm is a polynomial q-solver if its time-complexity is polynomial in |X| and\n\n|C\u2217| (the number of clusters).\n\n3. We say G admits an O(q) query complexity if there exists an algorithm A that is a polynomial\n\nq-solver for every clustering instance in G.\n\n3 An Ef\ufb01cient SSAC Algorithm\n\nIn this section we provide an ef\ufb01cient algorithm for clustering with queries. The setting is the one\ndescribed in the previous section. In particular, it is assumed that the oracle has a center-based\nclustering in his mind which satis\ufb01es the \u03b3-margin property. The space is Euclidean and the center\nof each cluster is the center of mass of the instances in that cluster. The algorithm not only makes\nsame-cluster queries, but also another type of query de\ufb01ned as below.\nDe\ufb01nition 4 (Cluster-assignment Query). A cluster-assignment query asks the cluster index that an\ninstance x belongs to. In other words OC\u2217 (x) = i if and only if x \u2208 C\u2217\ni .\nNote however that each cluster-assignment query can be replaced with k same-cluster queries (see\nappendix A in supplementary material). Therefore, we can express everything in terms of the more\nnatural notion of same-cluster queries, and the use of cluster-assignment query is just to make the\nrepresentation of the algorithm simpler.\nIntuitively, our proposed algorithm does the following. In the \ufb01rst phase, it tries to approximate the\ncenter of one of the clusters. It does this by asking cluster-assignment queries about a set of randomly\n(uniformly) selected point, until it has a suf\ufb01cient number of points from at least one cluster (say Cp).\nIt uses the mean of these points, \u00b5(cid:48)\nIn the second phase, the algorithm recovers all of the instances belonging to Cp. In order to do that, it\n\ufb01rst sorts all of the instances based on their distance to \u00b5(cid:48)\np. By showing that all of the points in Cp lie\ninside a sphere centered at \u00b5(cid:48)\np (which does not include points from any other cluster), it tries to \ufb01nd\n\np, to approximate the cluster center.\n\n4\n\n\fthe radius of this sphere by doing binary search using same-cluster queries. After that, the elements\nin Cp will be located and can be removed from the data set. The algorithm repeats this process k\ntimes to recover all of the clusters.\nThe details of our approach is stated precisely in Algorithm 1. Note that \u03b2 is a small constant4.\nTheorem 7 shows that if \u03b3 > 1 then our algorithm recovers the target clustering with high probability.\nNext, we give bounds on the time and query complexity of our algorithm. Theorem 8 shows that our\napproach needs O(k log n + k2 log k) queries and runs with time complexity O(kn log n).\n\nAlgorithm 1: Algorithm for \u03b3(> 1)-margin instances with queries\nInput: Clustering instance X , oracle O, the number of clusters k and parameter \u03b4 \u2208 (0, 1)\nOutput: A clustering C of the set X\nC = {}, S1 = X , \u03b7 = \u03b2 log k+log(1/\u03b4)\n(\u03b3\u22121)4\nfor i = 1 to k do\n\n//Asks cluster-assignment queries about the members of Z\n\nx.\n\n// Draws l independent elements from Si uniformly at random\n\nPhase 1\nl = k\u03b7 + 1;\nZ \u223c U l[Si]\nFor 1 \u2264 t \u2264 i,\n(cid:80)\nZt = {x \u2208 Z : O(x) = t}.\np = arg maxt |Zt|\n\u00b5(cid:48)\np := 1|Zp|\nx\u2208Zp\nPhase 2\n// We know that there exists ri such that \u2200x \u2208 Si, x \u2208 Ci \u21d4 d(x, \u00b5(cid:48)\n// Therefore, ri can be found by simple binary search\n\n(cid:98)Si = Sorted({Si}) // Sorts elements of {x : x \u2208 Si} in increasing order of d(x, \u00b5(cid:48)\nri = BinarySearch((cid:98)Si) //This step takes up to O(log |Si|) same-cluster queries\np = {x \u2208 Si : d(x, \u00b5(cid:48)\nC(cid:48)\nSi+1 = Si \\ C(cid:48)\np.\np}\nC = C \u222a {C(cid:48)\n\np) \u2264 ri}.\n\ni) < ri.\n\np).\n\nend\n\nLemma 5. Let (X , d, C) be a clustering instance, where C is center-based and satis\ufb01es the \u03b3-margin\nproperty. Let \u00b5 be the set of centers corresponding to the centers of mass of C. Let \u00b5(cid:48)\ni be such that\ni) \u2264 r(Ci)\u0001, where r(Ci) = maxx\u2208Ci d(x, \u00b5i) . Then \u03b3 \u2265 1 + 2\u0001 implies that\nd(\u00b5i, \u00b5(cid:48)\n\n1+\u0001\n\ni) \u2265 d(y, \u00b5i) \u2212 d(\u00b5i, \u00b5(cid:48)\n\n\u2200x \u2208 Ci,\u2200y \u2208 X \\ Ci \u21d2 d(x, \u00b5(cid:48)\n\ni) < d(y, \u00b5(cid:48)\ni)\ni) \u2264 d(x, \u00b5i) + d(\u00b5i, \u00b5(cid:48)\ni) \u2264 r(Ci)(1 + \u0001). Similarly,\ni) <\n\ni) > (\u03b3 \u2212 \u0001)r(Ci). Combining the two, we get that d(x, \u00b5(cid:48)\n\nProof. Fix any x \u2208 Ci and y \u2208 Cj. d(x, \u00b5(cid:48)\nd(y, \u00b5(cid:48)\n\u03b3\u2212\u0001 d(y, \u00b5(cid:48)\ni).\nLemma 6. Let the framework be as in Lemma 5. Let Zp, Cp, \u00b5p, \u00b5(cid:48)\n2 . If |Zp| > \u03b7, then the probability that d(\u00b5p, \u00b5(cid:48)\n1, and \u0001 = \u03b3\u22121\nProof. De\ufb01ne a uniform distribution U over Cp. Then \u00b5p and \u00b5(cid:48)\np are the true and empirical mean of\nthis distribution. Using a standard concentration inequality (Thm. 12 from Appendix D) shows that\nthe empirical mean is close to the true mean, completing the proof.\n\np and \u03b7 be de\ufb01ned as in Algorhtm\n\np) > r(Cp)\u0001 is at most \u03b4\nk .\n\nTheorem 7. Let (X , d, C) be a clustering instance, where C is center-based and satis\ufb01es the \u03b3-\nmargin property. Let \u00b5i be the center of mass of Ci. Assume \u03b4 \u2208 (0, 1) and \u03b3 > 1. Then with\nprobability at least 1 \u2212 \u03b4, Algorithm 1 outputs C.\n\n4It corresponds to the constant appeared in generalized Hoeffding inequality bound, discussed in Theorem\n\n12 in appendix D in supplementary materials.\n\n5\n\n\fp) \u2264 r(Cp)\u0001. By Lemma 5, this would mean that d(x, \u00b5(cid:48)\n\nProof. In the \ufb01rst phase of the algorithm we are making l > k\u03b7 cluster-assignment queries. Therefore,\nusing the pigeonhole principle, we know that there exists cluster index p such that |Zp| > \u03b7. Then\np such that with probability at least 1 \u2212 \u03b4\nLemma 6 implies that the algorithm chooses a center \u00b5(cid:48)\nk we\np) for all x \u2208 Cp\nhave d(\u00b5p, \u00b5(cid:48)\nand y (cid:54)\u2208 Cp. Hence, the radius ri found in the phase two of Alg. 1 is such that ri = max\nd(x, \u00b5(cid:48)\np).\nx\u2208Cp\nThis implies that C(cid:48)\nk one\niteration of the algorithm successfully \ufb01nds all the points in a cluster Cp. Using union bound, we get\nthat with probability at least 1 \u2212 k \u03b4\n\np (found in phase two) equals to Cp. Hence, with probability at least 1 \u2212 \u03b4\n\nk = 1 \u2212 \u03b4, the algorithm recovers the target clustering.\n\np) < d(y, \u00b5(cid:48)\n\nTheorem 8. Let the framework be as in theorem 7. Then Algorithm 1\n\n\u2022 Makes O(cid:0)k log n + k2 log k+log(1/\u03b4)\n\u2022 Runs in O(cid:0)kn log n + k2 log k+log(1/\u03b4)\n\n(cid:1) same-cluster queries to the oracle O.\n(cid:1) time.\n\n(\u03b3\u22121)4\n\n(\u03b3\u22121)4\n\nProof. In each iteration (i) the \ufb01rst phase of the algorithm takes O(\u03b7) time and makes \u03b7 + 1 cluster-\nassignment queries (ii) the second phase takes O(n log n) times and makes O(log n) same-cluster\nqueries. Each cluster-assignment query can be replaced with k same-cluster queries; therefore,\neach iteration runs in O(k\u03b7 + n log n) and uses O(k\u03b7 + log n) same-cluster queries. By replacing\n\u03b7 = \u03b2 log k+log(1/\u03b4)\n\nand noting that there are k iterations, the proof will be complete.\n\n(\u03b3\u22121)4\n\nCorollary 9. The set of Euclidean clustering instances that satisfy the \u03b3-margin property for some\n\n\u03b3 > 1 admits query complexity O(cid:0)k log n + k2 log k+log(1/\u03b4)\n\n(cid:1).\n\n(\u03b3\u22121)4\n\n4 Hardness Results\n\n4.1 Hardness of Euclidean k-means with Margin\n\n\u221a\n\nFinding k-means solution without the help of an oracle is generally computationally hard. In this\nsection, we will show that solving Euclidean k-means remains hard even if we know that the optimal\nsolution satis\ufb01es the \u03b3-margin property for \u03b3 =\n3.4. In particular, we show the hardness for the\ncase of k = \u0398(n\u0001) for any \u0001 \u2208 (0, 1).\nIn Section 3, we proposed a polynomial-time algorithm that could recover the target clustering using\nO(k2 log k + k log n) queries, assuming that the clustering satis\ufb01es the \u03b3-margin property for \u03b3 > 1.\nNow assume that the oracle conforms to the optimal k-means clustering solution. In this case, for\n3.4 \u2248 1.84, solving k-means clustering would be NP-hard without queries, while it\n\n1 < \u03b3 \u2264 \u221a\n{C1, . . . , Ck} which minimizes f (C) =(cid:80)\nbecomes ef\ufb01ciently solvable with the help of an oracle 5.\nGiven a set of instances X \u2282 Rd, the k-means clustering problem is to \ufb01nd a clustering C =\n2. The decision version of k-means\nis, given some value L, is there a clustering C with cost \u2264 L? The following theorem is the main\nresult of this section.\nTheorem 10. Finding the optimal solution to Euclidean k-means objective function is NP-hard when\nk = \u0398(n\u0001) for any \u0001 \u2208 (0, 1), even when the optimal solution satis\ufb01es the \u03b3-margin property for\n\u03b3 =\n\n(cid:107)x \u2212 \u00b5i(cid:107)2\n\nmin\n\u00b5i\u2208Rd\n\n(cid:80)\n\nx\u2208Ci\n\n3.4.\n\n\u221a\n\nCi\n\nThis results extends the hardness result of [BDR14] to the case of Euclidean metric, rather than\narbitrary one, and to the \u03b3-margin condition (instead of the \u03b1-center proximity there). The full proof\nis rather technical and is deferred to the supplementary material (appendix C).\n\n5To be precise, note that the algorithm used for clustering with queries is probabilistic, while the lower bound\nthat we provide is for deterministic algorithms. However, this implies a lower bound for randomized algorithms\nas well unless BP P (cid:54)= P\n\n6\n\n\f4.1.1 Overview of the proof\n\nOur method to prove Thm. 10 is based on the approach employed by [Vat09]. However, the original\nconstruction proposed in [Vat09] does not satisfy the \u03b3-margin property. Therefore, we have to\nmodify the proof by setting up the parameters of the construction more carefully.\nTo prove the theorem, we will provide a reduction from the problem of Exact Cover by 3-Sets (X3C)\nwhich is NP-Complete [GJ02], to the decision version of k-means.\nDe\ufb01nition 11 (X3C). Given a set U containing exactly 3m elements and a collection S =\n{S1, . . . , Sl} of subsets of U such that each Si contains exactly three elements, does there exist\nm elements in S such that their union is U?\nWe will show how to translate each instance of X3C, (U,S), to an instance of k-means clustering in\nthe Euclidean plane, X. In particular, X has a grid-like structure consisting of l rows (one for each\nSi) and roughly 6m columns (corresponding to U) which are embedded in the Euclidean plane. The\nspecial geometry of the embedding makes sure that any low-cost k-means clustering of the points\n(where k is roughly 6ml) exhibits a certain structure. In particular, any low-cost k-means clustering\ncould cluster each row in only two ways; One of these corresponds to Si being included in the cover,\nwhile the other means it should be excluded. We will then show that U has a cover of size m if and\nonly if X has a clustering of cost less than a speci\ufb01c value L. Furthermore, our choice of embedding\nmakes sure that the optimal clustering satis\ufb01es the \u03b3-margin property for \u03b3 =\n\n3.4 \u2248 1.84.\n\n\u221a\n\ni=1Zi).\n\n4.1.2 Reduction design\nGiven an instance of X3C, that is the elements U = {1, . . . , 3m} and the collection S, we construct\na set of points X in the Euclidean plane which we want to cluster. Particularly, X consists of\na set of points Hl,m in a grid-like manner, and the sets Zi corresponding to Si. In other words,\nX = Hl,m \u222a (\u222al\u22121\nThe set Hl,m is as described in Fig.\nis composed of 6m + 3 points\n{si, ri,1, . . . , ri,6m+1, fi}. Row Gi is composed of 3m points {gi,1, . . . , gi,3m}. The distances\nbetween the points are also shown in Fig. 1. Also, all these points have weight w, simply meaning\nthat each point is actually a set of w points on the same location.\nEach set Zi is constructed based on Si. In particular, Zi = \u222aj\u2208[3m]Bi,j, where Bi,j is a subset of\n{xi,j, x(cid:48)\ni,j \u2208 Bi,j iff j \u2208 Si.\nSimilarly, yi,j \u2208 Bi,j iff j (cid:54)\u2208 Si+1, and y(cid:48)\ni,j, yi,j and y(cid:48)\ni,j\nare speci\ufb01c locations as depicted in Fig. 2. In other words, exactly one of the locations xi,j and x(cid:48)\ni,j,\nand one of yi,j and y(cid:48)\n\ni,j} and is constructed as follows: xi,j \u2208 Bi,j iff j (cid:54)\u2208 Si, and x(cid:48)\n\ni,j \u2208 Bi,j iff j \u2208 Si+1. Furthermore, xi,j, x(cid:48)\n\ni,j will be occupied. We set the following parameters.\n\ni,j, yi,j, y(cid:48)\n\nThe row Ri\n\n1.\n\n\u221a\n\n\u221a\n\nh =\n\n5, d =\n\n6, \u0001 =\n\n1\nw2 , \u03bb =\n\n2\u221a\n3\n\nh, k = (l \u2212 1)3m + l(3m + 2)\n\nL1 = (6m + 3)wl, L2 = 3m(l \u2212 1)w, L = L1 + L2 \u2212 m\u03b1, \u03b1 =\n\nd\nw\n\n\u2212 1\n2w3\n\nLemma 12. The set X = Hl,n \u222a Z has a k-clustering of cost less or equal to L if and only if there\nis an exact cover for the X3C instance.\nLemma 13. Any k-clustering of X = Hl,n \u222a Z with cost \u2264 L has the \u03b3-margin property where\n\u03b3 =\n\n3.4. Furthermore, k = \u0398(n\u0001).\n\n\u221a\n\nThe proofs are provided in Appendix C. Lemmas 12 and 13 together show that X has a k-clustering\nof cost \u2264 L satisfying the \u03b3-margin property (for \u03b3 =\n3.4) if and only if there is an exact cover by\n3-sets for the X3C instance. This completes the proof of our main result (Thm. 10).\n\n\u221a\n\n4.2 Lower Bound on the Number of Queries\n\n\u221a\n\nIn the previous section we showed that k-means clustering is NP-hard even under \u03b3-margin assump-\n3.4 \u2248 1.84). On the other hand, in Section 3 we showed that this is not the case if the\ntion (for \u03b3 <\nalgorithm has access to an oracle. In this section, we show a lower bound on the number of queries\nneeded to provide a polynomial-time algorithm for k-means clustering under margin assumption.\n\n7\n\n\fd\n\n(cid:5)R1\nG1\n(cid:5)R2\n\nGl\u22121\n(cid:5)Rl\n\n2\n\n\u2022\n\n\u2022\n\n\u2022\n\n\u2022\n\u25e6\n\u2022\n\n\u25e6\n\u2022\n\n\u2022\n\n4\n\n\u2022\n\n\u2022\n\n\u2022\n\u25e6\n\u2022\n\n\u25e6\n\u2022\n\n. . .\n\n. . .\n\n. . .\n\n. . .\n\n. . .\n\n2\n\nd \u2212 \u0001\n(cid:5)\n\n\u2022\n\n\u2022\n\n\u2022\n\n(cid:5)\n\n(cid:5)\n\n\u2022\n\u25e6\n\u2022\n\n\u25e6\n\u2022\n\nFigure 1: Geometry of Hl,m. This \ufb01gure is sim-\nilar to Fig. 1 in [Vat09]. Reading from letf to\nright, each row Ri consists of a diamond (si),\n6m + 1 bullets (ri,1, . . . , ri,6m+1), and another\ndiamond (fi). Each rows Gi consists of 3m cir-\ncles (gi,1, . . . , gi,3m).\n\nri,2j\u22121\n\n\u2022\n\n2\n\n\u221a\n\u221a\n\nri,2j\n\n\u2022\n\n1\n\nri,2j+1\n\n\u2022\n\nh2 \u2212 1\nh2 \u2212 1\n\nxi,j\n\n\u2022\n\nh\n\nx(cid:48)\n\ni,j\n\n\u2022\n\ny(cid:48)\n\ni,j\n\n\u2022\n\n\u25e6\n\ngi,j\n\n\u03b1\n\nyi,j\n\n\u2022\n\n\u2022\n\n\u2022\n\n\u2022\n\nri+1,2j\n\nri+1,2j+1\n\nri+1,2j\u22121\nFigure 2: The locations of xi,j, x(cid:48)\ni,j, yi,j and\ny(cid:48)\ni,j in the set Zi. Note that the point gi,j is not\nvertically aligned with xi,j or ri,2j. This \ufb01gure is\nadapted from [Vat09].\n\nTheorem 14. For any \u03b3 \u2264 \u221a\n3.4, \ufb01nding the optimal solution to the k-means objective function is\nNP-Hard even when the optimal clustering satis\ufb01es the \u03b3-margin property and the algorithm can ask\nO(log k + log |X|) same-cluster queries.\nProof. Proof by contradiction: assume that there is polynomial-time algorithm A that makes\nO(log k + log |X|) same-cluster queries to the oracle. Then, we show there exists another al-\ngorithm A(cid:48) for the same problem that is still polynomial but uses no queries. However, this will be a\ncontradiction to Theorem 10, which will prove the result.\nIn order to prove that such A(cid:48) exists, we use a \u2018simulation\u2019 technique. Note that A makes only\nq < \u03b2(log k + log |X|) binary queries, where \u03b2 is a constant. The oracle therefore can respond to\nthese queries in maximum 2q < k\u03b2|X|\u03b2 different ways. Now the algorithm A(cid:48) can try to simulate all\nof k\u03b2|X|\u03b2 possible responses by the oracle and output the solution with minimum k-means clustering\ncost. Therefore, A(cid:48) runs in polynomial-time and is equivalent to A.\n\n5 Conclusions and Future Directions\n\nIn this work we introduced a framework for semi-supervised active clustering (SSAC) with same-\ncluster queries. Those queries can be viewed as a natural way for a clustering mechanism to gain\ndomain knowledge, without which clustering is an under-de\ufb01ned task. The focus of our analysis was\nthe computational and query complexity of such SSAC problems, when the input data set satis\ufb01es a\nclusterability condition \u2013 the \u03b3-margin property.\nOur main result shows that access to a limited number of such query answers (logarithmic in the\nsize of the data set and quadratic in the number of clusters) allows ef\ufb01cient successful clustering\n3.4 \u2248 1.84) that render the problem NP-hard\nunder conditions (margin parameter between 1 and\nwithout the help of such a query mechanism. We also provided a lower bound indicating that at least\n\u2126(log kn) queries are needed to make those NP hard problems feasibly solvable.\nWith practical applications of clustering in mind, a natural extension of our model is to allow the\noracle (i.e., the domain expert) to refrain from answering a certain fraction of the queries, or to make\na certain number of errors in its answers. It would be interesting to analyze how the performance\nguarantees of SSAC algorithms behave as a function of such abstentions and error rates. Interestingly,\nwe can modify our algorithm to handle a sub-logarithmic number of abstentions by chekcing all\npossible orcale answers to them (i.e., similar to the \u201csimulation\u201d trick in the proof of Thm. 14).\n\n\u221a\n\n8\n\n\fAcknowledgments\n\nWe would like to thank Samira Samadi and Vinayak Pathak for helpful discussions on the topics of\nthis paper.\n\nReferences\n[ABD15] Hassan Ashtiani and Shai Ben-David. Representation learning for clustering: A statisti-\n\n[ABS12]\n\n[BB08]\n\n[BBM02]\n\n[BBM04]\n\n[BDR14]\n\n[Ben15]\n\n[BL12]\n\n[Das08]\n\n[GJ02]\n\nIn\n\ncal framework. In Uncertainty in AI (UAI), 2015.\nPranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturba-\ntion stability. Information Processing Letters, 112(1):49\u201354, 2012.\nMaria-Florina Balcan and Avrim Blum. Clustering with interactive feedback.\nAlgorithmic Learning Theory, pages 316\u2013328. Springer, 2008.\nSugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering\nby seeding. In In Proceedings of 19th International Conference on Machine Learning\n(ICML-2002, 2002.\nSugato Basu, Mikhail Bilenko, and Raymond J Mooney. A probabilistic framework for\nsemi-supervised clustering. In Proceedings of the tenth ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 59\u201368. ACM, 2004.\nShalev Ben-David and Lev Reyzin. Data stability in clustering: A closer look. Theoreti-\ncal Computer Science, 558:51\u201361, 2014.\nShai Ben-David. Computational feasibility of clustering under clusterability assumptions.\nCoRR, abs/1501.00437, 2015.\nMaria Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. In\nAutomata, Languages, and Programming, pages 63\u201374. Springer, 2012.\nSanjoy Dasgupta. The hardness of k-means clustering. Department of Computer Science\nand Engineering, University of California, San Diego, 2008.\nMichael R Garey and David S Johnson. Computers and intractability, volume 29. wh\nfreeman New York, 2002.\n\n[KBDM09] Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi-supervised\n\ngraph clustering: a kernel approach. Machine learning, 74(1):1\u201322, 2009.\n\n[MNV09] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means\nIn WALCOM: Algorithms and Computation, pages 274\u2013285.\n\nproblem is np-hard.\nSpringer, 2009.\nAndrea Vattani. The hardness of k-means clustering in the plane. Manuscript, accessible\nat http://cseweb. ucsd. edu/avattani/papers/kmeans_hardness. pdf, 617, 2009.\n\n[Vat09]\n\n9\n\n\f", "award": [], "sourceid": 1607, "authors": [{"given_name": "Hassan", "family_name": "Ashtiani", "institution": "University of Waterloo"}, {"given_name": "Shrinu", "family_name": "Kushagra", "institution": "University of Waterloo"}, {"given_name": "Shai", "family_name": "Ben-David", "institution": "U. Waterloo"}]}