{"title": "Supervised Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 99, "abstract": "Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority  of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class  in that model. Our algorithm is query-efficient in the sense that it involves only a small amount  of interaction with the teacher. We also present and study two natural generalizations of the model.  The model assumes that the teacher response to the algorithm is perfect. We eliminate  this limitation by proposing a noisy model and give an algorithm for  clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees  a random subset of the points. Finally, for datasets  satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class  of clustering functions containing Single-Linkage will find the target clustering under the strongest  property.", "full_text": "Supervised Clustering\n\nPranjal Awasthi\n\nCarnegie Mellon University\npawasthi@cs.cmu.edu\n\nReza Bosagh Zadeh\nStanford University\n\nrezab@stanford.edu\n\nAbstract\n\nDespite the ubiquity of clustering as a tool in unsupervised learning, there is not\nyet a consensus on a formal theory, and the vast majority of work in this direction\nhas focused on unsupervised clustering. We study a recently proposed framework\nfor supervised clustering where there is access to a teacher. We give an improved\ngeneric algorithm to cluster any concept class in that model. Our algorithm is\nquery-ef\ufb01cient in the sense that it involves only a small amount of interaction\nwith the teacher. We also present and study two natural generalizations of the\nmodel. The model assumes that the teacher response to the algorithm is perfect.\nWe eliminate this limitation by proposing a noisy model and give an algorithm for\nclustering the class of intervals in this noisy model. We also propose a dynamic\nmodel where the teacher sees a random subset of the points. Finally, for datasets\nsatisfying a spectrum of weak to strong properties, we give query bounds, and\nshow that a class of clustering functions containing Single-Linkage will \ufb01nd the\ntarget clustering under the strongest property.\n\n1 Introduction\n\nClustering has traditionally been a tool of unsupervised learning. Despite widespread usage across\nseveral \ufb01elds there is not yet a well-established theory to describe clustering [ABD09, AL10, Blu09,\nGvLW09]. Recently, Balcan and Blum [BB08] proposed a supervised model of clustering, where\nthere is access to a teacher. We further explore the implications of their model and extend it in several\nimportant directions. As a motivating example, consider Google News, where news documents are\ngathered from the web and need to be clustered into groups, each corresponding to a particular news\nstory. In this case, it is clear to the human eye (the teacher) which group each document should\nbelong to, but the sheer number of articles makes clustering by hand prohibitive. In this case, an\nalgorithm can interact with the teacher to aid in clustering the documents without asking too much\nof the teacher.\n\nTraditional approaches to clustering optimize some objective function, like the k-means or the k-\nmedian, over the given set of points [KVV00, CGTS99]. These approaches work under the implicit\nassumption that by minimizing a certain objective function one can reach close to the underlying\nground truth clustering. Alternatively, another line of work makes strong assumptions on the na-\nture of the data. One popular in literature is the assumption that data is coming from a mixture of\nGaussians [Das99]. However when dealing with web-pages, documents etc. it is not very clear if\nthese assumptions are reasonable. In fact, there might be no principled way to reach the target clus-\ntering which a teacher has in mind without actually interacting with him/her. For example consider\ndocuments representing news articles. These documents could be clustered as {politics, sports,\nentertainment, other}. However, this is just one of the many possible clusterings. The clustering\n{entertainment + sports, politics, other} is equally likely apriori. Or perhaps the teacher would\nlike these articles to be clustered into {news articles} vs. {opinion pieces}. These scenarios mo-\ntivate the need to consider the problem of clustering under feedback. Recently, there has been an\ninterest in investigating such models and to come up with a more formal theoretical framework for\nanalyzing clustering problems and algorithms. One such framework was proposed by Balcan and\n\n1\n\n\fBlum [BB08] who, motivated by different models for learning under queries, proposed a model for\nclustering under queries.\n\nThe model is similar to the Equivalence Query(EQ) model of learning [Ang98] but with a differ-\nent kind of feedback. We assume that the given set S of m points belongs to k target clusters\n{c1, c2, . . . , ck}, where each cluster is de\ufb01ned by some concept c belonging to a concept class C.\nFor example, the points belonging to the cluster c1 will be the set {x \u2208 S|c1(x) = 1}. We also\nassume that each point belongs to exactly one of the k clusters. As in the EQ model of learning,\nthe algorithm presents a hypothesis clustering {h1, h2, . . . , hk0} to the teacher. If the clustering is\nincorrect the algorithm gets some feedback from the teacher. However, the feedback in this case is\ndifferent from the one in the EQ model. In the learning model, the algorithm gets a speci\ufb01c point x\nas a counter-example to its proposed hypothesis. For clustering problems this may not a very natural\nform of feedback. In a realistic scenario, the teacher can look at the clustering proposed and give\nsome limited feedback. Hence, the model in [BB08] considers the following feedback: If there is a\ncluster hi which contains points from two or more target clusters, then the teacher can ask the algo-\nrithm to split that cluster by issuing the request split(hi). Note that the teacher does not specify how\nthe cluster hi should be split. If there are clusters hi and hj such that hi \u222a hj is a subset of one of the\ntarget clusters, then the teacher can ask the algorithm to merge these two clusters by issuing the re-\nquest merge(hi, hj). The goal of the algorithm is to be query ef\ufb01cient \u2013 O(poly(k, log m, log |C|))\nqueries, and computationally ef\ufb01cient \u2013 running time of O(poly(k, m, log |C|)). Notice, that if we\nallow the algorithm to use the number of queries linear in m, then there is a trivial algorithm, which\nstarts with all the points in separate clusters and then merges clusters as requested by the teacher.\nOne could also imagine applying this split-merge framework to cases where the optimal clustering\ndoes not necessarily belong to a natural concept class, but instead satis\ufb01es some natural separa-\ntion conditions (ex., large margin conditions). We also study and present results for such problem\ninstances.\n\n1.1 Contributions\n\nIn their paper, Balcan and Blum [BB08] gave ef\ufb01cient clustering algorithms for the class of intervals\nand the class of disjunctions over {0, 1}n. We extend those results by constructing an algorithm for\nclustering the class of axis parallel rectangles in d dimensions. Our algorithm is computationally\nef\ufb01cient(for constant d) and uses a small number of queries. We generalize our algorithm to cluster\nthe class of hyperplanes in d dimensions with known slopes. Balcan and Blum [BB08] also gave a\ngeneric algorithm for any \ufb01nite concept class C, which uses O(k3 log |C|) queries. We reduce the\nquery complexity of the generic algorithm from O(k3 log |C|) to O(k log |C|). Furthermore, the\nnew algorithm is much simpler than the one from [BB08]. We study two natural generalization of\nthe original model. In the original model the teacher is only allowed to merge two clusters hi and\nhj if hi \u222a hj is a subset of one of the target clusters. We consider a noise tolerant version of this in\nwhich the teacher can ask the algorithm to merge hi and hj if both the clusters have at least some\n\ufb01xed fraction of points belonging to the same target cluster. This is a more natural model since we\nallow for the teacher requests to be imperfect.\n\nIn the original model we assume that the teacher has access to all the points. In practice, we are\ninterested in clustering a large domain of points and the teacher might only have access to a random\nsubset of these points at every step. For example, in the case of clustering news documents, our goal\nis to \ufb01gure out the target clustering which re\ufb02ects the teacher preferences. But the teacher sees a\nsmall fresh set of news articles very day. We propose a model which takes into account the fact that\nat each step the split and merge requests might be on a different set of points. In both the above\nmodels the straight forward algorithm for clustering the class of intervals fails. We develop new\nalgorithms for clustering intervals in both the models.\n\nWe also apply the split-merge framework of [BB08] to datasets satisfying a spectrum of weak to\nstrong properties and design algorithms for clustering such data sets. Along the way, we also show\nthat a class of clustering functions containing Single-Linkage will \ufb01nd the target clustering under\nthe strict threshold property (Theorem 6.1).\n\n2\n\n\f2 The model\n\nWe consider the model proposed by Balcan and Blum [BB08]. The clustering algorithm is given a\nset S of m points. Each point belongs to one of the k clusters. Each cluster is de\ufb01ned by a function\nf \u2208 C, where C is a concept class. The goal of the algorithm is to \ufb01gure out the correct clustering\nby interacting with the teacher as follows:\n\n1. The algorithm proposes a hypothesis clustering {h1, h2, . . . , hJ} to the teacher.\n2. The teacher can request split(hi) if hi contains points from two or more target clusters.\nThe teacher can request merge(hi, hj) if hi \u222a hj is a subset of one of the target clusters.\nThe assumption is that there is no noise in the teacher response. The goal is to use as few queries to\nthe teacher as possible. Ideally, we would like the number of queries to be poly(k, log m, log |C|).\n2.1 A generic algorithm for learning any \ufb01nite concept class\n\nWe reduce the query complexity of the generic algorithm for learning any concept class [BB08],\nfrom O(k3 log |C|) to O(k log |C|). In addition our algorithm is simpler than the original one. The\nnew algorithm is described below.\nGiven m points let V S = { the set of all possible k clusterings of the given points using concepts\nin C}. Notice that |V S| \u2264 |C|k. Given a set h \u2286 S of points we say that a given clustering R\nis consistent with h if h appears as a subset of one of the clusters in R. De\ufb01ne, V S(h) = {R \u2208\nV S|R is consistent with h.}. At each step the algorithm outputs clusters as follows:\n\n1. Initialize i = 1.\n2. Find the largest set of points hi, s.t. |V S(hi)| \u2265 1\n3. Output hi as a cluster.\n4. Set i = i + 1 and repeat steps 1-3 on the remaining points until every point has been\n\n2|V S|.\n\nassigned to some cluster.\n\n5. Present the clustering {h1, h2, . . . , hJ} to the teacher.\n\nIf the teacher says split(hi), remove all the clusterings in V S which are consistent with hi If the\nteacher says merge(hi, hj) , remove all the clusterings in V S which are inconsistent with hi \u222a hj.\nTheorem 2.1. The generic algorithm can cluster any \ufb01nite concept class using at most k log |C|\nqueries.\n\nProof. At each request, if the teacher says split(hi), then all the clusterings consistent with hi are\nremoved, which by the construction followed by the algorithm will be at least half of |V S|. If the\nteacher says merge(hi, hj), i < j, then all the clusterings inconsistent with hi \u222a hj are removed.\nThis set will be at least half of |V S|, since otherwise the number of clusterings consistent with\nhi \u222a hj will be more than half of |V S| which contradicts the maximality of hi. Hence, after each\nquery at least half of the version space is removed. From the above claim we notice that the total\nnumber of queries will be at most log |V S| \u2264 log|C|k \u2264 k log |C|.\nThe analysis can be improved if the VC-dimension d of the concept class C is much smaller than\nIn this case the size of V S can be bounded from above by C[m]k, where C[m] is the\nlog |C|.\nnumber of ways to split m points using concepts in C. Also from Sauer\u2019s lemma[Vap98] we know\nthat C[m] \u2264 md. Hence, we get |V S| \u2264 mkd. This gives a query complexity of O(kd log m).\n3 Clustering geometric concepts\n\nWe now present an algorithm for clustering the class of rectangles in 2 dimensions. We \ufb01rst present\na simple but less ef\ufb01cient algorithm for the problem. The algorithm uses O((k log m)3) queries\nand runs in time poly(k, m). In the appendix, we show that the query complexity of the algorithm\ncan be improved to O((k log m)2). Our algorithm generalizes in a natural way to rectangles in d\ndimensional space, and to hyperplanes in d dimensions with known slopes.\n\n3\n\n\f3.1 An algorithm for clustering rectangles\n\nEach rectangle c in the target clustering can be described by four points (ai, aj), (bi, bj) such that\n(x, y) \u2208 ck iff ai < x < aj and bi < y < bj. Hence, corresponding to any k-clustering there are at\nmost 2k points a1, a2, . . . , a2k on the x-axis and at most 2k points b1, b2, . . . , b2k on the y-axis. We\ncall these points the target points. The algorithm works by \ufb01nding these points. During its course\nthe algorithm maintains a set of points on the x-axis and a set of points on the y-axis. These points\ndivide the entire space into rectangular regions. The algorithm uses these regions as its hypothesis\nclusters. The algorithm is sketched below:\n\n1. Start with points (astart0, aend0) on the x-axis and points (bstart0, bend0), such that all the\n\npoints are contained in the rectangle de\ufb01ned by these points.\n\n2. At each step, cluster the m points according to the region in which they belong. Present\n\nthis clustering to the teacher.\n\n3. On a merge request, simply merge the two clusters.\n4. On a split of (ai0, aj0), (bi0, bj0), create a new point ar0 such that ai0 < ar0 < aj0, and the\nprojection of all the points onto (ai0, aj0) is divided into half by ar0. Similarly, create a\nnew point br0 such that bi0 < br0 < bj0, and the projection of all the points onto (bi0, bj0) is\ndivided into half by br0. Abandon all the merges done so far.\n\nTheorem 3.1. The algorithm can cluster the class of rectangles in 2 dimensions using at most\nO((k log m)3) queries.\n\nProof. Lets \ufb01rst bound the total number of split requests.\nIf the teacher says split on\n(xi, xj ), (yi, yj), then we know that either (xi, xj) contains a target point a or (yi, yj) contains\na target point b or both. By creating two splits we are ensuring that the size of at least one of the\nregions containing a target point is reduced by half. There are at most 2k intervals on the x-axis and\nat most 2k intervals on the y-axis. Hence, the total number of split requests is \u2264 4k log m. Now\nlets bound the merge requests. Between any two split requests the total number of merge requests\nwill be at most the total number of regions which is \u2264 O((k log m)2). Since, t points on the x and\nthe y axis can create at most t2 regions, we get that the total number of merge requests is at most\n\u2264 O(k log m)3. Hence, the total number of queries made by the algorithm is O((k log m)3).\nIf we are a bit more careful, we can avoid redoing the merges after every split and reduce the query\ncomplexity to O((k log m)2). So, for rectangles we have the following result1.\nTheorem 3.2. There is an algorithm which can cluster the class of rectangles in 2 dimensions using\nat most O((k log m)2) queries.\n\nWe can also generalize this algorithm to work for rectangles in a d-dimensional space. Hence, we\nget the following result\nCorollary 3.3. There is an algorithm which can cluster the class of rectangles in d dimensions using\nat most O((kd log m)d) queries.\nCorollary 3.4. There is an algorithm which can cluster the class of hyperplanes in d dimensions\nhaving a known set of slopes of size at most s, using at most O((kds log m)d) queries.\n\n4 Dynamic model\n\nWe now study a natural generalization of the original model.\nIn the original model we assume\nthat the teacher has access to the entire set of points. In practice, this will rarely be the case. For\nexample, in the case of clustering news articles, each day the teacher sees a small fresh set of articles\nand provides feedback. Based on this the algorithm must be able to \ufb01gure out the target clustering\nfor the entire space of articles. More formally, let X be the space of all the points. There is a target\nk clustering for these points, where cluster corresponds to a concept in a concept class C. At each\nstep, the world picks m points and the algorithm clusters these m points and presents the clustering\nto the teacher. If the teacher is unhappy with the clustering he may provide feedback. Note that\n\n1Proof is omitted due to space constraints\n\n4\n\n\fthe teacher need not provide feedback every time the algorithm proposes an incorrect clustering.\nThe goal of the algorithm is to minimize the amount of feedback necessary to \ufb01gure out the target\nclustering. Notice that at each step the algorithm may get a fresh set of m points. We assume that\nthe requests have no noise and the algorithm has access to all the points in X. We now give an\nalgorithm for learning intervals in this model.\n\n4.1 An algorithm for clustering intervals\n\nthe space X is discretized into n points.\n\nWe assume that\nthere ex-\nist points {a1, a2, . . . , ak+1}, on the x-axis such that\nthe target clustering is the intervals\n{[a1, a2], [a2, a3], . . . , [ak, ak+1]}. The algorithm maintains a set of points on the x-axis and uses\nthe intervals induced by them as its hypothesis. Also each interval is associated with a state of\nmarked/unmarked. When a new interval is created, it is always unmarked. An interval is marked\nif we know that none of the points(ai\u2019s) in the target clustering can be present in that interval. The\nalgorithm is sketched below:\n\nLet us assume that\n\n1. Start with one unmarked interval containing all the points in the space.\n2. Given a set of m points, \ufb01rst form preliminary clusters h1, . . . , hJ such that each cluster\n\ncorresponds to an interval. Next output the \ufb01nal clusters as follows:\n\n\u2022 set i=1\n\u2022 If hi and hi+1 correspond to adjacent intervals at least one of them is unmarked, then\n\noutput hi \u222a hi+1 and set i = i + 2. Else output hi and set i = i + 1.\n\n3. On a split request, split every unmarked interval in the cluster in half.\n4. On a merge request, mark every unmarked contained in the cluster.\n\nTheorem 4.1. The algorithm can cluster the class of intervals using at most O(k log n) mistakes.\n\nProof. Notice that by our construction, every cluster will contain at most 2 unmarked intervals. Lets\n\ufb01rst bound the total number of split requests. For every point ai in the target clustering we de\ufb01ne\ntwo variables lef t size(ai) and right size(ai).\nIf ai is inside a hypothesis interval [x, y] then\nlef t size(ai) = number of points in [x, ai] and right size(ai) = number of points in [ai, y]. If\nai is also a boundary point in the hypothesis clustering ([x, ai], [ai, y]) then again lef t size(ai) =\nnumber of points in [x, ai] and right size(ai) = number of points in [ai, y]. Notice, that every\nsplit request reduces either the lef t size or the right size of some boundary point by half. Since\nthere are at most k boundary points in the target clustering, the total number of split requests is\n\u2264 O(k log n) times. Also note that the number of unmarked intervals is at most O(k log n) since,\nunmarked intervals increase only via split requests. On every merge request either an unmarked\ninterval is marked or two marked intervals are merged. Hence, the total number of merge requests is\natmost twice the number of unmarked intervals \u2264 O(k log n). Hence, the total number of mistakes\nis \u2264 O(k log n).\nIts easy to notice that the generic algorithm for learning any \ufb01nite concept class in the original model\nalso works in this model. Hence, we can learn any \ufb01nite concept class in this model using at most\nk log |C| queries.\n5 \u03b7 noise model\n\nThe previous two models assume that there is no noise in the teacher requests. This is again an\nunrealistic assumption since we cannot expect the teacher responses to be perfect. For example,\nif the algorithm proposes a clustering in which there are two clusters which are almost pure,i.e., a\nlarge fraction of the points in both the clusters belong to the same target clusters, then there is a\ngood chance that the teacher will ask the algorithm to merge these two clusters, especially if the\nteacher has access to the clusters through a random subset of the points. In this section we study a\nmodel which removes this assumption. For simplicity, we consider the noisy version of the original\nmodel [BB08]. As in the original model, the algorithm has m points. At each step, the algorithm\nproposes a clustering {h1, h2, . . . , hJ} to the teacher and the teacher provides feedback. But now,\nthe feedback is noisy in the following sense\n\n5\n\n\f1. Split: As before the teacher can say split(hi), if hi contains points from more than one\n\ntarget clusters.\n\n2. Merge: The teacher can say merge(hi, hj), if hi and hj each have at least one point from\n\nsome target cluster.\n\nIt turns out that handling arbitrary noise is dif\ufb01cult. The following Theorem (proof omitted) shows\na counter-example.\nTheorem 5.1. Consider m points on a line and k = 2. Any clustering algorithm must use \u2126(m)\nqueries in the worst case to \ufb01gure out the target clustering in the noisy model.\n\nHence, we now consider a relaxed notion of noise. If the teacher says merge(hi, hj) then we assume\nthat at least a constant \u03b7 fraction of the points in both the clusters, belong to a single target cluster.\nUnder this model of noise we now give an algorithm for learning k-intervals.\n\n5.1 An algorithm for clustering intervals\n\nThe algorithm is a generalization of the interval learning algorithm in the original model. The main\nidea is that when the teacher asks to merge two intervals (ai, aj) and (aj, ak), then we know than\nat least \u03b7 fraction of the portion to the left and the right of aj is pure. Hence, the algorithm can\nstill make progress. As the algorithm proceeds it is going to mark certain intervals as \u201cpure\u201d which\nmeans that all the points in that interval belong to the same cluster. More formally the algorithm is\nas follows\n\n1. Start with one interval [astart0, aend0] containing all the points.\n2. At each step, cluster the points using the current set of intervals and present that clustering\n\nto the teacher.\n\n3. On split request : Divide the interval in half.\n4. On a merge request\n\n\u2022 If both the intervals are marked \u201cpure\u201d, merge them.\n\u2022 If both the intervals are unmarked, then create 3 intervals where the middle interval\n\u2022 If one interval is marked and one is unmarked, then shift the boundary between the\n\ncontains \u03b7 fraction of the two intervals. Also make the middle interval as \u201cpure\u201d.\n\ntwo intervals towards the unmarked interval by a fraction of \u03b7.\n\nTheorem 5.2. The algorithm clusters the class of intervals using at most O(k(log 1\n\n1\u2212\u03b7\n\nm)2).\n\nProof. We will call a merge request, as \u201cimpure\u201d if it involves at least one impure interval,i.e., an\ninterval which contains points from two or more clusters. Else we will call it as \u201cpure\u201d. Notice\nthat every split and impure merge request makes progress, i.e.\nthe size of some target interval is\nreduced by at least \u03b7. Hence, the total number of split + impure merge requests \u2264 k log 1\nm.\nm, since only split requests\nWe also know that the total number of unmarked intervals \u2264 k log 1\nincrease the unmarked intervals. Also, total number of marked intervals \u2264 total number of unmarked\nintervals, since every marked interval can be charged to a split request. Hence, the total number of\nintervals \u2264 2k log 1\nTo bound the total number of pure merges, notice that every time a pure merge is made, the size\nof some interval decreases by at least an \u03b7 fraction. The size of an interval can decrease at most\nlog 1\n\nm times. Hence, the total number of pure merges \u2264 k(log 1\n\nm)2.\n\nm.\n\n1\u2212\u03b7\n\n1\u2212\u03b7\n\n1\u2212\u03b7\n\n1\u2212\u03b7\n\n1\u2212\u03b7\n\nHence, the algorithm makes at most O(k(log 1\n\n1\u2212\u03b7\n\nm)2) queries.\n\n6 Properties of the Data\n\nWe now adapt the query framework of [BB08] to cluster datasets which satisfy certain natural sep-\naration conditions with respect to the target partitioning. For this section, sometimes we write\n2)i to mean the set of distances that exist between all pairs of n points. This\nd = he1, e2, . . . , e(n\n\n6\n\n\flist is always ordered by increasing distance. For a de\ufb01nition of the Single-Linkage and Min-Sum\nclustering functions, please see the appendix.\n\n6.1 Threshold Separation\n\n2)i with respect to \u0393,\nWe introduce a (strong) property that may be satis\ufb01ed by d = he1, e2, . . . , e(n\nthe target clustering. It is important to note that this property is imposing restrictions on d, de\ufb01ned\nby the data. An inner edge of \u0393 is a distance between two points inside a cluster, while an outer edge\nis a distance between two points in differing clusters.\n\nSTRICT THRESHOLD SEPARATION. There exists a threshold t > 0 such that all inner edges of \u0393\n\nhave distance less than or equal t, and all outer edges have distance greater than t.\n\nIn other words, the pairwise distances between the data are such that all inner edges of d (w.r.t.\n\u0393) have distance smaller than all outer edges (again, w.r.t. \u0393). This property gives away a lot of\ninformation about \u0393, in that it allows Single-Linkage to fully recover \u0393 as we will see in theorem\n6.1. Before we present the algorithm to interact with the teacher, Theorem 6.1 will be useful (proof\nomitted).\n\n[Kle03, JS71] introduce the following 3 properties which a clustering function can satisfy. An\nF (d, k)-transformation of d is a change to d such that inner-cluster distances in d are decreased, and\nouter-cluster distances are increased.\n\n1. CONSISTENCY. Fix k. Let d be a distance function, and d0 be a F (d, k)-transformation of\n\nd. Then F (d, k) = F (d0, k)\n\n2. ORDER-CONSISTENCY. For any two distance functions d and d0, number of clusters k, if\n\nthe order of edges in d is the same as the order of edges in d0, then F (d, k) = F (d0, k)\n\npartitions of S\n\n3. k-RICHNESS. For any number of clusters k, Range(F (\u2022, k)) is equal to the set of all k-\nTheorem 6.1. Fix k and a target k-partitioning \u0393, and let d be a distance function satisfying Strict\nThreshold Separation w.r.t. \u0393. Then for any Consistent, k-Rich, Order-Consistent partitioning func-\ntion F , we have F (d, k) = \u0393.\n\nNote that since Single-linkage is Consistent, k-Rich, and Order-Consistent [ZBD09], it immediately\nfollows that SL(d, k) = \u0393 - in other words, SL is guaranteed to \ufb01nd the target k-partitioning,\nbut we still have to interact with the teacher to \ufb01nd out k. It is a recently resolved problem that\nSingle-Linkage is not the only function satisfying the above properties [ZBD], so the the class\nof Consistent, k-Rich, and Order-Consistent functions has many members. We now present the\nalgorithm to interact with the teacher.\nTheorem 6.2. Given a dataset satisfying Strict Threshold Separation, there exists an algorithm\nwhich can \ufb01nd the target partitioning for any hypothesis class in O(log(n)) queries\n\nProof. Note that the threshold t and the number of clusters k are not known to the algorithm, else\nthe target could be found immediately. By theorem 6.1, we know that the target must be exactly\nwhat Single-Linkage returns for some k, and it remains to \ufb01nd the number of clusters. This can be\ndone using a binary search on the number of clusters which can vary from 1 to n. We start with\nsome candidate k, and if the teacher tells us to split anything, we know the number of clusters must\nbe larger, and if we are told to merge, we know the number of clusters must be smaller. Thus we can\n\ufb01nd the correct number of clusters in O(log(n)) queries.\n\nNote that since strict threshold separation implies strict separation, then the O(k) algorithm pre-\nsented in the next section can also be used, giving O(min(log(n), k)) queries.\nStrict Separation: Now we relax strict threshold separation\n\nSTRICT SEPARATION. All points in the same cluster are more similar to one another than to points\n\noutside the cluster.\n\n7\n\n\fWith this property, it is no longer true that all inner distances are smaller than outer distances, and\ntherefore Theorem 6.1 does not apply. However, [BBV08] prove the following lemma\nLemma 6.3. [BBV08] For a dataset satisfying strict separation, let SL(d) be the tree returned by\nSingle-Linkage. Then any partitioning respecting the strict separation of d will be a pruning of\nSL(d).\nTheorem 6.4. Given a dataset satisfying Strict Separation, there exists an algorithm which can \ufb01nd\nthe target partitioning for any hypothesis class in O(k) queries\n\nProof. Let the distances between points be represented by the distance function d. By lemma 6.3 we\nknow that the target partitioning must be a pruning of SL(d). Our algorithm will start by presenting\nthe teacher with all points in a single cluster. Upon a split request, we split according to the relevant\nnode in SL(d). There can be no merge requests since we always split perfectly. Each split will create\na new cluster, so there will be at most k \u2212 1 of these splits, after which the correct partitioning is\nfound.\n\n\u03b3-margin Separation: Margins show up in many learning models, and this is no exception. A\nnatural assumption is that there may be a separation of at least \u03b3 between points in differing clusters,\nwhere the points all lie inside the unit ball.\n\n\u03b3-MARGIN SEPARATION. Points in different clusters of the target partitioning are at least \u03b3 away\n\nfrom one another.\n\nWith this property, we can prove the following for all hypothesis classes\nTheorem 6.5. Given a dataset satisfying \u03b3-margin Separation, there exists an algorithm which can\n\ufb01nd the target partitioning for any hypothesis class in O((\n\n\u221ad\n\u03b3 )d \u2212 k) queries\n\nProof. We split the unit ball (inside which all points live) into hypercubes with edge length \u03b3\n. We\n\u221ad\nare interested in the diameter of such a hypercube. The diameter of a d-dimensional hypercube with\nside \u03b3\ncan be more than \u03b3 apart.\n\u221ad\nIt follows that if split the unit ball up using a grid of hypercubes, all points inside a hypercube must\nbe from the same cluster. We say such a hypercube is \u201cpure\u201d.\n\n= \u03b3, so no two points inside a hypercube of side \u03b3\n\u221ad\n\nis \u221ad \u00d7 \u03b3\n\n\u221ad\n\n\u221ad\nThere are at most O((\n\u03b3 )d) hypercubes in a unit ball. We show each hypercube as a single cluster\nto the teacher. Since all hypercubes are pure, we can only get merge requests, of which there can be\nat most O((\n\n\u221ad\n\u03b3 )d \u2212 k) until the target partitioning is found.\n\n7 Conclusions and open problems\n\nIn this paper we investigated a recently proposed model of clustering under feedback. We gave algo-\nrithms for clustering geometric concepts in the model. For datasets satisfying a spectrum of weak to\nstrong properties, we gave query bounds, and showed that a class of clustering functions containing\nSingle-Linkage will \ufb01nd the target clustering under the strongest property. We also studied natural\ngeneralizations of the model and gave ef\ufb01cient algorithms for learning intervals in the new models.\nSeveral interesting problems remain\n\n1. Give algorithms for clustering other classes of functions, for example linear separators in\n\nthe original model.\n\n2. Give ef\ufb01cient algorithms for clustering geometric concept classes in the new models.\n3. Establish connections between the proposed models and the Equivalence Query model of\n\nlearning.\n\n4. In [BB08], the authors give an algorithm for learning the class of disjunctions. It would be\ninteresting to come up with an attribute ef\ufb01cient version of the algorithm, similar in spirit\nto the Winnow algorithm [Lit87].\n\n8\n\n\fReferences\n\n[ABD09] M. Ackerman and S. Ben-David. Clusterability: A theoretical study. Proceedings of\n\n[AL10]\n\n[Ang98]\n[BB08]\n\nAISTATS-09, JMLR: W&CP, 5:1\u20138, 2009.\nBen-David S. Ackerman, M. and D. Loker. Characterization of Linkage-based Cluster-\ning. COLT 2010, 2010.\nD. Angluin. Queries and concept learning. Machine Learning, 2:319\u2013342, 1998.\nMaria-Florina Balcan and Avrim Blum. Clustering with interactive feedback. In ALT,\n2008.\n\n[BBV08] M.-F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering\nIn Proceedings of the 40th ACM Symposium on Theory of\n\nvia similarity functions.\nComputing, 2008.\n[Blu09]\nAvrim Blum. Thoughts on clustering. In NIPS Workshop on Clustering Theory, 2009.\n[CGTS99] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoy. A constant-factor approximation\nIn ACM Symposium on Theory of Computing,\n\nalgorithm for the k-median problem.\n1999.\nS. Dasgupta. Learning mixtures of gaussians. In Proceedings of the 40th Annual Sym-\nposium on Foundations of Computer Science, 1999.\n\n[Das99]\n\n[GvLW09] I. Guyon, U. von Luxburg, and R.C. Williamson. Clustering: Science or Art? In NIPS\n\n[JS71]\n[Kle03]\n\nWorkshop on Clustering Theory, 2009.\nN. Jardine and R. Sibson. Mathematical taxonomy. New York, 1971.\nJ. Kleinberg. An impossibility theorem for clustering. In Advances in Neural Informa-\ntion Processing Systems 15: Proceedings of the 2002 Conference, page 463. The MIT\nPress, 2003.\n\n[Lit87]\n\n[KVV00] R. Kannan, S. Vempala, and A. Veta. On clusterings-good, bad and spectral. In FOCS\n\u201900: Proceedings of the 41st Annual Symposium on Foundations of Computer Science,\n2000.\nNick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-\nthreshold algorithm. Machine Learning, 2(4), 1987.\nV. N. Vapnik. Statistical Learning Theory. John Wiley and Sons Inc., 1998.\nReza Bosagh Zadeh and Shai Ben-David. Axiomatic Characterizations of Single-\nLinkage. In In Submission.\n\n[Vap98]\n[ZBD]\n\n[ZBD09] Reza Bosagh Zadeh and Shai Ben-David. A Uniqueness Theorem for Clustering. In\n\nProceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n9\n\n\f", "award": [], "sourceid": 427, "authors": [{"given_name": "Pranjal", "family_name": "Awasthi", "institution": null}, {"given_name": "Reza", "family_name": "Zadeh", "institution": null}]}