{"title": "Same-Cluster Querying for Overlapping Clusters", "book": "Advances in Neural Information Processing Systems", "page_first": 10485, "page_last": 10495, "abstract": "Overlapping clusters are common in models of many practical data-segmentation applications. Suppose we are given $n$ elements to be clustered into $k$ possibly overlapping clusters, and an oracle that can interactively answer queries of the form ``do elements $u$ and $v$ belong to the same cluster?'' The goal is to recover the clusters with minimum number of such queries. This problem has been of recent interest for the case of disjoint clusters. In this paper, we look at the more practical scenario of overlapping clusters, and provide upper bounds (with algorithms) on the sufficient number of queries. We provide algorithmic results under both arbitrary (worst-case) and statistical modeling assumptions. Our algorithms are parameter free, efficient, and work in the presence of random noise. We also derive information-theoretic lower bounds on the number of queries needed, proving that our algorithms are order optimal. Finally, we test our algorithms over both synthetic and real-world data, showing their practicality and effectiveness.", "full_text": "Same-Cluster Querying for Overlapping Clusters\n\nWasim Huleihel\n\nDepartment of Electrical Engineering\n\nTel-Aviv University\n\nTel-Aviv, Israel 6997801\nwasimh@mail.tau.ac.il\n\nArya Mazumdar\n\nCollege of Information & Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\narya@cs.umass.edu\n\nMuriel M\u00e9dard\n\nSoumyabrata Pal\n\nElectrical Engineering & Computer Science\n\nMassachusetts Institute of Technology\n\nCollege of Information & Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nCambridge, MA 02139\n\nmedard@mit.edu\n\nAmherst, MA 01003\n\nsoumyabratap@umass.edu\n\nAbstract\n\nOverlapping clusters are common in models of many practical data-segmentation\napplications. Suppose we are given n elements to be clustered into k possibly\noverlapping clusters, and an oracle that can interactively answer queries of the form\n\u201cdo elements u and v belong to the same cluster?\u201d The goal is to recover the clusters\nwith minimum number of such queries. This problem has been of recent interest for\nthe case of disjoint clusters. In this paper, we look at the more practical scenario of\noverlapping clusters, and provide upper bounds (with algorithms) on the suf\ufb01cient\nnumber of queries. We provide algorithmic results under both arbitrary (worst-case)\nand statistical modeling assumptions. Our algorithms are parameter free, ef\ufb01cient,\nand work in the presence of random noise. We also derive information-theoretic\nlower bounds on the number of queries needed, proving that our algorithms are\norder optimal. Finally, we test our algorithms over both synthetic and real-world\ndata, showing their practicality and effectiveness.\n\n1\n\nIntroduction\n\nRecently, semi-supervised models of clustering that allow active querying during data segmentation\nhave become quite popular. This includes active learning, as well as data labeling by amateurs via\ncrowdsourcing. Clever implementation of interactive querying framework can improve the accuracy\nof clustering and help in inferring labels of large amount of data by issuing only a small number\nof queries. Interactions can easily be implemented (e.g., via captcha), especially if queries involve\nfew data points, like pairwise queries of whether two points belong to the same cluster or not\n[1, 2, 4, 9, 13, 15, 18, 19, 21, 22, 23, 25].\nUntil now, the querying model and algorithms/lower bounds are highly tailored towards \ufb02at clus-\ntering that produces a partition of the data. Consider the problem of clustering from pairwise\nqueries such as above when an element can be part of multiple clusters. Such overlapping clus-\ntering instances are ubiquitous across areas and many time are more practical model of data\nIndeed, overlapping models are quite natural for communi-\nsegmentation, see [3, 5, 20, 26].\nties in social networks or topic models [16].\nIn the supervised version of the problem every\nelement (or data-point) can have multiple labels, and we would like to know all the labels. To\nsee how the querying might work here consider the following input: {Tiger Shark, Grizzly\nBear, Blue Whale, Bush Dog, Giant Octopus, Ostrich, Komodo Dragon}.\nThis set\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f0\n1\n1\n\u2217\n0\n1\n1\n\n1\n1\n0\n1\n1\n\u2217\n1\n\n0\n\u2217\n1\n1\n0\n1\n1\n\ncan be clustered into the mammals {Grizzly Bear, Blue Whale, Bush Dog}, marine-\nlife {Tiger Shark, Blue Whale, Giant Octopus}, non-mammals {Tiger Shark, Giant\nOctopus, Ostrich, Komodo Dragon}, land-dwellers {Grizzly Bear, Bush Dog, Ostrich,\nKomodo Dragon}. Quite clearly, this ideal clustering (without labels) is overlapping.\nIf a\nquery of whether two elements belong to the same cluster or not is made then the answer\nshould be \u2018yes\u2019 if there exists a cluster where they can appear together.\nIf we form a re-\nsponse matrix of size 7 \u00d7 7 with rows and columns indexed by the order they appeared above\nin the list and entries being the binary answers to queries, then the matrix would be following:\nWhat is the minimum number of adaptive\nqueries that we should make to the above matrix\nso that it is possible to recover the clusters? In\nthe case when the clusters are not overlapping,\nthe answer is nk, where n is the number of el-\nements and k is the number of possible clusters\n[19]. For the case of overlapping clusters it is\nnot clear whether there is a unique clustering\nthat explains the responses. For this, certain ex-\ntra constraints must be placed: for example, a\nreasonable assumption is that an element can\nonly be part of \u2206 clusters, among the total k\n\nTS GB BW BD GO Os KD\n\u2217\nTS\n1\nGB\n1\n0\nBW 1\n0\nBD\n0\n1\nGO\n1\n1\nOs\n1\n1\n\u2217\nKD\n1\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n1\n1\n\u2217\n1\n1\n0\n0\n\n1\n0\n1\n0\n\u2217\n1\n1\n\npossible clusters, \u2206 (cid:28) k.\nJust like the response matrix above, it is possible to form a similarity matrix. The (i, j)th entry of\nthis matrix simply is the number of clusters where the ith and the jth elements coexists. It is clear\nthat the response matrix is just a quantized version of the similarity matrix. Even when the entire\nsimilarity matrix is given, there is no guarantee on uniqueness of overlapping clustering, unless\nfurther assumptions are made. In this paper, we primarily aim to recover the clustering from a limited\nnumber of adaptive queries to the response matrix. However, in terms of uniqueness guarantees, we\noften have to stop at the uniqueness of the similarity matrix.\nMain results and Techniques. Recovery of overlapping clusters from budgeted same-cluster queries\nis widely open, and signi\ufb01cantly more challenging than the \ufb02at clustering counterpart. In fact, none\nof the techniques proposed in prior literature as mentioned above extends easily to the case when the\nclusters may overlap. In this paper we tackle this problem for various types of responses. Speci\ufb01cally,\nin our setting there is an oracle having access to the similarity matrix AAT , where A is the n \u00d7 k\nclustering matrix whose ith row is the indicator vector of the cluster membership of ith element. In\nits most powerful mode, when queried the oracle provides the number of clusters where ith and the\njth elements coexists, namely the values of the entries of the matrix AAT . It turns out, however,\nthat even if one knows the matrix AAT perfectly, it is not always possible to recover A up to a\npermutation of columns.1 In fact, if no assumption is imposed on the clustering matrix A, then\nrecovering A from AAT is intractable in general. Indeed, even just \ufb01nding conditions on A such\nthat the factorization is unique (up to permutations) is related to the famous unsolved question of\n\u201cfor which orders a \ufb01nite projective plane exists?\" [11]. It is then clear that we need to impose some\nassumptions. We tackle this inherent problem in two different approaches. First, in Sections 3.1\nand 3.2, we propose two generative models for A: 1) a uniform ensemble where a given element\ncan only be part of \u2206 \u2265 1 clusters,2 among the total of k clusters, and its membership is drawn\n\n(cid:1) possible placements, 2) the matrix A is generated i.i.d.\n\nuniformly at random among all possible(cid:0) k\n\nwith Bernoulli entries. Then, for these two ensembles, we investigate the above fundamental question,\nand derive suf\ufb01cient conditions under which this factorization is unique, along with quasi-polynomial\nworst-case complexity algorithms for recovering A from AAT . The main immediate implication of\nthis result is that, under certain conditions, the clustering recovery problem reduces to recovering the\nsimilarity matrix, placing this objective to be our main task.\nWhile the above generative models allow us to obtain elegant and neat theoretical results, one\nmight argue that they may not capture many challenges existing in real-world overlapping clustering\nproblems. To this end, in Section 3.3, we go beyond the above generative models and analyze a\n\n\u2206\n\n1Given AAT , it is only possible to recover the clustering matrix A up to a permutation of columns,\n\n2The case where different items may belong to different numbers of clusters can be handled using the same\n\nsee Section 2.\n\ntechniques developed in this paper.\n\n2\n\n\fgeneral worst-case model with no statistical assumptions. Then, under certain realistic assumptions\non the clustering matrix, we provide and analyze algorithms solving the recovery problem.\nIn practice, however, the aforementioned \u2018value\u2019 oracle responses might be quite expensive. Accord-\ningly, we study also quantized and noisy variants of these responses. For example, instead of getting\ndirect values from AAT , the oracle only supplies the learner with (possibly noisy) binary answers on\nwhether arbitrarily picked pair of elements (i, j) appear together in some cluster or not (\u2018same-cluster\nquery\u2019). We consider also the case of dithered oracle, where noise is injected before quantization.\nFor these scenarios (and others), we provide both lower and upper bounds on the number of queries\nneeded for exact recovery. Our lower bounds are obtained using standard information-theoretic\nresults, such as Fano\u2019s inequality. For the upper bounds, we design novel randomized algorithms for\nrecovering the similarity matrix, and further show that these algorithms can work when the noise\nparameter is not given in advance. For example, when k = O(log n) and \u2206 (cid:28) k, we show that\nthe suf\ufb01cient and necessary number of quantized queries is \u0398(n log n), for the uniform ensemble.\nFinally, we test our algorithms over both synthetic and real-world data, showing the practicality and\neffectiveness of our algorithms.\nRelated Work. As mentioned above, there is a series of applied and theoretical works studying the\nquery complexity of \u2018same-cluster\u2019 queries for objective-based clustering (such as, k-means) and\nclustering with statistical generative models. In all the cases though, the clusters are assumed to be\nnon-overlapping. From a practical standpoint, entity resolution via crowdsourced pairwise same-\nentity queries were studied in [13, 17, 24, 25, 22, 27]. The effect of (possibly noisy) \u2018same-cluster\u2019\nqueries in similarity matrix based clustering has been studied in [18, 19, 1]. On the other hand, the\neffect of \u2018same-cluster\u2019 queries in the ef\ufb01ciency of k-means clustering was initiated in [4] - which\nwas then subsequently further studied in [27, 2, 9].\nIn our approach, we crucially use the \u2018low-rank\u2019 structure of the similarity matrix to recover the\nclustering from a bounded number of responses. Low-rank matrix completion is a well-studied topic\nin statistics and data science [7, 14]. It is possible to obtain weaker version of some of our results by\nrelying on the results of low-rank matrix completion as black-boxes. However, the speci\ufb01c structure\nof the similarity matrix under consideration allows us to obtain stronger results. The response matrix\nis a quantized version of a low-rank matrix. Querying entries of the response matrix can be seen\nas a so called 1-bit matrix completion problem [12]. However, most of the recovery guarantees of\n1-bit matrix completion depends crucially on certain dither noise (see, [12]), which may not be what\nis allowed in our setting. Finally, we mention [6] where the problem of overlapping clustering was\nconsidered from a different point of view.\nOrganization. The remaining part of this paper is organized as follows. The model and the learning\nproblem are provided in Section 2. Our main results on the query complexity are presented in\nSection 3. In particular, we provide upper bounds (with algorithms) on the suf\ufb01cient number of\nqueries for each of the scenarios investigated in this paper. These results are also accompanied with\ninformation-theoretic lower bounds on the necessary number of queries, which are presented in the\nappendix due to page limitation. Finally, Section 4 is devoted for a numerical study, where our main\nresults are illustrated empirically. Detailed proofs of all the theoretical results can be found in the\nsupplementary material.\n\n2 Model and Learning Problem\n\nConsider a set of elements N \u2261 [n] with k latent clusters Ni, for i = 1, 2, . . . , k, such that each\nelement in N belongs to at least one cluster. This data is represented by an n \u00d7 k matrix A, where\nAi,j = 1 if the ith elements is in the jth cluster. We will denote the k-dimensional binary vector\nrepresenting the cluster membership of the ith element by Ai (i.e., the ith row of A), and will\nhenceforth refer to it as the ith membership vector. In our setting there is an oracle O : N \u00d7N \u2192 D\nthat when queried with a pair of elements (i, j) \u2208 N \u00d7 N , returns a natural number L \u2208 D \u2282 N\naccording to some pre-de\ufb01ned rule. We shall refer to O as the oracle map. The queries \u2126 \u2286 N \u00d7 N\ncan be done adaptively. Our goal is to \ufb01nd the set \u2126 \u2286 N \u00d7 N such that |\u2126| is minimum, and it is\npossible to recover {Ni}k\ni=1 from the oracle answers. More speci\ufb01cally, the oracle have access to the\nsimilarity matrix AAT , and when queried with \u2126, answers according to O. Given (i, j) \u2208 N \u00d7 N ,\nwe consider the following oracle maps O, capturing several aspects of the problem:\n\n3\n\n\f\u2022 Direct responses: The oracle response is Odirect(i, j) = AT\n\n\u2022 Quantized (noisy) responses: The oracle response is Oquantized(i, j) = Q(cid:0)AT\n\ni Aj, namely, the number of clusters that\nelements i and j belong to simultaneously. Note that when the clusters are disjoint, the output is\nsimply an answer to the question \u201cdo elements i and j belong to the same cluster?\".\nQ(x) (cid:44) 1 for x > 0, and 0, otherwise, and Wij \u223c Bernoulli(q), with 0 \u2264 q \u2264 1, independent\nover pairs (i, j). In the noiseless case, i.e., q = 0, the oracle response is whether elements i and j\nappears together in at least one cluster or not. In the noisy case, the oracle response is the quantized\nresponse with probability 1 \u2212 q, and \ufb02ipped with probability q. This can be interpreted as if the\nquantized responses are further sent through a binary symmetric channel BSC(q).\ni Aj + Zi,j\n\n\u2022 Dithered responses: The oracle response is Odithered(i, j) = Q(cid:0)AT\n\n(cid:1), where Zij \u223c\n\n(cid:1) \u2295 Wi,j, where\n\nNormal(0, \u03c32), independent over pairs (i, j). In other words, the oracle outputs a dithered and\nquantized version of the direct responses.\n\ni Aj\n\nFor simplicity of notation, throughout the rest of this paper we will denote the oracle response to\nquery (i, j) \u2208 N by Yij, irrespective of the oracle model, which will be clear from the context.\nIf we permute the columns of A, the gram matrix AAT will remain the same. Therefore, \ufb01nding A is\nonly possible up to a permutation of columns. Unfortunately, it turns out that even if we know AAT\nperfectly it is not always possible to \ufb01nd A up to a permutation of columns, namely, the factorization\nmay not be unique. As an example, consider the following matrices\n\n\uf8ee\uf8ef\uf8f00\n\n0\n1\n1\n\n\uf8f9\uf8fa\uf8fb\n\n0\n1\n1\n0\n\n1\n0\n0\n0\n\n1\n1\n0\n1\n\n\uf8ee\uf8ef\uf8f01\n\n0\n0\n1\n\n\uf8f9\uf8fa\uf8fb and\n\n1 0\n1 1\n0 1\n0 1\n\n0\n0\n1\n0\n\nwhich have the same gram matrix but evidently are not column permutations of each other. Hence,\neven if we observe all the entries of the gram matrix, it is not possible to distinguish between these\ntwo matrices. We tackle this inherent problem in two different approaches.\nGenerative Models. We consider two generative models (random ensembles) for A; uniform and\ni.i.d. ensembles, de\ufb01ned as follows. Given k, \u2206 \u2208 N, de\ufb01ne the set\n\nTk(\u2206) (cid:44) {c \u2208 {0, 1}k : wH(c) = \u2206},\n\n(1)\n\n\u03c0\u2208Sk\n\n{ \u02c6A (cid:54)= AP\u03c0}(cid:17)\n\nA by Perror (cid:44) P(cid:16)(cid:84)\n\nas the set of all k-length binary sequence with Hamming weight (wH) \u2206. Then, we say that A belongs\nto the uniform ensemble if A is formed by drawing independently its n rows from Tk(\u2206). In the\nlatter ensemble, the matrix A is an i.i.d. matrix, with each entry being a Bernoulli(p) random variable,\nwhere 0 \u2264 p \u2264 1. As mentioned above we are interested in exact recovery of the clusters {N}k\ni=1, or\nequivalently, the clustering matrix A up to a permutation of the columns of A. More precisely, we\nde\ufb01ne the average probability of error associated with an algorithm which outputs an estimate \u02c6A of\n, where P\u03c0 is the permutation matrix corresponding to the\npermutation \u03c0 : [k] \u2192 [k], and Sk is the symmetric group acting on [k]. Accordingly, we say that an\nalgorithm properly recovered A if Perror is small. This recovery criterion follows from the fact that\nclustering is invariant to a permutation of the labels.\nIn contrast to the above negative example where two different matrices have the same gram matrix,\nunder certain weak conditions presented in Appendix A (see, Lemmas 3 and 4), we show that if the\nmatrix A is generated according to the either one of the above random ensembles, the factorization\nis unique up to column permutations. This results have a straightforward implication: under the\nconditions of the Lemmas 3 and 4, the clustering recovery problem (i.e., recovering A) reduces to\nthe recovery of the similarity matrix AAT , given partial observations \u2126 of its entries through the\noracle O. To actually recover A from AAT we propose Algorithms 4 and 5, appear in Appendix A,\nfor the uniform and i.i.d. ensembles, respectively.\nWorst-case Model. While the above generative models allow us to obtain elegant theoretical results,\nthey may be too idealistic for real-world clustering applications. To this end, we also consider a\ngeneral clustering model where each element in N can be belong to at most \u2206 \u2264 k clusters. Note that\nhere each element may belong to different number of clusters. We show that under certain geometric\nconditions on the clusters, recovery is possible using a simple, ef\ufb01cient, and parameter free algorithm.\n\n4\n\n\fAlgorithm 1 Findmembership The algorithm for extracting the clustering matrix via queries to\noracle.\nRequire: Number of elements: N, number of clusters k, oracle responses Odirect(i, j) for query\n1: Choose a set S of elements drawn uniformly at random from N , and perform all pairwise queries\n2: Extract the membership of all the |S| elements and \ufb01nd representatives T for the k clusters.\n3: Query each of the remaining n \u2212 |S| elements with all elements present in T .\n4: Return the clusters.\n\n(i, j) \u2208 \u2126, where i, j \u2208 [N ].\ncorresponding to these |S| elements.\n\nAlgorithm 2 FindSimilarity The algorithm for extracting the similarity matrix AAT via queries\nto oracle.\nRequire: Number of elements: N, number of clusters k, oracle responses Odirect(i, j) for query\n1: Choose a set S of elements drawn uniformly at random from N , and perform all pairwise queries\n2: Extract a valid membership of all the |S| elements by rank factorization of AS ATS . Then, \ufb01nd a\n3: Query each of the remaining n \u2212 |S| elements with all elements present in T . Subsequently\n\n(i, j) \u2208 \u2126, where i, j \u2208 [N ].\ncorresponding to these |S| elements.\nset T \u2286 S that forms a basis of Rk.\n\nsolve for the membership vector of the unknown element.\n\n4: Return the similarity matrix AAT .\n\n3 Algorithms and Their Performance Guarantees\n\nIn this section, we present our main theoretical results. Speci\ufb01cally, in Subsections 3.1 and 3.2\nour main results about generative models are given. Subsection 3.3 is devoted to the worst-case\nmodel. Due to space limitations, a summary table showing our lower and upper bounds on the sample\ncomplexities, theoretical results concerning dithered oracle responses, information-theoretic lower\nbounds, as well as most of our pseudo-codes algorithms, appear in the appendices.\n\n3.1 Direct Responses\n\nAs a warm up we start with the case of disjoint clusters. In this case, no statistical generative\nassumption on A is needed. The simple algorithm (Algorithm 1) for this serves as a building block\nfor the other more complicated scenarios considered in this paper.\nProposition 1. There exists a poly-time algorithm, i.e. Algorithm 1, which with probability at\nleast 1 \u2212 n\u2212\u03b5 recovers exactly the set of clusters N1, . . . ,Nk, Ni \u2229 Nj = \u2205, for i (cid:54)= j, using\n\n|\u2126| \u2265 k \u00b7 (n \u2212 m) +(cid:0)m\nProof Outline: Pick m elements uniformly at random from N , and perform all(cid:0)m\n\n(cid:1) queries, m = (n/nmin) log(kn\u03b5), where \u03b5 > 0 and nmin is the size of the\n(cid:1) pairwise queries\n\namong these m elements. It can be shown that if m \u2265 (n/nmin) log(kn\u03b5), then with probability\n1 \u2212 n\u2212\u03b5, among these m elements there will exist at least one element (representative) from each\ncluster. Finally, for the remaining (n \u2212 m) items, we perform at most k queries to decide which\ncluster they belong to.\n\nsmallest cluster.\n\n2\n\n2\n\nFrom Prop. 1, when nmin = \u2126(n/k), the number of queries needed are \u2126(kn). This result should\nbe contrasted with standard matrix completion results with uniform sampling, which state that\nO(kn log n) queries are needed [8]. Next, we consider the overlapping case, where Ni \u2229 Nj (cid:54)= \u2205. In\nthis case the similarity matrix AAT is not binary anymore. For a set S \u2286 [n], with m = |S|, we let\nAS be the m \u00d7 k projection matrix formed by the rows of A that correspond to the indices in S. We\nhave the following result.\nTheorem 1. There exists a polynomial-time algorithm, given in Algorithm 2, which with probability\nat least 1 \u2212 poly(n\u22121) recovers exactly the set of (overlapping) clusters N1, . . . ,Nk, using |\u2126| \u2265\n[1 + c1 log k + c2 log n], for the uniform\n\n(cid:1) + k \u00b7 (n \u2212 |S|) queries, where |S| > Suniform (cid:44) ( k\n\n(cid:0)|S|\n\n2\n\n\u2206)\n(k\u2212\u2206\n\u2206\u22121)\n\n5\n\n\fensemble, and |S| > Si.i.d. (cid:44) k \u2212 1 \u2212 log k+c3 log n\narbitrary positive numbers.\n\nlog max(p,1\u2212p) , for the i.i.d. ensemble, with c1, c2, c3 > 0\n\nbe done and the sample complexity will be(cid:0)|S|\n\nLet us explain the main idea behind Theorem 1. It is evident from Algorithm 2 that as long as we get\na valid subset of elements T \u2286 S whose membership vectors form a basis of Rk, then querying a\nparticular element i \u2208 N with all elements in T gives k linearly independent equations in k variables\nthat denote the membership of ith element to the different clusters. Subsequently, we can solve this\nsystem of equations uniquely to obtain the membership vector of ith element. Hence, if we choose\n|S| such that there exists a valid subset of S forming a basis of Rk with high probability, then we will\nrespectively, show that if |S| > Suniform for the uniform ensemble, and |S| > Si.i.d. for the i.i.d.\nensemble, then the above property holds.\nRemark 2. Note that in the second step of Algorithm 2 we perform a rank factorization of the matrix\nAS ATS . However, this factorization is not guaranteed to be unique, and accordingly, the resultant\nrank factorized matrix might be wrong. However, we show in the supplementary material, that even if\nthis is the case, Algorithm 2 will nevertheless recover the true similarity matrix.\n\n(cid:1) + k(n \u2212 |S|). Lemmas 5 and 6 (see, Appendix B),\n\n2\n\n3.2 Quantized Noisy Responses\n\nWe next move to the case where the oracle responses are quantized and noisy, namely, when queried\n\nwith (i, j), the oracle output is Oquantized(i, j) = Q(cid:0)AT\nrecovers the similarity matrix AAT , using |\u2126| \u2265(cid:0)|S|\n(cid:19)2(cid:20)(cid:18)k \u2212 2\u2206 + 1\n\n(cid:1) \u2295 Wi,j, where Wi,j \u223c Bernoulli(q).\n(cid:1) +|S|\u00b7 (n\u2212|S|) queries, where for any \u03b5 > 0,\n(cid:19)\n\nWe start with the uniform ensemble, for which we have the following result.\nTheorem 2. Assume that A was generated according to the uniform ensemble, with k \u2265 3\u2206. Then,\nthere exists a polynomial-time algorithm, given in Algorithm 6, which with probability 1 \u2212 n\u2212\u0001,\n\n(cid:18)k \u2212 2\u2206\n\n(cid:19)(cid:21)\u22122\n\n(cid:18) k\n\ni Aj\n\n|S| > 2(1 \u2212 2q)\u22124\n\n2\n\n\u2212\n\n\u2206\n\n\u2206\n\n\u2206\n\nlog(2n2+\u03b5).\n\n(2)\n\nThe main idea behind Algorithm 6 is the following: we \ufb01rst choose a random subset S \u2286 N of\nelements, such that (2) holds, and perform all pairwise queries among these elements. Using the\ni Aj, for any (i, j) \u2208 S. To this end,\nresultant queries we infer the unquantized inner products of AT\nwe count the number of elements which are similar to both the pro\ufb01le of elements i and j (see, the\nde\ufb01nition of Tij in Algorithm 6). Intuitively, it makes sense that the more similar the two elements\ni and j themselves are, the more the number of elements should be which are similar to both of\nthem. We show that the condition in (2) suf\ufb01ces to make the count highly concentrated around its\ni Aj. Finally, the remaining (n \u2212 |S|) elements\nmean, and accordingly, outputs the true value of AT\nare queried with the elements in S, and then we apply the above inferring procedure once again.\nWe emphasize here that the exponential dependency of the upper bounds on \u2206 is inherent, as the\ninformation-theoretic lower bounds in Appendix I suggest.\nIt turns out that the above idea is capable to handle the other scenarios considered in this paper, albeit\nwith certain technical modi\ufb01cations. Indeed, for the i.i.d. ensemble, we need an additional step\nbefore we can use the idea mentioned above. This is mainly because of the fact that analyzing the\naforementioned count statistic requires the knowledge of support size of Ai and Aj (which is \ufb01xed in\nthe uniform ensemble). An easy way around this problem is to infer \ufb01rst the (cid:96)0-norm of every element\nby counting the number of other elements that are similar. As before, under certain conditions, this\ncount behaves differently for different values of the actual (cid:96)0-norm value and therefore we can infer\nthe correct value. Once this step is done, everything else falls into place. Due to space limitation we\nrelegate the pseudo-algorithm for the i.i.d. setting to the appendices. We have the following result.\nTheorem 3. Assume that A was generated according to the i.i.d. ensemble. Then, there exists\na polynomial-time algorithm, given in Algorithm 9, which with probability 1 \u2212 n\u2212\u0001, recovers the\n\n(cid:1) + |S| \u00b7 (n \u2212 |S|) queries, where for any \u03b5 > 0,\n\nsimilarity matrix AAT , using |\u2126| \u2265(cid:0)|S|\n\n2\n\n|S| > 2p\u22122(1 \u2212 2q)\u22124(1 \u2212 p)2\u22122k log(2n2+\u03b5).\n\n(3)\n\nIn practice, the value of the noise parameter q might be unknown to the learner. In this case, we will\nnot know the expected values of the triangle counts under the different hypotheses a-priori, and thus\n\n6\n\n\f\u2206\n\n2\n\nn > 10(cid:0) k\n|\u2126| \u2265(cid:0)|S|\n\nour previous algorithms cannot be used directly. Fortunately, however, it turns out that with a simple\nmodi\ufb01cation, our algorithms can be used also when q is unknown. We have the following result\nstated for the uniform ensemble. A similar result can be obtained also for the i.i.d. ensemble.\nTheorem 4. Assume that A was generated according to the uniform ensemble with k \u2265 3\u2206, and\nof the noise parameter q, which with probability 1 \u2212 n\u2212\u0001, recovers the similarity matrix AAT , using\n\n(cid:1) log n. Then, there exists a polynomial-time algorithm, given in Algorithm 14, independent\n(cid:1) + |S| \u00b7 (n \u2212 |S|) queries, where for any \u03b5 > 0,\n(cid:19)2(cid:20)(cid:18)k \u2212 2\u2206 + 1\n(cid:19)\n\n(cid:18)k \u2212 2\u2206\n(cid:19)(cid:21)\u22122\n(cid:1) log n is rather weak, and naturally\nfactor only. Note that the additional technical condition n > 10(cid:0) k\n\nComparing Theorems 2 and 4, we notice that the query complexity grows by a multiplicative constant\n\n|S| > 18(1 \u2212 2q)\u22124\n\nlog(2n2+\u03b5).\n\n(cid:18) k\n\n\u2206\n\n(4)\n\nsatis\ufb01ed, for example, in the regime k = O(log n). Finally, note that since we deal with quantized\nresponses without any continuous dithering, matrix completion results cannot be used. In fact, without\ndithering matrix completion algorithms will fail on quantized data [12], as they do not exploit the\ndiscrete structure of the data, which is the main source for the success of our algorithms.\n\n\u2212\n\n\u2206\n\n\u2206\n\n\u2206\n\n3.3 Beyond Generative Models: Arbitrary Worst-Case Instances\n\nIn this subsection, we consider the worst-case model, where we do not impose any statistical\nassumptions, and assume that each element belong to at most \u2206 clusters. We focus on noiseless\nquantized oracle responses, but also discuss the direct responses scenario in Section 4. For this case,\nwe propose Algorithm 17. We have the following result.\nTheorem 5. Let Ni be the set of elements which belong to the i\u2019th cluster. If, for every cluster i \u2208 [k],\nqueries are suf\ufb01cient to recover the clusters, where \u03b1 \u00b7 |S| = log k + log n.\n\nj:j(cid:54)=i Nj}| > \u03b1\u00b7 n, for some \u03b1 > 0, then by using Algorithm 3,(cid:0)|S|\n\nwe have |Ni\\{(cid:83)\n\n(cid:1) +|S|\u00b7 (n\u2212|S|)\n\n2\n\nAs mentioned above, Algorithm 3 is parameter free, do not require the knowledge of \u2206, and ef\ufb01cient.\nFor the special case of \u2206 = 2, we show in Appendix J (see, Theorem 9) that the same result holds\nunder less restrictive conditions than those in Theorem 5. In fact, in Appendix K we conjecture that\nTheorem 5 holds true under a similar assumption as in Theorem 9. Depending on the dataset, the\nscaling of \u03b1 in Theorem 5 w.r.t. (\u2206, k, n) may vary widely. For example, in the non-overlapping\ncase, \u03b1 = kmin/n \u2264 1/k, where kmin is the size of the smallest cluster, which implies that the\nquery complexity in the best scenario is O(nk log n), which is consistent with our results in the\nprevious section. In the worst-case, a positive \u03b1 could be as small as 1/n (unreasonable in real-world\ndatasets), which implies a query complexity of O(n2). This is much higher than our average case\nresults, as expected. More generally, note that \u03b1 decreases as a function of \u2206, which implies that\nthe query complexity increases with \u2206. For example, consider the example of 3 equally-sized\nclusters A, B and C. Suppose \u2206 = 1 and in that case |A \\ B \u222a C| = |A| = n/3, implying that\n\u03b1 = 1/3. Now suppose that \u2206 = 2. In this case A \u2229 B and A \u2229 C are non-empty and therefore\n|A \\ B \u222a C| = |A| \u2212 |A \u2229 B| \u2212 |A \u2229 C| < n/3, namely, \u03b1 is less than 1/3.\n\nqueries corresponding to these |S| elements.\n\nAlgorithm 3 Worst-case quantized responses\nRequire: N, k, and oracle responses Oquantized(i, j) for every query (i, j) \u2208 \u2126.\n1: Choose a set S of elements drawn uniformly at random from [N ], and perform all pairwise\n2: Construct a graph G = (V,E) where the vertices are the |S| sampled elements. There exist an\n3: Construct the maximal cliques of the graph G such that all edges in E are covered. Each maximal\n4: Query each of the remaining n \u2212 |S| elements with all elements present in S. For each cluster, if\nan element is similar with all the elements in that particular cluster, then assign the element to\nthat cluster. Return the obtained clusters.\n\nedge between elements (i, j) only if they are determined to be similar by the oracle.\n\nclique forms a cluster.\n\n7\n\n\fTable 1: Sample complexities for k = O(log n) and \u2206 (cid:28) k\n\nOracle Type\nDirect responses (disjoint)\nDirect responses (overlapping)\nQuantized responses\nQuantized responses (worst-case, \u03b1 = n\u2212c) NA\n\nLower-Bound\nO(nk)\nO(nk)\nO(n \u00b7 polylog n) \u2126(n \u00b7 polylog n)\n\nUpper-Bound\n\u2126(nk)\n\u2126(nk)\n\u2126(n1+c \u00b7 polylog n)\n\n(a) Mean, median and maximum errors for \u2206 = 2.\n\n(b) Number of failures for \u2206 = 2.\n\n(c) Mean, median and maximum errors for \u2206 = 3.\n\n(d) Number of failures for \u2206 = 3.\n\nFigure 1: Results of our techniques on MovieLens dataset.\n\n3.4 Summary Table\n\nTable 1 summarizes the scaling of our lower and upper bounds on the sample complexities for each\nof the different oracle types considered in this paper. In the table, we opted to focus on the regime\nwhere k = O(log n) and \u2206 (cid:28) k, as we found it to be the most interesting one. We also assume\nthat the noise parameter q is \ufb01xed. Note, however, that our theoretical results are general and apply\nfor any scaling of k and \u2206. Also, since the scalings of the sample complexities associated with the\nuniform and the i.i.d. ensembles, as well as when the noise parameter q is unknown, are similar, we\nchoose to combine them together. We also present the worst-case scenario in Theorem 5, assuming\nthat \u03b1 = n\u2212c, for some c \u2208 (0, 1). For simplicity of presentation, we do not explicitly present the\nscaling of the lower and upper bound on polylog factors. For the regime above, we can see that the\nscaling of the upper and lower bounds w.r.t. n is the same up to constants for direct responses. For\nquantized responses there is a polylog factor difference between the obtain upper and lower bounds.\n\n4 Experimental Results\n\nWe provide only a summary of our experimental results here, deferring synthetic data results to\nAppendix L. We focus on real-world data from the popular Movielens dataset (https://grouplens.\norg/datasets/movielens/) for our experiments. The dataset we used describes 5-star rating and\nfree-text tagging activity from Movielens (http://movielens.org), a movie recommendation\nservice. It contains 100836 ratings and 3683 tag applications across 9742 movies.\n\n8\n\n\fQuantized Query Responses.\nIn order to establish our results we chose the following categories\nMystery, Drama, Sci-Fi, Horror, and Crime, and \ufb01rst selected only those movies that belong\nto at most two categories and at least one category (i.e., \u2206 = 2). The total number of such movies\nwere 3470 and in accordance to the statement in Theorem 9, we have \u03b1 = 0.0152 (there are 53\nmovies that belong to Mystery but does not belong to Sci-fi and Horror). The total number\nof possible queries is about 1.2 \u00d7 107 and the number of queries that are suf\ufb01cient theoretically\nis 2948935 (theoretical value of |S| is 245). We ran Algorithm 3 (running Algorithm 17 requires\nparsing all possible clique covers which is computationally hard) with different values of |S| (number\nof movies randomly chosen in the \ufb01rst step) and since the movies are sampled randomly, we ran\n50 trials for each value of |S|. Finally, after the \ufb01nal clustering is provided by the algorithm, we\ncalculated the gram matrix from the resulting clustering and compared it with the gram matrix of\nthe ground truth clustering. Figure 1a shows the mean, median, and maximum error as a function\nof the total number of queries accrued by Algorithm 3. Here, the error refers to the total number of\ndifferent entries in the estimated and true gram matrices. We can observe that the mean error almost\nreaches zero around 3 \u00d7 105 queries (about 2.5% of total). Figure 1b presents the total number of\nfailures in perfect clustering (trials when error is larger than 1) among the 50 trials for each value\nof |S| we have chosen. We obtain perfect clustering in all the 50 trials \ufb01rst using 1.2 \u00d7 106 queries\n(\u2248 10% of total). Note that since Theorem 5 gives a suf\ufb01cient condition on T only, in practice we\ncan take smaller values for T and still guarantee recovery. Of course, the smaller the size of S is,\nthe sample and computational complexities are smaller as well. We repeated the experiment with\nthe same set of categories as in the previous one but this time, we included movies that belonged to\nat most three clusters at the same time i.e (\u2206 = 3). The total number of such movies is 5082 and\ntherefore the total number of possible queries is about 2.59 \u00d7 107. Again, we conducted 50 trials for\neach chosen value of |S| and as before, we plotted the mean, median and maximum error in Figure\n1c and the number of failures in perfect clustering in Figure 1d for this setting. Notice that the mean\nerror drops almost to zero at about 6 \u00d7 105 queries (2.32% of total) and perfect clustering over all\n50 trials is achieved at 1.5 \u00d7 106 queries (5.8% of total). We would like to point out over here that\nAlgorithm 3 is parameter free, and provides a non-trivial solution even when the number of queries\nare far below the theoretical threshold limit. It turns out that the experimental threshold is better\nthan the theoretical threshold on queries for perfect clustering. Moreover, Algorithm 3 is ef\ufb01cient in\npartial clustering as well since the error drops very fast as the number of queries is increased.\nUnquantized Query Responses.\nIn this experiment, we used the Movielens dataset again and\nchose 5 classes Mystery, Drama, IMAX, Sci-Fi, and Horror, and selected those elements who\nbelonged to at most two categories and at least one category (i.e., \u2206 = 2). The total number of\nsuch movies are 3270, and for each query involving two movies, we obtain back the unquantized\nsimilarity (total number of categories they both belong to). We follow Algorithm FindSimilarity\nvery closely but with a small modi\ufb01cation. Indeed, note that Algorithm 2 is designed so that the\nguarantees hold under a speci\ufb01c stochastic assumption. More concisely, the necessary size of S is\nnot de\ufb01ned for arbitrary real-world datasets. Note, however, that the main objective in the \ufb01rst part\nof the algorithm is to select a number of elements so that the gram matrix is of full rank. Therefore,\nfor a real-world dataset, instead of sampling a \ufb01xed number of movies a-priori, we randomly select\nk = 5 movies and make all pairwise queries restricted to those 5 movies. Then, we check if the 5 \u00d7 5\ngram matrix (with the (i, j) entry being the unquantized similarity between the ith and jth movies) is\nof rank 5 and if yes, then we will use that matrix for further calculations. If not, we sample again\nuntil we succeed. We then proceed to factorize the obtained 5 \u00d7 5 gram matrix into the form of BBT ,\nwhere B is binary. Finally, we query every movie with the 5 movies already sampled (and clustered).\nThis provides us with \ufb01ve linearly independent equations in \ufb01ve variables (each corresponding to\nwhether the movie belongs to a particular cluster). Solving the equations for each movie, we \ufb01nally\nobtain the categories each movie belongs to. Hence, the number of queries is at most\n\nNumber of trials to obtain a rank 5 matrix \u00d7 25 + 3270 \u00d7 5.\n\nSince the algorithm is randomized, we simulated this 50 times and we found that the Mean query\ncomplexity is 232126 (with a standard deviation of 269315.36) which is only 4.34% of the total\nnumber of possible queries.\n\n9\n\n\fAcknowledgements: This research is supported in part by NSF Grants CCF 1642658, 1642550,\n1618512, and 1909046.\n\nReferences\n[1] K. Ahn, K. Lee, and C. Suh. Community recovery in hypergraphs. In Communication, Control,\nand Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 657\u2013663. IEEE,\n2016.\n\n[2] N. Ailon, A. Bhattacharya, R. Jaiswal, and A. Kumar. Approximate clustering with same-cluster\n\nqueries. arXiv preprint arXiv:1704.01862, 2017.\n\n[3] P. Arabie, J. D. Carroll, W. DeSarbo, and J. Wind. Overlapping clustering: A new method for\n\nproduct positioning. Journal of Marketing Research, pages 310\u2013317, 1981.\n\n[4] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. NIPS, 2016.\n\n[5] A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. J. Mooney. Model-based overlapping\nclustering. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge\ndiscovery in data mining, pages 532\u2013537. ACM, 2005.\n\n[6] F. Bonchi, A. Gionis, and A. Ukkonen. Overlapping correlation clustering. Knowledge and\n\nInformation Systems, 35:1\u201332, 04 2013.\n\n[7] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational mathematics, 9(6):717, 2009.\n\n[8] E. J. Cand\u00e8s and T. Tao. The power of convex relaxation: near-optimal matrix completion.\n\nIEEE Trans. Information Theory, 56(5):2053\u20132080, 2010.\n\n[9] I. Chien, C. Pan, and O. Milenkovic. Query k-means clustering and the double dixie cup\n\nproblem. arXiv preprint arXiv:1806.05938, 2018.\n\n[10] T. M. Cover and J. A. Thomas. Elements of information theory, 2nd Ed. John Wiley & Sons,\n\n2012.\n\n[11] H. S. M. Coxeter. Projective geometry. Springer Science & Business Media, 2003.\n\n[12] M. A. Davenport, Y. Plan, E. Van Den Berg, and M. Wootters. 1-bit matrix completion.\n\nInformation and Inference: A Journal of the IMA, 3(3):189\u2013223, 2014.\n\n[13] D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an oracle. PVLDB,\n\n9(5):384\u2013395, 2016.\n\n[14] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries.\n\ntransactions on information theory, 56(6):2980\u20132998, 2010.\n\nIEEE\n\n[15] T. Kim and J. Ghosh. Semi-supervised active clustering with weak oracles. arXiv preprint\n\narXiv:1709.03202, 2017.\n\n[16] X. Mao, P. Sarkar, and D. Chakrabarti. Overlapping clustering models, and one (class) svm to\nbind them all. In Advances in Neural Information Processing Systems, pages 2126\u20132136, 2018.\n\n[17] A. Mazumdar and B. Saha. A Theoretical Analysis of First Heuristics of Crowdsourced Entity\n\nResolution. The Thirty-First AAAI Conference on Arti\ufb01cial Intelligence (AAAI-17), 2017.\n\n[18] A. Mazumdar and B. Saha. Clustering with noisy queries. In Advances in Neural Information\n\nProcessing Systems (NIPS) 31, 2017.\n\n[19] A. Mazumdar and B. Saha. Query complexity of clustering with side information. In Advances\n\nin Neural Information Processing Systems, pages 4682\u20134693, 2017.\n\n[20] R. Nugent and M. Meila. An overview of clustering applied to molecular biology. In Statistical\n\nmethods in molecular biology, pages 369\u2013404. Springer, 2010.\n\n10\n\n\f[21] C. E. Tsourakakis, M. Mitzenmacher, K. G. Larsen, J. B\u0142asiok, B. Lawson, P. Nakkiran, and\nV. Nakos. Predicting positive and negative links with noisy queries: Theory & practice. arXiv\npreprint arXiv:1709.07308, 2017.\n\n[22] N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB,\n\n7(12):1071\u20131082, 2014.\n\n[23] R. K. Vinayak and B. Hassibi. Crowdsourced clustering: Querying edges vs triangles. In\n\nAdvances in Neural Information Processing Systems, pages 1316\u20131324, 2016.\n\n[24] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution.\n\nPVLDB, 5(11):1483\u20131494, 2012.\n\n[25] J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for\n\ncrowdsourced joins. In SIGMOD Conference, pages 229\u2013240, 2013.\n\n[26] O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings\nof the 21st annual international ACM SIGIR conference on Research and development in\ninformation retrieval, pages 46\u201354. ACM, 1998.\n\n[27] J. Zou, K. Chaudhuri, and A. T. Kalai. Crowdsourcing feature discovery via adaptively\nchosen comparisons. In Proceedings, The Third AAAI Conference on Human Computation and\nCrowdsourcing (HCOMP-15), November 2015.\n\n11\n\n\f", "award": [], "sourceid": 5521, "authors": [{"given_name": "Wasim", "family_name": "Huleihel", "institution": "Tel-Aviv University"}, {"given_name": "Arya", "family_name": "Mazumdar", "institution": "University of Massachusetts Amherst"}, {"given_name": "Muriel", "family_name": "Medard", "institution": "MIT"}, {"given_name": "Soumyabrata", "family_name": "Pal", "institution": "University of Massachusetts Amherst"}]}