{"title": "Crowdclustering", "book": "Advances in Neural Information Processing Systems", "page_first": 558, "page_last": 566, "abstract": "Is it possible to crowdsource categorization? Amongst the challenges: (a) each annotator has only a partial view of the data, (b) different annotators may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how annotators may approach clustering and show how one may infer clusters/categories, as well as annotator parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations.", "full_text": "Crowdclustering\n\nRyan Gomes\u2217\n\nCaltech\n\nPeter Welinder\n\nCaltech\n\nAndreas Krause\n\nETH Zurich & Caltech\n\nPietro Perona\n\nCaltech\n\nAbstract\n\nIs it possible to crowdsource categorization? Amongst the challenges: (a) each\nworker has only a partial view of the data, (b) different workers may have differ-\nent clustering criteria and may produce different numbers of categories, (c) the\nunderlying category structure may be hierarchical. We propose a Bayesian model\nof how workers may approach clustering and show how one may infer clusters\n/ categories, as well as worker parameters, using this model. Our experiments,\ncarried out on large collections of images, suggest that Bayesian crowdclustering\nworks well and may be superior to single-expert annotations.\n\nIntroduction\n\n1\nOutsourcing information processing to large groups of anonymous workers has been made easier\nby the internet. Crowdsourcing services, such as Amazon\u2019s Mechanical Turk, provide a convenient\nway to purchase Human Intelligence Tasks (HITs). Machine vision and machine learning researchers\nhave begun using crowdsourcing to label large sets of data (e.g., images and video [1, 2, 3]) which\nmay then be used as training data for AI and computer vision systems.\nIn all the work so far\ncategories are de\ufb01ned by a scientist, while categorical labels are provided by the workers.\nCan we use crowdsourcing to discover categories? I.e., is it possible to use crowdsourcing not only\nto classify data instances into established categories, but also to de\ufb01ne the categories in the \ufb01rst\nplace? This question is motivated by practical considerations. If we have a large number of images,\nperhaps several tens of thousands or more, it may not be realistic to expect a single person to look\nat all images and form an opinion as to how to categorize them. Additionally, individuals, whether\nuntrained or expert, might not agree on the criteria used to de\ufb01ne categories and may not even agree\non the number of categories that are present. In some domains unsupervised clustering by machine\nmay be of great help; however, unsupervised categorization of images and video is unfortunately a\nproblem that is far from solved. Thus, it is an interesting question whether it is possible to collect\nand combine the opinion of multiple human operators, each one of which is able to view a (perhaps\nsmall) subset of a large image collection.\nWe explore the question of crowdsourcing clustering in two steps: (a) Reduce the problem to a\nnumber of independent HITs of reasonable size and assign them to a large pool of human workers\n(Section 2). (b) Develop a model of the annotation process, and use the model to aggregate the\nhuman data automatically (Section 3) yielding a partition of the dataset into categories. We explore\nthe properties of our approach and algorithms on a number of real world data sets, and compare\nagainst existing methods in Section 4.\n2 Eliciting Information from Workers\nHow shall we enable human operators to express their opinion on how to categorize a large collection\nof images? Whatever method we choose, it should be easy to learn and it should be implementable\nby means of a simple graphical user interface (GUI). Our approach (Figure 1) is based on displaying\nsmall subsets of M images and asking workers to group them by means of mouse clicks. We\nprovide instructions that may cue workers to certain attributes but we do not provide the worker\nwith category de\ufb01nitions or examples. The worker groups the M items into clusters of his choosing,\nas many as he sees \ufb01t. An item may be placed in its own cluster if it is unlike the others in the\nHIT. The choice of M trades off between the dif\ufb01culty of the task (worker time required for a HIT\n\n\u2217Corresponding author, e-mail: gomes@vision.caltech.edu\n\n1\n\n\fFigure 1: Schematic of Bayesian crowdclustering. A large image collection is explored by workers. In each\nHIT (Section 2), the worker views a small subset of images on a GUI. By associating (arbitrarily chosen)\ncolors with sets of images the worker proposes a (partial) local clustering. Each HIT thus produces multiple\nbinary pairwise labels: each pair of images shown in the same HIT is placed by the worker either in the same\ncategory or in different categories. Each image is viewed by multiple workers in different contexts. A model\nof the annotation process (Sec. 3.1) is used to compute the most likely set of categories from the binary labels.\nWorker parameters are estimated as well.\nincreases super-linearly with the number of items), the resolution of the images (more images on\nthe screen means that they will be smaller), and contextual information that may guide the worker\nto make more global category decisions (more images give a better context, see Section 4.1.) Partial\nclusterings on many M-sized subsets of the data from many different workers are thus the raw data\non which we compute clustering.\nAn alternative would have been to use pairwise distance judgments or three-way comparisons. A\nlarge body of work exists in the social sciences that makes use of human-provided similarity values\nde\ufb01ned between pairs of data items (e.g., Multidimensional Scaling [4].) After obtaining pairwise\nsimilarity ratings from workers, and producing a Euclidean embedding, one could conceivably pro-\nceed with unsupervised clustering of the data in the Euclidean space. However, accurate distance\njudgments may be more laborious to specify than partial clusterings. We chose to explore what we\ncan achieve with partial clusterings alone.\nWe do not expect workers to agree on their de\ufb01nitions of categories, or to be consistent in catego-\nrization when performing multiple HITs. Thus, we avoid explicitly associating categories across\n\nHITs. Instead, we represent the results of each HIT as a series of(cid:0)M\n\n(cid:1) binary labels (see Figure 1).\n\n2\n\n2\n\n(cid:1).\n\nlabels is T = H(cid:0)M\n\nWe assume that there are N total items (indexed by i), J workers (indexed by j), and H HITs\n(indexed by h). The information obtained from workers is a set of binary variables L, with ele-\nments lt \u2208 {\u22121, +1} indexed by a positive integer t \u2208 {1, . . . , T}. Associated with the t-th label\nis a quadruple (at, bt, jt, ht), where jt \u2208 {1, . . . , J} indicates the worker that produced the label,\nand at \u2208 {1, . . . , N} and bt \u2208 {1, . . . , N} indicate the two data items compared by the label.\nht \u2208 {1, . . . , H} indicates the HIT from which the t-th pairwise label was derived. The number of\nSampling Procedure We have chosen to structure HITs as clustering tasks of M data items, so\nwe must specify them. If we simply seperate the items into disjoint sets, then it will be impossible to\ninfer a clustering over the entire data set. We will not know whether two items in different HITs are\nin the same cluster or not. There must be some overlap or redundancy: data items must be members\nof multiple HITs.\nIn the other extreme, we could construct HITs such that each pair of items may be found in at least\none HIT, so that every possible pairwise category relation is sampled. This would be quite expensive\nfor large number of items N, since the number of labels scales asymptotically as T \u2208 \u2126(N 2).\nHowever, we expect a noisy transitive property to hold: if items a and b are likely to be in the same\ncluster, and items b and c are (not) likely in the same cluster, then items a and c are (not) likely to\nbe in the same cluster as well. The transitive nature of binary cluster relations should allow sparse\nsampling, especially when the number of clusters is relatively small.\nAs a baseline sampling method, we use the random sampling scheme outlined by Strehl and\nGhosh [5] developed for the problem of object distributed clustering, in which a partition of a com-\nplete data set is learned from a number of clusterings restricted to subsets of the data. (We compare\nour aggregation algorithm to this work in Section 4.) Their scheme controls the level of sampling\nredundancy with a single parameter V , which in our problem is interpreted as the expected number\nof HITs to which a data item belongs.\n\n2\n\nImage setAnnotatorsGUIPairwise labels\u21e16=6=6=lt\u03c4jxiWjziatbtjtAnnotatorsData ItemsPairwise Labels\u03a6kVk\u201cAtomic\u201d clustersModel / inferenceCategories\f(cid:1)\n\n2\n\n(cid:0)M\n\nThe N items are \ufb01rst distributed deterministically among the HITs, so that there are (cid:100) M\nV (cid:101) items\nV (cid:101) items in each HIT are \ufb01lled by sampling without re-\nin each HIT. Then the remaining M \u2212 (cid:100) M\nV (cid:101) items that are not yet allocated to the HIT. There are a total of (cid:100) N V\nplacement from the N \u2212 (cid:100) M\nM (cid:101)\nunique HITs. We introduce an additional parameter R, which is the number of different workers that\nperform each constructed HIT. The total number of HITs distributed to the crowdsourcing service is\ntherefore H = R(cid:100) N V\nM (cid:101), and we impose the constraint that a worker can not perform the same HIT\nmore than once. This sampling scheme generates T = R(cid:100) N V\n\u2208 O(RN V M ) binary labels.\nM (cid:101)\nWith this exception, we \ufb01nd a dearth of ideas in the literature pertaining to sampling methods for\ndistributed clustering problems. Iterative schemes that adaptively choose maximally informative\nHITs may be preferable to random sampling. We are currently exploring ideas in this direction.\n3 Aggregation via Bayesian Crowdclustering\nThere is an extensive literature in machine learning on the problem of combining multiple alternative\nclusterings of data. This problem is known as consensus clustering [6], clustering aggregation [7],\nor cluster ensembles [5]. While some of these methods can work with partial input clusterings, most\nhave not been demonstrated in situations where the input clusterings involve only a small subset of\nthe total data items (M << N), which is the case in our problem.\nIn addition, existing approaches focus on producing a single \u201caverage\u201d clustering from a set of input\nclusterings. In contrast, we are not merely interested in the average clustering produced by a crowd\nof workers.\nInstead, we are interested in understanding the ways in which different individuals\nmay categorize the data. We seek a master clustering of the data that may be combined in order to\ndescribe the tendencies of individual workers. We refer to these groups of data as atomic clusters.\nFor example, suppose one worker groups objects into a cluster of tall objects and another of short\nobjects, while a different worker groups the same objects into a cluster of red objects and another\nof blue objects. Then, our method should recover four atomic clusters: tall red objects, short red\nobjects, tall blue objects, and short blue objects. The behavior of the two workers may then be\nsummarized using a confusion table of the atomic clusters (see Section 3.3). The \ufb01rst worker groups\nthe \ufb01rst and third atomic cluster into one category and the second and fourth atomic cluster into\nanother category. The second worker groups the \ufb01rst and second atomic clusters into a category and\nthe third and fourth atomic clusters into another category.\n3.1 Generative Model\nWe propose an approach in which data items are represented as points in a Euclidean space and\nworkers are modeled as pairwise binary classi\ufb01ers in this space. Atomic clusters are then obtained\nby clustering these inferred points using a Dirichlet process mixture model, which estimates the\nnumber of clusters [8]. The advantage of an intermediate Euclidean representation is that it provides\na compact way to capture the characteristics of each data item. Certain items may be inherently more\ndif\ufb01cult to categorize, in which case they may lie between clusters. Items may be similar along one\naxis but different along another (e.g., object height versus object color.) A similar approach was\nproposed by Welinder et al. [3] for the analysis of classi\ufb01cation labels obtained from crowdsourcing\nservices. This method does not apply to our problem, since it involves binary labels applied to single\ndata items rather than to pairs, and therefore requires that categories be de\ufb01ned a priori and agreed\nupon by all workers, which is incompatible with the crowdclustering problem.\nWe propose a probabilistic latent variable model that relates pairwise binary labels to hidden vari-\nables associated with both workers and images. The graphical model is shown in Figure 1. xi\nis a D dimensional vector, with components [xi]d that encodes item i\u2019s location in the embed-\nding space RD. Symmetric matrix Wj \u2208 RD\u00d7D with entries [Wj]d1d2 and bias \u03c4j \u2208 R are\nused to de\ufb01ne a pairwise binary classi\ufb01er, explained in the next paragraph, that represents worker\nj\u2019s labeling behavior. Because Wj is symmetric, we need only specify its upper triangular por-\ntion: vecp{Wj} which is a vector formed by \u201cstacking\u201d the partial columns of Wj according\nto the ordering [vecp{Wj}]1 = [Wj]11, [vecp{Wj}]2 = [Wj]12, [vecp{Wj}]3 = [Wj]22, etc.\n\u03a6k = {\u00b5k, \u03a3k} are the mean and covariance parameters associated with the k-th Gaussian atomic\ncluster, and Uk are stick breaking weights associated with a Dirichlet process.\nThe key term is the pairwise quadratic logistic regression likelihood that captures worker j\u2019s ten-\ndency to label the pair of images at and bt with lt:\n\n(1)\n\np(lt|xat, xbt, Wjt, \u03c4jt) =\n\n3\n\n1\n\n1 + exp(\u2212ltAt)\n\n\fwhere we de\ufb01ne the pairwise quadratic activity At = xT\nWjtxbt + \u03c4jt. Symmetry of Wj ensures\nat\nthat p(lt|xat, xbt, Wjt, \u03c4jt ) = p(lt|xbt, xat, Wjt, \u03c4jt). This form of likelihood yields a compact\nand tractable method of representing classi\ufb01ers de\ufb01ned over pairs of points in Euclidean space.\nPairs of vectors with large pairwise activity tend to be classi\ufb01ed as being in the same category,\nand in different categories otherwise. We \ufb01nd that this form of likelihood leads to tightly grouped\nclusters of points xi that are then easily discovered by mixture model clustering.\nThe joint distribution is\n\np(\u03a6, U, Z, X, W, \u03c4,L) =\n\np(Uk|\u03b1)p(\u03a6k|m0, \u03b20, J0, \u03b70)\n\np(zi|U )p(xi|\u03a6zi)\n\n\u221e(cid:89)\nJ(cid:89)\n\nk=1\n\nj=1\n\nN(cid:89)\nT(cid:89)\n\ni=1\n\nt=1\n\n(2)\n\n(3)\n\np(vecp{Wj}|\u03c3w\n\n0 )p(\u03c4j|\u03c3\u03c4\n0 )\n\np(lt|xat, xbt, Wjt, \u03c4jt).\n\nk\u22121(cid:89)\n\nThe conditional distributions are de\ufb01ned as follows:\n\np(Uk|\u03b1) = Beta(Uk; 1, \u03b1)\n(cid:89)\np(xi|\u03a6zi) = Normal(xi; \u00b5zi, \u03a3zi)\np(vecp{Wj}|\u03c3w\np(\u03c4j|\u03c3\u03c4\np(\u03a6k|m0, \u03b20, J0, \u03b70) = Normal-Wishart(\u03a6k; m0, \u03b20, J0, \u03b70)\n\nNormal([Wj]d1d2; 0, \u03c3w\n0 )\n\np(xi|\u03c3x\n\nd1\u2264d2\n\n0 ) =\n\n0 ) =\n\n(cid:89)\n\np(zi = k|U ) = Uk\n\n(1 \u2212 Ul)\nNormal([xi]d; 0, \u03c3x\n0 )\n\nl=1\n\nd\n0 ) = Normal(\u03c4j; 0, \u03c3\u03c4\n0 )\n\n\u221e(cid:89)\n\nk=K+1\n\n(cid:89)\n\n0 , \u03c3\u03c4\n\n0 , \u03c3w\n\nwhere (\u03c3x\n0 , \u03b1, m0, \u03b20, J0, \u03b70) are \ufb01xed hyper-parameters. Our model is similar to that\nof [9], which is used to model binary relational data. Salient differences include our use of a logistic\nrather than a Gaussian likelihood, and our enforcement of the symmetry of Wj. In the next section,\nwe develop an ef\ufb01cient deterministic inference algorithm to accomodate much larger data sets than\nthe sampling algorithm used in [9].\n3.2 Approximate Inference\nExact posterior inference in this model is intractable, since computing it involves integrating over\nvariables with complex dependencies. We therefore develop an inference algorithm based on the\nVariational Bayes method [10]. The high level idea is to work with a factorized proxy posterior\ndistribution that does not model the full complexity of interactions between variables; it instead\nrepresents a single mode of the true posterior. Because this distribution is factorized, integrations\ninvolving it become tractable. We de\ufb01ne the proxy distribution q(\u03a6, U, Z, X, W, \u03c4 ) =\n\nK(cid:89)\n\nN(cid:89)\n\nJ(cid:89)\n\np(Uk|\u03b1)p(\u03a6k|m0, \u03b20, J0, \u03b70)\n\nq(Uk)q(\u03a6k)\n\nq(zi)q(xi)\n\nk=1\n\ni=1\n\nj=1\n\nq(vecp{Wj})q(\u03c4j)\n\n(4)\n\nusing parametric distributions of the following form:\n\nq(Uk) = Beta(Uk; \u03bek,1, \u03bek,2)\n\nq(xi) =\n\nNormal([xi]d; [\u00b5x\n\ni ]d, [\u03c3x\n\ni ]d)\n\nd\n\nq(zi = k) = qik\n\nq(vecp{Wj}) =\n\nq(\u03a6k) = Normal-Wishart(mk, \u03b2k, Jk, \u03b7k)\n\n(5)\n\n(cid:89)\n\nd1\u2264d2\n\nq(\u03c4j) = Normal(\u03c4j; \u00b5\u03c4\n\nj , \u03c3\u03c4\nj )\n\nNormal([Wj]d1d2 ; [\u00b5w\n\nj ]d1d2, [\u03c3w\n\nj ]d1d2)\n\nTo handle the in\ufb01nite number of mixture components, we follow the approach of [11] where we\nde\ufb01ne variational distributions for the \ufb01rst K components, and \ufb01x the remainder to their correspond-\ning priors. {\u03bek,1, \u03bek,2} and {mk, \u03b2k, Jk, \u03b7k} are the variational parameters associated with the k-th\nmixture component. q(zi = k) = qik form the factorized assignment distribution for item i. \u00b5x\ni and\ni are variational mean and variance parameters associated with data item i\u2019s embedding location.\n\u03c3x\nj are symmetric matrix variational mean and variance parameters associated with worker\nj and \u03c3w\n\u00b5w\nj, and \u00b5\u03c4\nj are variational mean and variance parameters for the bias \u03c4j of worker j. We use\ndiagonal covariance Normal distributions over Wj and xi to reduce the number of parameters that\nmust be estimated.\n\nj and \u03c3\u03c4\n\n4\n\n\fNext, we de\ufb01ne a utility function which allows us to determine the variational parameters. We use\nJensen\u2019s inequality to develop a lower bound to the log evidence:\n\nlog p(L|\u03c3x\n\n0 , \u03c3\u03c4\n\n0 , \u03c3w\n\n0 , \u03b1, m0, \u03b20, J0, \u03b70)\n\n\u2265Eq log p(\u03a6, U, Z, X, W, \u03c4,L) + H{q(\u03a6, U, Z, X, W, \u03c4 )},\n\nH{\u00b7} is the entropy of the proxy distribution, and the lower bound is known as the Free Energy.\nHowever, the Free Energy still involves intractable integration, because the normal distributions over\nvariables Wj, xi, and \u03c4j are not conjugate [12] to the logistic likelihood term. We therefore locally\napproximate the logistic likelihood with an unnormalized Gaussian function lower bound, which is\nthe left hand side of the following inequality:\n\ng(\u2206t) exp{(ltAt \u2212 \u2206t)/2 + \u03bb(\u2206t)(A2\n\n(7)\nThis was adapted from [13] to our case of quadratic pairwise logistic regression. Here g(x) =\n(1 + e\u2212x)\u22121 and \u03bb(\u2206) = [1/2 \u2212 g(\u2206)]/(2\u2206). This expression introduces an additional variational\nparameter \u2206t for each label, which are optimized in order to tighten the lower bound. Our utility\nfunction is therefore:\n\nt )} \u2264 p(lt|xat, xbt, Wjt, \u03c4jt).\n\nt \u2212 \u22062\n\n(6)\n\n(8)\n\n(cid:88)\n\nt\n\nF =Eq log p(\u03a6, U, Z, X, W, \u03c4 ) + H{q(\u03a6, U, Z, X, W, \u03c4 )}\n+ \u03bb(\u2206t)(Eq{A2\n\nEq{At} \u2212\n\nlog g(\u2206t) +\n\n\u2206t\n2\n\nlt\n2\n\n+\n\nt} \u2212 \u22062\nt )\n\n(cid:111)\n\nwhich is a tractable lower bound to the log evidence. Optimization of variational parameters is\ncarried out in a coordinate ascent procedure, which exactly maximizes each variational parameter in\nturn while holding all others \ufb01xed. This is guaranteed to converge to a local maximum of the utility\nfunction. The update equations are given in an extended technical report [14]. We initialize the vari-\national parameters by carrying out a layerwise procedure: \ufb01rst, we substitute a zero mean isotropic\nnormal prior for the mixture model and perform variational updates over {\u00b5x\nj }.\nj , \u03c3\u03c4\nThen we use \u00b5x\ni as point estimates for xi and update {mk, \u03b2k, Jk, \u03b7k, \u03bek,1, \u03bek,2} and determine the\ninitial number of clusters K as in [11]. Finally, full joint inference updates are performed. Their\ncomputational complexity is O(D4T + D2KN ) = O(D4N V RM + D2KN ).\n3.3 Worker Confusion Analysis\nAs discussed in Section 3, we propose to understand a worker\u2019s behavior in terms of how he groups\natomic clusters into his own notion of categories. We are interested in the predicted confusion matrix\nCj for worker j, where\n\nj , \u03c3w\n\ni , \u00b5w\n\ni , \u03c3x\n\nj , \u00b5\u03c4\n\n(cid:110)(cid:90)\n\n[Cj]k1k2 = Eq\n\np(l = 1|xa, xb, Wj, \u03c4j)p(xa|\u03a6k1)p(xb|\u03a6k2)dxadxb\n\n(9)\n\nwhich expresses the probability that worker j assigns data items sampled from atomic cluster k1\nand k2 to the same cluster, as predicted by the variational posterior. This integration is intractable.\nWe use the expected values E{\u03a6k1} = {mk1 , Jk1 /\u03b7k1} and E{\u03a6k2} = {mk2, Jk2 /\u03b7k2} as point\nestimates in place of the variational distributions over \u03a6k1 and \u03a6k2. We then use Jensen\u2019s inequality\nand Eq. 7 again to yield a lower bound. Maximizing this bound over \u2206 yields\n\nk1\n\n\u00b5w\n\n(10)\n\nj mk2 + \u00b5\u03c4\n\n[ \u02c6Cj]k1k2 = g( \u02c6\u2206k1k2j) exp{(mT\n\nj \u2212 \u02c6\u2206k1k2j)/2}\nwhich we use as our approximate confusion matrix, where \u02c6\u2206k1k2j is given in [14].\n4 Experiments\nWe tested our method on four image data sets that have established \u201cground truth\u201d categories, which\nwere provided by a single human expert. These categories do not necessarily re\ufb02ect the uniquely\nvalid way to categorize the data set, however they form a convenient baseline for the purpose of\nquantitative comparison. We used 1000 images from the Scenes data set from [15] to illustrate our\napproach (Figures 2, 3, and 4.) We used 1354 images of birds from 10 species in the CUB-200 data\nset [16] (Table 1) and the 3845 images in the Stone\ufb02y9 data set [17] (Table 1) in order to compare\nour method quantitatively to other cluster aggregation methods. We used the 37794 images from the\nAttribute Discovery data set [18] in order to demonstrate our method on a large scale problem.\nWe set the dimensionality of xi to D = 4 (since higher dimensionality yielded no additional clus-\nters) and we iterated the update equations 100 times, which was enough for convergence. Hyperpa-\nrameters were tuned once on synthetic pairwise labels that simulated 100 data points drawn from 4\nclusters, and \ufb01xed during all experiments.\n\n5\n\n\fFigure 2: Scene Dataset. Left: Mean locations \u00b5x\ni projected onto \ufb01rst two Fisher discriminant vectors, along\nwith cluster labels superimposed at cluster means mk. Data items are colored according to their MAP label\nargmaxk qik. Center: High con\ufb01dence example images from the largest \ufb01ve clusters (rows correspond to\nclusters.) Right: Confusion table between ground truth scene categories and inferred clusters. The \ufb01rst cluster\nincludes three indoor ground truth categories, the second includes forest and open country categories, and the\nthird includes two urban categories. See Section 4.1 for a discussion and potential solution of this issue.\n\nFigure 3: (Left of line) Worker confusion matrices for the 40 most active workers. (Right of line) Selected\nworker confusion matrices for Scenes experiment. Worker 9 (left) makes distinctions that correspond closely to\nthe atomic clustering. Worker 45 (center) makes coarser distinctions, often combining atomic clusters. Right:\nWorker 29\u2019s single HIT was largely random and does not align with the atomic clusters.\n\nN\n\n(cid:80)\n\nFigure 2 (left) shows the mean locations of the data items \u00b5x\nlearned from the Scene data set,\ni\nvisualized as points in Euclidean space. We \ufb01nd well seperated clusters whose labels k are displayed\nat their mean locations mk. The points are colored according to argmaxk qik, which is item i\u2019s MAP\ncluster assignment. The cluster labels are sorted according to the number of assigned items, with\ncluster 1 being the largest. The axes are the \ufb01rst two Fisher discriminant directions (derived from\nthe MAP cluster assignments) as axes. The clusters are well seperated in the four dimensionsal\nik qik log qik in the \ufb01gure title, which shows\nspace (we give the average assignment entropy \u2212 1\nlittle cluster overlap.) Figure 2 (center) shows six high con\ufb01dence examples from clusters 1 through\n5. Figure 2 (right) shows the confusion table between the ground truth categories and the MAP\nclustering. We \ufb01nd that the MAP clusters often correspond to single ground truth categories, but they\nsometimes combine ground truth categories in reasonable ways. See Section 4.1 for a discussion and\npotential solution of this issue.\nFigure 3 (left of line) shows the predicted confusion matrices (Section 3.3) associated with the\n40 workers that performed the most HITs. This matrix captures the worker\u2019s tendency to label\nitems from different atomic clusters as being in the same or different category. Figure 3 (right of\nline) shows in detail the predicted confusion matrices for three workers. We have sorted the MAP\ncluster indices to yield approximately block diagonal matrices, for ease of interpretation. Worker 9\nmakes relatively \ufb01ne grained distinctions, including seperating clusters 1 and 9 that correspond to\nthe indoor categories and the bedroom scenes, respectively. Worker 45 combines clusters 5 and 8\nwhich correspond to city street and highway scenes in addition to grouping together all indoor scene\ncategories. The \ufb01ner grained distinctions made by worker 9 may be a result of performing more\nHITs (74) and seeing a larger number of images than worker 45, who performed 15 HITs. Finally\n(far right), we \ufb01nd a worker whose labels do not align with the atomic clustering. Inspection of his\nlabels show that they were entered largely at random.\nFigure 4 (top left) shows the number of HITs performed by each worker according to descending\nrank. Figure 4 (bottom left) is a Pareto curve that indicates the percentage of the HITs performed\nby the most active workers. The Pareto principle (i.e., the law of the vital few) [19] roughly holds:\nthe top 20% most active workers perform nearly 80% of the work. We wish to understand the\nextent to which the most active workers contribute to the results. For the purpose of quantitative\ncomparisons, we use Variation of Information (VI) [20] to measure the discrepancy between the\n\n6\n\n\u22121.5\u22121\u22120.500.511.5\u22122.5\u22122\u22121.5\u22121\u22120.500.51Average assignment entropy (bits): 0.0029653123456789101112354Ground TruthInferred Cluster 1234567891011bedroomsuburbkitchenliving roomcoastforesthighwayinside citymountainopen countrystreettall buildingoffice010203040506070Worker: 9, # of HITs: 74 1910735811426191073581142600.10.20.30.40.50.60.70.80.91Worker: 45, # of HITs: 15 1 9 7 103 5 8 114 6 2 1 9 7 103 5 8 114 6 2 00.10.20.30.40.50.60.70.80.91Worker: 29, # of HITs: 1 1910538711426191053871142600.10.20.30.40.50.60.70.80.91\fFigure 4: Scene Data set. Left Top: Number of completed HITs by worker rank. Left Bottom: Pareto curve.\nCenter: Variation of Information on the Scene data set as we incrementally remove top (blue) and bottom (red)\nranked workers. The top workers are removed one at a time, bottom ranked workers are removed in groups so\nthat both curves cover roughly the same domain. The most active workers do not dominate the results. Right:\nVariation of Information between the inferred clustering and the ground truth categories on the Scene data set,\nas a function of sampling parameter V . R is \ufb01xed at 5.\n\nStrehl & Ghosh [5]\n\nBayes Crowd\n1.103 \u00b1 0.082\n2.448 \u00b1 0.063\n\n18.5 min\n\nBayes Consensus\n1.721 \u00b1 0.07\n18.1 min\n2.735 \u00b1 0.037\n\n98.5 min\n\nNMF [21]\n1.500 \u00b1 0.26\n27.9 min\n4.571 \u00b1 0.158\n212.6 min\n\nBirds [16] (VI)\nBirds (time)\nStone\ufb02y9 [17] (VI)\nStone\ufb02y9 (time)\n\n1.256 \u00b1 0.001\n3.836 \u00b1 0.002\nTable 1: Quantitative comparison on Bird and Stone\ufb02y species categorization data sets. Quality is measured\nusing Variation of Information between the inferred clustering and ground truth. Bayesian Crowdclustering\noutperforms the alternatives.\n\n100.1 min\n\n0.93 min\n\n46.5 min\n\ninferred MAP clustering and the ground truth categorization. VI is a metric with strong information\ntheoretic justi\ufb01cation that is de\ufb01ned between two partitions (clusterings) of a data set; smaller values\nindicate a closer match and a VI of 0 means that two clusterings are identical. In Figure 4 (center)\nwe incrementally remove the most active (blue) and least active (red) workers. Removal of workers\ncorresponds to moving from right to left on the x-axis, which indicates the number of HITs used to\nlearn the model. The results show that removing the large number of workers that do fewer HITs is\nmore detrimental to performance than removing the relatively few workers that do a large number\nof HITs (given the same number of total HITs), indicating that the atomic clustering is learned from\nthe crowd at large.\nIn Figure 4 (right), we judge the impact of the sampling redundancy parameter V described in Sec-\ntion 2. We compare our approach (Bayesian crowdclustering) to two existing clustering aggregation\nmethods from the literature: consensus clustering by nonnegative matrix factorization (NMF) [21]\nand the cluster ensembles method of Strehl and Ghosh (S&G) [5]. NMF and S&G require the num-\nber of inferred clusters to be provided as a parameter, and we set this to the number of ground truth\ncategories. Even without the bene\ufb01t of this additional information, our method (which automati-\ncally infers the number of clusters) outperforms the alternatives. To judge the bene\ufb01t of modeling\nthe characteristics of individual workers, we also compare against a variant of our model in which\nall HITs are treated as if they are performed by a single worker (Bayesian consensus.) We \ufb01nd a\nsigni\ufb01cant improvement. We \ufb01x R = 5 in this experiment, but we \ufb01nd a similar ranking of methods\nat other values of R. However, the performance bene\ufb01t of the Bayesian methods over the existing\nmethods increases with R.\nWe compare the four methods quantitatively on two additional data sets, with the results summarized\nin Table 1. In both cases, we instruct workers to categorize based on species. This is known to be\na dif\ufb01cult task for non-experts. We set V = 6 and R = 5 for these experiments. Again, we \ufb01nd\nthat Bayesian Crowdclustering outperforms the alternatives. A run time comparison is also given\nin Table 1. Bayesian Crowdclustering results on the Bird and Stone\ufb02y data sets are summarized\nin [14].\nFinally, we demonstrate Bayesian crowdclustering on the large scale Attribute Discovery data set.\nThis data set has four image categories: bags, earrings, ties, and women\u2019s shoes. In addition, each\nimage is a member of one of 27 sub-categories (e.g., the bags category includes backpacks and totes\nas sub-categories.) See [14] for summary \ufb01gures. We \ufb01nd that our method easily discovers the four\n\n7\n\n0204060801001201400100200Worker RankCompleted HITs020406080100050100% of total workers% of total HITs20040060080010001200140016001.31.41.51.61.71.81.922.12.22.3Variation of InformationNumber of HITs Remaining Top workers excludedBottom workers excluded345678910111.522.533.54VVariation of InformationR=5 Bayes CrowdNMF ConsensusS&G Cluster EnsemblesBayes Consensus\fOriginal Cluster 1\n\nOriginal Cluster 4\n\nOriginal Cluster 8\n\nFigure 5: Divisive Clustering on the Scenes data set. Left: Confusion matrix and high con\ufb01dence examples\nwhen running our method on images assigned to cluster one in the original experiment (Figure 2). The three\nindoor scene categories are correctly recovered. Center: Workers are unable to subdivide mountain scenes\nconsistently and our method returns a single cluster. Right: Workers may \ufb01nd perceptually relevant distinctions\nnot present in the ground truth categories. Here, the highway category is subdivided according to the number\nof cars present.\ncategories. The subcategories are not discovered, likely due to limited context associated with HITs\nwith size M = 36 as discussed in the next section. Runtime was approximately 9.5 hours on a six\ncore Intel Xeon machine.\n4.1 Divisive Clustering\nAs indicated by the confusion matrix in Figure 2 (right), our method results in clusters that corre-\nspond to reasonable categories. However, it is clear that the data often has \ufb01ner categorical distinc-\ntions that go undiscovered. We conjecture that this is a result of the limited context presented to the\nworker in each HIT. When shown a set of M = 36 images consisting mostly of different types of\noutdoor scenes and a few indoor scenes, it is reasonable for a worker to consider the indoor scenes\nas a uni\ufb01ed category. However, if a HIT is composed purely of indoor scenes, a worker might draw\n\ufb01ner distinctions between images of of\ufb01ces, kitchens, and living rooms. To test this conjecture,\nwe developed a hierarchical procedure in which we run Bayesian crowdclustering independently on\nimages that are MAP assigned to the same cluster in the original Scenes experiment.\nFigure 5 (left) shows the results on the indoor scenes assigned to original cluster 1. We \ufb01nd that when\nrestricted to indoor scenes, the workers do \ufb01nd the relevant distinctions and our algorithm accurately\nrecovers the kitchen, living room, and of\ufb01ce ground truth categories. In Figure 5 (center) we ran the\nprocedure on images from original cluster 4, which is composed predominantly of mountain scenes.\nThe algorithm discovers one subcluster. In Figure 5 (right) the workers divide a cluster into three\nsubclusters that are perceptually relevant: they have organized them according to the number of cars\npresent.\n5 Conclusions\nWe have proposed a method for clustering a large set of data by distributing small tasks to a large\ngroup of workers. It is based on using a novel model of human clustering, as well as a novel ma-\nchine learning method to aggregate worker annotations. Modeling both data item properties and the\nworkers\u2019 annotation process and parameters appears to produce performance that is superior to ex-\nisting clustering aggregation methods. Our study poses a number of interesting questions for further\nresearch: Can adaptive sampling methods (as opposed to our random sampling) reduce the number\nof HITs that are necessary to achieve high quality clustering? Is it possible to model the workers\u2019\ntendency to learn over time as they perform HITs, rather than treating HITs independently as we do\nhere? Can we model contextual effects, perhaps by modeling the way that humans \u201cregularize\u201d their\ncategorical decisions depending on the number and variety of items present in the task?\nAcknowledgements This work was supported by ONR MURI grant 1015-G-NA-127, ARL grant\nW911NF-10-2-0016, and NSF grants IIS-0953413 and CNS-0932392.\n\n8\n\nGround TruthInferred Cluster 123bedroomsuburbkitchenliving roomcoastforesthighwayinside citymountainopen countrystreettall buildingoffice010203040506070Ground TruthInferred Cluster 1bedroomsuburbkitchenliving roomcoastforesthighwayinside citymountainopen countrystreettall buildingoffice010203040506070Ground TruthInferred Cluster 123bedroomsuburbkitchenliving roomcoastforesthighwayinside citymountainopen countrystreettall buildingoffice010203040501.11.21.34.18.18.28.3\fReferences\n[1] A. Sorokin and D. A. Forsyth. Utility data annotation with Amazon Mechanical Turk.\n\nInternet Vision, pages 1\u20138, 2008.\n\nIn\n\n[2] Sudheendra Vijayanarasimhan and Kristen Grauman. Large-Scale Live Active Learning:\n\nTraining Object Detectors with Crawled Data and Crowds. In CVPR, 2011.\n\n[3] Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The multidimensional\n\nwisdom of crowds. In Neural Information Processing Systems Conference (NIPS), 2010.\n\n[4] J. B. Kruskal. Multidimensional scaling by optimizing goodness-of-\ufb01t to a nonmetric hypoth-\n\nesis. PSym, 29:1\u201329, 1964.\n\n[5] Alexander Strehl and Joydeep Ghosh. Cluster ensembles\u2014A knowledge reuse framework for\n\ncombining multiple partitions. Journal of Machine Learning Research, 3:583\u2013617, 2002.\n\n[6] Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: A\nresampling-based method for class discovery and visualization of gene expression microar-\nray data. Machine Learning, 52(1\u20132):91\u2013118, 2003.\n\n[7] Gionis, Mannila, and Tsaparas. Clustering aggregation. In ACM Transactions on Knowledge\n\nDiscovery from Data, volume 1. 2007.\n\n[8] A.Y. Lo. On a class of bayesian nonparametric estimates: I. density estimates. The Annals of\n\nStatistics, pages 351\u2013357, 1984.\n\n[9] I. Sutskever, R. Salakhutdinov, and J.B. Tenenbaum. Modelling relational data using bayesian\nclustered tensor factorization. Advances in Neural Information Processing Systems (NIPS),\n2009.\n\n[10] Hagai Attias. A variational baysian framework for graphical models. In NIPS, pages 209\u2013215,\n\n1999.\n\n[11] Kenichi Kurihara, Max Welling, and Nikos Vlassis. Accelerated variational dirichlet process\nmixtures. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information\nProcessing Systems 19. MIT Press, Cambridge, MA, 2007.\n\n[12] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. Wiley, 1994.\n[13] Tommi S. Jaakkola and Michael I. Jordan. A variational approach to Bayesian logistic regres-\n\nsion models and their extensions, August 13 1996.\n\n[14] Ryan Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. Crowdclustering. Technical\n\nReport CaltechAUTHORS:20110628-202526159, June 2011.\n\n[15] Li Fei-Fei and Pietro Perona. A Bayesian hierarchical model for learning natural scene cate-\n\ngories. In CVPR, pages 524\u2013531. IEEE Computer Society, 2005.\n\n[16] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-\nUCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology,\n2010.\n\n[17] G. Martinez-Munoz, N. Larios, E. Mortensen, W. Zhang, A. Yamamuro, R. Paasch, N. Payet,\nD. Lytle, L. Shapiro, S. Todorovic, et al. Dictionary-free categorization of very similar objects\nvia stacked evidence trees. 2009.\n\n[18] T. Berg, A. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy\n\nweb data. Computer Vision\u2013ECCV 2010, pages 663\u2013676, 2010.\n\n[19] V. Pareto. Cours d\u2019economie politique. 1896.\n[20] M. Meila. Comparing clusterings by the variation of information.\n\nIn Learning theory and\nKernel machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop,\nCOLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003: proceedings, volume 2777,\npage 173. Springer Verlag, 2003.\n\n[21] Tao Li, Chris H. Q. Ding, and Michael I. Jordan. Solving consensus and semi-supervised\nclustering problems using nonnegative matrix factorization. In ICDM, pages 577\u2013582. IEEE\nComputer Society, 2007.\n\n9\n\n\f", "award": [], "sourceid": 389, "authors": [{"given_name": "Ryan", "family_name": "Gomes", "institution": null}, {"given_name": "Peter", "family_name": "Welinder", "institution": null}, {"given_name": "Andreas", "family_name": "Krause", "institution": null}, {"given_name": "Pietro", "family_name": "Perona", "institution": null}]}