{"title": "Semi-crowdsourced Clustering with Deep Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3212, "page_last": 3222, "abstract": "We consider the semi-supervised clustering problem where crowdsourcing provides noisy information about the pairwise comparisons on a small subset of data, i.e., whether a sample pair is in the same cluster. We propose a new approach that includes a deep generative model (DGM) to characterize low-level features of the data, and a statistical relational model for noisy pairwise annotations on its subset. The two parts share the latent variables. To make the model automatically trade-off between its complexity and fitting data, we also develop its fully Bayesian variant. The challenge of inference is addressed by fast (natural-gradient) stochastic variational inference algorithms, where we effectively combine variational message passing for the relational part and amortized learning of the DGM under a unified framework. Empirical results on synthetic and real-world datasets show that our model outperforms previous crowdsourced clustering methods.", "full_text": "Semi-crowdsourced Clustering with\n\nDeep Generative Models\n\nYucen Luo, Tian Tian, Jiaxin Shi, Jun Zhu\u2217, Bo Zhang\n\nDept. of Comp. Sci. & Tech., Institute for AI, THBI Lab, BNRist Center,\nState Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China\n\n{luoyc15,shijx15}@mails.tsinghua.edu.cn, rossowhite@163.com\n\n{dcszj,dcszb}@mail.tsinghua.edu.cn\n\nAbstract\n\nWe consider the semi-supervised clustering problem where crowdsourcing provides\nnoisy information about the pairwise comparisons on a small subset of data, i.e.,\nwhether a sample pair is in the same cluster. We propose a new approach that\nincludes a deep generative model (DGM) to characterize low-level features of\nthe data, and a statistical relational model for noisy pairwise annotations on its\nsubset. The two parts share the latent variables. To make the model automatically\ntrade-off between its complexity and \ufb01tting data, we also develop its fully Bayesian\nvariant. The challenge of inference is addressed by fast (natural-gradient) stochastic\nvariational inference algorithms, where we effectively combine variational message\npassing for the relational part and amortized learning of the DGM under a uni\ufb01ed\nframework. Empirical results on synthetic and real-world datasets show that our\nmodel outperforms previous crowdsourced clustering methods.\n\n1\n\nIntroduction\n\nClustering is a classic data analysis problem when the taxonomy of data is unknown in advance. Its\nmain goal is to divide samples into disjunct clusters based on the similarity between them. Clustering\nis useful in various application areas including computer vision [21], bioinformatics [28], anomaly\ndetection [2], etc. When the feature vectors of samples are observed, most clustering algorithms\nrequire a similarity or distance metric de\ufb01ned in the feature space, so that the optimization objective\ncan be built. Since different metrics may result in entirely different clustering results, and general\ngeometry metrics may not meet the intention of the tasks\u2019 designer, many clustering approaches\nlearn the metric from the side-information provided by domain experts [30], thus the manual labeling\nprocedure of experts could be a bottleneck for the learning pipeline.\nCrowdsourcing is an ef\ufb01cient way to collect human feedbacks [12]. It distributes micro-tasks to\na group of ordinal web workers in parallel, so the whole task can be done fast with relatively low\ncost. It has been used on annotating large-scale machine learning datasets such as ImageNet [6], and\ncan also be used to collect side-information for clustering. However, directly collecting labels from\ncrowds may lead to low-quality results due to the lack of expertise of workers. Consider an example\nof labeling a set of images of \ufb02owers from different species. One could show images to the web\nworkers and ask them to identify the corresponding species, but such tasks require the workers to be\nexperts in identifying the \ufb02owers and have all the species in their minds, which is not always possible.\nA more reasonable and easier task is to ask the workers to compare pairs of \ufb02ower images and to\nanswer whether they are in the same species or not. Then speci\ufb01c clustering methods are required to\ndiscover the clusters from the noisy feedbacks.\nTo solve above clustering problems with pairwise similarity labels between samples from the crowds,\nCrowdclustering [8] discovers the clusters within the dataset using a Bayesian hierarchical model.\n\n\u2217corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBy explicitly modeling the mistakes and preferences of web workers, the outputs will match the\nhuman consciousness of the clustering tasks. This method reduces the labeling cost to a great\ndegree compared with expert labeling. However, the cost still grows quadratically as the dataset\nsize grows, so it is still only suitable for small datasets. In this work, we move one step further and\nconsider the semi-supervised crowdclustering problem that jointly models the feature vectors and the\ncrowdsourced pairwise labels for only a subset of samples. When we control the size of the subset to\nbe labeled by crowds, the total labeling budget and time can be controlled. A similar problem has\nbeen discussed by [31], while the authors use a linear similarity function de\ufb01ned on the low-level\nobject features, and ignore the noise and inter-worker variations in the manual annotations.\nDifferent from existing approaches, we propose a semi-supervised deep Bayesian model to jointly\nmodel the generation of the labels and the raw features for both crowd-labeled and unlabeled samples.\nInstead of the direct usage of low-level features, we build a \ufb02exible deep generative model (DGM) to\ncapture the latent representation of data, which is more suitable to express the semantic similarity\nthan the low-level features. The crowdsourced pairwise labels are modeled by a statistical relational\nmodel, and the two parts (i.e., DGM and the relational model) share the same latent variables. We also\ninvestigate the fully Bayesian variant of this model so that it can automatically control its complexity.\nDue to the intractability of exact inference, we develop fast (natural-gradient) stochastic variational\ninference algorithms. To address the challenges in fully Bayesian inference over model parameters,\nwe effectively combine variational message passing and natural gradient updates for the conjugate\npart (i.e., the relational model and the mixture model) and amortized learning of the nonconjugate\npart (i.e., DGM) under a uni\ufb01ed framework. Empirical results on synthetic and real-world datasets\nshow that our model outperforms previous crowdsourced clustering methods.\n2 Semi-crowdsourced deep clustering\nIn this section, we propose the semi-crowdsourced clus-\ntering with deep generative models for directly modeling\nthe raw data, which enables end-to-end training. We call\nthe model Semi-crowdsourced Deep Clustering (SCDC),\nwhose graphical model is shown in Figure 1. This model\nis composed of two parts: the raw data model handles\nthe generative process of the observations O; the crowd-\nsourcing behavior model on labels L describes the labeling\nprocedure of the workers. The details for each part will be\nintroduced below.\n\n\u03b2\nM\n\n\u00b5, \u03a3\n\nL\n\n\u03c0\n\nN\n\n\u03b1\n\nx\n\n\u03b3\n\no\n\nz\n\nFigure 1: Semi-crowdsourced Deep\nClustering (SCDC).\n\n2.1 Model the raw data \u2013 deep generative models\nWe denote the raw data observations by O = {o1, ..., oN}.\nFor images, on \u2208 RD denotes the pixel values. For each\ndata point on we have a corresponding latent variable xn \u2208 Rd and p(on|xn, \u03b3) is a \ufb02exible neural\nnetwork density model parametrized by \u03b3. p(xn|zn; \u00b5, \u03a3) is a Gaussian mixture where zn comprises\na 1-of-K binary vector with elements znk for k = 1, ..., K. Here K denotes the number of clusters.\nWe denote the local latent variables by X = {x1, ..., xN}, Z = {z1, ..., zN}. When real-valued\nobservations are given, the generative process is as follows:\n\nN(cid:89)\nN(cid:89)\nwhere \u00b5\u03b3(\u00b7) and \u03c32\n\u03b3(\u00b7) are two neural networks parameterized by \u03b3. For other types of observations\nO, p(o|x; \u03b3) can be other distributions, e.g. Bernoulli distribution for binary observations. In general,\nour model is a deep generative model with structured latent variables.\n\nN (on|\u00b5\u03b3(xn), diag(\u03c32\n\nN (xn; \u00b5k, \u03a3k)znk ,\n\np(X|Z; \u00b5, \u03a3) =\n\np(O|X; \u03b3) =\n\np(zn; \u03c0) =\n\nN(cid:89)\n\nK(cid:89)\n\nN(cid:89)\n\nK(cid:89)\n\n\u03b3(xn))),\n\np(Z; \u03c0) =\n\n\u03c0znk\nk\n\n,\n\nn=1\n\nn=1\n\nn=1\n\nk=1\n\nn=1\n\nk=1\n\n2.2 Model the behavior of each worker \u2013 two-coin Dawid-Skene model\nWe collect pairwise annotations provided by M workers. A partially observed L(m) \u2208\n{0, 1, NULL}Nl\u00d7Nl is the annotation matrix of the m-th worker, where Nl is the number of an-\n\n2\n\n\fij\n\nji\n\nij = L(m)\n\n, \u2200i, j, m. Self-edges are not allowed, i.e., L(m)\n\nnotated data points. For observation pairs (oi, oj), i (cid:54)= j, L(m)\nij = 1 represents that the m-th worker\nprovides a must-link (ML) constraint, which means observations i and j belong to a same cluster,\nL(m)\nij = 0 represents cannot-link (CL) constraint, which means observations i and j belong to differ-\nent clusters, and NULL represents that L(m)\nis not observed. It is obvious that L(m) is symmetric,\ni.e., L(m)\nAmong all the N data observations O, we only crowdsource pairwise annotations for a small portion\nof O, denoted by OL. Each worker only provides annotations to a small amount of items in OL and\nthe annotation accuracies of non-expert workers may vary with observations and levels of expertise.\nWe adopt the two-coin Dawid-Skene model for annotators from [18] and develop a probabilistic\nmodel by explicitly modeling the uncertainty of each worker. Speci\ufb01cally, the uncertainty of the\nm-th worker can be characterized by accuracy parameters (\u03b1m, \u03b2m), where \u03b1m represents sensitivity,\nwhich means the probability of providing ML constraints for sample pairs belonging to the same\ncluster. And \u03b2m is the m-th worker\u2019s speci\ufb01city, which means the probability of providing CL\nconstraints for sample pairs from different clusters. Let \u03b1 = {\u03b11, ..., \u03b1M} and \u03b2 = {\u03b21, ..., \u03b2M}.\nThe likelihood is de\ufb01ned as\n\nii = NULL,\u2200i.\n\np(L(m)\n\nij\n\n|zi, zj; \u03b1m, \u03b2m) = Bern(L(m)\n|\u03b1m)z(cid:62)\nij = 1|zi = zj, \u03b1m) = \u03b1m, p(L(m)\nij = I[L(m)\n\ni zj Bern(L(m)\n(1)\nij = 0|zi (cid:54)= zj, \u03b2m) = \u03b2m. To simplify the\n(cid:54)= NULL]. Using the symmetry of L(m), the total likelihood of\n\n|1\u2212\u03b2m)1\u2212z(cid:62)\n\ni zj ,\n\nij\n\nij\n\nij\n\np(L|Z; \u03b1, \u03b2) =\n\np(L(m)\n\nij\n\n|zi, zj; \u03b1m, \u03b2m)I (m)\n\nij\n\n.\n\n(2)\n\nor equivalently, p(L(m)\nnotation, we de\ufb01ne I (m)\nannotations can be written\n\nM(cid:89)\n\n(cid:89)\n\nm=1\n\n1\u2264i 0 is the concentration, S \u2208 Rd\u00d7d is the scale matrix\n(positive de\ufb01nite), and \u03bd > d \u2212 1 is the degrees of freedom. The densities of \u03c0, z, (\u00b5, \u03a3), x can be\nwritten in the standard form of exponential families as:\n\np(\u03c0) = exp(cid:8)(cid:104)\u03b70\np(z|\u03c0) = exp(cid:8)(cid:104)\u03b70\np(x|z, \u00b5, \u03a3) = exp(cid:8)(cid:104)t(z), t(\u00b5, \u03a3)(cid:62)(t(x), 1)(cid:105)(cid:9) ,\n\n\u03c0, t(\u03c0)(cid:105) \u2212 log Z(\u03b70\nz(\u03c0), t(z)(cid:105) \u2212 log Z(\u03b70\n\n\u03c0)(cid:9) ,\np(\u00b5, \u03a3) = exp(cid:8)(cid:104)\u03b70\nz(\u03c0))(cid:9) = exp{(cid:104)t(\u03c0), (t(z), 1)(cid:105)} ,\n\n\u00b5,\u03a3, t(\u00b5, \u03a3)(cid:105)\u2212log Z(\u03b70\n\n\u00b5,\u03a3)(cid:9) ,\n\nwhere \u03b7 denotes the natural parameters, t(\u00b7) denotes the suf\ufb01cient statistics2, and log Z(\u00b7) denotes\nthe log partition function.\nFor the relational model, we assume the accuracy parameters of all workers (\u03b1, \u03b2) are drawn\nindependently from common priors. We choose conjugate Beta priors for them as\n\nK(cid:89)\n\nK(cid:89)\n\nM(cid:89)\n\nM(cid:89)\n\nM(cid:89)\n\nM(cid:89)\n\nm=1\n\np(\u03b1) =\n\np(\u03b1m) =\n\nWe write the exponential family form of p(\u03b1m) as: p(\u03b1m) = exp(cid:8)(cid:104)\u03b70\n\nBeta(\u03c4\u03b11\n\np(\u03b2m) =\n\np(\u03b2) =\n\nm=1\n\nm=1\n\n, \u03c4\u03b12\n\n0\n\n),\n\nm=1\n\n0\n\n(p(\u03b2m) is similar), where \u03b70\n\n\u03b1m\n\n= [\u03c4\u03b11\n\n0\n\n\u2212 1, \u03c4\u03b12\n\n0\n\n\u2212 1](cid:62) and t(\u03b1m) = [log \u03b1m, log(1 \u2212 \u03b1m)]\n\n\u03b1m\n\n, t(\u03b1m)(cid:105)\u2212log Z(\u03b70\n(cid:62).\n\n\u03b1m\n\nBeta(\u03c4\u03b21\n\n0\n\n, \u03c4\u03b22\n\n0\n\n).\n\n(6)\n\n)(cid:9)\n\n3.2 Natural-gradient stochastic variational inference\nThe overall joint distribution of all of the hidden and observed variables takes the form:\np(L(1:M ), O, X, Z, \u0398; \u03b3) = p(\u03c0)p(Z|\u03c0)p(\u00b5, \u03a3)p(X|Z, \u00b5, \u03a3)p(O|X; \u03b3)\n\n\u00b7 p(\u03b1)p(\u03b2)p(L(1:M )|Z, \u03b1, \u03b2).\n\n(7)\n\nOur learning objective is to maximize the marginal likelihood of observed data and pairwise\nannotations log p(O, L(1:M )). Exact posterior inference for this model is intractable. Thus\nwe consider a mean-\ufb01eld variational family q(\u0398, Z, X) = q(\u03b1)q(\u03b2)q(Z)q(X)q(\u03c0)q(\u00b5, \u03a3). To\nsimplify the notations, we write each variational distribution in its exponential family form:\nq(\u03b8) = exp{(cid:104)\u03b7\u03b8, t(\u03b8)(cid:105) \u2212 log Z(\u03b7\u03b8)} , \u03b8 \u2208 \u0398 \u222a Z \u222a X. The evidence lower bound (ELBO)\nL(\u03b7\u0398, \u03b7Z, \u03b7X; \u03b3) of log p(O, L(1:M )) is\n\n(cid:20) p(L(1:M ), O, X, Z, \u0398; \u03b3)\n\n(cid:21)\n\nq(\u0398)q(Z)q(X)\n\n.\n\n(8)\n\nlog p(O, L(1:M )) \u2265 L(\u03b7\u0398, \u03b7Z, \u03b7X; \u03b3) (cid:44) Eq(\u0398,Z,X) log\n\n2Detailed expressions of each distribution can be found in Appendix A\n\n4\n\n\fIn traditional mean-\ufb01eld variational inference for conjugate models, the optimal solution of maxi-\nmizing eq. (8) over each variational parameter can be derived analytically given other parameters\n\ufb01xed, thus a coordinate ascent can be applied as an ef\ufb01cient message passing algorithm [27, 11].\nHowever, it is not directly applicable to our model due to the non-conjugate observation likelihood\np(O|X; \u03b3). Inspired by [13], we handle the non-conjugate likelihood by introducing recognition\nnetworks r(oi; \u03c6). Different from SCDC in Section 2.3, the recognition networks here are used to\nform conjugate graphical model potentials:\n\n(9)\nBy replacing the non-conjugate likelihood p(O|X; \u03b3) in the original ELBO with a conjugate term\nde\ufb01ned by \u03c8(xi; oi, \u03c6), we have the following surrogate objective \u02c6L:\n\n\u03c8(xi; oi, \u03c6) (cid:44) (cid:104)r(oi; \u03c6), t(xi)(cid:105).\n\n(cid:20) p(L(1:M ), X, Z, \u0398) exp{\u03c8(X; O, \u03c6)}\n(cid:21)\n\n(cid:98)L(\u03b7\u0398, \u03b7Z, \u03b7X; \u03c6) (cid:44) Eq(\u0398,Z,X)log\n\nq(\u0398)q(Z)q(X)\n\nAs we shall see, the surrogate objective (cid:98)L helps us exploit the conjugate structure in the model, thus\nThe optimal solution for q\u2217(X) factorizes over n , i.e., q\u2217(X) =(cid:81)N\n\nenables a fast message-passing algorithm for these parts. Speci\ufb01cally, we can view eq. (10) as the\nELBO of a conjugate graphical model with the same structure as in Fig. 1 (up to a constant). Similar\nto coordinate-ascent mean-\ufb01eld variational inference [11], we can derive the local partial optimizers\nof individual variational parameters as below.\n\ni=1 q\u2217(xi), and q\u2217(xi) depends on\n\n.\n\n(10)\n\nthe expected suf\ufb01cient statistics of (\u00b5, \u03a3) and zn:\n\nlog q\u2217(xi) = Eq(\u00b5,\u03a3)q(zi) log p(xi|zi, \u00b5, \u03a3) + (cid:104)r(oi; \u03c6), t(xi)(cid:105) + const,\n\n(11)\n(12)\ni=1 q\u2217(zi), we have the local partial\n\nw(m)\n\nij\n\nEq(zj )[t(zj)],\n\n(13)\n\n(14)\n\nBy further assuming a mean-\ufb01eld structure over Z: q\u2217(Z) =(cid:81)N\n\n(\u00b5, \u03a3)](cid:62)Eq(zi)[t(zi)]+r(oi; \u03c6).\n\n= Eq(\u00b5,\u03a3)[\u03b70\n\n\u03b7\u2217\n\nxi\n\nxi\n\noptimizer for each single q(zi) as\n\n(cid:104)\n\n+ Eq(\u03b1)q(\u03b2)q(Z\u2212i)\n\nlog q\u2217(zi) = Eq(\u03c0) log p(zi|\u03c0) + Eq(\u00b5,\u03a3)q(xi) log p(xi|zi, \u00b5, \u03a3)\nM(cid:88)\nN(cid:88)\n+ const,\n(cid:17)(cid:105)\n\n(cid:105)\n(cid:62) Eq(xi) [(t(xi), 1)] +\n\n= Eq(\u03c0)t(\u03c0) + Eq(\u00b5,\u03a3) [t(\u00b5, \u03a3)]\n\nlog p(L(1:M )|Z, \u03b1, \u03b2)\n\n(cid:16)\n\n(cid:104)\n\nm=1\n\nj=1\n\nzi\n\n\u03b7\u2217\n\nEq(\u03b1,\u03b2)\n\nln 1\u2212\u03b1m\n\nij\n\nij\n\n\u03b2m\n\n+ L(m)\n\nln \u03b1m\n1\u2212\u03b1m\n\n+ ln \u03b2m\n1\u2212\u03b2m\n\nij = I (m)\n\nZ(\u03b7\u0398, \u03c6), \u03b7\u2217\n\n\u2207\u03b7Z(cid:98)L(\u03b7\u0398, \u03b7\u2217\n\nwhere w(m)\nis the weight of the message\nfrom zj to zi. Using a block coordinate ascent algorithm that applies eqs. (12) and (14) alternatively,\nwe can \ufb01nd the joint local partial optimizers (\u03b7\u2217\nother parameters \ufb01xed, i.e.,\nZ(\u03b7\u0398, \u03c6), \u03b7\u2217\nZ(\u03b7\u0398, \u03c6), \u03b7\u2217\n\nX(\u03b7\u0398, \u03c6), \u03c6) = 0, \u2207\u03b7X(cid:98)L(\u03b7\u0398, \u03b7\u2217\nZ(\u03b7\u0398, \u03c6), \u03b7\u2217\nX(\u03b7\u0398, \u03c6)) back into L, we de\ufb01ne the \ufb01nal objective\nJ (\u03b7\u0398; \u03c6, \u03b3) (cid:44) L(\u03b7\u0398, \u03b7\u2217\nX(\u03b7\u0398, \u03c6), \u03b3).\n\nX(\u03b7\u0398, \u03c6)) of (cid:98)L w.r.t. (\u03b7X, \u03b7Z) given\n\n(16)\nAs shown in [13], J (\u03b7\u0398; \u03c6, \u03b3) lower-bounds the partially-optimized mean \ufb01eld objective, i.e.,\nmax\u03b7X,\u03b7Z L(\u03b7\u0398, \u03b7Z, \u03b7X, \u03b3) \u2265 J (\u03b7\u0398, \u03b3, \u03c6), thus can serve as a variational objective itself. We\ncompute the natural gradients of J w.r.t. the global variational parameters \u03b7\u0398:\n\nX(\u03b7\u0398, \u03c6), \u03c6) = 0. (15)\n\nPlugging (\u03b7\u2217\n\nZ(\u03b7\u0398, \u03c6), \u03b7\u2217\n(cid:16)\n(cid:104)\n\u0398 + Eq\u2217(Z)q\u2217(X)\nt(Z, X, L(1:M )), 1\n\u03b70\n+ (\u2207\u03b7Z,\u03b7XL(\u03b7\u0398, \u03b7\u2217\nZ(\u03b7\u0398, \u03c6), \u03b7\u2217\n\n(cid:17) \u2212 \u03b7\u0398\n\n(cid:105)\n\n(cid:101)\u2207\u03b7\u0398J =\n\nX(\u03b7\u0398, \u03c6); \u03b3), 0) .\n\n(17)\n\nNote that the \ufb01rst term in eq. (17) is the same as the formula of natural gradient in SVI [11], which is\neasy to compute, and the second term originates from the dependence of \u03b7\u2217\nX on \u03b7\u0398 and can be\ncomputed using the reparameterization trick. For other parameters \u03c6, \u03b3, we can also get the gradients\n\u2207\u03c6J (\u03b7\u0398; \u03c6, \u03b3) and \u2207\u03b3J (\u03b7\u0398; \u03c6, \u03b3) using the reparameterization trick.\n\nZ, \u03b7\u2217\n\n5\n\n\fAlgorithm 1 Semi-crowdsoursed clustering with DGMs (BayesSCDC)\n\nInput: observations O = {o1, ..., oN}, annotations L(1:M ), variational parameters (\u03b7\u0398, \u03b3, \u03c6)\nrepeat\n\n\u03c8i \u2190 (cid:104)r(oi; \u03c6), t(xi)(cid:105), i = 1, ..., N\nfor each local variational parameter \u03b7\u2217\n\nzi do\nUpdate alternatively using eq. (12) and eq. (14)\n\nxi and \u03b7\u2217\n\nend for\nSample \u02c6xi \u223c q\u2217(xi), i = 1, ..., N\nUse \u02c6xi to approximate Eq\u2217(x) log p(o|x; \u03b3) in the lower bound J eq. (16)\nUpdate the global variational parameters \u03b7\u0398 using the natural gradient in eq. (17)\nUpdate \u03c6, \u03b3 using \u2207\u03c6,\u03b3J (\u03b7\u0398; \u03c6, \u03b3)\n\nuntil Convergence\n\nStochastic approximation: Computing the full natural gradient in eq. (17) requires to scan over\nall data and annotations, which is time-consuming. Similar to Section 2.3, we can approximate the\nvariational lower bound with unbiased estimates using mini-batches of data and annotations, thus\ngetting a stochastic natural gradient. Several sampling strategies have been developed for relational\nmodel [9] to keep the stochastic gradient unbiased. Here we choose the simplest way: we sample\nannotated data pairs uniformly from the annotations and form a subsample of the relational model,\nand do local message passing (eqs. (12) and (14)), then perform the global update using stochastic\nnatural gradient calculated in the subsample. Besides, for all the unannotated data, we also subsample\nmini-batches from them and perform local and global steps without relational terms. The algorithm\nof BayesSCDC is shown in Algorithm 1.\n\nComparison with SCDC BayesSCDC is different in two aspects: (a) fully Bayesian treatment of\nglobal parameters; (b) variational algorithms. As we shall see in experiments, the result of (a) is that\nBayesSCDC can automatically determine the number of mixture components during training. As for\n(b), note that the variational family used in SCDC is not more \ufb02exible, but more restricted compared\nto BayesSCDC. In BayesSCDC, the mean-\ufb01eld q(z)q(x) doesn\u2019t imply that q\u2217(z) and q\u2217(x) are\nindependent, instead they implicitly in\ufb02uence each other through message passing in Eqs. (12) and\n(14). More importantly, in BayesSCDC the variational posterior gathers information from L through\nmessage passing in the relational model. In contrast, the amortized form q(z|o)q(x, z|o) used in\nSCDC ignores the effect of observed annotations L. Another advantage of the inference algorithm in\nBayesSCDC is in the computational cost. As we have seen in Algorithm 1, the number of passes\nthrough the x to o network is no longer linear with K because we get rid of summing over z in the\nobservation term as in Section 2.3.\n\n4 Related work\nMost previous works on learning-from-crowds are about aggregating noisy crowdsourced labels\nfrom several prede\ufb01ned classes [5, 18, 26, 32, 24]. A common way they use is to simultaneously\nestimate the workers\u2019 behavior models and the ground truths. Different from this line of work,\ncrowdclustering [8] collects pairwise labels, including the must-links and the cannot-links, from the\ncrowds, then discovers the items\u2019 af\ufb01liations as well as the category structure from these noisy labels,\nso it can be used on a border range of applications compared with the classi\ufb01cation methods. Recent\nwork [25] also developed crowdclustering algorithm on triplet annotations.\nOne shortcoming of crowdclustering is that it can only cluster objects with available manual annota-\ntions. For large-scale problems, it is not feasible to have each object manually annotated by multiple\nworkers. Similar problems were extensively discussed in the semi-supervised clustering area, where\nwe are given the features for all the items and constraints on only a same portion of the items. Metric\nlearning methods, including Information-Theoretic Metric Learning (ITML) [4] and Metric Pairwise\nConstrained KMeans (MPCKMeans) [1], are used on this problem, they \ufb01rst learn the similarity\nmetric between items mainly based on the supervised portion of data, then cluster the rest items using\nthis metric. Semi-crowdsourced clustering (SemiCrowd) [31] combines the idea of crowdclustering\nand semi-supervised clustering, it aims to learn a pairwise similarity measure from the crowdsourced\nlabels of n objects (n (cid:28) N) and the features of N objects. Unlike crowdclustering, the number\nof clusters in SemiCrowd is assumed to be given a priori. And it doesn\u2019t estimate the behavior\n\n6\n\n\f(a) The Pinwheel dataset.\n\n(b) Without annotations,\ngood initialization.\n\n(c) Without annotations,\nbad initialization.\n\n(d) With noisy annota-\ntions on a subset of data.\n\nFigure 2: Clustering results on the Pinwheel dataset, with each color representing one cluster.\n\nof different workers. Multiple Clustering Views from the Crowd (MCVC) [3] extends the idea to\ndiscover several different clustering results from the noisy labels provided by uncertain experts. A\ncommon shortcoming of these semi-crowdsourced clustering methods is they cannot make good use\nof unlabeled items when measuring the similarities, while our model is a step towards this direction.\nAs shown in Section 2.1, our model is a deep generative model (DGM) with relational latent structures.\nDGMs are a kind of probabilistic graphical models that use neural networks to parameterize the\nconditional distribution between random variables. Unlike traditional probabilistic models, DGMs\ncan directly model high-dimensional outputs with complex structures, which enables end-to-end\ntraining on real data. They have shown success in (conditional) image generation [15, 19], semi-\nsupervised learning [14], and one-shot classi\ufb01cation [20]. Typical inference algorithms for DGMs are\nin the amortized form like that in Section 2.3. However, this approach cannot leverage the conjugate\nstructures in latent variables. Therefore few works have been done on fully Bayesian treatment\nof global parameters in DGMs. [13, 16] are two exceptions. In [13] the authors propose using\nrecognition networks to produce conjugate graphical model potentials, so that traditional variational\nmessage passing algorithms and natural gradient updates can be easily combined with amortized\nlearning of network parameters. Our work extends their algorithm to relational observations, which\nhas not been investigated before.\n\n5 Experiments\n\nIn this section, we demonstrate the effectiveness of the proposed methods on synthetic and real-world\ndatasets with simulated or crowdsourced noisy annotations. Code is available at https://github.\ncom/xinmei9322/semicrowd. Part of the implementation is based on ZhuSuan [22].\n\nij = 1. If not, the worker has probability \u03b2m to provide CL constraint L(m)\n\n5.1 Toy Pinwheel dataset\nSimulating noisy annotations from workers. Suppose we have M workers with accuracy parame-\nters (\u03b1m, \u03b2m). We random sample pairs of items oi and oj and generate the annotations provided by\nworker m based on the true clustering labels of oi and oj as well as the worker\u2019s accuracy parameters\n(\u03b1m, \u03b2m). If oi and oj belong to the same cluster, the worker has probability \u03b1m to provide ML\nconstraint L(m)\nEvaluation metrics. The clustering performance is evaluated by the commonly used normalized\nmutual information (NMI) score [23], measuring the similarity between two partitions. Following\nrecent work [29], we also report the unsupervised clustering accuracy, which requires to compute the\nbest mapping using the Hungarian algorithm ef\ufb01ciently.\nFirst we apply our method to a toy example\u2013the pinwheel dataset in Fig. 2a following [13, 16].\nIt has 5 clusters and each cluster has 100 data points, thus there are 500 data points in total. We\ncompare with unsupervised clustering to understand the help of noisy annotations. The clustering\nresults are shown in Fig. 2. We random sampled 100 data points for annotations and simulate 20\nworkers, each worker gives 49 pairs of annotations, 980 in total. We set equal accuracy to each\nworker \u03b1m = \u03b2m = 0.9.\nWe use the fully Bayesian model (BayesSCDC) described in Section 3. The initial number of clusters\nis set to a larger number K = 15 since the hyper priors have sparsity property intrinsically and can\nlearn the number of clusters automatically. Unsupervised clustering is sensitive to the initializations,\nwhich achieves 95.6% accuracy and NMI score 0.91 with good initializations as shown in Fig. 2b.\n\nij = 0.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Comparison to baselines: (a) Face: All the data points are annotated; (b) Face: Only 100\ndata points are annotated; (c) True accuracies are set to \u03b1 = \u03b2 = [0.95, 0.9, 0.85, 0.8, 0.75]. The\ngreen line is the true weights of each worker and the red line is the estimated weights by our model.\n\nAfter training, it learns K = 8 clusters. However, with bad initializations, the accuracy and NMI\nscore of unsupervised clustering are 75.6% and 0.806, respectively, as shown in Fig. 2c. With noisy\nannotations on random sampled 100 data points, our model improves accuracy to 96.6% and NMI\nscore to 0.94. And it converges to K = 6 clusters. Our model prevents the bad results in Fig. 2c by\nmaking use of annotations.\n\n5.2 UCI benchmark experiments\n\nIn this subsection, we compare the proposed SCDC with the competing methods on the UCI bench-\nmarks. The baselines include MCVC [3], SemiCrowd [31], semi-supervised clustering methods such\nas ITML [4], MPCKMeans [1] and Cluster-based Similarity Partitioning Algorithm (CSPA) [23].\nCrowdsourced annotations are not available for UCI datasets. Following the experimental protocol\nin MCVC [3], we generate noisy annotations given by M = 5 simulated workers with different\nsensitivity and speci\ufb01city, i.e., \u03b1 = \u03b2 = [0.95, 0.9, 0.85, 0.8, 0.75], which is more challenging than\nequal accuracy parameters. The annotations provided by each worker varies from 200 to 2000 and\nthe number of ML constraints equals to the number of CL constraints.\nWe test on Face dataset [7], containing 640 face images from 20 people with different poses (straight,\nleft, right, up). The ground-truth clustering is based on the poses. The original image has 960 pixels.\nTo speed up training, baseline methods apply Principle Component Analysis (PCA) and keep 20\ncomponents. For fair comparison, we test the proposed SCDC on the features after PCA. Fig. 3\nplots the mean and standard deviation of NMI scores in 10 different runs for each \ufb01xed number of\nconstraints. In Fig. 3a, the annotations are randomly generated on the whole dataset. We observe\nthat our method consistently outperforms all competing methods, demonstrating that the clustering\nbene\ufb01ts from the joint generative modeling of inputs and annotations.\nAnnotations on a subset. To illustrate the bene\ufb01ts of our method in the situation where only a small\npart of data points are annotated, we simulate noisy annotations on only 100 images. Fig. 3b shows\nthe results of 100 annotated images. Our method exploits more structure information in the unlabeled\ndata and shows notable improvements over all competing methods.\nRecover worker behaviors. For each worker m, our model estimates the different accuracies\n\u03b1m and \u03b2m. We can derive from eq. (2) that the annotations of each worker m are weighted by\n, which means workers with higher accuracies are more reliable and will be\nlog \u03b1m\n1\u2212\u03b1m\nweighted higher. We plot the weights of 5 workers in the Face experiments in Fig. 3c.\n\n+ log \u03b2m\n1\u2212\u03b2m\n\n5.3 End-to-end training with raw images\n\nMNIST As mentioned earlier, an important feature of DGMs is that they can directly model raw\ndata, such as images. To verify this, we experiment with the MNIST dataset of digit images, which\nincludes 60k training images from handwritten digits 0-9. We collect crowdsourced annotations\nfrom M = 48 workers and get 3276 annotations in total. The two variants of our model (SCDC,\nBayesSCDC) are tested with or without annotations. For BayesSCDC, a non-informative prior\nBeta(1, 1) is placed over \u03b1, \u03b2. For fair comparison, we also randomly sample the initial accuracy\nparameters \u03b1, \u03b2 from Beta(1, 1) for SCDC. We average the results of 5 runs. In each run we\nrandomly initialize the model for 10 times and pick the best result. All models are trained for\n\n8\n\n200400600800100012001400160018002000NumberofConstraints0.00.20.40.60.81.0NormalizedMutualInformationMCVCSemiCrowdITMLMPCKMeansCSPAProposed200400600800100012001400160018002000NumberofConstraints0.00.20.40.6NormalizedMutualInformationMCVCSemiCrowdITMLMPCKMeansCSPAProposed12345WorkerID2.02.53.03.54.04.55.05.56.0TrueWeightsRecoveredWeights\fTable 1: Clustering performance on MNIST. The average time per epoch is reported.\n\nMethod\n\nAccuracy\n\nwithout annotations\n\nNMI\n\nSCDC\n\n65.92 \u00b1 3.47 % 0.6953 \u00b1 0.0167\nBayesSCDC 77.64 \u00b1 3.97 % 0.7944 \u00b1 0.0178\n\nTime\n177.3s\n11.2s\n\nAccuracy\n\nwith annotations\nNMI\n\n81.87 \u00b1 3.86% 0.7657 \u00b1 0.0233\n84.24 \u00b1 5.52% 0.8120 \u00b1 0.0210\n\nTime\n201.7s\n16.4s\n\nEpoch 1\n\nEpoch 7\n\nEpoch 25\n\nEpoch 200\n\n(a)\n\n(b)\n\nFigure 4: (a) MNIST: visualization of generated random samples of 50 clusters during training\nBayesSCDC. Each column represents a cluster, whose inferred proportion (\u03c0k) is re\ufb02ected by\nbrightness; (b) Clustering results on CIFAR-10: (top) unsupervised; (bottom) with noisy annotations.\n\n200 epochs with minibatch size of 128 for each random initialization. The results are shown in\nTable 1. We can see that both models can effectively combine the information from the raw data and\nannotations, i.e., they worked reasonably well with only unlabeled data, and better when given noisy\nannotations on a subset of data. In terms of clustering accuracy and NMI, BayesSCDC outperforms\nSCDC. We believe that this is because the variational message passing algorithm used in BayesSCDC\ncan effectively gather information from the crowdsourced annotations to form better variational\napproximations, as explained in Section 3.2. Besides being more accurate, BayesSCDC is much\nfaster because the computation cost caused by neural networks does not scales linearly with the\nnumber of clusters K (50 in this case). In Fig. 4a we show that BayesSCDC is more \ufb02exible and\nautomatically determines the number of mixture components during training.\nCIFAR-10 We also conduct experiments with real crowdsourced labels on more complex natural\nimages, i.e., CIFAR-10. Using the same crowdsourcing scheme, we collect 8640 noisy annotations\nfrom 32 web workers on a subset of randomly sampled 4000 images. We apply SCDC with/without\nannotations for 5 runs of random initializations. SCDC without annotations failed with NMI score\n0.0424 \u00b1 0.0119 and accuracy 14.23 \u00b1 0.69% among 5 runs. But the NMI score achieved by SCDC\nwith noisy annotations is 0.5549 \u00b1 0.0028 and the accuracy is 50.09 \u00b1 0.08%. The clustering results\non test dataset are shown in Fig. 4b. We plot 10 test samples with the largest probability for each\ncluster. More experiment details and discussions could be found in the supplementary material.\n6 Conclusion\nIn this paper, we proposed a semi-crowdsourced clustering model based on deep generative models\nand its fully Bayesian version. We developed fast (natural-gradient) stochastic variational inference\nalgorithms for them. The resulting method can jointly model the crowdsourced labels, worker\nbehaviors, and the (un)annotated items. Experiments have demonstrated that the proposed method\noutperforms previous competing methods on standard benchmark datasets. Our work also provides\ngeneral guidelines on how to incorporate DGMs to statistical relational models, where the proposed\ninference algorithm can be applied under a broader context.\n\n9\n\n\fAcknowledgement\n\nYucen Luo would like to thank Matthew Johnson for helpful discussions on the SVAE algorithm [13],\nand Yale Chang for sharing the code of the UCI benchmark experiments. We thank the anonymous\nreviewers for feedbacks that greatly improved the paper. This work was supported by the National\nKey Research and Development Program of China (No. 2017YFA0700904), NSFC Projects (Nos.\n61620106010, 61621136008, 61332007), Beijing NSF Project (No. L172037), Tiangong Institute for\nIntelligent Computing, NVIDIA NVAIL Program, and the projects from Siemens, NEC and Intel.\n\nReferences\n[1] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrating constraints and metric learning\nin semi-supervised clustering. In Proceedings of the twenty-\ufb01rst international conference on Machine\nlearning, page 11. ACM, 2004.\n\n[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing\n\nsurveys (CSUR), 41(3):15, 2009.\n\n[3] Yale Chang, Junxiang Chen, Michael H Cho, Peter J Castaldi, Edwin K Silverman, and Jennifer G Dy.\nIn International Conference on Machine\n\nMultiple clustering views from multiple uncertain experts.\nLearning, pages 674\u2013683, 2017.\n\n[4] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric\nlearning. In Proceedings of the 24th international conference on Machine learning, pages 209\u2013216. ACM,\n2007.\n\n[5] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using\n\nthe em algorithm. Applied Statistics, pages 20\u201328, 1979.\n\n[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on,\npages 248\u2013255. IEEE, 2009.\n\n[7] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\n[8] Ryan G Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. Crowdclustering. In Advances in\n\nneural information processing systems, pages 558\u2013566, 2011.\n\n[9] Prem K Gopalan, Sean Gerrish, Michael Freedman, David M Blei, and David M Mimno. Scalable inference\nof overlapping communities. In Advances in Neural Information Processing Systems, pages 2249\u20132257,\n2012.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[11] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[12] Jeff Howe. The rise of crowdsourcing. Wired magazine, 14(6):1\u20134, 2006.\n\n[13] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing\ngraphical models with neural networks for structured representations and fast inference. In Advances in\nneural information processing systems, pages 2946\u20132954, 2016.\n\n[14] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[15] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[16] Wu Lin, Mohammad Emtiyaz Khan, and Nicolas Hubacher. Variational message passing with structured\n\ninference networks. In International Conference on Learning Representations, 2018.\n\n[17] Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. Smooth neighbors on teacher graphs for\nsemi-supervised learning. In The IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n10\n\n\f[18] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds.\n\nJMLR, 11:1297\u20131322, 2010.\n\n[19] Yong Ren, Jun Zhu, Jialian Li, and Yucen Luo. Conditional generative moment-matching networks. In\n\nAdvances in Neural Information Processing Systems, pages 2928\u20132936, 2016.\n\n[20] Danilo J Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. One-shot general-\nization in deep generative models. In Proceedings of the 33rd International Conference on International\nConference on Machine Learning-Volume 48, pages 1521\u20131529. JMLR. org, 2016.\n\n[21] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern\n\nanalysis and machine intelligence, 22(8):888\u2013905, 2000.\n\n[22] Jiaxin Shi, Jianfei. Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu, and Yuhao Zhou. ZhuSuan: A\n\nlibrary for Bayesian deep learning. arXiv preprint arXiv:1709.05870, 2017.\n\n[23] Alexander Strehl and Joydeep Ghosh. Cluster ensembles\u2014a knowledge reuse framework for combining\n\nmultiple partitions. Journal of machine learning research, 3(Dec):583\u2013617, 2002.\n\n[24] Tian Tian and Jun Zhu. Max-margin majority voting for learning from crowds. In Advances in Neural\n\nInformation Processing Systems, pages 1621\u20131629, 2015.\n\n[25] Ramya Korlakai Vinayak and Babak Hassibi. Crowdsourced clustering: Querying edges vs triangles. In\n\nNeural Information Processing System, 2016.\n\n[26] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of\n\ncrowds. In Advances in neural information processing systems, pages 2424\u20132432, 2010.\n\n[27] John Winn and Christopher M Bishop. Variational message passing. Journal of Machine Learning\n\nResearch, 6(Apr):661\u2013694, 2005.\n\n[28] Christian Wiwie, Jan Baumbach, and Richard R\u00f6ttger. Comparing the performance of biomedical clustering\n\nmethods. Nature methods, 12(11):1033\u20131038, 2015.\n\n[29] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In\n\nInternational conference on machine learning, pages 478\u2013487, 2016.\n\n[30] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with\napplication to clustering with side-information. In Advances in neural information processing systems,\npages 521\u2013528, 2003.\n\n[31] Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, and Anil K Jain. Semi-crowdsourced clustering: General-\nizing crowd labeling by robust distance metric learning. In Advances in neural information processing\nsystems, pages 1772\u20131780, 2012.\n\n[32] Dengyong Zhou, Qiang Liu, John Platt, and Christopher Meek. Aggregating ordinal labels from crowds by\nminimax conditional entropy. In Proceedings of the 31th International Conference on Machine Learning,\nICML 2014, Beijing, China, 21-26 June 2014, pages 262\u2013270, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1639, "authors": [{"given_name": "Yucen", "family_name": "Luo", "institution": "Tsinghua University"}, {"given_name": "TIAN", "family_name": "TIAN", "institution": "Tsinghua University"}, {"given_name": "Jiaxin", "family_name": "Shi", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}