{"title": "FastEx: Hash Clustering with Exponential Families", "book": "Advances in Neural Information Processing Systems", "page_first": 2798, "page_last": 2806, "abstract": "Clustering is a key component in data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as $k$-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters.", "full_text": "FastEx: Hash Clustering with Exponential Families\n\nAmr Ahmed\u2217\n\nSujith Ravi\n\nResearch at Google, Mountain View, CA\n\nResearch at Google, Mountain View, CA\n\namra@google.com\n\nsravi@google.com\n\nShravan M. Narayanamurthy\n\nMicrosoft Research, Bangalore, India\n\nshravanmn@gmail.com\n\nAlexander J. Smola\n\nResearch at Google, Mountain View, CA\n\nalex@smola.org\n\nAbstract\n\nClustering is a key component in any data analysis toolbox. Despite its impor-\ntance, scalable algorithms often eschew rich statistical models in favor of simpler\ndescriptions such as k-means clustering. In this paper we present a sampler, ca-\npable of estimating mixtures of exponential families. At its heart lies a novel\nproposal distribution using random projections to achieve high throughput in gen-\nerating proposals, which is crucial for clustering models with large numbers of\nclusters.\n\n1\n\nIntroduction\n\nFast clustering algorithms are a staple of exploratory data analysis. See e.g. [1] and references.\nClustering is useful for partitioning data into sets of similar items. Such tools are vital e.g. in large\nscale document analysis, or to provide a modicum of adaptivity to content personalization for a large\nbasis of users [2, 3]. Likewise it allows advertisers to target speci\ufb01c slices of the user base of an\ninternet portal. While similarity and prototype based techniques [4, 5] satisfy a large range of these\nrequirements, they tend to be less useful for the purpose of obtaining a proper probabilistic represen-\ntation of the data. The latter, is useful for determining typical and unusual events, forecasting traf\ufb01c,\ninformation retrieval, and when the results require integration into a larger probabilistic model.\nLarge scale problems, however, come with a rather surprising dilemma: as we increase the amount\nof data we can both estimate the model parameters for \ufb01xed model complexity (typically the number\nof clusters) more accurately. As a consequence we have the opportunity (and need) to increase the\nnumber of parameters, e.g. clusters. The latter is often ignored but vital to the rationale for using\nmore data \u2014 after all, for \ufb01xed model complexity there are rapidly diminishing returns afforded by\nextra data once a given threshold is exceeded. See also [6, 7] for a frequentist perspective. Simply\nput, it is a waste of computational resources to design algorithms capable of processing big data to\nbuild a simple model (e.g. millions of documents for tens of clusters).\n\nContributions We address the following problems: We need to deal with a large number of in-\nstances, e.g. by means of multicore sampling and we need to draw from a large number of clusters.\nWhen sampling from many clusters, the time to compute the object likelihood with respect to all\nclusters dominates the inference procedure. For instance, for 1000 clusters and documents of 1000\nwords a naive sampler needs to perform 106 \ufb02oating point operations. We can expect that a single\nsample will cost in excess of 1 milisecond. Given 10M documents this amounts to approximately 3\nhours for a single Gibbs sampling iteration, which is clearly infeasible: sampling requires hundreds\n\n\u2217This work was carried out while AA, SR, SMN and AJS were with Yahoo Research.\n\n1\n\n\fof passes. This problem is exacerbated for hierarchical models. To alleviate this issue we use binary\nhashing to compute a fast proposal distribution.\n\n2 Mixtures of Exponential Families\n\nOur models are mixtures of exponential families due to their \ufb02exilibility. This is essentially an\nextended model of [8, 9]. For convenience we focus on mixtures of multinomials with correspond-\ningly conjugate Dirichlet distributions. The derivations are entirely general and can be used e.g.\nfor mixtures of Gaussians or Poisson distributions. In the following we denote by X the domain of\nobservations X = {x1, . . . , xm} drawn from some distribution p. We want to estimate p.\n\n2.1 Exponential Families\n\nWe begin with a primer. In exponential families distributions over random variables are given by\n\n(1)\nHere \u03c6 : X \u2192 F is a map from x to the vector space of suf\ufb01cient statistics (for simplicity assume\nthat F is a Hilbert space) and \u03b8 \u2208 F. Finally, g(\u03b8) ensures that p(x; \u03b8) is properly normalized via\n\np(x; \u03b8) = exp ((cid:104)\u03c6(x), \u03b8(cid:105) \u2212 g(\u03b8)) .\n\ng(\u03b8) := log\n\nexp ((cid:104)\u03c6(x), \u03b8(cid:105)) d\u03c1(x)\n\n(2)\n\n(cid:90)\n\nX\n\nHere \u03c1 is the measure associated with X (e.g. the Lebesgue measure L2 or a weighted counting\nmeasure for the Poisson distribution). It is well known [10] that the mean parameter associated with\n(1) and the maximum likelihood estimate given X are connected via \u00b5[\u03b8] = \u00b5[X] where\n\nm(cid:88)\n\ni=1\n\n\u00b5[\u03b8] := Ex\u223cp(x;\u03b8) [\u03c6(x)] = \u2202\u03b8g(\u03b8) and \u00b5[X] :=\n\n1\nm\n\n\u03c6(xi).\n\n(3)\n\nThe mean must match the empirical average for it to be a maximum likelihood estimate.\nExample 1 (Multinomial) Assume that \u03c6(x) = ex \u2208 Rl and X = {1, . . . , d}, i.e. we have a set\nof d different outcomes and ex denotes the canonical vector associated with x. Empirical averages\nand probability estimates are directly connected via p(x; \u03b8) = nx/m = e\u03b8x. Here nx denotes the\nnumber of times we observe x. This yields \u03b8x = log nx/m and g(\u03b8) = 0.\n\n2.2 Conjugate Priors\n\nIn general, high-dimensional maximum likelihood estimation is statistically infeasible and we re-\nquire a prior on \u03b8 to obtain reliable estimates. We could impose a norm prior on \u03b8, leading to Laplace\nor Gaussian priors. Alternatively one may resort to conjugate priors. They have the property that\nthe posterior distribution p(\u03b8|X) over \u03b8 remains in the same family as p(\u03b8) via\n\np(\u03b8|m0, m0\u00b50) = e(cid:104)m0\u00b50,\u03b8(cid:105)\u2212m0g(\u03b8)\u2212h(m0,m0\u00b50).\n\n(4)\nHere the conjugate prior itself is a member of the exponential family with suf\ufb01cient statistic \u03c6(\u03b8) =\n(\u03b8,\u2212g(\u03b8)) and with natural parameters (m0, m0\u00b50). Commonly m0 is referred to as concentration\nparameter which acts as an effective sample size and \u00b50 is the mean parameter describing where on\nthe marginal polytope we expect the distribution to be. Note that \u00b50 \u2208 F. It corresponds to the mean\nof a putative distribution over observations (in a Dirichlet process this is the base measure and m0 is\nthe concentration parameter). Finally, h(m0, m0\u00b50) is a log-partition function in the parameters of\nthe conjugate prior. For instance, for the discrete distribution we have the Dirichlet, for the Gaussian\nthe Gauss-Wishart, and for the Poisson distribution the Gamma. Normalization in (4) implies\n\np(\u03b8|X) \u221d p(X|\u03b8)p(\u03b8|m0, m0\u00b50) =\u21d2 p(\u03b8|X) = p (\u03b8|m0 + m, m0\u00b50 + m\u00b5[X])\n\n(5)\nWe simply add the virtual observations m0\u00b50 described by the conjugate prior to the actual obser-\nvations X and compute the maximum likelihood estimate with respect to the augmented dataset.\n\nExample 2 (Multinomial) We simply update the empirical observation counts. This yields the\nsmoothed estimates for event probabilities in x:\n\np(x; \u03b8) =\n\nnx + m0 [\u00b50]x\n\nm + m0\n\nand equivalently \u03b8x = log\n\nnx + m0 [\u00b50]x\n\nm + m0\n\n.\n\n(6)\n\n2\n\n\f2.3 Mixture Models\n\nThe \ufb01nal piece is to describe the prior over mixture components. Our tools are entirely general\nand could take advantage of Bayesian nonparametrics, such as the Dirichlet process or the Pitman-\nYor process. For the sake of brevity and to ensure computational tractability (we need to limit the\ntime it takes to sample from the cluster distribution for a given instance) we limit ourselves to a\nDirichlet-Multinomial model with k components:\n\n\u2022 Draw discrete mixture p(y|\u03b8) with y \u2208 {1, . . . k} from Dirichlet with (mcluster\n).\n\u2022 For each component k draw exponential families distribution p(x|\u03b8y) from conjugate with\n\u2022 For each i \ufb01rst draw component yi \u223c p(y|\u03b8), then draw observation xi \u223c p(x|\u03b8yi).\n\nparameters (mcomponent\n\n, \u00b5cluster\n\n0\n\n, \u00b5component\n\n).\n\n0\n\n0\n\n0\n\nNote that we have two exponential families components here \u2014 a smoothed multinomial to capture\ncluster membership, i.e. y \u223c p(y|\u03b8) and one pertaining to the cluster distribution. Both parts could\nbe joined into a single exponential family model with y being the latent variable, a property that we\nwill exploit only for the purpose of fast sampling.\nThe venerable EM algorithm [8] is effective for a small number of clusters. For large numbers,\nhowever, Gibbs sampling of the collapsed likelihood is computationally more advantageous since it\nonly requires updates of the suf\ufb01cient statistics of two clusters per sample, whereas EM necessitates\nan update of all clusters. Collapsed Gibbs sampling works as follows:\n\n1. For a given xi draw yi \u223c p(yi|X, Y \u2212i) \u221d p(yi|Y ) \u00b7 p(xi|yi, X\u2212i, Y \u2212i).\n2. Update the suf\ufb01cient statistics for the changed clusters.\n\nFor large k step 1, particularly computing p(xi|yi, X\u2212i, Y \u2212i) dominates the inference procedure.\nWe now show how this step can be accelerated signi\ufb01cantly using a good proposal distribution,\nparallel sampling, and a Taylor expansion for general exponential families.\n\n3 Acceleration\n\n3.1 Taylor Approximation for Collapsed Inference\n\nLet us brie\ufb02y review the key equations involved in collapsed inference. Conjugate priors allow us to\nintegrate out the natural parameters \u03b8 and accelerate mixing in Gibbs samplers [11]. We can obtain\na closed form expression for the data likelihood:\n\np(X|m0, m0\u00b50) =\n\np(X|\u03b8)p(\u03b8|m0, m0\u00b50)d\u03b8 = eh(m0+m,m0\u00b50+m\u00b5[X])\u2212h(m0,m0\u00b50).\n\n(7)\n\n(cid:90)\n\nBy Bayes rule this implies that\n\np(x|X, m0, \u00b50) \u221d p(X \u222a {x}|m0, m0\u00b50) \u221d eh(m0+m+1,m0\u00b50+m\u00b5[X]+\u03c6(x))\n\n(8)\n\nUnfortunately the normalization h is often nontrivial to compute or even intractable. The exception\nbeing the multinomial, where the Laplace smoother amounts to the correct posterior x|X, i.e.\n\np(x|X, \u00b50, m0) =\n\nnx + m0 [\u00b50]x\n\nm + m0\n\n.\n\n(9)\n\nIn general, unfortunately, (8) will not have quite so simple form. Strictly speaking we would need\nto compute h and perform the update directly. This can be prohibitively costly or even impossible\ndepending on the choice of suf\ufb01cient statistics. While not necessary for our running example we\nstate the reasoning below to indicate that the problem can be overcome quite easily.\nWe exploit the properties of the log-partition function h of the conjugate prior for an approximation:\n\n\u2202\n\nh(m0, \u00b50m0) =\n\n(m0,m0\u00b50)\n\nE\n\n\u03b8\u223cp(\u03b8|m0,m0\u00b50)\n\n[\u2212g(\u03b8), \u03b8] =: (\u2212\u03b3\u2217, \u03b8\u2217)\n\nhence h(m0 + 1, m0\u00b50 + \u03c6(x)) \u2248 h(m0, m0\u00b50) + (cid:104)\u03b8\u2217, \u03c6(x)(cid:105) \u2212 \u03b3\u2217.\n\n(10)\n\n3\n\n\fHere \u03b3\u2217 is the expected value of the log partition function. This quantity is often hard to compute\nand fortunately unnecessary for inference since \u03b8\u2217 immediately implies a suitable normalization.\nApplying the Taylor expansion in h to (7) yields an approximation of x|X as\n\np(x|X, m0, m0\u00b50) \u2248 exp ((cid:104)\u03c6(x), \u03b8\u2217(cid:105) \u2212 g(\u03b8\u2217))\n\n(11)\nHere the normalization g(\u03b8\u2217) is an immediate consequence of the fact that this is a member of the\nexponential family. The key advantage of (11) is that nowhere do we need to compute h directly\n(the latter may not be available in closed form). We only need to estimate the parameter \u03b8\u2217.\n\nLemma 1 The expected parameter \u03b8\u2217 = E\u03b8\u223cp(\u03b8|X)[\u03b8] induces at most O(m\u22121) sampler error.\n\nPROOF. The contribution of a single instance to the suf\ufb01cient statistics is O(m\u22121). Since h is\n\nC\u221e, the residual of the Taylor expansion is bounded by O(m\u22121).\n\nHence, (11) explains why updates obtained in collapsed inference often resemble (or are identical\nto) a maximum-a-posteriori estimate obtained by conjugate priors, such as in Dirichlet-multinomial\nsmoothing. The computational convenience afforded by (11) is well justi\ufb01ed statistically.\n\n3.2 Locality Sensitive Importance Sampling\n\nThe next step is to accelerate the inner product(cid:10)\u03c6(x), \u03b8\u2217\n\n(cid:11) in (11) since this expression is evaluated\n\ny\n\nk times at each Gibbs sampler step. For large k this is the dominant term. We overcome this\nproblem by using binary hashing [12]. This provides a good approximation and therefore a proposal\ndistribution that can be used in a Metropolis-Hastings scheme without an excessive rejection rate.\nTo provide some motivation consider metric-based clustering algorithms such as k-means. They do\nnot suffer greatly from dealing with large numbers of clusters \u2014 after all, we only need to \ufb01nd the\nclosest prototype. Finding the closest point within a set in sublinear time is a well studied problem\n[13, 14, 15, 16]. In a nutshell it involves transforming the set of cluster centers into a data structure\nthat is only dependent on the inherent dimensionality of the data rather than the number of objects\nor the dimensionality of the actual data vector.\nThe problem with sampling from the collapsed distribution is that for a proper sampler we need to\nconsider all cluster probabilities including those related to clusters which are highly implausible and\nunlikely to be chosen for a given instance. That is, most of the time we discard the very computations\nthat made sampling so expensive. This is extremely wasteful. Instead, we design a sampler which\ntypically will only explore the clusters which are suf\ufb01ciently close to the \u201cbest\u201d matching cluster by\nmeans of a proposal distribution. [17, 12] effectively introduce binary hash functions:\n\nTheorem 2 For u, v \u2208 Rn and vectors w drawn from a spherically symmetric distribution on Rn\nthe following relation between signs of inner products and the angle (cid:93)(u, v) between vectors holds:\n(12)\n\n(cid:93)(u, v) = \u03c0 Pr{sgn [(cid:104)u, w(cid:105)] (cid:54)= sgn [(cid:104)v, w(cid:105)]}\n\nThis follows from a simple geometric observation, namely that only whenever w falls into the angle\nbetween the unit vectors in the directions of u and v we will have opposite signs. Any distribution\nof w orthogonal to the plane containing u, v is immaterial.\nSince exponential families rely on inner products to determine the log-likelihood of how well the\ndata \ufb01ts, we can use hashing to accelerate the expensive part considerably, namely comparing data\nwith clusters. More speci\ufb01cally, (cid:104)u, v(cid:105) = (cid:107)u(cid:107) \u00b7 (cid:107)v(cid:107) \u00b7 cos (cid:93)(u, v) allows us to store the signature of\na vector in terms of its signs and its norm to estimate the inner product ef\ufb01ciently.\n\nDe\ufb01nition 3 We denote by hl(v) \u2208 {0, 1}l a binary hash of v and by zl(u, v) an estimate of the\nprobability of matching signs, obtained as follows\n\ni := sgn [(cid:104)v, wi(cid:105)] where wi \u223c Um \ufb01xed and zl(u, v) :=\n\n(cid:107)h(u) \u2212 h(v)(cid:107)1 .\n\n1\nl\n\n(13)\n\n(cid:2)hl(v)(cid:3)\n\n4\n\n\fThat is, zl(u, v) measures how many bits differ between the hash vectors h(u) and h(v) associ-\nated with u, v. In this case we may estimate the unnormalized log-likelihood of an instance being\nassigned to a cluster via\n\nsl(x, y) =(cid:107)\u03b8y(cid:107)(cid:107)\u03c6(x)(cid:107) cos \u03c0zl(\u03c6(x), \u03b8y) \u2212 g(\u03b8y) \u2212 log ny\n\n(14)\nWe omitted the normalization log n of the cluster probability since it is identical for all components.\nThe above can be computed ef\ufb01ciently for any combination of x and y since we can precompute\n(and store) the values for (cid:107)\u03b8y(cid:107) ,(cid:107)\u03c6(x)(cid:107) , g(\u03b8y), log ny, and h(\u03c6(xi)) for all observations xi.\nThe binary representation is signi\ufb01cant since on modern CPUs computing the Hamming distance\nbetween h(u) and h(v) via zl(u, v) can be achieved in a fraction of a single clock cycle by means of\na vectorized instruction set. This is supported by current generation ARM and Intel CPU cores and\nby AMD and Nvidia GPUs (for instance Intel\u2019s SandyBridge series of processors can process up to\n256 bits in one clock cycle per core) and easily accessible via compiler optimization.\n\n3.3 Error Guarantees\n\nNote, though, that sl(x, y) is not accurate, since we only use an estimate of the inner product. Hence\nwe need to accommodate for sampling error. The following probabilistic guarantee ensures that we\ncan turn sl(x, y) into an upper bound of the likelihood.\nTheorem 4 Given k \u2208 N mixture components and let l the number of bits used for hashing. Then\nthe unnormalized cluster log-likelihood is bounded with probability at least 1 \u2212 \u03b4 by\n\u00afsl(x, y) =(cid:107)\u03b8y(cid:107)(cid:107)\u03c6(x)(cid:107) cos\n\n0, zl(\u03c6(x), \u03b8y) \u2212(cid:112)(log k/\u03b4) /2l\n\n(cid:17)(cid:105) \u2212 g(\u03b8y) \u2212 log ny\n\n\u03c0 max\n\n(cid:16)\n\n(15)\n\n(cid:104)\n\nPROOF. By Theorem 2 we know that in expectation the inner product can be computed via the\nprobability of a matching sign. Since we only take a \ufb01nite sample average we effectively partition\nthis into l equivalence classes. For convenience denote by z\u221e(\u03c6(x), \u03b8y) the expected value of\nzl(\u03c6(x), \u03b8y) over all hash functions. By Hoeffding\u2019s theorem we know that\n\nSolving for \u0001 yields \u0001 \u2264(cid:112)(\u2212 log \u03b4)/2l. Since we know that z\u221e(\u03c6(x), \u03b8y) \u2265 0 we can bound it for\n\nPr(cid:8)z\u221e(\u03c6(x), \u03b8y) < zl(\u03c6(x), \u03b8y) \u2212 \u0001(cid:9) \u2264 e\u22122l\u00012\n\nall k clusters with probability \u03b4 by taking the union bound over all events with \u03b4/k probability.\nRemark 5 Using 128 hash bits and with a failure probability of at most 10\u22124 for k = 104 clusters\nthe correction applied to zl(x, z) is less than 0.38.\n\n(16)\n\nNote that in practice we can reduce this correction factor signi\ufb01cantly for two reasons: \ufb01rstly, for\nsmall probabilities the basic Chernoff bound is considerably loose and we would be better advised\nto take the KL-divergence terms in the Chernoff bound directly into account, since the probability\nof deviation is bounded in terms of e\u2212mD(p(cid:107)p\u2212\u0001). Secondly, we use hashing to generate a proposal\ndistribution: once we select a particular cluster we verify the estimate using the true likelihood.\n\n3.4 Metropolis Hastings\n\nAn alternative to using the approximate upper bound directly, we employ it as a proposal distribution\nin a Metropolis Hastings (MH) framework. Denote by q the proposal distribution constructed from\nthe bound on the log-likelihood after normalization. For a given xi we \ufb01rst sample a new cluster\nassignment ynew\n\ni \u223c q(.) and then accept the proposal using (15) with probability r where\n\nq(y) \u221d e\u00afsl(x,y) and r =\n\n(17)\nHere p(xi|X, m0, \u00b50) is the true collapsed conditional likelihood of (8). The speci\ufb01c form depends\non h(.) as discussed in Section 3.1.\nNote that for a standard collapsed Gibbs sampler, p(x|X, \u00b50, m0) would be computed for all k can-\ndidate clusters, however, in our framework, we only need to compute it for 2 clusters: the proposed\nand old clusters: an O(k) time saving per sample, albeit with a nontrivial rejection probability.\n\nyold\n\ni\n\ni\n\ni\n\nq(yold)\nq(ynew\n)\n\ni\n\np(ynew\np(yold\n\n)p(xi|X i\n)p(xi|X i\n\nynew\n\ni\n\n, m0, \u00b50)\n, m0, \u00b50)\n\n5\n\n\fExample 3 For discrete distributions the conjugate is the Dirichlet distribution Dir(\u03b11:d) with\ncomponents given by \u03b1j = m0[\u00b50]j and the sum of the components is given by m0, where\nj \u2208 {1\u00b7\u00b7\u00b7 d}. In this case p(x|X, \u00b50, m0) reduces to predictive distribution given in (9) if x is\na singleton, i.e. a single observation, and to the ratio of two log partition functions if x is non-\nsingleton.1 We have the following predictive posterior\n\n\u0393(cid:0)(cid:80)D\n\u0393(cid:0)(cid:80)D\n\nd=1 [nyi\n\nd=1 [xd + nyi\n\nd + \u03b1d](cid:1)\nd + \u03b1d](cid:1) D(cid:89)\n\nd=1\n\n\u0393(cid:0)xd + nyi\n\u0393(cid:0)nyi\n\nd + \u03b1d\n\nd + \u03b1d\n\n(cid:1)\n\n(cid:1)\n\np(xi|X, yi, \u00b50, m0) =\n\n.\n\n(18)\n\n3.5 Updating the Suf\ufb01cient Statistics\n\nWe conclude our discussion of past proposals by discussing the updates involved in the suf\ufb01cient\nstatistics. For the sake of brevity we focus on multinomial models. For Gaussians changes in\nsuf\ufb01cient statistics can be achieved using a low rank update of the second order matrix and its\ninverse. Similar operations apply to other exponential family distributions.\nWhenever we assign an instance x to a new cluster we need to update the suf\ufb01cient statistics of the\nold cluster y and the new cluster y(cid:48) via\n\n(my \u2212 1)\u00b5[X|y] \u2190 my\u00b5[X|y] \u2212 \u03c6(x)\n(my(cid:48) + 1)\u00b5[X|y(cid:48)] \u2190 my(cid:48)\u00b5[X|y(cid:48)] + \u03c6(x)\n\nmy \u2190 my \u2212 1\nmy(cid:48) \u2190 my(cid:48) + 1\n\nHere \u00b5[X|y] denotes the suf\ufb01cient statistics for cluster y, i.e. it is the suf\ufb01cient statistic obtained\nfrom X by considering only instances for which yi = y. Likewise my the number of instances\nassociated with y. This is then used to update the natural parameter \u03b8y and the hash representation\nh(\u03b8y). For multinomials the mean natural parameters are just log counts. Thus these updates scale\nas O(W ) where W is the number of unique items (e.g. words in a document) in x (for Gaussians\nthe cost is O(d2) where d is the dimensionality of the data).\nThe second step is to update the hash-representation. For l bits a naive update would perform the\ndot-product between the mean natural parameters and each random vector, which scales as O(Dl),\nwhere D is the vocabulary size. However we can cache the l dot product values (as \ufb02oating point\nnumbers) for each cluster and update only those dot product values. Thus if x has W unique words,\nwe only incur an O(W l) penalty. Note that we never need to store the random vectors w since we\ncan compute them on the \ufb02y by means of hash functions rather than invoking a random number\ngenerator. We use murmurhash as a fast and high quality hash function.\n\n4 Experiments\n\n4.1 Data and Methods\n\nTo provide a realistic comparison on publicly available datasets we used documents from the\nWikipedia collection. More speci\ufb01cally, we extracted the articles and category attributes from a\ndump of its database. We generated multiple datasets for our experiments by \ufb01rst sampling a set of\nk categories and then by pooling all the articles from the chosen categories to form a document col-\nlection. This way the data was comparable and the apparent and desired diversity in terms of cluster\nsizes was matched. We extracted both 100 and 1000 categories, yielding the following datasets:\nW100\nW1000\nWe compare our fast sampler to a more conventional uncollapsed inference procedure. That is, we\ncompare the following two algorithms:\n\n2.5M unique words vocabulary\n5.6M unique words vocabulary\n\n100 clusters\n1000 clusters\n\n292k articles\n710k articles\n\nBaseline Clustering using a Dirichlet (DP) Multinomial Mixture model. It uses an uncollapsed like-\nlihood and alternates between sampling cluster assignments and drawing from the Dirichlet\ndistribution of the posterior.\n\n1x might represent a entire document [x]d denoting the count of word d in x. The predictive distribution\nfollows. This can be understood if we let \u03c6(x) = ex in the singleton case, and let \u03c6(x) = ([x]1,\u00b7\u00b7\u00b7 , [x]D) in\nthe bag-of-words case. The natural parameters of the multinomial remain the same in both cases.\n\n6\n\n\fFigure 1: (Left) Convergence of both a baseline implementation and of FastEx. (Right) The effect\nof the hash size on performance. Note that the baseline implementation only \ufb01nishes few iterations\nwhile our method almost \ufb01nishes convergence.\n\nFastEx We provide runtime results for a single core (our approach supports multi-core architec-\ntures, as discussed in the summary). Unless stated otherwise we use l = 32 bit to represent\na document and cluster. This choice was made since it provides an ef\ufb01cient trade-off be-\ntween memory usage and cost to compute the hash signatures.\n\n4.2 Evaluation\n\nFor each clustering method, we report results in terms of two different measures: ef\ufb01ciency and\nclustering quality. The former is measured in terms of average run time. For the latter we use the\nfact that we have access to the Wikipedia category tag of each article which we treat as the gold\nstandard for evaluation purposes.\nWe report results in terms of Variation of Information (VI) [18]. The latter is a standard measure of\nthe distance between two clusterings. Suppose we have two clusterings (partition of a document set\ninto several subsets) C1 and C2 then:\n\nVI(C1, C2) = H(C1) + H(C2) \u2212 2I(C1, C2)\n\n(19)\n\nwhere H(.) is entropy and I(C1, C2) is mutual information between C1 and C2. A lower value for\nVI implies a closer match to the gold standard and better quality.\nWe \ufb01rst report our results on the W100 dataset. As shown in Figure 1 our method is an order of\nmagnitude faster than the baseline. Hence we use a log-scale for the time axis. As evident from this\nFigure, our method both converges much faster than the baseline and achieves the same clustering\nquality. Figure 1 also displays the effect of the number of hash bits l on solution quality. We vary\nl \u2208 8, 16,\u00b7\u00b7\u00b7 , 128 and draw the VI curve as the time goes by. As evident form the \ufb01gure, increasing\nthe number of bits caused our method to converge faster due to a tighter bound on the log-likelihood\nand thus a higher acceptance ratio. We also observe that beyond 64 to 128 bits we do not observe\nany noticeable improvement as predicted by our theoretical guarantees.\nTo see how the performance of our method changes as we increase the number of clusters, we show\nin Table 1 both the time required to compute the proposal distribution for a given document and the\ntime it takes to perform the full sampling per document which includes: proposal time + time to\ncompute acceptance ratio + time to update the clusters suf\ufb01cient statistics and hash representation.\nAs shown in this Table, thanks to the fast instruction set support for XOR and bitcount operations on\nmodern processors, the time does not increase signi\ufb01cantly as we increase the number of clusters and\nthe overall time increases modestly as the number of clusters increases. Compare that to standard\nCollapsed Gibbs sampling in which the time scales linearly with the number of clusters.\n\n7\n\n 5 6 7 8 9 10 11 12 13 14 10 100 1000 10000 100000 1e+06Clustering quality (VI)Time (in seconds)100 clusters, 292k articlesBaselineFastEx (32 bit) 5 6 7 8 9 10 11 12 13 14 0 2000 4000 6000 8000 10000 12000 14000Clustering quality (VI)Time (in seconds)100 clusters, 292k articlesBaselineFastEx (8 bit)FastEx (16 bit)FastEx (32 bit)FastEx (64 bit)FastEx (128 bit)\fProposal\nTotal\nProposal\nTotal\n\nClusters k Bitsize l\n\n100\n\n1000\n\n8\n2.34\n69.52\n18.80\n103.91\n\n16\n2.34\n69.52\n18.80\n103.91\n\n32\n2.34\n78.77\n18.80\n103.91\n\n64\n2.56\n81.16\n21.42\n108.98\n\n128\n2.90\n82.19\n29.12\n114.61\n\nTable 1: Average time in microseconds spent per document for hash sampling in terms of computing\nthe proposal distribution and total computation time. As can be seen, the total computation time for\nsampling 10x more clusters only increases slightly, mostly due to the increase in proposal time.\n\nDataset\nW100\nW1000\n\nFastEx Quality (VI) Baseline Quality (VI)\n5.60\n14.00\n\n5.04\n14.10\n\nSpeedup\n9.25\n37.37\n\nTable 2: Clustering quality (VI) and absolute speedup achieved by hash sampling over the baseline\n(DP) clustering for different Wikipedia datasets.\n\nTable 2 has details on the \ufb01nal quality and speed up achieved by our method over the baseline. Due\nto a high quality proposals the time to draw from 1000 rather than 100 clusters increases slightly.\n\n5 Discussion and Future Work\n\nWe presented a new ef\ufb01cient parallel algorithm to perform scalable clustering for exponential fam-\nilies. It is general and uses techniques from hashing and information retrieval to circumvent the\nproblem of large numbers of clusters. Future work includes the application to a larger range of\nexponential family models and the extension of the fast retrieval scheme to hierarchical clustering.\n\nParallelization So far we only described a single processor sampling procedure. Unfortunately\nthis is not scalable given large amounts of data. To address the problem within single machines we\nuse a multicore sampler to parallelize inference. This requires a small amount of approximation\n\u2014 rather than sampling p(yi|xi, X\u2212i, Y \u2212i) in sequence we sample up to c latent variables yi in\nparallel in c processor cores. The latter approximation is negligible since c is tiny compared to the\ntotal number of documents we have. Our approach is an adaptation of the strategy described in [19].\nIn particular, we dissociate sampling and updating of the suf\ufb01cient statistics to ensure ef\ufb01cient lock\nmanagement and to avoid resource contention.\n\nsampler 1\n\ndisk\n\nreader\n\nsampler 2\n\nupdater\n\nwriter\n\ndisk\n\n...\n\nsampler n\n\nA key advantage is that all samplers share the same suf\ufb01cient statistics regardless of the number of\ncores used. By delegating write permissions to a separate updater thread the code is considerably\nsimpli\ufb01ed. This allows us to be parsimonious in terms of memory use. A multi-machine setting is\nalso achievable by keeping the sets of suf\ufb01cient statistics synchronized between computers. This is\npossible using the synchronization architecture of [20].\n\nSequential Estimation Our approach is compatible with sequential estimation methods and it is\npossible to use hash signatures for Sequential Monte Carlo estimation for clustering as in [21, 22].\nHowever it is highly nontrivial to parallelize particle \ufb01lters over a network of workstations.\n\nStochastic Gradient Descent An alternative is to use stochastic gradient descent on a variational\napproximation, following the approach proposed by [23]. Again, sampling is the dominant cost for\ninference and it can be accelerated by binary hashing.\n\n8\n\n\fReferences\n[1] C. D. Manning, P. Raghavan, and H. Sch\u00a8utze. Introduction to Information Retrieval. Cam-\n\nbridge University Press, 2008.\n\n[2] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data.\n\nConference on Knowledge Discovery and Data Mining, pages 26\u201335. ACM, 2007.\n\n[3] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online\n\ncollaborative \ufb01ltering. In Conference on World Wide Web, pages 271\u2013280. ACM, 2007.\n\n[4] D. Emanuel and A. Fiat. Correlation clustering \u2014 minimizing disagreements on arbitrary\nweighted graphs. Algorithms \u2014 ESA 2003, 11th Annual European Symposium, volume 2832\nof Lecture Notes in Computer Science, pages 208\u2013220. Springer, 2003.\n\n[5] J. MacQueen. Some methods of classi\ufb01cation and analysis of multivariate observations. In\nL. M. LeCam and J. Neyman, editors, Proc. 5th Berkeley Symposium on Math., Stat., and\nProb., page 281. U. California Press, Berkeley, CA, 1967.\n\n[6] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\ndimensional analysis of M-estimators with decomposable regularizers. CoRR, abs/1010.2731,\n2010. informal publication.\n\n[7] V. Vapnik and A. Chervonenkis. The necessary and suf\ufb01cient conditions for consistency in the\nempirical risk minimization method. Pattern Recognition and Image Analysis, 1(3):283\u2013305,\n1991.\n\n[8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data\n\nvia the EM Algorithm. Journal of the Royal Statistical Society B, 39(1):1\u201322, 1977.\n\n[9] C. E. Rasmussen. The in\ufb01nite gaussian mixture model. In Advances in Neural Information\n\nProcessing Systems 12, pages 554\u2013560, 2000.\n\n[10] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1 \u2013 2):1 \u2013 305, 2008.\n\n[11] T.L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy\n\nof Sciences, 101:5228\u20135235, 2004.\n\n[12] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of\n\nthe thiry-fourth annual ACM symposium on Theory of computing, pages 380\u2013388, 2002.\n\n[13] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In International\n\nConference on Machine Learning, 2006.\n\n[14] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In M. P.\nAtkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors, Proceedings\nof the 25th VLDB Conference, pages 518\u2013529, Edinburgh, Scotland, 1999. Morgan Kaufmann.\n[15] Y. Shen, A. Ng, and M. Seeger. Fast Gaussian process regression using kd-trees. In Y. Weiss,\nB. Sch\u00a8olkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18,\npages 1227\u20131234, Cambridge, MA, 2005. MIT Press.\n\n[16] R.J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of\n\nthe 16th international conference on World Wide Web, pages 131\u2013140. ACM, 2007.\n\n[17] M.X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut\nand satis\ufb01ability problems using semide\ufb01nite programming. Journal of the ACM, 42(6), 1995.\n\n[18] M. Meila. Comparing clusterings by the variation of information. In COLT, 2003.\n[19] A.J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large\n\nDatabases (VLDB), 2010.\n\n[20] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A.J. Smola. Scalable inference in\n\nlatent variable models. In Web Science and Data Mining (WSDM), 2012.\n\n[21] A. Ahmed, Q. Ho, C. H. Teo, J. Eisenstein, A. J. Smola, and E. P. Xing. Online inference for\n\nthe in\ufb01nite cluster-topic model: Storylines from streaming text. In AISTATS, 2011.\n\n[22] A. Ahmed, Q. Ho, J. Eisenstein, E. P. Xing, A. J. Smola, and C. H. Teo. Uni\ufb01ed analysis of\n\nstreaming news. In www, 2011.\n\n[23] D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent dirichlet allocation.\n\nIn International Conference on Machine Learning, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1283, "authors": [{"given_name": "Amr", "family_name": "Ahmed", "institution": null}, {"given_name": "Sujith", "family_name": "Ravi", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Shravan", "family_name": "Narayanamurthy", "institution": null}]}