{"title": "Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture", "book": "Advances in Neural Information Processing Systems", "page_first": 449, "page_last": 457, "abstract": "This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a low-variance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets.", "full_text": "Dynamic Clustering via Asymptotics of the\n\nDependent Dirichlet Process Mixture\n\nTrevor Campbell\n\nMIT\n\nCambridge, MA 02139\n\nMiao Liu\n\nDuke University\n\nDurham, NC 27708\n\ntdjc@mit.edu\n\nmiao.liu@duke.edu\n\nBrian Kulis\n\nOhio State University\nColumbus, OH 43210\n\nJonathan P. How\n\nMIT\n\nCambridge, MA 02139\n\nkulis@cse.ohio-state.edu\n\njhow@mit.edu\n\nLawrence Carin\nDuke University\n\nDurham, NC 27708\nlcarin@duke.edu\n\nAbstract\n\nThis paper presents a novel algorithm, based upon the dependent Dirichlet pro-\ncess mixture model (DDPMM), for clustering batch-sequential data containing\nan unknown number of evolving clusters. The algorithm is derived via a low-\nvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM,\nand provides a hard clustering with convergence guarantees similar to those of the\nk-means algorithm. Empirical results from a synthetic test with moving Gaussian\nclusters and a test with real ADS-B aircraft trajectory data demonstrate that the al-\ngorithm requires orders of magnitude less computational time than contemporary\nprobabilistic and hard clustering algorithms, while providing higher accuracy on\nthe examined datasets.\n\n1\n\nIntroduction\n\nThe Dirichlet process mixture model (DPMM) is a powerful tool for clustering data that enables\nthe inference of an unbounded number of mixture components, and has been widely studied in the\nmachine learning and statistics communities [1\u20134]. Despite its \ufb02exibility, it assumes the observa-\ntions are exchangeable, and therefore that the data points have no inherent ordering that in\ufb02uences\ntheir labeling. This assumption is invalid for modeling temporally/spatially evolving phenomena, in\nwhich the order of the data points plays a principal role in creating meaningful clusters. The depen-\ndent Dirichlet process (DDP), originally formulated by MacEachern [5], provides a prior over such\nevolving mixture models, and is a promising tool for incrementally monitoring the dynamic evolu-\ntion of the cluster structure within a dataset. More recently, a construction of the DDP built upon\ncompletely random measures [6] led to the development of the dependent Dirichlet process Mixture\nmodel (DDPMM) and a corresponding approximate posterior inference Gibbs sampling algorithm.\nThis model generalizes the DPMM by including birth, death and transition processes for the clusters\nin the model.\nThe DDPMM is a Bayesian nonparametric (BNP) model, part of an ever-growing class of prob-\nabilistic models for which inference captures uncertainty in both the number of parameters and\ntheir values. While these models are powerful in their capability to capture complex structures in\ndata without requiring explicit model selection, they suffer some practical shortcomings. Inference\ntechniques for BNPs typically fall into two classes: sampling methods (e.g., Gibbs sampling [2]\n\n1\n\n\for particle learning [4]) and optimization methods (e.g., variational inference [3] or stochastic vari-\national inference [7]). Current methods based on sampling do not scale well with the size of the\ndataset [8]. Most optimization methods require analytic derivatives and the selection of an upper\nbound on the number of clusters a priori, where the computational complexity increases with that\nupper bound [3, 7]. State-of-the-art techniques in both classes are not ideal for use in contexts where\nperforming inference quickly and reliably on large volumes of streaming data is crucial for timely\ndecision-making, such as autonomous robotic systems [9\u201311]. On the other hand, many classical\nclustering methods [12\u201314] scale well with the size of the dataset and are easy to implement, and\nadvances have recently been made to capture the \ufb02exibility of Bayesian nonparametrics in such\napproaches [15]. However, as of yet there is no classical algorithm that captures dynamic cluster\nstructure with the same representational power as the DDP mixture model.\nThis paper discusses the Dynamic Means algorithm, a novel hard clustering algorithm for spatio-\ntemporal data derived from the low-variance asymptotic limit of the Gibbs sampling algorithm for\nthe dependent Dirichlet process Gaussian mixture model. This algorithm captures the scalability\nand ease of implementation of classical clustering methods, along with the representational power\nof the DDP prior, and is guaranteed to converge to a local minimum of a k-means-like cost function.\nThe algorithm is signi\ufb01cantly more computationally tractable than Gibbs sampling, particle learning,\nand variational inference for the DDP mixture model in practice, while providing equivalent or better\nclustering accuracy on the examples presented. The performance and characteristics of the algorithm\nare demonstrated in a test on synthetic data, with a comparison to those of Gibbs sampling, particle\nlearning and variational inference. Finally, the applicability of the algorithm to real data is presented\nthrough an example of clustering a spatio-temporal dataset of aircraft trajectories recorded across\nthe United States.\n\n2 Background\n\nR+,(cid:82)\nThe Dirichlet process (DP) is a prior over mixture models, where the number of mixture components\nis not known a priori[16]. In general, we denote D \u223c DP(\u00b5), where \u03b1\u00b5 \u2208 R+ and \u00b5 : \u2126 \u2192\n\u2126 d\u00b5 = \u03b1\u00b5 are the concentration parameter and base measure of the DP, respectively. If\n\u221e\nk=0 \u2282 \u2126 \u00d7 R+, where \u03b8k \u2208 \u2126 and \u03c0k \u2208 R+[17]. The reader is\nD \u223c DP, then D = {(\u03b8k, \u03c0k)}\ndirected to [1] for a more thorough coverage of Dirichlet processes.\nThe dependent Dirichlet process (DDP)[5], an extension to the DP, is a prior over evolving mixture\nmodels. Given a Poisson process construction[6], the DDP essentially forms a Markov chain of DPs\n(D1, D2, . . . ), where the transitions are governed by a set of three stochastic operations: Points \u03b8k\nmay be added, removed, and may move during each step of the Markov chain. Thus, they become\nparameterized by time, denoted by \u03b8kt. In slightly more detail, if Dt is the DP at time step t, then\nthe following procedure de\ufb01nes the generative model of Dt conditioned on Dt\u22121 \u223c DP(\u00b5t\u22121):\n\nA\n\n|\u03b8)\u00b5(d\u03b8).\n\n\u03b8(cid:48)\n\nt to be the collection of points (\u03b8(cid:48), \u03c0). Then D(cid:48)(cid:48)\n\u2126 T (\u03b8(cid:48)\n\n1. Subsampling: De\ufb01ne a function q : \u2126 \u2192 [0, 1]. Then for each point (\u03b8, \u03c0) \u2208 Dt\u22121,\n(cid:82)\nsample a Bernoulli distribution b\u03b8 \u223c Be(q(\u03b8)). Set D(cid:48)\nt to be the collection of points (\u03b8, \u03c0)\nsuch that b\u03b8 = 1, and renormalize the weights. Then D(cid:48)\nt \u223c DP(q\u00b5t\u22121), where (q\u00b5)(A) =\nA q(\u03b8)\u00b5(d\u03b8).\nwhere (T \u00b5)(A) =(cid:82)\n(cid:82)\n2. Transition: De\ufb01ne a distribution T : \u2126 \u00d7 \u2126 \u2192 R+. For each point (\u03b8, \u03c0) \u2208 D(cid:48)\nt, sample\n|\u03b8), and set D(cid:48)(cid:48)\n\u223c T (\u03b8(cid:48)\nt \u223c DP(T q\u00b5t\u22121),\n3. Superposition: Sample F \u223c DP(\u03bd), and sample (cD, cF ) \u223c Dir(T q\u00b5t\u22121(\u2126), \u03bd(\u2126)).\nThen set Dt to be the union of (\u03b8, cD\u03c0) for all (\u03b8, \u03c0) \u2208 D(cid:48)(cid:48)\nt and (\u03b8, cF \u03c0) for all (\u03b8, \u03c0) \u2208 F .\nThus, Dt is a random convex combination of D(cid:48)(cid:48)\nt and F , where Dt \u223c DP(T q\u00b5t\u22121 + \u03bd).\nIf the DDP is used as a prior over a mixture model, these three operations allow new mixture com-\nponents to arise over time, and old mixture components to exhibit dynamics and perhaps disappear\nover time. As this is covered thoroughly in [6], the mathematics of the underlying Poisson point\nprocess construction are not discussed in more depth in this work. However, an important result of\nusing such a construction is the development of an explicit posterior for Dt given observations of the\npoints \u03b8kt at timestep t. For each point k that was observed in D\u03c4 for some \u03c4 : 1 \u2264 \u03c4 \u2264 t, de\ufb01ne:\nnkt \u2208 N as the number of observations of point k in timestep t; ckt \u2208 N as the number of past\n\n2\n\n\f(cid:32)\n\n(cid:88)\n\nobservations of point k prior to timestep t, i.e. ckt = (cid:80)t\u22121\n\n\u03c4 =1 nk\u03c4 ; qkt \u2208 (0, 1) as the subsampling\nweight on point k at timestep t; and \u2206tk as the number of time steps that have elapsed since point k\nwas last observed. Further, let \u03bdt be the measure for unobserved points at time step t. Then,\n\n(cid:88)\n\n(cid:33)\n\nDt|Dt\u22121 \u223c DP\n\n\u03bdt +\n\nqktcktT (\u00b7|\u03b8k(t\u2212\u2206tk)) +\n\nk:nkt=0\n\nk:nkt>0\n\n(ckt + nkt)\u03b4\u03b8kt\n\n(1)\n\nwhere ckt = 0 for any point k that was \ufb01rst observed during timestep t. This posterior leads directly\nto the development of a Gibbs sampling algorithm for the DDP, whose low-variance asymptotics are\ndiscussed further below.\n\n3 Asymptotic Analysis of the DDP Mixture\n\nThe dependent Dirichlet process Gaussian mixture model (DDP-GMM) serves as the foundation\nupon which the present work is built. The generative model of a DDP-GMM at time step t is\n\n\u221e\nk=1 \u223c DP(\u00b5t)\n{\u03b8kt, \u03c0kt}\n{zit}Nt\ni=1 \u223c Categorical({\u03c0kt}\n{yit}Nt\ni=1 \u223c N (\u03b8zitt, \u03c3I)\n\n\u221e\nk=1)\n\n(2)\n\nwhere \u03b8kt is the mean of cluster k, \u03c0kt is the categorical weight for class k, yit is a d-dimensional\nobservation vector, zit is a cluster label for observation i, and \u00b5t is the base measure from equation\n(1). Throughout the rest of this paper, the subscript kt refers to quantities related to cluster k at time\nstep t, and subscript it refers to quantities related to observation i at time step t.\nThe Gibbs sampling algorithm for the DDP-GMM iterates between sampling labels zit for dat-\napoints yit given the set of parameters {\u03b8kt}, and sampling parameters \u03b8kt given each group of\ndata {yit : zit = k}. Assuming the transition model T is Gaussian, and the subsampling func-\ntion q is constant, the functions and distributions used in the Gibbs sampling algorithm are: the\nprior over cluster parameters, \u03b8 \u223c N (\u03c6, \u03c1I); the likelihood of an observation given its cluster pa-\nrameter, yit \u223c N (\u03b8kt, \u03c3I); the distribution over the transitioned cluster parameter given its last\nknown location after \u2206tk time steps, \u03b8kt \u223c N (\u03b8k(t\u2212\u2206tk), \u03be\u2206tkI); and the subsampling function\nq(\u03b8) = q \u2208 (0, 1). Given these functions and distributions, the low-variance asymptotic limits\n(i.e. \u03c3 \u2192 0) of these two steps are discussed in the following sections.\n3.1 Setting Labels Given Parameters\n\nIn the label sampling step, a datapoint yit can either create a new cluster, join a current cluster, or\nrevive an old, transitioned cluster. Using the distributions de\ufb01ned previously, the label assignment\nprobabilities are\n\n(cid:17)\n\n(cid:17)\n\n||yit\u2212\u03c6||2\n\u2212\n2(\u03c3+\u03c1)\n||yit\u2212\u03b8kt||2\n\u2212\n||yit\u2212\u03b8k(t\u2212\u2206tk )||2\n\u2212\n\n2(\u03c3+\u03be\u2206tk)\n\n2\u03c3\n\n(cid:17)\n\nk = K + 1\n\nnkt > 0\n\nnkt = 0\n\n(3)\n\n(cid:16)\n(cid:16)\n(cid:16)\n\n\u03b1t(2\u03c0(\u03c3 + \u03c1))\u2212d/2 exp\n(ckt + nkt)(2\u03c0\u03c3)\u2212d/2 exp\nqktckt(2\u03c0(\u03c3 + \u03be\u2206tk))\u2212d/2 exp\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\u03b1\u03bd = (1 + \u03c1/\u03c3)d/2 exp(cid:0)\n\uf8f1\uf8f2\uf8f3 ||yit \u2212 \u03b8kt||2\n\n{Jk} , Jk =\n\nQ\u2206tk +\n\u03bb\n\n\u2212 \u03bb\n\n2\u03c3\n\n||yit\u2212\u03b8k(t\u2212\u2206tk )||2\n\n\u03c4 \u2206tk+1\n\n1\u2212qt\nwhere qkt = q\u2206tk due to the fact that q(\u03b8) is constant over \u2126, and \u03b1t = \u03b1\u03bd\n1\u2212q where \u03b1\u03bd is the\nconcentration parameter for the innovation process, Ft. The low-variance asymptotic limit of this\nlabel assignment step yields meaningful assignments as long as \u03b1\u03bd, \u03be, and q vary appropriately with\n\u03c3; thus, setting \u03b1\u03bd, \u03be, and q as follows (where \u03bb, \u03c4 and Q are positive constants):\n\u2212 Q\n\n(cid:1) ,\n\nq = exp\n\n\u03be = \u03c4 \u03c3,\n\n(cid:17)\n\n(cid:16)\n\n(4)\n\n2\u03c3\n\nyields the following assignments in the limit as \u03c3 \u2192 0:\n\nif \u03b8k instantiated\nif \u03b8k old, uninstantiated\nif \u03b8k new\n\n.\n\n(5)\n\np(zit = k| . . . ) \u221d\n\nzit = arg min\n\nk\n\nIn this assignment step, Q\u2206tk acts as a cost penalty for reviving old clusters that increases with the\ntime since the cluster was last seen, \u03c4 \u2206tk acts as a cost reduction to account for the possible motion\nof clusters since they were last instantiated, and \u03bb acts as a cost penalty for introducing a new cluster.\n\n3\n\n\f\u03b8post = \u03c3post\n\nThen letting \u03c3 \u2192 0,\n\n\u03b8kt|{yit : zit = k} \u223c N (\u03b8post, \u03c3post)\nnkt\n\u03c3\n\ni=1 yit\n\u03c3\n\n, \u03c3post =\n\n+\n\n+\n\n\u03c1\n\n(cid:18) 1\n\n\u03c1\n\n(cid:19)\u22121\n\n(cid:19)\n\n(cid:80)nkt\n((cid:80)nkt\n\n\u03b8kt =\n\ni=1 yit)\nnkt\n\ndef= mkt\n\n(7)\n\n(8)\n\n(cid:18) \u03c6\n\n(cid:90)\n\n3.2 Setting Parameters given Labels\n\nIn the parameter sampling step, the parameters are sampled using the distribution\n\np(\u03b8kt|{yit : zit = k}) \u221d p({yit : zit = k}|\u03b8kt)p(\u03b8kt)\n\n(6)\nThere are two cases to consider when setting a parameter \u03b8kt. Either \u2206tk = 0 and the cluster is new\nin the current time step, or \u2206tk > 0 and the cluster was previously created, disappeared for some\namount of time, and then was revived in the current time step.\n\nNew Cluster Suppose cluster k is being newly created. In this case, \u03b8kt \u223c N (\u03c6, \u03c1). Using the\nfact that a normal prior is conjugate a normal likelihood, the closed-form posterior for \u03b8kt is\n\nwhere mkt is the mean of the observations in the current timestep.\n\nRevived Cluster Suppose there are \u2206tk time steps where cluster k was not observed, but there\nare now nkt data points with mean mkt assigned to it in this time step. In this case,\n\nAgain using conjugacy of normal likelihoods and priors,\n\np(\u03b8kt) =\n\n\u03b8\n\nT (\u03b8kt|\u03b8)p(\u03b8) d\u03b8, \u03b8 \u223c N (\u03b8(cid:48), \u03c3(cid:48)).\n(cid:80)nkt\n\n\u03b8kt|{yit : zit = k} \u223c N (\u03b8post, \u03c3post)\n1\n\n(cid:18)\n\n(cid:19)\n\n\u03b8(cid:48)\n\n(cid:18)\n\n\u03b8post = \u03c3post\n\n\u03be\u2206tk + \u03c3(cid:48) +\n\n, \u03c3post =\n\n\u03be\u2206tk + \u03c3(cid:48) +\n\ni=1 yit\n\u03c3\n\n(9)\n\n(10)\n\n(cid:19)\u22121\n\nnkt\n\u03c3\n\nSimilarly to the label assignment step, let \u03be = \u03c4 \u03c3. Then as long as \u03c3(cid:48) = \u03c3/w, w > 0 (which holds\nif equation (10) is used to recursively keep track of the parameter posterior), taking the asymptotic\nlimit of this as \u03c3 \u2192 0 yields:\n\n\u03b8kt =\n\n\u03b8(cid:48)(w\u22121 + \u2206tk\u03c4 )\u22121 + nktmkt\n\n(w\u22121 + \u2206tk\u03c4 )\u22121 + nkt\n\n(11)\n\nthat is to say, the revived \u03b8kt is a weighted average of estimates using current timestep data and\nprevious timestep data. \u03c4 controls how much the current data is favored - as \u03c4 increases, the weight\non current data increases, which is explained by the fact that our uncertainty in where the old \u03b8(cid:48)\ntransitioned to increases with \u03c4. It is also noted that if \u03c4 = 0, this reduces to a simple weighted\naverage using the amount of data collected as weights.\n\nCombined Update Combining the updates for new cluster parameters and old transitioned cluster\nparameters yields a recursive update scheme:\n\n\u03b3kt =(cid:0)(wk(t\u2212\u2206tk))\u22121 + \u2206tk\u03c4(cid:1)\u22121\n\n\u03b8k0 = mk0\nwk0 = nk0\n\nand\n\n\u03b8kt =\n\n\u03b8k(t\u2212\u2206tk)\u03b3kt + nktmkt\n\n\u03b3kt + nkt\n\nwkt = \u03b3kt + nkt\n\n(12)\n\nwhere time step 0 here corresponds to when the cluster is \ufb01rst created. An interesting interpre-\ntation of this update is that it behaves like a standard Kalman \ufb01lter, in which w\u22121\nkt serves as the\ncurrent estimate variance, \u03c4 serves as the process noise variance, and nkt serves as the inverse of the\nmeasurement variance.\n\n4\n\n\fAlgorithm 1 Dynamic Means\nInput: {Yt}tf\n\nt=1, Q, \u03bb, \u03c4\n\nC1 \u2190 \u2205\nfor t = 1 \u2192 tf do\n\n(Kt, Zt, Lt) \u2190CLUSTER(Yt, Ct, Q, \u03bb, \u03c4)\nCt+1 \u2190UPDATEC(Zt, Kt, Ct)\n\nend for\nreturn {Kt, Zt, Lt}tf\n\nt=1\n\n4 The Dynamic Means Algorithm\n\nAlgorithm 2 CLUSTER\nInput: Yt, Ct, Q, \u03bb, \u03c4\nKt \u2190 \u2205, Zt \u2190 \u2205, L0 \u2190 \u221e\nfor n = 1 \u2192 \u221e do\n\n(Zt, Kt) \u2190ASSIGNLABELS(Yt, Zt, Kt, Ct)\n(Kt, Ln) \u2190ASSIGNPARAMS(Yt, Zt, Ct)\nif Ln = Ln\u22121 then\nreturn Kt, Zt, Ln\n\nend if\n\nend for\n\nIn this section, some further notation is required for brevity:\ni=1, Zt = {zit}Nt\n\nYt = {yit}Nt\nKt = {(\u03b8kt, wkt) : nkt > 0},\n\ni=1\n\nCt = {(\u2206tk, \u03b8k(t\u2212\u2206tk), wk(t\u2212\u2206tk))}\n\nwhere Yt and Zt are the sets of observations and labels at time step t, Kt is the set of currently active\nclusters (some are new with \u2206tk = 0, and some are revived with \u2206tk > 0), and Ct is the set of old\ncluster information.\n\n(13)\n\n4.1 Algorithm Description\n\nAs shown in the previous section, the low-variance asymptotic limit of the DDP Gibbs sampling\nalgorithm is a deterministic observation label update (5) followed by a deterministic, weighted least-\nsquares parameter update (12). Inspired by the original K-Means algorithm, applying these two\nupdates iteratively yields an algorithm which clusters a set of observations at a single time step\ngiven cluster means and weights from past time steps (Algorithm 2). Applying Algorithm 2 to a\nsequence of batches of data yields a clustering procedure that is able to track a set of dynamically\nevolving clusters (Algorithm 1), and allows new clusters to emerge and old clusters to be forgotten.\nWhile this is the primary application of Algorithm 1, the sequence of batches need not be a temporal\nsequence. For example, Algorithm 1 may be used as an any-time clustering algorithm for large\ndatasets, where the sequence of batches is generated by selecting random subsets of the full dataset.\nThe ASSIGNPARAMS function is exactly the update from equation (12) applied to each k \u2208 Kt.\nSimilarly, the ASSIGNLABELS function applies the update from equation (5) to each observation;\nhowever, in the case that a new cluster is created or an old one is revived by an observation, AS-\nSIGNLABELS also creates a parameter for that new cluster based on the parameter update equation\n(12) with that single observation. Note that the performance of the algorithm depends on the order\nin which ASSIGNLABELS assigns labels. Multiple random restarts of the algorithm with different\nassignment orders may be used to mitigate this dependence. The UPDATEC function is run after\nclustering observations from each time step, and constructs Ct+1 by setting \u2206tk = 1 for any new or\nrevived cluster, and by incrementing \u2206tk for any old cluster that was not revived:\n(14)\n\nCt+1 = {(\u2206tk + 1, \u03b8k(t\u2212\u2206tk), wk(t\u2212\u2206tk)) : k \u2208 Ct, k /\u2208 Kt} \u222a {(1, \u03b8kt, wkt) : k \u2208 Kt}\n\nAn important question is whether this algorithm is guaranteed to converge while clustering data in\neach time step. Indeed, it is; Theorem 1 shows that a particular cost function Lt monotonically\ndecreases under the label and parameter updates (5) and (12) at each time step. Since Lt \u2265 0, and it\nis monotonically decreased by Algorithm 2, the algorithm converges. Note that the Dynamic Means\nis only guaranteed to converge to a local optimum, similarly to the k-means[12] and DP-Means[15]\nalgorithms.\nTheorem 1. Each iteration in Algorithm 2 monotonically decreases the cost function Lt, where\n\n(cid:125)(cid:124)\n(cid:122)\n\u03b3kt||\u03b8kt \u2212 \u03b8k(t\u2212\u2206tk)||2\n\n2 +\n\nWeighted-Prior Sum-Squares Cost\n\n(cid:88)\n\nyit\u2208Yt\nzit=k\n\n\uf8f6\uf8f7\uf8f7\uf8f8 (15)\n(cid:123)\n||yit \u2212 \u03b8kt||2\n\n2\n\n\uf8eb\uf8ec\uf8ec\uf8ed New Cost\n(cid:122)\n(cid:125)(cid:124)\n\n(cid:88)\n\nk\u2208Kt\n\n(cid:123)\n\n(cid:122)(cid:125)(cid:124)(cid:123)\n\nRevival Cost\n\nLt =\n\n\u03bb [\u2206tk = 0] +\n\nQ\u2206tk +\n\nThe cost function is comprised of a number of components for each currently active cluster k \u2208 Kt:\nA penalty for new clusters based on \u03bb, a penalty for old clusters based on Q and \u2206tk, and \ufb01nally\n\n5\n\n\fa prior-weighted sum of squared distance cost for all the observations in cluster k. It is noted that\nfor new clusters, \u03b8kt = \u03b8k(t\u2212\u2206tk) since \u2206tk = 0, so the least squares cost is unweighted. The\nASSIGNPARAMS function calculates this cost function in each iteration of Algorithm 2, and the\nalgorithm terminates once the cost function does not decrease during an iteration.\n\n4.2 Reparameterizing the Algorithm\n\nIn order to use the Dynamic Means algorithm, there are three free parameters to select: \u03bb, Q, and \u03c4.\nWhile \u03bb represents how far an observation can be from a cluster before it is placed in a new cluster,\nand thus can be tuned intuitively, Q and \u03c4 are not so straightforward. The parameter Q represents\na conceptual added distance from any data point to a cluster for every time step that the cluster is\nnot observed. The parameter \u03c4 represents a conceptual reduction of distance from any data point\nto a cluster for every time step that the cluster is not observed. How these two quantities affect the\nalgorithm, and how they interact with the setting of \u03bb, is hard to judge.\nInstead of picking Q and \u03c4 directly, the algorithm may be reparameterized by picking NQ, k\u03c4 \u2208 R+,\nNQ > 1, k\u03c4 \u2265 1, and given a choice of \u03bb, setting\nQ =\u03bb/NQ \u03c4 =\n\nNQ(k\u03c4 \u2212 1) + 1\n\n(16)\n\n.\n\nNQ \u2212 1\n\nIf Q and \u03c4 are set in this manner, NQ represents the number (possibly fractional) of time steps a\ncluster can be unobserved before the label update (5) will never revive that cluster, and k\u03c4 \u03bb repre-\nsents the maximum squared distance away from a cluster center such that after a single time step, the\nlabel update (5) will revive that cluster. As NQ and k\u03c4 are speci\ufb01ed in terms of concrete algorithmic\nbehavior, they are intuitively easier to set than Q and \u03c4.\n\n5 Related Work\n\nPrior k-means clustering algorithms that determine the number of clusters present in the data have\nprimarily involved a method for iteratively modifying k using various statistical criteria [13, 14, 18].\nIn contrast, this work derives this capability from a Bayesian nonparametric model, similarly to\nthe DP-Means algorithm [15]. In this sense, the relationship between the Dynamic Means algo-\nrithm and the dependent Dirichlet process [6] is exactly that between the DP-Means algorithm and\nDirichlet process [16], where the Dynamic Means algorithm may be seen as an extension to the\nDP-Means that handles sequential data with time-varying cluster parameters. MONIC [19] and\nMC3 [20] have the capability to monitor time-varying clusters; however, these methods require dat-\napoints to be identi\ufb01able across timesteps, and determine cluster similarity across timesteps via the\ncommonalities between label assignments. The Dynamic Means algorithm does not require such\ninformation, and tracks clusters essentially based on similarity of the parameters across timesteps.\nEvolutionary clustering [21, 22], similar to Dynamic Means, minimizes an objective consisting of\na cost for clustering the present data set and a cost related to the comparison between the current\nclustering and past clusterings. The present work can be seen as a theoretically-founded extension\nof this class of algorithm that provides methods for automatic and adaptive prior weight selection,\nforming correspondences between old and current clusters, and for deciding when to introduce new\nclusters. Finally, some sequential Monte-Carlo methods (e.g. particle learning [23] or multi-target\ntracking [24, 25]) can be adapted for use in the present context, but suffer the drawbacks typical of\nparticle \ufb01ltering methods.\n\n6 Applications\n\n6.1 Synthetic Gaussian Motion Data\nIn this experiment, moving Gaussian clusters on [0, 1] \u00d7 [0, 1] were generated synthetically over a\nperiod of 100 time steps. In each step, there was some number of clusters, each having 15 data points.\nThe data points were sampled from a symmetric Gaussian distribution with a standard deviation of\n0.05. Between time steps, the cluster centers moved randomly, with displacements sampled from\nthe same distribution. At each time step, each cluster had a 0.05 probability of being destroyed.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 1: (1a - 1c): Accuracy contours and CPU time histogram for the Dynamic Means algorithm.\n(1d - 1e): Comparison with Gibbs\nsampling, variational inference, and particle learning. Shaded region indicates 1\u03c3 interval; in (1e), only upper half is shown. (1f): Comparison\nof accuracy when enforcing (Gibbs, DynMeans) and not enforcing (Gibbs NC, DynMeans NC) correct cluster tracking.\n\nThis data was clustered with Dynamic Means (with 3 random assignment ordering restarts), DDP-\nGMM Gibbs sampling [6], variational inference [3], and particle learning [4] on a computer with\nan Intel i7 processor and 16GB of memory. First, the number of clusters was \ufb01xed to 5, and the\nparameter space of each algorithm was searched for the best possible cluster label accuracy (taking\ninto account correct cluster tracking across time steps). The results of this parameter sweep for\nthe Dynamic Means algorithm with 50 trials at each parameter setting are shown in Figures 1a\u20131c.\nFigures 1a and 1b show how the average clustering accuracy varies with the parameters after \ufb01xing\neither k\u03c4 or TQ to their values at the maximum accuracy parameter setting over the full space. The\nDynamic Means algorithm had a similar robustness with respect to variations in its parameters as\nthe comparison algorithms. The histogram in Figure 1c demonstrates that the clustering speed is\nrobust to the setting of parameters. The speed of Dynamic Means, coupled with the smoothness of\nits performance with respect to its parameters, makes it well suited for automatic tuning [26].\nUsing the best parameter setting for each algorithm, the data as described above were clustered in\n50 trials with a varying number of clusters present in the data. For the Dynamic Means algorithm,\nparameter values \u03bb = 0.04, TQ = 6.8, and k\u03c4 = 1.01 were used, and the algorithm was again given\n3 attempts with random labeling assignment orders, where the lowest cost solution of the 3 was\npicked to proceed to the next time step. For the other algorithms, the parameter values \u03b1 = 1 and\nq = 0.05 were used, with a Gaussian transition distribution variance of 0.05. The number of samples\nfor the Gibbs sampling algorithm was 5000 with one recorded for every 5 samples, the number of\nparticles for the particle learning algorithm was 100, and the variational inference algorithm was run\nto a tolerance of 10\u221220 with the maximum number of iterations set to 5000.\nIn Figures 1d and 1e, the labeling accuracy and clustering time (respectively) for the algorithms is\nshown. The sampling algorithms were handicapped to generate Figure 1d; the best posterior sample\nin terms of labeling accuracy was selected at each time step, which required knowledge of the true\nlabeling. Further, the accuracy computation included enforcing consistency across timesteps, to\nallow tracking individual cluster trajectories. If this is not enforced (i.e. accuracy considers each\ntime step independently), the other algorithms provide accuracies more comparable to those of the\nDynamic Means algorithm. This effect is demonstrated in Figure 1f, which shows the time/accuracy\ntradeoff for Gibbs sampling (varying the number of samples) and Dynamic Means (varying the\nnumber of restarts). These examples illustrate that Dynamic Means outperforms standard inference\nalgorithms in both label accuracy and computation time for cluster tracking problems.\n\n7\n\n0.020.040.060.080.100.120.140.16\u03bb2345678910TQ0.2400.3200.3200.4000.4000.4800.4800.5600.5600.020.040.060.080.100.120.140.16\u03bb123456k\u03c40.2400.3200.3200.4000.4800.560\u22124.0\u22123.8\u22123.6\u22123.4\u22123.2\u22123.0\u22122.8\u22122.6\u22122.4\u22122.2CPUTime(log10s)perStep05010015020025005101520#Clusters020406080100%LabelAccuracyGibbsVBPLDynMeans05101520#Clusters10\u2212510\u2212410\u2212310\u2212210\u22121100101102CPUTime(s)perStepGibbsVBPLDynMeans10\u2212510\u2212410\u2212310\u2212210\u22121100101102CPUTime(s)perStep20406080100%AccuracyGibbsGibbsNCDynMeansDynMeansNC\fFigure 2: Results of the GP aircraft trajectory clustering. Left: A map (labeled with major US city airports) showing the overall aircraft \ufb02ows\nfor 12 trajectories, with colors and 1\u03c3 con\ufb01dence ellipses corresponding to takeoff region (multiple clusters per takeoff region), colored dots\nindicating mean takeoff position for each cluster, and lines indicating the mean trajectory for each cluster. Right: A track of plane counts for\nthe 12 clusters during the week, with color intensity proportional to the number of takeoffs at each time.\n\n6.2 Aircraft Trajectory Clustering\n\nIn this experiment, the Dynamic Means algorithm was used to \ufb01nd the typical spatial and tem-\nporal patterns in the motions of commercial aircraft. Automatic dependent surveillance-broadcast\n(ADS-B) data, including plane identi\ufb01cation, timestamp, latitude, longitude, heading and speed,\nwas collected from all transmitting planes across the United States during the week from 2013-3-22\n1:30:0 to 2013-3-28 12:0:0 UTC. Then, individual ADS-B messages were connected together based\non their plane identi\ufb01cation and timestamp to form trajectories, and erroneous trajectories were \ufb01l-\ntered based on reasonable spatial/temporal bounds, yielding 17,895 unique trajectories. Then, for\neach trajectory, a Gaussian process was trained using the latitude and longitude of each ADS-B\npoint along the trajectory as the inputs and the North and East components of plane velocity at those\npoints as the outputs. Next, the mean latitudinal and longitudinal velocities from the Gaussian pro-\ncess were queried for each point on a regular lattice across the USA (10 latitudes and 20 longitudes),\nand used to create a 400-dimensional feature vector for each trajectory. Of the resulting 17,895\nfeature vectors, 600 were hand-labeled (each label including a con\ufb01dence weight in [0, 1]). The\nfeature vectors were clustered using the DP-Means algorithm on the entire dataset in a single batch,\nand using Dynamic Means / DDPGMM Gibbs sampling (with 50 samples) with half-hour takeoff\nwindow batches.\nThe results of this exercise are provided in Figure 2 and Table 1.\nFigure 2 shows the spatial and temporal properties of the 12 most\npopular clusters discovered by Dynamic Means, demonstrating\nthat the algorithm successfully identi\ufb01ed major \ufb02ows of commer-\ncial aircraft across the US. Table 1 corroborates these qualitative\nresults with a quantitative comparison of the computation time\nand accuracy for the three algorithms tested over 20 trials. The\ncon\ufb01dence-weighted accuracy was computed by taking the ratio between the sum of the weights\nfor correctly labeled points and the sum of all weights. The DDPGMM Gibbs sampling algorithm\nwas handicapped as described in the synthetic experiment section. Of the three algorithms, Dynamic\nMeans provided the highest labeling accuracy, while requiring orders of magnitude less computation\ntime than both DP-Means and DDPGMM Gibbs sampling.\n\n% Acc. Time (s)\n2.7 \u00d7 102\n55.9\n3.1 \u00d7 103\n55.6\n1.4 \u00d7 104\n36.9\n\nTable 1: Mean computational time & accuracy\non hand-labeled aircraft trajectory data\n\nAlg.\nDynM\nDPM\nGibbs\n\n7 Conclusion\n\nThis work developed a clustering algorithm for batch-sequential data containing temporally evolving\nclusters, derived from a low-variance asymptotic analysis of the Gibbs sampling algorithm for the\ndependent Dirichlet process mixture model. Synthetic and real data experiments demonstrated that\nthe algorithm requires orders of magnitude less computational time than contemporary probabilistic\nand hard clustering algorithms, while providing higher accuracy on the examined datasets. The\nspeed of inference coupled with the convergence guarantees provided yield an algorithm which\nis suitable for use in time-critical applications, such as online model-based autonomous planning\nsystems.\n\nAcknowledgments\n\nThis work was supported by NSF award IIS-1217433 and ONR MURI grant N000141110688.\n\n8\n\n\u22120.2\u22120.10.00.10.2\u22120.5\u22120.4\u22120.3\u22120.2\u22120.10.0\u22120.2\u22120.10.00.10.2JFKMIAHOULAXSEAORDMSPFriSatSunMonTueWedThuFriUTCDate0123456789101112Cluster#\fReferences\n[1] Yee Whye Teh. Dirichlet processes. In Encyclopedia of Machine Learning. Springer, New York, 2010.\n[2] Radford M. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9(2):249\u2013265, 2000.\n\n[3] David M. Blei and Michael I. Jordan. Variational inference for dirichlet process mixtures. Bayesian\n\nAnalysis, 1(1):121\u2013144, 2006.\n\n[4] Carlos M. Carvalho, Hedibert F. Lopes, Nicholas G. Polson, and Matt A. Taddy. Particle learning for\n\ngeneral mixtures. Bayesian Analysis, 5(4):709\u2013740, 2010.\n\n[5] Steven N. MacEachern. Dependent nonparametric processes. In Proceedings of the Bayesian Statistical\n\nScience Section. American Statistical Association, 1999.\n\n[6] Dahua Lin, Eric Grimson, and John Fisher. Construction of dependent dirichlet processes based on\n\npoisson processes. In Neural Information Processing Systems, 2010.\n\n[7] Matt Hoffman, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. arXiv ePrint\n\n1206.7051, 2012.\n\n[8] Finale Doshi-Velez and Zoubin Ghahramani. Accelerated sampling for the indian buffet process.\n\nProceedings of the International Conference on Machine Learning, 2009.\n\nIn\n\n[9] Felix Endres, Christian Plagemann, Cyrill Stachniss, and Wolfram Burgard. Unsupervised discovery of\n\nobject classes from range data using latent dirichlet allocation. In Robotics Science and Systems, 2005.\n\n[10] Matthias Luber, Kai Arras, Christian Plagemann, and Wolfram Burgard. Classifying dynamic objects:\n\nAn unsupervised learning approach. In Robotics Science and Systems, 2004.\n\n[11] Zhikun Wang, Marc Deisenroth, Heni Ben Amor, David Vogt, Bernard Sch\u00a8olkopf, and Jan Peters. Prob-\nabilistic modeling of human movements for intention inference. In Robotics Science and Systems, 2008.\n[12] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129\u2013\n\n137, 1982.\n\n[13] Dan Pelleg and Andrew Moore. X-means: Extending k-means with ef\ufb01cient estimation of the number of\n\nclusters. In Proceedings of the 17th International Conference on Machine Learning, 2000.\n\n[14] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set\n\nvia the gap statistic. Journal of the Royal Statistical Society B, 63(2):411\u2013423, 2001.\n\n[15] Brian Kulis and Michael I. Jordan. Revisiting k-means: New algorithms via bayesian nonparametrics.\nIn Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland,\n2012.\n\n[16] Thomas S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics,\n\n1(2):209\u2013230, 1973.\n\n[17] Jayaram Sethuraman. A constructive de\ufb01nition of dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[18] Tsunenori Ishioka. Extended k-means with an ef\ufb01cient estimation of the number of clusters. In Proceed-\nings of the 2nd International Conference on Intelligent Data Engineering and Automated Learning, pages\n17\u201322, 2000.\n\n[19] Myra Spiliopoulou, Irene Ntoutsi, Yannis Theodoridis, and Rene Schult. Monic - modeling and monitor-\ning cluster transitions. In Proceedings of the 12th International Conference on Knowledge Discovering\nand Data Mining, pages 706\u2013711, 2006.\n\n[20] Panos Kalnis, Nikos Mamoulis, and Spiridon Bakiras. On discovering moving clusters in spatio-temporal\nIn Proceedings of the 9th International Symposium on Spatial and Temporal Databases, pages\n\ndata.\n364\u2013381. Springer, 2005.\n\n[21] Deepayan Chakraborti, Ravi Kumar, and Andrew Tomkins. Evolutionary clustering. In Proceedings of\n\nthe SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.\n\n[22] Kevin Xu, Mark Kliger, and Alfred Hero III. Adaptive evolutionary clustering. Data Mining and Knowl-\n\nedge Discovery, pages 1\u201333, 2012.\n\n[23] Carlos M. Carvalho, Michael S. Johannes, Hedibert F. Lopes, and Nicholas G. Polson. Particle learning\n\nand smoothing. Statistical Science, 25(1):88\u2013106, 2010.\n\n[24] Carine Hue, Jean-Pierre Le Cadre, and Patrick P\u00b4erez. Tracking multiple objects with particle \ufb01ltering.\n\nIEEE Transactions on Aerospace and Electronic Systems, 38(3):791\u2013812, 2002.\n\n[25] Jaco Vermaak, Arnaud Doucet, and Partick P\u00b4erez. Maintaining multi-modality through mixture tracking.\n\nIn Proceedings of the 9th IEEE International Conference on Computer Vision, 2003.\n\n[26] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical bayesian optimization of machine learning\n\nalgorithms. In Neural Information Processing Systems, 2012.\n\n9\n\n\f", "award": [], "sourceid": 293, "authors": [{"given_name": "Trevor", "family_name": "Campbell", "institution": "MIT"}, {"given_name": "Miao", "family_name": "Liu", "institution": "Duke University"}, {"given_name": "Brian", "family_name": "Kulis", "institution": "Ohio State University"}, {"given_name": "Jonathan", "family_name": "How", "institution": "MIT"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}