{"title": "Simplifying Mixture Models through Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1577, "page_last": 1584, "abstract": null, "full_text": "Simplifying Mixture Models\n\nthrough Function Approximation\n\nKai Zhang\n\nJames T. Kwok\n\nDepartment of Computer Science and Engineering\n\nThe Hong Kong University of Science and Technology\n\nClear Water Bay, Kowloon, Hong Kong\n{twinsen, jamesk}@cse.ust.hk\n\nAbstract\n\nFinite mixture model is a powerful tool in many statistical learning problems.\nIn this paper, we propose a general, structure-preserving approach to reduce its\nmodel complexity, which can bring signi\ufb01cant computational bene\ufb01ts in many\napplications. The basic idea is to group the original mixture components into\ncompact clusters, and then minimize an upper bound on the approximation error\nbetween the original and simpli\ufb01ed models. By adopting the L2 norm as the dis-\ntance measure between mixture models, we can derive closed-form solutions that\nare more robust and reliable than using the KL-based distance measure. Moreover,\nthe complexity of our algorithm is only linear in the sample size and dimensional-\nity. Experiments on density estimation and clustering-based image segmentation\ndemonstrate its outstanding performance in terms of both speed and accuracy.\n\n1 Introduction\n\nIn many statistical learning problems, it is useful to obtain an estimate of the underlying probability\ndensity given a set of observations. Such a density model can facilitate discovery of the underlying\ndata structure in unsupervised learning, and can also yield, asymptotically, optimal discriminant\nprocedures [7]. In this paper, we focus on the \ufb01nite mixture model, which describes the distribution\nj=1 \u03b1j \u03c6(x, \u03b8j ). Here, \u03b8j is the\nj=1 \u03b1j = 1. The most\n\nby a mixture of simple parametric functions \u03c6(\u00b7)\u2019s as f (x) = Pn\nparameter for the jth component, and the mixing parameters \u03b1j\u2019s satisfy Pn\n\ncommon parametric form of \u03c6 is the Gaussian, leading to the well-known Gaussian mixtures.\nThe mixture model has been widely used in clustering and density estimation, where the model\nparameters can be estimated by the standard Expectation-Maximization (EM) algorithm. However,\nthe EM can be prohibitively expensive on large problems [12]. On the other hand, note that in many\nlearning processes using mixture models (such as particle \ufb01ltering [6] and non-parametric belief\npropagation [13]), the computational requirement is also very demanding due to the large number\nof components involved in the model. In this situation, our interest is more on reducing the number\nof components for prospective computational saving. Previous works typically employ spatial data\nstructures, such as the kd-tree [8, 9], for acceleration. Recently, [5] proposes to reduce a large\nGaussian mixture into a smaller one by minimizing a KL-based distance between the two mixtures.\nThis has been applied with success on hierarchical clustering of scenery images and handwritten\ndigits.\n\nIn this paper, we propose a new algorithm for simplifying a given \ufb01nite mixture model while preserv-\ning its component structures, with application to nonparametric density estimation and clustering.\nThe idea is to minimize an upper bound on the approximation error between the original and sim-\npli\ufb01ed mixture models. By adopting the L2 norm as the error criterion, we can derive closed-form\nsolutions that are more robust and reliable than using the KL-based distance measures. At the same\n\n\ftime, our algorithm can be applied to general Gaussian kernels, and the complexity is only linear in\nthe sample size and dimensionality.\n\nThe rest of the paper is organized as follows.\nIn Section 2 we describe the proposed approach\nin detail, and illustrate its advantages compared with the existing ones. In Section 3, we report\nexperimental results on simplifying the Parzen window estimator, and color image segmentation\nthrough the mean shift clustering procedure. Section 4 gives some concluding remarks.\n\n2 Approximation Algorithm\n\nGiven a mixture model\n\nf (x) =\n\nn\n\nXj=1\n\n\u03b1j\u03c6j (x),\n\nwe assume that the jth component \u03c6j (x) is of the form\n\n\u03c6j(x) = |Hj|\u22121/2KHj (x \u2212 xj) ,\n\n(1)\n\n(2)\n\nwith weight \u03b1j, center xj and covariance matrix Hj. Here, KH(x) = K(H\u22121/2x) where K(x)\nis the kernel that is bounded and has compact support. Note that for radially symmetric kernels, it\nsuf\ufb01ces to de\ufb01ne K by the pro\ufb01le k such that K(x) = k(kxk2). With this notion, the gradient of\nthe kernel function, KH(x), can be conveniently written as \u2202xKH(x) = k\u2032(r)\u2202xr = 2k\u2032(r)H\u22121x,\nwhere r = xH\u22121x. Our task is to approximate f with a simpler mixture model\n\ng(x) =\n\nm\n\nXi=1\n\nwigi(x),\n\nwith m \u226a n, where each component gi also takes the form\n\ngi(x) = | \u02dcHi|\u22121/2K \u02dcHi\n\n(x \u2212 ti) ,\n\n(3)\n\n(4)\n\nwith weight wi, center ti, and covariance matrix \u02dcHi.\nNote that direct approximation of f by g is not feasible, because they involve a large number of\ncomponents. Given a distance measure D(\u00b7, \u00b7) between functions, the approximation error\n\nE = D(f, g) = D\uf8eb\nXj=1\n\uf8ed\n\nn\n\n\u03b1j\u03c6j,\n\nm\n\nXi=1\n\nwigi\uf8f6\n\uf8f8\n\n(5)\n\nis usually dif\ufb01cult to optimize. However, the problem can be very much simpli\ufb01ed by minimizing\n\nan upper bound of E. Consider the L2 distance D(\u03c6, \u03c6\u2032) = R (\u03c6(x) \u2212 \u03c6\u2032(x))2 dx, and suppose that\nj=1 are divided into disjoint clusters S1, . . . , Sm. Then, it is easy to\n\nthe mixture components {\u03c6j}n\nsee that the approximation error E is bounded by\n\n2\n\nn\n\nm\n\nXi=1\n\ndx \u2264 m\n\n\u03b1j\u03c6j (x) \u2212\n\nE = Z \uf8eb\nwigi(x)\uf8f6\nXj=1\n\uf8ed\n\uf8f8\nDenote this upper bound by E = mPm\ni=1 E i, where\nE i = Z \uf8eb\n\nm\n\nXi=1\n\nZ \uf8eb\n\uf8edwigi(x) \u2212 Xj\u2208Si\n\n\u03b1j\u03c6j (x)\uf8f6\n\uf8f8\n\n2\n\ndx.\n\n\uf8edwigi(x) \u2212 Xj\u2208Si\n\n2\n\n\u03b1j\u03c6j (x)\uf8f6\n\uf8f8\n\ndx.\n\n(6)\n\nNote that E is the sum of the \u201clocal\u201d approximation errors E i\u2019s. Hence, if we can \ufb01nd a good\nrepresentative wigi for each cluster by minimizing the local approximation error E i, the overall\napproximation performance can also be guaranteed. This suggests partitioning the original mixture\ncomponents into compact clusters, wherein approximation can then be done much more easily. Our\nbasic algorithm proceeds as follows:\n\n\f(Section 2.1.1) Partition the set of mixture components (\u03c6j\u2019s) into m clusters where m \u226a n.\n\n1.\nLet Si be the set that indexes all components belonging to the ith cluster.\n2. (Section 2.1.2) For each cluster, approximate the local mixture modelPj\u2208Si\ncomponent wigi, where gi is de\ufb01ned in (4).\n3. The simpli\ufb01ed model g is obtained by g(x) = Pm\n\ni=1 wigi(x).\n\nThese steps will be discussed in more detail in the following sections.\n\n\u03b1j \u03c6j by a single\n\n2.1 Procedure\n\n2.1.1 Partitioning of Components\n\nIn this section, we consider how to group similar components into the same cluster, so that the\nsubsequent local approximation can be more accurate. A useful algorithm for this task is the classic\nvector quantization (VQ) [4], where one iterates between partitioning a set of vectors and \ufb01nding\nthe best prototype for each partition until the distortion error converges. By de\ufb01ning a distance\nD(\u00b7, \u00b7) between mixture components \u03c6j\u2019s, we can partition the mixture components in a similar\nway. However, vector quantization is sensitive to the initial partitioning. So we \ufb01rst introduce a\nsimple but highly ef\ufb01cient partitioning method called sequential sampling (SS):\n\n1. Randomly select a \u03c6j and add it to the set of representatives R.\n2. For all the components (j = 1, 2, . . . , n), do the following\n\n\u2022 Compute the distance D (\u03c6j , Ri), where Ri \u2208 R.\n\u2022 Once if D (\u03c6j , Ri) \u2264 r, where r is a prede\ufb01ned threshold, assign \u03c6j to the representative\n\nRi, and then process the next component.\n\n\u2022 If D (\u03c6j , Ri) > r for all Ri \u2208 R, add \u03c6j as a new representative of R.\n\n3. Terminate when all the components have been processed.\n\nThis procedure partitions the components by choosing those \u03c6j\u2019s that are enough far away as rep-\nresentatives, with a user-de\ufb01ned resolution r. So it is less sensitive to initialization. In practice, we\nwill \ufb01rst initialize by sequential sampling, and then perform the iterative VQ procedure to further\nre\ufb01ne the partition, i.e., \ufb01nd the best representative Ri for each cluster, reassign each component \u03c6j\nto the closest representative R\u03c0(j), and iterate until the error Pj \u03b1jD(\u03c6j , R\u03c0(j)) converges.\n\n2.1.2 Local Approximation\n\nIn this part, we consider how to obtain a good representative, wigi in (4), for each local cluster Si.\nThe task is to determine the unknown variables wi, ti and \u02dcHi associated with gi. Using the L2 norm,\nthe upper bound (6) of the local approximation error can be written as\n\nE i = Z \uf8eb\n\n= w2\ni\n\n\uf8edwigi(x) \u2212 Xj\u2208Si\n\u2212 wi Xj\u2208Si\n\n|2 \u02dcHi|1/2\n\nCK\n\n2\n\ndx\n\n\u03b1j \u03c6j(x)\uf8f6\n\uf8f8\n\n2CK\u03b1jk(rij )\n|Hj + \u02dcHi|1/2\n\n+ ci.\n\nHere, CK = R k(x\u2032x)dx is a kernel-dependent constant, ci = R (Pj\u2208Si\nj (x))2dx is a data-\ndependent constant (irrelevant to the unknown variables), and rij = (ti \u2212xj)\u2032(Hj + \u02dcHi)\u22121(ti \u2212xj).\nHere we have assumed that k(a) \u00b7 k(b) = k(a + b), which is valid for the Gaussian and negative\nexponential kernels. Without this assumption, solutions can still be obtained but are less compact.\nTo minimize E i w.r.t. wi, ti and \u02dcHi, one can set the corresponding partial derivatives of E i to\nzero. However, this leads to a nonlinear system that is quite dif\ufb01cult to solve. Here, we decouple\nthe relations among these three parameters. First, observe that E i is a quadratic function of wi.\nTherefore, given \u02dcHi and ti, the minimum value of E i can be easily obtained as\n\n\u03b1j\u03c62\n\nmin\ni\n\nE\n\n= | \u02dcHi|\n\n1\n\n2 \uf8eb\n\uf8edXj\u2208Si\n\n\u03b1jk(rij )(cid:12)(cid:12)(cid:12)\n\nHj + \u02dcHi(cid:12)(cid:12)(cid:12)\n\n2\n\n.\n\n\u22121/2\uf8f6\n\uf8f8\n\n(7)\n\n\fThe remaining task is to minimize E\n\nmin\n\ni w.r.t. ti and \u02dcHi. By setting \u2202ti E\n\nmin\ni = 0, we have\n\nti = M\u22121\n\ni Xj\u2208Si\n\n\u03b1jk\u2032 (rij ) (Hj + \u02dcHi)\u22121xj\n\n|Hj + \u02dcHi|1/2\n\n,\n\n(8)\n\nwhere\n\nMi = Xj\u2208Si\n\n\u03b1jk\u2032 (rij ) (Hj + \u02dcHi)\u22121\n\n|Hj + \u02dcHi|1/2\n\n.\n\nThis is an iterative contraction mapping. If \u02dcHi is \ufb01xed, we can obtain ti by starting with an initial\n(0)\nt\ni\n\n, and then iterate (8) until convergence. Now, to solve \u02dcHi, we set \u2202 \u02dcHi\n\nmin\ni = 0 and obtain\n\nE\n\n\u02dcHi = P\u22121\n\ni Xj\u2208Si\n\nwhere\n\n\u03b1j( \u02dcHi + Hj)\u22121\n|Hj + \u02dcHi|1/2 (cid:16)k(rij )Hj + 4(\u2212k\u2032(rij ))(xj \u2212 ti)(xj \u2212 ti)\u2032( \u02dcHi + Hj)\u22121 \u02dcHi(cid:17) ,\n\n(9)\n\nPi = Xj\u2208Si\n\n( \u02dcHi + Hj)\u22121\n|Hj + \u02dcHi|1/2\n\n\u03b1jk(rij ).\n\nIn summary, we \ufb01rst initialize\n\n(0)\nt\ni\n(0)\n\u02dcH\ni\n\n= Pj\u2208Si\n= Pj\u2208Si\nand then iterate (8) and (9) until convergence. The converged values of ti and \u02dcHi are substituted\ninto \u2202wiE i = 0 to obtain wi as\n\ni \u2212 xj)\u2032(cid:17) /(Pj\u2208Si\n\n\u03b1j xj /(Pj\u2208Si\n\u03b1j (cid:16)Hj + (t\n\n\u03b1j),\n\n(0)\ni \u2212 xj)(t\n\n(0)\n\n\u03b1j),\n\nwi = |2 \u02dcHi|\n\n1\n\n2 Xj\u2208Si\n\n\u03b1j k(rij )\n\n|Hj + \u02dcHi|1/2\n\n.\n\n(10)\n\n2.2 Complexity\n\nIn the partitioning step, sequential sampling has a complexity of O(dmn), where n is original model\nsize, m is the number of clusters, and d the dimension. By using a hierarchical scheme [2], this\ncan be reduced to O(dn log(m)). The VQ takes O(dnm) time. In the local approximation step,\nthe complexity is lPm\ni=1 nid3 = lnd3, where l is the maximum number of iterations needed. In\npractice, we can enforce a diagonal structure on the covariance matrix \u02dcHi\u2019s while still obtaining a\nclosed-form solution. Hence, the complexity becomes linear in the dimension d instead of cubic.\nSumming up these three terms, the overall complexity is O(dn log(m) + dnm + lnd) = O(dn(m +\nl)), which is linear in both the data size and dimension (in practice m and l are quite small).\n\n2.3 Remarks\n\nIn this section, we discuss some interesting properties of the approximation scheme proposed in\nSection 2.1.2. To have better intuitions, we examine the special case of a Parzen window density\nestimator [11], where all \u03c6j\u2019s have the same weights and bandwidths (Hj = H for j = 1, 2, . . . , n).\nEquation (9) then reduces to\n\n\u02dcHi = H + 4 \u02dcHi( \u02dcHi + H)\u22121Vi,\n\n(11)\n\nwhere\n\nVi = Pj\u2208Si\n\n\u03b1j(\u2212k\u2032(rij ))(xj \u2212 ti)(xj \u2212 ti)\u2032\n\n.\n\n\u03b1jk(rij )\n\nPj\u2208Si\n\nIt shows that the bandwidth \u02dcHi of gi can be decomposed into two parts: the bandwidth H of the\noriginal kernel density estimator, and the covariance Vi of the local cluster Si with an adjusting\nmatrix \u0393i = 4 \u02dcHi( \u02dcHi + H)\u22121. As an illustration, consider the 1-D case where H = h2, \u02dcHi = h2\ni .\n\n\fi\n\n, and h2\n\ni = h2 + \u03b3iVi. Since Vi \u2265 0 and \u03b3i \u2265 2, we can see that h2\n\nThen \u03b3i = 4h2\ni \u2265 h2 + Vi.\nh2+h2\ni\nMoreover, hi is closely related to the spread of the local cluster. If all the points in Si are located at\nthe same position (i.e., Vi = 0), then h2\ni = h2. Otherwise, the larger the spread of the local cluster,\nthe larger is hi. In other words, the bandwidths \u02dcHi\u2019s are adaptive to the local data distribution.\nRelated works in simplifying the mixture models (such as [5]) simply choose \u02dcHi = H + Cov[Si].\nIn comparison, our covariance term Vi is more reliable in that it incorporates distance-based weight-\ning. Interestingly, this is somewhat similar to the bandwidth matrix used in the manifold Parzen\nwindows [14], which is designed for handling sparse, high-dimensional data more robustly. Note\nthat our choice of \u02dcHi is derived rigorously by minimizing the L2 approximation error. Therefore,\nthis coincidence naturally indicates the robustness of the L2-norm based distance measures. More-\nover, note that the adjusting matrix \u0393i changes not only the scale of the bandwidth, but also its\neigen-structures in an iterative manner. This will be very bene\ufb01cial in multivariate cases.\nSecond, in determining the center of gi, (8) can be reduced to\n\n.\n\n(12)\n\nti = Pj\u2208Si\nPj\u2208Si\n\n\u03b1jk\u2032\n\nH+ \u02dcHi\n\n\u03b1jk\u2032\n\nH+ \u02dcHi\n\n(xj \u2212 ti) xj\n(xj \u2212 ti)\n\n1\n\n1\n\nKH+ \u02dcHi\n\n2 Pj\u2208Si\n\ni ), the bandwidth of pi (i.e., h2 + h2\n\nThis can be regarded a mean-shift procedure [1] in the d-dimensional space with kernel K.\nIt\nis easy to verify that this iterative procedure is indeed locating the peak of the density function\npi(x) = |H + \u02dcHi|\u2212\n(x \u2212 xj). Note, on the other hand, that what we want to\n2 Pj\u2208Si\napproximate originally is the local density fi(x) = |H|\u2212\nKH (x \u2212 xj ). In the 1-D case\n(with H = h2, and \u02dcHi = h2\ni ) is larger than that of fi (i.e., h2).\nIt appears intriguing that on \ufb01tting a kernel density fi(x) estimated on the sample set {xj}j\u2208Si,\none needs to locate the maximum of another density function pi(x), instead of the maximum of\nfi(x) itself or simply, the mean of the sample set {xj}j\u2208Si as chosen in [5]. Indeed, these three\nchoices coincide when the distribution of Si is symmetric and uni-modal, but will differ otherwise.\nIntuitively, when the data is asymmetric, the center ti should be biased towards the heavier side of\nthe data distribution. The maximum of fi(x) thus fails to meet this requirement. On the other hand,\nthe mean of Si, though biased towards the heavier side, still lacks an accurate control on the degree\nof bias. In comparison, our method provides a principled way of selecting the center. Note that\npi(x) has a larger bandwidth than the original fi(x). Therefore, its maximum will move towards the\nheavier side of the distribution compared with that of fi(x), with the degree of bias automatically\ncontrolled by the mean shift iterations in (12).\n\nHere, we give an illustration on the performance of the three center selection schemes. Figure 1(a)\nshows the histogram of a local cluster Si, whose Parzen window estimator (fi) is asymmetric. Fig-\nure 1(b) plots the corresponding approximation error E i (6) at different bandwidths hi (the remaining\nparameter, wi, is set to the optimal value by (10) ). As can be seen, the approximation error of our\nmethod is consistently lower than those of the other two. Moreover, the resultant optimum is also\nmuch lower.\n\nm\na\nr\ng\no\n\nt\ns\nh\n\ni\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n1.5\n\n2\n\nx\n\n2.5\n\n3\n\n(a) The histogram of a local cluster\nSi and its density fi.\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\ni\n\na\nm\nx\no\nr\np\np\na\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nlocal maximu\nlocal mean\nour method\n\n0.05\n\n0.1\n\n0.15\n2\nh\ni\n\n0.2\n\n0.25\n\n0.3\n\n(b) Approximation error.\n\nFigure 1: Approximation of an asymmetric density using different center selection schemes.\n\n\f3 Experiments\n\nIn this section, we perform experiments to evaluate the performance of our mixture simpli\ufb01cation\nscheme. We focus on the Parzen window estimator which, on given a set of samples S = {xi}n\ni=1 in\nRd, can be written as \u02c6f (x) = 1\nj=1 KH (x \u2212 xj) . Note that the Parzen window estimator\nis a limiting form of the mixture model, where the number of components equals the data size\nand can be quite huge.\nIn Section 3.1, we use the proposed approach to reduce the number of\ncomponents in the kernel density estimator, and compare its performance with the algorithm in [5].\nThen, in Section 3.2, we perform color image segmentation by running the mean shift clustering\nalgorithm on the simpli\ufb01ed density model.\n\n2 Pn\n\nn |H|\u2212\n\n1\n\n3.1 Simplifying Nonparametric Density Models\n\nIn this section, we reduce the number of kernels in the Parzen window estimator by using the pro-\nposed approach and the method in [5]. Experiments are performed on a 1-D set with 1800 samples\ndrawn from the Gaussian mixture 8\n18 N (1.7, 0.64), where\nN (\u00b5, \u03c32) denotes the normal distribution with mean \u00b5 and variance \u03c32. The Gaussian kernel with\n\ufb01xed bandwidth h = 0.3 is used for density estimation. To make the problem more challenging,\nwe choose m = 5, i.e., only 5 kernels are used to approximate the density. The k-means algorithm\nis used for initialization. As can be seen from Figure 2(b), the third Gaussian component has been\nbroken into two by the method in [5]. In comparison, our result in Figure 2(c) is more reliable.\n\n18 N (\u22120.8, 0.36) + 4\n\n18 N (\u22122.6, 0.09) + 6\n\n180\n\n160\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\nx\n\n1\n\n2\n\n3\n\n4\n\n0\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nx\n\n0\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nx\n\n(a) Histogram.\n\n(b) Result by [5].\n\n(c) Our result.\n\nFigure 2: Approximate the Parzen window estimator by simplifying mixture models. Green: Parzen\nwindow estimator; black: simpli\ufb01ed mixture model; blue-dashed: components of the mixture model.\n\nTo have a quantitative evaluation, we randomly generate the 3-Gaussian data 100 times, and compare\nthe two algorithms (ours and [5]) using the following error criteria: 1) the L2 error (5); 2) the\nstandard KL distance; 3) the local KL-distance used in [5]. The local KL-distance between two\n\nmixtures, f = Pn\n\nj=1 \u03b1j\u03c6j and g = Pm\n\ni=1 wigi, is de\ufb01ned as\n\nd(f, g) =\n\nn\n\nXj=1\n\n\u03b1jKL(\u03c6j||g\u03c0(j)),\n\nwhere \u03c0(j) is the function that maps each component \u03c6j to the closest representative component\ng\u03c0(j) such that \u03c0(j) = arg mini=1,2,...,m KL(\u03c6j||gi).\nResults are plotted in Figure 3, where for clarity we order the results in increasing error obtained\nby [5]. We can see that under the L2 norm, the error of our algorithm is signi\ufb01cantly lower than\nthat of [5]. Quantitatively, our error is only about 36.61% of that by [5]. On using the standard\nKL-distance, our error is about 87.34% of that by [5], where the improvement is less signi\ufb01cant.\nThis is because the KL-distance is sensitive to the tail of the distribution, i.e., a small difference in\nthe low-density regions may induce a huge KL-distance. As for the local KL-distance, our error is\nabout 99.35% of that by [5].\n\n3.2 Image Segmentation\n\nThe Parzen window estimator can be used to reveal important clustering information, namely that its\nmodes (or local maxima) correspond to dominant clusters in the data. This property is utilized in the\n\n\fx 10\u22125\n\nmethod in [5]\nour method\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\nx 10\u22123\n\nmethod in [5]\nour method\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nmethod in [5]\nour method\n\n3600\n\n3500\n\n3400\n\n3300\n\n3200\n\n3100\n\n3000\n\n2900\n\nr\no\nr\nr\ne\n\n \n\nL\nK\n\n \nl\n\na\nc\no\n\nl\n\ne\nc\nn\na\n\ni\n\nt\ns\nd\nL\nK\n\n \n\n \n\nr\no\nr\nr\ne\nm\nr\no\nn\n\u2212\n\n2\n\nL\n\n0\n\n0\n\n20\n\n40\n60\nnumber of tests\n\n80\n\n100\n\n0\n\n0\n\n20\n\n40\n\nx\n\n60\n\n80\n\n100\n\n2800\n\n0\n\n20\n\n40\n60\nnumber of tests\n\n80\n\n100\n\n(a) The L2 distance error.\n\n(b) Standard KL-distance.\n\n(c) Local KL-distance de\ufb01ned by [5]\n\nFigure 3: Quantitative comparison of the approximation errors.\n\nmean shift clustering algorithm [1, 3], where every data point is moved along the density gradient\nuntil it reaches the nearest local density maximum. The mean shift algorithm is robust, and can\nidentify arbitrarily-shaped clusters in the feature space.\n\nRecently, mean shift is applied in color image segmentation and has proven to be quite successful [1].\nThe idea is to identify homogeneous image regions through clustering in a properly selected feature\nspace (such as color, texture, or shape). However, mean shift can be quite expensive due to the large\nnumber of kernels involved in the density estimator. To reduce the computational requirement, we\n\ufb01rst reduce the density estimator \u02c6f (x) to a simpler model g(x) using our simpli\ufb01cation scheme, and\nthen apply the iterative mean shift procedure on the simpli\ufb01ed model g(x).\nExperiments are performed on a number of benchmark images1 used in [1]. We use the Gaussian\nkernel with bandwidth h = 20. The partition parameter is r = 25. For comparison, we also imple-\nment the standard mean shift and its fast version using kd-trees (using the ANN library [10]). The\ncodes are written in C++ and run on a 2.26GHz Pentium-III machine. As the \u201ctrue\u201d segmentation\nof an image is subjective, so only a visual comparison is intended here.\n\nTable 1: Total wall time (in seconds) on various segmentation tasks, and the number of components in g(x).\n\nmean shift\n\nimage\nsquirrel\nhand\nhouse\nlake\n\ndata size\n\n60,192 (209\u00d7288)\n73,386 (243\u00d7302)\n48,960 (192\u00d7255)\n262,144 (512\u00d7512)\n\nstandard\n1215.8\n1679.7\n1284.5\n3343.0\n\nkd-tree\n11.94\n12.92\n5.16\n85.65\n\nour method\n\n# components\n\ntime consumption\n\n81\n120\n159\n440\n\n0.18\n0.35\n0.22\n3.67\n\nSegmentation results are shown in Figures 4. The rows, from top to bottom, are: the original image,\nsegmentation results by standard mean shift and our approach. We can see that our results are closer\nto those by the standard mean shift (applied on the original density estimator), with the number of\ncomponents (Table 1) dramatically smaller than the data size n. This demonstrates the success of\nour approximation scheme in maintaining the structure of the data distribution using highly compact\nmodels. Our algorithm is also much faster than the standard mean shift and its fast version using kd-\ntrees. The reason is that kd-trees only facilitates range searching but does not reduce the expensive\ncomputations associated with the large number of kernels.\n\n4 Conclusion\n\nFinite mixture is a powerful model in many statistical learning problems. However, the large model\nsize can be a major hindrance in many applications. In this paper, we reduce the model complexity\nby \ufb01rst grouping the components into compact clusters, and then perform local function approxima-\ntion that minimizes an upper bound of the approximation error. Our algorithm has low complexity,\nand demonstrates more reliable performance compared with methods using KL-based distances.\n\n1\n\nhttp://www.caip.rutgers.edu/\u223ccomanici/MSPAMI/msPamiResults.html\n\n\fFigure 4: Image segmentation by standard mean shift (2nd row), and ours (bottom).\n\nReferences\n\n[1] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 24(5):603\u2013619, 2002.\n\n[2] T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proceedings of ACM Sympo-\n\nsium on Theory of Computing, pages 434\u2013444, 1988.\n\n[3] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in\n\npattern recognition. IEEE Transactions on Information Theory, 21:32\u201340, 1975.\n\n[4] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Press, Boston,\n\n1992.\n\n[5] J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In Advances in Neural Infor-\n\nmation Processing Systems 17, pages 505\u2013512. 2005.\n\n[6] B. Han, D. Comaniciu, Y. Zhu, and L. Davis.\n\nIncremental density approximation and kernel-based\nBayesian \ufb01ltering for object tracking. In Proceedings of the International Conference on Computer Vision\nand Pattern Recognition, pages 638\u2013644, 2004.\n\n[7] A.J. Izenman. Recent developments in nonparametric density estimation. Journal of the American Sta-\n\ntistical Association, 86(413):205\u2013224, 1991.\n\n[8] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, and A.Y. Wu. An ef\ufb01cient k-\nmeans clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 24(7):881\u2013892, 2002.\n\n[9] A.W. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In Advances\n\nin Neural Information Processing Systems 11, pages 543\u2013549, 1998.\n\n[10] D.M. Mount and S. Arya. ANN: A library for approximate nearest neighbor searching. In Proceedings\nof Center for Geometric Computing Second Annual Fall Workshop Computational Geometry (available\nfrom http://www.cs.umd.edu/\u223cmount/ANN), 1997.\n\n[11] E. Parzen. On estimation of a probability density function and mode. Annals of Mathematical Statistics,\n\n33:1065\u20131075, 1962.\n\n[12] K. Popat and R.W. Picard. Cluster-based probability model and its application to image and texture\n\nprocessing. IEEE Transactions on Image Processing, 6(2):268\u2013284, 1997.\n\n[13] E.B. Sudderth, A. Torralba, W.T. Freeman, and A.S. Willsky. Describing visual scenes using transformed\n\nDirichlet processes. In Advances in Neural Information Processing Systems 19, 2006.\n\n[14] P. Vincent and Y. Bengio. Manifold Parzen windows.\n\nSystems 15, 2003.\n\nIn Advances in Neural Information Processing\n\n\f", "award": [], "sourceid": 3073, "authors": [{"given_name": "Kai", "family_name": "Zhang", "institution": null}, {"given_name": "James", "family_name": "Kwok", "institution": null}]}