{"title": "Spatial distance dependent Chinese restaurant processes for image segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 1476, "page_last": 1484, "abstract": "The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing \"human-like\" segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the first rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms.", "full_text": "Spatial distance dependent Chinese restaurant\n\nprocesses for image segmentation\n\nSoumya Ghosh1, Andrei B. Ungureanu2, Erik B. Sudderth1, and David M. Blei3\n\n1Department of Computer Science, Brown University, {sghosh,sudderth}@cs.brown.edu\n\n2Morgan Stanley, andrei.b.ungureanu@gmail.com\n\n3Department of Computer Science, Princeton University, blei@cs.princeton.edu\n\nAbstract\n\nThe distance dependent Chinese restaurant process (ddCRP) was recently intro-\nduced to accommodate random partitions of non-exchangeable data [1]. The dd-\nCRP clusters data in a biased way: each data point is more likely to be clustered\nwith other data that are near it in an external sense. This paper examines the dd-\nCRP in a spatial setting with the goal of natural image segmentation. We explore\nthe biases of the spatial ddCRP model and propose a novel hierarchical exten-\nsion better suited for producing \u201chuman-like\u201d segmentations. We then study the\nsensitivity of the models to various distance and appearance hyperparameters, and\nprovide the \ufb01rst rigorous comparison of nonparametric Bayesian models in the im-\nage segmentation domain. On unsupervised image segmentation, we demonstrate\nthat similar performance to existing nonparametric Bayesian models is possible\nwith substantially simpler models and algorithms.\n\n1 Introduction\n\nThe Chinese restaurant process (CRP) is a distribution on partitions of integers [2]. When used\nin a mixture model, it provides an alternative representation of a Bayesian nonparametric Dirichlet\nprocess mixture\u2014the data are clustered and the number of clusters is determined via the posterior\ndistribution. CRP mixtures assume that the data are exchangeable, i.e., their order does not af-\nfect the distribution of cluster structure. This can provide computational advantages and simplify\napproximate inference, but is often an unrealistic assumption.\n\nThe distance dependent Chinese restaurant process (ddCRP) was recently introduced to model ran-\ndom partitions of non-exchangeable data [1]. The ddCRP clusters data in a biased way: each data\npoint is more likely to be clustered with other data that are near it in an external sense. For example,\nwhen clustering time series data, points that closer in time are more likely to be grouped together.\nPrevious work [1] developed the ddCRP mixture in general, and derived posterior inference algo-\nrithms based on Gibbs sampling [3]. While they studied the ddCRP in time-series and sequential\nsettings, ddCRP models can be used with any type of distance and external covariates. Recently,\nother researchers [4] have also used the ddCRP in non-temporal settings.\n\nIn this paper, we study the ddCRP in a spatial setting. We use a spatial distance function between\npixels in natural images and cluster them to provide an unsupervised segmentation. The spatial dis-\ntance encourages the discovery of connected segments. We also develop a region-based hierarchical\ngeneralization, the rddCRP. Analogous to the hierarchical Dirichlet process (HDP) [5], the rddCRP\nclusters groups of data, where cluster components are shared across groups. Unlike the HDP, how-\never, the rddCRP allows within-group clusterings to depend on external distance measurements.\n\nTo demonstrate the power of this approach, we develop posterior inference algorithms for segment-\ning images with ddCRP and rddCRP mixtures. Image segmentation is an extensively studied area,\n\n1\n\n\fC\n\nC\n\nRemoving C leaves clustering unchanged\nAdding C leaves the clustering unchanged\n\nRemoving C splits the cluster\nAdding C merges the cluster\n\nFigure 1: Left: An illustration of the relationship between the customer assignment representation and the\ntable assignment representation. Each square is a data point (a pixel or superpixel) and each arrow is a customer\nassignment. Here, the distance window is of length 1. The corresponding table assignments, i.e., the clustering\nof these data, is shown by the color of the data points. Right: Intuitions behind the two cases considered by the\nGibbs sampler. Consider the link from node C. When removed, it may leave the clustering unchanged or split\na cluster. When added, it may leave the clustering unchanged or merge two clusters.\n\nwhich we will not attempt to survey here. In\ufb02uential existing methods include approaches based\non kernel density estimation [6], Markov random \ufb01elds [3, 7], and the normalized cut spectral clus-\ntering algorithm [8, 9]. A recurring dif\ufb01culty encountered by traditional methods is the need to\ndetermine an appropriate segment resolution for each image; even among images of similar scene\ntypes, the number of observed objects can vary widely. This has usually been dealt via heuristics\nwith poorly understood biases, or by simplifying the problem (e.g., partially specifying each image\u2019s\nsegmentation via manual user input [7]).\n\nRecently, several promising segmentation algorithms have been proposed based on nonparametric\nBayesian methods [10, 11, 12]. In particular, an approach which couples Pitman-Yor mixture mod-\nels [13] via thresholded Gaussian processes [14] has lead to very promising initial results [10], and\nprovides a baseline for our later experiments. Expanding on the experiments in [10], we analyze\n800 images of different natural scene types, and show that the comparatively simpler ddCRP-based\nalgorithms perform similarly to this work. Moreover, unlike previous nonparametric Bayesian ap-\nproaches, the structure of the ddCRP allows spatial connectivity of the inferred segments to (option-\nally) be enforced. In some applications, this is a known property of all reasonable segmentations.\n\nOur results demonstrate the practical utility of spatial ddCRP and hierarchical rddCRP models. We\nalso provide the \ufb01rst rigorous comparison of nonparametric Bayesian image segmentation models.\n\n2 Image Segmentation with Distance Dependent CRPs\n\nOur goal is to develop a probabilistic method to segment images of complex scenes. Image segmen-\ntation is the problem of partitioning an image into self-similar groups of adjacent pixels. Segmen-\ntation is an important step towards other tasks in image understanding, such as object recognition,\ndetection, or tracking. We model images as observed collections of \u201csuperpixels\u201d [15], which are\nsmall blocks of spatially adjacent pixels. Our goal is to associate the features xi in the ith superpixel\nwith some cluster zi; these clusters form the segments of that image.\n\nImage segmentation is thus a special kind of clustering problem where the desired solution has two\nproperties. First, we hope to \ufb01nd contiguous regions of the image assigned to the same cluster. Due\nto physical processes such as occlusion, it may be appropriate to \ufb01nd segments that contain two\nor three contiguous image regions, but we do not want a cluster that is scattered across individual\nimage pixels. Traditional clustering algorithms, such as k-means or probabilistic mixture models,\ndo not account for external information such as pixel location and are not biased towards contigu-\nous regions. Image locations have been heuristically incorporated into Gaussian mixture models\nby concatenating positions with appearance features in a vector [16], but the resulting bias towards\nelliptical regions often produces segmentation artifacts. Second, we would like a solution that deter-\n\n2\n\n\fmines the number of clusters from the image. Image segmentation algorithms are typically applied\nto collections of images of widely varying scenes, which are likely to require different numbers of\nsegments. Except in certain restricted domains such as medical image analysis, it is not practical to\nuse an algorithm that requires knowing the number of segments in advance.\n\nIn the following sections, we develop a Bayesian algorithm for image segmentation based on the\ndistance dependent Chinese restaurant process (ddCRP) mixture model [1]. Our algorithm \ufb01nds\nspatially contiguous segments in the image and determines an image-speci\ufb01c number of segments\nfrom the observed data.\n\n2.1 Chinese restaurant process mixtures\n\nThe ddCRP mixture is an extension of the traditional Chinese restaurant process (CRP) mixture.\nCRP mixtures provide a clustering method that determines the number of clusters from the data\u2014\nthey are an alternative formulation of the Dirichlet process mixture model. The assumed generative\nprocess is described by imagining a restaurant with an in\ufb01nite number of tables, each of which is\nendowed with a parameter for some family of data generating distributions (in our experiments,\nDirichlet). Customers enter the restaurant in sequence and sit at a randomly chosen table. They sit\nat the previously occupied tables with probability proportional to how many customers are already\nsitting at each; they sit at an unoccupied table with probability proportional to a scaling parameter.\nAfter the customers have entered the restaurant, the \u201cseating plan\u201d provides a clustering. Finally,\neach customer draws an observation from a distribution determined by the parameter at the assigned\ntable.\n\nConditioned on observed data, the CRP mixture provides a posterior distribution over table assign-\nments and the parameters attached to those tables. It is a distribution over clusterings, where the\nnumber of clusters is determined by the data. Though described sequentially, the CRP mixture is an\nexchangeable model: the posterior distribution over partitions does not depend on the ordering of\nthe observed data.\n\nTheoretically, exchangeability is necessary to make the connection between CRP mixtures and\nDirichlet process mixtures. Practically, exchangeability provides ef\ufb01cient Gibbs sampling algo-\nrithms for posterior inference. However, exchangeability is not an appropriate assumption in image\nsegmentation problems\u2014the locations of the image pixels are critical to providing contiguous seg-\nmentations.\n\n2.2 Distance dependent CRPs\n\nThe distance dependent Chinese Restaurant Process (ddCRP) is a generalization of the Chinese\nrestaurant process that allows for a non-exchangeable distribution on partitions [1]. Rather than\nrepresent a partition by customers assigned to tables, the ddCRP models customers linking to other\ncustomers. The seating plan is a byproduct of these links\u2014two customers are sitting at the same\ntable if one can reach the other by traversing the customer assignments. As in the CRP, tables are\nendowed with data generating parameters. Once the partition is determined, the observed data for\neach customer are generated by the per-table parameters.\n\nAs illustrated in Figure 1, the generative process is described in terms of customer assignments ci\n(as opposed to partition assignments or tables, zi). The distribution of customer assignments is\n\np (ci = j | D, f, \u03b1) \u221d (cid:26) f (dij )\n\n\u03b1\n\nj 6= i,\nj = i.\n\n(1)\n\nHere dij is a distance between data points i and j and f (d) is called the decay function. The decay\nfunction mediates how the distance between two data points affects their probability of connecting\nto each other, i.e., their probability of belonging to the same cluster.\n\nDetails of the ddCRP are found in [1]. We note that the traditional CRP is an instance of a ddCRP.\nHowever, in general, the ddCRP does not correspond to a model based on a random measure, like\nthe Dirichlet process. The ddCRP is appropriate for image segmentation because it can naturally\naccount for the spatial structure of the superpixels through its distance function. We use a spatial\ndistance between pixels to enforce a bias towards contiguous clusters. Though the ddCRP has been\npreviously described in general, only time-based distances are studied in [1].\n\n3\n\n\fFigure 2: Comparison of distance-dependent segmentation priors. From left to right, we show segmentations\nproduced by the ddCRP with a = 1, the ddCRP with a = 2, the ddCRP with a = 5, and the rddCRP with\na = 1.\n\nRestaurants represent images, tables represent segment assignment, and customers represent super-\npixels. The distance between superpixels is modeled as the number of hops required to reach one\nsuperpixel from another, with hops being allowed only amongst spatially neighboring superpixels.\nA \u201cwindow\u201d decay function of width a, f (d) = 1[d \u2264 a], determines link probabilities. If a = 1,\nsuperpixels can only directly connect to adjacent superpixels. Note this does not explicitly restrict\nthe size of segments, because any pair of pixels for which one is reachable from the other (i.e., in the\nsame connected component of the customer assignment graph) are in the same image segment. For\nthis special case segments are guaranteed with probability one to form spatially connected subsets\nof the image, a property not enforced by other Bayesian nonparametric models [10, 11, 12].\n\nThe full generative process for the observed features x1:N within a N -superpixel image is as follows:\n\n1. For each table, sample parameters \u03c6k \u223c G0.\n2. For each customer, sample a customer assignment ci \u223c ddCRP(\u03b1, f, D). This indirectly\n\ndetermines the cluster assignments z1:N , and thus the segmentation.\n\n3. For each superpixel, independently sample observed data xi \u223c P (\u00b7 | \u03c6zi).\n\nThe customer assignments are sampled using the spatial distance between pixels. The partition\nstructure, derived from the customer assignments, is used to sample the observed image features.\nGiven an image, the posterior distribution of the customer assignments induces a posterior over the\ncluster structure; this provides the segmentation. See Figure 1 for an illustration of the customer\nassignments and their derived table assignments in a segmentation setting.\n\nAs in [10], the data generating distribution for the observed features studied in Section 4 is multino-\nmial, with separate distributions for color and texture. We place conjugate Dirichlet priors on these\ncluster parameters.\n\n2.3 Region-based hierarchical distance dependent CRPs\n\nThe ddCRP model, when applied to an image with window size a = 1, produces a collection\nof contiguous patches (tables) homogeneous in color and texture features (Figure 2). While such\nsegmentations are useful for various applications [16], they do not re\ufb02ect the statistics of manual\nhuman segmentations, which contain larger regions [17]. We could bias our model to produce such\nregions by either increasing the window size a, or by introducing a hierarchy wherein the produced\npatches are grouped into a small number of regions. This region level model has each patch (table)\nassociated with a region k from a set of potentially in\ufb01nite regions. Each region in turn is associated\nwith an appearance model \u03c6k. The corresponding generative process is described as follows:\n\n1. For each customer, sample customer assignments ci \u223c ddCRP (\u03b1, f, D). This determines\n\nthe table assignments t1:N .\n\n2. For each table t, sample region assignments kt \u223c CRP (\u03b3).\n3. For each region, sample parameters \u03c6k \u223c G0.\n4. For each superpixel, independently sample observed data xi \u223c P (\u00b7 | \u03c6zi), where zi = kti .\n\nNote that this region level rddCRP model is a direct extension of the Chinese restaurant franchise\n(CRF) representation of the HDP [5], with the image partition being drawn from a ddCRP instead\n\n4\n\n\fof a CRP. In contrast to prior applications of the HDP, our region parameters are not shared amongst\nimages, although it would be simple to generalize to this case. Figure 3 plots samples from the\nrddCRP and ddCRP priors with increasing a. The rddCRP produces larger partitions than the ddCRP\nwith a = 1, while avoiding the noisy boundaries produced by a ddCRP with large a (see Figure 2).\n\n3 Inference with Gibbs Sampling\n\nA segmentation of an observed image is found by posterior inference. The problem is to compute\nthe conditional distribution of the latent variables\u2014the customer assignments c1:N \u2014conditioned\non the observed image features x1:N , the scaling parameter \u03b1, the distances between pixels D, the\nwindow size a, and the base distribution hyperparameter \u03bb:\n\n(cid:16)QN\n\nPc1:N (cid:16)QN\n\ni=1 p(ci | D, a, \u03b1)(cid:17) p(x1:N | z(c1:N ), \u03bb)\n\ni=1 p(ci | D, a, \u03b1)(cid:17) p(x1:N | z(c1:N ), \u03bb)\n\n(2)\n\np(c1:N | x1:N , \u03b1, d, a, \u03bb) =\n\nwhere z(c1:N ) is the cluster representation that is derived from the customer representation c1:N .\nNotice again that the prior term uses the customer representation to take into account distances\nbetween data points; the likelihood term uses the cluster representation.\n\nThe posterior in Equation (2) is not tractable to directly evaluate, due to the combinatorial sum in\nthe denominator. We instead use Gibbs sampling [3], a simple form of Monte Carlo Markov chain\n(MCMC) inference [18]. We de\ufb01ne the Markov chain by iteratively sampling each latent variable ci\nconditioned on the others and the observations,\n\np(ci | c\u2212i, x1:N , D, \u03b1, \u03bb) \u221d p(ci | D, \u03b1)p(x1:N | z(c1:N ), \u03bb).\n\nThe prior term is given in Equation (1). We can decompose the likelihood term as follows:\n\np(x1:N | z(c1:N ), \u03bb) =\n\nK(c1:N )\n\nYk=1\n\np(xz(c1:N )=k | z(c1:N ), \u03bb).\n\n(3)\n\n(4)\n\nWe have introduced notation to more easily move from the customer representation\u2014the primary\nlatent variables of our model\u2014and the cluster representation. Let K(c1:N ) denote the number of\nunique clusters in the customer assignments, z(c1:N ) the cluster assignments derived from the cus-\ntomer assignments, and xz(c1:N )=k the collection of observations assigned to cluster k. We assume\nthat the cluster parameters \u03c6k have been analytically marginalized. This is possible when the base\ndistribution G0 is conjugate to the data generating distribution, e.g. Dirichlet to multinomial.\n\nSampling from Equation (3) happens in two stages. First, we remove the customer link ci from the\ncurrent con\ufb01guration. Then, we consider the prior probability of each possible value of ci and how\nit changes the likelihood term, by moving from p(x1:N | z(c\u2212i), \u03bb) to p(x1:N | z(c1:N ), \u03bb).\nIn the \ufb01rst stage, removing ci either leaves the cluster structure intact, i.e., z(cold\n1:N ) = z(c\u2212i), or\nsplits the cluster assigned to data point i into two clusters. In the second stage, randomly reassigning\nci either leaves the cluster structure intact, i.e., z(c\u2212i) = z(c1:N ), or joins the cluster assigned to\ndata point i to another. See Figure 1 for an illustration of these cases. Via these moves, the sampler\nexplores the space of possible segmentations.\n\nLet \u2113 and m be the indices of the tables that are joined to index k. We \ufb01rst remove ci, possibly\nsplitting a cluster. Then we sample from\n\np(ci | c\u2212i, x1:N , D, \u03b1, \u03bb) \u221d (cid:26) p(ci | D, \u03b1)\u0393(x, z, \u03bb)\n\np(ci | D, \u03b1)\n\nif ci joins \u2113 and m;\notherwise,\n\nwhere\n\n\u0393(x, z, \u03bb) =\n\np(xz(c1:N )=k | \u03bb)\n\np(xz(c1:N )=\u2113 | \u03bb)p(xz(c1:N )=m | \u03bb)\n\n.\n\n(5)\n\n(6)\n\nThis de\ufb01nes a Markov chain whose stationary distribution is the posterior of the spatial ddCRP\nde\ufb01ned in Section 2. Though our presentation is slightly different, this algorithm is equivalent to the\none developed for ddCRP mixtures in [1].\n\n5\n\n\fIn the rddCRP, the algorithm for sampling the customer indicators is nearly the same, but with\ntwo differences. First, when ci is removed, it may spawn a new cluster. In that case, the region\nidentity of the new table must be sampled from the region level CRP. Second, the likelihood term in\nEquation (4) depends only on the superpixels in the image assigned to the segment in question. In\nthe rddCRP, it also depends on other superpixels assigned to segments that are assigned to the same\nregion. Finally, the rddCRP also requires resampling of region assignments as follows:\n\np(kt = \u2113 | k\u2212t, x1:N , t(c1:N ), \u03b3, \u03bb) \u221d (cid:26) m\u2212t\n\n\u03b3p(xt | \u03bb)\n\n\u2113 p(xt | x\u2212t, \u03bb)\n\nif \u2113 is used;\nif \u2113 is new.\n\n(7)\n\nHere, xt is the set of customers sitting at table t, x\u2212t is the set of all customers associated with\nregion \u2113 excluding xt, and m\u2212t\n\u2113\n\nis the number of tables associated with region \u2113 excluding xt.\n\n4 Empirical Results\n\nWe compare the performance of the ddCRP to manual segmentations of images drawn from eight\nnatural scene categories [19]. Non-expert users segmented each image into polygonal shapes, and\nlabeled them as distinct objects. The collection, which is available from LabelMe [17], contains a\ntotal of 2,688 images.1 We randomly select 100 images from each category. This image collection\nhas been previously used to analyze an image segmentation method based on spatially dependent\nPitman-Yor (PY) processes [10], and we compare both methods using an identical feature set. Each\nimage is \ufb01rst divided into approximately 1000 superpixels [15, 20]2 using the normalized cut al-\ngorithm [9].3 We describe the texture of each superpixel via a local texton histogram [21], using\nband-pass \ufb01lter responses quantized to 128 bins. A 120-bin HSV color histogram is also computed.\nEach superpixel i is summarized via these histograms xi.\n\nOur goal is to make a controlled comparison to alternative nonparametric Bayesian methods on a\nchallenging task. Performance is assessed via agreement with held out human segmentations, via\nthe Rand index [22]. We also present segmentation results for qualitative evaluation in Figures 3\nand 4 .\n\n4.1 Sensitivity to Hyperparameters\n\nOur models are governed by the CRP concentration parameters \u03b3 and \u03b1, the appearance base mea-\nsure hyperparameter \u03bb = (\u03bb0, ...\u03bb0), and the window size a. Empirically, \u03b3 has little impact on the\nsegmentation results, due to the high-dimensional and informative image features; all our experi-\nments set \u03b3 = 1. \u03b1 and \u03bb0 induce opposing biases: a small \u03b1 encourages larger segments, while a\nlarge \u03bb0 encourages larger segments. We found \u03b1 = 10\u22128 and \u03bb0 = 20 to work well.\n\nThe most in\ufb02uential prior parameter is the window size a, the effect of which is visualized in Fig-\nure 3. For the ddCRP model, setting a = 1 (ddCRP1) produces a set of small but contiguous\nsegments. Increasing to a = 2 (ddCRP2) results in fewer segments, but the produced segments are\ntypically spatially fragmented. This phenomenon is further exacerbated with larger values of a. The\nrddCRP model groups segments produced by a ddCRP. Because it is hard to recover meaningful\npartitions if these initial segments are poor, the rddCRP performs best when a = 1.\n\n4.2 Image Segmentation Performance\n\nWe now quantitatively measure the performance of our models. The ddCRP and the rddCRP sam-\nplers were run for 100 and 500 iterations, respectively. Both samplers displayed rapid mixing and\noften stabilized withing the \ufb01rst 50 iterations. Note that similar rapid mixing has been observed in\nother applications of the ddCRP [1].\n\nWe also compare to two previous models [10]: a PY mixture model with no spatial dependence\n(pybof20), and a PY mixture with spatial coupling induced via thresholded Gaussian processes (py-\ndist20). To control the comparison as much as possible, the PY models are tested with identical\nfeatures and base measure \u03b2, and other hyperparameters as in [10]. We also compare to the non-\nspatial PY with \u03bb0 = 1, the best bag-of-feature model in our experiments (pybof ). We employ\n\n1http://labelme.csail.mit.edu/browseLabelMe/\n2http://www.cs.sfu.ca/\u02dcmori/\n3http://www.eecs.berkeley.edu/Research/Projects/CS/vision/\n\n6\n\n\fModels \n\nImages \n\nFigure 3: Segmentations produced by various Bayesian nonparametric methods. From left to right, the\ncolumns display natural images, segmentations for the ddCRP with a = 1, the ddCRP with a = 2, the rddCRP\nwith a = 1, and thresholded Gaussian processes (pydist20) [10]. The top row displays partitions sampled from\nthe corresponding priors, which have 130, 54, 5, and 6 clusters, respectively.\n\nFigure 4: Top left: Average segmentation performance on the database of natural scenes, as measured by the\nRand index (larger is better), and those pairs of methods for which a Wilcoxon\u2019s signed rank test indicates com-\nparable performance with 95% con\ufb01dence. In the binary image, dark pixels indicate pairs that are statistically\nindistinguishable. Note that the rddCRP, spatial PY, and mean shift methods are statistically indistinguishable,\nand signi\ufb01cantly better than all others. Bottom left: Scatter plots comparing the pydist20 and rddCRP methods\nin the Mountain and Street scene categories. Right: Example segmentations produced by the rddCRP.\n\nnon-hierarchical versions of the PY models, so that each image is analyzed independently, and per-\nform inference via the previously developed mean \ufb01eld variational method. Finally, from the vision\nliterature we also compare to the normalized cuts (Ncuts) [8] and mean shift (MS) [6] segmentation\nalgorithms.4\n\n4We used the EDISON implementation of mean shift. The parameters of mean shift and normalized cuts\nwere tuned by performing a grid search over a training set containing 25 images from each of the 8 categories.\nFor normalized cuts the optimal number of segments was determined to be 5. For mean shift we held the spatial\n\n7\n\n\fQuantitative performance is summarized in Figure 4. The rddCRP outscores both versions of the\nddCRP model, in terms of Rand index. Nevertheless, the patchy ddCRP1 segmentations are inter-\nesting for applications where segmentation is an intermediate step rather than the \ufb01nal goal. The bag\nof features model with \u03bb0 = 20 performs poorly; with optimized \u03bb0 = 1 it is better, but still inferior\nto the best spatial models.\n\nIn general, the spatial PY and rddCRP perform similarly. The scatter plots in Fig. 4, which show\nRand indexes for each image from the mountain and street categories, provide insights into when\none model outperforms the other. For the street images rddCRP is better, while for images contain-\ning mountains spatial PY is superior. In general, street scenes contain more objects, many of which\nare small, and thus disfavored by the smooth Gaussian processes underlying the PY model. To most\nfairly compare priors, we have tested a version of the spatial PY model employing a covariance func-\ntion that depends only on spatial distance. Further performance improvements were demonstrated in\n[10] via a conditionally speci\ufb01ed covariance, which depends on detected image boundaries. Similar\nconditional speci\ufb01cation of the ddCRP distance function is a promising direction for future research.\n\nFinally, we note that the ddCRP (and rddCRP) models proposed here are far simpler than the spatial\nPY model, both in terms of model speci\ufb01cation and inference. The ddCRP models only require\npairwise superpixel distances to be speci\ufb01ed, as opposed to the positive de\ufb01nite covariance function\nrequired by the spatial PY model. Furthermore, the PY model\u2019s usage of thresholded Gaussian\nprocesses leads to a complex likelihood function, for which inference is a signi\ufb01cant challenge. In\ncontrast, ddCRP inference is carried out through a straightforward sampling algorithm,5 and thus\nmay provide a simpler foundation for building rich models of visual scenes.\n\n5 Discussion\n\nWe have studied the properties of spatial distance dependent Chinese restaurant processes, and ap-\nplied them to the problem of image segmentation. We showed that the spatial ddCRP model is\nparticularly well suited for segmenting an image into a collection of contiguous patches. Unlike\nprevious Bayesian nonparametric models, it can produce segmentations with guaranteed spatial\nconnectivity. To go from patches to coarser, human-like segmentations, we developed a hierar-\nchical region-based ddCRP. This hierarchical model achieves performance similar to state-of-the-art\nnonparametric Bayesian segmentation algorithms, using a simpler model and a substantially simpler\ninference algorithm.\n\nReferences\n\n[1] D. M. Blei and P. I. Frazier. Distant dependent chinese restaurant processes. Journal of Ma-\n\nchine Learning Research, 12:2461\u20132488, August 2011.\n\n[2] J. Pitman. Combinatorial Stochastic Processes. Lecture Notes for St. Flour Summer School.\n\nSpringer-Verlag, New York, NY, 2002.\n\n[3] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora-\ntion of images. IEEE Transactions on pattern analysis and machine intelligence, 6(6):721\u2013741,\nNovember 1984.\n\n[4] Richard Socher, Andrew Maas, and Christopher D. Manning. Spectral chinese restaurant pro-\ncesses: Nonparametric clustering based on similarities. In Fourteenth International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS), 2011.\n\n[5] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal\n\nof American Statistical Association, 25(2):1566 \u2013 1581, 2006.\n\n[6] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE\n\nTransactions on pattern analysis and machine intelligence, pages 603\u2013619, 2002.\n\nbandwidth constant at 7, and found optimal values of feature bandwidth and minimum region size to be 25 and\n4000 pixels, respectively.\n\n5In our Matlab implementations, the core ddCRP code was less than half as long as the corresponding PY\ncode. For the ddCRP, the computation time was 1 minute per iteration, and convergence typically happened\nafter only a few iterations. The PY code, which is based on variational approximations, took 12 minutes per\nimage.\n\n8\n\n\f[7] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using\niterated graph cuts. In ACM Transactions on Graphics (TOG), volume 23, pages 309\u2013314,\n2004.\n\n[8] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888\u2013\n\n905, 2000.\n\n[9] C. Fowlkes, D. Martin, and J. Malik. Learning af\ufb01nity functions for image segmentation:\n\nCombining patch-based and gradient-based approaches. CVPR, 2:54\u201361, 2003.\n\n[10] E. B. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent\n\npitman-yor processes. NIPS 22, 2008.\n\n[11] P. Orbanz and J. M. Buhmann. Smooth image segmentation by nonparametric Bayesian infer-\n\nence. In ECCV, volume 1, pages 444\u2013457, 2006.\n\n[12] Lan Du, Lu Ren, David Dunson, and Lawrence Carin. A bayesian model for simultaneous\n\nimage clustering, annotation and object segmentation. In NIPS 22, pages 486\u2013494. 2009.\n\n[13] J. Pitman and M. Yor. The two-parameter Poisson\u2013Dirichlet distribution derived from a stable\n\nsubordinator. Annals of Probability, 25(2):855\u2013900, 1997.\n\n[14] J. A. Duan, M. Guindani, and A. E. Gelfand. Generalized spatial Dirichlet process models.\n\nBiometrika, 94(4):809\u2013825, 2007.\n\n[15] X. Ren and J. Malik. Learning a classi\ufb01cation model for segmentation. ICCV, 2003.\n\n[16] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using\nexpectation-maximization and its application to image querying. PAMI, 24(8):1026\u20131038,\nAugust 2002.\n\n[17] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: A database web-based\n\ntool for image annotation. IJCV, 77:157\u2013173, 2008.\n\n[18] C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics.\n\nSpringer-Verlag, New York, NY, 2004.\n\n[19] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the\n\nspatial envelope. IJCV, 42(3):145 \u2013 175, 2001.\n\n[20] G. Mori. Guiding model search using segmentation. ICCV, 2005.\n\n[21] D. R. Martin, C.C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using\n\nlocal brightness, color, and texture cues. IEEE Trans. PAMI, 26(5):530\u2013549, 2004.\n\n[22] W.M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the Ameri-\n\ncan Statistical Association, pages 846\u2013850, 1971.\n\n9\n\n\f", "award": [], "sourceid": 843, "authors": [{"given_name": "Soumya", "family_name": "Ghosh", "institution": null}, {"given_name": "Andrei", "family_name": "Ungureanu", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}