{"title": "Pairwise Clustering and Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 185, "page_last": 192, "abstract": "", "full_text": "Pairwise Clustering and Graphical Models\n\nNoam Shental\n\nComputer Science & Eng.\n\nCenter for Neural Computation\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\nfenoam@cs.huji.ac.il\n\nTomer Hertz\n\nComputer Science & Eng.\n\nCenter for Neural Computation\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\nAssaf Zomet\n\nComputer Science & Eng.\n\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\nzomet@cs.huji.ac.il\n\nYair Weiss\n\nComputer Science & Eng.\n\nCenter for Neural Computation\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\ntomboy@cs.huji.ac.il\n\nyweiss@cs.huji.ac.il\n\nAbstract\n\nSigni(cid:2)cant progress in clustering has been achieved by algorithms that\nare based on pairwise af(cid:2)nities between the datapoints.\nIn particular,\nspectral clustering methods have the advantage of being able to divide\narbitrarily shaped clusters and are based on ef(cid:2)cient eigenvector calcu-\nlations. However, spectral methods lack a straightforward probabilistic\ninterpretation which makes it dif(cid:2)cult to automatically set parameters us-\ning training data.\nIn this paper we use the previously proposed typical cut framework for\npairwise clustering. We show an equivalence between calculating the\ntypical cut and inference in an undirected graphical model. We show that\nfor clustering problems with hundreds of datapoints exact inference may\nstill be possible. For more complicated datasets, we show that loopy be-\nlief propagation (BP) and generalized belief propagation (GBP) can give\nexcellent results on challenging clustering problems. We also use graph-\nical models to derive a learning algorithm for af(cid:2)nity matrices based on\nlabeled data.\n\n1\n\nIntroduction\n\nConsider the set of points shown in (cid:2)gure 1a. Datasets of this type, where the two clus-\nters are not easily described by a parametric model can be successfully clustered using\npairwise clustering algorithms [4, 6, 3]. These algorithms start by building a graph whose\nvertices correspond to datapoints and edges exist between nearby points with a weight that\ndecreases with distance. Clustering the points is then equivalent to graph partitioning.\n\n\fa\n\nb\n\nFigure 1: Clustering as graph parti-\ntioning (following [8]). Vertices corre-\nspond to datapoints and edges between\nadjacent pixels are weighted by the dis-\ntance. A single isolated datapoint is\nmarked by an arrow\n\nHow would we de(cid:2)ne a good partitioning? One option is the minimal cut criterion. De(cid:2)ne:\n\ncut(A; B) = Xi2A;j2B\n\nW (i; j)\n\n(1)\n\nwhere W (i; j) is the strength of the weight between node i and j in the graph. The minimal\ncut criterion (cid:2)nds clusterings that minimize cut(A; B).\nThe advantage of using the minimal cut criterion is that the optimal segmentation can be\ncomputed in polynomial time. A disadvantage, pointed out by Shi and Malik [8], is that\nit will often produce trivial segmentations. Since the cut value grows linearly with the\nnumber of edges cut, a single datapoint cut from its neighbors will often have a lower cut\nvalue than the desired clustering (e.g, the minimal cut solution separates the full dot in (cid:2)g1,\ninstead of the desired \u2018N\u2019 and \u2018I\u2019 clusters).\nIn order to avoid these trivial clusterings, several graph partitioning criteria have been pro-\nposed. Shi and Malik suggested the normalized cut criterion which directly penalizes par-\ntitions where one of the groups is small, hence a separation of a single isolated datapoint\nis not favored. Minimization of the normalized cut criterion is NP-Complete but it can be\napproximated using spectral methods.\nDespite the success of spectral methods in a wide range of clustering problems, several\nproblems remain. Perhaps the most important one is the lack of a straightforward proba-\nbilistic interpretation. However, interesting progress in this direction has be made by Meila\nand Shi [4] who showed a relation between the top eigenvectors and the equilibrium distri-\nbution of a random walk on the graph.\nThe typical cut criterion, suggested by Blatt et al [1] and later by Gdalyahu et al [2], is\nbased on a simple probabilistic model. Blatt et al (cid:2)rst de(cid:2)nes a probability distribution\nover possible partitions by:\n\nPr(A; B) =\n\n1\nZ\n\ne(cid:0)cut(A;B)=T\n\n(2)\n\nwhere Z is a normalizing constant, and the (cid:147)temperature(cid:148) T serves as a free parameter.\nUsing this probability distribution, the most probable partition is simply the minimal cut.\nThus performing MAP inference under this probability distribution will still lead to trivial\nsegmentations. However, as Blatt et al pointed out, there is far more information in the\nfull probability distribution over partitions than solely in the MAP partition. For example,\nconsider the pairwise correlation p(i; j) de(cid:2)ned for any two neighboring nodes in the graph\nas the probability that they belong to the same segment:\n\np(i; j) = XA;B\n\nPr(A; B)SAM E(i; j; A; B)\n\n(3)\n\nwith SAM E(i; j; A; B) de(cid:2)ned as 1 iff i 2 A and j 2 A or i 2 B and j 2 B.\nReferring again, to the single isolated datapoint in (cid:2)gure 1, then while that datapoint and its\nneighbors do not appear in the same cluster in the most probable partition, they do appear\nin the same cluster for the vast majority of partitions. Thus we would expect p(i; j) > 1=2\nfor that datapoint and its neighbors.\n\n\fHence the typical cut algorithm of Blatt et al consists of three stages:\n\n(cid:15) Preprocessing: Construct the af(cid:2)nity matrix W so that each node will be con-\nnected to at most K neighbors. De(cid:2)ne the af(cid:2)nities W (i; j) as: W (i; j) =\n(cid:0)d(i;j)2\n, where di;j is the distance between points i and j, and (cid:27) is the mean\n\n(cid:27)2\n\ne\ndistance to the K\u2019th neighbor.\n\n(cid:15) Estimating pairwise correlations: Use a Markov chain Monte-Carlo (MCMC)\n\nsampling method to estimate p(i; j) at each temperature T .\n\n(cid:15) Postprocessing: De(cid:2)ne the typical cut partition as the connected components of\n\nthe graph after removing any links for which p(i; j) < 1=2.\n\nFor a given W (i; j) the algorithm has a single free temperature parameter T (see eq. 5).\nThis parameter implicitly de(cid:2)nes the number of clusters. At zero temperature all the data-\npoints reside in one cluster (this trivially minimizes the cut value), and at high temperatures\nevery datapoint forms a separate cluster.\nIn this paper we show that calculating the typical cut is equivalent to performing inference\nin an undirected graphical model. We use this equivalence to show that in problems with\nhundreds of datapoints, the typical cut may be calculated exactly. We show that when exact\ninference is impossible, loopy belief propagation (BP) and generalized belief propagation\n(GBP) may give an excellent approximation with very little computational cost. Finally,\nwe use the standard algorithm for ML estimation in graphical models to derive a learning\nalgorithm for af(cid:2)nity matrices based on labeled data 1.\n\n2 The connection between typical cuts and graphical models\n\nAn undirected graphical model with pairwise potentials (see [10] for a review) consists of\na graph G and potential functions (cid:9)ij(xi; xj) such that the probability of an assignment x\nis given by:\n\nPr(x) =\n\n(cid:9)ij(xi; xj)\n\n(4)\n\n1\n\nZ Y\n\nwhere the product is taken over nodes that are connected in the graph G.\nTo connect this to typical cuts we (cid:2)rst de(cid:2)ne for every partition (A; B) a binary vector x\nsuch that x(i) = 0 if i 2 A and x(i) = 1 if i 2 B. We then de(cid:2)ne:\n\n(cid:9)ij(xi; xj) = (cid:18)\n\n1\n\ne(cid:0)W (i;j)=T\n\ne(cid:0)W (i;j)=T\n\n1\n\n(cid:19)\n\n(5)\n\nObservation 1: The typical cut probability distribution (equation 2) is equivalent to that\ninduced by a pairwise undirected graphical model (equation 4) whose graph G is the same\nas the graph used for graph partitioning and whose potentials are given by equation 5.\nSo far we have focused on partitioning the graph into two segments, but the equiva-\nlence holds for any number of segments q. Let (A1; A2; (cid:1) (cid:1) (cid:1) ; Aq) be a partitioning of the\ngraph into q segments (note that these segments need not be connected in G). De(cid:2)ne\ncut(A1; A2; (cid:1) (cid:1) (cid:1) Aq) in direct analogy to equation 1, and:\n\nPr((A1; A2; (cid:1) (cid:1) (cid:1) ; Aq) =\n\ne(cid:0) 1\n\nT cut(A1;A2;(cid:1)(cid:1)(cid:1);Aq)\n\n(6)\n\n1\nZ\n\nThe implication of observation 1 is that we can use the powerful tools of graphical models\nin the context of pairwise clustering. In subsequent sections we provide examples of the\nbene(cid:2)ts of using graphical models to compute typical cuts.\n\n1Parts of this work appeared previously in [7].\n\n\f3 Computing typical cuts using inference in a graphical model\n\nTypical cuts has been successfully used for clustering of datapoints in Rn [1] using an\nexpensive MCMC to calculate pairwise correlations, p(i; j). Using inference algorithms\nwe provide a deterministic and more ef(cid:2)cient estimate of p(i; j). More speci(cid:2)cally, we use\ninference algorithms to compute the pairwise beliefs over neighboring nodes bij(xi; xj),\n\nand calculate the pairwise correlation as p(i; j) = Pq\n\nt=1 bij(t; t).\n\nIn cases where the maximal clique size is small enough, we can calculate p(i; j) exactly\nusing the junction tree algorithm. In all other cases we must resort to approximate infer-\nence using the BP and the GBP algorithms. The following subsections discuss exact and\napproximate inference for computing typical cuts.\n\n3.1 Exact inference for typical cut clustering\n\nThe nature of real life clustering problems seems to suggest that exact inference would be\nintractable due to the clique size of the junction tree. Surprisingly, in our empirical stud-\nies, we discovered that on many datasets, including benchmark problems from the UCI\nrepository, we obtain (cid:147)thin(cid:148) junction trees (with maximal clique size less than 20). Fig-\nure 2a shows a two dimensional representative result. The temperature parameter T was\nautomatically chosen to provide two large clusters. As shown previously by Gdalyahu et al\nthe typical cut criterion does sensible things: it does not favor segmentation of individual\ndatapoints (as in minimal cut), nor is it fooled by narrow bridges between clusters (as in\nsimple connected components). However, while previous typical cut algorithms approxi-\nmate p(i; j) using MCMC, in some cases using the framework of graphical model we can\ncalculate p(i; j) exactly and ef(cid:2)ciently.\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\n1.8\n\n300\n\n200\n\n100\n\n0\n\n\u2212100\n\n\u2212200\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n2.6\n\n2.8\n\n3\n\n3.2\n\n\u2212300\n\n\u2212300\n\n\u2212200\n\n\u2212100\n\na.\n\n100\n\n200\n\n300\n\n0\n\nb.\n\nFigure 2: Clustering ex-\namples with clusters indi-\ncated by different markers.\nIn example (a) the pair-\nwise correlations were cal-\nculated exactly, while in ex-\nample (b) we used BP.\n\n3.2 Approximate inference for typical cut clustering\n\nAlthough exact inference is shown to be possible, in the more common case it is infeasible,\nand p(i; j) can only be estimated using approximate inference algorithms. In this section\nwe discuss approximate inference using the BP and the GBP algorithms.\n\nApproximate inference using Belief Propagation In BP the pairwise beliefs over neigh-\nboring nodes, bij, are de(cid:2)ned using the messages as:\nbij(xi; xj) = (cid:11)(cid:9)ij(xi; xj) Yxk2N (xi)nxj\n\nmki(xi) Yxk2N (xj )nxi\n\nmkj(xj)\n\n(7)\n\nCan this be used as an approximation for pairwise clustering?\n\n\fObservation 2: In case where the messages are initialized uniformly the pairwise beliefs\ncalculated by BP are only a function of the local potentials, i.e bij(xi; xj) / ij(xi; xj).\nProof: Due to the symmetry of the potentials and since the messages are initialized uni-\nformly, all the messages in BP remain uniform. Thus equation 7 will simply give the\nnormalized local potentials.\nA consequence of observation 2 is that we need to break the symmetry of the problem in\norder to use BP. We use here the method of conditioning. Due to the symmetry of the\npotentials, if exact inference is used then conditioning on a single node xc = 1 and cal-\nculating conditional correlations P (xi = xjjxc = 1) should give exactly the same answer\nas the unconditional correlations p(i; j) = P (xi = xj). However, when BP inference\nis used, clamping the value of xc causes its outgoing messages to be nonuniform, and as\nthese messages propagate through the graph they break the symmetry used in the proof of\nobservation 2. Empirically, this yields much better approximations of the correlations. In\nsome cases (e.g. when the graph is disconnected) conditioning on a single point does not\nbreak the symmetry throughout the graph and additional points need to be clamped.\nIn order to evaluate the quality of the approximation provided by BP, we compared BP us-\ning conditioning and exact inference over the dataset shown in (cid:2)g 2a. Figure 3 displays the\nresults at two different temperatures: (cid:147)low(cid:148) and (cid:147)high(cid:148). Each row presents the clustering\nsolution of exact inference and BP, and a scatter plot of the correlations over all of the edges\nusing the two methods. At the (cid:147)low(cid:148) temperature the approximation almost coincides with\nthe exact values, but at the (cid:147)high(cid:148) temperature BP over estimates the correlation values.\n\nExact\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\nBP\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\n1.8\n\nlow T\n\n1.8\n\nlow T\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n2.6\n\n2.8\n\n3\n\n3.2\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n2.6\n\n2.8\n\n3\n\n3.2\n\nExact\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\nBP\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\n1.8\n\nhigh T\n\n1.8\n\nhigh T\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n2.6\n\n2.8\n\n3\n\n3.2\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n2.6\n\n2.8\n\n3\n\n3.2\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\nn\no\ni\nt\na\ne\nr\nr\no\nc\n\nl\n\n \nt\nc\na\nx\nE\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\nn\no\ni\nt\na\ne\nr\nr\no\nc\n\nl\n\n \nt\nc\na\nx\nE\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\nLoopy correlations\n\n0.8\n\n0.9\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\nLoopy correlations\n\n0.8\n\n0.9\n\n1\n\n1\n\nFigure 3: Clustering results at a (cid:147)low(cid:148) temperature (upper row) and a (cid:147)high(cid:148) temperature (lower\nrow). The left and middle columns present clustering results of exact inference and of BP, respec-\ntively. The right column compares the values of the correlations provided by the two methods. Each\ndot corresponds to an edge in the graph. At (cid:147)low(cid:148) temperature most of the correlations are close to\n1, hence many edges appear as a single dot.\n\nApproximate inference using Generalized Belief Propagation Generalized Belief\nPropagation algorithms (GBP) [10] extend the BP algorithm by sending messages that are\nfunctions of clusters of variables, and has been shown to provide a better approximation\nthan BP in many applications. Can GBP improve the approximation of pairwise correla-\ntions in typical cuts?\n\n\fOur empirical studies show that the performance and convergence of GBP over a general\ngraph obtained from arbitrary points in Rn, strongly depends on the initial choice of clus-\nters (regions). As also observed by Minka et al [5] a speci(cid:2)c choice of clusters may yield\nworse results than BP, or may even cause GBP not to converge. However it is far from\nobvious how to choose these clusters. In previous uses of GBP [10] the basic clusters used\nwere chosen by hand. In order to use GBP to approximate p(i; j) in a general graph, one\nmust obtain a useful automatic procedure for selecting these initial clusters. We have ex-\nperimented with various heuristics but none of them gave good performance. However, in\nthe case of ordered graphs such as 2D grids and images, we have found that GBP gives an\nexcellent approximation when using four neighboring grid points as a region.\nFigure 4a shows results of GBP approximations for a 30x30 2D uniform grid. The clique\nsize in a junction tree is of order 230 hence exact inference is infeasible. We compare the\ncorrelations p(i; j) calculated using an extensive MCMC sampling procedure [9] to those\ncalculated using GBP with the clusters being four neighboring pixels in the graph. GBP\nconverges in only 10 iterations and can be seen to provide an excellent approximation.\nFigure 4c presents a comparison of the MCMC correlations with those calculated by GBP\non a real 120x80 image shown in (cid:2)g 4b with af(cid:2)nity based on color similarity. Figure 4d\npresents the clustering results, which provides a segmentation of the image.\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nP\nB\nG\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\nSW\u2212MC\n\n0.8\n\n1\n\n(a)\n\n(b)\n\nP\nB\nG\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nSW\u2212MC\n\n(c)\n\n(d)\n\nFigure 4: (a) Scatter plot of pairwise correlations in a 30x30 grid, using MCMC [9] and GBP. Each\ndot corresponds to the pairwise correlation of one edge at a speci(cid:2)c temperature. Notice the excellent\ncorrespondence between GBP and MCMC (c) The same comparison performed over the image in\n(b). (d) shows a gray level map of the 15 largest clusters.\n\n4 Learning Af(cid:2)nity Matrices from Labeled Datasets\n\nAs noted in the introduction using graphical models to compute typical cuts, can also be\nadvantageous for other aspects of the clustering problem, apart from computing p(i; j).\nOne such important advantage is learning the af(cid:2)nity matrix W (i; j) from labeled data.\nIn many problems, there are multiple ways to de(cid:2)ne af(cid:2)nities between any two datapoints.\nFor example, in image segmentation where the nodes are pixels, one can de(cid:2)ne af(cid:2)nity\nbased on color similarity, texture similarity or some combination of the two. Our goal is to\nuse a labeled training set of manually segmented images to learn the (cid:147)right(cid:148) af(cid:2)nities.\nMore speci(cid:2)cally let us assume the (cid:147)correct(cid:148) af(cid:2)nity is a linear combination of a set\nof known af(cid:2)nity functions ffkgK\nk=1, each corresponding to different features of the\ndata. Hence the af(cid:2)nity between neighboring points i and j, is de(cid:2)ned by: W (i; j) =\nPK\nk=1 (cid:11)kfk(i; j). In addition assume we are given a labeled training sample, which con-\nsists of the following: (i) A graph in which neighboring nodes are connected by edges. (ii)\nAf(cid:2)nity values fk(i; j). (iii) A partition of the graph x. Our goal is to estimate the af(cid:2)nity\nmixing coef(cid:2)cients (cid:11)k.\n\n\fThis problem can be solved using the graphical model de(cid:2)ned by the typical cut probability\ndistribution (Equation 6). Recall that the probability of a partition x is de(cid:2)ned as\n\nP (x) =\n\n1\nZ\n\ne(cid:0)cut(x) =\n\n1\nZ\n\ne(cid:0)P(1(cid:0)(cid:14)(xi(cid:0)xj ))W (i;j) =\n\n1\n\nZ((cid:11))\n\ne(cid:0)PK\n\nk=1 (cid:11)k fcutk(x)\n\n(8)\n\nWhere we have de(cid:2)ned: fcutk(x) = P(1 (cid:0) (cid:14)(xi (cid:0) xj))fk(i; j). fcutk(x) is the cut\n\nvalue de(cid:2)ned by x when only taking into account the af(cid:2)nity function fk, hence it can be\ncomputed using the training sample. Differentiating the log likelihood with respect to (cid:11)k\ngives the exponential family equation:\n\n@ ln P (x)\n\n@(cid:11)k\n\n= (cid:0)fcutk(x)+ < fcutk >(cid:11)\n\n(9)\n\nEquation 9 gives an intuitive de(cid:2)nition for the optimal (cid:11): the optimal (cid:11) is the one for which\n< fcutk >(cid:11)= fcutk(x), i.e, for optimal (cid:11) the expected values of the cuts for each feature\nseparately, match exactly the values of these cuts in the training set.\nSince we are dealing with the exponential family, the likelihood is convex and the ML\nsolution can be found using gradient ascent. To calculate the gradient explicitly, we use the\nlinearity of expectation:\n< fcutk >(cid:11)= X\nWhere p(i; j)(cid:11) are the pairwise correlations for given values of (cid:11).\nEquation 9 is visually similar to the learning rule derived by Meila and Shi [4] but the cost\nfunction they are minimizing is actually different, hence the expectations are taken with\nrespect to completely different distributions.\n\n< (1 (cid:0) (cid:14)(yi (cid:0) yj) >(cid:11) fk(i; j) = X\n\n(1 (cid:0) p(i; j)(cid:11))fk(i; j)\n\n4.1 Combining learning and GBP approximate inference\n\nWe experimented with the learning algorithm on images, with the pixels grid as the graph\nand using GBP for approximating p(i; j)(cid:11). The three pixel af(cid:2)nity functions, ffkg3\nk=1,\ncorrespond to the intensity differences in the R; G; B color channels. We used a standard\ntransformation of intensity difference to an af(cid:2)nity function by a Gaussian kernel.\nThe left pane in Fig 5 shows a synthetic example. There is one training image ((cid:2)g 5a) but\ntwo different manual segmentations ((cid:2)g 5b,c). The (cid:2)rst and second training segmentations\nare based on an illumination-covariant and an illumination-invariant af(cid:2)nities, respectively.\nWe used gradient ascent as given by equation 9. Figure 5d shows a novel image and (cid:2)g-\nures 5e,f show two different pairwise correlations of this image using the learned (cid:11). Indeed,\nthe algorithm learns to either ignore or not ignore illumination, based on the training set.\nThe right pane in (cid:2)gure 5 shows results on real images. For real images, we found that\na preprocessing of the image colors is required in order to learn shadow-invariant linear\ntransformation. This was done by saturating the image colors. The training segmentation\n((cid:2)gures 5a,b,c) ignores shadows. On the novel image ((cid:2)gure 5d) the most salient edge is a\nshadow on the face. Nevertheless, the segmentation based on the learned af(cid:2)nity ((cid:2)gure 5e)\nignores the shadows and segments the facial features from each other. In contrast, a typical\ncut segmentation which uses a naive af(cid:2)nity function (combining the three color channels\nwith uniform weights) segments mostly based on shadows ((cid:2)gure 5f).\n\n5 Discussion\n\nPairwise clustering algorithms have a wide range of applicability due to their ability to (cid:2)nd\nclusters with arbitrary shapes. In this paper we have shown how pairwise clustering can be\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 5: Left pane: A synthetic example for learning the af(cid:2)nity function. The top row presents the\ntraining set: The input image (a), the clusters of the (cid:2)rst (b) and second (c) experiments. The bottom\nrow presents the result of the learning algorithm: The input image (d), the marginal probabilities\np(i; j) (Eqn. 3) in the (cid:2)rst (e) and second (f) experiments. Right pane: Learning a color af(cid:2)nity\nfunction which is invariant to shadows. The top row shows the learning data set: The input image(a),\nthe pre-processed image (b) and the manual segmentation (invariant to shadows) (c). The bottom row\npresents, from left to right, the pre-processed test image (d), an edge map produced by learning the\nshadow-invariant af(cid:2)nity (e) and an edge map produced by a naive af(cid:2)nity function, combining the 3\ncolor channels with uniform weights (f). The edge maps were computed by thresholding the pairwise\ncorrelations p(i,j) (Eqn. 3). See text for details. Both illustrations are better viewed in color.\n\nmapped to an inference problem in a graphical model. This equivalence allowed us to use\nthe standard tools of graphical models: exact and approximate inference and ML learning.\nWe showed how to combine approximate inference and ML learning in the challenging\nproblem of learning af(cid:2)nities for images from labeled data. We have only begun to use the\nmany tools of graphical models. We are currently working on learning from unlabeled sets\nand on other approximate inference algorithms.\n\nReferences\n[1] M. Blatt, S. Wiseman, and E. Domany. Data clustering using a mode lgranular magnet. Neural\n\nComputation, 9:1805(cid:150)1842, 1997.\n\n[2] Y. Gdalyahu, D. Weinshall, and M. Werman. Self organization in vision: Stochastic clustering\nfor image segmentation, perceptual grouping, and image database organization. IEEE Trans. on\nPattern Analysis and Machine Intelligence, 23(10):1053(cid:150)1074, 2001.\n\n[3] T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing.\n\nTransactions on Pattern Analysis and Machine Intelligence, 19(1):1(cid:150)14, 1997.\n\nIEEE\n\n[4] M. Meila and J. Shi. Learning segmentation by random walks. In Advances in Neural Informa-\n\ntion Processing Systems 14, 2001.\n\n[5] T. Minka and Y. Qi. Tree-structured approximations by expectation propagation. In Advances in\n\nNeural Information Processing Systems 16, 2003.\n\n[6] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances\n\nin Neural Information Processing 14, 2001.\n\n[7] N. Shental, A. Zomet, T. Hertz, and Y. Weiss. Learning and inferring image segmentations using\n\nthe gbp typical cut. In 9th International Conference on Computer Vision, 2003.\n\n[8] J. Shi and J. Malik. Normalized cuts and image segmentation. In Proc. IEEE Conf. Computer\n\nVision and Pattern Recognition, pages 731(cid:150)737, 1997.\n\n[9] J.S. Wang and R.H Swendsen. Cluster monte carlo algorithms. Physica A, 167:565(cid:150)579, 1990.\n[10] J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations.\nIn G. Lakemeyer and B. Nebel, editors, Exploring Arti(cid:2)cial Intelligence in the New Millennium.\nMorgan Kaufmann, 2003.\n\n\f", "award": [], "sourceid": 2538, "authors": [{"given_name": "Noam", "family_name": "Shental", "institution": null}, {"given_name": "Assaf", "family_name": "Zomet", "institution": null}, {"given_name": "Tomer", "family_name": "Hertz", "institution": null}, {"given_name": "Yair", "family_name": "Weiss", "institution": null}]}