{"title": "Summary Statistics for Partitionings and Feature Allocations", "book": "Advances in Neural Information Processing Systems", "page_first": 261, "page_last": 269, "abstract": "Infinite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or find its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based definition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. Experiments on various infinite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice.", "full_text": "Summary Statistics for\n\nPartitionings and Feature Allocations\n\nIs\u00b8\u0131k Bar\u0131s\u00b8 Fidaner\n\nComputer Engineering Department\n\nBo\u02d8gazic\u00b8i University, Istanbul\n\nAli Taylan Cemgil\n\nComputer Engineering Department\n\nBo\u02d8gazic\u00b8i University, Istanbul\n\nfidaner@alternatifbilisim.org\n\ntaylan.cemgil@boun.edu.tr\n\nAbstract\n\nIn\ufb01nite mixture models are commonly used for clustering. One can sample from\nthe posterior of mixture assignments by Monte Carlo methods or \ufb01nd its maximum\na posteriori solution by optimization. However, in some problems the posterior\nis diffuse and it is hard to interpret the sampled partitionings. In this paper, we\nintroduce novel statistics based on block sizes for representing sample sets of par-\ntitionings and feature allocations. We develop an element-based de\ufb01nition of en-\ntropy to quantify segmentation among their elements. Then we propose a simple\nalgorithm called entropy agglomeration (EA) to summarize and visualize this in-\nformation. Experiments on various in\ufb01nite mixture posteriors as well as a feature\nallocation dataset demonstrate that the proposed statistics are useful in practice.\n\n1 Introduction\n\nClustering aims to summarize observed data by grouping its elements according to their similarities.\nDepending on the application, clusters may represent words belonging to topics, genes belonging to\nmetabolic processes or any other relation assumed by the deployed approach. In\ufb01nite mixture mod-\nels provide a general solution by allowing a potentially unlimited number of mixture components.\nThese models are based on nonparametric priors such as Dirichlet process (DP) [1, 2], its super-\nclass Poisson-Dirichlet process (PDP) [3, 4] and constructions such as Chinese restaurant process\n(CRP) [5] and stick-breaking process [6] that enable formulations of ef\ufb01cient inference methods [7].\nStudies on in\ufb01nite mixture models inspired the development of several other models [8, 9] includ-\ning Indian buffet process (IBP) for in\ufb01nite feature models [10, 11] and fragmentation-coagulation\nprocess for sequence data [12] all of which belong to Bayesian nonparametrics [13].\n\nIn making inference on in\ufb01nite mixture models, a sample set of partitionings can be obtained from\nthe posterior.1 If the posterior is peaked around a single partitioning, then the maximum a posteriori\nsolution will be quite informative. However, in some cases the posterior is more diffuse and one\nneeds to extract statistical information about the random partitioning induced by the model. This\nproblem to \u2018summarize\u2019 the samples from the in\ufb01nite mixture posterior was raised in bioinformatics\nliterature in 2002 by Medvedovic and Sivaganesan for clustering gene expression pro\ufb01les [14]. But\nthe question proved dif\ufb01cult and they \u2018circumvented\u2019 it by using a heuristic linkage algorithm based\non pairwise occurence probabilities [15, 16]. In this paper, we approach this problem and propose\nbasic methodology for summarizing sample sets of partitionings as well as feature allocations.\n\nNemenman et al. showed in 2002 that the entropy [17] of a DP posterior was strongly determined\nby its prior hyperparameters [18]. Archer et al. recently elaborated these results with respect to\nPDP [19].\nIn other work, entropy was generalized to partitionings by interpreting partitionings\nas probability distributions [20, 21]. Therefore, entropy emerges as an important statistic for our\nproblem, but new de\ufb01nitions will be needed for quantifying information in feature allocations.\n\n1In methods such as collapsed Gibbs sampling, slice sampling, retrospective sampling, truncation methods\n\n1\n\n\fIn the following sections, we de\ufb01ne the problem and introduce cumulative statistics for representing\npartitionings and feature allocations. Then, we develop an interpretation for entropy function in\nterms of per-element information in order to quantify segmentation among their elements. Finally,\nwe describe entropy agglomeration (EA) algorithm that generates dendrograms to summarize sam-\nple sets of partitionings and feature allocations. We demonstrate EA on in\ufb01nite mixture posteriors\nfor synthetic and real datasets as well as on a real dataset directly interpreted as a feature allocation.\n\n2 Basic de\ufb01nitions and the motivating problem\n\nWe begin with basic de\ufb01nitions. A partitioning of a set of elements [n] = {1, 2, . . . , n} is a set of\nblocks Z = {B1, . . . , B|Z|} such that Bi \u2282 [n] and Bi 6= \u2205 for all i \u2208 {1, . . . , n}, Bi \u2229 Bj = \u2205\nfor all i 6= j, and \u222aiBi = [n].2 We write Z \u22a2 [n] to designate that Z is a partitioning of [n].3 A\nsample set E = {Z (1), . . . , Z (T )} from a distribution \u03c0(Z) over partitionings is a multiset such that\nZ (t) \u223c \u03c0(Z) for all t \u2208 {1, . . . , T }. We are required to extract information from this sample set.\nOur motivation is the following problem: a set of observed elements (x1, . . . , xn) are clustered\nby an in\ufb01nite mixture model with parameters \u03b8(k) for each component k and mixture assignments\n(z1, . . . , zn) drawn from a two-parameter CRP prior with concentration \u03b1 and discount d [5].\n\nz \u223c CRP (z; \u03b1, d)\n\n\u03b8(k) \u223c p(\u03b8)\n\nxi | zi, \u03b8 \u223c F (xi | \u03b8(zi))\n\n(1)\n\nIn the conjugate case, all \u03b8(k) can be integrated out to get p(zi | z\u2212i, x) for sampling zi [22]:\n\np(zi | z\u2212i, x) \u221d Z p(z, x, \u03b8) d\u03b8 \u221d \uf8f1\uf8f2\n\uf8f3\n\nnk\u2212d\n\n\u03b1+dK +\n\nn\u22121+\u03b1 R F (xi|\u03b8) p(\u03b8|x\u2212i, z\u2212i) d\u03b8\nn\u22121+\u03b1 R F (xi|\u03b8) p(\u03b8) d\u03b8\n\nif k \u2264 K +\n\notherwise\n\n(2)\n\nThere are K + non-empty components and nk elements in each component k. In each iteration, xi\nwill either be put into an existing component k \u2264 K + or it will be assigned to a new component. By\nsampling all zi repeatedly, a sample set of assignments z(t) are obtained from the posterior p(z | x) =\n\u03c0(Z). These z(t) are then represented by partitionings Z (t) \u22a2 [n]. The induced sample set contains\ninformation regarding (1) CRP prior over partitioning structure given by the hyperparameters (\u03b1, d)\nand (2) integrals over \u03b8 that capture the relation among the observed elements (x1, . . . , xn).\nIn addition, we aim to extract information from feature allocations, which constitute a superclass of\npartitionings [11]. A feature allocation of [n] is a multiset of blocks F = {B1, . . . , B|F |} such that\nBi \u2282 [n] and Bi 6= \u2205 for all i \u2208 {1, . . . , n}. A sample set E = {F (1), . . . , F (T )} from a distribution\n\u03c0(F ) over feature allocations is a multiset such that F (t) \u223c \u03c0(F ) for all t. Current exposition will\nfocus on partitionings, but we are also going to show how our statistics apply to feature allocations.\n\nAssume that we have obtained a sample set E of partitionings. If it was obtained by sampling from\nan in\ufb01nite mixture posterior, then its blocks B \u2208 Z (t) correspond to the mixture components. Given\na sample set E, we can approximate any statistic f (Z) over \u03c0(Z) by averaging it over the set E:\n\nZ (1), . . . , Z (T ) \u223c \u03c0(Z)\n\n\u21d2\n\n1\nT\n\nT\n\nXt=1\n\nf (Z (t)) \u2248 h f (Z) i\u03c0(Z)\n\n(3)\n\nWhich f (Z) would be a useful statistic for Z? Three statistics commonly appear in the literature:\nFirst one is the number of blocks |Z|, which has been analyzed theoretically for various nonpara-\nmetric priors [2, 5]. It is simple, general and exchangable with respect to the elements [n], but it is\nnot very informative about the distribution \u03c0(Z) and therefore is not very useful in practice.\nA common statistic is pairwise occurence, which is used to extract information from in\ufb01nite mixture\nposteriors in applications like bioinformatics [14]. For given pairs of elements {a, b}, it counts the\n\nnumber of blocks that contain these pairs, written Pi[{a, b} \u2282 Bi]. It is a very useful similarity\n\nmeasure, but it cannot express information regarding relations among three or more elements.\n\nAnother statistic is exact block size distribution (referred to as \u2018multiplicities\u2019 in [11, 19]). It counts\n\nthe partitioning\u2019s blocks that contain exactly k elements, written Pi[|Bi| = k]. It is exchangable\n\nwith respect to the elements [n], but its weighted average over a sample set is dif\ufb01cult to interpret.\n\n2We use the term \u2018partitioning\u2019 to indicate a \u2018set partition\u2019 as distinguished from an integer \u2018partition\u2019.\n3The symbol \u2018\u22a2\u2019 is usually used for integer partitions, but here we use it for partitionings (=set partitions).\n\n2\n\n\fLet us illustrate the problem by a practical example, to which we will return in the formulations:\n\nE3 = {Z (1), Z (2), Z (3)}\n\nZ (1) = {{1, 3, 6, 7}, {2}, {4, 5}}\nZ (2) = {{1, 3, 6}, {2, 7}, {4, 5}}\nZ (3) = {{1, 2, 3, 6, 7}, {4, 5}}\n\nS1 = {1, 2, 3, 4}\n\nS2 = {1, 3, 6, 7}\n\nS3 = {1, 2, 3}\n\nSuppose that E3 represents interactions among seven genes. We want to compare the subsets of\nthese genes S1, S2, S3. The projection of a partitioning Z \u22a2 [n] onto S \u2282 [n] is de\ufb01ned as the set\nof non-empty intersections between S and B \u2208 Z. Projection onto S induces a partitioning of S.\n\nP ROJ(Z, S) = {B \u2229 S}B\u2208Z\\{\u2205}\n\n\u21d2\n\nP ROJ(Z, S) \u22a2 S\n\n(4)\n\nLet us represent gene interactions in Z (1) and Z (2) by projecting them onto each of the given subsets:\n\nP ROJ(Z (1), S1) = {{1, 3}, {2}, {4}}\nP ROJ(Z (1), S2) = {{1, 3, 6, 7}}\nP ROJ(Z (1), S3) = {{1, 3}, {2}}\n\nP ROJ(Z (2), S1) = {{1, 3}, {2}, {4}}\nP ROJ(Z (2), S2) = {{1, 3, 6}, {7}}\nP ROJ(Z (2), S3) = {{1, 3}, {2}}\n\nComparing S1 to S2, we can say that S1 is \u2018more segmented\u2019 than S2, and therefore genes in S2\nshould be more closely related than those in S1. However, it is more subtle and dif\ufb01cult to compare\nS2 to S3. A clear understanding would allow us to explore the subsets S \u2282 [n] in an informed\nmanner. In the following section, we develop a novel and general approach based on block sizes that\nopens up a systematic method for analyzing sample sets over partitionings and feature allocations.\n\n3 Cumulative statistics to represent structure\n\nWe de\ufb01ne cumulative block size distribution, or \u2018cumulative statistic\u2019 in short, as the function\n\n\u03c6k(Z) = Pi[|Bi| \u2265 k], which counts the partitioning\u2019s blocks of size at least k. We can rewrite the\nprevious statistics: number of blocks as \u03c61(Z), exact block size distribution as \u03c6k(Z) \u2212 \u03c6k+1(Z),\nand pairwise occurence as \u03c62(P ROJ(Z, {a, b})). Moreover, cumulative statistics satisfy the fol-\nlowing property: for partitionings of [n], \u03c6(Z) always sums up to n, just like a probability mass\nfunction that sums up to 1. When blocks of Z are sorted according to their sizes and the indicators\n[|Bi| \u2265 k] are arranged on a matrix as in Figure 1a, they form a Young diagram, showing that \u03c6(Z)\nis always the conjugate partition of the integer partition of Z. As a result, \u03c6(Z) as well as weighted\naverages over several \u03c6(Z) always sum up to n, just like taking averages over probability mass func-\ntions (Figure 2). Therefore, cumulative statistics of a random partitioning \u2018conserve mass\u2019. In the\n\nZ(1) = {{1, 3, 6, 7}, {2}, {4, 5}}\n\nP ROJ(Z(1), S1) = {{1, 3}, {2}, {4}}\n\nB1 = {2}\n\nB2 = {4, 5}\n\nB3 = {1, 3, 6, 7}\n\n1\n\n2\n\n4\n\n|B1| \u2265 1\n\n|B2| \u2265 1\n\n|B2| \u2265 2\n\nB1 = {2}\n\nB2 = {4}\n\n|B3| \u2265 1\n\n|B3| \u2265 2 |B3| \u2265 3\n\n|B3| \u2265 4\n\nB3 = {1, 3}\n\n1\n\n1\n\n2\n\n|B1| \u2265 1\n\n|B2| \u2265 1\n\n|B3| \u2265 1\n\n|B3| \u2265 2\n\n\u03c6(Z(1)) =\n\n3\n\n2\n\n1\n\n1\n\n\u03c6(P ROJ(Z(1), S1)) =\n\n3\n\n1\n\n(a) Cumulative block size distribution for a partitioning\n\n(b) For its projection onto a subset\n\nFigure 1: Young diagrams show the conjugacy between a partitioning and its cumulative statistic\n\n4\n\n3\n\n2\n\n1\n\n0\n\n)\n)\n1\n(\n\nZ\n(\n\u03c6\n\n1\n\n2\n\n4\n\n5\n\n3\nk\n\n4\n\n3\n\n2\n\n1\n\n0\n\n)\n)\n2\n(\n\nZ\n(\n\u03c6\n\n1\n\n2\n\n4\n\n5\n\n3\nk\n\n4\n\n3\n\n2\n\n1\n\n0\n\n)\n)\n3\n(\n\nZ\n(\n\u03c6\n\n1\n\n2\n\n4\n\n5\n\n3\nk\n\n4\n\n3\n\n2\n\n1\n\n0\n\ne\ne\nr\nh\nt\n\nr\ne\nv\no\ne\ng\na\nr\ne\nv\nA\n\n1\n\n2\n\n4\n\n5\n\n3\nk\n\nFigure 2: Cumulative statistics of the three examples and their average: all sum up to 7\n\n3\n\n\fcase of feature allocations, since elements can be omitted or repeated, this property does not hold.\n\nZ \u22a2 [n]\n\n\u21d2\n\nn\n\nXk=1\n\n\u03c6k(Z) = n\n\n\u21d2\n\nn\n\nh \u03c6k(Z) i\u03c0(Z) = n\n\nXk=1\n\n(5)\n\nWhen we project the partitioning Z onto a subset S \u2282 [n], the resulting vector \u03c6(P ROJ(Z, S))\nwill then sum up to |S| (Figure 1b). A \u2018taller\u2019 Young diagram implies a \u2018more segmented\u2019 subset.\n\nWe can form a partitioning Z by inserting elements 1, 2, 3, 4, . . . into its blocks (Figure 3a). In\nsuch a scheme, each step brings a new element and requires a new decision that will depend on all\nprevious decisions. It would be better if we could determine the whole path by few initial decisions.\n\nNow suppose that we know Z from the start and we generate an incremental sequence of subsets\nS1 = {1}, S2 = {1, 2}, S3 = {1, 2, 3}, S4 = {1, 2, 3, 4}, . . . according to a permutation of [n]:\n\u03c3 = (1, 2, 3, 4, . . . ). We can then represent any path in Figure 3a by a sequence of P ROJ(Z, Si)\nand determine the whole path by two initial parameters: Z and \u03c3. The resulting tree can be simpli\ufb01ed\nby representing the partitionings by their cumulative statistics instead of their blocks (Figure 3b).\n\nBased on this concept, we de\ufb01ne cumulative occurence distribution (COD) as the triangular matrix\nof incremental cumulative statistic vectors, written \u2206i,k(Z, \u03c3) = \u03c6k(P ROJ(Z, Si)) where Z \u22a2 [n],\n\u03c3 is a permutation of [n] and Si = {\u03c31, . . . , \u03c3i} for i \u2208 {1, . . . , n}. COD matrices for two extreme\npaths (Figure 3c, 3e) and for the example partitioning Z (1) (Figure 3d) are shown. For partitionings,\nith row of a COD matrix always sums up to i, even when averaged over a sample set as in Figure 4.\n\nZ \u22a2 [n]\n\n\u21d2\n\ni\n\nXk=1\n\n\u2206i,k(Z, \u03c3) = i\n\n\u21d2\n\ni\n\nh \u2206i,k(Z, \u03c3) i\u03c0(Z) = i\n\nXk=1\n\n(6)\n\nExpected COD matrix of a random partitioning expresses (1) cumulation of elements by the differ-\nences between its rows, and (2) cumulation of block sizes by the differences between its columns.\n\nAs an illustrative example, consider \u03c0(Z) = CRP (Z|\u03b1, d). Since CRP is exchangable and projec-\ntive, its expected cumulative statistic h\u03c6(Z)i\u03c0(Z) for n elements depends only on its hyperparam-\neters (\u03b1, d). As a result, its expected COD matrix \u2206 = h\u2206(Z, \u03c3)i\u03c0(Z) is independent of \u03c3, and it\n\n{{1}, {2}, {3}}\n\n{{1}, {2}}\n\n{{1, 3}, {2}}\n{{1}, {2, 3}}\n{{1, 2}, {3}}\n\n{{1}}\n\n{{1, 2}}\n\n{{1, 2, 3}}\n\n{{1}, {2}, {3}, {4}}\n\n{{1, 4}, {2}, {3}}\n{{1}, {2, 4}, {3}}\n{{1}, {2}, {3, 4}}\n{{1, 3}, {2}, {4}}\n{{1}, {2, 3}, {4}}\n{{1, 2}, {3}, {4}}\n\n{{1, 3}, {2, 4}}\n{{1, 4}, {2, 3}}\n{{1, 2}, {3, 4}}\n\n{{1, 3, 4}, {2}}\n{{1}, {2, 3, 4}}\n{{1, 2, 4}, {3}}\n{{1, 2, 3}, {4}}\n\n{{1, 2, 3, 4}}\n\n(4, 0, 0, 0)\n\n\u20131.39\n\n(3, 0, 0)\n\n(3, 1, 0, 0)\n\n\u20131.04\n\n(2, 0)\n\n(2, 1, 0)\n\n(2, 2, 0, 0)\n\n\u20130.69\n\n(2, 1, 1, 0)\n\n\u20130.56\n\ny\np\no\nr\nt\nn\ne\nn\no\ni\nt\ni\nt\nr\na\np\n\n(1)\n\n(1, 1)\n\n(1, 1, 1)\n\n(1, 1, 1, 1)\n\n\u20130\n\n(a) Form a partitioning by inserting elements\n\n(b) Form the statistic vector by inserting elements\n\n1\n\n1 1\n\n1 1 1\n\n1 1 1 1\n\n1 1 1 1 1\n\n1 1 1 1 1 1\n\n1 1 1 1 1 1 1\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n1\n\n2 0\n\n2 1 0\n\n3 1 0 0\n\n3 2 0 0 0\n\n3 2 1 0 0 0\n\n3 2 1 1 0 0 0\n\n1\n\n2 0\n\n3 0 0\n\n4 0 0 0\n\n5 0 0 0 0\n\n6 0 0 0 0 0\n\n7 0 0 0 0 0 0\n\n(c) All elements into one block\n\n(d) COD matrix \u2206(Z(1), (1, . . . , 7))\n\n(e) Each element into a new block\n\nFigure 3: Three COD matrices correspond to the three red dotted paths on the trees above\n\n4\n\n\f1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n1.0\n\n1.7 0.3\n\n1.7 1.0 0.3\n\n2.7 1.0 0.3 0.0\n\n2.7 2.0 0.3 0.0 0.0\n\n2.7 2.0 1.0 0.3 0.0 0.0\n\n2.7 2.3 1.0 0.7 0.3 0.0 0.0\n\ny\np\no\nr\nt\n\nn\ne\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1 2 3 4 5 6 7\n\n1\n\n3\n\n6\n\n7\n\n2\n\n4\n\n5\n\n1.0\n\n1.0 1.0\n\n1.0 1.0 1.0\n\n1.3 1.0 1.0 0.7\n\n1.7 1.3 1.0 0.7 0.3\n\n2.7 2.3 1.0 0.7 0.3 0.0\n\n2.7 2.3 1.0 0.7 0.3 0.0 0.0\n\ny\np\no\nr\nt\n\nn\ne\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1 3 6 7 2 4 5\n\nFigure 4: CODs and entropies over E3 for permutations (1, 2, 3, 4, 5, 6, 7) and (1, 3, 6, 7, 2, 4, 5)\n\nsatis\ufb01es an incremental formulation with the parameters (\u03b1, d) over the indices i \u2208 N, k \u2208 Z+:\n\n\u22060,k = 0\n\n\u2206i+1,k = \u2206i,k +\uf8f1\uf8f2\n\uf8f3\n\n\u03b1+d\u2206i,k\n\ni+\u03b1\n\n(k\u22121\u2212d)(\u2206i,k\u22121\u2212\u2206i,k)\n\ni+\u03b1\n\nif k = 1\n\notherwise\n\n(7)\n\nBy allowing k = 0 and setting \u2206i,0 = \u2212 \u03b1\nthe same matrix can be formulated by a difference equation over the indices i \u2208 N, k \u2208 N:\n\nd , and \u22060,k = 0 for k > 0 as the two boundary conditions,\n\n(\u2206i+1,k \u2212 \u2206i,k)(i + \u03b1) = (\u2206i,k\u22121 \u2212 \u2206i,k)(k \u2212 1 \u2212 d)\n\n(8)\n\nBy setting \u2206 = \u2206(0) we get an in\ufb01nite sequence of matrices \u2206(m) that satisfy the same equation:\n\n(\u2206(m)\n\ni+1,k \u2212 \u2206(m)\n\ni,k )(i + \u03b1) = (\u2206(m)\n\ni,k\u22121 \u2212 \u2206(m)\n\ni,k )(k \u2212 1 \u2212 d) = \u2206(m+1)\n\ni,k\n\n(9)\n\nTherefore, expected COD matrix of a CRP-distributed random partitioning is at a constant \u2018equilib-\nrium\u2019 determined by \u03b1 and d. This example shows that the COD matrix can reveal speci\ufb01c infor-\nmation about a distribution over partitionings; of course in practice we encounter non-exchangeable\nand almost arbitrary distributions over partitionings (e.g., the posterior distribution of an in\ufb01nite\nmixture), therefore in the following section we will develop a measure to quantify this information.\n\n4 Entropy to quantify segmentation\n\nShannon\u2019s entropy [17] can be an appropriate quantity to measure \u2018segmentation\u2019 with respect to\npartitionings, which can be interpreted as probability distributions [20, 21]. Since this interpretation\ndoes not cover feature allocations, we will make an alternative, element-based de\ufb01nition of entropy.\n\nHow does a block B inform us about its elements? Each element has a proportion 1/|B|, let us call\nthis quantity per-element segment size. Information is zero for |B| = n, since 1/n is the minimum\npossible segment size. If |B| < n, the block supplies positive information since the segment size is\nlarger than minimum, and we know that its segment size could be smaller if the block were larger.\nTo quantify this information, we de\ufb01ne per-element information for a block B as the integral of\nsegment size 1/s over the range [|B|, n] of block sizes that make this segment smaller (Figure 5).\n\npein(B) = Z n\n\n|B|\n\n1\ns\n\nds = log\n\nn\n|B|\n\n(10)\n\nIn pein(B), n is a \u2018base\u2019 that determines the minimum possible per-element segment size. Since\nsegment size expresses the signi\ufb01cance of elements, the function integrates segment sizes over the\nblock sizes that make the elements less signi\ufb01cant. This de\ufb01nition is comparable to the well-known\np-value, which integrates probabilities over the values that make the observations more signi\ufb01cant.\n\n1\ns\n\n1\n\n|B|\n\n|B|\n\nlog n\n|B|\n\n1\nn\n\nn\n\nn\n\n|\n\nB\n\n|\n\n0.4\n\ng\no\nl\n\n0.2\n\n|\n\nB\n\n|\n\nn\n\n0\n\n2\n\n4\n\n6\n\n8\nblock size |B|\n\n10\n\n12\n\nFigure 5: Per-element information for B\n\nFigure 6: Weighted information plotted for each n\n\n5\n\n\f)\nZ\n(\nH\n \ny\np\no\nr\nt\nn\ne\n \nn\no\ni\nt\ni\nt\nr\na\np\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n2\n\nSubset occurence:\n\nPi[S \u2282 Bi]\n\na \u2208 B b \u2208 B\n\nProjection entropy:\nH(P ROJ(Z, S))\n\na \u2208 B b \u2208 B\n\n0\n\n1\n\n0\n\n2 log 2\n1\n1\n\n0\n\n1\n2 log 2\n1\n\n0\n\n0\n\na \u2208 B b \u2208 B\n\na \u2208 B b \u2208 B\n\n0\n\n0\n\n0\n1\n\n0\n\n0\n\n0\n\n0\n\n3 log 3\n1\n1\n\n2\n3 log 3\n2\n\n1\n3 log 3\n1\n\n0\n\n3 log 3\n2\n2\n1\n3 log 3\n1\n\n2\n3 log 3\n2\n\n0\n\nS\n=\n{\na\n,\nb\n}\n\nS\n=\n{\na\n,\nb\n,\nc\n}\n\n4\nnumber of elements n\n\n6\n\n8\n\n10\n\n12\n\nc \u2208 B\n\nc \u2208 B\n\nFigure 7: H(Z) in incremental construction of Z\n\nFigure 8: Comparing two subset statistics\n\nWe can then compute the per-element information supplied by a partitioning Z, by taking a weighted\naverage over its blocks, since each block B \u2208 Z supplies information for a different proportion\n|B|/n of the elements being partitioned. For large n, weighted per-element information reaches its\nmaximum near |B| \u2248 n/2 (Figure 6). Total weighted information for Z gives Shannon\u2019s entropy\nfunction [17] which can be written in terms of the cumulative statistics (assuming \u03c6n+1 = 0):\n\nH(Z) =\n\n|Z|\n\nXi=1\n\n|Bi|\nn\n\npein(Bi) =\n\n|Z|\n\nXi=1\n\n|Bi|\nn\n\nlog\n\nn\n|Bi|\n\n=\n\nn\n\nXk=1\n\n(\u03c6k(Z) \u2212 \u03c6k+1(Z))\n\nk\nn\n\nlog\n\nn\nk\n\n(11)\n\nEntropy of a partitioning increases as its elements become more segmented among themselves. A\npartitioning with a single block has zero entropy, and a partitioning with n blocks has the maximum\nentropy log n. Nodes of the tree we examined in the previous section (Figure 3b) were vertically\narranged according to their entropies. On the extended tree (Figure 7), nth column of nodes represent\nthe possible partitionings of n. This tree serves as a \u2018grid\u2019 for both H(Z) and \u03c6(Z), as they are\nlinearly related with the general coef\ufb01cient ( k\nk\u22121 ). A similar grid for feature\nallocations can be generated by inserting nodes for cumulative statistics that do not conserve mass.\n\nn log n\n\nn log n\n\nk \u2212 k\u22121\n\nTo quantify the segmentation of a subset S, we compute projection entropy H(P ROJ(Z, S)). To\nunderstand this function, we compare it to subset occurence in Figure 8. Subset occurence acts as a\n\u2018score\u2019 that counts the \u2018successful\u2019 blocks that contain all of S, whereas projection entropy acts as a\n\u2018penalty\u2019 that quanti\ufb01es how much S is being divided and segmented by the given blocks B \u2208 Z.\n\nA partitioning Z and a permutation \u03c3 of its elements induce an entropy sequence (h1, . . . , hn) such\nthat hi(Z, \u03c3) = H(P ROJ(Z, Si)) where Si = {\u03c31, . . . , \u03c3i} for i \u2208 {1, . . . , n}. To \ufb01nd subsets of\nelements that are more closely related, one can seek permutations \u03c3 that keep the entropies low. The\ngenerated subsets Si will be those that are less segmented by B \u2208 Z. For the example problem, the\npermutation 1, 3, 6, 7, . . . keeps the expected entropies lower, compared to 1, 2, 3, 4, . . . (Figure 4).\n\n5 Entropy agglomeration and experimental results\n\nWe want to summarize a sample set using the proposed statistics. Permutations that yield lower\nentropy sequences can be meaningful, but a feasible algorithm can only involve a small subset of the\nn! permutations. We de\ufb01ne entropy agglomeration (EA) algorithm, which begins from 1-element\nsubsets, and merges in each iteration the pair of subsets that yield the minimum expected entropy:\n\nEntropy Agglomeration Algorithm:\n\n1. Initialize \u03a8 \u2190 {{1}, {2}, . . . , {n}}.\n2. Find the subset pair {Sa, Sb} \u2282 \u03a8 that minimizes the entropy h H(P ROJ(Z, Sa \u222a Sb)) i\u03c0(Z).\n3. Update \u03a8 \u2190 (\u03a8\\{Sa, Sb}) \u222a {Sa \u222a Sb}.\n4. If |\u03a8| > 1 then go to 2.\n5. Generate the dendrogram of chosen pairs by plotting minimum entropies for every split.\n\n6\n\n\fThe resulting dendrogram for the example partitionings are shown in Figure 9a. The subsets {4, 5}\nand {1, 3, 6} are shown in individual nodes, because their entropies are zero. Besides using this\ndendrogram as a general summary, one can also generate more speci\ufb01c dendrograms by choos-\ning speci\ufb01c elements or speci\ufb01c parts of the data. For a detailed element-wise analysis, entropy\nsequences of particular permutations \u03c3 can be assessed. Entropy Agglomeration is inspired by \u2018ag-\nglomerative clustering\u2019, a standard approach in bioinformatics [23]. To summarize partitionings\nof gene expressions, [14] applied agglomerative clustering by pairwise occurences. Although very\nuseful and informative, such methods remain \u2018heuristic\u2019 because they require a \u2018linkage criterion\u2019 in\nmerging subsets. EA avoids this drawback, since projection entropy is already de\ufb01ned over subsets.\nTo test the proposed algorithm, we apply it to partitionings sampled from in\ufb01nite mixture posteriors.\nIn the \ufb01rst three experiments, data is modeled by an in\ufb01nite mixture of Gaussians, where \u03b1 =\n0.05, d = 0, p(\u03b8) = N (\u03b8|0, 5) and F (x|\u03b8) = N (x|\u03b8, 0.15) (see Equation 1). Samples from the\nposterior are used to plot the histogram over the number of blocks, pairwise occurences, and the\nEA dendrogram. Pairwise occurences are ordered according to the EA dendrogram. In the fourth\nexperiment, EA is directly applied on the data. We describe each experiment and make observations:\n1) Synthetic data (Figure 9b): 30 points on R2 are arranged in three clusters. Plots are based on\n450 partitionings from the posterior. Clearly separating the three clusters, EA also re\ufb02ects their\nqualitative differences. The dispersedness of the \ufb01rst cluster is represented by distinguishing \u2018inner\u2019\nelements 1, 10, from \u2018outer\u2019 elements 6, 7. This is also seen as shades of gray in pairwise occurences.\n2) Iris \ufb02ower data (Figure 9c): This well-known dataset contains 150 points on R4 from three\n\ufb02ower species [24]. Plots are based on 150 partitionings obtained from the posterior. For conve-\nnience, small subtrees are shown as single leaves and elements are labeled by their species. All of\n50 A points appear in a single leaf, as they are clearly separated from B and C. The dendrogram\nautomatically scales to cover the points that are more uncertain with respect to the distribution.\n3) Galactose data (Figure 9d): This is a dataset of gene expressions by 820 genes in 20 experimental\nconditions [25]. First 204 genes are chosen, and \ufb01rst two letters of gene names are used for labels.\nPlots are based on 250 partitionings from the posterior. 70 RP (ribosomal protein) genes and 12\nHX (hexose transport) genes appear in individual leaves. In the large subtree on the top, an \u2018outer\u2019\ngrouping of 19 genes (circles in data plot) is distinguished from the \u2018inner\u2019 long tail of 68 genes.\n4) IGO (Figure 9e): This is a dataset of intergovernmental organizations (IGO) [26,v2.1] that con-\ntains IGO memberships of 214 countries through the years 1815-2000. In this experiment, we take\na different approach and apply EA directly on the dataset interpreted as a sample set of single-block\nfeature allocations, where the blocks are IGO-year tuples and elements are the countries. We take\nthe subset of 138 countries that appear in at least 1000 of the 12856 blocks. With some excep-\ntions, the countries display a general ordering of continents. From the \u2018outermost\u2019 continent to the\n\u2018innermost\u2019 continent they are: Europe, America-Australia-NZ, Asia, Africa and Middle East.\n\n6 Conclusion\nIn this paper, we developed a novel approach for summarizing sample sets of partitionings and fea-\nture allocations. After presenting the problem, we introduced cumulative statistics and cumulative\noccurence distribution matrices for each of its permutations, to represent a sample set in a systematic\nmanner. We de\ufb01ned per-element information to compute entropy sequences for these permutations.\nWe developed entropy agglomeration (EA) algorithm that chooses and visualises a small subset of\nthese entropy sequences. Finally, we experimented with various datasets to demonstrate the method.\nEntropy agglomeration is a simple algorithm that does not require much knowledge to implement,\nbut it is conceptually based on the cumulative statistics we have presented. Since we primarily aimed\nto formulate a useful algorithm, we only made the essential de\ufb01nitions, and several points remain\nto be elucidated. For instance, cumulative statistics can be investigated with respect to various\nnonparametric priors. Our de\ufb01nition of per-element information can be developed with respect to\ninformation theory and hypothesis testing. Last but not least, algorithms like entropy agglomeration\ncan be designed for summarization tasks concerning various types of combinatorial sample sets.\n\nAcknowledgements\nWe thank Ayc\u00b8a Cankorur, Erkan Karabekmez, Duygu Dikicio\u02d8glu and Bet\u00a8ul K\u0131rdar from Bo\u02d8gazic\u00b8i\nUniversity Chemical Engineering for introducing us to this problem by very helpful discussions.\nThis work was funded by T \u00a8UB\u02d9ITAK (110E292) and BAP (6882-12A01D5).\n\n7\n\n\fNumber of blocks\n\nGalactose data:\n\n(e) IGO data:\n\n1 SN\n1 PR\n1 SL\n1 PR\n1 PR\n1 PA\n1 LS\n1 SM\n1 SR\n1 HA\n1 HA\n1 SN\n1 SM\n1 SR\n1 MT\n1 SL\n1 MT\n1 FI\n1 AB\n1 SR\n1 TF\n1 FI\n1 NT\n1 ME\n1 NT\n1 PR\n1 AB\n1 PR\n1 RR\n1 PR\n1 HA\n1 LS\n1 PA\n1 TF\n1 PR\n1 PR\n1 HA\n1 SR\n1 ME\n1 SR\n1 SR\n1 SM\n1 SN\n1 SM\n1 SR\n1 PR\n1 SL\n1 MT\n1 MT\n1 LS\n1 PR\n1 SL\n1 RR\n1 NC\n1 SN\n1 HA\n1 SN\n1 AB\n1 HP\n1 PR\n1 PA\n1 LS\n1 HP\n1 NC\n1 SR\n1 AB\n1 PR\n1 TF\n1 SR\n1 PR\n1 SN\n1 HP\n1 NC\n1 FI\n1 FI\n1 ME\n1 PR\n1 RR\n1 ME\n1 NT\n1 NC\n1 HA\n1 PR\n1 HA\n1 PR\n1 TF\n1 HA\n1 ST\n12 HX\n1 SN\n1 ST\n1 SN\n1 SN\n1 SN\n1 HP\n1 PA\n1 ST\n1 ST\n1 NT\n1 SR\n1 SN\n1 SR\n1 RR\n1 SN\n1 SR\n1 RP\n1 RP\n1 RP\n1 RP\n1 RP\n1 RP\n70 RP, 4 YD\n1 CD\n1 CD\n1 PG\n1 PG\n1 CD\n1 CD\n1 PG\n1 PG\n\ngermany\nrussia\npoland\nhungary\nromania\nbulgaria\nluxembourg\nireland\nspain\nportugal\nitaly\ngreece\nuk\nfrance\nnetherlands\nbelgium\nwgermany\niceland\nnorway\nfinland\nsweden\ndenmark\nyugoslaviaserb\nswitzerland\naustria\nusa\njapan\ncanada\nsoafrica\nnewzealand\naustralia\ncuba\nhaiti\ndomrepublic\nnicaragua\nguatemala\nhonduras\nelsalvador\npanama\ncostarica\nvenezuela\necuador\nperu\ncolombia\nuruguay\nchile\nparaguay\nbolivia\nmexico\nbrazil\nargentina\ntrinidad\njamaica\nguyana\nbarbados\nsuriname\ngrenada\nbahamas\nczechoslovakia\nalbania\nthailand\nphilippines\nmalaysia\nindonesia\nsrilanka\npakistan\nindia\nsokorea\nchina\nvietnam\nsingapore\npapuanewguinea\nfiji\nnepal\nmyanmar\nbangladesh\nlaos\ncambodia\nafghanistan\nnigeria\nghana\nliberia\nsierraleone\ngambia\nmadagascar\nethiopia\nzaire\nrwanda\nburundi\nmauritius\nzambia\nmalawi\nuganda\ntanzania\nkenya\nguineabissau\neqguinea\nzimbabwe\nmozambique\nswaziland\nlesotho\nbotswana\nsudan\nsomalia\nmauritania\ngabon\ncameroon\nchad\ncongobrazz\ncar\nsenegal\nivorycoast\nmali\nniger\nburkinafaso\nguinea\ntogo\nbenin\nturkey\niran\nisrael\nmalta\ncyprus\negypt\ntunisia\nmorocco\nalgeria\nsyria\nlebanon\nlibya\njordan\nsaudiarabia\nkuwait\niraq\noman\nbahrain\nuae\nqatar\n\n0\n\n0.5\n\n1\n\n1.5\n\nentropy\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nentropy\n\n(a) Example partitionings:\n\nZ(1) = {{1, 3, 6, 7}, {2}, {4, 5}}\nZ(2) = {{1, 3, 6}, {2, 7}, {4, 5}}\nZ(3) = {{1, 2, 3, 6, 7}, {4, 5}}\n\n4, 5\n\n2\n\n7\n\n1, 3, 6\n\n0\n\n0.2\n\n0.6\n\n0.4\nentropy\n\n0.8\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n4\n5\n2\n7\n1\n3\n6\n\n2\n\n3\n\nPairwise occurences\n\n4 5 2 7 1 3 6\n\nNumber of blocks\n\n100\n\n50\n\n0\n\n3 5 7 9 11 13\n\n7\n6\n8\n3\n9\n5\n4\n2\n10\n1\n19\n15\n17\n16\n20\n14\n18\n12\n13\n11\n30\n25\n28\n26\n24\n22\n29\n27\n23\n21\n\n(b) Synthetic data:\n\n1.5\n\n1\n\n15\n\n19\n\n0.5\n\n0\n\n16\n14\n17\n\n13\n\n20\n\n11\n\n18\n12\n\n\u22120.5\n\n30\n\n25\n\n26\n\n28\n\n22\n24\n\n29\n\n23\n21\n27\n\n\u22121\n\n\u22121.5\n\n7\n\n6\n\n5\n1\n\n10\n\n3\n\n9\n\n4\n\n2\n\n8\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\nNumber of blocks\n\n5\n\n6\n\n7\n\n8\n\nPairwise occurences\n\n2 C\n1 C\n1 C\n2 C\n2 C\n4 C\n12 C\n1 C\n1 C\n8 C\n1 C\n1 B\n50 A, 1 B, 1 C\n1 B\n1 B\n2 B\n2 B\n2 B\n9 B\n1 B\n5 B, 1 C\n1 B\n1 B\n1 B\n1 B\n2 B\n1 C\n11 B, 8 C\n3 C\n8 B, 1 C\n\n50\n\n100\n\n150\n\n1\n\n0\nentropy\n\n60\n\n40\n\n20\n\n0\n\n50\n\n100\n\n150\n\nPairwise occurences\n\n10\n\n20\n\n30\n\n0\n\n1\n\nentropy\n\n10\n\n20\n\n30\n\n(c) Iris \ufb02ower data:\n\nA\n\nB\n\nC\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\u22124\n\n\u22122\n\n2\n(PCA projection R4 \u2192 R2)\n\n0\n\n4\n\n(d) Galactose data:\n\n2\n\n0\n\n\u22122\n\n\u22124\n\u22124\n\nHX\n\nothers\n\nRP\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n(PCA projection R20 \u2192 R2)\n\nNumber of blocks\n\nPairwise occurences\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n50\n\n100\n\n150\n\n200\n\n9\n\n11\n\n13\n\n15\n\n17\n\n19\n\n50\n\n100\n\n150\n\n200\n\nFigure 9: Entropy agglomeration and other results from the experiments (See the text)\n\n8\n\n\fReferences\n[1] Ferguson, T. S. (1973) A Bayesian analysis of some nonparametric problems. Annals of Statistics,\n\n1(2):209\u2013230.\n\n[2] Teh, Y. W. (2010) Dirichlet Processes. In Encyclopedia of Machine Learning. Springer.\n[3] Kingman, J. F. C. (1992). Poisson processes. Oxford University Press.\n[4] Pitman, J., & Yor, M. (1997) The two-parameter Poisson\u2013Dirichlet distribution derived from a stable subor-\n\ndinator. Annals of Probability, 25:855-900.\n\n[5] Pitman, J. (2006) Combinatorial Stochastic Processes. Lecture Notes in Mathematics. Springer-Verlag.\n[6] Sethuraman, J. (1994) A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4, 639-650.\n[7] Neal, R. M. (2000) Markov chain sampling methods for Dirichlet process mixture models, Journal of Com-\n\nputational and Graphical Statistics, 9:249\u2013265.\n\n[8] Meeds, E., Ghahramani, Z., Neal, R., & Roweis, S. (2007) Modelling dyadic data with binary latent factors.\n\nIn Advances in Neural Information Processing 19.\n\n[9] Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006) Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581.\n\n[10] Grif\ufb01ths, T. L. and Ghahramani, Z. (2011) The Indian buffet process: An introduction and review. Journal\n\nof Machine Learning Research, 12:1185\u20131224.\n\n[11] Broderick, T., Pitman, J., & Jordan, M. I. (2013). Feature allocations, probability functions, and paintboxes.\n\narXiv preprint arXiv:1301.6647.\n\n[12] Teh, Y. W., Blundell, C., & Elliott, L. T. (2011). Modelling genetic variations with fragmentation-\n\ncoagulation processes. In Advances in Neural Information Processing Systems 23.\n\n[13] Orbanz, P. & Teh, Y. W. (2010). Bayesian Nonparametric Models. In Encyclopedia of Machine Learning.\n\nSpringer.\n\n[14] Medvedovic, M. & Sivaganesan, S. (2002) Bayesian in\ufb01nite mixture model based clustering of gene expres-\n\nsion pro\ufb01les. Bioinformatics, 18:1194\u20131206.\n\n[15] Medvedovic, M., Yeung, K. and Bumgarner, R. (2004) Bayesian mixture model based clustering of repli-\n\ncated microarray data. Bioinformatics 20:1222\u20131232.\n\n[16] Liu X., Sivanagesan, S., Yeung, K.Y., Guo, J., Bumgarner, R. E. and Medvedovic, M. (2006) Context-\nspeci\ufb01c in\ufb01nite mixtures for clustering gene expression pro\ufb01les across diverse microarray dataset. Bioinfor-\nmatics, 22:1737-1744.\n\n[17] Shannon, C. E. (1948) A Mathematical Theory of Communication. Bell System Technical Journal\n\n27(3):379\u2013423.\n\n[18] I. Nemenman, F. Shafee, & W. Bialek. (2002) Entropy and inference, revisited. In Advances in Neural\n\nInformation Processing Systems, 14.\n\n[19] Archer, E., Park, I. M., & Pillow, J. (2013) Bayesian Entropy Estimation for Countable Discrete Distribu-\n\ntions. arXiv preprint arXiv:1302.0328.\n\n[20] Simovici, D. (2007) On Generalized Entropy and Entropic Metrics. Journal of Multiple Valued Logic and\n\nSoft Computing, 13(4/6):295.\n\n[21] Ellerman, D. (2009) Counting distinctions: on the conceptual foundations of Shannon\u2019s information theory.\n\nSynthese, 168(1):119-149.\n\n[22] Neal, R. M. (1992) Bayesian mixture modeling, in Maximum Entropy and Bayesian Methods: Proceedings\nof the 11th International Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis,\nSeattle, 1991, eds, Smith, Erickson, & Neudorfer, Dordrecht: Kluwer Academic Publishers, 197-211.\n\n[23] Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998) Cluster analysis and display of genome-\n\nwide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863-14868.\n\n[24] Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics,\n\n7(2):179-188.\n\n[25] Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R.,\nAebersold, R. & Hood, L. (2001) Integrated genomic and proteomic analyses of a systematically perturbed\nmetabolic network. Science, 292(5518):929-934.\n\n[26] Pevehouse,\n\nJ. C., Nordstrom, T. & Warnke, K.\n\nnizations Dataset Version 2.0.\nhttp://www.correlatesofwar.org/COW2%20Data/IGOs/IGOv2-1.htm\n\nCon\ufb02ict Management\n\n(2004) The COW-2 International Orga-\nand Peace Science 21(2):101-119.\n\n9\n\n\f", "award": [], "sourceid": 214, "authors": [{"given_name": "Isik", "family_name": "Fidaner", "institution": "Bo\u011fazi\u00e7i University"}, {"given_name": "Taylan", "family_name": "Cemgil", "institution": "Bo\u011fazi\u00e7i University"}]}