{"title": "Compact Representation of Uncertainty in Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 8630, "page_last": 8640, "abstract": "For many classic structured prediction problems, probability distributions over the dependent variables can be efficiently computed using widely-known algorithms and data structures (such as forward-backward, and its corresponding trellis for exact probability distributions in Markov models). However, we know of no previous work studying efficient representations of exact distributions over clusterings. This paper presents definitions and proofs for a dynamic-programming inference procedure that computes the partition function, the marginal probability of a cluster, and the MAP clustering---all exactly. Rather than the Nth Bell number, these exact solutions take time and space proportional to the substantially smaller powerset of N. Indeed, we improve upon the time complexity of the algorithm introduced by Kohonen and Corander (2016) for this problem by a factor of N. While still large, this previously unknown result is intellectually interesting in its own right, makes feasible exact inference for important real-world small data applications (such as medicine), and provides a natural stepping stone towards sparse-trellis approximations that enable further scalability (which we also explore). In experiments, we demonstrate the superiority of our approach over approximate methods in analyzing real-world gene expression data used in cancer treatment.", "full_text": "Compact Representation of Uncertainty in Clustering\n\nCraig S. Greenberg 1,2\n\nNicholas Monath1\n\nAri Kobren1\n\nPatrick Flaherty3\n\nAndrew McGregor1\n\nAndrew McCallum1\n\n1College of Information and Computer Sciences, University of Massachusetts Amherst\n\n2National Institute of Standards and Technology\n\n3Department of Mathematics and Statistics, University of Massachusetts Amherst\n\n{csgreenberg,nmonath,akobren,mcgregor,mccallum}@cs.umass.edu\n\nflaherty@math.umass.edu\n\nAbstract\n\nFor many classic structured prediction problems, probability distributions over the\ndependent variables can be ef\ufb01ciently computed using widely-known algorithms\nand data structures (such as forward-backward, and its corresponding trellis for\nexact probability distributions in Markov models). However, we know of no previ-\nous work studying ef\ufb01cient representations of exact distributions over clusterings.\nThis paper presents de\ufb01nitions and proofs for a dynamic-programming inference\nprocedure that computes the partition function, the marginal probability of a cluster,\nand the MAP clustering\u2014all exactly. Rather than the N th Bell number, these exact\nsolutions take time and space proportional to the substantially smaller powerset\nof N. Indeed, we improve upon the time complexity of the algorithm introduced\nby Kohonen and Corander [11] for this problem by a factor of N. While still\nlarge, this previously unknown result is intellectually interesting in its own right,\nmakes feasible exact inference for important real-world small data applications\n(such as medicine), and provides a natural stepping stone towards sparse-trellis\napproximations that enable further scalability (which we also explore). In experi-\nments, we demonstrate the superiority of our approach over approximate methods\nin analyzing real-world gene expression data used in cancer treatment.\n\n1\n\nIntroduction\n\nProbabilistic models provide a rich framework for expressing and analyzing uncertain data because\nthey provide a full joint probability distribution rather than an uncalibrated score or point estimate.\nThere are many well-established, simple probabilistic models, for example Hidden Markov Models\n(HMMs) for modeling sequences. Inference in HMMs is performed using the forward-backward\nalgorithm, which relies on an auxiliary data structure called a trellis (a graph-based dynamic program-\nming table). This trellis structure serves as a compact representation of the distribution over state\nsequences. Many model structures compactly represent distributions and allow for ef\ufb01cient exact or\napproximate inference of joint and marginal distributions.\nClustering is a classic unsupervised learning task. Classic clustering algorithms and even modern\nones, however, only provide a point estimate of the \u201cbest\u201d partitioning by some metric. In many\napplications, there are other partitions of the data that are nearly as good as the best one. Therefore\nrepresenting uncertainty in clustering can allow one to chose the most interpretable clustering from\namong a nearly equivalent set of options. We explore the bene\ufb01ts of representing uncertainty in\nclustering in a real-world gene expression analysis application in the experiments section.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRepresenting discrete distributions can be rather challenging, since the size of the support of the\ndistribution can grow extremely rapidly. In the case of HMMs, the number of sequences that need to\nbe represented is exponential in the sequence length. Despite this, the forward-backward algorithm\n(i.e., belief-propagation in a non-loopy graph) performs exact inference in time linear in the size of the\nsequence multiplied by the square of the size of the state space. In the case of clustering, the problem\nis far more dif\ufb01cult. The number of clusterings of N elements, known as the N th Bell number [2],\ngrows super exponentially in the number of elements to be clustered. For example, there are more\nthan a billion ways to cluster 15 elements. An exhaustive approach would require enumerating and\nscoring each clustering. We seek a more compact representation of distributions over clusterings.\nIn this paper, we present a dynamic programming inference procedure that exactly computes the\npartition function, the marginal probability of a cluster, and the MAP clustering. Crucially, our\napproach computes exact solutions in time and space proportional to the size of the powerset of\nN, which is substantially less than the N th Bell number complexity of the exhaustive approach.\nWhile the size of the powerset of N is still large, this is a previously unknown result that on its own\nbears intellectual interest. It further acts as a stepping stone towards approximations enabling larger\nscalability and provides insight on small data sets as shown in the experiments section.\nThe approach works by creating a directed acyclic graph (DAG), where each vertex represents\nan element of the powerset and there are edges between pairs of vertices that represent maximal\nsubsets/minimal supersets of one another. We refer to this DAG as a cluster trellis. The dynamic\nprograms can operate in either a top-down or bottom up fashion on the cluster trellis, labeling vertices\nwith local partition functions and maximum values. It is also possible to read likely splits and joins\nof clusters (see Appendix M), as well as marginals from this structure. These programs work in any\ncircumstance where the energy of a cluster can be computed. We prove that our programs return\nexact values and provide an analysis of their time and space complexity.\nThis paper also describes an approach to approximating the partition function, marginal probabilities,\nand MAP inference for clustering in reduced time and space. It works by performing exact compu-\ntations on a sparsi\ufb01ed version of the cluster trellis, where only a subset of the possible vertices are\nrepresented. This is analogous to using beam search [17] in HMMs. We prove that our programs\nreturn exact values for DAG-consistent partitions and that the time and space complexity are now\nmeasured in the size of the sparse cluster trellis. When not in the text, proofs of all facts and theorems\ncan be found in the Appendix.\nWe develop our method in further detail in the context of correlation clustering [1]. In correlation\nclustering, the goal is to construct a clustering that maximizes the sum of cluster energies (minus the\nsum of the across cluster energies), where a cluster energy can be computed from pairwise af\ufb01nities\namong data points. We give a dynamic program that computes the energies of all possible clusters.\nOur approach proceeds in a bottom up fashion with respect to the cluster trellis, annotating cluster\nenergies at each step. This all can be found in the Appendix.\nPrevious work has examined the related problem of computing MAP k-clusterings exactly, including\ndynamic programming approaches [8, 9, 22], as well as using fast convolutions [11]. Our method\nhas a smaller runtime complexity than using these approaches for computing the MAP clustering\nand partition function for all possible partitions (irrespective of k). Further, none of this related work\ndiscusses how to reduce complexity using approximation (as we do in Section 4), and it is unclear\nhow their work might be extended for approximation. The most closely related work [10] models\ndistributions over clusterings using Perturb and MAP [16]. Unlike the Perturb and MAP approach,\nour work focuses on exact inference in closed form.\nBeing able to compactly represent probability distributions over clusterings is a fundamental problem\nin managing uncertainty. This paper presents a dynamic programming approach to exact inference in\nclustering, reducing the time complexity of the problem from super exponential to sub-quadratic in\nthe size of the cluster trellis.\n2 Uncertainty in Clustering\n\nClustering is the task of dividing a dataset into disjoint sets of elements. Formally,\nDe\ufb01nition 1. (Clustering) Given a dataset of elements, D = {xi}N\nC = {C1, C2, . . . , CK} such that Ci \u2713 D,SK\nEach element of C is known as a cluster.\n\ni=1, a clustering is a set of subsets,\ni=1 Ci = D, and Ci\\ Cj = ; for all Ci, Cj 2 C, i 6= j.\n\n2\n\n\fabcd\n\nabc\n\nabd\n\nacd\n\nbcd\n\nab\n\nac\n\na\n\nad\n\nb\n\nbc\n\nc\n\nbd\n\nd\n\ncd\n\nFigure 1: A cluster trellis, T , over a dataset D = {a, b, c, d}. Each node in the trellis represents\na speci\ufb01c cluster, i.e., subset, of D corresponding to its label. Solid lines indicate parent-child\nrelationships. Note that a parent may have multiple children and a child may have multiple parents.\n\nOur goal is to design data structures and algorithms for ef\ufb01ciently computing the probability distri-\nbution over all clusterings of D. We adopt an energy-based probability model for clustering, where\nthe probability of a clustering is proportional to the product of the energies of the individual clusters\nmaking up the clustering. The primary assumption in energy based clustering is that clustering\nenergy is decomposable as the product of cluster energies. While it is intuitive that the probability of\nelements being clustered together would be independent of the clustering of elements disjoint from\nthe cluster, one could conceive of distributions that violate that assumption. An additional assumption\nis that exponentiating pairwise scores preserves item similarity. This is the Gibbs distribution, which\nhas been found useful in practice [6].\nDe\ufb01nition 2. (Energy Based Clustering) Let D be a dataset, C be a clustering of D and ED(C) be the\nenergy of C. Then, the probability of C with respect to D, PD(C), is equal to the energy of C normalized\nand Z(D) =PC2CD ED(C). The\nby the partition function, Z(D). This gives us PD(C) = ED(C)\nED(C) energy of C is de\ufb01ned as the product of the energies of its clusters: ED(C) =QC2C ED(C)\nZD\nWe use CD to refer to all clusterings of D. In general, we assume that D is \ufb01xed and so we omit\nsubscripts to simplify notation. Departing from convention [12], clusterings with higher energy\nare preferred to those with lower energy. Note that computing the membership probability of any\nelement xi in any cluster Cj, as is done in mixture models, is ill-suited for our goal. In particular,\nthis computation assumes a \ufb01xed clustering whereas our work focuses on computations performed\nwith respect to the distribution over all possible clusterings.\n\n3 The Cluster Trellis\n\nRecall that our goal is compute a distribution over the valid clusterings of an instance of energy\nbased clustering as ef\ufb01ciently as possible. Given a dataset D, a na\u00efve \ufb01rst step in computing such a\ndistribution is to iterate through its unique clusters and, for each, compute its energy and add it to a\nrunning sum. If the number of elements is |D| = N, the number of unique clusters is the N th Bell\nNumber, which is super-exponential in N [14].\nNote that a cluster C may appear in many clusterings of D. For example, consider the dataset\nD0 = {a, b, c, d}. The cluster {a, b} appears in 2 of the clusterings of D0. More precisely, in a dataset\ncomprised of N elements, a cluster of M elements appears in the (N M )th Bell Number of its\nclusterings. This allows us to make use of memoization to compute the distribution over clusterings\nmore ef\ufb01ciently, in a procedure akin to variable elimination in graphical models [4, 25]. Unlike\nvariable elimination, our procedure is agnostic to the ordering of the elimination.\nTo support the exploitation of this memoization approach, we introduce an auxiliary data structure\nwe call a cluster trellis.\nDe\ufb01nition 3. (Cluster Trellis) A cluster trellis, T , over a dataset D is a graph, (V (T ), E(T )), whose\nvertices represent all valid clusters of elements of D. The edges of the graph connect a pair vertices\nif one (the \u201cchild\u201d node) is a maximal strict subset of the other (the \u201cparent\u201d node).\n\n3\n\n\fIn this paper, we refer to a cluster trellis simply as a trellis. In more detail, each trellis vertex,\nv 2 V (T ), represents a unique cluster of elements; the vertices in T map one-to-one with the non-\nempty members of the powerset of the elements of D. We de\ufb01ne D(v) to be the elements in the cluster\nrepresented by v. There exists an edge from v0 to v, if D(v) \u21e2 D(v0) and D(v0) = D(v) [ {xi}\nfor some element xi 2 D (or vice versa). See Figure 1 for a visual representation of a trellis over\n4 elements. Each vertex stores the energy of its associated cluster, E(D(v)), and can be queried in\nconstant time. We borrow terminology from trees and say vertex v0 is a parent of vertex v, if there is\nan edge from v0 to v, and that vertex v00 is an ancestor of v if there is a directed path from v00 to v.\n\n3.1 Computing the Partition Function\n\nComputing a distribution over an event space requires computing a partition function, or normalizing\nconstant. We present an algorithm for computing the partition function, Z(D), with respect to all\npossible clusterings of the elements of D. Our algorithm uses the trellis and a particular memoization\nscheme to signi\ufb01cantly reduce the computation required: from super-exponential to exponential.\nThe full partition function, Z(D), can be expressed in terms of cluster energies and the partition\nfunctions of a speci\ufb01c set of subtrellises. A subtrellis rooted at v, denoted T [v] contains all nodes in\nT that are descendants of v.\nFormally, a subtrellis T [v] = (V (T [v]), E(T [v])) has vertices and edges satisfying the following\nproperties: (1) V (T [v]) = {u|u 2 V (T ) ^ D(u) \u2713 D(v)}, and (2) E(T [v]) = {(u, u0)|(u, u0) 2\nE(T ) ^ u, u0 2 V (T [v])}. Note that T [v] is always a valid trellis.\nThe following procedure not only computes Z(D), but also generalizes in a way that the partition\nfunction with respect to clusterings for any subset D(v) \u21e2 D can also be computed. We refer to\nthe partition function for a dataset D(v) memoized at the trellis/subtrellis T [D(v)] as the partition\nfunction for the trellis/subtrellis, Z(T [D(v)]).\nAlgorithm 1 PartitionFunction(T ,D)\n\nPick xi 2 D\nZ(D) = 0\nfor v in V (T )(i) do\n\nLet v0 be such that D(v0) = D \\ D(v)\nif Z(D(v0)) has not been assigned then\nZ(D) Z(D) + E(D(v)) \u21e4 Z(D(v0))\n\nZ(D(v0)) = PartitionFunction(T [v0],D(v0))\n\nreturn Z(D)\n\nDe\ufb01ne V (T )(i) = {v|v 2 V (T ) ^ xi 2 D(v)} and V (T )(i) = V (T )\\V (T )(i). In other words,\nV (T )(i) is the set of all vertices in the trellis containing the element xi and V (T )(i) is the set of all\nvertices that do not contain xi.\nFact 1. Let v 2 V (T ) and xi 2 D(v). The partition function with respect to D(v) can be written\nrecursively, with Z(D(v)) =Pvi2V (T [v])(i) E(vi) \u00b7 Z(D(v)\\D(vi)) and Z(;) = 1.\nProof. The partition function Z(D(v)) is de\ufb01ned as:\n\nZ(D(v)) = XC2CD(v) YC2C\n\nE(C)\n\nFor a given element xi in D(v), the set of all clusterings of D(v) can be re-written to factor out the\ncluster containing xi in each clustering:\n\nCD(v) = {{vi} [ C|vi 2 V (i),C 2 CD(v)\\D(vi)}\n\nNote that CD(v)\\D(vi) refers to all clusterings on the elements D(v)\\D(vi). Using this expansion and\nsince E({vi} [ Ci) = E({vi})E(Ci), we can rewrite the partition function as below. By performing\n\n4\n\n\falgebraic re-arrangements and applying our de\ufb01nitions:\n\nE(C)\n\nE(C)\n\nE(vi)E(C)\n\nZ(D(v)) = Xvi2V (i) XC2CD(v)\\D(vi)\n= Xvi2V (i) XC2CD(v)\\D(vi)\nE(vi)YC2C\n= Xvi2V (i) E(vi) XC2CD(v)\\D(vi) YC2C\n= Xvi2V (i) E(vi)Z(D(v) \\ D(vi))\n\nAs a result of Fact 1, we are able to construct a dynamic program for computing the partition function\nof a trellis as follows: (1) select an arbitrary element xi from the dataset; (2) construct V (T )(i) as\nde\ufb01ned above; (3) for each vertex vi 2 V (T )(i), compute and memoize the partition function of\nD(v) \\ D(vi) if it is not already cached; (4) sum the partition function values obtained in step (3).\nThe pseudocode for this dynamic program appears in Algorithm 1.\nWe use Algorithm 1 and Fact 1 to analyze the time and space complexity of computing the partition\ni=1. Our goal is to compute the partition\nfunction. Consider a trellis T over a dataset D = {xi}N\nfunction, Z(T ). When the partition function of all subtrellises of T have already been computed,\nAlgorithm 1 is able to run without recursion.\nFact 2. Let T be a trellis such that the partition function corresponding to each of its subtrellises\nT \u2019 is memoized and accessible in constant time. Then, Z(T ) can be computed by summing exactly\n2N1 terms. Given that the partition function of every strict sub-trellis of T (i.e., any sub-trellis of\nT that is not equivalent to T ) has been memoized and is accessible in constant time, then Z(T ) is\ncomputed by taking the sum of exactly 2N1 terms.\nWe now consider the more general case, where the partition function of all subtrellises of T have not\nyet been computed:\nTheorem 1. Let T be a trellis over D = {xi}N\ni=1. Then, Z(T ) can be computed in O(3N1) =\nO(|V (T )|log(3)) time.\nA proof of Theorem 1 can be found in the Appendix in Section E.\n\n3.2 Finding the Maximal Energy Clustering\nBy making a minor alteration to Algorithm 1, we are also able to compute the value of and \ufb01nd\nthe clustering with maximal energy. Speci\ufb01cally, at each vertex in the trellis, v, store the clustering\nof D(v) with maximal energy (and its associated energy). We begin by showing that there exists a\nrecursive form of the max-partition calculation analogous to the computation of the partition function\nin Fact 1.\nDe\ufb01nition 4. (Maximal Clustering) Let v 2 V (T ) and xi 2 D(v). The maximal clustering over the\nelements of D(v), C?(D(v)), is de\ufb01ned as: C?(D(v)) = argmaxC2CD(v) E(C).\nFact 3. C?(D(v)) can be written recursively as C?(D(v)) = argmaxv02V (T [v])(i) E(v0)\u00b7E(C?(D(v)\\\nD(v0))).\nIn other words, the clustering with maximal energy over the set of elements, D(v) can be written as\nthe energy of any cluster, C, in that clustering multiplied by a clustering with maximal energy over\nthe elements D(v)\\C.\nUsing this recursive de\ufb01nition, we modify Algorithm 1 to compute the maximum clustering instead\nof the partition function, resulting in Algorithm 2 (in Appendix). The correctness of this algorithm is\ndemonstrated by Fact 3. We can now analyze the time complexity of the algorithm. We use similar\nmemoized notation for the algorithm where C?(T [D(v)]) is the memoized value for C?(D(v)) stored\nat v.\nFact 4. Let TD be a trellis over D = {xi}N\ni=1. Then, C?(TD) can be computed in O(3N1) time.\n\n5\n\n\f3.3 Computing Marginals\n\nZ(D)\n\nZ(D)\n\nZ(D)\n\nZ(D)\n\nZ(D)\n\nThe trellis facilitates the computation of two types of cluster marginals. First, the trellis can be used\nto compute the probability of a speci\ufb01c cluster, D(v), with respect to the distribution over all possible\nclusterings; and second, it can be used to compute the probability that any group of elements, X,\nare clustered together. We begin by analyzing the \ufb01rst style of marginal computation as it is used in\ncomputing the second.\nLet C(v) 2 C be the set of clusterings that contain the cluster D(v). Then the marginal probability\nof D(v) is given by P (D(v)) = PC2C(v) E(C)\n, where Z(D) is the partition function with respect to\nthe full trellis described in section 2. This probability can be re-written in terms of the complement\n= E(D(v))PC02CD\\D(v) E(C0)\nof D(v), i.e., P (D(v)) = PC2C(v) E(C)\n=\nE(D(v))Z(D\\D(v))\n. Note that if Z(D \\D(v)) were memoized during Algorithm 1, then computing the\nmarginal probability requires constant time and space equal to the size of the trellis. This is only true\nfor clusters whose complements do not contain element xi (selected to compute Z(D) in Algorithm\n1), which is true for |V (T )|/(2|V (T )| 1) of the vertices in the trellis. Otherwise, we may need to\nrepeat the calculation from Algorithm 1 to compute Z(D \\ D(v)). We note that due to memoization,\nthe complexity of computing the partition function of the remaining verticies is no greater than the\ncomplexity of Algorithm 1.\nThis machinery makes it possible to compute the second style of marginal. Given a set of ele-\nments, X, the marginal probability of the elements of X being clustered together is: P (X) =\n\n= PC2C(v) E(D(v))E(C\\D(v))\n\nPD(v)2T :X\u2713D(v) P (D(v)). The probability that the elements of X is distinct from the marginal\n\nprobability of a cluster in that P (X) sums the marginal probabilities of all clusters that include all\nelements of X. Once the marginal probability of each cluster is computed, the marginal probability\nof any sets of elements being clustered together can be computed in time and space linear in the size\nof the trellis.\n4 The Sparse Trellis\n\nThe time to compute the partition function scales sub-quadratically with the size of the trellis\n(Theorem 1). Unfortunately, the size of the trellis scales exponentially with the size of the dataset,\nwhich limits the use of the trellis in practice. In this section, we show how to approximate the\npartition function and maximal partition of a sparse trellis, which is a trellis with some nodes omitted.\nIncreasing the sparsity of a trellis enables the computation of approximate clustering distributions for\nlarger datasets.\nDe\ufb01nition 5. (Sparse Trellis) Given a trellis T = (V (T ), E(T )), de\ufb01ne a sparse trellis with\nrespect to T to be any bT = (bV , \u02c6E) satisfying the following properties: bV 6= ;, bV \u21e2 V (T ), and\n\u02c6E = {(v, v0)| D(v0) \u21e2 D(v) ^ @u : D(v0) \u21e2 D(u) \u21e2 D(v)}.\nNote that there exist a number of sparse trellises that contain no valid clusterings. As an example,\nconsider bT = (bV = {v1, v2, v3},bE = ;) where D(v1) = {a, b}, D(v2) = {b, c}, and D(v3) =\n{a, c}.\nFor ease of analysis, we focus on a speci\ufb01c family of sparse trellises which are closed under recursive\ncomplement 1. This property ensures that the trellises contain only valid partitions. For trellises in this\nfamily we show that the partition function and the clustering with maximal energy can be computed\nusing algorithms similar to those described in Section 3. Since these algorithms have complexity\nmeasured in the number of nodes in the trellis, their ef\ufb01ciency improves with trellis-sparsity. We\nalso present the family of tree structured sparse trellises with tree speci\ufb01c partition function and max\npartition algorithms. The more general family of all sparse trellises is also discussed brie\ufb02y.\n\nreasoning about such clusters, in this paper we assume that any cluster that is not represented by a\n\nThe key challenge of analyzing a sparse trellis, bT , is how to treat any cluster C that is not represented\nby a vertex v 2 bT , i.e., C = D(v) ^ v 62 bT . Although there are several feasible approaches to\n1A set of sets, S, is closed under recursive complement iff 8x, y 2 S, x \u21e2 y =) 9z 2 S : xS z =\ny ^ x \\ z = ;.\n\n6\n\n\fvertex in bT has zero energy. Since the energy of a clustering, C, is the product of its clusters\u2019 energies\n(De\ufb01nition 2), E(C) = 0 if it contains one or more clusters with zero energy.\n4.1 Approximating The Partition Function and Max Partition\nGiven a sparse trellis, \u02c6T , we are able to compute the partition function by using Algorithm 1.\nFact 5. Let bT = (bV ,bE) be a sparse trellis whose vertices are closed under recursive complement.\nThen Algorithm 1 computes Z(bT ) in O(|bT |log(3)).\nIf bT is not closed under recursive complement, we cannot simply run Algorithm 1 because not all\nvertices for which the algorithm must compute energy (or the partition function) are guaranteed to\nexist. How to compute the partition function using such a trellis is an area of future study.\nGiven a sparse trellis, \u02c6T , closed under recursive complement, we are able to compute the max\npartition by using Algorithm 2. Doing so takes O(| \u02c6T |log(3)) time and O(| \u02c6T |) space. The correctness\nand complexity analysis is the same as in Section 4.1.\nThe often-used hierarchical (tree structured) clustering encompasses one family of sparse trellises.\nAlgorithms for tree structured trellises can be found in the Appendix in Section J.\n5 Experiments\n\nIn this section, we demonstrate the utility of the cluster trellis via experiments on real-world gene\nexpression data. To begin, we provide a high-level background on cancer subtypes to motivate the\nuse of our method in the experiment in Section 5.3.\n5.1 Background\nFor an oncologist, determining a prognosis and constructing a treatment plan for a patient is dependent\non the subtype of that patient\u2019s cancer [13]. This is because different subtypes react well to some\ntreatments, for example, to radiation and not chemotherapy, and for other subtypes the reverse is true\n[20]. For example, basal and erbB2+ subtypes of breast cancer are more sensitive to paclitaxel- and\ndoxorubicin-containing preoperative chemotherapy (approx. 45% pathologic complete response) than\nthe luminal and normal-like cancers (approx. 6% pathologic complete response)[18]. Unfortunately,\nidentifying cancer subtypes is often non-trivial. One common method of learning about a patient\u2019s\ncancer subtype is to cluster their gene expression data along with other available expression data for\nwhich previous treatments and treatment outcomes are known [21].\n5.2 Data & Methods\nWe use breast cancer transcriptome pro\ufb01ling (FPKM-UQ) data from The Cancer Genome Atlas\n(TCGA) because much is known about the gene expression patterns of this cancer type, yet there is\nheterogeneity in the clinical response of patients who are classi\ufb01ed into the same subtype by standard\napproaches [23].\nThe data are subselected for African American women with Stage I breast cancer. We select\nAfrican American women because there is a higher prevalence of the basal-like subtype among\npremenopausal African American women [15] and there is some evidence that there is heterogeneity\n(multiple clusters) even within this subtype [23]. Stage I breast cancer patients were selected because\nof the prognostic value in distinguishing aggressive subtypes from non-aggressive subtypes at an\nearly stage.\nDespite the considerable size of TCGA, there are only 11 samples meeting this basic, necessary\ninclusion/exclusion criteria. Each of the 11 samples is a 20,000 dimensional feature vector, where\neach dimension is a measure of how strongly a given gene is expressed. We begin by sub-selecting\nthe 3000 features with greatest variance across the samples. We then add an in\ufb01nitesimal value prior\nto taking the log of the remaining features, since genome expression data is believed to be normally\ndistributed in log-space [19]. A similar data processing was shown to be effective in prior work [19].\nWe use correlation clustering as the energy model. Pairwise similarities are exponentiated negative\neuclidean distances. We subtract from each the mean pairwise similarity so that similarities are both\npositive and negative. We then compute the marginal probabilities for each pair (i.e., the probability\n\n7\n\n\fFigure 2: For each pair of patients\nwith Stage I cancer, we plot the en-\nergy and marginal probability of the\npair being in the same cluster as de-\nscribed in Section 5.3.\n\nFigure 3: The approximate vs. exact\npairwise marginals for each pair of gene\nexpressions. Approximate marginals\nare computed using a Perturb-and-MAP\nbased method [10].\n\nFigure 4: Heatmap of the pairwise\nenergies between the patients. The\npair 74ca and d6fa has an energy\nof -4.7, 74ca and 62da have 91.09,\nand d6fa and 62da have 44.5.\n\nFigure 5: Heatmap of the marginal prob-\nability that a pair will be clustered to-\ngether. Patients 74ca and d6fa have a\npairwise marginal that is nearly one, de-\nspite having a low pairwise energy.\n\nthat the two samples appear in the same cluster). See Section 3.3 for how to compute these values\nusing the trellis.\n5.3 Model Evaluation using Marginals\n\nOne method for evaluating a set of cancer subtype clustering models is to identify pairs of samples\nthat the evaluator believes should be clustered together and inspect their pairwise energies. However,\nhigh pairwise energies do not necessarily mean the points will be clustered together by the model\n(which considers how the pairs\u2019 cluster assignment impacts the rest of the data). Similarly, a low\npairwise energy does not necessarily mean the two samples will not be clustered together. The\npairwise marginal on the other hand exactly captures the probability that the model will place the two\nsamples in the same cluster. We test if the corresponding unnormalized pairwise energies or a simple\napproximation of the marginals could reasonably be used as a proxy for exact pairwise marginals.\n5.3.1 Pairwise Energies vs. Marginals & Exact vs. Approximate Marginals\nFigure 2 plots the pairwise log energies vs. pairwise log marginals of the sub-sampled TCGA data2.\nThe pairwise scores and marginals are not strongly correlated, which suggests that unnormalized\npairwise energies cannot reasonably be used as a proxy for pairwise marginals. For example, the\nsample pair of patients (partial id numbers given) 74ca and d6fa have an energy of -4.7 (low), but\na pairwise marginal that is nearly one. This is because both 74ca and d6fa have high energy with\nsample 62da, with pairwise energies 91.09 (the fourth largest) and 44.5, respectively. Figures 4 and 5\nthat visualize the pairwise energies and pairwise marginals, respectively.\n\n2The MAP clustering of the dataset is in the Appendix in Section P.\n\n8\n\n10050050100PairwiseLogEnergies6005004003002001000PairwiseLogMarginalsPairwiseEnergiesandMarginals74ca&62da74ca&d6fa62da&d6fa43210ExactPairwiseLogMarginals2.52.01.51.00.5ApproximatePairwiseLogMarginalsApproximatevs.ExactMarginalLogProbabilityofPairsOfPoints\fWe also explore the extent to which an approximate method can accurately capture pairwise marginals.\nWe use an approach similar to Perturb-and-MAP [10]. We sample clusterings by adding Gumbel\ndistributed noise to the pairwise energies and using Algorithm 2 to \ufb01nd the maximal clustering with\nthe modi\ufb01ed energies. We approximate the marginal probability of a given pair being clustered\ntogether by measuring how many of these sampled clusters contain the pair in the same cluster. Figure\n3 plots the approximate vs. exact pairwise marginal for each pair of points in the dataset. The \ufb01gure\nshows that the approximate method overestimates many of the pairwise marginals. Like the pairwise\nscores (rather than exact marginals), using the approximate marginals in practice may lead to errors\nin data analysis.\n\n6 Related Work\n\nWhile there is, to the best of our knowledge, no prior work on compact representations for exactly\ncomputing distributions over clusterings, there is a small amount related work on computing the\nMAP k-clustering exactly, as well as a wide array of related work in approximate methods, graphical\nmodels, probabilistic models for clustering, and clustering methods.\nThe \ufb01rst dynamic programming approach to computing the MAP k-clustering was given in [9],\nwhich focuses on minimizing the sum of square distances within clusters. It works by considering\ndistributional form of the clusterings, i.e., all possible sizes of the clusters that comprise the clustering,\nand de\ufb01nes \u201carcs\u201d between them. However, no full speci\ufb01cation of the dynamic program is given\nand, as the author notes, many redundant computations are required, since there are many clusterings\nthat share the same distributional form. In [8], the \ufb01rst implementation is given, with some of the\nredundancies removed, and the implementation and amount of redundancy is further improved upon\nin [22]. In each of these cases, the focus is on \ufb01nding the best k-clustering, which can be done using\nthese methods in O(3n) time. These methods can also be used to \ufb01nd the MAP clustering for all K,\nhowever doing so would result in an O(n \u21e4 3n) time, which is worse than our O(3n) result.\nIn [11], the authors use fast convolutions to compute the MAP k-clustering and k-partition function.\nFast convolutions use a Mobius transform and Mobius inversion on the subset lattice to compute the\n\nconvolution in eO(n22n) time. It would seem promising to use this directly in our work, however,\nour algorithm divides the subset lattice in half,which prevents us from applying the fast transform\ndirectly. The authors note that, similar to the above dynamic programming approaches, their method\ncan be used to compute the clustering partition function and MAP in O(n \u21e4 3n), which is larger than\nour result of O(3n). Their use of convolutions to compute posteriors of k-clusterings also implies the\nexistence of an eO(n32n) algorithm to compute the pair-wise posterior matrix, i.e., the probability\nthat items i and j are clustered together, though the authors mention that, due to numerical instability\nissues, using fast convolutions to computing the pair-wise posterior matrix is only faster in theory.\nRecently proposed perturbation based methods [10] approximate distributions over clusterings as\nwell as marginal distributions over clusters. They use the Perturb and MAP approach [16], originally\nproposed by Papandreou, which is based on adding Gumbel distributed noise to the clustering\nenergy function. Unfortunately, for Perturb and MAP to approach the exact distribution, independent\nsamples from the Gumbel distribution must be added to each clustering energy, which would require\na super-exponential number of draws. To overcome this, Kappes et al. [10] propose adding Gumbel\nnoise to the pairwise real-valued af\ufb01nity scores, thus requiring fewer draws, but introducing some\ndependence among samples. They must also perform an outer relaxation in order obtain a computable\nbound for the log partition function. As a result, the method approaches a distribution with unknown\napproximation bounds.\n\n7 Conclusion\n\nIn this paper, we present a data structure and dynamic-programming algorithm to compactly represent\nand compute probability distributions over clusterings. We believe this to be the \ufb01rst work on ef\ufb01cient\nrepresentations of exact distributions over clusterings. We reduce the computation cost of the na\u00efve\nexhaustive method from the N th Bell number to sub-quadratic in the substantially smaller powerset\nof N. We demonstrate how this result is a \ufb01rst step towards practical approximations enabling larger\nscalability and show a case study of the method applied to correlation clustering.\n\n9\n\n\fAcknowledgments\nWe thank the anonymous reviewers for their constructive feedback.\nThis work was supported in part by the Center for Intelligent Information Retrieval, in part by\nDARPA under agreement number FA8750-13-2-0020, in part by the National Science Foundation\nGraduate Research Fellowship under Grant No. NSF-1451512 and in part by the National Science\nFoundation Grant 1637536. Any opinions, \ufb01ndings and conclusions or recommendations expressed\nin this material are those of the authors and do not necessarily re\ufb02ect those of the sponsor.\n\nReferences\n[1] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine Learning,\n\n2004.\n\n[2] E. T. Bell. Exponential polynomials. Annals of Mathematics, 1934.\n[3] Charles Blundell, Yee Whye Teh, and Katherine A Heller. Bayesian rose trees. Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, 2010.\n\n[4] Rina Dechter. Bucket elimination: A unifying framework for probabilistic inference. 1999.\n[5] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n[6] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian\nrestoration of images. IEEE Transactions on pattern analysis and machine intelligence, 1984.\n[7] Katherine A Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. International\n\nConference on Machine Learning, 2005.\n\n[8] Lawrence Hubert, Phipps Arabie, and Jacqueline Meulman. Combinatorial data analysis:\nOptimization by dynamic programming. Society for Industrial and Applied Mathematics, 2001.\n[9] Robert E. Jensen. A dynamic programming algorithm for cluster analysis. Operations Research,\n\n1969.\n\n[10] J\u00f6rg Hendrik Kappes, Paul Swoboda, et al. Probabilistic correlation clustering and image\npartitioning using perturbed multicuts. International Conference on Scale Space and Variational\nMethods in Computer Vision, 2015.\n\n[11] Jukka Kohonen and Jukka Corander. Computing exact clustering posteriors with subset convo-\n\nlution. Communications in Statistics-Theory and Methods, 2016.\n\n[12] Yann LeCun, Sumit Chopra, and Raia Hadsell. A tutorial on energy-based learning. 2006.\n[13] Brian D Lehmann and Jennifer A Pietenpol. Identi\ufb01cation and use of biomarkers in treatment\n\nstrategies for triple-negative breast cancer subtypes. The Journal of pathology, 2014.\n\n[14] L\u00e1szl\u00f3 Lov\u00e1sz. Combinatorial problems and exercises. 1993.\n[15] Cancer Genome Atlas Network et al. Comprehensive molecular portraits of human breast\n\ntumours. Nature, 2012.\n\n[16] George Papandreou and Alan L Yuille. Perturb-and-map random \ufb01elds: Using discrete op-\ntimization to learn and sample from energy models. International Conference on Computer\nVision, 2011.\n\n[17] D Raj Reddy et al. Speech understanding systems: A summary of results of the \ufb01ve-year\n\nresearch effort. department of computer science, 1977.\n\n[18] Roman Rouzier, Charles M Perou, W Fraser Symmans, et al. Breast cancer molecular subtypes\n\nrespond differently to preoperative chemotherapy. Clinical cancer research, 2005.\n\n[19] Hachem Saddiki, Jon McAuliffe, and Patrick Flaherty. Glad: a mixed-membership model for\n\nheterogeneous tumor subtype classi\ufb01cation. Bioinformatics, 2014.\n\n10\n\n\f[20] Therese S\u00f8rlie, Charles M Perou, Robert Tibshirani, et al. Gene expression patterns of breast\ncarcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National\nAcademy of Sciences, 2001.\n\n[21] Therese S\u00f8rlie, Robert Tibshirani, Joel Parker, et al. Repeated observation of breast tumor\nsubtypes in independent gene expression data sets. Proceedings of the National Academy of\nSciences, 2003.\n\n[22] BJ Van Os and Jacqueline J Meulman. Improving dynamic programming strategies for parti-\n\ntioning. Journal of Classi\ufb01cation, 2004.\n\n[23] Ozlem Yersal and Sabri Barutca. Biological subtypes of breast cancer: Prognostic and therapeu-\n\ntic implications. World journal of clinical oncology, 2014.\n\n[24] Giacomo Zanella, Brenda Betancourt, Hanna Wallach, Jeffrey Miller, Abbas Zaidi, and Re-\nbecca C Steorts. Flexible models for microclustering with application to entity resolution.\nAdvances in Neural Information Processing Systems, 2016.\n\n[25] Nevin Lianwen Zhang and David Poole. Exploiting causal independence in bayesian network\n\ninference. Journal of Arti\ufb01cial Intelligence Research, 1996.\n\n11\n\n\f", "award": [], "sourceid": 5220, "authors": [{"given_name": "Craig", "family_name": "Greenberg", "institution": "University of Massachusetts Amherst / NIST"}, {"given_name": "Nicholas", "family_name": "Monath", "institution": "University of Massachusetts Amherst"}, {"given_name": "Ari", "family_name": "Kobren", "institution": "UMass Amherst"}, {"given_name": "Patrick", "family_name": "Flaherty", "institution": "University of Massachusetts, Amherst"}, {"given_name": "Andrew", "family_name": "McGregor", "institution": "University of Massachusetts Amherst"}, {"given_name": "Andrew", "family_name": "McCallum", "institution": "UMass Amherst"}]}