{"title": "Identifying Structure across Pre-partitioned Data", "book": "Advances in Neural Information Processing Systems", "page_first": 489, "page_last": 496, "abstract": "", "full_text": " \n\n \n\nIdentifying Structure across Pre-\n\npartitioned Data \n\nZvika Marx \n\nNeural Computation Center \n\nThe Hebrew University \nJerusalem, Israel, 91904 \n\nIdo Dagan \n\nDepartment of CS \nBar-Ilan University \n\nRamat-Gan, Israel, 52900 \n\nEli Shamir \nSchool for CS \n\nThe Hebrew University \nJerusalem, Israel, 91904 \n\nAbstract \n\nWe propose an information-theoretic clustering approach that \nincorporates a pre-known partition of the data, aiming to identify \ncommon clusters that cut across the given partition. In the standard \nclustering setting the formation of clusters is guided by a single \nsource of feature information. The newly utilized pre-partition \nfactor introduces an additional bias that counterbalances the impact \nof the features whenever they become correlated with this known \npartition. The resulting algorithmic framework was applied \nsuccessfully to synthetic data, as well as to identifying text-based \ncross-religion correspondences. \n\nIntroduction \n\n1 \nThe standard task of feature-based data clustering deals with a single set of elements \nthat are characterized by a unified set of features. The goal of the clustering task is \nto identify implicit constructs, or themes, within the clustered set, grouping together \nelements that are characterized similarly by the features. In recent years there has \nbeen growing interest in more complex clustering settings, in which additional \ninformation is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on \ninformation bottleneck (IB) framework [6], which facilitates coherent \nthe \ninformation-theoretic representation of different information types. \nIn a recent line of research we have investigated the cross-dataset clustering task \n[7], [8]. In this setting, some inherent a-priori partition of the clustered data to \ndistinct subsets is given. The clustering goal it to identify corresponding \n(analogous) structures that cut across the different subsets, while ignoring internal \nstructures that characterize individual subsets. To accomplish this task, those \nfeatures that commonly characterize elements across the different subsets guide the \nclustering process, while within-subset regularities are neutralized. \nIn [7], we presented a distance-based hard clustering algorithm for the coupled- \nclustering problem, in which the clustered data is pre-partitioned to two subsets. In \n[8], our setting, generalized to pre-partitions of any number of subsets, was \naddressed by a heuristic extension of the probabilistic IB algorithm, yielding \nimproved empirical results. Specifically, the algorithm in [8] was based on a \n\n\f \n\nmodification of the IB stable-point equation, which amplified the impact of features \ncharacterizing a formed cluster across all, or most, subsets. \nThis paper describes an information-theoretic framework that motivates and extends \nthe algorithm proposed in [8]. The given pre-partitioning is represented via a \nprobability distribution variable, which may represent \u201csoft\u201d pre-partitioning of the \ndata, versus the strictly disjoint subsets assumed in the earlier cross-dataset \nframework. Further, we present a new functional that captures the cross-partition \nmotivation. From the new functional, we derive a stable-point equation underlying \nour algorithmic framework in conjunction with the corresponding IB equation. \nOur algorithm was tested empirically on synthetic data and on a real-world text-\nbased task that aimed to identify corresponding themes across distinct religions. We \nhave cross-clustered five sets of keywords that were extracted from topical corpora \nof texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction \nfrom standard clustering results, our algorithm reveals themes that are common to \nall religions, such as sacred writings, festivals, narratives and myths and theological \nprinciples, and avoids topical clusters that correspond to individual religions (for \nexample, \u2018Christmas\u2019 and \u2018Easter\u2019 are clustered together with \u2018Ramadan\u2019 rather than \nwith \u2018Church\u2019). \nFinally, we have paid specific attention to the framework of clustering with side \ninformation [4]. While this approach was presented for a somewhat different \nmindset, it might be used directly to address clustering across pre-partitioned data. \nWe compare the technical details of the two approaches and demonstrate \nempirically that clustering with side information does not seem appropriate for the \nkind of cross-partition tasks that we explored. \n\nThe Information Bottleneck Method \n\n2 \nProbabilistic (\u201csoft\u201d) data clustering outputs, for each element x of the set being \nclustered and each cluster c, an assignment probability p(c|x). The IB method [6] \ninterprets probabilistic clustering as lossy data compression. The given data is \nrepresented by a random variable X ranging over the clustered elements. X is \ncompressed through another random variable C, ranging over the clusters. Every \nelement x is characterized by conditional probability distribution p(Y|x), where Y is \na third random variable taking the members y of a given set of features as values. \nThe IB method formalizes the clustering task as minimizing the IB functional: \n\nL(IB) = I(C; X) \u2212 \u03b2 I(C; Y) . \n\n(1) \nAs known from information theory (Ch. 13 of [9]), minimizing the mutual \ninformation I(C; X) optimizes distorted compression rate. A complementary bias to \nmaximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to \nthe obtained clustering, inferred from the level by which C can predict Y. \u03b2 is a free \nparameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that \nminimize L(IB) satisfy the following equation: \n\np(c|x) = \n\n1\nx\n(\n),\n\u03b2\n\nz\n\necp\n)(\n\n\u03b2\n\u2212\n\nDKL\n\n[\n\n])|\ncYpxYp\n(\n\n||)|\n\n(\n\n, \n\n(2) \n\nwhere DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, \nbetween two distributions and z(\u03b2,x) is a normalization function over C. Eq. (2) \nimplies that, optimally, x is assigned to c in proportion to their KL distance in a \nfeature distribution space, where the distribution p(Y|c) takes the role of a \n\n\f \n\nStart at time t = 0 and iterate the following update-steps, till convergence: \n\n \n\n IB1: initialize pt(c|x) randomly or arbitrarily \npxYp\n||)|\n(\n \nt\n\npt(c|x) \u221d \n\nec\n)(\n\n\u2212\n\u03b2\n\nD\n\nKL\n\n[\n\n1\n\u2212\n\n \n])|\ncY\n\n(\n\n \n\n(t = 0) \n(t > 0) \n\n IB2: pt(c) \n\nxpxcp\n)(\nt\n\n(\n\n)\n\n|\n\n \n\nt\n\n1\n\u2212\n\np\n= \u2211x\n1\ncp\n)(\nt\n\n IB3: pt(y|c) = \n\nxpxypxcp\n)(\nt\n\n)\n\n(\n\n)\n\n(\n\n|\n\n|\n\n \n\n\u2211\n\nx\n\nFigure 1: The Information Bottleneck iterative algorithm (with fixed \u03b2 and |C|). \n\nrepresentative, or centroid, of c. The feature variable Y is hence utilized as the \n(exclusive) means to guide clustering, beyond the random nature of compression. \nFigure 1 presents the IB iterative algorithm for a fixed value of \u03b2. The IB1 update \nstep follows Eq. (2). The other two steps, which are derived from the IB functional \nas well, estimate the p(c) and p(y|c) values required for the next iteration. The \nalgorithm converges to a local minimum of the IB functional. The IB setting, \nparticularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and \nC are independent given X, that is: I(C; Y|X) = \u2211x p(x) I(C|x; Y|x) = 0. \nThe balancing parameter \u03b2 affects the number of distinct clusters being formed in a \nmanner that resembles (inverse) temperature in physical systems. The higher \u03b2 is \n(i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters \nare required for encoding the data. For each |C| = 2, 3, \u2026, there is a minimal \u03b2 \nvalue, enabling the formation of |C| distinct clusters. Setting \u03b2 to be smaller than \nthis critical value corresponding to the current |C| would result in two or more \nclusters that are identical to one another. Based on this, the iterative algorithm is \napplied repeatedly within a gradual cooling-like (deterministic annealing) scheme: \nstarting with random initialization of the p0(c|x)'s, generate two clusters with the \ncritical \u03b2 value, found empirically, for |C| = 2. Then, use a perturbation on the \nobtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of \nclusters and execute additional runs of the algorithm to identify the critical \u03b2 value \nfor the larger |C|. And so on: each output configuration is used as a basis for a more \ngranular one. The final outcome is a \u201csoft hierarchy\u201d of probabilistic clusters. \n\nCross-partition Clustering \n\n3 \nCross-partition (CP) clustering introduces a factor \u2013 a pre-given partition of the \nclustered data \u2013 additional to what considered in a standard clustering setting. For \nrepresenting this factor we introduce the pre-partitioning variable W, ranging over \nall parts w of the pre-given partition. Every data element x is associated with W \nthrough a given probability distribution p(W|x). Our goal is to cluster the data, so \nthat the clusters C would not be correlated with W. We notice that Y, which is \nintended to direct the formation of clusters, might be a-priori correlated with W, so \nthe formed clusters might end up being correlated with W as well. Our method aims \nat eliminating this aspect of Y. \n\nInformation Defocusing \n\n3.1 \nAs noted, some of the information conveyed by Y characterizes structures correlated \nwith W, while the other part of the information characterizes the target cross-W \n\n\f \n\nstructures. We are interested in detecting the latter while filtering out the former. \nHowever, there is no direct a-priori separation between the two parts of the Y-\nmediated information. Our strategy in tackling this difficulty is: we follow in \ngeneral Y's directions, as the IB method does, while avoiding Y's impact whenever it \nentails undesired inter-dependencies of C and W. \nOur strategy implies conflicting biases with regard to the mutual information I(C,Y): \nit should be maximized in order to form meaningful clusters, but be minimized as \nwell in the specific context where Y entails C\u2013W dependencies. Accordingly, we \npropose a computational procedure directed by two distinct cost-terms in tandem. \nThe first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). \nWith this bias alone, Y might dictate (or \u201cexplain\u201d, in retrospect) substantial C\u2013W \ndependencies, implying a low I(C;W|Y) value.1 Hence, the guideline of preventing Y \nfrom accounting for C\u2013W dependencies is realized through an opposing bias of \nmaximizing I(C;W|Y) = \u2211y p(y) I(C|y; W|y). The second cost term \u2013 the Information \nDefocusing (ID) functional \u2013 consequently counterbalances minimization of I(C,Y) \nagainst the new bias: \n\nL(ID) = I(C; Y) \u2212 \u03b7 I(C;W|Y) , \n\n(3) \nwhere \u03b7 is a free parameter articulating the tradeoff between the biases. The ID \nfunctional captures our goal of reducing the impact of Y selectively: \u201cdefocusing\u201d a \nspecific aspect of the information Y conveys: the information correlated with W. \nIn a like manner to the stable-point equation of the IB functional (Eq. 2), we derive \nthe following stable-point equation for the ID functional: \n\np(c|y) = \n\n1\n,(\n\u03b7\n\nz\n\ny\n\n)\n\ncp\n)(\n\n\u220f\n\nw\n\nwcyp\n(\n\n,\n\n|\n\nwp\n(\n\n)\n\n\u03b7\n1\n+\n\u03b7\n\n)\n\n, \n\n(4) \n\nwhere z(\u03b7,y) is a normalization function over C. The derivation relies on an \nadditional assumption, I(C;W) = 0, imposing the intended independence between C \nand W (the detailed derivation will be described elsewhere). \nThe intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated \nwith a cluster c in proportion to a weighted, though flattened, geometric mean of the \n\u201cW-projected centroids\u201d p(y|c,w), priored by p(c).2 This scheme overweighs y's that \ncontribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated \naround centroids biased towards evenly contributing features. The higher \u03b7 is, \nheavier emphasis is put on suppressing disagreements between the w's. For \u03b7 \u2192 \u221e a \nplain weighted geometric-mean scheme is obtained. The inclusion of a step derived \nfrom Eq. (4) in our algorithm (see below) facilitates convergence on a configuration \nwith centroids dominated by features that are evenly distributed across W. \n\n3.2 The Cross-partition Clustering Algorithm \nOur proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering \nconfiguration that optimizes simultaneously both the IB and ID functionals, \n\n \n\n1 Notice that \u201cZ explaining well the dependencies between A and B\u201d is equivalent with \u201cA \nand B sharing little information in common given Z\u201d, i.e. low I(A;B|Z). Complete \nconditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. \n2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the \nsubsets; in the current paper this scheme is analytically derived from the ID functional. \n\n\f \n\nStart at time t = 0 and iterate the following update-steps, till convergence: \n\n \n\n CP1: Initialize pt(c|x) randomly or arbitrarily \npxYp\n||)|\n(\n \nt\n\npt(c|x) \u221d \n\nec\n)(\n\n\u2212\n\u03b2\n\nD\n\nKL\n\n[\n\n1\n\u2212\n\n \n])|\ncY\n\n(\n\n \n\n(t = 0) \n(t > 0) \n\n1\n\u2212\n\nt\n\np\n= \u2211 x\n\n CP2: pt(c) \n\nxpxcp\n)(\nt\n\n)\n\n(\n\n|\n\n \n\n CP3: p*t(y|c,w) = \n\n1\n)(\n\nwpcp\nt\n\n(\n\n\u2211\n\nx\n\n)\n\nxpxwpxypxcp\n)(\nt\n\n)\n\n(\n\n(\n\n)\n\n(\n\n)\n\n|\n\n|\n\n|\n\n \n\n CP4: Initialize p*t(c) randomly or arbitrarily \nypyc\n|\n)(\n \n\np*t(c) = \u2211 \u2212\n* 1\nt\n\np\n\n(\n\n)\n\ny\n\n \n \n\n(t = 0) \n(t > 0) \n\n CP5: p*t(c|y) \u221d \n\np\n\n CP6: pt(y|c) = \n\np\n\nwcy\n|\n)\n\n,\n\n\u03b7\n1\n+\n\u03b7\n\nwp\n(\n\n)\n\n \n\n(*\nt\n\nw\n\np\n\nc\n)(*\nt\n\n\u220f\nypyc\n(*\n)(\n)\n|\nt\np\nc\n)(*\nt\n\n \n\nFigure 2: The cross-partition clustering iterative algorithm (with fixed \u03b2, \u03b7, and |C|). \n\nthus obtaining clusters that cut across the pre-given partition W. To this end, the \nalgorithm interleaves an iterative computation of the stable-point equations, and the \nadditional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 \ncorrespond to the computations related to the IB functional, while steps CP3, CP4 \nand CP5, which compute a separate set of parameters (denoted by an asterisk), \ncorrespond to the ID functional. Figure 3 summarizes the roles of the two \nfunctionals in the dynamics of the CP algorithm. The two components of the \niterative cycle are tied together in steps CP3 and CP6, in which parameters from one \nset are used as input to compute a parameter of other set. The derivation of step \nCP3 relies on an additional assumption, namely that C, Y and W are jointly \nindependent given X. This assumption, which extends to W the underlying \nassumption of the IB setting that C and Y are independent given X, still entails the \nIB stable point equation. At convergence, the stable point equations for both the IB \nand ID functionals are satisfied, each by its own set of parameters (in steps CP1 and \nCP5). \nThe deterministic annealing scheme, which gradually increases \u03b2 over repeated runs \n(see Sec. 2), is applied for the CP algorithm as well with \u03b7 held fixed. For a given \ntarget number of clusters |C|, the algorithm empirically converges with a wide range \nof \u03b7 values3. \n\nI(C;X) \u2193 (cid:213) IB (cid:214) \u03b2\u2191 I(C;Y) \u2193 (cid:213) ID (cid:214) \u03b7\u2191 I(C; W|Y) \n\n \n\n I(C; Y; W|X) = 0 \u2190 assumptions \u2192 I(C;W) = 0 \n\nFigure 3: The interplay of the IB and the ID functionals in the CP algorithm. \n\n \n\n3 High \u03b7 values tend to dictate centroids with features that are unevenly distributed \nacross W, resulting in shrinkage of some of the clusters. Further analysis will be \nprovided in future work. \n\n\f \n\nExperimental Results \n\n4 \nOur synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into \nthree 25-element parts denoted X1, X2 and X3 (in our formalism, for each clustered \nelement x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, \nwe partitioned the data twice, getting two (exhaustive) clustering configurations: \n1. Target cross-W clustering: five clusters, each with representatives from all Xw's; \n2. Masking within-w clustering: six clusters, each consisting of roughly half the \n\nelements of either X1, X2 or X3 with no representatives from the other Xw's. \n\nEach cluster, of both configurations, was characterized by a designated subset of \nfeatures. Masking clusters were designed to be more salient than target clusters: \nthey had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) \nand their elements shared higher feature-element (virtual) co-occurrence counts with \nthose designated features (900 vs. 450 per element-feature pair). Noise (random \npositive integer < 200) was added to all counts associating elements with their \ndesignated features (for both within-w and cross-W clusters), as well as to roughly \nquarter of the zero counts associating elements with the rest of the features. \nThe plain IB method consistently produced configurations strongly correlated with \nthe masking clustering, while the CP algorithm revealed the target configuration. \nWe got (see Table 1A) almost perfect results in configurations of nearly equal-sized \ncross-W clusters, and somewhat less perfect reconstruction in configurations of \ndiverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to \noptimal target-output cluster match by the proportion of elements correctly \nassigned, where assignment of an element x follows its highest p(c|x). The results \nindicated were averaged over 200 runs. They were obtained for the optimal \u03b7, \nwhich was found to be higher in the diverging-sizes task. \nIn the text-based task, the clustered elements \u2013 keywords \u2013 were automatically \nextracted from five distinct corpora addressing five religions: introductory web \npages, online magazines, encyclopedic entries etc., all downloaded from the \nInternet. The clustered keyword set X was consequently pre-partitioned to disjoint \nsubsets {Xw}w\u2208W, one for each religion4 (|Xw| \u2248 200 for each w). We conducted \nexperiments simultaneously involving religion pairs as well as all five religions. \nWe took the features Y to be a set of words that commonly occur within all five \ncorpora (|Y| \u2248 7000). x\u2013y co-occurrences were recorded within \u00b15-word sliding \nwindow truncated by sentence boundaries. \u03b7 was fixed to a value (1.0) enabling the \nformation of 20 clusters in all settings. The obtained clusters revealed interesting \ncross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing \nthe theme of sacred festivals: the three highest p(c/x) members within each religion \nwere Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas \n\nTable 1: Average correct assignment proportion scores for the synthetic task (A) and \n\nJaccard-coefficient scores for the religion keyword classification task (B). \n\nA. Synthetic Data \n\n \n\nIB CP \n \n\n \n\nequal-size clusters .305 \nnon-equal clusters .292 \n\n.985 \n.827 \n\n \n \n\n \n\nB. Religion Data \n\nIB Coupled Clustering [7] CP \n\n(cross-expert agreement on religion pairs .462\u00b1.232) \nreligion pairs\nall five (one case)\n\n.220\u00b1.138 \n.104 \u2013\u2013\u2013\u2013\u2013\u2013\u2013 \n\n.200\u00b1.100\n\n.407\u00b1.144 \n.167 \n\n4 A keyword x that appeared in the corpora of different religions was considered as a \ndistinct element for each religion, so the Xw were kept disjointed. \n \n\n\f \n\n(Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, \n(Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster \nproduced by the plain IB method was poorer by far, including Islamic Ramadan, and \nId and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked \nhigh too), but no single related term from the other religions. \nOur external evaluation standards were cross-religion keyword classes constructed \nmanually by experts of comparative religion studies. One such expert classification \ninvolved all five religions, and eight classifications addressed religions in pairs. \nEach of the eight religion-pair classifications was contributed by two independent \nexperts using the same keywords, so we could also assess the agreement between \nexperts. As an overlap measure we employed the Jaccard coefficient: the number of \nelement pairs co-assigned together by both one of the evaluated clusters and one of \nthe expert classes, divided by the number of pairs co-assigned by either our clusters \nor the expert (or both). We did not assume the number of expert classes is known in \nadvance (as done in the synthetic experiments), so the results were averaged over all \nconfigurations of 2\u201316 cluster hierarchy, for each experiment. The results shown in \nTable 1B \u2013 clear improvement relatively to plain IB and the distance-based coupled \nclustering [7] \u2013 are, however, persistent when the number of clusters is taken to be \nequal to the number of classes, or if only the best score in hierarchy is considered. \nThe level of cross-expert agreement indicates that our results are reasonably close to \nthe scores expected in such subjective task. \n\nComparison to Related Work \n\n5 \nThe information bottleneck framework served as the basis for several approaches \nthat represent additional information in their clustering setting. The multivariate \ninformation bottleneck (MIB) adapts the IB framework for networks of multiple \nvariables [3]. However, all variables in such networks are either compressed (like \nX), or predicted (like Y). The incorporation of an empirical variable to be masked \nor defocused in the sense of our W is not possible. Including such variables in the \nMIB framework might be explored in future work. \nParticularly relevant to our work is the IB-based method for extracting relevant \nconstructs with side information [4]. This approach addresses settings in which two \ndifferent types of features are distinguished explicitly: relevant versus irrelevant \nones, denoted by Y+ and Y\u2212. Both types of features are incorporated within a single \nfunctional to be minimized: L(IB-side-info) = I(C; X) \u2212 \u03b2 ( I(C; Y+) \u2212 \u03b3 I(C; Y\u2212) ), which \ndirectly drives clustering to de-correlate C and Y\u2212. \nFormally, our setting can be mapped to the side information setting by regarding the \npre-partition W simply as the additional set of irrelevant features, giving symmetric \n(and opposite) roles to W and Y. However, it seems that this view does not address \nproperly the desired cross-partition setting. In our setting, it is assumed that \nclustering should be guided in general by Y, while W should only neutralize \nparticular information within Y that would otherwise yield the undesired correlation \nbetween C and W (as described in Section 3.1). For that reason, the defocusing \nfunctional tie the three variables together by conditioning the de-correlation of C \nand W on Y, while its underlying assumption ensures the global de-correlation. \nIndeed, our method was found empirically superior on the cross-dataset task. The \nside-information IB method (the iterative algorithm with best scoring \u03b3) achieves \ncorrect assignment proportion of 0.52 in both synthetic tasks, where our method \nscored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification \ntask, Jaccard coefficient improved by 20% relatively to plain IB (compared to our \n100% improvement, see Table 1B). \n\n\f \n\nConclusions \n\n6 \nThis paper addressed the problem of clustering a pre-partitioned dataset, aiming to \ndetect new internal structures that are not correlated with the pre-given partition but \nrather cut across its components. The proposed framework extends the cross-dataset \nclustering algorithm [8], providing better formal grounding and representing any \npre-given (soft) partition of the dataset. Supported by empirical evidence, we \nsuggest that our framework is better suited for the cross-partition task than applying \nthe side-information framework [4], which was originally developed to address a \nsomewhat different setting. We also demonstrate substantial empirical advantage \nover the distance-based coupled-clustering algorithm [7]. \nAs an applied real-world goal, the algorithm successfully detects cross-religion \ncommonalities. This goal exemplifies the more general notion of detecting \nanalogies across different systems, which is a somewhat vague and non-consensual \ntask and therefore especially challenging for a computational framework. Our \napproach can be viewed as an initial step towards principled identification of \n\u201chidden\u201d commonalities between substantially different real world systems, while \nsuppressing the vast majority of attributes that are irrelevant for the analogy. \nFurther research may study the role of defocusing in supervised learning, where \nsome pre-given partitions might mask the role of underlying discriminative features. \nAdditionally, it would be interesting to explore relationships to other disciplines, \ne.g., network information theory ([9], Ch. 14) which provided motivation for the \nside-information approach. Finally, both frameworks (ours and side-information) \nsuggest the importance of dealing wisely with information that should not dictate \nthe clustering output directly. \n\nAck nowledgments \nWe thank Yuval Krymolowski for helpful discussions and Tiina Mahlam\u00e4ki, Eitan \nReich and William Shepard, for contributing the religion keyword classifications. \n\nReferences \n[1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. \nJournal of Machine Learning Research, 41(1):177-196. \n[2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means \nclustering with background knowledge. The 18th International Conference on Machine \nLearning (ICML-2001), pp 577-584. \n[3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information \nbottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152-\n161. \n[4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. \nAdvances in Neural Processing Information Systems 15 (NIPS'02). \n[5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. \nJournal of Machine Learning Research, 3:1307-1331. \n[6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The \n37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. \n[7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method \nfor detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. \n[8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding \nthemes across multiple corpora. Proceedings of the 6th Conference on Natural Language \nLearning (CoNLL-2002), pp. 15-21. \n[9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. John Wiley & \nSons, Inc., New York, New York. \n\n\f", "award": [], "sourceid": 2406, "authors": [{"given_name": "Zvika", "family_name": "Marx", "institution": null}, {"given_name": "Ido", "family_name": "Dagan", "institution": null}, {"given_name": "Eli", "family_name": "Shamir", "institution": null}]}